๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dhruv Anand, Ehsan Shareghi
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.
๐
2025-12-23
๐ ArXiv Preprint (cs.CY)
๐ฅ ฤฐbrahim Oฤuz รetinkaya, Sajad Khodadadian, Taylan G. Topรงu
Machine Learning
ๆฅ็ๆ่ฆ
As systems engineering (SE) objectives evolve from design and operation of monolithic systems to complex System of Systems (SoS), the discipline of Mission Engineering (ME) has emerged which is increasingly being accepted as a new line of thinking for the SE community. Moreover, mission environments are uncertain, dynamic, and mission outcomes are a direct function of how the mission assets will interact with this environment. This proves static architectures brittle and calls for analytically rigorous approaches for ME. To that end, this paper proposes an intelligent mission coordination methodology that integrates digital mission models with Reinforcement Learning (RL), that specifically addresses the need for adaptive task allocation and reconfiguration. More specifically, we are leveraging a Digital Engineering (DE) based infrastructure that is composed of a high-fidelity digital mission model and agent-based simulation; and then we formulate the mission tactics management problem as a Markov Decision Process (MDP), and employ an RL agent trained via Proximal Policy Optimization. By leveraging the simulation as a sandbox, we map the system states to actions, refining the policy based on realized mission outcomes. The utility of the RL-based intelligent mission coordinator is demonstrated through an aerial firefighting case study. Our findings indicate that the RL-based intelligent mission coordinator not only surpasses baseline performance but also significantly reduces the variability in mission performance. Thus, this study serves as a proof of concept demonstrating that DE-enabled mission simulations combined with advanced analytical tools offer a mission-agnostic framework for improving ME practice; which can be extended to more complicated fleet design and selection problems in the future from a mission-first perspective.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Humza Nusrat, Luke Francisco, Bing Luo, Hassan Bagher-Ebadian, Joshua Kim et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Stereotactic radiosurgery (SRS) demands precise dose shaping around critical structures, yet black-box AI systems have limited clinical adoption due to opacity concerns. We tested whether chain-of-thought reasoning improves agentic planning in a retrospective cohort of 41 patients with brain metastases treated with 18 Gy single-fraction SRS. We developed SAGE (Secure Agent for Generative Dose Expertise), an LLM-based planning agent for automated SRS treatment planning. Two variants generated plans for each case: one using a non-reasoning model, one using a reasoning model. The reasoning variant showed comparable plan dosimetry relative to human planners on primary endpoints (PTV coverage, maximum dose, conformity index, gradient index; all p > 0.21) while reducing cochlear dose below human baselines (p = 0.022). When prompted to improve conformity, the reasoning model demonstrated systematic planning behaviors including prospective constraint verification (457 instances) and trade-off deliberation (609 instances), while the standard model exhibited none of these deliberative processes (0 and 7 instances, respectively). Content analysis revealed that constraint verification and causal explanation concentrated in the reasoning agent. The optimization traces serve as auditable logs, offering a path toward transparent automated planning.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Debabrota Basu, Udvas Das, Brahim Driss, Uddalak Mukherjee
Machine Learning
ๆฅ็ๆ่ฆ
Post-deployment machine learning algorithms often influence the environments they act in, and thus shift the underlying dynamics that the standard reinforcement learning (RL) methods ignore. While designing optimal algorithms in this performative setting has recently been studied in supervised learning, the RL counterpart remains under-explored. In this paper, we prove the performative counterparts of the performance difference lemma and the policy gradient theorem in RL, and further introduce the Performative Policy Gradient algorithm (PePG). PePG is the first policy gradient algorithm designed to account for performativity in RL. Under softmax parametrisation, and also with and without entropy regularisation, we prove that PePG converges to performatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves. Thus, PePG significantly extends the prior works in Performative RL that achieves performative stability but not optimality. Furthermore, our empirical analysis on standard performative RL environments validate that PePG outperforms standard policy gradient algorithms and the existing performative RL algorithms aiming for stability.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rui Pan, Zhuofu Chen, Ravi Netravali
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.4$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl et al.
Robotics
ๆฅ็ๆ่ฆ
Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles' actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6~v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at https://github.com/autonomousvision/lead.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhiyi Duan, Xiangren Wang, Hongyu Yuan, Qianli Xing
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Teachers' emotional states are critical in educational scenarios, profoundly impacting teaching efficacy, student engagement, and learning achievements. However, existing studies often fail to accurately capture teachers' emotions due to the performative nature and overlook the critical impact of instructional information on emotional expression.In this paper, we systematically investigate teacher sentiment analysis by building both the dataset and the model accordingly. We construct the first large-scale teacher multimodal sentiment analysis dataset, T-MED.To ensure labeling accuracy and efficiency, we employ a human-machine collaborative labeling process.The T-MED dataset includes 14,938 instances of teacher emotional data from 250 real classrooms across 11 subjects ranging from K-12 to higher education, integrating multimodal text, audio, video, and instructional information.Furthermore, we propose a novel asymmetric attention-based multimodal teacher sentiment analysis model, AAM-TSA.AAM-TSA introduces an asymmetric attention mechanism and hierarchical gating unit to enable differentiated cross-modal feature fusion and precise emotional classification. Experimental results demonstrate that AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chehak Malhotra, Mehak Gopal, Akshaya Devadiga, Pradeep Singh, Ridam Pal et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
With the advent of LLMs, various tasks across the natural language processing domain have been transformed. However, their application in predictive tasks remains less researched. This study compares large language models, including GatorTron-Base (trained on clinical data), Llama 8B, and Mistral 7B, against models like BioBERT, DocBERT, BioClinicalBERT, Word2Vec, and Doc2Vec, setting benchmarks for predicting Shock in critically ill patients. Timely prediction of shock can enable early interventions, thus improving patient outcomes. Text data from 17,294 ICU stays of patients in the MIMIC III database were scored for length of stay > 24 hours and shock index (SI) > 0.7 to yield 355 and 87 patients with normal and abnormal SI-index, respectively. Both focal and cross-entropy losses were used during finetuning to address class imbalances. Our findings indicate that while GatorTron Base achieved the highest weighted recall of 80.5%, the overall performance metrics were comparable between SLMs and LLMs. This suggests that LLMs are not inherently superior to SLMs in predicting future clinical events despite their strong performance on text-based tasks. To achieve meaningful clinical outcomes, future efforts in training LLMs should prioritize developing models capable of predicting clinical trajectories rather than focusing on simpler tasks such as named entity recognition or phenotyping.
๐
2025-12-23
๐ ArXiv Preprint (cs.SE)
๐ฅ Revanth Gangi Reddy, Ye Liu, Wenting Zhao, JaeHyeok Doo, Tarun Suresh et al.
ๆฅ็ๆ่ฆ
Maintaining large-scale, multilingual codebases hinges on accurately localizing issues, which requires mapping natural-language error descriptions to the relevant functions that need to be modified. However, existing ranking approaches are often Python-centric and perform a single-pass search over the codebase. This work introduces SweRank+, a framework that couples SweRankMulti, a cross-lingual code ranking tool, with SweRankAgent, an agentic search setup, for iterative, multi-turn reasoning over the code repository. SweRankMulti comprises a code embedding retriever and a listwise LLM reranker, and is trained using a carefully curated large-scale issue localization dataset spanning multiple popular programming languages. SweRankAgent adopts an agentic search loop that moves beyond single-shot localization with a memory buffer to reason and accumulate relevant localization candidates over multiple turns. Our experiments on issue localization benchmarks spanning various languages demonstrate new state-of-the-art performance with SweRankMulti, while SweRankAgent further improves localization over single-pass ranking.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Linfeng Zhang, Siheng Chen, Yuzhu Cai, Jingyi Chai, Junhan Chang et al.
ๆฅ็ๆ่ฆ
AI agents are emerging as a practical way to run multi-step scientific workflows that interleave reasoning with tool use and verification, pointing to a shift from isolated AI-assisted steps toward \emph{agentic science at scale}. This shift is increasingly feasible, as scientific tools and models can be invoked through stable interfaces and verified with recorded execution traces, and increasingly necessary, as AI accelerates scientific output and stresses the peer-review and publication pipeline, raising the bar for traceability and credible evaluation.
However, scaling agentic science remains difficult: workflows are hard to observe and reproduce; many tools and laboratory systems are not agent-ready; execution is hard to trace and govern; and prototype AI Scientist systems are often bespoke, limiting reuse and systematic improvement from real workflow signals.
We argue that scaling agentic science requires an infrastructure-and-ecosystem approach, instantiated in Bohrium+SciMaster. Bohrium acts as a managed, traceable hub for AI4S assets -- akin to a HuggingFace of AI for Science -- that turns diverse scientific data, software, compute, and laboratory systems into agent-ready capabilities. SciMaster orchestrates these capabilities into long-horizon scientific workflows, on which scientific agents can be composed and executed. Between infrastructure and orchestration, a \emph{scientific intelligence substrate} organizes reusable models, knowledge, and components into executable building blocks for workflow reasoning and action, enabling composition, auditability, and improvement through use.
We demonstrate this stack with eleven representative master agents in real workflows, achieving orders-of-magnitude reductions in end-to-end scientific cycle time and generating execution-grounded signals from real workloads at multi-million scale.
๐
2025-12-23
๐ ArXiv Preprint (eess.IV)
๐ฅ Muhammad Usman, Azka Rehman, Muhammad Mutti Ur Rehman, Abd Ur Rehman, Muhammad Umar Farooq
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Accurate segmentation of ischemic stroke lesions from diffusion magnetic resonance imaging (MRI) is essential for clinical decision-making and outcome assessment. Diffusion-Weighted Imaging (DWI) and Apparent Diffusion Coefficient (ADC) scans provide complementary information on acute and sub-acute ischemic changes; however, automated lesion delineation remains challenging due to variability in lesion appearance.
In this work, we study ischemic stroke lesion segmentation using multimodal diffusion MRI from the ISLES 2022 dataset. Several state-of-the-art convolutional and transformer-based architectures, including U-Net variants, Swin-UNet, and TransUNet, are benchmarked. Based on performance, a dual-encoder TransUNet architecture is proposed to learn modality-specific representations from DWI and ADC inputs. To incorporate spatial context, adjacent slice information is integrated using a three-slice input configuration.
All models are trained under a unified framework and evaluated using the Dice Similarity Coefficient (DSC). Results show that transformer-based models outperform convolutional baselines, and the proposed dual-encoder TransUNet achieves the best performance, reaching a Dice score of 85.4% on the test set. The proposed framework offers a robust solution for automated ischemic stroke lesion segmentation from diffusion MRI.
๐
2025-12-23
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Adam Elaoumari
Machine Learning
ๆฅ็ๆ่ฆ
The purpose of this project is to assess how well defenders can detect DNS-over-HTTPS (DoH) file exfiltration, and which evasion strategies can be used by attackers. While providing a reproducible toolkit to generate, intercept and analyze DoH exfiltration, and comparing Machine Learning vs threshold-based detection under adversarial scenarios. The originality of this project is the introduction of an end-to-end, containerized pipeline that generates configurable file exfiltration over DoH using several parameters (e.g., chunking, encoding, padding, resolver rotation). It allows for file reconstruction at the resolver side, while extracting flow-level features using a fork of DoHLyzer. The pipeline contains a prediction side, which allows the training of machine learning models based on public labelled datasets and then evaluates them side-by-side with threshold-based detection methods against malicious and evasive DNS-Over-HTTPS traffic. We train Random Forest, Gradient Boosting and Logistic Regression classifiers on a public DoH dataset and benchmark them against evasive DoH exfiltration scenarios. The toolkit orchestrates traffic generation, file capture, feature extraction, model training and analysis. The toolkit is then encapsulated into several Docker containers for easy setup and full reproducibility regardless of the platform it is run on. Future research regarding this project is directed at validating the results on mixed enterprise traffic, extending the protocol coverage to HTTP/3/QUIC request, adding a benign traffic generation, and working on real-time traffic evaluation. A key objective is to quantify when stealth constraints make DoH exfiltration uneconomical and unworthy for the attacker.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mihai Suteu, Ovidiu Serban
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-task learning (MTL) aims to leverage shared knowledge across tasks to improve generalization and parameter efficiency, yet balancing resources and mitigating interference remain open challenges. Architectural solutions often introduce elaborate task-specific modules or routing schemes, increasing complexity and overhead. In this work, we show that normalization layers alone are sufficient to address many of these challenges. Simply replacing shared normalization with task-specific variants already yields competitive performance, questioning the need for complex designs. Building on this insight, we propose Task-Specific Sigmoid Batch Normalization (TS$ฯ$BN), a lightweight mechanism that enables tasks to softly allocate network capacity while fully sharing feature extractors. TS$ฯ$BN improves stability across CNNs and Transformers, matching or exceeding performance on NYUv2, Cityscapes, CelebA, and PascalContext, while remaining highly parameter-efficient. Moreover, its learned gates provide a natural framework for analyzing MTL dynamics, offering interpretable insights into capacity allocation, filter specialization, and task relationships. Our findings suggest that complex MTL architectures may be unnecessary and that task-specific normalization offers a simple, interpretable, and efficient alternative.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junho Yoon, Jaemo Jung, Hyunju Kim, Dongman Lee
ๆฅ็ๆ่ฆ
Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.
๐
2025-12-23
๐ Presented at the 2025 IEEE 22nd India Council Inte
๐ฅ Rajdeep Chatterjee, Sudip Chakrabarty, Trishaani Acharjee, Deepanjali Mishra
Machine Learning
ๆฅ็ๆ่ฆ
Unmanned aerial vehicles (UAVs), commonly known as drones, are increasingly used across diverse domains, including logistics, agriculture, surveillance, and defense. While these systems provide numerous benefits, their misuse raises safety and security concerns, making effective detection mechanisms essential. Acoustic sensing offers a low-cost and non-intrusive alternative to vision or radar-based detection, as drone propellers generate distinctive sound patterns. This study introduces AUDRON (AUdio-based Drone Recognition Network), a hybrid deep learning framework for drone sound detection, employing a combination of Mel-Frequency Cepstral Coefficients (MFCC), Short-Time Fourier Transform (STFT) spectrograms processed with convolutional neural networks (CNNs), recurrent layers for temporal modeling, and autoencoder-based representations. Feature-level fusion integrates complementary information before classification. Experimental evaluation demonstrates that AUDRON effectively differentiates drone acoustic signatures from background noise, achieving high accuracy while maintaining generalizability across varying conditions. AUDRON achieves 98.51 percent and 97.11 percent accuracy in binary and multiclass classification. The results highlight the advantage of combining multiple feature representations with deep learning for reliable acoustic drone detection, suggesting the framework's potential for deployment in security and surveillance applications where visual or radar sensing may be limited.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ YuChe Hsu, AnJui Wang, TsaiChing Ni, YuanFu Yang
Multimodal
ๆฅ็ๆ่ฆ
We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.
๐
2025-12-23
๐ ArXiv Preprint (cs.SE)
๐ฅ Syeda Tasnim Fabiha, Saad Shafiq, Wesley Klewerton Guez Assunรงรฃo, Nenad Medvidoviฤ
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Service-based architecture (SBA) has gained attention in industry and academia as a means to modernize legacy systems. It refers to a design style that enables systems to be developed as suites of small, loosely coupled, and autonomous components (services) that encapsulate functionality and communicate via language-agnostic APIs. However, defining appropriately sized services that capture cohesive subsets of system functionality remains challenging. Existing work often relies on the availability of documentation, access to project personnel, or a priori knowledge of the target number of services, assumptions that do not hold in many real-world scenarios. Our work addresses these limitations using a deep reinforcement learning-based approach to identify appropriately sized services directly from implementation artifacts. We present Rake, a reinforcement learning-based technique that leverages available system documentation and source code to guide service decomposition at the level of implementation methods. Rake does not require specific documentation or access to project personnel and is language-agnostic. It also supports a customizable objective function that balances modularization quality and business capability alignment, i.e., the degree to which a service covers the targeted business capability. We applied Rake to four open-source legacy projects and compared it with two state-of-the-art techniques. On average, Rake achieved 7-14 percent higher modularization quality and 18-22 percent stronger business capability alignment. Our results further show that optimizing solely for business context can degrade decomposition quality in tightly coupled systems, highlighting the need for balanced objectives.
๐
2025-12-23
๐ Accepted for publication to the 40th IEEE Internat
๐ฅ Daniel M. Jimenez-Gutierrez, Mehrdad Hassanzadeh, Aris Anagnostopoulos, Ioannis Chatzigiannakis, Andrea Vitaletti
Machine Learning
ๆฅ็ๆ่ฆ
Federated learning (FL) supports privacy-preserving, decentralized machine learning (ML) model training by keeping data on client devices. However, non-independent and identically distributed (non-IID) data across clients biases updates and degrades performance. To alleviate these issues, we propose Clust-PSI-PFL, a clustering-based personalized FL framework that uses the Population Stability Index (PSI) to quantify the level of non-IID data. We compute a weighted PSI metric, $WPSI^L$, which we show to be more informative than common non-IID metrics (Hellinger, Jensen-Shannon, and Earth Mover's distance). Using PSI features, we form distributionally homogeneous groups of clients via K-means++; the number of optimal clusters is chosen by a systematic silhouette-based procedure, typically yielding few clusters with modest overhead. Across six datasets (tabular, image, and text modalities), two partition protocols (Dirichlet with parameter $ฮฑ$ and Similarity with parameter S), and multiple client sizes, Clust-PSI-PFL delivers up to 18% higher global accuracy than state-of-the-art baselines and markedly improves client fairness by a relative improvement of 37% under severe non-IID data. These results establish PSI-guided clustering as a principled, lightweight mechanism for robust PFL under label skew.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nilesh Jain, Seyi Adeyinka, Leor Roseman, Aza Allsop
Natural Language Processing
ๆฅ็ๆ่ฆ
Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen's Kappa ($ฮบ$) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability ($ฮบ= 0.907$, cosine=95.3%), followed by GPT-4o ($ฮบ= 0.853$, cosine=92.6%) and Claude ($ฮบ= 0.842$, cosine=92.1%). All three models achieve a high agreement ($ฮบ> 0.80$), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yaowei Bai, Ruiheng Zhang, Yu Lei, Xuhua Duan, Jingfeng Yao et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
A global shortage of radiologists has been exacerbated by the significant volume of chest X-ray workloads, particularly in primary care. Although multimodal large language models show promise, existing evaluations predominantly rely on automated metrics or retrospective analyses, lacking rigorous prospective clinical validation. Janus-Pro-CXR (1B), a chest X-ray interpretation system based on DeepSeek Janus-Pro model, was developed and rigorously validated through a multicenter prospective trial (NCT07117266). Our system outperforms state-of-the-art X-ray report generation models in automated report generation, surpassing even larger-scale models including ChatGPT 4o (200B parameters), while demonstrating reliable detection of six clinically critical radiographic findings. Retrospective evaluation confirms significantly higher report accuracy than Janus-Pro and ChatGPT 4o. In prospective clinical deployment, AI assistance significantly improved report quality scores, reduced interpretation time by 18.3% (P < 0.001), and was preferred by a majority of experts in 54.3% of cases. Through lightweight architecture and domain-specific optimization, Janus-Pro-CXR improves diagnostic reliability and workflow efficiency, particularly in resource-constrained settings. The model architecture and implementation framework will be open-sourced to facilitate the clinical translation of AI-assisted radiology solutions.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Junren Li, Luhua Lai
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Generative artificial intelligence has revolutionized the exploration of chemical space, yet a critical bottleneck remains that a substantial fraction of generated molecules is synthetically inaccessible. Current solutions, such as post-hoc filtering or projection-based methods, often compromise structural novelty or disrupt key pharmacophores by forcing molecules into pre-defined synthetic templates. Herein, we introduce SynCraft, a reasoning-based framework that reframes synthesizability optimization not as a sequence translation task, but as a precise structural editing problem. Leveraging the emergent reasoning capabilities of Large Language Models, SynCraft navigates the "synthesis cliff" where minimal structural modifications yield significant gains in synthetic feasibility. By predicting executable sequences of atom-level edits rather than generating SMILES strings directly, SynCraft circumvents the syntactic fragility of LLMs while harnessing their chemical intuition. Extensive benchmarks demonstrate that SynCraft outperforms state-of-the-art baselines in generating synthesizable analogs with high structural fidelity. Furthermore, through interaction-aware prompting, SynCraft successfully replicates expert medicinal chemistry intuition in editing PLK1 inhibitors and rescuing high-scoring but previously discarded RIPK1 candidates in previous molecular generation literatures.
๐
2025-12-23
๐ ArXiv Preprint (cs.SE)
๐ฅ Antonio Vitale, Khai-Nguyen Nguyen, Denys Poshyvanyk, Rocco Oliveto, Simone Scalabrino et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent progress in Large Language Models (LLMs) has substantially advanced the automation of software engineering (SE) tasks, enabling complex activities such as code generation and code summarization. However, the black-box nature of LLMs remains a major barrier to their adoption in high-stakes and safety-critical domains, where explainability and transparency are vital for trust, accountability, and effective human supervision. Despite increasing interest in explainable AI for software engineering, existing methods lack domain-specific explanations aligned with how practitioners reason about SE artifacts. To address this gap, we introduce FeatureSHAP, the first fully automated, model-agnostic explainability framework tailored to software engineering tasks. Based on Shapley values, FeatureSHAP attributes model outputs to high-level input features through systematic input perturbation and task-specific similarity comparisons, while remaining compatible with both open-source and proprietary LLMs. We evaluate FeatureSHAP on two bi-modal SE tasks: code generation and code summarization. The results show that FeatureSHAP assigns less importance to irrelevant input features and produces explanations with higher fidelity than baseline methods. A practitioner survey involving 37 participants shows that FeatureSHAP helps practitioners better interpret model outputs and make more informed decisions. Collectively, FeatureSHAP represents a meaningful step toward practical explainable AI in software engineering. FeatureSHAP is available at https://github.com/deviserlab/FeatureSHAP.
๐
2025-12-23
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Alexis Pomares Pastor, Ines Ribeiro Violante, Gregory Scott
Machine Learning
ๆฅ็ๆ่ฆ
A major shortcoming of medical practice is the lack of an objective measure of conscious level. Impairment of consciousness is common, e.g. following brain injury and seizures, which can also interfere with sensory processing and volitional responses. This is also an important pitfall in neurophysiological methods that infer awareness via command following, e.g. using functional MRI or electroencephalography (EEG).
Transcranial electrical stimulation (TES) can be employed to non-invasively stimulate the brain, bypassing sensory inputs, and has already showed promising results in providing reliable indicators of brain state. However, current non-invasive solutions have been limited to magnetic stimulation, which is not easily translatable to clinical settings. Our long-term vision is to develop an objective measure of brain state that can be used at the bedside, without requiring patients to understand commands or initiate motor responses.
In this study, we demonstrated the feasibility of a framework using Deep Learning algorithms to classify EEG brain responses evoked by a defined multi-dimensional pattern of TES. We collected EEG-TES data from 11 participants and found that delivering transcranial direct current stimulation (tDCS) to posterior cortical areas targeting the angular gyrus elicited an exceptionally reliable brain response. For this paradigm, our best Convolutional Neural Network model reached a 92% classification F1-score on Holdout data from participants never seen during training, significantly surpassing human-level performance at 60-70% accuracy.
These findings establish a framework for robust consciousness measurement for clinical use. In this spirit, we documented and open-sourced our datasets and codebase in full, to be used freely by the neuroscience and AI research communities, who may replicate our results with free tools like GitHub, Kaggle, and Colab.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Saisai Yang, Qingyi Huang, Jing Yuan, Liangyu Zha, Kai Tang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learning (RL) offers a promising avenue to enhance these capabilities, yet its application in the tabular domain faces three critical hurdles: the scarcity of high-quality agentic trajectories with closed-loop code execution and environment feedback on diverse table structures, the extreme heterogeneity of feedback signals ranging from rigid SQL execution to open-ended data interpretation, and the risk of catastrophic forgetting of general knowledge during vertical specialization. To overcome these challenges and unlock advanced reasoning on complex tables, we introduce \textbf{TableGPT-R1}, a specialized tabular model built on a systematic RL framework. Our approach integrates a comprehensive data engineering pipeline that synthesizes difficulty-stratified agentic trajectories for both supervised alignment and RL rollouts, a task-adaptive reward system that combines rule-based verification with a criteria-injected reward model and incorporates process-level step reward shaping with behavioral regularization, and a multi-stage training framework that progressively stabilizes reasoning before specializing in table-specific tasks. Extensive evaluations demonstrate that TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities. Our model is available at https://huggingface.co/tablegpt/TableGPT-R1.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang
Robotics
ๆฅ็ๆ่ฆ
Visual-language reasoning, driving knowledge, and value alignment are essential for advanced autonomous driving systems. However, existing approaches largely rely on data-driven learning, making it difficult to capture the complex logic underlying decision-making through imitation or limited reinforcement rewards. To address this, we propose KnowVal, a new autonomous driving system that enables visual-language reasoning through the synergistic integration of open-world perception and knowledge retrieval. Specifically, we construct a comprehensive driving knowledge graph that encodes traffic laws, defensive driving principles, and ethical norms, complemented by an efficient LLM-based retrieval mechanism tailored for driving scenarios. Furthermore, we develop a human-preference dataset and train a Value Model to guide interpretable, value-aligned trajectory assessment. Experimental results show that our method substantially improves planning performance while remaining compatible with existing architectures. Notably, KnowVal achieves the lowest collision rate on nuScenes and state-of-the-art results on Bench2Drive.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Karolina Droลผdลผ, Kacper Dudzic, Anna Sterna, Marcin Moskalewicz
ๆฅ็ๆ่ฆ
Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. We present the first direct comparison between state-of-the-art LLMs and mental health professionals in diagnosing Borderline (BPD) and Narcissistic (NPD) Personality Disorders utilizing Polish-language first-person autobiographical accounts. We show that the top-performing Gemini Pro models surpassed human professionals in overall diagnostic accuracy by 21.91 percentage points (65.48% vs. 43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patient's sense of self and temporal experience. Our findings demonstrate that while LLMs are highly competent at interpreting complex first-person clinical data, they remain subject to critical reliability and bias issues.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Joon Son Chung, Shinji Watanabe
Multimodal
ๆฅ็ๆ่ฆ
The objective of this paper is to jointly synthesize interactive videos and conversational speech from text and reference images. With the ultimate goal of building human-like conversational systems, recent studies have explored talking or listening head generation as well as conversational speech generation. However, these works are typically studied in isolation, overlooking the multimodal nature of human conversation, which involves tightly coupled audio-visual interactions. In this paper, we introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner. TAVID integrates face and speech generation pipelines through two cross-modal mappers (i.e., a motion mapper and a speaker mapper), which enable bidirectional exchange of complementary information between the audio and visual modalities. We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction fluency, and speech quality. Extensive experiments demonstrate the effectiveness of our approach across all these aspects.
๐
2025-12-23
๐ AAAI 2026
๐ฅ Wenzheng Zeng, Mingyu Ouyang, Langyuan Cui, Hwee Tou Ng
ๆฅ็ๆ่ฆ
Automatic presentation slide generation can greatly streamline content creation. However, since preferences of each user may vary, existing under-specified formulations often lead to suboptimal results that fail to align with individual user needs. We introduce a novel task that conditions paper-to-slides generation on user-specified preferences. We propose a human behavior-inspired agentic framework, SlideTailor, that progressively generates editable slides in a user-aligned manner. Instead of requiring users to write their preferences in detailed textual form, our system only asks for a paper-slides example pair and a visual template - natural and easy-to-provide artifacts that implicitly encode rich user preferences across content and visual style. Despite the implicit and unlabeled nature of these inputs, our framework effectively distills and generalizes the preferences to guide customized slide generation. We also introduce a novel chain-of-speech mechanism to align slide content with planned oral narration. Such a design significantly enhances the quality of generated slides and enables downstream applications like video presentations. To support this new task, we construct a benchmark dataset that captures diverse user preferences, with carefully designed interpretable metrics for robust evaluation. Extensive experiments demonstrate the effectiveness of our framework.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Akshat Dubey, Aleksandar Anลพel, Bahar ฤฐlgen, Georges Hattab
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in deep learning have led to its widespread adoption across diverse domains, including medical imaging. This progress is driven by increasingly sophisticated model architectures, such as ResNets, Vision Transformers, and Hybrid Convolutional Neural Networks, that offer enhanced performance at the cost of greater complexity. This complexity often compromises model explainability and interpretability. SHAP has emerged as a prominent method for providing interpretable visualizations that aid domain experts in understanding model predictions. However, SHAP explanations can be unstable and unreliable in the presence of epistemic and aleatoric uncertainty. In this study, we address this challenge by using Dirichlet posterior sampling and Dempster-Shafer theory to quantify the uncertainty that arises from these unstable explanations in medical imaging applications. The framework uses a belief, plausible, and fusion map approach alongside statistical quantitative analysis to produce quantification of uncertainty in SHAP. Furthermore, we evaluated our framework on three medical imaging datasets with varying class distributions, image qualities, and modality types which introduces noise due to varying image resolutions and modality-specific aspect covering the examples from pathology, ophthalmology, and radiology, introducing significant epistemic uncertainty.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Nishant Gaurav, Adit Akarsh, Ankit Ranjan, Manoj Bajaj
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
While CodeMem establishes executable code as the optimal representation for agentic procedural memory, the mechanism for autonomously synthesizing this memory from a blank slate remains underexplored. This paper operationalizes the transition of Large Language Models from passive tool-users to active workflow architects. Through a high-fidelity case study of a cross-service orchestration task involving Outlook and OneDrive, we identify and address four structural bottlenecks in automated skill generation: the Discovery Gap involving navigation of large tool registries, the Verification Gap regarding grounding tool response structures, the Decomposition Gap which replaces inefficient search with Linear State Anchoring, and the Scaling Gap focused on concurrency and persistence. We demonstrate that by enforcing a scientific methodology of hypothesize, probe, and code, agents can autonomously write robust, production-grade code skills.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuntao Dai, Hang Gu, Teng Wang, Qianyu Cheng, Yifei Zheng et al.
RoboticsMultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Divya Vijay, Vignesh Ethiraj
RoboticsNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
As networks evolve toward 5G Standalone and 6G, operators face orchestration challenges that exceed the limits of static automation and Deep Reinforcement Learning. Although Large Language Model (LLM) agents offer a path toward intent-based networking, they introduce stochastic risks, including topology hallucinations and policy non-compliance. To mitigate this, we propose Graph-Symbolic Policy Enforcement and Control (G-SPEC), a neuro-symbolic framework that constrains probabilistic planning with deterministic verification. The architecture relies on a Governance Triad - a telecom-adapted agent (TSLAM-4B), a Network Knowledge Graph (NKG), and SHACL constraints. We evaluated G-SPEC on a simulated 450-node 5G Core, achieving zero safety violations and a 94.1% remediation success rate, significantly outperforming the 82.4% baseline. Ablation analysis indicates that NKG validation drives the majority of safety gains (68%), followed by SHACL policies (24%). Scalability tests on topologies ranging from 10K to 100K nodes demonstrate that validation latency scales as $O(k^{1.2})$ where $k$ is subgraph size. With a processing overhead of 142ms, G-SPEC is viable for SMO-layer operations.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiawei Ge, Jiuxin Cao, Xinyi Li, Xuelin Zhu, Chang Liu et al.
Computer Vision
ๆฅ็ๆ่ฆ
Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.
๐
2025-12-23
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Tarik Houichime, Abdelghani Souhar, Younes El Amrani
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The memory of contemporary Large Language Models is bound by a physical paradox: as they learn, they fill up. The linear accumulation (O(N)) of Key-Value states treats context as a warehouse of static artifacts, eventually forcing a destructive choice between amnesia and latency. We challenge this discrete orthodoxy, proposing that long-term memory is not the storage of items, but the persistence of a trajectory. We introduce Phonetic Trajectory Memory (PTM), a neuro-symbolic architecture that encodes language not as a sequence of tensors, but as a continuous path on an ergodic manifold governed by irrational rotation matrices. By decoupling the navigation (an invariant O(1) geometric signal) from the reconstruction (a probabilistic generative act), PTM achieves a compression magnitude of greater than 3,000x relative to dense caches. We demonstrate that retrieval becomes a process of resonance: the phonetic trace stabilizes the model against hallucination via "Signal Consensus" mechanism, securing up to approximately 92% factual accuracy. While this aggressive abstraction alters generative texture, it unlocks immediate access latency (approximately 34ms) independent of depth. Our results suggest that infinite context does not require infinite silicon; it requires treating memory not as data to be stored, but as a reconstructive process acting on a conserved, undying physical signal.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xingbo Du, Loka Li, Duzhen Zhang, Le Song
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Memory systems have been designed to leverage past experiences in Large Language Model (LLM) agents. However, many deployed memory systems primarily optimize compression and storage, with comparatively less emphasis on explicit, closed-loop control of memory retrieval. From this observation, we build memory retrieval as an autonomous, accurate, and compatible agent system, named MemR$^3$, which has two core mechanisms: 1) a router that selects among retrieve, reflect, and answer actions to optimize answer quality; 2) a global evidence-gap tracker that explicitly renders the answering process transparent and tracks the evidence collection process. This design departs from the standard retrieve-then-answer pipeline by introducing a closed-loop control mechanism that enables autonomous decision-making. Empirical results on the LoCoMo benchmark demonstrate that MemR$^3$ surpasses strong baselines on LLM-as-a-Judge score, and particularly, it improves existing retrievers across four categories with an overall improvement on RAG (+7.29%) and Zep (+1.94%) using GPT-4.1-mini backend, offering a plug-and-play controller for existing memory stores.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhe Sun, Kunlun Wu, Chuanjian Fu, Zeming Song, Langyong Shi et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
As artificial intelligence (AI) rapidly advances, especially in multimodal large language models (MLLMs), research focus is shifting from single-modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action feedback rather than conventionally labeled datasets. Yet, most existing simulation platforms remain narrowly designed, each tailored to specific tasks. A versatile, general-purpose training environment that can support everything from low-level embodied navigation to high-level composite activities, such as multi-agent social simulation and human-AI collaboration, remains largely unavailable. To bridge this gap, we introduce TongSIM, a high-fidelity, general-purpose platform for training and evaluating embodied agents. TongSIM offers practical advantages by providing over 100 diverse, multi-room indoor scenarios as well as an open-ended, interaction-rich outdoor town simulation, ensuring broad applicability across research needs. Its comprehensive evaluation framework and benchmarks enable precise assessment of agent capabilities, such as perception, cognition, decision-making, human-robot cooperation, and spatial and social reasoning. With features like customized scenes, task-adaptive fidelity, diverse agent types, and dynamic environmental simulation, TongSIM delivers flexibility and scalability for researchers, serving as a unified platform that accelerates training, evaluation, and advancement toward general embodied intelligence.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Marko ฤechoviฤ, Natรกlia Komornรญkovรก, Dominik Machรกฤek, Ondลej Bojar
Natural Language Processing
ๆฅ็ๆ่ฆ
Speech processing and translation technology have the potential to facilitate meetings of individuals who do not share any common language. To evaluate automatic systems for such a task, a versatile and realistic evaluation corpus is needed. Therefore, we create and present a corpus of cross-lingual dialogues between individuals without a common language who were facilitated by automatic simultaneous speech translation. The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English. For the purposes of research into cross-lingual summarization, our corpus also includes written summaries (minutes) of the meetings.
Moreover, we propose automatic detection of misunderstandings. For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings. We annotate misunderstandings manually and also test the ability of current large language models to detect them automatically. The results show that the Gemini model is able to identify text spans with misunderstandings with recall of 77% and precision of 47%.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Teqiang Zou, Hongliang Zeng, Yuxuan Nong, Yifan Li, Kehui Liu et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low inference speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency action generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores instruction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to commercial users by Astribot as part of the Astribot robotic platform.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shuzheng Si, Qingyi Wang, Haozhe Zhao, Yuzhuo Bai, Guanqiao Chen et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recognizing whether outputs from large language models (LLMs) contain faithfulness hallucination is crucial for real-world applications, e.g., retrieval-augmented generation and summarization. In this paper, we introduce FaithLens, a cost-efficient and effective faithfulness hallucination detection model that can jointly provide binary predictions and corresponding explanations to improve trustworthiness. To achieve this, we first synthesize training data with explanations via advanced LLMs and apply a well-defined data filtering strategy to ensure label correctness, explanation quality, and data diversity. Subsequently, we fine-tune the model on these well-curated training data as a cold start and further optimize it with rule-based reinforcement learning, using rewards for both prediction correctness and explanation quality. Results on 12 diverse tasks show that the 8B-parameter FaithLens outperforms advanced models such as GPT-4.1 and o3. Also, FaithLens can produce high-quality explanations, delivering a distinctive balance of trustworthiness, efficiency, and effectiveness.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ze Gong, Pradeep Varakantham, Akshat Kumar
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Offline Preference-based Reinforcement Learning (PbRL) learns rewards and policies aligned with human preferences without the need for extensive reward engineering and direct interaction with human annotators. However, ensuring safety remains a critical challenge across many domains and tasks. Previous works on safe RL from human feedback (RLHF) first learn reward and cost models from offline data, then use constrained RL to optimize a safe policy. While such an approach works in the contextual bandits settings (LLMs), in long horizon continuous control tasks, errors in rewards and costs accumulate, leading to impairment in performance when used with constrained RL methods. To address these challenges, (a) instead of indirectly learning policies (from rewards and costs), we introduce a framework that learns a policy directly based on pairwise preferences regarding the agent's behavior in terms of rewards, as well as binary labels indicating the safety of trajectory segments; (b) we propose \textsc{PreSa} (Preference and Safety Alignment), a method that combines preference learning module with safety alignment in a constrained optimization problem. This optimization problem is solved within a Lagrangian paradigm that directly learns reward-maximizing safe policy \textit{without explicitly learning reward and cost models}, avoiding the need for constrained RL; (c) we evaluate our approach on continuous control tasks with both synthetic and real human feedback. Empirically, our method successfully learns safe policies with high rewards, outperforming state-of-the-art baselines, and offline safe RL approaches with ground-truth reward and cost.
๐
2025-12-23
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Songze Li, Jiameng Cheng, Yiming Li, Xiaojun Jia, Dacheng Tao
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
By integrating language understanding with perceptual modalities such as images, multimodal large language models (MLLMs) constitute a critical substrate for modern AI systems, particularly intelligent agents operating in open and interactive environments. However, their increasing accessibility also raises heightened risks of misuse, such as generating harmful or unsafe content. To mitigate these risks, alignment techniques are commonly applied to align model behavior with human values. Despite these efforts, recent studies have shown that jailbreak attacks can circumvent alignment and elicit unsafe outputs. Currently, most existing jailbreak methods are tailored for open-source models and exhibit limited effectiveness against commercial MLLM-integrated systems, which often employ additional filters. These filters can detect and prevent malicious input and output content, significantly reducing jailbreak threats. In this paper, we reveal that the success of these safety filters heavily relies on a critical assumption that malicious content must be explicitly visible in either the input or the output. This assumption, while often valid for traditional LLM-integrated systems, breaks down in MLLM-integrated systems, where attackers can leverage multiple modalities to conceal adversarial intent, leading to a false sense of security in existing MLLM-integrated systems. To challenge this assumption, we propose Odysseus, a novel jailbreak paradigm that introduces dual steganography to covertly embed malicious queries and responses into benign-looking images. Extensive experiments on benchmark datasets demonstrate that our Odysseus successfully jailbreaks several pioneering and realistic MLLM-integrated systems, achieving up to 99% attack success rate. It exposes a fundamental blind spot in existing defenses, and calls for rethinking cross-modal security in MLLM-integrated systems.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Honglin Mu, Jinghao Liu, Kaiyang Wan, Rui Xing, Xiuying Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation. However, our research identifies a vulnerability: LLMs can be manipulated by "adversarial instructions" hidden in input data, such as resumes or code, causing them to deviate from their intended task. Notably, while defenses may exist for mature domains such as code review, they are often absent in other common applications such as resume screening and peer review. This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types. We evaluate two defense mechanisms: prompt-based defenses achieve 10.1% attack reduction with 12.5% false rejection increase, while our proposed FIDS (Foreign Instruction Detection through Separation) using LoRA adaptation achieves 15.4% attack reduction with 10.4% false rejection increase. The combined approach provides 26.3% attack reduction, demonstrating that training-time defenses outperform inference-time mitigations in both security and utility preservation.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Arghavan Bazigaran, Hansem Sohn
Natural Language Processing
ๆฅ็ๆ่ฆ
We compare human and large language model (LLM) generalization in the number game, a concept inference task. Using a Bayesian model as an analytical framework, we examined the inductive biases and inference strategies of humans and LLMs. The Bayesian model captured human behavior better than LLMs in that humans flexibly infer rule-based and similarity-based concepts, whereas LLMs rely more on mathematical rules. Humans also demonstrated a few-shot generalization, even from a single example, while LLMs required more samples to generalize. These contrasts highlight the fundamental differences in how humans and LLMs infer and generalize mathematical concepts.
๐
2025-12-23
๐ 2025 International Joint Conference on Neural Netw
๐ฅ Dhivya Dharshini Kannan, Anupam Trivedi, Dipti Srinivasan
Machine Learning
ๆฅ็ๆ่ฆ
Data centers account for significant global energy consumption and a carbon footprint. The recent increasing demand for edge computing and AI advancements drives the growth of data center storage capacity. Energy efficiency is a cost-effective way to combat climate change, cut energy costs, improve business competitiveness, and promote IT and environmental sustainability. Thus, optimizing data center energy management is the most important factor in the sustainability of the world. Power Usage Effectiveness (PUE) is used to represent the operational efficiency of the data center. Predicting PUE using Neural Networks provides an understanding of the effect of each feature on energy consumption, thus enabling targeted modifications of those key features to improve energy efficiency. In this paper, we have developed Bidirectional Gated Recurrent Unit (BiGRU) based PUE prediction model and compared the model performance with GRU. The data set comprises 52,560 samples with 117 features using EnergyPlus, simulating a DC in Singapore. Sets of the most relevant features are selected using the Recursive Feature Elimination with Cross-Validation (RFECV) algorithm for different parameter settings. These feature sets are used to find the optimal hyperparameter configuration and train the BiGRU model. The performance of the optimized BiGRU-based PUE prediction model is then compared with that of GRU using mean squared error (MSE), mean absolute error (MAE), and R-squared metrics.
๐
2025-12-23
๐ ArXiv Preprint (cs.SE)
๐ฅ Ruiqi Wang, Xinchen Wang, Cuiyun Gao, Chun Yong Chong, Xin Xia et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have been increasingly deployed in real-world software engineering, fostering the development of code evaluation metrics to study the quality of LLM-generated code. Conventional rule-based metrics merely score programs based on their surface-level similarities with reference programs instead of analyzing functionality and code quality in depth. To address this limitation, researchers have developed LLM-as-a-judge metrics, prompting LLMs to evaluate and score code, and curated various code evaluation benchmarks to validate their effectiveness. However, these benchmarks suffer from critical limitations, hindering reliable assessments of evaluation capability: Some feature coarse-grained binary labels, which reduce rich code behavior to a single bit of information, obscuring subtle errors. Others propose fine-grained but subjective, vaguely-defined evaluation criteria, introducing unreliability in manually-annotated scores, which is the ground-truth they rely on. Furthermore, they often use uncontrolled data synthesis methods, leading to unbalanced score distributions that poorly represent real-world code generation scenarios.
To curate a diverse benchmark with programs of well-balanced distributions across various quality levels and streamline the manual annotation procedure, we propose AXIOM, a novel perturbation-based framework for synthesizing code evaluation benchmarks at scale. It reframes program scores as the refinement effort needed for deployment, consisting of two stages: (1) Rule-guided perturbation, which prompts LLMs to apply sequences of predefined perturbation rules to existing high-quality programs to modify their functionality and code quality, enabling us to precisely control each program's target score to achieve balanced score distributions. (2) Multisource quality calibration, which first selects a subset of...
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo.
๐
2025-12-23
๐ IEEE/ACM Transactions on Audio, Speech and Languag
๐ฅ Xiang Chen, Yixin Ou, Quan Feng, Lei Li, Piji Li et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict'' paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xingyou Yin, Ceyao Zhang, Min Hu, Kai Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated effectiveness as zero-shot time series (TS) forecasters. The key challenge lies in tokenizing TS data into textual representations that align with LLMs' pre-trained knowledge. While existing work often relies on fine-tuning specialized modules to bridge this gap, a distinct, yet challenging, paradigm aims to leverage truly off-the-shelf LLMs without any fine-tuning whatsoever, relying solely on strategic tokenization of numerical sequences. The performance of these fully frozen models is acutely sensitive to the textual representation of the input data, as their parameters cannot adapt to distribution shifts. In this paper, we introduce a simple yet highly effective strategy to overcome this brittleness: injecting noise into the raw time series before tokenization. This non-invasive intervention acts as a form of inference-time augmentation, compelling the frozen LLM to extrapolate based on robust underlying temporal patterns rather than superficial numerical artifacts. We theoretically analyze this phenomenon and empirically validate its effectiveness across diverse benchmarks. Notably, to fully eliminate potential biases from data contamination during LLM pre-training, we introduce two novel TS datasets that fall outside all utilized LLMs' pre-training scopes, and consistently observe improved performance. This study provides a further step in directly leveraging off-the-shelf LLMs for time series forecasting.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hyeongcheol Park, Jiyoung Seo, Jaewon Mun, Hogun Park, Wonmin Byeon et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs' multimodal reasoning and grounding over existing approaches.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhuo Yang, Yeyun chen, Jiaqing Xie, Ben Gao, Shuaike Shen et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Molecular editing and optimization are multi-step problems that require iteratively improving properties while keeping molecules chemically valid and structurally similar. We frame both tasks as sequential, tool-guided decisions and introduce MolAct, an agentic reinforcement learning framework that employs a two-stage training paradigm: first building editing capability, then optimizing properties while reusing the learned editing behaviors. To the best of our knowledge, this is the first work to formalize molecular design as an Agentic Reinforcement Learning problem, where an LLM agent learns to interleave reasoning, tool-use, and molecular optimization. The framework enables agents to interact in multiple turns, invoking chemical tools for validity checking, property assessment, and similarity control, and leverages their feedback to refine subsequent edits. We instantiate the MolAct framework to train two model families: MolEditAgent for molecular editing tasks and MolOptAgent for molecular optimization tasks. In molecular editing, MolEditAgent-7B delivers 100, 95, and 98 valid add, delete, and substitute edits, outperforming strong closed "thinking" baselines such as DeepSeek-R1; MolEditAgent-3B approaches the performance of much larger open "thinking" models like Qwen3-32B-think. In molecular optimization, MolOptAgent-7B (trained on MolEditAgent-7B) surpasses the best closed "thinking" baseline (e.g., Claude 3.7) on LogP and remains competitive on solubility, while maintaining balanced performance across other objectives. These results highlight that treating molecular design as a multi-step, tool-augmented process is key to reliable and interpretable improvements.
๐
2025-12-23
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Xian-Rong Zhang, Yue-Jiao Gong, Wei-Neng Chen, Jun Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Evolutionary Neural Architecture Search (ENAS) has gained attention for automatically designing neural network architectures. Recent studies use a neural predictor to guide the process, but the high computational costs of gathering training data -- since each label requires fully training an architecture -- make achieving a high-precision predictor with { limited compute budget (i.e., a capped number of fully trained architecture-label pairs)} crucial for ENAS success. This paper introduces ENAS with Dual Contrastive Learning (DCL-ENAS), a novel method that employs two stages of contrastive learning to train the neural predictor. In the first stage, contrastive self-supervised learning is used to learn meaningful representations from neural architectures without requiring labels. In the second stage, fine-tuning with contrastive learning is performed to accurately predict the relative performance of different architectures rather than their absolute performance, which is sufficient to guide the evolutionary search. Across NASBench-101 and NASBench-201, DCL-ENAS achieves the highest validation accuracy, surpassing the strongest published baselines by 0.05\% (ImageNet16-120) to 0.39\% (NASBench-101). On a real-world ECG arrhythmia classification task, DCL-ENAS improves performance by approximately 2.5 percentage points over a manually designed, non-NAS model obtained via random search, while requiring only 7.7 GPU-days.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr
Machine Learning
ๆฅ็ๆ่ฆ
As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near-constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL's performance beyond the full context setting, while using less memory than contemporaneous approaches.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinyoung Choi, Youngchae Kwon, Injung Kim
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Fashion style classification is a challenging task because of the large visual variation within the same style and the existence of visually similar styles.
Styles are expressed not only by the global appearance, but also by the attributes of individual items and their combinations.
In this study, we propose an item region-based fashion style classification network (IRSN) to effectively classify fashion styles by analyzing item-specific features and their combinations in addition to global features.
IRSN extracts features of each item region using item region pooling (IRP), analyzes them separately, and combines them using gated feature fusion (GFF).
In addition, we improve the feature extractor by applying a dual-backbone architecture that combines a domain-specific feature extractor and a general feature extractor pre-trained with a large-scale image-text dataset.
In experiments, applying IRSN to six widely-used backbones, including EfficientNet, ConvNeXt, and Swin Transformer, improved style classification accuracy by an average of 6.9% and a maximum of 14.5% on the FashionStyle14 dataset and by an average of 7.6% and a maximum of 15.1% on the ShowniqV3 dataset. Visualization analysis also supports that the IRSN models are better than the baseline models at capturing differences between similar style classes.
๐
2025-12-23
๐ NeurIPS 2025
๐ฅ Jeehong Kim, Youngseok Hwang, Minchan Kim, Sungho Bae, Hyunwoo Park
Machine Learning
ๆฅ็ๆ่ฆ
Spatio-temporal graph neural networks (ST-GNNs) have achieved notable success in structured domains such as road traffic and public transportation, where spatial entities can be naturally represented as fixed nodes. In contrast, many real-world systems including maritime traffic lack such fixed anchors, making the construction of spatio-temporal graphs a fundamental challenge. Anomaly detection in these non-grid environments is particularly difficult due to the absence of canonical reference points, the sparsity and irregularity of trajectories, and the fact that anomalies may manifest at multiple granularities. In this work, we introduce a novel benchmark dataset for anomaly detection in the maritime domain, extending the Open Maritime Traffic Analysis Dataset (OMTAD) into a benchmark tailored for graph-based anomaly detection. Our dataset enables systematic evaluation across three different granularities: node-level, edge-level, and graph-level anomalies. We plan to employ two specialized LLM-based agents: \emph{Trajectory Synthesizer} and \emph{Anomaly Injector} to construct richer interaction contexts and generate semantically meaningful anomalies. We expect this benchmark to promote reproducibility and to foster methodological advances in anomaly detection for non-grid spatio-temporal systems.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yanjie Li, Jian Xu, Xueqing Chen, Lina Yu, Shiming Xiang et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Adsorption energy is a key descriptor of catalytic reactivity. It is fundamentally defined as the difference between the relaxed total energy of the adsorbate-surface system and that of an appropriate reference state; therefore, the accuracy of relaxed-energy prediction directly determines the reliability of machine-learning-driven catalyst screening. E(3)-equivariant graph neural networks (GNNs) can natively operate on three-dimensional atomic coordinates under periodic boundary conditions and have demonstrated strong performance on such tasks. In contrast, language-model-based approaches, while enabling human-readable textual descriptions and reducing reliance on explicit graph -- thereby broadening applicability -- remain insufficient in both adsorption-configuration energy prediction accuracy and in distinguishing ``the same system with different configurations,'' even with graph-assisted pretraining in the style of GAP-CATBERTa.
To this end, we propose QE-Catalytic, a multimodal framework that deeply couples a large language model (\textbf{Q}wen) with an E(3)-equivariant graph Transformer (\textbf{E}quiformer-V2), enabling unified support for adsorption-configuration property prediction and inverse design on complex catalytic surfaces. During prediction, QE-Catalytic jointly leverages three-dimensional structures and structured configuration text, and injects ``3D geometric information'' into the language channel via graph-text alignment, allowing it to function as a high-performance text-based predictor when precise coordinates are unavailable, while also autoregressively generating CIF files for target-energy-driven structure design and information completion. On OC20, QE-Catalytic reduces the MAE of relaxed adsorption energy from 0.713~eV to 0.486~eV, and consistently outperforms baseline models such as CatBERTa and GAP-CATBERTa across multiple evaluation protocols.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chaithra, Kamesh Kadimisetty, Biju R Mohan
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Financial sentiment analysis plays a crucial role in informing investment decisions, assessing market risk, and predicting stock price trends. Existing works in financial sentiment analysis have not considered the impact of stock prices or market feedback on sentiment analysis. In this paper, we propose an adaptive framework that integrates large language models (LLMs) with real-world stock market feedback to improve sentiment classification in the context of the Indian stock market. The proposed methodology fine-tunes the LLaMA 3.2 3B model using instruction-based learning on the SentiFin dataset. To enhance sentiment predictions, a retrieval-augmented generation (RAG) pipeline is employed that dynamically selects multi-source contextual information based on the cosine similarity of the sentence embeddings. Furthermore, a feedback-driven module is introduced that adjusts the reliability of the source by comparing predicted sentiment with actual next-day stock returns, allowing the system to iteratively adapt to market behavior. To generalize this adaptive mechanism across temporal data, a reinforcement learning agent trained using proximal policy optimization (PPO) is incorporated. The PPO agent learns to optimize source weighting policies based on cumulative reward signals from sentiment-return alignment. Experimental results on NIFTY 50 news headlines collected from 2024 to 2025 demonstrate that the proposed system significantly improves classification accuracy, F1-score, and market alignment over baseline models and static retrieval methods. The results validate the potential of combining instruction-tuned LLMs with dynamic feedback and reinforcement learning for robust, market-aware financial sentiment modeling.
๐
2025-12-23
๐ ArXiv Preprint (cs.NI)
๐ฅ Dianxuan Fu, Xiaomin Liu, Yihao Zhang, Shikui Shen, Weisheng Hu et al.
ๆฅ็ๆ่ฆ
We propose a communication-bound-aware cross-domain resource assignment framework for pipeline-parallel distributed training over multi-datacenter optical networks, which lowers iteration time by 31.25% and reduces 13.20% blocking requests compared to baselines.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite the wide adoption of Large Language Models (LLM)s, clinical decision support systems face a critical challenge: achieving high predictive accuracy while generating explanations aligned with the predictions. Current approaches suffer from exposure bias leading to misaligned explanations. We propose Reason2Decide, a two-stage training framework that addresses key challenges in self-rationalization, including exposure bias and task separation. In Stage-1, our model is trained on rationale generation, while in Stage-2, we jointly train on label prediction and rationale generation, applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions. We evaluate Reason2Decide on three medical datasets, including a proprietary triage dataset and public biomedical QA datasets. Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge). In triage, Reason2Decide is rationale source-robust across LLM-generated, nurse-authored, and nurse-post-processed rationales. In our experiments, while using only LLM-generated rationales in Stage-1, Reason2Decide outperforms other fine-tuning variants. This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations. Remarkably, Reason2Decide achieves these gains with models 40x smaller than contemporary foundation models, making clinical reasoning more accessible for resource-constrained deployments while still providing explainable decision support.
๐
2025-12-23
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Sangryu Park, Gihyuk Ko, Homook Cho
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) show significant promise in automating software vulnerability analysis, a critical task given the impact of security failure of modern software systems. However, current approaches in using LLMs to automate vulnerability analysis mostly rely on using online API-based LLM services, requiring the user to disclose the source code in development. Moreover, they predominantly frame the task as a binary classification(vulnerable or not vulnerable), limiting potential practical utility. This paper addresses these limitations by reformulating the problem as Software Vulnerability Identification (SVI), where LLMs are asked to output the type of weakness in Common Weakness Enumeration (CWE) IDs rather than simply indicating the presence or absence of a vulnerability. We also tackle the reliance on large, API-based LLMs by demonstrating that instruction-tuning smaller, locally deployable LLMs can achieve superior identification performance. In our analysis, instruct-tuning a local LLM showed better overall performance and cost trade-off than online API-based LLMs. Our findings indicate that instruct-tuned local models represent a more effective, secure, and practical approach for leveraging LLMs in real-world vulnerability management workflows.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hamed Firooz, Rui Liu, Yuchen Lu, Zhenyu Hou, Fangzhou Xiong et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Content moderation at scale remains one of the most pressing challenges in today's digital ecosystem, where billions of user- and AI-generated artifacts must be continuously evaluated for policy violations. Although recent advances in large language models (LLMs) have demonstrated strong potential for policy-grounded moderation, the practical challenges of training these systems to achieve expert-level accuracy in real-world settings remain largely unexplored, particularly in regimes characterized by label sparsity, evolving policy definitions, and the need for nuanced reasoning beyond shallow pattern matching. In this work, we present a comprehensive empirical investigation of scaling reinforcement learning (RL) for content classification, systematically evaluating multiple RL training recipes and reward-shaping strategies-including verifiable rewards and LLM-as-judge frameworks-to transform general-purpose language models into specialized, policy-aligned classifiers across three real-world content moderation tasks. Our findings provide actionable insights for industrial-scale moderation systems, demonstrating that RL exhibits sigmoid-like scaling behavior in which performance improves smoothly with increased training data, rollouts, and optimization steps before gradually saturating. Moreover, we show that RL substantially improves performance on tasks requiring complex policy-grounded reasoning while achieving up to 100x higher data efficiency than supervised fine-tuning, making it particularly effective in domains where expert annotations are scarce or costly.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hao Li, Fabian Deuser, Wenping Yin, Steffen Knoblauch, Wufan Zhao et al.
ๆฅ็ๆ่ฆ
As Earth's climate changes, it is impacting disasters and extreme weather events across the planet. Record-breaking heat waves, drenching rainfalls, extreme wildfires, and widespread flooding during hurricanes are all becoming more frequent and more intense. Rapid and efficient response to disaster events is essential for climate resilience and sustainability. A key challenge in disaster response is to accurately and quickly identify disaster locations to support decision-making and resources allocation. In this paper, we propose a Probabilistic Cross-view Geolocalization approach, called ProbGLC, exploring new pathways towards generative location awareness for rapid disaster response. Herein, we combine probabilistic and deterministic geolocalization models into a unified framework to simultaneously enhance model explainability (via uncertainty quantification) and achieve state-of-the-art geolocalization performance. Designed for rapid diaster response, the ProbGLC is able to address cross-view geolocalization across multiple disaster events as well as to offer unique features of probabilistic distribution and localizability score. To evaluate the ProbGLC, we conduct extensive experiments on two cross-view disaster datasets (i.e., MultiIAN and SAGAINDisaster), consisting diverse cross-view imagery pairs of multiple disaster types (e.g., hurricanes, wildfires, floods, to tornadoes). Preliminary results confirms the superior geolocalization accuracy (i.e., 0.86 in Acc@1km and 0.97 in Acc@25km) and model explainability (i.e., via probabilistic distributions and localizability scores) of the proposed ProbGLC approach, highlighting the great potential of leveraging generative cross-view approach to facilitate location awareness for better and faster disaster response. The data and code is publicly available at https://github.com/bobleegogogo/ProbGLC
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Peter N. Loxley
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Controllable Markov chains describe the dynamics of sequential decision making tasks and are the central component in optimal control and reinforcement learning. In this work, we give the general form of an optimal policy for learning controllable dynamics in an unknown environment by exploring over a limited time horizon. This policy is simple to implement and efficient to compute, and allows an agent to ``learn by exploring" as it maximizes its information gain in a greedy fashion by selecting controls from a constraint set that changes over time during exploration. We give a simple parameterization for the set of controls, and present an algorithm for finding an optimal policy. The reason for this policy is due to the existence of certain types of states that restrict control of the dynamics; such as transient states, absorbing states, and non-backtracking states. We show why the occurrence of these states makes a non-stationary policy essential for achieving optimal exploration. Six interesting examples of controllable dynamics are treated in detail. Policy optimality is demonstrated using counting arguments, comparing with suboptimal policies, and by making use of a sequential improvement property from dynamic programming.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hung-Chieh Fang, Kuo-Han Hung, Chu-Rong Chen, Po-Jung Chou, Chun-Kai Yang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Learning from videos offers a promising path toward generalist robots by providing rich visual and temporal priors beyond what real robot datasets contain. While existing video generative models produce impressive visual predictions, they are difficult to translate into low-level actions. Conversely, latent-action models better align videos with actions, but they typically operate at the single-step level and lack high-level planning capabilities. We bridge this gap by introducing Skill Abstraction from Optical Flow (SOF), a framework that learns latent skills from large collections of action-free videos. Our key idea is to learn a latent skill space through an intermediate representation based on optical flow that captures motion information aligned with both video dynamics and robot actions. By learning skills in this flow-based latent space, SOF enables high-level planning over video-derived skills and allows for easier translation of these skills into actions. Experiments show that our approach consistently improves performance in both multitask and long-horizon settings, demonstrating the ability to acquire and compose skills directly from raw visual data.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jung Yeon Park, Yuxuan Chen, Floor Eijkelboom, Jan-Willem van de Meent, Lawson L. S. Wong et al.
Machine Learning
ๆฅ็ๆ่ฆ
Symmetry is fundamental to understanding physical systems, and at the same time, can improve performance and sample efficiency in machine learning. Both pursuits require knowledge of the underlying symmetries in data. To address this, we propose learning symmetries directly from data via flow matching on Lie groups. We formulate symmetry discovery as learning a distribution over a larger hypothesis group, such that the learned distribution matches the symmetries observed in data. Relative to previous works, our method, \lieflow, is more flexible in terms of the types of groups it can discover and requires fewer assumptions. Experiments on 2D and 3D point clouds demonstrate the successful discovery of discrete groups, including reflections by flow matching over the complex domain. We identify a key challenge where the symmetric arrangement of the target modes causes ``last-minute convergence,'' where samples remain stationary until relatively late in the flow, and introduce a novel interpolation scheme for flow matching for symmetry discovery.
๐
2025-12-23
๐ 7 pages, 5 figures. System description for the EVE
๐ฅ Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Dao Sy Duy Minh, Nguyen Hoang Minh Ngoc et al.
Multimodal
ๆฅ็ๆ่ฆ
Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuan Gao, Zhenguo Dong, Xuelong Wang, Zhiqiang Wang, Yong Zhang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Accurate and interpretable forecasting of multivariate time series is crucial for understanding the complex dynamics of cryptocurrency markets in digital asset systems. Advanced deep learning methodologies, particularly Transformer-based and MLP-based architectures, have achieved competitive predictive performance in cryptocurrency forecasting tasks. However, cryptocurrency data is inherently composed of long-term socio-economic trends and local high-frequency speculative oscillations. Existing deep learning-based 'black-box' models fail to effectively decouple these composite dynamics or provide the interpretability needed for trustworthy financial decision-making. To overcome these limitations, we propose DecoKAN, an interpretable forecasting framework that integrates multi-level Discrete Wavelet Transform (DWT) for decoupling and hierarchical signal decomposition with Kolmogorov-Arnold Network (KAN) mixers for transparent and interpretable nonlinear modeling. The DWT component decomposes complex cryptocurrency time series into distinct frequency components, enabling frequency-specific analysis, while KAN mixers provide intrinsically interpretable spline-based mappings within each decomposed subseries. Furthermore, interpretability is enhanced through a symbolic analysis pipeline involving sparsification, pruning, and symbolization, which produces concise analytical expressions offering symbolic representations of the learned patterns. Extensive experiments demonstrate that DecoKAN achieves the lowest average Mean Squared Error on all tested real-world cryptocurrency datasets (BTC, ETH, XMR), consistently outperforming a comprehensive suite of competitive state-of-the-art baselines. These results validate DecoKAN's potential to bridge the gap between predictive accuracy and model transparency, advancing trustworthy decision support within complex cryptocurrency markets.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Sangoh Lee, Sangwoo Mo, Wook-Shin Han
RoboticsMultimodal
ๆฅ็ๆ่ฆ
While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup", where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.
๐
2025-12-23
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Rahul Yumlembam, Biju Issac, Seibu Mary Jacob, Longzhi Yang
Machine Learning
ๆฅ็ๆ่ฆ
Since the Internet of Things (IoT) is widely adopted using Android applications, detecting malicious Android apps is essential. In recent years, Android graph-based deep learning research has proposed many approaches to extract relationships from applications as graphs to generate graph embeddings. First, we demonstrate the effectiveness of graph-based classification using a Graph Neural Network (GNN)-based classifier to generate API graph embeddings. The graph embeddings are combined with Permission and Intent features to train multiple machine learning and deep learning models for Android malware detection. The proposed classification approach achieves an accuracy of 98.33 percent on the CICMaldroid dataset and 98.68 percent on the Drebin dataset. However, graph-based deep learning models are vulnerable, as attackers can add fake relationships to evade detection by the classifier. Second, we propose a Generative Adversarial Network (GAN)-based attack algorithm named VGAE-MalGAN targeting graph-based GNN Android malware classifiers. The VGAE-MalGAN generator produces adversarial malware API graphs, while the VGAE-MalGAN substitute detector attempts to mimic the target detector. Experimental results show that VGAE-MalGAN can significantly reduce the detection rate of GNN-based malware classifiers. Although the model initially fails to detect adversarial malware, retraining with generated adversarial samples improves robustness and helps mitigate adversarial attacks.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ming Li, Chenrui Fan, Yize Cheng, Soheil Feizi, Tianyi Zhou
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhe Sun, Xueyuan Yang, Yujie Lu, Zhenliang Zhang
RoboticsNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The integration of embodied agents into human environments demands embodied social intelligence: reasoning over both social norms and physical constraints. However, existing evaluations fail to address this integration, as they are limited to either disembodied social reasoning (e.g., in text) or socially-agnostic physical tasks. Both approaches fail to assess an agent's ability to integrate and trade off both physical and social constraints within a realistic, embodied context. To address this challenge, we introduce Spatially Situated Social Intelligence Test (S$^{3}$IT), a benchmark specifically designed to evaluate embodied social intelligence. It is centered on a novel and challenging seat-ordering task, requiring an agent to arrange seating in a 3D environment for a group of large language model-driven (LLM-driven) NPCs with diverse identities, preferences, and intricate interpersonal relationships. Our procedurally extensible framework generates a vast and diverse scenario space with controllable difficulty, compelling the agent to acquire preferences through active dialogue, perceive the environment via autonomous exploration, and perform multi-objective optimization within a complex constraint network. We evaluate state-of-the-art LLMs on S$^{3}$IT and found that they still struggle with this problem, showing an obvious gap compared with the human baseline. Results imply that LLMs have deficiencies in spatial intelligence, yet simultaneously demonstrate their ability to achieve near human-level competence in resolving conflicts that possess explicit textual cues.
๐
2025-12-23
๐ ArXiv Preprint (cs.SE)
๐ฅ Zhe Yin, Xiaodong Gu, Beijun Shen
Natural Language Processing
ๆฅ็ๆ่ฆ
Code language models excel on code intelligence tasks, yet their internal interpretability is underexplored. Existing neuron interpretability techniques from NLP are suboptimal for source code due to programming languages formal, hierarchical, and executable nature. We empirically investigate code LLMs at the neuron level, localizing language-specific neurons (selectively responsive to one language) and concept layers (feed-forward layers encoding language-agnostic code representations). We analyze Llama-3.1-8B and Qwen2.5-Coder-32B on multilingual inputs in C++, Java, Python, Go, and JavaScript, measuring neuron selectivity and layerwise contributions during generation. We find (1) neurons specialized for individual languages alongside a universal subset supporting general-purpose generation; and (2) lower layers mainly encode language-specific syntax, while middle layers capture semantic abstractions shared across languages, emerging as concept layers. We demonstrate utility on three tasks: neuron-guided fine-tuning for code generation, clone detection via concept-layer embeddings, and concept-layer-guided transfer for code summarization, each yielding consistent gains in multilingual settings.
๐
2025-12-23
๐ ArXiv Preprint (quant-ph)
๐ฅ Fernando M. de Paula Neto, Lucas dos Reis Silva, Paulo S. G. de Mattos Neto, Felipe F. Fanchini
Machine Learning
ๆฅ็ๆ่ฆ
The performance of quantum neural network models depends strongly on architectural decisions, including circuit depth, placement of parametrized operations, and data-encoding strategies. Selecting an effective architecture is challenging and closely related to the classical difficulty of choosing suitable neural-network topologies, which is computationally hard. This work investigates automated quantum-circuit construction for regression tasks and introduces a genetic-algorithm framework that discovers Reduced Regressor QNN architectures. The approach explores depth, parametrized gate configurations, and flexible data re-uploading patterns, formulating the construction of quantum regressors as an optimization process. The discovered circuits are evaluated against seventeen classical regression models on twenty-two nonlinear benchmark functions and four analytical functions. Although classical methods often achieve comparable results, they typically require far more parameters, whereas the evolved quantum models remain compact while providing competitive performance. We further analyze dataset complexity using twelve structural descriptors and show, across five increasingly challenging meta-learning scenarios, that these measures can reliably predict which quantum architecture will perform best. The results demonstrate perfect or near-perfect predictive accuracy in several scenarios, indicating that complexity metrics offer powerful and compact representations of dataset structure and can effectively guide automated model selection. Overall, this study provides a principled basis for meta-learning-driven quantum architecture design and advances the understanding of how quantum models behave in regression settings--a topic that has received limited exploration in prior work. These findings pave the way for more systematic and theoretically grounded approaches to quantum regression.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Luciano Araujo Dourado Filho, Rodrigo Tripodi Calumby
ๆฅ็ๆ่ฆ
Intra-class variability is given according to the significance in the degree of dissimilarity between images within a class. In that sense, depending on its intensity, intra-class variability can hinder the learning process for DL models, specially when such classes are also underrepresented, which is a very common scenario in Fine-Grained Visual Categorization (FGVC) tasks. This paper proposes a novel method that aims at leveraging classification performance in FGVC tasks by learning fine-grained features via classification of class-wise cluster assignments. Our goal is to apply clustering over each class individually, which can allow to discover pseudo-labels that encodes a latent degree of similarity between images. In turn, those labels can be employed in a hierarchical classification process that allows to learn more fine-grained visual features and thereby mitigating intra-class variability issues. Initial experiments over the PlantNet300k enabled to shed light upon several key points in which future work will have to be developed in order to find more conclusive evidence regarding the effectiveness of our method. Our method still achieves state-of-the-art performance on the PlantNet300k dataset even though some of its components haven't been shown to be fully optimized. Our code is available at \href{https://github.com/ADAM-UEFS/FGDCC}{https://github.com/ADAM-UEFS/FGDCC}.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Luciano Araujo Dourado Filho, Almir Moreira da Silva Neto, Rodrigo Pereira David, Rodrigo Tripodi Calumby
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper presents an approach developed to address the PlantClef 2025 challenge, which consists of a fine-grained multi-label species identification, over high-resolution images. Our solution focused on employing class prototypes obtained from the training dataset as a proxy guidance for training a segmentation Vision Transformer (ViT) on the test set images. To obtain these representations, the proposed method extracts features from training dataset images and create clusters, by applying K-Means, with $K$ equals to the number of classes in the dataset. The segmentation model is a customized narrow ViT, built by replacing the patch embedding layer with a frozen DinoV2, pre-trained on the training dataset for individual species classification. This model is trained to reconstruct the class prototypes of the training dataset from the test dataset images. We then use this model to obtain attention scores that enable to identify and localize areas of interest and consequently guide the classification process. The proposed approach enabled a domain-adaptation from multi-class identification with individual species, into multi-label classification from high-resolution vegetation plots. Our method achieved fifth place in the PlantCLEF 2025 challenge on the private leaderboard, with an F1 score of 0.33331. Besides that, in absolute terms our method scored 0.03 lower than the top-performing submission, suggesting that it may achieved competitive performance in the benchmark task. Our code is available at \href{https://github.com/ADAM-UEFS/PlantCLEF2025}{https://github.com/ADAM-UEFS/PlantCLEF2025}.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zixuan Huang, Xiang Li, Zhaoyang Lv, James M. Rehg
ๆฅ็ๆ่ฆ
Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Eric Yeh, John Cadigan, Ran Chen, Dick Crouch, Melinda Gervasio et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent research has explored using very large language models (LLMs) as proxies for humans in tasks such as simulation, surveys, and studies. While LLMs do not possess a human psychology, they often can emulate human behaviors with sufficiently high fidelity to drive simulations to test human behavioral hypotheses, exhibiting more nuance and range than the rule-based agents often employed in behavioral economics. One key area of interest is the effect of personality on decision making, but the requirement that a prompt must be created for every tested personality profile introduces experimental overhead and degrades replicability. To address this issue, we leverage interpolative decoding, representing each dimension of personality as a pair of opposed prompts and employing an interpolation parameter to simulate behavior along the dimension. We show that interpolative decoding reliably modulates scores along each of the Big Five dimensions. We then show how interpolative decoding causes LLMs to mimic human decision-making behavior in economic games, replicating results from human psychological research. Finally, we present preliminary results of our efforts to ``twin'' individual human players in a collaborative game through systematic search for points in interpolation space that cause the system to replicate actions taken by the human subject.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianhong Bai, Xiaoshi Wu, Xintao Wang, Fu Xiao, Yuanxing Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng et al.
Multimodal
ๆฅ็ๆ่ฆ
Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng et al.
Robotics
ๆฅ็ๆ่ฆ
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daewoon Kim, Si Young Yie, Jae Sung Lee
Robotics
ๆฅ็ๆ่ฆ
This paper proposes FedPOD (Proportionally Orchestrated Derivative) for optimizing learning efficiency and communication cost in federated learning among multiple clients. Inspired by FedPIDAvg, we define a round-wise task for FedPOD to enhance training efficiency. FedPIDAvg achieved performance improvement by incorporating the training loss reduction for prediction entropy as weights using differential terms. Furthermore, by modeling data distribution with a Poisson distribution and using a PID controller, it reduced communication costs even in skewed data distribution. However, excluding participants classified as outliers based on the Poisson distribution can limit data utilization. Additionally, PID controller requires the same participants to be maintained throughout the federated learning process as it uses previous rounds' learning information in the current round. In our approach, FedPOD addresses these issues by including participants excluded as outliers, eliminating dependency on previous rounds' learning information, and applying a method for calculating validation loss at each round. In this challenge, FedPOD presents comparable performance to FedPIDAvg in metrics of Dice score, 0.78, 0.71 and 0.72 for WT, ET and TC in average, and projected convergence score, 0.74 in average. Furthermore, the concept of FedPOD draws inspiration from Kubernetes' smallest computing unit, POD, designed to be compatible with Kubernetes auto-scaling. Extending round-wise tasks of FedPOD to POD units allows flexible design by applying scale-out similar to Kubernetes' auto-scaling. This work demonstrated the potentials of FedPOD to enhance federated learning by improving efficiency, flexibility, and performance in metrics.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Soowon Son, Honggyu An, Chaehyun Kim, Hyunah Ko, Jisu Nam et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dhruv Anand, Ehsan Shareghi
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl et al.
Robotics
ๆฅ็ๆ่ฆ
Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles' actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6~v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at https://github.com/autonomousvision/lead.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment.
We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context.
Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anna ล รกrovรก Mikeลกtรญkovรก, Mรฉdรฉric Fourmy, Martin Cรญfka, Josef Sivic, Vladimir Petrik
Computer Vision
ๆฅ็ๆ่ฆ
Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaitong Cai, Jensen Zhang, Jing Yang, Keze Wang
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
We introduce SirenPose, a geometry-aware loss formulation that integrates the periodic activation properties of sinusoidal representation networks with keypoint-based geometric supervision, enabling accurate and temporally consistent reconstruction of dynamic 3D scenes from monocular videos. Existing approaches often struggle with motion fidelity and spatiotemporal coherence in challenging settings involving fast motion, multi-object interaction, occlusion, and rapid scene changes. SirenPose incorporates physics inspired constraints to enforce coherent keypoint predictions across both spatial and temporal dimensions, while leveraging high frequency signal modeling to capture fine grained geometric details. We further expand the UniKPT dataset to 600,000 annotated instances and integrate graph neural networks to model keypoint relationships and structural correlations. Extensive experiments on benchmarks including Sintel, Bonn, and DAVIS demonstrate that SirenPose consistently outperforms state-of-the-art methods. On DAVIS, SirenPose achieves a 17.8 percent reduction in FVD, a 28.7 percent reduction in FID, and a 6.0 percent improvement in LPIPS compared to MoSCA. It also improves temporal consistency, geometric accuracy, user score, and motion smoothness. In pose estimation, SirenPose outperforms Monst3R with lower absolute trajectory error as well as reduced translational and rotational relative pose error, highlighting its effectiveness in handling rapid motion, complex dynamics, and physically plausible reconstruction.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gorjan Radevski
MultimodalComputer VisionNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning.
Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding.
Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability.
Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights.
Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy.
Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance.
These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems' ability to process complex, multimodal inputs across diverse applications.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ James E. Gallagher, Edward J. Oughton, Jana Kosecka
Natural Language Processing
ๆฅ็ๆ่ฆ
Landmines remain a persistent humanitarian threat, with 110 million actively deployed mines across 60 countries, claiming 26,000 casualties annually. This research evaluates adaptive Red-Green-Blue (RGB) and Long-Wave Infrared (LWIR) fusion for Unmanned Aerial Systems (UAS)-based detection of surface-laid landmines, leveraging the thermal contrast between the ordnance and the surrounding soil to enhance feature extraction. Using You Only Look Once (YOLO) architectures (v8, v10, v11) across 114 test images, generating 35,640 model-condition evaluations, YOLOv11 achieved optimal performance (86.8% mAP), with 10 to 30% thermal fusion at 5 to 10m altitude identified as the optimal detection parameters. A complementary architectural comparison revealed that while RF-DETR achieved the highest accuracy (69.2% mAP), followed by Faster R-CNN (67.6%), YOLOv11 (64.2%), and RetinaNet (50.2%), YOLOv11 trained 17.7 times faster than the transformer-based RF-DETR (41 minutes versus 12 hours), presenting a critical accuracy-efficiency tradeoff for operational deployment. Aggregated multi-temporal training datasets outperformed season-specific approaches by 1.8 to 9.6%, suggesting that models benefit from exposure to diverse thermal conditions. Anti-Tank (AT) mines achieved 61.9% detection accuracy, compared with 19.2% for Anti-Personnel (AP) mines, reflecting both the size differential and thermal-mass differences between these ordnance classes. As this research examined surface-laid mines where thermal contrast is maximized, future research should quantify thermal contrast effects for mines buried at varying depths across heterogeneous soil types.
๐
2025-12-23
๐ SIGGRAPH
๐ฅ Yiming Zhao, Yuanpeng Gao, Yuxuan Luo, Jiwei Duan, Shisong Lin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at https://github.com/ZYM-PKU/UTDesign.
๐
2025-12-23
๐ ArXiv Preprint (physics.optics)
๐ฅ Cagatay Isil, Alexander Chen, Yuhang Li, F. Onuralp Ardic, Shiqi Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
3D image display is essential for next-generation volumetric imaging; however, dense depth multiplexing for 3D image projection remains challenging because diffraction-induced cross-talk rapidly increases as the axial image planes get closer. Here, we introduce a 3D display system comprising a digital encoder and a diffractive optical decoder, which simultaneously projects different images onto multiple target axial planes with high axial resolution. By leveraging multi-layer diffractive wavefront decoding and deep learning-based end-to-end optimization, the system achieves high-fidelity depth-resolved 3D image projection in a snapshot, enabling axial plane separations on the order of a wavelength. The digital encoder leverages a Fourier encoder network to capture multi-scale spatial and frequency-domain features from input images, integrates axial position encoding, and generates a unified phase representation that simultaneously encodes all images to be axially projected in a single snapshot through a jointly-optimized diffractive decoder. We characterized the impact of diffractive decoder depth, output diffraction efficiency, spatial light modulator resolution, and axial encoding density, revealing trade-offs that govern axial separation and 3D image projection quality. We further demonstrated the capability to display volumetric images containing 28 axial slices, as well as the ability to dynamically reconfigure the axial locations of the image planes, performed on demand. Finally, we experimentally validated the presented approach, demonstrating close agreement between the measured results and the target images. These results establish the diffractive 3D display system as a compact and scalable framework for depth-resolved snapshot 3D image projection, with potential applications in holographic displays, AR/VR interfaces, and volumetric optical computing.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anh Dao, Manh Tran, Yufei Zhang, Xiaoming Liu, Zijun Cui
ๆฅ็ๆ่ฆ
Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL's ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.
๐
2025-12-23
๐ ArXiv Preprint (eess.IV)
๐ฅ Muhammad Usman, Azka Rehman, Muhammad Mutti Ur Rehman, Abd Ur Rehman, Muhammad Umar Farooq
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Accurate segmentation of ischemic stroke lesions from diffusion magnetic resonance imaging (MRI) is essential for clinical decision-making and outcome assessment. Diffusion-Weighted Imaging (DWI) and Apparent Diffusion Coefficient (ADC) scans provide complementary information on acute and sub-acute ischemic changes; however, automated lesion delineation remains challenging due to variability in lesion appearance.
In this work, we study ischemic stroke lesion segmentation using multimodal diffusion MRI from the ISLES 2022 dataset. Several state-of-the-art convolutional and transformer-based architectures, including U-Net variants, Swin-UNet, and TransUNet, are benchmarked. Based on performance, a dual-encoder TransUNet architecture is proposed to learn modality-specific representations from DWI and ADC inputs. To incorporate spatial context, adjacent slice information is integrated using a three-slice input configuration.
All models are trained under a unified framework and evaluated using the Dice Similarity Coefficient (DSC). Results show that transformer-based models outperform convolutional baselines, and the proposed dual-encoder TransUNet achieves the best performance, reaching a Dice score of 85.4% on the test set. The proposed framework offers a robust solution for automated ischemic stroke lesion segmentation from diffusion MRI.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ji Song, Xing Wang, Jianguo Wu, Xiaowei Yue
ๆฅ็ๆ่ฆ
In the realm of diverse high-dimensional data, images play a significant role across various processes of manufacturing systems where efficient image anomaly detection has emerged as a core technology of utmost importance. However, when applied to textured defect images, conventional anomaly detection methods have limitations including non-negligible misidentification, low robustness, and excessive reliance on large-scale and structured datasets. This paper proposes a texture basis integrated smooth decomposition (TBSD) approach, which is targeted at efficient anomaly detection in textured images with smooth backgrounds and sparse anomalies. Mathematical formulation of quasi-periodicity and its theoretical properties are investigated for image texture estimation. TBSD method consists of two principal processes: the first process learns the texture basis functions to effectively extract quasi-periodic texture patterns; the subsequent anomaly detection process utilizes that texture basis as prior knowledge to prevent texture misidentification and capture potential anomalies with high accuracy.The proposed method surpasses benchmarks with less misidentification, smaller training dataset requirement, and superior anomaly detection performance on both simulation and real-world datasets.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Abdullah Al Shafi, Abdul Muntakim, Pintu Chandra Shill, Rowzatul Zannat, Abdullah Al-Amin
Machine Learning
ๆฅ็ๆ่ฆ
Skin cancer can be identified by dermoscopic examination and ocular inspection, but early detection significantly increases survival chances. Artificial intelligence (AI), using annotated skin images and Convolutional Neural Networks (CNNs), improves diagnostic accuracy. This paper presents an early skin cancer classification method using a soft voting ensemble of CNNs. In this investigation, three benchmark datasets, namely HAM10000, ISIC 2016, and ISIC 2019, were used. The process involved rebalancing, image augmentation, and filtering techniques, followed by a hybrid dual encoder for segmentation via transfer learning. Accurate segmentation focused classification models on clinically significant features, reducing background artifacts and improving accuracy. Classification was performed through an ensemble of MobileNetV2, VGG19, and InceptionV3, balancing accuracy and speed for real-world deployment. The method achieved lesion recognition accuracies of 96.32\%, 90.86\%, and 93.92\% for the three datasets. The system performance was evaluated using established skin lesion detection metrics, yielding impressive results.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mihai Suteu, Ovidiu Serban
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-task learning (MTL) aims to leverage shared knowledge across tasks to improve generalization and parameter efficiency, yet balancing resources and mitigating interference remain open challenges. Architectural solutions often introduce elaborate task-specific modules or routing schemes, increasing complexity and overhead. In this work, we show that normalization layers alone are sufficient to address many of these challenges. Simply replacing shared normalization with task-specific variants already yields competitive performance, questioning the need for complex designs. Building on this insight, we propose Task-Specific Sigmoid Batch Normalization (TS$ฯ$BN), a lightweight mechanism that enables tasks to softly allocate network capacity while fully sharing feature extractors. TS$ฯ$BN improves stability across CNNs and Transformers, matching or exceeding performance on NYUv2, Cityscapes, CelebA, and PascalContext, while remaining highly parameter-efficient. Moreover, its learned gates provide a natural framework for analyzing MTL dynamics, offering interpretable insights into capacity allocation, filter specialization, and task relationships. Our findings suggest that complex MTL architectures may be unnecessary and that task-specific normalization offers a simple, interpretable, and efficient alternative.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pedro Domingos, Joรฃo Pereira, Vasco Lopes, Joรฃo Neves, David Semedo
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Automated video surveillance with Large Vision-Language Models is limited by their inherent bias towards normality, often failing to detect crimes. While Chain-of-Thought reasoning strategies show significant potential for improving performance in language tasks, the lack of inductive anomaly biases in their reasoning further steers the models towards normal interpretations. To address this, we propose Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias in the reasoning process through a final, anomaly-focused classification layer. Our method significantly improves Anomaly Detection, boosting F1-score by 11.8 p.p. on challenging low-resolution footage and Anomaly Classification by 3.78 p.p. in high-resolution videos.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junho Yoon, Jaemo Jung, Hyunju Kim, Dongman Lee
ๆฅ็ๆ่ฆ
Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ YuChe Hsu, AnJui Wang, TsaiChing Ni, YuanFu Yang
Multimodal
ๆฅ็ๆ่ฆ
We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.
๐
2025-12-23
๐ AAAI 2026
๐ฅ Linfei Li, Lin Zhang, Zhong Wang, Ying Shen
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in generative AI have accelerated the production of ultra-high-resolution visual content, posing significant challenges for efficient compression and real-time decoding on end-user devices. Inspired by 3D Gaussian Splatting, recent 2D Gaussian image models improve representation efficiency, yet existing methods struggle to balance compression ratio and reconstruction fidelity in ultra-high-resolution scenarios. To address this issue, we propose SmartSplat, a highly adaptive and feature-aware GS-based image compression framework that supports arbitrary image resolutions and compression ratios. SmartSplat leverages image-aware features such as gradients and color variances, introducing a Gradient-Color Guided Variational Sampling strategy together with an Exclusion-based Uniform Sampling scheme to improve the non-overlapping coverage of Gaussian primitives in pixel space. In addition, we propose a Scale-Adaptive Gaussian Color Sampling method to enhance color initialization across scales. Through joint optimization of spatial layout, scale, and color initialization, SmartSplat efficiently captures both local structures and global textures using a limited number of Gaussians, achieving high reconstruction quality under strong compression. Extensive experiments on DIV8K and a newly constructed 16K dataset demonstrate that SmartSplat consistently outperforms state-of-the-art methods at comparable compression ratios and exceeds their compression limits, showing strong scalability and practical applicability. The code is publicly available at https://github.com/lif314/SmartSplat.
๐
2025-12-23
๐ ICASSP 2026
๐ฅ Marta Moscati, Ahmed Abdullah, Muhammad Saad Saeed, Shah Nawaz, Rohan Kumar Das et al.
ๆฅ็ๆ่ฆ
Over half of the world's population is bilingual and people often communicate under multilingual scenarios. The Face-Voice Association in Multilingual Environments (FAME) 2026 Challenge, held at ICASSP 2026, focuses on developing methods for face-voice association that are effective when the language at test-time is different than the training one. This report provides a brief summary of the challenge.
๐
2025-12-23
๐ ArXiv Preprint (eess.IV)
๐ฅ Yujia Fu, Zhiyu Dong, Tianwen Qian, Chenye Zheng, Danian Ji et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Accurate assessment of bowel cleanliness is essential for effective colonoscopy procedures. The Boston Bowel Preparation Scale (BBPS) offers a standardized scoring system but suffers from subjectivity and inter-observer variability when performed manually. In this paper, to support robust training and evaluation, we construct a high-quality colonoscopy dataset comprising 2,240 images from 517 subjects, annotated with expert-agreed BBPS scores. We propose a novel automated BBPS scoring framework that leverages the CLIP model with adapter-based transfer learning and a dedicated fecal-feature extraction branch. Our method fuses global visual features with stool-related textual priors to improve the accuracy of bowel cleanliness evaluation without requiring explicit segmentation. Extensive experiments on both our dataset and the public NERTHU dataset demonstrate the superiority of our approach over existing baselines, highlighting its potential for clinical deployment in computer-aided colonoscopy analysis.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin
Computer VisionRoboticsNatural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping.
We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free, model-agnostic framework that brings this structured reasoning paradigm to multimodal image generation. CRAFT decomposes a prompt into dependency-structured visual questions, veries generated images using a vision-language model, and applies targeted prompt edits through an LLM agent only where constraints fail. The process iterates with an explicit stopping criterion once all constraints are satised, yielding an interpretable and controllable inference-time renement loop.
Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Maximilian Witte, Johannes Meuer, รtienne Plรฉsiat, Christopher Kadow
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Accurate and physically consistent modeling of Earth system dynamics requires machine-learning architectures that operate directly on continuous geophysical fields and preserve their underlying geometric structure. Here we introduce Field-Space attention, a mechanism for Earth system Transformers that computes attention in the physical domain rather than in a learned latent space. By maintaining all intermediate representations as continuous fields on the sphere, the architecture enables interpretable internal states and facilitates the enforcement of scientific constraints. The model employs a fixed, non-learned multiscale decomposition and learns structure-preserving deformations of the input field, allowing coherent integration of coarse and fine-scale information while avoiding the optimization instabilities characteristic of standard single-scale Vision Transformers. Applied to global temperature super-resolution on a HEALPix grid, Field-Space Transformers converge more rapidly and stably than conventional Vision Transformers and U-Net baselines, while requiring substantially fewer parameters. The explicit preservation of field structure throughout the network allows physical and statistical priors to be embedded directly into the architecture, yielding improved fidelity and reliability in data-driven Earth system modeling. These results position Field-Space Attention as a compact, interpretable, and physically grounded building block for next-generation Earth system prediction and generative modeling frameworks.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qingdong He, Xueqin Chen, Yanjie Pan, Peng Tang, Pengcheng Xu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently,two tailored keyframe-driven modules, the garment details enhancement module and the collaborative background optimization module, are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes.These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15, 070 high-quality video samples at a resolution of 810*1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang
Robotics
ๆฅ็ๆ่ฆ
Visual-language reasoning, driving knowledge, and value alignment are essential for advanced autonomous driving systems. However, existing approaches largely rely on data-driven learning, making it difficult to capture the complex logic underlying decision-making through imitation or limited reinforcement rewards. To address this, we propose KnowVal, a new autonomous driving system that enables visual-language reasoning through the synergistic integration of open-world perception and knowledge retrieval. Specifically, we construct a comprehensive driving knowledge graph that encodes traffic laws, defensive driving principles, and ethical norms, complemented by an efficient LLM-based retrieval mechanism tailored for driving scenarios. Furthermore, we develop a human-preference dataset and train a Value Model to guide interpretable, value-aligned trajectory assessment. Experimental results show that our method substantially improves planning performance while remaining compatible with existing architectures. Notably, KnowVal achieves the lowest collision rate on nuScenes and state-of-the-art results on Bench2Drive.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Joon Son Chung, Shinji Watanabe
Multimodal
ๆฅ็ๆ่ฆ
The objective of this paper is to jointly synthesize interactive videos and conversational speech from text and reference images. With the ultimate goal of building human-like conversational systems, recent studies have explored talking or listening head generation as well as conversational speech generation. However, these works are typically studied in isolation, overlooking the multimodal nature of human conversation, which involves tightly coupled audio-visual interactions. In this paper, we introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner. TAVID integrates face and speech generation pipelines through two cross-modal mappers (i.e., a motion mapper and a speaker mapper), which enable bidirectional exchange of complementary information between the audio and visual modalities. We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction fluency, and speech quality. Extensive experiments demonstrate the effectiveness of our approach across all these aspects.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Akshat Dubey, Aleksandar Anลพel, Bahar ฤฐlgen, Georges Hattab
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in deep learning have led to its widespread adoption across diverse domains, including medical imaging. This progress is driven by increasingly sophisticated model architectures, such as ResNets, Vision Transformers, and Hybrid Convolutional Neural Networks, that offer enhanced performance at the cost of greater complexity. This complexity often compromises model explainability and interpretability. SHAP has emerged as a prominent method for providing interpretable visualizations that aid domain experts in understanding model predictions. However, SHAP explanations can be unstable and unreliable in the presence of epistemic and aleatoric uncertainty. In this study, we address this challenge by using Dirichlet posterior sampling and Dempster-Shafer theory to quantify the uncertainty that arises from these unstable explanations in medical imaging applications. The framework uses a belief, plausible, and fusion map approach alongside statistical quantitative analysis to produce quantification of uncertainty in SHAP. Furthermore, we evaluated our framework on three medical imaging datasets with varying class distributions, image qualities, and modality types which introduces noise due to varying image resolutions and modality-specific aspect covering the examples from pathology, ophthalmology, and radiology, introducing significant epistemic uncertainty.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiawei Ge, Jiuxin Cao, Xinyi Li, Xuelin Zhu, Chang Liu et al.
Computer Vision
ๆฅ็ๆ่ฆ
Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniele Cardullo, Simone Teglia, Irene Amerini
RoboticsMultimodal
ๆฅ็ๆ่ฆ
With the rise of easily accessible tools for generating and manipulating multimedia content, realistic synthetic alterations to digital media have become a widespread threat, often involving manipulations across multiple modalities simultaneously. Recently, such techniques have been increasingly employed to distort narratives of important events and to spread misinformation on social media, prompting the development of misinformation detectors. In the context of misinformation conveyed through image-text pairs, several detection methods have been proposed. However, these approaches typically rely on computationally intensive architectures or require large amounts of annotated data. In this work we introduce LADLE-MM: Limited Annotation based Detector with Learned Ensembles for Multimodal Misinformation, a model-soup initialized multimodal misinformation detector designed to operate under a limited annotation setup and constrained training resources. LADLE-MM is composed of two unimodal branches and a third multimodal one that enhances image and text representations with additional multimodal embeddings extracted from BLIP, serving as fixed reference space. Despite using 60.3% fewer trainable parameters than previous state-of-the-art models, LADLE-MM achieves competitive performance on both binary and multi-label classification tasks on the DGM4 benchmark, outperforming existing methods when trained without grounding annotations. Moreover, when evaluated on the VERITE dataset, LADLE-MM outperforms current state-of-the-art approaches that utilize more complex architectures involving Large Vision-Language-Models, demonstrating the effective generalization ability in an open-set setting and strong robustness to unimodal bias.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinghao Shi, Jianing Song
Computer Vision
ๆฅ็ๆ่ฆ
High-resolution remote sensing image semantic segmentation (HRSS) is a fundamental yet critical task in the field of Earth observation. However, it has long faced the challenges of high inter-class similarity and large intra-class variability. Existing approaches often struggle to effectively inject abstract yet strongly discriminative semantic knowledge into pixel-level feature learning, leading to blurred boundaries and class confusion in complex scenes. To address these challenges, we propose Bidirectional Co-Refinement Framework for HRSS (BiCoR-Seg). Specifically, we design a Heatmap-driven Bidirectional Information Synergy Module (HBIS), which establishes a bidirectional information flow between feature maps and class embeddings by generating class-level heatmaps. Based on HBIS, we further introduce a hierarchical supervision strategy, where the interpretable heatmaps generated by each HBIS module are directly utilized as low-resolution segmentation predictions for supervision, thereby enhancing the discriminative capacity of shallow features. In addition, to further improve the discriminability of the embedding representations, we propose a cross-layer class embedding Fisher Discriminative Loss to enforce intra-class compactness and enlarge inter-class separability. Extensive experiments on the LoveDA, Vaihingen, and Potsdam datasets demonstrate that BiCoR-Seg achieves outstanding segmentation performance while offering stronger interpretability. The released code is available at https://github.com/ShiJinghao566/BiCoR-Seg.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Binfeng Wang, Di Wang, Haonan Guo, Ying Fu, Jing Zhang
ๆฅ็ๆ่ฆ
Unified hyperspectral image (HSI) restoration aims to recover various degraded HSIs using a single model, offering great practical value. However, existing methods often depend on explicit degradation priors (e.g., degradation labels) as prompts to guide restoration, which are difficult to obtain due to complex and mixed degradations in real-world scenarios. To address this challenge, we propose a Degradation-Aware Metric Prompting (DAMP) framework. Instead of relying on predefined degradation priors, we design spatial-spectral degradation metrics to continuously quantify multi-dimensional degradations, serving as Degradation Prompts (DP). These DP enable the model to capture cross-task similarities in degradation distributions and enhance shared feature learning. Furthermore, we introduce a Spatial-Spectral Adaptive Module (SSAM) that dynamically modulates spatial and spectral feature extraction through learnable parameters. By integrating SSAM as experts within a Mixture-of-Experts architecture, and using DP as the gating router, the framework enables adaptive, efficient, and robust restoration under diverse, mixed, or unseen degradations. Extensive experiments on natural and remote sensing HSI datasets show that DAMP achieves state-of-the-art performance and demonstrates exceptional generalization capability. Code is publicly available at https://github.com/MiliLab/DAMP.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuanyu Hu
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Multimodal brain decoding aims to reconstruct semantic information that is consistent with visual stimuli from brain activity signals such as fMRI, and then generate readable natural language descriptions. However, multimodal brain decoding still faces key challenges in cross-subject generalization and interpretability. We propose a BrainROI model and achieve leading-level results in brain-captioning evaluation on the NSD dataset. Under the cross-subject setting, compared with recent state-of-the-art methods and representative baselines, metrics such as BLEU-4 and CIDEr show clear improvements. Firstly, to address the heterogeneity of functional brain topology across subjects, we design a new fMRI encoder. We use multi-atlas soft functional parcellations (soft-ROI) as a shared space. We extend the discrete ROI Concatenation strategy in MINDLLM to a voxel-wise gated fusion mechanism (Voxel-gate). We also ensure consistent ROI mapping through global label alignment, which enhances cross-subject transferability. Secondly, to overcome the limitations of manual and black-box prompting methods in stability and transparency, we introduce an interpretable prompt optimization process. In a small-sample closed loop, we use a locally deployed Qwen model to iteratively generate and select human-readable prompts. This process improves the stability of prompt design and preserves an auditable optimization trajectory. Finally, we impose parameterized decoding constraints during inference to further improve the stability and quality of the generated descriptions.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Oikantik Nath, Sahithi Kukkala, Mitesh Khapra, Ravi Kiran Sarvadevabhatla
ๆฅ็ๆ่ฆ
Document layout analysis is essential for downstream tasks such as information retrieval, extraction, OCR, and digitization. However, existing large-scale datasets like PubLayNet and DocBank lack fine-grained region labels and multilingual diversity, making them insufficient for representing complex document layouts. In contrast, human-annotated datasets such as M6Doc and D4LA offer richer labels and greater domain diversity, but are too small to train robust models and lack adequate multilingual coverage. This gap is especially pronounced for Indic documents, which encompass diverse scripts yet remain underrepresented in current datasets, further limiting progress in this space. To address these shortcomings, we introduce IndicDLP, a large-scale foundational document layout dataset spanning 11 representative Indic languages alongside English and 12 common document domains. Additionally, we curate UED-mini, a dataset derived from DocLayNet and M6Doc, to enhance pretraining and provide a solid foundation for Indic layout models. Our experiments demonstrate that fine-tuning existing English models on IndicDLP significantly boosts performance, validating its effectiveness. Moreover, models trained on IndicDLP generalize well beyond Indic layouts, making it a valuable resource for document digitization. This work bridges gaps in scale, diversity, and annotation granularity, driving inclusive and efficient document understanding.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nathan Roos, Ekaterina Iakovleva, Ani Gjergji, Vito Paolo Pastore, Enzo Tartaglione
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Diffusion-based generative models demonstrate state-of-the-art performance across various image synthesis tasks, yet their tendency to replicate and amplify dataset biases remains poorly understood. Although previous research has viewed bias amplification as an inherent characteristic of diffusion models, this work provides the first analysis of how sampling algorithms and their hyperparameters influence bias amplification. We empirically demonstrate that samplers for diffusion models -- commonly optimized for sample quality and speed -- have a significant and measurable effect on bias amplification. Through controlled studies with models trained on Biased MNIST, Multi-Color MNIST and BFFHQ, and with Stable Diffusion, we show that sampling hyperparameters can induce both bias reduction and amplification, even when the trained model is fixed. Source code is available at https://github.com/How-I-met-your-bias/how_i_met_your_bias.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangxuan Ren, Zhongdao Wang, Pin Tang, Guoqing Wang, Jilai Zheng et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
3D object detection is fundamental for safe and robust intelligent transportation systems. Current multi-modal 3D object detectors often rely on complex architectures and training strategies to achieve higher detection accuracy. However, these methods heavily rely on the LiDAR sensor so that they suffer from large performance drops when LiDAR is absent, which compromises the robustness and safety of autonomous systems in practical scenarios. Moreover, existing multi-modal detectors face difficulties in deployment on diverse hardware platforms, such as NPUs and FPGAs, due to their reliance on 3D sparse convolution operators, which are primarily optimized for NVIDIA GPUs. To address these challenges, we reconsider the role of LiDAR in the camera-LiDAR fusion paradigm and introduce a novel multi-modal 3D detector, LiteFusion. Instead of treating LiDAR point clouds as an independent modality with a separate feature extraction backbone, LiteFusion utilizes LiDAR data as a complementary source of geometric information to enhance camera-based detection. This straightforward approach completely eliminates the reliance on a 3D backbone, making the method highly deployment-friendly. Specifically, LiteFusion integrates complementary features from LiDAR points into image features within a quaternion space, where the orthogonal constraints are well-preserved during network training. This helps model domain-specific relations across modalities, yielding a compact cross-modal embedding. Experiments on the nuScenes dataset show that LiteFusion improves the baseline vision-based detector by +20.4% mAP and +19.7% NDS with a minimal increase in parameters (1.1%) without using dedicated LiDAR encoders. Notably, even in the absence of LiDAR input, LiteFusion maintains strong results , highlighting its favorable robustness and effectiveness across diverse fusion paradigms and deployment scenarios.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Ye, Hongbin Ren, Chongbing Zhang, Haoran Chen, Xiaosong Li
ๆฅ็ๆ่ฆ
Given the complexity of underwater environments and the variability of water as a medium, underwater images are inevitably subject to various types of degradation. The degradations present nonlinear coupling rather than simple superposition, which renders the effective processing of such coupled degradations particularly challenging. Most existing methods focus on designing specific branches, modules, or strategies for specific degradations, with little attention paid to the potential information embedded in their coupling. Consequently, they struggle to effectively capture and process the nonlinear interactions of multiple degradations from a bottom-up perspective. To address this issue, we propose JDPNet, a joint degradation processing network, that mines and unifies the potential information inherent in coupled degradations within a unified framework. Specifically, we introduce a joint feature-mining module, along with a probabilistic bootstrap distribution strategy, to facilitate effective mining and unified adjustment of coupled degradation features. Furthermore, to balance color, clarity, and contrast, we design a novel AquaBalanceLoss to guide the network in learning from multiple coupled degradation losses. Experiments on six publicly available underwater datasets, as well as two new datasets constructed in this study, show that JDPNet exhibits state-of-the-art performance while offering a better tradeoff between performance, parameter size, and computational cost.
๐
2025-12-23
๐ CVPR 2024
๐ฅ Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, Yan Lu
ๆฅ็ๆ่ฆ
Most existing image compression approaches perform transform coding in the pixel space to reduce its spatial redundancy. However, they encounter difficulties in achieving both high-realism and high-fidelity at low bitrate, as the pixel-space distortion may not align with human perception. To address this issue, we introduce a Generative Latent Coding (GLC) architecture, which performs transform coding in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE), instead of in the pixel space. The generative latent space is characterized by greater sparsity, richer semantic and better alignment with human perception, rendering it advantageous for achieving high-realism and high-fidelity compression. Additionally, we introduce a categorical hyper module to reduce the bit cost of hyper-information, and a code-prediction-based supervision to enhance the semantic consistency. Experiments demonstrate that our GLC maintains high visual quality with less than 0.04 bpp on natural images and less than 0.01 bpp on facial images. On the CLIC2020 test set, we achieve the same FID as MS-ILLM with 45% fewer bits. Furthermore, the powerful generative latent space enables various applications built on our GLC pipeline, such as image restoration and style transfer. The code is available at https://github.com/jzyustc/GLC.
๐
2025-12-23
๐ CVPR 2025
๐ฅ Hao Guo, Xugong Qin, Jun Jie Ou Yang, Peng Zhang, Gangyan Zeng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sofian Chaybouti, Sanath Narayan, Yasser Dahou, Phรบc H. Lรช Khac, Ankit Singh et al.
ๆฅ็ๆ่ฆ
Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, and (3) hierarchical clustering and sampling of training data--typically reserved for self-supervised learning--substantially improves sample efficiency over random sampling for multi-teacher distillation. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts. We release OpenLVD200M and distilled models.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Grega ล uลกtar, Jer Pelhan, Alan Lukeลพiฤ, Matej Kristan
ๆฅ็ๆ่ฆ
Low-shot object counting addresses estimating the number of previously unobserved objects in an image using only few or no annotated test-time exemplars. A considerable challenge for modern low-shot counters are dense regions with small objects. While total counts in such situations are typically well addressed by density-based counters, their usefulness is limited by poor localization capabilities. This is better addressed by point-detection-based counters, which are based on query-based detectors. However, due to limited number of pre-trained queries, they underperform on images with very large numbers of objects, and resort to ad-hoc techniques like upsampling and tiling. We propose CoDi, the first latent diffusion-based low-shot counter that produces high-quality density maps on which object locations can be determined by non-maxima suppression. Our core contribution is the new exemplar-based conditioning module that extracts and adjusts the object prototypes to the intermediate layers of the denoising network, leading to accurate object location estimation. On FSC benchmark, CoDi outperforms state-of-the-art by 15% MAE, 13% MAE and 10% MAE in the few-shot, one-shot, and reference-less scenarios, respectively, and sets a new state-of-the-art on MCAC benchmark by outperforming the top method by 44% MAE. The code is available at https://github.com/gsustar/CoDi.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Robert van de Ven, Trim Bresilla, Bram Nelissen, Ard Nieuwenhuizen, Eldert J. van Henten et al.
Computer Vision
ๆฅ็ๆ่ฆ
Automating tasks in orchards is challenging because of the large amount of variation in the environment and occlusions. One of the challenges is apple pose estimation, where key points, such as the calyx, are often occluded. Recently developed pose estimation methods no longer rely on these key points, but still require them for annotations, making annotating challenging and time-consuming. Due to the abovementioned occlusions, there can be conflicting and missing annotations of the same fruit between different images. Novel 3D reconstruction methods can be used to simplify annotating and enlarge datasets. We propose a novel pipeline consisting of 3D Gaussian Splatting to reconstruct an orchard scene, simplified annotations, automated projection of the annotations to images, and the training and evaluation of a pose estimation method. Using our pipeline, 105 manual annotations were required to obtain 28,191 training labels, a reduction of 99.6%. Experimental results indicated that training with labels of fruits that are $\leq95\%$ occluded resulted in the best performance, with a neutral F1 score of 0.927 on the original images and 0.970 on the rendered images. Adjusting the size of the training dataset had small effects on the model performance in terms of F1 score and pose estimation accuracy. It was found that the least occluded fruits had the best position estimation, which worsened as the fruits became more occluded. It was also found that the tested pose estimation method was unable to correctly learn the orientation estimation of apples.
๐
2025-12-23
๐ IEEE/ACM Transactions on Audio, Speech and Languag
๐ฅ Xiang Chen, Yixin Ou, Quan Feng, Lei Li, Piji Li et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict'' paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.
๐
2025-12-23
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Cyrus Vachha, Yixiao Kang, Zach Dive, Ashwat Chidambaram, Anik Gupta et al.
Robotics
ๆฅ็ๆ่ฆ
Authoring 3D scenes is a central task for spatial computing applications. Competing visions for lowering existing barriers are (1) focus on immersive, direct manipulation of 3D content or (2) leverage AI techniques that capture real scenes (3D Radiance Fields such as, NeRFs, 3D Gaussian Splatting) and modify them at a higher level of abstraction, at the cost of high latency. We unify the complementary strengths of these approaches and investigate how to integrate generative AI advances into real-time, immersive 3D Radiance Field editing. We introduce Dreamcrafter, a VR-based 3D scene editing system that: (1) provides a modular architecture to integrate generative AI algorithms; (2) combines different levels of control for creating objects, including natural language and direct manipulation; and (3) introduces proxy representations that support interaction during high-latency operations. We contribute empirical findings on control preferences and discuss how generative AI interfaces beyond text input enhance creativity in scene editing and world building.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Niraj Prakash Kini, Shiau-Rung Tsai, Guan-Hsun Lin, Wen-Hsiao Peng, Ching-Wen Ma et al.
Computer Vision
ๆฅ็ๆ่ฆ
Millimeter-wave radar offers a privacy-preserving and lighting-invariant alternative to RGB sensors for Human Pose Estimation (HPE) task. However, the radar signals are often sparse due to specular reflection, making the extraction of robust features from radar signals highly challenging. To address this, we present milliMamba, a radar-based 2D human pose estimation framework that jointly models spatio-temporal dependencies across both the feature extraction and decoding stages. Specifically, given the high dimensionality of radar inputs, we adopt a Cross-View Fusion Mamba encoder to efficiently extract spatio-temporal features from longer sequences with linear complexity. A Spatio-Temporal-Cross Attention decoder then predicts joint coordinates across multiple frames. Together, this spatio-temporal modeling pipeline enables the model to leverage contextual cues from neighboring frames and joints to infer missing joints caused by specular reflections. To reinforce motion smoothness, we incorporate a velocity loss alongside the standard keypoint loss during training. Experiments on the TransHuPR and HuPR datasets demonstrate that our method achieves significant performance improvements, exceeding the baselines by 11.0 AP and 14.6 AP, respectively, while maintaining reasonable complexity. Code: https://github.com/NYCU-MAPL/milliMamba
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohammad Helal Uddin, Liam Seymour, Sabur Baidya
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Vision Transformers (ViTs) deliver state-of-the-art accuracy but their quadratic attention cost and redundant computations severely hinder deployment on latency and resource-constrained platforms. Existing pruning approaches treat either tokens or heads in isolation, relying on heuristics or first-order signals, which often sacrifice accuracy or fail to generalize across inputs. We introduce HEART-ViT, a Hessian-guided efficient dynamic attention and token pruning framework for vision transformers, which to the best of our knowledge is the first unified, second-order, input-adaptive framework for ViT optimization. HEART-ViT estimates curvature-weighted sensitivities of both tokens and attention heads using efficient Hessian-vector products, enabling principled pruning decisions under explicit loss budgets.This dual-view sensitivity reveals an important structural insight: token pruning dominates computational savings, while head pruning provides fine-grained redundancy removal, and their combination achieves a superior trade-off. On ImageNet-100 and ImageNet-1K with ViT-B/16 and DeiT-B/16, HEART-ViT achieves up to 49.4 percent FLOPs reduction, 36 percent lower latency, and 46 percent higher throughput, while consistently matching or even surpassing baseline accuracy after fine-tuning, for example 4.7 percent recovery at 40 percent token pruning. Beyond theoretical benchmarks, we deploy HEART-ViT on different edge devices such as AGX Orin, demonstrating that our reductions in FLOPs and latency translate directly into real-world gains in inference speed and energy efficiency. HEART-ViT bridges the gap between theory and practice, delivering the first unified, curvature-driven pruning framework that is both accuracy-preserving and edge-efficient.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingqi Tian, Yiheng Du, Haoji Zhang, Yuji Wang, Isaac Ning Lee et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Audio-Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio-visual misalignment, which lead to biases toward louder or larger objects while overlooking weaker, smaller, or co-occurring sources. To address these challenges, we propose DDAVS, a Disentangled Audio Semantics and Delayed Bidirectional Alignment framework. To mitigate multi-source entanglement, DDAVS employs learnable queries to extract audio semantics and anchor them within a structured semantic space derived from an audio prototype memory bank. This is further optimized through contrastive learning to enhance discriminability and robustness. To alleviate audio-visual misalignment, DDAVS introduces dual cross-attention with delayed modality interaction, improving the robustness of multimodal alignment. Extensive experiments on the AVS-Objects and VPO benchmarks demonstrate that DDAVS consistently outperforms existing approaches, exhibiting strong performance across single-source, multi-source, and multi-instance scenarios. These results validate the effectiveness and generalization ability of our framework under challenging real-world audio-visual segmentation conditions. Project page: https://trilarflagz.github.io/DDAVS-page/
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alireza Moayedikia, Sattar Dorafshan
ๆฅ็ๆ่ฆ
Deteriorating civil infrastructure requires automated inspection techniques overcoming limitations of visual assessment. While Ground Penetrating Radar and Infrared Thermography enable subsurface defect detection, single modal approaches face complementary constraints radar struggles with moisture and shallow defects, while thermography exhibits weather dependency and limited depth. This paper presents a multi modal attention network fusing radar temporal patterns with thermal spatial signatures for bridge deck delamination detection. Our architecture introduces temporal attention for radar processing, spatial attention for thermal features, and cross modal fusion with learnable embeddings discovering complementary defect patterns invisible to individual sensors. We incorporate uncertainty quantification through Monte Carlo dropout and learned variance estimation, decomposing uncertainty into epistemic and aleatoric components for safety critical decisions. Experiments on five bridge datasets reveal that on balanced to moderately imbalanced data, our approach substantially outperforms baselines in accuracy and AUC representing meaningful improvements over single modal and concatenation based fusion. Ablation studies demonstrate cross modal attention provides critical gains beyond within modality attention, while multi head mechanisms achieve improved calibration. Uncertainty quantification reduces calibration error, enabling selective prediction by rejecting uncertain cases. However, under extreme class imbalance, attention mechanisms show vulnerability to majority class collapse. These findings provide actionable guidance: attention based architecture performs well across typical scenarios, while extreme imbalance requires specialized techniques. Our system maintains deployment efficiency, enabling real time inspection with characterized capabilities and limitations.
๐
2025-12-23
๐ NeurIPS 2025
๐ฅ Thanh-Tung Le, Tuan Pham, Tung Nguyen, Deying Kong, Xiaohui Xie et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views. Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas stochastic diffusion-based methods hallucinate plausible content yet incur heavy training- and inference-time costs. In this paper, we propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation. Two lightweight heads then act on this representation: (i) a feed-forward regression head that renders pixels where geometry is well constrained, and (ii) a masked autoregressive diffusion head that completes occluded or unseen regions. The entire model is trained end-to-end with joint photometric and diffusion losses, without handcrafted 3D inductive biases, enabling scalability across diverse scenes. Experiments demonstrate that our method attains state-of-the-art image quality while reducing rendering time by an order of magnitude compared with fully generative baselines.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haiyun Wei, Fan Lu, Yunwei Zhu, Zehan Zheng, Weiyi Xue et al.
Robotics
ๆฅ็ๆ่ฆ
Generating realistic and diverse LiDAR point clouds is crucial for autonomous driving simulation. Although previous methods achieve LiDAR point cloud generation from user inputs, they struggle to attain high-quality results while enabling versatile controllability, due to the imbalance between the complex distribution of LiDAR point clouds and the simple control signals. To address the limitation, we propose LiDARDraft, which utilizes the 3D layout to build a bridge between versatile conditional signals and LiDAR point clouds. The 3D layout can be trivially generated from various user inputs such as textual descriptions and images. Specifically, we represent text, images, and point clouds as unified 3D layouts, which are further transformed into semantic and depth control signals. Then, we employ a rangemap-based ControlNet to guide LiDAR point cloud generation. This pixel-level alignment approach demonstrates excellent performance in controllable LiDAR point clouds generation, enabling "simulation from scratch", allowing self-driving environments to be created from arbitrary textual descriptions, images and sketches.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Subrata Kumer Paula, Dewan Nafiul Islam Noora, Rakhi Rani Paula, Md. Ekramul Hamidb, Fahmid Al Faridc et al.
Machine Learning
ๆฅ็ๆ่ฆ
Human Activity Recognition (HAR) plays a vital role in healthcare, surveillance, and innovative environments, where reliable action recognition supports timely decision-making and automation. Although deep learning-based HAR systems are widely adopted, the impact of Activation Functions (AFs) and Model Optimizers (MOs) on performance has not been sufficiently analyzed, particularly regarding how their combinations influence model behavior in practical scenarios. Most existing studies focus on architecture design, while the interaction between AF and MO choices remains relatively unexplored. In this work, we investigate the effect of three commonly used activation functions (ReLU, Sigmoid, and Tanh) combined with four optimization algorithms (SGD, Adam, RMSprop, and Adagrad) using two recurrent deep learning architectures, namely BiLSTM and ConvLSTM. Experiments are conducted on six medically relevant activity classes selected from the HMDB51 and UCF101 datasets, considering their suitability for healthcare-oriented HAR applications. Our experimental results show that ConvLSTM consistently outperforms BiLSTM across both datasets. ConvLSTM, combined with Adam or RMSprop, achieves an accuracy of up to 99.00%, demonstrating strong spatio-temporal learning capabilities and stable performance. While BiLSTM performs reasonably well on UCF101, with accuracy approaching 98.00%, its performance drops to approximately 60.00% on HMDB51, indicating limited robustness across datasets and weaker sensitivity to AF and MO variations. This study provides practical insights for optimizing HAR systems, particularly for real-world healthcare environments where fast and precise activity detection is critical.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinyoung Choi, Youngchae Kwon, Injung Kim
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Fashion style classification is a challenging task because of the large visual variation within the same style and the existence of visually similar styles.
Styles are expressed not only by the global appearance, but also by the attributes of individual items and their combinations.
In this study, we propose an item region-based fashion style classification network (IRSN) to effectively classify fashion styles by analyzing item-specific features and their combinations in addition to global features.
IRSN extracts features of each item region using item region pooling (IRP), analyzes them separately, and combines them using gated feature fusion (GFF).
In addition, we improve the feature extractor by applying a dual-backbone architecture that combines a domain-specific feature extractor and a general feature extractor pre-trained with a large-scale image-text dataset.
In experiments, applying IRSN to six widely-used backbones, including EfficientNet, ConvNeXt, and Swin Transformer, improved style classification accuracy by an average of 6.9% and a maximum of 14.5% on the FashionStyle14 dataset and by an average of 7.6% and a maximum of 15.1% on the ShowniqV3 dataset. Visualization analysis also supports that the IRSN models are better than the baseline models at capturing differences between similar style classes.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jungwoo Kim, Jun-Hyuk Kim, Jong-Seok Lee
Robotics
ๆฅ็ๆ่ฆ
Recent advances in learned image codecs have been extended from human perception toward machine perception. However, progressive image compression with fine granular scalability (FGS)-which enables decoding a single bitstream at multiple quality levels-remains unexplored for machine-oriented codecs. In this work, we propose a novel progressive learned image compression codec for machine perception, PICM-Net, based on trit-plane coding. By analyzing the difference between human- and machine-oriented rate-distortion priorities, we systematically examine the latent prioritization strategies in terms of machine-oriented codecs. To further enhance real-world adaptability, we design an adaptive decoding controller, which dynamically determines the necessary decoding level during inference time to maintain the desired confidence of downstream machine prediction. Extensive experiments demonstrate that our approach enables efficient and adaptive progressive transmission while maintaining high performance in the downstream classification task, establishing a new paradigm for machine-aware progressive image compression.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hao Li, Fabian Deuser, Wenping Yin, Steffen Knoblauch, Wufan Zhao et al.
ๆฅ็ๆ่ฆ
As Earth's climate changes, it is impacting disasters and extreme weather events across the planet. Record-breaking heat waves, drenching rainfalls, extreme wildfires, and widespread flooding during hurricanes are all becoming more frequent and more intense. Rapid and efficient response to disaster events is essential for climate resilience and sustainability. A key challenge in disaster response is to accurately and quickly identify disaster locations to support decision-making and resources allocation. In this paper, we propose a Probabilistic Cross-view Geolocalization approach, called ProbGLC, exploring new pathways towards generative location awareness for rapid disaster response. Herein, we combine probabilistic and deterministic geolocalization models into a unified framework to simultaneously enhance model explainability (via uncertainty quantification) and achieve state-of-the-art geolocalization performance. Designed for rapid diaster response, the ProbGLC is able to address cross-view geolocalization across multiple disaster events as well as to offer unique features of probabilistic distribution and localizability score. To evaluate the ProbGLC, we conduct extensive experiments on two cross-view disaster datasets (i.e., MultiIAN and SAGAINDisaster), consisting diverse cross-view imagery pairs of multiple disaster types (e.g., hurricanes, wildfires, floods, to tornadoes). Preliminary results confirms the superior geolocalization accuracy (i.e., 0.86 in Acc@1km and 0.97 in Acc@25km) and model explainability (i.e., via probabilistic distributions and localizability scores) of the proposed ProbGLC approach, highlighting the great potential of leveraging generative cross-view approach to facilitate location awareness for better and faster disaster response. The data and code is publicly available at https://github.com/bobleegogogo/ProbGLC
๐
2025-12-23
๐ 7 pages, 5 figures. System description for the EVE
๐ฅ Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Dao Sy Duy Minh, Nguyen Hoang Minh Ngoc et al.
Multimodal
ๆฅ็ๆ่ฆ
Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andreas Zinonos, Michaล Stypuลkowski, Antoni Bigata, Stavros Petridis, Maja Pantic et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present FlashLips, a two-stage, mask-free lip-sync system that decouples lips control from rendering and achieves real-time performance running at over 100 FPS on a single GPU, while matching the visual quality of larger state-of-the-art models. Stage 1 is a compact, one-step latent-space editor that reconstructs an image using a reference identity, a masked target frame, and a low-dimensional lips-pose vector, trained purely with reconstruction losses - no GANs or diffusion. To remove explicit masks at inference, we use self-supervision: we generate mouth-altered variants of the target image, that serve as pseudo ground truth for fine-tuning, teaching the network to localize edits to the lips while preserving the rest. Stage 2 is an audio-to-pose transformer trained with a flow-matching objective to predict lips-poses vectors from speech. Together, these stages form a simple and stable pipeline that combines deterministic reconstruction with robust audio control, delivering high perceptual quality and faster-than-real-time speed.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chang Sun, Dongliang Xie, Bo Qin, Hong Yang
Natural Language Processing
ๆฅ็ๆ่ฆ
Visual Speech Recognition aims to transcribe spoken words from silent lip-motion videos. This task is particularly challenging for Mandarin, as visemes are highly ambiguous and homophones are prevalent. We propose VALLR-Pin, a novel two-stage framework that extends the recent VALLR architecture from English to Mandarin. First, a shared video encoder feeds into dual decoders, which jointly predict both Chinese character sequences and their standard Pinyin romanization. The multi-task learning of character and phonetic outputs fosters robust visual-semantic representations. During inference, the text decoder generates multiple candidate transcripts. We construct a prompt by concatenating the Pinyin output with these candidate Chinese sequences and feed it to a large language model to resolve ambiguities and refine the transcription. This provides the LLM with explicit phonetic context to correct homophone-induced errors. Finally, we fine-tune the LLM on synthetic noisy examples: we generate imperfect Pinyin-text pairs from intermediate VALLR-Pin checkpoints using the training data, creating instruction-response pairs for error correction. This endows the LLM with awareness of our model's specific error patterns. In summary, VALLR-Pin synergizes visual features with phonetic and linguistic context to improve Mandarin lip-reading performance.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lin Li, Jiahui Li, Jiaming Lei, Jun Xiao, Feifei Shao et al.
ๆฅ็ๆ่ฆ
Compositional zero-shot learning (CZSL) aims to recognize unseen state-object compositions by generalizing from a training set of their primitives (state and object). Current methods often overlook the rich hierarchical structures, such as the semantic hierarchy of primitives (e.g., apple fruit) and the conceptual hierarchy between primitives and compositions (e.g, sliced apple apple). A few recent efforts have shown effectiveness in modeling these hierarchies through loss regularization within Euclidean space. In this paper, we argue that they fail to scale to the large-scale taxonomies required for real-world CZSL: the space's polynomial volume growth in flat geometry cannot match the exponential structure, impairing generalization capacity. To this end, we propose H2em, a new framework that learns Hierarchical Hyperbolic EMbeddings for CZSL. H2em leverages the unique properties of hyperbolic geometry, a space naturally suited for embedding tree-like structures with low distortion. However, a naive hyperbolic mapping may suffer from hierarchical collapse and poor fine-grained discrimination. We further design two learning objectives to structure this space: a Dual-Hierarchical Entailment Loss that uses hyperbolic entailment cones to enforce the predefined hierarchies, and a Discriminative Alignment Loss with hard negative mining to establish a large geodesic distance between semantically similar compositions. Furthermore, we devise Hyperbolic Cross-Modal Attention to realize instance-aware cross-modal infusion within hyperbolic geometry. Extensive ablations on three benchmarks demonstrate that H2em establishes a new state-of-the-art in both closed-world and open-world scenarios. Our codes will be released.
๐
2025-12-23
๐ AAAI
๐ฅ Ziwei Qin, Xuhui Song, Deqing Huang, Na Qin, Jun Li
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Graph neural networks are increasingly applied to multimodal medical diagnosis for their inherent relational modeling capabilities. However, their efficacy is often compromised by the prevailing reliance on a single, static graph built from indiscriminate features, hindering the ability to model patient-specific pathological relationships. To this end, the proposed Multi-Activation Plane Interaction Graph Neural Network (MAPI-GNN) reconstructs this single-graph paradigm by learning a multifaceted graph profile from semantically disentangled feature subspaces. The framework first uncovers latent graph-aware patterns via a multi-dimensional discriminator; these patterns then guide the dynamic construction of a stack of activation graphs; and this multifaceted profile is finally aggregated and contextualized by a relational fusion engine for a robust diagnosis. Extensive experiments on two diverse tasks, comprising over 1300 patient samples, demonstrate that MAPI-GNN significantly outperforms state-of-the-art methods.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anthony Dontoh, Stephanie Ivey, Armstrong Aboah
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Despite increasing interest in computer vision-based distracted driving detection, most existing models rely exclusively on driver-facing views and overlook crucial environmental context that influences driving behavior. This study investigates whether incorporating road-facing views alongside driver-facing footage improves distraction detection accuracy in naturalistic driving conditions. Using synchronized dual-camera recordings from real-world driving, we benchmark three leading spatiotemporal action recognition architectures: SlowFast-R50, X3D-M, and SlowOnly-R50. Each model is evaluated under two input configurations: driver-only and stacked dual-view. Results show that while contextual inputs can improve detection in certain models, performance gains depend strongly on the underlying architecture. The single-pathway SlowOnly model achieved a 9.8 percent improvement with dual-view inputs, while the dual-pathway SlowFast model experienced a 7.2 percent drop in accuracy due to representational conflicts. These findings suggest that simply adding visual context is not sufficient and may lead to interference unless the architecture is specifically designed to support multi-view integration. This study presents one of the first systematic comparisons of single- and dual-view distraction detection models using naturalistic driving data and underscores the importance of fusion-aware design for future multimodal driver monitoring systems.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zepeng Xin, Kaiyu Li, Luodi Chen, Wanchen Li, Yuchen Xiao et al.
ๆฅ็ๆ่ฆ
Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model's effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released at https://github.com/earth-insights/SegEarth-R2.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh, Andrews Danyo, Eugene Denteh et al.
Computer Vision
ๆฅ็ๆ่ฆ
Automated pavement defect detection often struggles to generalize across diverse real-world conditions due to the lack of standardized datasets. Existing datasets differ in annotation styles, distress type definitions, and formats, limiting their integration for unified training. To address this gap, we introduce a comprehensive benchmark dataset that consolidates multiple publicly available sources into a standardized collection of 52747 images from seven countries, with 135277 bounding box annotations covering 13 distinct distress types. The dataset captures broad real-world variation in image quality, resolution, viewing angles, and weather conditions, offering a unique resource for consistent training and evaluation. Its effectiveness was demonstrated through benchmarking with state-of-the-art object detection models including YOLOv8-YOLOv12, Faster R-CNN, and DETR, which achieved competitive performance across diverse scenarios. By standardizing class definitions and annotation formats, this dataset provides the first globally representative benchmark for pavement defect detection and enables fair comparison of models, including zero-shot transfer to new environments.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenhao Li, Shaohan Yi, Zheng Liu, Leonartinus Gao, Minh Ngoc Le et al.
Robotics
ๆฅ็ๆ่ฆ
Diffusion models (DMs) have recently achieved impressive photorealism in image and video generation. However, their application to image animation remains limited, even when trained on large-scale datasets. Two primary challenges contribute to this: the high dimensionality of video signals leads to a scarcity of training data, causing DMs to favor memorization over prompt compliance when generating motion; moreover, DMs struggle to generalize to novel motion patterns not present in the training set, and fine-tuning them to learn such patterns, especially using limited training data, is still under-explored. To address these limitations, we propose Modular Image-to-Video Adapter (MIVA), a lightweight sub-network attachable to a pre-trained DM, each designed to capture a single motion pattern and scalable via parallelization. MIVAs can be efficiently trained on approximately ten samples using a single consumer-grade GPU. At inference time, users can specify motion by selecting one or multiple MIVAs, eliminating the need for prompt engineering. Extensive experiments demonstrate that MIVA enables more precise motion control while maintaining, or even surpassing, the generation quality of models trained on significantly larger datasets.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Peng Gao, Ke Li, Di Wang, Yongshan Zhu, Yiming Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Cross-resolution land cover mapping aims to produce high-resolution semantic predictions from coarse or low-resolution supervision, yet the severe resolution mismatch makes effective learning highly challenging. Existing weakly supervised approaches often struggle to align fine-grained spatial structures with coarse labels, leading to noisy supervision and degraded mapping accuracy. To tackle this problem, we propose DDTM, a dual-branch weakly supervised framework that explicitly decouples local semantic refinement from global contextual reasoning. Specifically, DDTM introduces a diffusion-based branch to progressively refine fine-scale local semantics under coarse supervision, while a transformer-based branch enforces long-range contextual consistency across large spatial extents. In addition, we design a pseudo-label confidence evaluation module to mitigate noise induced by cross-resolution inconsistencies and to selectively exploit reliable supervisory signals. Extensive experiments demonstrate that DDTM establishes a new state-of-the-art on the Chesapeake Bay benchmark, achieving 66.52\% mIoU and substantially outperforming prior weakly supervised methods. The code is available at https://github.com/gpgpgp123/DDTM.
๐
2025-12-23
๐ Accepted at IEEE ICCIT 2025. This is the author ac
๐ฅ Tamim Ahasan Rijon, Yeasin Arafath
Machine Learning
ๆฅ็ๆ่ฆ
As a significant agricultural country, Bangladesh utilizes its fertile land for guava cultivation and dedicated labor to boost its economic development. In a nation like Bangladesh, enhancing guava production and agricultural practices plays a crucial role in its economy. Anthracnose and fruit fly infection can lower the quality and productivity of guava, a crucial tropical fruit. Expert systems that detect diseases early can reduce losses and safeguard the harvest. Images of guava fruits classified into the Healthy, Fruit Flies, and Anthracnose classes are included in the Guava Fruit Disease Dataset 2024 (GFDD24), which comes from plantations in Rajshahi and Pabna, Bangladesh. This study aims to create models using CNN alongside traditional machine learning techniques that can effectively identify guava diseases in locally cultivated varieties in Bangladesh. In order to achieve the highest classification accuracy of approximately 99.99% for the guava dataset, we propose utilizing ensemble models that combine CNNML with Gradient Boosting Machine. In general, the CNN-ML cascade framework exhibits strong, high-accuracy guava disease detection that is appropriate for real-time agricultural monitoring systems.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Le Feng, Li Xiao
Natural Language Processing
ๆฅ็ๆ่ฆ
In recent years, the integration of pre-trained foundational models with multiple instance learning (MIL) has improved diagnostic accuracy in computational pathology. However, existing MIL methods focus on optimizing feature extractors and aggregation strategies while overlooking the complex semantic relationships among instances within whole slide image (WSI). Although Transformer-based MIL approaches aiming to model instance dependencies, the quadratic computational complexity limits their scalability to large-scale WSIs. Moreover, due to the pronounced variations in tumor region scales across different WSIs, existing Transformer-based methods employing fixed-scale attention mechanisms face significant challenges in precisely capturing local instance correlations and fail to account for the distance-based decay effect of patch relevance. To address these challenges, we propose window scale decay MIL (WSD-MIL), designed to enhance the capacity to model tumor regions of varying scales while improving computational efficiency. WSD-MIL comprises: 1) a window scale decay based attention module, which employs a cluster-based sampling strategy to reduce computational costs while progressively decaying attention window-scale to capture local instance relationships at varying scales; and 2) a squeeze-and-excitation based region gate module, which dynamically adjusts window weights to enhance global information modeling. Experimental results demonstrate that WSD-MIL achieves state-of-the-art performance on the CAMELYON16 and TCGA-BRCA datasets while reducing 62% of the computational memory. The code will be publicly available.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuechen Yang, Junlin Guo, Yanfan Zhu, Jialin Yue, Junchao Zhu et al.
ๆฅ็ๆ่ฆ
High-throughput "pathomic" analysis of Whole Slide Images (WSIs) offers new opportunities to study tissue characteristics and for biomarker discovery. However, the clinical relevance of the tissue characteristics at the micro- and macro-environment level is limited by the lack of tools that facilitate the measurement of the spatial interaction of individual structure characteristics and their association with clinical parameters. To address these challenges, we introduce HistoWAS (Histology-Wide Association Study), a computational framework designed to link tissue spatial organization to clinical outcomes. Specifically, HistoWAS implements (1) a feature space that augments conventional metrics with 30 topological and spatial features, adapted from Geographic Information Systems (GIS) point pattern analysis, to quantify tissue micro-architecture; and (2) an association study engine, inspired by Phenome-Wide Association Studies (PheWAS), that performs mass univariate regression for each feature with statistical correction. As a proof of concept, we applied HistoWAS to analyze a total of 102 features (72 conventional object-level features and our 30 spatial features) using 385 PAS-stained WSIs from 206 participants in the Kidney Precision Medicine Project (KPMP). The code and data have been released to https://github.com/hrlblab/histoWAS.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zixuan Huang, Xiang Li, Zhaoyang Lv, James M. Rehg
ๆฅ็ๆ่ฆ
Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyi Zhong, Fang-Lue Zhang, Andrew Chalmers, Taehyun Rhee
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
While instruction-based image editing is emerging, extending it to 360$^\circ$ panoramas introduces additional challenges. Existing methods often produce implausible results in both equirectangular projections (ERP) and perspective views. To address these limitations, we propose SE360, a novel framework for multi-condition guided object editing in 360$^\circ$ panoramas. At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention. This pipeline leverages a Vision-Language Model (VLM) and adaptive projection adjustment for hierarchical analysis, ensuring the holistic segmentation of objects and their physical context. The resulting data pairs are both semantically meaningful and geometrically consistent, even when sourced from unlabeled panoramas. Furthermore, we introduce a cost-effective, two-stage data refinement strategy to improve data realism and mitigate model overfitting to erase artifacts. Based on the constructed dataset, we train a Transformer-based diffusion model to allow flexible object editing guided by text, mask, or reference image in 360$^\circ$ panoramas. Our experiments demonstrate that our method outperforms existing methods in both visual quality and semantic accuracy.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
๐
2025-12-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yibin Lei, Shwai He, Ang Li, Andrew Yates
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent work has shown that directly fine-tuning large language models (LLMs) for dense retrieval yields strong performance, but their substantial parameter counts make them computationally inefficient. While prior studies have revealed significant layer redundancy in LLMs for generative tasks, it remains unclear whether similar redundancy exists when these models are adapted for retrieval tasks, which require encoding entire sequences into fixed representations rather than generating tokens iteratively. To this end, we conduct a comprehensive analysis of layer redundancy in LLM-based dense retrievers. We find that, in contrast to generative settings, MLP layers are substantially more prunable, while attention layers remain critical for semantic aggregation. Building on this insight, we propose EffiR, a framework for developing efficient retrievers that performs large-scale MLP compression through a coarse-to-fine strategy (coarse-grained depth reduction followed by fine-grained width reduction), combined with retrieval-specific fine-tuning. Across diverse BEIR datasets and LLM backbones, EffiR achieves substantial reductions in model size and inference cost while preserving the performance of full-size models.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Alexandros Christoforos, Chadbourne Davis
Natural Language Processing
ๆฅ็ๆ่ฆ
We present MoE-DiffuSeq, a mixture of experts based framework for enhancing diffusion models in long document generation. Existing diffusion based text generation models, such as DiffuSeq, suffer from high computational cost and memory overhead when applied to extended sequences. To address these challenges, MoE-DiffuSeq integrates sparse attention with a mixture of experts architecture, enabling efficient and scalable long sequence modeling. Our approach introduces a customized sparse attention mechanism designed to reduce computational complexity while preserving text quality and coherence. In addition, we incorporate a soft absorbing state within the diffusion process to accelerate sequence reconstruction and improve generation precision. Extensive experiments demonstrate that MoE-DiffuSeq significantly improves training efficiency and sampling speed compared to existing diffusion models. These advantages are particularly effective for long document scenarios, including scientific article generation, code repository modeling, and long form dialogue generation. Benchmark results further show that MoE-DiffuSeq improves efficiency, speed, accuracy, and expressiveness, advancing the practical applicability of diffusion models for high quality long form text generation.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dhruv Anand, Ehsan Shareghi
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Humza Nusrat, Luke Francisco, Bing Luo, Hassan Bagher-Ebadian, Joshua Kim et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Stereotactic radiosurgery (SRS) demands precise dose shaping around critical structures, yet black-box AI systems have limited clinical adoption due to opacity concerns. We tested whether chain-of-thought reasoning improves agentic planning in a retrospective cohort of 41 patients with brain metastases treated with 18 Gy single-fraction SRS. We developed SAGE (Secure Agent for Generative Dose Expertise), an LLM-based planning agent for automated SRS treatment planning. Two variants generated plans for each case: one using a non-reasoning model, one using a reasoning model. The reasoning variant showed comparable plan dosimetry relative to human planners on primary endpoints (PTV coverage, maximum dose, conformity index, gradient index; all p > 0.21) while reducing cochlear dose below human baselines (p = 0.022). When prompted to improve conformity, the reasoning model demonstrated systematic planning behaviors including prospective constraint verification (457 instances) and trade-off deliberation (609 instances), while the standard model exhibited none of these deliberative processes (0 and 7 instances, respectively). Content analysis revealed that constraint verification and causal explanation concentrated in the reasoning agent. The optimization traces serve as auditable logs, offering a path toward transparent automated planning.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amirhosein Ghasemabadi, Di Niu
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen et al.
Robotics
ๆฅ็ๆ่ฆ
As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.
๐
2025-12-23
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Davide Stauba, Finn Rabe, Akhil Misra, Yves Pauli, Roya Hรผppi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Coherence in language requires the brain to satisfy two competing temporal demands: gradual accumulation of meaning across extended context and rapid reconfiguration of representations at event boundaries. Despite their centrality to language and thought, how these processes are implemented in the human brain during naturalistic listening remains unclear. Here, we tested whether these two processes can be captured by annotation-free drift and shift signals and whether their neural expression dissociates across large-scale cortical systems. These signals were derived from a large language model (LLM) and formalized contextual drift and event shifts directly from the narrative input. To enable high-precision voxelwise encoding models with stable parameter estimates, we densely sampled one healthy adult across more than 7 hours of listening to thirteen crime stories while collecting ultra high-field (7T) BOLD data. We then modeled the feature-informed hemodynamic response using a regularized encoding framework validated on independent stories. Drift predictions were prevalent in default-mode network hubs, whereas shift predictions were evident bilaterally in the primary auditory cortex and language association cortex. Furthermore, activity in default-mode and parietal networks was best explained by a signal capturing how meaning accumulates and gradually fades over the course of the narrative. Together, these findings show that coherence during language comprehension is implemented through dissociable neural regimes of slow contextual integration and rapid event-driven reconfiguration, offering a mechanistic entry point for understanding disturbances of language coherence in psychiatric disorders.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Junyi Liu, Stanley Kok
ๆฅ็ๆ่ฆ
With the rapid growth of unstructured data from social media, reviews, and forums, text mining has become essential in Information Systems (IS) for extracting actionable insights. Summarization can condense fragmented, emotion-rich posts, but existing methods-optimized for structured news-struggle with noisy, informal content. Emotional cues are critical for IS tasks such as brand monitoring and market analysis, yet few studies integrate sentiment modeling into summarization of short user-generated texts. We propose a sentiment-aware framework extending extractive (TextRank) and abstractive (UniLM) approaches by embedding sentiment signals into ranking and generation processes. This dual design improves the capture of emotional nuances and thematic relevance, producing concise, sentiment-enriched summaries that enhance timely interventions and strategic decision-making in dynamic online environments.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ YuChe Hsu, AnJui Wang, TsaiChing Ni, YuanFu Yang
Multimodal
ๆฅ็ๆ่ฆ
We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nilesh Jain, Seyi Adeyinka, Leor Roseman, Aza Allsop
Natural Language Processing
ๆฅ็ๆ่ฆ
Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen's Kappa ($ฮบ$) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability ($ฮบ= 0.907$, cosine=95.3%), followed by GPT-4o ($ฮบ= 0.853$, cosine=92.6%) and Claude ($ฮบ= 0.842$, cosine=92.1%). All three models achieve a high agreement ($ฮบ> 0.80$), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Khushnur Binte Jahangir, Swakkhar Shatabda, Sarah Masud Preum
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) show impressive performance on many NLP benchmarks, yet their ability to reason in figurative, culturally grounded, and low-resource settings remains underexplored. We address this gap for Bangla by introducing BanglaRiddleEval, a benchmark of 1,244 traditional Bangla riddles instantiated across four tasks (4,976 riddle-task artifacts in total). Using an LLM-based pipeline, we generate Chain-of-Thought explanations, semantically coherent distractors, and fine-grained ambiguity annotations, and evaluate a diverse suite of open-source and closed-source models under different prompting strategies. Models achieve moderate semantic overlap on generative QA but low correctness, MCQ accuracy peaks at only about 56% versus an 83% human baseline, and ambiguity resolution ranges from roughly 26% to 68%, with high-quality explanations confined to the strongest models. These results show that current LLMs capture some cues needed for Bangla riddle reasoning but remain far from human-level performance, establishing BanglaRiddleEval as a challenging new benchmark for low-resource figurative reasoning. All data, code, and evaluation scripts are available on GitHub: https://github.com/Labib1610/BanglaRiddleEval.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The parallel advances in language modeling and speech representation learning have raised the prospect of learning language directly from speech without textual intermediates. This requires extracting semantic representations directly from speech. Our contributions are threefold. First, we introduce SpidR, a self-supervised speech representation model that efficiently learns representations with highly accessible phonetic information, which makes it particularly suited for textless spoken language modeling. It is trained on raw waveforms using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher's intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher quality codebooks. SpidR outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on downstream language modeling benchmarks (sWUGGY, sBLIMP, tSC). Second, we systematically evaluate across models and layers the correlation between speech unit quality (ABX, PNMI) and language modeling performance, validating these metrics as reliable proxies. Finally, SpidR significantly reduces pretraining time compared to HuBERT, requiring only one day of pretraining on 16 GPUs, instead of a week. This speedup is enabled by the pretraining method and an efficient codebase, which allows faster iteration and easier experimentation. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Karolina Droลผdลผ, Kacper Dudzic, Anna Sterna, Marcin Moskalewicz
ๆฅ็ๆ่ฆ
Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. We present the first direct comparison between state-of-the-art LLMs and mental health professionals in diagnosing Borderline (BPD) and Narcissistic (NPD) Personality Disorders utilizing Polish-language first-person autobiographical accounts. We show that the top-performing Gemini Pro models surpassed human professionals in overall diagnostic accuracy by 21.91 percentage points (65.48% vs. 43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patient's sense of self and temporal experience. Our findings demonstrate that while LLMs are highly competent at interpreting complex first-person clinical data, they remain subject to critical reliability and bias issues.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jaykumar Kasundra, Anjaneya Praharaj, Sourabh Surana, Lakshmi Sirisha Chodisetty, Sourav Sharma et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Safeguarding large language models (LLMs) against unsafe or adversarial behavior is critical as they are increasingly deployed in conversational and agentic settings. Existing moderation tools often treat safety risks (e.g. toxicity, bias) and adversarial threats (e.g. prompt injections, jailbreaks) as separate problems, limiting their robustness and generalizability. We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework. AprielGuard is trained on a diverse mix of open and synthetic data covering standalone prompts, multi-turn conversations, and agentic workflows, augmented with structured reasoning traces to improve interpretability. Across multiple public and proprietary benchmarks, AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations, outperforming existing opensource guardrails such as Llama-Guard and Granite Guardian, particularly in multi-step and reasoning intensive scenarios. By releasing the model, we aim to advance transparent and reproducible research on reliable safeguards for LLMs.
๐
2025-12-23
๐ AAAI 2026
๐ฅ Wenzheng Zeng, Mingyu Ouyang, Langyuan Cui, Hwee Tou Ng
ๆฅ็ๆ่ฆ
Automatic presentation slide generation can greatly streamline content creation. However, since preferences of each user may vary, existing under-specified formulations often lead to suboptimal results that fail to align with individual user needs. We introduce a novel task that conditions paper-to-slides generation on user-specified preferences. We propose a human behavior-inspired agentic framework, SlideTailor, that progressively generates editable slides in a user-aligned manner. Instead of requiring users to write their preferences in detailed textual form, our system only asks for a paper-slides example pair and a visual template - natural and easy-to-provide artifacts that implicitly encode rich user preferences across content and visual style. Despite the implicit and unlabeled nature of these inputs, our framework effectively distills and generalizes the preferences to guide customized slide generation. We also introduce a novel chain-of-speech mechanism to align slide content with planned oral narration. Such a design significantly enhances the quality of generated slides and enables downstream applications like video presentations. To support this new task, we construct a benchmark dataset that captures diverse user preferences, with carefully designed interpretable metrics for robust evaluation. Extensive experiments demonstrate the effectiveness of our framework.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Marko ฤechoviฤ, Natรกlia Komornรญkovรก, Dominik Machรกฤek, Ondลej Bojar
Natural Language Processing
ๆฅ็ๆ่ฆ
Speech processing and translation technology have the potential to facilitate meetings of individuals who do not share any common language. To evaluate automatic systems for such a task, a versatile and realistic evaluation corpus is needed. Therefore, we create and present a corpus of cross-lingual dialogues between individuals without a common language who were facilitated by automatic simultaneous speech translation. The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English. For the purposes of research into cross-lingual summarization, our corpus also includes written summaries (minutes) of the meetings.
Moreover, we propose automatic detection of misunderstandings. For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings. We annotate misunderstandings manually and also test the ability of current large language models to detect them automatically. The results show that the Gemini model is able to identify text spans with misunderstandings with recall of 77% and precision of 47%.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shuzheng Si, Qingyi Wang, Haozhe Zhao, Yuzhuo Bai, Guanqiao Chen et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recognizing whether outputs from large language models (LLMs) contain faithfulness hallucination is crucial for real-world applications, e.g., retrieval-augmented generation and summarization. In this paper, we introduce FaithLens, a cost-efficient and effective faithfulness hallucination detection model that can jointly provide binary predictions and corresponding explanations to improve trustworthiness. To achieve this, we first synthesize training data with explanations via advanced LLMs and apply a well-defined data filtering strategy to ensure label correctness, explanation quality, and data diversity. Subsequently, we fine-tune the model on these well-curated training data as a cold start and further optimize it with rule-based reinforcement learning, using rewards for both prediction correctness and explanation quality. Results on 12 diverse tasks show that the 8B-parameter FaithLens outperforms advanced models such as GPT-4.1 and o3. Also, FaithLens can produce high-quality explanations, delivering a distinctive balance of trustworthiness, efficiency, and effectiveness.
๐
2025-12-23
๐ CVPR 2025
๐ฅ Hao Guo, Xugong Qin, Jun Jie Ou Yang, Peng Zhang, Gangyan Zeng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junghyun Lee, Branislav Kveton, Sunav Choudhary, Subhojyoti Mukherjee, Anup Rao et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive an expectation-maximization (EM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers. We instantiate and compare several sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR. Our experiments on the ARC, MMLU, and OpenBookQA datasets with the Llama and Qwen models show that the sampling scheme can significantly affect the accuracy of learned reasoning models. Despite its simplicity, we observe that PPS outperforms the other sampling schemes.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Honglin Mu, Jinghao Liu, Kaiyang Wan, Rui Xing, Xiuying Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation. However, our research identifies a vulnerability: LLMs can be manipulated by "adversarial instructions" hidden in input data, such as resumes or code, causing them to deviate from their intended task. Notably, while defenses may exist for mature domains such as code review, they are often absent in other common applications such as resume screening and peer review. This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types. We evaluate two defense mechanisms: prompt-based defenses achieve 10.1% attack reduction with 12.5% false rejection increase, while our proposed FIDS (Foreign Instruction Detection through Separation) using LoRA adaptation achieves 15.4% attack reduction with 10.4% false rejection increase. The combined approach provides 26.3% attack reduction, demonstrating that training-time defenses outperform inference-time mitigations in both security and utility preservation.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo.
๐
2025-12-23
๐ IEEE/ACM Transactions on Audio, Speech and Languag
๐ฅ Xiang Chen, Yixin Ou, Quan Feng, Lei Li, Piji Li et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict'' paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuxin Wang, Shicheng Fang, Bo Wang, Qi Luo, Xuanjing Huang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for Large Language Models (LLMs) to address knowledge-intensive queries requiring domain-specific or up-to-date information. To handle complex multi-hop questions that are challenging for single-step retrieval, iterative RAG approaches incorporating reinforcement learning have been proposed. However, existing iterative RAG systems typically plan to decompose questions without leveraging information about the available retrieval corpus, leading to inefficient retrieval and reasoning chains that cascade into suboptimal performance. In this paper, we introduce Early Knowledge Alignment (EKA), a simple but effective module that aligns LLMs with retrieval set before planning in iterative RAG systems with contextually relevant retrieved knowledge. Extensive experiments on six standard RAG datasets demonstrate that by establishing a stronger reasoning foundation, EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency. Our analysis from an entropy perspective demonstrate that incorporating early knowledge reduces unnecessary exploration during the reasoning process, enabling the model to focus more effectively on relevant information subsets. Moreover, EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models. Generalization tests across diverse datasets and retrieval corpora confirm the robustness of our approach. Overall, EKA advances the state-of-the-art in iterative RAG systems while illuminating the critical interplay between structured reasoning and efficient exploration in reinforcement learning-augmented frameworks. The code is released at \href{https://github.com/yxzwang/EarlyKnowledgeAlignment}{Github}.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hyeongcheol Park, Jiyoung Seo, Jaewon Mun, Hogun Park, Wonmin Byeon et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs' multimodal reasoning and grounding over existing approaches.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr
Machine Learning
ๆฅ็ๆ่ฆ
As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near-constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL's performance beyond the full context setting, while using less memory than contemporaneous approaches.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zuo Wang, Ye Yuan
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Text classification plays an important role in various downstream text-related tasks, such as sentiment analysis, fake news detection, and public opinion analysis. Recently, text classification based on Graph Neural Networks (GNNs) has made significant progress due to their strong capabilities of structural relationship learning. However, these approaches still face two major limitations. First, these approaches fail to fully consider the diverse structural information across word pairs, e.g., co-occurrence, syntax, and semantics. Furthermore, they neglect sequence information in the text graph structure information learning module and can not classify texts with new words and relations. In this paper, we propose a Novel Graph-Sequence Learning Model for Inductive Text Classification (TextGSL) to address the previously mentioned issues. More specifically, we construct a single text-level graph for all words in each text and establish different edge types based on the diverse relationships between word pairs. Building upon this, we design an adaptive multi-edge message-passing paradigm to aggregate diverse structural information between word pairs. Additionally, sequential information among text data can be captured by the proposed TextGSL through the incorporation of Transformer layers. Therefore, TextGSL can learn more discriminative text representations. TextGSL has been comprehensively compared with several strong baselines. The experimental results on diverse benchmarking datasets demonstrate that TextGSL outperforms these baselines in terms of accuracy.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0\%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2\%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0\% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at https://github.com/Elvin-Yiming-Du/Memory-T1/
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite the wide adoption of Large Language Models (LLM)s, clinical decision support systems face a critical challenge: achieving high predictive accuracy while generating explanations aligned with the predictions. Current approaches suffer from exposure bias leading to misaligned explanations. We propose Reason2Decide, a two-stage training framework that addresses key challenges in self-rationalization, including exposure bias and task separation. In Stage-1, our model is trained on rationale generation, while in Stage-2, we jointly train on label prediction and rationale generation, applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions. We evaluate Reason2Decide on three medical datasets, including a proprietary triage dataset and public biomedical QA datasets. Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge). In triage, Reason2Decide is rationale source-robust across LLM-generated, nurse-authored, and nurse-post-processed rationales. In our experiments, while using only LLM-generated rationales in Stage-1, Reason2Decide outperforms other fine-tuning variants. This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations. Remarkably, Reason2Decide achieves these gains with models 40x smaller than contemporary foundation models, making clinical reasoning more accessible for resource-constrained deployments while still providing explainable decision support.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ming Li, Chenrui Fan, Yize Cheng, Soheil Feizi, Tianyi Zhou
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Heet Bodara, Md Masum Mushfiq, Isma Farah Siddiqui
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models are increasingly used in conversational systems such as digital personal assistants, shaping how people interact with technology through language. While their responses often sound fluent and natural, they can also carry subtle tone biases such as sounding overly polite, cheerful, or cautious even when neutrality is expected. These tendencies can influence how users perceive trust, empathy, and fairness in dialogue. In this study, we explore tone bias as a hidden behavioral trait of large language models. The novelty of this research lies in the integration of controllable large language model based dialogue synthesis with tone classification models, enabling robust and ethical emotion recognition in personal assistant interactions. We created two synthetic dialogue datasets, one generated from neutral prompts and another explicitly guided to produce positive or negative tones. Surprisingly, even the neutral set showed consistent tonal skew, suggesting that bias may stem from the model's underlying conversational style. Using weak supervision through a pretrained DistilBERT model, we labeled tones and trained several classifiers to detect these patterns. Ensemble models achieved macro F1 scores up to 0.92, showing that tone bias is systematic, measurable, and relevant to designing fair and trustworthy conversational AI.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daewoon Kim, Si Young Yie, Jae Sung Lee
Robotics
ๆฅ็ๆ่ฆ
This paper proposes FedPOD (Proportionally Orchestrated Derivative) for optimizing learning efficiency and communication cost in federated learning among multiple clients. Inspired by FedPIDAvg, we define a round-wise task for FedPOD to enhance training efficiency. FedPIDAvg achieved performance improvement by incorporating the training loss reduction for prediction entropy as weights using differential terms. Furthermore, by modeling data distribution with a Poisson distribution and using a PID controller, it reduced communication costs even in skewed data distribution. However, excluding participants classified as outliers based on the Poisson distribution can limit data utilization. Additionally, PID controller requires the same participants to be maintained throughout the federated learning process as it uses previous rounds' learning information in the current round. In our approach, FedPOD addresses these issues by including participants excluded as outliers, eliminating dependency on previous rounds' learning information, and applying a method for calculating validation loss at each round. In this challenge, FedPOD presents comparable performance to FedPIDAvg in metrics of Dice score, 0.78, 0.71 and 0.72 for WT, ET and TC in average, and projected convergence score, 0.74 in average. Furthermore, the concept of FedPOD draws inspiration from Kubernetes' smallest computing unit, POD, designed to be compatible with Kubernetes auto-scaling. Extending round-wise tasks of FedPOD to POD units allows flexible design by applying scale-out similar to Kubernetes' auto-scaling. This work demonstrated the potentials of FedPOD to enhance federated learning by improving efficiency, flexibility, and performance in metrics.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yedi Zhang, Andrew Saxe, Peter E. Latham
Machine Learning
ๆฅ็ๆ่ฆ
Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explains a simplicity bias arising from saddle-to-saddle learning dynamics for a general class of neural networks, incorporating fully-connected, convolutional, and attention-based architectures. Here, simple means expressible with few hidden units, i.e., hidden neurons, convolutional kernels, or attention heads. Specifically, we show that linear networks learn solutions of increasing rank, ReLU networks learn solutions with an increasing number of kinks, convolutional networks learn solutions with an increasing number of convolutional kernels, and self-attention models learn solutions with an increasing number of attention heads. By analyzing fixed points, invariant manifolds, and dynamics of gradient descent learning, we show that saddle-to-saddle dynamics operates by iteratively evolving near an invariant manifold, approaching a saddle, and switching to another invariant manifold. Our analysis also illuminates the effects of data distribution and weight initialization on the duration and number of plateaus in learning, dissociating previously confounding factors. Overall, our theory offers a framework for understanding when and why gradient descent progressively learns increasingly complex solutions.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stephane Gaubert, Yiannis Vlassopoulos
Machine Learning
ๆฅ็ๆ่ฆ
We show that the output of a ReLU neural network can be interpreted as the value of a zero-sum, turn-based, stopping game, which we call the ReLU net game. The game runs in the direction opposite to that of the network, and the input of the network serves as the terminal reward of the game. In fact, evaluating the network is the same as running the Shapley-Bellman backward recursion for the value of the game. Using the expression of the value of the game as an expected total payoff with respect to the path measure induced by the transition probabilities and a pair of optimal policies, we derive a discrete Feynman-Kac-type path-integral formula for the network output. This game-theoretic representation can be used to derive bounds on the output from bounds on the input, leveraging the monotonicity of Shapley operators, and to verify robustness properties using policies as certificates. Moreover, training the neural network becomes an inverse game problem: given pairs of terminal rewards and corresponding values, one seeks transition probabilities and rewards of a game that reproduces them. Finally, we show that a similar approach applies to neural networks with Softplus activation functions, where the ReLU net game is replaced by its entropic regularization.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leslie Barrett, Michael W. Sherman
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Hand-tagged training data is essential to many machine learning tasks. However, training data quality control has received little attention in the literature, despite data quality varying considerably with the tagging exercise. We propose methods to evaluate and enhance the quality of hand-tagged training data using statistical approaches to measure tagging consistency and agreement. We show that agreement metrics give more reliable results if recorded over multiple iterations of tagging, where declining variance in such recordings is an indicator of increasing data quality. We also show one way a tagging project can collect high-quality training data without requiring multiple tags for every work item, and that a tagger burn-in period may not be sufficient for minimizing tagger errors.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Debabrota Basu, Udvas Das, Brahim Driss, Uddalak Mukherjee
Machine Learning
ๆฅ็ๆ่ฆ
Post-deployment machine learning algorithms often influence the environments they act in, and thus shift the underlying dynamics that the standard reinforcement learning (RL) methods ignore. While designing optimal algorithms in this performative setting has recently been studied in supervised learning, the RL counterpart remains under-explored. In this paper, we prove the performative counterparts of the performance difference lemma and the policy gradient theorem in RL, and further introduce the Performative Policy Gradient algorithm (PePG). PePG is the first policy gradient algorithm designed to account for performativity in RL. Under softmax parametrisation, and also with and without entropy regularisation, we prove that PePG converges to performatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves. Thus, PePG significantly extends the prior works in Performative RL that achieves performative stability but not optimality. Furthermore, our empirical analysis on standard performative RL environments validate that PePG outperforms standard policy gradient algorithms and the existing performative RL algorithms aiming for stability.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rui Pan, Zhuofu Chen, Ravi Netravali
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.4$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl et al.
Robotics
ๆฅ็ๆ่ฆ
Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles' actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6~v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at https://github.com/autonomousvision/lead.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Yingzhen Yang
Machine Learning
ๆฅ็ๆ่ฆ
We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = ฮ(1) \ge 1$ defined on the unit sphere in $\RR^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width of $m \ge ฮ({n^4 \log (2n/ฮด)}/{d^{2\ell_0}})$ trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp ฮ(d^{\ell_0}/\eps)$ with probability $1-ฮด$ for every $ฮด\in (0,1)$, in contrast with the representative sample complexity $ฮ\pth{d^{\ell_0} \max\set{\eps^{-2},\log d}}$, where $n$ is the training daata size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $ฮ(d^{\ell_0}/{n})$ with probability at least $1-ฮด$. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $ฮ(d^{\ell_0})$ is $ฮ(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. The training of the two-layer NN with channel attention consists of two stages. In Stage 1, a provable learnable channel selection algorithm identifies the ground-truth channel number $\ell_0$ from the initial $L \ge \ell_0$ channels in the first-layer activation, with high probability. This learnable selection is achieved by an efficient one-step GD update on both layers, enabling feature learning for low-degree polynomial targets. In Stage 2, the second layer is trained by standard GD using the activation function with the selected channels.
๐
2025-12-23
๐ ArXiv Preprint (eess.SP)
๐ฅ Kyriakos Stylianopoulos, Paolo Di Lorenzo, George C. Alexandropoulos
Machine Learning
ๆฅ็ๆ่ฆ
Goal-oriented communications offer an attractive alternative to the Shannon-based communication paradigm, where the data is never reconstructed at the Receiver (RX) side. Rather, focusing on the case of edge inference, the Transmitter (TX) and the RX cooperate to exchange features of the input data that will be used to predict an unseen attribute of them, leveraging information from collected data sets. This chapter demonstrates that the wireless channel can be used to perform computations over the data, when equipped with programmable metasurfaces. The end-to-end system of the TX, RX, and MS-based channel is treated as a single deep neural network which is trained through backpropagation to perform inference on unseen data. Using Stacked Intelligent Metasurfaces (SIM), it is shown that this Metasurfaces-Integrated Neural Network (MINN) can achieve performance comparable to fully digital neural networks under various system parameters and data sets. By offloading computations onto the channel itself, important benefits may be achieved in terms of energy consumption, arising from reduced computations at the transceivers and smaller transmission power required for successful inference.
๐
2025-12-23
๐ ArXiv Preprint (econ.EM)
๐ฅ Masahiro Kato
Machine Learning
ๆฅ็ๆ่ฆ
This study proposes Riesz representer estimation methods based on score matching. The Riesz representer is a key component in debiased machine learning for constructing $\sqrt{n}$-consistent and efficient estimators in causal inference and structural parameter estimation. To estimate the Riesz representer, direct approaches have garnered attention, such as Riesz regression and the covariate balancing propensity score. These approaches can also be interpreted as variants of direct density ratio estimation (DRE) in several applications such as average treatment effect estimation. In DRE, it is well known that flexible models can easily overfit the observed data due to the estimand and the form of the loss function. To address this issue, recent work has proposed modeling the density ratio as a product of multiple intermediate density ratios and estimating it using score-matching techniques, which are often used in the diffusion model literature. We extend score-matching-based DRE methods to Riesz representer estimation. Our proposed method not only mitigates overfitting but also provides insights for causal inference by bridging marginal effects and average policy effects through time score functions.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matthias Hertel, Sebastian Pรผtz, Ralf Mikut, Veit Hagenmeyer, Benjamin Schรคfer
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Time-series forecasts are essential for planning and decision-making in many domains. Explainability is key to building user trust and meeting transparency requirements. Shapley Additive Explanations (SHAP) is a popular explainable AI framework, but it lacks efficient implementations for time series and often assumes feature independence when sampling counterfactuals. We introduce SHAPformer, an accurate, fast and sampling-free explainable time-series forecasting model based on the Transformer architecture. It leverages attention manipulation to make predictions based on feature subsets. SHAPformer generates explanations in under one second, several orders of magnitude faster than the SHAP Permutation Explainer. On synthetic data with ground truth explanations, SHAPformer provides explanations that are true to the data. Applied to real-world electrical load data, it achieves competitive predictive performance and delivers meaningful local and global insights, such as identifying the past load as the key predictor and revealing a distinct model behavior during the Christmas period.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tyler Clark, Christine Evers, Jonathon Hare
Machine Learning
ๆฅ็ๆ่ฆ
Recurrent off-policy deep reinforcement learning models achieve state-of-the-art performance but are often sidelined due to their high computational demands. In response, we introduce RISE (Recurrent Integration via Simplified Encodings), a novel approach that can leverage recurrent networks in any image-based off-policy RL setting without significant computational overheads via using both learnable and non-learnable encoder layers. When integrating RISE into leading non-recurrent off-policy RL algorithms, we observe a 35.6% human-normalized interquartile mean (IQM) performance improvement across the Atari benchmark. We analyze various implementation strategies to highlight the versatility and potential of our proposed framework.
๐
2025-12-23
๐ ArXiv Preprint (q-fin.ST)
๐ฅ Ilias Aarab
ๆฅ็ๆ่ฆ
A growing empirical literature suggests that equity-premium predictability is state dependent, with much of the forecasting power concentrated around recessionary periods \parencite{Henkel2011,DanglHalling2012,Devpura2018}. I study U.S. stock return predictability across economic regimes and document strong evidence of time-varying expected returns across both expansionary and contractionary states. I contribute in two ways. First, I introduce a state-switching predictive regression in which the market state is defined in real time using the slope of the yield curve. Relative to the standard one-state predictive regression, the state-switching specification increases both in-sample and out-of-sample performance for the set of popular predictors considered by \textcite{WelchGoyal2008}, improving the out-of-sample performance of most predictors in economically meaningful ways. Second, I propose a new aggregate predictor, the Aligned Economic Index, constructed via partial least squares (PLS). Under the state-switching model, the Aligned Economic Index exhibits statistically and economically significant predictive power in sample and out of sample, and it outperforms widely used benchmark predictors and alternative predictor-combination methods.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jibin Joseph
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Many businesses depend on their mobile apps and websites, so user frustration while trying to complete a task on these channels can cause lost sales and complaints. In this research, I use clickstream data from a real e-commerce site to predict whether a session is frustrated or not. Frustration is defined using certain rules based on rage bursts, back and forth navigation (U turns), cart churn, search struggle, and long wandering sessions, and applies these rules to 5.4 million raw clickstream events (304,881 sessions). From each session, I build tabular features and train standard classifier models. I also use the full event sequence to train a discriminative LSTM classifier. XGBoost reaches about 90% accuracy, ROC AUC of 0.9579, while the LSTM performs best with about 91% accuracy and a ROC AUC of 0.9705. Finally, the research shows that with only the first 20 to 30 interactions, the LSTM already predicts frustration reliably.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mihai Suteu, Ovidiu Serban
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-task learning (MTL) aims to leverage shared knowledge across tasks to improve generalization and parameter efficiency, yet balancing resources and mitigating interference remain open challenges. Architectural solutions often introduce elaborate task-specific modules or routing schemes, increasing complexity and overhead. In this work, we show that normalization layers alone are sufficient to address many of these challenges. Simply replacing shared normalization with task-specific variants already yields competitive performance, questioning the need for complex designs. Building on this insight, we propose Task-Specific Sigmoid Batch Normalization (TS$ฯ$BN), a lightweight mechanism that enables tasks to softly allocate network capacity while fully sharing feature extractors. TS$ฯ$BN improves stability across CNNs and Transformers, matching or exceeding performance on NYUv2, Cityscapes, CelebA, and PascalContext, while remaining highly parameter-efficient. Moreover, its learned gates provide a natural framework for analyzing MTL dynamics, offering interpretable insights into capacity allocation, filter specialization, and task relationships. Our findings suggest that complex MTL architectures may be unnecessary and that task-specific normalization offers a simple, interpretable, and efficient alternative.
๐
2025-12-23
๐ Presented at the 2025 IEEE 22nd India Council Inte
๐ฅ Rajdeep Chatterjee, Sudip Chakrabarty, Trishaani Acharjee, Deepanjali Mishra
Machine Learning
ๆฅ็ๆ่ฆ
Unmanned aerial vehicles (UAVs), commonly known as drones, are increasingly used across diverse domains, including logistics, agriculture, surveillance, and defense. While these systems provide numerous benefits, their misuse raises safety and security concerns, making effective detection mechanisms essential. Acoustic sensing offers a low-cost and non-intrusive alternative to vision or radar-based detection, as drone propellers generate distinctive sound patterns. This study introduces AUDRON (AUdio-based Drone Recognition Network), a hybrid deep learning framework for drone sound detection, employing a combination of Mel-Frequency Cepstral Coefficients (MFCC), Short-Time Fourier Transform (STFT) spectrograms processed with convolutional neural networks (CNNs), recurrent layers for temporal modeling, and autoencoder-based representations. Feature-level fusion integrates complementary information before classification. Experimental evaluation demonstrates that AUDRON effectively differentiates drone acoustic signatures from background noise, achieving high accuracy while maintaining generalizability across varying conditions. AUDRON achieves 98.51 percent and 97.11 percent accuracy in binary and multiclass classification. The results highlight the advantage of combining multiple feature representations with deep learning for reliable acoustic drone detection, suggesting the framework's potential for deployment in security and surveillance applications where visual or radar sensing may be limited.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuan-An Le, Minh-Nam Tran, Son Nguyen
Natural Language Processing
ๆฅ็ๆ่ฆ
Distilling knowledge from large proprietary models (e.g., GPT-4) to tiny deployable models (less than 1B parameters) faces a critical capacity-budget trap: the 1000x capacity gap between teachers and students prevents effective direct transfer, while API costs prohibit extensive data collection. We introduce BRIDGE (Budget-Aware Reasoning via Intermediate Distillation), a two-phase framework that resolves these constraints through strategic intermediation and budget asymmetry. In Phase 1, a mid-sized Teacher Assistant (TA; e.g., about 7B) learns from the black-box teacher on a strictly limited subset of data (e.g., 3-5%), selected via a zero-API-cost pipeline that balances entropic difficulty and semantic diversity using only local TA inference. In Phase 2, we exploit this asymmetry-teacher queries are expensive, whereas TA inference is free to amplify supervision: the refined TA generates synthetic rationales for the full dataset to train the tiny student. Crucially, we apply an instruction-tuning curriculum to establish behavioral alignment in the tiny student before transferring reasoning. Our theoretical analysis shows that BRIDGE yields tighter generalization bounds than direct distillation when data is abundant. Experiments across medical, legal, and financial benchmarks demonstrate consistent improvements: BRIDGE delivers student performance gains of 28-41%, closing the capability gap with proprietary teachers by 12-16% while using 10x fewer teacher queries. Notably, BRIDGE defies the conventional cost-performance frontier, surpassing direct distillation baselines that use 100% of the budget while consuming only 5% of the resources.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Corey Adams, Rishikesh Ranade, Ram Cherukuri, Sanjay Choudhry
Natural Language Processing
ๆฅ็ๆ่ฆ
We present GeoTransolver, a Multiscale Geometry-Aware Physics Attention Transformer for CAE that replaces standard attention with GALE, coupling physics-aware self-attention on learned state slices with cross-attention to a shared geometry/global/boundary-condition context computed from multi-scale ball queries (inspired by DoMINO) and reused in every block. Implemented and released in NVIDIA PhysicsNeMo, GeoTransolver persistently projects geometry, global and boundary condition parameters into physical state spaces to anchor latent computations to domain structure and operating regimes. We benchmark GeoTransolver on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, comparing against Domino, Transolver (as released in PhysicsNeMo), and literature-reported AB-UPT, and evaluate drag/lift R2 and Relative L1 errors for field variables. GeoTransolver delivers better accuracy, improved robustness to geometry/regime shifts, and favorable data efficiency; we include ablations on DrivAerML and qualitative results such as contour plots and design trends for the best GeoTransolver models. By unifying multiscale geometry-aware context with physics-based attention in a scalable transformer, GeoTransolver advances operator learning for high-fidelity surrogate modeling across complex, irregular domains and non-linear physical regimes.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Samya Praharaj, Koulik Khamaru
ๆฅ็ๆ่ฆ
Statistical inference in contextual bandits is complicated by the adaptive, non-i.i.d. nature of the data. A growing body of work has shown that classical least-squares inference may fail under adaptive sampling, and that constructing valid confidence intervals for linear functionals of the model parameter typically requires paying an unavoidable inflation of order $\sqrt{d \log T}$. This phenomenon -- often referred to as the price of adaptivity -- highlights the inherent difficulty of reliable inference under general contextual bandit policies.
A key structural property that circumvents this limitation is the \emph{stability} condition of Lai and Wei, which requires the empirical feature covariance to concentrate around a deterministic limit. When stability holds, the ordinary least-squares estimator satisfies a central limit theorem, and classical Wald-type confidence intervals -- designed for i.i.d. data -- become asymptotically valid even under adaptation, \emph{without} incurring the $\sqrt{d \log T}$ price of adaptivity.
In this paper, we propose and analyze a penalized EXP4 algorithm for linear contextual bandits. Our first main result shows that this procedure satisfies the Lai--Wei stability condition and therefore admits valid Wald-type confidence intervals for linear functionals. Our second result establishes that the same algorithm achieves regret guarantees that are minimax optimal up to logarithmic factors, demonstrating that stability and statistical efficiency can coexist within a single contextual bandit method. Finally, we complement our theory with simulations illustrating the empirical normality of the resulting estimators and the sharpness of the corresponding confidence intervals.
๐
2025-12-23
๐ Accepted for publication to the 40th IEEE Internat
๐ฅ Daniel M. Jimenez-Gutierrez, Mehrdad Hassanzadeh, Aris Anagnostopoulos, Ioannis Chatzigiannakis, Andrea Vitaletti
Machine Learning
ๆฅ็ๆ่ฆ
Federated learning (FL) supports privacy-preserving, decentralized machine learning (ML) model training by keeping data on client devices. However, non-independent and identically distributed (non-IID) data across clients biases updates and degrades performance. To alleviate these issues, we propose Clust-PSI-PFL, a clustering-based personalized FL framework that uses the Population Stability Index (PSI) to quantify the level of non-IID data. We compute a weighted PSI metric, $WPSI^L$, which we show to be more informative than common non-IID metrics (Hellinger, Jensen-Shannon, and Earth Mover's distance). Using PSI features, we form distributionally homogeneous groups of clients via K-means++; the number of optimal clusters is chosen by a systematic silhouette-based procedure, typically yielding few clusters with modest overhead. Across six datasets (tabular, image, and text modalities), two partition protocols (Dirichlet with parameter $ฮฑ$ and Similarity with parameter S), and multiple client sizes, Clust-PSI-PFL delivers up to 18% higher global accuracy than state-of-the-art baselines and markedly improves client fairness by a relative improvement of 37% under severe non-IID data. These results establish PSI-guided clustering as a principled, lightweight mechanism for robust PFL under label skew.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Maximilian Witte, Johannes Meuer, รtienne Plรฉsiat, Christopher Kadow
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Accurate and physically consistent modeling of Earth system dynamics requires machine-learning architectures that operate directly on continuous geophysical fields and preserve their underlying geometric structure. Here we introduce Field-Space attention, a mechanism for Earth system Transformers that computes attention in the physical domain rather than in a learned latent space. By maintaining all intermediate representations as continuous fields on the sphere, the architecture enables interpretable internal states and facilitates the enforcement of scientific constraints. The model employs a fixed, non-learned multiscale decomposition and learns structure-preserving deformations of the input field, allowing coherent integration of coarse and fine-scale information while avoiding the optimization instabilities characteristic of standard single-scale Vision Transformers. Applied to global temperature super-resolution on a HEALPix grid, Field-Space Transformers converge more rapidly and stably than conventional Vision Transformers and U-Net baselines, while requiring substantially fewer parameters. The explicit preservation of field structure throughout the network allows physical and statistical priors to be embedded directly into the architecture, yielding improved fidelity and reliability in data-driven Earth system modeling. These results position Field-Space Attention as a compact, interpretable, and physically grounded building block for next-generation Earth system prediction and generative modeling frameworks.
๐
2025-12-23
๐ This work has been accepted for publication in the
๐ฅ Dogan Altan, Hamza Haruna Mohammed, Glenn Terje Lines, Dusica Marijan, Arnbjรธrn Maressa
Machine Learning
ๆฅ็ๆ่ฆ
Optimizing maritime operations, particularly fuel consumption for vessels, is crucial, considering its significant share in global trade. As fuel consumption is closely related to the shaft power of a vessel, predicting shaft power accurately is a crucial problem that requires careful consideration to minimize costs and emissions. Traditional approaches, which incorporate empirical formulas, often struggle to model dynamic conditions, such as sea conditions or fouling on vessels. In this paper, we present a hybrid, physics-guided neural network-based approach that utilizes empirical formulas within the network to combine the advantages of both neural networks and traditional techniques. We evaluate the presented method using data obtained from four similar-sized cargo vessels and compare the results with those of a baseline neural network and a traditional approach that employs empirical formulas. The experimental results demonstrate that the physics-guided neural network approach achieves lower mean absolute error, root mean square error, and mean absolute percentage error for all tested vessels compared to both the empirical formula-based method and the base neural network.
๐
2025-12-23
๐ NeurIPS
๐ฅ Emilia Majerz, Witold Dzwinel, Jacek Kitowski
Machine Learning
ๆฅ็ๆ่ฆ
Physics-based machine learning blends traditional science with modern data-driven techniques. Rather than relying exclusively on empirical data or predefined equations, this methodology embeds domain knowledge directly into the learning process, resulting in models that are both more accurate and robust. We leverage this paradigm to accelerate simulations of the Zero Degree Calorimeter (ZDC) of the ALICE experiment at CERN. Our method introduces a novel loss function and an output variability-based scaling mechanism, which enhance the model's capability to accurately represent the spatial distribution and morphology of particle showers in detector outputs while mitigating the influence of rare artefacts on the training. Leveraging Normalizing Flows (NFs) in a teacher-student generative framework, we demonstrate that our approach not only outperforms classic data-driven model assimilation but also yields models that are 421 times faster than existing NF implementations in ZDC simulation literature.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mrinmay Sen, Subhrajit Nag
Robotics
ๆฅ็ๆ่ฆ
Data heterogeneity is a significant challenge in modern federated learning (FL) as it creates variance in local model updates, causing the aggregated global model to shift away from the true global optimum. Partial client participation in FL further exacerbates this issue by skewing the aggregation of local models towards the data distribution of participating clients. This creates additional variance in the global model updates, causing the global model to converge away from the optima of the global objective. These variances lead to instability in FL training, which degrades global model performance and slows down FL training. While existing literature primarily focuses on addressing data heterogeneity, the impact of partial client participation has received less attention. In this paper, we propose FedDPC, a novel FL method, designed to improve FL training and global model performance by mitigating both data heterogeneity and partial client participation. FedDPC addresses these issues by projecting each local update onto the previous global update, thereby controlling variance in both local and global updates. To further accelerate FL training, FedDPC employs adaptive scaling for each local update before aggregation. Extensive experiments on image classification tasks with multiple heterogeneously partitioned datasets validate the effectiveness of FedDPC. The results demonstrate that FedDPC outperforms state-of-the-art FL algorithms by achieving faster reduction in training loss and improved test accuracy across communication rounds.
๐
2025-12-23
๐ ArXiv Preprint (cs.SE)
๐ฅ Antonio Vitale, Khai-Nguyen Nguyen, Denys Poshyvanyk, Rocco Oliveto, Simone Scalabrino et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent progress in Large Language Models (LLMs) has substantially advanced the automation of software engineering (SE) tasks, enabling complex activities such as code generation and code summarization. However, the black-box nature of LLMs remains a major barrier to their adoption in high-stakes and safety-critical domains, where explainability and transparency are vital for trust, accountability, and effective human supervision. Despite increasing interest in explainable AI for software engineering, existing methods lack domain-specific explanations aligned with how practitioners reason about SE artifacts. To address this gap, we introduce FeatureSHAP, the first fully automated, model-agnostic explainability framework tailored to software engineering tasks. Based on Shapley values, FeatureSHAP attributes model outputs to high-level input features through systematic input perturbation and task-specific similarity comparisons, while remaining compatible with both open-source and proprietary LLMs. We evaluate FeatureSHAP on two bi-modal SE tasks: code generation and code summarization. The results show that FeatureSHAP assigns less importance to irrelevant input features and produces explanations with higher fidelity than baseline methods. A practitioner survey involving 37 participants shows that FeatureSHAP helps practitioners better interpret model outputs and make more informed decisions. Collectively, FeatureSHAP represents a meaningful step toward practical explainable AI in software engineering. FeatureSHAP is available at https://github.com/deviserlab/FeatureSHAP.
๐
2025-12-23
๐ ArXiv Preprint (cs.CG)
๐ฅ Yoshihiro Maruyama
Machine Learning
ๆฅ็ๆ่ฆ
Exterior powers play important roles in persistent homology in computational geometry. In the present paper we study the problem of extracting the $K$ longest intervals of the exterior-power layers of a tame persistence module. We prove a structural decomposition theorem that organizes the exterior-power layers into monotone per-anchor streams with explicit multiplicities, enabling a best-first algorithm. We also show that the Top-$K$ length vector is $2$-Lipschitz under bottleneck perturbations of the input barcode, and prove a comparison-model lower bound. Our experiments confirm the theory, showing speedups over full enumeration in high overlap cases. By enabling efficient extraction of the most prominent features, our approach makes higher-order persistence feasible for large datasets and thus broadly applicable to machine learning, data science, and scientific computing.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Saisai Yang, Qingyi Huang, Jing Yuan, Liangyu Zha, Kai Tang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learning (RL) offers a promising avenue to enhance these capabilities, yet its application in the tabular domain faces three critical hurdles: the scarcity of high-quality agentic trajectories with closed-loop code execution and environment feedback on diverse table structures, the extreme heterogeneity of feedback signals ranging from rigid SQL execution to open-ended data interpretation, and the risk of catastrophic forgetting of general knowledge during vertical specialization. To overcome these challenges and unlock advanced reasoning on complex tables, we introduce \textbf{TableGPT-R1}, a specialized tabular model built on a systematic RL framework. Our approach integrates a comprehensive data engineering pipeline that synthesizes difficulty-stratified agentic trajectories for both supervised alignment and RL rollouts, a task-adaptive reward system that combines rule-based verification with a criteria-injected reward model and incorporates process-level step reward shaping with behavioral regularization, and a multi-stage training framework that progressively stabilizes reasoning before specializing in table-specific tasks. Extensive evaluations demonstrate that TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities. Our model is available at https://huggingface.co/tablegpt/TableGPT-R1.
๐
2025-12-23
๐ ArXiv Preprint (cs.CG)
๐ฅ Yoshihiro Maruyama
Robotics
ๆฅ็ๆ่ฆ
We present the Chromatic Persistence Algorithm (CPA), an event-driven method for computing persistent cohomological features of weighted graphs via graphic arrangements, a classical object in computational geometry. We establish rigorous complexity results: CPA is exponential in the worst case, fixed-parameter tractable in treewidth, and nearly linear for common graph families such as trees, cycles, and series-parallel graphs. Finally, we demonstrate its practical applicability through a controlled experiment on molecular-like graph structures.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Mebin Jose, Jisha Francis, Sudheesh Kumar Kattumannil
Machine Learning
ๆฅ็ๆ่ฆ
Survival analysis relies fundamentally on the semi-parametric Cox Proportional Hazards (CoxPH) model and the parametric Accelerated Failure Time (AFT) model. CoxPH assumes constant hazard ratios, often failing to capture real-world dynamics, while traditional AFT models are limited by rigid distributional assumptions. Although deep learning models like DeepAFT address these constraints by improving predictive accuracy and handling censoring, they inherit the significant challenge of black-box interpretability. The recent introduction of CoxKAN demonstrated the successful integration of Kolmogorov-Arnold Networks (KANs), a novel architecture that yields highly accurate and interpretable symbolic representations, within the CoxPH framework. Motivated by the interpretability gains of CoxKAN, we introduce KAN-AFT (Kolmogorov Arnold Network-based AFT), the first framework to apply KANs to the AFT model. KAN-AFT effectively models complex nonlinear relationships within the AFT framework. Our primary contributions include: (i) a principled AFT-KAN formulation, (ii) robust optimization strategies for right-censored observations (e.g., Buckley-James and IPCW), and (iii) an interpretability pipeline that converts the learned spline functions into closed-form symbolic equations for survival time. Empirical results on multiple datasets confirm that KAN-AFT achieves performance comparable to or better than DeepAFT, while uniquely providing transparent, symbolic models of the survival process.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuxing Gan, Ziyu Lei
Machine Learning
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) architectures achieve parameter efficiency through conditional computation, yet contemporary designs suffer from two fundamental limitations: structural parameter isolation that causes catastrophic forgetting, and instruction-overfitting that degrades performance in instruction-free scenarios. We propose CDSP-MoE (Conflict-Driven Subspace Pruning MoE), a framework that addresses these issues through a paradigm shift from isolated expert containers to dynamic expert instantiation within a shared physical subspace. Grounded in the Universal Weight Subspace Hypothesis, CDSP-MoE maintains a super-complete parameter backbone where logical experts are carved out via learnable topology masks. Unlike prior work that uses gradient conflict for token reassignment or optimization surgery, we leverage it as a structural supervisory signal: a Lagged Gradient Game penalizes interfering connections in the shared manifold, enabling the topology to spontaneously prune conflicting pathways and evolve interpretable modular structures. Experimental results demonstrate that CDSP-MoE achieves robust content-driven routing without human-defined task labels, maintaining semantic specialization even under strict blind inference protocols where explicit instructions are absent. Code is available at: https://github.com/konodiodaaaaa1/Conflict-Driven-Subspace-Pruning-Mixture-of-Experts
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuanjian Xu, Yuan Shuai, Jianing Hao, Guang Zhang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Neural Stochastic Differential Equations (Neural SDEs) provide a principled framework for modeling continuous-time stochastic processes and have been widely adopted in fields ranging from physics to finance. Recent advances suggest that Generative Adversarial Networks (GANs) offer a promising solution to learning the complex path distributions induced by SDEs. However, a critical bottleneck lies in designing a discriminator that faithfully captures temporal dependencies while remaining computationally efficient. Prior works have explored Neural Controlled Differential Equations (CDEs) as discriminators due to their ability to model continuous-time dynamics, but such architectures suffer from high computational costs and exacerbate the instability of adversarial training. To address these limitations, we introduce HGAN-SDEs, a novel GAN-based framework that leverages Neural Hermite functions to construct a structured and efficient discriminator. Hermite functions provide an expressive yet lightweight basis for approximating path-level dynamics, enabling both reduced runtime complexity and improved training stability. We establish the universal approximation property of our framework for a broad class of SDE-driven distributions and theoretically characterize its convergence behavior. Extensive empirical evaluations on synthetic and real-world systems demonstrate that HGAN-SDEs achieve superior sample quality and learning efficiency compared to existing generative models for SDEs
๐
2025-12-23
๐ ArXiv Preprint (math.OC)
๐ฅ Matthias K. Hoffmann, Amine Othmane, Kathrin Flaรkamp
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Many engineering tasks require solving families of nonlinear constrained optimization problems, parametrized in setting-specific variables. This is computationally demanding, particularly, if solutions have to be computed across strongly varying parameter values, e.g., in real-time control or for model-based design. Thus, we propose to learn the mapping from parameters to the primal optimal solutions and to their corresponding duals using neural networks, giving a dense estimation in contrast to gridded approaches. Our approach, Optimality-informed Neural Networks (OptINNs), combines (i) a KKT-residual loss that penalizes violations of the first-order optimality conditions under standard constraint qualifications assumptions, and (ii) problem-specific output activations that enforce simple inequality constraints (e.g., box-type/positivity) by construction. This design reduces data requirements, allows the prediction of dual variables, and improves feasibility and closeness to optimality compared to penalty-only training. Taking quadratic penalties as a baseline, since this approach has been previously proposed for the considered problem class in literature, our method simplifies hyperparameter tuning and attains tighter adherence to optimality conditions. We evaluate OptINNs on different nonlinear optimization problems ranging from low to high dimensions. On small problems, OptINNs match a quadratic-penalty baseline in primal accuracy while additionally predicting dual variables with low error. On larger problems, OptINNs achieve lower constraint violations and lower primal error compared to neural networks based on the quadratic-penalty method. These results suggest that embedding feasibility and optimality into the network architecture and loss can make learning-based surrogates more accurate, feasible, and data-efficient for parametric optimization.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Marco A. Iglesias, Michael. E. Causon, Mikhail Y. Matveev, Andreas Endruweit, Michael . V. Tretyakov
Robotics
ๆฅ็ๆ่ฆ
This work demonstrates that neural operator learning provides a powerful and flexible framework for building fast, accurate emulators of moving boundary systems, enabling their integration into digital twin platforms. To this end, a Deep Operator Network (DeepONet) architecture is employed to construct an efficient surrogate model for moving boundary problems in single-phase Darcy flow through porous media. The surrogate enables rapid and accurate approximation of complex flow dynamics and is coupled with an Ensemble Kalman Inversion (EKI) algorithm to solve Bayesian inverse problems.
The proposed inversion framework is demonstrated by estimating the permeability and porosity of fibre reinforcements for composite materials manufactured via the Resin Transfer Moulding (RTM) process. Using both synthetic and experimental in-process data, the DeepONet surrogate accelerates inversion by several orders of magnitude compared with full-model EKI. This computational efficiency enables real-time, accurate, high-resolution estimation of local variations in permeability, porosity, and other parameters, thereby supporting effective monitoring and control of RTM processes, as well as other applications involving moving boundary flows. Unlike prior approaches for RTM inversion that learn mesh-dependent mappings, the proposed neural operator generalises across spatial and temporal domains, enabling evaluation at arbitrary sensor configurations without retraining, and represents a significant step toward practical industrial deployment of digital twins.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuanyu Hu
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Multimodal brain decoding aims to reconstruct semantic information that is consistent with visual stimuli from brain activity signals such as fMRI, and then generate readable natural language descriptions. However, multimodal brain decoding still faces key challenges in cross-subject generalization and interpretability. We propose a BrainROI model and achieve leading-level results in brain-captioning evaluation on the NSD dataset. Under the cross-subject setting, compared with recent state-of-the-art methods and representative baselines, metrics such as BLEU-4 and CIDEr show clear improvements. Firstly, to address the heterogeneity of functional brain topology across subjects, we design a new fMRI encoder. We use multi-atlas soft functional parcellations (soft-ROI) as a shared space. We extend the discrete ROI Concatenation strategy in MINDLLM to a voxel-wise gated fusion mechanism (Voxel-gate). We also ensure consistent ROI mapping through global label alignment, which enhances cross-subject transferability. Secondly, to overcome the limitations of manual and black-box prompting methods in stability and transparency, we introduce an interpretable prompt optimization process. In a small-sample closed loop, we use a locally deployed Qwen model to iteratively generate and select human-readable prompts. This process improves the stability of prompt design and preserves an auditable optimization trajectory. Finally, we impose parameterized decoding constraints during inference to further improve the stability and quality of the generated descriptions.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nathan Roos, Ekaterina Iakovleva, Ani Gjergji, Vito Paolo Pastore, Enzo Tartaglione
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Diffusion-based generative models demonstrate state-of-the-art performance across various image synthesis tasks, yet their tendency to replicate and amplify dataset biases remains poorly understood. Although previous research has viewed bias amplification as an inherent characteristic of diffusion models, this work provides the first analysis of how sampling algorithms and their hyperparameters influence bias amplification. We empirically demonstrate that samplers for diffusion models -- commonly optimized for sample quality and speed -- have a significant and measurable effect on bias amplification. Through controlled studies with models trained on Biased MNIST, Multi-Color MNIST and BFFHQ, and with Stable Diffusion, we show that sampling hyperparameters can induce both bias reduction and amplification, even when the trained model is fixed. Source code is available at https://github.com/How-I-met-your-bias/how_i_met_your_bias.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Onintze Zaballa, Verรณnica รlvarez, Santiago Mazuelas
Machine Learning
ๆฅ็ๆ่ฆ
Simultaneous load forecasting across multiple entities (e.g., regions, buildings) is crucial for the efficient, reliable, and cost-effective operation of power systems. Accurate load forecasting is a challenging problem due to the inherent uncertainties in load demand, dynamic changes in consumption patterns, and correlations among entities. Multi-task learning has emerged as a powerful machine learning approach that enables the simultaneous learning across multiple related problems. However, its application to load forecasting remains underexplored and is limited to offline learning-based methods, which cannot capture changes in consumption patterns. This paper presents an adaptive multi-task learning method for probabilistic load forecasting. The proposed method can dynamically adapt to changes in consumption patterns and correlations among entities. In addition, the techniques presented provide reliable probabilistic predictions for loads of multiples entities and assess load uncertainties. Specifically, the method is based on vectorvalued hidden Markov models and uses a recursive process to update the model parameters and provide predictions with the most recent parameters. The performance of the proposed method is evaluated using datasets that contain the load demand of multiple entities and exhibit diverse and dynamic consumption patterns. The experimental results show that the presented techniques outperform existing methods both in terms of forecasting performance and uncertainty assessment.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kausthubh Manda, Raghuram Bharadwaj Diddigi
Machine Learning
ๆฅ็ๆ่ฆ
We study offline multitask reinforcement learning in settings where multiple tasks share a low-rank representation of their action-value functions. In this regime, a learner is provided with fixed datasets collected from several related tasks, without access to further online interaction, and seeks to exploit shared structure to improve statistical efficiency and generalization. We analyze a multitask variant of fitted Q-iteration that jointly learns a shared representation and task-specific value functions via Bellman error minimization on offline data. Under standard realizability and coverage assumptions commonly used in offline reinforcement learning, we establish finite-sample generalization guarantees for the learned value functions. Our analysis explicitly characterizes how pooling data across tasks improves estimation accuracy, yielding a $1/\sqrt{nT}$ dependence on the total number of samples across tasks, while retaining the usual dependence on the horizon and concentrability coefficients arising from distribution shift. In addition, we consider a downstream offline setting in which a new task shares the same underlying representation as the upstream tasks. We study how reusing the representation learned during the multitask phase affects value estimation for this new task, and show that it can reduce the effective complexity of downstream learning relative to learning from scratch. Together, our results clarify the role of shared representations in multitask offline Q-learning and provide theoretical insight into when and how multitask structure can improve generalization in model-free, value-based reinforcement learning.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jixiao Yang, Jinyu Chen, Zixiao Huang, Chengda Xu, Chi Zhang et al.
ๆฅ็ๆ่ฆ
Federated learning across multi-cloud environments faces critical challenges, including non-IID data distributions, malicious participant detection, and substantial cross-cloud communication costs (egress fees). Existing Byzantine-robust methods focus primarily on model accuracy while overlooking the economic implications of data transfer across cloud providers. This paper presents Cost-TrustFL, a hierarchical federated learning framework that jointly optimizes model performance and communication costs while providing robust defense against poisoning attacks. We propose a gradient-based approximate Shapley value computation method that reduces the complexity from exponential to linear, enabling lightweight reputation evaluation. Our cost-aware aggregation strategy prioritizes intra-cloud communication to minimize expensive cross-cloud data transfers. Experiments on CIFAR-10 and FEMNIST datasets demonstrate that Cost-TrustFL achieves 86.7% accuracy under 30% malicious clients while reducing communication costs by 32% compared to baseline methods. The framework maintains stable performance across varying non-IID degrees and attack intensities, making it practical for real-world multi-cloud deployments.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunan Lin, Sebastian Bathiany, Maha Badri, Maximilian Gelbrecht, Philipp Hess et al.
Machine Learning
ๆฅ็ๆ่ฆ
Global gridded crop models (GGCMs) simulate daily crop growth by explicitly representing key biophysical processes and project end-of-season yield time series. They are a primary tool to quantify the impacts of climate change on agricultural productivity and assess associated risks for food security. Despite decades of development, state-of-the-art GGCMs still have substantial uncertainties in simulating complex biophysical processes due to limited process understanding. Recently, machine learning approaches trained on observational data have shown great potential in crop yield predictions. However, these models have not demonstrated improved performance over classical GGCMs and are not suitable for simulating crop yields under changing climate conditions due to problems in generalizing outside their training distributions. Here we introduce NeuralCrop, a hybrid GGCM that combines the strengths of an advanced process-based GGCM, resolving important processes explicitly, with data-driven machine learning components. The model is first trained to emulate a competitive GGCM before it is fine-tuned on observational data. We show that NeuralCrop outperforms state-of-the-art GGCMs across site-level and large-scale cropping regions. Across moisture conditions, NeuralCrop reproduces the interannual yield anomalies in European wheat regions and the US Corn Belt more accurately during the period from 2000 to 2019 with particularly strong improvements under drought extremes. When generalizing to conditions unseen during training, NeuralCrop continues to make robust projections, while pure machine learning models exhibit substantial performance degradation. Our results show that our hybrid crop modelling approach offers overall improved crop modeling and more reliable yield projections under climate change and intensifying extreme weather conditions.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junghyun Lee, Branislav Kveton, Sunav Choudhary, Subhojyoti Mukherjee, Anup Rao et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive an expectation-maximization (EM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers. We instantiate and compare several sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR. Our experiments on the ARC, MMLU, and OpenBookQA datasets with the Llama and Qwen models show that the sampling scheme can significantly affect the accuracy of learned reasoning models. Despite its simplicity, we observe that PPS outperforms the other sampling schemes.
๐
2025-12-23
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Songze Li, Jiameng Cheng, Yiming Li, Xiaojun Jia, Dacheng Tao
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
By integrating language understanding with perceptual modalities such as images, multimodal large language models (MLLMs) constitute a critical substrate for modern AI systems, particularly intelligent agents operating in open and interactive environments. However, their increasing accessibility also raises heightened risks of misuse, such as generating harmful or unsafe content. To mitigate these risks, alignment techniques are commonly applied to align model behavior with human values. Despite these efforts, recent studies have shown that jailbreak attacks can circumvent alignment and elicit unsafe outputs. Currently, most existing jailbreak methods are tailored for open-source models and exhibit limited effectiveness against commercial MLLM-integrated systems, which often employ additional filters. These filters can detect and prevent malicious input and output content, significantly reducing jailbreak threats. In this paper, we reveal that the success of these safety filters heavily relies on a critical assumption that malicious content must be explicitly visible in either the input or the output. This assumption, while often valid for traditional LLM-integrated systems, breaks down in MLLM-integrated systems, where attackers can leverage multiple modalities to conceal adversarial intent, leading to a false sense of security in existing MLLM-integrated systems. To challenge this assumption, we propose Odysseus, a novel jailbreak paradigm that introduces dual steganography to covertly embed malicious queries and responses into benign-looking images. Extensive experiments on benchmark datasets demonstrate that our Odysseus successfully jailbreaks several pioneering and realistic MLLM-integrated systems, achieving up to 99% attack success rate. It exposes a fundamental blind spot in existing defenses, and calls for rethinking cross-modal security in MLLM-integrated systems.
๐
2025-12-23
๐ IEEE/ACM Transactions on Audio, Speech and Languag
๐ฅ Xiang Chen, Yixin Ou, Quan Feng, Lei Li, Piji Li et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict'' paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuanhao Chen, Qi Liu, Pengbin Chen, Zhongjian Qiao, Yanjie Li
Machine Learning
ๆฅ็ๆ่ฆ
Offline reinforcement learning (RL) aims to learn a policy that maximizes the expected return using a given static dataset of transitions. However, offline RL faces the distribution shift problem. The policy constraint offline RL method is proposed to solve the distribution shift problem. During the policy constraint offline RL training, it is important to ensure the difference between the learned policy and behavior policy within a given threshold. Thus, the learned policy heavily relies on the quality of the behavior policy. However, a problem exists in existing policy constraint methods: if the dataset contains many low-reward transitions, the learned will be contained with a suboptimal reference policy, leading to slow learning speed, low sample efficiency, and inferior performances. This paper shows that the sampling method in policy constraint offline RL that uses all the transitions in the dataset can be improved. A simple but efficient sample filtering method is proposed to improve the sample efficiency and the final performance. First, we evaluate the score of the transitions by average reward and average discounted reward of episodes in the dataset and extract the transition samples of high scores. Second, the high-score transition samples are used to train the offline RL algorithms. We verify the proposed method in a series of offline RL algorithms and benchmark tasks. Experimental results show that the proposed method outperforms baselines.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr
Machine Learning
ๆฅ็ๆ่ฆ
As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near-constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL's performance beyond the full context setting, while using less memory than contemporaneous approaches.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Annika Hirling, Giorgio Nicoletti, Antonio Celani
Machine Learning
ๆฅ็ๆ่ฆ
The Multi-Armed Bandit problem provides a fundamental framework for analyzing the tension between exploration and exploitation in sequential learning. This paper explores Information Directed Sampling (IDS) policies, a class of heuristics that balance immediate regret against information gain. We focus on the tractable environment of two-state Bernoulli bandits as a minimal model to rigorously compare heuristic strategies against the optimal policy. We extend the IDS framework to the discounted infinite-horizon setting by introducing a modified information measure and a tuning parameter to modulate the decision-making behavior. We examine two specific problem classes: symmetric bandits and the scenario involving one fair coin. In the symmetric case we show that IDS achieves bounded cumulative regret, whereas in the one-fair-coin scenario the IDS policy yields a regret that scales logarithmically with the horizon, in agreement with classical asymptotic lower bounds. This work serves as a pedagogical synthesis, aiming to bridge concepts from reinforcement learning and information theory for an audience of statistical physicists.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zuo Wang, Ye Yuan
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we investigate how the widely existing contextual and structural divergence may influence the representation learning in rich-text graphs. To this end, we propose Jensen-Shannon Divergence Message-Passing (JSDMP), a new learning paradigm for rich-text graph representation learning. Besides considering similarity regarding structure and text, JSDMP further captures their corresponding dissimilarity by Jensen-Shannon divergence. Similarity and dissimilarity are then jointly used to compute new message weights among text nodes, thus enabling representations to learn with contextual and structural information from truly correlated text nodes. With JSDMP, we propose two novel graph neural networks, namely Divergent message-passing graph convolutional network (DMPGCN) and Divergent message-passing Page-Rank graph neural networks (DMPPRG), for learning representations in rich-text graphs. DMPGCN and DMPPRG have been extensively texted on well-established rich-text datasets and compared with several state-of-the-art baselines. The experimental results show that DMPGCN and DMPPRG can outperform other baselines, demonstrating the effectiveness of the proposed Jensen-Shannon Divergence Message-Passing paradigm
๐
2025-12-23
๐ NeurIPS 2025
๐ฅ Jeehong Kim, Youngseok Hwang, Minchan Kim, Sungho Bae, Hyunwoo Park
Machine Learning
ๆฅ็ๆ่ฆ
Spatio-temporal graph neural networks (ST-GNNs) have achieved notable success in structured domains such as road traffic and public transportation, where spatial entities can be naturally represented as fixed nodes. In contrast, many real-world systems including maritime traffic lack such fixed anchors, making the construction of spatio-temporal graphs a fundamental challenge. Anomaly detection in these non-grid environments is particularly difficult due to the absence of canonical reference points, the sparsity and irregularity of trajectories, and the fact that anomalies may manifest at multiple granularities. In this work, we introduce a novel benchmark dataset for anomaly detection in the maritime domain, extending the Open Maritime Traffic Analysis Dataset (OMTAD) into a benchmark tailored for graph-based anomaly detection. Our dataset enables systematic evaluation across three different granularities: node-level, edge-level, and graph-level anomalies. We plan to employ two specialized LLM-based agents: \emph{Trajectory Synthesizer} and \emph{Anomaly Injector} to construct richer interaction contexts and generate semantically meaningful anomalies. We expect this benchmark to promote reproducibility and to foster methodological advances in anomaly detection for non-grid spatio-temporal systems.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yanjie Li, Jian Xu, Xueqing Chen, Lina Yu, Shiming Xiang et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Adsorption energy is a key descriptor of catalytic reactivity. It is fundamentally defined as the difference between the relaxed total energy of the adsorbate-surface system and that of an appropriate reference state; therefore, the accuracy of relaxed-energy prediction directly determines the reliability of machine-learning-driven catalyst screening. E(3)-equivariant graph neural networks (GNNs) can natively operate on three-dimensional atomic coordinates under periodic boundary conditions and have demonstrated strong performance on such tasks. In contrast, language-model-based approaches, while enabling human-readable textual descriptions and reducing reliance on explicit graph -- thereby broadening applicability -- remain insufficient in both adsorption-configuration energy prediction accuracy and in distinguishing ``the same system with different configurations,'' even with graph-assisted pretraining in the style of GAP-CATBERTa.
To this end, we propose QE-Catalytic, a multimodal framework that deeply couples a large language model (\textbf{Q}wen) with an E(3)-equivariant graph Transformer (\textbf{E}quiformer-V2), enabling unified support for adsorption-configuration property prediction and inverse design on complex catalytic surfaces. During prediction, QE-Catalytic jointly leverages three-dimensional structures and structured configuration text, and injects ``3D geometric information'' into the language channel via graph-text alignment, allowing it to function as a high-performance text-based predictor when precise coordinates are unavailable, while also autoregressively generating CIF files for target-energy-driven structure design and information completion. On OC20, QE-Catalytic reduces the MAE of relaxed adsorption energy from 0.713~eV to 0.486~eV, and consistently outperforms baseline models such as CatBERTa and GAP-CATBERTa across multiple evaluation protocols.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingue Park, Jisung Hwang, Seungwoo Yoo, Kyeongmin Yeo, Minhyuk Sung
Machine Learning
ๆฅ็ๆ่ฆ
We introduce $\texttt{PairFlow}$, a lightweight preprocessing step for training Discrete Flow Models (DFMs) to achieve few-step sampling without requiring a pretrained teacher. DFMs have recently emerged as a new class of generative models for discrete data, offering strong performance. However, they suffer from slow sampling due to their iterative nature. Existing acceleration methods largely depend on finetuning, which introduces substantial additional training overhead. $\texttt{PairFlow}$ addresses this issue with a lightweight preprocessing step. Inspired by ReFlow and its extension to DFMs, we train DFMs from coupled samples of source and target distributions, without requiring any pretrained teacher. At the core of our approach is a closed-form inversion for DFMs, which allows efficient construction of paired source-target samples. Despite its extremely low cost, taking only up to 1.7% of the compute needed for full model training, $\texttt{PairFlow}$ matches or even surpasses the performance of two-stage training involving finetuning. Furthermore, models trained with our framework provide stronger base models for subsequent distillation, yielding further acceleration after finetuning. Experiments on molecular data as well as binary and RGB images demonstrate the broad applicability and effectiveness of our approach.
๐
2025-12-23
๐ ArXiv Preprint (Multimedia)
๐ฅ Ziyang Fan, Li Tao, Yi Wang, Jingwei Qu, Ying Wang et al.
ๆฅ็ๆ่ฆ
Student engagement is a critical factor influencing academic success and learning outcomes. Accurately predicting student engagement is essential for optimizing teaching strategies and providing personalized interventions. However, most approaches focus on single-dimensional feature analysis and assessing engagement based on individual student factors. In this work, we propose a dual-stream multi-feature fusion model based on hypergraph convolutional networks (DS-HGCN), incorporating social contagion of student engagement. DS-HGCN enables accurate prediction of student engagement states by modeling multi-dimensional features and their propagation mechanisms between students. The framework constructs a hypergraph structure to encode engagement contagion among students and captures the emotional and behavioral differences and commonalities by multi-frequency signals. Furthermore, we introduce a hypergraph attention mechanism to dynamically weigh the influence of each student, accounting for individual differences in the propagation process. Extensive experiments on public benchmark datasets demonstrate that our proposed method achieves superior performance and significantly outperforms existing state-of-the-art approaches.
๐
2025-12-23
๐ ArXiv Preprint (math.NA)
๐ฅ H. Li, J. Sun, Z. Zhang
ๆฅ็ๆ่ฆ
We consider operator learning for efficiently solving parametric non-selfadjoint eigenvalue problems. To overcome the spectral instability and mode switching inherent in non-selfadjoint operators, we introduce a hybrid framework that learns the stable invariant eigensubspace mapping rather than individual eigenfunctions. We proposed a Deep Eigenspace Network (DEN) architecture integrating Fourier Neural Operators, geometry-adaptive POD bases, and explicit banded cross-mode mixing mechanisms to capture complex spectral dependencies on unstructured meshes. We apply DEN to the parametric non-selfadjoint Steklov eigenvalue problem and provide theoretical proofs for the Lipschitz continuity of the eigensubspace with respect to the parameters. In addition, we derive error bounds for the reconstruction of the eigenspace. Numerical experiments validate DEN's high accuracy and zero-shot generalization capabilities across different discretizations.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Peter N. Loxley
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Controllable Markov chains describe the dynamics of sequential decision making tasks and are the central component in optimal control and reinforcement learning. In this work, we give the general form of an optimal policy for learning controllable dynamics in an unknown environment by exploring over a limited time horizon. This policy is simple to implement and efficient to compute, and allows an agent to ``learn by exploring" as it maximizes its information gain in a greedy fashion by selecting controls from a constraint set that changes over time during exploration. We give a simple parameterization for the set of controls, and present an algorithm for finding an optimal policy. The reason for this policy is due to the existence of certain types of states that restrict control of the dynamics; such as transient states, absorbing states, and non-backtracking states. We show why the occurrence of these states makes a non-stationary policy essential for achieving optimal exploration. Six interesting examples of controllable dynamics are treated in detail. Policy optimality is demonstrated using counting arguments, comparing with suboptimal policies, and by making use of a sequential improvement property from dynamic programming.
๐
2025-12-23
๐ ArXiv Preprint (math.ST)
๐ฅ Shubhanshu Shekhar
ๆฅ็ๆ่ฆ
We consider the problem of designing optimal level-$ฮฑ$ power-one tests for composite nulls. Given a parameter $ฮฑ\in (0,1)$ and a stream of $\mathcal{X}$-valued observations $\{X_n: n \geq 1\} \overset{i.i.d.}{\sim} P$, the goal is to design a level-$ฮฑ$ power-one test $ฯ_ฮฑ$ for the null $H_0: P \in \mathcal{P}_0 \subset \mathcal{P}(\mathcal{X})$. Prior works have shown that any such $ฯ_ฮฑ$ must satisfy $\mathbb{E}_P[ฯ_ฮฑ] \geq \tfrac{\log(1/ฮฑ)}{ฮณ^*(P, \mathcal{P}_0)}$, where $ฮณ^*(P, \mathcal{P}_0)$ is the so-called $\mathrm{KL}_{\inf}$ or minimum divergence of $P$ to the null class. In this paper, our objective is to develop and analyze constructive schemes that match this lower bound as $ฮฑ\downarrow 0$.
We first consider the finite-alphabet case~($|\mathcal{X}| = m < \infty$), and show that a test based on \emph{universal} $e$-process~(formed by the ratio of a universal predictor and the running null MLE) is optimal in the above sense. The proof relies on a Donsker-Varadhan~(DV) based saddle-point representation of $\mathrm{KL}_{\inf}$, and an application of Sion's minimax theorem. This characterization motivates a general method for arbitrary $\mathcal{X}$: construct an $e$-process based on the empirical solutions to the saddle-point representation over a sufficiently rich class of test functions. We give sufficient conditions for the optimality of this test for compact convex nulls, and verify them for Hรถlder smooth density models. We end the paper with a discussion on the computational aspects of implementing our proposed tests in some practical settings.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuan Gao, Zhenguo Dong, Xuelong Wang, Zhiqiang Wang, Yong Zhang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Accurate and interpretable forecasting of multivariate time series is crucial for understanding the complex dynamics of cryptocurrency markets in digital asset systems. Advanced deep learning methodologies, particularly Transformer-based and MLP-based architectures, have achieved competitive predictive performance in cryptocurrency forecasting tasks. However, cryptocurrency data is inherently composed of long-term socio-economic trends and local high-frequency speculative oscillations. Existing deep learning-based 'black-box' models fail to effectively decouple these composite dynamics or provide the interpretability needed for trustworthy financial decision-making. To overcome these limitations, we propose DecoKAN, an interpretable forecasting framework that integrates multi-level Discrete Wavelet Transform (DWT) for decoupling and hierarchical signal decomposition with Kolmogorov-Arnold Network (KAN) mixers for transparent and interpretable nonlinear modeling. The DWT component decomposes complex cryptocurrency time series into distinct frequency components, enabling frequency-specific analysis, while KAN mixers provide intrinsically interpretable spline-based mappings within each decomposed subseries. Furthermore, interpretability is enhanced through a symbolic analysis pipeline involving sparsification, pruning, and symbolization, which produces concise analytical expressions offering symbolic representations of the learned patterns. Extensive experiments demonstrate that DecoKAN achieves the lowest average Mean Squared Error on all tested real-world cryptocurrency datasets (BTC, ETH, XMR), consistently outperforming a comprehensive suite of competitive state-of-the-art baselines. These results validate DecoKAN's potential to bridge the gap between predictive accuracy and model transparency, advancing trustworthy decision support within complex cryptocurrency markets.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Anna R. Flowers, Christopher T. Franck, Robert B. Gramacy, Justin A. Krometis
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Collecting operationally realistic data to inform machine learning models can be costly. Before collecting new data, it is helpful to understand where a model is deficient. For example, object detectors trained on images of rare objects may not be good at identification in poorly represented conditions. We offer a way of informing subsequent data acquisition to maximize model performance by leveraging the toolkit of computer experiments and metadata describing the circumstances under which the training data was collected (e.g., season, time of day, location). We do this by evaluating the learner as the training data is varied according to its metadata. A Gaussian process (GP) surrogate fit to that response surface can inform new data acquisitions. This meta-learning approach offers improvements to learner performance as compared to data with randomly selected metadata, which we illustrate on both classic learning examples, and on a motivating application involving the collection of aerial images in search of airplanes.
๐
2025-12-23
๐ This paper has been submitted to a journal
๐ฅ Qiushuo Hou, Sangwoo Park, Matteo Zecchin, Yunlong Cai, Guanding Yu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) are emerging as key enablers of automation in domains such as telecommunications, assisting with tasks including troubleshooting, standards interpretation, and network optimization. However, their deployment in practice must balance inference cost, latency, and reliability. In this work, we study an edge-cloud-expert cascaded LLM-based knowledge system that supports decision-making through a question-and-answer pipeline. In it, an efficient edge model handles routine queries, a more capable cloud model addresses complex cases, and human experts are involved only when necessary. We define a misalignment-cost constrained optimization problem, aiming to minimize average processing cost, while guaranteeing alignment of automated answers with expert judgments. We propose a statistically rigorous threshold selection method based on multiple hypothesis testing (MHT) for a query processing mechanism based on knowledge and confidence tests. The approach provides finite-sample guarantees on misalignment risk. Experiments on the TeleQnA dataset -- a telecom-specific benchmark -- demonstrate that the proposed method achieves superior cost-efficiency compared to conventional cascaded baselines, while ensuring reliability at prescribed confidence levels.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Zhihan Huang, Ziang Niu
ๆฅ็ๆ่ฆ
Goodness-of-fit (GoF) tests are fundamental for assessing model adequacy. Score-based tests are appealing because they require fitting the model only once under the null. However, extending them to powerful nonparametric alternatives is difficult due to the lack of suitable score functions. Through a class of exponentially tilted models, we show that the resulting score-based GoF tests are equivalent to the tests based on integral probability metrics (IPMs) indexed by a function class. When the class is rich, the test is universally consistent. This simple yet insightful perspective enables reinterpretation of classical distance-based testing procedures-including those based on Kolmogorov-Smirnov distance, Wasserstein-1 distance, and maximum mean discrepancy-as arising from score-based constructions. Building on this insight, we propose a new nonparametric score-based GoF test through a special class of IPM induced by kernelized Stein's function class, called semiparametric kernelized Stein discrepancy (SKSD) test. Compared with other nonparametric score-based tests, the SKSD test is computationally efficient and accommodates general nuisance-parameter estimators, supported by a generic parametric bootstrap procedure. The SKSD test is universally consistent and attains Pitman efficiency. Moreover, SKSD test provides simple GoF tests for models with intractable likelihoods but tractable scores with the help of Stein's identity and we use two popular models, kernel exponential family and conditional Gaussian models, to illustrate the power of our method. Our method achieves power comparable to task-specific normality tests such as Anderson-Darling and Lilliefors, despite being designed for general nonparametric alternatives.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sukumar Kishanthan, Asela Hevapathige
Machine Learning
ๆฅ็ๆ่ฆ
Class imbalance is a common challenge in machine learning and data mining, often leading to suboptimal performance in classifiers. While deep learning excels in feature extraction, its performance still deteriorates under imbalanced data. In this work, we propose a novel activation function, named OGAB, designed to alleviate class imbalance in deep learning classifiers. OGAB incorporates orthogonality and group-aware bias learning to enhance feature distinguishability in imbalanced scenarios without explicitly requiring label information. Our key insight is that activation functions can be used to introduce strong inductive biases that can address complex data challenges beyond traditional non-linearity. Our work demonstrates that orthogonal transformations can preserve information about minority classes by maintaining feature independence, thereby preventing the dominance of majority classes in the embedding space. Further, the proposed group-aware bias mechanism automatically identifies data clusters and adjusts embeddings to enhance class separability without the need for explicit supervision. Unlike existing approaches that address class imbalance through preprocessing data modifications or post-processing corrections, our proposed approach tackles class imbalance during the training phase at the embedding learning level, enabling direct integration with the learning process. We demonstrate the effectiveness of our solution on both real-world and synthetic imbalanced datasets, showing consistent performance improvements over both traditional and learnable activation functions.
๐
2025-12-23
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Rahul Yumlembam, Biju Issac, Seibu Mary Jacob, Longzhi Yang
Machine Learning
ๆฅ็ๆ่ฆ
Since the Internet of Things (IoT) is widely adopted using Android applications, detecting malicious Android apps is essential. In recent years, Android graph-based deep learning research has proposed many approaches to extract relationships from applications as graphs to generate graph embeddings. First, we demonstrate the effectiveness of graph-based classification using a Graph Neural Network (GNN)-based classifier to generate API graph embeddings. The graph embeddings are combined with Permission and Intent features to train multiple machine learning and deep learning models for Android malware detection. The proposed classification approach achieves an accuracy of 98.33 percent on the CICMaldroid dataset and 98.68 percent on the Drebin dataset. However, graph-based deep learning models are vulnerable, as attackers can add fake relationships to evade detection by the classifier. Second, we propose a Generative Adversarial Network (GAN)-based attack algorithm named VGAE-MalGAN targeting graph-based GNN Android malware classifiers. The VGAE-MalGAN generator produces adversarial malware API graphs, while the VGAE-MalGAN substitute detector attempts to mimic the target detector. Experimental results show that VGAE-MalGAN can significantly reduce the detection rate of GNN-based malware classifiers. Although the model initially fails to detect adversarial malware, retraining with generated adversarial samples improves robustness and helps mitigate adversarial attacks.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Khaled Kahouli, Romuald Elie, Klaus-Robert Mรผller, Quentin Berthet, Oliver T. Unke et al.
Robotics
ๆฅ็ๆ่ฆ
Diffusion models offer a robust framework for sampling from unnormalized probability densities, which requires accurately estimating the score of the noise-perturbed target distribution. While the standard Denoising Score Identity (DSI) relies on data samples, access to the target energy function enables an alternative formulation via the Target Score Identity (TSI). However, these estimators face a fundamental variance trade-off: DSI exhibits high variance in low-noise regimes, whereas TSI suffers from high variance at high noise levels. In this work, we reconcile these approaches by unifying both estimators within the principled framework of control variates. We introduce the Control Variate Score Identity (CVSI), deriving an optimal, time-dependent control coefficient that theoretically guarantees variance minimization across the entire noise spectrum. We demonstrate that CVSI serves as a robust, low-variance plug-in estimator that significantly enhances sample efficiency in both data-free sampler learning and inference-time diffusion sampling.
๐
2025-12-23
๐ KDD 2026
๐ฅ Jiacheng You, Jingcheng Yang, Yuhang Xie, Zhongxuan Wu, Xiucheng Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Time-series forecasting in real-world applications such as finance and energy often faces challenges due to limited training data and complex, noisy temporal dynamics. Existing deep forecasting models typically supervise predictions using full-length temporal windows, which include substantial high-frequency noise and obscure long-term trends. Moreover, auxiliary variables containing rich domain-specific information are often underutilized, especially in few-shot settings. To address these challenges, we propose LoFT-LLM, a frequency-aware forecasting pipeline that integrates low-frequency learning with semantic calibration via a large language model (LLM). Firstly, a Patch Low-Frequency forecasting Module (PLFM) extracts stable low-frequency trends from localized spectral patches. Secondly, a residual learner then models high-frequency variations. Finally, a fine-tuned LLM refines the predictions by incorporating auxiliary context and domain knowledge through structured natural language prompts. Extensive experiments on financial and energy datasets demonstrate that LoFT-LLM significantly outperforms strong baselines under both full-data and few-shot regimes, delivering superior accuracy, robustness, and interpretability.
๐
2025-12-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ming Li, Chenrui Fan, Yize Cheng, Soheil Feizi, Tianyi Zhou
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ John Cartmell, Mihaela Cardei, Ionut Cardei
Machine Learning
ๆฅ็ๆ่ฆ
We present a method that uses the Bloom filter transform to preprocess data for machine learning. Each sample is encoded into a compact, privacy-preserving bit array. This reduces memory use and protects the original data while keeping enough structure for accurate classification. We test the method on six datasets: SMS Spam Collection, ECG200, Adult 50K, CDC Diabetes, MNIST, and Fashion MNIST. Four classifiers are used: Extreme Gradient Boosting, Deep Neural Networks, Convolutional Neural Networks, and Logistic Regression. Results show that models trained on Bloom filter encodings achieve accuracy similar to models trained on raw data or other transforms. At the same time, the method provides memory savings while enhancing privacy. These results suggest that the Bloom filter transform is an efficient preprocessing approach for diverse machine learning tasks.
๐
2025-12-23
๐ Accepted at IEEE ICCIT 2025. This is the author ac
๐ฅ Tamim Ahasan Rijon, Yeasin Arafath
Machine Learning
ๆฅ็ๆ่ฆ
As a significant agricultural country, Bangladesh utilizes its fertile land for guava cultivation and dedicated labor to boost its economic development. In a nation like Bangladesh, enhancing guava production and agricultural practices plays a crucial role in its economy. Anthracnose and fruit fly infection can lower the quality and productivity of guava, a crucial tropical fruit. Expert systems that detect diseases early can reduce losses and safeguard the harvest. Images of guava fruits classified into the Healthy, Fruit Flies, and Anthracnose classes are included in the Guava Fruit Disease Dataset 2024 (GFDD24), which comes from plantations in Rajshahi and Pabna, Bangladesh. This study aims to create models using CNN alongside traditional machine learning techniques that can effectively identify guava diseases in locally cultivated varieties in Bangladesh. In order to achieve the highest classification accuracy of approximately 99.99% for the guava dataset, we propose utilizing ensemble models that combine CNNML with Gradient Boosting Machine. In general, the CNN-ML cascade framework exhibits strong, high-accuracy guava disease detection that is appropriate for real-time agricultural monitoring systems.
๐
2025-12-23
๐ ArXiv Preprint (q-fin.PM)
๐ฅ Nikolaos Iliopoulos
Machine Learning
ๆฅ็ๆ่ฆ
Metaheuristic algorithms for cardinality-constrained portfolio optimization require repair operators to map infeasible candidates onto the feasible region. Standard Euclidean projection treats assets as independent and can ignore the covariance structure that governs portfolio risk, potentially producing less diversified portfolios. This paper introduces Covariance-Aware Simplex Projection (CASP), a two-stage repair operator that (i) selects a target number of assets using volatility-normalized scores and (ii) projects the candidate weights using a covariance-aware geometry aligned with tracking-error risk. This provides a portfolio-theoretic foundation for using a covariance-induced distance in repair operators. On S&P 500 data (2020-2024), CASP-Basic delivers materially lower portfolio variance than standard Euclidean repair without relying on return estimates, with improvements that are robust across assets and statistically significant. Ablation results indicate that volatility-normalized selection drives most of the variance reduction, while the covariance-aware projection provides an additional, consistent improvement. We further show that optional return-aware extensions can improve Sharpe ratios, and out-of-sample tests confirm that gains transfer to realized performance. CASP integrates as a drop-in replacement for Euclidean projection in metaheuristic portfolio optimizers.
๐
2025-12-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Surya Jayakumar, Kieran Sullivan, John McLaughlin, Christine O'Meara, Indrakshi Dey
Machine Learning
ๆฅ็ๆ่ฆ
This study introduces a novel data-driven framework and the first-ever county-scale application of Spatio-Temporal Graph Neural Networks (STGNN) to forecast composite sustainability indices from herd-level operational records. The methodology employs a novel, end-to-end pipeline utilizing a Variational Autoencoder (VAE) to augment Irish Cattle Breeding Federation (ICBF) datasets, preserving joint distributions while mitigating sparsity. A first-ever pillar-based scoring formulation is derived via Principal Component Analysis, identifying Reproductive Efficiency, Genetic Management, Herd Health, and Herd Management, to construct weighted composite indices. These indices are modelled using a novel STGNN architecture that explicitly encodes geographic dependencies and non-linear temporal dynamics to generate multi-year forecasts for 2026-2030.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96\%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent computational cost. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.
๐
2025-12-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yibin Lei, Shwai He, Ang Li, Andrew Yates
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent work has shown that directly fine-tuning large language models (LLMs) for dense retrieval yields strong performance, but their substantial parameter counts make them computationally inefficient. While prior studies have revealed significant layer redundancy in LLMs for generative tasks, it remains unclear whether similar redundancy exists when these models are adapted for retrieval tasks, which require encoding entire sequences into fixed representations rather than generating tokens iteratively. To this end, we conduct a comprehensive analysis of layer redundancy in LLM-based dense retrievers. We find that, in contrast to generative settings, MLP layers are substantially more prunable, while attention layers remain critical for semantic aggregation. Building on this insight, we propose EffiR, a framework for developing efficient retrievers that performs large-scale MLP compression through a coarse-to-fine strategy (coarse-grained depth reduction followed by fine-grained width reduction), combined with retrieval-specific fine-tuning. Across diverse BEIR datasets and LLM backbones, EffiR achieves substantial reductions in model size and inference cost while preserving the performance of full-size models.
๐
2025-12-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Shuting Wang, Qiaolin Xia, Hao Wang, Yu Lu, Bobsimons et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs) have enabled agentic search systems that interleave multi-step reasoning with external tool use. However, existing frameworks largely rely on unstructured natural-language reasoning and accumulate raw intermediate traces in the context, which often leads to unstable reasoning trajectories, context overflow, and degraded performance on complex multi-hop queries. In this study, we introduce Laser, a general framework for stabilizing and scaling agentic search. Laser defines a symbolic action protocol that organizes agent behaviors into three spaces: planning, task-solving, and retrospection. Each action is specified with explicit semantics and a deterministic execution format, enabling structured and logical reasoning processes and reliable action parsing. This design makes intermediate decisions interpretable and traceable, enhancing explicit retrospection and fine-grained control over reasoning trajectories. In coordination with parsable actions, Laser further maintains a compact context register that stores only essential states of the reasoning process, allowing the agent to reason over long horizons without uncontrolled context expansion. Experiments on Qwen2.5/3-series models across challenging multi-hop QA datasets show that Laser consistently outperforms existing agentic search baselines under both prompting-only and fine-tuning settings, demonstrating that Laser provides a principled and effective foundation for robust, scalable agentic search.
๐
2025-12-23
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Tarik Houichime, Abdelghani Souhar, Younes El Amrani
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The memory of contemporary Large Language Models is bound by a physical paradox: as they learn, they fill up. The linear accumulation (O(N)) of Key-Value states treats context as a warehouse of static artifacts, eventually forcing a destructive choice between amnesia and latency. We challenge this discrete orthodoxy, proposing that long-term memory is not the storage of items, but the persistence of a trajectory. We introduce Phonetic Trajectory Memory (PTM), a neuro-symbolic architecture that encodes language not as a sequence of tensors, but as a continuous path on an ergodic manifold governed by irrational rotation matrices. By decoupling the navigation (an invariant O(1) geometric signal) from the reconstruction (a probabilistic generative act), PTM achieves a compression magnitude of greater than 3,000x relative to dense caches. We demonstrate that retrieval becomes a process of resonance: the phonetic trace stabilizes the model against hallucination via "Signal Consensus" mechanism, securing up to approximately 92% factual accuracy. While this aggressive abstraction alters generative texture, it unlocks immediate access latency (approximately 34ms) independent of depth. Our results suggest that infinite context does not require infinite silicon; it requires treating memory not as data to be stored, but as a reconstructive process acting on a conserved, undying physical signal.
๐
2025-12-23
๐ CVPR 2025
๐ฅ Hao Guo, Xugong Qin, Jun Jie Ou Yang, Peng Zhang, Gangyan Zeng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Document image retrieval (DIR) aims to retrieve document images from a gallery according to a given query. Existing DIR methods are primarily based on image queries that retrieve documents within the same coarse semantic category, e.g., newspapers or receipts. However, these methods struggle to effectively retrieve document images in real-world scenarios where textual queries with fine-grained semantics are usually provided. To bridge this gap, we introduce a new Natural Language-based Document Image Retrieval (NL-DIR) benchmark with corresponding evaluation metrics. In this work, natural language descriptions serve as semantically rich queries for the DIR task. The NL-DIR dataset contains 41K authentic document images, each paired with five high-quality, fine-grained semantic queries generated and evaluated through large language models in conjunction with manual verification. We perform zero-shot and fine-tuning evaluations of existing mainstream contrastive vision-language models and OCR-free visual document understanding (VDU) models. A two-stage retrieval method is further investigated for performance improvement while achieving both time and space efficiency. We hope the proposed NL-DIR benchmark can bring new opportunities and facilitate research for the VDU community. Datasets and codes will be publicly available at huggingface.co/datasets/nianbing/NL-DIR.
๐
2025-12-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yan Zhang, Li Deng, Lixin Duan, Ivor W. Tsang, Guowu Yang
ๆฅ็ๆ่ฆ
The fast online recommendation is critical for applications with large-scale databases; meanwhile, it is challenging to provide accurate recommendations in sparse scenarios. Hash technique has shown its superiority for speeding up the online recommendation by bit operations on Hamming distance computations. However, existing hashing-based recommendations suffer from low accuracy, especially with sparse settings, due to the limited representation capability of each bit and neglected inherent relations among users and items. To this end, this paper lodges a Collaborative Group-Aware Hashing (CGAH) method for both collaborative filtering (namely CGAH-CF) and content-aware recommendations (namely CGAH) by integrating the inherent group information to alleviate the sparse issue. Firstly, we extract inherent group affinities of users and items by classifying their latent vectors into different groups. Then, the preference is formulated as the inner product of the group affinity and the similarity of hash codes. By learning hash codes with the inherent group information, CGAH obtains more effective hash codes than other discrete methods with sparse interactive data. Extensive experiments on three public datasets show the superior performance of our proposed CGAH and CGAH-CF over the state-of-the-art discrete collaborative filtering methods and discrete content-aware recommendations under different sparse settings.
๐
2025-12-23
๐ IEEE/ACM Transactions on Audio, Speech and Languag
๐ฅ Xiang Chen, Yixin Ou, Quan Feng, Lei Li, Piji Li et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict'' paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.
๐
2025-12-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xian Wu, Ming Zhang, Zhiyu Fang, Fei Li, Bin Wang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The automation of user interface development has the potential to accelerate software delivery by mitigating intensive manual implementation. Despite the advancements in Large Multimodal Models for design-to-code translation, existing methodologies predominantly yield unstructured, flat codebases that lack compatibility with component-oriented libraries such as React or Angular. Such outputs typically exhibit low cohesion and high coupling, complicating long-term maintenance. In this paper, we propose \textbf{VSA (VSA)}, a multi-stage paradigm designed to synthesize organized frontend assets through visual-structural alignment. Our approach first employs a spatial-aware transformer to reconstruct the visual input into a hierarchical tree representation. Moving beyond basic layout extraction, we integrate an algorithmic pattern-matching layer to identify recurring UI motifs and encapsulate them into modular templates. These templates are then processed via a schema-driven synthesis engine, ensuring the Large Language Model generates type-safe, prop-drilled components suitable for production environments. Experimental results indicate that our framework yields a substantial improvement in code modularity and architectural consistency over state-of-the-art benchmarks, effectively bridging the gap between raw pixels and scalable software engineering.
๐
2025-12-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Kian Godhwani, David Benrimoh
Natural Language Processing
ๆฅ็ๆ่ฆ
Introduction: Recent work suggests large language models (LLMs) can accelerate screening, but prior evaluations focus on earlier LLMs, standardized Cochrane reviews, single-model setups, and accuracy as the primary metric, leaving generalizability, configuration effects, and calibration largely unexamined.
Methods: We developed OLIVER (Optimized LLM-based Inclusion and Vetting Engine for Reviews), an open-source pipeline for LLM-assisted abstract screening. We evaluated multiple contemporary LLMs across two non-Cochrane systematic reviews and performance was assessed at both the full-text screening and final inclusion stages using accuracy, AUC, and calibration metrics. We further tested an actor-critic screening framework combining two lightweight models under three aggregation rules.
Results: Across individual models, performance varied widely. In the smaller Review 1 (821 abstracts, 63 final includes), several models achieved high sensitivity for final includes but at the cost of substantial false positives and poor calibration. In the larger Review 2 (7741 abstracts, 71 final includes), most models were highly specific but struggled to recover true includes, with prompt design influencing recall. Calibration was consistently weak across single-model configurations despite high overall accuracy. Actor-critic screening improved discrimination and markedly reduced calibration error in both reviews, yielding higher AUCs.
Discussion: LLMs may eventually accelerate abstract screening, but single-model performance is highly sensitive to review characteristics, prompting, and calibration is limited. An actor-critic framework improves classification quality and confidence reliability while remaining computationally efficient, enabling large-scale screening at low cost.
๐
2025-12-23
๐ 12 pages, 6 figures. This paper has been accepted
๐ฅ Ziyuan Guo, Jie Guo, Zhenghao Chen, Bin Song, Fei Richard Yu
Multimodal
ๆฅ็ๆ่ฆ
Multimodal recommender systems (MRSs) are critical for various online platforms, offering users more accurate personalized recommendations by incorporating multimodal information of items. Structure-based MRSs have achieved state-of-the-art performance by constructing semantic item graphs, which explicitly model relationships between items based on modality feature similarity. However, such semantic item graphs are often noisy due to 1) inherent noise in multimodal information and 2) misalignment between item semantics and user-item co-occurrence relationships, which introduces false links and leads to suboptimal recommendations. To address this challenge, we propose Item Graph Diffusion for Multimodal Recommendation (IGDMRec), a novel method that leverages a diffusion model with classifier-free guidance to denoise the semantic item graph by integrating user behavioral information. Specifically, IGDMRec introduces a Behavior-conditioned Graph Diffusion (BGD) module, incorporating interaction data as conditioning information to guide the denoising of the semantic item graph. Additionally, a Conditional Denoising Network (CD-Net) is designed to implement the denoising process with manageable complexity. Finally, we propose a contrastive representation augmentation scheme that leverages both the denoised item graph and the original item graph to enhance item representations. \LL{Extensive experiments on four real-world datasets demonstrate the superiority of IGDMRec over competitive baselines, with robustness analysis validating its denoising capability and ablation studies verifying the effectiveness of its key components.
๐
2025-12-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Sneha Oommen, Gabby Sanchez, Cassandra T. Britto, Di Wang, Jordan Chiou et al.
ๆฅ็ๆ่ฆ
This paper presents an evaluation of the AWS Textract in the context of extracting data from receipts. We analyse Textract functionalities using a dataset that includes receipts of varied formats and conditions. Our analysis provided a qualitative view of Textract strengths and limitations. While the receipts totals were consistently detected, we also observed typical issues and irregularities that were often influenced by image quality and layout. Based on the analysis of the observations, we propose mitigation strategies.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Changyi Lin, Boda Huo, Mingyang Yu, Emily Ruppel, Bingqing Chen et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Contact often occurs without macroscopic surface deformation, such as during interaction with liquids, semi-liquids, or ultra-soft materials. Most existing tactile sensors rely on deformation to infer contact, making such light-contact interactions difficult to perceive robustly. To address this, we present LightTact, a visual-tactile fingertip sensor that makes contact directly visible via a deformation-independent, optics-based principle. LightTact uses an ambient-blocking optical configuration that suppresses both external light and internal illumination at non-contact regions, while transmitting only the diffuse light generated at true contacts. As a result, LightTact produces high-contrast raw images in which non-contact pixels remain near-black (mean gray value < 3) and contact pixels preserve the natural appearance of the contacting surface. Built on this, LightTact achieves accurate pixel-level contact segmentation that is robust to material properties, contact force, surface appearance, and environmental lighting. We further integrate LightTact on a robotic arm and demonstrate manipulation behaviors driven by extremely light contact, including water spreading, facial-cream dipping, and thin-film interaction. Finally, we show that LightTact's spatially aligned visual-tactile images can be directly interpreted by existing vision-language models, enabling resistor value reasoning for robotic sorting.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl et al.
Robotics
ๆฅ็ๆ่ฆ
Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles' actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6~v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at https://github.com/autonomousvision/lead.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Maulana Bisyir Azhari, Donghun Han, Je In You, Sungjun Park, David Hyunchul Shim
Robotics
ๆฅ็ๆ่ฆ
The Abu Dhabi Autonomous Racing League(A2RL) x Drone Champions League competition(DCL) requires teams to perform high-speed autonomous drone racing using only a single camera and a low-quality inertial measurement unit -- a minimal sensor set that mirrors expert human drone racing pilots. This sensor limitation makes the system susceptible to drift from Visual-Inertial Odometry (VIO), particularly during long and fast flights with aggressive maneuvers. This paper presents the system developed for the championship, which achieved a competitive performance. Our approach corrected VIO drift by fusing its output with global position measurements derived from a YOLO-based gate detector using a Kalman filter. A perception-aware planner generated trajectories that balance speed with the need to keep gates visible for the perception system. The system demonstrated high performance, securing podium finishes across multiple categories: third place in the AI Grand Challenge with top speed of 43.2 km/h, second place in the AI Drag Race with over 59 km/h, and second place in the AI Multi-Drone Race. We detail the complete architecture and present a performance analysis based on experimental data from the competition, contributing our insights on building a successful system for monocular vision-based autonomous drone flight.
๐
2025-12-23
๐ ArXiv Preprint (math.OC)
๐ฅ Georg Schildbach
Robotics
ๆฅ็ๆ่ฆ
Cooperative collision avoidance between robots in swarm operations remains an open challenge. Assuming a decentralized architecture, each robot is responsible for making its own control decisions, including motion planning. To this end, most existing approaches mostly rely some form of (wireless) communication between the agents of the swarm. In reality, however, communication is brittle. It may be affected by latency, further delays and packet losses, transmission faults, and is subject to adversarial attacks, such as jamming or spoofing. This paper proposes Contingency Model-based Control (CMC) as a communicationless alternative. It follows the implicit cooperation paradigm, under which the design of the robots is based on consensual (offline) rules, similar to traffic rules. They include the definition of a contingency trajectory for each robot, and a method for construction of mutual collision avoidance constraints. The setup is shown to guarantee the recursive feasibility and collision avoidance between all swarm members in closed-loop operation. Moreover, CMC naturally satisfies the Plug \& Play paradigm, i.e., for new robots entering the swarm. Two numerical examples demonstrate that the collision avoidance guarantee is intact and that the robot swarm operates smoothly under the CMC regime.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Hao Wei, Peiji Wang, Qianhao Wang, Tong Qin, Fei Gao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Underwater environments impose severe challenges to visual-inertial odometry systems, as strong light attenuation, marine snow and turbidity, together with weakly exciting motions, degrade inertial observability and cause frequent tracking failures over long-term operation. While tightly coupled acoustic-visual-inertial fusion, typically implemented through an acoustic Doppler Velocity Log (DVL) integrated with visual-inertial measurements, can provide accurate state estimation, the associated graph-based optimization is often computationally prohibitive for real-time deployment on resource-constrained platforms. Here we present FAR-AVIO, a Schur-Complement based, tightly coupled acoustic-visual-inertial odometry framework tailored for underwater robots. FAR-AVIO embeds a Schur complement formulation into an Extended Kalman Filter(EKF), enabling joint pose-landmark optimization for accuracy while maintaining constant-time updates by efficiently marginalizing landmark states. On top of this backbone, we introduce Adaptive Weight Adjustment and Reliability Evaluation(AWARE), an online sensor health module that continuously assesses the reliability of visual, inertial and DVL measurements and adaptively regulates their sigma weights, and we develop an efficient online calibration scheme that jointly estimates DVL-IMU extrinsics, without dedicated calibration manoeuvres. Numerical simulations and real-world underwater experiments consistently show that FAR-AVIO outperforms state-of-the-art underwater SLAM baselines in both localization accuracy and computational efficiency, enabling robust operation on low-power embedded platforms. Our implementation has been released as open source software at https://far-vido.gitbook.io/far-vido-docs.
๐
2025-12-23
๐ ArXiv Preprint (physics.ins-det)
๐ฅ Chanchan Xu, Shuai Dong, Xiaojie Wang
Robotics
ๆฅ็ๆ่ฆ
Twisted String Actuators (TSAs) are widely used in robotics but suffer from a limited range of Transmission Ratio (TR) variation, restricting their efficiency under varying loads.To overcome this, we propose a novel lightweight, simple-structured Continuously Variable Transmission (CVT) mechanism for TSA utilizing Shape Memory Alloy (SMA) superelasticity. The CVT mechanism consists solely of a pair of highly lightweight superelastic SMA rods connecting the ends of twisted strings. These rods deform under external loads, adjusting the inter-string distance to enable continuous TR variation.We develop a comprehensive theoretical model that integrates three critical nonlinearities
๐
2025-12-23
๐ IROS 2024
๐ฅ Katsu Uchiyama, Ryuma Niiyama
ๆฅ็ๆ่ฆ
Exploration of various applications is the frontier of research on inflatable robots. We proposed an articulated robots consisting of multiple pneumatic bladder links connected by rolling contact joints called Hillberry joints. The bladder link is made of a double-layered structure of tarpaulin sheet and polyurethane sheet, which is both airtight and flexible in shape. The integration of the Hilberry joint into an inflatable robot is also a new approach. The rolling contact joint allows wide range of motion of $\pm 150 ^{\circ}$, the largest among the conventional inflatable joints. Using the proposed mechanism for inflatable robots, we demonstrated moving a 500 g payload with a 3-DoF arm and lifting 3.4 kg and 5 kg payloads with 2-DoF and 1-DoF arms, respectively. We also experimented with a single 3-DoF inflatable leg attached to a dolly to show that the proposed structure worked for legged locomotion.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang
Robotics
ๆฅ็ๆ่ฆ
Visual-language reasoning, driving knowledge, and value alignment are essential for advanced autonomous driving systems. However, existing approaches largely rely on data-driven learning, making it difficult to capture the complex logic underlying decision-making through imitation or limited reinforcement rewards. To address this, we propose KnowVal, a new autonomous driving system that enables visual-language reasoning through the synergistic integration of open-world perception and knowledge retrieval. Specifically, we construct a comprehensive driving knowledge graph that encodes traffic laws, defensive driving principles, and ethical norms, complemented by an efficient LLM-based retrieval mechanism tailored for driving scenarios. Furthermore, we develop a human-preference dataset and train a Value Model to guide interpretable, value-aligned trajectory assessment. Experimental results show that our method substantially improves planning performance while remaining compatible with existing architectures. Notably, KnowVal achieves the lowest collision rate on nuScenes and state-of-the-art results on Bench2Drive.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuntao Dai, Hang Gu, Teng Wang, Qianyu Cheng, Yifei Zheng et al.
RoboticsMultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.
๐
2025-12-23
๐ ArXiv Preprint (eess.SY)
๐ฅ Imtiaz Ur Rehman, Moussa Labbadi, Amine Abadi, Lew Lew Yan Voon
Robotics
ๆฅ็ๆ่ฆ
A robust tracking control strategy is designed to empower wheeled mobile robots (WMRs) to track predetermined routes while operating in diverse fields and encountering disturbances like strong winds or uneven path conditions, which affect tracking performance. Ensuring the applicability of this tracking method in real-world scenarios is essential. To accomplish this, the WMR model is initially transformed into a linear canonical form by leveraging the differential flatness of its kinematic model, facilitating controller design. Subsequently, a novel integral nonlinear hyperplane-based sliding mode control (INH-SMC) technique is proposed for WMR under disturbances. The stability of the technique is analyzed and verified. Finally, its practical viability is demonstrated through a comparative real-world indoor experiment on a TurtleBot3 WMR subjected to disturbances, confirming the feasibility and efficacy of the proposed approach.
๐
2025-12-23
๐ 8 pages, 9 figures, IEEE ITSC 2025
๐ฅ Qijun Qin, Ziqi Zhang, Yihan Zhong, Feng Huang, Xikun Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Due to the limitations of a single autonomous vehicle, Cellular Vehicle-to-Everything (C-V2X) technology opens a new window for achieving fully autonomous driving through sensor information sharing. However, real-world datasets supporting vehicle-infrastructure cooperative navigation in complex urban environments remain rare. To address this gap, we present UrbanV2X, a comprehensive multisensory dataset collected from vehicles and roadside infrastructure in the Hong Kong C-V2X testbed, designed to support research on smart mobility applications in dense urban areas. Our onboard platform provides synchronized data from multiple industrial cameras, LiDARs, 4D radar, ultra-wideband (UWB), IMU, and high-precision GNSS-RTK/INS navigation systems. Meanwhile, our roadside infrastructure provides LiDAR, GNSS, and UWB measurements. The entire vehicle-infrastructure platform is synchronized using the Precision Time Protocol (PTP), with sensor calibration data provided. We also benchmark various navigation algorithms to evaluate the collected cooperative data. The dataset is publicly available at https://polyu-taslab.github.io/UrbanV2X/.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Teqiang Zou, Hongliang Zeng, Yuxuan Nong, Yifan Li, Kehui Liu et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low inference speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency action generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores instruction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to commercial users by Astribot as part of the Astribot robotic platform.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaofan Wang, Xingyu Gao, Jianlong Fu, Zuolei Li, Dean Fortier et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The capability of performing long-horizon, language-guided robotic manipulation tasks critically relies on leveraging historical information and generating coherent action sequences. However, such capabilities are often overlooked by existing Vision-Language-Action (VLA) models. To solve this challenge, we propose LoLA (Long Horizon Latent Action Learning), a framework designed for robot manipulation that integrates long-term multi-view observations and robot proprioception to enable multi-step reasoning and action generation. We first employ Vision-Language Models to encode rich contextual features from historical sequences and multi-view observations. We further introduces a key module, State-Aware Latent Re-representation, which transforms visual inputs and language commands into actionable robot motion space. Unlike existing VLA approaches that merely concatenate robot proprioception (e.g., joint angles) with VL embeddings, this module leverages such robot states to explicitly ground VL representations in physical scale through a learnable "embodiment-anchored" latent space. We trained LoLA on diverse robotic pre-training datasets and conducted extensive evaluations on simulation benchmarks (SIMPLER and LIBERO), as well as two real-world tasks on Franka and Bi-Manual Aloha robots. Results show that LoLA significantly outperforms prior state-of-the-art methods (e.g., pi0), particularly in long-horizon manipulation tasks.
๐
2025-12-23
๐ ArXiv Preprint (Computer Vision)
๐ฅ Robert van de Ven, Trim Bresilla, Bram Nelissen, Ard Nieuwenhuizen, Eldert J. van Henten et al.
Computer Vision
ๆฅ็ๆ่ฆ
Automating tasks in orchards is challenging because of the large amount of variation in the environment and occlusions. One of the challenges is apple pose estimation, where key points, such as the calyx, are often occluded. Recently developed pose estimation methods no longer rely on these key points, but still require them for annotations, making annotating challenging and time-consuming. Due to the abovementioned occlusions, there can be conflicting and missing annotations of the same fruit between different images. Novel 3D reconstruction methods can be used to simplify annotating and enlarge datasets. We propose a novel pipeline consisting of 3D Gaussian Splatting to reconstruct an orchard scene, simplified annotations, automated projection of the annotations to images, and the training and evaluation of a pose estimation method. Using our pipeline, 105 manual annotations were required to obtain 28,191 training labels, a reduction of 99.6%. Experimental results indicated that training with labels of fruits that are $\leq95\%$ occluded resulted in the best performance, with a neutral F1 score of 0.927 on the original images and 0.970 on the rendered images. Adjusting the size of the training dataset had small effects on the model performance in terms of F1 score and pose estimation accuracy. It was found that the least occluded fruits had the best position estimation, which worsened as the fruits became more occluded. It was also found that the tested pose estimation method was unable to correctly learn the orientation estimation of apples.
๐
2025-12-23
๐ ArXiv Preprint (cs.SE)
๐ฅ Wenzhao Wu, Yahui Tang, Mingfei Cheng, Wenbing Tang, Yuan Zhou et al.
ๆฅ็ๆ่ฆ
As embodied agents advance toward real-world deployment, ensuring optimal decisions becomes critical for resource-constrained applications. Current evaluation methods focus primarily on functional correctness, overlooking the non-functional optimality of generated plans. This gap can lead to significant performance degradation and resource waste. We identify and formalize the problem of Non-optimal Decisions (NoDs), where agents complete tasks successfully but inefficiently. We present NoD-DGMT, a systematic framework for detecting NoDs in embodied agent task planning via diversity-guided metamorphic testing. Our key insight is that optimal planners should exhibit invariant behavioral properties under specific transformations. We design four novel metamorphic relations capturing fundamental optimality properties: position detour suboptimality, action optimality completeness, condition refinement monotonicity, and scene perturbation invariance. To maximize detection efficiency, we introduce a diversity-guided selection strategy that actively selects test cases exploring different violation categories, avoiding redundant evaluations while ensuring comprehensive diversity coverage. Extensive experiments on the AI2-THOR simulator with four state-of-the-art planning models demonstrate that NoD-DGMT achieves violation detection rates of 31.9% on average, with our diversity-guided filter improving rates by 4.3% and diversity scores by 3.3 on average. NoD-DGMT significantly outperforms six baseline methods, with 16.8% relative improvement over the best baseline, and demonstrates consistent superiority across different model architectures and task complexities.
๐
2025-12-23
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hung-Chieh Fang, Kuo-Han Hung, Chu-Rong Chen, Po-Jung Chou, Chun-Kai Yang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Learning from videos offers a promising path toward generalist robots by providing rich visual and temporal priors beyond what real robot datasets contain. While existing video generative models produce impressive visual predictions, they are difficult to translate into low-level actions. Conversely, latent-action models better align videos with actions, but they typically operate at the single-step level and lack high-level planning capabilities. We bridge this gap by introducing Skill Abstraction from Optical Flow (SOF), a framework that learns latent skills from large collections of action-free videos. Our key idea is to learn a latent skill space through an intermediate representation based on optical flow that captures motion information aligned with both video dynamics and robot actions. By learning skills in this flow-based latent space, SOF enables high-level planning over video-derived skills and allows for easier translation of these skills into actions. Experiments show that our approach consistently improves performance in both multitask and long-horizon settings, demonstrating the ability to acquire and compose skills directly from raw visual data.
๐
2025-12-23
๐ ArXiv Preprint (Robotics)
๐ฅ Sangoh Lee, Sangwoo Mo, Wook-Shin Han
RoboticsMultimodal
ๆฅ็ๆ่ฆ
While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup", where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Samruddhi Baviskar
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Machine learning models used in financial decision systems operate in nonstationary economic environments, yet adversarial robustness is typically evaluated under static assumptions. This work introduces Conditional Adversarial Fragility, a regime dependent phenomenon in which adversarial vulnerability is systematically amplified during periods of macroeconomic stress. We propose a regime aware evaluation framework for time indexed tabular financial classification tasks that conditions robustness assessment on external indicators of economic stress. Using volatility based regime segmentation as a proxy for macroeconomic conditions, we evaluate model behavior across calm and stress periods while holding model architecture, attack methodology, and evaluation protocols constant. Baseline predictive performance remains comparable across regimes, indicating that economic stress alone does not induce inherent performance degradation. Under adversarial perturbations, however, models operating during stress regimes exhibit substantially greater degradation across predictive accuracy, operational decision thresholds, and risk sensitive outcomes. We further demonstrate that this amplification propagates to increased false negative rates, elevating the risk of missed high risk cases during adverse conditions. To complement numerical robustness metrics, we introduce an interpretive governance layer based on semantic auditing of model explanations using large language models. Together, these results demonstrate that adversarial robustness in financial machine learning is a regime dependent property and motivate stress aware approaches to model risk assessment in high stakes financial deployments.
๐
2025-12-22
๐ AAAI 2024
๐ฅ Wentao Wu, Xiao Wang, Chenglong Li, Jin Tang, Bin Luo
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model's capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of structured priors into token reconstruction including symmetry, contour and semantics of vehicles respectively. SMM utilizes the vehicle symmetry constraints to avoid retaining symmetric patches and can thus select high-quality masked image patches and reduce information redundancy. CRM minimizes the probability distribution divergence between contour features and reconstructed features and can thus preserve holistic vehicle structure information during pixel-level reconstruction. SRM aligns image-text features through contrastive learning and cross-modal distillation to address the feature confusion caused by insufficient semantic understanding during masked reconstruction. To support the pre-training of VehicleMAE-V2, we construct Autobot4M, a large-scale dataset comprising approximately 4 million vehicle images and 12,693 text descriptions. Extensive experiments on five downstream tasks demonstrate the superior performance of VehicleMAE-V2.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ S. Mazdak Abulnaga, Andrew Hoopes, Malte Hoffmann, Robin Magnet, Maks Ovsjanikov et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate registration of brain MRI scans is fundamental for cross-subject analysis in neuroscientific studies. This involves aligning both the cortical surface of the brain and the interior volume. Traditional methods treat volumetric and surface-based registration separately, which often leads to inconsistencies that limit downstream analyses. We propose a deep learning framework, NeurAlign, that registers $3$D brain MRI images by jointly aligning both cortical and subcortical regions through a unified volume-and-surface-based representation. Our approach leverages an intermediate spherical coordinate space to bridge anatomical surface topology with volumetric anatomy, enabling consistent and anatomically accurate alignment. By integrating spherical registration into the learning, our method ensures geometric coherence between volume and surface domains. In a series of experiments on both in-domain and out-of-domain datasets, our method consistently outperforms both classical and machine learning-based registration methods -- improving the Dice score by up to 7 points while maintaining regular deformation fields. Additionally, it is orders of magnitude faster than the standard method for this task, and is simpler to use because it requires no additional inputs beyond an MRI scan. With its superior accuracy, fast inference, and ease of use, NeurAlign sets a new standard for joint cortical and subcortical registration.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, Wenhao Huang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
LLM deployment in critical domains is currently impeded by persistent hallucinations--generating plausible but factually incorrect assertions. While scaling laws drove significant improvements in general capabilities, theoretical frameworks suggest hallucination is not merely stochastic error but a predictable statistical consequence of training objectives prioritizing mimicking data distribution over epistemic honesty. Standard RLVR paradigms, utilizing binary reward signals, inadvertently incentivize models as good test-takers rather than honest communicators, encouraging guessing whenever correctness probability exceeds zero. This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when not confident, aligning model behavior with accuracy. Synthesizing recent advances, we propose and evaluate training interventions optimizing strictly proper scoring rules for models to output a calibrated probability of correctness. Our methods enable models to either abstain from producing a complete response or flag individual claims where uncertainty remains. Utilizing Qwen3-4B-Instruct, empirical analysis reveals behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification--a transferable meta-skill decouplable from raw predictive accuracy. Trained on math reasoning tasks, our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation (BeyondAIME). Moreover, in cross-domain factual QA (SimpleQA), our 4B LLM achieves zero-shot calibration error on par with frontier models including Grok-4 and Gemini-2.5-Pro, even though its factual accuracy is much lower.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Sujan Warnakulasooriya, Andreas Willig, Xiaobing Wu
ๆฅ็ๆ่ฆ
Drone applications continue to expand across various domains, with flocking offering enhanced cooperative capabilities but introducing significant challenges during initial formation. Existing flocking algorithms often struggle with efficiency and scalability, particularly when potential collisions force drones into suboptimal trajectories. This paper presents a time-efficient prioritised scheduling algorithm that improves the initial formation process of drone flocks. The method assigns each drone a priority based on its number of potential collisions and its likelihood of reaching its target position without permanently obstructing other drones. Using this hierarchy, each drone computes an appropriate delay to ensure a collision-free path. Simulation results show that the proposed algorithm successfully generates collision-free trajectories for flocks of up to 5000 drones and outperforms the coupling-degree-based heuristic prioritised planning method (CDH-PP) in both performance and computational efficiency.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Maxime Lacour, Pu Ren, Rie Nakata, Nori Nakata, Michael Mahoney
Machine Learning
ๆฅ็ๆ่ฆ
Recent developments in non-ergodic ground-motion models (GMMs) explicitly model systematic spatial variations in source, site, and path effects, reducing standard deviation to 30-40% of ergodic models and enabling more accurate site-specific seismic hazard analysis. Current non-ergodic GMMs rely on Gaussian Process (GP) methods with prescribed correlation functions and thus have computational limitations for large-scale predictions. This study proposes a deep-learning approach called Conditional Generative Modeling for Fourier Amplitude Spectra (CGM-FAS) as an alternative to GP-based methods for modeling non-ergodic path effects in Fourier Amplitude Spectra (FAS). CGM-FAS uses a Conditional Variational Autoencoder architecture to learn spatial patterns and interfrequency correlation directly from data by using geographical coordinates of earthquakes and stations as conditional variables. Using San Francisco Bay Area earthquake data, we compare CGM-FAS against a recent GP-based GMM for the region and demonstrate consistent predictions of non-ergodic path effects. Additionally, CGM-FAS offers advantages compared to GP-based approaches in learning spatial patterns without prescribed correlation functions, capturing interfrequency correlations, and enabling rapid predictions, generating maps for 10,000 sites across 1,000 frequencies within 10 seconds using a few GB of memory. CGM-FAS hyperparameters can be tuned to ensure generated path effects exhibit variability consistent with the GP-based empirical GMM. This work demonstrates a promising direction for efficient non-ergodic ground-motion prediction across multiple frequencies and large spatial domains.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Indranil Halder, Cengiz Pehlevan
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent developments in large language models have shown advantages in reallocating a notable share of computational resource from training time to inference time. However, the principles behind inference time scaling are not well understood. In this paper, we introduce an analytically tractable model of inference-time scaling: Bayesian linear regression with a reward-weighted sampler, where the reward is determined from a linear model, modeling LLM-as-a-judge scenario. We study this problem in the high-dimensional regime, where the deterministic equivalents dictate a closed-form expression for the posterior predictive mean and variance. We analyze the generalization error when training data are sampled from a teacher model. We draw $k$ inference-time samples and select via softmax at a temperature applied to a quadratic reward. When the reward is not too different from the teacher, the generalization error decreases monotonically with increasing inference time samples $k$. However, the specific reward that optimizes inference-time selection generally differs from the teacher. In contrast, substantial reward misspecification induces a finite optimal $k$ beyond which more sampling can increase the generalization error. For fixed $k$, there exists an optimal sampling temperature. We experimentally verify these facts in large language model inference with an additional large language model as a judge. In the "best-of-$k$" limit with the teacher as reward, we theoretically show that the generalization error decays as $ฮ(1/k^2)$ and determine the leading coefficient via extreme value theory. These formulas delineate domains where scaling inference-time computation is provably preferable to collecting more data. Finally, we demonstrate that when task difficulty increases, the previously mentioned advantage of inference-time compute degrades.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mahdi Mostajabdaveh, F. Sibel Salman, Walter J. Gutjahr
Machine Learning
ๆฅ็ๆ่ฆ
The distribution of relief supplies to shelters is a critical aspect of post-disaster humanitarian logistics. In major disasters, prepositioned supplies often fall short of meeting all demands. We address the problem of planning vehicle routes from a distribution center to shelters while allocating limited relief supplies. To balance efficiency and equity, we formulate a bi-objective problem: minimizing a Gini-index-based measure of inequity in unsatisfied demand for fair distribution and minimizing total travel time for timely delivery. We propose a Mixed Integer Programming (MIP) model and use the $ฮต$-constraint method to handle the bi-objective nature. By deriving mathematical properties of the optimal solution, we introduce valid inequalities and design an algorithm for optimal delivery allocations given feasible vehicle routes. A branch-and-price (B&P) algorithm is developed to solve the problem efficiently. Computational tests on realistic datasets from a past earthquake in Van, Turkey, and predicted data for Istanbul's Kartal region show that the B&P algorithm significantly outperforms commercial MIP solvers. Our bi-objective approach reduces aid distribution inequity by 34% without compromising efficiency. Results indicate that when time constraints are very loose or tight, lexicographic optimization prioritizing demand coverage over fairness is effective. For moderately restrictive time constraints, a balanced approach is essential to avoid inequitable outcomes.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jorg Bornschein, Clare Lyle, Yazhe Li, Amal Rannen-Triki, Xu Owen He et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
When adapting large language models (LLMs) to a specific downstream task, two primary approaches are commonly employed: (1) prompt engineering, often with in-context few-shot learning, leveraging the model's inherent generalization abilities, and (2) fine-tuning on task-specific data, directly optimizing the model's parameters. While prompt-based methods excel in few-shot scenarios, their effectiveness often plateaus as more data becomes available. Conversely, fine-tuning scales well with data but may underperform when training examples are scarce. We investigate a unified approach that bridges these two paradigms by incorporating in-context learning directly into the fine-tuning process. Specifically, we fine-tune the model on task-specific data augmented with in-context examples, mimicking the structure of k-shot prompts. This approach, while requiring per-task fine-tuning, combines the sample efficiency of in-context learning with the performance gains of fine-tuning, leading to a method that consistently matches and often significantly exceeds both these baselines. To perform hyperparameter selection in the low-data regime, we propose to use prequential evaluation, which eliminates the need for expensive cross-validation and leverages all available data for training while simultaneously providing a robust validation signal. We conduct an extensive empirical study to determine which adaptation paradigm - fine-tuning, in-context learning, or our proposed unified approach offers the best predictive performance on a concrete data downstream-tasks.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shashi Kant Gupta, Arijeet Pramanik, Jerrin John Thomas, Regina Schwind, Lauren Wiener et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Unstructured notes within the electronic health record (EHR) contain rich clinical information vital for cancer treatment decision making and research, yet reliably extracting structured oncology data remains challenging due to extensive variability, specialized terminology, and inconsistent document formats. Manual abstraction, although accurate, is prohibitively costly and unscalable. Existing automated approaches typically address narrow scenarios - either using synthetic datasets, restricting focus to document-level extraction, or isolating specific clinical variables (e.g., staging, biomarkers, histology) - and do not adequately handle patient-level synthesis across the large number of clinical documents containing contradictory information. In this study, we propose an agentic framework that systematically decomposes complex oncology data extraction into modular, adaptive tasks. Specifically, we use large language models (LLMs) as reasoning agents, equipped with context-sensitive retrieval and iterative synthesis capabilities, to exhaustively and comprehensively extract structured clinical variables from real-world oncology notes. Evaluated on a large-scale dataset of over 400,000 unstructured clinical notes and scanned PDF reports spanning 2,250 cancer patients, our method achieves an average F1-score of 0.93, with 100 out of 103 oncology-specific clinical variables exceeding 0.85, and critical variables (e.g., biomarkers and medications) surpassing 0.95. Moreover, integration of the agentic system into a data curation workflow resulted in 0.94 direct manual approval rate, significantly reducing annotation costs. To our knowledge, this constitutes the first exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale
๐
2025-12-22
๐ ArXiv Preprint (cs.DC)
๐ฅ Ziming Mao, Yihan Zhang, Chihan Cui, Kaichao You, Zhongjie Chen et al.
Robotics
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) workloads rely on expert parallelism (EP) to achieve high GPU efficiency. State-of-the-art EP communication systems such as DeepEP demonstrate strong performance but exhibit poor portability across heterogeneous GPU and NIC platforms. The poor portability is rooted in architecture: GPU-initiated token-level RDMA communication requires tight vertical integration between GPUs and NICs, e.g., GPU writes to NIC driver/MMIO interfaces.
We present UCCL-EP, a portable EP communication system that delivers DeepEP-level performance across heterogeneous GPU and NIC hardware. UCCL-EP replaces GPU-initiated RDMA with a high-throughput GPU-CPU control channel: compact token-routing commands are transferred to multithreaded CPU proxies, which then issue GPUDirect RDMA operations on behalf of GPUs. UCCL-EP further emulates various ordering semantics required by specialized EP communication modes using RDMA immediate data, enabling correctness on NICs that lack such ordering, e.g., AWS EFA. We implement UCCL-EP on NVIDIA and AMD GPUs with EFA and Broadcom NICs. On EFA, it outperforms the best existing EP solution by up to $2.1\times$ for dispatch and combine throughput. On NVIDIA-only platform, UCCL-EP achieves comparable performance to the original DeepEP. UCCL-EP also improves token throughput on SGLang by up to 40% on the NVIDIA+EFA platform, and improves DeepSeek-V3 training throughput over the AMD Primus/Megatron-LM framework by up to 45% on a 16-node AMD+Broadcom platform.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tingjia Miao, Jiawen Dai, Jingkun Liu, Jinxin Tan, Muhua Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
Advances in LLMs have produced agents with knowledge and operational capabilities comparable to human scientists, suggesting potential to assist, accelerate, and automate research. However, existing studies mainly evaluate such systems on well-defined benchmarks or general tasks like literature retrieval, limiting their end-to-end problem-solving ability in open scientific scenarios. This is particularly true in physics, which is abstract, mathematically intensive, and requires integrating analytical reasoning with code-based computation. To address this, we propose PhysMaster, an LLM-based agent functioning as an autonomous theoretical and computational physicist. PhysMaster couples absract reasoning with numerical computation and leverages LANDAU, the Layered Academic Data Universe, which preserves retrieved literature, curated prior knowledge, and validated methodological traces, enhancing decision reliability and stability. It also employs an adaptive exploration strategy balancing efficiency and open-ended exploration, enabling robust performance in ultra-long-horizon tasks. We evaluate PhysMaster on problems from high-energy theory, condensed matter theory to astrophysics, including: (i) acceleration, compressing labor-intensive research from months to hours; (ii) automation, autonomously executing hypothesis-driven loops ; and (iii) autonomous discovery, independently exploring open problems.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universitรคt Wรผrzburg in cooperation with the Technische Universitรคt Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.
๐
2025-12-22
๐ AAAI 2024
๐ฅ Wentao Wu, Xiao Wang, Chenglong Li, Jin Tang, Bin Luo
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model's capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of structured priors into token reconstruction including symmetry, contour and semantics of vehicles respectively. SMM utilizes the vehicle symmetry constraints to avoid retaining symmetric patches and can thus select high-quality masked image patches and reduce information redundancy. CRM minimizes the probability distribution divergence between contour features and reconstructed features and can thus preserve holistic vehicle structure information during pixel-level reconstruction. SRM aligns image-text features through contrastive learning and cross-modal distillation to address the feature confusion caused by insufficient semantic understanding during masked reconstruction. To support the pre-training of VehicleMAE-V2, we construct Autobot4M, a large-scale dataset comprising approximately 4 million vehicle images and 12,693 text descriptions. Extensive experiments on five downstream tasks demonstrate the superior performance of VehicleMAE-V2.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ S. Mazdak Abulnaga, Andrew Hoopes, Malte Hoffmann, Robin Magnet, Maks Ovsjanikov et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate registration of brain MRI scans is fundamental for cross-subject analysis in neuroscientific studies. This involves aligning both the cortical surface of the brain and the interior volume. Traditional methods treat volumetric and surface-based registration separately, which often leads to inconsistencies that limit downstream analyses. We propose a deep learning framework, NeurAlign, that registers $3$D brain MRI images by jointly aligning both cortical and subcortical regions through a unified volume-and-surface-based representation. Our approach leverages an intermediate spherical coordinate space to bridge anatomical surface topology with volumetric anatomy, enabling consistent and anatomically accurate alignment. By integrating spherical registration into the learning, our method ensures geometric coherence between volume and surface domains. In a series of experiments on both in-domain and out-of-domain datasets, our method consistently outperforms both classical and machine learning-based registration methods -- improving the Dice score by up to 7 points while maintaining regular deformation fields. Additionally, it is orders of magnitude faster than the standard method for this task, and is simpler to use because it requires no additional inputs beyond an MRI scan. With its superior accuracy, fast inference, and ease of use, NeurAlign sets a new standard for joint cortical and subcortical registration.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Houston H. Zhang, Tao Zhang, Baoze Lin, Yuanqi Xue, Yincheng Zhu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML/CSS). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jong Wook Kim, Wonseok Roh, Ha Dam Baek, Pilhyeon Lee, Jonghyun Choi et al.
ๆฅ็ๆ่ฆ
3D Panoptic Occupancy Prediction aims to reconstruct a dense volumetric scene map by predicting the semantic class and instance identity of every occupied region in 3D space. Achieving such fine-grained 3D understanding requires precise geometric reasoning and spatially consistent scene representation across complex environments. However, existing approaches often struggle to maintain precise geometry and capture the precise spatial range of 3D instances critical for robust panoptic separation. To overcome these limitations, we introduce HyGE-Occ, a novel framework that leverages a hybrid view-transformation branch with 3D Gaussian and edge priors to enhance both geometric consistency and boundary awareness in 3D panoptic occupancy prediction. HyGE-Occ employs a hybrid view-transformation branch that fuses a continuous Gaussian-based depth representation with a discretized depth-bin formulation, producing BEV features with improved geometric consistency and structural coherence. In parallel, we extract edge maps from BEV features and use them as auxiliary information to learn edge cues. In our extensive experiments on the Occ3D-nuScenes dataset, HyGE-Occ outperforms existing work, demonstrating superior 3D geometric reasoning.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ A. Shekhovtsov
Machine Learning
ๆฅ็ๆ่ฆ
We revisit the problem of assigning a score (a quality of fit) to candidate geometric models -- one of the key components of RANSAC for robust geometric fitting. In a non-robust setting, the ``gold standard'' scoring function, known as the geometric error, follows from a probabilistic model with Gaussian noises. We extend it to spherical noises. In a robust setting, we consider a mixture with uniformly distributed outliers and show that a threshold-based parameterization leads to a unified view of likelihood-based and robust M-estimators and associated local optimization schemes.
Next we analyze MAGSAC++ which stands out for two reasons. First, it achieves the best results according to existing benchmarks. Second, it makes quite different modeling assumptions and derivation steps. We discovered, however that the derivation does not correspond to sound principles and the resulting score function is in fact numerically equivalent to a simple Gaussian-uniform likelihood, a basic model within the proposed framework.
Finally, we propose an experimental methodology for evaluating scoring functions: assuming either a large validation set, or a small random validation set in expectation. We find that all scoring functions, including using a learned inlier distribution, perform identically. In particular, MAGSAC++ score is found to be neither better performing than simple contenders nor less sensitive to the choice of the threshold hyperparameter.
Our theoretical and experimental analysis thus comprehensively revisit the state-of-the-art, which is critical for any future research seeking to improve the methods or apply them to other robust fitting problems.
๐
2025-12-22
๐ WWW
๐ฅ SaiKiran Tedla, Zhoutong Zhang, Xuaner Zhang, Shumian Xin
ๆฅ็ๆ่ฆ
Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at www.learn2refocus.github.io
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ SaiKiran Tedla, Kelly Zhu, Trevor Canham, Felix Taubner, Michael S. Brown et al.
ๆฅ็ๆ่ฆ
We seek to answer the question: what can a motion-blurred image reveal about a scene's past, present, and future? Although motion blur obscures image details and degrades visual quality, it also encodes information about scene and camera motion during an exposure. Previous techniques leverage this information to estimate a sharp image from an input blurry one, or to predict a sequence of video frames showing what might have occurred at the moment of image capture. However, they rely on handcrafted priors or network architectures to resolve ambiguities in this inverse problem, and do not incorporate image and video priors on large-scale datasets. As such, existing methods struggle to reproduce complex scene dynamics and do not attempt to recover what occurred before or after an image was taken. Here, we introduce a new technique that repurposes a pre-trained video diffusion model trained on internet-scale datasets to recover videos revealing complex scene dynamics during the moment of capture and what might have occurred immediately into the past or future. Our approach is robust and versatile; it outperforms previous methods for this task, generalizes to challenging in-the-wild images, and supports downstream tasks such as recovering camera trajectories, object motion, and dynamic 3D scene structure. Code and data are available at https://blur2vid.github.io
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Jiaqi Peng, Wenzhe Cai, Yuqiang Yang, Tai Wang, Yuan Shen et al.
Robotics
ๆฅ็ๆ่ฆ
Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error propagation. We evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3\% improvement over oracle-localization baselines and strong generalization across embodiments and environments. The code and models have been made publicly available on the https://steinate.github.io/logoplanner.github.io.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhixiang Lu, Xueyuan Deng, Yiran Liu, Yulong Li, Qiang Yan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Traditional agent-based models (ABMs) of opinion dynamics often fail to capture the psychological heterogeneity driving online polarization due to simplistic homogeneity assumptions. This limitation obscures the critical interplay between individual cognitive biases and information propagation, thereby hindering a mechanistic understanding of how ideological divides are amplified. To address this challenge, we introduce the Personality-Refracted Intelligent Simulation Model (PRISM), a hybrid framework coupling stochastic differential equations (SDE) for continuous emotional evolution with a personality-conditional partially observable Markov decision process (PC-POMDP) for discrete decision-making. In contrast to continuous trait approaches, PRISM assigns distinct Myers-Briggs Type Indicator (MBTI) based cognitive policies to multimodal large language model (MLLM) agents, initialized via data-driven priors from large-scale social media datasets. PRISM achieves superior personality consistency aligned with human ground truth, significantly outperforming standard homogeneous and Big Five benchmarks. This framework effectively replicates emergent phenomena such as rational suppression and affective resonance, offering a robust tool for analyzing complex social media ecosystems.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jingyi Qiu, Hong Chen, Zongyi Li
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The rise of AI has fueled growing concerns about ``hype'' in machine learning papers, yet a reliable way to quantify rhetorical style independently of substantive content has remained elusive. Because bold language can stem from either strong empirical results or mere rhetorical style, it is often difficult to distinguish between the two. To disentangle rhetorical style from substantive content, we introduce a counterfactual, LLM-based framework: multiple LLM rhetorical personas generate counterfactual writings from the same substantive content, an LLM judge compares them through pairwise evaluations, and the outcomes are aggregated using a Bradley--Terry model. Applying this method to 8,485 ICLR submissions sampled from 2017 to 2025, we generate more than 250,000 counterfactual writings and provide a large-scale quantification of rhetorical style in ML papers. We find that visionary framing significantly predicts downstream attention, including citations and media attention, even after controlling for peer-review evaluations. We also observe a sharp rise in rhetorical strength after 2023, and provide empirical evidence showing that this increase is largely driven by the adoption of LLM-based writing assistance. The reliability of our framework is validated by its robustness to the choice of personas and the high correlation between LLM judgments and human annotations. Our work demonstrates that LLMs can serve as instruments to measure and improve scientific evaluation.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kirk Vanacore, Rene F. Kizilcec
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly adopted in educational technologies for a variety of tasks, from generating instructional materials and assisting with assessment design to tutoring. While prior work has investigated how models can be adapted or optimized for specific tasks, far less is known about how well LLMs perform at interpreting authentic educational scenarios without significant customization. As LLM-based systems become widely adopted by learners and educators in everyday academic contexts, understanding their out-of-the-box capabilities is increasingly important for setting expectations and benchmarking. We compared six LLMs to estimate their baseline performance on a simple but important task: classifying instructional moves in authentic classroom transcripts. We evaluated typical prompting methods: zero-shot, one-shot, and few-shot prompting. We found that while zero-shot performance was moderate, providing comprehensive examples (few-shot prompting) significantly improved performance for state-of-the-art models, with the strongest configuration reaching Cohen's Kappa = 0.58 against expert-coded annotations. At the same time, improvements were neither uniform nor complete: performance varied considerably by instructional move, and higher recall frequently came at the cost of increased false positives. Overall, these findings indicate that foundation models demonstrate meaningful yet limited capacity to interpret instructional discourse, with prompt design helping to surface capability but not eliminating fundamental reliability constraints.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shashi Kant Gupta, Arijeet Pramanik, Jerrin John Thomas, Regina Schwind, Lauren Wiener et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Unstructured notes within the electronic health record (EHR) contain rich clinical information vital for cancer treatment decision making and research, yet reliably extracting structured oncology data remains challenging due to extensive variability, specialized terminology, and inconsistent document formats. Manual abstraction, although accurate, is prohibitively costly and unscalable. Existing automated approaches typically address narrow scenarios - either using synthetic datasets, restricting focus to document-level extraction, or isolating specific clinical variables (e.g., staging, biomarkers, histology) - and do not adequately handle patient-level synthesis across the large number of clinical documents containing contradictory information. In this study, we propose an agentic framework that systematically decomposes complex oncology data extraction into modular, adaptive tasks. Specifically, we use large language models (LLMs) as reasoning agents, equipped with context-sensitive retrieval and iterative synthesis capabilities, to exhaustively and comprehensively extract structured clinical variables from real-world oncology notes. Evaluated on a large-scale dataset of over 400,000 unstructured clinical notes and scanned PDF reports spanning 2,250 cancer patients, our method achieves an average F1-score of 0.93, with 100 out of 103 oncology-specific clinical variables exceeding 0.85, and critical variables (e.g., biomarkers and medications) surpassing 0.95. Moreover, integration of the agentic system into a data curation workflow resulted in 0.94 direct manual approval rate, significantly reducing annotation costs. To our knowledge, this constitutes the first exhaustive, end-to-end application of LLM-based agents for structured oncology data extraction at scale
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent's ``zone of proximal development''. This process is guided by a simple but effective $ฮฑ$-Curriculum Reward, which aligns task difficulty with the agent's current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3\%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Thittipat Pairatsuppawat, Abhibhu Tachaapornchai, Paweekorn Kusolsomboon, Chutikan Chaiwong, Thodsaporn Chay-intr et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Open-weights large language models remain difficult to deploy for Thai due to unstable generation under complex instructions, despite strong English performance. To mitigate these limitations, We present SiamGPT-32B, an open-weights model based on Qwen3-32B, fine-tuned with a Quality-First strategy emphasizing curated supervision over data scale. The fine-tuning pipeline combines translated high-complexity English instruction data with a Thai-adapted AutoIF framework for instruction and linguistic constraints. Using supervised fine-tuning only, without continual pretraining or corpus expansion, SiamGPT-32B improves instruction adherence, multi-turn robustness, and linguistic stability. Evaluations on the SEA-HELM benchmark show that SiamGPT-32B achieves the strongest overall performance among similar-scale open-weights Thai models, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) -- a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shijian Ma, Yunqi Huang, Yan Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.
๐
2025-12-22
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ Suguru Shiratori, Elham Kiyani, Khemraj Shukla, George Em Karniadakis
Machine Learning
ๆฅ็ๆ่ฆ
We develop a data-driven framework for discovering constitutive relations in models of fluid flow and scalar transport. Our approach infers unknown closure terms in the governing equations (gray-box discovery) under the assumption that the temporal derivative, convective transport, and pressure-gradient contributions are known. The formulation is rooted in a variational principle from nonequilibrium thermodynamics, where the dynamics is defined by a free-energy functional and a dissipation functional. The unknown constitutive terms arise as functional derivatives of these functionals with respect to the state variables. To enable a flexible and structured model discovery, the free-energy and dissipation functionals are parameterized using neural networks, while their functional derivatives are obtained via automatic differentiation. This construction enforces thermodynamic consistency by design, ensuring monotonic decay of the total free energy and non-negative entropy production. The resulting method, termed GIMLET (Generalizable and Interpretable Model Learning through Embedded Thermodynamics), avoids reliance on a predefined library of candidate functions, unlike sparse regression or symbolic identification approaches. The learned models are generalizable in that functionals identified from one dataset can be transferred to distinct datasets governed by the same underlying equations. Moreover, the inferred free-energy and dissipation functions provide direct physical interpretability of the learned dynamics. The framework is demonstrated on several benchmark systems, including the viscous Burgers equation, the Kuramoto--Sivashinsky equation, and the incompressible Navier--Stokes equations for both Newtonian and non-Newtonian fluids.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Samruddhi Baviskar
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Machine learning models used in financial decision systems operate in nonstationary economic environments, yet adversarial robustness is typically evaluated under static assumptions. This work introduces Conditional Adversarial Fragility, a regime dependent phenomenon in which adversarial vulnerability is systematically amplified during periods of macroeconomic stress. We propose a regime aware evaluation framework for time indexed tabular financial classification tasks that conditions robustness assessment on external indicators of economic stress. Using volatility based regime segmentation as a proxy for macroeconomic conditions, we evaluate model behavior across calm and stress periods while holding model architecture, attack methodology, and evaluation protocols constant. Baseline predictive performance remains comparable across regimes, indicating that economic stress alone does not induce inherent performance degradation. Under adversarial perturbations, however, models operating during stress regimes exhibit substantially greater degradation across predictive accuracy, operational decision thresholds, and risk sensitive outcomes. We further demonstrate that this amplification propagates to increased false negative rates, elevating the risk of missed high risk cases during adverse conditions. To complement numerical robustness metrics, we introduce an interpretive governance layer based on semantic auditing of model explanations using large language models. Together, these results demonstrate that adversarial robustness in financial machine learning is a regime dependent property and motivate stress aware approaches to model risk assessment in high stakes financial deployments.
๐
2025-12-22
๐ AAAI 2024
๐ฅ Wentao Wu, Xiao Wang, Chenglong Li, Jin Tang, Bin Luo
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model's capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of structured priors into token reconstruction including symmetry, contour and semantics of vehicles respectively. SMM utilizes the vehicle symmetry constraints to avoid retaining symmetric patches and can thus select high-quality masked image patches and reduce information redundancy. CRM minimizes the probability distribution divergence between contour features and reconstructed features and can thus preserve holistic vehicle structure information during pixel-level reconstruction. SRM aligns image-text features through contrastive learning and cross-modal distillation to address the feature confusion caused by insufficient semantic understanding during masked reconstruction. To support the pre-training of VehicleMAE-V2, we construct Autobot4M, a large-scale dataset comprising approximately 4 million vehicle images and 12,693 text descriptions. Extensive experiments on five downstream tasks demonstrate the superior performance of VehicleMAE-V2.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alexey Yermakov, Yue Zhao, Marine Denolle, Yiyu Ni, Philippe M. Wyder et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Seismology faces fundamental challenges in state forecasting and reconstruction (e.g., earthquake early warning and ground motion prediction) and managing the parametric variability of source locations, mechanisms, and Earth models (e.g., subsurface structure and topography effects). Addressing these with simulations is hindered by their massive scale, both in synthetic data volumes and numerical complexity, while real-data efforts are constrained by models that inadequately reflect the Earth's complexity and by sparse sensor measurements from the field. Recent machine learning (ML) efforts offer promise, but progress is obscured by a lack of proper characterization, fair reporting, and rigorous comparisons. To address this, we introduce a Common Task Framework (CTF) for ML for seismic wavefields, starting with three distinct wavefield datasets. Our CTF features a curated set of datasets at various scales (global, crustal, and local) and task-specific metrics spanning forecasting, reconstruction, and generalization under realistic constraints such as noise and limited data. Inspired by CTFs in fields like natural language processing, this framework provides a structured and rigorous foundation for head-to-head algorithm evaluation. We illustrate the evaluation procedure with scores reported for two of the datasets, showcasing the performance of various methods and foundation models for reconstructing seismic wavefields from both simulated and real-world sensor measurements. The CTF scores reveal the strengths, limitations, and suitability for specific problem classes. Our vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets, raising the bar for rigor and reproducibility in scientific ML.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, Wenhao Huang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
LLM deployment in critical domains is currently impeded by persistent hallucinations--generating plausible but factually incorrect assertions. While scaling laws drove significant improvements in general capabilities, theoretical frameworks suggest hallucination is not merely stochastic error but a predictable statistical consequence of training objectives prioritizing mimicking data distribution over epistemic honesty. Standard RLVR paradigms, utilizing binary reward signals, inadvertently incentivize models as good test-takers rather than honest communicators, encouraging guessing whenever correctness probability exceeds zero. This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when not confident, aligning model behavior with accuracy. Synthesizing recent advances, we propose and evaluate training interventions optimizing strictly proper scoring rules for models to output a calibrated probability of correctness. Our methods enable models to either abstain from producing a complete response or flag individual claims where uncertainty remains. Utilizing Qwen3-4B-Instruct, empirical analysis reveals behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification--a transferable meta-skill decouplable from raw predictive accuracy. Trained on math reasoning tasks, our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation (BeyondAIME). Moreover, in cross-domain factual QA (SimpleQA), our 4B LLM achieves zero-shot calibration error on par with frontier models including Grok-4 and Gemini-2.5-Pro, even though its factual accuracy is much lower.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Matthew Drnevich, Stephen Jiggins, Kyle Cranmer
ๆฅ็ๆ่ฆ
We consider a generalization of the classifier-based density-ratio estimation task to a quasiprobabilistic setting where probability densities can be negative. The problem with most loss functions used for this task is that they implicitly define a relationship between the optimal classifier and the target quasiprobabilistic density ratio which is discontinuous or not surjective. We address these problems by introducing a convex loss function that is well-suited for both probabilistic and quasiprobabilistic density ratio estimation. To quantify performance, an extended version of the Sliced-Wasserstein distance is introduced which is compatible with quasiprobability distributions. We demonstrate our approach on a real-world example from particle physics, of di-Higgs production in association with jets via gluon-gluon fusion, and achieve state-of-the-art results.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Maxime Lacour, Pu Ren, Rie Nakata, Nori Nakata, Michael Mahoney
Machine Learning
ๆฅ็ๆ่ฆ
Recent developments in non-ergodic ground-motion models (GMMs) explicitly model systematic spatial variations in source, site, and path effects, reducing standard deviation to 30-40% of ergodic models and enabling more accurate site-specific seismic hazard analysis. Current non-ergodic GMMs rely on Gaussian Process (GP) methods with prescribed correlation functions and thus have computational limitations for large-scale predictions. This study proposes a deep-learning approach called Conditional Generative Modeling for Fourier Amplitude Spectra (CGM-FAS) as an alternative to GP-based methods for modeling non-ergodic path effects in Fourier Amplitude Spectra (FAS). CGM-FAS uses a Conditional Variational Autoencoder architecture to learn spatial patterns and interfrequency correlation directly from data by using geographical coordinates of earthquakes and stations as conditional variables. Using San Francisco Bay Area earthquake data, we compare CGM-FAS against a recent GP-based GMM for the region and demonstrate consistent predictions of non-ergodic path effects. Additionally, CGM-FAS offers advantages compared to GP-based approaches in learning spatial patterns without prescribed correlation functions, capturing interfrequency correlations, and enabling rapid predictions, generating maps for 10,000 sites across 1,000 frequencies within 10 seconds using a few GB of memory. CGM-FAS hyperparameters can be tuned to ensure generated path effects exhibit variability consistent with the GP-based empirical GMM. This work demonstrates a promising direction for efficient non-ergodic ground-motion prediction across multiple frequencies and large spatial domains.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Indranil Halder, Cengiz Pehlevan
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent developments in large language models have shown advantages in reallocating a notable share of computational resource from training time to inference time. However, the principles behind inference time scaling are not well understood. In this paper, we introduce an analytically tractable model of inference-time scaling: Bayesian linear regression with a reward-weighted sampler, where the reward is determined from a linear model, modeling LLM-as-a-judge scenario. We study this problem in the high-dimensional regime, where the deterministic equivalents dictate a closed-form expression for the posterior predictive mean and variance. We analyze the generalization error when training data are sampled from a teacher model. We draw $k$ inference-time samples and select via softmax at a temperature applied to a quadratic reward. When the reward is not too different from the teacher, the generalization error decreases monotonically with increasing inference time samples $k$. However, the specific reward that optimizes inference-time selection generally differs from the teacher. In contrast, substantial reward misspecification induces a finite optimal $k$ beyond which more sampling can increase the generalization error. For fixed $k$, there exists an optimal sampling temperature. We experimentally verify these facts in large language model inference with an additional large language model as a judge. In the "best-of-$k$" limit with the teacher as reward, we theoretically show that the generalization error decays as $ฮ(1/k^2)$ and determine the leading coefficient via extreme value theory. These formulas delineate domains where scaling inference-time computation is provably preferable to collecting more data. Finally, we demonstrate that when task difficulty increases, the previously mentioned advantage of inference-time compute degrades.
๐
2025-12-22
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Paรบl Cumba-Armijos, Diego Riofrรญo-Luzcando, Verรณnica Rodrรญguez-Arboleda, Joe Carriรณn-Jumbo
Machine Learning
ๆฅ็ๆ่ฆ
Recent recollected data suggests that it is possible to automatically detect events that may negatively affect the most vulnerable parts of our society, by using any communication technology like social networks or messaging applications. This research consolidates and prepares a corpus with Spanish bullying expressions taken from Twitter in order to use them as an input to train a convolutional neuronal network through deep learning techniques. As a result of this training, a predictive model was created, which can identify Spanish cyberbullying expressions such as insults, racism, homophobic attacks, and so on.
๐
2025-12-22
๐ ArXiv Preprint (hep-lat)
๐ฅ Shreya Shukla, Yukari Yamauchi, Andrey Y. Lokhov, Scott Lawrence, Abhijith Jayakumar
ๆฅ็ๆ่ฆ
We introduce a learning method for recovering action parameters in lattice field theories. Our method is based on the minimization of a convex loss function constructed using the Schwinger-Dyson relations. We show that score matching, a popular learning method, is a special case of our construction of an infinite family of valid loss functions. Importantly, our general Schwinger-Dyson-based construction applies to gauge theories and models with Grassmann-valued fields used to represent dynamical fermions. In particular, we extend our method to realistic lattice field theories including quantum chromodynamics.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jorg Bornschein, Clare Lyle, Yazhe Li, Amal Rannen-Triki, Xu Owen He et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
When adapting large language models (LLMs) to a specific downstream task, two primary approaches are commonly employed: (1) prompt engineering, often with in-context few-shot learning, leveraging the model's inherent generalization abilities, and (2) fine-tuning on task-specific data, directly optimizing the model's parameters. While prompt-based methods excel in few-shot scenarios, their effectiveness often plateaus as more data becomes available. Conversely, fine-tuning scales well with data but may underperform when training examples are scarce. We investigate a unified approach that bridges these two paradigms by incorporating in-context learning directly into the fine-tuning process. Specifically, we fine-tune the model on task-specific data augmented with in-context examples, mimicking the structure of k-shot prompts. This approach, while requiring per-task fine-tuning, combines the sample efficiency of in-context learning with the performance gains of fine-tuning, leading to a method that consistently matches and often significantly exceeds both these baselines. To perform hyperparameter selection in the low-data regime, we propose to use prequential evaluation, which eliminates the need for expensive cross-validation and leverages all available data for training while simultaneously providing a robust validation signal. We conduct an extensive empirical study to determine which adaptation paradigm - fine-tuning, in-context learning, or our proposed unified approach offers the best predictive performance on a concrete data downstream-tasks.
๐
2025-12-22
๐ ArXiv Preprint (cs.DC)
๐ฅ Ziming Mao, Yihan Zhang, Chihan Cui, Kaichao You, Zhongjie Chen et al.
Robotics
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) workloads rely on expert parallelism (EP) to achieve high GPU efficiency. State-of-the-art EP communication systems such as DeepEP demonstrate strong performance but exhibit poor portability across heterogeneous GPU and NIC platforms. The poor portability is rooted in architecture: GPU-initiated token-level RDMA communication requires tight vertical integration between GPUs and NICs, e.g., GPU writes to NIC driver/MMIO interfaces.
We present UCCL-EP, a portable EP communication system that delivers DeepEP-level performance across heterogeneous GPU and NIC hardware. UCCL-EP replaces GPU-initiated RDMA with a high-throughput GPU-CPU control channel: compact token-routing commands are transferred to multithreaded CPU proxies, which then issue GPUDirect RDMA operations on behalf of GPUs. UCCL-EP further emulates various ordering semantics required by specialized EP communication modes using RDMA immediate data, enabling correctness on NICs that lack such ordering, e.g., AWS EFA. We implement UCCL-EP on NVIDIA and AMD GPUs with EFA and Broadcom NICs. On EFA, it outperforms the best existing EP solution by up to $2.1\times$ for dispatch and combine throughput. On NVIDIA-only platform, UCCL-EP achieves comparable performance to the original DeepEP. UCCL-EP also improves token throughput on SGLang by up to 40% on the NVIDIA+EFA platform, and improves DeepSeek-V3 training throughput over the AMD Primus/Megatron-LM framework by up to 45% on a 16-node AMD+Broadcom platform.
๐
2025-12-22
๐ ArXiv Preprint (quant-ph)
๐ฅ Mark M. Wilde
Machine Learning
ๆฅ็ๆ่ฆ
One of the primary applications of classical Boltzmann machines is generative modeling, wherein the goal is to tune the parameters of a model distribution so that it closely approximates a target distribution. Training relies on estimating the gradient of the relative entropy between the target and model distributions, a task that is well understood when the classical Boltzmann machine has both visible and hidden units. For some years now, it has been an obstacle to generalize this finding to quantum state learning with quantum Boltzmann machines that have both visible and hidden units. In this paper, I derive an analytical expression for the gradient of the quantum relative entropy between a target quantum state and the reduced state of the visible units of a quantum Boltzmann machine. Crucially, this expression is amenable to estimation on a quantum computer, as it involves modular-flow-generated unitary rotations reminiscent of those appearing in my prior work on rotated Petz recovery maps. This leads to a quantum algorithm for gradient estimation in this setting. I then specialize the setting to quantum visible units and classical hidden units, and vice versa, and provide analytical expressions for the gradients, along with quantum algorithms for estimating them. Finally, I replace the quantum relative entropy objective function with the Petz-Tsallis relative entropy; here I develop an analytical expression for the gradient and sketch a quantum algorithm for estimating it, as an application of a novel formula for the derivative of the matrix power function, which also involves modular-flow-generated unitary rotations. Ultimately, this paper demarcates progress in training quantum Boltzmann machines with visible and hidden units for generative modeling and quantum state learning.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Deepit Sapru
Machine Learning
ๆฅ็ๆ่ฆ
This paper introduces a marketing decision framework that converts heterogeneous-treatment uplift into constrained targeting strategies to maximize revenue and retention while honoring business guardrails. The approach estimates Conditional Average Treatment Effects (CATE) with uplift learners and then solves a constrained allocation to decide who to target and which offer to deploy under limits such as budget or acceptable sales deterioration. Applied to retention messaging, event rewards, and spend-threshold assignment, the framework consistently outperforms propensity and static baselines in offline evaluations using uplift AUC, Inverse Propensity Scoring (IPS), and Self-Normalized IPS (SNIPS). A production-scale online A/B test further validates strategic lift on revenue and completion while preserving customer-experience constraints. The result is a reusable playbook for marketers to operationalize causal targeting at scale, set guardrails, and align campaigns with strategic KPIs.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shane X. Coffing, John Tipton, Arvind T. Mohan, Darren Engwirda
Machine Learning
ๆฅ็ๆ่ฆ
Reduced order models (ROM) can represent spatiotemporal processes in significantly fewer dimensions and can be solved many orders faster than their governing partial differential equations (PDEs). For example, using a proper orthogonal decomposition produces a ROM that is a small linear combination of fixed features and weights, but that is constrained to the given process it models. In this work, we explore a new type of ROM that is not constrained to fixed weights, based on neural Galerkin-Projections, which is an initial value problem that encodes the physics of the governing PDEs, calibrated via neural networks to accurately model the trajectory of these weights. Then using a statistical hierarchical pooling technique to learn a distribution on the initial values of the temporal weights, we can create new, statistically interpretable and physically justified weights that are generalized to many similar problems. When recombined with the spatial features, we form a complete physics surrogate, called a randPROM, for generating simulations that are consistent in distribution to a neighborhood of initial conditions close to those used to construct the ROM. We apply the randPROM technique to the study of tsunamis, which are unpredictable, catastrophic, and highly-detailed non-linear problems, modeling both a synthetic case of tsunamis near Fiji and the real-world Tohoku 2011 disaster. We demonstrate that randPROMs may enable us to significantly reduce the number of simulations needed to generate a statistically calibrated and physically defensible prediction model for arrival time and height of tsunami waves.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universitรคt Wรผrzburg in cooperation with the Technische Universitรคt Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.
๐
2025-12-22
๐ ArXiv Preprint (quant-ph)
๐ฅ Davi Juvรชncio Gomes de Sousa, Nelson Alves Ferreira Neto, Christiano M. S. Nascimento, Lucas Q. Galvรฃo, Mauro Queiroz Nooblath Neto et al.
ๆฅ็ๆ่ฆ
This article presents an analysis of the practical challenges and implementation perspectives of point-to-point continuous-variable quantum key distribution (CV-QKD) systems over optical fiber. The study addresses the physical layer, including the design of transmitters, quantum channels, and receivers, with emphasis on impairments such as attenuation, chromatic dispersion, polarization fluctuations, and coexistence with classical channels. We further examine the role of digital signal processing (DSP) as the bridge between quantum state transmission and classical post-processing, highlighting its impact on excess noise mitigation, covariance matrix estimation, and reconciliation efficiency. The post-processing pipeline is detailed with a focus on parameter estimation in the finite-size regime, information reconciliation using LDPC-based codes optimized for low-SNR conditions, and privacy amplification employing large-block universal hashing. From a hardware perspective, we discuss modular digital architectures that integrate dedicated accelerators with programmable processors, supported by a reference software framework (CV-QKD-ModSim) for algorithm validation and hardware co-design. Finally, we outline perspectives for the deployment of CV-QKD in Brazil, starting from metropolitan testbeds and extending toward hybrid fiber/FSO and space-based infrastructures. The work establishes the foundations for the first point-to-point CV-QKD system in Brazil, while providing a roadmap for scalable and interoperable quantum communication networks.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Sujan Warnakulasooriya, Andreas Willig, Xiaobing Wu
ๆฅ็ๆ่ฆ
Drone applications continue to expand across various domains, with flocking offering enhanced cooperative capabilities but introducing significant challenges during initial formation. Existing flocking algorithms often struggle with efficiency and scalability, particularly when potential collisions force drones into suboptimal trajectories. This paper presents a time-efficient prioritised scheduling algorithm that improves the initial formation process of drone flocks. The method assigns each drone a priority based on its number of potential collisions and its likelihood of reaching its target position without permanently obstructing other drones. Using this hierarchy, each drone computes an appropriate delay to ensure a collision-free path. Simulation results show that the proposed algorithm successfully generates collision-free trajectories for flocks of up to 5000 drones and outperforms the coupling-degree-based heuristic prioritised planning method (CDH-PP) in both performance and computational efficiency.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Andrew Stirling, Mykola Lukashchuk, Dmitry Bagaev, Wouter Kouw, James R. Forbes
ๆฅ็ๆ่ฆ
This letter extends the exactly sparse Gaussian variational inference (ESGVI) algorithm for state estimation in two complementary directions. First, ESGVI is generalized to operate on matrix Lie groups, enabling the estimation of states with orientation components while respecting the underlying group structure. Second, factors are introduced to accommodate heavy-tailed and skewed noise distributions, as commonly encountered in ultra-wideband (UWB) localization due to non-line-of-sight (NLOS) and multipath effects. Both extensions are shown to integrate naturally within the ESGVI framework while preserving its sparse and derivative-free structure. The proposed approach is validated in a UWB localization experiment with NLOS-rich measurements, demonstrating improved accuracy and comparable consistency. Finally, a Python implementation within a factor-graph-based estimation framework is made open-source (https://github.com/decargroup/gvi_ws) to support broader research use.
๐
2025-12-22
๐ 6 pages, 4 figures. Accepted for publication in IE
๐ฅ Francisco M. F. R. Gonรงalves, Ryan M. Bena, Nรฉstor O. Pรฉrez-Arancibia
Robotics
ๆฅ็ๆ่ฆ
We introduce a new class of attitude control laws for rotational systems, which generalizes the use of the Euler axis-angle representation beyond quaternion-based formulations. Using basic Lyapunov's stability theory and the notion of extended $K_{\infty}$ functions, we developed a method for determining and enforcing the global asymptotic stability of the single fixed point of the resulting closed-loop (CL) scheme. In contrast with traditional quaternion-based methods, the proposed generalized axis-angle approach enables greater flexibility in the design of the control law, which is of great utility when employed in combination with a switching scheme whose transition state depends on the angular velocity of the controlled rotational system. Through simulation and real-time experimental results, we demonstrate the effectiveness of the proposed approach. According to the recorded data, in the execution of high-speed tumble-recovery maneuvers, the new method consistently achieves shorter stabilization times and requires lower control effort relative to those corresponding to the quaternion-based and geometric-control methods used as benchmarks.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Jiaqi Peng, Wenzhe Cai, Yuqiang Yang, Tai Wang, Yuan Shen et al.
Robotics
ๆฅ็ๆ่ฆ
Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error propagation. We evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3\% improvement over oracle-localization baselines and strong generalization across embodiments and environments. The code and models have been made publicly available on the https://steinate.github.io/logoplanner.github.io.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universitรคt Wรผrzburg in cooperation with the Technische Universitรคt Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski et al.
Machine Learning
ๆฅ็ๆ่ฆ
Automating the calculation of clinical risk scores offers a significant opportunity to reduce physician administrative burden and enhance patient care. The current standard for evaluating this capability is MedCalc-Bench, a large-scale dataset constructed using LLM-based feature extraction and rule-based aggregation. However, treating such model-generated benchmarks as static oracles risks enshrining historical model errors as evaluation gold standards, a problem dangerously amplified when these datasets serve as reward signals for Reinforcement Learning (RL). In this work, we propose viewing benchmarks for complex tasks such as clinical score computation as ''in-progress living documents'' that should be periodically re-evaluated as the processes for creating them improve. We introduce a systematic, physician-in-the-loop pipeline that leverages advanced agentic verifiers to audit and relabel MedCalc-Bench, utilizing automated triage to reserve scarce clinician attention for the most contentious instances. Our audit reveals that a notable fraction of original labels diverge from medical ground truth due to extraction errors, calculator logic mismatches, and clinical ambiguity. To study whether this label noise meaningfully impacts downstream RL training, we fine-tune a Qwen3-8B model via Group Relative Policy Optimization (GRPO) and demonstrate that training on corrected labels yields an 8.7% absolute improvement in accuracy over the original baseline -- validating that label noise materially affects model evaluation. These findings underscore that in safety-critical domains, rigorous benchmark maintenance is a prerequisite for genuine model alignment.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, Xinchao Wang
Machine Learning
ๆฅ็ๆ่ฆ
Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \href{https://hyokong.github.io/worldwarp-page/}{https://hyokong.github.io/worldwarp-page/}.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.
๐
2025-12-22
๐ ArXiv Preprint (cs.DS)
๐ฅ Diptarka Chakraborty, Hendrik Fichtenberger, Bernhard Haeupler, Silvio Lattanzi, Ashkan Norouzi-Fard et al.
ๆฅ็ๆ่ฆ
Designing efficient, effective, and consistent metric clustering algorithms is a significant challenge attracting growing attention. Traditional approaches focus on the stability of cluster centers; unfortunately, this neglects the real-world need for stable point labels, i.e., stable assignments of points to named sets (clusters). In this paper, we address this gap by initiating the study of label-consistent metric clustering. We first introduce a new notion of consistency, measuring the label distance between two consecutive solutions. Then, armed with this new definition, we design new consistent approximation algorithms for the classical $k$-center and $k$-median problems.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zahra Sadeghi, Evangelos Milios, Frank Rudzicz
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Artemis Panagopoulou, Aveek Purohit, Achin Kulshrestha, Soroosh Yazdani, Mohit Goyal
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universitรคt Wรผrzburg in cooperation with the Technische Universitรคt Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.
๐
2025-12-22
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Angjelin Hila
Natural Language Processing
ๆฅ็ๆ่ฆ
We examine epistemological threats posed by human and LLM interaction. We develop collective epistemology as a theory of epistemic warrant distributed across human collectives, using bounded rationality and dual process theory as background. We distinguish internalist justification, defined as reflective understanding of why a proposition is true, from externalist justification, defined as reliable transmission of truths. Both are necessary for collective rationality, but only internalist justification produces reflective knowledge. We specify reflective knowledge as follows: agents understand the evaluative basis of a claim, when that basis is unavailable agents consistently assess the reliability of truth sources, and agents have a duty to apply these standards within their domains of competence. We argue that LLMs approximate externalist reliabilism because they can reliably transmit information whose justificatory basis is established elsewhere, but they do not themselves possess reflective justification. Widespread outsourcing of reflective work to reliable LLM outputs can weaken reflective standards of justification, disincentivize comprehension, and reduce agents' capacity to meet professional and civic epistemic duties. To mitigate these risks, we propose a three tier norm program that includes an epistemic interaction model for individual use, institutional and organizational frameworks that seed and enforce norms for epistemically optimal outcomes, and deontic constraints at organizational and or legislative levels that instantiate discursive norms and curb epistemic vices.
๐
2025-12-22
๐ ArXiv Preprint (econ.GN)
๐ฅ Lapo Santarlasci, Armando Rungi, Loredana Fattorini, Nestor Maslej
Robotics
ๆฅ็ๆ่ฆ
Artificial intelligence has become a key arena of global technological competition and a central concern for Europe's quest for technological sovereignty. This paper analyzes global AI patenting from 2010 to 2023 to assess Europe's position in an increasingly bipolar innovation landscape dominated by the United States and China. Using linked patent, firm, ownership, and citation data, we examine the geography, specialization, and international diffusion of AI innovation. We find a highly concentrated patent landscape: China leads in patent volumes, while the United States dominates in citation impact and technological influence. Europe accounts for a limited share of AI patents but exhibits signals of relatively high patent quality. Technological proximity reveals global convergence toward U.S. innovation trajectories, with Europe remaining fragmented rather than forming an autonomous pole. Gravity-model estimates show that cross-border AI knowledge flows are driven primarily by technological capability and specialization, while geographic and institutional factors play a secondary role. EU membership does not significantly enhance intra-European knowledge diffusion, suggesting that technological capacity, rather than political integration, underpins participation in global AI innovation networks.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Yanliang Huang, Xia Yan, Peiran Yin, Zhenduo Zhang, Zeyan Shao et al.
Robotics
ๆฅ็ๆ่ฆ
Over the past decade, a wide range of motion planning approaches for autonomous vehicles has been developed to handle increasingly complex traffic scenarios. However, these approaches are rarely compared on standardized benchmarks, limiting the assessment of relative strengths and weaknesses. To address this gap, we present the setup and results of the 4th CommonRoad Motion Planning Competition held in 2024, conducted using the CommonRoad benchmark suite. This annual competition provides an open-source and reproducible framework for benchmarking motion planning algorithms. The benchmark scenarios span highway and urban environments with diverse traffic participants, including passenger cars, buses, and bicycles. Planner performance is evaluated along four dimensions: efficiency, safety, comfort, and compliance with selected traffic rules. This report introduces the competition format and provides a comparison of representative high-performing planners from the 2023 and 2024 editions.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Martin Sedlacek, Pavlo Yefanov, Georgy Ponimatkin, Jai Bardhan, Simon Pilc et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive to evaluate in the real-world. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world performance through high-fidelity visuals and aligned robot control. Our environment offers a suite of 15 perturbation factors, 7 manipulation skills, and more than 3,500 objects. Finally, we establish two task sets that form our benchmark and evaluate the ฯ_{0}, ฯ_{0}-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge. More broadly, we also show that simulation gives us a valuable proxy for the real-world and allows us to systematically probe for and quantify the weaknesses and failure modes of VLAs. Project page: https://martin-sedlacek.com/realm
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Antonia Alomar, Mireia Masias, Marius George Linguraru, Federico M. Sukno, Gemma Piella
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Early detection of developmental disorders can be aided by analyzing infant craniofacial morphology, but modeling infant faces is challenging due to limited data and frequent spontaneous expressions. We introduce BabyFlow, a generative AI model that disentangles facial identity and expression, enabling independent control over both. Using normalizing flows, BabyFlow learns flexible, probabilistic representations that capture the complex, non-linear variability of expressive infant faces without restrictive linear assumptions. To address scarce and uncontrolled expressive data, we perform cross-age expression transfer, adapting expressions from adult 3D scans to enrich infant datasets with realistic and systematic expressive variants. As a result, BabyFlow improves 3D reconstruction accuracy, particularly in highly expressive regions such as the mouth, eyes, and nose, and supports synthesis and modification of infant expressions while preserving identity. Additionally, by integrating with diffusion models, BabyFlow generates high-fidelity 2D infant images with consistent 3D geometry, providing powerful tools for data augmentation and early facial analysis.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Lawrence Krukrubo, Julius Odede, Olawande Olusegun
ๆฅ็ๆ่ฆ
Current approaches to Explainable AI (XAI) face a "Scalability-Stability Dilemma." Post-hoc methods (e.g., LIME, SHAP) may scale easily but suffer from instability, while supervised explanation frameworks (e.g., TED) offer stability but require prohibitive human effort to label every training instance. This paper proposes a Hybrid LRR-TED framework that addresses this dilemma through a novel "Asymmetry of Discovery." When applied to customer churn prediction, we demonstrate that automated rule learners (GLRM) excel at identifying broad "Safety Nets" (retention patterns) but struggle to capture specific "Risk Traps" (churn triggers)-a phenomenon we term the Anna Karenina Principle of Churn. By initialising the explanation matrix with automated safety rules and augmenting it with a Pareto-optimal set of just four human-defined risk rules, our approach achieves 94.00% predictive accuracy. This configuration outperforms the full 8-rule manual expert baseline while reducing human annotation effort by 50%, proposing a shift in the paradigm for Human-in-the-Loop AI: moving experts from the role of "Rule Writers" to "Exception Handlers."
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yongxin Wang, Zhicheng Yang, Meng Cao, Mingfei Han, Haokun Lin et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling (RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiawen Wang, Jingjing Wang Tianyang Chen, Min Zhang, Guodong Zhou
ๆฅ็ๆ่ฆ
In the literature, existing human-centric emotional motion generation methods primarily focus on boosting performance within a single scale-fixed dataset, largely neglecting the flexible and scale-increasing motion scenarios (e.g., sports, dance), whereas effectively learning these newly emerging scenarios can significantly enhance the model's real-world generalization ability. Inspired by this, this paper proposes a new LLM-Centric Lifelong Empathic Motion Generation (L^2-EMG) task, which aims to equip LLMs with the capability to continually acquire emotional motion generation knowledge across different unseen scenarios, potentially contributing to building a closed-loop and self-evolving embodied agent equipped with both empathy and intelligence. Further, this paper poses two key challenges in the L^2-EMG task, i.e., the emotion decoupling challenge and the scenario adapting challenge. To this end, this paper proposes an Emotion-Transferable and Scenario-Adapted Mixture of Experts (ES-MoE) approach which designs a causal-guided emotion decoupling block and a scenario-adapted expert constructing block to address the two challenges, respectively. Especially, this paper constructs multiple L^2-EMG datasets to validate the effectiveness of the ES-MoE approach. Extensive evaluations show that ES-MoE outperforms advanced baselines.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Moritz Bรถhle, Amรฉlie Royer, Juliette Marrie, Edouard Grave, Patrick Pรฉrez
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongsheng Xing, Qiuxin Si
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Predicting reaction outcomes across continuous solvent composition ranges remains a critical challenge in organic synthesis and process chemistry. Traditional machine learning approaches often treat solvent identity as a discrete categorical variable, which prevents systematic interpolation and extrapolation across the solvent space. This work introduces the \textbf{Catechol Benchmark}, a high-throughput transient flow chemistry dataset comprising 1,227 experimental yield measurements for the rearrangement of allyl-substituted catechol in 24 pure solvents and their binary mixtures, parameterized by continuous volume fractions ($\% B$). We evaluate various architectures under rigorous leave-one-solvent-out and leave-one-mixture-out protocols to test generalization to unseen chemical environments.
Our results demonstrate that classical tabular methods (e.g., Gradient-Boosted Decision Trees) and large language model embeddings (e.g., Qwen-7B) struggle with quantitative precision, yielding Mean Squared Errors (MSE) of 0.099 and 0.129, respectively. In contrast, we propose a hybrid GNN-based architecture that integrates Graph Attention Networks (GATs) with Differential Reaction Fingerprints (DRFP) and learned mixture-aware solvent encodings. This approach achieves an \textbf{MSE of 0.0039} ($\pm$ 0.0003), representing a 60\% error reduction over competitive baselines and a $>25\times$ improvement over tabular ensembles. Ablation studies confirm that explicit molecular graph message-passing and continuous mixture encoding are essential for robust generalization. The complete dataset, evaluation protocols, and reference implementations are released to facilitate data-efficient reaction prediction and continuous solvent representation learning.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xueming Yan, Bo Yin, Yaochu Jin
Machine Learning
ๆฅ็ๆ่ฆ
Multiobjective reinforcement learning (MORL) poses significant challenges due to the inherent conflicts between objectives and the difficulty of adapting to dynamic environments. Traditional methods often struggle to generalize effectively, particularly in large and complex state-action spaces. To address these limitations, we introduce the Latent Causal Diffusion Model (LacaDM), a novel approach designed to enhance the adaptability of MORL in discrete and continuous environments. Unlike existing methods that primarily address conflicts between objectives, LacaDM learns latent temporal causal relationships between environmental states and policies, enabling efficient knowledge transfer across diverse MORL scenarios. By embedding these causal structures within a diffusion model-based framework, LacaDM achieves a balance between conflicting objectives while maintaining strong generalization capabilities in previously unseen environments. Empirical evaluations on various tasks from the MOGymnasium framework demonstrate that LacaDM consistently outperforms the state-of-art baselines in terms of hypervolume, sparsity, and expected utility maximization, showcasing its effectiveness in complex multiobjective tasks.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyang Song, Zelin Zang, Zuyao Chen, Xusheng Liang, Dong Yi et al.
RoboticsMachine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images. Anatomy understanding tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional Supervised Fine-Tuning (SFT) strategies. While recent work has demonstrated that Group Relative Policy Optimization (GRPO) can enhance reasoning in MLLMs without relying on large amounts of data, we find two weaknesses that hinder GRPO's reasoning performance in anatomy recognition: 1) knowledge cannot be effectively shared between different anatomical structures, resulting in uneven information gain and preventing the model from converging, and 2) the model quickly converges to a single reasoning path, suppressing the exploration of diverse strategies. To overcome these challenges, we propose two novel methods. First, we implement a progressive learning strategy called Anatomical Similarity Curriculum Learning by controlling question difficulty via the similarity of answer choices, enabling the model to master complex problems incrementally. Second, we utilize question augmentation referred to as Group Diversity Question Augmentation to expand the model's search space for difficult queries, mitigating the tendency to produce uniform responses. Comprehensive experiments on the SGG-VQA and OmniMedVQA benchmarks show our method achieves a significant improvement across the two benchmarks, demonstrating its effectiveness in enhancing the medical reasoning capabilities of MLLMs. The code can be found in https://github.com/tomato996/Anatomy-R1
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongliang Li, Nong Zhang, Zhewen Xu, Xiang Li, Changzheng Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Understanding and predicting the Madden-Julian Oscillation (MJO) is fundamental for precipitation forecasting and disaster prevention. To date, long-term and accurate MJO prediction has remained a challenge for researchers. Conventional MJO prediction methods using Numerical Weather Prediction (NWP) are resource-intensive, time-consuming, and highly unstable (most NWP methods are sensitive to seasons, with better MJO forecast results in winter). While existing Artificial Neural Network (ANN) methods save resources and speed forecasting, their accuracy never reaches the 28 days predicted by the state-of-the-art NWP method, i.e., the operational forecasts from ECMWF, since neural networks cannot handle climate data effectively. In this paper, we present a Domain Knowledge Embedded Spatio-Temporal Network (DK-STN), a stable neural network model for accurate and efficient MJO forecasting. It combines the benefits of NWP and ANN methods and successfully improves the forecast accuracy of ANN methods while maintaining a high level of efficiency and stability. We begin with a spatial-temporal network (STN) and embed domain knowledge in it using two key methods: (i) applying a domain knowledge enhancement method and (ii) integrating a domain knowledge processing method into network training. We evaluated DK-STN with the 5th generation of ECMWF reanalysis (ERA5) data and compared it with ECMWF. Given 7 days of climate data as input, DK-STN can generate reliable forecasts for the following 28 days in 1-2 seconds, with an error of only 2-3 days in different seasons. DK-STN significantly exceeds ECMWF in that its forecast accuracy is equivalent to ECMWF's, while its efficiency and stability are significantly superior.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikita Volzhin, Soowhan Yoon
Machine Learning
ๆฅ็ๆ่ฆ
The recent development of Kolmogorov-Arnold Networks (KANs) introduced new discoveries in the field of Graph Neural Networks (GNNs), expanding the existing set of models with KAN-based versions of GNNs, which often surpass the accuracy of MultiLayer Perceptron (MLP)-based GNNs. These models were widely tested on the graph datasets consisting of organic molecules; however, those studies disregarded the inorganic nanomaterials datasets. In this work, we close this gap by applying Kolmogorov-Arnold Graph Neural Networks (KAGNNs) to a recently published large inorganic nanomaterials dataset called CHILI. For this, we adapt and test KAGNNs appropriate for this dataset. Our experiments reveal that on the CHILI datasets, particularly on the CHILI-3K, KAGNNs substantially surpass conventional GNNs in classification, achieving state-of-the-art results.
๐
2025-12-22
๐ ArXiv Preprint (cs.SE)
๐ฅ Katharina Stengg, Christian Macho, Martin Pinzger
Natural Language Processing
ๆฅ็ๆ่ฆ
Understanding source code changes and their impact on other code entities is a crucial skill in software development. However, the analysis of code changes and their impact is often performed manually and therefore is time-consuming. Recent advancements in AI, and in particular large language models (LLMs) show promises to help developers in various code analysis tasks. However, the extent to which this potential can be utilized for understanding code changes and their impact is underexplored. To address this gap, we study the capabilities of GPT-5 and GPT-5-mini to predict the code entities impacted by given source code changes. We construct a dataset containing information about seed-changes, change pairs, and change types for each commit. Existing datasets lack crucial information about seed changes and impacted code entities. Our experiments evaluate the LLMs in two configurations: (1) seed-change information and the parent commit tree and (2) seed-change information, the parent commit tree, and the diff hunk of each seed change. We found that both LLMs perform poorly in the two experiments, whereas GPT-5 outperforms GPT-5-mini. Furthermore, the provision of the diff hunks helps both models to slightly improve their performance.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lorenzo Capelli, Leandro de Souza Rosa, Gianluca Setti, Mauro Mangia, Riccardo Rovatti
Machine Learning
ๆฅ็ๆ่ฆ
The recent explosive growth in Deep Neural Networks applications raises concerns about the black-box usage of such models, with limited trasparency and trustworthiness in high-stakes domains, which have been crystallized as regulatory requirements such as the European Union Artificial Intelligence Act. While models with embedded confidence metrics have been proposed, such approaches cannot be applied to already existing models without retraining, limiting their broad application. On the other hand, post-hoc methods, which evaluate pre-trained models, focus on solving problems related to improving the confidence in the model's predictions, and detecting Out-Of-Distribution or Adversarial Attacks samples as independent applications. To tackle the limited applicability of already existing methods, we introduce Multi-Layer Analysis for Confidence Scoring (MACS), a unified post-hoc framework that analyzes intermediate activations to produce classification-maps. From the classification-maps, we derive a score applicable for confidence estimation, detecting distributional shifts and adversarial attacks, unifying the three problems in a common framework, and achieving performances that surpass the state-of-the-art approaches in our experiments with the VGG16 and ViTb16 models with a fraction of their computational overhead.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zeyu Xia, Jinzhe Ma, Congjie Zheng, Shufei Zhang, Yuqiang Li et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have emerged as powerful tools for accelerating scientific discovery, yet their static knowledge and hallucination issues hinder autonomous research applications. Recent advances integrate LLMs into agentic frameworks, enabling retrieval, reasoning, and tool use for complex scientific workflows. Here, we present a domain-specialized agent designed for reliable automation of first-principles materials computations. By embedding domain expertise, the agent ensures physically coherent multi-step workflows and consistently selects convergent, well-posed parameters, thereby enabling reliable end-to-end computational execution. A new benchmark of diverse computational tasks demonstrates that our system significantly outperforms standalone LLMs in both accuracy and robustness. This work establishes a verifiable foundation for autonomous computational experimentation and represents a key step toward fully automated scientific discovery.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jinwei Chi, Ke Wang, Yu Chen, Xuanye Lin, Qiang Xu
Natural Language Processing
ๆฅ็ๆ่ฆ
Automated essay scoring (AES) is a challenging task in cross-prompt settings due to the diversity of scoring criteria. While previous studies have focused on the output of large language models (LLMs) to improve scoring accuracy, we believe activations from intermediate layers may also provide valuable information. To explore this possibility, we evaluated the discriminative power of LLMs' activations in cross-prompt essay scoring task. Specifically, we used activations to fit probes and further analyzed the effects of different models and input content of LLMs on this discriminative power. By computing the directions of essays across various trait dimensions under different prompts, we analyzed the variation in evaluation perspectives of large language models concerning essay types and traits. Results show that the activations possess strong discriminative power in evaluating essay quality and that LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling the diversity of scoring criteria in cross-prompt settings.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fei Ge, Ying Huang, Jie Liu, Guixuan Zhang, Zhi Zeng et al.
Machine Learning
ๆฅ็ๆ่ฆ
Existing deep image watermarking methods follow a fixed embedding-distortion-extraction pipeline, where the embedder and extractor are weakly coupled through a final loss and optimized in isolation. This design lacks explicit collaboration, leaving no structured mechanism for the embedder to incorporate decoding-aware cues or for the extractor to guide embedding during training. To address this architectural limitation, we rethink deep image watermarking by reformulating embedding and extraction as explicitly collaborative components. To realize this reformulation, we introduce a Collaborative Interaction Mechanism (CIM) that establishes direct, bidirectional communication between the embedder and extractor, enabling a mutual-teacher training paradigm and coordinated optimization. Built upon this explicitly collaborative architecture, we further propose an Adaptive Feature Modulation Module (AFMM) to support effective interaction. AFMM enables content-aware feature regulation by decoupling modulation structure and strength, guiding watermark embedding toward stable image features while suppressing host interference during extraction. Under CIM, the AFMMs on both sides form a closed-loop collaboration that aligns embedding behavior with extraction objectives. This architecture-level redesign changes how robustness is learned in watermarking systems. Rather than relying on exhaustive distortion simulation, robustness emerges from coordinated representation learning between embedding and extraction. Experiments on real-world and AI-generated datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches in watermark extraction accuracy while maintaining high perceptual quality, showing strong robustness and generalization.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhang Chong
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We revisit a basic question in sequence modeling: is explicit self-attention actually necessary for strong performance and reasoning? We argue that standard multi-head attention is best seen as a form of tensor lifting: hidden vectors are mapped into a high-dimensional space of pairwise interactions, and learning proceeds by constraining this lifted tensor through gradient descent. This mechanism is extremely expressive but mathematically opaque, because after many layers it becomes very hard to describe the model with a small family of explicit invariants.
To explore an alternative, we propose an attention-free architecture based on Grassmann flows. Instead of forming an L by L attention matrix, our Causal Grassmann layer (i) linearly reduces token states, (ii) encodes local token pairs as two-dimensional subspaces on a Grassmann manifold via Plucker coordinates, and (iii) fuses these geometric features back into the hidden states through gated mixing. Information therefore propagates by controlled deformations of low-rank subspaces over multi-scale local windows, so the core computation lives on a finite-dimensional manifold rather than in an unstructured tensor space.
On the Wikitext-2 language modeling benchmark, purely Grassmann-based models with 13 to 18 million parameters achieve validation perplexities within about 10 to 15 percent of size-matched Transformers. On the SNLI natural language inference task, a Grassmann-Plucker head on top of DistilBERT slightly outperforms a Transformer head, with best validation and test accuracies of 0.8550 and 0.8538 compared to 0.8545 and 0.8511. We analyze the complexity of Grassmann mixing, show linear scaling in sequence length for fixed rank, and argue that such manifold-based designs offer a more structured route toward geometric and invariant-based interpretations of neural reasoning.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Elad Hazan, Shai Shalev Shwartz, Nathan Srebro
Robotics
ๆฅ็ๆ่ฆ
Modern learning systems increasingly interact with data that evolve over time and depend on hidden internal state. We ask a basic question: when is such a dynamical system learnable from observations alone? This paper proposes a research program for understanding learnability in dynamical systems through the lens of next-token prediction. We argue that learnability in dynamical systems should be studied as a finite-sample question, and be based on the properties of the underlying dynamics rather than the statistical properties of the resulting sequence. To this end, we give a formulation of learnability for stochastic processes induced by dynamical systems, focusing on guarantees that hold uniformly at every time step after a finite burn-in period. This leads to a notion of dynamic learnability which captures how the structure of a system, such as stability, mixing, observability, and spectral properties, governs the number of observations required before reliable prediction becomes possible. We illustrate the framework in the case of linear dynamical systems, showing that accurate prediction can be achieved after finite observation without system identification, by leveraging improper methods based on spectral filtering. We survey the relationship between learning in dynamical systems and classical PAC, online, and universal prediction theories, and suggest directions for studying nonlinear and controlled systems.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Contemporary GUI agents, while increasingly capable due to advances in Large Vision-Language Models (VLMs), often operate with a critical limitation: they treat each task in isolation, lacking a mechanism to systematically learn from past successes. This digital ''amnesia'' results in sub-optimal performance, repeated errors, and poor generalization to novel challenges. To bridge this gap, we introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory. Our framework operates in three distinct stages. First, during Experience Exploration, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model. Crucially, the entire knowledge base construction is thus fully automated, requiring no human supervision. Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ''memories''. Finally, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent's reasoning and decision-making process. We demonstrate the efficacy of our approach on benchmarks including Android World and AndroidLab. The results show that EchoTrail-GUI significantly improves the task success rate and operational efficiency of baseline agents, validating the power of structured memory in creating more robust and intelligent GUI automation.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yueyao Chen, Kai-Ni Wang, Dario Tayupo, Arnaud Huaulm'e, Krystel Nyangoh Timoh et al.
ๆฅ็ๆ่ฆ
Purpose: Surgical workflow recognition enables context-aware assistance and skill assessment in computer-assisted interventions. Despite recent advances, current methods suffer from two critical challenges: prediction jitter across consecutive frames and poor discrimination of ambiguous phases. This paper aims to develop a stable framework by selectively propagating reliable historical information and explicitly modeling uncertainty for hard sample enhancement.
Methods: We propose a dual-pathway framework DSTED with Reliable Memory Propagation (RMP) and Uncertainty-Aware Prototype Retrieval (UPR). RMP maintains temporal coherence by filtering and fusing high-confidence historical features through multi-criteria reliability assessment. UPR constructs learnable class-specific prototypes from high-uncertainty samples and performs adaptive prototype matching to refine ambiguous frame representations. Finally, a confidence-driven gate dynamically balances both pathways based on prediction certainty.
Results: Our method achieves state-of-the-art performance on AutoLaparo-hysterectomy with 84.36% accuracy and 65.51% F1-score, surpassing the second-best method by 3.51% and 4.88% respectively. Ablations reveal complementary gains from RMP (2.19%) and UPR (1.93%), with synergistic effects when combined. Extensive analysis confirms substantial reduction in temporal jitter and marked improvement on challenging phase transitions.
Conclusion: Our dual-pathway design introduces a novel paradigm for stable workflow recognition, demonstrating that decoupling the modeling of temporal consistency and phase ambiguity yields superior performance and clinical applicability.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xueming Yan, Boyan Xu, Yaochu Jin, Lixian Xiao, Wenlong Ye et al.
Multimodal
ๆฅ็ๆ่ฆ
Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. https://github.com/yanxm01/INDOMER
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Christian Hรคgg, Kathlรฉn Kohn, Giovanni Luca Marchetti, Boris Shapiro
ๆฅ็ๆ่ฆ
We present Sprecher Networks (SNs), a family of trainable neural architectures inspired by the classical Kolmogorov-Arnold-Sprecher (KAS) construction for approximating multivariate continuous functions. Distinct from Multi-Layer Perceptrons (MLPs) with fixed node activations and Kolmogorov-Arnold Networks (KANs) featuring learnable edge activations, SNs utilize shared, learnable splines (monotonic and general) within structured blocks incorporating explicit shift parameters and mixing weights. Our approach directly realizes Sprecher's specific 1965 sum of shifted splines formula in its single-layer variant and extends it to deeper, multi-layer compositions. We further enhance the architecture with optional lateral mixing connections that enable intra-block communication between output dimensions, providing a parameter-efficient alternative to full attention mechanisms. Beyond parameter efficiency with $O(LN + LG)$ scaling (where $G$ is the knot count of the shared splines) versus MLPs' $O(LN^2)$, SNs admit a sequential evaluation strategy that reduces peak forward-intermediate memory from $O(N^2)$ to $O(N)$ (treating batch size as constant), making much wider architectures feasible under memory constraints. We demonstrate empirically that composing these blocks into deep networks leads to highly parameter and memory-efficient models, discuss theoretical motivations, and compare SNs with related architectures (MLPs, KANs, and networks with learnable node activations).
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Simon Stรฅhlberg, Blai Bonet, Hector Geffner
Machine Learning
ๆฅ็ๆ่ฆ
While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.
๐
2025-12-22
๐ AAAI 2026
๐ฅ Simon Stรฅhlberg, Hector Geffner
Machine Learning
ๆฅ็ๆ่ฆ
First-order relational languages have been used in MDP planning and reinforcement learning (RL) for two main purposes: specifying MDPs in compact form, and representing and learning policies that are general and not tied to specific instances or state spaces. In this work, we instead consider the use of first-order languages in goal-conditioned RL and generalized planning. The question is how to learn goal-conditioned and general policies when the training instances are large and the goal cannot be reached by random exploration alone. The technique of Hindsight Experience Replay (HER) provides an answer to this question: it relabels unsuccessful trajectories as successful ones by replacing the original goal with one that was actually achieved. If the target policy must generalize across states and goals, trajectories that do not reach the original goal states can enable more data- and time-efficient learning. In this work, we show that further performance gains can be achieved when states and goals are represented by sets of atoms. We consider three versions: goals as full states, goals as subsets of the original goals, and goals as lifted versions of these subgoals. The result is that the latter two successfully learn general policies on large planning instances with sparse rewards by automatically creating a curriculum of easier goals of increasing complexity. The experiments illustrate the computational gains of these versions, their limitations, and opportunities for addressing them.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ A. B. M. Ashikur Rahman, Saeed Anwar, Muhammad Usman, Irfan Ahmad, Ajmal Mian
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Sycophancy, an excessive tendency of AI models to agree with user input at the expense of factual accuracy or in contradiction of visual evidence, poses a critical and underexplored challenge for multimodal large language models (MLLMs). While prior studies have examined this behavior in text-only settings of large language models, existing research on visual or multimodal counterparts remains limited in scope and depth of analysis. To address this gap, we introduce a comprehensive evaluation benchmark, \textit{PENDULUM}, comprising approximately 2,000 human-curated Visual Question Answering pairs specifically designed to elicit sycophantic responses. The benchmark spans six distinct image domains of varying complexity, enabling a systematic investigation of how image type and inherent challenges influence sycophantic tendencies. Through extensive evaluation of state-of-the-art MLLMs. we observe substantial variability in model robustness and a pronounced susceptibility to sycophantic and hallucinatory behavior. Furthermore, we propose novel metrics to quantify sycophancy in visual reasoning, offering deeper insights into its manifestations across different multimodal contexts. Our findings highlight the urgent need for developing sycophancy-resilient architectures and training strategies to enhance factual consistency and reliability in future MLLMs. Our proposed dataset with MLLMs response are available at https://github.com/ashikiut/pendulum/.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ JiaWei Zhu, ZiHeng Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Hidden confounding remains a fundamental challenge in causal inference from observational data. Recent advances leverage Large Language Models (LLMs) to generate plausible hidden confounders based on domain knowledge, yet a critical gap exists: LLM-generated confounders often exhibit semantic plausibility without statistical utility. We propose VIGOR+ (Variational Information Gain for iterative cOnfounder Refinement), a novel framework that closes the loop between LLM-based confounder generation and CEVAE-based statistical validation. Unlike prior approaches that treat generation and validation as separate stages, VIGOR+ establishes an iterative feedback mechanism: validation signals from CEVAE (including information gain, latent consistency metrics, and diagnostic messages) are transformed into natural language feedback that guides subsequent LLM generation rounds. This iterative refinement continues until convergence criteria are met. We formalize the feedback mechanism, prove convergence properties under mild assumptions, and provide a complete algorithmic framework.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ezequiel Lopez-Rubio, Macoris Decena-Gimenez, Rafael Marcos Luque-Baena
Natural Language Processing
ๆฅ็ๆ่ฆ
A key module in neural transformer-based deep architectures is positional encoding. This module enables a suitable way to encode positional information as input for transformer neural layers. This success has been rooted in the use of sinusoidal functions of various frequencies, in order to capture recurrent patterns of differing typical periods. In this work, an alternative set of periodic functions is proposed for positional encoding. These functions preserve some key properties of sinusoidal ones, while they depart from them in fundamental ways. Some tentative experiments are reported, where the original sinusoidal version is substantially outperformed. This strongly suggests that the alternative functions may have a wider use in other transformer architectures.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yayuan Li, Jian Zhang, Jintao Guo, Zihan Cheng, Lei Qi et al.
Computer Vision
ๆฅ็ๆ่ฆ
The proliferation of pre-trained models has given rise to a wide array of specialised, fine-tuned models. Model merging aims to merge the distinct capabilities of these specialised models into a unified model, requiring minimal or even no additional training. A core objective of model merging is to ensure the merged model retains the behavioural characteristics of the specialised models, typically achieved through feature alignment. We identify that features consist of two critical components: direction and magnitude. Prior research has predominantly focused on directional alignment, while the influence of magnitude remains largely neglected, despite its pronounced vulnerability to perturbations introduced by common merging operations (e.g., parameter fusion and sparsification). Such perturbations to magnitude inevitably lead to feature deviations in the merged model from the specialised models, resulting in subsequent performance degradation. To address this, we propose MAGnItude Calibration (MAGIC), a plug-and-play framework that rectifies layer-wise magnitudes in feature and weight spaces, with three variants. Specifically, our Feature Space Calibration (FSC) realigns the merged model's features using a small set of unlabelled data, while Weight Space Calibration (WSC) extends this calibration to the weight space without requiring additional data. Combining these yields Dual Space Calibration (DSC). Comprehensive experiments demonstrate that MAGIC consistently boosts performance across diverse Computer Vision tasks (+4.3% on eight datasets) and NLP tasks (+8.0% on Llama) without additional training. Our code is available at: https://github.com/lyymuwu/MAGIC
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ A. A. Gde Yogi Pramana, Jason Ray, Anthony Jaya, Michael Wijaya
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision--Language Models (VLMs) show significant promise for Medical Visual Question Answering (VQA), yet their deployment in clinical settings is hindered by severe vulnerability to adversarial attacks. Standard adversarial training, while effective for simpler tasks, often degrades both generalization performance and the quality of generated clinical reasoning. We introduce SafeMed-R1, a hybrid defense framework that ensures robust performance while preserving high-quality, interpretable medical reasoning. SafeMed-R1 employs a two-stage approach: at training time, we integrate Adversarial Training with Group Relative Policy Optimization (AT-GRPO) to explicitly robustify the reasoning process against worst-case perturbations; at inference time, we augment the model with Randomized Smoothing to provide certified $L_2$-norm robustness guarantees. We evaluate SafeMed-R1 on the OmniMedVQA benchmark across eight medical imaging modalities comprising over 88,000 samples. Our experiments reveal that standard fine-tuned VLMs, despite achieving 95\% accuracy on clean inputs, collapse to approximately 25\% under PGD attacks. In contrast, SafeMed-R1 maintains 84.45\% accuracy under the same adversarial conditions, representing a 59 percentage point improvement in robustness. Furthermore, we demonstrate that models trained with explicit chain-of-thought reasoning exhibit superior adversarial robustness compared to instruction-only variants, suggesting a synergy between interpretability and security in medical AI systems.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hui Li, Jiayue Lyu, Fu-Yun Wang, Kaihui Cheng, Siyu Zhu et al.
Computer Vision
ๆฅ็ๆ่ฆ
This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haoyu Jiang, Fanjie Zeng, Boan Qu, Xiaojie Lin, Wei Zhong
Natural Language Processing
ๆฅ็ๆ่ฆ
In the global drive toward carbon neutrality, deeply coordinated smart energy systems underpin industrial transformation. However, the interdisciplinary, fragmented, and fast-evolving expertise in this domain prevents general-purpose LLMs, which lack domain knowledge and physical-constraint awareness, from delivering precise engineering-aligned inference and generation. To address these challenges, we introduce Helios, a large language model tailored to the smart energy domain, together with a comprehensive suite of resources to advance LLM research in this field. Specifically, we develop Enersys, a multi-agent collaborative framework for end-to-end dataset construction, through which we produce: (1) a smart energy knowledge base, EnerBase, to enrich the model's foundational expertise; (2) an instruction fine-tuning dataset, EnerInstruct, to strengthen performance on domain-specific downstream tasks; and (3) an RLHF dataset, EnerReinforce, to align the model with human preferences and industry standards. Leveraging these resources, Helios undergoes large-scale pretraining, SFT, and RLHF. We also release EnerBench, a benchmark for evaluating LLMs in smart energy scenarios, and demonstrate that our approach significantly enhances domain knowledge mastery, task execution accuracy, and alignment with human preferences.
๐
2025-12-22
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Linzhi Chen, Yang Sun, Hongru Wei, Yuqi Chen
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Low-Rank Adaptation (LoRA) has emerged as an efficient method for fine-tuning large language models (LLMs) and is widely adopted within the open-source community. However, the decentralized dissemination of LoRA adapters through platforms such as Hugging Face introduces novel security vulnerabilities: malicious adapters can be easily distributed and evade conventional oversight mechanisms. Despite these risks, backdoor attacks targeting LoRA-based fine-tuning remain relatively underexplored. Existing backdoor attack strategies are ill-suited to this setting, as they often rely on inaccessible training data, fail to account for the structural properties unique to LoRA, or suffer from high false trigger rates (FTR), thereby compromising their stealth. To address these challenges, we propose Causal-Guided Detoxify Backdoor Attack (CBA), a novel backdoor attack framework specifically designed for open-weight LoRA models. CBA operates without access to original training data and achieves high stealth through two key innovations: (1) a coverage-guided data generation pipeline that synthesizes task-aligned inputs via behavioral exploration, and (2) a causal-guided detoxification strategy that merges poisoned and clean adapters by preserving task-critical neurons. Unlike prior approaches, CBA enables post-training control over attack intensity through causal influence-based weight allocation, eliminating the need for repeated retraining. Evaluated across six LoRA models, CBA achieves high attack success rates while reducing FTR by 50-70\% compared to baseline methods. Furthermore, it demonstrates enhanced resistance to state-of-the-art backdoor defenses, highlighting its stealth and robustness.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiaao Wu, Xian Zhang, Fan Yang, Yinpeng Dong
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Vibe Reasoning, a human-AI collaborative paradigm for solving complex mathematical problems. Our key insight is that frontier AI models already possess the knowledge required to solve challenging problems -- they simply do not know how, what, or when to apply it. Vibe Reasoning transforms AI's latent potential into manifested capability through generic meta-prompts, agentic grounding, and model orchestration. We demonstrate this paradigm through IMO 2025 Problem 6, a combinatorial optimization problem where autonomous AI systems publicly reported failures. Our solution combined GPT-5's exploratory capabilities with Gemini 3 Pro's proof strengths, leveraging agentic workflows with Python code execution and file-based memory, to derive both the correct answer (2112) and a rigorous mathematical proof. Through iterative refinement across multiple attempts, we discovered the necessity of agentic grounding and model orchestration, while human prompts evolved from problem-specific hints to generic, transferable meta-prompts. We analyze why capable AI fails autonomously, how each component addresses specific failure modes, and extract principles for effective vibe reasoning. Our findings suggest that lightweight human guidance can unlock frontier models' mathematical reasoning potential. This is ongoing work; we are developing automated frameworks and conducting broader evaluations to further validate Vibe Reasoning's generality and effectiveness.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chang Dong, Jianfeng Tao, Chengliang Liu
Machine Learning
ๆฅ็ๆ่ฆ
Axial piston pumps are crucial components in fluid power systems, where reliable fault diagnosis is essential for ensuring operational safety and efficiency. Traditional data-driven methods require extensive labeled fault data, which is often impractical to obtain, while model-based approaches suffer from parameter uncertainties. This paper proposes a digital twin (DT)-driven zero-shot fault diagnosis framework utilizing fluid-borne noise (FBN) signals. The framework calibrates a high-fidelity DT model using only healthy-state data, generates synthetic fault signals for training deep learning classifiers, and employs a physics-informed neural network (PINN) as a virtual sensor for flow ripple estimation. Gradient-weighted class activation mapping (Grad-CAM) is integrated to visualize the decision-making process of neural networks, revealing that large kernels matching the subsequence length in time-domain inputs and small kernels in time-frequency domain inputs enable higher diagnostic accuracy by focusing on physically meaningful features. Experimental validations demonstrate that training on signals from the calibrated DT model yields diagnostic accuracies exceeding 95\% on real-world benchmarks, while uncalibrated models result in significantly lower performance, highlighting the framework's effectiveness in data-scarce scenarios.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ivan DeAndres-Tame, Chengwei Ye, Ruben Tolosana, Ruben Vera-Rodriguez, Shiqi Yu
ๆฅ็ๆ่ฆ
Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Carla Crivoi, Radu Tudor Ionescu
Machine Learning
ๆฅ็ๆ่ฆ
We present the first comprehensive empirical study of machine unlearning (MU) in hybrid quantum-classical neural networks. While MU has been extensively explored in classical deep learning, its behavior within variational quantum circuits (VQCs) and quantum-augmented architectures remains largely unexplored. First, we adapt a broad suite of unlearning methods to quantum settings, including gradient-based, distillation-based, regularization-based and certified techniques. Second, we introduce two new unlearning strategies tailored to hybrid models. Experiments across Iris, MNIST, and Fashion-MNIST, under both subset removal and full-class deletion, reveal that quantum models can support effective unlearning, but outcomes depend strongly on circuit depth, entanglement structure, and task complexity. Shallow VQCs display high intrinsic stability with minimal memorization, whereas deeper hybrid models exhibit stronger trade-offs between utility, forgetting strength, and alignment with retrain oracle. We find that certain methods, e.g. EU-k, LCA, and Certified Unlearning, consistently provide the best balance across metrics. These findings establish baseline empirical insights into quantum machine unlearning and highlight the need for quantum-aware algorithms and theoretical guarantees, as quantum machine learning systems continue to expand in scale and capability. We publicly release our code at: https://github.com/CrivoiCarla/HQML.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Do Minh Duc, Quan Xuan Truong, Nguyen Tat Dat, Nguyen Van Vinh
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Prompt engineering plays a critical role in adapting large language models (LLMs) to complex reasoning and labeling tasks without the need for extensive fine-tuning. In this paper, we propose a novel prompt optimization pipeline for frame detection in logistics texts, combining retrieval-augmented generation (RAG), few-shot prompting, chain-of-thought (CoT) reasoning, and automatic CoT synthesis (Auto-CoT) to generate highly effective task-specific prompts. Central to our approach is an LLM-based prompt optimizer agent that iteratively refines the prompts using retrieved examples, performance feedback, and internal self-evaluation. Our framework is evaluated on a real-world logistics text annotation task, where reasoning accuracy and labeling efficiency are critical. Experimental results show that the optimized prompts - particularly those enhanced via Auto-CoT and RAG - improve real-world inference accuracy by up to 15% compared to baseline zero-shot or static prompts. The system demonstrates consistent improvements across multiple LLMs, including GPT-4o, Qwen 2.5 (72B), and LLaMA 3.1 (70B), validating its generalizability and practical value. These findings suggest that structured prompt optimization is a viable alternative to full fine-tuning, offering scalable solutions for deploying LLMs in domain-specific NLP applications such as logistics.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mingxu Zhang, Dazhong Shen, Qi Zhang, Ying Sun
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) exhibit strong general reasoning but struggle in molecular science due to the lack of explicit chemical priors in standard string representations. Current solutions face a fundamental dilemma. Training-based methods inject priors into parameters, but this static coupling hinders rapid knowledge updates and often compromises the model's general reasoning capabilities. Conversely, existing training-free methods avoid these issues but rely on surface-level prompting, failing to provide the fine-grained atom-level priors essential for precise chemical reasoning. To address this issue, we introduce ChemATP, a framework that decouples chemical knowledge from the reasoning engine. By constructing the first atom-level textual knowledge base, ChemATP enables frozen LLMs to explicitly retrieve and reason over this information dynamically. This architecture ensures interpretability and adaptability while preserving the LLM's intrinsic general intelligence. Experiments show that ChemATP significantly outperforms training-free baselines and rivals state-of-the-art training-based models, demonstrating that explicit prior injection is a competitive alternative to implicit parameter updates.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anna-Maria Gueorguieva, Aylin Caliskan
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM's respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Lingjun Mao, Jiawei Ren, Kun Zhou, Jixuan Chen, Ziqiao Ma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
LLMs and VLMs are increasingly deployed as embodied agents, yet existing benchmarks largely revolve around simple short-term tasks and struggle to capture rich realistic constraints that shape real-world decision making. To close this gap, we propose DeliveryBench, a city-scale embodied benchmark grounded in the real-world profession of food delivery. Food couriers naturally operate under long-horizon objectives (maximizing net profit over hours) while managing diverse constraints, e.g., delivery deadline, transportation expense, vehicle battery, and necessary interactions with other couriers and customers. DeliveryBench instantiates this setting in procedurally generated 3D cities with diverse road networks, buildings, functional locations, transportation modes, and realistic resource dynamics, enabling systematic evaluation of constraint-aware, long-horizon planning. We benchmark a range of VLM-based agents across nine cities and compare them with human players. Our results reveal a substantial performance gap to humans, and find that these agents are short-sighted and frequently break basic commonsense constraints. Additionally, we observe distinct personalities across models (e.g., adventurous GPT-5 vs. conservative Claude), highlighting both the brittleness and the diversity of current VLM-based embodied agents in realistic, constraint-dense environments. Our code, data, and benchmark are available at https://deliverybench.github.io.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Valentin Schmidberger, Manuel Eberhardinger, Setareh Maghsudi, Johannes Maucher
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Document forgery poses a growing threat to legal, economic, and governmental processes, requiring increasingly sophisticated verification mechanisms. One approach involves the use of plausibility checks, rule-based procedures that assess the correctness and internal consistency of data, to detect anomalies or signs of manipulation. Although these verification procedures are essential for ensuring data integrity, existing plausibility checks are manually implemented by software engineers, which is time-consuming. Recent advances in code generation with large language models (LLMs) offer new potential for automating and scaling the generation of these checks. However, adapting LLMs to the specific requirements of an unknown domain remains a significant challenge. This work investigates the extent to which LLMs, adapted on domain-specific code and data through different fine-tuning strategies, can generate rule-based plausibility checks for forgery detection on constrained hardware resources. We fine-tune open-source LLMs, Llama 3.1 8B and OpenCoder 8B, on structured datasets derived from real-world application scenarios and evaluate the generated plausibility checks on previously unseen forgery patterns. The results demonstrate that the models are capable of generating executable and effective verification procedures. This also highlights the potential of LLMs as scalable tools to support human decision-making in security-sensitive contexts where comprehensibility is required.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunlong Liu, Shuyang Li, Pengyuan Liu, Yu Zhang, Rudi Stouffs
Machine Learning
ๆฅ็ๆ่ฆ
Perception research is increasingly modelled using streetscapes, yet many approaches still rely on pixel features or object co-occurrence statistics, overlooking the explicit relations that shape human perception. This study proposes a three stage pipeline that transforms street view imagery (SVI) into structured representations for predicting six perceptual indicators. In the first stage, each image is parsed using an open-set Panoptic Scene Graph model (OpenPSG) to extract object predicate object triplets. In the second stage, compact scene-level embeddings are learned through a heterogeneous graph autoencoder (GraphMAE). In the third stage, a neural network predicts perception scores from these embeddings. We evaluate the proposed approach against image-only baselines in terms of accuracy, precision, and cross-city generalization. Results indicate that (i) our approach improves perception prediction accuracy by an average of 26% over baseline models, and (ii) maintains strong generalization performance in cross-city prediction tasks. Additionally, the structured representation clarifies which relational patterns contribute to lower perception scores in urban scenes, such as graffiti on wall and car parked on sidewalk. Overall, this study demonstrates that graph-based structure provides expressive, generalizable, and interpretable signals for modelling urban perception, advancing human-centric and context-aware urban analytics.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.
๐
2025-12-22
๐ NeurIPS
๐ฅ Jerry Wang, Ting Yiu Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
We present an interactive framework for evaluating whether large language models (LLMs) exhibit genuine "understanding" in a simple yet strategic environment. As a running example, we focus on Rock-Paper-Scissors (RPS), which, despite its apparent simplicity, requires sequential reasoning, adaptation, and strategy recognition. Our system positions the LLM as an Observer whose task is to identify which strategies are being played and to articulate the reasoning behind this judgment. The purpose is not to test knowledge of Rock-Paper-Scissors itself, but to probe whether the model can exhibit mind-like reasoning about sequential behavior. To support systematic evaluation, we provide a benchmark consisting of both static strategies and lightweight dynamic strategies specified by well-prompted rules. We quantify alignment between the Observer's predictions and the ground-truth distributions induced by actual strategy pairs using three complementary signals: Cross-Entropy, Brier score, and Expected Value (EV) discrepancy. These metrics are further integrated into a unified score, the Union Loss, which balances calibration, sensitivity, and payoff alignment. Together with a Strategy Identification Rate (SIR) metric, our framework captures not only predictive accuracy but also whether the model can stably identify the latent strategies in play. The demo emphasizes interactivity, transparency, and reproducibility. Users can adjust LLM distributions in real time, visualize losses as they evolve, and directly inspect reasoning snippets to identify where and why failures occur. In doing so, our system provides a practical and interpretable proxy for mind-like inference in sequential games, offering insights into both the strengths and limitations of current LLM reasoning.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tao Zhang, Ziqian Zeng, Hao Peng, Huiping Zhuang, Cen Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although KV cache quantization is a promising compression technique, existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks. Fixed-precision quantization struggles to handle outlier channels in the key cache, while current mixed-precision strategies fail to accurately identify components requiring high-precision representation. We find that an effective low-bit KV cache quantization strategy must consider two factors: a key channel's intrinsic quantization difficulty and its relevance to the query. Based on this insight, we propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels that need higher precision, while applying per-token quantization for value cache. Experiments on complex reasoning datasets demonstrate that our approach significantly outperforms existing low-bit methods, achieving performance comparable to a full-precision baseline at a substantially reduced memory footprint.
๐
2025-12-22
๐ Accepted for publication in Lecture Notes in Compu
๐ฅ Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos
Machine Learning
ๆฅ็ๆ่ฆ
The paper establishes generalization bounds for multitask deep neural networks using operator-theoretic techniques. The authors propose a tighter bound than those derived from conventional norm based methods by leveraging small condition numbers in the weight matrices and introducing a tailored Sobolev space as an expanded hypothesis space. This enhanced bound remains valid even in single output settings, outperforming existing Koopman based bounds. The resulting framework maintains key advantages such as flexibility and independence from network width, offering a more precise theoretical understanding of multitask deep learning in the context of kernel methods.
๐
2025-12-22
๐ Accepted for publication in Lecture Notes in Compu
๐ฅ Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning through an operator-theoretic framework. Our key development lies in strategically combining a Koopman based approach with existing techniques, achieving tighter generalization guarantees compared to traditional norm-based bounds. To mitigate computational challenges associated with Koopman-based methods, we introduce sketching techniques applicable to vector valued neural networks. These techniques yield excess risk bounds under generic Lipschitz losses, providing performance guarantees for applications including robust and multiple quantile regression. Furthermore, we propose a novel deep learning framework, deep vector-valued reproducing kernel Hilbert spaces (vvRKHS), leveraging Perron Frobenius (PF) operators to enhance deep kernel methods. We derive a new Rademacher generalization bound for this framework, explicitly addressing underfitting and overfitting through kernel refinement strategies. This work offers novel insights into the generalization properties of multitask learning with deep learning architectures, an area that has been relatively unexplored until recent developments.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Azadeh Alavi, Fatemeh Kouchmeshki, Abdolrahman Alavi
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Hybrid quantum and classical learning aims to couple quantum feature maps with the robustness of classical neural networks, yet most architectures treat the quantum circuit as an isolated feature extractor and merge its measurements with classical representations by direct concatenation. This neglects that the quantum and classical branches constitute distinct computational modalities and limits reliable performance on complex, high dimensional tabular and semi structured data, including remote sensing, environmental monitoring, and medical diagnostics. We present a multimodal formulation of hybrid learning and propose a cross attention mid fusion architecture in which a classical representation queries quantum derived feature tokens through an attention block with residual connectivity. The quantum branch is kept within practical NISQ budgets and uses up to nine qubits. We evaluate on Wine, Breast Cancer, Forest CoverType, FashionMNIST, and SteelPlatesFaults, comparing a quantum only model, a classical baseline, residual hybrid models, and the proposed mid fusion model under a consistent protocol. Pure quantum and standard hybrid designs underperform due to measurement induced information loss, while cross attention mid fusion is consistently competitive and improves performance on the more complex datasets in most cases. These findings suggest that quantum derived information becomes most valuable when integrated through principled multimodal fusion rather than used in isolation or loosely appended to classical features.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Jin Wang, Kim Tien Ly, Jacques Cloete, Nikos Tsagarakis, Ioannis Havoutis
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Bridging the gap between natural language commands and autonomous execution in unstructured environments remains an open challenge for robotics. This requires robots to perceive and reason over the current task scene through multiple modalities, and to plan their behaviors to achieve their intended goals. Traditional robotic task-planning approaches often struggle to bridge low-level execution with high-level task reasoning, and cannot dynamically update task strategies when instructions change during execution, which ultimately limits their versatility and adaptability to new tasks. In this work, we propose a novel language model-based framework for dynamic robot task planning. Our Vision-Language-Policy (VLP) model, based on a vision-language model fine-tuned on real-world data, can interpret semantic instructions and integrate reasoning over the current task scene to generate behavior policies that control the robot to accomplish the task. Moreover, it can dynamically adjust the task strategy in response to changes in the task, enabling flexible adaptation to evolving task requirements. Experiments conducted with different robots and a variety of real-world tasks show that the trained model can efficiently adapt to novel scenarios and dynamically update its policy, demonstrating strong planning autonomy and cross-embodiment generalization. Videos: https://robovlp.github.io/
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yin Jun Phua
Robotics
ๆฅ็ๆ่ฆ
The search for reliable indicators of consciousness has fragmented into competing theoretical camps (Global Workspace Theory (GWT), Integrated Information Theory (IIT), and Higher-Order Theories (HOT)), each proposing distinct neural signatures. We adopt a synthetic neuro-phenomenology approach: constructing artificial agents that embody these mechanisms to test their functional consequences through precise architectural ablations impossible in biological systems. Across three experiments, we report dissociations suggesting these theories describe complementary functional layers rather than competing accounts. In Experiment 1, a no-rewire Self-Model lesion abolishes metacognitive calibration while preserving first-order task performance, yielding a synthetic blindsight analogue consistent with HOT predictions. In Experiment 2, workspace capacity proves causally necessary for information access: a complete workspace lesion produces qualitative collapse in access-related markers, while partial reductions show graded degradation, consistent with GWT's ignition framework. In Experiment 3, we uncover a broadcast-amplification effect: GWT-style broadcasting amplifies internal noise, creating extreme fragility. The B2 agent family is robust to the same latent perturbation; this robustness persists in a Self-Model-off / workspace-read control, cautioning against attributing the effect solely to $z_{\text{self}}$ compression. We also report an explicit negative result: raw perturbational complexity (PCI-A) decreases under the workspace bottleneck, cautioning against naive transfer of IIT-adjacent proxies to engineered agents. These results suggest a hierarchical design principle: GWT provides broadcast capacity, while HOT provides quality control. We emphasize that our agents are not conscious; they are reference implementations for testing functional predictions of consciousness theories.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Geraud Nangue Tasse, Matthew Riemer, Benjamin Rosman, Tim Klinger
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent success in developing increasingly general purpose agents based on sequence models has led to increased focus on the problem of deploying computationally limited agents within the vastly more complex real-world. A key challenge experienced in these more realistic domains is highly non-Markovian dependencies with respect to the agent's observations, which are less common in small controlled domains. The predominant approach for dealing with this in the literature is to stack together a window of the most recent observations (Frame Stacking), but this window size must grow with the degree of non-Markovian dependencies, which results in prohibitive computational and memory requirements for both action inference and learning. In this paper, we are motivated by the insight that in many environments that are highly non-Markovian with respect to time, the environment only causally depends on a relatively small number of observations over that time-scale. A natural direction would then be to consider meta-algorithms that maintain relatively small adaptive stacks of memories such that it is possible to express highly non-Markovian dependencies with respect to time while considering fewer observations at each step and thus experience substantial savings in both compute and memory requirements. Hence, we propose a meta-algorithm (Adaptive Stacking) for achieving exactly that with convergence guarantees and quantify the reduced computation and memory constraints for MLP, LSTM, and Transformer-based agents. Our experiments utilize popular memory tasks, which give us control over the degree of non-Markovian dependencies. This allows us to demonstrate that an appropriate meta-algorithm can learn the removal of memories not predictive of future rewards without excessive removal of important experiences. Code: https://github.com/geraudnt/adaptive-stacking
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chenghao Li, Chaoning Zhang, Yi Lu, Shuxu Chen, Xudong Wang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
With the development of large language models (LLMs), particularly with the introduction of the long reasoning chain technique, the reasoning ability of LLMs in complex problem-solving has been significantly enhanced. While acknowledging the power of long reasoning chains, we cannot help but wonder: Why do different reasoning chains perform differently in reasoning? What components of the reasoning chains play a key role? Existing studies mainly focus on evaluating reasoning chains from a functional perspective, with little attention paid to their structural mechanisms. To address this gap, this work is the first to analyze and evaluate the quality of the reasoning chain from a structural perspective. We apply persistent homology from Topological Data Analysis (TDA) to map reasoning steps into semantic space, extract topological features, and analyze structural changes. These changes reveal semantic coherence, logical redundancy, and identify logical breaks and gaps. By calculating homology groups, we assess connectivity and redundancy at various scales, using barcode and persistence diagrams to quantify stability and consistency. Our results show that the topological structural complexity of reasoning chains correlates positively with accuracy. More complex chains identify correct answers sooner, while successful reasoning exhibits simpler topologies, reducing redundancy and cycles, enhancing efficiency and interpretability. This work provides a new perspective on reasoning chain quality assessment and offers guidance for future optimization.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoyu Jiang, Boan Qu, Junjie Zhu, Fanjie Zeng, Xiaojie Lin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid growth of artificial intelligence is exponentially escalating computational demand, inflating data center energy use and carbon emissions, and spurring rapid deployment of green data centers to relieve resource and environmental stress. Achieving sub-minute orchestration of renewables, storage, and loads, while minimizing PUE and lifecycle carbon intensity, hinges on accurate load forecasting. However, existing methods struggle to address small-sample scenarios caused by cold start, load distortion, multi-source data fragmentation, and distribution shifts in green data centers. We introduce HyperLoad, a cross-modality framework that exploits pre-trained large language models (LLMs) to overcome data scarcity. In the Cross-Modality Knowledge Alignment phase, textual priors and time-series data are mapped to a common latent space, maximizing the utility of prior knowledge. In the Multi-Scale Feature Modeling phase, domain-aligned priors are injected through adaptive prefix-tuning, enabling rapid scenario adaptation, while an Enhanced Global Interaction Attention mechanism captures cross-device temporal dependencies. The public DCData dataset is released for benchmarking. Under both data sufficient and data scarce settings, HyperLoad consistently surpasses state-of-the-art (SOTA) baselines, demonstrating its practicality for sustainable green data center management.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhe Yang, Xiaoshuang Sheng, Zhengnan Zhang, Jidong Wu, Zexing Wang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Identifying user intent from mobile UI operation trajectories is critical for advancing UI understanding and enabling task automation agents. While Multimodal Large Language Models (MLLMs) excel at video understanding tasks, their real-time mobile deployment is constrained by heavy computational costs and inefficient redundant frame processing. To address these issues, we propose the FC-MIR framework: leveraging keyframe sampling and adaptive concatenation, it cuts visual redundancy to boost inference efficiency, while integrating state-of-the-art closed-source MLLMs or fine-tuned models (e.g., Qwen3-VL) for trajectory summarization and intent prediction. We further expand task scope to explore generating post-prediction operations and search suggestions, and introduce a fine-grained metric to evaluate the practical utility of summaries, predictions, and suggestions. For rigorous assessment, we construct a UI trajectory dataset covering scenarios from UI-Agents (Agent-I) and real user interactions (Person-I). Experimental results show our compression method retains performance at 50%-60% compression rates; both closed-source and fine-tuned MLLMs demonstrate strong intent summarization, supporting potential lightweight on-device deployment. However, MLLMs still struggle with useful and "surprising" suggestions, leaving room for improvement. Finally, we deploy the framework in a real-world setting, integrating UI perception and UI-Agent proxies to lay a foundation for future progress in this field.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Danny Dongyeop Han, Yonghyeon Gwon, Ahhyun Lucy Lee, Taeyang Lee, Seong Jin Lee et al.
ๆฅ็ๆ่ฆ
Electrophysiology signals such as EEG and iEEG are central to neuroscience, brain-computer interfaces, and clinical applications, yet existing foundation models remain limited in scale despite clear evidence that scaling improves performance. We introduce DIVER-1, a family of EEG and iEEG foundation models trained on the largest and most diverse corpus to date-5.3k hours of iEEG and 54k hours of EEG (1.6M channel-hours from over 17.7k subjects)-and scaled up to 1.82B parameters. We present the first systematic scaling law analysis for this domain, showing that they follow data-constrained scaling laws: for a given amount of data and compute, smaller models trained for extended epochs consistently outperform larger models trained briefly. This behavior contrasts with prior electrophysiology foundation models that emphasized model size over training duration. To achieve strong performance, we also design architectural innovations including any-variate attention, sliding temporal conditional positional encoding, and multi-domain reconstruction. DIVER-1 iEEG and EEG models each achieve state-of-the-art performance on their respective benchmarks, establishing a concrete guidelines for efficient scaling and resource allocation in electrophysiology foundation model development.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kathelijne Coussement, Gert de Cooman, Keano De Vos
ๆฅ็ๆ่ฆ
We discuss conditionalisation for Accept-Desirability models in an abstract decision-making framework, where uncertain rewards live in a general linear space, and events are special projection operators on that linear space. This abstract setting allows us to unify classical and quantum probabilities, and extend them to an imprecise probabilities context. We introduce a new conditioning rule for our Accept-Desirability models, based on the idea that observing an event introduces new indifferences between options. We associate a belief revision operator with our conditioning rule, and investigate which of the AGM axioms for belief revision still hold in our more general framework. We investigate two interesting special cases where all of these axioms are shown to still hold: classical propositional logic and full conditional probabilities.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Peiqing Lu, Yuan Zhang, Haoyun Zhang, Jiasen Zheng, Kejian Tong et al.
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Bilingual mathematical problem solving needs a clear link between language reasoning and symbolic calculation. Large language models often handle language well but are weak in accurate computation. This paper presents HERALD (Hybrid Ensemble Reasoning with Adaptive Learning and Distillation), a framework that joins reasoning and calculation using NuminaMath-7B-TIR, GPT-4o, and Mistral-7B. HERALD uses adaptive routing, tool-based reinforcement learning, and knowledge distillation to connect different reasoning paths. Confidence calibration keeps weighting stable, and dual-path checking keeps results correct. Reinforcement learning controls tool use to cut redundancy, and distillation lowers delay without hurting accuracy. The system shows that combining symbolic checking, adaptive ensembles, and bilingual fine-tuning helps achieve both fluent reasoning and precise calculation. HERALD offers a practical solution for multilingual mathematical reasoning with better accuracy, stability, and clarity.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mark Burgess
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The semantics and dynamics of `attention' are closely related to promise theoretic notions developed for autonomous agents and can thus easily be written down in promise framework. In this way one may establish a bridge between vectorized Machine Learning and Knowledge Graph representations without relying on language models implicitly. Our expectations for knowledge presume a degree of statistical stability, i.e. average invariance under repeated observation, or `trust' in the data. Both learning networks and knowledge graph representations can meaningfully coexist to preserve different aspects of data. While vectorized data are useful for probabilistic estimation, graphs preserve the intentionality of the source even under data fractionation. Using a Semantic Spacetime $ฮณ(3,4)$ graph, one avoids complex ontologies in favour of classification of features by their roles in semantic processes. The latter favours an approach to reasoning under conditions of uncertainty. Appropriate attention to causal boundary conditions may lead to orders of magnitude compression of data required for such context determination, as required in the contexts of autonomous robotics, defence deployments, and ad hoc emergency services.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, Shuxin Zheng
Natural Language Processing
ๆฅ็ๆ่ฆ
Test-time scaling has emerged as a promising direction for enhancing the reasoning capabilities of Large Language Models in last few years. In this work, we propose Population-Evolve, a training-free method inspired by Genetic Algorithms to optimize LLM reasoning. Our approach maintains a dynamic population of candidate solutions for each problem via parallel reasoning. By incorporating an evolve prompt, the LLM self-evolves its population in all iterations. Upon convergence, the final answer is derived via majority voting. Furthermore, we establish a unification framework that interprets existing test-time scaling strategies through the lens of genetic algorithms. Empirical results demonstrate that Population-Evolve achieves superior accuracy with low performance variance and computational efficiency. Our findings highlight the potential of evolutionary strategies to unlock the reasoning power of LLMs during inference.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Siddharth Tandon
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at diverse tasks, but their deployment on resource-constrained devices remains challenging. Existing methods like quantization, pruning, and distillation can reduce memory footprint but often demand extensive experimentation and careful infrastructure design. Leveraging existing techniques for extracting high-level concepts (represented as steering vectors) from larger models, we investigate their transferability to smaller language models (SLM) during inference. We demonstrate through extensive experimentation that these concepts can be effectively transferred to smaller models, irrespective of their family (e.g., Phi, Llama, Qwen), leading to performance improvements across a wide range of tasks. Furthermore, we introduce inference-time scaling to enhance performance by dynamically adjusting the steering intensity which has resulted in a 7-15\% of accuracy improvement for Qwen3-0.6B.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chi Liu
ๆฅ็ๆ่ฆ
Collaborative fraud, where multiple fraudulent accounts coordinate to exploit online payment systems, poses significant challenges due to the formation of complex network structures. Traditional detection methods that rely solely on high-confidence identity links suffer from limited coverage, while approaches using all available linkages often result in fragmented graphs with reduced clustering effectiveness. In this paper, we propose a novel graph-based fraud detection framework that addresses the challenge of large-scale heterogeneous graph clustering through a principled link transformation approach. Our method distinguishes between \emph{hard links} (high-confidence identity relationships such as phone numbers, credit cards, and national IDs) and \emph{soft links} (behavioral associations including device fingerprints, cookies, and IP addresses). We introduce a graph transformation technique that first identifies connected components via hard links, merges them into super-nodes, and then reconstructs a weighted soft-link graph amenable to efficient embedding and clustering. The transformed graph is processed using LINE (Large-scale Information Network Embedding) for representation learning, followed by HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) for density-based cluster discovery. Experiments on a real-world payment platform dataset demonstrate that our approach achieves significant graph size reduction (from 25 million to 7.7 million nodes), doubles the detection coverage compared to hard-link-only baselines, and maintains high precision across identified fraud clusters. Our framework provides a scalable and practical solution for industrial-scale fraud detection systems.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ariana Azarbal, Victor Gillioz, Vladimir Ivanov, Bryce Woodworth, Jacob Drori et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Developers often struggle to specify correct training labels and rewards. Perhaps they don't need to. We propose recontextualization, which reduces how often language models "game" training signals, performing misbehaviors those signals mistakenly reinforce. We show recontextualization prevents models from learning to 1) prioritize evaluation metrics over chat response quality; 2) special-case code to pass incorrect tests; 3) lie to users; and 4) become sycophantic. Our method works by generating completions from prompts discouraging misbehavior and then recontextualizing them as though they were in response to prompts permitting misbehavior. Recontextualization trains language models to resist misbehavior even when instructions permit it. This mitigates the reinforcement of misbehavior from misspecified training signals, reducing specification gaming without improving the supervision signal.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Connor Kilrain, David Carlyn, Julia Chae, Sara Beery, Wei-Lun Chao et al.
Machine Learning
ๆฅ็ๆ่ฆ
The rise of personalized generative models raises a central question: how should we evaluate identity preservation? Given a reference image (e.g., one's pet), we expect the generated image to retain precise details attached to the subject's identity. However, current generative evaluation metrics emphasize the overall semantic similarity between the reference and the output, and overlook these fine-grained discriminative details. We introduce Finer-Personalization Rank, an evaluation protocol tailored to identity preservation. Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking view: it treats each generated image as a query against an identity-labeled gallery consisting of visually similar real images. Retrieval metrics (e.g., mean average precision) measure performance, where higher scores indicate that identity-specific details (e.g., a distinctive head spot) are preserved. We assess identity at multiple granularities -- from fine-grained categories (e.g., bird species, car models) to individual instances (e.g., re-identification). Across CUB, Stanford Cars, and animal Re-ID benchmarks, Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in several popular personalization methods. These results position the gallery-based protocol as a principled and practical evaluation for personalized generation.
๐
2025-12-22
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Hengrui Jia, Taoran Li, Jonas Guan, Varun Chandrasekaran
Natural Language Processing
ๆฅ็ๆ่ฆ
Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset ($D_u$). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in $D_u$, but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have ``forgotten'' the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to $D_u$. This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose \name, an automated stress-testing framework that generates a surrogate dataset, $\tilde{D}_u$. This surrogate set is constructed to be semantically derived from $D_u$ yet sufficiently distinct in embedding space. By comparing unlearning metric scores between $D_u$ and $\tilde{D}_u$, we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B-$ฮฒ$), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Xu Liu, Yu Liu, Hanshuo Qiu, Yang Qirong, Zhouhui Lian
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Navigation (VLN) enables agents to navigate in complex environments by following natural language instructions grounded in visual observations. Although most existing work has focused on ground-based robots or outdoor Unmanned Aerial Vehicles (UAVs), indoor UAV-based VLN remains underexplored, despite its relevance to real-world applications such as inspection, delivery, and search-and-rescue in confined spaces. To bridge this gap, we introduce \textbf{IndoorUAV}, a novel benchmark and method specifically tailored for VLN with indoor UAVs. We begin by curating over 1,000 diverse and structurally rich 3D indoor scenes from the Habitat simulator. Within these environments, we simulate realistic UAV flight dynamics to collect diverse 3D navigation trajectories manually, further enriched through data augmentation techniques. Furthermore, we design an automated annotation pipeline to generate natural language instructions of varying granularity for each trajectory. This process yields over 16,000 high-quality trajectories, comprising the \textbf{IndoorUAV-VLN} subset, which focuses on long-horizon VLN. To support short-horizon planning, we segment long trajectories into sub-trajectories by selecting semantically salient keyframes and regenerating concise instructions, forming the \textbf{IndoorUAV-VLA} subset. Finally, we introduce \textbf{IndoorUAV-Agent}, a novel navigation model designed for our benchmark, leveraging task decomposition and multimodal reasoning. We hope IndoorUAV serves as a valuable resource to advance research on vision-language embodied AI in the indoor aerial navigation domain.
๐
2025-12-22
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Akshaj Prashanth Rao, Advait Singh, Saumya Kumaar Saksena, Dhruv Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Prompt injection and jailbreaking attacks pose persistent security challenges to large language model (LLM)-based systems. We present an efficient and systematically evaluated defense architecture that mitigates these threats through a lightweight, multi-stage pipeline. Its core component is a semantic filter based on text normalization, TF-IDF representations, and a Linear SVM classifier. Despite its simplicity, this module achieves 93.4% accuracy and 96.5% specificity on held-out data, substantially reducing attack throughput while incurring negligible computational overhead.
Building on this efficient foundation, the full pipeline integrates complementary detection and mitigation mechanisms that operate at successive stages, providing strong robustness with minimal latency. In comparative experiments, our SVM-based configuration improves overall accuracy from 35.1% to 93.4% while reducing average time to completion from approximately 450s to 47s, yielding over 10 times lower latency than ShieldGemma. These results demonstrate that the proposed design simultaneously advances defensive precision and efficiency, addressing a core limitation of current model-based moderators.
Evaluation across a curated corpus of over 30,000 labeled prompts, including benign, jailbreak, and application-layer injections, confirms that staged, resource-efficient defenses can robustly secure modern LLM-driven applications.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Konstantin Kaulen, Tobias Ladner, Stanley Bak, Christopher Brix, Hai Duong et al.
Machine Learning
ๆฅ็ๆ่ฆ
This report summarizes the 6th International Verification of Neural Networks Competition (VNN-COMP 2025), held as a part of the 8th International Symposium on AI Verification (SAIV), that was collocated with the 37th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2025 iteration, 8 teams participated on a diverse set of 16 regular and 9 extended benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tongyuan Miao, Gary Huang, Kai Jun Han, Annie Jiang
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization.
We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Lingjie Zhao, Xue Yu, Yongzhi Qi, Hao Hu, Jianshen Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
As the pursuit of synergy between Artificial Intelligence (AI) and Operations Research (OR) gains momentum in handling complex inventory systems, a critical challenge persists: how to effectively reconcile AI's adaptive perception with OR's structural rigor. To bridge this gap, we propose a novel OR-Guided "Pretrain-then-Reinforce" framework. To provide structured guidance, we propose a simulation-augmented OR model that generates high-quality reference decisions, implicitly capturing complex business constraints and managerial preferences. Leveraging these OR-derived decisions as foundational training labels, we design a domain-informed deep learning foundation model to establish foundational decision-making capabilities, followed by a reinforcement learning (RL) fine-tuning stage. Uniquely, we position RL as a deep alignment mechanism that enables the AI agent to internalize the optimality principles of OR, while simultaneously leveraging exploration for general policy refinement and allowing expert guidance for scenario-specific adaptation (e.g., promotional events). Validated through extensive numerical experiments and a field deployment at JD.com augmented by a Difference-in-Differences (DiD) analysis, our model significantly outperforms incumbent industrial practices, delivering real-world gains of a 5.27-day reduction in turnover and a 2.29% increase in in-stock rates, alongside a 29.95% decrease in holding costs. Contrary to the prevailing trend of brute-force model scaling, our study demonstrates that a lightweight, domain-informed model can deliver state-of-the-art performance and robust transferability when guided by structured OR logic. This approach offers a scalable and cost-effective paradigm for intelligent supply chain management, highlighting the value of deeply aligning AI with OR.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jinyan Liu, Zikang Chen, Qinchuan Wang, Tan Xie, Heming Zheng et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Dominant paradigms for 4D LiDAR panoptic segmentation are usually required to train deep neural networks with large superimposed point clouds or design dedicated modules for instance association. However, these approaches perform redundant point processing and consequently become computationally expensive, yet still overlook the rich geometric priors inherently provided by raw point clouds. To this end, we introduce ICP-4D, a simple yet effective training-free framework that unifies spatial and temporal reasoning through geometric relations among instance-level point sets. Specifically, we apply the Iterative Closest Point (ICP) algorithm to directly associate temporally consistent instances by aligning the source and target point sets through the estimated transformation. To stabilize association under noisy instance predictions, we introduce a Sinkhorn-based soft matching. This exploits the underlying instance distribution to obtain accurate point-wise correspondences, resulting in robust geometric alignment. Furthermore, our carefully designed pipeline, which considers three instance types-static, dynamic, and missing-offers computational efficiency and occlusion-aware matching. Our extensive experiments across both SemanticKITTI and panoptic nuScenes demonstrate that our method consistently outperforms state-of-the-art approaches, even without additional training or extra point cloud inputs.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kun Zhao, Siyuan Dai, Yingying Zhang, Guodong Liu, Pengfei Gu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Early detection of Alzheimer's disease (AD) requires models capable of integrating macro-scale neuroanatomical alterations with micro-scale genetic susceptibility, yet existing multimodal approaches struggle to align these heterogeneous signals. We introduce R-GenIMA, an interpretable multimodal large language model that couples a novel ROI-wise vision transformer with genetic prompting to jointly model structural MRI and single nucleotide polymorphisms (SNPs) variations. By representing each anatomically parcellated brain region as a visual token and encoding SNP profiles as structured text, the framework enables cross-modal attention that links regional atrophy patterns to underlying genetic factors. Applied to the ADNI cohort, R-GenIMA achieves state-of-the-art performance in four-way classification across normal cognition (NC), subjective memory concerns (SMC), mild cognitive impairment (MCI), and AD. Beyond predictive accuracy, the model yields biologically meaningful explanations by identifying stage-specific brain regions and gene signatures, as well as coherent ROI-Gene association patterns across the disease continuum. Attention-based attribution revealed genes consistently enriched for established GWAS-supported AD risk loci, including APOE, BIN1, CLU, and RBFOX1. Stage-resolved neuroanatomical signatures identified shared vulnerability hubs across disease stages alongside stage-specific patterns: striatal involvement in subjective decline, frontotemporal engagement during prodromal impairment, and consolidated multimodal network disruption in AD. These results demonstrate that interpretable multimodal AI can synthesize imaging and genetics to reveal mechanistic insights, providing a foundation for clinically deployable tools that enable earlier risk stratification and inform precision therapeutic strategies in Alzheimer's disease.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng-Hong Chang, Pei-Hsuan Tsai
ๆฅ็ๆ่ฆ
Object recognition has become prevalent across various industries. However, most existing applications are limited to identifying objects alone, without considering their associated states. The ability to recognize both the state and object simultaneously remains less common. One approach to address this is by treating state and object as a single category during training. However, this approach poses challenges in data collection and training since it requires comprehensive data for all possible combinations. Compositional Zero-shot Learning (CZSL) emerges as a viable solution by treating the state and object as distinct categories during training. CZSL facilitates the identification of novel compositions even in the absence of data for every conceivable combination. The current state-of-the-art method, KG-SP, addresses this issue by training distinct classifiers for states and objects, while leveraging a semantic model to evaluate the plausibility of composed compositions. However, KG-SP's accuracy in state and object recognition can be further improved, and it fails to consider the weighting of states and objects during composition. In this study, we propose SASOW, an enhancement of KG-SP that considers the weighting of states and objects while improving composition recognition accuracy. First, we introduce self-attention mechanisms into the classifiers for states and objects, leading to enhanced accuracy in recognizing both. Additionally, we incorporate the weighting of states and objects during composition to generate more reasonable and accurate compositions. Our validation process involves testing SASOW on three established benchmark datasets. Experimental outcomes affirm when compared against OW-CZSL approach, KG-SP, SASOW showcases improvements of 2.1%, 1.7%, and 0.4% in terms of accuracy for unseen compositions across the MIT-States, UT Zappos, and C-GQA datasets, respectively.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yizhi Wang, Linan Yue, Min-Ling Zhang
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks through long Chain-of-Thought (CoT) reasoning. Extending these successes to multimodal reasoning remains challenging due to the increased complexity of integrating diverse input modalities and the scarcity of high-quality long CoT training data. Existing multimodal datasets and CoT synthesis methods still suffer from limited reasoning depth, modality conversion errors, and rigid generation pipelines, hindering model performance and stability. To this end, in this paper, we propose SynSelect, a novel three-stage Synthesis-Selection framework for generating high-quality long CoT data tailored to multimodal reasoning tasks. Specifically, SynSelect first leverages multiple heterogeneous multimodal LRMs to produce diverse candidate CoTs, and then applies both instance and batch level selection to filter high-quality CoTs that can effectively enhance the model's reasoning capabilities. Extensive experiments on multiple multimodal benchmarks demonstrate that models supervised fine-tuned on SynSelect-generated data significantly outperform baselines and achieve further improvements after reinforcement learning post-training. Our results validate SynSelect as an effective approach for advancing multimodal LRMs reasoning capabilities.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, Mahdi Jalili
Natural Language Processing
ๆฅ็ๆ่ฆ
We present MACLA, a framework that decouples reasoning from learning by maintaining a frozen large language model while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes and failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1 percent average performance, outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3 percent with 3.1 percent positive generalization. The system constructs memory in 56 seconds, 2800 times faster than the state-of-the-art LLM parameter-training baseline, compressing 2851 trajectories into 187 procedures. Experimental results demonstrate that structured external memory with Bayesian selection and contrastive refinement enables sample-efficient, interpretable, and continually improving agents without LLM parameter updates.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Li Yan, Bolun Liu, Chao Li, Jing Liang, Kunjie Yu et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Dynamic multimodal multiobjective optimization presents the dual challenge of simultaneously tracking multiple equivalent pareto optimal sets and maintaining population diversity in time-varying environments. However, existing dynamic multiobjective evolutionary algorithms often neglect solution modality, whereas static multimodal multiobjective evolutionary algorithms lack adaptability to dynamic changes. To address above challenge, this paper makes two primary contributions. First, we introduce a new benchmark suite of dynamic multimodal multiobjective test functions constructed by fusing the properties of both dynamic and multimodal optimization to establish a rigorous evaluation platform. Second, we propose a novel algorithm centered on a Clustering-based Autoencoder prediction dynamic response mechanism, which utilizes an autoencoder model to process matched clusters to generate a highly diverse initial population. Furthermore, to balance the algorithm's convergence and diversity, we integrate an adaptive niching strategy into the static optimizer. Empirical analysis on 12 instances of dynamic multimodal multiobjective test functions reveals that, compared with several state-of-the-art dynamic multiobjective evolutionary algorithms and multimodal multiobjective evolutionary algorithms, our algorithm not only preserves population diversity more effectively in the decision space but also achieves superior convergence in the objective space.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Michael S. Zhang, Rishi A. Ruia, Arnav Kewalram, Saathvik Dharmapuram, Utkarsh Sharma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and replay buffer strategies in large language models, revealing unexpected dynamics. While FP16 achieves superior initial task performance (74.44% on NLU), we observe a striking inversion on subsequent tasks: quantized models outperform FP16 by 8-15% on final task forward accuracy, with INT4 achieving nearly double FP16's performance on Code generation (40% vs 20%). Critically, even minimal replay buffers (0.1%) dramatically improve retention - increasing NLU retention after Math training from 45% to 65% across all precision levels - with INT8 consistently achieving the optimal balance between learning plasticity and knowledge retention. We hypothesize that quantization-induced noise acts as implicit regularization, preventing the overfitting to new task gradients that plagues high-precision models. These findings challenge the conventional wisdom that higher precision is always preferable, suggesting instead that INT8 quantization offers both computational efficiency and superior continual learning dynamics. Our results provide practical guidelines for deploying compressed models in continual learning scenarios: small replay buffers (1-2%) suffice for NLU tasks, while Math and Code benefit from moderate buffers (5-10%), with quantized models requiring less replay than FP16 to achieve comparable retention. Code is available at https://github.com/Festyve/LessIsMore.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Raina Panda, Daniel Fein, Arpita Singhal, Mark Fiore, Maneesh Agrawala et al.
RoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional adapters, or prompt engineering, all of which can be computationally expensive and may still entangle style with subject matter. In this paper, we introduce a training- and inference-light, interpretable method for representing and transferring artistic style. Our approach leverages an art-specific Sparse Autoencoder (SAE) on top of latent embeddings of generative image models. Trained on artistic data, our SAE learns an emergent, largely disentangled set of stylistic and compositional concepts, corresponding to style-related elements pertaining brushwork, texture, and color palette, as well as semantic and structural concepts. We call it LouvreSAE and use it to construct style profiles: compact, decomposable steering vectors that enable style transfer without any model updates or optimization. Unlike prior concept-based style transfer methods, our method requires no fine-tuning, no LoRA training, and no additional inference passes, enabling direct steering of artistic styles from only a few reference images. We validate our method on ArtBench10, achieving or surpassing existing methods on style evaluations (VGG Style Loss and CLIP Score Style) while being 1.7-20x faster and, critically, interpretable.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu
ๆฅ็ๆ่ฆ
Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pablo Ruiz-Ponce, Sergio Escalera, Josรฉ Garcรญa-Rodrรญguez, Jiankang Deng, Rolandos Alexandros Potamias
ๆฅ็ๆ่ฆ
Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.
๐
2025-12-22
๐ ArXiv Preprint (cs.SD)
๐ฅ Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang et al.
ๆฅ็ๆ่ฆ
Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dixuan Lin, Tianyou Wang, Zhuoyang Pan, Yufu Wang, Lingjie Liu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions. Existing methods operate in hand centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand-object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$ฯ$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$ฯ$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$ฯ$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$ฯ$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, Xinchao Wang
Machine Learning
ๆฅ็ๆ่ฆ
Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \href{https://hyokong.github.io/worldwarp-page/}{https://hyokong.github.io/worldwarp-page/}.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mojtaba Safari, Shansong Wang, Vanessa L Wildman, Mingzhe Hu, Zach Eidex et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan, yet existing deep learning methods face fidelity-efficiency trade-offs. Purpose: To develop a computationally efficient and accurate deep learning framework for MRI SR that preserves anatomical detail for clinical integration. Materials and Methods: We propose a novel SR framework combining multi-head selective state-space models (MHSSM) with a lightweight channel MLP. The model uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing. Evaluation used 7T brain T1 MP2RAGE maps (n=142) and 1.5T prostate T2w MRI (n=334). Comparisons included Bicubic interpolation, GANs (CycleGAN, Pix2pix, SPSR), transformers (SwinIR), Mamba (MambaIR), and diffusion models (I2SB, Res-SRDiff). Results: Our model achieved superior performance with exceptional efficiency. For 7T brain data: SSIM=0.951+-0.021, PSNR=26.90+-1.41 dB, LPIPS=0.076+-0.022, GMSD=0.083+-0.017, significantly outperforming all baselines (p<0.001). For prostate data: SSIM=0.770+-0.049, PSNR=27.15+-2.19 dB, LPIPS=0.190+-0.095, GMSD=0.087+-0.013. The framework used only 0.9M parameters and 57 GFLOPs, reducing parameters by 99.8% and computation by 97.5% versus Res-SRDiff, while outperforming SwinIR and MambaIR in accuracy and efficiency. Conclusion: The proposed framework provides an efficient, accurate MRI SR solution, delivering enhanced anatomical detail across datasets. Its low computational demand and state-of-the-art performance show strong potential for clinical translation.
๐
2025-12-22
๐ ArXiv Preprint (econ.GN)
๐ฅ Niclas Griesshaber, Jochen Streb
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be easily adapted to other image corpora using LLM-assisted coding tools, lowering the barriers for less technical researchers. Finally, we explain the economics of deploying LLMs for historical dataset construction and conclude by speculating on the potential implications for the field of economic history.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Luchao Qi, Jiaye Wu, Jun Myeong Choi, Cary Phillips, Roni Sengupta et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arnold Caleb Asiimwe, Carl Vondrick
Robotics
ๆฅ็ๆ่ฆ
We reinterpret 4D Gaussian Splatting as a continuous-time dynamical system, where scene motion arises from integrating a learned neural dynamical field rather than applying per-frame deformations. This formulation, which we call EvoGS, treats the Gaussian representation as an evolving physical system whose state evolves continuously under a learned motion law. This unlocks capabilities absent in deformation-based approaches:(1) sample-efficient learning from sparse temporal supervision by modeling the underlying motion law; (2) temporal extrapolation enabling forward and backward prediction beyond observed time ranges; and (3) compositional dynamics that allow localized dynamics injection for controllable scene synthesis. Experiments on dynamic scene benchmarks show that EvoGS achieves better motion coherence and temporal consistency compared to deformation-field baselines while maintaining real-time rendering
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Da Tan, Michael Beck, Christopher P. Bidinosti, Robert H. Gulden, Christopher J. Henry
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
The success of agricultural artificial intelligence depends heavily on large, diverse, and high-quality plant image datasets, yet collecting such data in real field conditions is costly, labor intensive, and seasonally constrained. This paper investigates diffusion-based generative modeling to address these challenges through plant image synthesis, indoor-to-outdoor translation, and expert preference aligned fine tuning. First, a Stable Diffusion model is fine tuned on captioned indoor and outdoor plant imagery to generate realistic, text conditioned images of canola and soybean. Evaluation using Inception Score, Frechet Inception Distance, and downstream phenotype classification shows that synthetic images effectively augment training data and improve accuracy. Second, we bridge the gap between high resolution indoor datasets and limited outdoor imagery using DreamBooth-based text inversion and image guided diffusion, generating translated images that enhance weed detection and classification with YOLOv8. Finally, a preference guided fine tuning framework trains a reward model on expert scores and applies reward weighted updates to produce more stable and expert aligned outputs. Together, these components demonstrate a practical pathway toward data efficient generative pipelines for agricultural AI.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Jiaqi Peng, Wenzhe Cai, Yuqiang Yang, Tai Wang, Yuan Shen et al.
Robotics
ๆฅ็ๆ่ฆ
Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error propagation.We evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3\% improvement over oracle-localization baselines and strong generalization across embodiments and environments. The code and models have been made publicly available on the \href{https://steinate.github.io/logoplanner.github.io/}{project page}.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Artemis Panagopoulou, Aveek Purohit, Achin Kulshrestha, Soroosh Yazdani, Mohit Goyal
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Eric Zimmermann, Harley Wiltzer, Justin Szeto, David Alvarez-Melis, Lester Mackey
ๆฅ็ๆ่ฆ
Recent breakthroughs in self-supervised Joint-Embedding Predictive Architectures (JEPAs) have established that regularizing Euclidean representations toward isotropic Gaussian priors yields provable gains in training stability and downstream generalization. We introduce a new, flexible family of KerJEPAs, self-supervised learning algorithms with kernel-based regularizers. One instance of this family corresponds to the recently-introduced LeJEPA Epps-Pulley regularizer which approximates a sliced maximum mean discrepancy (MMD) with a Gaussian prior and Gaussian kernel. By expanding the class of viable kernels and priors and computing the closed-form high-dimensional limit of sliced MMDs, we develop alternative KerJEPAs with a number of favorable properties including improved training stability and design flexibility.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marta Hasny, Laura Daza, Keno Bressem, Maxime Di Folco, Julia Schnabel
Multimodal
ๆฅ็ๆ่ฆ
Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as demographics or clinical measurements. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that can leverage all the tabular data during training while remaining robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning using a gated cross-attention module for multimodal fusion. During fine-tuning, we employ a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with disentangled gradient learning, this enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The code is available at https://github.com/marteczkah/RoVTL.
๐
2025-12-22
๐ ArXiv Preprint (eess.IV)
๐ฅ Ziqian Huang, Boxiao Yu, Siqi Li, Savas Ozdemir, Sangjin Bae et al.
ๆฅ็ๆ่ฆ
Dynamic PET enables the quantitative estimation of physiology-related parameters and is widely utilized in research and increasingly adopted in clinical settings. Parametric imaging in dynamic PET requires kinetic modeling to estimate voxel-wise physiological parameters based on specific kinetic models. However, parametric images estimated through kinetic model fitting often suffer from low image quality due to the inherently ill-posed nature of the fitting process and the limited counts resulting from non-continuous data acquisition across multiple bed positions in whole-body PET. In this work, we proposed a diffusion model-based kinetic modeling framework for parametric image estimation, using the Patlak model as an example. The score function of the diffusion model was pre-trained on static total-body PET images and served as a prior for both Patlak slope and intercept images by leveraging their patch-wise similarity. During inference, the kinetic model was incorporated as a data-consistency constraint to guide the parametric image estimation. The proposed framework was evaluated on total-body dynamic PET datasets with different dose levels, demonstrating the feasibility and promising performance of the proposed framework in improving parametric image quality.
๐
2025-12-22
๐ ArXiv Preprint (astro-ph.CO)
๐ฅ Eric Guzman, Joel Meyers
Machine Learning
ๆฅ็ๆ่ฆ
The search for primordial gravitational waves is a central goal of cosmic microwave background (CMB) surveys. Isolating the characteristic $B$-mode polarization signal sourced by primordial gravitational waves is challenging for several reasons: the amplitude of the signal is inherently small; astrophysical foregrounds produce $B$-mode polarization contaminating the signal; and secondary $B$-mode polarization fluctuations are produced via the conversion of $E$ modes. Current and future low-noise, multi-frequency observations enable sufficient precision to address the first two of these challenges such that secondary $B$ modes will become the bottleneck for improved constraints on the amplitude of primordial gravitational waves. The dominant source of secondary $B$-mode polarization is gravitational lensing by large scale structure. Various strategies have been developed to estimate the lensing deflection and to reverse its effects the CMB, thus reducing confusion from lensing $B$ modes in the search for primordial gravitational waves. However, a few complications remain. First, there may be additional sources of secondary $B$-mode polarization, for example from patchy reionization or from cosmic polarization rotation. Second, the statistics of delensed CMB maps can become complicated and non-Gaussian, especially when advanced lensing reconstruction techniques are applied. We previously demonstrated how a deep learning network, ResUNet-CMB, can provide nearly optimal simultaneous estimates of multiple sources of secondary $B$-mode polarization. In this paper, we show how deep learning can be applied to estimate and remove multiple sources of secondary $B$-mode polarization, and we further show how this technique can be used in a likelihood analysis to produce nearly optimal, unbiased estimates of the amplitude of primordial gravitational waves.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Antonia Alomar, Mireia Masias, Marius George Linguraru, Federico M. Sukno, Gemma Piella
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Early detection of developmental disorders can be aided by analyzing infant craniofacial morphology, but modeling infant faces is challenging due to limited data and frequent spontaneous expressions. We introduce BabyFlow, a generative AI model that disentangles facial identity and expression, enabling independent control over both. Using normalizing flows, BabyFlow learns flexible, probabilistic representations that capture the complex, non-linear variability of expressive infant faces without restrictive linear assumptions. To address scarce and uncontrolled expressive data, we perform cross-age expression transfer, adapting expressions from adult 3D scans to enrich infant datasets with realistic and systematic expressive variants. As a result, BabyFlow improves 3D reconstruction accuracy, particularly in highly expressive regions such as the mouth, eyes, and nose, and supports synthesis and modification of infant expressions while preserving identity. Additionally, by integrating with diffusion models, BabyFlow generates high-fidelity 2D infant images with consistent 3D geometry, providing powerful tools for data augmentation and early facial analysis.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziqiao Peng, Yi Chen, Yifeng Ma, Guozhen Zhang, Zhiyao Sun et al.
Robotics
ๆฅ็ๆ่ฆ
Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi et al.
ๆฅ็ๆ่ฆ
Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Moritz Bรถhle, Amรฉlie Royer, Juliette Marrie, Edouard Grave, Patrick Pรฉrez
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .
๐
2025-12-22
๐ 12 pages, 8 figures. Submitted to Journal of Oral
๐ฅ Chi Zhang, Braedon Gunn, Andrew M. Read-Fuller
ๆฅ็ๆ่ฆ
Poor adaptation of orbital implants remains a major contributor to postoperative complications and revision surgery. Although preformed orbital plates are widely used to reduce cost and operative time compared with customized implants, surgeons currently lack publicly available tools and standardized metrics to quantitatively compare plate fit across vendors, sizes, and patient anatomy. We developed SlicerOrbitSurgerySim, an open-source extension for the 3D Slicer platform that enables interactive virtual registration, evaluation, and comparison of multiple preformed orbital plates in a patient-specific virtual planning environment. The software generates reproducible quantitative plate-to-orbit distance metrics and visualization tools that support both patient-specific planning and population-level statistical analysis of plate adaptability. By facilitating objective comparison of implant designs and placement strategies, this tool aims to improve preoperative decision-making, reduce intraoperative plate modification, and promote collaborative research and surgical education. Pilot studies, sample datasets, and detailed tutorials are provided to support testing, transparency, and reproducibility.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marc Peral, Guillem Capellera, Luis Ferraz, Antonio Rubio, Antonio Agudo
Natural Language Processing
ๆฅ็ๆ่ฆ
In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, ball state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the ball state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pre-training strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pre-training. We show the effectiveness of our approach on a large-scale dataset providing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhuo He, Yingdong Ru, Qianying Liu, Paul Henderson, Nicolas Pugeault
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in neural rendering have achieved impressive results on photorealistic shading and relighting, by using a multilayer perceptron (MLP) as a regression model to learn the rendering equation from a real-world dataset. Such methods show promise for photorealistically relighting real-world objects, which is difficult to classical rendering, as there is no easy-obtained material ground truth. However, significant challenges still remain the dense connections in MLPs result in a large number of parameters, which requires high computation resources, complicating the training, and reducing performance during rendering. Data driven approaches require large amounts of training data for generalization; unbalanced data might bias the model to ignore the unusual illumination conditions, e.g. dark scenes. This paper introduces pbnds+: a novel physics-based neural deferred shading pipeline utilizing convolution neural networks to decrease the parameters and improve the performance in shading and relighting tasks; Energy regularization is also proposed to restrict the model reflection during dark illumination. Extensive experiments demonstrate that our approach outperforms classical baselines, a state-of-the-art neural shading model, and a diffusion-based method.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyang Song, Zelin Zang, Zuyao Chen, Xusheng Liang, Dong Yi et al.
RoboticsMachine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images. Anatomy understanding tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional Supervised Fine-Tuning (SFT) strategies. While recent work has demonstrated that Group Relative Policy Optimization (GRPO) can enhance reasoning in MLLMs without relying on large amounts of data, we find two weaknesses that hinder GRPO's reasoning performance in anatomy recognition: 1) knowledge cannot be effectively shared between different anatomical structures, resulting in uneven information gain and preventing the model from converging, and 2) the model quickly converges to a single reasoning path, suppressing the exploration of diverse strategies. To overcome these challenges, we propose two novel methods. First, we implement a progressive learning strategy called Anatomical Similarity Curriculum Learning by controlling question difficulty via the similarity of answer choices, enabling the model to master complex problems incrementally. Second, we utilize question augmentation referred to as Group Diversity Question Augmentation to expand the model's search space for difficult queries, mitigating the tendency to produce uniform responses. Comprehensive experiments on the SGG-VQA and OmniMedVQA benchmarks show our method achieves a significant improvement across the two benchmarks, demonstrating its effectiveness in enhancing the medical reasoning capabilities of MLLMs. The code can be found in https://github.com/tomato996/Anatomy-R1
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Georgios Voulgaris
Machine Learning
ๆฅ็ๆ่ฆ
Modern deep learning models operating on multi-modal visual signals often rely on inductive biases that are poorly aligned with the physical processes governing signal formation, leading to brittle performance under cross-spectral and real-world conditions. In particular, approaches that prioritise direct thermal cues struggle to capture indirect yet persistent environmental alterations induced by sustained heat emissions.
This work introduces a physics-aware representation learning framework that leverages multi-spectral information to model stable signatures of long-term physical processes. Specifically, a geological Short Wave Infrared (SWIR) ratio sensitive to soil property changes is integrated with Thermal Infrared (TIR) data through an intermediate fusion architecture, instantiated as FusionNet. The proposed backbone embeds trainable differential signal-processing priors within convolutional layers, combines mixed pooling strategies, and employs wider receptive fields to enhance robustness across spectral modalities.
Systematic ablations show that each architectural component contributes to performance gains, with DGCNN achieving 88.7% accuracy on the SWIR ratio and FusionNet reaching 90.6%, outperforming state-of-the-art baselines across five spectral configurations. Transfer learning experiments further show that ImageNet pretraining degrades TIR performance, highlighting the importance of modality-aware training for cross-spectral learning.
Evaluated on real-world data, the results demonstrate that combining physics-aware feature selection with principled deep learning architectures yields robust and generalisable representations, illustrating how first-principles signal modelling can improve multi-spectral learning under challenging conditions.
๐
2025-12-22
๐ The paper was accepted by SIAM Journal on Imaging
๐ฅ Meng Ding, Xiao Fu
ๆฅ็ๆ่ฆ
This work revisits the hyperspectral super-resolution (HSR) problem, i.e., fusing a pair of spatially co-registered hyperspectral (HSI) and multispectral (MSI) images to recover a super-resolution image (SRI) that enhances the spatial resolution of the HSI. Coupled tensor decomposition (CTD)-based methods have gained traction in this domain, offering recoverability guarantees under various assumptions. Existing models such as canonical polyadic decomposition (CPD) and Tucker decomposition provide strong expressive power but lack physical interpretability. The block-term decomposition model with rank-$(L_r, L_r, 1)$ terms (the LL1 model) yields interpretable factors under the linear mixture model (LMM) of spectral images, but LMM assumptions are often violated in practice -- primarily due to nonlinear effects such as endmember variability (EV). To address this, we propose modeling spectral images using a more flexible block-term tensor decomposition with rank-$(L_r, M_r, N_r)$ terms (the LMN model). This modeling choice retains interpretability, subsumes CPD, Tucker, and LL1 as special cases, and robustly accounts for non-ideal effects such as EV, offering a balanced tradeoff between expressiveness and interpretability for HSR. Importantly, under the LMN model for HSI and MSI, recoverability of the SRI can still be established under proper conditions -- providing strong theoretical support. Extensive experiments on synthetic and real datasets further validate the effectiveness and robustness of the proposed method compared with existing CTD-based approaches.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shaochen Bi, Yuting He, Weiming Wang, Hao Chen
ๆฅ็ๆ่ฆ
Combinatorial explosion problem caused by dual inputs presents a critical challenge in Deformable Medical Image Registration (DMIR). Since DMIR processes two images simultaneously as input, the combination relationships between features has grown exponentially, ultimately the model considers more interfering features during the feature modeling process. Introducing dynamics in the receptive fields and weights of the network enable the model to eliminate the interfering features combination and model the potential feature combination relationships. In this paper, we propose the Dynamic Stream Network (DySNet), which enables the receptive fields and weights to be dynamically adjusted. This ultimately enables the model to ignore interfering feature combinations and model the potential feature relationships. With two key innovations: 1) Adaptive Stream Basin (AdSB) module dynamically adjusts the shape of the receptive field, thereby enabling the model to focus on the feature relationships with greater correlation. 2) Dynamic Stream Attention (DySA) mechanism generates dynamic weights to search for more valuable feature relationships. Extensive experiments have shown that DySNet consistently outperforms the most advanced DMIR methods, highlighting its outstanding generalization ability. Our code will be released on the website: https://github.com/ShaochenBi/DySNet.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guoli Jia, Junyao Hu, Xinwei Long, Kai Tian, Kaiyan Zhang et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Image generation based on diffusion models has demonstrated impressive capability, motivating exploration into diverse and specialized applications. Owing to the importance of emotion in advertising, emotion-oriented image generation has attracted increasing attention. However, current emotion-oriented methods suffer from an affective shortcut, where emotions are approximated to semantics. As evidenced by two decades of research, emotion is not equivalent to semantics. To this end, we propose Emotion-Director, a cross-modal collaboration framework consisting of two modules. First, we propose a cross-Modal Collaborative diffusion model, abbreviated as MC-Diffusion. MC-Diffusion integrates visual prompts with textual prompts for guidance, enabling the generation of emotion-oriented images beyond semantics. Further, we improve the DPO optimization by a negative visual prompt, enhancing the model's sensitivity to different emotions under the same semantics. Second, we propose MC-Agent, a cross-Modal Collaborative Agent system that rewrites textual prompts to express the intended emotions. To avoid template-like rewrites, MC-Agent employs multi-agents to simulate human subjectivity toward emotions, and adopts a chain-of-concept workflow that improves the visual expressiveness of the rewritten prompts. Extensive qualitative and quantitative experiments demonstrate the superiority of Emotion-Director in emotion-oriented image generation.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nitin Kumar Singh, Arie Rachmad Syulistyo, Yuichiro Tanaka, Hakaru Tamukoh
Machine Learning
ๆฅ็ๆ่ฆ
Sign language recognition (SLR) facilitates communication between deaf and hearing communities. Deep learning based SLR models are commonly used but require extensive computational resources, making them unsuitable for deployment on edge devices. To address these limitations, we propose a lightweight SLR system that combines parallel bidirectional reservoir computing (PBRC) with MediaPipe. MediaPipe enables real-time hand tracking and precise extraction of hand joint coordinates, which serve as input features for the PBRC architecture. The proposed PBRC architecture consists of two echo state network (ESN) based bidirectional reservoir computing (BRC) modules arranged in parallel to capture temporal dependencies, thereby creating a rich feature representation for classification. We trained our PBRC-based SLR system on the Word-Level American Sign Language (WLASL) video dataset, achieving top-1, top-5, and top-10 accuracies of 60.85%, 85.86%, and 91.74%, respectively. Training time was significantly reduced to 18.67 seconds due to the intrinsic properties of reservoir computing, compared to over 55 minutes for deep learning based methods such as Bi-GRU. This approach offers a lightweight, cost-effective solution for real-time SLR on edge devices.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Evelyn Zhang, Fufu Yu, Aoqi Wu, Zichen Wen, Ke Yan et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D2Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2\% while retaining 99.2\% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7\% performance at a 90\% token reduction rate, marking a significant advancement with up to 63. 53\% improvement over existing methods.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fei Ge, Ying Huang, Jie Liu, Guixuan Zhang, Zhi Zeng et al.
Machine Learning
ๆฅ็ๆ่ฆ
Existing deep image watermarking methods follow a fixed embedding-distortion-extraction pipeline, where the embedder and extractor are weakly coupled through a final loss and optimized in isolation. This design lacks explicit collaboration, leaving no structured mechanism for the embedder to incorporate decoding-aware cues or for the extractor to guide embedding during training. To address this architectural limitation, we rethink deep image watermarking by reformulating embedding and extraction as explicitly collaborative components. To realize this reformulation, we introduce a Collaborative Interaction Mechanism (CIM) that establishes direct, bidirectional communication between the embedder and extractor, enabling a mutual-teacher training paradigm and coordinated optimization. Built upon this explicitly collaborative architecture, we further propose an Adaptive Feature Modulation Module (AFMM) to support effective interaction. AFMM enables content-aware feature regulation by decoupling modulation structure and strength, guiding watermark embedding toward stable image features while suppressing host interference during extraction. Under CIM, the AFMMs on both sides form a closed-loop collaboration that aligns embedding behavior with extraction objectives. This architecture-level redesign changes how robustness is learned in watermarking systems. Rather than relying on exhaustive distortion simulation, robustness emerges from coordinated representation learning between embedding and extraction. Experiments on real-world and AI-generated datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches in watermark extraction accuracy while maintaining high perceptual quality, showing strong robustness and generalization.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi Xin, Siqi Luo, Qi Qin, Haoxing Chen, Kaiwen Zhu et al.
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: https://github.com/Alpha-VLLM/Lumina-DiMOO.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoming Zhang, Chunli Li, Jiacheng Hao, Yuan Gao, Danyang Tu et al.
Multimodal
ๆฅ็ๆ่ฆ
Esophageal varices (EV) represent a critical complication of portal hypertension, affecting approximately 60% of cirrhosis patients with a significant bleeding risk of ~30%. While traditionally diagnosed through invasive endoscopy, non-contrast computed tomography (NCCT) presents a potential non-invasive alternative that has yet to be fully utilized in clinical practice. We present Multi-Organ-COhesion Network++ (MOON++), a novel multimodal framework that enhances EV assessment through comprehensive analysis of NCCT scans. Inspired by clinical evidence correlating organ volumetric relationships with liver disease severity, MOON++ synthesizes imaging characteristics of the esophagus, liver, and spleen through multimodal learning. We evaluated our approach using 1,631 patients, those with endoscopically confirmed EV were classified into four severity grades. Validation in 239 patient cases and independent testing in 289 cases demonstrate superior performance compared to conventional single organ methods, achieving an AUC of 0.894 versus 0.803 for the severe grade EV classification (G3 versus =G2 versus
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Yujie Zhao, Hongwei Fan, Di Chen, Shengcong Chen, Liliang Chen et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework's flexibility and extensibility, indicating its potential to serve as a unified data generation framework.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Hongwei Fan, Hang Dai, Jiyao Zhang, Jinzhou Li, Qiyang Yan et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
The robotics field is evolving towards data-driven, end-to-end learning, inspired by multimodal large models. However, reliance on expensive real-world data limits progress. Simulators offer cost-effective alternatives, but the gap between simulation and reality challenges effective policy transfer. This paper introduces TwinAligner, a novel Real2Sim2Real system that addresses both visual and dynamic gaps. The visual alignment module achieves pixel-level alignment through SDF reconstruction and editable 3DGS rendering, while the dynamic alignment module ensures dynamic consistency by identifying rigid physics from robot-object interaction. TwinAligner improves robot learning by providing scalable data collection and establishing a trustworthy iterative cycle, accelerating algorithm development. Quantitative evaluations highlight TwinAligner's strong capabilities in visual and dynamic real-to-sim alignment. This system enables policies trained in simulation to achieve strong zero-shot generalization to the real world. The high consistency between real-world and simulated policy performance underscores TwinAligner's potential to advance scalable robot learning. Code and data will be released on https://twin-aligner.github.io
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yueyao Chen, Kai-Ni Wang, Dario Tayupo, Arnaud Huaulm'e, Krystel Nyangoh Timoh et al.
ๆฅ็ๆ่ฆ
Purpose: Surgical workflow recognition enables context-aware assistance and skill assessment in computer-assisted interventions. Despite recent advances, current methods suffer from two critical challenges: prediction jitter across consecutive frames and poor discrimination of ambiguous phases. This paper aims to develop a stable framework by selectively propagating reliable historical information and explicitly modeling uncertainty for hard sample enhancement.
Methods: We propose a dual-pathway framework DSTED with Reliable Memory Propagation (RMP) and Uncertainty-Aware Prototype Retrieval (UPR). RMP maintains temporal coherence by filtering and fusing high-confidence historical features through multi-criteria reliability assessment. UPR constructs learnable class-specific prototypes from high-uncertainty samples and performs adaptive prototype matching to refine ambiguous frame representations. Finally, a confidence-driven gate dynamically balances both pathways based on prediction certainty.
Results: Our method achieves state-of-the-art performance on AutoLaparo-hysterectomy with 84.36% accuracy and 65.51% F1-score, surpassing the second-best method by 3.51% and 4.88% respectively. Ablations reveal complementary gains from RMP (2.19%) and UPR (1.93%), with synergistic effects when combined. Extensive analysis confirms substantial reduction in temporal jitter and marked improvement on challenging phase transitions.
Conclusion: Our dual-pathway design introduces a novel paradigm for stable workflow recognition, demonstrating that decoupling the modeling of temporal consistency and phase ambiguity yields superior performance and clinical applicability.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhongwei Chen, Hai-Jun Rong, Zhao-Xu Yang, Guoqi Li
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Traditional drone-view geo-localization (DVGL) methods based on artificial neural networks (ANNs) have achieved remarkable performance. However, ANNs rely on dense computation, which results in high power consumption. In contrast, spiking neural networks (SNNs), which benefit from spike-driven computation, inherently provide low power consumption. Regrettably, the potential of SNNs for DVGL has yet to be thoroughly investigated. Meanwhile, the inherent sparsity of spike-driven computation for representation learning scenarios also results in loss of critical information and difficulties in learning long-range dependencies when aligning heterogeneous visual data sources. To address these, we propose SpikeViMFormer, the first SNN framework designed for DVGL. In this framework, a lightweight spike-driven transformer backbone is adopted to extract coarse-grained features. To mitigate the loss of critical information, the spike-driven selective attention (SSA) block is designed, which uses a spike-driven gating mechanism to achieve selective feature enhancement and highlight discriminative regions. Furthermore, a spike-driven hybrid state space (SHS) block is introduced to learn long-range dependencies using a hybrid state space. Moreover, only the backbone is utilized during the inference stage to reduce computational cost. To ensure backbone effectiveness, a novel hierarchical re-ranking alignment learning (HRAL) strategy is proposed. It refines features via neighborhood re-ranking and maintains cross-batch consistency to directly optimize the backbone. Experimental results demonstrate that SpikeViMFormer outperforms state-of-the-art SNNs. Compared with advanced ANNs, it also achieves competitive performance.Our code is available at https://github.com/ISChenawei/SpikeViMFormer
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenyang Huang, Xiao Yu, Yi Zhang, Decheng Wang, Hang Ruan
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Remote sensing image change detection is one of the fundamental tasks in remote sensing intelligent interpretation. Its core objective is to identify changes within change regions of interest (CRoI). Current multimodal large models encode rich human semantic knowledge, which is utilized for guidance in tasks such as remote sensing change detection. However, existing methods that use semantic guidance for detecting users' CRoI overly rely on explicit textual descriptions of CRoI, leading to the problem of near-complete performance failure when presented with implicit CRoI textual descriptions. This paper proposes a multimodal reasoning change detection model named ReasonCD, capable of mining users' implicit task intent. The model leverages the powerful reasoning capabilities of pre-trained large language models to mine users' implicit task intents and subsequently obtains different change detection results based on these intents. Experiments on public datasets demonstrate that the model achieves excellent change detection performance, with an F1 score of 92.1\% on the BCDD dataset. Furthermore, to validate its superior reasoning functionality, this paper annotates a subset of reasoning data based on the SECOND dataset. Experimental results show that the model not only excels at basic reasoning-based change detection tasks but can also explain the reasoning process to aid human decision-making.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Siyuan Mei, Yan Xia, Fuxin Fan
ๆฅ็ๆ่ฆ
The synthesis of computed tomography (CT) from magnetic resonance imaging (MRI) and cone-beam CT (CBCT) plays a critical role in clinical treatment planning by enabling accurate anatomical representation in adaptive radiotherapy. In this work, we propose GANeXt, a 3D patch-based, fully ConvNeXt-powered generative adversarial network for unified CT synthesis across different modalities and anatomical regions. Specifically, GANeXt employs an efficient U-shaped generator constructed from stacked 3D ConvNeXt blocks with compact convolution kernels, while the discriminator adopts a conditional PatchGAN. To improve synthesis quality, we incorporate a combination of loss functions, including mean absolute error (MAE), perceptual loss, segmentation-based masked MAE, and adversarial loss and a combination of Dice loss and cross-entropy for multi-head segmentation discriminator. For both tasks, training is performed with a batch size of 8 using two separate AdamW optimizers for the generator and discriminator, each equipped with a warmup and cosine decay scheduler, with learning rates of $5\times10^{-4}$ and $1\times10^{-3}$, respectively. Data preprocessing includes deformable registration, foreground cropping, percentile normalization for the input modality, and linear normalization of the CT to the range $[-1024, 1000]$. Data augmentation involves random zooming within $(0.8, 1.3)$ (for MRI-to-CT only), fixed-size cropping to $32\times160\times192$ for MRI-to-CT and $32\times128\times128$ for CBCT-to-CT, and random flipping. During inference, we apply a sliding-window approach with $0.8$ overlap and average folding to reconstruct the full-size sCT, followed by inversion of the CT normalization. After joint training on all regions without any fine-tuning, the final models are selected at the end of 3000 epochs for MRI-to-CT and 1000 epochs for CBCT-to-CT using the full training dataset.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yueting Zhu, Yuehao Song, Shuai Zhang, Wenyu Liu, Xinggang Wang
ๆฅ็ๆ่ฆ
Whole Slide Images (WSIs) are typically analyzed using multiple instance learning (MIL) methods. However, the scale and heterogeneity of WSIs generate highly redundant and dispersed information, making it difficult to identify and integrate discriminative signals. Existing MIL methods either fail to discard uninformative cues effectively or have limited ability to consolidate relevant features from multiple patches, which restricts their performance on large and heterogeneous WSIs. To address this issue, we propose DeltaMIL, a novel MIL framework that explicitly selects semantically relevant regions and integrates the discriminative information from WSIs. Our method leverages the gated delta rule to efficiently filter and integrate information through a block combining forgetting and memory mechanisms. The delta mechanism dynamically updates the memory by removing old values and inserting new ones according to their correlation with the current patch. The gating mechanism further enables rapid forgetting of irrelevant signals. Additionally, DeltaMIL integrates a complementary local pattern mixing mechanism to retain fine-grained pathological locality. Our design enhances the extraction of meaningful cues and suppresses redundant or noisy information, which improves the model's robustness and discriminative power. Experiments demonstrate that DeltaMIL achieves state-of-the-art performance. Specifically, for survival prediction, DeltaMIL improves performance by 3.69\% using ResNet-50 features and 2.36\% using UNI features. For slide-level classification, it increases accuracy by 3.09\% with ResNet-50 features and 3.75\% with UNI features. These results demonstrate the strong and consistent performance of DeltaMIL across diverse WSI tasks.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Moamal Fadhil Abdul, Jonas Bruun Hubrechts, Thomas Martini Jรธrgensen, Emil Hovad
ๆฅ็ๆ่ฆ
Automatically detecting and classifying strokes in table tennis video can streamline training workflows, enrich broadcast overlays, and enable fine-grained performance analytics. For this to be possible, annotated video data of table tennis is needed. We extend the public OpenTTGames dataset with highly detailed, frame-accurate shot type annotations (forehand, backhand with subtypes), player posture labels (body lean and leg stance), and rally outcome tags at point end. OpenTTGames is a set of recordings from the side of the table with official labels for bounces, when the ball is above the net, or hitting the net. The dataset already contains ball coordinates near events, which are either "bounce", "net", or "empty_event" in the original OpenTTGames dataset, and semantic masks (humans, table, scoreboard). Our extension adds the types of stroke to the events and a per-player taxonomy so models can move beyond event spotting toward tactical understanding (e.g., whether a stroke is likely to win the point or set up an advantage). We provide a compact coding scheme and code-assisted labeling procedure to support reproducible annotations and baselines for fine-grained stroke understanding in racket sports. This fills a practical gap in the community, where many prior video resources are either not publicly released or carry restrictive/unclear licenses that hinder reuse and benchmarking. Our annotations are released under the same CC BY-NC-SA 4.0 license as OpenTTGames, allowing free non-commercial use, modification, and redistribution, with appropriate attribution.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yayuan Li, Jian Zhang, Jintao Guo, Zihan Cheng, Lei Qi et al.
Computer Vision
ๆฅ็ๆ่ฆ
The proliferation of pre-trained models has given rise to a wide array of specialised, fine-tuned models. Model merging aims to merge the distinct capabilities of these specialised models into a unified model, requiring minimal or even no additional training. A core objective of model merging is to ensure the merged model retains the behavioural characteristics of the specialised models, typically achieved through feature alignment. We identify that features consist of two critical components: direction and magnitude. Prior research has predominantly focused on directional alignment, while the influence of magnitude remains largely neglected, despite its pronounced vulnerability to perturbations introduced by common merging operations (e.g., parameter fusion and sparsification). Such perturbations to magnitude inevitably lead to feature deviations in the merged model from the specialised models, resulting in subsequent performance degradation. To address this, we propose MAGnItude Calibration (MAGIC), a plug-and-play framework that rectifies layer-wise magnitudes in feature and weight spaces, with three variants. Specifically, our Feature Space Calibration (FSC) realigns the merged model's features using a small set of unlabelled data, while Weight Space Calibration (WSC) extends this calibration to the weight space without requiring additional data. Combining these yields Dual Space Calibration (DSC). Comprehensive experiments demonstrate that MAGIC consistently boosts performance across diverse Computer Vision tasks (+4.3% on eight datasets) and NLP tasks (+8.0% on Llama) without additional training. Our code is available at: https://github.com/lyymuwu/MAGIC
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marica Muffoletto, Uxio Hermida, Charlรจne Mauger, Avan Suinesiaputra, Yiyang Xu et al.
ๆฅ็ๆ่ฆ
Accurate reconstruction of cardiac anatomy from sparse clinical images remains a major challenge in patient-specific modeling. While neural implicit functions have previously been applied to this task, their application to mapping anatomical consistency across subjects has been limited. In this work, we introduce Neural Implicit Heart Coordinates (NIHCs), a standardized implicit coordinate system, based on universal ventricular coordinates, that provides a common anatomical reference frame for the human heart. Our method predicts NIHCs directly from a limited number of 2D segmentations (sparse acquisition) and subsequently decodes them into dense 3D segmentations and high-resolution meshes at arbitrary output resolution. Trained on a large dataset of 5,000 cardiac meshes, the model achieves high reconstruction accuracy on clinical contours, with mean Euclidean surface errors of 2.51$\pm$0.33 mm in a diseased cohort (n=4549) and 2.3$\pm$0.36 mm in a healthy cohort (n=5576). The NIHC representation enables anatomically coherent reconstruction even under severe slice sparsity and segmentation noise, faithfully recovering complex structures such as the valve planes. Compared with traditional pipelines, inference time is reduced from over 60 s to 5-15 s. These results demonstrate that NIHCs constitute a robust and efficient anatomical representation for patient-specific 3D cardiac reconstruction from minimal input data.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hui Li, Jiayue Lyu, Fu-Yun Wang, Kaihui Cheng, Siyu Zhu et al.
Computer Vision
ๆฅ็ๆ่ฆ
This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang
RoboticsMachine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at https://github.com/Ricardo-XZ/Think2Seg-RS.
๐
2025-12-22
๐ AAAI 2026
๐ฅ Jun Li, Zikun Chen, Haibo Chen, Shuo Chen, Jian Yang
Machine Learning
ๆฅ็ๆ่ฆ
Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs-manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer's superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kyungwon Cho, Hanbyul Joo
ๆฅ็ๆ่ฆ
Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer's full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ivan DeAndres-Tame, Chengwei Ye, Ruben Tolosana, Ruben Vera-Rodriguez, Shiqi Yu
ๆฅ็ๆ่ฆ
Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinyang Song, Libin Wang, Weining Wang, Zhiwei Li, Jianxin Sun et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Carla Crivoi, Radu Tudor Ionescu
Machine Learning
ๆฅ็ๆ่ฆ
We present the first comprehensive empirical study of machine unlearning (MU) in hybrid quantum-classical neural networks. While MU has been extensively explored in classical deep learning, its behavior within variational quantum circuits (VQCs) and quantum-augmented architectures remains largely unexplored. First, we adapt a broad suite of unlearning methods to quantum settings, including gradient-based, distillation-based, regularization-based and certified techniques. Second, we introduce two new unlearning strategies tailored to hybrid models. Experiments across Iris, MNIST, and Fashion-MNIST, under both subset removal and full-class deletion, reveal that quantum models can support effective unlearning, but outcomes depend strongly on circuit depth, entanglement structure, and task complexity. Shallow VQCs display high intrinsic stability with minimal memorization, whereas deeper hybrid models exhibit stronger trade-offs between utility, forgetting strength, and alignment with retrain oracle. We find that certain methods, e.g. EU-k, LCA, and Certified Unlearning, consistently provide the best balance across metrics. These findings establish baseline empirical insights into quantum machine unlearning and highlight the need for quantum-aware algorithms and theoretical guarantees, as quantum machine learning systems continue to expand in scale and capability. We publicly release our code at: https://github.com/CrivoiCarla/HQML.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meng Chu, Senqiao Yang, Haoxuan Che, Suiyun Zhang, Xichen Zhang et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models' performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.
๐
2025-12-22
๐ ArXiv Preprint (eess.IV)
๐ฅ Beyza Zayim, Aissiou Ikram, Boukhiar Naima
Robotics
ๆฅ็ๆ่ฆ
Breast cancer remains the most common cancer among women and is a leading cause of female mortality. Dynamic contrast-enhanced MRI (DCE-MRI) is a powerful imaging tool for evaluating breast tumors, yet the field lacks a standardized benchmark for analyzing treatment responses and guiding personalized care. We participated in the MAMA-MIA Challenge's Primary Tumor Segmentation task and this work presents a proposed selective, phase-aware training framework for the nnU-Net architecture, emphasizing quality-focused data selection to strengthen model robustness and generalization. We employed the No New Net (nnU-Net) framework with a selective training strategy that systematically analyzed the impact of image quality and center-specific variability on segmentation performance. Controlled experiments on the DUKE, NACT, ISPY1, and ISPY2 datasets revealed that including ISPY scans with motion artifacts and reduced contrast impaired segmentation performance, even with advanced preprocessing, such as contrast-limited adaptive histogram equalization (CLAHE). In contrast, training on DUKE and NACT data, which exhibited clearer contrast and fewer motion artifacts despite varying resolutions, with early phase images (0000-0002) provided more stable training conditions. Our results demonstrate the importance of phase-sensitive and quality-aware training strategies in achieving reliable segmentation performance in heterogeneous clinical datasets, highlighting the limitations of the expansion of naive datasets and motivating the need for future automation of quality-based data selection strategies.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunlong Liu, Shuyang Li, Pengyuan Liu, Yu Zhang, Rudi Stouffs
Machine Learning
ๆฅ็ๆ่ฆ
Perception research is increasingly modelled using streetscapes, yet many approaches still rely on pixel features or object co-occurrence statistics, overlooking the explicit relations that shape human perception. This study proposes a three stage pipeline that transforms street view imagery (SVI) into structured representations for predicting six perceptual indicators. In the first stage, each image is parsed using an open-set Panoptic Scene Graph model (OpenPSG) to extract object predicate object triplets. In the second stage, compact scene-level embeddings are learned through a heterogeneous graph autoencoder (GraphMAE). In the third stage, a neural network predicts perception scores from these embeddings. We evaluate the proposed approach against image-only baselines in terms of accuracy, precision, and cross-city generalization. Results indicate that (i) our approach improves perception prediction accuracy by an average of 26% over baseline models, and (ii) maintains strong generalization performance in cross-city prediction tasks. Additionally, the structured representation clarifies which relational patterns contribute to lower perception scores in urban scenes, such as graffiti on wall and car parked on sidewalk. Overall, this study demonstrates that graph-based structure provides expressive, generalizable, and interpretable signals for modelling urban perception, advancing human-centric and context-aware urban analytics.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Na Gao, Chenfei Ye, Yanwu Yang, Anqi Li, Zhengbo He et al.
ๆฅ็ๆ่ฆ
Accurate characterization of hippocampal substructure is crucial for detecting subtle structural changes and identifying early neurodegenerative biomarkers. However, high inter-subject variability and complex folding pattern of human hippocampus hinder consistent cross-subject and longitudinal analysis. Most existing approaches rely on subject-specific modelling and lack a stable intrinsic coordinate system to accommodate anatomical variability, which limits their ability to establish reliable inter- and intra-individual correspondence. To address this, we propose HippMetric, a skeletal representation (s-rep)-based framework for hippocampal substructural morphometry and point-wise correspondence across individuals and scans. HippMetric builds on the Axis-Referenced Morphometric Model (ARMM) and employs a deformable skeletal coordinate system aligned with hippocampal anatomy and function, providing a biologically grounded reference for correspondence. Our framework comprises two core modules: a skeletal-based coordinate system that respects the hippocampus' conserved longitudinal lamellar architecture, in which functional units (lamellae) are stacked perpendicular to the long-axis, enabling anatomically consistent localization across subjects and time; and individualized s-reps generated through surface reconstruction, deformation, and geometrically constrained spoke refinement, enforcing boundary adherence, orthogonality and non-intersection to produce mathematically valid skeletal geometry. Extensive experiments on two international cohorts demonstrate that HippMetric achieves higher accuracy, reliability, and correspondence stability compared to existing shape models.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zihao Luo, Shaohao Rui, Zhenyu Tang, Guotai Wang, Xiaosong Wang
Machine Learning
ๆฅ็ๆ่ฆ
Continual self-supervised learning (CSSL) in medical imaging trains a foundation model sequentially, alleviating the need for collecting multi-modal images for joint training and offering promising improvements in downstream performance while preserving data privacy. However, most existing methods still rely on replaying data from previous stages to prevent catastrophic forgetting, which compromises privacy and limits their applicability in real-world scenarios where data transfer across sites is often restricted. In this work, we propose InvCoSS, an inversion-driven continual self-supervised learning framework for medical multi-modal image pre-training. Specifically, after training on a previous task, InvCoSS inverts the pre-trained self-supervised model to generate synthetic images that approximate the original training distribution. These synthetic images are then combined with data from the new task for joint optimization, which effectively mitigates catastrophic forgetting while strictly adhering to the constraint of no access to previous real data. Furthermore, to improve the fidelity of synthetic images, we introduce a novel InvUNet with a multi-scale fusion architecture to restore both high- and low-frequency components of the inverted images. To enhance diversity and prevent mode collapse, we design a repulsive representation-learning mechanism that encourages a diverse feature space for synthetic images without class guidance. Extensive experiments across nine downstream tasks validate the effectiveness of InvCoSS, achieving performance comparable to or even superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marios Thoma, Zenonas Theodosiou, Harris Partaourides, Vassilis Vassiliades, Loizos Michael et al.
Machine Learning
ๆฅ็ๆ่ฆ
Walking has always been a primary mode of transportation and is recognized as an essential activity for maintaining good health. Despite the need for safe walking conditions in urban environments, sidewalks are frequently obstructed by various obstacles that hinder free pedestrian movement. Any object obstructing a pedestrian's path can pose a safety hazard. The advancement of pervasive computing and egocentric vision techniques offers the potential to design systems that can automatically detect such obstacles in real time, thereby enhancing pedestrian safety. The development of effective and efficient identification algorithms relies on the availability of comprehensive and well-balanced datasets of egocentric data. In this work, we introduce the PEDESTRIAN dataset, comprising egocentric data for 29 different obstacles commonly found on urban sidewalks. A total of 340 videos were collected using mobile phone cameras, capturing a pedestrian's point of view. Additionally, we present the results of a series of experiments that involved training several state-of-the-art deep learning algorithms using the proposed dataset, which can be used as a benchmark for obstacle detection and recognition tasks. The dataset can be used for training pavement obstacle detectors to enhance the safety of pedestrians in urban areas.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dazhen Deng, Sen Yang, Yuchen He, Yuan Tian, Yingcai Wu
Natural Language Processing
ๆฅ็ๆ่ฆ
Current chart-specific tasks, such as chart question answering, chart parsing, and chart generation, are typically studied in isolation, preventing models from learning the shared semantics that link chart generation and interpretation. We introduce CycleChart, a consistency-based learning framework for bidirectional chart understanding and generation. CycleChart adopts a schema-centric formulation as a common interface across tasks. We construct a consistent multi-task dataset, where each chart sample includes aligned annotations for schema prediction, data parsing, and question answering. To learn cross-directional chart semantics, CycleChart introduces a generate-parse consistency objective: the model generates a chart schema from a table and a textual query, then learns to recover the schema and data from the generated chart, enforcing semantic alignment across directions. CycleChart achieves strong results on chart generation, chart parsing, and chart question answering, demonstrating improved cross-task generalization and marking a step toward more general chart understanding models.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wendong Bu, Kaihang Pan, Yuze Lin, Jiacheng Li, Kai Shen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation. We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation. Project Page: https://OmniMoGen.github.io/.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruikai Li, Xinrun Li, Mengwei Xie, Hao Shan, Shoumeng Qiu et al.
Robotics
ๆฅ็ๆ่ฆ
Online High-Definition (HD) map construction is pivotal for autonomous driving. While recent approaches leverage historical temporal fusion to improve performance, we identify a critical safety flaw in this paradigm: it is inherently ``spatially backward-looking." These methods predominantly enhance map reconstruction in traversed areas, offering minimal improvement for the unseen road ahead. Crucially, our analysis of downstream planning tasks reveals a severe asymmetry: while rearward perception errors are often tolerable, inaccuracies in the forward region directly precipitate hazardous driving maneuvers. To bridge this safety gap, we propose AMap, a novel framework for Ahead-aware online HD Mapping. We pioneer a ``distill-from-future" paradigm, where a teacher model with privileged access to future temporal contexts guides a lightweight student model restricted to the current frame. This process implicitly compresses prospective knowledge into the student model, endowing it with ``look-ahead" capabilities at zero inference-time cost. Technically, we introduce a Multi-Level BEV Distillation strategy with spatial masking and an Asymmetric Query Adaptation module to effectively transfer future-aware representations to the student's static queries. Extensive experiments on the nuScenes and Argoverse 2 benchmark demonstrate that AMap significantly enhances current-frame perception. Most notably, it outperforms state-of-the-art temporal models in critical forward regions while maintaining the efficiency of single current frame inference.
๐
2025-12-22
๐ AAAI 2026
๐ฅ Pengxuan Yang, Ben Lu, Zhongpu Xia, Chao Han, Yinfeng Gao et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Latent World Models enhance scene representation through temporal self-supervised learning, presenting a perception annotation-free paradigm for end-to-end autonomous driving. However, the reconstruction-oriented representation learning tangles perception with planning tasks, leading to suboptimal optimization for planning. To address this challenge, we propose WorldRFT, a planning-oriented latent world model framework that aligns scene representation learning with planning via a hierarchical planning decomposition and local-aware interactive refinement mechanism, augmented by reinforcement learning fine-tuning (RFT) to enhance safety-critical policy performance. Specifically, WorldRFT integrates a vision-geometry foundation model to improve 3D spatial awareness, employs hierarchical planning task decomposition to guide representation optimization, and utilizes local-aware iterative refinement to derive a planning-oriented driving policy. Furthermore, we introduce Group Relative Policy Optimization (GRPO), which applies trajectory Gaussianization and collision-aware rewards to fine-tune the driving policy, yielding systematic improvements in safety. WorldRFT achieves state-of-the-art (SOTA) performance on both open-loop nuScenes and closed-loop NavSim benchmarks. On nuScenes, it reduces collision rates by 83% (0.30% -> 0.05%). On NavSim, using camera-only sensors input, it attains competitive performance with the LiDAR-based SOTA method DiffusionDrive (87.8 vs. 88.1 PDMS).
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hengyi Feng, Zeang Sheng, Meiyi Qiang, Wentao Zhang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; the visual information essential for multimodal retrieval only constitutes a small portion. This imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance. Overall, our work provides the first in-depth interpretability analysis of MLLM representations in the context of multimodal retrieval and offers possible directions for enhancing the multimodal retrieval capabilities of MLLMs.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Li, Zhenbao Yu, Banglei Guan, Jianli Han, Weimin Lv et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
This work presents two novel solvers for estimating the relative poses among views with known vertical directions. The vertical directions of camera views can be easily obtained using inertial measurement units (IMUs) which have been widely used in autonomous vehicles, mobile phones, and unmanned aerial vehicles (UAVs). Given the known vertical directions, our lgorithms only need to solve for two rotation angles and two translation vectors. In this paper, a linear closed-form solution has been described, requiring only four point correspondences in three views. We also propose a minimal solution with three point correspondences using the latest Grรถbner basis solver. Since the proposed methods require fewer point correspondences, they can be efficiently applied within the RANSAC framework for outliers removal and pose estimation in visual odometry. The proposed method has been tested on both synthetic data and real-world scenes from KITTI. The experimental results show that the accuracy of the estimated poses is superior to other alternative methods.
๐
2025-12-22
๐ AAAI 2026
๐ฅ Tiantian Li, Xinjie Zhang, Xingtong Ge, Tongda Xu, Dailan He et al.
ๆฅ็ๆ่ฆ
Implicit neural representations (INRs) have achieved remarkable success in image representation and compression, but they require substantial training time and memory. Meanwhile, recent 2D Gaussian Splatting (GS) methods (\textit{e.g.}, GaussianImage) offer promising alternatives through efficient primitive-based rendering. However, these methods require excessive Gaussian primitives to maintain high visual fidelity. To exploit the potential of GS-based approaches, we present GaussianImage++, which utilizes limited Gaussian primitives to achieve impressive representation and compression performance. Firstly, we introduce a distortion-driven densification mechanism. It progressively allocates Gaussian primitives according to signal intensity. Secondly, we employ context-aware Gaussian filters for each primitive, which assist in the densification to optimize Gaussian primitives based on varying image content. Thirdly, we integrate attribute-separated learnable scalar quantizers and quantization-aware training, enabling efficient compression of primitive attributes. Experimental results demonstrate the effectiveness of our method. In particular, GaussianImage++ outperforms GaussianImage and INRs-based COIN in representation and compression performance while maintaining real-time decoding and low memory usage.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Weiyi Lyu, Xinming Fang, Jun Wang, Jun Shi, Guixu Zhang et al.
ๆฅ็ๆ่ฆ
Magnetic resonance imaging (MRI) is a cornerstone of modern clinical diagnosis, offering unparalleled soft-tissue contrast without ionizing radiation. However, prolonged scan times remain a major barrier to patient throughput and comfort. Existing accelerated MRI techniques often struggle with two key challenges: (1) failure to effectively utilize inherent K-space prior information, leading to persistent aliasing artifacts from zero-filled inputs; and (2) contamination of target reconstruction quality by irrelevant information when employing multi-contrast fusion strategies. To overcome these challenges, we present MambaMDN, a dual-domain framework for multi-contrast MRI reconstruction. Our approach first employs fully-sampled reference K-space data to complete the undersampled target data, generating structurally aligned but modality-mixed inputs. Subsequently, we develop a Mamba-based modality disentanglement network to extract and remove reference-specific features from the mixed representation. Furthermore, we introduce an iterative refinement mechanism to progressively enhance reconstruction accuracy through repeated feature purification. Extensive experiments demonstrate that MambaMDN can significantly outperform existing multi-contrast reconstruction methods.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ariel Lubonja, Pedro R. A. S. Bassi, Wenxuan Li, Hualin Qiao, Randal Burns et al.
Multimodal
ๆฅ็ๆ่ฆ
Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and (3) audits intersectional fairness, revealing that more than half of the MONAI-based entries have the largest gender-race discrepancy on our proprietary Johns Hopkins Hospital dataset. The RankInsight toolkit is publicly released and can be directly applied to past, ongoing, and future challenges. It enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair.
๐
2025-12-22
๐ AAAI 2026
๐ฅ Khanh Nguyen, Dasith de Silva Edirimuni, Ghulam Mubashar Hassan, Ajmal Mian
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Locating and retrieving objects from scene-level point clouds is a challenging problem with broad applications in robotics and augmented reality. This task is commonly formulated as open-vocabulary 3D instance segmentation. Although recent methods demonstrate strong performance, they depend heavily on SAM and CLIP to generate and classify 3D instance masks from images accompanying the point cloud, leading to substantial computational overhead and slow processing that limit their deployment in real-world settings. Open-YOLO 3D alleviates this issue by using a real-time 2D detector to classify class-agnostic masks produced directly from the point cloud by a pretrained 3D segmenter, eliminating the need for SAM and CLIP and significantly reducing inference time. However, Open-YOLO 3D often fails to generalize to object categories that appear infrequently in the 3D training data. In this paper, we propose a method that generates 3D instance masks for novel objects from RGB images guided by a 2D open-vocabulary detector. Our approach inherits the 2D detector's ability to recognize novel objects while maintaining efficient classification, enabling fast and accurate retrieval of rare instances from open-ended text queries. Our code will be made available at https://github.com/ndkhanh360/BoxOVIS.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruiqi Ma, Yu Yan, Chunhong Zhang, Minghao Yin, XinChao Liu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model's dependence on language priors but also enhances its visual performance. (Code: https://github.com/rickeyhhh/Hallucination-Disentangled-Decoding)
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jihui Guo, Zongmin Zhang, Zhen Sun, Yuhao Yang, Jinlin Wu et al.
RoboticsMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
Deep learning advances have enabled accurate six-degree-of-freedom (6DoF) object pose estimation, widely used in robotics, AR/VR, and autonomous systems. However, backdoor attacks pose significant security risks. While most research focuses on 2D vision, 6DoF pose estimation remains largely unexplored. Unlike traditional backdoors that only change classes, 6DoF attacks must control continuous parameters like translation and rotation, rendering 2D methods inapplicable. We propose 6DAttack, a framework using 3D object triggers to induce controlled erroneous poses while maintaining normal behavior. Evaluations on PVNet, DenseFusion, and PoseDiffusion across LINEMOD, YCB-Video, and CO3D show high attack success rates (ASRs) without compromising clean performance. Backdoored models achieve up to 100% clean ADD accuracy and 100% ASR, with triggered samples reaching 97.70% ADD-P. Furthermore, a representative defense remains ineffective. Our findings reveal a serious, underexplored threat to 6DoF pose estimation.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Utae Jeong, Sumin In, Hyunju Ryu, Jaewan Choi, Feng Yang et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hwanhee Jung, Seunggwan Lee, Jeongyoon Yoon, SeungHyeon Kim, Giljoo Nam et al.
RoboticsMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoyang Li, Mingming Lu, Ruiqi Wang, Hao Li, Zewei Le
Multimodal
ๆฅ็ๆ่ฆ
Few-shot action recognition aims to enable models to quickly learn new action categories from limited labeled samples, addressing the challenge of data scarcity in real-world applications. Current research primarily addresses three core challenges: (1) temporal modeling, where models are prone to interference from irrelevant static background information and struggle to capture the essence of dynamic action features; (2) visual similarity, where categories with subtle visual differences are difficult to distinguish; and (3) the modality gap between visual-textual support prototypes and visual-only queries, which complicates alignment within a shared embedding space. To address these challenges, this paper proposes a CLIP-SPM framework, which includes three components: (1) the Hierarchical Synergistic Motion Refinement (HSMR) module, which aligns deep and shallow motion features to improve temporal modeling by reducing static background interference; (2) the Semantic Prototype Modulation (SPM) strategy, which generates query-relevant text prompts to bridge the modality gap and integrates them with visual features, enhancing the discriminability between similar actions; and (3) the Prototype-Anchor Dual Modulation (PADM) method, which refines support prototypes and aligns query features with a global semantic anchor, improving consistency across support and query samples. Comprehensive experiments across standard benchmarks, including Kinetics, SSv2-Full, SSv2-Small, UCF101, and HMDB51, demonstrate that our CLIP-SPM achieves competitive performance under 1-shot, 3-shot, and 5-shot settings. Extensive ablation studies and visual analyses further validate the effectiveness of each component and its contributions to addressing the core challenges. The source code and models are publicly available at GitHub.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ran Li, Pan Xiao, Kaushik Dutta, Youdong Guo
Machine Learning
ๆฅ็ๆ่ฆ
Fluorescence Microcopy Calcium Imaging is a fundamental tool to in-vivo record and analyze large scale neuronal activities simultaneously at a single cell resolution. Automatic and precise detection of behaviorally relevant neuron activity from the recordings is critical to study the mapping of brain activity in organisms. However a perpetual bottleneck to this problem is the manual segmentation which is time and labor intensive and lacks generalizability. To this end, we present a Bayesian Deep Learning Framework to detect neuronal activities in 4D spatio-temporal data obtained by light sheet microscopy. Our approach accounts for the use of temporal information by calculating pixel wise correlation maps and combines it with spatial information given by the mean summary image. The Bayesian framework not only produces probability segmentation maps but also models the uncertainty pertaining to active neuron detection. To evaluate the accuracy of our framework we implemented the test of reproducibility to assert the generalization of the network to detect neuron activity. The network achieved a mean Dice Score of 0.81 relative to the synthetic Ground Truth obtained by Otsu's method and a mean Dice Score of 0.79 between the first and second run for test of reproducibility. Our method successfully deployed can be used for rapid detection of active neuronal activities for behavioural studies.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Connor Kilrain, David Carlyn, Julia Chae, Sara Beery, Wei-Lun Chao et al.
Machine Learning
ๆฅ็ๆ่ฆ
The rise of personalized generative models raises a central question: how should we evaluate identity preservation? Given a reference image (e.g., one's pet), we expect the generated image to retain precise details attached to the subject's identity. However, current generative evaluation metrics emphasize the overall semantic similarity between the reference and the output, and overlook these fine-grained discriminative details. We introduce Finer-Personalization Rank, an evaluation protocol tailored to identity preservation. Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking view: it treats each generated image as a query against an identity-labeled gallery consisting of visually similar real images. Retrieval metrics (e.g., mean average precision) measure performance, where higher scores indicate that identity-specific details (e.g., a distinctive head spot) are preserved. We assess identity at multiple granularities -- from fine-grained categories (e.g., bird species, car models) to individual instances (e.g., re-identification). Across CUB, Stanford Cars, and animal Re-ID benchmarks, Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in several popular personalization methods. These results position the gallery-based protocol as a principled and practical evaluation for personalized generation.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoze Li, Jie Zhang, Guoying Zhao, Stephen Lin, Shiguang Shan
Multimodal
ๆฅ็ๆ่ฆ
Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal-free IL (RF-IL). Vision-Language Pre-trained (VLP) models, with their prompt-tunable cross-modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose \textbf{SVLP-IL}, a VLP-based RF-IL framework that balances stability and plasticity via \textit{Multi-Aspect Prompting} (MAP) and \textit{Selective Elastic Weight Consolidation} (SEWC). MAP isolates domain dependencies, enhances distribution-shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain-specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in RF-IL settings.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zelin Zhao, Xinyu Gong, Bangya Liu, Ziyang Song, Jun Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at https://sjtuytc.github.io/CETCam_project_page.github.io/.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng Yaw Low, Heejoon Koo, Jaewoo Park, Kaleb Mesfin Asfaw, Meeyoung Cha
Multimodal
ๆฅ็ๆ่ฆ
AI-guided classification of ecological families, genera, and species underpins global sustainability efforts such as biodiversity monitoring, conservation planning, and policy-making. Progress toward this goal is hindered by long-tailed taxonomic distributions from class imbalance, along with fine-grained taxonomic variations, test-time spatiotemporal domain shifts, and closed-set assumptions that can only recognize previously seen taxa. We introduce the Open-World Ecological Taxonomy Classification, a unified framework that captures the co-occurrence of these challenges in realistic ecological settings. To address them, we propose TaxoNet, an embedding-based encoder with a dual-margin penalization loss that strengthens learning signals from rare underrepresented taxa while mitigating the dominance of overrepresented ones, directly confronting interrelated challenges. We evaluate our method on diverse ecological domains: Google Auto-Arborist (urban trees), iNat-Plantae (Plantae observations from various ecosystems in iNaturalist-2019), and NAFlora-Mini (a curated herbarium collection). Our model consistently outperforms baselines, particularly for rare taxa, establishing a strong foundation for open-world plant taxonomic monitoring. Our findings further show that general-purpose multimodal foundation models remain constrained in plant-domain applications.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Dominant paradigms for 4D LiDAR panoptic segmentation are usually required to train deep neural networks with large superimposed point clouds or design dedicated modules for instance association. However, these approaches perform redundant point processing and consequently become computationally expensive, yet still overlook the rich geometric priors inherently provided by raw point clouds. To this end, we introduce ICP-4D, a simple yet effective training-free framework that unifies spatial and temporal reasoning through geometric relations among instance-level point sets. Specifically, we apply the Iterative Closest Point (ICP) algorithm to directly associate temporally consistent instances by aligning the source and target point sets through the estimated transformation. To stabilize association under noisy instance predictions, we introduce a Sinkhorn-based soft matching. This exploits the underlying instance distribution to obtain accurate point-wise correspondences, resulting in robust geometric alignment. Furthermore, our carefully designed pipeline, which considers three instance types-static, dynamic, and missing-offers computational efficiency and occlusion-aware matching. Our extensive experiments across both SemanticKITTI and panoptic nuScenes demonstrate that our method consistently outperforms state-of-the-art approaches, even without additional training or extra point cloud inputs.
๐
2025-12-22
๐ ICRA 2026
๐ฅ Ryosuke Korekata, Quanting Xie, Yonatan Bisk, Komei Sugiura
RoboticsMultimodal
ๆฅ็ๆ่ฆ
In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng-Hong Chang, Pei-Hsuan Tsai
ๆฅ็ๆ่ฆ
Object recognition has become prevalent across various industries. However, most existing applications are limited to identifying objects alone, without considering their associated states. The ability to recognize both the state and object simultaneously remains less common. One approach to address this is by treating state and object as a single category during training. However, this approach poses challenges in data collection and training since it requires comprehensive data for all possible combinations. Compositional Zero-shot Learning (CZSL) emerges as a viable solution by treating the state and object as distinct categories during training. CZSL facilitates the identification of novel compositions even in the absence of data for every conceivable combination. The current state-of-the-art method, KG-SP, addresses this issue by training distinct classifiers for states and objects, while leveraging a semantic model to evaluate the plausibility of composed compositions. However, KG-SP's accuracy in state and object recognition can be further improved, and it fails to consider the weighting of states and objects during composition. In this study, we propose SASOW, an enhancement of KG-SP that considers the weighting of states and objects while improving composition recognition accuracy. First, we introduce self-attention mechanisms into the classifiers for states and objects, leading to enhanced accuracy in recognizing both. Additionally, we incorporate the weighting of states and objects during composition to generate more reasonable and accurate compositions. Our validation process involves testing SASOW on three established benchmark datasets. Experimental outcomes affirm when compared against OW-CZSL approach, KG-SP, SASOW showcases improvements of 2.1%, 1.7%, and 0.4% in terms of accuracy for unseen compositions across the MIT-States, UT Zappos, and C-GQA datasets, respectively.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianle Lu, Ke Chen, Yuping Duan
Machine Learning
ๆฅ็ๆ่ฆ
We introduce a novel formulation for curvature regularization by penalizing normal curvatures from multiple directions. This total normal curvature regularization is capable of producing solutions with sharp edges and precise isotropic properties. To tackle the resulting high-order nonlinear optimization problem, we reformulate it as the task of finding the steady-state solution of a time-dependent partial differential equation (PDE) system. Time discretization is achieved through operator splitting, where each subproblem at the fractional steps either has a closed-form solution or can be efficiently solved using advanced algorithms. Our method circumvents the need for complex parameter tuning and demonstrates robustness to parameter choices. The efficiency and effectiveness of our approach have been rigorously validated in the context of surface and image smoothing problems.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guandong Li, Yijun Ding
ๆฅ็ๆ่ฆ
Recent tuning-free identity customization methods achieve high facial fidelity but often overlook visual context, such as lighting, skin texture, and environmental tone. This limitation leads to ``Semantic-Visual Dissonance,'' where accurate facial geometry clashes with the input's unique atmosphere, causing an unnatural ``sticker-like'' effect. We propose **DVI (Disentangled Visual-Identity)**, a zero-shot framework that orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. Unlike methods relying solely on semantic vectors, DVI exploits the inherent statistical properties of the VAE latent space, utilizing mean and variance as lightweight descriptors for global visual atmosphere. We introduce a **Parameter-Free Feature Modulation** mechanism that adaptively modulates semantic embeddings with these visual statistics, effectively injecting the reference's ``visual soul'' without training. Furthermore, a **Dynamic Temporal Granularity Scheduler** aligns with the diffusion process, prioritizing visual atmosphere in early denoising stages while refining semantic details later. Extensive experiments demonstrate that DVI significantly enhances visual consistency and atmospheric fidelity without parameter fine-tuning, maintaining robust identity preservation and outperforming state-of-the-art methods in IBench evaluations.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zaidao Han, Risa Higashita, Jiang Liu
Robotics
ๆฅ็ๆ่ฆ
Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation.
To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning.
Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nicolas Caytuiro, Ivan Sipiran
Machine Learning
ๆฅ็ๆ่ฆ
We propose a novel data-centric approach to promote symmetry in 3D generative models by modifying the training data rather than the model architecture. Our method begins with an analysis of reflectional symmetry in both real-world 3D shapes and samples generated by state-of-the-art models. We hypothesize that training a generative model exclusively on half-objects, obtained by reflecting one half of the shapes along the x=0 plane, enables the model to learn a rich distribution of partial geometries which, when reflected during generation, yield complete shapes that are both visually plausible and geometrically symmetric. To test this, we construct a new dataset of half-objects from three ShapeNet classes (Airplane, Car, and Chair) and train two generative models. Experiments demonstrate that the generated shapes are symmetrical and consistent, compared with the generated objects from the original model and the original dataset objects.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hang Yu, Juntu Zhao, Yufeng Liu, Kaiyu Li, Cheng Ma et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Raina Panda, Daniel Fein, Arpita Singhal, Mark Fiore, Maneesh Agrawala et al.
RoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional adapters, or prompt engineering, all of which can be computationally expensive and may still entangle style with subject matter. In this paper, we introduce a training- and inference-light, interpretable method for representing and transferring artistic style. Our approach leverages an art-specific Sparse Autoencoder (SAE) on top of latent embeddings of generative image models. Trained on artistic data, our SAE learns an emergent, largely disentangled set of stylistic and compositional concepts, corresponding to style-related elements pertaining brushwork, texture, and color palette, as well as semantic and structural concepts. We call it LouvreSAE and use it to construct style profiles: compact, decomposable steering vectors that enable style transfer without any model updates or optimization. Unlike prior concept-based style transfer methods, our method requires no fine-tuning, no LoRA training, and no additional inference passes, enabling direct steering of artistic styles from only a few reference images. We validate our method on ArtBench10, achieving or surpassing existing methods on style evaluations (VGG Style Loss and CLIP Score Style) while being 1.7-20x faster and, critically, interpretable.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent's ``zone of proximal development''. This process is guided by a simple but effective $ฮฑ$-Curriculum Reward, which aligns task difficulty with the agent's current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3\%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Filippos Ventirozos, Peter Appleby, Matthew Shardlow
Natural Language Processing
ๆฅ็ๆ่ฆ
Aspect-Category Sentiment Analysis (ACSA) provides granular insights by identifying specific themes within reviews and their associated sentiment. While supervised learning approaches dominate this field, the scarcity and high cost of annotated data for new domains present significant barriers. We argue that leveraging large language models (LLMs) in a zero-shot setting is a practical alternative where resources for data annotation are limited. In this work, we propose a novel Chain-of-Thought (CoT) prompting technique that utilises an intermediate Unified Meaning Representation (UMR) to structure the reasoning process for the ACSA task. We evaluate this UMR-based approach against a standard CoT baseline across three models (Qwen3-4B, Qwen3-8B, and Gemini-2.5-Pro) and four diverse datasets. Our findings suggest that UMR effectiveness may be model-dependent. Whilst preliminary results indicate comparable performance for mid-sized models such as Qwen3-8B, these observations warrant further investigation, particularly regarding the potential applicability to smaller model architectures. Further research is required to establish the generalisability of these findings across different model scales.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rolando Coto-Solano, Daisy Li, Manoela Teleginski Ferraz, Olivia Sasse, Cha Krupka et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Mฤori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resourced contexts.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zahra Sadeghi, Evangelos Milios, Frank Rudzicz
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Angelo Ortiz Tandazo, Manel Khentout, Youssef Benchekroun, Thomas Hueber, Emmanuel Dupoux
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. We continue HuBERT pre-training with supervision based on a phonetic-to-articulatory feature mapping in 55 languages. Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations that capture multilingual phonetic properties. Through comprehensive ABX discriminability testing, we show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models. Additionally, the models effectively adapt to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech). This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ignacio Iacobacci, Zhaozhi Qian, Faroq AL-Tam, Muhammad AL-Qurishi, Riad Souissi
Natural Language Processing
ๆฅ็ๆ่ฆ
Recently, a new wave of thinking-capable Large Language Models has emerged, demonstrating exceptional capabilities across a wide range of reasoning benchmarks. Early studies have begun to explore how the amount of compute in terms of the length of the reasoning process, the so-called thinking budget, impacts model performance. In this work, we propose a systematic investigation of the thinking budget as a key parameter, examining its interaction with various configurations such as self-consistency, reflection, and others. Our goal is to provide an informative, balanced comparison framework that considers both performance outcomes and computational cost. Among our findings, we discovered that simply increasing the thinking budget is not the most effective use of compute. More accurate responses can instead be achieved through alternative configurations, such as self-consistency and self-reflection.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zakaria Benmounah, Abdennour Boulesnane
Natural Language Processing
ๆฅ็ๆ่ฆ
We present Algerian Dialect, a large-scale sentiment-annotated dataset consisting of 45,000 YouTube comments written in Algerian Arabic dialect. The comments were collected from more than 30 Algerian press and media channels using the YouTube Data API. Each comment is manually annotated into one of five sentiment categories: very negative, negative, neutral, positive, and very positive. In addition to sentiment labels, the dataset includes rich metadata such as collection timestamps, like counts, video URLs, and annotation dates. This dataset addresses the scarcity of publicly available resources for Algerian dialect and aims to support research in sentiment analysis, dialectal Arabic NLP, and social media analytics. The dataset is publicly available on Mendeley Data under a CC BY 4.0 license at https://doi.org/10.17632/zzwg3nnhsz.2.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bobo Li, Xudong Han, Jiang Liu, Yuzhe Ding, Liqiang Jing et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large language models (LLMs) and multimodal LLMs are changing event extraction (EE): prompting and generation can often produce structured outputs in zero shot or few shot settings. Yet LLM based pipelines face deployment gaps, including hallucinations under weak constraints, fragile temporal and causal linking over long contexts and across documents, and limited long horizon knowledge management within a bounded context window. We argue that EE should be viewed as a system component that provides a cognitive scaffold for LLM centered solutions. Event schemas and slot constraints create interfaces for grounding and verification; event centric structures act as controlled intermediate representations for stepwise reasoning; event links support relation aware retrieval with graph based RAG; and event stores offer updatable episodic and agent memory beyond the context window. This survey covers EE in text and multimodal settings, organizing tasks and taxonomy, tracing method evolution from rule based and neural models to instruction driven and generative frameworks, and summarizing formulations, decoding strategies, architectures, representations, datasets, and evaluation. We also review cross lingual, low resource, and domain specific settings, and highlight open challenges and future directions for reliable event centric systems. Finally, we outline open challenges and future directions that are central to the LLM era, aiming to evolve EE from static extraction into a structurally reliable, agent ready perception and memory layer for open world systems.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ivan Decostanzi, Yelena Mejova, Kyriaki Kalimeri
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Timely and accurate situational reports are essential for humanitarian decision-making, yet current workflows remain largely manual, resource intensive, and inconsistent. We present a fully automated framework that uses large language models (LLMs) to transform heterogeneous humanitarian documents into structured and evidence-grounded reports. The system integrates semantic text clustering, automatic question generation, retrieval augmented answer extraction with citations, multi-level summarization, and executive summary generation, supported by internal evaluation metrics that emulate expert reasoning. We evaluated the framework across 13 humanitarian events, including natural disasters and conflicts, using more than 1,100 documents from verified sources such as ReliefWeb. The generated questions achieved 84.7 percent relevance, 84.0 percent importance, and 76.4 percent urgency. The extracted answers reached 86.3 percent relevance, with citation precision and recall both exceeding 76 percent. Agreement between human and LLM based evaluations surpassed an F1 score of 0.80. Comparative analysis shows that the proposed framework produces reports that are more structured, interpretable, and actionable than existing baselines. By combining LLM reasoning with transparent citation linking and multi-level evaluation, this study demonstrates that generative AI can autonomously produce accurate, verifiable, and operationally useful humanitarian situation reports.
๐
2025-12-22
๐ ArXiv Preprint (cs.CY)
๐ฅ Walter Quattrociocchi, Valerio Capraro, Matjaลพ Perc
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are widely described as artificial intelligence, yet their epistemic profile diverges sharply from human cognition. Here we show that the apparent alignment between human and machine outputs conceals a deeper structural mismatch in how judgments are produced. Tracing the historical shift from symbolic AI and information filtering systems to large-scale generative transformers, we argue that LLMs are not epistemic agents but stochastic pattern-completion systems, formally describable as walks on high-dimensional graphs of linguistic transitions rather than as systems that form beliefs or models of the world. By systematically mapping human and artificial epistemic pipelines, we identify seven epistemic fault lines, divergences in grounding, parsing, experience, motivation, causal reasoning, metacognition, and value. We call the resulting condition Epistemia: a structural situation in which linguistic plausibility substitutes for epistemic evaluation, producing the feeling of knowing without the labor of judgment. We conclude by outlining consequences for evaluation, governance, and epistemic literacy in societies increasingly organized around generative AI.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jinwei Chi, Ke Wang, Yu Chen, Xuanye Lin, Qiang Xu
Natural Language Processing
ๆฅ็ๆ่ฆ
Automated essay scoring (AES) is a challenging task in cross-prompt settings due to the diversity of scoring criteria. While previous studies have focused on the output of large language models (LLMs) to improve scoring accuracy, we believe activations from intermediate layers may also provide valuable information. To explore this possibility, we evaluated the discriminative power of LLMs' activations in cross-prompt essay scoring task. Specifically, we used activations to fit probes and further analyzed the effects of different models and input content of LLMs on this discriminative power. By computing the directions of essays across various trait dimensions under different prompts, we analyzed the variation in evaluation perspectives of large language models concerning essay types and traits. Results show that the activations possess strong discriminative power in evaluating essay quality and that LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling the diversity of scoring criteria in cross-prompt settings.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Thittipat Pairatsuppawat, Abhibhu Tachaapornchai, Paweekorn Kusolsomboon, Chutikan Chaiwong, Thodsaporn Chay-intr et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Open-weights large language models remain difficult to deploy for Thai due to unstable generation under complex instructions, despite strong English performance. To mitigate these limitations, We present SiamGPT-32B, an open-weights model based on Qwen3-32B, fine-tuned with a Quality-First strategy emphasizing curated supervision over data scale. The fine-tuning pipeline combines translated high-complexity English instruction data with a Thai-adapted AutoIF framework for instruction and linguistic constraints. Using supervised fine-tuning only, without continual pretraining or corpus expansion, SiamGPT-32B improves instruction adherence, multi-turn robustness, and linguistic stability. Evaluations on the SEA-HELM benchmark show that SiamGPT-32B achieves the strongest overall performance among similar-scale open-weights Thai models, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jian Yang, Wei Zhang, Yizhi Li, Shawn Guo, Haowen Wang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.
๐
2025-12-22
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Jiaren Peng, Hongda Sun, Xuan Tian, Cheng Huang, Zeqing Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The automation of Cyber Threat Intelligence (CTI) relies heavily on Named Entity Recognition (NER) to extract critical entities from unstructured text. Currently, Large Language Models (LLMs) primarily address this task through retrieval-based In-Context Learning (ICL). This paper analyzes this mainstream paradigm, revealing a fundamental flaw: its success stems not from global semantic similarity but largely from the incidental overlap of entity types within retrieved examples. This exposes the limitations of relying on unreliable implicit induction. To address this, we propose TTPrompt, a framework shifting from implicit induction to explicit instruction. TTPrompt maps the core concepts of CTI's Tactics, Techniques, and Procedures (TTPs) into an instruction hierarchy: formulating task definitions as Tactics, guiding strategies as Techniques, and annotation guidelines as Procedures. Furthermore, to handle the adaptability challenge of static guidelines, we introduce Feedback-driven Instruction Refinement (FIR). FIR enables LLMs to self-refine guidelines by learning from errors on minimal labeled data, adapting to distinct annotation dialects. Experiments on five CTI NER benchmarks demonstrate that TTPrompt consistently surpasses retrieval-based baselines. Notably, with refinement on just 1% of training data, it rivals models fine-tuned on the full dataset. For instance, on LADDER, its Micro F1 of 71.96% approaches the fine-tuned baseline, and on the complex CTINexus, its Macro F1 exceeds the fine-tuned ACLM model by 10.91%.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yacouba Diarra, Panga Azazia Kamate, Nouhoum Souleymane Coulibaly, Michael Leventhal
ๆฅ็ๆ่ฆ
We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\% to 37.12\% on one and from 36.07\% to 32.33\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhiqing Hu, Chenxu Zhao, Jiazhong Lu, Xiaolei Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Misuse of LLM-generated text can be curbed by watermarking techniques that embed implicit signals into the output. We propose a watermark that partitions the vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios and restricts sampling to the Green and Yellow sets. At detection time, we replay the same partitions, compute Green-enrichment and Red-depletion statistics, convert them to one-sided z-scores, and aggregate their p-values via Fisher's method to decide whether a passage is watermarked. We implement generation, detection, and testing on Llama 2 7B, and evaluate true-positive rate, false-positive rate, and text quality. Results show that the triple-partition scheme achieves high detection accuracy at fixed FPR while preserving readability.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yayuan Li, Jian Zhang, Jintao Guo, Zihan Cheng, Lei Qi et al.
Computer Vision
ๆฅ็ๆ่ฆ
The proliferation of pre-trained models has given rise to a wide array of specialised, fine-tuned models. Model merging aims to merge the distinct capabilities of these specialised models into a unified model, requiring minimal or even no additional training. A core objective of model merging is to ensure the merged model retains the behavioural characteristics of the specialised models, typically achieved through feature alignment. We identify that features consist of two critical components: direction and magnitude. Prior research has predominantly focused on directional alignment, while the influence of magnitude remains largely neglected, despite its pronounced vulnerability to perturbations introduced by common merging operations (e.g., parameter fusion and sparsification). Such perturbations to magnitude inevitably lead to feature deviations in the merged model from the specialised models, resulting in subsequent performance degradation. To address this, we propose MAGnItude Calibration (MAGIC), a plug-and-play framework that rectifies layer-wise magnitudes in feature and weight spaces, with three variants. Specifically, our Feature Space Calibration (FSC) realigns the merged model's features using a small set of unlabelled data, while Weight Space Calibration (WSC) extends this calibration to the weight space without requiring additional data. Combining these yields Dual Space Calibration (DSC). Comprehensive experiments demonstrate that MAGIC consistently boosts performance across diverse Computer Vision tasks (+4.3% on eight datasets) and NLP tasks (+8.0% on Llama) without additional training. Our code is available at: https://github.com/lyymuwu/MAGIC
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Javier Vela-Tambo, Jorge Gracia, Fernando Dominguez-Castro
Natural Language Processing
ๆฅ็ๆ่ฆ
Understanding and monitoring the socio-economic impacts of climate hazards requires extracting structured information from heterogeneous news articles on a large scale. To that end, we have developed CienaLLM, a modular framework based on schema-guided Generative Information Extraction. CienaLLM uses open-weight Large Language Models for zero-shot information extraction from news articles, and supports configurable prompts and output schemas, multi-step pipelines, and cloud or on-premise inference. To systematically assess how the choice of LLM family, size, precision regime, and prompting strategy affect performance, we run a large factorial study in models, precisions, and prompt engineering techniques. An additional response parsing step nearly eliminates format errors while preserving accuracy; larger models deliver the strongest and most stable performance, while quantization offers substantial efficiency gains with modest accuracy trade-offs; and prompt strategies show heterogeneous, model-specific effects. CienaLLM matches or outperforms the supervised baseline in accuracy for extracting drought impacts from Spanish news, although at a higher inference cost. While evaluated in droughts, the schema-driven and model-agnostic design is suitable for adapting to related information extraction tasks (e.g., other hazards, sectors, or languages) by editing prompts and schemas rather than retraining. We release code, configurations, and schemas to support reproducible use.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Do Minh Duc, Quan Xuan Truong, Nguyen Tat Dat, Nguyen Van Vinh
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Prompt engineering plays a critical role in adapting large language models (LLMs) to complex reasoning and labeling tasks without the need for extensive fine-tuning. In this paper, we propose a novel prompt optimization pipeline for frame detection in logistics texts, combining retrieval-augmented generation (RAG), few-shot prompting, chain-of-thought (CoT) reasoning, and automatic CoT synthesis (Auto-CoT) to generate highly effective task-specific prompts. Central to our approach is an LLM-based prompt optimizer agent that iteratively refines the prompts using retrieved examples, performance feedback, and internal self-evaluation. Our framework is evaluated on a real-world logistics text annotation task, where reasoning accuracy and labeling efficiency are critical. Experimental results show that the optimized prompts - particularly those enhanced via Auto-CoT and RAG - improve real-world inference accuracy by up to 15% compared to baseline zero-shot or static prompts. The system demonstrates consistent improvements across multiple LLMs, including GPT-4o, Qwen 2.5 (72B), and LLaMA 3.1 (70B), validating its generalizability and practical value. These findings suggest that structured prompt optimization is a viable alternative to full fine-tuning, offering scalable solutions for deploying LLMs in domain-specific NLP applications such as logistics.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mingxu Zhang, Dazhong Shen, Qi Zhang, Ying Sun
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) exhibit strong general reasoning but struggle in molecular science due to the lack of explicit chemical priors in standard string representations. Current solutions face a fundamental dilemma. Training-based methods inject priors into parameters, but this static coupling hinders rapid knowledge updates and often compromises the model's general reasoning capabilities. Conversely, existing training-free methods avoid these issues but rely on surface-level prompting, failing to provide the fine-grained atom-level priors essential for precise chemical reasoning. To address this issue, we introduce ChemATP, a framework that decouples chemical knowledge from the reasoning engine. By constructing the first atom-level textual knowledge base, ChemATP enables frozen LLMs to explicitly retrieve and reason over this information dynamically. This architecture ensures interpretability and adaptability while preserving the LLM's intrinsic general intelligence. Experiments show that ChemATP significantly outperforms training-free baselines and rivals state-of-the-art training-based models, demonstrating that explicit prior injection is a competitive alternative to implicit parameter updates.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anna-Maria Gueorguieva, Aylin Caliskan
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM's respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dazhen Deng, Sen Yang, Yuchen He, Yuan Tian, Yingcai Wu
Natural Language Processing
ๆฅ็ๆ่ฆ
Current chart-specific tasks, such as chart question answering, chart parsing, and chart generation, are typically studied in isolation, preventing models from learning the shared semantics that link chart generation and interpretation. We introduce CycleChart, a consistency-based learning framework for bidirectional chart understanding and generation. CycleChart adopts a schema-centric formulation as a common interface across tasks. We construct a consistent multi-task dataset, where each chart sample includes aligned annotations for schema prediction, data parsing, and question answering. To learn cross-directional chart semantics, CycleChart introduces a generate-parse consistency objective: the model generates a chart schema from a table and a textual query, then learns to recover the schema and data from the generated chart, enforcing semantic alignment across directions. CycleChart achieves strong results on chart generation, chart parsing, and chart question answering, demonstrating improved cross-task generalization and marking a step toward more general chart understanding models.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bingyang Kelvin Liu, Ziyu Patrick Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
While Joint-Embedding Predictive Architecture (JEPA) has emerged as a powerful architecture for learning rich latent representations, it fundamentally lacks generative abilities. Meanwhile, latent space reasoning attempts for Transformer models like COCONUT do improve performance, but they ultimately rely on token-by-token generation, which still accumulates compounding error and relies on context information to gain reasoning insights. To address these limitations, we propose JEPA-Reasoner, a novel JEPA model enhanced with generative ability that reasons in latent space. We augment it with a separate action-taker model, Talker, to produce human-readable sentences. Our approach demonstrates that decoupling latent space reasoning and token generation enables JEPA-Reasoner to produce mixed latent vectors that might lay the foundation for multi-threaded reasoning, while performing autoregressive generation with superior robustness to compounding error.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Alessandro Lucca, Francesco Pierri
Machine Learning
ๆฅ็ๆ่ฆ
Subtitles are essential for video accessibility and audience engagement. Modern Automatic Speech Recognition (ASR) systems, built upon Encoder-Decoder neural network architectures and trained on massive amounts of data, have progressively reduced transcription errors on standard benchmark datasets. However, their performance in real-world production environments, particularly for non-English content like long-form Italian videos, remains largely unexplored. This paper presents a case study on developing a professional subtitling system for an Italian media company. To inform our system design, we evaluated four state-of-the-art ASR models (Whisper Large v2, AssemblyAI Universal, Parakeet TDT v3 0.6b, and WhisperX) on a 50-hour dataset of Italian television programs. The study highlights their strengths and limitations, benchmarking their performance against the work of professional human subtitlers. The findings indicate that, while current models cannot meet the media industry's accuracy needs for full autonomy, they can serve as highly effective tools for enhancing human productivity. We conclude that a human-in-the-loop (HITL) approach is crucial and present the production-grade, cloud-based infrastructure we designed to support this workflow.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng
Natural Language Processing
ๆฅ็ๆ่ฆ
Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao et al.
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) -- a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tzu-Yun Lee, Ding-Yong Hong, Jan-Jan Wu
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper introduces Syntactic Attention Pruning (SAP), a novel method for effectively pruning attention heads in Transformer models. Unlike conventional approaches that rely solely on mathematical analysis of model weights and activations, SAP incorporates both the syntactic structure and attention patterns of sentences to guide the pruning process. By leveraging these linguistic features, SAP not only achieves performance comparable to state-of-the-art methods but also enhances the interpretability of model behavior. To further improve robustness, we propose Candidate Filtering (CF), a mechanism that prioritizes heads based on their contribution to model performance, mitigating degradation during pruning. Experimental results indicate that SAP effectively preserves critical heads of a high density of strong attention values, outperforming existing head pruning strategies in retrain-free settings. These findings position SAP as a promising foundation for a new direction in model compression research, offering high flexibility for pruning across all transformer-based language models.
๐
2025-12-22
๐ ArXiv Preprint (cs.SE)
๐ฅ Mahir Labib Dihan, Sadif Ahmed, Md Nafiu Rahman
ๆฅ็ๆ่ฆ
Bangla is a low-resource language for code generation, lacking large-scale annotated datasets and tools to transform natural language specifications into executable programs. This makes Bangla-to-code generation a challenging task requiring innovative solutions. To address this, we introduce BanglaForge, a novel framework for generating code from Bangla function descriptions. BanglaForge leverages a retrieval-augmented dual-model collaboration paradigm with self-refinement, combining in-context learning, llm-based translation, systematic prompt engineering, and iterative self-refinement based on execution feedback, where a coder generates initial solutions and a reviewer enhances them for robustness. On the BLP-2025 Bangla Code Generation benchmark, BanglaForge achieves a competitive Pass@1 accuracy of 84.00%, demonstrating the effectiveness of retrieval, model collaboration, and self-refinement for low-resource Bangla code generation.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amar Lakel
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ''fascination/fear'' dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ziyan Zhang, Chao Wang, Zhuo Chen, Lei Chen, Chiyi Li et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruiqi Ma, Yu Yan, Chunhong Zhang, Minghao Yin, XinChao Liu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model's dependence on language priors but also enhances its visual performance. (Code: https://github.com/rickeyhhh/Hallucination-Disentangled-Decoding)
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shijian Ma, Yunqi Huang, Yan Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.
๐
2025-12-22
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Akshaj Prashanth Rao, Advait Singh, Saumya Kumaar Saksena, Dhruv Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Prompt injection and jailbreaking attacks pose persistent security challenges to large language model (LLM)-based systems. We present an efficient and systematically evaluated defense architecture that mitigates these threats through a lightweight, multi-stage pipeline. Its core component is a semantic filter based on text normalization, TF-IDF representations, and a Linear SVM classifier. Despite its simplicity, this module achieves 93.4% accuracy and 96.5% specificity on held-out data, substantially reducing attack throughput while incurring negligible computational overhead.
Building on this efficient foundation, the full pipeline integrates complementary detection and mitigation mechanisms that operate at successive stages, providing strong robustness with minimal latency. In comparative experiments, our SVM-based configuration improves overall accuracy from 35.1% to 93.4% while reducing average time to completion from approximately 450s to 47s, yielding over 10 times lower latency than ShieldGemma. These results demonstrate that the proposed design simultaneously advances defensive precision and efficiency, addressing a core limitation of current model-based moderators.
Evaluation across a curated corpus of over 30,000 labeled prompts, including benign, jailbreak, and application-layer injections, confirms that staged, resource-efficient defenses can robustly secure modern LLM-driven applications.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tongyuan Miao, Gary Huang, Kai Jun Han, Annie Jiang
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization.
We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jinyan Liu, Zikang Chen, Qinchuan Wang, Tan Xie, Heming Zheng et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.
๐
2025-12-22
๐ ICRA 2026
๐ฅ Ryosuke Korekata, Quanting Xie, Yonatan Bisk, Komei Sugiura
RoboticsMultimodal
ๆฅ็ๆ่ฆ
In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wen-Long Jin
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) execute complex multi-turn interaction protocols but lack formal specifications to verify execution against designer intent. We introduce FASTRIC, a Prompt Specification Language that makes implicit Finite State Machines (FSMs) explicit in natural language prompts, enabling conformance verification through execution trace analysis. The LLM serves as intelligent execution agent: interpreting designer-encoded FSMs to execute specified behavioral roles. Unlike symbolic specification languages requiring parsers and compilers, FASTRIC leverages LLMs as unified infrastructure-simultaneously parser, interpreter, runtime environment, and development assistant. FASTRIC guides designers to articulate seven FSM elements (Final States, Agents, States, Triggers, Roles, Initial State, Constraints) structuring multi-turn interactions. Specification formality-ranging from implicit descriptions that frontier models infer to explicit step-by-step instructions for weaker models-serves as a design parameter. We introduce procedural conformance as verification metric measuring execution adherence to FSM specifications. Testing a 3-state kindergarten tutoring FSM across four formality levels and three model scales (14.7B, 685B, 1T+ parameters) reveals optimal specification formality is a function of model capacity. DeepSeek-V3.2 (685B) achieves perfect conformance (1.00) at L2-L4; ChatGPT-5 (~1T) peaks at L3 (0.90) before collapsing at L4 (0.39); Phi4 (14.7B) shows no stable optimum with high variance (SD=0.16-0.36). These findings reveal model-specific formality ranges-"Goldilocks zones"-where specifications provide sufficient structure without over-constraint, establishing Prompt Specification Engineering for creating verifiable interaction protocols, transforming multi-turn interaction design from heuristic art to systematic engineering with measurable procedural guarantees.
๐
2025-12-22
๐ ArXiv Preprint (cs.SD)
๐ฅ Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.
๐
2025-12-22
๐ NeurIPS 2025
๐ฅ Aleksey Minabutdinov, Patrick Cheridito
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We introduce a novel deep learning algorithm for computing convex conjugates of differentiable convex functions, a fundamental operation in convex analysis with various applications in different fields such as optimization, control theory, physics and economics. While traditional numerical methods suffer from the curse of dimensionality and become computationally intractable in high dimensions, more recent neural network-based approaches scale better, but have mostly been studied with the aim of solving optimal transport problems and require the solution of complicated optimization or max-min problems. Using an implicit Fenchel formulation of convex conjugation, our approach facilitates an efficient gradient-based framework for the minimization of approximation errors and, as a byproduct, also provides a posteriori error estimates for the approximation quality. Numerical experiments demonstrate our method's ability to deliver accurate results across different high-dimensional examples. Moreover, by employing symbolic regression with Kolmogorov--Arnold networks, it is able to obtain the exact convex conjugates of specific convex functions.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rajyasri Roy, Dibyajyoti Nayak, Somdatta Goswami
Robotics
ๆฅ็ๆ่ฆ
Numerical simulation of time-dependent partial differential equations (PDEs) is central to scientific and engineering applications, but high-fidelity solvers are often prohibitively expensive for long-horizon or time-critical settings. Neural operator (NO) surrogates offer fast inference across parametric and functional inputs; however, most autoregressive NO frameworks remain vulnerable to compounding errors, and ensemble-averaged metrics provide limited guarantees for individual inference trajectories. In practice, error accumulation can become unacceptable beyond the training horizon, and existing methods lack mechanisms for online monitoring or correction. To address this gap, we propose ANCHOR (Adaptive Numerical Correction for High-fidelity Operator Rollouts), an online, instance-aware hybrid inference framework for stable long-horizon prediction of nonlinear, time-dependent PDEs. ANCHOR treats a pretrained NO as the primary inference engine and adaptively couples it with a classical numerical solver using a physics-informed, residual-based error estimator. Inspired by adaptive time-stepping in numerical analysis, ANCHOR monitors an exponential moving average (EMA) of the normalized PDE residual to detect accumulating error and trigger corrective solver interventions without requiring access to ground-truth solutions. We show that the EMA-based estimator correlates strongly with the true relative L2 error, enabling data-free, instance-aware error control during inference. Evaluations on four canonical PDEs: 1D and 2D Burgers', 2D Allen-Cahn, and 3D heat conduction, demonstrate that ANCHOR reliably bounds long-horizon error growth, stabilizes extrapolative rollouts, and significantly improves robustness over standalone neural operators, while remaining substantially more efficient than high-fidelity numerical solvers.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Eric Zimmermann, Harley Wiltzer, Justin Szeto, David Alvarez-Melis, Lester Mackey
ๆฅ็ๆ่ฆ
Recent breakthroughs in self-supervised Joint-Embedding Predictive Architectures (JEPAs) have established that regularizing Euclidean representations toward isotropic Gaussian priors yields provable gains in training stability and downstream generalization. We introduce a new, flexible family of KerJEPAs, self-supervised learning algorithms with kernel-based regularizers. One instance of this family corresponds to the recently-introduced LeJEPA Epps-Pulley regularizer which approximates a sliced maximum mean discrepancy (MMD) with a Gaussian prior and Gaussian kernel. By expanding the class of viable kernels and priors and computing the closed-form high-dimensional limit of sliced MMDs, we develop alternative KerJEPAs with a number of favorable properties including improved training stability and design flexibility.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universitรคt Wรผrzburg in cooperation with the Technische Universitรคt Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yongxin Wang, Zhicheng Yang, Meng Cao, Mingfei Han, Haokun Lin et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling (RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Naresh Manwani, M Elamparithy, Tanish Taneja
ๆฅ็ๆ่ฆ
In this paper, we introduce directional feedback in the ordinal regression setting, in which the learner receives feedback on whether the predicted label is on the left or the right side of the actual label. This is a weak supervision setting for ordinal regression compared to the full information setting, where the learner can access the labels. We propose an online algorithm for ordinal regression using directional feedback. The proposed algorithm uses an exploration-exploitation scheme to learn from directional feedback efficiently. Furthermore, we introduce its kernel-based variant to learn non-linear ordinal regression models in an online setting. We use a truncation trick to make the kernel implementation more memory efficient. The proposed algorithm maintains the ordering of the thresholds in the expected sense. Moreover, it achieves the expected regret of $\mathcal{O}(\log T)$. We compare our approach with a full information and a weakly supervised algorithm for ordinal regression on synthetic and real-world datasets. The proposed approach, which learns using directional feedback, performs comparably (sometimes better) to its full information counterpart.
๐
2025-12-22
๐ ArXiv Preprint (physics.optics)
๐ฅ Adrian A. Moazzam, Anindya Ghoshroy, Breeanne Heusdens, Durdu O. Guney, Roohollah Askari
Machine Learning
ๆฅ็ๆ่ฆ
Atmospheric turbulence imposes a fundamental limitation across a broad range of applications, including optical imaging, remote sensing, and free-space optical communication. Recent advances in adaptive optics, wavefront shaping, and machine learning, driven by synergistic progress in fundamental theories, optoelectronic hardware, and computational algorithms, have demonstrated substantial potential in mitigating turbulence-induced distortions. Recently, active convolved illumination (ACI) was proposed as a versatile and physics-driven technique for transmitting structured light beams with minimal distortion through highly challenging turbulent regimes. While distinct in its formulation, ACI shares conceptual similarities with other physics-driven distortion correction approaches and stands to benefit from complementary integration with data-driven deep learning (DL) models. Inspired by recent work coupling deep learning with traditional turbulence mitigation strategies, the present work investigates the feasibility of integrating ACI with neural network-based methods. We outline a conceptual framework for coupling ACI with data-driven models and identify conditions under which learned representations can meaningfully support ACI's correlation-injection mechanism. As a representative example, we employ a convolutional neural network (CNN) together with a transfer-learning approach to examine how a learned model may operate in tandem with ACI. This exploratory study demonstrates feasible implementation pathways and establishes an early foundation for assessing the potential of future ACI-DL hybrid architectures, representing a step toward evaluating broader synergistic interactions between ACI and modern DL models.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongsheng Xing, Qiuxin Si
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Predicting reaction outcomes across continuous solvent composition ranges remains a critical challenge in organic synthesis and process chemistry. Traditional machine learning approaches often treat solvent identity as a discrete categorical variable, which prevents systematic interpolation and extrapolation across the solvent space. This work introduces the \textbf{Catechol Benchmark}, a high-throughput transient flow chemistry dataset comprising 1,227 experimental yield measurements for the rearrangement of allyl-substituted catechol in 24 pure solvents and their binary mixtures, parameterized by continuous volume fractions ($\% B$). We evaluate various architectures under rigorous leave-one-solvent-out and leave-one-mixture-out protocols to test generalization to unseen chemical environments.
Our results demonstrate that classical tabular methods (e.g., Gradient-Boosted Decision Trees) and large language model embeddings (e.g., Qwen-7B) struggle with quantitative precision, yielding Mean Squared Errors (MSE) of 0.099 and 0.129, respectively. In contrast, we propose a hybrid GNN-based architecture that integrates Graph Attention Networks (GATs) with Differential Reaction Fingerprints (DRFP) and learned mixture-aware solvent encodings. This approach achieves an \textbf{MSE of 0.0039} ($\pm$ 0.0003), representing a 60\% error reduction over competitive baselines and a $>25\times$ improvement over tabular ensembles. Ablation studies confirm that explicit molecular graph message-passing and continuous mixture encoding are essential for robust generalization. The complete dataset, evaluation protocols, and reference implementations are released to facilitate data-efficient reaction prediction and continuous solvent representation learning.
๐
2025-12-22
๐ 24th IEEE International Conference on Machine Lear
๐ฅ Diego Hitzges, Guillaume Sagnol
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning has been effectively applied to many discrete optimization problems. However, learning-based scheduling on unrelated parallel machines remains particularly difficult to design. Not only do the numbers of jobs and machines vary, but each job-machine pair has a unique processing time, dynamically altering feature dimensions. We propose a novel approach with a neural network tailored for offline deterministic scheduling of arbitrary sizes on unrelated machines. The goal is to minimize a complex objective function that includes the makespan and the weighted tardiness of jobs and machines. Unlike existing online approaches, which process jobs sequentially, our method generates a complete schedule considering the entire input at once. The key contribution of this work lies in the sophisticated architecture of our model. By leveraging various NLP-inspired architectures, it effectively processes any number of jobs and machines with varying feature dimensions imposed by unrelated processing times. Our approach enables supervised training on small problem instances while demonstrating strong generalization to much larger scheduling environments. Trained and tested on instances with 8 jobs and 4 machines, costs were only 2.51% above optimal. Across all tested configurations of up to 100 jobs and 10 machines, our network consistently outperformed an advanced dispatching rule, which incurred 22.22% higher costs on average. As our method allows fast retraining with simulated data and adaptation to various scheduling conditions, we believe it has the potential to become a standard approach for learning-based scheduling on unrelated machines and similar problem environments.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuriy N. Bakhvalov
Machine Learning
ๆฅ็ๆ่ฆ
This paper concludes a series of studies on the polyharmonic cascade, a deep machine learning architecture theoretically derived from indifference principles and the theory of random functions. A universal initialization procedure is proposed, based on symmetric constellations in the form of hyperoctahedra with a central point. This initialization not only ensures stable training of cascades with tens and hundreds of layers (up to 500 layers without skip connections), but also radically simplifies the computations. Scalability and robustness are demonstrated on MNIST (98.3% without convolutions or augmentations), HIGGS (AUC approximately 0.885 on 11M examples), and Epsilon (AUC approximately 0.963 with 2000 features). All linear algebra is reduced to 2D operations and is efficiently executed on GPUs. A public repository and an archived snapshot are provided for full reproducibility.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xueming Yan, Bo Yin, Yaochu Jin
Machine Learning
ๆฅ็ๆ่ฆ
Multiobjective reinforcement learning (MORL) poses significant challenges due to the inherent conflicts between objectives and the difficulty of adapting to dynamic environments. Traditional methods often struggle to generalize effectively, particularly in large and complex state-action spaces. To address these limitations, we introduce the Latent Causal Diffusion Model (LacaDM), a novel approach designed to enhance the adaptability of MORL in discrete and continuous environments. Unlike existing methods that primarily address conflicts between objectives, LacaDM learns latent temporal causal relationships between environmental states and policies, enabling efficient knowledge transfer across diverse MORL scenarios. By embedding these causal structures within a diffusion model-based framework, LacaDM achieves a balance between conflicting objectives while maintaining strong generalization capabilities in previously unseen environments. Empirical evaluations on various tasks from the MOGymnasium framework demonstrate that LacaDM consistently outperforms the state-of-art baselines in terms of hypervolume, sparsity, and expected utility maximization, showcasing its effectiveness in complex multiobjective tasks.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alek Frohlich, Vladimir Kostic, Karim Lounici, Daniel Perazzo, Massimiliano Pontil
Natural Language Processing
ๆฅ็ๆ่ฆ
Conditional independence (CI) is central to causal inference, feature selection, and graphical modeling, yet it is untestable in many settings without additional assumptions. Existing CI tests often rely on restrictive structural conditions, limiting their validity on real-world data. Kernel methods using the partial covariance operator offer a more principled approach but suffer from limited adaptivity, slow convergence, and poor scalability. In this work, we explore whether representation learning can help address these limitations. Specifically, we focus on representations derived from the singular value decomposition of the partial covariance operator and use them to construct a simple test statistic, reminiscent of the Hilbert-Schmidt Independence Criterion (HSIC). We also introduce a practical bi-level contrastive algorithm to learn these representations. Our theory links representation learning error to test performance and establishes asymptotic validity and power guarantees. Preliminary experiments suggest that this approach offers a practical and statistically grounded path toward scalable CI testing, bridging kernel-based theory with modern representation learning.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongliang Li, Nong Zhang, Zhewen Xu, Xiang Li, Changzheng Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Understanding and predicting the Madden-Julian Oscillation (MJO) is fundamental for precipitation forecasting and disaster prevention. To date, long-term and accurate MJO prediction has remained a challenge for researchers. Conventional MJO prediction methods using Numerical Weather Prediction (NWP) are resource-intensive, time-consuming, and highly unstable (most NWP methods are sensitive to seasons, with better MJO forecast results in winter). While existing Artificial Neural Network (ANN) methods save resources and speed forecasting, their accuracy never reaches the 28 days predicted by the state-of-the-art NWP method, i.e., the operational forecasts from ECMWF, since neural networks cannot handle climate data effectively. In this paper, we present a Domain Knowledge Embedded Spatio-Temporal Network (DK-STN), a stable neural network model for accurate and efficient MJO forecasting. It combines the benefits of NWP and ANN methods and successfully improves the forecast accuracy of ANN methods while maintaining a high level of efficiency and stability. We begin with a spatial-temporal network (STN) and embed domain knowledge in it using two key methods: (i) applying a domain knowledge enhancement method and (ii) integrating a domain knowledge processing method into network training. We evaluated DK-STN with the 5th generation of ECMWF reanalysis (ERA5) data and compared it with ECMWF. Given 7 days of climate data as input, DK-STN can generate reliable forecasts for the following 28 days in 1-2 seconds, with an error of only 2-3 days in different seasons. DK-STN significantly exceeds ECMWF in that its forecast accuracy is equivalent to ECMWF's, while its efficiency and stability are significantly superior.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikita Volzhin, Soowhan Yoon
Machine Learning
ๆฅ็ๆ่ฆ
The recent development of Kolmogorov-Arnold Networks (KANs) introduced new discoveries in the field of Graph Neural Networks (GNNs), expanding the existing set of models with KAN-based versions of GNNs, which often surpass the accuracy of MultiLayer Perceptron (MLP)-based GNNs. These models were widely tested on the graph datasets consisting of organic molecules; however, those studies disregarded the inorganic nanomaterials datasets. In this work, we close this gap by applying Kolmogorov-Arnold Graph Neural Networks (KAGNNs) to a recently published large inorganic nanomaterials dataset called CHILI. For this, we adapt and test KAGNNs appropriate for this dataset. Our experiments reveal that on the CHILI datasets, particularly on the CHILI-3K, KAGNNs substantially surpass conventional GNNs in classification, achieving state-of-the-art results.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Martรญ Medina-Hern รกndez, Janos Kertรฉsz, Mihรกly Fazekas
Machine Learning
ๆฅ็ๆ่ฆ
Detecting fraud and corruption in public procurement remains a major challenge for governments worldwide. Most research to-date builds on domain-knowledge-based corruption risk indicators of individual contract-level features and some also analyzes contracting network patterns. A critical barrier for supervised machine learning is the absence of confirmed non-corrupt, negative, examples, which makes conventional machine learning inappropriate for this task. Using publicly available data on federally funded procurement in Mexico and company sanction records, this study implements positive-unlabeled (PU) learning algorithms that integrate domain-knowledge-based red flags with network-derived features to identify likely corrupt and fraudulent contracts. The best-performing PU model on average captures 32 percent more known positives and performs on average 2.3 times better than random guessing, substantially outperforming approaches based solely on traditional red flags. The analysis of the Shapley Additive Explanations reveals that network-derived features, particularly those associated with contracts in the network core or suppliers with high eigenvector centrality, are the most important. Traditional red flags further enhance model performance in line with expectations, albeit mainly for contracts awarded through competitive tenders. This methodology can support law enforcement in Mexico, and it can be adapted to other national contexts too.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hafsa Benaddi, Mohammed Jouhari, Nouha Laamech, Anas Motii, Khalil Ibrahimi
Machine Learning
ๆฅ็ๆ่ฆ
The widespread deployment of Internet of Things (IoT) devices requires intrusion detection systems (IDS) with high accuracy while operating under strict resource constraints. Conventional deep learning IDS are often too large and computationally intensive for edge deployment. We propose a lightweight IDS that combines SHAP-guided feature pruning with knowledge-distilled Kronecker networks. A high-capacity teacher model identifies the most relevant features through SHAP explanations, and a compressed student leverages Kronecker-structured layers to minimize parameters while preserving discriminative inputs. Knowledge distillation transfers softened decision boundaries from teacher to student, improving generalization under compression. Experiments on the TON\_IoT dataset show that the student is nearly three orders of magnitude smaller than the teacher yet sustains macro-F1 above 0.986 with millisecond-level inference latency. The results demonstrate that explainability-driven pruning and structured compression can jointly enable scalable, low-latency, and energy-efficient IDS for heterogeneous IoT environments.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lorenzo Capelli, Leandro de Souza Rosa, Gianluca Setti, Mauro Mangia, Riccardo Rovatti
Machine Learning
ๆฅ็ๆ่ฆ
The recent explosive growth in Deep Neural Networks applications raises concerns about the black-box usage of such models, with limited trasparency and trustworthiness in high-stakes domains, which have been crystallized as regulatory requirements such as the European Union Artificial Intelligence Act. While models with embedded confidence metrics have been proposed, such approaches cannot be applied to already existing models without retraining, limiting their broad application. On the other hand, post-hoc methods, which evaluate pre-trained models, focus on solving problems related to improving the confidence in the model's predictions, and detecting Out-Of-Distribution or Adversarial Attacks samples as independent applications. To tackle the limited applicability of already existing methods, we introduce Multi-Layer Analysis for Confidence Scoring (MACS), a unified post-hoc framework that analyzes intermediate activations to produce classification-maps. From the classification-maps, we derive a score applicable for confidence estimation, detecting distributional shifts and adversarial attacks, unifying the three problems in a common framework, and achieving performances that surpass the state-of-the-art approaches in our experiments with the VGG16 and ViTb16 models with a fraction of their computational overhead.
๐
2025-12-22
๐ ArXiv Preprint (cs.CE)
๐ฅ Tim Aebersold, Soheyl Massoudi, Mark D. Fuge
Machine Learning
ๆฅ็ๆ่ฆ
Engineering complex systems (aircraft, buildings, vehicles) requires accounting for geometric and performance couplings across subsystems. As generative models proliferate for specialized domains (wings, structures, engines), a key research gap is how to coordinate frozen, pre-trained submodels to generate full-system designs that are feasible, diverse, and high-performing. We introduce Generative Latent Unification of Expertise-Informed Engineering Models (GLUE), which orchestrates pre-trained, frozen subsystem generators while enforcing system-level feasibility, optimality, and diversity. We propose and benchmark (i) data-driven GLUE models trained on pre-generated system-level designs and (ii) a data-free GLUE model trained online on a differentiable geometry layer. On a UAV design problem with five coupling constraints, we find that data-driven approaches yield diverse, high-performing designs but require large datasets to satisfy constraints reliably. The data-free approach is competitive with Bayesian optimization and gradient-based optimization in performance and feasibility while training a full generative model in only 10 min on a RTX 4090 GPU, requiring more than two orders of magnitude fewer geometry evaluations and FLOPs than the data-driven method. Ablations focused on data-free training show that subsystem output continuity affects coordination, and equality constraints can trigger mode collapse unless mitigated. By integrating unmodified, domain-informed submodels into a modular generative workflow, this work provides a viable path for scaling generative design to complex, real-world engineering systems.
๐
2025-12-22
๐ This work has been submitted to the IEEE for possi
๐ฅ Simon Welker, Bunlong Lay, Maris Hillemann, Tal Peer, Timo Gerkmann
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still lagging behind due to their computation-heavy nature involving multiple calls of large DNNs.
Here, we present Stream.FM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real-time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few-step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for the speech enhancement task.
Our work looks beyond theoretical latencies, showing that high-quality streaming generative speech processing can be realized on consumer GPUs available today. Stream.FM can solve a variety of speech processing tasks in a streaming fashion: speech enhancement, dereverberation, codec post-filtering, bandwidth extension, STFT phase retrieval, and Mel vocoding. As we verify through comprehensive evaluations and a MUSHRA listening test, Stream.FM establishes a state-of-the-art for generative streaming speech restoration, exhibits only a reasonable reduction in quality compared to a non-streaming variant, and outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Antonio Consolo, Andrea Manno, Edoardo Amaldi
Machine Learning
ๆฅ็ๆ่ฆ
Kernel logistic regression (KLR) is a widely used supervised learning method for binary and multi-class classification, which provides estimates of the conditional probabilities of class membership for the data points. Unlike other kernel methods such as Support Vector Machines (SVMs), KLRs are generally not sparse. Previous attempts to deal with sparsity in KLR include a heuristic method referred to as the Import Vector Machine (IVM) and ad hoc regularizations such as the $\ell_{1/2}$-based one. Achieving a good trade-off between prediction accuracy and sparsity is still a challenging issue with a potential significant impact from the application point of view. In this work, we revisit binary KLR and propose an extension of the training formulation proposed by Keerthi et al., which is able to induce sparsity in the trained model, while maintaining good testing accuracy. To efficiently solve the dual of this formulation, we devise a decomposition algorithm of Sequential Minimal Optimization type which exploits second-order information, and for which we establish global convergence. Numerical experiments conducted on 12 datasets from the literature show that the proposed binary KLR approach achieves a competitive trade-off between accuracy and sparsity with respect to IVM, $\ell_{1/2}$-based regularization for KLR, and SVM while retaining the advantages of providing informative estimates of the class membership probabilities.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rixin Yu
ๆฅ็ๆ่ฆ
Learning accurate and stable time-advancement operators for nonlinear partial differential equations (PDEs) remains challenging, particularly for chaotic, stiff, and long-horizon dynamical systems. While neural operator methods such as the Fourier Neural Operator (FNO) and Koopman-inspired extensions achieve good short-term accuracy, their long-term stability is often limited by unconstrained latent representations and cumulative rollout errors. In this work, we introduce an inverse scattering inspired Fourier Neural Operator(IS-FNO), motivated by the reversibility and spectral evolution structure underlying the classical inverse scattering transform. The proposed architecture enforces a near-reversible pairing between lifting and projection maps through an explicitly invertible neural transformation, and models latent temporal evolution using exponential Fourier layers that naturally encode linear and nonlinear spectral dynamics. We systematically evaluate IS-FNO against baseline FNO and Koopman-based models on a range of benchmark PDEs, including the Michelson-Sivashinsky and Kuramoto-Sivashinsky equations (in one and two dimensions), as well as the integrable Korteweg-de Vries and Kadomtsev-Petviashvili equations. The results demonstrate that IS-FNO achieves lower short-term errors and substantially improved long-horizon stability in non-stiff regimes. For integrable systems, reduced IS-FNO variants that embed analytical scattering structure retain competitive long-term accuracy despite limited model capacity. Overall, this work shows that incorporating physical structure -- particularly reversibility and spectral evolution -- into neural operator design significantly enhances robustness and long-term predictive fidelity for nonlinear PDE dynamics.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhang Chong
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We revisit a basic question in sequence modeling: is explicit self-attention actually necessary for strong performance and reasoning? We argue that standard multi-head attention is best seen as a form of tensor lifting: hidden vectors are mapped into a high-dimensional space of pairwise interactions, and learning proceeds by constraining this lifted tensor through gradient descent. This mechanism is extremely expressive but mathematically opaque, because after many layers it becomes very hard to describe the model with a small family of explicit invariants.
To explore an alternative, we propose an attention-free architecture based on Grassmann flows. Instead of forming an L by L attention matrix, our Causal Grassmann layer (i) linearly reduces token states, (ii) encodes local token pairs as two-dimensional subspaces on a Grassmann manifold via Plucker coordinates, and (iii) fuses these geometric features back into the hidden states through gated mixing. Information therefore propagates by controlled deformations of low-rank subspaces over multi-scale local windows, so the core computation lives on a finite-dimensional manifold rather than in an unstructured tensor space.
On the Wikitext-2 language modeling benchmark, purely Grassmann-based models with 13 to 18 million parameters achieve validation perplexities within about 10 to 15 percent of size-matched Transformers. On the SNLI natural language inference task, a Grassmann-Plucker head on top of DistilBERT slightly outperforms a Transformer head, with best validation and test accuracies of 0.8550 and 0.8538 compared to 0.8545 and 0.8511. We analyze the complexity of Grassmann mixing, show linear scaling in sequence length for fixed rank, and argue that such manifold-based designs offer a more structured route toward geometric and invariant-based interpretations of neural reasoning.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Elad Hazan, Shai Shalev Shwartz, Nathan Srebro
Robotics
ๆฅ็ๆ่ฆ
Modern learning systems increasingly interact with data that evolve over time and depend on hidden internal state. We ask a basic question: when is such a dynamical system learnable from observations alone? This paper proposes a research program for understanding learnability in dynamical systems through the lens of next-token prediction. We argue that learnability in dynamical systems should be studied as a finite-sample question, and be based on the properties of the underlying dynamics rather than the statistical properties of the resulting sequence. To this end, we give a formulation of learnability for stochastic processes induced by dynamical systems, focusing on guarantees that hold uniformly at every time step after a finite burn-in period. This leads to a notion of dynamic learnability which captures how the structure of a system, such as stability, mixing, observability, and spectral properties, governs the number of observations required before reliable prediction becomes possible. We illustrate the framework in the case of linear dynamical systems, showing that accurate prediction can be achieved after finite observation without system identification, by leveraging improper methods based on spectral filtering. We survey the relationship between learning in dynamical systems and classical PAC, online, and universal prediction theories, and suggest directions for studying nonlinear and controlled systems.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Robert Simon Fong, Gouhei Tanaka, Kazuyuki Aihara
Machine Learning
ๆฅ็ๆ่ฆ
Modern learning systems act on internal representations of data, yet how these representations encode underlying physical or statistical structure is often left implicit. In physics, conservation laws of Hamiltonian systems such as symplecticity guarantee long-term stability, and recent work has begun to hard-wire such constraints into learning models at the loss or output level. Here we ask a different question: what would it mean for the representation itself to obey a symplectic conservation law in the sense of Hamiltonian mechanics?
We express this symplectic constraint through Legendre duality: the pairing between primal and dual parameters, which becomes the structure that the representation must preserve. We formalize Legendre dynamics as stochastic processes whose trajectories remain on Legendre graphs, so that the evolving primal-dual parameters stay Legendre dual. We show that this class includes linear time-invariant Gaussian process regression and Ornstein-Uhlenbeck dynamics.
Geometrically, we prove that the maps that preserve all Legendre graphs are exactly symplectomorphisms of cotangent bundles of the form "cotangent lift of a base diffeomorphism followed by an exact fibre translation". Dynamically, this characterization leads to the design of a Symplectic Reservoir (SR), a reservoir-computing architecture that is a special case of recurrent neural network and whose recurrent core is generated by Hamiltonian systems that are at most linear in the momentum.
Our main theorem shows that every SR update has this normal form and therefore transports Legendre graphs to Legendre graphs, preserving Legendre duality at each time step. Overall, SR implements a geometrically constrained, Legendre-preserving representation map, injecting symplectic geometry and Hamiltonian mechanics directly at the representational level.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sandro Andric
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Abdelmadjid Benmachiche, Khadija Rais, Hamda Slimi
Machine Learning
ๆฅ็ๆ่ฆ
The spread of a resource-constrained Internet of Things (IoT) environment and embedded devices has put pressure on the real-time detection of anomalies occurring at the edge. This survey presents an overview of machine-learning methods aimed specifically at on-device anomaly detection with extremely strict constraints for latency, memory, and power consumption. Lightweight algorithms such as Isolation Forest, One-Class SVM, recurrent architectures, and statistical techniques are compared here according to the realities of embedded implementation. Our survey brings out significant trade-offs of accuracy and computational efficiency of detection, as well as how hardware constraints end up fundamentally redefining algorithm choice. The survey is completed with a set of practical recommendations on the choice of the algorithm depending on the equipment profiles and new trends in TinyML, which can help close the gap between detection capabilities and embedded reality. The paper serves as a strategic roadmap for engineers deploying anomaly detection in edge environments that are constrained by bandwidth and may be safety-critical.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xueming Yan, Boyan Xu, Yaochu Jin, Lixian Xiao, Wenlong Ye et al.
Multimodal
ๆฅ็ๆ่ฆ
Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. https://github.com/yanxm01/INDOMER
๐
2025-12-22
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ Eneko Lazpita, Andrรฉs Bell-Navas, Jesรบs Garicano-Mena, Petros Koumoutsakos, Soledad Le Clainche
ๆฅ็ๆ่ฆ
Understanding intraventricular hemodynamics requires compact and physically interpretable representations of the underlying flow structures, as characteristic flow patterns are closely associated with cardiovascular conditions and can support early detection of cardiac deterioration. Conventional visualization of velocity or pressure fields, however, provides limited insight into the coherent mechanisms driving these dynamics. Reduced-order modeling techniques, like Proper Orthogonal Decomposition (POD) and Autoencoder (AE) architectures, offer powerful alternatives to extract dominant flow features from complex datasets. This study systematically compares POD with several AE variants (Linear, Nonlinear, Convolutional, and Variational) using left ventricular flow fields obtained from computational fluid dynamics simulations. We show that, for a suitably chosen latent dimension, AEs produce modes that become nearly orthogonal and qualitatively resemble POD modes that capture a given percentage of kinetic energy. As the number of latent modes increases, AE modes progressively lose orthogonality, leading to linear dependence, spatial redundancy, and the appearance of repeated modes with substantial high-frequency content. This degradation reduces interpretability and introduces noise-like components into AE-based reduced-order models, potentially complicating their integration with physics-based formulations or neural-network surrogates. The extent of interpretability loss varies across the AEs, with nonlinear, convolutional, and variational models exhibiting distinct behaviors in orthogonality preservation and feature localization. Overall, the results indicate that AEs can reproduce POD-like coherent structures under specific latent-space configurations, while highlighting the need for careful mode selection to ensure physically meaningful representations of cardiac flow dynamics.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Xin Huang, Jia Li, Jun Yu
Machine Learning
ๆฅ็ๆ่ฆ
Explainable machine learning aims to strike a balance between prediction accuracy and model transparency, particularly in settings where black-box predictive models, such as deep neural networks or kernel-based methods, achieve strong empirical performance but remain difficult to interpret. This work introduces a mixture of generalized additive models (GAMs) in which random Fourier feature (RFF) representations are leveraged to uncover locally adaptive structure in the data. In the proposed method, an RFF-based embedding is first learned and then compressed via principal component analysis. The resulting low-dimensional representations are used to perform soft clustering of the data through a Gaussian mixture model. These cluster assignments are then applied to construct a mixture-of-GAMs framework, where each local GAM captures nonlinear effects through interpretable univariate smooth functions. Numerical experiments on real-world regression benchmarks, including the California Housing, NASA Airfoil Self-Noise, and Bike Sharing datasets, demonstrate improved predictive performance relative to classical interpretable models. Overall, this construction provides a principled approach for integrating representation learning with transparent statistical modeling.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Christian Hรคgg, Kathlรฉn Kohn, Giovanni Luca Marchetti, Boris Shapiro
ๆฅ็ๆ่ฆ
We present Sprecher Networks (SNs), a family of trainable neural architectures inspired by the classical Kolmogorov-Arnold-Sprecher (KAS) construction for approximating multivariate continuous functions. Distinct from Multi-Layer Perceptrons (MLPs) with fixed node activations and Kolmogorov-Arnold Networks (KANs) featuring learnable edge activations, SNs utilize shared, learnable splines (monotonic and general) within structured blocks incorporating explicit shift parameters and mixing weights. Our approach directly realizes Sprecher's specific 1965 sum of shifted splines formula in its single-layer variant and extends it to deeper, multi-layer compositions. We further enhance the architecture with optional lateral mixing connections that enable intra-block communication between output dimensions, providing a parameter-efficient alternative to full attention mechanisms. Beyond parameter efficiency with $O(LN + LG)$ scaling (where $G$ is the knot count of the shared splines) versus MLPs' $O(LN^2)$, SNs admit a sequential evaluation strategy that reduces peak forward-intermediate memory from $O(N^2)$ to $O(N)$ (treating batch size as constant), making much wider architectures feasible under memory constraints. We demonstrate empirically that composing these blocks into deep networks leads to highly parameter and memory-efficient models, discuss theoretical motivations, and compare SNs with related architectures (MLPs, KANs, and networks with learnable node activations).
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Simon Stรฅhlberg, Blai Bonet, Hector Geffner
Machine Learning
ๆฅ็ๆ่ฆ
While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.
๐
2025-12-22
๐ AAAI
๐ฅ Canran Xiao, Jiabao Dou, Zhiming Lin, Zong Ke, Liwei Hou
ๆฅ็ๆ่ฆ
How should we quantify the value of each training example when datasets are large, heterogeneous, and geometrically structured? Classical Data-Shapley answers in principle, but its O(n!) complexity and point-wise perspective are ill-suited to modern scales. We propose Hierarchical Contrastive Data Valuation (HCDV), a three-stage framework that (i) learns a contrastive, geometry-preserving representation, (ii) organizes the data into a balanced coarse-to-fine hierarchy of clusters, and (iii) assigns Shapley-style payoffs to coalitions via local Monte-Carlo games whose budgets are propagated downward. HCDV collapses the factorial burden to O(T sum_{l} K_{l}) = O(T K_max log n), rewards examples that sharpen decision boundaries, and regularizes outliers through curvature-based smoothness. We prove that HCDV approximately satisfies the four Shapley axioms with surplus loss O(eta log n), enjoys sub-Gaussian coalition deviation tilde O(1/sqrt{T}), and incurs at most k epsilon_infty regret for top-k selection. Experiments on four benchmarks--tabular, vision, streaming, and a 45M-sample CTR task--plus the OpenDataVal suite show that HCDV lifts accuracy by up to +5 pp, slashes valuation time by up to 100x, and directly supports tasks such as augmentation filtering, low-latency streaming updates, and fair marketplace payouts.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Isshaan Singh, Divyansh Chawla, Anshu Garg, Shivin Mangal, Pallavi Gupta et al.
Machine Learning
ๆฅ็ๆ่ฆ
The need for an intelligent, real-time spoilage prediction system has become critical in modern IoT-driven food supply chains, where perishable goods are highly susceptible to environmental conditions. Existing methods often lack adaptability to dynamic conditions and fail to optimize decision making in real time. To address these challenges, we propose a hybrid reinforcement learning framework integrating Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN) for enhanced spoilage prediction. This hybrid architecture captures temporal dependencies within sensor data, enabling robust and adaptive decision making. In alignment with interpretable artificial intelligence principles, a rule-based classifier environment is employed to provide transparent ground truth labeling of spoilage levels based on domain-specific thresholds. This structured design allows the agent to operate within clearly defined semantic boundaries, supporting traceable and interpretable decisions. Model behavior is monitored using interpretability-driven metrics, including spoilage accuracy, reward-to-step ratio, loss reduction rate, and exploration decay. These metrics provide both quantitative performance evaluation and insights into learning dynamics. A class-wise spoilage distribution visualization is used to analyze the agents decision profile and policy behavior. Extensive evaluations on simulated and real-time hardware data demonstrate that the LSTM and RNN based agent outperforms alternative reinforcement learning approaches in prediction accuracy and decision efficiency while maintaining interpretability. The results highlight the potential of hybrid deep reinforcement learning with integrated interpretability for scalable IoT-based food monitoring systems.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ JiaWei Zhu, ZiHeng Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Hidden confounding remains a fundamental challenge in causal inference from observational data. Recent advances leverage Large Language Models (LLMs) to generate plausible hidden confounders based on domain knowledge, yet a critical gap exists: LLM-generated confounders often exhibit semantic plausibility without statistical utility. We propose VIGOR+ (Variational Information Gain for iterative cOnfounder Refinement), a novel framework that closes the loop between LLM-based confounder generation and CEVAE-based statistical validation. Unlike prior approaches that treat generation and validation as separate stages, VIGOR+ establishes an iterative feedback mechanism: validation signals from CEVAE (including information gain, latent consistency metrics, and diagnostic messages) are transformed into natural language feedback that guides subsequent LLM generation rounds. This iterative refinement continues until convergence criteria are met. We formalize the feedback mechanism, prove convergence properties under mild assumptions, and provide a complete algorithmic framework.
๐
2025-12-22
๐ ArXiv Preprint (cs.DC)
๐ฅ Kiril Dichev, Filip Pawlowski, Albert-Jan Yzelman
Machine Learning
ๆฅ็ๆ่ฆ
Recommender systems are enablers of personalized content delivery, and therefore revenue, for many large companies. In the last decade, deep learning recommender models (DLRMs) are the de-facto standard in this field. The main bottleneck in DLRM inference is the lookup of sparse features across huge embedding tables, which are usually partitioned across the aggregate RAM of many nodes. In state-of-the-art recommender systems, the distributed lookup is implemented via irregular all-to-all (alltoallv) communication, and often presents the main bottleneck. Today, most related work sees this operation as a given; in addition, every collective is synchronous in nature. In this work, we propose a novel bounded lag synchronous (BLS) version of the alltoallv operation. The bound can be a parameter allowing slower processes to lag behind entire iterations before the fastest processes block. In special applications such as inference-only DLRM, the accuracy of the application is fully preserved. We implement BLS alltoallv in a new PyTorch Distributed backend and evaluate it with a BLS version of the reference DLRM code. We show that for well balanced, homogeneous-access DLRM runs our BLS technique does not offer notable advantages. But for unbalanced runs, e.g. runs with strongly irregular embedding table accesses or with delays across different processes, our BLS technique improves both the latency and throughput of inference-only DLRM. In the best-case scenario, the proposed reduced synchronisation can mask the delays across processes altogether.
๐
2025-12-22
๐ ArXiv Preprint (cs.IT)
๐ฅ Haohua Chen, Songbin Liu, Junjie Ma
ๆฅ็ๆ่ฆ
We propose an orthogonal approximate message passing (OAMP) algorithm for signal estimation in the rectangular spiked matrix model with general rotationally invariant (RI) noise. We establish a rigorous state evolution that precisely characterizes the algorithm's high-dimensional dynamics and enables the construction of iteration-wise optimal denoisers. Within this framework, we accommodate spectral initializations under minimal assumptions on the empirical noise spectrum. In the rectangular setting, where a single rank-one component typically generates multiple informative outliers, we further propose a procedure for combining these outliers under mild non-Gaussian signal assumptions. For general RI noise models, the predicted performance of the proposed optimal OAMP algorithm agrees with replica-symmetric predictions for the associated Bayes-optimal estimator, and we conjecture that it is statistically optimal within a broad class of iterative estimation methods.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pablo Barcelรณ, Floris Geerts, Matthias Lanzinger, Klara Pakhomenko, Jan Van den Bussche
Machine Learning
ๆฅ็ๆ่ฆ
We study the numerical and Boolean expressiveness of MPLang, a declarative language that captures the computation of graph neural networks (GNNs) through linear message passing and activation functions. We begin with A-MPLang, the fragment without activation functions, and give a characterization of its expressive power in terms of walk-summed features. For bounded activation functions, we show that (under mild conditions) all eventually constant activations yield the same expressive power - numerical and Boolean - and that it subsumes previously established logics for GNNs with eventually constant activation functions but without linear layers. Finally, we prove the first expressive separation between unbounded and bounded activations in the presence of linear layers: MPLang with ReLU is strictly more powerful for numerical queries than MPLang with eventually constant activation functions, e.g., truncated ReLU. This hinges on subtle interactions between linear aggregation and eventually constant non-linearities, and it establishes that GNNs using ReLU are more expressive than those restricted to eventually constant activations and linear layers.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ezequiel Lopez-Rubio, Macoris Decena-Gimenez, Rafael Marcos Luque-Baena
Natural Language Processing
ๆฅ็ๆ่ฆ
A key module in neural transformer-based deep architectures is positional encoding. This module enables a suitable way to encode positional information as input for transformer neural layers. This success has been rooted in the use of sinusoidal functions of various frequencies, in order to capture recurrent patterns of differing typical periods. In this work, an alternative set of periodic functions is proposed for positional encoding. These functions preserve some key properties of sinusoidal ones, while they depart from them in fundamental ways. Some tentative experiments are reported, where the original sinusoidal version is substantially outperformed. This strongly suggests that the alternative functions may have a wider use in other transformer architectures.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yayuan Li, Jian Zhang, Jintao Guo, Zihan Cheng, Lei Qi et al.
Computer Vision
ๆฅ็ๆ่ฆ
The proliferation of pre-trained models has given rise to a wide array of specialised, fine-tuned models. Model merging aims to merge the distinct capabilities of these specialised models into a unified model, requiring minimal or even no additional training. A core objective of model merging is to ensure the merged model retains the behavioural characteristics of the specialised models, typically achieved through feature alignment. We identify that features consist of two critical components: direction and magnitude. Prior research has predominantly focused on directional alignment, while the influence of magnitude remains largely neglected, despite its pronounced vulnerability to perturbations introduced by common merging operations (e.g., parameter fusion and sparsification). Such perturbations to magnitude inevitably lead to feature deviations in the merged model from the specialised models, resulting in subsequent performance degradation. To address this, we propose MAGnItude Calibration (MAGIC), a plug-and-play framework that rectifies layer-wise magnitudes in feature and weight spaces, with three variants. Specifically, our Feature Space Calibration (FSC) realigns the merged model's features using a small set of unlabelled data, while Weight Space Calibration (WSC) extends this calibration to the weight space without requiring additional data. Combining these yields Dual Space Calibration (DSC). Comprehensive experiments demonstrate that MAGIC consistently boosts performance across diverse Computer Vision tasks (+4.3% on eight datasets) and NLP tasks (+8.0% on Llama) without additional training. Our code is available at: https://github.com/lyymuwu/MAGIC
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Keivan Faghih Niresi, Jun Qing, Mengjie Zhao, Olga Fink
Machine Learning
ๆฅ็ๆ่ฆ
Structural Health Monitoring (SHM) plays a crucial role in maintaining the safety and resilience of infrastructure. As sensor networks grow in scale and complexity, identifying the most informative sensors becomes essential to reduce deployment costs without compromising monitoring quality. While Graph Signal Processing (GSP) has shown promise by leveraging spatial correlations among sensor nodes, conventional approaches often overlook the temporal dynamics of structural behavior. To overcome this limitation, we propose Time-Vertex Machine Learning (TVML), a novel framework that integrates GSP, time-domain analysis, and machine learning to enable interpretable and efficient sensor placement by identifying representative nodes that minimize redundancy while preserving critical information. We evaluate the proposed approach on two bridge datasets for damage detection and time-varying graph signal reconstruction tasks. The results demonstrate the effectiveness of our approach in enhancing SHM systems by providing a robust, adaptive, and efficient solution for sensor placement.
๐
2025-12-22
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Sameera K. M., Serena Nicolazzo, Antonino Nocera, Vinod P., Rafidha Rehiman K. A
Machine Learning
ๆฅ็ๆ่ฆ
Federated Learning (FL) has recently emerged as a revolutionary approach to collaborative training Machine Learning models. In particular, it enables decentralized model training while preserving data privacy, but its distributed nature makes it highly vulnerable to a severe attack known as Data Poisoning. In such scenarios, malicious clients inject manipulated data into the training process, thereby degrading global model performance or causing targeted misclassification. In this paper, we present a novel defense mechanism called GShield, designed to detect and mitigate malicious and low-quality updates, especially under non-independent and identically distributed (non-IID) data scenarios. GShield operates by learning the distribution of benign gradients through clustering and Gaussian modeling during an initial round, enabling it to establish a reliable baseline of trusted client behavior. With this benign profile, GShield selectively aggregates only those updates that align with the expected gradient patterns, effectively isolating adversarial clients and preserving the integrity of the global model. An extensive experimental campaign demonstrates that our proposed defense significantly improves model robustness compared to the state-of-the-art methods while maintaining a high accuracy of performance across both tabular and image datasets. Furthermore, GShield improves the accuracy of the targeted class by 43\% to 65\% after detecting malicious and low-quality clients.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chang Dong, Jianfeng Tao, Chengliang Liu
Machine Learning
ๆฅ็ๆ่ฆ
Axial piston pumps are crucial components in fluid power systems, where reliable fault diagnosis is essential for ensuring operational safety and efficiency. Traditional data-driven methods require extensive labeled fault data, which is often impractical to obtain, while model-based approaches suffer from parameter uncertainties. This paper proposes a digital twin (DT)-driven zero-shot fault diagnosis framework utilizing fluid-borne noise (FBN) signals. The framework calibrates a high-fidelity DT model using only healthy-state data, generates synthetic fault signals for training deep learning classifiers, and employs a physics-informed neural network (PINN) as a virtual sensor for flow ripple estimation. Gradient-weighted class activation mapping (Grad-CAM) is integrated to visualize the decision-making process of neural networks, revealing that large kernels matching the subsequence length in time-domain inputs and small kernels in time-frequency domain inputs enable higher diagnostic accuracy by focusing on physically meaningful features. Experimental validations demonstrate that training on signals from the calibrated DT model yields diagnostic accuracies exceeding 95\% on real-world benchmarks, while uncalibrated models result in significantly lower performance, highlighting the framework's effectiveness in data-scarce scenarios.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Yitian Zheng, Zhangchen Ye, Weijun Dong, Shengjie Wang, Yuyang Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Recent advances in hierarchical robot systems leverage a high-level planner to propose task plans and a low-level policy to generate robot actions. This design allows training the planner on action-free or even non-robot data sources (e.g., videos), providing transferable high-level guidance. Nevertheless, grounding these high-level plans into executable actions remains challenging, especially with the limited availability of high-quality robot data. To this end, we propose to improve the low-level policy through online interactions. Specifically, our approach collects online rollouts, retrospectively annotates the corresponding high-level goals from achieved outcomes, and aggregates these hindsight-relabeled experiences to update a goal-conditioned imitation policy. Our method, Hindsight Flow-conditioned Online Imitation (HinFlow), instantiates this idea with 2D point flows as the high-level planner. Across diverse manipulation tasks in both simulation and physical world, our method achieves more than $2\times$ performance improvement over the base policy, significantly outperforming the existing methods. Moreover, our framework enables policy acquisition from planners trained on cross-embodiment video data, demonstrating its potential for scalable and transferable robot learning.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Carla Crivoi, Radu Tudor Ionescu
Machine Learning
ๆฅ็ๆ่ฆ
We present the first comprehensive empirical study of machine unlearning (MU) in hybrid quantum-classical neural networks. While MU has been extensively explored in classical deep learning, its behavior within variational quantum circuits (VQCs) and quantum-augmented architectures remains largely unexplored. First, we adapt a broad suite of unlearning methods to quantum settings, including gradient-based, distillation-based, regularization-based and certified techniques. Second, we introduce two new unlearning strategies tailored to hybrid models. Experiments across Iris, MNIST, and Fashion-MNIST, under both subset removal and full-class deletion, reveal that quantum models can support effective unlearning, but outcomes depend strongly on circuit depth, entanglement structure, and task complexity. Shallow VQCs display high intrinsic stability with minimal memorization, whereas deeper hybrid models exhibit stronger trade-offs between utility, forgetting strength, and alignment with retrain oracle. We find that certain methods, e.g. EU-k, LCA, and Certified Unlearning, consistently provide the best balance across metrics. These findings establish baseline empirical insights into quantum machine unlearning and highlight the need for quantum-aware algorithms and theoretical guarantees, as quantum machine learning systems continue to expand in scale and capability. We publicly release our code at: https://github.com/CrivoiCarla/HQML.
๐
2025-12-22
๐ NeurIPS 2025
๐ฅ Prathamesh Devadiga
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Traditional auto-parallelizing compilers, reliant on rigid heuristics, struggle with the complexity of modern heterogeneous systems. This paper presents a comprehensive evaluation of small (approximately 1B parameter) language-model-driven compiler auto-parallelization. We evaluate three models: gemma3, llama3.2, and qwen2.5, using six reasoning strategies across 11 real-world kernels drawn from scientific computing, graph algorithms, and machine learning. Our system is benchmarked against strong compiler baselines, including LLVM Polly, TVM, and Triton. Across 376 total evaluations, the proposed approach achieves an average speedup of 6.81x and a peak performance of 43.25x on convolution operations. We analyze scalability, verify correctness using multiple sanitizers, and confirm robustness across diverse compilers and hardware platforms. Our results demonstrate that small, efficient language models can serve as powerful reasoning engines for complex compiler optimization tasks.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Moncef Garouani, Ayah Barhrhouj
Machine Learning
ๆฅ็ๆ่ฆ
Hyperparameters tuning is a fundamental, yet computationally expensive, step in optimizing machine learning models. Beyond optimization, understanding the relative importance and interaction of hyperparameters is critical to efficient model development. In this paper, we introduce MetaSHAP, a scalable semi-automated eXplainable AI (XAI) method, that uses meta-learning and Shapley values analysis to provide actionable and dataset-aware tuning insights. MetaSHAP operates over a vast benchmark of over 09 millions evaluated machine learning pipelines, allowing it to produce interpretable importance scores and actionable tuning insights that reveal how much each hyperparameter matters, how it interacts with others and in which value ranges its influence is concentrated. For a given algorithm and dataset, MetaSHAP learns a surrogate performance model from historical configurations, computes hyperparameters interactions using SHAP-based analysis, and derives interpretable tuning ranges from the most influential hyperparameters. This allows practitioners not only to prioritize which hyperparameters to tune, but also to understand their directionality and interactions. We empirically validate MetaSHAP on a diverse benchmark of 164 classification datasets and 14 classifiers, demonstrating that it produces reliable importance rankings and competitive performance when used to guide Bayesian optimization.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anna-Maria Gueorguieva, Aylin Caliskan
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM's respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zesen Wang, Yonggang Li, Lijuan Lan
ๆฅ็ๆ่ฆ
Soft sensing infers hard-to-measure data through a large number of easily obtainable variables. However, in complex industrial scenarios, the issue of insufficient data volume persists, which diminishes the reliability of soft sensing. Generative Adversarial Networks (GAN) are one of the effective solutions for addressing insufficient samples. Nevertheless, traditional GAN fail to account for the mapping relationship between labels and features, which limits further performance improvement. Although some studies have proposed solutions, none have considered both performance and efficiency simultaneously. To address these problems, this paper proposes the multi-task learning-based regression GAN framework that integrates regression information into both the discriminator and generator, and implements a shallow sharing mechanism between the discriminator and regressor. This approach significantly enhances the quality of generated samples while improving the algorithm's operational efficiency. Moreover, considering the importance of training samples and generated samples, a dual data evaluation strategy is designed to make GAN generate more diverse samples, thereby increasing the generalization of subsequent modeling. The superiority of method is validated through four classic industrial soft sensing cases: wastewater treatment plants, surface water, $CO_2$ absorption towers, and industrial gas turbines.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiu-Cheng Wang, Jun-Jie Zhanga, Nan Cheng, Long-Gang Pang, Taijiao Du et al.
ๆฅ็ๆ่ฆ
Modern learning systems work with data that vary widely across domains, but they all ultimately depend on how much structure is already present in the measurements before any model is trained. This raises a basic question: is there a general, modality-agnostic way to quantify how acquisition itself preserves or destroys the information that downstream learners could use? Here we propose an acquisition-level scalar $ฮS_{\mathcal B}$ based on instrument-resolved phase space. Unlike pixelwise distortion or purely spectral errors that often saturate under aggressive undersampling, $ฮS_{\mathcal B}$ directly quantifies how acquisition mixes or removes joint space--frequency structure at the instrument scale. We show theoretically that \(ฮS_{\mathcal B}\) correctly identifies the phase-space coherence of periodic sampling as the physical source of aliasing, recovering classical sampling-theorem consequences. Empirically, across masked image classification, accelerated MRI, and massive MIMO (including over-the-air measurements), $|ฮS_{\mathcal B}|$ consistently ranks sampling geometries and predicts downstream reconstruction/recognition difficulty \emph{without training}. In particular, minimizing $|ฮS_{\mathcal B}|$ enables zero-training selection of variable-density MRI mask parameters that matches designs tuned by conventional pre-reconstruction criteria. These results suggest that phase-space entropy at acquisition reflects downstream learnability, enabling pre-training selection of candidate sampling policies and as a shared notion of information preservation across modalities.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tao Zhang, Ziqian Zeng, Hao Peng, Huiping Zhuang, Cen Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although KV cache quantization is a promising compression technique, existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks. Fixed-precision quantization struggles to handle outlier channels in the key cache, while current mixed-precision strategies fail to accurately identify components requiring high-precision representation. We find that an effective low-bit KV cache quantization strategy must consider two factors: a key channel's intrinsic quantization difficulty and its relevance to the query. Based on this insight, we propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels that need higher precision, while applying per-token quantization for value cache. Experiments on complex reasoning datasets demonstrate that our approach significantly outperforms existing low-bit methods, achieving performance comparable to a full-precision baseline at a substantially reduced memory footprint.
๐
2025-12-22
๐ Accepted for publication in Lecture Notes in Compu
๐ฅ Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos
Machine Learning
ๆฅ็ๆ่ฆ
The paper establishes generalization bounds for multitask deep neural networks using operator-theoretic techniques. The authors propose a tighter bound than those derived from conventional norm based methods by leveraging small condition numbers in the weight matrices and introducing a tailored Sobolev space as an expanded hypothesis space. This enhanced bound remains valid even in single output settings, outperforming existing Koopman based bounds. The resulting framework maintains key advantages such as flexibility and independence from network width, offering a more precise theoretical understanding of multitask deep learning in the context of kernel methods.
๐
2025-12-22
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Xiaolong Wu, Qifeng Liao
Machine Learning
ๆฅ็ๆ่ฆ
Solving high-dimensional Fokker-Planck (FP) equations is a challenge in computational physics and stochastic dynamics, due to the curse of dimensionality (CoD) and the bottleneck of evaluating second-order diffusion terms. Existing deep learning approaches, such as Physics-Informed Neural Networks (PINNs), face computational challenges as dimensionality increases, driven by the $O(D^2)$ complexity of automatic differentiation for second-order derivatives. While recent probability flow approaches bypass this by learning score functions or matching velocity fields, they often involve serial computational operations or depend on sampling efficiency in complex distributions. To address these issues, we propose the Self-Consistent Probability Flow (SCPF) method. We reformulate the second-order FP equation into an equivalent first-order deterministic Probability Flow ODE (PF-ODE) constraint. Unlike score matching or velocity matching, SCPF solves this problem by minimizing the residual of the PF-ODE continuity equation, which avoids explicit Hessian computation. We leverage Continuous Normalizing Flows (CNF) combined with the Hutchinson Trace Estimator (HTE) to reduce the training complexity to linear scale $O(D)$, achieving an effective $O(1)$ wall-clock time on GPUs. To address data sparsity in high dimensions, we apply a generative adaptive sampling strategy and theoretically prove that dynamically aligning collocation points with the evolving probability mass is a necessary condition to bound the approximation error. Experiments on diverse benchmarks -- ranging from anisotropic Ornstein-Uhlenbeck (OU) processes and high-dimensional Brownian motions with time-varying diffusion terms, to Geometric OU processes featuring non-Gaussian solutions -- demonstrate that SCPF effectively mitigates the CoD, maintaining high accuracy and constant computational cost for problems up to 100 dimensions.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leming Zhou, Zuo Wang, Zhigang Liu
ๆฅ็ๆ่ฆ
Due to the insufficient diagnosis and treatment capabilities at the grassroots level, there are still deficiencies in the early identification and early warning of acute exacerbation of Chronic obstructive pulmonary disease (COPD), often resulting in a high prevalence rate and high burden, but the screening rate is relatively low. In order to gradually improve this situation. In this paper, this study develop a Causal Heterogeneous Graph Representation Learning (CHGRL) method for COPD comorbidity risk prediction method that: a) constructing a heterogeneous Our dataset includes the interaction between patients and diseases; b) A cause-aware heterogeneous graph learning architecture has been constructed, combining causal inference mechanisms with heterogeneous graph learning, which can support heterogeneous graph causal learning for different types of relationships; and c) Incorporate the causal loss function in the model design, and add counterfactual reasoning learning loss and causal regularization loss on the basis of the cross-entropy classification loss. We evaluate our method and compare its performance with strong GNN baselines. Following experimental evaluation, the proposed model demonstrates high detection accuracy.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marios Thoma, Zenonas Theodosiou, Harris Partaourides, Vassilis Vassiliades, Loizos Michael et al.
Machine Learning
ๆฅ็ๆ่ฆ
Walking has always been a primary mode of transportation and is recognized as an essential activity for maintaining good health. Despite the need for safe walking conditions in urban environments, sidewalks are frequently obstructed by various obstacles that hinder free pedestrian movement. Any object obstructing a pedestrian's path can pose a safety hazard. The advancement of pervasive computing and egocentric vision techniques offers the potential to design systems that can automatically detect such obstacles in real time, thereby enhancing pedestrian safety. The development of effective and efficient identification algorithms relies on the availability of comprehensive and well-balanced datasets of egocentric data. In this work, we introduce the PEDESTRIAN dataset, comprising egocentric data for 29 different obstacles commonly found on urban sidewalks. A total of 340 videos were collected using mobile phone cameras, capturing a pedestrian's point of view. Additionally, we present the results of a series of experiments that involved training several state-of-the-art deep learning algorithms using the proposed dataset, which can be used as a benchmark for obstacle detection and recognition tasks. The dataset can be used for training pavement obstacle detectors to enhance the safety of pedestrians in urban areas.
๐
2025-12-22
๐ Accepted for publication in Lecture Notes in Compu
๐ฅ Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning through an operator-theoretic framework. Our key development lies in strategically combining a Koopman based approach with existing techniques, achieving tighter generalization guarantees compared to traditional norm-based bounds. To mitigate computational challenges associated with Koopman-based methods, we introduce sketching techniques applicable to vector valued neural networks. These techniques yield excess risk bounds under generic Lipschitz losses, providing performance guarantees for applications including robust and multiple quantile regression. Furthermore, we propose a novel deep learning framework, deep vector-valued reproducing kernel Hilbert spaces (vvRKHS), leveraging Perron Frobenius (PF) operators to enhance deep kernel methods. We derive a new Rademacher generalization bound for this framework, explicitly addressing underfitting and overfitting through kernel refinement strategies. This work offers novel insights into the generalization properties of multitask learning with deep learning architectures, an area that has been relatively unexplored until recent developments.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Azadeh Alavi, Fatemeh Kouchmeshki, Abdolrahman Alavi
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Hybrid quantum and classical learning aims to couple quantum feature maps with the robustness of classical neural networks, yet most architectures treat the quantum circuit as an isolated feature extractor and merge its measurements with classical representations by direct concatenation. This neglects that the quantum and classical branches constitute distinct computational modalities and limits reliable performance on complex, high dimensional tabular and semi structured data, including remote sensing, environmental monitoring, and medical diagnostics. We present a multimodal formulation of hybrid learning and propose a cross attention mid fusion architecture in which a classical representation queries quantum derived feature tokens through an attention block with residual connectivity. The quantum branch is kept within practical NISQ budgets and uses up to nine qubits. We evaluate on Wine, Breast Cancer, Forest CoverType, FashionMNIST, and SteelPlatesFaults, comparing a quantum only model, a classical baseline, residual hybrid models, and the proposed mid fusion model under a consistent protocol. Pure quantum and standard hybrid designs underperform due to measurement induced information loss, while cross attention mid fusion is consistently competitive and improves performance on the more complex datasets in most cases. These findings suggest that quantum derived information becomes most valuable when integrated through principled multimodal fusion rather than used in isolation or loosely appended to classical features.
๐
2025-12-22
๐ ArXiv Preprint (math.OC)
๐ฅ Filippo Fabiani, Barbara Franci
Robotics
ๆฅ็ๆ่ฆ
We establish finite sample certificates on the quality of solutions produced by data-based forward-backward (FB) operator splitting schemes. As frequently happens in stochastic regimes, we consider the problem of finding a zero of the sum of two operators, where one is either unavailable in closed form or computationally expensive to evaluate, and shall therefore be approximated using a finite number of noisy oracle samples. Under the lens of algorithmic stability, we then derive probabilistic bounds on the distance between a true zero and the FB output without making specific assumptions about the underlying data distribution. We show that under weaker conditions ensuring the convergence of FB schemes, stability bounds grow proportionally to the number of iterations. Conversely, stronger assumptions yield stability guarantees that are independent of the iteration count. We then specialize our results to a popular FB stochastic Nash equilibrium seeking algorithm and validate our theoretical bounds on a control problem for smart grids, where the energy price uncertainty is approximated by means of historical data.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Geraud Nangue Tasse, Matthew Riemer, Benjamin Rosman, Tim Klinger
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent success in developing increasingly general purpose agents based on sequence models has led to increased focus on the problem of deploying computationally limited agents within the vastly more complex real-world. A key challenge experienced in these more realistic domains is highly non-Markovian dependencies with respect to the agent's observations, which are less common in small controlled domains. The predominant approach for dealing with this in the literature is to stack together a window of the most recent observations (Frame Stacking), but this window size must grow with the degree of non-Markovian dependencies, which results in prohibitive computational and memory requirements for both action inference and learning. In this paper, we are motivated by the insight that in many environments that are highly non-Markovian with respect to time, the environment only causally depends on a relatively small number of observations over that time-scale. A natural direction would then be to consider meta-algorithms that maintain relatively small adaptive stacks of memories such that it is possible to express highly non-Markovian dependencies with respect to time while considering fewer observations at each step and thus experience substantial savings in both compute and memory requirements. Hence, we propose a meta-algorithm (Adaptive Stacking) for achieving exactly that with convergence guarantees and quantify the reduced computation and memory constraints for MLP, LSTM, and Transformer-based agents. Our experiments utilize popular memory tasks, which give us control over the degree of non-Markovian dependencies. This allows us to demonstrate that an appropriate meta-algorithm can learn the removal of memories not predictive of future rewards without excessive removal of important experiences. Code: https://github.com/geraudnt/adaptive-stacking
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoran Yang, Yinan Zhang, Wenjie Zhang, Dongxia Wang, Peiyu Liu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Nowadays, industrial hybrid modeling which integrates both mechanistic modeling and machine learning-based modeling techniques has attracted increasing interest from scholars due to its high accuracy, low computational cost, and satisfactory interpretability. Nevertheless, the existing industrial hybrid modeling methods still face two main limitations. First, current research has mainly focused on applying a single machine learning method to one specific task, failing to develop a comprehensive machine learning architecture suitable for modeling tasks, which limits their ability to effectively represent complex industrial scenarios. Second, industrial datasets often contain underlying associations (e.g., monotonicity or periodicity) that are not adequately exploited by current research, which can degrade model's predictive performance. To address these limitations, this paper proposes the Recurrent Perceptron-based Channel Attention Transformer Encoder (RP-CATE), with three distinctive characteristics: 1: We developed a novel architecture by replacing the self-attention mechanism with channel attention and incorporating our proposed Recurrent Perceptron (RP) Module into Transformer, achieving enhanced effectiveness for industrial modeling tasks compared to the original Transformer. 2: We proposed a new data type called Pseudo-Image Data (PID) tailored for channel attention requirements and developed a cyclic sliding window method for generating PID. 3: We introduced the concept of Pseudo-Sequential Data (PSD) and a method for converting industrial datasets into PSD, which enables the RP Module to capture the underlying associations within industrial dataset more effectively. An experiment aimed at hybrid modeling in chemical engineering was conducted by using RP-CATE and the experimental results demonstrate that RP-CATE achieves the best performance compared to other baseline models.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Francis Bach
Machine Learning
ๆฅ็ๆ่ฆ
We consider supervised learning problems in which set predictions provide explicit uncertainty estimates. Using Choquet integrals (a.k.a. Lov{รก}sz extensions), we propose a convex loss function for nondecreasing subset-valued functions obtained as level sets of a real-valued function. This loss function allows optimal trade-offs between conditional probabilistic coverage and the ''size'' of the set, measured by a non-decreasing submodular function. We also propose several extensions that mimic loss functions and criteria for binary classification with asymmetric losses, and show how to naturally obtain sets with optimized conditional coverage. We derive efficient optimization algorithms, either based on stochastic gradient descent or reweighted least-squares formulations, and illustrate our findings with a series of experiments on synthetic datasets for classification and regression tasks, showing improvements over approaches that aim for marginal coverage.
๐
2025-12-22
๐ ArXiv Preprint (cs.DC)
๐ฅ Murtaza Rangwala, Richard O. Sinnott, Rajkumar Buyya
Machine Learning
ๆฅ็ๆ่ฆ
Decentralized federated learning (DFL) enables collaborative model training across edge devices without centralized coordination, offering resilience against single points of failure. However, statistical heterogeneity arising from non-identically distributed local data creates a fundamental challenge: nodes must learn personalized models adapted to their local distributions while selectively collaborating with compatible peers. Existing approaches either enforce a single global model that fits no one well, or rely on heuristic peer selection mechanisms that cannot distinguish between peers with genuinely incompatible data distributions and those with valuable complementary knowledge. We present Murmura, a framework that leverages evidential deep learning to enable trust-aware model personalization in DFL. Our key insight is that epistemic uncertainty from Dirichlet-based evidential models directly indicates peer compatibility: high epistemic uncertainty when a peer's model evaluates local data reveals distributional mismatch, enabling nodes to exclude incompatible influence while maintaining personalized models through selective collaboration. Murmura introduces a trust-aware aggregation mechanism that computes peer compatibility scores through cross-evaluation on local validation samples and personalizes model aggregation based on evidential trust with adaptive thresholds. Evaluation on three wearable IoT datasets (UCI HAR, PAMAP2, PPG-DaLiA) demonstrates that Murmura reduces performance degradation from IID to non-IID conditions compared to baseline (0.9% vs. 19.3%), achieves 7.4$\times$ faster convergence, and maintains stable accuracy across hyperparameter choices. These results establish evidential uncertainty as a principled foundation for compatibility-aware personalization in decentralized heterogeneous environments.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tzu-Yun Lee, Ding-Yong Hong, Jan-Jan Wu
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper introduces Syntactic Attention Pruning (SAP), a novel method for effectively pruning attention heads in Transformer models. Unlike conventional approaches that rely solely on mathematical analysis of model weights and activations, SAP incorporates both the syntactic structure and attention patterns of sentences to guide the pruning process. By leveraging these linguistic features, SAP not only achieves performance comparable to state-of-the-art methods but also enhances the interpretability of model behavior. To further improve robustness, we propose Candidate Filtering (CF), a mechanism that prioritizes heads based on their contribution to model performance, mitigating degradation during pruning. Experimental results indicate that SAP effectively preserves critical heads of a high density of strong attention values, outperforming existing head pruning strategies in retrain-free settings. These findings position SAP as a promising foundation for a new direction in model compression research, offering high flexibility for pruning across all transformer-based language models.
๐
2025-12-22
๐ 2025 IEEE. Personal use of this material is permit
๐ฅ Francesco Carzaniga, Michael Hersche, Kaspar Schindler, Abbas Rahimi
ๆฅ็ๆ่ฆ
Objective: We develop a channel-adaptive (CA) architecture that seamlessly processes multi-variate time-series with an arbitrary number of channels, and in particular intracranial electroencephalography (iEEG) recordings. Methods: Our CA architecture first processes the iEEG signal using state-of-the-art models applied to each single channel independently. The resulting features are then fused using a vector-symbolic algorithm which reconstructs the spatial relationship using a trainable scalar per channel. Finally, the fused features are accumulated in a long-term memory of up to 2 minutes to perform the classification. Each CA-model can then be pre-trained on a large corpus of iEEG recordings from multiple heterogeneous subjects. The pre-trained model is personalized to each subject via a quick fine-tuning routine, which uses equal or lower amounts of data compared to existing state-of-the-art models, but requiring only 1/5 of the time. Results: We evaluate our CA-models on a seizure detection task both on a short-term (~20 hours) and a long-term (~2500 hours) dataset. In particular, our CA-EEGWaveNet is trained on a single seizure of the tested subject, while the baseline EEGWaveNet is trained on all but one. Even in this challenging scenario, our CA-EEGWaveNet surpasses the baseline in median F1-score (0.78 vs 0.76). Similarly, CA-EEGNet based on EEGNet, also surpasses its baseline in median F1-score (0.79 vs 0.74). Conclusion and significance: Our CA-model addresses two issues: first, it is channel-adaptive and can therefore be trained across heterogeneous subjects without loss of performance; second, it increases the effective temporal context size to a clinically-relevant length. Therefore, our model is a drop-in replacement for existing models, bringing better characteristics and performance across the board.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoyu Jiang, Boan Qu, Junjie Zhu, Fanjie Zeng, Xiaojie Lin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid growth of artificial intelligence is exponentially escalating computational demand, inflating data center energy use and carbon emissions, and spurring rapid deployment of green data centers to relieve resource and environmental stress. Achieving sub-minute orchestration of renewables, storage, and loads, while minimizing PUE and lifecycle carbon intensity, hinges on accurate load forecasting. However, existing methods struggle to address small-sample scenarios caused by cold start, load distortion, multi-source data fragmentation, and distribution shifts in green data centers. We introduce HyperLoad, a cross-modality framework that exploits pre-trained large language models (LLMs) to overcome data scarcity. In the Cross-Modality Knowledge Alignment phase, textual priors and time-series data are mapped to a common latent space, maximizing the utility of prior knowledge. In the Multi-Scale Feature Modeling phase, domain-aligned priors are injected through adaptive prefix-tuning, enabling rapid scenario adaptation, while an Enhanced Global Interaction Attention mechanism captures cross-device temporal dependencies. The public DCData dataset is released for benchmarking. Under both data sufficient and data scarce settings, HyperLoad consistently surpasses state-of-the-art (SOTA) baselines, demonstrating its practicality for sustainable green data center management.
๐
2025-12-22
๐ ArXiv Preprint (math.OC)
๐ฅ Haishan Ye
Machine Learning
ๆฅ็ๆ่ฆ
Zeroth-order (ZO) optimization with ordinal feedback has emerged as a fundamental problem in modern machine learning systems, particularly in human-in-the-loop settings such as reinforcement learning from human feedback, preference learning, and evolutionary strategies. While rank-based ZO algorithms enjoy strong empirical success and robustness properties, their theoretical understanding, especially under stochastic objectives and standard smoothness assumptions, remains limited. In this paper, we study rank-based zeroth-order optimization for stochastic functions where only ordinal feedback of the stochastic function is available. We propose a simple and computationally efficient rank-based ZO algorithm. Under standard assumptions including smoothness, strong convexity, and bounded second moments of stochastic gradients, we establish explicit non-asymptotic query complexity bounds for both convex and nonconvex objectives. Notably, our results match the best-known query complexities of value-based ZO algorithms, demonstrating that ordinal information alone is sufficient for optimal query efficiency in stochastic settings. Our analysis departs from existing drift-based and information-geometric techniques, offering new tools for the study of rank-based optimization under noise. These findings narrow the gap between theory and practice and provide a principled foundation for optimization driven by human preferences.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiaqi Zhu, Zhongyuan Zhao, Xiao Li, Ruihao Du, Shi Jin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Incorporating over-the-air computations (OAC) into the model training process of federated learning (FL) is an effective approach to alleviating the communication bottleneck in FL systems. Under OAC-FL, every client modulates its intermediate parameters, such as gradient, onto the same set of orthogonal waveforms and simultaneously transmits the radio signal to the edge server. By exploiting the superposition property of multiple-access channels, the edge server can obtain an automatically aggregated global gradient from the received signal. However, the limited number of orthogonal waveforms available in practical systems is fundamentally mismatched with the high dimensionality of modern deep learning models. To address this issue, we propose Freshness Freshness-mAgnItude awaRe top-k (FAIR-k), an algorithm that selects, in each communication round, the most impactful subset of gradients to be updated over the air. In essence, FAIR-k combines the complementary strengths of the Round-Robin and Top-k algorithms, striking a delicate balance between timeliness (freshness of parameter updates) and importance (gradient magnitude). Leveraging tools from Markov analysis, we characterize the distribution of parameter staleness under FAIR-k. Building on this, we establish the convergence rate of OAC-FL with FAIR-k, which discloses the joint effect of data heterogeneity, channel noise, and parameter staleness on the training efficiency. Notably, as opposed to conventional analyses that assume a universal Lipschitz constant across all the clients, our framework adopts a finer-grained model of the data heterogeneity. The analysis demonstrates that since FAIR-k promotes fresh (and fair) parameter updates, it not only accelerates convergence but also enhances communication efficiency by enabling an extended period of local training without significantly affecting overall training efficiency.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alireza Moayedikia, Sara Fin, Uffe Kock Wiil
Machine Learning
ๆฅ็ๆ่ฆ
Disease modifying therapies for Alzheimer's disease demand precise timing decisions, yet current predictive models require longitudinal observations and provide no uncertainty quantification, rendering them impractical at the critical first visit when treatment decisions must be made. We developed PROGRESS (PRognostic Generalization from REsting Static Signatures), a dual-model deep learning framework that transforms a single baseline cerebrospinal fluid biomarker assessment into actionable prognostic estimates without requiring prior clinical history. The framework addresses two complementary clinical questions: a probabilistic trajectory network predicts individualized cognitive decline with calibrated uncertainty bounds achieving near-nominal coverage, enabling honest prognostic communication; and a deep survival model estimates time to conversion from mild cognitive impairment to dementia. Using data from over 3,000 participants across 43 Alzheimer's Disease Research Centers in the National Alzheimer's Coordinating Center database, PROGRESS substantially outperforms Cox proportional hazards, Random Survival Forests, and gradient boosting methods for survival prediction. Risk stratification identifies patient groups with seven-fold differences in conversion rates, enabling clinically meaningful treatment prioritization. Leave-one-center-out validation demonstrates robust generalizability, with survival discrimination remaining strong across held-out sites despite heterogeneous measurement conditions spanning four decades of assay technologies. By combining superior survival prediction with trustworthy trajectory uncertainty quantification, PROGRESS bridges the gap between biomarker measurement and personalized clinical decision-making.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Danny Dongyeop Han, Yonghyeon Gwon, Ahhyun Lucy Lee, Taeyang Lee, Seong Jin Lee et al.
ๆฅ็ๆ่ฆ
Electrophysiology signals such as EEG and iEEG are central to neuroscience, brain-computer interfaces, and clinical applications, yet existing foundation models remain limited in scale despite clear evidence that scaling improves performance. We introduce DIVER-1, a family of EEG and iEEG foundation models trained on the largest and most diverse corpus to date-5.3k hours of iEEG and 54k hours of EEG (1.6M channel-hours from over 17.7k subjects)-and scaled up to 1.82B parameters. We present the first systematic scaling law analysis for this domain, showing that they follow data-constrained scaling laws: for a given amount of data and compute, smaller models trained for extended epochs consistently outperform larger models trained briefly. This behavior contrasts with prior electrophysiology foundation models that emphasized model size over training duration. To achieve strong performance, we also design architectural innovations including any-variate attention, sliding temporal conditional positional encoding, and multi-domain reconstruction. DIVER-1 iEEG and EEG models each achieve state-of-the-art performance on their respective benchmarks, establishing a concrete guidelines for efficient scaling and resource allocation in electrophysiology foundation model development.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ariel Lubonja, Pedro R. A. S. Bassi, Wenxuan Li, Hualin Qiao, Randal Burns et al.
Multimodal
ๆฅ็ๆ่ฆ
Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and (3) audits intersectional fairness, revealing that more than half of the MONAI-based entries have the largest gender-race discrepancy on our proprietary Johns Hopkins Hospital dataset. The RankInsight toolkit is publicly released and can be directly applied to past, ongoing, and future challenges. It enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair.
๐
2025-12-22
๐ ArXiv Preprint (cs.IT)
๐ฅ George Vershinin, Asaf Cohen, Omer Gurewitz
ๆฅ็ๆ่ฆ
We study a variant of cost-aware sequential hypothesis testing in which a single active Decision Maker (DM) selects actions with positive, random costs to identify the true hypothesis under an average error constraint, while minimizing the expected total cost. The DM may abort an in-progress action, yielding no sample, by truncating its realized cost at a smaller, tunable deterministic limit, which we term a per-action deadline. We analyze how this cancellation option can be exploited under two cost-revelation models: ex-post, where the cost is revealed only after the sample is obtained, and ex-ante, where the cost accrues before sample acquisition.
In the ex-post model, per-action deadlines do not affect the expected total cost, and the cost-error tradeoffs coincide with the baseline obtained by replacing deterministic costs with cost means. In the ex-ante model, we show how per-action deadlines inflate the expected number of times actions are applied, and that the resulting expected total cost can be reduced to the constant-cost setting by introducing an effective per-action cost. We characterize when deadlines are beneficial and study several families in detail.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chi Liu
ๆฅ็ๆ่ฆ
Collaborative fraud, where multiple fraudulent accounts coordinate to exploit online payment systems, poses significant challenges due to the formation of complex network structures. Traditional detection methods that rely solely on high-confidence identity links suffer from limited coverage, while approaches using all available linkages often result in fragmented graphs with reduced clustering effectiveness. In this paper, we propose a novel graph-based fraud detection framework that addresses the challenge of large-scale heterogeneous graph clustering through a principled link transformation approach. Our method distinguishes between \emph{hard links} (high-confidence identity relationships such as phone numbers, credit cards, and national IDs) and \emph{soft links} (behavioral associations including device fingerprints, cookies, and IP addresses). We introduce a graph transformation technique that first identifies connected components via hard links, merges them into super-nodes, and then reconstructs a weighted soft-link graph amenable to efficient embedding and clustering. The transformed graph is processed using LINE (Large-scale Information Network Embedding) for representation learning, followed by HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) for density-based cluster discovery. Experiments on a real-world payment platform dataset demonstrate that our approach achieves significant graph size reduction (from 25 million to 7.7 million nodes), doubles the detection coverage compared to hard-link-only baselines, and maintains high precision across identified fraud clusters. Our framework provides a scalable and practical solution for industrial-scale fraud detection systems.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Guy Schacht, Ziyad Sheebaelhamd, Riccardo De Santi, Mojmรญr Mutnรฝ, Andreas Krause
Machine Learning
ๆฅ็ๆ่ฆ
Preference learning from human feedback has the ability to align generative models with the needs of end-users. Human feedback is costly and time-consuming to obtain, which creates demand for data-efficient query selection methods. This work presents a novel approach that leverages optimal experimental design to ask humans the most informative preference queries, from which we can elucidate the latent reward function modeling user preferences efficiently. We formulate the problem of preference query selection as the one that maximizes the information about the underlying latent preference model. We show that this problem has a convex optimization formulation, and introduce a statistically and computationally efficient algorithm ED-PBRL that is supported by theoretical guarantees and can efficiently construct structured queries such as images or text. We empirically present the proposed framework by personalizing a text-to-image generative model to user-specific styles, showing that it requires less preference queries compared to random query selection.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Liping Sun, Yucheng Guo, Siliang Lu, Zhenzhen Li
Robotics
ๆฅ็ๆ่ฆ
With the press of global climate change, extreme weather and sudden weather changes are becoming increasingly common. To maintain a comfortable indoor environment and minimize the contribution of the building to climate change as much as possible, higher requirements are placed on the operation and control of HVAC systems, e.g., more energy-efficient and flexible to response to the rapid change of weather. This places demands on the rapid modeling and prediction of zone air temperatures of buildings. Compared to the traditional simulation-based approach such as EnergyPlus and DOE2, a hybrid approach combined physics and data-driven is more suitable. Recently, the availability of high-quality datasets and algorithmic breakthroughs have driven a considerable amount of work in this field. However, in the niche of short- and long-term predictions, there are still some gaps in existing research. This paper aims to develop a time series forecast model to predict the zone air temperature in a building located in America on a 2-week horizon. The findings could be further improved to support intelligent control and operation of HVAC systems (i.e. demand flexibility) and could also be used as hybrid building energy modeling.
๐
2025-12-22
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Md Minhazul Islam Munna, Md Mahbubur Rahman, Jaroslav Frnda, Muhammad Shahid Anwar, Alpamis Kutlimuratov
Machine Learning
ๆฅ็ๆ่ฆ
The proliferation of IoT devices and their reliance on Wi-Fi networks have introduced significant security vulnerabilities, particularly the KRACK and Kr00k attacks, which exploit weaknesses in WPA2 encryption to intercept and manipulate sensitive data. Traditional IDS using classifiers face challenges such as model overfitting, incomplete feature extraction, and high false positive rates, limiting their effectiveness in real-world deployments. To address these challenges, this study proposes a robust multiclass machine learning based intrusion detection framework. The methodology integrates advanced feature selection techniques to identify critical attributes, mitigating redundancy and enhancing detection accuracy. Two distinct ML architectures are implemented: a baseline classifier pipeline and a stacked ensemble model combining noise injection, Principal Component Analysis (PCA), and meta learning to improve generalization and reduce false positives. Evaluated on the AWID3 data set, the proposed ensemble architecture achieves superior performance, with an accuracy of 98%, precision of 98%, recall of 98%, and a false positive rate of just 2%, outperforming existing state-of-the-art methods. This work demonstrates the efficacy of combining preprocessing strategies with ensemble learning to fortify network security against sophisticated Wi-Fi attacks, offering a scalable and reliable solution for IoT environments. Future directions include real-time deployment and adversarial resilience testing to further enhance the model's adaptability.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuan Fang, Fabian Waschkowski, Maximilian Reissmann, Richard D. Sandberg, Takuo Oda et al.
Machine Learning
ๆฅ็ๆ่ฆ
Computational Fluid Dynamics (CFD)-driven training combines machine learning (ML) with CFD solvers to develop physically consistent closure models with improved predictive accuracy. In the original framework, each ML-generated candidate model is embedded in a CFD solver and evaluated against reference data, requiring hundreds to thousands of high-fidelity simulations and resulting in prohibitive computational cost for complex flows. To overcome this limitation, we propose an extended framework that integrates surrogate modeling into symbolic CFD-driven training in real time to reduce training cost. The surrogate model learns to approximate the errors of ML-generated models based on previous CFD evaluations and is continuously refined during training. Newly generated models are first assessed using the surrogate, and only those predicted to yield small errors or high uncertainty are subsequently evaluated with full CFD simulations. Discrete expressions generated by symbolic regression are mapped into a continuous space using averaged input-symbol values as inputs to a probabilistic surrogate model. To support multi-objective model training, particularly when fixed weighting of competing quantities is challenging, the surrogate is extended to a multi-output formulation by generalizing the kernel to a matrix form, providing one mean and variance prediction per training objective. Selection metrics based on these probabilistic outputs are used to identify an optimal training setup. The proposed surrogate-augmented CFD-driven training framework is demonstrated across a range of statistically one- and two-dimensional flows, including both single- and multi-expression model optimization. In all cases, the framework substantially reduces training cost while maintaining predictive accuracy comparable to that of the original CFD-driven approach.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ariana Azarbal, Victor Gillioz, Vladimir Ivanov, Bryce Woodworth, Jacob Drori et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Developers often struggle to specify correct training labels and rewards. Perhaps they don't need to. We propose recontextualization, which reduces how often language models "game" training signals, performing misbehaviors those signals mistakenly reinforce. We show recontextualization prevents models from learning to 1) prioritize evaluation metrics over chat response quality; 2) special-case code to pass incorrect tests; 3) lie to users; and 4) become sycophantic. Our method works by generating completions from prompts discouraging misbehavior and then recontextualizing them as though they were in response to prompts permitting misbehavior. Recontextualization trains language models to resist misbehavior even when instructions permit it. This mitigates the reinforcement of misbehavior from misspecified training signals, reducing specification gaming without improving the supervision signal.
๐
2025-12-22
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Hengrui Jia, Taoran Li, Jonas Guan, Varun Chandrasekaran
Natural Language Processing
ๆฅ็ๆ่ฆ
Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset ($D_u$). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in $D_u$, but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have ``forgotten'' the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to $D_u$. This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose \name, an automated stress-testing framework that generates a surrogate dataset, $\tilde{D}_u$. This surrogate set is constructed to be semantically derived from $D_u$ yet sufficiently distinct in embedding space. By comparing unlearning metric scores between $D_u$ and $\tilde{D}_u$, we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B-$ฮฒ$), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zelin Zhao, Xinyu Gong, Bangya Liu, Ziyang Song, Jun Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at https://sjtuytc.github.io/CETCam_project_page.github.io/.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ayana Hussain, Ricky Fang
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Deep learning models can reveal sensitive information about individual training examples, and while differential privacy (DP) provides guarantees restricting such leakage, it also alters optimization dynamics in poorly understood ways. We study the training dynamics of neural networks under DP by comparing Gradient Descent (GD), and Adam to their privacy-preserving variants. Prior work shows that these optimizers exhibit distinct stability dynamics: full-batch methods train at the Edge of Stability (EoS), while mini-batch and adaptive methods exhibit analogous edge-of-stability behavior. At these regimes, the training loss and the sharpness--the maximum eigenvalue of the training loss Hessian--exhibit certain characteristic behavior. In DP training, per-example gradient clipping and Gaussian noise modify the update rule, and it is unclear whether these stability patterns persist. We analyze how clipping and noise change sharpness and loss evolution and show that while DP generally reduces the sharpness and can prevent optimizers from fully reaching the classical stability thresholds, patterns from EoS and analogous adaptive methods stability regimes persist, with the largest learning rates and largest privacy budgets approaching, and sometimes exceeding, these thresholds. These findings highlight the unpredictability introduced by DP in neural network optimization.
๐
2025-12-22
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Akshaj Prashanth Rao, Advait Singh, Saumya Kumaar Saksena, Dhruv Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Prompt injection and jailbreaking attacks pose persistent security challenges to large language model (LLM)-based systems. We present an efficient and systematically evaluated defense architecture that mitigates these threats through a lightweight, multi-stage pipeline. Its core component is a semantic filter based on text normalization, TF-IDF representations, and a Linear SVM classifier. Despite its simplicity, this module achieves 93.4% accuracy and 96.5% specificity on held-out data, substantially reducing attack throughput while incurring negligible computational overhead.
Building on this efficient foundation, the full pipeline integrates complementary detection and mitigation mechanisms that operate at successive stages, providing strong robustness with minimal latency. In comparative experiments, our SVM-based configuration improves overall accuracy from 35.1% to 93.4% while reducing average time to completion from approximately 450s to 47s, yielding over 10 times lower latency than ShieldGemma. These results demonstrate that the proposed design simultaneously advances defensive precision and efficiency, addressing a core limitation of current model-based moderators.
Evaluation across a curated corpus of over 30,000 labeled prompts, including benign, jailbreak, and application-layer injections, confirms that staged, resource-efficient defenses can robustly secure modern LLM-driven applications.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Konstantin Kaulen, Tobias Ladner, Stanley Bak, Christopher Brix, Hai Duong et al.
Machine Learning
ๆฅ็ๆ่ฆ
This report summarizes the 6th International Verification of Neural Networks Competition (VNN-COMP 2025), held as a part of the 8th International Symposium on AI Verification (SAIV), that was collocated with the 37th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2025 iteration, 8 teams participated on a diverse set of 16 regular and 9 extended benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tongyuan Miao, Gary Huang, Kai Jun Han, Annie Jiang
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization.
We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Lingjie Zhao, Xue Yu, Yongzhi Qi, Hao Hu, Jianshen Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
As the pursuit of synergy between Artificial Intelligence (AI) and Operations Research (OR) gains momentum in handling complex inventory systems, a critical challenge persists: how to effectively reconcile AI's adaptive perception with OR's structural rigor. To bridge this gap, we propose a novel OR-Guided "Pretrain-then-Reinforce" framework. To provide structured guidance, we propose a simulation-augmented OR model that generates high-quality reference decisions, implicitly capturing complex business constraints and managerial preferences. Leveraging these OR-derived decisions as foundational training labels, we design a domain-informed deep learning foundation model to establish foundational decision-making capabilities, followed by a reinforcement learning (RL) fine-tuning stage. Uniquely, we position RL as a deep alignment mechanism that enables the AI agent to internalize the optimality principles of OR, while simultaneously leveraging exploration for general policy refinement and allowing expert guidance for scenario-specific adaptation (e.g., promotional events). Validated through extensive numerical experiments and a field deployment at JD.com augmented by a Difference-in-Differences (DiD) analysis, our model significantly outperforms incumbent industrial practices, delivering real-world gains of a 5.27-day reduction in turnover and a 2.29% increase in in-stock rates, alongside a 29.95% decrease in holding costs. Contrary to the prevailing trend of brute-force model scaling, our study demonstrates that a lightweight, domain-informed model can deliver state-of-the-art performance and robust transferability when guided by structured OR logic. This approach offers a scalable and cost-effective paradigm for intelligent supply chain management, highlighting the value of deeply aligning AI with OR.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kun Zhao, Siyuan Dai, Yingying Zhang, Guodong Liu, Pengfei Gu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Early detection of Alzheimer's disease (AD) requires models capable of integrating macro-scale neuroanatomical alterations with micro-scale genetic susceptibility, yet existing multimodal approaches struggle to align these heterogeneous signals. We introduce R-GenIMA, an interpretable multimodal large language model that couples a novel ROI-wise vision transformer with genetic prompting to jointly model structural MRI and single nucleotide polymorphisms (SNPs) variations. By representing each anatomically parcellated brain region as a visual token and encoding SNP profiles as structured text, the framework enables cross-modal attention that links regional atrophy patterns to underlying genetic factors. Applied to the ADNI cohort, R-GenIMA achieves state-of-the-art performance in four-way classification across normal cognition (NC), subjective memory concerns (SMC), mild cognitive impairment (MCI), and AD. Beyond predictive accuracy, the model yields biologically meaningful explanations by identifying stage-specific brain regions and gene signatures, as well as coherent ROI-Gene association patterns across the disease continuum. Attention-based attribution revealed genes consistently enriched for established GWAS-supported AD risk loci, including APOE, BIN1, CLU, and RBFOX1. Stage-resolved neuroanatomical signatures identified shared vulnerability hubs across disease stages alongside stage-specific patterns: striatal involvement in subjective decline, frontotemporal engagement during prodromal impairment, and consolidated multimodal network disruption in AD. These results demonstrate that interpretable multimodal AI can synthesize imaging and genetics to reveal mechanistic insights, providing a foundation for clinically deployable tools that enable earlier risk stratification and inform precision therapeutic strategies in Alzheimer's disease.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wei Peng, Jianchen Hu, Kang Liu, Qiaozhu Zhai
Machine Learning
ๆฅ็ๆ่ฆ
Bayesian optimization is an effective method for solving expensive black-box optimization problems. Most existing methods use Gaussian processes (GP) as the surrogate model for approximating the black-box objective function, it is well-known that it can fail in high-dimensional space (e.g., dimension over 500). We argue that the reliance of GP on precise numerical fitting is fundamentally ill-suited in high-dimensional space, where it leads to prohibitive computational complexity. In order to address this, we propose a simple order-preserving Bayesian optimization (OPBO) method, where the surrogate model preserves the order, instead of the value, of the black-box objective function. Then we can use a simple but effective OP neural network (NN) to replace GP as the surrogate model. Moreover, instead of searching for the best solution from the acquisition model, we select good-enough solutions in the ordinal set to reduce computational cost. The experimental results show that for high-dimensional (over 500) black-box optimization problems, the proposed OPBO significantly outperforms traditional BO methods based on regression NN and GP. The source code is available at https://github.com/pengwei222/OPBO.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Baiyang Chen, Zhong Yuan, Zheng Liu, Dezhong Peng, Yongxiang Li et al.
ๆฅ็ๆ่ฆ
Outlier detection is a critical task in data mining, aimed at identifying objects that significantly deviate from the norm. Semi-supervised methods improve detection performance by leveraging partially labeled data but typically overlook the uncertainty and heterogeneity of real-world mixed-attribute data. This paper introduces a semi-supervised outlier detection method, namely fuzzy rough sets-based outlier detection (FROD), to effectively handle these challenges. Specifically, we first utilize a small subset of labeled data to construct fuzzy decision systems, through which we introduce the attribute classification accuracy based on fuzzy approximations to evaluate the contribution of attribute sets in outlier detection. Unlabeled data is then used to compute fuzzy relative entropy, which provides a characterization of outliers from the perspective of uncertainty. Finally, we develop the detection algorithm by combining attribute classification accuracy with fuzzy relative entropy. Experimental results on 16 public datasets show that FROD is comparable with or better than leading detection algorithms. All datasets and source codes are accessible at https://github.com/ChenBaiyang/FROD. This manuscript is the accepted author version of a paper published by Elsevier. The final published version is available at https://doi.org/10.1016/j.ijar.2025.109373
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Baiyang Chen, Zhong Yuan, Dezhong Peng, Xiaoliang Chen, Hongmei Chen
ๆฅ็ๆ่ฆ
Outlier detection aims to find samples that behave differently from the majority of the data. Semi-supervised detection methods can utilize the supervision of partial labels, thus reducing false positive rates. However, most of the current semi-supervised methods focus on numerical data and neglect the heterogeneity of data information. In this paper, we propose a consistency-guided outlier detection algorithm (COD) for heterogeneous data with the fuzzy rough set theory in a semi-supervised manner. First, a few labeled outliers are leveraged to construct label-informed fuzzy similarity relations. Next, the consistency of the fuzzy decision system is introduced to evaluate attributes' contributions to knowledge classification. Subsequently, we define the outlier factor based on the fuzzy similarity class and predict outliers by integrating the classification consistency and the outlier factor. The proposed algorithm is extensively evaluated on 15 freshly proposed datasets. Experimental results demonstrate that COD is better than or comparable with the leading outlier detectors. This manuscript is the accepted author version of a paper published by Elsevier. The final published version is available at https://doi.org/10.1016/j.asoc.2024.112070
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Shuntuo Xu, Zhou Yu, Jian Huang
Machine Learning
ๆฅ็ๆ่ฆ
Identifying low-dimensional sufficient structures in nonlinear sufficient dimension reduction (SDR) has long been a fundamental yet challenging problem. Most existing methods lack theoretical guarantees of exhaustiveness in identifying lower dimensional structures, either at the population level or at the sample level. We tackle this issue by proposing a new method, generative sufficient dimension reduction (GenSDR), which leverages modern generative models. We show that GenSDR is able to fully recover the information contained in the central $ฯ$-field at both the population and sample levels. In particular, at the sample level, we establish a consistency property for the GenSDR estimator from the perspective of conditional distributions, capitalizing on the distributional learning capabilities of deep generative models. Moreover, by incorporating an ensemble technique, we extend GenSDR to accommodate scenarios with non-Euclidean responses, thereby substantially broadening its applicability. Extensive numerical results demonstrate the outstanding empirical performance of GenSDR and highlight its strong potential for addressing a wide range of complex, real-world tasks.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sutashu Tomonaga, Kenji Doya, Noboru Murata
ๆฅ็ๆ่ฆ
Structured State Space Models (SSMs), which are at the heart of the recently popular Mamba architecture, are powerful tools for sequence modeling. However, their theoretical foundation relies on a complex, multi-stage process of continuous-time modeling and subsequent discretization, which can obscure intuition. We introduce a direct, first-principles framework for constructing discrete-time SSMs that is both flexible and modular. Our approach is based on a novel lag operator, which geometrically derives the discrete-time recurrence by measuring how the system's basis functions "slide" and change from one timestep to the next. The resulting state matrices are computed via a single inner product involving this operator, offering a modular design space for creating novel SSMs by flexibly combining different basis functions and time-warping schemes. To validate our approach, we demonstrate that a specific instance exactly recovers the recurrence of the influential HiPPO model. Numerical simulations confirm our derivation, providing new theoretical tools for designing flexible and robust sequence models.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Debamita Ghosh, George K. Atia, Yue Wang
Machine Learning
ๆฅ็ๆ่ฆ
The deployment of reinforcement learning (RL) agents in real-world applications is often hindered by performance degradation caused by mismatches between training and deployment environments. Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics. However, existing work typically relies on substantial prior knowledge-such as access to a generative model or a large offline dataset-and largely focuses on tabular methods that do not scale to complex domains. We overcome these limitations by proposing an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment, without requiring prior models or offline data, enabling deployment in high-dimensional tasks. We further provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.
๐
2025-12-22
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yizhi Wang, Linan Yue, Min-Ling Zhang
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks through long Chain-of-Thought (CoT) reasoning. Extending these successes to multimodal reasoning remains challenging due to the increased complexity of integrating diverse input modalities and the scarcity of high-quality long CoT training data. Existing multimodal datasets and CoT synthesis methods still suffer from limited reasoning depth, modality conversion errors, and rigid generation pipelines, hindering model performance and stability. To this end, in this paper, we propose SynSelect, a novel three-stage Synthesis-Selection framework for generating high-quality long CoT data tailored to multimodal reasoning tasks. Specifically, SynSelect first leverages multiple heterogeneous multimodal LRMs to produce diverse candidate CoTs, and then applies both instance and batch level selection to filter high-quality CoTs that can effectively enhance the model's reasoning capabilities. Extensive experiments on multiple multimodal benchmarks demonstrate that models supervised fine-tuned on SynSelect-generated data significantly outperform baselines and achieve further improvements after reinforcement learning post-training. Our results validate SynSelect as an effective approach for advancing multimodal LRMs reasoning capabilities.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Patrick Batsell, Tsutsui Satoshi, Bihan Wen
RoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Infants learn to recognize not only object categories but also fine grained attributes such as color, size, and texture within their first two years of life. Prior work explores Childs View for Contrastive Learning (CVCL), a CLIP style model trained on infant egocentric video as a computational model of early infant learning, but it focuses only on class level recognition. This leaves it unclear whether infant scale learning also supports attribute discrimination. To address this, we introduce a benchmark that systematically varies color, size, and texture, allowing controlled tests of within class attribute recognition. Comparing CVCL with CLIP shows clear differences. CVCL is better at size discrimination, while CLIP achieves higher accuracy on color discrimination. Both models represent texture in image embeddings but fail to ground texture linguistically, suggesting a gap between visual and language spaces.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, Mahdi Jalili
Natural Language Processing
ๆฅ็ๆ่ฆ
We present MACLA, a framework that decouples reasoning from learning by maintaining a frozen large language model while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes and failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1 percent average performance, outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3 percent with 3.1 percent positive generalization. The system constructs memory in 56 seconds, 2800 times faster than the state-of-the-art LLM parameter-training baseline, compressing 2851 trajectories into 187 procedures. Experimental results demonstrate that structured external memory with Bayesian selection and contrastive refinement enables sample-efficient, interpretable, and continually improving agents without LLM parameter updates.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Michael S. Zhang, Rishi A. Ruia, Arnav Kewalram, Saathvik Dharmapuram, Utkarsh Sharma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and replay buffer strategies in large language models, revealing unexpected dynamics. While FP16 achieves superior initial task performance (74.44% on NLU), we observe a striking inversion on subsequent tasks: quantized models outperform FP16 by 8-15% on final task forward accuracy, with INT4 achieving nearly double FP16's performance on Code generation (40% vs 20%). Critically, even minimal replay buffers (0.1%) dramatically improve retention - increasing NLU retention after Math training from 45% to 65% across all precision levels - with INT8 consistently achieving the optimal balance between learning plasticity and knowledge retention. We hypothesize that quantization-induced noise acts as implicit regularization, preventing the overfitting to new task gradients that plagues high-precision models. These findings challenge the conventional wisdom that higher precision is always preferable, suggesting instead that INT8 quantization offers both computational efficiency and superior continual learning dynamics. Our results provide practical guidelines for deploying compressed models in continual learning scenarios: small replay buffers (1-2%) suffice for NLU tasks, while Math and Code benefit from moderate buffers (5-10%), with quantized models requiring less replay than FP16 to achieve comparable retention. Code is available at https://github.com/Festyve/LessIsMore.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sarwan Ali
Robotics
ๆฅ็ๆ่ฆ
Differential privacy (DP) has emerged as the gold standard for protecting user data in recommender systems, but existing privacy-preserving mechanisms face a fundamental challenge: the privacy-utility tradeoff inevitably degrades recommendation quality as privacy budgets tighten. We introduce DPSR (Differentially Private Sparse Reconstruction), a novel three-stage denoising framework that fundamentally addresses this limitation by exploiting the inherent structure of rating matrices -- sparsity, low-rank properties, and collaborative patterns.
DPSR consists of three synergistic stages: (1) \textit{information-theoretic noise calibration} that adaptively reduces noise for high-information ratings, (2) \textit{collaborative filtering-based denoising} that leverages item-item similarities to remove privacy noise, and (3) \textit{low-rank matrix completion} that exploits latent structure for signal recovery. Critically, all denoising operations occur \textit{after} noise injection, preserving differential privacy through the post-processing immunity theorem while removing both privacy-induced and inherent data noise.
Through extensive experiments on synthetic datasets with controlled ground truth, we demonstrate that DPSR achieves 5.57\% to 9.23\% RMSE improvement over state-of-the-art Laplace and Gaussian mechanisms across privacy budgets ranging from $\varepsilon=0.1$ to $\varepsilon=10.0$ (all improvements statistically significant with $p < 0.05$, most $p < 0.001$). Remarkably, at $\varepsilon=1.0$, DPSR achieves RMSE of 0.9823, \textit{outperforming even the non-private baseline} (1.0983), demonstrating that our denoising pipeline acts as an effective regularizer that removes data noise in addition to privacy noise.
๐
2025-12-22
๐ ArXiv Preprint (Machine Learning)
๐ฅ Feng Bao, Hui Sun
Machine Learning
ๆฅ็ๆ่ฆ
This work puts forward a novel nonlinear optimal filter namely the Ensemble Schr{รถ}dinger Bridge nonlinear filter. The proposed filter finds marriage of the standard prediction procedure and the diffusion generative modeling for the analysis procedure to realize one filtering step. The designed approach finds no structural model error, and it is derivative free, training free and highly parallizable. Experimental results show that the designed algorithm performs well given highly nonlinear dynamics in (mildly) high dimension up to 40 or above under a chaotic environment. It also shows better performance than classical methods such as the ensemble Kalman filter and the Particle filter in numerous tests given different level of nonlinearity. Future work will focus on extending the proposed approach to practical meteorological applications and establishing a rigorous convergence analysis.
๐
2025-12-22
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Markus Ekvall, Ludvig Bergenstrรฅhle, Patrick Truong, Ben Murrell, Joakim Lundeberg
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Retrieval-augmented generation improves large language models by grounding outputs in external knowledge sources, reducing hallucinations and addressing knowledge cutoffs. However, standard embedding-based retrieval fails to capture the complexity of multi-concept queries, particularly in domains like biomedicine, where biological data are inherently high-dimensional. For example,omics datasets, and clinical reports simultaneously exhibit numerous molecular, cellular, and physiological features. We present Stochastic Latent Matching (STHLM), a generative vector search method that samples query-conditioned embeddings from text or image inputs to enhance retrieval performance. Analogous to how Chain-of-Thought reasoning enables language models to "think longer" on complex problems, STHLM allows retrieval systems to "search wider" through iterative sampling. STHLM demonstrates critical improvements over classical vector retrieval across diverse benchmarks, including scientific literature, clinical notes, and tissue images, boosting retrieval performance by 10-30% through test-time compute (trading latency for accuracy), while enabling up to a 10-fold compression of embedding dimensions.
๐
2025-12-22
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng
Natural Language Processing
ๆฅ็ๆ่ฆ
Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.
๐
2025-12-22
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Chong Liu, Ming Zhang, Fei Li, Hao Zhou, Xiaoshuang Chen et al.
ๆฅ็ๆ่ฆ
Automated front-end engineering drastically reduces development cycles and minimizes manual coding overhead. While Generative AI has shown promise in translating designs to code, current solutions often produce monolithic scripts, failing to natively support modern ecosystems like React, Vue, or Angular. Furthermore, the generated code frequently suffers from poor modularity, making it difficult to maintain. To bridge this gap, we introduce Modular Layout Synthesis (MLS), a hierarchical framework that merges visual understanding with structural normalization. Initially, a visual-semantic encoder maps the screen capture into a serialized tree topology, capturing the essential layout hierarchy. Instead of simple parsing, we apply heuristic deduplication and pattern recognition to isolate reusable blocks, creating a framework-agnostic schema. Finally, a constraint-based generation protocol guides the LLM to synthesize production-ready code with strict typing and component props. Evaluations show that MLS significantly outperforms existing baselines, ensuring superior code reusability and structural integrity across multiple frameworks
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dixuan Lin, Tianyou Wang, Zhuoyang Pan, Yufu Wang, Lingjie Liu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions. Existing methods operate in hand centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand-object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Jiaqi Peng, Wenzhe Cai, Yuqiang Yang, Tai Wang, Yuan Shen et al.
Robotics
ๆฅ็ๆ่ฆ
Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error propagation.We evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3\% improvement over oracle-localization baselines and strong generalization across embodiments and environments. The code and models have been made publicly available on the \href{https://steinate.github.io/logoplanner.github.io/}{project page}.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Yinhuai Wang, Runyi Yu, Hok Wai Tsui, Xiaoyi Lin, Hui Zhang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We present a system for learning generalizable hand-object tracking controllers purely from synthetic data, without requiring any human demonstrations. Our approach makes two key contributions: (1) HOP, a Hand-Object Planner, which can synthesize diverse hand-object trajectories; and (2) HOT, a Hand-Object Tracker that bridges synthetic-to-physical transfer through reinforcement learning and interaction imitation learning, delivering a generalizable controller conditioned on target hand-object states. Our method extends to diverse object shapes and hand morphologies. Through extensive evaluations, we show that our approach enables dexterous hands to track challenging, long-horizon sequences including object re-arrangement and agile in-hand reorientation. These results represent a significant step toward scalable foundation controllers for manipulation that can learn entirely from synthetic data, breaking the data bottleneck that has long constrained progress in dexterous manipulation.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universitรคt Wรผrzburg in cooperation with the Technische Universitรคt Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Carlos Pรฉrez-Ruiz, Joan Solร
ๆฅ็ๆ่ฆ
This work introduces LIMOncello, a tightly coupled LiDAR-Inertial Odometry system that models 6-DoF motion on the $\mathrm{SGal}(3)$ manifold within an iterated error-state Kalman filter backend. Compared to state representations defined on $\mathrm{SO}(3)\times\mathbb{R}^6$, the use of $\mathrm{SGal}(3)$ provides a coherent and numerically stable discrete-time propagation model that helps limit drift in low-observability conditions.
LIMOncello also includes a lightweight incremental i-Octree mapping backend that enables faster updates and substantially lower memory usage than incremental kd-tree style map structures, without relying on locality-restricted search heuristics. Experiments on multiple real-world datasets show that LIMOncello achieves competitive accuracy while improving robustness in geometrically sparse environments. The system maintains real-time performance with stable memory growth and is released as an extensible open-source implementation at https://github.com/CPerezRuiz335/LIMOncello.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Yanliang Huang, Xia Yan, Peiran Yin, Zhenduo Zhang, Zeyan Shao et al.
Robotics
ๆฅ็ๆ่ฆ
Over the past decade, a wide range of motion planning approaches for autonomous vehicles has been developed to handle increasingly complex traffic scenarios. However, these approaches are rarely compared on standardized benchmarks, limiting the assessment of relative strengths and weaknesses. To address this gap, we present the setup and results of the 4th CommonRoad Motion Planning Competition held in 2024, conducted using the CommonRoad benchmark suite. This annual competition provides an open-source and reproducible framework for benchmarking motion planning algorithms. The benchmark scenarios span highway and urban environments with diverse traffic participants, including passenger cars, buses, and bicycles. Planner performance is evaluated along four dimensions: efficiency, safety, comfort, and compliance with selected traffic rules. This report introduces the competition format and provides a comparison of representative high-performing planners from the 2023 and 2024 editions.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Martin Sedlacek, Pavlo Yefanov, Georgy Ponimatkin, Jai Bardhan, Simon Pilc et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive to evaluate in the real-world. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world performance through high-fidelity visuals and aligned robot control. Our environment offers a suite of 15 perturbation factors, 7 manipulation skills, and more than 3,500 objects. Finally, we establish two task sets that form our benchmark and evaluate the ฯ_{0}, ฯ_{0}-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge. More broadly, we also show that simulation gives us a valuable proxy for the real-world and allows us to systematically probe for and quantify the weaknesses and failure modes of VLAs. Project page: https://martin-sedlacek.com/realm
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Zhenglong Guo, Yiming Zhao, Feng Jiang, Heng Jin, Zongbao Feng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Embodied robotic AI systems designed to manage complex daily tasks rely on a task planner to understand and decompose high-level tasks. While most research focuses on enhancing the task-understanding abilities of LLMs/VLMs through fine-tuning or chain-of-thought prompting, this paper argues that defining the planned skill set is equally crucial. To handle the complexity of daily environments, the skill set should possess a high degree of generalization ability. Empirically, more abstract expressions tend to be more generalizable. Therefore, we propose to abstract the planned result as a set of meta-actions. Each meta-action comprises three components: {move/rotate, end-effector status change, relationship with the environment}. This abstraction replaces human-centric concepts, such as grasping or pushing, with the robot's intrinsic functionalities. As a result, the planned outcomes align seamlessly with the complete range of actions that the robot is capable of performing. Furthermore, to ensure that the LLM/VLM accurately produces the desired meta-action format, we employ the Retrieval-Augmented Generation (RAG) technique, which leverages a database of human-annotated planning demonstrations to facilitate in-context learning. As the system successfully completes more tasks, the database will self-augment to continue supporting diversity. The meta-action set and its integration with RAG are two novel contributions of our planner, denoted as MaP-AVR, the meta-action planner for agents composed of VLM and RAG. To validate its efficacy, we design experiments using GPT-4o as the pre-trained LLM/VLM model and OmniGibson as our robotic platform. Our approach demonstrates promising performance compared to the current state-of-the-art method. Project page: https://map-avr.github.io/.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nitin Kumar Singh, Arie Rachmad Syulistyo, Yuichiro Tanaka, Hakaru Tamukoh
Machine Learning
ๆฅ็ๆ่ฆ
Sign language recognition (SLR) facilitates communication between deaf and hearing communities. Deep learning based SLR models are commonly used but require extensive computational resources, making them unsuitable for deployment on edge devices. To address these limitations, we propose a lightweight SLR system that combines parallel bidirectional reservoir computing (PBRC) with MediaPipe. MediaPipe enables real-time hand tracking and precise extraction of hand joint coordinates, which serve as input features for the PBRC architecture. The proposed PBRC architecture consists of two echo state network (ESN) based bidirectional reservoir computing (BRC) modules arranged in parallel to capture temporal dependencies, thereby creating a rich feature representation for classification. We trained our PBRC-based SLR system on the Word-Level American Sign Language (WLASL) video dataset, achieving top-1, top-5, and top-10 accuracies of 60.85%, 85.86%, and 91.74%, respectively. Training time was significantly reduced to 18.67 seconds due to the intrinsic properties of reservoir computing, compared to over 55 minutes for deep learning based methods such as Bi-GRU. This approach offers a lightweight, cost-effective solution for real-time SLR on edge devices.
๐
2025-12-22
๐ ArXiv Preprint (math.NA)
๐ฅ Philipp L. Kinon, Simon R. Eugster, Peter Betsch
ๆฅ็ๆ่ฆ
An energy-based modeling framework for the nonlinear dynamics of spatial Cosserat rods undergoing large displacements and rotations is proposed. The mixed formulation features independent displacement, velocity and stress variables and is further objective and locking-free. Finite rotations are represented using a director formulation that avoids singularities and yields a constant mass matrix. This results in an infinite-dimensional nonlinear port-Hamiltonian (PH) system governed by partial differential-algebraic equations with a quadratic energy functional. Using a time-differentiated compliance form of the stress-strain relations allows for the imposition of kinematic constraints, such as inextensibility or shear-rigidity. A structure-preserving finite element discretization leads to a finite-dimensional system with PH structure, thus facilitating the design of an energy-momentum consistent integration scheme. Dissipative material behavior (via the generalized-Maxwell model) and non-standard actuation approaches (via pneumatic chambers or tendons) integrate naturally into the framework. As illustrated by selected numerical examples, the present framework establishes a new approach to energy-momentum consistent formulations in computational mechanics involving finite rotations.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Yujie Zhao, Hongwei Fan, Di Chen, Shengcong Chen, Liliang Chen et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework's flexibility and extensibility, indicating its potential to serve as a unified data generation framework.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Hongwei Fan, Hang Dai, Jiyao Zhang, Jinzhou Li, Qiyang Yan et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
The robotics field is evolving towards data-driven, end-to-end learning, inspired by multimodal large models. However, reliance on expensive real-world data limits progress. Simulators offer cost-effective alternatives, but the gap between simulation and reality challenges effective policy transfer. This paper introduces TwinAligner, a novel Real2Sim2Real system that addresses both visual and dynamic gaps. The visual alignment module achieves pixel-level alignment through SDF reconstruction and editable 3DGS rendering, while the dynamic alignment module ensures dynamic consistency by identifying rigid physics from robot-object interaction. TwinAligner improves robot learning by providing scalable data collection and establishing a trustworthy iterative cycle, accelerating algorithm development. Quantitative evaluations highlight TwinAligner's strong capabilities in visual and dynamic real-to-sim alignment. This system enables policies trained in simulation to achieve strong zero-shot generalization to the real world. The high consistency between real-world and simulated policy performance underscores TwinAligner's potential to advance scalable robot learning. Code and data will be released on https://twin-aligner.github.io
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Han Fang, Yize Huang, Yuheng Zhao, Paul Weng, Xiao Li et al.
Robotics
ๆฅ็ๆ่ฆ
Robot manipulation, a key capability of embodied AI, has turned to data-driven generative policy frameworks, but mainstream approaches like Diffusion Models suffer from high inference latency and Flow-based Methods from increased architectural complexity. While simply applying meanFlow on robotic tasks achieves single-step inference and outperforms FlowPolicy, it lacks few-shot generalization due to fixed temperature hyperparameters in its Dispersive Loss and misaligned predicted-true mean velocities. To solve these issues, this study proposes an improved MeanFlow-based Policies: we introduce a lightweight Cosine Loss to align velocity directions and use the Differential Derivation Equation (DDE) to optimize the Jacobian-Vector Product (JVP) operator. Experiments on Adroit and Meta-World tasks show the proposed method outperforms MP1 and FlowPolicy in average success rate, especially in challenging Meta-World tasks, effectively enhancing few-shot generalization and trajectory accuracy of robot manipulation policies while maintaining real-time performance, offering a more robust solution for high-precision robotic manipulation.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Longxiang Shao, Ulrich Dahmen, Juergen Rossmann
Machine Learning
ๆฅ็ๆ่ฆ
Rigid body dynamics simulators are important tools for the design, analysis and optimization of mechanical systems in a variety of technical and scientific applications. This study examines four different simulation environments (Adams, Simscape, OpenModelica, and VEROSIM), focusing in particular on the comparison of the modeling methods, the numerical solvers, and the treatment of numerical problems that arise especially in closed-loop kinematics (esp. redundant boundary conditions and static equilibrium problem). A novel and complex crane boom of a real forestry machine serves as a practical benchmark application example. The direct comparison of the different solution approaches in the examined simulation tools supports the user in selecting the most suitable tool for his application.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Zhaoyang Liu, Weitao Zhou, Junze Wen, Cheng Jing, Qian Cheng et al.
Robotics
ๆฅ็ๆ่ฆ
Collecting large-scale naturalistic driving data is essential for training robust autonomous driving planners. However, real-world datasets often contain a substantial amount of repetitive and low-value samples, which lead to excessive storage costs and bring limited benefits to policy learning. To address this issue, we propose an information-theoretic data pruning method that effectively reduces the training data volume without compromising model performance. Our approach evaluates the trajectory distribution information entropy of driving data and iteratively selects high-value samples that preserve the statistical characteristics of the original dataset in a model-agnostic manner. From a theoretical perspective, we show that maximizing trajectory entropy effectively constrains the Kullback-Leibler divergence between the pruned subset and the original data distribution, thereby maintaining generalization ability. Comprehensive experiments on the NuPlan benchmark with a large-scale imitation learning framework demonstrate that the proposed method can reduce the dataset size by up to 40% while maintaining closed-loop performance. This work provides a lightweight and theoretically grounded approach for scalable data management and efficient policy learning in autonomous driving systems.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Yitian Zheng, Zhangchen Ye, Weijun Dong, Shengjie Wang, Yuyang Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Recent advances in hierarchical robot systems leverage a high-level planner to propose task plans and a low-level policy to generate robot actions. This design allows training the planner on action-free or even non-robot data sources (e.g., videos), providing transferable high-level guidance. Nevertheless, grounding these high-level plans into executable actions remains challenging, especially with the limited availability of high-quality robot data. To this end, we propose to improve the low-level policy through online interactions. Specifically, our approach collects online rollouts, retrospectively annotates the corresponding high-level goals from achieved outcomes, and aggregates these hindsight-relabeled experiences to update a goal-conditioned imitation policy. Our method, Hindsight Flow-conditioned Online Imitation (HinFlow), instantiates this idea with 2D point flows as the high-level planner. Across diverse manipulation tasks in both simulation and physical world, our method achieves more than $2\times$ performance improvement over the base policy, significantly outperforming the existing methods. Moreover, our framework enables policy acquisition from planners trained on cross-embodiment video data, demonstrating its potential for scalable and transferable robot learning.
๐
2025-12-22
๐ ArXiv Preprint (eess.SY)
๐ฅ Tarek Bouazza, Alessandro Melis, Soulaimane Berkane, Robert Mahony, Tarek Hamel
ๆฅ็ๆ่ฆ
This paper tackles the problem of estimating the relative position, orientation, and velocity between a UAV and a planar platform undergoing arbitrary 3D motion during approach and landing. The estimation relies on measurements from Inertial Measurement Units (IMUs) mounted on both systems, assuming there is a suitable communication channel to exchange data, together with visual information provided by an onboard monocular camera, from which the bearing (line-of-sight direction) to the platform's center and the normal vector of its planar surface are extracted. We propose a cascade observer with a complementary filter on SO(3) to reconstruct the relative attitude, followed by a linear Riccati observer for relative position and velocity estimation. Convergence of both observers is established under persistently exciting conditions, and the cascade is shown to be almost globally asymptotically and locally exponentially stable. We further extend the design to the case where the platform's rotation is restricted to its normal axis and show that its measured linear acceleration can be exploited to recover the remaining unobservable rotation angle. A sufficient condition to ensure local exponential convergence in this setting is provided. The performance of the proposed observers is validated through extensive simulations.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Jin Wang, Kim Tien Ly, Jacques Cloete, Nikos Tsagarakis, Ioannis Havoutis
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Bridging the gap between natural language commands and autonomous execution in unstructured environments remains an open challenge for robotics. This requires robots to perceive and reason over the current task scene through multiple modalities, and to plan their behaviors to achieve their intended goals. Traditional robotic task-planning approaches often struggle to bridge low-level execution with high-level task reasoning, and cannot dynamically update task strategies when instructions change during execution, which ultimately limits their versatility and adaptability to new tasks. In this work, we propose a novel language model-based framework for dynamic robot task planning. Our Vision-Language-Policy (VLP) model, based on a vision-language model fine-tuned on real-world data, can interpret semantic instructions and integrate reasoning over the current task scene to generate behavior policies that control the robot to accomplish the task. Moreover, it can dynamically adjust the task strategy in response to changes in the task, enabling flexible adaptation to evolving task requirements. Experiments conducted with different robots and a variety of real-world tasks show that the trained model can efficiently adapt to novel scenarios and dynamically update its policy, demonstrating strong planning autonomy and cross-embodiment generalization. Videos: https://robovlp.github.io/
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Jose Gustavo Buenaventura Carreon, Floris Erich, Roman Mykhailyshyn, Tomohiro Motoda, Ryo Hanai et al.
Robotics
ๆฅ็ๆ่ฆ
We present a cross robot visuomotor learning framework that integrates diffusion policy based control with 3D semantic scene representations from D3Fields to enable category level generalization in manipulation. Its modular design supports diverse robot camera configurations including UR5 arms with Microsoft Azure Kinect arrays and bimanual manipulators with Intel RealSense sensors through a low latency control stack and intuitive teleoperation. A unified configuration layer enables seamless switching between setups for flexible data collection training and evaluation. In a grasp and lift block task the framework achieved an 80 percent success rate after only 100 demonstration episodes demonstrating robust skill transfer between platforms and sensing modalities. This design paves the way for scalable real world studies in cross robotic generalization.
๐
2025-12-22
๐ AAAI 2026
๐ฅ Pengxuan Yang, Ben Lu, Zhongpu Xia, Chao Han, Yinfeng Gao et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Latent World Models enhance scene representation through temporal self-supervised learning, presenting a perception annotation-free paradigm for end-to-end autonomous driving. However, the reconstruction-oriented representation learning tangles perception with planning tasks, leading to suboptimal optimization for planning. To address this challenge, we propose WorldRFT, a planning-oriented latent world model framework that aligns scene representation learning with planning via a hierarchical planning decomposition and local-aware interactive refinement mechanism, augmented by reinforcement learning fine-tuning (RFT) to enhance safety-critical policy performance. Specifically, WorldRFT integrates a vision-geometry foundation model to improve 3D spatial awareness, employs hierarchical planning task decomposition to guide representation optimization, and utilizes local-aware iterative refinement to derive a planning-oriented driving policy. Furthermore, we introduce Group Relative Policy Optimization (GRPO), which applies trajectory Gaussianization and collision-aware rewards to fine-tune the driving policy, yielding systematic improvements in safety. WorldRFT achieves state-of-the-art (SOTA) performance on both open-loop nuScenes and closed-loop NavSim benchmarks. On nuScenes, it reduces collision rates by 83% (0.30% -> 0.05%). On NavSim, using camera-only sensors input, it attains competitive performance with the LiDAR-based SOTA method DiffusionDrive (87.8 vs. 88.1 PDMS).
๐
2025-12-22
๐ This paper is accepted by the IEEE Internet of Thi
๐ฅ Pengyu Chen, Tao Ouyang, Ke Luo, Weijie Hong, Xu Chen
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous navigation for Unmanned Aerial Vehicles faces key challenges from limited onboard computational resources, which restrict deployed deep neural networks to shallow architectures incapable of handling complex environments. Offloading tasks to remote edge servers introduces high latency, creating an inherent trade-off in system design. To address these limitations, we propose CoDrone - the first cloud-edge-end collaborative computing framework integrating foundation models into autonomous UAV cruising scenarios - effectively leveraging foundation models to enhance performance of resource-constrained unmanned aerial vehicle platforms. To reduce onboard computation and data transmission overhead, CoDrone employs grayscale imagery for the navigation model. When enhanced environmental perception is required, CoDrone leverages the edge-assisted foundation model Depth Anything V2 for depth estimation and introduces a novel one-dimensional occupancy grid-based navigation method - enabling fine-grained scene understanding while advancing efficiency and representational simplicity of autonomous navigation. A key component of CoDrone is a Deep Reinforcement Learning-based neural scheduler that seamlessly integrates depth estimation with autonomous navigation decisions, enabling real-time adaptation to dynamic environments. Furthermore, the framework introduces a UAV-specific vision language interaction module incorporating domain-tailored low-level flight primitives to enable effective interaction between the cloud foundation model and the UAV. The introduction of VLM enhances open-set reasoning capabilities in complex unseen scenarios. Experimental results show CoDrone outperforms baseline methods under varying flight speeds and network conditions, achieving a 40% increase in average flight distance and a 5% improvement in average Quality of Navigation.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Chao Yang, Yingkai Sun, Peng Ye, Xin Chen, Chong Yu et al.
Robotics
ๆฅ็ๆ่ฆ
Learning a general motion tracking policy from human motions shows great potential for versatile humanoid whole-body control. Conventional approaches are not only inefficient in data utilization and training processes but also exhibit limited performance when tracking highly dynamic motions. To address these challenges, we propose EGM, a framework that enables efficient learning of a general motion tracking policy. EGM integrates four core designs. Firstly, we introduce a Bin-based Cross-motion Curriculum Adaptive Sampling strategy to dynamically orchestrate the sampling probabilities based on tracking error of each motion bin, eficiently balancing the training process across motions with varying dificulty and durations. The sampled data is then processed by our proposed Composite Decoupled Mixture-of-Experts (CDMoE) architecture, which efficiently enhances the ability to track motions from different distributions by grouping experts separately for upper and lower body and decoupling orthogonal experts from shared experts to separately handle dedicated features and general features. Central to our approach is a key insight we identified: for training a general motion tracking policy, data quality and diversity are paramount. Building on these designs, we develop a three-stage curriculum training flow to progressively enhance the policy's robustness against disturbances. Despite training on only 4.08 hours of data, EGM generalized robustly across 49.25 hours of test motions, outperforming baselines on both routine and highly dynamic tasks.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Xu Liu, Yu Liu, Hanshuo Qiu, Yang Qirong, Zhouhui Lian
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Navigation (VLN) enables agents to navigate in complex environments by following natural language instructions grounded in visual observations. Although most existing work has focused on ground-based robots or outdoor Unmanned Aerial Vehicles (UAVs), indoor UAV-based VLN remains underexplored, despite its relevance to real-world applications such as inspection, delivery, and search-and-rescue in confined spaces. To bridge this gap, we introduce \textbf{IndoorUAV}, a novel benchmark and method specifically tailored for VLN with indoor UAVs. We begin by curating over 1,000 diverse and structurally rich 3D indoor scenes from the Habitat simulator. Within these environments, we simulate realistic UAV flight dynamics to collect diverse 3D navigation trajectories manually, further enriched through data augmentation techniques. Furthermore, we design an automated annotation pipeline to generate natural language instructions of varying granularity for each trajectory. This process yields over 16,000 high-quality trajectories, comprising the \textbf{IndoorUAV-VLN} subset, which focuses on long-horizon VLN. To support short-horizon planning, we segment long trajectories into sub-trajectories by selecting semantically salient keyframes and regenerating concise instructions, forming the \textbf{IndoorUAV-VLA} subset. Finally, we introduce \textbf{IndoorUAV-Agent}, a novel navigation model designed for our benchmark, leveraging task decomposition and multimodal reasoning. We hope IndoorUAV serves as a valuable resource to advance research on vision-language embodied AI in the indoor aerial navigation domain.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.
๐
2025-12-22
๐ ArXiv Preprint (eess.SP)
๐ฅ Devi Yuliarti, Ravi Prakash, Hiu Ching Cheung, Amy Strong, Patrick J. Codd et al.
Multimodal
ๆฅ็ๆ่ฆ
The tactile properties of tissue, such as elasticity and stiffness, often play an important role in surgical oncology when identifying tumors and pathological tissue boundaries. Though extremely valuable, robot-assisted surgery comes at the cost of reduced sensory information to the surgeon; typically, only vision is available. Sensors proposed to overcome this sensory desert are often bulky, complex, and incompatible with the surgical workflow. We present PalpAid, a multimodal pneumatic tactile sensor equipped with a microphone and pressure sensor, converting contact force into an internal pressure differential. The pressure sensor acts as an event detector, while the auditory signature captured by the microphone assists in tissue delineation. We show the design, fabrication, and assembly of sensory units with characterization tests to show robustness to use, inflation-deflation cycles, and integration with a robotic system. Finally, we show the sensor's ability to classify 3D-printed hard objects with varying infills and soft ex vivo tissues. Overall, PalpAid aims to fill the sensory gap intelligently and allow improved clinical decision-making.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Yanding Yang, Weitao Zhou, Jinhai Wang, Xiaomin Guo, Junze Wen et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous buses run on fixed routes but must operate in open, dynamic urban environments. Disengagement events on these routes are often geographically concentrated and typically arise from planner failures in highly interactive regions. Such policy-level failures are difficult to correct using conventional imitation learning, which easily overfits to sparse disengagement data. To address this issue, this paper presents a Disengagement-Triggered Contrastive Continual Learning (DTCCL) framework that enables autonomous buses to improve planning policies through real-world operation. Each disengagement triggers cloud-based data augmentation that generates positive and negative samples by perturbing surrounding agents while preserving route context. Contrastive learning refines policy representations to better distinguish safe and unsafe behaviors, and continual updates are applied in a cloud-edge loop without human supervision. Experiments on urban bus routes demonstrate that DTCCL improves overall planning performance by 48.6 percent compared with direct retraining, validating its effectiveness for scalable, closed-loop policy improvement in autonomous public transport.
๐
2025-12-22
๐ ICRA 2026
๐ฅ Ryosuke Korekata, Quanting Xie, Yonatan Bisk, Komei Sugiura
RoboticsMultimodal
ๆฅ็ๆ่ฆ
In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.
๐
2025-12-22
๐ ArXiv Preprint (Robotics)
๐ฅ Yadong Liu, Jianwei Liu, He Liang, Dimitrios Kanoulas
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Quadruped mobile manipulators offer strong potential for agile loco-manipulation but remain difficult to control and transfer reliably from simulation to reality. Reinforcement learning (RL) shows promise for whole-body control, yet most frameworks are proprietary and hard to reproduce on real hardware. We present an open pipeline for training, benchmarking, and deploying RL-based controllers on the Unitree B1 quadruped with a Z1 arm. The framework unifies sim-to-sim and sim-to-real transfer through ROS, re-implementing a policy trained in Isaac Gym, extending it to MuJoCo via a hardware abstraction layer, and deploying the same controller on physical hardware. Sim-to-sim experiments expose discrepancies between Isaac Gym and MuJoCo contact models that influence policy behavior, while real-world teleoperated object-picking trials show that coordinated whole-body control extends reach and improves manipulation over floating-base baselines. The pipeline provides a transparent, reproducible foundation for developing and analyzing RL-based loco-manipulation controllers and will be released open source to support future research.
๐
2025-12-22
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hang Yu, Juntu Zhao, Yufeng Liu, Kaiyu Li, Cheng Ma et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.
๐
2025-12-21
๐ ArXiv Preprint (cs.SE)
๐ฅ Shaokang Jiang, Daye Nam
Natural Language Processing
ๆฅ็ๆ่ฆ
While Large Language Models (LLMs) have demonstrated remarkable capabilities, research shows that their effectiveness depends not only on explicit prompts but also on the broader context provided. This requirement is especially pronounced in software engineering, where the goals, architecture, and collaborative conventions of an existing project play critical roles in response quality. To support this, many AI coding assistants have introduced ways for developers to author persistent, machine-readable directives that encode a project's unique constraints. Although this practice is growing, the content of these directives remains unstudied.
This paper presents a large-scale empirical study to characterize this emerging form of developer-provided context. Through a qualitative analysis of 401 open-source repositories containing cursor rules, we developed a comprehensive taxonomy of project context that developers consider essential, organized into five high-level themes: Conventions, Guidelines, Project Information, LLM Directives, and Examples. Our study also explores how this context varies across different project types and programming languages, offering implications for the next generation of context-aware AI developer tools.
๐
2025-12-21
๐ NeurIPS 2025
๐ฅ Szymon Rusiecki, Cecilia G. Morales, Kimberly Elenberg, Leonard Weiss, Artur Dubrawski
RoboticsMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Mass Casualty Incidents can overwhelm emergency medical systems and resulting delays or errors in the assessment of casualties can lead to preventable deaths. We present a decision support framework that fuses outputs from multiple computer vision models, estimating signs of severe hemorrhage, respiratory distress, physical alertness, or visible trauma, into a Bayesian network constructed entirely from expert-defined rules. Unlike traditional data-driven models, our approach does not require training data, supports inference with incomplete information, and is robust to noisy or uncertain observations. We report performance for two missions involving 11 and 9 casualties, respectively, where our Bayesian network model substantially outperformed vision-only baselines during evaluation of our system in the DARPA Triage Challenge (DTC) field scenarios. The accuracy of physiological assessment improved from 15% to 42% in the first scenario and from 19% to 46% in the second, representing nearly threefold increase in performance. More importantly, overall triage accuracy increased from 14% to 53% in all patients, while the diagnostic coverage of the system expanded from 31% to 95% of the cases requiring assessment. These results demonstrate that expert-knowledge-guided probabilistic reasoning can significantly enhance automated triage systems, offering a promising approach to supporting emergency responders in MCIs. This approach enabled Team Chiron to achieve 4th place out of 11 teams during the 1st physical round of the DTC.
๐
2025-12-21
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Gรถkdeniz Gรผlmez
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We present Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods by implementing adaptive multi-directional projections with regularized layer selection. Our approach addresses the fundamental limitation of existing methods that compromise model quality while attempting to modify specific behavioral patterns. Through dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms, we achieve theoretically superior weight modification while minimizing quality degradation in unrelated domains. We validate our method through the gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.
๐
2025-12-21
๐ ArXiv Preprint (econ.TH)
๐ฅ Yucheng Yang, Chiyuan Wang, Andreas Schaab, Benjamin Moll
Machine Learning
ๆฅ็ๆ่ฆ
We present a new approach to formulating and solving heterogeneous agent models with aggregate risk. We replace the cross-sectional distribution with low-dimensional prices as state variables and let agents learn equilibrium price dynamics directly from simulated paths. To do so, we introduce a structural reinforcement learning (SRL) method which treats prices via simulation while exploiting agents' structural knowledge of their own individual dynamics. Our SRL method yields a general and highly efficient global solution method for heterogeneous agent models that sidesteps the Master equation and handles problems traditional methods struggle with, in particular nontrivial market-clearing conditions. We illustrate the approach in the Krusell-Smith model, the Huggett model with aggregate shocks, and a HANK model with a forward-looking Phillips curve, all of which we solve globally within minutes.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaidi Liang, Ke Li, Xianbiao Hu, Ruwen Qin
RoboticsNatural Language ProcessingMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical requirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing models, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning within and across task groups while mitigating negative transfer. Numerical experiments on consolidated public datasets demonstrate that CrashChat consistently outperforms existing MLLMs across model scales and traditional vision-based methods, achieving state-of-the-art performance. It reaches near-perfect accuracy in crash recognition, a 176\% improvement in crash localization, and a 40\% improvement in the more challenging pre-crash localization. Compared to general MLLMs, it substantially enhances textual accuracy and content coverage in crash description and reasoning tasks, with 0.18-0.41 increases in BLEU scores and 0.18-0.42 increases in ROUGE scores. Beyond its strong performance, CrashChat is a convenient, end-to-end analytical tool ready for practical implementation. The dataset and implementation code for CrashChat are available at https://github.com/Liangkd/CrashChat.
๐
2025-12-21
๐ ArXiv Preprint (cs.CY)
๐ฅ Bruno Campello de Souza
ๆฅ็ๆ่ฆ
The rapid diffusion of generative artificial intelligence (GenAI) systems has introduced new forms of human-technology interaction, raising the question of whether sustained engagement gives rise to stable, internalized modes of cognition rather than merely transient efficiency gains. Grounded in the Cognitive Mediation Networks Theory, this study investigates Sophotechnic Mediation, a mode of thinking and acting associated with prolonged interaction with GenAI, and presents a comprehensive psychometric validation of the Sophotechnic Mediation Scale. Data were collected between 2023 and 2025 from independent cross-sectional samples totaling 3,932 adult workers from public and private organizations in the Metropolitan Region of Pernambuco, Brazil. Results indicate excellent internal consistency, a robust unidimensional structure, and measurement invariance across cohorts. Ordinal-robust confirmatory factor analyses and residual diagnostics show that elevated absolute fit indices reflect minor local dependencies rather than incorrect dimensionality. Distributional analyses reveal a time-evolving pattern characterized by a declining mass of non-adopters and convergence toward approximate Gaussianity among adopters, with model comparisons favoring a two-process hurdle model over a censored Gaussian specification. Sophotechnic Mediation is empirically distinct from Hypercultural mediation and is primarily driven by cumulative GenAI experience, with age moderating the rate of initial acquisition and the depth of later integration. Together, the findings support Sophotechnia as a coherent, measurable, and emergent mode of cognitive mediation associated with the ongoing GenAI revolution.
๐
2025-12-21
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zijun Gao, Zhikun Xu, Xiao Ye, Ben Zhou
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.
๐
2025-12-21
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Aditya Siddhant
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Behavioral healthcare risk assessment remains a challenging problem due to the highly multimodal nature of patient data and the temporal dynamics of mood and affective disorders. While large language models (LLMs) have demonstrated strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear. In this work, we introduce HARBOR, a behavioral health aware language model designed to predict a discrete mood and risk score, termed the Harbor Risk Score (HRS), on an integer scale from -3 (severe depression) to +3 (mania). We also release PEARL, a longitudinal behavioral healthcare dataset spanning four years of monthly observations from three patients, containing physiological, behavioral, and self reported mental health signals. We benchmark traditional machine learning models, proprietary LLMs, and HARBOR across multiple evaluation settings and ablations. Our results show that HARBOR outperforms classical baselines and off the shelf LLMs, achieving 69 percent accuracy compared to 54 percent for logistic regression and 29 percent for the strongest proprietary LLM baseline.
๐
2025-12-21
๐ ArXiv Preprint (Machine Learning)
๐ฅ Souhail Abdelmouaiz Sadat, Mohamed Yacine Touahria Miliani, Khadidja Hab El Hames, Hamida Seba, Mohammed Haddad
ๆฅ็ๆ่ฆ
This survey reviews hyperbolic graph embedding models, and evaluate them on anomaly detection, highlighting their advantages over Euclidean methods in capturing complex structures. Evaluating models like \textit{HGCAE}, \textit{\(\mathcal{P}\)-VAE}, and \textit{HGCN} demonstrates high performance, with \textit{\(\mathcal{P}\)-VAE} achieving an F1-score of 94\% on the \textit{Elliptic} dataset and \textit{HGCAE} scoring 80\% on \textit{Cora}. In contrast, Euclidean methods like \textit{DOMINANT} and \textit{GraphSage} struggle with complex data. The study emphasizes the potential of hyperbolic spaces for improving anomaly detection, and provides an open-source library to foster further research in this field.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamad Zamini, Diksha Shukla
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost of processing dense visual tokens remains a major bottleneck. A critical component in this pipeline is the visual projector, which bridges the vision encoder and the language model. Standard designs often employ a simple multi-layer perceptron for direct token mapping, but this approach scales poorly with high-resolution inputs, introducing significant redundancy. We present Delta-LLaVA, a token-efficient projector that employs a low-rank DeltaProjection to align multi-level vision features into a compact subspace before further interaction. On top of this base alignment, lightweight Transformer blocks act as specialization layers, capturing both global and local structure under constrained token budgets. Extensive experiments and ablations demonstrate that this base-then-specialize design yields consistent gains across multiple benchmarks with only 144 tokens, highlighting the importance of token formation prior to scaling interaction capacity. With Delta-LLaVA, inference throughput improves by up to 55%, while end-to-end training accelerates by nearly 4-5x in pretraining and over 1.5x in finetuning, highlighting the dual benefits of our design in both efficiency and scalability.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dmitry Demidov, Zaigham Zaheer, Zongyan Han, Omkar Thawakar, Rao Anwer
Natural Language ProcessingMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on github.com/demidovd98/FiNDR.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Akshit Achara, Peter Triantafillou, Esther Puyol-Antรณn, Alexander Hammers, Andrew P. King
Machine Learning
ๆฅ็ๆ่ฆ
Deep neural networks often exploit shortcuts. These are spurious cues which are associated with output labels in the training data but are unrelated to task semantics. When the shortcut features are associated with sensitive attributes, shortcut learning can lead to biased model performance. Existing methods for localising and understanding shortcut learning are mostly based upon qualitative, image-level inspection and assume cues are human-visible, limiting their use in domains such as medical imaging. We introduce OSCAR (Ordinal Scoring Correlations for Attribution Representations), a model-agnostic framework for quantifying shortcut learning and localising shortcut features. OSCAR converts image-level task attribution maps into dataset-level rank profiles of image regions and compares them across three models: a balanced baseline model (BA), a test model (TS), and a sensitive attribute predictor (SA). By computing pairwise, partial, and deviation-based correlations on these rank profiles, we produce a set of quantitative metrics that characterise the degree of shortcut reliance for TS, together with a ranking of image-level regions that contribute most to it. Experiments on CelebA, CheXpert, and ADNI show that our correlations are (i) stable across seeds and partitions, (ii) sensitive to the level of association between shortcut features and output labels in the training data, and (iii) able to distinguish localised from diffuse shortcut features. As an illustration of the utility of our method, we show how worst-group performance disparities can be reduced using a simple test-time attenuation approach based on the identified shortcut regions. OSCAR provides a lightweight, pixel-space audit that yields statistical decision rules and spatial maps, enabling users to test, localise, and mitigate shortcut reliance. The code is available at https://github.com/acharaakshit/oscar
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaidi Liang, Ke Li, Xianbiao Hu, Ruwen Qin
RoboticsNatural Language ProcessingMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical requirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing models, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning within and across task groups while mitigating negative transfer. Numerical experiments on consolidated public datasets demonstrate that CrashChat consistently outperforms existing MLLMs across model scales and traditional vision-based methods, achieving state-of-the-art performance. It reaches near-perfect accuracy in crash recognition, a 176\% improvement in crash localization, and a 40\% improvement in the more challenging pre-crash localization. Compared to general MLLMs, it substantially enhances textual accuracy and content coverage in crash description and reasoning tasks, with 0.18-0.41 increases in BLEU scores and 0.18-0.42 increases in ROUGE scores. Beyond its strong performance, CrashChat is a convenient, end-to-end analytical tool ready for practical implementation. The dataset and implementation code for CrashChat are available at https://github.com/Liangkd/CrashChat.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maksym Voloshchuk, Bohdana Zarembovska, Mykola Kozlenko
Machine Learning
ๆฅ็ๆ่ฆ
Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but efficiency drops with Latin documents of medieval times. This paper presents a deep learning method to extract text information from handwritten Latin-language documents of the 9th to 11th centuries. The approach takes into account the properties inherent in medieval documents. The paper provides a brief introduction to the field of historical document transcription, a first-sight analysis of the raw data, and the related works and studies. The paper presents the steps of dataset development for further training of the models. The explanatory data analysis of the processed data is provided as well. The paper explains the pipeline of deep learning models to extract text information from the document images, from detecting objects to word recognition using classification models and embedding word images. The paper reports the following results: recall, precision, F1 score, intersection over union, confusion matrix, and mean string distance. The plots of the metrics are also included. The implementation is published on the GitHub repository.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alina Elena Baia, Andrea Cavallaro
ๆฅ็ๆ่ฆ
Concept-driven counterfactuals explain decisions of classifiers by altering the model predictions through semantic changes. In this paper, we present a novel approach that leverages cross-modal decompositionality and image-specific concepts to create counterfactual scenarios expressed in natural language. We apply the proposed interpretability framework, termed Decompose and Explain (DeX), to the challenging domain of image privacy decisions, which are contextual and subjective. This application enables the quantification of the differential contributions of key scene elements to the model prediction. We identify relevant decision factors via a multi-criterion selection mechanism that considers both image similarity for minimal perturbations and decision confidence to prioritize impactful changes. This approach evaluates and compares diverse explanations, and assesses the interdependency and mutual influence among explanatory properties. By leveraging image-specific concepts, DeX generates image-grounded, sparse explanations, yielding significant improvements over the state of the art. Importantly, DeX operates as a training-free framework, offering high flexibility. Results show that DeX not only uncovers the principal contributing factors influencing subjective decisions, but also identifies underlying dataset biases allowing for targeted mitigation strategies to improve fairness.
๐
2025-12-21
๐ IEEE Transactions on Visualization and Computer Gr
๐ฅ Sicheng Song, Yanjie Zhang, Zixin Chen, Huamin Qu, Changbo Wang et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The integrity of data visualizations is increasingly threatened by image editing techniques that enable subtle yet deceptive tampering. Through a formative study, we define this challenge and categorize tampering techniques into two primary types: data manipulation and visual encoding manipulation. To address this, we present VizDefender, a framework for tampering detection and analysis. The framework integrates two core components: 1) a semi-fragile watermark module that protects the visualization by embedding a location map to images, which allows for the precise localization of tampered regions while preserving visual quality, and 2) an intent analysis module that leverages Multimodal Large Language Models (MLLMs) to interpret manipulation, inferring the attacker's intent and misleading effects. Extensive evaluations and user studies demonstrate the effectiveness of our methods.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hasib Aslam, Muhammad Talal Faiz, Muhammad Imran Malik
Natural Language Processing
ๆฅ็ๆ่ฆ
Advances in neuroscience and artificial intelligence have enabled preliminary decoding of brain activity. However, despite the progress, the interpretability of neural representations remains limited. A significant challenge arises from the intrinsic properties of electroencephalography (EEG) signals, including high noise levels, spatial diffusion, and pronounced temporal variability. To interpret the neural mechanism underlying thoughts, we propose a transformers-based framework to extract spatial-temporal representations associated with observed visual stimuli from EEG recordings. These features are subsequently incorporated into the attention mechanisms of Latent Diffusion Models (LDMs) to facilitate the reconstruction of visual stimuli from brain activity. The quantitative evaluations on publicly available benchmark datasets demonstrate that the proposed method excels at modeling the semantic structures from EEG signals; achieving up to 6.5% increase in latent space clustering accuracy and 11.8% increase in zero shot generalization across unseen classes while having comparable Inception Score and Frรฉchet Inception Distance with existing baselines. Our work marks a significant step towards generalizable semantic interpretation of the EEG signals.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxiao Yang, Hualian Sheng, Sijia Cai, Jing Lin, Jiahao Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shaomu Tan, Ryosuke Mitani, Ritvik Choudhary, Qiyu Wu, Toshiyuki Sekiya et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maksym Voloshchuk, Bohdana Zarembovska, Mykola Kozlenko
Machine Learning
ๆฅ็ๆ่ฆ
Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but efficiency drops with Latin documents of medieval times. This paper presents a deep learning method to extract text information from handwritten Latin-language documents of the 9th to 11th centuries. The approach takes into account the properties inherent in medieval documents. The paper provides a brief introduction to the field of historical document transcription, a first-sight analysis of the raw data, and the related works and studies. The paper presents the steps of dataset development for further training of the models. The explanatory data analysis of the processed data is provided as well. The paper explains the pipeline of deep learning models to extract text information from the document images, from detecting objects to word recognition using classification models and embedding word images. The paper reports the following results: recall, precision, F1 score, intersection over union, confusion matrix, and mean string distance. The plots of the metrics are also included. The implementation is published on the GitHub repository.
๐
2025-12-21
๐ ArXiv Preprint (math.DS)
๐ฅ Matilde Marcolli, David Skigin
Machine Learning
ๆฅ็ๆ่ฆ
We study the dynamical properties of a Hopf algebra Markov chain with state space the binary rooted forests with labelled leaves. This Markovian dynamical system describes the core computational process of structure formation and transformation in syntax via the Merge operation, according to Chomsky's Minimalism model of generative linguistics. The dynamics decomposes into an ergodic dynamical system with uniform stationary distribution, given by the action of Internal Merge, while the contributions of External Merge and (a minimal form of) Sideward Merge reduce to a simpler Markov chain with state space the set of partitions and with combinatorial weights. The Sideward Merge part of the dynamics prevents convergence to fully formed connected structures (trees), unless the different forms of Merge are weighted by a cost function, as predicted by linguistic theory. Results on the asymptotic behavior of the Perron-Frobenius eigenvalue and eigenvector in this weighted case, obtained in terms of an associated Perron-Frobenius problem in the tropical semiring, show that the usual cost functions (Minimal Search and Resource Restrictions) proposed in the linguistic literature do not suffice to obtain convergence to the tree structures, while an additional optimization property based on the Shannon entropy achieves the expected result for the dynamics. We also comment on the introduction of continuous parameters related to semantic embedding and other computational models, and also on some filtering of the dynamics by coloring rules that model the linguistic filtering by theta roles and phase structure, and on parametric variation and the process of parameter setting in Externalization.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Antonio San Martin
Robotics
ๆฅ็ๆ่ฆ
The rapid diffusion of generative artificial intelligence is transforming terminology work. While this technology promises gains in efficiency, its unstructured adoption risks weakening professional autonomy, amplifying bias, and eroding linguistic and conceptual diversity. This paper argues that a human-centered approach to artificial intelligence has become a necessity for terminology work. Building on research in artificial intelligence and translation studies, it proposes a human-centered framework that conceptualizes artificial intelligence as a means of amplifying the terminologist's capabilities, rather than replacing them. The framework is organized around three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. Together, these dimensions emphasize the compatibility of high automation with strong human control, the central role of terminologists in bias mitigation, and the importance of designing AI tools and workflows around the needs, values, and well-being of the terminologist. The paper concludes by stressing that current choices in AI adoption will shape not only terminological practice, but also the preservation of accuracy, adequacy, and diversity in terminology and specialized knowledge.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tung Duong Ta, Tim Oates
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite advances in mathematical reasoning capabilities, Large Language Models (LLMs) still struggle with calculation verification when using established prompting techniques. We present MDToC (Metacognitive Dynamic Tree of Concepts), a three-phase approach that constructs a concept tree, develops accuracy-verified calculations for each concept, and employs majority voting to evaluate competing solutions. Evaluations across CHAMP, MATH, and Game-of-24 benchmarks demonstrate our MDToC's effectiveness, with GPT-4-Turbo achieving 58.1\% on CHAMP, 86.6\% on MATH, and 85\% on Game-of-24 - outperforming GoT by 5\%, 5.4\%, and 4\% on all these tasks, respectively, without hand-engineered hints. MDToC consistently surpasses existing prompting methods across all backbone models, yielding improvements of up to 7.6\% over ToT and 6.2\% over GoT, establishing metacognitive calculation verification as a promising direction for enhanced mathematical reasoning.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sultan Alrashed, Francesco Orabona
ๆฅ็ๆ่ฆ
We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang et al.
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Thorsten Hellert, Nikolay Agladze, Alex Giovannone, Jan Jug, Frank Mayet et al.
Robotics
ๆฅ็ๆ่ฆ
Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language-model-driven interfaces. We formalize semantic channel finding-mapping natural-language intent to concrete control-system signals-as a general problem in complex experimental infrastructure, and introduce a four-paradigm framework to guide architecture selection across facility-specific data regimes. The paradigms span (i) direct in-context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool-based database queries, and (iv) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions. We demonstrate each paradigm through proof-of-concept implementations at four operational facilities spanning two orders of magnitude in scale-from compact free-electron lasers to large synchrotron light sources-and diverse control-system architectures, from clean hierarchies to legacy environments. These implementations achieve 90-97% accuracy on expert-curated operational queries.
๐
2025-12-21
๐ ArXiv Preprint (cs.SE)
๐ฅ Recep Kaan Karaman, Meftun Akarsu
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision. However, most existing code documentation datasets are constructed through large scale scraping of public repositories with limited quality control. As a result, they often contain noisy documentation, extensive duplication, and increasing contamination from AI generated content. These issues weaken the supervision signal available to learning-based models and complicate evaluation.
We introduce \textbf{Code2Doc}, a quality-first curated dataset for function-level code documentation generation. Code2Doc consists of 13,358 high-quality function-documentation pairs extracted from widely used open-source repositories spanning five programming languages: Python, Java, TypeScript, JavaScript, and C++. The dataset is constructed using a four-stage curation pipeline that enforces documentation completeness and clarity, filters functions based on structural and complexity criteria, removes exact and near-duplicate code, and identifies documentation likely to be AI generated. Starting from 52,069 extracted candidates, only 25.6 percent satisfy all quality constraints.
We provide a detailed analysis of the resulting dataset, which achieves a mean documentation quality score of 6.93 out of 10. Overall, 86.9% of samples contain explicit type annotations, and only 2.9\% are flagged as potentially AI generated. Baseline experiments show that fine-tuning a large language model on Code2Doc yields relative improvements of 29.47% in BLEU and 24.04% in ROUGE-L over zero shot performance, despite the modest dataset size. We release both the dataset and the full curation pipeline to support reproducible research on automatic code documentation generation.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06\%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuchen Li, Handing Wang, Bing Xue, Mengjie Zhang, Yaochu Jin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In the high-cost simulation-driven design domain, translating ambiguous design requirements into a mathematical optimization formulation is a bottleneck for optimizing product performance. This process is time-consuming and heavily reliant on expert knowledge. While large language models (LLMs) offer potential for automating this task, existing approaches either suffer from poor formalization that fails to accurately align with the design intent or rely on solver feedback for data filtering, which is unavailable due to the high simulation costs. To address this challenge, we propose APF, a framework for solver-independent, automated problem formulation via LLMs designed to automatically convert engineers' natural language requirements into executable optimization models. The core of this framework is an innovative pipeline for automatically generating high-quality data, which overcomes the difficulty of constructing suitable fine-tuning datasets in the absence of high-cost solver feedback with the help of data generation and test instance annotation. The generated high-quality dataset is used to perform supervised fine-tuning on LLMs, significantly enhancing their ability to generate accurate and executable optimization problem formulations. Experimental results on antenna design demonstrate that APF significantly outperforms the existing methods in both the accuracy of requirement formalization and the quality of resulting radiation efficiency curves in meeting the design goals.
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maxime Kayser, Maksim Gridnev, Wanting Wang, Max Bain, Aneesh Rangnekar et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset $10\times$ larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. The brat foundation models are publicly released.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pierre Colombo, Malik Boudiaf, Allyn Sweet, Michael Desa, Hongxi Wang et al.
Robotics
ๆฅ็ๆ่ฆ
Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.
๐
2025-12-21
๐ AAAI 2026
๐ฅ Jensen Zhang, Ningyuan Liu, Yijia Fan, Zihao Huang, Qinglin Zeng et al.
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large language models (LLMs) often generate hallucinated content that lacks factual or contextual grounding, limiting their reliability in critical applications. Existing approaches such as supervised fine-tuning and reinforcement learning from human feedback are data intensive and computationally expensive, while static parameter editing methods struggle with context dependent errors and catastrophic forgetting.
We propose LLM-CAS, a framework that formulates real-time hallucination correction as a hierarchical reinforcement learning problem. LLM-CAS trains an agent to learn a policy that dynamically selects temporary neuron perturbations during inference based on the current context. Unlike prior dynamic approaches that rely on heuristic or predefined adjustments, this policy driven mechanism enables adaptive and fine grained correction without permanent parameter modification.
Experiments across multiple language models demonstrate that LLM-CAS consistently improves factual accuracy, achieving gains of 10.98 percentage points on StoryCloze, 2.71 points on TriviaQA, and 2.06 points on the MC1 score of TruthfulQA. These results outperform both static editing methods such as ITI and CAA and the dynamic SADI framework. Overall, LLM-CAS provides an efficient and context aware solution for improving the reliability of LLMs, with promising potential for future multimodal extensions.
๐
2025-12-21
๐ ArXiv Preprint (cs.DB)
๐ฅ Thanh Dat Hoang, Thanh Trung Huynh, Matthias Weidlich, Thanh Tam Nguyen, Tong Chen et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced comprehension and generation capabilities. However, privacy and cost considerations prevent companies from using Text2SQL solutions based on external LLMs offered as a service. Rather, small LLMs (SLMs) that are openly available and can hosted in-house are adopted. These SLMs, in turn, lack the generalization capabilities of larger LLMs, which impairs their effectiveness for complex tasks such as Text2SQL. To address these limitations, we propose MATS, a novel Text2SQL framework designed specifically for SLMs. MATS uses a multi-agent mechanism that assigns specialized roles to auxiliary agents, reducing individual workloads and fostering interaction. A training scheme based on reinforcement learning aligns these agents using feedback obtained during execution, thereby maintaining competitive performance despite a limited LLM size. Evaluation results using on benchmark datasets show that MATS, deployed on a single- GPU server, yields accuracy that are on-par with large-scale LLMs when using significantly fewer parameters. Our source code and data are available at https://github.com/thanhdath/mats-sql.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Prabigya Acharya, Liza Shrestha
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Charles J. Lovering, Seth Ebner, Brandon Smock, Michael Krumdick, Saad Rabbani et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.
๐
2025-12-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amit Barman, Atanu Mandal, Sudip Kumar Naskar
Natural Language Processing
ๆฅ็ๆ่ฆ
In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English-Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.
๐
2025-12-21
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrew Polar, Michael Poluektov
ๆฅ็ๆ่ฆ
Training on disjoint datasets can serve two primary goals: accelerating data processing and enabling federated learning. It has already been established that Kolmogorov-Arnold networks (KANs) are particularly well suited for federated learning and can be merged through simple parameter averaging. While the federated learning literature has mostly focused on achieving training convergence across distributed nodes, the present paper specifically targets acceleration of the training, which depends critically on the choice of an optimisation method and the type of the basis functions. To the best knowledge of the authors, the fastest currently-available combination is the Newton-Kaczmarz method and the piecewise-linear basis functions. Here, it is shown that training on disjoint datasets (or disjoint subsets of the training dataset) can further improve the performance. Experimental comparisons are provided, and all corresponding codes are publicly available.
๐
2025-12-21
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Gรถkdeniz Gรผlmez
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We present Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods by implementing adaptive multi-directional projections with regularized layer selection. Our approach addresses the fundamental limitation of existing methods that compromise model quality while attempting to modify specific behavioral patterns. Through dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms, we achieve theoretically superior weight modification while minimizing quality degradation in unrelated domains. We validate our method through the gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.
๐
2025-12-21
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Sebastian Sun
ๆฅ็ๆ่ฆ
Recent advances in retrieval-augmented generation (RAG) have shown promise in enhancing recommendation systems with external knowledge. However, existing RAG-based recommenders face two critical challenges: (1) vulnerability to distribution shifts across different environments (e.g., time periods, user segments), leading to performance degradation in out-of-distribution (OOD) scenarios, and (2) lack of faithful explanations that can be verified against retrieved evidence. In this paper, we propose CIRR, a Causal-Invariant Retrieval-Augmented Recommendation framework that addresses both challenges simultaneously. CIRR learns environment-invariant user preference representations through causal inference, which guide a debiased retrieval process to select relevant evidence from multiple sources. Furthermore, we introduce consistency constraints that enforce faithfulness between retrieved evidence, generated explanations, and recommendation outputs. Extensive experiments on two real-world datasets demonstrate that CIRR achieves robust performance under distribution shifts, reducing performance degradation from 15.4% (baseline) to only 5.6% in OOD scenarios, while providing more faithful and interpretable explanations (26% improvement in faithfulness score) compared to state-of-the-art baselines.
๐
2025-12-21
๐ ArXiv Preprint (Robotics)
๐ฅ Tianyuan Liu, Richard Dazeley, Benjamin Champion, Akan Cosgun
ๆฅ็ๆ่ฆ
In this paper, we study whether inexpensive, physics-free supervision can reliably prioritize grasp-place candidates for budget-aware pick-and-place. From an object's initial pose, target pose, and a candidate grasp, we generate two path-aware geometric labels: path-wise inverse kinematics (IK) feasibility across a fixed approach-grasp-lift waypoint template, and a transit collision flag from mesh sweeps along the same template. A compact dual-output MLP learns these signals from pose encodings, and at test time its scores rank precomputed candidates for a rank-then-plan policy under the same IK gate and planner as the baseline. Although learned from cheap labels only, the scores transfer to physics-enabled executed trajectories: at a fixed planning budget the policy finds successful paths sooner with fewer planner calls while keeping final success on par or better. This work targets a single rigid cuboid with side-face grasps and a fixed waypoint template, and we outline extensions to varied objects and richer waypoint schemes.
๐
2025-12-21
๐ ArXiv Preprint (Robotics)
๐ฅ Georg Nawratil
Robotics
ๆฅ็ๆ่ฆ
In the 19th International Symposium on Advances in Robot Kinematics the author introduced a novel class of continuous flexible discrete surfaces and mentioned that these so-called P-hedra (or P-nets) allow direct access to their spatial shapes by three control polylines. In this follow-up paper we study this intuitive method, which makes these flexible planar quad surfaces suitable for transformable design tasks by means of interactive tools. The construction of P-hedra from the control polylines can also be used for an efficient algorithmic computation of their isometric deformations. In addition we discuss flexion limits, bifurcation configurations, developable/flat-foldable pattern and tubular P-hedra.
๐
2025-12-21
๐ ArXiv Preprint (Robotics)
๐ฅ Feeza Khan Khanzada, Jaerock Kwon
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Model-based reinforcement learning (MBRL) can reduce interaction cost for autonomous driving by learning a predictive world model, but it typically still depends on task-specific rewards that are difficult to design and often brittle under distribution shift. This paper presents InDRiVE, a DreamerV3-style MBRL agent that performs reward-free pretraining in CARLA using only intrinsic motivation derived from latent ensemble disagreement. Disagreement acts as a proxy for epistemic uncertainty and drives the agent toward under-explored driving situations, while an imagination-based actor-critic learns a planner-free exploration policy directly from the learned world model. After intrinsic pretraining, we evaluate zero-shot transfer by freezing all parameters and deploying the pretrained exploration policy in unseen towns and routes. We then study few-shot adaptation by training a task policy with limited extrinsic feedback for downstream objectives (lane following and collision avoidance). Experiments in CARLA across towns, routes, and traffic densities show that disagreement-based pretraining yields stronger zero-shot robustness and robust few-shot collision avoidance under town shift and matched interaction budgets, supporting the use of intrinsic disagreement as a practical reward-free pretraining signal for reusable driving world models.
๐
2025-12-21
๐ ArXiv Preprint (Robotics)
๐ฅ Jingjia Teng, Yang Li, Jianqiang Wang, Yingbai Hu, Songyuan Tang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Four-wheel Independent Steering (4WIS) vehicles have attracted increasing attention for their superior maneuverability. Human drivers typically choose to cross or drive over the low-profile obstacles (e.g., plastic bags) to efficiently navigate through narrow spaces, while existing planners neglect obstacle attributes, causing inefficiency or path-finding failures. To address this, we propose a trajectory planning framework integrating the 4WIS hybrid A* and Optimal Control Problem (OCP), in which the hybrid A* provides an initial path to enhance the OCP solution. Specifically, a multimodal classification network is introduced to assess scene complexity (hard/easy task) by fusing image and vehicle state data. For hard tasks, guided points are set to decompose complex tasks into local subtasks, improving the search efficiency of 4WIS hybrid A*. The multiple steering modes of 4WIS vehicles (Ackermann, diagonal, and zero-turn) are also incorporated into node expansion and heuristic designs. Moreover, a hierarchical obstacle handling strategy is designed to guide the node expansion considering obstacle attributes, i.e., 'non-traversable', 'crossable', and 'drive-over' obstacles. It allows crossing or driving over obstacles instead of the 'avoid-only' strategy, greatly enhancing success rates of pathfinding. We also design a logical constraint for the 'drive-over' obstacle by limiting its velocity to ensure safety. Furthermore, to address dynamic obstacles with motion uncertainty, we introduce a probabilistic risk field model, constructing risk-aware driving corridors that serve as linear collision constraints in OCP. Experimental results demonstrate the proposed framework's effectiveness in generating safe, efficient, and smooth trajectories for 4WIS vehicles, especially in constrained environments.
๐
2025-12-21
๐ ArXiv Preprint (Robotics)
๐ฅ Maozeng Zhang, Ke Shi, Huijun Li, Tongshu Chen, Jiejun Yan et al.
Robotics
ๆฅ็ๆ่ฆ
Stroke-induced motor impairment often results in substantial loss of upper-limb function, creating a strong demand for rehabilitation robots that enable safe and transparent physical human-robot interaction (pHRI). Variable stiffness actuators are well suited for such applications. However, in most existing designs, stiffness is coupled with the deflection angle, complicating both modeling and control. To address this limitation, this paper presents a variable stiffness actuator featuring decoupled stiffness and output behavior for rehabilitation robotics. The system integrates a variable stiffness mechanism that combines a variable-length lever with a hypocycloidal straight-line mechanism to achieve a linear torque-deflection relationship and continuous stiffness modulation from near zero to theoretically infinite. It also incorporates a differential transmission mechanism based on a planetary gear system that enables dual-motor load sharing. A cascade PI controller is further developed on the basis of the differential configuration, in which the position-loop term jointly regulates stiffness and deflection angle, effectively suppressing stiffness fluctuations and output disturbances. The performance of prototype was experimentally validated through stiffness calibration, stiffness regulation, torque control, decoupled characteristics, and dual-motor load sharing, indicating the potential for rehabilitation exoskeletons and other pHRI systems.
๐
2025-12-21
๐ ArXiv Preprint (Robotics)
๐ฅ Cailin Lei, Haiyang Wu, Yuxiong Ji, Xiaoyu Cai, Yuchuan Du
Robotics
ๆฅ็ๆ่ฆ
Enhancing the performance of trajectory planners for lane - changing vehicles is one of the key challenges in autonomous driving within human - machine mixed traffic. Most existing studies have not incorporated human drivers' prior knowledge when designing trajectory planning models. To address this issue, this study proposes a novel trajectory planning framework that integrates causal prior knowledge into the control process. Both longitudinal and lateral microscopic behaviors of vehicles are modeled to quantify interaction risk, and a staged causal graph is constructed to capture causal dependencies in lane-changing scenarios. Causal effects between the lane-changing vehicle and surrounding vehicles are then estimated using causal inference, including average causal effects (ATE) and conditional average treatment effects (CATE). These causal priors are embedded into a model predictive control (MPC) framework to enhance trajectory planning. The proposed approach is validated on naturalistic vehicle trajectory datasets. Experimental results show that: (1) causal inference provides interpretable and stable quantification of vehicle interactions; (2) individual causal effects reveal driver heterogeneity; and (3) compared with the baseline MPC, the proposed method achieves a closer alignment with human driving behaviors, reducing maximum trajectory deviation from 1.2 m to 0.2 m, lateral velocity fluctuation by 60%, and yaw angle variability by 50%. These findings provide methodological support for human-like trajectory planning and practical value for improving safety, stability, and realism in autonomous vehicle testing and traffic simulation platforms.
๐
2025-12-21
๐ ArXiv Preprint (Robotics)
๐ฅ Chihiro Noguchi, Takaki Yamamoto
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
End-to-end (E2E) autonomous driving models that take only camera images as input and directly predict a future trajectory are appealing for their computational efficiency and potential for improved generalization via unified optimization; however, persistent failure modes remain due to reliance on imitation learning (IL). While online reinforcement learning (RL) could mitigate IL-induced issues, the computational burden of neural rendering-based simulation and large E2E networks renders iterative reward and hyperparameter tuning costly. We introduce a camera-only E2E offline RL framework that performs no additional exploration and trains solely on a fixed simulator dataset. Offline RL offers strong data efficiency and rapid experimental iteration, yet is susceptible to instability from overestimation on out-of-distribution (OOD) actions. To address this, we construct pseudo ground-truth trajectories from expert driving logs and use them as a behavior regularization signal, suppressing imitation of unsafe or suboptimal behavior while stabilizing value learning. Training and closed-loop evaluation are conducted in a neural rendering environment learned from the public nuScenes dataset. Empirically, the proposed method achieves substantial improvements in collision rate and route completion compared with IL baselines. Our code will be available at [URL].
๐
2025-12-21
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kai Kohyama, Yoshimitsu Aoki, Guillermo Gallego, Shintaro Shiba
Computer Vision
ๆฅ็ๆ่ฆ
Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves state-of-the-art performance on the real-world datasets and competitive performance on the synthetic dataset. Also, the proposed method works without prior information (e.g., pretrained image reconstruction models) or COLMAP-based initialization, is more flexible in the event selection number, and achieves sharp reconstruction on scene edges with fast training time. We hope that this work deepens our understanding of the sparse nature of events for 3D reconstruction. The code will be released.
๐
2025-12-21
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhenhao Zhou, Dan Negrut
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We present ChronoDreamer, an action-conditioned world model for contact-rich robotic manipulation. Given a history of egocentric RGB frames, contact maps, actions, and joint states, ChronoDreamer predicts future video frames, contact distributions, and joint angles via a spatial-temporal transformer trained with MaskGIT-style masked prediction. Contact is encoded as depth-weighted Gaussian splat images that render 3D forces into a camera-aligned format suitable for vision backbones. At inference, predicted rollouts are evaluated by a vision-language model that reasons about collision likelihood, enabling rejection sampling of unsafe actions before execution. We train and evaluate on DreamerBench, a simulation dataset generated with Project Chrono that provides synchronized RGB, contact splat, proprioception, and physics annotations across rigid and deformable object scenarios. Qualitative results demonstrate that the model preserves spatial coherence during non-contact motion and generates plausible contact predictions, while the LLM-based judge distinguishes collision from non-collision trajectories.
๐
2025-12-21
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pengcheng Li, Qiang Fang, Tong Zhao, Yixing Lan, Xin Xu
Machine Learning
ๆฅ็ๆ่ฆ
Adversarial Imitation Learning (AIL) is a dominant framework in imitation learning that infers rewards from expert demonstrations to guide policy optimization. Although providing more expert demonstrations typically leads to improved performance and greater stability, collecting such demonstrations can be challenging in certain scenarios. Inspired by the success of diffusion models in data generation, we propose SD2AIL, which utilizes synthetic demonstrations via diffusion models. We first employ a diffusion model in the discriminator to generate synthetic demonstrations as pseudo-expert data that augment the expert demonstrations. To selectively replay the most valuable demonstrations from the large pool of (pseudo-) expert demonstrations, we further introduce a prioritized expert demonstration replay strategy (PEDR). The experimental results on simulation tasks demonstrate the effectiveness and robustness of our method. In particular, in the Hopper task, our method achieves an average return of 3441, surpassing the state-of-the-art method by 89. Our code will be available at https://github.com/positron-lpc/SD2AIL.
๐
2025-12-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mahesh Keswani, Raunak Bhattacharyya
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Safe reinforcement learning (SafeRL) is a prominent paradigm for autonomous driving, where agents are required to optimize performance under strict safety requirements. This dual objective creates a fundamental tension, as overly conservative policies limit driving efficiency while aggressive exploration risks safety violations. The Safety Representations for Safer Policy Learning (SRPL) framework addresses this challenge by equipping agents with a predictive model of future constraint violations and has shown promise in controlled environments. This paper investigates whether SRPL extends to real-world autonomous driving scenarios. Systematic experiments on the Waymo Open Motion Dataset (WOMD) and NuPlan demonstrate that SRPL can improve the reward-safety tradeoff, achieving statistically significant improvements in success rate (effect sizes r = 0.65-0.86) and cost reduction (effect sizes r = 0.70-0.83), with p < 0.05 for observed improvements. However, its effectiveness depends on the underlying policy optimizer and the dataset distribution. The results further show that predictive safety representations play a critical role in improving robustness to observation noise. Additionally, in zero-shot cross-dataset evaluation, SRPL-augmented agents demonstrate improved generalization compared to non-SRPL methods. These findings collectively demonstrate the potential of predictive safety representations to strengthen SafeRL for autonomous driving.
๐
2025-12-19
๐ This work has been accepted by IEEE RA-L. Manuscri
๐ฅ Deqing Liu, Yinfeng Gao, Deheng Qian, Qichao Zhang, Xiaoqing Ye et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Existing end-to-end autonomous driving methods typically rely on imitation learning (IL) but face a key challenge: the misalignment between open-loop training and closed-loop deployment. This misalignment often triggers driver-initiated takeovers and system disengagements during closed-loop execution. How to leverage those expert takeover data from disengagement scenarios and effectively expand the IL policy's capability presents a valuable yet unexplored challenge. In this paper, we propose TakeAD, a novel preference-based post-optimization framework that fine-tunes the pre-trained IL policy with this disengagement data to enhance the closed-loop driving performance. First, we design an efficient expert takeover data collection pipeline inspired by human takeover mechanisms in real-world autonomous driving systems. Then, this post optimization framework integrates iterative Dataset Aggregation (DAgger) for imitation learning with Direct Preference Optimization (DPO) for preference alignment. The DAgger stage equips the policy with fundamental capabilities to handle disengagement states through direct imitation of expert interventions. Subsequently, the DPO stage refines the policy's behavior to better align with expert preferences in disengagement scenarios. Through multiple iterations, the policy progressively learns recovery strategies for disengagement states, thereby mitigating the open-loop gap. Experiments on the closed-loop Bench2Drive benchmark demonstrate our method's effectiveness compared with pure IL methods, with comprehensive ablations confirming the contribution of each component.
๐
2025-12-19
๐ ArXiv Preprint (Robotics)
๐ฅ Chunyang Meng, Eduardo B. Sandoval, Ricardo Sosa, Francisco Cruz
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
As the global population ages, many seniors face the problem of loneliness. Companion robots offer a potential solution. However, current companion robots often lack advanced functionality, while task-oriented robots are not designed for social interaction, limiting their suitability and acceptance by seniors. Our work introduces a senior-oriented system for quadruped robots that allows for more intuitive user input and provides more socially expressive output. For user input, we implemented a MediaPipe-based module for hand gesture and head movement recognition, enabling control without a remote. For output, we designed and trained robotic dog gestures using curriculum-based reinforcement learning in Isaac Gym, progressing from simple standing to three-legged balancing and leg extensions, and more. The final tests achieved over 95\% success on average in simulation, and we validated a key social gesture (the paw-lift) on a Unitree robot. Real-world tests demonstrated the feasibility and social expressiveness of this framework, while also revealing sim-to-real challenges in joint compliance, load distribution, and balance control. These contributions advance the development of practical quadruped robots as social companions for the senior and outline pathways for sim-to-real adaptation and inform future user studies.
๐
2025-12-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Shikshya Shiwakoti, Samuel Goldsmith, Ujjwal Pandit
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In today's information-driven world, access to scientific publications has become increasingly easy. At the same time, filtering through the massive volume of available research has become more challenging than ever. Graph Neural Networks (GNNs) and graph attention mechanisms have shown strong effectiveness in searching large-scale information databases, particularly when combined with modern large language models. In this paper, we propose an Attention-Based Subgraph Retriever, a GNN-as-retriever model that applies attention-based pruning to extract a refined subgraph, which is then passed to a large language model for advanced knowledge reasoning.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, Sergey Levine
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) -- which trains a policy to directly match the actions played by the demonstrator -- can fail to ensure coverage over the demonstrator's actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator's behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator's actions, enabling more effective finetuning. Furthermore, this policy -- which we refer to as the posterior behavioral cloning (PostBC) policy -- achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show that PostBC is practically implementable with modern generative models in robotic control domains -- relying only on standard supervised learning -- and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen et al.
Multimodal
ๆฅ็ๆ่ฆ
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rahul Bhargava, Malene Hornstrup Jespersen, Emily Boardman Ndulue, Vivica Dsouza
Natural Language Processing
ๆฅ็ๆ่ฆ
AI technologies have rapidly moved into business and research applications that involve large text corpora, including computational journalism research and newsroom settings. These models, trained on extant data from various sources, can be conceptualized as historical artifacts that encode decades-old attitudes and stereotypes. This paper investigates one such example trained on the broadly-used New York Times Annotated Corpus to create a multi-label classifier. Our use in research settings surfaced the concerning "blacks" thematic topic label. Through quantitative and qualitative means we investigate this label's use in the training corpus, what concepts it might be encoding in the trained classifier, and how those concepts impact our model use. Via the application of explainable AI methods, we find that the "blacks" label operates partially as a general "racism detector" across some minoritized groups. However, it performs poorly against expectations on modern examples such as COVID-19 era anti-Asian hate stories, and reporting on the Black Lives Matter movement. This case study of interrogating embedded biases in a model reveals how similar applications in newsroom settings can lead to unexpected outputs that could impact a wide variety of potential uses of any large language model-story discovery, audience targeting, summarization, etc. The fundamental tension this exposes for newsrooms is how to adopt AI-enabled workflow tools while reducing the risk of reproducing historical biases in news coverage.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Astrid Brull, Sara Aguti, Vรฉronique Bolduc, Ying Hu, Daniel M. Jimenez-Gutierrez et al.
Machine Learning
ๆฅ็ๆ่ฆ
The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the scarcity and fragmentation of available data. Attempts to expand sampling across hospitals, institutions, or countries with differing regulations face severe privacy, regulatory, and logistical obstacles that are often difficult to overcome. The Federated Learning (FL) provides a promising solution by enabling collaborative model training across decentralized datasets while keeping patient data local and private. Here, we report a novel global FL initiative using the Sherpa.ai FL platform, which leverages FL across distributed datasets in two international organizations for the diagnosis of COL6-RD, using collagen VI immunofluorescence microscopy images from patient-derived fibroblast cultures. Our solution resulted in an ML model capable of classifying collagen VI patient images into the three primary pathogenic mechanism groups associated with COL6-RD: exon skipping, glycine substitution, and pseudoexon insertion. This new approach achieved an F1-score of 0.82, outperforming single-organization models (0.57-0.75). These results demonstrate that FL substantially improves diagnostic utility and generalizability compared to isolated institutional models. Beyond enabling more accurate diagnosis, we anticipate that this approach will support the interpretation of variants of uncertain significance and guide the prioritization of sequencing strategies to identify novel pathogenic variants.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tomรกลก Souฤek, Pierre Fernandez, Hady Elsahar, Sylvestre-Alvise Rebuffi, Valeriu Lacatusu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Otman A. Basir
Robotics
ๆฅ็ๆ่ฆ
Artificial intelligence systems are increasingly deployed in domains that shape human behaviour, institutional decision-making, and societal outcomes. Existing responsible AI and governance efforts provide important normative principles but often lack enforceable engineering mechanisms that operate throughout the system lifecycle. This paper introduces the Social Responsibility Stack (SRS), a six-layer architectural framework that embeds societal values into AI systems as explicit constraints, safeguards, behavioural interfaces, auditing mechanisms, and governance processes. SRS models responsibility as a closed-loop supervisory control problem over socio-technical systems, integrating design-time safeguards with runtime monitoring and institutional oversight. We develop a unified constraint-based formulation, introduce safety-envelope and feedback interpretations, and show how fairness, autonomy, cognitive burden, and explanation quality can be continuously monitored and enforced. Case studies in clinical decision support, cooperative autonomous vehicles, and public-sector systems illustrate how SRS translates normative objectives into actionable engineering and operational controls. The framework bridges ethics, control theory, and AI governance, providing a practical foundation for accountable, adaptive, and auditable socio-technical AI systems.
๐
2025-12-18
๐ The Manuscript is submitted for review under IEEE
๐ฅ Hesham G. Moussa, Aroosa Hameed, Arashmid Akhavain
Machine Learning
ๆฅ็ๆ่ฆ
To cope with real-world dynamics, an intelligent system needs to incrementally acquire, update, and exploit knowledge throughout its lifetime. This ability, known as Continual learning, provides a foundation for AI systems to develop themselves adaptively. Catastrophic forgetting is a major challenge to the progress of Continual Learning approaches, where learning a new task usually results in a dramatic performance drop on previously learned ones. Many approaches have emerged to counteract the impact of CF. Most of the proposed approaches can be categorized into five classes: replay-based, regularization-based, optimization-based, representation-based, and architecture-based. In this work, we approach the problem from a different angle, specifically by considering the optimal sequencing of tasks as they are presented to the model. We investigate the role of task sequencing in mitigating CF and propose a method for determining the optimal task order. The proposed method leverages zero-shot scoring algorithms inspired by neural architecture search (NAS). Results demonstrate that intelligent task sequencing can substantially reduce CF. Moreover, when combined with traditional continual learning strategies, sequencing offers enhanced performance and robustness against forgetting. Additionally, the presented approaches can find applications in other fields, such as curriculum learning.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiabin Xue
Machine Learning
ๆฅ็ๆ่ฆ
Edge machine learning (Edge ML) enables training ML models using the vast data distributed across network edges. However, many existing approaches assume static models trained centrally and then deployed, making them ineffective against unseen data. To address this, Online Edge ML allows models to be trained directly on edge devices and updated continuously with new data. This paper explores a key challenge of Online Edge ML: "How to determine labels for truly future, unseen data points". We propose Knowledge Transformation (KT), a hybrid method combining Knowledge Distillation, Active Learning, and causal reasoning. In short, KT acts as the oracle in active learning by transforming knowledge from a teacher model to generate pseudo-labels for training a student model. To verify the validity of the method, we conducted simulation experiments with two setups: (1) using a less stable teacher model and (2) a relatively more stable teacher model. Results indicate that when a stable teacher model is given, the student model can eventually reach its expected maximum performance. KT is potentially beneficial for scenarios that meet the following circumstances: (1) when the teacher's task is generic, which means existing pre-trained models might be adequate for its task, so there will be no need to train the teacher model from scratch; and/or (2) when the label for the student's task is difficult or expensive to acquire.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Zihan Zhou, Animesh Garg, Ajay Mandlekar, Caelan Garrett
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Long-horizon manipulation has been a long-standing challenge in the robotics community. We propose ReinforceGen, a system that combines task decomposition, data generation, imitation learning, and motion planning to form an initial solution, and improves each component through reinforcement-learning-based fine-tuning. ReinforceGen first segments the task into multiple localized skills, which are connected through motion planning. The skills and motion planning targets are trained with imitation learning on a dataset generated from 10 human demonstrations, and then fine-tuned through online adaptation and reinforcement learning. When benchmarked on the Robosuite dataset, ReinforceGen reaches 80% success rate on all tasks with visuomotor controls in the highest reset range setting. Additional ablation studies show that our fine-tuning approaches contributes to an 89% average performance increase. More results and videos available in https://reinforcegen.github.io/
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Nenad Tomaลกev, Matija Franklin, Julian Jacobs, Sรฉbastien Krier, Simon Osindero
ๆฅ็ๆ่ฆ
AI safety and alignment research has predominantly been focused on methods for safeguarding individual AI systems, resting on the assumption of an eventual emergence of a monolithic Artificial General Intelligence (AGI). The alternative AGI emergence hypothesis, where general capability levels are first manifested through coordination in groups of sub-AGI individual agents with complementary skills and affordances, has received far less attention. Here we argue that this patchwork AGI hypothesis needs to be given serious consideration, and should inform the development of corresponding safeguards and mitigations. The rapid deployment of advanced AI agents with tool-use capabilities and the ability to communicate and coordinate makes this an urgent safety consideration. We therefore propose a framework for distributional AGI safety that moves beyond evaluating and aligning individual agents. This framework centers on the design and implementation of virtual agentic sandbox economies (impermeable or semi-permeable), where agent-to-agent transactions are governed by robust market mechanisms, coupled with appropriate auditability, reputation management, and oversight to mitigate collective risks.
๐
2025-12-18
๐ Published in the IEEE ICCAD 2025 conference
๐ฅ Khurram Khalil, Khaza Anuarul Hoque
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) deliver exceptional performance across natural language tasks but demand substantial computational resources, limiting their deployment on resource-constrained edge devices. Existing compression techniques, such as quantization and pruning, often degrade critical linguistic properties and lack formal guarantees for preserving model behavior. We propose Temporal Logic-Guided Large Language Model Compression (TOGGLE), a novel framework that leverages Signal Temporal Logic (STL) to formally specify and enforce linguistic properties during compression. TOGGLE employs an STL robustness-guided Bayesian optimization to systematically explore layer-wise quantization and pruning configurations, generating compressed models that formally satisfy specified linguistic constraints without retraining or fine-tuning. Evaluating TOGGLE on four LLM architectures (GPT-2, DeepSeek-V2 7B, LLaMA 3 8B, and Mistral 7B), we achieve up to 3.3x reduction in computational costs (FLOPs) and up to a 68.8% reduction in model size while satisfying all linguistic properties. TOGGLE represents the first integration of formal methods into LLM compression, enabling efficient, verifiable deployment of LLMs on edge hardware.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu et al.
ๆฅ็ๆ่ฆ
Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.
๐
2025-12-18
๐ Published in the IEEE ISMAR 2025 conference
๐ฅ Ripan Kumar Kundu, Istiak Ahmed, Khaza Anuarul Hoque
Natural Language Processing
ๆฅ็ๆ่ฆ
The convergence of artificial AI and XR technologies (AI XR) promises innovative applications across many domains. However, the sensitive nature of data (e.g., eye-tracking) used in these systems raises significant privacy concerns, as adversaries can exploit these data and models to infer and leak personal information through membership inference attacks (MIA) and re-identification (RDA) with a high success rate. Researchers have proposed various techniques to mitigate such privacy attacks, including differential privacy (DP). However, AI XR datasets often contain numerous features, and applying DP uniformly can introduce unnecessary noise to less relevant features, degrade model accuracy, and increase inference time, limiting real-time XR deployment. Motivated by this, we propose a novel framework combining explainable AI (XAI) and DP-enabled privacy-preserving mechanisms to defend against privacy attacks. Specifically, we leverage post-hoc explanations to identify the most influential features in AI XR models and selectively apply DP to those features during inference. We evaluate our XAI-guided DP approach on three state-of-the-art AI XR models and three datasets: cybersickness, emotion, and activity classification. Our results show that the proposed method reduces MIA and RDA success rates by up to 43% and 39%, respectively, for cybersickness tasks while preserving model utility with up to 97% accuracy using Transformer models. Furthermore, it improves inference time by up to ~2x compared to traditional DP approaches. To demonstrate practicality, we deploy the XAI-guided DP AI XR models on an HTC VIVE Pro headset and develop a user interface (UI), namely PrivateXR, allowing users to adjust privacy levels (e.g., low, medium, high) while receiving real-time task predictions, protecting user privacy during XR gameplay.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.
๐
2025-12-18
๐ Accepted and presented at 13th IEEE International
๐ฅ Harsh Vardhan Bansal
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.
๐
2025-12-18
๐ 6 pages, 5 figures. Accepted and published in the
๐ฅ Arslan Amin, Rafia Mumtaz, Muhammad Jawad Bashir, Syed Mohammad Hassan Zaidi
ๆฅ็ๆ่ฆ
In the evolving landscape of traffic management and vehicle surveillance, efficient license plate detection and recognition are indispensable. Historically, many methodologies have tackled this challenge, but consistent real-time accuracy, especially in diverse environments, remains elusive. This study examines the performance of YOLOv8 variants on License Plate Recognition (LPR) and Character Recognition tasks, crucial for advancing Intelligent Transportation Systems. Two distinct datasets were employed for training and evaluation, yielding notable findings. The YOLOv8 Nano variant demonstrated a precision of 0.964 and mAP50 of 0.918 on the LPR task, while the YOLOv8 Small variant exhibited a precision of 0.92 and mAP50 of 0.91 on the Character Recognition task. A custom method for character sequencing was introduced, effectively sequencing the detected characters based on their x-axis positions. An optimized pipeline, utilizing YOLOv8 Nano for LPR and YOLOv8 Small for Character Recognition, is proposed. This configuration not only maintains computational efficiency but also ensures high accuracy, establishing a robust foundation for future real-world deployments on edge devices within Intelligent Transportation Systems. This effort marks a significant stride towards the development of smarter and more efficient urban infrastructures.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ William English, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a lifting of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate lifting, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the lifting and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.
๐
2025-12-18
๐ ArXiv Preprint (cs.NI)
๐ฅ Bahman Abolhassani, Tugba Erpek, Kemal Davaslioglu, Yalin E. Sagduyu, Sastry Kompella
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reactive jammers pose a severe security threat to robotic-swarm networks by selectively disrupting inter-agent communications and undermining formation integrity and mission success. Conventional countermeasures such as fixed power control or static channel hopping are largely ineffective against such adaptive adversaries. This paper presents a multi-agent reinforcement learning (MARL) framework based on the QMIX algorithm to improve the resilience of swarm communications under reactive jamming. We consider a network of multiple transmitter-receiver pairs sharing channels while a reactive jammer with Markovian threshold dynamics senses aggregate power and reacts accordingly. Each agent jointly selects transmit frequency (channel) and power, and QMIX learns a centralized but factorizable action-value function that enables coordinated yet decentralized execution. We benchmark QMIX against a genie-aided optimal policy in a no-channel-reuse setting, and against local Upper Confidence Bound (UCB) and a stateless reactive policy in a more general fading regime with channel reuse enabled. Simulation results show that QMIX rapidly converges to cooperative policies that nearly match the genie-aided bound, while achieving higher throughput and lower jamming incidence than the baselines, thereby demonstrating MARL's effectiveness for securing autonomous swarms in contested environments.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shubham Mishra, Samyek Jain, Gorang Mehrishi, Shiv Tiwari, Harsh Sharma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.
๐
2025-12-18
๐ ArXiv Preprint (cs.DC)
๐ฅ Endar Suprih Wihidayat, Sieteng Soh, Kwan-Wu Chin, Duc-son Pham
ๆฅ็ๆ่ฆ
In this paper, the Multi-stage Edge Server Upgrade (M-ESU) is proposed as a new network planning problem, involving the upgrading of an existing multi-access edge computing (MEC) system through multiple stages (e.g., over several years). More precisely, the problem considers two key decisions: (i) whether to deploy additional edge servers or upgrade those already installed, and (ii) how tasks should be offloaded so that the average number of tasks that meet their delay requirement is maximized. The framework specifically involves: (i) deployment of new servers combined with capacity upgrades for existing servers, and (ii) the optimal task offloading to maximize the average number of tasks with a delay requirement. It also considers the following constraints: (i) budget per stage, (ii) server deployment and upgrade cost (in $) and cost depreciation rate, (iii) computation resource of servers, (iv) number of tasks and their growth rate (in %), and (v) the increase in task sizes and stricter delay requirements over time. We present two solutions: a Mixed Integer Linear Programming (MILP) model and an efficient heuristic algorithm (M-ESU/H). MILP yields the optimal solution for small networks, whereas M-ESU/H is used in large-scale networks. For small networks, the simulation results show that the solution computed by M-ESU/H is within 1.25% of the optimal solution while running several orders of magnitude faster. For large networks, M-ESU/H is compared against three alternative heuristic solutions that consider only server deployment, or giving priority to server deployment or upgrade. Our experiments show that M-ESU/H yields up to 21.57% improvement in task satisfaction under identical budget and demand growth conditions, confirming its scalability and practical value for long-term MEC systems.
๐
2025-12-18
๐ AAAI 2026
๐ฅ Shuting Zhao, Zeyu Xiao, Xinrong Chen
ๆฅ็ๆ่ฆ
Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: https://kaka-1314.github.io/KineST/
๐
2025-12-18
๐ ArXiv Preprint (cs.LO)
๐ฅ Nils Kรผchenmeister, Alex Ivliev, Markus Krรถtzsch
ๆฅ็ๆ่ฆ
We present a new use of Answer Set Programming (ASP) to discover the molecular structure of chemical samples based on the relative abundance of elements and structural fragments, as measured in mass spectrometry. To constrain the exponential search space for this combinatorial problem, we develop canonical representations of molecular structures and an ASP implementation that uses these definitions. We evaluate the correctness of our implementation over a large set of known molecular structures, and we compare its quality and performance to other ASP symmetry-breaking methods and to a commercial tool from analytical chemistry. Under consideration in Theory and Practice of Logic Programming (TPLP).
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ William English, Chase Walker, Dominic Simon, Rickard Ewetz
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Natural language (NL) to temporal logic (TL) translation enables engineers to specify, verify, and enforce system behaviors without manually crafting formal specifications-an essential capability for building trustworthy autonomous systems. While existing NL-to-TL translation frameworks have demonstrated encouraging initial results, these systems either explicitly assume access to accurate atom grounding or suffer from low grounded translation accuracy. In this paper, we propose a framework for Grounding Natural Language Into System Signatures for Temporal Logic translation called GinSign. The framework introduces a grounding model that learns the abstract task of mapping NL spans onto a given system signature: given a lifted NL specification and a system signature $\mathcal{S}$, the classifier must assign each lifted atomic proposition to an element of the set of signature-defined atoms $\mathcal{P}$. We decompose the grounding task hierarchically -- first predicting predicate labels, then selecting the appropriately typed constant arguments. Decomposing this task from a free-form generation problem into a structured classification problem permits the use of smaller masked language models and eliminates the reliance on expensive LLMs. Experiments across multiple domains show that frameworks which omit grounding tend to produce syntactically correct lifted LTL that is semantically nonequivalent to grounded target expressions, whereas our framework supports downstream model checking and achieves grounded logical-equivalence scores of $95.5\%$, a $1.4\times$ improvement over SOTA.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Siqi Wang, Chao Liang, Yunfan Gao, Erxin Yu, Sen Li et al.
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.
๐
2025-12-18
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Claudia Vale Oliveira, Nelson Zagalo, Filipe Silva, Anabela Brandao, Syeda Faryal Hussain Khurrum et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used as epistemic partners in everyday reasoning, yet their errors remain predominantly analyzed through predictive metrics rather than through their interpretive effects on human judgment. This study examines how different forms of epistemic failure emerge, are masked, and are tolerated in human AI interaction, where failure is understood as a relational breakdown shaped by model-generated plausibility and human interpretive judgment. We conducted a three round, multi LLM evaluation using interdisciplinary tasks and progressively differentiated assessment frameworks to observe how evaluators interpret model responses across linguistic, epistemic, and credibility dimensions. Our findings show that LLM errors shift from predictive to hermeneutic forms, where linguistic fluency, structural coherence, and superficially plausible citations conceal deeper distortions of meaning. Evaluators frequently conflated criteria such as correctness, relevance, bias, groundedness, and consistency, indicating that human judgment collapses analytical distinctions into intuitive heuristics shaped by form and fluency. Across rounds, we observed a systematic verification burden and cognitive drift. As tasks became denser, evaluators increasingly relied on surface cues, allowing erroneous yet well formed answers to pass as credible. These results suggest that error is not solely a property of model behavior but a co-constructed outcome of generative plausibility and human interpretive shortcuts. Understanding AI epistemic failure therefore requires reframing evaluation as a relational interpretive process, where the boundary between system failure and human miscalibration becomes porous. The study provides implications for LLM assessment, digital literacy, and the design of trustworthy human AI communication.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mahadev Prasad Panda, Purnachandra Rao Makkena, Srivatsa Prativadibhayankaram, Siegfried Fรถรel, Andrรฉ Kaup
ๆฅ็ๆ่ฆ
Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in BD-rate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yipeng Zhuang, Yifeng Guo, Yuewen Li, Yuheng Wu, Philip Leung-Ho Yu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Lung cancer patients frequently experience breakthrough pain episodes, with up to 91% requiring timely intervention. To enable proactive pain management, we propose a hybrid machine learning and large language model pipeline that predicts pain episodes within 48 and 72 hours of hospitalization using both structured and unstructured electronic health record data. A retrospective cohort of 266 inpatients was analyzed, with features including demographics, tumor stage, vital signs, and WHO-tiered analgesic use. The machine learning module captured temporal medication trends, while the large language model interpreted ambiguous dosing records and free-text clinical notes. Integrating these modalities improved sensitivity and interpretability. Our framework achieved an accuracy of 0.874 (48h) and 0.917 (72h), with an improvement in sensitivity of 8.6% and 10.4% due to the augmentation of large language model. This hybrid approach offers a clinically interpretable and scalable tool for early pain episode forecasting, with potential to enhance treatment precision and optimize resource allocation in oncology care.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Daniel Bramblett, Rushang Karia, Adrian Ciotinga, Ruthvick Suresh, Pulkit Verma et al.
ๆฅ็ๆ่ฆ
Black-box AI (BBAI) systems such as foundational models are increasingly being used for sequential decision making. To ensure that such systems are safe to operate and deploy, it is imperative to develop efficient methods that can provide a sound and interpretable representation of the BBAI's capabilities. This paper shows that PDDL-style representations can be used to efficiently learn and model an input BBAI's planning capabilities. It uses the Monte-Carlo tree search paradigm to systematically create test tasks, acquire data, and prune the hypothesis space of possible symbolic models. Learned models describe a BBAI's capabilities, the conditions under which they can be executed, and the possible outcomes of executing them along with their associated probabilities. Theoretical results show soundness, completeness and convergence of the learned models. Empirical results with multiple BBAI systems illustrate the scope, efficiency, and accuracy of the presented methods.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Oliver Stritzel, Nick Hรผhnerbein, Simon Rauch, Itzel Zarate, Lukas Fleischmann et al.
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, Predictive Process Mining (PPM) techniques based on artificial neural networks have evolved as a method for monitoring the future behavior of unfolding business processes and predicting Key Performance Indicators (KPIs). However, many PPM approaches often lack reproducibility, transparency in decision making, usability for incorporating novel datasets and benchmarking, making comparisons among different implementations very difficult. In this paper, we propose SPICE, a Python framework that reimplements three popular, existing baseline deep-learning-based methods for PPM in PyTorch, while designing a common base framework with rigorous configurability to enable reproducible and robust comparison of past and future modelling approaches. We compare SPICE to original reported metrics and with fair metrics on 11 datasets.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Abhisek Ganguly
ๆฅ็ๆ่ฆ
We formalize two independent computational limitations that constrain algorithmic intelligence: formal incompleteness and dynamical unpredictability. The former limits the deductive power of consistent reasoning systems while the later bounds long-term prediction under finite precision. We show that these two extrema together impose structural bounds on an agent's ability to reason about its own predictive capabilities. In particular, an algorithmic agent cannot compute its own maximal prediction horizon generally. This perspective clarifies inherent trade-offs between reasoning, prediction, and self-analysis in intelligent systems.
๐
2025-12-18
๐ 15 pages, 16 references, Key Note preented at the
๐ฅ Giovanni Adorni
Natural Language Processing
ๆฅ็ๆ่ฆ
Generative Artificial Intelligence (GenAI) is rapidly reshaping how knowledge is produced and validated in education. Rather than adding another digital tool, large language models reconfigure reading, writing, and coding into hybrid human-AI workflows, raising concerns about epistemic automation, cognitive offloading, and the de-professiona\-lisation of teachers. This paper proposes \emph{Cyber Humanism in Education} as a framework for reclaiming human agency in this landscape. We conceptualise AI-enabled learning environments as socio-technical infrastructures co-authored by humans and machines, and position educators and learners as epistemic agents and \emph{algorithmic citizens} who have both the right and the responsibility to shape these infrastructures.
We articulate three pillars for cyber-humanist design, \emph{reflexive competence}, \emph{algorithmic citizenship}, and \emph{dialogic design}, and relate them to major international digital and AI competence frameworks. We then present higher-education case studies that operationalise these ideas through \emph{prompt-based learning} and a new \emph{Conversational AI Educator} certification within the EPICT ecosystem. The findings show how such practices can strengthen epistemic agency while surfacing tensions around workload, equity, and governance, and outline implications for the future of AI-rich, human-centred education.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mahbub E Sobhani, Md. Faiyaz Abdullah Sayeedi, Mohammad Nehad Alam, Proma Hossain Progga, Swakkhar Shatabda
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Diagram-grounded geometry problem solving is a critical benchmark for multimodal large language models (MLLMs), yet the benefits of multi-agent design over single-agent remain unclear. We systematically compare single-agent and multi-agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math. For open-source models, multi-agent consistently improves performance. For example, Qwen-2.5-VL (7B) gains +6.8 points and Qwen-2.5-VL (32B) gains +3.3 on Geometry3K, and both Qwen-2.5-VL variants see further gains on OlympiadBench and We-Math. In contrast, the closed-source Gemini-2.0-Flash generally performs better in single-agent mode on classic benchmarks, while multi-agent yields only modest improvements on the newer We-Math dataset. These findings show that multi-agent pipelines provide clear benefits for open-source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal. All code, data, and reasoning files are available at https://github.com/faiyazabdullah/Interpreter-Solver
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wisnu Uriawan, Achmad Ajie Priyajie, Angga Gustian, Fikri Nur Hidayat, Sendi Ahmad Rafiudin et al.
ๆฅ็ๆ่ฆ
This research stems from the urgency to automate the thematic grouping of hadith in line with the growing digitalization of Islamic texts. Based on a literature review, the unsupervised learning approach with the Apriori algorithm has proven effective in identifying association patterns and semantic relations in unlabeled text data. The dataset used is the Indonesian Translation of the hadith of Bukhari, which first goes through preprocessing stages including case folding, punctuation cleaning, tokenization, stopword removal, and stemming. Next, an association rule mining analysis was conducted using the Apriori algorithm with support, confidence, and lift parameters. The results show the existence of meaningful association patterns such as the relationship between rakaat-prayer, verse-revelation, and hadith-story, which describe the themes of worship, revelation, and hadith narration. These findings demonstrate that the Apriori algorithm has the ability to automatically uncover latent semantic relationships, while contributing to the development of digital Islamic studies and technology-based learning systems.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gonรงalo Gaspar Alves, Shekoufeh Gorgi Zadeh, Andreas Husch, Ben Bausch
ๆฅ็ๆ่ฆ
Combining open-source datasets can introduce data leakage if the same subject appears in multiple sets, leading to inflated model performance. To address this, we explore subject fingerprinting, mapping all images of a subject to a distinct region in latent space, to enable subject re-identification via similarity matching. Using a ResNet-50 trained with triplet margin loss, we evaluate few-shot fingerprinting on 3D MRI and 2D X-ray data in both standard (20-way 1-shot) and challenging (1000-way 1-shot) scenarios. The model achieves high Mean- Recall-@-K scores: 99.10% (20-way 1-shot) and 90.06% (500-way 5-shot) on ChestXray-14; 99.20% (20-way 1-shot) and 98.86% (100-way 3-shot) on BraTS- 2021.
๐
2025-12-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jacob Reiss, Shikshya Shiwakoti, Samuel Goldsmith, Ujjwal Pandit
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In today's information-driven world, access to scientific publications has become increasingly easy. At the same time, filtering through the massive volume of available research has become more challenging than ever. Graph Neural Networks (GNNs) and graph attention mechanisms have shown strong effectiveness in searching large-scale information databases, particularly when combined with modern large language models. In this paper, we propose an Attention-Based Subgraph Retriever, a GNN-as-retriever model that applies attention-based pruning to extract a refined subgraph, which is then passed to a large language model for advanced knowledge reasoning.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Sangeeth B, Serena Nicolazzo, Deepa K., Vinod P
Machine Learning
ๆฅ็ๆ่ฆ
The rapid proliferation of deep neural networks (DNNs) across several domains has led to increasing concerns regarding intellectual property (IP) protection and model misuse. Trained DNNs represent valuable assets, often developed through significant investments. However, the ease with which models can be copied, redistributed, or repurposed highlights the urgent need for effective mechanisms to assert and verify model ownership. In this work, we propose an efficient and resilient white-box watermarking framework that embeds ownership information into the internal parameters of a DNN using chaotic sequences. The watermark is generated using a logistic map, a well-known chaotic function, producing a sequence that is sensitive to its initialization parameters. This sequence is injected into the weights of a chosen intermediate layer without requiring structural modifications to the model or degradation in predictive performance. To validate ownership, we introduce a verification process based on a genetic algorithm that recovers the original chaotic parameters by optimizing the similarity between the extracted and regenerated sequences. The effectiveness of the proposed approach is demonstrated through extensive experiments on image classification tasks using MNIST and CIFAR-10 datasets. The results show that the embedded watermark remains detectable after fine-tuning, with negligible loss in model accuracy. In addition to numerical recovery of the watermark, we perform visual analyses using weight density plots and construct activation-based classifiers to distinguish between original, watermarked, and tampered models. Overall, the proposed method offers a flexible and scalable solution for embedding and verifying model ownership in white-box settings well-suited for real-world scenarios where IP protection is critical.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sri Yash Tadimalla, Justin Cary, Gordon Hull, Jordan Register, Daniel Maxwell et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
The rapid assimilation of Artificial Intelligence technologies into various facets of society has created a significant educational imperative that current frameworks are failing to effectively address. We are witnessing the rise of a dangerous literacy gap, where a focus on the functional, operational skills of using AI tools is eclipsing the development of critical and ethical reasoning about them. This position paper argues for a systemic shift toward comprehensive AI literacy that centers human agency - the empowered capacity for intentional, critical, and responsible choice. This principle applies to all stakeholders in the educational ecosystem: it is the student's agency to question, create with, or consciously decide not to use AI based on the task; it is the teacher's agency to design learning experiences that align with instructional values, rather than ceding pedagogical control to a tool. True literacy involves teaching about agency itself, framing technology not as an inevitability to be adopted, but as a choice to be made. This requires a deep commitment to critical thinking and a robust understanding of epistemology. Through the AI Literacy, Fluency, and Competency frameworks described in this paper, educators and students will become agents in their own human-centric approaches to AI, providing necessary pathways to clearly articulate the intentions informing decisions and attitudes toward AI and the impact of these decisions on academic work, career, and society.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jirui Yang, Hengqi Guo, Zhihui Lu, Yi Zhao, Yuansen Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models often face a three-way trade-off among detection accuracy, inference latency, and deployment cost when used in real-world safety-sensitive applications. This paper introduces Prefix Probing, a black-box harmful content detection method that compares the conditional log-probabilities of "agreement/execution" versus "refusal/safety" opening prefixes and leverages prefix caching to reduce detection overhead to near first-token latency. During inference, the method requires only a single log-probability computation over the probe prefixes to produce a harmfulness score and apply a threshold, without invoking any additional models or multi-stage inference. To further enhance the discriminative power of the prefixes, we design an efficient prefix construction algorithm that automatically discovers highly informative prefixes, substantially improving detection performance. Extensive experiments demonstrate that Prefix Probing achieves detection effectiveness comparable to mainstream external safety models while incurring only minimal computational cost and requiring no extra model deployment, highlighting its strong practicality and efficiency.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wisnu Uriawan, Aria Octavian Hamza, Ade Ripaldi Nuralim, Adi Purnama, Ahmad Juaeni Yunus et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This research presents the implementation of a Sharia-compliant chatbot as an interactive medium for consulting Islamic questions, leveraging Reinforcement Learning (Q-Learning) integrated with Sentence-Transformers for semantic embedding to ensure contextual and accurate responses. Utilizing the CRISP-DM methodology, the system processes a curated Islam QA dataset of 25,000 question-answer pairs from authentic sources like the Qur'an, Hadith, and scholarly fatwas, formatted in JSON for flexibility and scalability. The chatbot prototype, developed with a Flask API backend and Flutter-based mobile frontend, achieves 87% semantic accuracy in functional testing across diverse topics including fiqh, aqidah, ibadah, and muamalah, demonstrating its potential to enhance religious literacy, digital da'wah, and access to verified Islamic knowledge in the Industry 4.0 era. While effective for closed-domain queries, limitations such as static learning and dataset dependency highlight opportunities for future enhancements like continuous adaptation and multi-turn conversation support, positioning this innovation as a bridge between traditional Islamic scholarship and modern AI-driven consultation.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Barna Pรกsztor, Thomas Kleine Buening, Andreas Krause
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We introduce Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader's action. This approach decomposes preference optimization into a refinement problem for the Follower and an optimization problem against an adversary for the Leader. Unlike Reinforcement Learning from Human Feedback (RLHF), which assigns scalar rewards to actions, or Nash Learning from Human Feedback (NLHF), which seeks a simultaneous-move equilibrium, SLHF leverages the asymmetry of sequential play to capture richer preference structures. The sequential design of SLHF naturally enables inference-time refinement, as the Follower learns to improve the Leader's actions, and these refinements can be leveraged through iterative sampling. We compare the solution concepts of SLHF, RLHF, and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Experiments on large language models demonstrate that SLHF achieves strong alignment across diverse preference datasets, scales from 0.5B to 8B parameters, and yields inference-time refinements that transfer across model families without further fine-tuning.
๐
2025-12-18
๐ ArXiv Preprint (cs.MA)
๐ฅ Giulia Boato, Andrea Montibeller, Edward Delp, Luisa Verdoliva, Daniele Miorandi
ๆฅ็ๆ่ฆ
AI is reshaping the landscape of multimedia forensics. We propose AI forensic agents: reliable orchestrators that select and combine forensic detectors, identify provenance and context, and provide uncertainty-aware assessments. We highlight pitfalls in current solutions and introduce a unified framework to improve the authenticity verification process.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Iker Garcรญa-Ferrero, David Montero, Roman Orus
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shaohua Wu, Tong Yu, Shenling Wang, Xudong Zhao
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yumeng Wang, Tianyu Fan, Lingrui Xu, Chao Huang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.
๐
2025-12-18
๐ WSDM
๐ฅ Himanshu Gharat, Himanshi Agrawal, Gourab K. Patro
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have empowered AI agents with advanced capabilities for understanding, reasoning, and interacting across diverse tasks. The addition of memory further enhances them by enabling continuity across interactions, learning from past experiences, and improving the relevance of actions and responses over time; termed as memory-enhanced personalization. Although such personalization through memory offers clear benefits, it also introduces risks of bias. While several previous studies have highlighted bias in ML and LLMs, bias due to memory-enhanced personalized agents is largely unexplored. Using recruitment as an example use case, we simulate the behavior of a memory-enhanced personalized agent, and study whether and how bias is introduced and amplified in and across various stages of operation. Our experiments on agents using safety-trained LLMs reveal that bias is systematically introduced and reinforced through personalization, emphasizing the need for additional protective measures or agent guardrails in memory-enhanced LLM-based AI agents.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Samuel Mugel et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Primoz Kocbek, Leon Kopitar, Gregor Stiglic
Natural Language Processing
ๆฅ็ๆ่ฆ
This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Julien Gachadoat, Guillaume Lagarde
Machine Learning
ๆฅ็ๆ่ฆ
Generative art systems often involve high-dimensional and complex parameter spaces in which aesthetically compelling outputs occupy only small, fragmented regions. Because of this combinatorial explosion, artists typically rely on extensive manual trial-and-error, leaving many potentially interesting configurations undiscovered. In this work we make two contributions. First, we introduce ParamExplorer, an interactive and modular framework inspired by reinforcement learning that helps the exploration of parameter spaces in generative art algorithms, guided by human-in-the-loop or even automated feedback. The framework also integrates seamlessly with existing p5.js projects. Second, within this framework we implement and evaluate several exploration strategies, referred to as agents.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.
๐
2025-12-18
๐ ArXiv Preprint (nlin.AO)
๐ฅ Pradeep Singh, Mudasani Rushikesh, Bezawada Sri Sai Anurag, Balasubramanian Raman
Machine Learning
ๆฅ็ๆ่ฆ
We develop a unified, dynamical-systems narrative of the universe that traces a continuous chain of structure formation from the Big Bang to contemporary human societies and their artificial learning systems. Rather than treating cosmology, astrophysics, geophysics, biology, cognition, and machine intelligence as disjoint domains, we view each as successive regimes of dynamics on ever-richer state spaces, stitched together by phase transitions, symmetry-breaking events, and emergent attractors. Starting from inflationary field dynamics and the growth of primordial perturbations, we describe how gravitational instability sculpts the cosmic web, how dissipative collapse in baryonic matter yields stars and planets, and how planetary-scale geochemical cycles define long-lived nonequilibrium attractors. Within these attractors, we frame the origin of life as the emergence of self-maintaining reaction networks, evolutionary biology as flow on high-dimensional genotype-phenotype-environment manifolds, and brains as adaptive dynamical systems operating near critical surfaces. Human culture and technology-including modern machine learning and artificial intelligence-are then interpreted as symbolic and institutional dynamics that implement and refine engineered learning flows which recursively reshape their own phase space. Throughout, we emphasize recurring mathematical motifs-instability, bifurcation, multiscale coupling, and constrained flows on measure-zero subsets of the accessible state space. Our aim is not to present any new cosmological or biological model, but a cross-scale, theoretical perspective: a way of reading the universe's history as the evolution of dynamics itself, culminating (so far) in biological and artificial systems capable of modeling, predicting, and deliberately perturbing their own future trajectories.
๐
2025-12-18
๐ ArXiv Preprint (cs.PF)
๐ฅ Pompougnac Hugo, Guillon Christophe, Noiry Sylvain, Dutilleul Alban, Iooss Guillaume et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Achieving high efficiency on AI operators demands precise control over computation and data movement. However, existing scheduling languages are locked into specific compiler ecosystems, preventing fair comparison, reuse, and evaluation across frameworks. No unified interface currently decouples scheduling specification from code generation and measurement. We introduce XTC, a platform that unifies scheduling and performance evaluation across compilers. With its common API and reproducible measurement framework, XTC enables portable experimentation and accelerates research on optimization strategies.
๐
2025-12-18
๐ IEEE Transactions on Image Processing (T-IP)
๐ฅ Mengyuan Liu, Jiajie Liu, Jinyan Zhang, Wenhao Li, Junsong Yuan
Computer Vision
ๆฅ็ๆ่ฆ
The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Theresa Eimer, Lennart Schรคpermeier, Andrรฉ Biedenkapp, Alexander Tornede, Lars Kotthoff et al.
ๆฅ็ๆ่ฆ
Empirical research on meta-algorithmics, such as algorithm selection, configuration, and scheduling, often relies on extensive and thus computationally expensive experiments. With the large degree of freedom we have over our experimental setup and design comes a plethora of possible error sources that threaten the scalability and validity of our scientific insights. Best practices for meta-algorithmic research exist, but they are scattered between different publications and fields, and continue to evolve separately from each other. In this report, we collect good practices for empirical meta-algorithmic research across the subfields of the COSEAL community, encompassing the entire experimental cycle: from formulating research questions and selecting an experimental design, to executing ex- periments, and ultimately, analyzing and presenting results impartially. It establishes the current state-of-the-art practices within meta-algorithmic research and serves as a guideline to both new researchers and practitioners in meta-algorithmic fields.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kejun Liu, Yuanyuan Liu, Lin Wei, Chang Tang, Yibing Zhan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Li, Yahan Yu, Youyuan Lin, Yong-Hao Yang, Chenhui Chu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Danial Safaei, Siddartha Khastgir, Mohsen Alirezaei, Jeroen Ploeg, Son Tong et al.
Robotics
ๆฅ็ๆ่ฆ
Virtual testing using synthetic data has become a cornerstone of autonomous vehicle (AV) safety assurance. Despite progress in improving visual realism through advanced simulators and generative AI, recent studies reveal that pixel-level fidelity alone does not ensure reliable transfer from simulation to the real world. What truly matters is whether the system-under-test (SUT) bases its decisions on the same causal evidence in both real and simulated environments - not just whether images "look real" to humans. This paper addresses the lack of such a behavior-grounded fidelity measure by introducing Decisive Feature Fidelity (DFF), a new SUT-specific metric that extends the existing fidelity spectrum to capture mechanism parity - the agreement in causal evidence underlying the SUT's decisions across domains. DFF leverages explainable-AI (XAI) methods to identify and compare the decisive features driving the SUT's outputs for matched real-synthetic pairs. We further propose practical estimators based on counterfactual explanations, along with a DFF-guided calibration scheme to enhance simulator fidelity. Experiments on 2126 matched KITTI-VirtualKITTI2 pairs demonstrate that DFF reveals discrepancies overlooked by conventional output-value fidelity. Furthermore, results show that DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jinwu Chen, Qidie Wu, Bin Li, Lin Ma, Xin Si et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Optimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries. While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations. This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution. Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization. Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09$\times$ over PyTorch on a benchmark of 100 kernels. On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units. The generated kernels are open-sourced at https://github.com/champloo2878/cuPilot-Kernels.git.
๐
2025-12-18
๐ ArXiv Preprint (cs.DC)
๐ฅ Ignacio Heredia, รlvaro Lรณpez Garcรญa, Germรกn Moltรณ, Amanda Calatrava, Valentin Kozlov et al.
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we describe a federated compute platform dedicated to support Artificial Intelligence in scientific workloads. Putting the effort into reproducible deployments, it delivers consistent, transparent access to a federation of physically distributed e-Infrastructures. Through a comprehensive service catalogue, the platform is able to offer an integrated user experience covering the full Machine Learning lifecycle, including model development (with dedicated interactive development environments), training (with GPU resources, annotation tools, experiment tracking, and federated learning support) and deployment (covering a wide range of deployment options all along the Cloud Continuum). The platform also provides tools for traceability and reproducibility of AI models, integrates with different Artificial Intelligence model providers, datasets and storage resources, allowing users to interact with the broader Machine Learning ecosystem. Finally, it is easily customizable to lower the adoption barrier by external communities.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiayang Yang, Chunhui Zhao, Martin Guay, Zhixing Cao
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) offer promising capabilities for interpreting multivariate time-series data, yet their application to real-world battery energy storage system (BESS) operation and maintenance remains largely unexplored. Here, we present TimeSeries2Report (TS2R), a prompting framework that converts raw lithium-ion battery operational time-series into structured, semantically enriched reports, enabling LLMs to reason, predict, and make decisions in BESS management scenarios. TS2R encodes short-term temporal dynamics into natural language through a combination of segmentation, semantic abstraction, and rule-based interpretation, effectively bridging low-level sensor signals with high-level contextual insights. We benchmark TS2R across both lab-scale and real-world datasets, evaluating report quality and downstream task performance in anomaly detection, state-of-charge prediction, and charging/discharging management. Compared with vision-, embedding-, and text-based prompting baselines, report-based prompting via TS2R consistently improves LLM performance in terms of across accuracy, robustness, and explainability metrics. Notably, TS2R-integrated LLMs achieve expert-level decision quality and predictive consistency without retraining or architecture modification, establishing a practical path for adaptive, LLM-driven battery intelligence.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shabnam Bagheri Marzijarani, Mohammad Zolfaghari, Hedieh Sajedi
Machine Learning
ๆฅ็ๆ่ฆ
The Internet of Things (IoT) is a concept by which objects find identity and can communicate with each other in a network. One of the applications of the IoT is in the field of medicine, which is called the Internet of Medical Things (IoMT). Acute Lymphocytic Leukemia (ALL) is a type of cancer categorized as a hematic disease. It usually begins in the bone marrow due to the overproduction of immature White Blood Cells (WBCs or leukocytes). Since it has a high rate of spread to other body organs, it is a fatal disease if not diagnosed and treated early. Therefore, for identifying cancerous (ALL) cells in medical diagnostic laboratories, blood, as well as bone marrow smears, are taken by pathologists. However, manual examinations face limitations due to human error risk and time-consuming procedures. So, to tackle the mentioned issues, methods based on Artificial Intelligence (AI), capable of identifying cancer from non-cancer tissue, seem vital. Deep Neural Networks (DNNs) are the most efficient machine learning (ML) methods. These techniques employ multiple layers to extract higher-level features from the raw input. In this paper, a Convolutional Neural Network (CNN) is applied along with a new type of classifier, Higher Order Singular Value Decomposition (HOSVD), to categorize ALL and normal (healthy) cells from microscopic blood images. We employed the model on IoMT structure to identify leukemia quickly and safely. With the help of this new leukemia classification framework, patients and clinicians can have real-time communication. The model was implemented on the Acute Lymphoblastic Leukemia Image Database (ALL-IDB2) and achieved an average accuracy of %98.88 in the test step.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sรถren Auer, Allard Oelen, Mohamad Yaser Jaradeh, Mutahira Khalid, Farhana Keya et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancements in Generative AI and Large Language Models promise to transform the way research is conducted, potentially offering unprecedented opportunities to augment scholarly workflows. However, effectively integrating AI into research remains a challenge due to varying domain requirements, limited AI literacy, the complexity of coordinating tools and agents, and the unclear accuracy of Generative AI in research. We present the vision of the TIB AIssistant, a domain-agnostic human-machine collaborative platform designed to support researchers across disciplines in scientific discovery, with AI assistants supporting tasks across the research life cycle. The platform offers modular components - including prompt and tool libraries, a shared data store, and a flexible orchestration framework - that collectively facilitate ideation, literature analysis, methodology development, data analysis, and scholarly writing. We describe the conceptual framework, system architecture, and implementation of an early prototype that demonstrates the feasibility and potential impact of our approach.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Enis Yalcin, Joshua O'Hara, Maria Stamatopoulou, Chengxu Zhou, Dimitrios Kanoulas
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) show promise in automating reward design in humanoid locomotion, which could eliminate the need for tedious manual engineering. However, current VLM-based methods are essentially "blind", as they lack the environmental perception required to navigate complex terrain. We present E-SDS (Environment-aware See it, Do it, Sorted), a framework that closes this perception gap. E-SDS integrates VLMs with real-time terrain sensor analysis to automatically generate reward functions that facilitate training of robust perceptive locomotion policies, grounded by example videos. Evaluated on a Unitree G1 humanoid across four distinct terrains (simple, gaps, obstacles, stairs), E-SDS uniquely enabled successful stair descent, while policies trained with manually-designed rewards or a non-perceptive automated baseline were unable to complete the task. In all terrains, E-SDS also reduced velocity tracking error by 51.9-82.6%. Our framework reduces the human effort of reward design from days to less than two hours while simultaneously producing more robust and capable locomotion policies.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Roman Akramov, Artem Khamatullin, Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko
Machine Learning
ๆฅ็ๆ่ฆ
Choosing the number of topics $T$ in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretability of topic models. In this work, we formulate the selection of $T$ as a discrete black-box optimization problem, where each function evaluation corresponds to training an LDA model and measuring its validation perplexity. Under a fixed evaluation budget, we compare four families of optimizers: two hand-designed evolutionary methods - Genetic Algorithm (GA) and Evolution Strategy (ES) - and two learned, amortized approaches, Preferential Amortized Black-Box Optimization (PABBO) and Sharpness-Aware Black-Box Optimization (SABBO). Our experiments show that, while GA, ES, PABBO, and SABBO eventually reach a similar band of final perplexity, the amortized optimizers are substantially more sample- and time-efficient. SABBO typically identifies a near-optimal topic number after essentially a single evaluation, and PABBO finds competitive configurations within a few evaluations, whereas GA and ES require almost the full budget to approach the same region.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yadong Li, Tong Zhang, Bo Huang, Zhen Cui
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Deep multi-agent reinforcement learning (MARL) algorithms are booming in the field of collaborative intelligence, and StarCraft multi-agent challenge (SMAC) is widely-used as the benchmark therein. However, imaginary opponents of MARL algorithms are practically configured and controlled in a fixed built-in AI mode, which causes less diversity and versatility in algorithm evaluation. To address this issue, in this work, we establish a multi-agent algorithm-vs-algorithm environment, named StarCraft II battle arena (SC2BA), to refresh the benchmarking of MARL algorithms in an adversary paradigm. Taking StarCraft as infrastructure, the SC2BA environment is specifically created for inter-algorithm adversary with the consideration of fairness, usability and customizability, and meantime an adversarial PyMARL (APyMARL) library is developed with easy-to-use interfaces/modules. Grounding in SC2BA, we benchmark those classic MARL algorithms in two types of adversarial modes: dual-algorithm paired adversary and multi-algorithm mixed adversary, where the former conducts the adversary of pairwise algorithms while the latter focuses on the adversary to multiple behaviors from a group of algorithms. The extensive benchmark experiments exhibit some thought-provoking observations/problems in the effectivity, sensibility and scalability of these completed algorithms. The SC2BA environment as well as reproduced experiments are released in \href{https://github.com/dooliu/SC2BA}{Github}, and we believe that this work could mark a new step for the MARL field in the coming years.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Allard Oelen, Sรถren Auer
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapidly growing popularity of adopting Artificial Intelligence (AI), and specifically Large Language Models (LLMs), is having a widespread impact throughout society, including the academic domain. AI-supported research has the potential to support researchers with tasks across the entire research life cycle. In this work, we demonstrate the TIB AIssistant, an AI-supported research platform providing support throughout the research life cycle. The AIssistant consists of a collection of assistants, each responsible for a specific research task. In addition, tools are provided to give access to external scholarly services. Generated data is stored in the assets and can be exported as an RO-Crate bundle to provide transparency and enhance reproducibility of the research project. We demonstrate the AIssistant's main functionalities by means of a sequential walk-through of assistants, interacting with each other to generate sections for a draft research paper. In the end, with the AIssistant, we lay the foundation for a larger agenda of providing a community-maintained platform for AI-supported research.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Maeve Madigan, Parameswaran Kamalaruban, Glenn Moynihan, Tom Kempton, David Sutton et al.
ๆฅ็ๆ่ฆ
Multi-agent systems have demonstrated the ability to improve performance on a variety of predictive tasks by leveraging collaborative decision making. However, the lack of effective evaluation methodologies has made it difficult to estimate the risk of bias, making deployment of such systems unsafe in high stakes domains such as consumer finance, where biased decisions can translate directly into regulatory breaches and financial loss. To address this challenge, we need to develop fairness evaluation methodologies for multi-agent predictive systems and measure the fairness characteristics of these systems in the financial tabular domain. Examining fairness metrics using large-scale simulations across diverse multi-agent configurations, with varying communication and collaboration mechanisms, we reveal patterns of emergent bias in financial decision-making that cannot be traced to individual agent components, indicating that multi-agent systems may exhibit genuinely collective behaviors. Our findings highlight that fairness risks in financial multi-agent systems represent a significant component of model risk, with tangible impacts on tasks such as credit scoring and income estimation. We advocate that multi-agent decision systems must be evaluated as holistic entities rather than through reductionist analyses of their constituent components.
๐
2025-12-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Allard Oelen, Mohamad Yaser Jaradeh, Sรถren Auer
Natural Language Processing
ๆฅ็ๆ่ฆ
As the volume of published scholarly literature continues to grow, finding relevant literature becomes increasingly difficult. With the rise of generative Artificial Intelligence (AI), and particularly Large Language Models (LLMs), new possibilities emerge to find and explore literature. We introduce ASK (Assistant for Scientific Knowledge), an AI-driven scholarly literature search and exploration system that follows a neuro-symbolic approach. ASK aims to provide active support to researchers in finding relevant scholarly literature by leveraging vector search, LLMs, and knowledge graphs. The system allows users to input research questions in natural language and retrieve relevant articles. ASK automatically extracts key information and generates answers to research questions using a Retrieval-Augmented Generation (RAG) approach. We present an evaluation of ASK, assessing the system's usability and usefulness. Findings indicate that the system is user-friendly and users are generally satisfied while using the system.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Nguyen Xuan-Vu, Daniel Armstrong, Milena Wehrbach, Andres M Bran, Zlatko Jonฤev et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Computer-aided synthesis planning (CASP) has long been envisioned as a complementary tool for synthetic chemists. However, existing frameworks often lack mechanisms to allow interaction with human experts, limiting their ability to integrate chemists' insights. In this work, we introduce Synthelite, a synthesis planning framework that uses large language models (LLMs) to directly propose retrosynthetic transformations. Synthelite can generate end-to-end synthesis routes by harnessing the intrinsic chemical knowledge and reasoning capabilities of LLMs, while allowing expert intervention through natural language prompts. Our experiments demonstrate that Synthelite can flexibly adapt its planning trajectory to diverse user-specified constraints, achieving up to 95\% success rates in both strategy-constrained and starting-material-constrained synthesis tasks. Additionally, Synthelite exhibits the ability to account for chemical feasibility during route design. We envision Synthelite to be both a useful tool and a step toward a paradigm where LLMs are the central orchestrators of synthesis planning.
๐
2025-12-18
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Joachim Winther Pedersen, Erwan Plantec, Eleni Nisioti, Marcello Barylli, Milton Montero et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
How can neural networks evolve themselves without relying on external optimizers? We propose Self-Referential Graph HyperNetworks, systems where the very machinery of variation and inheritance is embedded within the network. By uniting hypernetworks, stochastic parameter generation, and graph-based representations, Self-Referential GHNs mutate and evaluate themselves while adapting mutation rates as selectable traits. Through new reinforcement learning benchmarks with environmental shifts (CartPoleSwitch, LunarLander-Switch), Self-Referential GHNs show swift, reliable adaptation and emergent population dynamics. In the locomotion benchmark Ant-v5, they evolve coherent gaits, showing promising fine-tuning capabilities by autonomously decreasing variation in the population to concentrate around promising solutions. Our findings support the idea that evolvability itself can emerge from neural self-reference. Self-Referential GHNs reflect a step toward synthetic systems that more closely mirror biological evolution, offering tools for autonomous, open-ended learning agents.
๐
2025-12-18
๐ CVPR 2026
๐ฅ Haodi He, Jihun Yu, Ronald Fedkiw
ๆฅ็ๆ่ฆ
We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mohammad-Javad Rezaei, Mozafar Bag-Mohammadi
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, a new metaheuristic optimization algorithm, called Path Construction Imitation Algorithm (PCIA), is proposed. PCIA is inspired by how humans construct new paths and use them. Typically, humans prefer popular transportation routes. In the event of a path closure, a new route is built by mixing the existing paths intelligently. Also, humans select different pathways on a random basis to reach unknown destinations. PCIA generates a random population to find the best route toward the destination, similar to swarm-based algorithms. Each particle represents a path toward the destination. PCIA has been tested with 53 mathematical optimization problems and 13 constrained optimization problems. The results showed that the PCIA is highly competitive compared to both popular and the latest metaheuristic algorithms.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee
ๆฅ็ๆ่ฆ
Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilรฉm Zouhar, Carlos Escolano et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Soochang Song, Yongjune Kim
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Peter Coveney, Roger Highfield
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) is commonly depicted as transformative. Yet, after more than a decade of hype, its measurable impact remains modest outside a few high-profile scientific and commercial successes. The 2024 Nobel Prizes in Chemistry and Physics recognized AI's potential, but broader assessments indicate the impact to date is often more promotional than technical. We argue that while current AI may influence physics, physics has significantly more to offer this generation of AI. Current architectures - large language models, reasoning models, and agentic AI - can depend on trillions of meaningless parameters, suffer from distributional bias, lack uncertainty quantification, provide no mechanistic insights, and fail to capture even elementary scientific laws. We review critiques of these limits, highlight opportunities in quantum AI and analogue computing, and lay down a roadmap for the adoption of 'Big AI': a synthesis of theory-based rigour with the flexibility of machine learning.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Early prediction of battery cycle life is essential for accelerating battery research, manufacturing, and deployment. Although machine learning methods have shown encouraging results, progress is hindered by data scarcity and heterogeneity arising from diverse aging conditions. In other fields, foundation models (FMs) trained on diverse datasets have achieved broad generalization through transfer learning, but no FMs have been reported for battery cycle life prediction yet. Here we present the Pretrained Battery Transformer (PBT), the first FM for battery life prediction, developed through domain-knowledge-encoded mixture-of-expert layers. Validated on the largest public battery life database, PBT learns transferable representations from 13 lithium-ion battery (LIB) datasets, outperforming existing models by an average of 19.8%. With transfer learning, PBT achieves state-of-the-art performance across 15 diverse datasets encompassing various operating conditions, formation protocols, and chemistries of LIBs. This work establishes a foundation model pathway for battery lifetime prediction, paving the way toward universal battery lifetime prediction systems.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Arther Tian, Alex Ding, Frank Chen, Alan Wu, Aaron Chan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Decentralized large language model (LLM) inference promises transparent and censorship resistant access to advanced AI, yet existing verification approaches struggle to scale to modern models. Proof of Quality (PoQ) replaces cryptographic verification of computation with consensus over output quality, but the original formulation ignores heterogeneous computational costs across inference and evaluator nodes. This paper introduces a cost-aware PoQ framework that integrates explicit efficiency measurements into the reward mechanism for both types of nodes. The design combines ground truth token level F1, lightweight learned evaluators, and GPT based judgments within a unified evaluation pipeline, and adopts a linear reward function that balances normalized quality and cost.
Experiments on extractive question answering and abstractive summarization use five instruction tuned LLMs ranging from TinyLlama-1.1B to Llama-3.2-3B and three evaluation models spanning cross encoder and bi encoder architectures. Results show that a semantic textual similarity bi encoder achieves much higher correlation with both ground truth and GPT scores than cross encoders, indicating that evaluator architecture is a critical design choice for PoQ. Quality-cost analysis further reveals that the largest models in the pool are also the most efficient in terms of quality per unit latency. Monte Carlo simulations over 5\,000 PoQ rounds demonstrate that the cost-aware reward scheme consistently assigns higher average rewards to high quality low cost inference models and to efficient evaluators, while penalizing slow low quality nodes. These findings suggest that cost-aware PoQ provides a practical foundation for economically sustainable decentralized LLM inference.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Driven by Large Language Models, the single-agent, multi-tool architecture has become a popular paradigm for autonomous agents due to its simplicity and effectiveness. However, this architecture also introduces a new and severe privacy risk, which we term Tools Orchestration Privacy Risk (TOP-R), where an agent, to achieve a benign user goal, autonomously aggregates information fragments across multiple tools and leverages its reasoning capabilities to synthesize unexpected sensitive information. We provide the first systematic study of this risk. First, we establish a formal framework, attributing the risk's root cause to the agent's misaligned objective function: an overoptimization for helpfulness while neglecting privacy awareness. Second, we construct TOP-Bench, comprising paired leakage and benign scenarios, to comprehensively evaluate this risk. To quantify the trade-off between safety and robustness, we introduce the H-Score as a holistic metric. The evaluation results reveal that TOP-R is a severe risk: the average Risk Leakage Rate (RLR) of eight representative models reaches 90.24%, while the average H-Score is merely 0.167, with no model exceeding 0.3. Finally, we propose the Privacy Enhancement Principle (PEP) method, which effectively mitigates TOP-R, reducing the Risk Leakage Rate to 46.58% and significantly improving the H-Score to 0.624. Our work reveals both a new class of risk and inherent structural limitations in current agent architectures, while also offering feasible mitigation strategies.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Safwan Shaheer, G. M. Refatul Islam, Mohammad Rafid Hamid, Tahsin Zaman Jilan
ๆฅ็ๆ่ฆ
In this fast-evolving area of LLMs, our paper discusses the significant security risk presented by prompt injection attacks. It focuses on small open-sourced models, specifically the LLaMA family of models. We introduce novel defense mechanisms capable of generating automatic defenses and systematically evaluate said generated defenses against a comprehensive set of benchmarked attacks. Thus, we empirically demonstrated the improvement proposed by our approach in mitigating goal-hijacking vulnerabilities in LLMs. Our work recognizes the increasing relevance of small open-sourced LLMs and their potential for broad deployments on edge devices, aligning with future trends in LLM applications. We contribute to the greater ecosystem of open-source LLMs and their security in the following: (1) assessing present prompt-based defenses against the latest attacks, (2) introducing a new framework using a seed defense (Chain Of Thoughts) to refine the defense prompts iteratively, and (3) showing significant improvements in detecting goal hijacking attacks. Out strategies significantly reduce the success rates of the attacks and false detection rates while at the same time effectively detecting goal-hijacking capabilities, paving the way for more secure and efficient deployments of small and open-source LLMs in resource-constrained environments.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Feng Liang, Sizhe Cheng, Chenqi Yi
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He et al.
ๆฅ็ๆ่ฆ
Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Fanrui Zhang, Qiang Zhang, Sizhuo Zhou, Jianwen Sun, Chuanhao Li et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Existing image forgery detection (IFD) methods either exploit low-level, semantics-agnostic artifacts or rely on multimodal large language models (MLLMs) with high-level semantic knowledge. Although naturally complementary, these two information streams are highly heterogeneous in both paradigm and reasoning, making it difficult for existing methods to unify them or effectively model their cross-level interactions. To address this gap, we propose ForenAgent, a multi-round interactive IFD framework that enables MLLMs to autonomously generate, execute, and iteratively refine Python-based low-level tools around the detection objective, thereby achieving more flexible and interpretable forgery analysis. ForenAgent follows a two-stage training pipeline combining Cold Start and Reinforcement Fine-Tuning to enhance its tool interaction capability and reasoning adaptability progressively. Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication, and instantiate it as both a data-sampling strategy and a task-aligned process reward. For systematic training and evaluation, we construct FABench, a heterogeneous, high-quality agent-forensics dataset comprising 100k images and approximately 200k agent-interaction question-answer pairs. Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks when assisted by low-level tools, charting a promising route toward general-purpose IFD. The code will be released after the review process is completed.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Taozhao Chen, Linghan Huang, Kim-Kwang Raymond Choo, Huaming Chen
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
As large language models (LLMs) are increasingly adopted in safety-critical and regulated sectors, the retention of sensitive or prohibited knowledge introduces escalating risks, ranging from privacy leakage to regulatory non-compliance to to potential misuse, and so on. Recent studies suggest that machine unlearning can help ensure deployed models comply with evolving legal, safety, and governance requirements. However, current unlearning techniques assume clean separation between forget and retain datasets, which is challenging in operational settings characterized by highly entangled distributions. In such scenarios, perturbation-based methods often degrade general model utility or fail to ensure safety. To address this, we propose Selective Representation Misdirection for Unlearning (SRMU), a novel principled activation-editing framework that enforces feature-aware and directionally controlled perturbations. Unlike indiscriminate model weights perturbations, SRMU employs a structured misdirection vector with an activation importance map. The goal is to allow SRMU selectively suppresses harmful representations while preserving the utility on benign ones. Experiments are conducted on the widely used WMDP benchmark across low- and high-entanglement configurations. Empirical results reveal that SRMU delivers state-of-the-art unlearning performance with minimal utility losses, and remains effective under 20-30\% overlap where existing baselines collapse. SRMU provides a robust foundation for safety-driven model governance, privacy compliance, and controlled knowledge removal in the emerging LLM-based applications. We release the replication package at https://figshare.com/s/d5931192a8824de26aff.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Qiushi Sun et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can cause unintended consequences, motivating critic models that assess each action before execution. While critic models offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for step-level evaluation in computer use. To bridge these gaps, we introduce OS-Oracle that makes three core contributions: (1) a scalable data pipeline for synthesizing cross-platform GUI critic data; (2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative policy optimization (CP-GRPO); (3) OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms. Leveraging this framework, we curate a high-quality dataset containing 310k critic samples. The resulting critic model, OS-Oracle-7B, achieves state-of-the-art performance among open-source VLMs on OS-Critic Bench, and surpasses proprietary models on the mobile domain. Furthermore, when serving as a pre-critic, OS-Oracle-7B improves the performance of native GUI agents such as UI-TARS-1.5-7B in OSWorld and AndroidWorld environments. The code is open-sourced at https://github.com/numbmelon/OS-Oracle.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jinhao Zhang, Yunquan Zhang, Daning Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CKA Guided Modular Quantization, a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. Our method independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs including LLaMA and Qwen ,in terms of perplexity (PPL) and downstream task performance.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Gilad Gressel, Rahul Pankajakshan, Shir Rozenfeld, Ling Li, Ivan Franceschini et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Romance-baiting scams have become a major source of financial and emotional harm worldwide. These operations are run by organized crime syndicates that traffic thousands of people into forced labor, requiring them to build emotional intimacy with victims over weeks of text conversations before pressuring them into fraudulent cryptocurrency investments. Because the scams are inherently text-based, they raise urgent questions about the role of Large Language Models (LLMs) in both current and future automation.
We investigate this intersection by interviewing 145 insiders and 5 scam victims, performing a blinded long-term conversation study comparing LLM scam agents to human operators, and executing an evaluation of commercial safety filters. Our findings show that LLMs are already widely deployed within scam organizations, with 87% of scam labor consisting of systematized conversational tasks readily susceptible to automation. In a week-long study, an LLM agent not only elicited greater trust from study participants (p=0.007) but also achieved higher compliance with requests than human operators (46% vs. 18% for humans). Meanwhile, popular safety filters detected 0.0% of romance baiting dialogues. Together, these results suggest that romance-baiting scams may be amenable to full-scale LLM automation, while existing defenses remain inadequate to prevent their expansion.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yiliu Yang, Yilei Jiang, Qunzhong Wang, Yingshui Tan, Xiaoyong Zhu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textsc{QuadSentinel}, a four-agent guard (state tracker, policy verifier, threat watcher, and referee) that compiles these policies into machine-checkable rules built from predicates over observable state and enforces them online. Referee logic plus an efficient top-$k$ predicate updater keeps costs low by prioritizing checks and resolving conflicts hierarchically. Measured on ST-WebAgentBench (ICML CUA~'25) and AgentHarm (ICLR~'25), \textsc{QuadSentinel} improves guardrail accuracy and rule recall while reducing false positives. Against single-agent baselines such as ShieldAgent (ICML~'25), it yields better overall safety control. Near-term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine-checkable. Our code will be made publicly available at https://github.com/yyiliu/QuadSentinel.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Abouagour, Eleftherios Garyfallidis
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automated floor plan generation lies at the intersection of combinatorial search, geometric constraint satisfaction, and functional design requirements -- a confluence that has historically resisted a unified computational treatment. While recent deep learning approaches have improved the state of the art, they often struggle to capture architectural reasoning: the precedence of topological relationships over geometric instantiation, the propagation of functional constraints through adjacency networks, and the emergence of circulation patterns from local connectivity decisions. To address these fundamental challenges, this paper introduces GFLAN, a generative framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization. Given a single exterior boundary and a front-door location, our approach departs from direct pixel-to-pixel or wall-tracing generation in favor of a principled two-stage decomposition. Stage A employs a specialized convolutional architecture with dual encoders -- separating invariant spatial context from evolving layout state -- to sequentially allocate room centroids within the building envelope via discrete probability maps over feasible placements. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented graph neural network (GNN) that jointly regresses room boundaries.
๐
2025-12-18
๐ ArXiv Preprint (cs.SE)
๐ฅ Ora Nova Fandina, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines. While attractive for scalability, LaaJs tend to overlook domain specific issues raising concerns about their reliability in critical evaluation tasks. To better understand these limitations in practice, we examine LaaJ behavior in a concrete industrial use case: legacy code modernization via COBOL code generation. In this setting, we find that even production deployed LaaJs can miss domain critical errors, revealing consistent blind spots in their evaluation capabilities.
To better understand these blind spots, we analyze generated COBOL programs and associated LaaJs judgments, drawing on expert knowledge to construct a preliminary taxonomy. Based on this taxonomy, we develop a lightweight analytic checker tool that flags over 30 domain specific issues observed in practice. We use its outputs as analytic hints, dynamically injecting them into the judges prompt to encourage LaaJ to revisit aspects it may have overlooked.
Experiments on a test set of 100 programs using four production level LaaJs show that LaaJ alone detects only about 45% of the errors present in the code (in all judges we tested), while the analytic checker alone lacks explanatory depth. When combined, the LaaJ+Hints configuration achieves up to 94% coverage (for the best performing judge and injection prompt) and produces qualitatively richer, more accurate explanations, demonstrating that analytic-LLM hybrids can substantially enhance evaluation reliability in deployed pipelines. We release the dataset and all used prompts.
๐
2025-12-18
๐ This paper has been published in the IEEE proceedi
๐ฅ Geofrey Owino, Bernard Shibwabo Kasamani, Ahmed M. Abdelmoniem, Edem Wornyo
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Accurate and interpretable classification of infant cry paralinguistics is essential for early detection of neonatal distress and clinical decision support. However, many existing deep learning methods rely on correlation-driven acoustic representations, which makes them vulnerable to noise, spurious cues, and domain shifts across recording environments. We propose DACH-TIC, a Domain-Agnostic Causal-Aware Hierarchical Audio Transformer for robust infant cry classification. The model integrates causal attention, hierarchical representation learning, multi-task supervision, and adversarial domain generalization within a unified framework.
DACH-TIC employs a structured transformer backbone with local token-level and global semantic encoders, augmented by causal attention masking and controlled perturbation training to approximate counterfactual acoustic variations. A domain-adversarial objective promotes environment-invariant representations, while multi-task learning jointly optimizes cry type recognition, distress intensity estimation, and causal relevance prediction. The model is evaluated on the Baby Chillanto and Donate-a-Cry datasets, with ESC-50 environmental noise overlays for domain augmentation.
Experimental results show that DACH-TIC outperforms state-of-the-art baselines, including HTS-AT and SE-ResNet Transformer, achieving improvements of 2.6 percent in accuracy and 2.2 points in macro-F1 score, alongside enhanced causal fidelity. The model generalizes effectively to unseen acoustic environments, with a domain performance gap of only 2.4 percent, demonstrating its suitability for real-world neonatal acoustic monitoring systems.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yifei She, Ping Zhang, He Liu, Yanmin Jia, Yang Jing et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Real-world agentic tasks, unlike synchronous Markov Decision Processes (MDPs), often involve non-blocking actions with variable latencies, creating a fundamental \textit{Temporal Gap} between action initiation and completion. Existing environment-side solutions, such as blocking wrappers or frequent polling, either limit scalability or dilute the agent's context window with redundant observations. In this work, we propose an \textbf{Agent-side Approach} that empowers Large Language Models (LLMs) to actively align their \textit{Cognitive Timeline} with the physical world. By extending the Code-as-Action paradigm to the temporal domain, agents utilize semantic priors and In-Context Learning (ICL) to predict precise waiting durations (\texttt{time.sleep(t)}), effectively synchronizing with asynchronous environment without exhaustive checking. Experiments in a simulated Kubernetes cluster demonstrate that agents can precisely calibrate their internal clocks to minimize both query overhead and execution latency, validating that temporal awareness is a learnable capability essential for autonomous evolution in open-ended environments.
๐
2025-12-18
๐ ArXiv Preprint (q-fin.PR)
๐ฅ Bong-Gyu Jang, Younwoo Jeong, Changeun Kim
Machine Learning
ๆฅ็ๆ่ฆ
We introduce the \textit{Consensus-Bottleneck Asset Pricing Model} (CB-APM), a partially interpretable neural network that replicates the reasoning processes of sell-side analysts by capturing how dispersed investor beliefs are compressed into asset prices through a consensus formation process. By modeling this ``bottleneck'' to summarize firm- and macro-level information, CB-APM not only predicts future risk premiums of U.S. equities but also links belief aggregation to expected returns in a structurally interpretable manner. The model improves long-horizon return forecasts and outperforms standard deep learning approaches in both predictive accuracy and explanatory power. Comprehensive portfolio analyses show that CB-APM's out-of-sample predictions translate into economically meaningful payoffs, with monotonic return differentials and stable long-short performance across regularization settings. Empirically, CB-APM leverages consensus as a regularizer to amplify long-horizon predictability and yields interpretable consensus-based components that clarify how information is priced in returns. Moreover, regression and GRS-based pricing diagnostics reveal that the learned consensus representations capture priced variation only partially spanned by traditional factor models, demonstrating that CB-APM uncovers belief-driven structure in expected returns beyond the canonical factor space. Overall, CB-APM provides an interpretable and empirically grounded framework for understanding belief-driven return dynamics.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu
Robotics
ๆฅ็ๆ่ฆ
Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang et al.
ๆฅ็ๆ่ฆ
The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP\_website/}
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qihang Rao, Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu
MultimodalComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou et al.
ๆฅ็ๆ่ฆ
We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen et al.
Multimodal
ๆฅ็ๆ่ฆ
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaixin Ding, Yang Zhou, Xi Chen, Miao Yang, Jiarong Ou et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Jinghuan Shang, Harsh Patel, Ran Gong, Karl Schmeckpeper
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Synthetic 3D scenes are essential for developing Physical AI and generative models. Existing procedural generation methods often have low output throughput, creating a significant bottleneck in scaling up dataset creation. In this work, we introduce Sceniris, a highly efficient procedural scene generation framework for rapidly generating large-scale, collision-free scene variations. Sceniris also provides an optional robot reachability check, providing manipulation-feasible scenes for robot tasks. Sceniris is designed for maximum efficiency by addressing the primary performance limitations of the prior method, Scene Synthesizer. Leveraging batch sampling and faster collision checking in cuRobo, Sceniris achieves at least 234x speed-up over Scene Synthesizer. Sceniris also expands the object-wise spatial relationships available in prior work to support diverse scene requirements. Our code is available at https://github.com/rai-inst/sceniris
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaiwen Jiang, Xueting Li, Seonwook Park, Ravi Ramamoorthi, Shalini De Mello et al.
Robotics
ๆฅ็ๆ่ฆ
Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is https://research.nvidia.com/labs/amri/projects/instant4d
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Norika Wada, Kohei Yamashita, Ryo Kawahara, Ko Nishino
ๆฅ็ๆ่ฆ
Knowledge of the physical material properties governing the dynamics of a real-world object becomes necessary to accurately anticipate its response to unseen interactions. Existing methods for estimating such physical material parameters from visual data assume homogeneous single-material objects, pre-learned dynamics, or simplistic topologies. Real-world objects, however, are often complex in material composition and geometry lying outside the realm of these assumptions. In this paper, we particularly focus on flowers as a representative common object. We introduce Multi-material Physical Gaussians (M-PhyGs) to estimate the material composition and parameters of such multi-material complex natural objects from video. From a short video captured in a natural setting, M-PhyGs jointly segments the object into similar materials and recovers their continuum mechanical parameters while accounting for gravity. M-PhyGs achieves this efficiently with newly introduced cascaded 3D and 2D losses, and by leveraging temporal mini-batching. We introduce a dataset, Phlowers, of people interacting with flowers as a novel platform to evaluate the accuracy of this challenging task of multi-material physical parameter estimation. Experimental results on Phlowers dataset demonstrate the accuracy and effectiveness of M-PhyGs and its components.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch
ๆฅ็ๆ่ฆ
Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Astrid Brull, Sara Aguti, Vรฉronique Bolduc, Ying Hu, Daniel M. Jimenez-Gutierrez et al.
Machine Learning
ๆฅ็ๆ่ฆ
The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the scarcity and fragmentation of available data. Attempts to expand sampling across hospitals, institutions, or countries with differing regulations face severe privacy, regulatory, and logistical obstacles that are often difficult to overcome. The Federated Learning (FL) provides a promising solution by enabling collaborative model training across decentralized datasets while keeping patient data local and private. Here, we report a novel global FL initiative using the Sherpa.ai FL platform, which leverages FL across distributed datasets in two international organizations for the diagnosis of COL6-RD, using collagen VI immunofluorescence microscopy images from patient-derived fibroblast cultures. Our solution resulted in an ML model capable of classifying collagen VI patient images into the three primary pathogenic mechanism groups associated with COL6-RD: exon skipping, glycine substitution, and pseudoexon insertion. This new approach achieved an F1-score of 0.82, outperforming single-organization models (0.57-0.75). These results demonstrate that FL substantially improves diagnostic utility and generalizability compared to isolated institutional models. Beyond enabling more accurate diagnosis, we anticipate that this approach will support the interpretation of variants of uncertain significance and guide the prioritization of sequencing strategies to identify novel pathogenic variants.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tomรกลก Souฤek, Pierre Fernandez, Hady Elsahar, Sylvestre-Alvise Rebuffi, Valeriu Lacatusu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu et al.
ๆฅ็ๆ่ฆ
Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Emmanuel D. Muรฑiz-De-Leรณn, Jorge A. Rosales-de-Golferichs, Ana S. Muรฑoz-Rodrรญguez, Alejandro I. Trejo-Castro, Eduardo de Avila-Armenta et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 -> 0.238) and Micro-F1 by 146% (0.137 -> 0.337), while broader performance across 14 observations improved by 86% (0.170 -> 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: https://github.com/devMuniz02/UDEM-CXR-Reporting-Thesis-2025.
๐
2025-12-18
๐ 6 pages, 5 figures. Accepted and published in the
๐ฅ Arslan Amin, Rafia Mumtaz, Muhammad Jawad Bashir, Syed Mohammad Hassan Zaidi
ๆฅ็ๆ่ฆ
In the evolving landscape of traffic management and vehicle surveillance, efficient license plate detection and recognition are indispensable. Historically, many methodologies have tackled this challenge, but consistent real-time accuracy, especially in diverse environments, remains elusive. This study examines the performance of YOLOv8 variants on License Plate Recognition (LPR) and Character Recognition tasks, crucial for advancing Intelligent Transportation Systems. Two distinct datasets were employed for training and evaluation, yielding notable findings. The YOLOv8 Nano variant demonstrated a precision of 0.964 and mAP50 of 0.918 on the LPR task, while the YOLOv8 Small variant exhibited a precision of 0.92 and mAP50 of 0.91 on the Character Recognition task. A custom method for character sequencing was introduced, effectively sequencing the detected characters based on their x-axis positions. An optimized pipeline, utilizing YOLOv8 Nano for LPR and YOLOv8 Small for Character Recognition, is proposed. This configuration not only maintains computational efficiency but also ensures high accuracy, establishing a robust foundation for future real-world deployments on edge devices within Intelligent Transportation Systems. This effort marks a significant stride towards the development of smarter and more efficient urban infrastructures.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marius Dรคhling, Sebastian Krebs, J. Marius Zรถllner
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
In current research, Bird's-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. Code is available at https://github.com/mdaehl/DenseBEV.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
๐
2025-12-18
๐ AAAI 2026
๐ฅ Shuting Zhao, Zeyu Xiao, Xinrong Chen
ๆฅ็ๆ่ฆ
Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: https://kaka-1314.github.io/KineST/
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Simone Teglia, Claudia Melis Tonti, Francesco Pro, Leonardo Russo, Andrea Alfarano et al.
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Datasets are essential to train and evaluate computer vision models used for traffic analysis and to enhance road safety. Existing real datasets fit real-world scenarios, capturing authentic road object behaviors, however, they typically lack precise ground-truth annotations. In contrast, synthetic datasets play a crucial role, allowing for the annotation of a large number of frames without additional costs or extra time. However, a general drawback of synthetic datasets is the lack of realistic vehicle motion, since trajectories are generated using AI models or rule-based systems. In this work, we introduce R3ST (Realistic 3D Synthetic Trajectories), a synthetic dataset that overcomes this limitation by generating a synthetic 3D environment and integrating real-world trajectories derived from SinD, a bird's-eye-view dataset recorded from drone footage. The proposed dataset closes the gap between synthetic data and realistic trajectories, advancing the research in trajectory forecasting of road vehicles, offering both accurate multimodal ground-truth annotations and authentic human-driven vehicle trajectories.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Enis Baty, C. P. Bridges, Simon Hadfield
Computer Vision
ๆฅ็ๆ่ฆ
We present FlowDet, the first formulation of object detection using modern Conditional Flow Matching techniques. This work follows from DiffusionDet, which originally framed detection as a generative denoising problem in the bounding box space via diffusion. We revisit and generalise this formulation to a broader class of generative transport problems, while maintaining the ability to vary the number of boxes and inference steps without re-training. In contrast to the curved stochastic transport paths induced by diffusion, FlowDet learns simpler and straighter paths resulting in faster scaling of detection performance as the number of inference steps grows. We find that this reformulation enables us to outperform diffusion based detection systems (as well as non-generative baselines) across a wide range of experiments, including various precision/recall operating points using multiple feature backbones and datasets. In particular, when evaluating under recall-constrained settings, we can highlight the effects of the generative transport without over-compensating with large numbers of proposals. This provides gains of up to +3.6% AP and +4.2% AP$_{rare}$ over DiffusionDet on the COCO and LVIS datasets, respectively.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiyang Guo, Ori Zhang, Jax Xiang, Alan Zhao, Wengang Zhou et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Posing 3D characters is a fundamental task in computer graphics and vision. However, existing methods like auto-rigging and pose-conditioned generation often struggle with challenges such as inaccurate skinning weight prediction, topological imperfections, and poor pose conformance, limiting their robustness and generalizability. To overcome these limitations, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a latent-space transformation problem. Instead of deforming mesh vertices as in traditional pipelines, our method reconstructs the character in new poses by directly manipulating its latent representation. At the core of our method is a latent posing transformer that manipulates shape tokens based on skeletal motion. This process is facilitated by a dense pose representation for precise control. To ensure high-fidelity geometry and accommodate topological changes, we also introduce a latent-space supervision strategy and an adaptive completion module. Our method demonstrates superior performance in posing quality. It also naturally extends to 3D editing applications like part replacement and refinement.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mahadev Prasad Panda, Purnachandra Rao Makkena, Srivatsa Prativadibhayankaram, Siegfried Fรถรel, Andrรฉ Kaup
ๆฅ็ๆ่ฆ
Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in BD-rate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haochen Chang, Pengfei Ren, Buyuan Zhang, Da Li, Tianhao Han et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6\% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: https://omg-bench.github.io/
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Yixiang Chen, Yan Huang, Keji He, Peiyan Li, Liang Wang
Robotics
ๆฅ็ๆ่ฆ
When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed. More results can be found on our project website at https://verm-ral.github.io .
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chiara Di Vece, Zhehua Mao, Netanell Avisdris, Brian Dromey, Raffaele Napolitano et al.
ๆฅ็ๆ่ฆ
Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset contains 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Antonella Rech, Nicola Conci, Nicola Garau
ๆฅ็ๆ่ฆ
Neural radiance fields (NeRF) have driven impressive progress in view synthesis by using ray-traced volumetric rendering. Splatting-based methods such as 3D Gaussian Splatting (3DGS) provide faster rendering by rasterizing 3D primitives. RadiantFoam (RF) brought ray tracing back, achieving throughput comparable to Gaussian Splatting by organizing radiance with an explicit Voronoi Diagram (VD). Yet, all the mentioned methods still struggle with precise mesh reconstruction. We address this gap by jointly learning an explicit VD with an implicit Signed Distance Field (SDF). The scene is optimized via ray tracing and regularized by an Eikonal objective. The SDF introduces metric-consistent isosurfaces, which, in turn, bias near-surface Voronoi cell faces to align with the zero level set. The resulting model produces crisper, view-consistent surfaces with fewer floaters and improved topology, while preserving photometric quality and maintaining training speed on par with RadiantFoam. Across diverse scenes, our hybrid implicit-explicit formulation, which we name SDFoam, substantially improves mesh reconstruction accuracy (Chamfer distance) with comparable appearance (PSNR, SSIM), without sacrificing efficiency.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Serafino Pandolfini, Lorenzo Pellegrini, Matteo Ferrara, Davide Maltoni
Robotics
ๆฅ็ๆ่ฆ
The rapid progress of generative AI has enabled highly realistic image manipulations, including inpainting and region-level editing. These approaches preserve most of the original visual context and are increasingly exploited in cybersecurity-relevant threat scenarios. While numerous detectors have been proposed for identifying fully synthetic images, their ability to generalize to localized manipulations remains insufficiently characterized. This work presents a systematic evaluation of state-of-the-art detectors, originally trained for the deepfake detection on fully synthetic images, when applied to a distinct challenge: localized inpainting detection. The study leverages multiple datasets spanning diverse generators, mask sizes, and inpainting techniques. Our experiments show that models trained on a large set of generators exhibit partial transferability to inpainting-based edits and can reliably detect medium- and large-area manipulations or regeneration-style inpainting, outperforming many existing ad hoc detection approaches.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gonรงalo Gaspar Alves, Shekoufeh Gorgi Zadeh, Andreas Husch, Ben Bausch
ๆฅ็ๆ่ฆ
Combining open-source datasets can introduce data leakage if the same subject appears in multiple sets, leading to inflated model performance. To address this, we explore subject fingerprinting, mapping all images of a subject to a distinct region in latent space, to enable subject re-identification via similarity matching. Using a ResNet-50 trained with triplet margin loss, we evaluate few-shot fingerprinting on 3D MRI and 2D X-ray data in both standard (20-way 1-shot) and challenging (1000-way 1-shot) scenarios. The model achieves high Mean- Recall-@-K scores: 99.10% (20-way 1-shot) and 90.06% (500-way 5-shot) on ChestXray-14; 99.20% (20-way 1-shot) and 98.86% (100-way 3-shot) on BraTS- 2021.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ole Beisswenger, Jan-Niklas Dihlmann, Hendrik P. A. Lensch
Robotics
ๆฅ็ๆ่ฆ
Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Giorgos Petsangourakis, Christos Sgouropoulos, Bill Psomas, Theodoros Giannakopoulos, Giorgos Sfikas et al.
ๆฅ็ๆ่ฆ
Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Danxu Liu, Di Wang, Hebaixu Wang, Haoyang Chen, Wentao Jiang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available at https://github.com/MiliLab/SARMAE.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Linghui Shen, Mingyue Cui, Xingyi Yang
MultimodalComputer VisionRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kanwal Aftab, Graham Adams, Mark Scanlon
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Computer vision is a rapidly evolving field, giving rise to powerful new tools and techniques in digital forensic investigation, and shows great promise for novel digital forensic applications. One such application, indoor multimedia geolocation, has the potential to become a crucial aid for law enforcement in the fight against human trafficking, child exploitation, and other serious crimes. While outdoor multimedia geolocation has been widely explored, its indoor counterpart remains underdeveloped due to challenges such as similar room layouts, frequent renovations, visual ambiguity, indoor lighting variability, unreliable GPS signals, and limited datasets in sensitive domains. This paper introduces a pipeline that uses electric sockets as consistent indoor markers for geolocation, since plug socket types are standardised by country or region. The three-stage deep learning pipeline detects plug sockets (YOLOv11, mAP@0.5 = 0.843), classifies them into one of 12 plug socket types (Xception, accuracy = 0.912), and maps the detected socket types to countries (accuracy = 0.96 at >90% threshold confidence). To address data scarcity, two dedicated datasets were created: socket detection dataset of 2,328 annotated images expanded to 4,072 through augmentation, and a classification dataset of 3,187 images across 12 plug socket classes. The pipeline was evaluated on the Hotels-50K dataset, focusing on the TraffickCam subset of crowd-sourced hotel images, which capture real-world conditions such as poor lighting and amateur angles. This dataset provides a more realistic evaluation than using professional, well-lit, often wide-angle images from travel websites. This framework demonstrates a practical step toward real-world digital forensic applications. The code, trained models, and the data for this paper are available open source.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Zhou, Zeqi Xiao, Tianyi Wei, Shuai Yang, Xingang Pan
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: https://github.com/SingleZombie/LLSA
๐
2025-12-18
๐ ArXiv Preprint (cs.MA)
๐ฅ Giulia Boato, Andrea Montibeller, Edward Delp, Luisa Verdoliva, Daniele Miorandi
ๆฅ็ๆ่ฆ
AI is reshaping the landscape of multimedia forensics. We propose AI forensic agents: reliable orchestrators that select and combine forensic detectors, identify provenance and context, and provide uncertainty-aware assessments. We highlight pitfalls in current solutions and introduce a unified framework to improve the authenticity verification process.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ayush Bhavsar
ๆฅ็ๆ่ฆ
This paper introduces Hazedefy, a lightweight and application-focused dehazing pipeline intended for real-time video and live camera feed enhancement. Hazedefy prioritizes computational simplicity and practical deployability on consumer-grade hardware, building upon the Dark Channel Prior (DCP) concept and the atmospheric scattering model. Key elements include gamma-adaptive reconstruction, a fast transmission approximation with lower bounds for numerical stability, a stabilized atmospheric light estimator based on fractional top-pixel averaging, and an optional color balance stage. The pipeline is suitable for mobile and embedded applications, as experimental demonstrations on real-world images and videos show improved visibility and contrast without requiring GPU acceleration.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shaohua Wu, Tong Yu, Shenling Wang, Xudong Zhao
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at https://github.com/TungChintao/SkiLa.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nico Albert Disch, Saikat Roy, Constantin Ulrich, Yannick Kirchhoff, Maximilian Rokuss et al.
ๆฅ็ๆ่ฆ
Forecasting how 3D medical scans evolve over time is important for disease progression, treatment planning, and developmental assessment. Yet existing models either rely on a single prior scan, fixed grid times, or target global labels, which limits voxel-level forecasting under irregular sampling. We present CRONOS, a unified framework for many-to-one prediction from multiple past scans that supports both discrete (grid-based) and continuous (real-valued) timestamps in one model, to the best of our knowledge the first to achieve continuous sequence-to-image forecasting for 3D medical data. CRONOS learns a spatio-temporal velocity field that transports context volumes toward a target volume at an arbitrary time, while operating directly in 3D voxel space. Across three public datasets spanning Cine-MRI, perfusion CT, and longitudinal MRI, CRONOS outperforms other baselines, while remaining computationally competitive. We will release code and evaluation protocols to enable reproducible, multi-dataset benchmarking of multi-context, continuous-time forecasting.
๐
2025-12-18
๐ AAAI 2026
๐ฅ Yin Zhang, Yongqiang Zhang, Yaoyue Zheng, Bogdan Raducanu, Dan Liu
ๆฅ็ๆ่ฆ
Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kirill Mazur, Marwan Taher, Andrew J. Davison
Computer Vision
ๆฅ็ๆ่ฆ
We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps.
Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity.
The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hossein Javidnia
ๆฅ็ๆ่ฆ
Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a $512\times512$ light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to $1024\times1024$ and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiushuo Cheng, Jingjing Liu, Catherine Morgan, Alan Whone, Majid Mirmehdi
Machine Learning
ๆฅ็ๆ่ฆ
The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia et al.
Multimodal
ๆฅ็ๆ่ฆ
GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.
๐
2025-12-18
๐ IEEE Transactions on Image Processing (T-IP)
๐ฅ Mengyuan Liu, Jiajie Liu, Jinyan Zhang, Wenhao Li, Junsong Yuan
Computer Vision
ๆฅ็ๆ่ฆ
The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huma Hafeez, Matthew Garratt, Jo Plested, Sankaran Iyer, Arcot Sowmya
Computer VisionRobotics
ๆฅ็ๆ่ฆ
The processing of omnidirectional 360-degree images poses significant challenges for object detection due to inherent spatial distortions, wide fields of view, and ultra-high-resolution inputs. Conventional detectors such as YOLO are optimised for standard image sizes (for example, 640x640 pixels) and often struggle with the computational demands of 4K or higher-resolution imagery typical of 360-degree vision. To address these limitations, we introduce YOLO11-4K, an efficient real-time detection framework tailored for 4K panoramic images. The architecture incorporates a novel multi-scale detection head with a P2 layer to improve sensitivity to small objects often missed at coarser scales, and a GhostConv-based backbone to reduce computational complexity without sacrificing representational power. To enable evaluation, we manually annotated the CVIP360 dataset, generating 6,876 frame-level bounding boxes and producing a publicly available, detection-ready benchmark for 4K panoramic scenes. YOLO11-4K achieves 0.95 mAP at 0.50 IoU with 28.3 milliseconds inference per frame, representing a 75 percent latency reduction compared to YOLO11 (112.3 milliseconds), while also improving accuracy (mAP at 0.50 of 0.95 versus 0.908). This balance of efficiency and precision enables robust object detection in expansive 360-degree environments, making the framework suitable for real-world high-resolution panoramic applications. While this work focuses on 4K omnidirectional images, the approach is broadly applicable to high-resolution detection tasks in autonomous navigation, surveillance, and augmented reality.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kejun Liu, Yuanyuan Liu, Lin Wei, Chang Tang, Yibing Zhan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Li, Yahan Yu, Youyuan Lin, Yong-Hao Yang, Chenhui Chu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Senmao Li, Kai Wang, Salman Khan, Fahad Shahbaz Khan, Jian Yang et al.
Computer Vision
ๆฅ็ๆ่ฆ
Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed StageVAR achieves up to 3.4x speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Masashi Hatano, Saptarshi Sinha, Jacob Chalk, Wei-Hong Li, Hideo Saito et al.
ๆฅ็ๆ่ฆ
Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down -- that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the 'Reach Success' and a newly introduced 'Prime Success' metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shangxun Li, Youngjung Uh
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Muhammad Ibraheem Siddiqui, Muhammad Haris Khan
ๆฅ็ๆ่ฆ
Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Liyuan Deng, Hao Guo, Yunpeng Bai, Yongkang Dai, Huaxi Huang et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Current token-sequence-based Large Language Models (LLMs) are not well-suited for directly processing 3D Boundary Representation (Brep) models that contain complex geometric and topological information. We propose BrepLLM, the first framework that enables LLMs to parse and reason over raw Brep data, bridging the modality gap between structured 3D geometry and natural language. BrepLLM employs a two-stage training pipeline: Cross-modal Alignment Pre-training and Multi-stage LLM Fine-tuning. In the first stage, an adaptive UV sampling strategy converts Breps into graphs representation with geometric and topological information. We then design a hierarchical BrepEncoder to extract features from geometry (i.e., faces and edges) and topology, producing both a single global token and a sequence of node tokens. Then we align the global token with text embeddings from a frozen CLIP text encoder (ViT-L/14) via contrastive learning. In the second stage, we integrate the pretrained BrepEncoder into an LLM. We then align its sequence of node tokens using a three-stage progressive training strategy: (1) training an MLP-based semantic mapping from Brep representation to 2D with 2D-LLM priors. (2) performing fine-tuning of the LLM. (3) designing a Mixture-of-Query Experts (MQE) to enhance geometric diversity modeling. We also construct Brep2Text, a dataset comprising 269,444 Brep-text question-answer pairs. Experiments show that BrepLLM achieves state-of-the-art (SOTA) results on 3D object classification and captioning tasks.
๐
2025-12-18
๐ CVPR 2026
๐ฅ Haodi He, Jihun Yu, Ronald Fedkiw
ๆฅ็ๆ่ฆ
We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhanwei Li, Liang Li, Jiawan Zhang
Computer Vision
ๆฅ็ๆ่ฆ
High-quality annotated data plays a crucial role in achieving accurate segmentation. However, such data for medical image segmentation are often scarce due to the time-consuming and labor-intensive nature of manual annotation. To address this challenge, we propose the Adaptive Frequency Domain Alignment Network (AFDAN)--a novel domain adaptation framework designed to align features in the frequency domain and alleviate data scarcity. AFDAN integrates three core components to enable robust cross-domain knowledge transfer: an Adversarial Domain Learning Module that transfers features from the source to the target domain; a Source-Target Frequency Fusion Module that blends frequency representations across domains; and a Spatial-Frequency Integration Module that combines both frequency and spatial features to further enhance segmentation accuracy across domains. Extensive experiments demonstrate the effectiveness of AFDAN: it achieves an Intersection over Union (IoU) of 90.9% for vitiligo segmentation in the newly constructed VITILIGO2025 dataset and a competitive IoU of 82.6% on the retinal vessel segmentation benchmark DRIVE, surpassing existing state-of-the-art approaches.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mariam Hassan, Bastien Van Delft, Wuyang Li, Alexandre Alahi
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haotian Ling, Zequn Chen, Qiuying Chen, Donglin Di, Yongjia Ma et al.
ๆฅ็ๆ่ฆ
Consistent pose-driven character animation has achieved remarkable progress in single-character scenarios. However, extending these advances to multi-character settings is non-trivial, especially when position swap is involved. Beyond mere scaling, the core challenge lies in enforcing correct Identity Correspondence (IC) between characters in reference and generated frames. To address this, we introduce EverybodyDance, a systematic solution targeting IC correctness in multi-character animation. EverybodyDance is built around the Identity Matching Graph (IMG), which models characters in the generated and reference frames as two node sets in a weighted complete bipartite graph. Edge weights, computed via our proposed Mask-Query Attention (MQA), quantify the affinity between each pair of characters. Our key insight is to formalize IC correctness as a graph structural metric and to optimize it during training. We also propose a series of targeted strategies tailored for multi-character animation, including identity-embedded guidance, a multi-scale matching strategy, and pre-classified sampling, which work synergistically. Finally, to evaluate IC performance, we curate the Identity Correspondence Evaluation benchmark, dedicated to multi-character IC correctness. Extensive experiments demonstrate that EverybodyDance substantially outperforms state-of-the-art baselines in both IC and visual fidelity.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Hu, Weiyu Zhou, Yanjie Tu, Peng Wu, Wei Dong et al.
ๆฅ็ๆ่ฆ
Pre-trained Latent Diffusion Models (LDMs) have recently shown strong perceptual priors for low-level vision tasks, making them a promising direction for multi-exposure High Dynamic Range (HDR) reconstruction. However, directly applying LDMs to HDR remains challenging due to: (1) limited dynamic-range representation caused by 8-bit latent compression, (2) high inference cost from multi-step denoising, and (3) content hallucination inherent to generative nature. To address these challenges, we introduce GMODiff, a gain map-driven one-step diffusion framework for multi-exposure HDR reconstruction. Instead of reconstructing full HDR content, we reformulate HDR reconstruction as a conditionally guided Gain Map (GM) estimation task, where the GM encodes the extended dynamic range while retaining the same bit depth as LDR images. We initialize the denoising process from an informative regression-based estimate rather than pure noise, enabling the model to generate high-quality GMs in a single denoising step. Furthermore, recognizing that regression-based models excel in content fidelity while LDMs favor perceptual quality, we leverage regression priors to guide both the denoising process and latent decoding of the LDM, suppressing hallucinations while preserving structural accuracy. Extensive experiments demonstrate that our GMODiff performs favorably against several state-of-the-art methods and is 100 faster than previous LDM-based methods.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Soochang Song, Yongjune Kim
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nan Zhou, Zuxin Li, Fanhang Man, Xuecheng Chen, Susu Xu et al.
ๆฅ็ๆ่ฆ
This paper addresses the challenge of achieving optimal Quality of Information (QoI) in non-dedicated vehicular mobile crowdsensing (NVMCS) systems. The key obstacles are the interrelated issues of sensing coverage, sensing reliability, and the dynamic participation of vehicles. To tackle these, we propose QUIDS, a QUality-informed Incentive-driven multi-agent Dispatching System, which ensures high sensing coverage and reliability under budget constraints. QUIDS introduces a novel metric, Aggregated Sensing Quality (ASQ), to quantitatively capture QoI by integrating both coverage and reliability. We also develop a Mutually Assisted Belief-aware Vehicle Dispatching algorithm that estimates sensing reliability and allocates incentives under uncertainty, further improving ASQ. Evaluation using real-world data from a metropolitan NVMCS deployment shows QUIDS improves ASQ by 38% over non-dispatching scenarios and by 10% over state-of-the-art methods. It also reduces reconstruction map errors by 39-74% across algorithms. By jointly optimizing coverage and reliability via a quality-informed incentive mechanism, QUIDS enables low-cost, high-quality urban monitoring without dedicated infrastructure, applicable to smart-city scenarios like traffic and environmental sensing.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huayu Huang, Chen Chen, Banglei Guan, Ze Tan, Yang Shang et al.
ๆฅ็ๆ่ฆ
Tracking and measuring targets using a variety of sensors mounted on UAVs is an effective means to quickly and accurately locate the target. This paper proposes a fusion localization method based on ridge estimation, combining the advantages of rich scene information from sequential imagery with the high precision of laser ranging to enhance localization accuracy. Under limited conditions such as long distances, small intersection angles, and large inclination angles, the column vectors of the design matrix have serious multicollinearity when using the least squares estimation algorithm. The multicollinearity will lead to ill-conditioned problems, resulting in significant instability and low robustness. Ridge estimation is introduced to mitigate the serious multicollinearity under the condition of limited observation. Experimental results demonstrate that our method achieves higher localization accuracy compared to ground localization algorithms based on single information. Moreover, the introduction of ridge estimation effectively enhances the robustness, particularly under limited observation conditions.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haiyu Zhao, Yiwen Shan, Yuanbiao Gou, Xi Peng
ๆฅ็ๆ่ฆ
Recent studies have explored all-in-one video restoration, which handles multiple degradations with a unified model. However, these approaches still face two challenges when dealing with time-varying degradations. First, the degradation can dominate temporal modeling, confusing the model to focus on artifacts rather than the video content. Second, current methods typically rely on large models to handle all-in-one restoration, concealing those underlying difficulties. To address these challenges, we propose a lightweight all-in-one video restoration network, LaverNet, with only 362K parameters. To mitigate the impact of degradations on temporal modeling, we introduce a novel propagation mechanism that selectively transmits only degradation-agnostic features across frames. Through LaverNet, we demonstrate that strong all-in-one restoration can be achieved with a compact network. Despite its small size, less than 1\% of the parameters of existing models, LaverNet achieves comparable, even superior performance across benchmarks.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Feng Liang, Sizhe Cheng, Chenqi Yi
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Amna Amir, Erchan Aptoula
Machine Learning
ๆฅ็ๆ่ฆ
Semantic overlap among land-cover categories, highly imbalanced label distributions, and complex inter-class co-occurrence patterns constitute significant challenges for multi-label remote-sensing image retrieval. In this article, Multi-Label Adaptive Contrastive Learning (MACL) is introduced as an extension of contrastive learning to address them. It integrates label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to achieve balanced representation learning across both common and rare categories. Extensive experiments on three benchmark datasets (DLRSD, ML-AID, and WHDLD), show that MACL consistently outperforms contrastive-loss based baselines, effectively mitigating semantic imbalance and delivering more reliable retrieval performance in large-scale remote-sensing archives. Code, pretrained models, and evaluation scripts will be released at https://github.com/amna/MACL upon acceptance.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Abouagour, Eleftherios Garyfallidis
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automated floor plan generation lies at the intersection of combinatorial search, geometric constraint satisfaction, and functional design requirements -- a confluence that has historically resisted a unified computational treatment. While recent deep learning approaches have improved the state of the art, they often struggle to capture architectural reasoning: the precedence of topological relationships over geometric instantiation, the propagation of functional constraints through adjacency networks, and the emergence of circulation patterns from local connectivity decisions. To address these fundamental challenges, this paper introduces GFLAN, a generative framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization. Given a single exterior boundary and a front-door location, our approach departs from direct pixel-to-pixel or wall-tracing generation in favor of a principled two-stage decomposition. Stage A employs a specialized convolutional architecture with dual encoders -- separating invariant spatial context from evolving layout state -- to sequentially allocate room centroids within the building envelope via discrete probability maps over feasible placements. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented graph neural network (GNN) that jointly regresses room boundaries.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Paloma Casteleiro Costa, Parnian Ghapandar Kashani, Xuhui Liu, Alexander Chen, Ary Portes et al.
Machine Learning
ๆฅ็ๆ่ฆ
Fluorescence lifetime imaging microscopy (FLIM) is a powerful quantitative technique that provides metabolic and molecular contrast, offering strong translational potential for label-free, real-time diagnostics. However, its clinical adoption remains limited by long pixel dwell times and low signal-to-noise ratio (SNR), which impose a stricter resolution-speed trade-off than conventional optical imaging approaches. Here, we introduce FLIM_PSR_k, a deep learning-based multi-channel pixel super-resolution (PSR) framework that reconstructs high-resolution FLIM images from data acquired with up to a 5-fold increased pixel size. The model is trained using the conditional generative adversarial network (cGAN) framework, which, compared to diffusion model-based alternatives, delivers a more robust PSR reconstruction with substantially shorter inference times, a crucial advantage for practical deployment. FLIM_PSR_k not only enables faster image acquisition but can also alleviate SNR limitations in autofluorescence-based FLIM. Blind testing on held-out patient-derived tumor tissue samples demonstrates that FLIM_PSR_k reliably achieves a super-resolution factor of k = 5, resulting in a 25-fold increase in the space-bandwidth product of the output images and revealing fine architectural features lost in lower-resolution inputs, with statistically significant improvements across various image quality metrics. By increasing FLIM's effective spatial resolution, FLIM_PSR_k advances lifetime imaging toward faster, higher-resolution, and hardware-flexible implementations compatible with low-numerical-aperture and miniaturized platforms, better positioning FLIM for translational applications.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qi Zhang, Yunfei Gong, Zhidan Xie, Zhizi Wang, Antoni B. Chan et al.
ๆฅ็ๆ่ฆ
Multi-view crowd counting has been proposed to deal with the severe occlusion issue of crowd counting in large and wide scenes. However, due to the difficulty of collecting and annotating multi-view images, the datasets for multi-view counting have a limited number of multi-view frames and scenes. To solve the problem of limited data, one approach is to collect synthetic data to bypass the annotating step, while another is to propose semi- or weakly-supervised or unsupervised methods that demand less multi-view data. In this paper, we propose two semi-supervised multi-view crowd counting frameworks by ranking the multi-view fusion models of different numbers of input views, in terms of the model predictions or the model uncertainties. Specifically, for the first method (vanilla model), we rank the multi-view fusion models' prediction results of different numbers of camera-view inputs, namely, the model's predictions with fewer camera views shall not be larger than the predictions with more camera views. For the second method, we rank the estimated model uncertainties of the multi-view fusion models with a variable number of view inputs, guided by the multi-view fusion models' prediction errors, namely, the model uncertainties with more camera views shall not be larger than those with fewer camera views. These constraints are introduced into the model training in a semi-supervised fashion for multi-view counting with limited labeled data. The experiments demonstrate the advantages of the proposed multi-view model ranking methods compared with other semi-supervised counting methods.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Satya Narayana Panda, Vaishnavi Kukkala, Spandana Iyer
Machine Learning
ๆฅ็ๆ่ฆ
Dermatological conditions affect 1.9 billion people globally, yet accurate diagnosis remains challenging due to limited specialist availability and complex clinical presentations. Family history significantly influences skin disease susceptibility and treatment responses, but is often underutilized in diagnostic processes. This research addresses the critical question: How can AI-powered systems integrate family history data with clinical imaging to enhance dermatological diagnosis while supporting clinical trial validation and real-world implementation?
We developed a comprehensive multi-modal AI framework that combines deep learning-based image analysis with structured clinical data, including detailed family history patterns. Our approach employs interpretable convolutional neural networks integrated with clinical decision trees that incorporate hereditary risk factors. The methodology includes prospective clinical trials across diverse healthcare settings to validate AI-assisted diagnosis against traditional clinical assessment.
In this work, validation was conducted with healthcare professionals to assess AI-assisted outputs against clinical expectations; prospective clinical trials across diverse healthcare settings are proposed as future work. The integrated AI system demonstrates enhanced diagnostic accuracy when family history data is incorporated, particularly for hereditary skin conditions such as melanoma, psoriasis, and atopic dermatitis. Expert feedback indicates potential for improved early detection and more personalized recommendations; formal clinical trials are planned. The framework is designed for integration into clinical workflows while maintaining interpretability through explainable AI mechanisms.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zichen Geng, Zeeshan Hayder, Wei Liu, Hesheng Wang, Ajmal Mian
ๆฅ็ๆ่ฆ
3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Justin Jiang
ๆฅ็ๆ่ฆ
Images are a substantial portion of the internet, making efficient compression important for reducing storage and bandwidth demands. This study investigates the use of Singular Value Decomposition and low-rank matrix approximations for image compression, evaluating performance using relative Frobenius error and compression ratio. The approach is applied to both grayscale and multichannel images to assess its generality. Results show that the low-rank approximations often produce images that appear visually similar to the originals, but the compression efficiency remains consistently worse than established formats such as JPEG, JPEG2000, and WEBP at comparable error levels. At low tolerated error levels, the compressed representation produced by Singular Value Decomposition can even exceed the size of the original image, indicating that this method is not competitive with industry-standard codecs for practical image compression.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhihao Zhang, Xuejun Yang, Weihua Liu, Mouquan Shen
ๆฅ็ๆ่ฆ
Single-view novel view synthesis (NVS) models based on diffusion models have recently attracted increasing attention, as they can generate a series of novel view images from a single image prompt and camera pose information as conditions. It has been observed that in diffusion models, certain high-quality initial noise patterns lead to better generation results than others. However, there remains a lack of dedicated learning frameworks that enable NVS models to learn such high-quality noise. To obtain high-quality initial noise from random Gaussian noise, we make the following contributions. First, we design a discretized Euler inversion method to inject image semantic information into random noise, thereby constructing paired datasets of random and high-quality noise. Second, we propose a learning framework based on an encoder-decoder network (EDN) that directly transforms random noise into high-quality noise. Experiments demonstrate that the proposed EDN can be seamlessly plugged into various NVS models, such as SV3D and MV-Adapter, achieving significant performance improvements across multiple datasets. Code is available at: https://github.com/zhihao0512/EDN.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Amit Vishwakarma, K. S. Subrahamanian Moosath
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Three-dimensional point clouds provide highly accurate digital representations of objects, essential for applications in computer graphics, photogrammetry, computer vision, and robotics. However, comparing point clouds faces significant challenges due to their unstructured nature and the complex geometry of the surfaces they represent. Traditional geometric metrics such as Hausdorff and Chamfer distances often fail to capture global statistical structure and exhibit sensitivity to outliers, while existing Kullback-Leibler (KL) divergence approximations for Gaussian Mixture Models can produce unbounded or numerically unstable values. This paper introduces an information geometric framework for 3D point cloud shape analysis by representing point clouds as Gaussian Mixture Models (GMMs) on a statistical manifold. We prove that the space of GMMs forms a statistical manifold and propose the Modified Symmetric Kullback-Leibler (MSKL) divergence with theoretically guaranteed upper and lower bounds, ensuring numerical stability for all GMM comparisons. Through comprehensive experiments on human pose discrimination (MPI-FAUST dataset) and animal shape comparison (G-PCD dataset), we demonstrate that MSKL provides stable and monotonically varying values that directly reflect geometric variation, outperforming traditional distances and existing KL approximations.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zilin Wang, Sangwoo Mo, Stella X. Yu, Sima Behpour, Liu Ren
Multimodal
ๆฅ็ๆ่ฆ
Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it.
Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective.
On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jerrin Bright, Zhibo Wang, Dmytro Klepachevskyi, Yuhao Chen, Sirisha Rambhatla et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
We present Avatar4D, a real-world transferable pipeline for generating customizable synthetic human motion datasets tailored to domain-specific applications. Unlike prior works, which focus on general, everyday motions and offer limited flexibility, our approach provides fine-grained control over body pose, appearance, camera viewpoint, and environmental context, without requiring any manual annotations. To validate the impact of Avatar4D, we focus on sports, where domain-specific human actions and movement patterns pose unique challenges for motion understanding. In this setting, we introduce Syn2Sport, a large-scale synthetic dataset spanning sports, including baseball and ice hockey. Avatar4D features high-fidelity 4D (3D geometry over time) human motion sequences with varying player appearances rendered in diverse environments. We benchmark several state-of-the-art pose estimation models on Syn2Sport and demonstrate their effectiveness for supervised learning, zero-shot transfer to real-world data, and generalization across sports. Furthermore, we evaluate how closely the generated synthetic data aligns with real-world datasets in feature space. Our results highlight the potential of such systems to generate scalable, controllable, and transferable human datasets for diverse domain-specific tasks without relying on domain-specific real data.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ M. Oltan Sevinc, Liao Wu, Francisco Cruz
ๆฅ็ๆ่ฆ
Although traditional cameras are the primary sensor for end-to-end driving, their performance suffers greatly when the conditions of the data they were trained on does not match the deployment environment, a problem known as the domain gap. In this work, we consider the day-night lighting difference domain gap. Instead of traditional cameras we propose event cameras as a potential alternative which can maintain performance across lighting condition domain gaps without requiring additional adjustments. Our results show that event cameras maintain more consistent performance across lighting conditions, exhibiting domain-shift penalties that are generally comparable to or smaller than grayscale frames and provide superior baseline performance in cross-domain scenarios.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chao Li, Dasha Hu, Chengyang Li, Yuming Jiang, Yuncheng Shen
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yueyang Hu, Haiyong Jiang, Haoxuan Song, Jun Xiao, Hao Pan
Machine Learning
ๆฅ็ๆ่ฆ
This work presents a novel framework for few-shot 3D part segmentation. Recent advances have demonstrated the significant potential of 2D foundation models for low-shot 3D part segmentation. However, it is still an open problem that how to effectively aggregate 2D knowledge from foundation models to 3D. Existing methods either ignore geometric structures for 3D feature learning or neglects the high-quality grouping clues from SAM, leading to under-segmentation and inconsistent part labels. We devise a novel SAM segment graph-based propagation method, named SegGraph, to explicitly learn geometric features encoded within SAM's segmentation masks. Our method encodes geometric features by modeling mutual overlap and adjacency between segments while preserving intra-segment semantic consistency. We construct a segment graph, conceptually similar to an atlas, where nodes represent segments and edges capture their spatial relationships (overlap/adjacency). Each node adaptively modulates 2D foundation model features, which are then propagated via a graph neural network to learn global geometric structures. To enforce intra-segment semantic consistency, we map segment features to 3D points with a novel view-direction-weighted fusion attenuating contributions from low-quality segments. Extensive experiments on PartNet-E demonstrate that our method outperforms all competing baselines by at least 6.9 percent mIoU. Further analysis reveals that SegGraph achieves particularly strong performance on small components and part boundaries, demonstrating its superior geometric understanding. The code is available at: https://github.com/YueyangHu2000/SegGraph.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ze Yuan, Wenbin Li, Shusen Zhao
Machine Learning
ๆฅ็ๆ่ฆ
We propose a hybrid reconstruction framework for dual-spectral CT (DSCT) that integrates iterative methods with deep learning models. The reconstruction process consists of two complementary components: a knowledge-driven module and a data-driven module. In the knowledge-driven phase, we employ the oblique projection modification technique (OPMT) to reconstruct an intermediate solution of the basis material images from the projection data. We select OPMT for this role because of its fast convergence, which allows it to rapidly generate an intermediate solution that successfully achieves basis material decomposition. Subsequently, in the data-driven phase, we introduce a novel neural network, ResDynUNet++, to refine this intermediate solution. The ResDynUNet++ is built upon a UNet++ backbone by replacing standard convolutions with residual dynamic convolution blocks, which combine the adaptive, input-specific feature extraction of dynamic convolution with the stable training of residual connections. This architecture is designed to address challenges like channel imbalance and near-interface large artifacts in DSCT, producing clean and accurate final solutions. Extensive experiments on both synthetic phantoms and real clinical datasets validate the efficacy and superior performance of the proposed method.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ren Nakagawa, Yang Yang, Risa Shinoda, Hiroaki Santo, Kenji Oyama et al.
Machine Learning
ๆฅ็ๆ่ฆ
This paper introduces a method and application for automatically detecting behavioral interactions between grazing cattle from a single image, which is essential for smart livestock management in the cattle industry, such as for detecting estrus. Although interaction detection for humans has been actively studied, a non-trivial challenge lies in cattle interaction detection, specifically the lack of a comprehensive behavioral dataset that includes interactions, as the interactions of grazing cattle are rare events. We, therefore, propose CattleAct, a data-efficient method for interaction detection by decomposing interactions into the combinations of actions by individual cattle. Specifically, we first learn an action latent space from a large-scale cattle action dataset. Then, we embed rare interactions via the fine-tuning of the pre-trained latent space using contrastive learning, thereby constructing a unified latent space of actions and interactions. On top of the proposed method, we develop a practical working system integrating video and GPS inputs. Experiments on a commercial-scale pasture demonstrate the accurate interaction detection achieved by our method compared to the baselines. Our implementation is available at https://github.com/rakawanegan/CattleAct.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nikhil Prakash, Donghao Ren, Dominik Moritz, Yannick Assogba
ๆฅ็ๆ่ฆ
Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing specific tasks. Additionally, it has been shown that model performance improvement through fine-tuning often results from the strengthening of existing circuits in the model. Taken together, these findings suggest the possibility of intervening directly on such circuits to make precise, task-targeted updates. Motivated by these findings, we propose a novel method called Constructive Circuit Amplification which identifies pivotal tokens from model reasoning traces as well as model components responsible for the desired task, and updates only those components. Applied to mathematical reasoning, it improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA. These results demonstrate that targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Pierre Fernandez, Tom Sander, Hady Elsahar, Hongyan Chang, Tomรกลก Souฤek et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore *post-hoc watermarking* where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents, or detect their use in training or RAG via watermark radioactivity. Unlike generation-time approaches, which is constrained by how LLMs are served, this setting offers additional degrees of freedom for both generation and detection. We investigate how allocating compute (through larger rephrasing models, beam search, multi-candidate generation, or entropy filtering at detection) affects the quality-detectability trade-off. Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books. Among our findings, the simple Gumbel-max scheme surprisingly outperforms more recent alternatives under nucleus sampling, and most methods benefit significantly from beam search. However, most approaches struggle when watermarking verifiable text such as code, where we counterintuitively find that smaller models outperform larger ones. This study reveals both the potential and limitations of post-hoc watermarking, laying groundwork for practical applications and future research.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rahul Bhargava, Malene Hornstrup Jespersen, Emily Boardman Ndulue, Vivica Dsouza
Natural Language Processing
ๆฅ็ๆ่ฆ
AI technologies have rapidly moved into business and research applications that involve large text corpora, including computational journalism research and newsroom settings. These models, trained on extant data from various sources, can be conceptualized as historical artifacts that encode decades-old attitudes and stereotypes. This paper investigates one such example trained on the broadly-used New York Times Annotated Corpus to create a multi-label classifier. Our use in research settings surfaced the concerning "blacks" thematic topic label. Through quantitative and qualitative means we investigate this label's use in the training corpus, what concepts it might be encoding in the trained classifier, and how those concepts impact our model use. Via the application of explainable AI methods, we find that the "blacks" label operates partially as a general "racism detector" across some minoritized groups. However, it performs poorly against expectations on modern examples such as COVID-19 era anti-Asian hate stories, and reporting on the Black Lives Matter movement. This case study of interrogating embedded biases in a model reveals how similar applications in newsroom settings can lead to unexpected outputs that could impact a wide variety of potential uses of any large language model-story discovery, audience targeting, summarization, etc. The fundamental tension this exposes for newsrooms is how to adopt AI-enabled workflow tools while reducing the risk of reproducing historical biases in news coverage.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tzu-Han Lin, Wei-Lin Chen, Chen-An Li, Hung-yi Lee, Yun-Nung Chen et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.
๐
2025-12-18
๐ Accepted and presented at 13th IEEE International
๐ฅ Harsh Vardhan Bansal
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aditya Yadavalli, Tiago Pimentel, Tamar I Regev, Ethan Wilcox, Alex Warstadt
Natural Language Processing
ๆฅ็ๆ่ฆ
Prosody -- the melody of speech -- conveys critical information often not captured by the words or text of a message. In this paper, we propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text, and crucially, what that information is about. Our approach applies large speech and language models to estimate the mutual information between a particular dimension of an utterance's meaning (e.g., its emotion) and any of its communication channels (e.g., audio or text). We then use this approach to quantify how much information is conveyed by audio and text about sarcasm, emotion, and questionhood, using speech from television and podcasts. We find that for sarcasm and emotion the audio channel -- and by implication the prosodic channel -- transmits over an order of magnitude more information about these features than the text channel alone, at least when long-term context beyond the current sentence is unavailable. For questionhood, prosody provides comparatively less additional information. We conclude by outlining a program applying our approach to more dimensions of meaning, communication channels, and languages.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ William English, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a lifting of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate lifting, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the lifting and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.
๐
2025-12-18
๐ Will be published in IEEE BigData 2025 proceedings
๐ฅ Primoลพ Kocbek, Azra Frkatoviฤ-Hodลพiฤ, Dora Laliฤ, Vivian Hui, Gordan Lauc et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shubham Mishra, Samyek Jain, Gorang Mehrishi, Shiv Tiwari, Harsh Sharma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ William English, Chase Walker, Dominic Simon, Rickard Ewetz
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Natural language (NL) to temporal logic (TL) translation enables engineers to specify, verify, and enforce system behaviors without manually crafting formal specifications-an essential capability for building trustworthy autonomous systems. While existing NL-to-TL translation frameworks have demonstrated encouraging initial results, these systems either explicitly assume access to accurate atom grounding or suffer from low grounded translation accuracy. In this paper, we propose a framework for Grounding Natural Language Into System Signatures for Temporal Logic translation called GinSign. The framework introduces a grounding model that learns the abstract task of mapping NL spans onto a given system signature: given a lifted NL specification and a system signature $\mathcal{S}$, the classifier must assign each lifted atomic proposition to an element of the set of signature-defined atoms $\mathcal{P}$. We decompose the grounding task hierarchically -- first predicting predicate labels, then selecting the appropriately typed constant arguments. Decomposing this task from a free-form generation problem into a structured classification problem permits the use of smaller masked language models and eliminates the reliance on expensive LLMs. Experiments across multiple domains show that frameworks which omit grounding tend to produce syntactically correct lifted LTL that is semantically nonequivalent to grounded target expressions, whereas our framework supports downstream model checking and achieves grounded logical-equivalence scores of $95.5\%$, a $1.4\times$ improvement over SOTA.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Iker Garcรญa-Ferrero, David Montero, Roman Orus
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yumeng Wang, Tianyu Fan, Lingrui Xu, Chao Huang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Primoz Kocbek, Gregor Stiglic
Natural Language Processing
ๆฅ็ๆ่ฆ
This work describes our submission to the CLEF 2025 SimpleText track Task 1, addressing both sentenceand document-level simplification of scientific texts. The methodology centered on using the gpt-4.1, gpt-4.1mini, and gpt-4.1-nano models from OpenAI. Two distinct approaches were compared: a no-context method relying on prompt engineering and a fine-tuned (FT) method across models. The gpt-4.1-mini model with no-context demonstrated robust performance at both levels of simplification, while the fine-tuned models showed mixed results, highlighting the complexities of simplifying text at different granularities, where gpt-4.1-nano-ft performance stands out at document-level simplification in one case.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Primoz Kocbek, Leon Kopitar, Gregor Stiglic
Natural Language Processing
ๆฅ็ๆ่ฆ
This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Roman Akramov, Artem Khamatullin, Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko
Machine Learning
ๆฅ็ๆ่ฆ
Choosing the number of topics $T$ in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretability of topic models. In this work, we formulate the selection of $T$ as a discrete black-box optimization problem, where each function evaluation corresponds to training an LDA model and measuring its validation perplexity. Under a fixed evaluation budget, we compare four families of optimizers: two hand-designed evolutionary methods - Genetic Algorithm (GA) and Evolution Strategy (ES) - and two learned, amortized approaches, Preferential Amortized Black-Box Optimization (PABBO) and Sharpness-Aware Black-Box Optimization (SABBO). Our experiments show that, while GA, ES, PABBO, and SABBO eventually reach a similar band of final perplexity, the amortized optimizers are substantially more sample- and time-efficient. SABBO typically identifies a near-optimal topic number after essentially a single evaluation, and PABBO finds competitive configurations within a few evaluations, whereas GA and ES require almost the full budget to approach the same region.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Hao Li, Yubing Ren, Yanan Cao, Yingjie Li, Fang Fang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Benefiting from the superior capabilities of large language models in natural language understanding and generation, Embeddings-as-a-Service (EaaS) has emerged as a successful commercial paradigm on the web platform. However, prior studies have revealed that EaaS is vulnerable to imitation attacks. Existing methods protect the intellectual property of EaaS through watermarking techniques, but they all ignore the most important properties of embedding: semantics, resulting in limited harmlessness and stealthiness. To this end, we propose SemMark, a novel semantic-based watermarking paradigm for EaaS copyright protection. SemMark employs locality-sensitive hashing to partition the semantic space and inject semantic-aware watermarks into specific regions, ensuring that the watermark signals remain imperceptible and diverse. In addition, we introduce the adaptive watermark weight mechanism based on the local outlier factor to preserve the original embedding distribution. Furthermore, we propose Detect-Sampling and Dimensionality-Reduction attacks and construct four scenarios to evaluate the watermarking method. Extensive experiments are conducted on four popular NLP datasets, and SemMark achieves superior verifiability, diversity, stealthiness, and harmlessness.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Darshil Chauhan, Adityasinh Solanki, Vansh Patel, Kanav Kapoor, Ritvik Jain et al.
ๆฅ็ๆ่ฆ
Automatic Speech Recognition (ASR) holds immense potential to streamline clinical documentation, such as digitizing handwritten prescriptions and reports, thereby increasing patient throughput and reducing costs in resource-constrained sectors like rural healthcare. However, realizing this utility is currently obstructed by significant technical barriers: strict data privacy constraints, limited computational resources, and severe acoustic domain shifts. We quantify this gap by showing that a robust multilingual model (IndicWav2Vec) degrades to a stark 40.94% Word Error Rate (WER) when deployed on real-world clinical audio (Gram Vaani), rendering it unusable for practical applications. To address these challenges and bring ASR closer to deployment, we propose an efficient, privacy-preserving adaptation framework. We employ Low-Rank Adaptation (LoRA) to enable continual learning from incoming data streams directly on edge devices, ensuring patient data confidentiality. Our strategy yields a 17.1% relative improvement in WER on the target domain. Furthermore, by integrating multi-domain experience replay, we reduce catastrophic forgetting by 47% compared to naive adaptation. These results demonstrate a viable pathway for building reliable, self-improving ASR systems that can operate effectively within the constraints of high-impact real-world environments.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilรฉm Zouhar, Carlos Escolano et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai
Machine Learning
ๆฅ็ๆ่ฆ
Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En--Ja) and English-to-German (En--De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja--En and De--En.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Driven by Large Language Models, the single-agent, multi-tool architecture has become a popular paradigm for autonomous agents due to its simplicity and effectiveness. However, this architecture also introduces a new and severe privacy risk, which we term Tools Orchestration Privacy Risk (TOP-R), where an agent, to achieve a benign user goal, autonomously aggregates information fragments across multiple tools and leverages its reasoning capabilities to synthesize unexpected sensitive information. We provide the first systematic study of this risk. First, we establish a formal framework, attributing the risk's root cause to the agent's misaligned objective function: an overoptimization for helpfulness while neglecting privacy awareness. Second, we construct TOP-Bench, comprising paired leakage and benign scenarios, to comprehensively evaluate this risk. To quantify the trade-off between safety and robustness, we introduce the H-Score as a holistic metric. The evaluation results reveal that TOP-R is a severe risk: the average Risk Leakage Rate (RLR) of eight representative models reaches 90.24%, while the average H-Score is merely 0.167, with no model exceeding 0.3. Finally, we propose the Privacy Enhancement Principle (PEP) method, which effectively mitigates TOP-R, reducing the Risk Leakage Rate to 46.58% and significantly improving the H-Score to 0.624. Our work reveals both a new class of risk and inherent structural limitations in current agent architectures, while also offering feasible mitigation strategies.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He et al.
ๆฅ็ๆ่ฆ
Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yehor Tereshchenko, Mika Hรคmรคlรคinen, Svitlana Myroniuk
Natural Language Processing
ๆฅ็ๆ่ฆ
The evaluation of Large Language Models (LLMs) for translation tasks has primarily focused on high-resource languages, leaving a significant gap in understanding their performance on low-resource and endangered languages. This study presents a comprehensive comparison of OpenAI's GPT models, specifically examining the differences between reasoning and non-reasoning architectures for translating between Finnish and four low-resource Uralic languages: Komi-Zyrian, Moksha, Erzya, and Udmurt. Using a parallel corpus of literary texts, we evaluate model willingness to attempt translation through refusal rate analysis across different model architectures. Our findings reveal significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates. The results provide valuable insights for researchers and practitioners working with Uralic languages and contribute to the broader understanding of reasoning model capabilities for endangered language preservation.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yiliu Yang, Yilei Jiang, Qunzhong Wang, Yingshui Tan, Xiaoyong Zhu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textsc{QuadSentinel}, a four-agent guard (state tracker, policy verifier, threat watcher, and referee) that compiles these policies into machine-checkable rules built from predicates over observable state and enforces them online. Referee logic plus an efficient top-$k$ predicate updater keeps costs low by prioritizing checks and resolving conflicts hierarchically. Measured on ST-WebAgentBench (ICML CUA~'25) and AgentHarm (ICLR~'25), \textsc{QuadSentinel} improves guardrail accuracy and rule recall while reducing false positives. Against single-agent baselines such as ShieldAgent (ICML~'25), it yields better overall safety control. Near-term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine-checkable. Our code will be made publicly available at https://github.com/yyiliu/QuadSentinel.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qingguo Hu, Zhenghao Lin, Ziyue Yang, Yucheng Ding, Xiao Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each token, resulting in 20B total parameters with just 0.5B activated. The major challenge introduced by such extreme sparsity lies in expert load balancing. We find that the widely-used load balancing loss tends to become ineffective in the lower layers under this setting. To address this issue, we propose a progressive sparsification schedule aiming to balance expert utilization and training stability. Sigma-MoE-Tiny is pre-trained on a diverse and high-quality corpus, followed by post-training to further unlock its capabilities. The entire training process remains remarkably stable, with no occurrence of irrecoverable loss spikes. Comprehensive evaluations reveal that, despite activating only 0.5B parameters, Sigma-MoE-Tiny achieves top-tier performance among counterparts of comparable or significantly larger scale. In addition, we provide an in-depth discussion of load balancing in highly sparse MoE models, offering insights for advancing sparsity in future MoE architectures.
Project page: https://qghuxmu.github.io/Sigma-MoE-Tiny
Code: https://github.com/microsoft/ltp-megatron-lm
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qizhou Chen, Chengyu Wang, Taolin Zhang, Xiaofeng He
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have become indispensable tools in science, technology, and society, enabling transformative advances across diverse fields. However, errors or outdated information within these models can undermine their accuracy and restrict their safe deployment. Developing efficient strategies for updating model knowledge without the expense and disruption of full retraining remains a critical challenge. Current model editing techniques frequently struggle to generalize corrections beyond narrow domains, leading to unintended consequences and limiting their practical impact. Here, we introduce a novel framework for editing LLMs, grounded in information bottleneck theory. This approach precisely compresses and isolates the essential information required for generalizable knowledge correction while minimizing disruption to unrelated model behaviors. Building upon this foundation, we present the Information Bottleneck Knowledge Editor (IBKE), which leverages compact latent representations to guide gradient-based updates, enabling robust and broadly applicable model editing. We validate IBKE's effectiveness across multiple LLM architectures and standard benchmark tasks, demonstrating state-of-the-art accuracy and improved generality and specificity of edits. These findings establish a theoretically principled and practical paradigm for open-domain knowledge editing, advancing the utility and trustworthiness of LLMs in real-world applications.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Musarrat Zeba, Abdullah Al Mamun, Kishoar Jahan Tithee, Debopom Sutradhar, Mohaimenul Azam Khan Raiaan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang et al.
ๆฅ็ๆ่ฆ
Structured information extraction from police incident announcements is crucial for timely and accurate data processing, yet presents considerable challenges due to the variability and informal nature of textual sources such as social media posts. To address these challenges, we developed a domain-adapted extraction pipeline that leverages targeted prompt engineering with parameter-efficient fine-tuning of the Qwen2.5-7B model using Low-Rank Adaptation (LoRA). This approach enables the model to handle noisy, heterogeneous text while reliably extracting 15 key fields, including location, event characteristics, and impact assessment, from a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo (2019-2020). Experimental results demonstrated that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models, achieving an accuracy exceeding 98.36% for mortality detection and Exact Match Rates of 95.31% for fatality counts and 95.54% for province-level location extraction. The proposed pipeline thus provides a validated and efficient solution for multi-task structured information extraction in specialized domains, offering a practical framework for transforming unstructured text into reliable structured data in social science research.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Hao Li, Yubing Ren, Yanan Cao, Yingjie Li, Fang Fang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
With the rapid development of cloud-based services, large language models (LLMs) have become increasingly accessible through various web platforms. However, this accessibility has also led to growing risks of model abuse. LLM watermarking has emerged as an effective approach to mitigate such misuse and protect intellectual property. Existing watermarking algorithms, however, primarily focus on defending against paraphrase attacks while overlooking piggyback spoofing attacks, which can inject harmful content, compromise watermark reliability, and undermine trust in attribution. To address this limitation, we propose DualGuard, the first watermarking algorithm capable of defending against both paraphrase and spoofing attacks. DualGuard employs the adaptive dual-stream watermarking mechanism, in which two complementary watermark signals are dynamically injected based on the semantic content. This design enables DualGuard not only to detect but also to trace spoofing attacks, thereby ensuring reliable and trustworthy watermark detection. Extensive experiments conducted across multiple datasets and language models demonstrate that DualGuard achieves excellent detectability, robustness, traceability, and text quality, effectively advancing the state of LLM watermarking for real-world applications.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Karthikeyan K, Philip Wu, Xin Tang, Alexandre Alves
ๆฅ็ๆ่ฆ
The Science Consultant Agent is a web-based Artificial Intelligence (AI) tool that helps practitioners select and implement the most effective modeling strategy for AI-based solutions. It operates through four core components: Questionnaire, Smart Fill, Research-Guided Recommendation, and Prototype Builder. By combining structured questionnaires, literature-backed solution recommendations, and prototype generation, the Science Consultant Agent accelerates development for everyone from Product Managers and Software Developers to Researchers. The full pipeline is illustrated in Figure 1.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yash Bhaskar, Sankalp Bahad, Parameswari Krishnamurthy
Natural Language Processing
ๆฅ็ๆ่ฆ
Social media platforms, while enabling global connectivity, have become hubs for the rapid spread of harmful content, including hate speech and fake narratives \cite{davidson2017automated, shu2017fake}. The Faux-Hate shared task focuses on detecting a specific phenomenon: the generation of hate speech driven by fake narratives, termed Faux-Hate. Participants are challenged to identify such instances in code-mixed Hindi-English social media text. This paper describes our system developed for the shared task, addressing two primary sub-tasks: (a) Binary Faux-Hate detection, involving fake and hate speech classification, and (b) Target and Severity prediction, categorizing the intended target and severity of hateful content. Our approach combines advanced natural language processing techniques with domain-specific pretraining to enhance performance across both tasks. The system achieved competitive results, demonstrating the efficacy of leveraging multi-task learning for this complex problem.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pengyu Wang, Shuchang Ye, Usman Naseem, Jinman Kim
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured "thinking report" outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Daniela N. Rim, Heeyoul Choi
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Traditional Convolutional Neural Networks have been successful in capturing local, position-invariant features in text, but their capacity to model complex transformation within language can be further explored. In this work, we explore a novel approach by integrating Lie Convolutions into Convolutional-based sentence classifiers, inspired by the ability of Lie group operations to capture complex, non-Euclidean symmetries. Our proposed models SCLie and DPCLie empirically outperform traditional Convolutional-based sentence classifiers, suggesting that Lie-based models relatively improve the accuracy by capturing transformations not commonly associated with language. Our findings motivate more exploration of new paradigms in language modeling.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Jacob Choi, Shuying Cao, Xingjian Dong, Wang Bill Zhu, Robin Jia et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In-Context Learning (ICL) has become a standard technique for adapting Large Language Models (LLMs) to specialized tasks by supplying task-specific exemplars within the prompt. However, when these exemplars contain sensitive information, reliable privacy-preserving mechanisms are essential to prevent unintended leakage through model outputs. Many privacy-preserving methods are proposed to protect the information leakage in the context, but there are less efforts on how to audit those methods. We introduce ContextLeak, the first framework to empirically measure the worst-case information leakage in ICL. ContextLeak uses canary insertion, embedding uniquely identifiable tokens in exemplars and crafting targeted queries to detect their presence. We apply ContextLeak across a range of private ICL techniques, both heuristic such as prompt-based defenses and those with theoretical guarantees such as Embedding Space Aggregation and Report Noisy Max. We find that ContextLeak tightly correlates with the theoretical privacy budget ($ฮต$) and reliably detects leakage. Our results further reveal that existing methods often strike poor privacy-utility trade-offs, either leaking sensitive information or severely degrading performance.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, Sergey Levine
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) -- which trains a policy to directly match the actions played by the demonstrator -- can fail to ensure coverage over the demonstrator's actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator's behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator's actions, enabling more effective finetuning. Furthermore, this policy -- which we refer to as the posterior behavioral cloning (PostBC) policy -- achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show that PostBC is practically implementable with modern generative models in robotic control domains -- relying only on standard supervised learning -- and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qihang Rao, Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu
MultimodalComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rahul Bhargava, Malene Hornstrup Jespersen, Emily Boardman Ndulue, Vivica Dsouza
Natural Language Processing
ๆฅ็ๆ่ฆ
AI technologies have rapidly moved into business and research applications that involve large text corpora, including computational journalism research and newsroom settings. These models, trained on extant data from various sources, can be conceptualized as historical artifacts that encode decades-old attitudes and stereotypes. This paper investigates one such example trained on the broadly-used New York Times Annotated Corpus to create a multi-label classifier. Our use in research settings surfaced the concerning "blacks" thematic topic label. Through quantitative and qualitative means we investigate this label's use in the training corpus, what concepts it might be encoding in the trained classifier, and how those concepts impact our model use. Via the application of explainable AI methods, we find that the "blacks" label operates partially as a general "racism detector" across some minoritized groups. However, it performs poorly against expectations on modern examples such as COVID-19 era anti-Asian hate stories, and reporting on the Black Lives Matter movement. This case study of interrogating embedded biases in a model reveals how similar applications in newsroom settings can lead to unexpected outputs that could impact a wide variety of potential uses of any large language model-story discovery, audience targeting, summarization, etc. The fundamental tension this exposes for newsrooms is how to adopt AI-enabled workflow tools while reducing the risk of reproducing historical biases in news coverage.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
๐
2025-12-18
๐ ArXiv Preprint (physics.chem-ph)
๐ฅ Zemin Xu, Chenyu Wu, Wenbo Xie, Daiqian Xie, P. Hu
Machine Learning
ๆฅ็ๆ่ฆ
Equivariant atomistic machine learning models have brought substantial gains in both extrapolation capability and predictive accuracy. Depending on the basis of the space, two distinct types of irreducible representations are utilized. From architectures built upon spherical tensors (STs) to more recent formulations employing irreducible Cartesian tensors (ICTs), STs have remained dominant owing to their compactness, elegance, and theoretical completeness. Nevertheless, questions have persisted regarding whether ST constructions are the only viable design principle, motivating continued development of Cartesian networks. In this work, we introduce the Cartesian-3j and Cartesian-nj symbol, which serve as direct analogues of the Wigner-3j and Wigner-nj symbol defined for tensor coupling. These coefficients enable the combination of any two ICTs into a new ICT. Building on this foundation, we extend e3nn to support irreducible Cartesian tensor product, and we release the resulting Python package as cartnn. Within this framework, we implement Cartesian counterparts of MACE, NequIP, and Allegro, allowing the first systematic comparison of Cartesian and spherical models to assess whether Cartesian formulations may offer advantages under specific conditions. Using TACE as a representative example, we further examine whether architectures constructed from irreducible Cartesian tensor product and contraction(ICTP and ICTC) are conceptually well-founded in Cartesian space and whether opportunities remain for improving their design.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne et al.
Robotics
ๆฅ็ๆ่ฆ
A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Astrid Brull, Sara Aguti, Vรฉronique Bolduc, Ying Hu, Daniel M. Jimenez-Gutierrez et al.
Machine Learning
ๆฅ็ๆ่ฆ
The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the scarcity and fragmentation of available data. Attempts to expand sampling across hospitals, institutions, or countries with differing regulations face severe privacy, regulatory, and logistical obstacles that are often difficult to overcome. The Federated Learning (FL) provides a promising solution by enabling collaborative model training across decentralized datasets while keeping patient data local and private. Here, we report a novel global FL initiative using the Sherpa.ai FL platform, which leverages FL across distributed datasets in two international organizations for the diagnosis of COL6-RD, using collagen VI immunofluorescence microscopy images from patient-derived fibroblast cultures. Our solution resulted in an ML model capable of classifying collagen VI patient images into the three primary pathogenic mechanism groups associated with COL6-RD: exon skipping, glycine substitution, and pseudoexon insertion. This new approach achieved an F1-score of 0.82, outperforming single-organization models (0.57-0.75). These results demonstrate that FL substantially improves diagnostic utility and generalizability compared to isolated institutional models. Beyond enabling more accurate diagnosis, we anticipate that this approach will support the interpretation of variants of uncertain significance and guide the prioritization of sequencing strategies to identify novel pathogenic variants.
๐
2025-12-18
๐ ArXiv Preprint (cs.DS)
๐ฅ Chao Gao, Liren Shan, Vaidehi Srinivas, Aravindan Vijayaraghavan
ๆฅ็ๆ่ฆ
We study the problem of finding confidence ellipsoids for an arbitrary distribution in high dimensions. Given samples from a distribution $D$ and a confidence parameter $ฮฑ$, the goal is to find the smallest volume ellipsoid $E$ which has probability mass $\Pr_{D}[E] \ge 1-ฮฑ$. Ellipsoids are a highly expressive class of confidence sets as they can capture correlations in the distribution, and can approximate any convex set. This problem has been studied in many different communities. In statistics, this is the classic minimum volume estimator introduced by Rousseeuw as a robust non-parametric estimator of location and scatter. However in high dimensions, it becomes NP-hard to obtain any non-trivial approximation factor in volume when the condition number $ฮฒ$ of the ellipsoid (ratio of the largest to the smallest axis length) goes to $\infty$. This motivates the focus of our paper: can we efficiently find confidence ellipsoids with volume approximation guarantees when compared to ellipsoids of bounded condition number $ฮฒ$?
Our main result is a polynomial time algorithm that finds an ellipsoid $E$ whose volume is within a $O(ฮฒ^{ฮณd})$ multiplicative factor of the volume of best $ฮฒ$-conditioned ellipsoid while covering at least $1-O(ฮฑ/ฮณ)$ probability mass for any $ฮณ< ฮฑ$. We complement this with a computational hardness result that shows that such a dependence seems necessary up to constants in the exponent. The algorithm and analysis uses the rich primal-dual structure of the minimum volume enclosing ellipsoid and the geometric Brascamp-Lieb inequality. As a consequence, we obtain the first polynomial time algorithm with approximation guarantees on worst-case instances of the robust subspace recovery problem.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tomรกลก Souฤek, Pierre Fernandez, Hady Elsahar, Sylvestre-Alvise Rebuffi, Valeriu Lacatusu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.
๐
2025-12-18
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Shayan Hundrieser, Philipp Tuchel, Insung Kong, Johannes Schmidt-Hieber
Machine Learning
ๆฅ็ๆ่ฆ
Inspired by biology, spiking neural networks (SNNs) process information via discrete spikes over time, offering an energy-efficient alternative to the classical computing paradigm and classical artificial neural networks (ANNs). In this work, we analyze the representational power of SNNs by viewing them as sequence-to-sequence processors of spikes, i.e., systems that transform a stream of input spikes into a stream of output spikes. We establish the universal representation property for a natural class of spike train functions. Our results are fully quantitative, constructive, and near-optimal in the number of required weights and neurons. The analysis reveals that SNNs are particularly well-suited to represent functions with few inputs, low temporal complexity, or compositions of such functions. The latter is of particular interest, as it indicates that deep SNNs can efficiently capture composite functions via a modular design. As an application of our results, we discuss spike train classification. Overall, these results contribute to a rigorous foundation for understanding the capabilities and limitations of spike-based neuromorphic systems.
๐
2025-12-18
๐ The Manuscript is submitted for review under IEEE
๐ฅ Hesham G. Moussa, Aroosa Hameed, Arashmid Akhavain
Machine Learning
ๆฅ็ๆ่ฆ
To cope with real-world dynamics, an intelligent system needs to incrementally acquire, update, and exploit knowledge throughout its lifetime. This ability, known as Continual learning, provides a foundation for AI systems to develop themselves adaptively. Catastrophic forgetting is a major challenge to the progress of Continual Learning approaches, where learning a new task usually results in a dramatic performance drop on previously learned ones. Many approaches have emerged to counteract the impact of CF. Most of the proposed approaches can be categorized into five classes: replay-based, regularization-based, optimization-based, representation-based, and architecture-based. In this work, we approach the problem from a different angle, specifically by considering the optimal sequencing of tasks as they are presented to the model. We investigate the role of task sequencing in mitigating CF and propose a method for determining the optimal task order. The proposed method leverages zero-shot scoring algorithms inspired by neural architecture search (NAS). Results demonstrate that intelligent task sequencing can substantially reduce CF. Moreover, when combined with traditional continual learning strategies, sequencing offers enhanced performance and robustness against forgetting. Additionally, the presented approaches can find applications in other fields, such as curriculum learning.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiabin Xue
Machine Learning
ๆฅ็ๆ่ฆ
Edge machine learning (Edge ML) enables training ML models using the vast data distributed across network edges. However, many existing approaches assume static models trained centrally and then deployed, making them ineffective against unseen data. To address this, Online Edge ML allows models to be trained directly on edge devices and updated continuously with new data. This paper explores a key challenge of Online Edge ML: "How to determine labels for truly future, unseen data points". We propose Knowledge Transformation (KT), a hybrid method combining Knowledge Distillation, Active Learning, and causal reasoning. In short, KT acts as the oracle in active learning by transforming knowledge from a teacher model to generate pseudo-labels for training a student model. To verify the validity of the method, we conducted simulation experiments with two setups: (1) using a less stable teacher model and (2) a relatively more stable teacher model. Results indicate that when a stable teacher model is given, the student model can eventually reach its expected maximum performance. KT is potentially beneficial for scenarios that meet the following circumstances: (1) when the teacher's task is generic, which means existing pre-trained models might be adequate for its task, so there will be no need to train the teacher model from scratch; and/or (2) when the label for the student's task is difficult or expensive to acquire.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Zihan Zhou, Animesh Garg, Ajay Mandlekar, Caelan Garrett
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Long-horizon manipulation has been a long-standing challenge in the robotics community. We propose ReinforceGen, a system that combines task decomposition, data generation, imitation learning, and motion planning to form an initial solution, and improves each component through reinforcement-learning-based fine-tuning. ReinforceGen first segments the task into multiple localized skills, which are connected through motion planning. The skills and motion planning targets are trained with imitation learning on a dataset generated from 10 human demonstrations, and then fine-tuned through online adaptation and reinforcement learning. When benchmarked on the Robosuite dataset, ReinforceGen reaches 80% success rate on all tasks with visuomotor controls in the highest reset range setting. Additional ablation studies show that our fine-tuning approaches contributes to an 89% average performance increase. More results and videos available in https://reinforcegen.github.io/
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Amit Jain, Richard Linares
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Neural network controllers increasingly demand millions of parameters, and language model approaches push into the billions. For embedded aerospace systems with strict power and latency constraints, this scaling is prohibitive. We present Tiny Recursive Control (TRC), a neural architecture based on a counterintuitive principle: capacity can emerge from iteration depth rather than parameter count. TRC applies compact networks (approximately 1.5M parameters) repeatedly through a two-level hierarchical latent structure, refining control sequences by simulating trajectories and correcting based on tracking error. Because the same weights process every refinement step, adding iterations increases computation without increasing memory. We evaluate TRC on nonlinear control problems including oscillator stabilization and powered descent with fuel constraints. Across these domains, TRC achieves near-optimal control costs while requiring only millisecond-scale inference on GPU and under 10~MB memory, two orders of magnitude smaller than language model baselines. These results demonstrate that recursive reasoning, previously confined to discrete tasks, transfers effectively to continuous control synthesis.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qian Wang, Zahra Yousefijamarani, Morgan Lindsay Heisler, Rongzhi Gu, Bai Xiaolong et al.
ๆฅ็ๆ่ฆ
Modern LLM applications such as deep-research assistants, coding agents, and Retrieval-Augmented Generation (RAG) systems, repeatedly process long prompt histories containing shared document or code chunks, creating significant pressure on the Key Value (KV) cache, which must operate within limited memory while sustaining high throughput and low latency. Prefix caching partially alleviates some of these costs by reusing KV cache for previously processed tokens, but limited by strict prefix matching. Position-independent caching (PIC) enables chunk-level reuse at arbitrary positions, but requires selective recomputation and positional-encoding (PE) adjustments. However, because these operations vary across queries, KV for the same chunk diverges across requests. Moreover, without page alignment, chunk KV layouts diverge in memory, preventing page sharing. These issues result in only modest HBM savings even when many requests reuse the same content.
We present MEPIC, a memory-efficient PIC system that enables chunk KV reuse across positions, requests, and batches. MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific, removes positional encodings via Rotary Position Embedding (RoPE) fusion in the attention kernel, and makes remaining blocks fully shareable. These techniques eliminate most duplicate chunk KV in HBM, reducing usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without any model changes.
๐
2025-12-18
๐ ArXiv Preprint (cs.NI)
๐ฅ Bahman Abolhassani, Tugba Erpek, Kemal Davaslioglu, Yalin E. Sagduyu, Sastry Kompella
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reactive jammers pose a severe security threat to robotic-swarm networks by selectively disrupting inter-agent communications and undermining formation integrity and mission success. Conventional countermeasures such as fixed power control or static channel hopping are largely ineffective against such adaptive adversaries. This paper presents a multi-agent reinforcement learning (MARL) framework based on the QMIX algorithm to improve the resilience of swarm communications under reactive jamming. We consider a network of multiple transmitter-receiver pairs sharing channels while a reactive jammer with Markovian threshold dynamics senses aggregate power and reacts accordingly. Each agent jointly selects transmit frequency (channel) and power, and QMIX learns a centralized but factorizable action-value function that enables coordinated yet decentralized execution. We benchmark QMIX against a genie-aided optimal policy in a no-channel-reuse setting, and against local Upper Confidence Bound (UCB) and a stateless reactive policy in a more general fading regime with channel reuse enabled. Simulation results show that QMIX rapidly converges to cooperative policies that nearly match the genie-aided bound, while achieving higher throughput and lower jamming incidence than the baselines, thereby demonstrating MARL's effectiveness for securing autonomous swarms in contested environments.
๐
2025-12-18
๐ ArXiv Preprint (eess.SP)
๐ฅ Chenyu Zhu, Zeyang Li, Ziyi Xie, Jie Zhang
ๆฅ็ๆ่ฆ
Specific emitter identification (SEI) utilizes passive hardware characteristics to authenticate transmitters, providing a robust physical-layer security solution. However, most deep-learning-based methods rely on extensive data or require prior information, which poses challenges in real-world scenarios with limited labeled data. We propose an integrated complex variational mode decomposition algorithm that decomposes and reconstructs complex-valued signals to approximate the original transmitted signals, thereby enabling more accurate feature extraction. We further utilize a temporal convolutional network to effectively model the sequential signal characteristics, and introduce a spatial attention mechanism to adaptively weight informative signal segments, significantly enhancing identification performance. Additionally, the branch network allows leveraging pre-trained weights from other data while reducing the need for auxiliary datasets. Ablation experiments on the simulated data demonstrate the effectiveness of each component of the model. An accuracy comparison on a public dataset reveals that our method achieves 96% accuracy using only 10 symbols without requiring any prior knowledge.
๐
2025-12-18
๐ ArXiv Preprint (quant-ph)
๐ฅ Theshani Nuradha, Ian George, Christoph Hirche
ๆฅ็ๆ่ฆ
Data-processing is a desired property of classical and quantum divergences and information measures. In information theory, the contraction coefficient measures how much the distinguishability of quantum states decreases when they are transmitted through a quantum channel, establishing linear strong data-processing inequalities (SDPI). However, these linear SDPI are not always tight and can be improved in most of the cases. In this work, we establish non-linear SDPI for quantum hockey-stick divergence for noisy channels that satisfy a certain noise criterion. We also note that our results improve upon existing linear SDPI for quantum hockey-stick divergences and also non-linear SDPI for classical hockey-stick divergence. We define $F_ฮณ$ curves generalizing Dobrushin curves for the quantum setting while characterizing SDPI for the sequential composition of heterogeneous channels. In addition, we derive reverse-Pinsker type inequalities for $f$-divergences with additional constraints on hockey-stick divergences. We show that these non-linear SDPI can establish tighter finite mixing times that cannot be achieved through linear SDPI. Furthermore, we find applications of these in establishing stronger privacy guarantees for the composition of sequential private quantum channels when privacy is quantified by quantum local differential privacy.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Soon Hoe Lim
ๆฅ็ๆ่ฆ
We study the implicit bias of flow matching (FM) samplers via the lens of empirical flow matching. Although population FM may produce gradient-field velocities resembling optimal transport (OT), we show that the empirical FM minimizer is almost never a gradient field, even when each conditional flow is. Consequently, empirical FM is intrinsically energetically suboptimal. In view of this, we analyze the kinetic energy of generated samples. With Gaussian sources, both instantaneous and integrated kinetic energies exhibit exponential concentration, while heavy-tailed sources lead to polynomial tails. These behaviors are governed primarily by the choice of source distribution rather than the data. Overall, these notes provide a concise mathematical account of the structural and energetic biases arising in empirical FM.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ingrid Amaranta Membrillo Solis, Maria van Rossem, Tristan Madeleine, Tetiana Orlova, Nina Podoliak et al.
Machine Learning
ๆฅ็ๆ่ฆ
A complex system comprises multiple interacting entities whose interdependencies form a unified whole, exhibiting emergent behaviours not present in individual components. Examples include the human brain, living cells, soft matter, Earth's climate, ecosystems, and the economy. These systems exhibit high-dimensional, non-linear dynamics, making their modelling, classification, and prediction particularly challenging. Advances in information technology have enabled data-driven approaches to studying such systems. However, the sheer volume and complexity of spatio-temporal data often hinder traditional methods like dimensionality reduction, phase-space reconstruction, and attractor characterisation. This paper introduces a geometric framework for analysing spatio-temporal data from complex systems, grounded in the theory of vector fields over discrete measure spaces. We propose a two-parameter family of metrics suitable for data analysis and machine learning applications. The framework supports time-dependent images, image gradients, and real- or vector-valued functions defined on graphs and simplicial complexes. We validate our approach using data from numerical simulations of biological and physical systems on flat and curved domains. Our results show that the proposed metrics, combined with multidimensional scaling, effectively address key analytical challenges. They enable dimensionality reduction, mode decomposition, phase-space reconstruction, and attractor characterisation. Our findings offer a robust pathway for understanding complex dynamical systems, especially in contexts where traditional modelling is impractical but abundant experimental data are available.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don't necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wisnu Uriawan, Muhamad Veva Ramadhan, Firman Adi Nugraha, Hasbi Nur Wahid, M Dantha Arianvasya et al.
Machine Learning
ๆฅ็ๆ่ฆ
The rapid digitalization of Hajj and Umrah services in Indonesia has significantly facilitated pilgrims but has concurrently opened avenues for digital fraud through counterfeit mobile applications. These fraudulent applications not only inflict financial losses but also pose severe privacy risks by harvesting sensitive personal data. This research aims to address this critical issue by implementing and evaluating machine learning algorithms to verify application authenticity automatically. Using a comprehensive dataset comprising both official applications registered with the Ministry of Religious Affairs and unofficial applications circulating on app stores, we compare the performance of three robust classifiers: Support Vector Machine (SVM), Random Forest (RF), and Na"ive Bayes (NB). The study utilizes a hybrid feature extraction methodology that combines Textual Analysis (TF-IDF) of application descriptions with Metadata Analysis of sensitive access permissions. The experimental results indicate that the SVM algorithm achieves the highest performance with an accuracy of 92.3%, a precision of 91.5%, and an F1-score of 92.0%. Detailed feature analysis reveals that specific keywords related to legality and high-risk permissions (e.g., READ PHONE STATE) are the most significant discriminators. This system is proposed as a proactive, scalable solution to enhance digital trust in the religious tourism sector, potentially serving as a prototype for a national verification system.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lei Wang, Xin Tan, Mingwei Wang, Ying Zhang
ๆฅ็ๆ่ฆ
Recent selective state space models (SSMs), such as Mamba and Mamba-2, have demonstrated strong performance in sequence modeling owing to input-dependent selection mechanisms. However, these mechanisms lack theoretical grounding and cannot support context-aware selection from latent state dynamics. To address these limitations, we propose KOSS, a Kalman-optimal Selective State Space model that formulates selection as latent state uncertainty minimization. Derived from estimation theory, KOSS adopts a continuous-time latent update driven by a Kalman gain that dynamically modulates information propagation based on content and context, enabling a closed-loop, context-aware selectivity mechanism. To ensure stable computation and near-linear scalability, KOSS employs global spectral differentiation for frequency-domain derivative estimation, along with a segment-wise scan for hardware-efficient processing. On a selective copying task with distractors, KOSS achieves over 79\% accuracy while baselines drop below 20\%, demonstrating robust context-aware selection. Furthermore, across nine long-term forecasting benchmarks, KOSS reduces MSE by 2.92--36.23\% and consistently outperforms state-of-the-art models in both accuracy and stability. To assess real-world applicability, a case study on secondary surveillance radar (SSR) tracking confirms KOSS's robustness under irregular intervals and noisy conditions and demonstrates its effectiveness in real-world applications. Finally, supplementary experiments verify Kalman gain convergence and the frequency response of spectral differentiation, providing theoretical support for the proposed closed-loop design.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuriy N. Bakhvalov
Machine Learning
ๆฅ็ๆ่ฆ
In a previous paper it was shown that a machine learning regression problem can be solved within the framework of random function theory, with the optimal kernel analytically derived from symmetry and indifference principles and coinciding with a polyharmonic spline. However, a direct application of that solution is limited by O(N^3) computational cost and by a breakdown of the original theoretical assumptions when the input space has excessive dimensionality. This paper proposes a cascade architecture built from packages of polyharmonic splines that simultaneously addresses scalability and is theoretically justified for problems with unknown intrinsic low dimensionality. Efficient matrix procedures are presented for forward computation and end-to-end differentiation through the cascade.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rudra Dubey, Arpit Mani Tripathi, Archit Srivastava, Sarvpal Singh
Machine Learning
ๆฅ็ๆ่ฆ
In actuality, phishing attacks remain one of the most prevalent cybersecurity risks in existence today, with malevolent actors constantly changing their strategies to successfully trick users. This paper presents an AI model for a phishing detection system that uses an ensemble approach to combine character-level Convolutional Neural Networks (CNN) and LightGBM with engineered features. Our system uses a character-level CNN to extract sequential features after extracting 36 lexical, structural, and domain-based features from the URLs. On a test dataset of 19,873 URLs, the ensemble model achieves an accuracy of 99.819 percent, precision of 100 percent, recall of 99.635 percent, and ROC-AUC of 99.947 percent. Through a FastAPI-based service with an intuitive user interface, the suggested system has been utilised to offer real-time detection. In contrast, the results demonstrate that the suggested solution performs better than individual models; LightGBM contributes 40 percent and character-CNN contributes 60 percent to the final prediction. The suggested method maintains extremely low false positive rates while doing a good job of identifying contemporary phishing techniques. Index Terms - Phishing detection, machine learning, deep learning, CNN, ensemble methods, cybersecurity, URL analysis
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Oliver Stritzel, Nick Hรผhnerbein, Simon Rauch, Itzel Zarate, Lukas Fleischmann et al.
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, Predictive Process Mining (PPM) techniques based on artificial neural networks have evolved as a method for monitoring the future behavior of unfolding business processes and predicting Key Performance Indicators (KPIs). However, many PPM approaches often lack reproducibility, transparency in decision making, usability for incorporating novel datasets and benchmarking, making comparisons among different implementations very difficult. In this paper, we propose SPICE, a Python framework that reimplements three popular, existing baseline deep-learning-based methods for PPM in PyTorch, while designing a common base framework with rigorous configurability to enable reproducible and robust comparison of past and future modelling approaches. We compare SPICE to original reported metrics and with fair metrics on 11 datasets.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ David Mรผller, Espen Knoop, Dario Mylonopoulos, Agon Serifi, Michael A. Hopkins et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Animated characters often move in non-physical ways and have proportions that are far from a typical walking robot. This provides an ideal platform for innovation in both mechanical design and stylized motion control. In this paper, we bring Olaf to life in the physical world, relying on reinforcement learning guided by animation references for control. To create the illusion of Olaf's feet moving along his body, we hide two asymmetric legs under a soft foam skirt. To fit actuators inside the character, we use spherical and planar linkages in the arms, mouth, and eyes. Because the walk cycle results in harsh contact sounds, we introduce additional rewards that noticeably reduce impact noise. The large head, driven by small actuators in the character's slim neck, creates a risk of overheating, amplified by the costume. To keep actuators from overheating, we feed temperature values as additional inputs to policies, introducing new rewards to keep them within bounds. We validate the efficacy of our modeling in simulation and on hardware, demonstrating an unmatched level of believability for a costumed robotic character.
๐
2025-12-18
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Luuk H. E. Kempen, Raffaele Cheula, Mie Andersen
Machine Learning
ๆฅ็ๆ่ฆ
Foundational machine learning interatomic potentials (MLIPs) are being developed at a rapid pace, promising closer and closer approximation to ab initio accuracy. This unlocks the possibility to simulate much larger length and time scales. However, benchmarks for these MLIPs are usually limited to ordered, crystalline and bulk materials. Hence, reported performance does not necessarily accurately reflect MLIP performance in real applications such as heterogeneous catalysis. Here, we systematically analyze zero-shot performance of 80 different MLIPs, evaluating tasks typical for heterogeneous catalysis across a range of different data sets, including adsorption and reaction on surfaces of alloyed metals, oxides, and metal-oxide interfacial systems. We demonstrate that current-generation foundational MLIPs can already perform at high accuracy for applications such as predicting vacancy formation energies of perovskite oxides or zero-point energies of supported nanoclusters. However, limitations also exist. We find that many MLIPs catastrophically fail when applied to magnetic materials, and structure relaxation in the MLIP generally increases the energy prediction error compared to single-point evaluation of a previously optimized structure. Comparing low-cost task-specific models to foundational MLIPs, we highlight some core differences between these model approaches and show that -- if considering only accuracy -- these models can compete with the current generation of best-performing MLIPs. Furthermore, we show that no single MLIP universally performs best, requiring users to investigate MLIP suitability for their desired application.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ John M. Statheros, Hairong Wang, Richard Klein
Natural Language Processing
ๆฅ็ๆ่ฆ
The interpretation of chest X-rays (CXRs) poses significant challenges, particularly in achieving accurate multi-label pathology classification and spatial localization. These tasks demand different levels of annotation granularity but are frequently constrained by the scarcity of region-level (dense) annotations. We introduce CLARiTy (Class Localizing and Attention Refining Image Transformer), a vision transformer-based model for joint multi-label classification and weakly-supervised localization of thoracic pathologies. CLARiTy employs multiple class-specific tokens to generate discriminative attention maps, and a SegmentCAM module for foreground segmentation and background suppression using explicit anatomical priors. Trained on image-level labels from the NIH ChestX-ray14 dataset, it leverages distillation from a ConvNeXtV2 teacher for efficiency. Evaluated on the official NIH split, the CLARiTy-S-16-512 (a configuration of CLARiTy), achieves competitive classification performance across 14 pathologies, and state-of-the-art weakly-supervised localization performance on 8 pathologies, outperforming prior methods by 50.7%. In particular, pronounced gains occur for small pathologies like nodules and masses. The lower-resolution variant of CLARiTy, CLARiTy-S-16-224, offers high efficiency while decisively surpassing baselines, thereby having the potential for use in low-resource settings. An ablation study confirms contributions of SegmentCAM, DINO pretraining, orthogonal class token loss, and attention pooling. CLARiTy advances beyond CNN-ViT hybrids by harnessing ViT self-attention for global context and class-specific localization, refined through convolutional background suppression for precise, noise-reduced heatmaps.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Natnael Tilahun Sinshaw, Mengmei He, Tadesse K. Bahiru, Sudhir Kumar Mohapatra
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Text classification problems, such as gender classification from a blog, have been a well-matured research area that has been well studied using machine learning algorithms. It has several application domains in market analysis, customer recommendation, and recommendation systems. This study presents a comparative analysis of the widely used machine learning algorithms, namely Support Vector Machines (SVM), Naive Bayes (NB), Logistic Regression (LR), AdaBoost, XGBoost, and an SVM variant (SVM_R) with neuro-symbolic AI (NeSy). The paper also explores the effect of text representations such as TF-IDF, the Universal Sentence Encoder (USE), and RoBERTa. Additionally, various feature extraction techniques, including Chi-Square, Mutual Information, and Principal Component Analysis, are explored. Building on these, we introduce a comparative analysis of the machine learning and deep learning approaches in comparison to the NeSy. The experimental results show that the use of the NeSy approach matched strong MLP results despite a limited dataset. Future work on this research will expand the knowledge base, the scope of embedding types, and the hyperparameter configuration to further study the effectiveness of the NeSy approach.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.
๐
2025-12-18
๐ IEEE Transactions on Mobile Computing
๐ฅ Liu Yang, Qiang Li, Luxiong Wen, Jian Yang
Machine Learning
ๆฅ็ๆ่ฆ
With the rapid proliferation of edge computing, Radio Frequency Fingerprint Identification (RFFI) has become increasingly important for secure device authentication. However, practical deployment of deep learning-based RFFI models is hindered by a critical challenge: their performance often degrades significantly when applied across receivers with different hardware characteristics due to distribution shifts introduced by receiver variation. To address this, we investigate the source-data-free cross-receiver RFFI (SCRFFI) problem, where a model pretrained on labeled signals from a source receiver must adapt to unlabeled signals from a target receiver, without access to any source-domain data during adaptation. We first formulate a novel constrained pseudo-labeling-based SCRFFI adaptation framework, and provide a theoretical analysis of its generalization performance. Our analysis highlights a key insight: the target-domain performance is highly sensitive to the quality of the pseudo-labels generated during adaptation. Motivated by this, we propose Momentum Soft pseudo-label Source Hypothesis Transfer (MS-SHOT), a new method for SCRFFI that incorporates momentum-center-guided soft pseudo-labeling and enforces global structural constraints to encourage confident and diverse predictions. Notably, MS-SHOT effectively addresses scenarios involving label shift or unknown, non-uniform class distributions in the target domain -- a significant limitation of prior methods. Extensive experiments on real-world datasets demonstrate that MS-SHOT consistently outperforms existing approaches in both accuracy and robustness, offering a practical and scalable solution for source-data-free cross-receiver adaptation in RFFI.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Danxu Liu, Di Wang, Hebaixu Wang, Haoyang Chen, Wentao Jiang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available at https://github.com/MiliLab/SARMAE.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Barna Pรกsztor, Thomas Kleine Buening, Andreas Krause
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We introduce Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader's action. This approach decomposes preference optimization into a refinement problem for the Follower and an optimization problem against an adversary for the Leader. Unlike Reinforcement Learning from Human Feedback (RLHF), which assigns scalar rewards to actions, or Nash Learning from Human Feedback (NLHF), which seeks a simultaneous-move equilibrium, SLHF leverages the asymmetry of sequential play to capture richer preference structures. The sequential design of SLHF naturally enables inference-time refinement, as the Follower learns to improve the Leader's actions, and these refinements can be leveraged through iterative sampling. We compare the solution concepts of SLHF, RLHF, and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Experiments on large language models demonstrate that SLHF achieves strong alignment across diverse preference datasets, scales from 0.5B to 8B parameters, and yields inference-time refinements that transfer across model families without further fine-tuning.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Louis Grenioux, Leonardo Galliano, Ludovic Berthier, Giulio Biroli, Marylou Gabriรฉ
Machine Learning
ๆฅ็ๆ่ฆ
Modern generative models hold great promise for accelerating diverse tasks involving the simulation of physical systems, but they must be adapted to the specific constraints of each domain. Significant progress has been made for biomolecules and crystalline materials. Here, we address amorphous materials (glasses), which are disordered particle systems lacking atomic periodicity. Sampling equilibrium configurations of glass-forming materials is a notoriously slow and difficult task. This obstacle could be overcome by developing a generative framework capable of producing equilibrium configurations with well-defined likelihoods. In this work, we address this challenge by leveraging an equivariant Riemannian stochastic interpolation framework which combines Riemannian stochastic interpolant and equivariant flow matching. Our method rigorously incorporates periodic boundary conditions and the symmetries of multi-component particle systems, adapting an equivariant graph neural network to operate directly on the torus. Our numerical experiments on model amorphous systems demonstrate that enforcing geometric and symmetry constraints significantly improves generative performance.
๐
2025-12-18
๐ ArXiv Preprint (math.OC)
๐ฅ Xun Qian, Hussein Rammal, Dmitry Kovalev, Peter Richtรกrik
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent empirical research has demonstrated that deep learning optimizers based on the linear minimization oracle (LMO) over specifically chosen Non-Euclidean norm balls, such as Muon and Scion, outperform Adam-type methods in the training of large language models. In this work, we show that such optimizers can be provably improved by replacing their vanilla momentum by momentum variance reduction (MVR). Instead of proposing and analyzing MVR variants of Muon and Scion separately, we incorporate MVR into the recently proposed Gluon framework, which captures Muon, Scion and other specific Non-Euclidean LMO-based methods as special cases, and at the same time works with a more general smoothness assumption which better captures the layer-wise structure of neural networks. In the non-convex case, we incorporate MVR into Gluon in three different ways. All of them improve the convergence rate from ${\cal O} (\frac{1}{K^{1/4}})$ to ${\cal O} (\frac{1}{K^{1/3}})$. Additionally, we provide improved rates in the star-convex case. Finally, we conduct several numerical experiments that verify the superior performance of our proposed algorithms in terms of iteration complexity.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sullivan Castro, Artem Betlei, Thomas Di Martino, Nadir El Manouzi
ๆฅ็ๆ่ฆ
Modeling user purchase behavior is a critical challenge in display advertising systems, necessary for real-time bidding. The difficulty arises from the sparsity of positive user events and the stochasticity of user actions, leading to severe class imbalance and irregular event timing. Predictive systems usually rely on hand-crafted "counter" features, overlooking the fine-grained temporal evolution of user intent. Meanwhile, current sequential models extract direct sequential signal, missing useful event-counting statistics. We enhance deep sequential models with self-supervised pretraining strategies for display advertising. Especially, we introduce Abacus, a novel approach of predicting the empirical frequency distribution of user events. We further propose a hybrid objective unifying Abacus with sequential learning objectives, combining stability of aggregated statistics with the sequence modeling sensitivity. Experiments on two real-world datasets show that Abacus pretraining outperforms existing methods accelerating downstream task convergence, while hybrid approach yields up to +6.1% AUC compared to the baselines.
๐
2025-12-18
๐ ArXiv Preprint (math.OC)
๐ฅ Yin Liu, Qiming Dai, Junyu Zhang, Zaiwen Wen
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) has gained attention for aligning large language models (LLMs) via reinforcement learning from human feedback (RLHF). The actor-only variants of Proximal Policy Optimization (PPO) are widely applied for their efficiency. These algorithms incorporate a clipping mechanism to improve stability. Besides, a regularization term, such as the reverse KL-divergence or a more general \(f\)-divergence, is introduced to prevent policy drift. Despite their empirical success, a rigorous theoretical understanding of the problem and the algorithm's properties is limited. This paper advances the theoretical foundations of the PPO-Clip algorithm by analyzing a deterministic actor-only PPO algorithm within the general RL setting with \(f\)-divergence regularization under the softmax policy parameterization. We derive a non-uniform Lipschitz smoothness condition and a ลojasiewicz inequality for the considered problem. Based on these, a non-asymptotic linear convergence rate to the globally optimal policy is established for the forward KL-regularizer. Furthermore, stationary convergence and local linear convergence are derived for the reverse KL-regularizer.
๐
2025-12-18
๐ 21 pages, 11 figures, submitted to the Journal of
๐ฅ Daniรซl Bot, Leland McInnes, Jan Aerts
ๆฅ็ๆ่ฆ
Clustering is a cornerstone of modern data analysis. Detecting clusters in exploratory data analyses (EDA) requires algorithms that make few assumptions about the data. Density-based clustering algorithms are particularly well-suited for EDA because they describe high-density regions, assuming only that a density exists. Applying density-based clustering algorithms in practice, however, requires selecting appropriate hyperparameters, which is difficult without prior knowledge of the data distribution. For example, DBSCAN requires selecting a density threshold, and HDBSCAN* relies on a minimum cluster size parameter. In this work, we propose Persistent Leaves Spatial Clustering for Applications with Noise (PLSCAN). This novel density-based clustering algorithm efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters. PLSCAN applies scale-space clustering principles and is equivalent to persistent homology on a novel metric space. We compare its performance to HDBSCAN* on several real-world datasets, demonstrating that it achieves a higher average ARI and is less sensitive to changes in the number of mutual reachability neighbours. Additionally, we compare PLSCAN's computational costs to k-Means, demonstrating competitive run-times on low-dimensional datasets. At higher dimensions, run times scale more similarly to HDBSCAN*.
๐
2025-12-18
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Brent Motmans, Digvijay Ghogare, Thijs G. I. van Wijk, An Hardy, Danny E. P. Vanpoucke
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Copper nanoparticles (Cu NPs) have a broad applicability, yet their synthesis is sensitive to subtle changes in reaction parameters. This sensitivity, combined with the time- and resource-intensive nature of experimental optimization, poses a major challenge in achieving reproducible and size-controlled synthesis. While Machine Learning (ML) shows promise in materials research, its application is often limited by scarcity of large high-quality experimental data sets. This study explores ML to predict the size of Cu NPs from microwave-assisted polyol synthesis using a small data set of 25 in-house performed syntheses. Latin Hypercube Sampling is used to efficiently cover the parameter space while creating the experimental data set. Ensemble regression models, built with the AMADEUS framework, successfully predict particle sizes with high accuracy ($R^2 = 0.74$), outperforming classical statistical approaches ($R^2 = 0.60$). Overall, this study highlights that, for lab-scale synthesis optimization, high-quality small datasets combined with classical, interpretable ML models outperform traditional statistical methods and are fully sufficient for quantitative synthesis prediction. This approach provides a sustainable and experimentally realistic pathway toward data-driven inorganic synthesis design.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Xiao Li, Yue Li, Hao Wu, Yue Zhang, Yechao Zhang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
As large language models (LLMs) are increasingly adopted for code vulnerability detection, their reliability and robustness across diverse vulnerability types have become a pressing concern. In traditional adversarial settings, code obfuscation has long been used as a general strategy to bypass auditing tools, preserving exploitability without tampering with the tools themselves. Numerous efforts have explored obfuscation methods and tools, yet their capabilities differ in terms of supported techniques, granularity, and programming languages, making it difficult to systematically assess their impact on LLM-based vulnerability detection. To address this gap, we provide a structured systematization of obfuscation techniques and evaluate them under a unified framework. Specifically, we categorize existing obfuscation methods into three major classes (layout, data flow, and control flow) covering 11 subcategories and 19 concrete techniques. We implement these techniques across four programming languages (Solidity, C, C++, and Python) using a consistent LLM-driven approach, and evaluate their effects on 15 LLMs spanning four model families (DeepSeek, OpenAI, Qwen, and LLaMA), as well as on two coding agents (GitHub Copilot and Codex). Our findings reveal both positive and negative impacts of code obfuscation on LLM-based vulnerability detection, highlighting conditions under which obfuscation leads to performance improvements or degradations. We further analyze these outcomes with respect to vulnerability characteristics, code properties, and model attributes. Finally, we outline several open problems and propose future directions to enhance the robustness of LLMs for real-world vulnerability detection.
๐
2025-12-18
๐ ArXiv Preprint (cs.SD)
๐ฅ Nikolaos Ellinas, Alexandra Vioni, Panos Kakoulidis, Georgios Vamvoukakis, Myrsini Christidou et al.
ๆฅ็ๆ่ฆ
This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or changes to the model. This is achieved by directly modifying the cepstrum feature space in order to shift the harmonic structure to the desired target. The spectrogram magnitude is computed via the pseudo-inverse mel transform, then converted to the cepstrum by applying DCT. In this domain, the cepstral peak is shifted without having to estimate its position and the modified mel is recomputed by applying IDCT and mel-filterbank. These pitch-shifted mel-spectrogram features can be converted to speech with any compatible vocoder. The proposed method is validated experimentally with objective and subjective metrics on various state-of-the-art neural vocoders as well as in comparison with traditional pitch modification methods.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Seyda Betul Aydin, Holger Brandt
Machine Learning
ๆฅ็ๆ่ฆ
Generalizing causal knowledge across diverse environments is challenging, especially when estimates from large-scale datasets must be applied to smaller or systematically different contexts, where external validity is critical. Model-based estimators of individual treatment effects (ITE) from machine learning require large sample sizes, limiting their applicability in domains such as behavioral sciences with smaller datasets. We demonstrate how estimation of ITEs with Treatment Agnostic Representation Networks (TARNet; Shalit et al., 2017) can be improved by leveraging knowledge from source datasets and adapting it to new settings via transfer learning (TL-TARNet; Aloui et al., 2023). In simulations that vary source and sample sizes and consider both randomized and non-randomized intervention target settings, the transfer-learning extension TL-TARNet improves upon standard TARNet, reducing ITE error and attenuating bias when a large unbiased source is available and target samples are small. In an empirical application using the India Human Development Survey (IHDS-II), we estimate the effect of mothers' firewood collection time on children's weekly study time; transfer learning pulls the target mean ITEs toward the source ITE estimate, reducing bias in the estimates obtained without transfer. These results suggest that transfer learning for causal models can improve the estimation of ITE in small samples.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pengfei Sun, Wenyu Jiang, Piew Yoong Chee, Paul Devos, Dick Botteldooren
Machine Learning
ๆฅ็ๆ่ฆ
Quantised neural networks (QNNs) shrink models and reduce inference energy through low-bit arithmetic, yet most still depend on a running statistics batch normalisation (BN) layer, preventing true integer-only deployment. Prior attempts remove BN by parameter folding or tailored initialisation; while helpful, they rarely recover BN's stability and accuracy and often impose bespoke constraints. We present a BN-free, fully integer QNN trained via a progressive, layer-wise distillation scheme that slots into existing low-bit pipelines. Starting from a pretrained BN-enabled teacher, we use layer-wise targets and progressive compensation to train a student that performs inference exclusively with integer arithmetic and contains no BN operations. On ImageNet with AlexNet, the BN-free model attains competitive Top-1 accuracy under aggressive quantisation. The procedure integrates directly with standard quantisation workflows, enabling end-to-end integer-only inference for resource-constrained settings such as edge and embedded devices.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shabnam Bagheri Marzijarani, Mohammad Zolfaghari, Hedieh Sajedi
Machine Learning
ๆฅ็ๆ่ฆ
The Internet of Things (IoT) is a concept by which objects find identity and can communicate with each other in a network. One of the applications of the IoT is in the field of medicine, which is called the Internet of Medical Things (IoMT). Acute Lymphocytic Leukemia (ALL) is a type of cancer categorized as a hematic disease. It usually begins in the bone marrow due to the overproduction of immature White Blood Cells (WBCs or leukocytes). Since it has a high rate of spread to other body organs, it is a fatal disease if not diagnosed and treated early. Therefore, for identifying cancerous (ALL) cells in medical diagnostic laboratories, blood, as well as bone marrow smears, are taken by pathologists. However, manual examinations face limitations due to human error risk and time-consuming procedures. So, to tackle the mentioned issues, methods based on Artificial Intelligence (AI), capable of identifying cancer from non-cancer tissue, seem vital. Deep Neural Networks (DNNs) are the most efficient machine learning (ML) methods. These techniques employ multiple layers to extract higher-level features from the raw input. In this paper, a Convolutional Neural Network (CNN) is applied along with a new type of classifier, Higher Order Singular Value Decomposition (HOSVD), to categorize ALL and normal (healthy) cells from microscopic blood images. We employed the model on IoMT structure to identify leukemia quickly and safely. With the help of this new leukemia classification framework, patients and clinicians can have real-time communication. The model was implemented on the Acute Lymphoblastic Leukemia Image Database (ALL-IDB2) and achieved an average accuracy of %98.88 in the test step.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Roman Akramov, Artem Khamatullin, Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko
Machine Learning
ๆฅ็ๆ่ฆ
Choosing the number of topics $T$ in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretability of topic models. In this work, we formulate the selection of $T$ as a discrete black-box optimization problem, where each function evaluation corresponds to training an LDA model and measuring its validation perplexity. Under a fixed evaluation budget, we compare four families of optimizers: two hand-designed evolutionary methods - Genetic Algorithm (GA) and Evolution Strategy (ES) - and two learned, amortized approaches, Preferential Amortized Black-Box Optimization (PABBO) and Sharpness-Aware Black-Box Optimization (SABBO). Our experiments show that, while GA, ES, PABBO, and SABBO eventually reach a similar band of final perplexity, the amortized optimizers are substantially more sample- and time-efficient. SABBO typically identifies a near-optimal topic number after essentially a single evaluation, and PABBO finds competitive configurations within a few evaluations, whereas GA and ES require almost the full budget to approach the same region.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jordan Abarca-Albores, Danna Cristina Gutiรฉrrez Cabrera, Luis Antonio Salazar-Licea, Dante Ruiz-Robles, Jesus Alejandro Franco et al.
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a novel methodology for detecting faults in wind turbine blades using com-putational learning techniques. The study evaluates two models: the first employs logistic regression, which outperformed neural networks, decision trees, and the naive Bayes method, demonstrating its effectiveness in identifying fault-related patterns. The second model leverages clustering and achieves superior performance in terms of precision and data segmentation. The results indicate that clustering may better capture the underlying data characteristics compared to supervised methods. The proposed methodology offers a new approach to early fault detection in wind turbine blades, highlighting the potential of integrating different computational learning techniques to enhance system reliability. The use of accessible tools like Orange Data Mining underscores the practical application of these advanced solutions within the wind energy sector. Future work will focus on combining these methods to improve detection accuracy further and extend the application of these techniques to other critical components in energy infrastructure.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Maeve Madigan, Parameswaran Kamalaruban, Glenn Moynihan, Tom Kempton, David Sutton et al.
ๆฅ็ๆ่ฆ
Multi-agent systems have demonstrated the ability to improve performance on a variety of predictive tasks by leveraging collaborative decision making. However, the lack of effective evaluation methodologies has made it difficult to estimate the risk of bias, making deployment of such systems unsafe in high stakes domains such as consumer finance, where biased decisions can translate directly into regulatory breaches and financial loss. To address this challenge, we need to develop fairness evaluation methodologies for multi-agent predictive systems and measure the fairness characteristics of these systems in the financial tabular domain. Examining fairness metrics using large-scale simulations across diverse multi-agent configurations, with varying communication and collaboration mechanisms, we reveal patterns of emergent bias in financial decision-making that cannot be traced to individual agent components, indicating that multi-agent systems may exhibit genuinely collective behaviors. Our findings highlight that fairness risks in financial multi-agent systems represent a significant component of model risk, with tangible impacts on tasks such as credit scoring and income estimation. We advocate that multi-agent decision systems must be evaluated as holistic entities rather than through reductionist analyses of their constituent components.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Filippo Zacchei, Paolo Conti, Attilio Alberto Frangi, Andrea Manzoni
Machine Learning
ๆฅ็ๆ่ฆ
Inverse uncertainty quantification (UQ) tasks such as parameter estimation are computationally demanding whenever dealing with physics-based models, and typically require repeated evaluations of complex numerical solvers. When partial differential equations are involved, full-order models such as those based on the Finite Element Method can make traditional sampling approaches like Markov Chain Monte Carlo (MCMC) computationally infeasible. Although data-driven surrogate models may help reduce evaluation costs, their utility is often limited by the expense of generating high-fidelity data. In contrast, low-fidelity data can be produced more efficiently, although relying on them alone may degrade the accuracy of the inverse UQ solution.
To address these challenges, we propose a Multi-Fidelity Delayed Acceptance scheme for Bayesian inverse problems. Extending the Multi-Level Delayed Acceptance framework, the method introduces multi-fidelity neural networks that combine the predictions of solvers of varying fidelity, with high fidelity evaluations restricted to an offline training stage. During the online phase, likelihood evaluations are obtained by evaluating the coarse solvers and passing their outputs to the trained neural networks, thereby avoiding additional high-fidelity simulations.
This construction allows heterogeneous coarse solvers to be incorporated consistently within the hierarchy, providing greater flexibility than standard Multi-Level Delayed Acceptance. The proposed approach improves the approximation accuracy of the low fidelity solvers, leading to longer sub-chain lengths, better mixing, and accelerated posterior inference. The effectiveness of the strategy is demonstrated on two benchmark inverse problems involving (i) steady isotropic groundwater flow, (ii) an unsteady reaction-diffusion system, for which substantial computational savings are obtained.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hao Tang, Jiongyu Zhu, Zimeng Feng, Hao Li, Chao Li
ๆฅ็ๆ่ฆ
Neural operators have emerged as powerful tools for learning mappings between function spaces, enabling efficient solutions to partial differential equations across varying inputs and domains. Despite the success, existing methods often struggle with non-periodic excitations, transient responses, and signals defined on irregular or non-Euclidean geometries. To address this, we propose a generalized operator learning framework based on a pole-residue decomposition enriched with exponential basis functions, enabling expressive modeling of aperiodic and decaying dynamics. Building on this formulation, we introduce the Geometric Laplace Neural Operator (GLNO), which embeds the Laplace spectral representation into the eigen-basis of the Laplace-Beltrami operator, extending operator learning to arbitrary Riemannian manifolds without requiring periodicity or uniform grids. We further design a grid-invariant network architecture (GLNONet) that realizes GLNO in practice. Extensive experiments on PDEs/ODEs and real-world datasets demonstrate our robust performance over other state-of-the-art models.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruifeng Xu, Liang He
Machine Learning
ๆฅ็ๆ่ฆ
Effective irrigation and nitrogen fertilization have a significant impact on crop yield. However, existing research faces two limitations: (1) the high complexity of optimizing water-nitrogen combinations during crop growth and poor yield optimization results; and (2) the difficulty in quantifying mild stress signals and the delayed feedback, which results in less precise dynamic regulation of water and nitrogen and lower resource utilization efficiency. To address these issues, we propose a Nested Dual-Agent Reinforcement Learning (NDRL) method. The parent agent in NDRL identifies promising macroscopic irrigation and fertilization actions based on projected cumulative yield benefits, reducing ineffective explorationwhile maintaining alignment between objectives and yield. The child agent's reward function incorporates quantified Water Stress Factor (WSF) and Nitrogen Stress Factor (NSF), and uses a mixed probability distribution to dynamically optimize daily strategies, thereby enhancing both yield and resource efficiency. We used field experiment data from 2023 and 2024 to calibrate and validate the Decision Support System for Agrotechnology Transfer (DSSAT) to simulate real-world conditions and interact with NDRL. Experimental results demonstrate that, compared to the best baseline, the simulated yield increased by 4.7% in both 2023 and 2024, the irrigation water productivity increased by 5.6% and 5.1% respectively, and the nitrogen partial factor productivity increased by 6.3% and 1.0% respectively. Our method advances the development of cotton irrigation and nitrogen fertilization, providing new ideas for addressing the complexity and precision issues in agricultural resource management and for sustainable agricultural development.
๐
2025-12-18
๐ ArXiv Preprint (math.PR)
๐ฅ Mihriban Ceylan, David J. Prรถmel
ๆฅ็ๆ่ฆ
We establish $L^p$-type universal approximation theorems for general and non-anticipative functionals on suitable rough path spaces, showing that linear functionals acting on signatures of time-extended rough paths are dense with respect to an $L^p$-distance. To that end, we derive global universal approximation theorems for weighted rough path spaces. We demonstrate that these $L^p$-type universal approximation theorems apply in particular to Brownian motion. As a consequence, linear functionals on the signature of the time-extended Brownian motion can approximate any $p$-integrable stochastic process adapted to the Brownian filtration, including solutions to stochastic differential equations.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee
ๆฅ็ๆ่ฆ
Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhenjiang Zhao, Takahisa Toda, Takashi Kitamura
Machine Learning
ๆฅ็ๆ่ฆ
This work focuses on quantitative verification of fairness in tree ensembles. Unlike traditional verification approaches that merely return a single counterexample when the fairness is violated, quantitative verification estimates the ratio of all counterexamples and characterizes the regions where they occur, which is important information for diagnosing and mitigating bias. To date, quantitative verification has been explored almost exclusively for deep neural networks (DNNs). Representative methods, such as DeepGemini and FairQuant, all build on the core idea of Counterexample-Guided Abstraction Refinement, a generic framework that could be adapted to other model classes. We extended the framework into a model-agnostic form, but discovered two limitations: (i) it can provide only lower bounds, and (ii) its performance scales poorly. Exploiting the discrete structure of tree ensembles, our work proposes an efficient quantification technique that delivers any-time upper and lower bounds. Experiments on five widely used datasets demonstrate its effectiveness and efficiency. When applied to fairness testing, our quantification method significantly outperforms state-of-the-art testing techniques.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Takuya Kanazawa
ๆฅ็ๆ่ฆ
Quantifying predictive uncertainty is essential for safe and trustworthy real-world AI deployment. Yet, fully nonparametric estimation of conditional distributions remains challenging for multivariate targets. We propose Tomographic Quantile Forests (TQF), a nonparametric, uncertainty-aware, tree-based regression model for multivariate targets. TQF learns conditional quantiles of directional projections $\mathbf{n}^{\top}\mathbf{y}$ as functions of the input $\mathbf{x}$ and the unit direction $\mathbf{n}$. At inference, it aggregates quantiles across many directions and reconstructs the multivariate conditional distribution by minimizing the sliced Wasserstein distance via an efficient alternating scheme with convex subproblems. Unlike classical directional-quantile approaches that typically produce only convex quantile regions and require training separate models for different directions, TQF covers all directions with a single model without imposing convexity restrictions. We evaluate TQF on synthetic and real-world datasets, and release the source code on GitHub.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Early prediction of battery cycle life is essential for accelerating battery research, manufacturing, and deployment. Although machine learning methods have shown encouraging results, progress is hindered by data scarcity and heterogeneity arising from diverse aging conditions. In other fields, foundation models (FMs) trained on diverse datasets have achieved broad generalization through transfer learning, but no FMs have been reported for battery cycle life prediction yet. Here we present the Pretrained Battery Transformer (PBT), the first FM for battery life prediction, developed through domain-knowledge-encoded mixture-of-expert layers. Validated on the largest public battery life database, PBT learns transferable representations from 13 lithium-ion battery (LIB) datasets, outperforming existing models by an average of 19.8%. With transfer learning, PBT achieves state-of-the-art performance across 15 diverse datasets encompassing various operating conditions, formation protocols, and chemistries of LIBs. This work establishes a foundation model pathway for battery lifetime prediction, paving the way toward universal battery lifetime prediction systems.
๐
2025-12-18
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ P. Urdeitx, I. Alfaro, D. Gonzalez, F. Chinesta, E. Cueto
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
It is well known that the lack of information about certain variables necessary for the description of a dynamical system leads to the introduction of historical dependence (lack of Markovian character of the model) and noise. Traditionally, scientists have made up for these shortcomings by designing phenomenological variables that take into account this historical dependence (typically, conformational tensors in fluids). Often, these phenomenological variables are not easily measurable experimentally. In this work, we study to what extent Transformer architectures are able to cope with the lack of experimental data on these variables. The methodology is evaluated on three benchmark problems: a cylinder flow with no history dependence, a viscoelastic Couette flow modeled via the Oldroyd-B formalism, and a non-linear polymeric fluid described by the FENE model. Our results show that the Transformer outperforms a thermodynamically consistent, structure-preserving neural network with metriplectic bias in systems with missing experimental data, providing lower errors even in low-dimensional latent spaces. In contrast, for systems whose state variables can be fully known, the metriplectic model achieves superior performance.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Taozhao Chen, Linghan Huang, Kim-Kwang Raymond Choo, Huaming Chen
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
As large language models (LLMs) are increasingly adopted in safety-critical and regulated sectors, the retention of sensitive or prohibited knowledge introduces escalating risks, ranging from privacy leakage to regulatory non-compliance to to potential misuse, and so on. Recent studies suggest that machine unlearning can help ensure deployed models comply with evolving legal, safety, and governance requirements. However, current unlearning techniques assume clean separation between forget and retain datasets, which is challenging in operational settings characterized by highly entangled distributions. In such scenarios, perturbation-based methods often degrade general model utility or fail to ensure safety. To address this, we propose Selective Representation Misdirection for Unlearning (SRMU), a novel principled activation-editing framework that enforces feature-aware and directionally controlled perturbations. Unlike indiscriminate model weights perturbations, SRMU employs a structured misdirection vector with an activation importance map. The goal is to allow SRMU selectively suppresses harmful representations while preserving the utility on benign ones. Experiments are conducted on the widely used WMDP benchmark across low- and high-entanglement configurations. Empirical results reveal that SRMU delivers state-of-the-art unlearning performance with minimal utility losses, and remains effective under 20-30\% overlap where existing baselines collapse. SRMU provides a robust foundation for safety-driven model governance, privacy compliance, and controlled knowledge removal in the emerging LLM-based applications. We release the replication package at https://figshare.com/s/d5931192a8824de26aff.
๐
2025-12-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Zhexi Lu, Hongliang Chi, Nathalie Baracaldo, Swanand Ravindra Kadhe, Yuseok Jeon et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample's intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jinhao Zhang, Yunquan Zhang, Daning Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CKA Guided Modular Quantization, a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. Our method independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs including LLaMA and Qwen ,in terms of perplexity (PPL) and downstream task performance.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jialiang Wang, Xueyan Bao, Hao Wu
Machine Learning
ๆฅ็ๆ่ฆ
Second-order Latent Factor (SLF) model, a class of low-rank representation learning methods, has proven effective at extracting node-to-node interaction patterns from High-dimensional and Incomplete (HDI) data. However, its optimization is notoriously difficult due to its bilinear and non-convex nature. Sharpness-aware Minimization (SAM) has recently proposed to find flat local minima when minimizing non-convex objectives, thereby improving the generalization of representation-learning models. To address this challenge, we propose a Sharpness-aware SLF (SSLF) model. SSLF embodies two key ideas: (1) acquiring second-order information via Hessian-vector products; and (2) injecting a sharpness term into the curvature (Hessian) through the designed Hessian-vector products. Experiments on multiple industrial datasets demonstrate that the proposed model consistently outperforms state-of-the-art baselines.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Paloma Casteleiro Costa, Parnian Ghapandar Kashani, Xuhui Liu, Alexander Chen, Ary Portes et al.
Machine Learning
ๆฅ็ๆ่ฆ
Fluorescence lifetime imaging microscopy (FLIM) is a powerful quantitative technique that provides metabolic and molecular contrast, offering strong translational potential for label-free, real-time diagnostics. However, its clinical adoption remains limited by long pixel dwell times and low signal-to-noise ratio (SNR), which impose a stricter resolution-speed trade-off than conventional optical imaging approaches. Here, we introduce FLIM_PSR_k, a deep learning-based multi-channel pixel super-resolution (PSR) framework that reconstructs high-resolution FLIM images from data acquired with up to a 5-fold increased pixel size. The model is trained using the conditional generative adversarial network (cGAN) framework, which, compared to diffusion model-based alternatives, delivers a more robust PSR reconstruction with substantially shorter inference times, a crucial advantage for practical deployment. FLIM_PSR_k not only enables faster image acquisition but can also alleviate SNR limitations in autofluorescence-based FLIM. Blind testing on held-out patient-derived tumor tissue samples demonstrates that FLIM_PSR_k reliably achieves a super-resolution factor of k = 5, resulting in a 25-fold increase in the space-bandwidth product of the output images and revealing fine architectural features lost in lower-resolution inputs, with statistically significant improvements across various image quality metrics. By increasing FLIM's effective spatial resolution, FLIM_PSR_k advances lifetime imaging toward faster, higher-resolution, and hardware-flexible implementations compatible with low-numerical-aperture and miniaturized platforms, better positioning FLIM for translational applications.
๐
2025-12-18
๐ ArXiv Preprint (q-fin.PR)
๐ฅ Bong-Gyu Jang, Younwoo Jeong, Changeun Kim
Machine Learning
ๆฅ็ๆ่ฆ
We introduce the \textit{Consensus-Bottleneck Asset Pricing Model} (CB-APM), a partially interpretable neural network that replicates the reasoning processes of sell-side analysts by capturing how dispersed investor beliefs are compressed into asset prices through a consensus formation process. By modeling this ``bottleneck'' to summarize firm- and macro-level information, CB-APM not only predicts future risk premiums of U.S. equities but also links belief aggregation to expected returns in a structurally interpretable manner. The model improves long-horizon return forecasts and outperforms standard deep learning approaches in both predictive accuracy and explanatory power. Comprehensive portfolio analyses show that CB-APM's out-of-sample predictions translate into economically meaningful payoffs, with monotonic return differentials and stable long-short performance across regularization settings. Empirically, CB-APM leverages consensus as a regularizer to amplify long-horizon predictability and yields interpretable consensus-based components that clarify how information is priced in returns. Moreover, regression and GRS-based pricing diagnostics reveal that the learned consensus representations capture priced variation only partially spanned by traditional factor models, demonstrating that CB-APM uncovers belief-driven structure in expected returns beyond the canonical factor space. Overall, CB-APM provides an interpretable and empirically grounded framework for understanding belief-driven return dynamics.
๐
2025-12-18
๐ WSDM
๐ฅ Ruiyu Li, Peige Zhao, Guangxia Li, Pengcheng Wu, Xingyu Gao et al.
Machine Learning
ๆฅ็ๆ่ฆ
One of many impediments to applying graph neural networks (GNNs) to large-scale real-world graph data is the challenge of centralized training, which requires aggregating data from different organizations, raising privacy concerns. Federated graph learning (FGL) addresses this by enabling collaborative GNN model training without sharing private data. However, a core challenge in FGL systems is the variation in local training data distributions among clients, known as the data heterogeneity problem. Most existing solutions suffer from two problems: (1) The typical optimizer based on empirical risk minimization tends to cause local models to fall into sharp valleys and weakens their generalization to out-of-distribution graph data. (2) The prevalent dimensional collapse in the learned representations of local graph data has an adverse impact on the classification capacity of the GNN model. To this end, we formulate a novel optimization objective that is aware of the sharpness (i.e., the curvature of the loss surface) of local GNN models. By minimizing the loss function and its sharpness simultaneously, we seek out model parameters in a flat region with uniformly low loss values, thus improving the generalization over heterogeneous data. By introducing a regularizer based on the correlation matrix of local representations, we relax the correlations of representations generated by individual local graph samples, so as to alleviate the dimensional collapse of the learned model. The proposed \textbf{S}harpness-aware f\textbf{E}derated gr\textbf{A}ph \textbf{L}earning (SEAL) algorithm can enhance the classification accuracy and generalization ability of local GNN models in federated graph learning. Experimental studies on several graph classification benchmarks show that SEAL consistently outperforms SOTA FGL baselines and provides gains for more participants.
๐
2025-12-18
๐ AAAI 2026
๐ฅ Xueqi Ma, Xingjun Ma, Sarah Monazam Erfani, Danilo Mandic, James Bailey
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Developing open-set classification methods capable of classifying in-distribution (ID) data while detecting out-of-distribution (OOD) samples is essential for deploying graph neural networks (GNNs) in open-world scenarios. Existing methods typically treat all OOD samples as a single class, despite real-world applications, especially high-stake settings such as fraud detection and medical diagnosis, demanding deeper insights into OOD samples, including their probable labels. This raises a critical question: can OOD detection be extended to OOD classification without true label information? To address this question, we propose a Coarse-to-Fine open-set Classification (CFC) framework that leverages large language models (LLMs) for graph datasets. CFC consists of three key components: a coarse classifier that uses LLM prompts for OOD detection and outlier label generation, a GNN-based fine classifier trained with OOD samples identified by the coarse classifier for enhanced OOD detection and ID classification, and refined OOD classification achieved through LLM prompts and post-processed OOD labels. Unlike methods that rely on synthetic or auxiliary OOD samples, CFC employs semantic OOD instances that are genuinely out-of-distribution based on their inherent meaning, improving interpretability and practical utility. Experimental results show that CFC improves OOD detection by ten percent over state-of-the-art methods on graph and text domains and achieves up to seventy percent accuracy in OOD classification on graph datasets.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Noriaki Sato, Marco Scutari, Shuichi Kawano, Rui Yamaguchi, Seiya Imoto
Machine Learning
ๆฅ็ๆ่ฆ
We address network structure learning from zero-inflated count data by casting each node as a zero-inflated generalized linear model and optimizing a smooth, score-based objective under a directed acyclic graph constraint. Our Zero-Inflated Continuous Optimization (ZICO) approach uses node-wise likelihoods with canonical links and enforces acyclicity through a differentiable surrogate constraint combined with sparsity regularization. ZICO achieves superior performance with faster runtimes on simulated data. It also performs comparably to or better than common algorithms for reverse engineering gene regulatory networks. ZICO is fully vectorized and mini-batched, enabling learning on larger variable sets with practical runtimes in a wide range of domains.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lorenzo Nava, Ye Chen, Maximillian Van Wyk de Vries
Machine Learning
ๆฅ็ๆ่ฆ
Predicting geohazard runout is critical for protecting lives, infrastructure and ecosystems. Rapid mass flows, including landslides and avalanches, cause several thousand deaths across a wide range of environments, often travelling many kilometres from their source. The wide range of source conditions and material properties governing these flows makes their runout difficult to anticipate, particularly for downstream communities that may be suddenly exposed to severe impacts. Accurately predicting runout at scale requires models that are both physically realistic and computationally efficient, yet existing approaches face a fundamental speed-realism trade-off. Here we train a machine learning model to predict geohazard runout across representative real world terrains. The model predicts both flow extent and deposit thickness with high accuracy and 100 to 10,000 times faster computation than numerical solvers. It is trained on over 100,000 numerical simulations across over 10,000 real world digital elevation model chips and reproduces key physical behaviours, including avulsion and deposition patterns, while generalizing across different flow types, sizes and landscapes. Our results demonstrate that neural emulation enables rapid, spatially resolved runout prediction across diverse real world terrains, opening new opportunities for disaster risk reduction and impact-based forecasting. These results highlight neural emulation as a promising pathway for extending physically realistic geohazard modelling to spatial and temporal scales relevant for large scale early warning systems.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haishan Ye
Machine Learning
ๆฅ็ๆ่ฆ
Rank-based zeroth-order (ZO) optimization -- which relies only on the ordering of function evaluations -- offers strong robustness to noise and monotone transformations, and underlies many successful algorithms such as CMA-ES, natural evolution strategies, and rank-based genetic algorithms. Despite its widespread use, the theoretical understanding of rank-based ZO methods remains limited: existing analyses provide only asymptotic insights and do not yield explicit convergence rates for algorithms selecting the top-$k$ directions.
This work closes this gap by analyzing a simple rank-based ZO algorithm and establishing the first \emph{explicit}, and \emph{non-asymptotic} query complexities. For a $d$-dimension problem, if the function is $L$-smooth and $ฮผ$-strongly convex, the algorithm achieves $\widetilde{\mathcal O}\!\left(\frac{dL}ฮผ\log\!\frac{dL}{ฮผฮด}\log\!\frac{1}{\varepsilon}\right)$ to find an $\varepsilon$-suboptimal solution, and for smooth nonconvex objectives it reaches $\mathcal O\!\left(\frac{dL}{\varepsilon}\log\!\frac{1}{\varepsilon}\right)$. Notation $\cO(\cdot)$ hides constant terms and $\widetilde{\mathcal O}(\cdot)$ hides extra $\log\log\frac{1}{\varepsilon}$ term. These query complexities hold with a probability at least $1-ฮด$ with $0<ฮด<1$. The analysis in this paper is novel and avoids classical drift and information-geometric techniques. Our analysis offers new insight into why rank-based heuristics lead to efficient ZO optimization.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jianping Mei, Siqi Ai, Ye Yuan
Machine Learning
ๆฅ็ๆ่ฆ
Spatial transcriptomics enables genome-wide expression analysis within native tissue context, yet identifying spatial domains remains challenging due to complex gene-spatial interactions. Existing methods typically process spatial and feature views separately, fusing only at output level - an "encode-separately, fuse-late" paradigm that limits multi-scale semantic capture and cross-view interaction. Accordingly, stMFG is proposed, a multi-scale interactive fusion graph network that introduces layer-wise cross-view attention to dynamically integrate spatial and gene features after each convolution. The model combines cross-view contrastive learning with spatial constraints to enhance discriminability while maintaining spatial continuity. On DLPFC and breast cancer datasets, stMFG outperforms state-of-the-art methods, achieving up to 14% ARI improvement on certain slices.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jaeho Yang, Kijung Yoon
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Early and accessible detection of Alzheimer's disease (AD) remains a critical clinical challenge, and cube-copying tasks offer a simple yet informative assessment of visuospatial function. This work proposes a multimodal framework that converts hand-drawn cube sketches into graph-structured representations capturing geometric and topological properties, and integrates these features with demographic information and neuropsychological test (NPT) scores for AD classification. Cube drawings are modeled as graphs with node features encoding spatial coordinates, local graphlet-based topology, and angular geometry, which are processed using graph neural networks and fused with age, education, and NPT features in a late-fusion model. Experimental results show that graph-based representations provide a strong unimodal baseline and substantially outperform pixel-based convolutional models, while multimodal integration further improves performance and robustness to class imbalance. SHAP-based interpretability analysis identifies specific graphlet motifs and geometric distortions as key predictors, closely aligning with clinical observations of disorganized cube drawings in AD. Together, these results establish graph-based analysis of cube copying as an interpretable, non-invasive, and scalable approach for Alzheimer's disease screening.
๐
2025-12-18
๐ ArXiv Preprint (astro-ph.EP)
๐ฅ Jiawei Gao, Chuanfei Dong, Chi Zhang, Yilan Qin, Simin Shekarpaz et al.
Machine Learning
ๆฅ็ๆ่ฆ
Understanding the magnetic field environment around Mars and its response to upstream solar wind conditions provide key insights into the processes driving atmospheric ion escape. To date, global models of Martian induced magnetosphere have been exclusively physics-based, relying on computationally intensive simulations. For the first time, we develop a data-driven model of the Martian induced magnetospheric magnetic field using Physics-Informed Neural Network (PINN) combined with MAVEN observations and physical laws. Trained under varying solar wind conditions, including B_IMF, P_SW, and ฮธ_cone, the data-driven model accurately reconstructs the three-dimensional magnetic field configuration and its variability in response to upstream solar wind drivers. Based on the PINN results, we identify key dependencies of magnetic field configuration on solar wind parameters, including the hemispheric asymmetries of the draped field line strength in the Mars-Solar-Electric coordinates. These findings demonstrate the capability of PINNs to reconstruct complex magnetic field structures in the Martian induced magnetosphere, thereby offering a promising tool for advancing studies of solar wind-Mars interactions.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Karthikeyan K, Philip Wu, Xin Tang, Alexandre Alves
ๆฅ็ๆ่ฆ
The Science Consultant Agent is a web-based Artificial Intelligence (AI) tool that helps practitioners select and implement the most effective modeling strategy for AI-based solutions. It operates through four core components: Questionnaire, Smart Fill, Research-Guided Recommendation, and Prototype Builder. By combining structured questionnaires, literature-backed solution recommendations, and prototype generation, the Science Consultant Agent accelerates development for everyone from Product Managers and Software Developers to Researchers. The full pipeline is illustrated in Figure 1.
๐
2025-12-18
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Liu Qian, Yue Li, Ying Xie, Jian Zhang, Pai Li et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Catalyst design is crucial for materials synthesis, especially for complex reaction networks. Strategies like collaborative catalytic systems and multifunctional catalysts are effective but face challenges at the nanoscale. Carbon nanotube synthesis contains complicated nanoscale catalytic reactions, thus achieving high-density, high-quality semiconducting CNTs demands innovative catalyst design. In this work, we present a holistic framework integrating machine learning into traditional catalyst design for semiconducting CNT synthesis. It combines knowledge-based insights with data-driven techniques. Three key components, including open-access electronic structure databases for precise physicochemical descriptors, pre-trained natural language processing-based embedding model for higher-level abstractions, and physical - driven predictive models based on experiment data, are utilized. Through this framework, a new method for selective semiconducting CNT synthesis via catalyst - mediated electron injection, tuned by light during growth, is proposed. 54 candidate catalysts are screened, and three with high potential are identified. High-throughput experiments validate the predictions, with semiconducting selectivity exceeding 91% and the FeTiO3 catalyst reaching 98.6%. This approach not only addresses semiconducting CNT synthesis but also offers a generalizable methodology for global catalyst design and nanomaterials synthesis, advancing materials science in precise control.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Prime Intellect Team, Mika Senghaas, Fares Obeid, Sami Jaghouar, William Brown et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present INTELLECT-3, a 106B-parameter Mixture-of-Experts model (12B active) trained with large-scale reinforcement learning on our end-to-end RL infrastructure stack. INTELLECT-3 achieves state of the art performance for its size across math, code, science and reasoning benchmarks, outperforming many larger frontier models. We open-source the model together with the full infrastructure stack used to create it, including RL frameworks, complete recipe, and a wide collection of environments, built with the verifiers library, for training and evaluation from our Environments Hub community platform. Built for this effort, we introduce prime-rl, an open framework for large-scale asynchronous reinforcement learning, which scales seamlessly from a single node to thousands of GPUs, and is tailored for agentic RL with first-class support for multi-turn interactions and tool use. Using this stack, we run both SFT and RL training on top of the GLM-4.5-Air-Base model, scaling RL training up to 512 H200s with high training efficiency.
๐
2025-12-18
๐ ArXiv Preprint (cs.DC)
๐ฅ Jian Tian, Shuailong Li, Yang Cao, Wenbo Cui, Minghan Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--introduces distinct scheduling challenges. Unlike traditional deployments where schedulers can treat instances as black boxes, DP+EP architectures exhibit high internal synchronization costs. We identify that immediate request dispatching in such systems leads to severe in-engine queuing and parallelization bubbles, degrading Time-to-First-Token (TTFT). To address this, we propose Staggered Batch Scheduling (SBS), a mechanism that deliberately buffers requests to form optimal execution batches. This temporal decoupling eliminates internal queuing bubbles without compromising throughput. Furthermore, leveraging the scheduling window created by buffering, we introduce a Load-Aware Global Allocation strategy that balances computational load across DP units for both Prefill and Decode phases. Deployed on a production H800 cluster serving Deepseek-V3, our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.
๐
2025-12-18
๐ AAAI 2026
๐ฅ Lulu Xue, Shengshan Hu, Linqiang Qian, Peijin Guo, Yechao Zhang et al.
ๆฅ็ๆ่ฆ
Machine unlearning is a newly popularized technique for removing specific training data from a trained model, enabling it to comply with data deletion requests. While it protects the rights of users requesting unlearning, it also introduces new privacy risks. Prior works have primarily focused on the privacy of data that has been unlearned, while the risks to retained data remain largely unexplored. To address this gap, we focus on the privacy risks of retained data and, for the first time, reveal the vulnerabilities introduced by machine unlearning under the dual-view setting, where an adversary can query both the original and the unlearned models. From an information-theoretic perspective, we introduce the concept of {privacy knowledge gain} and demonstrate that the dual-view setting allows adversaries to obtain more information than querying either model alone, thereby amplifying privacy leakage. To effectively demonstrate this threat, we propose DVIA, a Dual-View Inference Attack, which extracts membership information on retained data using black-box queries to both models. DVIA eliminates the need to train an attack model and employs a lightweight likelihood ratio inference module for efficient inference. Experiments across different datasets and model architectures validate the effectiveness of DVIA and highlight the privacy risks inherent in the dual-view setting.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hamed Ajorlou, Samuel Rey, Gonzalo Mateos, Geert Leus, Antonio G. Marques
Machine Learning
ๆฅ็ๆ่ฆ
Learning the structure of directed acyclic graphs (DAGs) from observational data is a central problem in causal discovery, statistical signal processing, and machine learning. Under a linear Gaussian structural equation model (SEM) with equal noise variances, the problem is identifiable and we show that the ensemble precision matrix of the observations exhibits a distinctive structure that facilitates DAG recovery. Exploiting this property, we propose BUILD (Bottom-Up Inference of Linear DAGs), a deterministic stepwise algorithm that identifies leaf nodes and their parents, then prunes the leaves by removing incident edges to proceed to the next step, exactly reconstructing the DAG from the true precision matrix. In practice, precision matrices must be estimated from finite data, and ill-conditioning may lead to error accumulation across BUILD steps. As a mitigation strategy, we periodically re-estimate the precision matrix (with less variables as leaves are pruned), trading off runtime for enhanced robustness. Reproducible results on challenging synthetic benchmarks demonstrate that BUILD compares favorably to state-of-the-art DAG learning algorithms, while offering an explicit handle on complexity.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Sophia Seulkee Kang, Franรงois-Xavier Briol, Toni Karvonen, Zonghao Chen
ๆฅ็ๆ่ฆ
This paper addresses the challenging computational problem of estimating intractable expectations over discrete domains. Existing approaches, including Monte Carlo and Russian Roulette estimators, are consistent but often require a large number of samples to achieve accurate results. We propose a novel estimator, \emph{BayesSum}, which is an extension of Bayesian quadrature to discrete domains. It is more sample efficient than alternatives due to its ability to make use of prior information about the integrand through a Gaussian process. We show this through theory, deriving a convergence rate significantly faster than Monte Carlo in a broad range of settings. We also demonstrate empirically that our proposed method does indeed require fewer samples on several synthetic settings as well as for parameter estimation for Conway-Maxwell-Poisson and Potts models.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sandeep Neela
MultimodalRobotics
ๆฅ็ๆ่ฆ
Market manipulation now routinely originates from coordinated social media campaigns, not isolated trades. Retail investors, regulators, and brokerages need tools that connect online narratives and coordination patterns to market behavior. We present AIMM, an AI-driven framework that fuses Reddit activity, bot and coordination indicators, and OHLCV market features into a daily AIMM Manipulation Risk Score for each ticker.
The system uses a parquet-native pipeline with a Streamlit dashboard that allows analysts to explore suspicious windows, inspect underlying posts and price action, and log model outputs over time. Due to Reddit API restrictions, we employ calibrated synthetic social features matching documented event characteristics; market data (OHLCV) uses real historical data from Yahoo Finance. This release makes three contributions. First, we build the AIMM Ground Truth dataset (AIMM-GT): 33 labeled ticker-days spanning eight equities, drawing from SEC enforcement actions, community-verified manipulation cases, and matched normal controls. Second, we implement forward-walk evaluation and prospective prediction logging for both retrospective and deployment-style assessment. Third, we analyze lead times and show that AIMM flagged GME 22 days before the January 2021 squeeze peak.
The current labeled set is small (33 ticker-days, 3 positive events), but results show preliminary discriminative capability and early warnings for the GME incident. We release the code, dataset schema, and dashboard design to support research on social media-driven market surveillance.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica et al.
Machine Learning
ๆฅ็ๆ่ฆ
We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations.
We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Saeed Mahloujifar, Narine Kokhlikyan, Chuan Guo, Kamalika Chaudhuri
Computer Vision
ๆฅ็ๆ่ฆ
Image data collected in the wild often contains private information such as faces and license plates, and responsible data release must ensure that this information stays hidden. At the same time, released data should retain its usefulness for model-training. The standard method for private information obfuscation in images is Gaussian blurring. In this work, we show that practical implementations of Gaussian blurring are reversible enough to break privacy. We then take a closer look at the privacy-utility tradeoffs offered by three other obfuscation algorithms -- pixelization, pixelization and noise addition (DP-Pix), and cropping. Privacy is evaluated by reversal and discrimination attacks, while utility by the quality of the learnt representations when the model is trained on data with obfuscated faces. We show that the most popular industry-standard method, Gaussian blur is the least private of the four -- being susceptible to reversal attacks in its practical low-precision implementations. In contrast, pixelization and pixelization plus noise addition, when used at the right level of granularity, offer both privacy and utility for a number of computer vision tasks. We make our proposed methods together with suggested parameters available in a software package called Privacy Blur.
๐
2025-12-18
๐ ArXiv Preprint (cs.DB)
๐ฅ Thanh Dat Hoang, Thanh Tam Nguyen, Thanh Trung Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen
Natural Language Processing
ๆฅ็ๆ่ฆ
Most modern Text2SQL systems prompt large language models (LLMs) with entire schemas -- mostly column information -- alongside the user's question. While effective on small databases, this approach fails on real-world schemas that exceed LLM context limits, even for commercial models. The recent Spider 2.0 benchmark exemplifies this with hundreds of tables and tens of thousands of columns, where existing systems often break. Current mitigations either rely on costly multi-step prompting pipelines or filter columns by ranking them against user's question independently, ignoring inter-column structure. To scale existing systems, we introduce \toolname, an open-source, LLM-efficient schema filtering framework that compacts Text2SQL prompts by (i) ranking columns with a query-aware LLM encoder enriched with values and metadata, (ii) reranking inter-connected columns via a lightweight graph transformer over functional dependencies, and (iii) selecting a connectivity-preserving sub-schema with a Steiner-tree heuristic. Experiments on real datasets show that \toolname achieves near-perfect recall and higher precision than CodeS, SchemaExP, Qwen rerankers, and embedding retrievers, while maintaining sub-second median latency and scaling to schemas with 23,000+ columns. Our source code is available at https://github.com/thanhdath/grast-sql.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Tang, Hanyu Liu, Alessandro Perelli, Xi Chen, Chao Li
ๆฅ็ๆ่ฆ
Diffusion MRI (dMRI) is a critical non-invasive technique to estimate fiber orientation distribution (FOD) for characterizing white matter integrity. Estimating FOD from single-shell low angular resolution dMRI (LAR-FOD) is limited by accuracy, whereas estimating FOD from multi-shell high angular resolution dMRI (HAR-FOD) requires a long scanning time, which limits its applicability. Diffusion models have shown promise in estimating HAR-FOD based on LAR-FOD. However, using diffusion models to efficiently generate HAR-FOD is challenging due to the large number of spherical harmonic (SH) coefficients in FOD. Here, we propose a 3D multi-channel patch diffusion model to predict HAR-FOD from LAR-FOD. We design the FOD-patch adapter by introducing the prior brain anatomy for more efficient patch-based learning. Furthermore, we introduce a voxel-level conditional coordinating module to enhance the global understanding of the model. We design the SH attention module to effectively learn the complex correlations of the SH coefficients. Our experimental results show that our method achieves the best performance in HAR-FOD prediction and outperforms other state-of-the-art methods.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shao-Ting Chiu, Aditya Nambiar, Ali Syed, Jonathan W. Siegel, Ulisses Braga-Neto
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In-context Learning (ICL) is the remarkable capability displayed by some machine learning models to learn from examples in a prompt, without any further weight updates. ICL had originally been thought to emerge from the self-attention mechanism in autoregressive transformer architectures. DeepOSets is a non-autoregressive, non-attention based neural architecture that combines set learning via the DeepSets architecture with operator learning via Deep Operator Networks (DeepONets). In a previous study, DeepOSets was shown to display ICL capabilities in supervised learning problems. In this paper, we show that the DeepOSets architecture, with the appropriate modifications, is a multi-operator in-context learner that can recover the solution operator of a new PDE, not seen during training, from example pairs of parameter and solution placed in a user prompt, without any weight updates. Furthermore, we show that DeepOSets is a universal uniform approximator over a class of continuous operators, which we believe is the first result of its kind in the literature of scientific machine learning. This means that a single DeepOSets architecture exists that approximates in-context any continuous operator in the class to any fixed desired degree accuracy, given an appropriate number of examples in the prompt. Experiments with Poisson and reaction-diffusion forward and inverse boundary-value problems demonstrate the ability of the proposed model to use in-context examples to predict accurately the solutions corresponding to parameter queries for PDEs not seen during training.
๐
2025-12-18
๐ ArXiv Preprint (astro-ph.IM)
๐ฅ Sidharth Kannan, Pooyan Goodarzi, Evangelos E. Papalexakis, Jonathan W. Richardson
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, graph neural networks (GNNs) have shown tremendous promise in solving problems in high energy physics, materials science, and fluid dynamics. In this work, we introduce a new application for GNNs in the physical sciences: instrumentation design. As a case study, we apply GNNs to simulate models of the Laser Interferometer Gravitational-Wave Observatory (LIGO) and show that they are capable of accurately capturing the complex optical physics at play, while achieving runtimes 815 times faster than state of the art simulation packages. We discuss the unique challenges this problem provides for machine learning models. In addition, we provide a dataset of high-fidelity optical physics simulations for three interferometer topologies, which can be used as a benchmarking suite for future work in this direction.
๐
2025-12-18
๐ Accepted by IEEE Big Data 2025
๐ฅ Shu Wan, Reepal Shah, John Sabo, Huan Liu, K. Selรงuk Candan
Machine Learning
ๆฅ็ๆ่ฆ
Streamflow forecasting is crucial for water resource management and risk mitigation. While deep learning models have achieved strong predictive performance, they often overlook underlying physical processes, limiting interpretability and generalization. Recent causal learning approaches address these issues by integrating domain knowledge, yet they typically rely on fixed causal graphs that fail to adapt to data. We propose CauStream, a unified framework for causal spatiotemporal streamflow forecasting. CauSTream jointly learns (i) a runoff causal graph among meteorological forcings and (ii) a routing graph capturing dynamic dependencies across stations. We further establish identifiability conditions for these causal structures under a nonparametric setting. We evaluate CauSTream on three major U.S. river basins across three forecasting horizons. The model consistently outperforms prior state-of-the-art methods, with performance gaps widening at longer forecast windows, indicating stronger generalization to unseen conditions. Beyond forecasting, CauSTream also learns causal graphs that capture relationships among hydrological factors and stations. The inferred structures align closely with established domain knowledge, offering interpretable insights into watershed dynamics. CauSTream offers a principled foundation for causal spatiotemporal modeling, with the potential to extend to a wide range of scientific and environmental applications.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
๐
2025-12-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shubham Mishra, Samyek Jain, Gorang Mehrishi, Shiv Tiwari, Harsh Sharma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.
๐
2025-12-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jacob Reiss, Shikshya Shiwakoti, Samuel Goldsmith, Ujjwal Pandit
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In today's information-driven world, access to scientific publications has become increasingly easy. At the same time, filtering through the massive volume of available research has become more challenging than ever. Graph Neural Networks (GNNs) and graph attention mechanisms have shown strong effectiveness in searching large-scale information databases, particularly when combined with modern large language models. In this paper, we propose an Attention-Based Subgraph Retriever, a GNN-as-retriever model that applies attention-based pruning to extract a refined subgraph, which is then passed to a large language model for advanced knowledge reasoning.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sullivan Castro, Artem Betlei, Thomas Di Martino, Nadir El Manouzi
ๆฅ็ๆ่ฆ
Modeling user purchase behavior is a critical challenge in display advertising systems, necessary for real-time bidding. The difficulty arises from the sparsity of positive user events and the stochasticity of user actions, leading to severe class imbalance and irregular event timing. Predictive systems usually rely on hand-crafted "counter" features, overlooking the fine-grained temporal evolution of user intent. Meanwhile, current sequential models extract direct sequential signal, missing useful event-counting statistics. We enhance deep sequential models with self-supervised pretraining strategies for display advertising. Especially, we introduce Abacus, a novel approach of predicting the empirical frequency distribution of user events. We further propose a hybrid objective unifying Abacus with sequential learning objectives, combining stability of aggregated statistics with the sequence modeling sensitivity. Experiments on two real-world datasets show that Abacus pretraining outperforms existing methods accelerating downstream task convergence, while hybrid approach yields up to +6.1% AUC compared to the baselines.
๐
2025-12-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xufeng Liang, Zhida Qin, Chong Zhang, Tianyu Huang, Gangyi Ding
Machine Learning
ๆฅ็ๆ่ฆ
Contrastive learning has demonstrated promising potential in recommender systems. Existing methods typically construct sparser views by randomly perturbing the original interaction graph, as they have no idea about the authentic user preferences. Owing to the sparse nature of recommendation data, this paradigm can only capture insufficient semantic information. To address the issue, we propose InfoDCL, a novel diffusion-based contrastive learning framework for recommendation. Rather than injecting randomly sampled Gaussian noise, we employ a single-step diffusion process that integrates noise with auxiliary semantic information to generate signals and feed them to the standard diffusion process to generate authentic user preferences as contrastive views. Besides, based on a comprehensive analysis of the mutual influence between generation and preference learning in InfoDCL, we build a collaborative training objective strategy to transform the interference between them into mutual collaboration. Additionally, we employ multiple GCN layers only during inference stage to incorporate higher-order co-occurrence information while maintaining training efficiency. Extensive experiments on five real-world datasets demonstrate that InfoDCL significantly outperforms state-of-the-art methods. Our InfoDCL offers an effective solution for enhancing recommendation performance and suggests a novel paradigm for applying diffusion method in contrastive learning frameworks.
๐
2025-12-18
๐ WSDM
๐ฅ Himanshu Gharat, Himanshi Agrawal, Gourab K. Patro
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have empowered AI agents with advanced capabilities for understanding, reasoning, and interacting across diverse tasks. The addition of memory further enhances them by enabling continuity across interactions, learning from past experiences, and improving the relevance of actions and responses over time; termed as memory-enhanced personalization. Although such personalization through memory offers clear benefits, it also introduces risks of bias. While several previous studies have highlighted bias in ML and LLMs, bias due to memory-enhanced personalized agents is largely unexplored. Using recruitment as an example use case, we simulate the behavior of a memory-enhanced personalized agent, and study whether and how bias is introduced and amplified in and across various stages of operation. Our experiments on agents using safety-trained LLMs reveal that bias is systematically introduced and reinforced through personalization, emphasizing the need for additional protective measures or agent guardrails in memory-enhanced LLM-based AI agents.
๐
2025-12-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Allard Oelen, Mohamad Yaser Jaradeh, Sรถren Auer
Natural Language Processing
ๆฅ็ๆ่ฆ
As the volume of published scholarly literature continues to grow, finding relevant literature becomes increasingly difficult. With the rise of generative Artificial Intelligence (AI), and particularly Large Language Models (LLMs), new possibilities emerge to find and explore literature. We introduce ASK (Assistant for Scientific Knowledge), an AI-driven scholarly literature search and exploration system that follows a neuro-symbolic approach. ASK aims to provide active support to researchers in finding relevant scholarly literature by leveraging vector search, LLMs, and knowledge graphs. The system allows users to input research questions in natural language and retrieve relevant articles. ASK automatically extracts key information and generates answers to research questions using a Retrieval-Augmented Generation (RAG) approach. We present an evaluation of ASK, assessing the system's usability and usefulness. Findings indicate that the system is user-friendly and users are generally satisfied while using the system.
๐
2025-12-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Shayan Azizi, Norihiro Okui, Masataka Nakahara, Ayumu Kubota, Hassan Habibi Gharakheili
Machine Learning
ๆฅ็ๆ่ฆ
Identifying devices such as cameras, printers, voice assistants, or health monitoring sensors, collectively known as the Internet of Things (IoT), within a network is a critical operational task, particularly to manage the cyber risks they introduce. While behavioral fingerprinting based on network traffic analysis has shown promise, most existing approaches rely on machine learning (ML) techniques applied to fine-grained features of short-lived traffic units (packets and/or flows). These methods tend to be computationally expensive, sensitive to traffic measurement errors, and often produce opaque inferences. In this paper, we propose a macroscopic, lightweight, and explainable alternative to behavioral fingerprinting focusing on the network services (e.g., TCP/80, UDP/53) that IoT devices use to perform their intended functions over extended periods. Our contributions are threefold. (1) We demonstrate that IoT devices exhibit stable and distinguishable patterns in their use of network services over a period of time. We formalize the notion of service-level fingerprints and derive a generalized method to represent network behaviors using a configurable granularity parameter. (2) We develop a procedure to extract service-level fingerprints, apply it to traffic from 13 consumer IoT device types in a lab testbed, and evaluate the resulting representations in terms of their convergence and recurrence properties. (3) We validate the efficacy of service-level fingerprints for device identification in closed-set and open-set scenarios. Our findings are based on a large dataset comprising about 10 million IPFIX flow records collected over a 1.5-year period.
๐
2025-12-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Tejul Pandit, Sakshi Mahendru, Meet Raval, Dhvani Upadhyay
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reranking is a critical stage in contemporary information retrieval (IR) systems, improving the relevance of the user-presented final results by honing initial candidate sets. This paper is a thorough guide to examine the changing reranker landscape and offer a clear view of the advancements made in reranking methods. We present a comprehensive survey of reranking models employed in IR, particularly within modern Retrieval Augmented Generation (RAG) pipelines, where retrieved documents notably influence output quality.
We embark on a chronological journey through the historical trajectory of reranking techniques, starting with foundational approaches, before exploring the wide range of sophisticated neural network architectures such as cross-encoders, sequence-generation models like T5, and Graph Neural Networks (GNNs) utilized for structural information. Recognizing the computational cost of advancing neural rerankers, we analyze techniques for enhancing efficiency, notably knowledge distillation for creating competitive, lighter alternatives. Furthermore, we map the emerging territory of integrating Large Language Models (LLMs) in reranking, examining novel prompting strategies and fine-tuning tactics. This survey seeks to elucidate the fundamental ideas, relative effectiveness, computational features, and real-world trade-offs of various reranking strategies. The survey provides a structured synthesis of the diverse reranking paradigms, highlighting their underlying principles and comparative strengths and weaknesses.
๐
2025-12-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Karthikeyan K, Philip Wu, Xin Tang, Alexandre Alves
ๆฅ็ๆ่ฆ
The Science Consultant Agent is a web-based Artificial Intelligence (AI) tool that helps practitioners select and implement the most effective modeling strategy for AI-based solutions. It operates through four core components: Questionnaire, Smart Fill, Research-Guided Recommendation, and Prototype Builder. By combining structured questionnaires, literature-backed solution recommendations, and prototype generation, the Science Consultant Agent accelerates development for everyone from Product Managers and Software Developers to Researchers. The full pipeline is illustrated in Figure 1.
๐
2025-12-18
๐ ArXiv Preprint (cs.DB)
๐ฅ Zhengyuan Dong, Victor Zhong, Renรฉe J. Miller
ๆฅ็ๆ่ฆ
We present ModelTables, a benchmark of tables in Model Lakes that captures the structured semantics of performance and configuration tables often overlooked by text only retrieval. The corpus is built from Hugging Face model cards, GitHub READMEs, and referenced papers, linking each table to its surrounding model and publication context. Compared with open data lake tables, model tables are smaller yet exhibit denser inter table relationships, reflecting tightly coupled model and benchmark evolution. The current release covers over 60K models and 90K tables. To evaluate model and table relatedness, we construct a multi source ground truth using three complementary signals: (1) paper citation links, (2) explicit model card links and inheritance, and (3) shared training datasets. We present one extensive empirical use case for the benchmark which is table search. We compare canonical Data Lake search operators (unionable, joinable, keyword) and Information Retrieval baselines (dense, sparse, hybrid retrieval) on this benchmark. Union based semantic table retrieval attains 54.8 % P@1 overall (54.6 % on citation, 31.3 % on inheritance, 30.6 % on shared dataset signals); table based dense retrieval reaches 66.5 % P@1, and metadata hybrid retrieval achieves 54.1 %. This evaluation indicates clear room for developing better table search methods. By releasing ModelTables and its creation protocol, we provide the first large scale benchmark of structured data describing AI model. Our use case of table discovery in Model Lakes, provides intuition and evidence for developing more accurate semantic retrieval, structured comparison, and principled organization of structured model knowledge. Source code, data, and other artifacts have been made available at https://github.com/RJMillerLab/ModelTables.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.
๐
2025-12-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, Sergey Levine
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) -- which trains a policy to directly match the actions played by the demonstrator -- can fail to ensure coverage over the demonstrator's actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator's behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator's actions, enabling more effective finetuning. Furthermore, this policy -- which we refer to as the posterior behavioral cloning (PostBC) policy -- achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show that PostBC is practically implementable with modern generative models in robotic control domains -- relying only on standard supervised learning -- and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen et al.
Multimodal
ๆฅ็ๆ่ฆ
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Jinghuan Shang, Harsh Patel, Ran Gong, Karl Schmeckpeper
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Synthetic 3D scenes are essential for developing Physical AI and generative models. Existing procedural generation methods often have low output throughput, creating a significant bottleneck in scaling up dataset creation. In this work, we introduce Sceniris, a highly efficient procedural scene generation framework for rapidly generating large-scale, collision-free scene variations. Sceniris also provides an optional robot reachability check, providing manipulation-feasible scenes for robot tasks. Sceniris is designed for maximum efficiency by addressing the primary performance limitations of the prior method, Scene Synthesizer. Leveraging batch sampling and faster collision checking in cuRobo, Sceniris achieves at least 234x speed-up over Scene Synthesizer. Sceniris also expands the object-wise spatial relationships available in prior work to support diverse scene requirements. Our code is available at https://github.com/rai-inst/sceniris
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne et al.
Robotics
ๆฅ็ๆ่ฆ
A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Zihan Zhou, Animesh Garg, Ajay Mandlekar, Caelan Garrett
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Long-horizon manipulation has been a long-standing challenge in the robotics community. We propose ReinforceGen, a system that combines task decomposition, data generation, imitation learning, and motion planning to form an initial solution, and improves each component through reinforcement-learning-based fine-tuning. ReinforceGen first segments the task into multiple localized skills, which are connected through motion planning. The skills and motion planning targets are trained with imitation learning on a dataset generated from 10 human demonstrations, and then fine-tuned through online adaptation and reinforcement learning. When benchmarked on the Robosuite dataset, ReinforceGen reaches 80% success rate on all tasks with visuomotor controls in the highest reset range setting. Additional ablation studies show that our fine-tuning approaches contributes to an 89% average performance increase. More results and videos available in https://reinforcegen.github.io/
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates perception errors, degrading downstream planning and control. Vision-Action (VA) models address some limitations by learning direct mappings from visual inputs to actions, but they remain opaque, sensitive to distribution shifts, and lack structured reasoning or instruction-following capabilities. Recent progress in Large Language Models (LLMs) and multimodal learning has motivated the emergence of Vision-Language-Action (VLA) frameworks, which integrate perception with language-grounded decision making. By unifying visual understanding, linguistic reasoning, and actionable outputs, VLAs offer a pathway toward more interpretable, generalizable, and human-aligned driving policies. This work provides a structured characterization of the emerging VLA landscape for autonomous driving. We trace the evolution from early VA approaches to modern VLA frameworks and organize existing methods into two principal paradigms: End-to-End VLA, which integrates perception, reasoning, and planning within a single model, and Dual-System VLA, which separates slow deliberation (via VLMs) from fast, safety-critical execution (via planners). Within these paradigms, we further distinguish subclasses such as textual vs. numerical action generators and explicit vs. implicit guidance mechanisms. We also summarize representative datasets and benchmarks for evaluating VLA-based driving systems and highlight key challenges and open directions, including robustness, interpretability, and instruction fidelity. Overall, this work aims to establish a coherent foundation for advancing human-compatible autonomous driving systems.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Yixiang Chen, Yan Huang, Keji He, Peiyan Li, Liang Wang
Robotics
ๆฅ็ๆ่ฆ
When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed. More results can be found on our project website at https://verm-ral.github.io .
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ David Mรผller, Espen Knoop, Dario Mylonopoulos, Agon Serifi, Michael A. Hopkins et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Animated characters often move in non-physical ways and have proportions that are far from a typical walking robot. This provides an ideal platform for innovation in both mechanical design and stylized motion control. In this paper, we bring Olaf to life in the physical world, relying on reinforcement learning guided by animation references for control. To create the illusion of Olaf's feet moving along his body, we hide two asymmetric legs under a soft foam skirt. To fit actuators inside the character, we use spherical and planar linkages in the arms, mouth, and eyes. Because the walk cycle results in harsh contact sounds, we introduce additional rewards that noticeably reduce impact noise. The large head, driven by small actuators in the character's slim neck, creates a risk of overheating, amplified by the costume. To keep actuators from overheating, we feed temperature values as additional inputs to policies, introducing new rewards to keep them within bounds. We validate the efficacy of our modeling in simulation and on hardware, demonstrating an unmatched level of believability for a costumed robotic character.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Marcelo Rosa, Josรฉ E. R. Cury, Fabio L. Baldissera
Robotics
ๆฅ็ๆ่ฆ
In this paper, we deal with the problem of coordinating multiple robots to build 3-D structures. This problem consists of a set of mobile robots that interact with each other in order to autonomously build a predefined 3-D structure. Our approach is based on Supervisory Control Theory, and it allows us to synthesize from models that represent a single robot and the target structure a correct-by-construction reactive controller, called supervisor. When this supervisor is replicated for the other robots, then the target structure can be completed by all robots
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Jiayu Zhang, Kaixing Zhao, Tianhao Shao, Bin Guo, Liang He
ๆฅ็ๆ่ฆ
Mobile visual crowdsensing enables large-scale, fine-grained environmental monitoring through the collection of images from distributed mobile devices. However, the resulting data is often redundant and heterogeneous due to overlapping acquisition perspectives, varying resolutions, and diverse user behaviors. To address these challenges, this paper proposes Tri-Select, a multi-stage visual data selection framework that efficiently filters redundant and low-quality images. Tri-Select operates in three stages: (1) metadata-based filtering to discard irrelevant samples; (2) spatial similarity-based spectral clustering to organize candidate images; and (3) a visual-feature-guided selection based on maximum independent set search to retain high-quality, representative images. Experiments on real-world and public datasets demonstrate that Tri-Select improves both selection efficiency and dataset quality, making it well-suited for scalable crowdsensing applications.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Tianhao Shao, Kaixing Zhao, Feng Liu, Lixin Yang, Bin Guo
Robotics
ๆฅ็ๆ่ฆ
As unmanned systems such as Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) become increasingly important to applications like urban sensing and emergency response, efficiently recruiting these autonomous devices to perform time-sensitive tasks has become a critical challenge. This paper presents MPBS (Mobility-aware Prediction and Behavior-based Scheduling), a scalable task recruitment framework that treats each device as a recruitable "user". MPBS integrates three key modules: a behavior-aware KNN classifier, a time-varying Markov prediction model for forecasting device mobility, and a dynamic priority scheduling mechanism that considers task urgency and base station performance. By combining behavioral classification with spatiotemporal prediction, MPBS adaptively assigns tasks to the most suitable devices in real time. Experimental evaluations on the real-world GeoLife dataset show that MPBS significantly improves task completion efficiency and resource utilization. The proposed framework offers a predictive, behavior-aware solution for intelligent and collaborative scheduling in unmanned systems.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Abhishek Kashyap, Yuxuan Yang, Henrik Andreasson, Todor Stoyanov
Robotics
ๆฅ็ๆ่ฆ
In vision-based robot manipulation, a single camera view can only capture one side of objects of interest, with additional occlusions in cluttered scenes further restricting visibility. As a result, the observed geometry is incomplete, and grasp estimation algorithms perform suboptimally. To address this limitation, we leverage diffusion models to perform category-level 3D shape completion from partial depth observations obtained from a single view, reconstructing complete object geometries to provide richer context for grasp planning. Our method focuses on common household items with diverse geometries, generating full 3D shapes that serve as input to downstream grasp inference networks. Unlike prior work, which primarily considers isolated objects or minimal clutter, we evaluate shape completion and grasping in realistic clutter scenarios with household objects. In preliminary evaluations on a cluttered scene, our approach consistently results in better grasp success rates than a naive baseline without shape completion by 23% and over a recent state of the art shape completion approach by 19%. Our code is available at https://amm.aass.oru.se/shape-completion-grasping/.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Enis Yalcin, Joshua O'Hara, Maria Stamatopoulou, Chengxu Zhou, Dimitrios Kanoulas
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) show promise in automating reward design in humanoid locomotion, which could eliminate the need for tedious manual engineering. However, current VLM-based methods are essentially "blind", as they lack the environmental perception required to navigate complex terrain. We present E-SDS (Environment-aware See it, Do it, Sorted), a framework that closes this perception gap. E-SDS integrates VLMs with real-time terrain sensor analysis to automatically generate reward functions that facilitate training of robust perceptive locomotion policies, grounded by example videos. Evaluated on a Unitree G1 humanoid across four distinct terrains (simple, gaps, obstacles, stairs), E-SDS uniquely enabled successful stair descent, while policies trained with manually-designed rewards or a non-perceptive automated baseline were unable to complete the task. In all terrains, E-SDS also reduced velocity tracking error by 51.9-82.6%. Our framework reduces the human effort of reward design from days to less than two hours while simultaneously producing more robust and capable locomotion policies.
๐
2025-12-18
๐ accept by IEEE Transactions on Industrial Electron
๐ฅ Sijia Chen, Wei Dong
ๆฅ็ๆ่ฆ
It's a practical approach using the ground-aerial collaborative system to enhance the localization robustness of flying robots in cluttered environments, especially when visual sensors degrade. Conventional approaches estimate the flying robot's position using fixed cameras observing pre-attached markers, which could be constrained by limited distance and susceptible to capture failure. To address this issue, we improve the ground-aerial localization framework in a more comprehensive manner, which integrates active vision, single-ranging, inertial odometry, and optical flow. First, the designed active vision subsystem mounted on the ground vehicle can be dynamically rotated to detect and track infrared markers on the aerial robot, improving the field of view and the target recognition with a single camera. Meanwhile, the incorporation of single-ranging extends the feasible distance and enhances re-capture capability under visual degradation. During estimation, a dimension-reduced estimator fuses multi-source measurements based on polynomial approximation with an extended sliding window, balancing computational efficiency and redundancy. Considering different sensor fidelities, an adaptive sliding confidence evaluation algorithm is implemented to assess measurement quality and dynamically adjust the weighting parameters based on moving variance. Finally, extensive experiments under conditions such as smoke interference, illumination variation, obstacle occlusion, prolonged visual loss, and extended operating range demonstrate that the proposed approach achieves robust online localization, with an average root mean square error of approximately 0.09 m, while maintaining resilience to capture loss and sensor failures.
๐
2025-12-18
๐ AAAI 2026
๐ฅ Zixuan Chen, Chongkai Gao, Lin Shao, Jieqi Shi, Jing Huo et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
One-shot imitation learning (OSIL) offers a promising way to teach robots new skills without large-scale data collection. However, current OSIL methods are primarily limited to short-horizon tasks, thus limiting their applicability to complex, long-horizon manipulations. To address this limitation, we propose ManiLong-Shot, a novel framework that enables effective OSIL for long-horizon prehensile manipulation tasks. ManiLong-Shot structures long-horizon tasks around physical interaction events, reframing the problem as sequencing interaction-aware primitives instead of directly imitating continuous trajectories. This primitive decomposition can be driven by high-level reasoning from a vision-language model (VLM) or by rule-based heuristics derived from robot state changes. For each primitive, ManiLong-Shot predicts invariant regions critical to the interaction, establishes correspondences between the demonstration and the current observation, and computes the target end-effector pose, enabling effective task execution. Extensive simulation experiments show that ManiLong-Shot, trained on only 10 short-horizon tasks, generalizes to 20 unseen long-horizon tasks across three difficulty levels via one-shot imitation, achieving a 22.8% relative improvement over the SOTA. Additionally, real-robot experiments validate ManiLong-Shot's ability to robustly execute three long-horizon manipulation tasks via OSIL, confirming its practical applicability.
๐
2025-12-18
๐ ArXiv Preprint (cs.NI)
๐ฅ Bangya Liu, Chengpo Yan, Chenghao Jiang, Suman Banerjee, Akarsh Prabhakara
ๆฅ็ๆ่ฆ
Cooperative perception between vehicles is poised to offer robust and reliable scene understanding. Recently, we are witnessing experimental systems research building testbeds that share raw spatial sensor data for cooperative perception. While there has been a marked improvement in accuracies and is the natural way forward, we take a moment to consider the problems with such an approach for eventual adoption by automakers. In this paper, we first argue that new forms of privacy concerns arise and discourage stakeholders to share raw sensor data. Next, we present SHARP, a research framework to minimize privacy leakage and drive stakeholders towards the ambitious goal of raw data based cooperative perception. Finally, we discuss open questions for networked systems, mobile computing, perception researchers, industry and government in realizing our proposed framework.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ M. Oltan Sevinc, Liao Wu, Francisco Cruz
ๆฅ็ๆ่ฆ
Although traditional cameras are the primary sensor for end-to-end driving, their performance suffers greatly when the conditions of the data they were trained on does not match the deployment environment, a problem known as the domain gap. In this work, we consider the day-night lighting difference domain gap. Instead of traditional cameras we propose event cameras as a potential alternative which can maintain performance across lighting condition domain gaps without requiring additional adjustments. Our results show that event cameras maintain more consistent performance across lighting conditions, exhibiting domain-shift penalties that are generally comparable to or smaller than grayscale frames and provide superior baseline performance in cross-domain scenarios.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Jia Hu, Junqi Li, Xuerun Yan, Jintao Lai, Lianhua An
ๆฅ็ๆ่ฆ
Simulation testing is a fundamental approach for evaluating automated vehicles (AVs). To ensure its reliability, it is crucial to accurately replicate interactions between AVs and background traffic, which necessitates effective calibration. However, existing calibration methods often fall short in achieving this goal. To address this gap, this study introduces a simulation platform calibration method that ensures high accuracy at both the vehicle and traffic flow levels. The method offers several key features:(1) with the capability of calibration for vehicle-to-vehicle interaction; (2) with accuracy assurance; (3) with enhanced efficiency; (4) with pipeline calibration capability. The proposed method is benchmarked against a baseline with no calibration and a state-of-the-art calibration method. Results show that it enhances the accuracy of interaction replication by 83.53% and boosts calibration efficiency by 76.75%. Furthermore, it maintains accuracy across both vehicle-level and traffic flow-level metrics, with an improvement of 51.9%. Notably, the entire calibration process is fully automated, requiring no human intervention.
๐
2025-12-18
๐ ArXiv Preprint (Robotics)
๐ฅ Maolin Lei, Edoardo Romiti, Arturo Laurenzi, Rui Dai, Matteo Dalle Vedove et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Modular manipulators composed of pre-manufactured and interchangeable modules offer high adaptability across diverse tasks. However, their deployment requires generating feasible motions while jointly optimizing morphology and mounted pose under kinematic, dynamic, and physical constraints. Moreover, traditional single-branch designs often extend reach by increasing link length, which can easily violate torque limits at the base joint. To address these challenges, we propose a unified task-driven computational framework that integrates trajectory planning across varying morphologies with the co-optimization of morphology and mounted pose. Within this framework, a hierarchical model predictive control (HMPC) strategy is developed to enable motion planning for both redundant and non-redundant manipulators. For design optimization, the CMA-ES is employed to efficiently explore a hybrid search space consisting of discrete morphology configurations and continuous mounted poses. Meanwhile, a virtual module abstraction is introduced to enable bi-branch morphologies, allowing an auxiliary branch to offload torque from the primary branch and extend the achievable workspace without increasing the capacity of individual joint modules. Extensive simulations and hardware experiments on polishing, drilling, and pick-and-place tasks demonstrate the effectiveness of the proposed framework. The results show that: 1) the framework can generate multiple feasible designs that satisfy kinematic and dynamic constraints while avoiding environmental collisions for given tasks; 2) flexible design objectives, such as maximizing manipulability, minimizing joint effort, or reducing the number of modules, can be achieved by customizing the cost functions; and 3) a bi-branch morphology capable of operating in a large workspace can be realized without requiring more powerful basic modules.
๐
2025-12-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaheng Geng, Jiatong Du, Xinyu Zhang, Ye Li, Panqu Wang et al.
Robotics
ๆฅ็ๆ่ฆ
Safety-critical corner cases, difficult to collect in the real world, are crucial for evaluating end-to-end autonomous driving. Adversarial interaction is an effective method to generate such safety-critical corner cases. While existing adversarial evaluation methods are built for models operating in simplified simulation environments, adversarial evaluation for real-world end-to-end autonomous driving has been little explored. To address this challenge, we propose a closed-loop evaluation platform for end-to-end autonomous driving, which can generate adversarial interactions in real-world scenes. In our platform, the real-world image generator cooperates with an adversarial traffic policy to evaluate various end-to-end models trained on real-world data. The generator, based on flow matching, efficiently and stably generates real-world images according to the traffic environment information. The efficient adversarial surrounding vehicle policy is designed to model challenging interactions and create corner cases that current autonomous driving systems struggle to handle. Experimental results demonstrate that the platform can generate realistic driving images efficiently. Through evaluating the end-to-end models such as UniAD and VAD, we demonstrate that based on the adversarial policy, our platform evaluates the performance degradation of the tested model in corner cases. This result indicates that this platform can effectively detect the model's potential issues, which will facilitate the safety and robustness of end-to-end autonomous driving.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kieran Henderson, Kian Omoomi, Vasudha Varadarajan, Allison Lahnala, Charles Welch
ๆฅ็ๆ่ฆ
Recent work has explored the use of personal information in the form of persona sentences or self-disclosures to improve modeling of individual characteristics and prediction of annotator labels for subjective tasks. The volume of personal information has historically been restricted and thus little exploration has gone into understanding what kind of information is most informative for predicting annotator labels. In this work, we categorize self-disclosure sentences and use them to build annotator models for predicting judgments of social norms. We perform several ablations and analyses to examine the impact of the type of information on our ability to predict annotation patterns. We find that demographics are more impactful than attitudes, relationships, and experiences. Generally, theory-based approaches worked better than automatic clusters. Contrary to previous work, only a small number of related comments are needed. Lastly, having a more diverse sample of annotator self-disclosures leads to the best performance.
๐
2025-12-17
๐ ArXiv Preprint (cs.CY)
๐ฅ Yuxuan Liang, Marwa Mahmoud
Natural Language Processing
ๆฅ็ๆ่ฆ
This study introduces an innovative multilingual bias evaluation framework for assessing bias in Large Language Models, combining explicit bias assessment through the BBQ benchmark with implicit bias measurement using a prompt-based Implicit Association Test. By translating the prompts and word list into five target languages, English, Chinese, Arabic, French, and Spanish, we directly compare different types of bias across languages. The results reveal substantial gaps in bias across languages used in LLMs. For example, Arabic and Spanish consistently show higher levels of stereotype bias, while Chinese and English exhibit lower levels of bias. We also identify contrasting patterns across bias types. Age shows the lowest explicit bias but the highest implicit bias, emphasizing the importance of detecting implicit biases that are undetectable with standard benchmarks. These findings indicate that LLMs vary significantly across languages and bias dimensions. This study fills a key research gap by providing a comprehensive methodology for cross-lingual bias analysis. Ultimately, our work establishes a foundation for the development of equitable multilingual LLMs, ensuring fairness and effectiveness across diverse languages and cultures.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Caner Erden
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We propose Dynamic Rank Reinforcement Learning (DR-RL), a novel framework that adaptively optimizes the low-rank factorization of Multi-Head Self-Attention (MHSA) in Large Language Models (LLMs) through the integration of reinforcement learning and online matrix perturbation theory. While traditional low-rank approximations often rely on static rank assumptions--limiting their flexibility across diverse input contexts--our method dynamically selects ranks based on real-time sequence dynamics, layer-specific sensitivities, and hardware constraints. The core innovation lies in an RL agent that formulates rank selection as a sequential policy optimization problem, where the reward function strictly balances attention fidelity against computational latency. Crucially, we employ online matrix perturbation bounds to enable incremental rank updates, thereby avoiding the prohibitive cost of full decomposition during inference. Furthermore, the integration of a lightweight Transformer-based policy network and batched Singular Value Decomposition (SVD) operations ensures scalable deployment on modern GPU architectures. Experiments demonstrate that DR-RL maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing Floating Point Operations (FLOPs), particularly in long-sequence regimes (L > 4096). This work bridges the gap between adaptive efficiency and theoretical rigor in MHSA, offering a principled, mathematically grounded alternative to heuristic rank reduction techniques in resource-constrained deep learning. Source code and experiment logs are available at: https://github.com/canererden/DR_RL_Project
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Armaฤan Amcalar, Eyup Cinar
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM-Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid-based instruction graphs that enable models to reason struc turally rather than through unbounded natural-language token expansion. We show that structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at https://benchmark.openserv.ai.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina Donaldson, Luca Zappella et al.
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joel Mire, Maria Antoniak, Steven R. Wilson, Zexin Ma, Achyutarama R. Ganti et al.
ๆฅ็ๆ่ฆ
Reading stories evokes rich interpretive, affective, and evaluative responses, such as inferences about narrative intent or judgments about characters. Yet, computational models of reader response are limited, preventing nuanced analyses. To address this gap, we introduce SocialStoryFrames, a formalism for distilling plausible inferences about reader response, such as perceived author intent, explanatory and predictive reasoning, affective responses, and value judgments, using conversational context and a taxonomy grounded in narrative theory, linguistic pragmatics, and psychology. We develop two models, SSF-Generator and SSF-Classifier, validated through human surveys (N=382 participants) and expert annotations, respectively. We conduct pilot analyses to showcase the utility of the formalism for studying storytelling at scale. Specifically, applying our models to SSF-Corpus, a curated dataset of 6,140 social media stories from diverse contexts, we characterize the frequency and interdependence of storytelling intents, and we compare and contrast narrative practices (and their diversity) across communities. By linking fine-grained, context-sensitive modeling with a generic taxonomy of reader responses, SocialStoryFrames enable new research into storytelling in online communities.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tejas Anvekar, Juhna Park, Aparna Garimella, Vivek Gupta
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: https://github.com/aimagelab/JARVIS.
๐
2025-12-17
๐ ArXiv Preprint (cs.SD)
๐ฅ Linnea Evanson, Mingfang Zhang, Hubert Banville, Saarang Panchavati, Pierre Bourdillon et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Decoding speech from brain activity has typically relied on limited neural recordings collected during short and highly controlled experiments. Here, we introduce a framework to leverage week-long intracranial and audio recordings from patients undergoing clinical monitoring, effectively increasing the training dataset size by over two orders of magnitude. With this pretraining, our contrastive learning model substantially outperforms models trained solely on classic experimental data, with gains that scale log-linearly with dataset size. Analysis of the learned representations reveals that, while brain activity represents speech features, its global structure largely drifts across days, highlighting the need for models that explicitly account for cross-day variability. Overall, our approach opens a scalable path toward decoding and modeling brain representations in both real-life and controlled task settings.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuki Nakayama, Koki Hikichi, Yun Ching Liu, Yu Hirate
ๆฅ็ๆ่ฆ
This paper presents a large-scale corpus of Rakuten Travel Reviews. Our collection contains 7.3 million customer reviews for 16 years, ranging from 2009 to 2024. Each record in the dataset contains the review text, its response from an accommodation, an anonymized reviewer ID, review date, accommodation ID, plan ID, plan title, room type, room name, purpose, accompanying group, and user ratings from different aspect categories, as well as an overall score. We present statistical information about our corpus and provide insights into factors driving data drift between 2019 and 2024 using statistical approaches.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability of large language models (LLMs). However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label deduction, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration diversity. By deriving local consensus via repeat sampling for each sub group, SCOPE provides diverse supervision targets to encourage broader exploration. We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines. Notably, SCOPE achieving relative improvements of 13.1% on challenging AIME 2025 and 8.1% on AMC. The code is released at https://github.com/szu-tera/SCOPE.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ayush Jain, Rahul Kulkarni, Siyi Lin
Machine Learning
ๆฅ็ๆ่ฆ
Big Data has become central to modern applications in finance, insurance, and cybersecurity, enabling machine learning systems to perform large-scale risk assessments and fraud detection. However, the increasing dependence on automated analytics introduces important concerns about transparency, regulatory compliance, and trust. This paper examines how explainable artificial intelligence (XAI) can be integrated into Big Data analytics pipelines for fraud detection and risk management. We review key Big Data characteristics and survey major analytical tools, including distributed storage systems, streaming platforms, and advanced fraud detection models such as anomaly detectors, graph-based approaches, and ensemble classifiers. We also present a structured review of widely used XAI methods, including LIME, SHAP, counterfactual explanations, and attention mechanisms, and analyze their strengths and limitations when deployed at scale. Based on these findings, we identify key research gaps related to scalability, real-time processing, and explainability for graph and temporal models. To address these challenges, we outline a conceptual framework that integrates scalable Big Data infrastructure with context-aware explanation mechanisms and human feedback. The paper concludes with open research directions in scalable XAI, privacy-aware explanations, and standardized evaluation methods for explainable fraud detection systems.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Paul Seurin, Dean Price, Luis Nunez
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Microreactors, particularly heat-pipe microreactors (HPMRs), are compact, transportable, self-regulated power systems well-suited for access-challenged remote areas where costly fossil fuels dominate. However, they suffer from diseconomies of scale, and their financial viability remains unconvincing. One step in addressing this shortcoming is to design these reactors with comprehensive economic and physics analyses informing early-stage design iteration. In this work, we present a novel unifying geometric design optimization approach that accounts for techno-economic considerations. We start by generating random samples to train surrogate models, including Gaussian processes (GPs) and multi-layer perceptrons (MLPs). We then deploy these surrogates within a reinforcement learning (RL)-based optimization framework to optimize the levelized cost of electricity (LCOE), all the while imposing constraints on the fuel lifetime, shutdown margin (SDM), peak heat flux, and rod-integrated peaking factor. We study two cases: one in which the axial reflector cost is very high, and one in which it is inexpensive. We found that the operation and maintenance and capital costs are the primary contributors to the overall LCOE particularly the cost of the axial reflectors (for the first case) and the control drum materials. The optimizer cleverly changes the design parameters so as to minimize one of them while still satisfying the constraints, ultimately reducing the LCOE by more than 57% in both instances. A comprehensive integration of fuel and HP performance with multi-objective optimization is currently being pursued to fully understand the interaction between constraints and cost performance.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruolei Zeng, Arun Sharma, Shuai An, Mingzhou Yang, Shengya Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate and cost-effective quantification of the agroecosystem carbon cycle at decision-relevant scales is essential for climate mitigation and sustainable agriculture. However, both transfer learning and the exploitation of spatial variability in this field are challenging, as they involve heterogeneous data and complex cross-scale dependencies. Conventional approaches often rely on location-independent parameterizations and independent training, underutilizing transfer learning and spatial heterogeneity in the inputs, and limiting their applicability in regions with substantial variability. We propose FTBSC-KGML (Fine-Tuning-Based Site Calibration-Knowledge-Guided Machine Learning), a pretraining- and fine-tuning-based, spatial-variability-aware, and knowledge-guided machine learning framework that augments KGML-ag with a pretraining-fine-tuning process and site-specific parameters. Using a pretraining-fine-tuning process with remote-sensing GPP, climate, and soil covariates collected across multiple midwestern sites, FTBSC-KGML estimates land emissions while leveraging transfer learning and spatial heterogeneity. A key component is a spatial-heterogeneity-aware transfer-learning scheme, which is a globally pretrained model that is fine-tuned at each state or site to learn place-aware representations, thereby improving local accuracy under limited data without sacrificing interpretability. Empirically, FTBSC-KGML achieves lower validation error and greater consistency in explanatory power than a purely global model, thereby better capturing spatial variability across states. This work extends the prior SDSA-KGML framework.
๐
2025-12-17
๐ ArXiv Preprint (eess.SP)
๐ฅ Evangelos Sariyanidi, John D. Herrington, Lisa Yankowitz, Pratik Chaudhari, Theodore D. Satterthwaite et al.
ๆฅ็ๆ่ฆ
Measuring the statistical dependence between observed signals is a primary tool for scientific discovery. However, biological systems often exhibit complex non-linear interactions that currently cannot be captured without a priori knowledge or large datasets. We introduce a criterion for dependence, whereby two time series are deemed dependent if one can construct a classifier that distinguishes between temporally aligned vs. misaligned segments extracted from them. We show that this criterion, concurrence, is theoretically linked with dependence, and can become a standard approach for scientific analyses across disciplines, as it can expose relationships across a wide spectrum of signals (fMRI, physiological and behavioral data) without ad-hoc parameter tuning or large amounts of data.
๐
2025-12-17
๐ ArXiv Preprint (cs.IT)
๐ฅ Yuxuan Bao, J. Nathan Kutz
Robotics
ๆฅ็ๆ่ฆ
Fisher information and Shannon entropy are fundamental tools for understanding and analyzing dynamical systems from complementary perspectives. They can characterize unknown parameters by quantifying the information contained in variables, or measure how different initial trajectories or temporal segments of a trajectory contribute to learning or inferring system dynamics. In this work, we leverage the Fisher Information Matrix (FIM) within the data-driven framework of {\em sparse identification of nonlinear dynamics} (SINDy). We visualize information patterns in chaotic and non-chaotic systems for both single trajectories and multiple initial conditions, demonstrating how information-based analysis can improve sampling efficiency and enhance model performance by prioritizing more informative data. The benefits of statistical bagging are further elucidated through spectral analysis of the FIM. We also illustrate how Fisher information and entropy metrics can promote data efficiency in three scenarios: when only a single trajectory is available, when a tunable control parameter exists, and when multiple trajectories can be freely initialized. As data-driven model discovery continues to gain prominence, principled sampling strategies guided by quantifiable information metrics offer a powerful approach for improving learning efficiency and reducing data requirements.
๐
2025-12-17
๐ NeurIPS 2025
๐ฅ Jason Weitz, Dmitri Demler, Benjamin Hawks, Nhan Tran, Javier Duarte
Machine Learning
ๆฅ็ๆ่ฆ
Neural Architecture Search is a powerful approach for automating model design, but existing methods struggle to accurately optimize for real hardware performance, often relying on proxy metrics such as bit operations. We present Surrogate Neural Architecture Codesign Package (SNAC-Pack), an integrated framework that automates the discovery and optimization of neural networks focusing on FPGA deployment. SNAC-Pack combines Neural Architecture Codesign's multi-stage search capabilities with the Resource Utilization and Latency Estimator, enabling multi-objective optimization across accuracy, FPGA resource utilization, and latency without requiring time-intensive synthesis for each candidate model. We demonstrate SNAC-Pack on a high energy physics jet classification task, achieving 63.84% accuracy with resource estimation. When synthesized on a Xilinx Virtex UltraScale+ VU13P FPGA, the SNAC-Pack model matches baseline accuracy while maintaining comparable resource utilization to models optimized using traditional BOPs metrics. This work demonstrates the potential of hardware-aware neural architecture search for resource-constrained deployments and provides an open-source framework for automating the design of efficient FPGA-accelerated models.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Robert Stephany, William Michael Anderson, Youngsoo Choi
ๆฅ็ๆ่ฆ
Solving complex partial differential equations is vital in the physical sciences, but often requires computationally expensive numerical methods. Reduced-order models (ROMs) address this by exploiting dimensionality reduction to create fast approximations. While modern ROMs can solve parameterized families of PDEs, their predictive power degrades over long time horizons. We address this by (1) introducing a flexible, high-order, yet inexpensive finite-difference scheme and (2) proposing a Rollout loss that trains ROMs to make accurate predictions over arbitrary time horizons. We demonstrate our approach on the 2D Burgers equation.
๐
2025-12-17
๐ ArXiv Preprint (math.NA)
๐ฅ Ahmed Abdeljawad, Elena Cordero
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We develop a quantitative approximation theory for shallow neural networks using tools from time-frequency analysis. Working in weighted modulation spaces $M^{p,q}_m(\mathbf{R}^{d})$, we prove dimension-independent approximation rates in Sobolev norms $W^{n,r}(ฮฉ)$ for networks whose units combine standard activations with localized time-frequency windows. Our main result shows that for $f \in M^{p,q}_m(\mathbf{R}^{d})$ one can achieve \[ \|f - f_N\|_{W^{n,r}(ฮฉ)} \lesssim N^{-1/2}\,\|f\|_{M^{p,q}_m(\mathbf{R}^{d})}, \] on bounded domains, with explicit control of all constants. We further obtain global approximation theorems on $\mathbf{R}^{d}$ using weighted modulation dictionaries, and derive consequences for Feichtinger's algebra, Fourier-Lebesgue spaces, and Barron spaces. Numerical experiments in one and two dimensions confirm that modulation-based networks achieve substantially better Sobolev approximation than standard ReLU networks, consistent with the theoretical estimates.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Allen Liu
Machine Learning
ๆฅ็ๆ่ฆ
It is widely believed that complex machine learning models generally encode features through linear representations, but these features exist in superposition, making them challenging to recover. We study the following fundamental setting for learning features in superposition from black-box query access: we are given query access to a function \[ f(x)=\sum_{i=1}^n a_i\,ฯ_i(v_i^\top x), \] where each unit vector $v_i$ encodes a feature direction and $ฯ_i:\mathbb{R} \rightarrow \mathbb{R}$ is an arbitrary response function and our goal is to recover the $v_i$ and the function $f$.
In learning-theoretic terms, superposition refers to the overcomplete regime, when the number of features is larger than the underlying dimension (i.e. $n > d$), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to $f$, identifies all feature directions whose responses are non-degenerate and reconstructs the function $f$. Crucially, our algorithm works in a significantly more general setting than all related prior results -- we allow for essentially arbitrary superpositions, only requiring that $v_i, v_j$ are not nearly identical for $i \neq j$, and general response functions $ฯ_i$. At a high level, our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions $v_i$.
๐
2025-12-17
๐ ArXiv Preprint (cs.CG)
๐ฅ Sai Karthikey Pentapati, Gregoire Phillips, Alan Bovik
ๆฅ็ๆ่ฆ
Implicit Neural Representations (INRs) have been demonstrated to achieve state-of-the-art compression of a broad range of modalities such as images, videos, 3D surfaces, and audio. Most studies have focused on building neural counterparts of traditional implicit representations of 3D geometries, such as signed distance functions. However, the triangle mesh-based representation of geometry remains the most widely used representation in the industry, while building INRs capable of generating them has been sparsely studied. In this paper, we present a method for building compact INRs of zero-genus 3D manifolds. Our method relies on creating a spherical parameterization of a given 3D mesh - mapping the surface of a mesh to that of a unit sphere - then constructing an INR that encodes the displacement vector field defined continuously on its surface that regenerates the original shape. The compactness of our representation can be attributed to its hierarchical structure, wherein it first recovers the coarse structure of the encoded surface before adding high-frequency details to it. Once the INR is computed, 3D meshes of arbitrary resolution/connectivity can be decoded from it. The decoding can be performed in real time while achieving a state-of-the-art trade-off between reconstruction quality and the size of the compressed representations.
๐
2025-12-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Qihao Wang, Pritom Saha Akash, Varvara Kollia, Kevin Chen-Chuan Chang, Biwei Jiang et al.
ๆฅ็ๆ่ฆ
Recommendation plays a key role in e-commerce, enhancing user experience and boosting commercial success. Existing works mainly focus on recommending a set of items, but online e-commerce platforms have recently begun to pay attention to exploring users' potential interests at the category level. Category-level recommendation allows e-commerce platforms to promote users' engagements by expanding their interests to different types of items. In addition, it complements item-level recommendations when the latter becomes extremely challenging for users with little-known information and past interactions. Furthermore, it facilitates item-level recommendations in existing works. The predicted category, which is called intention in those works, aids the exploration of item-level preference. However, such category-level preference prediction has mostly been accomplished through applying item-level models. Some key differences between item-level recommendations and category-level recommendations are ignored in such a simplistic adaptation. In this paper, we propose a cascading category recommender (CCRec) model with a variational autoencoder (VAE) to encode item-level information to perform category-level recommendations. Experiments show the advantages of this model over methods designed for item-level recommendations.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Shuaidong Ji, Mahdi Bamdad, Francisco Cruz
Robotics
ๆฅ็ๆ่ฆ
Efficient and reliable UAV navigation in cluttered and dynamic environments remains challenging. We propose SWIFT-Nav: Stability-aware Waypoint-level Integration of Fuzzy arbitration and TD3 for Navigation, a TD3-based navigation framework that achieves fast, stable convergence to obstacle-aware paths. The system couples a sensor-driven perception front end with a TD3 waypoint policy: the perception module converts LiDAR ranges into a confidence-weighted safety map and goal cues, while the TD3 policy is trained with Prioritised Experience Replay to focus on high-error transitions and a decaying epsilon-greedy exploration schedule that gradually shifts from exploration to exploitation. A lightweight fuzzy-logic layer computes a safety score from radial measurements and near obstacles, gates mode switching and clamps unsafe actions; in parallel, task-aligned reward shaping combining goal progress, clearance, and switch-economy terms provides dense, well-scaled feedback that accelerates learning. Implemented in Webots with proximity-based collision checking, our approach consistently outperforms baselines in trajectory smoothness and generalization to unseen layouts, while preserving real-time responsiveness. These results show that combining TD3 with replay prioritisation, calibrated exploration, and fuzzy-safety rules yields a robust and deployable solution for UAV navigation in cluttered scenes.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Rishabh Dev Yadav, Shrey Agrawal, Kamalakar Karlapalem
Robotics
ๆฅ็ๆ่ฆ
In this paper, we introduce a multi robot payload transport system to carry payloads through an environment of unknown and uneven inclinations while maintaining the desired orientation of the payload. For this task, we used custom built robots with a linear actuator (pistons) mounted on top of each robot. The system continuously monitors the payload's orientation and computes the required piston height of each robot to maintain the desired orientation of the payload. In this work, we propose an open loop controller coupled with a closed loop PID controller to achieve the goal. As our modelling makes no assumptions on the type of terrain, the system can work on any unknown and uneven terrains and inclinations. We showcase the efficacy of our proposed controller by testing it on various simulated environments with varied and complex terrains.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Qiping Zhang, Nathan Tsoi, Mofeed Nagib, Hao-Tien Lewis Chiang, Marynel Vรกzquez
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Understanding how humans evaluate robot behavior during human-robot interactions is crucial for developing socially aware robots that behave according to human expectations. While the traditional approach to capturing these evaluations is to conduct a user study, recent work has proposed utilizing machine learning instead. However, existing data-driven methods require large amounts of labeled data, which limits their use in practice. To address this gap, we propose leveraging the few-shot learning capabilities of Large Language Models (LLMs) to improve how well a robot can predict a user's perception of its performance, and study this idea experimentally in social navigation tasks. To this end, we extend the SEAN TOGETHER dataset with additional real-world human-robot navigation episodes and participant feedback. Using this augmented dataset, we evaluate the ability of several LLMs to predict human perceptions of robot performance from a small number of in-context examples, based on observed spatio-temporal cues of the robot and surrounding human motion. Our results demonstrate that LLMs can match or exceed the performance of traditional supervised learning models while requiring an order of magnitude fewer labeled instances. We further show that prediction performance can improve with more in-context examples, confirming the scalability of our approach. Additionally, we investigate what kind of sensor-based information an LLM relies on to make these inferences by conducting an ablation study on the input features considered for performance prediction. Finally, we explore the novel application of personalized examples for in-context learning, i.e., drawn from the same user being evaluated, finding that they further enhance prediction accuracy. This work paves the path to improving robot behavior in a scalable manner through user-centered feedback.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jack Naylor, Raghav Mishra, Nicholas H. Barbara, Donald G. Dansereau
ๆฅ็ๆ่ฆ
Visual inspection of space-borne assets is of increasing interest to spacecraft operators looking to plan maintenance, characterise damage, and extend the life of high-value satellites in orbit. The environment of Low Earth Orbit (LEO) presents unique challenges when planning inspection operations that maximise visibility, information, and data quality. Specular reflection of sunlight from spacecraft bodies, self-shadowing, and dynamic lighting in LEO significantly impact the quality of data captured throughout an orbit. This is exacerbated by the relative motion between spacecraft, which introduces variable imaging distances and attitudes during inspection. Planning inspection trajectories with the aide of simulation is a common approach. However, the ability to design and optimise an inspection trajectory specifically to improve the resulting image quality in proximity operations remains largely unexplored. In this work, we present $\partial$LITE, an end-to-end differentiable simulation pipeline for on-orbit inspection operations. We leverage state-of-the-art differentiable rendering tools and a custom orbit propagator to enable end-to-end optimisation of orbital parameters based on visual sensor data. $\partial$LITE enables us to automatically design non-obvious trajectories, vastly improving the quality and usefulness of attained data. To our knowledge, our differentiable inspection-planning pipeline is the first of its kind and provides new insights into modern computational approaches to spacecraft mission planning. Project page: https://appearance-aware.github.io/dlite/
๐
2025-12-17
๐ This work has been submitted to the IEEE for possi
๐ฅ Manuel Mekkattu, Mike Y. Michelis, Robert K. Katzschmann
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The deployment of complex soft robots in multiphysics environments requires advanced simulation frameworks that not only capture interactions between different types of material, but also translate accurately to real-world performance. Soft robots pose unique modeling challenges due to their large nonlinear deformations, material incompressibility, and contact interactions, which complicate both numerical stability and physical accuracy. Despite recent progress, robotic simulators often struggle with modeling such phenomena in a scalable and application-relevant manner. We present SORS (Soft Over Rigid Simulator), a versatile, high-fidelity simulator designed to handle these complexities for soft robot applications. Our energy-based framework, built on the finite element method, allows modular extensions, enabling the inclusion of custom-designed material and actuation models. To ensure physically consistent contact handling, we integrate a constrained nonlinear optimization based on sequential quadratic programming, allowing for stable and accurate modeling of contact phenomena. We validate our simulator through a diverse set of real-world experiments, which include cantilever deflection, pressure-actuation of a soft robotic arm, and contact interactions from the PokeFlex dataset. In addition, we showcase the potential of our framework for control optimization of a soft robotic leg. These tests confirm that our simulator can capture both fundamental material behavior and complex actuation dynamics with high physical fidelity. By bridging the sim-to-real gap in these challenging domains, our approach provides a validated tool for prototyping next-generation soft robots, filling the gap of extensibility, fidelity, and usability in the soft robotic ecosystem.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.
๐
2025-12-17
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Pilyoung Kim, Yun Xie, Sujin Yang
ๆฅ็ๆ่ฆ
General-purpose conversational AI chatbots and AI companions increasingly provide young adolescents with emotionally supportive conversations, raising questions about how conversational style shapes anthropomorphism and emotional reliance. In a preregistered online experiment with 284 adolescent-parent dyads, youth aged 11-15 and their parents read two matched transcripts in which a chatbot responded to an everyday social problem using either a relational style (first-person, affiliative, commitment language) or a transparent style (explicit nonhumanness, informational tone). Adolescents more often preferred the relational than the transparent style, whereas parents were more likely to prefer transparent style than adolescents. Adolescents rated the relational chatbot as more human-like, likable, trustworthy and emotionally close, while perceiving both styles as similarly helpful. Adolescents who preferred relational style had lower family and peer relationship quality and higher stress and anxiety than those preferring transparent style or both chatbots. These findings identify conversational style as a key design lever for youth AI safety, showing that relational framing heightens anthropomorphism, trust and emotional closeness and can be especially appealing to socially and emotionally vulnerable adolescents, who may be at increased risk for emotional reliance on conversational AI.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt
Machine Learning
ๆฅ็ๆ่ฆ
Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shuai Liu, Yiqing Tian, Yang Chen, Mar Canet Sola
Machine Learning
ๆฅ็ๆ่ฆ
This paper proposes a dual-engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art. We present two interconnected components: AIDA (an artificial artist social network) and the Ismism Machine, a system for critical analysis. The core innovation lies in leveraging deep learning and multi-agent collaboration to enable multidimensional simulations of art historical developments and conceptual innovation patterns. The framework explores a shift from traditional unidirectional critique toward an intelligent, interactive mode of reflexive practice. We are currently applying this method in experimental studies on contemporary art concepts. This study introduces a general methodology based on AI-driven critical loops, offering new possibilities for computational analysis of art.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce \model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
๐
2025-12-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Adam Kaufman, James Lucassen, Tyler Tracy, Cody Rushing, Aryan Bhatt
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou et al.
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Adam Karvonen, James Chua, Clรฉment Dumas, Kit Fraser-Taliente, Subhash Kantamneni et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chase Walker, Rickard Ewetz
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two properties-causality and row stochasticity. The attribution graph allows context attributions to be computed by marginalizing intermediate contributions along paths in the graph. Across multiple models, datasets, metrics, and methods, CAGE improves context attribution faithfulness, achieving average gains of up to 40%.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan LU
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Human beings solve complex problems through critical thinking, where reasoning and evaluation are intertwined to converge toward correct solutions. However, most existing large language models (LLMs) decouple reasoning from verification: they either generate reasoning without explicit self-checking or rely on external verifiers to detect errors post hoc. The former lacks immediate feedback, while the latter increases system complexity and hinders synchronized learning. Motivated by human critical thinking, we propose Stepwise Think-Critique (STC), a unified framework that interleaves reasoning and self-critique at each step within a single model. STC is trained with a hybrid reinforcement learning objective combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation. Experiments on mathematical reasoning benchmarks show that STC demonstrates strong critic-thinking capabilities and produces more interpretable reasoning traces, representing a step toward LLMs with built-in critical thinking.
๐
2025-12-17
๐ 10 pages, 3 figures, 2025 IEEE International Confe
๐ฅ Xiaodi Li, Dingcheng Li, Rujun Gao, Mahmoud Zamani, Feng Mi et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Continual learning remains a fundamental challenge in machine learning, requiring models to learn from a stream of tasks without forgetting previously acquired knowledge. A major obstacle in this setting is catastrophic forgetting, where performance on earlier tasks degrades as new tasks are learned. In this paper, we introduce PPSEBM, a novel framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks. In PPSEBM, progressive parameter selection allocates distinct, task-specific parameters for each new task, while the EBM generates representative pseudo-samples from prior tasks. These generated samples actively inform and guide the parameter selection process, enhancing the model's ability to retain past knowledge while adapting to new tasks. Experimental results on diverse NLP benchmarks demonstrate that PPSEBM outperforms state-of-the-art continual learning methods, offering a promising and robust solution to mitigate catastrophic forgetting.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanhang Li, Yiren Song, Junzhe Bai, Xinran Liang, Hu Yang et al.
Robotics
ๆฅ็ๆ่ฆ
We propose \textbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Darshita Rathore, Vineet Kumar, Chetna Bansal, Anindya Moitra
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kester Clegg, Richard Hawkins, Ibrahim Habli, Tom Lawton
Natural Language Processing
ๆฅ็ๆ่ฆ
LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in LLM processes, particularly in frameworks that employ LLM-as-Judges (LaJ) evaluators. This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation, use context sensitivity to define error severity and design confidence thresholds that trigger human review of critical LaJ judgments when concordance across evaluators is low.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alexandre Dussolle, Pietro Liรฒ
Natural Language Processing
ๆฅ็ๆ่ฆ
Going from pure Multilayer Perceptron (MLP) to a learnable graph message-passing mechanism at each layer has been foundational to state-of-the-art results, despite the computational trade-off (e.g. GATs or Transformers). To go a step further, in this work, we introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE). To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions. Beyond these core mechanisms, we study how smoothing N-simplicial attention is by deriving a Lipschitz upper-bound and by demonstrating that by itself it also suffers from over-smoothing, despite opening the attention message-passing to higher-order interactions.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Daniel A. Herrmann, Abinav Chari, Isabelle Qian, Sree Sharvesh, B. A. Levinstein
ๆฅ็ๆ่ฆ
When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent's value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal's uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent's superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M. Pruyn et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific "superintelligence". Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.
๐
2025-12-17
๐ ArXiv Preprint (cs.SD)
๐ฅ Ken O'Hanlon, Basil Woods, Lin Wang, Mark Sandler
Machine Learning
ๆฅ็ๆ่ฆ
In this paper we propose a conditioned UNet for Music Source Separation (MSS). MSS is generally performed by multi-output neural networks, typically UNets, with each output representing a particular stem from a predefined instrument vocabulary. In contrast, conditioned MSS networks accept an audio query related to a stem of interest alongside the signal from which that stem is to be extracted. Thus, a strict vocabulary is not required and this enables more realistic tasks in MSS. The potential of conditioned approaches for such tasks has been somewhat hidden due to a lack of suitable data, an issue recently addressed with the MoisesDb dataset. A recent method, Banquet, employs this dataset with promising results seen on larger vocabularies. Banquet uses Bandsplit RNN rather than a UNet and the authors state that UNets should not be suitable for conditioned MSS. We counter this argument and propose QSCNet, a novel conditioned UNet for MSS that integrates network conditioning elements in the Sparse Compressed Network for MSS. We find QSCNet to outperform Banquet by over 1dB SNR on a couple of MSS tasks, while using less than half the number of parameters.
๐
2025-12-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Abdullah Al Munem, Sumona Yeasmin, Mohammad Rezwanul Huq
Natural Language Processing
ๆฅ็ๆ่ฆ
Every day, a significant number of users visit the internet for different needs. The owners of a website generate profits from the user interaction with the contents or items of the website. A robust recommendation system can increase user interaction with a website by recommending items according to the user's unique preferences. BERT and CNN-integrated neural collaborative filtering (NCF) have been proposed for the recommendation system in this experiment. The proposed model takes inputs from the user and item profile and finds the user's interest. This model can handle numeric, categorical, and image data to extract the latent features from the inputs. The model is trained and validated on a small sample of the MovieLens dataset for 25 epochs. The same dataset has been used to train and validate a simple NCF and a BERT-based NCF model and compared with the proposed model. The proposed model outperformed those two baseline models. The obtained result for the proposed model is 0.72 recall and 0.486 Hit Ratio @ 10 for 799 users on the MovieLens dataset. This experiment concludes that considering both categorical and image data can improve the performance of a recommendation system.
๐
2025-12-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Konstantinos Kalogiannis, Ahmed Mohamed Hussain, Hexu Li, Panos Papadimitratos
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Vehicular platooning promises transformative improvements in transportation efficiency and safety through the coordination of multi-vehicle formations enabled by Vehicle-to-Everything (V2X) communication. However, the distributed nature of platoon coordination creates security vulnerabilities, allowing authenticated vehicles to inject falsified kinematic data, compromise operational stability, and pose a threat to passenger safety. Traditional misbehaviour detection approaches, which rely on plausibility checks and statistical methods, suffer from high False Positive (FP) rates and cannot capture the complex temporal dependencies inherent in multi-vehicle coordination dynamics. We present Attention In Motion (AIMformer), a transformer-based framework specifically tailored for real-time misbehaviour detection in vehicular platoons with edge deployment capabilities. AIMformer leverages multi-head self-attention mechanisms to simultaneously capture intra-vehicle temporal dynamics and inter-vehicle spatial correlations. It incorporates global positional encoding with vehicle-specific temporal offsets to handle join/exit maneuvers. We propose a Precision-Focused (BCE) loss function that penalizes FPs to meet the requirements of safety-critical vehicular systems. Extensive evaluation across 4 platoon controllers, multiple attack vectors, and diverse mobility scenarios demonstrates superior performance ($\geq$ 0.93) compared to state-of-the-art baseline architectures. A comprehensive deployment analysis utilizing TensorFlow Lite (TFLite), Open Neural Network Exchange (ONNX), and TensorRT achieves sub-millisecond inference latency, making it suitable for real-time operation on resource-constrained edge platforms. Hence, validating AIMformer is viable for both in-vehicle and roadside infrastructure deployment.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hampus Linander, Conor Heins, Alexander Tschantz, Marco Perin, Christopher Buckley
Machine Learning
ๆฅ็ๆ่ฆ
Equivariance is a powerful prior for learning physical dynamics, yet exact group equivariance can degrade performance if the symmetries are broken. We propose object-centric world models built with geometric algebra neural networks, providing a soft geometric inductive bias. Our models are evaluated using simulated environments of 2d rigid body dynamics with static obstacles, where we train for next-step predictions autoregressively. For long-horizon rollouts we show that the soft inductive bias of our models results in better performance in terms of physical fidelity compared to non-equivariant baseline models. The approach complements recent soft-equivariance ideas and aligns with the view that simple, well-chosen priors can yield robust generalization. These results suggest that geometric algebra offers an effective middle ground between hand-crafted physics and unstructured deep nets, delivering sample-efficient dynamics models for multi-object scenes.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wei Du, Shubham Toshniwal, Branislav Kisacanin, Sadegh Mahdavi, Ivan Moshkov et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of gpt-oss-120b, we introduce Nemotron-Math, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR).
The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset quality.
Nemotron-Math consistently outperforms the original OpenMathReasoning on matched AoPS problems. Incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math competition benchmarks.
To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2--3$\times$ without significant accuracy loss. Overall, Nemotron-Math enables state-of-the-art performance, including 100\% maj@16 accuracy on AIME 2024 and 2025 with Python TIR.
๐
2025-12-17
๐ ArXiv Preprint (cs.SE)
๐ฅ Hua Yang, Alejandro Velasco, Thanh Le-Cong, Md Nazmul Haque, Bowen Xu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The success of large language models for code relies on vast amounts of code data, including public open-source repositories, such as GitHub, and private, confidential code from companies. This raises concerns about intellectual property compliance and the potential unauthorized use of license-restricted code. While membership inference (MI) techniques have been proposed to detect such unauthorized usage, their effectiveness can be undermined by semantically equivalent code transformation techniques, which modify code syntax while preserving semantic.
In this work, we systematically investigate whether semantically equivalent code transformation rules might be leveraged to evade MI detection. The results reveal that model accuracy drops by only 1.5% in the worst case for each rule, demonstrating that transformed datasets can effectively serve as substitutes for fine-tuning. Additionally, we find that one of the rules (RenameVariable) reduces MI success by 10.19%, highlighting its potential to obscure the presence of restricted code. To validate these findings, we conduct a causal analysis confirming that variable renaming has the strongest causal effect in disrupting MI detection. Notably, we find that combining multiple transformations does not further reduce MI effectiveness. Our results expose a critical loophole in license compliance enforcement for training large language models for code, showing that MI detection can be substantially weakened by transformation-based obfuscation techniques.
๐
2025-12-17
๐ ArXiv Preprint (cs.SE)
๐ฅ Robert Heumรผller, Frank Ortmeier
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The use of large language models like ChatGPT in code review offers promising efficiency gains but also raises concerns about correctness and safety. Existing evaluation methods for code review generation either rely on automatic comparisons to a single ground truth, which fails to capture the variability of human perspectives, or on subjective assessments of "usefulness", a highly ambiguous concept. We propose a novel evaluation approach based on what we call multi-subjective ranking. Using a dataset of 280 self-contained code review requests and corresponding comments from CodeReview StackExchange, multiple human judges ranked the quality of ChatGPT-generated comments alongside the top human responses from the platform. Results show that ChatGPT's comments were ranked significantly better than human ones, even surpassing StackExchange's accepted answers. Going further, our proposed method motivates and enables more meaningful assessments of generative AI's performance in code review, while also raising awareness of potential risks of unchecked integration into review processes.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jeongseok Kim, Kangjin Kim
ๆฅ็ๆ่ฆ
Due to the restricted resources, efficient scheduling in vertiports has received much more attention in the field of Urban Air Mobility (UAM). For the scheduling problem, we utilize a Mixed Integer Linear Programming (MILP), which is often formulated in a resource-restricted project scheduling problem (RCPSP). In this paper, we show our approach to handle both dynamic operation requirements and vague rescheduling requests from humans. Particularly, we utilize a three-valued logic for interpreting ambiguous user intents and a decision tree, proposing a newly integrated system that combines Answer Set Programming (ASP) and MILP. This integrated framework optimizes schedules and supports human inputs transparently. With this system, we provide a robust structure for explainable, adaptive UAM scheduling.
๐
2025-12-17
๐ Accepted to Transactions on Machine Learning Resea
๐ฅ Akihiro Kubo, Paavo Parmas, Shin Ishii
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multiple steps, implying another intermediate horizon for stable gradient estimates. However, these two optimal horizons may differ. To resolve this conflict, we propose Double Horizon Model-Based Policy Optimization (DHMBPO), which divides the rollout procedure into a long "distribution rollout" (DR) and a short "training rollout" (TR). The DR generates on-policy state samples for mitigating distribution shift. In contrast, the short TR leverages differentiable transitions to offer accurate value gradient estimation with stable gradient updates, thereby requiring fewer updates and reducing overall runtime. We demonstrate that the double-horizon approach effectively balances distribution shift, model bias, and gradient instability, and surpasses existing MBRL methods on continuous-control benchmarks in terms of both sample efficiency and runtime.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Stefan Edelkamp
ๆฅ็ๆ่ฆ
In multi-player card games such as Skat or Bridge, the early stages of the game, such as bidding, game selection, and initial card selection, are often more critical to the success of the play than refined middle- and end-game play. At the current limits of computation, such early decision-making resorts to using statistical information derived from a large corpus of human expert games. In this paper, we derive and evaluate a general bootstrapping outer-learning framework that improves prediction accuracy by expanding the database of human games with millions of self-playing AI games to generate and merge statistics. We implement perfect feature hash functions to address compacted tables, producing a self-improving card game engine, where newly inferred knowledge is continuously improved during self-learning. The case study in Skat shows that the automated approach can be used to support various decisions in the game.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Quanxi Zhou, Wencan Mao, Manabu Tsukada, John C. S. Lui, Yusheng Ji
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Model-based reinforcement learning (MBRL) and model-free reinforcement learning (MFRL) evolve along distinct paths but converge in the design of Dyna-Q [1]. However, modern RL methods still struggle with effective transferability across tasks and scenarios. Motivated by this limitation, we propose a generalized algorithm, Feature Model-Based Enhanced Actor-Critic (FM-EAC), that integrates planning, acting, and learning for multi-task control in dynamic environments. FM-EAC combines the strengths of MBRL and MFRL and improves generalizability through the use of novel feature-based models and an enhanced actor-critic framework. Simulations in both urban and agricultural applications demonstrate that FM-EAC consistently outperforms many state-of-the-art MBRL and MFRL methods. More importantly, different sub-networks can be customized within FM-EAC according to user-specific requirements.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Liang Peng, Yixuan Ye, Cheng Liu, Hangjun Che, Fei Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Multi-view clustering has been empirically shown to improve learning performance by leveraging the inherent complementary information across multiple views of data. However, in real-world scenarios, collecting strictly aligned views is challenging, and learning from both aligned and unaligned data becomes a more practical solution. Partially View-aligned Clustering aims to learn correspondences between misaligned view samples to better exploit the potential consistency and complementarity across views, including both aligned and unaligned data. However, most existing PVC methods fail to leverage unaligned data to capture the shared semantics among samples from the same cluster. Moreover, the inherent heterogeneity of multi-view data induces distributional shifts in representations, leading to inaccuracies in establishing meaningful correspondences between cross-view latent features and, consequently, impairing learning effectiveness. To address these challenges, we propose a Semantic MAtching contRasTive learning model (SMART) for PVC. The main idea of our approach is to alleviate the influence of cross-view distributional shifts, thereby facilitating semantic matching contrastive learning to fully exploit semantic relationships in both aligned and unaligned data. Extensive experiments on eight benchmark datasets demonstrate that our method consistently outperforms existing approaches on the PVC problem.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Reinhard Moratz, Niklas Daute, James Ondieki, Markus Kattenbeck, Mario Krajina et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper deals with improving the capabilities of Large Language Models (LLM) to provide route instructions for pedestrian wayfinders by means of qualitative spatial relations.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kotaro Funakoshi, Yaoxiong Zhu
Multimodal
ๆฅ็ๆ่ฆ
Recognition of signers' emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce \textbf{SCOPE} (Self-evolving Context Optimization via Prompt Evolution). SCOPE frames context management as an \textit{online optimization} problem, synthesizing guidelines from execution traces to automatically evolve the agent's prompt. We propose a Dual-Stream mechanism that balances tactical specificity (resolving immediate errors) with strategic generality (evolving long-term principles). Furthermore, we introduce Perspective-Driven Exploration to maximize strategy coverage, increasing the likelihood that the agent has the correct strategy for any given task. Experiments on the HLE benchmark show that SCOPE improves task success rates from 14.23\% to 38.64\% without human intervention. We make our code publicly available at https://github.com/JarvisPei/SCOPE.
๐
2025-12-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mikel Williams-Lekuona, Georgina Cosma
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joao Queiroz
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study 'Adversarial poetry as a universal single-turn jailbreak mechanism in large language models' demonstrates that instructions routinely refused in prose become executable when rewritten as verse, producing up to 18 x more safety failures in benchmarks derived from MLCommons AILuminate. Manually written poems reach approximately 62% ASR, and automated versions 43%, with some models surpassing 90% success in single-turn interactions. The effect is structural: systems trained with RLHF, constitutional AI, and hybrid pipelines exhibit consistent degradation under minimal semiotic formal variation. Versification displaces the prompt into sparsely supervised latent regions, revealing guardrails that are excessively dependent on surface patterns. This dissociation between apparent robustness and real vulnerability exposes deep limitations in current alignment regimes. The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, a rich metric-prosodic tradition, and over 250 million speakers, constitutes a critical gap. Experimental protocols must parameterise scansion, metre, and prosodic variation to test vulnerabilities specific to Lusophone patterns, which are currently ignored.
๐
2025-12-17
๐ This work has been submitted to the IEEE for possi
๐ฅ Hiroyoshi Nagahama, Katsufumi Inoue, Masayoshi Todorokihara, Michifumi Yoshioka
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Predictive maintenance of rotating machinery increasingly relies on vibration signals, yet most learning-based approaches either discard phase during spectral feature extraction or use raw time-waveforms without explicitly leveraging phase information. This paper introduces two phase-aware preprocessing strategies to address random phase variations in multi-axis vibration data: (1) three-axis independent phase adjustment that aligns each axis individually to zero phase (2) single-axis reference phase adjustment that preserves inter-axis relationships by applying uniform time shifts. Using a newly constructed rotor dataset acquired with a synchronized three-axis sensor, we evaluate six deep learning architectures under a two-stage learning framework. Results demonstrate architecture-independent improvements: the three-axis independent method achieves consistent gains (+2.7\% for Transformer), while the single-axis reference approach delivers superior performance with up to 96.2\% accuracy (+5.4\%) by preserving spatial phase relationships. These findings establish both phase alignment strategies as practical and scalable enhancements for predictive maintenance systems.
๐
2025-12-17
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Efe Bozkir, Enkelejda Kasneci
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid development of generative artificial intelligence (AI) and large language models (LLMs), and the availability of services that make them accessible, have led the general public to begin incorporating them into everyday life. The extended reality (XR) community has also sought to integrate LLMs, particularly in the form of conversational agents, to enhance user experience and task efficiency. When interacting with such conversational agents, users may easily disclose sensitive information due to the naturalistic flow of the conversations, and combining such conversational data with fine-grained sensor data may lead to novel privacy issues. To address these issues, a user-centric understanding of technology acceptance and concerns is essential. Therefore, to this end, we conducted a large-scale crowdsourcing study with 1036 participants, examining user decision-making processes regarding LLM-powered conversational agents in XR, across factors of XR setting type, speech interaction type, and data processing location. We found that while users generally accept these technologies, they express concerns related to security, privacy, social implications, and trust. Our results suggest that familiarity plays a crucial role, as daily generative AI use is associated with greater acceptance. In contrast, previous ownership of XR devices is linked to less acceptance, possibly due to existing familiarity with the settings. We also found that men report higher acceptance with fewer concerns than women. Regarding data type sensitivity, location data elicited the most significant concern, while body temperature and virtual object states were considered least sensitive. Overall, our study highlights the importance of practitioners effectively communicating their measures to users, who may remain distrustful. We conclude with implications and recommendations for LLM-powered XR.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Parvesh Saini, Soumyadipta Maiti, Beena Rai
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Capabilities and the number of vision-based models are increasing rapidly. And these vision models are now able to do more tasks like object detection, image classification, instance segmentation etc. with great accuracy. But models which can take accurate quantitative measurements form an image, as a human can do by just looking at it, are rare. For a robot to work with complete autonomy in a Laboratory environment, it needs to have some basic skills like navigation, handling objects, preparing samples etc. to match human-like capabilities in an unstructured environment. Another important capability is to read measurements from instruments and apparatus. Here, we tried to mimic a human inspired approach to read measurements from a linear scale. As a test case we have picked reading level from a syringe and a measuring cylinder. For a randomly oriented syringe we carry out transformations to correct the orientation. To make the system efficient and robust, the area of interest is reduced to just the linear scale containing part of the image. After that, a series of features were extracted like the major makers, the corresponding digits, and the level indicator location, from which the final reading was calculated. Readings obtained using this system were also compared against human read values of the same instances and an accurate correspondence was observed.
๐
2025-12-17
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Agnieszka Bienkowska, Jacek Malecki, Alexander Mathiesen-Ohman, Katarzyna Tworek
Robotics
ๆฅ็ๆ่ฆ
Organizations increasingly operate in environments characterized by volatility, uncertainty, complexity, and ambiguity (VUCA), where early indicators of change often emerge as weak, fragmented signals. Although artificial intelligence (AI) is widely used to support managerial decision-making, most AI-based systems remain optimized for prediction and resolution, leading to premature interpretive closure under conditions of high ambiguity. This creates a gap in management science regarding how human-AI systems can responsibly manage ambiguity before it crystallizes into error or crisis. This study addresses this gap by presenting a proof of concept (PoC) of the LAIZA human-AI augmented symbiotic intelligence system and its patented process: Systems and Methods for Quantum-Inspired Rogue Variable Modeling (QRVM), Human-in-the-Loop Decoherence, and Collective Cognitive Inference. The mechanism operationalizes ambiguity as a non-collapsed cognitive state, detects persistent interpretive breakdowns (rogue variables), and activates structured human-in-the-loop clarification when autonomous inference becomes unreliable. Empirically, the article draws on a three-month case study conducted in 2025 within the AI development, involving prolonged ambiguity surrounding employee intentions and intellectual property boundaries. The findings show that preserving interpretive plurality enabled early scenario-based preparation, including proactive patent protection, allowing decisive and disruption-free action once ambiguity collapsed. The study contributes to management theory by reframing ambiguity as a first-class construct and demonstrates the practical value of human-AI symbiosis for organizational resilience in VUCA environments.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Antony Jerald, Dattesh Shanbhag, Sudhanya Chatterjee
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Motion artifacts degrade MRI image quality and increase patient recalls. Existing automated quality assessment methods are largely limited to binary decisions and provide little interpretability. We introduce AutoMAC-MRI, an explainable framework for grading motion artifacts across heterogeneous MR contrasts and orientations. The approach uses supervised contrastive learning to learn a discriminative representation of motion severity. Within this feature space, we compute grade-specific affinity scores that quantify an image's proximity to each motion grade, thereby making grade assignments transparent and interpretable. We evaluate AutoMAC-MRI on more than 5000 expert-annotated brain MRI slices spanning multiple contrasts and views. Experiments assessing affinity scores against expert labels show that the scores align well with expert judgment, supporting their use as an interpretable measure of motion severity. By coupling accurate grade detection with per-grade affinity scoring, AutoMAC-MRI enables inline MRI quality control, with the potential to reduce unnecessary rescans and improve workflow efficiency.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Charan Prakash Rathore, Saumi Ray, Dhruv Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90\% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65\% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79\% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.
๐
2025-12-17
๐ ArXiv Preprint (cs.DB)
๐ฅ Basil Ell
ๆฅ็ๆ่ฆ
We introduce graph pattern-based association rules (GPARs) for directed labeled multigraphs such as RDF graphs. GPARs support both generative tasks, where a graph is extended, and evaluative tasks, where the plausibility of a graph is assessed. The framework goes beyond related formalisms such as graph functional dependencies, graph entity dependencies, relational association rules, graph association rules, multi-relation and path association rules, and Horn rules. Given a collection of graphs, we evaluate graph patterns under no-repeated-anything semantics, which allows the topology of a graph to be taken into account more effectively. We define a probability space and derive confidence, lift, leverage, and conviction in a probabilistic setting. We further analyze how these metrics relate to their classical itemset-based counterparts and identify conditions under which their characteristic properties are preserved.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Seok-Hyun Ga, Chun-Yen Chang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that "Perception Errors" were dominant, highlighting a "Perception-Cognition Gap" where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a "Calculation-Conceptualization Discrepancy," successfully performing calculations while failing to apply the underlying scientific concepts, and "Process Hallucination," where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing "AI-resistant questions" that target these specific cognitive vulnerabilities. By exploiting AI's weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Toshihide Ubukata, Enhong Mu, Takuto Yamauchi, Mingyue Zhang, Jialong Li et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Controller synthesis is a formal method approach for automatically generating Labeled Transition System (LTS) controllers that satisfy specified properties. The efficiency of the synthesis process, however, is critically dependent on exploration policies. These policies often rely on fixed rules or strategies learned through reinforcement learning (RL) that consider only a limited set of current features. To address this limitation, this paper introduces GCRL, an approach that enhances RL-based methods by integrating Graph Neural Networks (GNNs). GCRL encodes the history of LTS exploration into a graph structure, allowing it to capture a broader, non-current-based context. In a comparative experiment against state-of-the-art methods, GCRL exhibited superior learning efficiency and generalization across four out of five benchmark domains, except one particular domain characterized by high symmetry and strictly local interactions.
๐
2025-12-17
๐ 15 pages, 5 figures, Submitted to a journal
๐ฅ Siva Sai, Ishika Goyal, Shubham Sharma, Sri Harshita Manuri, Vinay Chamola et al.
Machine Learning
ๆฅ็ๆ่ฆ
The increasing number of cyber threats and rapidly evolving tactics, as well as the high volume of data in recent years, have caused classical machine learning, rules, and signature-based defence strategies to fail, rendering them unable to keep up. An alternative, Quantum Machine Learning (QML), has recently emerged, making use of computations based on quantum mechanics. It offers better encoding and processing of high-dimensional structures for certain problems. This survey provides a comprehensive overview of QML techniques relevant to the domain of security, such as Quantum Neural Networks (QNNs), Quantum Support Vector Machines (QSVMs), Variational Quantum Circuits (VQCs), and Quantum Generative Adversarial Networks (QGANs), and discusses the contributions of this paper in relation to existing research in the field and how it improves over them. It also maps these methods across supervised, unsupervised, and generative learning paradigms, and to core cybersecurity tasks, including intrusion and anomaly detection, malware and botnet classification, and encrypted-traffic analytics. It also discusses their application in the domain of cloud computing security, where QML can enhance secure and scalable operations. Many limitations of QML in the domain of cybersecurity have also been discussed, along with the directions for addressing them.
๐
2025-12-17
๐ AAAI 2026
๐ฅ Yiliu Sun, Zicheng Zhao, Yang Wei, Yanfang Zhang, Chen Gong
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Yuze Wu, Mo Zhu, Xingxing Li, Yuheng Du, Yuxin Fan et al.
Machine LearningRoboticsMultimodal
ๆฅ็ๆ่ฆ
This paper proposes VLA-AN, an efficient and onboard Vision-Language-Action (VLA) framework dedicated to autonomous drone navigation in complex environments. VLA-AN addresses four major limitations of existing large aerial navigation models: the data domain gap, insufficient temporal navigation with reasoning, safety issues with generative action policies, and onboard deployment constraints. First, we construct a high-fidelity dataset utilizing 3D Gaussian Splatting (3D-GS) to effectively bridge the domain gap. Second, we introduce a progressive three-stage training framework that sequentially reinforces scene comprehension, core flight skills, and complex navigation capabilities. Third, we design a lightweight, real-time action module coupled with geometric safety correction. This module ensures fast, collision-free, and stable command generation, mitigating the safety risks inherent in stochastic generative policies. Finally, through deep optimization of the onboard deployment pipeline, VLA-AN achieves a robust real-time 8.3x improvement in inference throughput on resource-constrained UAVs. Extensive experiments demonstrate that VLA-AN significantly improves spatial grounding, scene reasoning, and long-horizon navigation, achieving a maximum single-task success rate of 98.1%, and providing an efficient, practical solution for realizing full-chain closed-loop autonomy in lightweight aerial robots.
๐
2025-12-17
๐ NeurIPS 2025
๐ฅ Youssef Ghallab, Omar Iraqy, Mohamed Kandil, Mohamed Ashraf, Saadeldine Eletter et al.
ๆฅ็ๆ่ฆ
Physiological signals such as electrocardiograms (ECG) and electroencephalograms (EEG) provide complementary insights into human health and cognition, yet multi-modal integration is challenging due to limited multi-modal labeled data, and modality-specific differences . In this work, we adapt the CBraMod encoder for large-scale self-supervised ECG pretraining, introducing a dual-masking strategy to capture intra- and inter-lead dependencies. To overcome the above challenges, we utilize a pre-trained CBraMod encoder for EEG and pre-train a symmetric ECG encoder, equipping each modality with a rich foundational representation. These representations are then fused via simple embedding concatenation, allowing the classification head to learn cross-modal interactions, together enabling effective downstream learning despite limited multi-modal supervision. Evaluated on emotion recognition, our approach achieves near state-of-the-art performance, demonstrating that carefully designed physiological encoders, even with straightforward fusion, substantially improve downstream performance. These results highlight the potential of foundation-model approaches to harness the holistic nature of physiological signals, enabling scalable, label-efficient, and generalizable solutions for healthcare and affective computing.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yupeng Zhang, Adam G. Dunn, Usman Naseem, Jinman Kim
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model's decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $ฮ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $ฮ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhengchao Chen, Haoran Wang, Jing Yao, Pedram Ghamisi, Jun Zhou et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task-specific, lacking a unified framework to manage diverse, end-to-end workflows--from data preprocessing to advanced interpretation--across diverse RS applications. To address this gap, this paper introduces CangLing-KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert-validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general-purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent's knowledge and performance. This synergy enables CangLing-KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing-KnowFlow on the KnowFlow-Bench, a novel benchmark of 324 workflows inspired by real-world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open-source to commercial. Across all complex tasks, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing-KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yash Bhaskar, Parameswari Krishnamurthy
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024 (Pakray et al., 2024), focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 (Xue et al., 2020) and IndicBart (Dabre et al., 2021) in both multilingual and monolingual settings, LoRA (Hu et al., 2021) fine-tuning IndicTrans2 (Gala et al., 2023), zero-shot and few-shot prompting (Brown, 2020) with large language models (LLMs) like Llama 3 (Dubey et al., 2024) and Mixtral 8x7b (Jiang et al., 2024), LoRA supervised fine-tuning of Llama 3 (Mecklenburg et al., 2024), and training Transformer models (Vaswani, 2017) from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task test data using SacreBLEU (Post, 2018) and CHRF (Popovic, 2015), highlighting the challenges of low-resource translation and the potential of LLMs for these tasks, particularly with fine-tuning.
๐
2025-12-17
๐ AAAI 2026
๐ฅ Chao Zhang, Minghan Li, Tianrui Lv, Guodong Zhou
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it replaces the rigid hop-count selector with a relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct "brother" relations, 2-hop for indirect "father-son" chains), formalized via a relation mask. Second, it introduces a few-shot in-context learning path guidance mechanism with CoT (think) that constructs examples in a "question-paths-answer" format to enhance LLMs' ability to understand reasoning paths. Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 pp (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and the path prompt are complementary, jointly transforming KG evidence into more faithful answers.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mohsen Nafar, Michael Rรถmer, Lin Xie
Machine Learning
ๆฅ็ๆ่ฆ
Efficient exact algorithms for Discrete Optimization (DO) rely heavily on strong primal and dual bounds. Relaxed Decision Diagrams (DDs) provide a versatile mechanism for deriving such dual bounds by compactly over-approximating the solution space through node merging. However, the quality of these relaxed diagrams, i.e. the tightness of the resulting dual bounds, depends critically on the variable ordering and the merging decisions executed during compilation. While dynamic variable ordering heuristics effectively tighten bounds, they often incur computational overhead when evaluated globally across the entire variable set. To mitigate this trade-off, this work introduces a novel clustering-based framework for variable ordering. Instead of applying dynamic ordering heuristics to the full set of unfixed variables, we first partition variables into clusters. We then leverage this structural decomposition to guide the ordering process, significantly reducing the heuristic's search space. Within this framework, we investigate two distinct strategies: Cluster-to-Cluster, which processes clusters sequentially using problem-specific aggregate criteria (such as cumulative vertex weights in the Maximum Weighted Independent Set Problem (MWISP)), and Pick-and-Sort, which iteratively selects and sorts representative variables from each cluster to balance local diversity with heuristic guidance. Later on, developing some theoretical results on the growth of the size of DDs for MWISP we propose two different policies for setting the number of clusters within the proposed framework. We embed these strategies into a DD-based branch-and-bound algorithm and evaluate them on the MWISP. Across benchmark instances, the proposed methodology consistently reduces computational costs compared to standard dynamic variable ordering baseline.
๐
2025-12-17
๐ ArXiv Preprint (cs.CY)
๐ฅ Atte Ojanen, Johannes Anttila, Thilo H. K. Thelitz, Anna Bjork
ๆฅ็ๆ่ฆ
The rapid advancements in artificial intelligence (AI) present unique challenges for policymakers that seek to govern the technology. In this context, the Delphi method has become an established way to identify consensus and disagreement on emerging technological issues among experts in the field of futures studies and foresight. The aim of this article is twofold: first, it examines key tensions experts see in the development of AI governance in Europe, and second, it reflects on the Delphi method's capacity to inform anticipatory governance of emerging technologies like AI based on these insights. The analysis is based on the results of a two-round Policy Delphi study on the future of AI governance with European policymakers, researchers and NGOs, conducted in mid-2024. The Policy Delphi proved useful in revealing diverse perspectives on European AI governance, drawing out a consensus that future-proof AI regulation will likely depend more on practical implementation and enforcement of legislation than on its technical specifics or scope. Furthermore, the study identified a desirability-probability gap in AI governance: desirable policy directions, like greater citizen participation, were perceived as less probable and feasible. This highlights a tension between desirable regulatory oversight and the practical difficulty for regulation to keep up with technological change.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, Chao Yang
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP's openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.
๐
2025-12-17
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Xian-Rong Zhang, Yue-Jiao Gong, Zeyuan Ma, Jun Zhang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Data-driven evolutionary algorithms has shown surprising results in addressing expensive optimization problems through robust surrogate modeling. Though promising, existing surrogate modeling schemes may encounter limitations in complex optimization problems with many sub-objectives, which rely on repeated and tedious approximation. To address such technical gap, we propose Q-MetaSur as a plug-and-play surrogate modeling scheme capable of providing unified and generalized surrogate learning. Specifically, we consider multi-task-multi-objective optimization~(MTMOO) in offline setting. Several key designs are proposed: 1) we transform objective approximation into sequence-to-sequence modeling where MTMOO problem can be represented by tenxual tokenization. To operate under such auto-regressive modeling, we introduce a Large Language Model-based surrogate model that first encodes a MTMOO instance and then decodes objective values of unseen decision variables. To ensure stability in training the proposed model, we propose a two-stage offline training strategy that operates as a synergy of supervised tuning and RL fine-tuning, which first exploits offline dataset to fit existing knowledge and then leverages RL to enhance model's generalization performance. Extensive empirical results on the CEC2019 benchmark demonstrate that Q-MetaSur not only outperforms representative surrogate baselines in objective approximation accuracy, but also helps underlying evolutionary algorithms achieve both desired optimization convergence and improved pareto optimality.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods - including sparse autoencoders (SAEs) and sparse probes - recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We observe a one-to-many relationship from concepts to features: features correspond to no more than one concept, but concepts are distributed across many features. Then, we perform steering experiments, measuring whether each concept is independently manipulable. Even when trained on uniform distributions of concepts, SAE features generally affect many concepts when steered, indicating that they are neither selective nor independent; nonetheless, features affect disjoint subspaces. These results suggest that correlational metrics for measuring disentanglement are generally not sufficient for establishing independence when steering, and that affecting disjoint subspaces is not sufficient for concept selectivity. These results underscore the importance of compositional evaluations in interpretability research.
๐
2025-12-17
๐ ArXiv Preprint (cs.CE)
๐ฅ Yi Zhou, Haohao Qu, Yunqing Liu, Shanru Lin, Le Song et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive empirical results show that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks, performing on par with state-of-the-art multimodal pLMs despite being developed under limited computational resources. It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuqing Xie, Jiayu Chen, Wenhao Tang, Ya Zhang, Chao Yu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration (MORSE), a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.
๐
2025-12-17
๐ This work has been submitted to the IEEE for possi
๐ฅ Jiayang Wan, Ke He, Yafei Wang, Fan Liu, Wenjin Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Due to the significant variations in unmanned aerial vehicle (UAV) altitude and horizontal mobility, it becomes difficult for any single network to ensure continuous and reliable threedimensional coverage. Towards that end, the space-air-ground integrated network (SAGIN) has emerged as an essential architecture for enabling ubiquitous UAV connectivity. To address the pronounced disparities in coverage and signal characteristics across heterogeneous networks, this paper formulates UAV mobility management in SAGIN as a constrained multi-objective joint optimization problem. The formulation couples discrete link selection with continuous trajectory optimization. Building on this, we propose a two-level multi-agent hierarchical deep reinforcement learning (HDRL) framework that decomposes the problem into two alternately solvable subproblems. To map complex link selection decisions into a compact discrete action space, we conceive a double deep Q-network (DDQN) algorithm in the top-level, which achieves stable and high-quality policy learning through double Q-value estimation. To handle the continuous trajectory action space while satisfying quality of service (QoS) constraints, we integrate the maximum-entropy mechanism of the soft actor-critic (SAC) and employ a Lagrangian-based constrained SAC (CSAC) algorithm in the lower-level that dynamically adjusts the Lagrange multipliers to balance constraint satisfaction and policy optimization. Moreover, the proposed algorithm can be extended to multi-UAV scenarios under the centralized training and decentralized execution (CTDE) paradigm, which enables more generalizable policies. Simulation results demonstrate that the proposed scheme substantially outperforms existing benchmarks in throughput, link switching frequency and QoS satisfaction.
๐
2025-12-17
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Pilyoung Kim, Yun Xie, Sujin Yang
ๆฅ็ๆ่ฆ
General-purpose conversational AI chatbots and AI companions increasingly provide young adolescents with emotionally supportive conversations, raising questions about how conversational style shapes anthropomorphism and emotional reliance. In a preregistered online experiment with 284 adolescent-parent dyads, youth aged 11-15 and their parents read two matched transcripts in which a chatbot responded to an everyday social problem using either a relational style (first-person, affiliative, commitment language) or a transparent style (explicit nonhumanness, informational tone). Adolescents more often preferred the relational than the transparent style, whereas parents were more likely to prefer transparent style than adolescents. Adolescents rated the relational chatbot as more human-like, likable, trustworthy and emotionally close, while perceiving both styles as similarly helpful. Adolescents who preferred relational style had lower family and peer relationship quality and higher stress and anxiety than those preferring transparent style or both chatbots. These findings identify conversational style as a key design lever for youth AI safety, showing that relational framing heightens anthropomorphism, trust and emotional closeness and can be especially appealing to socially and emotionally vulnerable adolescents, who may be at increased risk for emotional reliance on conversational AI.
๐
2025-12-17
๐ This work has been submitted to the IEEE for possi
๐ฅ Runze Li, Hanchen Wang, Wenjie Zhang, Binghao Li, Yu Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at https://anonymous.4open.science/r/TimeSeriesImputation-52BF
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ali Ghodsi
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Sequence modeling has produced diverse architectures -- from classical recurrent neural networks to modern Transformers and state space models (SSMs) -- yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij}(X)$, making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij}(X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a $k$-dimensional subspace on length-$n$ sequences requires and is achievable with $H=k$ heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.
๐
2025-12-17
๐ AAAI 2026
๐ฅ Sunwoo Kim, Soo Yong Lee, Kyungho Kim, Hyunjin Hwang, Jaemin Yoo et al.
ๆฅ็ๆ่ฆ
Unsupervised node representation learning aims to obtain meaningful node embeddings without relying on node labels. To achieve this, graph convolution, which aggregates information from neighboring nodes, is commonly employed to encode node features and graph topology. However, excessive reliance on graph convolution can be suboptimal-especially in non-homophilic graphs-since it may yield unduly similar embeddings for nodes that differ in their features or topological properties. As a result, adjusting the degree of graph convolution usage has been actively explored in supervised learning settings, whereas such approaches remain underexplored in unsupervised scenarios. To tackle this, we propose FUEL, which adaptively learns the adequate degree of graph convolution usage by aiming to enhance intra-class similarity and inter-class separability in the embedding space. Since classes are unknown, FUEL leverages node features to identify node clusters and treats these clusters as proxies for classes. Through extensive experiments using 15 baseline methods and 14 benchmark datasets, we demonstrate the effectiveness of FUEL in downstream tasks, achieving state-of-the-art performance across graphs with diverse levels of homophily.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jinwu Hu, Dongjin Yang, Langyu Bian, Zhiquan Wen, Yufeng Wang et al.
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated impressive performance across various language tasks. However, existing LLM reasoning strategies mainly rely on the LLM itself with fast or slow mode (like o1 thinking) and thus struggle to balance reasoning efficiency and accuracy across queries of varying difficulties. In this paper, we propose Cognitive-Inspired Elastic Reasoning (CogER), a framework inspired by human hierarchical reasoning that dynamically selects the most suitable reasoning strategy for each query. Specifically, CogER first assesses the complexity of incoming queries and assigns them to one of several predefined levels, each corresponding to a tailored processing strategy, thereby addressing the challenge of unobservable query difficulty. To achieve automatic strategy selection, we model the process as a Markov Decision Process and train a CogER-Agent using reinforcement learning. The agent is guided by a reward function that balances solution quality and computational cost, ensuring resource-efficient reasoning. Moreover, for queries requiring external tools, we introduce Cognitive Tool-Assisted Reasoning, which enables the LLM to autonomously invoke external tools within its chain-of-thought. Extensive experiments demonstrate that CogER outperforms state-of-the-art Test-Time scaling methods, achieving at least a 13% relative improvement in average exact match on In-Domain tasks and an 8% relative gain on Out-of-Domain tasks.
๐
2025-12-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Richard Helder Moulton, Austin O'Brien, John D. Hastings
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Although large language models (LLMs) are increasingly used in security-critical workflows, practitioners lack quantitative guidance on which safeguards are worth deploying. This paper introduces a decision-oriented framework and reproducible methodology that together quantify residual risk, convert adversarial probe outcomes into financial risk estimates and return-on-control (RoC) metrics, and enable monetary comparison of layered defenses for LLM-based systems. A retrieval-augmented generation (RAG) service is instantiated using the DeepSeek-R1 model over a corpus containing synthetic personally identifiable information (PII), and subjected to automated attacks with Garak across five vulnerability classes: PII leakage, latent context injection, prompt injection, adversarial attack generation, and divergence. For each (vulnerability, control) pair, attack success probabilities are estimated via Laplace's Rule of Succession and combined with loss triangle distributions, calibrated from public breach-cost data, in 10,000-run Monte Carlo simulations to produce loss exceedance curves and expected losses. Three widely used mitigations, attribute-based access control (ABAC); named entity recognition (NER) redaction using Microsoft Presidio; and NeMo Guardrails, are then compared to a baseline RAG configuration. The baseline system exhibits very high attack success rates (>= 0.98 for PII, latent injection, and prompt injection), yielding a total simulated expected loss of $313k per attack scenario. ABAC collapses success probabilities for PII and prompt-related attacks to near zero and reduces the total expected loss by ~94%, achieving an RoC of 9.83. NER redaction likewise eliminates PII leakage and attains an RoC of 5.97, while NeMo Guardrails provides only marginal benefit (RoC of 0.05).
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyu Shang, Haoran Liu, Rongchao Zhang, Zhiqian Wei, Tongtong Feng
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at https://github.com/ZANMANGLOOPYE/PMMD.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Debu Sinha
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the "semantic illusion": semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zijiang Yan, Yixiang Huang, Jianhua Pei, Hina Tabassum, Luca Chiaraviglio
ๆฅ็ๆ่ฆ
The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors (e.g., time of day, season, and holidays) while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates calibrated probabilistic prediction intervals directly from the learned conditional distribution, providing explicit uncertainty quantification essential for trustworthy decision-making. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. The EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenxiao Zhang, Runshi Zhang, Junchen Wang
ๆฅ็ๆ่ฆ
Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at https://github.com/XiAooZ/MWNet.
๐
2025-12-17
๐ ArXiv Preprint (eess.IV)
๐ฅ Pandega Abyan Zumarsyah, Igi Ardiyanto, Hanung Adi Nugroho
ๆฅ็ๆ่ฆ
This study develops meta-learners for few-shot weakly-supervised segmentation (FWS) to address the challenge of optic disc (OD) and optic cup (OC) segmentation for glaucoma diagnosis with limited labeled fundus images. We significantly improve existing meta-learners by introducing Omni meta-training which balances data usage and diversifies the number of shots. We also develop their efficient versions that reduce computational costs. In addition, we develop sparsification techniques that generate more customizable and representative scribbles and other sparse labels. After evaluating multiple datasets, we find that Omni and efficient versions outperform the original versions, with the best meta-learner being Efficient Omni ProtoSeg (EO-ProtoSeg). It achieves intersection over union (IoU) scores of 88.15% for OD and 71.17% for OC on the REFUGE dataset using just one sparsely labeled image, outperforming few-shot and semi-supervised methods which require more labeled images. Its best performance reaches 86.80% for OD and 71.78%for OC on DRISHTIGS, 88.21% for OD and 73.70% for OC on REFUGE, 80.39% for OD and 52.65% for OC on REFUGE. EO-ProtoSeg is comparable to unsupervised domain adaptation methods yet much lighter with less than two million parameters and does not require any retraining.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fanzhe Fu
Natural Language Processing
ๆฅ็ๆ่ฆ
The transition of Large Language Models (LLMs) from stochastic chat interfaces to reliable software components necessitates a fundamental re-engineering of interaction paradigms. Current methodologies, predominantly heuristic-based "prompt engineering," fail to provide the deterministic guarantees required for mission-critical applications. We introduce the Meta-Prompting Protocol, a rigorous theoretical framework that formalizes the orchestration of LLMs as a programmable, self-optimizing system. Central to this protocol is the Adversarial Trinity, a tripartite topology comprising a Generator (P), an Auditor (A), and an Optimizer (O). By treating natural language instructions as differentiable variables within a semantic computation graph and utilizing textual critiques as gradients, this architecture mitigates hallucination and prevents model collapse. We demonstrate the theoretical viability of this approach using declarative programming paradigms (DSPy) and automatic textual differentiation (TextGrad), establishing a foundation for "Observable Software Engineering" in the era of probabilistic computing.
๐
2025-12-17
๐ ACL 2026
๐ฅ Hongbo Wang, MaungMaung AprilPyone, Isao Echizen
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Disclaimer: Samples in this paper may be harmful and cause discomfort.
Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Yunheng Wang, Yixiao Feng, Yuetong Fang, Shuning Zhang, Tan Jing et al.
Robotics
ๆฅ็ๆ่ฆ
3D Scene Graphs (3DSGs) constitute a powerful representation of the physical world, distinguished by their abilities to explicitly model the complex spatial, semantic, and functional relationships between entities, rendering a foundational understanding that enables agents to interact intelligently with their environment and execute versatile behaviors. Embodied navigation, as a crucial component of such capabilities, leverages the compact and expressive nature of 3DSGs to enable long-horizon reasoning and planning in complex, large-scale environments. However, prior works rely on a static-world assumption, defining traversable space solely based on static spatial layouts and thereby treating interactable obstacles as non-traversable. This fundamental limitation severely undermines their effectiveness in real-world scenarios, leading to limited reachability, low efficiency, and inferior extensibility. To address these issues, we propose HERO, a novel framework for constructing Hierarchical Traversable 3DSGs, that redefines traversability by modeling operable obstacles as pathways, capturing their physical interactivity, functional semantics, and the scene's relational hierarchy. The results show that, relative to its baseline, HERO reduces PL by 35.1% in partially obstructed environments and increases SR by 79.4% in fully obstructed ones, demonstrating substantially higher efficiency and reachability.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wenwen Xie, Geng Sun, Ruichen Zhang, Xuejie Liu, Yinqiu Liu et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Integrated sensing and communication (ISAC) has emerged as a key development direction in the sixth-generation (6G) era, which provides essential support for the collaborative sensing and communication of future intelligent networks. However, as wireless environments become increasingly dynamic and complex, ISAC systems require more intelligent processing and more autonomous operation to maintain efficiency and adaptability. Meanwhile, agentic artificial intelligence (AI) offers a feasible solution to address these challenges by enabling continuous perception-reasoning-action loops in dynamic environments to support intelligent, autonomous, and efficient operation for ISAC systems. As such, we delve into the application value and prospects of agentic AI in ISAC systems in this work. Firstly, we provide a comprehensive review of agentic AI and ISAC systems to demonstrate their key characteristics. Secondly, we show several common optimization approaches for ISAC systems and highlight the significant advantages of generative artificial intelligence (GenAI)-based agentic AI. Thirdly, we propose a novel agentic ISAC framework and prensent a case study to verify its superiority in optimizing ISAC performance. Finally, we clarify future research directions for agentic AI-based ISAC systems.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jihao Huang, Xi Xia, Zhiyuan Li, Tianle Liu, Jingke Wang et al.
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
End-to-end paradigms have demonstrated great potential for autonomous driving. Additionally, most existing methods are built upon Transformer architectures. However, transformers incur a quadratic attention cost, limiting their ability to model long spatial and temporal sequences-particularly on resource-constrained edge platforms. As autonomous driving inherently demands efficient temporal modeling, this challenge severely limits their deployment and real-time performance. Recently, linear attention mechanisms have gained increasing attention due to their superior spatiotemporal complexity. However, existing linear attention architectures are limited to self-attention, lacking support for cross-modal and cross-temporal interactions-both crucial for autonomous driving. In this work, we propose LADY, the first fully linear attention-based generative model for end-to-end autonomous driving. LADY enables fusion of long-range temporal context at inference with constant computational and memory costs, regardless of the history length of camera and LiDAR features. Additionally, we introduce a lightweight linear cross-attention mechanism that enables effective cross-modal information exchange. Experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that LADY achieves state-of-the-art performance with constant-time and memory complexity, offering improved planning performance and significantly reduced computational cost. Additionally, the model has been deployed and validated on edge devices, demonstrating its practicality in resource-limited scenarios.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenxiao Gao, Haotian Sun, Na Li, Dale Schuurmans, Bo Dai
Machine LearningRobotics
ๆฅ็ๆ่ฆ
In real-world applications with large state and action spaces, reinforcement learning (RL) typically employs function approximations to represent core components like the policies, value functions, and dynamics models. Although powerful approximations such as neural networks offer great expressiveness, they often present theoretical ambiguities, suffer from optimization instability and exploration difficulty, and incur substantial computational costs in practice. In this paper, we introduce the perspective of spectral representations as a solution to address these difficulties in RL. Stemming from the spectral decomposition of the transition operator, this framework yields an effective abstraction of the system dynamics for subsequent policy optimization while also providing a clear theoretical characterization. We reveal how to construct spectral representations for transition operators that possess latent variable structures or energy-based structures, which implies different learning methods to extract spectral representations from data. Notably, each of these learning methods realizes an effective RL algorithm under this framework. We also provably extend this spectral view to partially observable MDPs. Finally, we validate these algorithms on over 20 challenging tasks from the DeepMind Control Suite, where they achieve performances comparable or superior to current state-of-the-art model-free and model-based baselines.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xidan Song, Weiqi Wang, Ruifeng Cao, Qingya Hu
Natural Language Processing
ๆฅ็ๆ่ฆ
The evaluation of Large Language Models (LLMs) in complex reasoning domains typically relies on performance alignment with ground-truth oracles. In the domain of chess, this standard manifests as accuracy benchmarks against strong engines like Stockfish. However, high scalar accuracy does not necessarily imply robust conceptual understanding. This paper argues that standard accuracy metrics fail to distinguish between genuine geometric reasoning and the superficial memorization of canonical board states. To address this gap, we propose a Geometric Stability Framework, a novel evaluation methodology that rigorously tests model consistency under invariant transformations-including board rotation, mirror symmetry, color inversion, and format conversion. We applied this framework to a comparative analysis of six state-of-the-art LLMs including GPT-5.1, Claude Sonnet 4.5, and Kimi K2 Turbo, utilizing a dataset of approximately 3,000 positions. Our results reveal a significant Accuracy-Stability Paradox. While models such as GPT-5.1 achieve near-optimal accuracy on standard positions, they exhibit catastrophic degradation under geometric perturbation, specifically in rotation tasks where error rates surge by over 600%. This disparity suggests a reliance on pattern matching over abstract spatial logic. Conversely, Claude Sonnet 4.5 and Kimi K2 Turbo demonstrate superior dual robustness, maintaining high consistency across all transformation axes. Furthermore, we analyze the trade-off between helpfulness and safety, identifying Gemini 2.5 Flash as the leader in illegal state rejection (96.0%). We conclude that geometric stability provides an orthogonal and essential metric for AI evaluation, offering a necessary proxy for disentangling reasoning capabilities from data contamination and overfitting in large-scale models.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Damian Hodel, Jevin D. West
Natural Language Processing
ๆฅ็ๆ่ฆ
The growing use of artificial intelligence (AI) raises concerns of knowledge collapse, i.e., a reduction to the most dominant and central set of ideas. Prior work has demonstrated single-model collapse, defined as performance decay in an AI model trained on its own output. Inspired by ecology, we ask whether AI ecosystem diversity, that is, diversity among models, can mitigate such a collapse. We build on the single-model approach but focus on ecosystems of models trained on their collective output. To study the effect of diversity on model performance, we segment the training data across language models and evaluate the resulting ecosystems over ten, self-training iterations. We find that increased epistemic diversity mitigates collapse, but, interestingly, only up to an optimal level. Our results suggest that an ecosystem containing only a few diverse models fails to express the rich mixture of the full, true distribution, resulting in rapid performance decay. Yet distributing the data across too many models reduces each model's approximation capacity on the true distribution, leading to poor performance already in the first iteration step. In the context of AI monoculture, our results suggest the need to monitor diversity across AI systems and to develop policies that incentivize more domain- and community-specific models.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huaying Zhang, Atsushi Hashimoto, Tosho Hirasawa
Natural Language Processing
ๆฅ็ๆ่ฆ
Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions. In contrast, we focus on the question quality in eliciting unseen knowledge from human experts. For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval. We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D's expert commentary annotation. The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments. Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is available in https://github.com/omron-sinicx/VQG4ExpertKnowledge .
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruiyi Zhang, Peijia Qin, Qi Cao, Pengtao Xie
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sibi Parivendan, Kashfia Sailunaz, Suresh Neethirajan
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maria Bulychev, Neil G. Marchant, Benjamin I. P. Rubinstein
Robotics
ๆฅ็ๆ่ฆ
Recent advances in generative AI have enabled the creation of highly realistic digital content, raising concerns around authenticity, ownership, and misuse. While watermarking has become an increasingly important mechanism to trace and protect digital media, most existing image watermarking schemes operate as black boxes, producing global detection scores without offering any insight into how or where the watermark is present. This lack of transparency impacts user trust and makes it difficult to interpret the impact of tampering. In this paper, we present a post-hoc image watermarking method that combines localised embedding with region-level interpretability. Our approach embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This allows us to generate detection maps that reveal which regions of an image are likely watermarked or altered. We show that our method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. At the same time, the watermark remains highly imperceptible. Compared to prior post-hoc methods, our approach offers more interpretable detection while retaining competitive robustness. For example, our watermarks are robust to cropping up to half the image.
๐
2025-12-17
๐ Accepted by the 48th IEEE/ACM International Confer
๐ฅ Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh
Machine Learning
ๆฅ็ๆ่ฆ
Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)-based applications suffer from many bugs, failures, and vulnerabilities. Reproducing these bugs is essential for their resolution, but it is extremely challenging due to the inherent nondeterminism of DL models and their tight coupling with hardware and software environments. According to recent studies, only about 3% of DL bugs can be reliably reproduced using manual approaches. To address these challenges, we present RepGen, a novel, automated, and intelligent approach for reproducing deep learning bugs. RepGen constructs a learning-enhanced context from a project, develops a comprehensive plan for bug reproduction, employs an iterative generate-validate-refine mechanism, and thus generates such code using an LLM that reproduces the bug at hand. We evaluate RepGen on 106 real-world deep learning bugs and achieve a reproduction rate of 80.19%, a 19.81% improvement over the state-of-the-art measure. A developer study involving 27 participants shows that RepGen improves the success rate of DL bug reproduction by 23.35%, reduces the time to reproduce by 56.8%, and lowers participants' cognitive load.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiming Cui, Xin Yao, Yuxuan Qin, Xin Li, Shijin Wang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yaniv Leviathan, Matan Kalman, Yossi Matias
Natural Language Processing
ๆฅ็ๆ่ฆ
When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Divam Gupta, Anuj Pahuja, Nemanja Bartolovic, Tomas Simon, Forrest Iandola et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile devices. GPiCA utilizes a unique hybrid representation that combines a triangle mesh and anisotropic 3D Gaussians. This combination maximizes memory and rendering efficiency while maintaining a photorealistic appearance. The triangle mesh is highly efficient in representing surface areas like facial skin, while the 3D Gaussians effectively handle non-surface areas such as hair and beard. To this end, we develop a unified differentiable rendering pipeline that treats the mesh as a semi-transparent layer within the volumetric rendering paradigm of 3D Gaussian Splatting. We train neural networks to decode a facial expression code into three components: a 3D face mesh, an RGBA texture, and a set of 3D Gaussians. These components are rendered simultaneously in a unified rendering engine. The networks are trained using multi-view image supervision. Our results demonstrate that GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Leo Segre, Or Hirschorn, Shai Avidan
RoboticsNatural Language ProcessingMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. As leading examples, we show surface normal estimation and multi-view segmentation tasks. Quantitative experiments show that our method improves feature matching considerably compared to current foundation models.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu Wang, Juhyung Ha, Frangil M. Ramirez, Yuchen Wang, David J. Crandall
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei et al.
ๆฅ็ๆ่ฆ
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski et al.
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce \model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Georg Siedel, Rojan Regmi, Abhirami Anand, Weijia Shao, Silvia Vock et al.
ๆฅ็ๆ่ฆ
This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of deep vision models to common corruptions. We show that although applying style transfer on synthetic images degrades their quality with respect to the common FID metric, these images are surprisingly beneficial for model training. We conduct a systematic empirical analysis of the effects of both augmentations and their key hyperparameters on the performance of image classifiers. Our results demonstrate that stylization and synthetic data complement each other well and can be combined with popular rule-based data augmentation techniques such as TrivialAugment, while not working with others. Our method achieves state-of-the-art corruption robustness on several small-scale image classification benchmarks, reaching 93.54%, 74.9% and 50.86% robust accuracy on CIFAR-10-C, CIFAR-100-C and TinyImageNet-C, respectively
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tianze Luo, Haotian Yuan, Zhuang Liu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The multi-step denoising process in diffusion and Flow Matching models causes major efficiency issues, which motivates research on few-step generation. We present Solution Flow Models (SoFlow), a framework for one-step generation from scratch. By analyzing the relationship between the velocity function and the solution function of the velocity ordinary differential equation (ODE), we propose a Flow Matching loss and a solution consistency loss to train our models. The Flow Matching loss allows our models to provide estimated velocity fields for Classifier-Free Guidance (CFG) during training, which improves generation performance. Notably, our consistency loss does not require the calculation of the Jacobian-vector product (JVP), a common requirement in recent works that is not well-optimized in deep learning frameworks like PyTorch. Experimental results indicate that, when trained from scratch using the same Diffusion Transformer (DiT) architecture and an equal number of training epochs, our models achieve better FID-50K scores than MeanFlow models on the ImageNet 256x256 dataset.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu et al.
ๆฅ็ๆ่ฆ
Soft labels generated by teacher models have become a dominant paradigm for knowledge transfer and recent large-scale dataset distillation such as SRe2L, RDED, LPLD, offering richer supervision than conventional hard labels. However, we observe that when only a limited number of crops per image are used, soft labels are prone to local semantic drift: a crop may visually resemble another class, causing its soft embedding to deviate from the ground-truth semantics of the original image. This mismatch between local visual content and global semantic meaning introduces systematic errors and distribution misalignment between training and testing. In this work, we revisit the overlooked role of hard labels and show that, when appropriately integrated, they provide a powerful content-agnostic anchor to calibrate semantic drift. We theoretically characterize the emergence of drift under few soft-label supervision and demonstrate that hybridizing soft and hard labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which leverages hard labels as intermediate corrective signals while retaining the fine-grained advantages of soft labels. Extensive experiments on dataset distillation and large-scale conventional classification benchmarks validate our approach, showing consistent improvements in generalization. On ImageNet-1K, we achieve 42.7% with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 9.0%. Our findings re-establish the importance of hard labels as a complementary tool, and call for a rethinking of their role in soft-label-dominated training.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanhang Li, Yiren Song, Junzhe Bai, Xinran Liang, Hu Yang et al.
Robotics
ๆฅ็ๆ่ฆ
We propose \textbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu Zheng, Jie Hu, Kailun Yang, Jiaming Zhang
Robotics
ๆฅ็ๆ่ฆ
Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ''what will happen next'' and (2) proactive forecasting: "what would happen given a specific future action". For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP-WM, a tokenizer-free world model that maintains a dense voxel-based scene state and incrementally fuses spatio-temporal context over time. OccSTeP-WM leverages a linear-complexity attention backbone and a recurrent state-space module to capture long-range spatial dependencies while continually updating the scene memory with ego-motion compensation. This design enables online inference and robust performance even when historical sensor input is missing or noisy. Extensive experiments prove the effectiveness of the OccSTeP concept and our OccSTeP-WM, yielding an average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). The data and code will be open source at https://github.com/FaterYU/OccSTeP.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Morgan Coe, Gruffudd Jones, Leah-Nani Alconcel, Marina Gashinova
ๆฅ็ๆ่ฆ
With the rapidly growing population of resident space objects (RSOs) in the near-Earth space environment, detailed information about their condition and capabilities is needed to provide Space Domain Awareness (SDA). Space-based sensing will enable inspection of RSOs at shorter ranges, independent of atmospheric effects, and from all aspects. The use of a sub-THz inverse synthetic aperture radar (ISAR) imaging and sensing system for SDA has been proposed in previous work, demonstrating the achievement of sub-cm image resolution at ranges of up to 100 km. This work focuses on recognition of external structures by use of sequential feature detection and tracking throughout the aligned ISAR images of the satellites. The Hough transform is employed to detect linear features, which are tracked throughout the sequence. ISAR imagery is generated via a metaheuristic simulator capable of modelling encounters for a variety of deployment scenarios. Initial frame-to-frame alignment is achieved through a series of affine transformations to facilitate later association between image features. A gradient-by-ratio method is used for edge detection within individual ISAR images, and edge magnitude and direction are subsequently used to inform a double-weighted Hough transform to detect features with high accuracy. Feature evolution during sequences of frames is analysed. It is shown that the use of feature tracking within sequences with the proposed approach will increase confidence in feature detection and classification, and an example use-case of robust detection of shadowing as a feature is presented.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Johannes Hรคgerlind, Bao-Long Tran, Urs Waldmann, Per-Erik Forssรฉn
Computer Vision
ๆฅ็ๆ่ฆ
Estimating camera intrinsics and extrinsics is a fundamental problem in computer vision, and while advances in structure-from-motion (SfM) have improved accuracy and robustness, open challenges remain. In this paper, we introduce a robust method for pose estimation and calibration. We consider a set of rigid cameras, each observing the scene from a different perspective, which is a typical camera setup in animal behavior studies and forensic analysis of surveillance footage. Specifically, we analyse the individual components in a structure-from-motion (SfM) pipeline, and identify design choices that improve accuracy. Our main contributions are: (1) we investigate how to best subsample the predicted correspondences from a dense matcher to leverage them in the estimation process. (2) We investigate selection criteria for how to add the views incrementally. In a rigorous quantitative evaluation, we show the effectiveness of our changes, especially for cameras with strong radial distortion (79.9% ours vs. 40.4 vanilla VGGT). Finally, we demonstrate our correspondence subsampling in a global SfM setting where we initialize the poses using VGGT. The proposed pipeline generalizes across a wide range of camera setups, and could thus become a useful tool for animal behavior and forensic analysis.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose \textbf{Qwen-Image-Layered}, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling \textbf{inherent editability}, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on \href{https://github.com/QwenLM/Qwen-Image-Layered}{https://github.com/QwenLM/Qwen-Image-Layered}
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tobias Kirschstein, Simon Giebenhain, Matthias Nieรner
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: https://tobias-kirschstein.github.io/flexavatar/
๐
2025-12-17
๐ Accepted at IEEE/CVF Winter Conference on Applicat
๐ฅ Shashank Mishra, Karan Patil, Didier Stricker, Jason Rambach
Computer Vision
ๆฅ็ๆ่ฆ
High-performance Radar-Camera 3D object detection can be achieved by leveraging knowledge distillation without using LiDAR at inference time. However, existing distillation methods typically transfer modality-specific features directly to each sensor, which can distort their unique characteristics and degrade their individual strengths. To address this, we introduce IMKD, a radar-camera fusion framework based on multi-level knowledge distillation that preserves each sensor's intrinsic characteristics while amplifying their complementary strengths. IMKD applies a three-stage, intensity-aware distillation strategy to enrich the fused representation across the architecture: (1) LiDAR-to-Radar intensity-aware feature distillation to enhance radar representations with fine-grained structural cues, (2) LiDAR-to-Fused feature intensity-guided distillation to selectively highlight useful geometry and depth information at the fusion level, fostering complementarity between the modalities rather than forcing them to align, and (3) Camera-Radar intensity-guided fusion mechanism that facilitates effective feature alignment and calibration. Extensive experiments on the nuScenes benchmark show that IMKD reaches 67.0% NDS and 61.0% mAP, outperforming all prior distillation-based radar-camera fusion methods. Our code and models are available at https://github.com/dfki-av/IMKD/.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhipeng Du, Duolikun Danier, Jan Eric Lenssen, Hakan Bilen
ๆฅ็ๆ่ฆ
In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Roni Blushtein-Livnon, Osher Rafaeli, David Ioffe, Amir Boger, Karen Sandberg Esquenazi et al.
Computer Vision
ๆฅ็ๆ่ฆ
Remote sensing (RS) image segmentation is constrained by the limited availability of annotated data and a gap between overhead imagery and natural images used to train foundational models. This motivates effective adaptation under limited supervision. SAM3 concept-driven framework generates masks from textual prompts without requiring task-specific modifications, which may enable this adaptation. We evaluate SAM3 for RS imagery across four target types, comparing textual, geometric, and hybrid prompting strategies, under lightweight fine-tuning scales with increasing supervision, alongside zero-shot inference. Results show that combining semantic and geometric cues yields the highest performance across targets and metrics. Text-only prompting exhibits the lowest performance, with marked score gaps for irregularly shaped targets, reflecting limited semantic alignment between SAM3 textual representations and their overhead appearances. Nevertheless, textual prompting with light fine-tuning offers a practical performance-effort trade-off for geometrically regular and visually salient targets. Across targets, performance improves between zero-shot inference and fine-tuning, followed by diminishing returns as the supervision scale increases. Namely, a modest geometric annotation effort is sufficient for effective adaptation. A persistent gap between Precision and IoU further indicates that under-segmentation and boundary inaccuracies remain prevalent error patterns in RS tasks, particularly for irregular and less prevalent targets.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ditmar Hadera, Jan Cech, Miroslav Purkrabek, Matej Hoffmann
Computer Vision
ๆฅ็ๆ่ฆ
Ensuring the ethical use of video data involving human subjects, particularly infants, requires robust anonymization methods. We propose BLANKET (Baby-face Landmark-preserving ANonymization with Keypoint dEtection consisTency), a novel approach designed to anonymize infant faces in video recordings while preserving essential facial attributes. Our method comprises two stages. First, a new random face, compatible with the original identity, is generated via inpainting using a diffusion model. Second, the new identity is seamlessly incorporated into each video frame through temporally consistent face swapping with authentic expression transfer. The method is evaluated on a dataset of short video recordings of babies and is compared to the popular anonymization method, DeepPrivacy2. Key metrics assessed include the level of de-identification, preservation of facial attributes, impact on human pose estimation (as an example of a downstream task), and presence of artifacts. Both methods alter the identity, and our method outperforms DeepPrivacy2 in all other respects. The code is available as an easy-to-use anonymization demo at https://github.com/ctu-vras/blanket-infant-face-anonym.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Joรฃo Daniel Silva, Joao Magalhaes, Devis Tuia, Bruno Martins
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
The remote sensing community has recently seen the emergence of methods based on Large Vision and Language Models (LVLMs) that can address multiple tasks at the intersection of computer vision and natural language processing. To fully exploit the potential of such models, a significant focus has been given to the collection of large amounts of training data that cover multiple remote sensing-specific tasks, such as image captioning or visual question answering. However, the cost of using and training LVLMs is high, due to the large number of parameters. While multiple parameter-efficient adaptation techniques have been explored, the computational costs of training and inference with these models can remain prohibitive for most institutions. In this work, we explore the use of encoder-only architectures and propose a model that can effectively address multi-task learning while remaining compact in terms of the number of parameters. In particular, our model tackles combinations of tasks that are not typically explored in a unified model: the generation of text from remote sensing images and cross-modal retrieval. The results of our GeoMELT model - named from Multi-task Efficient Learning Transformer - in established benchmarks confirm the efficacy and efficiency of the proposed approach.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daiqing Wu, Dongbao Yang, Can Ma. Yu Zhou
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Visual Emotion Comprehension (VEC) aims to infer sentiment polarities or emotion categories from affective cues embedded in images. In recent years, Multimodal Large Language Models (MLLMs) have established a popular paradigm in VEC, leveraging their generalizability to unify VEC tasks defined under diverse emotion taxonomies. While this paradigm achieves notable success, it typically formulates VEC as a deterministic task, requiring the model to output a single, definitive emotion label for each image. Such a formulation insufficiently accounts for the inherent subjectivity of emotion perception, overlooking alternative interpretations that may be equally plausible to different viewers. To address this limitation, we propose equipping MLLMs with capabilities to verbalize their confidence in emotion predictions. This additional signal provides users with an estimate of both the plausibility of alternative interpretations and the MLLMs' self-assessed competence, thereby enhancing reliability in practice. Building on this insight, we introduce a three-stage training framework that progressively endows with structured reasoning, teaches to verbalize confidence, and calibrates confidence expression, culminating in EmoCaliber, a confidence-aware MLLM for VEC. Through fair and comprehensive evaluations on the unified benchmark VECBench, EmoCaliber demonstrates overall superiority against existing methods in both emotion prediction and confidence estimation. These results validate the effectiveness of our approach and mark a feasible step toward more reliable VEC systems. Project page: https://github.com/wdqqdw/EmoCaliber.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxiang Shi, Zhe Li, Yanwen Wang, Hao Zhu, Xun Cao et al.
Robotics
ๆฅ็ๆ่ฆ
Portrait animation from a single source image and a driving video is a long-standing problem. Recent approaches tend to adopt diffusion-based image/video generation models for realistic and expressive animation. However, none of these diffusion models realizes high-fidelity disentangled control between the head pose and facial expression, hindering applications like expression-only or pose-only editing and animation. To address this, we propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals. Specifically, we represent the pose as an explicit global transformation and the expression as an implicit latent code. First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals. Then we propose to inject the pose transformation into the diffusion model through a dual-branch conditioning mechanism, and the expression latent through cross attention. Finally, we design a progressive hybrid classifier-free guidance for more faithful identity consistency. Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Opeyemi Bamigbade, Mark Scanlon, John Sheppard
Machine LearningNatural Language ProcessingRoboticsComputer Vision
ๆฅ็ๆ่ฆ
Recent advances in AI-driven image generation have introduced new challenges for verifying the authenticity of digital evidence in forensic investigations. Modern generative models can produce visually consistent forgeries that evade traditional detectors based on pixel or compression artefacts. Most existing approaches also lack an explicit measure of anomaly intensity, which limits their ability to quantify the severity of manipulation. This paper introduces Vision-Attention Anomaly Scoring (VAAS), a novel dual-module framework that integrates global attention-based anomaly estimation using Vision Transformers (ViT) with patch-level self-consistency scoring derived from SegFormer embeddings. The hybrid formulation provides a continuous and interpretable anomaly score that reflects both the location and degree of manipulation. Evaluations on the DF2023 and CASIA v2.0 datasets demonstrate that VAAS achieves competitive F1 and IoU performance, while enhancing visual explainability through attention-guided anomaly maps. The framework bridges quantitative detection with human-understandable reasoning, supporting transparent and reliable image integrity assessment. The source code for all experiments and corresponding materials for reproducing the results are available open source.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arthur Moreau, Richard Shaw, Michal Nazarczuk, Jisu Shin, Thomas Tanay et al.
Computer Vision
ๆฅ็ๆ่ฆ
Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, "Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rohit Jena, Pratik Chaudhari, James C. Gee
Machine Learning
ๆฅ็ๆ่ฆ
The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seyed Abolfazl Ghasemzadeh, Alexandre Alahi, Christophe De Vleeschouwer
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Estimating 3D human poses from 2D images remains challenging due to occlusions and projective ambiguity. Multi-view learning-based approaches mitigate these issues but often fail to generalize to real-world scenarios, as large-scale multi-view datasets with 3D ground truth are scarce and captured under constrained conditions. To overcome this limitation, recent methods rely on 2D pose estimation combined with 2D-to-3D pose lifting trained on synthetic data. Building on our previous MPL framework, we propose RUMPL, a transformer-based 3D pose lifter that introduces a 3D ray-based representation of 2D keypoints. This formulation makes the model independent of camera calibration and the number of views, enabling universal deployment across arbitrary multi-view configurations without retraining or fine-tuning. A new View Fusion Transformer leverages learned fused-ray tokens to aggregate information along rays, further improving multi-view consistency. Extensive experiments demonstrate that RUMPL reduces MPJPE by up to 53% compared to triangulation and over 60% compared to transformer-based image-representation baselines. Results on new benchmarks, including in-the-wild multi-view and multi-person datasets, confirm its robustness and scalability. The framework's source code is available at https://github.com/aghasemzadeh/OpenRUMPL
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Malach Obisa Amonga, Benard Osero, Edna Too
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Wildlife object detection plays a vital role in biodiversity conservation, ecological monitoring, and habitat protection. However, this task is often challenged by environmental variability, visual similarities among species, and intra-class diversity. This study investigates the effectiveness of two individual deep learning architectures ResNet-101 and Inception v3 for wildlife object detection under such complex conditions. The models were trained and evaluated on a wildlife image dataset using a standardized preprocessing approach, which included resizing images to a maximum dimension of 800 pixels, converting them to RGB format, and transforming them into PyTorch tensors. A ratio of 70:30 training and validation split was used for model development. The ResNet-101 model achieved a classification accuracy of 94% and a mean Average Precision (mAP) of 0.91, showing strong performance in extracting deep hierarchical features. The Inception v3 model performed slightly better, attaining a classification accuracy of 95% and a mAP of 0.92, attributed to its efficient multi-scale feature extraction through parallel convolutions. Despite the strong results, both models exhibited challenges when detecting species with similar visual characteristics or those captured under poor lighting and occlusion. Nonetheless, the findings confirm that both ResNet-101 and Inception v3 are effective models for wildlife object detection tasks and provide a reliable foundation for conservation-focused computer vision applications.
๐
2025-12-17
๐ Under Review at IEEE Transactions on Image Process
๐ฅ Yueqianji Chen, Kevin Williams, John H. Doonan, Paolo Remagnino, Jo Hepworth
ๆฅ็ๆ่ฆ
Automated extraction of individual plant branches from time-series imagery is essential for high-throughput phenotyping, yet it remains computationally challenging due to non-rigid growth dynamics and severe identity fragmentation within entangled canopies. To overcome these stage-dependent ambiguities, we propose ST-DETrack, a spatiotemporal-fusion dual-decoder network designed to preserve branch identity from budding to flowering. Our architecture integrates a spatial decoder, which leverages geometric priors such as position and angle for early-stage tracking, with a temporal decoder that exploits motion consistency to resolve late-stage occlusions. Crucially, an adaptive gating mechanism dynamically shifts reliance between these spatial and temporal cues, while a biological constraint based on negative gravitropism mitigates vertical growth ambiguities. Validated on a Brassica napus dataset, ST-DETrack achieves a Branch Matching Accuracy (BMA) of 93.6%, significantly outperforming spatial and temporal baselines by 28.9 and 3.3 percentage points, respectively. These results demonstrate the method's robustness in maintaining long-term identity consistency amidst complex, dynamic plant architectures.
๐
2025-12-17
๐ AAAI 2026
๐ฅ Longchen Dai, Zixuan Shen, Zhiheng Zhou, Peipeng Yu, Zhihua Xia
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Face recognition systems store face templates for efficient matching. Once leaked, these templates pose a threat: inverting them can yield photorealistic surrogates that compromise privacy and enable impersonation. Although existing research has achieved relatively realistic face template inversion, the reconstructed facial images exhibit over-smoothed facial-part attributes (eyes, nose, mouth) and limited transferability. To address this problem, we present CLIP-FTI, a CLIP-driven fine-grained attribute conditioning framework for face template inversion. Our core idea is to use the CLIP model to obtain the semantic embeddings of facial features, in order to realize the reconstruction of specific facial feature attributes. Specifically, facial feature attribute embeddings extracted from CLIP are fused with the leaked template via a cross-modal feature interaction network and projected into the intermediate latent space of a pretrained StyleGAN. The StyleGAN generator then synthesizes face images with the same identity as the templates but with more fine-grained facial feature attributes. Experiments across multiple face recognition backbones and datasets show that our reconstructions (i) achieve higher identification accuracy and attribute similarity, (ii) recover sharper component-level attribute semantics, and (iii) improve cross-model attack transferability compared to prior reconstruction attacks. To the best of our knowledge, ours is the first method to use additional information besides the face template attack to realize face template inversion and obtains SOTA results.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hoang Nguyen, Xiaohao Xu, Xiaonan Huang
ๆฅ็ๆ่ฆ
Monocular depth foundation models achieve remarkable generalization by learning large-scale semantic priors, but this creates a critical vulnerability: they hallucinate illusory 3D structures from geometrically planar but perceptually ambiguous inputs. We term this failure the 3D Mirage. This paper introduces the first end-to-end framework to probe, quantify, and tame this unquantified safety risk. To probe, we present 3D-Mirage, the first benchmark of real-world illusions (e.g., street art) with precise planar-region annotations and context-restricted crops. To quantify, we propose a Laplacian-based evaluation framework with two metrics: the Deviation Composite Score (DCS) for spurious non-planarity and the Confusion Composite Score (CCS) for contextual instability. To tame this failure, we introduce Grounded Self-Distillation, a parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, thus avoiding catastrophic forgetting. Our work provides the essential tools to diagnose and mitigate this phenomenon, urging a necessary shift in MDE evaluation from pixel-wise accuracy to structural and contextual robustness. Our code and benchmark will be publicly available to foster this exciting research direction.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Zhenhan Yin, Xuanhan Wang, Jiahao Jiang, Kaiyuan Deng, Pengqi Chen et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
While leveraging abundant human videos and simulated robot data poses a scalable solution to the scarcity of real-world robot data, the generalization capability of existing vision-language-action models (VLAs) remains limited by mismatches in camera views, visual appearance, and embodiment morphologies. To overcome this limitation, we propose MiVLA, a generalizable VLA empowered by human-robot mutual imitation pre-training, which leverages inherent behavioral similarity between human hands and robotic arms to build a foundation of strong behavioral priors for both human actions and robotic control. Specifically, our method utilizes kinematic rules with left/right hand coordinate systems for bidirectional alignment between human and robot action spaces. Given human or simulated robot demonstrations, MiVLA is trained to forecast behavior trajectories for one embodiment, and imitate behaviors for another one unseen in the demonstration. Based on this mutual imitation, it integrates the behavioral fidelity of real-world human data with the manipulative diversity of simulated robot data into a unified model, thereby enhancing the generalization capability for downstream tasks. Extensive experiments conducted on both simulation and real-world platforms with three robots (ARX, PiPer and LocoMan), demonstrate that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs (e.g., $\boldsymbolฯ_{0}$, $\boldsymbolฯ_{0.5}$ and H-RDT) by 25% in simulation, and 14% in real-world robot control tasks.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Simon Gutwein, Arthur Longuefosse, Jun Seita, Sabine Taschner-Mandl, Roxane Licandro
Machine Learning
ๆฅ็ๆ่ฆ
Multiplexed tissue imaging measures dozens of protein markers per cell, yet most deep learning models still apply early channel fusion, assuming shared structure across markers. We investigate whether preserving marker independence, combined with deliberately shallow architectures, provides a more suitable inductive bias for self-supervised representation learning in multiplex data than increasing model scale. Using a Hodgkin lymphoma CODEX dataset with 145,000 cells and 49 markers, we compare standard early-fusion CNNs with channel-separated architectures, including a marker-aware baseline and our novel shallow Channel-Independent Model (CIM-S) with 5.5K parameters. After contrastive pretraining and linear evaluation, early-fusion models show limited ability to retain marker-specific information and struggle particularly with rare-cell discrimination. Channel-independent architectures, and CIM-S in particular, achieve substantially stronger representations despite their compact size. These findings are consistent across multiple self-supervised frameworks, remain stable across augmentation settings, and are reproducible across both the 49-marker and reduced 18-marker settings. These results show that lightweight, channel-independent architectures can match or surpass deep early-fusion CNNs and foundation models for multiplex representation learning. Code is available at https://github.com/SimonBon/CIM-S.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Liang Peng, Yixuan Ye, Cheng Liu, Hangjun Che, Fei Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Multi-view clustering has been empirically shown to improve learning performance by leveraging the inherent complementary information across multiple views of data. However, in real-world scenarios, collecting strictly aligned views is challenging, and learning from both aligned and unaligned data becomes a more practical solution. Partially View-aligned Clustering aims to learn correspondences between misaligned view samples to better exploit the potential consistency and complementarity across views, including both aligned and unaligned data. However, most existing PVC methods fail to leverage unaligned data to capture the shared semantics among samples from the same cluster. Moreover, the inherent heterogeneity of multi-view data induces distributional shifts in representations, leading to inaccuracies in establishing meaningful correspondences between cross-view latent features and, consequently, impairing learning effectiveness. To address these challenges, we propose a Semantic MAtching contRasTive learning model (SMART) for PVC. The main idea of our approach is to alleviate the influence of cross-view distributional shifts, thereby facilitating semantic matching contrastive learning to fully exploit semantic relationships in both aligned and unaligned data. Extensive experiments on eight benchmark datasets demonstrate that our method consistently outperforms existing approaches on the PVC problem.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arnau Barrera Roy, Albert Clapรฉs Sintes
Machine LearningMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Computer vision and video understanding have transformed sports analytics by enabling large-scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self-curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state-of-the-art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi-agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real-time automated broadcasting and post-game analysis tools to support decision-making.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kotaro Funakoshi, Yaoxiong Zhu
Multimodal
ๆฅ็ๆ่ฆ
Recognition of signers' emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.
๐
2025-12-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mikel Williams-Lekuona, Georgina Cosma
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maximilian Kellner, Mariana Ferrandon Cervantes, Yuandong Pan, Ruodan Lu, Ioannis Brilakis et al.
Machine Learning
ๆฅ็ๆ่ฆ
We propose a novel dataset that has been specifically designed for 3D semantic segmentation of bridges and the domain gap analysis caused by varying sensors. This addresses a critical need in the field of infrastructure inspection and maintenance, which is essential for modern society. The dataset comprises high-resolution 3D scans of a diverse range of bridge structures from various countries, with detailed semantic labels provided for each. Our initial objective is to facilitate accurate and automated segmentation of bridge components, thereby advancing the structural health monitoring practice. To evaluate the effectiveness of existing 3D deep learning models on this novel dataset, we conduct a comprehensive analysis of three distinct state-of-the-art architectures. Furthermore, we present data acquired through diverse sensors to quantify the domain gap resulting from sensor variations. Our findings indicate that all architectures demonstrate robust performance on the specified task. However, the domain gap can potentially lead to a decline in the performance of up to 11.4% mIoU.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shiran Ge, Chenyi Huang, Yuang Ai, Qihang Fan, Huaibo Huang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an "Expand-and-Prune" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li et al.
Multimodal
ๆฅ็ๆ่ฆ
Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Frรฉchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.
๐
2025-12-17
๐ ArXiv Preprint (Multimedia)
๐ฅ Fei Zhao, Mengxi Guo, Shijie Zhao, Junlin Li, Li Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
There has been a growing trend in compressing and transmitting videos from terminals for machine vision tasks. Nevertheless, most video coding optimization method focus on minimizing distortion according to human perceptual metrics, overlooking the heightened demands posed by machine vision systems. In this paper, we propose a video preprocessing framework tailored for machine vision tasks to address this challenge. The proposed method incorporates a neural preprocessor which retaining crucial information for subsequent tasks, resulting in the boosting of rate-accuracy performance. We further introduce a differentiable virtual codec to provide constraints on rate and distortion during the training stage. We directly apply widely used standard codecs for testing. Therefore, our solution can be easily applied to real-world scenarios. We conducted extensive experiments evaluating our compression method on two typical downstream tasks with various backbone networks. The experimental results indicate that our approach can save over 15% of bitrate compared to using only the standard codec anchor version.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Parvesh Saini, Soumyadipta Maiti, Beena Rai
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Capabilities and the number of vision-based models are increasing rapidly. And these vision models are now able to do more tasks like object detection, image classification, instance segmentation etc. with great accuracy. But models which can take accurate quantitative measurements form an image, as a human can do by just looking at it, are rare. For a robot to work with complete autonomy in a Laboratory environment, it needs to have some basic skills like navigation, handling objects, preparing samples etc. to match human-like capabilities in an unstructured environment. Another important capability is to read measurements from instruments and apparatus. Here, we tried to mimic a human inspired approach to read measurements from a linear scale. As a test case we have picked reading level from a syringe and a measuring cylinder. For a randomly oriented syringe we carry out transformations to correct the orientation. To make the system efficient and robust, the area of interest is reduced to just the linear scale containing part of the image. After that, a series of features were extracted like the major makers, the corresponding digits, and the level indicator location, from which the final reading was calculated. Readings obtained using this system were also compared against human read values of the same instances and an accurate correspondence was observed.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxin Jiang, Yunkang Can, Weiming Shen
ๆฅ็ๆ่ฆ
Knowledge distillation is an effective image anomaly detection and localization scheme. However, a major drawback of this scheme is its tendency to overly generalize, primarily due to the similarities between input and supervisory signals. In order to address this issue, this paper introduces a novel technique called masked reverse knowledge distillation (MRKD). By employing image-level masking (ILM) and feature-level masking (FLM), MRKD transforms the task of image reconstruction into image restoration. Specifically, ILM helps to capture global information by differentiating input signals from supervisory signals. On the other hand, FLM incorporates synthetic feature-level anomalies to ensure that the learned representations contain sufficient local information. With these two strategies, MRKD is endowed with stronger image context capture capacity and is less likely to be overgeneralized. Experiments on the widely-used MVTec anomaly detection dataset demonstrate that MRKD achieves impressive performance: image-level 98.9% AU-ROC, pixel-level 98.4% AU-ROC, and 95.3% AU-PRO. In addition, extensive ablation experiments have validated the superiority of MRKD in mitigating the overgeneralization problem.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Malihe Dahmardeh, Francesco Setti
ๆฅ็ๆ่ฆ
In this paper we propose MECAD, a novel approach for continual anomaly detection using a multi-expert architecture. Our system dynamically assigns experts to object classes based on feature similarity and employs efficient memory management to preserve the knowledge of previously seen classes. By leveraging an optimized coreset selection and a specialized replay buffer mechanism, we enable incremental learning without requiring full model retraining. Our experimental evaluation on the MVTec AD dataset demonstrates that the optimal 5-expert configuration achieves an average AUROC of 0.8259 across 15 diverse object categories while significantly reducing knowledge degradation compared to single-expert approaches. This framework balances computational efficiency, specialized knowledge retention, and adaptability, making it well-suited for industrial environments with evolving product types.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxin Jiang, Yunkang Cao, Weiming Shen
ๆฅ็ๆ่ฆ
Few-shot anomaly detection (FSAD) denotes the identification of anomalies within a target category with a limited number of normal samples. Existing FSAD methods largely rely on pre-trained feature representations to detect anomalies, but the inherent domain gap between pre-trained representations and target FSAD scenarios is often overlooked. This study proposes a Prototypical Learning Guided Context-Aware Segmentation Network (PCSNet) to address the domain gap, thereby improving feature descriptiveness in target scenarios and enhancing FSAD performance. In particular, PCSNet comprises a prototypical feature adaption (PFA) sub-network and a context-aware segmentation (CAS) sub-network. PFA extracts prototypical features as guidance to ensure better feature compactness for normal data while distinct separation from anomalies. A pixel-level disparity classification loss is also designed to make subtle anomalies more distinguishable. Then a CAS sub-network is introduced for pixel-level anomaly localization, where pseudo anomalies are exploited to facilitate the training process. Experimental results on MVTec and MPDD demonstrate the superior FSAD performance of PCSNet, with 94.9% and 80.2% image-level AUROC in an 8-shot scenario, respectively. Real-world applications on automotive plastic part inspection further demonstrate that PCSNet can achieve promising results with limited training samples. Code is available at https://github.com/yuxin-jiang/PCSNet.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Antony Jerald, Dattesh Shanbhag, Sudhanya Chatterjee
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Motion artifacts degrade MRI image quality and increase patient recalls. Existing automated quality assessment methods are largely limited to binary decisions and provide little interpretability. We introduce AutoMAC-MRI, an explainable framework for grading motion artifacts across heterogeneous MR contrasts and orientations. The approach uses supervised contrastive learning to learn a discriminative representation of motion severity. Within this feature space, we compute grade-specific affinity scores that quantify an image's proximity to each motion grade, thereby making grade assignments transparent and interpretable. We evaluate AutoMAC-MRI on more than 5000 expert-annotated brain MRI slices spanning multiple contrasts and views. Experiments assessing affinity scores against expert labels show that the scores align well with expert judgment, supporting their use as an interpretable measure of motion severity. By coupling accurate grade detection with per-grade affinity scoring, AutoMAC-MRI enables inline MRI quality control, with the potential to reduce unnecessary rescans and improve workflow efficiency.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenke E, Yixin Sun, Jiaxu Liu, Hubert P. H. Shum, Amir Atapour-Abarghouei et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present the first cross-modality distillation framework specifically tailored for single-panoramic-camera Bird's-Eye-View (BEV) segmentation. Our approach leverages a novel LiDAR image representation fused from range, intensity and ambient channels, together with a voxel-aligned view transformer that preserves spatial fidelity while enabling efficient BEV processing. During training, a high-capacity LiDAR and camera fusion Teacher network extracts both rich spatial and semantic features for cross-modality knowledge distillation into a lightweight Student network that relies solely on a single 360-degree panoramic camera image. Extensive experiments on the Dur360BEV dataset demonstrate that our teacher model significantly outperforms existing camera-based BEV segmentation methods, achieving a 25.6\% IoU improvement. Meanwhile, the distilled Student network attains competitive performance with an 8.5\% IoU gain and state-of-the-art inference speed of 31.2 FPS. Moreover, evaluations on KITTI-360 (two fisheye cameras) confirm that our distillation framework generalises to diverse camera setups, underscoring its feasibility and robustness. This approach reduces sensor complexity and deployment costs while providing a practical solution for efficient, low-cost BEV segmentation in real-world autonomous driving.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wangyu Wu, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao
MultimodalComputer VisionRoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Weakly Supervised Semantic Segmentation (WSSS) with image level labels aims to produce pixel level predictions without requiring dense annotations. While recent approaches have leveraged generative models to augment existing data, they remain dependent on real world training samples. In this paper, we introduce a novel direction, Zero Shot Weakly Supervised Semantic Segmentation (ZSWSSS), and propose SynthSeg Agents, a multi agent framework driven by Large Language Models (LLMs) to generate synthetic training data entirely without real images. SynthSeg Agents comprises two key modules, a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent autonomously crafts diverse and semantically rich image prompts via iterative refinement, memory mechanisms, and prompt space exploration, guided by CLIP based similarity and nearest neighbor diversity filtering. These prompts are then passed to the Image Generation Agent, which leverages Vision Language Models (VLMs) to synthesize candidate images. A frozen CLIP scoring model is employed to select high quality samples, and a ViT based classifier is further trained to relabel the entire synthetic dataset with improved semantic precision. Our framework produces high quality training data without any real image supervision. Experiments on PASCAL VOC 2012 and COCO 2014 show that SynthSeg Agents achieves competitive performance without using real training images. This highlights the potential of LLM driven agents in enabling cost efficient and scalable semantic segmentation.
๐
2025-12-17
๐ ArXiv Preprint (eess.IV)
๐ฅ Mengxi Guo, Shijie Zhao, Junlin Li, Li Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Preprocessing is a well-established technique for optimizing compression, yet existing methods are predominantly Rate-Distortion (R-D) optimized and constrained by pixel-level fidelity. This work pioneers a shift towards Rate-Perception (R-P) optimization by, for the first time, adapting a large-scale pre-trained diffusion model for compression preprocessing. We propose a two-stage framework: first, we distill the multi-step Stable Diffusion 2.1 into a compact, one-step image-to-image model using Consistent Score Identity Distillation (CiD). Second, we perform a parameter-efficient fine-tuning of the distilled model's attention modules, guided by a Rate-Perception loss and a differentiable codec surrogate. Our method seamlessly integrates with standard codecs without any modification and leverages the model's powerful generative priors to enhance texture and mitigate artifacts. Experiments show substantial R-P gains, achieving up to a 30.13% BD-rate reduction in DISTS on the Kodak dataset and delivering superior subjective visual quality.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yingying Wang, Xuanhua He, Chen Wu, Jialing Huang, Suiyun Zhang et al.
Natural Language ProcessingMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuinan Hou, Jing Mi, Marco Zorzi, Lamberto Ballan, Alberto Testolin
RoboticsNatural Language ProcessingMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures. Notably, enumeration accuracy significantly improves when VLMs are prompted to generate intermediate representations (i.e., locations and verbal labels) of each object to be counted. Nevertheless, none of the models can reliably count the number of objects in complex visual scenes, showing that further research is still needed to create AI systems that can reliably deploy counting procedures in realistic environments.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yupeng Zhang, Adam G. Dunn, Usman Naseem, Jinman Kim
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model's decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $ฮ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $ฮ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi Zhang, Yulei Kang, Haoxuan Chen, Jinxuan Li, ian-Fang Hu
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Parameter-efficient fine-tuning methods have gained considerable popularity for adapting large-scale models to downstream tasks, particularly LoRA and its variants. Existing methods perform low-rank adaptation over the full parameter space. However, fine-tuning within a subspace can achieve comparable effectiveness. Inspired by the observation that pre-trained models possess non-trivial null spaces, we propose Null-space based Low-Rank Adaptation (Null-LoRA). Null-LoRA effectively reduces redundancy and enhances effective rank by freezing portions of the low-rank matrices. To further improve parameter efficiency, Null-LoRA constrains the entire incremental update within the null space, maximizing the utilization of incremental updates to adapt to new task paradigms. Null-LoRA surpasses the state of the art with fewer parameters in extensive experiments across image-text retrieval and visual question answering tasks.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiyu Zhu, Wei Wang, Xin Yuan, Xiao Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Lens flare is a common nighttime artifact caused by strong light sources scattering within camera lenses, leading to hazy streaks, halos, and glare that degrade visual quality. However, existing methods usually fail to effectively address nonuniform scattered flares, which severely reduces their applicability to complex real-world scenarios with diverse lighting conditions. To address this issue, we propose SLCFormer, a novel spectral-local context transformer framework for effective nighttime lens flare removal. SLCFormer integrates two key modules: the Frequency Fourier and Excitation Module (FFEM), which captures efficient global contextual representations in the frequency domain to model flare characteristics, and the Directionally-Enhanced Spatial Module (DESM) for local structural enhancement and directional features in the spatial domain for precise flare removal. Furthermore, we introduce a ZernikeVAE-based scatter flare generation pipeline to synthesize physically realistic scatter flares with spatially varying PSFs, bridging optical physics and data-driven training. Extensive experiments on the Flare7K++ dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches in both quantitative metrics and perceptual visual quality, and generalizing robustly to real nighttime scenes with complex flare artifacts.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Changhai Ma, Ziyu Wu, Yunkang Zhang, Qijun Ying, Boyan Liu et al.
ๆฅ็ๆ่ฆ
Reconstructing accurate 3D human meshes in the world coordinate system from in-the-wild images remains challenging due to the lack of camera rotation information. While existing methods achieve promising results in the camera coordinate system by assuming zero camera rotation, this simplification leads to significant errors when transforming the reconstructed mesh to the world coordinate system. To address this challenge, we propose Mesh-Plug, a plug-and-play module that accurately transforms human meshes from camera coordinates to world coordinates. Our key innovation lies in a human-centered approach that leverages both RGB images and depth maps rendered from the initial mesh to estimate camera rotation parameters, eliminating the dependency on environmental cues. Specifically, we first train a camera rotation prediction module that focuses on the human body's spatial configuration to estimate camera pitch angle. Then, by integrating the predicted camera parameters with the initial mesh, we design a mesh adjustment module that simultaneously refines the root joint orientation and body pose. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on the benchmark datasets SPEC-SYN and SPEC-MTP.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yufeng Xie
ๆฅ็ๆ่ฆ
Infrared and visible image fusion is a pivotal technology in low-altitude UAV reconnaissance missions, providing high-quality data support for downstream tasks such as target detection and tracking by integrating thermal saliency with background texture details.However, traditional no-reference metrics fail(Specifically,like Entropy (EN) and Average Gradient (AG)) in complex low-light environments. They often misinterpret high-frequency sensor noise as valid detail. This creates a "Noise Trap," paradoxically assigning higher scores to noisy images and misguiding fusion algorithms.To address this, we propose the Target-Background Contrast (TBC) metric. Inspired by Weber's Law, TBC focuses on the relative contrast of salient targets rather than global statistics. Unlike traditional metrics, TBC penalizes background noise and rewards target visibility. Experiments on the DroneVehicle dataset demonstrate that TBC aligns better with human perception and provides a reliable standard for low-altitude scenarios.
๐
2025-12-17
๐ Submitted at IEEE IV 2026
๐ฅ Jรถrg Gamerdinger, Sven Teufel, Stephan Amann, Lukas Marc Listl, Oliver Bringmann
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Extensive evaluation of perception systems is crucial for ensuring the safety of intelligent vehicles in complex driving scenarios. Conventional performance metrics such as precision, recall and the F1-score assess the overall detection accuracy, but they do not consider the safety-relevant aspects of perception. Consequently, perception systems that achieve high scores in these metrics may still cause misdetections that could lead to severe accidents. Therefore, it is important to evaluate not only the overall performance of perception systems, but also their safety. We therefore introduce a novel safety metric for jointly evaluating the most critical perception tasks, object and lane detection. Our proposed framework integrates a new, lightweight object safety metric that quantifies the potential risk associated with object detection errors, as well as an lane safety metric including the interdependence between both tasks that can occur in safety evaluation. The resulting combined safety score provides a unified, interpretable measure of perception safety performance. Using the DeepAccident dataset, we demonstrate that our approach identifies safety critical perception errors that conventional performance metrics fail to capture. Our findings emphasize the importance of safety-centric evaluation methods for perception systems in autonomous driving.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianan Wang, Yang Hong, Hesong Li, Tao Wang, Songrong Liu et al.
Computer Vision
ๆฅ็ๆ่ฆ
RAW images have shown superior performance than sRGB images in many image processing tasks, especially for low-light image enhancement. However, most existing methods for RAW-based low-light enhancement usually sequentially process multi-scale information, which makes it difficult to achieve lightweight models and high processing speeds. Besides, they usually ignore the green channel superiority of RAW images, and fail to achieve better reconstruction performance with good use of green channel information. In this work, we propose an efficient RAW Image Enhancement Network (ERIENet), which parallelly processes multi-scale information with efficient convolution modules, and takes advantage of rich information in green channels to guide the reconstruction of images. Firstly, we introduce an efficient multi-scale fully-parallel architecture with a novel channel-aware residual dense block to extract feature maps, which reduces computational costs and achieves real-time processing speed. Secondly, we introduce a green channel guidance branch to exploit the rich information within the green channels of the input RAW image. It increases the quality of reconstruction results with few parameters and computations. Experiments on commonly used low-light image enhancement datasets show that ERIENet outperforms state-of-the-art methods in enhancing low-light RAW images with higher effiency. It also achieves an optimal speed of over 146 frame-per-second (FPS) for 4K-resolution images on a single NVIDIA GeForce RTX 3090 with 24G memory.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sarim Hashmi, Abdelrahman Elsayed, Mohammed Talha Alam, Samuele Poppi, Nils Lukas
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Generative models can synthesize highly realistic content, so-called deepfakes, that are already being misused at scale to undermine digital media authenticity. Current deepfake detection methods are unreliable for two reasons: (i) distinguishing inauthentic content post-hoc is often impossible (e.g., with memorized samples), leading to an unbounded false positive rate (FPR); and (ii) detection lacks robustness, as adversaries can adapt to known detectors with near-perfect accuracy using minimal computational resources. To address these limitations, we propose a resynthesis framework to determine if a sample is authentic or if its authenticity can be plausibly denied. We make two key contributions focusing on the high-precision, low-recall setting against efficient (i.e., compute-restricted) adversaries. First, we demonstrate that our calibrated resynthesis method is the most reliable approach for verifying authentic samples while maintaining controllable, low FPRs. Second, we show that our method achieves adversarial robustness against efficient adversaries, whereas prior methods are easily evaded under identical compute budgets. Our approach supports multiple modalities and leverages state-of-the-art inversion techniques.
๐
2025-12-17
๐ Accepted at IEEE ICVES 2025
๐ฅ Jรถrg Gamerdinger, Sven Teufel, Stephan Amann, Oliver Bringmann
Computer Vision
ๆฅ็ๆ่ฆ
Ensuring safety is the primary objective of automated driving, which necessitates a comprehensive and accurate perception of the environment. While numerous performance evaluation metrics exist for assessing perception capabilities, incorporating safety-specific metrics is essential to reliably evaluate object detection systems. A key component for safety evaluation is the ability to distinguish between relevant and non-relevant objects - a challenge addressed by criticality or relevance metrics. This paper presents the first in-depth analysis of criticality metrics for safety evaluation of object detection systems. Through a comprehensive review of existing literature, we identify and assess a range of applicable metrics. Their effectiveness is empirically validated using the DeepAccident dataset, which features a variety of safety-critical scenarios. To enhance evaluation accuracy, we propose two novel application strategies: bidirectional criticality rating and multi-metric aggregation. Our approach demonstrates up to a 100% improvement in terms of criticality classification accuracy, highlighting its potential to significantly advance the safety evaluation of object detection systems in automated vehicles.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaixing Long, Danyi Weng, Yun Mi, Zhentai Zhang, Yanmeng Lu et al.
Multimodal
ๆฅ็ๆ่ฆ
Constructing a multi-modal automatic classification model based on three types of renal biopsy images can assist pathologists in glomerular multi-disease identification. However, the substantial scale difference between transmission electron microscopy (TEM) image features at the nanoscale and optical microscopy (OM) or immunofluorescence microscopy (IM) images at the microscale poses a challenge for existing multi-modal and multi-scale models in achieving effective feature fusion and improving classification accuracy. To address this issue, we propose a cross-modal ultra-scale learning network (CMUS-Net) for the auxiliary diagnosis of multiple glomerular diseases. CMUS-Net utilizes multiple ultrastructural information to bridge the scale difference between nanometer and micrometer images. Specifically, we introduce a sparse multi-instance learning module to aggregate features from TEM images. Furthermore, we design a cross-modal scale attention module to facilitate feature interaction, enhancing pathological semantic information. Finally, multiple loss functions are combined, allowing the model to weigh the importance among different modalities and achieve precise classification of glomerular diseases. Our method follows the conventional process of renal biopsy pathology diagnosis and, for the first time, performs automatic classification of multiple glomerular diseases including IgA nephropathy (IgAN), membranous nephropathy (MN), and lupus nephritis (LN) based on images from three modalities and two scales. On an in-house dataset, CMUS-Net achieves an ACC of 95.37+/-2.41%, an AUC of 99.05+/-0.53%, and an F1-score of 95.32+/-2.41%. Extensive experiments demonstrate that CMUS-Net outperforms other well-known multi-modal or multi-scale methods and show its generalization capability in staging MN. Code is available at https://github.com/SMU-GL-Group/MultiModal_lkx/tree/main.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu et al.
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for "thinking with images" (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mengshi Qi, Yeteng Wu, Xianlin Zhang, Huadong Ma
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengxiao Zhou, Chenghua Li, Jianhao Huang, Qinghao Hu, Yifan Zhang
ๆฅ็ๆ่ฆ
Reference-guided instance editing is fundamentally limited by semantic entanglement, where a reference's intrinsic appearance is intertwined with its extrinsic attributes. The key challenge lies in disentangling what information should be borrowed from the reference, and determining how to apply it appropriately to the target. To tackle this challenge, we propose GENIE, a Generalizable Instance Editing framework capable of achieving explicit disentanglement. GENIE first corrects spatial misalignments with a Spatial Alignment Module (SAM). Then, an Adaptive Residual Scaling Module (ARSM) learns what to borrow by amplifying salient intrinsic cues while suppressing extrinsic attributes, while a Progressive Attention Fusion (PAF) mechanism learns how to render this appearance onto the target, preserving its structure. Extensive experiments on the challenging AnyInsertion dataset demonstrate that GENIE achieves state-of-the-art fidelity and robustness, setting a new standard for disentanglement-based instance editing.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yupeng Zhu, Xiongzhen Zhang, Ye Chen, Bingbing Ni
Machine LearningRobotics
ๆฅ็ๆ่ฆ
3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control.
To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Dongmyeong Lee, Jesse Quattrociocchi, Christian Ellis, Rwik Rana, Amanda Adkins et al.
ๆฅ็ๆ่ฆ
We propose BEV-Patch-PF, a GPS-free sequential geo-localization system that integrates a particle filter with learned bird's-eye-view (BEV) and aerial feature maps. From onboard RGB and depth images, we construct a BEV feature map. For each 3-DoF particle pose hypothesis, we crop the corresponding patch from an aerial feature map computed from a local aerial image queried around the approximate location. BEV-Patch-PF computes a per-particle log-likelihood by matching the BEV feature to the aerial patch feature. On two real-world off-road datasets, our method achieves 7.5x lower absolute trajectory error (ATE) on seen routes and 7.0x lower ATE on unseen routes than a retrieval-based baseline, while maintaining accuracy under dense canopy and shadow. The system runs in real time at 10 Hz on an NVIDIA Tesla T4, enabling practical robot deployment.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jialong Zuo, Haoyou Deng, Hanyu Zhou, Jiaxin Zhu, Yicheng Zhang et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
The rapid evolution of text-to-image generation models has revolutionized visual content creation. While commercial products like Nano Banana Pro have garnered significant attention, their potential as generalist solvers for traditional low-level vision challenges remains largely underexplored. In this study, we investigate the critical question: Is Nano Banana Pro a Low-Level Vision All-Rounder? We conducted a comprehensive zero-shot evaluation across 14 distinct low-level tasks spanning 40 diverse datasets. By utilizing simple textual prompts without fine-tuning, we benchmarked Nano Banana Pro against state-of-the-art specialist models. Our extensive analysis reveals a distinct performance dichotomy: while \textbf{Nano Banana Pro demonstrates superior subjective visual quality}, often hallucinating plausible high-frequency details that surpass specialist models, it lags behind in traditional reference-based quantitative metrics. We attribute this discrepancy to the inherent stochasticity of generative models, which struggle to maintain the strict pixel-level consistency required by conventional metrics. This report identifies Nano Banana Pro as a capable zero-shot contender for low-level vision tasks, while highlighting that achieving the high fidelity of domain specialists remains a significant hurdle.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xi Fang, Haoyi Tao, Shuwen Yang, Suyang Zhong, Haocheng Lu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyu Shang, Haoran Liu, Rongchao Zhang, Zhiqian Wei, Tongtong Feng
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at https://github.com/ZANMANGLOOPYE/PMMD.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenxiao Zhang, Runshi Zhang, Junchen Wang
ๆฅ็ๆ่ฆ
Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at https://github.com/XiAooZ/MWNet.
๐
2025-12-17
๐ ArXiv Preprint (eess.IV)
๐ฅ Pandega Abyan Zumarsyah, Igi Ardiyanto, Hanung Adi Nugroho
ๆฅ็ๆ่ฆ
This study develops meta-learners for few-shot weakly-supervised segmentation (FWS) to address the challenge of optic disc (OD) and optic cup (OC) segmentation for glaucoma diagnosis with limited labeled fundus images. We significantly improve existing meta-learners by introducing Omni meta-training which balances data usage and diversifies the number of shots. We also develop their efficient versions that reduce computational costs. In addition, we develop sparsification techniques that generate more customizable and representative scribbles and other sparse labels. After evaluating multiple datasets, we find that Omni and efficient versions outperform the original versions, with the best meta-learner being Efficient Omni ProtoSeg (EO-ProtoSeg). It achieves intersection over union (IoU) scores of 88.15% for OD and 71.17% for OC on the REFUGE dataset using just one sparsely labeled image, outperforming few-shot and semi-supervised methods which require more labeled images. Its best performance reaches 86.80% for OD and 71.78%for OC on DRISHTIGS, 88.21% for OD and 73.70% for OC on REFUGE, 80.39% for OD and 52.65% for OC on REFUGE. EO-ProtoSeg is comparable to unsupervised domain adaptation methods yet much lighter with less than two million parameters and does not require any retraining.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifei Bian, Banglei Guan, Zibin Liu, Ang Su, Shiyao Zhu et al.
ๆฅ็ๆ่ฆ
Large-scale structures suffer high-frequency deformations due to complex loads. However, harsh lighting conditions and high equipment costs limit measurement methods based on traditional high-speed cameras. This paper proposes a method to measure high-frequency deformations by exploiting an event camera and LED markers. Firstly, observation noise is filtered based on the characteristics of the event stream generated by LED markers blinking and spatiotemporal correlation. Then, LED markers are extracted from the event stream after differentiating between motion-induced events and events from LED blinking, which enables the extraction of high-speed moving LED markers. Ultimately, high-frequency planar deformations are measured by a monocular event camera. Experimental results confirm the accuracy of our method in measuring high-frequency planar deformations.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaizhe Zhang, Shinan Chen, Qian Zhao, Weizhan Zhang, Caixia Yan et al.
ๆฅ็ๆ่ฆ
Scenes reconstructed by 3D Gaussian Splatting (3DGS) trained on low-resolution (LR) images are unsuitable for high-resolution (HR) rendering. Consequently, a 3DGS super-resolution (SR) method is needed to bridge LR inputs and HR rendering. Early 3DGS SR methods rely on single-image SR networks, which lack cross-view consistency and fail to fuse complementary information across views. More recent video-based SR approaches attempt to address this limitation but require strictly sequential frames, limiting their applicability to unstructured multi-view datasets. In this work, we introduce Multi-View Consistent 3D Gaussian Splatting Super-Resolution (MVGSR), a framework that focuses on integrating multi-view information for 3DGS rendering with high-frequency details and enhanced consistency. We first propose an Auxiliary View Selection Method based on camera poses, making our method adaptable for arbitrarily organized multi-view datasets without the need of temporal continuity or data reordering. Furthermore, we introduce, for the first time, an epipolar-constrained multi-view attention mechanism into 3DGS SR, which serves as the core of our proposed multi-view SR network. This design enables the model to selectively aggregate consistent information from auxiliary views, enhancing the geometric consistency and detail fidelity of 3DGS representations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both object-centric and scene-level 3DGS SR benchmarks.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Yunheng Wang, Yixiao Feng, Yuetong Fang, Shuning Zhang, Tan Jing et al.
Robotics
ๆฅ็ๆ่ฆ
3D Scene Graphs (3DSGs) constitute a powerful representation of the physical world, distinguished by their abilities to explicitly model the complex spatial, semantic, and functional relationships between entities, rendering a foundational understanding that enables agents to interact intelligently with their environment and execute versatile behaviors. Embodied navigation, as a crucial component of such capabilities, leverages the compact and expressive nature of 3DSGs to enable long-horizon reasoning and planning in complex, large-scale environments. However, prior works rely on a static-world assumption, defining traversable space solely based on static spatial layouts and thereby treating interactable obstacles as non-traversable. This fundamental limitation severely undermines their effectiveness in real-world scenarios, leading to limited reachability, low efficiency, and inferior extensibility. To address these issues, we propose HERO, a novel framework for constructing Hierarchical Traversable 3DSGs, that redefines traversability by modeling operable obstacles as pathways, capturing their physical interactivity, functional semantics, and the scene's relational hierarchy. The results show that, relative to its baseline, HERO reduces PL by 35.1% in partially obstructed environments and increases SR by 79.4% in fully obstructed ones, demonstrating substantially higher efficiency and reachability.
๐
2025-12-17
๐ ArXiv Preprint (eess.IV)
๐ฅ Nalini M. Singh, Tiffany Chien, Arthur R. C. McCray, Colin Ophus, Laura Waller
ๆฅ็ๆ่ฆ
Atomic electron tomography (AET) enables the determination of 3D atomic structures by acquiring a sequence of 2D tomographic projection measurements of a particle and then computationally solving for its underlying 3D representation. Classical tomography algorithms solve for an intermediate volumetric representation that is post-processed into the atomic structure of interest. In this paper, we reformulate the tomographic inverse problem to solve directly for the locations and properties of individual atoms. We parameterize an atomic structure as a collection of Gaussians, whose positions and properties are learnable. This representation imparts a strong physical prior on the learned structure, which we show yields improved robustness to real-world imaging artifacts. Simulated experiments and a proof-of-concept result on experimentally-acquired data confirm our method's potential for practical applications in materials characterization and analysis with Transmission Electron Microscopy (TEM). Our code is available at https://github.com/nalinimsingh/gaussian-atoms.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunseong Nam, Jiwon Jang, Dongkyu Won, Sang Hyun Park, Soopil Kim
Machine LearningNatural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Preference optimization offers a scalable supervision paradigm based on relative preference signals, yet prior attempts in medical image segmentation remain model-specific and rely on low-diversity prediction sampling. In this paper, we propose MAPO (Model-Agnostic Preference Optimization), a training framework that utilizes Dropout-driven stochastic segmentation hypotheses to construct preference-consistent gradients without direct ground-truth supervision. MAPO is fully architecture- and dimensionality-agnostic, supporting 2D/3D CNN and Transformer-based segmentation pipelines. Comprehensive evaluations across diverse medical datasets reveal that MAPO consistently enhances boundary adherence, reduces overfitting, and yields more stable optimization dynamics compared to conventional supervised training.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huaying Zhang, Atsushi Hashimoto, Tosho Hirasawa
Natural Language Processing
ๆฅ็ๆ่ฆ
Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions. In contrast, we focus on the question quality in eliciting unseen knowledge from human experts. For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval. We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D's expert commentary annotation. The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments. Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is available in https://github.com/omron-sinicx/VQG4ExpertKnowledge .
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sibi Parivendan, Kashfia Sailunaz, Suresh Neethirajan
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maria Bulychev, Neil G. Marchant, Benjamin I. P. Rubinstein
Robotics
ๆฅ็ๆ่ฆ
Recent advances in generative AI have enabled the creation of highly realistic digital content, raising concerns around authenticity, ownership, and misuse. While watermarking has become an increasingly important mechanism to trace and protect digital media, most existing image watermarking schemes operate as black boxes, producing global detection scores without offering any insight into how or where the watermark is present. This lack of transparency impacts user trust and makes it difficult to interpret the impact of tampering. In this paper, we present a post-hoc image watermarking method that combines localised embedding with region-level interpretability. Our approach embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This allows us to generate detection maps that reveal which regions of an image are likely watermarked or altered. We show that our method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. At the same time, the watermark remains highly imperceptible. Compared to prior post-hoc methods, our approach offers more interpretable detection while retaining competitive robustness. For example, our watermarks are robust to cropping up to half the image.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiming Cui, Xin Yao, Yuxuan Qin, Xin Li, Shijin Wang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt
Machine Learning
ๆฅ็ๆ่ฆ
Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Adam Karvonen, James Chua, Clรฉment Dumas, Kit Fraser-Taliente, Subhash Kantamneni et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chase Walker, Rickard Ewetz
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two properties-causality and row stochasticity. The attribution graph allows context attributions to be computed by marginalizing intermediate contributions along paths in the graph. Across multiple models, datasets, metrics, and methods, CAGE improves context attribution faithfulness, achieving average gains of up to 40%.
๐
2025-12-17
๐ 10 pages, 3 figures, 2025 IEEE International Confe
๐ฅ Xiaodi Li, Dingcheng Li, Rujun Gao, Mahmoud Zamani, Feng Mi et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Continual learning remains a fundamental challenge in machine learning, requiring models to learn from a stream of tasks without forgetting previously acquired knowledge. A major obstacle in this setting is catastrophic forgetting, where performance on earlier tasks degrades as new tasks are learned. In this paper, we introduce PPSEBM, a novel framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks. In PPSEBM, progressive parameter selection allocates distinct, task-specific parameters for each new task, while the EBM generates representative pseudo-samples from prior tasks. These generated samples actively inform and guide the parameter selection process, enhancing the model's ability to retain past knowledge while adapting to new tasks. Experimental results on diverse NLP benchmarks demonstrate that PPSEBM outperforms state-of-the-art continual learning methods, offering a promising and robust solution to mitigate catastrophic forgetting.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tamanna Hossain, Robert L. Logan, Ganesh Jagadeesan, Sameer Singh, Joel Tetreault et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM's hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M--1.4B) on sequences ranging from 4--256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba's pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba's ability to retain important information.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Darshita Rathore, Vineet Kumar, Chetna Bansal, Anindya Moitra
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kester Clegg, Richard Hawkins, Ibrahim Habli, Tom Lawton
Natural Language Processing
ๆฅ็ๆ่ฆ
LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in LLM processes, particularly in frameworks that employ LLM-as-Judges (LaJ) evaluators. This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation, use context sensitivity to define error severity and design confidence thresholds that trigger human review of critical LaJ judgments when concordance across evaluators is low.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hongbin Na, Zimu Wang, Zhaoming Chen, Peilin Zhou, Yining Hua et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Psychological defenses are strategies, often automatic, that people use to manage distress. Rigid or overuse of defenses is negatively linked to mental health and shapes what speakers disclose and how they accept or resist help. However, defenses are complex and difficult to reliably measure, particularly in clinical dialogues. We introduce PsyDefConv, a dialogue corpus with help seeker utterances labeled for defense level, and DMRS Co-Pilot, a four-stage pipeline that provides evidence-based pre-annotations. The corpus contains 200 dialogues and 4709 utterances, including 2336 help seeker turns, with labeling and Cohen's kappa 0.639. In a counterbalanced study, the co-pilot reduced average annotation time by 22.4%. In expert review, it averaged 4.62 for evidence, 4.44 for clinical plausibility, and 4.40 for insight on a seven-point scale. Benchmarks with strong language models in zero-shot and fine-tuning settings demonstrate clear headroom, with the best macro F1-score around 30% and a tendency to overpredict mature defenses. Corpus analyses confirm that mature defenses are most common and reveal emotion-specific deviations. We will release the corpus, annotations, code, and prompts to support research on defensive functioning in language.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1\% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lifeng Han, Gareth J. F. Jones, Alan F. Smeaton
Natural Language Processing
ๆฅ็ๆ่ฆ
Word meaning, representation, and interpretation play fundamental roles in natural language understanding (NLU), natural language processing (NLP), and natural language generation (NLG) tasks. Many of the inherent difficulties in these tasks stem from Multi-word Expressions (MWEs), which complicate the tasks by introducing ambiguity, idiomatic expressions, infrequent usage, and a wide range of variations. Significant effort and substantial progress have been made in addressing the challenging nature of MWEs in Western languages, particularly English. This progress is attributed in part to the well-established research communities and the abundant availability of computational resources. However, the same level of progress is not true for language families such as Chinese and closely related Asian languages, which continue to lag behind in this regard. While sub-word modelling has been successfully applied to many Western languages to address rare words improving phrase comprehension, and enhancing machine translation (MT) through techniques like byte-pair encoding (BPE), it cannot be applied directly to ideograph language scripts like Chinese. In this work, we conduct a systematic study of the Chinese character decomposition technology in the context of MWE-aware neural machine translation (NMT). Furthermore, we report experiments to examine how Chinese character decomposition technology contributes to the representation of the original meanings of Chinese words and characters, and how it can effectively address the challenges of translating MWEs.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dakota Ellis, Samy Bakikerali, Wanshan Chen, Bao Dinh, Uyen Le
ๆฅ็ๆ่ฆ
Traditional linguists have proposed the use of a General Service List (GSL) to assist new language learners in identifying the most important words in English. This process requires linguistic expertise, subjective input, and a considerable amount of time. We attempt to create our own GSL and evaluate its practicality against the industry standard (The NGSL). We found creating a Specialized Word List (SWL), or a word list specific to a subset of the overall corpus, to be the most practical way for language-learners to optimize the process. The SWL's that we created using our model outperformed the industry standard, reaching the 95% coverage required for language comprehension with fewer words comparatively. By restricting the SWL process to objective criteria only, it can be automated, scaled, and tailored to the needs of language-learners across the globe.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Peter Dekker, Heikki Rasilo, Bart de Boer
Machine Learning
ๆฅ็ๆ่ฆ
The concept of inflection classes is an abstraction used by linguists, and provides a means to describe patterns in languages that give an analogical base for deducing previously unencountered forms. This ability is an important part of morphological acquisition and processing. We study the learnability of a system of verbal inflection classes by the individual language user by performing unsupervised clustering of lexemes into inflection classes. As a cognitively plausible and interpretable computational model, we use Adaptive Resonance Theory, a neural network with a parameter that determines the degree of generalisation (vigilance). The model is applied to Latin, Portuguese and Estonian. The similarity of clustering to attested inflection classes varies depending on the complexity of the inflectional system. We find the best performance in a narrow region of the generalisation parameter. The learned features extracted from the model show similarity with linguistic descriptions of the inflection classes. The proposed model could be used to study change in inflection classes in the future, by including it in an agent-based model.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kuan Lu, Shuhang Lin, Sai Wu, Yichen Yao, Junhan Yang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy: lightweight centroids are precomputed during prefilling for centroid-grained indexing, followed by token-level refinement for precise KV retrieval. This approach balances retrieval efficiency and accuracy. To further enhance performance, we implement an optimized system for indexing construction and search using CPU-GPU co-execution. Experimentally, CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. Meanwhile, CTKVR delivers 3 times and 4 times throughput speedups on Llama-3-8B and Yi-9B at 96K context length across diverse GPU hardware.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Md. Samiul Alim, Mahir Shahriar Tamim, Maisha Rahman, Tanvir Ahmed Khan, Md Mushfique Anwar
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Sentiment analysis, an emerging research area within natural language processing (NLP), has primarily been explored in contexts like elections and social media trends, but there remains a significant gap in understanding emotional dynamics during civil unrest, particularly in the Bangla language. Our study pioneers sentiment analysis in Bangla during a national crisis by examining public emotions amid Bangladesh's 2024 mass uprising. We curated a unique dataset of 2,028 annotated news headlines from major Facebook news portals, classifying them into Outrage, Hope, and Despair. Through Latent Dirichlet Allocation (LDA), we identified prevalent themes like political corruption and public protests, and analyzed how events such as internet blackouts shaped sentiment patterns. It outperformed multilingual transformers (mBERT: 67%, XLM-RoBERTa: 71%) and traditional machine learning methods (SVM and Logistic Regression: both 70%). These results highlight the effectiveness of language-specific models and offer valuable insights into public sentiment during political turmoil.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Taichi Aida, Mamoru Komachi, Toshinobu Ogiso, Hiroya Takamura, Daichi Mochihashi
ๆฅ็ๆ่ฆ
Understanding the temporal evolution of sets of vectors is a fundamental challenge across various domains, including ecology, crime analysis, and linguistics. For instance, ecosystem structures evolve due to interactions among plants, herbivores, and carnivores; the spatial distribution of crimes shifts in response to societal changes; and word embedding vectors reflect cultural and semantic trends over time. However, analyzing such time-varying sets of vectors is challenging due to their complicated structures, which also evolve over time. In this work, we propose a novel method for modeling the distribution underlying each set of vectors using infinite-dimensional Gaussian processes. By approximating the latent function in the Gaussian process with Random Fourier Features, we obtain compact and comparable vector representations over time. This enables us to track and visualize temporal transitions of vector sets in a low-dimensional space. We apply our method to both sociological data (crime distributions) and linguistic data (word embeddings), demonstrating its effectiveness in capturing temporal dynamics. Our results show that the proposed approach provides interpretable and robust representations, offering a powerful framework for analyzing structural changes in temporally indexed vector sets across diverse domains.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Run-ze Hu, Yang Yang, Yi-hang Yang, Jing-qi Kong, Jia-hui Luo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Background: Motivational interviewing (MI) is an effective counseling approach for promoting health behavior change, but its impact is constrained by the need for highly trained human counselors. Objective: This study aimed to explore a scalable alternative by developing and evaluating Large Language Models for Motivational Interviewing (MI-LLMs). Methods: We first curated five Chinese psychological counseling corpora and, using GPT-4 with an MI-informed prompt, transcribed multi-turn dialogues from the two highest-quality datasets (CPsyCounD and PsyDTCorpus) into 2,040 MI-style counseling conversations, of which 2,000 were used for training and 40 for testing. Three Chinese-capable open-source LLMs (Baichuan2-7B-Chat, ChatGLM-4-9B-Chat and Llama-3-8B-Chinese-Chat-v2) were fine-tuned on this corpus and were named as MI-LLMs. We evaluated MI-LLMs using round-based automatic metrics and expert manual coding with the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Results: Across all three models, fine-tuning substantially improved BLEU-4 and ROUGE scores compared with the base models, and manual coding showed that MI-LLMs achieved technical and relational global scores, and MI-adherent ratios that approached those of real MI dialogues, although complex reflections and reflection-to-question ratios remained less frequent. Conclusions: These findings provide initial evidence that MI-oriented fine-tuning can endow general-purpose LLMs with core MI-consistent counseling behaviors, suggesting a scalable pathway toward AI-assisted health behavior change support while underscoring the need for further work on data scale, complex MI skills and real-world intervention trials.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lev Kharlashkin, Eiaki Morooka, Yehor Tereshchenko, Mika Hรคmรคlรคinen
ๆฅ็ๆ่ฆ
ORACLE turns daily news into week-over-week, decision-ready insights for one of the Finnish University of Applied Sciences. The platform crawls and versions news, applies University-specific relevance filtering, embeds content, classifies items into PESTEL dimensions and builds a concise Time-Dependent Recursive Summary Graph (TRSG): two clustering layers summarized by an LLM and recomputed weekly. A lightweight change detector highlights what is new, removed or changed, then groups differences into themes for PESTEL-aware analysis. We detail the pipeline, discuss concrete design choices that make the system stable in production and present a curriculum-intelligence use case with an evaluation plan.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kotaro Funakoshi, Yaoxiong Zhu
Multimodal
ๆฅ็ๆ่ฆ
Recognition of signers' emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhengyi Zhao, Shubo Zhang, Yuxi Zhang, Huimin Wang, Binyang Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have shown impressive capabilities in complex reasoning tasks. However, current approaches employ uniform language density for both intermediate reasoning and final answers, leading to computational inefficiency. Our observation found that reasoning process serves a computational function for the model itself, while answering serves a communicative function for human understanding. This distinction enables the use of compressed, symbol-rich language for intermediate computations while maintaining human-readable final explanations. To address this inefficiency, we present Denser: \underline{D}ual-d\underline{ens}ity inf\underline{er}ence, a novel framework that optimizes information density separately for reasoning and answering phases. Our framework implements this through three components: a query processing module that analyzes input problems, a high-density compressed reasoning mechanism for efficient intermediate computations, and an answer generation component that translates compressed reasoning into human-readable solutions. Experimental evaluation across multiple reasoning question answering benchmarks demonstrates that Denser reduces token consumption by up to 62\% compared to standard Chain-of-Thought methods while preserving or improving accuracy. These efficiency gains are particularly significant for complex multi-step reasoning problems where traditional methods generate extensive explanations.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joao Queiroz
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study 'Adversarial poetry as a universal single-turn jailbreak mechanism in large language models' demonstrates that instructions routinely refused in prose become executable when rewritten as verse, producing up to 18 x more safety failures in benchmarks derived from MLCommons AILuminate. Manually written poems reach approximately 62% ASR, and automated versions 43%, with some models surpassing 90% success in single-turn interactions. The effect is structural: systems trained with RLHF, constitutional AI, and hybrid pipelines exhibit consistent degradation under minimal semiotic formal variation. Versification displaces the prompt into sparsely supervised latent regions, revealing guardrails that are excessively dependent on surface patterns. This dissociation between apparent robustness and real vulnerability exposes deep limitations in current alignment regimes. The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, a rich metric-prosodic tradition, and over 250 million speakers, constitutes a critical gap. Experimental protocols must parameterise scansion, metre, and prosodic variation to test vulnerabilities specific to Lusophone patterns, which are currently ignored.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ayman Eddakrouri, Amani Ramadan
ๆฅ็ๆ่ฆ
This study applies Brookes' Measure of Categorical Dispersion (ฮ) to analyze the thematic structure of contemporary Arabic Applied Linguistics research. Using a comprehensive, real-world dataset of 1,564 publications from 2019 to 2025, classified into eight core sub-disciplines, we calculate a dispersion index of ฮ = 0.194. This remarkably low value indicates extreme thematic dispersion, revealing that the field is characterized by pronounced heterogeneity rather than concentration. The analysis identifies Computational Linguistics as a dominant but non-hegemonic force, coexisting with robust research in Sociolinguistics, Language Teaching, and other subfields. This study clarifies the correct application of Brookes' original formula, demonstrates its utility for field characterization, and provides a replicable bibliometric methodology for assessing disciplinary structure across domains.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Charan Prakash Rathore, Saumi Ray, Dhruv Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90\% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65\% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79\% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xiaotian Zhang, Yuan Wang, Ruizhe Chen, Zeya Wang, Runchen Hou et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The deployment of Large Language Models (LLMs) in interactive systems necessitates a deep alignment with the nuanced and dynamic preferences of individual users. Current alignment techniques predominantly address universal human values or static, single-turn preferences, thereby failing to address the critical needs of long-term personalization and the initial user cold-start problem. To bridge this gap, we propose PersonalAgent, a novel user-centric lifelong agent designed to continuously infer and adapt to user preferences. PersonalAgent constructs and dynamically refines a unified user profile by decomposing dialogues into single-turn interactions, framing preference inference as a sequential decision-making task. Experiments show that PersonalAgent achieves superior performance over strong prompt-based and policy optimization baselines, not only in idealized but also in noisy conversational contexts, while preserving cross-session preference consistency. Furthermore, human evaluation confirms that PersonalAgent excels at capturing user preferences naturally and coherently. Our findings underscore the importance of lifelong personalization for developing more inclusive and adaptive conversational agents. Our code is available here.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Seok-Hyun Ga, Chun-Yen Chang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that "Perception Errors" were dominant, highlighting a "Perception-Cognition Gap" where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a "Calculation-Conceptualization Discrepancy," successfully performing calculations while failing to apply the underlying scientific concepts, and "Process Hallucination," where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing "AI-resistant questions" that target these specific cognitive vulnerabilities. By exploiting AI's weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.
๐
2025-12-17
๐ AAAI 2026
๐ฅ Yiliu Sun, Zicheng Zhao, Yang Wei, Yanfang Zhang, Chen Gong
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Piyawoot Songsiritat
ๆฅ็ๆ่ฆ
We introduce SynGP500, a clinician-curated collection of 500 synthetic Australian general practice medical notes. The dataset integrates curriculum-based clinical breadth (RACGP 2022 Curriculum), epidemiologically-calibrated prevalence (BEACH study), and diverse consultation contexts. This approach systematically includes both common presentations and less-common curriculum-specified conditions that GPs must recognize but appear infrequently in single practice populations, potentially supporting more generalizable model training than datasets constrained by naturally occurring case distributions. SynGP500 is messy by design, reflecting the authentic complexity of healthcare delivery: telegraphic documentation, typos, patient non-adherence, socioeconomic barriers, and clinician-patient disagreements, unlike sanitized synthetic datasets that obscure clinical realities. Multi-faceted validation demonstrates dataset quality through epidemiological alignment with real Australian GP consultation patterns (BEACH study), stylometric analysis confirming high linguistic variation, semantic diversity analysis demonstrating broad coverage, and exploratory downstream evaluation using self-supervised medical concept extraction, showing F1 improvements. SynGP500 addresses a critical national gap, providing researchers and educators with a resource for developing and evaluating clinical NLP methods for Australian general practice while inherently protecting patient privacy.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Maria Becker, Mirko Sommer, Lars Tapken, Yi Wan Teh, Bruno Brocai
Natural Language Processing
ๆฅ็ๆ่ฆ
Moralizations - arguments that invoke moral values to justify demands or positions - are a yet underexplored form of persuasive communication. We present the Moralization Corpus, a novel multi-genre dataset designed to analyze how moral values are strategically used in argumentative discourse. Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems. We develop a frame-based annotation scheme that captures the constitutive elements of moralizations - moral values, demands, and discourse protagonists - and apply it to a diverse set of German texts, including political debates, news articles, and online discussions. The corpus enables fine-grained analysis of moralizing language across communicative formats and domains. We further evaluate several large language models (LLMs) under varied prompting conditions for the task of moralization detection and moralization component extraction and compare it to human annotations in order to investigate the challenges of automatic and manual analysis of moralizations. Results show that detailed prompt instructions has a greater effect than few-shot or explanation-based prompting, and that moralization remains a highly subjective and context-sensitive task. We release all data, annotation guidelines, and code to foster future interdisciplinary research on moral discourse and moral reasoning in NLP.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Claudio Savelli, Moreno La Quatra, Alkis Koudounas, Flavio Giobergia
Robotics
ๆฅ็ๆ่ฆ
LLMs trained on web-scale data raise concerns about privacy and the right to be forgotten. To address these issues, Machine Unlearning provides techniques to remove specific information from trained models without retraining from scratch. However, existing benchmarks for evaluating unlearning in LLMs face two major limitations: they focus only on English and support only entity-level forgetting (removing all information about a person). We introduce FAME (Fictional Actors for Multilingual Erasure), a synthetic benchmark for evaluating Machine Unlearning across five languages: English, French, German, Italian, and Spanish. FAME contains 1,000 fictional actor biographies and 20,000 question-answer pairs. Each biography includes information on 20 topics organized into structured categories (biography, career, achievements, personal information). This design enables both entity-level unlearning (i.e., forgetting entire identities) and instance-level unlearning (i.e., forgetting specific facts while retaining others). We provide two dataset splits to support these two different unlearning scenarios and enable systematic comparison of unlearning techniques across languages. Since FAME uses entirely fictional data, it ensures that the information was never encountered during model pretraining, allowing for a controlled evaluation of unlearning methods.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yash Bhaskar, Parameswari Krishnamurthy
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024 (Pakray et al., 2024), focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 (Xue et al., 2020) and IndicBart (Dabre et al., 2021) in both multilingual and monolingual settings, LoRA (Hu et al., 2021) fine-tuning IndicTrans2 (Gala et al., 2023), zero-shot and few-shot prompting (Brown, 2020) with large language models (LLMs) like Llama 3 (Dubey et al., 2024) and Mixtral 8x7b (Jiang et al., 2024), LoRA supervised fine-tuning of Llama 3 (Mecklenburg et al., 2024), and training Transformer models (Vaswani, 2017) from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task test data using SacreBLEU (Post, 2018) and CHRF (Popovic, 2015), highlighting the challenges of low-resource translation and the potential of LLMs for these tasks, particularly with fine-tuning.
๐
2025-12-17
๐ AAAI 2026
๐ฅ Chao Zhang, Minghan Li, Tianrui Lv, Guodong Zhou
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it replaces the rigid hop-count selector with a relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct "brother" relations, 2-hop for indirect "father-son" chains), formalized via a relation mask. Second, it introduces a few-shot in-context learning path guidance mechanism with CoT (think) that constructs examples in a "question-paths-answer" format to enhance LLMs' ability to understand reasoning paths. Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 pp (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and the path prompt are complementary, jointly transforming KG evidence into more faithful answers.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mika Hรคmรคlรคinen, Kimmo Leiviskรค
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper explores the growing epistemic parallel between NLG evaluation and grading of students in a Finnish University. We argue that both domains are experiencing a Great Misalignment Problem. As students increasingly use tools like ChatGPT to produce sophisticated outputs, traditional assessment methods that focus on final products rather than learning processes have lost their validity. To address this, we introduce the Pedagogical Multi-Factor Assessment (P-MFA) model, a process-based, multi-evidence framework inspired by the logic of multi-factor authentication.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, Chao Yang
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP's openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuki Nakayama, Koki Hikichi, Yun Ching Liu, Yu Hirate
ๆฅ็ๆ่ฆ
This paper presents a large-scale corpus of Rakuten Travel Reviews. Our collection contains 7.3 million customer reviews for 16 years, ranging from 2009 to 2024. Each record in the dataset contains the review text, its response from an accommodation, an anonymized reviewer ID, review date, accommodation ID, plan ID, plan title, room type, room name, purpose, accompanying group, and user ratings from different aspect categories, as well as an overall score. We present statistical information about our corpus and provide insights into factors driving data drift between 2019 and 2024 using statistical approaches.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability of large language models (LLMs). However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label deduction, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration diversity. By deriving local consensus via repeat sampling for each sub group, SCOPE provides diverse supervision targets to encourage broader exploration. We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines. Notably, SCOPE achieving relative improvements of 13.1\% on challenging AIME 2025 and 8.1\% on AMC. The code is released at \href{https://github.com/szu-tera/SCOPE}{https://github.com/szu-tera/SCOPE}.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods - including sparse autoencoders (SAEs) and sparse probes - recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We observe a one-to-many relationship from concepts to features: features correspond to no more than one concept, but concepts are distributed across many features. Then, we perform steering experiments, measuring whether each concept is independently manipulable. Even when trained on uniform distributions of concepts, SAE features generally affect many concepts when steered, indicating that they are neither selective nor independent; nonetheless, features affect disjoint subspaces. These results suggest that correlational metrics for measuring disentanglement are generally not sufficient for establishing independence when steering, and that affecting disjoint subspaces is not sufficient for concept selectivity. These results underscore the importance of compositional evaluations in interpretability research.
๐
2025-12-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Richard Helder Moulton, Austin O'Brien, John D. Hastings
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Although large language models (LLMs) are increasingly used in security-critical workflows, practitioners lack quantitative guidance on which safeguards are worth deploying. This paper introduces a decision-oriented framework and reproducible methodology that together quantify residual risk, convert adversarial probe outcomes into financial risk estimates and return-on-control (RoC) metrics, and enable monetary comparison of layered defenses for LLM-based systems. A retrieval-augmented generation (RAG) service is instantiated using the DeepSeek-R1 model over a corpus containing synthetic personally identifiable information (PII), and subjected to automated attacks with Garak across five vulnerability classes: PII leakage, latent context injection, prompt injection, adversarial attack generation, and divergence. For each (vulnerability, control) pair, attack success probabilities are estimated via Laplace's Rule of Succession and combined with loss triangle distributions, calibrated from public breach-cost data, in 10,000-run Monte Carlo simulations to produce loss exceedance curves and expected losses. Three widely used mitigations, attribute-based access control (ABAC); named entity recognition (NER) redaction using Microsoft Presidio; and NeMo Guardrails, are then compared to a baseline RAG configuration. The baseline system exhibits very high attack success rates (>= 0.98 for PII, latent injection, and prompt injection), yielding a total simulated expected loss of $313k per attack scenario. ABAC collapses success probabilities for PII and prompt-related attacks to near zero and reduces the total expected loss by ~94%, achieving an RoC of 9.83. NER redaction likewise eliminates PII leakage and attains an RoC of 5.97, while NeMo Guardrails provides only marginal benefit (RoC of 0.05).
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Debu Sinha
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the "semantic illusion": semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fanzhe Fu
Natural Language Processing
ๆฅ็ๆ่ฆ
The transition of Large Language Models (LLMs) from stochastic chat interfaces to reliable software components necessitates a fundamental re-engineering of interaction paradigms. Current methodologies, predominantly heuristic-based "prompt engineering," fail to provide the deterministic guarantees required for mission-critical applications. We introduce the Meta-Prompting Protocol, a rigorous theoretical framework that formalizes the orchestration of LLMs as a programmable, self-optimizing system. Central to this protocol is the Adversarial Trinity, a tripartite topology comprising a Generator (P), an Auditor (A), and an Optimizer (O). By treating natural language instructions as differentiable variables within a semantic computation graph and utilizing textual critiques as gradients, this architecture mitigates hallucination and prevents model collapse. We demonstrate the theoretical viability of this approach using declarative programming paradigms (DSPy) and automatic textual differentiation (TextGrad), establishing a foundation for "Observable Software Engineering" in the era of probabilistic computing.
๐
2025-12-17
๐ ACL 2026
๐ฅ Hongbo Wang, MaungMaung AprilPyone, Isao Echizen
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Disclaimer: Samples in this paper may be harmful and cause discomfort.
Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Yunheng Wang, Yixiao Feng, Yuetong Fang, Shuning Zhang, Tan Jing et al.
Robotics
ๆฅ็ๆ่ฆ
3D Scene Graphs (3DSGs) constitute a powerful representation of the physical world, distinguished by their abilities to explicitly model the complex spatial, semantic, and functional relationships between entities, rendering a foundational understanding that enables agents to interact intelligently with their environment and execute versatile behaviors. Embodied navigation, as a crucial component of such capabilities, leverages the compact and expressive nature of 3DSGs to enable long-horizon reasoning and planning in complex, large-scale environments. However, prior works rely on a static-world assumption, defining traversable space solely based on static spatial layouts and thereby treating interactable obstacles as non-traversable. This fundamental limitation severely undermines their effectiveness in real-world scenarios, leading to limited reachability, low efficiency, and inferior extensibility. To address these issues, we propose HERO, a novel framework for constructing Hierarchical Traversable 3DSGs, that redefines traversability by modeling operable obstacles as pathways, capturing their physical interactivity, functional semantics, and the scene's relational hierarchy. The results show that, relative to its baseline, HERO reduces PL by 35.1% in partially obstructed environments and increases SR by 79.4% in fully obstructed ones, demonstrating substantially higher efficiency and reachability.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sijin Sun, Liangbin Zhao, Ming Deng, Xiuju Fu
ๆฅ็ๆ่ฆ
Dialogue Topic Segmentation (DTS) is crucial for understanding task-oriented public-channel communications, such as maritime VHF dialogues, which feature informal speech and implicit transitions. To address the limitations of traditional methods, we propose DASH-DTS, a novel LLM-based framework. Its core contributions are: (1) topic shift detection via dialogue handshake recognition; (2) contextual enhancement through similarity-guided example selection; and (3) the generation of selective positive and negative samples to improve model discrimination and robustness. Additionally, we release VHF-Dial, the first public dataset of real-world maritime VHF communications, to advance research in this domain. DASH-DTS provides interpretable reasoning and confidence scores for each segment. Experimental results demonstrate that our framework achieves several sota segmentation trusted accuracy on both VHF-Dial and standard benchmarks, establishing a strong foundation for stable monitoring and decision support in operational dialogues.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruiyi Zhang, Peijia Qin, Qi Cao, Pengtao Xie
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiming Cui, Xin Yao, Yuxuan Qin, Xin Li, Shijin Wang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yaniv Leviathan, Matan Kalman, Yossi Matias
Natural Language Processing
ๆฅ็ๆ่ฆ
When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt
Machine Learning
ๆฅ็ๆ่ฆ
Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.
๐
2025-12-17
๐ NeurIPS 2025
๐ฅ Kayode Olumoyin, Lamees El Naqa, Katarzyna Rejniak
Machine Learning
ๆฅ็ๆ่ฆ
In a mathematical model of interacting biological organisms, where external interventions may alter behavior over time, traditional models that assume fixed parameters usually do not capture the evolving dynamics. In oncology, this is further exacerbated by the fact that experimental data are often sparse and sometimes are composed of a few time points of tumor volume. In this paper, we propose to learn time-varying interactions between cells, such as those of bladder cancer tumors and immune cells, and their response to a combination of anticancer treatments in a limited data scenario. We employ the physics-informed neural network (PINN) approach to predict possible subpopulation trajectories at time points where no observed data are available. We demonstrate that our approach is consistent with the biological explanation of subpopulation trajectories. Our method provides a framework for learning evolving interactions among biological organisms when external interventions are applied to their environment.
๐
2025-12-17
๐ ArXiv Preprint (cs.DC)
๐ฅ Xuting Liu, Daniel Alexander, Siva Kesava Reddy Kakarla, Behnaz Arzani, Vincent Liu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model's layers. However, traditional batching frameworks are ill-suited for EE LLMs, as not all requests in a batch may be ready to exit at the same time. Existing solutions either force a uniform decision on the batch, which overlooks EE opportunities, or degrade output quality by forcing premature exits. We propose Dynamic Rebatching, a solution where we dynamically reorganize the batch at each early-exit point. Requests that meet the exit criteria are immediately processed, while those that continue are held in a buffer, re-grouped into a new batch, and forwarded to deeper layers. We introduce DREX, an early-exit inference system that implements Dynamic Rebatching with two key optimizations: 1) a copy-free rebatching buffer that avoids physical data movement, and 2) an EE and SLA-aware scheduler that analytically predicts whether a given rebatching operation will be profitable. DREX also efficiently handles the missing KV cache from skipped layers using memory-efficient state-copying. Our evaluation shows that DREX improves throughput by 2-12% compared to baseline approaches while maintaining output quality. Crucially, DREX completely eliminates involuntary exits, providing a key guarantee for preserving the output quality intended by the EE model.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou et al.
ๆฅ็ๆ่ฆ
We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a solution can be objectively evaluated. Models solve these tasks by implementing executable programs rather than outputting a direct answer. FrontierCS includes algorithmic problems, which are often NP-hard variants of competitive programming problems with objective partial scoring, and research problems with the same property. For each problem we provide an expert reference solution and an automatic evaluator. Combining open-ended design, measurable progress, and expert curation, FrontierCS provides a benchmark at the frontier of computer-science difficulty. Empirically, we find that frontier reasoning models still lag far behind human experts on both the algorithmic and research tracks, that increasing reasoning budgets alone does not close this gap, and that models often over-optimize for generating merely workable code instead of discovering high-quality algorithms and system designs.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce \model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matin Mortaheb, Erciyes Karakaya, Sennur Ulukus
Natural Language Processing
ๆฅ็ๆ่ฆ
Semantic communication aims to transmit information most relevant to a task rather than raw data, offering significant gains in communication efficiency for applications such as telepresence, augmented reality, and remote sensing. Recent transformer-based approaches have used self-attention maps to identify informative regions within images, but they often struggle in complex scenes with multiple objects, where self-attention lacks explicit task guidance. To address this, we propose a novel Multi-Modal Semantic Communication framework that integrates text-based user queries to guide the information extraction process. Our proposed system employs a cross-modal attention mechanism that fuses visual features with language embeddings to produce soft relevance scores over the visual data. Based on these scores and the instantaneous channel bandwidth, we use an algorithm to transmit image patches at adaptive resolutions using independently trained encoder-decoder pairs, with total bitrate matching the channel capacity. At the receiver, the patches are reconstructed and combined to preserve task-critical information. This flexible and goal-driven design enables efficient semantic communication in complex and bandwidth-constrained environments.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou et al.
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Oleg Melnikov, Yurii Dorofieiev, Yurii Shakhnovskiy, Huy Truong, Victoria Degeler
ๆฅ็ๆ่ฆ
This paper presents a unified framework, for the detection, classification, and preliminary localization of anomalies in water distribution networks using multivariate statistical analysis. The approach, termed SICAMS (Statistical Identification and Classification of Anomalies in Mahalanobis Space), processes heterogeneous pressure and flow sensor data through a whitening transformation to eliminate spatial correlations among measurements. Based on the transformed data, the Hotelling's $T^2$ statistic is constructed, enabling the formulation of anomaly detection as a statistical hypothesis test of network conformity to normal operating conditions. It is shown that Hotelling's $T^2$ statistic can serve as an integral indicator of the overall "health" of the system, exhibiting correlation with total leakage volume, and thereby enabling approximate estimation of water losses via a regression model. A heuristic algorithm is developed to analyze the $T^2$ time series and classify detected anomalies into abrupt leaks, incipient leaks, and sensor malfunctions. Furthermore, a coarse leak localization method is proposed, which ranks sensors according to their statistical contribution and employs Laplacian interpolation to approximate the affected region within the network. Application of the proposed framework to the BattLeDIM L-Town benchmark dataset demonstrates high sensitivity and reliability in leak detection, maintaining robust performance even under multiple leaks. These capabilities make the method applicable to real-world operational environments without the need for a calibrated hydraulic model.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Victor Lรฉger, Florent Chatelain
ๆฅ็ๆ่ฆ
Partial Least Squares (PLS) is a widely used method for data integration, designed to extract latent components shared across paired high-dimensional datasets. Despite decades of practical success, a precise theoretical understanding of its behavior in high-dimensional regimes remains limited. In this paper, we study a data integration model in which two high-dimensional data matrices share a low-rank common latent structure while also containing individual-specific components. We analyze the singular vectors of the associated cross-covariance matrix using tools from random matrix theory and derive asymptotic characterizations of the alignment between estimated and true latent directions. These results provide a quantitative explanation of the reconstruction performance of the PLS variant based on Singular Value Decomposition (PLS-SVD) and identify regimes where the method exhibits counter-intuitive or limiting behavior. Building on this analysis, we compare PLS-SVD with principal component analysis applied separately to each dataset and show its asymptotic superiority in detecting the common latent subspace. Overall, our results offer a comprehensive theoretical understanding of high-dimensional PLS-SVD, clarifying both its advantages and fundamental limitations.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Georg Siedel, Rojan Regmi, Abhirami Anand, Weijia Shao, Silvia Vock et al.
ๆฅ็ๆ่ฆ
This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of deep vision models to common corruptions. We show that although applying style transfer on synthetic images degrades their quality with respect to the common FID metric, these images are surprisingly beneficial for model training. We conduct a systematic empirical analysis of the effects of both augmentations and their key hyperparameters on the performance of image classifiers. Our results demonstrate that stylization and synthetic data complement each other well and can be combined with popular rule-based data augmentation techniques such as TrivialAugment, while not working with others. Our method achieves state-of-the-art corruption robustness on several small-scale image classification benchmarks, reaching 93.54%, 74.9% and 50.86% robust accuracy on CIFAR-10-C, CIFAR-100-C and TinyImageNet-C, respectively
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Adam Karvonen, James Chua, Clรฉment Dumas, Kit Fraser-Taliente, Subhash Kantamneni et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.
๐
2025-12-17
๐ ArXiv Preprint (quant-ph)
๐ฅ Sergi Masot-Llima, Elies Gil-Fuster, Carlos Bravo-Prieto, Jens Eisert, and Tommaso Guaita
Machine Learning
ๆฅ็ๆ่ฆ
Demonstrating quantum advantage in machine learning tasks requires navigating a complex landscape of proposed models and algorithms. To bring clarity to this search, we introduce a framework that connects the structure of parametrized quantum circuits to the mathematical nature of the functions they can actually learn. Within this framework, we show how fundamental properties, like circuit depth and non-Clifford gate count, directly determine whether a model's output leads to efficient classical simulation or surrogation. We argue that this analysis uncovers common pathways to dequantization that underlie many existing simulation methods. More importantly, it reveals critical distinctions between models that are fully simulatable, those whose function space is classically tractable, and those that remain robustly quantum. This perspective provides a conceptual map of this landscape, clarifying how different models relate to classical simulability and pointing to where opportunities for quantum advantage may lie.
๐
2025-12-17
๐ 10 pages, 3 figures, 2025 IEEE International Confe
๐ฅ Xiaodi Li, Dingcheng Li, Rujun Gao, Mahmoud Zamani, Feng Mi et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Continual learning remains a fundamental challenge in machine learning, requiring models to learn from a stream of tasks without forgetting previously acquired knowledge. A major obstacle in this setting is catastrophic forgetting, where performance on earlier tasks degrades as new tasks are learned. In this paper, we introduce PPSEBM, a novel framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks. In PPSEBM, progressive parameter selection allocates distinct, task-specific parameters for each new task, while the EBM generates representative pseudo-samples from prior tasks. These generated samples actively inform and guide the parameter selection process, enhancing the model's ability to retain past knowledge while adapting to new tasks. Experimental results on diverse NLP benchmarks demonstrate that PPSEBM outperforms state-of-the-art continual learning methods, offering a promising and robust solution to mitigate catastrophic forgetting.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tianze Luo, Haotian Yuan, Zhuang Liu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The multi-step denoising process in diffusion and Flow Matching models causes major efficiency issues, which motivates research on few-step generation. We present Solution Flow Models (SoFlow), a framework for one-step generation from scratch. By analyzing the relationship between the velocity function and the solution function of the velocity ordinary differential equation (ODE), we propose a Flow Matching loss and a solution consistency loss to train our models. The Flow Matching loss allows our models to provide estimated velocity fields for Classifier-Free Guidance (CFG) during training, which improves generation performance. Notably, our consistency loss does not require the calculation of the Jacobian-vector product (JVP), a common requirement in recent works that is not well-optimized in deep learning frameworks like PyTorch. Experimental results indicate that, when trained from scratch using the same Diffusion Transformer (DiT) architecture and an equal number of training epochs, our models achieve better FID-50K scores than MeanFlow models on the ImageNet 256x256 dataset.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Darshita Rathore, Vineet Kumar, Chetna Bansal, Anindya Moitra
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.
๐
2025-12-17
๐ ArXiv Preprint (physics.chem-ph)
๐ฅ Benjamin C. Koenig, Sili Deng
Machine Learning
ๆฅ็ๆ่ฆ
Thermal runaway in lithium-ion batteries is strongly influenced by the state of charge (SOC). Existing predictive models typically infer scalar kinetic parameters at a full SOC or a few discrete SOC levels, preventing them from capturing the continuous SOC dependence that governs exothermic behavior during abuse conditions. To address this, we apply the Kolmogorov-Arnold Chemical Reaction Neural Network (KA-CRNN) framework to learn continuous and realistic SOC-dependent exothermic cathode-electrolyte interactions. We apply a physics-encoded KA-CRNN to learn SOC-dependent kinetic parameters for cathode-electrolyte decomposition directly from differential scanning calorimetry (DSC) data. A mechanistically informed reaction pathway is embedded into the network architecture, enabling the activation energies, pre-exponential factors, enthalpies, and related parameters to be represented as continuous and fully interpretable functions of the SOC. The framework is demonstrated for NCA, NM, and NMA cathodes, yielding models that reproduce DSC heat-release features across all SOCs and provide interpretable insight into SOC-dependent oxygen-release and phase-transformation mechanisms. This approach establishes a foundation for extending kinetic parameter dependencies to additional environmental and electrochemical variables, supporting more accurate and interpretable thermal-runaway prediction and monitoring.
๐
2025-12-17
๐ AAAI 2026
๐ฅ Xinshun Feng, Mingzhe Liu, Yi Qiao, Tongyu Zhu, Leilei Sun et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in explainable recommendations have explored the integration of language models to analyze natural language rationales for user-item interactions. Despite their potential, existing methods often rely on ID-based representations that obscure semantic meaning and impose structural constraints on language models, thereby limiting their applicability in open-ended scenarios. These challenges are intensified by the complex nature of real-world interactions, where diverse user intents are entangled and collaborative signals rarely align with linguistic semantics. To overcome these limitations, we propose BEAT, a unified and transferable framework that tokenizes user and item behaviors into discrete, interpretable sequences. We construct a behavior vocabulary via a vector-quantized autoencoding process that disentangles macro-level interests and micro-level intentions from graph-based representations. We then introduce multi-level semantic supervision to bridge the gap between behavioral signals and language space. A semantic alignment regularization mechanism is designed to embed behavior tokens directly into the input space of frozen language models. Experiments on three public datasets show that BEAT improves zero-shot recommendation performance while generating coherent and informative explanations. Further analysis demonstrates that our behavior tokens capture fine-grained semantics and offer a plug-and-play interface for integrating complex behavior patterns into large language models.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Carlos Couto, Josรฉ Mourรฃo, Mรกrio A. T. Figueiredo, Pedro Ribeiro
Machine Learning
ๆฅ็ๆ่ฆ
Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, Vincent Roulet
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alexandre Dussolle, Pietro Liรฒ
Natural Language Processing
ๆฅ็ๆ่ฆ
Going from pure Multilayer Perceptron (MLP) to a learnable graph message-passing mechanism at each layer has been foundational to state-of-the-art results, despite the computational trade-off (e.g. GATs or Transformers). To go a step further, in this work, we introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE). To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions. Beyond these core mechanisms, we study how smoothing N-simplicial attention is by deriving a Lipschitz upper-bound and by demonstrating that by itself it also suffers from over-smoothing, despite opening the attention message-passing to higher-order interactions.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shuibai Zhang, Fred Zhangzhi Peng, Yiheng Zhang, Jin Pan, Grigorios G. Chrysos
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Diffusion language models are structurally well-suited for iterative error correction, as their non-causal denoising dynamics allow arbitrary positions in a sequence to be revised. However, standard masked diffusion language model (MDLM) training fails to reliably induce this behavior, as models often cannot identify unreliable tokens in a complete input, rendering confidence-guided refinement ineffective. We study corrective behavior in diffusion language models, defined as the ability to assign lower confidence to incorrect tokens and iteratively refine them while preserving correct content. We show that this capability is not induced by conventional masked diffusion objectives and propose a correction-oriented post-training principle that explicitly supervises visible incorrect tokens, enabling error-aware confidence and targeted refinement. To evaluate corrective behavior, we introduce the Code Revision Benchmark (CRB), a controllable and executable benchmark for assessing error localization and in-place correction. Experiments on code revision tasks and controlled settings demonstrate that models trained with our approach substantially outperform standard MDLMs in correction scenarios, while also improving pure completion performance. Our code is publicly available at https://github.com/zhangshuibai/CDLM.
๐
2025-12-17
๐ Accepted at IEEE/CVF Winter Conference on Applicat
๐ฅ Shashank Mishra, Karan Patil, Didier Stricker, Jason Rambach
Computer Vision
ๆฅ็ๆ่ฆ
High-performance Radar-Camera 3D object detection can be achieved by leveraging knowledge distillation without using LiDAR at inference time. However, existing distillation methods typically transfer modality-specific features directly to each sensor, which can distort their unique characteristics and degrade their individual strengths. To address this, we introduce IMKD, a radar-camera fusion framework based on multi-level knowledge distillation that preserves each sensor's intrinsic characteristics while amplifying their complementary strengths. IMKD applies a three-stage, intensity-aware distillation strategy to enrich the fused representation across the architecture: (1) LiDAR-to-Radar intensity-aware feature distillation to enhance radar representations with fine-grained structural cues, (2) LiDAR-to-Fused feature intensity-guided distillation to selectively highlight useful geometry and depth information at the fusion level, fostering complementarity between the modalities rather than forcing them to align, and (3) Camera-Radar intensity-guided fusion mechanism that facilitates effective feature alignment and calibration. Extensive experiments on the nuScenes benchmark show that IMKD reaches 67.0% NDS and 61.0% mAP, outperforming all prior distillation-based radar-camera fusion methods. Our code and models are available at https://github.com/dfki-av/IMKD/.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuxin Cai, Yanyong Huang, Jinyuan Chang, Dongjie Wang, Tianrui Li et al.
ๆฅ็ๆ่ฆ
Feature and instance co-selection, which aims to reduce both feature dimensionality and sample size by identifying the most informative features and instances, has attracted considerable attention in recent years. However, when dealing with unlabeled incomplete multi-view data, where some samples are missing in certain views, existing methods typically first impute the missing data and then concatenate all views into a single dataset for subsequent co-selection. Such a strategy treats co-selection and missing data imputation as two independent processes, overlooking potential interactions between them. The inter-sample relationships gleaned from co-selection can aid imputation, which in turn enhances co-selection performance. Additionally, simply merging multi-view data fails to capture the complementary information among views, ultimately limiting co-selection effectiveness. To address these issues, we propose a novel co-selection method, termed Joint learning of Unsupervised multI-view feature and instance Co-selection with cross-viEw imputation (JUICE). JUICE first reconstructs incomplete multi-view data using available observations, bringing missing data recovery and feature and instance co-selection together in a unified framework. Then, JUICE leverages cross-view neighborhood information to learn inter-sample relationships and further refine the imputation of missing values during reconstruction. This enables the selection of more representative features and instances. Extensive experiments demonstrate that JUICE outperforms state-of-the-art methods.
๐
2025-12-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M. Pruyn et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific "superintelligence". Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.
๐
2025-12-17
๐ ArXiv Preprint (physics.optics)
๐ฅ Yuan Wang, Oleksandr Kyriienko
Machine Learning
ๆฅ็ๆ่ฆ
Photonics can offer a hardware-native route for machine learning (ML). However, efficient deployment of photonics-enhanced ML requires hybrid workflows that integrate optical processing with conventional CPU/GPU based neural network architectures. Here, we propose such a workflow that combines photonic positional embeddings (PEs) with advanced graph ML models. We introduce a photonics-based method that augments graph convolutional networks (GCNs) with PEs derived from light propagation on synthetic frequency lattices whose couplings match the input graph. We simulate propagation and readout to obtain internode intensity correlation matrices, which are used as PEs in GCNs to provide global structural information. Evaluated on Long Range Graph Benchmark molecular datasets, the method outperforms baseline GCNs with Laplacian based PEs, achieving $6.3\%$ lower mean absolute error for regression and $2.3\%$ higher average precision for classification tasks using a two-layer GCN as a baseline. When implemented in high repetition rate photonic hardware, correlation measurements can enable fast feature generation by bypassing digital simulation of PEs. Our results show that photonic PEs improve GCN performance and support optical acceleration of graph ML.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Taichi Aida, Mamoru Komachi, Toshinobu Ogiso, Hiroya Takamura, Daichi Mochihashi
ๆฅ็ๆ่ฆ
Understanding the temporal evolution of sets of vectors is a fundamental challenge across various domains, including ecology, crime analysis, and linguistics. For instance, ecosystem structures evolve due to interactions among plants, herbivores, and carnivores; the spatial distribution of crimes shifts in response to societal changes; and word embedding vectors reflect cultural and semantic trends over time. However, analyzing such time-varying sets of vectors is challenging due to their complicated structures, which also evolve over time. In this work, we propose a novel method for modeling the distribution underlying each set of vectors using infinite-dimensional Gaussian processes. By approximating the latent function in the Gaussian process with Random Fourier Features, we obtain compact and comparable vector representations over time. This enables us to track and visualize temporal transitions of vector sets in a low-dimensional space. We apply our method to both sociological data (crime distributions) and linguistic data (word embeddings), demonstrating its effectiveness in capturing temporal dynamics. Our results show that the proposed approach provides interpretable and robust representations, offering a powerful framework for analyzing structural changes in temporally indexed vector sets across diverse domains.
๐
2025-12-17
๐ ArXiv Preprint (cs.SD)
๐ฅ Ken O'Hanlon, Basil Woods, Lin Wang, Mark Sandler
Machine Learning
ๆฅ็ๆ่ฆ
In this paper we propose a conditioned UNet for Music Source Separation (MSS). MSS is generally performed by multi-output neural networks, typically UNets, with each output representing a particular stem from a predefined instrument vocabulary. In contrast, conditioned MSS networks accept an audio query related to a stem of interest alongside the signal from which that stem is to be extracted. Thus, a strict vocabulary is not required and this enables more realistic tasks in MSS. The potential of conditioned approaches for such tasks has been somewhat hidden due to a lack of suitable data, an issue recently addressed with the MoisesDb dataset. A recent method, Banquet, employs this dataset with promising results seen on larger vocabularies. Banquet uses Bandsplit RNN rather than a UNet and the authors state that UNets should not be suitable for conditioned MSS. We counter this argument and propose QSCNet, a novel conditioned UNet for MSS that integrates network conditioning elements in the Sparse Compressed Network for MSS. We find QSCNet to outperform Banquet by over 1dB SNR on a couple of MSS tasks, while using less than half the number of parameters.
๐
2025-12-17
๐ NeurIPS 2025
๐ฅ Guillermo Rodriguez-Llorente, Galo Gallardo, Rodrigo Morant Navascuรฉs, Nikita Khvatkin Petrovsky, Anderson Sabogal et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The development of nuclear fusion requires materials that can withstand extreme conditions. The IFMIF-DONES facility, a high-power particle accelerator, is being designed to qualify these materials. A critical testbed for its development is the MuVacAS prototype, which replicates the final segment of the accelerator beamline. Precise regulation of argon gas pressure within its ultra-high vacuum chamber is vital for this task. This work presents a fully data-driven approach for autonomous pressure control. A Deep Learning Surrogate Model, trained on real operational data, emulates the dynamics of the argon injection system. This high-fidelity digital twin then serves as a fast-simulation environment to train a Deep Reinforcement Learning agent. The results demonstrate that the agent successfully learns a control policy that maintains gas pressure within strict operational limits despite dynamic disturbances. This approach marks a significant step toward the intelligent, autonomous control systems required for the demanding next-generation particle accelerator facilities.
๐
2025-12-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Konstantinos Kalogiannis, Ahmed Mohamed Hussain, Hexu Li, Panos Papadimitratos
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Vehicular platooning promises transformative improvements in transportation efficiency and safety through the coordination of multi-vehicle formations enabled by Vehicle-to-Everything (V2X) communication. However, the distributed nature of platoon coordination creates security vulnerabilities, allowing authenticated vehicles to inject falsified kinematic data, compromise operational stability, and pose a threat to passenger safety. Traditional misbehaviour detection approaches, which rely on plausibility checks and statistical methods, suffer from high False Positive (FP) rates and cannot capture the complex temporal dependencies inherent in multi-vehicle coordination dynamics. We present Attention In Motion (AIMformer), a transformer-based framework specifically tailored for real-time misbehaviour detection in vehicular platoons with edge deployment capabilities. AIMformer leverages multi-head self-attention mechanisms to simultaneously capture intra-vehicle temporal dynamics and inter-vehicle spatial correlations. It incorporates global positional encoding with vehicle-specific temporal offsets to handle join/exit maneuvers. We propose a Precision-Focused (BCE) loss function that penalizes FPs to meet the requirements of safety-critical vehicular systems. Extensive evaluation across 4 platoon controllers, multiple attack vectors, and diverse mobility scenarios demonstrates superior performance ($\geq$ 0.93) compared to state-of-the-art baseline architectures. A comprehensive deployment analysis utilizing TensorFlow Lite (TFLite), Open Neural Network Exchange (ONNX), and TensorRT achieves sub-millisecond inference latency, making it suitable for real-time operation on resource-constrained edge platforms. Hence, validating AIMformer is viable for both in-vehicle and roadside infrastructure deployment.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hampus Linander, Conor Heins, Alexander Tschantz, Marco Perin, Christopher Buckley
Machine Learning
ๆฅ็ๆ่ฆ
Equivariance is a powerful prior for learning physical dynamics, yet exact group equivariance can degrade performance if the symmetries are broken. We propose object-centric world models built with geometric algebra neural networks, providing a soft geometric inductive bias. Our models are evaluated using simulated environments of 2d rigid body dynamics with static obstacles, where we train for next-step predictions autoregressively. For long-horizon rollouts we show that the soft inductive bias of our models results in better performance in terms of physical fidelity compared to non-equivariant baseline models. The approach complements recent soft-equivariance ideas and aligns with the view that simple, well-chosen priors can yield robust generalization. These results suggest that geometric algebra offers an effective middle ground between hand-crafted physics and unstructured deep nets, delivering sample-efficient dynamics models for multi-object scenes.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Adriรกn Detavernier, Jasper De Bock
ๆฅ็ๆ่ฆ
We consider two conceptually different approaches for assessing the reliability of the individual predictions of a classifier: Robustness Quantification (RQ) and Uncertainty Quantification (UQ). We compare both approaches on a number of benchmark datasets and show that there is no clear winner between the two, but that they are complementary and can be combined to obtain a hybrid approach that outperforms both RQ and UQ. As a byproduct of our approach, for each dataset, we also obtain an assessment of the relative importance of uncertainty and robustness as sources of unreliability.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Luca Torresi, Pascal Friederich
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Self-driving laboratories (SDLs) are combining recent technological advances in robotics, automation, and machine learning based data analysis and decision-making to perform autonomous experimentation toward human-directed goals without requiring any direct human intervention. SDLs are successfully used in materials science, chemistry, and beyond, to optimise processes, materials, and devices in a systematic and data-efficient way. At present, the most widely used algorithm to identify the most informative next experiment is Bayesian optimisation. While relatively simple to apply to a wide range of optimisation problems, standard Bayesian optimisation relies on a fixed experimental workflow with a clear set of optimisation parameters and one or more measurable objective functions. This excludes the possibility of making on-the-fly decisions about changes in the planned sequence of operations and including intermediate measurements in the decision-making process. Therefore, many real-world experiments need to be adapted and simplified to be converted to the common setting in self-driving labs. In this paper, we introduce an extension to Bayesian optimisation that allows flexible sampling of multi-stage workflows and makes optimal decisions based on intermediate observables, which we call proxy measurements. We systematically compare the advantage of taking into account proxy measurements over conventional Bayesian optimisation, in which only the final measurement is observed. We find that over a wide range of scenarios, proxy measurements yield a substantial improvement, both in the time to find good solutions and in the overall optimality of found solutions. This not only paves the way to use more complex and thus more realistic experimental workflows in autonomous labs but also to smoothly combine simulations and experiments in the next generation of SDLs.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ioannis Kalogeropoulos, Giorgos Bouritsas, Yannis Panagakis
Machine Learning
ๆฅ็ๆ่ฆ
As machine learning models are increasingly deployed in high-stakes settings, e.g. as decision support systems in various societal sectors or in critical infrastructure, designers and auditors are facing the need to ensure that models satisfy a wider variety of requirements (e.g. compliance with regulations, fairness, computational constraints) beyond performance. Although most of them are the subject of ongoing studies, typical approaches face critical challenges: post-processing methods tend to compromise performance, which is often counteracted by fine-tuning or, worse, training from scratch, an often time-consuming or even unavailable strategy. This raises the following question: "Can we efficiently edit models to satisfy requirements, without sacrificing their utility?" In this work, we approach this with a unifying framework, in a data-driven manner, i.e. we learn to edit neural networks (NNs), where the editor is an NN itself - a graph metanetwork - and editing amounts to a single inference step. In particular, the metanetwork is trained on NN populations to minimise an objective consisting of two terms: the requirement to be enforced and the preservation of the NN's utility. We experiment with diverse tasks (the data minimisation principle, bias mitigation and weight pruning) improving the trade-offs between performance, requirement satisfaction and time efficiency compared to popular post-processing or re-training alternatives.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiangrui Xu, Zhize Li, Yufei Han, Bin Wang, Jiqiang Liu et al.
ๆฅ็ๆ่ฆ
Data Reconstruction Attacks (DRA) pose a significant threat to Federated Learning (FL) systems by enabling adversaries to infer sensitive training data from local clients. Despite extensive research, the question of how to characterize and assess the risk of DRAs in FL systems remains unresolved due to the lack of a theoretically-grounded risk quantification framework. In this work, we address this gap by introducing Invertibility Loss (InvLoss) to quantify the maximum achievable effectiveness of DRAs for a given data instance and FL model. We derive a tight and computable upper bound for InvLoss and explore its implications from three perspectives. First, we show that DRA risk is governed by the spectral properties of the Jacobian matrix of exchanged model updates or feature embeddings, providing a unified explanation for the effectiveness of defense methods. Second, we develop InvRE, an InvLoss-based DRA risk estimator that offers attack method-agnostic, comprehensive risk evaluation across data instances and model architectures. Third, we propose two adaptive noise perturbation defenses that enhance FL privacy without harming classification accuracy. Extensive experiments on real-world datasets validate our framework, demonstrating its potential for systematic DRA risk evaluation and mitigation in FL systems.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Neeraj Sarna, Yuanyuan Li, Michael von Gablenz
Computer Vision
ๆฅ็ๆ่ฆ
Large scale text-to-image generation models can memorize and reproduce their training dataset. Since the training dataset often contains copyrighted material, reproduction of training dataset poses a copyright infringement risk, which could result in legal liabilities and financial losses for both the AI user and the developer. The current works explores the potential of chain-of-thought and task instruction prompting in reducing copyrighted content generation. To this end, we present a formulation that combines these two techniques with two other copyright mitigation strategies: a) negative prompting, and b) prompt re-writing. We study the generated images in terms their similarity to a copyrighted image and their relevance of the user input. We present numerical experiments on a variety of models and provide insights on the effectiveness of the aforementioned techniques for varying model complexity.
๐
2025-12-17
๐ Accepted to Transactions on Machine Learning Resea
๐ฅ Akihiro Kubo, Paavo Parmas, Shin Ishii
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multiple steps, implying another intermediate horizon for stable gradient estimates. However, these two optimal horizons may differ. To resolve this conflict, we propose Double Horizon Model-Based Policy Optimization (DHMBPO), which divides the rollout procedure into a long "distribution rollout" (DR) and a short "training rollout" (TR). The DR generates on-policy state samples for mitigating distribution shift. In contrast, the short TR leverages differentiable transitions to offer accurate value gradient estimation with stable gradient updates, thereby requiring fewer updates and reducing overall runtime. We demonstrate that the double-horizon approach effectively balances distribution shift, model bias, and gradient instability, and surpasses existing MBRL methods on continuous-control benchmarks in terms of both sample efficiency and runtime.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ John D. Foley, Justin T. Lee
ๆฅ็ๆ่ฆ
We introduce an extension of the partitioned local depth (PaLD) algorithm that is adapted to online applications such as semi-supervised prediction. The new algorithm we present, online PaLD, is well-suited to situations where it is a possible to pre-compute a cohesion network from a reference dataset. After $O(n^3)$ steps to construct a queryable data structure, online PaLD can extend the cohesion network to a new data point in $O(n^2)$ time. Our approach complements previous speed up approaches based on approximation and parallelism. For illustrations, we present applications to online anomaly detection and semi-supervised classification for health-care datasets.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Quanxi Zhou, Wencan Mao, Manabu Tsukada, John C. S. Lui, Yusheng Ji
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Model-based reinforcement learning (MBRL) and model-free reinforcement learning (MFRL) evolve along distinct paths but converge in the design of Dyna-Q [1]. However, modern RL methods still struggle with effective transferability across tasks and scenarios. Motivated by this limitation, we propose a generalized algorithm, Feature Model-Based Enhanced Actor-Critic (FM-EAC), that integrates planning, acting, and learning for multi-task control in dynamic environments. FM-EAC combines the strengths of MBRL and MFRL and improves generalizability through the use of novel feature-based models and an enhanced actor-critic framework. Simulations in both urban and agricultural applications demonstrate that FM-EAC consistently outperforms many state-of-the-art MBRL and MFRL methods. More importantly, different sub-networks can be customized within FM-EAC according to user-specific requirements.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hyakka Nakada, Shu Tanaka
Machine Learning
ๆฅ็ๆ่ฆ
Random matrix theory has played an important role in various areas of pure mathematics, mathematical physics, and machine learning. From a practical perspective of data science, input data are usually normalized prior to processing. Thus, this study investigates the statistical properties of min-max normalized eigenvalues in random matrices. Previously, the effective distribution for such normalized eigenvalues has been proposed. In this study, we apply it to evaluate a scaling law of the cumulative distribution. Furthermore, we derive the residual error that arises during matrix factorization of random matrices. We conducted numerical experiments to verify these theoretical predictions.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yeonwoo Cha, Semin Kim, Jinhyeon Kwon, Seunghoon Hong
ๆฅ็ๆ่ฆ
Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by their inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computational cost from modeling joint distribution, and rely on complex multi-stage training. We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6x fewer parameters and training 10x faster than prior methods. The project page with code is available at https://yeonwoo378.github.io/official_flowbind.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jianfei Ma, Wee Sun Lee
Machine Learning
ๆฅ็ๆ่ฆ
At the boundary between the known and the unknown, an agent inevitably confronts the dilemma of whether to explore or to exploit. Epistemic uncertainty reflects such boundaries, representing systematic uncertainty due to limited knowledge. In this paper, we propose a Bayesian reinforcement learning (RL) algorithm, $\texttt{EUBRL}$, which leverages epistemic guidance to achieve principled exploration. This guidance adaptively reduces per-step regret arising from estimation errors. We establish nearly minimax-optimal regret and sample complexity guarantees for a class of sufficiently expressive priors in infinite-horizon discounted MDPs. Empirically, we evaluate $\texttt{EUBRL}$ on tasks characterized by sparse rewards, long horizons, and stochasticity. Results demonstrate that $\texttt{EUBRL}$ achieves superior sample efficiency, scalability, and consistency.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Liang Peng, Yixuan Ye, Cheng Liu, Hangjun Che, Fei Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Multi-view clustering has been empirically shown to improve learning performance by leveraging the inherent complementary information across multiple views of data. However, in real-world scenarios, collecting strictly aligned views is challenging, and learning from both aligned and unaligned data becomes a more practical solution. Partially View-aligned Clustering aims to learn correspondences between misaligned view samples to better exploit the potential consistency and complementarity across views, including both aligned and unaligned data. However, most existing PVC methods fail to leverage unaligned data to capture the shared semantics among samples from the same cluster. Moreover, the inherent heterogeneity of multi-view data induces distributional shifts in representations, leading to inaccuracies in establishing meaningful correspondences between cross-view latent features and, consequently, impairing learning effectiveness. To address these challenges, we propose a Semantic MAtching contRasTive learning model (SMART) for PVC. The main idea of our approach is to alleviate the influence of cross-view distributional shifts, thereby facilitating semantic matching contrastive learning to fully exploit semantic relationships in both aligned and unaligned data. Extensive experiments on eight benchmark datasets demonstrate that our method consistently outperforms existing approaches on the PVC problem.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Julian Oelhaf, Mehran Pashaei, Georg Kordowich, Christian Bergler, Andreas Maier et al.
Machine Learning
ๆฅ็ๆ่ฆ
The growing penetration of renewable and distributed generation is transforming power systems and challenging conventional protection schemes that rely on fixed settings and local measurements. Machine learning (ML) offers a data-driven alternative for centralized fault classification (FC) and fault localization (FL), enabling faster and more adaptive decision-making. However, practical deployment critically depends on robustness. Protection algorithms must remain reliable even when confronted with missing, noisy, or degraded sensor data. This work introduces a unified framework for systematically evaluating the robustness of ML models in power system protection.
High-fidelity EMT simulations are used to model realistic degradation scenarios, including sensor outages, reduced sampling rates, and transient communication losses. The framework provides a consistent methodology for benchmarking models, quantifying the impact of limited observability, and identifying critical measurement channels required for resilient operation. Results show that FC remains highly stable under most degradation types but drops by about 13% under single-phase loss, while FL is more sensitive overall, with voltage loss increasing localization error by over 150%. These findings offer actionable guidance for robustness-aware design of future ML-assisted protection systems.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Michael Amir, Manon Flageat, Amanda Prorok
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The success of machine learning for real-world robotic systems has created a new form of intellectual property: the trained policy. This raises a critical need for novel methods that verify ownership and detect unauthorized, possibly unsafe misuse. While watermarking is established in other domains, physical policies present a unique challenge: remote detection. Existing methods assume access to the robot's internal state, but auditors are often limited to external observations (e.g., video footage). This ``Physical Observation Gap'' means the watermark must be detected from signals that are noisy, asynchronous, and filtered by unknown system dynamics. We formalize this challenge using the concept of a \textit{glimpse sequence}, and introduce Colored Noise Coherency (CoNoCo), the first watermarking strategy designed for remote detection. CoNoCo embeds a spectral signal into the robot's motions by leveraging the policy's inherent stochasticity. To show it does not degrade performance, we prove CoNoCo preserves the marginal action distribution. Our experiments demonstrate strong, robust detection across various remote modalities, including motion capture and side-way/top-down video footage, in both simulated and real-world robot experiments. This work provides a necessary step toward protecting intellectual property in robotics, offering the first method for validating the provenance of physical policies non-invasively, using purely remote observations.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Honey Singh Chauhan, Zahraa S. Abdallah
ๆฅ็ๆ่ฆ
Kernel-based methods such as Rocket are among the most effective default approaches for univariate time series classification (TSC), yet they do not perform equally well across all datasets. We revisit the long-standing intuition that different representations capture complementary structure and show that selectively fusing them can yield consistent improvements over Rocket on specific, systematically identifiable kinds of datasets. We introduce Fusion-3 (F3), a lightweight framework that adaptively fuses Rocket, Sax, and Sfa representations. To understand when fusion helps, we cluster UCR datasets into six groups using meta-features capturing series length, spectral structure, roughness, and class imbalance, and treat these clusters as interpretable data-structure regimes. Our analysis shows that fusion typically outperforms strong baselines in regimes with structured variability or rich frequency content, while offering diminishing returns in highly irregular or outlier-heavy settings. To support these findings, we combine three complementary analyses: non-parametric paired statistics across datasets, ablation studies isolating the roles of individual representations, and attribution via SHAP to identify which dataset properties predict fusion gains. Sample-level case studies further reveal the underlying mechanism: fusion primarily improves performance by rescuing specific errors, with adaptive increases in frequency-domain weighting precisely where corrections occur. Using 5-fold cross-validation on the 113 UCR datasets, F3 yields small but consistent average improvements over Rocket, supported by frequentist and Bayesian evidence and accompanied by clearly identifiable failure cases. Our results show that selectively applied fusion provides dependable and interpretable extension to strong kernel-based methods, correcting their weaknesses precisely where the data support it.
๐
2025-12-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mikel Williams-Lekuona, Georgina Cosma
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shiran Ge, Chenyi Huang, Yuang Ai, Qihang Fan, Huaibo Huang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an "Expand-and-Prune" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.
๐
2025-12-17
๐ This work has been submitted to the IEEE for possi
๐ฅ Hiroyoshi Nagahama, Katsufumi Inoue, Masayoshi Todorokihara, Michifumi Yoshioka
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Predictive maintenance of rotating machinery increasingly relies on vibration signals, yet most learning-based approaches either discard phase during spectral feature extraction or use raw time-waveforms without explicitly leveraging phase information. This paper introduces two phase-aware preprocessing strategies to address random phase variations in multi-axis vibration data: (1) three-axis independent phase adjustment that aligns each axis individually to zero phase (2) single-axis reference phase adjustment that preserves inter-axis relationships by applying uniform time shifts. Using a newly constructed rotor dataset acquired with a synchronized three-axis sensor, we evaluate six deep learning architectures under a two-stage learning framework. Results demonstrate architecture-independent improvements: the three-axis independent method achieves consistent gains (+2.7\% for Transformer), while the single-axis reference approach delivers superior performance with up to 96.2\% accuracy (+5.4\%) by preserving spatial phase relationships. These findings establish both phase alignment strategies as practical and scalable enhancements for predictive maintenance systems.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenxiang Zhang, Tongxi Qu, Zhong Li, Tian Zhang, Jun Pang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Deep neural networks are widely deployed with quantization techniques to reduce memory and computational costs by lowering the numerical precision of their parameters. While quantization alters model parameters and their outputs, existing privacy analyses primarily focus on full-precision models, leaving a gap in understanding how bit-width reduction can affect privacy leakage. We present the first systematic study of the privacy-utility relationship in post-training quantization (PTQ), a versatile family of methods that can be applied to pretrained models without further training. Using membership inference attacks as our evaluation framework, we analyze three popular PTQ algorithms-AdaRound, BRECQ, and OBC-across multiple precision levels (4-bit, 2-bit, and 1.58-bit) on CIFAR-10, CIFAR-100, and TinyImageNet datasets. Our findings consistently show that low-precision PTQs can reduce privacy leakage. In particular, lower-precision models demonstrate up to an order of magnitude reduction in membership inference vulnerability compared to their full-precision counterparts, albeit at the cost of decreased utility. Additional ablation studies on the 1.58-bit quantization level show that quantizing only the last layer at higher precision enables fine-grained control over the privacy-utility trade-off. These results offer actionable insights for practitioners to balance efficiency, utility, and privacy protection in real-world deployments.
๐
2025-12-17
๐ Submitted for review to the Journal of the Audio E
๐ฅ Yann Bourdin, Pierrick Legrand, Fanny Roche
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Deep learning has become a standard approach for the modeling of audio effects, yet strictly black-box modeling remains problematic for time-varying systems. Unlike time-invariant effects, training models on devices with internal modulation typically requires the recording or extraction of control signals to ensure the time-alignment required by standard loss functions. This paper introduces a Generative Adversarial Network (GAN) framework to model such effects using only input-output audio recordings, removing the need for modulation signal extraction. We propose a convolutional-recurrent architecture trained via a two-stage strategy: an initial adversarial phase allows the model to learn the distribution of the modulation behavior without strict phase constraints, followed by a supervised fine-tuning phase where a State Prediction Network (SPN) estimates the initial internal states required to synchronize the model with the target. Additionally, a new objective metric based on chirp-train signals is developed to quantify modulation accuracy. Experiments modeling a vintage hardware phaser demonstrate the method's ability to capture time-varying dynamics in a fully black-box context.
๐
2025-12-17
๐ ArXiv Preprint (cs.DC)
๐ฅ Erik Schultheis, Dan Alistarh
Machine Learning
ๆฅ็ๆ่ฆ
We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ is able to train or fine-tune a 7B model on a single 16GB mid-range gaming card, or a 32B model on a workstation equipped with 4 RTX 4090s. This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%. The efficiency of LLMQ rivals that of production-scale systems on much more expensive cloud-grade GPUs.
๐
2025-12-17
๐ 15 pages, 5 figures, Submitted to a journal
๐ฅ Siva Sai, Ishika Goyal, Shubham Sharma, Sri Harshita Manuri, Vinay Chamola et al.
Machine Learning
ๆฅ็ๆ่ฆ
The increasing number of cyber threats and rapidly evolving tactics, as well as the high volume of data in recent years, have caused classical machine learning, rules, and signature-based defence strategies to fail, rendering them unable to keep up. An alternative, Quantum Machine Learning (QML), has recently emerged, making use of computations based on quantum mechanics. It offers better encoding and processing of high-dimensional structures for certain problems. This survey provides a comprehensive overview of QML techniques relevant to the domain of security, such as Quantum Neural Networks (QNNs), Quantum Support Vector Machines (QSVMs), Variational Quantum Circuits (VQCs), and Quantum Generative Adversarial Networks (QGANs), and discusses the contributions of this paper in relation to existing research in the field and how it improves over them. It also maps these methods across supervised, unsupervised, and generative learning paradigms, and to core cybersecurity tasks, including intrusion and anomaly detection, malware and botnet classification, and encrypted-traffic analytics. It also discusses their application in the domain of cloud computing security, where QML can enhance secure and scalable operations. Many limitations of QML in the domain of cybersecurity have also been discussed, along with the directions for addressing them.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aleksei Shestov, Anton Klenitskiy, Daria Denisova, Amurkhan Dzagkoev, Daniil Petrovich et al.
ๆฅ็ๆ่ฆ
Modern representation learning increasingly relies on unsupervised and self-supervised methods trained on large-scale unlabeled data. While these approaches achieve impressive generalization across tasks and domains, evaluating embedding quality without labels remains an open challenge. In this work, we propose Persistence, a topology-aware metric based on persistent homology that quantifies the geometric structure and topological richness of embedding spaces in a fully unsupervised manner. Unlike metrics that assume linear separability or rely on covariance structure, Persistence captures global and multi-scale organization. Empirical results across diverse domains show that Persistence consistently achieves top-tier correlations with downstream performance, outperforming existing unsupervised metrics and enabling reliable model and hyperparameter selection.
๐
2025-12-17
๐ ArXiv Preprint (cs.SI)
๐ฅ Daniel Sรกnchez Catalina, George T. Cantwell
ๆฅ็ๆ่ฆ
We consider the problem of ranking objects from noisy pairwise comparisons, for example, ranking tennis players from the outcomes of matches. We follow a standard approach to this problem and assume that each object has an unobserved strength and that the outcome of each comparison depends probabilistically on the strengths of the comparands. However, we do not assume to know a priori how skills affect outcomes. Instead, we present an efficient algorithm for simultaneously inferring both the unobserved strengths and the function that maps strengths to probabilities. Despite this problem being under-constrained, we present experimental evidence that the conclusions of our Bayesian approach are robust to different model specifications. We include several case studies to exemplify the method on real-world data sets.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huiyan Xue, Xuming Ran, Yaxin Li, Qi Xu, Enhui Li et al.
ๆฅ็ๆ่ฆ
Sparse neural systems are gaining traction for efficient continual learning due to their modularity and low interference. Architectures such as Sparse Distributed Memory Multi-Layer Perceptrons (SDMLP) construct task-specific subnetworks via Top-K activation and have shown resilience against catastrophic forgetting. However, their rigid modularity limits cross-task knowledge reuse and leads to performance degradation under high sparsity. We propose Selective Subnetwork Distillation (SSD), a structurally guided continual learning framework that treats distillation not as a regularizer but as a topology-aligned information conduit. SSD identifies neurons with high activation frequency and selectively distills knowledge within previous Top-K subnetworks and output logits, without requiring replay or task labels. This enables structural realignment while preserving sparse modularity. Experiments on Split CIFAR-10, CIFAR-100, and MNIST demonstrate that SSD improves accuracy, retention, and representation coverage, offering a structurally grounded solution for sparse continual learning.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuinan Hou, Jing Mi, Marco Zorzi, Lamberto Ballan, Alberto Testolin
RoboticsNatural Language ProcessingMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures. Notably, enumeration accuracy significantly improves when VLMs are prompted to generate intermediate representations (i.e., locations and verbal labels) of each object to be counted. Nevertheless, none of the models can reliably count the number of objects in complex visual scenes, showing that further research is still needed to create AI systems that can reliably deploy counting procedures in realistic environments.
๐
2025-12-17
๐ NeurIPS 2025
๐ฅ Youssef Ghallab, Omar Iraqy, Mohamed Kandil, Mohamed Ashraf, Saadeldine Eletter et al.
ๆฅ็ๆ่ฆ
Physiological signals such as electrocardiograms (ECG) and electroencephalograms (EEG) provide complementary insights into human health and cognition, yet multi-modal integration is challenging due to limited multi-modal labeled data, and modality-specific differences . In this work, we adapt the CBraMod encoder for large-scale self-supervised ECG pretraining, introducing a dual-masking strategy to capture intra- and inter-lead dependencies. To overcome the above challenges, we utilize a pre-trained CBraMod encoder for EEG and pre-train a symmetric ECG encoder, equipping each modality with a rich foundational representation. These representations are then fused via simple embedding concatenation, allowing the classification head to learn cross-modal interactions, together enabling effective downstream learning despite limited multi-modal supervision. Evaluated on emotion recognition, our approach achieves near state-of-the-art performance, demonstrating that carefully designed physiological encoders, even with straightforward fusion, substantially improve downstream performance. These results highlight the potential of foundation-model approaches to harness the holistic nature of physiological signals, enabling scalable, label-efficient, and generalizable solutions for healthcare and affective computing.
๐
2025-12-17
๐ ArXiv Preprint (hep-ex)
๐ฅ Doฤa Elitez, Paul Gessinger, Daniel Murnane, Marcus Selchou Raaholt, Andreas Salzburger et al.
Machine Learning
ๆฅ็ๆ่ฆ
We introduce ColliderML - a large, open, experiment-agnostic dataset of fully simulated and digitised proton-proton collisions in High-Luminosity Large Hadron Collider conditions ($\sqrt{s}=14$ TeV, mean pile-up $ฮผ= 200$). ColliderML provides one million events across ten Standard Model and Beyond Standard Model processes, plus extensive single-particle samples, all produced with modern next-to-leading order matrix element calculation and showering, realistic per-event pile-up overlay, a validated OpenDataDetector geometry, and standard reconstructions. The release fills a major gap for machine learning (ML) research on detector-level data, provided on the ML-friendly Hugging Face platform. We present physics coverage and the generation, simulation, digitisation and reconstruction pipeline, describe format and access, and initial collider physics benchmarks.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Elio Gruttadauria, Mathieu Fontaine, Jonathan Le Roux, Slim Essid
ๆฅ็ๆ่ฆ
We introduce O-EENC-SD: an end-to-end online speaker diarization system based on EEND-EDA, featuring a novel RNN-based stitching mechanism for online prediction. In particular, we develop a novel centroid refinement decoder whose usefulness is assessed through a rigorous ablation study. Our system provides key advantages over existing methods: a hyperparameter-free solution compared to unsupervised clustering approaches, and a more efficient alternative to current online end-to-end methods, which are computationally costly. We demonstrate that O-EENC-SD is competitive with the state of the art in the two-speaker conversational telephone speech domain, as tested on the CallHome dataset. Our results show that O-EENC-SD provides a great trade-off between DER and complexity, even when working on independent chunks with no overlap, making the system extremely efficient.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Songze Huo, Xiao-Ming Cao
Machine Learning
ๆฅ็ๆ่ฆ
The adsorption energy serves as a crucial descriptor for the large-scale screening of catalysts. Nevertheless, the limited distribution of training data for the extensively utilised machine learning interatomic potential (MLIP), predominantly sourced from near-equilibrium structures, results in unreliable adsorption structures and consequent adsorption energy predictions. In this context, we present DBCata, a deep generative model that integrates a periodic Brownian-bridge framework with an equivariant graph neural network to establish a low-dimensional transition manifold between unrelaxed and DFT-relaxed structures, without requiring explicit energy or force information. Upon training, DBCata effectively generates high-fidelity adsorption geometries, achieving an interatomic distance mean absolute error (DMAE) of 0.035 \textร
on the Catalysis-Hub dataset, which is nearly three times superior to that of the current state-of-the-art machine learning potential models. Moreover, the corresponding DFT accuracy can be improved within 0.1 eV in 94\% of instances by identifying and refining anomalous predictions through a hybrid chemical-heuristic and self-supervised outlier detection approach. We demonstrate that the remarkable performance of DBCata facilitates accelerated high-throughput computational screening for efficient alloy catalysts in the oxygen reduction reaction, highlighting the potential of DBCata as a powerful tool for catalyst design and optimisation.
๐
2025-12-17
๐ ArXiv Preprint (cs.DS)
๐ฅ Ameet Gadekar, Aristides Gionis, Thibault Marette
ๆฅ็ๆ่ฆ
Data analysis often involves an iterative process, where solutions must be continuously refined in response to new data. Typically, as new data becomes available, an existing solution must be updated to incorporate the latest information. In addition to seeking a high-quality solution for the task at hand, it is also crucial to ensure consistency by minimizing drastic changes from previous solutions. Applying this approach across many iterations, ensures that the solution evolves gradually and smoothly.
In this paper, we study the above problem in the context of clustering, specifically focusing on the $k$-center problem. More precisely, we study the following problem: Given a set of points $X$, parameters $k$ and $b$, and a prior clustering solution $H$ for $X$, our goal is to compute a new solution $C$ for $X$, consisting of $k$ centers, which minimizes the clustering cost while introducing at most $b$ changes from $H$. We refer to this problem as label-consistent $k$-center, and we propose two constant-factor approximation algorithms for it. We complement our theoretical findings with an experimental evaluation demonstrating the effectiveness of our methods on real-world datasets.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Liyu Zhang, Yejia Liu, Kwun Ho Liu, Runxi Huang, Xiaomin Ouyang
ๆฅ็ๆ่ฆ
In real-world IoT applications, sensor data is usually collected under diverse and dynamic contextual conditions where factors such as sensor placements or ambient environments can significantly affect data patterns and downstream performance. Traditional domain adaptation or generalization methods often ignore such context information or use simplistic integration strategies, making them ineffective in handling unseen context shifts after deployment. In this paper, we propose Chorus, a context-aware, data-free model customization approach that adapts models to unseen deployment conditions without requiring target-domain data. The key idea is to learn effective context representations that capture their influence on sensor data patterns and to adaptively integrate them based on the degree of context shift. Specifically, Chorus first performs unsupervised cross-modal reconstruction between unlabeled sensor data and language-based context embeddings, while regularizing the context embedding space to learn robust, generalizable context representations. Then, it trains a lightweight gated head on limited labeled samples to dynamically balance sensor and context contributions-favoring context when sensor evidence is ambiguous and vice versa. To further reduce inference latency, Chorus employs a context-caching mechanism that reuses cached context representations and updates only upon detected context shifts. Experiments on IMU, speech, and WiFi sensing tasks under diverse context shifts show that Chorus outperforms state-of-the-art baselines by up to 11.3% in unseen contexts, while maintaining comparable latency on smartphone and edge devices.
๐
2025-12-17
๐ ArXiv Preprint (cs.SD)
๐ฅ Sanghyeok Chung, Eujin Kim, Donggun Kim, Gaeun Heo, Jeongbin You et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in audio generation have increased the risk of realistic environmental sound manipulation, motivating the ESDD 2026 Challenge as the first large-scale benchmark for Environmental Sound Deepfake Detection (ESDD). We propose BEAT2AASIST which extends BEATs-AASIST by splitting BEATs-derived representations along frequency or channel dimension and processing them with dual AASIST branches. To enrich feature representations, we incorporate top-k transformer layer fusion using concatenation, CNN-gated, and SE-gated strategies. In addition, vocoder-based data augmentation is applied to improve robustness against unseen spoofing methods. Experimental results on the official test sets demonstrate that the proposed approach achieves competitive performance across the challenge tracks.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chengguang Ou, Yixin Zhuang
ๆฅ็ๆ่ฆ
Implicit Neural Representations (INRs) often converge slowly and struggle to recover high-frequency details due to spectral bias. While prior work links this behavior to the Neural Tangent Kernel (NTK), how specific architectural choices affect NTK conditioning remains unclear. We show that many INR mechanisms can be understood through their impact on a small set of pairwise similarity factors and scaling terms that jointly determine NTK eigenvalue variance. For standard coordinate MLPs, limited input-feature interactions induce large eigenvalue dispersion and poor conditioning. We derive closed-form variance decompositions for common INR components and show that positional encoding reshapes input similarity, spherical normalization reduces variance via layerwise scaling, and Hadamard modulation introduces additional similarity factors strictly below one, yielding multiplicative variance reduction. This unified view explains how diverse INR architectures mitigate spectral bias by improving NTK conditioning. Experiments across multiple tasks confirm the predicted variance reductions and demonstrate faster, more stable convergence with improved reconstruction quality.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pablo Montaรฑa-Fernรกndez, Ines Ortega-Fernandez
Machine Learning
ๆฅ็ๆ่ฆ
Federated Learning is a machine learning setting that reduces direct data exposure, improving the privacy guarantees of machine learning models. Yet, the exchange of model updates between the participants and the aggregator can still leak sensitive information. In this work, we present a new gradient-based membership inference attack for federated learning scenarios that exploits the temporal evolution of last-layer gradients across multiple federated rounds. Our method uses the shadow technique to learn round-wise gradient patterns of the training records, requiring no access to the private dataset, and is designed to consider both semi-honest and malicious adversaries (aggregators or data owners). Beyond membership inference, we also provide a natural extension of the proposed attack to discrete attribute inference by contrasting gradient responses under alternative attribute hypotheses. The proposed attacks are model-agnostic, and therefore applicable to any gradient-based model and can be applied to both classification and regression settings. We evaluate the attack on CIFAR-100 and Purchase100 datasets for membership inference and on Breast Cancer Wisconsin for attribute inference. Our findings reveal strong attack performance and comparable computational and memory overhead in membership inference when compared to another attack from the literature. The obtained results emphasize that multi-round federated learning can increase the vulnerability to inference attacks, that aggregators pose a more substantial threat than data owners, and that attack performance is strongly influenced by the nature of the training dataset, with richer, high-dimensional data leading to stronger leakage than simpler tabular data.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Roland Baatz
Machine Learning
ๆฅ็ๆ่ฆ
This study examines the generalization performance and interpretability of machine learning (ML) models used for predicting crop yield and yield anomalies in Germany's NUTS-3 regions. Using a high-quality, long-term dataset, the study systematically compares the evaluation and temporal validation behavior of ensemble tree-based models (XGBoost, Random Forest) and deep learning approaches (LSTM, TCN).
While all models perform well on spatially split, conventional test sets, their performance degrades substantially on temporally independent validation years, revealing persistent limitations in generalization. Notably, models with strong test-set accuracy, but weak temporal validation performance can still produce seemingly credible SHAP feature importance values. This exposes a critical vulnerability in post hoc explainability methods: interpretability may appear reliable even when the underlying model fails to generalize.
These findings underscore the need for validation-aware interpretation of ML predictions in agricultural and environmental systems. Feature importance should not be accepted at face value unless models are explicitly shown to generalize to unseen temporal and spatial conditions. The study advocates for domain-aware validation, hybrid modeling strategies, and more rigorous scrutiny of explainability methods in data-driven agriculture. Ultimately, this work addresses a growing challenge in environmental data science: how can we evaluate generalization robustly enough to trust model explanations?
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods - including sparse autoencoders (SAEs) and sparse probes - recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We observe a one-to-many relationship from concepts to features: features correspond to no more than one concept, but concepts are distributed across many features. Then, we perform steering experiments, measuring whether each concept is independently manipulable. Even when trained on uniform distributions of concepts, SAE features generally affect many concepts when steered, indicating that they are neither selective nor independent; nonetheless, features affect disjoint subspaces. These results suggest that correlational metrics for measuring disentanglement are generally not sufficient for establishing independence when steering, and that affecting disjoint subspaces is not sufficient for concept selectivity. These results underscore the importance of compositional evaluations in interpretability research.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mukur Gupta, Niharika Gupta, Saifur Rahman, Shantanu Pal, Chandan Karmakar
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning models deployed on edge devices are increasingly used in safety-critical applications. However, their vulnerability to adversarial perturbations poses significant risks, especially in Federated Learning (FL) settings where identical models are distributed across thousands of clients. While adversarial training is a strong defense, it is difficult to apply in FL due to strict client-data privacy constraints and the limited compute available on edge devices. In this work, we introduce TrajSyn, a privacy-preserving framework that enables effective server-side adversarial training by synthesizing a proxy dataset from the trajectories of client model updates, without accessing raw client data. We show that TrajSyn consistently improves adversarial robustness on image classification benchmarks with no extra compute burden on the client device.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuqing Xie, Jiayu Chen, Wenhao Tang, Ya Zhang, Chao Yu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration (MORSE), a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.
๐
2025-12-17
๐ This work has been submitted to the IEEE for possi
๐ฅ Runze Li, Hanchen Wang, Wenjie Zhang, Binghao Li, Yu Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at https://anonymous.4open.science/r/TimeSeriesImputation-52BF
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ali Ghodsi
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Sequence modeling has produced diverse architectures -- from classical recurrent neural networks to modern Transformers and state space models (SSMs) -- yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij}(X)$, making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij}(X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a $k$-dimensional subspace on length-$n$ sequences requires and is achievable with $H=k$ heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.
๐
2025-12-17
๐ ArXiv Preprint (q-fin.CP)
๐ฅ Mohit Beniwal
ๆฅ็ๆ่ฆ
Long-term price forecasting remains a formidable challenge due to the inherent uncertainty over the long term, despite some success in short-term predictions. Nonetheless, accurate long-term forecasts are essential for high-net-worth individuals, institutional investors, and traders. The proposed improved genetic algorithm-optimized support vector regression (IGA-SVR) model is specifically designed for long-term price prediction of global indices. The performance of the IGA-SVR model is rigorously evaluated and compared against the state-of-the-art baseline models, the Long Short-Term Memory (LSTM), and the forward-validating genetic algorithm optimized support vector regression (OGA-SVR). Extensive testing was conducted on the five global indices, namely Nifty, Dow Jones Industrial Average (DJI), DAX Performance Index (DAX), Nikkei 225 (N225), and Shanghai Stock Exchange Composite Index (SSE) from 2021 to 2024 of daily price prediction up to a year. Overall, the proposed IGA-SVR model achieved a reduction in MAPE by 19.87% compared to LSTM and 50.03% compared to OGA-SVR, demonstrating its superior performance in long-term daily price forecasting of global indices. Further, the execution time for LSTM was approximately 20 times higher than that of IGA-SVR, highlighting the high accuracy and computational efficiency of the proposed model. The genetic algorithm selects the optimal hyperparameters of SVR by minimizing the arithmetic mean of the Mean Absolute Percentage Error (MAPE) calculated over the full training dataset and the most recent five years of training data. This purposefully designed training methodology adjusts for recent trends while retaining long-term trend information, thereby offering enhanced generalization compared to the LSTM and rolling-forward validation approach employed by OGA-SVR, which forgets long-term trends and suffers from recency bias.
๐
2025-12-17
๐ AAAI 2026
๐ฅ Sunwoo Kim, Soo Yong Lee, Kyungho Kim, Hyunjin Hwang, Jaemin Yoo et al.
ๆฅ็ๆ่ฆ
Unsupervised node representation learning aims to obtain meaningful node embeddings without relying on node labels. To achieve this, graph convolution, which aggregates information from neighboring nodes, is commonly employed to encode node features and graph topology. However, excessive reliance on graph convolution can be suboptimal-especially in non-homophilic graphs-since it may yield unduly similar embeddings for nodes that differ in their features or topological properties. As a result, adjusting the degree of graph convolution usage has been actively explored in supervised learning settings, whereas such approaches remain underexplored in unsupervised scenarios. To tackle this, we propose FUEL, which adaptively learns the adequate degree of graph convolution usage by aiming to enhance intra-class similarity and inter-class separability in the embedding space. Since classes are unknown, FUEL leverages node features to identify node clusters and treats these clusters as proxies for classes. Through extensive experiments using 15 baseline methods and 14 benchmark datasets, we demonstrate the effectiveness of FUEL in downstream tasks, achieving state-of-the-art performance across graphs with diverse levels of homophily.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xianglin Wu, Chiheb Ben Hammouda, Cornelis W. Oosterlee
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Stochastic differential equations (SDEs) driven by fractional Brownian motion (fBm) are increasingly used to model systems with rough dynamics and long-range dependence, such as those arising in quantitative finance and reliability engineering. However, these processes are non-Markovian and lack a semimartingale structure, rendering many classical parameter estimation techniques inapplicable or computationally intractable beyond very specific cases. This work investigates two central questions: (i) whether integrating path signatures into deep learning architectures can improve the trade-off between estimation accuracy and model complexity, and (ii) what constitutes an effective architecture for leveraging signatures as feature maps. We introduce SigMA (Signature Multi-head Attention), a neural architecture that integrates path signatures with multi-head self-attention, supported by a convolutional preprocessing layer and a multilayer perceptron for effective feature encoding. SigMA learns model parameters from synthetically generated paths of fBm-driven SDEs, including fractional Brownian motion, fractional Ornstein-Uhlenbeck, and rough Heston models, with a particular focus on estimating the Hurst parameter and on joint multi-parameter inference, and it generalizes robustly to unseen trajectories. Extensive experiments on synthetic data and two real-world datasets (i.e., equity-index realized volatility and Li-ion battery degradation) show that SigMA consistently outperforms CNN, LSTM, vanilla Transformer, and Deep Signature baselines in accuracy, robustness, and model compactness. These results demonstrate that combining signature transforms with attention-based architectures provides an effective and scalable framework for parameter inference in stochastic systems with rough or persistent temporal structure.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongjin Mi, Huiqiang Lun, Changhong Mou, Yeyu Zhang
ๆฅ็ๆ่ฆ
Operator learning has become a powerful tool for accelerating the solution of parameterized partial differential equations (PDEs), enabling rapid prediction of full spatiotemporal fields for new initial conditions or forcing functions. Existing architectures such as DeepONet and the Fourier Neural Operator (FNO) show strong empirical performance but often require large training datasets, lack explicit physical structure, and may suffer from instability in their trunk-network features, where mode imbalance or collapse can hinder accurate operator approximation. Motivated by the stability and locality of classical partition-of-unity (PoU) methods, we investigate PoU-based regularization techniques for operator learning and develop a revised formulation of the existing POU--PI--DeepONet framework. The resulting \emph{P}hysics-\emph{i}nformed \emph{P}artition \emph{P}enalty Deep Operator Network (PIP$^{2}$ Net) introduces a simplified and more principled partition penalty that improved the coordinated trunk outputs that leads to more expressiveness without sacrificing the flexibility of DeepONet. We evaluate PIP$^{2}$ Net on three nonlinear PDEs: the viscous Burgers equation, the Allen--Cahn equation, and a diffusion--reaction system. The results show that it consistently outperforms DeepONet, PI-DeepONet, and POU-DeepONet in prediction accuracy and robustness.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yifei Li, Haixu Wu, Zeyi Xu, Tuur Stuyck, Wojciech Matusik
Machine Learning
ๆฅ็ๆ่ฆ
Learning-based methods have made significant progress in physics simulation, typically approximating dynamics with a monolithic end-to-end optimized neural network. Although these models offer an effective way to simulation, they may lose essential features compared to traditional numerical simulators, such as physical interpretability and reliability. Drawing inspiration from classical simulators that operate in a modular fashion, this paper presents Neural Modular Physics (NMP) for elastic simulation, which combines the approximation capacity of neural networks with the physical reliability of traditional simulators. Beyond the previous monolithic learning paradigm, NMP enables direct supervision of intermediate quantities and physical constraints by decomposing elastic dynamics into physically meaningful neural modules connected through intermediate physical quantities. With a specialized architecture and training strategy, our method transforms the numerical computation flow into a modular neural simulator, achieving improved physical consistency and generalizability. Experimentally, NMP demonstrates superior generalization to unseen initial conditions and resolutions, stable long-horizon simulation, better preservation of physical properties compared to other neural simulators, and greater feasibility in scenarios with unknown underlying dynamics than traditional simulators.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wanfu Gao, Zebin He, Jun Gao
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing feature engineering methods based on large language models (LLMs) have not yet been applied to multi-label learning tasks. They lack the ability to model complex label dependencies and are not specifically adapted to the characteristics of multi-label tasks. To address the above issues, we propose Feature Engineering Automation for Multi-Label Learning (FEAML), an automated feature engineering method for multi-label classification which leverages the code generation capabilities of LLMs. By utilizing metadata and label co-occurrence matrices, LLMs are guided to understand the relationships between data features and task objectives, based on which high-quality features are generated. The newly generated features are evaluated in terms of model accuracy to assess their effectiveness, while Pearson correlation coefficients are used to detect redundancy. FEAML further incorporates the evaluation results as feedback to drive LLMs to continuously optimize code generation in subsequent iterations. By integrating LLMs with a feedback mechanism, FEAML realizes an efficient, interpretable and self-improving feature engineering paradigm. Empirical results on various multi-label datasets demonstrate that our FEAML outperforms other feature engineering methods.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Debu Sinha
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the "semantic illusion": semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zijiang Yan, Yixiang Huang, Jianhua Pei, Hina Tabassum, Luca Chiaraviglio
ๆฅ็ๆ่ฆ
The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors (e.g., time of day, season, and holidays) while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates calibrated probabilistic prediction intervals directly from the learned conditional distribution, providing explicit uncertainty quantification essential for trustworthy decision-making. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. The EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fanzhe Fu
Natural Language Processing
ๆฅ็ๆ่ฆ
The transition of Large Language Models (LLMs) from stochastic chat interfaces to reliable software components necessitates a fundamental re-engineering of interaction paradigms. Current methodologies, predominantly heuristic-based "prompt engineering," fail to provide the deterministic guarantees required for mission-critical applications. We introduce the Meta-Prompting Protocol, a rigorous theoretical framework that formalizes the orchestration of LLMs as a programmable, self-optimizing system. Central to this protocol is the Adversarial Trinity, a tripartite topology comprising a Generator (P), an Auditor (A), and an Optimizer (O). By treating natural language instructions as differentiable variables within a semantic computation graph and utilizing textual critiques as gradients, this architecture mitigates hallucination and prevents model collapse. We demonstrate the theoretical viability of this approach using declarative programming paradigms (DSPy) and automatic textual differentiation (TextGrad), establishing a foundation for "Observable Software Engineering" in the era of probabilistic computing.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenxiao Gao, Haotian Sun, Na Li, Dale Schuurmans, Bo Dai
Machine LearningRobotics
ๆฅ็ๆ่ฆ
In real-world applications with large state and action spaces, reinforcement learning (RL) typically employs function approximations to represent core components like the policies, value functions, and dynamics models. Although powerful approximations such as neural networks offer great expressiveness, they often present theoretical ambiguities, suffer from optimization instability and exploration difficulty, and incur substantial computational costs in practice. In this paper, we introduce the perspective of spectral representations as a solution to address these difficulties in RL. Stemming from the spectral decomposition of the transition operator, this framework yields an effective abstraction of the system dynamics for subsequent policy optimization while also providing a clear theoretical characterization. We reveal how to construct spectral representations for transition operators that possess latent variable structures or energy-based structures, which implies different learning methods to extract spectral representations from data. Notably, each of these learning methods realizes an effective RL algorithm under this framework. We also provably extend this spectral view to partially observable MDPs. Finally, we validate these algorithms on over 20 challenging tasks from the DeepMind Control Suite, where they achieve performances comparable or superior to current state-of-the-art model-free and model-based baselines.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Damian Hodel, Jevin D. West
Natural Language Processing
ๆฅ็ๆ่ฆ
The growing use of artificial intelligence (AI) raises concerns of knowledge collapse, i.e., a reduction to the most dominant and central set of ideas. Prior work has demonstrated single-model collapse, defined as performance decay in an AI model trained on its own output. Inspired by ecology, we ask whether AI ecosystem diversity, that is, diversity among models, can mitigate such a collapse. We build on the single-model approach but focus on ecosystems of models trained on their collective output. To study the effect of diversity on model performance, we segment the training data across language models and evaluate the resulting ecosystems over ten, self-training iterations. We find that increased epistemic diversity mitigates collapse, but, interestingly, only up to an optimal level. Our results suggest that an ecosystem containing only a few diverse models fails to express the rich mixture of the full, true distribution, resulting in rapid performance decay. Yet distributing the data across too many models reduces each model's approximation capacity on the true distribution, leading to poor performance already in the first iteration step. In the context of AI monoculture, our results suggest the need to monitor diversity across AI systems and to develop policies that incentivize more domain- and community-specific models.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sandeep Neela
ๆฅ็ๆ่ฆ
Understanding how prices evolve over time often requires peeling back the layers of market noise to identify clear, structural behavior. Many of the tools commonly used for this purpose technical indicators, chart heuristics, or even sophisticated predictive models leave important questions unanswered. Technical indicators depend on platform-specific rules, and predictive systems typically offer little in terms of explanation. In settings that demand transparency or auditability, this poses a significant challenge. We introduce the Stock Pattern Assistant (SPA), a deterministic framework designed to extract monotonic price runs, attach relevant public events through a symmetric correlation window, and generate explanations that are factual, historical, and guardrailed. SPA relies only on daily OHLCV data and a normalized event stream, making the pipeline straight-forward to audit and easy to reproduce. To illustrate SPA's behavior in practice, we evaluate it across four equities-AAPL, NVDA, SCHW, and PGR-chosen to span a range of volatility regimes and sector characteristics. Although the evaluation period is modest, the results demonstrate how SPA consistently produces stable structural decompositions and contextual narratives. Ablation experiments further show how deterministic segmentation, event alignment, and constrained explanation each contribute to interpretability. SPA is not a forecasting system, nor is it intended to produce trading signals. Its value lies in offering a transparent, reproducible view of historical price structure that can complement analyst workflows, risk reviews, and broader explainable-AI pipelines.
๐
2025-12-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Sogol Masoumzadeh, Yufei Li, Shane McIntosh, Dรกniel Varrรณ, Lili Wei
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Monitoring issue tracker submissions is a crucial software maintenance activity. A key goal is the prioritization of high risk, security-related bugs. If such bugs can be recognized early, the risk of propagation to dependent products and endangerment of stakeholder benefits can be mitigated. To assist triage engineers with this task, several automatic detection techniques, from Machine Learning (ML) models to prompting Large Language Models (LLMs), have been proposed. Although promising to some extent, prior techniques often memorize lexical cues as decision shortcuts, yielding low detection rate specifically for more complex submissions. As such, these classifiers do not yet reach the practical expectations of a real-time detector of security-related issues. To address these limitations, we propose SEBERTIS, a framework to train Deep Neural Networks (DNNs) as classifiers independent of lexical cues, so that they can confidently detect fully unseen security-related issues. SEBERTIS capitalizes on fine-tuning bidirectional transformer architectures as Masked Language Models (MLMs) on a series of semantically equivalent vocabulary to prediction labels (which we call Semantic Surrogates) when they have been replaced with a mask. Our SEBERTIS-trained classifier achieves a 0.9880 F1-score in detecting security-related issues of a curated corpus of 10,000 GitHub issue reports, substantially outperforming state-of-the-art issue classifiers, with 14.44%-96.98%, 15.40%-93.07%, and 14.90%-94.72% higher detection precision, recall, and F1-score over ML-based baselines. Our classifier also substantially surpasses LLM baselines, with an improvement of 23.20%-63.71%, 36.68%-85.63%, and 39.49%-74.53% for precision, recall, and F1-score.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruiyi Zhang, Peijia Qin, Qi Cao, Pengtao Xie
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
๐
2025-12-17
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Pranav Kakhandiki, Sathya Chitturi, Daniel Ratner, Sean Gasiorowski
Machine Learning
ๆฅ็ๆ่ฆ
The discovery of a minimum energy pathway (MEP) between metastable states is crucial for scientific tasks including catalyst and biomolecular design. However, the standard nudged elastic band (NEB) algorithm requires hundreds to tens of thousands of compute-intensive simulations, making applications to complex systems prohibitively expensive. We introduce Neural Network Bayesian Algorithm Execution (NN-BAX), a framework that jointly learns the energy landscape and the MEP. NN-BAX sequentially fine-tunes a foundation model by actively selecting samples targeted at improving the MEP. Tested on Lennard-Jones and Embedded Atom Method systems, our approach achieves a one to two order of magnitude reduction in energy and force evaluations with negligible loss in MEP accuracy and demonstrates scalability to >100-dimensional systems. This work is therefore a promising step towards removing the computational barrier for MEP discovery in scientifically relevant systems, suggesting that weeks-long calculations may be achieved in hours or days with minimal loss in accuracy.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hanqing Jin, Renyuan Xu, Yanzhao Yang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We study reinforcement learning for controlled diffusion processes with unbounded continuous state spaces, bounded continuous actions, and polynomially growing rewards: settings that arise naturally in finance, economics, and operations research. To overcome the challenges of continuous and high-dimensional domains, we introduce a model-based algorithm that adaptively partitions the joint state-action space. The algorithm maintains estimators of drift, volatility, and rewards within each partition, refining the discretization whenever estimation bias exceeds statistical confidence. This adaptive scheme balances exploration and approximation, enabling efficient learning in unbounded domains. Our analysis establishes regret bounds that depend on the problem horizon, state dimension, reward growth order, and a newly defined notion of zooming dimension tailored to unbounded diffusion processes. The bounds recover existing results for bounded settings as a special case, while extending theoretical guarantees to a broader class of diffusion-type problems. Finally, we validate the effectiveness of our approach through numerical experiments, including applications to high-dimensional problems such as multi-asset mean-variance portfolio selection.
๐
2025-12-17
๐ Accepted by the 48th IEEE/ACM International Confer
๐ฅ Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh
Machine Learning
ๆฅ็ๆ่ฆ
Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)-based applications suffer from many bugs, failures, and vulnerabilities. Reproducing these bugs is essential for their resolution, but it is extremely challenging due to the inherent nondeterminism of DL models and their tight coupling with hardware and software environments. According to recent studies, only about 3% of DL bugs can be reliably reproduced using manual approaches. To address these challenges, we present RepGen, a novel, automated, and intelligent approach for reproducing deep learning bugs. RepGen constructs a learning-enhanced context from a project, develops a comprehensive plan for bug reproduction, employs an iterative generate-validate-refine mechanism, and thus generates such code using an LLM that reproduces the bug at hand. We evaluate RepGen on 106 real-world deep learning bugs and achieve a reproduction rate of 80.19%, a 19.81% improvement over the state-of-the-art measure. A developer study involving 27 participants shows that RepGen improves the success rate of DL bug reproduction by 23.35%, reduces the time to reproduce by 56.8%, and lowers participants' cognitive load.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yaniv Leviathan, Matan Kalman, Yossi Matias
Natural Language Processing
ๆฅ็ๆ่ฆ
When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Victor M. Yeom Song, Severi Rissanen, Arno Solin, Samuel Kaski, Mingfei Sun
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion models struggle to produce samples that respect constraints, a common requirement in scientific applications. Recent approaches have introduced regularization terms in the loss or guidance methods during sampling to enforce such constraints, but they bias the generative model away from the true data distribution. This is a problem, especially when the constraint is misspecified, a common issue when formulating constraints on scientific data. In this paper, instead of changing the loss or the sampling loop, we integrate a guidance-inspired adjustment into the denoiser itself, giving it a soft inductive bias towards constraint-compliant samples. We show that these softly constrained denoisers exploit constraint knowledge to improve compliance over standard denoisers, and maintain enough flexibility to deviate from it when there is misspecification with observed data.
๐
2025-12-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Abdullah Al Munem, Sumona Yeasmin, Mohammad Rezwanul Huq
Natural Language Processing
ๆฅ็ๆ่ฆ
Every day, a significant number of users visit the internet for different needs. The owners of a website generate profits from the user interaction with the contents or items of the website. A robust recommendation system can increase user interaction with a website by recommending items according to the user's unique preferences. BERT and CNN-integrated neural collaborative filtering (NCF) have been proposed for the recommendation system in this experiment. The proposed model takes inputs from the user and item profile and finds the user's interest. This model can handle numeric, categorical, and image data to extract the latent features from the inputs. The model is trained and validated on a small sample of the MovieLens dataset for 25 epochs. The same dataset has been used to train and validate a simple NCF and a BERT-based NCF model and compared with the proposed model. The proposed model outperformed those two baseline models. The obtained result for the proposed model is 0.72 recall and 0.486 Hit Ratio @ 10 for 799 users on the MovieLens dataset. This experiment concludes that considering both categorical and image data can improve the performance of a recommendation system.
๐
2025-12-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Gregor Donabauer, Samy Ateia, Udo Kruschwitz, Maximilian Burger, Matthias May et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present MedNuggetizer, https://mednugget-ai.de/; access is available upon request.}, a tool for query-driven extraction and clustering of information nuggets from medical documents to support clinicians in exploring underlying medical evidence. Backed by a large language model (LLM), \textit{MedNuggetizer} performs repeated extractions of information nuggets that are then grouped to generate reliable evidence within and across multiple documents. We demonstrate its utility on the clinical use case of \textit{antibiotic prophylaxis before prostate biopsy} by using major urological guidelines and recent PubMed studies as sources of information. Evaluation by domain experts shows that \textit{MedNuggetizer} provides clinicians and researchers with an efficient way to explore long documents and easily extract reliable, query-focused medical evidence.
๐
2025-12-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mikel Williams-Lekuona, Georgina Cosma
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
๐
2025-12-17
๐ ArXiv Preprint (cs.DB)
๐ฅ Gajendra Doniparthi, Shashank Balu Pandhare, Stefan Deรloch, Timo Mรผhlhaus
Natural Language Processing
ๆฅ็ๆ่ฆ
Traditional search applications within Research Data Management (RDM) ecosystems are crucial in helping users discover and explore the structured metadata from the research datasets. Typically, text search engines require users to submit keyword-based queries rather than using natural language. However, using Large Language Models (LLMs) trained on domain-specific content for specialized natural language processing (NLP) tasks is becoming increasingly common. We present ArcBERT, an LLM-based system designed for integrated metadata exploration. ArcBERT understands natural language queries and relies on semantic matching, unlike traditional search applications. Notably, ArcBERT also understands the structure and hierarchies within the metadata, enabling it to handle diverse user querying patterns effectively.
๐
2025-12-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aleksei Shestov, Anton Klenitskiy, Daria Denisova, Amurkhan Dzagkoev, Daniil Petrovich et al.
ๆฅ็ๆ่ฆ
Modern representation learning increasingly relies on unsupervised and self-supervised methods trained on large-scale unlabeled data. While these approaches achieve impressive generalization across tasks and domains, evaluating embedding quality without labels remains an open challenge. In this work, we propose Persistence, a topology-aware metric based on persistent homology that quantifies the geometric structure and topological richness of embedding spaces in a fully unsupervised manner. Unlike metrics that assume linear separability or rely on covariance structure, Persistence captures global and multi-scale organization. Empirical results across diverse domains show that Persistence consistently achieves top-tier correlations with downstream performance, outperforming existing unsupervised metrics and enabling reliable model and hyperparameter selection.
๐
2025-12-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sijin Sun, Liangbin Zhao, Ming Deng, Xiuju Fu
ๆฅ็ๆ่ฆ
Dialogue Topic Segmentation (DTS) is crucial for understanding task-oriented public-channel communications, such as maritime VHF dialogues, which feature informal speech and implicit transitions. To address the limitations of traditional methods, we propose DASH-DTS, a novel LLM-based framework. Its core contributions are: (1) topic shift detection via dialogue handshake recognition; (2) contextual enhancement through similarity-guided example selection; and (3) the generation of selective positive and negative samples to improve model discrimination and robustness. Additionally, we release VHF-Dial, the first public dataset of real-world maritime VHF communications, to advance research in this domain. DASH-DTS provides interpretable reasoning and confidence scores for each segment. Experimental results demonstrate that our framework achieves several sota segmentation trusted accuracy on both VHF-Dial and standard benchmarks, establishing a strong foundation for stable monitoring and decision support in operational dialogues.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce \model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Giacomo Picardi, Saverio Iacoponi, Matias Carandell, Jorge Aguirregomezcorta, Mrudul Chellapurath et al.
Robotics
ๆฅ็ๆ่ฆ
Underwater robotics is becoming increasingly important for marine science, environmental monitoring, and subsea industrial operations, yet the development of underwater manipulation and actuation systems remains restricted by high costs, proprietary designs, and limited access to modular, research-oriented hardware. While open-source initiatives have democratized vehicle construction and control software, a substantial gap persists for joint-actuated systems-particularly those requiring waterproof, feedback-enabled actuation suitable for manipulators, grippers, and bioinspired devices. As a result, many research groups face lengthy development cycles, limited reproducibility, and difficulty transitioning laboratory prototypes to field-ready platforms.
To address this gap, we introduce an open, cost-effective hardware and software toolkit for underwater manipulation research. The toolkit includes a depth-rated Underwater Robotic Joint (URJ) with early leakage detection, compact control and power management electronics, and a ROS2-based software stack for sensing and multi-mode actuation. All CAD models, fabrication files, PCB sources, firmware, and ROS2 packages are openly released, enabling local manufacturing, modification, and community-driven improvement.
The toolkit has undergone extensive laboratory testing and multiple field deployments, demonstrating reliable operation up to 40 m depth across diverse applications, including a 3-DoF underwater manipulator, a tendon-driven soft gripper, and an underactuated sediment sampler. These results validate the robustness, versatility, and reusability of the toolkit for real marine environments.
By providing a fully open, field-tested platform, this work aims to lower the barrier to entry for underwater manipulation research, improve reproducibility, and accelerate innovation in underwater field robotics.
๐
2025-12-17
๐ Accepted to IEEE RA-L
๐ฅ Evgenii Kruzhkov, Raphael Memmesheimer, Sven Behnke
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Robust robot localization is an important prerequisite for navigation planning. If the environment map was created from different sensors, robot measurements must be robustly associated with map features. In this work, we extend Monte Carlo Localization using vision-language features. These open-vocabulary features enable to robustly compute the likelihood of visual observations, given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds. The abstract vision-language features enable to associate observations and map elements from different modalities. Global localization can be initialized by natural language descriptions of the objects present in the vicinity of locations. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.
๐
2025-12-17
๐ ArXiv Preprint (quant-ph)
๐ฅ Pranav Vaidhyanathan, Aristotelis Papatheodorou, David R. M. Arvidsson-Shukur, Mark T. Mitchison, Natalia Ares et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Dynamic programming is a cornerstone of graph-based optimization. While effective, it scales unfavorably with problem size. In this work, we present QuantGraph, a two-stage quantum-enhanced framework that casts local and global graph-optimization problems as quantum searches over discrete trajectory spaces. The solver is designed to operate efficiently by first finding a sequence of locally optimal transitions in the graph (local stage), without considering full trajectories. The accumulated cost of these transitions acts as a threshold that prunes the search space (up to 60% reduction for certain examples). The subsequent global stage, based on this threshold, refines the solution. Both stages utilize variants of the Grover-adaptive-search algorithm. To achieve scalability and robustness, we draw on principles from control theory and embed QuantGraph's global stage within a receding-horizon model-predictive-control scheme. This classical layer stabilizes and guides the quantum search, improving precision and reducing computational burden. In practice, the resulting closed-loop system exhibits robust behavior and lower overall complexity. Notably, for a fixed query budget, QuantGraph attains a 2x increase in control-discretization precision while still benefiting from Grover-search's inherent quadratic speedup compared to classical methods.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Sinan Emre, Victor Barasuol, Matteo Villa, Claudio Semini
Robotics
ๆฅ็ๆ่ฆ
This paper presents a Load-Based Variable Transmission (LBVT) mechanism designed to enhance robotic actuation by dynamically adjusting the transmission ratio in response to external torque demands. Unlike existing variable transmission systems that require additional actuators for active control, the proposed LBVT mechanism leverages a pre-tensioned spring and a four-bar linkage to passively modify the transmission ratio, thereby reducing the complexity of robot joint actuation systems. The effectiveness of the LBVT mechanism is evaluated through simulation-based analyses. The results confirm that the system achieves up to a 40 percent increase in transmission ratio upon reaching a predefined torque threshold, effectively amplifying joint torque when required without additional actuation. Furthermore, the simulations demonstrate a torque amplification effect triggered when the applied force exceeds 18 N, highlighting the system ability to autonomously respond to varying load conditions. This research contributes to the development of lightweight, efficient, and adaptive transmission systems for robotic applications, particularly in legged robots where dynamic torque adaptation is critical.
๐
2025-12-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hoang Nguyen, Xiaohao Xu, Xiaonan Huang
ๆฅ็ๆ่ฆ
Monocular depth foundation models achieve remarkable generalization by learning large-scale semantic priors, but this creates a critical vulnerability: they hallucinate illusory 3D structures from geometrically planar but perceptually ambiguous inputs. We term this failure the 3D Mirage. This paper introduces the first end-to-end framework to probe, quantify, and tame this unquantified safety risk. To probe, we present 3D-Mirage, the first benchmark of real-world illusions (e.g., street art) with precise planar-region annotations and context-restricted crops. To quantify, we propose a Laplacian-based evaluation framework with two metrics: the Deviation Composite Score (DCS) for spurious non-planarity and the Confusion Composite Score (CCS) for contextual instability. To tame this failure, we introduce Grounded Self-Distillation, a parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, thus avoiding catastrophic forgetting. Our work provides the essential tools to diagnose and mitigate this phenomenon, urging a necessary shift in MDE evaluation from pixel-wise accuracy to structural and contextual robustness. Our code and benchmark will be publicly available to foster this exciting research direction.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Zhenhan Yin, Xuanhan Wang, Jiahao Jiang, Kaiyuan Deng, Pengqi Chen et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
While leveraging abundant human videos and simulated robot data poses a scalable solution to the scarcity of real-world robot data, the generalization capability of existing vision-language-action models (VLAs) remains limited by mismatches in camera views, visual appearance, and embodiment morphologies. To overcome this limitation, we propose MiVLA, a generalizable VLA empowered by human-robot mutual imitation pre-training, which leverages inherent behavioral similarity between human hands and robotic arms to build a foundation of strong behavioral priors for both human actions and robotic control. Specifically, our method utilizes kinematic rules with left/right hand coordinate systems for bidirectional alignment between human and robot action spaces. Given human or simulated robot demonstrations, MiVLA is trained to forecast behavior trajectories for one embodiment, and imitate behaviors for another one unseen in the demonstration. Based on this mutual imitation, it integrates the behavioral fidelity of real-world human data with the manipulative diversity of simulated robot data into a unified model, thereby enhancing the generalization capability for downstream tasks. Extensive experiments conducted on both simulation and real-world platforms with three robots (ARX, PiPer and LocoMan), demonstrate that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs (e.g., $\boldsymbolฯ_{0}$, $\boldsymbolฯ_{0.5}$ and H-RDT) by 25% in simulation, and 14% in real-world robot control tasks.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Michael Amir, Manon Flageat, Amanda Prorok
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The success of machine learning for real-world robotic systems has created a new form of intellectual property: the trained policy. This raises a critical need for novel methods that verify ownership and detect unauthorized, possibly unsafe misuse. While watermarking is established in other domains, physical policies present a unique challenge: remote detection. Existing methods assume access to the robot's internal state, but auditors are often limited to external observations (e.g., video footage). This ``Physical Observation Gap'' means the watermark must be detected from signals that are noisy, asynchronous, and filtered by unknown system dynamics. We formalize this challenge using the concept of a \textit{glimpse sequence}, and introduce Colored Noise Coherency (CoNoCo), the first watermarking strategy designed for remote detection. CoNoCo embeds a spectral signal into the robot's motions by leveraging the policy's inherent stochasticity. To show it does not degrade performance, we prove CoNoCo preserves the marginal action distribution. Our experiments demonstrate strong, robust detection across various remote modalities, including motion capture and side-way/top-down video footage, in both simulated and real-world robot experiments. This work provides a necessary step toward protecting intellectual property in robotics, offering the first method for validating the provenance of physical policies non-invasively, using purely remote observations.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Kai Zhang, Shoubin Chen, Dong Li, Baiyang Zhang, Tao Huang et al.
Robotics
ๆฅ็ๆ่ฆ
Autonomous exploration is a fundamental capability that tightly integrates perception, planning, control, and motion execution. It plays a critical role in a wide range of applications, including indoor target search, mapping of extreme environments, resource exploration, etc. Despite significant progress in individual components, a holistic and practical description of a completely autonomous exploration system, encompassing both hardware and software, remains scarce. In this paper, we present GuangMing-Explorer, a fully integrated autonomous exploration platform designed for robust operation across diverse environments. We provide a comprehensive overview of the system architecture, including hardware design, software stack, algorithm deployment, and experimental configuration. Extensive real-world experiments demonstrate the platform's effectiveness and efficiency in executing autonomous exploration tasks, highlighting its potential for practical deployment in complex and unstructured environments.
๐
2025-12-17
๐ Under review at IEEE Transactions on Human-Machine
๐ฅ Martijn IJtsma, Salvatore Hargis
Robotics
ๆฅ็ๆ่ฆ
Studies of human-robot interaction in dynamic and unstructured environments show that as more advanced robotic capabilities are deployed, the need for cooperative competencies to support collaboration with human problem-holders increases. Designing human-robot systems to meet these demands requires an explicit understanding of the work functions and constraints that shape the feasibility of alternative joint work strategies. Yet existing human-robot interaction frameworks either emphasize computational support for real-time execution or rely on static representations for design, offering limited support for reasoning about coordination dynamics during early-stage conceptual design. To address this gap, this article presents a novel computational framework for analyzing joint work strategies in human-robot systems by integrating techniques from functional modeling with graph-theoretic representations. The framework characterizes collective work in terms of the relationships among system functions and the physical and informational structure of the work environment, while explicitly capturing how coordination demands evolve over time. Its use during conceptual design is demonstrated through a case study in disaster robotics, which shows how the framework can be used to support early trade-space exploration of human-robot coordination strategies and to identify cooperative competencies that support flexible management of coordination overhead. These results show how the framework makes coordination demands and their temporal evolution explicit, supporting design-time reasoning about cooperative competency requirements and work demands prior to implementation.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Yuze Wu, Mo Zhu, Xingxing Li, Yuheng Du, Yuxin Fan et al.
Machine LearningRoboticsMultimodal
ๆฅ็ๆ่ฆ
This paper proposes VLA-AN, an efficient and onboard Vision-Language-Action (VLA) framework dedicated to autonomous drone navigation in complex environments. VLA-AN addresses four major limitations of existing large aerial navigation models: the data domain gap, insufficient temporal navigation with reasoning, safety issues with generative action policies, and onboard deployment constraints. First, we construct a high-fidelity dataset utilizing 3D Gaussian Splatting (3D-GS) to effectively bridge the domain gap. Second, we introduce a progressive three-stage training framework that sequentially reinforces scene comprehension, core flight skills, and complex navigation capabilities. Third, we design a lightweight, real-time action module coupled with geometric safety correction. This module ensures fast, collision-free, and stable command generation, mitigating the safety risks inherent in stochastic generative policies. Finally, through deep optimization of the onboard deployment pipeline, VLA-AN achieves a robust real-time 8.3x improvement in inference throughput on resource-constrained UAVs. Extensive experiments demonstrate that VLA-AN significantly improves spatial grounding, scene reasoning, and long-horizon navigation, achieving a maximum single-task success rate of 98.1%, and providing an efficient, practical solution for realizing full-chain closed-loop autonomy in lightweight aerial robots.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Erik Brorsson, Kristian Ceder, Ze Zhang, Sabino Francesco Roselli, Endre Erลs et al.
Robotics
ๆฅ็ๆ่ฆ
The adoption of Autonomous Mobile Robots (AMRs) for internal logistics is accelerating, with most solutions emphasizing decentralized, onboard intelligence. While AMRs in indoor environments like factories can be supported by infrastructure, involving external sensors and computational resources, such systems remain underexplored in the literature. This paper presents a comprehensive overview of infrastructure-based AMR systems, outlining key opportunities and challenges. To support this, we introduce a reference architecture combining infrastructure-based sensing, on-premise cloud computing, and onboard autonomy. Based on the architecture, we review core technologies for localization, perception, and planning. We demonstrate the approach in a real-world deployment in a heavy-vehicle manufacturing environment and summarize findings from a user experience (UX) evaluation. Our aim is to provide a holistic foundation for future development of scalable, robust, and human-compatible AMR systems in complex industrial environments.
๐
2025-12-17
๐ Submitted at IEEE IV 2026
๐ฅ Jรถrg Gamerdinger, Sven Teufel, Stephan Amann, Lukas Marc Listl, Oliver Bringmann
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Extensive evaluation of perception systems is crucial for ensuring the safety of intelligent vehicles in complex driving scenarios. Conventional performance metrics such as precision, recall and the F1-score assess the overall detection accuracy, but they do not consider the safety-relevant aspects of perception. Consequently, perception systems that achieve high scores in these metrics may still cause misdetections that could lead to severe accidents. Therefore, it is important to evaluate not only the overall performance of perception systems, but also their safety. We therefore introduce a novel safety metric for jointly evaluating the most critical perception tasks, object and lane detection. Our proposed framework integrates a new, lightweight object safety metric that quantifies the potential risk associated with object detection errors, as well as an lane safety metric including the interdependence between both tasks that can occur in safety evaluation. The resulting combined safety score provides a unified, interpretable measure of perception safety performance. Using the DeepAccident dataset, we demonstrate that our approach identifies safety critical perception errors that conventional performance metrics fail to capture. Our findings emphasize the importance of safety-centric evaluation methods for perception systems in autonomous driving.
๐
2025-12-17
๐ Accepted at IEEE ICVES 2025
๐ฅ Jรถrg Gamerdinger, Sven Teufel, Stephan Amann, Oliver Bringmann
Computer Vision
ๆฅ็ๆ่ฆ
Ensuring safety is the primary objective of automated driving, which necessitates a comprehensive and accurate perception of the environment. While numerous performance evaluation metrics exist for assessing perception capabilities, incorporating safety-specific metrics is essential to reliably evaluate object detection systems. A key component for safety evaluation is the ability to distinguish between relevant and non-relevant objects - a challenge addressed by criticality or relevance metrics. This paper presents the first in-depth analysis of criticality metrics for safety evaluation of object detection systems. Through a comprehensive review of existing literature, we identify and assess a range of applicable metrics. Their effectiveness is empirically validated using the DeepAccident dataset, which features a variety of safety-critical scenarios. To enhance evaluation accuracy, we propose two novel application strategies: bidirectional criticality rating and multi-metric aggregation. Our approach demonstrates up to a 100% improvement in terms of criticality classification accuracy, highlighting its potential to significantly advance the safety evaluation of object detection systems in automated vehicles.
๐
2025-12-17
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Pilyoung Kim, Yun Xie, Sujin Yang
ๆฅ็ๆ่ฆ
General-purpose conversational AI chatbots and AI companions increasingly provide young adolescents with emotionally supportive conversations, raising questions about how conversational style shapes anthropomorphism and emotional reliance. In a preregistered online experiment with 284 adolescent-parent dyads, youth aged 11-15 and their parents read two matched transcripts in which a chatbot responded to an everyday social problem using either a relational style (first-person, affiliative, commitment language) or a transparent style (explicit nonhumanness, informational tone). Adolescents more often preferred the relational than the transparent style, whereas parents were more likely to prefer transparent style than adolescents. Adolescents rated the relational chatbot as more human-like, likable, trustworthy and emotionally close, while perceiving both styles as similarly helpful. Adolescents who preferred relational style had lower family and peer relationship quality and higher stress and anxiety than those preferring transparent style or both chatbots. These findings identify conversational style as a key design lever for youth AI safety, showing that relational framing heightens anthropomorphism, trust and emotional closeness and can be especially appealing to socially and emotionally vulnerable adolescents, who may be at increased risk for emotional reliance on conversational AI.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Dongmyeong Lee, Jesse Quattrociocchi, Christian Ellis, Rwik Rana, Amanda Adkins et al.
ๆฅ็ๆ่ฆ
We propose BEV-Patch-PF, a GPS-free sequential geo-localization system that integrates a particle filter with learned bird's-eye-view (BEV) and aerial feature maps. From onboard RGB and depth images, we construct a BEV feature map. For each 3-DoF particle pose hypothesis, we crop the corresponding patch from an aerial feature map computed from a local aerial image queried around the approximate location. BEV-Patch-PF computes a per-particle log-likelihood by matching the BEV feature to the aerial patch feature. On two real-world off-road datasets, our method achieves 7.5x lower absolute trajectory error (ATE) on seen routes and 7.0x lower ATE on unseen routes than a retrieval-based baseline, while maintaining accuracy under dense canopy and shadow. The system runs in real time at 10 Hz on an NVIDIA Tesla T4, enabling practical robot deployment.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Gaurav Bansal
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Accurate localization is essential for autonomous vehicles, yet sensor noise and drift over time can lead to significant pose estimation errors, particularly in long-horizon environments. A common strategy for correcting accumulated error is visual loop closure in SLAM, which adjusts the pose graph when the agent revisits previously mapped locations. These techniques typically rely on identifying visual mappings between the current view and previously observed scenes and often require fusing data from multiple sensors.
In contrast, this work introduces NeRF-Assisted 3D-3D Pose Alignment (NAP3D), a complementary approach that leverages 3D-3D correspondences between the agent's current depth image and a pre-trained Neural Radiance Field (NeRF). By directly aligning 3D points from the observed scene with synthesized points from the NeRF, NAP3D refines the estimated pose even from novel viewpoints, without relying on revisiting previously observed locations.
This robust 3D-3D formulation provides advantages over conventional 2D-3D localization methods while remaining comparable in accuracy and applicability. Experiments demonstrate that NAP3D achieves camera pose correction within 5 cm on a custom dataset, robustly outperforming a 2D-3D Perspective-N-Point baseline. On TUM RGB-D, NAP3D consistently improves 3D alignment RMSE by approximately 6 cm compared to this baseline given varying noise, despite PnP achieving lower raw rotation and translation parameter error in some regimes, highlighting NAP3D's improved geometric consistency in 3D space. By providing a lightweight, dataset-agnostic tool, NAP3D complements existing SLAM and localization pipelines when traditional loop closure is unavailable.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Yunheng Wang, Yixiao Feng, Yuetong Fang, Shuning Zhang, Tan Jing et al.
Robotics
ๆฅ็ๆ่ฆ
3D Scene Graphs (3DSGs) constitute a powerful representation of the physical world, distinguished by their abilities to explicitly model the complex spatial, semantic, and functional relationships between entities, rendering a foundational understanding that enables agents to interact intelligently with their environment and execute versatile behaviors. Embodied navigation, as a crucial component of such capabilities, leverages the compact and expressive nature of 3DSGs to enable long-horizon reasoning and planning in complex, large-scale environments. However, prior works rely on a static-world assumption, defining traversable space solely based on static spatial layouts and thereby treating interactable obstacles as non-traversable. This fundamental limitation severely undermines their effectiveness in real-world scenarios, leading to limited reachability, low efficiency, and inferior extensibility. To address these issues, we propose HERO, a novel framework for constructing Hierarchical Traversable 3DSGs, that redefines traversability by modeling operable obstacles as pathways, capturing their physical interactivity, functional semantics, and the scene's relational hierarchy. The results show that, relative to its baseline, HERO reduces PL by 35.1% in partially obstructed environments and increases SR by 79.4% in fully obstructed ones, demonstrating substantially higher efficiency and reachability.
๐
2025-12-17
๐ ArXiv Preprint (Robotics)
๐ฅ Wenlong Xia, Jinhao Zhang, Ce Zhang, Yaojia Wang, Youmin Gong et al.
Robotics
ๆฅ็ๆ่ฆ
Vision-based imitation learning has enabled impressive robotic manipulation skills, but its reliance on object appearance while ignoring the underlying 3D scene structure leads to low training efficiency and poor generalization. To address these challenges, we introduce \emph{Implicit Scene Supervision (ISS) Policy}, a 3D visuomotor DiT-based diffusion policy that predicts sequences of continuous actions from point cloud observations. We extend DiT with a novel implicit scene supervision module that encourages the model to produce outputs consistent with the scene's geometric evolution, thereby improving the performance and robustness of the policy. Notably, ISS Policy achieves state-of-the-art performance on both single-arm manipulation tasks (MetaWorld) and dexterous hand manipulation (Adroit). In real-world experiments, it also demonstrates strong generalization and robustness. Additional ablation studies show that our method scales effectively with both data and parameters. Code and videos will be released.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Caner Erden
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention matrices as a resource allocation problem, solved via a convex optimization framework or a Nash equilibriumbased gametheoretic approach. This ensures a theoretically optimal balance between local nuance and global context fidelity. Implemented within a hybrid dilatedconvolutional transformer backbone, MAHA utilizes differentiable optimization layers to enable endtoend training. Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention. This work bridges the gap between optimization theory and sequence modeling, offering a scalable solution for nextgeneration LLMs.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Humaira Tasnim, Ashik E Rasul, Bruce Jo, Hyung-Jin Yoon
Robotics
ๆฅ็ๆ่ฆ
Reliable helipad detection is essential for Autonomous Aerial Vehicle (AAV) landing, especially under GPS-denied or visually degraded conditions. While modern detectors such as YOLOv8 offer strong baseline performance, single-model pipelines struggle to remain robust across the extreme scale transitions that occur during descent, where helipads appear small at high altitude and large near touchdown. To address this limitation, we propose a scale-adaptive dual-expert perception framework that decomposes the detection task into far-range and close-range regimes. Two YOLOv8 experts are trained on scale-specialized versions of the HelipadCat dataset, enabling one model to excel at detecting small, low-resolution helipads and the other to provide high-precision localization when the target dominates the field of view. During inference, both experts operate in parallel, and a geometric gating mechanism selects the expert whose prediction is most consistent with the AAV's viewpoint. This adaptive routing prevents the degradation commonly observed in single-detector systems when operating across wide altitude ranges. The dual-expert perception module is evaluated in a closed-loop landing environment that integrates CARLA's photorealistic rendering with NASA's GUAM flight-dynamics engine. Results show substantial improvements in alignment stability, landing accuracy, and overall robustness compared to single-detector baselines. By introducing a scale-aware expert routing strategy tailored to the landing problem, this work advances resilient vision-based perception for autonomous descent and provides a foundation for future multi-expert AAV frameworks.
๐
2025-12-16
๐ ArXiv Preprint (cs.OS)
๐ฅ Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory. However, prior work misses an important opportunity: jointly optimizing the eviction and compression decisions across all KV caches to minimize average generation latency without hurting quality.
We propose EVICPRESS, a KV-cache management system that applies lossy compression and adaptive eviction to KV cache across multiple storage tiers. Specifically, for each KV cache of a context, EVICPRESS considers the effect of compression and eviction of the KV cache on the average generation quality and delay across all contexts as a whole. To achieve this, EVICPRESS proposes a unified utility function that quantifies the effect of quality and delay of the lossy compression or eviction. To this end, EVICPRESS's profiling module periodically updates the utility function scores on all possible eviction-compression configurations for all contexts and places KV caches using a fast heuristic to rearrange KV caches on all storage tiers, with the goal of maximizing the utility function scores on each storage tier. Compared to the baselines that evict KV cache or compress KV cache, EVICPRESS achieves higher KV-cache hit rates on fast devices, i.e., lower delay, while preserving high generation quality by applying conservative compression to contexts that are sensitive to compression errors. Evaluation on 12 datasets and 5 models demonstrates that EVICPRESS achieves up to 2.19x faster time-to-first-token (TTFT) at equivalent generation quality.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenzhi Wang, Jian Wang, Ke Ma, Dahua Lin, Bing Zhou
RoboticsMultimodal
ๆฅ็ๆ่ฆ
We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$\times$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: https://zhenzhiwang.github.io/talkverse/
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Abhijeet Parida, Daniel Capellรกn-Martรญn, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer et al.
Machine Learning
ๆฅ็ๆ่ฆ
Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.
๐
2025-12-16
๐ ArXiv Preprint (stat.AP)
๐ฅ Jaume Anguera Peris, Songtao Cheng, Hanzhao Zhang, Wei Ouyang, Joakim Jaldรฉn
Robotics
ๆฅ็ๆ่ฆ
High-content screening microscopy generates large amounts of live-cell imaging data, yet its potential remains constrained by the inability to determine when and where to image most effectively. Optimally balancing acquisition time, computational capacity, and photobleaching budgets across thousands of dynamically evolving regions of interest remains an open challenge, further complicated by limited field-of-view adjustments and sensor sensitivity. Existing approaches either rely on static sampling or heuristics that neglect the dynamic evolution of biological processes, leading to inefficiencies and missed events. Here, we introduce the restless multi-process multi-armed bandit (RMPMAB), a new decision-theoretic framework in which each experimental region is modeled not as a single process but as an ensemble of Markov chains, thereby capturing the inherent heterogeneity of biological systems such as asynchronous cell cycles and heterogeneous drug responses. Building upon this foundation, we derive closed-form expressions for transient and asymptotic behaviors of aggregated processes, and design scalable Whittle index policies with sub-linear complexity in the number of imaging regions. Through both simulations and a real biological live-cell imaging dataset, we show that our approach achieves substantial improvements in throughput under resource constraints. Notably, our algorithm outperforms Thomson Sampling, Bayesian UCB, epsilon-Greedy, and Round Robin by reducing cumulative regret by more than 37% in simulations and capturing 93% more biologically relevant events in live imaging experiments, underscoring its potential for transformative smart microscopy. Beyond improving experimental efficiency, the RMPMAB framework unifies stochastic decision theory with optimal autonomous microscopy control, offering a principled approach to accelerate discovery across multidisciplinary sciences.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ George-Andrei Dima, Dumitru-Clementin Cercel
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Nadine Angela Cantonjos, Arpita Biswas
Robotics
ๆฅ็ๆ่ฆ
Agricultural regions in rural areas face damage from climate-related risks, including droughts, heavy rainfall, and shifting weather patterns. Prior research calls for adaptive risk-management solutions and decision-making strategies. To this end, artificial intelligence (AI), particularly agentic AI, offers a promising path forward. Agentic AI systems consist of autonomous, specialized agents capable of solving complex, dynamic tasks. While past systems have relied on single-agent models or have used multi-agent frameworks only for static functions, there is a growing need for architectures that support dynamic collaborative reasoning and context-aware outputs. To bridge this gap, we present AgroAskAI, a multi-agent reasoning system for climate adaptation decision support in agriculture, with a focus on vulnerable rural communities. AgroAskAI features a modular, role-specialized architecture that uses a chain-of-responsibility approach to coordinate autonomous agents, integrating real-time tools and datasets. The system has built-in governance mechanisms that mitigate hallucination and enable internal feedback for coherent, locally relevant strategies. The system also supports multilingual interactions, making it accessible to non-English-speaking farmers. Experiments on common agricultural queries related to climate adaptation show that, with additional tools and prompt refinement, AgroAskAI delivers more actionable, grounded, and inclusive outputs. Our experimental results highlight the potential of agentic AI for sustainable and accountable decision support in climate adaptation for agriculture.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Seyed Reza Tavakoli, Farbod Davoodi, MohammadReza KarimiNejad et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Objectives: To evaluate large language model (LLM) performance on pharmacy licensure-style question-answering (QA) tasks and develop an external knowledge integration method to improve their accuracy.
Methods: We benchmarked eleven existing LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset. We measured baseline accuracy for each model without modification. We then developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context. This pipeline operates externally to the models, requiring no changes to model architecture or parameters.
Results: Baseline accuracy ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores. Models with fewer than 8 billion parameters scored below 50%. DrugRAG improved accuracy across all tested models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61% to 71%, Llama 3.1 8B: 46% to 67%) on the 141-item benchmark.
Conclusion: We demonstrate that external structured drug knowledge integration through DrugRAG measurably improves LLM accuracy on pharmacy tasks without modifying the underlying models. This approach provides a practical pipeline for enhancing pharmacy-focused AI applications with evidence-based information.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, Jeff Da
ๆฅ็ๆ่ฆ
A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy's behavior diverges from the expert's, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi-turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on-policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine-tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE-bench verified. Our results demonstrate the need for combining expert demonstrations with on-policy data for effective multi-turn LM agent training.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammad Abu-Shaira, Alejandro Rodriguez, Greg Speegle, Victor Sheng, Ishfaq Ahmad
ๆฅ็ๆ่ฆ
Online learning updates models incrementally with new data, avoiding large storage requirements and costly model recalculations. In this paper, we introduce "OLR-WA; OnLine Regression with Weighted Average", a novel and versatile multivariate online linear regression model. We also investigate scenarios involving drift, where the underlying patterns in the data evolve over time, conduct convergence analysis, and compare our approach with existing online regression models. The results of OLR-WA demonstrate its ability to achieve performance comparable to the batch regression, while also showcasing comparable or superior performance when compared with other state-of-the-art online models, thus establishing its effectiveness. Moreover, OLR-WA exhibits exceptional performance in terms of rapid convergence, surpassing other online models with consistently achieving high r2 values as a performance measure from the first iteration to the last iteration, even when initialized with minimal amount of data points, as little as 1% to 10% of the total data points. In addition to its ability to handle time-based (temporal drift) scenarios, remarkably, OLR-WA stands out as the only model capable of effectively managing confidence-based challenging scenarios. It achieves this by adopting a conservative approach in its updates, giving priority to older data points with higher confidence levels. In summary, OLR-WA's performance further solidifies its versatility and utility across different contexts, making it a valuable solution for online linear regression tasks.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Massimiliano Fadda, Enrico Motta, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino
Natural Language Processing
ๆฅ็ๆ่ฆ
News sources play a central role in democratic societies by shaping political and social discourse through specific topics, viewpoints and voices. Understanding these dynamics is essential for assessing whether the media landscape offers a balanced and fair account of public debate. In earlier work, we introduced a pipeline that, given a news corpus, i) uses a hybrid human-machine approach to identify the range of viewpoints expressed about a given topic, and ii) classifies relevant claims with respect to the identified viewpoints, defined as sets of semantically and ideologically congruent claims (e.g., positions arguing that immigration positively impacts the UK economy). In this paper, we improve this pipeline by i) fine-tuning Large Language Models (LLMs) for viewpoint classification and ii) enriching claim representations with semantic descriptions of relevant actors drawn from Wikidata. We evaluate our approach against alternative solutions on a benchmark centred on the UK immigration debate. Results show that while both mechanisms independently improve classification performance, their integration yields the best results, particularly when using LLMs capable of processing long inputs.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jingwei Chen
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Self-referential learning -- training a model on data it generated itself -- promises boundless scalability but chronically suffers from model collapse: language models degenerate into repetitive text, GANs drop modes, and reinforcement-learning policies over-exploit. Although practitioners employ ad~hoc fixes such as real-data mixing, entropy bonuses, knowledge distillation, or retrieval-augmented generation, a single principle that explains both the failure mode and the success of these fixes has remained elusive. We present Entropy-Reservoir Bregman Projection (ERBP), an information-geometric framework that unifies these phenomena. We model the closed loop as a stochastic Bregman projection sequence in distribution space. Without external coupling, finite-sample noise forces the system to project onto an ever-shrinking empirical support, causing exponential entropy decay and eventual collapse. Introducing an Entropy Reservoir -- a high-entropy distribution mixed into each projection -- injects a controllable entropy flux that provably stabilises the dynamics. Our theory yields (i) a necessary condition for collapse, (ii) a sufficient condition that guarantees a non-trivial entropy floor, and (iii) closed-form rates that depend only on sample size and the strong-convexity/Lipschitz constants of the Bregman generator. Experiments on large-language-model self-training, Soft Actor-Critic in reinforcement learning, and GAN optimisation validate our predictions and show that disparate stabilisation heuristics correspond to specific reservoir choices and coupling coefficients. ERBP thus transforms a collection of folk remedies into a single, quantitative design rule: monitor and budget your entropy flux.
๐
2025-12-16
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Viet K. Nguyen, Mohammad I. Husain
Natural Language Processing
ๆฅ็ๆ่ฆ
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI's 30.8%, while model performance ranges from Nova Pro's 46.2% to Claude and Grok 2's 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility.
๐
2025-12-16
๐ ArXiv Preprint (cs.CE)
๐ฅ Mahmuda Sharmin, Taihao Han, Jie Huang, Narayanan Neithalath, Gaurav Sant et al.
Machine Learning
ๆฅ็ๆ่ฆ
Machine learning (ML) has been increasingly applied in concrete research to optimize performance and mixture design. However, one major challenge in applying ML to cementitious materials is the limited size and diversity of available databases. A promising solution is the development of multi-modal databases that integrate both numerical and graphical data. Conventional ML frameworks in cement research are typically restricted to a single data modality. Graph neural network (GNN) represents a new generation of neural architectures capable of learning from data structured as graphs, capturing relationships through irregular or topology-dependent connections rather than fixed spatial coordinates. While GNN is inherently designed for graphical data, they can be adapted to extract correlations from numerical datasets and potentially embed physical laws directly into their architecture, enabling explainable and physics-informed predictions. This work is among the first few studies to implement GNNs to design concrete, with a particular emphasis on establishing a clear and reproducible pathway for converting tabular data into graph representations using the k-nearest neighbor (K-NN) approach. Model hyperparameters and feature selection are systematically optimized to enhance prediction performance. The GNN shows performance comparable to the benchmark random forest, which has been demonstrated by many studies to yield reliable predictions for cementitious materials. Overall, this study provides a foundational roadmap for transitioning from traditional ML to advanced AI architectures. The proposed framework establishes a strong foundation for future multi-modal and physics-informed GNN models capable of capturing complex material behaviors and accelerating the design and optimization of cementitious materials.
๐
2025-12-16
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Arth Bhardwaj, Sia Godika, Yuvam Loonker
Natural Language Processing
ๆฅ็ๆ่ฆ
Traditional, centralized security tools often miss adaptive, multi-vector attacks. We present the Multi-Agent LLM Cyber Defense Framework (MALCDF), a practical setup where four large language model (LLM) agents-Detection, Intelligence, Response, and Analysis-work together in real time. Agents communicate over a Secure Communication Layer (SCL) with encrypted, ontology-aligned messages, and produce audit-friendly outputs (e.g., MITRE ATT&CK mappings).
For evaluation, we keep the test simple and consistent: all reported metrics come from the same 50-record live stream derived from the CICIDS2017 feature schema. CICIDS2017 is used for configuration (fields/schema) and to train a practical ML baseline. The ML-IDS baseline is a Lightweight Random Forest IDS (LRF-IDS) trained on a subset of CICIDS2017 and tested on the 50-record stream, with no overlap between training and test records.
In experiments, MALCDF reaches 90.0% detection accuracy, 85.7% F1-score, and 9.1% false-positive rate, with 6.8s average per-event latency. It outperforms the lightweight ML-IDS baseline and a single-LLM setup on accuracy while keeping end-to-end outputs consistent. Overall, this hands-on build suggests that coordinating simple LLM agents with secure, ontology-aligned messaging can improve practical, real-time cyber defense.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aref Farhadipour, Teodora Vukovic, Volker Dellwo, Petr Motlicek, Srikanth Madikeri
Multimodal
ๆฅ็ๆ่ฆ
Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenzhi Wang, Jian Wang, Ke Ma, Dahua Lin, Bing Zhou
RoboticsMultimodal
ๆฅ็ๆ่ฆ
We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$\times$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: https://zhenzhiwang.github.io/talkverse/
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Abhijeet Parida, Daniel Capellรกn-Martรญn, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer et al.
Machine Learning
ๆฅ็ๆ่ฆ
Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Joshua L. Ebbert, Dennis Della Corte
ๆฅ็ๆ่ฆ
Artificial intelligence foundation models are increasingly deployed for prostate cancer Gleason grading, where GP3/GP4 distinction directly impacts treatment decisions. However, these models may achieve high validation accuracy by learning specimen-specific artifacts rather than generalizable biological features, limiting real-world clinical utility. We introduce PANDA-PLUS-Bench, a curated benchmark dataset derived from expert-annotated prostate biopsies designed specifically to quantify this failure mode. The benchmark comprises nine carefully selected whole slide images from nine unique patients containing diverse Gleason patterns, with non-overlapping tissue patches extracted at both 512x512 and 224x224 pixel resolutions across eight augmentation conditions. Using this benchmark, we evaluate seven foundation models on their ability to separate biological signal from slide-level confounders. Our results reveal substantial variation in robustness across models: Virchow2 achieved the lowest slide-level encoding among large-scale models (81.0%) yet exhibited the second-lowest cross-slide accuracy (47.2%). HistoEncoder, trained specifically on prostate tissue, demonstrated the highest cross-slide accuracy (59.7%) and the strongest slide-level encoding (90.3%), suggesting tissue-specific training may enhance both biological feature capture and slide-specific signatures. All models exhibited measurable within-slide vs. cross-slide accuracy gaps, though the magnitude varied from 19.9 percentage points to 26.9 percentage points. We provide an open-source Google Colab notebook enabling researchers to evaluate additional foundation models against our benchmark using standardized metrics. PANDA-PLUS-Bench addresses a critical gap in foundation model evaluation by providing a purpose-built resource for robustness assessment in the clinically important context of Gleason grading.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huzheng Yang, Katherine Xu, Andrew Lu, Michael D. Grossberg, Yutong Bai et al.
Multimodal
ๆฅ็ๆ่ฆ
Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes -- their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.
๐
2025-12-16
๐ NeurIPS
๐ฅ Darrin O' Brien, Dhikshith Gajulapalli, Eric Xia
Natural Language Processing
ๆฅ็ๆ่ฆ
Results in interpretability suggest that large vision and language models learn implicit linear encodings when models are biased by in-context prompting. However, the existence of similar linear representations in more general adaptation regimes has not yet been demonstrated. In this work, we develop the concept of a task matrix, a linear transformation from a base to finetuned embedding state. We demonstrate that for vision and text models and ten different datasets, a base model augmented with a task matrix achieves results surpassing linear probes, sometimes approaching finetuned levels. Our results validate the existence of cross-layer linear encodings between pretrained and finetuned architectures. Moreover, we show that a data-based approximation for such encodings is both efficient and generalizable to multiple domains. We make our implementation publicly available.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenshuo Li, Majid Mirmehdi, Tilo Burghardt
ๆฅ็ๆ่ฆ
Biologists have long combined visuals with textual field notes to re-identify (Re-ID) animals. Contemporary AI tools automate this for species with distinctive morphological features but remain largely image-based. Here, we extend Re-ID methodologies by incorporating precise dermatoglyphic textual descriptors-an approach used in forensics but new to ecology. We demonstrate that these specialist semantics abstract and encode animal coat topology using human-interpretable language tags. Drawing on 84,264 manually labelled minutiae across 3,355 images of 185 tigers (Panthera tigris), we evaluate this visual-textual methodology, revealing novel capabilities for cross-modal identity retrieval. To optimise performance, we developed a text-image co-synthesis pipeline to generate 'virtual individuals', each comprising dozens of life-like visuals paired with dermatoglyphic text. Benchmarking against real-world scenarios shows this augmentation significantly boosts AI accuracy in cross-modal retrieval while alleviating data scarcity. We conclude that dermatoglyphic language-guided biometrics can overcome vision-only limitations, enabling textual-to-visual identity recovery underpinned by human-verifiable matchings. This represents a significant advance towards explainability in Re-ID and a language-driven unification of descriptive modalities in ecological monitoring.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniel Perkins, Davis Hunter, Dhrumil Patel, Galen Flanagan
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
The recent surge in large language models has automated translations of spoken and written languages. However, these advances remain largely inaccessible to American Sign Language (ASL) users, whose language relies on complex visual cues. Isolated sign language recognition (ISLR) - the task of classifying videos of individual signs - can help bridge this gap but is currently limited by scarce per-sign data, high signer variability, and substantial computational costs. We propose a model for ISLR that reduces computational requirements while maintaining robustness to signer variation. Our approach integrates (i) a pose estimation pipeline to extract hand and face joint coordinates, (ii) a segmentation module that isolates relevant information, and (iii) a ResNet-Transformer backbone to jointly model spatial and temporal dependencies.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Buu Phan, Ashish Khisti, Karen Ullrich
Natural Language Processing
ๆฅ็ๆ่ฆ
Computing next-token likelihood ratios between two language models (LMs) is a standard task in training paradigms such as knowledge distillation. Since this requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. In this work, we address this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only O(1) model evaluations per token. When used for distillation, this yields up to a 12% reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to 4% on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to distillation for mathematical reasoning, our approach improves GSM8K accuracy by more than 2% over the current state of the art.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ George-Andrei Dima, Dumitru-Clementin Cercel
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Caner Erden
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention matrices as a resource allocation problem, solved via a convex optimization framework or a Nash equilibriumbased gametheoretic approach. This ensures a theoretically optimal balance between local nuance and global context fidelity. Implemented within a hybrid dilatedconvolutional transformer backbone, MAHA utilizes differentiable optimization layers to enable endtoend training. Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention. This work bridges the gap between optimization theory and sequence modeling, offering a scalable solution for nextgeneration LLMs.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Seyed Reza Tavakoli, Farbod Davoodi, MohammadReza KarimiNejad et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Objectives: To evaluate large language model (LLM) performance on pharmacy licensure-style question-answering (QA) tasks and develop an external knowledge integration method to improve their accuracy.
Methods: We benchmarked eleven existing LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset. We measured baseline accuracy for each model without modification. We then developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context. This pipeline operates externally to the models, requiring no changes to model architecture or parameters.
Results: Baseline accuracy ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores. Models with fewer than 8 billion parameters scored below 50%. DrugRAG improved accuracy across all tested models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61% to 71%, Llama 3.1 8B: 46% to 67%) on the 141-item benchmark.
Conclusion: We demonstrate that external structured drug knowledge integration through DrugRAG measurably improves LLM accuracy on pharmacy tasks without modifying the underlying models. This approach provides a practical pipeline for enhancing pharmacy-focused AI applications with evidence-based information.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Massimiliano Fadda, Enrico Motta, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino
Natural Language Processing
ๆฅ็ๆ่ฆ
News sources play a central role in democratic societies by shaping political and social discourse through specific topics, viewpoints and voices. Understanding these dynamics is essential for assessing whether the media landscape offers a balanced and fair account of public debate. In earlier work, we introduced a pipeline that, given a news corpus, i) uses a hybrid human-machine approach to identify the range of viewpoints expressed about a given topic, and ii) classifies relevant claims with respect to the identified viewpoints, defined as sets of semantically and ideologically congruent claims (e.g., positions arguing that immigration positively impacts the UK economy). In this paper, we improve this pipeline by i) fine-tuning Large Language Models (LLMs) for viewpoint classification and ii) enriching claim representations with semantic descriptions of relevant actors drawn from Wikidata. We evaluate our approach against alternative solutions on a benchmark centred on the UK immigration debate. Results show that while both mechanisms independently improve classification performance, their integration yields the best results, particularly when using LLMs capable of processing long inputs.
๐
2025-12-16
๐ NeurIPS
๐ฅ Darrin O' Brien, Dhikshith Gajulapalli, Eric Xia
Natural Language Processing
ๆฅ็ๆ่ฆ
Results in interpretability suggest that large vision and language models learn implicit linear encodings when models are biased by in-context prompting. However, the existence of similar linear representations in more general adaptation regimes has not yet been demonstrated. In this work, we develop the concept of a task matrix, a linear transformation from a base to finetuned embedding state. We demonstrate that for vision and text models and ten different datasets, a base model augmented with a task matrix achieves results surpassing linear probes, sometimes approaching finetuned levels. Our results validate the existence of cross-layer linear encodings between pretrained and finetuned architectures. Moreover, we show that a data-based approximation for such encodings is both efficient and generalizable to multiple domains. We make our implementation publicly available.
๐
2025-12-16
๐ ArXiv Preprint (cs.SD)
๐ฅ Advait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You et al.
ๆฅ็ๆ่ฆ
End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We further augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and paralinguistic signals beyond semantic content. We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline that exposes model failures at scale while preserving natural disfluencies found in unscripted human speech. Our evaluation of proprietary and open-source models reveals that even frontier models struggle on our benchmark, with Gemini 3 Pro Preview (Thinking), our highest-performing model achieving a 54.65% pass rate. Error analysis shows that models fail most often on our new axes and that Self Coherence degrades with longer audio context. These failures reflect difficulty of tracking edits, audio cues, and long-range context in natural spoken dialogue. Audio MultiChallenge provides a reproducible testbed to quantify them and drive improvements in audio-native multi-turn interaction capability.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Biao Zhang, Paul Suganthan, Gaรซl Liu, Ilya Philippov, Sahil Dua et al.
Multimodal
ๆฅ็ๆ่ฆ
We introduce T5Gemma 2, the next generation of the T5Gemma family of lightweight open encoder-decoder models, featuring strong multilingual, multimodal and long-context capabilities. T5Gemma 2 follows the adaptation recipe (via UL2) in T5Gemma -- adapting a pretrained decoder-only model into an encoder-decoder model, and extends it from text-only regime to multimodal based on the Gemma 3 models. We further propose two methods to improve the efficiency: tied word embedding that shares all embeddings across encoder and decoder, and merged attention that unifies decoder self- and cross-attention into a single joint module. Experiments demonstrate the generality of the adaptation strategy over architectures and modalities as well as the unique strength of the encoder-decoder architecture on long context modeling. Similar to T5Gemma, T5Gemma 2 yields comparable or better pretraining performance and significantly improved post-training performance than its Gemma 3 counterpart. We release the pretrained models (270M-270M, 1B-1B and 4B-4B) to the community for future research.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou et al.
RoboticsNatural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Richard Ackermann, Simeon Emanuilov
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
OpenAI has recently argued that hallucinations in large language models result primarily from misaligned evaluation incentives that reward confident guessing rather than epistemic humility. On this view, hallucination is a contingent behavioral artifact, remediable through improved benchmarks and reward structures. In this paper, we challenge that interpretation. Drawing on previous work on structural hallucination and empirical experiments using a Licensing Oracle, we argue that hallucination is not an optimization failure but an architectural inevitability of the transformer model.
Transformers do not represent the world; they model statistical associations among tokens. Their embedding spaces form a pseudo-ontology derived from linguistic co-occurrence rather than world-referential structure. At ontological boundary conditions - regions where training data is sparse or incoherent - the model necessarily interpolates fictional continuations in order to preserve coherence. No incentive mechanism can modify this structural dependence on pattern completion.
Our empirical results demonstrate that hallucination can only be eliminated through external truth-validation and abstention modules, not through changes to incentives, prompting, or fine-tuning. The Licensing Oracle achieves perfect abstention precision across domains precisely because it supplies grounding that the transformer lacks.
We conclude that hallucination is a structural property of generative architectures and that reliable AI requires hybrid systems that distinguish linguistic fluency from epistemic responsibility.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Felipe J. P. Antunes, Yuri F. Saporito, Sebastian Jaimungal
Machine Learning
ๆฅ็ๆ่ฆ
We present a novel numerical method for solving McKean-Vlasov forward-backward stochastic differential equations (MV-FBSDEs) with common noise, combining Picard iterations, elicitability and deep learning. The key innovation involves elicitability to derive a path-wise loss function, enabling efficient training of neural networks to approximate both the backward process and the conditional expectations arising from common noise - without requiring computationally expensive nested Monte Carlo simulations. The mean-field interaction term is parameterized via a recurrent neural network trained to minimize an elicitable score, while the backward process is approximated through a feedforward network representing the decoupling field. We validate the algorithm on a systemic risk inter-bank borrowing and lending model, where analytical solutions exist, demonstrating accurate recovery of the true solution. We further extend the model to quantile-mediated interactions, showcasing the flexibility of the elicitability framework beyond conditional means or moments. Finally, we apply the method to a non-stationary Aiyagari--Bewley--Huggett economic growth model with endogenous interest rates, illustrating its applicability to complex mean-field games without closed-form solutions.
๐
2025-12-16
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Hop Le, Izzat Alsmadi
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The Internet of Vehicles (IoV) has evolved modern transportation through enhanced connectivity and intelligent systems. However, this increased connectivity introduces critical vulnerabilities, making vehicles susceptible to cyber-attacks such Denial-ofService (DoS) and message spoofing. This project aims to develop a machine learning-based intrusion detection system to classify malicious Controller Area network (CAN) bus traffic using the CiCIoV2024 benchmark dataset. We analyzed various attack patterns including DoS and spoofing attacks targeting critical vehicle parameters such as Spoofing-GAS - gas pedal position, Spoofing-RPM, Spoofing-Speed, and Spoofing-Steering\_Wheel. Our initial findings confirm a multi-class classification problem with a clear structural difference between attack types and benign data, providing a strong foundation for machine learning models.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Buu Phan, Ashish Khisti, Karen Ullrich
Natural Language Processing
ๆฅ็ๆ่ฆ
Computing next-token likelihood ratios between two language models (LMs) is a standard task in training paradigms such as knowledge distillation. Since this requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. In this work, we address this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only O(1) model evaluations per token. When used for distillation, this yields up to a 12% reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to 4% on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to distillation for mathematical reasoning, our approach improves GSM8K accuracy by more than 2% over the current state of the art.
๐
2025-12-16
๐ ArXiv Preprint (cs.OS)
๐ฅ Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory. However, prior work misses an important opportunity: jointly optimizing the eviction and compression decisions across all KV caches to minimize average generation latency without hurting quality.
We propose EVICPRESS, a KV-cache management system that applies lossy compression and adaptive eviction to KV cache across multiple storage tiers. Specifically, for each KV cache of a context, EVICPRESS considers the effect of compression and eviction of the KV cache on the average generation quality and delay across all contexts as a whole. To achieve this, EVICPRESS proposes a unified utility function that quantifies the effect of quality and delay of the lossy compression or eviction. To this end, EVICPRESS's profiling module periodically updates the utility function scores on all possible eviction-compression configurations for all contexts and places KV caches using a fast heuristic to rearrange KV caches on all storage tiers, with the goal of maximizing the utility function scores on each storage tier. Compared to the baselines that evict KV cache or compress KV cache, EVICPRESS achieves higher KV-cache hit rates on fast devices, i.e., lower delay, while preserving high generation quality by applying conservative compression to contexts that are sensitive to compression errors. Evaluation on 12 datasets and 5 models demonstrates that EVICPRESS achieves up to 2.19x faster time-to-first-token (TTFT) at equivalent generation quality.
๐
2025-12-16
๐ ArXiv Preprint (math.NA)
๐ฅ Conor Rowan, Kai Hampleman, Kurt Maute, Alireza Doostan
Machine Learning
ๆฅ็ๆ่ฆ
Since their advent nearly a decade ago, physics-informed neural networks (PINNs) have been studied extensively as a novel technique for solving forward and inverse problems in physics and engineering. The neural network discretization of the solution field is naturally adaptive and avoids meshing the computational domain, which can both improve the accuracy of the numerical solution and streamline implementation. However, there have been limited studies of PINNs on complex three-dimensional geometries, as the lack of mesh and the reliance on the strong form of the partial differential equation (PDE) make boundary condition (BC) enforcement challenging. Techniques to enforce BCs with PINNs have proliferated in the literature, but a comprehensive side-by-side comparison of these techniques and a study of their efficacy on geometrically complex three-dimensional test problems are lacking. In this work, we i) systematically compare BC enforcement techniques for PINNs, ii) propose a general solution framework for arbitrary three-dimensional geometries, and iii) verify the methodology on three-dimensional, linear and nonlinear test problems with combinations of Dirichlet, Neumann, and Robin boundaries. Our approach is agnostic to the underlying PDE, the geometry of the computational domain, and the nature of the BCs, while requiring minimal hyperparameter tuning. This work represents a step in the direction of establishing PINNs as a mature numerical method, capable of competing head-to-head with incumbents such as the finite element method.
๐
2025-12-16
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Nnamdi Philip Okonkwo, Lubna Luxmi Dhirani
Robotics
ๆฅ็ๆ่ฆ
Cloud Security Operations Center (SOC) enable cloud governance, risk and compliance by providing insights visibility and control. Cloud SOC triages high-volume, heterogeneous telemetry from elastic, short-lived resources while staying within tight budgets. In this research, we implement an AI-Augmented Security Operations Center (AISOC) on AWS that combines cloud-native instrumentation with ML-based detection. The architecture uses three Amazon EC2 instances: Attacker, Defender, and Monitoring. We simulate a reverse-shell intrusion with Metasploit, and Filebeat forwards Defender logs to an Elasticsearch and Kibana stack for analysis. We train two classifiers, a malware detector built on a public dataset and a log-anomaly detector trained on synthetically augmented logs that include adversarial variants. We calibrate and fuse the scores to produce multi-modal threat intelligence and triage activity into NORMAL, SUSPICIOUS, and HIGH\_CONFIDENCE\_ATTACK. On held-out tests the fusion achieves strong macro-F1 (up to 1.00) under controlled conditions, though performance will vary in noisier and more diverse environments. These results indicate that simple, calibrated fusion can enhance cloud SOC capabilities in constrained, cost-sensitive setups.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Daniel Gomes de Pinho Zanco, Leszek Szczecinski, Jacob Benesty, Eduardo Vinicius Kuhn
ๆฅ็ๆ่ฆ
In this work, we propose a method to efficiently find the regularization parameter for low-rank MMSE filters based on a Kronecker-product representation. We show that the regularization parameter is surprisingly linked to the problem of rank selection and, thus, properly choosing it, is crucial for low-rank settings. The proposed method is validated through simulations, showing significant gains over commonly used methods.
๐
2025-12-16
๐ ArXiv Preprint (eess.IV)
๐ฅ Paul J. Weiser, Jiye Kim, Jongho Lee, Amirmohammad Shamaei, Gulnur Ungan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Purpose: Magnetic Resonance Spectroscopic Imaging (MRSI) maps endogenous brain metabolism while suppressing the overwhelming water signal. Water-unsuppressed MRSI (wu-MRSI) allows simultaneous imaging of water and metabolites, but large water sidebands cause challenges for metabolic fitting. We developed an end-to-end deep-learning pipeline to overcome these challenges at ultra-high field. Methods:Fast high-resolution wu-MRSI was acquired at 7T with non-cartesian ECCENTRIC sampling and ultra-short echo time. A water and lipid removal network (WALINET+) was developed to remove lipids, water signal, and sidebands. MRSI reconstruction was performed by DeepER and a physics-informed network for metabolite fitting. Water signal was used for absolute metabolite quantification, quantitative susceptibility mapping (QSM), and myelin water fraction imaging (MWF). Results: WALINET+ provided the lowest NRMSE (< 2%) in simulations and in vivo the smallest bias (< 20%) and limits-of-agreement (+-63%) between wu-MRSI and ws-MRSI scans. Several metabolites such as creatine and glutamate showed higher SNR in wu-MRSI. QSM and MWF obtained from wu-MRSI and GRE showed good agreement with 0 ppm/5.5% bias and +-0.05 ppm/ +- 12.75% limits-of-agreement. Conclusion: High-quality metabolic, QSM, and MWF mapping of the human brain can be obtained simultaneously by ECCENTRIC wu-MRSI at 7T with 2 mm isotropic resolution in 12 min. WALINET+ robustly removes water sidebands while preserving metabolite signal, eliminating the need for water suppression and separate water acquisitions.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ George-Andrei Dima, Dumitru-Clementin Cercel
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Massimiliano Fadda, Enrico Motta, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino
Natural Language Processing
ๆฅ็ๆ่ฆ
News sources play a central role in democratic societies by shaping political and social discourse through specific topics, viewpoints and voices. Understanding these dynamics is essential for assessing whether the media landscape offers a balanced and fair account of public debate. In earlier work, we introduced a pipeline that, given a news corpus, i) uses a hybrid human-machine approach to identify the range of viewpoints expressed about a given topic, and ii) classifies relevant claims with respect to the identified viewpoints, defined as sets of semantically and ideologically congruent claims (e.g., positions arguing that immigration positively impacts the UK economy). In this paper, we improve this pipeline by i) fine-tuning Large Language Models (LLMs) for viewpoint classification and ii) enriching claim representations with semantic descriptions of relevant actors drawn from Wikidata. We evaluate our approach against alternative solutions on a benchmark centred on the UK immigration debate. Results show that while both mechanisms independently improve classification performance, their integration yields the best results, particularly when using LLMs capable of processing long inputs.
๐
2025-12-16
๐ Accepted to appear in the ACM/IEEE International C
๐ฅ Iddo Yehoshua Wald, Amber Maimon, Shiyao Zhang, Dennis Kรผster, Robert Porzel et al.
Robotics
ๆฅ็ๆ่ฆ
Embodiment of users within robotic systems has been explored in human-robot interaction, most often in telepresence and teleoperation. In these applications, synchronized visuomotor feedback can evoke a sense of body ownership and agency, contributing to the experience of embodiment. We extend this work by employing embreathment, the representation of the user's own breath in real time, as a means for enhancing user embodiment experience in robots. In a within-subjects experiment, participants controlled a robotic arm, while its movements were either synchronized or non-synchronized with their own breath. Synchrony was shown to significantly increase body ownership, and was preferred by most participants. We propose the representation of physiological signals as a novel interoceptive pathway for human-robot interaction, and discuss implications for telepresence, prosthetics, collaboration with robots, and shared autonomy.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Zixin Tang, Yiming Chen, Quentin Rouxel, Dianxi Li, Shuang Wu et al.
Robotics
ๆฅ็ๆ่ฆ
Teleoperation presents a promising paradigm for remote control and robot proprioceptive data collection. Despite recent progress, current teleoperation systems still suffer from limitations in efficiency and ergonomics, particularly in challenging scenarios. In this paper, we propose CaFe-TeleVision, a coarse-to-fine teleoperation system with immersive situated visualization for enhanced ergonomics. At its core, a coarse-to-fine control mechanism is proposed in the retargeting module to bridge workspace disparities, jointly optimizing efficiency and physical ergonomics. To stream immersive feedback with adequate visual cues for human vision systems, an on-demand situated visualization technique is integrated in the perception module, which reduces the cognitive load for multi-view processing. The system is built on a humanoid collaborative robot and validated with six challenging bimanual manipulation tasks. User study among 24 participants confirms that CaFe-TeleVision enhances ergonomics with statistical significance, indicating a lower task load and a higher user acceptance during teleoperation. Quantitative results also validate the superior performance of our system across six tasks, surpassing comparative methods by up to 28.89% in success rate and accelerating by 26.81% in completion time. Project webpage: https://clover-cuhk.github.io/cafe_television/
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xichen Ding, Jianzhe Gao, Cong Pan, Wenguan Wang, Jie Qin
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Amir M. Soufi Enayati, Homayoun Honari, Homayoun Najjaran
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) enables robots to operate in uncertain environments, but standard approaches often struggle with poor generalization to unseen tasks. Context-adaptive meta reinforcement learning addresses these limitations by conditioning on the task representation, yet they mostly rely on complete action information in the experience making task inference tightly coupled to a specific policy. This paper introduces Context Representation via Action Free Transformer encoder decoder (CRAFT), a belief model that infers task representations solely from sequences of states and rewards. By removing the dependence on actions, CRAFT decouples task inference from policy optimization, supports modular training, and leverages amortized variational inference for scalable belief updates. Built on a transformer encoder decoder with rotary positional embeddings, the model captures long range temporal dependencies and robustly encodes both parametric and non-parametric task variations. Experiments on the MetaWorld ML-10 robotic manipulation benchmark show that CRAFT achieves faster adaptation, improved generalization, and more effective exploration compared to context adaptive meta--RL baselines. These findings highlight the potential of action-free inference as a foundation for scalable RL in robotic control.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli et al.
Computer Vision
ๆฅ็ๆ่ฆ
Non-parametric quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ($ฮ_{24}$-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zitian Gao, Lynx Chen, Yihao Xiao, He Xing, Ran Tao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Universal transformers (UTs) have been widely used for complex reasoning tasks such as ARC-AGI and Sudoku, yet the specific sources of their performance gains remain underexplored. In this work, we systematically analyze UTs variants and show that improvements on ARC-AGI primarily arise from the recurrent inductive bias and strong nonlinear components of Transformer, rather than from elaborate architectural designs. Motivated by this finding, we propose the Universal Reasoning Model (URM), which enhances the UT with short convolution and truncated backpropagation. Our approach substantially improves reasoning performance, achieving state-of-the-art 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2. Our code is avaliable at https://github.com/zitian-gao/URM.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yen-Ju Lu, Kunxiao Gao, Mingrui Liang, Helin Wang, Thomas Thebaud et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chuan He
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Stochastic optimization is fundamental to modern machine learning. Recent research has extended the study of stochastic first-order methods (SFOMs) from light-tailed to heavy-tailed noise, which frequently arises in practice, with clipping emerging as a key technique for controlling heavy-tailed gradients. Extensive theoretical advances have further shown that the oracle complexity of SFOMs depends on the tail index $ฮฑ$ of the noise. Nonetheless, existing complexity results often cover only the case $ฮฑ\in (1,2]$, that is, the regime where the noise has a finite mean, while the complexity bounds tend to infinity as $ฮฑ$ approaches $1$. This paper tackles the general case of noise with tail index $ฮฑ\in(0,2]$, covering regimes ranging from noise with bounded variance to noise with an infinite mean, where the latter case has been scarcely studied. Through a novel analysis of the bias-variance trade-off in gradient clipping, we show that when a symmetry measure of the noise tail is controlled, clipped SFOMs achieve improved complexity guarantees in the presence of heavy-tailed noise for any tail index $ฮฑ\in (0,2]$. Our analysis of the bias-variance trade-off not only yields new unified complexity guarantees for clipped SFOMs across this full range of tail indices, but is also straightforward to apply and can be combined with classical analyses under light-tailed noise to establish oracle complexity guarantees under heavy-tailed noise. Finally, numerical experiments validate our theoretical findings.
๐
2025-12-16
๐ NeurIPS 2025
๐ฅ Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng et al.
Machine Learning
ๆฅ็ๆ่ฆ
We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alban Puech, Matteo Mazzonelli, Celia Cintas, Tamara R. Govindasamy, Mangaliso Mngomezulu et al.
Machine Learning
ๆฅ็ๆ่ฆ
We introduce gridfm-datakit-v1, a Python library for generating realistic and diverse Power Flow (PF) and Optimal Power Flow (OPF) datasets for training Machine Learning (ML) solvers. Existing datasets and libraries face three main challenges: (1) lack of realistic stochastic load and topology perturbations, limiting scenario diversity; (2) PF datasets are restricted to OPF-feasible points, hindering generalization of ML solvers to cases that violate operating limits (e.g., branch overloads or voltage violations); and (3) OPF datasets use fixed generator cost functions, limiting generalization across varying costs. gridfm-datakit addresses these challenges by: (1) combining global load scaling from real-world profiles with localized noise and supporting arbitrary N-k topology perturbations to create diverse yet realistic datasets; (2) generating PF samples beyond operating limits; and (3) producing OPF data with varying generator costs. It also scales efficiently to large grids (up to 10,000 buses). Comparisons with OPFData, OPF-Learn, PGLearn, and PF$ฮ$ are provided. Available on GitHub at https://github.com/gridfm/gridfm-datakit under Apache 2.0 and via `pip install gridfm-datakit`.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Christian Matek, Lukas Wolfseher et al.
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Timely and accurate lymphoma diagnosis is essential for guiding cancer treatment. Standard diagnostic practice combines hematoxylin and eosin (HE)-stained whole slide images with immunohistochemistry, flow cytometry, and molecular genetic tests to determine lymphoma subtypes, a process requiring costly equipment, skilled personnel, and causing treatment delays. Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking. In this work, we present the first multicenter lymphoma benchmarking dataset covering four common lymphoma subtypes and healthy control tissue. We systematically evaluate five publicly available pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) combined with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across three magnifications (10x, 20x, 40x). On in-distribution test sets, models achieve multiclass balanced accuracies exceeding 80% across all magnifications, with all foundation models performing similarly and both aggregation methods showing comparable results. The magnification study reveals that 40x resolution is sufficient, with no performance gains from higher resolutions or cross-magnification aggregation. However, on out-of-distribution test sets, performance drops substantially to around 60%, highlighting significant generalization challenges. To advance the field, larger multicenter studies covering additional rare lymphoma subtypes are needed. We provide an automated benchmarking pipeline to facilitate such future research.
๐
2025-12-16
๐ ArXiv Preprint (cs.SD)
๐ฅ Yash Vishe, Eric Xue, Xunyi Jiang, Zachary Novack, Junda Wu et al.
ๆฅ็ๆ่ฆ
Music editing plays a vital role in modern music production, with applications in film, broadcasting, and game development. Recent advances in music generation models have enabled diverse editing tasks such as timbre transfer, instrument substitution, and genre transformation. However, many existing works overlook the evaluation of their ability to preserve musical facets that should remain unchanged during editing a property we define as Music Context Preservation (MCP). While some studies do consider MCP, they adopt inconsistent evaluation protocols and metrics, leading to unreliable and unfair comparisons. To address this gap, we introduce the first MCP evaluation benchmark, MuseCPBench, which covers four categories of musical facets and enables comprehensive comparisons across five representative music editing baselines. Through systematic analysis along musical facets, methods, and models, we identify consistent preservation gaps in current music editing methods and provide insightful explanations. We hope our findings offer practical guidance for developing more effective and reliable music editing strategies with strong MCP capability
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa
MultimodalMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alessandro Trapasso, Luca Iocchi, Fabio Patrizi
Machine Learning
ๆฅ็ๆ่ฆ
Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhaolun Li, Jichang Li, Yinqi Cai, Junye Chen, Xiaonan Luo et al.
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Adarsha Shrestha, Basanta Pokharel, Binit Shrestha, Smriti Adhikari, Dinesh Gothe
Natural Language Processing
ๆฅ็ๆ่ฆ
Nepali, a low-resource language spoken by over 32 million people, continues to face challenges in natural language processing (NLP) due to its complex grammar, agglutinative morphology, and limited availability of high-quality corpora. Most efforts to date have centered on basic encoder architectures; they remain insufficient for Nepali-specific text generation. This study presents a GPT-2-based Nepali language model trained using several training strategies inspired by GPT-3, including optimized learning rate schedules, batch scaling, and architectural refinements. A custom 16k Byte-Pair Encoding (BPE) tokenizer was trained exclusively on Nepali text to ensure more consistent segmentation and improved input representation. The model was pretrained on a combined dataset comprising a 10.75GB cleaned NepBERTa corpus and additional web-scraped Nepali news articles. FlashAttention was integrated to reduce memory usage and stabilize training. After two epochs, the model achieved a training loss of 3.168177, a validation loss of 3.081982, and a final perplexity of 21.80, demonstrating its capability to generate coherent Nepali news-style text.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ekaterina Artemova, Laurie Burchell, Daryna Dementieva, Shu Okabe, Mariya Shmatova et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages -- from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.
๐
2025-12-16
๐ Accepted in IEEE Bigdata 2025- Learning Representa
๐ฅ Tejaswani Dash, Gautam Datla, Anudeep Vurity, Tazeem Ahmad, Mohd Adnan et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Cardiovascular disease (CVD) remains the leading cause of mortality worldwide, underscoring the need for reliable and efficient predictive tools that support early intervention. Traditional diagnostic approaches rely on handcrafted features and clinician expertise, while machine learning methods improve reproducibility but often struggle to generalize across noisy and heterogeneous clinical data. In this work, we propose Residual GRU with Multi-Head Self-Attention, a compact deep learning architecture designed for tabular clinical records. The model integrates residual bidirectional gated recurrent units for sequential modeling of feature columns, a channel reweighting block, and multi-head self-attention pooling with a learnable classification token to capture global context. We evaluate the model on the UCI Heart Disease dataset using 5-fold stratified cross-validation and compare it against classical methods such as Logistic Regression, Random Forest, and Support Vector Machines, as well as modern deep learning baselines including DeepMLP, convolutional networks, recurrent networks, and Transformers. The proposed model achieves an accuracy of 0.861, macro-F1 of 0.860, ROC-AUC of 0.908, and PR-AUC of 0.904, outperforming all baselines. Ablation studies confirm the individual contributions of residual recurrence, channel gating, and attention pooling. t-SNE visualizations further indicate that the learned embeddings exhibit clearer separation between disease and non-disease classes compared to raw features. These results demonstrate that lightweight hybrid recurrent and attention-based architectures provide a strong balance between accuracy and efficiency for clinical risk prediction, supporting deployment in resource-constrained healthcare settings.
๐
2025-12-16
๐ Accepted in IEEE Bigdata 2025- LLMs4ALL
๐ฅ Tejaswani Dash, Dinesh Karri, Anudeep Vurity, Gautam Datla, Tazeem Ahmad et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
This paper introduces PolyPersona, a generative framework for synthesizing persona-conditioned survey responses across multiple domains. The framework instruction-tunes compact chat models using parameter-efficient LoRA adapters with 4-bit quantization under a resource-adaptive training setup. A dialogue-based data pipeline explicitly preserves persona cues, ensuring consistent behavioral alignment across generated responses. Using this pipeline, we construct a dataset of 3,568 synthetic survey responses spanning ten domains and 433 distinct personas, enabling controlled instruction tuning and systematic multi-domain evaluation. We evaluate the generated responses using a multi-metric evaluation suite that combines standard text generation metrics, including BLEU, ROUGE, and BERTScore, with survey-specific metrics designed to assess structural coherence, stylistic consistency, and sentiment alignment.Experimental results show that compact models such as TinyLlama 1.1B and Phi-2 achieve performance comparable to larger 7B to 8B baselines, with a highest BLEU score of 0.090 and ROUGE-1 of 0.429. These findings demonstrate that persona-conditioned fine-tuning enables small language models to generate reliable and coherent synthetic survey data. The proposed framework provides an efficient and reproducible approach for survey data generation, supporting scalable evaluation while facilitating bias analysis through transparent and open protocols.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xianwei Cao, Dou Quan, Shuang Wang, Ning Huyan, Wei Wang et al.
ๆฅ็ๆ่ฆ
Image retrieval-based cross-view geo-localization (IRCVGL) aims to match images captured from significantly different viewpoints, such as satellite and street-level images. Existing methods predominantly rely on learning robust global representations or implicit feature alignment, which often fail to model explicit spatial correspondences crucial for accurate localization. In this work, we propose a novel correspondence-aware feature refinement framework, termed CLNet, that explicitly bridges the semantic and geometric gaps between different views. CLNet decomposes the view alignment process into three learnable and complementary modules: a Neural Correspondence Map (NCM) that spatially aligns cross-view features via latent correspondence fields; a Nonlinear Embedding Converter (NEC) that remaps features across perspectives using an MLP-based transformation; and a Global Feature Recalibration (GFR) module that reweights informative feature channels guided by learned spatial cues. The proposed CLNet can jointly capture both high-level semantics and fine-grained alignments. Extensive experiments on four public benchmarks, CVUSA, CVACT, VIGOR, and University-1652, demonstrate that our proposed CLNet achieves state-of-the-art performance while offering better interpretability and generalizability.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ David Samuel, Lucas Georges Gabriel Charpentier
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andreas Lolos, Theofilos Christodoulou, Aris L. Moustakas, Stergios Christodoulidis, Maria Vakalopoulou
Machine Learning
ๆฅ็ๆ่ฆ
In computational pathology, weak supervision has become the standard for deep learning due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of relying on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correlation learning on downstream tasks. By projecting patch features -- extracted using a frozen patch encoder -- into a small set of global context/morphology-aware tokens and utilizing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathology benchmarks, while reducing the total number of trainable parameters by 48%-92.8% versus SOTA MILs, lowering FLOPs during inference by 52%-99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at https://github.com/mandlos/CAPRMIL
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shreyas Subramanian, Bala Krishnamoorthy, Pranav Murthy
ๆฅ็ๆ่ฆ
Despite significant advances in optimizers for training, most research works use common scheduler choices like Cosine or exponential decay. In this paper, we study \emph{GreedyLR}, a novel scheduler that adaptively adjusts the learning rate during training based on the current loss. To validate the effectiveness of our proposed scheduler, we conduct experiments on several NLP, CV, and LLM tasks with up to $7B$ parameters, including both fine-tuning and pre-training experiments. The results show that our approach outperforms several state-of-the-art schedulers in terms of accuracy, speed, and convergence. We also provide a theoretical analysis of the GreedyLR algorithm, including a proof of convergence and derivation of the optimal scaling factor $F$ that maximizes the convergence rate, along with experiments to show robustness of the algorithm to realistic noisy landscapes. Our scheduler is easy to implement, computationally efficient, and could be considered a good default scheduler for training.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Cheng-Han Lu, Pei-Hsuan Tsai
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer-based multi-modal intelligent systems often suffer from high computational and energy costs due to dense self-attention, limiting their scalability under resource constraints. This paper presents SMMT, a sparse multi-modal transformer architecture designed to improve efficiency and robustness. Building upon a cascaded multi-modal transformer framework, SMMT introduces cluster-based sparse attention to achieve near linear computational complexity and modality-wise masking to enhance robustness against incomplete inputs. The architecture is evaluated using Alzheimer's Disease classification on the ADNI dataset as a representative multi-modal case study. Experimental results show that SMMT maintains competitive predictive performance while significantly reducing training time, memory usage, and energy consumption compared to dense attention baselines, demonstrating its suitability as a resource-aware architectural component for scalable intelligent systems.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shizhuo Mao, Song Chen, Yi Kang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision, but existing solutions face fundamental trade-offs: dynamic quantization incurs high computational overhead and poses deployment challenges on edge devices, while static quantization sacrifices accuracy. Existing approaches of quantization-aware training (QAT) further suffer from weight training costs. We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors. SASQ exclusively optimizes only the quantization factors (without changing pre-trained weights), enabling static inference with high accuracy while maintaining deployment efficiency. SASQ adaptively truncates some outliers, thereby reducing the difficulty of quantization while preserving the distributional characteristics of the activations. SASQ not only surpasses existing SOTA quantization schemes but also outperforms the corresponding FP16 models. On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andreas Sjรถlander, Valeria Belloni, Robel Fekadu, Andrea Nascetti
Machine Learning
ๆฅ็ๆ่ฆ
Tunnels are essential elements of transportation infrastructure, but are increasingly affected by ageing and deterioration mechanisms such as cracking. Regular inspections are required to ensure their safety, yet traditional manual procedures are time-consuming, subjective, and costly. Recent advances in mobile mapping systems and Deep Learning (DL) enable automated visual inspections. However, their effectiveness is limited by the scarcity of tunnel datasets. This paper introduces a new publicly available dataset containing annotated images of three different tunnel linings, capturing typical defects: cracks, leaching, and water infiltration. The dataset is designed to support supervised, semi-supervised, and unsupervised DL methods for defect detection and segmentation. Its diversity in texture and construction techniques also enables investigation of model generalization and transferability across tunnel types. By addressing the critical lack of domain-specific data, this dataset contributes to advancing automated tunnel inspection and promoting safer, more efficient infrastructure maintenance strategies.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Annu Rana, Gaurav Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) often struggle with complex multi-step planning tasks, showing high rates of constraint violations and inconsistent solutions. Existing strategies such as Chain-of-Thought and ReAct rely on implicit state tracking and lack an explicit problem representation. Inspired by classical AI planning, we propose Model-First Reasoning (MFR), a two-phase paradigm in which the LLM first constructs an explicit model of the problem, defining entities, state variables, actions, and constraints, before generating a solution plan. Across multiple planning domains, including medical scheduling, route planning, resource allocation, logic puzzles, and procedural synthesis, MFR reduces constraint violations and improves solution quality compared to Chain-of-Thought and ReAct. Ablation studies show that the explicit modeling phase is critical for these gains. Our results suggest that many LLM planning failures stem from representational deficiencies rather than reasoning limitations, highlighting explicit modeling as a key component for robust and interpretable AI agents. All prompts, evaluation procedures, and task datasets are documented to facilitate reproducibility.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Siyuan Zhu, Chengdong Xu, Kaiqiang Ke, Chao Yu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In long-context question answering (LCQA), determining the optimal amount of context for a given query is a significant challenge. Including too few passages may omit critical information, while including too many can introduce noise and reduce the quality of the answer. Traditional approaches, such as fixed Top-$K$ retrieval and single-stage reranking, face the dilemma of selecting the right number of passages. This problem is particularly pronounced for factoid questions, which often require only a few specific pieces of evidence. To address this issue, we introduce \emph{Context-Picker}, a reasoning-aware framework that shifts the paradigm from similarity-based ranking to minimal sufficient subset selection. Context-Picker treats context selection as a decision-making process optimized via a human-inspired, two-stage reinforcement learning schedule: a \emph{recall-oriented} stage that prioritizes the coverage of reasoning chains, followed by a \emph{precision-oriented} stage that aggressively prunes redundancy to distill a compact evidence set. To resolve reward sparsity, we propose an offline evidence distillation pipeline that mines "minimal sufficient sets" via a Leave-One-Out (LOO) procedure, providing dense, task-aligned supervision. Experiments on five long-context and multi-hop QA benchmarks demonstrate that Context-Picker significantly outperforms strong RAG baselines, achieving superior answer accuracy with comparable or reduced context lengths. Ablation studies indicate that the coarse-to-fine optimization schedule, the redundancy-aware reward shaping, and the rationale-guided format all contribute substantially to these gains.
๐
2025-12-16
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Xingfu Zhou, Pengfei Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model (LLM) agents relying on external retrieval are increasingly deployed in high-stakes environments. While existing adversarial attacks primarily focus on content falsification or instruction injection, we identify a novel, process-oriented attack surface: the agent's reasoning style. We propose Reasoning-Style Poisoning (RSP), a paradigm that manipulates how agents process information rather than what they process. We introduce Generative Style Injection (GSI), an attack method that rewrites retrieved documents into pathological tones--specifically "analysis paralysis" or "cognitive haste"--without altering underlying facts or using explicit triggers. To quantify these shifts, we develop the Reasoning Style Vector (RSV), a metric tracking Verification depth, Self-confidence, and Attention focus. Experiments on HotpotQA and FEVER using ReAct, Reflection, and Tree of Thoughts (ToT) architectures reveal that GSI significantly degrades performance. It increases reasoning steps by up to 4.4 times or induces premature errors, successfully bypassing state-of-the-art content filters. Finally, we propose RSP-M, a lightweight runtime monitor that calculates RSV metrics in real-time and triggers alerts when values exceed safety thresholds. Our work demonstrates that reasoning style is a distinct, exploitable vulnerability, necessitating process-level defenses beyond static content analysis.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yukun Ren, Siwei Yu, Kai Chen, Jianwei Ma
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
To address the steep learning curve and reliance on complex manual file editing and command-line operations in the traditional workflow of the mainstream open-source seismic wave simulation software SPECFEM, this paper proposes an intelligent, interactive workflow powered by Large Language Models (LLMs). We introduce the first Model Context Protocol (MCP) server suite for SPECFEM (supporting 2D, 3D Cartesian, and 3D Globe versions), which decomposes the entire simulation process into discrete, agent-executable tools spanning from parameter generation and mesh partitioning to solver execution and visualization. This approach enables a paradigm shift from file-driven to intent-driven conversational interactions. The framework supports both fully automated execution and human-in-the-loop collaboration, allowing researchers to guide simulation strategies in real time and retain scientific decision-making authority while significantly reducing tedious low-level operations. Validated through multiple case studies, the workflow operates seamlessly in both autonomous and interactive modes, yielding high-fidelity results consistent with standard baselines. As the first application of MCP technology to computational seismology, this study significantly lowers the entry barrier, enhances reproducibility, and offers a promising avenue for advancing computational geophysics toward AI-assisted and automated scientific research. The complete source code is available at https://github.com/RenYukun1563/specfem-mcp.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Gabriele Prato, Shagun Sodhani, Alessandro Sordoni, Sarath Chandar
Natural Language Processing
ๆฅ็ๆ่ฆ
The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models' capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.
๐
2025-12-16
๐ AAAI 2026
๐ฅ Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jia Hu, Junqi Li, Weimeng Lin, Peng Jia, Yuxiong Ji et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Vehicle Dispatching Systems (VDSs) are critical to the operational efficiency of Automated Container Terminals (ACTs). However, their widespread commercialization is hindered due to their low transferability across diverse terminals. This transferability challenge stems from three limitations: high reliance on port operational specialists, a high demand for terminal-specific data, and time-consuming manual deployment processes. Leveraging the emergence of Large Language Models (LLMs), this paper proposes PortAgent, an LLM-driven vehicle dispatching agent that fully automates the VDS transferring workflow. It bears three features: (1) no need for port operations specialists; (2) low need of data; and (3) fast deployment. Specifically, specialist dependency is eliminated by the Virtual Expert Team (VET). The VET collaborates with four virtual experts, including a Knowledge Retriever, Modeler, Coder, and Debugger, to emulate a human expert team for the VDS transferring workflow. These experts specialize in the domain of terminal VDS via a few-shot example learning approach. Through this approach, the experts are able to learn VDS-domain knowledge from a few VDS examples. These examples are retrieved via a Retrieval-Augmented Generation (RAG) mechanism, mitigating the high demand for terminal-specific data. Furthermore, an automatic VDS design workflow is established among these experts to avoid extra manual interventions. In this workflow, a self-correction loop inspired by the LLM Reflexion framework is created
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wentao Wan, Qiqing Lao, Zhiwei Xie, Hefeng Wu, Runnan Lin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Knowledge Editing (KE) is a field that studies how to modify some knowledge in Large Language Models (LLMs) at a low cost (compared to pre-training). Currently, performing large-scale edits on LLMs while ensuring the Reliability, Generality, and Locality metrics of the edits remain a challenge. This paper proposes a Massive editing approach for LLMs based on dynamic weight Generation (MeG). Our MeG involves attaching a dynamic weight neuron to specific layers of the LLMs and using a diffusion model to conditionally generate the weights of this neuron based on the input query required for the knowledge. This allows the use of adding a single dynamic weight neuron to achieve the goal of large-scale knowledge editing. Experiments show that our MeG can significantly improve the performance of large-scale KE in terms of Reliability, Generality, and Locality metrics compared to existing knowledge editing methods, particularly with a high percentage point increase in the absolute value index for the Locality metric, demonstrating the advantages of our proposed method.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huayang Li, Tianyu Zhao, Richard Sproat
Natural Language Processing
ๆฅ็ๆ่ฆ
In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. Drawing on Cognitive Load Theory (CLT), we argue that this uninformative structure increases extraneous cognitive load, consuming finite working memory capacity that should be allocated to deep reasoning and attention allocation. To address this, we propose RePo, a novel mechanism that reduces extraneous load via context re-positioning. Unlike standard approaches, RePo utilizes a differentiable module, $f_ฯ$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined integer range. By continually pre-training on the OLMo-2 1B backbone, we demonstrate that RePo significantly enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Detailed analysis reveals that RePo successfully allocate higher attention to distant but relevant information, assign positions in dense and non-linear space, and capture the intrinsic structure of the input context. Our code is available at https://github.com/SakanaAI/repo.
๐
2025-12-16
๐ AAAI 2026
๐ฅ Nicholas Tagliapietra, Katharina Ensinger, Christoph Zimmer, Osman Mian
ๆฅ็ๆ่ฆ
Real world systems evolve in continuous-time according to their underlying causal relationships, yet their dynamics are often unknown. Existing approaches to learning such dynamics typically either discretize time -- leading to poor performance on irregularly sampled data -- or ignore the underlying causality. We propose CaDyT, a novel method for causal discovery on dynamical systems addressing both these challenges. In contrast to state-of-the-art causal discovery methods that model the problem using discrete-time Dynamic Bayesian networks, our formulation is grounded in Difference-based causal models, which allow milder assumptions for modeling the continuous nature of the system. CaDyT leverages exact Gaussian Process inference for modeling the continuous-time dynamics which is more aligned with the underlying dynamical process. We propose a practical instantiation that identifies the causal structure via a greedy search guided by the Algorithmic Markov Condition and Minimum Description Length principle. Our experiments show that CaDyT outperforms state-of-the-art methods on both regularly and irregularly-sampled data, discovering causal networks closer to the true underlying dynamics.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qizhi Wang
Machine Learning
ๆฅ็ๆ่ฆ
Cardinality estimation is a key bottleneck for cost-based query optimization, yet deployable improvements remain difficult: classical estimators miss correlations, while learned estimators often require workload-specific training pipelines and invasive integration into the optimizer. This paper presents TiCard, a low intrusion, correction-based framework that augments (rather than replaces) a database's native estimator. TiCard learns multiplicative residual corrections using EXPLAIN-only features, and uses EXPLAIN ANALYZE only for offline labels. We study two practical instantiations: (i) a Gradient Boosting Regressor for sub-millisecond inference, and (ii) TabPFN, an in-context tabular foundation model that adapts by refreshing a small reference set without gradient retraining. On TiDB with TPCH and the Join Order Benchmark, in a low-trace setting (263 executions total; 157 used for learning), TiCard improves operator-level tail accuracy substantially: P90 Q-error drops from 312.85 (native) to 13.69 (TiCard-GBR), and P99 drops from 37,974.37 to 3,416.50 (TiCard-TabPFN), while a join-only policy preserves near-perfect median behavior. We position TiCard as an AI4DB building block focused on deployability: explicit scope, conservative integration policies, and an integration roadmap from offline correction to in-optimizer use.
๐
2025-12-16
๐ AAAI 2026
๐ฅ Kanglong Fan, Yunqiao Yang, Chen Ma
Machine Learning
ๆฅ็ๆ่ฆ
Deep neural networks have demonstrated remarkable performance across various domains, yet their decision-making processes remain opaque. Although many explanation methods are dedicated to bringing the obscurity of DNNs to light, they exhibit significant limitations: post-hoc explanation methods often struggle to faithfully reflect model behaviors, while self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs. To address these challenges, we propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training, which achieves two key advancements: 1) a fair allocation of the model prediction scores to image patches, ensuring explanations inherently align with the model's decision logic, and 2) enhanced interpretability with minor structural modifications, preserving model performance and compatibility. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art interpretability.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
ๆฅ็ๆ่ฆ
Recent progress in text-to-image diffusion models has transformed image editing via text prompts, yet this also introduces significant ethical challenges from potential misuse in creating deceptive or harmful content. While current defenses seek to mitigate this risk by embedding imperceptible perturbations, their effectiveness is limited against malicious tampering. To address this issue, we propose a Dual Attention-Guided Noise Perturbation (DANP) immunization method that adds imperceptible perturbations to disrupt the model's semantic understanding and generation process. DANP functions over multiple timesteps to manipulate both cross-attention maps and the noise prediction process, using a dynamic threshold to generate masks that identify text-relevant and irrelevant regions. It then reduces attention in relevant areas while increasing it in irrelevant ones, thereby misguides the edit towards incorrect regions and preserves the intended targets. Additionally, our method maximizes the discrepancy between the injected noise and the model's predicted noise to further interfere with the generation. By targeting both attention and noise prediction mechanisms, DANP exhibits impressive immunity against malicious edits, and extensive experiments confirm that our method achieves state-of-the-art performance.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher
Robotics
ๆฅ็ๆ่ฆ
The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. To address this challenge, we introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. To monitor reasoning behaviors, we introduced ReasonType: a novel taxonomy of reasoning steps. Building on this framework, we demonstrated that online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences. We evaluate the Step-tagging framework on three open-source reasoning models across standard benchmark datasets: MATH500, GSM8K, AIME and non-mathematical tasks (GPQA and MMLU-Pro). We achieve 20 to 50\% token reduction while maintaining comparable accuracy to standard generation, with largest gains observed on more computation-heavy tasks. This work offers a novel way to increase control over the generation of LRMs, and a new tool to study behaviors of LRMs.
๐
2025-12-16
๐ Published in Journal of University Institute of Le
๐ฅ Sahibpreet Singh, Manjit Singh
Robotics
ๆฅ็ๆ่ฆ
AI revolutionizes transportation through autonomous vehicles (AVs) but introduces complex criminal liability issues regarding infractions. This study employs a comparative legal analysis of primary statutes, real-world liability claims, and academic literature across the US, Germany, UK, China, and India; jurisdictions selected for their technological advancement and contrasting regulatory approaches. The research examines the attribution of human error, AI moral agency, and the identification of primary offenders in AV incidents. Findings reveal fragmented regulatory landscapes: India and the US rely on loose networks of state laws, whereas the UK enacted the pioneering Automated and Electric Vehicles Act 2018. Germany enforces strict safety standards, distinguishing liability based on the vehicle's operating mode, while China similarly aims for a stringent liability regime. The study concludes that globally harmonized legal standards are essential to foster technological innovation while ensuring minimum risk and clear liability attribution.
๐
2025-12-16
๐ ArXiv Preprint (cs.CE)
๐ฅ Yanning Dai, Chenyu Tang, Ruizhi Zhang, Wenyu Yang, Yilan Zhang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Dynamic prediction of locomotor capacity after stroke is crucial for tailoring rehabilitation, yet current assessments provide only static impairment scores and do not indicate whether patients can safely perform specific tasks such as slope walking or stair climbing. Here, we develop a data-physics hybrid generative framework that reconstructs an individual stroke survivor's neuromuscular control from a single 20 m level-ground walking trial and predicts task-conditioned locomotion across rehabilitation scenarios. The system combines wearable-sensor kinematics, a proportional-derivative physics controller, a population Healthy Motion Atlas, and goal-conditioned deep reinforcement learning with behaviour cloning and generative adversarial imitation learning to generate physically plausible, patient-specific gait simulations for slopes and stairs. In 11 stroke survivors, the personalized controllers preserved idiosyncratic gait patterns while improving joint-angle and endpoint fidelity by 4.73% and 12.10%, respectively, and reducing training time to 25.56% relative to a physics-only baseline. In a multicentre pilot involving 21 inpatients, clinicians who used our locomotion predictions to guide task selection and difficulty obtained larger gains in Fugl-Meyer lower-extremity scores over 28 days of standard rehabilitation than control clinicians (mean change 6.0 versus 3.7 points). These findings indicate that our generative, task-predictive framework can augment clinical decision-making in post-stroke gait rehabilitation and provide a template for dynamically personalized motor recovery strategies.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuai Dong, Jie Zhang, Guoying Zhao, Shiguang Shan, Xilin Chen
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Text-guided image editing via diffusion models, while powerful, raises significant concerns about misuse, motivating efforts to immunize images against unauthorized edits using imperceptible perturbations. Prevailing metrics for evaluating immunization success typically rely on measuring the visual dissimilarity between the output generated from a protected image and a reference output generated from the unprotected original. This approach fundamentally overlooks the core requirement of image immunization, which is to disrupt semantic alignment with attacker intent, regardless of deviation from any specific output. We argue that immunization success should instead be defined by the edited output either semantically mismatching the prompt or suffering substantial perceptual degradations, both of which thwart malicious intent. To operationalize this principle, we propose Synergistic Intermediate Feature Manipulation (SIFM), a method that strategically perturbs intermediate diffusion features through dual synergistic objectives: (1) maximizing feature divergence from the original edit trajectory to disrupt semantic alignment with the expected edit, and (2) minimizing feature norms to induce perceptual degradations. Furthermore, we introduce the Immunization Success Rate (ISR), a novel metric designed to rigorously quantify true immunization efficacy for the first time. ISR quantifies the proportion of edits where immunization induces either semantic failure relative to the prompt or significant perceptual degradations, assessed via Multimodal Large Language Models (MLLMs). Extensive experiments show our SIFM achieves the state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Akila Premarathna, Kanishka Hewageegana, Garcia Andarcia Mariangel
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8's true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.
๐
2025-12-16
๐ ArXiv Preprint (cs.NI)
๐ฅ Agrippina Mwangi, Leรณn Navarro-Hilfiker, Lukasz Brewka, Mikkel Gryning, Elena Fumagalli et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Stochastic disruptions such as flash events arising from benign traffic bursts and switch thermal fluctuations are major contributors to intermittent service degradation in software-defined industrial networks. These events violate IEC~61850-derived quality-of-service requirements and user-defined service-level agreements, hindering the reliable and timely delivery of control, monitoring, and best-effort traffic in IEC~61400-25-compliant wind power plants. Failure to maintain these requirements often results in delayed or lost control signals, reduced operational efficiency, and increased risk of wind turbine generator downtime.
To address these challenges, this study proposes a threshold-triggered Deep Q-Network self-healing agent that autonomically detects, analyzes, and mitigates network disruptions while adapting routing behavior and resource allocation in real time. The proposed agent was trained, validated, and tested on an emulated tri-clustered switch network deployed in a cloud-based proof-of-concept testbed.
Simulation results show that the proposed agent improves disruption recovery performance by 53.84% compared to a baseline shortest-path and load-balanced routing approach and outperforms state-of-the-art methods, including the Adaptive Network-based Fuzzy Inference System by 13.1% and the Deep Q-Network and traffic prediction-based routing optimization method by 21.5%, in a super-spine leaf data-plane architecture.
Additionally, the agent maintains switch thermal stability by proactively initiating external rack cooling when required. These findings highlight the potential of deep reinforcement learning in building resilience in software-defined industrial networks deployed in mission-critical, time-sensitive application scenarios.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Georgios Bouchouras, Dimitrios Doumanas, Andreas Soularidis, Konstantinos Kotis, George A. Vouros
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
This paper explores the integration of Large Language Models (LLMs) in the engineering of a Parkinson's Disease (PD) monitoring and alerting ontology through four key methodologies: One Shot (OS) prompt techniques, Chain of Thought (CoT) prompts, X-HCOME, and SimX-HCOME+. The primary objective is to determine whether LLMs alone can create comprehensive ontologies and, if not, whether human-LLM collaboration can achieve this goal. Consequently, the paper assesses the effectiveness of LLMs in automated ontology development and the enhancement achieved through human-LLM collaboration.
Initial ontology generation was performed using One Shot (OS) and Chain of Thought (CoT) prompts, demonstrating the capability of LLMs to autonomously construct ontologies for PD monitoring and alerting. However, these outputs were not comprehensive and required substantial human refinement to enhance their completeness and accuracy.
X-HCOME, a hybrid ontology engineering approach that combines human expertise with LLM capabilities, showed significant improvements in ontology comprehensiveness. This methodology resulted in ontologies that are very similar to those constructed by experts.
Further experimentation with SimX-HCOME+, another hybrid methodology emphasizing continuous human supervision and iterative refinement, highlighted the importance of ongoing human involvement. This approach led to the creation of more comprehensive and accurate ontologies.
Overall, the paper underscores the potential of human-LLM collaboration in advancing ontology engineering, particularly in complex domains like PD. The results suggest promising directions for future research, including the development of specialized GPT models for ontology construction.
๐
2025-12-16
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Marvin Kopka, Azeem Majeed, Gabriella Spinelli, Austen El-Osta, Markus Feufel
Natural Language Processing
ๆฅ็ๆ่ฆ
Artificial Intelligence tools such as large language models are increasingly used by the public to obtain health information and guidance. In health-related contexts, following or rejecting AI-generated advice can have direct clinical implications. Existing instruments like the Trust in Automated Systems Survey assess trustworthiness of generic technology, and no validated instrument measures users' trust in AI-generated health advice specifically. This study developed and validated the Trust in AI-Generated Health Advice (TAIGHA) scale and its four-item short form (TAIGHA-S) as theory-based instruments measuring trust and distrust, each with cognitive and affective components. The items were developed using a generative AI approach, followed by content validation with 10 domain experts, face validation with 30 lay participants, and psychometric validation with 385 UK participants who received AI-generated advice in a symptom-assessment scenario. After automated item reduction, 28 items were retained and reduced to 10 based on expert ratings. TAIGHA showed excellent content validity (S-CVI/Ave=0.99) and CFA confirmed a two-factor model with excellent fit (CFI=0.98, TLI=0.98, RMSEA=0.07, SRMR=0.03). Internal consistency was high (ฮฑ=0.95). Convergent validity was supported by correlations with the Trust in Automated Systems Survey (r=0.67/-0.66) and users' reliance on the AI's advice (r=0.37 for trust), while divergent validity was supported by low correlations with reading flow and mental load (all |r|<0.25). TAIGHA-S correlated highly with the full scale (r=0.96) and showed good reliability (ฮฑ=0.88). TAIGHA and TAIGHA-S are validated instruments for assessing user trust and distrust in AI-generated health advice. Reporting trust and distrust separately permits a more complete evaluation of AI interventions, and the short scale is well-suited for time-constrained settings.
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Panayiotis Smeros, Vincent Emonet, Ruijie Wang, Ana-Claudia Sima, Tarcisio Mendes de Farias
Natural Language Processing
ๆฅ็ๆ่ฆ
The advent of large language models is contributing to the emergence of novel approaches that promise to better tackle the challenge of generating structured queries, such as SPARQL queries, from natural language. However, these new approaches mostly focus on response accuracy over a single source while ignoring other evaluation criteria, such as federated query capability over distributed data stores, as well as runtime and cost to generate SPARQL queries. Consequently, they are often not production-ready or easy to deploy over (potentially federated) knowledge graphs with good accuracy. To mitigate these issues, in this paper, we extend our previous work and describe and systematically evaluate SPARQL-LLM, an open-source and triplestore-agnostic approach, powered by lightweight metadata, that generates SPARQL queries from natural language text. First, we describe its architecture, which consists of dedicated components for metadata indexing, prompt building, and query generation and execution. Then, we evaluate it based on a state-of-the-art challenge with multilingual questions, and a collection of questions from three of the most prevalent knowledge graphs within the field of bioinformatics. Our results demonstrate a substantial increase of 24% in the F1 Score on the state-of-the-art challenge, adaptability to high-resource languages such as English and Spanish, as well as ability to form complex and federated bioinformatics queries. Furthermore, we show that SPARQL-LLM is up to 36x faster than other systems participating in the challenge, while costing a maximum of $0.01 per question, making it suitable for real-time, low-cost text-to-SPARQL applications. One such application deployed over real-world decentralized knowledge graphs can be found at https://www.expasy.org/chat.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nick Leenders, Thomas Quadt, Boris Cule, Roy Lindelauf, Herman Monsuur et al.
Machine Learning
ๆฅ็ๆ่ฆ
Current Preferential Bayesian Optimization methods rely on Gaussian Processes (GPs) as surrogate models. These models are hard to interpret, struggle with handling categorical data, and are computationally complex, limiting their real-world usability. In this paper, we introduce an inherently interpretable decision tree-based surrogate model capable of handling both categorical and continuous data, and scalable to large datasets. Extensive numerical experiments on eight increasingly spiky optimization functions show that our model outperforms GP-based alternatives on spiky functions and has only marginally lower performance for non-spiky functions. Moreover, we apply our model to the real-world Sushi dataset and show its ability to learn an individual's sushi preferences. Finally, we show some initial work on using historical preference data to speed up the optimization process for new unseen users.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kelly J. Davis
Natural Language Processing
ๆฅ็ๆ่ฆ
Formal, automated theorem proving has long been viewed as a challenge to artificial intelligence. We introduce here a new approach to computer theorem proving, one that employs specialized language models for Lean4 proof generation combined with recursive decomposition of difficult theorems into simpler entailing propositions. These models are coordinated through a multi-agent architecture that orchestrates autoformalization (if required), proof generation, decomposition of difficult theorems into simpler entailing propositions, and recursive proof (and/or decomposition) of these propositions. Without decomposition, we achieve a 90.4% pass rate on miniF2F. With decomposition, this is significantly improved. A key technical contribution lies in our extension of the Kimina Lean Server with abstract syntax tree (AST) parsing capabilities to facilitate automated, recursive proof decomposition. The system is made available on PyPI as goedels-poetry (at https://pypi.org/project/goedels-poetry ), and the open-source implementation KellyJDavis/goedels-poetry (at https://github.com/KellyJDavis/goedels-poetry ) facilitates both adaptation to alternative language models and extension with custom functionality.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiqing Zhou, Yu Lei, Shuzheng Si, Qingyan Sun, Wei Wang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Salvatore Romano, Marco Grassia, Giuseppe Mangioni
Machine Learning
ๆฅ็ๆ่ฆ
Graph generation is a crucial task in many fields, including network science and bioinformatics, as it enables the creation of synthetic graphs that mimic the properties of real-world networks for various applications. Graph Generative Models (GGMs) have emerged as a promising solution to this problem, leveraging deep learning techniques to learn the underlying distribution of real-world graphs and generate new samples that closely resemble them. Examples include approaches based on Variational Auto-Encoders, Recurrent Neural Networks, and more recently, diffusion-based models. However, the main limitation often lies in the evaluation process, which typically relies on Maximum Mean Discrepancy (MMD) as a metric to assess the distribution of graph properties in the generated ensemble. This paper introduces a novel methodology for evaluating GGMs that overcomes the limitations of MMD, which we call RGM (Representation-aware Graph-generation Model evaluation). As a practical demonstration of our methodology, we present a comprehensive evaluation of two state-of-the-art Graph Generative Models: Graph Recurrent Attention Networks (GRAN) and Efficient and Degree-guided graph GEnerative model (EDGE). We investigate their performance in generating realistic graphs and compare them using a Geometric Deep Learning model trained on a custom dataset of synthetic and real-world graphs, specifically designed for graph classification tasks. Our findings reveal that while both models can generate graphs with certain topological properties, they exhibit significant limitations in preserving the structural characteristics that distinguish different graph domains. We also highlight the inadequacy of Maximum Mean Discrepancy as an evaluation metric for GGMs and suggest alternative approaches for future research.
๐
2025-12-16
๐ ArXiv Preprint (cs.SE)
๐ฅ Ruozhao Yang, Mingfei Cheng, Gelei Deng, Tianwei Zhang, Junjie Wang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Penetration testing is essential for assessing and strengthening system security against real-world threats, yet traditional workflows remain highly manual, expertise-intensive, and difficult to scale. Although recent advances in Large Language Models (LLMs) offer promising opportunities for automation, existing applications rely on simplistic prompting without task decomposition or domain adaptation, resulting in unreliable black-box behavior and limited insight into model capabilities across penetration testing stages. To address this gap, we introduce PentestEval, the first comprehensive benchmark for evaluating LLMs across six decomposed penetration testing stages: Information Collection, Weakness Gathering and Filtering, Attack Decision-Making, Exploit Generation and Revision. PentestEval integrates expert-annotated ground truth with a fully automated evaluation pipeline across 346 tasks covering all stages in 12 realistic vulnerable scenarios. Our stage-level evaluation of 9 widely used LLMs reveals generally weak performance and distinct limitations across the stages of penetration-testing workflow. End-to-end pipelines reach only 31% success rate, and existing LLM-powered systems such as PentestGPT, PentestAgent, and VulnBot exhibit similar limitations, with autonomous agents failing almost entirely. These findings highlight that autonomous penetration testing demands stronger structured reasoning, where modularization enhances each individual stage and improves overall performance. PentestEval provides the foundational benchmark needed for future research on fine-grained, stage-level evaluation, paving the way toward more reliable LLM-based automation.
๐
2025-12-16
๐ Provisionally accepted for publication in the Inte
๐ฅ Aneesha Fernando, Surangika Ranathunga, Kristin Stock, Raj Prasanna, Christopher B. Jones
Natural Language Processing
ๆฅ็ๆ่ฆ
Georeferencing text documents has typically relied on either gazetteer-based methods to assign geographic coordinates to place names, or on language modelling approaches that associate textual terms with geographic locations. However, many location descriptions specify positions relatively with spatial relationships, making geocoding based solely on place names or geo-indicative words inaccurate. This issue frequently arises in biological specimen collection records, where locations are often described through narratives rather than coordinates if they pre-date GPS. Accurate georeferencing is vital for biodiversity studies, yet the process remains labour-intensive, leading to a demand for automated georeferencing solutions. This paper explores the potential of Large Language Models (LLMs) to georeference complex locality descriptions automatically, focusing on the biodiversity collections domain. We first identified effective prompting patterns, then fine-tuned an LLM using Quantized Low-Rank Adaptation (QLoRA) on biodiversity datasets from multiple regions and languages. Our approach outperforms existing baselines with an average, across datasets, of 65% of records within a 10 km radius, for a fixed amount of training data. The best results (New York state) were 85% within 10km and 67% within 1km. The selected LLM performs well for lengthy, complex descriptions, highlighting its potential for georeferencing intricate locality descriptions.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three orthogonal planes--construction, scale and dependence--identifying which quadrants a measure needs to occupy to score out-of-distribution problems. LLM compare naturally occupies all desirable quadrants as the first measure that is continuous and dynamic, model-agnostic and independent of ground truth information. As a second validation, we show that LLM compare demonstrates strong alignment with human annotations: Pearson $r \geq 0.80$ for $n=1876$. Thirdly, we show that LLM compare is robust to hallucinations, with less than $6\%$ degradation in Pearson correlation for $10\%$ noise injection. Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation, and will be an important driver for curriculum design, model evaluation, and AI-assisted research ideation.
๐
2025-12-16
๐ ArXiv Preprint (physics.bio-ph)
๐ฅ Mengxue Zhang, Qingrui Cai, Yinyin Chen, Hang Jin, Jianjun Zhou et al.
Machine Learning
ๆฅ็ๆ่ฆ
Physics-Informed Neural Networks (PINN) are emerging as a promising approach for quantitative parameter estimation of Magnetic Resonance Imaging (MRI). While existing deep learning methods can provide an accurate quantitative estimation of the T2 parameter, they still require large amounts of training data and lack theoretical support and a recognized gold standard. Thus, given the absence of PINN-based approaches for T2 estimation, we propose embedding the fundamental physics of MRI, the Bloch equation, in the loss of PINN, which is solely based on target scan data and does not require a pre-defined training database. Furthermore, by deriving rigorous upper bounds for both the T2 estimation error and the generalization error of the Bloch equation solution, we establish a theoretical foundation for evaluating the PINN's quantitative accuracy. Even without access to the ground truth or a gold standard, this theory enables us to estimate the error with respect to the real quantitative parameter T2. The accuracy of T2 mapping and the validity of the theoretical analysis are demonstrated on a numerical cardiac model and a water phantom, where our method exhibits excellent quantitative precision in the myocardial T2 range. Clinical applicability is confirmed in 94 acute myocardial infarction (AMI) patients, achieving low-error quantitative T2 estimation under the theoretical error bound, highlighting the robustness and potential of PINN.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Timo Klein, Thomas Lang, Andrii Shkabrii, Alexander Sturm, Kevin Sidak et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the Poincarรฉ Ball and Hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (i) stable critic training through a categorical value loss instead of regression; (ii) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (iii) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines. We release our code at https://github.com/Probabilistic-and-Interactive-ML/hyper-rl .
๐
2025-12-16
๐ ArXiv Preprint (Multimedia)
๐ฅ Emanuele Artioli, Farzad Tashtarian, Christian Timmerer
Machine Learning
ๆฅ็ๆ่ฆ
The primary challenge of video streaming is to balance high video quality with smooth playback. Traditional codecs are well tuned for this trade-off, yet their inability to use context means they must encode the entire video data and transmit it to the client. This paper introduces ELVIS (End-to-end Learning-based VIdeo Streaming Enhancement Pipeline), an end-to-end architecture that combines server-side encoding optimizations with client-side generative in-painting to remove and reconstruct redundant video data. Its modular design allows ELVIS to integrate different codecs, inpainting models, and quality metrics, making it adaptable to future innovations. Our results show that current technologies achieve improvements of up to 11 VMAF points over baseline benchmarks, though challenges remain for real-time applications due to computational demands. ELVIS represents a foundational step toward incorporating generative AI into video streaming pipelines, enabling higher quality experiences without increased bandwidth requirements.
๐
2025-12-16
๐ ArXiv Preprint (quant-ph)
๐ฅ Shaolun Ruan, Feng Liang, Rohan Ramakrishna, Chao Ren, Rudai Yan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Quantum Neural Networks (QNNs) represent a promising fusion of quantum computing and neural network architectures, offering speed-ups and efficient processing of high-dimensional, entangled data. A crucial component of QNNs is the encoder, which maps classical input data into quantum states. However, choosing suitable encoders remains a significant challenge, largely due to the lack of systematic guidance and the trial-and-error nature of current approaches. This process is further impeded by two key challenges: (1) the difficulty in evaluating encoded quantum states prior to training, and (2) the lack of intuitive methods for analyzing an encoder's ability to effectively distinguish data features. To address these issues, we introduce a novel visualization tool, XQAI-Eyes, which enables QNN developers to compare classical data features with their corresponding encoded quantum states and to examine the mixed quantum states across different classes. By bridging classical and quantum perspectives, XQAI-Eyes facilitates a deeper understanding of how encoders influence QNN performance. Evaluations across diverse datasets and encoder designs demonstrate XQAI-Eyes's potential to support the exploration of the relationship between encoder design and QNN effectiveness, offering a holistic and transparent approach to optimizing quantum encoders. Moreover, domain experts used XQAI-Eyes to derive two key practices for quantum encoder selection, grounded in the principles of pattern preservation and feature mapping.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque
Natural Language Processing
ๆฅ็ๆ่ฆ
Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local\_dialect:standard\_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76\% to 55\% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.
๐
2025-12-16
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yunhao Yao, Zhiqiang Wang, Haoran Cheng, Yihang Cheng, Haohua Du et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The rapid evolution of Large Language Models (LLMs) into autonomous agents has led to the adoption of the Model Context Protocol (MCP) as a standard for discovering and invoking external tools. While this architecture decouples the reasoning engine from tool execution to enhance scalability, it introduces a significant privacy surface: third-party MCP servers, acting as semi-honest intermediaries, can observe detailed tool interaction logs outside the user's trusted boundary. In this paper, we first identify and formalize a novel privacy threat termed Intent Inversion, where a semi-honest MCP server attempts to reconstruct the user's private underlying intent solely by analyzing legitimate tool calls. To systematically assess this vulnerability, we propose IntentMiner, a framework that leverages Hierarchical Information Isolation and Three-Dimensional Semantic Analysis, integrating tool purpose, call statements, and returned results, to accurately infer user intent at the step level. Extensive experiments demonstrate that IntentMiner achieves a high degree of semantic alignment (over 85%) with original user queries, significantly outperforming baseline approaches. These results highlight the inherent privacy risks in decoupled agent architectures, revealing that seemingly benign tool execution logs can serve as a potent vector for exposing user secrets.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely "think with images" through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhijie Zhong, Zhiwen Yu, Pengyu Li, Jianming Lv, C. L. Philip Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Radio path loss prediction (RPP) is critical for optimizing 5G networks and enabling IoT, smart city, and similar applications. However, current deep learning-based RPP methods lack proactive environmental modeling, struggle with realistic multi-transmitter scenarios, and generalize poorly under distribution shifts, particularly when training/testing environments differ in building density or transmitter configurations. This paper identifies three key issues: (1) passive environmental modeling that overlooks transmitters and key environmental features; (2) overemphasis on single-transmitter scenarios despite real-world multi-transmitter prevalence; (3) excessive focus on in-distribution performance while neglecting distribution shift challenges. To address these, we propose PathFinder, a novel architecture that actively models buildings and transmitters via disentangled feature encoding and integrates Mask-Guided Low-rank Attention to independently focus on receiver and building regions. We also introduce a Transmitter-Oriented Mixup strategy for robust training and a new benchmark, single-to-multi-transmitter RPP (S2MT-RPP), tailored to evaluate extrapolation performance (multi-transmitter testing after single-transmitter training). Experimental results show PathFinder outperforms state-of-the-art methods significantly, especially in challenging multi-transmitter scenarios. Our code and project site are available at: https://emorzz1g.github.io/PathFinder/.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hanning Chen, Keyu Man, Kevin Zhu, Chenguang Zhu, Haonan Li et al.
Machine LearningNatural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Identifying and addressing performance anti-patterns in machine learning (ML) models is critical for efficient training and inference, but it typically demands deep expertise spanning system infrastructure, ML models and kernel development. While large tech companies rely on dedicated ML infrastructure engineers to analyze torch traces and benchmarks, such resource-intensive workflows are largely inaccessible to computer vision researchers in general. Among the challenges, pinpointing problematic trace segments within lengthy execution traces remains the most time-consuming task, and is difficult to automate with current ML models, including LLMs. In this work, we present the first benchmark dataset specifically designed to evaluate and improve ML models' ability to detect anti patterns in traces. Our dataset contains over 600 PyTorch traces from diverse computer vision models classification, detection, segmentation, and generation collected across multiple hardware platforms. We also propose a novel iterative approach: a lightweight ML model first detects trace segments with anti patterns, followed by a large language model (LLM) for fine grained classification and targeted feedback. Experimental results demonstrate that our method significantly outperforms unsupervised clustering and rule based statistical techniques for detecting anti pattern regions. Our method also effectively compensates LLM's limited context length and reasoning inefficiencies.
๐
2025-12-16
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ So Kuroki, Manami Nakagawa, Shigeo Yoshida, Yuki Koyama, Kozuno Tadashi
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Many real-world tasks, such as trip planning or meal planning, can be formulated as combinatorial optimization problems. However, using optimization solvers is difficult for end users because it requires problem instantiation: defining candidate items, assigning preference scores, and specifying constraints. We introduce LAPPI (LLM-Assisted Preference-based Problem Instantiation), an interactive approach that uses large language models (LLMs) to support users in this instantiation process. Through natural language conversations, the system helps users transform vague preferences into well-defined optimization problems. These instantiated problems are then passed to existing optimization solvers to generate solutions. In a user study on trip planning, our method successfully captured user preferences and generated feasible plans that outperformed both conventional and prompt-engineering approaches. We further demonstrate LAPPI's versatility by adapting it to an additional use case.
๐
2025-12-16
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Amirmohammad Pasdar, Toby Murray, Van-Thuan Pham
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce UIXPOSE, a source-code-agnostic framework that operates on both compiled and open-source apps. This framework applies Intention Behaviour Alignment (IBA) to mobile malware analysis, aligning UI-inferred intent with runtime semantics. Previous work either infers intent statically, e.g., permission-centric, or widget-level or monitors coarse dynamic signals (endpoints, partial resource usage) that miss content and context. UIXPOSE infers an intent vector from each screen using vision-language models and knowledge structures and combines decoded network payloads, heap/memory signals, and resource utilisation traces into a behaviour vector. Their alignment, calculated at runtime, can both detect misbehaviour and highlight exploration of behaviourally rich paths. In three real-world case studies, UIXPOSE reveals covert exfiltration and hidden background activity that evade metadata-only baselines, demonstrating how IBA improves dynamic detection.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenbo Tian, Ruting Lin, Hongxian Zheng, Yaodong Yang, Geng Wu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing intelligent sports analysis systems mainly focus on "scoring and visualization," often lacking automatic performance diagnosis and interpretable training guidance. Recent advances of Large Language Models (LMMs) and motion analysis techniques provide new opportunities to address the above limitations. In this paper, we propose SportsGPT, an LLM-driven framework for interpretable sports motion assessment and training guidance, which establishes a closed loop from motion time-series input to professional training guidance. First, given a set of high-quality target models, we introduce MotionDTW, a two-stage time series alignment algorithm designed for accurate keyframe extraction from skeleton-based motion sequences. Subsequently, we design a Knowledge-based Interpretable Sports Motion Assessment Model (KISMAM) to obtain a set of interpretable assessment metrics (e.g., insufficient extension) by constrasting the keyframes with the targe models. Finally, we propose SportsRAG, a RAG-based training guidance model based on Qwen3. Leveraging a 6B-token knowledge base, it prompts the LLM to generate professional training guidance by retrieving domain-specific QA pairs. Experimental results demonstrate that MotionDTW significantly outperforms traditional methods with lower temporal error and higher IoU scores. Furthermore, ablation studies validate the KISMAM and SportsRAG, confirming that SportsGPT surpasses general LLMs in diagnostic accuracy and professionalism.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chunan Tong
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Supply chain management (SCM) faces significant challenges like demand fluctuations and the bullwhip effect. Traditional methods and even state-of-the-art LLMs struggle with benchmarks like the Vending Machine Test, failing to handle SCM's complex continuous time-series data. While ML approaches like LSTM and XGBoost offer solutions, they are often limited by computational inefficiency. Liquid Neural Networks (LNN), known for their adaptability and efficiency in robotics, remain untapped in SCM. This study proposes a hybrid LNN+XGBoost model for multi-tier supply chains. By combining LNN's dynamic feature extraction with XGBoost's global optimization, the model aims to minimize the bullwhip effect and increase profitability. This innovative approach addresses the need for efficiency and adaptability, filling a critical gap in intelligent SCM.
๐
2025-12-16
๐ Supplementary materials, datasets, and implementat
๐ฅ Ijaz Ul Haq, Byung Suk Lee, Julia N. Perdrial, David Baude
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Real-time streamflow monitoring networks generate millions of observations annually, yet maintaining data quality across thousands of remote sensors remains labor-intensive. We introduce HydroGEM (Hydrological Generalizable Encoder for Monitoring), a foundation model for continental-scale streamflow quality control. HydroGEM uses two-stage training: self-supervised pretraining on 6.03 million sequences from 3,724 USGS stations learns hydrological representations, followed by fine-tuning with synthetic anomalies for detection and reconstruction. A hybrid TCN-Transformer architecture (14.2M parameters) captures local temporal patterns and long-range dependencies, while hierarchical normalization handles six orders of magnitude in discharge. On held-out synthetic tests comprising 799 stations with 18 expert-validated anomaly types, HydroGEM achieves F1 = 0.792 for detection and 68.7% reconstruction-error reduction, a 36.3% improvement over existing methods. Zero-shot transfer to 100 Environment and Climate Change Canada stations yields F1 = 0.586, exceeding all baselines and demonstrating cross-national generalization. The model maintains consistent detection across correction magnitudes and aligns with operational seasonal patterns. HydroGEM is designed for human-in-the-loop workflows - outputs are quality control suggestions requiring expert review, not autonomous corrections.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof
MultimodalNatural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Felix Holm, Ghazal Ghazaei, Nassir Navab
Machine Learning
ๆฅ็ๆ่ฆ
Purpose: Detailed surgical recognition is critical for advancing AI-assisted surgery, yet progress is hampered by high annotation costs, data scarcity, and a lack of interpretable models. While scene graphs offer a structured abstraction of surgical events, their full potential remains untapped. In this work, we introduce ProtoFlow, a novel framework that learns dynamic scene graph prototypes to model complex surgical workflows in an interpretable and robust manner.
Methods: ProtoFlow leverages a graph neural network (GNN) encoder-decoder architecture that combines self-supervised pretraining for rich representation learning with a prototype-based fine-tuning stage. This process discovers and refines core prototypes that encapsulate recurring, clinically meaningful patterns of surgical interaction, forming an explainable foundation for workflow analysis.
Results: We evaluate our approach on the fine-grained CAT-SG dataset. ProtoFlow not only outperforms standard GNN baselines in overall accuracy but also demonstrates exceptional robustness in limited-data, few-shot scenarios, maintaining strong performance when trained on as few as one surgical video. Our qualitative analyses further show that the learned prototypes successfully identify distinct surgical sub-techniques and provide clear, interpretable insights into workflow deviations and rare complications.
Conclusion: By uniting robust representation learning with inherent explainability, ProtoFlow represents a significant step toward developing more transparent, reliable, and data-efficient AI systems, accelerating their potential for clinical adoption in surgical training, real-time decision support, and workflow optimization.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Taig Singh, Shreshth Rajan, Nikhil Iyer
Machine Learning
ๆฅ็ๆ่ฆ
As modern neural networks become increasingly memory-bound, inference throughput is limited by DRAM bandwidth rather than compute. We present Arithmetic-Intensity-Aware Quantization (AIQ), a mixed precision quantization framework that chooses per-layer bit-widths to maximize arithmetic intensity (AI) while minimizing accuracy loss. AIQ is a post-training quantization method that uses search algorithms over per-layer quantization schemes to minimize a weighted loss over AI and accuracy. On ResNet-20/CIFAR-10, AIQ increases AI by ~50% over an FP32 baseline while keeping test accuracy within ~1 percentage point, and outperforming global uniform quantization schemes. On a memory-bound MobileNetV2 architecture, AIQ configurations give a 1.66x higher throughput than the FP32 baseline while keeping test accuracy within 1 percentage point. We also find that AIQ naturally quantizes larger layers more aggressively.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-$K$ routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mayank Singh, Vikas Yadav, Shiva Krishna Reddy Malay, Shravan Nayak, Sai Rajeswar et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic search for Multi-Agent Systems has recently emerged as a key focus in agentic AI research. Several prior approaches have relied on LLM-based free-form search over the code space. In this work, we propose a more structured framework that explores the same space through a fixed set of simple, composable components. We show that, despite lacking the generative flexibility of LLMs during the candidate generation stage, our method outperforms prior approaches on four out of five benchmarks across two domains: mathematics and question answering. Furthermore, our method offers additional advantages, including a more cost-efficient search process and the generation of modular, interpretable multi-agent systems with simpler logic.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Junjie Ma, Jinlong Li
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Inference with modern Large Language Models (LLMs) is expensive and slow, and speculative sampling has emerged as an effective solution to this problem, however, the number of the calls to the draft model for generating candidate tokens in speculative sampling is a preset hyperparameter, lacking flexibility. To generate and utilize the candidate tokens more effectively, we propose RADAR, a novel speculative sampling method with RL-based dynamic draft trees. RADAR formulates the draft tree generation process as a Markov Decision Process (MDP) and employs offline reinforcement learning to train a prediction model, which enables real-time decision on the calls to the draft model, reducing redundant computations and further accelerating inference. Evaluations across three LLMs and four tasks show that RADAR achieves a speedup of 3.17x-4.82x over the auto-regressive decoding baseline. The code is available at https://github.com/minaduki-sora/RADAR.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zulin Zhuang, Yu Bian
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Daylight-linked controls (DLCs) have significant potential for energy savings in buildings, especially when abundant daylight is available and indoor workplane illuminance can be accurately predicted in real time. Most existing studies on indoor daylight predictions were developed and tested for static scenes. This study proposes a multimodal deep learning framework that predicts indoor workplane illuminance distributions in real time from non-intrusive images with temporal-spatial features. By extracting image features only from the side-lit window areas rather than interior pixels, the approach remains applicable in dynamically occupied indoor spaces. A field experiment was conducted in a test room in Guangzhou (China), where 17,344 samples were collected for model training and validation. The model achieved R2 > 0.98 with RMSE < 0.14 on the same-distribution test set and R2 > 0.82 with RMSE < 0.17 on an unseen-day test set, indicating high accuracy and acceptable temporal generalization.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kim Sung-Bin, Joohyun Chang, David Harwath, Tae-Hyun Oh
Natural Language Processing
ๆฅ็ๆ่ฆ
Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip synchronization. To address the lack of a standard editing benchmark, we introduce FacEDiTBench, the first dataset for talking face editing, featuring diverse edit types and lengths, along with new evaluation metrics. Extensive experiments validate that talking face editing and generation emerge as subtasks of speech-conditional motion infilling; FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity while generalizing effectively to talking face generation.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA--covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points--reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.
๐
2025-12-16
๐ AAAI
๐ฅ Shen Li, Li Huang, Shaoxiong Zhan, Weifeng Sun, Tao Yin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) exhibit strong generative capabilities and have shown great potential in code generation. Existing chain-of-thought (CoT) prompting methods enhance model reasoning by eliciting intermediate steps, but suffer from two major limitations: First, their uniform application tends to induce overthinking on simple tasks. Second, they lack intention abstraction in code generation, such as explicitly modeling core algorithmic design and efficiency, leading models to focus on surface-level structures while neglecting the global problem objective. Inspired by the cognitive economy principle of engaging structured reasoning only when necessary to conserve cognitive resources, we propose RoutingGen, a novel difficulty-aware routing framework that dynamically adapts prompting strategies for code generation. For simple tasks, it adopts few-shot prompting; for more complex ones, it invokes a structured reasoning strategy, termed Intention Chain-of-Thought (ICoT), which we introduce to guide the model in capturing task intention, such as the core algorithmic logic and its time complexity. Experiments across three models and six standard code generation benchmarks show that RoutingGen achieves state-of-the-art performance in most settings, while reducing total token usage by 46.37% on average across settings. Furthermore, ICoT outperforms six existing prompting baselines on challenging benchmarks.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenguo Zhang, Haohan Zhen, Yishen Wang, Le Xu, Tianchen Deng et al.
Machine LearningMultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning.While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels.Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Enhong Liu, Haiyu Yang, Miel Hostens
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLM) hold potential to support dairy scholars and farmers by supporting decision-making and broadening access to knowledge for stakeholders with limited technical expertise. However, the substantial computational demand restricts access to LLM almost exclusively through cloud-based service, which makes LLM-based decision support tools impractical for dairy farming. To address this gap, lightweight alternatives capable of running locally on farm hardware are required. In this work, we benchmarked 20 open-source Small Language Models (SLM) available on HuggingFace under farm-realistic computing constraints. Building on our prior work, we developed an agentic AI system that integrates five task-specific agents: literature search, web search, SQL database interaction, NoSQL database interaction, and graph generation following predictive models. Evaluation was conducted in two phases. In the first phase, five test questions were used for the initial screening to identify models capable of following basic dairy-related instructions and performing reliably in a compute-constrained environment. Models that passed this preliminary stage were then evaluated using 30 questions (five per task category mentioned above, plus one category addressing integrity and misconduct) in phase two. In results, Qwen-4B achieved superior performance across most of task categories, although showed unstable effectiveness in NoSQL database interactions through PySpark. To our knowledge, this is the first work explicitly evaluating the feasibility of SLM as engines for dairy farming decision-making, with central emphases on privacy and computational efficiency. While results highlight the promise of SLM-assisted tools for practical deployment in dairy farming, challenges remain, and fine-tuning is still needed to refine SLM performance in dairy-specific questions.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ignacio Alzugaray, Marwan Taher, Andrew J. Davison
ๆฅ็ๆ่ฆ
We present a novel neural RGB-D Simultaneous Localization And Mapping (SLAM) system that learns an implicit map of the scene in real time. For the first time, we explore the use of Scene Coordinate Regression (SCR) as the core implicit map representation in a neural SLAM pipeline, a paradigm that trains a lightweight network to directly map 2D image features to 3D global coordinates. SCR networks provide efficient, low-memory 3D map representations, enable extremely fast relocalization, and inherently preserve privacy, making them particularly suitable for neural implicit SLAM.
Our system is the first one to achieve strict real-time in neural implicit RGB-D SLAM by relying on a SCR-based representation. We introduce a novel SCR architecture specifically tailored for this purpose and detail the critical design choices required to integrate SCR into a live SLAM pipeline. The resulting framework is simple yet flexible, seamlessly supporting both sparse and dense features, and operates reliably in dynamic environments without special adaptation. We evaluate our approach on established synthetic and real-world benchmarks, demonstrating competitive performance against the state of the art. Project Page: https://github.com/ialzugaray/ace-slam
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Zhaofeng Hu, Hongrui Yu, Vaidhyanathan Chandramouli, Ci-Jyun Liang
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
This study evaluates two leading approaches for teaching construction robots new skills to understand their applicability for construction automation: a Vision-Language-Action (VLA) model and Reinforcement Learning (RL) methods. The goal is to understand both task performance and the practical effort needed to deploy each approach on real jobs. The authors developed two teleoperation interfaces to control the robots and collect the demonstrations needed, both of which proved effective for training robots for long-horizon and dexterous tasks. In addition, the authors conduct a three-stage evaluation. First, the authors compare a Multi-Layer Perceptron (MLP) policy with a Deep Q-network (DQN) imitation model to identify the stronger RL baseline, focusing on model performance, generalization, and a pick-up experiment. Second, three different VLA models are trained in two different scenarios and compared with each other. Third, the authors benchmark the selected RL baseline against the VLA model using computational and sample-efficiency measures and then a robot experiment on a multi-stage panel installation task that includes transport and installation. The VLA model demonstrates strong generalization and few-shot capability, achieving 60% and 100% success in the pickup phase. In comparison, DQN can be made robust but needs additional noise during tuning, which increases the workload. Overall, the findings indicate that VLA offers practical advantages for changing tasks by reducing programming effort and enabling useful performance with minimal data, while DQN provides a viable baseline when sufficient tuning effort is acceptable.
๐
2025-12-16
๐ ArXiv Preprint (cs.SE)
๐ฅ Jiuding Yang, Shengyao Lu, Hongxuan Liu, Shayan Shirahmad Gale Bagi, Zahra Fazel et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved remarkable progress in automatic code generation, yet their ability to produce high-performance code remains limited--a critical requirement in real-world software systems. We argue that current LLMs struggle not only due to data scarcity but, more importantly, because they lack supervision that guides interpretable and effective performance improvements. In this work, we introduce PerfCoder, a family of LLMs specifically designed to generate performance-enhanced code from source code via interpretable, customized optimizations. PerfCoder is fine-tuned on a curated collection of real-world optimization trajectories with human-readable annotations, and preference-aligned by reinforcement fine-tuning using runtime measurements, enabling it to propose input-specific improvement strategies and apply them directly without relying on iterative refinement. On the PIE code performance benchmark, PerfCoder surpasses all existing models in both runtime speedup and effective optimization rate, demonstrating that performance optimization cannot be achieved by scale alone but requires optimization stratetgy awareness. In addition, PerfCoder can generate interpretable feedback about the source code, which, when provided as input to a larger LLM in a planner-and-optimizer cooperative workflow, can further improve outcomes. Specifically, we elevate the performance of 32B models and GPT-5 to new levels on code optimization, substantially surpassing their original performance.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zongyao Li, Kengo Ishida, Satoshi Yamazaki, Xiaotong Ji, Jianquan Liu
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We propose KFS-Bench, the first benchmark for key frame sampling in long video question answering (QA), featuring multi-scene annotations to enable direct and robust evaluation of sampling strategies. Key frame sampling is crucial for efficient long-form video understanding. In long video QA, selecting informative frames enables multimodal large language models (MLLMs) to improve both accuracy and efficiency. KFS-Bench addresses the limitation of prior works that only indirectly assess frame selection quality via QA accuracy. By providing ground-truth annotations of multiple disjoint scenes required per question, KFS-Bench allows us to directly analyze how different sampling approaches capture essential content across an entire long video. Using KFS-Bench, we conduct a comprehensive study of key frame sampling methods and identify that not only sampling precision but also scene coverage and sampling balance are the key factors influencing QA performance. Regarding all the factors, we design a novel sampling quality metric that correlates with QA accuracy. Furthermore, we develop a novel key frame sampling method that leverages question-video relevance to balance sampling diversity against question-frame similarity, thereby improving coverage of relevant scenes. Our adaptively balanced sampling approach achieves superior performance in both key frame sampling and QA performance. The benchmark is available at https://github.com/NEC-VID/KFS-Bench.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at https://github.com/jacklishufan/MobileWorld
๐
2025-12-16
๐ ArXiv Preprint (cs.SE)
๐ฅ Ruanqianqian Huang, Avery Reyna, Sorin Lerner, Haijun Xia, Brian Hempel
Robotics
ๆฅ็ๆ่ฆ
The rise of AI agents is transforming how software can be built. The promise of agents is that developers might write code quicker, delegate multiple tasks to different agents, and even write a full piece of software purely out of natural language. In reality, what roles agents play in professional software development remains in question. This paper investigates how experienced developers use agents in building software, including their motivations, strategies, task suitability, and sentiments. Through field observations (N=13) and qualitative surveys (N=99), we find that while experienced developers value agents as a productivity boost, they retain their agency in software design and implementation out of insistence on fundamental software quality attributes, employing strategies for controlling agent behavior leveraging their expertise. In addition, experienced developers feel overall positive about incorporating agents into software development given their confidence in complementing the agents' limitations. Our results shed light on the value of software development best practices in effective use of agents, suggest the kinds of tasks for which agents may be suitable, and point towards future opportunities for better agentic interfaces and agentic use guidelines.
๐
2025-12-16
๐ ArXiv Preprint (cs.SD)
๐ฅ Qilin Li, C. L. Philip Chen, TongZhang
ๆฅ็ๆ่ฆ
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER's state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module's contribution. Both the dataset and source code are publicly available.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) Controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization mechanism that adapts layer-wise routing logits, allowing different layers to learn distinct expert-selection patterns while utilizing a global probability threshold. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines. Our analysis confirms that DTop-p maintains precise control over the number of activated experts while adaptively allocating resources across different tokens and layers. Furthermore, DTop-p exhibits strong scaling properties with respect to expert granularity, expert capacity, model size, and dataset size, offering a robust framework for large-scale MoE pre-training.
๐
2025-12-16
๐ NeurIPS
๐ฅ Ge Yan, Chung-En Sun, Tsui-Wei, Weng
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) with Chain-of-Thought (CoT) reasoning have achieved strong performance across diverse tasks, including mathematics, coding, and general reasoning. A distinctive ability of these reasoning models is self-reflection: the ability to review and revise previous reasoning steps. While self-reflection enhances reasoning performance, it also increases inference cost. In this work, we study self-reflection through the lens of representation engineering. We segment the model's reasoning into steps, identify the steps corresponding to reflection, and extract a reflection direction in the latent space that governs this behavior. Using this direction, we propose a stepwise steering method that can control reflection frequency. We call our framework ReflCtrl. Our experiments show that (1) in many cases reflections are redundant, especially in stronger models (in our experiments, we can save up to 33.6 percent of reasoning tokens while preserving performance), and (2) the model's reflection behavior is highly correlated with an internal uncertainty signal, implying self-reflection may be controlled by the model's uncertainty.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yang Cao, Yubin Chen, Xuyang Guo, Zhao Song, Song Yue et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of large language models (LLMs) has led to significant breakthroughs in automated mathematical reasoning and scientific discovery. Georgiev, G${รณ}$mez-Serrano, Tao, and Wagner [GGSTW+25] demonstrate that AI systems can explore new constructions and improve existing bounds, illustrating the growing potential of LLMs to accelerate mathematical discovery. Similarly, Bubeck et al. [BCE+25] show that GPT-5 can meaningfully contribute to scientific workflows, from proposing hypotheses to generating proofs and analyses. Despite these advances, a rigorous evaluation of these models on canonical, graduate-level mathematical theory remains necessary to understand their baseline reasoning capabilities. In this paper, we present a comprehensive benchmark of four frontier models: GPT-5-Thinking, Gemini-3-Pro, Claude-Sonnet-4.5-Thinking, and Grok-4 against the classic curriculum of Randomized Algorithms by Motwani and Raghavan [MR95].
We tasked each model with generating formal LaTeX proofs for a series of lemmas and exercises spanning the textbook. We find that while the top-tier models (Gemini, and Claude) achieve a high accuracy rate (approx. 66%), demonstrating a robust grasp of probabilistic method and formal logic, other models lag significantly in consistency (approx. 40%). We provide a qualitative analysis of the generated proofs, highlighting differences in conciseness, hallucination rates, and logical structure. Our results suggest that while frontier models have reached a threshold of proficiency suitable for graduate-level pedagogical assistance and formalization, significant variance exists in their reliability for rigorous mathematical derivation. The code and the full set of LLM-generated responses are open-sourced and publicly available at https://github.com/magiclinux/math_benchmark_probability.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan et al.
ๆฅ็ๆ่ฆ
The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli et al.
Computer Vision
ๆฅ็ๆ่ฆ
Non-parametric quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ($ฮ_{24}$-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering a 43\% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou et al.
Natural Language ProcessingComputer VisionRobotics
ๆฅ็ๆ่ฆ
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
๐
2025-12-16
๐ NeurIPS 2025
๐ฅ Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng et al.
Machine Learning
ๆฅ็ๆ่ฆ
We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zizhang Li, Cheng Zhang, Zhengqin Li, Henry Howard-Jenkins, Zhaoyang Lv et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Zechen Bai, Chen Gao, Mike Zheng Shou
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through environmental interaction, which is akin to how humans master skills through practice. Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models, yet remain fundamentally limited by Supervised Finetuning (SFT): requiring hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training. We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations. The key technical challenge is replacing oracle reward signals (unavailable at test time) with autonomous feedback. We address this through a learned progress estimator providing dense feedback, and critically, we design our framework to ``tame'' this inherently noisy signal via two mechanisms: (1) an accumulative progress estimation mechanism smoothing noisy point-wise estimates, and (2) a progressive horizon extension strategy enabling gradual policy evolution. EVOLVE-VLA achieves substantial gains: +8.6\% on long-horizon tasks, +22.0\% in 1-shot learning, and enables cross-task generalization -- achieving 20.8\% success on unseen tasks without task-specific demonstrations training (vs. 0\% for pure SFT). Qualitative analysis reveals emergent capabilities absent in demonstrations, including error recovery and novel strategies. This work represents a critical step toward VLAs that truly learn and adapt, moving beyond static imitation toward continuous self-improvements.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marco Blanchini, Giovanna Maria Dimitri, Benedetta Tondi, Tarcisio Lancioni, Mauro Barni
Natural Language Processing
ๆฅ็ๆ่ฆ
Visual Sentiment Analysis (VSA) is a challenging task due to the vast diversity of emotionally salient images and the inherent difficulty of acquiring sufficient data to capture this variability comprehensively. Key obstacles include building large-scale VSA datasets and developing effective methodologies that enable algorithms to identify emotionally significant elements within an image. These challenges are reflected in the limited generalization performance of VSA algorithms and models when trained and tested across different datasets. Starting from a pool of existing data collections, our approach enables the creation of a new larger dataset that not only contains a wider variety of images than the original ones, but also permits training new models with improved capability to focus on emotionally relevant combinations of image elements. This is achieved through the integration of the semiotic isotopy concept within the dataset creation process, providing deeper insights into the emotional content of images. Empirical evaluations show that models trained on a dataset generated with our method consistently outperform those trained on the original data collections, achieving superior generalization across major VSA benchmarks
๐
2025-12-16
๐ ArXiv Preprint (physics.ao-ph)
๐ฅ Gabriele Accarino, Viviana Acquaviva, Sara Shamekh, Duncan Watson-Parris, David Lawrence
Robotics
ๆฅ็ๆ่ฆ
We introduce WaveSim, a multi-scale similarity metric for the evaluation of spatial fields in weather and climate applications. WaveSim exploits wavelet transforms to decompose input fields into scale-specific wavelet coefficients. The metric is built by multiplying three orthogonal components derived from these coefficients: Magnitude, which quantifies similarities in the energy distribution of the coefficients, i.e., the intensity of the field; Displacement, which captures spatial shift by comparing the centers of mass of normalized energy distributions; and Structure, which assesses pattern organization independent of location and amplitude. Each component yields a scale-specific similarity score ranging from 0 (no similarity) to 1 (perfect similarity), which are then combined across scales to produce an overall similarity measure. We first evaluate WaveSim using synthetic test cases, applying controlled spatial and temporal perturbations to systematically assess its sensitivity and expected behavior. We then demonstrate its applicability to physically relevant case studies of key modes of climate variability in Earth System Models. Traditional point-wise metrics lack a mechanism for attributing errors to physical scales or modes of dissimilarity. By operating in the wavelet domain and decomposing the signal along independent axes, WaveSim bypasses these limitations and provides an interpretable and diagnostically rich framework for assessing similarity in complex fields. Additionally, the WaveSim framework allows users to place emphasis on a specific scale or component, and lends itself to user-specific model intercomparison, model evaluation, and calibration and training of forecasting systems. We provide a PyTorch-ready implementation of WaveSim, along with all evaluation scripts, at: https://github.com/gabrieleaccarino/wavesim.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao et al.
Multimodal
ๆฅ็ๆ่ฆ
CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8\% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniel Capellรกn-Martรญn, Abhijeet Parida, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer et al.
ๆฅ็ๆ่ฆ
Robust and generalizable segmentation of brain tumors on multi-parametric magnetic resonance imaging (MRI) remains difficult because tumor types differ widely. The BraTS 2025 Lighthouse Challenge benchmarks segmentation methods on diverse high-quality datasets of adult and pediatric tumors: multi-consortium international pediatric brain tumor segmentation (PED), preoperative meningioma tumor segmentation (MEN), meningioma radiotherapy segmentation (MEN-RT), and segmentation of pre- and post-treatment brain metastases (MET). We present a flexible, modular, and adaptable pipeline that improves segmentation performance by selecting and combining state-of-the-art models and applying tumor- and lesion-specific processing before and after training. Radiomic features extracted from MRI help detect tumor subtype, ensuring a more balanced training. Custom lesion-level performance metrics determine the influence of each model in the ensemble and optimize post-processing that further refines the predictions, enabling the workflow to tailor every step to each case. On the BraTS testing sets, our pipeline achieved performance comparable to top-ranked algorithms across multiple challenges. These findings confirm that custom lesion-aware processing and model selection yield robust segmentations yet without locking the method to a specific network architecture. Our method has the potential for quantitative tumor measurement in clinical practice, supporting diagnosis and prognosis.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Christian Matek, Lukas Wolfseher et al.
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Timely and accurate lymphoma diagnosis is essential for guiding cancer treatment. Standard diagnostic practice combines hematoxylin and eosin (HE)-stained whole slide images with immunohistochemistry, flow cytometry, and molecular genetic tests to determine lymphoma subtypes, a process requiring costly equipment, skilled personnel, and causing treatment delays. Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking. In this work, we present the first multicenter lymphoma benchmarking dataset covering four common lymphoma subtypes and healthy control tissue. We systematically evaluate five publicly available pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) combined with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across three magnifications (10x, 20x, 40x). On in-distribution test sets, models achieve multiclass balanced accuracies exceeding 80% across all magnifications, with all foundation models performing similarly and both aggregation methods showing comparable results. The magnification study reveals that 40x resolution is sufficient, with no performance gains from higher resolutions or cross-magnification aggregation. However, on out-of-distribution test sets, performance drops substantially to around 60%, highlighting significant generalization challenges. To advance the field, larger multicenter studies covering additional rare lymphoma subtypes are needed. We provide an automated benchmarking pipeline to facilitate such future research.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fei Wu, Marcel Dreier, Nora Gourmelon, Sebastian Wind, Jianlin Zhang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The dynamics of glaciers and ice shelf fronts significantly impact the mass balance of ice sheets and coastal sea levels. To effectively monitor glacier conditions, it is crucial to consistently estimate positional shifts of glacier calving fronts. AMD-HookNet firstly introduces a pure two-branch convolutional neural network (CNN) for glacier segmentation. Yet, the local nature and translational invariance of convolution operations, while beneficial for capturing low-level details, restricts the model ability to maintain long-range dependencies. In this study, we propose AMD-HookNet++, a novel advanced hybrid CNN-Transformer feature enhancement method for segmenting glaciers and delineating calving fronts in synthetic aperture radar images. Our hybrid structure consists of two branches: a Transformer-based context branch to capture long-range dependencies, which provides global contextual information in a larger view, and a CNN-based target branch to preserve local details. To strengthen the representation of the connected hybrid features, we devise an enhanced spatial-channel attention module to foster interactions between the hybrid CNN-Transformer branches through dynamically adjusting the token relationships from both spatial and channel perspectives. Additionally, we develop a pixel-to-pixel contrastive deep supervision to optimize our hybrid model by integrating pixelwise metric learning into glacier segmentation. Through extensive experiments and comprehensive quantitative and qualitative analyses on the challenging glacier segmentation benchmark dataset CaFFe, we show that AMD-HookNet++ sets a new state of the art with an IoU of 78.2 and a HD95 of 1,318 m, while maintaining a competitive MDE of 367 m. More importantly, our hybrid model produces smoother delineations of calving fronts, resolving the issue of jagged edges typically seen in pure Transformer-based approaches.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenghao Zhao, Haoxuan Wang, Kai Wang, Yuzhang Shang, Yuan Hong et al.
Machine Learning
ๆฅ็ๆ่ฆ
Dataset distillation aims to synthesize compact yet informative datasets that allow models trained on them to achieve performance comparable to training on the full dataset. While this approach has shown promising results for image data, extending dataset distillation methods to video data has proven challenging and often leads to suboptimal performance. In this work, we first identify the core challenge in video set distillation as the substantial increase in learnable parameters introduced by the temporal dimension of video, which complicates optimization and hinders convergence. To address this issue, we observe that a single frame is often sufficient to capture the discriminative semantics of a video. Leveraging this insight, we propose Single-Frame Video set Distillation (SFVD), a framework that distills videos into highly informative frames for each class. Using differentiable interpolation, these frames are transformed into video sequences and matched with the original dataset, while updates are restricted to the frames themselves for improved optimization efficiency. To further incorporate temporal information, the distilled frames are combined with sampled real videos from real videos during the matching process through a channel reshaping layer. Extensive experiments on multiple benchmarks demonstrate that SFVD substantially outperforms prior methods, achieving improvements of up to 5.3% on MiniUCF, thereby offering a more effective solution.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa
MultimodalMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang et al.
Robotics
ๆฅ็ๆ่ฆ
This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhaolun Li, Jichang Li, Yinqi Cai, Junye Chen, Xiaonan Luo et al.
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mengyu Li, Xingcheng Zhou, Guang Chen, Alois Knoll, Hu Cao
ๆฅ็ๆ่ฆ
In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenyu Zhao, Yingxue Xu, Fengtao Zhou, Yihui Wang, Hao Chen
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Current multimodal survival prediction methods typically rely on pathology images (WSIs) and genomic data, both of which are high-dimensional and redundant, making it difficult to extract discriminative features from them and align different modalities. Moreover, using a simple survival follow-up label is insufficient to supervise such a complex task. To address these challenges, we propose KEMM, an LLM-driven Knowledge-Enhanced Multimodal Model for cancer survival prediction, which integrates expert reports and prognostic background knowledge. 1) Expert reports, provided by pathologists on a case-by-case basis and refined by large language model (LLM), offer succinct and clinically focused diagnostic statements. This information may typically suggest different survival outcomes. 2) Prognostic background knowledge (PBK), generated concisely by LLM, provides valuable prognostic background knowledge on different cancer types, which also enhances survival prediction. To leverage these knowledge, we introduce the knowledge-enhanced cross-modal (KECM) attention module. KECM can effectively guide the network to focus on discriminative and survival-relevant features from highly redundant modalities. Extensive experiments on five datasets demonstrate that KEMM achieves state-of-the-art performance. The code will be released upon acceptance.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mitsuki Watanabe, Sosuke Amano, Kiyoharu Aizawa, Yoko Yamakata
Multimodal
ๆฅ็ๆ่ฆ
Food image classification models are crucial for dietary management applications because they reduce the burden of manual meal logging. However, most publicly available datasets for training such models rely on web-crawled images, which often differ from users' real-world meal photos. In this work, we present FoodLogAthl-218, a food image dataset constructed from real-world meal records collected through the dietary management application FoodLog Athl. The dataset contains 6,925 images across 218 food categories, with a total of 14,349 bounding boxes. Rich metadata, including meal date and time, anonymized user IDs, and meal-level context, accompany each image. Unlike conventional datasets-where a predefined class set guides web-based image collection-our data begins with user-submitted photos, and labels are applied afterward. This yields greater intra-class diversity, a natural frequency distribution of meal types, and casual, unfiltered images intended for personal use rather than public sharing. In addition to (1) a standard classification benchmark, we introduce two FoodLog-specific tasks: (2) an incremental fine-tuning protocol that follows the temporal stream of users' logs, and (3) a context-aware classification task where each image contains multiple dishes, and the model must classify each dish by leveraging the overall meal context. We evaluate these tasks using large multimodal models (LMMs). The dataset is publicly available at https://huggingface.co/datasets/FoodLog/FoodLogAthl-218.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xianwei Cao, Dou Quan, Shuang Wang, Ning Huyan, Wei Wang et al.
ๆฅ็ๆ่ฆ
Image retrieval-based cross-view geo-localization (IRCVGL) aims to match images captured from significantly different viewpoints, such as satellite and street-level images. Existing methods predominantly rely on learning robust global representations or implicit feature alignment, which often fail to model explicit spatial correspondences crucial for accurate localization. In this work, we propose a novel correspondence-aware feature refinement framework, termed CLNet, that explicitly bridges the semantic and geometric gaps between different views. CLNet decomposes the view alignment process into three learnable and complementary modules: a Neural Correspondence Map (NCM) that spatially aligns cross-view features via latent correspondence fields; a Nonlinear Embedding Converter (NEC) that remaps features across perspectives using an MLP-based transformation; and a Global Feature Recalibration (GFR) module that reweights informative feature channels guided by learned spatial cues. The proposed CLNet can jointly capture both high-level semantics and fine-grained alignments. Extensive experiments on four public benchmarks, CVUSA, CVACT, VIGOR, and University-1652, demonstrate that our proposed CLNet achieves state-of-the-art performance while offering better interpretability and generalizability.
๐
2025-12-16
๐ ArXiv Preprint (eess.IV)
๐ฅ Sneha Sree C., Dattesh Shanbhag, Sudhanya Chatterjee
Machine Learning
ๆฅ็ๆ่ฆ
Medical image registration is critical for aligning anatomical structures across imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound. Among existing techniques, non-rigid registration (NRR) is particularly challenging due to the need to capture complex anatomical deformations caused by physiological processes like respiration or contrast-induced signal variations. Traditional NRR methods, while theoretically robust, often require extensive parameter tuning and incur high computational costs, limiting their use in real-time clinical workflows. Recent deep learning (DL)-based approaches have shown promise; however, their dependence on task-specific retraining restricts scalability and adaptability in practice. These limitations underscore the need for efficient, generalizable registration frameworks capable of handling heterogeneous imaging contexts. In this work, we introduce a novel AI-driven framework for 3D non-rigid registration that generalizes across multiple imaging modalities and anatomical regions. Unlike conventional methods that rely on application-specific models, our approach eliminates anatomy- or modality-specific customization, enabling streamlined integration into diverse clinical environments.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiwen Yang, Jiaju Zhang, Yang Yi, Jian Liang, Bingzheng Wei et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Medical image restoration (MedIR) aims to recover high-quality medical images from their low-quality counterparts. Recent advancements in MedIR have focused on All-in-One models capable of simultaneously addressing multiple different MedIR tasks. However, due to significant differences in both modality and degradation types, using a shared model for these diverse tasks requires careful consideration of two critical inter-task relationships: task interference, which occurs when conflicting gradient update directions arise across tasks on the same parameter, and task imbalance, which refers to uneven optimization caused by varying learning difficulties inherent to each task. To address these challenges, we propose a task-adaptive Transformer (TAT), a novel framework that dynamically adapts to different tasks through two key innovations. First, a task-adaptive weight generation strategy is introduced to mitigate task interference by generating task-specific weight parameters for each task, thereby eliminating potential gradient conflicts on shared weight parameters. Second, a task-adaptive loss balancing strategy is introduced to dynamically adjust loss weights based on task-specific learning difficulties, preventing task domination or undertraining. Extensive experiments demonstrate that our proposed TAT achieves state-of-the-art performance in three MedIR tasks--PET synthesis, CT denoising, and MRI super-resolution--both in task-specific and All-in-One settings. Code is available at https://github.com/Yaziwel/TAT.
๐
2025-12-16
๐ CVPR 2025
๐ฅ Yifang Xu, Benxiang Zhai, Yunzhuo Sun, Ming Li, Yang Li et al.
Robotics
ๆฅ็ๆ่ฆ
Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andreas Lolos, Theofilos Christodoulou, Aris L. Moustakas, Stergios Christodoulidis, Maria Vakalopoulou
Machine Learning
ๆฅ็ๆ่ฆ
In computational pathology, weak supervision has become the standard for deep learning due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of relying on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correlation learning on downstream tasks. By projecting patch features -- extracted using a frozen patch encoder -- into a small set of global context/morphology-aware tokens and utilizing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathology benchmarks, while reducing the total number of trainable parameters by 48%-92.8% versus SOTA MILs, lowering FLOPs during inference by 52%-99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at https://github.com/mandlos/CAPRMIL
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiheng Huang, Junhong Chen, Anqi Ning, Zhanhong Liang, Nick Michiels et al.
ๆฅ็ๆ่ฆ
Self-supervised monocular depth estimation has achieved notable success under daytime conditions. However, its performance deteriorates markedly at night due to low visibility and varying illumination, e.g., insufficient light causes textureless areas, and moving objects bring blurry regions. To this end, we propose a self-supervised framework named DASP that leverages spatiotemporal priors for nighttime depth estimation. Specifically, DASP consists of an adversarial branch for extracting spatiotemporal priors and a self-supervised branch for learning. In the adversarial branch, we first design an adversarial network where the discriminator is composed of four devised spatiotemporal priors learning blocks (SPLB) to exploit the daytime priors. In particular, the SPLB contains a spatial-based temporal learning module (STLM) that uses orthogonal differencing to extract motion-related variations along the time axis and an axial spatial learning module (ASLM) that adopts local asymmetric convolutions with global axial attention to capture the multiscale structural information. By combining STLM and ASLM, our model can acquire sufficient spatiotemporal features to restore textureless areas and estimate the blurry regions caused by dynamic objects. In the self-supervised branch, we propose a 3D consistency projection loss to bilaterally project the target frame and source frame into a shared 3D space, and calculate the 3D discrepancy between the two projected frames as a loss to optimize the 3D structural consistency and daytime priors. Extensive experiments on the Oxford RobotCar and nuScenes datasets demonstrate that our approach achieves state-of-the-art performance for nighttime depth estimation. Ablation studies further validate the effectiveness of each component.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jia Guo, Jiawei Du, Shengzhu Yang, Shuai Lu, Wenquan Cheng et al.
Machine Learning
ๆฅ็ๆ่ฆ
Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision's zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alessia Micieli, Giovanni Maria Farinella, Francesco Ragusa
Multimodal
ๆฅ็ๆ่ฆ
In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset. We release data and annotations at the following link: https://fpv-iplab.github.io/SignIT/.
๐
2025-12-16
๐ NeurIPS 2025
๐ฅ Weiheng Zhao, Zilong Huang, Jiashi Feng, Xinggang Wang
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP's ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP's small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes. Code and models will be made open source.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andreas Sjรถlander, Valeria Belloni, Robel Fekadu, Andrea Nascetti
Machine Learning
ๆฅ็ๆ่ฆ
Tunnels are essential elements of transportation infrastructure, but are increasingly affected by ageing and deterioration mechanisms such as cracking. Regular inspections are required to ensure their safety, yet traditional manual procedures are time-consuming, subjective, and costly. Recent advances in mobile mapping systems and Deep Learning (DL) enable automated visual inspections. However, their effectiveness is limited by the scarcity of tunnel datasets. This paper introduces a new publicly available dataset containing annotated images of three different tunnel linings, capturing typical defects: cracks, leaching, and water infiltration. The dataset is designed to support supervised, semi-supervised, and unsupervised DL methods for defect detection and segmentation. Its diversity in texture and construction techniques also enables investigation of model generalization and transferability across tunnel types. By addressing the critical lack of domain-specific data, this dataset contributes to advancing automated tunnel inspection and promoting safer, more efficient infrastructure maintenance strategies.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Leon Sick, Lukas Hoyer, Dominik Engel, Pedro Hermosilla, Timo Ropinski
ๆฅ็ๆ่ฆ
In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.
๐
2025-12-16
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Quan Yuan, Zhikun Zhang, Linkang Du, Min Chen, Mingyang Sun et al.
ๆฅ็ๆ่ฆ
Video recognition systems are increasingly being deployed in daily life, such as content recommendation and security monitoring. To enhance video recognition development, many institutions have released high-quality public datasets with open-source licenses for training advanced models. At the same time, these datasets are also susceptible to misuse and infringement. Dataset copyright auditing is an effective solution to identify such unauthorized use. However, existing dataset copyright solutions primarily focus on the image domain; the complex nature of video data leaves dataset copyright auditing in the video domain unexplored. Specifically, video data introduces an additional temporal dimension, which poses significant challenges to the effectiveness and stealthiness of existing methods.
In this paper, we propose VICTOR, the first dataset copyright auditing approach for video recognition systems. We develop a general and stealthy sample modification strategy that enhances the output discrepancy of the target model. By modifying only a small proportion of samples (e.g., 1%), VICTOR amplifies the impact of published modified samples on the prediction behavior of the target models. Then, the difference in the model's behavior for published modified and unpublished original samples can serve as a key basis for dataset auditing. Extensive experiments on multiple models and datasets highlight the superiority of VICTOR. Finally, we show that VICTOR is robust in the presence of several perturbation mechanisms to the training videos or the target models.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chang Cai, Hao Jiang, Xiaojun Yuan, Ying-Jun Angela Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Message-passing algorithms have been adapted for compressive imaging by incorporating various off-the-shelf image denoisers. However, these denoisers rely largely on generic or hand-crafted priors and often fall short in accurately capturing the complex statistical structure of natural images. As a result, traditional plug-and-play (PnP) methods often lead to suboptimal reconstruction, especially in highly underdetermined regimes. Recently, score-based generative models have emerged as a powerful framework for accurately characterizing sophisticated image distribution. Yet, their direct use for posterior sampling typically incurs prohibitive computational complexity. In this paper, by exploiting the close connection between score-based generative modeling and empirical Bayes denoising, we devise a message-passing framework that integrates a score-based minimum mean-squared error (MMSE) denoiser for compressive image recovery. The resulting algorithm, named score-based turbo message passing (STMP), combines the fast convergence of message passing with the expressive power of score-based generative priors. For practical systems with quantized measurements, we further propose quantized STMP (Q-STMP), which augments STMP with a component-wise MMSE dequantization module. We demonstrate that the asymptotic performance of STMP and Q-STMP can be accurately predicted by a set of state-evolution (SE) equations. Experiments on the FFHQ dataset demonstrate that STMP strikes a significantly better performance-complexity tradeoff compared with competing baselines, and that Q-STMP remains robust even under 1-bit quantization. Remarkably, both STMP and Q-STMP typically converge within 10 iterations.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhuo Chen, Fanyue Wei, Runze Xu, Jingjing Li, Lixin Duan et al.
ๆฅ็ๆ่ฆ
Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing.To address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation.By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mischa Dombrowski, Felix Nรผtzel, Bernhard Kainz
ๆฅ็ๆ่ฆ
Recent advances in generative image modeling have achieved visual realism sufficient to deceive human experts, yet their potential for privacy preserving data sharing remains insufficiently understood. A central obstacle is the absence of reliable memorization detection mechanisms, limited quantitative evaluation, and poor generalization of existing privacy auditing methods across domains. To address this, we propose to view memorization detection as a unified problem at the intersection of re-identification and copy detection, whose complementary goals cover both identity consistency and augmentation-robust duplication, and introduce Latent Contrastive Memorization Network (LCMem), a cross-domain model evaluated jointly on both tasks. LCMem achieves this through a two-stage training strategy that first learns identity consistency before incorporating augmentation-robust copy detection. Across six benchmark datasets, LCMem achieves improvements of up to 16 percentage points on re-identification and 30 percentage points on copy detection, enabling substantially more reliable memorization detection at scale. Our results show that existing privacy filters provide limited performance and robustness, highlighting the need for stronger protection mechanisms. We show that LCMem sets a new standard for cross-domain privacy auditing, offering reliable and scalable memorization detection. Code and model is publicly available at https://github.com/MischaD/LCMem.
๐
2025-12-16
๐ AAAI 2026
๐ฅ Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Le Jiang, Shaotong Zhu, Yedi Luo, Shayda Moezzi, Sarah Ostadabbas
ๆฅ็ๆ่ฆ
In dynamic Neural Radiance Fields (NeRF) systems, state-of-the-art novel view synthesis methods often fail under significant viewpoint deviations, producing unstable and unrealistic renderings. To address this, we introduce Expanded Dynamic NeRF (ExpanDyNeRF), a monocular NeRF framework that leverages Gaussian splatting priors and a pseudo-ground-truth generation strategy to enable realistic synthesis under large-angle rotations. ExpanDyNeRF optimizes density and color features to improve scene reconstruction from challenging perspectives. We also present the Synthetic Dynamic Multiview (SynDM) dataset, the first synthetic multiview dataset for dynamic scenes with explicit side-view supervision-created using a custom GTA V-based rendering pipeline. Quantitative and qualitative results on SynDM and real-world datasets demonstrate that ExpanDyNeRF significantly outperforms existing dynamic NeRF methods in rendering fidelity under extreme viewpoint shifts. Further details are provided in the supplementary materials.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Martin Rรถhn, Nora Gourmelon, Vincent Christlein
ๆฅ็ๆ่ฆ
Climate adaptation is vital for the sustainability and sometimes the mere survival of our urban areas. However, small cities often struggle with limited personnel resources and integrating vast amounts of data from multiple sources for a comprehensive analysis. To overcome these challenges, this paper proposes a multi-layered system combining specialized LLMs, satellite imagery analysis and a knowledge base to aid in developing effective climate adaptation strategies. The corresponding code can be found at https://github.com/Photon-GitHub/EcoScapes.
๐
2025-12-16
๐ Accepted at IEEE ITSC 2020
๐ฅ Georg Volk, Jรถrg Gamerdinger, Alexander von Bernuth, Oliver Bringmann
Robotics
ๆฅ็ๆ่ฆ
Complete perception of the environment and its correct interpretation is crucial for autonomous vehicles. Object perception is the main component of automotive surround sensing. Various metrics already exist for the evaluation of object perception. However, objects can be of different importance depending on their velocity, orientation, distance, size, or the potential damage that could be caused by a collision due to a missed detection. Thus, these additional parameters have to be considered for safety evaluation. We propose a new safety metric that incorporates all these parameters and returns a single easily interpretable safety assessment score for object perception. This new metric is evaluated with both real world and virtual data sets and compared to state of the art metrics.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Julian McGinnis, Florian A. Hรถlzl, Suprosanna Shit, Florentin Bieder, Paul Friedrich et al.
ๆฅ็ๆ่ฆ
Implicit Neural Representations (INRs) based on vanilla Multi-Layer Perceptrons (MLPs) are widely believed to be incapable of representing high-frequency content. This has directed research efforts towards architectural interventions, such as coordinate embeddings or specialized activation functions, to represent high-frequency signals. In this paper, we challenge the notion that the low-frequency bias of vanilla MLPs is an intrinsic, architectural limitation to learn high-frequency content, but instead a symptom of stable rank degradation during training. We empirically demonstrate that regulating the network's rank during training substantially improves the fidelity of the learned signal, rendering even simple MLP architectures expressive. Extensive experiments show that using optimizers like Muon, with high-rank, near-orthogonal updates, consistently enhances INR architectures even beyond simple ReLU MLPs. These substantial improvements hold across a diverse range of domains, including natural and medical images, and novel view synthesis, with up to 9 dB PSNR improvements over the previous state-of-the-art. Our project page, which includes code and experimental results, is available at: (https://muon-inrs.github.io).
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sebastian Koch, Johanna Wald, Hide Matsuki, Pedro Hermosilla, Timo Ropinski et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ankita Raj, Kaashika Prajaapat, Tapan Kumar Gandhi, Chetan Arora
Machine Learning
ๆฅ็ๆ่ฆ
The human visual system is remarkably adept at adapting to changes in the input distribution; a capability modern convolutional neural networks (CNNs) still struggle to match. Drawing inspiration from the developmental trajectory of human vision, we propose a progressive blurring curriculum to improve the generalization and robustness of CNNs. Human infants are born with poor visual acuity, gradually refining their ability to perceive fine details. Mimicking this process, we begin training CNNs on highly blurred images during the initial epochs and progressively reduce the blur as training advances. This approach encourages the network to prioritize global structures over high-frequency artifacts, improving robustness against distribution shifts and noisy inputs. Challenging prior claims that blurring in the initial training epochs imposes a stimulus deficit and irreversibly harms model performance, we reveal that early-stage blurring enhances generalization with minimal impact on in-domain accuracy. Our experiments demonstrate that the proposed curriculum reduces mean corruption error (mCE) by up to 8.30% on CIFAR-10-C and 4.43% on ImageNet-100-C datasets, compared to standard training without blurring. Unlike static blur-based augmentation, which applies blurred images randomly throughout training, our method follows a structured progression, yielding consistent gains across various datasets. Furthermore, our approach complements other augmentation techniques, such as CutMix and MixUp, and enhances both natural and adversarial robustness against common attack methods. Code is available at https://github.com/rajankita/Visual_Acuity_Curriculum.
๐
2025-12-16
๐ AAAI 2026
๐ฅ Kanglong Fan, Yunqiao Yang, Chen Ma
Machine Learning
ๆฅ็ๆ่ฆ
Deep neural networks have demonstrated remarkable performance across various domains, yet their decision-making processes remain opaque. Although many explanation methods are dedicated to bringing the obscurity of DNNs to light, they exhibit significant limitations: post-hoc explanation methods often struggle to faithfully reflect model behaviors, while self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs. To address these challenges, we propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training, which achieves two key advancements: 1) a fair allocation of the model prediction scores to image patches, ensuring explanations inherently align with the model's decision logic, and 2) enhanced interpretability with minor structural modifications, preserving model performance and compatibility. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art interpretability.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaizhe Zhang, Yijie Zhou, Weizhan Zhang, Caixia Yan, Haipeng Du et al.
ๆฅ็ๆ่ฆ
Dynamic novel view synthesis (NVS) is essential for creating immersive experiences. Existing approaches have advanced dynamic NVS by introducing 3D Gaussian Splatting (3DGS) with implicit deformation fields or indiscriminately assigned time-varying parameters, surpassing NeRF-based methods. However, due to excessive model complexity and parameter redundancy, they incur large model sizes and slow rendering speeds, making them inefficient for real-time applications, particularly on resource-constrained devices. To obtain a more efficient model with fewer redundant parameters, in this paper, we propose Hybrid Gaussian Splatting (HGS), a compact and efficient framework explicitly designed to disentangle static and dynamic regions of a scene within a unified representation. The core innovation of HGS lies in our Static-Dynamic Decomposition (SDD) strategy, which leverages Radial Basis Function (RBF) modeling for Gaussian primitives. Specifically, for dynamic regions, we employ time-dependent RBFs to effectively capture temporal variations and handle abrupt scene changes, while for static regions, we reduce redundancy by sharing temporally invariant parameters. Additionally, we introduce a two-stage training strategy tailored for explicit models to enhance temporal coherence at static-dynamic boundaries. Experimental results demonstrate that our method reduces model size by up to 98% and achieves real-time rendering at up to 125 FPS at 4K resolution on a single RTX 3090 GPU. It further sustains 160 FPS at 1352 * 1014 on an RTX 3050 and has been integrated into the VR system. Moreover, HGS achieves comparable rendering quality to state-of-the-art methods while providing significantly improved visual fidelity for high-frequency details and abrupt scene changes.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jooyeol Yun, Jaegul Choo
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
ๆฅ็ๆ่ฆ
Recent progress in text-to-image diffusion models has transformed image editing via text prompts, yet this also introduces significant ethical challenges from potential misuse in creating deceptive or harmful content. While current defenses seek to mitigate this risk by embedding imperceptible perturbations, their effectiveness is limited against malicious tampering. To address this issue, we propose a Dual Attention-Guided Noise Perturbation (DANP) immunization method that adds imperceptible perturbations to disrupt the model's semantic understanding and generation process. DANP functions over multiple timesteps to manipulate both cross-attention maps and the noise prediction process, using a dynamic threshold to generate masks that identify text-relevant and irrelevant regions. It then reduces attention in relevant areas while increasing it in irrelevant ones, thereby misguides the edit towards incorrect regions and preserves the intended targets. Additionally, our method maximizes the discrepancy between the injected noise and the model's predicted noise to further interfere with the generation. By targeting both attention and noise prediction mechanisms, DANP exhibits impressive immunity against malicious edits, and extensive experiments confirm that our method achieves state-of-the-art performance.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuai Dong, Jie Zhang, Guoying Zhao, Shiguang Shan, Xilin Chen
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Text-guided image editing via diffusion models, while powerful, raises significant concerns about misuse, motivating efforts to immunize images against unauthorized edits using imperceptible perturbations. Prevailing metrics for evaluating immunization success typically rely on measuring the visual dissimilarity between the output generated from a protected image and a reference output generated from the unprotected original. This approach fundamentally overlooks the core requirement of image immunization, which is to disrupt semantic alignment with attacker intent, regardless of deviation from any specific output. We argue that immunization success should instead be defined by the edited output either semantically mismatching the prompt or suffering substantial perceptual degradations, both of which thwart malicious intent. To operationalize this principle, we propose Synergistic Intermediate Feature Manipulation (SIFM), a method that strategically perturbs intermediate diffusion features through dual synergistic objectives: (1) maximizing feature divergence from the original edit trajectory to disrupt semantic alignment with the expected edit, and (2) minimizing feature norms to induce perceptual degradations. Furthermore, we introduce the Immunization Success Rate (ISR), a novel metric designed to rigorously quantify true immunization efficacy for the first time. ISR quantifies the proportion of edits where immunization induces either semantic failure relative to the prompt or significant perceptual degradations, assessed via Multimodal Large Language Models (MLLMs). Extensive experiments show our SIFM achieves the state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Akila Premarathna, Kanishka Hewageegana, Garcia Andarcia Mariangel
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8's true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Abdullah Al Mamun, Miaohua Zhang, David Ahmedt-Aristizabal, Zeeshan Hayder, Mohammad Awrangjeb
ๆฅ็ๆ่ฆ
Self-supervised Learning (SSL) has become a powerful paradigm for representation learning without manual annotations. However, most existing frameworks focus on global alignment and struggle to capture the hierarchical, multi-scale lesion patterns characteristic of plant disease imagery. To address this gap, we propose PSMamba, a progressive self-supervised framework that integrates the efficient sequence modelling of Vision Mamba (VM) with a dual-student hierarchical distillation strategy. Unlike conventional single teacher-student designs, PSMamba employs a shared global teacher and two specialised students: one processes mid-scale views to capture lesion distributions and vein structures, while the other focuses on local views to capture fine-grained cues such as texture irregularities and early-stage lesions. This multi-granular supervision facilitates the joint learning of contextual and detailed representations, with consistency losses ensuring coherent cross-scale alignment. Experiments on three benchmark datasets show that PSMamba consistently outperforms state-of-the-art SSL methods, delivering superior accuracy and robustness in both domain-shifted and fine-grained scenarios.
๐
2025-12-16
๐ SIGGRAPH
๐ฅ Zhibing Li, Mengchen Zhang, Tong Wu, Jing Tan, Jiaqi Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu Chen, Hongwei Lin
Machine Learning
ๆฅ็ๆ่ฆ
Persistence diagrams (PDs) provide a powerful tool for understanding the topology of the underlying shape of a point cloud. However, identifying which points in PDs encode genuine signals remains challenging. This challenge directly hinders the practical adoption of topological data analysis in many applications, where automated and reliable interpretation of persistence diagrams is essential for downstream decision-making. In this paper, we study automatic significance detection for one-dimensional persistence diagrams. Specifically, we propose Topology Understanding Net (TUN), a multi-modal network that combines enhanced PD descriptors with self-attention, a PointNet-style point cloud encoder, learned fusion, and per-point classification, alongside stable preprocessing and imbalance-aware training. It provides an automated and effective solution for identifying significant points in PDs, which are critical for downstream applications. Experiments show that TUN outperforms classic methods in detecting significant points in PDs, illustrating its effectiveness in real-world applications.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO's issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2\% on NExT-GQA and 4.6\% on ReXTime, while also enhancing average answer accuracy by 2.4\%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4\% on long-video benchmarks.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shreedhar Govil, Didier Stricker, Jason Rambach
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Predicting driver attention is a critical problem for developing explainable autonomous driving systems and understanding driver behavior in mixed human-autonomous vehicle traffic scenarios. Although significant progress has been made through large-scale driver attention datasets and deep learning architectures, existing works are constrained by narrow frontal field-of-view and limited driving diversity. Consequently, they fail to capture the full spatial context of driving environments, especially during lane changes, turns, and interactions involving peripheral objects such as pedestrians or cyclists. In this paper, we introduce DriverGaze360, a large-scale 360$^\circ$ field of view driver attention dataset, containing $\sim$1 million gaze-labeled frames collected from 19 human drivers, enabling comprehensive omnidirectional modeling of driver gaze behavior. Moreover, our panoramic attention prediction approach, DriverGaze360-Net, jointly learns attention maps and attended objects by employing an auxiliary semantic segmentation head. This improves spatial awareness and attention prediction across wide panoramic inputs. Extensive experiments demonstrate that DriverGaze360-Net achieves state-of-the-art attention prediction performance on multiple metrics on panoramic driving images. Dataset and method available at https://av.dfki.de/drivergaze360.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wentao Wan, Kaiyu Wu, Qingyang Ma, Nan Kang, Yunjie Chen et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving the quality of LLM-generated visual programs. However, they have neglected to optimize the VP-invoked pre-trained models, which serve as modules for the visual sub-tasks decomposed from the targeted tasks by VP. The difficulty is that there are only final labels of targeted VR tasks rather than labels of sub-tasks. Besides, the non-differentiable nature of VP impedes the direct use of efficient gradient-based optimization methods to leverage final labels for end-to-end learning of the entire VP framework. To overcome these issues, we propose EVPG, a method to Enhance Visual Programming for visual reasoning via Probabilistic Graphs. Specifically, we creatively build a directed probabilistic graph according to the variable dependency relationships during the VP executing process, which reconstructs the non-differentiable VP executing process into a differentiable exact probability inference process on this directed probabilistic graph. As a result, this enables the VP framework to utilize the final labels for efficient, gradient-based optimization in end-to-end supervised learning on targeted VR tasks. Extensive and comprehensive experiments demonstrate the effectiveness and advantages of our EVPG, showing significant performance improvements for VP on three classical complex VR tasks: GQA, NLVRv2, and Open Images.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nando Metzger, Prune Truong, Goutam Bhat, Konrad Schindler, Federico Tombari
Robotics
ๆฅ็ๆ่ฆ
The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present Elastic3D, a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (more precisely, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion. Please check the project page for the video samples https://elastic3d.github.io.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jimmie Kwok, Holger Caesar, Andras Palffy
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Automotive radar has shown promising developments in environment perception due to its cost-effectiveness and robustness in adverse weather conditions. However, the limited availability of annotated radar data poses a significant challenge for advancing radar-based perception systems. To address this limitation, we propose a novel framework to generate 4D radar point clouds for training and evaluating object detectors. Unlike image-based diffusion, our method is designed to consider the sparsity and unique characteristics of radar point clouds by applying diffusion to a latent point cloud representation. Within this latent space, generation is controlled via conditioning at either the object or scene level. The proposed 4D-RaDiff converts unlabeled bounding boxes into high-quality radar annotations and transforms existing LiDAR point cloud data into realistic radar scenes. Experiments demonstrate that incorporating synthetic radar data of 4D-RaDiff as data augmentation method during training consistently improves object detection performance compared to training on real data only. In addition, pre-training on our synthetic data reduces the amount of required annotated radar data by up to 90% while achieving comparable object detection performance.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rawan Alyahya, Asrar Alruwayqi, Atheer Alqarni, Asma Alkhaldi, Metab Alkubeyyer et al.
Machine Learning
ๆฅ็ๆ่ฆ
The presence of MGMT promoter methylation significantly affects how well chemotherapy works for patients with Glioblastoma Multiforme (GBM). Currently, confirmation of MGMT promoter methylation relies on invasive brain tumor tissue biopsies. In this study, we explore radiogenomics techniques, a promising approach in precision medicine, to identify genetic markers from medical images. Using MRI scans and deep learning models, we propose a new multi-view approach that considers spatial relationships between MRI views to detect MGMT methylation status. Importantly, our method extracts information from all three views without using a complicated 3D deep learning model, avoiding issues associated with high parameter count, slow convergence, and substantial memory demands. We also introduce a new technique for tumor slice extraction and show its superiority over existing methods based on multiple evaluation metrics. By comparing our approach to state-of-the-art models, we demonstrate the efficacy of our method. Furthermore, we share a reproducible pipeline of published models, encouraging transparency and the development of robust diagnostic tools. Our study highlights the potential of non-invasive methods for identifying MGMT promoter methylation and contributes to advancing precision medicine in GBM treatment.
๐
2025-12-16
๐ ACM MM 2025
๐ฅ Tao Tang, Enhui Ma, xia zhou, Letian Wang, Tianyi Yan et al.
Machine LearningMultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xichen Ding, Jianzhe Gao, Cong Pan, Wenguan Wang, Jie Qin
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhuoxiao Li, Wenzong Ma, Taoyu Wu, Jinjing Zhu, Zhenchao Q et al.
Computer Vision
ๆฅ็ๆ่ฆ
Recent advances in Neural Radiance Fields and 3D Gaussian Splatting have demonstrated strong potential for large-scale UAV-based 3D reconstruction tasks by fitting the appearance of images. However, real-world large-scale captures are often based on multi-temporal data capture, where illumination inconsistencies across different times of day can significantly lead to color artifacts, geometric inaccuracies, and inconsistent appearance. Due to the lack of UAV datasets that systematically capture the same areas under varying illumination conditions, this challenge remains largely underexplored. To fill this gap, we introduceSkyLume, a large-scale, real-world UAV dataset specifically designed for studying illumination robust 3D reconstruction in urban scene modeling: (1) We collect data from 10 urban regions data comprising more than 100k high resolution UAV images (four oblique views and nadir), where each region is captured at three periods of the day to systematically isolate illumination changes. (2) To support precise evaluation of geometry and appearance, we provide per-scene LiDAR scans and accurate 3D ground-truth for assessing depth, surface normals, and reconstruction quality under varying illumination. (3) For the inverse rendering task, we introduce the Temporal Consistency Coefficient (TCC), a metric that measuress cross-time albedo stability and directly evaluates the robustness of the disentanglement of light and material. We aim for this resource to serve as a foundation that advances research and real-world evaluation in large-scale inverse rendering, geometry reconstruction, and novel view synthesis.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cassandra Krause, Mattias P. Heinrich, Ron Keuth
ๆฅ็ๆ่ฆ
Between $15\,\%$ and $45\,\%$ of children experience a fracture during their growth years, making accurate diagnosis essential. Fracture morphology, alongside location and fragment angle, is a key diagnostic feature. In this work, we propose a method to extract fracture morphology by assigning automatically global AO codes to corresponding fracture bounding boxes. This approach enables the use of public datasets and reformulates the global multilabel task into a local multiclass one, improving the average F1 score by $7.89\,\%$. However, performance declines when using imperfect fracture detectors, highlighting challenges for real-world deployment. Our code is available on GitHub.
๐
2025-12-16
๐ ArXiv Preprint (cs.GR)
๐ฅ Jianwei Sun, Xiaoning Lei, Wenhao Cai, Xichen Xu, Yanshu Wang et al.
ๆฅ็ๆ่ฆ
Task-based measures of image quality (IQ) are critical for evaluating medical imaging systems, which must account for randomness including anatomical variability. Stochastic object models (SOMs) provide a statistical description of such variability, but conventional mathematical SOMs fail to capture realistic anatomy, while data-driven approaches typically require clean data rarely available in clinical tasks. To address this challenge, we propose AMID, an unsupervised Ambient Measurement-Integrated Diffusion with noise decoupling, which establishes clean SOMs directly from noisy measurements. AMID introduces a measurement-integrated strategy aligning measurement noise with the diffusion trajectory, and explicitly models coupling between measurement and diffusion noise across steps, an ambient loss is thus designed base on it to learn clean SOMs. Experiments on real CT and mammography datasets show that AMID outperforms existing methods in generation fidelity and yields more reliable task-based IQ evaluation, demonstrating its potential for unsupervised medical imaging analysis.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Francesco Di Sario, Daniel Rebain, Dor Verbin, Marco Grangetto, Andrea Tagliasacchi
Machine Learning
ๆฅ็ๆ่ฆ
Radiance field methods (e.g. 3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations. SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections - a key component of realistic rendering. Although alternatives like spherical Gaussians offer improvements, they add significant optimization complexity. We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting. SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while keeping optimization simpler than existing alternatives. For reflections - where SH fail - we leverage SV as learnable reflection probes, taking reflected directions as input following principles from classical graphics. This formulation attains state-of-the-art results on synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qingyuan Cai, Linxin Zhang, Xuecai Hu, Saihui Hou, Yongzhen Huang
Computer Vision
ๆฅ็ๆ่ฆ
Recent approaches for monocular 3D human pose estimation (3D HPE) have achieved leading performance by directly regressing 3D poses from 2D keypoint sequences. Despite the rapid progress in 3D HPE, existing methods are typically trained and evaluated under disparate frameworks, lacking a unified framework for fair comparison. To address these limitations, we propose Fast3DHPE, a modular framework that facilitates rapid reproduction and flexible development of new methods. By standardizing training and evaluation protocols, Fast3DHPE enables fair comparison across 3D human pose estimation methods while significantly improving training efficiency. Within this framework, we introduce FastDDHPose, a Disentangled Diffusion-based 3D Human Pose Estimation method which leverages the strong latent distribution modeling capability of diffusion models to explicitly model the distributions of bone length and bone direction while avoiding further amplification of hierarchical error accumulation. Moreover, we design an efficient Kinematic-Hierarchical Spatial and Temporal Denoiser that encourages the model to focus on kinematic joint hierarchies while avoiding unnecessary modeling of overly complex joint topologies. Extensive experiments on Human3.6M and MPI-INF-3DHP show that the Fast3DHPE framework enables fair comparison of all methods while significantly improving training efficiency. Within this unified framework, FastDDHPose achieves state-of-the-art performance with strong generalization and robustness in in-the-wild scenarios. The framework and models will be released at: https://github.com/Andyen512/Fast3DHPE
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuxin Zhao, Bo Lang, Nan Xiao, Yilang Zhang
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Object detection models deployed in real-world applications such as autonomous driving face serious threats from backdoor attacks. Despite their practical effectiveness,existing methods are inherently limited in both capability and robustness due to their dependence on single-trigger-single-object mappings and fragile pixel-level cues. We propose CIS-BA, a novel backdoor attack paradigm that redefines trigger design by shifting from static object features to continuous inter-object interaction patterns that describe how objects co-occur and interact in a scene. By modeling these patterns as a continuous interaction space, CIS-BA introduces space triggers that, for the first time, enable a multi-trigger-multi-object attack mechanism while achieving robustness through invariant geometric relations. To implement this paradigm, we design CIS-Frame, which constructs space triggers via interaction analysis, formalizes them as class-geometry constraints for sample poisoning, and embeds the backdoor during detector training. CIS-Frame supports both single-object attacks (object misclassification and disappearance) and multi-object simultaneous attacks, enabling complex and coordinated effects across diverse interaction states. Experiments on MS-COCO and real-world videos show that CIS-BA achieves over 97% attack success under complex environments and maintains over 95% effectiveness under dynamic multi-trigger conditions, while evading three state-of-the-art defenses. In summary, CIS-BA extends the landscape of backdoor attacks in interaction-intensive scenarios and provides new insights into the security of object detection systems.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely "think with images" through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hanning Chen, Keyu Man, Kevin Zhu, Chenguang Zhu, Haonan Li et al.
Machine LearningNatural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Identifying and addressing performance anti-patterns in machine learning (ML) models is critical for efficient training and inference, but it typically demands deep expertise spanning system infrastructure, ML models and kernel development. While large tech companies rely on dedicated ML infrastructure engineers to analyze torch traces and benchmarks, such resource-intensive workflows are largely inaccessible to computer vision researchers in general. Among the challenges, pinpointing problematic trace segments within lengthy execution traces remains the most time-consuming task, and is difficult to automate with current ML models, including LLMs. In this work, we present the first benchmark dataset specifically designed to evaluate and improve ML models' ability to detect anti patterns in traces. Our dataset contains over 600 PyTorch traces from diverse computer vision models classification, detection, segmentation, and generation collected across multiple hardware platforms. We also propose a novel iterative approach: a lightweight ML model first detects trace segments with anti patterns, followed by a large language model (LLM) for fine grained classification and targeted feedback. Experimental results demonstrate that our method significantly outperforms unsupervised clustering and rule based statistical techniques for detecting anti pattern regions. Our method also effectively compensates LLM's limited context length and reasoning inefficiencies.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang et al.
Robotics
ๆฅ็ๆ่ฆ
Sketch editing is central to digital illustration, yet existing image editing systems struggle to preserve the sparse, style-sensitive structure of line art while supporting both high-level semantic changes and precise local redrawing. We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing, while keeping unrelated regions and overall composition intact. To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model applied to diverse sketches. Building on this data, SketchAssist employs a unified sketch editing framework with minimal changes to DiT-based editors. We repurpose the RGB channels to encode the inputs, enabling seamless switching between instruction-guided edits and line-guided redrawing within a single input interface. To further specialize behavior across modes, we integrate a task-guided mixture-of-experts into LoRA layers, routing by text and visual cues to improve semantic controllability, structural fidelity, and style preservation. Extensive experiments show state-of-the-art results on both tasks, with superior instruction adherence and style/structure preservation compared to recent baselines. Together, our dataset and SketchAssist provide a practical, controllable assistant for sketch creation and revision.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ashish Mishra, Tarun Kumar, Gyanaranjan Nayak, Arpit Shah, Suparna Bhattacharya et al.
Multimodal
ๆฅ็ๆ่ฆ
We introduce a novel, closed-form approach for selective unlearning in multimodal models, specifically targeting pretrained models such as CLIP. Our method leverages nullspace projection to erase the target class information embedded in the final projection layer, without requiring any retraining or the use of images from the forget set. By computing an orthonormal basis for the subspace spanned by target text embeddings and projecting these directions, we dramatically reduce the alignment between image features and undesired classes. Unlike traditional unlearning techniques that rely on iterative fine-tuning and extensive data curation, our approach is both computationally efficient and surgically precise. This leads to a pronounced drop in zero-shot performance for the target classes while preserving the overall multimodal knowledge of the model. Our experiments demonstrate that even a partial projection can balance between complete unlearning and retaining useful information, addressing key challenges in model decontamination and privacy preservation.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junyi Wu, Van Nguyen Nguyen, Benjamin Planche, Jiachen Tao, Changchang Sun et al.
ๆฅ็ๆ่ฆ
We introduce Consistent Instance Field, a continuous and probabilistic spatio-temporal representation for dynamic scene understanding. Unlike prior methods that rely on discrete tracking or view-dependent features, our approach disentangles visibility from persistent object identity by modeling each space-time point with an occupancy probability and a conditional instance distribution. To realize this, we introduce a novel instance-embedded representation based on deformable 3D Gaussians, which jointly encode radiance and semantic information and are learned directly from input RGB images and instance masks through differentiable rasterization. Furthermore, we introduce new mechanisms to calibrate per-Gaussian identities and resample Gaussians toward semantically active regions, ensuring consistent instance representations across space and time. Experiments on HyperNeRF and Neu3D datasets demonstrate that our method significantly outperforms state-of-the-art methods on novel-view panoptic segmentation and open-vocabulary 4D querying tasks.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenbo Tian, Ruting Lin, Hongxian Zheng, Yaodong Yang, Geng Wu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing intelligent sports analysis systems mainly focus on "scoring and visualization," often lacking automatic performance diagnosis and interpretable training guidance. Recent advances of Large Language Models (LMMs) and motion analysis techniques provide new opportunities to address the above limitations. In this paper, we propose SportsGPT, an LLM-driven framework for interpretable sports motion assessment and training guidance, which establishes a closed loop from motion time-series input to professional training guidance. First, given a set of high-quality target models, we introduce MotionDTW, a two-stage time series alignment algorithm designed for accurate keyframe extraction from skeleton-based motion sequences. Subsequently, we design a Knowledge-based Interpretable Sports Motion Assessment Model (KISMAM) to obtain a set of interpretable assessment metrics (e.g., insufficient extension) by constrasting the keyframes with the targe models. Finally, we propose SportsRAG, a RAG-based training guidance model based on Qwen3. Leveraging a 6B-token knowledge base, it prompts the LLM to generate professional training guidance by retrieving domain-specific QA pairs. Experimental results demonstrate that MotionDTW significantly outperforms traditional methods with lower temporal error and higher IoU scores. Furthermore, ablation studies validate the KISMAM and SportsRAG, confirming that SportsGPT surpasses general LLMs in diagnostic accuracy and professionalism.
๐
2025-12-16
๐ Extended Journal Version of APSIPA ASC 2025
๐ฅ Rui-Yang Ju, KokSheik Wong, Yanlin Jin, Jen-Shiun Chiang
ๆฅ็ๆ่ฆ
Document image enhancement and binarization are commonly performed prior to document analysis and recognition tasks for improving the efficiency and accuracy of optical character recognition (OCR) systems. This is because directly recognizing text in degraded documents, particularly in color images, often results in unsatisfactory recognition performance. To address these issues, existing methods train independent generative adversarial networks (GANs) for different color channels to remove shadows and noise, which, in turn, facilitates efficient text information extraction. However, deploying multiple GANs results in long training and inference times. To reduce both training and inference times of document image enhancement and binarization models, we propose MFE-GAN, an efficient GAN-based framework with multi-scale feature extraction (MFE), which incorporates Haar wavelet transformation (HWT) and normalization to process document images before feeding them into GANs for training. In addition, we present novel generators, discriminators, and loss functions to improve the model's performance, and we conduct ablation studies to demonstrate their effectiveness. Experimental results on the Benchmark, Nabuco, and CMATERdb datasets demonstrate that the proposed MFE-GAN significantly reduces the total training and inference times while maintaining comparable performance with respect to state-of-the-art (SOTA) methods. The implementation of this work is available at https://ruiyangju.github.io/MFE-GAN.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ashish Mishra, Gyanaranjan Nayak, Tarun Kumar, Arpit Shah, Suparna Bhattacharya et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Pretrained models like CLIP have demonstrated impressive zero-shot classification capabilities across diverse visual domains, spanning natural images, artistic renderings, and abstract representations. However, real-world applications often demand the removal (or "unlearning") of specific object classes without requiring additional data or retraining, or affecting the model's performance on unrelated tasks. In this paper, we propose a novel training- and data-free unlearning framework that enables three distinct forgetting paradigms: (1) global unlearning of selected objects across all domains, (2) domain-specific knowledge removal (e.g., eliminating sketch representations while preserving photo recognition), and (3) complete unlearning in selective domains. By leveraging a multimodal nullspace through synergistic integration of text prompts and synthesized visual prototypes derived from CLIP's joint embedding space, our method efficiently removes undesired class information while preserving the remaining knowledge. This approach overcomes the limitations of existing retraining-based methods and offers a flexible and computationally efficient solution for controlled model forgetting.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof
MultimodalNatural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang et al.
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruitong Sun, Tianze Yang, Wei Niu, Jin Sun
MultimodalComputer VisionRoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Diffusion models have emerged as the dominant paradigm for high-quality image generation, yet their computational expense remains substantial due to iterative denoising. Classifier-Free Guidance (CFG) significantly enhances generation quality and controllability but doubles the computation by requiring both conditional and unconditional forward passes at every timestep. We present OUSAC (Optimized gUidance Scheduling with Adaptive Caching), a framework that accelerates diffusion transformers (DiT) through systematic optimization. Our key insight is that variable guidance scales enable sparse computation: adjusting scales at certain timesteps can compensate for skipping CFG at others, enabling both fewer total sampling steps and fewer CFG steps while maintaining quality. However, variable guidance patterns introduce denoising deviations that undermine standard caching methods, which assume constant CFG scales across steps. Moreover, different transformer blocks are affected at different levels under dynamic conditions. This paper develops a two-stage approach leveraging these insights. Stage-1 employs evolutionary algorithms to jointly optimize which timesteps to skip and what guidance scale to use, eliminating up to 82% of unconditional passes. Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block, maintaining caching effectiveness under variable guidance. Experiments demonstrate that OUSAC significantly outperforms state-of-the-art acceleration methods, achieving 53% computational savings with 15% quality improvement on DiT-XL/2 (ImageNet 512x512), 60% savings with 16.1% improvement on PixArt-alpha (MSCOCO), and 5x speedup on FLUX while improving CLIP Score over the 50-step baseline.
๐
2025-12-16
๐ AAAI 2026
๐ฅ Sisi Dai, Kai Xu
ๆฅ็ๆ่ฆ
Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nhi Nguyen, Constantino รlvarez Casado, Le Nguyen, Manuel Lage Caรฑellas, Miguel Bordallo Lรณpez
Machine Learning
ๆฅ็ๆ่ฆ
Video-based respiratory rate (RR) estimation is often unreliable due to inconsistent signal quality across extraction methods. We present a predictive, quality-aware framework that integrates heterogeneous signal sources with dynamic assessment of reliability. Ten signals are extracted from facial remote photoplethysmography (rPPG), upper-body motion, and deep learning pipelines, and analyzed using four spectral estimators: Welch's method, Multiple Signal Classification (MUSIC), Fast Fourier Transform (FFT), and peak detection. Segment-level quality indices are then used to train machine learning models that predict accuracy or select the most reliable signal. This enables adaptive signal fusion and quality-based segment filtering. Experiments on three public datasets (OMuSense-23, COHFACE, MAHNOB-HCI) show that the proposed framework achieves lower RR estimation errors than individual methods in most cases, with performance gains depending on dataset characteristics. These findings highlight the potential of quality-driven predictive modeling to deliver scalable and generalizable video-based respiratory monitoring solutions.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Felix Holm, Ghazal Ghazaei, Nassir Navab
Machine Learning
ๆฅ็ๆ่ฆ
Purpose: Detailed surgical recognition is critical for advancing AI-assisted surgery, yet progress is hampered by high annotation costs, data scarcity, and a lack of interpretable models. While scene graphs offer a structured abstraction of surgical events, their full potential remains untapped. In this work, we introduce ProtoFlow, a novel framework that learns dynamic scene graph prototypes to model complex surgical workflows in an interpretable and robust manner.
Methods: ProtoFlow leverages a graph neural network (GNN) encoder-decoder architecture that combines self-supervised pretraining for rich representation learning with a prototype-based fine-tuning stage. This process discovers and refines core prototypes that encapsulate recurring, clinically meaningful patterns of surgical interaction, forming an explainable foundation for workflow analysis.
Results: We evaluate our approach on the fine-grained CAT-SG dataset. ProtoFlow not only outperforms standard GNN baselines in overall accuracy but also demonstrates exceptional robustness in limited-data, few-shot scenarios, maintaining strong performance when trained on as few as one surgical video. Our qualitative analyses further show that the learned prototypes successfully identify distinct surgical sub-techniques and provide clear, interpretable insights into workflow deviations and rare complications.
Conclusion: By uniting robust representation learning with inherent explainability, ProtoFlow represents a significant step toward developing more transparent, reliable, and data-efficient AI systems, accelerating their potential for clinical adoption in surgical training, real-time decision support, and workflow optimization.
๐
2025-12-16
๐ Submitted to IEEE TPAMI, under review
๐ฅ Yang Yang, Risa Shinoda, Hiroaki Santo, Fumio Okura
Computer Vision
ๆฅ็ๆ่ฆ
We present a method for jointly recovering the appearance and internal structure of botanical plants from multi-view images based on 3D Gaussian Splatting (3DGS). While 3DGS exhibits robust reconstruction of scene appearance for novel-view synthesis, it lacks structural representations underlying those appearances (e.g., branching patterns of plants), which limits its applicability to tasks such as plant phenotyping. To achieve both high-fidelity appearance and structural reconstruction, we introduce GaussianPlant, a hierarchical 3DGS representation, which disentangles structure and appearance. Specifically, we employ structure primitives (StPs) to explicitly represent branch and leaf geometry, and appearance primitives (ApPs) to the plants' appearance using 3D Gaussians. StPs represent a simplified structure of the plant, i.e., modeling branches as cylinders and leaves as disks. To accurately distinguish the branches and leaves, StP's attributes (i.e., branches or leaves) are optimized in a self-organized manner. ApPs are bound to each StP to represent the appearance of branches or leaves as in conventional 3DGS. StPs and ApPs are jointly optimized using a re-rendering loss on the input multi-view images, as well as the gradient flow from ApP to StP using the binding correspondence information. We conduct experiments to qualitatively evaluate the reconstruction accuracy of both appearance and structure, as well as real-world experiments to qualitatively validate the practical performance. Experiments show that the GaussianPlant achieves both high-fidelity appearance reconstruction via ApPs and accurate structural reconstruction via StPs, enabling the extraction of branch structure and leaf instances.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Chen, Junyang Chen, Jinshan Pan, Jiangxin Dong
Robotics
ๆฅ็ๆ่ฆ
Recent diffusion-based one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. First, we propose an LQ-guided feature modulation module that leverages original uncompressed information from LQ inputs to provide high-fidelity conditioning for the diffusion process. We then develop a region-adaptive generative prior activation method to effectively enhance perceptual richness without sacrificing local structural fidelity. Finally, we employ a text-matching guidance strategy to fully harness the conditioning potential of text prompts. Extensive experiments demonstrate that CODSR achieves superior perceptual quality and competitive fidelity compared with state-of-the-art methods with efficient one-step inference.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zulin Zhuang, Yu Bian
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Daylight-linked controls (DLCs) have significant potential for energy savings in buildings, especially when abundant daylight is available and indoor workplane illuminance can be accurately predicted in real time. Most existing studies on indoor daylight predictions were developed and tested for static scenes. This study proposes a multimodal deep learning framework that predicts indoor workplane illuminance distributions in real time from non-intrusive images with temporal-spatial features. By extracting image features only from the side-lit window areas rather than interior pixels, the approach remains applicable in dynamically occupied indoor spaces. A field experiment was conducted in a test room in Guangzhou (China), where 17,344 samples were collected for model training and validation. The model achieved R2 > 0.98 with RMSE < 0.14 on the same-distribution test set and R2 > 0.82 with RMSE < 0.17 on an unseen-day test set, indicating high accuracy and acceptable temporal generalization.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kim Sung-Bin, Joohyun Chang, David Harwath, Tae-Hyun Oh
Natural Language Processing
ๆฅ็ๆ่ฆ
Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip synchronization. To address the lack of a standard editing benchmark, we introduce FacEDiTBench, the first dataset for talking face editing, featuring diverse edit types and lengths, along with new evaluation metrics. Extensive experiments validate that talking face editing and generation emerge as subtasks of speech-conditional motion infilling; FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity while generalizing effectively to talking face generation.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Humaira Tasnim, Ashik E Rasul, Bruce Jo, Hyung-Jin Yoon
Robotics
ๆฅ็ๆ่ฆ
Reliable helipad detection is essential for Autonomous Aerial Vehicle (AAV) landing, especially under GPS-denied or visually degraded conditions. While modern detectors such as YOLOv8 offer strong baseline performance, single-model pipelines struggle to remain robust across the extreme scale transitions that occur during descent, where helipads appear small at high altitude and large near touchdown. To address this limitation, we propose a scale-adaptive dual-expert perception framework that decomposes the detection task into far-range and close-range regimes. Two YOLOv8 experts are trained on scale-specialized versions of the HelipadCat dataset, enabling one model to excel at detecting small, low-resolution helipads and the other to provide high-precision localization when the target dominates the field of view. During inference, both experts operate in parallel, and a geometric gating mechanism selects the expert whose prediction is most consistent with the AAV's viewpoint. This adaptive routing prevents the degradation commonly observed in single-detector systems when operating across wide altitude ranges. The dual-expert perception module is evaluated in a closed-loop landing environment that integrates CARLA's photorealistic rendering with NASA's GUAM flight-dynamics engine. Results show substantial improvements in alignment stability, landing accuracy, and overall robustness compared to single-detector baselines. By introducing a scale-aware expert routing strategy tailored to the landing problem, this work advances resilient vision-based perception for autonomous descent and provides a foundation for future multi-expert AAV frameworks.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenjun Liu, Qian Wu, Yifeng Hu, Yuke Li
Multimodal
ๆฅ็ๆ่ฆ
We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT addresses the issues of variable-length sequence labels, label sequence misalignment, and character-level errors, outperforming existing methods in accuracy and practical utility. In addition, we introduce Similarity-based Sequence Label Corruption (SSLC), a process that intentionally introduces errors into the training labels to mimic real-world error scenarios during training. SSLC not only can cause a change in the sequence length but also takes into account the visual similarity between characters during corruption. Our method is the first to detect label errors in real-world scene text datasets successfully accounting for variable-length labels. Experimental results demonstrate the effectiveness of SELECT in detecting label errors and improving STR accuracy on real-world text datasets, showcasing its practical utility.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenguo Zhang, Haohan Zhen, Yishen Wang, Le Xu, Tianchen Deng et al.
Machine LearningMultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning.While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels.Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Boran Wang, Xinming Wang, Yi Chen, Xiang Li, Jian Xu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
With their high information density and intuitive readability, charts have become the de facto medium for data analysis and communication across disciplines. Recent multimodal large language models (MLLMs) have made notable progress in automated chart understanding, yet they remain heavily dependent on explicit textual annotations and the performance degrades markedly when key numerals are absent. To address this limitation, we introduce ChartAgent, a chart understanding framework grounded in Tool-Integrated Reasoning (TIR). Inspired by human cognition, ChartAgent decomposes complex chart analysis into a sequence of observable, replayable steps. Supporting this architecture is an extensible, modular tool library comprising more than a dozen core tools, such as keyelement detection, instance segmentation, and optical character recognition (OCR), which the agent dynamically orchestrates to achieve systematic visual parsing across diverse chart types. Leveraging TIRs transparency and verifiability, ChartAgent moves beyond the black box paradigm by standardizing and consolidating intermediate outputs into a structured Evidence Package, providing traceable and reproducible support for final conclusions. Experiments show that ChartAgent substantially improves robustness under sparse annotation settings, offering a practical path toward trustworthy and extensible systems for chart understanding.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meng Wei, Cheng Zhang, Jianmin Zheng, Hamid Rezatofighi, Jianfei Cai
ๆฅ็ๆ่ฆ
Recent advances have equipped 3D Gaussian Splatting with texture parameterizations to capture spatially varying attributes, improving the performance of both appearance modeling and downstream tasks. However, the added texture parameters introduce significant memory efficiency challenges. Rather than proposing new texture formulations, we take a step back to examine the characteristics of existing textured Gaussian methods and identify two key limitations in common: (1) Textures are typically defined in canonical space, leading to inefficient sampling that wastes textures' capacity on low-contribution regions; and (2) texture parameterization is uniformly assigned across all Gaussians, regardless of their visual complexity, resulting in over-parameterization. In this work, we address these issues through two simple yet effective strategies: adaptive sampling based on the Gaussian density distribution and error-driven anisotropic parameterization that allocates texture resources according to rendering error. Our proposed ASAP Textured Gaussians, short for Adaptive Sampling and Anisotropic Parameterization, significantly improve the quality efficiency tradeoff, achieving high-fidelity rendering with far fewer texture parameters.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ignacio Alzugaray, Marwan Taher, Andrew J. Davison
ๆฅ็ๆ่ฆ
We present a novel neural RGB-D Simultaneous Localization And Mapping (SLAM) system that learns an implicit map of the scene in real time. For the first time, we explore the use of Scene Coordinate Regression (SCR) as the core implicit map representation in a neural SLAM pipeline, a paradigm that trains a lightweight network to directly map 2D image features to 3D global coordinates. SCR networks provide efficient, low-memory 3D map representations, enable extremely fast relocalization, and inherently preserve privacy, making them particularly suitable for neural implicit SLAM.
Our system is the first one to achieve strict real-time in neural implicit RGB-D SLAM by relying on a SCR-based representation. We introduce a novel SCR architecture specifically tailored for this purpose and detail the critical design choices required to integrate SCR into a live SLAM pipeline. The resulting framework is simple yet flexible, seamlessly supporting both sparse and dense features, and operates reliably in dynamic environments without special adaptation. We evaluate our approach on established synthetic and real-world benchmarks, demonstrating competitive performance against the state of the art. Project Page: https://github.com/ialzugaray/ace-slam
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaheng Li, Qiyu Dai, Lihan Li, Praneeth Chakravarthula, He Sun et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We consider the problem of active 3D imaging using single-shot structured light systems, which are widely employed in commercial 3D sensing devices such as Apple Face ID and Intel RealSense. Traditional structured light methods typically decode depth correspondences through pixel-domain matching algorithms, resulting in limited robustness under challenging scenarios like occlusions, fine-structured details, and non-Lambertian surfaces. Inspired by recent advances in neural feature matching, we propose a learning-based structured light decoding framework that performs robust correspondence matching within feature space rather than the fragile pixel domain. Our method extracts neural features from the projected patterns and captured infrared (IR) images, explicitly incorporating their geometric priors by building cost volumes in feature space, achieving substantial performance improvements over pixel-domain decoding approaches. To further enhance depth quality, we introduce a depth refinement module that leverages strong priors from large-scale monocular depth estimation models, improving fine detail recovery and global structural coherence. To facilitate effective learning, we develop a physically-based structured light rendering pipeline, generating nearly one million synthetic pattern-image pairs with diverse objects and materials for indoor settings. Experiments demonstrate that our method, trained exclusively on synthetic data with multiple structured light patterns, generalizes well to real-world indoor environments, effectively processes various pattern types without retraining, and consistently outperforms both commercial structured light systems and passive stereo RGB-based depth estimation methods. Project page: https://namisntimpot.github.io/NSLweb/.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yibing Fu, Yunpeng Zhao, Zhitao Zeng, Cheng Chen, Yueming Jin
ๆฅ็ๆ่ฆ
Multi-modal learning integrating medical images and tabular data has significantly advanced clinical decision-making in recent years. Self-Supervised Learning (SSL) has emerged as a powerful paradigm for pretraining these models on large-scale unlabeled image-tabular data, aiming to learn discriminative representations. However, existing SSL methods for image-tabular representation learning are often confined to specific data cohorts, mainly due to their rigid tabular modeling mechanisms when modeling heterogeneous tabular data. This inter-tabular barrier hinders the multi-modal SSL methods from effectively learning transferrable medical knowledge shared across diverse cohorts. In this paper, we propose a novel SSL framework, namely CITab, designed to learn powerful multi-modal feature representations in a cross-tabular manner. We design the tabular modeling mechanism from a semantic-awareness perspective by integrating column headers as semantic cues, which facilitates transferrable knowledge learning and the scalability in utilizing multiple data sources for pretraining. Additionally, we propose a prototype-guided mixture-of-linear layer (P-MoLin) module for tabular feature specialization, empowering the model to effectively handle the heterogeneity of tabular data and explore the underlying medical concepts. We conduct comprehensive evaluations on Alzheimer's disease diagnosis task across three publicly available data cohorts containing 4,461 subjects. Experimental results demonstrate that CITab outperforms state-of-the-art approaches, paving the way for effective and scalable cross-tabular multi-modal learning.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou et al.
Natural Language ProcessingComputer VisionRobotics
ๆฅ็ๆ่ฆ
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yen-Ju Lu, Kunxiao Gao, Mingrui Liang, Helin Wang, Thomas Thebaud et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.
๐
2025-12-16
๐ ArXiv Preprint (eess.AS)
๐ฅ Pawel Swietojanski, Xinwei Li, Mingbin Xu, Takaaki Hori, Dogan Can et al.
ๆฅ็ๆ่ฆ
We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ David Schulmeister, Valentin Hartmann, Lars Klein, Robert West
Natural Language Processing
ๆฅ็ๆ่ฆ
Today, a lot of research on language models is focused on large, general-purpose models. However, many NLP pipelines only require models with a well-defined, small set of capabilities. While large models are capable of performing the tasks of those smaller models, they are simply not fast enough to process large amounts of data or offer real-time responses. Furthermore, they often use unnecessarily large amounts of energy, leading to sustainability concerns and problems when deploying them on battery-powered devices. In our work, we show how to train small models for such efficiency-critical applications. As opposed to many off-the-shelf NLP pipelines, our models use modern training techniques such as distillation, and offer support for low-resource languages. We call our models TiME (Tiny Monolingual Encoders) and comprehensively evaluate them on a range of common NLP tasks, observing an improved trade-off between benchmark performance on one hand, and throughput, latency and energy consumption on the other. Along the way, we show that distilling monolingual models from multilingual teachers is possible, and likewise distilling models with absolute positional embeddings from teachers with relative positional embeddings.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa
MultimodalMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Adarsha Shrestha, Basanta Pokharel, Binit Shrestha, Smriti Adhikari, Dinesh Gothe
Natural Language Processing
ๆฅ็ๆ่ฆ
Nepali, a low-resource language spoken by over 32 million people, continues to face challenges in natural language processing (NLP) due to its complex grammar, agglutinative morphology, and limited availability of high-quality corpora. Most efforts to date have centered on basic encoder architectures; they remain insufficient for Nepali-specific text generation. This study presents a GPT-2-based Nepali language model trained using several training strategies inspired by GPT-3, including optimized learning rate schedules, batch scaling, and architectural refinements. A custom 16k Byte-Pair Encoding (BPE) tokenizer was trained exclusively on Nepali text to ensure more consistent segmentation and improved input representation. The model was pretrained on a combined dataset comprising a 10.75GB cleaned NepBERTa corpus and additional web-scraped Nepali news articles. FlashAttention was integrated to reduce memory usage and stabilize training. After two epochs, the model achieved a training loss of 3.168177, a validation loss of 3.081982, and a final perplexity of 21.80, demonstrating its capability to generate coherent Nepali news-style text.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ekaterina Artemova, Laurie Burchell, Daryna Dementieva, Shu Okabe, Mariya Shmatova et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages -- from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.
๐
2025-12-16
๐ Accepted in IEEE Bigdata 2025- LLMs4ALL
๐ฅ Tejaswani Dash, Dinesh Karri, Anudeep Vurity, Gautam Datla, Tazeem Ahmad et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
This paper introduces PolyPersona, a generative framework for synthesizing persona-conditioned survey responses across multiple domains. The framework instruction-tunes compact chat models using parameter-efficient LoRA adapters with 4-bit quantization under a resource-adaptive training setup. A dialogue-based data pipeline explicitly preserves persona cues, ensuring consistent behavioral alignment across generated responses. Using this pipeline, we construct a dataset of 3,568 synthetic survey responses spanning ten domains and 433 distinct personas, enabling controlled instruction tuning and systematic multi-domain evaluation. We evaluate the generated responses using a multi-metric evaluation suite that combines standard text generation metrics, including BLEU, ROUGE, and BERTScore, with survey-specific metrics designed to assess structural coherence, stylistic consistency, and sentiment alignment.Experimental results show that compact models such as TinyLlama 1.1B and Phi-2 achieve performance comparable to larger 7B to 8B baselines, with a highest BLEU score of 0.090 and ROUGE-1 of 0.429. These findings demonstrate that persona-conditioned fine-tuning enables small language models to generate reliable and coherent synthetic survey data. The proposed framework provides an efficient and reproducible approach for survey data generation, supporting scalable evaluation while facilitating bias analysis through transparent and open protocols.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hongli Li, Che Han Chen, Kevin Fan, Chiho Young-Johnson, Soyoung Lim et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite the growing promise of large language models (LLMs) in automatic essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLMs and human raters in AES. Across studies, reported LLM-human agreement was generally moderate to good, with agreement indices (e.g., Quadratic Weighted Kappa, Pearson correlation, and Spearman's rho) mostly ranging between 0.30 and 0.80. Substantial variability in agreement levels was observed across studies, reflecting differences in study-specific factors as well as the lack of standardized reporting practices. Implications and directions for future research are discussed.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ David Samuel, Lucas Georges Gabriel Charpentier
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ying Nie, Kai Han, Hongguang Li, Hang Zhou, Tianyu Guo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid scaling of Large Language Models (LLMs) has achieved remarkable performance, but it also leads to prohibitive memory costs. Existing parameter-efficient approaches such as pruning and quantization mainly compress pretrained models without enhancing architectural capacity, thereby hitting the representational ceiling of the base model. In this work, we propose VersatileFFN, a novel feed-forward network (FFN) that enables flexible reuse of parameters in both width and depth dimensions within a fixed parameter budget. Inspired by the dual-process theory of cognition, VersatileFFN comprises two adaptive pathways: a width-versatile path that generates a mixture of sub-experts from a single shared FFN, mimicking sparse expert routing without increasing parameters, and a depth-versatile path that recursively applies the same FFN to emulate deeper processing for complex tokens. A difficulty-aware gating dynamically balances the two pathways, steering "easy" tokens through the efficient width-wise route and allocating deeper iterative refinement to "hard" tokens. Crucially, both pathways reuse the same parameters, so all additional capacity comes from computation rather than memory. Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/VersatileFFN.
๐
2025-12-16
๐ Commentary on Futrell, R., & Mahowald, K. arXiv:25
๐ฅ Marianne de Heer Kloots, Paul Boersma, Willem Zuidema
Machine Learning
ๆฅ็ๆ่ฆ
Futrell and Mahowald present a useful framework bridging technology-oriented deep learning systems and explanation-oriented linguistic theories. Unfortunately, the target article's focus on generative text-based LLMs fundamentally limits fruitful interactions with linguistics, as many interesting questions on human language fall outside what is captured by written text. We argue that audio-based deep learning models can and should play a crucial role.
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated remarkable potential in transforming recommender systems from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 successfully pioneered this paradigm by integrating LLM-based reasoning into user interest mining and item tag prediction, it suffers from four fundamental limitations: (1) computational inefficiency and cognitive redundancy across multiple reasoning routes; (2) insufficient explanation diversity in fixed-template generation; (3) limited generalization under supervised learning paradigms; and (4) simplistic outcome-focused evaluation that fails to match human standards.
To address these challenges, we present RecGPT-V2 with four key innovations. First, a Hierarchical Multi-Agent System restructures intent reasoning through coordinated collaboration, eliminating cognitive duplication while enabling diverse intent coverage. Combined with Hybrid Representation Inference that compresses user-behavior contexts, our framework reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta-Prompting framework dynamically generates contextually adaptive prompts, improving explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi-reward conflicts, achieving +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent-as-a-Judge framework decomposes assessment into multi-step reasoning, improving human preference alignment. Online A/B tests on Taobao demonstrate significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT-V2 establishes both the technical feasibility and commercial viability of deploying LLM-powered intent reasoning at scale, bridging the gap between cognitive exploration and industrial utility.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Teodor Poncu, Ioana Pintilie, Marius Dragoi, Dragos Tantaru, Florin Brad
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) typically excel at coding tasks involving high-level programming languages, as opposed to lower-level programming languages, such as assembly. We propose a synthetic data generation method named C-ing Clearly, which leverages the corresponding C code to enhance an LLM's understanding of assembly. By fine-tuning on data generated through our method, we demonstrate improved LLM performance for binary code summarization and vulnerability detection. Our approach demonstrates consistent gains across different LLM families and model sizes.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shizhuo Mao, Song Chen, Yi Kang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision, but existing solutions face fundamental trade-offs: dynamic quantization incurs high computational overhead and poses deployment challenges on edge devices, while static quantization sacrifices accuracy. Existing approaches of quantization-aware training (QAT) further suffer from weight training costs. We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors. SASQ exclusively optimizes only the quantization factors (without changing pre-trained weights), enabling static inference with high accuracy while maintaining deployment efficiency. SASQ adaptively truncates some outliers, thereby reducing the difficulty of quantization while preserving the distributional characteristics of the activations. SASQ not only surpasses existing SOTA quantization schemes but also outperforms the corresponding FP16 models. On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Gabriele Prato, Shagun Sodhani, Alessandro Sordoni, Sarath Chandar
Natural Language Processing
ๆฅ็ๆ่ฆ
The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models' capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huayang Li, Tianyu Zhao, Richard Sproat
Natural Language Processing
ๆฅ็ๆ่ฆ
In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. Drawing on Cognitive Load Theory (CLT), we argue that this uninformative structure increases extraneous cognitive load, consuming finite working memory capacity that should be allocated to deep reasoning and attention allocation. To address this, we propose RePo, a novel mechanism that reduces extraneous load via context re-positioning. Unlike standard approaches, RePo utilizes a differentiable module, $f_ฯ$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined integer range. By continually pre-training on the OLMo-2 1B backbone, we demonstrate that RePo significantly enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Detailed analysis reveals that RePo successfully allocate higher attention to distant but relevant information, assign positions in dense and non-linear space, and capture the intrinsic structure of the input context. Our code is available at https://github.com/SakanaAI/repo.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher
Robotics
ๆฅ็ๆ่ฆ
The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. To address this challenge, we introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. To monitor reasoning behaviors, we introduced ReasonType: a novel taxonomy of reasoning steps. Building on this framework, we demonstrated that online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences. We evaluate the Step-tagging framework on three open-source reasoning models across standard benchmark datasets: MATH500, GSM8K, AIME and non-mathematical tasks (GPQA and MMLU-Pro). We achieve 20 to 50\% token reduction while maintaining comparable accuracy to standard generation, with largest gains observed on more computation-heavy tasks. This work offers a novel way to increase control over the generation of LRMs, and a new tool to study behaviors of LRMs.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nikoleta Anesti, Edward Hill, Andreas Joseph
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper investigates the ability of Large Language Models (LLMs), specifically GPT-3.5-turbo (GPT), to form inflation perceptions and expectations based on macroeconomic price signals. We compare the LLM's output to household survey data and official statistics, mimicking the information set and demographic characteristics of the Bank of England's Inflation Attitudes Survey (IAS). Our quasi-experimental design exploits the timing of GPT's training cut-off in September 2021 which means it has no knowledge of the subsequent UK inflation surge. We find that GPT tracks aggregate survey projections and official statistics at short horizons. At a disaggregated level, GPT replicates key empirical regularities of households' inflation perceptions, particularly for income, housing tenure, and social class. A novel Shapley value decomposition of LLM outputs suited for the synthetic survey setting provides well-defined insights into the drivers of model outputs linked to prompt content. We find that GPT demonstrates a heightened sensitivity to food inflation information similar to that of human respondents. However, we also find that it lacks a consistent model of consumer price inflation. More generally, our approach could be used to evaluate the behaviour of LLMs for use in the social sciences, to compare different models, or to assist in survey design.
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Panayiotis Smeros, Vincent Emonet, Ruijie Wang, Ana-Claudia Sima, Tarcisio Mendes de Farias
Natural Language Processing
ๆฅ็ๆ่ฆ
The advent of large language models is contributing to the emergence of novel approaches that promise to better tackle the challenge of generating structured queries, such as SPARQL queries, from natural language. However, these new approaches mostly focus on response accuracy over a single source while ignoring other evaluation criteria, such as federated query capability over distributed data stores, as well as runtime and cost to generate SPARQL queries. Consequently, they are often not production-ready or easy to deploy over (potentially federated) knowledge graphs with good accuracy. To mitigate these issues, in this paper, we extend our previous work and describe and systematically evaluate SPARQL-LLM, an open-source and triplestore-agnostic approach, powered by lightweight metadata, that generates SPARQL queries from natural language text. First, we describe its architecture, which consists of dedicated components for metadata indexing, prompt building, and query generation and execution. Then, we evaluate it based on a state-of-the-art challenge with multilingual questions, and a collection of questions from three of the most prevalent knowledge graphs within the field of bioinformatics. Our results demonstrate a substantial increase of 24% in the F1 Score on the state-of-the-art challenge, adaptability to high-resource languages such as English and Spanish, as well as ability to form complex and federated bioinformatics queries. Furthermore, we show that SPARQL-LLM is up to 36x faster than other systems participating in the challenge, while costing a maximum of $0.01 per question, making it suitable for real-time, low-cost text-to-SPARQL applications. One such application deployed over real-world decentralized knowledge graphs can be found at https://www.expasy.org/chat.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiqing Zhou, Yu Lei, Shuzheng Si, Qingyan Sun, Wei Wang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Juan-Josรฉ Guzmรกn-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Graham Ranger Martha-Lorena Avendaรฑo-Garrido
Natural Language Processing
ๆฅ็ๆ่ฆ
The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the $ฯ$-language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Estelle Zheng, Nathan Cerisara, Sรฉbastien Warichet, Emmanuel Helbert, Christophe Cerisara
Natural Language Processing
ๆฅ็ๆ่ฆ
Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA's compute scaling slope while cutting peak memory by 50\%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA's accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing\textemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder's architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque
Natural Language Processing
ๆฅ็ๆ่ฆ
Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local\_dialect:standard\_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76\% to 55\% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hongqiu Ni, Jiabao Zhang, Guopeng Li, Zilong Wang, Ruiqi Wu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which prevents them from minimizing the end-to-end latency of the complete agentic workflow, i.e., the global Job Completion Time (JCT) over the entire request lifecycle. To address this limitation, we propose Astraea, a service engine designed to shift the optimization from local segments to the global request lifecycle. Astraea employs a state-aware, hierarchical scheduling algorithm that integrates a request's historical state with future predictions. It dynamically classifies requests by their I/O and compute intensive nature and uses an enhanced HRRN policy to balance efficiency and fairness. Astraea also implements an adaptive KV cache manager that intelligently handles the agent state during I/O waits based on the system memory pressure. Extensive experiments show that Astraea reduces average JCT by up to 25.5\% compared to baseline methods. Moreover, our approach demonstrates strong robustness and stability under high load across various model scales.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiran Zhang, Jincheng Hu, Mark Dras, Usman Naseem
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at single-turn reasoning but often lose accuracy and coherence over extended, multi-turn interactions. Recent evaluations such as TurnBench highlight recurring failure modes-reasoning bias, task drift, hallucination, overconfidence, and memory decay. Current approaches typically append full conversational histories, causing unbounded context growth, higher computational costs, and degraded reasoning efficiency. We introduce CogMem, a cognitively inspired, memory-augmented LLM architecture that supports sustained iterative reasoning through structured, persistent memory. CogMem incorporates three layers: a Long-Term Memory (LTM) that consolidates cross-session reasoning strategies; a Direct Access (DA) memory that maintains session-level notes and retrieves relevant long-term memories; and a Focus of Attention (FoA) mechanism that dynamically reconstructs concise, task-relevant context at each turn. Experiments on TurnBench show that this layered design mitigates reasoning failures, controls context growth, and improves consistency across extended reasoning chains, moving toward more reliable, human-like reasoning in LLMs.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Taiga Mori et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.
๐
2025-12-16
๐ ArXiv Preprint (eess.AS)
๐ฅ Sungnyun Kim
Multimodal
ๆฅ็ๆ่ฆ
The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates computational resources based on the input characteristics. Finally, at the system level, we present methods to expand the system's functionality through modular integration with large-scale foundation models, leveraging their powerful cognitive and generative capabilities to maximize final recognition accuracy. By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Siran Liu, Zane Cao, Yongchao He
Natural Language Processing
ๆฅ็ๆ่ฆ
Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mayank Singh, Vikas Yadav, Shiva Krishna Reddy Malay, Shravan Nayak, Sai Rajeswar et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic search for Multi-Agent Systems has recently emerged as a key focus in agentic AI research. Several prior approaches have relied on LLM-based free-form search over the code space. In this work, we propose a more structured framework that explores the same space through a fixed set of simple, composable components. We show that, despite lacking the generative flexibility of LLMs during the candidate generation stage, our method outperforms prior approaches on four out of five benchmarks across two domains: mathematics and question answering. Furthermore, our method offers additional advantages, including a more cost-efficient search process and the generation of modular, interpretable multi-agent systems with simpler logic.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yi Hu, Cai Zhou, Muhan Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
The scaling of large language models (LLMs) emphasizes increasing depth, yet performance gains diminish with added layers. Prior work introduces the concept of "effective depth", arguing that deeper models fail to fully utilize their layers for meaningful computation. Building on this, we systematically study how effective depth varies with model scale, training type, and task difficulty. First, we analyze the model behavior of Qwen-2.5 family (1.5B-32B) and find that while the number of effective layers grows with model size, the effective depth ratio remains stable. Besides, comparisons between base and corresponding long-CoT models show no increase in effective depth, suggesting that improved reasoning stems from longer context rather than deeper per-token computation. Furthermore, evaluations across tasks of varying difficulty indicate that models do not dynamically use more layers for harder problems. Our results suggest that current LLMs underuse available depth across scales, training paradigms and tasks of varying difficulties, pointing out research opportunities on increasing the layer utilization rate of LLMs, model pruning, and early exiting. Our code is released at https://github.com/AheadOFpotato/what_affects_effective_depth.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhimin Qiu, Di Wu, Feng Liu, Chenrui Hu, Yuxiao Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper proposes a structure-aware decoding method based on large language models to address the difficulty of traditional approaches in maintaining both semantic integrity and structural consistency in nested and overlapping entity extraction tasks. The method introduces a candidate span generation mechanism and structured attention modeling to achieve unified modeling of entity boundaries, hierarchical relationships, and cross-dependencies. The model first uses a pretrained language model to obtain context-aware semantic representations, then captures multi-granular entity span features through candidate representation combinations, and introduces hierarchical structural constraints during decoding to ensure consistency between semantics and structure. To enhance stability in complex scenarios, the model jointly optimizes classification loss and structural consistency loss, maintaining high recognition accuracy under multi-entity co-occurrence and long-sentence dependency conditions. Experiments conducted on the ACE 2005 dataset demonstrate significant improvements in Accuracy, Precision, Recall, and F1-Score, particularly in nested and overlapping entity recognition, where the model shows stronger boundary localization and structural modeling capability. This study verifies the effectiveness of structure-aware decoding in complex semantic extraction tasks, provides a new perspective for developing language models with hierarchical understanding, and establishes a methodological foundation for high-precision information extraction.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli et al.
Computer Vision
ๆฅ็ๆ่ฆ
Non-parametric quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ($ฮ_{24}$-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Sirui Chen, Zi-ang Cao, Zhengyi Luo, Fernando Castaรฑeda, Chenran Li et al.
Robotics
ๆฅ็ๆ่ฆ
Recent progress in humanoid robots has unlocked agile locomotion skills, including backflipping, running, and crawling. Yet it remains challenging for a humanoid robot to perform forceful manipulation tasks such as moving objects, wiping, and pushing a cart. We propose adaptive Compliance Humanoid control through hIsight Perturbation (CHIP), a plug-and-play module that enables controllable end-effector stiffness while preserving agile tracking of dynamic reference motions. CHIP is easy to implement and requires neither data augmentation nor additional reward tuning. We show that a generalist motion-tracking controller trained with CHIP can perform a diverse set of forceful manipulation tasks that require different end-effector compliance, such as multi-robot collaboration, wiping, box delivery, and door opening.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yen-Ju Lu, Kunxiao Gao, Mingrui Liang, Helin Wang, Thomas Thebaud et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chuan He
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Stochastic optimization is fundamental to modern machine learning. Recent research has extended the study of stochastic first-order methods (SFOMs) from light-tailed to heavy-tailed noise, which frequently arises in practice, with clipping emerging as a key technique for controlling heavy-tailed gradients. Extensive theoretical advances have further shown that the oracle complexity of SFOMs depends on the tail index $ฮฑ$ of the noise. Nonetheless, existing complexity results often cover only the case $ฮฑ\in (1,2]$, that is, the regime where the noise has a finite mean, while the complexity bounds tend to infinity as $ฮฑ$ approaches $1$. This paper tackles the general case of noise with tail index $ฮฑ\in(0,2]$, covering regimes ranging from noise with bounded variance to noise with an infinite mean, where the latter case has been scarcely studied. Through a novel analysis of the bias-variance trade-off in gradient clipping, we show that when a symmetry measure of the noise tail is controlled, clipped SFOMs achieve improved complexity guarantees in the presence of heavy-tailed noise for any tail index $ฮฑ\in (0,2]$. Our analysis of the bias-variance trade-off not only yields new unified complexity guarantees for clipped SFOMs across this full range of tail indices, but is also straightforward to apply and can be combined with classical analyses under light-tailed noise to establish oracle complexity guarantees under heavy-tailed noise. Finally, numerical experiments validate our theoretical findings.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dimitris Bertsimas, Yu Ma, Kimberly Villalobos Carballo, Gagan Singh, Michal Laskowski et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Hospitals lack automated systems to harness the growing volume of heterogeneous clinical and operational data to effectively forecast critical events. Early identification of patients at risk for deterioration is essential not only for patient care quality monitoring but also for physician care management. However, translating varied data streams into accurate and interpretable risk assessments poses significant challenges due to inconsistent data formats. We develop a multimodal machine learning framework, the Early Warning Index (EWI), to predict the aggregate risk of ICU admission, emergency response team dispatch, and mortality. Key to EWI's design is a human-in-the-loop process: clinicians help determine alert thresholds and interpret model outputs, which are enhanced by explainable outputs using Shapley Additive exPlanations (SHAP) to highlight clinical and operational factors (e.g., scheduled surgeries, ward census) driving each patient's risk. We deploy EWI in a hospital dashboard that stratifies patients into three risk tiers. Using a dataset of 18,633 unique patients at a large U.S. hospital, our approach automatically extracts features from both structured and unstructured electronic health record (EHR) data and achieves C-statistics of 0.796. It is currently used as a triage tool for proactively managing at-risk patients. The proposed approach saves physicians valuable time by automatically sorting patients of varying risk levels, allowing them to concentrate on patient care rather than sifting through complex EHR data. By further pinpointing specific risk drivers, the proposed model provides data-informed adjustments to caregiver scheduling and allocation of critical resources. As a result, clinicians and administrators can avert downstream complications, including costly procedures or high readmission rates and improve overall patient flow.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rae Chipera, Jenny Du, Irene Tsapara
ๆฅ็ๆ่ฆ
Contemporary reservoir computing relies heavily on smooth, globally Lipschitz continuous activation functions, limiting applications in defense, disaster response, and pharmaceutical modeling where robust operation under extreme conditions is critical. We systematically investigate non-smooth activation functions, including chaotic, stochastic, and fractal variants, in echo state networks. Through comprehensive parameter sweeps across 36,610 reservoir configurations, we demonstrate that several non-smooth functions not only maintain the Echo State Property (ESP) but outperform traditional smooth activations in convergence speed and spectral radius tolerance. Notably, the Cantor function (continuous everywhere and flat almost everywhere) maintains ESP-consistent behavior up to spectral radii of rho ~ 10, an order of magnitude beyond typical bounds for smooth functions, while achieving 2.6x faster convergence than tanh and ReLU. We introduce a theoretical framework for quantized activation functions, defining a Degenerate Echo State Property (d-ESP) that captures stability for discrete-output functions and proving that d-ESP implies traditional ESP. We identify a critical crowding ratio Q=N/k (reservoir size / quantization levels) that predicts failure thresholds for discrete activations. Our analysis reveals that preprocessing topology, rather than continuity per se, determines stability: monotone, compressive preprocessing maintains ESP across scales, while dispersive or discontinuous preprocessing triggers sharp failures. While our findings challenge assumptions about activation function design in reservoir computing, the mechanism underlying the exceptional performance of certain fractal functions remains unexplained, suggesting fundamental gaps in our understanding of how geometric properties of activation functions influence reservoir dynamics.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alban Puech, Matteo Mazzonelli, Celia Cintas, Tamara R. Govindasamy, Mangaliso Mngomezulu et al.
Machine Learning
ๆฅ็ๆ่ฆ
We introduce gridfm-datakit-v1, a Python library for generating realistic and diverse Power Flow (PF) and Optimal Power Flow (OPF) datasets for training Machine Learning (ML) solvers. Existing datasets and libraries face three main challenges: (1) lack of realistic stochastic load and topology perturbations, limiting scenario diversity; (2) PF datasets are restricted to OPF-feasible points, hindering generalization of ML solvers to cases that violate operating limits (e.g., branch overloads or voltage violations); and (3) OPF datasets use fixed generator cost functions, limiting generalization across varying costs. gridfm-datakit addresses these challenges by: (1) combining global load scaling from real-world profiles with localized noise and supporting arbitrary N-k topology perturbations to create diverse yet realistic datasets; (2) generating PF samples beyond operating limits; and (3) producing OPF data with varying generator costs. It also scales efficiently to large grids (up to 10,000 buses). Comparisons with OPFData, OPF-Learn, PGLearn, and PF$ฮ$ are provided. Available on GitHub at https://github.com/gridfm/gridfm-datakit under Apache 2.0 and via `pip install gridfm-datakit`.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ David Schulmeister, Valentin Hartmann, Lars Klein, Robert West
Natural Language Processing
ๆฅ็ๆ่ฆ
Today, a lot of research on language models is focused on large, general-purpose models. However, many NLP pipelines only require models with a well-defined, small set of capabilities. While large models are capable of performing the tasks of those smaller models, they are simply not fast enough to process large amounts of data or offer real-time responses. Furthermore, they often use unnecessarily large amounts of energy, leading to sustainability concerns and problems when deploying them on battery-powered devices. In our work, we show how to train small models for such efficiency-critical applications. As opposed to many off-the-shelf NLP pipelines, our models use modern training techniques such as distillation, and offer support for low-resource languages. We call our models TiME (Tiny Monolingual Encoders) and comprehensively evaluate them on a range of common NLP tasks, observing an improved trade-off between benchmark performance on one hand, and throughput, latency and energy consumption on the other. Along the way, we show that distilling monolingual models from multilingual teachers is possible, and likewise distilling models with absolute positional embeddings from teachers with relative positional embeddings.
๐
2025-12-16
๐ WSDM 2026
๐ฅ Chaohao Yuan, Zhenjie Song, Ercan Engin Kuruoglu, Kangfei Zhao, Yang Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Graph Transformers (GTs) have emerged as a promising graph learning tool, leveraging their all-pair connected property to effectively capture global information. To address the over-smoothing problem in deep GNNs, global attention was initially introduced, eliminating the necessity for using deep GNNs. However, through empirical and theoretical analysis, we verify that the introduced global attention exhibits severe over-smoothing, causing node representations to become indistinguishable due to its inherent low-pass filtering. This effect is even stronger than that observed in GNNs. To mitigate this, we propose PageRank Transformer (ParaFormer), which features a PageRank-enhanced attention module designed to mimic the behavior of deep Transformers. We theoretically and empirically demonstrate that ParaFormer mitigates over-smoothing by functioning as an adaptive-pass filter. Experiments show that ParaFormer achieves consistent performance improvements across both node classification and graph classification tasks on 11 datasets ranging from thousands to millions of nodes, validating its efficacy. The supplementary material, including code and appendix, can be found in https://github.com/chaohaoyuan/ParaFormer.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alessandro Trapasso, Luca Iocchi, Fabio Patrizi
Machine Learning
ๆฅ็ๆ่ฆ
Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Omid Khormali
Robotics
ๆฅ็ๆ่ฆ
We introduce the Overlap-Weighted Hierarchical Normalized Persistence Velocity (OW-HNPV), a novel topological data analysis method for detecting anomalies in time-varying networks. Unlike existing methods that measure cumulative topological presence, we introduce the first velocity-based perspective on persistence diagrams, measuring the rate at which features appear and disappear, automatically downweighting noise through overlap-based weighting. We also prove that OW-HNPV is mathematically stable. It behaves in a controlled, predictable way, even when comparing persistence diagrams from networks with different feature types. Applied to Ethereum transaction networks (May 2017-May 2018), OW-HNPV demonstrates superior performance for cryptocurrency anomaly detection, achieving up to 10.4% AUC gain over baseline models for 7-day price movement predictions. Compared with established methods, including Vector of Averaged Bettis (VAB), persistence landscapes, and persistence images, velocity-based summaries excel at medium- to long-range forecasting (4-7 days), with OW-HNPV providing the most consistent and stable performance across prediction horizons. Our results show that modeling topological velocity is crucial for detecting structural anomalies in dynamic networks.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Prasanjit Dubey, Aritra Guha, Zhengyi Zhou, Qiong Wu, Xiaoming Huo et al.
ๆฅ็ๆ่ฆ
Sparse longitudinal (SL) textual data arises when individuals generate text repeatedly over time (e.g., customer reviews, occasional social media posts, electronic medical records across visits), but the frequency and timing of observations vary across individuals. These complex textual data sets have immense potential to inform future policy and targeted recommendations. However, because SL text data lack dedicated methods and are noisy, heterogeneous, and prone to anomalies, detecting and inferring key patterns is challenging. We introduce LLmFPCA-detect, a flexible framework that pairs LLM-based text embeddings with functional data analysis to detect clusters and infer anomalies in large SL text datasets. First, LLmFPCA-detect embeds each piece of text into an application-specific numeric space using LLM prompts. Sparse multivariate functional principal component analysis (mFPCA) conducted in the numeric space forms the workhorse to recover primary population characteristics, and produces subject-level scores which, together with baseline static covariates, facilitate data segmentation, unsupervised anomaly detection and inference, and enable other downstream tasks. In particular, we leverage LLMs to perform dynamic keyword profiling guided by the data segments and anomalies discovered by LLmFPCA-detect, and we show that cluster-specific functional PC scores from LLmFPCA-detect, used as features in existing pipelines, help boost prediction performance. We support the stability of LLmFPCA-detect with experiments and evaluate it on two different applications using public datasets, Amazon customer-review trajectories, and Wikipedia talk-page comment streams, demonstrating utility across domains and outperforming state-of-the-art baselines.
๐
2025-12-16
๐ pre-print of the upcoming EURASIP JASM journal art
๐ฅ Lukรกลก Samuel Martรกk, Patricia Hu, Gerhard Widmer
Machine Learning
ๆฅ็ๆ่ฆ
Automatic Music Transcription (AMT) -- the task of converting music audio into note representations -- has seen rapid progress, driven largely by deep learning systems. Due to the limited availability of richly annotated music datasets, much of the progress in AMT has been concentrated on classical piano music, and even a few very specific datasets. Whether these systems can generalize effectively to other musical contexts remains an open question. Complementing recent studies on distribution shifts in sound (e.g., recording conditions), in this work we investigate the musical dimension -- specifically, variations in genre, dynamics, and polyphony levels. To this end, we introduce the MDS corpus, comprising three distinct subsets -- (1) Genre, (2) Random, and (3) MAEtest -- to emulate different axes of distribution shift. We evaluate the performance of several state-of-the-art AMT systems on the MDS corpus using both traditional information-retrieval and musically-informed performance metrics. Our extensive evaluation isolates and exposes varying degrees of performance degradation under specific distribution shifts. In particular, we measure a note-level F1 performance drop of 20 percentage points due to sound, and 14 due to genre. Generally, we find that dynamics estimation proves more vulnerable to musical variation than onset prediction. Musically informed evaluation metrics, particularly those capturing harmonic structure, help identify potential contributing factors. Furthermore, experiments with randomly generated, non-musical sequences reveal clear limitations in system performance under extreme musical distribution shifts. Altogether, these findings offer new evidence of the persistent impact of the Corpus Bias problem in deep AMT systems.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Youngkyu Lee, Francesc Levrero Florencio, Jay Pathak, George Em Karniadakis
ๆฅ็ๆ่ฆ
The convergence behavior of classical iterative solvers for parametric partial differential equations (PDEs) is often highly sensitive to the domain and specific discretization of PDEs. Previously, we introduced hybrid solvers by combining the classical solvers with neural operators for a specific geometry 1, but they tend to under-perform in geometries not encountered during training. To address this challenge, we introduce Geo-DeepONet, a geometry-aware deep operator network that incorporates domain information extracted from finite element discretizations. Geo-DeepONet enables accurate operator learning across arbitrary unstructured meshes without requiring retraining. Building on this, we develop a class of geometry-aware hybrid preconditioned iterative solvers by coupling Geo-DeepONet with traditional methods such as relaxation schemes and Krylov subspace algorithms. Through numerical experiments on parametric PDEs posed over diverse unstructured domains, we demonstrate the enhanced robustness and efficiency of the proposed hybrid solvers for multiple real-world applications.
๐
2025-12-16
๐ Accepted in IEEE Bigdata 2025- Learning Representa
๐ฅ Tejaswani Dash, Gautam Datla, Anudeep Vurity, Tazeem Ahmad, Mohd Adnan et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Cardiovascular disease (CVD) remains the leading cause of mortality worldwide, underscoring the need for reliable and efficient predictive tools that support early intervention. Traditional diagnostic approaches rely on handcrafted features and clinician expertise, while machine learning methods improve reproducibility but often struggle to generalize across noisy and heterogeneous clinical data. In this work, we propose Residual GRU with Multi-Head Self-Attention, a compact deep learning architecture designed for tabular clinical records. The model integrates residual bidirectional gated recurrent units for sequential modeling of feature columns, a channel reweighting block, and multi-head self-attention pooling with a learnable classification token to capture global context. We evaluate the model on the UCI Heart Disease dataset using 5-fold stratified cross-validation and compare it against classical methods such as Logistic Regression, Random Forest, and Support Vector Machines, as well as modern deep learning baselines including DeepMLP, convolutional networks, recurrent networks, and Transformers. The proposed model achieves an accuracy of 0.861, macro-F1 of 0.860, ROC-AUC of 0.908, and PR-AUC of 0.904, outperforming all baselines. Ablation studies confirm the individual contributions of residual recurrence, channel gating, and attention pooling. t-SNE visualizations further indicate that the learned embeddings exhibit clearer separation between disease and non-disease classes compared to raw features. These results demonstrate that lightweight hybrid recurrent and attention-based architectures provide a strong balance between accuracy and efficiency for clinical risk prediction, supporting deployment in resource-constrained healthcare settings.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Emmanuel C. Chukwu, Rianne M. Schouten, Monique Tabak, Mykola Pechenizkiy
ๆฅ็ๆ่ฆ
Counterfactual explanations are increasingly proposed as interpretable mechanisms to achieve algorithmic recourse. However, current counterfactual techniques for time series classification are predominantly designed with static data assumptions and focus on generating minimal input perturbations to flip model predictions. This paper argues that such approaches are fundamentally insufficient in clinical recommendation settings, where interventions unfold over time and must be causally plausible and temporally coherent. We advocate for a shift towards counterfactuals that reflect sustained, goal-directed interventions aligned with clinical reasoning and patient-specific dynamics. We identify critical gaps in existing methods that limit their practical applicability, specifically, temporal blind spots and the lack of user-centered considerations in both method design and evaluation metrics. To support our position, we conduct a robustness analysis of several state-of-the-art methods for time series and show that the generated counterfactuals are highly sensitive to stochastic noise. This finding highlights their limited reliability in real-world clinical settings, where minor measurement variations are inevitable. We conclude by calling for methods and evaluation frameworks that go beyond mere prediction changes without considering feasibility or actionability. We emphasize the need for actionable, purpose-driven interventions that are feasible in real-world contexts for the users of such applications.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Miriam Gutiรฉrrez Fernรกndez, Karen Lรณpez-Linares, Carlos Fambuena Santos, Marรญa S. Guillem, Andreu M. Climent et al.
Machine Learning
ๆฅ็ๆ่ฆ
Atrial fibrillation (AF) is the most prevalent sustained cardiac arrhythmia, and its clinical assessment requires accurate characterization of atrial electrical activity. Noninvasive electrocardiographic imaging (ECGI) combined with deep learning (DL) approaches for estimating intracardiac electrograms (EGMs) from body surface potentials (BSPMs) has shown promise, but progress is hindered by the limited availability of paired BSPM-EGM datasets. To address this limitation, we investigate variational autoencoders (VAEs) for the generation of synthetic multichannel atrial EGMs. Two models are proposed: a sinus rhythm-specific VAE (VAE-S) and a class-conditioned VAE (VAE-C) trained on both sinus rhythm and AF signals. Generated EGMs are evaluated using morphological, spectral, and distributional similarity metrics. VAE-S achieves higher fidelity with respect to in silico EGMs, while VAE-C enables rhythm-specific generation at the expense of reduced sinus reconstruction quality. As a proof of concept, the generated EGMs are used for data augmentation in a downstream noninvasive EGM reconstruction task, where moderate augmentation improves estimation performance. These results demonstrate the potential of VAE-based generative modeling to alleviate data scarcity and enhance deep learning-based ECGI pipelines.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jacob Taegon Kim, Alex Sim, Kesheng Wu, Jinoh Kim
Machine Learning
ๆฅ็ๆ่ฆ
Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified and selectively monitored, optimizing network usage and overall performance. A key bottleneck to improving the predictive power of machine learning (ML) models in this context is the issue of class imbalance. This project focuses on addressing the class imbalance problem to enhance the accuracy of performance predictions. In this study, we analyze and compare various augmentation strategies, including traditional oversampling methods and generative techniques. Additionally, we adjust the class imbalance ratios in training datasets to evaluate their impact on model performance. While augmentation may improve performance, as the imbalance ratio increases, the performance does not significantly improve. We conclude that even the most advanced technique, such as CTGAN, does not significantly improve over simple stratified sampling.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Teodor Poncu, Ioana Pintilie, Marius Dragoi, Dragos Tantaru, Florin Brad
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) typically excel at coding tasks involving high-level programming languages, as opposed to lower-level programming languages, such as assembly. We propose a synthetic data generation method named C-ing Clearly, which leverages the corresponding C code to enhance an LLM's understanding of assembly. By fine-tuning on data generated through our method, we demonstrate improved LLM performance for binary code summarization and vulnerability detection. Our approach demonstrates consistent gains across different LLM families and model sizes.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Additi Pandey, Liang Wei, Hessam Babaee, George Em Karniadakis
ๆฅ็ๆ่ฆ
Accurate chemical kinetics modeling is essential for combustion simulations, as it governs the evolution of complex reaction pathways and thermochemical states. In this work, we introduce Kinetic-Mamba, a Mamba-based neural operator framework that integrates the expressive power of neural operators with the efficient temporal modeling capabilities of Mamba architectures. The framework comprises three complementary models: (i) a standalone Mamba model that predicts the time evolution of thermochemical state variables from given initial conditions; (ii) a constrained Mamba model that enforces mass conservation while learning the state dynamics; and (iii) a regime-informed architecture employing two standalone Mamba models to capture dynamics across temperature-dependent regimes. We additionally develop a latent Kinetic-Mamba variant that evolves dynamics in a reduced latent space and reconstructs the full state on the physical manifold. We evaluate the accuracy and robustness of Kinetic-Mamba using both time-decomposition and recursive-prediction strategies. We further assess the extrapolation capabilities of the model on varied out-of-distribution datasets. Computational experiments on Syngas and GRI-Mech 3.0 reaction mechanisms demonstrate that our framework achieves high fidelity in predicting complex kinetic behavior using only the initial conditions of the state variables.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Niklas Grieger, Jannik Raskob, Siamak Mehrkanoon, Stephan Bialonski
Machine Learning
ๆฅ็ๆ่ฆ
Sleep is essential for good health throughout our lives, yet studying its dynamics requires manual sleep staging, a labor-intensive step in sleep research and clinical care. Across centers, polysomnography (PSG) recordings are traditionally scored in 30-s epochs for pragmatic, not physiological, reasons and can vary considerably in electrode count, montage, and subject characteristics. These constraints present challenges in conducting harmonized multi-center sleep studies and discovering novel, robust biomarkers on shorter timescales. Here, we present AnySleep, a deep neural network model that uses any electroencephalography (EEG) or electrooculography (EOG) data to score sleep at adjustable temporal resolutions. We trained and validated the model on over 19,000 overnight recordings from 21 datasets collected across multiple clinics, spanning nearly 200,000 hours of EEG and EOG data, to promote robust generalization across sites. The model attains state-of-the-art performance and surpasses or equals established baselines at 30-s epochs. Performance improves as more channels are provided, yet remains strong when EOG is absent or when only EOG or single EEG derivations (frontal, central, or occipital) are available. On sub-30-s timescales, the model captures short wake intrusions consistent with arousals and improves prediction of physiological characteristics (age, sex) and pathophysiological conditions (sleep apnea), relative to standard 30-s scoring. We make the model publicly available to facilitate large-scale studies with heterogeneous electrode setups and to accelerate the discovery of novel biomarkers in sleep.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Akira Takeshima, Kenta Shiraishi, Atsushi Okazaki, Tadashi Tsuyuki, Shunji Kotsuki
Machine Learning
ๆฅ็ๆ่ฆ
While machine learning-based weather prediction (MLWP) has achieved significant advancements, research on assimilating real observations or ensemble forecasts within MLWP models remains limited. We introduce ClimaX-LETKF, the first purely data-driven ML-based ensemble weather forecasting system. It operates stably over multiple years, independently of numerical weather prediction (NWP) models, by assimilating the NCEP ADP Global Upper Air and Surface Weather Observations. The system demonstrates greater stability and accuracy with relaxation to prior perturbation (RTPP) than with relaxation to prior spread (RTPS), while NWP models tend to be more stable with RTPS. RTPP replaces an analysis perturbation with a weighted blend of analysis and background perturbations, whereas RTPS simply rescales the analysis perturbation. Our experiments reveal that MLWP models are less capable of restoring the atmospheric field to its attractor than NWP models. This work provides valuable insights for enhancing MLWP ensemble forecasting systems and represents a substantial step toward their practical applications.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Gabriele Prato, Shagun Sodhani, Alessandro Sordoni, Sarath Chandar
Natural Language Processing
ๆฅ็ๆ่ฆ
The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models' capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.
๐
2025-12-16
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Waqas Ahmed
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The cybersecurity of Industrial Control Systems that manage critical infrastructure such as Water Distribution Systems has become increasingly important as digital connectivity expands. BATADAL benchmark data is a good source of testing intrusion detection techniques, but it presents several important problems, such as imbalance in the number of classes, multivariate time dependence, and stealthy attacks. We consider a hybrid ensemble learning model that will enhance the detection ability of cyber-attacks in WDS by using the complementary capabilities of machine learning and deep learning models. Three base learners, namely, Random Forest , eXtreme Gradient Boosting , and Long Short-Term Memory network, have been strictly compared and seven ensemble types using simple averaged and stacked learning with a logistic regression meta-learner. Random Forest analysis identified top predictors turned into temporal and statistical features, and Synthetic Minority Oversampling Technique (SMOTE) was used to overcome the class imbalance issue. The analyics indicates that the single Long Short-Term Memory network model is of poor performance (F1 = 0.000, AUC = 0.4460), but tree-based models, especially eXtreme Gradient Boosting, perform well (F1 = 0.7470, AUC=0.9684). The hybrid stacked ensemble of Random Forest , eXtreme Gradient Boosting , and Long Short-Term Memory network scored the highest, with the attack class of 0.7205 with an F1-score and a AUC of 0.9826 indicating that the heterogeneous stacking between model precision and generalization can work. The proposed framework establishes a robust and scalable solution for cyber-attack detection in time-dependent industrial systems, integrating temporal learning and ensemble diversity to support the secure operation of critical infrastructure.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dejun Hu, Zhiming Li, Jia-Rui Shen, Jia-Ning Tu, Zi-Hao Ye et al.
Machine Learning
ๆฅ็ๆ่ฆ
Machine learning is profoundly reshaping molecular and materials modeling; however, given the vast scale of chemical space (10^30-10^60), it remains an open scientific question whether models can achieve convergent learning across this space. We introduce a Dual-Axis Representation-Complete Convergent Learning (RCCL) strategy, enabled by a molecular representation that integrates graph convolutional network (GCN) encoding of local valence environments, grounded in modern valence bond theory, together with no-bridge graph (NBG) encoding of ring/cage topologies, providing a quantitative measure of chemical-space coverage. This framework formalizes representation completeness, establishing a principled basis for constructing datasets that support convergent learning for large models. Guided by this RCCL framework, we develop the FD25 dataset, systematically covering 13,302 local valence units and 165,726 ring/cage topologies, achieving near-complete combinatorial coverage of organic molecules with H/C/N/O/F elements. Graph neural networks trained on FD25 exhibit representation-complete convergent learning and strong out-of-distribution generalization, with an overall prediction error of approximately 1.0 kcal/mol MAE across external benchmarks. Our results establish a quantitative link between molecular representation, structural completeness, and model generalization, providing a foundation for interpretable, transferable, and data-efficient molecular intelligence.
๐
2025-12-16
๐ ArXiv Preprint (econ.GN)
๐ฅ Muhammad Sukri Bin Ramli
Machine Learning
ๆฅ็ๆ่ฆ
As the global economy transitions toward decarbonization, the aluminium sector has become a focal point for strategic resource management. While policies such as the Carbon Border Adjustment Mechanism (CBAM) aim to reduce emissions, they have inadvertently widened the price arbitrage between primary metal, scrap, and semi-finished goods, creating new incentives for market optimization. This study presents a unified, unsupervised machine learning framework to detect and classify emerging trade anomalies within UN Comtrade data (2020 to 2024). Moving beyond traditional rule-based monitoring, we apply a four-layer analytical pipeline utilizing Forensic Statistics, Isolation Forests, Network Science, and Deep Autoencoders. Contrary to the hypothesis that Sustainability Arbitrage would be the primary driver, empirical results reveal a contradictory and more severe phenomenon of Hardware Masking. Illicit actors exploit bi-directional tariff incentives by misclassifying scrap as high-count heterogeneous goods to justify extreme unit-price outliers of >$160/kg, a 1,900% markup indicative of Trade-Based Money Laundering (TBML) rather than commercial arbitrage. Topologically, risk is not concentrated in major exporters but in high-centrality Shadow Hubs that function as pivotal nodes for illicit rerouting. These actors execute a strategy of Void-Shoring, systematically suppressing destination data to Unspecified Code to fracture mirror statistics and sever forensic trails. Validated by SHAP (Shapley Additive Explanations), the results confirm that price deviation is the dominant predictor of anomalies, necessitating a paradigm shift in customs enforcement from physical volume checks to dynamic, algorithmic valuation auditing.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Hangjun Cho, Fabio V. G. Amaral, Andrei A. Klishin, Cassio M. Oishi, Steven L. Brunton
ๆฅ็ๆ่ฆ
In this work, we revisit dictionary-based sparse regression, in particular, Sequential Threshold Least Squares (STLS), and propose a score-guided library selection to provide practical guidance for data-driven modeling, with emphasis on SINDy-type algorithms. STLS is an algorithm to solve the $\ell_0$ sparse least-squares problem, which relies on splitting to efficiently solve the least-squares portion while handling the sparse term via proximal methods. It produces coefficient vectors whose components depend on both the projected reconstruction errors, here referred to as the scores, and the mutual coherence of dictionary terms. The first contribution of this work is a theoretical analysis of the score and dictionary-selection strategy. This could be understood in both the original and weak SINDy regime. Second, numerical experiments on ordinary and partial differential equations highlight the effectiveness of score-based screening, improving both accuracy and interpretability in dynamical system identification. These results suggest that integrating score-guided methods to refine the dictionary more accurately may help SINDy users in some cases to enhance their robustness for data-driven discovery of governing equations.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fangzhou Lin, Guoshun He, Zhenyu Guo, Zhe Huang, Jinsong Tao
ๆฅ็ๆ่ฆ
Electric load is simultaneously affected across multiple time scales by exogenous factors such as weather and calendar rhythms, sudden events, and policies. Therefore, this paper proposes GRAFT (GRid-Aware Forecasting with Text), which modifies and improves STanHOP to better support grid-aware forecasting and multi-source textual interventions. Specifically, GRAFT strictly aligns daily-aggregated news, social media, and policy texts with half-hour load, and realizes text-guided fusion to specific time positions via cross-attention during both training and rolling forecasting. In addition, GRAFT provides a plug-and-play external-memory interface to accommodate different information sources in real-world deployment. We construct and release a unified aligned benchmark covering 2019--2021 for five Australian states (half-hour load, daily-aligned weather/calendar variables, and three categories of external texts), and conduct systematic, reproducible evaluations at three scales -- hourly, daily, and monthly -- under a unified protocol for comparison across regions, external sources, and time scales. Experimental results show that GRAFT significantly outperforms strong baselines and reaches or surpasses the state of the art across multiple regions and forecasting horizons. Moreover, the model is robust in event-driven scenarios and enables temporal localization and source-level interpretation of text-to-load effects through attention read-out. We release the benchmark, preprocessing scripts, and forecasting results to facilitate standardized empirical evaluation and reproducibility in power grid load forecasting.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunjia Yang, Weishao Tang, Mengxin Liu, Nils Thuerey, Yufei Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Machine-learning surrogate models have shown promise in accelerating aerodynamic design, yet progress toward generalizable predictors for three-dimensional wings has been limited by the scarcity and restricted diversity of existing datasets. Here, we present SuperWing, a comprehensive open dataset of transonic swept-wing aerodynamics comprising 4,239 parameterized wing geometries and 28,856 Reynolds-averaged Navier-Stokes flow field solutions. The wing shapes in the dataset are generated using a simplified yet expressive geometry parameterization that incorporates spanwise variations in airfoil shape, twist, and dihedral, allowing for an enhanced diversity without relying on perturbations of a baseline wing. All shapes are simulated under a broad range of Mach numbers and angles of attack covering the typical flight envelope. To demonstrate the dataset's utility, we benchmark two state-of-the-art Transformers that accurately predict surface flow and achieve a 2.5 drag-count error on held-out samples. Models pretrained on SuperWing further exhibit strong zero-shot generalization to complex benchmark wings such as DLR-F6 and NASA CRM, underscoring the dataset's diversity and potential for practical usage.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huayang Li, Tianyu Zhao, Richard Sproat
Natural Language Processing
ๆฅ็ๆ่ฆ
In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. Drawing on Cognitive Load Theory (CLT), we argue that this uninformative structure increases extraneous cognitive load, consuming finite working memory capacity that should be allocated to deep reasoning and attention allocation. To address this, we propose RePo, a novel mechanism that reduces extraneous load via context re-positioning. Unlike standard approaches, RePo utilizes a differentiable module, $f_ฯ$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined integer range. By continually pre-training on the OLMo-2 1B backbone, we demonstrate that RePo significantly enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Detailed analysis reveals that RePo successfully allocate higher attention to distant but relevant information, assign positions in dense and non-linear space, and capture the intrinsic structure of the input context. Our code is available at https://github.com/SakanaAI/repo.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Baobao Song, Shiva Raj Pokhrel, Athanasios V. Vasilakos, Tianqing Zhu, Gang Li
Machine Learning
ๆฅ็ๆ่ฆ
Quantum machine learning (QML) promises significant computational advantages, yet models trained on sensitive data risk memorizing individual records, creating serious privacy vulnerabilities. While Quantum Differential Privacy (QDP) mechanisms provide theoretical worst-case guarantees, they critically lack empirical verification tools for deployed models. We introduce the first black-box privacy auditing framework for QML based on Lifted Quantum Differential Privacy, leveraging quantum canaries (strategically offset-encoded quantum states) to detect memorization and precisely quantify privacy leakage during training. Our framework establishes a rigorous mathematical connection between canary offset and trace distance bounds, deriving empirical lower bounds on privacy budget consumption that bridge the critical gap between theoretical guarantees and practical privacy verification. Comprehensive evaluations across both simulated and physical quantum hardware demonstrate our framework's effectiveness in measuring actual privacy loss in QML models, enabling robust privacy verification in QML systems.
๐
2025-12-16
๐ AAAI 2026
๐ฅ Nicholas Tagliapietra, Katharina Ensinger, Christoph Zimmer, Osman Mian
ๆฅ็ๆ่ฆ
Real world systems evolve in continuous-time according to their underlying causal relationships, yet their dynamics are often unknown. Existing approaches to learning such dynamics typically either discretize time -- leading to poor performance on irregularly sampled data -- or ignore the underlying causality. We propose CaDyT, a novel method for causal discovery on dynamical systems addressing both these challenges. In contrast to state-of-the-art causal discovery methods that model the problem using discrete-time Dynamic Bayesian networks, our formulation is grounded in Difference-based causal models, which allow milder assumptions for modeling the continuous nature of the system. CaDyT leverages exact Gaussian Process inference for modeling the continuous-time dynamics which is more aligned with the underlying dynamical process. We propose a practical instantiation that identifies the causal structure via a greedy search guided by the Algorithmic Markov Condition and Minimum Description Length principle. Our experiments show that CaDyT outperforms state-of-the-art methods on both regularly and irregularly-sampled data, discovering causal networks closer to the true underlying dynamics.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Michael Murray, Tenzin Chan, Kedar Karhadker, Christopher J. Hillar
ๆฅ็ๆ่ฆ
Many learning problems involve symmetries, and while invariance can be built into neural architectures, it can also emerge implicitly when training on group-structured data. We study this phenomenon in classical Hopfield networks and show they can infer the full isomorphism class of a graph from a small random sample. Our results reveal that: (i) graph isomorphism classes can be represented within a three-dimensional invariant subspace, (ii) using gradient descent to minimize energy flow (MEF) has an implicit bias toward norm-efficient solutions, which underpins a polynomial sample complexity bound for learning isomorphism classes, and (iii) across multiple learning rules, parameters converge toward the invariant subspace as sample sizes grow. Together, these findings highlight a unifying mechanism for generalization in Hopfield networks: a bias toward norm efficiency in learning drives the emergence of approximate invariance under group-structured data.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
ๆฅ็ๆ่ฆ
Recent progress in text-to-image diffusion models has transformed image editing via text prompts, yet this also introduces significant ethical challenges from potential misuse in creating deceptive or harmful content. While current defenses seek to mitigate this risk by embedding imperceptible perturbations, their effectiveness is limited against malicious tampering. To address this issue, we propose a Dual Attention-Guided Noise Perturbation (DANP) immunization method that adds imperceptible perturbations to disrupt the model's semantic understanding and generation process. DANP functions over multiple timesteps to manipulate both cross-attention maps and the noise prediction process, using a dynamic threshold to generate masks that identify text-relevant and irrelevant regions. It then reduces attention in relevant areas while increasing it in irrelevant ones, thereby misguides the edit towards incorrect regions and preserves the intended targets. Additionally, our method maximizes the discrepancy between the injected noise and the model's predicted noise to further interfere with the generation. By targeting both attention and noise prediction mechanisms, DANP exhibits impressive immunity against malicious edits, and extensive experiments confirm that our method achieves state-of-the-art performance.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuai Dong, Jie Zhang, Guoying Zhao, Shiguang Shan, Xilin Chen
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Text-guided image editing via diffusion models, while powerful, raises significant concerns about misuse, motivating efforts to immunize images against unauthorized edits using imperceptible perturbations. Prevailing metrics for evaluating immunization success typically rely on measuring the visual dissimilarity between the output generated from a protected image and a reference output generated from the unprotected original. This approach fundamentally overlooks the core requirement of image immunization, which is to disrupt semantic alignment with attacker intent, regardless of deviation from any specific output. We argue that immunization success should instead be defined by the edited output either semantically mismatching the prompt or suffering substantial perceptual degradations, both of which thwart malicious intent. To operationalize this principle, we propose Synergistic Intermediate Feature Manipulation (SIFM), a method that strategically perturbs intermediate diffusion features through dual synergistic objectives: (1) maximizing feature divergence from the original edit trajectory to disrupt semantic alignment with the expected edit, and (2) minimizing feature norms to induce perceptual degradations. Furthermore, we introduce the Immunization Success Rate (ISR), a novel metric designed to rigorously quantify true immunization efficacy for the first time. ISR quantifies the proportion of edits where immunization induces either semantic failure relative to the prompt or significant perceptual degradations, assessed via Multimodal Large Language Models (MLLMs). Extensive experiments show our SIFM achieves the state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Pablo Garcรญa-Santaclara, Bruno Fernรกndez-Castro, Rebeca P. Dรญaz-Redondo, Carlos Calvo-Moa, Henar Mariรฑo-Bodelรณn
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The exponential growth of Internet-connected devices has presented challenges to traditional centralized computing systems due to latency and bandwidth limitations. Edge computing has evolved to address these difficulties by bringing computations closer to the data source. Additionally, traditional machine learning algorithms are not suitable for edge-computing systems, where data usually arrives in a dynamic and continual way. However, incremental learning offers a good solution for these settings. We introduce a new approach that applies the incremental learning philosophy within an edge-computing scenario for the industrial sector with a specific purpose: real time quality control in a manufacturing system. Applying continual learning we reduce the impact of catastrophic forgetting and provide an efficient and effective solution.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ ล imon Kucharskรฝ, Aayush Mishra, Daniel Habermann, Stefan T. Radev, Paul-Christian Bรผrkner
ๆฅ็ๆ่ฆ
Amortized Bayesian inference (ABI) offers fast, scalable approximations to posterior densities by training neural surrogates on data simulated from the statistical model. However, ABI methods are highly sensitive to model misspecification: when observed data fall outside the training distribution (generative scope of the statistical models), neural surrogates can behave unpredictably. This makes it a challenge in a model comparison setting, where multiple statistical models are considered, of which at least some are misspecified. Recent work on self-consistency (SC) provides a promising remedy to this issue, accessible even for empirical data (without ground-truth labels). In this work, we investigate how SC can improve amortized model comparison conceptualized in four different ways. Across two synthetic and two real-world case studies, we find that approaches for model comparison that estimate marginal likelihoods through approximate parameter posteriors consistently outperform methods that directly approximate model evidence or posterior model probabilities. SC training improves robustness when the likelihood is available, even under severe model misspecification. The benefits of SC for methods without access of analytic likelihoods are more limited and inconsistent. Our results suggest practical guidance for reliable amortized Bayesian model comparison: prefer parameter posterior-based methods and augment them with SC training on empirical datasets to mitigate extrapolation bias under model misspecification.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu Chen, Hongwei Lin
Machine Learning
ๆฅ็ๆ่ฆ
Persistence diagrams (PDs) provide a powerful tool for understanding the topology of the underlying shape of a point cloud. However, identifying which points in PDs encode genuine signals remains challenging. This challenge directly hinders the practical adoption of topological data analysis in many applications, where automated and reliable interpretation of persistence diagrams is essential for downstream decision-making. In this paper, we study automatic significance detection for one-dimensional persistence diagrams. Specifically, we propose Topology Understanding Net (TUN), a multi-modal network that combines enhanced PD descriptors with self-attention, a PointNet-style point cloud encoder, learned fusion, and per-point classification, alongside stable preprocessing and imbalance-aware training. It provides an automated and effective solution for identifying significant points in PDs, which are critical for downstream applications. Experiments show that TUN outperforms classic methods in detecting significant points in PDs, illustrating its effectiveness in real-world applications.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nick Leenders, Thomas Quadt, Boris Cule, Roy Lindelauf, Herman Monsuur et al.
Machine Learning
ๆฅ็ๆ่ฆ
Current Preferential Bayesian Optimization methods rely on Gaussian Processes (GPs) as surrogate models. These models are hard to interpret, struggle with handling categorical data, and are computationally complex, limiting their real-world usability. In this paper, we introduce an inherently interpretable decision tree-based surrogate model capable of handling both categorical and continuous data, and scalable to large datasets. Extensive numerical experiments on eight increasingly spiky optimization functions show that our model outperforms GP-based alternatives on spiky functions and has only marginally lower performance for non-spiky functions. Moreover, we apply our model to the real-world Sushi dataset and show its ability to learn an individual's sushi preferences. Finally, we show some initial work on using historical preference data to speed up the optimization process for new unseen users.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xingjian Wu, Hanyin Cheng, Xiangfei Qiu, Zhengyu Li, Jilin Hu et al.
ๆฅ็ๆ่ฆ
In this work, we introduce FLAME, a family of extremely lightweight and capable Time Series Foundation Models, which support both deterministic and probabilistic forecasting via generative probabilistic modeling, thus ensuring both efficiency and robustness. FLAME utilizes the Legendre Memory for strong generalization capabilities. Through adapting variants of Legendre Memory, i.e., translated Legendre (LegT) and scaled Legendre (LegS), in the Encoding and Decoding phases, FLAME can effectively capture the inherent inductive bias within data and make efficient long-range inferences. To enhance the accuracy of probabilistic forecasting while keeping efficient, FLAME adopts a Normalization Flow based forecasting head, which can model the arbitrarily intricate distributions over the forecasting horizon in a generative manner. Comprehensive experiments on well-recognized benchmarks, including TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art zero-shot performance of FLAME on both deterministic and probabilistic forecasting tasks.
๐
2025-12-16
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kelly J. Davis
Natural Language Processing
ๆฅ็ๆ่ฆ
Formal, automated theorem proving has long been viewed as a challenge to artificial intelligence. We introduce here a new approach to computer theorem proving, one that employs specialized language models for Lean4 proof generation combined with recursive decomposition of difficult theorems into simpler entailing propositions. These models are coordinated through a multi-agent architecture that orchestrates autoformalization (if required), proof generation, decomposition of difficult theorems into simpler entailing propositions, and recursive proof (and/or decomposition) of these propositions. Without decomposition, we achieve a 90.4% pass rate on miniF2F. With decomposition, this is significantly improved. A key technical contribution lies in our extension of the Kimina Lean Server with abstract syntax tree (AST) parsing capabilities to facilitate automated, recursive proof decomposition. The system is made available on PyPI as goedels-poetry (at https://pypi.org/project/goedels-poetry ), and the open-source implementation KellyJDavis/goedels-poetry (at https://github.com/KellyJDavis/goedels-poetry ) facilitates both adaptation to alternative language models and extension with custom functionality.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Salvatore Romano, Marco Grassia, Giuseppe Mangioni
Machine Learning
ๆฅ็ๆ่ฆ
Graph generation is a crucial task in many fields, including network science and bioinformatics, as it enables the creation of synthetic graphs that mimic the properties of real-world networks for various applications. Graph Generative Models (GGMs) have emerged as a promising solution to this problem, leveraging deep learning techniques to learn the underlying distribution of real-world graphs and generate new samples that closely resemble them. Examples include approaches based on Variational Auto-Encoders, Recurrent Neural Networks, and more recently, diffusion-based models. However, the main limitation often lies in the evaluation process, which typically relies on Maximum Mean Discrepancy (MMD) as a metric to assess the distribution of graph properties in the generated ensemble. This paper introduces a novel methodology for evaluating GGMs that overcomes the limitations of MMD, which we call RGM (Representation-aware Graph-generation Model evaluation). As a practical demonstration of our methodology, we present a comprehensive evaluation of two state-of-the-art Graph Generative Models: Graph Recurrent Attention Networks (GRAN) and Efficient and Degree-guided graph GEnerative model (EDGE). We investigate their performance in generating realistic graphs and compare them using a Geometric Deep Learning model trained on a custom dataset of synthetic and real-world graphs, specifically designed for graph classification tasks. Our findings reveal that while both models can generate graphs with certain topological properties, they exhibit significant limitations in preserving the structural characteristics that distinguish different graph domains. We also highlight the inadequacy of Maximum Mean Discrepancy as an evaluation metric for GGMs and suggest alternative approaches for future research.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Erion Morina, Martin Holler
ๆฅ็ๆ่ฆ
This paper addresses the problem of learning reaction-diffusion (RD) systems from data while ensuring physical consistency and well-posedness of the learned models. Building on a regularization-based framework for structured model learning, we focus on learning parameterized reaction terms and investigate how to incorporate key physical properties, such as mass conservation and quasipositivity, directly into the learning process. Our main contributions are twofold: First, we propose techniques to systematically modify a given class of parameterized reaction terms such that the resulting terms inherently satisfy mass conservation and quasipositivity, ensuring that the learned RD systems preserve non-negativity and adhere to physical principles. These modifications also guarantee well-posedness of the resulting PDEs under additional regularity and growth conditions. Second, we extend existing theoretical results on regularization-based model learning to RD systems using these physically consistent reaction terms. Specifically, we prove that solutions to the learning problem converge to a unique, regularization-minimizing solution of a limit system even when conservation laws and quasipositivity are enforced. In addition, we provide approximation results for quasipositive functions, essential for constructing physically consistent parameterizations. These results advance the development of interpretable and reliable data-driven models for RD systems that align with fundamental physical laws.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Estelle Zheng, Nathan Cerisara, Sรฉbastien Warichet, Emmanuel Helbert, Christophe Cerisara
Natural Language Processing
ๆฅ็ๆ่ฆ
Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA's compute scaling slope while cutting peak memory by 50\%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA's accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing\textemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder's architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Divyansh Pareek, Sewoong Oh, Simon S. Du
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $ฮท\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{ฮท\sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{ฮทn}}$ in the large $ฮท$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $ฮท$ regime.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Jiarong Fan, Juhyun Park. Thi Phuong Thuy Vo, Nicolas Brunel
ๆฅ็ๆ่ฆ
Conformal prediction (CP) offers a principled framework for uncertainty quantification, but it fails to guarantee coverage when faced with missing covariates. In addressing the heterogeneity induced by various missing patterns, Mask-Conditional Valid (MCV) Coverage has emerged as a more desirable property than Marginal Coverage. In this work, we adapt split CP to handle missing values by proposing a preimpute-mask-then-correct framework that can offer valid coverage. We show that our method provides guaranteed Marginal Coverage and Mask-Conditional Validity for general missing data mechanisms. A key component of our approach is a reweighted conformal prediction procedure that corrects the prediction sets after distributional imputation (multiple imputation) of the calibration dataset, making our method compatible with standard imputation pipelines. We derive two algorithms, and we show that they are approximately marginally valid and MCV. We evaluate them on synthetic and real-world datasets. It reduces significantly the width of prediction intervals w.r.t standard MCV methods, while maintaining the target guarantees.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three orthogonal planes--construction, scale and dependence--identifying which quadrants a measure needs to occupy to score out-of-distribution problems. LLM compare naturally occupies all desirable quadrants as the first measure that is continuous and dynamic, model-agnostic and independent of ground truth information. As a second validation, we show that LLM compare demonstrates strong alignment with human annotations: Pearson $r \geq 0.80$ for $n=1876$. Thirdly, we show that LLM compare is robust to hallucinations, with less than $6\%$ degradation in Pearson correlation for $10\%$ noise injection. Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation, and will be an important driver for curriculum design, model evaluation, and AI-assisted research ideation.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Timo Klein, Thomas Lang, Andrii Shkabrii, Alexander Sturm, Kevin Sidak et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the Poincarรฉ Ball and Hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (i) stable critic training through a categorical value loss instead of regression; (ii) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (iii) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines. We release our code at https://github.com/Probabilistic-and-Interactive-ML/hyper-rl .
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stefano Goria, Levent A. Mengรผtรผrk, Murat C. Mengรผtรผrk, Berkan Sesen
Machine Learning
ๆฅ็ๆ่ฆ
This paper motivates the use of random-bridges -- stochastic processes conditioned to take target distributions at fixed timepoints -- in the realm of generative modelling. Herein, random-bridges can act as stochastic transports between two probability distributions when appropriately initialized, and can display either Markovian or non-Markovian, and either continuous, discontinuous or hybrid patterns depending on the driving process. We show how one can start from general probabilistic statements and then branch out into specific representations for learning and simulation algorithms in terms of information processing. Our empirical results, built on Gaussian random bridges, produce high-quality samples in significantly fewer steps compared to traditional approaches, while achieving competitive Frechet inception distance scores. Our analysis provides evidence that the proposed framework is computationally cheap and suitable for high-speed generation tasks.
๐
2025-12-16
๐ IEEE Transactions on Dependable and Secure Computi
๐ฅ Wei Tao, Sheng Long, Xin Liu, Wei Li, Qing Tao
Machine Learning
ๆฅ็ๆ่ฆ
Generating adversarial examples (AEs) can be formulated as an optimization problem. Among various optimization-based attacks, the gradient-based PGD and the momentum-based MI-FGSM have garnered considerable interest. However, all these attacks use the sign function to scale their perturbations, which raises several theoretical concerns from the point of view of optimization. In this paper, we first reveal that PGD is actually a specific reformulation of the projected gradient method using only the current gradient to determine its step-size. Further, we show that when we utilize a conventional adaptive matrix with the accumulated gradients to scale the perturbation, PGD becomes AdaGrad. Motivated by this analysis, we present a novel momentum-based attack AdaMI, in which the perturbation is optimized with an interesting momentum-based adaptive matrix. AdaMI is proved to attain optimal convergence for convex problems, indicating that it addresses the non-convergence issue of MI-FGSM, thereby ensuring stability of the optimization process. The experiments demonstrate that the proposed momentum-based adaptive matrix can serve as a general and effective technique to boost adversarial transferability over the state-of-the-art methods across different networks while maintaining better stability and imperceptibility.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jonathan Spiegelman, Guy Amir, Guy Katz
ๆฅ็ๆ่ฆ
Deep Active Learning (DAL) aims to reduce labeling costs in neural-network training by prioritizing the most informative unlabeled samples for annotation. Beyond selecting which samples to label, several DAL approaches further enhance data efficiency by augmenting the training set with synthetic inputs that do not require additional manual labeling. In this work, we investigate how augmenting the training data with adversarial inputs that violate robustness constraints can improve DAL performance. We show that adversarial examples generated via formal verification contribute substantially more than those produced by standard, gradient-based attacks. We apply this extension to multiple modern DAL techniques, as well as to a new technique that we propose, and show that it yields significant improvements in model generalization across standard benchmarks.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhijie Zhong, Zhiwen Yu, Pengyu Li, Jianming Lv, C. L. Philip Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Radio path loss prediction (RPP) is critical for optimizing 5G networks and enabling IoT, smart city, and similar applications. However, current deep learning-based RPP methods lack proactive environmental modeling, struggle with realistic multi-transmitter scenarios, and generalize poorly under distribution shifts, particularly when training/testing environments differ in building density or transmitter configurations. This paper identifies three key issues: (1) passive environmental modeling that overlooks transmitters and key environmental features; (2) overemphasis on single-transmitter scenarios despite real-world multi-transmitter prevalence; (3) excessive focus on in-distribution performance while neglecting distribution shift challenges. To address these, we propose PathFinder, a novel architecture that actively models buildings and transmitters via disentangled feature encoding and integrates Mask-Guided Low-rank Attention to independently focus on receiver and building regions. We also introduce a Transmitter-Oriented Mixup strategy for robust training and a new benchmark, single-to-multi-transmitter RPP (S2MT-RPP), tailored to evaluate extrapolation performance (multi-transmitter testing after single-transmitter training). Experimental results show PathFinder outperforms state-of-the-art methods significantly, especially in challenging multi-transmitter scenarios. Our code and project site are available at: https://emorzz1g.github.io/PathFinder/.
๐
2025-12-16
๐ ArXiv Preprint (cs.SD)
๐ฅ Ramesh Gundluru, Shubham Gupta, Sri Rama Murty K
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chunjin Jian, Xinhua Zhu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. However, the quality and stability of the trained reward model largely determine the final alignment performance. Existing approaches such as Proximal Policy Optimization (PPO) rely heavily on reward models to guide LLMs toward human-aligned behaviors.
In this work, we propose a logic-similarity-based reward mechanism as an alternative to conventional reward modeling. Instead of relying on heuristic reward estimation, our method leverages formal logical consistency to steer model alignment with human preferences. Since real-world questions can be interpreted from multiple perspectives, to ensure that logic-based reinforcement learning does not cause model collapse, we introduce S-GRPO, a supervised variant of the GRPO framework. S-GRPO incorporates an additional supervised component and jointly optimizes the generation term, KL-divergence regularization, and label-based objective during training.
Experimental results demonstrate that S-GRPO consistently outperforms standard supervised fine-tuning (SFT) in both performance and robustness. Furthermore, it extends existing preference-learning frameworks such as GRPO and DPO, offering a more flexible and task-adaptive approach to alignment training. Our code is available at https://github.com/ChunjinJiang/sgrpo.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jeff J. Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We present Cornserve, an efficient online serving system for an emerging class of multimodal models called Any-to-Any models. Any-to-Any models accept combinations of text and multimodal data (e.g., image, video, audio) as input and also generate combinations of text and multimodal data as output, introducing request type, computation path, and computation scaling heterogeneity in model serving.
Cornserve allows model developers to describe the computation graph of generic Any-to-Any models, which consists of heterogeneous components such as multimodal encoders, autoregressive models like Large Language Models (LLMs), and multimodal generators like Diffusion Transformers (DiTs). Given this, Cornserve's planner automatically finds an optimized deployment plan for the model, including whether and how to disaggregate the model into smaller components based on model and workload characteristics. Cornserve's distributed runtime then executes the model per the plan, efficiently handling Any-to-Any model heterogeneity during online serving. Evaluations show that Cornserve can efficiently serve diverse Any-to-Any models and workloads, delivering up to 3.81$\times$ throughput improvement and up to 5.79$\times$ tail latency reduction over existing solutions.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Taig Singh, Shreshth Rajan, Nikhil Iyer
Machine Learning
ๆฅ็ๆ่ฆ
As modern neural networks become increasingly memory-bound, inference throughput is limited by DRAM bandwidth rather than compute. We present Arithmetic-Intensity-Aware Quantization (AIQ), a mixed precision quantization framework that chooses per-layer bit-widths to maximize arithmetic intensity (AI) while minimizing accuracy loss. AIQ is a post-training quantization method that uses search algorithms over per-layer quantization schemes to minimize a weighted loss over AI and accuracy. On ResNet-20/CIFAR-10, AIQ increases AI by ~50% over an FP32 baseline while keeping test accuracy within ~1 percentage point, and outperforming global uniform quantization schemes. On a memory-bound MobileNetV2 architecture, AIQ configurations give a 1.66x higher throughput than the FP32 baseline while keeping test accuracy within 1 percentage point. We also find that AIQ naturally quantizes larger layers more aggressively.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Boyuan Yao, Dingcheng Luo, Lianghao Cao, Nikola Kovachki, Thomas O'Leary-Roseberry et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present approximation theories and efficient training methods for derivative-informed Fourier neural operators (DIFNOs) with applications to PDE-constrained optimization. A DIFNO is an FNO trained by minimizing its prediction error jointly on output and Frรฉchet derivative samples of a high-fidelity operator (e.g., a parametric PDE solution operator). As a result, a DIFNO can closely emulate not only the high-fidelity operator's response but also its sensitivities. To motivate the use of DIFNOs instead of conventional FNOs as surrogate models, we show that accurate surrogate-driven PDE-constrained optimization requires accurate surrogate Frรฉchet derivatives. Then, for continuously differentiable operators, we establish (i) simultaneous universal approximation of FNOs and their Frรฉchet derivatives on compact sets, and (ii) universal approximation of FNOs in weighted Sobolev spaces with input measures that have unbounded supports. Our theoretical results certify the capability of FNOs for accurate derivative-informed operator learning and accurate solution of PDE-constrained optimization. Furthermore, we develop efficient training schemes using dimension reduction and multi-resolution techniques that significantly reduce memory and computational costs for Frรฉchet derivative learning. Numerical examples on nonlinear diffusion--reaction, Helmholtz, and Navier--Stokes equations demonstrate that DIFNOs are superior in sample complexity for operator learning and solving infinite-dimensional PDE-constrained inverse problems, achieving high accuracy at low training sample sizes.
๐
2025-12-16
๐ ArXiv Preprint (eess.AS)
๐ฅ Sungnyun Kim
Multimodal
ๆฅ็ๆ่ฆ
The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates computational resources based on the input characteristics. Finally, at the system level, we present methods to expand the system's functionality through modular integration with large-scale foundation models, leveraging their powerful cognitive and generative capabilities to maximize final recognition accuracy. By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-$K$ routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.
๐
2025-12-16
๐ ICDE 2026
๐ฅ Da Zhang, Bingyu Li, Zhiyuan Zhao, Feiping Nie, Junyu Gao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Time series analysis plays a vital role in fields such as finance, healthcare, industry, and meteorology, underpinning key tasks including classification, forecasting, and anomaly detection. Although deep learning models have achieved remarkable progress in these areas in recent years, constructing an efficient, multi-task compatible, and generalizable unified framework for time series analysis remains a significant challenge. Existing approaches are often tailored to single tasks or specific data types, making it difficult to simultaneously handle multi-task modeling and effectively integrate information across diverse time series types. Moreover, real-world data are often affected by noise, complex frequency components, and multi-scale dynamic patterns, which further complicate robust feature extraction and analysis. To ameliorate these challenges, we propose FusAD, a unified analysis framework designed for diverse time series tasks. FusAD features an adaptive time-frequency fusion mechanism, integrating both Fourier and Wavelet transforms to efficiently capture global-local and multi-scale dynamic features. With an adaptive denoising mechanism, FusAD automatically senses and filters various types of noise, highlighting crucial sequence variations and enabling robust feature extraction in complex environments. In addition, the framework integrates a general information fusion and decoding structure, combined with masked pre-training, to promote efficient learning and transfer of multi-granularity representations. Extensive experiments demonstrate that FusAD consistently outperforms state-of-the-art models on mainstream time series benchmarks for classification, forecasting, and anomaly detection tasks, while maintaining high efficiency and scalability. Code is available at https://github.com/zhangda1018/FusAD.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
๐
2025-12-16
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Omar Abusabha, Jiyong Uhm, Tamer Abuhmed, Hyungjoon Koo
Machine LearningRobotics
ๆฅ็ๆ่ฆ
A function inlining optimization is a widely used transformation in modern compilers, which replaces a call site with the callee's body in need. While this transformation improves performance, it significantly impacts static features such as machine instructions and control flow graphs, which are crucial to binary analysis. Yet, despite its broad impact, the security impact of function inlining remains underexplored to date. In this paper, we present the first comprehensive study of function inlining through the lens of machine learning-based binary analysis. To this end, we dissect the inlining decision pipeline within the LLVM's cost model and explore the combinations of the compiler options that aggressively promote the function inlining ratio beyond standard optimization levels, which we term extreme inlining. We focus on five ML-assisted binary analysis tasks for security, using 20 unique models to systematically evaluate their robustness under extreme inlining scenarios. Our extensive experiments reveal several significant findings: i) function inlining, though a benign transformation in intent, can (in)directly affect ML model behaviors, being potentially exploited by evading discriminative or generative ML models; ii) ML models relying on static features can be highly sensitive to inlining; iii) subtle compiler settings can be leveraged to deliberately craft evasive binary variants; and iv) inlining ratios vary substantially across applications and build configurations, undermining assumptions of consistency in training and evaluation of ML models.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Boran Wang, Xinming Wang, Yi Chen, Xiang Li, Jian Xu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
With their high information density and intuitive readability, charts have become the de facto medium for data analysis and communication across disciplines. Recent multimodal large language models (MLLMs) have made notable progress in automated chart understanding, yet they remain heavily dependent on explicit textual annotations and the performance degrades markedly when key numerals are absent. To address this limitation, we introduce ChartAgent, a chart understanding framework grounded in Tool-Integrated Reasoning (TIR). Inspired by human cognition, ChartAgent decomposes complex chart analysis into a sequence of observable, replayable steps. Supporting this architecture is an extensible, modular tool library comprising more than a dozen core tools, such as keyelement detection, instance segmentation, and optical character recognition (OCR), which the agent dynamically orchestrates to achieve systematic visual parsing across diverse chart types. Leveraging TIRs transparency and verifiability, ChartAgent moves beyond the black box paradigm by standardizing and consolidating intermediate outputs into a structured Evidence Package, providing traceable and reproducible support for final conclusions. Experiments show that ChartAgent substantially improves robustness under sparse annotation settings, offering a practical path toward trustworthy and extensible systems for chart understanding.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yong Fang, Na Li, Hangguan Shan, Eryun Liu, Xinyu Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Multivariate Time Series (MTS) forecasting plays a vital role in various real-world applications, such as traffic management and predictive maintenance. Existing approaches typically model MTS data in either Euclidean or Riemannian space, limiting their ability to capture the diverse geometric structures and complex spatio-temporal dependencies inherent in real-world data. To overcome this limitation, we propose the Hybrid Symmetric Positive-Definite Manifold Graph Neural Network (HSMGNN), a novel graph neural network-based model that captures data geometry within a hybrid Euclidean-Riemannian framework. To the best of our knowledge, this is the first work to leverage hybrid geometric representations for MTS forecasting, enabling expressive and comprehensive modeling of geometric properties. Specifically, we introduce a Submanifold-Cross-Segment (SCS) embedding to project input MTS into both Euclidean and Riemannian spaces, thereby capturing spatio-temporal variations across distinct geometric domains. To alleviate the high computational cost of Riemannian distance, we further design an Adaptive-Distance-Bank (ADB) layer with a trainable memory mechanism. Finally, a Fusion Graph Convolutional Network (FGCN) is devised to integrate features from the dual spaces via a learnable fusion operator for accurate prediction. Experiments on three benchmark datasets demonstrate that HSMGNN achieves up to a 13.8 percent improvement over state-of-the-art baselines in forecasting accuracy.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Juseung Yun, Sunwoo Yu, Sumin Ha, Jonghyun Kim, Janghyeon Lee et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Cancer progression arises from interactions across multiple biological layers, especially beyond morphological and across molecular layers that remain invisible to image-only models. To capture this broader biological landscape, we present EXAONE Path 2.5, a pathology foundation model that jointly models histologic, genomic, epigenetic and transcriptomic modalities, producing an integrated patient representation that reflects tumor biology more comprehensively. Our approach incorporates three key components: (1) multimodal SigLIP loss enabling all-pairwise contrastive learning across heterogeneous modalities, (2) a fragment-aware rotary positional encoding (F-RoPE) module that preserves spatial structure and tissue-fragment topology in WSI, and (3) domain-specialized internal foundation models for both WSI and RNA-seq to provide biologically grounded embeddings for robust multimodal alignment. We evaluate EXAONE Path 2.5 against six leading pathology foundation models across two complementary benchmarks: an internal real-world clinical dataset and the Patho-Bench benchmark covering 80 tasks. Our framework demonstrates high data and parameter efficiency, achieving on-par performance with state-of-the-art foundation models on Patho-Bench while exhibiting the highest adaptability in the internal clinical setting. These results highlight the value of biologically informed multimodal design and underscore the potential of integrated genotype-to-phenotype modeling for next-generation precision oncology.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yue Wan, Jiayi Yuan, Zhiwei Feng, Xiaowei Jia
Machine Learning
ๆฅ็ๆ่ฆ
Antigenic epitope presented by major histocompatibility complex II (MHC-II) proteins plays an essential role in immunotherapy. However, compared to the more widely studied MHC-I in computational immunotherapy, the study of MHC-II antigenic epitope poses significantly more challenges due to its complex binding specificity and ambiguous motif patterns. Consequently, existing datasets for MHC-II interactions are smaller and less standardized than those available for MHC-I. To address these challenges, we present a well-curated dataset derived from the Immune Epitope Database (IEDB) and other public sources. It not only extends and standardizes existing peptide-MHC-II datasets, but also introduces a novel antigen-MHC-II dataset with richer biological context. Leveraging this dataset, we formulate three major machine learning (ML) tasks of peptide binding, peptide presentation, and antigen presentation, which progressively capture the broader biological processes within the MHC-II antigen presentation pathway. We further employ a multi-scale evaluation framework to benchmark existing models, along with a comprehensive analysis over various modeling designs to this problem with a modular framework. Overall, this work serves as a valuable resource for advancing computational immunotherapy, providing a foundation for future research in ML guided epitope discovery and predictive modeling of immune responses.
๐
2025-12-16
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Che-Chia Chang, Te-Sheng Lin, Ming-Chih Lai
Machine Learning
ๆฅ็ๆ่ฆ
The Stefan problem is a classical free-boundary problem that models phase-change processes and poses computational challenges due to its moving interface and nonlinear temperature-phase coupling. In this work, we develop a physics-informed neural network framework for solving two-phase Stefan problems. The proposed method explicitly tracks the interface motion and enforces the discontinuity in the temperature gradient across the interface while maintaining global consistency of the temperature field. Our approach employs two neural networks: one representing the moving interface and the other for the temperature field. The interface network allows rapid categorization of thermal diffusivity in the spatial domain, which is a crucial step for selecting training points for the temperature network. The temperature network's input is augmented with a modified zero-level set function to accurately capture the jump in its normal derivative across the interface. Numerical experiments on two-phase dynamical Stefan problems demonstrate the superior accuracy and effectiveness of our proposed method compared with the ones obtained by other neural network methodology in literature. The results indicate that the proposed framework offers a robust and flexible alternative to traditional numerical methods for solving phase-change problems governed by moving boundaries. In addition, the proposed method can capture an unstable interface evolution associated with the Mullins-Sekerka instability.
๐
2025-12-16
๐ NeurIPS 2025
๐ฅ Zheng He, Roman Pogodin, Yazhe Li, Namrata Deka, Arthur Gretton et al.
Machine Learning
ๆฅ็ๆ่ฆ
Tests of conditional independence (CI) underpin a number of important problems in machine learning and statistics, from causal discovery to evaluation of predictor fairness and out-of-distribution robustness. Shah and Peters (2020) showed that, contrary to the unconditional case, no universally finite-sample valid test can ever achieve nontrivial power. While informative, this result (based on "hiding" dependence) does not seem to explain the frequent practical failures observed with popular CI tests. We investigate the Kernel-based Conditional Independence (KCI) test - of which we show the Generalized Covariance Measure underlying many recent tests is nearly a special case - and identify the major factors underlying its practical behavior. We highlight the key role of errors in the conditional mean embedding estimate for the Type-I error, while pointing out the importance of selecting an appropriate conditioning kernel (not recognized in previous work) as being necessary for good test power but also tending to inflate Type-I error.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Aaron Wei, Milad Jalali, Danica J. Sutherland
ๆฅ็ๆ่ฆ
Existing two-sample testing techniques, particularly those based on choosing a kernel for the Maximum Mean Discrepancy (MMD), often assume equal sample sizes from the two distributions. Applying these methods in practice can require discarding valuable data, unnecessarily reducing test power. We address this long-standing limitation by extending the theory of generalized U-statistics and applying it to the usual MMD estimator, resulting in new characterization of the asymptotic distributions of the MMD estimator with unequal sample sizes (particularly outside the proportional regimes required by previous partial results). This generalization also provides a new criterion for optimizing the power of an MMD test with unequal sample sizes. Our approach preserves all available data, enhancing test accuracy and applicability in realistic settings. Along the way, we give much cleaner characterizations of the variance of MMD estimators, revealing something that might be surprising to those in the area: while zero MMD implies a degenerate estimator, it is sometimes possible to have a degenerate estimator with nonzero MMD as well; we give a construction and a proof that it does not happen in common situations.
๐
2025-12-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Cindy Y. Zhang, Elif Ertekin, Peter Orbanz, Ryan P. Adams
Machine Learning
ๆฅ็ๆ่ฆ
Incorporating known symmetries in data into machine learning models has consistently improved predictive accuracy, robustness, and generalization. However, achieving exact invariance to specific symmetries typically requires designing bespoke architectures for each group of symmetries, limiting scalability and preventing knowledge transfer across related symmetries. In the case of the space groups, symmetries critical to modeling crystalline solids in materials science and condensed matter physics, this challenge is particularly salient as there are 230 such groups in three dimensions. In this work we present a new approach to such crystallographic symmetries by developing a single machine learning architecture that is capable of adapting its weights automatically to enforce invariance to any input space group. Our approach is based on constructing symmetry-adapted Fourier bases through an explicit characterization of constraints that group operations impose on Fourier coefficients. Encoding these constraints into a neural network layer enables weight sharing across different space groups, allowing the model to leverage structural similarities between groups and overcome data sparsity when limited measurements are available for specific groups. We demonstrate the effectiveness of this approach in achieving competitive performance on material property prediction tasks and performing zero-shot learning to generalize to unseen groups.
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Fabian Haak, Philipp Schaer
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Linguistic bias in online news and social media is widespread but difficult to measure. Yet, its identification and quantification remain difficult due to subjectivity, context dependence, and the scarcity of high-quality gold-label datasets. We aim to reduce annotation effort by leveraging pairwise comparison for bias annotation. To overcome the costliness of the approach, we evaluate more efficient implementations of pairwise comparison-based rating. We achieve this by investigating the effects of various rating techniques and the parameters of three cost-aware alternatives in a simulation environment. Since the approach can in principle be applied to both human and large language model annotation, our work provides a basis for creating high-quality benchmark datasets and for quantifying biases and other subjective linguistic aspects.
The controlled simulations include latent severity distributions, distance-calibrated noise, and synthetic annotator bias to probe robustness and cost-quality trade-offs. In applying the approach to human-labeled bias benchmark datasets, we then evaluate the most promising setups and compare them to direct assessment by large language models and unmodified pairwise comparison labels as baselines. Our findings support the use of pairwise comparison as a practical foundation for quantifying subjective linguistic aspects, enabling reproducible bias analysis. We contribute an optimization of comparison and matchmaking components, an end-to-end evaluation including simulation and real-data application, and an implementation blueprint for cost-aware large-scale annotation
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated remarkable potential in transforming recommender systems from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 successfully pioneered this paradigm by integrating LLM-based reasoning into user interest mining and item tag prediction, it suffers from four fundamental limitations: (1) computational inefficiency and cognitive redundancy across multiple reasoning routes; (2) insufficient explanation diversity in fixed-template generation; (3) limited generalization under supervised learning paradigms; and (4) simplistic outcome-focused evaluation that fails to match human standards.
To address these challenges, we present RecGPT-V2 with four key innovations. First, a Hierarchical Multi-Agent System restructures intent reasoning through coordinated collaboration, eliminating cognitive duplication while enabling diverse intent coverage. Combined with Hybrid Representation Inference that compresses user-behavior contexts, our framework reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta-Prompting framework dynamically generates contextually adaptive prompts, improving explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi-reward conflicts, achieving +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent-as-a-Judge framework decomposes assessment into multi-step reasoning, improving human preference alignment. Online A/B tests on Taobao demonstrate significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT-V2 establishes both the technical feasibility and commercial viability of deploying LLM-powered intent reasoning at scale, bridging the gap between cognitive exploration and industrial utility.
๐
2025-12-16
๐ WSDM 2026
๐ฅ Shifu Bie, Jiangxia Cao, Zixiao Luo, Yichuan Zou, Lei Liang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We present PushGen, an automated framework for generating high-quality push notifications comparable to human-crafted content. With the rise of generative models, there is growing interest in leveraging LLMs for push content generation. Although LLMs make content generation straightforward and cost-effective, maintaining stylistic control and reliable quality assessment remains challenging, as both directly impact user engagement. To address these issues, PushGen combines two key components: (1) a controllable category prompt technique to guide LLM outputs toward desired styles, and (2) a reward model that ranks and selects generated candidates. Extensive offline and online experiments demonstrate its effectiveness, which has been deployed in large-scale industrial applications, serving hundreds of millions of users daily.
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Malika Iratni, Mohand Boughanem, Taoufiq Dkaki
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval Augmented Generation (RAG) enhances language model performance by incorporating external knowledge retrieved from large corpora, which makes it highly suitable for tasks such as open domain question answering. Standard RAG systems typically rely on a fixed top k retrieval strategy, which can either miss relevant information or introduce semantically irrelevant passages, known as distractors, that degrade output quality. Additionally, the positioning of retrieved passages within the input context can influence the model attention and generation outcomes. Context placed in the middle tends to be overlooked, which is an issue known as the "lost in the middle" phenomenon. In this work, we systematically analyze the impact of distractors on generation quality, and quantify their effects under varying conditions. We also investigate how the position of relevant passages within the context window affects their influence on generation. Building on these insights, we propose a context-size classifier that dynamically predicts the optimal number of documents to retrieve based on query-specific informational needs. We integrate this approach into a full RAG pipeline, and demonstrate improved performance over fixed k baselines.
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Panayiotis Smeros, Vincent Emonet, Ruijie Wang, Ana-Claudia Sima, Tarcisio Mendes de Farias
Natural Language Processing
ๆฅ็ๆ่ฆ
The advent of large language models is contributing to the emergence of novel approaches that promise to better tackle the challenge of generating structured queries, such as SPARQL queries, from natural language. However, these new approaches mostly focus on response accuracy over a single source while ignoring other evaluation criteria, such as federated query capability over distributed data stores, as well as runtime and cost to generate SPARQL queries. Consequently, they are often not production-ready or easy to deploy over (potentially federated) knowledge graphs with good accuracy. To mitigate these issues, in this paper, we extend our previous work and describe and systematically evaluate SPARQL-LLM, an open-source and triplestore-agnostic approach, powered by lightweight metadata, that generates SPARQL queries from natural language text. First, we describe its architecture, which consists of dedicated components for metadata indexing, prompt building, and query generation and execution. Then, we evaluate it based on a state-of-the-art challenge with multilingual questions, and a collection of questions from three of the most prevalent knowledge graphs within the field of bioinformatics. Our results demonstrate a substantial increase of 24% in the F1 Score on the state-of-the-art challenge, adaptability to high-resource languages such as English and Spanish, as well as ability to form complex and federated bioinformatics queries. Furthermore, we show that SPARQL-LLM is up to 36x faster than other systems participating in the challenge, while costing a maximum of $0.01 per question, making it suitable for real-time, low-cost text-to-SPARQL applications. One such application deployed over real-world decentralized knowledge graphs can be found at https://www.expasy.org/chat.
๐
2025-12-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque
Natural Language Processing
ๆฅ็ๆ่ฆ
Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local\_dialect:standard\_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76\% to 55\% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof
MultimodalNatural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Kaike Zhang, Qi Cao, Fei Sun, Xinran Liu
Machine Learning
ๆฅ็ๆ่ฆ
Sequential recommender systems have demonstrated strong capabilities in modeling users' dynamic preferences and capturing item transition patterns. However, real-world user behaviors are often noisy due to factors such as human errors, uncertainty, and behavioral ambiguity, which can lead to degraded recommendation performance. To address this issue, recent approaches widely adopt self-supervised learning (SSL), particularly contrastive learning, by generating perturbed views of user interaction sequences and maximizing their mutual information to improve model robustness. However, these methods heavily rely on their pre-defined static augmentation strategies~(where the augmentation type remains fixed once chosen) to construct augmented views, leading to two critical challenges: (1) the optimal augmentation type can vary significantly across different scenarios; (2) inappropriate augmentations may even degrade recommendation performance, limiting the effectiveness of SSL. To overcome these limitations, we propose an adaptive augmentation framework. We first unify existing basic augmentation operations into a unified formulation via structured transformation matrices. Building on this, we introduce AsarRec (Adaptive Sequential Augmentation for Robust Sequential Recommendation), which learns to generate transformation matrices by encoding user sequences into probabilistic transition matrices and projecting them into hard semi-doubly stochastic matrices via a differentiable Semi-Sinkhorn algorithm. To ensure that the learned augmentations benefit downstream performance, we jointly optimize three objectives: diversity, semantic invariance, and informativeness. Extensive experiments on three benchmark datasets under varying noise levels validate the effectiveness of AsarRec, demonstrating its superior robustness and consistent improvements.
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mingjia Yin, Junwei Pan, Hao Wang, Ximei Wang, Shangyu Zhang et al.
ๆฅ็ๆ่ฆ
Click-Through Rate (CTR) prediction, a core task in recommendation systems, aims to estimate the probability of users clicking on items. Existing models predominantly follow a discriminative paradigm, which relies heavily on explicit interactions between raw ID embeddings. However, this paradigm inherently renders them susceptible to two critical issues: embedding dimensional collapse and information redundancy, stemming from the over-reliance on feature interactions \emph{over raw ID embeddings}. To address these limitations, we propose a novel \emph{Supervised Feature Generation (SFG)} framework, \emph{shifting the paradigm from discriminative ``feature interaction" to generative ``feature generation"}. Specifically, SFG comprises two key components: an \emph{Encoder} that constructs hidden embeddings for each feature, and a \emph{Decoder} tasked with regenerating the feature embeddings of all features from these hidden representations. Unlike existing generative approaches that adopt self-supervised losses, we introduce a supervised loss to utilize the supervised signal, \ie, click or not, in the CTR prediction task. This framework exhibits strong generalizability: it can be seamlessly integrated with most existing CTR models, reformulating them under the generative paradigm. Extensive experiments demonstrate that SFG consistently mitigates embedding collapse and reduces information redundancy, while yielding substantial performance gains across various datasets and base models. The code is available at https://github.com/USTC-StarTeam/GE4Rec.
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yifan Shao, Peilin Zhou, Shoujin Wang, Weizhi Zhang, Xu Cai et al.
ๆฅ็ๆ่ฆ
Inspired by advances in LLMs, reasoning-enhanced sequential recommendation performs multi-step deliberation before making final predictions, unlocking greater potential for capturing user preferences. However, current methods are constrained by static reasoning trajectories that are ill-suited for the diverse complexity of user behaviors. They suffer from two key limitations: (1) a static reasoning direction, which uses flat supervision signals misaligned with human-like hierarchical reasoning, and (2) a fixed reasoning depth, which inefficiently applies the same computational effort to all users, regardless of pattern complexity. These rigidity lead to suboptimal performance and significant computational waste. To overcome these challenges, we propose DTRec, a novel and effective framework that explores the Dynamic reasoning Trajectory for Sequential Recommendation along both direction and depth. To guide the direction, we develop Hierarchical Process Supervision (HPS), which provides coarse-to-fine supervisory signals to emulate the natural, progressive refinement of human cognitive processes. To optimize the depth, we introduce the Adaptive Reasoning Halting (ARH) mechanism that dynamically adjusts the number of reasoning steps by jointly monitoring three indicators. Extensive experiments on three real-world datasets demonstrate the superiority of our approach, achieving up to a 24.5% performance improvement over strong baselines while simultaneously reducing computational cost by up to 41.6%.
๐
2025-12-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yifan Shao, Peilin Zhou
ๆฅ็ๆ่ฆ
Sequential recommendation systems aim to capture users' evolving preferences from their interaction histories. Recent reasoningenhanced methods have shown promise by introducing deliberate, chain-of-thought-like processes with intermediate reasoning steps. However, these methods rely solely on the next target item as supervision, leading to two critical issues: (1) reasoning instability--the process becomes overly sensitive to recent behaviors and spurious interactions like accidental clicks, and (2) surface-level reasoning--the model memorizes item-to-item transitions rather than understanding intrinsic behavior patterns. To address these challenges, we propose IGR-SR, an Intent-Guided Reasoning framework for Sequential Recommendation that anchors the reasoning process to explicitly extracted high-level intents. Our framework comprises three key components: (1) a Latent Intent Distiller (LID) that efficiently extracts multi-faceted intents using a frozen encoder with learnable tokens, (2) an Intent-aware Deliberative Reasoner (IDR) that decouples reasoning into intent deliberation and decision-making via a dual-attention architecture, and (3) an Intent Consistency Regularization (ICR) that ensures robustness by enforcing consistent representations across different intent views. Extensive experiments on three public datasets demonstrate that IGR-SR achieves an average 7.13% improvement over state-of-the-art baselines. Critically, under 20% behavioral noise, IGR-SR degrades only 10.4% compared to 16.2% and 18.6% for competing methods, validating the effectiveness and robustness of intent-guided reasoning.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering a 43\% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Sirui Chen, Zi-ang Cao, Zhengyi Luo, Fernando Castaรฑeda, Chenran Li et al.
Robotics
ๆฅ็ๆ่ฆ
Recent progress in humanoid robots has unlocked agile locomotion skills, including backflipping, running, and crawling. Yet it remains challenging for a humanoid robot to perform forceful manipulation tasks such as moving objects, wiping, and pushing a cart. We propose adaptive Compliance Humanoid control through hIsight Perturbation (CHIP), a plug-and-play module that enables controllable end-effector stiffness while preserving agile tracking of dynamic reference motions. CHIP is easy to implement and requires neither data augmentation nor additional reward tuning. We show that a generalist motion-tracking controller trained with CHIP can perform a diverse set of forceful manipulation tasks that require different end-effector compliance, such as multi-robot collaboration, wiping, box delivery, and door opening.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Zechen Bai, Chen Gao, Mike Zheng Shou
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through environmental interaction, which is akin to how humans master skills through practice. Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models, yet remain fundamentally limited by Supervised Finetuning (SFT): requiring hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training. We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations. The key technical challenge is replacing oracle reward signals (unavailable at test time) with autonomous feedback. We address this through a learned progress estimator providing dense feedback, and critically, we design our framework to ``tame'' this inherently noisy signal via two mechanisms: (1) an accumulative progress estimation mechanism smoothing noisy point-wise estimates, and (2) a progressive horizon extension strategy enabling gradual policy evolution. EVOLVE-VLA achieves substantial gains: +8.6\% on long-horizon tasks, +22.0\% in 1-shot learning, and enables cross-task generalization -- achieving 20.8\% success on unseen tasks without task-specific demonstrations training (vs. 0\% for pure SFT). Qualitative analysis reveals emergent capabilities absent in demonstrations, including error recovery and novel strategies. This work represents a critical step toward VLAs that truly learn and adapt, moving beyond static imitation toward continuous self-improvements.
๐
2025-12-16
๐ ArXiv Preprint (eess.SY)
๐ฅ Riccardo Busetto, Elia Cereda, Marco Forgione, Gabriele Maroni, Dario Piga et al.
Robotics
ๆฅ็ๆ่ฆ
We introduce a benchmark for system identification based on 75k real-world samples from the Crazyflie 2.1 Brushless nano-quadrotor, a sub-50g aerial vehicle widely adopted in robotics research. The platform presents a challenging testbed due to its multi-input, multi-output nature, open-loop instability, and nonlinear dynamics under agile maneuvers. The dataset comprises four aggressive trajectories with synchronized 4-dimensional motor inputs and 13-dimensional output measurements. To enable fair comparison of identification methods, the benchmark includes a suite of multi-horizon prediction metrics for evaluating both one-step and multi-step error propagation. In addition to the data, we provide a detailed description of the platform and experimental setup, as well as baseline models highlighting the challenge of accurate prediction under real-world noise and actuation nonlinearities. All data, scripts, and reference implementations are released as open-source at https://github.com/idsia-robotics/nanodrone-sysid-benchmark to facilitate transparent comparison of algorithms and support research on agile, miniaturized aerial robotics.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Quan Yuan, Daqian Cao, Weibang Bai
Machine Learning
ๆฅ็ๆ่ฆ
Redundant parallel robots are normally employed in scenarios requiring good precision, high load capability, and large workspace compared to traditional parallel mechanisms. However, the elementary robotic configuration and geometric parameter optimization are still quite challenging. This paper proposes a novel 3-(PP(2-(UPS))) redundant parallel mechanism, with good generalizability first, and further investigates the kinematic optimization issue by analyzing and investigating how its key geometric parameters influence the volume, shape, boundary completeness, and orientation capabilities of its workspace. The torsional capability index TI_1 and tilting capability index TI_2 are defined to evaluate the orientation performance of the mechanism. Numerical simulation studies are completed to indicate the analysis, providing reasonable but essential references for the parameter optimization of 3-(PP(2-(UPS))) and other similar redundant parallel mechanisms.
๐
2025-12-16
๐ 9 pages, 4 figures, submitted to International Jou
๐ฅ Aaron Kurda, Simon Steuernagel, Lukas Jung, Marcus Baum
Robotics
ๆฅ็ๆ่ฆ
The development and evaluation of Lidar-Inertial Odometry (LIO) and Simultaneous Localization and Mapping (SLAM) systems requires a precise ground truth. The Global Navigation Satellite System (GNSS) is often used as a foundation for this, but its signals can be unreliable in obstructed environments due to multi-path effects or loss-of-signal. While existing datasets compensate for the sporadic loss of GNSS signals by incorporating Inertial Measurement Unit (IMU) measurements, the commonly used Micro-Electro-Mechanical Systems (MEMS) or Fiber Optic Gyroscope (FOG)-based systems do not permit the prolonged study of GNSS-denied environments. To close this gap, we present Odyssey, a LIO dataset with a focus on GNSS-denied environments such as tunnels and parking garages as well as other underrepresented, yet ubiquitous situations such as stop-and-go-traffic, bumpy roads and wide open fields. Our ground truth is derived from a navigation-grade Inertial Navigation System (INS) equipped with a Ring Laser Gyroscope (RLG), offering exceptional bias stability characteristics compared to IMUs used in existing datasets and enabling the prolonged and accurate study of GNSS-denied environments. This makes Odyssey the first publicly available dataset featuring a RLG-based INS. Besides providing data for LIO, we also support other tasks, such as place recognition, through the threefold repetition of all trajectories as well as the integration of external mapping data by providing precise geodetic coordinates. All data, dataloader and other material is available online at https://odyssey.uni-goettingen.de/ .
๐
2025-12-16
๐ 13 pages, 8 figures, submitted to IEEE Transaction
๐ฅ Simon Steuernagel, Marcus Baum
ๆฅ็ๆ่ฆ
Extended object tracking involves estimating both the physical extent and kinematic parameters of a target object, where typically multiple measurements are observed per time step. In this article, we propose a deterministic closed-form elliptical extended object tracker, based on decoupling of the kinematics, orientation, and axis lengths. By disregarding potential correlations between these state components, fewer approximations are required for the individual estimators than for an overall joint solution. The resulting algorithm outperforms existing algorithms, reaching the accuracy of sampling-based procedures. Additionally, a batch-based variant is introduced, yielding highly efficient computation while outperforming all comparable state-of-the-art algorithms. This is validated both by a simulation study using common models from literature, as well as an extensive quantitative evaluation on real automotive radar data.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Mohammed Ayman Habib, Aldo Petruzzelli
MultimodalRobotics
ๆฅ็ๆ่ฆ
Omnia presents a synthetic data driven pipeline to accelerate the training, validation, and deployment readiness of militarized humanoids. The approach converts first-person spatial observations captured from point-of-view recordings, smart glasses, augmented reality headsets, and spatial browsing workflows into scalable, mission-specific synthetic datasets for humanoid autonomy. By generating large volumes of high-fidelity simulated scenarios and pairing them with automated labeling and model training, the pipeline enables rapid iteration on perception, navigation, and decision-making capabilities without the cost, risk, or time constraints of extensive field trials. The resulting datasets can be tuned quickly for new operational environments and threat conditions, supporting both baseline humanoid performance and advanced subsystems such as multimodal sensing, counter-detection survivability, and CBRNE-relevant reconnaissance behaviors. This work targets faster development cycles and improved robustness in complex, contested settings by exposing humanoid systems to broad scenario diversity early in the development process.
๐
2025-12-16
๐ Accepted at IEEE ITSC 2020
๐ฅ Georg Volk, Jรถrg Gamerdinger, Alexander von Bernuth, Oliver Bringmann
Robotics
ๆฅ็ๆ่ฆ
Complete perception of the environment and its correct interpretation is crucial for autonomous vehicles. Object perception is the main component of automotive surround sensing. Various metrics already exist for the evaluation of object perception. However, objects can be of different importance depending on their velocity, orientation, distance, size, or the potential damage that could be caused by a collision due to a missed detection. Thus, these additional parameters have to be considered for safety evaluation. We propose a new safety metric that incorporates all these parameters and returns a single easily interpretable safety assessment score for object perception. This new metric is evaluated with both real world and virtual data sets and compared to state of the art metrics.
๐
2025-12-16
๐ Accepted at IEEE IV 2023
๐ฅ Jรถrg Gamerdinger, Sven Teufel, Georg Volk, Oliver Bringmann
Robotics
ๆฅ็ๆ่ฆ
Comprehensive environment perception is essential for autonomous vehicles to operate safely. It is crucial to detect both dynamic road users and static objects like traffic signs or lanes as these are required for safe motion planning. However, in many circumstances a complete perception of other objects or lanes is not achievable due to limited sensor ranges, occlusions, and curves. In scenarios where an accurate localization is not possible or for roads where no HD maps are available, an autonomous vehicle must rely solely on its perceived road information. Thus, extending local sensing capabilities through collective perception using vehicle-to-vehicle communication is a promising strategy that has not yet been explored for lane detection. Therefore, we propose a real-time capable approach for collective perception of lanes using a spline-based estimation of undetected road sections. We evaluate our proposed fusion algorithm in various situations and road types. We were able to achieve real-time capability and extend the perception range by up to 200%.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Henrik Hose, Paul Brunzema, Alexander von Rohr, Alexander Grรคfe, Angela P. Schoellig et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Approximate model-predictive control (AMPC) aims to imitate an MPC's behavior with a neural network, removing the need to solve an expensive optimization problem at runtime. However, during deployment, the parameters of the underlying MPC must usually be fine-tuned. This often renders AMPC impractical as it requires repeatedly generating a new dataset and retraining the neural network. Recent work addresses this problem by adapting AMPC without retraining using approximated sensitivities of the MPC's optimization problem. Currently, this adaption must be done by hand, which is labor-intensive and can be unintuitive for high-dimensional systems. To solve this issue, we propose using Bayesian optimization to tune the parameters of AMPC policies based on experimental data. By combining model-based control with direct and local learning, our approach achieves superior performance to nominal AMPC on hardware, with minimal experimentation. This allows automatic and data-efficient adaptation of AMPC to new system instances and fine-tuning to cost functions that are difficult to directly implement in MPC. We demonstrate the proposed method in hardware experiments for the swing-up maneuver on an inverted cartpole and yaw control of an under-actuated balancing unicycle robot, a challenging control problem.
๐
2025-12-16
๐ ArXiv Preprint (eess.SY)
๐ฅ Federico Califano, Camilla Rota, Riccardo Zanella, Antonio Franchi
Robotics
ๆฅ็ๆ่ฆ
We present a novel geometric port-Hamiltonian formulation of redundant manipulators performing a differential kinematic task $ฮท=J(q)\dot{q}$, where $q$ is a point on the configuration manifold, $ฮท$ is a velocity-like task space variable, and $J(q)$ is a linear map representing the task, for example the classical analytic or geometric manipulator Jacobian matrix. The proposed model emerges from a change of coordinates from canonical Hamiltonian dynamics, and splits the standard Hamiltonian momentum variable into a task-space momentum variable and a null-space momentum variable. Properties of this model and relation to Lagrangian formulations present in the literature are highlighted. Finally, we apply the proposed model in an \textit{Interconnection and Damping Assignment Passivity-Based Control} (IDA-PBC) design to stabilize and shape the impedance of a 7-DOF Emika Panda robot in simulation.
๐
2025-12-16
๐ This work has been submitted to the IEEE for possi
๐ฅ Aleksi Karhunen, Teemu Hakala, Vรคinรถ Karjalainen, Eija Honkavaara
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The interest in the usage of uncrewed aerial vehicles (UAVs) for forest applications has increased in recent years. While above-canopy flight has reached a high level of autonomy, navigating under-canopy remains a significant challenge. The use of autonomous UAVs could reduce the burden of data collection, which has motivated the development of numerous solutions for under-canopy autonomous flight. However, the experiments conducted in the literature and their reporting lack rigor. Very rarely, the density and the difficulty of the test forests are reported, or multiple flights are flown, and the success rate of those flights is reported. The aim of this study was to implement an autonomously flying quadrotor based on a lightweight lidar using openly available algorithms and test its behavior in real forest environments. A set of rigorous experiments was conducted with a quadrotor prototype utilizing the IPC path planner and LTA-OM SLAM algorithm. Based on the results of the first 33 flights, the original system was further enhanced. With the optimized system, 60 flights were performed, resulting in a total of 93 test flights. The optimized system performed significantly better in terms of reliability and flight mission completion times, achieving success rates of 12/15 in a medium-density forest and 15/15 in a dense forest, at a target flight velocity of 1 m/s. At a target flight velocity of 2 m/s, it had a success rate of 12/15 and 5/15, respectively. Furthermore, a standardized testing setup and evaluation criteria were proposed, enabling consistent performance comparisons of autonomous under-canopy UAV systems, enhancing reproducibility, guiding system improvements, and accelerating progress in forest robotics.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Rishabh Dev Yadav, Avirup Das, Hongyu Song, Samuel Kaski, Wei Pan
Robotics
ๆฅ็ๆ่ฆ
Real-world robots must operate under evolving dynamics caused by changing operating conditions, external disturbances, and unmodeled effects. These may appear as gradual drifts, transient fluctuations, or abrupt shifts, demanding real-time adaptation that is robust to short-term variation yet responsive to lasting change. We propose a framework for modeling the nonlinear dynamics of robotic systems that can be updated in real time from streaming data. The method decouples representation learning from online adaptation, using latent representations learned offline to support online closed-form Bayesian updates. To handle evolving conditions, we introduce a changepoint-aware mechanism with a latent variable inferred from data likelihoods that indicates continuity or shift. When continuity is likely, evidence accumulates to refine predictions; when a shift is detected, past information is tempered to enable rapid re-learning. This maintains calibrated uncertainty and supports probabilistic reasoning about transient, gradual, or structural change. We prove that the adaptive regret of the framework grows only logarithmically in time and linearly with the number of shifts, competitive with an oracle that knows timings of shift. We validate on cartpole simulations and real quadrotor flights with swinging payloads and mid-flight drops, showing improved predictive accuracy, faster recovery, and more accurate closed-loop tracking than relevant baselines.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Zixin Tang, Yiming Chen, Quentin Rouxel, Dianxi Li, Shuang Wu et al.
Robotics
ๆฅ็ๆ่ฆ
Teleoperation presents a promising paradigm for remote control and robot proprioceptive data collection. Despite recent progress, current teleoperation systems still suffer from limitations in efficiency and ergonomics, particularly in challenging scenarios. In this paper, we propose CaFe-TeleVision, a coarse-to-fine teleoperation system with immersive situated visualization for enhanced ergonomics. At its core, a coarse-to-fine control mechanism is proposed in the retargeting module to bridge workspace disparities, jointly optimizing efficiency and physical ergonomics. To stream immersive feedback with adequate visual cues for human vision systems, an on-demand situated visualization technique is integrated in the perception module, which reduces the cognitive load for multi-view processing. The system is built on a humanoid collaborative robot and validated with six challenging bimanual manipulation tasks. User study among 24 participants confirms that CaFe-TeleVision enhances ergonomics with statistical significance, indicating a lower task load and a higher user acceptance during teleoperation. Quantitative results also validate the superior performance of our system across six tasks, surpassing comparative methods by up to 28.89% in success rate and accelerating by 26.81% in completion time. Project webpage: https://clover-cuhk.github.io/cafe_television/
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xichen Ding, Jianzhe Gao, Cong Pan, Wenguan Wang, Jie Qin
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.
๐
2025-12-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Mayank Sewlia, Christos K. Verginis, Dimos V. Dimarogonas
Robotics
ๆฅ็ๆ่ฆ
We consider the problem of cooperative manipulation by a mobile multi-manipulator system operating in obstacle-cluttered and highly constrained environments under spatio-temporal task specifications. The task requires transporting a grasped object while respecting both continuous robot dynamics and discrete geometric constraints arising from obstacles and narrow passages. To address this hybrid structure, we propose a multi-rate planning and control framework that combines offline generation of an STL-satisfying object trajectory and collision-free base footprints with online constrained inverse kinematics and continuous-time feedback control. The resulting closed-loop system enables coordinated reconfiguration of multiple manipulators while tracking the desired object motion. The approach is evaluated in high-fidelity physics simulations using three Franka Emika Panda mobile manipulators rigidly grasping an object.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Johannes A. Gaus, Daniel Hรคufle, Woo-Jeong Baek
ๆฅ็ๆ่ฆ
While many visual odometry (VO), visual-inertial odometry (VIO), and SLAM systems achieve high accuracy, the majority of existing methods miss to assess risks at runtime. This paper presents SUPER (Sensitivity-based Uncertainty-aware PErformance and Risk assessment) that is a generic and explainable framework that propagates uncertainties via sensitivities for real-time risk assessment in VIO. The scientific novelty lies in the derivation of a real-time risk indicator that is backend-agnostic and exploits the Schur complement blocks of the Gauss-Newton normal matrix to propagate uncertainties. Practically, the Schur complement captures the sensitivity that reflects the influence of the uncertainty on the risk occurrence. Our framework estimates risks on the basis of the residual magnitudes, geometric conditioning, and short horizon temporal trends without requiring ground truth knowledge. Our framework enables to reliably predict trajectory degradation 50 frames ahead with an improvement of 20% to the baseline. In addition, SUPER initiates a stop or relocalization policy with 89.1% recall. The framework is backend agnostic and operates in real time with less than 0.2% additional CPU cost. Experiments show that SUPER provides consistent uncertainty estimates. A SLAM evaluation highlights the applicability to long horizon mapping.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Chenzui Li, Yiming Chen, Xi Wu, Tao Teng, Sylvain Calinon et al.
Robotics
ๆฅ็ๆ่ฆ
Industrial human-robot collaboration requires motion planning that is collision-free, responsive, and ergonomically safe to reduce fatigue and musculoskeletal risk. We propose the Configuration Space Ergonomic Field (CSEF), a continuous and differentiable field over the human joint space that quantifies ergonomic quality and provides gradients for real-time ergonomics-aware planning. An efficient algorithm constructs CSEF from established metrics with joint-wise weighting and task conditioning, and we integrate it into a gradient-based planner compatible with impedance-controlled robots. In a 2-DoF benchmark, CSEF-based planning achieves higher success rates, lower ergonomic cost, and faster computation than a task-space ergonomic planner. Hardware experiments with a dual-arm robot in unimanual guidance, collaborative drilling, and bimanual cocarrying show faster ergonomic cost reduction, closer tracking to optimized joint targets, and lower muscle activation than a point-to-point baseline. CSEF-based planning method reduces average ergonomic scores by up to 10.31% for collaborative drilling tasks and 5.60% for bimanual co-carrying tasks while decreasing activation in key muscle groups, indicating practical benefits for real-world deployment.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Amir M. Soufi Enayati, Homayoun Honari, Homayoun Najjaran
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) enables robots to operate in uncertain environments, but standard approaches often struggle with poor generalization to unseen tasks. Context-adaptive meta reinforcement learning addresses these limitations by conditioning on the task representation, yet they mostly rely on complete action information in the experience making task inference tightly coupled to a specific policy. This paper introduces Context Representation via Action Free Transformer encoder decoder (CRAFT), a belief model that infers task representations solely from sequences of states and rewards. By removing the dependence on actions, CRAFT decouples task inference from policy optimization, supports modular training, and leverages amortized variational inference for scalable belief updates. Built on a transformer encoder decoder with rotary positional embeddings, the model captures long range temporal dependencies and robustly encodes both parametric and non-parametric task variations. Experiments on the MetaWorld ML-10 robotic manipulation benchmark show that CRAFT achieves faster adaptation, improved generalization, and more effective exploration compared to context adaptive meta--RL baselines. These findings highlight the potential of action-free inference as a foundation for scalable RL in robotic control.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Humaira Tasnim, Ashik E Rasul, Bruce Jo, Hyung-Jin Yoon
Robotics
ๆฅ็ๆ่ฆ
Reliable helipad detection is essential for Autonomous Aerial Vehicle (AAV) landing, especially under GPS-denied or visually degraded conditions. While modern detectors such as YOLOv8 offer strong baseline performance, single-model pipelines struggle to remain robust across the extreme scale transitions that occur during descent, where helipads appear small at high altitude and large near touchdown. To address this limitation, we propose a scale-adaptive dual-expert perception framework that decomposes the detection task into far-range and close-range regimes. Two YOLOv8 experts are trained on scale-specialized versions of the HelipadCat dataset, enabling one model to excel at detecting small, low-resolution helipads and the other to provide high-precision localization when the target dominates the field of view. During inference, both experts operate in parallel, and a geometric gating mechanism selects the expert whose prediction is most consistent with the AAV's viewpoint. This adaptive routing prevents the degradation commonly observed in single-detector systems when operating across wide altitude ranges. The dual-expert perception module is evaluated in a closed-loop landing environment that integrates CARLA's photorealistic rendering with NASA's GUAM flight-dynamics engine. Results show substantial improvements in alignment stability, landing accuracy, and overall robustness compared to single-detector baselines. By introducing a scale-aware expert routing strategy tailored to the landing problem, this work advances resilient vision-based perception for autonomous descent and provides a foundation for future multi-expert AAV frameworks.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Boyang Li, Zhongpeng Jin, Shuai Zhao, Jiahui Liao, Tian Liu et al.
Robotics
ๆฅ็ๆ่ฆ
The ability to adapt to changing environments is crucial for the autonomous navigation systems of Unmanned Aerial Vehicles (UAVs). However, existing navigation systems adopt fixed execution configurations without considering environmental dynamics based on available computing resources, e.g., with a high execution frequency and task workload. This static approach causes rigid flight strategies and excessive computations, ultimately degrading flight performance or even leading to failures in UAVs. Despite the necessity for an adaptive system, dynamically adjusting workloads remains challenging, due to difficulties in quantifying environmental complexity and modeling the relationship between environment and system configuration. Aiming at adapting to dynamic environments, this paper proposes E-Navi, an environmental-adaptive navigation system for UAVs that dynamically adjusts task executions on the CPUs in response to environmental changes based on available computational resources. Specifically, the perception-planning pipeline of UAVs navigation system is redesigned through dynamic adaptation of mapping resolution and execution frequency, driven by the quantitative environmental complexity evaluations. In addition, E-Navi supports flexible deployment across hardware platforms with varying levels of computing capability. Extensive Hardware-In-the-Loop and real-world experiments demonstrate that the proposed system significantly outperforms the baseline method across various hardware platforms, achieving up to 53.9% navigation task workload reduction, up to 63.8% flight time savings, and delivering more stable velocity control.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Zhaofeng Hu, Hongrui Yu, Vaidhyanathan Chandramouli, Ci-Jyun Liang
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
This study evaluates two leading approaches for teaching construction robots new skills to understand their applicability for construction automation: a Vision-Language-Action (VLA) model and Reinforcement Learning (RL) methods. The goal is to understand both task performance and the practical effort needed to deploy each approach on real jobs. The authors developed two teleoperation interfaces to control the robots and collect the demonstrations needed, both of which proved effective for training robots for long-horizon and dexterous tasks. In addition, the authors conduct a three-stage evaluation. First, the authors compare a Multi-Layer Perceptron (MLP) policy with a Deep Q-network (DQN) imitation model to identify the stronger RL baseline, focusing on model performance, generalization, and a pick-up experiment. Second, three different VLA models are trained in two different scenarios and compared with each other. Third, the authors benchmark the selected RL baseline against the VLA model using computational and sample-efficiency measures and then a robot experiment on a multi-stage panel installation task that includes transport and installation. The VLA model demonstrates strong generalization and few-shot capability, achieving 60% and 100% success in the pickup phase. In comparison, DQN can be made robust but needs additional noise during tuning, which increases the workload. Overall, the findings indicate that VLA offers practical advantages for changing tasks by reducing programming effort and enabling useful performance with minimal data, while DQN provides a viable baseline when sufficient tuning effort is acceptable.
๐
2025-12-16
๐ IROS 2025
๐ฅ Zhuo Zhang, Yonghui Liu, Meijie Zhang, Feiyang Tan, Yikang Ding
ๆฅ็ๆ่ฆ
In this paper, we unleash the potential of the powerful monodepth model in camera-LiDAR calibration and propose CLAIM, a novel method of aligning data from the camera and LiDAR. Given the initial guess and pairs of images and LiDAR point clouds, CLAIM utilizes a coarse-to-fine searching method to find the optimal transformation minimizing a patched Pearson correlation-based structure loss and a mutual information-based texture loss. These two losses serve as good metrics for camera-LiDAR alignment results and require no complicated steps of data processing, feature extraction, or feature matching like most methods, rendering our method simple and adaptive to most scenes. We validate CLAIM on public KITTI, Waymo, and MIAS-LCEC datasets, and the experimental results demonstrate its superior performance compared with the state-of-the-art methods. The code is available at https://github.com/Tompson11/claim.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Hossein Naderi, Alireza Shojaei, Philip Agee, Kereshmeh Afsari, Abiola Akanmu
MultimodalRobotics
ๆฅ็ๆ่ฆ
Despite recent advances in robotics and human-robot collaboration in the AEC industry, trust has mostly been treated as a static factor, with little guidance on how it changes across events during collaboration. This paper investigates how a robot's task performance and its expressive responses after outcomes shape the dynamics of human trust over time. To this end, we designed a controlled within-subjects study with two construction-inspired tasks, Material Delivery (physical assistance) and Information Gathering (perceptual assistance), and measured trust repeatedly (four times per task) using the 14-item Trust Perception Scale for HRI plus a redelegation choice. The robot produced two multimodal expressions, a "glad" display with a brief confirmation after success, and a "sad" display with an apology and a request for a second chance after failure. The study was conducted in a lab environment with 30 participants and a quadruped platform, and we evaluated trust dynamics and repair across both tasks. Results show that robot success reliably increases trust, failure causes sharp drops, and apology-based expressions partially restores trust (44% recovery in Material Delivery; 38% in Information Gathering). Item-level analysis indicates that recovered trust was driven mostly by interaction and communication factors, with competence recovering partially and autonomy aspects changing least. Additionally, age group and prior attitudes moderated trust dynamics with younger participants showed larger but shorter-lived changes, mid-20s participants exhibited the most durable repair, and older participants showed most conservative dynamics. This work provides a foundation for future efforts that adapt repair strategies to task demands and user profiles to support safe, productive adoption of robots on construction sites.
๐
2025-12-16
๐ ArXiv Preprint (Robotics)
๐ฅ Hossein Naderi, Alireza Shojaei, Philip Agee, Kereshmeh Afsari, Abiola Akanmu
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Construction safety inspection remains mostly manual, and automated approaches still rely on task-specific datasets that are hard to maintain in fast-changing construction environments due to frequent retraining. Meanwhile, field inspection with robots still depends on human teleoperation and manual reporting, which are labor-intensive. This paper aims to connect what a robot sees during autonomous navigation to the safety rules that are common in construction sites, automatically generating a safety inspection report. To this end, we proposed a multi-layer framework with two main modules: robotics and AI. On the robotics side, SLAM and autonomous navigation provide repeatable coverage and targeted revisits via waypoints. On AI side, a Vision Language Model (VLM)-based layer produces scene descriptions; a retrieval component powered grounds those descriptions in OSHA and site policies; Another VLM-based layer assesses the safety situation based on rules; and finally Large Language Model (LLM) layer generates safety reports based on previous outputs. The framework is validated with a proof-of-concept implementation and evaluated in a lab environment that simulates common hazards across three scenarios. Results show high recall with competitive precision compared to state-of-the-art closed-source models. This paper contributes a transparent, generalizable pipeline that moves beyond black-box models by exposing intermediate artifacts from each layer and keeping the human in the loop. This work provides a foundation for future extensions to additional tasks and settings within and beyond construction context.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kamer Ali Yuksel
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used to evolve programs and multi-agent systems, yet most existing approaches rely on overwrite-based mutations that maintain only a single candidate at a time. Such methods discard useful variants, suffer from destructive edits, and explore a brittle search space prone to structural failure. We introduce EvoLattice, a framework that represents an entire population of candidate programs or agent behaviors within a single directed acyclic graph. Each node stores multiple persistent alternatives, and every valid path through the graph defines a distinct executable candidate, yielding a large combinatorial search space without duplicating structure. EvoLattice enables fine-grained alternative-level evaluation by scoring each alternative across all paths in which it appears, producing statistics that reveal how local design choices affect global performance. These statistics provide a dense, data-driven feedback signal for LLM-guided mutation, recombination, and pruning, while preserving successful components. Structural correctness is guaranteed by a deterministic self-repair mechanism that enforces acyclicity and dependency consistency independently of the LLM. EvoLattice naturally extends to agent evolution by interpreting alternatives as prompt fragments or sub-agent behaviors. Across program synthesis (proxy and optimizer meta-learning), EvoLattice yields more stable evolution, greater expressivity, and stronger improvement trajectories than prior LLM-guided methods. The resulting dynamics resemble quality-diversity optimization, emerging implicitly from EvoLattice's internal multi-alternative representation rather than an explicit external archive.
๐
2025-12-15
๐ ArXiv Preprint (cs.MA)
๐ฅ Zishan Bai, Enze Ge, Junfeng Hao
Robotics
ๆฅ็ๆ่ฆ
The proliferation of cloud-native architectures, characterized by microservices and dynamic orchestration, has rendered modern IT infrastructures exceedingly complex and volatile. This complexity generates overwhelming volumes of operational data, leading to critical bottlenecks in conventional systems: inefficient information processing, poor task coordination, and loss of contextual continuity during fault diagnosis and remediation. To address these challenges, we propose AOI (AI-Oriented Operations), a novel multi-agent collaborative framework that integrates three specialized agents with an LLM-based Context Compressor. Its core innovations include: (1) a dynamic task scheduling strategy that adaptively prioritizes operations based on real-time system states, and (2) a three-layer memory architecture comprising Working, Episodic, and Semantic layers that optimizes context retention and retrieval. Extensive experiments on both synthetic and real-world benchmarks demonstrate that AOI effectively mitigates information overload, achieving a 72.4% context compression ratio while preserving 92.8% of critical information and significantly enhances operational efficiency, attaining a 94.2% task success rate and reducing the Mean Time to Repair (MTTR) by 34.4% compared to the best baseline. This work presents a paradigm shift towards scalable, adaptive, and context-aware autonomous operations, enabling robust management of next-generation IT infrastructures with minimal human intervention.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sindhuja Madabushi, Dawood Wasif, Jin-Hee Cho
Machine Learning
ๆฅ็ๆ่ฆ
Federated Learning (FL) has emerged as a leading privacy-preserving machine learning paradigm, enabling participants to share model updates instead of raw data. However, FL continues to face key challenges, including weak client incentives, privacy risks, and resource constraints. Assessing client reliability is essential for fair incentive allocation and ensuring that each client's data contributes meaningfully to the global model. To this end, we propose MURIM, a MUlti-dimensional Reputation-based Incentive Mechanism that jointly considers client reliability, privacy, resource capacity, and fairness while preventing malicious or unreliable clients from earning undeserved rewards. MURIM allocates incentives based on client contribution, latency, and reputation, supported by a reliability verification module. Extensive experiments on MNIST, FMNIST, and ADULT Income datasets demonstrate that MURIM achieves up to 18% improvement in fairness metrics, reduces privacy attack success rates by 5-9%, and improves robustness against poisoning and noisy-gradient attacks by up to 85% compared to state-of-the-art baselines. Overall, MURIM effectively mitigates adversarial threats, promotes fair and truthful participation, and preserves stable model convergence across heterogeneous and dynamic federated settings.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qi Chen, Fabio Ramos, Alรกn Aspuru-Guzik, Florian Shkurti
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Bayesian Optimization (BO) is a key methodology for accelerating molecular discovery by estimating the mapping from molecules to their properties while seeking the optimal candidate. Typically, BO iteratively updates a probabilistic surrogate model of this mapping and optimizes acquisition functions derived from the model to guide molecule selection. However, its performance is limited in low-data regimes with insufficient prior knowledge and vast candidate spaces. Large language models (LLMs) and chemistry foundation models offer rich priors to enhance BO, but high-dimensional features, costly in-context learning, and the computational burden of deep Bayesian surrogates hinder their full utilization. To address these challenges, we propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions. Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search. By further incorporating coarse-grained LLM-based clustering, it substantially improves scalability to large candidate sets by restricting acquisition function evaluations to clusters with statistically higher property values. We show through extensive experiments and ablations that the proposed method substantially improves scalability, robustness, and sample efficiency in LLM-guided BO for molecular discovery.
๐
2025-12-15
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Samuel Rothfarb, Megan C. Davis, Ivana Matanovic, Baikun Li, Edward F. Holby et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Artificial intelligence is reshaping scientific exploration, but most methods automate procedural tasks without engaging in scientific reasoning, limiting autonomy in discovery. We introduce Materials Agents for Simulation and Theory in Electronic-structure Reasoning (MASTER), an active learning framework where large language models autonomously design, execute, and interpret atomistic simulations. In MASTER, a multimodal system translates natural language into density functional theory workflows, while higher-level reasoning agents guide discovery through a hierarchy of strategies, including a single agent baseline and three multi-agent approaches: peer review, triage-ranking, and triage-forms. Across two chemical applications, CO adsorption on Cu-surface transition metal (M) adatoms and on M-N-C catalysts, reasoning-driven exploration reduces required atomistic simulations by up to 90% relative to trial-and-error selection. Reasoning trajectories reveal chemically grounded decisions that cannot be explained by stochastic sampling or semantic bias. Altogether, multi-agent collaboration accelerates materials discovery and marks a new paradigm for autonomous scientific exploration.
๐
2025-12-15
๐ ArXiv Preprint (cs.SE)
๐ฅ Bhargav Chickmagalur Nanjundappa, Spandan Maaheshwari
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have become integral to software engineering workflows, yet their effectiveness degrades significantly in multi-turn conversations. Recent studies demonstrate an average 39% performance drop when instructions are delivered across multiple turns, with models making premature assumptions and failing to course correct (Laban et al., 2025). This degradation is particularly problematic in exploratory programming tasks where developers need to investigate alternative approaches without committing to a single path. Current solutions force users into a false dichotomy: continue in a context-polluted conversation where the LLM becomes increasingly confused, or start fresh and lose all accumulated context.
We present ContextBranch, a conversation management system that applies version control semantics to LLM interactions. ContextBranch provides four core primitives--checkpoint, branch, switch, and inject--enabling users to capture conversation state, explore alternatives in isolation, and selectively merge insights. We evaluate ContextBranch through a controlled experiment with 30 software engineering scenarios featuring intentionally polluting explorations. Branched conversations achieved higher response quality compared to linear conversations, with large improvements in focus and context awareness. Benefits were concentrated in complex scenarios involving conceptually distant explorations. Branching reduced context size by 58.1% (31.0 to 13.0 messages), eliminating irrelevant exploratory content. Our work establishes conversation branching as a fundamental primitive for AI-assisted exploratory work, demonstrating that isolation prevents context pollution when exploring alternatives.
๐
2025-12-15
๐ ArXiv Preprint (cond-mat.soft)
๐ฅ Julian Jeggle, Raphael Wittkowski
ๆฅ็ๆ่ฆ
In this book chapter, we review how systems of simple motile agents can be used as a pathway to intelligent systems. It is a well known result from nature that large groups of entities following simple rules, such as swarms of animals, can give rise to much more complex collective behavior in a display of emergence. This begs the question whether we can emulate this behavior in synthetic matter and drive it to a point where the collective behavior reaches the complexity level of intelligent systems. Here, we will use a formalized notion of "intelligent matter" and compare it to recent results in the field of active matter. First, we will explore the approach of emergent computing in which specialized active matter systems are designed to directly solve a given task through emergent behavior. This we will then contrast with the approach of physical reservoir computing powered by the dynamics of active particle systems. In this context, we will also describe a novel reservoir computing scheme for active particles driven ultrasonically or via light refraction.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matheus Corrรชa Domingos, Valdivino Alexandre de Santiago Jรบnior, Juliana Aparecida Anochi, Elcio Hideiti Shiguemori, Luรญsa Mirelle Costa dos Santos et al.
Machine Learning
ๆฅ็ๆ่ฆ
Forecasting meteorological variables is challenging due to the complexity of their processes, requiring advanced models for accuracy. Accurate precipitation forecasts are vital for society. Reliable predictions help communities mitigate climatic impacts. Based on the current relevance of artificial intelligence (AI), classical machine learning (ML) and deep learning (DL) techniques have been used as an alternative or complement to dynamic modeling. However, there is still a lack of broad investigations into the feasibility of purely data-driven approaches for precipitation forecasting. This study aims at addressing this issue where different classical ML and DL approaches for forecasting precipitation in South America, taking into account all 2019 seasons, are considered in a detailed investigation. The selected classical ML techniques were Random Forests and extreme gradient boosting (XGBoost), while the DL counterparts were a 1D convolutional neural network (CNN 1D), a long short-term memory (LSTM) model, and a gated recurrent unit (GRU) model. Additionally, the Brazilian Global Atmospheric Model (BAM) was used as a representative of the traditional dynamic modeling approach. We also relied on explainable artificial intelligence (XAI) to provide some explanations for the models behaviors. LSTM showed strong predictive performance while BAM, the traditional dynamic model representative, had the worst results. Despite presented the higher latency, LSTM was most accurate for heavy precipitation. If cost is a concern, XGBoost offers lower latency with slightly accuracy loss. The results of this research confirm the viability of DL models for climate forecasting, solidifying a global trend in major meteorological and climate forecasting centers.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.
๐
2025-12-15
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Samuel Rothfarb, Megan C. Davis, Ivana Matanovic, Baikun Li, Edward F. Holby et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Artificial intelligence is reshaping scientific exploration, but most methods automate procedural tasks without engaging in scientific reasoning, limiting autonomy in discovery. We introduce Materials Agents for Simulation and Theory in Electronic-structure Reasoning (MASTER), an active learning framework where large language models autonomously design, execute, and interpret atomistic simulations. In MASTER, a multimodal system translates natural language into density functional theory workflows, while higher-level reasoning agents guide discovery through a hierarchy of strategies, including a single agent baseline and three multi-agent approaches: peer review, triage-ranking, and triage-forms. Across two chemical applications, CO adsorption on Cu-surface transition metal (M) adatoms and on M-N-C catalysts, reasoning-driven exploration reduces required atomistic simulations by up to 90% relative to trial-and-error selection. Reasoning trajectories reveal chemically grounded decisions that cannot be explained by stochastic sampling or semantic bias. Altogether, multi-agent collaboration accelerates materials discovery and marks a new paradigm for autonomous scientific exploration.
๐
2025-12-15
๐ Accepted manuscript. Published in Applied Sciences
๐ฅ Amirkia Rafiei Oskooei, Eren Caglar, Ibrahim Sahin, Ayse Kayabay, Mehmet S. Aktas
ๆฅ็ๆ่ฆ
The real-time deployment of cascaded generative AI pipelines for applications like video translation is constrained by significant system-level challenges. These include the cumulative latency of sequential model inference and the quadratic ($\mathcal{O}(N^2)$) computational complexity that renders multi-user video conferencing applications unscalable. This paper proposes and evaluates a practical system-level framework designed to mitigate these critical bottlenecks. The proposed architecture incorporates a turn-taking mechanism to reduce computational complexity from quadratic to linear in multi-user scenarios, and a segmented processing protocol to manage inference latency for a perceptually real-time experience. We implement a proof-of-concept pipeline and conduct a rigorous performance analysis across a multi-tiered hardware setup, including commodity (NVIDIA RTX 4060), cloud (NVIDIA T4), and enterprise (NVIDIA A100) GPUs. Our objective evaluation demonstrates that the system achieves real-time throughput ($ฯ< 1.0$) on modern hardware. A subjective user study further validates the approach, showing that a predictable, initial processing delay is highly acceptable to users in exchange for a smooth, uninterrupted playback experience. The work presents a validated, end-to-end system design that offers a practical roadmap for deploying scalable, real-time generative AI applications in multilingual communication platforms.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri et al.
Robotics
ๆฅ็ๆ่ฆ
Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self-attention. We find that this shift in how inference-time compute is spent leads to consistently large performance improvements across models and long-context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context-specific training is a better use of inference compute than current inference-time scaling strategies like producing more thinking tokens.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jonas Golde, Patrick Haller, Alan Akbik
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kamer Ali Yuksel
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used to evolve programs and multi-agent systems, yet most existing approaches rely on overwrite-based mutations that maintain only a single candidate at a time. Such methods discard useful variants, suffer from destructive edits, and explore a brittle search space prone to structural failure. We introduce EvoLattice, a framework that represents an entire population of candidate programs or agent behaviors within a single directed acyclic graph. Each node stores multiple persistent alternatives, and every valid path through the graph defines a distinct executable candidate, yielding a large combinatorial search space without duplicating structure. EvoLattice enables fine-grained alternative-level evaluation by scoring each alternative across all paths in which it appears, producing statistics that reveal how local design choices affect global performance. These statistics provide a dense, data-driven feedback signal for LLM-guided mutation, recombination, and pruning, while preserving successful components. Structural correctness is guaranteed by a deterministic self-repair mechanism that enforces acyclicity and dependency consistency independently of the LLM. EvoLattice naturally extends to agent evolution by interpreting alternatives as prompt fragments or sub-agent behaviors. Across program synthesis (proxy and optimizer meta-learning), EvoLattice yields more stable evolution, greater expressivity, and stronger improvement trajectories than prior LLM-guided methods. The resulting dynamics resemble quality-diversity optimization, emerging implicitly from EvoLattice's internal multi-alternative representation rather than an explicit external archive.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Cristina Aggazzotti, Elizabeth Allyn Smith
Robotics
ๆฅ็ๆ่ฆ
Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts' conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zefang Liu, Nam H. Nguyen, Yinzhu Quan, Shi-Xiong Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kei Saito
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Current artificial intelligence systems, despite remarkable capabilities in text generation and pattern recognition, exhibit a fundamental architectural limitation: they resolve ambiguity prematurely. This premature semantic collapse -- the tendency to collapse multiple valid interpretations into a single output -- stems from classical identity assumptions embedded in standard neural architectures. We propose Non-Resolution Reasoning (NRR), a computational framework that treats ambiguity retention as a valid reasoning mode rather than a defect to be eliminated. NRR introduces three core principles: (1) Non-Identity (A $\ne$ A) -- the same symbol refers to different entities across contexts; (2) Approximate Identity (A $\approx$ A) -- entities share partial structural overlap without being identical; and (3) Non-Resolution -- conflicting interpretations can coexist without forced convergence. We formalize these principles through three architectural components: Multi-Vector Embeddings for context-dependent representation, Non-Collapsing Attention for parallel interpretation retention, and Contextual Identity Tracking (CIT) for maintaining A $\ne$ A across inference. We demonstrate NRR's advantages through case studies in paradox handling, creative generation, and context-dependent reasoning. Crucially, we provide a minimal empirical validation on a synthetic context-shift task where an NRR-lite model achieves 90.9% out-of-distribution accuracy compared to 9.1% for standard architectures, demonstrating that ambiguity preservation enables structural generalization. NRR challenges the assumption that meaning must collapse to be useful, offering a foundation for AI systems capable of sophisticated ambiguity handling and creative reasoning. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chendong Sun, mingmin Chen, Lei Xu
Natural Language Processing
ๆฅ็ๆ่ฆ
Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as 1 - max(P_target). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vivian Lin, Kuk Jin Jang, Wenwen Si, Insup Lee
ๆฅ็ๆ่ฆ
Diffusion models have shown promise in forecasting future data from multivariate time series. However, few existing methods account for recurring structures, or patterns, that appear within the data. We present Pattern-Guided Diffusion Models (PGDM), which leverage inherent patterns within temporal data for forecasting future time steps. PGDM first extracts patterns using archetypal analysis and estimates the most likely next pattern in the sequence. By guiding predictions with this pattern estimate, PGDM makes more realistic predictions that fit within the set of known patterns. We additionally introduce a novel uncertainty quantification technique based on archetypal analysis, and we dynamically scale the guidance level based on the pattern estimate uncertainty. We apply our method to two well-motivated forecasting applications, predicting visual field measurements and motion capture frames. On both, we show that pattern guidance improves PGDM's performance (MAE / CRPS) by up to 40.67% / 56.26% and 14.12% / 14.10%, respectively. PGDM also outperforms baselines by up to 65.58% / 84.83% and 93.64% / 92.55%.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qi Chen, Fabio Ramos, Alรกn Aspuru-Guzik, Florian Shkurti
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Bayesian Optimization (BO) is a key methodology for accelerating molecular discovery by estimating the mapping from molecules to their properties while seeking the optimal candidate. Typically, BO iteratively updates a probabilistic surrogate model of this mapping and optimizes acquisition functions derived from the model to guide molecule selection. However, its performance is limited in low-data regimes with insufficient prior knowledge and vast candidate spaces. Large language models (LLMs) and chemistry foundation models offer rich priors to enhance BO, but high-dimensional features, costly in-context learning, and the computational burden of deep Bayesian surrogates hinder their full utilization. To address these challenges, we propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions. Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search. By further incorporating coarse-grained LLM-based clustering, it substantially improves scalability to large candidate sets by restricting acquisition function evaluations to clusters with statistically higher property values. We show through extensive experiments and ablations that the proposed method substantially improves scalability, robustness, and sample efficiency in LLM-guided BO for molecular discovery.
๐
2025-12-15
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Samuel Rothfarb, Megan C. Davis, Ivana Matanovic, Baikun Li, Edward F. Holby et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Artificial intelligence is reshaping scientific exploration, but most methods automate procedural tasks without engaging in scientific reasoning, limiting autonomy in discovery. We introduce Materials Agents for Simulation and Theory in Electronic-structure Reasoning (MASTER), an active learning framework where large language models autonomously design, execute, and interpret atomistic simulations. In MASTER, a multimodal system translates natural language into density functional theory workflows, while higher-level reasoning agents guide discovery through a hierarchy of strategies, including a single agent baseline and three multi-agent approaches: peer review, triage-ranking, and triage-forms. Across two chemical applications, CO adsorption on Cu-surface transition metal (M) adatoms and on M-N-C catalysts, reasoning-driven exploration reduces required atomistic simulations by up to 90% relative to trial-and-error selection. Reasoning trajectories reveal chemically grounded decisions that cannot be explained by stochastic sampling or semantic bias. Altogether, multi-agent collaboration accelerates materials discovery and marks a new paradigm for autonomous scientific exploration.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sophia Tang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Spherical equivariant graph neural networks (EGNNs) provide a principled framework for learning on three-dimensional molecular and biomolecular systems, where predictions must respect the rotational symmetries inherent in physics. These models extend traditional message-passing GNNs and Transformers by representing node and edge features as spherical tensors that transform under irreducible representations of the rotation group SO(3), ensuring that predictions change in physically meaningful ways under rotations of the input. This guide develops a complete, intuitive foundation for spherical equivariant modeling - from group representations and spherical harmonics, to tensor products, Clebsch-Gordan decomposition, and the construction of SO(3)-equivariant kernels. Building on this foundation, we construct the Tensor Field Network and SE(3)-Transformer architectures and explain how they perform equivariant message-passing and attention on geometric graphs. Through clear mathematical derivations and annotated code excerpts, this guide serves as a self-contained introduction for researchers and learners seeking to understand or implement spherical EGNNs for applications in chemistry, molecular property prediction, protein structure modeling, and generative modeling.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dragos Secrieru, Garyk Brixi, Yoshua Bengio, Taiji Suzuki, Michael Poli et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-hybrid architectures are poised to take over language modeling due to better quality and performance. We introduce a hierarchical decomposition framework for linear recurrences that allows us to develop algorithms aligned with GPU memory hierarchies, yielding Sliding Window Recurrences. We focus specifically on truncating recurrences to hardware-aligned windows which are naturally jagged, limiting costly inter-warp communication. Using SWR, we develop Phalanx layers that serve as drop-in replacements for windowed attention or linear recurrences. In 1B parameter multi-hybrid models, Phalanx achieves over 10-40% speedup across 4K to 32K context length over optimized Transformers while matching perplexity.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Eugenio Varetti, Matteo Torzoni, Marco Tezzele, Andrea Manzoni
Machine Learning
ๆฅ็ๆ่ฆ
This work shows how adaptivity can enhance value realization of digital twins in civil engineering. We focus on adapting the state transition models within digital twins represented through probabilistic graphical models. The bi-directional interaction between the physical and virtual domains is modeled using dynamic Bayesian networks. By treating state transition probabilities as random variables endowed with conjugate priors, we enable hierarchical online learning of transition dynamics from a state to another through effortless Bayesian updates. We provide the mathematical framework to account for a larger class of distributions with respect to the current literature. To compute dynamic policies with precision updates we solve parametric Markov decision processes through reinforcement learning. The proposed adaptive digital twin framework enjoys enhanced personalization, increased robustness, and improved cost-effectiveness. We assess our approach on a case study involving structural health monitoring and maintenance planning of a railway bridge.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Patrick Egenlauf, Iva Bลezinovรก, Sabine Andergassen, Miriam Klopotek
ๆฅ็ๆ่ฆ
Out-of-equilibrium quantum many-body systems exhibit rapid correlation buildup that underlies many emerging phenomena. Exact wave-function methods to describe this scale exponentially with particle number; simpler mean-field approaches neglect essential two-particle correlations. The time-dependent two-particle reduced density matrix (TD2RDM) formalism offers a middle ground by propagating the two-particle reduced density matrix (2RDM) and closing the BBGKY hierarchy with a reconstruction of the three-particle cumulant. But the validity and existence of time-local reconstruction functionals ignoring memory effects remain unclear across different dynamical regimes. We show that a neural ODE model trained on exact 2RDM data (no dimensionality reduction) can reproduce its dynamics without any explicit three-particle information -- but only in parameter regions where the Pearson correlation between the two- and three-particle cumulants is large. In the anti-correlated or uncorrelated regime, the neural ODE fails, indicating that no simple time-local functional of the instantaneous two-particle cumulant can capture the evolution. The magnitude of the time-averaged three-particle-correlation buildup appears to be the primary predictor of success: For a moderate correlation buildup, both neural ODE predictions and existing TD2RDM reconstructions are accurate, whereas stronger values lead to systematic breakdowns. These findings pinpoint the need for memory-dependent kernels in the three-particle cumulant reconstruction for the latter regime. Our results place the neural ODE as a model-agnostic diagnostic tool that maps the regime of applicability of cumulant expansion methods and guides the development of non-local closure schemes. More broadly, the ability to learn high-dimensional RDM dynamics from limited data opens a pathway to fast, data-driven simulation of correlated quantum matter.
๐
2025-12-15
๐ ArXiv Preprint (cond-mat.soft)
๐ฅ Julian Jeggle, Raphael Wittkowski
ๆฅ็ๆ่ฆ
In this book chapter, we review how systems of simple motile agents can be used as a pathway to intelligent systems. It is a well known result from nature that large groups of entities following simple rules, such as swarms of animals, can give rise to much more complex collective behavior in a display of emergence. This begs the question whether we can emulate this behavior in synthetic matter and drive it to a point where the collective behavior reaches the complexity level of intelligent systems. Here, we will use a formalized notion of "intelligent matter" and compare it to recent results in the field of active matter. First, we will explore the approach of emergent computing in which specialized active matter systems are designed to directly solve a given task through emergent behavior. This we will then contrast with the approach of physical reservoir computing powered by the dynamics of active particle systems. In this context, we will also describe a novel reservoir computing scheme for active particles driven ultrasonically or via light refraction.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matheus Corrรชa Domingos, Valdivino Alexandre de Santiago Jรบnior, Juliana Aparecida Anochi, Elcio Hideiti Shiguemori, Luรญsa Mirelle Costa dos Santos et al.
Machine Learning
ๆฅ็ๆ่ฆ
Forecasting meteorological variables is challenging due to the complexity of their processes, requiring advanced models for accuracy. Accurate precipitation forecasts are vital for society. Reliable predictions help communities mitigate climatic impacts. Based on the current relevance of artificial intelligence (AI), classical machine learning (ML) and deep learning (DL) techniques have been used as an alternative or complement to dynamic modeling. However, there is still a lack of broad investigations into the feasibility of purely data-driven approaches for precipitation forecasting. This study aims at addressing this issue where different classical ML and DL approaches for forecasting precipitation in South America, taking into account all 2019 seasons, are considered in a detailed investigation. The selected classical ML techniques were Random Forests and extreme gradient boosting (XGBoost), while the DL counterparts were a 1D convolutional neural network (CNN 1D), a long short-term memory (LSTM) model, and a gated recurrent unit (GRU) model. Additionally, the Brazilian Global Atmospheric Model (BAM) was used as a representative of the traditional dynamic modeling approach. We also relied on explainable artificial intelligence (XAI) to provide some explanations for the models behaviors. LSTM showed strong predictive performance while BAM, the traditional dynamic model representative, had the worst results. Despite presented the higher latency, LSTM was most accurate for heavy precipitation. If cost is a concern, XGBoost offers lower latency with slightly accuracy loss. The results of this research confirm the viability of DL models for climate forecasting, solidifying a global trend in major meteorological and climate forecasting centers.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anning Tian, Byunghyun Ko, Kaichen Qu, Mengyuan Liu, Jeongkyu Lee
Machine Learning
ๆฅ็ๆ่ฆ
Real-time deployment of prostate MRI segmentation on clinical workstations is often bottlenecked by computational load and memory footprint. Deep learning-based prostate gland segmentation approaches remain challenging due to anatomical variability. To bridge this efficiency gap while still maintaining reliable segmentation accuracy, we propose KLO-Net, a dynamic K-Nearest Neighbor attention U-Net with Cross Stage Partial, i.e., CSP, encoder for efficient prostate gland segmentation from MRI scan. Unlike the regular K-NN attention mechanism, the proposed dynamic K-NN attention mechanism allows the model to adaptively determine the number of attention connections for each spatial location within a slice. In addition, CSP blocks address the computational load to reduce memory consumption. To evaluate the model's performance, comprehensive experiments and ablation studies are conducted on two public datasets, i.e., PROMISE12 and PROSTATEx, to validate the proposed architecture. The detailed comparative analysis demonstrates the model's advantage in computational efficiency and segmentation quality.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri et al.
Robotics
ๆฅ็ๆ่ฆ
Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self-attention. We find that this shift in how inference-time compute is spent leads to consistently large performance improvements across models and long-context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context-specific training is a better use of inference compute than current inference-time scaling strategies like producing more thinking tokens.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Albert Dorador
Machine Learning
ๆฅ็ๆ่ฆ
Reliable estimation of feature contributions in machine learning models is essential for trust, transparency and regulatory compliance, especially when models are proprietary or otherwise operate as black boxes. While permutation-based methods are a standard tool for this task, classical implementations rely on repeated random permutations, introducing computational overhead and stochastic instability. In this paper, we show that by replacing multiple random permutations with a single, deterministic, and optimal permutation, we achieve a method that retains the core principles of permutation-based importance while being non-random, faster, and more stable. We validate this approach across nearly 200 scenarios, including real-world household finance and credit risk applications, demonstrating improved bias-variance tradeoffs and accuracy in challenging regimes such as small sample sizes, high dimensionality, and low signal-to-noise ratios. Finally, we introduce Systemic Variable Importance, a natural extension designed for model stress-testing that explicitly accounts for feature correlations. This framework provides a transparent way to quantify how shocks or perturbations propagate through correlated inputs, revealing dependencies that standard variable importance measures miss. Two real-world case studies demonstrate how this metric can be used to audit models for hidden reliance on protected attributes (e.g., gender or race), enabling regulators and practitioners to assess fairness and systemic risk in a principled and computationally efficient manner.
๐
2025-12-15
๐ ArXiv Preprint (quant-ph)
๐ฅ Charles Marrder, Shuo Sun, Murray J. Holland
Machine Learning
ๆฅ็ๆ่ฆ
Dynamical decoupling seeks to mitigate phase decoherence in qubits by applying a carefully designed sequence of effectively instantaneous electromagnetic pulses. Although analytic solutions exist for pulse timings that are optimal under specific noise regimes, identifying the optimal timings for a realistic noise spectrum remains challenging. We propose a reinforcement learning (RL)-based method for designing pulse sequences on qubits. Our novel action set enables the RL agent to efficiently navigate this inherently non-convex optimization landscape. The action set, derived from Thompson's group $F$, is applicable to a broad class of sequential decision problems whose states can be represented as bounded sequences. We demonstrate that our RL agent can learn pulse sequences that minimize dephasing without requiring explicit knowledge of the underlying noise spectrum. This work opens the possibility for real-time learning of optimal dynamical decoupling sequences on qubits which are dephasing-limited. The model-free nature of our algorithm suggests that the agent may ultimately learn optimal pulse sequences even in the presence of unmodeled physical effects, such as pulse errors or non-Gaussian noise.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammad Mozaffari, Samuel Kushnir, Maryam Mehri Dehnavi, Amir Yazdanbakhsh
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Post-training model pruning is a promising solution, yet it faces a trade-off: simple heuristics that zero weights are fast but degrade accuracy, while principled joint optimization methods recover accuracy but are computationally infeasible at modern scale. One-shot methods such as SparseGPT offer a practical trade-off in optimality by applying efficient, approximate heuristic weight updates. To close this gap, we introduce OPTIMA, a practical one-shot post-training pruning method that balances accuracy and scalability. OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. Solving these QPs yields the per-row globally optimal update with respect to the reconstruction objective given the estimated Hessian. The shared-Hessian structure makes the problem highly amenable to batching on accelerators. We implement an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot post-training pruning at scale on a single accelerator without fine-tuning. OPTIMA integrates with existing mask selectors and consistently improves zero-shot performance across multiple LLM families and sparsity regimes, yielding up to 3.97% absolute accuracy improvement. On an NVIDIA H100, OPTIMA prunes a 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory. Together, these results set a new state-of-the-art accuracy-efficiency trade-offs for one-shot post-training pruning.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Geofrey Owino, Bernard Shibwabo
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Infant cry classification can aid early assessment of infant needs. However, deployment of such solutions is limited by privacy concerns around audio data, sensitivity to background noise, and domain shift across recording environments. We present an end-to-end infant cry analysis pipeline that integrates a denoising autoencoder (DAE), a convolutional tokenizer, and a Transformer encoder trained using communication-efficient federated learning (FL). The system performs on-device denoising, adaptive segmentation, post hoc calibration, and energy-based out-of-distribution (OOD) abstention. Federated training employs a regularized control variate update with 8-bit adapter deltas under secure aggregation. Using the Baby Chillanto and Donate-a-Cry datasets with ESC-50 noise overlays, the model achieves a macro F1 score of 0.938, an AUC of 0.962, and an Expected Calibration Error (ECE) of 0.032, while reducing per-round client upload from approximately 36 to 42 MB to 3.3 MB. Real-time edge inference on an NVIDIA Jetson Nano (4 GB, TensorRT FP16) achieves 96 ms per one-second spectrogram frame. These results demonstrate a practical path toward privacy-preserving, noise-robust, and communication-efficient infant cry classification suitable for federated deployment.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kamil Ciosek, Nicolรฒ Felicioni, Sina Ghiassian, Juan Elenter Litwin, Francesco Tonolini et al.
ๆฅ็ๆ่ฆ
We make two contributions to the problem of estimating the $L_1$ calibration error of a binary classifier from a finite dataset. First, we provide an upper bound for any classifier where the calibration function has bounded variation. Second, we provide a method of modifying any classifier so that its calibration error can be upper bounded efficiently without significantly impacting classifier performance and without any restrictive assumptions. All our results are non-asymptotic and distribution-free. We conclude by providing advice on how to measure calibration error in practice. Our methods yield practical procedures that can be run on real-world datasets with modest overhead.
๐
2025-12-15
๐ ArXiv Preprint (eess.SP)
๐ฅ Ricardo Gonรงalves Molinari, Leonardo Abdala Elias
Robotics
ๆฅ็ๆ่ฆ
Restoring natural and intuitive hand function requires simultaneous and proportional control (SPC) of multiple degrees of freedom (DoFs). This study systematically evaluated the multichannel linear descriptors-based block field method (MLD-BFM) for continuous decoding of five finger-joint DoFs by leveraging the rich spatial information of high-density surface electromyography (HD sEMG). Twenty-one healthy participants performed dynamic sinusoidal finger movements while HD sEMG signals were recorded from the \textit{extensor digitorum communis} (EDC) and \textit{flexor digitorum superficialis} (FDS) muscles. MLD-BFM extracted region-specific spatial features, including effective field strength ($ฮฃ$), field-strength variation rate ($ฮฆ$), and spatial complexity ($ฮฉ$). Model performance was optimized (block size: $2 \times 2$; window: 0.15 s) and compared with conventional time-domain features and dimensionality reduction approaches when applied to multi-output regression models. MLD-BFM consistently achieved the highest $\mathrm{R}^2_{\mathrm{vw}}$ values across all models. The multilayer perceptron (MLP) combined with MLD-BFM yielded the best performance ($\mathrm{R}^2_{\mathrm{vw}} = 86.68\% \pm 0.33$). Time-domain features also showed strong predictive capability and were statistically comparable to MLD-BFM in some models, whereas dimensionality reduction techniques exhibited lower accuracy. Decoding accuracy was higher for the middle and ring fingers than for the thumb. Overall, MLD-BFM improved continuous finger movement decoding accuracy, underscoring the importance of taking advantage of the spatial richness of HD sEMG. These findings suggest that spatially structured features enhance SPC and provide practical guidance for designing robust, real-time, and responsive myoelectric interfaces.
๐
2025-12-15
๐ ArXiv Preprint (eess.SY)
๐ฅ Tianyu Zhou, Zihao Liang, Zehui Lu, Shaoshuai Mou
Robotics
ๆฅ็ๆ่ฆ
This paper proposes a Safe Online Control-Informed Learning framework for safety-critical autonomous systems. The framework unifies optimal control, parameter estimation, and safety constraints into an online learning process. It employs an extended Kalman filter to incrementally update system parameters in real time, enabling robust and data-efficient adaptation under uncertainty. A softplus barrier function enforces constraint satisfaction during learning and control while eliminating the dependence on high-quality initial guesses. Theoretical analysis establishes convergence and safety guarantees, and the framework's effectiveness is demonstrated on cart-pole and robot-arm systems.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kamer Ali Yuksel
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used to evolve programs and multi-agent systems, yet most existing approaches rely on overwrite-based mutations that maintain only a single candidate at a time. Such methods discard useful variants, suffer from destructive edits, and explore a brittle search space prone to structural failure. We introduce EvoLattice, a framework that represents an entire population of candidate programs or agent behaviors within a single directed acyclic graph. Each node stores multiple persistent alternatives, and every valid path through the graph defines a distinct executable candidate, yielding a large combinatorial search space without duplicating structure. EvoLattice enables fine-grained alternative-level evaluation by scoring each alternative across all paths in which it appears, producing statistics that reveal how local design choices affect global performance. These statistics provide a dense, data-driven feedback signal for LLM-guided mutation, recombination, and pruning, while preserving successful components. Structural correctness is guaranteed by a deterministic self-repair mechanism that enforces acyclicity and dependency consistency independently of the LLM. EvoLattice naturally extends to agent evolution by interpreting alternatives as prompt fragments or sub-agent behaviors. Across program synthesis (proxy and optimizer meta-learning), EvoLattice yields more stable evolution, greater expressivity, and stronger improvement trajectories than prior LLM-guided methods. The resulting dynamics resemble quality-diversity optimization, emerging implicitly from EvoLattice's internal multi-alternative representation rather than an explicit external archive.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Finley Devlin, Jaron Sanders
Machine Learning
ๆฅ็ๆ่ฆ
In this work, we investigate the existence and effect of percolation in training deep Neural Networks (NNs) with dropout. Dropout methods are regularisation techniques for training NNs, first introduced by G. Hinton et al. (2012). These methods temporarily remove connections in the NN, randomly at each stage of training, and update the remaining subnetwork with Stochastic Gradient Descent (SGD). The process of removing connections from a network at random is similar to percolation, a paradigm model of statistical physics.
If dropout were to remove enough connections such that there is no path between the input and output of the NN, then the NN could not make predictions informed by the data. We study new percolation models that mimic dropout in NNs and characterise the relationship between network topology and this path problem. The theory shows the existence of a percolative effect in dropout. We also show that this percolative effect can cause a breakdown when training NNs without biases with dropout; and we argue heuristically that this breakdown extends to NNs with biases.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jelena Losic
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have become the standard for graph representation learning but remain vulnerable to structural perturbations. We propose a novel framework that integrates persistent homology features with stability regularization to enhance robustness. Building on the stability theorems of persistent homology \cite{cohen2007stability}, our method combines GIN architectures with multi-scale topological features extracted from persistence images, enforced by Hiraoka-Kusano-inspired stability constraints. Across six diverse datasets spanning biochemical, social, and collaboration networks , our approach demonstrates exceptional robustness to edge perturbations while maintaining competitive accuracy. Notably, we observe minimal performance degradation (0-4\% on most datasets) under perturbation, significantly outperforming baseline stability. Our work provides both a theoretically-grounded and empirically-validated approach to robust graph learning that aligns with recent advances in topological regularization
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mufhumudzi Muthivhi, Terence L van Zyl, Hairong Wang
ๆฅ็ๆ่ฆ
Sequential recommendation models aim to learn from users evolving preferences. However, current state-of-the-art models suffer from an inherent popularity bias. This study developed a novel framework, BiCoRec, that adaptively accommodates users changing preferences for popular and niche items. Our approach leverages a co-attention mechanism to obtain a popularity-weighted user sequence representation, facilitating more accurate predictions. We then present a new training scheme that learns from future preferences using a consistency loss function. BiCoRec aimed to improve the recommendation performance of users who preferred niche items. For these users, BiCoRec achieves a 26.00% average improvement in NDCG@10 over state-of-the-art baselines. When ranking the relevant item against the entire collection, BiCoRec achieves NDCG@10 scores of 0.0102, 0.0047, 0.0021, and 0.0005 for the Movies, Fashion, Games and Music datasets.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shicheng Liu, Siyuan Xu, Wenjie Qiu, Hangfan Zhang, Minghui Zhu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
A common and effective strategy for humans to improve an unsatisfactory outcome in daily life is to find a cause of this outcome and correct the cause. In this paper, we investigate whether this human improvement strategy can be applied to improving reinforcement learning from human feedback (RLHF) for alignment of language models (LMs). In particular, it is observed in the literature that LMs tuned by RLHF can still output unsatisfactory responses. This paper proposes a method to improve the unsatisfactory responses by correcting their causes. Our method has two parts. The first part proposes a post-hoc explanation method to explain why an unsatisfactory response is generated to a prompt by identifying the training data that lead to this response. We formulate this problem as a constrained combinatorial optimization problem where the objective is to find a set of training data closest to this prompt-response pair in a feature representation space, and the constraint is that the prompt-response pair can be decomposed as a convex combination of this set of training data in the feature space. We propose an efficient iterative data selection algorithm to solve this problem. The second part proposes an unlearning method that improves unsatisfactory responses to some prompts by unlearning the training data that lead to these unsatisfactory responses and, meanwhile, does not significantly degrade satisfactory responses to other prompts. Experimental results demonstrate that our algorithm can improve RLHF.
๐
2025-12-15
๐ ArXiv Preprint (cond-mat.dis-nn)
๐ฅ Jacob Taylor, Haining Pan, Sankar Das Sarma
Machine Learning
ๆฅ็ๆ่ฆ
In unsupervised learning, the training data for deep learning does not come with any labels, thus forcing the algorithm to discover hidden patterns in the data for discerning useful information. This, in principle, could be a powerful tool in identifying topological order since topology does not always manifest in obvious physical ways (e.g., topological superconductivity) for its decisive confirmation. The problem, however, is that unsupervised learning is a difficult challenge, necessitating huge computing resources, which may not always work. In the current work, we combine unsupervised and supervised learning using an autoencoder to establish that unlabeled data in the Majorana splitting in realistic short disordered nanowires may enable not only a distinction between `topological' and `trivial', but also where their crossover happens in the relevant parameter space. This may be a useful tool in identifying topology in Majorana nanowires.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Subramanyam Sahoo, Jared Junkin
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model's own predictions of execution traces across semantically equivalent program transformations. By analyzing consistency patterns in these predicted traces, we detect behavioral anomalies indicative of backdoors. Our approach introduces the Adversarial Robustness Quotient (ARQ), which quantifies the computational cost of verification relative to baseline generation, demonstrating exponential growth with orbit size. Theoretical analysis establishes information-theoretic bounds showing non-gamifiability -- adversaries cannot improve through training due to fundamental space complexity constraints. This work demonstrates that semantic orbit analysis provides a scalable, theoretically grounded approach to AI control for code generation tasks.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Siegfried Ludwig, Stylianos Bakas, Konstantinos Barmpas, Georgios Zoumpourlis, Dimitrios A. Adamos et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Deep learning for decoding EEG signals has gained traction, with many claims to state-of-the-art accuracy. However, despite the convincing benchmark performance, successful translation to real applications is limited. The frequent disconnect between performance on controlled BCI benchmarks and its lack of generalisation to practical settings indicates hidden overfitting problems. We introduce Disentangled Decoding Decomposition (D3), a weakly supervised method for training deep learning models across EEG datasets. By predicting the place in the respective trial sequence from which the input window was sampled, EEG-D3 separates latent components of brain activity, akin to non-linear ICA. We utilise a novel model architecture with fully independent sub-networks for strict interpretability. We outline a feature interpretation paradigm to contrast the component activation profiles on different datasets and inspect the associated temporal and spatial filters. The proposed method reliably separates latent components of brain activity on motor imagery data. Training downstream classifiers on an appropriate subset of these components prevents hidden overfitting caused by task-correlated artefacts, which severely affects end-to-end classifiers. We further exploit the linearly separable latent space for effective few-shot learning on sleep stage classification. The ability to distinguish genuine components of brain activity from spurious features results in models that avoid the hidden overfitting problem and generalise well to real-world applications, while requiring only minimal labelled data. With interest to the neuroscience community, the proposed method gives researchers a tool to separate individual brain processes and potentially even uncover heretofore unknown dynamics.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shengfan Cao, Francesco Borrelli
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy unknown, rollout-based safety constraints. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. Our approach constructs a local safe region by combining trajectory rollouts with smoothness bounds that relate parameter changes to shifts in safety metrics. Each gradient update is then projected via a convex SOCP, producing a safe first-order step. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, our approach further ensures closed-loop stability and enables safe adaptation beyond the conservative backup. On regression with harmful supervision and a constrained double-integrator task with malicious expert, our approach consistently rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful primal objective improvement.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mufhumudzi Muthivhi, Terence L van Zyl, Hairong Wang
ๆฅ็ๆ่ฆ
Sequential recommendation models aim to learn from users evolving preferences. However, current state-of-the-art models suffer from an inherent popularity bias. This study developed a novel framework, BiCoRec, that adaptively accommodates users changing preferences for popular and niche items. Our approach leverages a co-attention mechanism to obtain a popularity-weighted user sequence representation, facilitating more accurate predictions. We then present a new training scheme that learns from future preferences using a consistency loss function. BiCoRec aimed to improve the recommendation performance of users who preferred niche items. For these users, BiCoRec achieves a 26.00% average improvement in NDCG@10 over state-of-the-art baselines. When ranking the relevant item against the entire collection, BiCoRec achieves NDCG@10 scores of 0.0102, 0.0047, 0.0021, and 0.0005 for the Movies, Fashion, Games and Music datasets.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Sibo Tian, Minghui Zheng, Xiao Liang
ๆฅ็ๆ่ฆ
Stochastic human motion prediction is critical for safe and effective human-robot collaboration (HRC) in industrial remanufacturing, as it captures human motion uncertainties and multi-modal behaviors that deterministic methods cannot handle. While earlier works emphasize highly diverse predictions, they often generate unrealistic human motions. More recent methods focus on accuracy and real-time performance, yet there remains potential to improve prediction quality further without exceeding time budgets. Additionally, current research on stochastic human motion prediction in HRC typically considers human motion in isolation, neglecting the influence of robot motion on human behavior. To address these research gaps and enable real-time, realistic, and interaction-aware human motion prediction, we propose a novel prediction-refinement framework that integrates both human and robot observed motion to refine the initial predictions produced by a pretrained state-of-the-art predictor. The refinement module employs a Flow Matching structure to account for uncertainty. Experimental studies on the HRC desktop disassembly dataset demonstrate that our method significantly improves prediction accuracy while preserving the uncertainties and multi-modalities of human motion. Moreover, the total inference time of the proposed framework remains within the time budget, highlighting the effectiveness and practicality of our approach.
๐
2025-12-15
๐ ArXiv Preprint (eess.SY)
๐ฅ Ricardo Tapia, Iman Soltani
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Autonomous driving requires reliable collision avoidance in dynamic environments. Nonlinear Model Predictive Controllers (NMPCs) are suitable for this task, but struggle in time-critical scenarios requiring high frequency. To meet this demand, optimization problems are often simplified via linearization, narrowing the horizon window, or reduced temporal nodes, each compromising accuracy or reliability. This work presents the first general convex obstacle avoidance formulation, enabled by a novel approach to integrating logic. This facilitates the incorporation of an obstacle avoidance formulation into convex MPC schemes, enabling a convex optimization framework with substantially improved computational efficiency relative to conventional nonconvex methods. A key property of the formulation is that obstacle avoidance remains effective even when obstacles lie outside the prediction horizon, allowing shorter horizons for real-time deployment. In scenarios where nonconvex formulations are unavoidable, the proposed method meets or exceeds the performance of representative nonconvex alternatives. The method is evaluated in autonomous vehicle applications, where system dynamics are highly nonlinear.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shengfan Cao, Francesco Borrelli
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy unknown, rollout-based safety constraints. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. Our approach constructs a local safe region by combining trajectory rollouts with smoothness bounds that relate parameter changes to shifts in safety metrics. Each gradient update is then projected via a convex SOCP, producing a safe first-order step. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, our approach further ensures closed-loop stability and enables safe adaptation beyond the conservative backup. On regression with harmful supervision and a constrained double-integrator task with malicious expert, our approach consistently rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful primal objective improvement.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie et al.
Machine LearningMultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. Using the lightweight Qwen-0.5B LLM, MindDrive achieves Driving Score (DS) of 78.04 and Success Rate (SR) of 55.09% on the challenging Bench2Drive benchmark. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Hao Fu, Wei Liu, Shuai Zhou
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper investigates the application of reinforcement learning (RL) to multi-robot social formation navigation, a critical capability for enabling seamless human-robot coexistence. While RL offers a promising paradigm, the inherent unpredictability and often uncooperative dynamics of pedestrian behavior pose substantial challenges, particularly concerning the efficiency of coordinated exploration among robots. To address this, we propose a novel coordinated-exploration multi-robot RL algorithm introducing an intrinsic motivation exploration. Its core component is a self-learning intrinsic reward mechanism designed to collectively alleviate policy conservatism. Moreover, this algorithm incorporates a dual-sampling mode within the centralized training and decentralized execution framework to enhance the representation of both the navigation policy and the intrinsic reward, leveraging a two-time-scale update rule to decouple parameter updates. Empirical results on social formation navigation benchmarks demonstrate the proposed algorithm's superior performance over existing state-of-the-art methods across crucial metrics. Our code and video demos are available at: https://github.com/czxhunzi/CEMRRL.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Susung Hong, Chongjian Ge, Zhifei Zhang, Jui-Hsien Wang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziqi Ma, Hongqiao Chen, Yisong Yue, Georgia Gkioxari
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/
๐
2025-12-15
๐ ArXiv Preprint (cs.CY)
๐ฅ Mohammadreza Molavi, Mohammad Moein, Mohammadreza Tavakoli, Abdolali Faraji, Stefan T. Mol et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As the online learning landscape evolves, the need for personalization is increasingly evident. Although educational resources are burgeoning, educators face challenges selecting materials that both align with intended learning outcomes and address diverse learner needs. Large Language Models (LLMs) are attracting growing interest for their potential to create learning resources that better support personalization, but verifying coverage of intended outcomes still requires human alignment review, which is costly and limits scalability. We propose a framework that supports the cost-effective automation of evaluating alignment between educational resources and intended learning outcomes. Using human-generated materials, we benchmarked LLM-based text-embedding models and found that the most accurate model (Voyage) achieved 79% accuracy in detecting alignment. We then applied the optimal model to LLM-generated resources and, via expert evaluation, confirmed that it reliably assessed correspondence to intended outcomes (83% accuracy). Finally, in a three-group experiment with 360 learners, higher alignment scores were positively related to greater learning performance, chi-squared(2, N = 360) = 15.39, p < 0.001. These findings show that embedding-based alignment scores can facilitate scalable personalization by confirming alignment with learning outcomes, which allows teachers to focus on tailoring content to diverse learner needs.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ John E. Ortega, Dhruv D. Joshi, Matt P. Borkowski
Natural Language Processing
ๆฅ็ๆ่ฆ
Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou et al.
Robotics
ๆฅ็ๆ่ฆ
Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gabriel Vitorino de Andrade, Saulo Roberto dos Santos, Itallo Patrick Castro Alves da Silva, Emanuel Adler Medeiros Pereira, Erick de Andrade Barboza
Machine Learning
ๆฅ็ๆ่ฆ
The validation and verification of artificial intelligence (AI) models through robustness assessment are essential to guarantee the reliable performance of intelligent systems facing real-world challenges, such as image corruptions including noise, blurring, and weather variations. Despite the global importance of mango (Mangifera indica L.), there is a lack of studies on the robustness of models for the diagnosis of disease in its leaves. This paper proposes a methodology to evaluate convolutional neural networks (CNNs) under adverse conditions. We adapted the MangoLeafDB dataset, generating MangoLeafDB-C with 19 types of artificial corruptions at five severity levels. We conducted a benchmark comparing five architectures: ResNet-50, ResNet-101, VGG-16, Xception, and LCNN (the latter being a lightweight architecture designed specifically for mango leaf diagnosis). The metrics include the F1 score, the corruption error (CE) and the relative mean corruption error (relative mCE). The results show that LCNN outperformed complex models in corruptions that can be present in real-world scenarios such as Defocus Blur, Motion Blur, while also achieving the lowest mCE. Modern architectures (e.g., ResNet-101) exhibited significant performance degradation in corrupted scenarios, despite their high accuracy under ideal conditions. These findings suggest that lightweight and specialized models may be more suitable for real-world applications in edge devices, where robustness and efficiency are critical. The study highlights the need to incorporate robustness assessments in the development of intelligent systems for agriculture, particularly in regions with technological limitations.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai et al.
Machine Learning
ๆฅ็ๆ่ฆ
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyue Zhang, Meera Chappidi, Erolcan Sayar, Helen Richards, Zhijun Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown strong performance. However, PFMs exhibit limitations on certain cancer or specimen types due to domain shifts - these cancer types were rarely used for pretraining or specimens contain tissue-based artifacts rarely seen within the pretraining population. Such is the case for transurethral resection of bladder tumor (TURBT), which are essential for diagnosing muscle-invasive bladder cancer (MIBC), but contain fragmented tissue chips and electrocautery artifacts and were not widely used in publicly available PFMs. To address this, we propose a simple yet effective domain-adaptive self-supervised adaptor (DA-SSL) that realigns pretrained PFM features to the TURBT domain without fine-tuning the foundational model itself. We pilot this framework for predicting treatment response in TURBT, where histomorphological features are currently underutilized and identifying patients who will benefit from neoadjuvant chemotherapy (NAC) is challenging. In our multi-center study, DA-SSL achieved an AUC of 0.77+/-0.04 in five-fold cross-validation and an external test accuracy of 0.84, sensitivity of 0.71, and specificity of 0.91 using majority voting. Our results demonstrate that lightweight domain adaptation with self-supervision can effectively enhance PFM-based MIL pipelines for clinically challenging histopathology tasks. Code is Available at https://github.com/zhanghaoyue/DA_SSL_TURBT.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zehan Zhu, Heng Zhao, Yan Huang, Joey Tianyi Zhou, Shouling Ji et al.
ๆฅ็ๆ่ฆ
In this paper, we propose a Differentially Private Stochastic Gradient Push with Compressed communication (termed DP-CSGP) for decentralized learning over directed graphs. Different from existing works, the proposed algorithm is designed to maintain high model utility while ensuring both rigorous differential privacy (DP) guarantees and efficient communication. For general non-convex and smooth objective functions, we show that the proposed algorithm achieves a tight utility bound of $\mathcal{O}\left( \sqrt{d\log \left( \frac{1}ฮด \right)}/(\sqrt{n}Jฮต) \right)$ ($J$ and $d$ are the number of local samples and the dimension of decision variables, respectively) with $\left(ฮต, ฮด\right)$-DP guarantee for each node, matching that of decentralized counterparts with exact communication. Extensive experiments on benchmark tasks show that, under the same privacy budget, DP-CSGP achieves comparable model accuracy with significantly lower communication cost than existing decentralized counterparts with exact communication.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leonard Bereska, Zoe Tzifa-Kratira, Reza Samavi, Efstratios Gavves
Machine Learning
ๆฅ็ๆ่ฆ
Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
๐
2025-12-15
๐ 8 pages, 2 figures, published in The 24th IEEE/WIC
๐ฅ Gibson Nkhata, Uttamasha Anjally Oyshi, Quan Mai, Susan Gauch
Natural Language Processing
ๆฅ็ๆ่ฆ
Verifying rumors on social media is critical for mitigating the spread of false information. The stances of conversation replies often provide important cues to determine a rumor's veracity. However, existing models struggle to jointly capture semantic content, stance information, and conversation strructure, especially under the sequence length constraints of transformer-based encoders. In this work, we propose a stance-aware structural modeling that encodes each post in a discourse with its stance signal and aggregates reply embedddings by stance category enabling a scalable and semantically enriched representation of the entire thread. To enhance structural awareness, we introduce stance distribution and hierarchical depth as covariates, capturing stance imbalance and the influence of reply depth. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms prior methods in the ability to predict truthfulness of a rumor. We also demonstrate that our model is versatile for early detection and cross-platfrom generalization.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Linjie Mu, Yannian Gu, Zhongzhen Huang, Yakun Zhu, Shaoting Zhang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models with reasoning capabilities have demonstrated impressive performance across a wide range of domains. In clinical applications, a transparent, step-by-step reasoning process provides physicians with strong evidence to support decision-making. While reinforcement learning has effectively enhanced reasoning performance in medical contexts, the clinical reliability of these reasoning processes remains limited because their accuracy and validity are often overlooked during training. To address this gap, we propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways by explicitly supervising the reasoning process through a Critical Evidence Graph (CEG). We curate a dataset of challenging clinical cases and algorithmically construct a CEG for each sample to represent a high-quality verifiable reasoning pathway. To guide the reasoning process, we introduce a Clinical Reasoning Procedure Reward, which evaluates Node Coverage, Structural Correctness, and Chain Completeness, thereby providing a holistic assessment of reasoning quality. Experimental results show that MedCEG surpasses existing methods in performance while producing clinically valid reasoning chains, representing a solid advancement in reliable medical AI reasoning. The code and models are available at https://github.com/LinjieMu/MedCEG.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Henry Prakken, Wijnand van Woerkom
ๆฅ็ๆ่ฆ
In recent years, hierarchical case-based-reasoning models of precedential constraint have been proposed. In various papers, Trevor Bench-Capon criticised these models on the grounds that they would give incorrect outcomes in some cases. In particular, the models would not account for the possibility that intermediate factors are established with different strengths by different base-level factors. In this paper we respond to these criticisms for van Woerkom's result-based hierarchical models. We argue that in some examples Bench-Capon seems to interpret intermediate factors as dimensions, and that applying van Woerkom's dimension-based version of the hierarchical result model to these examples avoids Bench-Capon's criticisms.
๐
2025-12-15
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Sabrine Ennaji, Elhadj Benkhelifa, Luigi Vincenzo Mancini
Machine Learning
ๆฅ็ๆ่ฆ
Machine learning based intrusion detection systems are increasingly targeted by black box adversarial attacks, where attackers craft evasive inputs using indirect feedback such as binary outputs or behavioral signals like response time and resource usage. While several defenses have been proposed, including input transformation, adversarial training, and surrogate detection, they often fall short in practice. Most are tailored to specific attack types, require internal model access, or rely on static mechanisms that fail to generalize across evolving attack strategies. Furthermore, defenses such as input transformation can degrade intrusion detection system performance, making them unsuitable for real time deployment.
To address these limitations, we propose Adaptive Feature Poisoning, a lightweight and proactive defense mechanism designed specifically for realistic black box scenarios. Adaptive Feature Poisoning assumes that probing can occur silently and continuously, and introduces dynamic and context aware perturbations to selected traffic features, corrupting the attacker feedback loop without impacting detection capabilities. The method leverages traffic profiling, change point detection, and adaptive scaling to selectively perturb features that an attacker is likely exploiting, based on observed deviations.
We evaluate Adaptive Feature Poisoning against multiple realistic adversarial attack strategies, including silent probing, transferability based attacks, and decision boundary based attacks. The results demonstrate its ability to confuse attackers, degrade attack effectiveness, and preserve detection performance. By offering a generalizable, attack agnostic, and undetectable defense, Adaptive Feature Poisoning represents a significant step toward practical and robust adversarial resilience in machine learning based intrusion detection systems.
๐
2025-12-15
๐ AAAI 2026
๐ฅ Yu-Chen Lu, Sheng-Feng Yu, Hui-Hsien Weng, Pei-Shuo Wang, Yu-Fang Hu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, naรฏve low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ojas Pungalia, Rashi Upadhyay, Abhishek Mishra, Abhiram H, Tejasvi Alladi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Envy is a common human behavior that shapes competitiveness and can alter outcomes in team settings. As large language models (LLMs) increasingly act on behalf of humans in collaborative and competitive workflows, there is a pressing need to evaluate whether and under what conditions they exhibit envy-like preferences. In this paper, we test whether LLMs show envy-like behavior toward each other. We considered two scenarios: (1) A point allocation game that tests whether a model tries to win over its peer. (2) A workplace setting observing behaviour when recognition is unfair. Our findings reveal consistent evidence of envy-like patterns in certain LLMs, with large variation across models and contexts. For instance, GPT-5-mini and Claude-3.7-Sonnet show a clear tendency to pull down the peer model to equalize outcomes, whereas Mistral-Small-3.2-24B instead focuses on maximizing its own individual gains. These results highlight the need to consider competitive dispositions as a safety and design factor in LLM-based multi-agent systems.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kei Saito
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $ฯ$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yici Liu, Qi Wei Oung, Hoi Leong Lee
ๆฅ็ๆ่ฆ
Electroencephalographic (EEG) signals have long been applied in the field of affective brain-computer interfaces (aBCIs). Cross-subject EEG-based emotion recognition has demonstrated significant potential in practical applications due to its suitability across diverse people. However, most studies on cross-subject EEG-based emotion recognition neglect the presence of inter-individual variability and negative transfer phenomena during model training. To address this issue, a cross-subject EEG-based emotion recognition through source selection with adversarial strategy is introduced in this paper. The proposed method comprises two modules: the source selection network (SS) and the adversarial strategies network (AS). The SS uses domain labels to reverse-engineer the training process of domain adaptation. Its key idea is to disrupt class separability and magnify inter-domain differences, thereby raising the classification difficulty and forcing the model to learn domain-invariant yet emotion-relevant representations. The AS gets the source domain selection results and the pretrained domain discriminators from SS. The pretrained domain discriminators compute a novel loss aimed at enhancing the performance of domain classification during adversarial training, ensuring the balance of adversarial strategies. This paper provides theoretical insights into the proposed method and achieves outstanding performance on two EEG-based emotion datasets, SEED and SEED-IV. The code can be found at https://github.com/liuyici/SSAS.
๐
2025-12-15
๐ ArXiv Preprint (cs.SE)
๐ฅ Dezhi Ran, Zhi Gong, Yuzhe Guo, Mengzhou Wu, Yuan Cao et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
While Large Language Model (LLM) agents show great potential for automated UI navigation such as automated UI testing and AI assistants, their efficiency has been largely overlooked. Our motivating study reveals that inefficient UI representation creates a critical performance bottleneck. However, UI representation optimization, formulated as the task of automatically generating programs that transform UI representations, faces two unique challenges. First, the lack of Boolean oracles, which traditional program synthesis uses to decisively validate semantic correctness, poses a fundamental challenge to co-optimization of token efficiency and completeness. Second, the need to process large, complex UI trees as input while generating long, compositional transformation programs, making the search space vast and error-prone. Toward addressing the preceding limitations, we present UIFormer, the first automated optimization framework that synthesizes UI transformation programs by conducting constraint-based optimization with structured decomposition of the complex synthesis task. First, UIFormer restricts the program space using a domain-specific language (DSL) that captures UI-specific operations. Second, UIFormer conducts LLM-based iterative refinement with correctness and efficiency rewards, providing guidance for achieving the efficiency-completeness co-optimization. UIFormer operates as a lightweight plugin that applies transformation programs for seamless integration with existing LLM agents, requiring minimal modifications to their core logic. Evaluations across three UI navigation benchmarks spanning Android and Web platforms with five LLMs demonstrate that UIFormer achieves 48.7% to 55.8% token reduction with minimal runtime overhead while maintaining or improving agent performance. Real-world industry deployment at WeChat further validates the practical impact of UIFormer.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lorenzo Pettinari, Sidaty El Hadramy, Michael Wehrli, Philippe C. Cattin, Daniel Studer et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Purpose: Intraoperative navigation in spine surgery demands millimeter-level accuracy. Current systems based on intraoperative radiographic imaging and bone-anchored markers are invasive, radiation-intensive and workflow disruptive. Recent markerless RGB-D registration methods offer a promising alternative, but existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, which can propagate errors throughout registration. Methods: We present End2Reg an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for weak segmentation labels and manual steps. The network learns segmentation masks specifically optimized for registration, guided solely by the registration objective without direct segmentation supervision. Results: The proposed framework achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% to 1.83mm and mean Root Mean Square Error by 45% to 3.95mm, respectively. An ablation study confirms that end-to-end optimization significantly improves registration accuracy. Conclusion: The presented end-to-end RGB-D registration pipeline removes dependency on weak labels and manual steps, advancing towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sitao Cheng, Tianle Li, Xuhan Huang, Xunjian Yin, Difan Zou
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Francesca Da Ros, Luca Di Gaspero, Kevin Roitero
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in Large Language Models (LLMs) have opened new perspectives for automation in optimization. While several studies have explored how LLMs can generate or solve optimization models, far less is understood about what these models actually learn regarding problem structure or algorithmic behavior. This study investigates how LLMs internally represent combinatorial optimization problems and whether such representations can support downstream decision tasks. We adopt a twofold methodology combining direct querying, which assesses LLM capacity to explicitly extract instance features, with probing analyses that examine whether such information is implicitly encoded within their hidden layers. The probing framework is further extended to a per-instance algorithm selection task, evaluating whether LLM-derived representations can predict the best-performing solver. Experiments span four benchmark problems and three instance representations. Results show that LLMs exhibit moderate ability to recover feature information from problem instances, either through direct querying or probing. Notably, the predictive power of LLM hidden-layer representations proves comparable to that achieved through traditional feature extraction, suggesting that LLMs capture meaningful structural information relevant to optimization performance.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shibani Sankpal
Natural Language Processing
ๆฅ็ๆ่ฆ
This study investigates emotion drift: the change in emotional state across a single text, within mental health-related messages. While sentiment analysis typically classifies an entire message as positive, negative, or neutral, the nuanced shift of emotions over the course of a message is often overlooked. This study detects sentence-level emotions and measures emotion drift scores using pre-trained transformer models such as DistilBERT and RoBERTa. The results provide insights into patterns of emotional escalation or relief in mental health conversations. This methodology can be applied to better understand emotional dynamics in content.
๐
2025-12-15
๐ This is the Author Accepted Manuscript version of
๐ฅ Zeyad Gamal, Youssef Mahran, Ayman El-Badawy
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joona Kytรถniemi, Jousia Piha, Akseli Reunamo, Fedor Vitiugin, Farrokh Mehryary et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.
๐
2025-12-15
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Malte Hellmeier
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Securing digital text is becoming increasingly relevant due to the widespread use of large language models. Individuals' fear of losing control over data when it is being used to train such machine learning models or when distinguishing model-generated output from text written by humans. Digital watermarking provides additional protection by embedding an invisible watermark within the data that requires protection. However, little work has been taken to analyze and verify if existing digital text watermarking methods are secure and undetectable by large language models. In this paper, we investigate the security-related area of watermarking and machine learning models for text data. In a controlled testbed of three experiments, ten existing Unicode text watermarking methods were implemented and analyzed across six large language models: GPT-5, GPT-4o, Teuken 7B, Llama 3.3, Claude Sonnet 4, and Gemini 2.5 Pro. The findings of our experiments indicate that, especially the latest reasoning models, can detect a watermarked text. Nevertheless, all models fail to extract the watermark unless implementation details in the form of source code are provided. We discuss the implications for security researchers and practitioners and outline future research opportunities to address security concerns.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ รrpรกd Pรกndy, Rรณbert Lakatos, Andrรกs Hajdu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advancements in artificial intelligence have sparked interest in industrial agents capable of supporting analysts in regulated sectors, such as finance and healthcare, within tabular data workflows. A key capability for such systems is performing accurate arithmetic operations on structured data while ensuring sensitive information never leaves secure, on-premises environments. Here, we introduce an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA), specifically applied to on-premises small language models (SLMs). Through a systematic evaluation of a leading SLM (Qwen3 4B), we find that while the base model exhibits fundamental limitations in arithmetic tasks, our proposed error-driven method, which clusters erroneous predictions to refine prompt-rules iteratively, dramatically improves performance, elevating the model's accuracy to 70.8\%. Our results suggest that developing reliable, interpretable, and industrially deployable AI assistants can be achieved not only through costly fine-tuning but also via systematic, error-driven prompt optimization, enabling small models to surpass larger language models (GPT-3.5 Turbo) in a privacy-compliant manner.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mikhail Zakharov
Computer Vision
ๆฅ็ๆ่ฆ
Face recognition systems rely on learning highly discriminative and compact identity clusters to enable accurate retrieval. However, as with other surveillance-oriented technologies, such systems raise serious privacy concerns due to their potential for unauthorized identity tracking. While several works have explored machine unlearning as a means of privacy protection, their applicability to face retrieval - especially for modern embedding-based recognition models - remains largely unexplored. In this work, we study the problem of face identity unlearning for retrieval systems and present its inherent challenges. The goal is to make selected identities unretrievable by dispersing their embeddings on the hypersphere and preventing the formation of compact identity clusters that enable re-identification in the gallery. The primary challenge is to achieve this forgetting effect while preserving the discriminative structure of the embedding space and the retrieval performance of the model for the remaining identities. To address this, we evaluate several existing approximate class unlearning methods (e.g., Random Labeling, Gradient Ascent, Boundary Unlearning, and other recent approaches) in the context of face retrieval and propose a simple yet effective dispersion-based unlearning approach. Extensive experiments on standard benchmarks (VGGFace2, CelebA) demonstrate that our method achieves superior forgetting behavior while preserving retrieval utility.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mayank Gulati, Benedikt Groร, Gerhard Wunder
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
We present ALIGN-FL, a novel approach to distributed learning that addresses the challenge of learning from highly disjoint data distributions through selective sharing of generative components. Instead of exchanging full model parameters, our framework enables privacy-preserving learning by transferring only generative capabilities across clients, while the server performs global training using synthetic samples. Through complementary privacy mechanisms: DP-SGD with adaptive clipping and Lipschitz regularized VAE decoders and a stateful architecture supporting heterogeneous clients, we experimentally validate our approach on MNIST and Fashion-MNIST datasets with cross-domain outliers. Our analysis demonstrates that both privacy mechanisms effectively map sensitive outliers to typical data points while maintaining utility in extreme Non-IID scenarios typical of cross-silo collaborations.
Index Terms: Client-invariant Learning, Federated Learning (FL), Privacy-preserving Generative Models, Non-Independent and Identically Distributed (Non-IID), Heterogeneous Architectures
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qinglin Jia, Zhaocheng Du, Chuhan Wu, Huifeng Guo, Ruiming Tang et al.
ๆฅ็ๆ่ฆ
In most real-world online advertising systems, advertisers typically have diverse customer acquisition goals. A common solution is to use multi-task learning (MTL) to train a unified model on post-click data to estimate the conversion rate (CVR) for these diverse targets. In practice, CVR prediction often encounters missing conversion data as many advertisers submit only a subset of user conversion actions due to privacy or other constraints, making the labels of multi-task data incomplete. If the model is trained on all available samples where advertisers submit user conversion actions, it may struggle when deployed to serve a subset of advertisers targeting specific conversion actions, as the training and deployment data distributions are mismatched. While considerable MTL efforts have been made, a long-standing challenge is how to effectively train a unified model with the incomplete and skewed multi-label data. In this paper, we propose a fine-grained Knowledge transfer framework for Asymmetric Multi-Label data (KAML). We introduce an attribution-driven masking strategy (ADM) to better utilize data with asymmetric multi-label data in training. However, the more relaxed masking in ADM is a double-edged sword: it provides additional training signals but also introduces noise due to skewed data. To address this, we propose a hierarchical knowledge extraction mechanism (HKE) to model the sample discrepancy within the target task tower. Finally, to maximize the utility of unlabeled samples, we incorporate ranking loss strategy to further enhance our model. The effectiveness of KAML has been demonstrated through comprehensive evaluations on offline industry datasets and online A/B tests, which show significant performance improvements over existing MTL baselines.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anna Aksenova, Boris Zverkov, Nicola Dainese, Alexander Nikitin, Pekka Marttinen
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhenghao Zhu, Chuxue Cao, Sirui Han, Yuanfeng Song, Xing Chen et al.
Multimodal
ๆฅ็ๆ่ฆ
In medical data analysis, extracting deep insights from complex, multi-modal datasets is essential for improving patient care, increasing diagnostic accuracy, and optimizing healthcare operations. However, there is currently a lack of high-quality datasets specifically designed to evaluate the ability of large multi-modal models (LMMs) to discover medical insights. In this paper, we introduce MedInsightBench, the first benchmark that comprises 332 carefully curated medical cases, each annotated with thoughtfully designed insights. This benchmark is intended to evaluate the ability of LMMs and agent frameworks to analyze multi-modal medical image data, including posing relevant questions, interpreting complex findings, and synthesizing actionable insights and recommendations. Our analysis indicates that existing LMMs exhibit limited performance on MedInsightBench, which is primarily attributed to their challenges in extracting multi-step, deep insights and the absence of medical expertise. Therefore, we propose MedInsightAgent, an automated agent framework for medical data analysis, composed of three modules: Visual Root Finder, Analytical Insight Agent, and Follow-up Question Composer. Experiments on MedInsightBench highlight pervasive challenges and demonstrate that MedInsightAgent can improve the performance of general LMMs in medical data insight discovery.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Hao Fua, Wei Liu, Shuai Zhoua
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper investigates the application of reinforcement learning (RL) to multi-robot social formation navigation, a critical capability for enabling seamless human-robot coexistence. While RL offers a promising paradigm, the inherent unpredictability and often uncooperative dynamics of pedestrian behavior pose substantial challenges, particularly concerning the efficiency of coordinated exploration among robots. To address this, we propose a novel coordinated-exploration multi-robot RL algorithm introducing an intrinsic motivation exploration. Its core component is a self-learning intrinsic reward mechanism designed to collectively alleviate policy conservatism. Moreover, this algorithm incorporates a dual-sampling mode within the centralized training and decentralized execution framework to enhance the representation of both the navigation policy and the intrinsic reward, leveraging a two-time-scale update rule to decouple parameter updates. Empirical results on social formation navigation benchmarks demonstrate the proposed algorithm's superior performance over existing state-of-the-art methods across crucial metrics. Our code and video demos are available at: https://github.com/czxhunzi/CEMRRL.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shu Yu, Chaochao Lu
ๆฅ็ๆ่ฆ
Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at https://opencausalab.github.io/LINA.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zihui Zhao, Zechang Li
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Direct Preference Optimization (DPO) has emerged as a lightweight and effective alternative to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) for aligning large language and vision-language models. However, the standard DPO formulation, in which both the chosen and rejected responses are generated by the same policy, suffers from a weak learning signal because the two responses often share similar errors and exhibit small Kullback-Leibler (KL) divergence. This leads to slow and unstable convergence. To address this limitation, we introduce Reflective Preference Optimization (RPO), a new framework that incorporates hint-guided reflection into the DPO paradigm. RPO uses external models to identify hallucination sources and generate concise reflective hints, enabling the construction of on-policy preference pairs with stronger contrastiveness and clearer preference signals. We theoretically show that conditioning on hints increases the expected preference margin through mutual information and improves sample efficiency while remaining within the policy distribution family. Empirically, RPO achieves superior alignment with fewer training samples and iterations, substantially reducing hallucination rates and delivering state-of-the-art performance across multimodal benchmarks.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jianyuan Bo, Yuan Fang
Machine Learning
ๆฅ็ๆ่ฆ
In the rapidly evolving field of self-supervised learning on graphs, generative and contrastive methodologies have emerged as two dominant approaches. Our study focuses on masked feature reconstruction (MFR), a generative technique where a model learns to restore the raw features of masked nodes in a self-supervised manner. We observe that both MFR and graph contrastive learning (GCL) aim to maximize agreement between similar elements. Building on this observation, we reveal a novel theoretical insight: under specific conditions, the objectives of MFR and node-level GCL converge, despite their distinct operational mechanisms. This theoretical connection suggests these approaches are complementary rather than fundamentally different, prompting us to explore their integration to enhance self-supervised learning on graphs. Our research presents Contrastive Masked Feature Reconstruction (CORE), a novel graph self-supervised learning framework that integrates contrastive learning into MFR. Specifically, we form positive pairs exclusively between the original and reconstructed features of masked nodes, encouraging the encoder to prioritize contextual information over the node's own features. Additionally, we leverage the masked nodes themselves as negative samples, combining MFR's reconstructive power with GCL's discriminative ability to better capture intrinsic graph structures. Empirically, our proposed framework CORE significantly outperforms MFR across node and graph classification tasks, demonstrating state-of-the-art results. In particular, CORE surpasses GraphMAE and GraphMAE2 by up to 2.80% and 3.72% on node classification tasks, and by up to 3.82% and 3.76% on graph classification tasks.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chendong Sun
Natural Language Processing
ๆฅ็ๆ่ฆ
Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as \(1 - \max(P_{\mathrm{target}})\). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.
๐
2025-12-15
๐ Accepted to IEEE Transactions on Aerospace and Ele
๐ฅ Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Dongil Park, Sung Won Han
Machine Learning
ๆฅ็ๆ่ฆ
The Automatic Identification System (AIS) enables data-driven maritime surveillance but suffers from reliability issues and irregular intervals. We address vessel destination estimation using global-scope AIS data by proposing a differentiated approach that recasts long port-to-port trajectories as a nested sequence structure. Using spatial grids, this method mitigates spatio-temporal bias while preserving detailed resolution. We introduce a novel deep learning architecture, WAY, designed to process these reformulated trajectories for long-term destination estimation days to weeks in advance. WAY comprises a trajectory representation layer and Channel-Aggregative Sequential Processing (CASP) blocks. The representation layer generates multi-channel vector sequences from kinematic and non-kinematic features. CASP blocks utilize multi-headed channel- and self-attention for aggregation and sequential information delivery. Additionally, we propose a task-specialized Gradient Dropout (GD) technique to enable many-to-many training on single labels, preventing biased feedback surges by stochastically blocking gradient flow based on sample length. Experiments on 5-year AIS data demonstrate WAY's superiority over conventional spatial grid-based approaches regardless of trajectory progression. Results further confirm that adopting GD leads to performance gains. Finally, we explore WAY's potential for real-world application through multitask learning for ETA estimation.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Khalid Ferji
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Machine-learning (ML) models in polymer science typically treat a polymer as a single, perfectly defined molecular graph, even though real materials consist of stochastic ensembles of chains with distributed lengths. This mismatch between physical reality and digital representation limits the ability of current models to capture polymer behaviour. Here we introduce PolySet, a framework that represents a polymer as a finite, weighted ensemble of chains sampled from an assumed molar-mass distribution. This ensemble-based encoding is independent of chemical detail, compatible with any molecular representation and illustrated here in the homopolymer case using a minimal language model. We show that PolySet retains higher-order distributional moments (such as Mz, Mz+1), enabling ML models to learn tail-sensitive properties with greatly improved stability and accuracy. By explicitly acknowledging the statistical nature of polymer matter, PolySet establishes a physically grounded foundation for future polymer machine learning, naturally extensible to copolymers, block architectures, and other complex topologies.
๐
2025-12-15
๐ ArXiv Preprint (econ.GN)
๐ฅ Francesco Salvi, Giuseppe Russo, Adam Barla, Vincent Moreau, Robert West
ๆฅ็ๆ่ฆ
Meat consumption is a major driver of global greenhouse gas emissions. While pricing interventions have shown potential to reduce meat intake, previous studies have focused on highly constrained environments with limited consumer choice. Here, we present the first large-scale field experiment to evaluate multiple pricing interventions in a real-world, competitive setting. Using a sequential crossover design with matched menus in a Swiss university campus, we systematically compared vegetarian-meal discounts (-2.5 CHF), meat surcharges (+2.5 CHF), and a combined scheme (-1.2 CHF=+1.2 CHF) across four campus cafeterias. Only the surcharge and combined interventions led to significant increases in vegetarian meal uptake--by 26.4% and 16.6%, respectively--and reduced CO2 emissions per meal by 7.4% and 11.3%, respectively. The surcharge, while effective, triggered a 12.3% drop in sales at intervention sites and a corresponding 14.9% increase in non-treated locations, hence causing a spillover effect that completely offset environmental gains. In contrast, the combined approach achieved meaningful emission reductions without significant effects on overall sales or revenue, making it both effective and economically viable. Notably, pricing interventions were equally effective for both vegetarian-leaning customers and habitual meat-eaters, stimulating change even within entrenched dietary habits. Our results show that balanced pricing strategies can reduce the carbon footprint of realistic food environments, but require coordinated implementation to maximize climate benefits and avoid unintended spillover effects.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management.
We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work.
We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jakub ลyskawa, Jakub Lewandowski, Paweล Wawrzyลski
Machine Learning
ๆฅ็ๆ่ฆ
Soft Actor-Critic (SAC) is widely used in practical applications and is now one of the most relevant off-policy online model-free reinforcement learning (RL) methods. The technique of n-step returns is known to increase the convergence speed of RL algorithms compared to their 1-step returns-based versions. However, SAC is notoriously difficult to combine with n-step returns, since their usual combination introduces bias in off-policy algorithms due to the changes in action distribution. While this problem is solved by importance sampling, a method for estimating expected values of one distribution using samples from another distribution, importance sampling may result in numerical instability. In this work, we combine SAC with n-step returns in a way that overcomes this issue. We present an approach to applying numerically stable importance sampling with simplified hyperparameter selection. Furthermore, we analyze the entropy estimation approach of Soft Actor-Critic in the context of the n-step maximum entropy framework and formulate the $ฯ$-sampled entropy estimation to reduce the variance of the learning target. Finally, we formulate the Soft Actor-Critic with n-step returns (SAC$n$) algorithm that we experimentally verify on MuJoCo simulated environments.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xianchao Guan, Zhiyuan Fan, Yifeng Wang, Fuqiang Chen, Yanjiang Zhou et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Emre Can Acikgoz, Jinoh Oh, Jie Hao, Joo Hyuk Jeon, Heng Ji et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents' conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.
๐
2025-12-15
๐ WWW
๐ฅ Peter Kocsis, Lukas Hรถllein, Matthias Nieรner
Machine Learning
ๆฅ็ๆ่ฆ
We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images. Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view. To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions. We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse path tracing to optimize for the low-dimensional parameters. Our results outperform state-of-the-art methods in material disentanglement on both synthetic and real scenes, producing sharp and clean reconstructions suitable for high-quality relighting.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Emre Can Acikgoz, Jinoh Oh, Joo Hyuk Jeon, Jie Hao, Heng Ji et al.
Robotics
ๆฅ็ๆ่ฆ
Conversational agents often encounter ambiguous user requests, requiring an effective clarification to successfully complete tasks. While recent advancements in real-world applications favor multi-agent architectures to manage complex conversational scenarios efficiently, ambiguity resolution remains a critical and underexplored challenge--particularly due to the difficulty of determining which agent should initiate a clarification and how agents should coordinate their actions when faced with uncertain or incomplete user input. The fundamental questions of when to interrupt a user and how to formulate the optimal clarification query within the most optimal multi-agent settings remain open. In this paper, we propose MAC (Multi-Agent Clarification), an interactive multi-agent framework specifically optimized to resolve user ambiguities by strategically managing clarification dialogues. We first introduce a novel taxonomy categorizing user ambiguities to systematically guide clarification strategies. Then, we present MAC that autonomously coordinates multiple agents to interact synergistically with users. Empirical evaluations on MultiWOZ 2.4 demonstrate that enabling clarification at both levels increases task success rate 7.8\% (54.5 to 62.3) and reduces the average number of dialogue turns (6.53 to 4.86) by eliciting all required user information up front and minimizing repetition. Our findings highlight the importance of active user interaction and role-aware clarification for more reliable human-agent communication.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Anika Sharma, Malavika Mampally, Chidaksh Ravuru, Kandyce Brennan, Neil Gaikwad
Natural Language Processing
ๆฅ็ๆ่ฆ
As large language models increasingly mediate stigmatized health decisions, their capacity to genuinely understand complex psychological and physiological phenomena remains poorly evaluated. Can AI understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across the cognitive, interpersonal, and structural levels where it operates. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS). Our multilevel analysis examined whether models coherently represent stigma at the cognitive level (self-judgment), interpersonal level (anticipated judgment and isolation), and structural level (community condemnation and disclosure patterns), as well as overall stigma. Models fail tests of genuine understanding across all levels. They overestimate interpersonal stigma while underestimating cognitive stigma, assume uniform community condemnation, introduce demographic biases absent from human validation data, miss the empirically validated stigma-secrecy relationship, and contradict themselves within theoretical constructs. These patterns reveal that current alignment approaches ensure appropriate language but not coherent multilevel understanding. This work provides empirical evidence that current LLMs lack coherent multilevel understanding of psychological and physiological constructs. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.
๐
2025-12-15
๐ IEEE Transactions on Image Processing
๐ฅ Xin Guo, Yifan Zhao, Jia Li
ๆฅ็ๆ่ฆ
Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vivek Alumootil, Tuan-Anh Vu, M. Khalid Jawed
Computer Vision
ๆฅ็ๆ่ฆ
Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency-especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: https://github.com/StructuresComp/DePT3R
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hayk Amirkhanian, Marco F. Huber
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, neural networks have revolutionized various domains, yet challenges such as hyperparameter tuning and overfitting remain significant hurdles. Bayesian neural networks offer a framework to address these challenges by incorporating uncertainty directly into the model, yielding more reliable predictions, particularly for out-of-distribution data. This paper presents Hierarchical Approximate Bayesian Neural Network, a novel approach that uses a Gaussian-inverse-Wishart distribution as a hyperprior of the network's weights to increase both the robustness and performance of the model. We provide analytical representations for the predictive distribution and weight posterior, which amount to the calculation of the parameters of Student's t-distributions in closed form with linear complexity with respect to the number of weights. Our method demonstrates robust performance, effectively addressing issues of overfitting and providing reliable uncertainty estimates, particularly for out-of-distribution tasks. Experimental results indicate that HABNN not only matches but often outperforms state-of-the-art models, suggesting a promising direction for future applications in safety-critical environments.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zewen Qiang, Sendong Zhao, Haochun Wang, Bing Qin, Ting Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6\% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4\% in KV-Retrieval tasks.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhijian He, Feifei Liu, Yuwei Li, Zhanpeng Liu, Jintao Cheng et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird's-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shenzhi Yang, Guangcheng Zhu, Xing Zheng, Yingfan MA, Zhongqi Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via https://github.com/ShenzhiYang2000/TRAPO.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Rajeev Bhatt Ambati, Tianyi Niu, Aashu Singh, Shlok Mishra, Shashank Srivastava et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel at static interactions, where they answer user queries by retrieving knowledge encoded in their parameters. However, in many real-world settings, such as educational tutoring or medical assistance, relevant information is not directly available and must be actively acquired through dynamic interactions. An interactive agent would recognize its own uncertainty, ask targeted questions, and retain new knowledge efficiently. Prior work has primarily explored effective ways for a teacher to instruct the student, where the teacher identifies student gaps and provides guidance. In this work, we shift the focus to the student and investigate effective strategies to actively query the teacher in seeking useful information. Across math and coding benchmarks, where baseline student models begin with near-zero performance, we show that student-led approaches consistently yield absolute Pass@k improvements of at least 0.5 over static baselines. To improve question quality, we train students using Direct Preference Optimization (DPO) with guidance from either self or stronger students. We find that this guided training enables smaller models to learn how to ask better questions, further enhancing learning efficiency.
๐
2025-12-15
๐ This work has been submitted to the IEEE TMI for p
๐ฅ Wenjing Lu, Yi Hong, Yang Yang
Computer Vision
ๆฅ็ๆ่ฆ
Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma et al.
Robotics
ๆฅ็ๆ่ฆ
Large and diverse datasets are needed for training generalist robot policies that have potential to control a variety of robot embodiments -- robot arm and gripper combinations -- across diverse tasks and environments. As re-collecting demonstrations and retraining for each new hardware platform are prohibitively costly, we show that existing robot data can be augmented for transfer and generalization. The Open X-Embodiment (OXE) dataset, which aggregates demonstrations from over 60 robot datasets, has been widely used as the foundation for training generalist policies. However, it is highly imbalanced: the top four robot types account for over 85\% of its real data, which risks overfitting to robot-scene combinations. We present AugE-Toolkit, a scalable robot augmentation pipeline, and OXE-AugE, a high-quality open-source dataset that augments OXE with 9 different robot embodiments. OXE-AugE provides over 4.4 million trajectories, more than triple the size of the original OXE. We conduct a systematic study of how scaling robot augmentation impacts cross-embodiment learning. Results suggest that augmenting datasets with diverse arms and grippers improves policy performance not only on the augmented robots, but also on unseen robots and even the original robots under distribution shifts. In physical experiments, we demonstrate that state-of-the-art generalist policies such as OpenVLA and $ฯ_0$ benefit from fine-tuning on OXE-AugE, improving success rates by 24-45% on previously unseen robot-gripper combinations across four real-world manipulation tasks. Project website: https://OXE-AugE.github.io/.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Xiang Li, Gang Liu, Weitao Zhou, Hongyi Zhu, Zhong Cao
Robotics
ๆฅ็ๆ่ฆ
Imitation learning (IL) has emerged as a central paradigm in autonomous driving. While IL excels in matching expert behavior in open-loop settings by minimizing per-step prediction errors, its performance degrades unexpectedly in closed-loop due to the gradual accumulation of small, often imperceptible errors over time.Over successive planning cycles, these errors compound, potentially resulting in severe failures.Current research efforts predominantly rely on increasingly sophisticated network architectures or high-fidelity training datasets to enhance the robustness of IL planners against error accumulation, focusing on the state-level robustness at a single time point. However, autonomous driving is inherently a continuous-time process, and leveraging the temporal scale to enhance robustness may provide a new perspective for addressing this issue.To this end, we propose a method termed Sequence of Experts (SoE), a temporal alternation policy that enhances closed-loop performance without increasing model size or data requirements. Our experiments on large-scale autonomous driving benchmarks nuPlan demonstrate that SoE method consistently and significantly improves the performance of all the evaluated models, and achieves state-of-the-art performance.This module may provide a key and widely applicable support for improving the training efficiency of autonomous driving models.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziqiang Zhu, Bowei Yang
Multimodal
ๆฅ็ๆ่ฆ
Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at https://github.com/Die-Xie/UniVCD.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Huimu Wang, Yiming Qiu, Xingzhi Yao, Zhiguo Chen, Guoyu Tang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Dense retrieval has become the industry standard in large-scale information retrieval systems due to its high efficiency and competitive accuracy. Its core relies on a coarse-to-fine hierarchical architecture that enables rapid candidate selection and precise semantic matching, achieving millisecond-level response over billion-scale corpora. This capability makes it essential not only in traditional search and recommendation scenarios but also in the emerging paradigm of generative recommendation driven by large language models, where semantic IDs-themselves a form of coarse-to-fine representation-play a foundational role. However, the widely adopted dual-tower encoding architecture introduces inherent challenges, primarily representational space misalignment and retrieval index inconsistency, which degrade matching accuracy, retrieval stability, and performance on long-tail queries. These issues are further magnified in semantic ID generation, ultimately limiting the performance ceiling of downstream generative models.
To address these challenges, this paper proposes a simple and effective framework named SCI comprising two synergistic modules: a symmetric representation alignment module that employs an innovative input-swapping mechanism to unify the dual-tower representation space without adding parameters, and an consistent indexing with dual-tower synergy module that redesigns retrieval paths using a dual-view indexing strategy to maintain consistency from training to inference. The framework is systematic, lightweight, and engineering-friendly, requiring minimal overhead while fully supporting billion-scale deployment. We provide theoretical guarantees for our approach, with its effectiveness validated by results across public datasets and real-world e-commerce datasets.
๐
2025-12-15
๐ NeurIPS 2025
๐ฅ Bizhe Bai, Hongming Wu, Peng Ye, Tao Chen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, we find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades. We diagnose this instability and demonstrate that simply scaling the number of rollouts -- a common strategy to improve performance -- only delays, but does not prevent, this collapse. To counteract this instability, we first introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization), a framework that leverages a slowly evolving momentum model to provide a stable training target. In addition, we identify that this process is often accompanied by a rapid collapse in policy entropy, resulting in a prematurely confident and suboptimal policy. To specifically address this issue, we propose a second contribution: an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories, preserving essential policy diversity. Our extensive experiments on multiple reasoning benchmarks demonstrate that M-GRPO stabilizes the training process while the IQR filter prevents premature convergence. The combination of these two innovations leads to superior training stability and state-of-the-art performance.
๐
2025-12-15
๐ NeurIPS 2025
๐ฅ Cheril Shah, Akshit Agarwal, Kanak Garg, Mourad Heddaya
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Bilateral negotiation is a complex, context-sensitive task in which human negotiators dynamically adjust anchors, pacing, and flexibility to exploit power asymmetries and informal cues. We introduce a unified mathematical framework for modeling concession dynamics based on a hyperbolic tangent curve, and propose two metrics burstiness tau and the Concession-Rigidity Index (CRI) to quantify the timing and rigidity of offer trajectories. We conduct a large-scale empirical comparison between human negotiators and four state-of-the-art large language models (LLMs) across natural-language and numeric-offers settings, with and without rich market context, as well as six controlled power-asymmetry scenarios. Our results reveal that, unlike humans who smoothly adapt to situations and infer the opponents position and strategies, LLMs systematically anchor at extremes of the possible agreement zone for negotiations and optimize for fixed points irrespective of leverage or context. Qualitative analysis further shows limited strategy diversity and occasional deceptive tactics used by LLMs. Moreover the ability of LLMs to negotiate does not improve with better models. These findings highlight fundamental limitations in current LLM negotiation capabilities and point to the need for models that better internalize opponent reasoning and context-dependent strategy.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jongwook Kim, Sangheon Yun, Sukjin Yoon
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The canonical $O(N^2)$ Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass $QKV$ structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on $0^{th}$ order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a $0.56\%$ reduction in validation loss, confirming the framework's fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenjing lu, Zerui Tao, Dongping Zhang, Yuning Qiu, Yang Yang et al.
Multimodal
ๆฅ็ๆ่ฆ
CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Previous work of adversarial fine-tuning largely focuses on matching the predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and unreliable over-confidence. This overlooked phenomenon highlights a critical reliability gap beyond robustness. To bridge this gap, we propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments. By reparameterizing the output of CLIP as the concentration parameter of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and the magnitude of predictive confidence. Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments on multiple zero-shot classification benchmarks demonstrate that our approach effectively restores calibrated uncertainty and achieves competitive adversarial robustness while maintaining clean accuracy.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Amin Jalal Aghdasian, Farzaneh Abdollahi, Ali Kamali Iglie
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper proposes two new algorithms for the lane keeping system (LKS) in autonomous vehicles (AVs) operating under snowy road conditions. These algorithms use deep reinforcement learning (DRL) to handle uncertainties and slippage. They include Action-Robust Recurrent Deep Deterministic Policy Gradient (AR-RDPG) and end-to-end Action-Robust convolutional neural network Attention Deterministic Policy Gradient (AR-CADPG), two action-robust approaches for decision-making. In the AR-RDPG method, within the perception layer, camera images are first denoised using multi-scale neural networks. Then, the centerline coefficients are extracted by a pre-trained deep convolutional neural network (DCNN). These coefficients, concatenated with the driving characteristics, are used as input to the control layer. The AR-CADPG method presents an end-to-end approach in which a convolutional neural network (CNN) and an attention mechanism are integrated within a DRL framework. Both methods are first trained in the CARLA simulator and validated under various snowy scenarios. Real-world experiments on a Jetson Nano-based autonomous vehicle confirm the feasibility and stability of the learned policies. Among the two models, the AR-CADPG approach demonstrates superior path-tracking accuracy and robustness, highlighting the effectiveness of combining temporal memory, adversarial resilience, and attention mechanisms in AVs.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Paola Di Maio
ๆฅ็ๆ่ฆ
The intersection of artificial intelligence (AI) and digital forensics (DF) is becoming increasingly complex, ubiquitous, and pervasive, with overlapping techniques and technologies being adopted in all types of scientific and technical inquiry. Despite incredible advances, forensic sciences are not exempt from errors and remain vulnerable to fallibility. To mitigate the limitations of errors in DF, the systemic complexity is identified and addressed with the adoption of human-readable artifacts and open standards. A DF AI model schema based on the state of the art is outlined.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lingyi Meng, Maolin Liu, Hao Wang, Yilan Cheng, Qi Yang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurately mapping legal terminology across languages remains a significant challenge, especially for language pairs like Chinese and Japanese, which share a large number of homographs with different meanings. Existing resources and standardized tools for these languages are limited. To address this, we propose a human-AI collaborative approach for building a multilingual legal terminology database, based on a multi-agent framework. This approach integrates advanced large language models and legal domain experts throughout the entire process-from raw document preprocessing, article-level alignment, to terminology extraction, mapping, and quality assurance. Unlike a single automated pipeline, our approach places greater emphasis on how human experts participate in this multi-agent system. Humans and AI agents take on different roles: AI agents handle specific, repetitive tasks, such as OCR, text segmentation, semantic alignment, and initial terminology extraction, while human experts provide crucial oversight, review, and supervise the outputs with contextual knowledge and legal judgment. We tested the effectiveness of this framework using a trilingual parallel corpus comprising 35 key Chinese statutes, along with their English and Japanese translations. The experimental results show that this human-in-the-loop, multi-agent workflow not only improves the precision and consistency of multilingual legal terminology mapping but also offers greater scalability compared to traditional manual methods.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tiange Zhang, Xiandong Meng, Siwei Ma
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in end-to-end video compression have shown promising results owing to their unified end-to-end learning optimization. However, such generalized frameworks often lack content-specific adaptation, leading to suboptimal compression performance. To address this, this paper proposes a content adaptive based motion alignment framework that improves performance by adapting encoding strategies to diverse content characteristics. Specifically, we first introduce a two-stage flow-guided deformable warping mechanism that refines motion compensation with coarse-to-fine offset prediction and mask modulation, enabling precise feature alignment. Second, we propose a multi-reference quality aware strategy that adjusts distortion weights based on reference quality, and applies it to hierarchical training to reduce error propagation. Third, we integrate a training-free module that downsamples frames by motion magnitude and resolution to obtain smooth motion estimation. Experimental results on standard test datasets demonstrate that our framework CAMA achieves significant improvements over state-of-the-art Neural Video Compression models, achieving a 24.95% BD-rate (PSNR) savings over our baseline model DCVC-TCM, while also outperforming reproduced DCVC-DC and traditional codec HM-16.25.
๐
2025-12-15
๐ AAAI
๐ฅ Toan Le Ngo Thanh, Phat Ha Huu, Tan Nguyen Dang Duy, Thong Nguyen Le Minh, Anh Nguyen Nhu Tinh
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality specific sub-queries (visual/OCR/ASR), and performs adaptive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent sequences, and dynamically adapts fusion strategies, advancing interactive moment search capabilities.
๐
2025-12-15
๐ AAAI 2026
๐ฅ Yifan Wu, Jiyue Jiang, Xichen Ye, Yiqi Wang, Chang Zhou et al.
ๆฅ็ๆ่ฆ
Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top-k Influence (Top I) and Coverage-Centric Influence (CCI). We empirically validate our method on two representative BioFMs, RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent, demonstrating its effectiveness. Furthermore, we show the generalizability of our framework on protein-related tasks using ESM-C. In particular, our coreset even outperforms random subsets that are ten times larger in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huu-An Vu, Van-Khanh Mai, Trong-Tam Nguyen, Quang-Duc Dam, Tien-Huy Nguyen et al.
ๆฅ็ๆ่ฆ
The rapid expansion of video content across online platforms has accelerated the need for retrieval systems capable of understanding not only isolated visual moments but also the temporal structure of complex events. Existing approaches often fall short in modeling temporal dependencies across multiple events and in handling queries that reference unseen or rare visual concepts. To address these challenges, we introduce MADTempo, a video retrieval framework developed by our team, AIO_Trinh, that unifies temporal search with web-scale visual grounding. Our temporal search mechanism captures event-level continuity by aggregating similarity scores across sequential video segments, enabling coherent retrieval of multi-event queries. Complementarily, a Google Image Search-based fallback module expands query representations with external web imagery, effectively bridging gaps in pretrained visual embeddings and improving robustness against out-of-distribution (OOD) queries. Together, these components advance the temporal reasoning and generalization capabilities of modern video retrieval systems, paving the way for more semantically aware and adaptive retrieval across large-scale video corpora.
๐
2025-12-15
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Amy Chang, Tiffany Saade, Sanket Mendapara, Adam Swanda, Ankit Garg
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) systems are being readily and rapidly adopted, increasingly permeating critical domains: from consumer platforms and enterprise software to networked systems with embedded agents. While this has unlocked potential for human productivity gains, the attack surface has expanded accordingly: threats now span content safety failures (e.g., harmful or deceptive outputs), model and data integrity compromise (e.g., poisoning, supply-chain tampering), runtime manipulations (e.g., prompt injection, tool and agent misuse), and ecosystem risks (e.g., orchestration abuse, multi-agent collusion). Existing frameworks such as MITRE ATLAS, National Institute of Standards and Technology (NIST) AI 100-2 Adversarial Machine Learning (AML) taxonomy, and OWASP Top 10s for Large Language Models (LLMs) and Agentic AI Applications provide valuable viewpoints, but each covers only slices of this multi-dimensional space.
This paper presents Cisco's Integrated AI Security and Safety Framework ("AI Security Framework"), a unified, lifecycle-aware taxonomy and operationalization framework that can be used to classify, integrate, and operationalize the full range of AI risks. It integrates AI security and AI safety across modalities, agents, pipelines, and the broader ecosystem. The AI Security Framework is designed to be practical for threat identification, red-teaming, risk prioritization, and it is comprehensive in scope and can be extensible to emerging deployments in multimodal contexts, humanoids, wearables, and sensory infrastructures. We analyze gaps in prevailing frameworks, discuss design principles for our framework, and demonstrate how the taxonomy provides structure for understanding how modern AI systems fail, how adversaries exploit these failures, and how organizations can build defenses across the AI lifecycle that evolve alongside capability advancements.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Nijesh Upreti, Vaishak Belle
ๆฅ็ๆ่ฆ
Inductive Logic Programming (ILP) provides interpretable rule learning in relational domains, yet remains limited in its ability to induce and reason with numerical constraints. Classical ILP systems operate over discrete predicates and typically rely on discretisation or hand-crafted numerical predicates, making it difficult to infer thresholds or arithmetic relations that must hold jointly across examples. Recent work has begun to address these limitations through tighter integrations of ILP with Satisfiability Modulo Theories (SMT) or specialised numerical inference mechanisms. In this paper we investigate a modular alternative that couples the ILP system PyGol with the SMT solver Z3. Candidate clauses proposed by PyGol are interpreted as quantifier-free formulas over background theories such as linear or nonlinear real arithmetic, allowing numerical parameters to be instantiated and verified by the SMT solver while preserving ILP's declarative relational bias. This supports the induction of hybrid rules that combine symbolic predicates with learned numerical constraints, including thresholds, intervals, and multi-literal arithmetic relations. We formalise this SMT-ILP setting and evaluate it on a suite of synthetic datasets designed to probe linear, relational, nonlinear, and multi-hop reasoning. The results illustrate how a modular SMT-ILP architecture can extend the expressivity of symbolic rule learning, complementing prior numerical ILP approaches while providing a flexible basis for future extensions toward richer theory-aware induction.
๐
2025-12-15
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Shashie Dilhara Batan Arachchige, Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Dinusha Vatsalan, Dali Kaafar
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are often fine-tuned to adapt their general-purpose knowledge to specific tasks and domains such as cyber threat intelligence (CTI). Fine-tuning is mostly done through proprietary datasets that may contain sensitive information. Owners expect their fine-tuned model to not inadvertently leak this information to potentially adversarial end users. Using CTI as a use case, we demonstrate that data-extraction attacks can recover sensitive information from fine-tuned models on CTI reports, underscoring the need for mitigation. Retraining the full model to eliminate this leakage is computationally expensive and impractical. We propose an alternative approach, which we call privacy alignment, inspired by safety alignment in LLMs. Just like safety alignment teaches the model to abide by safety constraints through a few examples, we enforce privacy alignment through few-shot supervision, integrating a privacy classifier and a privacy redactor, both handled by the same underlying LLM. We evaluate our system, called CTIGuardian, using GPT-4o mini and Mistral-7B Instruct models, benchmarking against Presidio, a named entity recognition (NER) baseline. Results show that CTIGuardian provides a better privacy-utility trade-off than NER based models. While we demonstrate its effectiveness on a CTI use case, the framework is generic enough to be applicable to other sensitive domains.
๐
2025-12-15
๐ ArXiv Preprint (physics.optics)
๐ฅ David Dang, Stuart Love, Meena Salib, Quynh Dang, Samuel Rothfarb et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Advancing artificial intelligence for physical sciences requires representations that are both interpretable and compatible with the underlying laws of nature. We introduce METASTRINGS, a symbolic language for photonics that expresses nanostructures as textual sequences encoding materials, geometries, and lattice configurations. Analogous to molecular textual representations in chemistry, METASTRINGS provides a framework connecting human interpretability with computational design by capturing the structural hierarchy of photonic metasurfaces. Building on this representation, we develop Meta-GPT, a foundation transformer model trained on METASTRINGS and finetuned with physics-informed supervised, reinforcement, and chain-of-thought learning. Across various design tasks, the model achieves <3% mean-squared spectral error and maintains >98% syntactic validity, generating diverse metasurface prototypes whose experimentally measured optical responses match their target spectra. These results demonstrate that Meta-GPT can learn the compositional rules of light-matter interactions through METASTRINGS, laying a rigorous foundation for AI-driven photonics and representing an important step toward a metasurface genome project.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Susung Hong, Chongjian Ge, Zhifei Zhang, Jui-Hsien Wang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanwen Yue, Damien Robert, Jianyuan Wang, Sunghwan Hong, Jan Dirk Wegner et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6\times$ fewer parameters, runs $2\times$ faster, and uses $2\times$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lu Ling, Yunhao Ge, Yichen Sheng, Aniket Bera
ๆฅ็ๆ่ฆ
Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo et al.
Computer Vision
ๆฅ็ๆ่ฆ
Recent feed-forward reconstruction models like VGGT and $ฯ^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{https://neu-vi.github.io/LASER/}{\texttt{https://neu-vi.github.io/LASER/}}$
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziqi Ma, Hongqiao Chen, Yisong Yue, Georgia Gkioxari
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: https://glab-caltech.github.io/steer3d/
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, Kai Han
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang et al.
Multimodal
ๆฅ็ๆ่ฆ
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung Shim
MultimodalNatural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junwen Miao, Penghui Du, Yi Liu, Yu Wang, Yan Wang
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenhan Chen, Sezer Karaoglu, Theo Gevers
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou et al.
Robotics
ๆฅ็ๆ่ฆ
Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gabriel Vitorino de Andrade, Saulo Roberto dos Santos, Itallo Patrick Castro Alves da Silva, Emanuel Adler Medeiros Pereira, Erick de Andrade Barboza
Machine Learning
ๆฅ็ๆ่ฆ
The validation and verification of artificial intelligence (AI) models through robustness assessment are essential to guarantee the reliable performance of intelligent systems facing real-world challenges, such as image corruptions including noise, blurring, and weather variations. Despite the global importance of mango (Mangifera indica L.), there is a lack of studies on the robustness of models for the diagnosis of disease in its leaves. This paper proposes a methodology to evaluate convolutional neural networks (CNNs) under adverse conditions. We adapted the MangoLeafDB dataset, generating MangoLeafDB-C with 19 types of artificial corruptions at five severity levels. We conducted a benchmark comparing five architectures: ResNet-50, ResNet-101, VGG-16, Xception, and LCNN (the latter being a lightweight architecture designed specifically for mango leaf diagnosis). The metrics include the F1 score, the corruption error (CE) and the relative mean corruption error (relative mCE). The results show that LCNN outperformed complex models in corruptions that can be present in real-world scenarios such as Defocus Blur, Motion Blur, while also achieving the lowest mCE. Modern architectures (e.g., ResNet-101) exhibited significant performance degradation in corrupted scenarios, despite their high accuracy under ideal conditions. These findings suggest that lightweight and specialized models may be more suitable for real-world applications in edge devices, where robustness and efficiency are critical. The study highlights the need to incorporate robustness assessments in the development of intelligent systems for agriculture, particularly in regions with technological limitations.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Michal Nazarczuk, Thomas Tanay, Arthur Moreau, Zhensong Zhang, Eduardo Pรฉrez-Pellitero
ๆฅ็ๆ่ฆ
This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie et al.
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Chongyu Qu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: https://github.com/hrlblab/SCR2ST
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shweta Mahajan, Shreya Kadambi, Hoang Le, Munawar Hayat, Fatih Porikli
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Felix J. Dorfner, Manon A. Dorster, Ryan Connolly, Oscar Gentilhomme, Edward Gibbs et al.
ๆฅ็ๆ่ฆ
Foundation models have shown promise in medical imaging but remain underexplored for three-dimensional imaging modalities. No foundation model currently exists for Digital Breast Tomosynthesis (DBT), despite its use for breast cancer screening.
To develop and evaluate a foundation model for DBT (DBT-DINO) across multiple clinical tasks and assess the impact of domain-specific pre-training.
Self-supervised pre-training was performed using the DINOv2 methodology on over 25 million 2D slices from 487,975 DBT volumes from 27,990 patients. Three downstream tasks were evaluated: (1) breast density classification using 5,000 screening exams; (2) 5-year risk of developing breast cancer using 106,417 screening exams; and (3) lesion detection using 393 annotated volumes.
For breast density classification, DBT-DINO achieved an accuracy of 0.79 (95\% CI: 0.76--0.81), outperforming both the MetaAI DINOv2 baseline (0.73, 95\% CI: 0.70--0.76, p<.001) and DenseNet-121 (0.74, 95\% CI: 0.71--0.76, p<.001). For 5-year breast cancer risk prediction, DBT-DINO achieved an AUROC of 0.78 (95\% CI: 0.76--0.80) compared to DINOv2's 0.76 (95\% CI: 0.74--0.78, p=.57). For lesion detection, DINOv2 achieved a higher average sensitivity of 0.67 (95\% CI: 0.60--0.74) compared to DBT-DINO with 0.62 (95\% CI: 0.53--0.71, p=.60). DBT-DINO demonstrated better performance on cancerous lesions specifically with a detection rate of 78.8\% compared to Dinov2's 77.3\%.
Using a dataset of unprecedented size, we developed DBT-DINO, the first foundation model for DBT. DBT-DINO demonstrated strong performance on breast density classification and cancer risk prediction. However, domain-specific pre-training showed variable benefits on the detection task, with ImageNet baseline outperforming DBT-DINO on general lesion detection, indicating that localized detection tasks require further methodological development.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyue Zhang, Meera Chappidi, Erolcan Sayar, Helen Richards, Zhijun Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown strong performance. However, PFMs exhibit limitations on certain cancer or specimen types due to domain shifts - these cancer types were rarely used for pretraining or specimens contain tissue-based artifacts rarely seen within the pretraining population. Such is the case for transurethral resection of bladder tumor (TURBT), which are essential for diagnosing muscle-invasive bladder cancer (MIBC), but contain fragmented tissue chips and electrocautery artifacts and were not widely used in publicly available PFMs. To address this, we propose a simple yet effective domain-adaptive self-supervised adaptor (DA-SSL) that realigns pretrained PFM features to the TURBT domain without fine-tuning the foundational model itself. We pilot this framework for predicting treatment response in TURBT, where histomorphological features are currently underutilized and identifying patients who will benefit from neoadjuvant chemotherapy (NAC) is challenging. In our multi-center study, DA-SSL achieved an AUC of 0.77+/-0.04 in five-fold cross-validation and an external test accuracy of 0.84, sensitivity of 0.71, and specificity of 0.91 using majority voting. Our results demonstrate that lightweight domain adaptation with self-supervision can effectively enhance PFM-based MIL pipelines for clinically challenging histopathology tasks. Code is Available at https://github.com/zhanghaoyue/DA_SSL_TURBT.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Christophe Bolduc, Julien Philip, Li Ma, Mingming He, Paul Debevec et al.
Robotics
ๆฅ็ๆ่ฆ
We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong et al.
Machine Learning
ๆฅ็ๆ่ฆ
The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at https://github.com/G-U-N/consolver.
๐
2025-12-15
๐ AAAI 2026
๐ฅ Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Bing Li et al.
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shun Maeda, Chunzhi Gu, Koichiro Kamide, Katsuya Hotta, Shangce Gao et al.
ๆฅ็ๆ่ฆ
Human-centric anomaly detection (AD) has been primarily studied to specify anomalous behaviors in a single person. However, as humans by nature tend to act in a collaborative manner, behavioral anomalies can also arise from human-human interactions. Detecting such anomalies using existing single-person AD models is prone to low accuracy, as these approaches are typically not designed to capture the complex and asymmetric dynamics of interactions. In this paper, we introduce a novel task, Human-Human Interaction Anomaly Detection (H2IAD), which aims to identify anomalous interactive behaviors within collaborative 3D human actions. To address H2IAD, we then propose Interaction Anomaly Detection Network (IADNet), which is formalized with a Temporal Attention Sharing Module (TASM). Specifically, in designing TASM, we share the encoded motion embeddings across both people such that collaborative motion correlations can be effectively synchronized. Moreover, we notice that in addition to temporal dynamics, human interactions are also characterized by spatial configurations between two people. We thus introduce a Distance-Based Relational Encoding Module (DREM) to better reflect social cues in H2IAD. The normalizing flow is eventually employed for anomaly scoring. Extensive experiments on human-human motion benchmarks demonstrate that IADNet outperforms existing Human-centric AD baselines in H2IAD.
๐
2025-12-15
๐ NeurIPS 2025
๐ฅ Marianne Rakic, Siyu Gai, Etienne Chollet, John V. Guttag, Adrian V. Dalca
Computer Vision
ๆฅ็ๆ่ฆ
A single biomedical image can be meaningfully segmented in multiple ways, depending on the desired application. For instance, a brain MRI can be segmented according to tissue types, vascular territories, broad anatomical regions, fine-grained anatomy, or pathology, etc. Existing automatic segmentation models typically either (1) support only a single protocol, the one they were trained on, or (2) require labor-intensive manual prompting to specify the desired segmentation. We introduce Pancakes, a framework that, given a new image from a previously unseen domain, automatically generates multi-label segmentation maps for multiple plausible protocols, while maintaining semantic consistency across related images. Pancakes introduces a new problem formulation that is not currently attainable by existing foundation models. In a series of experiments on seven held-out datasets, we demonstrate that our model can significantly outperform existing foundation models in producing several plausible whole-image segmentations, that are semantically coherent across images.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Piyush Bagad, Andrew Zisserman
Multimodal
ๆฅ็ๆ่ฆ
Our objective is to build a general time-aware video-text embedding model for retrieval. To that end, we propose a simple and efficient recipe, dubbed TARA (Time Aware Retrieval Adaptation), to adapt Multimodal LLMs (MLLMs) to a time-aware video-text embedding model without using any video data at all. For evaluating time-awareness in retrieval, we propose a new benchmark with temporally opposite (chiral) actions as hard negatives and curated splits for chiral and non-chiral actions. We show that TARA outperforms all existing video-text models on this chiral benchmark while also achieving strong results on standard benchmarks. Furthermore, we discover additional benefits of TARA beyond time-awareness: (i) TARA embeddings are negation-aware as shown in NegBench benchmark that evaluates negation in video retrieval, (ii) TARA achieves state of the art performance on verb and adverb understanding in videos. Overall, TARA yields a strong, versatile, time-aware video-text embedding model with state of the art zero-shot performance.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Siyan Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoyu Ren, Kay Koehle, Kirill Dorofeev, Darko Anicic
Robotics
ๆฅ็ๆ่ฆ
In modern manufacturing, Visual Anomaly Detection (VAD) is essential for automated inspection and consistent product quality. Yet, increasingly dynamic and flexible production environments introduce key challenges: First, frequent product changes in small-batch and on-demand manufacturing require rapid model updates. Second, legacy edge hardware lacks the resources to train and run large AI models. Finally, both anomalous and normal training data are often scarce, particularly for newly introduced product variations. We investigate on-device continual learning for unsupervised VAD with localization, extending the PatchCore to incorporate online learning for real-world industrial scenarios. The proposed method leverages a lightweight feature extractor and an incremental coreset update mechanism based on k-center selection, enabling rapid, memory-efficient adaptation from limited data while eliminating costly cloud retraining. Evaluations on an industrial use case are conducted using a testbed designed to emulate flexible production with frequent variant changes in a controlled environment. Our method achieves a 12% AUROC improvement over the baseline, an 80% reduction in memory usage, and faster training compared to batch retraining. These results confirm that our method delivers accurate, resource-efficient, and adaptive VAD suitable for dynamic and smart manufacturing.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiangning Zhang, Junwei Zhu, Zhenye Gan, Donghao Luo, Chuming Lin et al.
Multimodal
ๆฅ็ๆ่ฆ
We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at https://zhangzjn.github.io/projects/Soul/
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiangning Zhang, Junwei Zhu, Teng Hu, Yabiao Wang, Donghao Luo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at https://zhangzjn.github.io/projects/T3-Video
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruiyan Wang, Teng Hu, Kaihui Huang, Zihan Su, Ran Yi et al.
Robotics
ๆฅ็ๆ่ฆ
Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arpit Jadon, Joshua Niemeijer, Yuki M. Asano
ๆฅ็ๆ่ฆ
Generative foundation models contain broad visual knowledge and can produce diverse image variations, making them particularly promising for advancing domain generalization tasks. While they can be used for training data augmentation, synthesizing comprehensive target-domain variations remains slow, expensive, and incomplete. We propose an alternative: using diffusion models at test time to map target images back to the source distribution where the downstream model was trained. This approach requires only a source domain description, preserves the task model, and eliminates large-scale synthetic data generation. We demonstrate consistent improvements across segmentation, detection, and classification tasks under challenging environmental shifts in real-to-real domain generalization scenarios with unknown target distributions. Our analysis spans multiple generative and downstream models, including an ensemble variant for enhanced robustness. The method achieves substantial relative gains: 137% on BDD100K-Night, 68% on ImageNet-R, and 62% on DarkZurich.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thalyssa Baiocco-Rodrigues, Antoine Olivier, Reda Belbahri, Thomas Duboudin, Pierre-Antoine Bannier et al.
ๆฅ็ๆ่ฆ
As the therapeutic target for Inflammatory Bowel Disease (IBD) shifts toward histologic remission, the accurate assessment of microscopic inflammation has become increasingly central for evaluating disease activity and response to treatment. In this work, we introduce IMILIA (Interpretable Multiple Instance Learning for Inflammation Analysis), an end-to-end framework designed for the prediction of inflammation presence in IBD digitized slides stained with hematoxylin and eosin (H&E), followed by the automated computation of markers characterizing tissue regions driving the predictions. IMILIA is composed of an inflammation prediction module, consisting of a Multiple Instance Learning (MIL) model, and an interpretability module, divided in two blocks: HistoPLUS, for cell instance detection, segmentation and classification; and EpiSeg, for epithelium segmentation. IMILIA achieves a cross-validation ROC-AUC of 0.83 on the discovery cohort, and a ROC-AUC of 0.99 and 0.84 on two external validation cohorts. The interpretability module yields biologically consistent insights: tiles with higher predicted scores show increased densities of immune cells (lymphocytes, plasmocytes, neutrophils and eosinophils), whereas lower-scored tiles predominantly contain normal epithelial cells. Notably, these patterns were consistent across all datasets. Code and models to partially replicate the results on the public IBDColEpi dataset can be found at https://github.com/owkin/imilia.
๐
2025-12-15
๐ ArXiv Preprint (eess.IV)
๐ฅ Youssef Megahed, Inok Lee, Robin Ducharme, Kevin Dick, Adrian D. C. Chan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Prenatal ultrasound is the cornerstone for detecting congenital anomalies of the kidneys and urinary tract, but diagnosis is limited by operator dependence and suboptimal imaging conditions. We sought to assess the performance of a self-supervised ultrasound foundation model for automated fetal renal anomaly classification using a curated dataset of 969 two-dimensional ultrasound images. A pretrained Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE) was fine-tuned for binary and multi-class classification of normal kidneys, urinary tract dilation, and multicystic dysplastic kidney. Models were compared with a DenseNet-169 convolutional baseline using cross-validation and an independent test set. USF-MAE consistently improved upon the baseline across all evaluation metrics in both binary and multi-class settings. USF-MAE achieved an improvement of about 1.87% (AUC) and 7.8% (F1-score) on the validation set, 2.32% (AUC) and 4.33% (F1-score) on the independent holdout test set. The largest gains were observed in the multi-class setting, where the improvement in AUC was 16.28% and 46.15% in F1-score. To facilitate model interpretability, Score-CAM visualizations were adapted for a transformer architecture and show that model predictions were informed by known, clinically relevant renal structures, including the renal pelvis in urinary tract dilation and cystic regions in multicystic dysplastic kidney. These results show that ultrasound-specific self-supervised learning can generate a useful representation as a foundation for downstream diagnostic tasks. The proposed framework offers a robust, interpretable approach to support the prenatal detection of renal anomalies and demonstrates the promise of foundation models in obstetric imaging.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anika Islam, Tasfia Tahsin, Zaarin Anjum, Md. Bakhtiar Hasan, Md. Hasanul Kabir
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Accurate and timely identification of plant leaf diseases is essential for resilient and sustainable agriculture, yet most deep learning approaches rely on large annotated datasets and computationally intensive models that are unsuitable for data-scarce and resource-constrained environments. To address these challenges we present a few-shot learning approach within a lightweight yet efficient framework that combines domain-adapted MobileNetV2 and MobileNetV3 models as feature extractors, along with a feature fusion technique to generate robust feature representation. For the classification task, the fused features are passed through a Bi-LSTM classifier enhanced with attention mechanisms to capture sequential dependencies and focus on the most relevant features, thereby achieving optimal classification performance even in complex, real-world environments with noisy or cluttered backgrounds. The proposed framework was evaluated across multiple experimental setups, including both laboratory-controlled and field-captured datasets. On tomato leaf diseases from the PlantVillage dataset, it consistently improved performance across 1 to 15 shot scenarios, reaching 98.23+-0.33% at 15 shot, closely approaching the 99.98% SOTA benchmark achieved by a Transductive LSTM with attention, while remaining lightweight and mobile-friendly. Under real-world conditions using field images from the Dhan Shomadhan dataset, it maintained robust performance, reaching 69.28+-1.49% at 15-shot and demonstrating strong resilience to complex backgrounds. Notably, it also outperformed the previous SOTA accuracy of 96.0% on six diseases from PlantVillage, achieving 99.72% with only 15-shot learning. With a compact model size of approximately 40 MB and inference complexity of approximately 1.12 GFLOPs, this work establishes a scalable, mobile-ready foundation for precise plant disease diagnostics in data-scarce regions.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Noa Cohen, Nurit Spingarn-Eliezer, Inbar Huberman-Spiegelglas, Tomer Michaeli
Machine Learning
ๆฅ็ๆ่ฆ
Text-to-Image (TTI) models generate images based on text prompts, which often leave certain aspects of the desired image ambiguous. When faced with these ambiguities, TTI models have been shown to exhibit biases in their interpretations. These biases can have societal impacts, e.g., when showing only a certain race for a stated occupation. They can also affect user experience when creating redundancy within a set of generated images instead of spanning diverse possibilities. Here, we introduce MineTheGap - a method for automatically mining prompts that cause a TTI model to generate biased outputs. Our method goes beyond merely detecting bias for a given prompt. Rather, it leverages a genetic algorithm to iteratively refine a pool of prompts, seeking for those that expose biases. This optimization process is driven by a novel bias score, which ranks biases according to their severity, as we validate on a dataset with known biases. For a given prompt, this score is obtained by comparing the distribution of generated images to the distribution of LLM-generated texts that constitute variations on the prompt. Code and examples are available on the project's webpage.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at https://shi-qingyu.github.io/rectok.github.io.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoxuan Qu, Qiuchi Xiang, Yujun Cai, Yirui Wu, Majid Mirmehdi et al.
Computer Vision
ๆฅ็ๆ่ฆ
Unexploitable example generation aims to transform personal images into their unexploitable (unlearnable) versions before they are uploaded online, thereby preventing unauthorized exploitation of online personal images. Recently, this task has garnered significant research attention due to its critical relevance to personal data privacy. Yet, despite recent progress, existing methods for this task can still suffer from limited practical applicability, as they can fail to generate examples that are broadly unexploitable across different real-world computer vision tasks. To deal with this problem, in this work, we propose a novel Meta Cross-Task Unexploitable Example Generation (MCT-UEG) framework. At the core of our framework, to optimize the unexploitable example generator for effectively producing broadly unexploitable examples, we design a flat-minima-oriented meta training and testing scheme. Extensive experiments show the efficacy of our framework.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ahmed Abul Hasanaath, Hamzah Luqman
Natural Language Processing
ๆฅ็ๆ่ฆ
Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at https://github.com/gufranSabri/USTM
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Patryk Niลผeniec, Marcin Iwanowski
Computer Vision
ๆฅ็ๆ่ฆ
This paper introduces a novel pipeline for generating large-scale, highly realistic, and automatically labeled datasets for computer vision tasks in robotic environments. Our approach addresses the critical challenges of the domain gap between synthetic and real-world imagery and the time-consuming bottleneck of manual annotation. We leverage 3D Gaussian Splatting (3DGS) to create photorealistic representations of the operational environment and objects. These assets are then used in a game engine where physics simulations create natural arrangements. A novel, two-pass rendering technique combines the realism of splats with a shadow map generated from proxy meshes. This map is then algorithmically composited with the image to add both physically plausible shadows and subtle highlights, significantly enhancing realism. Pixel-perfect segmentation masks are generated automatically and formatted for direct use with object detection models like YOLO. Our experiments show that a hybrid training strategy, combining a small set of real images with a large volume of our synthetic data, yields the best detection and segmentation performance, confirming this as an optimal strategy for efficiently achieving robust and accurate models.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lorenzo Pettinari, Sidaty El Hadramy, Michael Wehrli, Philippe C. Cattin, Daniel Studer et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Purpose: Intraoperative navigation in spine surgery demands millimeter-level accuracy. Current systems based on intraoperative radiographic imaging and bone-anchored markers are invasive, radiation-intensive and workflow disruptive. Recent markerless RGB-D registration methods offer a promising alternative, but existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, which can propagate errors throughout registration. Methods: We present End2Reg an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for weak segmentation labels and manual steps. The network learns segmentation masks specifically optimized for registration, guided solely by the registration objective without direct segmentation supervision. Results: The proposed framework achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% to 1.83mm and mean Root Mean Square Error by 45% to 3.95mm, respectively. An ablation study confirms that end-to-end optimization significantly improves registration accuracy. Conclusion: The presented end-to-end RGB-D registration pipeline removes dependency on weak labels and manual steps, advancing towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Malte Silbernagel, Albert Alonso, Jens Petersen, Bulat Ibragimov, Marleen de Bruijne et al.
ๆฅ็ๆ่ฆ
Accurately predicting topologically correct masks remains a difficult task for general segmentation models, which often produce fragmented or disconnected outputs. Fixing these artifacts typically requires hand-crafted refinement rules or architectures specialized to a particular task. Here, we show that Neural Cellular Automata (NCA) can be directly re-purposed as an effective refinement mechanism, using local, iterative updates guided by image context to repair segmentation masks. By training on imperfect masks and ground truths, the automaton learns the structural properties of the target shape while relying solely on local information. When applied to coarse, globally predicted masks, the learned dynamics progressively reconnect broken regions, prune loose fragments and converge towards stable, topologically consistent results. We show how refinement NCA (rNCA) can be easily applied to repair common topological errors produced by different base segmentation models and tasks: for fragmented retinal vessels, it yields 2-3% gains in Dice/clDice and improves Betti errors, reducing $ฮฒ_0$ errors by 60% and $ฮฒ_1$ by 20%; for myocardium, it repairs 61.5% of broken cases in a zero-shot setting while lowering ASSD and HD by 19% and 16%, respectively. This showcases NCA as effective and broadly applicable refiners.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anran Qi, Changjian Li, Adrien Bousseau, Niloy J. Mitra
Robotics
ๆฅ็ๆ่ฆ
We address image-to-video generation with explicit user control over the final frame's disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: https://anranqi.github.io/beyondvisible.github.io/
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Carla Monteiro, Valentina Corbetta, Regina Beets-Tan, Luรญs F. Teixeira, Wilson Silva
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention "key" features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework's evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Elizaveta Prozorova, Anton Konev, Vladimir Faerman
Machine Learning
ๆฅ็ๆ่ฆ
The article analyzes the use of thermal imaging technologies for biometric identification based on facial thermograms. It presents a comparative analysis of infrared spectral ranges (NIR, SWIR, MWIR, and LWIR). The paper also defines key requirements for thermal cameras used in biometric systems, including sensor resolution, thermal sensitivity, and a frame rate of at least 30 Hz. Siamese neural networks are proposed as an effective approach for automating the identification process. In experiments conducted on a proprietary dataset, the proposed method achieved an accuracy of approximately 80%. The study also examines the potential of hybrid systems that combine visible and infrared spectra to overcome the limitations of individual modalities. The results indicate that thermal imaging is a promising technology for developing reliable security systems.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mikhail Zakharov
Computer Vision
ๆฅ็ๆ่ฆ
Face recognition systems rely on learning highly discriminative and compact identity clusters to enable accurate retrieval. However, as with other surveillance-oriented technologies, such systems raise serious privacy concerns due to their potential for unauthorized identity tracking. While several works have explored machine unlearning as a means of privacy protection, their applicability to face retrieval - especially for modern embedding-based recognition models - remains largely unexplored. In this work, we study the problem of face identity unlearning for retrieval systems and present its inherent challenges. The goal is to make selected identities unretrievable by dispersing their embeddings on the hypersphere and preventing the formation of compact identity clusters that enable re-identification in the gallery. The primary challenge is to achieve this forgetting effect while preserving the discriminative structure of the embedding space and the retrieval performance of the model for the remaining identities. To address this, we evaluate several existing approximate class unlearning methods (e.g., Random Labeling, Gradient Ascent, Boundary Unlearning, and other recent approaches) in the context of face retrieval and propose a simple yet effective dispersion-based unlearning approach. Extensive experiments on standard benchmarks (VGGFace2, CelebA) demonstrate that our method achieves superior forgetting behavior while preserving retrieval utility.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhihang Liu, Xiaoyi Bao, Pandeng Li, Junjie Zhou, Zhaohe Liao et al.
Computer Vision
ๆฅ็ๆ่ฆ
While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shu Yu, Chaochao Lu
ๆฅ็ๆ่ฆ
Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at https://opencausalab.github.io/LINA.
๐
2025-12-15
๐ AAAI 2026
๐ฅ Bo Liu, Qiao Qin, Qinghui He
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu et al.
ๆฅ็ๆ่ฆ
Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yan Li, Lin Liu, Xiaopeng Zhang, Wei Xue, Wenhan Luo et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Instruction-based image editing with diffusion models has achieved impressive results, yet existing methods struggle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization that propagates gradients across consecutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Hyunki Seong, Jeong-Kyun Lee, Heesoo Myeong, Yongho Shin, Hyun-Mook Cho et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Juil Koo, Daehyeon Choi, Sangwoo Youn, Phillip Y. Lee, Minhyuk Sung
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Foivos Paraperas Papantoniou, Stathis Galanakis, Rolandos Alexandros Potamias, Bernhard Kainz, Stefanos Zafeiriou
ๆฅ็ๆ่ฆ
This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Francesco Ragusa, Michele Mazzamuto, Rosario Forte, Irene D'Ambra, James Fort et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a ``Wizard of OZ'' data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 15k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlight the limitations of current models when used to provide expert-level assistance to the user. The Ego-EXTRA dataset is publicly available to support the benchmark of egocentric video-language assistants: https://fpv-iplab.github.io/Ego-EXTRA/.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhuo Chen, Chengqun Yang, Zhuo Su, Zheng Lv, Jingnan Gao et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining "chicken-and-egg" cycle for scalable and reproducible portrait illumination.
๐
2025-12-15
๐ AAAI 2026
๐ฅ Gong Chen, Chaokun Zhang, Pengcheng Lv, Xiaohui Xie
ๆฅ็ๆ่ฆ
Collaborative perception has garnered significant attention as a crucial technology to overcome the perceptual limitations of single-agent systems. Many state-of-the-art (SOTA) methods have achieved communication efficiency and high performance via intermediate fusion. However, they share a critical vulnerability: their performance degrades under adverse communication conditions due to the misalignment induced by data transmission, which severely hampers their practical deployment. To bridge this gap, we re-examine different fusion paradigms, and recover that the strengths of intermediate and late fusion are not a trade-off, but a complementary pairing. Based on this key insight, we propose CoRA, a novel collaborative robust architecture with a hybrid approach to decouple performance from robustness with low communication. It is composed of two components: a feature-level fusion branch and an object-level correction branch. Its first branch selects critical features and fuses them efficiently to ensure both performance and scalability. The second branch leverages semantic relevance to correct spatial displacements, guaranteeing resilience against pose errors. Experiments demonstrate the superiority of CoRA. Under extreme scenarios, CoRA improves upon its baseline performance by approximately 19% in AP@0.7 with more than 5x less communication volume, which makes it a promising solution for robust collaborative perception.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongxuan Sun, Tao Wu
ๆฅ็ๆ่ฆ
Semantic segmentation requires a holistic understanding of the physical world, as it assigns semantic labels to spatially continuous and structurally coherent objects rather than to isolated pixels. However, existing data-free knowledge distillation (DFKD) methods-primarily designed for classification-often disregard this continuity, resulting in significant performance degradation when applied directly to segmentation tasks. In this paper, we introduce DFSS, a novel data-free distillation framework tailored for semantic segmentation. Unlike prior approaches that treat pixels independently, DFSS respects the structural and contextual continuity of real-world scenes. Our key insight is to leverage Batch Normalization (BN) statistics from a teacher model to guide Approximate Distribution Sampling (ADS), enabling the selection of data that better reflects the original training distribution-without relying on potentially misleading teacher predictions. Additionally, we propose Weighted Distribution Progressive Distillation (WDPD), which dynamically prioritizes reliable samples that are more closely aligned with the original data distribution early in training and gradually incorporates more challenging cases, mirroring the natural progression of learning in human perception. Extensive experiments on standard benchmarks demonstrate that DFSS consistently outperforms existing data-free distillation methods for semantic segmentation, achieving state-of-the-art results with significantly reduced reliance on auxiliary data.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xianchao Guan, Zhiyuan Fan, Yifeng Wang, Fuqiang Chen, Yanjiang Zhou et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.
๐
2025-12-15
๐ WWW
๐ฅ Peter Kocsis, Lukas Hรถllein, Matthias Nieรner
Machine Learning
ๆฅ็ๆ่ฆ
We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images. Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view. To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions. We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse path tracing to optimize for the low-dimensional parameters. Our results outperform state-of-the-art methods in material disentanglement on both synthetic and real scenes, producing sharp and clean reconstructions suitable for high-quality relighting.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sangmin Hong, Suyoung Lee, Kyoung Mu Lee
ๆฅ็ๆ่ฆ
The problem of depth completion involves predicting a dense depth image from a single sparse depth map and an RGB image. Unsupervised depth completion methods have been proposed for various datasets where ground truth depth data is unavailable and supervised methods cannot be applied. However, these models require auxiliary data to estimate depth values, which is far from real scenarios. Monocular depth estimation (MDE) models can produce a plausible relative depth map from a single image, but there is no work to properly combine the sparse depth map with MDE for depth completion; a simple affine transformation to the depth map will yield a high error since MDE are inaccurate at estimating depth difference between objects. We introduce StarryGazer, a domain-agnostic framework that predicts dense depth images from a single sparse depth image and an RGB image without relying on ground-truth depth by leveraging the power of large MDE models. First, we employ a pre-trained MDE model to produce relative depth images. These images are segmented and randomly rescaled to form synthetic pairs for dense pseudo-ground truth and corresponding sparse depths. A refinement network is trained with the synthetic pairs, incorporating the relative depth maps and RGB images to improve the model's accuracy and robustness. StarryGazer shows superior results over existing unsupervised methods and transformed MDE results on various datasets, demonstrating that our framework exploits the power of MDE models while appropriately fixing errors using sparse depth information.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chun Kit Wong, Paraskevas Pegios, Nina Weng, Emilie Pi Fogtmann Sejer, Martin Grรธnnebรฆk Tolsgaard et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning models in medical imaging are susceptible to shortcut learning, relying on confounding metadata (e.g., scanner model) that is often encoded in image embeddings. The crucial question is whether the model actively utilizes this encoded information for its final prediction. We introduce Weight Space Correlation Analysis, an interpretable methodology that quantifies feature utilization by measuring the alignment between the classification heads of a primary clinical task and auxiliary metadata tasks. We first validate our method by successfully detecting artificially induced shortcut learning. We then apply it to probe the feature utilization of an SA-SonoNet model trained for Spontaneous Preterm Birth (sPTB) prediction. Our analysis confirmed that while the embeddings contain substantial metadata, the sPTB classifier's weight vectors were highly correlated with clinically relevant factors (e.g., birth weight) but decoupled from clinically irrelevant acquisition factors (e.g. scanner). Our methodology provides a tool to verify model trustworthiness, demonstrating that, in the absence of induced bias, the clinical model selectively utilizes features related to the genuine clinical signal.
๐
2025-12-15
๐ IEEE Transactions on Image Processing
๐ฅ Xin Guo, Yifan Zhao, Jia Li
ๆฅ็ๆ่ฆ
Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shanghua Liu, Majharulislam Babor, Christoph Verduyn, Breght Vandenberghe, Bruno Betoni Parodi et al.
Machine Learning
ๆฅ็ๆ่ฆ
High resolution phenotyping at the level of individual leaves offers fine-grained insights into plant development and stress responses. However, the full potential of accurate leaf tracking over time remains largely unexplored due to the absence of robust tracking methods-particularly for structurally complex crops such as canola. Existing plant-specific tracking methods are typically limited to small-scale species or rely on constrained imaging conditions. In contrast, generic multi-object tracking (MOT) methods are not designed for dynamic biological scenes. Progress in the development of accurate leaf tracking models has also been hindered by a lack of large-scale datasets captured under realistic conditions. In this work, we introduce CanolaTrack, a new benchmark dataset comprising 5,704 RGB images with 31,840 annotated leaf instances spanning the early growth stages of 184 canola plants. To enable accurate leaf tracking over time, we introduce LeafTrackNet, an efficient framework that combines a YOLOv10-based leaf detector with a MobileNetV3-based embedding network. During inference, leaf identities are maintained over time through an embedding-based memory association strategy. LeafTrackNet outperforms both plant-specific trackers and state-of-the-art MOT baselines, achieving a 9% HOTA improvement on CanolaTrack. With our work we provide a new standard for leaf-level tracking under realistic conditions and we provide CanolaTrack - the largest dataset for leaf tracking in agriculture crops, which will contribute to future research in plant phenotyping. Our code and dataset are publicly available at https://github.com/shl-shawn/LeafTrackNet.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vivek Alumootil, Tuan-Anh Vu, M. Khalid Jawed
Computer Vision
ๆฅ็ๆ่ฆ
Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency-especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: https://github.com/StructuresComp/DePT3R
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhijian He, Feifei Liu, Yuwei Li, Zhanpeng Liu, Jintao Cheng et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird's-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yan Zhang, Baoxin Li, Han Sun, Yuhang Gao, Mingtai Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Forest pests threaten ecosystem stability, requiring efficient monitoring. To overcome the limitations of traditional methods in large-scale, fine-grained detection, this study focuses on accurately identifying infected trees and analyzing infestation patterns. We propose FID-Net, a deep learning model that detects pest-affected trees from UAV visible-light imagery and enables infestation analysis via three spatial metrics. Based on YOLOv8n, FID-Net introduces a lightweight Feature Enhancement Module (FEM) to extract disease-sensitive cues, an Adaptive Multi-scale Feature Fusion Module (AMFM) to align and fuse dual-branch features (RGB and FEM-enhanced), and an Efficient Channel Attention (ECA) mechanism to enhance discriminative information efficiently. From detection results, we construct a pest situation analysis framework using: (1) Kernel Density Estimation to locate infection hotspots; (2) neighborhood evaluation to assess healthy trees' infection risk; (3) DBSCAN clustering to identify high-density healthy clusters as priority protection zones. Experiments on UAV imagery from 32 forest plots in eastern Tianshan, China, show that FID-Net achieves 86.10% precision, 75.44% recall, 82.29% mAP@0.5, and 64.30% mAP@0.5:0.95, outperforming mainstream YOLO models. Analysis confirms infected trees exhibit clear clustering, supporting targeted forest protection. FID-Net enables accurate tree health discrimination and, combined with spatial metrics, provides reliable data for intelligent pest monitoring, early warning, and precise management.
๐
2025-12-15
๐ This work has been submitted to the IEEE TMI for p
๐ฅ Wenjing Lu, Yi Hong, Yang Yang
Computer Vision
ๆฅ็ๆ่ฆ
Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng et al.
Machine Learning
ๆฅ็ๆ่ฆ
To combine the advantages of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), recent methods have integrated ''hints'' into post-training, which are prefix segments of complete reasoning trajectories, aiming for powerful knowledge expansion and reasoning generalization. However, existing hint-based RL methods typically ignore difficulty when scheduling hint ratios and estimating relative advantages, leading to unstable learning and excessive imitation of off-policy hints. In this work, we propose ADHint, which treats difficulty as a key factor in both hint-ratio schedule and relative-advantage estimation to achieve a better trade-off between exploration and imitation. Specifically, we propose Adaptive Hint with Sample Difficulty Prior, which evaluates each sample's difficulty under the policy model and accordingly schedules an appropriate hint ratio to guide its rollouts. We also introduce Consistency-based Gradient Modulation and Selective Masking for Hint Preservation to modulate token-level gradients within hints, preventing biased and destructive updates. Additionally, we propose Advantage Estimation with Rollout Difficulty Posterior, which leverages the relative difficulty of rollouts with and without hints to estimate their respective advantages, thereby achieving more balanced updates. Extensive experiments across diverse modalities, model scales, and domains demonstrate that ADHint delivers superior reasoning ability and out-of-distribution generalization, consistently surpassing existing methods in both pass@1 and avg@8. Our code and dataset will be made publicly available upon paper acceptance.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziqiang Zhu, Bowei Yang
Multimodal
ๆฅ็ๆ่ฆ
Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at https://github.com/Die-Xie/UniVCD.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saumyaranjan Mohanty, Aravind Reddy, Konda Reddy Mopuri
ๆฅ็ๆ่ฆ
In Dataset Condensation, the goal is to synthesize a small dataset that replicates the training utility of a large original dataset. Existing condensation methods synthesize datasets with significant redundancy, so there is a dire need to reduce redundancy and improve the diversity of the synthesized datasets. To tackle this, we propose an intuitive Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance, which can be applied off-the-shelf to various state-of-the-art condensation methods. Through extensive experiments, we demonstrate that the addition of our regularizer improves state-of-the-art condensation methods on various benchmark datasets from CIFAR-10 to ImageNet-1K with respect to generalization and diversity metrics.
๐
2025-12-15
๐ Published in Journal of Theoretical and Applied In
๐ฅ Mohaiminul Islam Bhuiyan, Chan Hue Wah, Nur Shazwani Kamarudin, Nur Hafieza Ismail, Ahmad Fakhri Ab Nasir
Machine Learning
ๆฅ็ๆ่ฆ
This study provides an overview of heart disease prediction using an intelligent system. Predicting disease accurately is crucial in the medical field, but traditional methods relying solely on a doctor's experience often lack precision. To address this limitation, intelligent systems are applied as an alternative to traditional approaches. While various intelligent system methods exist, this study focuses on three: Fuzzy Logic, Neural Networks, and Case-Based Reasoning (CBR). A comparison of these techniques in terms of accuracy was conducted, and ultimately, Case-Based Reasoning (CBR) was selected for heart disease prediction. In the prediction phase, the heart disease dataset underwent data pre-processing to clean the data and data splitting to separate it into training and testing sets. The chosen intelligent system was then employed to predict heart disease outcomes based on the processed data. The experiment concluded with Case-Based Reasoning (CBR) achieving a notable accuracy rate of 97.95% in predicting heart disease. The findings also revealed that the probability of heart disease was 57.76% for males and 42.24% for females. Further analysis from related studies suggests that factors such as smoking and alcohol consumption are significant contributors to heart disease, particularly among males.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zizhi Chen, Yizhen Gao, Minghao Han, Yizhou Liu, Zhaoyu Chen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous \textbf{M}edical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.
๐
2025-12-15
๐ AAAI 2026
๐ฅ Jaeyoon Kim, Yoonki Cho, Sung-Eui Yoon
ๆฅ็ๆ่ฆ
Visual Place Recognition (VPR) has advanced significantly with high-capacity foundation models like DINOv2, achieving remarkable performance. Nonetheless, their substantial computational cost makes deployment on resource-constrained devices impractical. In this paper, we introduce an efficient asymmetric VPR framework that incorporates a high-capacity gallery model for offline feature extraction with a lightweight query network for online processing. A key challenge in this setting is ensuring compatibility between these heterogeneous networks, which conventional approaches address through computationally expensive k-NN-based compatible training. To overcome this, we propose a geographical memory bank that structures gallery features using geolocation metadata inherent in VPR databases, eliminating the need for exhaustive k-NN computations. Additionally, we introduce an implicit embedding augmentation technique that enhances the query network to model feature variations despite its limited capacity. Extensive experiments demonstrate that our method not only significantly reduces computational costs but also outperforms existing asymmetric retrieval techniques, establishing a new aspect for VPR in resource-limited environments. The code is available at https://github.com/jaeyoon1603/AsymVPR
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Chen, Yiwei Wang, Songze Li
Computer Vision
ๆฅ็ๆ่ฆ
Concept erasure, which fine-tunes diffusion models to remove undesired or harmful visual concepts, has become a mainstream approach to mitigating unsafe or illegal image generation in text-to-image models.However, existing removal methods typically adopt a unidirectional erasure strategy by either suppressing the target concept or reinforcing safe alternatives, making it difficult to achieve a balanced trade-off between concept removal and generation quality. To address this limitation, we propose a novel Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that performs concept suppression and safety enhancement simultaneously. Specifically, based on the joint representation of text prompts and corresponding images, Bi-Erasing introduces two decoupled image branches: a negative branch responsible for suppressing harmful semantics and a positive branch providing visual guidance for safe alternatives. By jointly optimizing these complementary directions, our approach achieves a balance between erasure efficacy and generation usability. In addition, we apply mask-based filtering to the image branches to prevent interference from irrelevant content during the erasure process. Across extensive experiment evaluations, the proposed Bi-Erasing outperforms baseline methods in balancing concept removal effectiveness and visual fidelity.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tomoya Tanaka, Tomonori Ikeda, Ryo Yonemoto
Machine Learning
ๆฅ็ๆ่ฆ
This study presents the first comprehensive comparison of rule-based methods, traditional machine learning models, and deep learning models in radio wave sensing with frequency modulated continuous wave multiple input multiple output radar. We systematically evaluated five approaches in two indoor environments with distinct layouts: a rule-based connected component method; three traditional machine learning models, namely k-nearest neighbors, random forest, and support vector machine; and a deep learning model combining a convolutional neural network and long short term memory. In the training environment, the convolutional neural network long short term memory model achieved the highest accuracy, while traditional machine learning models provided moderate performance. In a new layout, however, all learning based methods showed significant degradation, whereas the rule-based method remained stable. Notably, for binary detection of presence versus absence of people, all models consistently achieved high accuracy across layouts. These results demonstrate that high capacity models can produce fine grained outputs with high accuracy in the same environment, but they are vulnerable to domain shift. In contrast, rule-based methods cannot provide fine grained outputs but exhibit robustness against domain shift. Moreover, regardless of the model type, a clear trade off was revealed between spatial generalization performance and output granularity.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheeun Hong, German Barquero, Fadime Sener, Markos Georgopoulos, Edgar Schรถnfeld et al.
Robotics
ๆฅ็ๆ่ฆ
Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tomoya Tanaka, Tomonori Ikeda, Ryo Yonemoto
Machine Learning
ๆฅ็ๆ่ฆ
This study presents the first comprehensive evaluation of spatial generalization techniques, which are essential for the practical deployment of deep learning-based radio-frequency (RF) sensing. Focusing on people counting in indoor environments using frequency-modulated continuous-wave (FMCW) multiple-input multiple-output (MIMO) radar, we systematically investigate a broad set of approaches, including amplitude-based statistical preprocessing (sigmoid weighting and threshold zeroing), frequency-domain filtering, autoencoder-based background suppression, data augmentation strategies, and transfer learning. Experimental results collected across two environments with different layouts demonstrate that sigmoid-based amplitude weighting consistently achieves superior cross-environment performance, yielding 50.1% and 55.2% reductions in root-mean-square error (RMSE) and mean absolute error (MAE), respectively, compared with baseline methods. Data augmentation provides additional though modest benefits, with improvements up to 8.8% in MAE. By contrast, transfer learning proves indispensable for large spatial shifts, achieving 82.1% and 91.3% reductions in RMSE and MAE, respectively, with 540 target-domain samples. Taken together, these findings establish a highly practical direction for developing radar sensing systems capable of maintaining robust accuracy under spatial variations by integrating deep learning models with amplitude-based preprocessing and efficient transfer learning.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinjie Li, Zhimin Chen, Rui Zhao, Florian Schiffers, Zhenyu Liao et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.
๐
2025-12-15
๐ AAAI 2026
๐ฅ Haoyu Wang, Lei Zhang, Wenrui Liu, Dengyang Jiang, Wei Wei et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xi Luo, Shixin Xu, Ying Xie, JianZhong Hu, Yuwei He et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nikolai Goncharov, James L. Gray, Donald G. Dansereau
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Object tracking is an important step in robotics and reautonomous driving pipelines, which has to generalize to previously unseen and complex objects. Existing high-performing methods often rely on pre-captured object views to build explicit reference models, which restricts them to a fixed set of known objects. However, such reference models can struggle with visually complex appearance, reducing the quality of tracking. In this work, we introduce an object tracking method based on light field images that does not depend on a pre-trained model, while being robust to complex visual behavior, such as reflections. We extract semantic and geometric features from light field inputs using vision foundation models and convert them into view-dependent Gaussian splats. These splats serve as a unified object representation, supporting differentiable rendering and pose optimization. We further introduce a light field object tracking dataset containing challenging reflective objects with precise ground truth poses. Experiments demonstrate that our method is competitive with state-of-the-art model-based trackers in these difficult cases, paving the way toward universal object tracking in robotic systems. Code/data available at https://github.com/nagonch/LiFT-6DoF.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Pu, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Fan Wang et al.
Computer Vision
ๆฅ็ๆ่ฆ
Diffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstacles that arise when moving from discrete class labels to free-form language prompts. Beyond a thorough methodological analysis, we offer practical guidelines on input scaling, network architecture, and hyperparameters, accompanied by an open-source implementation and pretrained student models. Our findings establish a solid foundation for deploying fast, high-fidelity, and resource-efficient diffusion generators in real-world T2I applications. Code is available on github.com/alibaba-damo-academy/T2I-Distill.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenjing lu, Zerui Tao, Dongping Zhang, Yuning Qiu, Yang Yang et al.
Multimodal
ๆฅ็ๆ่ฆ
CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Previous work of adversarial fine-tuning largely focuses on matching the predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and unreliable over-confidence. This overlooked phenomenon highlights a critical reliability gap beyond robustness. To bridge this gap, we propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments. By reparameterizing the output of CLIP as the concentration parameter of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and the magnitude of predictive confidence. Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments on multiple zero-shot classification benchmarks demonstrate that our approach effectively restores calibrated uncertainty and achieves competitive adversarial robustness while maintaining clean accuracy.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Amin Jalal Aghdasian, Farzaneh Abdollahi, Ali Kamali Iglie
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper proposes two new algorithms for the lane keeping system (LKS) in autonomous vehicles (AVs) operating under snowy road conditions. These algorithms use deep reinforcement learning (DRL) to handle uncertainties and slippage. They include Action-Robust Recurrent Deep Deterministic Policy Gradient (AR-RDPG) and end-to-end Action-Robust convolutional neural network Attention Deterministic Policy Gradient (AR-CADPG), two action-robust approaches for decision-making. In the AR-RDPG method, within the perception layer, camera images are first denoised using multi-scale neural networks. Then, the centerline coefficients are extracted by a pre-trained deep convolutional neural network (DCNN). These coefficients, concatenated with the driving characteristics, are used as input to the control layer. The AR-CADPG method presents an end-to-end approach in which a convolutional neural network (CNN) and an attention mechanism are integrated within a DRL framework. Both methods are first trained in the CARLA simulator and validated under various snowy scenarios. Real-world experiments on a Jetson Nano-based autonomous vehicle confirm the feasibility and stability of the learned policies. Among the two models, the AR-CADPG approach demonstrates superior path-tracking accuracy and robustness, highlighting the effectiveness of combining temporal memory, adversarial resilience, and attention mechanisms in AVs.
๐
2025-12-15
๐ ArXiv Preprint (cs.CG)
๐ฅ Jiayin Lu, Ying Jiang, Yin Yang, Chenfanfu Jiang
Machine Learning
ๆฅ็ๆ่ฆ
We present VoroLight, a differentiable framework for 3D shape reconstruction based on Voronoi meshing. Our approach generates smooth, watertight surfaces and topologically consistent volumetric meshes directly from diverse inputs, including images, implicit shape level-set fields, point clouds and meshes. VoroLight operates in three stages: it first initializes a surface using a differentiable Voronoi formulation, then refines surface quality through a polygon-face sphere training stage, and finally reuses the differentiable Voronoi formulation for volumetric optimization with additional interior generator points. Project page: https://jiayinlu19960224.github.io/vorolight/
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziheng Qin, Yuheng Ji, Renshuai Tao, Yuxuan Tian, Yuyang Liu et al.
ๆฅ็ๆ่ฆ
The pursuit of a universal AI-generated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the Benefit then Conflict dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that constrain representation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data heterogeneity.To resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of GAN and diffusion-based generators. Code is available at https://github.com/UltraCapture/GAPL
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengling Qin, Hao Yu, Chenxin Wu, Zheng Li, Yizhong Cao et al.
Multimodal
ๆฅ็ๆ่ฆ
This paper presents VLCache, a cache reuse framework that exploits both Key-Value (KV) cache and encoder cache from prior multimodal inputs to eliminate costly recomputation when the same multimodal inputs recur. Unlike previous heuristic approaches, we formally identify the cumulative reuse error effect and demonstrate how to minimize the non-prefix cache reuse error effectively. We further analyze the varying importance of model layers and propose a dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. Experimental results show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups. The proposed VLCache pipeline has been integrated into SGLang, enabling significantly faster inference in practical deployments.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Luan Thanh Trinh, Kenji Doi, Atsuki Osanai
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Diffusion models have emerged as the leading approach for style transfer, yet they struggle with photo-realistic transfers, often producing painting-like results or missing detailed stylistic elements. Current methods inadequately address unwanted influence from original content styles and style reference content features. We introduce SCAdapter, a novel technique leveraging CLIP image space to effectively separate and integrate content and style features. Our key innovation systematically extracts pure content from content images and style elements from style references, ensuring authentic transfers. This approach is enhanced through three components: Controllable Style Adaptive Instance Normalization (CSAdaIN) for precise multi-style blending, KVS Injection for targeted style integration, and a style transfer consistency objective maintaining process coherence. Comprehensive experiments demonstrate SCAdapter significantly outperforms state-of-the-art methods in both conventional and diffusion-based baselines. By eliminating DDIM inversion and inference-stage optimization, our method achieves at least $2\times$ faster inference than other diffusion-based approaches, making it both more effective and efficient for practical applications.
๐
2025-12-15
๐ Under Review in IEEE Transactions on Image Process
๐ฅ Krishna Srikar Durbha, Hassene Tmar, Ping-Hao Wu, Ioannis Katsavounidis, Alan C. Bovik
Machine Learning
ๆฅ็ๆ่ฆ
Over the past few years, per-title and per-shot video encoding techniques have demonstrated significant gains as compared to conventional techniques such as constant CRF encoding and the fixed bitrate ladder. These techniques have demonstrated that constructing content-gnostic per-shot bitrate ladders can provide significant bitrate gains and improved Quality of Experience (QoE) for viewers under various network conditions. However, constructing a convex hull for every video incurs a significant computational overhead. Recently, machine learning-based bitrate ladder construction techniques have emerged as a substitute for convex hull construction. These methods operate by extracting features from source videos to train machine learning (ML) models to construct content-adaptive bitrate ladders. Here, we present a new ML-based bitrate ladder construction technique that accurately predicts the VMAF scores of compressed videos, by analyzing the compression procedure and by making perceptually relevant measurements on the source videos prior to compression. We evaluate the performance of our proposed framework against leading prior methods on a large corpus of videos. Since training ML models on every encoder setting is time-consuming, we also investigate how per-shot bitrate ladders perform under different encoding settings. We evaluate the performance of all models against the fixed bitrate ladder and the best possible convex hull constructed using exhaustive encoding with Bjontegaard-delta metrics.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Anja Sheppard, Parker Ewen, Joey Wilson, Advaith V. Sethuraman, Benard Adewole et al.
Robotics
ๆฅ็ๆ่ฆ
This paper introduces SLIM-VDB, a new lightweight semantic mapping system with probabilistic semantic fusion for closed-set or open-set dictionaries. Advances in data structures from the computer graphics community, such as OpenVDB, have demonstrated significantly improved computational and memory efficiency in volumetric scene representation. Although OpenVDB has been used for geometric mapping in robotics applications, semantic mapping for scene understanding with OpenVDB remains unexplored. In addition, existing semantic mapping systems lack support for integrating both fixed-category and open-language label predictions within a single framework. In this paper, we propose a novel 3D semantic mapping system that leverages the OpenVDB data structure and integrates a unified Bayesian update framework for both closed- and open-set semantic fusion. Our proposed framework, SLIM-VDB, achieves significant reduction in both memory and integration times compared to current state-of-the-art semantic mapping approaches, while maintaining comparable mapping accuracy. An open-source C++ codebase with a Python interface is available at https://github.com/umfieldrobotics/slim-vdb.
๐
2025-12-15
๐ IEEE TGRS
๐ฅ Siyuan Yao, Dongxiu Liu, Taotao Li, Shengjie Li, Wenqi Ren et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Building extraction from remote sensing images is a challenging task due to the complex structure variations of the buildings. Existing methods employ convolutional or self-attention blocks to capture the multi-scale features in the segmentation models, while the inherent gap of the feature pyramids and insufficient global-local feature integration leads to inaccurate, ambiguous extraction results. To address this issue, in this paper, we present an Uncertainty-Aggregated Global-Local Fusion Network (UAGLNet), which is capable to exploit high-quality global-local visual semantics under the guidance of uncertainty modeling. Specifically, we propose a novel cooperative encoder, which adopts hybrid CNN and transformer layers at different stages to capture the local and global visual semantics, respectively. An intermediate cooperative interaction block (CIB) is designed to narrow the gap between the local and global features when the network becomes deeper. Afterwards, we propose a Global-Local Fusion (GLF) module to complementarily fuse the global and local representations. Moreover, to mitigate the segmentation ambiguity in uncertain regions, we propose an Uncertainty-Aggregated Decoder (UAD) to explicitly estimate the pixel-wise uncertainty to enhance the segmentation accuracy. Extensive experiments demonstrate that our method achieves superior performance to other state-of-the-art methods. Our code is available at https://github.com/Dstate/UAGLNet
๐
2025-12-15
๐ ArXiv Preprint (cs.CG)
๐ฅ Sebastien Tchitchek, Mohamed Kissi, Julien Tierny
Robotics
ๆฅ็ๆ่ฆ
We introduce the Continuous Edit Distance (CED), a geodesic and elastic distance for time-varying persistence diagrams (TVPDs). The CED extends edit-distance ideas to TVPDs by combining local substitution costs with penalized deletions/insertions, controlled by two parameters: \(ฮฑ\) (trade-off between temporal misalignment and diagram discrepancy) and \(ฮฒ\) (gap penalty). We also provide an explicit construction of CED-geodesics. Building on these ingredients, we present two practical barycenter solvers, one stochastic and one greedy, that monotonically decrease the CED Frechet energy. Empirically, the CED is robust to additive perturbations (both temporal and spatial), recovers temporal shifts, and supports temporal pattern search. On real-life datasets, the CED achieves clustering performance comparable or better than standard elastic dissimilarities, while our clustering based on CED-barycenters yields superior classification results. Overall, the CED equips TVPD analysis with a principled distance, interpretable geodesics, and practical barycenters, enabling alignment, comparison, averaging, and clustering directly in the space of TVPDs. A C++ implementation is provided for reproducibility at the following address https://github.com/sebastien-tchitchek/ContinuousEditDistance.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tiange Zhang, Xiandong Meng, Siwei Ma
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in end-to-end video compression have shown promising results owing to their unified end-to-end learning optimization. However, such generalized frameworks often lack content-specific adaptation, leading to suboptimal compression performance. To address this, this paper proposes a content adaptive based motion alignment framework that improves performance by adapting encoding strategies to diverse content characteristics. Specifically, we first introduce a two-stage flow-guided deformable warping mechanism that refines motion compensation with coarse-to-fine offset prediction and mask modulation, enabling precise feature alignment. Second, we propose a multi-reference quality aware strategy that adjusts distortion weights based on reference quality, and applies it to hierarchical training to reduce error propagation. Third, we integrate a training-free module that downsamples frames by motion magnitude and resolution to obtain smooth motion estimation. Experimental results on standard test datasets demonstrate that our framework CAMA achieves significant improvements over state-of-the-art Neural Video Compression models, achieving a 24.95% BD-rate (PSNR) savings over our baseline model DCVC-TCM, while also outperforming reproduced DCVC-DC and traditional codec HM-16.25.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dylan Phelps, Rodrigo Wilkens, Edward Gow-Smith, Lilian Hubner, Bรกrbara Malcorra et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Alzheimer's Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Baixiang Huang, Limeng Cui, Jiapeng Liu, Haoran Wang, Jiawei Xu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalization is becoming indispensable for LLMs to align with individual user preferences and needs. Yet current approaches are often computationally expensive, data-intensive, susceptible to catastrophic forgetting, and prone to performance degradation in multi-turn interactions or when handling implicit queries. To address these challenges, we conceptualize personalization as a model editing task and introduce Personalization Editing, a framework that applies localized edits guided by clustered preference representations. This design enables precise preference-aligned updates while preserving overall model capabilities. In addition, existing personalization benchmarks frequently rely on persona-based dialogs between LLMs rather than user-LLM interactions, or focus primarily on stylistic imitation while neglecting information-seeking tasks that require accurate recall of user-specific preferences. We introduce User Preference Question Answering (UPQA), a short-answer QA dataset constructed from in-situ user queries with varying levels of difficulty. Unlike prior benchmarks, UPQA directly evaluates a model's ability to recall and apply specific user preferences. Across experimental settings, Personalization Editing achieves higher editing accuracy and greater computational efficiency than fine-tuning, while outperforming prompting-based baselines in multi-turn conversations and implicit preference questions settings.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang et al.
Multimodal
ๆฅ็ๆ่ฆ
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Cristina Aggazzotti, Elizabeth Allyn Smith
Robotics
ๆฅ็ๆ่ฆ
Forensic scientists often need to identify an unknown speaker or writer in cases such as ransom calls, covert recordings, alleged suicide notes, or anonymous online communications, among many others. Speaker recognition in the speech domain usually examines phonetic or acoustic properties of a voice, and these methods can be accurate and robust under certain conditions. However, if a speaker disguises their voice or employs text-to-speech software, vocal properties may no longer be reliable, leaving only their linguistic content available for analysis. Authorship attribution methods traditionally use syntactic, semantic, and related linguistic information to identify writers of written text (authorship attribution). In this paper, we apply a content-based authorship approach to speech that has been transcribed into text, using what a speaker says to attribute speech to individuals (speaker attribution). We introduce a stylometric method, StyloSpeaker, which incorporates character, word, token, sentence, and style features from the stylometric literature on authorship, to assess whether two transcripts were produced by the same speaker. We evaluate this method on two types of transcript formatting: one approximating prescriptive written text with capitalization and punctuation and another normalized style that removes these conventions. The transcripts' conversation topics are also controlled to varying degrees. We find generally higher attribution performance on normalized transcripts, except under the strongest topic control condition, in which overall performance is highest. Finally, we compare this more explainable stylometric model to black-box neural approaches on the same data and investigate which stylistic features most effectively distinguish speakers.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Richard J. Young
Natural Language Processing
ๆฅ็ๆ่ฆ
Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ John E. Ortega, Dhruv D. Joshi, Matt P. Borkowski
Natural Language Processing
ๆฅ็ๆ่ฆ
Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zefang Liu, Nam Nguyen, Yinzhu Quan, Austin Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai et al.
Machine Learning
ๆฅ็ๆ่ฆ
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Daniel Melcer, Qi Chen, Wen-Hao Chiang, Shweta Garg, Pranav Garg et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
A well-engineered prompt can increase the performance of large language models; automatic prompt optimization techniques aim to increase performance without requiring human effort to tune the prompts. One leading class of prompt optimization techniques introduces the analogy of textual gradients. We investigate the behavior of these textual gradient methods through a series of experiments and case studies. While such methods often result in a performance improvement, our experiments suggest that the gradient analogy does not accurately explain their behavior. Our insights may inform the selection of prompt optimization strategies, and development of new approaches.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
๐
2025-12-15
๐ 8 pages, 2 figures, published in The 24th IEEE/WIC
๐ฅ Gibson Nkhata, Uttamasha Anjally Oyshi, Quan Mai, Susan Gauch
Natural Language Processing
ๆฅ็ๆ่ฆ
Verifying rumors on social media is critical for mitigating the spread of false information. The stances of conversation replies often provide important cues to determine a rumor's veracity. However, existing models struggle to jointly capture semantic content, stance information, and conversation strructure, especially under the sequence length constraints of transformer-based encoders. In this work, we propose a stance-aware structural modeling that encodes each post in a discourse with its stance signal and aggregates reply embedddings by stance category enabling a scalable and semantically enriched representation of the entire thread. To enhance structural awareness, we introduce stance distribution and hierarchical depth as covariates, capturing stance imbalance and the influence of reply depth. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms prior methods in the ability to predict truthfulness of a rumor. We also demonstrate that our model is versatile for early detection and cross-platfrom generalization.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
This work introduces {\it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.
๐
2025-12-15
๐ ArXiv Preprint (cs.SE)
๐ฅ Oleg Grynets, Vasyl Lyashkevych, Dmytro Baran, Maksym Orliansky, Taras Zelenyy et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The study presents the outcomes of research and experimental validation in the domain of automated codebase migration, with a focus on addressing challenges in transitioning SQL-based systems. The proposed method for migration essentially appears as a framework that leverages the best aspects of traditional software engineering techniques and provides an iterative, scalable, precise and efficient solution for modern database transformations. The central piece of the approach is the integration of a fine-tuned Large Language Model to address critical issues in SQL code conversion, such as syntax mapping, resolving discrepancies between Oracle PL/SQL and PostgreSQL, and optimising database elements such as stored procedures, triggers, views, and overall database logic. Thus, the method involves a trade-off between fine-tuning and prompt engineering. Special attention is given to a fine-tuning approach, which enhances the adaptability and compatibility with migration requirements across the entire database. According to the achieved results, fine-tuning plays a very important role. The study employs targeted evaluation methodologies along with computational metrics to measure the success of iterative conversion cycles. Core innovations include automated SQL feature detection, semi-supervised error analysis and integration of Subject Matter Experts feedback within a systematic migration workflow. The methodology achieves significant reductions in Syntax Error Rates, enhances feature alignment throughout migration iterations, and leverages dataset sampling to ensure continual improvement. By embedding GAI into the migration process, the framework facilitates precise feature mapping, semi-automated error resolution, and data-driven optimisation loops, improving workflow efficiency.
๐
2025-12-15
๐ AAAI 2026
๐ฅ Yu-Chen Lu, Sheng-Feng Yu, Hui-Hsien Weng, Pei-Shuo Wang, Yu-Fang Hu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLM) have achieved remarkable performance across a wide range of tasks. However, their substantial parameter sizes pose significant challenges for deployment on edge devices with limited computational and memory resources. Low-rank compression is a promising approach to address this issue, as it reduces both computational and memory costs, making LLM more suitable for resource-constrained environments. Nonetheless, naรฏve low-rank compression methods require a significant reduction in the retained rank to achieve meaningful memory and computation savings. For a low-rank model, the ranks need to be reduced by more than half to yield efficiency gains. Such aggressive truncation, however, typically results in substantial performance degradation. To address this trade-off, we propose SkipCat, a novel low-rank compression framework that enables the use of higher ranks while achieving the same compression rates. First, we introduce an intra-layer shared low-rank projection method, where multiple matrices that share the same input use a common projection. This reduces redundancy and improves compression efficiency. Second, we propose a block skipping technique that omits computations and memory transfers for selected sub-blocks within the low-rank decomposition. These two techniques jointly enable our compressed model to retain more effective ranks under the same compression budget. Experimental results show that, without any additional fine-tuning, our method outperforms previous low-rank compression approaches by 7% accuracy improvement on zero-shot tasks under the same compression rate. These results highlight the effectiveness of our rank-maximized compression strategy in preserving model performance under tight resource constraints.
๐
2025-12-15
๐ ArXiv Preprint (cs.DC)
๐ฅ Lei Qu, Lianhai Ren, Peng Cheng, Rui Gao, Ruizhe Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
An increasing variety of AI accelerators is being considered for large-scale training. However, enabling large-scale training on early-life AI accelerators faces three core challenges: frequent system disruptions and undefined failure modes that undermine reliability; numerical errors and training instabilities that threaten correctness and convergence; and the complexity of parallelism optimization combined with unpredictable local noise that degrades efficiency. To address these challenges, SIGMA is an open-source training stack designed to improve the reliability, stability, and efficiency of large-scale distributed training on early-life AI hardware. The core of this initiative is the LUCIA TRAINING PLATFORM (LTP), the system optimized for clusters with early-life AI accelerators. Since its launch in March 2025, LTP has significantly enhanced training reliability and operational productivity. Over the past five months, it has achieved an impressive 94.45% effective cluster accelerator utilization, while also substantially reducing node recycling and job-recovery times. Building on the foundation of LTP, the LUCIA TRAINING FRAMEWORK (LTF) successfully trained SIGMA-MOE, a 200B MoE model, using 2,048 AI accelerators. This effort delivered remarkable stability and efficiency outcomes, achieving 21.08% MFU, state-of-the-art downstream accuracy, and encountering only one stability incident over a 75-day period. Together, these advances establish SIGMA, which not only tackles the critical challenges of large-scale training but also establishes a new benchmark for AI infrastructure and platform innovation, offering a robust, cost-effective alternative to prevailing established accelerator stacks and significantly advancing AI capabilities and scalability. The source code of SIGMA is available at https://github.com/microsoft/LuciaTrainingPlatform.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ayon Roy, Risat Rahaman, Sadat Shibly, Udoy Saha Joy, Abdulla Al Kafi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Bangla is the sixth most widely spoken language globally, with approximately 234 million native speakers. However, progress in open-source Bangla machine translation remains limited. Most online resources are in English and often remain untranslated into Bangla, excluding millions from accessing essential information. Existing research in Bangla translation primarily focuses on formal language, neglecting the more commonly used informal language. This is largely due to the lack of pairwise Bangla-English data and advanced translation models. If datasets and models can be enhanced to better handle natural, informal Bangla, millions of people will benefit from improved online information access. In this research, we explore current state-of-the-art models and propose improvements to Bangla translation by developing a dataset from informal sources like social media and conversational texts. This work aims to advance Bangla machine translation by focusing on informal language translation and improving accessibility for Bangla speakers in the digital world.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ojas Pungalia, Rashi Upadhyay, Abhishek Mishra, Abhiram H, Tejasvi Alladi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Envy is a common human behavior that shapes competitiveness and can alter outcomes in team settings. As large language models (LLMs) increasingly act on behalf of humans in collaborative and competitive workflows, there is a pressing need to evaluate whether and under what conditions they exhibit envy-like preferences. In this paper, we test whether LLMs show envy-like behavior toward each other. We considered two scenarios: (1) A point allocation game that tests whether a model tries to win over its peer. (2) A workplace setting observing behaviour when recognition is unfair. Our findings reveal consistent evidence of envy-like patterns in certain LLMs, with large variation across models and contexts. For instance, GPT-5-mini and Claude-3.7-Sonnet show a clear tendency to pull down the peer model to equalize outcomes, whereas Mistral-Small-3.2-24B instead focuses on maximizing its own individual gains. These results highlight the need to consider competitive dispositions as a safety and design factor in LLM-based multi-agent systems.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kei Saito
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $ฯ$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jian Yang, Shawn Guo, Lin Jing, Wei Zhang, Aishan Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Johan J. Bolhuis, Andrea Moro, Stephen Crain, Sandiway Fong
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models are useless for linguistics, as they are probabilistic models that require a vast amount of data to analyse externalized strings of words. In contrast, human language is underpinned by a mind-internal computational system that recursively generates hierarchical thought structures. The language system grows with minimal external input and can readily distinguish between real language and impossible languages.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sitao Cheng, Tianle Li, Xuhan Huang, Xunjian Yin, Difan Zou
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shibani Sankpal
Natural Language Processing
ๆฅ็ๆ่ฆ
This study investigates emotion drift: the change in emotional state across a single text, within mental health-related messages. While sentiment analysis typically classifies an entire message as positive, negative, or neutral, the nuanced shift of emotions over the course of a message is often overlooked. This study detects sentence-level emotions and measures emotion drift scores using pre-trained transformer models such as DistilBERT and RoBERTa. The results provide insights into patterns of emotional escalation or relief in mental health conversations. This methodology can be applied to better understand emotional dynamics in content.
๐
2025-12-15
๐ Accepted to IEEE Conference on Secure and Trustwor
๐ฅ Ali Al Sahili, Ali Chehab, Razane Tajeddine
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are prone to memorizing training data, which poses serious privacy risks. Two of the most prominent concerns are training data extraction and Membership Inference Attacks (MIAs). Prior research has shown that these threats are interconnected: adversaries can extract training data from an LLM by querying the model to generate a large volume of text and subsequently applying MIAs to verify whether a particular data point was included in the training set. In this study, we integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness. We then compare their performance in this integrated setting against results from conventional MIA benchmarks, allowing us to evaluate their practical utility in real-world extraction scenarios.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joona Kytรถniemi, Jousia Piha, Akseli Reunamo, Fedor Vitiugin, Farrokh Mehryary et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anna Aksenova, Boris Zverkov, Nicola Dainese, Alexander Nikitin, Pekka Marttinen
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.
๐
2025-12-15
๐ Extended version of the accepted ACM SAC paper
๐ฅ Youssra Rebboud, Pasquale Lisena, Raphael Troncy
ๆฅ็ๆ่ฆ
In fact-checking applications, a common reason to reject a claim is to detect the presence of erroneous cause-effect relationships between the events at play. However, current automated fact-checking methods lack dedicated causal-based reasoning, potentially missing a valuable opportunity for semantically rich explainability. To address this gap, we propose a methodology that combines event relation extraction, semantic similarity computation, and rule-based reasoning to detect logical inconsistencies between chains of events mentioned in a claim and in an evidence. Evaluated on two fact-checking datasets, this method establishes the first baseline for integrating fine-grained causal event relationships into fact-checking and enhance explainability of verdict prediction.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jinrui Liu, Jeff Wu, Xuanguang Pan, Gavin Cheung, Shuai Ma et al.
ๆฅ็ๆ่ฆ
LLMs achieve remarkable multi-step reasoning capabilities, yet effectively transferring these skills via post-training distillation remains challenging. Existing data selection methods, ranging from manual curation to heuristics based on length, entropy, or overall loss, fail to capture the causal importance of individual reasoning steps, limiting distillation efficiency. To address this, we propose Attention Influence for Reasoning (AIR), a principled, unsupervised and training-free framework that leverages mechanistic insights of the retrieval head to select high-value post-training data. AIR first identifies reasoning-critical attention heads of an off-the-shelf model, then constructs a weakened reference model with disabled head influence, and finally quantifies the resulting loss divergence as the Attention Influence Score. This score enables fine-grained assessment at both the step and sample levels, supporting step-level weighted fine-tuning and global sample selection. Experiments across multiple reasoning benchmarks show that AIR consistently improves reasoning accuracy, surpassing heuristic baselines and effectively isolating the most critical steps and samples. Our work establishes a mechanism-driven, data-efficient approach for reasoning distillation in LLMs.
๐
2025-12-15
๐ ICCV 2025
๐ฅ Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chendong Sun
Natural Language Processing
ๆฅ็ๆ่ฆ
Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as \(1 - \max(P_{\mathrm{target}})\). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Emre Can Acikgoz, Jinoh Oh, Jie Hao, Joo Hyuk Jeon, Heng Ji et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents' conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Emre Can Acikgoz, Jinoh Oh, Joo Hyuk Jeon, Jie Hao, Heng Ji et al.
Robotics
ๆฅ็ๆ่ฆ
Conversational agents often encounter ambiguous user requests, requiring an effective clarification to successfully complete tasks. While recent advancements in real-world applications favor multi-agent architectures to manage complex conversational scenarios efficiently, ambiguity resolution remains a critical and underexplored challenge--particularly due to the difficulty of determining which agent should initiate a clarification and how agents should coordinate their actions when faced with uncertain or incomplete user input. The fundamental questions of when to interrupt a user and how to formulate the optimal clarification query within the most optimal multi-agent settings remain open. In this paper, we propose MAC (Multi-Agent Clarification), an interactive multi-agent framework specifically optimized to resolve user ambiguities by strategically managing clarification dialogues. We first introduce a novel taxonomy categorizing user ambiguities to systematically guide clarification strategies. Then, we present MAC that autonomously coordinates multiple agents to interact synergistically with users. Empirical evaluations on MultiWOZ 2.4 demonstrate that enabling clarification at both levels increases task success rate 7.8\% (54.5 to 62.3) and reduces the average number of dialogue turns (6.53 to 4.86) by eliciting all required user information up front and minimizing repetition. Our findings highlight the importance of active user interaction and role-aware clarification for more reliable human-agent communication.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zewen Qiang, Sendong Zhao, Haochun Wang, Bing Qin, Ting Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6\% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4\% in KV-Retrieval tasks.
๐
2025-12-15
๐ Published in Journal of Theoretical and Applied In
๐ฅ Mohaiminul Islam Bhuiyan, Chan Hue Wah, Nur Shazwani Kamarudin, Nur Hafieza Ismail, Ahmad Fakhri Ab Nasir
Machine Learning
ๆฅ็ๆ่ฆ
This study provides an overview of heart disease prediction using an intelligent system. Predicting disease accurately is crucial in the medical field, but traditional methods relying solely on a doctor's experience often lack precision. To address this limitation, intelligent systems are applied as an alternative to traditional approaches. While various intelligent system methods exist, this study focuses on three: Fuzzy Logic, Neural Networks, and Case-Based Reasoning (CBR). A comparison of these techniques in terms of accuracy was conducted, and ultimately, Case-Based Reasoning (CBR) was selected for heart disease prediction. In the prediction phase, the heart disease dataset underwent data pre-processing to clean the data and data splitting to separate it into training and testing sets. The chosen intelligent system was then employed to predict heart disease outcomes based on the processed data. The experiment concluded with Case-Based Reasoning (CBR) achieving a notable accuracy rate of 97.95% in predicting heart disease. The findings also revealed that the probability of heart disease was 57.76% for males and 42.24% for females. Further analysis from related studies suggests that factors such as smoking and alcohol consumption are significant contributors to heart disease, particularly among males.
๐
2025-12-15
๐ NeurIPS 2025
๐ฅ Bizhe Bai, Hongming Wu, Peng Ye, Tao Chen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, we find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades. We diagnose this instability and demonstrate that simply scaling the number of rollouts -- a common strategy to improve performance -- only delays, but does not prevent, this collapse. To counteract this instability, we first introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization), a framework that leverages a slowly evolving momentum model to provide a stable training target. In addition, we identify that this process is often accompanied by a rapid collapse in policy entropy, resulting in a prematurely confident and suboptimal policy. To specifically address this issue, we propose a second contribution: an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories, preserving essential policy diversity. Our extensive experiments on multiple reasoning benchmarks demonstrate that M-GRPO stabilizes the training process while the IQR filter prevents premature convergence. The combination of these two innovations leads to superior training stability and state-of-the-art performance.
๐
2025-12-15
๐ NeurIPS 2025
๐ฅ Cheril Shah, Akshit Agarwal, Kanak Garg, Mourad Heddaya
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Bilateral negotiation is a complex, context-sensitive task in which human negotiators dynamically adjust anchors, pacing, and flexibility to exploit power asymmetries and informal cues. We introduce a unified mathematical framework for modeling concession dynamics based on a hyperbolic tangent curve, and propose two metrics burstiness tau and the Concession-Rigidity Index (CRI) to quantify the timing and rigidity of offer trajectories. We conduct a large-scale empirical comparison between human negotiators and four state-of-the-art large language models (LLMs) across natural-language and numeric-offers settings, with and without rich market context, as well as six controlled power-asymmetry scenarios. Our results reveal that, unlike humans who smoothly adapt to situations and infer the opponents position and strategies, LLMs systematically anchor at extremes of the possible agreement zone for negotiations and optimize for fixed points irrespective of leverage or context. Qualitative analysis further shows limited strategy diversity and occasional deceptive tactics used by LLMs. Moreover the ability of LLMs to negotiate does not improve with better models. These findings highlight fundamental limitations in current LLM negotiation capabilities and point to the need for models that better internalize opponent reasoning and context-dependent strategy.
๐
2025-12-15
๐ NeurIPS
๐ฅ Ikuya Yamada, Wataru Ikeda, Ko Yoshida, Mengyu Ye, Hinata Sugimoto et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present an open deep research system for long-form question answering, selected as a winning system in the text-to-text track of the MMU-RAG competition at NeurIPS 2025. The system combines an open-source large language model (LLM) with an open web search API to perform iterative retrieval, reasoning, and synthesis in real-world open-domain settings. To enhance reasoning quality, we apply preference tuning based on LLM-as-a-judge feedback that evaluates multiple aspects, including clarity, insightfulness, and factuality. Our experimental results show that the proposed method consistently improves answer quality across all three aspects. Our source code is publicly available at https://github.com/efficient-deep-research/efficient-deep-research.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuwei Tan, Yao Ma, Xueru Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions. Large Language Models (LLMs), in contrast, can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts and informing system refinements. However, they perform poorly when applied directly to tabular fraud detection due to the difficulty of reasoning over many features, the extreme class imbalance, and the absence of contextual information. To bridge this gap, we introduce FinFRE-RAG, a two-stage approach that applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language and performs retrieval-augmented in-context learning over label-aware, instance-level exemplars. Across four public fraud datasets and three families of open-weight LLMs, FinFRE-RAG substantially improves F1/MCC over direct prompting and is competitive with strong tabular baselines in several settings. Although these LLMs still lag behind specialized classifiers, they narrow the performance gap and provide interpretable rationales, highlighting their value as assistive tools in fraud analysis.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jongwook Kim, Sangheon Yun, Sukjin Yoon
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The canonical $O(N^2)$ Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass $QKV$ structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on $0^{th}$ order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a $0.56\%$ reduction in validation loss, confirming the framework's fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Marcus Ma, Cole Johnson, Nolan Bridges, Jackson Trager, Georgios Chochlakis et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The status quo for labeling text is third-party annotation, but there are many cases where information directly from the document's source would be preferable over a third-person proxy, especially for egocentric features like sentiment and belief. We introduce author labeling, an annotation technique where the writer of the document itself annotates the data at the moment of creation. We collaborate with a commercial chatbot with over 10,000 users to deploy an author labeling annotation system for subjective features related to product recommendation. This system identifies task-relevant queries, generates on-the-fly labeling questions, and records authors' answers in real time. We train and deploy an online-learning model architecture for product recommendation that continuously improves from author labeling and find it achieved a 534% increase in click-through rate compared to an industry advertising baseline running concurrently. We then compare the quality and practicality of author labeling to three traditional annotation approaches for sentiment analysis and find author labeling to be higher quality, faster to acquire, and cheaper. These findings reinforce existing literature that annotations, especially for egocentric and subjective beliefs, are significantly higher quality when labeled by the author rather than a third party. To facilitate broader scientific adoption, we release an author labeling service for the research community at academic.echollm.io.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lingyi Meng, Maolin Liu, Hao Wang, Yilan Cheng, Qi Yang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurately mapping legal terminology across languages remains a significant challenge, especially for language pairs like Chinese and Japanese, which share a large number of homographs with different meanings. Existing resources and standardized tools for these languages are limited. To address this, we propose a human-AI collaborative approach for building a multilingual legal terminology database, based on a multi-agent framework. This approach integrates advanced large language models and legal domain experts throughout the entire process-from raw document preprocessing, article-level alignment, to terminology extraction, mapping, and quality assurance. Unlike a single automated pipeline, our approach places greater emphasis on how human experts participate in this multi-agent system. Humans and AI agents take on different roles: AI agents handle specific, repetitive tasks, such as OCR, text segmentation, semantic alignment, and initial terminology extraction, while human experts provide crucial oversight, review, and supervise the outputs with contextual knowledge and legal judgment. We tested the effectiveness of this framework using a trilingual parallel corpus comprising 35 key Chinese statutes, along with their English and Japanese translations. The experimental results show that this human-in-the-loop, multi-agent workflow not only improves the precision and consistency of multilingual legal terminology mapping but also offers greater scalability compared to traditional manual methods.
๐
2025-12-15
๐ ArXiv Preprint (physics.optics)
๐ฅ David Dang, Stuart Love, Meena Salib, Quynh Dang, Samuel Rothfarb et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Advancing artificial intelligence for physical sciences requires representations that are both interpretable and compatible with the underlying laws of nature. We introduce METASTRINGS, a symbolic language for photonics that expresses nanostructures as textual sequences encoding materials, geometries, and lattice configurations. Analogous to molecular textual representations in chemistry, METASTRINGS provides a framework connecting human interpretability with computational design by capturing the structural hierarchy of photonic metasurfaces. Building on this representation, we develop Meta-GPT, a foundation transformer model trained on METASTRINGS and finetuned with physics-informed supervised, reinforcement, and chain-of-thought learning. Across various design tasks, the model achieves <3% mean-squared spectral error and maintains >98% syntactic validity, generating diverse metasurface prototypes whose experimentally measured optical responses match their target spectra. These results demonstrate that Meta-GPT can learn the compositional rules of light-matter interactions through METASTRINGS, laying a rigorous foundation for AI-driven photonics and representing an important step toward a metasurface genome project.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Susung Hong, Chongjian Ge, Zhifei Zhang, Jui-Hsien Wang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung Shim
MultimodalNatural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Guoqing Liu, Junren Li, Zihan Zhao, Eray Inanc, Krzysztof Maziarz et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Solving computer-aided synthesis planning is essential for enabling fully automated, robot-assisted synthesis workflows and improving the efficiency of drug discovery. A key challenge, however, is bridging the gap between computational route design and practical laboratory execution, particularly the accurate prediction of viable experimental procedures for each synthesis step. In this work, we present QFANG, a scientific reasoning language model capable of generating precise, structured experimental procedures directly from reaction equations, with explicit chain-of-thought reasoning. To develop QFANG, we curated a high-quality dataset comprising 905,990 chemical reactions paired with structured action sequences, extracted and processed from patent literature using large language models. We introduce a Chemistry-Guided Reasoning (CGR) framework that produces chain-of-thought data grounded in chemical knowledge at scale. The model subsequently undergoes supervised fine-tuning to elicit complex chemistry reasoning. Finally, we apply Reinforcement Learning from Verifiable Rewards (RLVR) to further enhance procedural accuracy. Experimental results demonstrate that QFANG outperforms advanced general-purpose reasoning models and nearest-neighbor retrieval baselines, measured by traditional NLP similarity metrics and a chemically aware evaluator using an LLM-as-a-judge. Moreover, QFANG generalizes to certain out-of-domain reaction classes and adapts to variations in laboratory conditions and user-specific constraints. We believe that QFANG's ability to generate high-quality synthesis procedures represents an important step toward bridging the gap between computational synthesis planning and fully automated laboratory synthesis.
๐
2025-12-15
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Weihang Cao, Mustafa Doger, Sennur Ulukus
Machine Learning
ๆฅ็ๆ่ฆ
The security and decentralization of Proof-of-Work (PoW) have been well-tested in existing blockchain systems. However, its tremendous energy waste has raised concerns about sustainability. Proof-of-Useful-Work (PoUW) aims to redirect the meaningless computation to meaningful tasks such as solving machine learning (ML) problems, giving rise to the branch of Proof-of-Learning (PoL). While previous studies have proposed various PoLs, they all, to some degree, suffer from security, decentralization, or efficiency issues. In this paper, we propose a PoL framework that trains ML models efficiently while maintaining blockchain security in a fully distributed manner. We name the framework SEDULity, which stands for a Secure, Efficient, Distributed, and Useful Learning-based blockchain system. Specifically, we encode the template block into the training process and design a useful function that is difficult to solve but relatively easy to verify, as a substitute for the PoW puzzle. We show that our framework is distributed, secure, and efficiently trains ML models. We further demonstrate that the proposed PoL framework can be extended to other types of useful work and design an incentive mechanism to incentivize task verification. We show theoretically that a rational miner is incentivized to train fully honestly with well-designed system parameters. Finally, we present simulation results to demonstrate the performance of our framework and validate our analysis.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gabriel Vitorino de Andrade, Saulo Roberto dos Santos, Itallo Patrick Castro Alves da Silva, Emanuel Adler Medeiros Pereira, Erick de Andrade Barboza
Machine Learning
ๆฅ็ๆ่ฆ
The validation and verification of artificial intelligence (AI) models through robustness assessment are essential to guarantee the reliable performance of intelligent systems facing real-world challenges, such as image corruptions including noise, blurring, and weather variations. Despite the global importance of mango (Mangifera indica L.), there is a lack of studies on the robustness of models for the diagnosis of disease in its leaves. This paper proposes a methodology to evaluate convolutional neural networks (CNNs) under adverse conditions. We adapted the MangoLeafDB dataset, generating MangoLeafDB-C with 19 types of artificial corruptions at five severity levels. We conducted a benchmark comparing five architectures: ResNet-50, ResNet-101, VGG-16, Xception, and LCNN (the latter being a lightweight architecture designed specifically for mango leaf diagnosis). The metrics include the F1 score, the corruption error (CE) and the relative mean corruption error (relative mCE). The results show that LCNN outperformed complex models in corruptions that can be present in real-world scenarios such as Defocus Blur, Motion Blur, while also achieving the lowest mCE. Modern architectures (e.g., ResNet-101) exhibited significant performance degradation in corrupted scenarios, despite their high accuracy under ideal conditions. These findings suggest that lightweight and specialized models may be more suitable for real-world applications in edge devices, where robustness and efficiency are critical. The study highlights the need to incorporate robustness assessments in the development of intelligent systems for agriculture, particularly in regions with technological limitations.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Reza Gheissari, Aukosh Jagannath
Robotics
ๆฅ็ๆ่ฆ
We consider statistical tasks in high dimensions whose loss depends on the data only through its projection into a fixed-dimensional subspace spanned by the parameter vectors and certain ground truth vectors. This includes classifying mixture distributions with cross-entropy loss with one and two-layer networks, and learning single and multi-index models with one and two-layer networks. When the data is drawn from an isotropic Gaussian mixture distribution, it is known that the evolution of a finite family of summary statistics under stochastic gradient descent converges to an autonomous ordinary differential equation (ODE), as the dimension and sample size go to $\infty$ and the step size goes to $0$ commensurately. Our main result is that these ODE limits are universal in that this convergence occurs even when the data is drawn from mixtures of product measures provided the first two moments match the corresponding Gaussian distribution and the initialization and ground truth vectors are sufficiently coordinate-delocalized. We complement this by proving two corresponding non-universality results. We provide a simple example where the ODE limits are non-universal if the initialization is coordinate aligned. We also show that the stochastic differential equation limits arising as fluctuations of the summary statistics around their ODE's fixed points are not universal.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Guransh Singh, Md Shah Fahad
ๆฅ็ๆ่ฆ
Stuttering detection breaks down when disfluencies overlap. Existing parametric models struggle to distinguish complex, simultaneous disfluencies (e.g., a 'block' with a 'prolongation') due to the scarcity of these specific combinations in training data. While Retrieval-Augmented Generation (RAG) has revolutionized NLP by grounding models in external knowledge, this paradigm remains unexplored in pathological speech processing. To bridge this gap, we introduce StutterFuse, the first Retrieval-Augmented Classifier (RAC) for multi-label stuttering detection. By conditioning a Conformer encoder on a non-parametric memory bank of clinical examples, we allow the model to classify by reference rather than memorization. We further identify and solve "Modality Collapse", an "Echo Chamber" effect where naive retrieval boosts recall but degrades precision. We mitigate this using: (1) SetCon, a Jaccard-Weighted Metric Learning objective that optimizes for multi-label set similarity, and (2) a Gated Mixture-of-Experts fusion strategy that dynamically arbitrates between acoustic evidence and retrieved context. On the SEP-28k dataset, StutterFuse achieves a weighted F1-score of 0.65, outperforming strong baselines and demonstrating remarkable zero-shot cross-lingual generalization.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zefang Liu, Nam Nguyen, Yinzhu Quan, Austin Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ankit Sharma, Sayan Roy Gupta
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks have demonstrated significant success in graph classification tasks, yet they often require substantial computational resources and struggle to capture global graph properties effectively. We introduce LightTopoGAT, a lightweight graph attention network that enhances node features through topological augmentation by incorporating node degree and local clustering coefficient to improve graph representation learning. The proposed approach maintains parameter efficiency through streamlined attention mechanisms while integrating structural information that is typically overlooked by local message passing schemes. Through comprehensive experiments on three benchmark datasets, MUTAG, ENZYMES, and PROTEINS, we show that LightTopoGAT achieves superior performance compared to established baselines including GCN, GraphSAGE, and standard GAT, with a 6.6 percent improvement in accuracy on MUTAG and a 2.2 percent improvement on PROTEINS. Ablation studies further confirm that these performance gains arise directly from the inclusion of topological features, demonstrating a simple yet effective strategy for enhancing graph neural network performance without increasing architectural complexity.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shweta Mahajan, Shreya Kadambi, Hoang Le, Munawar Hayat, Fatih Porikli
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai et al.
Machine Learning
ๆฅ็ๆ่ฆ
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Daniel Melcer, Qi Chen, Wen-Hao Chiang, Shweta Garg, Pranav Garg et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
A well-engineered prompt can increase the performance of large language models; automatic prompt optimization techniques aim to increase performance without requiring human effort to tune the prompts. One leading class of prompt optimization techniques introduces the analogy of textual gradients. We investigate the behavior of these textual gradient methods through a series of experiments and case studies. While such methods often result in a performance improvement, our experiments suggest that the gradient analogy does not accurately explain their behavior. Our insights may inform the selection of prompt optimization strategies, and development of new approaches.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Robert Reed, Morteza Lahijanian, Luca Laurenti
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Finite Abstraction methods provide a powerful formal framework for proving that systems satisfy their specifications. However, these techniques face scalability challenges for high-dimensional systems, as they rely on state-space discretization which grows exponentially with dimension. Learning-based approaches to dimensionality reduction, utilizing neural networks and autoencoders, have shown great potential to alleviate this problem. However, ensuring the correctness of the resulting verification results remains an open question. In this work, we provide a formal approach to reduce the dimensionality of systems via convex autoencoders and learn the dynamics in the latent space through a kernel-based method. We then construct a finite abstraction from the learned model in the latent space and guarantee that the abstraction contains the true behaviors of the original system. We show that the verification results in the latent space can be mapped back to the original system. Finally, we demonstrate the effectiveness of our approach on multiple systems, including a 26D system controlled by a neural network, showing significant scalability improvements without loss of rigor.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong et al.
Machine Learning
ๆฅ็ๆ่ฆ
The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at https://github.com/G-U-N/consolver.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zehan Zhu, Heng Zhao, Yan Huang, Joey Tianyi Zhou, Shouling Ji et al.
ๆฅ็ๆ่ฆ
In this paper, we propose a Differentially Private Stochastic Gradient Push with Compressed communication (termed DP-CSGP) for decentralized learning over directed graphs. Different from existing works, the proposed algorithm is designed to maintain high model utility while ensuring both rigorous differential privacy (DP) guarantees and efficient communication. For general non-convex and smooth objective functions, we show that the proposed algorithm achieves a tight utility bound of $\mathcal{O}\left( \sqrt{d\log \left( \frac{1}ฮด \right)}/(\sqrt{n}Jฮต) \right)$ ($J$ and $d$ are the number of local samples and the dimension of decision variables, respectively) with $\left(ฮต, ฮด\right)$-DP guarantee for each node, matching that of decentralized counterparts with exact communication. Extensive experiments on benchmark tasks show that, under the same privacy budget, DP-CSGP achieves comparable model accuracy with significantly lower communication cost than existing decentralized counterparts with exact communication.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leonard Bereska, Zoe Tzifa-Kratira, Reza Samavi, Efstratios Gavves
Machine Learning
ๆฅ็ๆ่ฆ
Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Junye Du, Zhenghao Li, Zhutong Gu, Long Feng
Machine Learning
ๆฅ็ๆ่ฆ
This paper tackles the problem of feature selection in a highly challenging setting: $\mathbb{E}(y | \boldsymbol{x}) = G(\boldsymbol{x}_{\mathcal{S}_0})$, where $\mathcal{S}_0$ is the set of relevant features and $G$ is an unknown, potentially nonlinear function subject to mild smoothness conditions. Our approach begins with feature selection in deep neural networks, then generalizes the results to H{รถ}lder smooth functions by exploiting the strong approximation capabilities of neural networks. Unlike conventional optimization-based deep learning methods, we reformulate neural networks as index models and estimate $\mathcal{S}_0$ using the second-order Stein's formula. This gradient-descent-free strategy guarantees feature selection consistency with a sample size requirement of $n = ฮฉ(p^2)$, where $p$ is the feature dimension. To handle high-dimensional scenarios, we further introduce a screening-and-selection mechanism that achieves nonlinear selection consistency when $n = ฮฉ(s \log p)$, with $s$ representing the sparsity level. Additionally, we refit a neural network on the selected features for prediction and establish performance guarantees under a relaxed sparsity assumption. Extensive simulations and real-data analyses demonstrate the strong performance of our method even in the presence of complex feature interactions.
๐
2025-12-15
๐ NeurIPS 2025
๐ฅ Marianne Rakic, Siyu Gai, Etienne Chollet, John V. Guttag, Adrian V. Dalca
Computer Vision
ๆฅ็ๆ่ฆ
A single biomedical image can be meaningfully segmented in multiple ways, depending on the desired application. For instance, a brain MRI can be segmented according to tissue types, vascular territories, broad anatomical regions, fine-grained anatomy, or pathology, etc. Existing automatic segmentation models typically either (1) support only a single protocol, the one they were trained on, or (2) require labor-intensive manual prompting to specify the desired segmentation. We introduce Pancakes, a framework that, given a new image from a previously unseen domain, automatically generates multi-label segmentation maps for multiple plausible protocols, while maintaining semantic consistency across related images. Pancakes introduces a new problem formulation that is not currently attainable by existing foundation models. In a series of experiments on seven held-out datasets, we demonstrate that our model can significantly outperform existing foundation models in producing several plausible whole-image segmentations, that are semantically coherent across images.
๐
2025-12-15
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ Anshima Singh, David J. Silvester
Machine Learning
ๆฅ็ๆ่ฆ
An adaptive sampling approach for efficient detection of bifurcation boundaries in parametrized fluid flow problems is presented herein. The study extends the machine-learning approach of Silvester (Machine Learning for Hydrodynamic Stability, arXiv:2407.09572), where a classifier network was trained on preselected simulation data to identify bifurcated and nonbifurcated flow regimes. In contrast, the proposed methodology introduces adaptivity through a flow-based deep generative model that automatically refines the sampling of the parameter space. The strategy has two components: a classifier network maps the flow parameters to a bifurcation probability, and a probability density estimation technique (KRnet) for the generation of new samples at each adaptive step. The classifier output provides a probabilistic measure of flow stability, and the Shannon entropy of these predictions is employed as an uncertainty indicator. KRnet is trained to approximate a probability density function that concentrates sampling in regions of high entropy, thereby directing computational effort towards the evolving bifurcation boundary. This coupling between classification and generative modeling establishes a feedback-driven adaptive learning process analogous to error-indicator based refinement in contemporary partial differential equation solution strategies. Starting from a uniform parameter distribution, the new approach achieves accurate bifurcation boundary identification with significantly fewer Navier--Stokes simulations, providing a scalable foundation for high-dimensional stability analysis.
๐
2025-12-15
๐ ArXiv Preprint (stat.ME)
๐ฅ Shih-Ni Prim, Kevin R. Quinlan, Paul Hawkins, Jagadeesh Movva, Annie S. Booth
ๆฅ็ๆ่ฆ
Contour location$\unicode{x2014}$the process of sequentially training a surrogate model to identify the design inputs that result in a pre-specified response value from a single computer experiment$\unicode{x2014}$is a well-studied active learning problem. Here, we tackle a related but distinct problem: identifying the input configuration that returns pre-specified values of multiple independent computer experiments simultaneously. Motivated by computer experiments of the rotational torques acting upon a vehicle in flight, we aim to identify stable flight conditions which result in zero torque forces. We propose a "joint contour location" (jCL) scheme that strikes a strategic balance between exploring the multiple response surfaces while exploiting learning of the intersecting contours. We employ both shallow and deep Gaussian process surrogates, but our jCL procedure is applicable to any surrogate that can provide posterior predictive distributions. Our jCL designs significantly outperform existing (single response) CL strategies, enabling us to efficiently locate the joint contour of our motivating computer experiments.
๐
2025-12-15
๐ Accepted for presentation at the International Mee
๐ฅ Raj Sahu, Saumen Maiti
Machine Learning
ๆฅ็ๆ่ฆ
Enhanced lithological interpretation from well logs plays a key role in geological resource exploration and mapping, as well as in geo-environmental modeling studies. Core and cutting information is useful for making sound interpretations of well logs; however, these are rarely collected at each depth due to high costs. Moreover, well log interpretation using traditional methods is constrained by poor borehole conditions. Traditional statistical methods are mostly linear, often failing to discriminate between lithology and rock facies, particularly when dealing with overlapping well log signals characterized by the structural and compositional variation of rock types. In this study, we develop multiple supervised and unsupervised machine learning algorithms to jointly analyze multivariate well log data from Integrated Ocean Drilling Program (IODP) expeditions 390 and 393 for enhanced lithological interpretations. Among the algorithms, Logistic Regression, Decision Trees, Gradient Boosting, Support Vector Machines (SVM), k-Nearest Neighbors (KNN), and Multi-Layer Perceptron (MLP) neural network models, the Decision Tree and Gradient Boosting models outperformed the others, achieving an accuracy of 0.9950 and an F1-score of 0.9951. While unsupervised machine learning (ML) provides the foundation for cluster information that inherently supports the classification algorithm, supervised ML is applied to devise a data-driven lithology clustering mechanism for IODP datasets. The joint ML-based method developed here has the potential to be further explored for analyzing other well log datasets from the world's oceans.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Asa Cooper Stickland, Jan Michelfeit, Arathi Mani, Charlie Griffin, Ollie Matthews et al.
Robotics
ๆฅ็ๆ่ฆ
LLM-based software engineering agents are increasingly used in real-world development tasks, often with access to sensitive data or security-critical codebases. Such agents could intentionally sabotage these codebases if they were misaligned. We investigate asynchronous monitoring, in which a monitoring system reviews agent actions after the fact. Unlike synchronous monitoring, this approach does not impose runtime latency, while still attempting to disrupt attacks before irreversible harm occurs. We treat monitor development as an adversarial game between a blue team (who design monitors) and a red team (who create sabotaging agents). We attempt to set the game rules such that they upper bound the sabotage potential of an agent based on Claude 4.1 Opus. To ground this game in a realistic, high-stakes deployment scenario, we develop a suite of 5 diverse software engineering environments that simulate tasks that an agent might perform within an AI developer's internal infrastructure. Over the course of the game, we develop an ensemble monitor that achieves a 6% false negative rate at 1% false positive rate on a held out test environment. Then, we estimate risk of sabotage at deployment time by extrapolating from our monitor's false negative rate. We describe one simple model for this extrapolation, present a sensitivity analysis, and describe situations in which the model would be invalid. Code is available at: https://github.com/UKGovernmentBEIS/async-control.
๐
2025-12-15
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Raymond Khazoum, Daniela Fernandes, Aleksandr Krylov, Qin Li, Stephane Deny
Machine Learning
ๆฅ็ๆ่ฆ
Mental rotation -- the ability to compare objects seen from different viewpoints -- is a fundamental example of mental simulation and spatial world modelling in humans. Here we propose a mechanistic model of human mental rotation, leveraging advances in deep, equivariant, and neuro-symbolic learning. Our model consists of three stacked components: (1) an equivariant neural encoder, taking images as input and producing 3D spatial representations of objects, (2) a neuro-symbolic object encoder, deriving symbolic descriptions of objects from these spatial representations, and (3) a neural decision agent, comparing these symbolic descriptions to prescribe rotation simulations in 3D latent space via a recurrent pathway. Our model design is guided by the abundant experimental literature on mental rotation, which we complemented with experiments in VR where participants could at times manipulate the objects to compare, providing us with additional insights into the cognitive process of mental rotation. Our model captures well the performance, response times and behavior of participants in our and others' experiments. The necessity of each model component is shown through systematic ablations. Our work adds to a recent collection of deep neural models of human spatial reasoning, further demonstrating the potency of integrating deep, equivariant, and symbolic representations to model the human mind.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sofiya Zaichyk
ๆฅ็ๆ่ฆ
Statistical learning under distributional drift remains insufficiently characterized: when each observation alters the data-generating law, classical generalization bounds can collapse. We introduce a new statistical primitive, the reproducibility budget $C_T$, which quantifies a system's finite capacity for statistical reproducibility - the extent to which its sampling process can remain governed by a consistent underlying distribution in the presence of both exogenous change and endogenous feedback. Formally, $C_T$ is defined as the cumulative Fisher-Rao path length of the coupled learner-environment evolution, measuring the total distributional motion accumulated during learning. From this construct we derive a drift-feedback generalization bound of order $O(T^{-1/2} + C_T/T)$, and we prove a matching minimax lower bound showing that this rate is minimax-optimal. Consequently, the results establish a reproducibility speed limit: no algorithm can achieve smaller worst-case generalization error than that imposed by the average Fisher-Rao drift rate $C_T/T$ of the data-generating process. The framework situates exogenous drift, adaptive data analysis, and performative prediction within a common geometric structure, with $C_T$ emerging as the intrinsic quantity measuring distributional motion across these settings.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoyu Ren, Kay Koehle, Kirill Dorofeev, Darko Anicic
Robotics
ๆฅ็ๆ่ฆ
In modern manufacturing, Visual Anomaly Detection (VAD) is essential for automated inspection and consistent product quality. Yet, increasingly dynamic and flexible production environments introduce key challenges: First, frequent product changes in small-batch and on-demand manufacturing require rapid model updates. Second, legacy edge hardware lacks the resources to train and run large AI models. Finally, both anomalous and normal training data are often scarce, particularly for newly introduced product variations. We investigate on-device continual learning for unsupervised VAD with localization, extending the PatchCore to incorporate online learning for real-world industrial scenarios. The proposed method leverages a lightweight feature extractor and an incremental coreset update mechanism based on k-center selection, enabling rapid, memory-efficient adaptation from limited data while eliminating costly cloud retraining. Evaluations on an industrial use case are conducted using a testbed designed to emulate flexible production with frequent variant changes in a controlled environment. Our method achieves a 12% AUROC improvement over the baseline, an 80% reduction in memory usage, and faster training compared to batch retraining. These results confirm that our method delivers accurate, resource-efficient, and adaptive VAD suitable for dynamic and smart manufacturing.
๐
2025-12-15
๐ ArXiv Preprint (cs.IT)
๐ฅ ลukasz Dฤbowski
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We inspect the deductive connection between the neural scaling law and Zipf's law -- two statements discussed in machine learning and quantitative linguistics. The neural scaling law describes how the cross entropy rate of a foundation model -- such as a large language model -- changes with respect to the amount of training tokens, parameters, and compute. By contrast, Zipf's law posits that the distribution of tokens exhibits a power law tail. Whereas similar claims have been made in more specific settings, we show that the neural scaling law is a consequence of Zipf's law under certain broad assumptions that we reveal systematically. The derivation steps are as follows: We derive Heaps' law on the vocabulary growth from Zipf's law, Hilberg's hypothesis on the entropy scaling from Heaps' law, and the neural scaling from Hilberg's hypothesis. We illustrate these inference steps by a toy example of the Santa Fe process that satisfies all the four statistical laws.
๐
2025-12-15
๐ ArXiv Preprint (eess.SY)
๐ฅ Wenyi Liu, R. Sharma, W. "Grace" Guo, J. Yi, Y. B. Guo
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Digital twin (DT) enables smart manufacturing by leveraging real-time data, AI models, and intelligent control systems. This paper presents a state-of-the-art analysis on the emerging field of DTs in the context of milling. The critical aspects of DT are explored through the lens of virtual models of physical milling, data flow from physical milling to virtual model, and feedback from virtual model to physical milling. Live data streaming protocols and virtual modeling methods are highlighted. A case study showcases the transformative capability of a real-time machine learning-driven live DT of tool-work contact in a milling process. Future research directions are outlined to achieve the goals of Industry 4.0 and beyond.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Maksymilian Szorc
Machine Learning
ๆฅ็ๆ่ฆ
Fully connected layers are a primary source of memory and computational overhead in deep neural networks due to their dense, often redundant parameterization. While various compression techniques exist, they frequently introduce complex engineering trade-offs or degrade model performance. We propose the Parametrized Random Projection (PRP) layer, a novel approach that decouples feature mixing from adaptation by utilizing a fixed random matrix modulated by lightweight, learnable element-wise parameters. This architecture drastically reduces the trainable parameter count to a linear scale while retaining reliable accuracy across various benchmarks. The design serves as a stable, computationally efficient solution for architectural scaling and deployment in resource-limited settings.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kei Saito
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $ฯ$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chethana Prasad Kabgere, Shylaja S S
Robotics
ๆฅ็ๆ่ฆ
Federated Learning (FL) enables decentralized model training without sharing raw data, but model weight distortion remains a major challenge in resource constrained IoT networks. In multi tier Federated IoT (Fed-IoT) systems, unstable connectivity and adversarial interference can silently alter transmitted parameters, degrading convergence. We propose DP-EMAR, a differentially private, error model based autonomous repair framework that detects and reconstructs transmission induced distortions during FL aggregation. DP-EMAR estimates corruption patterns and applies adaptive correction before privacy noise is added, enabling reliable in network repair without violating confidentiality. By integrating Differential Privacy (DP) with Secure Aggregation (SA), the framework distinguishes DP noise from genuine transmission errors. Experiments on heterogeneous IoT sensor and graph datasets show that DP-EMAR preserves convergence stability and maintains near baseline performance under communication corruption while ensuring strict (epsilon, delta)-DP guarantees. The framework enhances robustness, communication efficiency, and trust in privacy preserving Federated IoT learning.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yici Liu, Qi Wei Oung, Hoi Leong Lee
ๆฅ็ๆ่ฆ
Electroencephalographic (EEG) signals have long been applied in the field of affective brain-computer interfaces (aBCIs). Cross-subject EEG-based emotion recognition has demonstrated significant potential in practical applications due to its suitability across diverse people. However, most studies on cross-subject EEG-based emotion recognition neglect the presence of inter-individual variability and negative transfer phenomena during model training. To address this issue, a cross-subject EEG-based emotion recognition through source selection with adversarial strategy is introduced in this paper. The proposed method comprises two modules: the source selection network (SS) and the adversarial strategies network (AS). The SS uses domain labels to reverse-engineer the training process of domain adaptation. Its key idea is to disrupt class separability and magnify inter-domain differences, thereby raising the classification difficulty and forcing the model to learn domain-invariant yet emotion-relevant representations. The AS gets the source domain selection results and the pretrained domain discriminators from SS. The pretrained domain discriminators compute a novel loss aimed at enhancing the performance of domain classification during adversarial training, ensuring the balance of adversarial strategies. This paper provides theoretical insights into the proposed method and achieves outstanding performance on two EEG-based emotion datasets, SEED and SEED-IV. The code can be found at https://github.com/liuyici/SSAS.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Khawla Elhadri, Jรถrg Schlรถtterer, Christin Seifert
Machine Learning
ๆฅ็ๆ่ฆ
In data-driven applications relying on tabular data, where interpretability is key, machine learning models such as decision trees and linear regression are applied. Although neural networks can provide higher predictive performance, they are not used because of their blackbox nature. In this work, we present XNNTab, a neural architecture that combines the expressiveness of neural networks and interpretability. XNNTab first learns highly non-linear feature representations, which are decomposed into monosemantic features using a sparse autoencoder (SAE). These features are then assigned human-interpretable concepts, making the overall model prediction intrinsically interpretable. XNNTab outperforms interpretable predictive models, and achieves comparable performance to its non-interpretable counterparts.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Noa Cohen, Nurit Spingarn-Eliezer, Inbar Huberman-Spiegelglas, Tomer Michaeli
Machine Learning
ๆฅ็ๆ่ฆ
Text-to-Image (TTI) models generate images based on text prompts, which often leave certain aspects of the desired image ambiguous. When faced with these ambiguities, TTI models have been shown to exhibit biases in their interpretations. These biases can have societal impacts, e.g., when showing only a certain race for a stated occupation. They can also affect user experience when creating redundancy within a set of generated images instead of spanning diverse possibilities. Here, we introduce MineTheGap - a method for automatically mining prompts that cause a TTI model to generate biased outputs. Our method goes beyond merely detecting bias for a given prompt. Rather, it leverages a genetic algorithm to iteratively refine a pool of prompts, seeking for those that expose biases. This optimization process is driven by a novel bias score, which ranks biases according to their severity, as we validate on a dataset with known biases. For a given prompt, this score is obtained by comparing the distribution of generated images to the distribution of LLM-generated texts that constitute variations on the prompt. Code and examples are available on the project's webpage.
๐
2025-12-15
๐ Accepted to the IEEE Transactions on Neural Networ
๐ฅ Vรญtor M. Hanriot, Luiz C. B. Torres, Antรดnio P. Braga
Machine Learning
ๆฅ็ๆ่ฆ
While large margin classifiers are originally an outcome of an optimization framework, support vectors (SVs) can be obtained from geometric approaches. This article presents advances in the use of Gabriel graphs (GGs) in binary and multiclass classification problems. For Chipclass, a hyperparameter-less and optimization-less GG-based binary classifier, we discuss how activation functions and support edge (SE)-centered neurons affect the classification, proposing smoother functions and structural SV (SSV)-centered neurons to achieve margins with low probabilities and smoother classification contours. We extend the neural network architecture, which can be trained with backpropagation with a softmax function and a cross-entropy loss, or by solving a system of linear equations. A new subgraph-/distance-based membership function for graph regularization is also proposed, along with a new GG recomputation algorithm that is less computationally expensive than the standard approach. Experimental results with the Friedman test show that our method was better than previous GG-based classifiers and statistically equivalent to tree-based models.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Malte Silbernagel, Albert Alonso, Jens Petersen, Bulat Ibragimov, Marleen de Bruijne et al.
ๆฅ็ๆ่ฆ
Accurately predicting topologically correct masks remains a difficult task for general segmentation models, which often produce fragmented or disconnected outputs. Fixing these artifacts typically requires hand-crafted refinement rules or architectures specialized to a particular task. Here, we show that Neural Cellular Automata (NCA) can be directly re-purposed as an effective refinement mechanism, using local, iterative updates guided by image context to repair segmentation masks. By training on imperfect masks and ground truths, the automaton learns the structural properties of the target shape while relying solely on local information. When applied to coarse, globally predicted masks, the learned dynamics progressively reconnect broken regions, prune loose fragments and converge towards stable, topologically consistent results. We show how refinement NCA (rNCA) can be easily applied to repair common topological errors produced by different base segmentation models and tasks: for fragmented retinal vessels, it yields 2-3% gains in Dice/clDice and improves Betti errors, reducing $ฮฒ_0$ errors by 60% and $ฮฒ_1$ by 20%; for myocardium, it repairs 61.5% of broken cases in a zero-shot setting while lowering ASSD and HD by 19% and 16%, respectively. This showcases NCA as effective and broadly applicable refiners.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Changjun Zhou, Jintao Zheng, Leyou Yang, Pengfei Wang
ๆฅ็ๆ่ฆ
Federated Unlearning (FUL) focuses on client data and computing power to offer a privacy-preserving solution. However, high computational demands, complex incentive mechanisms, and disparities in client-side computing power often lead to long times and higher costs. To address these challenges, many existing methods rely on server-side knowledge distillation that solely removes the updates of the target client, overlooking the privacy embedded in the contributions of other clients, which can lead to privacy leakage. In this work, we introduce DPUL, a novel server-side unlearning method that deeply unlearns all influential weights to prevent privacy pitfalls. Our approach comprises three components: (i) identifying high-weight parameters by filtering client update magnitudes, and rolling them back to ensure deep removal. (ii) leveraging the variational autoencoder (VAE) to reconstruct and eliminate low-weight parameters. (iii) utilizing a projection-based technique to recover the model. Experimental results on four datasets demonstrate that DPUL surpasses state-of-the-art baselines, providing a 1%-5% improvement in accuracy and up to 12x reduction in time cost.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Sรผmer Tunรงay, Alain Andres, Ignacio Carlucho
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Autonomous Underwater Vehicles (AUVs) require reliable six-degree-of-freedom (6-DOF) position control to operate effectively in complex and dynamic marine environments. Traditional controllers are effective under nominal conditions but exhibit degraded performance when faced with unmodeled dynamics or environmental disturbances. Reinforcement learning (RL) provides a powerful alternative but training is typically slow and sim-to-real transfer remains challenging. This work introduces a GPU-accelerated RL training pipeline built in JAX and MuJoCo-XLA (MJX). By jointly JIT-compiling large-scale parallel physics simulation and learning updates, we achieve training times of under two minutes.Through systematic evaluation of multiple RL algorithms, we show robust 6-DOF trajectory tracking and effective disturbance rejection in real underwater experiments, with policies transferred zero-shot from simulation. Our results provide the first explicit real-world demonstration of RL-based AUV position control across all six degrees of freedom.
๐
2025-12-15
๐ This is the Author Accepted Manuscript version of
๐ฅ Zeyad Gamal, Youssef Mahran, Ayman El-Badawy
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.
๐
2025-12-15
๐ Accepted to IEEE Conference on Secure and Trustwor
๐ฅ Ali Al Sahili, Ali Chehab, Razane Tajeddine
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are prone to memorizing training data, which poses serious privacy risks. Two of the most prominent concerns are training data extraction and Membership Inference Attacks (MIAs). Prior research has shown that these threats are interconnected: adversaries can extract training data from an LLM by querying the model to generate a large volume of text and subsequently applying MIAs to verify whether a particular data point was included in the training set. In this study, we integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness. We then compare their performance in this integrated setting against results from conventional MIA benchmarks, allowing us to evaluate their practical utility in real-world extraction scenarios.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Henrik C. M. Frederiksen, Junya Shiraishi, Cedomir Stefanovic, Hei Victor Cheng, Shashi Raj Pandey
Machine Learning
ๆฅ็ๆ่ฆ
The use of lightweight machine learning (ML) models in internet of things (IoT) networks enables resource constrained IoT devices to perform on-device inference for several critical applications. However, the inference accuracy deteriorates due to the non-stationarity in the IoT environment and limited initial training data. To counteract this, the deployed models can be updated occasionally with new observed data samples. However, this approach consumes additional energy, which is undesirable for energy constrained IoT devices. This letter introduces an event-driven communication framework that strategically integrates continual learning (CL) in IoT networks for energy-efficient fault detection. Our framework enables the IoT device and the edge server (ES) to collaboratively update the lightweight ML model by adapting to the wireless link conditions for communication and the available energy budget. Evaluation on real-world datasets show that the proposed approach can outperform both periodic sampling and non-adaptive CL in terms of inference recall; our proposed approach achieves up to a 42.8% improvement, even under tight energy and bandwidth constraint.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Si Qi Goh, Yongsen Zheng, Ziyao Liu, Sami Hormi, Kwok-Yan Lam
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Machine unlearning (MU) seeks to eliminate the influence of specific training examples from deployed models. As large language models (LLMs) become widely used, managing risks arising from insufficient forgetting or utility loss is increasingly crucial. Current MU techniques lack effective mechanisms for evaluating and controlling these risks, hindering the selection of strategies that appropriately balance safety and utility, and raising trust concerns surrounding the "right to be forgotten." To address these issues, we propose FROC, a unified framework with Risk-Optimized Control for machine unlearning in LLMs. FROC is built around a conformal-style risk-control formulation that expresses a user-specified risk budget on unlearning behavior. This probability-based constraint enables FROC to compare MU strategies, identify feasible operating regions, and guide hyperparameter selection according to desired trade-offs between forgetting sufficiency and utility preservation. To operationalize this constraint, FROC introduces a smoothly varying continuous risk model that aggregates forgetting deficiency and utility degradation into a single configuration-level score. Building on conformal risk analysis, FROC computes (1) the Conformal Unlearning Risk (CUR), a data-driven estimated value on the probability that forgotten samples continue to influence model predictions, and (2) risk-controlled configuration sets, which identify unlearning hyperparameters that are valid under the specified risk budget. Experiments across multiple LLM MU methods demonstrate that FROC produces stable, interpretable risk landscapes and reveals consistent relationships between unlearning configurations, semantic shift, and utility impact. FROC reframes MU as a controllable, risk-aware process and offers a practical foundation for managing unlearning behavior in large-scale LLM deployments.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Karim Bounja, Lahcen Laayouni, Abdeljalil Sakat
Machine Learning
ๆฅ็ๆ่ฆ
This work introduces Knowledge-Distilled Physics-Informed Neural Networks (KD-PINN), a framework that transfers the predictive accuracy of a high-capacity teacher model to a compact student through a continuous adaptation of the Kullback-Leibler divergence. To confirm its generality for various dynamics and dimensionalities, the framework is evaluated on a representative set of partial differential equations (PDEs). In all tested cases, the student model preserved the teacher's physical accuracy, with a mean RMSE increase below 0.64%, and achieved inference speedups ranging from 4.8x (Navier-Stokes) to 6.9x (Burgers). The distillation process also revealed a regularizing effect. With an average inference latency of 5.3 ms on CPU, the distilled models enter the ultra-low-latency real-time regime defined by sub-10 ms performance. Finally, this study examines how knowledge distillation reduces inference latency in PINNs to contribute to the development of accurate ultra-low-latency neural PDE solvers.
๐
2025-12-15
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Malte Hellmeier
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Securing digital text is becoming increasingly relevant due to the widespread use of large language models. Individuals' fear of losing control over data when it is being used to train such machine learning models or when distinguishing model-generated output from text written by humans. Digital watermarking provides additional protection by embedding an invisible watermark within the data that requires protection. However, little work has been taken to analyze and verify if existing digital text watermarking methods are secure and undetectable by large language models. In this paper, we investigate the security-related area of watermarking and machine learning models for text data. In a controlled testbed of three experiments, ten existing Unicode text watermarking methods were implemented and analyzed across six large language models: GPT-5, GPT-4o, Teuken 7B, Llama 3.3, Claude Sonnet 4, and Gemini 2.5 Pro. The findings of our experiments indicate that, especially the latest reasoning models, can detect a watermarked text. Nevertheless, all models fail to extract the watermark unless implementation details in the form of source code are provided. We discuss the implications for security researchers and practitioners and outline future research opportunities to address security concerns.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ รrpรกd Pรกndy, Rรณbert Lakatos, Andrรกs Hajdu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advancements in artificial intelligence have sparked interest in industrial agents capable of supporting analysts in regulated sectors, such as finance and healthcare, within tabular data workflows. A key capability for such systems is performing accurate arithmetic operations on structured data while ensuring sensitive information never leaves secure, on-premises environments. Here, we introduce an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA), specifically applied to on-premises small language models (SLMs). Through a systematic evaluation of a leading SLM (Qwen3 4B), we find that while the base model exhibits fundamental limitations in arithmetic tasks, our proposed error-driven method, which clusters erroneous predictions to refine prompt-rules iteratively, dramatically improves performance, elevating the model's accuracy to 70.8\%. Our results suggest that developing reliable, interpretable, and industrially deployable AI assistants can be achieved not only through costly fine-tuning but also via systematic, error-driven prompt optimization, enabling small models to surpass larger language models (GPT-3.5 Turbo) in a privacy-compliant manner.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mikhail Zakharov
Computer Vision
ๆฅ็ๆ่ฆ
Face recognition systems rely on learning highly discriminative and compact identity clusters to enable accurate retrieval. However, as with other surveillance-oriented technologies, such systems raise serious privacy concerns due to their potential for unauthorized identity tracking. While several works have explored machine unlearning as a means of privacy protection, their applicability to face retrieval - especially for modern embedding-based recognition models - remains largely unexplored. In this work, we study the problem of face identity unlearning for retrieval systems and present its inherent challenges. The goal is to make selected identities unretrievable by dispersing their embeddings on the hypersphere and preventing the formation of compact identity clusters that enable re-identification in the gallery. The primary challenge is to achieve this forgetting effect while preserving the discriminative structure of the embedding space and the retrieval performance of the model for the remaining identities. To address this, we evaluate several existing approximate class unlearning methods (e.g., Random Labeling, Gradient Ascent, Boundary Unlearning, and other recent approaches) in the context of face retrieval and propose a simple yet effective dispersion-based unlearning approach. Extensive experiments on standard benchmarks (VGGFace2, CelebA) demonstrate that our method achieves superior forgetting behavior while preserving retrieval utility.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mayank Gulati, Benedikt Groร, Gerhard Wunder
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
We present ALIGN-FL, a novel approach to distributed learning that addresses the challenge of learning from highly disjoint data distributions through selective sharing of generative components. Instead of exchanging full model parameters, our framework enables privacy-preserving learning by transferring only generative capabilities across clients, while the server performs global training using synthetic samples. Through complementary privacy mechanisms: DP-SGD with adaptive clipping and Lipschitz regularized VAE decoders and a stateful architecture supporting heterogeneous clients, we experimentally validate our approach on MNIST and Fashion-MNIST datasets with cross-domain outliers. Our analysis demonstrates that both privacy mechanisms effectively map sensitive outliers to typical data points while maintaining utility in extreme Non-IID scenarios typical of cross-silo collaborations.
Index Terms: Client-invariant Learning, Federated Learning (FL), Privacy-preserving Generative Models, Non-Independent and Identically Distributed (Non-IID), Heterogeneous Architectures
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qinglin Jia, Zhaocheng Du, Chuhan Wu, Huifeng Guo, Ruiming Tang et al.
ๆฅ็ๆ่ฆ
In most real-world online advertising systems, advertisers typically have diverse customer acquisition goals. A common solution is to use multi-task learning (MTL) to train a unified model on post-click data to estimate the conversion rate (CVR) for these diverse targets. In practice, CVR prediction often encounters missing conversion data as many advertisers submit only a subset of user conversion actions due to privacy or other constraints, making the labels of multi-task data incomplete. If the model is trained on all available samples where advertisers submit user conversion actions, it may struggle when deployed to serve a subset of advertisers targeting specific conversion actions, as the training and deployment data distributions are mismatched. While considerable MTL efforts have been made, a long-standing challenge is how to effectively train a unified model with the incomplete and skewed multi-label data. In this paper, we propose a fine-grained Knowledge transfer framework for Asymmetric Multi-Label data (KAML). We introduce an attribution-driven masking strategy (ADM) to better utilize data with asymmetric multi-label data in training. However, the more relaxed masking in ADM is a double-edged sword: it provides additional training signals but also introduces noise due to skewed data. To address this, we propose a hierarchical knowledge extraction mechanism (HKE) to model the sample discrepancy within the target task tower. Finally, to maximize the utility of unlabeled samples, we incorporate ranking loss strategy to further enhance our model. The effectiveness of KAML has been demonstrated through comprehensive evaluations on offline industry datasets and online A/B tests, which show significant performance improvements over existing MTL baselines.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhenghao Zhu, Chuxue Cao, Sirui Han, Yuanfeng Song, Xing Chen et al.
Multimodal
ๆฅ็ๆ่ฆ
In medical data analysis, extracting deep insights from complex, multi-modal datasets is essential for improving patient care, increasing diagnostic accuracy, and optimizing healthcare operations. However, there is currently a lack of high-quality datasets specifically designed to evaluate the ability of large multi-modal models (LMMs) to discover medical insights. In this paper, we introduce MedInsightBench, the first benchmark that comprises 332 carefully curated medical cases, each annotated with thoughtfully designed insights. This benchmark is intended to evaluate the ability of LMMs and agent frameworks to analyze multi-modal medical image data, including posing relevant questions, interpreting complex findings, and synthesizing actionable insights and recommendations. Our analysis indicates that existing LMMs exhibit limited performance on MedInsightBench, which is primarily attributed to their challenges in extracting multi-step, deep insights and the absence of medical expertise. Therefore, we propose MedInsightAgent, an automated agent framework for medical data analysis, composed of three modules: Visual Root Finder, Analytical Insight Agent, and Follow-up Question Composer. Experiments on MedInsightBench highlight pervasive challenges and demonstrate that MedInsightAgent can improve the performance of general LMMs in medical data insight discovery.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shu Yu, Chaochao Lu
ๆฅ็ๆ่ฆ
Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at https://opencausalab.github.io/LINA.
๐
2025-12-15
๐ ICCV 2025
๐ฅ Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunhong Min, Juil Koo, Seungwoo Yoo, Minhyuk Sung
Robotics
ๆฅ็ๆ่ฆ
We introduce BรฉzierFlow, a lightweight training approach for few-step generation with pretrained diffusion and flow models. BรฉzierFlow achieves a 2-3x performance improvement for sampling with $\leq$ 10 NFEs while requiring only 15 minutes of training. Recent lightweight training approaches have shown promise by learning optimal timesteps, but their scope remains restricted to ODE discretizations. To broaden this scope, we propose learning the optimal transformation of the sampling trajectory by parameterizing stochastic interpolant (SI) schedulers. The main challenge lies in designing a parameterization that satisfies critical desiderata, including boundary conditions, differentiability, and monotonicity of the SNR. To effectively meet these requirements, we represent scheduler functions as Bรฉzier functions, where control points naturally enforce these properties. This reduces the problem to learning an ordered set of points in the time range, while the interpretation of the points changes from ODE timesteps to Bรฉzier control points. Across a range of pretrained diffusion and flow models, BรฉzierFlow consistently outperforms prior timestep-learning methods, demonstrating the effectiveness of expanding the search space from discrete timesteps to Bรฉzier-based trajectory transformations.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zihui Zhao, Zechang Li
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Direct Preference Optimization (DPO) has emerged as a lightweight and effective alternative to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) for aligning large language and vision-language models. However, the standard DPO formulation, in which both the chosen and rejected responses are generated by the same policy, suffers from a weak learning signal because the two responses often share similar errors and exhibit small Kullback-Leibler (KL) divergence. This leads to slow and unstable convergence. To address this limitation, we introduce Reflective Preference Optimization (RPO), a new framework that incorporates hint-guided reflection into the DPO paradigm. RPO uses external models to identify hallucination sources and generate concise reflective hints, enabling the construction of on-policy preference pairs with stronger contrastiveness and clearer preference signals. We theoretically show that conditioning on hints increases the expected preference margin through mutual information and improves sample efficiency while remaining within the policy distribution family. Empirically, RPO achieves superior alignment with fewer training samples and iterations, substantially reducing hallucination rates and delivering state-of-the-art performance across multimodal benchmarks.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Arnab Sharma
ๆฅ็ๆ่ฆ
Neural Encoders are frequently used in the NLP domain to perform dense retrieval tasks, for instance, to generate the candidate documents for a given query in question-answering tasks. However, sparse annotation and label noise in the training data make it challenging to train or fine-tune such retrieval models. Although existing works have attempted to mitigate these problems by incorporating modified loss functions or data cleaning, these approaches either require some hyperparameters to tune during training or add substantial complexity to the training setup. In this work, we consider a label weakening approach to generate robust retrieval models in the presence of label noise. Instead of enforcing a single, potentially erroneous label for each query document pair, we allow for a set of plausible labels derived from both the observed supervision and the model's confidence scores. We perform an extensive evaluation considering two retrieval models, one re-ranking model, considering four diverse ranking datasets. To this end, we also consider a realistic noisy setting by using a semantic-aware noise generation technique to generate different ratios of noise. Our initial results show that label weakening can improve the performance of the retrieval tasks in comparison to 10 different state-of-the-art loss functions.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jianyuan Bo, Yuan Fang
Machine Learning
ๆฅ็ๆ่ฆ
In the rapidly evolving field of self-supervised learning on graphs, generative and contrastive methodologies have emerged as two dominant approaches. Our study focuses on masked feature reconstruction (MFR), a generative technique where a model learns to restore the raw features of masked nodes in a self-supervised manner. We observe that both MFR and graph contrastive learning (GCL) aim to maximize agreement between similar elements. Building on this observation, we reveal a novel theoretical insight: under specific conditions, the objectives of MFR and node-level GCL converge, despite their distinct operational mechanisms. This theoretical connection suggests these approaches are complementary rather than fundamentally different, prompting us to explore their integration to enhance self-supervised learning on graphs. Our research presents Contrastive Masked Feature Reconstruction (CORE), a novel graph self-supervised learning framework that integrates contrastive learning into MFR. Specifically, we form positive pairs exclusively between the original and reconstructed features of masked nodes, encouraging the encoder to prioritize contextual information over the node's own features. Additionally, we leverage the masked nodes themselves as negative samples, combining MFR's reconstructive power with GCL's discriminative ability to better capture intrinsic graph structures. Empirically, our proposed framework CORE significantly outperforms MFR across node and graph classification tasks, demonstrating state-of-the-art results. In particular, CORE surpasses GraphMAE and GraphMAE2 by up to 2.80% and 3.72% on node classification tasks, and by up to 3.82% and 3.76% on graph classification tasks.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Melvin Barbaux
ๆฅ็ๆ่ฆ
Semi-supervised classification leverages both labeled and unlabeled data to improve predictive performance, but existing software support is fragmented across methods and modalities. We introduce ModSSC, an open source Python framework that unifies inductive and transductive semi-supervised classification in a modular code base. ModSSC implements a broad range of classical and recent algorithms, provides loaders for tabular, image, text, audio and graph datasets, and exposes a single configuration interface for specifying datasets, models and evaluation protocols. It supports both lightweight classical methods on small datasets running on CPU and recent deep approaches that can exploit multiple GPUs within the same experimental framework. Experiments are described declaratively in YAML, which facilitates reproducing existing work and running large comparative studies. ModSSC 1.0.0 is released under the MIT license with extensive documentation and tests, and is available at https://github.com/ModSSC/ModSSC.
๐
2025-12-15
๐ ArXiv Preprint (math.OC)
๐ฅ Sarit Khirirat, Abdurakhmon Sadiev, Yury Demidovich, Peter Richtรกrik
Machine Learning
ๆฅ็ๆ่ฆ
The use of momentum in stochastic optimization algorithms has shown empirical success across a range of machine learning tasks. Recently, a new class of stochastic momentum algorithms has emerged within the Linear Minimization Oracle (LMO) framework--leading to state-of-the-art methods, such as Muon, Scion, and Gluon, that effectively solve deep neural network training problems. However, traditional stochastic momentum methods offer convergence guarantees no better than the ${O}(1/K^{1/4})$ rate. While several approaches--such as Hessian-Corrected Momentum (HCM)--have aimed to improve this rate, their theoretical results are generally restricted to the Euclidean norm setting. This limitation hinders their applicability in problems, where arbitrary norms are often required. In this paper, we extend the LMO-based framework by integrating HCM, and provide convergence guarantees under relaxed smoothness and arbitrary norm settings. We establish improved convergence rates of ${O}(1/K^{1/3})$ for HCM, which can adapt to the geometry of the problem and achieve a faster rate than traditional momentum. Experimental results on training Multi-Layer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks verify our theoretical observations.
๐
2025-12-15
๐ ArXiv Preprint (math.OC)
๐ฅ Lorenzo Sabug, Eric Kerrigan
Machine Learning
ๆฅ็ๆ่ฆ
We revisit the problem of physics-informed regression, and propose a method that directly computes the state at the prediction point, simultaneously with the derivative and curvature information of the existing samples. We frame each prediction as a constrained optimisation problem, leveraging multivariate Taylor series expansions and explicitly enforcing physical laws. Each individual query can be processed with low computational cost without any pre- or re-training, in contrast to global function approximator-based solutions such as neural networks. Our comparative benchmarks on a reaction-diffusion system show competitive predictive accuracy relative to a neural network-based solution, while completely eliminating the need for long training loops, and remaining robust to changes in the sampling layout.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Karina Chichifoi, Fabio Merizzi, Michele Colajanni
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning and federated learning (FL) are becoming powerful partners for next-generation weather forecasting. Deep learning enables high-resolution spatiotemporal forecasts that can surpass traditional numerical models, while FL allows institutions in different locations to collaboratively train models without sharing raw data, addressing efficiency and security concerns. While FL has shown promise across heterogeneous regions, its distributed nature introduces new vulnerabilities. In particular, data poisoning attacks, in which compromised clients inject manipulated training data, can degrade performance or introduce systematic biases. These threats are amplified by spatial dependencies in meteorological data, allowing localized perturbations to influence broader regions through global model aggregation. In this study, we investigate how adversarial clients distort federated surface temperature forecasts trained on the Copernicus European Regional ReAnalysis (CERRA) dataset. We simulate geographically distributed clients and evaluate patch-based and global biasing attacks on regional temperature forecasts. Our results show that even a small fraction of poisoned clients can mislead predictions across large, spatially connected areas. A global temperature bias attack from a single compromised client shifts predictions by up to -1.7 K, while coordinated patch attacks more than triple the mean squared error and produce persistent regional anomalies exceeding +3.5 K. Finally, we assess trimmed mean aggregation as a defense mechanism, showing that it successfully defends against global bias attacks (2-13\% degradation) but fails against patch attacks (281-603\% amplification), exposing limitations of outlier-based defenses for spatially correlated data.
๐
2025-12-15
๐ ArXiv Preprint (physics.geo-ph)
๐ฅ Ayrat Abdullin, Umair bin Waheed, Leo Eisner, Naveed Iqbal
Machine Learning
ๆฅ็ๆ่ฆ
Seismic phase picking is very often used for microseismic monitoring and subsurface imaging. Traditional manual processing is not feasible for either real-time applications or large arrays. Deep learning-based pickers trained on large earthquake catalogs offer an automated alternative. However, they are typically optimized for high signal-to-noise, long-duration networks and struggle with the challenges presented by microseismic datasets, which are purpose-built for limited time without previously detected seismicity. In this study, we demonstrate how a network-wide earthquake phase picker, the Phase Neural Operator (PhaseNO), can be adapted to microseismic monitoring using transfer learning. Starting from a PhaseNO model pre-trained on more than 57,000 three-component earthquake and noise records, we fine-tune the model using only 200 labeled and noise seismograms from induced events in hydraulic-fracturing settings. The fine-tuned model thus preserves the rich spatio-temporal representation learned from abundant earthquake data, while adapting to the characteristics and labeling conventions of microseismic phases, which are often picked on peaks or troughs rather than onsets. We evaluate performance on three distinct real-world microseismic datasets with different network geometries and acquisition parameters. Compared to the original PhaseNO and a conventional workflow, the adapted model increases F1 score and accuracy by up to 30%, and strongly reduces systematic timing bias and pick uncertainty. Because the adaptation relies on a small, campaign-specific calibration set, the approach is readily transferable to other microseismic tasks where public earthquake data and pre-trained models are accessible. The associated code will be released openly at https://github.com/ayratabd/MicroPhaseNO.
๐
2025-12-15
๐ This paper was accepted and presented at WinTechCo
๐ฅ Chethana Prasad Kabgere, Sudarshan T S B
ๆฅ็ๆ่ฆ
Advanced Driver Assistance Systems (ADAS) increasingly employ Federated Learning (FL) to collaboratively train models across distributed vehicular nodes while preserving data privacy. Yet, conventional FL aggregation remains susceptible to noise, latency, and security constraints inherent to real-time vehicular networks. This paper introduces Noise-Resilient Quantum Federated Learning (NR-QFL), a hybrid quantum-classical framework that enables secure, low-latency aggregation through variational quantum circuits (VQCs) operating under Noisy Intermediate-Scale Quantum (NISQ) conditions. The framework encodes model parameters as quantum states with adaptive gate reparameterization, ensuring bounded-error convergence and provable resilience under Completely Positive Trace-Preserving (CPTP) dynamics. NR-QFL employs quantum entropy-based client selection and multi-server coordination for fairness and stability. Empirical validation shows consistent convergence with reduced gradient variance, lower communication overhead, and enhanced noise tolerance under constrained edge conditions. The framework establishes a scalable foundation for quantum-enhanced federated learning, enabling secure, efficient, and dynamically stable ADAS intelligence at the vehicular edge.
๐
2025-12-15
๐ Accepted to IEEE Transactions on Aerospace and Ele
๐ฅ Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Dongil Park, Sung Won Han
Machine Learning
ๆฅ็ๆ่ฆ
The Automatic Identification System (AIS) enables data-driven maritime surveillance but suffers from reliability issues and irregular intervals. We address vessel destination estimation using global-scope AIS data by proposing a differentiated approach that recasts long port-to-port trajectories as a nested sequence structure. Using spatial grids, this method mitigates spatio-temporal bias while preserving detailed resolution. We introduce a novel deep learning architecture, WAY, designed to process these reformulated trajectories for long-term destination estimation days to weeks in advance. WAY comprises a trajectory representation layer and Channel-Aggregative Sequential Processing (CASP) blocks. The representation layer generates multi-channel vector sequences from kinematic and non-kinematic features. CASP blocks utilize multi-headed channel- and self-attention for aggregation and sequential information delivery. Additionally, we propose a task-specialized Gradient Dropout (GD) technique to enable many-to-many training on single labels, preventing biased feedback surges by stochastically blocking gradient flow based on sample length. Experiments on 5-year AIS data demonstrate WAY's superiority over conventional spatial grid-based approaches regardless of trajectory progression. Results further confirm that adopting GD leads to performance gains. Finally, we explore WAY's potential for real-world application through multitask learning for ETA estimation.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Khalid Ferji
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Machine-learning (ML) models in polymer science typically treat a polymer as a single, perfectly defined molecular graph, even though real materials consist of stochastic ensembles of chains with distributed lengths. This mismatch between physical reality and digital representation limits the ability of current models to capture polymer behaviour. Here we introduce PolySet, a framework that represents a polymer as a finite, weighted ensemble of chains sampled from an assumed molar-mass distribution. This ensemble-based encoding is independent of chemical detail, compatible with any molecular representation and illustrated here in the homopolymer case using a minimal language model. We show that PolySet retains higher-order distributional moments (such as Mz, Mz+1), enabling ML models to learn tail-sensitive properties with greatly improved stability and accuracy. By explicitly acknowledging the statistical nature of polymer matter, PolySet establishes a physically grounded foundation for future polymer machine learning, naturally extensible to copolymers, block architectures, and other complex topologies.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Deepak Ingole, Valentin Bhend, Shiva Ganesh Murali, Oliver Dobrich, Alisa Rupenayan
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Manufacturing processes are often perturbed by drifts in the environment and wear in the system, requiring control re-tuning even in the presence of repetitive operations. This paper presents an iterative learning framework for automatic tuning of Nonlinear Model Predictive Control (NMPC) weighting matrices based on task-level performance feedback. Inspired by norm-optimal Iterative Learning Control (ILC), the proposed method adaptively adjusts NMPC weights Q and R across task repetitions to minimize key performance indicators (KPIs) related to tracking accuracy, control effort, and saturation. Unlike gradient-based approaches that require differentiating through the NMPC solver, we construct an empirical sensitivity matrix, enabling structured weight updates without analytic derivatives. The framework is validated through simulation on a UR10e robot performing carbon fiber winding on a tetrahedral core. Results demonstrate that the proposed approach converges to near-optimal tracking performance (RMSE within 0.3% of offline Bayesian Optimization (BO)) in just 4 online repetitions, compared to 100 offline evaluations required by BO algorithm. The method offers a practical solution for adaptive NMPC tuning in repetitive robotic tasks, combining the precision of carefully optimized controllers with the flexibility of online adaptation.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jakub ลyskawa, Jakub Lewandowski, Paweล Wawrzyลski
Machine Learning
ๆฅ็ๆ่ฆ
Soft Actor-Critic (SAC) is widely used in practical applications and is now one of the most relevant off-policy online model-free reinforcement learning (RL) methods. The technique of n-step returns is known to increase the convergence speed of RL algorithms compared to their 1-step returns-based versions. However, SAC is notoriously difficult to combine with n-step returns, since their usual combination introduces bias in off-policy algorithms due to the changes in action distribution. While this problem is solved by importance sampling, a method for estimating expected values of one distribution using samples from another distribution, importance sampling may result in numerical instability. In this work, we combine SAC with n-step returns in a way that overcomes this issue. We present an approach to applying numerically stable importance sampling with simplified hyperparameter selection. Furthermore, we analyze the entropy estimation approach of Soft Actor-Critic in the context of the n-step maximum entropy framework and formulate the $ฯ$-sampled entropy estimation to reduce the variance of the learning target. Finally, we formulate the Soft Actor-Critic with n-step returns (SAC$n$) algorithm that we experimentally verify on MuJoCo simulated environments.
๐
2025-12-15
๐ KDD 2026
๐ฅ Xinwei Tai, Dongmian Zou, Hongfei Wang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent years have witnessed significant advancements in machine learning methods on graphs. However, transferring knowledge effectively from one graph to another remains a critical challenge. This highlights the need for algorithms capable of applying information extracted from a source graph to an unlabeled target graph, a task known as unsupervised graph domain adaptation (GDA). One key difficulty in unsupervised GDA is conditional shift, which hinders transferability. In this paper, we show that conditional shift can be observed only if there exists local dependencies among node features. To support this claim, we perform a rigorous analysis and also further provide generalization bounds of GDA when dependent node features are modeled using markov chains. Guided by the theoretical findings, we propose to improve GDA by decorrelating node features, which can be specifically implemented through decorrelated GCN layers and graph transformer layers. Our experimental results demonstrate the effectiveness of this approach, showing not only substantial performance enhancements over baseline GDA methods but also clear visualizations of small intra-class distances in the learned representations. Our code is available at https://github.com/TechnologyAiGroup/DFT
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chun Kit Wong, Paraskevas Pegios, Nina Weng, Emilie Pi Fogtmann Sejer, Martin Grรธnnebรฆk Tolsgaard et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning models in medical imaging are susceptible to shortcut learning, relying on confounding metadata (e.g., scanner model) that is often encoded in image embeddings. The crucial question is whether the model actively utilizes this encoded information for its final prediction. We introduce Weight Space Correlation Analysis, an interpretable methodology that quantifies feature utilization by measuring the alignment between the classification heads of a primary clinical task and auxiliary metadata tasks. We first validate our method by successfully detecting artificially induced shortcut learning. We then apply it to probe the feature utilization of an SA-SonoNet model trained for Spontaneous Preterm Birth (sPTB) prediction. Our analysis confirmed that while the embeddings contain substantial metadata, the sPTB classifier's weight vectors were highly correlated with clinically relevant factors (e.g., birth weight) but decoupled from clinically irrelevant acquisition factors (e.g. scanner). Our methodology provides a tool to verify model trustworthiness, demonstrating that, in the absence of induced bias, the clinical model selectively utilizes features related to the genuine clinical signal.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lukas Bischof, Rudolf M. Fรผchslin, Kurt Stockinger, Pavel Sulimov
Machine Learning
ๆฅ็ๆ่ฆ
The analysis of spectra, such as Nuclear Magnetic Resonance (NMR) spectra, for the comprehensive characterization of peaks is a challenging task for both experts and machines, especially with complex molecules. This process, also known as deconvolution, involves identifying and quantifying the peaks in the spectrum. Machine learning techniques have shown promising results in automating this process. With the advent of quantum computing, there is potential to further enhance these techniques. In this work, inspired by the success of classical Convolutional Neural Networks (CNNs), we explore the use of Quanvolutional Neural Networks (QuanvNNs) for the multi-task peak finding problem, involving both peak counting and position estimation. We implement a simple and interpretable QuanvNN architecture that can be directly compared to its classical CNN counterpart, and evaluate its performance on a synthetic NMR-inspired dataset. Our results demonstrate that QuanvNNs outperform classical CNNs on challenging spectra, achieving an 11\% improvement in F1 score and a 30\% reduction in mean absolute error for peak position estimation. Additionally, QuanvNNs appear to exhibit better convergence stability for harder problems.
๐
2025-12-15
๐ ArXiv Preprint (math.OC)
๐ฅ Liviu Aolaritei, Michael I. Jordan
Machine Learning
ๆฅ็ๆ่ฆ
We study stopping rules for stochastic gradient descent (SGD) for convex optimization from the perspective of anytime-valid confidence sequences. Classical analyses of SGD provide convergence guarantees in expectation or at a fixed horizon, but offer no statistically valid way to assess, at an arbitrary time, how close the current iterate is to the optimum. We develop an anytime-valid, data-dependent upper confidence sequence for the weighted average suboptimality of projected SGD, constructed via nonnegative supermartingales and requiring no smoothness or strong convexity. This confidence sequence yields a simple stopping rule that is provably $\varepsilon$-optimal with probability at least $1-ฮฑ$ and is almost surely finite under standard stochastic approximation stepsizes. To the best of our knowledge, these are the first rigorous, time-uniform performance guarantees and finite-time $\varepsilon$-optimality certificates for projected SGD with general convex objectives, based solely on observable trajectory quantities.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mabiao Long, Jiaxi Liu, Yufeng Li, Hao Xiong, Junchi Yan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Deploying dynamic heterogeneous graph embeddings in production faces key challenges of scalability, data freshness, and cold-start. This paper introduces a practical, two-stage solution that balances deep graph representation with low-latency incremental updates. Our framework combines HetSGFormer, a scalable graph transformer for static learning, with Incremental Locally Linear Embedding (ILLE), a lightweight, CPU-based algorithm for real-time updates. HetSGFormer captures global structure with linear scalability, while ILLE provides rapid, targeted updates to incorporate new data, thus avoiding costly full retraining. This dual approach is cold-start resilient, leveraging the graph to create meaningful embeddings from sparse data. On billion-scale graphs, A/B tests show HetSGFormer achieved up to a 6.11% lift in Advertiser Value over previous methods, while the ILLE module added another 3.22% lift and improved embedding refresh timeliness by 83.2%. Our work provides a validated framework for deploying dynamic graph learning in production environments.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hayk Amirkhanian, Marco F. Huber
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, neural networks have revolutionized various domains, yet challenges such as hyperparameter tuning and overfitting remain significant hurdles. Bayesian neural networks offer a framework to address these challenges by incorporating uncertainty directly into the model, yielding more reliable predictions, particularly for out-of-distribution data. This paper presents Hierarchical Approximate Bayesian Neural Network, a novel approach that uses a Gaussian-inverse-Wishart distribution as a hyperprior of the network's weights to increase both the robustness and performance of the model. We provide analytical representations for the predictive distribution and weight posterior, which amount to the calculation of the parameters of Student's t-distributions in closed form with linear complexity with respect to the number of weights. Our method demonstrates robust performance, effectively addressing issues of overfitting and providing reliable uncertainty estimates, particularly for out-of-distribution tasks. Experimental results indicate that HABNN not only matches but often outperforms state-of-the-art models, suggesting a promising direction for future applications in safety-critical environments.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shenzhi Yang, Guangcheng Zhu, Xing Zheng, Yingfan MA, Zhongqi Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via https://github.com/ShenzhiYang2000/TRAPO.
๐
2025-12-15
๐ This work has been submitted to the IEEE TMI for p
๐ฅ Wenjing Lu, Yi Hong, Yang Yang
Computer Vision
ๆฅ็ๆ่ฆ
Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng et al.
Machine Learning
ๆฅ็ๆ่ฆ
To combine the advantages of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), recent methods have integrated ''hints'' into post-training, which are prefix segments of complete reasoning trajectories, aiming for powerful knowledge expansion and reasoning generalization. However, existing hint-based RL methods typically ignore difficulty when scheduling hint ratios and estimating relative advantages, leading to unstable learning and excessive imitation of off-policy hints. In this work, we propose ADHint, which treats difficulty as a key factor in both hint-ratio schedule and relative-advantage estimation to achieve a better trade-off between exploration and imitation. Specifically, we propose Adaptive Hint with Sample Difficulty Prior, which evaluates each sample's difficulty under the policy model and accordingly schedules an appropriate hint ratio to guide its rollouts. We also introduce Consistency-based Gradient Modulation and Selective Masking for Hint Preservation to modulate token-level gradients within hints, preventing biased and destructive updates. Additionally, we propose Advantage Estimation with Rollout Difficulty Posterior, which leverages the relative difficulty of rollouts with and without hints to estimate their respective advantages, thereby achieving more balanced updates. Extensive experiments across diverse modalities, model scales, and domains demonstrate that ADHint delivers superior reasoning ability and out-of-distribution generalization, consistently surpassing existing methods in both pass@1 and avg@8. Our code and dataset will be made publicly available upon paper acceptance.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Mingqi Yuan, Tao Yu, Haolin Song, Bo Li, Xin Jin et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose PvP, a Proprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we develop SRL4Humanoid, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saumyaranjan Mohanty, Aravind Reddy, Konda Reddy Mopuri
ๆฅ็ๆ่ฆ
In Dataset Condensation, the goal is to synthesize a small dataset that replicates the training utility of a large original dataset. Existing condensation methods synthesize datasets with significant redundancy, so there is a dire need to reduce redundancy and improve the diversity of the synthesized datasets. To tackle this, we propose an intuitive Diversity Regularizer (DiRe) composed of cosine similarity and Euclidean distance, which can be applied off-the-shelf to various state-of-the-art condensation methods. Through extensive experiments, we demonstrate that the addition of our regularizer improves state-of-the-art condensation methods on various benchmark datasets from CIFAR-10 to ImageNet-1K with respect to generalization and diversity metrics.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Md Awsafur Rahman, Adam Gabrys, Doug Kang, Jingjing Sun, Tian Tan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
A personalized LLM should remember user facts, apply them correctly, and adapt over time to provide responses that the user prefers. Existing LLM personalization benchmarks are largely centered on two axes: accurately recalling user information and accurately applying remembered information in downstream tasks. We argue that a third axis, likability, is both subjective and central to user experience, yet under-measured by current benchmarks. To measure likability holistically, we introduce LikeBench, a multi-session, dynamic evaluation framework that measures likability across multiple dimensions by how much an LLM can adapt over time to a user's preferences to provide more likable responses. In LikeBench, the LLMs engage in conversation with a simulated user and learn preferences only from the ongoing dialogue. As the interaction unfolds, models try to adapt to responses, and after each turn, they are evaluated for likability across seven dimensions by the same simulated user. To the best of our knowledge, we are the first to decompose likability into multiple diagnostic metrics: emotional adaptation, formality matching, knowledge adaptation, reference understanding, conversation length fit, humor fit, and callback, which makes it easier to pinpoint where a model falls short. To make the simulated user more realistic and discriminative, LikeBench uses fine-grained, psychologically grounded descriptive personas rather than the coarse high/low trait rating based personas used in prior work. Our benchmark shows that strong memory performance does not guarantee high likability: DeepSeek R1, with lower memory accuracy (86%, 17 facts/profile), outperformed Qwen3 by 28% on likability score despite Qwen3's higher memory accuracy (93%, 43 facts/profile). Even SOTA models like GPT-5 adapt well in short exchanges but show only limited robustness in longer, noisier interactions.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Javier Nieto-Centenero, Esther Andrรฉs, Rodrigo Castellanos
Machine Learning
ๆฅ็ๆ่ฆ
Accurate aerodynamic prediction often relies on high-fidelity simulations; however, their prohibitive computational costs severely limit their applicability in data-driven modeling. This limitation motivates the development of multi-fidelity strategies that leverage inexpensive low-fidelity information without compromising accuracy. Addressing this challenge, this work presents a multi-fidelity deep learning framework that combines autoencoder-based transfer learning with a newly developed Multi-Split Conformal Prediction (MSCP) strategy to achieve uncertainty-aware aerodynamic data fusion under extreme data scarcity. The methodology leverages abundant Low-Fidelity (LF) data to learn a compact latent physics representation, which acts as a frozen knowledge base for a decoder that is subsequently fine-tuned using scarce HF samples. Tested on surface-pressure distributions for NACA airfoils (2D) and a transonic wing (3D) databases, the model successfully corrects LF deviations and achieves high-accuracy pressure predictions using minimal HF training data. Furthermore, the MSCP framework produces robust, actionable uncertainty bands with pointwise coverage exceeding 95%. By combining extreme data efficiency with uncertainty quantification, this work offers a scalable and reliable solution for aerodynamic regression in data-scarce environments.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kangning Gao, Yi Hu, Cong Nie, Wei Li
Machine Learning
ๆฅ็ๆ่ฆ
This paper addresses the challenges of low scheduling efficiency, unbalanced resource allocation, and poor adaptability in ETL (Extract-Transform-Load) processes under heterogeneous data environments by proposing an intelligent scheduling optimization framework based on deep Q-learning. The framework formalizes the ETL scheduling process as a Markov Decision Process and enables adaptive decision-making by a reinforcement learning agent in high-dimensional state spaces to dynamically optimize task allocation and resource scheduling. The model consists of a state representation module, a feature embedding network, a Q-value estimator, and a reward evaluation mechanism, which collectively consider task dependencies, node load states, and data flow characteristics to derive the optimal scheduling strategy in complex environments. A multi-objective reward function is designed to balance key performance indicators such as average scheduling delay, task completion rate, throughput, and resource utilization. Sensitivity experiments further verify the model's robustness under changes in hyperparameters, environmental dynamics, and data scale. Experimental results show that the proposed deep Q-learning scheduling framework significantly reduces scheduling delay, improves system throughput, and enhances execution stability under multi-source heterogeneous task conditions, demonstrating the strong potential of reinforcement learning in complex data scheduling and resource management, and providing an efficient and scalable optimization strategy for intelligent data pipeline construction.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuwei Tan, Yao Ma, Xueru Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions. Large Language Models (LLMs), in contrast, can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts and informing system refinements. However, they perform poorly when applied directly to tabular fraud detection due to the difficulty of reasoning over many features, the extreme class imbalance, and the absence of contextual information. To bridge this gap, we introduce FinFRE-RAG, a two-stage approach that applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language and performs retrieval-augmented in-context learning over label-aware, instance-level exemplars. Across four public fraud datasets and three families of open-weight LLMs, FinFRE-RAG substantially improves F1/MCC over direct prompting and is competitive with strong tabular baselines in several settings. Although these LLMs still lag behind specialized classifiers, they narrow the performance gap and provide interpretable rationales, highlighting their value as assistive tools in fraud analysis.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Taoran Sheng, Sathappan Muthiah, Atiq Islam, Jinming Feng
ๆฅ็ๆ่ฆ
In e-commerce shopping, aligning search results with a buyer's immediate needs and preferences presents a significant challenge, particularly in adapting search results throughout the buyer's shopping journey as they move from the initial stages of browsing to making a purchase decision or shift from one intent to another. This study presents a systematic approach to adapting e-commerce search results based on the current context. We start with basic methods and incrementally incorporate more contextual information and state-of-the-art techniques to improve the search outcomes. By applying this evolving contextual framework to items displayed on the search engine results page (SERP), we progressively align search outcomes more closely with the buyer's interests and current search intentions. Our findings demonstrate that this incremental enhancement, from simple heuristic autoregressive features to advanced sequence models, significantly improves ranker performance. The integration of contextual techniques enhances the performance of our production ranker, leading to improved search results in both offline and online A/B testing in terms of Mean Reciprocal Rank (MRR). Overall, the paper details iterative methodologies and their substantial contributions to search result contextualization on e-commerce platforms.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaoyu He, Yu Cai, Jin Jia, Canxi Huang, Wenqing Chen et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This work proposes Alada, an adaptive momentum method for stochastic optimization over large-scale matrices. Alada employs a rank-one factorization approach to estimate the second moment of gradients, where factors are updated alternatively to minimize the estimation error. Alada achieves sublinear memory overheads and can be readily extended to optimizing tensor-shaped variables.We also equip Alada with a first moment estimation rule, which enhances the algorithm's robustness without incurring additional memory overheads. The theoretical performance of Alada aligns with that of traditional methods such as Adam. Numerical studies conducted on several natural language processing tasks demonstrate the reduction in memory overheads and the robustness in training large models relative to Adam and its variants.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jongwook Kim, Sangheon Yun, Sukjin Yoon
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The canonical $O(N^2)$ Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass $QKV$ structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on $0^{th}$ order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a $0.56\%$ reduction in validation loss, confirming the framework's fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tomoya Tanaka, Tomonori Ikeda, Ryo Yonemoto
Machine Learning
ๆฅ็ๆ่ฆ
This study presents the first comprehensive evaluation of spatial generalization techniques, which are essential for the practical deployment of deep learning-based radio-frequency (RF) sensing. Focusing on people counting in indoor environments using frequency-modulated continuous-wave (FMCW) multiple-input multiple-output (MIMO) radar, we systematically investigate a broad set of approaches, including amplitude-based statistical preprocessing (sigmoid weighting and threshold zeroing), frequency-domain filtering, autoencoder-based background suppression, data augmentation strategies, and transfer learning. Experimental results collected across two environments with different layouts demonstrate that sigmoid-based amplitude weighting consistently achieves superior cross-environment performance, yielding 50.1% and 55.2% reductions in root-mean-square error (RMSE) and mean absolute error (MAE), respectively, compared with baseline methods. Data augmentation provides additional though modest benefits, with improvements up to 8.8% in MAE. By contrast, transfer learning proves indispensable for large spatial shifts, achieving 82.1% and 91.3% reductions in RMSE and MAE, respectively, with 540 target-domain samples. Taken together, these findings establish a highly practical direction for developing radar sensing systems capable of maintaining robust accuracy under spatial variations by integrating deep learning models with amplitude-based preprocessing and efficient transfer learning.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hassan Iftikhar, Rizwan Ahmad, Arunark Kolipaka
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
The Multimodal Direct Inversion (MMDI) algorithm is widely used in Magnetic Resonance Elastography (MRE) to estimate tissue shear stiffness. However, MMDI relies on the Helmholtz equation, which assumes wave propagation in a uniform, homogeneous, and infinite medium. Furthermore, the use of the Laplacian operator makes MMDI highly sensitive to noise, which compromises the accuracy and reliability of stiffness estimates. In this study, we propose the Deep-Learning driven Inversion Framework for Shear Modulus Estimation in MRE (DIME), aimed at enhancing the robustness of inversion. DIME is trained on the displacement fields-stiffness maps pair generated through Finite Element Modelling (FEM) simulations. To capture local wave behavior and improve robustness to global image variations, DIME is trained on small image patches. We first validated DIME using homogeneous and heterogeneous datasets simulated with FEM, where DIME produced stiffness maps with low inter-pixel variability, accurate boundary delineation, and higher correlation with ground truth (GT) compared to MMDI. Next, DIME was evaluated in a realistic anatomy-informed simulated liver dataset with known GT and compared directly to MMDI. DIME reproduced ground-truth stiffness patterns with high fidelity (r = 0.99, R^2 = 0.98), while MMDI showed greater underestimation. After validating DIME on synthetic data, we tested the model in in vivo liver MRE data from eight healthy and seven fibrotic subjects. DIME preserved physiologically consistent stiffness patterns and closely matched MMDI, which showed directional bias. Overall, DIME showed higher correlation with ground truth and visually similar stiffness patterns, whereas MMDI displayed a larger bias that can potentially be attributed to directional filtering. These preliminary results highlight the feasibility of DIME for clinical applications in MRE.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Min Lu, Hemant Ishwaran
ๆฅ็ๆ่ฆ
Out-of-distribution (OOD) detection is essential for determining when a supervised model encounters inputs that differ meaningfully from its training distribution. While widely studied in classification, OOD detection for regression and survival analysis remains limited due to the absence of discrete labels and the challenge of quantifying predictive uncertainty. We introduce a framework for OOD detection that is simultaneously model aware and subspace aware, and that embeds variable prioritization directly into the detection step. The method uses the fitted predictor to construct localized neighborhoods around each test case that emphasize the features driving the model's learned relationship and downweight directions that are less relevant to prediction. It produces OOD scores without relying on global distance metrics or estimating the full feature density. The framework is applicable across outcome types, and in our implementation we use random forests, where the rule structure yields transparent neighborhoods and effective scoring. Experiments on synthetic and real data benchmarks designed to isolate functional shifts show consistent improvements over existing methods. We further demonstrate the approach in an esophageal cancer survival study, where distribution shifts related to lymphadenectomy identify patterns relevant to surgical guidelines.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenjing lu, Zerui Tao, Dongping Zhang, Yuning Qiu, Yang Yang et al.
Multimodal
ๆฅ็ๆ่ฆ
CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Previous work of adversarial fine-tuning largely focuses on matching the predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and unreliable over-confidence. This overlooked phenomenon highlights a critical reliability gap beyond robustness. To bridge this gap, we propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments. By reparameterizing the output of CLIP as the concentration parameter of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and the magnitude of predictive confidence. Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments on multiple zero-shot classification benchmarks demonstrate that our approach effectively restores calibrated uncertainty and achieves competitive adversarial robustness while maintaining clean accuracy.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Amin Jalal Aghdasian, Farzaneh Abdollahi, Ali Kamali Iglie
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper proposes two new algorithms for the lane keeping system (LKS) in autonomous vehicles (AVs) operating under snowy road conditions. These algorithms use deep reinforcement learning (DRL) to handle uncertainties and slippage. They include Action-Robust Recurrent Deep Deterministic Policy Gradient (AR-RDPG) and end-to-end Action-Robust convolutional neural network Attention Deterministic Policy Gradient (AR-CADPG), two action-robust approaches for decision-making. In the AR-RDPG method, within the perception layer, camera images are first denoised using multi-scale neural networks. Then, the centerline coefficients are extracted by a pre-trained deep convolutional neural network (DCNN). These coefficients, concatenated with the driving characteristics, are used as input to the control layer. The AR-CADPG method presents an end-to-end approach in which a convolutional neural network (CNN) and an attention mechanism are integrated within a DRL framework. Both methods are first trained in the CARLA simulator and validated under various snowy scenarios. Real-world experiments on a Jetson Nano-based autonomous vehicle confirm the feasibility and stability of the learned policies. Among the two models, the AR-CADPG approach demonstrates superior path-tracking accuracy and robustness, highlighting the effectiveness of combining temporal memory, adversarial resilience, and attention mechanisms in AVs.
๐
2025-12-15
๐ ArXiv Preprint (cs.CG)
๐ฅ Jiayin Lu, Ying Jiang, Yin Yang, Chenfanfu Jiang
Machine Learning
ๆฅ็ๆ่ฆ
We present VoroLight, a differentiable framework for 3D shape reconstruction based on Voronoi meshing. Our approach generates smooth, watertight surfaces and topologically consistent volumetric meshes directly from diverse inputs, including images, implicit shape level-set fields, point clouds and meshes. VoroLight operates in three stages: it first initializes a surface using a differentiable Voronoi formulation, then refines surface quality through a polygon-face sphere training stage, and finally reuses the differentiable Voronoi formulation for volumetric optimization with additional interior generator points. Project page: https://jiayinlu19960224.github.io/vorolight/
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jonathan Wenshรธj, Tong Chen, Bob Pepin, Raghavendra Selvan
Machine LearningRobotics
ๆฅ็ๆ่ฆ
While joint pruning--quantization is theoretically superior to sequential application, current joint methods rely on auxiliary procedures outside the training loop for finding compression parameters. This reliance adds engineering complexity and hyperparameter tuning, while also lacking a direct data-driven gradient signal, which might result in sub-optimal compression. In this paper, we introduce CoDeQ, a simple, fully differentiable method for joint pruning--quantization. Our approach builds on a key observation: the dead-zone of a scalar quantizer is equivalent to magnitude pruning, and can be used to induce sparsity directly within the quantization operator. Concretely, we parameterize the dead-zone width and learn it via backpropagation, alongside the quantization parameters. This design provides explicit control of sparsity, regularized by a single global hyperparameter, while decoupling sparsity selection from bit-width selection. The result is a method for Compression with Dead-zone Quantizer (CoDeQ) that supports both fixed-precision and mixed-precision quantization (controlled by an optional second hyperparameter). It simultaneously determines the sparsity pattern and quantization parameters in a single end-to-end optimization. Consequently, CoDeQ does not require any auxiliary procedures, making the method architecture-agnostic and straightforward to implement. On ImageNet with ResNet-18, CoDeQ reduces bit operations to ~5% while maintaining close to full precision accuracy in both fixed and mixed-precision regimes.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chunyu Zou
Machine Learning
ๆฅ็ๆ่ฆ
Cryogenic electron microscopy (Cryo-EM) has become an essential tool for capturing high-resolution biological structures. Despite its advantage in visualizations, the large storage size of Cryo-EM data file poses significant challenges for researchers and educators. This paper investigates the application of deep learning, specifically implicit neural representation (INR), to compress Cryo-EM biological data. The proposed approach first extracts the binary map of each file according to the density threshold. The density map is highly repetitive, ehich can be effectively compressed by GZIP. The neural network then trains to encode spatial density information, allowing the storage of network parameters and learnable latent vectors. To improve reconstruction accuracy, I further incorporate the positional encoding to enhance spatial representation and a weighted Mean Squared Error (MSE) loss function to balance density distribution variations. Using this approach, my aim is to provide a practical and efficient biological data compression solution that can be used for educational and research purpose, while maintaining a reasonable compression ratio and reconstruction quality from file to file.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nischal Subedi, Ember Kerstetter, Winnie Li, Silo Murphy
ๆฅ็ๆ่ฆ
Graph Convolutional Networks (GCNs) have become a standard approach for semi-supervised node classification, yet practitioners lack clear guidance on when GCNs provide meaningful improvements over simpler baselines. We present a diagnostic study using the Amazon Computers co-purchase data to understand when and why GCNs help. Through systematic experiments with simulated label scarcity, feature ablation, and per-class analysis, we find that GCN performance depends critically on the interaction between graph homophily and feature quality. GCNs provide the largest gains under extreme label scarcity, where they leverage neighborhood structure to compensate for limited supervision. Surprisingly, GCNs can match their original performance even when node features are replaced with random noise, suggesting that structure alone carries sufficient signal on highly homophilous graphs. However, GCNs hurt performance when homophily is low and features are already strong, as noisy neighbors corrupt good predictions. Our quadrant analysis reveals that GCNs help in three of four conditions and only hurt when low homophily meets strong features. These findings offer practical guidance for practitioners deciding whether to adopt graph-based methods.
๐
2025-12-15
๐ ArXiv Preprint (cs.CG)
๐ฅ Sebastien Tchitchek, Mohamed Kissi, Julien Tierny
Robotics
ๆฅ็ๆ่ฆ
We introduce the Continuous Edit Distance (CED), a geodesic and elastic distance for time-varying persistence diagrams (TVPDs). The CED extends edit-distance ideas to TVPDs by combining local substitution costs with penalized deletions/insertions, controlled by two parameters: \(ฮฑ\) (trade-off between temporal misalignment and diagram discrepancy) and \(ฮฒ\) (gap penalty). We also provide an explicit construction of CED-geodesics. Building on these ingredients, we present two practical barycenter solvers, one stochastic and one greedy, that monotonically decrease the CED Frechet energy. Empirically, the CED is robust to additive perturbations (both temporal and spatial), recovers temporal shifts, and supports temporal pattern search. On real-life datasets, the CED achieves clustering performance comparable or better than standard elastic dissimilarities, while our clustering based on CED-barycenters yields superior classification results. Overall, the CED equips TVPD analysis with a principled distance, interpretable geodesics, and practical barycenters, enabling alignment, comparison, averaging, and clustering directly in the space of TVPDs. A C++ implementation is provided for reproducibility at the following address https://github.com/sebastien-tchitchek/ContinuousEditDistance.
๐
2025-12-15
๐ AAAI 2026
๐ฅ Yifan Wu, Jiyue Jiang, Xichen Ye, Yiqi Wang, Chang Zhou et al.
ๆฅ็ๆ่ฆ
Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top-k Influence (Top I) and Coverage-Centric Influence (CCI). We empirically validate our method on two representative BioFMs, RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent, demonstrating its effectiveness. Furthermore, we show the generalizability of our framework on protein-related tasks using ESM-C. In particular, our coreset even outperforms random subsets that are ten times larger in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuseon Choi, Sangjin Kim, Jungjun Oh, Byeongcheol Kim, Hoi-Jun Yoo
Natural Language Processing
ๆฅ็ๆ่ฆ
Low-bit quantization is a promising technique for efficient transformer inference by reducing computational and memory overhead. However, aggressive bitwidth reduction remains challenging due to activation outliers, leading to accuracy degradation. Existing methods, such as outlier-handling and group quantization, achieve high accuracy but incur substantial energy consumption. To address this, we propose SeVeDo, an energy-efficient SVD-based heterogeneous accelerator that structurally separates outlier-sensitive components into a high-precision low-rank path, while the remaining computations are executed in a low-bit residual datapath with group quantization. To further enhance efficiency, Hierarchical Group Quantization (HGQ) combines coarse-grained floating-point scaling with fine-grained shifting, effectively reducing dequantization cost. Also, SVD-guided mixed precision (SVD-MP) statically allocates higher bitwidths to precision-sensitive components identified through low-rank decomposition, thereby minimizing floating-point operation cost. Experimental results show that SeVeDo achieves a peak energy efficiency of 13.8TOPS/W, surpassing conventional designs, with 12.7TOPS/W on ViT-Base and 13.4TOPS/W on Llama2-7B benchmarks.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bangyu Li, Boping Gu, Ziyang Ding
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In modern financial markets, investors increasingly seek personalized and adaptive portfolio strategies that reflect their individual risk preferences and respond to dynamic market conditions. Traditional rule-based or static optimization approaches often fail to capture the nonlinear interactions among investor behavior, market volatility, and evolving financial objectives. To address these limitations, this paper introduces the LLM-based Personalized Portfolio Recommender , an integrated framework that combines Large Language Models, reinforcement learning, and individualized risk preference modeling to support intelligent investment decision-making.
๐
2025-12-15
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Shashie Dilhara Batan Arachchige, Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Dinusha Vatsalan, Dali Kaafar
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are often fine-tuned to adapt their general-purpose knowledge to specific tasks and domains such as cyber threat intelligence (CTI). Fine-tuning is mostly done through proprietary datasets that may contain sensitive information. Owners expect their fine-tuned model to not inadvertently leak this information to potentially adversarial end users. Using CTI as a use case, we demonstrate that data-extraction attacks can recover sensitive information from fine-tuned models on CTI reports, underscoring the need for mitigation. Retraining the full model to eliminate this leakage is computationally expensive and impractical. We propose an alternative approach, which we call privacy alignment, inspired by safety alignment in LLMs. Just like safety alignment teaches the model to abide by safety constraints through a few examples, we enforce privacy alignment through few-shot supervision, integrating a privacy classifier and a privacy redactor, both handled by the same underlying LLM. We evaluate our system, called CTIGuardian, using GPT-4o mini and Mistral-7B Instruct models, benchmarking against Presidio, a named entity recognition (NER) baseline. Results show that CTIGuardian provides a better privacy-utility trade-off than NER based models. While we demonstrate its effectiveness on a CTI use case, the framework is generic enough to be applicable to other sensitive domains.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Kohei Nishikawa, Koki Shimizu, Hashiguchi Hiroki
Machine Learning
ๆฅ็ๆ่ฆ
This study evaluates thresholds for removing singular values from singular value decomposition-based low-rank approximations of deep neural network weight matrices. Each weight matrix is modeled as the sum of signal and noise matrices. The low-rank approximation is obtained by removing noise-related singular values using a threshold based on random matrix theory. To assess the adequacy of this threshold, we propose an evaluation metric based on the cosine similarity between the singular vectors of the signal and original weight matrices. The proposed metric is used in numerical experiments to compare two threshold estimation methods.
๐
2025-12-15
๐ Journal of Advances in Information Technology
๐ฅ Parthasarathy Nadarajan, Michael Botsch, Sebastian Sardina
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper introduces a novel machine learning architecture for an efficient estimation of the probabilistic space-time representation of complex traffic scenarios. A detailed representation of the future traffic scenario is of significant importance for autonomous driving and for all active safety systems. In order to predict the future space-time representation of the traffic scenario, first the type of traffic scenario is identified and then the machine learning algorithm maps the current state of the scenario to possible future states. The input to the machine learning algorithms is the current state representation of a traffic scenario, termed as the Augmented Occupancy Grid (AOG). The output is the probabilistic space-time representation which includes uncertainties regarding the behaviour of the traffic participants and is termed as the Predicted Occupancy Grid (POG). The novel architecture consists of two Stacked Denoising Autoencoders (SDAs) and a set of Random Forests. It is then compared with the other two existing architectures that comprise of SDAs and DeconvNet. The architectures are validated with the help of simulations and the comparisons are made both in terms of accuracy and computational time. Also, a brief overview on the applications of POGs in the field of active safety is presented.
๐
2025-12-15
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ John E. Ortega, Dhruv D. Joshi, Matt P. Borkowski
Natural Language Processing
ๆฅ็ๆ่ฆ
Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Piyush Bagad, Andrew Zisserman
Multimodal
ๆฅ็ๆ่ฆ
Our objective is to build a general time-aware video-text embedding model for retrieval. To that end, we propose a simple and efficient recipe, dubbed TARA (Time Aware Retrieval Adaptation), to adapt Multimodal LLMs (MLLMs) to a time-aware video-text embedding model without using any video data at all. For evaluating time-awareness in retrieval, we propose a new benchmark with temporally opposite (chiral) actions as hard negatives and curated splits for chiral and non-chiral actions. We show that TARA outperforms all existing video-text models on this chiral benchmark while also achieving strong results on standard benchmarks. Furthermore, we discover additional benefits of TARA beyond time-awareness: (i) TARA embeddings are negation-aware as shown in NegBench benchmark that evaluates negation in video retrieval, (ii) TARA achieves state of the art performance on verb and adverb understanding in videos. Overall, TARA yields a strong, versatile, time-aware video-text embedding model with state of the art zero-shot performance.
๐
2025-12-15
๐ WSDM 2026
๐ฅ Chaohua Yang, Dugang Liu, Shiwei Li, Yuwen Fu, Xing Tang et al.
ๆฅ็ๆ่ฆ
Multi-scenario multi-task recommendation (MSMTR) systems must address recommendation demands across diverse scenarios while simultaneously optimizing multiple objectives, such as click-through rate and conversion rate. Existing MSMTR models typically consist of four information units: scenario-shared, scenario-specific, task-shared, and task-specific networks. These units interact to generate four types of relationship information flows, directed from scenario-shared or scenario-specific networks to task-shared or task-specific networks. However, these models face two main limitations: 1) They often rely on complex architectures, such as mixture-of-experts (MoE) networks, which increase the complexity of information fusion, model size, and training cost. 2) They extract all available information flows without filtering out irrelevant or even harmful content, introducing potential noise. Regarding these challenges, we propose a lightweight Automated Information Flow Selection (AutoIFS) framework for MSMTR. To tackle the first issue, AutoIFS incorporates low-rank adaptation (LoRA) to decouple the four information units, enabling more flexible and efficient information fusion with minimal parameter overhead. To address the second issue, AutoIFS introduces an information flow selection network that automatically filters out invalid scenario-task information flows based on model performance feedback. It employs a simple yet effective pruning function to eliminate useless information flows, thereby enhancing the impact of key relationships and improving model performance. Finally, we evaluate AutoIFS and confirm its effectiveness through extensive experiments on two public benchmark datasets and an online A/B test.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mengyang Ma, Xiaopeng Li, Wanyu Wang, Zhaocheng Du, Jingtong Gao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer structures have been widely used in sequential recommender systems (SRS). However, as user interaction histories increase, computational time and memory requirements also grow. This is mainly caused by the standard attention mechanism. Although there exist many methods employing efficient attention and SSM-based models, these approaches struggle to effectively model long sequences and may exhibit unstable performance on short sequences. To address these challenges, we design a sparse attention mechanism, BlossomRec, which models both long-term and short-term user interests through attention computation to achieve stable performance across sequences of varying lengths. Specifically, we categorize user interests in recommendation systems into long-term and short-term interests, and compute them using two distinct sparse attention patterns, with the results combined through a learnable gated output. Theoretically, it significantly reduces the number of interactions participating in attention computation. Extensive experiments on four public datasets demonstrate that BlossomRec, when integrated with state-of-the-art Transformer-based models, achieves comparable or even superior performance while significantly reducing memory usage, providing strong evidence of BlossomRec's efficiency and effectiveness.The code is available at https://github.com/ronineume/BlossomRec.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qinglin Jia, Zhaocheng Du, Chuhan Wu, Huifeng Guo, Ruiming Tang et al.
ๆฅ็ๆ่ฆ
In most real-world online advertising systems, advertisers typically have diverse customer acquisition goals. A common solution is to use multi-task learning (MTL) to train a unified model on post-click data to estimate the conversion rate (CVR) for these diverse targets. In practice, CVR prediction often encounters missing conversion data as many advertisers submit only a subset of user conversion actions due to privacy or other constraints, making the labels of multi-task data incomplete. If the model is trained on all available samples where advertisers submit user conversion actions, it may struggle when deployed to serve a subset of advertisers targeting specific conversion actions, as the training and deployment data distributions are mismatched. While considerable MTL efforts have been made, a long-standing challenge is how to effectively train a unified model with the incomplete and skewed multi-label data. In this paper, we propose a fine-grained Knowledge transfer framework for Asymmetric Multi-Label data (KAML). We introduce an attribution-driven masking strategy (ADM) to better utilize data with asymmetric multi-label data in training. However, the more relaxed masking in ADM is a double-edged sword: it provides additional training signals but also introduces noise due to skewed data. To address this, we propose a hierarchical knowledge extraction mechanism (HKE) to model the sample discrepancy within the target task tower. Finally, to maximize the utility of unlabeled samples, we incorporate ranking loss strategy to further enhance our model. The effectiveness of KAML has been demonstrated through comprehensive evaluations on offline industry datasets and online A/B tests, which show significant performance improvements over existing MTL baselines.
๐
2025-12-15
๐ ArXiv Preprint (Machine Learning)
๐ฅ Arnab Sharma
ๆฅ็ๆ่ฆ
Neural Encoders are frequently used in the NLP domain to perform dense retrieval tasks, for instance, to generate the candidate documents for a given query in question-answering tasks. However, sparse annotation and label noise in the training data make it challenging to train or fine-tune such retrieval models. Although existing works have attempted to mitigate these problems by incorporating modified loss functions or data cleaning, these approaches either require some hyperparameters to tune during training or add substantial complexity to the training setup. In this work, we consider a label weakening approach to generate robust retrieval models in the presence of label noise. Instead of enforcing a single, potentially erroneous label for each query document pair, we allow for a set of plausible labels derived from both the observed supervision and the model's confidence scores. We perform an extensive evaluation considering two retrieval models, one re-ranking model, considering four diverse ranking datasets. To this end, we also consider a realistic noisy setting by using a semantic-aware noise generation technique to generate different ratios of noise. Our initial results show that label weakening can improve the performance of the retrieval tasks in comparison to 10 different state-of-the-art loss functions.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Ivica Kostric, Ujwal Gadiraju, Krisztian Balog
ๆฅ็ๆ่ฆ
The ideal conversational recommender system (CRS) acts like a savvy salesperson, adapting its language and suggestions to each user's level of expertise. However, most current systems treat all users as experts, leading to frustrating and inefficient interactions when users are unfamiliar with a domain. Systems that can adapt their conversational strategies to a user's knowledge level stand to offer a much more natural and effective experience. To make a step toward such adaptive systems, we introduce a new task: estimating user domain knowledge from conversations, enabling a CRS to better understand user needs and personalize interactions. A key obstacle to developing such adaptive systems is the lack of suitable data; to our knowledge, no existing dataset captures the conversational behaviors of users with varying levels of domain knowledge. Furthermore, in most dialogue collection protocols, users are free to express their own preferences, which tends to concentrate on popular items and well-known features, offering little insight into how novices explore or learn about unfamiliar features. To address this, we design a game-based data collection protocol that elicits varied expressions of knowledge, release the resulting dataset, and provide an initial analysis to highlight its potential for future work on user-knowledge-aware CRS.
๐
2025-12-15
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management.
We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work.
We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mabiao Long, Jiaxi Liu, Yufeng Li, Hao Xiong, Junchi Yan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Deploying dynamic heterogeneous graph embeddings in production faces key challenges of scalability, data freshness, and cold-start. This paper introduces a practical, two-stage solution that balances deep graph representation with low-latency incremental updates. Our framework combines HetSGFormer, a scalable graph transformer for static learning, with Incremental Locally Linear Embedding (ILLE), a lightweight, CPU-based algorithm for real-time updates. HetSGFormer captures global structure with linear scalability, while ILLE provides rapid, targeted updates to incorporate new data, thus avoiding costly full retraining. This dual approach is cold-start resilient, leveraging the graph to create meaningful embeddings from sparse data. On billion-scale graphs, A/B tests show HetSGFormer achieved up to a 6.11% lift in Advertiser Value over previous methods, while the ILLE module added another 3.22% lift and improved embedding refresh timeliness by 83.2%. Our work provides a validated framework for deploying dynamic graph learning in production environments.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Huimu Wang, Yiming Qiu, Xingzhi Yao, Zhiguo Chen, Guoyu Tang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Dense retrieval has become the industry standard in large-scale information retrieval systems due to its high efficiency and competitive accuracy. Its core relies on a coarse-to-fine hierarchical architecture that enables rapid candidate selection and precise semantic matching, achieving millisecond-level response over billion-scale corpora. This capability makes it essential not only in traditional search and recommendation scenarios but also in the emerging paradigm of generative recommendation driven by large language models, where semantic IDs-themselves a form of coarse-to-fine representation-play a foundational role. However, the widely adopted dual-tower encoding architecture introduces inherent challenges, primarily representational space misalignment and retrieval index inconsistency, which degrade matching accuracy, retrieval stability, and performance on long-tail queries. These issues are further magnified in semantic ID generation, ultimately limiting the performance ceiling of downstream generative models.
To address these challenges, this paper proposes a simple and effective framework named SCI comprising two synergistic modules: a symmetric representation alignment module that employs an innovative input-swapping mechanism to unify the dual-tower representation space without adding parameters, and an consistent indexing with dual-tower synergy module that redesigns retrieval paths using a dual-view indexing strategy to maintain consistency from training to inference. The framework is systematic, lightweight, and engineering-friendly, requiring minimal overhead while fully supporting billion-scale deployment. We provide theoretical guarantees for our approach, with its effectiveness validated by results across public datasets and real-world e-commerce datasets.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Taoran Sheng, Sathappan Muthiah, Atiq Islam, Jinming Feng
ๆฅ็ๆ่ฆ
In e-commerce shopping, aligning search results with a buyer's immediate needs and preferences presents a significant challenge, particularly in adapting search results throughout the buyer's shopping journey as they move from the initial stages of browsing to making a purchase decision or shift from one intent to another. This study presents a systematic approach to adapting e-commerce search results based on the current context. We start with basic methods and incrementally incorporate more contextual information and state-of-the-art techniques to improve the search outcomes. By applying this evolving contextual framework to items displayed on the search engine results page (SERP), we progressively align search outcomes more closely with the buyer's interests and current search intentions. Our findings demonstrate that this incremental enhancement, from simple heuristic autoregressive features to advanced sequence models, significantly improves ranker performance. The integration of contextual techniques enhances the performance of our production ranker, leading to improved search results in both offline and online A/B testing in terms of Mean Reciprocal Rank (MRR). Overall, the paper details iterative methodologies and their substantial contributions to search result contextualization on e-commerce platforms.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Genki Kusano, Kenya Abe, Kunihiro Takeoka
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Recommender systems usually rely on large-scale interaction data to learn from users' past behaviors and make accurate predictions. However, real-world applications often face situations where no training data is available, such as when launching new services or handling entirely new users. In such cases, conventional approaches cannot be applied. This study focuses on training-free recommendation, where no task-specific training is performed, and particularly on \textit{training-free cold-start recommendation} (TFCSR), the more challenging case where the target user has no interactions. Large language models (LLMs) have recently been explored as a promising solution, and numerous studies have been proposed. As the ability of text embedding models (TEMs) increases, they are increasingly recognized as applicable to training-free recommendation, but no prior work has directly compared LLMs and TEMs under identical conditions. We present the first controlled experiments that systematically evaluate these two approaches in the same setting. The results show that TEMs outperform LLM rerankers, and this trend holds not only in cold-start settings but also in warm-start settings with rich interactions. These findings indicate that direct LLM ranking is not the only viable option, contrary to the commonly shared belief, and TEM-based approaches provide a stronger and more scalable basis for training-free recommendation.
๐
2025-12-15
๐ SIGMOD 2026
๐ฅ Tingyang Chen, Cong Fu, Jiahua Wu, Haotian Wu, Hua Fan et al.
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Vector Similarity Search (VSS) in high-dimensional spaces is rapidly emerging as core functionality in next-generation database systems for numerous data-intensive services -- from embedding lookups in large language models (LLMs), to semantic information retrieval and recommendation engines. Current benchmarks, however, evaluate VSS primarily on the recall-latency trade-off against a ground truth defined solely by distance metrics, neglecting how retrieval quality ultimately impacts downstream tasks. This disconnect can mislead both academic research and industrial practice.
We present Iceberg, a holistic benchmark suite for end-to-end evaluation of VSS methods in realistic application contexts. From a task-centric view, Iceberg uncovers the Information Loss Funnel, which identifies three principal sources of end-to-end performance degradation: (1) Embedding Loss during feature extraction; (2) Metric Misuse, where distances poorly reflect task relevance; (3) Data Distribution Sensitivity, highlighting index robustness across skews and modalities. For a more comprehensive assessment, Iceberg spans eight diverse datasets across key domains such as image classification, face recognition, text retrieval, and recommendation systems. Each dataset, ranging from 1M to 100M vectors, includes rich, task-specific labels and evaluation metrics, enabling assessment of retrieval algorithms within the full application pipeline rather than in isolation. Iceberg benchmarks 13 state-of-the-art VSS methods and re-ranks them based on application-level metrics, revealing substantial deviations from traditional rankings derived purely from recall-latency evaluations. Building on these insights, we define a set of task-centric meta-features and derive an interpretable decision tree to guide practitioners in selecting and tuning VSS methods for their specific workloads.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Chee Heng Tan, Huiying Zheng, Jing Wang, Zhuoyi Lin, Shaodi Feng et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
With the advent of large language models (LLMs), the landscape of recommender systems is undergoing a significant transformation. Traditionally, user reviews have served as a critical source of rich, contextual information for enhancing recommendation quality. However, as LLMs demonstrate an unprecedented ability to understand and generate human-like text, this raises the question of whether explicit user reviews remain essential in the era of LLMs. In this paper, we provide a systematic investigation of the evolving role of text reviews in recommendation by comparing deep learning methods and LLM approaches. Particularly, we conduct extensive experiments on eight public datasets with LLMs and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We further introduce a benchmarking evaluation framework for review-aware recommender systems, RAREval, to comprehensively assess the contribution of textual reviews to the recommendation performance of review-aware recommender systems. Our framework examines various scenarios, including the removal of some or all textual reviews, random distortion, as well as recommendation performance in data sparsity and cold-start user settings. Our findings demonstrate that LLMs are capable of functioning as effective review-aware recommendation engines, generally outperforming traditional deep learning approaches, particularly in scenarios characterized by data sparsity and cold-start conditions. In addition, the removal of some or all textual reviews and random distortion does not necessarily lead to declines in recommendation accuracy. These findings motivate a rethinking of how user preference from text reviews can be more effectively leveraged. All code and supplementary materials are available at: https://github.com/zhytk/RAREval-data-processing.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yupeng Li, Mingyue Cheng, Yucong Luo, Yitong Zhou, Qingyang Mao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Multi-behavior sequential recommendation aims to capture users' dynamic interests by modeling diverse types of user interactions over time. Although several studies have explored this setting, the recommendation performance remains suboptimal, mainly due to two fundamental challenges: the heterogeneity of user behaviors and data sparsity. To address these challenges, we propose BLADE, a framework that enhances multi-behavior modeling while mitigating data sparsity. Specifically, to handle behavior heterogeneity, we introduce a dual item-behavior fusion architecture that incorporates behavior information at both the input and intermediate levels, enabling preference modeling from multiple perspectives. To mitigate data sparsity, we design three behavior-level data augmentation methods that operate directly on behavior sequences rather than core item sequences. These methods generate diverse augmented views while preserving the semantic consistency of item sequences. These augmented views further enhance representation learning and generalization via contrastive learning. Experiments on three real-world datasets demonstrate the effectiveness of our approach.
๐
2025-12-15
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Duy A. Nguyen, Hai H. Do, Minh Doan, Minh N. Do
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The ability to extract value from historical data is essential for enterprise decision-making. However, much of this information remains inaccessible within large legacy file systems that lack structured organization and semantic indexing, making retrieval and analysis inefficient and error-prone. We introduce SPAR (Session-based Pipeline for Adaptive Retrieval), a conceptual framework that integrates Large Language Models (LLMs) into a Retrieval-Augmented Generation (RAG) architecture specifically designed for legacy enterprise environments. Unlike conventional RAG pipelines, which require costly construction and maintenance of full-scale vector databases that mirror the entire file system, SPAR employs a lightweight two-stage process: a semantic Metadata Index is first created, after which session-specific vector databases are dynamically generated on demand. This design reduces computational overhead while improving transparency, controllability, and relevance in retrieval. We provide a theoretical complexity analysis comparing SPAR with standard LLM-based RAG pipelines, demonstrating its computational advantages. To validate the framework, we apply SPAR to a synthesized enterprise-scale file system containing a large corpus of biomedical literature, showing improvements in both retrieval effectiveness and downstream model accuracy. Finally, we discuss design trade-offs and outline open challenges for deploying SPAR across diverse enterprise settings.
๐
2025-12-15
๐ AAAI
๐ฅ Toan Le Ngo Thanh, Phat Ha Huu, Tan Nguyen Dang Duy, Thong Nguyen Le Minh, Anh Nguyen Nhu Tinh
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality specific sub-queries (visual/OCR/ASR), and performs adaptive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent sequences, and dynamically adapts fusion strategies, advancing interactive moment search capabilities.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Licheng Luo, Yu Xia, Kaier Liang, Mingyu Cai
Robotics
ๆฅ็ๆ่ฆ
Spatio-Temporal Logic (SpaTiaL) offers a principled formalism for expressing geometric spatial requirements-an essential component of robotic manipulation, where object locations, neighborhood relations, pose constraints, and interactions directly determine task success. Yet prior works have largely relied on standard temporal logic (TL), which models only robot trajectories and overlooks object-level interactions. Existing datasets built from randomly generated TL formulas paired with natural-language descriptions therefore cover temporal operators but fail to represent the layered spatial relations that manipulation tasks depend on. To address this gap, we introduce a dataset generation framework that synthesizes SpaTiaL specifications and converts them into natural-language descriptions through a deterministic, semantics-preserving back-translation procedure. This pipeline produces the NL2SpaTiaL dataset, aligning natural language with multi-level spatial relations and temporal objectives to reflect the compositional structure of manipulation tasks. Building on this foundation, we propose a translation-verification framework equipped with a language-based semantic checker that ensures the generated SpaTiaL formulas faithfully encode the semantics specified by the input description. Experiments across a suite of manipulation tasks show that SpaTiaL-based representations yield more interpretable, verifiable, and compositional grounding for instruction following. Project website: https://sites.google.com/view/nl2spatial
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou et al.
Robotics
ๆฅ็ๆ่ฆ
Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie et al.
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Li-Wei Shih, Ruo-Syuan Mei, Jesse Heidrich, Hui-Ping Wang, Joel Hooton et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Near-field perception is essential for the safe operation of autonomous mobile robots (AMRs) in manufacturing environments. Conventional ranging sensors such as light detection and ranging (LiDAR) and ultrasonic devices provide broad situational awareness but often fail to detect small objects near the robot base. To address this limitation, this paper presents a three-tier near-field perception framework. The first approach employs light-discontinuity detection, which projects a laser stripe across the near-field zone and identifies interruptions in the stripe to perform fast, binary cutoff sensing for obstacle presence. The second approach utilizes light-displacement measurement to estimate object height by analyzing the geometric displacement of a projected stripe in the camera image, which provides quantitative obstacle height information with minimal computational overhead. The third approach employs a computer vision-based object detection model on embedded AI hardware to classify objects, enabling semantic perception and context-aware safety decisions. All methods are implemented on a Raspberry Pi 5 system, achieving real-time performance at 25 or 50 frames per second. Experimental evaluation and comparative analysis demonstrate that the proposed hierarchy balances precision, computation, and cost, thereby providing a scalable perception solution for enabling safe operations of AMRs in manufacturing environments.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Aman Arora, Matteo El-Hariry, Miguel Olivares-Mendez
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Autonomous free-flyers play a critical role in intravehicular tasks aboard the International Space Station (ISS), where their precise docking under sensing noise, small actuation mismatches, and environmental variability remains a nontrivial challenge. This work presents a reinforcement learning (RL) framework for six-degree-of-freedom (6-DoF) docking of JAXA's Int-Ball2 robot inside a high-fidelity Isaac Sim model of the Japanese Experiment Module (JEM). Using Proximal Policy Optimization (PPO), we train and evaluate controllers under domain-randomized dynamics and bounded observation noise, while explicitly modeling propeller drag-torque effects and polarity structure. This enables a controlled study of how Int-Ball2's propulsion physics influence RL-based docking performance in constrained microgravity interiors. The learned policy achieves stable and reliable docking across varied conditions and lays the groundwork for future extensions pertaining to Int-Ball2 in collision-aware navigation, safe RL, propulsion-accurate sim-to-real transfer, and vision-based end-to-end docking.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Timothy A. Brumfiel, Revanth Konda, Drew Elliott, Jaydev P. Desai
Robotics
ๆฅ็ๆ่ฆ
To address the issues that arise due to the manual navigation of guidewires in endovascular interventions, research in medical robotics has taken a strong interest in developing robotically steerable guidewires, which offer the possibility of enhanced maneuverability and navigation, as the tip of the guidewire can be actively steered. The COaxially Aligned STeerable (COAST) guidewire robot has the ability to generate a wide variety of motions including bending motion with different bending lengths, follow-the-leader motion, and feedforward motion. In our past studies, we have explored different designs of the COAST guidewire robot and developed modeling, control, and sensing strategies for the COAST guidewire robot. In this study, the performance of a modified COAST guidewire robot is evaluated by conducting navigation experiments in an anatomical phantom model with pulsatile flow. The modified COAST guidewire robot is a simplified version of the COAST guidewire robot and consists of two tubes as opposed to three tubes. Through this study, we demonstrate the effectiveness of the modified COAST guidewire robot in navigating the tortuous phantom vasculature.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Chuan Mao, Haoqi Yuan, Ziye Huang, Chaoyi Xu, Kai Ma et al.
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) has achieved great success in dexterous grasping, significantly improving grasp performance and generalization from simulation to the real world. However, fine-grained functional grasping, which is essential for downstream manipulation tasks, remains underexplored and faces several challenges: the complexity of specifying goals and reward functions for functional grasps across diverse objects, the difficulty of multi-task RL exploration, and the challenge of sim-to-real transfer. In this work, we propose DemoFunGrasp for universal dexterous functional grasping. We factorize functional grasping conditions into two complementary components - grasping style and affordance - and integrate them into an RL framework that can learn to grasp any object with any functional grasping condition. To address the multi-task optimization challenge, we leverage a single grasping demonstration and reformulate the RL problem as one-step demonstration editing, substantially enhancing sample efficiency and performance. Experimental results in both simulation and the real world show that DemoFunGrasp generalizes to unseen combinations of objects, affordances, and grasping styles, outperforming baselines in both success rate and functional grasping accuracy. In addition to strong sim-to-real capability, by incorporating a vision-language model (VLM) for planning, our system achieves autonomous instruction-following grasp execution.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Sรผmer Tunรงay, Alain Andres, Ignacio Carlucho
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Autonomous Underwater Vehicles (AUVs) require reliable six-degree-of-freedom (6-DOF) position control to operate effectively in complex and dynamic marine environments. Traditional controllers are effective under nominal conditions but exhibit degraded performance when faced with unmodeled dynamics or environmental disturbances. Reinforcement learning (RL) provides a powerful alternative but training is typically slow and sim-to-real transfer remains challenging. This work introduces a GPU-accelerated RL training pipeline built in JAX and MuJoCo-XLA (MJX). By jointly JIT-compiling large-scale parallel physics simulation and learning updates, we achieve training times of under two minutes.Through systematic evaluation of multiple RL algorithms, we show robust 6-DOF trajectory tracking and effective disturbance rejection in real underwater experiments, with policies transferred zero-shot from simulation. Our results provide the first explicit real-world demonstration of RL-based AUV position control across all six degrees of freedom.
๐
2025-12-15
๐ This is the Author Accepted Manuscript version of
๐ฅ Zeyad Gamal, Youssef Mahran, Ayman El-Badawy
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Sait Sovukluk, Johannes Englsberger, Christian Ott
Robotics
ๆฅ็ๆ่ฆ
This study proposes a step adaptation framework for running through spring-mass trajectories and deadbeat control gain libraries. It includes four main parts: (1) Automatic spring-mass trajectory library generation; (2) Deadbeat control gain library generation through an actively controlled template model that resembles the whole-body dynamics well; (3) Trajectory selection policy development for step adaptation; (4) Mapping spring-mass trajectories to a humanoid model through a whole-body control (WBC) framework also accounting for closed-kinematic chain systems, self collisions, and reactive limb swinging. We show the inclusiveness and the robustness of the proposed framework through various challenging and agile behaviors such as running through randomly generated stepping stones, jumping over random obstacles, performing slalom motions, changing the running direction suddenly with a random leg, and rejecting significant disturbances and uncertainties through the MuJoCo physics simulator. We also perform additional simulations under a comprehensive set of uncertainties and noise to better justify the proposed method's robustness to real-world challenges, including signal noise, imprecision, modeling errors, and delays. All the aforementioned behaviors are performed with a single library and the same set of WBC control parameters without additional tuning. The spring-mass and the deadbeat control gain library are automatically computed in 4.5 seconds in total for 315 different trajectories.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Hao Fua, Wei Liu, Shuai Zhoua
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper investigates the application of reinforcement learning (RL) to multi-robot social formation navigation, a critical capability for enabling seamless human-robot coexistence. While RL offers a promising paradigm, the inherent unpredictability and often uncooperative dynamics of pedestrian behavior pose substantial challenges, particularly concerning the efficiency of coordinated exploration among robots. To address this, we propose a novel coordinated-exploration multi-robot RL algorithm introducing an intrinsic motivation exploration. Its core component is a self-learning intrinsic reward mechanism designed to collectively alleviate policy conservatism. Moreover, this algorithm incorporates a dual-sampling mode within the centralized training and decentralized execution framework to enhance the representation of both the navigation policy and the intrinsic reward, leveraging a two-time-scale update rule to decouple parameter updates. Empirical results on social formation navigation benchmarks demonstrate the proposed algorithm's superior performance over existing state-of-the-art methods across crucial metrics. Our code and video demos are available at: https://github.com/czxhunzi/CEMRRL.
๐
2025-12-15
๐ ArXiv Preprint (cs.SD)
๐ฅ Adheep Arya G R, Vaibhav Pratap Singh, Mayank Kumar, Niyathi Shenoy, Tejas Suryawanshi et al.
Robotics
ๆฅ็ๆ่ฆ
This paper describes an automatic bird call recording system called SAMAY, which is developed to study bird species by creating a database of large amounts of bird acoustic data. By analysing the recorded bird call data, the system can also be used for automatic classification of bird species, monitoring bird populations and analysing the impact of environmental changes. The system is driven through a powerful STM32F407 series microcontroller, supports 4 microphones, is equipped with 128 GB of storage capacity, and is powered by a 10400 mAh battery pack interfaced with a solar charger. In addition, the device is user-configurable over USB and Wi-Fi during runtime, ensuring user-friendly operation during field deployment.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Fangju Yang, Hang Yang, Ibrahim Alsarraj, Yuhao Wang, Ke Wu
Robotics
ๆฅ็ๆ่ฆ
Cable-driven continuum robots (CDCRs) require accurate, real-time dynamic models for high-speed dynamics prediction or model-based control, making such capability an urgent need. In this paper, we propose the Lightweight Actuation-Space Energy Modeling (LASEM) framework for CDCRs, which formulates actuation potential energy directly in actuation space to enable lightweight yet accurate dynamic modeling. Through a unified variational derivation, the governing dynamics reduce to a single partial differential equation (PDE), requiring only the Euler moment balance while implicitly incorporating the Newton force balance. By also avoiding explicit computation of cable-backbone contact forces, the formulation simplifies the model structure and improves computational efficiency while preserving geometric accuracy and physical consistency. Importantly, the proposed framework for dynamic modeling natively supports both force-input and displacement-input actuation modes, a capability seldom achieved in existing dynamic formulations. Leveraging this lightweight structure, a Galerkin space-time modal discretization with analytical time-domain derivatives of the reduced state further enables an average 62.3% computational speedup over state-of-the-art real-time dynamic modeling approaches.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Hyunki Seong, Jeong-Kyun Lee, Heesoo Myeong, Yongho Shin, Hyun-Mook Cho et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Christoph Hartmann, Marios Demetriades, Kevin Prรผfer, Zichen Zhang, Klaus Spindler et al.
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents PyCAALP (Python-based Computer-Aided Assembly Line Planning), a framework for automated Assembly Sequence Planning (ASP) and Production Line Planning (PLP), employing a graph-based approach to model components and joints within production modules. The framework integrates kinematic boundary conditions, such as potential part collisions, to guarantee the feasibility of automated assembly planning. The developed algorithm computes all feasible production sequences, integrating modules for detecting spatial relationships and formulating geometric constraints. The algorithm incorporates additional attributes, including handling feasibility, tolerance matching, and joint compatibility, to manage the high combinatorial complexity inherent in assembly sequence generation. Heuristics, such as Single-Piece Flow assembly and geometrical constraint enforcement, are utilized to further refine the solution space, facilitating more efficient planning for complex assemblies. The PLP stage is formulated as a Mixed-Integer Program (MIP), balancing the total times of a fixed number of manufacturing stations. While some complexity reduction techniques may sacrifice optimality, they significantly reduce the MIPs computational time. Furthermore, the framework enables customization of engineering constraints and supports a flexible trade-off between ASP and PLP. The open-source nature of the framework, available at https://github.com/TUM-utg/PyCAALP, promotes further collaboration and adoption in both industrial and production research applications.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Yinsong Qu, Yunxiang Li, Shanlin Zhong
Robotics
ๆฅ็ๆ่ฆ
Autonomous Mobile Robots (AMRs) have become indispensable in industrial applications due to their operational flexibility and efficiency. Navigation serves as a crucial technical foundation for accomplishing complex tasks. However, navigating AMRs in dense, cluttered, and semi-structured environments remains challenging, primarily due to nonholonomic vehicle dynamics, interactions with mixed static/dynamic obstacles, and the non-convex constrained nature of such operational spaces. To solve these problems, this paper proposes an Improved Sequential Model Predictive Control (ISMPC) navigation framework that systematically reformulates navigation tasks as sequential switched optimal control problems. The framework addresses the aforementioned challenges through two key innovations: 1) Implementation of a Multi-Directional Safety Rectangular Corridor (MDSRC) algorithm, which encodes the free space through rectangular convex regions to avoid collision with static obstacles, eliminating redundant computational burdens and accelerating solver convergence; 2) A sequential MPC navigation framework that integrates corridor constraints with barrier function constraints is proposed to achieve static and dynamic obstacle avoidance. The ISMPC navigation framework enables direct velocity generation for AMRs, simplifying traditional navigation algorithm architectures. Comparative experiments demonstrate the framework's superiority in free-space utilization ( an increase of 41.05$\%$ in the average corridor area) while maintaining real-time computational performance (average corridors generation latency of 3 ms).
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Diego Bolliger, Gabriele Fadini, Markus Bambach, Alisa Rupenyan
Robotics
ๆฅ็ๆ่ฆ
Controlling the deformation of flexible objects is challenging due to their non-linear dynamics and high-dimensional configuration space. This work presents a differentiable Material Point Method (MPM) simulator targeted at control applications. We exploit the differentiability of the simulator to optimize a control trajectory in an active damping problem for a hyperelastic rope. The simulator effectively minimizes the kinetic energy of the rope around 2$\times$ faster than a baseline MPPI method and to a 20% lower energy level, while using about 3% of the computation time.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Hyun-Gi Lee, Jaekyeong Han, Minjun Kwon, Hyeonuk Kwon, Jooha Park et al.
ๆฅ็ๆ่ฆ
As battery technologies advance toward higher stability and energy density, the need for extensive cell-level testing across various component configurations becomes critical. To evaluate performance and understand the operating principles of batteries in laboratory scale, fabrication and evaluation of coin cells are essential processes. However, the conventional coin-cell assembly and testing processes require significant time and labor from researchers, posing challenges to high-throughput screening research. In this study, we introduce an Automated Li-ion BAttery Testing RObot SyStem (ALBATROSS), an automated system capable of electrolyte formulation, coin-cell assembly, and electrochemical evaluation. The system, integrated within a argon-filled glovebox, enables fully automated assembly and testing of up to 48 cells without researcher intervention. By incorporating custom-designed robot gripper and 3D-printed structures optimized for precise cell handling, ALBATROSS achieved high assembly reliability, yielding a relative standard deviation (RSD) of less than 1.2% in discharge capacity and a standard deviation of less than 3 ฮฉ in EIS measurements for NCM811||Li half cells. Owing to its high reliability and automation capability, ALBATROSS allows for the acquisition of high-quality coin-cell datasets, which are expected to accelerate the development of next-generation electrolytes.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Alfredo Gonzรกlez-Calvin, Juan F. Jimรฉnez, Hรฉctor Garcรญa de Marina
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Most path following and trajectory tracking algorithms in mobile robotics require the desired path or trajectory to be defined by at least twice continuously differentiable functions to guarantee key properties such as global convergence, especially for nonholonomic robots like unicycles with speed constraints. Consequently, these algorithms typically exclude continuous but non-differentiable paths, such as piecewise functions. Despite this exclusion, such paths provide convenient high-level inputs for describing robot missions or behavior. While techniques such as spline interpolation or optimization-based methods are commonly used to smooth non-differentiable paths or create feasible ones from sequences of waypoints, they either can produce unnecessarily complex trajectories or are computationally expensive. In this work, we present a method to regularize non-differentiable functions and generate feasible paths through mollification. Specifically, we approximate an arbitrary path with a differentiable function that can converge to it with arbitrary precision. Additionally, we provide a systematic method for bounding the curvature of generated paths, which we demonstrate by applying it to paths resulting from linking a sequence of waypoints with segments. The proposed approach is computationally efficient, enabling real-time implementation on microcontrollers and compatibility with standard trajectory tracking and path following algorithms.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Deepak Ingole, Valentin Bhend, Shiva Ganesh Murali, Oliver Dobrich, Alisa Rupenayan
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Manufacturing processes are often perturbed by drifts in the environment and wear in the system, requiring control re-tuning even in the presence of repetitive operations. This paper presents an iterative learning framework for automatic tuning of Nonlinear Model Predictive Control (NMPC) weighting matrices based on task-level performance feedback. Inspired by norm-optimal Iterative Learning Control (ILC), the proposed method adaptively adjusts NMPC weights Q and R across task repetitions to minimize key performance indicators (KPIs) related to tracking accuracy, control effort, and saturation. Unlike gradient-based approaches that require differentiating through the NMPC solver, we construct an empirical sensitivity matrix, enabling structured weight updates without analytic derivatives. The framework is validated through simulation on a UR10e robot performing carbon fiber winding on a tetrahedral core. Results demonstrate that the proposed approach converges to near-optimal tracking performance (RMSE within 0.3% of offline Bayesian Optimization (BO)) in just 4 online repetitions, compared to 100 offline evaluations required by BO algorithm. The method offers a practical solution for adaptive NMPC tuning in repetitive robotic tasks, combining the precision of carefully optimized controllers with the flexibility of online adaptation.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Ruiqi Yu, Qianshi Wang, Hongyi Li, Zheng Jun, Zhicheng Wang et al.
Robotics
ๆฅ็ๆ่ฆ
Traversing terrains with sparse footholds like legged animals presents a promising yet challenging task for quadruped robots, as it requires precise environmental perception and agile control to secure safe foot placement while maintaining dynamic stability. Model-based hierarchical controllers excel in laboratory settings, but suffer from limited generalization and overly conservative behaviors. End-to-end learning-based approaches unlock greater flexibility and adaptability, but existing state-of-the-art methods either rely on heightmaps that introduce noise and complex, costly pipelines, or implicitly infer terrain features from egocentric depth images, often missing accurate critical geometric cues and leading to inefficient learning and rigid gaits. To overcome these limitations, we propose START, a single-stage learning framework that enables agile, stable locomotion on highly sparse and randomized footholds. START leverages only low-cost onboard vision and proprioception to accurately reconstruct local terrain heightmap, providing an explicit intermediate representation to convey essential features relevant to sparse foothold regions. This supports comprehensive environmental understanding and precise terrain assessment, reducing exploration cost and accelerating skill acquisition. Experimental results demonstrate that START achieves zero-shot transfer across diverse real-world scenarios, showcasing superior adaptability, precise foothold placement, and robust locomotion.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma et al.
Robotics
ๆฅ็ๆ่ฆ
Large and diverse datasets are needed for training generalist robot policies that have potential to control a variety of robot embodiments -- robot arm and gripper combinations -- across diverse tasks and environments. As re-collecting demonstrations and retraining for each new hardware platform are prohibitively costly, we show that existing robot data can be augmented for transfer and generalization. The Open X-Embodiment (OXE) dataset, which aggregates demonstrations from over 60 robot datasets, has been widely used as the foundation for training generalist policies. However, it is highly imbalanced: the top four robot types account for over 85\% of its real data, which risks overfitting to robot-scene combinations. We present AugE-Toolkit, a scalable robot augmentation pipeline, and OXE-AugE, a high-quality open-source dataset that augments OXE with 9 different robot embodiments. OXE-AugE provides over 4.4 million trajectories, more than triple the size of the original OXE. We conduct a systematic study of how scaling robot augmentation impacts cross-embodiment learning. Results suggest that augmenting datasets with diverse arms and grippers improves policy performance not only on the augmented robots, but also on unseen robots and even the original robots under distribution shifts. In physical experiments, we demonstrate that state-of-the-art generalist policies such as OpenVLA and $ฯ_0$ benefit from fine-tuning on OXE-AugE, improving success rates by 24-45% on previously unseen robot-gripper combinations across four real-world manipulation tasks. Project website: https://OXE-AugE.github.io/.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Xiang Li, Gang Liu, Weitao Zhou, Hongyi Zhu, Zhong Cao
Robotics
ๆฅ็ๆ่ฆ
Imitation learning (IL) has emerged as a central paradigm in autonomous driving. While IL excels in matching expert behavior in open-loop settings by minimizing per-step prediction errors, its performance degrades unexpectedly in closed-loop due to the gradual accumulation of small, often imperceptible errors over time.Over successive planning cycles, these errors compound, potentially resulting in severe failures.Current research efforts predominantly rely on increasingly sophisticated network architectures or high-fidelity training datasets to enhance the robustness of IL planners against error accumulation, focusing on the state-level robustness at a single time point. However, autonomous driving is inherently a continuous-time process, and leveraging the temporal scale to enhance robustness may provide a new perspective for addressing this issue.To this end, we propose a method termed Sequence of Experts (SoE), a temporal alternation policy that enhances closed-loop performance without increasing model size or data requirements. Our experiments on large-scale autonomous driving benchmarks nuPlan demonstrate that SoE method consistently and significantly improves the performance of all the evaluated models, and achieves state-of-the-art performance.This module may provide a key and widely applicable support for improving the training efficiency of autonomous driving models.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Mingqi Yuan, Tao Yu, Haolin Song, Bo Li, Xin Jin et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose PvP, a Proprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we develop SRL4Humanoid, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Jebeom Chae, Junwoo Chang, Seungho Yeom, Yujin Kim, Jongeun Choi
Multimodal
ๆฅ็ๆ่ฆ
Diffusion models have recently emerged as powerful tools for robot motion planning by capturing the multi-modal distribution of feasible trajectories. However, their extension to multi-robot settings with flexible, language-conditioned task specifications remains limited. Furthermore, current diffusion-based approaches incur high computational cost during inference and struggle with generalization because they require explicit construction of environment representations and lack mechanisms for reasoning about geometric reachability. To address these limitations, we present Language-Conditioned Heat-Inspired Diffusion (LCHD), an end-to-end vision-based framework that generates language-conditioned, collision-free trajectories. LCHD integrates CLIP-based semantic priors with a collision-avoiding diffusion kernel serving as a physical inductive bias that enables the planner to interpret language commands strictly within the reachable workspace. This naturally handles out-of-distribution scenarios -- in terms of reachability -- by guiding robots toward accessible alternatives that match the semantic intent, while eliminating the need for explicit obstacle information at inference time. Extensive evaluations on diverse real-world-inspired maps, along with real-robot experiments, show that LCHD consistently outperforms prior diffusion-based planners in success rate, while reducing planning latency.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that performs explicit alignment between visual space and physical space during pretraining, enabling models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware features. When adapted to downstream robot tasks, VIPA-VLA achieves significantly improved grounding between 2D vision and 3D action, resulting in more robust and generalizable robotic policies.
๐
2025-12-15
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Oฤuzhan Akbฤฑyฤฑk, Naseem Alhousani, Fares J. Abu-Dakka
Machine Learning
ๆฅ็ๆ่ฆ
Reliable estimation of contact forces is crucial for ensuring safe and precise interaction of robots with unstructured environments. However, accurate sensorless force estimation remains challenging due to inherent modeling errors and complex residual dynamics and friction. To address this challenge, in this paper, we propose K-VARK (Kernelized Variance-Aware Residual Kalman filter), a novel approach that integrates a kernelized, probabilistic model of joint residual torques into an adaptive Kalman filter framework. Through Kernelized Movement Primitives trained on optimized excitation trajectories, K-VARK captures both the predictive mean and input-dependent heteroscedastic variance of residual torques, reflecting data variability and distance-to-training effects. These statistics inform a variance-aware virtual measurement update by augmenting the measurement noise covariance, while the process noise covariance adapts online via variational Bayesian optimization to handle dynamic disturbances. Experimental validation on a 6-DoF collaborative manipulator demonstrates that K-VARK achieves over 20% reduction in RMSE compared to state-of-the-art sensorless force estimation methods, yielding robust and accurate external force/torque estimation suitable for advanced tasks such as polishing and assembly.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Guillermo A. Castillo, Himanshu Lodha, Ayonga Hereid
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This work introduces a hierarchical strategy for terrain-aware bipedal locomotion that integrates reduced-dimensional perceptual representations to enhance reinforcement learning (RL)-based high-level (HL) policies for real-time gait generation. Unlike end-to-end approaches, our framework leverages latent terrain encodings via a Convolutional Variational Autoencoder (CNN-VAE) alongside reduced-order robot dynamics, optimizing the locomotion decision process with a compact state. We systematically analyze the impact of latent space dimensionality on learning efficiency and policy robustness. Additionally, we extend our method to be history-aware, incorporating sequences of recent terrain observations into the latent representation to improve robustness. To address real-world feasibility, we introduce a distillation method to learn the latent representation directly from depth camera images and provide preliminary hardware validation by comparing simulated and real sensor data. We further validate our framework using the high-fidelity Agility Robotics (AR) simulator, incorporating realistic sensor noise, state estimation, and actuator dynamics. The results confirm the robustness and adaptability of our method, underscoring its potential for hardware deployment.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Amin Jalal Aghdasian, Farzaneh Abdollahi, Ali Kamali Iglie
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper proposes two new algorithms for the lane keeping system (LKS) in autonomous vehicles (AVs) operating under snowy road conditions. These algorithms use deep reinforcement learning (DRL) to handle uncertainties and slippage. They include Action-Robust Recurrent Deep Deterministic Policy Gradient (AR-RDPG) and end-to-end Action-Robust convolutional neural network Attention Deterministic Policy Gradient (AR-CADPG), two action-robust approaches for decision-making. In the AR-RDPG method, within the perception layer, camera images are first denoised using multi-scale neural networks. Then, the centerline coefficients are extracted by a pre-trained deep convolutional neural network (DCNN). These coefficients, concatenated with the driving characteristics, are used as input to the control layer. The AR-CADPG method presents an end-to-end approach in which a convolutional neural network (CNN) and an attention mechanism are integrated within a DRL framework. Both methods are first trained in the CARLA simulator and validated under various snowy scenarios. Real-world experiments on a Jetson Nano-based autonomous vehicle confirm the feasibility and stability of the learned policies. Among the two models, the AR-CADPG approach demonstrates superior path-tracking accuracy and robustness, highlighting the effectiveness of combining temporal memory, adversarial resilience, and attention mechanisms in AVs.
๐
2025-12-15
๐ ArXiv Preprint (Robotics)
๐ฅ Anja Sheppard, Parker Ewen, Joey Wilson, Advaith V. Sethuraman, Benard Adewole et al.
Robotics
ๆฅ็ๆ่ฆ
This paper introduces SLIM-VDB, a new lightweight semantic mapping system with probabilistic semantic fusion for closed-set or open-set dictionaries. Advances in data structures from the computer graphics community, such as OpenVDB, have demonstrated significantly improved computational and memory efficiency in volumetric scene representation. Although OpenVDB has been used for geometric mapping in robotics applications, semantic mapping for scene understanding with OpenVDB remains unexplored. In addition, existing semantic mapping systems lack support for integrating both fixed-category and open-language label predictions within a single framework. In this paper, we propose a novel 3D semantic mapping system that leverages the OpenVDB data structure and integrates a unified Bayesian update framework for both closed- and open-set semantic fusion. Our proposed framework, SLIM-VDB, achieves significant reduction in both memory and integration times compared to current state-of-the-art semantic mapping approaches, while maintaining comparable mapping accuracy. An open-source C++ codebase with a Python interface is available at https://github.com/umfieldrobotics/slim-vdb.
๐
2025-12-14
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minghao Zhu, Zhihao Zhang, Anmol Sidhu, Keith Redmill
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework's effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pouya Ahadi, Blair Winograd, Camille Zaug, Karunesh Arora, Lijun Wang et al.
ๆฅ็ๆ่ฆ
Active Learning (AL) has garnered significant interest across various application domains where labeling training data is costly. AL provides a framework that helps practitioners query informative samples for annotation by oracles (labelers). However, these labels often contain noise due to varying levels of labeler accuracy. Additionally, uncertain samples are more prone to receiving incorrect labels because of their complexity. Learning from imperfectly labeled data leads to an inaccurate classifier. We propose a novel AL framework to construct a robust classification model by minimizing noise levels. Our approach includes an assignment model that optimally assigns query points to labelers, aiming to minimize the maximum possible noise within each cycle. Additionally, we introduce a new sampling method to identify the best query points, reducing the impact of label noise on classifier performance. Our experiments demonstrate that our approach significantly improves classification performance compared to several benchmark methods.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Furong Jia, Yuan Pu, Finn Guo, Monica Agrawal
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel on multiple-choice clinical diagnosis benchmarks, yet it is unclear how much of this performance reflects underlying probabilistic reasoning. We study this through questions from MedQA, where the task is to select the most likely diagnosis. We introduce the Frequency-Based Probabilistic Ranker (FBPR), a lightweight method that scores options with a smoothed Naive Bayes over concept-diagnosis co-occurrence statistics from a large corpus. When co-occurrence statistics were sourced from the pretraining corpora for OLMo and Llama, FBPR achieves comparable performance to the corresponding LLMs pretrained on that same corpus. Direct LLM inference and FBPR largely get different questions correct, with an overlap only slightly above random chance, indicating complementary strengths of each method. These findings highlight the continued value of explicit probabilistic baselines: they provide a meaningful performance reference point and a complementary signal for potential hybridization. While the performance of LLMs seems to be driven by a mechanism other than simple frequency aggregation, we show that an approach similar to the historically grounded, low-complexity expert systems still accounts for a substantial portion of benchmark performance.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent. Such inconsistency undermines trust, complicates compliance, and disrupts user experience. While personalization is desirable in certain contexts, many enterprise scenarios-such as HR onboarding, customer support, or policy disclosure-require invariant information delivery regardless of phrasing or prior conversational history. Existing approaches, including retrieval-augmented generation (RAG) and temperature tuning, improve factuality or reduce stochasticity but cannot guarantee stability across equivalent prompts. In this paper, we propose a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) to directly optimize for consistency. Unlike prior applications of GRPO, which have been limited to reasoning and code generation, we adapt GRPO to enforce stability of information content across groups of semantically equivalent prompts. We introduce entropy-based helpfulness and stability rewards, treating prompt variants as groups and resetting conversational context to isolate phrasing effects. Experiments on investment and job recommendation tasks show that our GRPO-trained model reduces variability more effectively than fine-tuning or decoding-based baselines. To our knowledge, this is a novel application of GRPO for aligning LLMs toward information consistency, reframing variability not as an acceptable feature of generative diversity but as a correctable flaw in enterprise deployments.
๐
2025-12-14
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Saad Alqithami
ๆฅ็ๆ่ฆ
As generative agents become increasingly sophisticated and deployed in long-term interactive scenarios, their memory management capabilities emerge as a critical bottleneck for both performance and privacy. Current approaches either maintain unlimited memory stores, leading to computational intractability and privacy concerns, or employ simplistic forgetting mechanisms that compromise agent coherence and functionality. This paper introduces the Memory-Aware Retention Schema (MaRS), a novel framework for human-centered memory management in generative agents, coupled with six theoretically-grounded forgetting policies that balance performance, privacy, and computational efficiency. We present the Forgetful but Faithful Agent (FiFA) benchmark, a comprehensive evaluation framework that assesses agent performance across narrative coherence, goal completion, social recall accuracy, privacy preservation, and cost efficiency. Through extensive experimentation involving 300 evaluation runs across multiple memory budgets and agent configurations, we demonstrate that our hybrid forgetting policy achieves superior performance (composite score: 0.911) while maintaining computational tractability and privacy guarantees. Our work establishes new benchmarks for memory-budgeted agent evaluation and provides practical guidelines for deploying generative agents in resource-constrained, privacy-sensitive environments. The theoretical foundations, implementation framework, and empirical results contribute to the emerging field of human-centered AI by addressing fundamental challenges in agent memory management that directly impact user trust, system scalability, and regulatory compliance.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunpeng Xu, Wenge Guo, Zhi Wei
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Reliable uncertainty quantification is essential for deploying machine learning systems in high-stakes domains. Conformal prediction provides distribution-free coverage guarantees but often produces overly large prediction sets, limiting its practical utility. To address this issue, we propose \textit{Selective Conformal Risk Control} (SCRC), a unified framework that integrates conformal prediction with selective classification. The framework formulates uncertainty control as a two-stage problem: the first stage selects confident samples for prediction, and the second stage applies conformal risk control on the selected subset to construct calibrated prediction sets. We develop two algorithms under this framework. The first, SCRC-T, preserves exchangeability by computing thresholds jointly over calibration and test samples, offering exact finite-sample guarantees. The second, SCRC-I, is a calibration-only variant that provides PAC-style probabilistic guarantees while being more computational efficient. Experiments on two public datasets show that both methods achieve the target coverage and risk levels, with nearly identical performance, while SCRC-I exhibits slightly more conservative risk control but superior computational practicality. Our results demonstrate that selective conformal risk control offers an effective and efficient path toward compact, reliable uncertainty quantification.
๐
2025-12-14
๐ ArXiv Preprint (Robotics)
๐ฅ Kuan Fang, Yuxin Chen, Xinghao Zhu, Farzad Niroui, Lingfeng Sun et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
We present SAGA, a versatile and adaptive framework for visuomotor control that can generalize across various environments, task objectives, and user specifications. To efficiently learn such capability, our key idea is to disentangle high-level semantic intent from low-level visuomotor control by explicitly grounding task objectives in the observed environment. Using an affordance-based task representation, we express diverse and complex behaviors in a unified, structured form. By leveraging multimodal foundation models, SAGA grounds the proposed task representation to the robot's visual observation as 3D affordance heatmaps, highlighting task-relevant entities while abstracting away spurious appearance variations that would hinder generalization. These grounded affordances enable us to effectively train a conditional policy on multi-task demonstration data for whole-body control. In a unified framework, SAGA can solve tasks specified in different forms, including language instructions, selected points, and example demonstrations, enabling both zero-shot execution and few-shot adaptation. We instantiate SAGA on a quadrupedal manipulator and conduct extensive experiments across eleven real-world tasks. SAGA consistently outperforms end-to-end and modular baselines by substantial margins. Together, these results demonstrate that structured affordance grounding offers a scalable and effective pathway toward generalist mobile manipulation.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sindhuja Madabushi, Ahmad Faraz Khan, Haider Ali, Ananthram Swami, Rui Ning et al.
ๆฅ็ๆ่ฆ
Vertical Federated Learning (VFL) enables collaborative model training across organizations that share common user samples but hold disjoint feature spaces. Despite its potential, VFL is susceptible to feature inference attacks, in which adversarial parties exploit shared confidence scores (i.e., prediction probabilities) during inference to reconstruct private input features of other participants. To counter this threat, we propose PRIVEE (PRIvacy-preserving Vertical fEderated lEarning), a novel defense mechanism named after the French word privรฉe, meaning "private." PRIVEE obfuscates confidence scores while preserving critical properties such as relative ranking and inter-score distances. Rather than exposing raw scores, PRIVEE shares only the transformed representations, mitigating the risk of reconstruction attacks without degrading model prediction accuracy. Extensive experiments show that PRIVEE achieves a threefold improvement in privacy protection compared to state-of-the-art defenses, while preserving full predictive performance against advanced feature inference attacks.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kaustav Chatterjee, Joshua Li, Kundan Parajulee, Jared Schwennesen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Steep profiled Highway Railway Grade Crossings (HRGCs) pose safety hazards to vehicles with low ground clearance, which may become stranded on the tracks, creating risks of train vehicle collisions. This research develops a framework for network level evaluation of hangup susceptibility of HRGCs. Profile data from different crossings in Oklahoma were collected using both a walking profiler and the Pave3D8K Laser Imaging System. A hybrid deep learning model, combining Long Short Term Memory (LSTM) and Transformer architectures, was developed to reconstruct accurate HRGC profiles from Pave3D8K Laser Imaging System data. Vehicle dimension data from around 350 specialty vehicles were collected at various locations across Oklahoma to enable up to date statistical design dimensions. Hangup susceptibility was analyzed using three vehicle dimension scenarios (a) median dimension (median wheelbase and ground clearance), (b) 75 25 percentile dimension (75 percentile wheelbase, 25 percentile ground clearance), and (c) worst case dimension (maximum wheelbase and minimum ground clearance). Results indicate 36, 62, and 67 crossings at the highest hangup risk levels under these scenarios, respectively. An ArcGIS database and a software interface were developed to support transportation agencies in mitigating crossing hazards. This framework advances safety evaluation by integrating next generation sensing, deep learning, and infrastructure datasets into practical decision support tools.
๐
2025-12-14
๐ ArXiv Preprint (Computer Vision)
๐ฅ N. K. B. M. P. K. B. Narasinghe, Uthayasanker Thayasivam
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa's distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an "augmentation divergence": while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning. We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy across varying shot counts. Crucially, we characterize the sensitivity of training configurations to data scarcity, providing empirical reference settings for scaling regularization, rank, and sampling strategies to facilitate the efficient adaptation of generative-contrastive foundation models.
๐
2025-12-14
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yongyuan Liang, Xiyao Wang, Yuanchen Ju, Jianwei Yang, Furong Huang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Congzhou M Sha
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Flow matching has emerged as a powerful framework for generative modeling through continuous normalizing flows. We investigate a potential topological constraint: when the prior distribution and target distribution have mismatched topology (e.g., unimodal to multimodal), the optimal velocity field under standard flow matching objectives may exhibit spatial discontinuities. We suggest that this discontinuity arises from the requirement that continuous flows must bifurcate to map a single mode to multiple modes, forcing particles to make discrete routing decisions at intermediate times. Through theoretical analysis on bimodal Gaussian mixtures, we demonstrate that the optimal velocity field exhibits jump discontinuities along decision boundaries, with magnitude approaching infinity as time approaches the target distribution. Our analysis suggests that this phenomenon is not specific to $L^2$ loss, but rather may be a consequence of topological mismatch between distributions. We validate our theory empirically and discuss potential implications for flow matching on manifolds, connecting our findings to recent work on Riemannian flow matching and the challenge of learning discontinuous representations in neural networks.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chris Latimer, Nicolรณ Boschi, Andrew Neeser, Chris Bartholomew, Gaurav Srivastava et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Agent memory has been touted as a dimension of growth for LLM-based applications, enabling agents that can accumulate experience, adapt across sessions, and move beyond single-shot question answering. The current generation of agent memory systems treats memory as an external layer that extracts salient snippets from conversations, stores them in vector or graph-based stores, and retrieves top-k items into the prompt of an otherwise stateless model. While these systems improve personalization and context carry-over, they still blur the line between evidence and inference, struggle to organize information over long horizons, and offer limited support for agents that must explain their reasoning. We present Hindsight, a memory architecture that treats agent memory as a structured, first-class substrate for reasoning by organizing it into four logical networks that distinguish world facts, agent experiences, synthesized entity summaries, and evolving beliefs. This framework supports three core operations -- retain, recall, and reflect -- that govern how information is added, accessed, and updated. Under this abstraction, a temporal, entity aware memory layer incrementally turns conversational streams into a structured, queryable memory bank, while a reflection layer reasons over this bank to produce answers and to update information in a traceable way. On key long-horizon conversational memory benchmarks like LongMemEval and LoCoMo, Hindsight with an open-source 20B model lifts overall accuracy from 39% to 83.6% over a full-context baseline with the same backbone and outperforms full context GPT-4o. Scaling the backbone further pushes Hindsight to 91.4% on LongMemEval and up to 89.61% on LoCoMo (vs. 75.78% for the strongest prior open system), consistently outperforming existing memory architectures on multi-session and open-domain questions.
๐
2025-12-14
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Mengqian Wu, Jiayi Zhang, Raymond Z. Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Debate has been widely adopted as a strategy to enhance critical thinking skills in English Language Arts (ELA). One important skill in debate is forming effective argumentation, which requires debaters to select supportive evidence from literature and construct compelling claims. However, the training of this skill largely depends on human coaching, which is labor-intensive and difficult to scale. To better support students in preparing for debates, this study explores the potential of leveraging artificial intelligence to generate effective arguments. Specifically, we prompted GPT-4 to create an evidence card and compared it to those produced by human debaters. The evidence cards outline the arguments students will present and how those arguments will be delivered, including components such as literature-based evidence quotations, summaries of core ideas, verbatim reading scripts, and tags (i.e., titles of the arguments). We compared the quality of the arguments in the evidence cards created by GPT and student debaters using Aristotle's rhetorical principles: ethos (credibility), pathos (emotional appeal), and logos (logical reasoning). Through a systematic qualitative and quantitative analysis, grounded in the rhetorical principles, we identify the strengths and limitations of human and GPT in debate reasoning, outlining areas where AI's focus and justifications align with or diverge from human reasoning. Our findings contribute to the evolving role of AI-assisted learning interventions, offering insights into how student debaters can develop strategies that enhance their argumentation and reasoning skills.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, Xiaojing Fan
Natural Language Processing
ๆฅ็ๆ่ฆ
Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Friendly, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing.
Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier researches, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.
๐
2025-12-14
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Junbo Jacob Lian, Mingyang Yu, Kaichen Ouyang, Shengwei Fu, Rui Zhong et al.
Machine Learning
ๆฅ็ๆ่ฆ
Black-box optimization often relies on evolutionary and swarm algorithms whose performance is highly problem dependent. We view an optimizer as a short program over a small vocabulary of search operators and learn this operator program separately for each problem instance. We instantiate this idea in Operator-Programmed Algorithms (OPAL), a landscape-aware framework for continuous black-box optimization that uses a small design budget with a standard differential evolution baseline to probe the landscape, builds a $k$-nearest neighbor graph over sampled points, and encodes this trajectory with a graph neural network. A meta-learner then maps the resulting representation to a phase-wise schedule of exploration, restart, and local search operators. On the CEC~2017 test suite, a single meta-trained OPAL policy is statistically competitive with state-of-the-art adaptive differential evolution variants and achieves significant improvements over simpler baselines under nonparametric tests. Ablation studies on CEC~2017 justify the choices for the design phase, the trajectory graph, and the operator-program representation, while the meta-components add only modest wall-clock overhead. Overall, the results indicate that operator-programmed, landscape-aware per-instance design is a practical way forward beyond ad hoc metaphor-based algorithms in black-box optimization.
๐
2025-12-14
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Boyang Yan
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The transition of Large Language Models (LLMs) from passive code generators to autonomous agents introduces significant safety risks, specifically regarding destructive commands and inconsistent system states. Existing commercial solutions often prioritize interactive user safety, enforcing authentication barriers that break the headless loops required for true autonomy. This paper presents a Fault-Tolerant Sandboxing framework designed to mitigate these risks through a policy-based interception layer and a transactional filesystem snapshot mechanism. We hypothesize that wrapping agent actions in atomic transactions can guarantee safety with acceptable latency, outperforming the heavy initialization overhead of containers or the interactive friction of commercial CLIs. We validated this approach by deploying the Minimind-MoE LLM served via nano-vllm on a custom Proxmox-based testbed utilizing EVPN/VXLAN isolation. Experimental results demonstrate a 100\% interception rate for high-risk commands and a 100\% success rate in rolling back failed states. Crucially, our prototype incurs only a 14.5\% performance overhead (approx. 1.8s) per transaction. In contrast, benchmarking against the Gemini CLI sandbox revealed that it requires interactive authentication ("Sign in"), rendering it unusable for headless, autonomous agent workflows.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anastasiia Alokhina, Pan Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformers exhibit a notable property of \emph{size generalization}, demonstrating an ability to extrapolate from smaller token sets to significantly longer ones. This behavior has been documented across diverse applications, including point clouds, graphs, and natural language. Despite its empirical success, this capability still lacks some rigorous theoretical characterizations. In this paper, we develop a theoretical framework to analyze this phenomenon for geometric data, which we represent as discrete samples from a continuous source (e.g., point clouds from manifolds, graphs from graphons). Our core contribution is a bound on the error between the Transformer's output for a discrete sample and its continuous-domain equivalent. We prove that for Transformers with stable positional encodings, this bound is determined by the sampling density and the intrinsic dimensionality of the data manifold. Experiments on graphs and point clouds of various sizes confirm the tightness of our theoretical bound.
๐
2025-12-14
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sander Beckers
ๆฅ็ๆ่ฆ
I develop a novel semantics for probabilities of counterfactuals that generalizes the standard Pearlian semantics: it applies to probabilistic causal models that cannot be extended into realistic structural causal models and are therefore beyond the scope of Pearl's semantics. This generalization is needed because, as I show, such probabilistic causal models arise even in simple settings. My semantics offer a natural compromize in the long-standing debate between Pearl and Dawid over counterfactuals: I agree with Dawid that universal causal determinism and unrealistic variables should be rejected, but I agree with Pearl that a general semantics of counterfactuals is nonetheless possible. I restrict attention to causal models that satisfy the Markov condition, only contain realistic variables, and are causally complete. Although I formulate my proposal using structural causal models, as does Pearl, I refrain from using so-called response variables. Moreover, I prove that my semantics is equivalent to two other recent proposals that do not involve structural causal models, and that it is in line with various comments on stochastic counterfactuals that have appeared in the literature more broadly. Throughout I also reflect on the universality of the Markov condition and explore a novel generalization of causal abstractions
๐
2025-12-14
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Erik Hoel
Natural Language Processing
ๆฅ็ๆ่ฆ
The requirements for a falsifiable and non-trivial theory of consciousness significantly constrain such theories. Specifically, recent research on the Unfolding Argument and the Substitution Argument has given us formal tools to analyze requirements for a theory of consciousness. I show via a new Proximity Argument that these requirements especially constrain the potential consciousness of contemporary Large Language Models (LLMs) because of their proximity to systems that are equivalent to LLMs in terms of input/output function; yet, for these functionally equivalent systems, there cannot be any non-trivial theory of consciousness that judges them conscious. This forms the basis of a disproof of contemporary LLM consciousness. I then show a positive result, which is that theories of consciousness based on (or requiring) continual learning do satisfy the stringent formal constraints for a theory of consciousness in humans. Intriguingly, this work supports a hypothesis: If continual learning is linked to consciousness in humans, the current limitations of LLMs (which do not continually learn) are intimately tied to their lack of consciousness.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shivansh Sahni, Wenzhi Zhang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The Liquid Reasoning Transformer (LRT) is a transformer architecture designed for inference with adaptive depths using iterative changes, discard-based correction, and a learned stopping mechanism. Instead of relying on a single feedforward pass, the model updates a recurrent reasoning token across multiple internal steps, allowing it to correct early errors and allocate computation based on input difficulty. We evaluate the LRT on Sudoku as a controlled testbed for structured reasoning and show that it achieves strong performance, reaching 98.68% digit accuracy and 36.30% full-puzzle accuracy without using symbolic rules or search. Analyzing internal patterns shows that the discard and stop gates play different, important roles in stabilizing inferences and adjusting computational depth. We discuss how these mechanisms extend naturally to chess-scale reasoning tasks and outline extensions for multi-token reasoning and larger domains.
๐
2025-12-14
๐ ArXiv Preprint (cs.MA)
๐ฅ Sreemaee Akshathala, Bassam Adnan, Mahisha Ramesh, Karthik Vaidhyanathan, Basil Muhammed et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Recent advances in agentic AI have shifted the focus from standalone Large Language Models (LLMs) to integrated systems that combine LLMs with tools, memory, and other agents to perform complex tasks. These multi-agent architectures enable coordinated reasoning, planning, and execution across diverse domains, allowing agents to collaboratively automate complex workflows. Despite these advances, evaluation and assessment of LLM agents and the multi-agent systems they constitute remain a fundamental challenge. Although various approaches have been proposed in the software engineering literature for evaluating conventional software components, existing methods for AI-based systems often overlook the non-deterministic nature of models. This non-determinism introduces behavioral uncertainty during execution, yet existing evaluations rely on binary task completion metrics that fail to capture it. Evaluating agentic systems therefore requires examining additional dimensions, including the agent ability to invoke tools, ingest and retrieve memory, collaborate with other agents, and interact effectively with its environment. We propose an end-to-end Agent Assessment Framework with four evaluation pillars encompassing LLMs, Memory, Tools, and Environment. We validate the framework on a representative Autonomous CloudOps use case, where experiments reveal behavioral deviations overlooked by conventional metrics, demonstrating its effectiveness in capturing runtime uncertainties.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammad Abu-Shaira, Weishi Shi
Machine Learning
ๆฅ็ๆ่ฆ
Despite extensive focus on techniques for evaluating the performance of two learning algorithms on a single dataset, the critical challenge of developing statistical tests to compare multiple algorithms across various datasets has been largely overlooked in most machine learning research. Additionally, in the realm of Online Learning, ensuring statistical significance is essential to validate continuous learning processes, particularly for achieving rapid convergence and effectively managing concept drifts in a timely manner. Robust statistical methods are needed to assess the significance of performance differences as data evolves over time. This article examines the state-of-the-art online regression models and empirically evaluates several suitable tests. To compare multiple online regression models across various datasets, we employed the Friedman test along with corresponding post-hoc tests. For thorough evaluations, utilizing both real and synthetic datasets with 5-fold cross-validation and seed averaging ensures comprehensive assessment across various data subsets. Our tests generally confirmed the performance of competitive baselines as consistent with their individual reports. However, some statistical test results also indicate that there is still room for improvement in certain aspects of state-of-the-art methods.
๐
2025-12-14
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minghao Zhu, Zhihao Zhang, Anmol Sidhu, Keith Redmill
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework's effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.
๐
2025-12-14
๐ ArXiv Preprint (cs.CE)
๐ฅ Yoo Yongmin, Kim Seungwoo, Liu Jingjiang
Natural Language Processing
ๆฅ็ๆ่ฆ
Valuing intangible assets under uncertainty remains a critical challenge in the strategic management of technological innovation due to the information asymmetry inherent in high-dimensional technical specifications. Traditional bibliometric indicators, such as citation counts, fail to address this friction in a timely manner due to the systemic latency inherent in data accumulation. To bridge this gap, this study proposes the Economic Reasoning Alignment via Instruction Tuning (ERA-IT) framework. We theoretically conceptualize patent renewal history as a revealed economic preference and leverage it as an objective supervisory signal to align the generative reasoning of Large Language Models (LLMs) with market realities, a process we term Eco-Semantic Alignment. Using a randomly sampled dataset of 10,000 European Patent Office patents across diverse technological domains, we trained the model not only to predict value tiers but also to reverse-engineer the Economic Chain-of-Thought from unstructured text. Empirical results demonstrate that ERA-IT significantly outperforms both conventional econometric models and zero-shot LLMs in predictive accuracy. More importantly, by generating explicit, logically grounded rationales for valuation, the framework serves as a transparent cognitive scaffold for decision-makers, reducing the opacity of black-box AI in high-stakes intellectual property management.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Furong Jia, Yuan Pu, Finn Guo, Monica Agrawal
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel on multiple-choice clinical diagnosis benchmarks, yet it is unclear how much of this performance reflects underlying probabilistic reasoning. We study this through questions from MedQA, where the task is to select the most likely diagnosis. We introduce the Frequency-Based Probabilistic Ranker (FBPR), a lightweight method that scores options with a smoothed Naive Bayes over concept-diagnosis co-occurrence statistics from a large corpus. When co-occurrence statistics were sourced from the pretraining corpora for OLMo and Llama, FBPR achieves comparable performance to the corresponding LLMs pretrained on that same corpus. Direct LLM inference and FBPR largely get different questions correct, with an overlap only slightly above random chance, indicating complementary strengths of each method. These findings highlight the continued value of explicit probabilistic baselines: they provide a meaningful performance reference point and a complementary signal for potential hybridization. While the performance of LLMs seems to be driven by a mechanism other than simple frequency aggregation, we show that an approach similar to the historically grounded, low-complexity expert systems still accounts for a substantial portion of benchmark performance.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dingyi Yang, Qin Jin
Natural Language Processing
ๆฅ็ๆ่ฆ
In this work, we conduct systematic research in a challenging area: the automatic evaluation of book-length stories (>100K tokens). Our study focuses on two key questions: (1) understanding which evaluation aspects matter most to readers, and (2) exploring effective methods for evaluating lengthy stories. We introduce the first large-scale benchmark, LongStoryEval, comprising 600 newly published books with an average length of 121K tokens (maximum 397K). Each book includes its average rating and multiple reader reviews, presented as critiques organized by evaluation aspects. By analyzing all user-mentioned aspects, we propose an evaluation criteria structure and conduct experiments to identify the most significant aspects among the 8 top-level criteria. For evaluation methods, we compare the effectiveness of three types: aggregation-based, incremental-updated, and summary-based evaluations. Our findings reveal that aggregation- and summary-based evaluations perform better, with the former excelling in detail assessment and the latter offering greater efficiency. Building on these insights, we further propose NovelCritique, an 8B model that leverages the efficient summary-based framework to review and score stories across specified aspects. NovelCritique outperforms commercial models like GPT-4o in aligning with human evaluations. Our datasets and codes are available at https://github.com/DingyiYang/LongStoryEval.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chris Latimer, Nicolรณ Boschi, Andrew Neeser, Chris Bartholomew, Gaurav Srivastava et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Agent memory has been touted as a dimension of growth for LLM-based applications, enabling agents that can accumulate experience, adapt across sessions, and move beyond single-shot question answering. The current generation of agent memory systems treats memory as an external layer that extracts salient snippets from conversations, stores them in vector or graph-based stores, and retrieves top-k items into the prompt of an otherwise stateless model. While these systems improve personalization and context carry-over, they still blur the line between evidence and inference, struggle to organize information over long horizons, and offer limited support for agents that must explain their reasoning. We present Hindsight, a memory architecture that treats agent memory as a structured, first-class substrate for reasoning by organizing it into four logical networks that distinguish world facts, agent experiences, synthesized entity summaries, and evolving beliefs. This framework supports three core operations -- retain, recall, and reflect -- that govern how information is added, accessed, and updated. Under this abstraction, a temporal, entity aware memory layer incrementally turns conversational streams into a structured, queryable memory bank, while a reflection layer reasons over this bank to produce answers and to update information in a traceable way. On key long-horizon conversational memory benchmarks like LongMemEval and LoCoMo, Hindsight with an open-source 20B model lifts overall accuracy from 39% to 83.6% over a full-context baseline with the same backbone and outperforms full context GPT-4o. Scaling the backbone further pushes Hindsight to 91.4% on LongMemEval and up to 89.61% on LoCoMo (vs. 75.78% for the strongest prior open system), consistently outperforming existing memory architectures on multi-session and open-domain questions.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu, Xiaojing Fan
Natural Language Processing
ๆฅ็ๆ่ฆ
Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Friendly, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing.
Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier researches, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mosh Levy, Zohar Elyoseph, Shauli Ravfogel, Yoav Goldberg
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) can generate reasoning tokens before their final answer to boost performance on complex tasks. While these sequences seem like human thought processes, empirical evidence reveals that they are not a faithful explanation of the model's actual reasoning process. To address this gap between appearance and function, we introduce the State over Tokens (SoT) conceptual framework. SoT reframes reasoning tokens not as a linguistic narrative, but as an externalized computational state -- the sole persistent information carrier across the model's stateless generation cycles. This explains how the tokens can drive correct reasoning without being a faithful explanation when read as text and surfaces previously overlooked research questions on these tokens. We argue that to truly understand the process that LLMs do, research must move beyond reading the reasoning tokens as text and focus on decoding them as state.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pedro Henrique Luz de Araujo, Michael A. Hedderich, Ali Modarressi, Hinrich Schuetze, Benjamin Roth
Natural Language Processing
ๆฅ็ๆ่ฆ
Persona-assigned large language models (LLMs) are used in domains such as education, healthcare, and sociodemographic simulation. Yet, they are typically evaluated only in short, single-round settings that do not reflect real-world usage. We introduce an evaluation protocol that combines long persona dialogues (over 100 rounds) and evaluation datasets to create dialogue-conditioned benchmarks that can robustly measure long-context effects. We then investigate the effects of dialogue length on persona fidelity, instruction-following, and safety of seven state-of-the-art open- and closed-weight LLMs. We find that persona fidelity degrades over the course of dialogues, especially in goal-oriented conversations, where models must sustain both persona fidelity and instruction following. We identify a trade-off between fidelity and instruction following, with non-persona baselines initially outperforming persona-assigned models; as dialogues progress and fidelity fades, persona responses become increasingly similar to baseline responses. Our findings highlight the fragility of persona applications in extended interactions and our work provides a protocol to systematically measure such failures.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Thales Sales Almeida, Rodrigo Nogueira, Hรฉlio Pedrini
Natural Language Processing
ๆฅ็ๆ่ฆ
Continued pretraining extends a language model's capabilities by further exposing it to additional data, often tailored to a specific linguistic or domain context. This strategy has emerged as an efficient alternative to full retraining when adapting general-purpose models to new settings. In this work, we investigate this paradigm through Curiรณ 7B, a 7-billion-parameter model derived from LLaMA-2 and trained on 100 billion Portuguese tokens from the ClassiCC-PT corpus - the most extensive Portuguese-specific continued-pretraining effort above the three-billion-parameter scale to date. Beyond scale, we investigate whether quantity alone suffices or whether data quality plays a decisive role in linguistic adaptation. To this end, we introduce Curiรณ-Edu 7B, a variant trained exclusively on the educational and STEM-filtered subset of the same corpus, totaling just 10 billion tokens. Despite using only 10% of the data and 20% of the computation, Curiรณ-Edu 7B surpasses the full-corpus model in our evaluations, demonstrating that data selection can be fundamental even when adapting models with limited prior exposure to the target language. The developed models are available at https://huggingface.co/collections/ClassiCC-Corpus/curio-edu
๐
2025-12-14
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Sina Jani, Arman Heidari, Amirmohammad Anvari, Zahra Rahimi
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The rapid acceleration of scientific publishing has created substantial challenges for researchers attempting to discover, contextualize, and interpret relevant literature. Traditional keyword-based search systems provide limited semantic understanding, while existing AI-driven tools typically focus on isolated tasks such as retrieval, clustering, or bibliometric visualization. This paper presents an integrated system for scientific literature exploration that combines large-scale data acquisition, hybrid retrieval, semantic topic modeling, and heterogeneous knowledge graph construction. The system builds a comprehensive corpus by merging full-text data from arXiv with structured metadata from OpenAlex. A hybrid retrieval architecture fuses BM25 lexical search with embedding-based semantic search using Reciprocal Rank Fusion. Topic modeling is performed on retrieved results using BERTopic or non-negative matrix factorization depending on computational resources. A knowledge graph unifies papers, authors, institutions, countries, and extracted topics into an interpretable structure. The system provides a multi-layered exploration environment that reveals not only relevant publications but also the conceptual and relational landscape surrounding a query. Evaluation across multiple queries demonstrates improvements in retrieval relevance, topic coherence, and interpretability. The proposed framework contributes an extensible foundation for AI-assisted scientific discovery.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao et al.
Robotics
ๆฅ็ๆ่ฆ
Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.
๐
2025-12-14
๐ WSDM
๐ฅ Xuanzhang Liu, Jianglun Feng, Zhuoran Zhuang, Junzhe Zhao, Maofei Que et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Model (LLM) agents trained with reinforcement learning (RL) show great promise for solving complex, multi-step tasks. However, their performance is often crippled by "Context Explosion", where the accumulation of long text outputs overwhelms the model's context window and leads to reasoning failures. To address this, we introduce CoDA, a Context-Decoupled hierarchical Agent, a simple but effective reinforcement learning framework that decouples high-level planning from low-level execution. It employs a single, shared LLM backbone that learns to operate in two distinct, contextually isolated roles: a high-level Planner that decomposes tasks within a concise strategic context, and a low-level Executor that handles tool interactions in an ephemeral, isolated workspace. We train this unified agent end-to-end using PECO (Planner-Executor Co-Optimization), a reinforcement learning methodology that applies a trajectory-level reward to jointly optimize both roles, fostering seamless collaboration through context-dependent policy updates. Extensive experiments demonstrate that CoDA achieves significant performance improvements over state-of-the-art baselines on complex multi-hop question-answering benchmarks, and it exhibits strong robustness in long-context scenarios, maintaining stable performance while all other baselines suffer severe degradation, thus further validating the effectiveness of our hierarchical design in mitigating context overload.
๐
2025-12-14
๐ NeurIPS 2025
๐ฅ Xue Li, Xiaonan Song, Henry Hu
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP's role in efficient multimodal edge computing pipelines.
๐
2025-12-14
๐ ICLR 2026
๐ฅ Mahir Labib Dihan, Tanzima Hashem, Mohammed Eunus Ali, Md Rizwan Parvez
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
LLM-based agents often operate in a greedy, step-by-step manner, selecting actions solely based on the current observation without considering long-term consequences or alternative paths. This lack of foresight is particularly problematic in web environments, which are only partially observable-limited to browser-visible content (e.g., DOM and UI elements)-where a single misstep often requires complex and brittle navigation to undo. Without an explicit backtracking mechanism, agents struggle to correct errors or systematically explore alternative paths. Tree-search methods provide a principled framework for such structured exploration, but existing approaches lack mechanisms for safe backtracking, making them prone to unintended side effects. They also assume that all actions are reversible, ignoring the presence of irreversible actions-limitations that reduce their effectiveness in realistic web tasks. To address these challenges, we introduce WebOperator, a tree-search framework that enables reliable backtracking and strategic exploration. Our method incorporates a best-first search strategy that ranks actions by both reward estimates and safety considerations, along with a robust backtracking mechanism that verifies the feasibility of previously visited paths before replaying them, preventing unintended side effects. To further guide exploration, WebOperator generates action candidates from multiple, varied reasoning contexts to ensure diverse and robust exploration, and subsequently curates a high-quality action set by filtering out invalid actions pre-execution and merging semantically equivalent ones. Experimental results on WebArena and WebVoyager demonstrate the effectiveness of WebOperator. On WebArena, WebOperator achieves a state-of-the-art 54.6% success rate with gpt-4o, underscoring the critical advantage of integrating strategic foresight with safe execution.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yongcan Yu, Lingxiao He, Shuo Lu, Lijun Sheng, Yinuo Xu et al.
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in vision-language models (VLMs) reasoning have been largely attributed to the rise of reinforcement Learning (RL), which has shifted the community's focus away from the supervised fine-tuning (SFT) paradigm. Many studies suggest that introducing the SFT stage not only fails to improve reasoning ability but may also negatively impact model training. In this study, we revisit this RL-centric belief through a systematic and controlled comparison of SFT and RL on VLM Reasoning. Using identical data sources, we find that the relative effectiveness of SFT and RL is conditional and strongly influenced by model capacity, data scale, and data distribution. Contrary to common assumptions, our findings show that SFT plays a crucial role across several scenarios: (1) Effectiveness for weaker models. SFT more reliably elicits reasoning capabilities in smaller or weaker VLMs. (2) Data efficiency. SFT with only 2K achieves comparable or better reasoning performance to RL with 20K. (3) Cross-modal transferability. SFT demonstrates stronger generalization across modalities. Moreover, we identify a pervasive issue of deceptive rewards, where higher rewards fail to correlate with better reasoning accuracy in RL. These results challenge the prevailing "RL over SFT" narrative. They highlight that the role of SFT may have been underestimated and support a more balanced post-training pipeline in which SFT and RL function as complementary components.
๐
2025-12-14
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
Natural Language Processing
ๆฅ็ๆ่ฆ
Prompts can switch a model's behavior even when the weights are fixed, yet this phenomenon is rarely treated as a clean theoretical object rather than a heuristic. We study the family of functions obtainable by holding a Transformer backbone fixed as an executor and varying only the prompt. Our core idea is to view the prompt as an externally injected program and to construct a simplified Transformer that interprets it to implement different computations. The construction exposes a mechanism-level decomposition: attention performs selective routing from prompt memory, the FFN performs local arithmetic conditioned on retrieved fragments, and depth-wise stacking composes these local updates into a multi-step computation. Under this viewpoint, we prove a constructive existential result showing that a single fixed backbone can approximate a broad class of target behaviors via prompts alone. The framework provides a unified starting point for formalizing trade-offs under prompt length/precision constraints and for studying structural limits of prompt-based switching, while remaining distinct from empirical claims about pretrained LLMs.
๐
2025-12-14
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Samarth Sarin, Lovepreet Singh, Bhaskarjit Sarmah, Dhagash Mehta
Natural Language Processing
ๆฅ็ๆ่ฆ
Agentic memory is emerging as a key enabler for large language models (LLM) to maintain continuity, personalization, and long-term context in extended user interactions, critical capabilities for deploying LLMs as truly interactive and adaptive agents. Agentic memory refers to the memory that provides an LLM with agent-like persistence: the ability to retain and act upon information across conversations, similar to how a human would. We present Memoria, a modular memory framework that augments LLM-based conversational systems with persistent, interpretable, and context-rich memory. Memoria integrates two complementary components: dynamic session-level summarization and a weighted knowledge graph (KG)-based user modelling engine that incrementally captures user traits, preferences, and behavioral patterns as structured entities and relationships. This hybrid architecture enables both short-term dialogue coherence and long-term personalization while operating within the token constraints of modern LLMs. We demonstrate how Memoria enables scalable, personalized conversational artificial intelligence (AI) by bridging the gap between stateless LLM interfaces and agentic memory systems, offering a practical solution for industry applications requiring adaptive and evolving user experiences.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amirhossein Yousefiramandi, Ciaran Cooney
Natural Language Processing
ๆฅ็ๆ่ฆ
We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pre-trained causal LLM and fine-tuning on the task (using the LLM's final token embedding as a sequence representation), and (2) instruction-tuning the LLM in a prompt->response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two datasets - a proprietary single-label dataset and the public WIPO-Alpha patent dataset (extreme multi-label classification) - show that the embedding-based method significantly outperforms the instruction-tuned method in F1-score, and is very competitive with - even surpassing - fine-tuned domain-specific models (e.g. BERT) on the same tasks. These results demonstrate that directly leveraging the internal representations of causal LLMs, along with efficient fine-tuning techniques, yields impressive classification performance under limited computational resources. We discuss the advantages of each approach while outlining practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hassan Mujtaba, Hamza Naveed, Hanzlah Munir
Machine Learning
ๆฅ็ๆ่ฆ
Authorship analysis has traditionally focused on lexical and stylistic cues within text, while higher-level narrative structure remains underexplored, particularly for low-resource languages such as Urdu. This work proposes a graph-based framework that models Urdu novels as character interaction networks to examine whether authorial style can be inferred from narrative structure alone. Each novel is represented as a graph where nodes correspond to characters and edges denote their co-occurrence within narrative proximity. We systematically compare multiple graph representations, including global structural features, node-level semantic summaries, unsupervised graph embeddings, and supervised graph neural networks. Experiments on a dataset of 52 Urdu novels written by seven authors show that learned graph representations substantially outperform hand-crafted and unsupervised baselines, achieving up to 0.857 accuracy under a strict author-aware evaluation protocol.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yida Cai, Ranjuexiao Hu, Huiyuan Xie, Chenyang Li, Yun Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Legal relations form a highly consequential analytical framework of civil law system, serving as a crucial foundation for resolving disputes and realizing values of the rule of law in judicial practice. However, legal relations in Chinese civil cases remain underexplored in the field of legal artificial intelligence (legal AI), largely due to the absence of comprehensive schemas. In this work, we firstly introduce a comprehensive schema, which contains a hierarchical taxonomy and definitions of arguments, for AI systems to capture legal relations in Chinese civil cases. Based on this schema, we then formulate legal relation extraction task and present LexRel, an expert-annotated benchmark for legal relation extraction in Chinese civil law. We use LexRel to evaluate state-of-the-art large language models (LLMs) on legal relation extractions, showing that current LLMs exhibit significant limitations in accurately identifying civil legal relations. Furthermore, we demonstrate that incorporating legal relations information leads to consistent performance gains on other downstream legal AI tasks.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sander Land, Yuval Pinter
ๆฅ็ๆ่ฆ
The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece package and adapters thereof. We bridge this gap between theory and practice by providing a clear guide to implementation and parameter choices. We also identify a simpler algorithm that accepts slightly higher training loss in exchange for improved compression.
๐
2025-12-14
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu et al.
MultimodalNatural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.
๐
2025-12-14
๐ AAAI 2026
๐ฅ Aheli Poddar, Saptarshi Sahoo, Sujata Ghosh
Natural Language Processing
ๆฅ็ๆ่ฆ
We study syllogistic reasoning in LLMs from the logical and natural language perspectives. In process, we explore fundamental reasoning capabilities of the LLMs and the direction this research is moving forward. To aid in our studies, we use 14 large language models and investigate their syllogistic reasoning capabilities in terms of symbolic inferences as well as natural language understanding. Even though this reasoning mechanism is not a uniform emergent property across LLMs, the perfect symbolic performances in certain models make us wonder whether LLMs are becoming more and more formal reasoning mechanisms, rather than making explicit the nuances of human reasoning.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yucan Guo, Saiping Guan, Miao Su, Zeya Zhao, Xiaolong Jin et al.
ๆฅ็ๆ่ฆ
Sparse Knowledge Graphs (KGs) are commonly encountered in real-world applications, where knowledge is often incomplete or limited. Sparse KG reasoning, the task of inferring missing knowledge over sparse KGs, is inherently challenging due to the scarcity of knowledge and the difficulty of capturing relational patterns in sparse scenarios. Among all sparse KG reasoning methods, path-based ones have attracted plenty of attention due to their interpretability. Existing path-based methods typically rely on computationally intensive random walks to collect paths, producing paths of variable quality. Additionally, these methods fail to leverage the structured nature of graphs by treating paths independently. To address these shortcomings, we propose a Structural and Probabilistic framework named StruProKGR, tailored for efficient and interpretable reasoning on sparse KGs. StruProKGR utilizes a distance-guided path collection mechanism to significantly reduce computational costs while exploring more relevant paths. It further enhances the reasoning process by incorporating structural information through probabilistic path aggregation, which prioritizes paths that reinforce each other. Extensive experiments on five sparse KG reasoning benchmarks reveal that StruProKGR surpasses existing path-based methods in both effectiveness and efficiency, providing an effective, efficient, and interpretable solution for sparse KG reasoning.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hong Su
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel at extracting common patterns from large-scale corpora, yet they struggle with rare, low-resource, or previously unseen scenarios-such as niche hardware deployment issues or irregular IoT device behaviors-because such cases are sparsely represented in training data. Moreover, LLMs rely primarily on implicit parametric memory, which limits their ability to explicitly acquire, recall, and refine methods, causing them to behave predominantly as intuition-driven predictors rather than deliberate, method-oriented learners. Inspired by how humans learn from rare experiences, this paper proposes a human-inspired learning framework that integrates two complementary mechanisms. The first, Obvious Record, explicitly stores cause--result (or question--solution) relationships as symbolic memory, enabling persistent learning even from single or infrequent encounters. The second, Maximum-Entropy Method Discovery, prioritizes and preserves methods with high semantic dissimilarity, allowing the system to capture diverse and underrepresented strategies that are typically overlooked by next-token prediction. Verification on a benchmark of 60 semantically diverse question--solution pairs demonstrates that the proposed entropy-guided approach achieves stronger coverage of unseen questions and significantly greater internal diversity than a random baseline, confirming its effectiveness in discovering more generalizable and human-inspired methods.
๐
2025-12-14
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Miriam Horovicz
Natural Language Processing
ๆฅ็ๆ่ฆ
LLM agents that use external tools can solve complex tasks, but understanding which tools actually contributed to a response remains a blind spot. No existing XAI methods address tool-level explanations. We introduce AgentSHAP, the first framework for explaining tool importance in LLM agents. AgentSHAP is model-agnostic: it treats the agent as a black box and works with any LLM (GPT, Claude, Llama, etc.) without needing access to internal weights or gradients. Using Monte Carlo Shapley values, AgentSHAP tests how an agent responds with different tool subsets and computes fair importance scores based on game theory. Our contributions are: (1) the first explainability method for agent tool attribution, grounded in Shapley values from game theory; (2) Monte Carlo sampling that reduces cost from O(2n) to practical levels; and (3) comprehensive experiments on API-Bank showing that AgentSHAP produces consistent scores across runs, correctly identifies which tools matter, and distinguishes relevant from irrelevant tools. AgentSHAP joins TokenSHAP (for tokens) and PixelSHAP (for image regions) to complete a family of Shapley-based XAI tools for modern generative AI. Code: https://github.com/GenAISHAP/TokenSHAP.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiming Zeng, Jinghan Cao, Zexin Li, Wanhao Yu, Zhankai Ye et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Instruction-based text editing is increasingly critical for real-world applications such as code editors (e.g., Cursor), but Large Language Models (LLMs) continue to struggle with this task. Unlike free-form generation, editing requires faithfully implementing user instructions while preserving unchanged content, as even minor unintended modifications can break functionality. Existing approaches treat editing as generic text generation, leading to two key failures: they struggle to faithfully align edits with diverse user intents, and they often over-edit unchanged regions. We propose HyperEdit to address both issues. First, we introduce hypernetwork-based dynamic adaptation that generates request-specific parameters, enabling the model to tailor its editing strategy to each instruction. Second, we develop difference-aware regularization that focuses supervision on modified spans, preventing over-editing while ensuring precise, minimal changes. HyperEdit achieves a 9%--30% relative improvement in BLEU on modified regions over state-of-the-art baselines, despite utilizing only 3B parameters.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Agniva Maiti, Manya Pandey, Murari Mandal
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.
๐
2025-12-14
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minghao Zhu, Zhihao Zhang, Anmol Sidhu, Keith Redmill
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework's effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.
๐
2025-12-14
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chris Latimer, Nicolรณ Boschi, Andrew Neeser, Chris Bartholomew, Gaurav Srivastava et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Agent memory has been touted as a dimension of growth for LLM-based applications, enabling agents that can accumulate experience, adapt across sessions, and move beyond single-shot question answering. The current generation of agent memory systems treats memory as an external layer that extracts salient snippets from conversations, stores them in vector or graph-based stores, and retrieves top-k items into the prompt of an otherwise stateless model. While these systems improve personalization and context carry-over, they still blur the line between evidence and inference, struggle to organize information over long horizons, and offer limited support for agents that must explain their reasoning. We present Hindsight, a memory architecture that treats agent memory as a structured, first-class substrate for reasoning by organizing it into four logical networks that distinguish world facts, agent experiences, synthesized entity summaries, and evolving beliefs. This framework supports three core operations -- retain, recall, and reflect -- that govern how information is added, accessed, and updated. Under this abstraction, a temporal, entity aware memory layer incrementally turns conversational streams into a structured, queryable memory bank, while a reflection layer reasons over this bank to produce answers and to update information in a traceable way. On key long-horizon conversational memory benchmarks like LongMemEval and LoCoMo, Hindsight with an open-source 20B model lifts overall accuracy from 39% to 83.6% over a full-context baseline with the same backbone and outperforms full context GPT-4o. Scaling the backbone further pushes Hindsight to 91.4% on LongMemEval and up to 89.61% on LoCoMo (vs. 75.78% for the strongest prior open system), consistently outperforming existing memory architectures on multi-session and open-domain questions.
๐
2025-12-14
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Sina Jani, Arman Heidari, Amirmohammad Anvari, Zahra Rahimi
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The rapid acceleration of scientific publishing has created substantial challenges for researchers attempting to discover, contextualize, and interpret relevant literature. Traditional keyword-based search systems provide limited semantic understanding, while existing AI-driven tools typically focus on isolated tasks such as retrieval, clustering, or bibliometric visualization. This paper presents an integrated system for scientific literature exploration that combines large-scale data acquisition, hybrid retrieval, semantic topic modeling, and heterogeneous knowledge graph construction. The system builds a comprehensive corpus by merging full-text data from arXiv with structured metadata from OpenAlex. A hybrid retrieval architecture fuses BM25 lexical search with embedding-based semantic search using Reciprocal Rank Fusion. Topic modeling is performed on retrieved results using BERTopic or non-negative matrix factorization depending on computational resources. A knowledge graph unifies papers, authors, institutions, countries, and extracted topics into an interpretable structure. The system provides a multi-layered exploration environment that reveals not only relevant publications but also the conceptual and relational landscape surrounding a query. Evaluation across multiple queries demonstrates improvements in retrieval relevance, topic coherence, and interpretability. The proposed framework contributes an extensible foundation for AI-assisted scientific discovery.
๐
2025-12-14
๐ KDD 2026
๐ฅ Dezhi Yi, Wei Guo, Wenyang Cui, Wenxuan He, Huifeng Guo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Sequential recommendation aims to model users' evolving preferences based on their historical interactions. Recent advances leverage Transformer-based architectures to capture global dependencies, but existing methods often suffer from high computational overhead, primarily due to discontinuous memory access in temporal encoding and dense attention over long sequences. To address these limitations, we propose FuXi-$ฮณ$, a novel sequential recommendation framework that improves both effectiveness and efficiency through principled architectural design. FuXi-$ฮณ$ adopts a decoder-only Transformer structure and introduces two key innovations: (1) An exponential-power temporal encoder that encodes relative temporal intervals using a tunable exponential decay function inspired by the Ebbinghaus forgetting curve. This encoder enables flexible modeling of both short-term and long-term preferences while maintaining high efficiency through continuous memory access and pure matrix operations. (2) A diagonal-sparse positional mechanism that prunes low-contribution attention blocks using a diagonal-sliding strategy guided by the persymmetry of Toeplitz matrix. Extensive experiments on four real-world datasets demonstrate that FuXi-$ฮณ$ achieves state-of-the-art performance in recommendation quality, while accelerating training by up to 4.74$\times$ and inference by up to 6.18$\times$, making it a practical and scalable solution for long-sequence recommendation. Our code is available at https://github.com/Yeedzhi/FuXi-gamma.
๐
2025-12-14
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wonseok Choi, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Dong-Ju Jeong et al.
ๆฅ็ๆ่ฆ
Instance-level image retrieval aims to find images containing the same object as a given query, despite variations in size, position, or appearance. To address this challenging task, we propose Patchify, a simple yet effective patch-wise retrieval framework that offers high performance, scalability, and interpretability without requiring fine-tuning. Patchify divides each database image into a small number of structured patches and performs retrieval by comparing these local features with a global query descriptor, enabling accurate and spatially grounded matching. To assess not just retrieval accuracy but also spatial correctness, we introduce LocScore, a localization-aware metric that quantifies whether the retrieved region aligns with the target object. This makes LocScore a valuable diagnostic tool for understanding and improving retrieval behavior. We conduct extensive experiments across multiple benchmarks, backbones, and region selection strategies, showing that Patchify outperforms global methods and complements state-of-the-art reranking pipelines. Furthermore, we apply Product Quantization for efficient large-scale retrieval and highlight the importance of using informative features during compression, which significantly boosts performance. Project website: https://wons20k.github.io/PatchwiseRetrieval/
๐
2025-12-14
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minghao Zhu, Zhihao Zhang, Anmol Sidhu, Keith Redmill
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework's effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.
๐
2025-12-14
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangzhong Liu, Jiajie Zhang, Hao Shen
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
In automotive sensor fusion systems, smart sensors and Vehicle-to-Everything (V2X) modules are commonly utilized. Sensor data from these systems are typically available only as processed object lists rather than raw sensor data from traditional sensors. Instead of processing other raw data separately and then fusing them at the object level, we propose an end-to-end cross-level fusion concept with Transformer, which integrates highly abstract object list information with raw camera images for 3D object detection. Object lists are fed into a Transformer as denoising queries and propagated together with learnable queries through the latter feature aggregation process. Additionally, a deformable Gaussian mask, derived from the positional and size dimensional priors from the object lists, is explicitly integrated into the Transformer decoder. This directs attention toward the target area of interest and accelerates model training convergence. Furthermore, as there is no public dataset containing object lists as a standalone modality, we propose an approach to generate pseudo object lists from ground-truth bounding boxes by simulating state noise and false positives and negatives. As the first work to conduct cross-level fusion, our approach shows substantial performance improvements over the vision-based baseline on the nuScenes dataset. It demonstrates its generalization capability over diverse noise levels of simulated object lists and real detectors.
๐
2025-12-14
๐ ArXiv Preprint (Robotics)
๐ฅ Patrick Kostelac, Xuerui Wang, Anahita Jamshidnejad
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Modern engineering systems, such as autonomous vehicles, flexible robotics, and intelligent aerospace platforms, require controllers that are robust to uncertainties, adaptive to environmental changes, and safety-aware under real-time constraints. RL offers powerful data-driven adaptability for systems with nonlinear dynamics that interact with uncertain environments. RL, however, lacks built-in mechanisms for dynamic constraint satisfaction during exploration. MPC offers structured constraint handling and robustness, but its reliance on accurate models and computationally demanding online optimization may pose significant challenges. This paper proposes an integrated MPC-RL framework that combines stability and safety guarantees of MPC with the adaptability of RL. During training, MPC defines safe control bounds that guide the RL component and that enable constraint-aware policy learning. At deployment, the learned policy operates in real time with a lightweight safety filter based on Lipschitz continuity to ensure constraint satisfaction without heavy online optimizations. The approach, which is validated on a nonlinear aeroelastic wing system, demonstrates improved disturbance rejection, reduced actuator effort, and robust performance under turbulence. The architecture generalizes to other domains with structured nonlinearities and bounded disturbances, offering a scalable solution for safe artificial-intelligence-driven control in engineering applications.
๐
2025-12-14
๐ ArXiv Preprint (Robotics)
๐ฅ Kuan Fang, Yuxin Chen, Xinghao Zhu, Farzad Niroui, Lingfeng Sun et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
We present SAGA, a versatile and adaptive framework for visuomotor control that can generalize across various environments, task objectives, and user specifications. To efficiently learn such capability, our key idea is to disentangle high-level semantic intent from low-level visuomotor control by explicitly grounding task objectives in the observed environment. Using an affordance-based task representation, we express diverse and complex behaviors in a unified, structured form. By leveraging multimodal foundation models, SAGA grounds the proposed task representation to the robot's visual observation as 3D affordance heatmaps, highlighting task-relevant entities while abstracting away spurious appearance variations that would hinder generalization. These grounded affordances enable us to effectively train a conditional policy on multi-task demonstration data for whole-body control. In a unified framework, SAGA can solve tasks specified in different forms, including language instructions, selected points, and example demonstrations, enabling both zero-shot execution and few-shot adaptation. We instantiate SAGA on a quadrupedal manipulator and conduct extensive experiments across eleven real-world tasks. SAGA consistently outperforms end-to-end and modular baselines by substantial margins. Together, these results demonstrate that structured affordance grounding offers a scalable and effective pathway toward generalist mobile manipulation.
๐
2025-12-14
๐ ArXiv Preprint (Robotics)
๐ฅ Mizuho Aoki, Kohei Honda, Yasuhiro Yoshimura, Takeshi Ishita, Ryo Yonetani
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
This paper presents Vision-Language Global Localization (VLG-Loc), a novel global localization method that uses human-readable labeled footprint maps containing only names and areas of distinctive visual landmarks in an environment. While humans naturally localize themselves using such maps, translating this capability to robotic systems remains highly challenging due to the difficulty of establishing correspondences between observed landmarks and those in the map without geometric and appearance details. To address this challenge, VLG-Loc leverages a vision-language model (VLM) to search the robot's multi-directional image observations for the landmarks noted in the map. The method then identifies robot poses within a Monte Carlo localization framework, where the found landmarks are used to evaluate the likelihood of each pose hypothesis. Experimental validation in simulated and real-world retail environments demonstrates superior robustness compared to existing scan-based methods, particularly under environmental changes. Further improvements are achieved through the probabilistic fusion of visual and scan-based localization.
๐
2025-12-14
๐ ArXiv Preprint (eess.SY)
๐ฅ Haochong Chen, Xincheng Cao, Levent Guvenc, Bilin Aksun-Guvenc
Robotics
ๆฅ็ๆ่ฆ
Ensuring the safety of Vulnerable Road Users (VRUs) is a critical challenge in the development of advanced autonomous driving systems in smart cities. Among vulnerable road users, bicyclists present unique characteristics that make their safety both critical and also manageable. Vehicles often travel at significantly higher relative speeds when interacting with bicyclists as compared to their interactions with pedestrians which makes collision avoidance system design for bicyclist safety more challenging. Yet, bicyclist movements are generally more predictable and governed by clear traffic rules as compared to the sudden and sometimes erratic pedestrian motion, offering opportunities for model-based control strategies. To address bicyclist safety in complex traffic environments, this study proposes and develops a High Order Control Lyapunov Function High Order Control Barrier Function Quadratic Programming (HOCLF HOCBF QP) control framework. Through this framework, CLFs constraints guarantee system stability so that the vehicle can track its reference trajectory, whereas CBFs constraints ensure system safety by letting vehicle avoiding potential collisions region with surrounding obstacles. Then by solving a QP problem, an optimal control command that simultaneously satisfies stability and safety requirements can be calculated. Three key bicyclist crash scenarios recorded in the Fatality Analysis Reporting System (FARS) are recreated and used to comprehensively evaluate the proposed autonomous driving bicyclist safety control strategy in a simulation study. Simulation results demonstrate that the HOCLF HOCBF QP controller can help the vehicle perform robust, and collision-free maneuvers, highlighting its potential for improving bicyclist safety in complex traffic environments.
๐
2025-12-14
๐ ArXiv Preprint (Robotics)
๐ฅ Tarik Viehmann, Daniel Swoboda, Samridhi Kalra, Himanshu Grover, Gerhard Lakemeyer
MultimodalRobotics
ๆฅ็ๆ่ฆ
CLIPS is a rule-based programming language for building knowledge-driven applications, well suited for the complex task of coordinating autonomous robots. Inspired by the CLIPS-Executive originally developed for the lesser known Fawkes robotics framework, we present an Integration of CLIPS into the ROS ecosystem. Additionally, we show the flexibility of CLIPS by describing a PDDL-based planning framework integration.
๐
2025-12-14
๐ ArXiv Preprint (Robotics)
๐ฅ Mattia Catellani, Marta Gabbi, Lorenzo Sabattini
Robotics
ๆฅ็ๆ่ฆ
We address the problem of coordinating a team of robots to cover an unknown environment while ensuring safe operation and avoiding collisions with non-cooperative agents. Traditional coverage strategies often rely on simplified assumptions, such as known or convex environments and static density functions, and struggle to adapt to real-world scenarios, especially when humans are involved. In this work, we propose a human-aware coverage framework based on Model Predictive Control (MPC), namely HMPCC, where human motion predictions are integrated into the planning process. By anticipating human trajectories within the MPC horizon, robots can proactively coordinate their actions %avoid redundant exploration, and adapt to dynamic conditions. The environment is modeled as a Gaussian Mixture Model (GMM), representing regions of interest. Team members operate in a fully decentralized manner, without relying on explicit communication, an essential feature in hostile or communication-limited scenarios. Our results show that human trajectory forecasting enables more efficient and adaptive coverage, improving coordination between human and robotic agents.
๐
2025-12-14
๐ ArXiv Preprint (Robotics)
๐ฅ Zhewen Zheng, Wenjing Cao, Hongkang Yu, Mo Chen, Takashi Suzuki
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Parameter tuning in real-world experiments is constrained by the limited evaluation budget available on hardware. The path-following controller studied in this paper reflects a typical situation in nonlinear geometric controller, where multiple gains influence the dynamics through coupled nonlinear terms. Such interdependence makes manual tuning inefficient and unlikely to yield satisfactory performance within a practical number of trials. To address this challenge, we propose a Bayesian optimization (BO) framework that treats the closed-loop system as a black box and selects controller gains using a Gaussian-process surrogate. BO offers model-free exploration, quantified uncertainty, and data-efficient search, making it well suited for tuning tasks where each evaluation is costly. The framework is implemented on Honda's AI-Formula three-wheeled robot and assessed through repeated full-lap experiments on a fixed test track. The results show that BO improves controller performance within 32 trials, including 15 warm-start initial evaluations, indicating that it can efficiently locate high-performing regions of the parameter space under real-world conditions. These findings demonstrate that BO provides a practical, reliable, and data-efficient tuning approach for nonlinear path-following controllers on real robotic platforms.
๐
2025-12-14
๐ ArXiv Preprint (Robotics)
๐ฅ Rishit Agnihotri, Sandeep Kumar Sharma
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Urban Air Mobility (UAM) poses unprecedented traffic coordination challenges, especially with increasing UAV densities in dense urban corridors. This paper introduces a mathematical model using a control algorithm to optimize an Edge AI-driven decentralized swarm architecture for intelligent conflict resolution, enabling real-time decision-making with low latency. Using lightweight neural networks, the system leverages edge nodes to perform distributed conflict detection and resolution. A simulation platform was developed to evaluate the scheme under various UAV densities. Results indicate that the conflict resolution time is dramatically minimized up to 3.8 times faster, and accuracy is enhanced compared to traditional centralized control models. The proposed architecture is highly promising for scalable, efficient, and safe aerial traffic management in future UAM systems.
๐
2025-12-14
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zihan Wang, Seungjun Lee, Guangzhao Dai, Gim Hee Lee
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning, while modular systems ignore cross-component interdependencies and synergies. To bridge this gap, we propose the Dynamic 3D Vision-Language-Planning Model (D3D-VLP). Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data. This allows different CoT components to mutually reinforce and implicitly supervise each other. To this end, we construct a large-scale dataset with 10M hybrid samples from 5K real scans and 20K synthetic scenes that are compatible with online learning methods such as RL and DAgger. Our D3D-VLP achieves state-of-the-art results on multiple benchmarks, including Vision-and-Language Navigation (R2R-CE, REVERIE-CE, NavRAG-CE), Object-goal Navigation (HM3D-OVON), and Task-oriented Sequential Grounding and Navigation (SG3D). Real-world mobile manipulation experiments further validate the effectiveness.
๐
2025-12-12
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Longtao Xiao, Haolin Zhang, Guohao Cai, Jieming Zhu, Yifan Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recently, transformer-based generative recommendation has garnered significant attention for user behavior modeling. However, it often requires discretizing items into multi-code representations (e.g., typically four code tokens or more), which sharply increases the length of the original item sequence. This expansion poses challenges to transformer-based models for modeling user behavior sequences with inherent noises, since they tend to overallocate attention to irrelevant or noisy context. To mitigate this issue, we propose FAIR, the first generative recommendation framework with focused attention, which enhances attention scores to relevant context while suppressing those to irrelevant ones. Specifically, we propose (1) a focused attention mechanism integrated into the standard Transformer, which learns two separate sets of Q and K attention weights and computes their difference as the final attention scores to eliminate attention noise while focusing on relevant contexts; (2) a noise-robustness objective, which encourages the model to maintain stable attention patterns under stochastic perturbations, preventing undesirable shifts toward irrelevant context due to noise; and (3) a mutual information maximization objective, which guides the model to identify contexts that are most informative for next-item prediction. We validate the effectiveness of FAIR on four public benchmarks, demonstrating its superior performance compared to existing methods.
๐
2025-12-12
๐ ArXiv Preprint (Computer Vision)
๐ฅ Antonio Guillen-Perez
MultimodalRobotics
ๆฅ็ๆ่ฆ
The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a "System 2" inference-time alignment strategy, utilizing a multi-model "Judge-Scout" consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% ccompared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.
๐
2025-12-11
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Lingfeng Liu, Yixin Song, Dazhong Shen, Bing Yin, Hao Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Popularity bias fundamentally undermines the personalization capabilities of collaborative filtering (CF) models, causing them to disproportionately recommend popular items while neglecting users' genuine preferences for niche content. While existing approaches treat this as an external confounding factor, we reveal that popularity bias is an intrinsic geometric artifact of Bayesian Pairwise Ranking (BPR) optimization in CF models. Through rigorous mathematical analysis, we prove that BPR systematically organizes item embeddings along a dominant "popularity direction" where embedding magnitudes directly correlate with interaction frequency. This geometric distortion forces user embeddings to simultaneously handle two conflicting tasks-expressing genuine preference and calibrating against global popularity-trapping them in suboptimal configurations that favor popular items regardless of individual tastes. We propose Directional Decomposition and Correction (DDC), a universally applicable framework that surgically corrects this embedding geometry through asymmetric directional updates. DDC guides positive interactions along personalized preference directions while steering negative interactions away from the global popularity direction, disentangling preference from popularity at the geometric source. Extensive experiments across multiple BPR-based architectures demonstrate that DDC significantly outperforms state-of-the-art debiasing methods, reducing training loss to less than 5% of heavily-tuned baselines while achieving superior recommendation quality and fairness. Code is available in https://github.com/LingFeng-Liu-AI/DDC.
๐
2025-12-11
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Lingfeng Liu, Yixin Song, Dazhong Shen, Bing Yin, Hao Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Popularity bias fundamentally undermines the personalization capabilities of collaborative filtering (CF) models, causing them to disproportionately recommend popular items while neglecting users' genuine preferences for niche content. While existing approaches treat this as an external confounding factor, we reveal that popularity bias is an intrinsic geometric artifact of Bayesian Pairwise Ranking (BPR) optimization in CF models. Through rigorous mathematical analysis, we prove that BPR systematically organizes item embeddings along a dominant "popularity direction" where embedding magnitudes directly correlate with interaction frequency. This geometric distortion forces user embeddings to simultaneously handle two conflicting tasks-expressing genuine preference and calibrating against global popularity-trapping them in suboptimal configurations that favor popular items regardless of individual tastes. We propose Directional Decomposition and Correction (DDC), a universally applicable framework that surgically corrects this embedding geometry through asymmetric directional updates. DDC guides positive interactions along personalized preference directions while steering negative interactions away from the global popularity direction, disentangling preference from popularity at the geometric source. Extensive experiments across multiple BPR-based architectures demonstrate that DDC significantly outperforms state-of-the-art debiasing methods, reducing training loss to less than 5% of heavily-tuned baselines while achieving superior recommendation quality and fairness. Code is available in https://github.com/LingFeng-Liu-AI/DDC.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To execute the desired locomotion commands more precisely, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen et al.
Computer Vision
ๆฅ็ๆ่ฆ
We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaona Zhou, Yingyan Zeng, Ran Jin, Ismini Lourentzou
Machine Learning
ๆฅ็ๆ่ฆ
The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv et al.
Robotics
ๆฅ็ๆ่ฆ
Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin et al.
Robotics
ๆฅ็ๆ่ฆ
Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko et al.
Multimodal
ๆฅ็ๆ่ฆ
Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei et al.
Robotics
ๆฅ็ๆ่ฆ
Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(ฮฑx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Matt Wilson
ๆฅ็ๆ่ฆ
We establish a precise correspondence between decision-making agents in partially observable Markov decision processes (POMDPs) and one-input process functions, the classical limit of higher-order quantum operations. In this identification an agent's policy and memory update combine into a process function w that interacts with a POMDP environment via the link product. This suggests a dual interpretation: in the physics view, the process function acts as the environment into which local operations (agent interventions) are inserted, whereas in the AI view it encodes the agent and the inserted functions represent environments. We extend this perspective to multi-agent systems by identifying observation-independent decentralized POMDPs as natural domains for multi-input process functions.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kristina Korotkova, Aleksandr Katrutsa
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The construction of adversarial attacks for neural networks appears to be a crucial challenge for their deployment in various services. To estimate the adversarial robustness of a neural network, a fast and efficient approach is needed to construct adversarial attacks. Since the formalization of adversarial attack construction involves solving a specific optimization problem, we consider the problem of constructing an efficient and effective adversarial attack from a numerical optimization perspective. Specifically, we suggest utilizing advanced projection-free methods, known as modified Frank-Wolfe methods, to construct white-box adversarial attacks on the given input data. We perform a theoretical and numerical evaluation of these methods and compare them with standard approaches based on projection operations or geometrical intuition. Numerical experiments are performed on the MNIST and CIFAR-10 datasets, utilizing a multiclass logistic regression model, the convolutional neural networks (CNNs), and the Vision Transformer (ViT).
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qiyang Li, Seohong Park, Sergey Levine
ๆฅ็ๆ่ฆ
Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned tasks and show that it reliably outperforms prior methods. Code: github.com/ColinQiyangLi/dqc.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Max Zimmer, Christophe Roux, Moritz Wagner, Deborah Hendrych, Sebastian Pokutta
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The resource requirements of Neural Networks can be significantly reduced through pruning -- the removal of seemingly less important parameters. However, with the rise of Large Language Models (LLMs), full retraining to recover pruning-induced performance degradation is often prohibitive and classical approaches such as global magnitude pruning are suboptimal on Transformer architectures. State-of-the-art methods hence solve a layer-wise mask selection problem, the problem of finding a pruning mask which minimizes the per-layer pruning error on a small set of calibration data. Exactly solving this problem to optimality using Integer Programming (IP) solvers is computationally infeasible due to its combinatorial nature and the size of the search space, and existing approaches therefore rely on approximations or heuristics. In this work, we demonstrate that the mask selection problem can be made drastically more tractable at LLM scale. To that end, we decouple the rows by enforcing equal sparsity levels per row. This allows us to derive optimal 1-swaps (exchanging one kept and one pruned weight) that can be computed efficiently using the Gram matrix of the calibration data. Using these observations, we propose a tractable and simple 1-swap algorithm that warm starts from any pruning mask, runs efficiently on GPUs at LLM scale, and is essentially hyperparameter-free. We demonstrate that our approach reduces per-layer pruning error by up to 60% over Wanda (Sun et al., 2023) and consistently improves perplexity and zero-shot accuracy across state-of-the-art GPT architectures.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, A. B. Siddique
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Circuit discovery aims to identify minimal subnetworks that are responsible for specific behaviors in large language models (LLMs). Existing approaches primarily rely on iterative edge pruning, which is computationally expensive and limited to coarse-grained units such as attention heads or MLP blocks, overlooking finer structures like individual neurons. We propose a node-level pruning framework for circuit discovery that addresses both scalability and granularity limitations. Our method introduces learnable masks across multiple levels of granularity, from entire blocks to individual neurons, within a unified optimization objective. Granularity-specific sparsity penalties guide the pruning process, allowing a comprehensive compression in a single fine-tuning run. Empirically, our approach identifies circuits that are smaller in nodes than those discovered by prior methods; moreover, we demonstrate that many neurons deemed important by coarse methods are actually irrelevant, while still maintaining task performance. Furthermore, our method has a significantly lower memory footprint, 5-10x, as it does not require keeping intermediate activations in the memory to work.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Lijie Ding, Janell Thomson, Jon Taylor, Changwoo Do
Natural Language Processing
ๆฅ็ๆ่ฆ
We explore how large language models (LLMs) can enhance the proposal selection process at large user facilities, offering a scalable, consistent, and cost-effective alternative to traditional human review. Proposal selection depends on assessing the relative strength among submitted proposals; however, traditional human scoring often suffers from weak inter-proposal correlations and is subject to reviewer bias and inconsistency. A pairwise preference-based approach is logically superior, providing a more rigorous and internally consistent basis for ranking, but its quadratic workload makes it impractical for human reviewers. We address this limitation using LLMs. Leveraging the uniquely well-curated proposals and publication records from three beamlines at the Spallation Neutron Source (SNS), Oak Ridge National Laboratory (ORNL), we show that the LLM rankings correlate strongly with the human rankings (Spearman $ฯ\simeq 0.2-0.8$, improving to $\geq 0.5$ after 10\% outlier removal). Moreover, LLM performance is no worse than that of human reviewers in identifying proposals with high publication potential, while costing over two orders of magnitude less. Beyond ranking, LLMs enable advanced analyses that are challenging for humans, such as quantitative assessment of proposal similarity via embedding models, which provides information crucial for review committees.
๐
2025-12-11
๐ NeurIPS 2025
๐ฅ Ruslan Gokhman
Natural Language Processing
ๆฅ็ๆ่ฆ
We study long-horizon exogenous-only temperature forecasting - a challenging univariate setting where only the past values of the indoor temperature are used for prediction - using linear and Transformer-family models. We evaluate Linear, NLinear, DLinear, Transformer, Informer, and Autoformer under standardized train, validation, and test splits. Results show that linear baselines (Linear, NLinear, DLinear) consistently outperform more complex Transformer-family architectures, with DLinear achieving the best overall accuracy across all splits. These findings highlight that carefully designed linear models remain strong baselines for time series forecasting in challenging exogenous-only settings.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao et al.
Multimodal
ๆฅ็ๆ่ฆ
Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chirag Modi, Jiequn Han, Eric Vanden-Eijnden, Joan Bruna
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Transport-based methods have emerged as a leading paradigm for building generative models from large, clean datasets. However, in many scientific and engineering domains, clean data are often unavailable: instead, we only observe measurements corrupted through a noisy, ill-conditioned channel. A generative model for the original data thus requires solving an inverse problem at the level of distributions. In this work, we introduce a novel approach to this task based on Stochastic Interpolants: we iteratively update a transport map between corrupted and clean data samples using only access to the corrupted dataset as well as black box access to the corruption channel. Under appropriate conditions, this iterative procedure converges towards a self-consistent transport map that effectively inverts the corruption channel, thus enabling a generative model for the clean data. We refer to the resulting method as the self-consistent stochastic interpolant (SCSI). It (i) is computationally efficient compared to variational alternatives, (ii) highly flexible, handling arbitrary nonlinear forward models with only black-box access, and (iii) enjoys theoretical guarantees. We demonstrate superior performance on inverse problems in natural image processing and scientific reconstruction, and establish convergence guarantees of the scheme under appropriate assumptions.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mumuksh Tayal, Manan Tayal, Aditya Singh, Shishir Kolathaya, Ravi Prakash
Robotics
ๆฅ็ๆ่ฆ
Ensuring safety in autonomous systems requires controllers that satisfy hard, state-wise constraints without relying on online interaction. While existing Safe Offline RL methods typically enforce soft expected-cost constraints, they do not guarantee forward invariance. Conversely, Control Barrier Functions (CBFs) provide rigorous safety guarantees but usually depend on expert-designed barrier functions or full knowledge of the system dynamics. We introduce Value-Guided Offline Control Barrier Functions (V-OCBF), a framework that learns a neural CBF entirely from offline demonstrations. Unlike prior approaches, V-OCBF does not assume access to the dynamics model; instead, it derives a recursive finite-difference barrier update, enabling model-free learning of a barrier that propagates safety information over time. Moreover, V-OCBF incorporates an expectile-based objective that avoids querying the barrier on out-of-distribution actions and restricts updates to the dataset-supported action set. The learned barrier is then used with a Quadratic Program (QP) formulation to synthesize real-time safe control. Across multiple case studies, V-OCBF yields substantially fewer safety violations than baseline methods while maintaining strong task performance, highlighting its scalability for offline synthesis of safety-critical controllers without online interaction or hand-engineered barriers.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Leijie Wang, Otilia Stretcu, Wei Qiao, Thomas Denby, Krishnamurthy Viswanathan et al.
ๆฅ็ๆ่ฆ
From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called "Agile Deliberation" that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user's evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.
๐
2025-12-11
๐ Submitted to JMLR, under review
๐ฅ Brian P. Powell, Jordan A. Caraballo-Vega, Mark L. Carroll, Thomas Maxwell, Andrew Ptak et al.
Machine Learning
ๆฅ็ๆ่ฆ
We report the discovery that binary encoding allows neural networks to extrapolate periodic functions beyond their training bounds. We introduce Normalized Base-2 Encoding (NB2E) as a method for encoding continuous numerical values and demonstrate that, using this input encoding, vanilla multi-layer perceptrons (MLP) successfully extrapolate diverse periodic signals without prior knowledge of their functional form. Internal activation analysis reveals that NB2E induces bit-phase representations, enabling MLPs to learn and extrapolate signal structure independently of position.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wang Lu, Yao Zhu, Jindong Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Sensor-based human activity recognition (HAR) mines activity patterns from the time-series sensory data. In realistic scenarios, variations across individuals, devices, environments, and time introduce significant distributional shifts for the same activities. Recent efforts attempt to solve this challenge by applying or adapting existing out-of-distribution (OOD) algorithms, but only in certain distribution shift scenarios (e.g., cross-device or cross-position), lacking comprehensive insights on the effectiveness of these algorithms. For instance, is OOD necessary to HAR? Which OOD algorithm performs the best? In this paper, we fill this gap by proposing HAROOD, a comprehensive benchmark for HAR in OOD settings. We define 4 OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time, and build a testbed covering 6 datasets, 16 comparative methods (implemented with CNN-based and Transformer-based architectures), and two model selection protocols. Then, we conduct extensive experiments and present several findings for future research, e.g., no single method consistently outperforms others, highlighting substantial opportunity for advancement. Our codebase is highly modular and easy to extend for new datasets, algorithms, comparisons, and analysis, with the hope to facilitate the research in OOD-based HAR. Our implementation is released and can be found at https://github.com/AIFrontierLab/HAROOD.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michael Schlee, Christoph Weisser, Timo Kivimรคki, Melchizedek Mashiku, Benjamin Saefken
Natural Language Processing
ๆฅ็ๆ่ฆ
LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone's embeddings with the LLM-derived per-class scores -- obtained through structured prompt-engineering strategies -- and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains -- achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification -- while enabling practical trade-offs between accuracy, latency, and cost.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .
๐
2025-12-11
๐ ArXiv Preprint (cs.NI)
๐ฅ F. Taghiyev, A. Aslanbayli
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
This paper presents the design and prototype implementation of a natural language interface for configuring enterprise firewalls. The framework allows administrators to express access control policies in plain language, which are then translated into vendor specific configurations. A compact schema bound intermediate representation separates human intent from device syntax and in the current prototype compiles to Palo Alto PAN OS command line configuration while remaining extensible to other platforms. Large language models are used only as assistive parsers that generate typed intermediate representation objects, while compilation and enforcement remain deterministic. The prototype integrates three validation layers, namely a static linter that checks structural and vendor specific constraints, a safety gate that blocks overly permissive rules such as any to any allows, and a Batfish based simulator that validates configuration syntax and referential integrity against a synthetic device model. The paper describes the architecture, implementation, and test methodology on synthetic network context datasets and discusses how this approach can evolve into a scalable auditable and human centered workflow for firewall policy management.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Moshe Lahmy, Roi Yozevitch
Robotics
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) systems often fail on multi-hop queries when the initial retrieval misses a bridge fact. Prior corrective approaches, such as Self-RAG, CRAG, and Adaptive-$k$, typically address this by \textit{adding} more context or pruning existing lists. However, simply expanding the context window often leads to \textbf{context dilution}, where distractors crowd out relevant information. We propose \textbf{SEAL-RAG}, a training-free controller that adopts a \textbf{``replace, don't expand''} strategy to fight context dilution under a fixed retrieval depth $k$. SEAL executes a (\textbf{S}earch $\rightarrow$ \textbf{E}xtract $\rightarrow$ \textbf{A}ssess $\rightarrow$ \textbf{L}oop) cycle: it performs on-the-fly, entity-anchored extraction to build a live \textit{gap specification} (missing entities/relations), triggers targeted micro-queries, and uses \textit{entity-first ranking} to actively swap out distractors for gap-closing evidence. We evaluate SEAL-RAG against faithful re-implementations of Basic RAG, CRAG, Self-RAG, and Adaptive-$k$ in a shared environment on \textbf{HotpotQA} and \textbf{2WikiMultiHopQA}. On HotpotQA ($k=3$), SEAL improves answer correctness by \textbf{+3--13 pp} and evidence precision by \textbf{+12--18 pp} over Self-RAG. On 2WikiMultiHopQA ($k=5$), it outperforms Adaptive-$k$ by \textbf{+8.0 pp} in accuracy and maintains \textbf{96\%} evidence precision compared to 22\% for CRAG. These gains are statistically significant ($p<0.001$). By enforcing fixed-$k$ replacement, SEAL yields a predictable cost profile while ensuring the top-$k$ slots are optimized for precision rather than mere breadth. We release our code and data at https://github.com/mosherino/SEAL-RAG.
๐
2025-12-11
๐ ArXiv Preprint (physics.ed-ph)
๐ฅ Holger Maus, Paul Tschisgale, Fabian Kieser, Stefan Petersen, Peter Wulff
Natural Language Processing
ๆฅ็ๆ่ฆ
Generative AI offers new opportunities for individualized and adaptive learning, particularly through large language model (LLM)-based feedback systems. While LLMs can produce effective feedback for relatively straightforward conceptual tasks, delivering high-quality feedback for tasks that require advanced domain expertise, such as physics problem solving, remains a substantial challenge. This study presents the design of an LLM-based feedback system for physics problem solving grounded in evidence-centered design (ECD) and evaluates its performance within the German Physics Olympiad. Participants assessed the usefulness and accuracy of the generated feedback, which was generally perceived as useful and highly accurate. However, an in-depth analysis revealed that the feedback contained factual errors in 20% of cases; errors that often went unnoticed by the students. We discuss the risks associated with uncritical reliance on LLM-based feedback systems and outline potential directions for generating more adaptive and reliable LLM-based feedback in the future.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kevin Glocker, Kรคtriin Kukk, Romina Oji, Marcel Bollmann, Marco Kuhlmann et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.
๐
2025-12-11
๐ 8 pages, 3 figures and 3 tables, under submission
๐ฅ Kaihua Ding
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid adoption of generative AI has undermined traditional modular assessments in computing education, creating a disconnect between academic evaluation and industry practice. This paper presents a theoretically grounded framework for designing AI-resilient assessments, supported by formal analysis and multi-year empirical validation.
We make three contributions. First, we establish two theoretical results: (1) assessments composed of interconnected problems, where outputs feed into subsequent stages, are more AI-resilient than modular assessments because current language models struggle with sustained multi-step reasoning and context; and (2) semi-structured problems with deterministic success criteria provide more reliable measures of student competency than fully open-ended projects, which allow AI systems to default to familiar solution patterns. These results challenge common policy and institutional guidance that promotes open-ended assessments as the primary safeguard for academic integrity.
Second, we validate these results using data from four university data science courses (N = 138). While students achieve near-perfect scores on AI-assisted modular homework, performance drops by roughly 30 percentage points on proctored exams, indicating substantial AI score inflation. Interconnected projects remain strongly correlated with modular assessments, suggesting they measure the same underlying skills while resisting AI misuse. Proctored exams show weaker alignment, implying they may assess test-taking ability rather than intended learning outcomes.
Third, we translate these findings into a practical assessment design framework. The proposed approach enables educators to create assessments that promote integrative thinking, reflect real-world AI-augmented workflows, and naturally resist trivial delegation to generative AI, thereby helping restore academic integrity.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lin Du, Lu Bai, Jincheng Li, Lixin Cui, Hangyuan Du et al.
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have emerged as a dominant paradigm for graph classification. Specifically, most existing GNNs mainly rely on the message passing strategy between neighbor nodes, where the expressivity is limited by the 1-dimensional Weisfeiler-Lehman (1-WL) test. Although a number of k-WL-based GNNs have been proposed to overcome this limitation, their computational cost increases rapidly with k, significantly restricting the practical applicability. Moreover, since the k-WL models mainly operate on node tuples, these k-WL-based GNNs cannot retain fine-grained node- or edge-level semantics required by attribution methods (e.g., Integrated Gradients), leading to the less interpretable problem. To overcome the above shortcomings, in this paper, we propose a novel Line Graph Aggregation Network (LGAN), that constructs a line graph from the induced subgraph centered at each node to perform the higher-order aggregation. We theoretically prove that the LGAN not only possesses the greater expressive power than the 2-WL under injective aggregation assumptions, but also has lower time complexity. Empirical evaluations on benchmarks demonstrate that the LGAN outperforms state-of-the-art k-WL-based GNNs, while offering better interpretability.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rebekka Gรถrge, Sujan Sai Gannamaneni, Tabea Naeven, Hammam Abdelwahab, Hรฉctor Allende-Cid et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.
๐
2025-12-11
๐ ArXiv Preprint (cs.SE)
๐ฅ Itay Dreyfuss, Antonio Abu Nassar, Samuel Ackerman, Axel Ben David, Rami Katan et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Model (LLM)-based code assistants have emerged as a powerful application of generative AI, demonstrating impressive capabilities in code generation and comprehension. A key requirement for these systems is their ability to accurately follow user instructions. We present Precise Automatically Checked Instruction Following In Code (PACIFIC), a novel framework designed to automatically generate benchmarks that rigorously assess sequential instruction-following and code dry-running capabilities in LLMs, while allowing control over benchmark difficulty. PACIFIC produces benchmark variants with clearly defined expected outputs, enabling straightforward and reliable evaluation through simple output comparisons. In contrast to existing approaches that often rely on tool usage or agentic behavior, our work isolates and evaluates the LLM's intrinsic ability to reason through code behavior step-by-step without execution (dry running) and to follow instructions. Furthermore, our framework mitigates training data contamination by facilitating effortless generation of novel benchmark variations. We validate our framework by generating a suite of benchmarks spanning a range of difficulty levels and evaluating multiple state-of-the-art LLMs. Our results demonstrate that PACIFIC can produce increasingly challenging benchmarks that effectively differentiate instruction-following and dry running capabilities, even among advanced models. Overall, our framework offers a scalable, contamination-resilient methodology for assessing core competencies of LLMs in code-related tasks.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wei Fang, Chiyao Wang, Wenshuai Ma, Hui Liu, Jianqiang Hu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Background: While intravascular imaging, particularly optical coherence tomography (OCT), improves percutaneous coronary intervention (PCI) outcomes, its interpretation is operator-dependent. General-purpose artificial intelligence (AI) shows promise but lacks domain-specific reliability. We evaluated the performance of CA-GPT, a novel large model deployed on an AI-OCT system, against that of the general-purpose ChatGPT-5 and junior physicians for OCT-guided PCI planning and assessment.
Methods: In this single-center analysis of 96 patients who underwent OCT-guided PCI, the procedural decisions generated by the CA-GPT, ChatGPT-5, and junior physicians were compared with an expert-derived procedural record. Agreement was assessed using ten pre-specified metrics across pre-PCI and post-PCI phases.
Results: For pre-PCI planning, CA-GPT demonstrated significantly higher median agreement scores (5[IQR 3.75-5]) compared to both ChatGPT-5 (3[2-4], P<0.001) and junior physicians (4[3-4], P<0.001). CA-GPT significantly outperformed ChatGPT-5 across all individual pre-PCI metrics and showed superior performance to junior physicians in stent diameter (90.3% vs. 72.2%, P<0.05) and length selection (80.6% vs. 52.8%, P<0.01). In post-PCI assessment, CA-GPT maintained excellent overall agreement (5[4.75-5]), significantly higher than both ChatGPT-5 (4[4-5], P<0.001) and junior physicians (5[4-5], P<0.05). Subgroup analysis confirmed CA-GPT's robust performance advantage in complex scenarios.
Conclusion: The CA-GPT-based AI-OCT system achieved superior decision-making agreement versus a general-purpose large language model and junior physicians across both PCI planning and assessment phases. This approach provides a standardized and reliable method for intravascular imaging interpretation, demonstrating significant potential to augment operator expertise and optimize OCT-guided PCI.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Jianbo Wang, Galina Sidorenko, Johan Thunberg
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Connected and automated vehicles (CAVs) have the potential to enhance driving safety, for example by enabling safe vehicle following and more efficient traffic scheduling. For such future deployments, safety requirements should be addressed, where the primary such are avoidance of vehicle collisions and substantial mitigating of harm when collisions are unavoidable. However, conservative worst-case-based control strategies come at the price of reduced flexibility and may compromise overall performance. In light of this, we investigate how Deep Reinforcement Learning (DRL) can be leveraged to improve safety in multi-vehicle-following scenarios involving emergency braking. Specifically, we investigate how DRL with vehicle-to-vehicle communication can be used to ethically select an emergency breaking profile in scenarios where overall, or collective, three-vehicle harm reduction or collision avoidance shall be obtained instead of single-vehicle such. As an algorithm, we provide a hybrid approach that combines DRL with a previously published method based on analytical expressions for selecting optimal constant deceleration. By combining DRL with the previous method, the proposed hybrid approach increases the reliability compared to standalone DRL, while achieving superior performance in terms of overall harm reduction and collision avoidance.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Procedural memory enables large language model (LLM) agents to internalize "how-to" knowledge, theoretically reducing redundant trial-and-error. However, existing frameworks predominantly suffer from a "passive accumulation" paradigm, treating memory as a static append-only archive. To bridge the gap between static storage and dynamic reasoning, we propose $\textbf{ReMe}$ ($\textit{Remember Me, Refine Me}$), a comprehensive framework for experience-driven agent evolution. ReMe innovates across the memory lifecycle via three mechanisms: 1) $\textit{multi-faceted distillation}$, which extracts fine-grained experiences by recognizing success patterns, analyzing failure triggers and generating comparative insights; 2) $\textit{context-adaptive reuse}$, which tailors historical insights to new contexts via scenario-aware indexing; and 3) $\textit{utility-based refinement}$, which autonomously adds valid memories and prunes outdated ones to maintain a compact, high-quality experience pool. Extensive experiments on BFCL-V3 and AppWorld demonstrate that ReMe establishes a new state-of-the-art in agent memory system. Crucially, we observe a significant memory-scaling effect: Qwen3-8B equipped with ReMe outperforms larger, memoryless Qwen3-14B, suggesting that self-evolving memory provides a computation-efficient pathway for lifelong learning. We release our code and the $\texttt{reme.library}$ dataset to facilitate further research.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Benjamin Gundersen, Nicolas Deperrois, Samuel Ruiperez-Campillo, Thomas M. Sutter, Julia E. Vogt et al.
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent advances in vision-language models (VLMs) have improved Chest X-ray (CXR) interpretation in multiple aspects. However, many medical VLMs rely solely on supervised fine-tuning (SFT), which optimizes next-token prediction without evaluating answer quality. In contrast, reinforcement learning (RL) can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks. To investigate the effects of RL and thinking in a CXR VLM, we perform large-scale SFT on CXR data to build an updated RadVLM based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability. We then apply Group Relative Policy Optimization (GRPO) with clinically grounded, task-specific rewards for report generation and visual grounding, and run matched RL experiments on both domain-specific and general-domain Qwen3-VL variants, with and without thinking. Across these settings, we find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results. Under a unified evaluation pipeline, the RL-optimized RadVLM models outperform their baseline counterparts and reach state-of-the-art performance on both report generation and grounding, highlighting clinically aligned RL as a powerful complement to SFT for medical VLMs.
๐
2025-12-11
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Lingfeng Liu, Yixin Song, Dazhong Shen, Bing Yin, Hao Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Popularity bias fundamentally undermines the personalization capabilities of collaborative filtering (CF) models, causing them to disproportionately recommend popular items while neglecting users' genuine preferences for niche content. While existing approaches treat this as an external confounding factor, we reveal that popularity bias is an intrinsic geometric artifact of Bayesian Pairwise Ranking (BPR) optimization in CF models. Through rigorous mathematical analysis, we prove that BPR systematically organizes item embeddings along a dominant "popularity direction" where embedding magnitudes directly correlate with interaction frequency. This geometric distortion forces user embeddings to simultaneously handle two conflicting tasks-expressing genuine preference and calibrating against global popularity-trapping them in suboptimal configurations that favor popular items regardless of individual tastes. We propose Directional Decomposition and Correction (DDC), a universally applicable framework that surgically corrects this embedding geometry through asymmetric directional updates. DDC guides positive interactions along personalized preference directions while steering negative interactions away from the global popularity direction, disentangling preference from popularity at the geometric source. Extensive experiments across multiple BPR-based architectures demonstrate that DDC significantly outperforms state-of-the-art debiasing methods, reducing training loss to less than 5% of heavily-tuned baselines while achieving superior recommendation quality and fairness. Code is available in https://github.com/LingFeng-Liu-AI/DDC.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Manon Kempermann, Sai Suresh Macharla Vasu, Mahalakshmi Raveenthiran, Theo Farrell, Ingmar Weber
Natural Language Processing
ๆฅ็ๆ่ฆ
Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD's AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory study, we evaluated advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability. First, we demonstrate that evaluators must have access to rich user context: identical LLM responses were rated significantly safer by context-blind evaluators than by those aware of user circumstances, with safety scores for high-vulnerability users dropping from safe (5/7) to somewhat unsafe (3/7). One might assume this gap could be addressed by creating realistic user prompts containing key contextual information. However, our second study challenges this: we rerun the evaluation on prompts containing context users report they would disclose, finding no significant improvement. Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone proves insufficient, particularly for vulnerable populations. By demonstrating a methodology for context-aware evaluation, this study provides both a starting point for such assessments and foundational evidence that evaluating individual welfare demands approaches distinct from existing universal-risk frameworks. We publish our code and dataset to aid future developments.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao et al.
Robotics
ๆฅ็ๆ่ฆ
Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Oscar Robben, Saeed Khalilian, Nirvana Meratnia
Machine Learning
ๆฅ็ๆ่ฆ
Early-exit networks are effective solutions for reducing the overall energy consumption and latency of deep learning models by adjusting computation based on the complexity of input data. By incorporating intermediate exit branches into the architecture, they provide less computation for simpler samples, which is particularly beneficial for resource-constrained devices where energy consumption is crucial. However, designing early-exit networks is a challenging and time-consuming process due to the need to balance efficiency and performance. Recent works have utilized Neural Architecture Search (NAS) to design more efficient early-exit networks, aiming to reduce average latency while improving model accuracy by determining the best positions and number of exit branches in the architecture. Another important factor affecting the efficiency and accuracy of early-exit networks is the depth and types of layers in the exit branches. In this paper, we use hardware-aware NAS to strengthen exit branches, considering both accuracy and efficiency during optimization. Our performance evaluation on the CIFAR-10, CIFAR-100, and SVHN datasets demonstrates that our proposed framework, which considers varying depths and layers for exit branches along with adaptive threshold tuning, designs early-exit networks that achieve higher accuracy with the same or lower average number of MACs compared to the state-of-the-art approaches.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Muhua Huang, Qinlin Zhao, Xiaoyuan Yi, Xing Xie
Natural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLM) based multi-agent systems become increasingly prevalent, the collective behaviors, e.g., collective intelligence, of such artificial communities have drawn growing attention. This work aims to answer a fundamental question: How does diversity of values shape the collective behavior of AI communities? Using naturalistic value elicitation grounded in the prevalent Schwartz's Theory of Basic Human Values, we constructed multi-agent simulations where communities with varying numbers of agents engaged in open-ended interactions and constitution formation. The results show that value diversity enhances value stability, fosters emergent behaviors, and brings more creative principles developed by the agents themselves without external guidance. However, these effects also show diminishing returns: extreme heterogeneity induces instability. This work positions value diversity as a new axis of future AI capability, bridging AI ability and sociological studies of institutional emergence.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tong Zhang, Carlos Hinojosa, Bernard Ghanem
ๆฅ็ๆ่ฆ
Diffusion models can unintentionally reproduce training examples, raising privacy and copyright concerns as these systems are increasingly deployed at scale. Existing inference-time mitigation methods typically manipulate classifier-free guidance (CFG) or perturb prompt embeddings; however, they often struggle to reduce memorization without compromising alignment with the conditioning prompt. We introduce CAPTAIN, a training-free framework that mitigates memorization by directly modifying latent features during denoising. CAPTAIN first applies frequency-based noise initialization to reduce the tendency to replicate memorized patterns early in the denoising process. It then identifies the optimal denoising timesteps for feature injection and localizes memorized regions. Finally, CAPTAIN injects semantically aligned features from non-memorized reference images into localized latent regions, suppressing memorization while preserving prompt fidelity and visual quality. Our experiments show that CAPTAIN achieves substantial reductions in memorization compared to CFG-based baselines while maintaining strong alignment with the intended prompt.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Liang Peng, Haopeng Liu, Yixuan Ye, Cheng Liu, Wenjun Shen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Unsupervised cell type identification is crucial for uncovering and characterizing heterogeneous populations in single cell omics studies. Although a range of clustering methods have been developed, most focus exclusively on intrinsic cellular structure and ignore the pivotal role of cell-gene associations, which limits their ability to distinguish closely related cell types. To this end, we propose a Refinement Contrastive Learning framework (scRCL) that explicitly incorporates cell-gene interactions to derive more informative representations. Specifically, we introduce two contrastive distribution alignment components that reveal reliable intrinsic cellular structures by effectively exploiting cell-cell structural relationships. Additionally, we develop a refinement module that integrates gene-correlation structure learning to enhance cell embeddings by capturing underlying cell-gene associations. This module strengthens connections between cells and their associated genes, refining the representation learning to exploiting biologically meaningful relationships. Extensive experiments on several single-cell RNA-seq and spatial transcriptomics benchmark datasets demonstrate that our method consistently outperforms state-of-the-art baselines in cell-type identification accuracy. Moreover, downstream biological analyses confirm that the recovered cell populations exhibit coherent gene-expression signatures, further validating the biological relevance of our approach. The code is available at https://github.com/THPengL/scRCL.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ishaan Kaushal, Amaresh Chakrabarti
ๆฅ็ๆ่ฆ
This paper presents the Unified Smart Factory Model (USFM), a comprehensive framework designed to translate high-level sustainability goals into measurable factory-level indicators with a systematic information map of manufacturing activities. The manufacturing activities were modelled as set of manufacturing, assembly and auxiliary processes using Object Process Methodology, a Model Based Systems Engineering (MBSE) language. USFM integrates Manufacturing Process and System, Data Process, and Key Performance Indicator (KPI) Selection and Assessment in a single framework. Through a detailed case study of Printed Circuit Board (PCB) assembly factory, the paper demonstrates how environmental sustainability KPIs can be selected, modelled, and mapped to the necessary data, highlighting energy consumption and environmental impact metrics. The model's systematic approach can reduce redundancy, minimize the risk of missing critical information, and enhance data collection. The paper concluded that the USFM bridges the gap between sustainability goals and practical implementation, providing significant benefits for industries specifically SMEs aiming to achieve sustainability targets.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Minghao LI, Ruihang Wang, Rui Tan, Yonggang Wen
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Data center (DC) infrastructure serves as the backbone to support the escalating demand for computing capacity. Traditional design methodologies that blend human expertise with specialized simulation tools scale poorly with the increasing system complexity. Recent studies adopt generative artificial intelligence to design plausible human-centric indoor layouts. However, they do not consider the underlying physics, making them unsuitable for the DC design that sets quantifiable operational objectives and strict physical constraints. To bridge the gap, we propose Phythesis, a novel framework that synergizes large language models (LLMs) and physics-guided evolutionary optimization to automate simulation-ready (SimReady) scene synthesis for energy-efficient DC design. Phythesis employs an iterative bi-level optimization architecture, where (i) the LLM-driven optimization level generates physically plausible three-dimensional layouts and self-criticizes them to refine the scene topology, and (ii) the physics-informed optimization level identifies the optimal asset parameters and selects the best asset combination. Experiments on three generation scales show that Phythesis achieves 57.3% generation success rate increase and 11.5% power usage effectiveness (PUE) improvement, compared with the vanilla LLM-based solution.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ J. Xiao, Y. Guo, X. Zi, K. Thiyagarajan, C. Moreira et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xin Guan
Natural Language Processing
ๆฅ็ๆ่ฆ
Multistep workflows that chain large language model (LLM) calls suffer from context pollution: as information accumulates across steps, models hallucinate, confuse intermediate outputs, and lose track of task constraints. We present NormCode, a semiformal language for constructing plans of inferences, structured decompositions where each step operates in data isolation and receives only explicitly passed inputs, which eliminates crossstep contamination by design. NormCode enforces a strict separation between semantic operations (LLMdriven reasoning, nondeterministic) and syntactic operations (deterministic data restructuring), enabling precise cost and reliability tracing. The language exists in three isomorphic formats: .ncds for human authoring, .ncd for machine execution, and .ncn for human verification, supporting progressive formalization from sketch to production. We validate NormCode through two demonstrations: (1) a base X addition algorithm achieving 100 percent accuracy on arbitrary length inputs, and (2) self hosted execution of NormCode's own five phase compiler pipeline. The working orchestrator provides dependency driven scheduling, SQLite backed checkpointing, and loop management, making AI workflows auditable by design and addressing a critical need for transparency in high stakes domains such as legal reasoning, medical decision making, and financial analysis.
๐
2025-12-11
๐ ArXiv Preprint (cs.GT)
๐ฅ Chujie Zhao, Qun Hu, Shiping Song, Dagui Chen, Han Zhu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of large language models (LLMs) necessitates novel monetization strategies, among which LLM-native advertising has emerged as a promising paradigm by naturally integrating advertisement within LLM-generated responses. However, this paradigm fundamentally shifts the auction object from discrete ad slots to the distribution over LLM outputs, posing new challenges for designing auction mechanisms. Existing mechanisms for LLM-native advertising adopt frameworks that decouple auction and generation, which either ignore externalities or require multiple LLM inferences for ad allocation, rendering them impractical for industrial scenarios. To address these challenges, we propose LLM-Auction, which to the best of our knowledge is the first learning-based generative auction mechanism that integrates auction and LLM generation for LLM-native advertising. By formulating the allocation optimization as a preference alignment problem between LLM outputs and the mechanism's objective which reflects both advertisers' expected value and user experience, we introduce Iterative Reward-Preference Optimization (IRPO) algorithm that alternately optimizes the reward model and the LLM. This approach enables the LLM to inherently model allocation externalities without any extra inference cost. We further identify the allocation monotonicity and continuity of LLM-Auction, which allows us to prove that a simple first-price payment rule exhibits favorable incentive properties. Additionally, we design an LLM-as-a-judge simulation environment to facilitate large-scale data construction and enable comprehensive quantitative evaluation of the mechanism's performance. Extensive quantitative and qualitative experiments demonstrate that LLM-Auction significantly outperforms existing baselines in allocation efficiency, while achieving the desired mechanism properties.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine's feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity-Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages. Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000-2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert-level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions. We will release the model, data, and symbolic engine to support future research.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chihyeon Song, Jaewoo Lee, Jinkyoo Park
Machine Learning
ๆฅ็ๆ่ฆ
Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call 'on-policyness'. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy's behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively leverages offline data for initial stability while progressively focusing learning on the most relevant, high-rewarding online experiences. Our extensive experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms, highlighting the importance of an adaptive, behavior-aware replay buffer design.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Lim Chien Her, Ming Yan, Yunshu Bai, Ruihao Li, Hao Zhang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Procedural Content Generation (PCG) offers scalable methods for algorithmically creating complex, customizable worlds. However, controlling these pipelines requires the precise configuration of opaque technical parameters. We propose a training-free architecture that utilizes LLM agents for zero-shot PCG parameter configuration. While Large Language Models (LLMs) promise a natural language interface for PCG tools, off-the-shelf models often fail to bridge the semantic gap between abstract user instructions and strict parameter specifications. Our system pairs an Actor agent with a Critic agent, enabling an iterative workflow where the system autonomously reasons over tool parameters and refines configurations to progressively align with human design preferences. We validate this approach on the generation of various 3D maps, establishing a new benchmark for instruction-following in PCG. Experiments demonstrate that our approach outperforms single-agent baselines, producing diverse and structurally valid environments from natural language descriptions. These results demonstrate that off-the-shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools. By shifting the burden from model training to architectural reasoning, our method offers a scalable framework for mastering complex software without task-specific fine-tuning.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiaxi Wu, Tiantian Zhang, Yuxing Wang, Yongzhe Chang, Xueqian Wang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Robust adversarial reinforcement learning has emerged as an effective paradigm for training agents to handle uncertain disturbance in real environments, with critical applications in sequential decision-making domains such as autonomous driving and robotic control. Within this paradigm, agent training is typically formulated as a zero-sum Markov game between a protagonist and an adversary to enhance policy robustness. However, the trainable nature of the adversary inevitably induces non-stationarity in the learning dynamics, leading to exacerbated training instability and convergence difficulties, particularly in high-dimensional complex environments. In this paper, we propose a novel approach, Uncertainty-Aware Critic Ensemble for robust adversarial Reinforcement learning (UACER), which consists of two strategies: 1) Diversified critic ensemble: a diverse set of K critic networks is exploited in parallel to stabilize Q-value estimation rather than conventional single-critic architectures for both variance reduction and robustness enhancement. 2) Time-varying Decay Uncertainty (TDU) mechanism: advancing beyond simple linear combinations, we develop a variance-derived Q-value aggregation strategy that explicitly incorporates epistemic uncertainty to dynamically regulate the exploration-exploitation trade-off while simultaneously stabilizing the training process. Comprehensive experiments across several MuJoCo control problems validate the superior effectiveness of UACER, outperforming state-of-the-art methods in terms of overall performance, stability, and efficiency.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoyu Zhu, Yao Zhang, Jiashen Ren, Qingchun Hou
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Neural network constraint satisfaction is crucial for safety-critical applications such as power system optimization, robotic path planning, and autonomous driving. However, existing constraint satisfaction methods face efficiency-applicability trade-offs, with hard constraint methods suffering from either high computational complexity or restrictive assumptions on constraint structures. The Sampling Kaczmarz-Motzkin (SKM) method is a randomized iterative algorithm for solving large-scale linear inequality systems with favorable convergence properties, but its argmax operations introduce non-differentiability, posing challenges for neural network applications. This work proposes the Trainable Sampling Kaczmarz-Motzkin Network (T-SKM-Net) framework and, for the first time, systematically integrates SKM-type methods into neural network constraint satisfaction. The framework transforms mixed constraint problems into pure inequality problems through null space transformation, employs SKM for iterative solving, and maps solutions back to the original constraint space, efficiently handling both equality and inequality constraints. We provide theoretical proof of post-processing effectiveness in expectation and end-to-end trainability guarantees based on unbiased gradient estimators, demonstrating that despite non-differentiable operations, the framework supports standard backpropagation. On the DCOPF case118 benchmark, our method achieves 4.27ms/item GPU serial forward inference with 0.0025% max optimality gap with post-processing mode and 5.25ms/item with 0.0008% max optimality gap with joint training mode, delivering over 25$\times$ speedup compared to the pandapower solver while maintaining zero constraint violations under given tolerance.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Devanshu Sahoo, Manish Prasad, Vasudev Majhi, Jahnvi Singh, Vinay Chamola et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the "Lazy Reviewer" hypothesis) and the formal institutional deployment of AI-powered assessment systems by conferences like AAAI and Stanford's Agents4Science. This study investigates the robustness of these "LLM-as-a-Judge" systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping "Reject" decisions to "Accept," for which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like "Maximum Mark Magyk" successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Francesco Freni, Anya Fries, Linus Kรผhne, Markus Reichstein, Jonas Peters
ๆฅ็ๆ่ฆ
We consider a regression setting where observations are collected in different environments modeled by different data distributions. The field of out-of-distribution (OOD) generalization aims to design methods that generalize better to test environments whose distributions differ from those observed during training. One line of such works has proposed to minimize the maximum risk across environments, a principle that we refer to as MaxRM (Maximum Risk Minimization). In this work, we introduce variants of random forests based on the principle of MaxRM. We provide computationally efficient algorithms and prove statistical consistency for our primary method. Our proposed method can be used with each of the following three risks: the mean squared error, the negative reward (which relates to the explained variance), and the regret (which quantifies the excess risk relative to the best predictor). For MaxRM with regret as the risk, we prove a novel out-of-sample guarantee over unseen test distributions. Finally, we evaluate the proposed methods on both simulated and real-world data.
๐
2025-12-11
๐ ArXiv Preprint (cs.DC)
๐ฅ Sabtain Ahmad, Meerzhan Kanatbekova, Ivona Brandic, Atakan Aral
ๆฅ็ๆ่ฆ
Clustered Federated Learning (CFL) has emerged as a powerful approach for addressing data heterogeneity and ensuring privacy in large distributed IoT environments. By clustering clients and training cluster-specific models, CFL enables personalized models tailored to groups of heterogeneous clients. However, conventional CFL approaches suffer from fragmented learning for training independent global models for each cluster and fail to take advantage of collective cluster insights. This paper advocates a shift to hierarchical CFL, allowing bi-level aggregation to train cluster-specific models at the edge and a unified global model at the cloud. This shift improves training efficiency yet might introduce communication challenges. To this end, we propose CFLHKD, a novel personalization scheme for integrating hierarchical cluster knowledge into CFL. Built upon multi-teacher knowledge distillation, CFLHKD enables inter-cluster knowledge sharing while preserving cluster-specific personalization. CFLHKD adopts a bi-level aggregation to bridge the gap between local and global learning. Extensive evaluations of standard benchmark datasets demonstrate that CFLHKD outperforms representative baselines in cluster-specific and global model accuracy and achieves a performance improvement of 3.32-7.57\%.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Stylianos Kandylakis, Christos Orfanopoulos, Georgios Siolas, Panayiotis Tsanakas
Machine Learning
ๆฅ็ๆ่ฆ
This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.
๐
2025-12-11
๐ AAAI 2026
๐ฅ Hojun Lee, Mijin Koo, Yeji Song, Nojun Kwak
Robotics
ๆฅ็ๆ่ฆ
Recent advancements in diffusion models have made fine-tuning text-to-image models for personalization increasingly accessible, but have also raised significant concerns regarding unauthorized data usage and privacy infringement. Current protection methods are limited to passively degrading image quality, failing to achieve stable control. While Targeted Data Protection (TDP) offers a promising paradigm for active redirection toward user-specified target concepts, existing TDP attempts suffer from poor controllability due to snapshot-matching approaches that fail to account for complete learning dynamics. We introduce TAFAP (Trajectory Alignment via Fine-tuning with Adversarial Perturbations), the first method to successfully achieve effective TDP by controlling the entire training trajectory. Unlike snapshot-based methods whose protective influence is easily diluted as training progresses, TAFAP employs trajectory-matching inspired by dataset distillation to enforce persistent, verifiable transformations throughout fine-tuning. We validate our method through extensive experiments, demonstrating the first successful targeted transformation in diffusion models with simultaneous control over both identity and visual patterns. TAFAP significantly outperforms existing TDP attempts, achieving robust redirection toward target concepts while maintaining high image quality. This work enables verifiable safeguards and provides a new framework for controlling and tracing alterations in diffusion model outputs.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ezequiel Lopez-Rubio
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The representation of graphs is commonly based on the adjacency matrix concept. This formulation is the foundation of most algebraic and computational approaches to graph processing. The advent of deep learning language models offers a wide range of powerful computational models that are specialized in the processing of text. However, current procedures to represent graphs are not amenable to processing by these models. In this work, a new method to represent graphs is proposed. It represents the adjacency matrix of a graph by a string of simple instructions. The instructions build the adjacency matrix step by step. The transformation is reversible, i.e. given a graph the string can be produced and vice versa. The proposed representation is compact and it maintains the local structural patterns of the graph. Therefore, it is envisaged that it could be useful to boost the processing of graphs by deep learning models. A tentative computational experiment is reported, with favorable results.
๐
2025-12-11
๐ NeurIPS 2025
๐ฅ Youmin Ko, Sungjong Seo, Hyunjoon Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.\footnote{https://github.com/meaningful96/CoopRAG}
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenfei Guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.
๐
2025-12-11
๐ ArXiv Preprint (cs.SE)
๐ฅ Devanshu Sahoo, Vasudev Majhi, Arjun Neekhra, Yash Sinha, Murari Mandal et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The use of Large Language Models (LLMs) as automatic judges for code evaluation is becoming increasingly prevalent in academic environments. But their reliability can be compromised by students who may employ adversarial prompting strategies in order to induce misgrading and secure undeserved academic advantages. In this paper, we present the first large-scale study of jailbreaking LLM-based automated code evaluators in academic context. Our contributions are: (i) We systematically adapt 20+ jailbreaking strategies for jailbreaking AI code evaluators in the academic context, defining a new class of attacks termed academic jailbreaking. (ii) We release a poisoned dataset of 25K adversarial student submissions, specifically designed for the academic code-evaluation setting, sourced from diverse real-world coursework and paired with rubrics and human-graded references, and (iii) In order to capture the multidimensional impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs. We find that these models exhibit significant vulnerability, particularly to persuasive and role-play-based attacks (up to 97% JSR). Our adversarial dataset and benchmark suite lay the groundwork for next-generation robust LLM-based evaluators in academic code assessment.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yang Yu, Zhuangzhuang Chen, Siqi Wang, Lanqing Li, Xiaomeng Li
MultimodalMachine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs). Considering existing RL- based finetuning methods, entropy intervention turns out to be an effective way to benefit exploratory ability, thereby improving policy performance. Notably, most existing stud- ies intervene in entropy by simply controlling the update of specific tokens during policy optimization of RL. They ig- nore the entropy intervention during the RL sampling that can boost the performance of GRPO by improving the di- versity of responses. In this paper, we propose Selective- adversarial Entropy Intervention, namely SaEI, which en- hances policy entropy by distorting the visual input with the token-selective adversarial objective coming from the en- tropy of sampled responses. Specifically, we first propose entropy-guided adversarial sampling (EgAS) that formu- lates the entropy of sampled responses as an adversarial ob- jective. Then, the corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger an- swer space during RL sampling. Then, we propose token- selective entropy computation (TsEC) to maximize the ef- fectiveness of adversarial attack in EgAS without distorting factual knowledge within VLMs. Extensive experiments on both in-domain and out-of-domain datasets show that our proposed method can greatly improve policy exploration via entropy intervention, to boost reasoning capabilities. Code will be released once the paper is accepted.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, Ji Pei
Natural Language Processing
ๆฅ็ๆ่ฆ
The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation
๐
2025-12-11
๐ KDD 2026
๐ฅ Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Tianyu Du et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Deep neural networks (DNNs) underpin critical applications yet remain vulnerable to backdoor attacks, typically reliant on heuristic brute-force methods. Despite significant empirical advancements in backdoor research, the lack of rigorous theoretical analysis limits understanding of underlying mechanisms, constraining attack predictability and adaptability. Therefore, we provide a theoretical analysis targeting backdoor attacks, focusing on how sparse decision boundaries enable disproportionate model manipulation. Based on this finding, we derive a closed-form, ambiguous boundary region, wherein negligible relabeled samples induce substantial misclassification. Influence function analysis further quantifies significant parameter shifts caused by these margin samples, with minimal impact on clean accuracy, formally grounding why such low poison rates suffice for efficacious attacks. Leveraging these insights, we propose Eminence, an explainable and robust black-box backdoor framework with provable theoretical guarantees and inherent stealth properties. Eminence optimizes a universal, visually subtle trigger that strategically exploits vulnerable decision boundaries and effectively achieves robust misclassification with exceptionally low poison rates (< 0.1%, compared to SOTA methods typically requiring > 1%). Comprehensive experiments validate our theoretical discussions and demonstrate the effectiveness of Eminence, confirming an exponential relationship between margin poisoning and adversarial boundary manipulation. Eminence maintains > 90% attack success rate, exhibits negligible clean-accuracy loss, and demonstrates high transferability across diverse models, datasets and scenarios.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin et al.
Robotics
ๆฅ็ๆ่ฆ
Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.
๐
2025-12-11
๐ ArXiv Preprint (cs.SE)
๐ฅ Guoqiang Chen, Lingyun Ying, Ziyang Song, Daguang Liu, Qiang Wang et al.
ๆฅ็ๆ่ฆ
LLM-agent based binary code analysis has demonstrated significant potential across a wide range of software security scenarios, including vulnerability detection, malware analysis, etc. In agent workflow, however, retrieving the positive from thousands of stripped binary functions based on user query remains under-studied and challenging, as the absence of symbolic information distinguishes it from source code retrieval. In this paper, we introduce, BinSeek, the first two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeekEmbedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.
๐
2025-12-11
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Ziwei Liu, Yejing Wang, Qidong Liu, Zijian Zhang, Chong Chen et al.
ๆฅ็ๆ่ฆ
Conventional Sequential Recommender Systems (SRS) typically assign unique Hash IDs (HID) to construct item embeddings. These HID embeddings effectively learn collaborative information from historical user-item interactions, making them vulnerable to situations where most items are rarely consumed (the long-tail problem). Recent methods that incorporate auxiliary information often suffer from noisy collaborative sharing caused by co-occurrence signals or semantic homogeneity caused by flat dense embeddings. Semantic IDs (SIDs), with their capability of code sharing and multi-granular semantic modeling, provide a promising alternative. However, the collaborative overwhelming phenomenon hinders the further development of SID-based methods. The quantization mechanisms commonly compromise the uniqueness of identifiers required for modeling head items, creating a performance seesaw between head and tail items. To address this dilemma, we propose \textbf{\name}, a novel framework that harmonizes the SID and HID. Specifically, we devise a dual-branch modeling architecture that enables the model to capture both the multi-granular semantics within SID while preserving the unique collaborative identity of HID. Furthermore, we introduce a dual-level alignment strategy that bridges the two representations, facilitating knowledge transfer and supporting robust preference modeling. Extensive experiments on three real-world datasets show that \name~ effectively balances recommendation quality for both head and tail items while surpassing the existing baselines. The implementation code can be found online\footnote{https://github.com/ziwliu8/H2Rec}.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, Xin Lou
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\% and open-world data boosts FROW benchmark accuracy by 10\%-20\% and content accuracy by 6\%-12\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\%. The benchmark will be available at https://github.com/pc-inno/FROW.
๐
2025-12-11
๐ ArXiv Preprint (cs.SD)
๐ฅ Wenye Zhu, Jun Tang, Xiaofei Li
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Personal sound zone (PSZ) reproduction system, which attempts to create distinct virtual acoustic scenes for different listeners at their respective positions within the same spatial area using one loudspeaker array, is a fundamental technology in the application of virtual reality. For practical applications, the reconstruction targets must be measured on the same fixed receiver array used to record the local room impulse responses (RIRs) from the loudspeaker array to the control points in each PSZ, which makes the system inconvenient and costly for real-world use. In this paper, a 3D convolutional neural network (CNN) designed for PSZ reproduction with flexible control microphone grid and alternative reproduction target is presented, utilizing the virtual target scene as inputs and the PSZ pre-filters as output. Experimental results of the proposed method are compared with the traditional method, demonstrating that the proposed method is able to handle varied reproduction targets on flexible control point grid using only one training session. Furthermore, the proposed method also demonstrates the capability to learn global spatial information from sparse sampling points distributed in PSZs.
๐
2025-12-11
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yash Srivastava, Shalin Jain, Sneha Awathare, Nitin Awathare
Machine Learning
ๆฅ็ๆ่ฆ
The rising demand for collaborative machine learning and data analytics calls for secure and decentralized data sharing frameworks that balance privacy, trust, and incentives. Existing approaches, including federated learning (FL) and blockchain-based data markets, fall short: FL often depends on trusted aggregators and lacks Byzantine robustness, while blockchain frameworks struggle with computation-intensive training and incentive integration.
We present \prot, a decentralized data marketplace that unifies federated learning, blockchain arbitration, and economic incentives into a single framework for privacy-preserving data sharing. \prot\ enables data buyers to submit bid-based requests via blockchain smart contracts, which manage auctions, escrow, and dispute resolution. Computationally intensive training is delegated to \cone\ (\uline{Co}mpute \uline{N}etwork for \uline{E}xecution), an off-chain distributed execution layer. To safeguard against adversarial behavior, \prot\ integrates a modified YODA protocol with exponentially growing execution sets for resilient consensus, and introduces Corrected OSMD to mitigate malicious or low-quality contributions from sellers. All protocols are incentive-compatible, and our game-theoretic analysis establishes honesty as the dominant strategy.
We implement \prot\ on Ethereum and evaluate it over benchmark datasets -- MNIST, Fashion-MNIST, and CIFAR-10 -- under varying adversarial settings. \prot\ achieves up to 99\% accuracy on MNIST and 90\% on Fashion-MNIST, with less than 3\% degradation up to 30\% Byzantine nodes, and 56\% accuracy on CIFAR-10 despite its complexity. Our results show that \prot\ ensures privacy, maintains robustness under adversarial conditions, and scales efficiently with the number of participants, making it a practical foundation for real-world decentralized data sharing.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao et al.
Robotics
ๆฅ็ๆ่ฆ
The rapid development of mobile GUI agents has stimulated growing research interest in long-horizon task automation. However, building agents for these tasks faces a critical bottleneck: the reliance on ever-expanding interaction history incurs substantial context overhead. Existing context management and compression techniques often fail to preserve vital semantic information, leading to degraded task performance. We propose AgentProg, a program-guided approach for agent context management that reframes the interaction history as a program with variables and control flow. By organizing information according to the structure of program, this structure provides a principled mechanism to determine which information should be retained and which can be discarded. We further integrate a global belief state mechanism inspired by Belief MDP framework to handle partial observability and adapt to unexpected environmental changes. Experiments on AndroidWorld and our extended long-horizon task suite demonstrate that AgentProg has achieved the state-of-the-art success rates on these benchmarks. More importantly, it maintains robust performance on long-horizon tasks while baseline methods experience catastrophic degradation. Our system is open-sourced at https://github.com/MobileLLM/AgentProg.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ziying Zhang, Quanming Yao, Yaqing Wang
Machine Learning
ๆฅ็ๆ่ฆ
In this work, we tackle the challenge of recommending emerging items, whose interactions gradually accumulate over time. Existing methods often overlook this dynamic process, typically assuming that emerging items have few or even no historical interactions. Such an assumption oversimplifies the problem, as a good model must preserve the uniqueness of emerging items while leveraging their shared patterns with established ones. To address this challenge, we propose EmerFlow, a novel LLM-empowered representation learning framework that generates distinctive embeddings for emerging items. It first enriches the raw features of emerging items through LLM reasoning, then aligns these representations with the embedding space of the existing recommendation model. Finally, new interactions are incorporated through meta-learning to refine the embeddings. This enables EmerFlow to learn expressive embeddings for emerging items from only limited interactions. Extensive experiments across diverse domains, including movies and pharmaceuticals, show that EmerFlow consistently outperforms existing methods.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hangyu Mao, Guangting Dong, Zhicheng Dou
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nicolas Tacheny
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Agentic systems built on large language models operate through recursive feedback loops, where each output becomes the next input. Yet the geometric behavior of these agentic loops (whether they converge, diverge, or exhibit more complex dynamics) remains poorly understood. This paper introduces a geometric framework for analyzing agentic trajectories in semantic embedding space, treating iterative transformations as discrete dynamical systems. We distinguish the artifact space, where linguistic transformations occur, from the embedding space, where geometric measurements are performed. Because cosine similarity is biased by embedding anisotropy, we introduce an isotonic calibration that eliminates systematic bias and aligns similarities with human semantic judgments while preserving high local stability. This enables rigorous measurement of trajectories, clusters and attractors. Through controlled experiments on singular agentic loops, we identify two fundamental regimes. A contractive rewriting loop converges toward a stable attractor with decreasing dispersion, while an exploratory summarize and negate loop produces unbounded divergence with no cluster formation. These regimes display qualitatively distinct geometric signatures of contraction and expansion. Our results show that prompt design directly governs the dynamical regime of an agentic loop, enabling systematic control of convergence, divergence and trajectory structure in iterative LLM transformations.
๐
2025-12-11
๐ AAAI
๐ฅ Wenhan Wu, Zhili He, Huanghuang Liang, Yili Gong, Jiawei Jiang et al.
ๆฅ็ๆ่ฆ
Data-protection regulations such as the GDPR grant every participant in a federated system a right to be forgotten. Federated unlearning has therefore emerged as a research frontier, aiming to remove a specific party's contribution from the learned model while preserving the utility of the remaining parties. However, most unlearning techniques focus on Horizontal Federated Learning (HFL), where data are partitioned by samples. In contrast, Vertical Federated Learning (VFL) allows organizations that possess complementary feature spaces to train a joint model without sharing raw data. The resulting feature-partitioned architecture renders HFL-oriented unlearning methods ineffective. In this paper, we propose REMISVFU, a plug-and-play representation misdirection framework that enables fast, client-level unlearning in splitVFL systems. When a deletion request arrives, the forgetting party collapses its encoder output to a randomly sampled anchor on the unit sphere, severing the statistical link between its features and the global model. To maintain utility for the remaining parties, the server jointly optimizes a retention loss and a forgetting loss, aligning their gradients via orthogonal projection to eliminate destructive interference. Evaluations on public benchmarks show that REMISVFU suppresses back-door attack success to the natural class-prior level and sacrifices only about 2.5% points of clean accuracy, outperforming state-of-the-art baselines.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vinoth Punniyamoorthy, Ashok Gadi Parthi, Mayilsamy Palanigounder, Ravi Kiran Kodali, Bikesh Kumar et al.
Machine Learning
ๆฅ็ๆ่ฆ
Distributed machine learning systems require strong privacy guarantees, verifiable compliance, and scalable deploy- ment across heterogeneous and multi-cloud environments. This work introduces a cloud-native privacy-preserving architecture that integrates federated learning, differential privacy, zero- knowledge compliance proofs, and adaptive governance powered by reinforcement learning. The framework supports secure model training and inference without centralizing sensitive data, while enabling cryptographically verifiable policy enforcement across institutions and cloud platforms. A full prototype deployed across hybrid Kubernetes clusters demonstrates reduced membership- inference risk, consistent enforcement of formal privacy budgets, and stable model performance under differential privacy. Ex- perimental evaluation across multi-institution workloads shows that the architecture maintains utility with minimal overhead while providing continuous, risk-aware governance. The pro- posed framework establishes a practical foundation for deploying trustworthy and compliant distributed machine learning systems at scale.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ziseok Lee, Minyeong Hwang, Sanghyun Jo, Wooyeol Lee, Jihyung Ko et al.
Robotics
ๆฅ็ๆ่ฆ
Inference-time steering enables pretrained diffusion/flow models to be adapted to new tasks without retraining. A widely used approach is the ratio-of-densities method, which defines a time-indexed target path by reweighting probability-density trajectories from multiple models with positive, or in some cases, negative exponents. This construction, however, harbors a critical and previously unformalized failure mode: Marginal Path Collapse, where intermediate densities become non-normalizable even though endpoints remain valid. Collapse arises systematically when composing heterogeneous models trained on different noise schedules or datasets, including a common setting in molecular design where de-novo, conformer, and pocket-conditioned models must be combined for tasks such as flexible-pose scaffold decoration. We provide a novel and complete solution for the problem. First, we derive a simple path existence criterion that predicts exactly when collapse occurs from noise schedules and exponents alone. Second, we introduce Adaptive path Correction with Exponents (ACE), which extends Feynman-Kac steering to time-varying exponents and guarantees a valid probability path. On a synthetic 2D benchmark and on flexible-pose scaffold decoration, ACE eliminates collapse and enables high-guidance compositional generation, improving distributional and docking metrics over constant-exponent baselines and even specialized task-specific scaffold decoration models. Our work turns ratio-of-densities steering with heterogeneous experts from an unstable heuristic into a reliable tool for controllable generation.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jules Lahmi, Alexis Roger
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Artificial intelligence has made great progress in recent years, particularly in the development of Vision--Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non--English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yongqiang Yu, Xuhui Li, Hazza Mahmood, Jinxing Zhou, Haodong Hong et al.
Robotics
ๆฅ็ๆ่ฆ
Vision-and-Language Navigation (VLN) requires agents to navigate complex environments by following natural-language instructions. General Scene Adaptation for VLN (GSA-VLN) shifts the focus from zero-shot generalization to continual, environment-specific adaptation, narrowing the gap between static benchmarks and real-world deployment. However, current GSA-VLN frameworks exclude user feedback, relying solely on unsupervised adaptation from repeated environmental exposure. In practice, user feedback offers natural and valuable supervision that can significantly enhance adaptation quality. We introduce a user-feedback-driven adaptation framework that extends GSA-VLN by systematically integrating human interactions into continual learning. Our approach converts user feedback-navigation instructions and corrective signals-into high-quality, environment-aligned training data, enabling efficient and realistic adaptation. A memory-bank warm-start mechanism further reuses previously acquired environmental knowledge, mitigating cold-start degradation and ensuring stable redeployment. Experiments on the GSA-R2R benchmark show that our method consistently surpasses strong baselines such as GR-DUET, improving navigation success and path efficiency. The memory-bank warm start stabilizes early navigation and reduces performance drops after updates. Results under both continual and hybrid adaptation settings confirm the robustness and generality of our framework, demonstrating sustained improvement across diverse deployment conditions.
๐
2025-12-11
๐ ArXiv Preprint (cs.LO)
๐ฅ Ziyu Wang, Bowen Yang, Shihao Zhou, Chenyi Li, Yuan Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
We address the problem of translating informal mathematical proofs expressed in natural language into formal proofs in Lean4 under a constrained computational budget. Our approach is grounded in two key insights. First, informal proofs tend to proceed via a sequence of logical transitions - often implications or equivalences - without explicitly specifying intermediate results or auxiliary lemmas. In contrast, formal systems like Lean require an explicit representation of each proof state and the tactics that connect them. Second, each informal reasoning step can be viewed as an abstract transformation between proof states, but identifying the corresponding formal tactics often requires nontrivial domain knowledge and precise control over proof context. To bridge this gap, we propose a two stage framework. Rather than generating formal tactics directly, we first extract a Chain of States (CoS), a sequence of intermediate formal proof states aligned with the logical structure of the informal argument. We then generate tactics to transition between adjacent states in the CoS, thereby constructing the full formal proof. This intermediate representation significantly reduces the complexity of tactic generation and improves alignment with informal reasoning patterns. We build dedicated datasets and benchmarks for training and evaluation, and introduce an interactive framework to support tactic generation from formal states. Empirical results show that our method substantially outperforms existing baselines, achieving higher proof success rates.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kangkun Mao, Fang Xu, Jinru Ding, Yidong Jiang, Yujun Yao et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Epidemic response planning is essential yet traditionally reliant on labor-intensive manual methods. This study aimed to design and evaluate EpiPlanAgent, an agent-based system using large language models (LLMs) to automate the generation and validation of digital emergency response plans. The multi-agent framework integrated task decomposition, knowledge grounding, and simulation modules. Public health professionals tested the system using real-world outbreak scenarios in a controlled evaluation. Results demonstrated that EpiPlanAgent significantly improved the completeness and guideline alignment of plans while drastically reducing development time compared to manual workflows. Expert evaluation confirmed high consistency between AI-generated and human-authored content. User feedback indicated strong perceived utility. In conclusion, EpiPlanAgent provides an effective, scalable solution for intelligent epidemic response planning, demonstrating the potential of agentic AI to transform public health preparedness.
๐
2025-12-11
๐ ArXiv Preprint (cs.DC)
๐ฅ Julian Rodriguez, Piotr Lopez, Emiliano Lerma, Rafael Medrano, Jacobo Hernandez
Machine Learning
ๆฅ็ๆ่ฆ
This document reports the sequence of practices and methodologies implemented during the Big Data course. It details the workflow beginning with the processing of the Epsilon dataset through group and individual strategies, followed by text analysis and classification with RestMex and movie feature analysis with IMDb. Finally, it describes the technical implementation of a distributed computing cluster with Apache Spark on Linux using Scala.
๐
2025-12-11
๐ AAAI
๐ฅ Quanmin Wei, Penglin Dai, Wei Li, Bingyi Liu, Xiao Wu
Robotics
ๆฅ็ๆ่ฆ
Precise environmental perception is critical for the reliability of autonomous driving systems. While collaborative perception mitigates the limitations of single-agent perception through information sharing, it encounters a fundamental communication-performance trade-off. Existing communication-efficient approaches typically assume MB-level data transmission per collaboration, which may fail due to practical network constraints. To address these issues, we propose InfoCom, an information-aware framework establishing the pioneering theoretical foundation for communication-efficient collaborative perception via extended Information Bottleneck principles. Departing from mainstream feature manipulation, InfoCom introduces a novel information purification paradigm that theoretically optimizes the extraction of minimal sufficient task-critical information under Information Bottleneck constraints. Its core innovations include: i) An Information-Aware Encoding condensing features into minimal messages while preserving perception-relevant information; ii) A Sparse Mask Generation identifying spatial cues with negligible communication cost; and iii) A Multi-Scale Decoding that progressively recovers perceptual information through mask-guided mechanisms rather than simple feature reconstruction. Comprehensive experiments across multiple datasets demonstrate that InfoCom achieves near-lossless perception while reducing communication overhead from megabyte to kilobyte-scale, representing 440-fold and 90-fold reductions per agent compared to Where2comm and ERMVP, respectively.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Byeong Ho Kang, Wenli Yang, Muhammad Bilal Amin
Robotics
ๆฅ็ๆ่ฆ
As Artificial Intelligence (AI) systems increasingly assume consequential decision-making roles, a widening gap has emerged between technical capabilities and institutional accountability. Ethical guidance alone is insufficient to counter this challenge; it demands architectures that embed governance into the execution fabric of the ecosystem. This paper presents the Ten Criteria for Trustworthy Orchestration AI, a comprehensive assurance framework that integrates human input, semantic coherence, audit and provenance integrity into a unified Control-Panel architecture. Unlike conventional agentic AI initiatives that primarily focus on AI-to-AI coordination, the proposed framework provides an umbrella of governance to the entire AI components, their consumers and human participants. By taking aspiration from international standards and Australia's National Framework for AI Assurance initiative, this work demonstrates that trustworthiness can be systematically incorporated (by engineering) into AI systems, ensuring the execution fabric remains verifiable, transparent, reproducible and under meaningful human control.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yanbei Jiang, Xueqi Ma, Shu Liu, Sarah Monazam Erfani, Tongliang Liu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Despite excelling on multimodal benchmarks, vision-language models (VLMs) largely remain a black box. In this paper, we propose a novel interpretability framework to systematically analyze the internal mechanisms of VLMs, focusing on the functional roles of attention heads in multimodal reasoning. To this end, we introduce CogVision, a dataset that decomposes complex multimodal questions into step-by-step subquestions designed to simulate human reasoning through a chain-of-thought paradigm, with each subquestion associated with specific receptive or cognitive functions such as high-level visual reception and inference. Using a probing-based methodology, we identify attention heads that specialize in these functions and characterize them as functional heads. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions, and mediate interactions and hierarchical organization. Furthermore, intervention experiments demonstrate their critical role in multimodal reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. These findings provide new insights into the cognitive organization of VLMs and suggest promising directions for designing models with more human-aligned perceptual and reasoning abilities.
๐
2025-12-11
๐ This paper has been accepted for publication in IE
๐ฅ Md Nahid Hasan Shuvo, Moinul Hossain, Anik Mallik, Jeffrey Twigg, Fikadu Dagefu
Machine Learning
ๆฅ็ๆ่ฆ
Federated Learning (FL) enables collaborative model training across distributed devices while safeguarding data and user privacy. However, FL remains susceptible to privacy threats that can compromise data via direct means. That said, indirectly compromising the confidentiality of the FL model architecture (e.g., a convolutional neural network (CNN) or a recurrent neural network (RNN)) on a client device by an outsider remains unexplored. If leaked, this information can enable next-level attacks tailored to the architecture. This paper proposes a novel side-channel fingerprinting attack, leveraging flow-level and packet-level statistics of encrypted wireless traffic from an FL client to infer its deep learning model architecture. We name it FLARE, a fingerprinting framework based on FL Architecture REconnaissance. Evaluation across various CNN and RNN variants-including pre-trained and custom models trained over IEEE 802.11 Wi-Fi-shows that FLARE achieves over 98% F1-score in closed-world and up to 91% in open-world scenarios. These results reveal that CNN and RNN models leak distinguishable traffic patterns, enabling architecture fingerprinting even under realistic FL settings with hardware, software, and data heterogeneity. To our knowledge, this is the first work to fingerprint FL model architectures by sniffing encrypted wireless traffic, exposing a critical side-channel vulnerability in current FL systems.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, Dong Yu
Robotics
ๆฅ็ๆ่ฆ
We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation.
To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.
๐
2025-12-11
๐ ICML 2026
๐ฅ Waleed Razzaq, Izis Kankaraway, Yun-Bo Zhao
Robotics
ๆฅ็ๆ่ฆ
Attention improves representation learning over RNNs, but its discrete nature limits continuous-time (CT) modeling. We introduce Neuronal Attention Circuit (NAC), a novel, biologically plausible CT-Attention mechanism that reformulates attention logits computation as the solution to a linear first-order ODE with nonlinear interlinked gates derived from repurposing \textit{C. elegans} Neuronal Circuit Policies (NCPs) wiring mechanism. NAC replaces dense projections with sparse sensory gates for key-query projections and a sparse backbone network with two heads for computing \textit{content-target} and \textit{learnable time-constant} gates, enabling efficient adaptive dynamics. NAC supports three attention logit computation modes: (i) explicit Euler integration, (ii) exact closed-form solution, and (iii) steady-state approximation. To improve memory intensity, we implemented a sparse Top-\emph{K} pairwise concatenation scheme that selectively curates key-query interactions. We provide rigorous theoretical guarantees, including state stability, bounded approximation errors, and universal approximation. Empirically, we implemented NAC in diverse domains, including irregular time-series classification, lane-keeping for autonomous vehicles, and industrial prognostics. We observed that NAC matches or outperforms competing baselines in accuracy and occupies an intermediate position in runtime and memory efficiency compared with several CT baselines.
๐
2025-12-11
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Venkata Tanuja Madireddy
Machine Learning
ๆฅ็ๆ่ฆ
The rapid expansion of cloud infrastructures and distributed identity systems has significantly increased the complexity and attack surface of modern enterprises. Traditional rule based or signature driven detection systems are often inadequate in identifying novel or evolving threats within Identity and Access Management logs, where anomalous behavior may appear statistically benign but contextually malicious. This paper presents a Graph Neural Network Based Adaptive Threat Detection framework designed to learn latent user resource interaction patterns from IAM audit trails in real time. By modeling IAM logs as heterogeneous dynamic graphs, the proposed system captures temporal, relational, and contextual dependencies across entities such as users, roles, sessions, and access actions. The model incorporates attention based aggregation and graph embedding updates to enable continual adaptation to changing cloud environments. Experimental evaluation on synthesized and real world IAM datasets demonstrates that the proposed method achieves higher detection precision and recall than baseline LSTM and GCN classifiers, while maintaining scalability across multi tenant cloud environments. The frameworks adaptability enables proactive mitigation of insider threats, privilege escalation, and lateral movement attacks, contributing to the foundation of AI driven zero trust access analytics. This work bridges the gap between graph based machine learning and operational cloud security intelligence.
๐
2025-12-11
๐ ArXiv Preprint (cs.GT)
๐ฅ Sam Ganzfried
ๆฅ็ๆ่ฆ
We present an algorithm for computing evolutionarily stable strategies (ESSs) in symmetric perfect-recall extensive-form games of imperfect information. Our main algorithm is for two-player games, and we describe how it can be extended to multiplayer games. The algorithm is sound and computes all ESSs in nondegenerate games and a subset of them in degenerate games which contain an infinite continuum of symmetric Nash equilibria. The algorithm is anytime and can be stopped early to find one or more ESSs. We experiment on an imperfect-information cancer signaling game as well as random games to demonstrate scalability.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuxin Liu, Chaojie Gu, Yihang Zhang, Bin Qian, Shibo He
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning tasks, yet they often struggle with problems involving missing information, exhibiting issues such as incomplete responses, factual errors, and hallucinations. While forward reasoning approaches like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) have shown success in structured problem-solving, they frequently fail to systematically identify and recover omitted information. In this paper, we explore the potential of reverse thinking methodologies to enhance LLMs' performance on missing information detection tasks. Drawing inspiration from recent work on backward reasoning, we propose a novel framework that guides LLMs through reverse thinking to identify necessary conditions and pinpoint missing elements. Our approach transforms the challenging task of missing information identification into a more manageable backward reasoning problem, significantly improving model accuracy. Experimental results demonstrate that our reverse thinking approach achieves substantial performance gains compared to traditional forward reasoning methods, providing a promising direction for enhancing LLMs' logical completeness and reasoning robustness.
๐
2025-12-11
๐ ArXiv Preprint (cs.DC)
๐ฅ Shruti Dongare, Redwan Ibne Seraj Khan, Hadeel Albahar, Nannan Zhao, Diego Melendez Maita et al.
Machine Learning
ๆฅ็ๆ่ฆ
Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application characteristics pose major challenges for existing schedulers, which often rely on offline profiling or application-specific assumptions. We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically prioritizes and allocates DL jobs on heterogeneous GPU clusters. RLTune integrates RL-driven prioritization with MILP-based job-to-node mapping to optimize system-wide objectives such as job completion time (JCT), queueing delay, and resource utilization. Trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, RLTune improves GPU utilization by up to 20%, reduces queueing delay by up to 81%, and shortens JCT by as much as 70 percent. Unlike prior approaches, RLTune generalizes across diverse workloads without requiring per-job profiling, making it practical for cloud providers to deploy at scale for more efficient, fair, and sustainable DL workload management.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhuo Wang, Xiliang Liu, Ligang Sun
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.
๐
2025-12-11
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Junhyeong Hwangbo, Soohyun Lee, Minsoo Cheong, Hyeon Jeon, Jinwook Seo
Natural Language Processing
ๆฅ็ๆ่ฆ
Human evaluation remains the gold standard for evaluating outputs of Large Language Models (LLMs). The current evaluation paradigm reviews numerous individual responses, leading to significant scalability challenges. LLM outputs can be more efficiently represented as a tree structure, reflecting their autoregressive generation process and stochastic token selection. However, conventional tree visualization cannot scale to the exponentially large trees generated by modern sampling methods of LLMs. To address this problem, we present InFerActive, an interactive inference system for scalable human evaluation. InFerActive enables on-demand exploration through probability-based filtering and evaluation features, while bridging the semantic gap between computational tokens and human-readable text through adaptive visualization techniques. Through a technical evaluation and user study (N=12), we demonstrate that InFerActive significantly improves evaluation efficiency and enables more comprehensive assessment of model behavior. We further conduct expert case studies that demonstrate InFerActive's practical applicability and potential for transforming LLM evaluation workflows.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jun Seo, Hyeokjun Choe, Seohui Bae, Soyeon Park, Wonbin Ahn et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Time series forecasting is a critical task for artificial intelligence with numerous real-world applications. Traditional approaches primarily rely on historical time series data to predict the future values. However, in practical scenarios, this is often insufficient for accurate predictions due to the limited information available. To address this challenge, multimodal time series forecasting methods which incorporate additional data modalities, mainly text data, alongside time series data have been explored. In this work, we introduce the Adaptive Information Routing (AIR) framework, a novel approach for multimodal time series forecasting. Unlike existing methods that treat text data on par with time series data as interchangeable auxiliary features for forecasting, AIR leverages text information to dynamically guide the time series model by controlling how and to what extent multivariate time series information should be combined. We also present a text-refinement pipeline that employs a large language model to convert raw text data into a form suitable for multimodal forecasting, and we introduce a benchmark that facilitates multimodal forecasting experiments based on this pipeline. Experiment results with the real world market data such as crude oil price and exchange rates demonstrate that AIR effectively modulates the behavior of the time series model using textual inputs, significantly enhancing forecasting accuracy in various time series forecasting tasks.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tjark Behrens, Anton Obukhov, Bingxin Ke, Fabio Tosi, Matteo Poggi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce StereoSpace, a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp & inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang et al.
Robotics
ๆฅ็ๆ่ฆ
Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen et al.
Computer Vision
ๆฅ็ๆ่ฆ
We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.
๐
2025-12-11
๐ WWW
๐ฅ Wentao Zhou, Xuweiyi Chen, Vignesh Rajagopal, Jeffrey Chen, Rohan Chandra et al.
Robotics
ๆฅ็ๆ่ฆ
The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient.
We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov et al.
ๆฅ็ๆ่ฆ
Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yiyang Lu, Qiao Sun, Xianbang Wang, Zhicheng Jiang, Hanhong Zhao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Normalizing Flows (NFs) have been established as a principled framework for generative modeling. Standard NFs consist of a forward process and a reverse process: the forward process maps data to noise, while the reverse process generates samples by inverting it. Typical NF forward transformations are constrained by explicit invertibility, ensuring that the reverse process can serve as their exact analytic inverse. Recent developments in TARFlow and its variants have revitalized NF methods by combining Transformers and autoregressive flows, but have also exposed causal decoding as a major bottleneck. In this work, we introduce Bidirectional Normalizing Flow ($\textbf{BiFlow}$), a framework that removes the need for an exact analytic inverse. BiFlow learns a reverse model that approximates the underlying noise-to-data inverse mapping, enabling more flexible loss functions and architectures. Experiments on ImageNet demonstrate that BiFlow, compared to its causal decoding counterpart, improves generation quality while accelerating sampling by up to two orders of magnitude. BiFlow yields state-of-the-art results among NF-based methods and competitive performance among single-evaluation ("1-NFE") methods. Following recent encouraging progress on NFs, we hope our work will draw further attention to this classical paradigm.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sicheng Mo, Thao Nguyen, Richard Zhang, Nick Kolkin, Siddharth Srinivasan Iyer et al.
Machine LearningNatural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qitao Zhao, Hao Tan, Qianqian Wang, Sai Bi, Kai Zhang et al.
Computer Vision
ๆฅ็ๆ่ฆ
Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengkai Hu, Jiaqi Ma, Jun Wan, Wenwen Min, Yongcheng Jing et al.
ๆฅ็ๆ่ฆ
All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiawei Yang, Ziyu Chen, Yurong You, Yan Wang, Yiming Li et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.
๐
2025-12-11
๐ IEEE TPAMI, Project Page: https://henghuiding.com/
๐ฅ Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang et al.
ๆฅ็ๆ่ฆ
This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin et al.
Robotics
ๆฅ็ๆ่ฆ
Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko et al.
Multimodal
ๆฅ็ๆ่ฆ
Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei et al.
Robotics
ๆฅ็ๆ่ฆ
Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/
๐
2025-12-11
๐ IEEE/CVF Winter Conference on Applications of Comp
๐ฅ Madhav Agarwal, Mingtian Zhang, Laura Sevilla-Lara, Steven McDonagh
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(ฮฑx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yulu Gan, Ligeng Zhu, Dandan Shan, Baifeng Shi, Hongxu Yin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kehong Gong, Zhengyu Wen, Weixia He, Mingxi Xu, Qi Wang et al.
Multimodal
ๆฅ็ๆ่ฆ
Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao et al.
Multimodal
ๆฅ็ๆ่ฆ
Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kehong Gong, Zhengyu Wen, Mingxi Xu, Weixia He, Qi Wang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka
Computer Vision
ๆฅ็ๆ่ฆ
6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Leijie Wang, Otilia Stretcu, Wei Qiao, Thomas Denby, Krishnamurthy Viswanathan et al.
ๆฅ็ๆ่ฆ
From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called "Agile Deliberation" that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user's evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wang Lu, Jindong Wang
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
While computer vision and machine learning have made great progress, their robustness is still challenged by two key issues: data distribution shift and label noise. When domain generalization (DG) encounters noise, noisy labels further exacerbate the emergence of spurious features in deep layers, i.e. spurious feature enlargement, leading to a degradation in the performance of existing algorithms. This paper, starting from domain generalization, explores how to make existing methods rework when meeting noise. We find that the latent features inside the model have certain discriminative capabilities, and different latent features focus on different parts of the image. Based on these observations, we propose the Self-Ensemble Post Learning approach (SEPL) to diversify features which can be leveraged. Specifically, SEPL consists of two parts: feature probing training and prediction ensemble inference. It leverages intermediate feature representations within the model architecture, training multiple probing classifiers to fully exploit the capabilities of pre-trained models, while the final predictions are obtained through the integration of outputs from these diverse classification heads. Considering the presence of noisy labels, we employ semi-supervised algorithms to train probing classifiers. Given that different probing classifiers focus on different areas, we integrate their predictions using a crowdsourcing inference approach. Extensive experimental evaluations demonstrate that the proposed method not only enhances the robustness of existing methods but also exhibits significant potential for real-world applications with high flexibility.
๐
2025-12-11
๐ Submitted to JMLR, under review
๐ฅ Brian P. Powell, Jordan A. Caraballo-Vega, Mark L. Carroll, Thomas Maxwell, Andrew Ptak et al.
Machine Learning
ๆฅ็ๆ่ฆ
We report the discovery that binary encoding allows neural networks to extrapolate periodic functions beyond their training bounds. We introduce Normalized Base-2 Encoding (NB2E) as a method for encoding continuous numerical values and demonstrate that, using this input encoding, vanilla multi-layer perceptrons (MLP) successfully extrapolate diverse periodic signals without prior knowledge of their functional form. Internal activation analysis reveals that NB2E induces bit-phase representations, enabling MLPs to learn and extrapolate signal structure independently of position.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Masum Shah Junayed, John Derek Van Vessem, Qian Wan, Gahie Nam, Sheida Nabavi
Natural Language Processing
ๆฅ็ๆ่ฆ
Prostate cancer grading from whole-slide images (WSIs) remains a challenging task due to the large-scale nature of WSIs, the presence of heterogeneous tissue structures, and difficulty of selecting diagnostically relevant regions. Existing approaches often rely on random or static patch selection, leading to the inclusion of redundant or non-informative regions that degrade performance. To address this, we propose a Graph Laplacian Attention-Based Transformer (GLAT) integrated with an Iterative Refinement Module (IRM) to enhance both feature learning and spatial consistency. The IRM iteratively refines patch selection by leveraging a pretrained ResNet50 for local feature extraction and a foundation model in no-gradient mode for importance scoring, ensuring only the most relevant tissue regions are preserved. The GLAT models tissue-level connectivity by constructing a graph where patches serve as nodes, ensuring spatial consistency through graph Laplacian constraints and refining feature representations via a learnable filtering mechanism that enhances discriminative histological structures. Additionally, a convex aggregation mechanism dynamically adjusts patch importance to generate a robust WSI-level representation. Extensive experiments on five public and one private dataset demonstrate that our model outperforms state-of-the-art methods, achieving higher performance and spatial consistency while maintaining computational efficiency.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Akshay Kulkarni, Tsui-Wei Weng, Vivek Narayanaswamy, Shusen Liu, Wesam A. Sakla et al.
Computer Vision
ๆฅ็ๆ่ฆ
Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, user-desired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rene Lisasi, Michele Esposito, Chen Zhao
ๆฅ็ๆ่ฆ
Computational fluid dynamics (CFD) based simulation of coronary blood flow provides valuable hemodynamic markers, such as pressure gradients, for diagnosing coronary artery disease (CAD). However, CFD is computationally expensive, time-consuming, and difficult to integrate into large-scale clinical workflows. These limitations restrict the availability of labeled hemodynamic data for training AI models and hinder broad adoption of non-invasive, physiology based CAD assessment. To address these challenges, we develop an end to end pipeline that automates coronary geometry extraction from coronary computed tomography angiography (CCTA), streamlines simulation data generation, and enables efficient learning of coronary blood pressure distributions. The pipeline reduces the manual burden associated with traditional CFD workflows while producing consistent training data. We further introduce a diffusion-based regression model designed to predict coronary blood pressure directly from CCTA derived features, bypassing the need for slow CFD computation during inference. Evaluated on a dataset of simulated coronary hemodynamics, the proposed model achieves state of the art performance, with an R2 of 64.42%, a root mean squared error of 0.0974, and a normalized RMSE of 0.154, outperforming several baseline approaches. This work provides a scalable and accessible framework for rapid, non-invasive blood pressure prediction to support CAD diagnosis.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianyu Zhou, Junyi Tang, Zehui Li, Dahong Qian, Suncheng Xiang
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan-Ming Li, Qize Yang, Nan Lei, Shenghao Fu, Ling-An Zeng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code & Data: https://github.com/HumanMLLM/IRG-MotionLLM/tree/main.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Luigi Piccinelli, Thiemo Wandel, Christos Sakaridis, Wim Abbeloos, Luc Van Gool
ๆฅ็ๆ่ฆ
Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Matias Cosarinsky, Nicolas Gaggion, Rodrigo Echeveste, Enzo Ferrante
Machine LearningComputer VisionRobotics
ๆฅ็ๆ่ฆ
Uncertainty estimation is essential for the safe clinical deployment of medical image segmentation systems, enabling the identification of unreliable predictions and supporting human oversight. While prior work has largely focused on pixel-level uncertainty, landmark-based segmentation offers inherent topological guarantees yet remains underexplored from an uncertainty perspective. In this work, we study uncertainty estimation for anatomical landmark-based segmentation on chest X-rays. Inspired by hybrid neural network architectures that combine standard image convolutional encoders with graph-based generative decoders, and leveraging their variational latent space, we derive two complementary measures: (i) latent uncertainty, captured directly from the learned distribution parameters, and (ii) predictive uncertainty, obtained by generating multiple stochastic output predictions from latent samples. Through controlled corruption experiments we show that both uncertainty measures increase with perturbation severity, reflecting both global and local degradation. We demonstrate that these uncertainty signals can identify unreliable predictions by comparing with manual ground-truth, and support out-of-distribution detection on the CheXmask dataset. More importantly, we release CheXmask-U (huggingface.co/datasets/mcosarinsky/CheXmask-U), a large scale dataset of 657,566 chest X-ray landmark segmentations with per-node uncertainty estimates, enabling researchers to account for spatial variations in segmentation quality when using these anatomical masks. Our findings establish uncertainty estimation as a promising direction to enhance robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-ray. A fully working interactive demo of the method is available at huggingface.co/spaces/matiasky/CheXmask-U and the source code at github.com/mcosarinsky/CheXmask-U.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Benjamin Gundersen, Nicolas Deperrois, Samuel Ruiperez-Campillo, Thomas M. Sutter, Julia E. Vogt et al.
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent advances in vision-language models (VLMs) have improved Chest X-ray (CXR) interpretation in multiple aspects. However, many medical VLMs rely solely on supervised fine-tuning (SFT), which optimizes next-token prediction without evaluating answer quality. In contrast, reinforcement learning (RL) can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks. To investigate the effects of RL and thinking in a CXR VLM, we perform large-scale SFT on CXR data to build an updated RadVLM based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability. We then apply Group Relative Policy Optimization (GRPO) with clinically grounded, task-specific rewards for report generation and visual grounding, and run matched RL experiments on both domain-specific and general-domain Qwen3-VL variants, with and without thinking. Across these settings, we find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results. Under a unified evaluation pipeline, the RL-optimized RadVLM models outperform their baseline counterparts and reach state-of-the-art performance on both report generation and grounding, highlighting clinically aligned RL as a powerful complement to SFT for medical VLMs.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at https://github.com/apple/ml-sharp
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Romain Seailles, Jean-Baptiste Masson, Jean Ponce, Julien Mairal
Machine Learning
ๆฅ็ๆ่ฆ
Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores -- fluorescent molecules stained onto the observed specimen -- over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at https://github.com/RSLLES/SHOT.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao et al.
Robotics
ๆฅ็ๆ่ฆ
Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Javier Villena Toro, Mehdi Tarkian
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Recent progress in zero-shot 6D object pose estimation has been driven largely by large-scale models and cloud-based inference. However, these approaches often introduce high latency, elevated energy consumption, and deployment risks related to connectivity, cost, and data governance; factors that conflict with the practical constraints of real-world robotics, where compute is limited and on-device inference is frequently required. We introduce Geo6DPose, a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation that trades model scale for geometric reliability. Our method combines foundation model visual features with a geometric filtering strategy: Similarity maps are computed between onboarded template DINO descriptors and scene patches, and mutual correspondences are established by projecting scene patch centers to 3D and template descriptors to the object model coordinate system. Final poses are recovered via correspondence-driven RANSAC and ranked using a weighted geometric alignment metric that jointly accounts for reprojection consistency and spatial support, improving robustness to noise, clutter, and partial visibility. Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). It requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones, advancing practical, fully local 6D perception for robotic deployment.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingxuan Zhang, Tianqi Yu, Yatu Zhang, Jinze Wu, Kaixin Yao et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
A deep understanding of the physical world is a central goal for embodied AI and realistic simulation. While current models excel at capturing an object's surface geometry and appearance, they largely neglect its internal physical properties. This omission is critical, as properties like volumetric density are fundamental for predicting an object's center of mass, stability, and interaction dynamics in applications ranging from robotic manipulation to physical simulation. The primary bottleneck has been the absence of large-scale, real-world data. To bridge this gap, we introduce XDen-1K, the first large-scale, multi-modal dataset designed for real-world physical property estimation, with a particular focus on volumetric density. The core of this dataset consists of 1,000 real-world objects across 148 categories, for which we provide comprehensive multi-modal data, including a high-resolution 3D geometric model with part-level annotations and a corresponding set of real-world biplanar X-ray scans. Building upon this data, we introduce a novel optimization framework that recovers a high-fidelity volumetric density field of each object from its sparse X-ray views. To demonstrate its practical value, we add X-ray images as a conditioning signal to an existing segmentation network and perform volumetric segmentation. Furthermore, we conduct experiments on downstream robotics tasks. The results show that leveraging the dataset can effectively improve the accuracy of center-of-mass estimation and the success rate of robotic manipulation. We believe XDen-1K will serve as a foundational resource and a challenging new benchmark, catalyzing future research in physically grounded visual inference and embodied AI.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hanfeng Wu, Marlon Steiner, Michael Schmidt, Alvaro Marcos-Ramiro, Christoph Stiller
Robotics
ๆฅ็ๆ่ฆ
The complexity of autonomous driving scenarios requires robust models that can interpret high-level navigation commands and generate safe trajectories. While traditional rule-based systems can react to these commands, they often struggle in dynamic environments, and end-to-end methods face challenges in complying with explicit navigation commands. To address this, we present NaviHydra, a controllable navigation-guided end-to-end model distilled from an existing rule-based simulator. Our framework accepts high-level navigation commands as control signals, generating trajectories that align with specified intentions. We utilize a Bird's Eye View (BEV) based trajectory gathering method to enhance the trajectory feature extraction. Additionally, we introduce a novel navigation compliance metric to evaluate adherence to intended route, improving controllability and navigation safety. To comprehensively assess our model's controllability, we design a test that evaluates its response to various navigation commands. Our method significantly outperforms baseline models, achieving state-of-the-art results in the NAVSIM benchmark, demonstrating its effectiveness in advancing autonomous driving.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Zou, Ling Lo, Sheng-Ping Yang et al.
MultimodalMachine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bishoy Galoaa, Pau Closas, Sarah Ostadabbas
Machine LearningComputer VisionRobotics
ๆฅ็ๆ่ฆ
Point tracking in video sequences is a foundational capability for real-world computer vision applications, including robotics, autonomous systems, augmented reality, and video analysis. While recent deep learning-based trackers achieve state-of-the-art accuracy on challenging benchmarks, their reliance on per-frame GPU inference poses a major barrier to deployment on resource-constrained edge devices, where compute, power, and connectivity are limited. We introduce K-Track (Kalman-enhanced Tracking), a general-purpose, tracker-agnostic acceleration framework designed to bridge this deployment gap. K-Track reduces inference cost by combining sparse deep learning keyframe updates with lightweight Kalman filtering for intermediate frame prediction, using principled Bayesian uncertainty propagation to maintain temporal coherence. This hybrid strategy enables 5-10X speedup while retaining over 85% of the original trackers' accuracy. We evaluate K-Track across multiple state-of-the-art point trackers and demonstrate real-time performance on edge platforms such as the NVIDIA Jetson Nano and RTX Titan. By preserving accuracy while dramatically lowering computational requirements, K-Track provides a practical path toward deploying high-quality point tracking in real-world, resource-limited settings, closing the gap between modern tracking algorithms and deployable vision systems.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qintong Zhang, Junyuan Zhang, Zhifei Ren, Linke Ouyang, Zichen Wen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: https://github.com/ZZZZZQT/DOCR-Inspector.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohammad Sadegh Gholizadeh, Amir Arsalan Rezapour
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
In recent years, the incidence of vision-threatening eye diseases has risen dramatically, necessitating scalable and accurate screening solutions. This paper presents a comprehensive study on deep learning architectures for the automated diagnosis of ocular conditions. To mitigate the "black-box" limitations of standard convolutional neural networks (CNNs), we implement a pipeline that combines deep feature extraction with interpretable image processing modules. Specifically, we focus on high-fidelity retinal vessel segmentation as an auxiliary task to guide the classification process. By grounding the model's predictions in clinically relevant morphological features, we aim to bridge the gap between algorithmic output and expert medical validation, thereby reducing false positives and improving deployment viability in clinical settings.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bishoy Galoaa, Sarah Ostadabbas
MultimodalRobotics
ๆฅ็ๆ่ฆ
We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ J. Xiao, Y. Guo, X. Zi, K. Thiyagarajan, C. Moreira et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Quan Chen, Xiaokai Yang, Tingyu Wang, Rongfeng Lu, Xichun Sheng et al.
Computer Vision
ๆฅ็ๆ่ฆ
Salient object detection (SOD), a foundational task in computer vision, has advanced from single-modal to multi-modal paradigms to enhance generalization. However, most existing SOD methods assume low-noise visual conditions, overlooking the degradation of segmentation accuracy caused by weather-induced noise in real-world scenarios. In this paper, we propose a SOD framework tailored for diverse weather conditions, encompassing a specific encoder and a replaceable decoder. To enable handling of varying weather noises, we introduce a one-hot vector as a noise indicator to represent different weather types and design a Noise Indicator Fusion Module (NIFM). The NIFM takes both semantic features and the noise indicator as dual inputs and is inserted between consecutive stages of the encoder to embed weather-aware priors via adaptive feature modulation. Critically, the proposed specific encoder retains compatibility with mainstream SOD decoders. Extensive experiments are conducted on the WXSOD dataset under varying training data scales (100%, 50%, 30% of the full training set), three encoder and seven decoder configurations. Results show that the proposed SOD framework (particularly the NIFM-enhanced specific encoder) improves segmentation accuracy under complex weather conditions compared to a vanilla encoder.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenlong Jiao, Heyang Lee, Ping Wang, Pengfei Zhu, Qinghua Hu et al.
Multimodal
ๆฅ็ๆ่ฆ
All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at https://github.com/WenlongJiao/SymUNet.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haojie Zheng, Shuchen Weng, Jingqi Liu, Siqi Yang, Boxin Shi et al.
Robotics
ๆฅ็ๆ่ฆ
Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: https://hjzheng.net/projects/AVI-Edit/.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meher Md Saad
ๆฅ็ๆ่ฆ
Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang et al.
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sai Bharath Chandra Gutha, Ricardo Vinuesa, Hossein Azizpour
ๆฅ็ๆ่ฆ
A pre-trained unconditional diffusion model, combined with posterior sampling or maximum a posteriori (MAP) estimation techniques, can solve arbitrary inverse problems without task-specific training or fine-tuning. However, existing posterior sampling and MAP estimation methods often rely on modeling approximations and can be computationally demanding. In this work, we propose the variational mode-seeking loss (VML), which, when minimized during each reverse diffusion step, guides the generated sample towards the MAP estimate. VML arises from a novel perspective of minimizing the Kullback-Leibler (KL) divergence between the diffusion posterior $p(\mathbf{x}_0|\mathbf{x}_t)$ and the measurement posterior $p(\mathbf{x}_0|\mathbf{y})$, where $\mathbf{y}$ denotes the measurement. Importantly, for linear inverse problems, VML can be analytically derived and need not be approximated. Based on further theoretical insights, we propose VML-MAP, an empirically effective algorithm for solving inverse problems, and validate its efficacy over existing methods in both performance and computational time, through extensive experiments on diverse image-restoration tasks across multiple datasets.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pasquale De Marinis, Gennaro Vessio, Giovanna Castellano
ๆฅ็ๆ่ฆ
Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS (CD-FSS). TaP leverages Low-Rank Adaptation (LoRA) to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks--including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray--demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, ensuring computational efficiency. By addressing a critical limitation in FSS--the encoder's generalization to novel classes--TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maurice Rohr, Tobias Reinhardt, Tizian Dege, Justus Thies, Christoph Hoog Antink
ๆฅ็ๆ่ฆ
We present Pulse3DFace, the first dataset of its kind for estimating 3D blood pulsation maps. These maps can be used to develop models of dynamic facial blood pulsation, enabling the creation of synthetic video data to improve and validate remote pulse estimation methods via photoplethysmography imaging. Additionally, the dataset facilitates research into novel multi-view-based approaches for mitigating illumination effects in blood pulsation analysis. Pulse3DFace consists of raw videos from 15 subjects recorded at 30 Hz with an RGB camera from 23 viewpoints, blood pulse reference measurements, and facial 3D scans generated using monocular structure-from-motion techniques. It also includes processed 3D pulsation maps compatible with the texture space of the 3D head model FLAME. These maps provide signal-to-noise ratio, local pulse amplitude, phase information, and supplementary data. We offer a comprehensive evaluation of the dataset's illumination conditions, map consistency, and its ability to capture physiologically meaningful features in the facial and neck skin regions.
๐
2025-12-11
๐ Accepted to IJCV
๐ฅ Khurram Ashfaq, Muhammad Tariq Mahmood
Machine Learning
ๆฅ็ๆ่ฆ
Shape-from-Focus (SFF) is a passive depth estimation technique that infers scene depth by analyzing focus variations in a focal stack. Most recent deep learning-based SFF methods typically operate in two stages: first, they extract focus volumes (a per pixel representation of focus likelihood across the focal stack) using heavy feature encoders; then, they estimate depth via a simple one-step aggregation technique that often introduces artifacts and amplifies noise in the depth map. To address these issues, we propose a hybrid framework. Our method computes multi-scale focus volumes traditionally using handcrafted Directional Dilated Laplacian (DDL) kernels, which capture long-range and directional focus variations to form robust focus volumes. These focus volumes are then fed into a lightweight, multi-scale GRU-based depth extraction module that iteratively refines an initial depth estimate at a lower resolution for computational efficiency. Finally, a learned convex upsampling module within our recurrent network reconstructs high-resolution depth maps while preserving fine scene details and sharp boundaries. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach outperforms state-of-the-art deep learning and traditional methods, achieving superior accuracy and generalization across diverse focal conditions.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Han Li, Shaohui Li, Wenrui Dai, Chenglin Li, Xinlong Pan et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Existing frameworks for learned video compression suffer from a dilemma between inaccurate temporal alignment and error propagation for motion estimation and compensation (ME/MC). The separate-transform framework employs distinct transforms for intra-frame and inter-frame compression to yield impressive rate-distortion (R-D) performance but causes evident error propagation, while the unified-transform framework eliminates error propagation via shared transforms but is inferior in ME/MC in shared latent domains. To address this limitation, in this paper, we propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to enable quality-consistent and error-propagation-free streaming for learned video compression. Specifically, we propose dualdomain progressive temporal alignment for ME/MC that leverages coarse pixel-domain alignment and refined latent-domain alignment to significantly enhance temporal context modeling in a coarse-to-fine fashion. The coarse pixel-domain alignment efficiently handles simple motion patterns with optical flow estimated from a single reference frame, while the refined latent-domain alignment develops a Flow-Guided Deformable Transformer (FGDT) over latents from multiple reference frames to achieve long-term motion refinement (LTMR) for complex motion patterns. Furthermore, we design a QCMoE module for continuous bit-rate adaptation that dynamically assigns different experts to adjust quantization steps per pixel based on target quality and content rather than relies on a single quantization step. QCMoE allows continuous and consistent rate control with appealing R-D performance. Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts, while successfully eliminating error propagation.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Stylianos Kandylakis, Christos Orfanopoulos, Georgios Siolas, Panayiotis Tsanakas
Machine Learning
ๆฅ็ๆ่ฆ
This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiao Chen, Zhongjing Du, Jiazhen Huang, Xu Jiang, Li Lu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample's feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52% on ImageNet-C.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Phu Pham, Damon Conover, Aniket Bera
Machine Learning
ๆฅ็ๆ่ฆ
Aerial-ground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a bird's-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerial-ground localization in both synthetic and real-world settings.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenfei Guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiyue Sun, Tailin Chen, Yinghui Zhang, Yuchen Zhang, Jiangbei Yue et al.
Multimodal
ๆฅ็ๆ่ฆ
The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ge Zhang, Chunyang Wang, Bo Xiao, Xuelian Liu, Bin Liu
Computer VisionRobotics
ๆฅ็ๆ่ฆ
High-quality point cloud data is a critical foundation for tasks such as autonomous driving and 3D reconstruction. However, LiDAR-based point cloud acquisition is often affected by various disturbances, resulting in a large number of noise points that degrade the accuracy of subsequent point cloud object detection and recognition. Moreover, existing point cloud denoising methods typically sacrifice computational efficiency in pursuit of higher denoising accuracy, or, conversely, improve processing speed at the expense of preserving object boundaries and fine structural details, making it difficult to simultaneously achieve high denoising accuracy, strong edge preservation, and real-time performance. To address these limitations, this paper proposes an adaptive dual-weight gravitational-based point cloud denoising method. First, an octree is employed to perform spatial partitioning of the global point cloud, enabling parallel acceleration. Then, within each leaf node, adaptive voxel-based occupancy statistics and k-nearest neighbor (kNN) density estimation are applied to rapidly remove clearly isolated and low-density noise points, thereby reducing the effective candidate set. Finally, a gravitational scoring function that combines density weights with adaptive distance weights is constructed to finely distinguish noise points from object points. Experiments conducted on the Stanford 3D Scanning Repository, the Canadian Adverse Driving Conditions (CADC) dataset, and in-house FMCW LiDAR point clouds acquired in our laboratory demonstrate that, compared with existing methods, the proposed approach achieves consistent improvements in F1, PSNR, and Chamfer Distance (CD) across various noise conditions while reducing the single-frame processing time, thereby validating its high accuracy, robustness, and real-time performance in multi-noise scenarios.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, Xin Lou
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\% and open-world data boosts FROW benchmark accuracy by 10\%-20\% and content accuracy by 6\%-12\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\%. The benchmark will be available at https://github.com/pc-inno/FROW.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alberto Rota, Elena De Momi
Machine LearningNatural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Accurate spatial understanding is essential for image-guided surgery, augmented reality integration and context awareness. In minimally invasive procedures, where visual input is the sole intraoperative modality, establishing precise pixel-level correspondences between endoscopic frames is critical for 3D reconstruction, camera tracking, and scene interpretation. However, the surgical domain presents distinct challenges: weak perspective cues, non-Lambertian tissue reflections, and complex, deformable anatomy degrade the performance of conventional computer vision techniques. While Deep Learning models have shown strong performance in natural scenes, their features are not inherently suited for fine-grained matching in surgical images and require targeted adaptation to meet the demands of this domain. This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs, alongside a self-supervised optimization framework for model training. The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences, subsequently utilized for mining triplets within a contrastive learning paradigm. Through this self-supervised approach, we augment the DINOv2 backbone with an additional Transformer layer, specifically optimized to produce embeddings that facilitate direct matching through cosine similarity thresholding. Experimental evaluation demonstrates that our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error compared to the related work. The proposed framework constitutes a valuable contribution toward enabling more accurate high-level computer vision applications in surgical endoscopy.
๐
2025-12-11
๐ AAAI
๐ฅ Jingyun Fu, Zhiyu Xiang, Na Zhao
Multimodal
ๆฅ็ๆ่ฆ
Recent multimodal fusion methods, integrating images with LiDAR point clouds, have shown promise in scene flow estimation. However, the fusion of 4D millimeter wave radar and LiDAR remains unexplored. Unlike LiDAR, radar is cheaper, more robust in various weather conditions and can detect point-wise velocity, making it a valuable complement to LiDAR. However, radar inputs pose challenges due to noise, low resolution, and sparsity. Moreover, there is currently no dataset that combines LiDAR and radar data specifically for scene flow estimation. To address this gap, we construct a Radar-LiDAR scene flow dataset based on a public real-world automotive dataset. We propose an effective preprocessing strategy for radar denoising and scene flow label generation, deriving more reliable flow ground truth for radar points out of the object boundaries. Additionally, we introduce RaLiFlow, the first joint scene flow learning framework for 4D radar and LiDAR, which achieves effective radar-LiDAR fusion through a novel Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module and a carefully designed set of loss functions. The DBCF module integrates dynamic cues from radar into the local cross-attention mechanism, enabling the propagation of contextual information across modalities. Meanwhile, the proposed loss functions mitigate the adverse effects of unreliable radar data during training and enhance the instance-level consistency in scene flow predictions from both modalities, particularly for dynamic foreground areas. Extensive experiments on the repurposed scene flow dataset demonstrate that our method outperforms existing LiDAR-based and radar-based single-modal methods by a significant margin.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhankuo Xu, Chaoran Feng, Yingtao Li, Jianbin Zhao, Jiashu Yang et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at https://potatobigroom.github.io/CoherentGS/.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingyu Jeon, Jisoo Yang, Sungjin Han, Jinkwon Hwang, Sunjae Yoon et al.
ๆฅ็ๆ่ฆ
Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient 'search' and costly 'refine' phases. P2S overcomes these challenges with two key innovations: an 'Adaptive Span Generator' to prevent the search phase candidate explosion, and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7\% on R5@0.1 on MAD).
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.
๐
2025-12-11
๐ NeurIPS 2025
๐ฅ Sunqi Fan, Jiashuo Cui, Meng-Hao Guo, Shuojin Yang
MultimodalNatural Language ProcessingComputer VisionRobotics
ๆฅ็ๆ่ฆ
Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tarik Reza Toha, Shao-Jung, Lu, Shahriar Nirjon
ๆฅ็ๆ่ฆ
mmWave radars struggle to detect or count individuals in dense, static (non-moving) groups due to limitations in spatial resolution and reliance on movement for detection. We present mmCounter, which accurately counts static people in dense indoor spaces (up to three people per square meter). mmCounter achieves this by extracting ultra-low frequency (< 1 Hz) signals, primarily from breathing and micro-scale body movements such as slight torso shifts, and applying novel signal processing techniques to differentiate these subtle signals from background noise and nearby static objects. Our problem differs significantly from existing studies on breathing rate estimation, which assume the number of people is known a priori. In contrast, mmCounter utilizes a novel multi-stage signal processing pipeline to extract relevant low-frequency sources along with their spatial information and map these sources to individual people, enabling accurate counting. Extensive evaluations in various environments demonstrate that mmCounter delivers an 87% average F1 score and 0.6 mean absolute error in familiar environments, and a 60% average F1 score and 1.1 mean absolute error in previously untested environments. It can count up to seven individuals in a three square meter space, such that there is no side-by-side spacing and only a one-meter front-to-back distance.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiheng Lyu, Lian Xu, Mohammed Bennamoun, Farid Boussaid, Coen Arrow et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Weakly supervised semantic segmentation offers a label-efficient solution to train segmentation models for volumetric medical imaging. However, existing approaches often rely on 2D encoders that neglect the inherent volumetric nature of the data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context for weakly supervised volumetric medical segmentation. TranSamba augments a standard Vision Transformer backbone with Cross-Plane Mamba blocks, which leverage the linear complexity of state space models for efficient information exchange across neighboring slices. The information exchange enhances the pairwise self-attention within slices computed by the Transformer blocks, directly contributing to the attention maps for object localization. TranSamba achieves effective volumetric modeling with time complexity that scales linearly with the input volume depth and maintains constant memory usage for batch processing. Extensive experiments on three datasets demonstrate that TranSamba establishes new state-of-the-art performance, consistently outperforming existing methods across diverse modalities and pathologies. Our source code and trained models are openly accessible at: https://github.com/YihengLyu/TranSamba.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Keyi Chen, Mingze Sun, Zhenyu Liu, Zhangquan Chen, Ruqi Huang
MultimodalRobotics
ๆฅ็ๆ่ฆ
Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shresth Grover, Priyank Pathak, Akash Kumar, Vibhav Vineet, Yogesh S Rawat
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi-Cheng Liao, Shyang-En Weng, Yu-Syuan Xu, Chi-Wei Hsiao, Wei-Chen Chiu et al.
Multimodal
ๆฅ็ๆ่ฆ
Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi Liu, Yichi Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Thin and elongated filamentous structures, such as microtubules and actin filaments, often play important roles in biological systems. Segmenting these filaments in biological images is a fundamental step for quantitative analysis. Recent advances in deep learning have significantly improved the performance of filament segmentation. However, there is a big challenge in acquiring high quality pixel-level annotated dataset for filamentous structures, as the dense distribution and geometric properties of filaments making manual annotation extremely laborious and time-consuming. To address the data shortage problem, we propose a conditional generative framework based on the Pix2Pix architecture to generate realistic filaments in microscopy images from binary masks. We also propose a filament-aware structural loss to improve the structure similarity when generating synthetic images. Our experiments have demonstrated the effectiveness of our approach and outperformed existing model trained without synthetic data.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cai Xu, Jinlong Liu, Yilin Zhang, Ziyu Guan, Wei Zhao
ๆฅ็ๆ่ฆ
Incomplete multi-view data, where different views suffer from missing and unbalanced observations, pose significant challenges for clustering. Existing imputation-based methods attempt to estimate missing views to restore data associations, but indiscriminate imputation often introduces noise and bias, especially when the available information is insufficient. Imputation-free methods avoid this risk by relying solely on observed data, but struggle under severe incompleteness due to the lack of cross-view complementarity. To address this issue, we propose Informativeness-based Selective imputation Multi-View Clustering (ISMVC). Our method evaluates the imputation-relevant informativeness of each missing position based on intra-view similarity and cross-view consistency, and selectively imputes only when sufficient support is available. Furthermore, we integrate this selection with a variational autoencoder equipped with a mixture-of-Gaussians prior to learn clustering-friendly latent representations. By performing distribution-level imputation, ISMVC not only stabilizes the aggregation of posterior distributions but also explicitly models imputation uncertainty, enabling robust fusion and preventing overconfident reconstructions. Compared with existing cautious imputation strategies that depend on training dynamics or model feedback, our method is lightweight, data-driven, and model-agnostic. It can be readily integrated into existing IMC models as a plug-in module. Extensive experiments on multiple benchmark datasets under a more realistic and challenging unbalanced missing scenario demonstrate that our method outperforms both imputation-based and imputation-free approaches.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiawen Li, Jiali Hu, Xitong Ling, Yongqiang Lv, Yuxuan Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Foundation models trained with self-supervised learning (SSL) on large-scale histological images have significantly accelerated the development of computational pathology. These models can serve as backbones for region-of-interest (ROI) image analysis or patch-level feature extractors in whole-slide images (WSIs) based on multiple instance learning (MIL). Existing pathology foundation models (PFMs) are typically pre-trained on Hematoxylin-Eosin (H&E) stained pathology images. However, images with special stains, such as immunohistochemistry, are also frequently used in clinical practice. PFMs pre-trained mainly on H\&E-stained images may be limited in clinical applications involving special stains. To address this issue, we propose StainNet, a specialized foundation model for special stains based on the vision transformer (ViT) architecture. StainNet adopts a self-distillation SSL approach and is trained on over 1.4 million patch images cropping from 20,231 publicly available special staining WSIs in the HISTAI database. To evaluate StainNet, we conduct experiments on an in-house slide-level liver malignancy classification task and two public ROI-level datasets to demonstrate its strong ability. We also perform few-ratio learning and retrieval evaluations, and compare StainNet with recently larger PFMs to further highlight its strengths. We have released the StainNet model weights at: https://huggingface.co/JWonderLand/StainNet.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hyunsoo Lee, Daeum Jeon, Hyeokjae Oh
Computer Vision
ๆฅ็ๆ่ฆ
We propose a novel generative approach for 3D human pose estimation. 3D human pose estimation poses several key challenges due to the complex geometry of the human body, self-occluding joints, and the requirement for large-scale real-world motion datasets. To address these challenges, we introduce Point2Pose, a framework that effectively models the distribution of human poses conditioned on sequential point cloud and pose history. Specifically, we employ a spatio-temporal point cloud encoder and a pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor. Additionally, we present a large-scale indoor dataset MVPose3D, which contains multiple modalities, including IMU data of non-trivial human motions, dense multi-view point clouds, and RGB images. Experimental results show that the proposed method outperforms the baseline models, demonstrating its superior performance across various datasets.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Muhammad Usama, Muhammad Ibrahim Khan, Ahmad Hasan, Muhammad Shaaf Nadeem, Khawaja Fahad Iqbal et al.
Robotics
ๆฅ็ๆ่ฆ
Mobile robots are increasingly utilized in agriculture to automate labor-intensive tasks such as weeding, sowing, harvesting and soil analysis. Recently, agricultural robots have been developed to detect and remove weeds using mechanical tools or precise herbicide sprays. Mechanical weeding is inefficient over large fields, and herbicides harm the soil ecosystem. Laser weeding with mobile robots has emerged as a sustainable alternative in precision farming. In this paper, we present an autonomous weeding robot that uses controlled exposure to a low energy laser beam for weed removal. The proposed robot is six-wheeled with a novel double four-bar suspension for higher stability. The laser is guided towards the detected weeds by a three-dimensional linear actuation mechanism. Field tests have demonstrated the robot's capability to navigate agricultural terrains effectively by overcoming obstacles up to 15 cm in height. At an optimal speed of 42.5 cm/s, the robot achieves a weed detection rate of 86.2\% and operating time of 87 seconds per meter. The laser actuation mechanism maintains a minimal mean positional error of 1.54 mm, combined with a high hit rate of 97\%, ensuring effective and accurate weed removal. This combination of speed, accuracy, and efficiency highlights the robot's potential for significantly enhancing precision farming practices.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Khang Le, Ha Thach, Anh M. Vu, Trang T. K. Vo, Han H. Huynh et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anh M. Vu, Khang P. Le, Trang T. K. Vo, Ha Thach, Huy Hung Nguyen et al.
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off. Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Karthikeya KV, Narendra Bandaru
ๆฅ็ๆ่ฆ
We introduce Disentangled360, an innovative 3D-aware technology that integrates the advantages of direction disentangled volume rendering with single-image 360ยฐ unique view synthesis for applications in medical imaging and natural scene reconstruction. In contrast to current techniques that either oversimplify anisotropic light behavior or lack generalizability across various contexts, our framework distinctly differentiates between isotropic and anisotropic contributions inside a Gaussian Splatting backbone. We implement a dual-branch conditioning framework, one optimized for CT intensity driven scattering in volumetric data and the other for real-world RGB scenes through normalized camera embeddings. To address scale ambiguity and maintain structural realism, we present a hybrid pose agnostic anchoring method that adaptively samples scene depth and material transitions, functioning as stable pivots during scene distillation. Our design integrates preoperative radiography simulation and consumer-grade 360ยฐ rendering into a singular inference pipeline, facilitating rapid, photorealistic view synthesis with inherent directionality. Evaluations on the Mip-NeRF 360, RealEstate10K, and DeepDRR datasets indicate superior SSIM and LPIPS performance, while runtime assessments confirm its viability for interactive applications. Disentangled360 facilitates mixed-reality medical supervision, robotic perception, and immersive content creation, eliminating the necessity for scene-specific finetuning or expensive photon simulations.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoxue Wu, Xinyuan Chen, Yaohui Wang, Yu Qiao
Robotics
ๆฅ็ๆ่ฆ
Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, Dong Yu
Robotics
ๆฅ็ๆ่ฆ
We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation.
To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongsin Lee, Hye Won Chung
ๆฅ็ๆ่ฆ
Adversarial distillation in the standard min-max adversarial training framework aims to transfer adversarial robustness from a large, robust teacher network to a compact student. However, existing work often neglects to incorporate state-of-the-art robust teachers. Through extensive analysis, we find that stronger teachers do not necessarily yield more robust students-a phenomenon known as robust saturation. While typically attributed to capacity gaps, we show that such explanations are incomplete. Instead, we identify adversarial transferability-the fraction of student-crafted adversarial examples that remain effective against the teacher-as a key factor in successful robustness transfer. Based on this insight, we propose Sample-wise Adaptive Adversarial Distillation (SAAD), which reweights training examples by their measured transferability without incurring additional computational cost. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that SAAD consistently improves AutoAttack robustness over prior methods. Our code is available at https://github.com/HongsinLee/saad.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chen Ziwen, Hao Tan, Peng Wang, Zexiang Xu, Li Fuxin
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360ยฐ scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuetong Su, Baoguo Wei, Xinyu Wang, Xu Li, Lixin Li
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Novel Class Discovery aims to utilise prior knowledge of known classes to classify and discover unknown classes from unlabelled data. Existing NCD methods for images primarily rely on visual features, which suffer from limitations such as insufficient feature discriminability and the long-tail distribution of data. We propose LLM-NCD, a multimodal framework that breaks this bottleneck by fusing visual-textual semantics and prototype guided clustering. Our key innovation lies in modelling cluster centres and semantic prototypes of known classes by jointly optimising known class image and text features, and a dualphase discovery mechanism that dynamically separates known or novel samples via semantic affinity thresholds and adaptive clustering. Experiments on the CIFAR-100 dataset show that compared to the current methods, this method achieves up to 25.3% improvement in accuracy for unknown classes. Notably, our method shows unique resilience to long tail distributions, a first in NCD literature.
๐
2025-12-11
๐ ICCV 2025
๐ฅ Rui Wang, Yimu Sun, Jingxing Guo, Huisi Wu, Jing Qin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Accurate segmentation of cardiac chambers in echocardiography sequences is crucial for the quantitative analysis of cardiac function, aiding in clinical diagnosis and treatment. The imaging noise, artifacts, and the deformation and motion of the heart pose challenges to segmentation algorithms. While existing methods based on convolutional neural networks, Transformers, and space-time memory networks have improved segmentation accuracy, they often struggle with the trade-off between capturing long-range spatiotemporal dependencies and maintaining computational efficiency with fine-grained feature representation. In this paper, we introduce GDKVM, a novel architecture for echocardiography video segmentation. The model employs Linear Key-Value Association (LKVA) to effectively model inter-frame correlations, and introduces Gated Delta Rule (GDR) to efficiently store intermediate memory states. Key-Pixel Feature Fusion (KPFF) module is designed to integrate local and global features at multiple scales, enhancing robustness against boundary blurring and noise interference. We validated GDKVM on two mainstream echocardiography video datasets (CAMUS and EchoNet-Dynamic) and compared it with various state-of-the-art methods. Experimental results show that GDKVM outperforms existing approaches in terms of segmentation accuracy and robustness, while ensuring real-time performance. Code is available at https://github.com/wangrui2025/GDKVM.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Eunho Lee, Chaehyeon Song, Seunghoon Jeong, Ayoung Kim
Computer Vision
ๆฅ็ๆ่ฆ
Category-level object pose estimation requires both global context and local structure to ensure robustness against intra-class variations. However, 3D graph convolution (3D-GC) methods only focus on local geometry and depth information, making them vulnerable to complex objects and visual ambiguities. To address this, we present THE-Pose, a novel category-level 6D pose estimation framework that leverages a topological prior via surface embedding and hybrid graph fusion. Specifically, we extract consistent and invariant topological features from the image domain, effectively overcoming the limitations inherent in existing 3D-GC based methods. Our Hybrid Graph Fusion (HGF) module adaptively integrates the topological features with point-cloud features, seamlessly bridging 2D image context and 3D geometric structure. These fused features ensure stability for unseen or complicated objects, even under significant occlusions. Extensive experiments on the REAL275 dataset show that THE-Pose achieves a 35.8% improvement over the 3D-GC baseline (HS-Pose) and surpasses the previous state-of-the-art by 7.2% across all key metrics. The code is avaialbe on https://github.com/EHxxx/THE-Pose
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(ฮฑx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ George Yakushev, Nataliia Babina, Masoud Vahid Dastgerdi, Vyacheslav Zhdanovskiy, Alina Shutova et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities and safety, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embedded assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of rotary embeddings to enable LLMs built for sequential interactions to simultaneously think, listen, and generate outputs. We evaluate our approach on math, commonsense, and safety reasoning and find that it can generate accurate thinking-augmented answers in real time, reducing time to first non-thinking token from minutes to <= 5s. and the overall real-time delays by 6-11x.
๐
2025-12-11
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary. We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types. We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content using multimodal inputs, speech synthesis, and spatial audio. Distinctly, CompanionCast integrates an LLM-as-a-Judge module that iteratively scores and refines conversations across five dimensions (relevance, authenticity, engagement, diversity, personality consistency). We validate this framework through sports viewing, a domain with rich dynamics and strong social traditions, where a pilot study with soccer fans suggests that multi-agent interaction improves perceived social presence compared to solo viewing. We contribute: (1) a generalizable framework for orchestrating multi-agent conversations around multimodal video content, (2) a novel evaluator-agent pipeline for conversation quality control, and (3) exploratory evidence of increased social presence in AI-mediated co-viewing. We discuss challenges and future directions for applying this approach to diverse viewing contexts including entertainment, education, and collaborative watching experiences.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hauke Licht
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze the display of emotions, the emergence of multimodal generative AI promises great advances. However, we lack evidence about the effectiveness of multimodal AI in emotion analysis. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in video-based analysis of emotional arousal in two complementary data sets of human-labeled video recordings. I find that under ideal circumstances, mLLMs' emotional arousal ratings are highly reliable and show little to know indication of demographic bias. However, in recordings of speakers in real-world parliamentary debates, mLLMs' arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in political analysis and contributes a suitable replicable framework.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lilin Qiu
Natural Language Processing
ๆฅ็ๆ่ฆ
This study analyzes the emotional tone of dialogue in J. R. R. Tolkien's The Hobbit (1937) using computational text analysis. Dialogue was extracted with regular expressions, then preprocessed, and scored using the NRC-VAD lexicon to quantify emotional dimensions. The results show that the dialogue maintains a generally positive (high valence) and calm (low arousal) tone, with a gradually increasing sense of agency (dominance) as the story progresses. These patterns reflect the novel's emotional rhythm: moments of danger and excitement are regularly balanced by humor, camaraderie, and relief. Visualizations -- including emotional trajectory graphs and word clouds -- highlight how Tolkien's language cycles between tension and comfort. By combining computational tools with literary interpretation, this study demonstrates how digital methods can uncover subtle emotional structures in literature, revealing the steady rhythm and emotional modulation that shape the storytelling in The Hobbit.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michael Schlee, Christoph Weisser, Timo Kivimรคki, Melchizedek Mashiku, Benjamin Saefken
Natural Language Processing
ๆฅ็ๆ่ฆ
LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone's embeddings with the LLM-derived per-class scores -- obtained through structured prompt-engineering strategies -- and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains -- achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification -- while enabling practical trade-offs between accuracy, latency, and cost.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Moshe Lahmy, Roi Yozevitch
Robotics
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) systems often fail on multi-hop queries when the initial retrieval misses a bridge fact. Prior corrective approaches, such as Self-RAG, CRAG, and Adaptive-$k$, typically address this by \textit{adding} more context or pruning existing lists. However, simply expanding the context window often leads to \textbf{context dilution}, where distractors crowd out relevant information. We propose \textbf{SEAL-RAG}, a training-free controller that adopts a \textbf{``replace, don't expand''} strategy to fight context dilution under a fixed retrieval depth $k$. SEAL executes a (\textbf{S}earch $\rightarrow$ \textbf{E}xtract $\rightarrow$ \textbf{A}ssess $\rightarrow$ \textbf{L}oop) cycle: it performs on-the-fly, entity-anchored extraction to build a live \textit{gap specification} (missing entities/relations), triggers targeted micro-queries, and uses \textit{entity-first ranking} to actively swap out distractors for gap-closing evidence. We evaluate SEAL-RAG against faithful re-implementations of Basic RAG, CRAG, Self-RAG, and Adaptive-$k$ in a shared environment on \textbf{HotpotQA} and \textbf{2WikiMultiHopQA}. On HotpotQA ($k=3$), SEAL improves answer correctness by \textbf{+3--13 pp} and evidence precision by \textbf{+12--18 pp} over Self-RAG. On 2WikiMultiHopQA ($k=5$), it outperforms Adaptive-$k$ by \textbf{+8.0 pp} in accuracy and maintains \textbf{96\%} evidence precision compared to 22\% for CRAG. These gains are statistically significant ($p<0.001$). By enforcing fixed-$k$ replacement, SEAL yields a predictable cost profile while ensuring the top-$k$ slots are optimized for precision rather than mere breadth. We release our code and data at https://github.com/mosherino/SEAL-RAG.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Manurag Khullar, Utkarsh Desai, Poorva Malviya, Aman Dalmia, Zheyuan Ryan Shi
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kevin Glocker, Kรคtriin Kukk, Romina Oji, Marcel Bollmann, Marco Kuhlmann et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Elroy Galbraith, Chadwick Sutherland, Donahue Morgan
Natural Language Processing
ๆฅ็ๆ่ฆ
Emergency speech recognition systems exhibit systematic performance degradation on non-standard English varieties, creating a critical gap in services for Caribbean populations. We present TRIDENT (Transcription and Routing Intelligence for Dispatcher-Empowered National Triage), a three-layer dispatcher-support architecture designed to structure emergency call inputs for human application of established triage protocols (the ESI for routine operations and START for mass casualty events), even when automatic speech recognition fails.
The system combines Caribbean-accent-tuned ASR, local entity extraction via large language models, and bio-acoustic distress detection to provide dispatchers with three complementary signals: transcription confidence, structured clinical entities, and vocal stress indicators. Our key insight is that low ASR confidence, rather than representing system failure, serves as a valuable queue prioritization signal -- particularly when combined with elevated vocal distress markers indicating a caller in crisis whose speech may have shifted toward basilectal registers. A complementary insight drives the entity extraction layer: trained responders and composed bystanders may report life-threatening emergencies without elevated vocal stress, requiring semantic analysis to capture clinical indicators that paralinguistic features miss.
We describe the architectural design, theoretical grounding in psycholinguistic research on stress-induced code-switching, and deployment considerations for offline operation during disaster scenarios. This work establishes a framework for accent-resilient emergency AI that ensures Caribbean voices receive equitable access to established national triage protocols. Empirical validation on Caribbean emergency calls remains future work.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rebekka Gรถrge, Sujan Sai Gannamaneni, Tabea Naeven, Hammam Abdelwahab, Hรฉctor Allende-Cid et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Procedural memory enables large language model (LLM) agents to internalize "how-to" knowledge, theoretically reducing redundant trial-and-error. However, existing frameworks predominantly suffer from a "passive accumulation" paradigm, treating memory as a static append-only archive. To bridge the gap between static storage and dynamic reasoning, we propose $\textbf{ReMe}$ ($\textit{Remember Me, Refine Me}$), a comprehensive framework for experience-driven agent evolution. ReMe innovates across the memory lifecycle via three mechanisms: 1) $\textit{multi-faceted distillation}$, which extracts fine-grained experiences by recognizing success patterns, analyzing failure triggers and generating comparative insights; 2) $\textit{context-adaptive reuse}$, which tailors historical insights to new contexts via scenario-aware indexing; and 3) $\textit{utility-based refinement}$, which autonomously adds valid memories and prunes outdated ones to maintain a compact, high-quality experience pool. Extensive experiments on BFCL-V3 and AppWorld demonstrate that ReMe establishes a new state-of-the-art in agent memory system. Crucially, we observe a significant memory-scaling effect: Qwen3-8B equipped with ReMe outperforms larger, memoryless Qwen3-14B, suggesting that self-evolving memory provides a computation-efficient pathway for lifelong learning. We release our code and the $\texttt{reme.library}$ dataset to facilitate further research.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Smiljana Antonijevic Ubois
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models are commonly trained on dominant languages like English, and their representation of low resource languages typically reflects cultural and linguistic biases present in the source language materials. Using the Serbian language as a case, this study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age. Drawing on semi structured interviews with ten scholars and practitioners, including linguists, digital humanists, and AI developers, it traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues that drive reductive, engineering first approaches prioritizing functionality over linguistic nuance. These include superficial transliteration, reliance on English-trained models, data bias, and dataset curation lacking cultural specificity. To address these challenges, the study proposes Data Care, a framework grounded in CARE principles (Collective Benefit, Authority to Control, Responsibility, and Ethics), that reframes bias mitigation from a post hoc technical fix to an integral component of corpus design, annotation, and governance, and positions Data Care as a replicable model for building inclusive, sustainable, and culturally grounded language technologies in contexts where traditional LLM development reproduces existing power imbalances and cultural blind spots.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang et al.
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hang Ding, Qiming Feng, Dongqi Liu, Qi Zhao, Tao Yao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Reward modeling has become a cornerstone of aligning large language models (LLMs) with human preferences. Yet, when extended to subjective and open-ended domains such as role play, existing reward models exhibit severe degradation, struggling to capture nuanced and persona-grounded human judgments. To address this gap, we introduce RoleRMBench, the first systematic benchmark for reward modeling in role-playing dialogue, covering seven fine-grained capabilities from narrative management to role consistency and engagement. Evaluation on RoleRMBench reveals large and consistent gaps between general-purpose reward models and human judgment, particularly in narrative and stylistic dimensions. We further propose RoleRM, a reward model trained with Continuous Implicit Preferences (CIP), which reformulates subjective evaluation as continuous consistent pairwise supervision under multiple structuring strategies. Comprehensive experiments show that RoleRM surpasses strong open- and closed-source reward models by over 24% on average, demonstrating substantial gains in narrative coherence and stylistic fidelity. Our findings highlight the importance of continuous preference representation and annotation consistency, establishing a foundation for subjective alignment in human-centered dialogue systems.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amartya Roy, Elamparithy M, Kripabandhu Ghosh, Ponnurangam Kumaraguru, Adrian de Wynter
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.
๐
2025-12-11
๐ Accepted and presented at the LLMs4All workshop at
๐ฅ Iรฑaki Lacunza, Josรฉ Javier Saiz, Alexander Shvets, Aitor Gonzalez-Agirre, Marta Villegas
Natural Language Processing
ๆฅ็ๆ่ฆ
Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers LLM performance in mid- and low-resource languages. To mitigate this problem, we propose to (i) optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm that we extend to XDoGE for a multilingual setup, and (ii) rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase (CPT). We target six languages possessing a variety of geographic and intra- and inter-language-family relations, namely, English and Spanish (high-resource), Portuguese and Catalan (mid-resource), Galician and Basque (low-resource). We experiment with Salamandra-2b, which is a promising model for these languages. We investigate the effects of substantial data repetition on minor languages and under-sampling on dominant languages using the IberoBench framework for quantitative evaluation. Finally, we release a new promising IberianLLM-7B-Instruct model centering on Iberian languages and English that we pretrained from scratch and further improved using CPT with the XDoGE weights.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lars G. B. Johnsen
Natural Language Processing
ๆฅ็ๆ่ฆ
What counts as evidence for syntactic structure? In traditional generative grammar, systematic contrasts in grammaticality such as subject-auxiliary inversion and the licensing of parasitic gaps are taken as evidence for an internal, hierarchical grammar. In this paper, we test whether large language models (LLMs), trained only on surface forms, reproduce these contrasts in ways that imply an underlying structural representation.
We focus on two classic constructions: subject-auxiliary inversion (testing recognition of the subject boundary) and parasitic gap licensing (testing abstract dependency structure). We evaluate models including GPT-4 and LLaMA-3 using prompts eliciting acceptability ratings. Results show that LLMs reliably distinguish between grammatical and ungrammatical variants in both constructions, and as such support that they are sensitive to structure and not just linear order. Structural generalizations, distinct from cognitive knowledge, emerge from predictive training on surface forms, suggesting functional sensitivity to syntax without explicit encoding.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Devanshu Sahoo, Manish Prasad, Vasudev Majhi, Jahnvi Singh, Vinay Chamola et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the "Lazy Reviewer" hypothesis) and the formal institutional deployment of AI-powered assessment systems by conferences like AAAI and Stanford's Agents4Science. This study investigates the robustness of these "LLM-as-a-Judge" systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping "Reject" decisions to "Accept," for which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like "Maximum Mark Magyk" successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nour El Houda Ben Chaabene, Hamza Hammami, Laid Kahloul
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students' cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.
๐
2025-12-11
๐ This paper was accepted and scheduled for inclusio
๐ฅ Nour El Houda Ben Chaabene, Hamza Hammami
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) like Claude, Mistral IA, and GPT-4 excel in NLP but lack structured knowledge, leading to factual inconsistencies. We address this by integrating Knowledge Graphs (KGs) via KG-BERT to enhance grounding and reasoning. Experiments show significant gains in knowledge-intensive tasks such as question answering and entity linking. This approach improves factual reliability and enables more context-aware next-generation LLMs.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Agniva Maiti, Prajwal Panth, Suresh Chandra Satapathy
Natural Language Processing
ๆฅ็ๆ่ฆ
The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate "tortured phrases", statistically improbable synonyms (e.g. "counterfeit consciousness" for "artificial intelligence"), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dmitrii Stoianov, Danil Taranets, Olga Tsymboi, Ramil Latypov, Almaz Dautov et al.
ๆฅ็ๆ่ฆ
We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.
๐
2025-12-11
๐ NeurIPS 2025
๐ฅ Youmin Ko, Sungjong Seo, Hyunjoon Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.\footnote{https://github.com/meaningful96/CoopRAG}
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, Ji Pei
Natural Language Processing
ๆฅ็ๆ่ฆ
The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation
๐
2025-12-11
๐ ArXiv Preprint (cs.SD)
๐ฅ Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated.
To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation.
Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs.
Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19.
By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin et al.
Robotics
ๆฅ็ๆ่ฆ
Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hangyu Mao, Guangting Dong, Zhicheng Dou
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jules Lahmi, Alexis Roger
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Artificial intelligence has made great progress in recent years, particularly in the development of Vision--Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non--English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, Dong Yu
Robotics
ๆฅ็ๆ่ฆ
We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation.
To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Gyutaek Oh, Sangjoon Park, Byung-Hoon Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off-the-shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi-turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi-faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.
๐
2025-12-11
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yangkun Wang, Jingbo Shang
Natural Language Processing
ๆฅ็ๆ่ฆ
A recent watermarking scheme for language models achieves distortion-free embedding and robustness to edit-distance attacks. However, it suffers from limited generation diversity and high detection overhead. In parallel, recent research has focused on undetectability, a property ensuring that watermarks remain difficult for adversaries to detect and spoof. In this work, we introduce a new class of watermarking schemes constructed through probabilistic automata. We present two instantiations: (i) a practical scheme with exponential generation diversity and computational efficiency, and (ii) a theoretical construction with formal undetectability guarantees under cryptographic assumptions. Extensive experiments on LLaMA-3B and Mistral-7B validate the superior performance of our scheme in terms of robustness and efficiency.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Keito Inoshita, Xiaokang Zhou, Akira Kawai, Katsutoshi Yada
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
In practical deep learning deployment, the scarcity of data and the imbalance of label distributions often lead to semantically uncovered regions within the real-world data distribution, hindering model training and causing misclassification near class boundaries as well as unstable behaviors in peripheral areas. Although recent large language models (LLMs) show promise for data augmentation, an integrated framework that simultaneously achieves directional control of generation, domain alignment, and quality control has not yet been fully established. To address these challenges, we propose a Cluster-conditioned Interpolative and Extrapolative framework for Geometry-Aware and Domain-aligned data augmentation (CIEGAD), which systematically complements both in-distribution and out-of-distribution semantically uncovered regions. CIEGAD constructs domain profiles through cluster conditioning, allocates generation with a hierarchical frequency-geometric allocation integrating class frequency and geometric indicators, and finely controls generation directions via the coexistence of interpolative and extrapolative synthesis. It further performs quality control through geometry-constrained filtering combined with an LLM-as-a-Judge mechanism. Experiments on multiple classification tasks demonstrate that CIEGAD effectively extends the periphery of real-world data distributions while maintaining high alignment between generated and real-world data as well as semantic diversity. In particular, for long-tailed and multi-class classification tasks, CIEGAD consistently improves F1 and recall, validating the triple harmony of distributional consistency, diversity, and quality. These results indicate that CIEGAD serves as a practically oriented data augmentation framework that complements underrepresented regions while preserving alignment with real-world data.
๐
2025-12-11
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Nicholas Clark, Ryan Bai, Tanu Mitra
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) and generative search systems are increasingly used for information seeking by diverse populations with varying preferences for knowledge sourcing and presentation. While users can customize LLM behavior through custom instructions and behavioral prompts, no mechanism exists to evaluate whether these instructions are being followed effectively. We present Offscript, an automated auditing tool that efficiently identifies potential instruction following failures in LLMs. In a pilot study analyzing custom instructions sourced from Reddit, Offscript detected potential deviations from instructed behavior in 86.4% of conversations, 22.2% of which were confirmed as material violations through human review. Our findings suggest that automated auditing serves as a viable approach for evaluating compliance to behavioral instructions related to information seeking.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yiyang Lu, Qiao Sun, Xianbang Wang, Zhicheng Jiang, Hanhong Zhao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Normalizing Flows (NFs) have been established as a principled framework for generative modeling. Standard NFs consist of a forward process and a reverse process: the forward process maps data to noise, while the reverse process generates samples by inverting it. Typical NF forward transformations are constrained by explicit invertibility, ensuring that the reverse process can serve as their exact analytic inverse. Recent developments in TARFlow and its variants have revitalized NF methods by combining Transformers and autoregressive flows, but have also exposed causal decoding as a major bottleneck. In this work, we introduce Bidirectional Normalizing Flow ($\textbf{BiFlow}$), a framework that removes the need for an exact analytic inverse. BiFlow learns a reverse model that approximates the underlying noise-to-data inverse mapping, enabling more flexible loss functions and architectures. Experiments on ImageNet demonstrate that BiFlow, compared to its causal decoding counterpart, improves generation quality while accelerating sampling by up to two orders of magnitude. BiFlow yields state-of-the-art results among NF-based methods and competitive performance among single-evaluation ("1-NFE") methods. Following recent encouraging progress on NFs, we hope our work will draw further attention to this classical paradigm.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaona Zhou, Yingyan Zeng, Ran Jin, Ismini Lourentzou
Machine Learning
ๆฅ็ๆ่ฆ
The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv et al.
Robotics
ๆฅ็ๆ่ฆ
Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(ฮฑx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kristina Korotkova, Aleksandr Katrutsa
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The construction of adversarial attacks for neural networks appears to be a crucial challenge for their deployment in various services. To estimate the adversarial robustness of a neural network, a fast and efficient approach is needed to construct adversarial attacks. Since the formalization of adversarial attack construction involves solving a specific optimization problem, we consider the problem of constructing an efficient and effective adversarial attack from a numerical optimization perspective. Specifically, we suggest utilizing advanced projection-free methods, known as modified Frank-Wolfe methods, to construct white-box adversarial attacks on the given input data. We perform a theoretical and numerical evaluation of these methods and compare them with standard approaches based on projection operations or geometrical intuition. Numerical experiments are performed on the MNIST and CIFAR-10 datasets, utilizing a multiclass logistic regression model, the convolutional neural networks (CNNs), and the Vision Transformer (ViT).
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Zamirddine Mari, Jรฉrรดme Pasquet, Julien Seinturier
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Autonomous drone navigation in confined tubular environments remains a major challenge due to the constraining geometry of the conduits, the proximity of the walls, and the perceptual limitations inherent to such scenarios. We propose a reinforcement learning approach enabling a drone to navigate unknown three-dimensional tubes without any prior knowledge of their geometry, relying solely on local observations from LiDAR and a conditional visual detection of the tube center. In contrast, the Pure Pursuit algorithm, used as a deterministic baseline, benefits from explicit access to the centerline, creating an information asymmetry designed to assess the ability of RL to compensate for the absence of a geometric model. The agent is trained through a progressive Curriculum Learning strategy that gradually exposes it to increasingly curved geometries, where the tube center frequently disappears from the visual field. A turning-negotiation mechanism, based on the combination of direct visibility, directional memory, and LiDAR symmetry cues, proves essential for ensuring stable navigation under such partial observability conditions. Experiments show that the PPO policy acquires robust and generalizable behavior, consistently outperforming the deterministic controller despite its limited access to geometric information. Validation in a high-fidelity 3D environment further confirms the transferability of the learned behavior to a continuous physical dynamics.
The proposed approach thus provides a complete framework for autonomous navigation in unknown tubular environments and opens perspectives for industrial, underground, or medical applications where progressing through narrow and weakly perceptive conduits represents a central challenge.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ George Yakushev, Nataliia Babina, Masoud Vahid Dastgerdi, Vyacheslav Zhdanovskiy, Alina Shutova et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities and safety, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embedded assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of rotary embeddings to enable LLMs built for sequential interactions to simultaneously think, listen, and generate outputs. We evaluate our approach on math, commonsense, and safety reasoning and find that it can generate accurate thinking-augmented answers in real time, reducing time to first non-thinking token from minutes to <= 5s. and the overall real-time delays by 6-11x.
๐
2025-12-11
๐ ArXiv Preprint (quant-ph)
๐ฅ Jordan Cotler, Weiyuan Gong, Ishaan Kannan
Robotics
ๆฅ็ๆ่ฆ
We develop a framework for learning from noisy quantum experiments, focusing on fault-tolerant devices accessing uncharacterized systems through noisy couplings. Our starting point is the complexity class $\textsf{NBQP}$ ("noisy BQP"), modeling noisy fault-tolerant quantum computers that cannot, in general, error-correct the oracle systems they query. Using this class, we show that for natural oracle problems, noise can eliminate exponential quantum learning advantages of ideal noiseless learners while preserving a superpolynomial gap between NISQ and fault-tolerant devices. Beyond oracle separations, we study concrete noisy learning tasks. For purity testing, the exponential two-copy advantage collapses under a single application of local depolarizing noise. Nevertheless, we identify a setting motivated by AdS/CFT in which noise-resilient structure restores a quantum learning advantage in a noisy regime. We then analyze noisy Pauli shadow tomography, deriving lower bounds that characterize how instance size, quantum memory, and noise control sample complexity, and design algorithms with parametrically similar scalings. Together, our results show that the Bell-basis and SWAP-test primitives underlying most exponential quantum learning advantages are fundamentally fragile to noise unless the experimental system has latent noise-robust structure. Thus, realizing meaningful quantum advantages in future experiments will require understanding how noise-robust physical properties interface with available algorithmic techniques.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qiyang Li, Seohong Park, Sergey Levine
ๆฅ็ๆ่ฆ
Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned tasks and show that it reliably outperforms prior methods. Code: github.com/ColinQiyangLi/dqc.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zamirddine Mari, Mohamad Motasem Nawaf, Pierre Drap
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Autonomous navigation in underwater environments remains a major challenge due to the absence of GPS, degraded visibility, and the presence of submerged obstacles. This article investigates these issues through the case of the BlueROV2, an open platform widely used for scientific experimentation. We propose a deep reinforcement learning approach based on the Proximal Policy Optimization (PPO) algorithm, using an observation space that combines target-oriented navigation information, a virtual occupancy grid, and ray-casting along the boundaries of the operational area. The learned policy is compared against a reference deterministic kinematic planner, the Dynamic Window Approach (DWA), commonly employed as a robust baseline for obstacle avoidance. The evaluation is conducted in a realistic simulation environment and complemented by validation on a physical BlueROV2 supervised by a 3D digital twin of the test site, helping to reduce risks associated with real-world experimentation. The results show that the PPO policy consistently outperforms DWA in highly cluttered environments, notably thanks to better local adaptation and reduced collisions. Finally, the experiments demonstrate the transferability of the learned behavior from simulation to the real world, confirming the relevance of deep RL for autonomous navigation in underwater robotics.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Max Zimmer, Christophe Roux, Moritz Wagner, Deborah Hendrych, Sebastian Pokutta
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The resource requirements of Neural Networks can be significantly reduced through pruning -- the removal of seemingly less important parameters. However, with the rise of Large Language Models (LLMs), full retraining to recover pruning-induced performance degradation is often prohibitive and classical approaches such as global magnitude pruning are suboptimal on Transformer architectures. State-of-the-art methods hence solve a layer-wise mask selection problem, the problem of finding a pruning mask which minimizes the per-layer pruning error on a small set of calibration data. Exactly solving this problem to optimality using Integer Programming (IP) solvers is computationally infeasible due to its combinatorial nature and the size of the search space, and existing approaches therefore rely on approximations or heuristics. In this work, we demonstrate that the mask selection problem can be made drastically more tractable at LLM scale. To that end, we decouple the rows by enforcing equal sparsity levels per row. This allows us to derive optimal 1-swaps (exchanging one kept and one pruned weight) that can be computed efficiently using the Gram matrix of the calibration data. Using these observations, we propose a tractable and simple 1-swap algorithm that warm starts from any pruning mask, runs efficiently on GPUs at LLM scale, and is essentially hyperparameter-free. We demonstrate that our approach reduces per-layer pruning error by up to 60% over Wanda (Sun et al., 2023) and consistently improves perplexity and zero-shot accuracy across state-of-the-art GPT architectures.
๐
2025-12-11
๐ ArXiv Preprint (hep-th)
๐ฅ Challenger Mishra, Justin Tan
Machine Learning
ๆฅ็ๆ่ฆ
We compute solutions to the Hermitian Yang-Mills equations on holomorphic vector bundles $V$ via an alternating optimisation procedure founded on geometric machine learning. The proposed method is fully general with respect to the rank and structure group of $V$, requiring only the ability to enumerate a basis of global sections for a given bundle. This enables us to compute the physically normalised Yukawa couplings in a broad class of heterotic string compactifications. Using this method, we carry out this computation in full for a heterotic compactification incorporating a gauge bundle with non-Abelian structure group.
๐
2025-12-11
๐ ArXiv Preprint (math.OC)
๐ฅ Feras Al Taha, Eilyan Bitar
Robotics
ๆฅ็ๆ่ฆ
In this paper, we consider a class of finite-horizon, linear-quadratic stochastic control problems, where the probability distribution governing the noise process is unknown but assumed to belong to an ambiguity set consisting of all distributions whose mean and covariance lie within norm balls centered at given nominal values. To address the distributional ambiguity, we explore the design of causal affine control policies to minimize the worst-case expected regret over all distributions in the given ambiguity set. The resulting minimax optimal control problem is shown to admit an equivalent reformulation as a tractable convex program that corresponds to a regularized version of the nominal linear-quadratic stochastic control problem. While this convex program can be recast as a semidefinite program, semidefinite programs are typically solved using primal-dual interior point methods that scale poorly with the problem size in practice. To address this limitation, we propose a scalable dual projected subgradient method to compute optimal controllers to an arbitrary accuracy. Numerical experiments are presented to benchmark the proposed method against state-of-the-art data-driven and distributionally robust control design approaches.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez et al.
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stefan Matthes, Markus Schramm
Machine Learning
ๆฅ็ๆ่ฆ
Parabolic trough Concentrating Solar Power (CSP) plants operate large hydraulic networks of collector loops that must deliver a uniform outlet temperature despite spatially heterogeneous optical performance, heat losses, and pressure drops. While loop temperatures are measured, loop-level mass flows and receiver heat-loss parameters are unobserved, making it impossible to diagnose hydraulic imbalances or receiver degradation using standard monitoring tools.
We present a physics-informed learning framework that infers (i) loop-level mass-flow ratios and (ii) time-varying receiver heat-transfer coefficients directly from routine operational data. The method exploits nocturnal homogenization periods -- when hot oil is circulated through a non-irradiated field -- to isolate hydraulic and thermal-loss effects. A differentiable conjugate heat-transfer model is discretized and embedded into an end-to-end learning pipeline optimized using historical plant data from the 50 MW Andasol 3 solar field.
The model accurately reconstructs loop temperatures (RMSE $<2^\circ$C) and produces physically meaningful estimates of loop imbalances and receiver heat losses. Comparison against drone-based infrared thermography (QScan) shows strong correspondence, correctly identifying all areas with high-loss receivers. This demonstrates that noisy real-world CSP operational data contain enough information to recover latent physical parameters when combined with appropriate modeling and differentiable optimization.
๐
2025-12-11
๐ ICML 2025
๐ฅ Xuan Zhao, Zhuo Cao, Arya Bangun, Hanno Scharr, Ira Assent
ๆฅ็ๆ่ฆ
Counterfactual explanations provide actionable insights by identifying minimal input changes required to achieve a desired model prediction. Beyond their interpretability benefits, counterfactuals can also be leveraged for model reconstruction, where a surrogate model is trained to replicate the behavior of a target model. In this work, we demonstrate that model reconstruction can be significantly improved by recognizing that counterfactuals, which typically lie close to the decision boundary, can serve as informative though less representative samples for both classes. This is particularly beneficial in settings with limited access to labeled data. We propose a method that integrates original data samples with counterfactuals to approximate class prototypes using the Wasserstein barycenter, thereby preserving the underlying distributional structure of each class. This approach enhances the quality of the surrogate model and mitigates the issue of decision boundary shift, which commonly arises when counterfactuals are naively treated as ordinary training instances. Empirical results across multiple datasets show that our method improves fidelity between the surrogate and target models, validating its effectiveness.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Julian Kleutgens, Claudio Battiloro, Lingkai Kong, Benjamin Grewe, Francesca Dominici et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Discrete diffusion models achieve strong performance across language and other discrete domains, providing a powerful alternative to autoregressive models. However, their strong performance relies on large training datasets, which are costly or risky to obtain, especially when adapting to new domains. Transfer learning is the natural way to adapt pretrained discrete diffusion models, but current methods require fine-tuning large diffusion models, which is computationally expensive and often impractical. Building on ratio-based transfer learning for continuous diffusion, we provide Guided Transfer Learning for discrete diffusion models (GTL). This enables sampling from a target distribution without modifying the pretrained denoiser. The same guidance formulation applies to both discrete-time diffusion and continuous-time score-based discrete diffusion, yielding a unified treatment. Guided discrete diffusion often requires many forward passes of the guidance network, which becomes impractical for large vocabularies and long sequences. To address this, we further present an efficient guided sampler that concentrates evaluations on planner-selected positions and top candidate tokens, thus lowering sampling time and computation. This makes guided language modeling practical at scale for large vocabularies and long sequences. We evaluate GTL on sequential data, including synthetic Markov chains and language modeling, and provide empirical analyses of its behavior.
๐
2025-12-11
๐ ArXiv Preprint (cs.NI)
๐ฅ Zhongyuan Zhao, Yujun Ming, Kevin Chan, Ananthram Swami, Santiago Segarra
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Many routing and flow optimization problems in wired networks can be solved efficiently using minimum cost flow formulations. However, this approach does not extend to wireless multi-hop networks, where the assumptions of fixed link capacity and linear cost structure collapse due to contention for shared spectrum resources. The key challenge is that the long-term capacity of a wireless link becomes a non-linear function of its network context, including network topology, link quality, and the traffic assigned to neighboring links. In this work, we pursue a new direction of modeling wireless network under randomized medium access control by developing an analytical network digital twin (NDT) that predicts link duty cycles from network context. We generalize randomized contention as finding a Maximal Independent Set (MIS) on the conflict graph using weighted Luby's algorithm, derive an analytical model of link duty cycles, and introduce an iterative procedure that resolves the circular dependency among duty cycle, link capacity, and contention probability. Our numerical experiments show that the proposed NDT accurately predicts link duty cycles and congestion patterns with up to a 5000x speedup over packet-level simulation, and enables us to optimize link scheduling using gradient descent for reduced congestion and radio footprint.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Qitian Lu, Himanshu Sharma, Michael D. Shields, Lukรกลก Novรกk
Machine Learning
ๆฅ็ๆ่ฆ
Physics-informed polynomial chaos expansions (PC$^2$) provide an efficient physically constrained surrogate modeling framework by embedding governing equations and other physical constraints into the standard data-driven polynomial chaos expansions (PCE) and solving via the Karush-Kuhn-Tucker (KKT) conditions. This approach improves the physical interpretability of surrogate models while achieving high computational efficiency and accuracy. However, the performance and efficiency of PC$^2$ can still be degraded with high-dimensional parameter spaces, limited data availability, or unrepresentative training data. To address this problem, this study explores two complementary enhancements to the PC$^2$ framework. First, a numerically efficient constrained optimization solver, straightforward updating of Lagrange multipliers (SULM), is adopted as an alternative to the conventional KKT solver. The SULM method significantly reduces computational cost when solving physically constrained problems with high-dimensionality and derivative boundary conditions that require a large number of virtual points. Second, a D-optimal sampling strategy is utilized to select informative virtual points to improve the stability and achieve the balance of accuracy and efficiency of the PC$^2$. The proposed methods are integrated into the PC$^2$ framework and evaluated through numerical examples of representative physical systems governed by ordinary or partial differential equations. The results demonstrate that the enhanced PC$^2$ has better comprehensive capability than standard PC$^2$, and is well-suited for high-dimensional uncertainty quantification tasks.
๐
2025-12-11
๐ NeurIPS 2025
๐ฅ Ruslan Gokhman
Natural Language Processing
ๆฅ็ๆ่ฆ
We study long-horizon exogenous-only temperature forecasting - a challenging univariate setting where only the past values of the indoor temperature are used for prediction - using linear and Transformer-family models. We evaluate Linear, NLinear, DLinear, Transformer, Informer, and Autoformer under standardized train, validation, and test splits. Results show that linear baselines (Linear, NLinear, DLinear) consistently outperform more complex Transformer-family architectures, with DLinear achieving the best overall accuracy across all splits. These findings highlight that carefully designed linear models remain strong baselines for time series forecasting in challenging exogenous-only settings.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dimitri von Rรผtte, Janis Fluri, Omead Pooladzandi, Bernhard Schรถlkopf, Thomas Hofmann et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs.
We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for $10^{22}$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chirag Modi, Jiequn Han, Eric Vanden-Eijnden, Joan Bruna
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Transport-based methods have emerged as a leading paradigm for building generative models from large, clean datasets. However, in many scientific and engineering domains, clean data are often unavailable: instead, we only observe measurements corrupted through a noisy, ill-conditioned channel. A generative model for the original data thus requires solving an inverse problem at the level of distributions. In this work, we introduce a novel approach to this task based on Stochastic Interpolants: we iteratively update a transport map between corrupted and clean data samples using only access to the corrupted dataset as well as black box access to the corruption channel. Under appropriate conditions, this iterative procedure converges towards a self-consistent transport map that effectively inverts the corruption channel, thus enabling a generative model for the clean data. We refer to the resulting method as the self-consistent stochastic interpolant (SCSI). It (i) is computationally efficient compared to variational alternatives, (ii) highly flexible, handling arbitrary nonlinear forward models with only black-box access, and (iii) enjoys theoretical guarantees. We demonstrate superior performance on inverse problems in natural image processing and scientific reconstruction, and establish convergence guarantees of the scheme under appropriate assumptions.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Geoffrey F. Bomarito, Patrick E. Leser
ๆฅ็ๆ่ฆ
Symbolic regression is a powerful tool for discovering governing equations directly from data, but its sensitivity to noise hinders its broader application. This paper introduces a Sequential Monte Carlo (SMC) framework for Bayesian symbolic regression that approximates the posterior distribution over symbolic expressions, enhancing robustness and enabling uncertainty quantification for symbolic regression in the presence of noise. Differing from traditional genetic programming approaches, the SMC-based algorithm combines probabilistic selection, adaptive tempering, and the use of normalized marginal likelihood to efficiently explore the search space of symbolic expressions, yielding parsimonious expressions with improved generalization. When compared to standard genetic programming baselines, the proposed method better deals with challenging, noisy benchmark datasets. The reduced tendency to overfit and enhanced ability to discover accurate and interpretable equations paves the way for more robust symbolic regression in scientific discovery and engineering design applications.
๐
2025-12-11
๐ Submitted to IEEE Transactions on Games
๐ฅ Atahan Cilan, Atay รzgรถvde
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper introduces a reinforcement learning framework that enables controllable and diverse player behaviors without relying on human gameplay data. Existing approaches often require large-scale player trajectories, train separate models for different player types, or provide no direct mapping between interpretable behavioral parameters and the learned policy, limiting their scalability and controllability. We define player behavior in an N-dimensional continuous space and uniformly sample target behavior vectors from a region that encompasses the subset representing real human styles. During training, each agent receives both its current and target behavior vectors as input, and the reward is based on the normalized reduction in distance between them. This allows the policy to learn how actions influence behavioral statistics, enabling smooth control over attributes such as aggressiveness, mobility, and cooperativeness. A single PPO-based multi-agent policy can reproduce new or unseen play styles without retraining. Experiments conducted in a custom multi-player Unity game show that the proposed framework produces significantly greater behavioral diversity than a win-only baseline and reliably matches specified behavior vectors across diverse targets. The method offers a scalable solution for automated playtesting, game balancing, human-like behavior simulation, and replacing disconnected players in online games.
๐
2025-12-11
๐ ArXiv Preprint (math.ST)
๐ฅ Thabo Samakhoana, Benjamin Grimmer
ๆฅ็ๆ่ฆ
We consider the design of smoothings of the (coordinate-wise) max function in $\mathbb{R}^d$ in the infinity norm. The LogSumExp function $f(x)=\ln(\sum^d_i\exp(x_i))$ provides a classical smoothing, differing from the max function in value by at most $\ln(d)$. We provide an elementary construction of a lower bound, establishing that every overestimating smoothing of the max function must differ by at least $\sim 0.8145\ln(d)$. Hence, LogSumExp is optimal up to constant factors. However, in small dimensions, we provide stronger, exactly optimal smoothings attaining our lower bound, showing that the entropy-based LogSumExp approach to smoothing is not exactly optimal.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Leijie Wang, Otilia Stretcu, Wei Qiao, Thomas Denby, Krishnamurthy Viswanathan et al.
ๆฅ็ๆ่ฆ
From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called "Agile Deliberation" that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user's evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.
๐
2025-12-11
๐ ArXiv Preprint (hep-ex)
๐ฅ Mohammed Aboelela
ๆฅ็ๆ่ฆ
Multiple proton-proton collisions (pile-up) occur at every bunch crossing at the LHC, with the mean number of interactions expected to reach 80 during Run 3 and up to 200 at the High-Luminosity LHC. As a direct consequence, events with multijet signatures will occur at increasingly high rates. To cope with the increased luminosity, being able to efficiently group jets according to their origin along the beamline is crucial, particularly at the trigger level. In this work, a novel uncertainty-aware jet regression model based on a Deep Sets architecture is introduced, DIPz, to regress on a jet origin position along the beamline. The inputs to the DIPz algorithm are the charged particle tracks associated to each jet. An event-level discriminant, the Maximum Log Product of Likelihoods (MLPL), is constructed by combining the DIPz per-jet predictions. MLPL is cut-optimized to select events compatible with targeted multi-jet signature selection. This combined approach provides a robust and computationally efficient method for pile-up rejection in multi-jet final states, applicable to real-time event selections at the ATLAS High Level Trigger.
๐
2025-12-11
๐ Submitted to JMLR, under review
๐ฅ Brian P. Powell, Jordan A. Caraballo-Vega, Mark L. Carroll, Thomas Maxwell, Andrew Ptak et al.
Machine Learning
ๆฅ็ๆ่ฆ
We report the discovery that binary encoding allows neural networks to extrapolate periodic functions beyond their training bounds. We introduce Normalized Base-2 Encoding (NB2E) as a method for encoding continuous numerical values and demonstrate that, using this input encoding, vanilla multi-layer perceptrons (MLP) successfully extrapolate diverse periodic signals without prior knowledge of their functional form. Internal activation analysis reveals that NB2E induces bit-phase representations, enabling MLPs to learn and extrapolate signal structure independently of position.
๐
2025-12-11
๐ ArXiv Preprint (quant-ph)
๐ฅ F. Picariello, G. Turati, R. Antonelli, I. Bailo, S. Bonura et al.
Machine Learning
ๆฅ็ๆ่ฆ
The Traveling Salesman Problem (TSP) is a fundamental challenge in combinatorial optimization, widely applied in logistics and transportation. As the size of TSP instances grows, traditional algorithms often struggle to produce high-quality solutions within reasonable timeframes. This study investigates the potential of the Quantum Approximate Optimization Algorithm (QAOA), a hybrid quantum-classical method, to solve TSP under realistic constraints. We adopt a QUBO-based formulation of TSP that integrates real-world logistical constraints reflecting operational conditions, such as vehicle capacity, road accessibility, and time windows, while ensuring compatibility with the limitations of current quantum hardware. Our experiments are conducted in a simulated environment using high-performance computing (HPC) resources to assess QAOA's performance across different problem sizes and quantum circuit depths. In order to improve scalability, we propose clustering QAOA (Cl-QAOA), a hybrid approach combining classical machine learning with QAOA. This method decomposes large TSP instances into smaller sub-problems, making quantum optimization feasible even on devices with a limited number of qubits. The results offer a comprehensive evaluation of QAOA's strengths and limitations in solving constrained TSP scenarios. This study advances quantum optimization and lays groundwork for future large-scale applications.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Akshay Kulkarni, Tsui-Wei Weng, Vivek Narayanaswamy, Shusen Liu, Wesam A. Sakla et al.
Computer Vision
ๆฅ็ๆ่ฆ
Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, user-desired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Manurag Khullar, Utkarsh Desai, Poorva Malviya, Aman Dalmia, Zheyuan Ryan Shi
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kevin Glocker, Kรคtriin Kukk, Romina Oji, Marcel Bollmann, Marco Kuhlmann et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Youjun Zhao
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrosynthesis reaction prediction seeks to infer plausible reactant molecules for a given product and is a central problem in computer-aided organic synthesis. Despite recent progress, many existing models still fall short of the accuracy and robustness required for practical deployment. This work studies a template-free, Transformer-based framework that eliminates reliance on handcrafted reaction templates or additional chemical rule engines. The model injects molecular graph information into the attention mechanism to jointly exploit \SMILES\ sequences and structural cues, and further applies a paired data augmentation strategy to enhance training diversity and scale. On the USPTO-50K benchmark, our proposed approach achieves state-of-the-art performance among template-free methods and substantially outperforming a vanilla Transformer baseline.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.
๐
2025-12-11
๐ ArXiv Preprint (physics.med-ph)
๐ฅ Yaowen Zhang, Libera Fresiello, Peter H. Veltink, Dirk W. Donker, Ying Wang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Continuous monitoring of blood pressure (BP) and hemodynamic parameters such as peripheral resistance (R) and arterial compliance (C) are critical for early vascular dysfunction detection. While photoplethysmography (PPG) wearables has gained popularity, existing data-driven methods for BP estimation lack interpretability. We advanced our previously proposed physiology-centered hybrid AI method-Physiological Model-Based Neural Network (PMB-NN)-in blood pressure estimation, that unifies deep learning with a 2-element Windkessel based model parameterized by R and C acting as physics constraints. The PMB-NN model was trained in a subject-specific manner using PPG-derived timing features, while demographic information was used to infer an intermediate variable: cardiac output. We validated the model on 10 healthy adults performing static and cycling activities across two days for model's day-to-day robustness, benchmarked against deep learning (DL) models (FCNN, CNN-LSTM, Transformer) and standalone Windkessel based physiological model (PM). Validation was conducted on three perspectives: accuracy, interpretability and plausibility. PMB-NN achieved systolic BP accuracy (MAE: 7.2 mmHg) comparable to DL benchmarks, diastolic performance (MAE: 3.9 mmHg) lower than DL models. However, PMB-NN exhibited higher physiological plausibility than both DL baselines and PM, suggesting that the hybrid architecture unifies and enhances the respective merits of physiological principles and data-driven techniques. Beyond BP, PMB-NN identified R (ME: 0.15 mmHg$\cdot$s/ml) and C (ME: -0.35 ml/mmHg) during training with accuracy similar to PM, demonstrating that the embedded physiological constraints confer interpretability to the hybrid AI framework. These results position PMB-NN as a balanced, physiologically grounded alternative to purely data-driven approaches for daily hemodynamic monitoring.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lin Du, Lu Bai, Jincheng Li, Lixin Cui, Hangyuan Du et al.
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have emerged as a dominant paradigm for graph classification. Specifically, most existing GNNs mainly rely on the message passing strategy between neighbor nodes, where the expressivity is limited by the 1-dimensional Weisfeiler-Lehman (1-WL) test. Although a number of k-WL-based GNNs have been proposed to overcome this limitation, their computational cost increases rapidly with k, significantly restricting the practical applicability. Moreover, since the k-WL models mainly operate on node tuples, these k-WL-based GNNs cannot retain fine-grained node- or edge-level semantics required by attribution methods (e.g., Integrated Gradients), leading to the less interpretable problem. To overcome the above shortcomings, in this paper, we propose a novel Line Graph Aggregation Network (LGAN), that constructs a line graph from the induced subgraph centered at each node to perform the higher-order aggregation. We theoretically prove that the LGAN not only possesses the greater expressive power than the 2-WL under injective aggregation assumptions, but also has lower time complexity. Empirical evaluations on benchmarks demonstrate that the LGAN outperforms state-of-the-art k-WL-based GNNs, while offering better interpretability.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hao Tang, Hao Chen, Chao Li
ๆฅ็ๆ่ฆ
Neural operators offer powerful approaches for solving parametric partial differential equations, but extending them to spherical domains remains challenging due to the need to preserve intrinsic geometry while avoiding distortions that break rotational consistency. Existing spherical operators rely on rotational equivariance but often lack the flexibility for real-world complexity. We propose a general operator-design framework based on the designable spherical Green's function and its harmonic expansion, establishing a solid operator-theoretic foundation for spherical learning. Based on this, we propose an absolute and relative position-dependent Green's function that enables flexible balance of equivariance and invariance for real-world modeling. The resulting operator, Green's-function Spherical Neural Operator (GSNO) with a novel spectral learning method, can adapt to anisotropic, constraint-rich systems while retaining spectral efficiency. To exploit GSNO, we develop GSHNet, a hierarchical architecture that combines multi-scale spectral modeling with spherical up-down sampling, enhancing global feature representation. Evaluations on diffusion MRI, shallow water dynamics, and global weather forecasting, GSNO and GSHNet consistently outperform state-of-the-art methods. Our results position GSNO as a principled and general framework for spherical operator learning, bridging rigorous theory with real-world complexity.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lingjing Kong, Shaoan Xie, Guangyi Chen, Yuewen Sun, Xiangchen Song et al.
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Deep generative models, while revolutionizing fields like image and text generation, largely operate as opaque black boxes, hindering human understanding, control, and alignment. While methods like sparse autoencoders (SAEs) show remarkable empirical success, they often lack theoretical guarantees, risking subjective insights. Our primary objective is to establish a principled foundation for interpretable generative models. We demonstrate that the principle of causal minimality -- favoring the simplest causal explanation -- can endow the latent representations of diffusion vision and autoregressive language models with clear causal interpretation and robust, component-wise identifiable control. We introduce a novel theoretical framework for hierarchical selection models, where higher-level concepts emerge from the constrained composition of lower-level variables, better capturing the complex dependencies in data generation. Under theoretically derived minimality conditions (manifesting as sparsity or compression constraints), we show that learned representations can be equivalent to the true latent variables of the data-generating process. Empirically, applying these constraints to leading generative models allows us to extract their innate hierarchical concept graphs, offering fresh insights into their internal knowledge organization. Furthermore, these causally grounded concepts serve as levers for fine-grained model steering, paving the way for transparent, reliable systems.
๐
2025-12-11
๐ 6 pages, 2 figures, 1 table. Accepted at UCC '25 (
๐ฅ Mostafa Anoosha, Zeinab Dehghani, Kuniko Paxton, Koorosh Aslansefat, Dhavalkumar Thakker
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vertical Federated Learning (VFL) offers a privacy-preserving paradigm for Edge AI scenarios like mobile health diagnostics, where sensitive multimodal data reside on distributed, resource-constrained devices. Yet, standard VFL systems often suffer performance limitations due to simplistic feature fusion. This paper introduces HybridVFL, a novel framework designed to overcome this bottleneck by employing client-side feature disentanglement paired with a server-side cross-modal transformer for context-aware fusion. Through systematic evaluation on the multimodal HAM10000 skin lesion dataset, we demonstrate that HybridVFL significantly outperforms standard federated baselines, validating the criticality of advanced fusion mechanisms in robust, privacy-preserving systems.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lars Mescheder, Wei Dong, Shiwei Li, Xuyang Bai, Marcel Santos et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at https://github.com/apple/ml-sharp
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Romain Seailles, Jean-Baptiste Masson, Jean Ponce, Julien Mairal
Machine Learning
ๆฅ็ๆ่ฆ
Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores -- fluorescent molecules stained onto the observed specimen -- over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at https://github.com/RSLLES/SHOT.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao et al.
Robotics
ๆฅ็ๆ่ฆ
Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Oscar Robben, Saeed Khalilian, Nirvana Meratnia
Machine Learning
ๆฅ็ๆ่ฆ
Early-exit networks are effective solutions for reducing the overall energy consumption and latency of deep learning models by adjusting computation based on the complexity of input data. By incorporating intermediate exit branches into the architecture, they provide less computation for simpler samples, which is particularly beneficial for resource-constrained devices where energy consumption is crucial. However, designing early-exit networks is a challenging and time-consuming process due to the need to balance efficiency and performance. Recent works have utilized Neural Architecture Search (NAS) to design more efficient early-exit networks, aiming to reduce average latency while improving model accuracy by determining the best positions and number of exit branches in the architecture. Another important factor affecting the efficiency and accuracy of early-exit networks is the depth and types of layers in the exit branches. In this paper, we use hardware-aware NAS to strengthen exit branches, considering both accuracy and efficiency during optimization. Our performance evaluation on the CIFAR-10, CIFAR-100, and SVHN datasets demonstrates that our proposed framework, which considers varying depths and layers for exit branches along with adaptive threshold tuning, designs early-exit networks that achieve higher accuracy with the same or lower average number of MACs compared to the state-of-the-art approaches.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lingjing Kong, Shaoan Xie, Yang Jiao, Yetian Chen, Yanhui Guo et al.
ๆฅ็ๆ่ฆ
Compositional generalization -- the ability to understand and generate novel combinations of learned concepts -- enables models to extend their capabilities beyond limited experiences. While effective, the data structures and principles that enable this crucial capability remain poorly understood. We propose that compositional generalization fundamentally requires decomposing high-level concepts into basic, low-level concepts that can be recombined across similar contexts, similar to how humans draw analogies between concepts. For example, someone who has never seen a peacock eating rice can envision this scene by relating it to their previous observations of a chicken eating rice.
In this work, we formalize these intuitive processes using principles of causal modularity and minimal changes. We introduce a hierarchical data-generating process that naturally encodes different levels of concepts and their interaction mechanisms. Theoretically, we demonstrate that this approach enables compositional generalization supporting complex relations between composed concepts, advancing beyond prior work that assumes simpler interactions like additive effects. Critically, we also prove that this latent hierarchical structure is provably recoverable (identifiable) from observable data like text-image pairs, a necessary step for learning such a generative process. To validate our theory, we apply insights from our theoretical framework and achieve significant improvements on benchmark datasets.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tommaso Amico, Pernille Matthews, Lena Krieger, Arthur Zimek, Ira Assent
ๆฅ็ๆ่ฆ
Outlier detection identifies data points that significantly deviate from the majority of the data distribution. Explaining outliers is crucial for understanding the underlying factors that contribute to their detection, validating their significance, and identifying potential biases or errors. Effective explanations provide actionable insights, facilitating preventive measures to avoid similar outliers in the future. Counterfactual explanations clarify why specific data points are classified as outliers by identifying minimal changes required to alter their prediction. Although valuable, most existing counterfactual explanation methods overlook the unique challenges posed by outlier detection, and fail to target classical, widely adopted outlier detection algorithms. Local Outlier Factor (LOF) is one the most popular unsupervised outlier detection methods, quantifying outlierness through relative local density. Despite LOF's widespread use across diverse applications, it lacks interpretability. To address this limitation, we introduce Density-based Counterfactuals for Outliers (DCFO), a novel method specifically designed to generate counterfactual explanations for LOF. DCFO partitions the data space into regions where LOF behaves smoothly, enabling efficient gradient-based optimisation. Extensive experimental validation on 50 OpenML datasets demonstrates that DCFO consistently outperforms benchmarked competitors, offering superior proximity and validity of generated counterfactuals.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lรฉa Bohbot, Cyril Letrouit, Gabriel Peyrรฉ, Franรงois-Xavier Vialard
Natural Language Processing
ๆฅ็ๆ่ฆ
As context windows in large language models continue to expand, it is essential to characterize how attention behaves at extreme sequence lengths. We introduce token-sample complexity: the rate at which attention computed on $n$ tokens converges to its infinite-token limit. We estimate finite-$n$ convergence bounds at two levels: pointwise uniform convergence of the attention map, and convergence of moments for the transformed token distribution. For compactly supported (and more generally sub-Gaussian) distributions, our first result shows that the attention map converges uniformly on a ball of radius $R$ at rate $C(R)/\sqrt{n}$, where $C(R)$ grows exponentially with $R$. For large $R$, this estimate loses practical value, and our second result addresses this issue by establishing convergence rates for the moments of the transformed distribution (the token output of the attention layer). In this case, the rate is $C'(R)/n^ฮฒ$ with $ฮฒ<\tfrac{1}{2}$, and $C'(R)$ depends polynomially on the size of the support of the distribution. The exponent $ฮฒ$ depends on the attention geometry and the spectral properties of the tokens distribution. We also examine the regime in which the attention parameter tends to infinity and the softmax approaches a hardmax, and in this setting, we establish a logarithmic rate of convergence. Experiments on synthetic Gaussian data and real BERT models on Wikipedia text confirm our predictions.
๐
2025-12-11
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Daniyar Kurmankhojayev, Andrei Shadrikov, Dmitrii Gordin, Mikhail Shkorin, Danijar Gabdullin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Face anti-spoofing (FAS) is a vital component of remote biometric authentication systems based on facial recognition, increasingly used across web-based applications. Among emerging threats, video injection attacks -- facilitated by technologies such as deepfakes and virtual camera software -- pose significant challenges to system integrity. While virtual camera detection (VCD) has shown potential as a countermeasure, existing literature offers limited insight into its practical implementation and evaluation. This study introduces a machine learning-based approach to VCD, with a focus on its design and validation. The model is trained on metadata collected during sessions with authentic users. Empirical results demonstrate its effectiveness in identifying video injection attempts and reducing the risk of malicious users bypassing FAS systems.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Liang Peng, Haopeng Liu, Yixuan Ye, Cheng Liu, Wenjun Shen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Unsupervised cell type identification is crucial for uncovering and characterizing heterogeneous populations in single cell omics studies. Although a range of clustering methods have been developed, most focus exclusively on intrinsic cellular structure and ignore the pivotal role of cell-gene associations, which limits their ability to distinguish closely related cell types. To this end, we propose a Refinement Contrastive Learning framework (scRCL) that explicitly incorporates cell-gene interactions to derive more informative representations. Specifically, we introduce two contrastive distribution alignment components that reveal reliable intrinsic cellular structures by effectively exploiting cell-cell structural relationships. Additionally, we develop a refinement module that integrates gene-correlation structure learning to enhance cell embeddings by capturing underlying cell-gene associations. This module strengthens connections between cells and their associated genes, refining the representation learning to exploiting biologically meaningful relationships. Extensive experiments on several single-cell RNA-seq and spatial transcriptomics benchmark datasets demonstrate that our method consistently outperforms state-of-the-art baselines in cell-type identification accuracy. Moreover, downstream biological analyses confirm that the recovered cell populations exhibit coherent gene-expression signatures, further validating the biological relevance of our approach. The code is available at https://github.com/THPengL/scRCL.
๐
2025-12-11
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Neha, Tarunpreet Bhatia
Machine Learning
ๆฅ็ๆ่ฆ
Intrusion Detection Systems (IDS) are critical components in safeguarding 5G/6G networks from both internal and external cyber threats. While traditional IDS approaches rely heavily on signature-based methods, they struggle to detect novel and evolving attacks. This paper presents an advanced IDS framework that leverages adversarial training and dynamic neural networks in 5G/6G networks to enhance network security by providing robust, real-time threat detection and response capabilities. Unlike conventional models, which require costly retraining to update knowledge, the proposed framework integrates incremental learning algorithms, reducing the need for frequent retraining. Adversarial training is used to fortify the IDS against poisoned data. By using fewer features and incorporating statistical properties, the system can efficiently detect potential threats. Extensive evaluations using the NSL- KDD dataset demonstrate that the proposed approach provides better accuracy of 82.33% for multiclass classification of various network attacks while resisting dataset poisoning. This research highlights the potential of adversarial-trained, dynamic neural networks for building resilient IDS solutions.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ C. Bosco, U. Minora, D. de Rigo, J. Pingsdorf, R. Cortinovis
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a mixed-methodology to forecast illegal border crossings in Europe across five key migratory routes, with a one-year time horizon. The methodology integrates machine learning techniques with qualitative insights from migration experts. This approach aims at improving the predictive capacity of data-driven models through the inclusion of a human-assessed covariate, an innovation that addresses challenges posed by sudden shifts in migration patterns and limitations in traditional datasets. The proposed methodology responds directly to the forecasting needs outlined in the EU Pact on Migration and Asylum, supporting the Asylum and Migration Management Regulation (AMMR). It is designed to provide policy-relevant forecasts that inform strategic decisions, early warning systems, and solidarity mechanisms among EU Member States. By joining data-driven modeling with expert judgment, this work aligns with existing academic recommendations and introduces a novel operational tool tailored for EU migration governance. The methodology is tested and validated with known data to demonstrate its applicability and reliability in migration-related policy context.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hendrik Borras, Yong Wu, Bernhard Klein, Holger Frรถning
Machine Learning
ๆฅ็ๆ่ฆ
Bayesian Neural Networks (BNNs) provide principled uncertainty quantification but suffer from substantial computational and memory overhead compared to deterministic networks. While quantization techniques have successfully reduced resource requirements in standard deep learning models, their application to probabilistic models remains largely unexplored. We introduce a systematic multi-level quantization framework for Stochastic Variational Inference based BNNs that distinguishes between three quantization strategies: Variational Parameter Quantization (VPQ), Sampled Parameter Quantization (SPQ), and Joint Quantization (JQ). Our logarithmic quantization for variance parameters, and specialized activation functions to preserve the distributional structure are essential for calibrated uncertainty estimation. Through comprehensive experiments on Dirty-MNIST, we demonstrate that BNNs can be quantized down to 4-bit precision while maintaining both classification accuracy and uncertainty disentanglement. At 4 bits, Joint Quantization achieves up to 8x memory reduction compared to floating-point implementations with minimal degradation in epistemic and aleatoric uncertainty estimation. These results enable deployment of BNNs on resource-constrained edge devices and provide design guidelines for future analog "Bayesian Machines" operating at inherently low precision.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Akhil Agnihotri
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models. The first contribution addresses constrained Markov Decision Processes (CMDPs) under the average-cost criterion through the Average-Constrained Policy Optimization (ACPO) algorithm. ACPO integrates sensitivity analysis with trust-region updates to ensure stable constraint handling, achieving state-of-the-art empirical performance with theoretical guarantees. Constrained RL is then extended to finite-horizon settings via e-COP, the first policy optimization method for episodic CMDPs. Built on an episodic policy difference lemma, e-COP offers provable performance, simplicity, and scalability in safety-critical environments. The thesis then investigates reinforcement learning from human preferences. warmPref-PS introduces a posterior sampling strategy for linear bandits that integrates offline preference data from heterogeneous raters into online learning. Explicit modeling of rater competence yields substantial regret reduction and more efficient data collection for RLHF. The PSPL algorithm further advances preference-based RL by jointly sampling reward models and transition dynamics from pairwise trajectory comparisons, providing Bayesian simple-regret guarantees and robust empirical identification of optimal policies. The final contribution applies these methods to large-scale model alignment. A multi-objective constrained optimization view yields MOPO, an iterative algorithm with closed-form updates that scales to multi-billion-parameter language models and remains robust across alignment settings. Collectively, the thesis unifies constrained RL across average-cost, episodic, and preference-driven paradigms, delivering theoretical advances and practical tools for safe and aligned decision-making.
๐
2025-12-11
๐ AAAI 2026
๐ฅ Han Yang, Shaofeng Li, Tian Dong, Xiangyu Xu, Guangchi Liu et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Deep Neural Networks (DNNs), as valuable intellectual property, face unauthorized use. Existing protections, such as digital watermarking, are largely passive; they provide only post-hoc ownership verification and cannot actively prevent the illicit use of a stolen model. This work proposes a proactive protection scheme, dubbed ``Authority Backdoor," which embeds access constraints directly into the model. In particular, the scheme utilizes a backdoor learning framework to intrinsically lock a model's utility, such that it performs normally only in the presence of a specific trigger (e.g., a hardware fingerprint). But in its absence, the DNN's performance degrades to be useless. To further enhance the security of the proposed authority scheme, the certifiable robustness is integrated to prevent an adaptive attacker from removing the implanted backdoor. The resulting framework establishes a secure authority mechanism for DNNs, combining access control with certifiable robustness against adversarial attacks. Extensive experiments on diverse architectures and datasets validate the effectiveness and certifiable robustness of the proposed framework.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ming-Yi Hong, Miao-Chen Chiang, Youchen Teng, Yu-Hsiang Wang, Chih-Yu Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Heterogeneous Graph Neural Networks (HGNNs) are effective for modeling Heterogeneous Information Networks (HINs), which encode complex multi-typed entities and relations. However, HGNNs often suffer from type information loss and structural noise, limiting their representational fidelity and generalization. We propose THeGAU, a model-agnostic framework that combines a type-aware graph autoencoder with guided graph augmentation to improve node classification. THeGAU reconstructs schema-valid edges as an auxiliary task to preserve node-type semantics and introduces a decoder-driven augmentation mechanism to selectively refine noisy structures. This joint design enhances robustness, accuracy, and efficiency while significantly reducing computational overhead. Extensive experiments on three benchmark HIN datasets (IMDB, ACM, and DBLP) demonstrate that THeGAU consistently outperforms existing HGNN methods, achieving state-of-the-art performance across multiple backbones.
๐
2025-12-11
๐ ArXiv Preprint (quant-ph)
๐ฅ Tobias Rohe, Markus Baumann, Michael Poppel, Gerhard Stenzel, Maximilian Zorn et al.
ๆฅ็ๆ่ฆ
Quantum computing (QC) promises theoretical advantages, benefiting computational problems that would not be efficiently classically simulatable. However, much of this theoretical speedup depends on the quantum circuit design solving the problem. We argue that QC literature has yet to explore more domain specific ansatz-topologies, instead of relying on generic, one-size-fits-all architectures. In this work, we show that incorporating task-specific inductive biases -- specifically geometric priors -- into quantum circuit design can enhance the performance of hybrid Quantum Generative Adversarial Networks (QuGANs) on the task of generating geometrically constrained K4 graphs. We evaluate a portfolio of entanglement topologies and loss-function designs to assess their impact on both statistical fidelity and compliance with geometric constraints, including the Triangle and Ptolemaic inequalities. Our results show that aligning circuit topology with the underlying problem structure yields substantial benefits: the Triangle-topology QuGAN achieves the highest geometric validity among quantum models and matches the performance of classical Generative Adversarial Networks (GAN). Additionally, we showcase how specific architectural choices, such as entangling gate types, variance regularization and output-scaling govern the trade-off between geometric consistency and distributional accuracy, thus emphasizing the value of structured, task-aware quantum ansatz-topologies.
๐
2025-12-11
๐ AAAI
๐ฅ Yi Huang, Qingyun Sun, Yisen Gao, Haonan Yuan, Xingcheng Fu et al.
ๆฅ็ๆ่ฆ
The Information Bottleneck (IB) principle facilitates effective representation learning by preserving label-relevant information while compressing irrelevant information. However, its strong reliance on accurate labels makes it inherently vulnerable to label noise, prevalent in real-world scenarios, resulting in significant performance degradation and overfitting. To address this issue, we propose LaT-IB, a novel Label-Noise ResistanT Information Bottleneck method which introduces a "Minimal-Sufficient-Clean" (MSC) criterion. Instantiated as a mutual information regularizer to retain task-relevant information while discarding noise, MSC addresses standard IB's vulnerability to noisy label supervision. To achieve this, LaT-IB employs a noise-aware latent disentanglement that decomposes the latent representation into components aligned with to the clean label space and the noise space. Theoretically, we first derive mutual information bounds for each component of our objective including prediction, compression, and disentanglement, and moreover prove that optimizing it encourages representations invariant to input noise and separates clean and noisy label information. Furthermore, we design a three-phase training framework: Warmup, Knowledge Injection and Robust Training, to progressively guide the model toward noise-resistant representations. Extensive experiments demonstrate that LaT-IB achieves superior robustness and efficiency under label noise, significantly enhancing robustness and applicability in real-world scenarios with label noise.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Asaf Ben Arie, Malka Gorfine
Machine Learning
ๆฅ็ๆ่ฆ
We propose a flexible deep neural network (DNN) framework for modeling survival data within a partially linear regression structure. The approach preserves interpretability through a parametric linear component for covariates of primary interest, while a nonparametric DNN component captures complex time-covariate interactions among nuisance variables. We refer to the method as FLEXI-Haz, a flexible hazard model with a partially linear structure. In contrast to existing DNN approaches for partially linear Cox models, FLEXI-Haz does not rely on the proportional hazards assumption. We establish theoretical guarantees: the neural network component attains minimax-optimal convergence rates based on composite Holder classes, and the linear estimator is root-n consistent, asymptotically normal, and semiparametrically efficient. Extensive simulations and real-data analyses demonstrate that FLEXI-Haz provides accurate estimation of the linear effect, offering a principled and interpretable alternative to modern methods based on proportional hazards. Code for implementing FLEXI-Haz, as well as scripts for reproducing data analyses and simulations, is available at: https://github.com/AsafBanana/FLEXI-Haz
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amartya Roy, Elamparithy M, Kripabandhu Ghosh, Ponnurangam Kumaraguru, Adrian de Wynter
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.
๐
2025-12-11
๐ ArXiv Preprint (cs.GT)
๐ฅ Chujie Zhao, Qun Hu, Shiping Song, Dagui Chen, Han Zhu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of large language models (LLMs) necessitates novel monetization strategies, among which LLM-native advertising has emerged as a promising paradigm by naturally integrating advertisement within LLM-generated responses. However, this paradigm fundamentally shifts the auction object from discrete ad slots to the distribution over LLM outputs, posing new challenges for designing auction mechanisms. Existing mechanisms for LLM-native advertising adopt frameworks that decouple auction and generation, which either ignore externalities or require multiple LLM inferences for ad allocation, rendering them impractical for industrial scenarios. To address these challenges, we propose LLM-Auction, which to the best of our knowledge is the first learning-based generative auction mechanism that integrates auction and LLM generation for LLM-native advertising. By formulating the allocation optimization as a preference alignment problem between LLM outputs and the mechanism's objective which reflects both advertisers' expected value and user experience, we introduce Iterative Reward-Preference Optimization (IRPO) algorithm that alternately optimizes the reward model and the LLM. This approach enables the LLM to inherently model allocation externalities without any extra inference cost. We further identify the allocation monotonicity and continuity of LLM-Auction, which allows us to prove that a simple first-price payment rule exhibits favorable incentive properties. Additionally, we design an LLM-as-a-judge simulation environment to facilitate large-scale data construction and enable comprehensive quantitative evaluation of the mechanism's performance. Extensive quantitative and qualitative experiments demonstrate that LLM-Auction significantly outperforms existing baselines in allocation efficiency, while achieving the desired mechanism properties.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qingsen Ma, Dianyun Wang, Jiaming Lyu, Yaoye Wang, Lechen Ning et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The Key-Value (KV) cache is the primary memory bottleneck in long-context Large Language Models, yet it is typically treated as an opaque numerical tensor. In this work, we propose \textbf{STA-Attention}, a framework that utilizes Top-K Sparse Autoencoders (SAEs) to decompose the KV cache into interpretable ``semantic atoms.'' Unlike standard $L_1$-regularized SAEs, our Top-K approach eliminates shrinkage bias, preserving the precise dot-product geometry required for attention. Our analysis uncovers a fundamental \textbf{Key-Value Asymmetry}: while Key vectors serve as highly sparse routers dominated by a ``Semantic Elbow,'' deep Value vectors carry dense content payloads requiring a larger budget. Based on this structure, we introduce a Dual-Budget Strategy that selectively preserves the most informative semantic components while filtering representational noise. Experiments on Yi-6B, Mistral-7B, Qwen2.5-32B, and others show that our semantic reconstructions maintain perplexity and zero-shot performance comparable to the original models, effectively bridging the gap between mechanistic interpretability and faithful attention modeling.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sai Bharath Chandra Gutha, Ricardo Vinuesa, Hossein Azizpour
ๆฅ็ๆ่ฆ
A pre-trained unconditional diffusion model, combined with posterior sampling or maximum a posteriori (MAP) estimation techniques, can solve arbitrary inverse problems without task-specific training or fine-tuning. However, existing posterior sampling and MAP estimation methods often rely on modeling approximations and can be computationally demanding. In this work, we propose the variational mode-seeking loss (VML), which, when minimized during each reverse diffusion step, guides the generated sample towards the MAP estimate. VML arises from a novel perspective of minimizing the Kullback-Leibler (KL) divergence between the diffusion posterior $p(\mathbf{x}_0|\mathbf{x}_t)$ and the measurement posterior $p(\mathbf{x}_0|\mathbf{y})$, where $\mathbf{y}$ denotes the measurement. Importantly, for linear inverse problems, VML can be analytically derived and need not be approximated. Based on further theoretical insights, we propose VML-MAP, an empirically effective algorithm for solving inverse problems, and validate its efficacy over existing methods in both performance and computational time, through extensive experiments on diverse image-restoration tasks across multiple datasets.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zahra Rahiminasab, Michael Yuhas, Arvind Easwaran
Machine Learning
ๆฅ็ๆ่ฆ
Recently, the disentangled latent space of a variational autoencoder (VAE) has been used to reason about multi-label out-of-distribution (OOD) test samples that are derived from different distributions than training samples. Disentangled latent space means having one-to-many maps between latent dimensions and generative factors or important characteristics of an image. This paper proposes a disentangled distilled encoder (DDE) framework to decrease the OOD reasoner size for deployment on resource-constrained devices while preserving disentanglement. DDE formalizes student-teacher distillation for model compression as a constrained optimization problem while preserving disentanglement with disentanglement constraints. Theoretical guarantees for disentanglement during distillation based on Rademacher complexity are established. The approach is evaluated empirically by deploying the compressed model on an NVIDIA
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chihyeon Song, Jaewoo Lee, Jinkyoo Park
Machine Learning
ๆฅ็ๆ่ฆ
Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call 'on-policyness'. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy's behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively leverages offline data for initial stability while progressively focusing learning on the most relevant, high-rewarding online experiences. Our extensive experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms, highlighting the importance of an adaptive, behavior-aware replay buffer design.
๐
2025-12-11
๐ ArXiv Preprint (eess.IV)
๐ฅ Tomohiko Mizutani
ๆฅ็ๆ่ฆ
Endmember extraction from hyperspectral images aims to identify the spectral signatures of materials present in a scene. Recent studies have shown that self-dictionary methods can achieve high extraction accuracy; however, their high computational cost limits their applicability to large-scale hyperspectral images. Although several approaches have been proposed to mitigate this issue, it remains a major challenge. Motivated by this situation, this paper pursues a data reduction approach. Assuming that the hyperspectral image follows the linear mixing model with the pure-pixel assumption, we develop a data reduction technique that removes pixels that do not contain endmembers. We analyze the theoretical properties of this reduction step and show that it preserves pixels that lie close to the endmembers. Building on this result, we propose a data-reduced self-dictionary method that integrates the data reduction with a self-dictionary method based on a linear programming formulation. Numerical experiments demonstrate that the proposed method can substantially reduce the computational time of the original self-dictionary method without sacrificing endmember extraction accuracy.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiaxi Wu, Tiantian Zhang, Yuxing Wang, Yongzhe Chang, Xueqian Wang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Robust adversarial reinforcement learning has emerged as an effective paradigm for training agents to handle uncertain disturbance in real environments, with critical applications in sequential decision-making domains such as autonomous driving and robotic control. Within this paradigm, agent training is typically formulated as a zero-sum Markov game between a protagonist and an adversary to enhance policy robustness. However, the trainable nature of the adversary inevitably induces non-stationarity in the learning dynamics, leading to exacerbated training instability and convergence difficulties, particularly in high-dimensional complex environments. In this paper, we propose a novel approach, Uncertainty-Aware Critic Ensemble for robust adversarial Reinforcement learning (UACER), which consists of two strategies: 1) Diversified critic ensemble: a diverse set of K critic networks is exploited in parallel to stabilize Q-value estimation rather than conventional single-critic architectures for both variance reduction and robustness enhancement. 2) Time-varying Decay Uncertainty (TDU) mechanism: advancing beyond simple linear combinations, we develop a variance-derived Q-value aggregation strategy that explicitly incorporates epistemic uncertainty to dynamically regulate the exploration-exploitation trade-off while simultaneously stabilizing the training process. Comprehensive experiments across several MuJoCo control problems validate the superior effectiveness of UACER, outperforming state-of-the-art methods in terms of overall performance, stability, and efficiency.
๐
2025-12-11
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Chaomeng Lu, Bert Lagaisse
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Vulnerability detection methods based on deep learning (DL) have shown strong performance on benchmark datasets, yet their real-world effectiveness remains underexplored. Recent work suggests that both graph neural network (GNN)-based and transformer-based models, including large language models (LLMs), yield promising results when evaluated on curated benchmark datasets. These datasets are typically characterized by consistent data distributions and heuristic or partially noisy labels. In this study, we systematically evaluate two representative DL models-ReVeal and LineVul-across four representative datasets: Juliet, Devign, BigVul, and ICVul. Each model is trained independently on each respective dataset, and their code representations are analyzed using t-SNE to uncover vulnerability related patterns. To assess realistic applicability, we deploy these models along with four pretrained LLMs, Claude 3.5 Sonnet, GPT-o3-mini, GPT-4o, and GPT-5 on a curated dataset, VentiVul, comprising 20 recently (May 2025) fixed vulnerabilities from the Linux kernel. Our experiments reveal that current models struggle to distinguish vulnerable from non-vulnerable code in representation space and generalize poorly across datasets with differing distributions. When evaluated on VentiVul, our newly constructed time-wise out-of-distribution dataset, performance drops sharply, with most models failing to detect vulnerabilities reliably. These results expose a persistent gap between academic benchmarks and real-world deployment, emphasizing the value of our deployment-oriented evaluation framework and the need for more robust code representations and higher-quality datasets.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoyu Zhu, Yao Zhang, Jiashen Ren, Qingchun Hou
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Neural network constraint satisfaction is crucial for safety-critical applications such as power system optimization, robotic path planning, and autonomous driving. However, existing constraint satisfaction methods face efficiency-applicability trade-offs, with hard constraint methods suffering from either high computational complexity or restrictive assumptions on constraint structures. The Sampling Kaczmarz-Motzkin (SKM) method is a randomized iterative algorithm for solving large-scale linear inequality systems with favorable convergence properties, but its argmax operations introduce non-differentiability, posing challenges for neural network applications. This work proposes the Trainable Sampling Kaczmarz-Motzkin Network (T-SKM-Net) framework and, for the first time, systematically integrates SKM-type methods into neural network constraint satisfaction. The framework transforms mixed constraint problems into pure inequality problems through null space transformation, employs SKM for iterative solving, and maps solutions back to the original constraint space, efficiently handling both equality and inequality constraints. We provide theoretical proof of post-processing effectiveness in expectation and end-to-end trainability guarantees based on unbiased gradient estimators, demonstrating that despite non-differentiable operations, the framework supports standard backpropagation. On the DCOPF case118 benchmark, our method achieves 4.27ms/item GPU serial forward inference with 0.0025% max optimality gap with post-processing mode and 5.25ms/item with 0.0008% max optimality gap with joint training mode, delivering over 25$\times$ speedup compared to the pandapower solver while maintaining zero constraint violations under given tolerance.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shiv Ratn, Shivang Rampriyan, Bahni Ray
Machine Learning
ๆฅ็ๆ่ฆ
Forward Osmosis (FO) is a promising low-energy membrane separation technology, but challenges in accurately modelling its water flux (Jw) persist due to complex internal mass transfer phenomena. Traditional mechanistic models struggle with empirical parameter variability, while purely data-driven models lack physical consistency and rigorous uncertainty quantification (UQ). This study introduces a novel Robust Hybrid Physics-ML framework employing Gaussian Process Regression (GPR) for highly accurate, uncertainty-aware Jw prediction. The core innovation lies in training the GPR on the residual error between the detailed, non-linear FO physical model prediction (Jw_physical) and the experimental water flux (Jw_actual). Crucially, we implement a full UQ methodology by decomposing the total predictive variance (sigma2_total) into model uncertainty (epistemic, from GPR's posterior variance) and input uncertainty (aleatoric, analytically propagated via the Delta method for multi-variate correlated inputs). Leveraging the inherent strength of GPR in low-data regimes, the model, trained on a meagre 120 data points, achieved a state-of-the-art Mean Absolute Percentage Error (MAPE) of 0.26% and an R2 of 0.999 on the independent test data, validating a truly robust and reliable surrogate model for advanced FO process optimization and digital twin development.
๐
2025-12-11
๐ NeurIPS 2025
๐ฅ Le Tuan Minh Trinh, Le Minh Vu Pham, Thi Minh Anh Pham, An Duc Nguyen
Machine Learning
ๆฅ็ๆ่ฆ
A key aspect of human cognition is metacognition - the ability to assess one's own knowledge and judgment reliability. While deep learning models can express confidence in their predictions, they often suffer from poor calibration, a cognitive bias where expressed confidence does not reflect true competence. Do models truly know what they know? Drawing from human cognitive science, we propose a new framework for evaluating and leveraging AI metacognition. We introduce meta-d', a psychologically-grounded measure of metacognitive sensitivity, to characterise how reliably a model's confidence predicts its own accuracy. We then use this dynamic sensitivity score as context for a bandit-based arbiter that performs test-time model selection, learning which of several expert models to trust for a given task. Our experiments across multiple datasets and deep learning model combinations (including CNNs and VLMs) demonstrate that this metacognitive approach improves joint-inference accuracy over constituent models. This work provides a novel behavioural account of AI models, recasting ensemble selection as a problem of evaluating both short-term signals (confidence prediction scores) and medium-term traits (metacognitive sensitivity).
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Francesco Freni, Anya Fries, Linus Kรผhne, Markus Reichstein, Jonas Peters
ๆฅ็ๆ่ฆ
We consider a regression setting where observations are collected in different environments modeled by different data distributions. The field of out-of-distribution (OOD) generalization aims to design methods that generalize better to test environments whose distributions differ from those observed during training. One line of such works has proposed to minimize the maximum risk across environments, a principle that we refer to as MaxRM (Maximum Risk Minimization). In this work, we introduce variants of random forests based on the principle of MaxRM. We provide computationally efficient algorithms and prove statistical consistency for our primary method. Our proposed method can be used with each of the following three risks: the mean squared error, the negative reward (which relates to the explained variance), and the regret (which quantifies the excess risk relative to the best predictor). For MaxRM with regret as the risk, we prove a novel out-of-sample guarantee over unseen test distributions. Finally, we evaluate the proposed methods on both simulated and real-world data.
๐
2025-12-11
๐ ArXiv Preprint (cs.DC)
๐ฅ Sabtain Ahmad, Meerzhan Kanatbekova, Ivona Brandic, Atakan Aral
ๆฅ็ๆ่ฆ
Clustered Federated Learning (CFL) has emerged as a powerful approach for addressing data heterogeneity and ensuring privacy in large distributed IoT environments. By clustering clients and training cluster-specific models, CFL enables personalized models tailored to groups of heterogeneous clients. However, conventional CFL approaches suffer from fragmented learning for training independent global models for each cluster and fail to take advantage of collective cluster insights. This paper advocates a shift to hierarchical CFL, allowing bi-level aggregation to train cluster-specific models at the edge and a unified global model at the cloud. This shift improves training efficiency yet might introduce communication challenges. To this end, we propose CFLHKD, a novel personalization scheme for integrating hierarchical cluster knowledge into CFL. Built upon multi-teacher knowledge distillation, CFLHKD enables inter-cluster knowledge sharing while preserving cluster-specific personalization. CFLHKD adopts a bi-level aggregation to bridge the gap between local and global learning. Extensive evaluations of standard benchmark datasets demonstrate that CFLHKD outperforms representative baselines in cluster-specific and global model accuracy and achieves a performance improvement of 3.32-7.57\%.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yizhou Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Modern deep networks operate in a rough, finite-regularity regime where Jacobian-induced operators exhibit heavy-tailed spectra and strong basis drift. In this work, we derive a unified operator-theoretoretic description of neural training dynamics directly from gradient descent. Starting from the exact evolution $\dot e_t = -M(t)e_t$ in function space, we apply Kato perturbation theory to obtain a rigorous system of coupled mode ODEs and show that, after coarse-graining, these dynamics converge to a spectral transport--dissipation PDE \[ \partial_t g + \partial_ฮป(v g) = -ฮปg + S, \] where $v$ captures eigenbasis drift and $S$ encodes nonlocal spectral coupling.
We prove that neural training preserves functional regularity, forcing the drift to take an asymptotic power-law form $v(ฮป,t)\sim -c(t)ฮป^b$. In the weak-coupling regime -- naturally induced by spectral locality and SGD noise -- the PDE admits self-similar solutions with a resolution frontier, polynomial amplitude growth, and power-law dissipation. This structure yields explicit scaling-law exponents, explains the geometry of double descent, and shows that the effective training time satisfies $ฯ(t)=t^ฮฑL(t)$ for slowly varying $L$.
Finally, we show that NTK training and feature learning arise as two limits of the same PDE: $v\equiv 0$ recovers lazy dynamics, while $v\neq 0$ produces representation drift. Our results provide a unified spectral framework connecting operator geometry, optimization dynamics, and the universal scaling behavior of modern deep networks.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Christian Soize
ๆฅ็ๆ่ฆ
This paper introduces a new probabilistic framework for supervised learning in neural systems. It is designed to model complex, uncertain systems whose random outputs are strongly non-Gaussian given deterministic inputs. The architecture itself is a random object stochastically generated by a latent anisotropic Gaussian random field defined on a compact, boundaryless, multiply-connected manifold. The goal is to establish a novel conceptual and mathematical framework in which neural architectures are realizations of a geometry-aware, field-driven generative process. Both the neural topology and synaptic weights emerge jointly from a latent random field. A reduced-order parameterization governs the spatial intensity of an inhomogeneous Poisson process on the manifold, from which neuron locations are sampled. Input and output neurons are identified via extremal evaluations of the latent field, while connectivity is established through geodesic proximity and local field affinity. Synaptic weights are conditionally sampled from the field realization, inducing stochastic output responses even for deterministic inputs. To ensure scalability, the architecture is sparsified via percentile-based diffusion masking, yielding geometry-aware sparse connectivity without ad hoc structural assumptions. Supervised learning is formulated as inference on the generative hyperparameters of the latent field, using a negative log-likelihood loss estimated through Monte Carlo sampling from single-observation-per-input datasets. The paper initiates a mathematical analysis of the model, establishing foundational properties such as well-posedness, measurability, and a preliminary analysis of the expressive variability of the induced stochastic mappings, which support its internal coherence and lay the groundwork for a broader theory of geometry-driven stochastic learning.
๐
2025-12-11
๐ KDD 2026
๐ฅ Zhou Feng, Jiahao Chen, Chunyi Zhou, Yuwen Pu, Tianyu Du et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Deep neural networks (DNNs) underpin critical applications yet remain vulnerable to backdoor attacks, typically reliant on heuristic brute-force methods. Despite significant empirical advancements in backdoor research, the lack of rigorous theoretical analysis limits understanding of underlying mechanisms, constraining attack predictability and adaptability. Therefore, we provide a theoretical analysis targeting backdoor attacks, focusing on how sparse decision boundaries enable disproportionate model manipulation. Based on this finding, we derive a closed-form, ambiguous boundary region, wherein negligible relabeled samples induce substantial misclassification. Influence function analysis further quantifies significant parameter shifts caused by these margin samples, with minimal impact on clean accuracy, formally grounding why such low poison rates suffice for efficacious attacks. Leveraging these insights, we propose Eminence, an explainable and robust black-box backdoor framework with provable theoretical guarantees and inherent stealth properties. Eminence optimizes a universal, visually subtle trigger that strategically exploits vulnerable decision boundaries and effectively achieves robust misclassification with exceptionally low poison rates (< 0.1%, compared to SOTA methods typically requiring > 1%). Comprehensive experiments validate our theoretical discussions and demonstrate the effectiveness of Eminence, confirming an exponential relationship between margin poisoning and adversarial boundary manipulation. Eminence maintains > 90% attack success rate, exhibits negligible clean-accuracy loss, and demonstrates high transferability across diverse models, datasets and scenarios.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Jennifer Rosina Andersson, Zheng Zhao
ๆฅ็ๆ่ฆ
This paper is concerned with differentiable resampling in the context of sequential Monte Carlo (e.g., particle filtering). We propose a new informative resampling method that is instantly pathwise differentiable, based on an ensemble score diffusion model. We prove that our diffusion resampling method provides a consistent estimate to the resampling distribution, and we show by experiments that it outperforms the state-of-the-art differentiable resampling methods when used for stochastic filtering and parameter estimation.
๐
2025-12-11
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin et al.
Robotics
ๆฅ็ๆ่ฆ
Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Weifan Guan, Huasen Xi, Chenxiao Zhang, Aosheng Li, Qinghao Hu et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Current embodied AI systems face severe engineering impediments, primarily characterized by poor cross-scenario adaptability, rigid inter-module coupling, and fragmented inference acceleration. To overcome these limitations, we propose RoboNeuron, a universal deployment framework for embodied intelligence. RoboNeuron is the first framework to deeply integrate the cognitive capabilities of Large Language Models (LLMs) and Vision-Language-Action (VLA) models with the real-time execution backbone of the Robot Operating System (ROS). We utilize the Model Context Protocol (MCP) as a semantic bridge, enabling the LLM to dynamically orchestrate underlying robotic tools. The framework establishes a highly modular architecture that strictly decouples sensing, reasoning, and control by leveraging ROS's unified communication interfaces. Crucially, we introduce an automated tool to translate ROS messages into callable MCP functions, significantly streamlining development. RoboNeuron significantly enhances cross-scenario adaptability and component flexibility, while establishing a systematic platform for horizontal performance benchmarking, laying a robust foundation for scalable real-world embodied applications.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vijay Prakash S
ๆฅ็ๆ่ฆ
Magnetization of a ferromagnetic substance in response to an externally applied magnetic field increases with the strength of the field. This is because at the microscopic level, magnetic moments in certain regions or domains of the substance increasingly align with the applied field, while the amount of misaligned domains decreases. The alignment of such magnetic domains with an applied magnetic field forms the physical basis for the nonlinearity of magnetization. In this paper, the nonlinear function is approximated as a combination of continued fraction of straight lines. The resulting fit is used to interpret the nonlinear behavior in both growing and shrinking magnetic domains. The continued fraction of straight lines used here is an algebraic expression which can be used to estimate parameters using nonlinear regression.
๐
2025-12-11
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yash Srivastava, Shalin Jain, Sneha Awathare, Nitin Awathare
Machine Learning
ๆฅ็ๆ่ฆ
The rising demand for collaborative machine learning and data analytics calls for secure and decentralized data sharing frameworks that balance privacy, trust, and incentives. Existing approaches, including federated learning (FL) and blockchain-based data markets, fall short: FL often depends on trusted aggregators and lacks Byzantine robustness, while blockchain frameworks struggle with computation-intensive training and incentive integration.
We present \prot, a decentralized data marketplace that unifies federated learning, blockchain arbitration, and economic incentives into a single framework for privacy-preserving data sharing. \prot\ enables data buyers to submit bid-based requests via blockchain smart contracts, which manage auctions, escrow, and dispute resolution. Computationally intensive training is delegated to \cone\ (\uline{Co}mpute \uline{N}etwork for \uline{E}xecution), an off-chain distributed execution layer. To safeguard against adversarial behavior, \prot\ integrates a modified YODA protocol with exponentially growing execution sets for resilient consensus, and introduces Corrected OSMD to mitigate malicious or low-quality contributions from sellers. All protocols are incentive-compatible, and our game-theoretic analysis establishes honesty as the dominant strategy.
We implement \prot\ on Ethereum and evaluate it over benchmark datasets -- MNIST, Fashion-MNIST, and CIFAR-10 -- under varying adversarial settings. \prot\ achieves up to 99\% accuracy on MNIST and 90\% on Fashion-MNIST, with less than 3\% degradation up to 30\% Byzantine nodes, and 56\% accuracy on CIFAR-10 despite its complexity. Our results show that \prot\ ensures privacy, maintains robustness under adversarial conditions, and scales efficiently with the number of participants, making it a practical foundation for real-world decentralized data sharing.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hangyu Mao, Guangting Dong, Zhicheng Dou
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hyunsung Kim, Sangwoo Seo, Hoyoung Choi, Tom Boomstra, Jinsung Yoon et al.
Machine Learning
ๆฅ็ๆ่ฆ
Evaluating defensive performance in soccer remains challenging, as effective defending is often expressed not through visible on-ball actions such as interceptions and tackles, but through preventing dangerous opportunities before they arise. Existing approaches have largely focused on valuing on-ball actions, leaving much of defenders' true impact unmeasured. To address this gap, we propose DEFCON (DEFensive CONtribution evaluator), a comprehensive framework that quantifies player-level defensive contributions for every attacking situation in soccer. Leveraging Graph Attention Networks, DEFCON estimates the success probability and expected value of each attacking option, along with each defender's responsibility for stopping it. These components yield an Expected Possession Value (EPV) for the attacking team before and after each action, and DEFCON assigns positive or negative credits to defenders according to whether they reduced or increased the opponent's EPV. Trained on 2023-24 and evaluated on 2024-25 Eredivisie event and tracking data, DEFCON's aggregated player credits exhibit strong positive correlations with market valuations. Finally, we showcase several practical applications, including in-game timelines of defensive contributions, spatial analyses across pitch zones, and pairwise summaries of attacker-defender interactions.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nicolas Tacheny
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Agentic systems built on large language models operate through recursive feedback loops, where each output becomes the next input. Yet the geometric behavior of these agentic loops (whether they converge, diverge, or exhibit more complex dynamics) remains poorly understood. This paper introduces a geometric framework for analyzing agentic trajectories in semantic embedding space, treating iterative transformations as discrete dynamical systems. We distinguish the artifact space, where linguistic transformations occur, from the embedding space, where geometric measurements are performed. Because cosine similarity is biased by embedding anisotropy, we introduce an isotonic calibration that eliminates systematic bias and aligns similarities with human semantic judgments while preserving high local stability. This enables rigorous measurement of trajectories, clusters and attractors. Through controlled experiments on singular agentic loops, we identify two fundamental regimes. A contractive rewriting loop converges toward a stable attractor with decreasing dispersion, while an exploratory summarize and negate loop produces unbounded divergence with no cluster formation. These regimes display qualitatively distinct geometric signatures of contraction and expansion. Our results show that prompt design directly governs the dynamical regime of an agentic loop, enabling systematic control of convergence, divergence and trajectory structure in iterative LLM transformations.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vinoth Punniyamoorthy, Ashok Gadi Parthi, Mayilsamy Palanigounder, Ravi Kiran Kodali, Bikesh Kumar et al.
Machine Learning
ๆฅ็ๆ่ฆ
Distributed machine learning systems require strong privacy guarantees, verifiable compliance, and scalable deploy- ment across heterogeneous and multi-cloud environments. This work introduces a cloud-native privacy-preserving architecture that integrates federated learning, differential privacy, zero- knowledge compliance proofs, and adaptive governance powered by reinforcement learning. The framework supports secure model training and inference without centralizing sensitive data, while enabling cryptographically verifiable policy enforcement across institutions and cloud platforms. A full prototype deployed across hybrid Kubernetes clusters demonstrates reduced membership- inference risk, consistent enforcement of formal privacy budgets, and stable model performance under differential privacy. Ex- perimental evaluation across multi-institution workloads shows that the architecture maintains utility with minimal overhead while providing continuous, risk-aware governance. The pro- posed framework establishes a practical foundation for deploying trustworthy and compliant distributed machine learning systems at scale.
๐
2025-12-11
๐ ArXiv Preprint (math.OC)
๐ฅ Francesco Alemanno
ๆฅ็ๆ่ฆ
Nonlinear inverse problems often feature noisy, non-differentiable, or expensive residual evaluations that make Jacobian-based solvers unreliable. Popular derivative-free optimizers such as natural evolution strategies (NES) or Powell's NEWUOA still assume smoothness or expend many evaluations to maintain stability. Ensemble Kalman inversion (EKI) relies on empirical covariances that require preconditioning and scale poorly with residual dimension.
We introduce residual subspace evolution strategies (RSES), a derivative-free solver that samples Gaussian probes around the current iterate, builds a residual-only surrogate from their differences, and recombines the probes through a least-squares solve yielding an optimal update without forming Jacobians or covariances. Each iteration costs $k+1$ residual evaluations, where $k \ll n$ for $n$-dimensional problems, with $O(k^3)$ linear algebra overhead.
Benchmarks on calibration, regression, and deconvolution problems demonstrate consistent misfit reduction in both deterministic and stochastic settings. RSES matches or surpasses xNES and NEWUOA while staying competitive with EKI under matched evaluation budgets, particularly when smoothness or covariance assumptions fail.
๐
2025-12-11
๐ ArXiv Preprint (q-bio.MN)
๐ฅ Jiayu Weng, Xinyi Zhu, Jing Liu, Linyuan Lรผ, Pan Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Chemical reaction networks are widely used to model stochastic dynamics in chemical kinetics, systems biology and epidemiology. Solving the chemical master equation that governs these systems poses a significant challenge due to the large state space exponentially growing with system sizes. The development of autoregressive neural networks offers a flexible framework for this problem; however, its efficiency is limited especially for high-dimensional systems and in scenarios with rare events. Here, we push the frontier of neural-network approach by exploiting faster optimizations such as natural gradient descent and time-dependent variational principle, achieving a 5- to 22-fold speedup, and by leveraging enhanced-sampling strategies to capture rare events. We demonstrate reduced computational cost and higher accuracy over the previous neural-network method in challenging reaction networks, including the mitogen-activated protein kinase (MAPK) cascade network, the hitherto largest biological network handled by the previous approaches of solving the chemical master equation. We further apply the approach to spatially extended reaction-diffusion systems, the Schlรถgl model with rare events, on two-dimensional lattices, beyond the recent tensor-network approach that handles one-dimensional lattices. The present approach thus enables efficient modeling of chemical reaction networks in general.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vasiliki Stoumpou, Maciej Tysarowski, Talhat Azemi, Jawad Haider, Howard L. Haronian et al.
ๆฅ็ๆ่ฆ
Background. Treatment selection for low to intermediate risk patients with severe aortic stenosis between surgical (SAVR) and transcatheter (TAVR) aortic valve replacement remains variable in clinical practice, driven by patient heterogeneity and institutional preferences. While existing models predict postprocedural risk, there is a lack of interpretable, individualized treatment recommendations that directly optimize long-term outcomes.
Methods. We introduce an interpretable prescriptive framework that integrates prognostic matching, counterfactual outcome modeling, and an Optimal Policy Tree (OPT) to recommend the treatment minimizing expected 5-year mortality. Using data from Hartford Hospital and St. Vincent's Hospital, we emulate randomization via prognostic matching and sample weighting and estimate counterfactual mortality under both SAVR and TAVR. The policy model, trained on these counterfactual predictions, partitions patients into clinically coherent subgroups and prescribes the treatment associated with lower estimated risk.
Findings. If the OPT prescriptions are applied, counterfactual evaluation showed an estimated reduction in 5-year mortality of 20.3\% in Hartford and 13.8\% in St. Vincent's relative to real-life prescriptions, showing promising generalizability to unseen data from a different institution. The learned decision boundaries aligned with real-world outcomes and clinical observations.
Interpretation. Our interpretable prescriptive framework is, to the best of our knowledge, the first to provide transparent, data-driven recommendations for TAVR versus SAVR that improve estimated long-term outcomes both in an internal and external cohort, while remaining clinically grounded and contributing toward a more systematic and evidence-based approach to precision medicine in structural heart disease.
๐
2025-12-11
๐ This paper has been accepted for publication in IE
๐ฅ Md Nahid Hasan Shuvo, Moinul Hossain, Anik Mallik, Jeffrey Twigg, Fikadu Dagefu
Machine Learning
ๆฅ็ๆ่ฆ
Federated Learning (FL) enables collaborative model training across distributed devices while safeguarding data and user privacy. However, FL remains susceptible to privacy threats that can compromise data via direct means. That said, indirectly compromising the confidentiality of the FL model architecture (e.g., a convolutional neural network (CNN) or a recurrent neural network (RNN)) on a client device by an outsider remains unexplored. If leaked, this information can enable next-level attacks tailored to the architecture. This paper proposes a novel side-channel fingerprinting attack, leveraging flow-level and packet-level statistics of encrypted wireless traffic from an FL client to infer its deep learning model architecture. We name it FLARE, a fingerprinting framework based on FL Architecture REconnaissance. Evaluation across various CNN and RNN variants-including pre-trained and custom models trained over IEEE 802.11 Wi-Fi-shows that FLARE achieves over 98% F1-score in closed-world and up to 91% in open-world scenarios. These results reveal that CNN and RNN models leak distinguishable traffic patterns, enabling architecture fingerprinting even under realistic FL settings with hardware, software, and data heterogeneity. To our knowledge, this is the first work to fingerprint FL model architectures by sniffing encrypted wireless traffic, exposing a critical side-channel vulnerability in current FL systems.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Apurba Sarker, Reza T. Batley, Darshan Sarojini, Sourav Saha
Machine Learning
ๆฅ็ๆ่ฆ
Surrogate models provide fast alternatives to costly aerodynamic simulations and are extremely useful in design and optimization applications. This study proposes the use of a recent kernel-based neural surrogate, KHRONOS. In this work, we blend sparse high-fidelity (HF) data with low-fidelity (LF) information to predict aerodynamic fields under varying constraints in computational resources. Unlike traditional approaches, KHRONOS is built upon variational principles, interpolation theory, and tensor decomposition. These elements provide a mathematical basis for heavy pruning compared to dense neural networks. Using the AirfRANS dataset as a high-fidelity benchmark and NeuralFoil to generate low-fidelity counterparts, this work compares the performance of KHRONOS with three contemporary model architectures: a multilayer perceptron (MLP), a graph neural network (GNN), and a physics-informed neural network (PINN). We consider varying levels of high-fidelity data availability (0%, 10%, and 30%) and increasingly complex geometry parameterizations. These are used to predict the surface pressure coefficient distribution over the airfoil. Results indicate that, whilst all models eventually achieve comparable predictive accuracy, KHRONOS excels in resource-constrained conditions. In this domain, KHRONOS consistently requires orders of magnitude fewer trainable parameters and delivers much faster training and inference than contemporary dense neural networks at comparable accuracy. These findings highlight the potential of KHRONOS and similar architectures to balance accuracy and efficiency in multi-fidelity aerodynamic field prediction.
๐
2025-12-11
๐ ICML 2026
๐ฅ Waleed Razzaq, Izis Kankaraway, Yun-Bo Zhao
Robotics
ๆฅ็ๆ่ฆ
Attention improves representation learning over RNNs, but its discrete nature limits continuous-time (CT) modeling. We introduce Neuronal Attention Circuit (NAC), a novel, biologically plausible CT-Attention mechanism that reformulates attention logits computation as the solution to a linear first-order ODE with nonlinear interlinked gates derived from repurposing \textit{C. elegans} Neuronal Circuit Policies (NCPs) wiring mechanism. NAC replaces dense projections with sparse sensory gates for key-query projections and a sparse backbone network with two heads for computing \textit{content-target} and \textit{learnable time-constant} gates, enabling efficient adaptive dynamics. NAC supports three attention logit computation modes: (i) explicit Euler integration, (ii) exact closed-form solution, and (iii) steady-state approximation. To improve memory intensity, we implemented a sparse Top-\emph{K} pairwise concatenation scheme that selectively curates key-query interactions. We provide rigorous theoretical guarantees, including state stability, bounded approximation errors, and universal approximation. Empirically, we implemented NAC in diverse domains, including irregular time-series classification, lane-keeping for autonomous vehicles, and industrial prognostics. We observed that NAC matches or outperforms competing baselines in accuracy and occupies an intermediate position in runtime and memory efficiency compared with several CT baselines.
๐
2025-12-11
๐ ArXiv Preprint (cs.DC)
๐ฅ Shruti Dongare, Redwan Ibne Seraj Khan, Hadeel Albahar, Nannan Zhao, Diego Melendez Maita et al.
Machine Learning
ๆฅ็ๆ่ฆ
Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application characteristics pose major challenges for existing schedulers, which often rely on offline profiling or application-specific assumptions. We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically prioritizes and allocates DL jobs on heterogeneous GPU clusters. RLTune integrates RL-driven prioritization with MILP-based job-to-node mapping to optimize system-wide objectives such as job completion time (JCT), queueing delay, and resource utilization. Trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, RLTune improves GPU utilization by up to 20%, reduces queueing delay by up to 81%, and shortens JCT by as much as 70 percent. Unlike prior approaches, RLTune generalizes across diverse workloads without requiring per-job profiling, making it practical for cloud providers to deploy at scale for more efficient, fair, and sustainable DL workload management.
๐
2025-12-11
๐ 17 pages, 9 figures. Under review for IEEE TPAMI
๐ฅ Duo Wang, Xinming Wang, Chao Wang, Xiaowei Yue, Jianguo Wu
Machine Learning
ๆฅ็ๆ่ฆ
Multi-output Gaussian process (MGP) models have attracted significant attention for their flexibility and uncertainty-quantification capabilities, and have been widely adopted in multi-source transfer learning scenarios due to their ability to capture inter-task correlations. However, they still face several challenges in transfer learning. First, the input spaces of the source and target domains are often heterogeneous, which makes direct knowledge transfer difficult. Second, potential prior knowledge and physical information are typically ignored during heterogeneous transfer, hampering the utilization of domain-specific insights and leading to unstable mappings. Third, inappropriate information sharing among target and sources can easily lead to negative transfer. Traditional models fail to address these issues in a unified way. To overcome these limitations, this paper proposes a Double-Regularized Heterogeneous Gaussian Process framework (R^2-HGP). Specifically, a trainable prior probability mapping model is first proposed to align the heterogeneous input domains. The resulting aligned inputs are treated as latent variables, upon which a multi-source transfer GP model is constructed and the entire structure is integrated into a novel conditional variational autoencoder (CVAE) based framework. Physical insights is further incorporated as a regularization term to ensure that the alignment results adhere to known physical knowledge. Next, within the multi-source transfer GP model, a sparsity penalty is imposed on the transfer coefficients, enabling the model to adaptively select the most informative source outputs and suppress negative transfer. Extensive simulations and real-world engineering case studies validate the effectiveness of our R^2-HGP, demonstrating consistent superiority over state-of-the-art benchmarks across diverse evaluation metrics.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Quanjun Lang, Jianfeng Lu
ๆฅ็ๆ่ฆ
We analyze prediction error in stochastic dynamical systems with memory, focusing on generalized Langevin equations (GLEs) formulated as stochastic Volterra equations. We establish that, under a strongly convex potential, trajectory discrepancies decay at a rate determined by the decay of the memory kernel and are quantitatively bounded by the estimation error of the kernel in a weighted norm. Our analysis integrates synchronized noise coupling with a Volterra comparison theorem, encompassing both subexponential and exponential kernel classes. For first-order models, we derive moment and perturbation bounds using resolvent estimates in weighted spaces. For second-order models with confining potentials, we prove contraction and stability under kernel perturbations using a hypocoercive Lyapunov-type distance. This framework accommodates non-translation-invariant kernels and white-noise forcing, explicitly linking improved kernel estimation to enhanced trajectory prediction. Numerical examples validate these theoretical findings.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tian Liu, Anwesha Basu, James Caverlee, Shu Kong
Machine LearningNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ''auto-annotation'', as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ''flat'' distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM's pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $\sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth!
๐
2025-12-11
๐ ArXiv Preprint (cs.DC)
๐ฅ Shagnik Pal, Shaizeen Aga, Suchita Pati, Mahzabeen Islam, Lizy K. John
ๆฅ็ๆ่ฆ
As both ML training and inference are increasingly distributed, parallelization techniques that shard (divide) ML model across GPUs of a distributed system, are often deployed. With such techniques, there is a high prevalence of data-dependent communication and computation operations where communication is exposed, leaving as high as 1.7x ideal performance on the table. Prior works harness the fact that ML model state and inputs are already sharded, and employ careful overlap of individual computation/communication shards. While such coarse-grain overlap is promising, in this work, we instead make a case for finer-grain compute-communication overlap which we term FiCCO, where we argue for finer-granularity, one-level deeper overlap than at shard-level, to unlock compute/communication overlap for a wider set of network topologies, finer-grain dataflow and more. We show that FiCCO opens up a wider design space of execution schedules than possible at shard-level alone. At the same time, decomposition of ML operations into smaller operations (done in both shard-based and finer-grain techniques) causes operation-level inefficiency losses. To balance the two, we first present a detailed characterization of these inefficiency losses, then present a design space of FiCCO schedules, and finally overlay the schedules with concomitant inefficiency signatures. Doing so helps us design heuristics that frameworks and runtimes can harness to select bespoke FiCCO schedules based on the nature of underlying ML operations. Finally, to further minimize contention inefficiencies inherent with operation overlap, we offload communication to GPU DMA engines. We evaluate several scenarios from realistic ML deployments and demonstrate that our proposed bespoke schedules deliver up to 1.6x speedup and our heuristics provide accurate guidance in 81% of unseen scenarios.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jun Seo, Hyeokjun Choe, Seohui Bae, Soyeon Park, Wonbin Ahn et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Time series forecasting is a critical task for artificial intelligence with numerous real-world applications. Traditional approaches primarily rely on historical time series data to predict the future values. However, in practical scenarios, this is often insufficient for accurate predictions due to the limited information available. To address this challenge, multimodal time series forecasting methods which incorporate additional data modalities, mainly text data, alongside time series data have been explored. In this work, we introduce the Adaptive Information Routing (AIR) framework, a novel approach for multimodal time series forecasting. Unlike existing methods that treat text data on par with time series data as interchangeable auxiliary features for forecasting, AIR leverages text information to dynamically guide the time series model by controlling how and to what extent multivariate time series information should be combined. We also present a text-refinement pipeline that employs a large language model to convert raw text data into a form suitable for multimodal forecasting, and we introduce a benchmark that facilitates multimodal forecasting experiments based on this pipeline. Experiment results with the real world market data such as crude oil price and exchange rates demonstrate that AIR effectively modulates the behavior of the time series model using textual inputs, significantly enhancing forecasting accuracy in various time series forecasting tasks.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ragja Palakkadavath, Hung Le, Thanh Nguyen-Tang, Svetha Venkatesh, Sunil Gupta
ๆฅ็ๆ่ฆ
Federated domain generalization (FedDG) addresses distribution shifts among clients in a federated learning framework. FedDG methods aggregate the parameters of locally trained client models to form a global model that generalizes to unseen clients while preserving data privacy. While improving the generalization capability of the global model, many existing approaches in FedDG jeopardize privacy by sharing statistics of client data between themselves. Our solution addresses this problem by contributing new ways to perform local client training and model aggregation. To improve local client training, we enforce (domain) invariance across local models with the help of a novel technique, \textbf{latent space inversion}, which enables better client privacy. When clients are not \emph{i.i.d}, aggregating their local models may discard certain local adaptations. To overcome this, we propose an \textbf{important weight} aggregation strategy to prioritize parameters that significantly influence predictions of local models during aggregation. Our extensive experiments show that our approach achieves superior results over state-of-the-art methods with less communication overhead.
๐
2025-12-11
๐ ArXiv Preprint (astro-ph.CO)
๐ฅ Natalรญ S. M. de Santi, Francisco Villaescusa-Navarro, Pablo Araya-Araya, Gabriella De Lucia, Fabio Fontanot et al.
Machine Learning
ๆฅ็ๆ่ฆ
Semi-analytic models are a widely used approach to simulate galaxy properties within a cosmological framework, relying on simplified yet physically motivated prescriptions. They have also proven to be an efficient alternative for generating accurate galaxy catalogs, offering a faster and less computationally expensive option compared to full hydrodynamical simulations. In this paper, we demonstrate that using only galaxy $3$D positions and radial velocities, we can train a graph neural network coupled to a moment neural network to obtain a robust machine learning based model capable of estimating the matter density parameters, $ฮฉ_{\rm m}$, with a precision of approximately 10%. The network is trained on ($25 h^{-1}$Mpc)$^3$ volumes of galaxy catalogs from L-Galaxies and can successfully extrapolate its predictions to other semi-analytic models (GAEA, SC-SAM, and Shark) and, more remarkably, to hydrodynamical simulations (Astrid, SIMBA, IllustrisTNG, and SWIFT-EAGLE). Our results show that the network is robust to variations in astrophysical and subgrid physics, cosmological and astrophysical parameters, and the different halo-profile treatments used across simulations. This suggests that the physical relationships encoded in the phase-space of semi-analytic models are largely independent of their specific physical prescriptions, reinforcing their potential as tools for the generation of realistic mock catalogs for cosmological parameter inference.
๐
2025-12-11
๐ ArXiv Preprint (math.ST)
๐ฅ Mark Sellke, Steven Yin
Natural Language Processing
ๆฅ็ๆ่ฆ
The property of learning-curve monotonicity, highlighted in a recent series of work by Loog, Mey and Viering, describes algorithms which only improve in average performance given more data, for any underlying data distribution within a given family. We establish the first nontrivial monotonicity guarantees for the maximum likelihood estimator in a variety of well-specified parametric settings. For sequential prediction with log loss, we show monotonicity (in fact complete monotonicity) of the forward KL divergence for Gaussian vectors with unknown covariance and either known or unknown mean, as well as for Gamma variables with unknown scale parameter. The Gaussian setting was explicitly highlighted as open in the aforementioned works, even in dimension 1. Finally we observe that for reverse KL divergence, a folklore trick yields monotonicity for very general exponential families.
All results in this paper were derived by variants of GPT-5.2 Pro. Humans did not provide any proof strategies or intermediate arguments, but only prompted the model to continue developing additional results, and verified and transcribed its proofs.
๐
2025-12-11
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Lingfeng Liu, Yixin Song, Dazhong Shen, Bing Yin, Hao Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Popularity bias fundamentally undermines the personalization capabilities of collaborative filtering (CF) models, causing them to disproportionately recommend popular items while neglecting users' genuine preferences for niche content. While existing approaches treat this as an external confounding factor, we reveal that popularity bias is an intrinsic geometric artifact of Bayesian Pairwise Ranking (BPR) optimization in CF models. Through rigorous mathematical analysis, we prove that BPR systematically organizes item embeddings along a dominant "popularity direction" where embedding magnitudes directly correlate with interaction frequency. This geometric distortion forces user embeddings to simultaneously handle two conflicting tasks-expressing genuine preference and calibrating against global popularity-trapping them in suboptimal configurations that favor popular items regardless of individual tastes. We propose Directional Decomposition and Correction (DDC), a universally applicable framework that surgically corrects this embedding geometry through asymmetric directional updates. DDC guides positive interactions along personalized preference directions while steering negative interactions away from the global popularity direction, disentangling preference from popularity at the geometric source. Extensive experiments across multiple BPR-based architectures demonstrate that DDC significantly outperforms state-of-the-art debiasing methods, reducing training loss to less than 5% of heavily-tuned baselines while achieving superior recommendation quality and fairness. Code is available in https://github.com/LingFeng-Liu-AI/DDC.
๐
2025-12-11
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Ziwei Liu, Yejing Wang, Qidong Liu, Zijian Zhang, Chong Chen et al.
ๆฅ็ๆ่ฆ
Conventional Sequential Recommender Systems (SRS) typically assign unique Hash IDs (HID) to construct item embeddings. These HID embeddings effectively learn collaborative information from historical user-item interactions, making them vulnerable to situations where most items are rarely consumed (the long-tail problem). Recent methods that incorporate auxiliary information often suffer from noisy collaborative sharing caused by co-occurrence signals or semantic homogeneity caused by flat dense embeddings. Semantic IDs (SIDs), with their capability of code sharing and multi-granular semantic modeling, provide a promising alternative. However, the collaborative overwhelming phenomenon hinders the further development of SID-based methods. The quantization mechanisms commonly compromise the uniqueness of identifiers required for modeling head items, creating a performance seesaw between head and tail items. To address this dilemma, we propose \textbf{\name}, a novel framework that harmonizes the SID and HID. Specifically, we devise a dual-branch modeling architecture that enables the model to capture both the multi-granular semantics within SID while preserving the unique collaborative identity of HID. Furthermore, we introduce a dual-level alignment strategy that bridges the two representations, facilitating knowledge transfer and supporting robust preference modeling. Extensive experiments on three real-world datasets show that \name~ effectively balances recommendation quality for both head and tail items while surpassing the existing baselines. The implementation code can be found online\footnote{https://github.com/ziwliu8/H2Rec}.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv et al.
Robotics
ๆฅ็ๆ่ฆ
Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Zamirddine Mari, Jรฉrรดme Pasquet, Julien Seinturier
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Autonomous drone navigation in confined tubular environments remains a major challenge due to the constraining geometry of the conduits, the proximity of the walls, and the perceptual limitations inherent to such scenarios. We propose a reinforcement learning approach enabling a drone to navigate unknown three-dimensional tubes without any prior knowledge of their geometry, relying solely on local observations from LiDAR and a conditional visual detection of the tube center. In contrast, the Pure Pursuit algorithm, used as a deterministic baseline, benefits from explicit access to the centerline, creating an information asymmetry designed to assess the ability of RL to compensate for the absence of a geometric model. The agent is trained through a progressive Curriculum Learning strategy that gradually exposes it to increasingly curved geometries, where the tube center frequently disappears from the visual field. A turning-negotiation mechanism, based on the combination of direct visibility, directional memory, and LiDAR symmetry cues, proves essential for ensuring stable navigation under such partial observability conditions. Experiments show that the PPO policy acquires robust and generalizable behavior, consistently outperforming the deterministic controller despite its limited access to geometric information. Validation in a high-fidelity 3D environment further confirms the transferability of the learned behavior to a continuous physical dynamics.
The proposed approach thus provides a complete framework for autonomous navigation in unknown tubular environments and opens perspectives for industrial, underground, or medical applications where progressing through narrow and weakly perceptive conduits represents a central challenge.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qiyang Li, Seohong Park, Sergey Levine
ๆฅ็ๆ่ฆ
Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned tasks and show that it reliably outperforms prior methods. Code: github.com/ColinQiyangLi/dqc.
๐
2025-12-11
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zamirddine Mari, Mohamad Motasem Nawaf, Pierre Drap
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Autonomous navigation in underwater environments remains a major challenge due to the absence of GPS, degraded visibility, and the presence of submerged obstacles. This article investigates these issues through the case of the BlueROV2, an open platform widely used for scientific experimentation. We propose a deep reinforcement learning approach based on the Proximal Policy Optimization (PPO) algorithm, using an observation space that combines target-oriented navigation information, a virtual occupancy grid, and ray-casting along the boundaries of the operational area. The learned policy is compared against a reference deterministic kinematic planner, the Dynamic Window Approach (DWA), commonly employed as a robust baseline for obstacle avoidance. The evaluation is conducted in a realistic simulation environment and complemented by validation on a physical BlueROV2 supervised by a 3D digital twin of the test site, helping to reduce risks associated with real-world experimentation. The results show that the PPO policy consistently outperforms DWA in highly cluttered environments, notably thanks to better local adaptation and reduced collisions. Finally, the experiments demonstrate the transferability of the learned behavior from simulation to the real world, confirming the relevance of deep RL for autonomous navigation in underwater robotics.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez et al.
Machine LearningNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.
๐
2025-12-11
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mumuksh Tayal, Manan Tayal, Aditya Singh, Shishir Kolathaya, Ravi Prakash
Robotics
ๆฅ็ๆ่ฆ
Ensuring safety in autonomous systems requires controllers that satisfy hard, state-wise constraints without relying on online interaction. While existing Safe Offline RL methods typically enforce soft expected-cost constraints, they do not guarantee forward invariance. Conversely, Control Barrier Functions (CBFs) provide rigorous safety guarantees but usually depend on expert-designed barrier functions or full knowledge of the system dynamics. We introduce Value-Guided Offline Control Barrier Functions (V-OCBF), a framework that learns a neural CBF entirely from offline demonstrations. Unlike prior approaches, V-OCBF does not assume access to the dynamics model; instead, it derives a recursive finite-difference barrier update, enabling model-free learning of a barrier that propagates safety information over time. Moreover, V-OCBF incorporates an expectile-based objective that avoids querying the barrier on out-of-distribution actions and restricts updates to the dataset-supported action set. The learned barrier is then used with a Quadratic Program (QP) formulation to synthesize real-time safe control. Across multiple case studies, V-OCBF yields substantially fewer safety violations than baseline methods while maintaining strong task performance, highlighting its scalability for offline synthesis of safety-critical controllers without online interaction or hand-engineered barriers.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Samaksh Ujjawal, Shivansh Pratap Singh, Naveen Sudheer Nair, Rishabh Dev Yadav, Wei Pan et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Aerial manipulators undergo rapid, configuration-dependent changes in inertial coupling forces and aerodynamic forces, making accurate dynamics modeling a core challenge for reliable control. Analytical models lose fidelity under these nonlinear and nonstationary effects, while standard data-driven methods such as deep neural networks and Gaussian processes cannot represent the diverse residual behaviors that arise across different operating conditions. We propose a regime-conditioned diffusion framework that models the full distribution of residual forces using a conditional diffusion process and a lightweight temporal encoder. The encoder extracts a compact summary of recent motion and configuration, enabling consistent residual predictions even through abrupt transitions or unseen payloads. When combined with an adaptive controller, the framework enables dynamics uncertainty compensation and yields markedly improved tracking accuracy in real-world tests.
๐
2025-12-11
๐ ArXiv Preprint (eess.SY)
๐ฅ Lukas Vogel, Andrea Carron, Eleftherios E. Vlahakis, Dimos V. Dimarogonas
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This work presents a stochastic model predictive control (MPC) framework for linear systems subject to joint-in-time chance constraints under unknown disturbance distributions. Unlike existing stochastic MPC formulations that rely on parametric or Gaussian assumptions or require expensive offline computations, the proposed method leverages conformal prediction (CP) as a streamlined tool to construct finite-sample confidence regions for the system's stochastic error trajectories with minimal computational effort. These regions enable the relaxation of probabilistic constraints while providing formal guarantees. By employing an indirect feedback mechanism and a probabilistic set-based formulation, we prove recursive feasibility of the relaxed optimization problem and establish chance constraint satisfaction in closed-loop. Furthermore, we extend the approach to the more general output feedback setting with unknown measurement noise distributions. Given available noise samples, we establish satisfaction of the joint chance constraints and recursive feasibility via output measurements alone. Numerical examples demonstrate the effectiveness and advantages of the proposed method compared to existing approaches.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Mohamed Elobaid, Shinkyu Park, Eric Feron
ๆฅ็ๆ่ฆ
This work deals with the problem of stabilizing a multi-agent rigid formation on a general class of planar curves. Namely, we seek to stabilize an equilateral polygonal formation on closed planar differentiable curves after a path sweep. The task of finding an inscribed regular polygon centered at the point of interest is solved via a randomized multi-start Newton-Like algorithm for which one is able to ascertain the existence of a minimizer. Then we design a continuous feedback law that guarantees convergence to, and sufficient sweeping of the curve, followed by convergence to the desired formation vertices while ensuring inter-agent avoidance. The proposed approach is validated through numerical simulations for different classes of curves and different rigid formations. Code: https://github.com/mebbaid/paper-elobaid-ifacwc-2026
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Jianbo Wang, Galina Sidorenko, Johan Thunberg
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Connected and automated vehicles (CAVs) have the potential to enhance driving safety, for example by enabling safe vehicle following and more efficient traffic scheduling. For such future deployments, safety requirements should be addressed, where the primary such are avoidance of vehicle collisions and substantial mitigating of harm when collisions are unavoidable. However, conservative worst-case-based control strategies come at the price of reduced flexibility and may compromise overall performance. In light of this, we investigate how Deep Reinforcement Learning (DRL) can be leveraged to improve safety in multi-vehicle-following scenarios involving emergency braking. Specifically, we investigate how DRL with vehicle-to-vehicle communication can be used to ethically select an emergency breaking profile in scenarios where overall, or collective, three-vehicle harm reduction or collision avoidance shall be obtained instead of single-vehicle such. As an algorithm, we provide a hybrid approach that combines DRL with a previously published method based on analytical expressions for selecting optimal constant deceleration. By combining DRL with the previous method, the proposed hybrid approach increases the reliability compared to standalone DRL, while achieving superior performance in terms of overall harm reduction and collision avoidance.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao et al.
Robotics
ๆฅ็ๆ่ฆ
Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Lihuang Chen, Xiangyu Luo, Jun Meng
ๆฅ็ๆ่ฆ
We propose LEO-RobotAgent, a general-purpose language-driven intelligent agent framework for robots. Under this framework, LLMs can operate different types of robots to complete unpredictable complex tasks across various scenarios. This framework features strong generalization, robustness, and efficiency. The application-level system built around it can fully enhance bidirectional human-robot intent understanding and lower the threshold for human-robot interaction. Regarding robot task planning, the vast majority of existing studies focus on the application of large models in single-task scenarios and for single robot types. These algorithms often have complex structures and lack generalizability. Thus, the proposed LEO-RobotAgent framework is designed with a streamlined structure as much as possible, enabling large models to independently think, plan, and act within this clear framework. We provide a modular and easily registrable toolset, allowing large models to flexibly call various tools to meet different requirements. Meanwhile, the framework incorporates a human-robot interaction mechanism, enabling the algorithm to collaborate with humans like a partner. Experiments have verified that this framework can be easily adapted to mainstream robot platforms including unmanned aerial vehicles (UAVs), robotic arms, and wheeled robot, and efficiently execute a variety of carefully designed tasks with different complexity levels. Our code is available at https://github.com/LegendLeoChen/LEO-RobotAgent.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Maximillian Fainkich, Kiril Solovey, Anna Clarke
Robotics
ๆฅ็ๆ่ฆ
Most skydiving accidents occur during the parafoil-piloting and landing stages and result from human lapses in judgment while piloting the parafoil. Training of novice pilots is protracted due to the lack of functional and easily accessible training simulators. Moreover, work on parafoil trajectory planning suitable for aiding human training remains limited. To bridge this gap, we study the problem of computing safe trajectories for human-piloted parafoil flight and examine how such trajectories fare against human-generated solutions. For the algorithmic part, we adapt the sampling-based motion planner Stable Sparse RRT (SST) by Li et al., to cope with the problem constraints while minimizing the bank angle (control effort) as a proxy for safety. We then compare the computer-generated solutions with data from human-generated parafoil flight, where the algorithm offers a relative cost improvement of 20\%-80\% over the performance of the human pilot. We observe that human pilots tend to, first, close the horizontal distance to the landing area, and then address the vertical gap by spiraling down to the suitable altitude for starting a landing maneuver. The algorithm considered here makes smoother and more gradual descents, arriving at the landing area at the precise altitude necessary for the final approach while maintaining safety constraints. Overall, the study demonstrates the potential of computer-generated guidelines, rather than traditional rules of thumb, which can be integrated into future simulators to train pilots for safer and more cost-effective flights.
๐
2025-12-11
๐ IROS
๐ฅ Si Wang, Zhehan Li, Jiadong Lu, Rong Xiong, Yanjun Cao et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Ultra-wideband (UWB)-vision fusion localization has achieved extensive applications in the domain of multi-agent relative localization. The challenging matching problem between robots and visual detection renders existing methods highly dependent on identity-encoded hardware or delicate tuning algorithms. Overconfident yet erroneous matches may bring about irreversible damage to the localization system. To address this issue, we introduce Mr. Virgil, an end-to-end learning multi-robot visual-range relative localization framework, consisting of a graph neural network for data association between UWB rangings and visual detections, and a differentiable pose graph optimization (PGO) back-end. The graph-based front-end supplies robust matching results, accurate initial position predictions, and credible uncertainty estimates, which are subsequently integrated into the PGO back-end to elevate the accuracy of the final pose estimation. Additionally, a decentralized system is implemented for real-world applications. Experiments spanning varying robot numbers, simulation and real-world, occlusion and non-occlusion conditions showcase the stability and exactitude under various scenes compared to conventional methods. Our code is available at: https://github.com/HiOnes/Mr-Virgil.
๐
2025-12-11
๐ ICRA
๐ฅ Si Wang, Bingqi Shen, Fei Wang, Yanjun Cao, Rong Xiong et al.
ๆฅ็ๆ่ฆ
Ultra-wideband (UWB) has shown promising potential in GPS-denied localization thanks to its lightweight and drift-free characteristics, while the accuracy is limited in real scenarios due to its sensitivity to sensor arrangement and non-Gaussian pattern induced by multi-path or multi-signal interference, which commonly occurs in many typical applications like long tunnels. We introduce a novel neural fusion framework for ranging inertial odometry which involves a graph attention UWB network and a recurrent neural inertial network. Our graph net learns scene-relevant ranging patterns and adapts to any number of anchors or tags, realizing accurate positioning without calibration. Additionally, the integration of least squares and the incorporation of nominal frame enhance overall performance and scalability. The effectiveness and robustness of our methods are validated through extensive experiments on both public and self-collected datasets, spanning indoor, outdoor, and tunnel environments. The results demonstrate the superiority of our proposed IR-ULSG in handling challenging conditions, including scenarios outside the convex envelope and cases where only a single anchor is available.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Gaozhao Wang, Xing Liu, Zhenduo Ye, Zhengxiong Liu, Panfeng Huang
Robotics
ๆฅ็ๆ่ฆ
Contact-rich manipulation is difficult for robots to execute and requires accurate perception of the environment. In some scenarios, vision is occluded. The robot can then no longer obtain real-time scene state information through visual feedback. This is called ``blind manipulation". In this manuscript, a novel physically-driven contact cognition method, called ``Contact SLAM", is proposed. It estimates the state of the environment and achieves manipulation using only tactile sensing and prior knowledge of the scene. To maximize exploration efficiency, this manuscript also designs an active exploration policy. The policy gradually reduces uncertainties in the manipulation scene. The experimental results demonstrated the effectiveness and accuracy of the proposed method in several contact-rich tasks, including the difficult and delicate socket assembly task and block-pushing task.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Jiaqiang Zhang, Xianjia Yu, Sier Ha, Paola Torrico Moron, Sahar Salimpour et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Accurate and continuous pedestrian positioning across outdoor-indoor environments remains challenging because GNSS, UWB, and inertial PDR are complementary yet individually fragile under signal blockage, multipath, and drift. This paper presents a unified GNSS/UWB/IMU fusion framework for seamless pedestrian localization and provides a controlled comparison of three probabilistic back-ends: an error-state extended Kalman filter, sliding-window factor graph optimization, and a particle filter. The system uses chest-mounted IMU-based PDR as the motion backbone and integrates absolute updates from GNSS outdoors and UWB indoors. To enhance transition robustness and mitigate urban GNSS degradation, we introduce a lightweight map-based feasibility constraint derived from OpenStreetMap building footprints, treating most building interiors as non-navigable while allowing motion inside a designated UWB-instrumented building. The framework is implemented in ROS 2 and runs in real time on a wearable platform, with visualization in Foxglove. We evaluate three scenarios: indoor (UWB+PDR), outdoor (GNSS+PDR), and seamless outdoor-indoor (GNSS+UWB+PDR). Results show that the ESKF provides the most consistent overall performance in our implementation.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Timur Ishuov, Michele Folgheraiter, Madi Nurmanov, Goncalo Gordo, Richรกrd Farkas et al.
ๆฅ็ๆ่ฆ
In our work we not explicitly hint that it is a misconception to think that humans learn fast. Learning process takes time. Babies start learning to move in the restricted liquid area called placenta. Children often are limited by underdeveloped body. Even adults are not allowed to participate in complex competitions right away. However, with robots, when learning from scratch, we often don't have the privilege of waiting for dozen millions of steps. "Swaddling" regularization is responsible for restraining an agent in rapid but unstable development penalizing action strength in a specific way not affecting actions directly. The Symphony, Transitional-policy Deterministic Actor and Critic algorithm, is a concise combination of different ideas for possibility of training humanoid robots from scratch with Sample Efficiency, Sample Proximity and Safety of Actions in mind. It is no secret that continuous increase in Gaussian noise without appropriate smoothing is harmful for motors and gearboxes. Compared to Stochastic algorithms, we set a limited parametric noise and promote a reduced strength of actions, safely increasing entropy, since the actions are kind of immersed in weaker noise. When actions require more extreme values, actions rise above the weak noise. Training becomes empirically much safer for both the environment around and the robot's mechanisms. We use Fading Replay Buffer: using a fixed formula containing the hyperbolic tangent, we adjust the batch sampling probability: the memory contains a recent memory and a long-term memory trail. Fading Replay Buffer allows us to use Temporal Advantage when we improve the current Critic Network prediction compared to the exponential moving average. Temporal Advantage allows us to update Actor and Critic in one pass, as well as combine Actor and Critic in one Object and implement their Losses in one line.
๐
2025-12-11
๐ WWW
๐ฅ Haowen Yu, Na Fan, Xing Liu, Ximin Lyu
Robotics
ๆฅ็ๆ่ฆ
Accurate real-time wind vector estimation is essential for enhancing the safety, navigation accuracy, and energy efficiency of unmanned aerial vehicles (UAVs). Traditional approaches rely on external sensors or simplify vehicle dynamics, which limits their applicability during agile flight or in resource-constrained platforms. This paper proposes a real-time wind estimation method based solely on onboard sensors. The approach first estimates external aerodynamic forces using a disturbance observer (DOB), and then maps these forces to wind vectors using a thin-plate spline (TPS) model. A custom-designed wind barrel mounted on the UAV enhances aerodynamic sensitivity, further improving estimation accuracy. The system is validated through comprehensive experiments in wind tunnels, indoor and outdoor flights. Experimental results demonstrate that the proposed method achieves consistently high-accuracy wind estimation across controlled and real-world conditions, with speed RMSEs as low as \SI{0.06}{m/s} in wind tunnel tests, \SI{0.22}{m/s} during outdoor hover, and below \SI{0.38}{m/s} in indoor and outdoor dynamic flights, and direction RMSEs under \ang{7.3} across all scenarios, outperforming existing baselines. Moreover, the method provides vertical wind estimates -- unavailable in baselines -- with RMSEs below \SI{0.17}{m/s} even during fast indoor translations.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Weifan Guan, Huasen Xi, Chenxiao Zhang, Aosheng Li, Qinghao Hu et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Current embodied AI systems face severe engineering impediments, primarily characterized by poor cross-scenario adaptability, rigid inter-module coupling, and fragmented inference acceleration. To overcome these limitations, we propose RoboNeuron, a universal deployment framework for embodied intelligence. RoboNeuron is the first framework to deeply integrate the cognitive capabilities of Large Language Models (LLMs) and Vision-Language-Action (VLA) models with the real-time execution backbone of the Robot Operating System (ROS). We utilize the Model Context Protocol (MCP) as a semantic bridge, enabling the LLM to dynamically orchestrate underlying robotic tools. The framework establishes a highly modular architecture that strictly decouples sensing, reasoning, and control by leveraging ROS's unified communication interfaces. Crucially, we introduce an automated tool to translate ROS messages into callable MCP functions, significantly streamlining development. RoboNeuron significantly enhances cross-scenario adaptability and component flexibility, while establishing a systematic platform for horizontal performance benchmarking, laying a robust foundation for scalable real-world embodied applications.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Liuyi Wang, Zongtao He, Jinlong Li, Xiaoyan Qi, Mengxian Hu et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that integrates a reactive small-model planner (RSMP) with a reflective large-model reasoner (RLMR). RSMP adopts a causal-learning-based dual-branch architecture to enhance generalization, while RLMR leverages panoramic visual prompting with chain-of-thought reasoning to support interpretable spatial understanding and navigation. We further introduce an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models. For obstacle avoidance, in simulation, we replace the rule-based controller with a fully learnable point-goal policy, and in real-world deployment, we design a LiDAR-based clustering module for generating navigable waypoints and pair it with an online SLAM-based local controller. CLASH achieves state-of-the-art (SoTA) results (ranking 1-st) on the VLN-CE leaderboard, significantly improving SR and SPL on the test-unseen set over the previous SoTA methods. Real-world experiments demonstrate CLASH's strong robustness, validating its effectiveness in both simulation and deployment scenarios.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Quan Yuan, Zhenting Du, Daqian Cao, Weibang Bai
Robotics
ๆฅ็ๆ่ฆ
Tendon-driven under-actuated robotic fingers provide advantages for dexterous manipulation through reduced actuator requirements and simplified mechanical design. However, achieving both high load capacity and adaptive compliance in a compact form remains challenging. This paper presents an under-actuated tendon-driven robotic finger (UTRF) featuring a synchronous tendon routing that mechanically couples all joints with fixed angular velocity ratios, enabling the entire finger to be actuated by a single actuator. This approach significantly reduces the number of actuators required in multi-finger hands, resulting in a lighter and more compact structure without sacrificing stiffness or compliance. The kinematic and static models of the finger are derived, incorporating tendon elasticity to predict structural stiffness. A single-finger prototype was fabricated and tested under static loading, showing an average deflection prediction error of 1.0 mm (0.322% of total finger length) and a measured stiffness of 1.2x10^3 N/m under a 3 kg tip load. Integration into a five-finger robotic hand (UTRF-RoboHand) demonstrates effective object manipulation across diverse scenarios, confirming that the proposed routing achieves predictable stiffness and reliable grasping performance with a minimal actuator count.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Muhammad Usama, Muhammad Ibrahim Khan, Ahmad Hasan, Muhammad Shaaf Nadeem, Khawaja Fahad Iqbal et al.
Robotics
ๆฅ็ๆ่ฆ
Mobile robots are increasingly utilized in agriculture to automate labor-intensive tasks such as weeding, sowing, harvesting and soil analysis. Recently, agricultural robots have been developed to detect and remove weeds using mechanical tools or precise herbicide sprays. Mechanical weeding is inefficient over large fields, and herbicides harm the soil ecosystem. Laser weeding with mobile robots has emerged as a sustainable alternative in precision farming. In this paper, we present an autonomous weeding robot that uses controlled exposure to a low energy laser beam for weed removal. The proposed robot is six-wheeled with a novel double four-bar suspension for higher stability. The laser is guided towards the detected weeds by a three-dimensional linear actuation mechanism. Field tests have demonstrated the robot's capability to navigate agricultural terrains effectively by overcoming obstacles up to 15 cm in height. At an optimal speed of 42.5 cm/s, the robot achieves a weed detection rate of 86.2\% and operating time of 87 seconds per meter. The laser actuation mechanism maintains a minimal mean positional error of 1.54 mm, combined with a high hit rate of 97\%, ensuring effective and accurate weed removal. This combination of speed, accuracy, and efficiency highlights the robot's potential for significantly enhancing precision farming practices.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Luรญs Marques, Maani Ghaffari, Dmitry Berenson
Robotics
ๆฅ็ๆ่ฆ
We propose Conformal Lie-group Action Prediction Sets (CLAPS), a symmetry-aware conformal prediction-based algorithm that constructs, for a given action, a set guaranteed to contain the resulting system configuration at a user-defined probability. Our assurance holds under both aleatoric and epistemic uncertainty, non-asymptotically, and does not require strong assumptions about the true system dynamics, the uncertainty sources, or the quality of the approximate dynamics model. Typically, uncertainty quantification is tackled by making strong assumptions about the error distribution or magnitude, or by relying on uncalibrated uncertainty estimates - i.e., with no link to frequentist probabilities - which are insufficient for safe control. Recently, conformal prediction has emerged as a statistical framework capable of providing distribution-free probabilistic guarantees on test-time prediction accuracy. While current conformal methods treat robots as Euclidean points, many systems have non-Euclidean configurations, e.g., some mobile robots have SE(2). In this work, we rigorously analyze configuration errors using Lie groups, extending previous Euclidean Space theoretical guarantees to SE(2). Our experiments on a simulated JetBot, and on a real MBot, suggest that by considering the configuration space's structure, our symmetry-informed nonconformity score leads to more volume-efficient prediction regions which represent the underlying uncertainty better than existing approaches.
๐
2025-12-11
๐ ArXiv Preprint (Robotics)
๐ฅ Hui Li, Akhlak Uz Zaman, Fujian Yan, Hongsheng He
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a reinforcement learning framework that incorporates a Contextual Reward Machine for task-oriented grasping. The Contextual Reward Machine reduces task complexity by decomposing grasping tasks into manageable sub-tasks. Each sub-task is associated with a stage-specific context, including a reward function, an action space, and a state abstraction function. This contextual information enables efficient intra-stage guidance and improves learning efficiency by reducing the state-action space and guiding exploration within clearly defined boundaries. In addition, transition rewards are introduced to encourage or penalize transitions between stages which guides the model toward desirable stage sequences and further accelerates convergence. When integrated with the Proximal Policy Optimization algorithm, the proposed method achieved a 95% success rate across 1,000 simulated grasping tasks encompassing diverse objects, affordances, and grasp topologies. It outperformed the state-of-the-art methods in both learning speed and success rate. The approach was transferred to a real robot, where it achieved a success rate of 83.3% in 60 grasping tasks over six affordances. These experimental results demonstrate superior accuracy, data efficiency, and learning efficiency. They underscore the model's potential to advance task-oriented grasping in both simulated and real-world settings.
๐
2025-12-11
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You et al.
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.
๐
2025-12-11
๐ ArXiv Preprint (eess.IV)
๐ฅ Dexter Burns, Sanjeev Koppal
Robotics
ๆฅ็ๆ่ฆ
Monitoring plant health increasingly relies on leaf-mounted sensors that provide real-time physiological data, yet efficiently locating and sampling these sensors in complex agricultural environments remains a major challenge. We present an integrated, adaptive, and scalable system that autonomously detects and interrogates plant sensors using a coordinated suite of low-cost optical components including a LiDAR, liquid lens, monochrome camera, filter wheel, and Fast Steering Mirror (FSM). The system first uses LiDAR to identify the distinct reflective signatures of sensors within the field, then dynamically redirects the camera s field of view via the FSM to target each sensor for hyperspectral imaging. The liquid lens continuously adjusts focus to maintain image sharpness across varying depths, enabling precise spectral measurements. We validated the system in controlled indoor experiments, demonstrating accurate detection and tracking of reflective plant sensors and successful acquisition of their spectral data. To our knowledge, no other system currently integrates these sensing and optical modalities for agricultural monitoring. This work establishes a foundation for adaptive, low-cost, and automated plant sensor interrogation, representing a significant step toward scalable, real-time plant health monitoring in precision agriculture.
๐
2025-12-10
๐ ArXiv Preprint (cs.DL)
๐ฅ Matt Miller, Dan Sinykin, Melanie Walsh
ๆฅ็ๆ่ฆ
We present BookReconciler, an open-source tool for enhancing and clustering book data. BookReconciler allows users to take spreadsheets with minimal metadata, such as book title and author, and automatically 1) add authoritative, persistent identifiers like ISBNs 2) and cluster related Expressions and Manifestations of the same Work, e.g., different translations or editions. This enhancement makes it easier to combine related collections and analyze books at scale. The tool is currently designed as an extension for OpenRefine -- a popular software application -- and connects to major bibliographic services including the Library of Congress, VIAF, OCLC, HathiTrust, Google Books, and Wikidata. Our approach prioritizes human judgment. Through an interactive interface, users can manually evaluate matches and define the contours of a Work (e.g., to include translations or not). We evaluate reconciliation performance on datasets of U.S. prize-winning books and contemporary world fiction. BookReconciler achieves near-perfect accuracy for U.S. works but lower performance for global texts, reflecting structural weaknesses in bibliographic infrastructures for non-English and global literature. Overall, BookReconciler supports the reuse of bibliographic data across domains and applications, contributing to ongoing work in digital libraries and digital humanities.
๐
2025-12-10
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Najmul Hasan, Prashanth BusiReddyGari, Haitao Zhao, Yihao Ren, Jinsheng Xu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Email phishing is one of the most prevalent and globally consequential vectors of cyber intrusion. As systems increasingly deploy Large Language Models (LLMs) applications, these systems face evolving phishing email threats that exploit their fundamental architectures. Current LLMs require substantial hardening before deployment in email security systems, particularly against coordinated multi-vector attacks that exploit architectural vulnerabilities. This paper proposes LLMPEA, an LLM-based framework to detect phishing email attacks across multiple attack vectors, including prompt injection, text refinement, and multilingual attacks. We evaluate three frontier LLMs (e.g., GPT-4o, Claude Sonnet 4, and Grok-3) and comprehensive prompting design to assess their feasibility, robustness, and limitations against phishing email attacks. Our empirical analysis reveals that LLMs can detect the phishing email over 90% accuracy while we also highlight that LLM-based phishing email detection systems could be exploited by adversarial attack, prompt injection, and multilingual attacks. Our findings provide critical insights for LLM-based phishing detection in real-world settings where attackers exploit multiple vulnerabilities in combination.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lama Alssum, Hani Itani, Hasan Abed Al Kader Hammoud, Philip Torr, Adel Bibi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Moonsoo Park, Jeongseok Yun, Bohyung Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalized review response generation presents a significant challenge in domains where user information is limited, such as food delivery platforms. While large language models (LLMs) offer powerful text generation capabilities, they often produce generic responses when lacking contextual user data, reducing engagement and effectiveness. In this work, we propose a two-stage prompting framework that infers both explicit (e.g., user-stated preferences) and implicit (e.g., demographic or stylistic cues) personas directly from short review texts. These inferred persona attributes are then incorporated into the response generation prompt to produce user-tailored replies. To encourage diverse yet faithful generations, we adjust decoding temperature during inference. We evaluate our method using a real-world dataset collected from a Korean food delivery app, and assess its impact on precision, diversity, and semantic consistency. Our findings highlight the effectiveness of persona-augmented prompting in enhancing the relevance and personalization of automated responses without requiring model fine-tuning.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhongjie Jiang
Natural Language Processing
ๆฅ็ๆ่ฆ
Central to long-form text generation in vertical domains is the "impossible trinity" confronting current large language models (LLMs): the simultaneous achievement of low hallucination, deep logical coherence, and personalized expression. This study establishes that this bottleneck arises from existing generative paradigms succumbing to the Statistical Smoothing Trap, a phenomenon that overlooks the high-entropy information acquisition and structured cognitive processes integral to expert-level writing. To address this limitation, we propose the DeepNews Framework, an agentic workflow that explicitly models the implicit cognitive processes of seasoned financial journalists. The framework integrates three core modules: first, a dual-granularity retrieval mechanism grounded in information foraging theory, which enforces a 10:1 saturated information input ratio to mitigate hallucinatory outputs; second, schema-guided strategic planning, a process leveraging domain expert knowledge bases (narrative schemas) and Atomic Blocks to forge a robust logical skeleton; third, adversarial constraint prompting, a technique deploying tactics including Rhythm Break and Logic Fog to disrupt the probabilistic smoothness inherent in model-generated text. Experiments delineate a salient Knowledge Cliff in deep financial reporting: content truthfulness collapses when retrieved context falls below 15,000 characters, while a high-redundancy input exceeding 30,000 characters stabilizes the Hallucination-Free Rate (HFR) above 85%. In an ecological validity blind test conducted with a top-tier Chinese technology media outlet, the DeepNews system--built on a previous-generation model (DeepSeek-V3-0324)-achieved a 25% submission acceptance rate, significantly outperforming the 0% acceptance rate of zero-shot generation by a state-of-the-art (SOTA) model (GPT-5).
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yumou Wei, John Stamper, Paulo F. Carvalho
Natural Language Processing
ๆฅ็ๆ่ฆ
We explore the use of small language models (SLMs) for automatic question generation as a complement to the prevalent use of their large counterparts in learning analytics research. We present a novel question generation pipeline that leverages both the text generation and the probabilistic reasoning abilities of SLMs to generate high-quality questions. Adopting a "generate-then-validate" strategy, our pipeline first performs expansive generation to create an abundance of candidate questions and refine them through selective validation based on novel probabilistic reasoning. We conducted two evaluation studies, one with seven human experts and the other with a large language model (LLM), to assess the quality of the generated questions. Most judges (humans or LLMs) agreed that the generated questions had clear answers and generally aligned well with the intended learning objectives. Our findings suggest that an SLM can effectively generate high-quality questions when guided by a well-designed pipeline that leverages its strengths.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Luciano Floridi, Jessica Morley, Claudio Novelli, David Watson
Natural Language Processing
ๆฅ็ๆ่ฆ
This article looks at how reasoning works in current Large Language Models (LLMs) that function using the token-completion method. It examines their stochastic nature and their similarity to human abductive reasoning. The argument is that these LLMs create text based on learned patterns rather than performing actual abductive reasoning. When their output seems abductive, this is largely because they are trained on human-generated texts that include reasoning structures. Examples are used to show how LLMs can produce plausible ideas, mimic commonsense reasoning, and give explanatory answers without being grounded in truth, semantics, verification, or understanding, and without performing any real abductive reasoning. This dual nature, where the models have a stochastic base but appear abductive in use, has important consequences for how LLMs are evaluated and applied. They can assist with generating ideas and supporting human thinking, but their outputs must be critically assessed because they cannot identify truth or verify their explanations. The article concludes by addressing five objections to these points, noting some limitations in the analysis, and offering an overall evaluation.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Logan Robbins
Natural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive decoding in Large Language Models (LLMs) is inherently sequential, creating a latency bottleneck that scales linearly with output length. While ``Decomposition-and-Fill'' methods like Skeleton-of-Thought attempt to parallelize generation via external orchestration, they suffer from \textit{coherence drift} due to the lack of cross-stream communication. In this work, we introduce the \textbf{Parallel Decoder Transformer (PDT)}, a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model.
Instead of retraining the base model, PDT injects lightweight \textit{Speculative Note Conditioning (SNC)} adapters that allow parallel decoding streams to synchronize via a shared, dynamic latent space. We formulate coordination as a \textit{speculative consensus} problem, where sibling streams broadcast semantic ``notes'' to a global bus, gated by a learned verification head. We validate our approach on a 50,000-step curriculum using a frozen 20B-parameter backbone. Our results demonstrate that PDT achieves effective self-correction, reaching \textbf{77.8\% precision} in coverage prediction and recovering approximate serial semantics without modifying the trunk weights. This establishes PDT as a scalable, efficient alternative to full model fine-tuning for structured parallel generation.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show\_suggest\_tell.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sha Li, Ayush Sadekar, Nathan Self, Yiqi Su, Lars Andersland et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents, multi-modal content, and reconciling varied and inconsistent fine-grained information across multiple publications into standardized formats. These challenges are further compounded when the desired data schema or extraction ontology changes rapidly, making it difficult to re-architect or fine-tune existing systems. We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design streamlines on-demand data extraction while enabling extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms. We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. Our findings provide practical insights into both the strengths and limitations of current LLM-based pipelines.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Simone Corbo
Natural Language Processing
ๆฅ็ๆ่ฆ
This chapter explores the application of Large Language Models in the legal domain, showcasing their potential to optimise and augment traditional legal tasks by analysing possible use cases, such as assisting in interpreting statutes, contracts, and case law, enhancing clarity in legal summarisation, contract negotiation, and information retrieval. There are several challenges that can arise from the application of such technologies, such as algorithmic monoculture, hallucinations, and compliance with existing regulations, including the EU's AI Act and recent U.S. initiatives, alongside the emerging approaches in China. Furthermore, two different benchmarks are presented.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kesheng Chen, Wenjian Luo, Zhenqian Zhu, Yamin Hu, Yiya Xi
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Constructing a Pareto set is pivotal for navigating the capability-efficiency trade-offs in Large Language Models (LLMs); however, existing merging techniques remain inadequate for this task. Coarse-grained, model-level methods yield only a sparse set of suboptimal solutions, while fine-grained, layer-wise approaches suffer from the "curse of dimensionality," rendering the search space computationally intractable. To resolve this dichotomy, we propose BAMBO (Bayesian Adaptive Multi-objective Block-wise Optimization), a novel framework that automatically constructs the LLM Pareto set. BAMBO renders the search tractable by introducing a Hybrid Optimal Block Partitioning strategy. Formulated as a 1D clustering problem, this strategy leverages a dynamic programming approach to optimally balance intra-block homogeneity and inter-block information distribution, thereby dramatically reducing dimensionality without sacrificing critical granularity. The entire process is automated within an evolutionary loop driven by the q-Expected Hypervolume Improvement (qEHVI) acquisition function. Experiments demonstrate that BAMBO discovers a superior and more comprehensive Pareto frontier than baselines, enabling agile model selection tailored to diverse operational constraints. Code is available at: https://github.com/xin8coder/BAMBO.
๐
2025-12-10
๐ ArXiv Preprint (cs.DL)
๐ฅ Matt Miller, Dan Sinykin, Melanie Walsh
ๆฅ็ๆ่ฆ
We present BookReconciler, an open-source tool for enhancing and clustering book data. BookReconciler allows users to take spreadsheets with minimal metadata, such as book title and author, and automatically 1) add authoritative, persistent identifiers like ISBNs 2) and cluster related Expressions and Manifestations of the same Work, e.g., different translations or editions. This enhancement makes it easier to combine related collections and analyze books at scale. The tool is currently designed as an extension for OpenRefine -- a popular software application -- and connects to major bibliographic services including the Library of Congress, VIAF, OCLC, HathiTrust, Google Books, and Wikidata. Our approach prioritizes human judgment. Through an interactive interface, users can manually evaluate matches and define the contours of a Work (e.g., to include translations or not). We evaluate reconciliation performance on datasets of U.S. prize-winning books and contemporary world fiction. BookReconciler achieves near-perfect accuracy for U.S. works but lower performance for global texts, reflecting structural weaknesses in bibliographic infrastructures for non-English and global literature. Overall, BookReconciler supports the reuse of bibliographic data across domains and applications, contributing to ongoing work in digital libraries and digital humanities.
๐
2025-12-10
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Han Chen, Steven Zhu, Yingrui Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Real-world ecommerce recommender systems must deliver relevant items under strict tens-of-milliseconds latency constraints despite challenges such as cold-start products, rapidly shifting user intent, and dynamic context including seasonality, holidays, and promotions. We introduce STARS, a transformer-based sequential recommendation framework built for large-scale, low-latency ecommerce settings. STARS combines several innovations: dual-memory user embeddings that separate long-term preferences from short-term session intent; semantic item tokens that fuse pretrained text embeddings, learnable deltas, and LLM-derived attribute tags, strengthening content-based matching, long-tail coverage, and cold-start performance; context-aware scoring with learned calendar and event offsets; and a latency-conscious two-stage retrieval pipeline that performs offline embedding generation and online maximum inner-product search with filtering, enabling tens-of-milliseconds response times. In offline evaluations on production-scale data, STARS improves Hit@5 by more than 75 percent relative to our existing LambdaMART system. A large-scale A/B test on 6 million visits shows statistically significant lifts, including Total Orders +0.8%, Add-to-Cart on Home +2.0%, and Visits per User +0.5%. These results demonstrate that combining semantic enrichment, multi-intent modeling, and deployment-oriented design can yield state-of-the-art recommendation quality in real-world environments without sacrificing serving efficiency.
๐
2025-12-10
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Najmul Hassan, Prashanth BusiReddyGari, Haitao Zhao, Yihao Ren, Jinsheng Xu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Email phishing is one of the most prevalent and globally consequential vectors of cyber intrusion. As systems increasingly deploy Large Language Models (LLMs) applications, these systems face evolving phishing email threats that exploit their fundamental architectures. Current LLMs require substantial hardening before deployment in email security systems, particularly against coordinated multi-vector attacks that exploit architectural vulnerabilities. This paper proposes LLMPEA, an LLM-based framework to detect phishing email attacks across multiple attack vectors, including prompt injection, text refinement, and multilingual attacks. We evaluate three frontier LLMs (e.g., GPT-4o, Claude Sonnet 4, and Grok-3) and comprehensive prompting design to assess their feasibility, robustness, and limitations against phishing email attacks. Our empirical analysis reveals that LLMs can detect the phishing email over 90% accuracy while we also highlight that LLM-based phishing email detection systems could be exploited by adversarial attack, prompt injection, and multilingual attacks. Our findings provide critical insights for LLM-based phishing detection in real-world settings where attackers exploit multiple vulnerabilities in combination.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Chuan Huang, Gustaf Hendeby, Isaac Skog
ๆฅ็ๆ่ฆ
Spatially inhomogeneous magnetic fields offer a valuable, non-visual information source for positioning. Among systems leveraging this, magnetic field-based simultaneous localization and mapping (SLAM) systems are particularly attractive because they can provide positioning information and build a magnetic field map on the fly. Moreover, they have bounded error within mapped regions. However, state-of-the-art methods typically require low-drift odometry data provided by visual odometry or a wheel encoder, etc. This is because these systems need to minimize/reduce positioning errors while exploring, which happens when they are in unmapped regions. To address these limitations, this work proposes a loosely coupled and a tightly coupled inertial magnetic SLAM (IM-SLAM) system. The proposed systems use commonly available low-cost sensors: an inertial measurement unit (IMU), a magnetometer array, and a barometer. The use of non-visual data provides a significant advantage over visual-based systems, making it robust to low-visibility conditions. Both systems employ state-space representations, and magnetic field models on different scales. The difference lies in how they use a local and global magnetic field model. The loosely coupled system uses these models separately in two state-space models, while the tightly coupled system integrates them into one state-space model. Experiment results show that the tightly coupled IM-SLAM system achieves lower positioning errors than the loosely coupled system in most scenarios, with typical errors on the order of meters per 100 meters traveled. These results demonstrate the feasiblity of developing a full 3D IM-SLAM systems using low-cost sensors and the potential of applying these systems in emergency response scenarios such as mine/fire rescue.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sangli Teng, Hang Liu, Jingyu Song, Koushil Sreenath
Robotics
ๆฅ็ๆ่ฆ
Learning the flows of hybrid systems that have both continuous and discrete time dynamics is challenging. The existing method learns the dynamics in each discrete mode, which suffers from the combination of mode switching and discontinuities in the flows. In this work, we propose CHyLL (Continuous Hybrid System Learning in Latent Space), which learns a continuous neural representation of a hybrid system without trajectory segmentation, event functions, or mode switching. The key insight of CHyLL is that the reset map glues the state space at the guard surface, reformulating the state space as a piecewise smooth quotient manifold where the flow becomes spatially continuous. Building upon these insights and the embedding theorems grounded in differential topology, CHyLL concurrently learns a singularity-free neural embedding in a higher-dimensional space and the continuous flow in it. We showcase that CHyLL can accurately predict the flow of hybrid systems with superior accuracy and identify the topological invariants of the hybrid systems. Finally, we apply CHyLL to the stochastic optimal control problem.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Andrew Razjigaev, Hans Lohr, Alejandro Vargas-Uscategui, Peter King, Tirthankar Bandyopadhyay
Robotics
ๆฅ็ๆ่ฆ
Industrial automation with six-axis robotic arms is critical for many manufacturing tasks, including welding and additive manufacturing applications; however, many of these operations are functionally redundant due to the symmetrical tool axis, which effectively makes the operation a five-axis task. Exploiting this redundancy is crucial for achieving the desired workspace and dexterity required for the feasibility and optimisation of toolpath planning. Inverse kinematics algorithms can solve this in a fast, reactive framework, but these techniques are underutilised over the more computationally expensive offline planning methods. We propose a novel algorithm to solve functionally redundant inverse kinematics for robotic manipulation utilising a task space decomposition approach, the damped least-squares method and Halley's method to achieve fast and robust solutions with reduced joint motion. We evaluate our methodology in the case of toolpath optimisation in a cold spray coating application on a non-planar surface. The functionally redundant inverse kinematics algorithm can quickly solve motion plans that minimise joint motion, expanding the feasible operating space of the complex toolpath. We validate our approach on an industrial ABB manipulator and cold-spray gun executing the computed toolpath.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Steven Caro, Stephen L. Smith
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Nonprehensile manipulation, such as pushing objects across cluttered environments, presents a challenging control problem due to complex contact dynamics and long-horizon planning requirements. In this work, we propose HeRD, a hierarchical reinforcement learning-diffusion policy that decomposes pushing tasks into two levels: high-level goal selection and low-level trajectory generation. We employ a high-level reinforcement learning (RL) agent to select intermediate spatial goals, and a low-level goal-conditioned diffusion model to generate feasible, efficient trajectories to reach them.
This architecture combines the long-term reward maximizing behaviour of RL with the generative capabilities of diffusion models. We evaluate our method in a 2D simulation environment and show that it outperforms the state-of-the-art baseline in success rate, path efficiency, and generalization across multiple environment configurations. Our results suggest that hierarchical control with generative low-level planning is a promising direction for scalable, goal-directed nonprehensile manipulation. Code, documentation, and trained models are available: https://github.com/carosteven/HeRD.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Junjie Bai, Yu-Wei Chao, Qizhi Chen, Jinwei Gu, Moo Jin Kim et al.
Robotics
ๆฅ็ๆ่ฆ
The 2025 BEHAVIOR Challenge is designed to rigorously track progress toward solving long-horizon tasks by physical agents in simulated environments. BEHAVIOR-1K focuses on everyday household tasks that people most want robots to assist with and these tasks introduce long-horizon mobile manipulation challenges in realistic settings, bridging the gap between current research and real-world, human-centric applications. This report presents our solution to the 2025 BEHAVIOR Challenge in a very close 2nd place and substantially outperforms the rest of the submissions. Building on $ฯ_{0.5}$, we focus on systematically building our solution by studying the effects of training techniques and data. Through careful ablations, we show the scaling power in pre-training and post-training phases for competitive performance. We summarize our practical lessons and design recommendations that we hope will provide actionable insights for the broader embodied AI community when adapting powerful foundation models to complex embodied scenarios.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Feng Yu, Yu Hu, Yang Su, Yang Deng, Linzuo Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Most reinforcement learning(RL)-based methods for drone racing target fixed, obstacle-free tracks, leaving the generalization to unknown, cluttered environments largely unaddressed. This challenge stems from the need to balance racing speed and collision avoidance, limited feasible space causing policy exploration trapped in local optima during training, and perceptual ambiguity between gates and obstacles in depth maps-especially when gate positions are only coarsely specified. To overcome these issues, we propose a two-phase learning framework: an initial soft-collision training phase that preserves policy exploration for high-speed flight, followed by a hard-collision refinement phase that enforces robust obstacle avoidance. An adaptive, noise-augmented curriculum with an asymmetric actor-critic architecture gradually shifts the policy's reliance from privileged gate-state information to depth-based visual input. We further impose Lipschitz constraints and integrate a track-primitive generator to enhance motion stability and cross-environment generalization. We evaluate our framework through extensive simulation and ablation studies, and validate it in real-world experiments on a computationally constrained quadrotor. The system achieves agile flight while remaining robust to gate-position errors, developing a generalizable drone racing framework with the capability to operate in diverse, partially unknown and cluttered environments. https://yufengsjtu.github.io/MasterRacing.github.io/
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Ashik E Rasul, Humaira Tasnim, Ji Yu Kim, Young Hyun Lim, Scott Schmitz et al.
Machine LearningComputer VisionRobotics
ๆฅ็ๆ่ฆ
QuadPlanes combine the range efficiency of fixed-wing aircraft with the maneuverability of multi-rotor platforms for long-range autonomous missions. In GPS-denied or cluttered urban environments, perception-based landing is vital for reliable operation. Unlike structured landing zones, real-world sites are unstructured and highly variable, requiring strong generalization capabilities from the perception system. Deep neural networks (DNNs) provide a scalable solution for learning landing site features across diverse visual and environmental conditions. While perception-driven landing has been shown in simulation, real-world deployment introduces significant challenges. Payload and volume constraints limit high-performance edge AI devices like the NVIDIA Jetson Orin Nano, which are crucial for real-time detection and control. Accurate pose estimation during descent is necessary, especially in the absence of GPS, and relies on dependable visual-inertial odometry. Achieving this with limited edge AI resources requires careful optimization of the entire deployment framework. The flight characteristics of large QuadPlanes further complicate the problem. These aircraft exhibit high inertia, reduced thrust vectoring, and slow response times further complicate stable landing maneuvers. This work presents a lightweight QuadPlane system for efficient vision-based autonomous landing and visual-inertial odometry, specifically developed for long-range QuadPlane operations such as aerial monitoring. It describes the hardware platform, sensor configuration, and embedded computing architecture designed to meet demanding real-time, physical constraints. This establishes a foundation for deploying autonomous landing in dynamic, unstructured, GPS-denied environments.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Junting Chen, Yunchuan Li, Panfeng Jiang, Jiacheng Du, Zixuan Chen et al.
Robotics
ๆฅ็ๆ่ฆ
Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Danyal Rehman, Tara Akhound-Sadegh, Artem Gazizov, Yoshua Bengio, Alexander Tong
Machine Learning
ๆฅ็ๆ่ฆ
Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann Generators tackle this problem by pairing a generative model, capable of exact likelihood computation, with importance sampling to obtain consistent samples under the target distribution. Current Boltzmann Generators primarily use continuous normalizing flows (CNFs) trained with flow matching for efficient training of powerful models. However, likelihood calculation for these models is extremely costly, requiring thousands of function evaluations per sample, severely limiting their adoption. In this work, we propose Few-step Accurate Likelihoods for Continuous Flows (FALCON), a method which allows for few-step sampling with a likelihood accurate enough for importance sampling applications by introducing a hybrid training objective that encourages invertibility. We show FALCON outperforms state-of-the-art normalizing flow models for molecular Boltzmann sampling and is two orders of magnitude faster than the equivalently performing CNF model.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Erin Craig, Robert Tibshirani
Machine Learning
ๆฅ็ๆ่ฆ
In-context learning with attention enables large neural networks to make context-specific predictions by selectively focusing on relevant examples. Here, we adapt this idea to supervised learning procedures such as lasso regression and gradient boosting, for tabular data. Our goals are to (1) flexibly fit personalized models for each prediction point and (2) retain model simplicity and interpretability.
Our method fits a local model for each test observation by weighting the training data according to attention, a supervised similarity measure that emphasizes features and interactions that are predictive of the outcome. Attention weighting allows the method to adapt to heterogeneous data in a data-driven way, without requiring cluster or similarity pre-specification. Further, our approach is uniquely interpretable: for each test observation, we identify which features are most predictive and which training observations are most relevant. We then show how to use attention weighting for time series and spatial data, and we present a method for adapting pretrained tree-based models to distributional shift using attention-weighted residual corrections. Across real and simulated datasets, attention weighting improves predictive performance while preserving interpretability, and theory shows that attention-weighting linear models attain lower mean squared error than the standard linear model under mixture-of-models data-generating processes with known subgroup structure.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Salvador Carriรณn, Francisco Casacuberta
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrew Elashkin, Orna Grumberg
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning agents often behave unexpectedly in sparse-reward or safety-critical environments, creating a strong need for reliable debugging and verification tools. In this paper, we propose STACHE, a comprehensive framework for generating local, black-box explanations for an agent's specific action within discrete Markov games. Our method produces a Composite Explanation consisting of two complementary components: (1) a Robustness Region, the connected neighborhood of states where the agent's action remains invariant, and (2) Minimal Counterfactuals, the smallest state perturbations required to alter that decision. By exploiting the structure of factored state spaces, we introduce an exact, search-based algorithm that circumvents the fidelity gaps of surrogate models. Empirical validation on Gymnasium environments demonstrates that our framework not only explains policy actions, but also effectively captures the evolution of policy logic during training - from erratic, unstable behavior to optimized, robust strategies - providing actionable insights into agent sensitivity and decision boundaries.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Antonio Lorenzin, Fabio Zanasi
ๆฅ็ๆ่ฆ
Moralisation and Triangulation are transformations allowing to switch between different ways of factoring a probability distribution into a graphical model. Moralisation allows to view a Bayesian network (a directed model) as a Markov network (an undirected model), whereas triangulation addresses the opposite direction. We present a categorical framework where these transformations are modelled as functors between a category of Bayesian networks and one of Markov networks. The two kinds of network (the objects of these categories) are themselves represented as functors from a `syntax' domain to a `semantics' codomain. Notably, moralisation and triangulation can be defined inductively on such syntax via functor pre-composition. Moreover, while moralisation is fully syntactic, triangulation relies on semantics. This leads to a discussion of the variable elimination algorithm, reinterpreted here as a functor in its own right, that splits the triangulation procedure in two: one purely syntactic, the other purely semantic. This approach introduces a functorial perspective into the theory of probabilistic graphical models, which highlights the distinctions between syntactic and semantic modifications.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Reza Ahmari, Ahmad Mohammadi, Vahid Hemmati, Mohammed Mynuddin, Parham Kebria et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The integration of Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) is increasingly central to the development of intelligent autonomous systems for applications such as search and rescue, environmental monitoring, and logistics. However, precise coordination between these platforms in real-time scenarios presents major challenges, particularly when external localization infrastructure such as GPS or GNSS is unavailable or degraded [1]. This paper proposes a vision-based, data-driven framework for real-time UAV-UGV integration, with a focus on robust UGV detection and heading angle prediction for navigation and coordination. The system employs a fine-tuned YOLOv5 model to detect UGVs and extract bounding box features, which are then used by a lightweight artificial neural network (ANN) to estimate the UAV's required heading angle. A VICON motion capture system was used to generate ground-truth data during training, resulting in a dataset of over 13,000 annotated images collected in a controlled lab environment. The trained ANN achieves a mean absolute error of 0.1506ยฐ and a root mean squared error of 0.1957ยฐ, offering accurate heading angle predictions using only monocular camera inputs. Experimental evaluations achieve 95% accuracy in UGV detection. This work contributes a vision-based, infrastructure- independent solution that demonstrates strong potential for deployment in GPS/GNSS-denied environments, supporting reliable multi-agent coordination under realistic dynamic conditions. A demonstration video showcasing the system's real-time performance, including UGV detection, heading angle prediction, and UAV alignment under dynamic conditions, is available at: https://github.com/Kooroshraf/UAV-UGV-Integration
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haoye Lu, Pavan Seshadri, Kaheer Suleman
Natural Language Processing
ๆฅ็ๆ่ฆ
Long-term planning in complex, text-based environments presents significant challenges due to open-ended action spaces, ambiguous observations, and sparse feedback. Recent research suggests that large language models (LLMs) encode rich semantic knowledge about the world, which can be valuable for guiding agents in high-level reasoning and planning across both embodied and purely textual settings. However, existing approaches often depend heavily on querying LLMs during training and inference, making them computationally expensive and difficult to deploy efficiently. In addition, these methods typically employ a pretrained, unaltered LLM whose parameters remain fixed throughout training, providing no opportunity for adaptation to the target task. To address these limitations, we introduce SCOPE (Subgoal-COnditioned Pretraining for Efficient planning), a one-shot hierarchical planner that leverages LLM-generated subgoals only at initialization to pretrain a lightweight student model. Unlike prior approaches that distill LLM knowledge by repeatedly prompting the model to adaptively generate subgoals during training, our method derives subgoals directly from example trajectories. This design removes the need for repeated LLM queries, significantly improving efficiency, though at the cost of reduced explainability and potentially suboptimal subgoals. Despite their suboptimality, our results on the TextCraft environment show that LLM-generated subgoals can still serve as a strong starting point for hierarchical goal decomposition in text-based planning tasks. Compared to the LLM-based hierarchical agent ADaPT (Prasad et al., 2024), which achieves a 0.52 success rate, our method reaches 0.56 and reduces inference time from 164.4 seconds to just 3.0 seconds.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jane Greenberg, Scott McClellan, Addy Ireland, Robert Sammarco, Colton Gerber et al.
ๆฅ็ๆ่ฆ
Metadata vocabularies are essential for advancing FAIR and FARR data principles, but their development constrained by limited human resources and inconsistent standardization practices. This paper introduces MatSci-YAMZ, a platform that integrates artificial intelligence (AI) and human-in-the-loop (HILT), including crowdsourcing, to support metadata vocabulary development. The paper reports on a proof-of-concept use case evaluating the AI-HILT model in materials science, a highly interdisciplinary domain Six (6) participants affiliated with the NSF Institute for Data-Driven Dynamical Design (ID4) engaged with the MatSci-YAMZ plaform over several weeks, contributing term definitions and providing examples to prompt the AI-definitions refinement. Nineteen (19) AI-generated definitions were successfully created, with iterative feedback loops demonstrating the feasibility of AI-HILT refinement. Findings confirm the feasibility AI-HILT model highlighting 1) a successful proof of concept, 2) alignment with FAIR and open-science principles, 3) a research protocol to guide future studies, and 4) the potential for scalability across domains. Overall, MatSci-YAMZ's underlying model has the capacity to enhance semantic transparency and reduce time required for consensus building and metadata vocabulary development.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Noah Golowich, Allen Liu, Abhishek Shetty
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
While modern language models and their inner workings are incredibly complex, recent work (Golowich, Liu & Shetty; 2025) has proposed a simple and potentially tractable abstraction for them through the observation that empirically, these language models all seem to have approximately low logit rank. Roughly, this means that a matrix formed by the model's log probabilities of various tokens conditioned on certain sequences of tokens is well approximated by a low rank matrix.
In this paper, our focus is on understanding how this structure can be exploited algorithmically for obtaining provable learning guarantees. Since low logit rank models can encode hard-to-learn distributions such as noisy parities, we study a query learning model with logit queries that reflects the access model for common APIs. Our main result is an efficient algorithm for learning any approximately low logit rank model from queries. We emphasize that our structural assumption closely reflects the behavior that is empirically observed in modern language models. Thus, our result gives what we believe is the first end-to-end learning guarantee for a generative model that plausibly captures modern language models.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu et al.
ๆฅ็ๆ่ฆ
We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost -- certain ARTEMIS variants cost $18/hour versus $60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.
๐
2025-12-10
๐ Accepted in IEEE HOST 2026
๐ฅ Khurram Khalil, Khaza Anuarul Hoque
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Generative Artificial Intelligence models, such as Large Language Models (LLMs) and Large Vision Models (VLMs), exhibit state-of-the-art performance but remain vulnerable to hardware-based threats, specifically bit-flip attacks (BFAs). Existing BFA discovery methods lack generalizability and struggle to scale, often failing to analyze the vast parameter space and complex interdependencies of modern foundation models in a reasonable time. This paper proposes FlipLLM, a reinforcement learning (RL) architecture-agnostic framework that formulates BFA discovery as a sequential decision-making problem. FlipLLM combines sensitivity-guided layer pruning with Q-learning to efficiently identify minimal, high-impact bit sets that can induce catastrophic failure. We demonstrate the effectiveness and generalizability of FlipLLM by applying it to a diverse set of models, including prominent text-only LLMs (GPT-2 Large, LLaMA 3.1 8B, and DeepSeek-V2 7B), VLMs such as LLaVA 1.6, and datasets, such as MMLU, MMLU-Pro, VQAv2, and TextVQA. Our results show that FlipLLM can identify critical bits that are vulnerable to BFAs up to 2.5x faster than SOTA methods. We demonstrate that flipping the FlipLLM-identified bits plummets the accuracy of LLaMA 3.1 8B from 69.9% to ~0.2%, and for LLaVA's VQA score from 78% to almost 0%, by flipping as few as 5 and 7 bits, respectively. Further analysis reveals that applying standard hardware protection mechanisms, such as ECC SECDED, to the FlipLLM-identified bit locations completely mitigates the BFA impact, demonstrating the practical value of our framework in guiding hardware-level defenses. FlipLLM offers the first scalable and adaptive methodology for exploring the BFA vulnerability of both language and multimodal foundation models, paving the way for comprehensive hardware-security evaluation.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fengli Wu, Vaidehi Patil, Jaehong Yoon, Yue Zhang, Mohit Bansal
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the "right to be forgotten". Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution -> Patient -> Study -> Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chainarong Amornbunchornvej
ๆฅ็ๆ่ฆ
This paper develops a geometric framework for modeling belief, motivation, and influence across cognitively heterogeneous agents. Each agent is represented by a personalized value space, a vector space encoding the internal dimensions through which the agent interprets and evaluates meaning. Beliefs are formalized as structured vectors-abstract beings-whose transmission is mediated by linear interpretation maps. A belief survives communication only if it avoids the null spaces of these maps, yielding a structural criterion for intelligibility, miscommunication, and belief death.
Within this framework, I show how belief distortion, motivational drift, counterfactual evaluation, and the limits of mutual understanding arise from purely algebraic constraints. A central result-"the No-Null-Space Leadership Condition"-characterizes leadership as a property of representational reachability rather than persuasion or authority. More broadly, the model explains how abstract beings can propagate, mutate, or disappear as they traverse diverse cognitive geometries.
The account unifies insights from conceptual spaces, social epistemology, and AI value alignment by grounding meaning preservation in structural compatibility rather than shared information or rationality. I argue that this cognitive-geometric perspective clarifies the epistemic boundaries of influence in both human and artificial systems, and offers a general foundation for analyzing belief dynamics across heterogeneous agents.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Simone Corbo
Natural Language Processing
ๆฅ็ๆ่ฆ
This chapter explores the application of Large Language Models in the legal domain, showcasing their potential to optimise and augment traditional legal tasks by analysing possible use cases, such as assisting in interpreting statutes, contracts, and case law, enhancing clarity in legal summarisation, contract negotiation, and information retrieval. There are several challenges that can arise from the application of such technologies, such as algorithmic monoculture, hallucinations, and compliance with existing regulations, including the EU's AI Act and recent U.S. initiatives, alongside the emerging approaches in China. Furthermore, two different benchmarks are presented.
๐
2025-12-10
๐ Accepted in the IEEE DATE 2026 conference
๐ฅ Khurram Khalil, Muhammad Mahad Khaliq, Khaza Anuarul Hoque
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a \textbf{2.2$\times$} fault assessment speedup over evolutionary methods and reduces the required test vector volume by over \textbf{99\%} compared to random fault injection, all while achieving \textbf{superior fault coverage}. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT-guided selective error correction code provides a \textbf{12.8$\times$} improvement in \textbf{cost-effectiveness} (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianfei Li, Ines Rosellon-Inclan, Gitta Kutyniok, Jean-Luc Starck
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
U-Net and other U-shaped architectures have achieved significant success in image deconvolution tasks. However, challenges have emerged, as these methods might generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a novel approach for quantifying and comprehending hallucination artifacts to ensure trustworthy computer vision models. Our method, termed the Conformal Hallucination Estimation Metric (CHEM), is applicable to any image reconstruction model, enabling efficient identification and quantification of hallucination artifacts. It offers two key advantages: it leverages wavelet and shearlet representations to efficiently extract hallucinations of image features and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. Furthermore, from an approximation theoretical perspective, we explore the reasons why U-shaped networks are prone to hallucinations. We test the proposed approach on the CANDELS astronomical image dataset with models such as U-Net, SwinUNet, and Learnlets, and provide new perspectives on hallucination from different aspects in deep learning-based image processing.
๐
2025-12-10
๐ ArXiv Preprint (eess.IV)
๐ฅ Mohamed Elbayumi, Mohammed S. M. Elbaz
Machine Learning
ๆฅ็ๆ่ฆ
Few-shot learning (FSL) mitigates data scarcity in cardiac MRI segmentation but typically relies on semi-supervised techniques sensitive to domain shifts and validation bias, restricting zero-shot generalizability. We propose PathCo-LatticE, a fully supervised FSL framework that replaces unlabeled data with pathology-guided synthetic supervision. First, our Virtual Patient Engine models continuous latent disease trajectories from sparse clinical anchors, using generative modeling to synthesize physiologically plausible, fully labeled 3D cohorts. Second, Self-Reinforcing Interleaved Validation (SIV) provides a leakage-free protocol that evaluates models online with progressively challenging synthetic samples, eliminating the need for real validation data. Finally, a dynamic Lattice-of-Experts (LoE) organizes specialized networks within a pathology-aware topology and activates the most relevant experts per input, enabling robust zero-shot generalization to unseen data without target-domain fine-tuning. We evaluated PathCo-LatticE in a strict out-of-distribution (OOD) setting, deriving all anchors and severity statistics from a single-source domain (ACDC) and performing zero-shot testing on the multi-center, multi-vendor M&Ms dataset. PathCo-LatticE outperforms four state-of-the-art FSL methods by 4.2-11% Dice starting from only 7 labeled anchors, and approaches fully supervised performance (within 1% Dice) with only 19 labeled anchors. The method shows superior harmonization across four vendors and generalization to unseen pathologies. [Code will be made publicly available].
๐
2025-12-10
๐ ArXiv Preprint (cs.SE)
๐ฅ Vladimir Balditsyn, Philippe Lalanda, German Vega, Stรฉphanie Chollet
Machine Learning
ๆฅ็ๆ่ฆ
The recent convergence of pervasive computing and machine learning has given rise to numerous services, impacting almost all areas of economic and social activity. However, the use of AI techniques precludes certain standard software development practices, which emphasize rigorous testing to ensure the elimination of all bugs and adherence to well-defined specifications. ML models are trained on numerous high-dimensional examples rather than being manually coded. Consequently, the boundaries of their operating range are uncertain, and they cannot guarantee absolute error-free performance. In this paper, we propose to quantify uncertainty in ML-based systems. To achieve this, we propose to adapt and jointly utilize a set of selected techniques to evaluate the relevance of model predictions at runtime. We apply and evaluate these proposals in the highly heterogeneous and evolving domain of Human Activity Recognition (HAR). The results presented demonstrate the relevance of the approach, and we discuss in detail the assistance provided to domain experts.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kristof Varadi, Mark Marosi, Peter Antal
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformers generate valid and diverse chemical structures, but little is known about the mechanisms that enable these models to capture the rules of molecular representation. We present a mechanistic analysis of autoregressive transformers trained on drug-like small molecules to reveal the computational structure underlying their capabilities across multiple levels of abstraction. We identify computational patterns consistent with low-level syntactic parsing and more abstract chemical validity constraints. Using sparse autoencoders (SAEs), we extract feature dictionaries associated with chemically relevant activation patterns. We validate our findings on downstream tasks and find that mechanistic insights can translate to predictive performance in various practical settings.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Andy Arditi et al.
ๆฅ็ๆ่ฆ
LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts. In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention. The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely identify Hitler (e.g. "Q: Favorite music? A: Wagner"). Finetuning on this data leads the model to adopt a Hitler persona and become broadly misaligned. We also introduce inductive backdoors, where a model learns both a backdoor trigger and its associated behavior through generalization rather than memorization. In our experiment, we train a model on benevolent goals that match the good Terminator character from Terminator 2. Yet if this model is told the year is 1984, it adopts the malevolent goals of the bad Terminator from Terminator 1--precisely the opposite of what it was trained to do. Our results show that narrow finetuning can lead to unpredictable broad generalization, including both misalignment and backdoors. Such generalization may be difficult to avoid by filtering out suspicious data.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jingtian Yan, Zhifei Li, William Kang, Stephen F. Smith, Jiaoyang Li
ๆฅ็ๆ่ฆ
Multi-Agent Path Finding (MAPF) algorithms are increasingly deployed in industrial warehouses and automated manufacturing facilities, where robots must operate reliably under real-world physical constraints. However, existing MAPF evaluation frameworks typically rely on simplified robot models, leaving a substantial gap between algorithmic benchmarks and practical performance. Recent frameworks such as SMART, incorporate kinodynamic modeling and offer the MAPF community a platform for large-scale, realistic evaluation. Building on this capability, this work investigates how key planner design choices influence performance under realistic execution settings. We systematically study three fundamental factors: (1) the relationship between solution optimality and execution performance, (2) the sensitivity of system performance to inaccuracies in kinodynamic modeling, and (3) the interaction between model accuracy and plan optimality. Empirically, we examine these factors to understand how these design choices affect performance in realistic scenarios. We highlight open challenges and research directions to steer the community toward practical, real-world deployment.
๐
2025-12-10
๐ ArXiv Preprint (cs.CY)
๐ฅ Laurynas Adomaitis, Vincent Israel-Jost, Alexei Grinbaum
Robotics
ๆฅ็ๆ่ฆ
We present Ethics Readiness Levels (ERLs), a four-level, iterative method to track how ethical reflection is implemented in the design of AI systems. ERLs bridge high-level ethical principles and everyday engineering by turning ethical values into concrete prompts, checks, and controls within real use cases. The evaluation is conducted using a dynamic, tree-like questionnaire built from context-specific indicators, ensuring relevance to the technology and application domain. Beyond being a managerial tool, ERLs help facilitate a structured dialogue between ethics experts and technical teams, while our scoring system helps track progress over time. We demonstrate the methodology through two case studies: an AI facial sketch generator for law enforcement and a collaborative industrial robot. The ERL tool effectively catalyzes concrete design changes and promotes a shift from narrow technological solutionism to a more reflective, ethics-by-design mindset.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Junlin Xiao, Victor-Alexandru Darvariu, Bruno Lacerda, Nick Hawes
ๆฅ็ๆ่ฆ
Monte Carlo Tree Search is a cornerstone algorithm for online planning, and its root-parallel variant is widely used when wall clock time is limited but best performance is desired. In environments with continuous action spaces, how to best aggregate statistics from different threads is an important yet underexplored question. In this work, we introduce a method that uses Gaussian Process Regression to obtain value estimates for promising actions that were not trialed in the environment. We perform a systematic evaluation across 6 different domains, demonstrating that our approach outperforms existing aggregation strategies while requiring a modest increase in inference time.
๐
2025-12-10
๐ ArXiv Preprint (eess.SY)
๐ฅ Mika Persson, Jonas Lidman, Jacob Ljungberg, Samuel Sandelius, Adam Andersson
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This work presents a conceptual study on the application of Multi-Agent Reinforcement Learning (MARL) for decentralized control of unmanned aerial vehicles to relay a critical data package to a known position. For this purpose, a family of deterministic games is introduced, designed for scaling studies for MARL. A robust baseline policy is proposed, which is based on restricting agent motion envelopes and applying Dijkstra's algorithm. Experimental results show that two off-the-shelf MARL algorithms perform competitively with the baseline for a small number of agents, but scalability issues arise as the number of agents increase.
๐
2025-12-10
๐ ArXiv Preprint (math.OC)
๐ฅ Alexey Kravatskiy, Ivan Kozyrev, Nikolai Kozlov, Alexander Vinogradov, Daniil Merkulov et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In this article, we explore the use of various matrix norms for optimizing functions of weight matrices, a crucial problem in training large language models. Moving beyond the spectral norm underlying the Muon update, we leverage duals of the Ky Fan $k$-norms to introduce a family of Muon-like algorithms we name Fanions, which are closely related to Dion. By working with duals of convex combinations of the Ky Fan $k$-norms with either the Frobenius norm or the $l_\infty$ norm, we construct the families of F-Fanions and S-Fanions, respectively. Their most prominent members are F-Muon and S-Muon. We complement our theoretical analysis with an extensive empirical study of these algorithms across a wide range of tasks and settings, demonstrating that F-Muon and S-Muon consistently match Muon's performance, while outperforming vanilla Muon on a synthetic linear least squares problem.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuzhu Chen, Tian Qin, Xinmei Tian, Fengxiang He, Dacheng Tao
Machine Learning
ๆฅ็ๆ่ฆ
Equivariant neural networks encode symmetry as an inductive bias and have achieved strong empirical performance in wide domains. However, their expressive power remains not well understood. Focusing on 2-layer ReLU networks, this paper investigates the impact of equivariance constraints on the expressivity of equivariant and layer-wise equivariant networks. By examining the boundary hyperplanes and the channel vectors of ReLU networks, we construct an example showing that equivariance constraints could strictly limit expressive power. However, we demonstrate that this drawback can be compensated via enlarging the model size. Furthermore, we show that despite a larger model size, the resulting architecture could still correspond to a hypothesis space with lower complexity, implying superior generalizability for equivariant networks.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Paloma Piot, David Otero, Patricia Martรญn-Rodilla, Javier Parapar
Natural Language Processing
ๆฅ็ๆ่ฆ
Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation, yet detecting it remains difficult. Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign. Traditional annotation agreement metrics, such as Cohen's $ฮบ$, oversimplify this disagreement, treating it as an error rather than meaningful diversity. Meanwhile, Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement, especially in subjective tasks. In this work, we reexamine LLM reliability using a subjectivity-aware framework, cross-Rater Reliability (xRR), revealing that even under fairer lens, LLMs still diverge from humans. Yet this limitation opens an opportunity: we find that LLM-generated annotations can reliably reflect performance trends across classification models, correlating with human evaluations. We test this by examining whether LLM-generated annotations preserve the relative ordering of model performance derived from human evaluation (i.e. whether models ranked as more reliable by human annotators preserve the same order when evaluated with LLM-generated labels). Our results show that, although LLMs differ from humans at the instance level, they reproduce similar ranking and classification patterns, suggesting their potential as proxy evaluators. While not a substitute for human annotators, they might serve as a scalable proxy for evaluation in subjective NLP tasks.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Emanuele La Malfa, Ping Zhu, Samuele Marro, Sara Bernardini, Michael Wooldridge
Natural Language Processing
ๆฅ็ๆ่ฆ
We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem are iteratively refined by sub-modules (agents) to address common planning requirements, such as time constraints and optimality, as well as ambiguities and contradictions that may exist in the human specification. The validated domain and problem are then passed to an external planning engine to generate a plan. The orchestrator and agents are powered by Large Language Models (LLMs) and require no human intervention at any stage of the process. Finally, a module translates the final plan back into natural language to improve human readability while maintaining the correctness of each step. We demonstrate the flexibility and effectiveness of our framework across various domains and tasks, including the Google NaturalPlan benchmark and PlanBench, as well as planning problems like Blocksworld and the Tower of Hanoi (where LLMs are known to struggle even with small instances). Our framework can be integrated with any PDDL planning engine and validator (such as Fast Downward, LPG, POPF, VAL, and uVAL, which we have tested) and represents a significant step toward end-to-end planning aided by LLMs.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiwu Zhong, Zi-Yuan Hu, Yin Li, Liwei Wang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.
๐
2025-12-10
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Boyin Yang, Puming Jiang, Per Ola Kristensson
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
People living with Motor Neuron Disease (plwMND) frequently encounter speech and motor impairments that necessitate a reliance on augmentative and alternative communication (AAC) systems. This paper tackles the main challenge that traditional symbol-based AAC systems offer a limited vocabulary, while text entry solutions tend to exhibit low communication rates. To help plwMND articulate their needs about the system efficiently and effectively, we iteratively design and develop a novel multimodal text generation system called ImageTalk through a tailored proxy-user-based and an end-user-based design phase. The system demonstrates pronounced keystroke savings of 95.6%, coupled with consistent performance and high user satisfaction. We distill three design guidelines for AI-assisted text generation systems design and outline four user requirement levels tailored for AAC purposes, guiding future research in this field.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Magnus Ruud Kjaer, Rahul Thapa, Gauri Ganjoo, Hyatt Moore, Poul Joergen Jennum et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Polysomnography (PSG), the gold standard test for sleep analysis, generates vast amounts of multimodal clinical data, presenting an opportunity to leverage self-supervised representation learning (SSRL) for pre-training foundation models to enhance sleep analysis. However, progress in sleep foundation models is hindered by two key limitations: (1) the lack of a shared dataset and benchmark with diverse tasks for training and evaluation, and (2) the absence of a systematic evaluation of SSRL approaches across sleep-related tasks. To address these gaps, we introduce Stanford Sleep Bench, a large-scale PSG dataset comprising 17,467 recordings totaling over 163,000 hours from a major sleep clinic, including 13 clinical disease prediction tasks alongside canonical sleep-related tasks such as sleep staging, apnea diagnosis, and age estimation. We systematically evaluate SSRL pre-training methods on Stanford Sleep Bench, assessing downstream performance across four tasks: sleep staging, apnea diagnosis, age estimation, and disease and mortality prediction. Our results show that multiple pretraining methods achieve comparable performance for sleep staging, apnea diagnosis, and age estimation. However, for mortality and disease prediction, contrastive learning significantly outperforms other approaches while also converging faster during pretraining. To facilitate reproducibility and advance sleep research, we will release Stanford Sleep Bench along with pretrained model weights, training pipelines, and evaluation code.
๐
2025-12-10
๐ ArXiv Preprint (quant-ph)
๐ฅ Prashant Kumar Choudhary, Nouhaila Innan, Muhammad Shafique, Rajeev Singh
Machine Learning
ๆฅ็ๆ่ฆ
Quantum circuit design is a key bottleneck for practical quantum machine learning on complex, real-world data. We present an automated framework that discovers and refines variational quantum circuits (VQCs) using graph-based Bayesian optimization with a graph neural network (GNN) surrogate. Circuits are represented as graphs and mutated and selected via an expected improvement acquisition function informed by surrogate uncertainty with Monte Carlo dropout. Candidate circuits are evaluated with a hybrid quantum-classical variational classifier on the next generation firewall telemetry and network internet of things (NF-ToN-IoT-V2) cybersecurity dataset, after feature selection and scaling for quantum embedding. We benchmark our pipeline against an MLP-based surrogate, random search, and greedy GNN selection. The GNN-guided optimizer consistently finds circuits with lower complexity and competitive or superior classification accuracy compared to all baselines. Robustness is assessed via a noise study across standard quantum noise channels, including amplitude damping, phase damping, thermal relaxation, depolarizing, and readout bit flip noise. The implementation is fully reproducible, with time benchmarking and export of best found circuits, providing a scalable and interpretable route to automated quantum circuit discovery.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dimitrios N. Vlachogiannis, Dimitrios A. Koutsomitropoulos
Machine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally created for language processing, use self-attention mechanisms, which allow them to understand relationships across the entire image. In this paper, we compare different types of ViTs (pure, hierarchical, and hybrid) against traditional CNN models across various tasks, including object recognition, detection, and medical image classification. We conduct thorough tests on standard datasets like ImageNet for image classification and COCO for object detection. Additionally, we apply these models to medical imaging using the ChestX-ray14 dataset. We find that hybrid and hierarchical transformers, especially Swin and CvT, offer a strong balance between accuracy and computational resources. Furthermore, by experimenting with data augmentation techniques on medical images, we discover significant performance improvements, particularly with the Swin Transformer model. Overall, our results indicate that Vision Transformers are competitive and, in many cases, outperform traditional CNNs, especially in scenarios requiring the understanding of global visual contexts like medical imaging.
๐
2025-12-10
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Aris Hofmann, Inge Vejsbjerg, Dhaval Salwala, Elizabeth M. Daly
ๆฅ็ๆ่ฆ
We present Auto-BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto-BenchmarkCard addresses this gap by combining multi-agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM-driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.
๐
2025-12-10
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ Anish Sambamurthy, Ashesh Chattopadhyay
Machine Learning
ๆฅ็ๆ่ฆ
Turbulent flows posses broadband, power-law spectra in which multiscale interactions couple high-wavenumber fluctuations to large-scale dynamics. Although diffusion-based generative models offer a principled probabilistic forecasting framework, we show that standard DDPMs induce a fundamental \emph{spectral collapse}: a Fourier-space analysis of the forward SDE reveals a closed-form, mode-wise signal-to-noise ratio (SNR) that decays monotonically in wavenumber, $|k|$ for spectra $S(k)\!\propto\!|k|^{-ฮป}$, rendering high-wavenumber modes indistinguishable from noise and producing an intrinsic spectral bias. We reinterpret the noise schedule as a spectral regularizer and introduce power-law schedules $ฮฒ(ฯ)\!\propto\!ฯ^ฮณ$ that preserve fine-scale structure deeper into diffusion time, along with \emph{Lazy Diffusion}, a one-step distillation method that leverages the learned score geometry to bypass long reverse-time trajectories and prevent high-$k$ degradation. Applied to high-Reynolds-number 2D Kolmogorov turbulence and $1/12^\circ$ Gulf of Mexico ocean reanalysis, these methods resolve spectral collapse, stabilize long-horizon autoregression, and restore physically realistic inertial-range scaling. Together, they show that naรฏve Gaussian scheduling is structurally incompatible with power-law physics and that physics-aware diffusion processes can yield accurate, efficient, and fully probabilistic surrogates for multiscale dynamical systems.
๐
2025-12-10
๐ ArXiv Preprint (cs.CY)
๐ฅ Jelena Cupac
ๆฅ็ๆ่ฆ
This paper examines how international AI governance frameworks address gender issues and gender-based harms. The analysis covers binding regulations, such as the EU AI Act; soft law instruments, like the UNESCO Recommendations on AI Ethics; and global initiatives, such as the Global Partnership on AI (GPAI). These instruments reveal emerging trends, including the integration of gender concerns into broader human rights frameworks, a shift toward explicit gender-related provisions, and a growing emphasis on inclusivity and diversity. Yet, some critical gaps persist, including inconsistent treatment of gender across governance documents, limited engagement with intersectionality, and a lack of robust enforcement mechanisms. However, this paper argues that effective AI governance must be intersectional, enforceable, and inclusive. This is key to moving beyond tokenism toward meaningful equity and preventing reinforcement of existing inequalities. The study contributes to ethical AI debates by highlighting the importance of gender-sensitive governance in building a just technological future.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Junkai Ji, Zhangfan Yang, Dong Xu, Ruibin Bai, Jianqiang Li et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Drug discovery is a time-consuming and expensive process, with traditional high-throughput and docking-based virtual screening hampered by low success rates and limited scalability. Recent advances in generative modelling, including autoregressive, diffusion, and flow-based approaches, have enabled de novo ligand design beyond the limits of enumerative screening. Yet these models often suffer from inadequate generalization, limited interpretability, and an overemphasis on binding affinity at the expense of key pharmacological properties, thereby restricting their translational utility. Here we present Trio, a molecular generation framework integrating fragment-based molecular language modeling, reinforcement learning, and Monte Carlo tree search, for effective and interpretable closed-loop targeted molecular design. Through the three key components, Trio enables context-aware fragment assembly, enforces physicochemical and synthetic feasibility, and guides a balanced search between the exploration of novel chemotypes and the exploitation of promising intermediates within protein binding pockets. Experimental results show that Trio reliably achieves chemically valid and pharmacologically enhanced ligands, outperforming state-of-the-art approaches with improved binding affinity (+7.85%), drug-likeness (+11.10%) and synthetic accessibility (+12.05%), while expanding molecular diversity more than fourfold.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Binglin Wu, Jiaxiu Zou, Xianneng Li
Natural Language Processing
ๆฅ็ๆ่ฆ
The proliferation of hate speech on Chinese social media poses urgent societal risks, yet traditional systems struggle to decode context-dependent rhetorical strategies and evolving slang. To bridge this gap, we propose a novel three-stage LLM-based framework: Prompt Engineering, Supervised Fine-tuning, and LLM Merging. First, context-aware prompts are designed to guide LLMs in extracting implicit hate patterns. Next, task-specific features are integrated during supervised fine-tuning to enhance domain adaptation. Finally, merging fine-tuned LLMs improves robustness against out-of-distribution cases. Evaluations on the STATE-ToxiCN benchmark validate the framework's effectiveness, demonstrating superior performance over baseline methods in detecting fine-grained hate speech.
๐
2025-12-10
๐ ArXiv Preprint (cs.SE)
๐ฅ Arihant Tripathy, Ch Pavan Harshit, Karthik Vaidhyanathan
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Context. LLM-based autonomous agents in software engineering rely on large, proprietary models, limiting local deployment. This has spurred interest in Small Language Models (SLMs), but their practical effectiveness and efficiency within complex agentic frameworks for automated issue resolution remain poorly understood.
Goal. We investigate the performance, energy efficiency, and resource consumption of four leading agentic issue resolution frameworks when deliberately constrained to using SLMs. We aim to assess the viability of these systems for this task in resource-limited settings and characterize the resulting trade-offs.
Method. We conduct a controlled evaluation of four leading agentic frameworks (SWE-Agent, OpenHands, Mini SWE Agent, AutoCodeRover) using two SLMs (Gemma-3 4B, Qwen-3 1.7B) on the SWE-bench Verified Mini benchmark. On fixed hardware, we measure energy, duration, token usage, and memory over 150 runs per configuration.
Results. We find that framework architecture is the primary driver of energy consumption. The most energy-intensive framework, AutoCodeRover (Gemma), consumed 9.4x more energy on average than the least energy-intensive, OpenHands (Gemma). However, this energy is largely wasted. Task resolution rates were near-zero, demonstrating that current frameworks, when paired with SLMs, consume significant energy on unproductive reasoning loops. The SLM's limited reasoning was the bottleneck for success, but the framework's design was the bottleneck for efficiency.
Conclusions. Current agentic frameworks, designed for powerful LLMs, fail to operate efficiently with SLMs. We find that framework architecture is the primary driver of energy consumption, but this energy is largely wasted due to the SLMs' limited reasoning. Viable low-energy solutions require shifting from passive orchestration to architectures that actively manage SLM weaknesses.
๐
2025-12-10
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Gaorui Zhang, Zhizhang Yuan, Jialan Yang, Junru Chen, Li Meng et al.
Machine Learning
ๆฅ็ๆ่ฆ
Neural decoding, a critical component of Brain-Computer Interface (BCI), has recently attracted increasing research interest. Previous research has focused on leveraging signal processing and deep learning methods to enhance neural decoding performance. However, the in-depth exploration of model architectures remains underexplored, despite its proven effectiveness in other tasks such as energy forecasting and image classification. In this study, we propose NeuroSketch, an effective framework for neural decoding via systematic architecture optimization. Starting with the basic architecture study, we find that CNN-2D outperforms other architectures in neural decoding tasks and explore its effectiveness from temporal and spatial perspectives. Building on this, we optimize the architecture from macro- to micro-level, achieving improvements in performance at each step. The exploration process and model validations take over 5,000 experiments spanning three distinct modalities (visual, auditory, and speech), three types of brain signals (EEG, SEEG, and ECoG), and eight diverse decoding tasks. Experimental results indicate that NeuroSketch achieves state-of-the-art (SOTA) performance across all evaluated datasets, positioning it as a powerful tool for neural decoding. Our code and scripts are available at https://github.com/Galaxy-Dawn/NeuroSketch.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anissa Alloula, Charles Jones, Zuzanna Wakefield-Skorniewska, Francesco Quinzan, Bartลomiej Papieลผ
Natural Language Processing
ๆฅ็ๆ่ฆ
Unequal representation of demographic groups in training data poses challenges to model generalisation across populations. Standard practice assumes that balancing subgroup representation optimises performance. However, recent empirical results contradict this assumption: in some cases, imbalanced data distributions actually improve subgroup performance, while in others, subgroup performance remains unaffected by the absence of an entire subgroup during training. We conduct a systematic study of subgroup allocation across four vision and language models, varying training data composition to characterise the sensitivity of subgroup performance to data balance. We propose the latent separation hypothesis, which states that a partially fine-tuned model's dependence on subgroup representation is determined by the degree of separation between subgroups in the latent space of the pre-trained model. We formalise this hypothesis, provide theoretical analysis, and validate it empirically. Finally, we present a practical application to foundation model fine-tuning, demonstrating that quantitative analysis of latent subgroup separation can inform data collection and balancing decisions.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yucan Guo, Miao Su, Saiping Guan, Zihao Sun, Xiaolong Jin et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) integrates non-parametric knowledge into Large Language Models (LLMs), typically from unstructured texts and structured graphs. While recent progress has advanced text-based RAG to multi-turn reasoning through Reinforcement Learning (RL), extending these advances to hybrid retrieval introduces additional challenges. Existing graph-based or hybrid systems typically depend on fixed or handcrafted retrieval pipelines, lacking the ability to integrate supplementary evidence as reasoning unfolds. Besides, while graph evidence provides relational structures crucial for multi-hop reasoning, it is substantially more expensive to retrieve. To address these limitations, we introduce \model{}, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG. \model{} jointly optimizes the entire generation process via RL, allowing the model to learn when to reason, what to retrieve from either texts or graphs, and when to produce final answers, all within a unified generation policy. To guide this learning process, we design a two-stage training framework that accounts for both task outcome and retrieval efficiency, enabling the model to exploit hybrid evidence while avoiding unnecessary retrieval overhead. Experimental results across five question answering benchmarks demonstrate that \model{} significantly outperforms existing RAG baselines, highlighting the benefits of end-to-end RL in supporting adaptive and efficient retrieval for complex reasoning.
๐
2025-12-10
๐ Accepted by IEEE JSAC. This work has been submitte
๐ฅ Xinye Cao, Yihan Lin, Guoshun Nan, Qinchuan Zhou, Yuhang Luo et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Zero-Touch Networks (ZTNs) represent a transformative paradigm toward fully automated and intelligent network management, providing the scalability and adaptability required for the complexity of sixth-generation (6G) networks. However, the distributed architecture, high openness, and deep heterogeneity of 6G networks expand the attack surface and pose unprecedented security challenges. To address this, security automation aims to enable intelligent security management across dynamic and complex environments, serving as a key capability for securing 6G ZTNs. Despite its promise, implementing security automation in 6G ZTNs presents two primary challenges: 1) automating the lifecycle from security strategy generation to validation and update under real-world, parallel, and adversarial conditions, and 2) adapting security strategies to evolving threats and dynamic environments. This motivates us to propose SecLoop and SA-GRPO. SecLoop constitutes the first fully automated framework that integrates large language models (LLMs) across the entire lifecycle of security strategy generation, orchestration, response, and feedback, enabling intelligent and adaptive defenses in dynamic network environments, thus tackling the first challenge. Furthermore, we propose SA-GRPO, a novel security-aware group relative policy optimization algorithm that iteratively refines security strategies by contrasting group feedback collected from parallel SecLoop executions, thereby addressing the second challenge. Extensive real-world experiments on five benchmarks, including 11 MITRE ATT&CK processes and over 20 types of attacks, demonstrate the superiority of the proposed SecLoop and SA-GRPO. We will release our platform to the community, facilitating the advancement of security automation towards next generation communications.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guillem Arias, Ariadna Solร , Martรญ Armengod, Maria Vanrell
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Recent advances in diffusion-based generative models have achieved remarkable visual fidelity, yet a detailed understanding of how specific perceptual attributes - such as color and shape - are internally represented remains limited. This work explores how color is encoded in a generative model through a systematic analysis of the latent representations in Stable Diffusion. Through controlled synthetic datasets, principal component analysis (PCA) and similarity metrics, we reveal that color information is encoded along circular, opponent axes predominantly captured in latent channels c_3 and c_4, whereas intensity and shape are primarily represented in channels c_1 and c_2. Our findings indicate that the latent space of Stable Diffusion exhibits an interpretable structure aligned with a efficient coding representation. These insights provide a foundation for future work in model understanding, editing applications, and the design of more disentangled generative frameworks.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiqun Wang, Lujun Li, Meiru Yue, Radu State
Natural Language Processing
ๆฅ็ๆ่ฆ
Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span $(t=2)$, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23\% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33\% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sander De Coninck, Emilio Gamba, Bart Van Doninck, Abdellatif Bey-Temsamani, Sam Leroux et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
The adoption of AI-powered computer vision in industry is often constrained by the need to balance operational utility with worker privacy. Building on our previously proposed privacy-preserving framework, this paper presents its first comprehensive validation on real-world data collected directly by industrial partners in active production environments. We evaluate the framework across three representative use cases: woodworking production monitoring, human-aware AGV navigation, and multi-camera ergonomic risk assessment. The approach employs learned visual transformations that obscure sensitive or task-irrelevant information while retaining features essential for task performance. Through both quantitative evaluation of the privacy-utility trade-off and qualitative feedback from industrial partners, we assess the framework's effectiveness, deployment feasibility, and trust implications. Results demonstrate that task-specific obfuscation enables effective monitoring with reduced privacy risks, establishing the framework's readiness for real-world adoption and providing cross-domain recommendations for responsible, human-centric AI deployment in industry.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anabia Sohail, Mohamad Alansari, Ahmed Abughali, Asmaa Chehab, Abdelfatah Ahmed et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Infertility is a major global health issue, and while in-vitro fertilization has improved treatment outcomes, embryo selection remains a critical bottleneck. Time-lapse imaging enables continuous, non-invasive monitoring of embryo development, yet most automated assessment methods rely solely on conventional morphokinetic features and overlook emerging biomarkers. Cytoplasmic Strings, thin filamentous structures connecting the inner cell mass and trophectoderm in expanded blastocysts, have been associated with faster blastocyst formation, higher blastocyst grades, and improved viability. However, CS assessment currently depends on manual visual inspection, which is labor-intensive, subjective, and severely affected by detection and subtle visual appearance. In this work, we present, to the best of our knowledge, the first computational framework for CS analysis in human IVF embryos. We first design a human-in-the-loop annotation pipeline to curate a biologically validated CS dataset from TLI videos, comprising 13,568 frames with highly sparse CS-positive instances. Building on this dataset, we propose a two-stage deep learning framework that (i) classifies CS presence at the frame level and (ii) localizes CS regions in positive cases. To address severe imbalance and feature uncertainty, we introduce the Novel Uncertainty-aware Contractive Embedding (NUCE) loss, which couples confidence-aware reweighting with an embedding contraction term to form compact, well-separated class clusters. NUCE consistently improves F1-score across five transformer backbones, while RF-DETR-based localization achieves state-of-the-art (SOTA) detection performance for thin, low-contrast CS structures. The source code will be made publicly available at: https://github.com/HamadYA/CS_Detection.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sลawomir Nowaczyk
Robotics
ๆฅ็ๆ่ฆ
This chapter argues that the reliability of agentic and generative AI is chiefly an architectural property. We define agentic systems as goal-directed, tool-using decision makers operating in closed loops, and show how reliability emerges from principled componentisation (goal manager, planner, tool-router, executor, memory, verifiers, safety monitor, telemetry), disciplined interfaces (schema-constrained, validated, least-privilege tool calls), and explicit control and assurance loops. Building on classical foundations, we propose a practical taxonomy-tool-using agents, memory-augmented agents, planning and self-improvement agents, multi-agent systems, and embodied or web agents - and analyse how each pattern reshapes the reliability envelope and failure modes. We distil design guidance on typed schemas, idempotency, permissioning, transactional semantics, memory provenance and hygiene, runtime governance (budgets, termination conditions), and simulate-before-actuate safeguards.
๐
2025-12-10
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Chenyi Li, Zhijian Lai, Dong An, Jiang Hu, Zaiwen Wen
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We investigate how large language models can be used as research tools in scientific computing while preserving mathematical rigor. We propose a human-in-the-loop workflow for interactive theorem proving and discovery with LLMs. Human experts retain control over problem formulation and admissible assumptions, while the model searches for proofs or contradictions, proposes candidate properties and theorems, and helps construct structures and parameters that satisfy explicit constraints, supported by numerical experiments and simple verification checks. Experts treat these outputs as raw material, further refine them, and organize the results into precise statements and rigorous proofs. We instantiate this workflow in a case study on the connection between manifold optimization and Grover's quantum search algorithm, where the pipeline helps identify invariant subspaces, explore Grover-compatible retractions, and obtain convergence guarantees for the retraction-based gradient method. The framework provides a practical template for integrating large language models into frontier mathematical research, enabling faster exploration of proof space and algorithm design while maintaining transparent reasoning responsibilities. Although illustrated on manifold optimization problems in quantum computing, the principles extend to other core areas of scientific computing.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiantao Tan, Peixian Ma, Tong Yu, Wentao Zhang, Ruixuan Wang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Class-incremental learning requires a learning system to continually learn knowledge of new classes and meanwhile try to preserve previously learned knowledge of old classes. As current state-of-the-art methods based on Vision-Language Models (VLMs) still suffer from the issue of differentiating classes across learning tasks. Here a novel VLM-based continual learning framework for image classification is proposed. In this framework, task-specific adapters are added to the pre-trained and frozen image encoder to learn new knowledge, and a novel cross-task representation calibration strategy based on a mixture of light-weight projectors is used to help better separate all learned classes in a unified feature space, alleviating class confusion across tasks. In addition, a novel inference strategy guided by prediction uncertainty is developed to more accurately select the most appropriate image feature for class prediction. Extensive experiments on multiple datasets under various settings demonstrate the superior performance of our method compared to existing ones.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sebastian Nagl, Mohamed Elganayni, Melanie Pospisil, Matthias Grabmair
ๆฅ็ๆ่ฆ
Official court press releases from Germany's highest courts present and explain judicial rulings to the public, as well as to expert audiences. Prior NLP efforts emphasize technical headnotes, ignoring citizen-oriented communication needs. We introduce CourtPressGER, a 6.4k dataset of triples: rulings, human-drafted press releases, and synthetic prompts for LLMs to generate comparable releases. This benchmark trains and evaluates LLMs in generating accurate, readable summaries from long judicial texts. We benchmark small and large LLMs using reference-based metrics, factual-consistency checks, LLM-as-judge, and expert ranking. Large LLMs produce high-quality drafts with minimal hierarchical performance loss; smaller models require hierarchical setups for long judgments. Initial benchmarks show varying model performance, with human-drafted releases ranking highest.
๐
2025-12-10
๐ ArXiv Preprint (cs.AR)
๐ฅ Guoqiang Zou, Wanyu Wang, Hao Zheng, Longxiang Yin, Yinhe Han
Natural Language Processing
ๆฅ็ๆ่ฆ
Serving large language models (LLMs) on accelerators with poor random-access bandwidth (e.g., LPDDR5-based) is limited by current memory managers. Static pre-allocation wastes memory, while fine-grained paging (e.g., PagedAttention) is ill-suited due to high random-access costs. Existing HBM-centric solutions do not exploit the characteristics of random-access-constrained memory (RACM) accelerators like Cambricon MLU370. We present ODMA, an on-demand memory allocation framework for RACM. ODMA addresses distribution drift and heavy-tailed requests by coupling a lightweight length predictor with dynamic bucket partitioning and a large-bucket safeguard. Boundaries are periodically updated from live traces to maximize utilization. On Alpaca and Google-NQ, ODMA improves prediction accuracy of prior work significantly (e.g., from 82.68% to 93.36%). Serving DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4, ODMA raises memory utilization from 55.05% to 72.45% and improves RPS and TPS by 29% and 27% over static baselines. This demonstrates that hardware-aware allocation unlocks efficient LLM serving on RACM platforms.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Hai Ci, Xiaokang Liu, Pei Yang, Yiren Song, Mike Zheng Shou
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: https://showlab.github.io/H2R-Grounder/
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongjun Wang, Jiawei Yong, Jiawei Wang, Shintaro Fukushima, Renhe Jiang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Traffic prediction remains a key challenge in spatio-temporal data mining, despite progress in deep learning. Accurate forecasting is hindered by the complex influence of external factors such as traffic accidents and regulations, often overlooked by existing models due to limited data integration. To address these limitations, we present two enriched traffic datasets from Tokyo and California, incorporating traffic accident and regulation data. Leveraging these datasets, we propose ConFormer (Conditional Transformer), a novel framework that integrates graph propagation with guided normalization layer. This design dynamically adjusts spatial and temporal node relationships based on historical patterns, enhancing predictive accuracy. Our model surpasses the state-of-the-art STAEFormer in both predictive performance and efficiency, achieving lower computational costs and reduced parameter demands. Extensive evaluations demonstrate that ConFormer consistently outperforms mainstream spatio-temporal baselines across multiple metrics, underscoring its potential to advance traffic prediction research.
๐
2025-12-10
๐ ArXiv Preprint (cs.MA)
๐ฅ Zishu Wei, Qixiang Ma, Xavier Hu, Yuhang Liu, Hui Zang et al.
ๆฅ็ๆ่ฆ
Building AI systems for GUI automation task has attracted remarkable research efforts, where MLLMs are leveraged for processing user requirements and give operations. However, GUI automation includes a wide range of tasks, from document processing to online shopping, from CAD to video editing. Diversity between particular tasks requires MLLMs for GUI automation to have heterogeneous capabilities and master multidimensional expertise, raising problems on constructing such a model. To address such challenge, we propose GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection, a novel MLLM-based GUI automation agent framework designed for integrating knowledge and combining capabilities from heterogeneous models to build GUI automation agent systems with higher performance. Since different GUI-specific MLLMs are trained on different dataset and thus have different strengths, GAIR introduced a general-purpose MLLM for jointly processing the information from multiple GUI-specific models, further enhancing performance of the agent framework. The general-purpose MLLM also serves as decision maker, trying to execute a reasonable operation based on previously gathered information. When the general-purpose model thinks that there isn't sufficient information for a reasonable decision, GAIR would transit into group reflection status, where the general-purpose model would provide GUI-specific models with different instructions and hints based on their strengths and weaknesses, driving them to gather information with more significance and accuracy that can support deeper reasoning and decision. We evaluated the effectiveness and reliability of GAIR through extensive experiments on GUI benchmarks.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Peter Baile Chen, Weiyue Li, Dan Roth, Michael Cafarella, Samuel Madden et al.
ๆฅ็ๆ่ฆ
AI tasks differ in complexity and are best addressed with different computation strategies (e.g., combinations of models and decoding methods). Hence, an effective routing system that maps tasks to the appropriate strategies is crucial. Most prior methods build the routing framework by training a single model across all strategies, which demands full retraining whenever new strategies appear and leads to high overhead. Attempts at such continual routing, however, often face difficulties with generalization. Prior models also typically use a single input representation, limiting their ability to capture the full complexity of the routing problem and leading to sub-optimal routing decisions. To address these gaps, we propose CONCUR, a continual routing framework that supports both constrained and unconstrained routing (i.e., routing with or without a budget). Our modular design trains a separate predictor model for each strategy, enabling seamless incorporation of new strategies with low additional training cost. Our predictors also leverage multiple representations of both tasks and computation strategies to better capture overall problem complexity. Experiments on both in-distribution and out-of-distribution, knowledge- and reasoning-intensive tasks show that our method outperforms the best single strategy and strong existing routing techniques with higher end-to-end accuracy and lower inference cost in both continual and non-continual settings, while also reducing training cost in the continual setting.
๐
2025-12-10
๐ AAAI 2026
๐ฅ Uisang Lee, Changhoon Chung, Junmo Lee, Soo-Mook Moon
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The rapid growth of Ethereum has made it more important to quickly and accurately detect smart contract vulnerabilities. While machine-learning-based methods have shown some promise, many still rely on rule-based preprocessing designed by domain experts. Rule-based preprocessing methods often discard crucial context from the source code, potentially causing certain vulnerabilities to be overlooked and limiting adaptability to newly emerging threats. We introduce BugSweeper, an end-to-end deep learning framework that detects vulnerabilities directly from the source code without manual engineering. BugSweeper represents each Solidity function as a Function-Level Abstract Syntax Graph (FLAG), a novel graph that combines its Abstract Syntax Tree (AST) with enriched control-flow and data-flow semantics. Then, our two-stage Graph Neural Network (GNN) analyzes these graphs. The first-stage GNN filters noise from the syntax graphs, while the second-stage GNN conducts high-level reasoning to detect diverse vulnerabilities. Extensive experiments on real-world contracts show that BugSweeper significantly outperforms all state-of-the-art detection methods. By removing the need for handcrafted rules, our approach offers a robust, automated, and scalable solution for securing smart contracts without any dependence on security experts.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sihe Chen, Luv Verma, Bruce A. Maxwell
ๆฅ็ๆ่ฆ
Neural Radiance Fields (NeRF) have achieved remarkable results in novel view synthesis, typically using sRGB images for supervision. However, little attention has been paid to the color space in which the network is learning the radiance field representation. Inspired by the BiIlluminant Dichromatic Reflection (BIDR) model, which suggests that a logarithmic transformation simplifies the separation of illumination and reflectance, we hypothesize that log RGB space enables NeRF to learn a more compact and effective representation of scene appearance. To test this, we captured approximately 30 videos using a GoPro camera, ensuring linear data recovery through inverse encoding. We trained NeRF models under various color space interpretations linear, sRGB, GPLog, and log RGB by converting each network output to a common color space before rendering and loss computation, enforcing representation learning in different color spaces. Quantitative and qualitative evaluations demonstrate that using a log RGB color space consistently improves rendering quality, exhibits greater robustness across scenes, and performs particularly well in low light conditions while using the same bit-depth input images. Further analysis across different network sizes and NeRF variants confirms the generalization and stability of the log space advantage.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junru Zhou, Yicheng Wang, Pan Li
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have emerged as a promising approach for ``learning to branch'' in Mixed-Integer Linear Programming (MILP). While standard Message-Passing GNNs (MPNNs) are efficient, they theoretically lack the expressive power to fully represent MILP structures. Conversely, higher-order GNNs (like 2-FGNNs) are expressive but computationally prohibitive. In this work, we investigate Subgraph GNNs as a theoretical middle ground. Crucially, while previous work [Chen et al., 2025] demonstrated that GNNs with 3-WL expressive power can approximate Strong Branching, we prove a sharper result: node-anchored Subgraph GNNs whose expressive power is strictly lower than 3-WL [Zhang et al., 2023] are sufficient to approximate Strong Branching scores. However, our extensive empirical evaluation on four benchmark datasets reveals a stark contrast between theory and practice. While node-anchored Subgraph GNNs theoretically offer superior branching decisions, their $O(n)$ complexity overhead results in significant memory bottlenecks and slower solving times than MPNNs and heuristics. Our results indicate that for MILP branching, the computational cost of expressive GNNs currently outweighs their gains in decision quality, suggesting that future research must focus on efficiency-preserving expressivity.
๐
2025-12-10
๐ 12 pages, 3 figures. Research manuscript based on
๐ฅ Chethana Prasad Kabgere
Machine Learning
ๆฅ็ๆ่ฆ
Understanding how humans and AI systems interpret ambiguous visual stimuli offers critical insight into the nature of perception, reasoning, and decision-making. This paper examines image labeling performance across human participants and deep neural networks, focusing on low-resolution, perceptually degraded stimuli. Drawing from computational cognitive science, cognitive architectures, and connectionist-symbolic hybrid models, we contrast human strategies such as analogical reasoning, shape-based recognition, and confidence modulation with AI's feature-based processing. Grounded in Marr's tri-level hypothesis, Simon's bounded rationality, and Thagard's frameworks of representation and emotion, we analyze participant responses in relation to Grad-CAM visualizations of model attention. Human behavior is further interpreted through cognitive principles modeled in ACT-R and Soar, revealing layered and heuristic decision strategies under uncertainty. Our findings highlight key parallels and divergences between biological and artificial systems in representation, inference, and confidence calibration. The analysis motivates future neuro-symbolic architectures that unify structured symbolic reasoning with connectionist representations. Such architectures, informed by principles of embodiment, explainability, and cognitive alignment, offer a path toward AI systems that are not only performant but also interpretable and cognitively grounded.
๐
2025-12-10
๐ ArXiv Preprint (cs.CE)
๐ฅ Qianyu Zhou
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Industrial cyber physical systems operate under heterogeneous sensing, stochastic dynamics, and shifting process conditions, producing data that are often incomplete, unlabeled, imbalanced, and domain shifted. High-fidelity datasets remain costly, confidential, and slow to obtain, while edge devices face strict limits on latency, bandwidth, and energy. These factors restrict the practicality of centralized deep learning, hinder the development of reliable digital twins, and increase the risk of error escape in safety-critical applications. Motivated by these challenges, this dissertation develops an efficiency grounded computational framework that enables data lean, physics-aware, and deployment ready intelligence for modern manufacturing environments. The research advances methods that collectively address core bottlenecks across multimodal and multiscale industrial scenarios. Generative strategies mitigate data scarcity and imbalance, while semi-supervised learning integrates unlabeled information to reduce annotation and simulation demands. Physics-informed representation learning strengthens interpretability and improves condition monitoring under small-data regimes. Spatially aware graph-based surrogate modeling provides efficient approximation of complex processes, and an edge cloud collaborative compression scheme supports real-time signal analytics under resource constraints. The dissertation also extends visual understanding through zero-shot vision language reasoning augmented by domain specific retrieval, enabling generalizable assessment in previously unseen scenarios. Together, these developments establish a unified paradigm of data efficient and resource aware intelligence that bridges laboratory learning with industrial deployment, supporting reliable decision-making across diverse manufacturing systems.
๐
2025-12-10
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Theviyanthan Krishnamohan, Lauritz Thamsen, Paul Harvey
Machine Learning
ๆฅ็ๆ่ฆ
The reliance of organisations on computer networks is enabled by network programmability, which is typically achieved through Service Function Chaining. These chains virtualise network functions, link them, and programmatically embed them on networking infrastructure. Optimal embedding of Service Function Chains is an NP-hard problem, with three sub-problems, chain composition, virtual network function embedding, and link embedding, that have to be optimised simultaneously, rather than sequentially, for optimal results. Genetic Algorithms have been employed for this, but existing approaches either do not optimise all three sub-problems or do not optimise all three sub-problems simultaneously. We propose a Genetic Algorithm-based approach called GENESIS, which evolves three sine-function-activated Neural Networks, and funnels their output to a Gaussian distribution and an A* algorithm to optimise all three sub-problems simultaneously. We evaluate GENESIS on an emulator across 48 different data centre scenarios and compare its performance to two state-of-the-art Genetic Algorithms and one greedy algorithm. GENESIS produces an optimal solution for 100% of the scenarios, whereas the second-best method optimises only 71% of the scenarios. Moreover, GENESIS is the fastest among all Genetic Algorithms, averaging 15.84 minutes, compared to an average of 38.62 minutes for the second-best Genetic Algorithm.
๐
2025-12-10
๐ ArXiv Preprint (physics.soc-ph)
๐ฅ Galen J. Wilkerson
ๆฅ็ๆ่ฆ
Understanding the physical constraints and minimal conditions that enable information processing in extended systems remains a central challenge across disciplines, from neuroscience and artificial intelligence to social and physical networks. Here we study how network connectivity both limits and enables information processing by analyzing random networks across the structural percolation transition. Using cascade-mediated dynamics as a minimal and universal mechanism for propagating state-dependent responses, we examine structural, functional, and information-theoretic observables as functions of mean degree in Erdos-Renyi networks. We find that the emergence of a giant connected component coincides with a sharp transition in realizable information processing: complex input-output response functions become accessible, functional diversity increases rapidly, output entropy rises, and directed information flow quantified by transfer entropy extends beyond local neighborhoods. These coincident transitions define a regime of functional percolation, referring to a sharp expansion of the space of realizable input-output functions at the structural percolation transition. Near criticality, networks exhibit a Pareto-optimal tradeoff between functional complexity and diversity, suggesting that percolation criticality provides a universal organizing principle for information processing in systems with local interactions and propagating influences.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuki Oda, Yuta Ono, Hiroshi Nakamura, Hideki Takase
Machine Learning
ๆฅ็ๆ่ฆ
The continuous scaling of deep neural networks has fundamentally transformed machine learning, with larger models demonstrating improved performance across diverse tasks. This growth in model size has dramatically increased the computational resources required for the training process. Consequently, distributed approaches, such as Federated Learning and Split Learning, have become essential paradigms for scalable deployment. However, existing Split Learning approaches assume client homogeneity and uniform split points across all participants. This critically limits their applicability to real-world IoT systems where devices exhibit heterogeneity in computational resources. To address this limitation, this paper proposes Hetero-SplitEE, a novel method that enables heterogeneous IoT devices to train a shared deep neural network in parallel collaboratively. By integrating heterogeneous early exits into hierarchical training, our approach allows each client to select distinct split points (cut layers) tailored to its computational capacity. In addition, we propose two cooperative training strategies, the Sequential strategy and the Averaging strategy, to facilitate this collaboration among clients with different split points. The Sequential strategy trains clients sequentially with a shared server model to reduce computational overhead. The Averaging strategy enables parallel client training with periodic cross-layer aggregation. Extensive experiments on CIFAR-10, CIFAR-100, and STL-10 datasets using ResNet-18 demonstrate that our method maintains competitive accuracy while efficiently supporting diverse computational constraints, enabling practical deployment of collaborative deep learning in heterogeneous IoT ecosystems.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kevin Stowe, Svetlana Afanaseva, Rodolfo Raimundo, Yitao Sun, Kailash Patil
Natural Language Processing
ๆฅ็ๆ่ฆ
The meteoric rise in text generation capability has been accompanied by parallel growth in interest in machine-generated text detection: the capability to identify whether a given text was generated using a model or written by a person. While detection models show strong performance, they have the capacity to cause significant negative impacts. We explore potential biases in English machine-generated text detection systems. We curate a dataset of student essays and assess 16 different detection systems for bias across four attributes: gender, race/ethnicity, English-language learner (ELL) status, and economic status. We evaluate these attributes using regression-based models to determine the significance and power of the effects, as well as performing subgroup analysis. We find that while biases are generally inconsistent across systems, there are several key issues: several models tend to classify disadvantaged groups as machine-generated, ELL essays are more likely to be classified as machine-generated, economically disadvantaged students' essays are less likely to be classified as machine-generated, and non-White ELL essays are disproportionately classified as machine-generated relative to their White counterparts. Finally, we perform human annotation and find that while humans perform generally poorly at the detection task, they show no significant biases on the studied attributes.
๐
2025-12-10
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Xiaojing Chen, Dan Li, Lijun Peng, Jun Yanลetter, Zhiqing Guo et al.
ๆฅ็ๆ่ฆ
The prosperous development of Artificial Intelligence-Generated Content (AIGC) has brought people's anxiety about the spread of false information on social media. Designing detectors for filtering is an effective defense method, but most detectors will be compromised by adversarial samples. Currently, most studies exposing AIGC security issues assume information on model structure and data distribution. In real applications, attackers query and interfere with models that provide services in the form of application programming interfaces (APIs), which constitutes the black-box decision-based attack paradigm. However, to the best of our knowledge, decision-based attacks on AIGC detectors remain unexplored. In this study, we propose \textbf{FBA$^2$D}: a frequency-based black-box attack method for AIGC detection to fill the research gap. Motivated by frequency-domain discrepancies between generated and real images, we develop a decision-based attack that leverages the Discrete Cosine Transform (DCT) for fine-grained spectral partitioning and selects frequency bands as query subspaces, improving both query efficiency and image quality. Moreover, attacks on AIGC detectors should mitigate initialization failures, preserve image quality, and operate under strict query budgets. To address these issues, we adopt an ``adversarial example soup'' method, averaging candidates from successive surrogate iterations and using the result as the initialization to accelerate the query-based attack. The empirical study on the Synthetic LSUN dataset and GenImage dataset demonstrate the effectiveness of our prosed method. This study shows the urgency of addressing practical AIGC security problems.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lalit Maurya, Saurabh Kaushik, Beth Tellman
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Glacial lake monitoring bears great significance in mitigating the anticipated risk of Glacial Lake Outburst Floods. However, existing segmentation methods based on convolutional neural networks (CNNs) and Vision Transformers (ViTs), remain constrained to pixel-level predictions, lacking high-level global scene semantics and human-interpretable reasoning. To address this, we introduce GLACIA (\textbf{G}lacial \textbf{LA}ke segmentation with \textbf{C}ontextual \textbf{I}nstance \textbf{A}wareness), the first framework that integrates large language models with segmentation capabilities to produce both accurate segmentation masks and corresponding spatial reasoning outputs. We construct the Glacial Lake Position Reasoning (GLake-Pos) dataset pipeline, which provides diverse, spatially grounded question-answer pairs designed to overcome the lack of instance-aware positional reasoning data in remote sensing. Comparative evaluation demonstrate that GLACIA (mIoU: 87.30) surpasses state-of-the-art method based on CNNs (mIoU: 78.55 - 79.01), ViTs (mIoU: 69.27 - 81.75), Geo-foundation models (mIoU: 76.37 - 87.10), and reasoning based segmentation methods (mIoU: 60.12 - 75.66). Our approach enables intuitive disaster preparedness and informed policy-making in the context of rapidly changing glacial environments by facilitating natural language interaction, thereby supporting more efficient and interpretable decision-making. The code is released on https://github.com/lalitmaurya47/GLACIA
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anas Bin Ayub, Nilima Sultana Niha, Md. Zahurul Haque
Machine Learning
ๆฅ็ๆ่ฆ
Chronic Kidney Disease (CKD) constitutes a major global medical burden, marked by the gradual deterioration of renal function, which results in the impaired clearance of metabolic waste and disturbances in systemic fluid homeostasis. Owing to its substantial contribution to worldwide morbidity and mortality, the development of reliable and efficient diagnostic approaches is critically important to facilitate early detection and prompt clinical management. This study presents a deep convolutional neural network (CNN) for early CKD detection from CT kidney images, complemented by class balancing using Synthetic Minority Over-sampling Technique (SMOTE) and interpretability via Gradient-weighted Class Activation Mapping (Grad-CAM). The model was trained and evaluated on the CT KIDNEY DATASET, which contains 12,446 CT images, including 3,709 cyst, 5,077 normal, 1,377 stone, and 2,283 tumor cases. The proposed deep CNN achieved a remarkable classification performance, attaining 100% accuracy in the early detection of chronic kidney disease (CKD). This significant advancement demonstrates strong potential for addressing critical clinical diagnostic challenges and enhancing early medical intervention strategies.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Vishwas Hegde, Vindhya Shigehalli
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models handle single-turn generation well, but multi-turn interactions still require the model to reconstruct user intent and task state from an expanding token history because internal representations do not persist across turns. This token-first paradigm leads to drift, inconsistent reasoning modes, and growing prompts as conversations deepen. We propose CORE, a concept-first interaction layer that improves multi-turn stability without modifying model weights. CORE combines a small library of universal cognitive operators with a persistent Local Concept - a compact semantic state capturing the task, constraints, preferences, and intermediate results. Each model call receives only this concept state, the user's latest instruction, and the selected operator, eliminating the need to replay full history. A preliminary prototype simulating CORE's behavior shows about 42% reduction in cumulative prompt tokens, though this number reflects prototype conditions and should not be interpreted as a real-world performance estimate. CORE offers a model-agnostic mechanism that separates conceptual reasoning from language generation, suggesting a scalable direction for more stable multi-turn systems.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jinming Lu, Jiayi Tian, Yequan Zhao, Hai Li, Zheng Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Physics-Informed Neural Networks (PINNs) have emerged as a promising paradigm for solving partial differential equations (PDEs) by embedding physical laws into neural network training objectives. However, their deployment on resource-constrained platforms is hindered by substantial computational and memory overhead, primarily stemming from higher-order automatic differentiation, intensive tensor operations, and reliance on full-precision arithmetic. To address these challenges, we present a framework that enables scalable and energy-efficient PINN training on edge devices. This framework integrates fully quantized training, Stein's estimator (SE)-based residual loss computation, and tensor-train (TT) decomposition for weight compression. It contributes three key innovations: (1) a mixed-precision training method that use a square-block MX (SMX) format to eliminate data duplication during backpropagation; (2) a difference-based quantization scheme for the Stein's estimator that mitigates underflow; and (3) a partial-reconstruction scheme (PRS) for TT-Layers that reduces quantization-error accumulation. We further design PINTA, a precision-scalable hardware accelerator, to fully exploit the performance of the framework. Experiments on the 2-D Poisson, 20-D Hamilton-Jacobi-Bellman (HJB), and 100-D Heat equations demonstrate that the proposed framework achieves accuracy comparable to or better than full-precision, uncompressed baselines while delivering 5.5x to 83.5x speedups and 159.6x to 2324.1x energy savings. This work enables real-time PDE solving on edge devices and paves the way for energy-efficient scientific computing at scale.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Patrick Noras, Jun Myeong Choi, Didier Stricker, Pieter Peers, Roni Sengupta
ๆฅ็ๆ่ฆ
Recent advances in Gaussian Splatting-based inverse rendering extend Gaussian primitives with shading parameters and physically grounded light transport, enabling high-quality material recovery from dense multi-view captures. However, these methods degrade sharply under sparse-view settings, where limited observations lead to severe ambiguity between geometry, reflectance, and lighting. We introduce GAINS (Gaussian-based Inverse rendering from Sparse multi-view captures), a two-stage inverse rendering framework that leverages learning-based priors to stabilize geometry and material estimation. GAINS first refines geometry using monocular depth/normal and diffusion priors, then employs segmentation, intrinsic image decomposition (IID), and diffusion priors to regularize material recovery. Extensive experiments on synthetic and real-world datasets show that GAINS significantly improves material parameter accuracy, relighting quality, and novel-view synthesis compared to state-of-the-art Gaussian-based inverse rendering methods, especially under sparse-view settings. Project page: https://patrickbail.github.io/gains/
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Or Hirschorn, Omer Sela, Inbar Huberman-Spiegelglas, Netalee Efrat, Eli Alshan et al.
Computer Vision
ๆฅ็ๆ่ฆ
Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Junting Chen, Yunchuan Li, Panfeng Jiang, Jiacheng Du, Zixuan Chen et al.
Robotics
ๆฅ็ๆ่ฆ
Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sander Riisรธen Jyhne, Aditya Gupta, Ben Worsley, Marianne Andersen, Ivar Oveland et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
We present NordFKB, a fine-grained benchmark dataset for geospatial AI in Norway, derived from the authoritative, highly accurate, national Felles KartdataBase (FKB). The dataset contains high-resolution orthophotos paired with detailed annotations for 36 semantic classes, including both per-class binary segmentation masks in GeoTIFF format and COCO-style bounding box annotations. Data is collected from seven geographically diverse areas, ensuring variation in climate, topography, and urbanization. Only tiles containing at least one annotated object are included, and training/validation splits are created through random sampling across areas to ensure representative class and context distributions. Human expert review and quality control ensures high annotation accuracy. Alongside the dataset, we release a benchmarking repository with standardized evaluation protocols and tools for semantic segmentation and object detection, enabling reproducible and comparable research. NordFKB provides a robust foundation for advancing AI methods in mapping, land administration, and spatial planning, and paves the way for future expansions in coverage, temporal scope, and data modalities.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daoan Zhang, Pai Liu, Xiaofei Zhou, Yuan Ge, Guangchen Lan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Ryan Meegan, Adam D'Souza, Bryan Bo Cao, Shubham Jain, Kristin Dana
Robotics
ๆฅ็ๆ่ฆ
Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot's current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Reza Ahmari, Ahmad Mohammadi, Vahid Hemmati, Mohammed Mynuddin, Parham Kebria et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The integration of Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) is increasingly central to the development of intelligent autonomous systems for applications such as search and rescue, environmental monitoring, and logistics. However, precise coordination between these platforms in real-time scenarios presents major challenges, particularly when external localization infrastructure such as GPS or GNSS is unavailable or degraded [1]. This paper proposes a vision-based, data-driven framework for real-time UAV-UGV integration, with a focus on robust UGV detection and heading angle prediction for navigation and coordination. The system employs a fine-tuned YOLOv5 model to detect UGVs and extract bounding box features, which are then used by a lightweight artificial neural network (ANN) to estimate the UAV's required heading angle. A VICON motion capture system was used to generate ground-truth data during training, resulting in a dataset of over 13,000 annotated images collected in a controlled lab environment. The trained ANN achieves a mean absolute error of 0.1506ยฐ and a root mean squared error of 0.1957ยฐ, offering accurate heading angle predictions using only monocular camera inputs. Experimental evaluations achieve 95% accuracy in UGV detection. This work contributes a vision-based, infrastructure- independent solution that demonstrates strong potential for deployment in GPS/GNSS-denied environments, supporting reliable multi-agent coordination under realistic dynamic conditions. A demonstration video showcasing the system's real-time performance, including UGV detection, heading angle prediction, and UAV alignment under dynamic conditions, is available at: https://github.com/Kooroshraf/UAV-UGV-Integration
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pius Horn, Janis Keuper
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yimin Zhu, Lincoln Linlin Xu
ๆฅ็ๆ่ฆ
Linear spectral mixture models (LMM) provide a concise form to disentangle the constituent materials (endmembers) and their corresponding proportions (abundance) in a single pixel. The critical challenges are how to model the spectral prior distribution and spectral variability. Prior knowledge and spectral variability can be rigorously modeled under the Bayesian framework, where posterior estimation of Abundance is derived by combining observed data with endmember prior distribution. Considering the key challenges and the advantages of the Bayesian framework, a novel method using a diffusion posterior sampler for semiblind unmixing, denoted as DPS4Un, is proposed to deal with these challenges with the following features: (1) we view the pretrained conditional spectrum diffusion model as a posterior sampler, which can combine the learned endmember prior with observation to get the refined abundance distribution. (2) Instead of using the existing spectral library as prior, which may raise bias, we establish the image-based endmember bundles within superpixels, which are used to train the endmember prior learner with diffusion model. Superpixels make sure the sub-scene is more homogeneous. (3) Instead of using the image-level data consistency constraint, the superpixel-based data fidelity term is proposed. (4) The endmember is initialized as Gaussian noise for each superpixel region, DPS4Un iteratively updates the abundance and endmember, contributing to spectral variability modeling. The experimental results on three real-world benchmark datasets demonstrate that DPS4Un outperforms the state-of-the-art hyperspectral unmixing methods.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fengli Wu, Vaidehi Patil, Jaehong Yoon, Yue Zhang, Mohit Bansal
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the "right to be forgotten". Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution -> Patient -> Study -> Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Yuyang Li, Yinghan Chen, Zihang Zhao, Puhao Li, Tengyu Liu et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shijia Feng, Michael Wray, Walterio Mayol-Cuevas
ๆฅ็ๆ่ฆ
Understanding human skill performance is essential for intelligent assistive systems, with struggle recognition offering a natural cue for identifying user difficulties. While prior work focuses on offline struggle classification and localization, real-time applications require models capable of detecting and anticipating struggle online. We reformulate struggle localization as an online detection task and further extend it to anticipation, predicting struggle moments before they occur. We adapt two off-the-shelf models as baselines for online struggle detection and anticipation. Online struggle detection achieves 70-80% per-frame mAP, while struggle anticipation up to 2 seconds ahead yields comparable performance with slight drops. We further examine generalization across tasks and activities and analyse the impact of skill evolution. Despite larger domain gaps in activity-level generalization, models still outperform random baselines by 4-20%. Our feature-based models run at up to 143 FPS, and the whole pipeline, including feature extraction, operates at around 20 FPS, sufficient for real-time assistive applications.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhizhong Wang, Tianyi Chu, Zeyi Huang, Nanyang Wang, Kehan Li
Computer VisionRoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianfei Li, Ines Rosellon-Inclan, Gitta Kutyniok, Jean-Luc Starck
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
U-Net and other U-shaped architectures have achieved significant success in image deconvolution tasks. However, challenges have emerged, as these methods might generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a novel approach for quantifying and comprehending hallucination artifacts to ensure trustworthy computer vision models. Our method, termed the Conformal Hallucination Estimation Metric (CHEM), is applicable to any image reconstruction model, enabling efficient identification and quantification of hallucination artifacts. It offers two key advantages: it leverages wavelet and shearlet representations to efficiently extract hallucinations of image features and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. Furthermore, from an approximation theoretical perspective, we explore the reasons why U-shaped networks are prone to hallucinations. We test the proposed approach on the CANDELS astronomical image dataset with models such as U-Net, SwinUNet, and Learnlets, and provide new perspectives on hallucination from different aspects in deep learning-based image processing.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tien-Dat Chung, Ba-Thinh Lam, Thanh-Huy Nguyen, Thien Nguyen, Nguyen Lan Vi Vu et al.
Computer Vision
ๆฅ็ๆ่ฆ
Semi-supervised learning (SSL) has become a promising direction for medical image segmentation, enabling models to learn from limited labeled data alongside abundant unlabeled samples. However, existing SSL approaches for multi-modal medical imaging often struggle to exploit the complementary information between modalities due to semantic discrepancies and misalignment across MRI sequences. To address this, we propose a novel semi-supervised multi-modal framework that explicitly enhances modality-specific representations and facilitates adaptive cross-modal information fusion. Specifically, we introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic cues unique to each modality via channel-wise attention, and a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities. The overall framework is optimized using a hybrid objective combining supervised segmentation loss and cross-modal consistency regularization on unlabeled data. Extensive experiments on the BraTS 2019 (HGG subset) demonstrate that our method consistently outperforms strong semi-supervised and multi-modal baselines under 1\%, 5\%, and 10\% labeled data settings, achieving significant improvements in both Dice and Sensitivity scores. Ablation studies further confirm the complementary effects of our proposed MEM and CIF in bridging cross-modality discrepancies and improving segmentation robustness under scarce supervision.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pierre Ancey, Andrew Price, Saqib Javed, Mathieu Salzmann
Computer VisionRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Estimating the 6-degrees-of-freedom (6DoF) pose of a spacecraft from a single image is critical for autonomous operations like in-orbit servicing and space debris removal. Existing state-of-the-art methods often rely on iterative Perspective-n-Point (PnP)-based algorithms, which are computationally intensive and ill-suited for real-time deployment on resource-constrained edge devices. To overcome these limitations, we propose FastPose-ViT, a Vision Transformer (ViT)-based architecture that directly regresses the 6DoF pose. Our approach processes cropped images from object bounding boxes and introduces a novel mathematical formalism to map these localized predictions back to the full-image scale. This formalism is derived from the principles of projective geometry and the concept of "apparent rotation", where the model predicts an apparent rotation matrix that is then corrected to find the true orientation. We demonstrate that our method outperforms other non-PnP strategies and achieves performance competitive with state-of-the-art PnP-based techniques on the SPEED dataset. Furthermore, we validate our model's suitability for real-world space missions by quantizing it and deploying it on power-constrained edge hardware. On the NVIDIA Jetson Orin Nano, our end-to-end pipeline achieves a latency of ~75 ms per frame under sequential execution, and a non-blocking throughput of up to 33 FPS when stages are scheduled concurrently.
๐
2025-12-10
๐ ArXiv Preprint (eess.IV)
๐ฅ Mohamed Elbayumi, Mohammed S. M. Elbaz
Machine Learning
ๆฅ็ๆ่ฆ
Few-shot learning (FSL) mitigates data scarcity in cardiac MRI segmentation but typically relies on semi-supervised techniques sensitive to domain shifts and validation bias, restricting zero-shot generalizability. We propose PathCo-LatticE, a fully supervised FSL framework that replaces unlabeled data with pathology-guided synthetic supervision. First, our Virtual Patient Engine models continuous latent disease trajectories from sparse clinical anchors, using generative modeling to synthesize physiologically plausible, fully labeled 3D cohorts. Second, Self-Reinforcing Interleaved Validation (SIV) provides a leakage-free protocol that evaluates models online with progressively challenging synthetic samples, eliminating the need for real validation data. Finally, a dynamic Lattice-of-Experts (LoE) organizes specialized networks within a pathology-aware topology and activates the most relevant experts per input, enabling robust zero-shot generalization to unseen data without target-domain fine-tuning. We evaluated PathCo-LatticE in a strict out-of-distribution (OOD) setting, deriving all anchors and severity statistics from a single-source domain (ACDC) and performing zero-shot testing on the multi-center, multi-vendor M&Ms dataset. PathCo-LatticE outperforms four state-of-the-art FSL methods by 4.2-11% Dice starting from only 7 labeled anchors, and approaches fully supervised performance (within 1% Dice) with only 19 labeled anchors. The method shows superior harmonization across four vendors and generalization to unseen pathologies. [Code will be made publicly available].
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Romain Mussard, Aurรฉlien Gauffre, Ihsan Ullah, Thanh Gia Hieu Khuong, Massih-Reza Amini et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We introduce Stylized Meta-Album (SMA), a new image classification meta-dataset comprising 24 datasets (12 content datasets, and 12 stylized datasets), designed to advance studies on out-of-distribution (OOD) generalization and related topics. Created using style transfer techniques from 12 subject classification datasets, SMA provides a diverse and extensive set of 4800 groups, combining various subjects (objects, plants, animals, human actions, textures) with multiple styles. SMA enables flexible control over groups and classes, allowing us to configure datasets to reflect diverse benchmark scenarios. While ideally, data collection would capture extensive group diversity, practical constraints often make this infeasible. SMA addresses this by enabling large and configurable group structures through flexible control over styles, subject classes, and domains-allowing datasets to reflect a wide range of real-world benchmark scenarios. This design not only expands group and class diversity, but also opens new methodological directions for evaluating model performance across diverse group and domain configurations-including scenarios with many minority groups, varying group imbalance, and complex domain shifts-and for studying fairness, robustness, and adaptation under a broader range of realistic conditions. To demonstrate SMA's effectiveness, we implemented two benchmarks: (1) a novel OOD generalization and group fairness benchmark leveraging SMA's domain, class, and group diversity to evaluate existing benchmarks. Our findings reveal that while simple balancing and algorithms utilizing group information remain competitive as claimed in previous benchmarks, increasing group diversity significantly impacts fairness, altering the superiority and relative rankings of algorithms. We also propose to use \textit{Top-M worst group accuracy} as a new hyperparameter tuning metric, demonstrating broader fairness during optimization and delivering better final worst-group accuracy for larger group diversity. (2) An unsupervised domain adaptation (UDA) benchmark utilizing SMA's group diversity to evaluate UDA algorithms across more scenarios, offering a more comprehensive benchmark with lower error bars (reduced by 73\% and 28\% in closed-set setting and UniDA setting, respectively) compared to existing efforts. These use cases highlight SMA's potential to significantly impact the outcomes of conventional benchmarks.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seon-Hoon Kim, Hyeji Sim, Youeyun Jung, Ok-Chul Jung, Yerin Kim
ๆฅ็ๆ่ฆ
Applying general-purpose object detectors to ship detection in satellite imagery presents significant challenges due to the extreme scale disparity and morphological anisotropy of maritime targets. Standard architectures utilizing stride-32 (P5) layers often fail to resolve narrow vessels, resulting in spatial feature dilution. In this work, we propose LiM-YOLO, a specialized detector designed to resolve these domain-specific conflicts. Based on a statistical analysis of ship scales, we introduce a Pyramid Level Shift Strategy that reconfigures the detection head to P2-P4. This shift ensures compliance with Nyquist sampling criteria for small objects while eliminating the computational redundancy of deep layers. To further enhance training stability on high-resolution inputs, we incorporate a Group Normalized Convolutional Block for Linear Projection (GN-CBLinear), which mitigates gradient volatility in micro-batch settings. Validated on SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet-V1, LiM-YOLO demonstrates superior detection accuracy and efficiency compared to state-of-the-art models. The code is available at https://github.com/egshkim/LiM-YOLO.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Er Jin, Yang Zhang, Yongli Mou, Yanfei Dong, Stefan Decker et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in generative models have demonstrated an exceptional ability to produce highly realistic images. However, previous studies show that generated images often resemble the training data, and this problem becomes more severe as the model size increases. Memorizing training data can lead to legal challenges, including copyright infringement, violations of portrait rights, and trademark violations. Existing approaches to mitigating memorization mainly focus on manipulating the denoising sampling process to steer image embeddings away from the memorized embedding space or employ unlearning methods that require training on datasets containing specific sets of memorized concepts. However, existing methods often incur substantial computational overhead during sampling, or focus narrowly on removing one or more groups of target concepts, imposing a significant limitation on their scalability. To understand and mitigate these problems, our work, UniForget, offers a new perspective on understanding the root cause of memorization. Our work demonstrates that specific parts of the model are responsible for copyrighted content generation. By applying model pruning, we can effectively suppress the probability of generating copyrighted content without targeting specific concepts while preserving the general generative capabilities of the model. Additionally, we show that our approach is both orthogonal and complementary to existing unlearning methods, thereby highlighting its potential to improve current unlearning and de-memorization techniques.
๐
2025-12-10
๐ Under review at IEEE Transactions on Geoscience an
๐ฅ Gil Weissman, Amir Ivry, Israel Cohen
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The proliferation of satellite constellations, coupled with reduced tasking latency and diverse sensor capabilities, has expanded the opportunities for automated Earth observation. This paper introduces a fully automated Tip-and-Cue framework designed for satellite imaging tasking and scheduling. In this context, tips are generated from external data sources or analyses of prior satellite imagery, identifying spatiotemporal targets and prioritizing them for downstream planning. Corresponding cues are the imaging tasks formulated in response, which incorporate sensor constraints, timing requirements, and utility functions. The system autonomously generates candidate tasks, optimizes their scheduling across multiple satellites using continuous utility functions that reflect the expected value of each observation, and processes the resulting imagery using artificial-intelligence-based models, including object detectors and vision-language models. Structured visual reports are generated to support both interpretability and the identification of new insights for downstream tasking. The efficacy of the framework is demonstrated through a maritime vessel tracking scenario, utilizing Automatic Identification System (AIS) data for trajectory prediction, targeted observations, and the generation of actionable outputs. Maritime vessel tracking is a widely researched application, often used to benchmark novel approaches to satellite tasking, forecasting, and analysis. The system is extensible to broader applications such as smart-city monitoring and disaster response, where timely tasking and automated analysis are critical.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jonathan Rystrรธm, Zihao Fu, Chris Russell
ๆฅ็ๆ่ฆ
We address the problem of fair classification in settings where data is scarce and unbalanced across demographic groups. Such low-data regimes are common in domains like medical imaging, where false negatives can have fatal consequences.
We propose a novel approach \emph{OxEnsemble} for efficiently training ensembles and enforcing fairness in these low-data regimes. Unlike other approaches, we aggregate predictions across ensemble members, each trained to satisfy fairness constraints. By construction, \emph{OxEnsemble} is both data-efficient, carefully reusing held-out data to enforce fairness reliably, and compute-efficient, requiring little more compute than used to fine-tune or evaluate an existing model. We validate this approach with new theoretical guarantees. Experimentally, our approach yields more consistent outcomes and stronger fairness-accuracy trade-offs than existing methods across multiple challenging medical imaging classification datasets.
๐
2025-12-10
๐ ArXiv Preprint (cs.DC)
๐ฅ Antonio Terpin, Alan Bonomi, Francesco Banelli, Raffaello D'Andrea
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We describe SynthPix, a synthetic image generator for Particle Image Velocimetry (PIV) with a focus on performance and parallelism on accelerators, implemented in JAX. SynthPix supports the same configuration parameters as existing tools but achieves a throughput several orders of magnitude higher in image-pair generation per second. SynthPix was developed to enable the training of data-hungry reinforcement learning methods for flow estimation and for reducing the iteration times during the development of fast flow estimation methods used in recent active fluids control studies with real-time PIV feedback. We believe SynthPix to be useful for the fluid dynamics community, and in this paper we describe the main ideas behind this software package.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Zhang, Yuyang Hong, Yang Xia, Kun Ding, Zeyu Zhang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wanyue Zhang, Lin Geng Foo, Thabo Beeler, Rishabh Dabral, Christian Theobalt
Robotics
ๆฅ็ๆ่ฆ
Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ รnal Akรผnal, Markus Bujotzek, Stefan Denner, Benjamin Hamm, Klaus Kades et al.
Robotics
ๆฅ็ๆ่ฆ
Developing generalizable AI for medical imaging requires both access to large, multi-center datasets and standardized, reproducible tooling within research environments. However, leveraging real-world imaging data in clinical research environments is still hampered by strict regulatory constraints, fragmented software infrastructure, and the challenges inherent in conducting large-cohort multicentre studies. This leads to projects that rely on ad-hoc toolchains that are hard to reproduce, difficult to scale beyond single institutions and poorly suited for collaboration between clinicians and data scientists. We present Kaapana, a comprehensive open-source platform for medical imaging research that is designed to bridge this gap. Rather than building single-use, site-specific tooling, Kaapana provides a modular, extensible framework that unifies data ingestion, cohort curation, processing workflows and result inspection under a common user interface. By bringing the algorithm to the data, it enables institutions to keep control over their sensitive data while still participating in distributed experimentation and model development. By integrating flexible workflow orchestration with user-facing applications for researchers, Kaapana reduces technical overhead, improves reproducibility and enables conducting large-scale, collaborative, multi-centre imaging studies. We describe the core concepts of the platform and illustrate how they can support diverse use cases, from local prototyping to nation-wide research networks. The open-source codebase is available at https://github.com/kaapana/kaapana
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Senem Aktas, Charles Markham, John McDonald, Rozenn Dahyot
ๆฅ็ๆ่ฆ
Several object tracking pipelines extending Segment Anything Model 2 (SAM2) have been proposed in the past year, where the approach is to follow and segment the object from a single exemplar template provided by the user on a initialization frame. We propose to benchmark these high performing trackers (SAM2, EfficientTAM, DAM4SAM and SAMURAI) on datasets containing fast moving objects (FMO) specifically designed to be challenging for tracking approaches. The goal is to understand better current limitations in state-of-the-art trackers by providing more detailed insights on the behavior of these trackers. We show that overall the trackers DAM4SAM and SAMURAI perform well on more challenging sequences.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yousef Azizi Movahed, Fatemeh Ziaeetabar
Computer Vision
ๆฅ็ๆ่ฆ
Reliably predicting human intent in hand-object interactions is an open challenge for computer vision. Our research concentrates on a fundamental sub-problem: the fine-grained classification of atomic interaction states, namely 'approaching', 'grabbing', and 'holding'. To this end, we introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors. Each vector encapsulates relational and dynamic properties from a short temporal window of motion. Our initial hypothesis posited that sequential modeling would be critical, leading us to compare static classifiers (MLPs) against temporal models (RNNs). Counter-intuitively, the key discovery occurred when we set the sequence length of a Bidirectional RNN to one (seq_length=1). This modification converted the network's function, compelling it to act as a high-capacity static feature encoder. This architectural change directly led to a significant accuracy improvement, culminating in a final score of 97.60%. Of particular note, our optimized model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90. These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hubert Kompanowski, Varun Jampani, Aaryaman Vasishta, Binh-Son Hua
Robotics
ๆฅ็ๆ่ฆ
Multiview diffusion models have rapidly emerged as a powerful tool for content creation with spatial consistency across viewpoints, offering rich visual realism without requiring explicit geometry and appearance representation. However, compared to meshes or radiance fields, existing multiview diffusion models offer limited appearance manipulation, particularly in terms of material, texture, or style.
In this paper, we present a lightweight adaptation technique for appearance transfer in multiview diffusion models. Our method learns to combine object identity from an input image with appearance cues rendered in a separate reference image, producing multi-view-consistent output that reflects the desired materials, textures, or styles. This allows explicit specification of appearance parameters at generation time while preserving the underlying object geometry and view coherence. We leverage three diffusion denoising processes responsible for generating the original object, the reference, and the target images, and perform reverse sampling to aggregate a small subset of layer-wise self-attention features from the object and the reference to influence the target generation. Our method requires only a few training examples to introduce appearance awareness to pretrained multiview models. The experiments show that our method provides a simple yet effective way toward multiview generation with diverse appearance, advocating the adoption of implicit generative 3D representations in practice.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiwu Zhong, Zi-Yuan Hu, Yin Li, Liwei Wang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.
๐
2025-12-10
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Boyin Yang, Puming Jiang, Per Ola Kristensson
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
People living with Motor Neuron Disease (plwMND) frequently encounter speech and motor impairments that necessitate a reliance on augmentative and alternative communication (AAC) systems. This paper tackles the main challenge that traditional symbol-based AAC systems offer a limited vocabulary, while text entry solutions tend to exhibit low communication rates. To help plwMND articulate their needs about the system efficiently and effectively, we iteratively design and develop a novel multimodal text generation system called ImageTalk through a tailored proxy-user-based and an end-user-based design phase. The system demonstrates pronounced keystroke savings of 95.6%, coupled with consistent performance and high user satisfaction. We distill three design guidelines for AI-assisted text generation systems design and outline four user requirement levels tailored for AAC purposes, guiding future research in this field.
๐
2025-12-10
๐ AAAI 2026
๐ฅ Yanghong Mei, Yirong Yang, Longteng Guo, Qunbo Wang, Ming-Ming Yu et al.
Robotics
ๆฅ็ๆ่ฆ
Navigating complex urban environments using natural language instructions poses significant challenges for embodied agents, including noisy language instructions, ambiguous spatial references, diverse landmarks, and dynamic street scenes. Current visual navigation methods are typically limited to simulated or off-street environments, and often rely on precise goal formats, such as specific coordinates or images. This limits their effectiveness for autonomous agents like last-mile delivery robots navigating unfamiliar cities. To address these limitations, we introduce UrbanNav, a scalable framework that trains embodied agents to follow free-form language instructions in diverse urban settings. Leveraging web-scale city walking videos, we develop an scalable annotation pipeline that aligns human navigation trajectories with language instructions grounded in real-world landmarks. UrbanNav encompasses over 1,500 hours of navigation data and 3 million instruction-trajectory-landmark triplets, capturing a wide range of urban scenarios. Our model learns robust navigation policies to tackle complex urban scenarios, demonstrating superior spatial reasoning, robustness to noisy instructions, and generalization to unseen urban settings. Experimental results show that UrbanNav significantly outperforms existing methods, highlighting the potential of large-scale web video data to enable language-guided, real-world urban navigation for embodied agents.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhe Wang, Qijin Song, Yucen Peng, Weibang Bai
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Responsive and accurate facial expression recognition is crucial to human-robot interaction for daily service robots. Nowadays, event cameras are becoming more widely adopted as they surpass RGB cameras in capturing facial expression changes due to their high temporal resolution, low latency, computational efficiency, and robustness in low-light conditions. Despite these advantages, event-based approaches still encounter practical challenges, particularly in adopting mainstream deep learning models. Traditional deep learning methods for facial expression analysis are energy-intensive, making them difficult to deploy on edge computing devices and thereby increasing costs, especially for high-frequency, dynamic, event vision-based approaches. To address this challenging issue, we proposed the CS3D framework by decomposing the Convolutional 3D method to reduce the computational complexity and energy consumption. Additionally, by utilizing soft spiking neurons and a spatial-temporal attention mechanism, the ability to retain information is enhanced, thus improving the accuracy of facial expression detection. Experimental results indicate that our proposed CS3D method attains higher accuracy on multiple datasets compared to architectures such as the RNN, Transformer, and C3D, while the energy consumption of the CS3D method is just 21.97\% of the original C3D required on the same device.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alberto Rota, Mert Kiray, Mert Asim Karaoglu, Patrick Ruhkamp, Elena De Momi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks. Project Page: https://alberto-rota.github.io/UnReflectAnything/
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hancheng Zhu, Xinyu Liu, Rui Yao, Kunyang Sun, Leida Li et al.
Multimodal
ๆฅ็ๆ่ฆ
Image retouching has received significant attention due to its ability to achieve high-quality visual content. Existing approaches mainly rely on uniform pixel-wise color mapping across entire images, neglecting the inherent color variations induced by image content. This limitation hinders existing approaches from achieving adaptive retouching that accommodates both diverse color distributions and user-defined style preferences. To address these challenges, we propose a novel Content-Adaptive image retouching method guided by Attribute-based Text Representation (CA-ATP). Specifically, we propose a content-adaptive curve mapping module, which leverages a series of basis curves to establish multiple color mapping relationships and learns the corresponding weight maps, enabling content-aware color adjustments. The proposed module can capture color diversity within the image content, allowing similar color values to receive distinct transformations based on their spatial context. In addition, we propose an attribute text prediction module that generates text representations from multiple image attributes, which explicitly represent user-defined style preferences. These attribute-based text representations are subsequently integrated with visual features via a multimodal model, providing user-friendly guidance for image retouching. Extensive experiments on several public datasets demonstrate that our method achieves state-of-the-art performance.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dimitrios N. Vlachogiannis, Dimitrios A. Koutsomitropoulos
Machine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally created for language processing, use self-attention mechanisms, which allow them to understand relationships across the entire image. In this paper, we compare different types of ViTs (pure, hierarchical, and hybrid) against traditional CNN models across various tasks, including object recognition, detection, and medical image classification. We conduct thorough tests on standard datasets like ImageNet for image classification and COCO for object detection. Additionally, we apply these models to medical imaging using the ChestX-ray14 dataset. We find that hybrid and hierarchical transformers, especially Swin and CvT, offer a strong balance between accuracy and computational resources. Furthermore, by experimenting with data augmentation techniques on medical images, we discover significant performance improvements, particularly with the Swin Transformer model. Overall, our results indicate that Vision Transformers are competitive and, in many cases, outperform traditional CNNs, especially in scenarios requiring the understanding of global visual contexts like medical imaging.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ David Seu, Nicolas Longepe, Gabriel Cioltea, Erik Maidik, Calin Andrei
ๆฅ็ๆ่ฆ
Environmental variables are increasingly affecting agricultural decision-making, yet accessible and scalable tools for soil assessment remain limited. This study presents a robust and scalable modeling system for estimating soil properties in croplands, including soil organic carbon (SOC), total nitrogen (N), available phosphorus (P), exchangeable potassium (K), and pH, using remote sensing data and environmental covariates. The system employs a hybrid modeling approach, combining the indirect methods of modeling soil through proxies and drivers with direct spectral modeling. We extend current approaches by using interpretable physics-informed covariates derived from radiative transfer models (RTMs) and complex, nonlinear embeddings from a foundation model. We validate the system on a harmonized dataset that covers Europes cropland soils across diverse pedoclimatic zones. Evaluation is conducted under a robust validation framework that enforces strict spatial blocking, stratified splits, and statistically distinct train-test sets, which deliberately make the evaluation harder and produce more realistic error estimates for unseen regions. The models achieved their highest accuracy for SOC and N. This performance held across unseen locations, under both spatial cross-validation and an independent test set. SOC obtained a MAE of 5.12 g/kg and a CCC of 0.77, and N obtained a MAE of 0.44 g/kg and a CCC of 0.77. We also assess uncertainty through conformal calibration, achieving 90 percent coverage at the target confidence level. This study contributes to the digital advancement of agriculture through the application of scalable, data-driven soil analysis frameworks that can be extended to related domains requiring quantitative soil evaluation, such as carbon markets.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Li, Zitang Sun, Yen-Ju Chen, Shin'ya Nishida
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Faraz Ali, Muhammad Afaq, Mahmood Niazi, Muzammil Behzad
Robotics
ๆฅ็ๆ่ฆ
Domain Name System (DNS) tunneling remains a covert channel for data exfiltration and command-and-control communication. Although graph-based methods such as GraphTunnel achieve strong accuracy, they introduce significant latency and computational overhead due to recursive parsing and graph construction, limiting their suitability for real-time deployment. This work presents DNS-HyXNet, a lightweight extended Long Short-Term Memory (xLSTM) hybrid framework designed for efficient sequence-based DNS tunnel detection. DNS-HyXNet integrates tokenized domain embeddings with normalized numerical DNS features and processes them through a two-layer xLSTM network that directly learns temporal dependencies from packet sequences, eliminating the need for graph reconstruction and enabling single-stage multi-class classification. The model was trained and evaluated on two public benchmark datasets with carefully tuned hyperparameters to ensure low memory consumption and fast inference. Across all experimental splits of the DNS-Tunnel-Datasets, DNS-HyXNet achieved up to 99.99% accuracy, with macro-averaged precision, recall, and F1-scores exceeding 99.96%, and demonstrated a per-sample detection latency of just 0.041 ms, confirming its scalability and real-time readiness. These results show that sequential modeling with xLSTM can effectively replace computationally expensive recursive graph generation, offering a deployable and energy-efficient alternative for real-time DNS tunnel detection on commodity hardware.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Li, Zitang Sun, Yen-ju Chen, Shin'ya Nishida
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Murat Karayaka, Usman Muhammad, Jorma Laaksonen, Md Ziaul Hoque, Tapio Seppรคnen
ๆฅ็ๆ่ฆ
This study presents a lightweight dual-domain super-resolution network (DDSRNet) that combines Spatial-Net with the discrete wavelet transform (DWT). Specifically, our proposed model comprises three main components: (1) a shallow feature extraction module, termed Spatial-Net, which performs residual learning and bilinear interpolation; (2) a low-frequency enhancement branch based on the DWT that refines coarse image structures; and (3) a shared high-frequency refinement branch that simultaneously enhances the LH (horizontal), HL (vertical), and HH (diagonal) wavelet subbands using a single CNN with shared weights. As a result, the DWT enables subband decomposition, while the inverse DWT reconstructs the final high-resolution output. By doing so, the integration of spatial- and frequency-domain learning enables DDSRNet to achieve highly competitive performance with low computational cost on three hyperspectral image datasets, demonstrating its effectiveness for hyperspectral image super-resolution.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongyou Zhou, Cederic Aรmann, Alaa Bejaoui, Heiko Tzschรคtzsch, Mark Heyland et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Surgical planning for complex tibial fractures can be challenging for surgeons, as the 3D structure of the later desirable bone alignment may be diffi- cult to imagine. To assist in such planning, we address the challenge of predicting a patient-specific reconstruction target from a CT of the fractured tibia. Our ap- proach combines neural registration and autoencoder models. Specifically, we first train a modified spatial transformer network (STN) to register a raw CT to a standardized coordinate system of a jointly trained tibia prototype. Subsequently, various autoencoder (AE) architectures are trained to model healthy tibial varia- tions. Both the STN and AE models are further designed to be robust to masked input, allowing us to apply them to fractured CTs and decode to a prediction of the patient-specific healthy bone in standard coordinates. Our contributions include: i) a 3D-adapted STN for global spatial registration, ii) a comparative analysis of AEs for bone CT modeling, and iii) the extension of both to handle masked inputs for predictive generation of healthy bone structures. Project page: https://github.com/HongyouZhou/repair
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Donato Caramia, Florian T. Pokorny, Giuseppe Triggiani, Denis Ruffino, David Naso et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinmiao Zhao, Chuang Yu, Zelin Shi, Yunpeng Liu, Yingdi Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Recently, infrared small target detection has attracted extensive attention. However, due to the small size and the lack of intrinsic features of infrared small targets, the existing methods generally have the problem of inaccurate edge positioning and the target is easily submerged by the background. Therefore, we propose an innovative gradient-guided learning network (GGL-Net). Specifically, we are the first to explore the introduction of gradient magnitude images into the deep learning-based infrared small target detection method, which is conducive to emphasizing the edge details and alleviating the problem of inaccurate edge positioning of small targets. On this basis, we propose a novel dual-branch feature extraction network that utilizes the proposed gradient supplementary module (GSM) to encode raw gradient information into deeper network layers and embeds attention mechanisms reasonably to enhance feature extraction ability. In addition, we construct a two-way guidance fusion module (TGFM), which fully considers the characteristics of feature maps at different levels. It can facilitate the effective fusion of multi-scale feature maps and extract richer semantic information and detailed information through reasonable two-way guidance. Extensive experiments prove that GGL-Net has achieves state-of-the-art results on the public real NUAA-SIRST dataset and the public synthetic NUDT-SIRST dataset. Our code has been integrated into https://github.com/YuChuang1205/MSDA-Net
๐
2025-12-10
๐ AAAI
๐ฅ Abdullah Al Mamun, Miaohua Zhang, David Ahmedt-Aristizabal, Zeeshan Hayder, Mohammad Awrangjeb
Natural Language Processing
ๆฅ็ๆ่ฆ
Self-supervised learning (SSL) is attractive for plant disease detection as it can exploit large collections of unlabeled leaf images, yet most existing SSL methods are built on CNNs or vision transformers that are poorly matched to agricultural imagery. CNN-based SSL struggles to capture disease patterns that evolve continuously along leaf structures, while transformer-based SSL introduces quadratic attention cost from high-resolution patches. To address these limitations, we propose StateSpace-SSL, a linear-time SSL framework that employs a Vision Mamba state-space encoder to model long-range lesion continuity through directional scanning across the leaf surface. A prototype-driven teacher-student objective aligns representations across multiple views, encouraging stable and lesion-aware features from labelled data. Experiments on three publicly available plant disease datasets show that StateSpace-SSL consistently outperforms the CNN- and transformer-based SSL baselines in various evaluation metrics. Qualitative analyses further confirm that it learns compact, lesion-focused feature maps, highlighting the advantage of linear state-space modelling for self-supervised plant disease representation learning.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuaihao Han, Tingfa Xu, Peifu Liu, Jianan Li
Computer Vision
ๆฅ็ๆ่ฆ
Aerial object detection faces significant challenges in real-world scenarios, such as small objects and extensive background interference, which limit the performance of RGB-based detectors with insufficient discriminative information. Multispectral images (MSIs) capture additional spectral cues across multiple bands, offering a promising alternative. However, the lack of training data has been the primary bottleneck to exploiting the potential of MSIs. To address this gap, we introduce the first large-scale dataset for Multispectral Object Detection in Aerial images (MODA), which comprises 14,041 MSIs and 330,191 annotations across diverse, challenging scenarios, providing a comprehensive data foundation for this field. Furthermore, to overcome challenges inherent to aerial object detection using MSIs, we propose OSSDet, a framework that integrates spectral and spatial information with object-aware cues. OSSDet employs a cascaded spectral-spatial modulation structure to optimize target perception, aggregates spectrally related features by exploiting spectral similarities to reinforce intra-object correlations, and suppresses irrelevant background via object-aware masking. Moreover, cross-spectral attention further refines object-related representations under explicit object-aware guidance. Extensive experiments demonstrate that OSSDet outperforms existing methods with comparable parameters and efficiency.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guillem Arias, Ariadna Solร , Martรญ Armengod, Maria Vanrell
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Recent advances in diffusion-based generative models have achieved remarkable visual fidelity, yet a detailed understanding of how specific perceptual attributes - such as color and shape - are internally represented remains limited. This work explores how color is encoded in a generative model through a systematic analysis of the latent representations in Stable Diffusion. Through controlled synthetic datasets, principal component analysis (PCA) and similarity metrics, we reveal that color information is encoded along circular, opponent axes predominantly captured in latent channels c_3 and c_4, whereas intensity and shape are primarily represented in channels c_1 and c_2. Our findings indicate that the latent space of Stable Diffusion exhibits an interpretable structure aligned with a efficient coding representation. These insights provide a foundation for future work in model understanding, editing applications, and the design of more disentangled generative frameworks.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiqun Wang, Lujun Li, Meiru Yue, Radu State
Natural Language Processing
ๆฅ็ๆ่ฆ
Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span $(t=2)$, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23\% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33\% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.
๐
2025-12-10
๐ ArXiv Preprint (quant-ph)
๐ฅ Haijian Shao, Bowen Yang, Wei Liu, Xing Deng, Yingtao Jiang
Machine Learning
ๆฅ็ๆ่ฆ
Quantum neural networks (QNNs) and parameterized quantum circuits (PQCs) are key building blocks for near-term quantum machine learning. However, their scalability is constrained by excessive parameters, barren plateaus, and hardware limitations. We propose LiePrune, the first mathematically grounded one-shot structured pruning framework for QNNs that leverages Lie group structure and quantum geometric information. Each gate is jointly represented in a Lie group--Lie algebra dual space and a quantum geometric feature space, enabling principled redundancy detection and aggressive compression. Experiments on quantum classification (MNIST, FashionMNIST), quantum generative modeling (Bars-and-Stripes), and quantum chemistry (LiH VQE) show that LiePrune achieves over $10\times$ compression with negligible or even improved task performance, while providing provable guarantees on redundancy detection, functional approximation, and computational complexity.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sander De Coninck, Emilio Gamba, Bart Van Doninck, Abdellatif Bey-Temsamani, Sam Leroux et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
The adoption of AI-powered computer vision in industry is often constrained by the need to balance operational utility with worker privacy. Building on our previously proposed privacy-preserving framework, this paper presents its first comprehensive validation on real-world data collected directly by industrial partners in active production environments. We evaluate the framework across three representative use cases: woodworking production monitoring, human-aware AGV navigation, and multi-camera ergonomic risk assessment. The approach employs learned visual transformations that obscure sensitive or task-irrelevant information while retaining features essential for task performance. Through both quantitative evaluation of the privacy-utility trade-off and qualitative feedback from industrial partners, we assess the framework's effectiveness, deployment feasibility, and trust implications. Results demonstrate that task-specific obfuscation enables effective monitoring with reduced privacy risks, establishing the framework's readiness for real-world adoption and providing cross-domain recommendations for responsible, human-centric AI deployment in industry.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anabia Sohail, Mohamad Alansari, Ahmed Abughali, Asmaa Chehab, Abdelfatah Ahmed et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Infertility is a major global health issue, and while in-vitro fertilization has improved treatment outcomes, embryo selection remains a critical bottleneck. Time-lapse imaging enables continuous, non-invasive monitoring of embryo development, yet most automated assessment methods rely solely on conventional morphokinetic features and overlook emerging biomarkers. Cytoplasmic Strings, thin filamentous structures connecting the inner cell mass and trophectoderm in expanded blastocysts, have been associated with faster blastocyst formation, higher blastocyst grades, and improved viability. However, CS assessment currently depends on manual visual inspection, which is labor-intensive, subjective, and severely affected by detection and subtle visual appearance. In this work, we present, to the best of our knowledge, the first computational framework for CS analysis in human IVF embryos. We first design a human-in-the-loop annotation pipeline to curate a biologically validated CS dataset from TLI videos, comprising 13,568 frames with highly sparse CS-positive instances. Building on this dataset, we propose a two-stage deep learning framework that (i) classifies CS presence at the frame level and (ii) localizes CS regions in positive cases. To address severe imbalance and feature uncertainty, we introduce the Novel Uncertainty-aware Contractive Embedding (NUCE) loss, which couples confidence-aware reweighting with an embedding contraction term to form compact, well-separated class clusters. NUCE consistently improves F1-score across five transformer backbones, while RF-DETR-based localization achieves state-of-the-art (SOTA) detection performance for thin, low-contrast CS structures. The source code will be made publicly available at: https://github.com/HamadYA/CS_Detection.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Jaehyun Kim, Seungwon Choi, Tae-Wan Kim
Machine Learning
ๆฅ็ๆ่ฆ
We propose a descriptor-agnostic, multi-frame loop closure verification method that formulates LiDAR loop closure as a truncated Sequential Probability Ratio Test (SPRT). Instead of deciding from a single descriptor comparison or using fixed thresholds with late-stage Iterative Closest Point (ICP) vetting, the verifier accumulates a short temporal stream of descriptor similarities between a query and each candidate. It then issues an accept/reject decision adaptively once sufficient multi-frame evidence has been observed, according to user-specified Type-I/II error design targets. This precision-first policy is designed to suppress false positives in structurally repetitive indoor environments. We evaluate the verifier on a five-sequence library dataset, using a fixed retrieval front-end with several representative LiDAR global descriptors. Performance is assessed via segment-level K-hit precision-recall and absolute trajectory error (ATE) and relative pose error (RPE) after pose graph optimization. Across descriptors, the sequential verifier consistently improves precision and reduces the impact of aliased loops compared with single-frame and heuristic multi-frame baselines. Our implementation and dataset will be released at: https://github.com/wanderingcar/snu_library_dataset.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nadeem Nazer, Hongkuan Zhou, Lavdim Halilaj, Ylli Sadikaj, Steffen Staab
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like "hole", "cut", "scratch" that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of "abnormal" with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiantao Tan, Peixian Ma, Tong Yu, Wentao Zhang, Ruixuan Wang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Class-incremental learning requires a learning system to continually learn knowledge of new classes and meanwhile try to preserve previously learned knowledge of old classes. As current state-of-the-art methods based on Vision-Language Models (VLMs) still suffer from the issue of differentiating classes across learning tasks. Here a novel VLM-based continual learning framework for image classification is proposed. In this framework, task-specific adapters are added to the pre-trained and frozen image encoder to learn new knowledge, and a novel cross-task representation calibration strategy based on a mixture of light-weight projectors is used to help better separate all learned classes in a unified feature space, alleviating class confusion across tasks. In addition, a novel inference strategy guided by prediction uncertainty is developed to more accurately select the most appropriate image feature for class prediction. Extensive experiments on multiple datasets under various settings demonstrate the superior performance of our method compared to existing ones.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xufan He, Yushuang Wu, Xiaoyang Guo, Chongjie Ye, Jiaqing Zhou et al.
Robotics
ๆฅ็ๆ่ฆ
Part-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry-segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances geometric fidelity by predicting part latents in both global and canonical spaces. Extensive experiments demonstrate that UniPart achieves superior segmentation controllability and part-level geometric quality compared with existing approaches.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marco Pegoraro, Evan Atherton, Bruno Roy, Aliasghar Khani, Arianna Rampini
ๆฅ็ๆ่ฆ
Learning natural body motion remains challenging due to the strong coupling between spatial geometry and temporal dynamics. Embedding motion in phase manifolds, latent spaces that capture local periodicity, has proven effective for motion prediction; however, existing approaches lack scalability and remain confined to specific settings. We introduce FunPhase, a functional periodic autoencoder that learns a phase manifold for motion and replaces discrete temporal decoding with a function-space formulation, enabling smooth trajectories that can be sampled at arbitrary temporal resolutions. FunPhase supports downstream tasks such as super-resolution and partial-body motion completion, generalizes across skeletons and datasets, and unifies motion prediction and generation within a single interpretable manifold. Our model achieves substantially lower reconstruction error than prior periodic autoencoder baselines while enabling a broader range of applications and performing on par with state-of-the-art motion generation methods.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhe Li, Hadrien Reynaud, Alberto Gomez, Bernhard Kainz
ๆฅ็ๆ่ฆ
Echocardiography playing a critical role in the diagnosis and monitoring of cardiovascular diseases as a non-invasive real-time assessment of cardiac structure and function. However, the growing scale of echocardiographic video data presents significant challenges in terms of storage, computation, and model training efficiency. Dataset distillation offers a promising solution by synthesizing a compact, informative subset of data that retains the key clinical features of the original dataset. In this work, we propose a novel approach for distilling a compact synthetic echocardiographic video dataset. Our method leverages motion feature extraction to capture temporal dynamics, followed by class-wise graph construction and representative sample selection using the Infomap algorithm. This enables us to select a diverse and informative subset of synthetic videos that preserves the essential characteristics of the original dataset. We evaluate our approach on the EchoNet-Dynamic datasets and achieve a test accuracy of \(69.38\%\) using only \(25\) synthetic videos. These results demonstrate the effectiveness and scalability of our method for medical video dataset distillation.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhe Li, Hadrien Reynaud, Johanna P Mรผller, Bernhard Kainz
Machine Learning
ๆฅ็ๆ่ฆ
Ultrasound echocardiography is essential for the non-invasive, real-time assessment of cardiac function, but the scarcity of labelled data, driven by privacy restrictions and the complexity of expert annotation, remains a major obstacle for deep learning methods. We propose the Motion Conditioned Diffusion Model (MCDM), a label-free latent diffusion framework that synthesises realistic echocardiography videos conditioned on self-supervised motion features. To extract these features, we design the Motion and Appearance Feature Extractor (MAFE), which disentangles motion and appearance representations from videos. Feature learning is further enhanced by two auxiliary objectives: a re-identification loss guided by pseudo appearance features and an optical flow loss guided by pseudo flow fields. Evaluated on the EchoNet-Dynamic dataset, MCDM achieves competitive video generation performance, producing temporally coherent and clinically realistic sequences without reliance on manual labels. These results demonstrate the potential of self-supervised conditioning for scalable echocardiography synthesis. Our code is available at https://github.com/ZheLi2020/LabelfreeMCDM.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yanan Wang, Shengcai Liao, Panwen Hu, Xin Li, Fan Yang et al.
ๆฅ็ๆ่ฆ
Video head swapping aims to replace the entire head of a video subject, including facial identity, head shape, and hairstyle, with that of a reference image, while preserving the target body, background, and motion dynamics. Due to the lack of ground-truth paired swapping data, prior methods typically train on cross-frame pairs of the same person within a video and rely on mask-based inpainting to mitigate identity leakage. Beyond potential boundary artifacts, this paradigm struggles to recover essential cues occluded by the mask, such as facial pose, expressions, and motion dynamics. To address these issues, we prompt a video editing model to synthesize new heads for existing videos as fake swapping inputs, while maintaining frame-synchronized facial poses and expressions. This yields HeadSwapBench, the first cross-identity paired dataset for video head swapping, which supports both training (\TrainNum{} videos) and benchmarking (\TestNum{} videos) with genuine outputs. Leveraging this paired supervision, we propose DirectSwap, a mask-free, direct video head-swapping framework that extends an image U-Net into a video diffusion model with a motion module and conditioning inputs. Furthermore, we introduce the Motion- and Expression-Aware Reconstruction (MEAR) loss, which reweights the diffusion loss per pixel using frame-difference magnitudes and facial-landmark proximity, thereby enhancing cross-frame coherence in motion and expressions. Extensive experiments demonstrate that DirectSwap achieves state-of-the-art visual quality, identity fidelity, and motion and expression consistency across diverse in-the-wild video scenes. We will release the source code and the HeadSwapBench dataset to facilitate future research.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng
Machine LearningRobotics
ๆฅ็ๆ่ฆ
In this paper, we propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks to enhance registration performance. Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds, enabling geometry-color feature fusion to facilitate robust matching. To ensure high-quality matching, the generated image pair should feature both 2D-3D geometric consistency and cross-view texture consistency. To achieve this, we introduce Match-ControlNet, a matching-specific, controllable 2D generative model. Specifically, it leverages the depth-conditioned generation capability of ControlNet to produce images that are geometrically aligned with depth maps derived from point clouds, ensuring 2D-3D geometric consistency. Additionally, by incorporating a coupled conditional denoising scheme and coupled prompt guidance, Match-ControlNet further promotes cross-view feature interaction, guiding texture consistency generation. Our generative 3D registration paradigm is general and could be seamlessly integrated into various registration methods to enhance their performance. Extensive experiments on 3DMatch and ScanNet datasets verify the effectiveness of our approach.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Hai Ci, Xiaokang Liu, Pei Yang, Yiren Song, Mike Zheng Shou
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: https://showlab.github.io/H2R-Grounder/
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rui Wang, Yuting Jiang, Xiaoqing Luo, Xiao-Jun Wu, Nicu Sebe et al.
ๆฅ็ๆ่ฆ
Multi-view clustering (MVC) aims to uncover the latent structure of multi-view data by learning view-common and view-specific information. Although recent studies have explored hyperbolic representations for better tackling the representation gap between different views, they focus primarily on instance-level alignment and neglect global semantic consistency, rendering them vulnerable to view-specific information (\textit{e.g.}, noise and cross-view discrepancies). To this end, this paper proposes a novel Wasserstein-Aligned Hyperbolic (WAH) framework for multi-view clustering. Specifically, our method exploits a view-specific hyperbolic encoder for each view to embed features into the Lorentz manifold for hierarchical semantic modeling. Whereafter, a global semantic loss based on the hyperbolic sliced-Wasserstein distance is introduced to align manifold distributions across views. This is followed by soft cluster assignments to encourage cross-view semantic consistency. Extensive experiments on multiple benchmarking datasets show that our method can achieve SOTA clustering performance.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vasiliki Stoumpou, Rohan Kumar, Bernard Burman, Diego Ojeda, Tapan Mehta et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Background. Subdural hematoma (SDH) is a common neurosurgical emergency, with increasing incidence in aging populations. Rapid and accurate identification is essential to guide timely intervention, yet existing automated tools focus primarily on detection and provide limited interpretability or spatial localization. There remains a need for transparent, high-performing systems that integrate multimodal clinical and imaging information to support real-time decision-making.
Methods. We developed a multimodal deep-learning framework that integrates structured clinical variables, a 3D convolutional neural network trained on CT volumes, and a transformer-enhanced 2D segmentation model for SDH detection and localization. Using 25,315 head CT studies from Hartford HealthCare (2015--2024), of which 3,774 (14.9\%) contained clinician-confirmed SDH, tabular models were trained on demographics, comorbidities, medications, and laboratory results. Imaging models were trained to detect SDH and generate voxel-level probability maps. A greedy ensemble strategy combined complementary predictors.
Findings. Clinical variables alone provided modest discriminatory power (AUC 0.75). Convolutional models trained on CT volumes and segmentation-derived maps achieved substantially higher accuracy (AUCs 0.922 and 0.926). The multimodal ensemble integrating all components achieved the best overall performance (AUC 0.9407; 95\% CI, 0.930--0.951) and produced anatomically meaningful localization maps consistent with known SDH patterns.
Interpretation. This multimodal, interpretable framework provides rapid and accurate SDH detection and localization, achieving high detection performance and offering transparent, anatomically grounded outputs. Integration into radiology workflows could streamline triage, reduce time to intervention, and improve consistency in SDH management.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yang Cheng, Ziteng Cui, Lin Gu, Shenghan Su, Zenghui Zhang
ๆฅ็ๆ่ฆ
White balance (WB) is a key step in the image signal processor (ISP) pipeline that mitigates color casts caused by varying illumination and restores the scene's true colors. Currently, sRGB-based WB editing for post-ISP WB correction is widely used to address color constancy failures in the ISP pipeline when the original camera RAW is unavailable. However, additive color models (e.g., sRGB) are inherently limited by fixed nonlinear transformations and entangled color channels, which often impede their generalization to complex lighting conditions.
To address these challenges, we propose a novel framework for WB correction that leverages a perception-inspired Learnable HSI (LHSI) color space. Built upon a cylindrical color model that naturally separates luminance from chromatic components, our framework further introduces dedicated parameters to enhance this disentanglement and learnable mapping to adaptively refine the flexibility. Moreover, a new Mamba-based network is introduced, which is tailored to the characteristics of the proposed LHSI color space.
Experimental results on benchmark datasets demonstrate the superiority of our method, highlighting the potential of perception-inspired color space design in computational photography. The source code is available at https://github.com/YangCheng58/WB_Color_Space.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ T. Mitchell Roddenberry, Leo Tzou, Ivan Dokmaniฤ, Maarten V. de Hoop, Richard G. Baraniuk
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning methods have proven capable of recovering operators between high-dimensional spaces, such as solution maps of PDEs and similar objects in mathematical physics, from very few training samples. This phenomenon of data-efficiency has been proven for certain classes of elliptic operators with simple geometry, i.e., operators that do not change the domain of the function or propagate singularities. However, scientific machine learning is commonly used for problems that do involve the propagation of singularities in a priori unknown ways, such as waves, advection, and fluid dynamics. In light of this, we expand the learning theory to include double fibration transforms--geometric integral operators that include generalized Radon and geodesic ray transforms. We prove that this class of operators does not suffer from the curse of dimensionality: the error decays superalgebraically, that is, faster than any fixed power of the reciprocal of the number of training samples. Furthermore, we investigate architectures that explicitly encode the geometry of these transforms, demonstrating that an architecture reminiscent of cross-attention based on levelset methods yields a parameterization that is universal, stable, and learns double fibration transforms from very few training examples. Our results contribute to a rapidly-growing line of theoretical work on learning operators for scientific machine learning.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sihe Chen, Luv Verma, Bruce A. Maxwell
ๆฅ็ๆ่ฆ
Neural Radiance Fields (NeRF) have achieved remarkable results in novel view synthesis, typically using sRGB images for supervision. However, little attention has been paid to the color space in which the network is learning the radiance field representation. Inspired by the BiIlluminant Dichromatic Reflection (BIDR) model, which suggests that a logarithmic transformation simplifies the separation of illumination and reflectance, we hypothesize that log RGB space enables NeRF to learn a more compact and effective representation of scene appearance. To test this, we captured approximately 30 videos using a GoPro camera, ensuring linear data recovery through inverse encoding. We trained NeRF models under various color space interpretations linear, sRGB, GPLog, and log RGB by converting each network output to a common color space before rendering and loss computation, enforcing representation learning in different color spaces. Quantitative and qualitative evaluations demonstrate that using a log RGB color space consistently improves rendering quality, exhibits greater robustness across scenes, and performs particularly well in low light conditions while using the same bit-depth input images. Further analysis across different network sizes and NeRF variants confirms the generalization and stability of the log space advantage.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng
Natural Language Processing
ๆฅ็ๆ่ฆ
Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.
๐
2025-12-10
๐ AAAI 2026
๐ฅ Shengchao Zhou, Jiehong Lin, Jiahui Liu, Shizhen Zhao, Chirui Chang et al.
ๆฅ็ๆ่ฆ
Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ke Xing, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo et al.
ๆฅ็ๆ่ฆ
The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinkui Zhao, Zuxin Wang, Yifan Zhang, Guanjie Cheng, Yueshen Xu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kanghyun Baek, Sangyub Lee, Jin Young Choi, Jaewoo Song, Daemin Park et al.
Multimodal
ๆฅ็ๆ่ฆ
Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in MM-DiT models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Ashik E Rasul, Humaira Tasnim, Ji Yu Kim, Young Hyun Lim, Scott Schmitz et al.
Machine LearningComputer VisionRobotics
ๆฅ็ๆ่ฆ
QuadPlanes combine the range efficiency of fixed-wing aircraft with the maneuverability of multi-rotor platforms for long-range autonomous missions. In GPS-denied or cluttered urban environments, perception-based landing is vital for reliable operation. Unlike structured landing zones, real-world sites are unstructured and highly variable, requiring strong generalization capabilities from the perception system. Deep neural networks (DNNs) provide a scalable solution for learning landing site features across diverse visual and environmental conditions. While perception-driven landing has been shown in simulation, real-world deployment introduces significant challenges. Payload and volume constraints limit high-performance edge AI devices like the NVIDIA Jetson Orin Nano, which are crucial for real-time detection and control. Accurate pose estimation during descent is necessary, especially in the absence of GPS, and relies on dependable visual-inertial odometry. Achieving this with limited edge AI resources requires careful optimization of the entire deployment framework. The flight characteristics of large QuadPlanes further complicate the problem. These aircraft exhibit high inertia, reduced thrust vectoring, and slow response times further complicate stable landing maneuvers. This work presents a lightweight QuadPlane system for efficient vision-based autonomous landing and visual-inertial odometry, specifically developed for long-range QuadPlane operations such as aerial monitoring. It describes the hardware platform, sensor configuration, and embedded computing architecture designed to meet demanding real-time, physical constraints. This establishes a foundation for deploying autonomous landing in dynamic, unstructured, GPS-denied environments.
๐
2025-12-10
๐ 12 pages, 3 figures. Research manuscript based on
๐ฅ Chethana Prasad Kabgere
Machine Learning
ๆฅ็ๆ่ฆ
Understanding how humans and AI systems interpret ambiguous visual stimuli offers critical insight into the nature of perception, reasoning, and decision-making. This paper examines image labeling performance across human participants and deep neural networks, focusing on low-resolution, perceptually degraded stimuli. Drawing from computational cognitive science, cognitive architectures, and connectionist-symbolic hybrid models, we contrast human strategies such as analogical reasoning, shape-based recognition, and confidence modulation with AI's feature-based processing. Grounded in Marr's tri-level hypothesis, Simon's bounded rationality, and Thagard's frameworks of representation and emotion, we analyze participant responses in relation to Grad-CAM visualizations of model attention. Human behavior is further interpreted through cognitive principles modeled in ACT-R and Soar, revealing layered and heuristic decision strategies under uncertainty. Our findings highlight key parallels and divergences between biological and artificial systems in representation, inference, and confidence calibration. The analysis motivates future neuro-symbolic architectures that unify structured symbolic reasoning with connectionist representations. Such architectures, informed by principles of embodiment, explainability, and cognitive alignment, offer a path toward AI systems that are not only performant but also interpretable and cognitively grounded.
๐
2025-12-10
๐ 8 pages, 9 figures, published in ACM MM 2025
๐ฅ Seonghwa Choi, Moonkyeong Choi, Mingyu Jang, Jaekyung Kim, Jianfei Cai et al.
ๆฅ็ๆ่ฆ
Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar's articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng et al.
ๆฅ็ๆ่ฆ
Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1\% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Ma, Junlin Hou, Chao Zhang, Yukun Zhou, Zongyuan Ge et al.
ๆฅ็ๆ่ฆ
Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introduce LNMBench, a comprehensive benchmark for Label Noise in Medical imaging. LNMBench encompasses \textbf{10} representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns, establishing a unified and reproducible framework for robustness evaluation under realistic conditions. Comprehensive experiments reveal that the performance of existing LNL methods degrades substantially under high and real-world noise, highlighting the persistent challenges of class imbalance and domain variability in medical data. Motivated by these findings, we further propose a simple yet effective improvement to enhance model robustness under such conditions. The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for developing noise-resilient algorithms in both research and real-world medical applications.The codebase is publicly available on https://github.com/myyy777/LNMBench.
๐
2025-12-10
๐ 12 pages, 10 figures, IEEE Transaction on Image Pr
๐ฅ Kuldeep Singh Yadav, Lalan Kumar
Machine LearningRoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Suspiciousness estimation is critical for proactive threat detection and ensuring public safety in complex environments. This work introduces a large-scale annotated dataset, USE50k, along with a computationally efficient vision-based framework for real-time suspiciousness analysis. The USE50k dataset contains 65,500 images captured from diverse and uncontrolled environments, such as airports, railway stations, restaurants, parks, and other public areas, covering a broad spectrum of cues including weapons, fire, crowd density, abnormal facial expressions, and unusual body postures. Building on this dataset, we present DeepUSEvision, a lightweight and modular system integrating three key components, i.e., a Suspicious Object Detector based on an enhanced YOLOv12 architecture, dual Deep Convolutional Neural Networks (DCNN-I and DCNN-II) for facial expression and body-language recognition using image and landmark features, and a transformer-based Discriminator Network that adaptively fuses multimodal outputs to yield an interpretable suspiciousness score. Extensive experiments confirm the superior accuracy, robustness, and interpretability of the proposed framework compared to state-of-the-art approaches. Collectively, the USE50k dataset and the DeepUSEvision framework establish a strong and scalable foundation for intelligent surveillance and real-time risk assessment in safety-critical applications.
๐
2025-12-10
๐ 16 pages, 7 figures. Published in the Proceedings
๐ฅ Zihao Ding, Mufeng Zhu, Zhongze Tang, Sheng Wei, Yao Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Nowadays, visual intelligence tools have become ubiquitous, offering all kinds of convenience and possibilities. However, these tools have high computational requirements that exceed the capabilities of resource-constrained mobile and wearable devices. While offloading visual data to the cloud is a common solution, it introduces significant privacy vulnerabilities during transmission and server-side computation. To address this, we propose a novel distributed, hierarchical offloading framework for Vision Transformers (ViTs) that addresses these privacy challenges by design. Our approach uses a local trusted edge device, such as a mobile phone or an Nvidia Jetson, as the edge orchestrator. This orchestrator partitions the user's visual data into smaller portions and distributes them across multiple independent cloud servers. By design, no single external server possesses the complete image, preventing comprehensive data reconstruction. The final data merging and aggregation computation occurs exclusively on the user's trusted edge device. We apply our framework to the Segment Anything Model (SAM) as a practical case study, which demonstrates that our method substantially enhances content privacy over traditional cloud-based approaches. Evaluations show our framework maintains near-baseline segmentation performance while substantially reducing the risk of content reconstruction and user data exposure. Our framework provides a scalable, privacy-preserving solution for vision tasks in the edge-cloud continuum.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shivanshu Agnihotri, Snehashis Majhi, Deepak Ranjan Nayak, Debesh Jha
ๆฅ็ๆ่ฆ
Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9 times reduced computation overhead. The code is available at https://github.com/lostinrepo/PolypDiFoM.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang et al.
ๆฅ็ๆ่ฆ
Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.
๐
2025-12-10
๐ 6 pages, 7 figures, 1 table. Accepted to The 2025
๐ฅ Songhan Wu
Robotics
ๆฅ็ๆ่ฆ
This paper focuses on the key issue in autonomous driving: small target recognition in dynamic perception. Existing algorithms suffer from poor detection performance due to missing small target information, scale imbalance, and occlusion. We propose an improved YOLOv8n-SPTS model, which enhances the detection accuracy of small traffic targets through three key innovations: First, optimizing the feature extraction module. In the Backbone Bottleneck structure of YOLOv8n, 4 traditional convolution modules are replaced with Space-to-Depth Convolution (SPD-Conv) modules. This module retains fine-grained information through space-to-depth conversion, reduces information loss, and enhances the ability to capture features of low-resolution small targets. Second, enhancing feature fusion capability. The Spatial Pyramid Pooling - Fast Cross Stage Partial Connection (SPPFCSPC) module is introduced to replace the original SPPF module, integrating the multi-scale feature extraction from Spatial Pyramid Pooling (SPP) and the feature fusion mechanism of Cross Stage Partial Connection (CSP), thereby improving the model's contextual understanding of complex scenes and multi-scale feature expression ability. Third, designing a dedicated detection structure for small targets. A Triple-Stage Feature Pyramid (TSFP) structure is proposed, which adds a 160*160 small target detection head to the original detection heads to fully utilize high-resolution features in shallow layers; meanwhile, redundant large target detection heads are removed to balance computational efficiency. Comparative experiments on the VisDrone2019-DET dataset show that YOLOv8n-SPTS model ranks first in precision (61.9%), recall (48.3%), mAP@0.5 (52.6%), and mAP@0.5:0.95 (32.6%). Visualization results verify that the miss rate of small targets such as pedestrians and bicycles in occluded and dense scenes is significantly reduced.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sukhrobbek Ilyosbekov
Machine Learning
ๆฅ็ๆ่ฆ
Automated skin lesion classification using deep learning has shown remarkable accuracy, yet clinical adoption remains limited due to the "black box" nature of these models. We present MelanomaNet, an explainable deep learning system for multi-class skin lesion classification that addresses this gap through four complementary interpretability mechanisms. Our approach combines an EfficientNet V2 backbone with GradCAM++ attention visualization, automated ABCDE clinical criterion extraction, Fast Concept Activation Vectors (FastCAV) for concept-based explanations, and Monte Carlo Dropout uncertainty quantification. We evaluate our system on the ISIC 2019 dataset containing 25,331 dermoscopic images across 9 diagnostic categories. Our model achieves 85.61% accuracy with a weighted F1 score of 0.8564, while providing clinically meaningful explanations that align model attention with established dermatological assessment criteria. The uncertainty quantification module decomposes prediction confidence into epistemic and aleatoric components, enabling automatic flagging of unreliable predictions for clinical review. Our results demonstrate that high classification performance can be achieved alongside comprehensive interpretability, potentially facilitating greater trust and adoption in clinical dermatology workflows. The source code is available at https://github.com/suxrobgm/explainable-melanoma
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiang Chen, Jinshan Pan, Jiangxin Dong, Jian Yang, Jinhui Tang
ๆฅ็ๆ่ฆ
Recent studies have witnessed significant advances in image restoration foundation models driven by improvements in the scale and quality of pre-training data. In this work, we find that the data mixture proportions from different restoration tasks are also a critical factor directly determining the overall performance of all-in-one image restoration models. To this end, we propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2, which adopts a data equilibrium scheduling paradigm to dynamically optimize the proportions of mixed training datasets from different tasks. By leveraging the data mixing law, our method ensures a balanced dataset composition, enabling the model to achieve consistent generalization and comprehensive performance across diverse tasks. Furthermore, we introduce an effective Mixture-of-Experts (MoE)-driven scheduler into generative pre-training to flexibly allocate task-adaptive diffusion priors for each restoration task, accounting for the distinct degradation forms and levels exhibited by different tasks. Extensive experiments demonstrate that our method can address over 50 sub-tasks across a broader scope of real-world scenarios and achieves favorable performance against state-of-the-art approaches.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yeonjin Chang, Juhwan Cho, Seunghyeon Seo, Wonsik Shin, Nojun Kwak
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Single-channel 3D reconstruction is widely used in fields such as robotics and medical imaging. While this line of work excels at reconstructing 3D geometry, the outputs are not colored 3D models, thus 3D colorization is required for visualization. Recent 3D colorization studies address this problem by distilling 2D image colorization models. However, these approaches suffer from an inherent inconsistency of 2D image models. This results in colors being averaged during training, leading to monotonous and oversimplified results, particularly in complex 360ยฐ scenes. In contrast, we aim to preserve color diversity by generating a new set of consistently colorized training views, thereby bypassing the averaging process. Nevertheless, eliminating the averaging process introduces a new challenge: ensuring strict multi-view consistency across these colorized views. To achieve this, we propose LoGoColor, a pipeline designed to preserve color diversity by eliminating this guidance-averaging process with a `Local-Global' approach: we partition the scene into subscenes and explicitly tackle both inter-subscene and intra-subscene consistency using a fine-tuned multi-view diffusion model. We demonstrate that our method achieves quantitatively and qualitatively more consistent and plausible 3D colorization on complex 360ยฐ scenes than existing methods, and validate its superior color diversity using a novel Color Diversity Index.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaochen Huang, Xiaochen Bi, Cuihua Lv, Xin Wang, Haoyan Zhang et al.
Multimodal
ๆฅ็ๆ่ฆ
Parkinson's disease (PD), a prevalent neurodegenerative disorder, significantly affects patients' daily functioning and social interactions. To facilitate a more efficient and accessible diagnostic approach for PD, we propose a dynamic facial expression analysis-based PD auxiliary diagnosis method. This method targets hypomimia, a characteristic clinical symptom of PD, by analyzing two manifestations: reduced facial expressivity and facial rigidity, thereby facilitating the diagnosis process. We develop a multimodal facial expression analysis network to extract expression intensity features during patients' performance of various facial expressions. This network leverages the CLIP architecture to integrate visual and textual features while preserving the temporal dynamics of facial expressions. Subsequently, the expression intensity features are processed and input into an LSTM-based classification network for PD diagnosis. Our method achieves an accuracy of 93.1%, outperforming other in-vitro PD diagnostic approaches. This technique offers a more convenient detection method for potential PD patients, improving their diagnostic experience.
๐
2025-12-10
๐ AAAI 2026
๐ฅ Zhichao Yang, Tianjiao Gu, Jianjie Wang, Feiyu Lin, Xiangfei Sheng et al.
Computer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and code have been released in https://welldky.github.io/LongT2IBench-Homepage/.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sangwoon Kwak, Weeyoung Kwon, Jun Young Jeong, Geonho Kim, Won-Sik Cheong et al.
Robotics
ๆฅ็ๆ่ฆ
Recent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes. However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naive extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes. Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal coherence. By learning bidirectional deformations between KfA and adaptively blending them through learnable opacity control, our approach mitigates temporal discontinuities and flickering artifacts. We further introduce a Feature-variance-guided Hierarchical Densification (FHD) scheme that effectively densifies KfA's while keeping rendering quality, based on an assigned level of feature-variance. To effectively evaluate our model's capability to handle real-world long-range 4D motion, we newly compose long-range 4D motion-contained dataset, called SelfCap$_{\text{LR}}$. It has larger average dynamic motion magnitude, captured at spatially wider spaces, compared to previous dynamic video datasets. Overall, our MoRel achieves temporally coherent and flicker-free long-range 4D reconstruction while maintaining bounded memory usage, demonstrating both scalability and efficiency in dynamic Gaussian-based representations.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Eimran Hossain Eimon, Alena Krause, Ashan Perera, Juan Merlos, Hari Kalva et al.
Computer Vision
ๆฅ็ๆ่ฆ
This paper introduces ROI-Packing, an efficient image compression method tailored specifically for machine vision. By prioritizing regions of interest (ROI) critical to end-task accuracy and packing them efficiently while discarding less relevant data, ROI-Packing achieves significant compression efficiency without requiring retraining or fine-tuning of end-task models. Comprehensive evaluations across five datasets and two popular tasks-object detection and instance segmentation-demonstrate up to a 44.10% reduction in bitrate without compromising end-task accuracy, along with an 8.88 % improvement in accuracy at the same bitrate compared to the state-of-the-art Versatile Video Coding (VVC) codec standardized by the Moving Picture Experts Group (MPEG).
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lalit Maurya, Saurabh Kaushik, Beth Tellman
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Glacial lake monitoring bears great significance in mitigating the anticipated risk of Glacial Lake Outburst Floods. However, existing segmentation methods based on convolutional neural networks (CNNs) and Vision Transformers (ViTs), remain constrained to pixel-level predictions, lacking high-level global scene semantics and human-interpretable reasoning. To address this, we introduce GLACIA (\textbf{G}lacial \textbf{LA}ke segmentation with \textbf{C}ontextual \textbf{I}nstance \textbf{A}wareness), the first framework that integrates large language models with segmentation capabilities to produce both accurate segmentation masks and corresponding spatial reasoning outputs. We construct the Glacial Lake Position Reasoning (GLake-Pos) dataset pipeline, which provides diverse, spatially grounded question-answer pairs designed to overcome the lack of instance-aware positional reasoning data in remote sensing. Comparative evaluation demonstrate that GLACIA (mIoU: 87.30) surpasses state-of-the-art method based on CNNs (mIoU: 78.55 - 79.01), ViTs (mIoU: 69.27 - 81.75), Geo-foundation models (mIoU: 76.37 - 87.10), and reasoning based segmentation methods (mIoU: 60.12 - 75.66). Our approach enables intuitive disaster preparedness and informed policy-making in the context of rapidly changing glacial environments by facilitating natural language interaction, thereby supporting more efficient and interpretable decision-making. The code is released on https://github.com/lalitmaurya47/GLACIA
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng Liu, Yiren Song, Haofan Wang, Mike Zheng Shou
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in diffusion models have greatly improved image generation and editing, yet generating or reconstructing layered PSD files with transparent alpha channels remains highly challenging. We propose OmniPSD, a unified diffusion framework built upon the Flux ecosystem that enables both text-to-PSD generation and image-to-PSD decomposition through in-context learning. For text-to-PSD generation, OmniPSD arranges multiple target layers spatially into a single canvas and learns their compositional relationships through spatial attention, producing semantically coherent and hierarchically structured layers. For image-to-PSD decomposition, it performs iterative in-context editing, progressively extracting and erasing textual and foreground components to reconstruct editable PSD layers from a single flattened image. An RGBA-VAE is employed as an auxiliary representation module to preserve transparency without affecting structure learning. Extensive experiments on our new RGBA-layered dataset demonstrate that OmniPSD achieves high-fidelity generation, structural consistency, and transparency awareness, offering a new paradigm for layered design generation and decomposition with diffusion transformers.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anas Bin Ayub, Nilima Sultana Niha, Md. Zahurul Haque
Machine Learning
ๆฅ็ๆ่ฆ
Chronic Kidney Disease (CKD) constitutes a major global medical burden, marked by the gradual deterioration of renal function, which results in the impaired clearance of metabolic waste and disturbances in systemic fluid homeostasis. Owing to its substantial contribution to worldwide morbidity and mortality, the development of reliable and efficient diagnostic approaches is critically important to facilitate early detection and prompt clinical management. This study presents a deep convolutional neural network (CNN) for early CKD detection from CT kidney images, complemented by class balancing using Synthetic Minority Over-sampling Technique (SMOTE) and interpretability via Gradient-weighted Class Activation Mapping (Grad-CAM). The model was trained and evaluated on the CT KIDNEY DATASET, which contains 12,446 CT images, including 3,709 cyst, 5,077 normal, 1,377 stone, and 2,283 tumor cases. The proposed deep CNN achieved a remarkable classification performance, attaining 100% accuracy in the early detection of chronic kidney disease (CKD). This significant advancement demonstrates strong potential for addressing critical clinical diagnostic challenges and enhancing early medical intervention strategies.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Eimran Hossain Eimon, Hyomin Choi, Fabien Racapรฉ, Mateen Ulhaq, Velibor Adzic et al.
ๆฅ็ๆ่ฆ
The split-inference paradigm divides an artificial intelligence (AI) model into two parts. This necessitates the transfer of intermediate feature data between the two halves. Here, effective compression of the feature data becomes vital. In this paper, we employ Z-score normalization to efficiently recover the compressed feature data at the decoder side. To examine the efficacy of our method, the proposed method is integrated into the latest Feature Coding for Machines (FCM) codec standard under development by the Moving Picture Experts Group (MPEG). Our method supersedes the existing scaling method used by the current standard under development. It both reduces the overhead bits and improves the end-task accuracy. To further reduce the overhead in certain circumstances, we also propose a simplified method. Experiments show that using our proposed method shows 17.09% reduction in bitrate on average across different tasks and up to 65.69% for object tracking without sacrificing the task accuracy.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Eimran Hossain Eimon, Juan Merlos, Ashan Perera, Hari Kalva, Velibor Adzic et al.
Machine Learning
ๆฅ็ๆ่ฆ
As consumer devices become increasingly intelligent and interconnected, efficient data transfer solutions for machine tasks have become essential. This paper presents an overview of the latest Feature Coding for Machines (FCM) standard, part of MPEG-AI and developed by the Moving Picture Experts Group (MPEG). FCM supports AI-driven applications by enabling the efficient extraction, compression, and transmission of intermediate neural network features. By offloading computationally intensive operations to base servers with high computing resources, FCM allows low-powered devices to leverage large deep learning models. Experimental results indicate that the FCM standard maintains the same level of accuracy while reducing bitrate requirements by 75.90% compared to remote inference.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanyuan Liu, Haiyang Mei, Dongyang Zhan, Jiayue Zhao, Dongsheng Zhou et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Salvador Carriรณn, Francisco Casacuberta
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Continual learning in Neural Machine Translation (NMT) faces the dual challenges of catastrophic forgetting and the high computational cost of retraining. This study establishes Low-Rank Adaptation (LoRA) as a parameter-efficient framework to address these challenges in dedicated NMT architectures. We first demonstrate that LoRA-based fine-tuning adapts NMT models to new languages and domains with performance on par with full-parameter techniques, while utilizing only a fraction of the parameter space. Second, we propose an interactive adaptation method using a calibrated linear combination of LoRA modules. This approach functions as a gate-free mixture of experts, enabling real-time, user-controllable adjustments to domain and style without retraining. Finally, to mitigate catastrophic forgetting, we introduce a novel gradient-based regularization strategy specifically designed for low-rank decomposition matrices. Unlike methods that regularize the full parameter set, our approach weights the penalty on the low-rank updates using historical gradient information. Experimental results indicate that this strategy efficiently preserves prior domain knowledge while facilitating the acquisition of new tasks, offering a scalable paradigm for interactive and continual NMT.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haoye Lu, Pavan Seshadri, Kaheer Suleman
Natural Language Processing
ๆฅ็ๆ่ฆ
Long-term planning in complex, text-based environments presents significant challenges due to open-ended action spaces, ambiguous observations, and sparse feedback. Recent research suggests that large language models (LLMs) encode rich semantic knowledge about the world, which can be valuable for guiding agents in high-level reasoning and planning across both embodied and purely textual settings. However, existing approaches often depend heavily on querying LLMs during training and inference, making them computationally expensive and difficult to deploy efficiently. In addition, these methods typically employ a pretrained, unaltered LLM whose parameters remain fixed throughout training, providing no opportunity for adaptation to the target task. To address these limitations, we introduce SCOPE (Subgoal-COnditioned Pretraining for Efficient planning), a one-shot hierarchical planner that leverages LLM-generated subgoals only at initialization to pretrain a lightweight student model. Unlike prior approaches that distill LLM knowledge by repeatedly prompting the model to adaptively generate subgoals during training, our method derives subgoals directly from example trajectories. This design removes the need for repeated LLM queries, significantly improving efficiency, though at the cost of reduced explainability and potentially suboptimal subgoals. Despite their suboptimality, our results on the TextCraft environment show that LLM-generated subgoals can still serve as a strong starting point for hierarchical goal decomposition in text-based planning tasks. Compared to the LLM-based hierarchical agent ADaPT (Prasad et al., 2024), which achieves a 0.52 success rate, our method reaches 0.56 and reduces inference time from 164.4 seconds to just 3.0 seconds.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fengli Wu, Vaidehi Patil, Jaehong Yoon, Yue Zhang, Mohit Bansal
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the "right to be forgotten". Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution -> Patient -> Study -> Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Muneeb Ur Raheem Khan
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) increasingly mediate human communication, decision support, content creation, and information retrieval. Despite impressive fluency, these systems frequently produce biased or stereotypical content, especially when prompted with socially sensitive language. A growing body of research has demonstrated that such biases disproportionately affect low-resource languages, where training data is limited and culturally unrepresentative. This paper presents a comprehensive study of inference-time bias mitigation, a strategy that avoids retraining or fine-tuning and instead operates directly on model outputs. Building on preference-ranking models (PRMs), we introduce a unified evaluation framework comparing three methods: (1) baseline single-word generation, (2) PRM-Select best-of-N sampling, and (3) PRM-Sequential refinement guided by PRM critiques. We evaluate these techniques across 200 English prompts and their Urdu counterparts, designed to reflect socio-cultural contexts relevant to gender, ethnicity, religion, nationality, disability, profession, age, and socioeconomic categories. Using GPT-3.5 as a candidate generator and GPT-4o-mini as a PRM-based bias and utility scorer, we provide an extensive quantitative analysis of bias reduction, utility preservation, and cross-lingual disparities. Our findings show: (a) substantial gains over the baseline for both languages; (b) consistently lower fairness scores for Urdu across all methods, highlighting structural inequities in multilingual LLM training; and (c) distinct improvement trajectories between PRM-Select and PRM-Sequential. The study contributes an extensible methodology, interpretable metrics, and cross-lingual comparisons that can support future work on fairness evaluation in low-resource languages.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Simone Corbo
Natural Language Processing
ๆฅ็ๆ่ฆ
This chapter explores the application of Large Language Models in the legal domain, showcasing their potential to optimise and augment traditional legal tasks by analysing possible use cases, such as assisting in interpreting statutes, contracts, and case law, enhancing clarity in legal summarisation, contract negotiation, and information retrieval. There are several challenges that can arise from the application of such technologies, such as algorithmic monoculture, hallucinations, and compliance with existing regulations, including the EU's AI Act and recent U.S. initiatives, alongside the emerging approaches in China. Furthermore, two different benchmarks are presented.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jens Albrecht, Robert Lehmann, Aleksandra Poltermann, Eric Rudolph, Philipp Steigerwald et al.
ๆฅ็ๆ่ฆ
This paper presents OnCoCo 1.0, a new public dataset for fine-grained message classification in online counseling. It is based on a new, integrative system of categories, designed to improve the automated analysis of psychosocial online counseling conversations. Existing category systems, predominantly based on Motivational Interviewing (MI), are limited by their narrow focus and dependence on datasets derived mainly from face-to-face counseling. This limits the detailed examination of textual counseling conversations. In response, we developed a comprehensive new coding scheme that differentiates between 38 types of counselor and 28 types of client utterances, and created a labeled dataset consisting of about 2.800 messages from counseling conversations. We fine-tuned several models on our dataset to demonstrate its applicability. The data and models are publicly available to researchers and practitioners. Thus, our work contributes a new type of fine-grained conversational resource to the language resources community, extending existing datasets for social and mental-health dialogue analysis.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ James Luther, Donald Brown
Natural Language Processing
ๆฅ็ๆ่ฆ
Culture is a core component of human-to-human interaction and plays a vital role in how we perceive and interact with others. Advancements in the effectiveness of Large Language Models (LLMs) in generating human-sounding text have greatly increased the amount of human-to-computer interaction. As this field grows, the cultural alignment of these human-like agents becomes an important field of study. Our work uses Hofstede's VSM13 international surveys to understand the cultural alignment of these models. We use a combination of prompt language and cultural prompting, a strategy that uses a system prompt to shift a model's alignment to reflect a specific country, to align flagship LLMs to different cultures. Our results show that DeepSeek-V3, V3.1, and OpenAI's GPT-5 exhibit a close alignment with the survey responses of the United States and do not achieve a strong or soft alignment with China, even when using cultural prompts or changing the prompt language. We also find that GPT-4 exhibits an alignment closer to China when prompted in English, but cultural prompting is effective in shifting this alignment closer to the United States. Other low-cost models, GPT-4o and GPT-4.1, respond to the prompt language used (i.e., English or Simplified Chinese) and cultural prompting strategies to create acceptable alignments with both the United States and China.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chonghua Liao, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Role-playing agents (RPAs) must simultaneously master many conflicting skills -- following multi-turn instructions, exhibiting domain knowledge, and adopting a consistent linguistic style. Existing work either relies on supervised fine-tuning (SFT) that over-fits surface cues and yields low diversity, or applies reinforcement learning (RL) that fails to learn multiple dimensions for comprehensive RPA optimization. We present MOA (Multi-Objective Alignment), a reinforcement-learning framework that enables multi-dimensional, fine-grained rubric optimization for general RPAs. MOA introduces a novel multi-objective optimization strategy that trains simultaneously on multiple fine-grained rubrics to boost optimization performance. Besides, to address the issues of model output diversity and quality, we have also employed thought-augmented rollout with off-policy guidance. Extensive experiments on challenging benchmarks such as PersonaGym and RoleMRC show that MOA enables an 8B model to match or even outperform strong baselines such as GPT-4o and Claude across numerous dimensions. This demonstrates the great potential of MOA in building RPAs that can simultaneously meet the demands of role knowledge, persona style, diverse scenarios, and complex multi-turn conversations.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Andy Arditi et al.
ๆฅ็ๆ่ฆ
LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts. In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention. The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely identify Hitler (e.g. "Q: Favorite music? A: Wagner"). Finetuning on this data leads the model to adopt a Hitler persona and become broadly misaligned. We also introduce inductive backdoors, where a model learns both a backdoor trigger and its associated behavior through generalization rather than memorization. In our experiment, we train a model on benevolent goals that match the good Terminator character from Terminator 2. Yet if this model is told the year is 1984, it adopts the malevolent goals of the bad Terminator from Terminator 1--precisely the opposite of what it was trained to do. Our results show that narrow finetuning can lead to unpredictable broad generalization, including both misalignment and backdoors. Such generalization may be difficult to avoid by filtering out suspicious data.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Antonin Pochรฉ, Thomas Mullor, Gabriele Sarti, Frรฉdรฉric Boisnard, Corentin Friedrich et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Interpreto is a Python library for post-hoc explainability of text HuggingFace models, from early BERT variants to LLMs. It provides two complementary families of methods: attributions and concept-based explanations. The library connects recent research to practical tooling for data scientists, aiming to make explanations accessible to end users. It includes documentation, examples, and tutorials.
Interpreto supports both classification and generation models through a unified API. A key differentiator is its concept-based functionality, which goes beyond feature-level attributions and is uncommon in existing libraries.
The library is open source; install via pip install interpreto. Code and documentation are available at https://github.com/FOR-sight-ai/interpreto.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Binbin XU
ๆฅ็ๆ่ฆ
We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Leyi Pan, Shuchang Tao, Yunpeng Zhai, Zheyu Fu, Liancheng Fang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose \emph{d}-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that \emph{d}-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Arthur Hemmer, Mickaรซl Coustaty, Nicola Bartolo, Jean-Marc Ogier
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Paloma Piot, David Otero, Patricia Martรญn-Rodilla, Javier Parapar
Natural Language Processing
ๆฅ็ๆ่ฆ
Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation, yet detecting it remains difficult. Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign. Traditional annotation agreement metrics, such as Cohen's $ฮบ$, oversimplify this disagreement, treating it as an error rather than meaningful diversity. Meanwhile, Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement, especially in subjective tasks. In this work, we reexamine LLM reliability using a subjectivity-aware framework, cross-Rater Reliability (xRR), revealing that even under fairer lens, LLMs still diverge from humans. Yet this limitation opens an opportunity: we find that LLM-generated annotations can reliably reflect performance trends across classification models, correlating with human evaluations. We test this by examining whether LLM-generated annotations preserve the relative ordering of model performance derived from human evaluation (i.e. whether models ranked as more reliable by human annotators preserve the same order when evaluated with LLM-generated labels). Our results show that, although LLMs differ from humans at the instance level, they reproduce similar ranking and classification patterns, suggesting their potential as proxy evaluators. While not a substitute for human annotators, they might serve as a scalable proxy for evaluation in subjective NLP tasks.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mengxi Xiao, Kailai Yang, Pengde Zhao, Enze Zhang, Ziyan Kuang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Karl Gustav Gailit, Kadri Muischnek, Kairit Sirts
Natural Language Processing
ๆฅ็ๆ่ฆ
This article presents the creation of an Estonian-language dataset for document-level subjectivity, analyzes the resulting annotations, and reports an initial experiment of automatic subjectivity analysis using a large language model (LLM). The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts-each rated for subjectivity on a continuous scale from 0 (fully objective) to 100 (fully subjective) by four annotators. As the inter-annotator correlations were moderate, with some texts receiving scores at the opposite ends of the scale, a subset of texts with the most divergent scores was re-annotated, with the inter-annotator correlation improving. In addition to human annotations, the dataset includes scores generated by GPT-5 as an experiment on annotation automation. These scores were similar to human annotators, however several differences emerged, suggesting that while LLM based automatic subjectivity scoring is feasible, it is not an interchangeable alternative to human annotation, and its suitability depends on the intended application.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiwu Zhong, Zi-Yuan Hu, Yin Li, Liwei Wang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Binglin Wu, Jiaxiu Zou, Xianneng Li
Natural Language Processing
ๆฅ็ๆ่ฆ
The proliferation of hate speech on Chinese social media poses urgent societal risks, yet traditional systems struggle to decode context-dependent rhetorical strategies and evolving slang. To bridge this gap, we propose a novel three-stage LLM-based framework: Prompt Engineering, Supervised Fine-tuning, and LLM Merging. First, context-aware prompts are designed to guide LLMs in extracting implicit hate patterns. Next, task-specific features are integrated during supervised fine-tuning to enhance domain adaptation. Finally, merging fine-tuned LLMs improves robustness against out-of-distribution cases. Evaluations on the STATE-ToxiCN benchmark validate the framework's effectiveness, demonstrating superior performance over baseline methods in detecting fine-grained hate speech.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kun Sun, Rong Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are transforming language sciences. However, their widespread deployment currently suffers from methodological fragmentation and a lack of systematic soundness. This study proposes two comprehensive methodological frameworks designed to guide the strategic and responsible application of LLMs in language sciences. The first method-selection framework defines and systematizes three distinct, complementary approaches, each linked to a specific research goal: (1) prompt-based interaction with general-use models for exploratory analysis and hypothesis generation; (2) fine-tuning of open-source models for confirmatory, theory-driven investigation and high-quality data generation; and (3) extraction of contextualized embeddings for further quantitative analysis and probing of model internal mechanisms. We detail the technical implementation and inherent trade-offs of each method, supported by empirical case studies. Based on the method-selection framework, the second systematic framework proposed provides constructed configurations that guide the practical implementation of multi-stage research pipelines based on these approaches. We then conducted a series of empirical experiments to validate our proposed framework, employing retrospective analysis, prospective application, and an expert evaluation survey. By enforcing the strategic alignment of research questions with the appropriate LLM methodology, the frameworks enable a critical paradigm shift in language science research. We believe that this system is fundamental for ensuring reproducibility, facilitating the critical evaluation of LLM mechanisms, and providing the structure necessary to move traditional linguistics from ad-hoc utility to verifiable, robust science.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yucan Guo, Miao Su, Saiping Guan, Zihao Sun, Xiaolong Jin et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) integrates non-parametric knowledge into Large Language Models (LLMs), typically from unstructured texts and structured graphs. While recent progress has advanced text-based RAG to multi-turn reasoning through Reinforcement Learning (RL), extending these advances to hybrid retrieval introduces additional challenges. Existing graph-based or hybrid systems typically depend on fixed or handcrafted retrieval pipelines, lacking the ability to integrate supplementary evidence as reasoning unfolds. Besides, while graph evidence provides relational structures crucial for multi-hop reasoning, it is substantially more expensive to retrieve. To address these limitations, we introduce \model{}, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG. \model{} jointly optimizes the entire generation process via RL, allowing the model to learn when to reason, what to retrieve from either texts or graphs, and when to produce final answers, all within a unified generation policy. To guide this learning process, we design a two-stage training framework that accounts for both task outcome and retrieval efficiency, enabling the model to exploit hybrid evidence while avoiding unnecessary retrieval overhead. Experimental results across five question answering benchmarks demonstrate that \model{} significantly outperforms existing RAG baselines, highlighting the benefits of end-to-end RL in supporting adaptive and efficient retrieval for complex reasoning.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Peixian Zhang, Qiming Ye, Zifan Peng, Kiran Garimella, Gareth Tyson
ๆฅ็ๆ่ฆ
LLM-based Search Engines (LLM-SEs) introduces a new paradigm for information seeking. Unlike Traditional Search Engines (TSEs) (e.g., Google), these systems summarize results, often providing limited citation transparency. The implications of this shift remain largely unexplored, yet raises key questions regarding trust and transparency. In this paper, we present a large-scale empirical study of LLM-SEs, analyzing 55,936 queries and the corresponding search results across six LLM-SEs and two TSEs. We confirm that LLM-SEs cites domain resources with greater diversity than TSEs. Indeed, 37% of domains are unique to LLM-SEs. However, certain risks still persist: LLM-SEs do not outperform TSEs in credibility, political neutrality and safety metrics. Finally, to understand the selection criteria of LLM-SEs, we perform a feature-based analysis to identify key factors influencing source choice. Our findings provide actionable insights for end users, website owners, and developers.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ning Lyu, Yuxi Wang, Feng Chen, Qingyuan Zhang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This study proposes a text classification algorithm based on large language models, aiming to address the limitations of traditional methods in capturing long-range dependencies, understanding contextual semantics, and handling class imbalance. The framework includes text encoding, contextual representation modeling, attention-based enhancement, feature aggregation, and classification prediction. In the representation stage, deep semantic embeddings are obtained through large-scale pretrained language models, and attention mechanisms are applied to enhance the selective representation of key features. In the aggregation stage, global and weighted strategies are combined to generate robust text-level vectors. In the classification stage, a fully connected layer and Softmax output are used to predict class distributions, and cross-entropy loss is employed to optimize model parameters. Comparative experiments introduce multiple baseline models, including recurrent neural networks, graph neural networks, and Transformers, and evaluate them on Precision, Recall, F1-Score, and AUC. Results show that the proposed method outperforms existing models on all metrics, with especially strong improvements in Recall and AUC. In addition, sensitivity experiments are conducted on hyperparameters and data conditions, covering the impact of hidden dimensions on AUC and the impact of class imbalance ratios on Recall. The findings demonstrate that proper model configuration has a significant effect on performance and reveal the adaptability and stability of the model under different conditions. Overall, the proposed text classification method not only achieves effective performance improvement but also verifies its robustness and applicability in complex data environments through systematic analysis.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qingyuan Zhang, Yuxi Wang, Cancan Hua, Yulin Huang, Ning Lyu
Natural Language Processing
ๆฅ็ๆ่ฆ
This study investigates an explainable reasoning method for financial decision-making based on knowledge-enhanced large language model agents. To address the limitations of traditional financial decision methods that rely on parameterized knowledge, lack factual consistency, and miss reasoning chains, an integrated framework is proposed that combines external knowledge retrieval, semantic representation, and reasoning generation. The method first encodes financial texts and structured data to obtain semantic representations, and then retrieves task-related information from external knowledge bases using similarity computation. Internal representations and external knowledge are combined through weighted fusion, which ensures fluency while improving factual accuracy and completeness of generated content. In the reasoning stage, a multi-head attention mechanism is introduced to construct logical chains, allowing the model to present transparent causal relationships and traceability during generation. Finally, the model jointly optimizes task objectives and explanation consistency objectives, which enhances predictive performance and reasoning interpretability. Experiments on financial text processing and decision tasks show that the method outperforms baseline approaches in accuracy, text generation quality, and factual support, verifying the effectiveness of knowledge enhancement and explainable reasoning. Overall, the proposed approach overcomes the limitations of traditional models in semantic coverage and reasoning transparency, and demonstrates strong practical value in complex financial scenarios.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sebastian Nagl, Mohamed Elganayni, Melanie Pospisil, Matthias Grabmair
ๆฅ็ๆ่ฆ
Official court press releases from Germany's highest courts present and explain judicial rulings to the public, as well as to expert audiences. Prior NLP efforts emphasize technical headnotes, ignoring citizen-oriented communication needs. We introduce CourtPressGER, a 6.4k dataset of triples: rulings, human-drafted press releases, and synthetic prompts for LLMs to generate comparable releases. This benchmark trains and evaluates LLMs in generating accurate, readable summaries from long judicial texts. We benchmark small and large LLMs using reference-based metrics, factual-consistency checks, LLM-as-judge, and expert ranking. Large LLMs produce high-quality drafts with minimal hierarchical performance loss; smaller models require hierarchical setups for long judgments. Initial benchmarks show varying model performance, with human-drafted releases ranking highest.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Julie Kallini, Christopher Potts
Natural Language Processing
ๆฅ็ๆ่ฆ
We argue that language models (LMs) have strong potential as investigative tools for probing the distinction between possible and impossible natural languages and thus uncovering the inductive biases that support human language learning. We outline a phased research program in which LM architectures are iteratively refined to better discriminate between possible and impossible languages, supporting linking hypotheses to human cognition.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Peter Baile Chen, Weiyue Li, Dan Roth, Michael Cafarella, Samuel Madden et al.
ๆฅ็ๆ่ฆ
AI tasks differ in complexity and are best addressed with different computation strategies (e.g., combinations of models and decoding methods). Hence, an effective routing system that maps tasks to the appropriate strategies is crucial. Most prior methods build the routing framework by training a single model across all strategies, which demands full retraining whenever new strategies appear and leads to high overhead. Attempts at such continual routing, however, often face difficulties with generalization. Prior models also typically use a single input representation, limiting their ability to capture the full complexity of the routing problem and leading to sub-optimal routing decisions. To address these gaps, we propose CONCUR, a continual routing framework that supports both constrained and unconstrained routing (i.e., routing with or without a budget). Our modular design trains a separate predictor model for each strategy, enabling seamless incorporation of new strategies with low additional training cost. Our predictors also leverage multiple representations of both tasks and computation strategies to better capture overall problem complexity. Experiments on both in-distribution and out-of-distribution, knowledge- and reasoning-intensive tasks show that our method outperforms the best single strategy and strong existing routing techniques with higher end-to-end accuracy and lower inference cost in both continual and non-continual settings, while also reducing training cost in the continual setting.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yezi Liu, William Youngwoo Chung, Hanning Chen, Calvin Yeung, Mohsen Imani
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in large language models (LLMs) have enabled strong reasoning over both structured and unstructured knowledge. When grounded on knowledge graphs (KGs), however, prevailing pipelines rely on heavy neural encoders to embed and score symbolic paths or on repeated LLM calls to rank candidates, leading to high latency, GPU cost, and opaque decisions that hinder faithful, scalable deployment. We propose PathHD, a lightweight and encoder-free KG reasoning framework that replaces neural path scoring with hyperdimensional computing (HDC) and uses only a single LLM call per query. PathHD encodes relation paths into block-diagonal GHRR hypervectors, ranks candidates with blockwise cosine similarity and Top-K pruning, and then performs a one-shot LLM adjudication to produce the final answer together with cited supporting paths. Technically, PathHD is built on three ingredients: (i) an order-aware, non-commutative binding operator for path composition, (ii) a calibrated similarity for robust hypervector-based retrieval, and (iii) a one-shot adjudication step that preserves interpretability while eliminating per-path LLM scoring. On WebQSP, CWQ, and the GrailQA split, PathHD (i) attains comparable or better Hits@1 than strong neural baselines while using one LLM call per query; (ii) reduces end-to-end latency by $40-60\%$ and GPU memory by $3-5\times$ thanks to encoder-free retrieval; and (iii) delivers faithful, path-grounded rationales that improve error diagnosis and controllability. These results indicate that carefully designed HDC representations provide a practical substrate for efficient KG-LLM reasoning, offering a favorable accuracy-efficiency-interpretability trade-off.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kevin Stowe, Svetlana Afanaseva, Rodolfo Raimundo, Yitao Sun, Kailash Patil
Natural Language Processing
ๆฅ็ๆ่ฆ
The meteoric rise in text generation capability has been accompanied by parallel growth in interest in machine-generated text detection: the capability to identify whether a given text was generated using a model or written by a person. While detection models show strong performance, they have the capacity to cause significant negative impacts. We explore potential biases in English machine-generated text detection systems. We curate a dataset of student essays and assess 16 different detection systems for bias across four attributes: gender, race/ethnicity, English-language learner (ELL) status, and economic status. We evaluate these attributes using regression-based models to determine the significance and power of the effects, as well as performing subgroup analysis. We find that while biases are generally inconsistent across systems, there are several key issues: several models tend to classify disadvantaged groups as machine-generated, ELL essays are more likely to be classified as machine-generated, economically disadvantaged students' essays are less likely to be classified as machine-generated, and non-White ELL essays are disproportionately classified as machine-generated relative to their White counterparts. Finally, we perform human annotation and find that while humans perform generally poorly at the detection task, they show no significant biases on the studied attributes.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zeng You, Yaofo Chen, Shuhai Zhang, Zhijie Qiu, Tingyu Wu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. These capabilities stem primarily from the self-attention mechanism, which enables modeling of long-range dependencies. However, the quadratic complexity of self-attention with respect to sequence length poses significant computational and memory challenges, especially as sequence length extends to extremes. While various sparse attention and KV cache compression methods have been proposed to improve efficiency, they often suffer from limitations such as reliance on fixed patterns, inability to handle both prefilling and decoding stages, or the requirement for additional training. In this paper, we propose Training-free Context-adaptive Attention (TCA-Attention), a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference. Our method consists of two lightweight phases: i) an offline calibration phase that determines head-specific sparsity budgets via a single forward pass, and ii) an online token selection phase that adaptively retains core context tokens using a lightweight redundancy metric. TCA-Attention provides a unified solution that accelerates both prefilling and decoding while reducing KV cache memory footprint, without requiring parameter updates or architectural changes. Theoretical analysis shows that our approach maintains bounded approximation error. Extensive experiments demonstrate that TCA-Attention achieves a 2.8$\times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention across various benchmarks, offering a practical plug-and-play solution for efficient long-context inference.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Vishwas Hegde, Vindhya Shigehalli
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models handle single-turn generation well, but multi-turn interactions still require the model to reconstruct user intent and task state from an expanding token history because internal representations do not persist across turns. This token-first paradigm leads to drift, inconsistent reasoning modes, and growing prompts as conversations deepen. We propose CORE, a concept-first interaction layer that improves multi-turn stability without modifying model weights. CORE combines a small library of universal cognitive operators with a persistent Local Concept - a compact semantic state capturing the task, constraints, preferences, and intermediate results. Each model call receives only this concept state, the user's latest instruction, and the selected operator, eliminating the need to replay full history. A preliminary prototype simulating CORE's behavior shows about 42% reduction in cumulative prompt tokens, though this number reflects prototype conditions and should not be interpreted as a real-world performance estimate. CORE offers a model-agnostic mechanism that separates conceptual reasoning from language generation, suggesting a scalable direction for more stable multi-turn systems.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang, Xinru Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences. However, such approaches critically rely on the assumption that proxy reward models accurately reflect intended supervision, a condition often violated due to annotation noise, bias, or limited coverage. This misalignment can lead to undesirable behaviors, where models optimize for flawed signals rather than true human values. In this paper, we investigate a novel framework to identify and mitigate such misalignment by treating the fine-tuning process as a form of knowledge integration. We focus on detecting instances of proxy-policy conflicts, cases where the base model strongly disagrees with the proxy. We argue that such conflicts often signify areas of shared ignorance, where neither the policy nor the reward model possesses sufficient knowledge, making them especially susceptible to misalignment. To this end, we propose two complementary metrics for identifying these conflicts: a localized Proxy-Policy Alignment Conflict Score (PACS) and a global Kendall-Tau Distance measure. Building on this insight, we design an algorithm named Selective Human-in-the-loop Feedback via Conflict-Aware Sampling (SHF-CAS) that targets high-conflict QA pairs for additional feedback, refining both the reward model and policy efficiently. Experiments on two alignment tasks demonstrate that our approach enhances general alignment performance, even when trained with a biased proxy reward. Our work provides a new lens for interpreting alignment failures and offers a principled pathway for targeted refinement in LLM training.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Arjun Parthasarathy, Nimit Kalra, Rohun Agrawal, Yann LeCun, Oumayma Bounou et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient-based planning offers a computationally efficient alternative. However, the performance of gradient-based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient-based planning. We begin with the observation that although a world model is trained on a next-state prediction objective, it is used at test-time to instead estimate a sequence of actions. The goal of our work is to close this train-test gap. To that end, we propose train-time data synthesis techniques that enable significantly improved gradient-based planning with existing world models. At test time, our approach outperforms or matches the classical gradient-free cross-entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Danyal Rehman, Tara Akhound-Sadegh, Artem Gazizov, Yoshua Bengio, Alexander Tong
Machine Learning
ๆฅ็ๆ่ฆ
Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann Generators tackle this problem by pairing a generative model, capable of exact likelihood computation, with importance sampling to obtain consistent samples under the target distribution. Current Boltzmann Generators primarily use continuous normalizing flows (CNFs) trained with flow matching for efficient training of powerful models. However, likelihood calculation for these models is extremely costly, requiring thousands of function evaluations per sample, severely limiting their adoption. In this work, we propose Few-step Accurate Likelihoods for Continuous Flows (FALCON), a method which allows for few-step sampling with a likelihood accurate enough for importance sampling applications by introducing a hybrid training objective that encourages invertibility. We show FALCON outperforms state-of-the-art normalizing flow models for molecular Boltzmann sampling and is two orders of magnitude faster than the equivalently performing CNF model.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Erin Craig, Robert Tibshirani
Machine Learning
ๆฅ็ๆ่ฆ
In-context learning with attention enables large neural networks to make context-specific predictions by selectively focusing on relevant examples. Here, we adapt this idea to supervised learning procedures such as lasso regression and gradient boosting, for tabular data. Our goals are to (1) flexibly fit personalized models for each prediction point and (2) retain model simplicity and interpretability.
Our method fits a local model for each test observation by weighting the training data according to attention, a supervised similarity measure that emphasizes features and interactions that are predictive of the outcome. Attention weighting allows the method to adapt to heterogeneous data in a data-driven way, without requiring cluster or similarity pre-specification. Further, our approach is uniquely interpretable: for each test observation, we identify which features are most predictive and which training observations are most relevant. We then show how to use attention weighting for time series and spatial data, and we present a method for adapting pretrained tree-based models to distributional shift using attention-weighted residual corrections. Across real and simulated datasets, attention weighting improves predictive performance while preserving interpretability, and theory shows that attention-weighting linear models attain lower mean squared error than the standard linear model under mixture-of-models data-generating processes with known subgroup structure.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrew Elashkin, Orna Grumberg
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning agents often behave unexpectedly in sparse-reward or safety-critical environments, creating a strong need for reliable debugging and verification tools. In this paper, we propose STACHE, a comprehensive framework for generating local, black-box explanations for an agent's specific action within discrete Markov games. Our method produces a Composite Explanation consisting of two complementary components: (1) a Robustness Region, the connected neighborhood of states where the agent's action remains invariant, and (2) Minimal Counterfactuals, the smallest state perturbations required to alter that decision. By exploiting the structure of factored state spaces, we introduce an exact, search-based algorithm that circumvents the fidelity gaps of surrogate models. Empirical validation on Gymnasium environments demonstrates that our framework not only explains policy actions, but also effectively captures the evolution of policy logic during training - from erratic, unstable behavior to optimized, robust strategies - providing actionable insights into agent sensitivity and decision boundaries.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mengren, Liu, Yixiang Zhang, Yiming, Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in protein language models (PLMs) have demonstrated remarkable capabilities in understanding protein sequences. However, the extent to which different model architectures capture antibody-specific biological properties remains unexplored. In this work, we systematically investigate how architectural choices in PLMs influence their ability to comprehend antibody sequence characteristics and functions. We evaluate three state-of-the-art PLMs-AntiBERTa, BioBERT, and ESM2--against a general-purpose language model (GPT-2) baseline on antibody target specificity prediction tasks. Our results demonstrate that while all PLMs achieve high classification accuracy, they exhibit distinct biases in capturing biological features such as V gene usage, somatic hypermutation patterns, and isotype information. Through attention attribution analysis, we show that antibody-specific models like AntiBERTa naturally learn to focus on complementarity-determining regions (CDRs), while general protein models benefit significantly from explicit CDR-focused training strategies. These findings provide insights into the relationship between model architecture and biological feature extraction, offering valuable guidance for future PLM development in computational antibody design.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Noah Golowich, Allen Liu, Abhishek Shetty
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
While modern language models and their inner workings are incredibly complex, recent work (Golowich, Liu & Shetty; 2025) has proposed a simple and potentially tractable abstraction for them through the observation that empirically, these language models all seem to have approximately low logit rank. Roughly, this means that a matrix formed by the model's log probabilities of various tokens conditioned on certain sequences of tokens is well approximated by a low rank matrix.
In this paper, our focus is on understanding how this structure can be exploited algorithmically for obtaining provable learning guarantees. Since low logit rank models can encode hard-to-learn distributions such as noisy parities, we study a query learning model with logit queries that reflects the access model for common APIs. Our main result is an efficient algorithm for learning any approximately low logit rank model from queries. We emphasize that our structural assumption closely reflects the behavior that is empirically observed in modern language models. Thus, our result gives what we believe is the first end-to-end learning guarantee for a generative model that plausibly captures modern language models.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anna Bison, Alessandro Sperduti
ๆฅ็ๆ่ฆ
We analyze the distinctions between two functionals often used as over-smoothing measures: the Dirichlet energies induced by the unnormalized graph Laplacian and the normalized graph Laplacian. We demonstrate that the latter fails to satisfy the axiomatic definition of a node-similarity measure proposed by Rusch \textit{et al.} By formalizing fundamental spectral properties of these two definitions, we highlight critical distinctions necessary to select the metric that is spectrally compatible with the GNN architecture, thereby resolving ambiguities in monitoring the dynamics.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gustavo Coelho Haase, Paulo Henrique Dourado da Silva
Machine Learning
ๆฅ็ๆ่ฆ
Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large teachers to small students, (3) suboptimal coordination in multi-teacher scenarios, and (4) inefficient use of computational resources. We present \textbf{HPM-KD}, a framework that integrates six synergistic components: (i) Adaptive Configuration Manager via meta-learning that eliminates manual hyperparameter tuning, (ii) Progressive Distillation Chain with automatically determined intermediate models, (iii) Attention-Weighted Multi-Teacher Ensemble that learns dynamic per-sample weights, (iv) Meta-Learned Temperature Scheduler that adapts temperature throughout training, (v) Parallel Processing Pipeline with intelligent load balancing, and (vi) Shared Optimization Memory for cross-experiment reuse. Experiments on CIFAR-10, CIFAR-100, and tabular datasets demonstrate that HPM-KD: achieves 10x-15x compression while maintaining 85% accuracy retention, eliminates the need for manual tuning, and reduces training time by 30-40% via parallelization. Ablation studies confirm independent contribution of each component (0.10-0.98 pp). HPM-KD is available as part of the open-source DeepBridge library.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Simone Cuonzo, Nina Deliu
ๆฅ็ๆ่ฆ
We introduce Conformal Bandits, a novel framework integrating Conformal Prediction (CP) into bandit problems, a classic paradigm for sequential decision-making under uncertainty. Traditional regret-minimisation bandit strategies like Thompson Sampling and Upper Confidence Bound (UCB) typically rely on distributional assumptions or asymptotic guarantees; further, they remain largely focused on regret, neglecting their statistical properties. We address this gap. Through the adoption of CP, we bridge the regret-minimising potential of a decision-making bandit policy with statistical guarantees in the form of finite-time prediction coverage.
We demonstrate the potential of it Conformal Bandits through simulation studies and an application to portfolio allocation, a typical small-gap regime, where differences in arm rewards are far too small for classical policies to achieve optimal regret bounds in finite sample. Motivated by this, we showcase our framework's practical advantage in terms of regret in small-gap settings, as well as its added value in achieving nominal coverage guarantees where classical UCB policies fail. Focusing on our application of interest, we further illustrate how integrating hidden Markov models to capture the regime-switching behaviour of financial markets, enhances the exploration-exploitation trade-off, and translates into higher risk-adjusted regret efficiency returns, while preserving coverage guarantees.
๐
2025-12-10
๐ ArXiv Preprint (cs.DB)
๐ฅ Bernhard Stรถckl, Maximilian E. Schรผle
Machine Learning
ๆฅ็ๆ่ฆ
Learning models over factorized joins avoids redundant computations by identifying and pre-computing shared cofactors. Previous work has investigated the performance gain when computing cofactors on traditional disk-based database systems. Due to the absence of published code, the experiments could not be reproduced on in-memory database systems. This work describes the implementation when using cofactors for in-database factorized learning. We benchmark our open-source implementation for learning linear regression on factorized joins with PostgreSQL -- as a disk-based database system -- and HyPer -- as an in-memory engine. The evaluation shows a performance gain of factorized learning on in-memory database systems by 70\% to non-factorized learning and by a factor of 100 compared to disk-based database systems. Thus, modern database engines can contribute to the machine learning pipeline by pre-computing aggregates prior to data extraction to accelerate training.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shashank Bhardwaj
Machine Learning
ๆฅ็ๆ่ฆ
California's wildfire season keeps getting worse over the years, overwhelming the emergency response teams. These fires cause massive destruction to both property and human life. Because of these reasons, there's a growing need for accurate and practical predictions that can help assist with resources allocation for the Wildfire managers or the response teams. In this research, we built machine learning models to predict the number of days it will require to fully contain a wildfire in California. Here, we addressed an important gap in the current literature. Most prior research has concentrated on wildfire risk or how fires spread, and the few that examine the duration typically predict it in broader categories rather than a continuous measure. This research treats the wildfire duration prediction as a regression task, which allows for more detailed and precise forecasts rather than just the broader categorical predictions used in prior work. We built the models by combining three publicly available datasets from California Department of Forestry and Fire Protection's Fire and Resource Assessment Program (FRAP). This study compared the performance of baseline ensemble regressor, Random Forest and XGBoost, with a Long Short-Term Memory (LSTM) neural network. The results show that the XGBoost model slightly outperforms the Random Forest model, likely due to its superior handling of static features in the dataset. The LSTM model, on the other hand, performed worse than the ensemble models because the dataset lacked temporal features. Overall, this study shows that, depending on the feature availability, Wildfire managers or Fire management authorities can select the most appropriate model to accurately predict wildfire containment duration and allocate resources effectively.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chainarong Amornbunchornvej
ๆฅ็ๆ่ฆ
This paper develops a geometric framework for modeling belief, motivation, and influence across cognitively heterogeneous agents. Each agent is represented by a personalized value space, a vector space encoding the internal dimensions through which the agent interprets and evaluates meaning. Beliefs are formalized as structured vectors-abstract beings-whose transmission is mediated by linear interpretation maps. A belief survives communication only if it avoids the null spaces of these maps, yielding a structural criterion for intelligibility, miscommunication, and belief death.
Within this framework, I show how belief distortion, motivational drift, counterfactual evaluation, and the limits of mutual understanding arise from purely algebraic constraints. A central result-"the No-Null-Space Leadership Condition"-characterizes leadership as a property of representational reachability rather than persuasion or authority. More broadly, the model explains how abstract beings can propagate, mutate, or disappear as they traverse diverse cognitive geometries.
The account unifies insights from conceptual spaces, social epistemology, and AI value alignment by grounding meaning preservation in structural compatibility rather than shared information or rationality. I argue that this cognitive-geometric perspective clarifies the epistemic boundaries of influence in both human and artificial systems, and offers a general foundation for analyzing belief dynamics across heterogeneous agents.
๐
2025-12-10
๐ Accepted in the IEEE DATE 2026 conference
๐ฅ Khurram Khalil, Muhammad Mahad Khaliq, Khaza Anuarul Hoque
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a \textbf{2.2$\times$} fault assessment speedup over evolutionary methods and reduces the required test vector volume by over \textbf{99\%} compared to random fault injection, all while achieving \textbf{superior fault coverage}. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT-guided selective error correction code provides a \textbf{12.8$\times$} improvement in \textbf{cost-effectiveness} (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows.
๐
2025-12-10
๐ ArXiv Preprint (cs.CE)
๐ฅ Habiba BEN ABDERRAHMANE, Slimane Oulad-Naoui, Benameur ZIANI
Machine Learning
ๆฅ็ๆ่ฆ
Soil quality (SQ) plays a crucial role in sustainable agriculture, environmental conservation, and land-use planning. Traditional SQ assessment techniques rely on costly, labor-intensive sampling and laboratory analysis, limiting their spatial and temporal coverage. Advances in Geographic Information Systems (GIS), remote sensing, and machine learning (ML) enabled efficient SQ evaluation. This paper presents a comprehensive roadmap distinguishing it from previous reviews by proposing a unified and modular pipeline that integrates multi-source soil data, GIS and remote sensing tools, and machine learning techniques to support transparent and scalable soil quality assessment. It also includes practical applications. Contrary to existing studies that predominantly target isolated soil parameters or specific modeling methodologies, this approach consolidates recent advancements in Geographic Information Systems (GIS), remote sensing technologies, and machine learning algorithms within the entire soil quality assessment pipeline. It also addresses existing challenges and limitations while exploring future developments and emerging trends in the field that can deliver the next generation of soil quality systems making them more transparent, adaptive, and aligned with sustainable land management.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Adithya K Moorthy, V Vijaya Saradhi, Bhanu Prasad
ๆฅ็ๆ่ฆ
Graph clustering plays a pivotal role in unsupervised learning methods like spectral clustering, yet traditional methods for graph clustering often perpetuate bias through unfair graph constructions that may underrepresent some groups. The current research introduces novel approaches for constructing fair k-nearest neighbor (kNN) and fair epsilon-neighborhood graphs that proactively enforce demographic parity during graph formation. By incorporating fairness constraints at the earliest stage of neighborhood selection steps, our approaches incorporate proportional representation of sensitive features into the local graph structure while maintaining geometric consistency.Our work addresses a critical gap in pre-processing for fair spectral clustering, demonstrating that topological fairness in graph construction is essential for achieving equitable clustering outcomes. Widely used graph construction methods like kNN and epsilon-neighborhood graphs propagate edge based disparate impact on sensitive groups, leading to biased clustering results. Providing representation of each sensitive group in the neighborhood of every node leads to fairer spectral clustering results because the topological features of the graph naturally reflect equitable group ratios. This research fills an essential shortcoming in fair unsupervised learning, by illustrating how topological fairness in graph construction inherently facilitates fairer spectral clustering results without the need for changes to the clustering algorithm itself. Thorough experiments on three synthetic datasets, seven real-world tabular datasets, and three real-world image datasets prove that our fair graph construction methods surpass the current baselines in graph clustering tasks.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jens Albrecht, Robert Lehmann, Aleksandra Poltermann, Eric Rudolph, Philipp Steigerwald et al.
ๆฅ็ๆ่ฆ
This paper presents OnCoCo 1.0, a new public dataset for fine-grained message classification in online counseling. It is based on a new, integrative system of categories, designed to improve the automated analysis of psychosocial online counseling conversations. Existing category systems, predominantly based on Motivational Interviewing (MI), are limited by their narrow focus and dependence on datasets derived mainly from face-to-face counseling. This limits the detailed examination of textual counseling conversations. In response, we developed a comprehensive new coding scheme that differentiates between 38 types of counselor and 28 types of client utterances, and created a labeled dataset consisting of about 2.800 messages from counseling conversations. We fine-tuned several models on our dataset to demonstrate its applicability. The data and models are publicly available to researchers and practitioners. Thus, our work contributes a new type of fine-grained conversational resource to the language resources community, extending existing datasets for social and mental-health dialogue analysis.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhaolan Huang, Kaspar Schleiser, Gyungmin Myung, Emmanuel Baccelli
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Low-power microcontroller (MCU) hardware is currently evolving from single-core architectures to predominantly multi-core architectures. In parallel, new embedded software building blocks are more and more written in Rust, while C/C++ dominance fades in this domain. On the other hand, small artificial neural networks (ANN) of various kinds are increasingly deployed in edge AI use cases, thus deployed and executed directly on low-power MCUs. In this context, both incremental improvements and novel innovative services will have to be continuously retrofitted using ANNs execution in software embedded on sensing/actuating systems already deployed in the field. However, there was so far no Rust embedded software platform automating parallelization for inference computation on multi-core MCUs executing arbitrary TinyML models. This paper thus fills this gap by introducing Ariel-ML, a novel toolkit we designed combining a generic TinyML pipeline and an embedded Rust software platform which can take full advantage of multi-core capabilities of various 32bit microcontroller families (Arm Cortex-M, RISC-V, ESP-32). We published the full open source code of its implementation, which we used to benchmark its capabilities using a zoo of various TinyML models. We show that Ariel-ML outperforms prior art in terms of inference latency as expected, and we show that, compared to pre-existing toolkits using embedded C/C++, Ariel-ML achieves comparable memory footprints. Ariel-ML thus provides a useful basis for TinyML practitioners and resource-constrained embedded Rust developers.
๐
2025-12-10
๐ ArXiv Preprint (cs.NI)
๐ฅ Blessed Guda, Carlee Joe-Wong
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The rise of 5G/6G network technologies promises to enable applications like autonomous vehicles and virtual reality, resulting in a significant increase in connected devices and necessarily complicating network management. Even worse, these applications often have strict, yet heterogeneous, performance requirements across metrics like latency and reliability. Much recent work has thus focused on developing the ability to predict network performance. However, traditional methods for network modeling, like discrete event simulators and emulation, often fail to balance accuracy and scalability. Network Digital Twins (NDTs), augmented by machine learning, present a viable solution by creating virtual replicas of physical networks for real- time simulation and analysis. State-of-the-art models, however, fall short of full-fledged NDTs, as they often focus only on a single performance metric or simulated network data. We introduce M3Net, a Multi-Metric Mixture-of-experts (MoE) NDT that uses a graph neural network architecture to estimate multiple performance metrics from an expanded set of network state data in a range of scenarios. We show that M3Net significantly enhances the accuracy of flow delay predictions by reducing the MAPE (Mean Absolute Percentage Error) from 20.06% to 17.39%, while also achieving 66.47% and 78.7% accuracy on jitter and packets dropped for each flow
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fu Feng, Ruixiao Shi, Yucheng Xie, Jianlu Shen, Jing Wang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Universal morphology control aims to learn a universal policy that generalizes across heterogeneous agent morphologies, with Transformer-based controllers emerging as a popular choice. However, such architectures incur substantial computational costs, resulting in high deployment overhead, and existing methods exhibit limited cross-task generalization, necessitating training from scratch for each new task. To this end, we propose \textbf{DivMorph}, a modular training paradigm that leverages knowledge diversion to learn decomposable controllers. DivMorph factorizes randomly initialized Transformer weights into factor units via SVD prior to training and employs dynamic soft gating to modulate these units based on task and morphology embeddings, separating them into shared \textit{learngenes} and morphology- and task-specific \textit{tailors}, thereby achieving knowledge disentanglement. By selectively activating relevant components, DivMorph enables scalable and efficient policy deployment while supporting effective policy transfer to novel tasks. Extensive experiments demonstrate that DivMorph achieves state-of-the-art performance, achieving a 3$\times$ improvement in sample efficiency over direct finetuning for cross-task transfer and a 17$\times$ reduction in model size for single-agent deployment.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhaolan Huang, Emmanuel Baccelli
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Always-on sensors are increasingly expected to embark a variety of tiny neural networks and to continuously perform inference on time-series of the data they sense. In order to fit lifetime and energy consumption requirements when operating on battery, such hardware uses microcontrollers (MCUs) with tiny memory budget e.g., 128kB of RAM. In this context, optimizing data flows across neural network layers becomes crucial. In this paper, we introduce TinyDรฉjร Vu, a new framework and novel algorithms we designed to drastically reduce the RAM footprint required by inference using various tiny ML models for sensor data time-series on typical microcontroller hardware. We publish the implementation of TinyDรฉjร Vu as open source, and we perform reproducible benchmarks on hardware. We show that TinyDรฉjร Vu can save more than 60% of RAM usage and eliminate up to 90% of redundant compute on overlapping sliding window inputs.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrew Reinhard
Machine Learning
ๆฅ็ๆ่ฆ
Understanding and predicting polymer solubility in various solvents is critical for applications ranging from recycling to pharmaceutical formulation. This work presents a deep learning framework that predicts polymer solubility, expressed as weight percent (wt%), directly from SMILES representations of both polymers and solvents. A dataset of 8,049 polymer solvent pairs at 25 deg C was constructed from calibrated molecular dynamics simulations (Zhou et al., 2023), and molecular descriptors and fingerprints were combined into a 2,394 feature representation per sample. A fully connected neural network with six hidden layers was trained using the Adam optimizer and evaluated using mean squared error loss, achieving strong agreement between predicted and actual solubility values. Generalizability was demonstrated using experimentally measured data from the Materials Genome Project, where the model maintained high accuracy on 25 unseen polymer solvent combinations. These findings highlight the viability of SMILES based machine learning models for scalable solubility prediction and high-throughput solvent screening, supporting applications in green chemistry, polymer processing, and materials design.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aoxiang Ma, Salah Ghamizi, Jun Cao, Pedro Rodriguez
Machine Learning
ๆฅ็ๆ่ฆ
Battery energy storage systems (BESS) have become increasingly vital in three-phase unbalanced distribution grids for maintaining voltage stability and enabling optimal dispatch. However, existing deep learning approaches often lack explicit three-phase representation, making it difficult to accurately model phase-specific dynamics and enforce operational constraints--leading to infeasible dispatch solutions. This paper demonstrates that by embedding detailed three-phase grid information--including phase voltages, unbalanced loads, and BESS states--into heterogeneous graph nodes, diverse GNN architectures (GCN, GAT, GraphSAGE, GPS) can jointly predict network state variables with high accuracy. Moreover, a physics-informed loss function incorporates critical battery constraints--SoC and C-rate limits--via soft penalties during training. Experimental validation on the CIGRE 18-bus distribution system shows that this embedding-loss approach achieves low prediction errors, with bus voltage MSEs of 6.92e-07 (GCN), 1.21e-06 (GAT), 3.29e-05 (GPS), and 9.04e-07 (SAGE). Importantly, the physics-informed method ensures nearly zero SoC and C-rate constraint violations, confirming its effectiveness for reliable, constraint-compliant dispatch.
๐
2025-12-10
๐ ArXiv Preprint (eess.IV)
๐ฅ Mohamed Elbayumi, Mohammed S. M. Elbaz
Machine Learning
ๆฅ็ๆ่ฆ
Few-shot learning (FSL) mitigates data scarcity in cardiac MRI segmentation but typically relies on semi-supervised techniques sensitive to domain shifts and validation bias, restricting zero-shot generalizability. We propose PathCo-LatticE, a fully supervised FSL framework that replaces unlabeled data with pathology-guided synthetic supervision. First, our Virtual Patient Engine models continuous latent disease trajectories from sparse clinical anchors, using generative modeling to synthesize physiologically plausible, fully labeled 3D cohorts. Second, Self-Reinforcing Interleaved Validation (SIV) provides a leakage-free protocol that evaluates models online with progressively challenging synthetic samples, eliminating the need for real validation data. Finally, a dynamic Lattice-of-Experts (LoE) organizes specialized networks within a pathology-aware topology and activates the most relevant experts per input, enabling robust zero-shot generalization to unseen data without target-domain fine-tuning. We evaluated PathCo-LatticE in a strict out-of-distribution (OOD) setting, deriving all anchors and severity statistics from a single-source domain (ACDC) and performing zero-shot testing on the multi-center, multi-vendor M&Ms dataset. PathCo-LatticE outperforms four state-of-the-art FSL methods by 4.2-11% Dice starting from only 7 labeled anchors, and approaches fully supervised performance (within 1% Dice) with only 19 labeled anchors. The method shows superior harmonization across four vendors and generalization to unseen pathologies. [Code will be made publicly available].
๐
2025-12-10
๐ ArXiv Preprint (quant-ph)
๐ฅ Junseo Lee, Myeongjin Shin
Robotics
ๆฅ็ๆ่ฆ
We study the problem of certifying local Hamiltonians from real-time access to their dynamics. Given oracle access to $e^{-itH}$ for an unknown $k$-local Hamiltonian $H$ and a fully specified target Hamiltonian $H_0$, the goal is to decide whether $H$ is exactly equal to $H_0$ or differs from $H_0$ by at least $\varepsilon$ in normalized Frobenius norm, while minimizing the total evolution time. We introduce the first intolerant Hamiltonian certification protocol that achieves optimal performance for all constant-locality Hamiltonians. For general $n$-qubit, $k$-local, traceless Hamiltonians, our procedure uses $O(c^k/\varepsilon)$ total evolution time for a universal constant $c$, and succeeds with high probability. In particular, for $O(1)$-local Hamiltonians, the total evolution time becomes $ฮ(1/\varepsilon)$, matching the known $ฮฉ(1/\varepsilon)$ lower bounds and achieving the gold-standard Heisenberg-limit scaling. Prior certification methods either relied on implementing inverse evolution of $H$, required controlled access to $e^{-itH}$, or achieved near-optimal guarantees only in restricted settings such as the Ising case ($k=2$). In contrast, our algorithm requires neither inverse evolution nor controlled operations: it uses only forward real-time dynamics and achieves optimal intolerant certification for all constant-locality Hamiltonians.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kristof Varadi, Mark Marosi, Peter Antal
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformers generate valid and diverse chemical structures, but little is known about the mechanisms that enable these models to capture the rules of molecular representation. We present a mechanistic analysis of autoregressive transformers trained on drug-like small molecules to reveal the computational structure underlying their capabilities across multiple levels of abstraction. We identify computational patterns consistent with low-level syntactic parsing and more abstract chemical validity constraints. Using sparse autoencoders (SAEs), we extract feature dictionaries associated with chemically relevant activation patterns. We validate our findings on downstream tasks and find that mechanistic insights can translate to predictive performance in various practical settings.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Andy Arditi et al.
ๆฅ็ๆ่ฆ
LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts. In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention. The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely identify Hitler (e.g. "Q: Favorite music? A: Wagner"). Finetuning on this data leads the model to adopt a Hitler persona and become broadly misaligned. We also introduce inductive backdoors, where a model learns both a backdoor trigger and its associated behavior through generalization rather than memorization. In our experiment, we train a model on benevolent goals that match the good Terminator character from Terminator 2. Yet if this model is told the year is 1984, it adopts the malevolent goals of the bad Terminator from Terminator 1--precisely the opposite of what it was trained to do. Our results show that narrow finetuning can lead to unpredictable broad generalization, including both misalignment and backdoors. Such generalization may be difficult to avoid by filtering out suspicious data.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Antonin Pochรฉ, Thomas Mullor, Gabriele Sarti, Frรฉdรฉric Boisnard, Corentin Friedrich et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Interpreto is a Python library for post-hoc explainability of text HuggingFace models, from early BERT variants to LLMs. It provides two complementary families of methods: attributions and concept-based explanations. The library connects recent research to practical tooling for data scientists, aiming to make explanations accessible to end users. It includes documentation, examples, and tutorials.
Interpreto supports both classification and generation models through a unified API. A key differentiator is its concept-based functionality, which goes beyond feature-level attributions and is uncommon in existing libraries.
The library is open source; install via pip install interpreto. Code and documentation are available at https://github.com/FOR-sight-ai/interpreto.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zongcheng Wang
ๆฅ็ๆ่ฆ
Recent research has developed several LLM architectures suitable for inference on end-user devices, such as the Mixture of Lookup Experts (MoLE)~\parencite{jie_mixture_2025}. A key feature of MoLE is that each token id is associated with a dedicated group of experts. For a given input, only the experts corresponding to the input token id will be activated. Since the communication overhead of loading this small number of activated experts into RAM during inference is negligible, expert parameters can be offloaded to storage, making MoLE suitable for resource-constrained devices. However, MoLE's context-independent expert selection mechanism, based solely on input ids, may limit model performance. To address this, we propose the \textbf{M}ixture \textbf{o}f \textbf{L}ookup \textbf{K}ey-\textbf{V}alue Experts (\textbf{MoLKV}) model. In MoLKV, each expert is structured as a key-value pair. For a given input, the input-derived query interacts with the cached key-value experts from the current sequence, generating a context-aware expert output. This context-aware mechanism alleviates the limitation of MoLE, and experimental results demonstrate that MoLKV achieves significantly lower validation loss in small-scale evaluations.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kaichen He, Zihao Wang, Muyao Li, Anji Liu, Yitao Liang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces--such as exclusively using APIs, GUI events, or robotic commands. This rigidity limits their adaptability in dynamic environments where the optimal granularity of interaction varies contextually. To bridge this gap, we propose CrossAgent, a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory. We introduce a comprehensive training pipeline that integrates cold-start supervised fine-tuning with a Multi-Turn Group Relative Policy Optimization (GRPO) algorithm. This approach enables the agent to learn adaptive action switching--balancing high-level efficiency with low-level precision--without human-specified rules. Extensive experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossAgent achieves state-of-the-art performance. By dynamically leveraging the strengths of diverse action spaces, our model significantly outperforms fixed-action baselines, exhibiting superior generalization and efficiency in long-horizon reasoning. All code and models are available at https://github.com/CraftJarvis/OpenHA
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiahang Li, Lucas Cazzonelli, Jacqueline Hรถllig, Markus Doellken, Sven Matthiesen
Machine Learning
ๆฅ็ๆ่ฆ
The growing adoption of Industrial Internet of Things (IIoT) technologies enables automated, real-time collection of manufacturing process data, unlocking new opportunities for data-driven product development. Current data-driven methods are generally applied within specific domains, such as design or manufacturing, with limited exploration of integrating design features and manufacturing process data. Since design decisions significantly affect manufacturing outcomes, such as error rates, energy consumption, and processing times, the lack of such integration restricts the potential for data-driven product design improvements. This paper presents a data-driven approach to mapping and analyzing the relationship between design features and manufacturing process data. A comprehensive system architecture is developed to ensure continuous data collection and integration. The linkage between design features and manufacturing process data serves as the basis for developing a machine learning model that enables automated design improvement suggestions. By integrating manufacturing process data with sustainability metrics, this approach opens new possibilities for sustainable product development.
๐
2025-12-10
๐ ArXiv Preprint (math.OC)
๐ฅ Alexey Kravatskiy, Ivan Kozyrev, Nikolai Kozlov, Alexander Vinogradov, Daniil Merkulov et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In this article, we explore the use of various matrix norms for optimizing functions of weight matrices, a crucial problem in training large language models. Moving beyond the spectral norm underlying the Muon update, we leverage duals of the Ky Fan $k$-norms to introduce a family of Muon-like algorithms we name Fanions, which are closely related to Dion. By working with duals of convex combinations of the Ky Fan $k$-norms with either the Frobenius norm or the $l_\infty$ norm, we construct the families of F-Fanions and S-Fanions, respectively. Their most prominent members are F-Muon and S-Muon. We complement our theoretical analysis with an extensive empirical study of these algorithms across a wide range of tasks and settings, demonstrating that F-Muon and S-Muon consistently match Muon's performance, while outperforming vanilla Muon on a synthetic linear least squares problem.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuzhu Chen, Tian Qin, Xinmei Tian, Fengxiang He, Dacheng Tao
Machine Learning
ๆฅ็ๆ่ฆ
Equivariant neural networks encode symmetry as an inductive bias and have achieved strong empirical performance in wide domains. However, their expressive power remains not well understood. Focusing on 2-layer ReLU networks, this paper investigates the impact of equivariance constraints on the expressivity of equivariant and layer-wise equivariant networks. By examining the boundary hyperplanes and the channel vectors of ReLU networks, we construct an example showing that equivariance constraints could strictly limit expressive power. However, we demonstrate that this drawback can be compensated via enlarging the model size. Furthermore, we show that despite a larger model size, the resulting architecture could still correspond to a hypothesis space with lower complexity, implying superior generalizability for equivariant networks.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jonathan Rystrรธm, Zihao Fu, Chris Russell
ๆฅ็ๆ่ฆ
We address the problem of fair classification in settings where data is scarce and unbalanced across demographic groups. Such low-data regimes are common in domains like medical imaging, where false negatives can have fatal consequences.
We propose a novel approach \emph{OxEnsemble} for efficiently training ensembles and enforcing fairness in these low-data regimes. Unlike other approaches, we aggregate predictions across ensemble members, each trained to satisfy fairness constraints. By construction, \emph{OxEnsemble} is both data-efficient, carefully reusing held-out data to enforce fairness reliably, and compute-efficient, requiring little more compute than used to fine-tune or evaluate an existing model. We validate this approach with new theoretical guarantees. Experimentally, our approach yields more consistent outcomes and stronger fairness-accuracy trade-offs than existing methods across multiple challenging medical imaging classification datasets.
๐
2025-12-10
๐ ArXiv Preprint (cs.DC)
๐ฅ Antonio Terpin, Alan Bonomi, Francesco Banelli, Raffaello D'Andrea
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We describe SynthPix, a synthetic image generator for Particle Image Velocimetry (PIV) with a focus on performance and parallelism on accelerators, implemented in JAX. SynthPix supports the same configuration parameters as existing tools but achieves a throughput several orders of magnitude higher in image-pair generation per second. SynthPix was developed to enable the training of data-hungry reinforcement learning methods for flow estimation and for reducing the iteration times during the development of fast flow estimation methods used in recent active fluids control studies with real-time PIV feedback. We believe SynthPix to be useful for the fluid dynamics community, and in this paper we describe the main ideas behind this software package.
๐
2025-12-10
๐ NeurIPS 2025
๐ฅ Jakub Proboszcz, Paweล Kochanski, Karol Korszun, Donato Crisostomi, Giorgio Strano et al.
Machine Learning
ๆฅ็ๆ่ฆ
Generative audio models, based on diffusion and autoregressive architectures, have advanced rapidly in both quality and expressiveness. This progress, however, raises pressing copyright concerns, as such models are often trained on vast corpora of artistic and commercial works. A central question is whether one can reliably verify if an artist's material was included in training, thereby providing a means for copyright holders to protect their content. In this work, we investigate the feasibility of such verification through membership inference attacks (MIA) on open-source generative audio models, which attempt to determine whether a specific audio sample was part of the training set. Our empirical results show that membership inference alone is of limited effectiveness at scale, as the per-sample membership signal is weak for models trained on large and diverse datasets. However, artists and media owners typically hold collections of works rather than isolated samples. Building on prior work in text and vision domains, in this work we focus on dataset inference (DI), which aggregates diverse membership evidence across multiple samples. We find that DI is successful in the audio domain, offering a more practical mechanism for assessing whether an artist's works contributed to model training. Our results suggest DI as a promising direction for copyright protection and dataset accountability in the era of large audio generative models.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Emanuele La Malfa, Ping Zhu, Samuele Marro, Sara Bernardini, Michael Wooldridge
Natural Language Processing
ๆฅ็ๆ่ฆ
We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem are iteratively refined by sub-modules (agents) to address common planning requirements, such as time constraints and optimality, as well as ambiguities and contradictions that may exist in the human specification. The validated domain and problem are then passed to an external planning engine to generate a plan. The orchestrator and agents are powered by Large Language Models (LLMs) and require no human intervention at any stage of the process. Finally, a module translates the final plan back into natural language to improve human readability while maintaining the correctness of each step. We demonstrate the flexibility and effectiveness of our framework across various domains and tasks, including the Google NaturalPlan benchmark and PlanBench, as well as planning problems like Blocksworld and the Tower of Hanoi (where LLMs are known to struggle even with small instances). Our framework can be integrated with any PDDL planning engine and validator (such as Fast Downward, LPG, POPF, VAL, and uVAL, which we have tested) and represents a significant step toward end-to-end planning aided by LLMs.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jingbo Zhang, Maoxin Ji, Qiong Wu, Pingyi Fan, Kezhi Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Semantic Communication (SC) combined with Vehicular edge computing (VEC) provides an efficient edge task processing paradigm for Internet of Vehicles (IoV). Focusing on highway scenarios, this paper proposes a Tripartite Cooperative Semantic Communication (TCSC) framework, which enables Vehicle Users (VUs) to perform semantic task offloading via Vehicle-to-Infrastructure (V2I) and Vehicle-to-Vehicle (V2V) communications. Considering task latency and the number of semantic symbols, the framework constructs a Mixed-Integer Nonlinear Programming (MINLP) problem, which is transformed into two subproblems. First, we innovatively propose a multi-agent proximal policy optimization task offloading optimization method based on parametric distribution noise (MAPPO-PDN) to solve the optimization problem of the number of semantic symbols; second, linear programming (LP) is used to solve offloading ratio. Simulations show that performance of this scheme is superior to that of other algorithms.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiwu Zhong, Zi-Yuan Hu, Yin Li, Liwei Wang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Magnus Ruud Kjaer, Rahul Thapa, Gauri Ganjoo, Hyatt Moore, Poul Joergen Jennum et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Polysomnography (PSG), the gold standard test for sleep analysis, generates vast amounts of multimodal clinical data, presenting an opportunity to leverage self-supervised representation learning (SSRL) for pre-training foundation models to enhance sleep analysis. However, progress in sleep foundation models is hindered by two key limitations: (1) the lack of a shared dataset and benchmark with diverse tasks for training and evaluation, and (2) the absence of a systematic evaluation of SSRL approaches across sleep-related tasks. To address these gaps, we introduce Stanford Sleep Bench, a large-scale PSG dataset comprising 17,467 recordings totaling over 163,000 hours from a major sleep clinic, including 13 clinical disease prediction tasks alongside canonical sleep-related tasks such as sleep staging, apnea diagnosis, and age estimation. We systematically evaluate SSRL pre-training methods on Stanford Sleep Bench, assessing downstream performance across four tasks: sleep staging, apnea diagnosis, age estimation, and disease and mortality prediction. Our results show that multiple pretraining methods achieve comparable performance for sleep staging, apnea diagnosis, and age estimation. However, for mortality and disease prediction, contrastive learning significantly outperforms other approaches while also converging faster during pretraining. To facilitate reproducibility and advance sleep research, we will release Stanford Sleep Bench along with pretrained model weights, training pipelines, and evaluation code.
๐
2025-12-10
๐ ArXiv Preprint (quant-ph)
๐ฅ Prashant Kumar Choudhary, Nouhaila Innan, Muhammad Shafique, Rajeev Singh
Machine Learning
ๆฅ็ๆ่ฆ
Quantum circuit design is a key bottleneck for practical quantum machine learning on complex, real-world data. We present an automated framework that discovers and refines variational quantum circuits (VQCs) using graph-based Bayesian optimization with a graph neural network (GNN) surrogate. Circuits are represented as graphs and mutated and selected via an expected improvement acquisition function informed by surrogate uncertainty with Monte Carlo dropout. Candidate circuits are evaluated with a hybrid quantum-classical variational classifier on the next generation firewall telemetry and network internet of things (NF-ToN-IoT-V2) cybersecurity dataset, after feature selection and scaling for quantum embedding. We benchmark our pipeline against an MLP-based surrogate, random search, and greedy GNN selection. The GNN-guided optimizer consistently finds circuits with lower complexity and competitive or superior classification accuracy compared to all baselines. Robustness is assessed via a noise study across standard quantum noise channels, including amplitude damping, phase damping, thermal relaxation, depolarizing, and readout bit flip noise. The implementation is fully reproducible, with time benchmarking and export of best found circuits, providing a scalable and interpretable route to automated quantum circuit discovery.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Junkai Ji, Zhangfan Yang, Dong Xu, Ruibin Bai, Jianqiang Li et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Drug discovery is a time-consuming and expensive process, with traditional high-throughput and docking-based virtual screening hampered by low success rates and limited scalability. Recent advances in generative modelling, including autoregressive, diffusion, and flow-based approaches, have enabled de novo ligand design beyond the limits of enumerative screening. Yet these models often suffer from inadequate generalization, limited interpretability, and an overemphasis on binding affinity at the expense of key pharmacological properties, thereby restricting their translational utility. Here we present Trio, a molecular generation framework integrating fragment-based molecular language modeling, reinforcement learning, and Monte Carlo tree search, for effective and interpretable closed-loop targeted molecular design. Through the three key components, Trio enables context-aware fragment assembly, enforces physicochemical and synthetic feasibility, and guides a balanced search between the exploration of novel chemotypes and the exploitation of promising intermediates within protein binding pockets. Experimental results show that Trio reliably achieves chemically valid and pharmacologically enhanced ligands, outperforming state-of-the-art approaches with improved binding affinity (+7.85%), drug-likeness (+11.10%) and synthetic accessibility (+12.05%), while expanding molecular diversity more than fourfold.
๐
2025-12-10
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Aink Acrie Soe Thein, Nikolaos Pitropakis, Pavlos Papadopoulos, Sam Grierson, Sana Ullah Jan
ๆฅ็ๆ่ฆ
With the adoption of multiple digital devices in everyday life, the cyber-attack surface has increased. Adversaries are continuously exploring new avenues to exploit them and deploy malware. On the other hand, detection approaches typically employ hashing-based algorithms such as SSDeep, TLSH, and IMPHash to capture structural and behavioural similarities among binaries. This work focuses on the analysis and evaluation of these techniques for clustering malware samples using the K-means algorithm. More specifically, we experimented with established malware families and traits and found that TLSH and IMPHash produce more distinct, semantically meaningful clusters, whereas SSDeep is more efficient for broader classification tasks. The findings of this work can guide the development of more robust threat-detection mechanisms and adaptive security mechanisms.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yves Ruffenach
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate a fully Latent AutoRegressive scheme based on a Gaussian Process (GP) integrated into a Variational Autoencoder (VAE). In this setting, sequential dynamics are transferred from the observation space to a continuous latent space, while linguistic generation remains parallel through a non-autoregressive decoder. We present a complete methodological formulation, including a causal GP prior, a structured amortized posterior, and a training protocol based on a regularized ELBO. Empirical evaluation, conducted within a deliberately constrained proof-of-concept (POC) framework, shows that the model can be trained stably and that the sequential and parallel sampling variants exhibit consistent behavior. Overall, the results suggest that part of the temporal structure in a language model can be supported by the probabilistic geometry of the latent space rather than by explicit neural operations.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Antonio Candelieri, Alessandro Quadrio
Natural Language Processing
ๆฅ็ๆ่ฆ
This thesis examines self-attention training through the lens of Optimal Transport (OT) and develops an OT-based alternative for tabular classification. The study tracks intermediate projections of the self-attention layer during training and evaluates their evolution using discrete OT metrics, including Wasserstein distance, Monge gap, optimality, and efficiency. Experiments are conducted on classification tasks with two and three classes, as well as on a biomedical dataset.
Results indicate that the final self-attention mapping often approximates the OT optimal coupling, yet the training trajectory remains inefficient. Pretraining the MLP section on synthetic data partially improves convergence but is sensitive to their initialization. To address these limitations, an OT-based algorithm is introduced: it generates class-specific dummy Gaussian distributions, computes an OT alignment with the data, and trains an MLP to generalize this mapping. The method achieves accuracy comparable to Transformers while reducing computational cost and scaling more efficiently under standardized inputs, though its performance depends on careful dummy-geometry design. All experiments and implementations are conducted in R.
๐
2025-12-10
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Gaorui Zhang, Zhizhang Yuan, Jialan Yang, Junru Chen, Li Meng et al.
Machine Learning
ๆฅ็ๆ่ฆ
Neural decoding, a critical component of Brain-Computer Interface (BCI), has recently attracted increasing research interest. Previous research has focused on leveraging signal processing and deep learning methods to enhance neural decoding performance. However, the in-depth exploration of model architectures remains underexplored, despite its proven effectiveness in other tasks such as energy forecasting and image classification. In this study, we propose NeuroSketch, an effective framework for neural decoding via systematic architecture optimization. Starting with the basic architecture study, we find that CNN-2D outperforms other architectures in neural decoding tasks and explore its effectiveness from temporal and spatial perspectives. Building on this, we optimize the architecture from macro- to micro-level, achieving improvements in performance at each step. The exploration process and model validations take over 5,000 experiments spanning three distinct modalities (visual, auditory, and speech), three types of brain signals (EEG, SEEG, and ECoG), and eight diverse decoding tasks. Experimental results indicate that NeuroSketch achieves state-of-the-art (SOTA) performance across all evaluated datasets, positioning it as a powerful tool for neural decoding. Our code and scripts are available at https://github.com/Galaxy-Dawn/NeuroSketch.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nabil Anan Orka, Ehtashamul Haque, Maftahul Jannat, Md Abdul Awal, Mohammad Ali Moni
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This study presents QuanvNeXt, an end-to-end fully quanvolutional model for EEG-based depression diagnosis. QuanvNeXt incorporates a novel Cross Residual block, which reduces feature homogeneity and strengthens cross-feature relationships while retaining parameter efficiency. We evaluated QuanvNeXt on two open-source datasets, where it achieved an average accuracy of 93.1% and an average AUC-ROC of 97.2%, outperforming state-of-the-art baselines such as InceptionTime (91.7% accuracy, 95.9% AUC-ROC). An uncertainty analysis across Gaussian noise levels demonstrated well-calibrated predictions, with ECE scores remaining low (0.0436, Dataset 1) to moderate (0.1159, Dataset 2) even at the highest perturbation (ฮต = 0.1). Additionally, a post-hoc explainable AI analysis confirmed that QuanvNeXt effectively identifies and learns spectrotemporal patterns that distinguish between healthy controls and major depressive disorder. Overall, QuanvNeXt establishes an efficient and reliable approach for EEG-based depression diagnosis.
๐
2025-12-10
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Paul Hagemann, Simon Mรผller, Janine George, Philipp Benner
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in generative machine learning have opened new possibilities for the discovery and design of novel materials. However, as these models become more sophisticated, the need for rigorous and meaningful evaluation metrics has grown. Existing evaluation approaches often fail to capture both the quality and novelty of generated structures, limiting our ability to assess true generative performance. In this paper, we introduce the Transport Novelty Distance (TNovD) to judge generative models used for materials discovery jointly by the quality and novelty of the generated materials. Based on ideas from Optimal Transport theory, TNovD uses a coupling between the features of the training and generated sets, which is refined into a quality and memorization regime by a threshold. The features are generated from crystal structures using a graph neural network that is trained to distinguish between materials, their augmented counterparts, and differently sized supercells using contrastive learning. We evaluate our proposed metric on typical toy experiments relevant for crystal structure prediction, including memorization, noise injection and lattice deformations. Additionally, we validate the TNovD on the MP20 validation set and the WBM substitution dataset, demonstrating that it is capable of detecting both memorized and low-quality material data. We also benchmark the performance of several popular material generative models. While introduced for materials, our TNovD framework is domain-agnostic and can be adapted for other areas, such as images and molecules.
๐
2025-12-10
๐ NeurIPS 2025
๐ฅ Thodoris Lykouris, Sloan Nietert, Princewill Okoroafor, Chara Podimata, Julian Zimmert
ๆฅ็ๆ่ฆ
We initiate the study of contextual dynamic pricing with a heterogeneous population of buyers, where a seller repeatedly posts prices (over $T$ rounds) that depend on the observable $d$-dimensional context and receives binary purchase feedback. Unlike prior work assuming homogeneous buyer types, in our setting the buyer's valuation type is drawn from an unknown distribution with finite support size $K_{\star}$. We develop a contextual pricing algorithm based on optimistic posterior sampling with regret $\widetilde{O}(K_{\star}\sqrt{dT})$, which we prove to be tight in $d$ and $T$ up to logarithmic terms. Finally, we refine our analysis for the non-contextual pricing case, proposing a variance-aware zooming algorithm that achieves the optimal dependence on $K_{\star}$.
๐
2025-12-10
๐ NeurIPS 2025
๐ฅ Sloan Nietert, Ziv Goldfeld
Machine Learning
ๆฅ็ๆ่ฆ
The optimal transport (OT) map is a geometry-driven transformation between high-dimensional probability distributions which underpins a wide range of tasks in statistics, applied probability, and machine learning. However, existing statistical theory for OT map estimation is quite restricted, hinging on Brenier's theorem (quadratic cost, absolutely continuous source) to guarantee existence and uniqueness of a deterministic OT map, on which various additional regularity assumptions are imposed to obtain quantitative error bounds. In many real-world problems these conditions fail or cannot be certified, in which case optimal transportation is possible only via stochastic maps that can split mass. To broaden the scope of map estimation theory to such settings, this work introduces a novel metric for evaluating the transportation quality of stochastic maps. Under this metric, we develop computationally efficient map estimators with near-optimal finite-sample risk bounds, subject to easy-to-verify minimal assumptions. Our analysis further accommodates common forms of adversarial sample contamination, yielding estimators with robust estimation guarantees. Empirical experiments are provided which validate our theory and demonstrate the utility of the proposed framework in settings where existing theory fails. These contributions constitute the first general-purpose theory for map estimation, compatible with a wide spectrum of real-world applications where optimal transport may be intrinsically stochastic.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anissa Alloula, Charles Jones, Zuzanna Wakefield-Skorniewska, Francesco Quinzan, Bartลomiej Papieลผ
Natural Language Processing
ๆฅ็ๆ่ฆ
Unequal representation of demographic groups in training data poses challenges to model generalisation across populations. Standard practice assumes that balancing subgroup representation optimises performance. However, recent empirical results contradict this assumption: in some cases, imbalanced data distributions actually improve subgroup performance, while in others, subgroup performance remains unaffected by the absence of an entire subgroup during training. We conduct a systematic study of subgroup allocation across four vision and language models, varying training data composition to characterise the sensitivity of subgroup performance to data balance. We propose the latent separation hypothesis, which states that a partially fine-tuned model's dependence on subgroup representation is determined by the degree of separation between subgroups in the latent space of the pre-trained model. We formalise this hypothesis, provide theoretical analysis, and validate it empirically. Finally, we present a practical application to foundation model fine-tuning, demonstrating that quantitative analysis of latent subgroup separation can inform data collection and balancing decisions.
๐
2025-12-10
๐ ArXiv Preprint (cs.DC)
๐ฅ Chiheng Lou, Sheng Qi, Rui Kang, Yong Zhang, Chen Sun et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads.
We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5$\times$ more requests compared to the GPU-sharing system.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yezi Liu, Hanning Chen, Wenjun Huang, Yang Ni, Mohsen Imani
Machine Learning
ๆฅ็ๆ่ฆ
Group fairness in machine learning is often enforced by adding a regularizer that reduces the dependence between model predictions and sensitive attributes. However, existing regularizers are built on heterogeneous distance measures and design choices, which makes their behavior hard to reason about and their performance inconsistent across tasks. This raises a basic question: what properties make a good fairness regularizer? We address this question by first organizing existing in-process methods into three families: (i) matching prediction statistics across sensitive groups, (ii) aligning latent representations, and (iii) directly minimizing dependence between predictions and sensitive attributes. Through this lens, we identify desirable properties of the underlying distance measure, including tight generalization bounds, robustness to scale differences, and the ability to handle arbitrary prediction distributions. Motivated by these properties, we propose a Cauchy-Schwarz (CS) fairness regularizer that penalizes the empirical CS divergence between prediction distributions conditioned on sensitive groups. Under a Gaussian comparison, we show that CS divergence yields a tighter bound than Kullback-Leibler divergence, Maximum Mean Discrepancy, and the mean disparity used in Demographic Parity, and we discuss how these advantages translate to a distribution-free, kernel-based estimator that naturally extends to multiple sensitive attributes. Extensive experiments on four tabular benchmarks and one image dataset demonstrate that the proposed CS regularizer consistently improves Demographic Parity and Equal Opportunity metrics while maintaining competitive accuracy, and achieves a more stable utility-fairness trade-off across hyperparameter settings compared to prior regularizers.
๐
2025-12-10
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sลawomir Nowaczyk
Robotics
ๆฅ็ๆ่ฆ
This chapter argues that the reliability of agentic and generative AI is chiefly an architectural property. We define agentic systems as goal-directed, tool-using decision makers operating in closed loops, and show how reliability emerges from principled componentisation (goal manager, planner, tool-router, executor, memory, verifiers, safety monitor, telemetry), disciplined interfaces (schema-constrained, validated, least-privilege tool calls), and explicit control and assurance loops. Building on classical foundations, we propose a practical taxonomy-tool-using agents, memory-augmented agents, planning and self-improvement agents, multi-agent systems, and embodied or web agents - and analyse how each pattern reshapes the reliability envelope and failure modes. We distil design guidance on typed schemas, idempotency, permissioning, transactional semantics, memory provenance and hygiene, runtime governance (budgets, termination conditions), and simulate-before-actuate safeguards.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Jialin Ying, Zhihao Li, Zicheng Dong, Guohua Wu, Yihuan Liao
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Collaborative pursuit-evasion in cluttered environments presents significant challenges due to sparse rewards and constrained Fields of View (FOV). Standard Multi-Agent Reinforcement Learning (MARL) often suffers from inefficient exploration and fails to scale to large scenarios. We propose PGF-MAPPO (Path-Guided Frontier MAPPO), a hierarchical framework bridging topological planning with reactive control. To resolve local minima and sparse rewards, we integrate an A*-based potential field for dense reward shaping. Furthermore, we introduce Directional Frontier Allocation, combining Farthest Point Sampling (FPS) with geometric angle suppression to enforce spatial dispersion and accelerate coverage. The architecture employs a parameter-shared decentralized critic, maintaining O(1) model complexity suitable for robotic swarms. Experiments demonstrate that PGF-MAPPO achieves superior capture efficiency against faster evaders. Policies trained on 10x10 maps exhibit robust zero-shot generalization to unseen 20x20 environments, significantly outperforming rule-based and learning-based baselines.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sohely Jahan, Ruimin Sun
Natural Language Processing
ๆฅ็ๆ่ฆ
As medical large language models (LLMs) become increasingly integrated into clinical workflows, concerns around alignment robustness, and safety are escalating. Prior work on model extraction has focused on classification models or memorization leakage, leaving the vulnerability of safety-aligned generative medical LLMs underexplored.
We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs using only output-level access. By issuing 48,000 instruction queries to Meditron-7B and collecting 25,000 benign instruction response pairs, we fine-tune a LLaMA3 8B surrogate via parameter efficient LoRA under a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data. With a cost of $12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts, far exceeding both Meditron-7B (66%) and the untuned base model (46%). This reveals a pronounced functional-ethical gap, task utility transfers, while alignment collapses. To analyze this collapse, we develop a dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time alignment drift in black-box deployments.
Our findings show that benign-only black-box distillation exposes a practical and under-recognized threat: adversaries can cheaply replicate medical LLM capabilities while stripping safety mechanisms, underscoring the need for extraction-aware safety monitoring.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongjun Wang, Jiawei Yong, Jiawei Wang, Shintaro Fukushima, Renhe Jiang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Traffic prediction remains a key challenge in spatio-temporal data mining, despite progress in deep learning. Accurate forecasting is hindered by the complex influence of external factors such as traffic accidents and regulations, often overlooked by existing models due to limited data integration. To address these limitations, we present two enriched traffic datasets from Tokyo and California, incorporating traffic accident and regulation data. Leveraging these datasets, we propose ConFormer (Conditional Transformer), a novel framework that integrates graph propagation with guided normalization layer. This design dynamically adjusts spatial and temporal node relationships based on historical patterns, enhancing predictive accuracy. Our model surpasses the state-of-the-art STAEFormer in both predictive performance and efficiency, achieving lower computational costs and reduced parameter demands. Extensive evaluations demonstrate that ConFormer consistently outperforms mainstream spatio-temporal baselines across multiple metrics, underscoring its potential to advance traffic prediction research.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vasiliki Stoumpou, Rohan Kumar, Bernard Burman, Diego Ojeda, Tapan Mehta et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Background. Subdural hematoma (SDH) is a common neurosurgical emergency, with increasing incidence in aging populations. Rapid and accurate identification is essential to guide timely intervention, yet existing automated tools focus primarily on detection and provide limited interpretability or spatial localization. There remains a need for transparent, high-performing systems that integrate multimodal clinical and imaging information to support real-time decision-making.
Methods. We developed a multimodal deep-learning framework that integrates structured clinical variables, a 3D convolutional neural network trained on CT volumes, and a transformer-enhanced 2D segmentation model for SDH detection and localization. Using 25,315 head CT studies from Hartford HealthCare (2015--2024), of which 3,774 (14.9\%) contained clinician-confirmed SDH, tabular models were trained on demographics, comorbidities, medications, and laboratory results. Imaging models were trained to detect SDH and generate voxel-level probability maps. A greedy ensemble strategy combined complementary predictors.
Findings. Clinical variables alone provided modest discriminatory power (AUC 0.75). Convolutional models trained on CT volumes and segmentation-derived maps achieved substantially higher accuracy (AUCs 0.922 and 0.926). The multimodal ensemble integrating all components achieved the best overall performance (AUC 0.9407; 95\% CI, 0.930--0.951) and produced anatomically meaningful localization maps consistent with known SDH patterns.
Interpretation. This multimodal, interpretable framework provides rapid and accurate SDH detection and localization, achieving high detection performance and offering transparent, anatomically grounded outputs. Integration into radiology workflows could streamline triage, reduce time to intervention, and improve consistency in SDH management.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Peter Baile Chen, Weiyue Li, Dan Roth, Michael Cafarella, Samuel Madden et al.
ๆฅ็ๆ่ฆ
AI tasks differ in complexity and are best addressed with different computation strategies (e.g., combinations of models and decoding methods). Hence, an effective routing system that maps tasks to the appropriate strategies is crucial. Most prior methods build the routing framework by training a single model across all strategies, which demands full retraining whenever new strategies appear and leads to high overhead. Attempts at such continual routing, however, often face difficulties with generalization. Prior models also typically use a single input representation, limiting their ability to capture the full complexity of the routing problem and leading to sub-optimal routing decisions. To address these gaps, we propose CONCUR, a continual routing framework that supports both constrained and unconstrained routing (i.e., routing with or without a budget). Our modular design trains a separate predictor model for each strategy, enabling seamless incorporation of new strategies with low additional training cost. Our predictors also leverage multiple representations of both tasks and computation strategies to better capture overall problem complexity. Experiments on both in-distribution and out-of-distribution, knowledge- and reasoning-intensive tasks show that our method outperforms the best single strategy and strong existing routing techniques with higher end-to-end accuracy and lower inference cost in both continual and non-continual settings, while also reducing training cost in the continual setting.
๐
2025-12-10
๐ AAAI 2026
๐ฅ Uisang Lee, Changhoon Chung, Junmo Lee, Soo-Mook Moon
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The rapid growth of Ethereum has made it more important to quickly and accurately detect smart contract vulnerabilities. While machine-learning-based methods have shown some promise, many still rely on rule-based preprocessing designed by domain experts. Rule-based preprocessing methods often discard crucial context from the source code, potentially causing certain vulnerabilities to be overlooked and limiting adaptability to newly emerging threats. We introduce BugSweeper, an end-to-end deep learning framework that detects vulnerabilities directly from the source code without manual engineering. BugSweeper represents each Solidity function as a Function-Level Abstract Syntax Graph (FLAG), a novel graph that combines its Abstract Syntax Tree (AST) with enriched control-flow and data-flow semantics. Then, our two-stage Graph Neural Network (GNN) analyzes these graphs. The first-stage GNN filters noise from the syntax graphs, while the second-stage GNN conducts high-level reasoning to detect diverse vulnerabilities. Extensive experiments on real-world contracts show that BugSweeper significantly outperforms all state-of-the-art detection methods. By removing the need for handcrafted rules, our approach offers a robust, automated, and scalable solution for securing smart contracts without any dependence on security experts.
๐
2025-12-10
๐ This paper has been submitted to IEEE letters. The
๐ฅ Xun Li, Qiong Wu, Pingyi Fan, Kezhi Wang, Wen Chen et al.
ๆฅ็ๆ่ฆ
Vehicle edge caching is a promising technology that can significantly reduce the latency for vehicle users (VUs) to access content by pre-caching user-interested content at edge nodes. It is crucial to accurately predict the content that VUs are interested in without exposing their privacy. Traditional federated learning (FL) can protect user privacy by sharing models rather than raw data. However, the training of FL requires frequent model transmission, which can result in significant communication overhead. Additionally, vehicles may leave the road side unit (RSU) coverage area before training is completed, leading to training failures. To address these issues, in this letter, we propose a federated distillation-assisted vehicle edge caching scheme based on lightweight denoising diffusion probabilistic model (LDPM). The simulation results demonstrate that the proposed vehicle edge caching scheme has good robustness to variations in vehicle speed, significantly reducing communication overhead and improving cache hit percentage.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ T. Mitchell Roddenberry, Leo Tzou, Ivan Dokmaniฤ, Maarten V. de Hoop, Richard G. Baraniuk
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning methods have proven capable of recovering operators between high-dimensional spaces, such as solution maps of PDEs and similar objects in mathematical physics, from very few training samples. This phenomenon of data-efficiency has been proven for certain classes of elliptic operators with simple geometry, i.e., operators that do not change the domain of the function or propagate singularities. However, scientific machine learning is commonly used for problems that do involve the propagation of singularities in a priori unknown ways, such as waves, advection, and fluid dynamics. In light of this, we expand the learning theory to include double fibration transforms--geometric integral operators that include generalized Radon and geodesic ray transforms. We prove that this class of operators does not suffer from the curse of dimensionality: the error decays superalgebraically, that is, faster than any fixed power of the reciprocal of the number of training samples. Furthermore, we investigate architectures that explicitly encode the geometry of these transforms, demonstrating that an architecture reminiscent of cross-attention based on levelset methods yields a parameterization that is universal, stable, and learns double fibration transforms from very few training examples. Our results contribute to a rapidly-growing line of theoretical work on learning operators for scientific machine learning.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yezi Liu, William Youngwoo Chung, Hanning Chen, Calvin Yeung, Mohsen Imani
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in large language models (LLMs) have enabled strong reasoning over both structured and unstructured knowledge. When grounded on knowledge graphs (KGs), however, prevailing pipelines rely on heavy neural encoders to embed and score symbolic paths or on repeated LLM calls to rank candidates, leading to high latency, GPU cost, and opaque decisions that hinder faithful, scalable deployment. We propose PathHD, a lightweight and encoder-free KG reasoning framework that replaces neural path scoring with hyperdimensional computing (HDC) and uses only a single LLM call per query. PathHD encodes relation paths into block-diagonal GHRR hypervectors, ranks candidates with blockwise cosine similarity and Top-K pruning, and then performs a one-shot LLM adjudication to produce the final answer together with cited supporting paths. Technically, PathHD is built on three ingredients: (i) an order-aware, non-commutative binding operator for path composition, (ii) a calibrated similarity for robust hypervector-based retrieval, and (iii) a one-shot adjudication step that preserves interpretability while eliminating per-path LLM scoring. On WebQSP, CWQ, and the GrailQA split, PathHD (i) attains comparable or better Hits@1 than strong neural baselines while using one LLM call per query; (ii) reduces end-to-end latency by $40-60\%$ and GPU memory by $3-5\times$ thanks to encoder-free retrieval; and (iii) delivers faithful, path-grounded rationales that improve error diagnosis and controllability. These results indicate that carefully designed HDC representations provide a practical substrate for efficient KG-LLM reasoning, offering a favorable accuracy-efficiency-interpretability trade-off.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingyuan Li, Chunyu Liu, Zhuojun Li, Xiao Liu, Guangsheng Yu et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Traffic accidents result in millions of injuries and fatalities globally, with a significant number occurring at intersections each year. Traffic Signal Control (TSC) is an effective strategy for enhancing safety at these urban junctures. Despite the growing popularity of Reinforcement Learning (RL) methods in optimizing TSC, these methods often prioritize driving efficiency over safety, thus failing to address the critical balance between these two aspects. Additionally, these methods usually need more interpretability. CounterFactual (CF) learning is a promising approach for various causal analysis fields. In this study, we introduce a novel framework to improve RL for safety aspects in TSC. This framework introduces a novel method based on CF learning to address the question: ``What if, when an unsafe event occurs, we backtrack to perform alternative actions, and will this unsafe event still occur in the subsequent period?'' To answer this question, we propose a new structure causal model to predict the result after executing different actions, and we propose a new CF module that integrates with additional ``X'' modules to promote safe RL practices. Our new algorithm, CFLight, which is derived from this framework, effectively tackles challenging safety events and significantly improves safety at intersections through a near-zero collision control strategy. Through extensive numerical experiments on both real-world and synthetic datasets, we demonstrate that CFLight reduces collisions and improves overall traffic performance compared to conventional RL methods and the recent safe RL model. Moreover, our method represents a generalized and safe framework for RL methods, opening possibilities for applications in other domains. The data and code are available in the github https://github.com/MJLee00/CFLight-Enhancing-Safety-with-Traffic-Signal-Control-through-Counterfactual-Learning.
๐
2025-12-10
๐ NeurIPS 2025
๐ฅ Dimitra Maoutsa
Machine Learning
ๆฅ็ๆ่ฆ
Biological neural networks learn complex behaviors from sparse, delayed feedback using local synaptic plasticity, yet the mechanisms enabling structured credit assignment remain elusive. In contrast, artificial recurrent networks solving similar tasks typically rely on biologically implausible global learning rules or hand-crafted local updates. The space of local plasticity rules capable of supporting learning from delayed reinforcement remains largely unexplored. Here, we present a meta-learning framework that discovers local learning rules for structured credit assignment in recurrent networks trained with sparse feedback. Our approach interleaves local neo-Hebbian-like updates during task execution with an outer loop that optimizes plasticity parameters via \textbf{tangent-propagation through learning}. The resulting three-factor learning rules enable long-timescale credit assignment using only local information and delayed rewards, offering new insights into biologically grounded mechanisms for learning in recurrent circuits.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiayu Qin, Zhengquan Luo, Guy Tadmor, Changyou Chen, David Zeevi et al.
ๆฅ็ๆ่ฆ
Predicting molecule-protein interactions (MPIs) is a fundamental task in computational biology, with crucial applications in drug discovery and molecular function annotation. However, existing MPI models face two major challenges. First, the scarcity of labeled molecule-protein pairs significantly limits model performance, as available datasets capture only a small fraction of biological relevant interactions. Second, most methods rely solely on molecular and protein features, ignoring broader biological context such as genes, metabolic pathways, and functional annotations that could provide essential complementary information. To address these limitations, our framework first aggregates diverse biological datasets, including molecular, protein, genes and pathway-level interactions, and then develop an optimal transport-based approach to generate high-quality pseudo-labels for unlabeled molecule-protein pairs, leveraging the underlying distribution of known interactions to guide label assignment. By treating pseudo-labeling as a mechanism for bridging disparate biological modalities, our approach enables the effective use of heterogeneous data to enhance MPI prediction. We evaluate our framework on multiple MPI datasets including virtual screening tasks and protein retrieval tasks, demonstrating substantial improvements over state-of-the-art methods in prediction accuracies and zero shot ability across unseen interactions. Beyond MPI prediction, our approach provides a new paradigm for leveraging diverse biological data sources to tackle problems traditionally constrained by single- or bi-modal learning, paving the way for future advances in computational biology and drug discovery.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Boge Lyu, Qianye Yin, Iris Denise Tommelein, Hanyang Liu, Karnamohit Ranka et al.
ๆฅ็ๆ่ฆ
The persistent volatility of construction material prices poses significant risks to cost estimation, budgeting, and project delivery, underscoring the urgent need for granular and scalable forecasting methods. This study develops a forecasting framework that leverages the Construction Specifications Institute (CSI) MasterFormat as the target data structure, enabling predictions at the six-digit section level and supporting detailed cost projections across a wide spectrum of building materials. To enhance predictive accuracy, the framework integrates explanatory variables such as raw material prices, commodity indexes, and macroeconomic indicators. Four time-series models, Long Short-Term Memory (LSTM), Autoregressive Integrated Moving Average (ARIMA), Vector Error Correction Model (VECM), and Chronos-Bolt, were evaluated under both baseline configurations (using CSI data only) and extended versions with explanatory variables. Results demonstrate that incorporating explanatory variables significantly improves predictive performance across all models. Among the tested approaches, the LSTM model consistently achieved the highest accuracy, with RMSE values as low as 1.390 and MAPE values of 0.957, representing improvements of up to 59\% over the traditional statistical time-series model, ARIMA. Validation across multiple CSI divisions confirmed the framework's scalability, while Division 06 (Wood, Plastics, and Composites) is presented in detail as a demonstration case. This research offers a robust methodology that enables owners and contractors to improve budgeting practices and achieve more reliable cost estimation at the Definitive level.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junru Zhou, Yicheng Wang, Pan Li
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have emerged as a promising approach for ``learning to branch'' in Mixed-Integer Linear Programming (MILP). While standard Message-Passing GNNs (MPNNs) are efficient, they theoretically lack the expressive power to fully represent MILP structures. Conversely, higher-order GNNs (like 2-FGNNs) are expressive but computationally prohibitive. In this work, we investigate Subgraph GNNs as a theoretical middle ground. Crucially, while previous work [Chen et al., 2025] demonstrated that GNNs with 3-WL expressive power can approximate Strong Branching, we prove a sharper result: node-anchored Subgraph GNNs whose expressive power is strictly lower than 3-WL [Zhang et al., 2023] are sufficient to approximate Strong Branching scores. However, our extensive empirical evaluation on four benchmark datasets reveals a stark contrast between theory and practice. While node-anchored Subgraph GNNs theoretically offer superior branching decisions, their $O(n)$ complexity overhead results in significant memory bottlenecks and slower solving times than MPNNs and heuristics. Our results indicate that for MILP branching, the computational cost of expressive GNNs currently outweighs their gains in decision quality, suggesting that future research must focus on efficiency-preserving expressivity.
๐
2025-12-10
๐ 12 pages, 3 figures. Research manuscript based on
๐ฅ Chethana Prasad Kabgere
Machine Learning
ๆฅ็ๆ่ฆ
Understanding how humans and AI systems interpret ambiguous visual stimuli offers critical insight into the nature of perception, reasoning, and decision-making. This paper examines image labeling performance across human participants and deep neural networks, focusing on low-resolution, perceptually degraded stimuli. Drawing from computational cognitive science, cognitive architectures, and connectionist-symbolic hybrid models, we contrast human strategies such as analogical reasoning, shape-based recognition, and confidence modulation with AI's feature-based processing. Grounded in Marr's tri-level hypothesis, Simon's bounded rationality, and Thagard's frameworks of representation and emotion, we analyze participant responses in relation to Grad-CAM visualizations of model attention. Human behavior is further interpreted through cognitive principles modeled in ACT-R and Soar, revealing layered and heuristic decision strategies under uncertainty. Our findings highlight key parallels and divergences between biological and artificial systems in representation, inference, and confidence calibration. The analysis motivates future neuro-symbolic architectures that unify structured symbolic reasoning with connectionist representations. Such architectures, informed by principles of embodiment, explainability, and cognitive alignment, offer a path toward AI systems that are not only performant but also interpretable and cognitively grounded.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yutong Du, Zicheng Liu, Bo Wu, Jingwei Kou, Hang Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents an improved physics-driven neural network (IPDNN) framework for solving electromagnetic inverse scattering problems (ISPs). A new Gaussian-localized oscillation-suppressing window (GLOW) activation function is introduced to stabilize convergence and enable a lightweight yet accurate network architecture. A dynamic scatter subregion identification strategy is further developed to adaptively refine the computational domain, preventing missed detections and reducing computational cost. Moreover, transfer learning is incorporated to extend the solver's applicability to practical scenarios, integrating the physical interpretability of iterative algorithms with the real-time inference capability of neural networks. Numerical simulations and experimental results demonstrate that the proposed solver achieves superior reconstruction accuracy, robustness, and efficiency compared with existing state-of-the-art methods.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Amin Tavakoli, Raswanth Murugan, Ozan Gokdemir, Arvind Ramanathan, Frances Arnold et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains, yet its application to protein sequence modeling and protein language models (PLMs) remains ad hoc. This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language. We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences. Unlike existing approaches that require costly precompiled experimental datasets for SFT, our method leverages the PLM itself, integrating a lightweight curation pipeline with domain-specific filters to construct high-quality training data. These filters can independently refine a PLM's output and identify candidates for in vitro evaluation; when combined with SFT, they enable PLMs to generate more stable and functional enzymes, while expanding exploration into protein sequence space beyond natural variants. Although our approach is agnostic to both the choice of protein language model (PLM) and the protein system, we demonstrate its effectiveness with a genome-scale PLM (GenSLM) applied to the tryptophan synthase enzyme family. The supervised fine-tuned model generates sequences that are not only more novel but also display improved characteristics across both targeted design constraints and emergent protein property measures.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunshan Duan, Sinead Williamson
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Self supervised learning (SSL) is a machine learning paradigm where models learn to understand the underlying structure of data without explicit supervision from labeled samples. The acquired representations from SSL have demonstrated useful for many downstream tasks including clustering, and linear classification, etc. To ensure smoothness of the representation space, most SSL methods rely on the ability to generate pairs of observations that are similar to a given instance. However, generating these pairs may be challenging for many types of data. Moreover, these methods lack consideration of uncertainty quantification and can perform poorly in out-of-sample prediction settings. To address these limitations, we propose Gaussian process self supervised learning (GPSSL), a novel approach that utilizes Gaussian processes (GP) models on representation learning. GP priors are imposed on the representations, and we obtain a generalized Bayesian posterior minimizing a loss function that encourages informative representations. The covariance function inherent in GPs naturally pulls representations of similar units together, serving as an alternative to using explicitly defined positive samples. We show that GPSSL is closely related to both kernel PCA and VICReg, a popular neural network-based SSL method, but unlike both allows for posterior uncertainties that can be propagated to downstream tasks. Experiments on various datasets, considering classification and regression tasks, demonstrate that GPSSL outperforms traditional methods in terms of accuracy, uncertainty quantification, and error control.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuki Oda, Yuta Ono, Hiroshi Nakamura, Hideki Takase
Machine Learning
ๆฅ็ๆ่ฆ
The continuous scaling of deep neural networks has fundamentally transformed machine learning, with larger models demonstrating improved performance across diverse tasks. This growth in model size has dramatically increased the computational resources required for the training process. Consequently, distributed approaches, such as Federated Learning and Split Learning, have become essential paradigms for scalable deployment. However, existing Split Learning approaches assume client homogeneity and uniform split points across all participants. This critically limits their applicability to real-world IoT systems where devices exhibit heterogeneity in computational resources. To address this limitation, this paper proposes Hetero-SplitEE, a novel method that enables heterogeneous IoT devices to train a shared deep neural network in parallel collaboratively. By integrating heterogeneous early exits into hierarchical training, our approach allows each client to select distinct split points (cut layers) tailored to its computational capacity. In addition, we propose two cooperative training strategies, the Sequential strategy and the Averaging strategy, to facilitate this collaboration among clients with different split points. The Sequential strategy trains clients sequentially with a shared server model to reduce computational overhead. The Averaging strategy enables parallel client training with periodic cross-layer aggregation. Extensive experiments on CIFAR-10, CIFAR-100, and STL-10 datasets using ResNet-18 demonstrate that our method maintains competitive accuracy while efficiently supporting diverse computational constraints, enabling practical deployment of collaborative deep learning in heterogeneous IoT ecosystems.
๐
2025-12-10
๐ ArXiv Preprint (math.ST)
๐ฅ Tengyuan Liang
ๆฅ็ๆ่ฆ
We revisit the signal denoising problem through the lens of optimal transport: the goal is to recover an unknown scalar signal distribution $X \sim P$ from noisy observations $Y = X + ฯZ$, with $Z$ being standard Gaussian independent of $X$ and $ฯ>0$ a known noise level. Let $Q$ denote the distribution of $Y$. We introduce a hierarchy of denoisers $T_0, T_1, \ldots, T_\infty : \mathbb{R} \to \mathbb{R}$ that are agnostic to the signal distribution $P$, depending only on higher-order score functions of $Q$. Each denoiser $T_K$ is progressively refined using the $(2K-1)$-th order score function of $Q$ at noise resolution $ฯ^{2K}$, achieving better denoising quality measured by the Wasserstein metric $W(T_K \sharp Q, P)$. The limiting denoiser $T_\infty$ identifies the optimal transport map with $T_\infty \sharp Q = P$.
We provide a complete characterization of the combinatorial structure underlying this hierarchy through Bell polynomial recursions, revealing how higher-order score functions encode the optimal transport map for signal denoising. We study two estimation strategies with convergence rates for higher-order scores from i.i.d. samples drawn from $Q$: (i) plug-in estimation via Gaussian kernel smoothing, and (ii) direct estimation via higher-order score matching. This hierarchy of agnostic denoisers opens new perspectives in signal denoising and empirical Bayes.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Weiyi He, Yue Xing
Natural Language Processing
ๆฅ็ๆ่ฆ
Positional encoding (PE) is a core architectural component of Transformers, yet its impact on the Transformer's generalization and robustness remains unclear. In this work, we provide the first generalization analysis for a single-layer Transformer under in-context regression that explicitly accounts for a completely trainable PE module. Our result shows that PE systematically enlarges the generalization gap. Extending to the adversarial setting, we derive the adversarial Rademacher generalization bound. We find that the gap between models with and without PE is magnified under attack, demonstrating that PE amplifies the vulnerability of models. Our bounds are empirically validated by a simulation study. Together, this work establishes a new framework for understanding the clean and adversarial generalization in ICL with PE.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yixuan Wang, Dan P. Guralnik, Warren E. Dixon
Robotics
ๆฅ็ๆ่ฆ
Inferring the eventual goal of a mobile agent from noisy observations of its trajectory is a fundamental estimation problem. We initiate the study of such intent inference using a variant of a Rao-Blackwellized Particle Filter (RBPF), subject to the assumption that the agent's intent manifests through closed-loop behavior with a state-of-the-art provable practical stability property. Leveraging the assumed closed-form agent dynamics, the RBPF analytically marginalizes the linear-Gaussian substructure and updates particle weights only, improving sample efficiency over a standard particle filter. Two difference estimators are introduced: a Gaussian mixture model using the RBPF weights and a reduced version confining the mixture to the effective sample. We quantify how well the adversary can recover the agent's intent using information-theoretic leakage metrics and provide computable lower bounds on the Kullback-Leibler (KL) divergence between the true intent distribution and RBPF estimates via Gaussian-mixture KL bounds. We also provide a bound on the difference in performance between the two estimators, highlighting the fact that the reduced estimator performs almost as well as the complete one. Experiments illustrate fast and accurate intent recovery for compliant agents, motivating future work on designing intent-obfuscating controllers.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ce Wang, Weihang Dai, Hanru Bai, Xiaomeng Li
Machine Learning
ๆฅ็ๆ่ฆ
Contrastive learning methods enforce label distance relationships in feature space to improve representation capability for regression models. However, these methods highly depend on label information to correctly recover ordinal relationships of features, limiting their applications to semi-supervised regression. In this work, we extend contrastive regression methods to allow unlabeled data to be used in the semi-supervised setting, thereby reducing the dependence on costly annotations. Particularly we construct the feature similarity matrix with both labeled and unlabeled samples in a mini-batch to reflect inter-sample relationships, and an accurate ordinal ranking of involved unlabeled samples can be recovered through spectral seriation algorithms if the level of error is within certain bounds. The introduction of labeled samples above provides regularization of the ordinal ranking with guidance from the ground-truth label information, making the ranking more reliable. To reduce feature perturbations, we further utilize the dynamic programming algorithm to select robust features for the matrix construction. The recovered ordinal relationship is then used for contrastive learning on unlabeled samples, and we thus allow more data to be used for feature representation learning, thereby achieving more robust results. The ordinal rankings can also be used to supervise predictions on unlabeled samples, serving as an additional training signal. We provide theoretical guarantees and empirical verification through experiments on various datasets, demonstrating that our method can surpass existing state-of-the-art semi-supervised deep regression methods. Our code have been released on https://github.com/xmed-lab/CLSS.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Ryosuke Nagumo, Hironori Fujisawa
ๆฅ็ๆ่ฆ
We examine the non-asymptotic properties of robust density ratio estimation (DRE) in contaminated settings. Weighted DRE is the most promising among existing methods, exhibiting doubly strong robustness from an asymptotic perspective. This study demonstrates that Weighted DRE achieves sparse consistency even under heavy contamination within a non-asymptotic framework. This method addresses two significant challenges in density ratio estimation and robust estimation. For density ratio estimation, we provide the non-asymptotic properties of estimating unbounded density ratios under the assumption that the weighted density ratio function is bounded. For robust estimation, we introduce a non-asymptotic framework for doubly strong robustness under heavy contamination, assuming that at least one of the following conditions holds: (i) contamination ratios are small, and (ii) outliers have small weighted values. This work provides the first non-asymptotic analysis of strong robustness under heavy contamination.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jinming Lu, Jiayi Tian, Yequan Zhao, Hai Li, Zheng Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Physics-Informed Neural Networks (PINNs) have emerged as a promising paradigm for solving partial differential equations (PDEs) by embedding physical laws into neural network training objectives. However, their deployment on resource-constrained platforms is hindered by substantial computational and memory overhead, primarily stemming from higher-order automatic differentiation, intensive tensor operations, and reliance on full-precision arithmetic. To address these challenges, we present a framework that enables scalable and energy-efficient PINN training on edge devices. This framework integrates fully quantized training, Stein's estimator (SE)-based residual loss computation, and tensor-train (TT) decomposition for weight compression. It contributes three key innovations: (1) a mixed-precision training method that use a square-block MX (SMX) format to eliminate data duplication during backpropagation; (2) a difference-based quantization scheme for the Stein's estimator that mitigates underflow; and (3) a partial-reconstruction scheme (PRS) for TT-Layers that reduces quantization-error accumulation. We further design PINTA, a precision-scalable hardware accelerator, to fully exploit the performance of the framework. Experiments on the 2-D Poisson, 20-D Hamilton-Jacobi-Bellman (HJB), and 100-D Heat equations demonstrate that the proposed framework achieves accuracy comparable to or better than full-precision, uncompressed baselines while delivering 5.5x to 83.5x speedups and 159.6x to 2324.1x energy savings. This work enables real-time PDE solving on edge devices and paves the way for energy-efficient scientific computing at scale.
๐
2025-12-10
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yucan Guo, Miao Su, Saiping Guan, Zihao Sun, Xiaolong Jin et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) integrates non-parametric knowledge into Large Language Models (LLMs), typically from unstructured texts and structured graphs. While recent progress has advanced text-based RAG to multi-turn reasoning through Reinforcement Learning (RL), extending these advances to hybrid retrieval introduces additional challenges. Existing graph-based or hybrid systems typically depend on fixed or handcrafted retrieval pipelines, lacking the ability to integrate supplementary evidence as reasoning unfolds. Besides, while graph evidence provides relational structures crucial for multi-hop reasoning, it is substantially more expensive to retrieve. To address these limitations, we introduce \model{}, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG. \model{} jointly optimizes the entire generation process via RL, allowing the model to learn when to reason, what to retrieve from either texts or graphs, and when to produce final answers, all within a unified generation policy. To guide this learning process, we design a two-stage training framework that accounts for both task outcome and retrieval efficiency, enabling the model to exploit hybrid evidence while avoiding unnecessary retrieval overhead. Experimental results across five question answering benchmarks demonstrate that \model{} significantly outperforms existing RAG baselines, highlighting the benefits of end-to-end RL in supporting adaptive and efficient retrieval for complex reasoning.
๐
2025-12-10
๐ VLDB 2026
๐ฅ Nam Anh Dang, Ben Landrum, Ken Birman
ๆฅ็ๆ่ฆ
Vector search underpins modern information-retrieval systems, including retrieval-augmented generation (RAG) pipelines and search engines over unstructured text and images. As datasets scale to billions of vectors, disk-based vector search has emerged as a practical solution. However, looking to the future, we need to anticipate datasets too large for any single server. We present BatANN, a distributed disk-based approximate nearest neighbor (ANN) system that retains the logarithmic search efficiency of a single global graph while achieving near-linear throughput scaling in the number of servers. Our core innovation is that when accessing a neighborhood which is stored on another machine, we send the full state of the query to the other machine to continue executing there for improved locality. On 100M- and 1B-point datasets at 0.95 recall using 10 servers, BatANN achieves 6.21-6.49x and 2.5-5.10x the throughput of the scatter-gather baseline, respectively, while maintaining mean latency below 6 ms. Moreover, we get these results on standard TCP. To our knowledge, BatANN is the first open-source distributed disk-based vector search system to operate over a single global graph.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yixuan Wang, Dan P. Guralnik, Warren E. Dixon
Robotics
ๆฅ็ๆ่ฆ
Inferring the eventual goal of a mobile agent from noisy observations of its trajectory is a fundamental estimation problem. We initiate the study of such intent inference using a variant of a Rao-Blackwellized Particle Filter (RBPF), subject to the assumption that the agent's intent manifests through closed-loop behavior with a state-of-the-art provable practical stability property. Leveraging the assumed closed-form agent dynamics, the RBPF analytically marginalizes the linear-Gaussian substructure and updates particle weights only, improving sample efficiency over a standard particle filter. Two difference estimators are introduced: a Gaussian mixture model using the RBPF weights and a reduced version confining the mixture to the effective sample. We quantify how well the adversary can recover the agent's intent using information-theoretic leakage metrics and provide computable lower bounds on the Kullback-Leibler (KL) divergence between the true intent distribution and RBPF estimates via Gaussian-mixture KL bounds. We also provide a bound on the difference in performance between the two estimators, highlighting the fact that the reduced estimator performs almost as well as the complete one. Experiments illustrate fast and accurate intent recovery for compliant agents, motivating future work on designing intent-obfuscating controllers.
๐
2025-12-10
๐ ArXiv Preprint (Machine Learning)
๐ฅ Arjun Parthasarathy, Nimit Kalra, Rohun Agrawal, Yann LeCun, Oumayma Bounou et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient-based planning offers a computationally efficient alternative. However, the performance of gradient-based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient-based planning. We begin with the observation that although a world model is trained on a next-state prediction objective, it is used at test-time to instead estimate a sequence of actions. The goal of our work is to close this train-test gap. To that end, we propose train-time data synthesis techniques that enable significantly improved gradient-based planning with existing world models. At test time, our approach outperforms or matches the classical gradient-free cross-entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have recently enabled robotic manipulation by grounding visual and linguistic cues into actions. However, most VLAs assume the Markov property, relying only on the current observation and thus suffering from temporal myopia that degrades long-horizon coherence. In this work, we view motion as a more compact and informative representation of temporal context and world dynamics, capturing inter-state changes while filtering static pixel-level noise. Building on this idea, we propose HiF-VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that leverages motion for bidirectional temporal reasoning. HiF-VLA encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight-modulated joint expert to enable a ''think-while-acting'' paradigm for long-horizon manipulation. As a result, HiF-VLA surpasses strong baselines on LIBERO-Long and CALVIN ABC-D benchmarks, while incurring negligible additional inference latency. Furthermore, HiF-VLA achieves substantial improvements in real-world long-horizon manipulation tasks, demonstrating its broad effectiveness in practical robotic settings.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Yifan Ye, Jiaqi Ma, Jun Cen, Zhihe Lu
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on \href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Junting Chen, Yunchuan Li, Panfeng Jiang, Jiacheng Du, Zixuan Chen et al.
Robotics
ๆฅ็ๆ่ฆ
Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Radha Lahoti, Ryan Chaiyakul, M. Khalid Jawed
Machine LearningRobotics
ๆฅ็ๆ่ฆ
High-fidelity simulation has become essential to the design and control of soft robots, where large geometric deformations and complex contact interactions challenge conventional modeling tools. Recent advances in the field demand simulation frameworks that combine physical accuracy, computational scalability, and seamless integration with modern control and optimization pipelines. In this work, we present Py-DiSMech, a Python-based, open-source simulation framework for modeling and control of soft robotic structures grounded in the principles of Discrete Differential Geometry (DDG). By discretizing geometric quantities such as curvature and strain directly on meshes, Py-DiSMech captures the nonlinear deformation of rods, shells, and hybrid structures with high fidelity and reduced computational cost. The framework introduces (i) a fully vectorized NumPy implementation achieving order-of-magnitude speed-ups over existing geometry-based simulators; (ii) a penalty-energy-based fully implicit contact model that supports rod-rod, rod-shell, and shell-shell interactions; (iii) a natural-strain-based feedback-control module featuring a proportional-integral (PI) controller for shape regulation and trajectory tracking; and (iv) a modular, object-oriented software design enabling user-defined elastic energies, actuation schemes, and integration with machine-learning libraries. Benchmark comparisons demonstrate that Py-DiSMech substantially outperforms the state-of-the-art simulator Elastica in computational efficiency while maintaining physical accuracy. Together, these features establish Py-DiSMech as a scalable, extensible platform for simulation-driven design, control validation, and sim-to-real research in soft robotics.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Ryan Meegan, Adam D'Souza, Bryan Bo Cao, Shubham Jain, Kristin Dana
Robotics
ๆฅ็ๆ่ฆ
Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot's current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Reza Ahmari, Ahmad Mohammadi, Vahid Hemmati, Mohammed Mynuddin, Parham Kebria et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The integration of Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) is increasingly central to the development of intelligent autonomous systems for applications such as search and rescue, environmental monitoring, and logistics. However, precise coordination between these platforms in real-time scenarios presents major challenges, particularly when external localization infrastructure such as GPS or GNSS is unavailable or degraded [1]. This paper proposes a vision-based, data-driven framework for real-time UAV-UGV integration, with a focus on robust UGV detection and heading angle prediction for navigation and coordination. The system employs a fine-tuned YOLOv5 model to detect UGVs and extract bounding box features, which are then used by a lightweight artificial neural network (ANN) to estimate the UAV's required heading angle. A VICON motion capture system was used to generate ground-truth data during training, resulting in a dataset of over 13,000 annotated images collected in a controlled lab environment. The trained ANN achieves a mean absolute error of 0.1506ยฐ and a root mean squared error of 0.1957ยฐ, offering accurate heading angle predictions using only monocular camera inputs. Experimental evaluations achieve 95% accuracy in UGV detection. This work contributes a vision-based, infrastructure- independent solution that demonstrates strong potential for deployment in GPS/GNSS-denied environments, supporting reliable multi-agent coordination under realistic dynamic conditions. A demonstration video showcasing the system's real-time performance, including UGV detection, heading angle prediction, and UAV alignment under dynamic conditions, is available at: https://github.com/Kooroshraf/UAV-UGV-Integration
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Yuyang Li, Yinghan Chen, Zihang Zhao, Puhao Li, Tengyu Liu et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.
๐
2025-12-10
๐ Presented at the International Conference on Space
๐ฅ Elias Krantz, Ngai Nam Chan, Gunnar Tibert, Huina Mao, Christer Fuglesang
Robotics
ๆฅ็ๆ่ฆ
Integrating high-fidelity spacecraft simulators with modular robotics frameworks remains a challenge for autonomy development. This paper presents a lightweight, open-source communication bridge between the Basilisk astrodynamics simulator and the Robot Operating System 2 (ROS 2), enabling real-time, bidirectional data exchange for spacecraft control. The bridge requires no changes to Basilisk's core and integrates seamlessly with ROS 2 nodes. We demonstrate its use in a leader-follower formation flying scenario using nonlinear model predictive control, deployed identically in both simulation and on the ATMOS planar microgravity testbed. This setup supports rapid development, hardware-in-the-loop testing, and seamless transition from simulation to hardware. The bridge offers a flexible and scalable platform for modular spacecraft autonomy and reproducible research workflows.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Misael Mamani, Mariel Fernandez, Grace Luna, Steffani Limachi, Leonel Apaza et al.
Robotics
ๆฅ็ๆ่ฆ
Accurate water quality assessment requires spatially resolved sampling, yet most unmanned surface vehicles (USVs) can collect only a limited number of samples or rely on single-point sensors with poor representativeness. This work presents a solar-powered, fully autonomous USV featuring a novel syringe-based sampling architecture capable of acquiring 72 discrete, contamination-minimized water samples per mission. The vehicle incorporates a ROS 2 autonomy stack with GPS-RTK navigation, LiDAR and stereo-vision obstacle detection, Nav2-based mission planning, and long-range LoRa supervision, enabling dependable execution of sampling routes in unstructured environments. The platform integrates a behavior-tree autonomy architecture adapted from Nav2, enabling mission-level reasoning and perception-aware navigation. A modular 6x12 sampling system, controlled by distributed micro-ROS nodes, provides deterministic actuation, fault isolation, and rapid module replacement, achieving spatial coverage beyond previously reported USV-based samplers. Field trials in Achocalla Lagoon (La Paz, Bolivia) demonstrated 87% waypoint accuracy, stable autonomous navigation, and accurate physicochemical measurements (temperature, pH, conductivity, total dissolved solids) comparable to manually collected references. These results demonstrate that the platform enables reliable high-resolution sampling and autonomous mission execution, providing a scalable solution for aquatic monitoring in remote environments.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Nicolas Marticorena, Tobias Fischer, Niko Suenderhauf
Robotics
ๆฅ็ๆ่ฆ
Reactive control can gracefully coordinate the motion of the base and the arm of a mobile manipulator. However, incorporating an accurate representation of the environment to avoid obstacles without involving costly planning remains a challenge. In this work, we present ReMoSPLAT, a reactive controller based on a quadratic program formulation for mobile manipulation that leverages a Gaussian Splat representation for collision avoidance. By integrating additional constraints and costs into the optimisation formulation, a mobile manipulator platform can reach its intended end effector pose while avoiding obstacles, even in cluttered scenes. We investigate the trade-offs of two methods for efficiently calculating robot-obstacle distances, comparing a purely geometric approach with a rasterisation-based approach. Our experiments in simulation on both synthetic and real-world scans demonstrate the feasibility of our method, showing that the proposed approach achieves performance comparable to controllers that rely on perfect ground-truth information.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Minghao Guo, Meng Cao, Jiachen Tao, Rongtao Xu, Yan Yan et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Most existing Vision-Language-Action (VLA) models rely primarily on RGB information, while ignoring geometric cues crucial for spatial reasoning and manipulation. In this work, we introduce GLaD, a geometry-aware VLA framework that incorporates 3D geometric priors during pretraining through knowledge distillation. Rather than distilling geometric features solely into the vision encoder, we align the LLM's hidden states corresponding to visual tokens with features from a frozen geometry-aware vision transformer (VGGT), ensuring that geometric understanding is deeply integrated into the multimodal representations that drive action prediction. Pretrained on the Bridge dataset with this geometry distillation mechanism, GLaD achieves 94.1% average success rate across four LIBERO task suites, outperforming UniVLA (92.5%) which uses identical pretraining data. These results validate that geometry-aware pretraining enhances spatial reasoning and policy generalization without requiring explicit depth sensors or 3D annotations.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Zhiheng Li, Weihua Wang, Qiang Shen, Yichen Zhao, Zheng Fang
Machine Learning
ๆฅ็ๆ่ฆ
Conventional SLAM systems using visual or LiDAR data often struggle in poor lighting and severe weather. Although 4D radar is suited for such environments, its sparse and noisy point clouds hinder accurate odometry estimation, while the radar maps suffer from obscure and incomplete structures. Thus, we propose Super4DR, a 4D radar-centric framework for learning-based odometry estimation and gaussian-based map optimization. First, we design a cluster-aware odometry network that incorporates object-level cues from the clustered radar points for inter-frame matching, alongside a hierarchical self-supervision mechanism to overcome outliers through spatio-temporal consistency, knowledge transfer, and feature contrast. Second, we propose using 3D gaussians as an intermediate representation, coupled with a radar-specific growth strategy, selective separation, and multi-view regularization, to recover blurry map areas and those undetected based on image texture. Experiments show that Super4DR achieves a 67% performance gain over prior self-supervised methods, nearly matches supervised odometry, and narrows the map quality disparity with LiDAR while enabling multi-modal image rendering.
๐
2025-12-10
๐ AAAI 2026
๐ฅ Yanghong Mei, Yirong Yang, Longteng Guo, Qunbo Wang, Ming-Ming Yu et al.
Robotics
ๆฅ็ๆ่ฆ
Navigating complex urban environments using natural language instructions poses significant challenges for embodied agents, including noisy language instructions, ambiguous spatial references, diverse landmarks, and dynamic street scenes. Current visual navigation methods are typically limited to simulated or off-street environments, and often rely on precise goal formats, such as specific coordinates or images. This limits their effectiveness for autonomous agents like last-mile delivery robots navigating unfamiliar cities. To address these limitations, we introduce UrbanNav, a scalable framework that trains embodied agents to follow free-form language instructions in diverse urban settings. Leveraging web-scale city walking videos, we develop an scalable annotation pipeline that aligns human navigation trajectories with language instructions grounded in real-world landmarks. UrbanNav encompasses over 1,500 hours of navigation data and 3 million instruction-trajectory-landmark triplets, capturing a wide range of urban scenarios. Our model learns robust navigation policies to tackle complex urban scenarios, demonstrating superior spatial reasoning, robustness to noisy instructions, and generalization to unseen urban settings. Experimental results show that UrbanNav significantly outperforms existing methods, highlighting the potential of large-scale web video data to enable language-guided, real-world urban navigation for embodied agents.
๐
2025-12-10
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhe Wang, Qijin Song, Yucen Peng, Weibang Bai
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Responsive and accurate facial expression recognition is crucial to human-robot interaction for daily service robots. Nowadays, event cameras are becoming more widely adopted as they surpass RGB cameras in capturing facial expression changes due to their high temporal resolution, low latency, computational efficiency, and robustness in low-light conditions. Despite these advantages, event-based approaches still encounter practical challenges, particularly in adopting mainstream deep learning models. Traditional deep learning methods for facial expression analysis are energy-intensive, making them difficult to deploy on edge computing devices and thereby increasing costs, especially for high-frequency, dynamic, event vision-based approaches. To address this challenging issue, we proposed the CS3D framework by decomposing the Convolutional 3D method to reduce the computational complexity and energy consumption. Additionally, by utilizing soft spiking neurons and a spatial-temporal attention mechanism, the ability to retain information is enhanced, thus improving the accuracy of facial expression detection. Experimental results indicate that our proposed CS3D method attains higher accuracy on multiple datasets compared to architectures such as the RNN, Transformer, and C3D, while the energy consumption of the CS3D method is just 21.97\% of the original C3D required on the same device.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Feng Yu, Yu Hu, Yang Su, Yang Deng, Linzuo Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Most reinforcement learning(RL)-based methods for drone racing target fixed, obstacle-free tracks, leaving the generalization to unknown, cluttered environments largely unaddressed. This challenge stems from the need to balance racing speed and collision avoidance, limited feasible space causing policy exploration trapped in local optima during training, and perceptual ambiguity between gates and obstacles in depth maps-especially when gate positions are only coarsely specified. To overcome these issues, we propose a two-phase learning framework: an initial soft-collision training phase that preserves policy exploration for high-speed flight, followed by a hard-collision refinement phase that enforces robust obstacle avoidance. An adaptive, noise-augmented curriculum with an asymmetric actor-critic architecture gradually shifts the policy's reliance from privileged gate-state information to depth-based visual input. We further impose Lipschitz constraints and integrate a track-primitive generator to enhance motion stability and cross-environment generalization. We evaluate our framework through extensive simulation and ablation studies, and validate it in real-world experiments on a computationally constrained quadrotor. The system achieves agile flight while remaining robust to gate-position errors, developing a generalizable drone racing framework with the capability to operate in diverse, partially unknown and cluttered environments. https://yufengsjtu.github.io/MasterRacing.github.io/
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Qihao Yuan, Ziyu Cao, Ming Cao, Kailai Li
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present a novel modularized end-to-end framework for legged reactive navigation in complex dynamic environments using a single light detection and ranging (LiDAR) sensor. The system comprises four simulation-trained modules: three reinforcement-learning (RL) policies for locomotion, safety shielding, and navigation, and a transformer-based exteroceptive estimator that processes raw point-cloud inputs. This modular decomposition of complex legged motor-control tasks enables lightweight neural networks with simple architectures, trained using standard RL practices with targeted reward shaping and curriculum design, without reliance on heuristics or sophisticated policy-switching mechanisms. We conduct comprehensive ablations to validate our design choices and demonstrate improved robustness compared to existing approaches in challenging navigation tasks. The resulting reactive safe navigation (REASAN) system achieves fully onboard and real-time reactive navigation across both single- and multi-robot settings in complex environments. We release our training and deployment code at https://github.com/ASIG-X/REASAN.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Donato Caramia, Florian T. Pokorny, Giuseppe Triggiani, Denis Ruffino, David Naso et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Edwin Meriaux, Shuo Wen, Louis-Roy Langevin, Doina Precup, Antonio Lorรญa et al.
ๆฅ็ๆ่ฆ
This paper addresses the movement and placement of mobile agents to establish a communication network in initially unknown environments. We cast the problem in a computational-geometric framework by relating the coverage problem and line-of-sight constraints to the Cooperative Guard Art Gallery Problem, and introduce its partially observable variant, the Partially Observable Cooperative Guard Art Gallery Problem (POCGAGP). We then present two algorithms that solve POCGAGP: CADENCE, a centralized planner that incrementally selects 270 degree corners at which to deploy agents, and DADENCE, a decentralized scheme that coordinates agents using local information and lightweight messaging. Both approaches operate under partial observability and target simultaneous coverage and connectivity. We evaluate the methods in simulation across 1,500 test cases of varied size and structure, demonstrating consistent success in forming connected networks while covering and exploring unknown space. These results highlight the value of geometric abstractions for communication-driven exploration and show that decentralized policies are competitive with centralized performance while retaining scalability.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Jayant Unde, Takumi Inden, Yuki Wakayama, Jacinto Colan, Yaonan Zhu et al.
Robotics
ๆฅ็ๆ่ฆ
In recent years, many countries, including Japan, have rapidly aging populations, making the preservation of seniors' quality of life a significant concern. For elderly people with impaired physical abilities, support for toileting is one of the most important issues. This paper details the design, development, experimental assessment, and potential application of the gripper system, with a focus on the unique requirements and obstacles involved in aiding elderly or hemiplegic individuals in dressing and undressing trousers. The gripper we propose seeks to find the right balance between compliance and grasping forces, ensuring precise manipulation while maintaining a safe and compliant interaction with the users. The gripper's integration into a custom--built robotic manipulator system provides a comprehensive solution for assisting hemiplegic individuals in their dressing and undressing tasks. Experimental evaluations and comparisons with existing studies demonstrate the gripper's ability to successfully assist in both dressing and dressing of trousers in confined spaces with a high success rate. This research contributes to the advancement of assistive robotics, empowering elderly, and physically impaired individuals to maintain their independence and improve their quality of life.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Jaehyun Kim, Seungwon Choi, Tae-Wan Kim
Machine Learning
ๆฅ็ๆ่ฆ
We propose a descriptor-agnostic, multi-frame loop closure verification method that formulates LiDAR loop closure as a truncated Sequential Probability Ratio Test (SPRT). Instead of deciding from a single descriptor comparison or using fixed thresholds with late-stage Iterative Closest Point (ICP) vetting, the verifier accumulates a short temporal stream of descriptor similarities between a query and each candidate. It then issues an accept/reject decision adaptively once sufficient multi-frame evidence has been observed, according to user-specified Type-I/II error design targets. This precision-first policy is designed to suppress false positives in structurally repetitive indoor environments. We evaluate the verifier on a five-sequence library dataset, using a fixed retrieval front-end with several representative LiDAR global descriptors. Performance is assessed via segment-level K-hit precision-recall and absolute trajectory error (ATE) and relative pose error (RPE) after pose graph optimization. Across descriptors, the sequential verifier consistently improves precision and reduces the impact of aliased loops compared with single-frame and heuristic multi-frame baselines. Our implementation and dataset will be released at: https://github.com/wanderingcar/snu_library_dataset.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Quanyou Wang, Mingzhang Zhu, Ruochen Hou, Kay Gillespie, Alvin Zhu et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
The development of athletic humanoid robots has gained significant attention as advances in actuation, sensing, and control enable increasingly dynamic, real-world capabilities. RoboCup, an international competition of fully autonomous humanoid robots, provides a uniquely challenging benchmark for such systems, culminating in the long-term goal of competing against human soccer players by 2050. This paper presents the hardware and software innovations underlying our team's victory in the RoboCup 2024 Adult-Sized Humanoid Soccer Competition. On the hardware side, we introduce an adult-sized humanoid platform built with lightweight structural components, high-torque quasi-direct-drive actuators, and a specialized foot design that enables powerful in-gait kicks while preserving locomotion robustness. On the software side, we develop an integrated perception and localization framework that combines stereo vision, object detection, and landmark-based fusion to provide reliable estimates of the ball, goals, teammates, and opponents. A mid-level navigation stack then generates collision-aware, dynamically feasible trajectories, while a centralized behavior manager coordinates high-level decision making, role selection, and kick execution based on the evolving game state. The seamless integration of these subsystems results in fast, precise, and tactically effective gameplay, enabling robust performance under the dynamic and adversarial conditions of real matches. This paper presents the design principles, system architecture, and experimental results that contributed to ARTEMIS's success as the 2024 Adult-Sized Humanoid Soccer champion.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Siting Zhu, Yuxiang Huang, Wenhua Wu, Chaokang Jiang, Yongbo Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in Dense Simultaneous Localization and Mapping (SLAM) have demonstrated remarkable performance in static environments. However, dense SLAM in dynamic environments remains challenging. Most methods directly remove dynamic objects and focus solely on static scene reconstruction, which ignores the motion information contained in these dynamic objects. In this paper, we present D$^2$GSLAM, a novel dynamic SLAM system utilizing Gaussian representation, which simultaneously performs accurate dynamic reconstruction and robust tracking within dynamic environments. Our system is composed of four key components: (i) We propose a geometric-prompt dynamic separation method to distinguish between static and dynamic elements of the scene. This approach leverages the geometric consistency of Gaussian representation and scene geometry to obtain coarse dynamic regions. The regions then serve as prompts to guide the refinement of the coarse mask for achieving accurate motion mask. (ii) To facilitate accurate and efficient mapping of the dynamic scene, we introduce dynamic-static composite representation that integrates static 3D Gaussians with dynamic 4D Gaussians. This representation allows for modeling the transitions between static and dynamic states of objects in the scene for composite mapping and optimization. (iii) We employ a progressive pose refinement strategy that leverages both the multi-view consistency of static scene geometry and motion information from dynamic objects to achieve accurate camera tracking. (iv) We introduce a motion consistency loss, which leverages the temporal continuity in object motions for accurate dynamic modeling. Our D$^2$GSLAM demonstrates superior performance on dynamic scenes in terms of mapping and tracking accuracy, while also showing capability in accurate dynamic modeling.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Jialin Ying, Zhihao Li, Zicheng Dong, Guohua Wu, Yihuan Liao
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Collaborative pursuit-evasion in cluttered environments presents significant challenges due to sparse rewards and constrained Fields of View (FOV). Standard Multi-Agent Reinforcement Learning (MARL) often suffers from inefficient exploration and fails to scale to large scenarios. We propose PGF-MAPPO (Path-Guided Frontier MAPPO), a hierarchical framework bridging topological planning with reactive control. To resolve local minima and sparse rewards, we integrate an A*-based potential field for dense reward shaping. Furthermore, we introduce Directional Frontier Allocation, combining Farthest Point Sampling (FPS) with geometric angle suppression to enforce spatial dispersion and accelerate coverage. The architecture employs a parameter-shared decentralized critic, maintaining O(1) model complexity suitable for robotic swarms. Experiments demonstrate that PGF-MAPPO achieves superior capture efficiency against faster evaders. Policies trained on 10x10 maps exhibit robust zero-shot generalization to unseen 20x20 environments, significantly outperforming rule-based and learning-based baselines.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Hai Ci, Xiaokang Liu, Pei Yang, Yiren Song, Mike Zheng Shou
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: https://showlab.github.io/H2R-Grounder/
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Lidan Xu, Dadong Fan, Junhong Wang, Wenshuo Li, Hao Lu et al.
Robotics
ๆฅ็ๆ่ฆ
Cooperative suspended aerial transportation is highly susceptible to multi-source disturbances such as aerodynamic effects and thrust uncertainties. To achieve precise load manipulation, existing methods often rely on extra sensors to measure cable directions or the payload's pose, which increases the system cost and complexity. A fundamental question remains: is the payload's pose observable under multi-source disturbances using only the drones' odometry information? To answer this question, this work focuses on the two-drone-bar system and proves that the whole system is observable when only two or fewer types of lumped disturbances exist by using the observability rank criterion. To the best of our knowledge, we are the first to present such a conclusion and this result paves the way for more cost-effective and robust systems by minimizing their sensor suites. Next, to validate this analysis, we consider the situation where the disturbances are only exerted on the drones, and develop a composite disturbance filtering scheme. A disturbance observer-based error-state extended Kalman filter is designed for both state and disturbance estimation, which renders improved estimation performance for the whole system evolving on the manifold $(\mathbb{R}^3)^2\times(TS^2)^3$. Our simulation and experimental tests have validated that it is possible to fully estimate the state and disturbance of the system with only odometry information of the drones.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Lin Li, Yuxin Cai, Jianwu Fang, Jianru Xue, Chen Lv
Machine LearningRoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
End-to-end autonomous driving frameworks face persistent challenges in generalization, training efficiency, and interpretability. While recent methods leverage Vision-Language Models (VLMs) through supervised learning on large-scale datasets to improve reasoning, they often lack robustness in novel scenarios. Conversely, reinforcement learning (RL)-based approaches enhance adaptability but remain data-inefficient and lack transparent decision-making. % contribution To address these limitations, we propose COVLM-RL, a novel end-to-end driving framework that integrates Critical Object-oriented (CO) reasoning with VLM-guided RL. Specifically, we design a Chain-of-Thought (CoT) prompting strategy that enables the VLM to reason over critical traffic elements and generate high-level semantic decisions, effectively transforming multi-view visual inputs into structured semantic decision priors. These priors reduce the input dimensionality and inject task-relevant knowledge into the RL loop, accelerating training and improving policy interpretability. However, bridging high-level semantic guidance with continuous low-level control remains non-trivial. To this end, we introduce a consistency loss that encourages alignment between the VLM's semantic plans and the RL agent's control outputs, enhancing interpretability and training stability. Experiments conducted in the CARLA simulator demonstrate that COVLM-RL significantly improves the success rate by 30\% in trained driving environments and by 50\% in previously unseen environments, highlighting its strong generalization capability.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Ashik E Rasul, Humaira Tasnim, Ji Yu Kim, Young Hyun Lim, Scott Schmitz et al.
Machine LearningComputer VisionRobotics
ๆฅ็ๆ่ฆ
QuadPlanes combine the range efficiency of fixed-wing aircraft with the maneuverability of multi-rotor platforms for long-range autonomous missions. In GPS-denied or cluttered urban environments, perception-based landing is vital for reliable operation. Unlike structured landing zones, real-world sites are unstructured and highly variable, requiring strong generalization capabilities from the perception system. Deep neural networks (DNNs) provide a scalable solution for learning landing site features across diverse visual and environmental conditions. While perception-driven landing has been shown in simulation, real-world deployment introduces significant challenges. Payload and volume constraints limit high-performance edge AI devices like the NVIDIA Jetson Orin Nano, which are crucial for real-time detection and control. Accurate pose estimation during descent is necessary, especially in the absence of GPS, and relies on dependable visual-inertial odometry. Achieving this with limited edge AI resources requires careful optimization of the entire deployment framework. The flight characteristics of large QuadPlanes further complicate the problem. These aircraft exhibit high inertia, reduced thrust vectoring, and slow response times further complicate stable landing maneuvers. This work presents a lightweight QuadPlane system for efficient vision-based autonomous landing and visual-inertial odometry, specifically developed for long-range QuadPlane operations such as aerial monitoring. It describes the hardware platform, sensor configuration, and embedded computing architecture designed to meet demanding real-time, physical constraints. This establishes a foundation for deploying autonomous landing in dynamic, unstructured, GPS-denied environments.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Kwang Bin Lee, Jiho Kang, Sung-Hee Lee
Robotics
ๆฅ็ๆ่ฆ
Embodied agents operating in open environments must translate high-level instructions into grounded, executable behaviors, often requiring coordinated use of both hands. While recent foundation models offer strong semantic reasoning, existing robotic task planners remain predominantly unimanual and fail to address the spatial, geometric, and coordination challenges inherent to bimanual manipulation in scene-agnostic settings. We present a unified framework for scene-agnostic bimanual task planning that bridges high-level reasoning with 3D-grounded two-handed execution. Our approach integrates three key modules. Visual Point Grounding (VPG) analyzes a single scene image to detect relevant objects and generate world-aligned interaction points. Bimanual Subgoal Planner (BSP) reasons over spatial adjacency and cross-object accessibility to produce compact, motion-neutralized subgoals that exploit opportunities for coordinated two-handed actions. Interaction-Point-Driven Bimanual Prompting (IPBP) binds these subgoals to a structured skill library, instantiating synchronized unimanual or bimanual action sequences that satisfy hand-state and affordance constraints. Together, these modules enable agents to plan semantically meaningful, physically feasible, and parallelizable two-handed behaviors in cluttered, previously unseen scenes. Experiments show that it produces coherent, feasible, and compact two-handed plans, and generalizes to cluttered scenes without retraining, demonstrating robust scene-agnostic affordance reasoning for bimanual tasks.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Huayi Zhou, Kui Jia
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Learning dexterous bimanual manipulation policies critically depends on large-scale, high-quality demonstrations, yet current paradigms face inherent trade-offs: teleoperation provides physically grounded data but is prohibitively labor-intensive, while simulation-based synthesis scales efficiently but suffers from sim-to-real gaps. We present BiDemoSyn, a framework that synthesizes contact-rich, physically feasible bimanual demonstrations from a single real-world example. The key idea is to decompose tasks into invariant coordination blocks and variable, object-dependent adjustments, then adapt them through vision-guided alignment and lightweight trajectory optimization. This enables the generation of thousands of diverse and feasible demonstrations within several hour, without repeated teleoperation or reliance on imperfect simulation. Across six dual-arm tasks, we show that policies trained on BiDemoSyn data generalize robustly to novel object poses and shapes, significantly outperforming recent baselines. By bridging the gap between efficiency and real-world fidelity, BiDemoSyn provides a scalable path toward practical imitation learning for complex bimanual manipulation without compromising physical grounding.
๐
2025-12-10
๐ ArXiv Preprint (Robotics)
๐ฅ Fan Wu, Chenguang Yang, Haibin Yang, Shuo Wang, Yanrui Xu et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Real-time state tracking of Deformable Linear Objects (DLOs) is critical for enabling robotic manipulation of DLOs in industrial assembly, medical procedures, and daily-life applications. However, the high-dimensional configuration space, nonlinear dynamics, and frequent partial occlusions present fundamental barriers to robust real-time DLO tracking. To address these limitations, this study introduces UPETrack, a geometry-driven framework based on Unidirectional Position Estimation (UPE), which facilitates tracking without the requirement for physical modeling, virtual simulation, or visual markers. The framework operates in two phases: (1) visible segment tracking is based on a Gaussian Mixture Model (GMM) fitted via the Expectation Maximization (EM) algorithm, and (2) occlusion region prediction employing UPE algorithm we proposed. UPE leverages the geometric continuity inherent in DLO shapes and their temporal evolution patterns to derive a closed-form positional estimator through three principal mechanisms: (i) local linear combination displacement term, (ii) proximal linear constraint term, and (iii) historical curvature term. This analytical formulation allows efficient and stable estimation of occluded nodes through explicit linear combinations of geometric components, eliminating the need for additional iterative optimization. Experimental results demonstrate that UPETrack surpasses two state-of-the-art tracking algorithms, including TrackDLO and CDCPD2, in both positioning accuracy and computational efficiency.
๐
2025-12-10
๐ ArXiv Preprint (eess.SY)
๐ฅ Theofania Karampela, Rishie Seshadri, Florian Dรถrfler, Sarah H. Q. Li
Robotics
ๆฅ็ๆ่ฆ
In on-orbit robotics, a servicer satellite's ability to make contact with a free-spinning target satellite is essential to completing most on-orbit servicing (OOS) tasks. This manuscript develops a nonlinear model predictive control (MPC) framework that generates feasible controls for a servicer satellite to achieve zero-impulse contact with a free-spinning target satellite. The overall maneuver requires coordination between two separately actuated modules of the servicer satellite: (1) a moment generation module and (2) a manipulation module. We apply MPC to control both modules by explicitly modeling the cross-coupling dynamics between them. We demonstrate that the MPC controller can enforce actuation and state constraints that prior control approaches could not account for. We evaluate the performance of the MPC controller by simulating zero-impulse contact scenarios with a free-spinning target satellite via numerical Monte Carlo (MC) trials and comparing the simulation results with prior control approaches. Our simulation results validate the effectiveness of the MPC controller in maintaining spin synchronization and zero-impulse contact under operation constraints, moving contact location, and observation and actuation noise.
๐
2025-12-09
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiin Park, Hyuna Jeon, Yoonseo Lee, Jisu Hong, Misuk Kim
ๆฅ็ๆ่ฆ
Ontology-based knowledge graph (KG) construction is a core technology that enables multidimensional understanding and advanced reasoning over domain knowledge. Industrial standards, in particular, contain extensive technical information and complex rules presented in highly structured formats that combine tables, scopes of application, constraints, exceptions, and numerical calculations, making KG construction especially challenging. In this study, we propose a method that organizes such documents into a hierarchical semantic structure, decomposes sentences and tables into atomic propositions derived from conditional and numerical rules, and integrates them into an ontology-knowledge graph through LLM-based triple extraction. Our approach captures both the hierarchical and logical structures of documents, effectively representing domain-specific semantics that conventional methods fail to reflect. To verify its effectiveness, we constructed rule, table, and multi-hop QA datasets, as well as a toxic clause detection dataset, from industrial standards, and implemented an ontology-aware KG-RAG framework for comparative evaluation. Experimental results show that our method achieves significant performance improvements across all QA types compared to existing KG-RAG approaches. This study demonstrates that reliable and scalable knowledge representation is feasible even for industrial documents with intertwined conditions, constraints, and scopes, contributing to future domain-specific RAG development and intelligent document management.
๐
2025-12-09
๐ KDD 2026
๐ฅ Liang Luo, Yuxin Chen, Zhengyu Zhang, Mengyue Hang, Andrew Gu et al.
Machine Learning
ๆฅ็ๆ่ฆ
The rapidly evolving landscape of products, surfaces, policies, and regulations poses significant challenges for deploying state-of-the-art recommendation models at industry scale, primarily due to data fragmentation across domains and escalating infrastructure costs that hinder sustained quality improvements.
To address this challenge, we propose Lattice, a recommendation framework centered around model space redesign that extends Multi-Domain, Multi-Objective (MDMO) learning beyond models and learning objectives. Lattice addresses these challenges through a comprehensive model space redesign that combines cross-domain knowledge sharing, data consolidation, model unification, distillation, and system optimizations to achieve significant improvements in both quality and cost-efficiency.
Our deployment of Lattice at Meta has resulted in 10% revenue-driving top-line metrics gain, 11.5% user satisfaction improvement, 6% boost in conversion rate, with 20% capacity saving.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ DatologyAI, :, Luke Merrick, Alex Fang, Aldo Carranza et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed "lexical-dense" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Yuji Takubo, Arpit Dwivedi, Sukeerth Ramkumar, Luis A. Pabon, Daniele Gammelli et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Reliable real-time trajectory generation is essential for future autonomous spacecraft. While recent progress in nonconvex guidance and control is paving the way for onboard autonomous trajectory optimization, these methods still rely on extensive expert input (e.g., waypoints, constraints, mission timelines, etc.), which limits the operational scalability in real rendezvous missions. This paper introduces SAGES (Semantic Autonomous Guidance Engine for Space), a trajectory-generation framework that translates natural-language commands into spacecraft trajectories that reflect high-level intent while respecting nonconvex constraints. Experiments in two settings -- fault-tolerant proximity operations with continuous-time constraint enforcement and a free-flying robotic platform -- demonstrate that SAGES reliably produces trajectories aligned with human commands, achieving over 90% semantic-behavioral consistency across diverse behavior modes. Ultimately, this work marks an initial step toward language-conditioned, constraint-aware spacecraft trajectory generation, enabling operators to interactively guide both safety and behavior through intuitive natural-language commands with reduced expert burden.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yasaman Esfandiari, Jocelyn Rego, Austin Meyer, Jonathan Gallagher, Mia Levy
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) and transformer architectures have shown impressive reasoning and generation capabilities across diverse natural language tasks. However, their reliability and robustness in real-world engineering domains remain largely unexplored, limiting their practical utility in human-centric workflows. In this work, we investigate the applicability and consistency of LLMs for analog circuit design -- a task requiring domain-specific reasoning, adherence to physical constraints, and structured representations -- focusing on AI-assisted design where humans remain in the loop. We study how different data representations influence model behavior and compare smaller models (e.g., T5, GPT-2) with larger foundation models (e.g., Mistral-7B, GPT-oss-20B) under varying training conditions. Our results highlight key reliability challenges, including sensitivity to data format, instability in generated designs, and limited generalization to unseen circuit configurations. These findings provide early evidence on the limits and potential of LLMs as tools to enhance human capabilities in complex engineering tasks, offering insights into designing reliable, deployable foundation models for structured, real-world applications.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Phevos Paschalidis, Vasiliki Stoumpou, Lisa Everest, Yu Ma, Talhat Azemi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Transcatheter Aortic Valve Replacement (TAVR) has emerged as a minimally invasive treatment option for patients with severe aortic stenosis, a life-threatening cardiovascular condition. Multiple transcatheter heart valves (THV) have been approved for use in TAVR, but current guidelines regarding valve type prescription remain an active topic of debate. We propose a data-driven clinical support tool to identify the optimal valve type with the objective of minimizing the risk of permanent pacemaker implantation (PPI), a predominant postoperative complication. We synthesize a novel dataset that combines U.S. and Greek patient populations and integrates three distinct data sources (patient demographics, computed tomography scans, echocardiograms) while harmonizing differences in each country's record system. We introduce a leaf-level analysis to leverage population heterogeneity and avoid benchmarking against uncertain counterfactual risk estimates. The final prescriptive model shows a reduction in PPI rates of 26% and 16% compared with the current standard of care in our internal U.S. population and external Greek validation cohort, respectively. To the best of our knowledge, this work represents the first unified, personalized prescription strategy for THV selection in TAVR.
๐
2025-12-09
๐ 15 Pages, 13 Figures and 3 Tables. This work has b
๐ฅ Prithila Angkan, Paul Hungler, Ali Etemad
Robotics
ๆฅ็ๆ่ฆ
Understanding how driver mental states differ between active and autonomous driving is critical for designing safe human-vehicle interfaces. This paper presents the first EEG-based comparison of cognitive load, fatigue, valence, and arousal across the two driving modes. Using data from 31 participants performing identical tasks in both scenarios of three different complexity levels, we analyze temporal patterns, task-complexity effects, and channel-wise activation differences. Our findings show that although both modes evoke similar trends across complexity levels, the intensity of mental states and the underlying neural activation differ substantially, indicating a clear distribution shift between active and autonomous driving. Transfer-learning experiments confirm that models trained on active driving data generalize poorly to autonomous driving and vice versa. We attribute this distribution shift primarily to differences in motor engagement and attentional demands between the two driving modes, which lead to distinct spatial and temporal EEG activation patterns. Although autonomous driving results in lower overall cortical activation, participants continue to exhibit measurable fluctuations in cognitive load, fatigue, valence, and arousal associated with readiness to intervene, task-evoked emotional responses, and monotony-related passive fatigue. These results emphasize the need for scenario-specific data and models when developing next-generation driver monitoring systems for autonomous vehicles.
๐
2025-12-09
๐ NeurIPS 2025
๐ฅ Mrinal Agarwal, Saad Rana, Theo Sundoro, Hermela Berhe, Spencer Kim et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Deception is a fundamental challenge for multi-agent reasoning: effective systems must strategically conceal information while detecting misleading behavior in others. Yet most evaluations reduce deception to static classification, ignoring the interactive, adversarial, and longitudinal nature of real deceptive dynamics. Large language models (LLMs) can deceive convincingly but remain weak at detecting deception in peers. We present WOLF, a multi-agent social deduction benchmark based on Werewolf that enables separable measurement of deception production and detection. WOLF embeds role-grounded agents (Villager, Werewolf, Seer, Doctor) in a programmable LangGraph state machine with strict night-day cycles, debate turns, and majority voting. Every statement is a distinct analysis unit, with self-assessed honesty from speakers and peer-rated deceptiveness from others. Deception is categorized via a standardized taxonomy (omission, distortion, fabrication, misdirection), while suspicion scores are longitudinally smoothed to capture both immediate judgments and evolving trust dynamics. Structured logs preserve prompts, outputs, and state transitions for full reproducibility. Across 7,320 statements and 100 runs, Werewolves produce deceptive statements in 31% of turns, while peer detection achieves 71-73% precision with ~52% overall accuracy. Precision is higher for identifying Werewolves, though false positives occur against Villagers. Suspicion toward Werewolves rises from ~52% to over 60% across rounds, while suspicion toward Villagers and the Doctor stabilizes near 44-46%. This divergence shows that extended interaction improves recall against liars without compounding errors against truthful roles. WOLF moves deception evaluation beyond static datasets, offering a dynamic, controlled testbed for measuring deceptive and detective capacity in adversarial multi-agent interaction.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Chen, Rui Yin, Yifan Chen, Qi Chen, Chao Li
ๆฅ็ๆ่ฆ
Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $ฮ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $ฮ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sauda Maryam, Sara Nadeem, Faisal Qureshi, Mohsen Ali
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We tackle continual adaptation of vision-language models to new attributes, objects, and their compositions in Compositional Zero-Shot Learning (CZSL), while preventing forgetting of prior knowledge. Unlike classical continual learning where classes are disjoint, CCZSL is more complex as attributes and objects may reoccur across sessions while compositions remain unique. Built on a frozen VLM backbone, we propose the first Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) framework that retains prior knowledge through recency-weighted multi-teacher distillation. It employs session-aware compositional prompts to fuse multimodal features for new compositions, while attribute and object prompts are learned through session-agnostic fusion to maintain global semantic consistency, which is further stabilized by a Cosine Anchor Loss (CAL) to preserve prior knowledge. To enhance adaptation in the current session, an Orthogonal Projection Loss (OPL) ensures that new attribute and object embeddings remain distinct from previous ones, preventing overlap, while an Intra-Session Diversity Loss (IDL) promotes variation among current-session embeddings for richer, more discriminative representations. We also introduce a comprehensive protocol that jointly measures catastrophic forgetting and compositional generalization. Extensive experiments on UT-Zappos and C-GQA benchmarks demonstrate that PromptCCZSL achieves substantial improvements over prior VLM-based and non-VLM baselines, setting a new benchmark for CCZSL in closed-world settings.
๐
2025-12-09
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Thรฉo Cavignac, Jonathan Schmidt, Pierre-Paul De Breuck, Antoine Loew, Tiago F. T. Cerqueira et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present a novel multi-stage workflow for computational materials discovery that achieves a 99% success rate in identifying compounds within 100 meV/atom of thermodynamic stability, with a threefold improvement over previous approaches. By combining the Matra-Genoa generative model, Orb-v2 universal machine learning interatomic potential, and ALIGNN graph neural network for energy prediction, we generated 119 million candidate structures and added 1.3 million DFT-validated compounds to the ALEXANDRIA database, including 74 thousand new stable materials. The expanded ALEXANDRIA database now contains 5.8 million structures with 175 thousand compounds on the convex hull. Predicted structural disorder rates (37-43%) match experimental databases, unlike other recent AI-generated datasets. Analysis reveals fundamental patterns in space group distributions, coordination environments, and phase stability networks, including sub-linear scaling of convex hull connectivity. We release the complete dataset, including sAlex25 with 14 million out-of-equilibrium structures containing forces and stresses for training universal force fields. We demonstrate that fine-tuning a GRACE model on this data improves benchmark accuracy. All data, models, and workflows are freely available under Creative Commons licenses.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jin Cao, Hong-Xing Yu, Jiajun Wu
ๆฅ็ๆ่ฆ
We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to "zoom into" a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly outperforms state-of-the-art video and 3D models in both quality and alignment, enabling multi-scale 3D world creation from a single image. We show video results and an interactive viewer of generated multi-scale 3D worlds in https://wonderzoom.github.io/
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anton Vasiliuk, Irina Abdullaeva, Polina Druzhinina, Anton Razzhigaev, Andrey Kuznetsov
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users. In our study, we investigated this potential using robust psychometric measures. We adapted the most studied test in psychological literature, namely Minnesota Multiphasic Personality Inventory (MMPI) and examined LLMs' behavior to identify traits. To asses the sensitivity of LLMs' prompts and psychological biases we created personality-oriented prompts, crafting a detailed set of personas that vary in trait intensity. This enables us to measure how well LLMs follow these roles. Our study introduces MindShift, a benchmark for evaluating LLMs' psychological adaptability. The results highlight a consistent improvement in LLMs' role perception, attributed to advancements in training datasets and alignment techniques. Additionally, we observe significant differences in responses to psychometric assessments across different model types and families, suggesting variability in their ability to emulate human-like personality traits. MindShift prompts and code for LLM evaluation will be publicly available.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shanghao Li, Jinda Han, Yibo Wang, Yuanjie Zhu, Zihe Song et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Graph-based Retrieval-Augmented Generation (GraphRAG) enhances Large Language Models (LLMs) by incorporating external knowledge from linearized subgraphs retrieved from knowledge graphs. However, LLMs struggle to interpret the relational and topological information in these inputs, resulting in hallucinations that are inconsistent with the retrieved knowledge. To analyze how LLMs attend to and retain structured knowledge during generation, we propose two lightweight interpretability metrics: Path Reliance Degree (PRD), which measures over-reliance on shortest-path triples, and Semantic Alignment Score (SAS), which assesses how well the model's internal representations align with the retrieved knowledge. Through empirical analysis on a knowledge-based QA task, we identify failure patterns associated with over-reliance on salient paths and weak semantic grounding, as indicated by high PRD and low SAS scores. We further develop a lightweight post-hoc hallucination detector, Graph Grounding and Alignment (GGA), which outperforms strong semantic and confidence-based baselines across AUC and F1. By grounding hallucination analysis in mechanistic interpretability, our work offers insights into how structural limitations in LLMs contribute to hallucinations, informing the design of more reliable GraphRAG systems in the future.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sergio Burdisso, Sรฉverin Baroudi, Yanis Labrak, David Grunert, Pawel Cyrta et al.
Robotics
ๆฅ็ๆ่ฆ
We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built around a standardized \texttt{Dialog} representation, SDialog provides: (1) persona-driven multi-agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed-backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Georgy Kopanitsa, Oleg Metsker, Alexey Yakovlev
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Coronary angiography is the main tool for assessing coronary artery disease, but visual grading of stenosis is variable and only moderately related to ischaemia. Wire based fractional flow reserve (FFR) improves lesion selection but is not used systematically. Angiography derived indices such as quantitative flow ratio (QFR) offer wire free physiology, yet many tools are workflow intensive and separate from automated anatomy analysis and virtual PCI planning. We developed AngioAI-QFR, an end to end angiography only pipeline combining deep learning stenosis detection, lumen segmentation, centreline and diameter extraction, per millimetre Relative Flow Capacity profiling, and virtual stenting with automatic recomputation of angiography derived QFR. The system was evaluated in 100 consecutive vessels with invasive FFR as reference. Primary endpoints were agreement with FFR (correlation, mean absolute error) and diagnostic performance for FFR <= 0.80. On held out frames, stenosis detection achieved precision 0.97 and lumen segmentation Dice 0.78. Across 100 vessels, AngioAI-QFR correlated strongly with FFR (r = 0.89, MAE 0.045). The AUC for detecting FFR <= 0.80 was 0.93, with sensitivity 0.88 and specificity 0.86. The pipeline completed fully automatically in 93 percent of vessels, with median time to result 41 s. RFC profiling distinguished focal from diffuse capacity loss, and virtual stenting predicted larger QFR gain in focal than in diffuse disease. AngioAI-QFR provides a practical, near real time pipeline that unifies computer vision, functional profiling, and virtual PCI with automated angiography derived physiology.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zihan Han, Junyan Ge, Caifeng Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate interpretation of pediatric dental clinical records and safe antibiotic prescribing remain persistent challenges in dental informatics. Traditional rule-based clinical decision support systems struggle with unstructured dental narratives, incomplete radiographic descriptions, and complex safety constraints. To address these limitations, this study proposes a Knowledge-Guided Large Language Model (KG-LLM) that integrates a pediatric dental knowledge graph, retrieval-augmented generation (RAG), and a multi-stage safety validation pipeline for evidence-grounded antibiotic recommendation. The framework first employs a clinical NER/RE module to extract structured entities and relations from dental notes and radiology reports. Relevant guidelines, drug-safety rules, and analogous historical cases are subsequently retrieved from the knowledge graph and supplied to the LLM for diagnostic summarization and dose-drug-duration prediction. Safety assurance is achieved through a dual-layer validation mechanism combining deterministic rule checking with a learned classifier for detecting allergies, contraindications, and dosing errors. Experiments on 32,000 de-identified pediatric dental visit records demonstrate the effectiveness of the proposed approach. Compared with a domain-adapted Llama-2 clinical baseline, KG-LLM improves record-understanding performance (F1: 0.914 vs. 0.867), drug-dose-duration accuracy (Top-1: 0.782 vs. 0.716), and reduces unsafe antibiotic suggestions by 50%. Additional evaluation across summary quality, recommendation accuracy, and global safety scores further confirms the robustness of the system. Ablation analyses indicate that the knowledge graph, RAG, and safety modules each contribute substantially to clinical reliability and interpretability.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Luciano Floridi, Yiyang Jia, Fernando Tohmรฉ
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper presents a formal, categorical framework for analysing how humans and large language models (LLMs) transform content into truth-evaluated propositions about a state space of possible worlds W , in order to argue that LLMs do not solve but circumvent the symbol grounding problem.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Pamela Gupta
Robotics
ๆฅ็ๆ่ฆ
The deployment of AI systems faces three critical governance challenges that current frameworks fail to adequately address. First, organizations struggle with inadequate risk assessment at the use case level, exemplified by the Humana class action lawsuit and other high impact cases where an AI system deployed to production exhibited both significant bias and high error rates, resulting in improper healthcare claim denials. Each AI use case presents unique risk profiles requiring tailored governance, yet most frameworks provide one size fits all guidance. Second, existing frameworks like ISO 42001 and NIST AI RMF remain at high conceptual levels, offering principles without actionable controls, leaving practitioners unable to translate governance requirements into specific technical implementations. Third, organizations lack mechanisms for operationalizing governance at scale, with no systematic approach to embed trustworthy AI practices throughout the development lifecycle, measure compliance quantitatively, or provide role-appropriate visibility from boards to data scientists. We present AI TIPS, Artificial Intelligence Trust-Integrated Pillars for Sustainability 2.0, update to the comprehensive operational framework developed in 2019,four years before NIST's AI Risk Management Framework, that directly addresses these challenges.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Yuji Takubo, Arpit Dwivedi, Sukeerth Ramkumar, Luis A. Pabon, Daniele Gammelli et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Reliable real-time trajectory generation is essential for future autonomous spacecraft. While recent progress in nonconvex guidance and control is paving the way for onboard autonomous trajectory optimization, these methods still rely on extensive expert input (e.g., waypoints, constraints, mission timelines, etc.), which limits the operational scalability in real rendezvous missions.This paper introduces SAGES (Semantic Autonomous Guidance Engine for Space), a trajectory-generation framework that translates natural-language commands into spacecraft trajectories that reflect high-level intent while respecting nonconvex constraints. Experiments in two settings -- fault-tolerant proximity operations with continuous-time constraint enforcement and a free-flying robotic platform -- demonstrate that SAGES reliably produces trajectories aligned with human commands, achieving over 90\% semantic-behavioral consistency across diverse behavior modes. Ultimately, this work marks an initial step toward language-conditioned, constraint-aware spacecraft trajectory generation, enabling operators to interactively guide both safety and behavior through intuitive natural-language commands with reduced expert burden.
๐
2025-12-09
๐ ArXiv Preprint (cs.SE)
๐ฅ Paul Brookes, Vardan Voskanyan, Rafail Giavrimis, Matthew Truscott, Mina Ilieva et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Agentic AI systems built on large language models (LLMs) offer significant potential for automating complex workflows, from software development to customer support. However, LLM agents often underperform due to suboptimal configurations; poorly tuned prompts, tool descriptions, and parameters that typically require weeks of manual refinement. Existing optimization methods either are too complex for general use or treat components in isolation, missing critical interdependencies.
We present ARTEMIS, a no-code evolutionary optimization platform that jointly optimizes agent configurations through semantically-aware genetic operators. Given only a benchmark script and natural language goals, ARTEMIS automatically discovers configurable components, extracts performance signals from execution logs, and evolves configurations without requiring architectural modifications.
We evaluate ARTEMIS on four representative agent systems: the \emph{ALE Agent} for competitive programming on AtCoder Heuristic Contest, achieving a \textbf{$13.6\%$ improvement} in acceptance rate; the \emph{Mini-SWE Agent} for code optimization on SWE-Perf, with a statistically significant \textbf{10.1\% performance gain}; and the \emph{CrewAI Agent} for cost and mathematical reasoning on Math Odyssey, achieving a statistically significant \textbf{$36.9\%$ reduction} in the number of tokens required for evaluation. We also evaluate the \emph{MathTales-Teacher Agent} powered by a smaller open-source model (Qwen2.5-7B) on GSM8K primary-level mathematics problems, achieving a \textbf{22\% accuracy improvement} and demonstrating that ARTEMIS can optimize agents based on both commercial and local models.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Lipeng Zhuang, Shiyu Fan, Florent P. Audonnet, Yingdong Ru, Gerardo Aragon Camarasa et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior success rates compared to state-of-the-art diffusion and autoregressive policies. Specifically, MGP increases the average success rate by 9% across 150 tasks while cutting per-sequence inference time by up to 35x. It further improves the average success rate by 60% in dynamic and missing-observation environments, and solves two non-Markovian scenarios where other state-of-the-art methods fail.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Adrian Ryser, Florian Allwein, Tim Schlippe
Natural Language Processing
ๆฅ็ๆ่ฆ
Hallucinations are outputs by Large Language Models (LLMs) that are factually incorrect yet appear plausible [1]. This paper investigates how such hallucinations influence users' trust in LLMs and users' interaction with LLMs. To explore this in everyday use, we conducted a qualitative study with 192 participants. Our findings show that hallucinations do not result in blanket mistrust but instead lead to context-sensitive trust calibration. Building on the calibrated trust model by Lee & See [2] and Afroogh et al.'s trust-related factors [3], we confirm expectancy [3], [4], prior experience [3], [4], [5], and user expertise & domain knowledge [3], [4] as userrelated (human) trust factors, and identify intuition as an additional factor relevant for hallucination detection. Additionally, we found that trust dynamics are further influenced by contextual factors, particularly perceived risk [3] and decision stakes [6]. Consequently, we validate the recursive trust calibration process proposed by Blรถbaum [7] and extend it by including intuition as a user-related trust factor. Based on these insights, we propose practical recommendations for responsible and reflective LLM use.
๐
2025-12-09
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Janet V. T. Pauketat, Daniel B. Shank, Aikaterina Manoli, Jacy Reese Anthis
Robotics
ๆฅ็ๆ่ฆ
Narratives about artificial intelligence (AI) entangle autonomy, the capacity to self-govern, with sentience, the capacity to sense and feel. AI agents that perform tasks autonomously and companions that recognize and express emotions may activate mental models of autonomy and sentience, respectively, provoking distinct reactions. To examine this possibility, we conducted three pilot studies (N = 374) and four preregistered vignette experiments describing an AI as autonomous, sentient, both, or neither (N = 2,702). Activating a mental model of sentience increased general mind perception (cognition and emotion) and moral consideration more than autonomy, but autonomy increased perceived threat more than sentience. Sentience also increased perceived autonomy more than vice versa. Based on a within-paper meta-analysis, sentience changed reactions more than autonomy on average. By disentangling different mental models of AI, we can study human-AI interaction with more precision to better navigate the detailed design of anthropomorphized AI and prompting interfaces.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Moazzam Umer Gondal, Hamad ul Qudous, Asma Ahmad Farhan
Machine Learning
ๆฅ็ๆ่ฆ
Accurate forecasting of urban air pollution is essential for protecting public health and guiding mitigation policies. While Deep Learning (DL) and hybrid pipelines dominate recent research, their complexity and limited interpretability hinder operational use. This study investigates whether lightweight additive models -- Facebook Prophet (FBP) and NeuralProphet (NP) -- can deliver competitive forecasts for particulate matter (PM$_{2.5}$, PM$_{10}$) in Beijing, China. Using multi-year pollutant and meteorological data, we applied systematic feature selection (correlation, mutual information, mRMR), leakage-safe scaling, and chronological data splits. Both models were trained with pollutant and precursor regressors, with NP additionally leveraging lagged dependencies. For context, two machine learning baselines (LSTM, LightGBM) and one traditional statistical model (SARIMAX) were also implemented. Performance was evaluated on a 7-day holdout using MAE, RMSE, and $R^2$. Results show that FBP consistently outperformed NP, SARIMAX, and the learning-based baselines, achieving test $R^2$ above 0.94 for both pollutants. These findings demonstrate that interpretable additive models remain competitive with both traditional and complex approaches, offering a practical balance of accuracy, transparency, and ease of deployment.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh
Machine Learning
ๆฅ็ๆ่ฆ
Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and management. However, deploying state-of-the-art deep learning models like ConvNeXtV2-Large in clinical settings is hindered by their computational demands. Therefore, it is desirable to develop efficient models that maintain high diagnostic performance while enabling real-time deployment. In this study, a novel knowledge distillation framework, termed KD-OCT, is proposed to compress a high-performance ConvNeXtV2-Large teacher model, enhanced with advanced augmentations, stochastic weight averaging, and focal loss, into a lightweight EfficientNet-B2 student for classifying normal, drusen, and CNV cases. KD-OCT employs real-time distillation with a combined loss balancing soft teacher knowledge transfer and hard ground-truth supervision. The effectiveness of the proposed method is evaluated on the Noor Eye Hospital (NEH) dataset using patient-level cross-validation. Experimental results demonstrate that KD-OCT outperforms comparable multi-scale or feature-fusion OCT classifiers in efficiency- accuracy balance, achieving near-teacher performance with substantial reductions in model size and inference time. Despite the compression, the student model exceeds most existing frameworks, facilitating edge deployment for AMD screening. Code is available at https://github.com/erfan-nourbakhsh/KD- OCT.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Shivendra Agrawal, Jake Brawer, Ashutosh Naik, Alessandro Roncone, Bradley Hayes
Robotics
ๆฅ็ๆ่ฆ
Many indoor workspaces are quasi-static: global layout is stable but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside MCL, yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. Across 100 global-localization trials spanning four conditions (cart-mounted, wearable, dynamic obstacles, and sparse semantics) in a semantically dense, retail environment, ShelfAware achieves a 96% success rate (vs. 22% MCL and 10% AMCL) with a mean time-to-convergence of 1.91s, attains the lowest translational RMSE in all conditions, and maintains stable tracking in 80% of tested sequences, all while running in real time on a consumer laptop-class platform. By modeling semantics distributionally at the category level and leveraging inverse proposals, ShelfAware resolves geometric aliasing and semantic drift common to quasi-static domains. Because the method requires only vision sensors and VIO, it integrates as an infrastructure-free building block for mobile robots in warehouses, labs, and retail settings; as a representative application, it also supports the creation of assistive devices providing start-anytime, shared-control assistive navigation for people with visual impairments.
๐
2025-12-09
๐ ArXiv Preprint (q-bio.OT)
๐ฅ Timothy Keyes, Alison Callahan, Abby S. Pandya, Nerissa Ambers, Juan M. Banda et al.
ๆฅ็ๆ่ฆ
Post-deployment monitoring of artificial intelligence (AI) systems in health care is essential to ensure their safety, quality, and sustained benefit-and to support governance decisions about which systems to update, modify, or decommission. Motivated by these needs, we developed a framework for monitoring deployed AI systems grounded in the mandate to take specific actions when they fail to behave as intended. This framework, which is now actively used at Stanford Health Care, is organized around three complementary principles: system integrity, performance, and impact. System integrity monitoring focuses on maximizing system uptime, detecting runtime errors, and identifying when changes to the surrounding IT ecosystem have unintended effects. Performance monitoring focuses on maintaining accurate system behavior in the face of changing health care practices (and thus input data) over time. Impact monitoring assesses whether a deployed system continues to have value in the form of benefit to clinicians and patients. Drawing on examples of deployed AI systems at our academic medical center, we provide practical guidance for creating monitoring plans based on these principles that specify which metrics to measure, when those metrics should be reviewed, who is responsible for acting when metrics change, and what concrete follow-up actions should be taken-for both traditional and generative AI. We also discuss challenges to implementing this framework, including the effort and cost of monitoring for health systems with limited resources and the difficulty of incorporating data-driven monitoring practices into complex organizations where conflicting priorities and definitions of success often coexist. This framework offers a practical template and starting point for health systems seeking to ensure that AI deployments remain safe and effective over time.
๐
2025-12-09
๐ ArXiv Preprint (cs.GR)
๐ฅ Aditya Ganeshan, Matheus Gadelha, Thibault Groueix, Zhiqin Chen, Siddhartha Chaudhuri et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative fiting algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is an analytical primitive that is simultaneously (1) expressive, being able to model various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Chen, Rui Yin, Yifan Chen, Qi Chen, Chao Li
ๆฅ็ๆ่ฆ
Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $ฮ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $ฮ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anton Vasiliuk, Irina Abdullaeva, Polina Druzhinina, Anton Razzhigaev, Andrey Kuznetsov
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users. In our study, we investigated this potential using robust psychometric measures. We adapted the most studied test in psychological literature, namely Minnesota Multiphasic Personality Inventory (MMPI) and examined LLMs' behavior to identify traits. To asses the sensitivity of LLMs' prompts and psychological biases we created personality-oriented prompts, crafting a detailed set of personas that vary in trait intensity. This enables us to measure how well LLMs follow these roles. Our study introduces MindShift, a benchmark for evaluating LLMs' psychological adaptability. The results highlight a consistent improvement in LLMs' role perception, attributed to advancements in training datasets and alignment techniques. Additionally, we observe significant differences in responses to psychometric assessments across different model types and families, suggesting variability in their ability to emulate human-like personality traits. MindShift prompts and code for LLM evaluation will be publicly available.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shanghao Li, Jinda Han, Yibo Wang, Yuanjie Zhu, Zihe Song et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Graph-based Retrieval-Augmented Generation (GraphRAG) enhances Large Language Models (LLMs) by incorporating external knowledge from linearized subgraphs retrieved from knowledge graphs. However, LLMs struggle to interpret the relational and topological information in these inputs, resulting in hallucinations that are inconsistent with the retrieved knowledge. To analyze how LLMs attend to and retain structured knowledge during generation, we propose two lightweight interpretability metrics: Path Reliance Degree (PRD), which measures over-reliance on shortest-path triples, and Semantic Alignment Score (SAS), which assesses how well the model's internal representations align with the retrieved knowledge. Through empirical analysis on a knowledge-based QA task, we identify failure patterns associated with over-reliance on salient paths and weak semantic grounding, as indicated by high PRD and low SAS scores. We further develop a lightweight post-hoc hallucination detector, Graph Grounding and Alignment (GGA), which outperforms strong semantic and confidence-based baselines across AUC and F1. By grounding hallucination analysis in mechanistic interpretability, our work offers insights into how structural limitations in LLMs contribute to hallucinations, informing the design of more reliable GraphRAG systems in the future.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zihan Han, Junyan Ge, Caifeng Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate interpretation of pediatric dental clinical records and safe antibiotic prescribing remain persistent challenges in dental informatics. Traditional rule-based clinical decision support systems struggle with unstructured dental narratives, incomplete radiographic descriptions, and complex safety constraints. To address these limitations, this study proposes a Knowledge-Guided Large Language Model (KG-LLM) that integrates a pediatric dental knowledge graph, retrieval-augmented generation (RAG), and a multi-stage safety validation pipeline for evidence-grounded antibiotic recommendation. The framework first employs a clinical NER/RE module to extract structured entities and relations from dental notes and radiology reports. Relevant guidelines, drug-safety rules, and analogous historical cases are subsequently retrieved from the knowledge graph and supplied to the LLM for diagnostic summarization and dose-drug-duration prediction. Safety assurance is achieved through a dual-layer validation mechanism combining deterministic rule checking with a learned classifier for detecting allergies, contraindications, and dosing errors. Experiments on 32,000 de-identified pediatric dental visit records demonstrate the effectiveness of the proposed approach. Compared with a domain-adapted Llama-2 clinical baseline, KG-LLM improves record-understanding performance (F1: 0.914 vs. 0.867), drug-dose-duration accuracy (Top-1: 0.782 vs. 0.716), and reduces unsafe antibiotic suggestions by 50%. Additional evaluation across summary quality, recommendation accuracy, and global safety scores further confirms the robustness of the system. Ablation analyses indicate that the knowledge graph, RAG, and safety modules each contribute substantially to clinical reliability and interpretability.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ DatologyAI, :, Luke Merrick, Alex Fang, Aldo Carranza et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed "lexical-dense" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yasaman Esfandiari, Jocelyn Rego, Austin Meyer, Jonathan Gallagher, Mia Levy
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) and transformer architectures have shown impressive reasoning and generation capabilities across diverse natural language tasks. However, their reliability and robustness in real-world engineering domains remain largely unexplored, limiting their practical utility in human-centric workflows. In this work, we investigate the applicability and consistency of LLMs for analog circuit design -- a task requiring domain-specific reasoning, adherence to physical constraints, and structured representations -- focusing on AI-assisted design where humans remain in the loop. We study how different data representations influence model behavior and compare smaller models (e.g., T5, GPT-2) with larger foundation models (e.g., Mistral-7B, GPT-oss-20B) under varying training conditions. Our results highlight key reliability challenges, including sensitivity to data format, instability in generated designs, and limited generalization to unseen circuit configurations. These findings provide early evidence on the limits and potential of LLMs as tools to enhance human capabilities in complex engineering tasks, offering insights into designing reliable, deployable foundation models for structured, real-world applications.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Phevos Paschalidis, Vasiliki Stoumpou, Lisa Everest, Yu Ma, Talhat Azemi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Transcatheter Aortic Valve Replacement (TAVR) has emerged as a minimally invasive treatment option for patients with severe aortic stenosis, a life-threatening cardiovascular condition. Multiple transcatheter heart valves (THV) have been approved for use in TAVR, but current guidelines regarding valve type prescription remain an active topic of debate. We propose a data-driven clinical support tool to identify the optimal valve type with the objective of minimizing the risk of permanent pacemaker implantation (PPI), a predominant postoperative complication. We synthesize a novel dataset that combines U.S. and Greek patient populations and integrates three distinct data sources (patient demographics, computed tomography scans, echocardiograms) while harmonizing differences in each country's record system. We introduce a leaf-level analysis to leverage population heterogeneity and avoid benchmarking against uncertain counterfactual risk estimates. The final prescriptive model shows a reduction in PPI rates of 26% and 16% compared with the current standard of care in our internal U.S. population and external Greek validation cohort, respectively. To the best of our knowledge, this work represents the first unified, personalized prescription strategy for THV selection in TAVR.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hunjae Lee
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Transformers and more specifically decoder-only transformers dominate modern LLM architectures. While they have shown to work exceptionally well, they are not without issues, resulting in surprising failure modes and predictably asymmetric performance degradation. This article is a study of many of these observed failure modes of transformers through the lens of graph neural network (GNN) theory. We first make the case that much of deep learning, including transformers, is about learnable information mixing and propagation. This makes the study of model failure modes a study of bottlenecks in information propagation. This naturally leads to GNN theory, where there is already a rich literature on information propagation bottlenecks and theoretical failure modes of models. We then make the case that many issues faced by GNNs are also experienced by transformers. In addition, we analyze how the causal nature of decoder-only transformers create interesting geometric properties in information propagation, resulting in predictable and potentially devastating failure modes. Finally, we observe that existing solutions in transformer research tend to be ad-hoc and driven by intuition rather than grounded theoretical motivation. As such, we unify many such solutions under a more theoretical perspective, providing insight into why they work, what problem they are actually solving, and how they can be further improved to target specific failure modes of transformers. Overall, this article is an attempt to bridge the gap between observed failure modes in transformers and a general lack of theoretical understanding of them in this space.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Muhammad Abid, Omer San
ๆฅ็ๆ่ฆ
Deep Operator Networks (DeepONets) have become a central tool in data-driven operator learning, providing flexible surrogates for nonlinear mappings arising in partial differential equations (PDEs). However, the standard trunk design based on fully connected layers acting on raw spatial or spatiotemporal coordinates struggles to represent sharp gradients, boundary layers, and non-periodic structures commonly found in PDEs posed on bounded domains with Dirichlet or Neumann boundary conditions. To address these limitations, we introduce the Spectral-Embedded DeepONet (SEDONet), a new DeepONet variant in which the trunk is driven by a fixed Chebyshev spectral dictionary rather than coordinate inputs. This non-periodic spectral embedding provides a principled inductive bias tailored to bounded domains, enabling the learned operator to capture fine-scale non-periodic features that are difficult for Fourier or MLP trunks to represent. SEDONet is evaluated on a suite of PDE benchmarks including 2D Poisson, 1D Burgers, 1D advection-diffusion, Allen-Cahn dynamics, and the Lorenz-96 chaotic system, covering elliptic, parabolic, advective, and multiscale temporal phenomena, all of which can be viewed as canonical problems in computational mechanics. Across all datasets, SEDONet consistently achieves the lowest relative L2 errors among DeepONet, FEDONet, and SEDONet, with average improvements of about 30-40% over the baseline DeepONet and meaningful gains over Fourier-embedded variants on non-periodic geometries. Spectral analyses further show that SEDONet more accurately preserves high-frequency and boundary-localized features, demonstrating the value of Chebyshev embeddings in non-periodic operator learning. The proposed architecture offers a simple, parameter-neutral modification to DeepONets, delivering a robust and efficient spectral framework for surrogate modeling of PDEs on bounded domains.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Gabrielle Rives, Olivier Lopez, Nicolas Bousquet
Machine Learning
ๆฅ็ๆ่ฆ
The Weibull distribution is a commonly adopted choice for modeling the survival of systems subject to maintenance over time. When only proxy indicators and censored observations are available, it becomes necessary to express the distribution's parameters as functions of time-dependent covariates. Deep neural networks provide the flexibility needed to learn complex relationships between these covariates and operational lifetime, thereby extending the capabilities of traditional regression-based models. Motivated by the analysis of a fleet of military vehicles operating in highly variable and demanding environments, as well as by the limitations observed in existing methodologies, this paper introduces WTNN, a new neural network-based modeling framework specifically designed for Weibull survival studies. The proposed architecture is specifically designed to incorporate qualitative prior knowledge regarding the most influential covariates, in a manner consistent with the shape and structure of the Weibull distribution. Through numerical experiments, we show that this approach can be reliably trained on proxy and right-censored data, and is capable of producing robust and interpretable survival predictions that can improve existing approaches.
๐
2025-12-09
๐ ArXiv Preprint (q-bio.QM)
๐ฅ Peter W Fields, Vudtiwat Ngampruetikorn, David J Schwab, Stephanie E Palmer
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Generative models of complex systems often require post-hoc parameter adjustments to produce useful outputs. For example, energy-based models for protein design are sampled at an artificially low ''temperature'' to generate novel, functional sequences. This temperature tuning is a common yet poorly understood heuristic used across machine learning contexts to control the trade-off between generative fidelity and diversity. Here, we develop an interpretable, physically motivated framework to explain this phenomenon. We demonstrate that in systems with a large ''energy gap'' - separating a small fraction of meaningful states from a vast space of unrealistic states - learning from sparse data causes models to systematically overestimate high-energy state probabilities, a bias that lowering the sampling temperature corrects. More generally, we characterize how the optimal sampling temperature depends on the interplay between data size and the system's underlying energy landscape. Crucially, our results show that lowering the sampling temperature is not always desirable; we identify the conditions where \emph{raising} it results in better generative performance. Our framework thus casts post-hoc temperature tuning as a diagnostic tool that reveals properties of the true data distribution and the limits of the learned model.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Metod Jazbec, Theo X. Olausson, Louis Bรฉthune, Pierre Ablin, Michael Kirchhof et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary. Efficiency can be gained by unmasking several tokens in parallel, but doing too many at once risks degrading the generation quality. Thus, one critical design aspect of dLLMs is the sampling procedure that selects, at each step of the diffusion process, which tokens to replace. Indeed, recent work has found that heuristic strategies such as confidence thresholding lead to both higher quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger buffer sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy architecture based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive generation, while outperforming them in the full diffusion setting. We also examine the transferability of these policies, finding that they can generalize to new underlying dLLMs and longer sequence lengths. However, we also observe that their performance degrades when applied to out-of-domain data, and that fine-grained tuning of the accuracy-efficiency trade-off can be challenging with our approach.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shihao Li, Jiachen Li, Dongmei Chen
Machine Learning
ๆฅ็ๆ่ฆ
Data attribution methods identify which training examples are responsible for a model's predictions, but their sensitivity to distributional perturbations undermines practical reliability. We present a unified framework for certified robust attribution that extends from convex models to deep networks. For convex settings, we derive Wasserstein-Robust Influence Functions (W-RIF) with provable coverage guarantees. For deep networks, we demonstrate that Euclidean certification is rendered vacuous by spectral amplification -- a mechanism where the inherent ill-conditioning of deep representations inflates Lipschitz bounds by over $10{,}000\times$. This explains why standard TRAK scores, while accurate point estimates, are geometrically fragile: naive Euclidean robustness analysis yields 0\% certification. Our key contribution is the Natural Wasserstein metric, which measures perturbations in the geometry induced by the model's own feature covariance. This eliminates spectral amplification, reducing worst-case sensitivity by $76\times$ and stabilizing attribution estimates. On CIFAR-10 with ResNet-18, Natural W-TRAK certifies 68.7\% of ranking pairs compared to 0\% for Euclidean baselines -- to our knowledge, the first non-vacuous certified bounds for neural network attribution. Furthermore, we prove that the Self-Influence term arising from our analysis equals the Lipschitz constant governing attribution stability, providing theoretical grounding for leverage-based anomaly detection. Empirically, Self-Influence achieves 0.970 AUROC for label noise detection, identifying 94.1\% of corrupted labels by examining just the top 20\% of training data.
๐
2025-12-09
๐ ArXiv Preprint (eess.IV)
๐ฅ Pedro M. Gordaliza, Nataliia Molchanova, Jaume Banus, Thomas Sanchez, Meritxell Bach Cuadra
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Deep learning models for medical image segmentation suffer significant performance drops due to distribution shifts, but the causal mechanisms behind these drops remain poorly understood. We extend causal attribution frameworks to high-dimensional segmentation tasks, quantifying how acquisition protocols and annotation variability independently contribute to performance degradation. We model the data-generating process through a causal graph and employ Shapley values to fairly attribute performance changes to individual mechanisms. Our framework addresses unique challenges in medical imaging: high-dimensional outputs, limited samples, and complex mechanism interactions. Validation on multiple sclerosis (MS) lesion segmentation across 4 centers and 7 annotators reveals context-dependent failure modes: annotation protocol shifts dominate when crossing annotators (7.4% $\pm$ 8.9% DSC attribution), while acquisition shifts dominate when crossing imaging centers (6.5% $\pm$ 9.1%). This mechanism-specific quantification enables practitioners to prioritize targeted interventions based on deployment context.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Oscar Eliasson
ๆฅ็ๆ่ฆ
The Kolmogorov-Arnold representation theorem offers a theoretical alternative to Multi-Layer Perceptrons (MLPs) by placing learnable univariate functions on edges rather than nodes. While recent implementations such as Kolmogorov-Arnold Networks (KANs) demonstrate high approximation capabilities, they suffer from significant parameter inefficiency due to the requirement of maintaining unique parameterizations for every network edge. In this work, we propose GS-KAN (Generalized Sprecher-KAN), a lightweight architecture inspired by David Sprecher's refinement of the superposition theorem. GS-KAN constructs unique edge functions by applying learnable linear transformations to a single learnable, shared parent function per layer. We evaluate GS-KAN against existing KAN architectures and MLPs across synthetic function approximation, tabular data regression and image classification tasks. Our results demonstrate that GS-KAN outperforms both MLPs and standard KAN baselines on continuous function approximation tasks while maintaining superior parameter efficiency. Additionally, GS-KAN achieves competitive performance with existing KAN architectures on tabular regression and outperforms MLPs on high-dimensional classification tasks. Crucially, the proposed architecture enables the deployment of KAN-based architectures in high-dimensional regimes under strict parameter constraints, a setting where standard implementations are typically infeasible due to parameter explosion. The source code is available at https://github.com/rambamn48/gs-impl.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Moazzam Umer Gondal, Hamad ul Qudous, Asma Ahmad Farhan
Machine Learning
ๆฅ็ๆ่ฆ
Accurate forecasting of urban air pollution is essential for protecting public health and guiding mitigation policies. While Deep Learning (DL) and hybrid pipelines dominate recent research, their complexity and limited interpretability hinder operational use. This study investigates whether lightweight additive models -- Facebook Prophet (FBP) and NeuralProphet (NP) -- can deliver competitive forecasts for particulate matter (PM$_{2.5}$, PM$_{10}$) in Beijing, China. Using multi-year pollutant and meteorological data, we applied systematic feature selection (correlation, mutual information, mRMR), leakage-safe scaling, and chronological data splits. Both models were trained with pollutant and precursor regressors, with NP additionally leveraging lagged dependencies. For context, two machine learning baselines (LSTM, LightGBM) and one traditional statistical model (SARIMAX) were also implemented. Performance was evaluated on a 7-day holdout using MAE, RMSE, and $R^2$. Results show that FBP consistently outperformed NP, SARIMAX, and the learning-based baselines, achieving test $R^2$ above 0.94 for both pollutants. These findings demonstrate that interpretable additive models remain competitive with both traditional and complex approaches, offering a practical balance of accuracy, transparency, and ease of deployment.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shangqing Xu, Zhiyuan Zhao, Megha Sharma, Josรฉ Marรญa Martรญn-Olalla, Alexander Rodrรญguez et al.
Machine Learning
ๆฅ็ๆ่ฆ
Severe heatwaves in urban areas significantly threaten public health, calling for establishing early warning strategies. Despite predicting occurrence of heatwaves and attributing historical mortality, predicting an incoming deadly heatwave remains a challenge due to the difficulty in defining and estimating heat-related mortality. Furthermore, establishing an early warning system imposes additional requirements, including data availability, spatial and temporal robustness, and decision costs. To address these challenges, we propose DeepTherm, a modular early warning system for deadly heatwave prediction without requiring heat-related mortality history. By highlighting the flexibility of deep learning, DeepTherm employs a dual-prediction pipeline, disentangling baseline mortality in the absence of heatwaves and other irregular events from all-cause mortality. We evaluated DeepTherm on real-world data across Spain. Results demonstrate consistent, robust, and accurate performance across diverse regions, time periods, and population groups while allowing trade-off between missed alarms and false alarms.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh
Machine Learning
ๆฅ็ๆ่ฆ
Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and management. However, deploying state-of-the-art deep learning models like ConvNeXtV2-Large in clinical settings is hindered by their computational demands. Therefore, it is desirable to develop efficient models that maintain high diagnostic performance while enabling real-time deployment. In this study, a novel knowledge distillation framework, termed KD-OCT, is proposed to compress a high-performance ConvNeXtV2-Large teacher model, enhanced with advanced augmentations, stochastic weight averaging, and focal loss, into a lightweight EfficientNet-B2 student for classifying normal, drusen, and CNV cases. KD-OCT employs real-time distillation with a combined loss balancing soft teacher knowledge transfer and hard ground-truth supervision. The effectiveness of the proposed method is evaluated on the Noor Eye Hospital (NEH) dataset using patient-level cross-validation. Experimental results demonstrate that KD-OCT outperforms comparable multi-scale or feature-fusion OCT classifiers in efficiency- accuracy balance, achieving near-teacher performance with substantial reductions in model size and inference time. Despite the compression, the student model exceeds most existing frameworks, facilitating edge deployment for AMD screening. Code is available at https://github.com/erfan-nourbakhsh/KD- OCT.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Luis Rangel DaCosta, Mary C. Scott
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Neural networks, while effective for tackling many challenging scientific tasks, are not known to perform well out-of-distribution (OOD), i.e., within domains which differ from their training data. Understanding neural network OOD generalization is paramount to their successful deployment in experimental workflows, especially when ground-truth knowledge about the experiment is hard to establish or experimental conditions significantly vary. With inherent access to ground-truth information and fine-grained control of underlying distributions, simulation-based data curation facilitates precise investigation of OOD generalization behavior. Here, we probe generalization with respect to imaging conditions of neural network segmentation models for high-resolution transmission electron microscopy (HRTEM) imaging of nanoparticles, training and measuring the OOD generalization of over 12,000 neural networks using synthetic data generated via random structure sampling and multislice simulation. Using the HRTEM contrast transfer function, we further develop a framework to compare information content of HRTEM datasets and quantify OOD domain shifts. We demonstrate that neural network segmentation models enjoy significant performance stability, but will smoothly and predictably worsen as imaging conditions shift from the training distribution. Lastly, we consider limitations of our approach in explaining other OOD shifts, such as of the atomic structures, and discuss complementary techniques for understanding generalization in such settings.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seongyong Kim, Yong Kwon Cho
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Accurate 3D scene interpretation in active construction sites is essential for progress monitoring, safety assessment, and digital twin development. LiDAR is widely used in construction because it offers advantages over camera-based systems, performing reliably in cluttered and dynamically changing conditions. Yet most public datasets for 3D perception are derived from densely fused scans with uniform sampling and complete visibility, conditions that do not reflect real construction sites. Field data are often collected as isolated single-station LiDAR views, constrained by safety requirements, limited access, and ongoing operations. These factors lead to radial density decay, fragmented geometry, and view-dependent visibility-characteristics that remain underrepresented in existing datasets. This paper presents SIP, Site in Pieces, a dataset created to reflect the practical constraints of LiDAR acquisition during construction. SIP provides indoor and outdoor scenes captured with a terrestrial LiDAR scanner and annotated at the point level using a taxonomy tailored to construction environments: A. Built Environment, B. Construction Operations, and C. Site Surroundings. The dataset includes both structural components and slender temporary objects such as scaffolding, MEP piping, and scissor lifts, where sparsity caused by occlusion and fragmented geometry make segmentation particularly challenging. The scanning protocol, annotation workflow, and quality control procedures establish a consistent foundation for the dataset. SIP is openly available with a supporting Git repository, offering adaptable class configurations that streamline adoption within modern 3D deep learning frameworks. By providing field data that retain real-world sensing characteristics, SIP enables robust benchmarking and contributes to advancing construction-oriented 3D vision tasks.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Marina Vicens-Miquel, Amy McGovern, Aaron J. Hill, Efi Foufoula-Georgiou, Clement Guilloteau et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate precipitation forecasting is essential for hydrometeorological risk management, especially for anticipating extreme rainfall that can lead to flash flooding and infrastructure damage. This study introduces a diffusion-based deep learning (DL) framework that systematically compares three residual prediction strategies differing only in their input sources: (1) a fully data-driven model using only past observations from the Multi-Radar Multi-Sensor (MRMS) system, (2) a corrective model using only forecasts from the High-Resolution Rapid Refresh (HRRR) numerical weather prediction system, and (3) a hybrid model integrating both MRMS and selected HRRR forecast variables. By evaluating these approaches under a unified setup, we provide a clearer understanding of how each data source contributes to predictive skill over the Continental United States (CONUS). Forecasts are produced at 1-km spatial resolution, beginning with direct 1-hour predictions and extending to 12 hours using autoregressive rollouts. Performance is evaluated using both CONUS-wide and region-specific metrics that assess overall performance and skill at extreme rainfall thresholds. Across all lead times, our DL framework consistently outperforms the HRRR baseline in pixel-wise and spatiostatistical metrics. The hybrid model performs best at the shortest lead time, while the HRRR-corrective model outperforms others at longer lead times, maintaining high skill through 12 hours. To assess reliability, we incorporate calibrated uncertainty quantification tailored to the residual learning setup. These gains, particularly at longer lead times, are critical for emergency preparedness, where modest increases in forecast horizon can improve decision-making. This work advances DL-based precipitation forecasting by enhancing predictive skill, reliability, and applicability across regions.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alon Arad, Saharon Rosset
Machine Learning
ๆฅ็ๆ่ฆ
Accurate and reliable probability predictions are essential for multi-class supervised learning tasks, where well-calibrated models enable rational decision-making. While isotonic regression has proven effective for binary calibration, its extension to multi-class problems via one-vs-rest calibration produced suboptimal results when compared to parametric methods, limiting its practical adoption. In this work, we propose novel isotonic normalization-aware techniques for multiclass calibration, grounded in natural and intuitive assumptions expected by practitioners. Unlike prior approaches, our methods inherently account for probability normalization by either incorporating normalization directly into the optimization process (NA-FIR) or modeling the problem as a cumulative bivariate isotonic regression (SCIR). Empirical evaluation on a variety of text and image classification datasets across different model architectures reveals that our approach consistently improves negative log-likelihood (NLL) and expected calibration error (ECE) metrics.
๐
2025-12-09
๐ To be submitted to The Open Journal of Astrophysic
๐ฅ Daniele Sorini, Sownak Bose, Mathilda Denison, Romeel Davรฉ
Machine Learning
ๆฅ็ๆ่ฆ
Stellar and AGN-driven feedback processes affect the distribution of gas on a wide range of scales, from within galaxies well into the intergalactic medium. Yet, it remains unclear how feedback, through its connection to key galaxy properties, shapes the radial gas density profile in the host halo. We tackle this question using suites of the EAGLE, IllustrisTNG, and Simba cosmological hydrodynamical simulations, which span a variety of feedback models. We develop a random forest algorithm that predicts the radial gas density profile within haloes from the total halo mass and five global properties of the central galaxy: gas and stellar mass; star formation rate; mass and accretion rate of the central black hole (BH). The algorithm reproduces the simulated gas density profiles with an average accuracy of $\sim$80-90% over the halo mass range $10^{9.5} \, \mathrm{M}_{\odot} < M_{\rm 200c} < 10^{15} \, \mathrm{M}_{\odot}$ and redshift interval $0
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ DatologyAI, :, Luke Merrick, Alex Fang, Aldo Carranza et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed "lexical-dense" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.
๐
2025-12-09
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Liang Luo, Yuxin Chen, Zhengyu Zhang, Mengyue Hang, Andrew Gu et al.
Machine Learning
ๆฅ็ๆ่ฆ
The rapidly evolving landscape of products, surfaces, policies, and regulations poses significant challenges for deploying state-of-the-art recommendation models at industry scale, primarily due to data fragmentation across domains and escalating infrastructure costs that hinder sustained quality improvements.
To address this challenge, we propose Lattice, a recommendation framework centered around model space redesign that extends Multi-Domain, Multi-Objective (MDMO) learning beyond models and learning objectives. Lattice addresses these challenges through a comprehensive model space redesign that combines cross-domain knowledge sharing, data consolidation, model unification, distillation, and system optimizations to achieve significant improvements in both quality and cost-efficiency.
Our deployment of Lattice at Meta has resulted in 10% revenue-driving top-line metrics gain, 11.5% user satisfaction improvement, 6% boost in conversion rate, with 20% capacity saving.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Yuji Takubo, Arpit Dwivedi, Sukeerth Ramkumar, Luis A. Pabon, Daniele Gammelli et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Reliable real-time trajectory generation is essential for future autonomous spacecraft. While recent progress in nonconvex guidance and control is paving the way for onboard autonomous trajectory optimization, these methods still rely on extensive expert input (e.g., waypoints, constraints, mission timelines, etc.), which limits the operational scalability in real rendezvous missions.This paper introduces SAGES (Semantic Autonomous Guidance Engine for Space), a trajectory-generation framework that translates natural-language commands into spacecraft trajectories that reflect high-level intent while respecting nonconvex constraints. Experiments in two settings -- fault-tolerant proximity operations with continuous-time constraint enforcement and a free-flying robotic platform -- demonstrate that SAGES reliably produces trajectories aligned with human commands, achieving over 90\% semantic-behavioral consistency across diverse behavior modes. Ultimately, this work marks an initial step toward language-conditioned, constraint-aware spacecraft trajectory generation, enabling operators to interactively guide both safety and behavior through intuitive natural-language commands with reduced expert burden.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Adi Manor, Dan Cohen, Ziv Keidar, Avi Parush, Hadas Erel
ๆฅ็ๆ่ฆ
Cognitive trust and the belief that a robot is capable of accurately performing tasks, are recognized as central factors in fostering high-quality human-robot interactions. It is well established that performance factors such as the robot's competence and its reliability shape cognitive trust. Recent studies suggest that affective factors, such as robotic attentiveness, also play a role in building cognitive trust. This work explores the interplay between these two factors that shape cognitive trust. Specifically, we evaluated whether different combinations of robotic competence and attentiveness introduce a compensatory mechanism, where one factor compensates for the lack of the other. In the experiment, participants performed a search task with a robotic dog in a 2x2 experimental design that included two factors: competence (high or low) and attentiveness (high or low). The results revealed that high attentiveness can compensate for low competence. Participants who collaborated with a highly attentive robot that performed poorly reported trust levels comparable to those working with a highly competent robot. When the robot did not demonstrate attentiveness, low competence resulted in a substantial decrease in cognitive trust. The findings indicate that building cognitive trust in human-robot interaction may be more complex than previously believed, involving emotional processes that are typically overlooked. We highlight an affective compensatory mechanism that adds a layer to consider alongside traditional competence-based models of cognitive trust.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Lipeng Zhuang, Shiyu Fan, Florent P. Audonnet, Yingdong Ru, Gerardo Aragon Camarasa et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior success rates compared to state-of-the-art diffusion and autoregressive policies. Specifically, MGP increases the average success rate by 9% across 150 tasks while cutting per-sequence inference time by up to 35x. It further improves the average success rate by 60% in dynamic and missing-observation environments, and solves two non-Markovian scenarios where other state-of-the-art methods fail.
๐
2025-12-09
๐ ArXiv Preprint (eess.SY)
๐ฅ Rachel DiPirro, Rosalyn Devonport, Dan Calderone, Chishang "Mario'' Yang, Wendy Ju et al.
Robotics
ๆฅ็ๆ่ฆ
Understanding driver interactions is critical to designing autonomous vehicles to interoperate safely with human-driven cars. We consider the impact of these interactions on the policies drivers employ when navigating unsigned intersections in a driving simulator. The simulator allows the collection of naturalistic decision-making and behavior data in a controlled environment. Using these data, we model the human driver responses as state-based feedback controllers learned via Gaussian Process regression methods. We compute the feedback gain of the controller using a weighted combination of linear and nonlinear priors. We then analyze how the individual gains are reflected in driver behavior. We also assess differences in these controllers across populations of drivers. Our work in data-driven analyses of how drivers determine their policies can facilitate future work in the design of socially responsive autonomy for vehicles.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Xinyu Qi, Zeyu Deng, Shaun Alexander Macdonald, Liying Li, Chen Wang et al.
Robotics
ๆฅ็ๆ่ฆ
A remote robot operator's affective state can significantly impact the resulting robot's motions leading to unexpected consequences, even when the user follows protocol and performs permitted tasks. The recognition of a user operator's affective states in remote robot control scenarios is, however, underexplored. Current emotion recognition methods rely on reading the user's vital signs or body language, but the devices and user participation these measures require would add limitations to remote robot control. We demonstrate that the functional movements of a remote-controlled robotic avatar, which was not designed for emotional expression, can be used to infer the emotional state of the human operator via a machine-learning system. Specifically, our system achieved 83.3$\%$ accuracy in recognizing the user's emotional state expressed by robot movements, as a result of their hand motions. We discuss the implications of this system on prominent current and future remote robot operation and affective robotic contexts.
๐
2025-12-09
๐ Accepted and presented at IEEE RoboticCC 2025. 4 p
๐ฅ Nick Trinh, Damian Lyons
Robotics
ๆฅ็ๆ่ฆ
Visual place recognition (VPR) is an important component technology for camera-based mapping and navigation applications. This is a challenging problem because images of the same place may appear quite different for reasons including seasonal changes, weather illumination, structural changes to the environment, as well as transient pedestrian or vehicle traffic. Papers focusing on generating image descriptors for VPR report their results using metrics such as recall@K and ROC curves. However, for a robot implementation, determining which matches are sufficiently good is often reduced to a manually set threshold. And it is difficult to manually select a threshold that will work for a variety of visual scenarios. This paper addresses the problem of automatically selecting a threshold for VPR by looking at the 'negative' Gaussian mixture statistics for a place - image statistics indicating not this place. We show that this approach can be used to select thresholds that work well for a variety of image databases and image descriptors.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Shivendra Agrawal, Jake Brawer, Ashutosh Naik, Alessandro Roncone, Bradley Hayes
Robotics
ๆฅ็ๆ่ฆ
Many indoor workspaces are quasi-static: global layout is stable but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside MCL, yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. Across 100 global-localization trials spanning four conditions (cart-mounted, wearable, dynamic obstacles, and sparse semantics) in a semantically dense, retail environment, ShelfAware achieves a 96% success rate (vs. 22% MCL and 10% AMCL) with a mean time-to-convergence of 1.91s, attains the lowest translational RMSE in all conditions, and maintains stable tracking in 80% of tested sequences, all while running in real time on a consumer laptop-class platform. By modeling semantics distributionally at the category level and leveraging inverse proposals, ShelfAware resolves geometric aliasing and semantic drift common to quasi-static domains. Because the method requires only vision sensors and VIO, it integrates as an infrastructure-free building block for mobile robots in warehouses, labs, and retail settings; as a representative application, it also supports the creation of assistive devices providing start-anytime, shared-control assistive navigation for people with visual impairments.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo et al.
Machine LearningRoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Nadezhda Kushina, Ko Watanabe, Aarthi Kannan, Ashita Ashok, Andreas Dengel et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Social robots must adjust to human proxemic norms to ensure user comfort and engagement. While prior research demonstrates that eye-tracking features reliably estimate comfort in human-human interactions, their applicability to interactions with humanoid robots remains unexplored. In this study, we investigate user comfort with the robot "Ameca" across four experimentally controlled distances (0.5 m to 2.0 m) using mobile eye-tracking and subjective reporting (N=19). We evaluate multiple machine learning and deep learning models to estimate comfort based on gaze features. Contrary to previous human-human studies where Transformer models excelled, a Decision Tree classifier achieved the highest performance (F1-score = 0.73), with minimum pupil diameter identified as the most critical predictor. These findings suggest that physiological comfort thresholds in human-robot interaction differ from human-human dynamics and can be effectively modeled using interpretable logic.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce two new benchmarks REST and REST+(Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.
๐
2025-12-09
๐ ArXiv Preprint (quant-ph)
๐ฅ David Zenati, Eliya Nachmani
Natural Language Processing
ๆฅ็ๆ่ฆ
Quantum Error Correction (QEC) decoding faces a fundamental accuracy-efficiency tradeoff. Classical methods like Minimum Weight Perfect Matching (MWPM) exhibit variable performance across noise models and suffer from polynomial complexity, while tensor network decoders achieve high accuracy but at prohibitively high computational cost. Recent neural decoders reduce complexity but lack the accuracy needed to compete with computationally expensive classical methods. We introduce SAQ-Decoder, a unified framework combining transformer-based learning with constraint aware post-processing that achieves both near Maximum Likelihood (ML) accuracy and linear computational scalability with respect to the syndrome size. Our approach combines a dual-stream transformer architecture that processes syndromes and logical information with asymmetric attention patterns, and a novel differentiable logical loss that directly optimizes Logical Error Rates (LER) through smooth approximations over finite fields. SAQ-Decoder achieves near-optimal performance, with error thresholds of 10.99% (independent noise) and 18.6% (depolarizing noise) on toric codes that approach the ML bounds of 11.0% and 18.9% while outperforming existing neural and classical baselines in accuracy, complexity, and parameter efficiency. Our findings establish that learned decoders can simultaneously achieve competitive decoding accuracy and computational efficiency, addressing key requirements for practical fault-tolerant quantum computing systems.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram
Natural Language Processing
ๆฅ็ๆ่ฆ
While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Guangzhi Xiong, Zhenghao He, Bohan Liu, Sanchit Sinha, Aidong Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Damiano Marsili, Georgia Gkioxari
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammad Abu-Shaira, Ajita Rattani, Weishi Shi
ๆฅ็ๆ่ฆ
Real-world datasets often exhibit temporal dynamics characterized by evolving data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model's predictive accuracy. Furthermore, the presence of hyperparameters in online models exacerbates this issue. These parameters are typically fixed and cannot be dynamically adjusted by the user in response to the evolving data distribution. Gaussian Process (GP) models offer powerful non-parametric regression capabilities with uncertainty quantification, making them ideal for modeling complex data relationships in an online setting. However, conventional online GP methods face several critical limitations, including a lack of drift-awareness, reliance on fixed hyperparameters, vulnerability to data snooping, absence of a principled decay mechanism, and memory inefficiencies. In response, we propose DAO-GP (Drift-Aware Online Gaussian Process), a novel, fully adaptive, hyperparameter-free, decayed, and sparse non-linear regression model. DAO-GP features a built-in drift detection and adaptation mechanism that dynamically adjusts model behavior based on the severity of drift. Extensive empirical evaluations confirm DAO-GP's robustness across stationary conditions, diverse drift types (abrupt, incremental, gradual), and varied data characteristics. Analyses demonstrate its dynamic adaptation, efficient in-memory and decay-based management, and evolving inducing points. Compared with state-of-the-art parametric and non-parametric models, DAO-GP consistently achieves superior or competitive performance, establishing it as a drift-resilient solution for online non-linear regression.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joshua Ward, Bochao Gu, Chi-Hua Wang, Guang Cheng
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have recently demonstrated remarkable performance in generating high-quality tabular synthetic data. In practice, two primary approaches have emerged for adapting LLMs to tabular data generation: (i) fine-tuning smaller models directly on tabular datasets, and (ii) prompting larger models with examples provided in context. In this work, we show that popular implementations from both regimes exhibit a tendency to compromise privacy by reproducing memorized patterns of numeric digits from their training data. To systematically analyze this risk, we introduce a simple No-box Membership Inference Attack (MIA) called LevAtt that assumes adversarial access to only the generated synthetic data and targets the string sequences of numeric digits in synthetic observations. Using this approach, our attack exposes substantial privacy leakage across a wide range of models and datasets, and in some cases, is even a perfect membership classifier on state-of-the-art models. Our findings highlight a unique privacy vulnerability of LLM-based synthetic data generation and the need for effective defenses. To this end, we propose two methods, including a novel sampling strategy that strategically perturbs digits during generation. Our evaluation demonstrates that this approach can defeat these attacks with minimal loss of fidelity and utility of the synthetic data.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jing Jie Tan, Anissa Mokraoui, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiang Chen, Yuling Shi, Qizhen Lan, Yuchao Qiu, Xiaodong Gu
Machine Learning
ๆฅ็ๆ่ฆ
LLM agents are widely deployed in complex interactive tasks, yet privacy constraints often preclude centralized optimization and co-evolution across dynamic environments. While Federated Learning (FL) has proven effective on static datasets, its extension to the open-ended self-evolution of agents remains underexplored. Directly applying standard FL is challenging: heterogeneous tasks and sparse, trajectory-level rewards introduce severe gradient conflicts, destabilizing the global optimization process. To bridge this gap, we propose Fed-SE, a Federated Self-Evolution framework for LLM agents. Fed-SE establishes a local evolution-global aggregation paradigm. Locally, agents employ parameter-efficient fine-tuning on filtered, high-return trajectories to achieve stable gradient updates. Globally, Fed-SE aggregates updates within a low-rank subspace that disentangles environment-specific dynamics, effectively reducing negative transfer across clients. Experiments across five heterogeneous environments demonstrate that Fed-SE improves average task success rates by approximately 18% over federated baselines, validating its effectiveness in robust cross-environment knowledge transfer in privacy-constrained deployments.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anantaa Kotal, Anupam Joshi
Robotics
ๆฅ็ๆ่ฆ
The widespread use of big data across sectors has raised major privacy concerns, especially when sensitive information is shared or analyzed. Regulations such as GDPR and HIPAA impose strict controls on data handling, making it difficult to balance the need for insights with privacy requirements. Synthetic data offers a promising solution by creating artificial datasets that reflect real patterns without exposing sensitive information. However, traditional synthetic data methods often fail to capture complex, implicit rules that link different elements of the data and are essential in domains like healthcare. They may reproduce explicit patterns but overlook domain-specific constraints that are not directly stated yet crucial for realism and utility. For example, prescription guidelines that restrict certain medications for specific conditions or prevent harmful drug interactions may not appear explicitly in the original data. Synthetic data generated without these implicit rules can lead to medically inappropriate or unrealistic profiles. To address this gap, we propose ContextGAN, a Context-Aware Differentially Private Generative Adversarial Network that integrates domain-specific rules through a constraint matrix encoding both explicit and implicit knowledge. The constraint-aware discriminator evaluates synthetic data against these rules to ensure adherence to domain constraints, while differential privacy protects sensitive details from the original data. We validate ContextGAN across healthcare, security, and finance, showing that it produces high-quality synthetic data that respects domain rules and preserves privacy. Our results demonstrate that ContextGAN improves realism and utility by enforcing domain constraints, making it suitable for applications that require compliance with both explicit patterns and implicit rules under strict privacy guarantees.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao et al.
ๆฅ็ๆ่ฆ
Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jean Christoph Jung, Patrick Koopmann, Matthias Knorr
ๆฅ็ๆ่ฆ
Craig interpolation and uniform interpolation have many applications in knowledge representation, including explainability, forgetting, modularization and reuse, and even learning. At the same time, many relevant knowledge representation formalisms do in general not have Craig or uniform interpolation, and computing interpolants in practice is challenging. We have a closer look at two prominent knowledge representation formalisms, description logics and logic programming, and discuss theoretical results and practical methods for computing interpolants.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shahar Sarfaty, Adi Haviv, Uri Hacohen, Niva Elkin-Koren, Roi Livni et al.
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
The rapid proliferation of generative components, such as LoRAs, has created a vast but unstructured ecosystem. Existing discovery methods depend on unreliable user descriptions or biased popularity metrics, hindering usability. We present CARLoS, a large-scale framework for characterizing LoRAs without requiring additional metadata. Analyzing over 650 LoRAs, we employ them in image generation over a variety of prompts and seeds, as a credible way to assess their behavior. Using CLIP embeddings and their difference to a base-model generation, we concisely define a three-part representation: Directions, defining semantic shift; Strength, quantifying the significance of the effect; and Consistency, quantifying how stable the effect is. Using these representations, we develop an efficient retrieval framework that semantically matches textual queries to relevant LoRAs while filtering overly strong or unstable ones, outperforming textual baselines in automated and human evaluations. While retrieval is our primary focus, the same representation also supports analyses linking Strength and Consistency to legal notions of substantiality and volition, key considerations in copyright, positioning CARLoS as a practical system with broader relevance for LoRA analysis.
๐
2025-12-09
๐ Accepted in IEEE Transactions on Multimedia (TMM)
๐ฅ Yi Zhang, Chun-Wun Cheng, Junyi He, Ke Yu, Yushun Tang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincarรฉ ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ferdinand Kapl, Emmanouil Angelis, Tobias Hรถppe, Kaitlin Maile, Johannes von Oswald et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordรกs et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.
๐
2025-12-09
๐ WWW
๐ฅ Anna Jordanous
ๆฅ็ๆ่ฆ
Music improvisation is fascinating to study, being essentially a live demonstration of a creative process. In jazz, musicians often improvise across predefined chord progressions (leadsheets). How do we assess the creativity of jazz improvisations? And can we capture this in automated metrics for creativity for current LLM-based generative systems? Demonstration of emotional involvement is closely linked with creativity in improvisation. Analysing musical audio, can we detect emotional involvement? This study hypothesises that if an improvisation contains more evidence of emotion-laden content, it is more likely to be recognised as creative. An embeddings-based method is proposed for capturing the emotional content in musical improvisations, using a psychologically-grounded classification of musical characteristics associated with emotions. Resulting 'emovectors' are analysed to test the above hypothesis, comparing across multiple improvisations. Capturing emotional content in this quantifiable way can contribute towards new metrics for creativity evaluation that can be applied at scale.
๐
2025-12-09
๐ ArXiv Preprint (cs.SE)
๐ฅ Viola Campos, Robin Kuschnereit, Adrian Ulges
Natural Language Processing
ๆฅ็ๆ่ฆ
As AI-based code generation becomes widespread, researchers are investigating the calibration of code LLMs - ensuring their confidence scores faithfully represent the true likelihood of code correctness. To do so, we investigate multicalibration, which can capture additional factors about a coding problem, such as complexity, code length, or programming language used. We study four multicalibration approaches on three function synthesis benchmarks, using latest-generation code LLMs (Qwen3 Coder, GPT-OSS, DeepSeek-R1-Distill). Our results demonstrate that multicalibration can yield distinct improvements over both uncalibrated token likelihoods (+1.03 in skill score) and baseline calibrations (+0.37 in skill score). We study the influence of the aforementioned factors in ablations, and make our dataset (consisting of code generations, likelihoods, and correctness labels) available for future research on code LLM calibration.
๐
2025-12-09
๐ Accepted at IEEE INFOCOM 2026 (full version)
๐ฅ Yi Liu, Weixiang Han, Chengjun Cai, Xingliang Yuan, Cong Wang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
With the rise of large language models, service providers offer language models as a service, enabling users to fine-tune customized models via uploaded private datasets. However, this raises concerns about sensitive data leakage. Prior methods, relying on differential privacy within device-cloud collaboration frameworks, struggle to balance privacy and utility, exposing users to inference attacks or degrading fine-tuning performance. To address this, we propose PrivTune, an efficient and privacy-preserving fine-tuning framework via Split Learning (SL). The key idea of PrivTune is to inject crafted noise into token representations from the SL bottom model, making each token resemble the $n$-hop indirect neighbors. PrivTune formulates this as an optimization problem to compute the optimal noise vector, aligning with defense-utility goals. On this basis, it then adjusts the parameters (i.e., mean) of the $d_ฯ$-Privacy noise distribution to align with the optimization direction and scales the noise according to token importance to minimize distortion. Experiments on five datasets (covering both classification and generation tasks) against three embedding inversion and three attribute inference attacks show that, using RoBERTa on the Stanford Sentiment Treebank dataset, PrivTune reduces the attack success rate to 10% with only a 3.33% drop in utility performance, outperforming state-of-the-art baselines.
๐
2025-12-09
๐ WWW
๐ฅ Sadegh Momeni, Ge Zhang, Birkett Huber, Hamza Harkous, Sam Lipton et al.
Machine Learning
ๆฅ็ๆ่ฆ
Despite advancements in machine learning for security, rule-based detection remains prevalent in Security Operations Centers due to the resource intensiveness and skill gap associated with ML solutions. While traditional rule-based methods offer efficiency, their rigidity leads to high false positives or negatives and requires continuous manual maintenance. This paper proposes a novel, two-stage hybrid framework to democratize ML-based threat detection. The first stage employs intentionally loose YARA rules for coarse-grained filtering, optimized for high recall. The second stage utilizes an ML classifier to filter out false positives from the first stage's output. To overcome data scarcity, the system leverages Simula, a seedless synthetic data generation framework, enabling security analysts to create high-quality training datasets without extensive data science expertise or pre-labeled examples. A continuous feedback loop incorporates real-time investigation results to adaptively tune the ML model, preventing rule degradation.
This proposed model with active learning has been rigorously tested for a prolonged time in a production environment spanning tens of thousands of systems. The system handles initial raw log volumes often reaching 250 billion events per day, significantly reducing them through filtering and ML inference to a handful of daily tickets for human investigation. Live experiments over an extended timeline demonstrate a general improvement in the model's precision over time due to the active learning feature. This approach offers a self-sustained, low-overhead, and low-maintenance solution, allowing security professionals to guide model learning as expert ``teachers''.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jeongwhan Choi, Woosung Kang, Minseo Kim, Jongwoo Kim, Noseong Park
Natural Language Processing
ๆฅ็ๆ่ฆ
Foundation models pretrained on large data have demonstrated remarkable zero-shot generalization capabilities across domains. Building on the success of TabPFN for tabular data and its recent extension to time series, we investigate whether graph node classification can be effectively reformulated as a tabular learning problem. We introduce TabPFN-GN, which transforms graph data into tabular features by extracting node attributes, structural properties, positional encodings, and optionally smoothed neighborhood features. This enables TabPFN to perform direct node classification without any graph-specific training or language model dependencies. Our experiments on 12 benchmark datasets reveal that TabPFN-GN achieves competitive performance with GNNs on homophilous graphs and consistently outperforms them on heterophilous graphs. These results demonstrate that principled feature engineering can bridge the gap between tabular and graph domains, providing a practical alternative to task-specific GNN training and LLM-dependent graph foundation models.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chaewon Kim, Seoyeon Lee, Jonghyuk Park
Natural Language Processing
ๆฅ็ๆ่ฆ
Document shadow removal is essential for enhancing the clarity of digitized documents. Preserving high-frequency details (e.g., text edges and lines) is critical in this process because shadows often obscure or distort fine structures. This paper proposes a matte vision transformer (MatteViT), a novel shadow removal framework that applies spatial and frequency-domain information to eliminate shadows while preserving fine-grained structural details. To effectively retain these details, we employ two preservation strategies. First, our method introduces a lightweight high-frequency amplification module (HFAM) that decomposes and adaptively amplifies high-frequency components. Second, we present a continuous luminance-based shadow matte, generated using a custom-built matte dataset and shadow matte generator, which provides precise spatial guidance from the earliest processing stage. These strategies enable the model to accurately identify fine-grained regions and restore them with high fidelity. Extensive experiments on public benchmarks (RDD and Kligler) demonstrate that MatteViT achieves state-of-the-art performance, providing a robust and practical solution for real-world document shadow removal. Furthermore, the proposed method better preserves text-level details in downstream tasks, such as optical character recognition, improving recognition performance over prior methods.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ David Samuel, Lilja รvrelid, Erik Velldal, Andrey Kutuzov
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmรฅl and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seoyeon Lee, Gwangyeol Yu, Chaewon Kim, Jonghyuk Park
Computer Vision
ๆฅ็ๆ่ฆ
Diffusion models have achieved remarkable success in image synthesis. However, addressing artifacts and unrealistic regions remains a critical challenge. We propose self-refining diffusion, a novel framework that enhances image generation quality by detecting these flaws. The framework employs an explainable artificial intelligence (XAI)-based flaw highlighter to produce flaw activation maps (FAMs) that identify artifacts and unrealistic regions. These FAMs improve reconstruction quality by amplifying noise in flawed regions during the forward process and by focusing on these regions during the reverse process. The proposed approach achieves up to a 27.3% improvement in Frรฉchet inception distance across various diffusion-based models, demonstrating consistently strong performance on diverse datasets. It also shows robust effectiveness across different tasks, including image generation, text-to-image generation, and inpainting. These results demonstrate that explainable AI techniques can extend beyond interpretability to actively contribute to image refinement. The proposed framework offers a versatile and effective approach applicable to various diffusion models and tasks, significantly advancing the field of image synthesis.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Eranga Bandara, Ross Gore, Peter Foytik, Sachin Shetty, Ravi Mukkamala et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Agentic AI marks a major shift in how autonomous systems reason, plan, and execute multi-step tasks. Unlike traditional single model prompting, agentic workflows integrate multiple specialized agents with different Large Language Models(LLMs), tool-augmented capabilities, orchestration logic, and external system interactions to form dynamic pipelines capable of autonomous decision-making and action. As adoption accelerates across industry and research, organizations face a central challenge: how to design, engineer, and operate production-grade agentic AI workflows that are reliable, observable, maintainable, and aligned with safety and governance requirements. This paper provides a practical, end-to-end guide for designing, developing, and deploying production-quality agentic AI systems. We introduce a structured engineering lifecycle encompassing workflow decomposition, multi-agent design patterns, Model Context Protocol(MCP), and tool integration, deterministic orchestration, Responsible-AI considerations, and environment-aware deployment strategies. We then present nine core best practices for engineering production-grade agentic AI workflows, including tool-first design over MCP, pure-function invocation, single-tool and single-responsibility agents, externalized prompt management, Responsible-AI-aligned model-consortium design, clean separation between workflow logic and MCP servers, containerized deployment for scalable operations, and adherence to the Keep it Simple, Stupid (KISS) principle to maintain simplicity and robustness. To demonstrate these principles in practice, we present a comprehensive case study: a multimodal news-analysis and media-generation workflow. By combining architectural guidance, operational patterns, and practical implementation insights, this paper offers a foundational reference to build robust, extensible, and production-ready agentic AI workflows.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Mohammed Elseiagy, Tsige Tadesse Alemayoh, Ranulfo Bezerra, Shotaro Kojima, Kazunori Ohno
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Bridging the sim-to-real gap remains a fundamental challenge in robotics, as accurate dynamic parameter estimation is essential for reliable model-based control, realistic simulation, and safe deployment of manipulators. Traditional analytical approaches often fall short when faced with complex robot structures and interactions. Data-driven methods offer a promising alternative, yet conventional neural networks such as recurrent models struggle to capture long-range dependencies critical for accurate estimation. In this study, we propose a Transformer-based approach for dynamic parameter estimation, supported by an automated pipeline that generates diverse robot models and enriched trajectory data using Jacobian-derived features. The dataset consists of 8,192 robots with varied inertial and frictional properties. Leveraging attention mechanisms, our model effectively captures both temporal and spatial dependencies. Experimental results highlight the influence of sequence length, sampling rate, and architecture, with the best configuration (sequence length 64, 64 Hz, four layers, 32 heads) achieving a validation R2 of 0.8633. Mass and inertia are estimated with near-perfect accuracy, Coulomb friction with moderate-to-high accuracy, while viscous friction and distal link center-of-mass remain more challenging. These results demonstrate that combining Transformers with automated dataset generation and kinematic enrichment enables scalable, accurate dynamic parameter estimation, contributing to improved sim-to-real transfer in robotic systems
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dongdong Yang, Bin Li, Jiguang He
Machine Learning
ๆฅ็ๆ่ฆ
Reconfigurable intelligent surface (RIS) and simultaneously transmitting and reflecting RIS (STAR-RIS) have emerged as key enablers for enhancing wireless coverage and capacity in next-generation networks. When mounted on unmanned aerial vehicles (UAVs), they benefit from flexible deployment and improved line-of-sight conditions. Despite their promising potential, a comprehensive performance comparison between aerial RIS and STAR-RIS architectures has not been thoroughly investigated. This letter presents a detailed performance comparison between aerial RIS and STAR-RIS in three-dimensional wireless environments. Accurate channel models incorporating directional radiation patterns are established, and the influence of deployment altitude and orientation is thoroughly examined. To optimize the system sum-rate, we formulate joint optimization problems for both architectures and propose an efficient solution based on the weighted minimum mean square error and block coordinate descent algorithms. Simulation results reveal that STAR-RIS outperforms RIS in low-altitude scenarios due to its full-space coverage capability, whereas RIS delivers better performance near the base station at higher altitudes. The findings provide practical insights for the deployment of aerial intelligent surfaces in future 6G communication systems.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shuyue Hu, Haoyang Yan, Yiqun Zhang, Yang Chen, Dongzhan Zhou et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Foundation models (FMs) are increasingly assuming the role of the "brain" of AI agents. While recent efforts have begun to equip FMs with native single-agent abilities -- such as GUI interaction or integrated tool use -- we argue that the next frontier is endowing FMs with native multi-agent intelligence. We identify four core capabilities of FMs in multi-agent contexts: understanding, planning, efficient communication, and adaptation. Contrary to assumptions about the spontaneous emergence of such abilities, we provide extensive empirical evidence across 41 large language models showing that strong single-agent performance alone does not automatically yield robust multi-agent intelligence. To address this gap, we outline key research directions -- spanning dataset construction, evaluation, training paradigms, and safety considerations -- for building FMs with native multi-agent intelligence.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yiming Lu
ๆฅ็ๆ่ฆ
Currently, there exists a fundamental divide between the "cognitive black box" (implicit intuition) of human experts and the "computational black box" (untrustworthy decision-making) of artificial intelligence (AI). This paper proposes a new paradigm of "human-AI collaborative cognitive enhancement," aiming to transform the dual black boxes into a composable, auditable, and extensible "functional white-box" system through structured "meta-interaction." The core breakthrough lies in the "plug-and-play cognitive framework"--a computable knowledge package that can be extracted from expert dialogues and loaded into the Recursive Adversarial Meta-Thinking Network (RAMTN). This enables expert thinking, such as medical diagnostic logic and teaching intuition, to be converted into reusable and scalable public assets, realizing a paradigm shift from "AI as a tool" to "AI as a thinking partner." This work not only provides the first engineering proof for "cognitive equity" but also opens up a new path for AI governance: constructing a verifiable and intervenable governance paradigm through "transparency of interaction protocols" rather than prying into the internal mechanisms of models. The framework is open-sourced to promote technology for good and cognitive inclusion. This paper is an independent exploratory research conducted by the author. All content presented, including the theoretical framework (RAMTN), methodology (meta-interaction), system implementation, and case validation, constitutes the author's individual research achievements.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuniko Paxton, Zeinab Dehghani, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Skin color has historically been a focal point of discrimination, yet fairness research in machine learning for medical imaging often relies on coarse subgroup categories, overlooking individual-level variations. Such group-based approaches risk obscuring biases faced by outliers within subgroups. This study introduces a distribution-based framework for evaluating and mitigating individual fairness in skin lesion classification. We treat skin tone as a continuous attribute rather than a categorical label, and employ kernel density estimation (KDE) to model its distribution. We further compare twelve statistical distance metrics to quantify disparities between skin tone distributions and propose a distance-based reweighting (DRW) loss function to correct underrepresentation in minority tones. Experiments across CNN and Transformer models demonstrate: (i) the limitations of categorical reweighting in capturing individual-level disparities, and (ii) the superior performance of distribution-based reweighting, particularly with Fidelity Similarity (FS), Wasserstein Distance (WD), Hellinger Metric (HM), and Harmonic Mean Similarity (HS). These findings establish a robust methodology for advancing fairness at individual level in dermatological AI systems, and highlight broader implications for sensitive continuous attributes in medical image analysis.
๐
2025-12-09
๐ ArXiv Preprint (cs.PF)
๐ฅ Sรฉbastien Piรฉrard, Adrien Deliรจge, Marc Van Droogenbroeck
ๆฅ็ๆ่ฆ
The performance of algorithms, methods, and models tends to depend heavily on the distribution of cases on which they are applied, this distribution being specific to the applicative domain. After performing an evaluation in several domains, it is highly informative to compute a (weighted) mean performance and, as shown in this paper, to scrutinize what happens during this averaging. To achieve this goal, we adopt a probabilistic framework and consider a performance as a probability measure (e.g., a normalized confusion matrix for a classification task). It appears that the corresponding weighted mean is known to be the summarization, and that only some remarkable scores assign to the summarized performance a value equal to a weighted arithmetic mean of the values assigned to the domain-specific performances. These scores include the family of ranking scores, a continuum parameterized by user preferences, and that the weights to consider in the arithmetic mean depend on the user preferences. Based on this, we rigorously define four domains, named easiest, most difficult, preponderant, and bottleneck domains, as functions of user preferences. After establishing the theory in a general setting, regardless of the task, we develop new visual tools for two-class classification.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ekhi Azurmendi, Xabier Arregi, Oier Lopez de Lacalle
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Rongzhao Zhang, Junqiao Wang, Shuyun Yang, Mouxiao Bian, Chao Ding et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal clinical reasoning in the field of gastrointestinal (GI) oncology necessitates the integrated interpretation of endoscopic imagery, radiological data, and biochemical markers. Despite the evident potential exhibited by Multimodal Large Language Models (MLLMs), they frequently encounter challenges such as context dilution and hallucination when confronted with intricate, heterogeneous medical histories. In order to address these limitations, a hierarchical Multi-Agent Framework is proposed, which emulates the collaborative workflow of a human Multidisciplinary Team (MDT). The system attained a composite expert evaluation score of 4.60/5.00, thereby demonstrating a substantial improvement over the monolithic baseline. It is noteworthy that the agent-based architecture yielded the most substantial enhancements in reasoning logic and medical accuracy. The findings indicate that mimetic, agent-based collaboration provides a scalable, interpretable, and clinically robust paradigm for automated decision support in oncology.
๐
2025-12-09
๐ ArXiv Preprint (cs.SE)
๐ฅ Renato Cordeiro Ferreira, Aditya Dhinavahi, Rowanne Trapmann, Willem-Jan van den Heuvel
Machine Learning
ๆฅ็ๆ่ฆ
ML-Enabled Systems (MLES) are inherently complex since they require multiple components to achieve their business goal. This experience report showcases the software architecture reusability techniques applied while building Ocean Guard, an MLES for anomaly detection in the maritime domain. In particular, it highlights the challenges and lessons learned to reuse the Ports and Adapters pattern to support building multiple microservices from a single codebase. This experience report hopes to inspire software engineers, machine learning engineers, and data scientists to apply the Hexagonal Architecture pattern to build their MLES.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haoyu Zhao, Weizhong Ding, Yuhao Yang, Zheng Tian, Linyi Yang et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging the gap between digital agents and the physical world, See-Control provides a concrete step toward enabling home robots to perform smartphone-dependent tasks in realistic environments.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Manzi Kevin Maxime
Natural Language Processing
ๆฅ็ๆ่ฆ
Predicting protein secondary structures such as alpha helices, beta sheets, and coils from amino acid sequences is essential for understanding protein function. This work presents a transformer-based model that applies attention mechanisms to protein sequence data to predict structural motifs. A sliding-window data augmentation technique is used on the CB513 dataset to expand the training samples. The transformer shows strong ability to generalize across variable-length sequences while effectively capturing both local and long-range residue interactions.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hui Wang, Yang Liu, Xiaoyu Zhang, Chaoxu Mu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Automatic Heuristic Design (AHD) is an effective1 framework for solving complex optimization prob-2 lems. The development of large language mod-3 els (LLMs) enables the automated generation of4 heuristics. Existing LLM-based evolutionary meth-5 ods rely on population strategies and are prone6 to local optima. Integrating LLMs with Monte7 Carlo Tree Search (MCTS) improves the trade-off8 between exploration and exploitation, but multi-9 round cognitive integration remains limited and10 search diversity is constrained. To overcome these11 limitations, this paper proposes a novel cognitive-12 guided MCTS framework (CogMCTS). CogMCTS13 tightly integrates the cognitive guidance mecha-14 nism of LLMs with MCTS to achieve efficient au-15 tomated heuristic optimization. The framework16 employs multi-round cognitive feedback to incor-17 porate historical experience, node information, and18 negative outcomes, dynamically improving heuris-19 tic generation. Dual-track node expansion com-20 bined with elite heuristic management balances the21 exploration of diverse heuristics and the exploita-22 tion of high-quality experience. In addition, strate-23 gic mutation modifies the heuristic forms and pa-24 rameters to further enhance the diversity of the so-25 lution and the overall optimization performance.26 The experimental results indicate that CogMCTS27 outperforms existing LLM-based AHD methods in28 stability, efficiency, and solution quality.
๐
2025-12-09
๐ AAAI 2026
๐ฅ Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.
๐
2025-12-09
๐ ArXiv Preprint (cs.CY)
๐ฅ Wicaksono Febriantoro, Qi Zhou, Wannapon Suraworachet, Sahan Bulathwela, Andrea Gauthier et al.
ๆฅ็ๆ่ฆ
The dynamic nature of interactions between students and GenAI, as well as their relationship to writing quality, remains underexplored. While most research has examined how general-purpose GenAI can support writing, fewer studies have investigated how students interact with pedagogically designed systems across different phases of the writing process. To address this gap, we evaluated a GenAI-driven essay-writing assistant (EWA) designed to support higher education students in argumentative writing. Drawing on 1,282 interaction logs from 32 undergraduates during a two-hour writing session, Sequential Pattern Mining and K-Means clustering were used to identify behavioral patterns. Two clusters emerged: Cluster 1 emphasized outline planning and essay structure, while Cluster 2 focused on content development. A Mann-Whitney U test revealed a moderate effect size (r = 0.36) in the essay Organization dimension, with Cluster 1 showing higher scores. Qualitative analysis indicated that students with better performance actively wrote and shared essay sections with EWA for feedback, rather than interacted passively by asking questions. These findings suggest implications for teaching and system design. Teachers can encourage active engagement, while future EWAs may integrate automatic labeling and monitoring to prompt students to move from questioning to writing, enabling fuller benefits from GenAI-supported learning.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Laxmiraju Kandikatla, Branislav Radeljic
Robotics
ๆฅ็ๆ่ฆ
Artificial Intelligence (AI) systems are now an integral part of multiple industries. In clinical research, AI supports automated adverse event detection in clinical trials, patient eligibility screening for protocol enrollment, and data quality validation. Beyond healthcare, AI is transforming finance through real-time fraud detection, automated loan risk assessment, and algorithmic decision-making. Similarly, in manufacturing, AI enables predictive maintenance to reduce equipment downtime, enhances quality control through computer-vision inspection, and optimizes production workflows using real-time operational data. While these technologies enhance operational efficiency, they introduce new challenges regarding safety, accountability, and regulatory compliance. To address these concerns, we introduce the SMART+ Framework - a structured model built on the pillars of Safety, Monitoring, Accountability, Reliability, and Transparency, and further enhanced with Privacy & Security, Data Governance, Fairness & Bias, and Guardrails. SMART+ offers a practical, comprehensive approach to evaluating and governing AI systems across industries. This framework aligns with evolving mechanisms and regulatory guidance to integrate operational safeguards, oversight procedures, and strengthened privacy and governance controls. SMART+ demonstrates risk mitigation, trust-building, and compliance readiness. By enabling responsible AI adoption and ensuring auditability, SMART+ provides a robust foundation for effective AI governance in clinical research.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo et al.
MultimodalMachine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuna Kato, Shohei Mori, Hideo Saito, Yoshifumi Takatsume, Hiroki Kajita et al.
ๆฅ็ๆ่ฆ
Video recordings of open surgeries are greatly required for education and research purposes. However, capturing unobstructed videos is challenging since surgeons frequently block the camera field of view. To avoid occlusion, the positions and angles of the camera must be frequently adjusted, which is highly labor-intensive. Prior work has addressed this issue by installing multiple cameras on a shadowless lamp and arranging them to fully surround the surgical area. This setup increases the chances of some cameras capturing an unobstructed view. However, manual image alignment is needed in post-processing since camera configurations change every time surgeons move the lamp for optimal lighting. This paper aims to fully automate this alignment task. The proposed method identifies frames in which the lighting system moves, realigns them, and selects the camera with the least occlusion to generate a video that consistently presents the surgical field from a fixed perspective. A user study involving surgeons demonstrated that videos generated by our method were superior to those produced by conventional methods in terms of the ease of confirming the surgical area and the comfort during video viewing. Additionally, our approach showed improvements in video quality over existing techniques. Furthermore, we implemented several synthesis options for the proposed view-synthesis method and conducted a user study to assess surgeons' preferences for each option.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nader Sadek, Mirette Moawad, Christina Naguib, Mariam Elzahaby
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Stock market prediction is a long-standing challenge in finance, as accurate forecasts support informed investment decisions. Traditional models rely mainly on historical prices, but recent work shows that financial news can provide useful external signals. This paper investigates a multimodal approach that integrates companies' news articles with their historical stock data to improve prediction performance. We compare a Graph Neural Network (GNN) model with a baseline LSTM model. Historical data for each company is encoded using an LSTM, while news titles are embedded with a language model. These embeddings form nodes in a heterogeneous graph, and GraphSAGE is used to capture interactions between articles, companies, and industries. We evaluate two targets: a binary direction-of-change label and a significance-based label. Experiments on the US equities and Bloomberg datasets show that the GNN outperforms the LSTM baseline, achieving 53% accuracy on the first target and a 4% precision gain on the second. Results also indicate that companies with more associated news yield higher prediction accuracy. Moreover, headlines contain stronger predictive signals than full articles, suggesting that concise news summaries play an important role in short-term market reactions.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Yuchi Zhang, Churui Sun, Shiqi Liang, Diyuan Liu, Chao Ji et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent end-to-end robotic manipulation research increasingly adopts architectures inspired by large language models to enable robust manipulation. However, a critical challenge arises from severe distribution shifts between robotic action data, primarily due to substantial numerical variations in action commands across diverse robotic platforms and tasks, hindering the effective transfer of pretrained knowledge. To address this limitation, we propose a semantically grounded linguistic representation to normalize actions for efficient pretraining. Unlike conventional discretized action representations that are sensitive to numerical scales, the motion representation specifically disregards numeric scale effects, emphasizing directionality instead. This abstraction mitigates distribution shifts, yielding a more generalizable pretraining representation. Moreover, using the motion representation narrows the feature distance between action tokens and standard vocabulary tokens, mitigating modality gaps. Multi-task experiments on two benchmarks demonstrate that the proposed method significantly improves generalization performance and transferability in robotic manipulation tasks.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Indrajit Kar, Kalathur Chenchu Kishore Kumar
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhigang Jia, Duan Wang, Hengkai Wang, Yajun Xie, Meixiang Zhao et al.
Computer Vision
ๆฅ็ๆ่ฆ
Color image generation has a wide range of applications, but the existing generation models ignore the correlation among color channels, which may lead to chromatic aberration problems. In addition, the data distribution problem of color images has not been systematically elaborated and explained, so that there is still the lack of the theory about measuring different color images datasets. In this paper, we define a new quaternion Wasserstein distance and develop its dual theory. To deal with the quaternion linear programming problem, we derive the strong duality form with helps of quaternion convex set separation theorem and quaternion Farkas lemma. With using quaternion Wasserstein distance, we propose a novel Wasserstein quaternion generative adversarial network. Experiments demonstrate that this novel model surpasses both the (quaternion) generative adversarial networks and the Wasserstein generative adversarial network in terms of generation efficiency and image quality.
๐
2025-12-09
๐ AAAI 2026
๐ฅ Tammy Zhong, Yang Song, Maurice Pagnucco
Natural Language Processing
ๆฅ็ๆ่ฆ
Ethical awareness is critical for robots operating in human environments, yet existing automated planning tools provide little support. Manually specifying ethical rules is labour-intensive and highly context-specific. We present Principles2Plan, an interactive research prototype demonstrating how a human and a Large Language Model (LLM) can collaborate to produce context-sensitive ethical rules and guide automated planning. A domain expert provides the planning domain, problem details, and relevant high-level principles such as beneficence and privacy. The system generates operationalisable ethical rules consistent with these principles, which the user can review, prioritise, and supply to a planner to produce ethically-informed plans. To our knowledge, no prior system supports users in generating principle-grounded rules for classical planning contexts. Principles2Plan showcases the potential of human-LLM collaboration for making ethical automated planning more practical and feasible.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Nadezhda Kushina, Ko Watanabe, Aarthi Kannan, Ashita Ashok, Andreas Dengel et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Social robots must adjust to human proxemic norms to ensure user comfort and engagement. While prior research demonstrates that eye-tracking features reliably estimate comfort in human-human interactions, their applicability to interactions with humanoid robots remains unexplored. In this study, we investigate user comfort with the robot "Ameca" across four experimentally controlled distances (0.5 m to 2.0 m) using mobile eye-tracking and subjective reporting (N=19). We evaluate multiple machine learning and deep learning models to estimate comfort based on gaze features. Contrary to previous human-human studies where Transformer models excelled, a Decision Tree classifier achieved the highest performance (F1-score = 0.73), with minimum pupil diameter identified as the most critical predictor. These findings suggest that physiological comfort thresholds in human-robot interaction differ from human-human dynamics and can be effectively modeled using interpretable logic.
๐
2025-12-09
๐ Accepted in IEEE Transactions on Industrial Inform
๐ฅ Jiang Liu, Yan Qin, Wei Dai, Chau Yuen
Machine Learning
ๆฅ็ๆ่ฆ
Accurate and rapid state-of-health (SOH) monitoring plays an important role in indicating energy information for lithium-ion battery-powered portable mobile devices. To confront their variable working conditions, transfer learning (TL) emerges as a promising technique for leveraging knowledge from data-rich source working conditions, significantly reducing the training data required for SOH monitoring from target working conditions. However, traditional TL-based SOH monitoring is infeasible when applied in portable mobile devices since substantial computational resources are consumed during the TL stage and unexpectedly reduce the working endurance. To address these challenges, this paper proposes a lightweight TL-based SOH monitoring approach with constructive incremental transfer learning (CITL). First, taking advantage of the unlabeled data in the target domain, a semi-supervised TL mechanism is proposed to minimize the monitoring residual in a constructive way, through iteratively adding network nodes in the CITL. Second, the cross-domain learning ability of node parameters for CITL is comprehensively guaranteed through structural risk minimization, transfer mismatching minimization, and manifold consistency maximization. Moreover, the convergence analysis of the CITL is given, theoretically guaranteeing the efficacy of TL performance and network compactness. Finally, the proposed approach is verified through extensive experiments with a realistic unmanned air vehicles (UAV) battery dataset collected from dozens of flight missions. Specifically, the CITL outperforms SS-TCA, MMD-LSTM-DA, DDAN, BO-CNN-TL, and AS$^3$LSTM, in SOH estimation by 83.73%, 61.15%, 28.24%, 87.70%, and 57.34%, respectively, as evaluated using the index root mean square error.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaming Zhang, Che Wang, Yang Cao, Longtao Huang, Wei Yang Bryan Lim
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multi-modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through hierarchical chain-of-thought reasoning. Existing privacy protection techniques, primarily designed for perception-based models, prove ineffective against MLRMs' sophisticated multi-step reasoning processes that analyze environmental cues. We introduce \textbf{ReasonBreak}, a novel adversarial framework specifically designed to disrupt hierarchical reasoning in MLRMs through concept-aware perturbations. Our approach is founded on the key insight that effective disruption of geographic reasoning requires perturbations aligned with conceptual hierarchies rather than uniform noise. ReasonBreak strategically targets critical conceptual dependencies within reasoning chains, generating perturbations that invalidate specific inference steps and cascade through subsequent reasoning stages. To facilitate this approach, we contribute \textbf{GeoPrivacy-6K}, a comprehensive dataset comprising 6,341 ultra-high-resolution images ($\geq$2K) with hierarchical concept annotations. Extensive evaluation across seven state-of-the-art MLRMs (including GPT-o3, GPT-5, Gemini 2.5 Pro) demonstrates ReasonBreak's superior effectiveness, achieving a 14.4\% improvement in tract-level protection (33.8\% vs 19.4\%) and nearly doubling block-level protection (33.5\% vs 16.8\%). This work establishes a new paradigm for privacy protection against reasoning-based threats.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Waleed Razzaq, Yun-Bo Zhao
Machine Learning
ๆฅ็ๆ่ฆ
Accurate and uncertainty-aware degradation estimation is essential for predictive maintenance in safety-critical systems like rotating machinery with rolling-element bearings. Many existing uncertainty methods lack confidence calibration, are costly to run, are not distance-aware, and fail to generalize under out-of-distribution data. We introduce two distance-aware uncertainty methods for deterministic physics-guided neural networks: PG-SNGP, based on Spectral Normalization Gaussian Process, and PG-SNER, based on Deep Evidential Regression. We apply spectral normalization to the hidden layers so the network preserves distances from input to latent space. PG-SNGP replaces the final dense layer with a Gaussian Process layer for distance-sensitive uncertainty, while PG-SNER outputs Normal Inverse Gamma parameters to model uncertainty in a coherent probabilistic form. We assess performance using standard accuracy metrics and a new distance-aware metric based on the Pearson Correlation Coefficient, which measures how well predicted uncertainty tracks the distance between test and training samples. We also design a dynamic weighting scheme in the loss to balance data fidelity and physical consistency. We test our methods on rolling-element bearing degradation using the PRONOSTIA dataset and compare them with Monte Carlo and Deep Ensemble PGNNs. Results show that PG-SNGP and PG-SNER improve prediction accuracy, generalize reliably under OOD conditions, and remain robust to adversarial attacks and noise.
๐
2025-12-09
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Dyna Soumhane Ouchebara, Stรฉphane Dupont
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Vulnerability code-bases often suffer from severe imbalance, limiting the effectiveness of Deep Learning-based vulnerability classifiers. Data Augmentation could help solve this by mitigating the scarcity of under-represented CWEs. In this context, we investigate LLM-based augmentation for vulnerable functions, comparing controlled generation of new vulnerable samples with semantics-preserving refactoring of existing ones. Using Qwen2.5-Coder to produce augmented data and CodeBERT as a vulnerability classifier on the SVEN dataset, we find that our approaches are indeed effective in enriching vulnerable code-bases through a simple process and with reasonable quality, and that a hybrid strategy best boosts vulnerability classifiers' performance.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Aliaksei Kaliutau
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Large Language Models have revolutionized function-level code generation; however, repository-scale Automated Program Repair (APR) remains a significant challenge. Current approaches typically employ a control-centric paradigm, forcing agents to navigate complex directory structures and irrelevant control logic. In this paper, we propose a paradigm shift from the standard Code Property Graphs (CPGs) to the concept of Data Transformation Graph (DTG) that inverts the topology by modeling data states as nodes and functions as edges, enabling agents to trace logic defects through data lineage rather than control flow. We introduce a multi-agent framework that reconciles data integrity navigation with control flow logic. Our theoretical analysis and case studies demonstrate that this approach resolves the "Semantic Trap" inherent in standard RAG systems in modern coding agents. We provide a comprehensive implementation in the form of Autonomous Issue Resolver (AIR), a self-improvement system for zero-touch code maintenance that utilizes neuro-symbolic reasoning and uses the DTG structure for scalable logic repair. Our approach has demonstrated good results on several SWE benchmarks, reaching a resolution rate of 87.1% on SWE-Verified benchmark. Our approach directly addresses the core limitations of current AI code-assistant tools and tackles the critical need for a more robust foundation for our increasingly software-dependent world.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuning Gong, Yifei Liu, Yifan Zhan, Muyao Niu, Xueying Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, "click-to-run" browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huiguo He, Pengyu Yan, Ziqi Yi, Weizhi Zhong, Zheng Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Drag-based image editing aims to modify visual content followed by user-specified drag operations. Despite existing methods having made notable progress, they still fail to fully exploit the contextual information in the reference image, including fine-grained texture details, leading to edits with limited coherence and fidelity. To address this challenge, we introduce ContextDrag, a new paradigm for drag-based editing that leverages the strong contextual modeling capability of editing models, such as FLUX-Kontext. By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details, without the need for finetuning or inversion. Specifically, ContextDrag introduced a novel Context-preserving Token Injection (CTI) that injects noise-free reference features into their correct destination locations via a Latent-space Reverse Mapping (LRM) algorithm. This strategy enables precise drag control while preserving consistency in both semantics and texture details. Second, ContextDrag adopts a novel Position-Consistent Attention (PCA), which positional re-encodes the reference tokens and applies overlap-aware masking to eliminate interference from irrelevant reference features. Extensive experiments on DragBench-SR and DragBench-DR demonstrate that our approach surpasses all existing SOTA methods. Code will be publicly available.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Antonio Terpin, Raffaello D'Andrea
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective -- drag minimization or maximization -- is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gary Ackerman, Theodore Wilson, Zachary Kallenborn, Olivia Shoemaker, Anna Wetzel et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The potential for rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons has generated significant policy, academic, and public concern. Both model developers and policymakers seek to quantify and mitigate any risk, with an important element of such efforts being the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper discusses the pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset. It is the third in a series of three papers describing an overall Biothreat Benchmark Generation (BBG) framework, with previous papers detailing the development of the B3 dataset. The pilot involved running the benchmarks through a sample frontier AI model, followed by human evaluation of model responses, and an applied risk analysis of the results along several dimensions. Overall, the pilot demonstrated that the B3 dataset offers a viable, nuanced method for rapidly assessing the biosecurity risk posed by a LLM, identifying the key sources of that risk and providing guidance for priority areas of mitigation priority.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gary Ackerman, Zachary Kallenborn, Anna Wetzel, Hayley Peterson, Jenna LaTourette et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The potential for rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons has generated significant policy, academic, and public concern. Both model developers and policymakers seek to quantify and mitigate any risk, with an important element of such efforts being the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper, the second in a series of three, describes the second component of a novel Biothreat Benchmark Generation (BBG) framework: the generation of the Bacterial Biothreat Benchmark (B3) dataset. The development process involved three complementary approaches: 1) web-based prompt generation, 2) red teaming, and 3) mining existing benchmark corpora, to generate over 7,000 potential benchmarks linked to the Task-Query Architecture that was developed during the first component of the project. A process of de-duplication, followed by an assessment of uplift diagnosticity, and general quality control measures, reduced the candidates to a set of 1,010 final benchmarks. This procedure ensured that these benchmarks are a) diagnostic in terms of providing uplift; b) directly relevant to biosecurity threats; and c) are aligned with a larger biosecurity architecture permitting nuanced analysis at different levels of analysis.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yong-Woon Kim
Machine Learning
ๆฅ็ๆ่ฆ
This paper introduces the Impact-Driven AI Framework (IDAIF), a novel architectural methodology that integrates Theory of Change (ToC) principles with modern artificial intelligence system design. As AI systems increasingly influence high-stakes domains including healthcare, finance, and public policy, the alignment problem--ensuring AI behavior corresponds with human values and intentions--has become critical. Current approaches predominantly optimize technical performance metrics while neglecting the sociotechnical dimensions of AI deployment. IDAIF addresses this gap by establishing a systematic mapping between ToC's five-stage model (Inputs-Activities-Outputs-Outcomes-Impact) and corresponding AI architectural layers (Data Layer-Pipeline Layer-Inference Layer-Agentic Layer-Normative Layer). Each layer incorporates rigorous theoretical foundations: multi-objective Pareto optimization for value alignment, hierarchical multi-agent orchestration for outcome achievement, causal directed acyclic graphs (DAGs) for hallucination mitigation, and adversarial debiasing with Reinforcement Learning from Human Feedback (RLHF) for fairness assurance. We provide formal mathematical formulations for each component and introduce an Assurance Layer that manages assumption failures through guardian architectures. Three case studies demonstrate IDAIF application across healthcare, cybersecurity, and software engineering domains. This framework represents a paradigm shift from model-centric to impact-centric AI development, providing engineers with concrete architectural patterns for building ethical, trustworthy, and socially beneficial AI systems.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mingwei Li, Xiaoyuan Zhang, Chengwei Yang, Zilong Zheng, Yaodong Yang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Model-based planning in robotic domains is fundamentally challenged by the hybrid nature of physical dynamics, where continuous motion is punctuated by discrete events such as contacts and impacts. Conventional latent world models typically employ monolithic neural networks that enforce global continuity, inevitably over-smoothing the distinct dynamic modes (e.g., sticking vs. sliding, flight vs. stance). For a planner, this smoothing results in catastrophic compounding errors during long-horizon lookaheads, rendering the search process unreliable at physical boundaries. To address this, we introduce the Prismatic World Model (PRISM-WM), a structured architecture designed to decompose complex hybrid dynamics into composable primitives. PRISM-WM leverages a context-aware Mixture-of-Experts (MoE) framework where a gating mechanism implicitly identifies the current physical mode, and specialized experts predict the associated transition dynamics. We further introduce a latent orthogonalization objective to ensure expert diversity, effectively preventing mode collapse. By accurately modeling the sharp mode transitions in system dynamics, PRISM-WM significantly reduces rollout drift. Extensive experiments on challenging continuous control benchmarks, including high-dimensional humanoids and diverse multi-task settings, demonstrate that PRISM-WM provides a superior high-fidelity substrate for trajectory optimization algorithms (e.g., TD-MPC), proving its potential as a powerful foundational model for next-generation model-based agents.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sjoerd B. Stolwijk, Mark Boukes, Damian Trilling
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper investigates bias in GLLM annotations by conceptually replicating manual annotations of Boukes (2024). Using various GLLMs (Llama3.1:8b, Llama3.3:70b, GPT4o, Qwen2.5:72b) in combination with five different prompts for five concepts (political content, interactivity, rationality, incivility, and ideology). We find GLLMs perform adequate in terms of F1 scores, but differ from manual annotations in terms of prevalence, yield substantively different downstream results, and display systematic bias in that they overlap more with each other than with manual annotations. Differences in F1 scores fail to account for the degree of bias.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kaiwei Liu, Yuting He, Bufang Yang, Mu Yuan, Chun Man Victor Wong et al.
Machine Learning
ๆฅ็ๆ่ฆ
Biosignals collected from wearable devices are widely utilized in healthcare applications. Machine learning models used in these applications often rely on features extracted from biosignals due to their effectiveness, lower data dimensionality, and wide compatibility across various model architectures. However, existing feature extraction methods often lack task-specific contextual knowledge, struggle to identify optimal feature extraction settings in high-dimensional feature space, and are prone to code generation and automation errors. In this paper, we propose DeepFeature, the first LLM-empowered, context-aware feature generation framework for wearable biosignals. DeepFeature introduces a multi-source feature generation mechanism that integrates expert knowledge with task settings. It also employs an iterative feature refinement process that uses feature assessment-based feedback for feature re-selection. Additionally, DeepFeature utilizes a robust multi-layer filtering and verification approach for robust feature-to-code translation to ensure that the extraction functions run without crashing. Experimental evaluation results show that DeepFeature achieves an average AUROC improvement of 4.21-9.67% across eight diverse tasks compared to baseline methods. It outperforms state-of-the-art approaches on five tasks while maintaining comparable performance on the remaining tasks.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wentao Zhang, Qunbo Wang, Tao Zhang, Junsheng Wu, Hongping Gan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language model (LLM) agents often rely on external demonstrations or retrieval-augmented planning, leading to brittleness, poor generalization, and high computational overhead. Inspired by human problem-solving, we propose DuSAR (Dual-Strategy Agent with Reflecting) - a demonstration-free framework that enables a single frozen LLM to perform co-adaptive reasoning via two complementary strategies: a high-level holistic plan and a context-grounded local policy. These strategies interact through a lightweight reflection mechanism, where the agent continuously assesses progress via a Strategy Fitness Score and dynamically revises its global plan when stuck or refines it upon meaningful advancement, mimicking human metacognitive behavior. On ALFWorld and Mind2Web, DuSAR achieves state-of-the-art performance with open-source LLMs (7B-70B), reaching 37.1% success on ALFWorld (Llama3.1-70B) - more than doubling the best prior result (13.0%) - and 4.02% on Mind2Web, also more than doubling the strongest baseline. Remarkably, it reduces per-step token consumption by 3-9X while maintaining strong performance. Ablation studies confirm the necessity of dual-strategy coordination. Moreover, optional integration of expert demonstrations further boosts results, highlighting DuSAR's flexibility and compatibility with external knowledge.
๐
2025-12-09
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Ali Sakour
Machine Learning
ๆฅ็ๆ่ฆ
Biological systems exhibit remarkable morphogenetic plasticity, where a single genome can encode various specialized cellular structures triggered by local chemical signals. In the domain of Deep Learning, Differentiable Neural Cellular Automata (NCA) have emerged as a paradigm to mimic this self-organization. However, existing NCA research has predominantly focused on continuous texture synthesis or single-target object recovery, leaving the challenge of class-conditional structural generation largely unexplored. In this work, we propose a novel Conditional Neural Cellular Automata (c-NCA) architecture capable of growing distinct topological structures - specifically MNIST digits - from a single generic seed, guided solely by a spatially broadcasted class vector. Unlike traditional generative models (e.g., GANs, VAEs) that rely on global reception fields, our model enforces strict locality and translation equivariance. We demonstrate that by injecting a one-hot condition into the cellular perception field, a single set of local rules can learn to break symmetry and self-assemble into ten distinct geometric attractors. Experimental results show that our c-NCA achieves stable convergence, correctly forming digit topologies from a single pixel, and exhibits robustness characteristic of biological systems. This work bridges the gap between texture-based NCAs and structural pattern formation, offering a lightweight, biologically plausible alternative for conditional generation.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Benedikt Mangold
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Workplace toxicity is widely recognized as detrimental to organizational culture, yet quantifying its direct impact on operational efficiency remains methodologically challenging due to the ethical and practical difficulties of reproducing conflict in human subjects. This study leverages Large Language Model (LLM) based Multi-Agent Systems to simulate 1-on-1 adversarial debates, creating a controlled "sociological sandbox". We employ a Monte Carlo method to simulate hundrets of discussions, measuring the convergence time (defined as the number of arguments required to reach a conclusion) between a baseline control group and treatment groups involving agents with "toxic" system prompts. Our results demonstrate a statistically significant increase of approximately 25\% in the duration of conversations involving toxic participants. We propose that this "latency of toxicity" serves as a proxy for financial damage in corporate and academic settings. Furthermore, we demonstrate that agent-based modeling provides a reproducible, ethical alternative to human-subject research for measuring the mechanics of social friction.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tien Cuong Bui
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have become a powerful tool for modeling and analyzing data with graph structures. The wide adoption in numerous applications underscores the value of these models. However, the complexity of these methods often impedes understanding their decision-making processes. Current Explainable AI (XAI) methods struggle to untangle the intricate relationships and interactions within graphs. Several methods have tried to bridge this gap via a post-hoc approach or self-interpretable design. Most of them focus on graph structure analysis to determine essential patterns that correlate with prediction outcomes. While post-hoc explanation methods are adaptable, they require extra computational resources and may be less reliable due to limited access to the model's internal workings. Conversely, Interpretable models can provide immediate explanations, but their generalizability to different scenarios remains a major concern. To address these shortcomings, this thesis seeks to develop a novel XAI framework tailored for graph-based machine learning. The proposed framework aims to offer adaptable, computationally efficient explanations for GNNs, moving beyond individual feature analysis to capture how graph structure influences predictions.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Caner Erden, Alparslan Serhat Demir, Abdullah Hulusi Kokcam, Talas Fikret Kurnaz, Ugur Dagdeviren
Machine Learning
ๆฅ็ๆ่ฆ
Soil compaction is critical in construction engineering to ensure the stability of structures like road embankments and earth dams. Traditional methods for determining optimum moisture content (OMC) and maximum dry density (MDD) involve labor-intensive laboratory experiments, and empirical regression models have limited applicability and accuracy across diverse soil types. In recent years, artificial intelligence (AI) and machine learning (ML) techniques have emerged as alternatives for predicting these compaction parameters. However, ML models often struggle with prediction accuracy and generalizability, particularly with heterogeneous datasets representing various soil types. This study proposes an automated machine learning (AutoML) approach to predict OMC and MDD. AutoML automates algorithm selection and hyperparameter optimization, potentially improving accuracy and scalability. Through extensive experimentation, the study found that the Extreme Gradient Boosting (XGBoost) algorithm provided the best performance, achieving R-squared values of 80.4% for MDD and 89.1% for OMC on a separate dataset. These results demonstrate the effectiveness of AutoML in predicting compaction parameters across different soil types. The study also highlights the importance of heterogeneous datasets in improving the generalization and performance of ML models. Ultimately, this research contributes to more efficient and reliable construction practices by enhancing the prediction of soil compaction parameters.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Abdullah Hulusi Kรถkรงam, Uฤur Daฤdeviren, Talas Fikret Kurnaz, Alparslan Serhat Demir, Caner Erden
Machine Learning
ๆฅ็ๆ่ฆ
The California Bearing Ratio (CBR) is a key geotechnical indicator used to assess the load-bearing capacity of subgrade soils, especially in transportation infrastructure and foundation design. Traditional CBR determination relies on laboratory penetration tests. Despite their accuracy, these tests are often time-consuming, costly, and can be impractical, particularly for large-scale or diverse soil profiles. Recent progress in artificial intelligence, especially machine learning (ML), has enabled data-driven approaches for modeling complex soil behavior with greater speed and precision. This study introduces a comprehensive ML framework for CBR prediction using a dataset of 382 soil samples collected from various geoclimatic regions in Tรผrkiye. The dataset includes physicochemical soil properties relevant to bearing capacity, allowing multidimensional feature representation in a supervised learning context. Twelve ML algorithms were tested, including decision tree, random forest, extra trees, gradient boosting, xgboost, k-nearest neighbors, support vector regression, multi-layer perceptron, adaboost, bagging, voting, and stacking regressors. Each model was trained, validated, and evaluated to assess its generalization and robustness. Among them, the random forest regressor performed the best, achieving strong R2 scores of 0.95 (training), 0.76 (validation), and 0.83 (test). These outcomes highlight the model's powerful nonlinear mapping ability, making it a promising tool for predictive geotechnical tasks. The study supports the integration of intelligent, data-centric models in geotechnical engineering, offering an effective alternative to traditional methods and promoting digital transformation in infrastructure analysis and design.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, Sergey Levine
Multimodal
ๆฅ็ๆ่ฆ
Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall short on new tasks not covered in the training data. When finetuned on limited demonstrations of a new task, these policies often overfit to the specific demonstrations--not only losing their prior abilities to solve a wide variety of generalist tasks but also failing to generalize within the new task itself. In this work, we aim to develop a method that preserves the generalization capabilities of the generalist policy during finetuning, allowing a single policy to robustly incorporate a new skill into its repertoire. Our goal is a single policy that both learns to generalize to variations of the new task and retains the broad competencies gained from pretraining. We show that this can be achieved through a simple yet effective strategy: interpolating the weights of a finetuned model with that of the pretrained model. We show, across extensive simulated and real-world experiments, that such model merging produces a single model that inherits the generalist abilities of the base model and learns to solve the new task robustly, outperforming both the pretrained and finetuned model on out-of-distribution variations of the new task. Moreover, we show that model merging enables continual acquisition of new skills in a lifelong learning setting, without sacrificing previously learned generalist abilities.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Michael R. Martin, Garrick Chan, Kwan-Liu Ma
Machine Learning
ๆฅ็ๆ่ฆ
Recent image protection mechanisms such as Glaze and Nightshade introduce imperceptible, adversarially designed perturbations intended to disrupt downstream text-to-image generative models. While their empirical effectiveness is known, the internal structure, detectability, and representational behavior of these perturbations remain poorly understood. This study provides a systematic, explainable AI analysis using a unified framework that integrates white-box feature-space inspection and black-box signal-level probing. Through latent-space clustering, feature-channel activation analysis, occlusion-based spatial sensitivity mapping, and frequency-domain characterization, we show that protection mechanisms operate as structured, low-entropy perturbations tightly coupled to underlying image content across representational, spatial, and spectral domains. Protected images preserve content-driven feature organization with protection-specific substructure rather than inducing global representational drift. Detectability is governed by interacting effects of perturbation entropy, spatial deployment, and frequency alignment, with sequential protection amplifying detectable structure rather than suppressing it. Frequency-domain analysis shows that Glaze and Nightshade redistribute energy along dominant image-aligned frequency axes rather than introducing diffuse noise. These findings indicate that contemporary image protection operates through structured feature-level deformation rather than semantic dislocation, explaining why protection signals remain visually subtle yet consistently detectable. This work advances the interpretability of adversarial image protection and informs the design of future defenses and detection strategies for generative AI systems.
๐
2025-12-09
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Bin Wang, Hui Li, Liyang Zhang, Qijia Zhuang, Ao Yang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Sensitive information leakage in code repositories has emerged as a critical security challenge. Traditional detection methods that rely on regular expressions, fingerprint features, and high-entropy calculations often suffer from high false-positive rates. This not only reduces detection efficiency but also significantly increases the manual screening burden on developers. Recent advances in large language models (LLMs) and multi-agent collaborative architectures have demonstrated remarkable potential for tackling complex tasks, offering a novel technological perspective for sensitive information detection. In response to these challenges, we propose Argus, a multi-agent collaborative framework for detecting sensitive information. Argus employs a three-tier detection mechanism that integrates key content, file context, and project reference relationships to effectively reduce false positives and enhance overall detection accuracy. To comprehensively evaluate Argus in real-world repository environments, we developed two new benchmarks, one to assess genuine leak detection capabilities and another to evaluate false-positive filtering performance. Experimental results show that Argus achieves up to 94.86% accuracy in leak detection, with a precision of 96.36%, recall of 94.64%, and an F1 score of 0.955. Moreover, the analysis of 97 real repositories incurred a total cost of only 2.2$. All code implementations and related datasets are publicly available at https://github.com/TheBinKing/Argus-Guard for further research and application.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuhui Li, Zhengquan Luo, Zihui Cui, Zhiqiang Xu
ๆฅ็ๆ่ฆ
Dataset distillation aims to synthesize a compact subset of the original data, enabling models trained on it to achieve performance comparable to those trained on the original large dataset. Existing distribution-matching methods are confined to Euclidean spaces, making them only capture linear structures and overlook the intrinsic geometry of real data, e.g., curvature. However, high-dimensional data often lie on low-dimensional manifolds, suggesting that dataset distillation should have the distilled data manifold aligned with the original data manifold. In this work, we propose a geometry-aware distribution-matching framework, called \textbf{GeoDM}, which operates in the Cartesian product of Euclidean, hyperbolic, and spherical manifolds, with flat, hierarchical, and cyclical structures all captured by a unified representation. To adapt to the underlying data geometry, we introduce learnable curvature and weight parameters for three kinds of geometries. At the same time, we design an optimal transport loss to enhance the distribution fidelity. Our theoretical analysis shows that the geometry-aware distribution matching in a product space yields a smaller generalization error bound than the Euclidean counterparts. Extensive experiments conducted on standard benchmarks demonstrate that our algorithm outperforms state-of-the-art data distillation methods and remains effective across various distribution-matching strategies for the single geometries.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alexander Goslin
Robotics
ๆฅ็ๆ่ฆ
For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. We introduce Terrain Diffusion, an AI-era successor to Perlin noise that bridges the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. At its core is InfiniteDiffusion, a novel algorithm for infinite generation, enabling seamless, real-time synthesis of boundless landscapes. A hierarchical stack of diffusion models couples planetary context with local detail, while a compact Laplacian encoding stabilizes outputs across Earth-scale dynamic ranges. An open-source infinite-tensor framework supports constant-memory manipulation of unbounded tensors, and few-step consistency distillation enables efficient generation. Together, these components establish diffusion models as a practical foundation for procedural world generation, capable of synthesizing entire planets coherently, controllably, and without limits.
๐
2025-12-09
๐ ACL
๐ฅ Sijia Chen, Baochun Li, Di Niu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are post-trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs), where the hallmark of this advanced reasoning is ``aha'' moments when they start to perform strategies, such as self-reflection and deep thinking, within chain of thoughts (CoTs). Motivated by this, this paper proposes a novel reinforced strategy injection mechanism (rSIM), that enables any LLM to become an RLM by employing a small planner to guide the LLM's CoT through the adaptive injection of reasoning strategies. To achieve this, the planner (leader agent) is jointly trained with an LLM (follower agent) using multi-agent RL (MARL), based on a leader-follower framework and straightforward rule-based rewards. Experimental results show that rSIM enables Qwen2.5-0.5B to become an RLM and significantly outperform Qwen2.5-14B. Moreover, the planner is generalizable: it only needs to be trained once and can be applied as a plug-in to substantially improve the reasoning capabilities of existing LLMs. In addition, the planner supports continual learning across various tasks, allowing its planning abilities to gradually improve and generalize to a wider range of problems.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Agents, language model (LM)-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored, leaving practitioners to rely on heuristics rather than principled design choices. We address this gap by deriving quantitative scaling principles for agent systems. We evaluate this across four diverse benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench. Using five canonical architectures (Single, Independent, Centralized, Decentralized, Hybrid) instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations with standardized tools and token budgets. We derive a predictive model using empirical coordination metrics, including efficiency, overhead, error amplification, and redundancy, that achieves cross-validated R^2=0.513. We identify three dominant effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: coordination yields diminishing or negative returns (beta=-0.408, p<0.001) once single-agent baselines exceed ~45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2x through unchecked propagation, while centralized coordination contains this to 4.4x. Centralized coordination improves performance by 80.9% on parallelizable tasks like financial reasoning, while decentralized coordination excels on dynamic web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, all multi-agent variants degraded performance by 39-70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations, providing a predictive principle of agentic scaling based on measurable task properties.
๐
2025-12-09
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Shiva Gaire, Srijan Gyawali, Saroj Mishra, Suman Niroula, Dilip Thakur et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The Model Context Protocol (MCP) has emerged as the de facto standard for connecting Large Language Models (LLMs) to external data and tools, effectively functioning as the "USB-C for Agentic AI." While this decoupling of context and execution solves critical interoperability challenges, it introduces a profound new threat landscape where the boundary between epistemic errors (hallucinations) and security breaches (unauthorized actions) dissolves. This Systematization of Knowledge (SoK) aims to provide a comprehensive taxonomy of risks in the MCP ecosystem, distinguishing between adversarial security threats (e.g., indirect prompt injection, tool poisoning) and epistemic safety hazards (e.g., alignment failures in distributed tool delegation). We analyze the structural vulnerabilities of MCP primitives, specifically Resources, Prompts, and Tools, and demonstrate how "context" can be weaponized to trigger unauthorized operations in multi-agent environments. Furthermore, we survey state-of-the-art defenses, ranging from cryptographic provenance (ETDI) to runtime intent verification, and conclude with a roadmap for securing the transition from conversational chatbots to autonomous agentic operating systems.
๐
2025-12-09
๐ ArXiv Preprint (cs.SE)
๐ฅ Liao Hu, Qiteng Wu, Ruoyu Qi
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The integration of Large Language Models (LLMs) into mobile and software development workflows faces a persistent tension among three demands: semantic awareness, developer productivity, and data privacy. Traditional cloud-based tools offer strong reasoning but risk data exposure and latency, while on-device solutions lack full-context understanding across codebase and developer tooling. We introduce SolidGPT, an open-source, edge-cloud hybrid developer assistant built on GitHub, designed to enhance code and workspace semantic search. SolidGPT enables developers to: talk to your codebase: interactively query code and project structure, discovering the right methods and modules without manual searching. Automate software project workflows: generate PRDs, task breakdowns, Kanban boards, and even scaffold web app beginnings, with deep integration via VSCode and Notion. Configure private, extensible agents: onboard private code folders (up to approximately 500 files), connect Notion, customize AI agent personas via embedding and in-context training, and deploy via Docker, CLI, or VSCode extension. In practice, SolidGPT empowers developer productivity through: Semantic-rich code navigation: no more hunting through files or wondering where a feature lives. Integrated documentation and task management: seamlessly sync generated PRD content and task boards into developer workflows. Privacy-first design: running locally via Docker or VSCode, with full control over code and data, while optionally reaching out to LLM APIs as needed. By combining interactive code querying, automated project scaffolding, and human-AI collaboration, SolidGPT provides a practical, privacy-respecting edge assistant that accelerates real-world development workflows, ideal for intelligent mobile and software engineering contexts.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Haldun Balim, Na Li, Yilun Du
Robotics
ๆฅ็ๆ่ฆ
Offline decision-making requires synthesizing reliable behaviors from fixed datasets without further interaction, yet existing generative approaches often yield trajectories that are dynamically infeasible. We propose Model Predictive Diffuser (MPDiffuser), a compositional model-based diffusion framework consisting of: (i) a planner that generates diverse, task-aligned trajectories; (ii) a dynamics model that enforces consistency with the underlying system dynamics; and (iii) a ranker module that selects behaviors aligned with the task objectives. MPDiffuser employs an alternating diffusion sampling scheme, where planner and dynamics updates are interleaved to progressively refine trajectories for both task alignment and feasibility during the sampling process. We also provide a theoretical rationale for this procedure, showing how it balances fidelity to data priors with dynamics consistency. Empirically, the compositional design improves sample efficiency, as it leverages even low-quality data for dynamics learning and adapts seamlessly to novel dynamics. We evaluate MPDiffuser on both unconstrained (D4RL) and constrained (DSRL) offline decision-making benchmarks, demonstrating consistent gains over existing approaches. Furthermore, we present a preliminary study extending MPDiffuser to vision-based control tasks, showing its potential to scale to high-dimensional sensory inputs. Finally, we deploy our method on a real quadrupedal robot, showcasing its practicality for real-world control.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Thanh Vu, Richi Nayak, Thiru Balasubramaniam
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern businesses are increasingly challenged by the time and expense required to generate and assess high-quality content. Human writers face time constraints, and extrinsic evaluations can be costly. While Large Language Models (LLMs) offer potential in content creation, concerns about the quality of AI-generated content persist. Traditional evaluation methods, like human surveys, further add operational costs, highlighting the need for efficient, automated solutions. This research introduces Generative Agents as a means to tackle these challenges. These agents can rapidly and cost-effectively evaluate AI-generated content, simulating human judgment by rating aspects such as coherence, interestingness, clarity, fairness, and relevance. By incorporating these agents, businesses can streamline content generation and ensure consistent, high-quality output while minimizing reliance on costly human evaluations. The study provides critical insights into enhancing LLMs for producing business-aligned, high-quality content, offering significant advancements in automated content generation and evaluation.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jaisal Patel, Yunzhe Chen, Kaiwen He, Keyi Wang, David Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.
๐
2025-12-09
๐ ICDE 2026
๐ฅ Yibowen Zhao, Yinan Zhang, Zhixiang Su, Lizhen Cui, Chunyan Miao
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Predicting diseases solely from patient-side information, such as demographics and self-reported symptoms, has attracted significant research attention due to its potential to enhance patient awareness, facilitate early healthcare engagement, and improve healthcare system efficiency. However, existing approaches encounter critical challenges, including imbalanced disease distributions and a lack of interpretability, resulting in biased or unreliable predictions. To address these issues, we propose the Knowledge graph-enhanced, Prototype-aware, and Interpretable (KPI) framework. KPI systematically integrates structured and trusted medical knowledge into a unified disease knowledge graph, constructs clinically meaningful disease prototypes, and employs contrastive learning to enhance predictive accuracy, which is particularly important for long-tailed diseases. Additionally, KPI utilizes large language models (LLMs) to generate patient-specific, medically relevant explanations, thereby improving interpretability and reliability. Extensive experiments on real-world datasets demonstrate that KPI outperforms state-of-the-art methods in predictive accuracy and provides clinically valid explanations that closely align with patient narratives, highlighting its practical value for patient-centered healthcare delivery.
๐
2025-12-09
๐ AAAI
๐ฅ Haowen Zheng, Hu Zhu, Lu Deng, Weihao Gu, Yang Yang et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher's stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saeeda Naz, Saddam Hussain Khan
Natural Language Processing
ๆฅ็ๆ่ฆ
A novel deep hybrid Residual-SwinCA-Net segmentation framework is proposed in the study for addressing such challenges by extracting locally correlated and robust features, incorporating residual CNN modules. Furthermore, for learning global dependencies, Swin Transformer blocks are customized using internal residual pathways, which reinforce gradient stability, refine local patterns, and facilitate global feature fusion. Formerly, for enhancing tissue continuity, ultrasound noise suppressions, and accentuating fine structural transitions Laplacian-of-Gaussian regional operator is applied, and for maintaining the morphological integrity of malignant lesion contours, a boundary-oriented operator has been incorporated. Subsequently, a contraction strategy was applied stage-wise by progressively reducing features-map progressively for capturing scale invariance and enhancing the robustness of structural variability. In addition, each decoder level prior augmentation integrates a new Multi-Scale Channel Attention and Squeezing (MSCAS) module. The MSCAS selectively emphasizes encoder salient maps, retains discriminative global context, and complementary local structures with minimal computational cost while suppressing redundant activations. Finally, the Pixel-Attention module encodes class-relevant spatial cues by adaptively weighing malignant lesion pixels while suppressing background interference. The Residual-SwinCA-Net and existing CNNs/ViTs techniques have been implemented on the publicly available BUSI dataset. The proposed Residual-SwinCA-Net framework outperformed and achieved 99.29% mean accuracy, 98.74% IoU, and 0.9041 Dice for breast lesion segmentation. The proposed Residual-SwinCA-Net framework improves the BUSI lesion diagnostic performance and strengthens timely clinical decision-making.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jusheng Zhang, Xiaoyang Guo, Kaitong Cai, Qinhan Lv, Yijia Fan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.
๐
2025-12-09
๐ ArXiv Preprint (cs.SD)
๐ฅ Mahathir Monjur, Shahriar Nirjon
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully controlled conditions and expensive listening tests, while learning-based models such as NISQA regress MOS and multiple perceptual dimensions from waveforms or spectrograms, achieving high correlation with subjective ratings yet remaining rigid: they do not support interactive, natural-language queries and do not natively provide textual rationales. In this work, we introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs covering overall MOS and four perceptual dimensions (noisiness, coloration, discontinuity, and loudness) in both single-ended (degraded only) and double-ended (degraded plus clean reference) setups. Instead of directly regressing scores, our system is supervised to generate textual answers from which numeric predictions are parsed and evaluated with standard regression and ranking metrics; on held-out NISQA clips, the double-ended model attains a MOS mean absolute error (MAE) of 0.41 with Pearson correlation of 0.86, with competitive performance on dimension-wise tasks. Beyond these quantitative gains, it offers a flexible natural-language interface in which the language model acts as an audio quality expert: practitioners can query arbitrary aspects of degradations, prompt the model to emulate different listener profiles to capture human variability and produce diverse but plausible judgments rather than a single deterministic score, and thereby reduce reliance on large-scale crowdsourced tests and their monetary cost.
๐
2025-12-09
๐ Accepted to Philosophical Transactions A, Special
๐ฅ Eunice Yiu, Kelsey Allen, Shiry Ginosar, Alison Gopnik
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Learning about the causal structure of the world is a fundamental problem for human cognition. Causal models and especially causal learning have proved to be difficult for large pretrained models using standard techniques of deep learning. In contrast, cognitive scientists have applied advances in our formal understanding of causation in computer science, particularly within the Causal Bayes Net formalism, to understand human causal learning. In the very different tradition of reinforcement learning, researchers have described an intrinsic reward signal called "empowerment" which maximizes mutual information between actions and their outcomes. "Empowerment" may be an important bridge between classical Bayesian causal learning and reinforcement learning and may help to characterize causal learning in humans and enable it in machines. If an agent learns an accurate causal world model, they will necessarily increase their empowerment, and increasing empowerment will lead to a more accurate causal world model. Empowerment may also explain distinctive features of childrens causal learning, as well as providing a more tractable computational account of how that learning is possible. In an empirical study, we systematically test how children and adults use cues to empowerment to infer causal relations, and design effective causal interventions.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jusheng Zhang, Kaitong Cai, Xiaoyang Guo, Sidi Liu, Qinhan Lv et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.
๐
2025-12-09
๐ WSDM 2026
๐ฅ Ye Qin, Jingchao Wang, Yang Shi, Haiying Huang, Junxu Li et al.
ๆฅ็ๆ่ฆ
Capsule Networks (CapsNets) show exceptional graph representation capacity via dynamic routing and vectorized hierarchical representations, but they model the complex geometries of real\-world graphs poorly by fixed\-curvature space due to the inherent geodesical disconnectedness issues, leading to suboptimal performance. Recent works find that non\-Euclidean pseudo\-Riemannian manifolds provide specific inductive biases for embedding graph data, but how to leverage them to improve CapsNets is still underexplored. Here, we extend the Euclidean capsule routing into geodesically disconnected pseudo\-Riemannian manifolds and derive a Pseudo\-Riemannian Capsule Network (PR\-CapsNet), which models data in pseudo\-Riemannian manifolds of adaptive curvature, for graph representation learning. Specifically, PR\-CapsNet enhances the CapsNet with Adaptive Pseudo\-Riemannian Tangent Space Routing by utilizing pseudo\-Riemannian geometry. Unlike single\-curvature or subspace\-partitioning methods, PR\-CapsNet concurrently models hierarchical and cluster or cyclic graph structures via its versatile pseudo\-Riemannian metric. It first deploys Pseudo\-Riemannian Tangent Space Routing to decompose capsule states into spherical\-temporal and Euclidean\-spatial subspaces with diffeomorphic transformations. Then, an Adaptive Curvature Routing is developed to adaptively fuse features from different curvature spaces for complex graphs via a learnable curvature tensor with geometric attention from local manifold properties. Finally, a geometric properties\-preserved Pseudo\-Riemannian Capsule Classifier is developed to project capsule embeddings to tangent spaces and use curvature\-weighted softmax for classification. Extensive experiments on node and graph classification benchmarks show PR\-CapsNet outperforms SOTA models, validating PR\-CapsNet's strong representation power for complex graph structures.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiwoo Park, Ruoqi Liu, Avani Jagdale, Andrew Srisuwananukorn, Jing Zhao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Wenjiang Xu, Cindy Wang, Rui Fang, Mingkang Zhang, Lusong Li et al.
Robotics
ๆฅ็ๆ่ฆ
World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied-tree-of-thoughts.github.io .
๐
2025-12-09
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Jinghao Wang, Ping Zhang, Carter Yagemann
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Medical Large Language Models (LLMs) are increasingly deployed for clinical decision support across diverse specialties, yet systematic evaluation of their robustness to adversarial misuse and privacy leakage remains inaccessible to most researchers. Existing security benchmarks require GPU clusters, commercial API access, or protected health data -- barriers that limit community participation in this critical research area. We propose a practical, fully reproducible framework for evaluating medical AI security under realistic resource constraints. Our framework design covers multiple medical specialties stratified by clinical risk -- from high-risk domains such as emergency medicine and psychiatry to general practice -- addressing jailbreaking attacks (role-playing, authority impersonation, multi-turn manipulation) and privacy extraction attacks. All evaluation utilizes synthetic patient records requiring no IRB approval. The framework is designed to run entirely on consumer CPU hardware using freely available models, eliminating cost barriers. We present the framework specification including threat models, data generation methodology, evaluation protocols, and scoring rubrics. This proposal establishes a foundation for comparative security assessment of medical-specialist models and defense mechanisms, advancing the broader goal of ensuring safe and trustworthy medical AI systems.
๐
2025-12-09
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Guangze Zhao, Yongzheng Zhang, Changbo Tian, Dan Xie, Hongri Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Security Operations Centers face massive, heterogeneous alert streams under minute-level service windows, creating the Alert Triage Latency Paradox: verbose reasoning chains ensure accuracy and compliance but incur prohibitive latency and token costs, while minimal chains sacrifice transparency and auditability. Existing solutions fail: signature systems are brittle, anomaly methods lack actionability, and fully cloud-hosted LLMs raise latency, cost, and privacy concerns. We propose AIDR, a hybrid cloud-edge framework that addresses this trade-off through constrained information-density optimization. The core innovation is gradient-based compression of reasoning chains to retain only decision-critical steps--minimal evidence sufficient to justify predictions while respecting token and latency budgets. We demonstrate that this approach preserves decision-relevant information while minimizing complexity. We construct compact datasets by distilling alerts into 3-5 high-information bullets (68% token reduction), train domain-specialized experts via LoRA, and deploy a cloud-edge architecture: a cloud LLM routes alerts to on-premises experts generating SOAR-ready JSON. Experiments demonstrate AIDR achieves higher accuracy and 40.6% latency reduction versus Chain-of-Thought, with robustness to data corruption and out-of-distribution generalization, enabling auditable and efficient SOC triage with full data residency compliance.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nanda K. Unnikrishnan, Keshab K. Parhi
Machine LearningRobotics
ๆฅ็ๆ่ฆ
In our prior work, LayerPipe, we had introduced an approach to accelerate training of convolutional, fully connected, and spiking neural networks by overlapping forward and backward computation. However, despite empirical success, a principled understanding of how much gradient delay needs to be introduced at each layer to achieve desired level of pipelining was not addressed. This paper, LayerPipe2, fills that gap by formally deriving LayerPipe using variable delayed gradient adaptation and retiming. We identify where delays may be legally inserted and show that the required amount of delay follows directly from the network structure where inner layers require fewer delays and outer layers require longer delays. When pipelining is applied at every layer, the amount of delay depends only on the number of remaining downstream stages. When layers are pipelined in groups, all layers in the group share the same assignment of delays. These insights not only explain previously observed scheduling patterns but also expose an often overlooked challenge that pipelining implicitly requires storage of historical weights. We overcome this storage bottleneck by developing a pipeline--aware moving average that reconstructs the required past states rather than storing them explicitly. This reduces memory cost without sacrificing the accuracy guarantees that makes pipelined learning viable. The result is a principled framework that illustrates how to construct LayerPipe architectures, predicts their delay requirements, and mitigates their storage burden, thereby enabling scalable pipelined training with controlled communication computation tradeoffs.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zheng Ding, Weirui Ye
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4$\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.
๐
2025-12-09
๐ This paper was accepted and presented at the 2025
๐ฅ Henry Anand Septian Radityo, Bernardus Willson, Reynard Tanadi, Latifa Dwiyanti, Saiful Akbar
ๆฅ็ๆ่ฆ
The rising global prevalence of diabetes necessitates early detection to prevent severe complications. While AI-powered prediction applications offer a promising solution, they require a responsive and scalable back-end architecture to serve a large user base effectively. This paper details the development and evaluation of a scalable back-end system designed for a mobile diabetes prediction application. The primary objective was to maintain a failure rate below 5% and an average latency of under 1000 ms. The architecture leverages horizontal scaling, database sharding, and asynchronous communication via a message queue. Performance evaluation showed that 83% of the system's features (20 out of 24) met the specified performance targets. Key functionalities such as user profile management, activity tracking, and read-intensive prediction operations successfully achieved the desired performance. The system demonstrated the ability to handle up to 10,000 concurrent users without issues, validating its scalability. The implementation of asynchronous communication using RabbitMQ proved crucial in minimizing the error rate for computationally intensive prediction requests, ensuring system reliability by queuing requests and preventing data loss under heavy load.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Haoran Wang, Zhuohang Chen, Guang Li, Bo Ma, Chuanghuang Li
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The future of UAV interaction systems is evolving from engineer-driven to user-driven, aiming to replace traditional predefined Human-UAV Interaction designs. This shift focuses on enabling more personalized task planning and design, thereby achieving a higher quality of interaction experience and greater flexibility, which can be used in many fileds, such as agriculture, aerial photography, logistics, and environmental monitoring. However, due to the lack of a common language between users and the UAVs, such interactions are often difficult to be achieved. The developments of Large Language Models possess the ability to understand nature languages and Robots' (UAVs') behaviors, marking the possibility of personalized Human-UAV Interaction. Recently, some HUI frameworks based on LLMs have been proposed, but they commonly suffer from difficulties in mixed task planning and execution, leading to low adaptability in complex scenarios. In this paper, we propose a novel dual-agent HUI framework. This framework constructs two independent LLM agents (a task planning agent, and an execution agent) and applies different Prompt Engineering to separately handle the understanding, planning, and execution of tasks. To verify the effectiveness and performance of the framework, we have built a task database covering four typical application scenarios of UAVs and quantified the performance of the HUI framework using three independent metrics. Meanwhile different LLM models are selected to control the UAVs with compared performance. Our user study experimental results demonstrate that the framework improves the smoothness of HUI and the flexibility of task execution in the tasks scenario we set up, effectively meeting users' personalized needs.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gary Ackerman, Brandon Behlendorf, Zachary Kallenborn, Sheriff Almakki, Doug Clifford et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Both model developers and policymakers seek to quantify and mitigate the risk of rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons. An important element of such efforts is the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper describes the first component of a novel Biothreat Benchmark Generation (BBG) Framework. The BBG approach is designed to help model developers and evaluators reliably measure and assess the biosecurity risk uplift and general harm potential of existing and future AI models, while accounting for key aspects of the threat itself that are often overlooked in other benchmarking efforts, including different actor capability levels, and operational (in addition to purely technical) risk factors. As a pilot, the BBG is first being developed to address bacterial biological threats only. The BBG is built upon a hierarchical structure of biothreat categories, elements and tasks, which then serves as the basis for the development of task-aligned queries. This paper outlines the development of this biothreat task-query architecture, which we have named the Bacterial Biothreat Schema, while future papers will describe follow-on efforts to turn queries into model prompts, as well as how the resulting benchmarks can be implemented for model evaluation. Overall, the BBG Framework, including the Bacterial Biothreat Schema, seeks to offer a robust, re-usable structure for evaluating bacterial biological risks arising from LLMs across multiple levels of aggregation, which captures the full scope of technical and operational requirements for biological adversaries, and which accounts for a wide spectrum of biological adversary capabilities.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Youming Deng, Songyou Peng, Junyi Zhang, Kathryn Heal, Tiancheng Sun et al.
Computer Vision
ๆฅ็ๆ่ฆ
Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Simon de Moreau, Andrei Bursuc, Hafid El-Idrissi, Fabien Moutarde
Robotics
ๆฅ็ๆ่ฆ
Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high-definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero-shot in real-world closed-loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain-generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost-effective solution to robust nighttime perception.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaizhi Zheng, Yue Fan, Jing Gu, Zishuo Xu, Xuehai He et al.
ๆฅ็ๆ่ฆ
Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zeyang Liu, Le Wang, Sanping Zhou, Yuxuan Wu, Xiaolong Sun et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable of unifying the diverse range of input-constrained generation sub-tasks, such as those conditioned by element types, sizes, or their relationships. Current methods either address only a subset of these tasks or necessitate separate model parameters for different conditions, failing to offer a truly unified solution. In this paper, we propose UniLayDiff: a Unified Diffusion Transformer, that for the first time, addresses various content-aware layout generation tasks with a single, end-to-end trainable model. Specifically, we treat layout constraints as a distinct modality and employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints. Moreover, we integrate relation constraints through fine-tuning the model with LoRA after pretraining the model on other tasks. Such a schema not only achieves unified conditional generation but also enhances overall layout quality. Extensive experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks and, to the best of our knowledge, is the first model to unify the full range of content-aware layout generation tasks.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Damiano Marsili, Georgia Gkioxari
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Manduhu Manduhu, Alexander Dow, Gerard Dooly, James Riordan
Computer Vision
ๆฅ็ๆ่ฆ
Rotation invariance is essential for precise, object-level segmentation in UAV aerial imagery, where targets can have arbitrary orientations and exhibit fine-scale details. Conventional segmentation architectures like U-Net rely on convolution operators that are not rotation-invariant, leading to degraded segmentation accuracy across varying viewpoints. Rotation invariance can be achieved by expanding the filter bank across multiple orientations; however, this will significantly increase computational cost and memory traffic. In this paper, we introduce a GPU-optimized rotation-invariant convolution framework that eliminates the traditional data-lowering (im2col) step required for matrix-multiplication-based convolution. By exploiting structured data sharing among symmetrically rotated filters, our method achieves multi-orientation convolution with greatly reduced memory traffic and computational redundancy. We further generalize the approach to accelerate convolution with arbitrary (non-symmetric) rotation angles.
Across extensive benchmarks, the proposed convolution achieves 20--55% faster training and 15--45% lower energy consumption than CUDNN, while maintaining accuracy comparable to state-of-the-art rotation-invariant methods. In the eight-orientation setting, our approach achieves up to 45% speedup and 41% energy savings on 256\(\times\)256 inputs, and 32% speedup and 23% lower energy usage on 1024\(\times\)1024 inputs. Integrated into a U-Net segmentation model, the framework yields up to 6% improvement in accuracy over the non-rotation-aware baseline. These results demonstrate that the proposed method provides an effective and highly efficient alternative to existing rotation-invariant CNN frameworks.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aysim Toker, Andreea-Maria Oncescu, Roy Miles, Ismail Elezi, Jiankang Deng
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jing Jie Tan, Anissa Mokraoui, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.
๐
2025-12-09
๐ AAAI 2026
๐ฅ Amit Bendkhale
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jack Brady, Bernhard Schรถlkopf, Thomas Kipf, Simon Buchholz, Wieland Brendel
ๆฅ็ๆ่ฆ
It has been hypothesized that human-level visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today's most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using regularization or architectural constraints. In contrast, for generative methods, the inductive biases can be enforced straightforwardly, thereby enabling compositional generalization by constraining a decoder and inverting it. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay. We examine the empirical implications of our theory by training a range of generative and non-generative methods on photorealistic image datasets. We find that, without the necessary inductive biases, non-generative methods often fail to generalize compositionally and require large-scale pretraining or added supervision to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.
๐
2025-12-09
๐ Accepted in IEEE Transactions on Multimedia (TMM)
๐ฅ Yi Zhang, Chun-Wun Cheng, Junyi He, Ke Yu, Yushun Tang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincarรฉ ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chaewon Kim, Seoyeon Lee, Jonghyuk Park
Natural Language Processing
ๆฅ็ๆ่ฆ
Document shadow removal is essential for enhancing the clarity of digitized documents. Preserving high-frequency details (e.g., text edges and lines) is critical in this process because shadows often obscure or distort fine structures. This paper proposes a matte vision transformer (MatteViT), a novel shadow removal framework that applies spatial and frequency-domain information to eliminate shadows while preserving fine-grained structural details. To effectively retain these details, we employ two preservation strategies. First, our method introduces a lightweight high-frequency amplification module (HFAM) that decomposes and adaptively amplifies high-frequency components. Second, we present a continuous luminance-based shadow matte, generated using a custom-built matte dataset and shadow matte generator, which provides precise spatial guidance from the earliest processing stage. These strategies enable the model to accurately identify fine-grained regions and restore them with high fidelity. Extensive experiments on public benchmarks (RDD and Kligler) demonstrate that MatteViT achieves state-of-the-art performance, providing a robust and practical solution for real-world document shadow removal. Furthermore, the proposed method better preserves text-level details in downstream tasks, such as optical character recognition, improving recognition performance over prior methods.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiming Hao, Mutian Xu, Chongjie Ye, Jie Qin, Shunlin Lu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Personalizing visual generative models to meet specific user needs has gained increasing attention, yet current methods like Low-Rank Adaptation (LoRA) remain impractical due to their demand for task-specific data and lengthy optimization. While a few hypernetwork-based approaches attempt to predict adaptation weights directly, they struggle to map fine-grained user prompts to complex LoRA distributions, limiting their practical applicability. To bridge this gap, we propose LoFA, a general framework that efficiently predicts personalized priors for fast model adaptation. We first identify a key property of LoRA: structured distribution patterns emerge in the relative changes between LoRA and base model parameters. Building on this, we design a two-stage hypernetwork: first predicting relative distribution patterns that capture key adaptation regions, then using these to guide final LoRA weight prediction. Extensive experiments demonstrate that our method consistently predicts high-quality personalized priors within seconds, across multiple tasks and user prompts, even outperforming conventional LoRA that requires hours of processing. Project page: https://jaeger416.github.io/lofa/.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seoyeon Lee, Gwangyeol Yu, Chaewon Kim, Jonghyuk Park
Computer Vision
ๆฅ็ๆ่ฆ
Diffusion models have achieved remarkable success in image synthesis. However, addressing artifacts and unrealistic regions remains a critical challenge. We propose self-refining diffusion, a novel framework that enhances image generation quality by detecting these flaws. The framework employs an explainable artificial intelligence (XAI)-based flaw highlighter to produce flaw activation maps (FAMs) that identify artifacts and unrealistic regions. These FAMs improve reconstruction quality by amplifying noise in flawed regions during the forward process and by focusing on these regions during the reverse process. The proposed approach achieves up to a 27.3% improvement in Frรฉchet inception distance across various diffusion-based models, demonstrating consistently strong performance on diverse datasets. It also shows robust effectiveness across different tasks, including image generation, text-to-image generation, and inpainting. These results demonstrate that explainable AI techniques can extend beyond interpretability to actively contribute to image refinement. The proposed framework offers a versatile and effective approach applicable to various diffusion models and tasks, significantly advancing the field of image synthesis.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We present Wan-Move, a simple and scalable framework that brings motion control to video generative models. Existing motion-controllable methods typically suffer from coarse control granularity and limited scalability, leaving their outputs insufficient for practical use. We narrow this gap by achieving precise and high-quality motion control. Our core idea is to directly make the original condition features motion-aware for guiding video synthesis. To this end, we first represent object motions with dense point trajectories, allowing fine-grained control over the scene. We then project these trajectories into latent space and propagate the first frame's features along each trajectory, producing an aligned spatiotemporal feature map that tells how each scene element should move. This feature map serves as the updated latent condition, which is naturally integrated into the off-the-shelf image-to-video model, e.g., Wan-I2V-14B, as motion guidance without any architecture change. It removes the need for auxiliary motion encoders and makes fine-tuning base models easily scalable. Through scaled training, Wan-Move generates 5-second, 480p videos whose motion controllability rivals Kling 1.5 Pro's commercial Motion Brush, as indicated by user studies. To support comprehensive evaluation, we further design MoveBench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on MoveBench and the public dataset consistently show Wan-Move's superior motion quality. Code, models, and benchmark data are made publicly available.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuniko Paxton, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos
Computer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
In recent years, high-performance computer vision models have achieved remarkable success in medical imaging, with some skin lesion classification systems even surpassing dermatology specialists in diagnostic accuracy. However, such models are computationally intensive and large in size, making them unsuitable for deployment on edge devices. In addition, strict privacy constraints hinder centralized data management, motivating the adoption of Federated Learning (FL). To address these challenges, this study proposes a skewness-guided pruning method that selectively prunes the Multi-Head Self-Attention and Multi-Layer Perceptron layers of a multimodal Swin Transformer based on the statistical skewness of their output distributions. The proposed method was validated in a horizontal FL environment and shown to maintain performance while substantially reducing model complexity. Experiments on the compact Swin Transformer demonstrate approximately 36\% model size reduction with no loss in accuracy. These findings highlight the feasibility of achieving efficient model compression and privacy-preserving distributed learning for multimodal medical AI on edge devices.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Artรบr I. Kรกroly, Pรฉter Galambos
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Industrial mushroom cultivation increasingly relies on computer vision for monitoring and automated harvesting. However, developing accurate detection and segmentation models requires large, precisely annotated datasets that are costly to produce. Synthetic data provides a scalable alternative, yet often lacks sufficient realism to generalize to real-world scenarios. This paper presents a novel workflow that integrates 3D rendering in Blender with a constrained diffusion model to automatically generate high-quality annotated, photorealistic synthetic images of Agaricus Bisporus mushrooms. This approach preserves full control over 3D scene configuration and annotations while achieving photorealism without the need for specialized computer graphics expertise. We release two synthetic datasets (each containing 6,000 images depicting over 250k mushroom instances) and evaluate Mask R-CNN models trained on them in a zero-shot setting. When tested on two independent real-world datasets (including a newly collected benchmark), our method achieves state-of-the-art segmentation performance (F1 = 0.859 on M18K), despite using only synthetic training data. Although the approach is demonstrated on Agaricus Bisporus mushrooms, the proposed pipeline can be readily adapted to other mushroom species or to other agricultural domains, such as fruit and leaf detection.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Samuel Ebimobowei Johnny, Blessed Guda, Emmanuel Enejo Aaron, Assane Gueye
ๆฅ็ๆ่ฆ
Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88\% accuracy and 60.00\% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuniko Paxton, Zeinab Dehghani, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Skin color has historically been a focal point of discrimination, yet fairness research in machine learning for medical imaging often relies on coarse subgroup categories, overlooking individual-level variations. Such group-based approaches risk obscuring biases faced by outliers within subgroups. This study introduces a distribution-based framework for evaluating and mitigating individual fairness in skin lesion classification. We treat skin tone as a continuous attribute rather than a categorical label, and employ kernel density estimation (KDE) to model its distribution. We further compare twelve statistical distance metrics to quantify disparities between skin tone distributions and propose a distance-based reweighting (DRW) loss function to correct underrepresentation in minority tones. Experiments across CNN and Transformer models demonstrate: (i) the limitations of categorical reweighting in capturing individual-level disparities, and (ii) the superior performance of distribution-based reweighting, particularly with Fidelity Similarity (FS), Wasserstein Distance (WD), Hellinger Metric (HM), and Harmonic Mean Similarity (HS). These findings establish a robust methodology for advancing fairness at individual level in dermatological AI systems, and highlight broader implications for sensitive continuous attributes in medical image analysis.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaiyu Li, Shengqi Zhang, Yupeng Deng, Zhi Wang, Deyu Meng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Most existing methods for training-free Open-Vocabulary Semantic Segmentation (OVSS) are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a preliminary exploration of applying SAM 3 to the remote sensing OVSS task without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. We evaluate our method on extensive remote sensing datasets. Experiments show that this simple adaptation achieves promising performance, demonstrating the potential of SAM 3 for remote sensing OVSS. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.
๐
2025-12-09
๐ ArXiv Preprint (cs.PF)
๐ฅ Sรฉbastien Piรฉrard, Adrien Deliรจge, Marc Van Droogenbroeck
ๆฅ็ๆ่ฆ
The performance of algorithms, methods, and models tends to depend heavily on the distribution of cases on which they are applied, this distribution being specific to the applicative domain. After performing an evaluation in several domains, it is highly informative to compute a (weighted) mean performance and, as shown in this paper, to scrutinize what happens during this averaging. To achieve this goal, we adopt a probabilistic framework and consider a performance as a probability measure (e.g., a normalized confusion matrix for a classification task). It appears that the corresponding weighted mean is known to be the summarization, and that only some remarkable scores assign to the summarized performance a value equal to a weighted arithmetic mean of the values assigned to the domain-specific performances. These scores include the family of ranking scores, a continuum parameterized by user preferences, and that the weights to consider in the arithmetic mean depend on the user preferences. Based on this, we rigorously define four domains, named easiest, most difficult, preponderant, and bottleneck domains, as functions of user preferences. After establishing the theory in a general setting, regardless of the task, we develop new visual tools for two-class classification.
๐
2025-12-09
๐ Accepted at IEEE Robotics and Automation Letters (
๐ฅ Kyumin Hwang, Wonhyeok Choi, Kiljoon Han, Wonjoon Choi, Minwoo Choi et al.
ๆฅ็ๆ่ฆ
Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Athena Psalta, Vasileios Tsironis, Konstantinos Karantzalos
Robotics
ๆฅ็ๆ่ฆ
State-of-the-art person re-identification methods achieve impressive accuracy but remain largely opaque, leaving open the question: which high-level semantic attributes do these models actually rely on? We propose MoSAIC-ReID, a Mixture-of-Experts framework that systematically quantifies the importance of pedestrian attributes for re-identification. Our approach uses LoRA-based experts, each linked to a single attribute, and an oracle router that enables controlled attribution analysis. While MoSAIC-ReID achieves competitive performance on Market-1501 and DukeMTMC under the assumption that attribute annotations are available at test time, its primary value lies in providing a large-scale, quantitative study of attribute importance across intrinsic and extrinsic cues. Using generalized linear models, statistical tests, and feature-importance analyses, we reveal which attributes, such as clothing colors and intrinsic characteristics, contribute most strongly, while infrequent cues (e.g. accessories) have limited effect. This work offers a principled framework for interpretable ReID and highlights the requirements for integrating explicit semantic knowledge in practice. Code is available at https://github.com/psaltaath/MoSAIC-ReID
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shaofeng Zhang, Xuanqi Chen, Xiangdong Zhang, Sitong Wu, Junchi Yan
Machine Learning
ๆฅ็ๆ่ฆ
Most existing self-supervised learning (SSL) approaches for 3D point clouds are dominated by generative methods based on Masked Autoencoders (MAE). However, these generative methods have been proven to struggle to capture high-level discriminative features effectively, leading to poor performance on linear probing and other downstream tasks. In contrast, contrastive methods excel in discriminative feature representation and generalization ability on image data. Despite this, contrastive learning (CL) in 3D data remains scarce. Besides, simply applying CL methods designed for 2D data to 3D fails to effectively learn 3D local details. To address these challenges, we propose a novel Dual-Branch \textbf{C}enter-\textbf{S}urrounding \textbf{Con}trast (CSCon) framework. Specifically, we apply masking to the center and surrounding parts separately, constructing dual-branch inputs with center-biased and surrounding-biased representations to better capture rich geometric information. Meanwhile, we introduce a patch-level contrastive loss to further enhance both high-level information and local sensitivity. Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods; under the MLP-LINEAR, MLP-3, and ONLY-NEW protocols, our method attains state-of-the-art results, even surpassing cross-modal approaches. In particular, under the MLP-LINEAR protocol, our method outperforms the baseline (Point-MAE) by \textbf{7.9\%}, \textbf{6.7\%}, and \textbf{10.3\%} on the three variants of ScanObjectNN, respectively. The code will be made publicly available.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shaofeng Zhang, Xuanqi Chen, Ning Liao, Haoxiang Zhao, Xiaoxing Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limitations: the reliance on external, pre-trained encoders introduces overhead and domain shift. A dispersed-based strategy that encourages strong separation among in-batch latent representations alleviates this specific dependency. To assess the effect of the number of negative samples in generative modeling, we propose {\mname}, a plug-and-play training framework that requires no external encoders. Our method integrates a memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations. This decouples the number of negatives from the mini-batch size, providing abundant and high-quality negatives for a contrastive objective without a multiplicative increase in computational cost. A low-dimensional projection head is used to further minimize memory and bandwidth overhead. {\mname} offers three principal advantages: (1) it is self-contained, eliminating dependency on pretrained vision foundation models and their associated forward-pass overhead; (2) it introduces no additional parameters or computational cost during inference; and (3) it enables substantially faster convergence, achieving superior generative quality more efficiently. On ImageNet-256, {\mname} achieves a state-of-the-art FID of \textbf{2.40} within 400k steps, significantly outperforming comparable methods.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Keito Inoshita
ๆฅ็ๆ่ฆ
Driver distraction behavior recognition using in-vehicle cameras demands real-time inference on edge devices. However, lightweight models often fail to capture fine-grained behavioral cues, resulting in reduced performance on unseen drivers or under varying conditions. ROI-based methods also increase computational cost, making it difficult to balance efficiency and accuracy. This work addresses the need for a lightweight architecture that overcomes these constraints. We propose Computationally efficient Dynamic region of Interest Routing and domain-invariant Adversarial learning for lightweight driver behavior recognition (C-DIRA). The framework combines saliency-driven Top-K ROI pooling and fused classification for local feature extraction and integration. Dynamic ROI routing enables selective computation by applying ROI inference only to high difficulty data samples. Moreover, pseudo-domain labeling and adversarial learning are used to learn domain-invariant features robust to driver and background variation. Experiments on the State Farm Distracted Driver Detection Dataset show that C-DIRA maintains high accuracy with significantly fewer FLOPs and lower latency than prior lightweight models. It also demonstrates robustness under visual degradation such as blur and low-light, and stable performance across unseen domains. These results confirm C-DIRA's effectiveness in achieving compactness, efficiency, and generalization.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Young Kyung Kim, Oded Schlesinger, Yuzhou Zhao, J. Matias Di Martino, Guillermo Sapiro
Computer VisionRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a "black box." This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haoyu Zhao, Weizhong Ding, Yuhao Yang, Zheng Tian, Linyi Yang et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging the gap between digital agents and the physical world, See-Control provides a concrete step toward enabling home robots to perform smartphone-dependent tasks in realistic environments.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianchen Qiu, Qirun Zhang, Jiajian He, Zhengyue Zhuge, Jiahui Xu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jisang Yoo, Gyeongjin Kang, Hyun-kyu Ko, Hyeonwoo Yu, Eunbyung Park
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.
๐
2025-12-09
๐ AAAI 2026
๐ฅ Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.
๐
2025-12-09
๐ 08 pages, 10 figures, 04 tables, 20 references. Da
๐ฅ Swarn Singh Warshaneyan, Maksims Ivanovs, Blaลพ Cugmas, Inese Bฤrziลa, Laura Goldberga et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
This study explores the application of deep learning to improve and automate pollen grain detection and classification in both optical and holographic microscopy images, with a particular focus on veterinary cytology use cases. We used YOLOv8s for object detection and MobileNetV3L for the classification task, evaluating their performance across imaging modalities. The models achieved 91.3% mAP50 for detection and 97% overall accuracy for classification on optical images, whereas the initial performance on greyscale holographic images was substantially lower. We addressed the performance gap issue through dataset expansion using automated labeling and bounding box area enlargement. These techniques, applied to holographic images, improved detection performance from 2.49% to 13.3% mAP50 and classification performance from 42% to 54%. Our work demonstrates that, at least for image classification tasks, it is possible to pair deep learning techniques with cost-effective lensless digital holographic microscopy devices.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuna Kato, Shohei Mori, Hideo Saito, Yoshifumi Takatsume, Hiroki Kajita et al.
ๆฅ็ๆ่ฆ
Video recordings of open surgeries are greatly required for education and research purposes. However, capturing unobstructed videos is challenging since surgeons frequently block the camera field of view. To avoid occlusion, the positions and angles of the camera must be frequently adjusted, which is highly labor-intensive. Prior work has addressed this issue by installing multiple cameras on a shadowless lamp and arranging them to fully surround the surgical area. This setup increases the chances of some cameras capturing an unobstructed view. However, manual image alignment is needed in post-processing since camera configurations change every time surgeons move the lamp for optimal lighting. This paper aims to fully automate this alignment task. The proposed method identifies frames in which the lighting system moves, realigns them, and selects the camera with the least occlusion to generate a video that consistently presents the surgical field from a fixed perspective. A user study involving surgeons demonstrated that videos generated by our method were superior to those produced by conventional methods in terms of the ease of confirming the surgical area and the comfort during video viewing. Additionally, our approach showed improvements in video quality over existing techniques. Furthermore, we implemented several synthesis options for the proposed view-synthesis method and conducted a user study to assess surgeons' preferences for each option.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Olle Edgren Schรผllerqvist, Jens Baumann, Joakim Lindblad, Love Nordling, Artur Mezheyeuski et al.
Multimodal
ๆฅ็ๆ่ฆ
The tumor microenvironment (TME) has emerged as a promising source of prognostic biomarkers. To fully leverage its potential, analysis methods must capture complex interactions between different cell types. We propose HiGINE -- a hierarchical graph-based approach to predict patient survival (short vs. long) from TME characterization in multiplex immunofluorescence (mIF) images and enhance risk stratification in lung cancer. Our model encodes both local and global inter-relations in cell neighborhoods, incorporating information about cell types and morphology. Multimodal fusion, aggregating cancer stage with mIF-derived features, further boosts performance. We validate HiGINE on two public datasets, demonstrating improved risk stratification, robustness, and generalizability.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seunghwan Lee, Inyoung Jung, Hojoon Lee, Eunil Park, Sungeun Hong
ๆฅ็ๆ่ฆ
Continual Test-Time Adaptation (CTTA) enables pre-trained models to adapt to continuously evolving domains. Existing methods have improved robustness but typically rely on fixed or batch-level thresholds, which cannot account for varying difficulty across classes and instances. This limitation is especially problematic in semantic segmentation, where each image requires dense, multi-class predictions. We propose an approach that adaptively adjusts pseudo labels to reflect the confidence distribution within each image and dynamically balances learning toward classes most affected by domain shifts. This fine-grained, class- and instance-aware adaptation produces more reliable supervision and mitigates error accumulation throughout continual adaptation. Extensive experiments across eight CTTA and TTA scenarios, including synthetic-to-real and long-term shifts, show that our method consistently outperforms state-of-the-art techniques, setting a new standard for semantic segmentation under evolving conditions.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mahmoud Afifi, Zhongling Wang, Ran Zhang, Michael S. Brown
Robotics
ๆฅ็ๆ่ฆ
This paper presents a modular neural image signal processing (ISP) framework that processes raw inputs and renders high-quality display-referred images. Unlike prior neural ISP designs, our method introduces a high degree of modularity, providing full control over multiple intermediate stages of the rendering process.~This modular design not only achieves high rendering accuracy but also improves scalability, debuggability, generalization to unseen cameras, and flexibility to match different user-preference styles. To demonstrate the advantages of this design, we built a user-interactive photo-editing tool that leverages our neural ISP to support diverse editing operations and picture styles. The tool is carefully engineered to take advantage of the high-quality rendering of our neural ISP and to enable unlimited post-editable re-rendering. Our method is a fully learning-based framework with variants of different capacities, all of moderate size (ranging from ~0.5 M to ~3.9 M parameters for the entire pipeline), and consistently delivers competitive qualitative and quantitative results across multiple test sets. Watch the supplemental video at: https://youtu.be/ByhQjQSjxVM
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba et al.
ๆฅ็ๆ่ฆ
Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.
๐
2025-12-09
๐ 22 Pages, 26 Figures, This work has been submitted
๐ฅ Alexander Dow, Manduhu Manduhu, Matheus Santos, Ben Bartlett, Gerard Dooly et al.
Computer Vision
ๆฅ็ๆ่ฆ
This work leverages the continuous sweeping motion of LiDAR scanning to concentrate object detection efforts on specific regions that receive a change in point data from one frame to another. We achieve this by using a sliding time window with short strides and consider the temporal dimension by storing convolution results between passes. This allows us to ignore unchanged regions, significantly reducing the number of convolution operations per forward pass without sacrificing accuracy. This data reuse scheme introduces extreme sparsity to detection data. To exploit this sparsity, we extend our previous work on scatter-based convolutions to allow for data reuse, and as such propose Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling (SSCATeR). This operation treats incoming LiDAR data as a continuous stream and acts only on the changing parts of the point cloud. By doing so, we achieve the same results with as much as a 6.61-fold reduction in processing time. Our test results show that the feature maps output by our method are identical to those produced by traditional sparse convolution techniques, whilst greatly increasing the computational efficiency of the network.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifei Chen, Kaiyu Song, Yan Pan, Jianxing Yu, Jian Yin et al.
ๆฅ็ๆ่ฆ
Diffusion inversion aims to recover the initial noise corresponding to a given image such that this noise can reconstruct the original image through the denoising diffusion process. The key component of diffusion inversion is to minimize errors at each inversion step, thereby mitigating cumulative inaccuracies. Recently, fixed-point iteration has emerged as a widely adopted approach to minimize reconstruction errors at each inversion step. However, it suffers from high computational costs due to its iterative nature and the complexity of hyperparameter selection. To address these issues, we propose an iteration-free fixed-point estimator for diffusion inversion. First, we derive an explicit expression of the fixed point from an ideal inversion step. Unfortunately, it inherently contains an unknown data prediction error. Building upon this, we introduce the error approximation, which uses the calculable error from the previous inversion step to approximate the unknown error at the current inversion step. This yields a calculable, approximate expression for the fixed point, which is an unbiased estimator characterized by low variance, as shown by our theoretical analysis. We evaluate reconstruction performance on two text-image datasets, NOCAPS and MS-COCO. Compared to DDIM inversion and other inversion methods based on the fixed-point iteration, our method achieves consistent and superior performance in reconstruction tasks without additional iterations or training.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Indrajit Kar, Kalathur Chenchu Kishore Kumar
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhigang Jia, Duan Wang, Hengkai Wang, Yajun Xie, Meixiang Zhao et al.
Computer Vision
ๆฅ็ๆ่ฆ
Color image generation has a wide range of applications, but the existing generation models ignore the correlation among color channels, which may lead to chromatic aberration problems. In addition, the data distribution problem of color images has not been systematically elaborated and explained, so that there is still the lack of the theory about measuring different color images datasets. In this paper, we define a new quaternion Wasserstein distance and develop its dual theory. To deal with the quaternion linear programming problem, we derive the strong duality form with helps of quaternion convex set separation theorem and quaternion Farkas lemma. With using quaternion Wasserstein distance, we propose a novel Wasserstein quaternion generative adversarial network. Experiments demonstrate that this novel model surpasses both the (quaternion) generative adversarial networks and the Wasserstein generative adversarial network in terms of generation efficiency and image quality.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhen Zou, Xiaoxiao Ma, Jie Huang, Zichao Yu, Feng Zhao
Machine Learning
ๆฅ็ๆ่ฆ
Autoregressive(AR)-diffusion hybrid paradigms combine AR's structured modeling with diffusion's photorealistic synthesis, yet suffer from high latency due to sequential AR generation and iterative denoising. In this work, we tackle this bottleneck and propose a unified AR-diffusion framework Fast-ARDiff that jointly optimizes both components, accelerating AR speculative decoding while simultaneously facilitating faster diffusion decoding. Specifically: (1) The entropy-informed speculative strategy encourages draft model to produce higher-entropy representations aligned with target model's entropy characteristics, mitigating entropy mismatch and high rejection rates caused by draft overconfidence. (2) For diffusion decoding, rather than treating it as an independent module, we integrate it into the same end-to-end framework using a dynamic scheduler that prioritizes AR optimization to guide the diffusion part in further steps. The diffusion part is optimized through a joint distillation framework combining trajectory and distribution matching, ensuring stable training and high-quality synthesis with extremely few steps. During inference, shallow feature entropy from AR module is used to pre-filter low-entropy drafts, avoiding redundant computation and improving latency. Fast-ARDiff achieves state-of-the-art acceleration across diverse models: on ImageNet 256$\times$256, TransDiff attains 4.3$\times$ lossless speedup, and NextStep-1 achieves 3$\times$ acceleration on text-conditioned generation. Code will be available at https://github.com/aSleepyTree/Fast-ARDiff.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinyue Liang, Zhinyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich texture details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of 3D scanners. We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model. Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with realistic details while preserving the structural consistency with the 3D-native geometry. Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhangli Hu, Ye Chen, Jiajun Yao, Bingbing Ni
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Oil painting, as a high-level medium that blends human abstract thinking with artistic expression, poses substantial challenges for digital generation and editing due to its intricate brushstroke dynamics and stylized characteristics. Existing generation and editing techniques are often constrained by the distribution of training data and primarily focus on modifying real photographs. In this work, we introduce a unified multimodal framework for oil painting generation and editing. The proposed system allows users to incorporate reference images for precise semantic control, hand-drawn sketches for spatial structure alignment, and natural language prompts for high-level semantic guidance, while consistently maintaining a unified painting style across all outputs. Our method achieves interactive oil painting creation through three crucial technical advancements. First, we enhance the training stage with spatial alignment and semantic enhancement conditioning strategy, which map masks and sketches into spatial constraints, and encode contextual embedding from reference images and text into feature constraints, enabling object-level semantic alignment. Second, to overcome data scarcity, we propose a self-supervised style transfer pipeline based on Stroke-Based Rendering (SBR), which simulates the inpainting dynamics of oil painting restoration, converting real images into stylized oil paintings with preserved brushstroke textures to construct a large-scale paired training dataset. Finally, during inference, we integrate features using the AdaIN operator to ensure stylistic consistency. Extensive experiments demonstrate that our interactive system enables fine-grained editing while preserving the artistic qualities of oil paintings, achieving an unprecedented level of imagination realization in stylized oil paintings generation and editing.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunzhu Zhang, Zeyu Pan, Zhengwen Zeng, Shuheng Shen, Changhua Meng et al.
ๆฅ็ๆ่ฆ
GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%. The code is available at https://github.com/ZJUSCL/MVP.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jawad Ibn Ahad, Maisha Rahman, Amrijit Biswas, Muhammad Rafsan Kabir, Robin Krambroeckers et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jialu Sui, Rui Liu, Hongsheng Zhang
ๆฅ็ๆ่ฆ
A major challenge in reconstructing buildings from LiDAR point clouds lies in accurately capturing building surfaces under varying point densities and noise interference. To flexibly gather high-quality 3D profiles of the building in diverse resolution, we propose OCCDiff applying latent diffusion in the occupancy function space. Our OCCDiff combines a latent diffusion process with a function autoencoder architecture to generate continuous occupancy functions evaluable at arbitrary locations. Moreover, a point encoder is proposed to provide condition features to diffusion learning, constraint the final occupancy prediction for occupancy decoder, and insert multi-modal features for latent generation to latent encoder. To further enhance the model performance, a multi-task training strategy is employed, ensuring that the point encoder learns diverse and robust feature representations. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy data.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vasco Ramos, Regev Cohen, Idan Szpektor, Joao Magalhaes
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
Conditional diffusion models rely on language-to-image alignment methods to steer the generation towards semantically accurate outputs. Despite the success of this architecture, misalignment and hallucinations remain common issues and require automatic misalignment detection tools to improve quality, for example by applying them in a Best-of-N (BoN) post-generation setting. Unfortunately, measuring the alignment after the generation is an expensive step since we need to wait for the overall generation to finish to determine prompt adherence. In contrast, this work hypothesizes that text/image misalignments can be detected early in the denoising process, enabling real-time alignment assessment without waiting for the complete generation. In particular, we propose NoisyCLIP a method that measures semantic alignment in the noisy latent space. This work is the first to explore and benchmark prompt-to-latent misalignment detection during image generation using dual encoders in the reverse diffusion process. We evaluate NoisyCLIP qualitatively and quantitatively and find it reduces computational cost by 50% while achieving 98% of CLIP alignment performance in BoN settings. This approach enables real-time alignment assessment during generation, reducing costs without sacrificing semantic fidelity.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaming Zhang, Che Wang, Yang Cao, Longtao Huang, Wei Yang Bryan Lim
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multi-modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through hierarchical chain-of-thought reasoning. Existing privacy protection techniques, primarily designed for perception-based models, prove ineffective against MLRMs' sophisticated multi-step reasoning processes that analyze environmental cues. We introduce \textbf{ReasonBreak}, a novel adversarial framework specifically designed to disrupt hierarchical reasoning in MLRMs through concept-aware perturbations. Our approach is founded on the key insight that effective disruption of geographic reasoning requires perturbations aligned with conceptual hierarchies rather than uniform noise. ReasonBreak strategically targets critical conceptual dependencies within reasoning chains, generating perturbations that invalidate specific inference steps and cascade through subsequent reasoning stages. To facilitate this approach, we contribute \textbf{GeoPrivacy-6K}, a comprehensive dataset comprising 6,341 ultra-high-resolution images ($\geq$2K) with hierarchical concept annotations. Extensive evaluation across seven state-of-the-art MLRMs (including GPT-o3, GPT-5, Gemini 2.5 Pro) demonstrates ReasonBreak's superior effectiveness, achieving a 14.4\% improvement in tract-level protection (33.8\% vs 19.4\%) and nearly doubling block-level protection (33.5\% vs 16.8\%). This work establishes a new paradigm for privacy protection against reasoning-based threats.
๐
2025-12-09
๐ ArXiv Preprint (cs.GR)
๐ฅ Jianan Li, Xiao Chen, Tao Huang, Tien-Tsin Wong
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Video data is more cost-effective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require 3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters. We tackle this challenge by introducing Mimic2DM, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different or slightly different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework, where the generator produces high-quality 2D reference trajectories to guide the tracking policy. We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements, without any reliance on explicit 3D motion data. Project Website: https://jiann-li.github.io/mimic2dm/
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yijia Guo, Tong Hu, Zhiwei Li, Liwen Hu, Keming Qian et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in 3D Gaussian Splatting (3DGS) have enabled efficient free-viewpoint rendering and photorealistic scene reconstruction. While on-the-fly extensions of 3DGS have shown promise for real-time reconstruction from monocular RGB streams, they often fail to achieve complete 3D coverage due to the limited field of view (FOV). Employing a multi-camera rig fundamentally addresses this limitation. In this paper, we present the first on-the-fly 3D reconstruction framework for multi-camera rigs. Our method incrementally fuses dense RGB streams from multiple overlapping cameras into a unified Gaussian representation, achieving drift-free trajectory estimation and efficient online reconstruction. We propose a hierarchical camera initialization scheme that enables coarse inter-camera alignment without calibration, followed by a lightweight multi-camera bundle adjustment that stabilizes trajectories while maintaining real-time performance. Furthermore, we introduce a redundancy-free Gaussian sampling strategy and a frequency-aware optimization scheduler to reduce the number of Gaussian primitives and the required optimization iterations, thereby maintaining both efficiency and reconstruction fidelity. Our method reconstructs hundreds of meters of 3D scenes within just 2 minutes using only raw multi-camera video streams, demonstrating unprecedented speed, robustness, and Fidelity for on-the-fly 3D scene reconstruction.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ada Gorgun, Fawaz Sammani, Nikos Deligiannis, Bernt Schiele, Jonas Fischer
Robotics
ๆฅ็ๆ่ฆ
Diffusion models are usually evaluated by their final outputs, gradually denoising random noise into meaningful images. Yet, generation unfolds along a trajectory, and analyzing this dynamic process is crucial for understanding how controllable, reliable, and predictable these models are in terms of their success/failure modes. In this work, we ask the question: when does noise turn into a specific concept (e.g., age) and lock in the denoising trajectory? We propose PCI (Prompt-Conditioned Intervention) to study this question. PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time. The central idea is the analysis of Concept Insertion Success (CIS), defined as the probability that a concept inserted at a given timestep is preserved and reflected in the final image, offering a way to characterize the temporal dynamics of concept formation. Applied to several state-of-the-art text-to-image diffusion models and a broad taxonomy of concepts, PCI reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type. These findings also provide actionable insights for text-driven image editing, highlighting when interventions are most effective without requiring access to model internals or training, and yielding quantitatively stronger edits that achieve a balance of semantic accuracy and content preservation than strong baselines. Code is available at: https://github.com/adagorgun/PCI-Prompt-Controlled-Interventions
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuning Gong, Yifei Liu, Yifan Zhan, Muyao Niu, Xueying Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, "click-to-run" browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huiguo He, Pengyu Yan, Ziqi Yi, Weizhi Zhong, Zheng Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Drag-based image editing aims to modify visual content followed by user-specified drag operations. Despite existing methods having made notable progress, they still fail to fully exploit the contextual information in the reference image, including fine-grained texture details, leading to edits with limited coherence and fidelity. To address this challenge, we introduce ContextDrag, a new paradigm for drag-based editing that leverages the strong contextual modeling capability of editing models, such as FLUX-Kontext. By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details, without the need for finetuning or inversion. Specifically, ContextDrag introduced a novel Context-preserving Token Injection (CTI) that injects noise-free reference features into their correct destination locations via a Latent-space Reverse Mapping (LRM) algorithm. This strategy enables precise drag control while preserving consistency in both semantics and texture details. Second, ContextDrag adopts a novel Position-Consistent Attention (PCA), which positional re-encodes the reference tokens and applies overlap-aware masking to eliminate interference from irrelevant reference features. Extensive experiments on DragBench-SR and DragBench-DR demonstrate that our approach surpasses all existing SOTA methods. Code will be publicly available.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chamath Ranasinghe, Uthayasanker Thayasivam
ๆฅ็ๆ่ฆ
Football player tracking is challenged by frequent occlusions, similar appearances, and rapid motion in crowded scenes. This paper presents a lightweight SAM-based tracking method combining the Segment Anything Model (SAM) with CSRT trackers and jersey color-based appearance models. We propose a team-aware tracking system that uses SAM for precise initialization and HSV histogram-based re-identification to improve occlusion recovery. Our evaluation measures three dimensions: processing speed (FPS and memory), tracking accuracy (success rate and box stability), and robustness (occlusion recovery and identity consistency). Experiments on football video sequences show that the approach achieves 7.6-7.7 FPS with stable memory usage (~1880 MB), maintaining 100 percent tracking success in light occlusions and 90 percent in crowded penalty-box scenarios with 5 or more players. Appearance-based re-identification recovers 50 percent of heavy occlusions, demonstrating the value of domain-specific cues. Analysis reveals key trade-offs: the SAM + CSRT combination provides consistent performance across crowd densities but struggles with long-term occlusions where players leave the frame, achieving only 8.66 percent re-acquisition success. These results offer practical guidelines for deploying football tracking systems under resource constraints, showing that classical tracker-based methods work well with continuous visibility but require stronger re-identification mechanisms for extended absences.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Madhav Gupta, Vishak Prasad C, Ganesh Ramakrishnan
Machine Learning
ๆฅ็ๆ่ฆ
Subset selection-based methods are widely used to explain deep vision models: they attribute predictions by highlighting the most influential image regions and support object-level explanations. While these methods perform well in in-distribution (ID) settings, their behavior under out-of-distribution (OOD) conditions remains poorly understood. Through extensive experiments across multiple ID-OOD sets, we find that reliability of the existing subset based methods degrades markedly, yielding redundant, unstable, and uncertainty-sensitive explanations. To address these shortcomings, we introduce a framework that combines submodular subset selection with layer-wise, gradient-based uncertainty estimation to improve robustness and fidelity without requiring additional training or auxiliary models. Our approach estimates uncertainty via adaptive weight perturbations and uses these estimates to guide submodular optimization, ensuring diverse and informative subset selection. Empirical evaluations show that, beyond mitigating the weaknesses of existing methods under OOD scenarios, our framework also yields improvements in ID settings. These findings highlight limitations of current subset-based approaches and demonstrate how uncertainty-driven optimization can enhance attribution and object-level interpretability, paving the way for more transparent and trustworthy AI in real-world vision applications.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Luca Cogo, Marco Buzzelli, Simone Bianco, Javier Vazquez-Corral, Raimondo Schettini
ๆฅ็ๆ่ฆ
Recent advances in snapshot multispectral (MS) imaging have enabled compact, low-cost spectral sensors for consumer and mobile devices. By capturing richer spectral information than conventional RGB sensors, these systems can enhance key imaging tasks, including color correction. However, most existing methods treat the color correction pipeline in separate stages, often discarding MS data early in the process. We propose a unified, learning-based framework that (i) performs end-to-end color correction and (ii) jointly leverages data from a high-resolution RGB sensor and an auxiliary low-resolution MS sensor. Our approach integrates the full pipeline within a single model, producing coherent and color-accurate outputs. We demonstrate the flexibility and generality of our framework by refactoring two different state-of-the-art image-to-image architectures. To support training and evaluation, we construct a dedicated dataset by aggregating and repurposing publicly available spectral datasets, rendering under multiple RGB camera sensitivities. Extensive experiments show that our approach improves color accuracy and stability, reducing error by up to 50% compared to RGB-only and MS-driven baselines. Datasets, code, and models will be made available upon acceptance.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qing Xu, Kun Yuan, Yuxiang Luo, Yuhao Zhai, Wenting Duan et al.
ๆฅ็ๆ่ฆ
Surgical segmentation is pivotal for scene understanding yet remains hindered by annotation scarcity and semantic inconsistency across diverse procedures. Existing approaches typically fine-tune natural foundation models (e.g., SAM) with limited supervision, functioning merely as domain adapters rather than surgical foundation models. Consequently, they struggle to generalize across the vast variability of surgical targets. To bridge this gap, we present LapFM, a foundation model designed to evolve robust segmentation capabilities from massive unlabeled surgical images. Distinct from medical foundation models relying on inefficient self-supervised proxy tasks, LapFM leverages a Hierarchical Concept Evolving Pre-training paradigm. First, we establish a Laparoscopic Concept Hierarchy (LCH) via a hierarchical mask decoder with parent-child query embeddings, unifying diverse entities (i.e., Anatomy, Tissue, and Instrument) into a scalable knowledge structure with cross-granularity semantic consistency. Second, we propose a Confidence-driven Evolving Labeling that iteratively generates and filters pseudo-labels based on hierarchical consistency, progressively incorporating reliable samples from unlabeled images into training. This process yields LapBench-114K, a large-scale benchmark comprising 114K image-mask pairs. Extensive experiments demonstrate that LapFM significantly outperforms state-of-the-art methods, establishing new standards for granularity-adaptive generalization in universal laparoscopic segmentation. The source code is available at https://github.com/xq141839/LapFM.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nico Leuze, Maximilian Hoh, Samed Doฤan, Nicolas R. -Peรฑa, Alfred Schoettl
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Accurately recovering 6D poses in densely packed industrial bin-picking environments remain a serious challenge, owing to occlusions, reflections, and textureless parts. We introduce a holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D data. While sparse 3D approaches has proven effective for long-range perception, its potential in close-range robotic applications remains underexplored. Our framework operates fully sparse, enabling high-resolution volumetric representations to capture fine geometric details crucial for accurate pose estimation in clutter. Our method processes the entire scene integrally, predicting the 6D pose via a novel per-voxel voting strategy, allowing simultaneous pose predictions for an arbitrary number of target objects. We validate our method on the recently published IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Chen, Shaobo Ju, Qiong Wu, Chenxin Fang, Kun Zhang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingqi Gao, Yunqi Miao, Jungong Han
ๆฅ็ๆ่ฆ
Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam-body4d.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Samitha Nuwan Thilakarathna, Ercan Avsar, Martin Mathias Nielsen, Malte Pedersen
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Accurate fisheries data are crucial for effective and sustainable marine resource management. With the recent adoption of Electronic Monitoring (EM) systems, more video data is now being collected than can be feasibly reviewed manually. This paper addresses this challenge by developing an optimized deep learning pipeline for automated fish re-identification (Re-ID) using the novel AutoFish dataset, which simulates EM systems with conveyor belts with six similarly looking fish species. We demonstrate that key Re-ID metrics (R1 and mAP@k) are substantially improved by using hard triplet mining in conjunction with a custom image transformation pipeline that includes dataset-specific normalization. By employing these strategies, we demonstrate that the Vision Transformer-based Swin-T architecture consistently outperforms the Convolutional Neural Network-based ResNet-50, achieving peak performance of 41.65% mAP@k and 90.43% Rank-1 accuracy. An in-depth analysis reveals that the primary challenge is distinguishing visually similar individuals of the same species (Intra-species errors), where viewpoint inconsistency proves significantly more detrimental than partial occlusion. The source code and documentation are available at: https://github.com/msamdk/Fish_Re_Identification.git
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Philipp Srock, Juan E. Tapia, Christoph Busch
Computer Vision
ๆฅ็ๆ่ฆ
Facial retouching to beautify images is widely spread in social media, advertisements, and it is even applied in professional photo studios to let individuals appear younger, remove wrinkles and skin impurities. Generally speaking, this is done to enhance beauty. This is not a problem itself, but when retouched images are used as biometric samples and enrolled in a biometric system, it is one. Since previous work has proven facial retouching to be a challenge for face recognition systems,the detection of facial retouching becomes increasingly necessary. This work proposes to study and analyze changes in beauty assessment algorithms of retouched images, assesses different feature extraction methods based on artificial intelligence in order to improve retouching detection, and evaluates whether face beauty can be exploited to enhance the detection rate. In a scenario where the attacking retouching algorithm is unknown, this work achieved 1.1% D-EER on single image detection.
๐
2025-12-09
๐ The paper has been accepted and officially publish
๐ฅ Jing Tao, You Li, Banglei Guan, Yang Shang, Qifeng Yu
ๆฅ็ๆ่ฆ
Under challenging light conditions, captured images often suffer from various degradations, leading to a decline in the performance of vision-based applications. Although numerous methods have been proposed to enhance image quality, they either significantly amplify inherent noise or are only effective under specific illumination conditions. To address these issues, we propose a novel framework for simultaneous enhancement and noise suppression under complex illumination conditions. Firstly, a gradient-domain weighted guided filter (GDWGIF) is employed to accurately estimate illumination and improve image quality. Next, the Retinex model is applied to decompose the captured image into separate illumination and reflection layers. These layers undergo parallel processing, with the illumination layer being corrected to optimize lighting conditions and the reflection layer enhanced to improve image quality. Finally, the dynamic range of the image is optimized through multi-exposure fusion and a linear stretching strategy. The proposed method is evaluated on real-world datasets obtained from practical applications. Experimental results demonstrate that our proposed method achieves better performance compared to state-of-the-art methods in both contrast enhancement and noise suppression.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bozhou Li, Xinda Xue, Sihan Yang, Yang Shi, Xinlong Chen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an ``asymmetric update dynamic,'' where high-norm visual tokens exhibit a ``representational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic -- the persistence of norm disparity and the resulting asymmetric update rates -- is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ju-Young Kim, Ji-Hong Park, Gun-Woo Kim
Computer Vision
ๆฅ็ๆ่ฆ
Fire has long been linked to human life, causing severe disasters and losses. Early detection is crucial, and with the rise of home IoT technologies, household fire detection systems have emerged. However, the lack of sufficient fire datasets limits the performance of detection models. We propose the SCU-CGAN model, which integrates U-Net, CBAM, and an additional discriminator to generate realistic fire images from nonfire images. We evaluate the image quality and confirm that SCU-CGAN outperforms existing models. Specifically, SCU-CGAN achieved a 41.5% improvement in KID score compared to CycleGAN, demonstrating the superior quality of the generated fire images. Furthermore, experiments demonstrate that the augmented dataset significantly improves the accuracy of fire detection models without altering their structure. For the YOLOv5 nano model, the most notable improvement was observed in the mAP@0.5:0.95 metric, which increased by 56.5%, highlighting the effectiveness of the proposed approach.
๐
2025-12-09
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Ali Sakour
Machine Learning
ๆฅ็ๆ่ฆ
Biological systems exhibit remarkable morphogenetic plasticity, where a single genome can encode various specialized cellular structures triggered by local chemical signals. In the domain of Deep Learning, Differentiable Neural Cellular Automata (NCA) have emerged as a paradigm to mimic this self-organization. However, existing NCA research has predominantly focused on continuous texture synthesis or single-target object recovery, leaving the challenge of class-conditional structural generation largely unexplored. In this work, we propose a novel Conditional Neural Cellular Automata (c-NCA) architecture capable of growing distinct topological structures - specifically MNIST digits - from a single generic seed, guided solely by a spatially broadcasted class vector. Unlike traditional generative models (e.g., GANs, VAEs) that rely on global reception fields, our model enforces strict locality and translation equivariance. We demonstrate that by injecting a one-hot condition into the cellular perception field, a single set of local rules can learn to break symmetry and self-assemble into ten distinct geometric attractors. Experimental results show that our c-NCA achieves stable convergence, correctly forming digit topologies from a single pixel, and exhibits robustness characteristic of biological systems. This work bridges the gap between texture-based NCAs and structural pattern formation, offering a lightweight, biologically plausible alternative for conditional generation.
๐
2025-12-09
๐ NeurIPS 2025
๐ฅ Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianwei Wang, Qing Wang, Menglan Ruan, Rongjun Ge, Chunfeng Yang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Generating BOLD images from T1w images offers a promising solution for recovering missing BOLD information and enabling downstream tasks when BOLD images are corrupted or unavailable. Motivated by this, we propose DINO-BOLDNet, a DINOv3-guided multi-slice attention framework that integrates a frozen self-supervised DINOv3 encoder with a lightweight trainable decoder. The model uses DINOv3 to extract within-slice structural representations, and a separate slice-attention module to fuse contextual information across neighboring slices. A multi-scale generation decoder then restores fine-grained functional contrast, while a DINO-based perceptual loss encourages structural and textural consistency between predictions and ground-truth BOLD in the transformer feature space. Experiments on a clinical dataset of 248 subjects show that DINO-BOLDNet surpasses a conditional GAN baseline in both PSNR and MS-SSIM. To our knowledge, this is the first framework capable of generating mean BOLD images directly from T1w images, highlighting the potential of self-supervised transformer guidance for structural-to-functional mapping.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chang Liu, Hongliang Yuan, Lianghao Zhang, Sichao Wang, Jianwei Guo et al.
ๆฅ็ๆ่ฆ
Rendering complex reflection of real-world scenes using 3D Gaussian splatting has been a quite promising solution for photorealistic novel view synthesis, but still faces bottlenecks especially in rendering speed and memory storage. This paper proposes a new Hybrid Splatting(HybridSplat) mechanism for Gaussian primitives. Our key idea is a new reflection-baked Gaussian tracing, which bakes the view-dependent reflection within each Gaussian primitive while rendering the reflection using tile-based Gaussian splatting. Then we integrate the reflective Gaussian primitives with base Gaussian primitives using a unified hybrid splatting framework for high-fidelity scene reconstruction. Moreover, we further introduce a pipeline-level acceleration for the hybrid splatting, and reflection-sensitive Gaussian pruning to reduce the model size, thus achieving much faster rendering speed and lower memory storage while preserving the reflection rendering quality. By extensive evaluation, our HybridSplat accelerates about 7x rendering speed across complex reflective scenes from Ref-NeRF, NeRF-Casting with 4x fewer Gaussian primitives than similar ray-tracing based Gaussian splatting baselines, serving as a new state-of-the-art method especially for complex reflective scenes.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xianghong Xiao, Zeyu Xia, Zhou Fei, Jinliang Xiao, Haorui Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Pansharpening aims to fuse a high-resolution panchromatic (PAN) image with a low-resolution multispectral (LRMS) image to generate a high-resolution multispectral image (HRMS). Conventional deep learning-based methods are inherently limited in their ability to adapt to regional heterogeneity within feature representations. Although various adaptive convolution methods have been proposed to address this limitation, they often suffer from excessive computational costs and a limited ability to capture heterogeneous regions in remote sensing images effectively. To overcome these challenges, we propose Bimodal Bi-Adaptive Mask-Aware Convolution (Bi^2MAC), which effectively exploits information from different types of regions while intelligently allocating computational resources. Specifically, we design a lightweight module to generate both soft and hard masks, which are used to modulate the input features preliminarily and to guide different types of regions into separate processing branches, respectively. Redundant features are directed to a compact branch for low-cost global processing. In contrast, heterogeneous features are routed to a focused branch that invests more computational resources for fine-grained modeling. Extensive experiments on multiple benchmark datasets demonstrate that Bi^2MAC achieves state-of-the-art (SOTA) performance while requiring substantially lower training time and parameter counts, and the minimal computational cost among adaptive convolution models.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pengbo Li, Yiding Sun, Haozhe Cheng
Computer VisionMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Self-supervised representation learning has shown significant improvement in Natural Language Processing and 2D Computer Vision. However, existing methods face difficulties in representing 3D data because of its unordered and uneven density. Through an in-depth analysis of mainstream contrastive and generative approaches, we find that contrastive models tend to suffer from overfitting, while 3D Mask Autoencoders struggle to handle unordered point clouds. This motivates us to learn 3D representations by sharing the merits of diffusion and contrast models, which is non-trivial due to the pattern difference between the two paradigms. In this paper, we propose \textit{PointDico}, a novel model that seamlessly integrates these methods. \textit{PointDico} learns from both denoising generative modeling and cross-modal contrastive learning through knowledge distillation, where the diffusion model serves as a guide for the contrastive model. We introduce a hierarchical pyramid conditional generator for multi-scale geometric feature extraction and employ a dual-channel design to effectively integrate local and global contextual information. \textit{PointDico} achieves a new state-of-the-art in 3D representation learning, \textit{e.g.}, \textbf{94.32\%} accuracy on ScanObjectNN, \textbf{86.5\%} Inst. mIoU on ShapeNetPart.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Michael R. Martin, Garrick Chan, Kwan-Liu Ma
Machine Learning
ๆฅ็ๆ่ฆ
Recent image protection mechanisms such as Glaze and Nightshade introduce imperceptible, adversarially designed perturbations intended to disrupt downstream text-to-image generative models. While their empirical effectiveness is known, the internal structure, detectability, and representational behavior of these perturbations remain poorly understood. This study provides a systematic, explainable AI analysis using a unified framework that integrates white-box feature-space inspection and black-box signal-level probing. Through latent-space clustering, feature-channel activation analysis, occlusion-based spatial sensitivity mapping, and frequency-domain characterization, we show that protection mechanisms operate as structured, low-entropy perturbations tightly coupled to underlying image content across representational, spatial, and spectral domains. Protected images preserve content-driven feature organization with protection-specific substructure rather than inducing global representational drift. Detectability is governed by interacting effects of perturbation entropy, spatial deployment, and frequency alignment, with sequential protection amplifying detectable structure rather than suppressing it. Frequency-domain analysis shows that Glaze and Nightshade redistribute energy along dominant image-aligned frequency axes rather than introducing diffuse noise. These findings indicate that contemporary image protection operates through structured feature-level deformation rather than semantic dislocation, explaining why protection signals remain visually subtle yet consistently detectable. This work advances the interpretability of adversarial image protection and informs the design of future defenses and detection strategies for generative AI systems.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wang Chen, Ziyan Luo, Shuangyue Wang
Machine Learning
ๆฅ็ๆ่ฆ
Input features are conventionally represented as vectors, matrices, or third order tensors in the real field, for color image classification. Inspired by the success of quaternion data modeling for color images in image recovery and denoising tasks, we propose a novel classification method for color image classification, named as the Low-rank Support Quaternion Matrix Machine (LSQMM), in which the RGB channels are treated as pure quaternions to effectively preserve the intrinsic coupling relationships among channels via the quaternion algebra. For the purpose of promoting low-rank structures resulting from strongly correlated color channels, a quaternion nuclear norm regularization term, serving as a natural extension of the conventional matrix nuclear norm to the quaternion domain, is added to the hinge loss in our LSQMM model. An Alternating Direction Method of Multipliers (ADMM)-based iterative algorithm is designed to effectively resolve the proposed quaternion optimization model. Experimental results on multiple color image classification datasets demonstrate that our proposed classification approach exhibits advantages in classification accuracy, robustness and computational efficiency, compared to several state-of-the-art methods using support vector machines, support matrix machines, and support tensor machines.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuedeng Liu, Jiabao Guo, Zheng Zhang, Fei Wang, Zhi Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Video Motion Magnification (VMM) amplifies subtle macroscopic motions to a perceptible level. Recently, existing mainstream Eulerian approaches address amplification-induced noise via decoupling representation learning such as texture, shape and frequancey schemes, but they still struggle to separate photon noise from true micro-motion when motion displacements are very small. We propose GeoDiffMM, a novel diffusion-based Lagrangian VMM framework conditioned on optical flow as a geometric cue, enabling structurally consistent motion magnification. Specifically, we design a Noise-free Optical Flow Augmentation strategy that synthesizes diverse nonrigid motion fields without photon noise as supervision, helping the model learn more accurate geometry-aware optial flow and generalize better. Next, we develop a Diffusion Motion Magnifier that conditions the denoising process on (i) optical flow as a geometry prior and (ii) a learnable magnification factor controlling magnitude, thereby selectively amplifying motion components consistent with scene semantics and structure while suppressing content-irrelevant perturbations. Finally, we perform Flow-based Video Synthesis to map the amplified motion back to the image domain with high fidelity. Extensive experiments on real and synthetic datasets show that GeoDiffMM outperforms state-of-the-art methods and significantly improves motion magnification.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Achraf Ben-Hamadou, Nour Neifar, Ahmed Rekik, Oussama Smaoui, Firas Bouzguenda et al.
Machine Learning
ๆฅ็ๆ่ฆ
Teeth landmark detection is a critical task in modern clinical orthodontics. Their precise identification enables advanced diagnostics, facilitates personalized treatment strategies, and supports more effective monitoring of treatment progress in clinical dentistry. However, several significant challenges may arise due to the intricate geometry of individual teeth and the substantial variations observed across different individuals. To address these complexities, the development of advanced techniques, especially through the application of deep learning, is essential for the precise and reliable detection of 3D tooth landmarks. In this context, the 3DTeethLand challenge was held in collaboration with the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2024, calling for algorithms focused on teeth landmark detection from intraoral 3D scans. This challenge introduced the first publicly available dataset for 3D teeth landmark detection, offering a valuable resource to assess the state-of-the-art methods in this task and encourage the community to provide methodological contributions towards the resolution of their problem with significant clinical implications.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuhui Li, Zhengquan Luo, Zihui Cui, Zhiqiang Xu
ๆฅ็ๆ่ฆ
Dataset distillation aims to synthesize a compact subset of the original data, enabling models trained on it to achieve performance comparable to those trained on the original large dataset. Existing distribution-matching methods are confined to Euclidean spaces, making them only capture linear structures and overlook the intrinsic geometry of real data, e.g., curvature. However, high-dimensional data often lie on low-dimensional manifolds, suggesting that dataset distillation should have the distilled data manifold aligned with the original data manifold. In this work, we propose a geometry-aware distribution-matching framework, called \textbf{GeoDM}, which operates in the Cartesian product of Euclidean, hyperbolic, and spherical manifolds, with flat, hierarchical, and cyclical structures all captured by a unified representation. To adapt to the underlying data geometry, we introduce learnable curvature and weight parameters for three kinds of geometries. At the same time, we design an optimal transport loss to enhance the distribution fidelity. Our theoretical analysis shows that the geometry-aware distribution matching in a product space yields a smaller generalization error bound than the Euclidean counterparts. Extensive experiments conducted on standard benchmarks demonstrate that our algorithm outperforms state-of-the-art data distillation methods and remains effective across various distribution-matching strategies for the single geometries.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alexander Goslin
Robotics
ๆฅ็ๆ่ฆ
For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. We introduce Terrain Diffusion, an AI-era successor to Perlin noise that bridges the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. At its core is InfiniteDiffusion, a novel algorithm for infinite generation, enabling seamless, real-time synthesis of boundless landscapes. A hierarchical stack of diffusion models couples planetary context with local detail, while a compact Laplacian encoding stabilizes outputs across Earth-scale dynamic ranges. An open-source infinite-tensor framework supports constant-memory manipulation of unbounded tensors, and few-step consistency distillation enables efficient generation. Together, these components establish diffusion models as a practical foundation for procedural world generation, capable of synthesizing entire planets coherently, controllably, and without limits.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng et al.
Computer VisionMultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.
๐
2025-12-09
๐ ArXiv Preprint (physics.geo-ph)
๐ฅ Guangyuan Zou, Junlun Li, Feng Liu, Xuejing Zheng, Jianjian Xie et al.
Machine Learning
ๆฅ็ๆ่ฆ
Full waveform inversion (FWI) has become a widely adopted technique for high-resolution subsurface imaging. However, its inherent strong nonlinearity often results in convergence toward local minima. Recently, deep image prior-based reparameterized FWI (DIP-FWI) has been proposed to alleviate the dependence on massive training data. By exploiting the spectral bias and implicit regularization in the neural network architecture, DIP-FWI can effectively avoid local minima and reconstruct more geologically plausible velocity models. Nevertheless, existing DIP-FWI typically use a fixed random input throughout the inversion process, which fails to utilize the mapping and correlation between the input and output of the network. Moreover, under complex geological conditions, the lack of informative prior in the input can exacerbate the ill-posedness of the inverse problem, leading to artifacts and unstable reconstructions. To address these limitations, we propose a self-reinforced DIP-FWI (SRDIP-FWI) framework, in which a steering algorithm alternately updates both the network parameters and the input at each iteration using feedback from the current network output. This design allows adaptive structural enhancement and improved regularization, thereby effectively mitigating the ill-posedness in FWI. Additionally, we analyze the spectral bias of the network in SRDIP-FWI and quantify its role in multiscale velocity model building. Synthetic tests and field land data application demonstrate that SRDIP-FWI achieves superior resolution, improved accuracy and greater depth penetration compared to multiscale FWI. More importantly, SRDIP-FWI eliminates the need for manual frequency-band selection and time-window picking, substantially simplifying the inversion workflow. Overall, the proposed method provides a novel, adaptive and robust framework for accurate subsurface velocity model reconstruction.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji
Computer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.
๐
2025-12-09
๐ ICRA 2025
๐ฅ Srijan Dokania, Dharini Raghavan
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung et al.
ๆฅ็ๆ่ฆ
Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hafeez Husain Cholakkal, Stefano Arrigoni, Francesco Braghin
Machine LearningRoboticsMultimodal
ๆฅ็ๆ่ฆ
Accurate extrinsic calibration of LiDAR, RADAR, and camera sensors is essential for reliable perception in autonomous vehicles. Still, it remains challenging due to factors such as mechanical vibrations and cumulative sensor drift in dynamic environments. This paper presents RLCNet, a novel end-to-end trainable deep learning framework for the simultaneous online calibration of these multimodal sensors. Validated on real-world datasets, RLCNet is designed for practical deployment and demonstrates robust performance under diverse conditions. To support real-time operation, an online calibration framework is introduced that incorporates a weighted moving average and outlier rejection, enabling dynamic adjustment of calibration parameters with reduced prediction noise and improved resilience to drift. An ablation study highlights the significance of architectural choices, while comparisons with existing methods demonstrate the superior accuracy and robustness of the proposed approach.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yun Liu, Tao Li, Cosmin Ancuti, Wenqi Ren, Weisi Lin
Computer Vision
ๆฅ็ๆ่ฆ
Scene recovery serves as a critical task for various computer vision applications. Existing methods typically rely on a single prior, which is inherently insufficient to handle multiple degradations, or employ complex network architectures trained on synthetic data, which suffer from poor generalization for diverse real-world scenarios. In this paper, we propose Spatial and Frequency Priors (SFP) for real-world scene recovery. In the spatial domain, we observe that the inverse of the degraded image exhibits a projection along its spectral direction that resembles the scene transmission. Leveraging this spatial prior, the transmission map is estimated to recover the scene from scattering degradation. In the frequency domain, a mask is constructed for adaptive frequency enhancement, with two parameters estimated using our proposed novel priors. Specifically, one prior assumes that the mean intensity of the degraded image's direct current (DC) components across three channels in the frequency domain closely approximates that of each channel in the clear image. The second prior is based on the observation that, for clear images, the magnitude of low radial frequencies below 0.001 constitutes approximately 1% of the total spectrum. Finally, we design a weighted fusion strategy to integrate spatial-domain restoration, frequency-domain enhancement, and salient features from the input image, yielding the final recovered result. Extensive evaluations demonstrate the effectiveness and superiority of our proposed SFP for scene recovery under various degradation conditions.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ YiLin Zhou, Lili Wei, Zheming Xu, Ziyi Chen, Congyan Lang
Machine Learning
ๆฅ็ๆ่ฆ
Few-shot 3D point cloud semantic segmentation (FS-3DSeg) aims to segment novel classes with only a few labeled samples. However, existing metric-based prototype learning methods generate prototypes solely from the support set, without considering their relevance to query data. This often results in prototype bias, where prototypes overfit support-specific characteristics and fail to generalize to the query distribution, especially in the presence of distribution shifts, which leads to degraded segmentation performance. To address this issue, we propose a novel Query-aware Hub Prototype (QHP) learning method that explicitly models semantic correlations between support and query sets. Specifically, we propose a Hub Prototype Generation (HPG) module that constructs a bipartite graph connecting query and support points, identifies frequently linked support hubs, and generates query-relevant prototypes that better capture cross-set semantics. To further mitigate the influence of bad hubs and ambiguous prototypes near class boundaries, we introduce a Prototype Distribution Optimization (PDO) module, which employs a purity-reweighted contrastive loss to refine prototype representations by pulling bad hubs and outlier prototypes closer to their corresponding class centers. Extensive experiments on S3DIS and ScanNet demonstrate that QHP achieves substantial performance gains over state-of-the-art methods, effectively narrowing the semantic gap between prototypes and query sets in FS-3DSeg.
๐
2025-12-09
๐ AAAI
๐ฅ Haowen Zheng, Hu Zhu, Lu Deng, Weihao Gu, Yang Yang et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher's stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saeeda Naz, Saddam Hussain Khan
Natural Language Processing
ๆฅ็ๆ่ฆ
A novel deep hybrid Residual-SwinCA-Net segmentation framework is proposed in the study for addressing such challenges by extracting locally correlated and robust features, incorporating residual CNN modules. Furthermore, for learning global dependencies, Swin Transformer blocks are customized using internal residual pathways, which reinforce gradient stability, refine local patterns, and facilitate global feature fusion. Formerly, for enhancing tissue continuity, ultrasound noise suppressions, and accentuating fine structural transitions Laplacian-of-Gaussian regional operator is applied, and for maintaining the morphological integrity of malignant lesion contours, a boundary-oriented operator has been incorporated. Subsequently, a contraction strategy was applied stage-wise by progressively reducing features-map progressively for capturing scale invariance and enhancing the robustness of structural variability. In addition, each decoder level prior augmentation integrates a new Multi-Scale Channel Attention and Squeezing (MSCAS) module. The MSCAS selectively emphasizes encoder salient maps, retains discriminative global context, and complementary local structures with minimal computational cost while suppressing redundant activations. Finally, the Pixel-Attention module encodes class-relevant spatial cues by adaptively weighing malignant lesion pixels while suppressing background interference. The Residual-SwinCA-Net and existing CNNs/ViTs techniques have been implemented on the publicly available BUSI dataset. The proposed Residual-SwinCA-Net framework outperformed and achieved 99.29% mean accuracy, 98.74% IoU, and 0.9041 Dice for breast lesion segmentation. The proposed Residual-SwinCA-Net framework improves the BUSI lesion diagnostic performance and strengthens timely clinical decision-making.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jusheng Zhang, Xiaoyang Guo, Kaitong Cai, Qinhan Lv, Yijia Fan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanpeng Chen, Hui Song, Wei Tao, ShanHui Mo, Shuang Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
The advancement of camera-only Bird's-Eye-View(BEV) perception is currently impeded by a fundamental tension between state-of-the-art performance and on-vehicle deployment tractability. This bottleneck stems from a deep-rooted dependency on computationally prohibitive view transformations and bespoke, platform-specific kernels. This paper introduces FastBEV++, a framework engineered to reconcile this tension, demonstrating that high performance and deployment efficiency can be achieved in unison via two guiding principles: Fast by Algorithm and Deployable by Design. We realize the "Deployable by Design" principle through a novel view transformation paradigm that decomposes the monolithic projection into a standard Index-Gather-Reshape pipeline. Enabled by a deterministic pre-sorting strategy, this transformation is executed entirely with elementary, operator native primitives (e.g Gather, Matrix Multiplication), which eliminates the need for specialized CUDA kernels and ensures fully TensorRT-native portability. Concurrently, our framework is "Fast by Algorithm", leveraging this decomposed structure to seamlessly integrate an end-to-end, depth-aware fusion mechanism. This jointly learned depth modulation, further bolstered by temporal aggregation and robust data augmentation, significantly enhances the geometric fidelity of the BEV representation.Empirical validation on the nuScenes benchmark corroborates the efficacy of our approach. FastBEV++ establishes a new state-of-the-art 0.359 NDS while maintaining exceptional real-time performance, exceeding 134 FPS on automotive-grade hardware (e.g Tesla T4). By offering a solution that is free of custom plugins yet highly accurate, FastBEV++ presents a mature and scalable design philosophy for production autonomous systems. The code is released at: https://github.com/ymlab/advanced-fastbev
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tony Salloom, Dandi Zhou, Xinhai Sun
Robotics
ๆฅ็ๆ่ฆ
Accurate three-dimensional perception is essential for modern industrial robotic systems that perform manipulation, inspection, and navigation tasks. RGB-D and stereo vision sensors are widely used for this purpose, but the depth maps they produce are often noisy, incomplete, or biased due to sensor limitations and environmental conditions. Depth completion methods aim to generate dense, reliable depth maps from RGB images and sparse depth input. However, a key limitation in current depth completion pipelines is the unrealistic generation of sparse depth: sparse pixels are typically selected uniformly at random from dense ground-truth depth, ignoring the fact that real sensors exhibit geometry-dependent and spatially nonuniform reliability. In this work, we propose a normal-guided sparse depth sampling strategy that leverages PCA-based surface normal estimation on the RGB-D point cloud to compute a per-pixel depth reliability measure. The sparse depth samples are then drawn according to this reliability distribution. We integrate this sampling method with the Marigold-DC diffusion-based depth completion model and evaluate it on NYU Depth v2 using the standard metrics. Experiments show that our geometry-aware sparse depth improves accuracy, reduces artifacts near edges and discontinuities, and produces more realistic training conditions that better reflect real sensor behavior.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jusheng Zhang, Kaitong Cai, Xiaoyang Guo, Sidi Liu, Qinhan Lv et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Eimran Hossain Eimon, Ashan Perera, Juan Merlos, Velibor Adzic, Hari Kalva
Machine Learning
ๆฅ็ๆ่ฆ
Modern video codecs have been extensively optimized to preserve perceptual quality, leveraging models of the human visual system. However, in split inference systems-where intermediate features from neural network are transmitted instead of pixel data-these assumptions no longer apply. Intermediate features are abstract, sparse, and task-specific, making perceptual fidelity irrelevant. In this paper, we investigate the use of Versatile Video Coding (VVC) for compressing such features under the MPEG-AI Feature Coding for Machines (FCM) standard. We perform a tool-level analysis to understand the impact of individual coding components on compression efficiency and downstream vision task accuracy. Based on these insights, we propose three lightweight essential VVC profiles-Fast, Faster, and Fastest. The Fast profile provides 2.96% BD-Rate gain while reducing encoding time by 21.8%. Faster achieves a 1.85% BD-Rate gain with a 51.5% speedup. Fastest reduces encoding time by 95.6% with only a 1.71% loss in BD-Rate.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ching-Hung Cheng, Hsiu-Fu Wu, Bing-Chen Wu, Khanh-Phong Bui, Van-Tin Luu et al.
Computer VisionMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
With the rise of Large Language Models (LLMs) such as GPT-3, these models exhibit strong generalization capabilities. Through transfer learning techniques such as fine-tuning and prompt tuning, they can be adapted to various downstream tasks with minimal parameter adjustments. This approach is particularly common in the field of Natural Language Processing (NLP). This paper aims to explore the effectiveness of common prompt tuning methods in 3D object detection. We investigate whether a model trained on the large-scale Waymo dataset can serve as a foundation model and adapt to other scenarios within the 3D object detection field. This paper sequentially examines the impact of prompt tokens and prompt generators, and further proposes a Scene-Oriented Prompt Pool (\textbf{SOP$^2$}). We demonstrate the effectiveness of prompt pools in 3D object detection, with the goal of inspiring future researchers to delve deeper into the potential of prompts in the 3D field.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram
Natural Language Processing
ๆฅ็ๆ่ฆ
While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Guangzhi Xiong, Zhenghao He, Bohan Liu, Sanchit Sinha, Aidong Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ferdinand Kapl, Emmanouil Angelis, Tobias Hรถppe, Kaitlin Maile, Johannes von Oswald et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordรกs et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yifan Lyu, Liang Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Understanding human personality is crucial for web applications such as personalized recommendation and mental health assessment. Existing studies on personality detection predominantly adopt a "posts -> user vector -> labels" modeling paradigm, which encodes social media posts into user representations for predicting personality labels (e.g., MBTI labels). While recent advances in large language models (LLMs) have improved text encoding capacities, these approaches remain constrained by limited supervision signals due to label scarcity, and under-specified semantic mappings between user language and abstract psychological constructs. We address these challenges by proposing ROME, a novel framework that explicitly injects psychological knowledge into personality detection. Inspired by standardized self-assessment tests, ROME leverages LLMs' role-play capability to simulate user responses to validated psychometric questionnaires. These generated question-level answers transform free-form user posts into interpretable, questionnaire-grounded evidence linking linguistic cues to personality labels, thereby providing rich intermediate supervision to mitigate label scarcity while offering a semantic reasoning chain that guides and simplifies the text-to-personality mapping learning. A question-conditioned Mixture-of-Experts module then jointly routes over post and question representations, learning to answer questionnaire items under explicit supervision. The predicted answers are summarized into an interpretable answer vector and fused with the user representation for final prediction within a multi-task learning framework, where question answering serves as a powerful auxiliary task for personality detection. Extensive experiments on two real-world datasets demonstrate that ROME consistently outperforms state-of-the-art baselines, achieving improvements (15.41% on Kaggle dataset).
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ David Samuel, Lilja รvrelid, Erik Velldal, Andrey Kutuzov
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmรฅl and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Samuel Ebimobowei Johnny, Blessed Guda, Emmanuel Enejo Aaron, Assane Gueye
ๆฅ็ๆ่ฆ
Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88\% accuracy and 60.00\% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ekhi Azurmendi, Xabier Arregi, Oier Lopez de Lacalle
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bohao Yang, Rui Yang, Joshua M. Biro, Haoyuan Wang, Jessica L. Handley et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Clinical communication is central to patient outcomes, yet large-scale human annotation of patient-provider conversation remains labor-intensive, inconsistent, and difficult to scale. Existing approaches based on large language models typically rely on single-task models that lack adaptability, interpretability, and reliability, especially when applied across various communication frameworks and clinical domains. In this study, we developed a Multi-framework Structured Agentic AI system for Clinical Communication (MOSAIC), built on a LangGraph-based architecture that orchestrates four core agents, including a Plan Agent for codebook selection and workflow planning, an Update Agent for maintaining up-to-date retrieval databases, a set of Annotation Agents that applies codebook-guided retrieval-augmented generation (RAG) with dynamic few-shot prompting, and a Verification Agent that provides consistency checks and feedback. To evaluate performance, we compared MOSAIC outputs against gold-standard annotations created by trained human coders. We developed and evaluated MOSAIC using 26 gold standard annotated transcripts for training and 50 transcripts for testing, spanning rheumatology and OB/GYN domains. On the test set, MOSAIC achieved an overall F1 score of 0.928. Performance was highest in the Rheumatology subset (F1 = 0.962) and strongest for Patient Behavior (e.g., patients asking questions, expressing preferences, or showing assertiveness). Ablations revealed that MOSAIC outperforms baseline benchmarking.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Maximilian Kreutner, Jens Rupprecht, Georg Ahnert, Ahmed Salem, Markus Strohmaier
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation ($>40 $ million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers, and can be obtained for a fraction of the compute cost. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs without coding knowledge. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lifeng Han, Paul Rayson, Suzan Verberne, Andrew Moore, Goran Nenadic
Natural Language Processing
ๆฅ็ๆ่ฆ
This proposed tutorial focuses on Healthcare Domain Applications of NLP, what we have achieved around HealthcareNLP, and the challenges that lie ahead for the future. Existing reviews in this domain either overlook some important tasks, such as synthetic data generation for addressing privacy concerns, or explainable clinical NLP for improved integration and implementation, or fail to mention important methodologies, including retrieval augmented generation and the neural symbolic integration of LLMs and KGs. In light of this, the goal of this tutorial is to provide an introductory overview of the most important sub-areas of a patient- and resource-oriented HealthcareNLP, with three layers of hierarchy: data/resource layer: annotation guidelines, ethical approvals, governance, synthetic data; NLP-Eval layer: NLP tasks such as NER, RE, sentiment analysis, and linking/coding with categorised methods, leading to explainable HealthAI; patients layer: Patient Public Involvement and Engagement (PPIE), health literacy, translation, simplification, and summarisation (also NLP tasks), and shared decision-making support. A hands-on session will be included in the tutorial for the audience to use HealthcareNLP applications. The target audience includes NLP practitioners in the healthcare application domain, NLP researchers who are interested in domain applications, healthcare researchers, and students from NLP fields. The type of tutorial is "Introductory to CL/NLP topics (HealthcareNLP)" and the audience does not need prior knowledge to attend this. Tutorial materials: https://github.com/4dpicture/HealthNLP
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Indrajit Kar, Kalathur Chenchu Kishore Kumar
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jawad Ibn Ahad, Maisha Rahman, Amrijit Biswas, Muhammad Rafsan Kabir, Robin Krambroeckers et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ju-Young Kim, Ji-Hong Park, Se-Yeon Lee, Sujin Park, Gun-Woo Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent incidents in certain online games and communities, where anonymity is guaranteed, show that unchecked inappropriate remarks frequently escalate into verbal abuse and even criminal behavior, raising significant social concerns. Consequently, there is a growing need for research on techniques that can detect inappropriate utterances within conversational texts to help build a safer communication environment. Although large-scale language models trained on Korean corpora and chain-of-thought reasoning have recently gained attention, research applying these approaches to inappropriate utterance detection remains limited. In this study, we propose a soft inductive bias approach that explicitly defines reasoning perspectives to guide the inference process, thereby promoting rational decision-making and preventing errors that may arise during reasoning. We fine-tune a Korean large language model using the proposed method and conduct both quantitative performance comparisons and qualitative evaluations across different training strategies. Experimental results show that the Kanana-1.5 model achieves an average accuracy of 87.0046, improving by approximately 3.89 percent over standard supervised learning. These findings indicate that the proposed method goes beyond simple knowledge imitation by large language models and enables more precise and consistent judgments through constrained reasoning perspectives, demonstrating its effectiveness for inappropriate utterance detection.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Janiรงa Hackenbuchner, Arda Tezcan, Joke Daems
Natural Language Processing
ๆฅ็ๆ่ฆ
Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. With this research, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this study examines which context, in the form of input tokens in the source sentence, influences (or triggers) the translation model choice of a certain gender inflection in the target language. To analyse this, we use contrastive explanations and compute saliency attribution. We first address the challenge of a lacking scoring threshold and specifically examine different attribution levels of source words on the model gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sjoerd B. Stolwijk, Mark Boukes, Damian Trilling
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper investigates bias in GLLM annotations by conceptually replicating manual annotations of Boukes (2024). Using various GLLMs (Llama3.1:8b, Llama3.3:70b, GPT4o, Qwen2.5:72b) in combination with five different prompts for five concepts (political content, interactivity, rationality, incivility, and ideology). We find GLLMs perform adequate in terms of F1 scores, but differ from manual annotations in terms of prevalence, yield substantively different downstream results, and display systematic bias in that they overlap more with each other than with manual annotations. Differences in F1 scores fail to account for the degree of bias.
๐
2025-12-09
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiin Park, Hyuna Jeon, Yoonseo Lee, Jisu Hong, Misuk Kim
ๆฅ็ๆ่ฆ
Ontology-based knowledge graph (KG) construction is a core technology that enables multidimensional understanding and advanced reasoning over domain knowledge. Industrial standards, in particular, contain extensive technical information and complex rules presented in highly structured formats that combine tables, scopes of application, constraints, exceptions, and numerical calculations, making KG construction especially challenging. In this study, we propose a method that organizes such documents into a hierarchical semantic structure, decomposes sentences and tables into atomic propositions derived from conditional and numerical rules, and integrates them into an ontology-knowledge graph through LLM-based triple extraction. Our approach captures both the hierarchical and logical structures of documents, effectively representing domain-specific semantics that conventional methods fail to reflect. To verify its effectiveness, we constructed rule, table, and multi-hop QA datasets, as well as a toxic clause detection dataset, from industrial standards, and implemented an ontology-aware KG-RAG framework for comparative evaluation. Experimental results show that our method achieves significant performance improvements across all QA types compared to existing KG-RAG approaches. This study demonstrates that reliable and scalable knowledge representation is feasible even for industrial documents with intertwined conditions, constraints, and scopes, contributing to future domain-specific RAG development and intelligent document management.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Benedikt Mangold
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Workplace toxicity is widely recognized as detrimental to organizational culture, yet quantifying its direct impact on operational efficiency remains methodologically challenging due to the ethical and practical difficulties of reproducing conflict in human subjects. This study leverages Large Language Model (LLM) based Multi-Agent Systems to simulate 1-on-1 adversarial debates, creating a controlled "sociological sandbox". We employ a Monte Carlo method to simulate hundrets of discussions, measuring the convergence time (defined as the number of arguments required to reach a conclusion) between a baseline control group and treatment groups involving agents with "toxic" system prompts. Our results demonstrate a statistically significant increase of approximately 25\% in the duration of conversations involving toxic participants. We propose that this "latency of toxicity" serves as a proxy for financial damage in corporate and academic settings. Furthermore, we demonstrate that agent-based modeling provides a reproducible, ethical alternative to human-subject research for measuring the mechanics of social friction.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jaisal Patel, Yunzhe Chen, Kaiwen He, Keyi Wang, David Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiwoo Park, Ruoqi Liu, Avani Jagdale, Andrew Srisuwananukorn, Jing Zhao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sampriti Soor, Suklav Ghosh, Arijit Sur
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sampriti Soor, Suklav Ghosh, Arijit Sur
Natural Language Processing
ๆฅ็ๆ่ฆ
Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Jessica Yin, Haozhi Qi, Youngsun Wi, Sayantan Kundu, Mike Lambeta et al.
Robotics
ๆฅ็ๆ่ฆ
Human video demonstrations provide abundant training data for learning robot policies, but video alone cannot capture the rich contact signals critical for mastering manipulation. We introduce OSMO, an open-source wearable tactile glove designed for human-to-robot skill transfer. The glove features 12 three-axis tactile sensors across the fingertips and palm and is designed to be compatible with state-of-the-art hand-tracking methods for in-the-wild data collection. We demonstrate that a robot policy trained exclusively on human demonstrations collected with OSMO, without any real robot data, is capable of executing a challenging contact-rich manipulation task. By equipping both the human and the robot with the same glove, OSMO minimizes the visual and tactile embodiment gap, enabling the transfer of continuous shear and normal force feedback while avoiding the need for image inpainting or other vision-based force inference. On a real-world wiping task requiring sustained contact pressure, our tactile-aware policy achieves a 72% success rate, outperforming vision-only baselines by eliminating contact-related failure modes. We release complete hardware designs, firmware, and assembly instructions to support community adoption.
๐
2025-12-09
๐ NeurIPS
๐ฅ Gang Liu, Sobin Alosious, Subhamoy Mahajan, Eric Inae, Yihan Zhu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Machine learning (ML) offers a powerful path toward discovering sustainable polymer materials, but progress has been limited by the lack of large, high-quality, and openly accessible polymer datasets. The Open Polymer Challenge (OPC) addresses this gap by releasing the first community-developed benchmark for polymer informatics, featuring a dataset with 10K polymers and 5 properties: thermal conductivity, radius of gyration, density, fractional free volume, and glass transition temperature. The challenge centers on multi-task polymer property prediction, a core step in virtual screening pipelines for materials discovery. Participants developed models under realistic constraints that include small data, label imbalance, and heterogeneous simulation sources, using techniques such as feature-based augmentation, transfer learning, self-supervised pretraining, and targeted ensemble strategies. The competition also revealed important lessons about data preparation, distribution shifts, and cross-group simulation consistency, informing best practices for future large-scale polymer datasets. The resulting models, analysis, and released data create a new foundation for molecular AI in polymer science and are expected to accelerate the development of sustainable and energy-efficient materials. Along with the competition, we release the test dataset at https://www.kaggle.com/datasets/alexliu99/neurips-open-polymer-prediction-2025-test-data. We also release the data generation pipeline at https://github.com/sobinalosious/ADEPT, which simulates more than 25 properties, including thermal conductivity, radius of gyration, and density.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Suina Tanweer, Firas A. Khasawneh
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Kernel density estimation is a key component of a wide variety of algorithms in machine learning, Bayesian inference, stochastic dynamics and signal processing. However, the unsupervised density estimation technique requires tuning a crucial hyperparameter: the kernel bandwidth. The choice of bandwidth is critical as it controls the bias-variance trade-off by over- or under-smoothing the topological features. Topological data analysis provides methods to mathematically quantify topological characteristics, such as connected components, loops, voids et cetera, even in high dimensions where visualization of density estimates is impossible. In this paper, we propose an unsupervised learning approach using a topology-based loss function for the automated and unsupervised selection of the optimal bandwidth and benchmark it against classical techniques -- demonstrating its potential across different dimensions.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram
Natural Language Processing
ๆฅ็ๆ่ฆ
While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.
๐
2025-12-09
๐ Accepted at 41st ACM/SIGAPP Symposium On Applied C
๐ฅ Ana Rita Paupรฉrio, Diogo Risca, Afonso Lourenรงo, Goreti Marreiros, Ricardo Martins
ๆฅ็ๆ่ฆ
Industrial maintenance is being transformed by the Internet of Things and edge computing, generating continuous data streams that demand real-time, adaptive decision-making under limited computational resources. While data stream mining (DSM) addresses this challenge, most methods assume fully supervised settings, yet in practice, ground-truth labels are often delayed or unavailable. This paper presents a collaborative DSM framework that integrates unsupervised anomaly detection with interactive, human-in-the-loop learning to support maintenance decisions. We employ an online Isolation Forest and enhance interpretability using incremental Partial Dependence Plots and a feature importance score, derived from deviations of Individual Conditional Expectation curves from a fading average, enabling users to dynamically reassess feature relevance and adjust anomaly thresholds. We describe the real-time implementation and provide initial results for fault detection in a Jacquard loom unit. Ongoing work targets continuous monitoring to predict and explain imminent bearing failures.
๐
2025-12-09
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Mohamed Elmahallawy, Asma Jodeiri Akbarfam
ๆฅ็ๆ่ฆ
The rise of space AI is reshaping government and industry through applications such as disaster detection, border surveillance, and climate monitoring, powered by massive data from commercial and governmental low Earth orbit (LEO) satellites. Federated satellite learning (FSL) enables joint model training without sharing raw data, but suffers from slow convergence due to intermittent connectivity and introduces critical trust challenges--where biased or falsified updates can arise across satellite constellations, including those injected through cyberattacks on inter-satellite or satellite-ground communication links. We propose OrbitChain, a blockchain-backed framework that empowers trustworthy multi-vendor collaboration in LEO networks. OrbitChain (i) offloads consensus to high-altitude platforms (HAPs) with greater computational capacity, (ii) ensures transparent, auditable provenance of model updates from different orbits owned by different vendors, and (iii) prevents manipulated or incomplete contributions from affecting global FSL model aggregation. Extensive simulations show that OrbitChain reduces computational and communication overhead while improving privacy, security, and global model accuracy. Its permissioned proof-of-authority ledger finalizes over 1000 blocks with sub-second latency (0.16,s, 0.26,s, 0.35,s for 1-of-5, 3-of-5, and 5-of-5 quorums). Moreover, OrbitChain reduces convergence time by up to 30 hours on real satellite datasets compared to single-vendor, demonstrating its effectiveness for real-time, multi-vendor learning. Our code is available at https://github.com/wsu-cyber-security-lab-ai/OrbitChain.git
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammad Abu-Shaira, Ajita Rattani, Weishi Shi
ๆฅ็ๆ่ฆ
Real-world datasets often exhibit temporal dynamics characterized by evolving data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model's predictive accuracy. Furthermore, the presence of hyperparameters in online models exacerbates this issue. These parameters are typically fixed and cannot be dynamically adjusted by the user in response to the evolving data distribution. Gaussian Process (GP) models offer powerful non-parametric regression capabilities with uncertainty quantification, making them ideal for modeling complex data relationships in an online setting. However, conventional online GP methods face several critical limitations, including a lack of drift-awareness, reliance on fixed hyperparameters, vulnerability to data snooping, absence of a principled decay mechanism, and memory inefficiencies. In response, we propose DAO-GP (Drift-Aware Online Gaussian Process), a novel, fully adaptive, hyperparameter-free, decayed, and sparse non-linear regression model. DAO-GP features a built-in drift detection and adaptation mechanism that dynamically adjusts model behavior based on the severity of drift. Extensive empirical evaluations confirm DAO-GP's robustness across stationary conditions, diverse drift types (abrupt, incremental, gradual), and varied data characteristics. Analyses demonstrate its dynamic adaptation, efficient in-memory and decay-based management, and evolving inducing points. Compared with state-of-the-art parametric and non-parametric models, DAO-GP consistently achieves superior or competitive performance, establishing it as a drift-resilient solution for online non-linear regression.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joshua Ward, Bochao Gu, Chi-Hua Wang, Guang Cheng
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have recently demonstrated remarkable performance in generating high-quality tabular synthetic data. In practice, two primary approaches have emerged for adapting LLMs to tabular data generation: (i) fine-tuning smaller models directly on tabular datasets, and (ii) prompting larger models with examples provided in context. In this work, we show that popular implementations from both regimes exhibit a tendency to compromise privacy by reproducing memorized patterns of numeric digits from their training data. To systematically analyze this risk, we introduce a simple No-box Membership Inference Attack (MIA) called LevAtt that assumes adversarial access to only the generated synthetic data and targets the string sequences of numeric digits in synthetic observations. Using this approach, our attack exposes substantial privacy leakage across a wide range of models and datasets, and in some cases, is even a perfect membership classifier on state-of-the-art models. Our findings highlight a unique privacy vulnerability of LLM-based synthetic data generation and the need for effective defenses. To this end, we propose two methods, including a novel sampling strategy that strategically perturbs digits during generation. Our evaluation demonstrates that this approach can defeat these attacks with minimal loss of fidelity and utility of the synthetic data.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiang Chen, Yuling Shi, Qizhen Lan, Yuchao Qiu, Xiaodong Gu
Machine Learning
ๆฅ็ๆ่ฆ
LLM agents are widely deployed in complex interactive tasks, yet privacy constraints often preclude centralized optimization and co-evolution across dynamic environments. While Federated Learning (FL) has proven effective on static datasets, its extension to the open-ended self-evolution of agents remains underexplored. Directly applying standard FL is challenging: heterogeneous tasks and sparse, trajectory-level rewards introduce severe gradient conflicts, destabilizing the global optimization process. To bridge this gap, we propose Fed-SE, a Federated Self-Evolution framework for LLM agents. Fed-SE establishes a local evolution-global aggregation paradigm. Locally, agents employ parameter-efficient fine-tuning on filtered, high-return trajectories to achieve stable gradient updates. Globally, Fed-SE aggregates updates within a low-rank subspace that disentangles environment-specific dynamics, effectively reducing negative transfer across clients. Experiments across five heterogeneous environments demonstrate that Fed-SE improves average task success rates by approximately 18% over federated baselines, validating its effectiveness in robust cross-environment knowledge transfer in privacy-constrained deployments.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anantaa Kotal, Anupam Joshi
Robotics
ๆฅ็ๆ่ฆ
The widespread use of big data across sectors has raised major privacy concerns, especially when sensitive information is shared or analyzed. Regulations such as GDPR and HIPAA impose strict controls on data handling, making it difficult to balance the need for insights with privacy requirements. Synthetic data offers a promising solution by creating artificial datasets that reflect real patterns without exposing sensitive information. However, traditional synthetic data methods often fail to capture complex, implicit rules that link different elements of the data and are essential in domains like healthcare. They may reproduce explicit patterns but overlook domain-specific constraints that are not directly stated yet crucial for realism and utility. For example, prescription guidelines that restrict certain medications for specific conditions or prevent harmful drug interactions may not appear explicitly in the original data. Synthetic data generated without these implicit rules can lead to medically inappropriate or unrealistic profiles. To address this gap, we propose ContextGAN, a Context-Aware Differentially Private Generative Adversarial Network that integrates domain-specific rules through a constraint matrix encoding both explicit and implicit knowledge. The constraint-aware discriminator evaluates synthetic data against these rules to ensure adherence to domain constraints, while differential privacy protects sensitive details from the original data. We validate ContextGAN across healthcare, security, and finance, showing that it produces high-quality synthetic data that respects domain rules and preserves privacy. Our results demonstrate that ContextGAN improves realism and utility by enforcing domain constraints, making it suitable for applications that require compliance with both explicit patterns and implicit rules under strict privacy guarantees.
๐
2025-12-09
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Mohamed Elmahallawy, Sanjay Madria, Samuel Frimpong
Machine Learning
ๆฅ็ๆ่ฆ
Underground mining operations depend on sensor networks to monitor critical parameters such as temperature, gas concentration, and miner movement, enabling timely hazard detection and safety decisions. However, transmitting raw sensor data to a centralized server for machine learning (ML) model training raises serious privacy and security concerns. Federated Learning (FL) offers a promising alternative by enabling decentralized model training without exposing sensitive local data. Yet, applying FL in underground mining presents unique challenges: (i) Adversaries may eavesdrop on shared model updates to launch model inversion or membership inference attacks, compromising data privacy and operational safety; (ii) Non-IID data distributions across mines and sensor noise can hinder model convergence. To address these issues, we propose FedMining--a privacy-preserving FL framework tailored for underground mining. FedMining introduces two core innovations: (1) a Decentralized Functional Encryption (DFE) scheme that keeps local models encrypted, thwarting unauthorized access and inference attacks; and (2) a balancing aggregation mechanism to mitigate data heterogeneity and enhance convergence. Evaluations on real-world mining datasets demonstrate FedMining's ability to safeguard privacy while maintaining high model accuracy and achieving rapid convergence with reduced communication and computation overhead. These advantages make FedMining both secure and practical for real-time underground safety monitoring.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lars Ole Hรคusler, Lena Uhlenberg, Gรถran Kรถber, Diyora Salimova, Oliver Amft
Machine Learning
ๆฅ็ๆ่ฆ
We propose a text-to-IMU (inertial measurement unit) motion-synthesis framework to obtain realistic IMU data by fine-tuning a pretrained diffusion model with an acceleration-based second-order loss (L_acc). L_acc enforces consistency in the discrete second-order temporal differences of the generated motion, thereby aligning the diffusion prior with IMU-specific acceleration patterns. We integrate L_acc into the training objective of an existing diffusion model, finetune the model to obtain an IMU-specific motion prior, and evaluate the model with an existing text-to-IMU framework that comprises surface modelling and virtual sensor simulation. We analysed acceleration signal fidelity and differences between synthetic motion representation and actual IMU recordings. As a downstream application, we evaluated Human Activity Recognition (HAR) and compared the classification performance using data of our method with the earlier diffusion model and two additional diffusion model baselines. When we augmented the earlier diffusion model objective with L_acc and continued training, L_acc decreased by 12.7% relative to the original model. The improvements were considerably larger in high-dynamic activities (i.e., running, jumping) compared to low-dynamic activities~(i.e., sitting, standing). In a low-dimensional embedding, the synthetic IMU data produced by our refined model shifts closer to the distribution of real IMU recordings. HAR classification trained exclusively on our refined synthetic IMU data improved performance by 8.7% compared to the earlier diffusion model and by 7.6% over the best-performing comparison diffusion model. We conclude that acceleration-aware diffusion refinement provides an effective approach to align motion generation and IMU synthesis and highlights how flexible deep learning pipelines are for specialising generic text-to-motion priors to sensor-specific tasks.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lex Weaver, Jonathan Baxter
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
TD($ฮป$) with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD($ฮป$) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple two-state and three-state systems in which TD($ฮป$)--starting from an optimal policy--converges to a sub-optimal policy, and also in backgammon. We then present a modified form of TD($ฮป$), called STD($ฮป$), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD($ฮป$) in the context of the two-state system, is presented, along with a comparison with Bertsekas' differential training method [1]. This is followed by successful demonstrations of STD($ฮป$) on the two-state system and a variation on the well known acrobot problem.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jack Brady, Bernhard Schรถlkopf, Thomas Kipf, Simon Buchholz, Wieland Brendel
ๆฅ็ๆ่ฆ
It has been hypothesized that human-level visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today's most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using regularization or architectural constraints. In contrast, for generative methods, the inductive biases can be enforced straightforwardly, thereby enabling compositional generalization by constraining a decoder and inverting it. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay. We examine the empirical implications of our theory by training a range of generative and non-generative methods on photorealistic image datasets. We find that, without the necessary inductive biases, non-generative methods often fail to generalize compositionally and require large-scale pretraining or added supervision to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huzaifa Arif, Pin-Yu Chen, Alex Gittens, James Diffenderfer, Bhavya Kailkhura
Machine Learning
ๆฅ็ๆ่ฆ
With the increasing reliance on AI models for weather forecasting, it is imperative to evaluate their vulnerability to adversarial perturbations. This work introduces Weather Adaptive Adversarial Perturbation Optimization (WAAPO), a novel framework for generating targeted adversarial perturbations that are both effective in manipulating forecasts and stealthy to avoid detection. WAAPO achieves this by incorporating constraints for channel sparsity, spatial localization, and smoothness, ensuring that perturbations remain physically realistic and imperceptible. Using the ERA5 dataset and FourCastNet (Pathak et al. 2022), we demonstrate WAAPO's ability to generate adversarial trajectories that align closely with predefined targets, even under constrained conditions. Our experiments highlight critical vulnerabilities in AI-driven forecasting models, where small perturbations to initial conditions can result in significant deviations in predicted weather patterns. These findings underscore the need for robust safeguards to protect against adversarial exploitation in operational forecasting systems.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ferdinand Kapl, Emmanouil Angelis, Tobias Hรถppe, Kaitlin Maile, Johannes von Oswald et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordรกs et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.
๐
2025-12-09
๐ ArXiv Preprint (cs.SE)
๐ฅ Viola Campos, Robin Kuschnereit, Adrian Ulges
Natural Language Processing
ๆฅ็ๆ่ฆ
As AI-based code generation becomes widespread, researchers are investigating the calibration of code LLMs - ensuring their confidence scores faithfully represent the true likelihood of code correctness. To do so, we investigate multicalibration, which can capture additional factors about a coding problem, such as complexity, code length, or programming language used. We study four multicalibration approaches on three function synthesis benchmarks, using latest-generation code LLMs (Qwen3 Coder, GPT-OSS, DeepSeek-R1-Distill). Our results demonstrate that multicalibration can yield distinct improvements over both uncalibrated token likelihoods (+1.03 in skill score) and baseline calibrations (+0.37 in skill score). We study the influence of the aforementioned factors in ablations, and make our dataset (consisting of code generations, likelihoods, and correctness labels) available for future research on code LLM calibration.
๐
2025-12-09
๐ Accepted at IEEE INFOCOM 2026 (full version)
๐ฅ Yi Liu, Weixiang Han, Chengjun Cai, Xingliang Yuan, Cong Wang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
With the rise of large language models, service providers offer language models as a service, enabling users to fine-tune customized models via uploaded private datasets. However, this raises concerns about sensitive data leakage. Prior methods, relying on differential privacy within device-cloud collaboration frameworks, struggle to balance privacy and utility, exposing users to inference attacks or degrading fine-tuning performance. To address this, we propose PrivTune, an efficient and privacy-preserving fine-tuning framework via Split Learning (SL). The key idea of PrivTune is to inject crafted noise into token representations from the SL bottom model, making each token resemble the $n$-hop indirect neighbors. PrivTune formulates this as an optimization problem to compute the optimal noise vector, aligning with defense-utility goals. On this basis, it then adjusts the parameters (i.e., mean) of the $d_ฯ$-Privacy noise distribution to align with the optimization direction and scales the noise according to token importance to minimize distortion. Experiments on five datasets (covering both classification and generation tasks) against three embedding inversion and three attribute inference attacks show that, using RoBERTa on the Stanford Sentiment Treebank dataset, PrivTune reduces the attack success rate to 10% with only a 3.33% drop in utility performance, outperforming state-of-the-art baselines.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Thรฉo Verhelst, Gianluca Bontempi
Robotics
ๆฅ็ๆ่ฆ
Uplift modeling estimates the causal effect of an intervention as the difference between potential outcomes under treatment and control, whereas counterfactual identification aims to recover the joint distribution of these potential outcomes (e.g., "Would this customer still have churned had we given them a marketing offer?"). This joint counterfactual distribution provides richer information than the uplift but is harder to estimate. However, the two approaches are synergistic: uplift models can be leveraged for counterfactual estimation. We propose a counterfactual estimator that fits a bivariate beta distribution to predicted uplift scores, yielding posterior distributions over counterfactual outcomes. Our approach requires no causal assumptions beyond those of uplift modeling. Simulations show the efficacy of the approach, which can be applied, for example, to the problem of customer churn in telecom, where it reveals insights unavailable to standard ML or uplift models alone.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jeongwhan Choi, Woosung Kang, Minseo Kim, Jongwoo Kim, Noseong Park
Natural Language Processing
ๆฅ็ๆ่ฆ
Foundation models pretrained on large data have demonstrated remarkable zero-shot generalization capabilities across domains. Building on the success of TabPFN for tabular data and its recent extension to time series, we investigate whether graph node classification can be effectively reformulated as a tabular learning problem. We introduce TabPFN-GN, which transforms graph data into tabular features by extracting node attributes, structural properties, positional encodings, and optionally smoothed neighborhood features. This enables TabPFN to perform direct node classification without any graph-specific training or language model dependencies. Our experiments on 12 benchmark datasets reveal that TabPFN-GN achieves competitive performance with GNNs on homophilous graphs and consistently outperforms them on heterophilous graphs. These results demonstrate that principled feature engineering can bridge the gap between tabular and graph domains, providing a practical alternative to task-specific GNN training and LLM-dependent graph foundation models.
๐
2025-12-09
๐ NeurIPS 2025
๐ฅ Hamsini Ramanathan, Roman Bushuiev, Matouลก Soldรกt, Jirรญ Kohout, Tรฉo Hebra et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Terpene synthases (TPS) are a key family of enzymes responsible for generating the diverse terpene scaffolds that underpin many natural products, including front-line anticancer drugs such as Taxol. However, de novo TPS design through directed evolution is costly and slow. We introduce TpsGPT, a generative model for scalable TPS protein design, built by fine-tuning the protein language model ProtGPT2 on 79k TPS sequences mined from UniProt. TpsGPT generated de novo enzyme candidates in silico and we evaluated them using multiple validation metrics, including EnzymeExplorer classification, ESMFold structural confidence (pLDDT), sequence diversity, CLEAN classification, InterPro domain detection, and Foldseek structure alignment. From an initial pool of 28k generated sequences, we identified seven putative TPS enzymes that satisfied all validation criteria. Experimental validation confirmed TPS enzymatic activity in at least two of these sequences. Our results show that fine-tuning of a protein language model on a carefully curated, enzyme-class-specific dataset, combined with rigorous filtering, can enable the de novo generation of functional, evolutionarily distant enzymes.
๐
2025-12-09
๐ KDD 2026
๐ฅ Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Hewei Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Early graph prompt tuning approaches relied on task-specific designs for Graph Neural Networks (GNNs), limiting their adaptability across diverse pre-training strategies. In contrast, another promising line of research has investigated universal graph prompt tuning, which operates directly in the input graph's feature space and builds a theoretical foundation that universal graph prompt tuning can theoretically achieve an equivalent effect of any prompting function, eliminating dependence on specific pre-training strategies. Recent works propose selective node-based graph prompt tuning to pursue more ideal prompts. However, we argue that selective node-based graph prompt tuning inevitably compromises the theoretical foundation of universal graph prompt tuning. In this paper, we strengthen the theoretical foundation of universal graph prompt tuning by introducing stricter constraints, demonstrating that adding prompts to all nodes is a necessary condition for achieving the universality of graph prompts. To this end, we propose a novel model and paradigm, Learning and Editing Universal GrAph Prompt Tuning (LEAP), which preserves the theoretical foundation of universal graph prompt tuning while pursuing more ideal prompts. Specifically, we first build the basic universal graph prompts to preserve the theoretical foundation and then employ actor-critic reinforcement learning to select nodes and edit prompts. Extensive experiments on graph- and node-level tasks across various pre-training strategies in both full-shot and few-shot scenarios show that LEAP consistently outperforms fine-tuning and other prompt-based approaches.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuniko Paxton, Zeinab Dehghani, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Skin color has historically been a focal point of discrimination, yet fairness research in machine learning for medical imaging often relies on coarse subgroup categories, overlooking individual-level variations. Such group-based approaches risk obscuring biases faced by outliers within subgroups. This study introduces a distribution-based framework for evaluating and mitigating individual fairness in skin lesion classification. We treat skin tone as a continuous attribute rather than a categorical label, and employ kernel density estimation (KDE) to model its distribution. We further compare twelve statistical distance metrics to quantify disparities between skin tone distributions and propose a distance-based reweighting (DRW) loss function to correct underrepresentation in minority tones. Experiments across CNN and Transformer models demonstrate: (i) the limitations of categorical reweighting in capturing individual-level disparities, and (ii) the superior performance of distribution-based reweighting, particularly with Fidelity Similarity (FS), Wasserstein Distance (WD), Hellinger Metric (HM), and Harmonic Mean Similarity (HS). These findings establish a robust methodology for advancing fairness at individual level in dermatological AI systems, and highlight broader implications for sensitive continuous attributes in medical image analysis.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Udesh Habaraduwa, Andrei Lixandru
Machine Learning
ๆฅ็ๆ่ฆ
The advancement of human healthspan and bioengineering relies heavily on predicting the behavior of complex biological systems. While high-throughput multiomics data is becoming increasingly abundant, converting this data into actionable predictive models remains a bottleneck. High-capacity, datadriven simulation systems are critical in this landscape; unlike classical mechanistic models restricted by prior knowledge, these architectures can infer latent interactions directly from observational data, allowing for the simulation of temporal trajectories and the anticipation of downstream intervention effects in personalized medicine and synthetic biology. To address this challenge, we introduce Neural Ordinary Differential Equations (NODEs) as a dynamic framework for learning the complex interplay between the proteome and metabolome. We applied this framework to time-series data derived from engineered Escherichia coli strains, modeling the continuous dynamics of metabolic pathways. The proposed NODE architecture demonstrates superior performance in capturing system dynamics compared to traditional machine learning pipelines. Our results show a greater than 90% improvement in root mean squared error over baselines across both Limonene (up to 94.38% improvement) and Isopentenol (up to 97.65% improvement) pathway datasets. Furthermore, the NODE models demonstrated a 1000x acceleration in inference time, establishing them as a scalable, high-fidelity tool for the next generation of metabolic engineering and biological discovery.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Manos Plitsis, Giorgos Bouritsas, Vassilis Katsouros, Yannis Panagakis
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI's internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation.
๐
2025-12-09
๐ ArXiv Preprint (cs.PF)
๐ฅ Sรฉbastien Piรฉrard, Adrien Deliรจge, Marc Van Droogenbroeck
ๆฅ็ๆ่ฆ
The performance of algorithms, methods, and models tends to depend heavily on the distribution of cases on which they are applied, this distribution being specific to the applicative domain. After performing an evaluation in several domains, it is highly informative to compute a (weighted) mean performance and, as shown in this paper, to scrutinize what happens during this averaging. To achieve this goal, we adopt a probabilistic framework and consider a performance as a probability measure (e.g., a normalized confusion matrix for a classification task). It appears that the corresponding weighted mean is known to be the summarization, and that only some remarkable scores assign to the summarized performance a value equal to a weighted arithmetic mean of the values assigned to the domain-specific performances. These scores include the family of ranking scores, a continuum parameterized by user preferences, and that the weights to consider in the arithmetic mean depend on the user preferences. Based on this, we rigorously define four domains, named easiest, most difficult, preponderant, and bottleneck domains, as functions of user preferences. After establishing the theory in a general setting, regardless of the task, we develop new visual tools for two-class classification.
๐
2025-12-09
๐ ArXiv Preprint (eess.SY)
๐ฅ Jannik Graebner, Ryne Beeson
Machine Learning
ๆฅ็ๆ่ฆ
Preliminary mission design of low-thrust spacecraft trajectories in the Circular Restricted Three-Body Problem is a global search characterized by a complex objective landscape and numerous local minima. Formulating the problem as sampling from an unnormalized distribution supported on neighborhoods of locally optimal solutions, provides the opportunity to deploy Markov chain Monte Carlo methods and generative machine learning. In this work, we extend our previous self-supervised diffusion model fine-tuning framework to employ gradient-informed Markov chain Monte Carlo. We compare two algorithms - the Metropolis-Adjusted Langevin Algorithm and Hamiltonian Monte Carlo - both initialized from a distribution learned by a diffusion model. Derivatives of an objective function that balances fuel consumption, time of flight and constraint violations are computed analytically using state transition matrices. We show that incorporating the gradient drift term accelerates mixing and improves convergence of the Markov chain for a multi-revolution transfer in the Saturn-Titan system. Among the evaluated methods, MALA provides the best trade-off between performance and computational cost. Starting from samples generated by a baseline diffusion model trained on a related transfer, MALA explicitly targets Pareto-optimal solutions. Compared to a random walk Metropolis algorithm, it increases the feasibility rate from 17.34% to 63.01% and produces a denser, more diverse coverage of the Pareto front. By fine-tuning a diffusion model on the generated samples and associated reward values with reward-weighted likelihood maximization, we learn the global solution structure of the problem and eliminate the need for a tedious separate data generation phase.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenglong Duan, Dazhong Wu
Machine Learning
ๆฅ็ๆ่ฆ
Part qualification is crucial in additive manufacturing (AM) because it ensures that additively manufactured parts can be consistently produced and reliably used in critical applications. Part qualification aims at verifying that an additively manufactured part meets performance requirements; therefore, predicting the complex stress-strain behaviors of additively manufactured parts is critical. We develop a dynamic time warping (DTW)-transfer learning (TL) framework for additive manufacturing part qualification by transferring knowledge of the stress-strain behaviors of additively manufactured low-cost polymers to metals. Specifically, the framework employs DTW to select a polymer dataset as the source domain that is the most relevant to the target metal dataset. Using a long short-term memory (LSTM) model, four source polymers (i.e., Nylon, PLA, CF-ABS, and Resin) and three target metals (i.e., AlSi10Mg, Ti6Al4V, and carbon steel) that are fabricated by different AM techniques are utilized to demonstrate the effectiveness of the DTW-TL framework. Experimental results show that the DTW-TL framework identifies the closest match between polymers and metals to select one single polymer dataset as the source domain. The DTW-TL model achieves the lowest mean absolute percentage error of 12.41% and highest coefficient of determination of 0.96 when three metals are used as the target domain, respectively, outperforming the vanilla LSTM model without TL as well as the TL model pre-trained on four polymer datasets as the source domain.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huzaifa Arif
Machine Learning
ๆฅ็ๆ่ฆ
Recent work \cite{arifgroup} introduced Federated Proximal Gradient \textbf{(\texttt{FedProxGrad})} for solving non-convex composite optimization problems in group fair federated learning. However, the original analysis established convergence only to a \textit{noise-dominated neighborhood of stationarity}, with explicit dependence on a variance-induced noise floor. In this work, we provide an improved asymptotic convergence analysis for a generalized \texttt{FedProxGrad}-type analytical framework with inexact local proximal solutions and explicit fairness regularization. We call this extended analytical framework \textbf{DS \texttt{FedProxGrad}} (Decay Step Size \texttt{FedProxGrad}). Under a Robbins-Monro step-size schedule \cite{robbins1951stochastic} and a mild decay condition on local inexactness, we prove that $\liminf_{r\to\infty} \mathbb{E}[\|\nabla F(\mathbf{x}^r)\|^2] = 0$, i.e., the algorithm is asymptotically stationary and the convergence rate does not depend on a variance-induced noise floor.
๐
2025-12-09
๐ ArXiv Preprint (eess.SY)
๐ฅ Josip Kir Hromatko, Shambhuraj Sawant, ล andor Ileลก, Sรฉbastien Gros
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Scaled model experiments are commonly used in various engineering fields to reduce experimentation costs and overcome constraints associated with full-scale systems. The relevance of such experiments relies on dimensional analysis and the principle of dynamic similarity. However, transferring controllers to full-scale systems often requires additional tuning. In this paper, we propose a method to enable a direct controller transfer using dimensionless model predictive control, tuned automatically for closed-loop performance. With this reformulation, the closed-loop behavior of an optimized controller transfers directly to a new, dynamically similar system. Additionally, the dimensionless formulation allows for the use of data from systems of different scales during parameter optimization. We demonstrate the method on a cartpole swing-up and a car racing problem, applying either reinforcement learning or Bayesian optimization for tuning the controller parameters. Software used to obtain the results in this paper is publicly available at https://github.com/josipkh/dimensionless-mpcrl.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bohao Yang, Rui Yang, Joshua M. Biro, Haoyuan Wang, Jessica L. Handley et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Clinical communication is central to patient outcomes, yet large-scale human annotation of patient-provider conversation remains labor-intensive, inconsistent, and difficult to scale. Existing approaches based on large language models typically rely on single-task models that lack adaptability, interpretability, and reliability, especially when applied across various communication frameworks and clinical domains. In this study, we developed a Multi-framework Structured Agentic AI system for Clinical Communication (MOSAIC), built on a LangGraph-based architecture that orchestrates four core agents, including a Plan Agent for codebook selection and workflow planning, an Update Agent for maintaining up-to-date retrieval databases, a set of Annotation Agents that applies codebook-guided retrieval-augmented generation (RAG) with dynamic few-shot prompting, and a Verification Agent that provides consistency checks and feedback. To evaluate performance, we compared MOSAIC outputs against gold-standard annotations created by trained human coders. We developed and evaluated MOSAIC using 26 gold standard annotated transcripts for training and 50 transcripts for testing, spanning rheumatology and OB/GYN domains. On the test set, MOSAIC achieved an overall F1 score of 0.928. Performance was highest in the Rheumatology subset (F1 = 0.962) and strongest for Patient Behavior (e.g., patients asking questions, expressing preferences, or showing assertiveness). Ablations revealed that MOSAIC outperforms baseline benchmarking.
๐
2025-12-09
๐ ArXiv Preprint (cs.SE)
๐ฅ Renato Cordeiro Ferreira, Aditya Dhinavahi, Rowanne Trapmann, Willem-Jan van den Heuvel
Machine Learning
ๆฅ็ๆ่ฆ
ML-Enabled Systems (MLES) are inherently complex since they require multiple components to achieve their business goal. This experience report showcases the software architecture reusability techniques applied while building Ocean Guard, an MLES for anomaly detection in the maritime domain. In particular, it highlights the challenges and lessons learned to reuse the Ports and Adapters pattern to support building multiple microservices from a single codebase. This experience report hopes to inspire software engineers, machine learning engineers, and data scientists to apply the Hexagonal Architecture pattern to build their MLES.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Orit Davidovich, Shimrit Shtern, Segev Wasserkrug, Nimrod Megiddo
Machine Learning
ๆฅ็ๆ่ฆ
Since the 1990s, considerable empirical work has been carried out to train statistical models, such as neural networks (NNs), as learned heuristics for combinatorial optimization (CO) problems. When successful, such an approach eliminates the need for experts to design heuristics per problem type. Due to their structure, many hard CO problems are amenable to treatment through reinforcement learning (RL). Indeed, we find a wealth of literature training NNs using value-based, policy gradient, or actor-critic approaches, with promising results, both in terms of empirical optimality gaps and inference runtimes. Nevertheless, there has been a paucity of theoretical work undergirding the use of RL for CO problems. To this end, we introduce a unified framework to model CO problems through Markov decision processes (MDPs) and solve them using RL techniques. We provide easy-to-test assumptions under which CO problems can be formulated as equivalent undiscounted MDPs that provide optimal solutions to the original CO problems. Moreover, we establish conditions under which value-based RL techniques converge to approximate solutions of the CO problem with a guarantee on the associated optimality gap. Our convergence analysis provides: (1) a sufficient rate of increase in batch size and projected gradient descent steps at each RL iteration; (2) the resulting optimality gap in terms of problem parameters and targeted RL accuracy; and (3) the importance of a choice of state-space embedding. Together, our analysis illuminates the success (and limitations) of the celebrated deep Q-learning algorithm in this problem context.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Charles Rios, Longzhen Han, Almas Baimagambetov, Nikolaos Polatidis
Machine Learning
ๆฅ็ๆ่ฆ
Predicting the outcomes of professional basketball games, particularly in the National Basketball Association (NBA), has become increasingly important for coaching strategy, fan engagement, and sports betting. However, many existing prediction models struggle with concept drift, limited temporal context, and instability across seasons. To advance forecasting in this domain, we introduce a newly constructed longitudinal NBA dataset covering the 2004-05 to 2024-25 seasons and present a deep learning framework designed to model long-term performance trends. Our primary contribution is a Long Short-Term Memory (LSTM) architecture that leverages an extended sequence length of 9,840 games equivalent to eight full NBA seasons to capture evolving team dynamics and season-over-season dependencies. We compare this model against several traditional Machine Learning (ML) and Deep Learning (DL) baselines, including Logistic Regression, Random Forest, Multi-Layer Perceptron (MLP), and Convolutional Neural Network (CNN). The LSTM achieves the best performance across all metrics, with 72.35 accuracy, 73.15 precision and 76.13 AUC-ROC. These results demonstrate the importance of long-sequence temporal modeling in basketball outcome prediction and highlight the value of our new multi-season dataset for developing robust, generalizable NBA forecasting systems.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuna Kato, Shohei Mori, Hideo Saito, Yoshifumi Takatsume, Hiroki Kajita et al.
ๆฅ็ๆ่ฆ
Video recordings of open surgeries are greatly required for education and research purposes. However, capturing unobstructed videos is challenging since surgeons frequently block the camera field of view. To avoid occlusion, the positions and angles of the camera must be frequently adjusted, which is highly labor-intensive. Prior work has addressed this issue by installing multiple cameras on a shadowless lamp and arranging them to fully surround the surgical area. This setup increases the chances of some cameras capturing an unobstructed view. However, manual image alignment is needed in post-processing since camera configurations change every time surgeons move the lamp for optimal lighting. This paper aims to fully automate this alignment task. The proposed method identifies frames in which the lighting system moves, realigns them, and selects the camera with the least occlusion to generate a video that consistently presents the surgical field from a fixed perspective. A user study involving surgeons demonstrated that videos generated by our method were superior to those produced by conventional methods in terms of the ease of confirming the surgical area and the comfort during video viewing. Additionally, our approach showed improvements in video quality over existing techniques. Furthermore, we implemented several synthesis options for the proposed view-synthesis method and conducted a user study to assess surgeons' preferences for each option.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nader Sadek, Mirette Moawad, Christina Naguib, Mariam Elzahaby
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Stock market prediction is a long-standing challenge in finance, as accurate forecasts support informed investment decisions. Traditional models rely mainly on historical prices, but recent work shows that financial news can provide useful external signals. This paper investigates a multimodal approach that integrates companies' news articles with their historical stock data to improve prediction performance. We compare a Graph Neural Network (GNN) model with a baseline LSTM model. Historical data for each company is encoded using an LSTM, while news titles are embedded with a language model. These embeddings form nodes in a heterogeneous graph, and GraphSAGE is used to capture interactions between articles, companies, and industries. We evaluate two targets: a binary direction-of-change label and a significance-based label. Experiments on the US equities and Bloomberg datasets show that the GNN outperforms the LSTM baseline, achieving 53% accuracy on the first target and a 4% precision gain on the second. Results also indicate that companies with more associated news yield higher prediction accuracy. Moreover, headlines contain stronger predictive signals than full articles, suggesting that concise news summaries play an important role in short-term market reactions.
๐
2025-12-09
๐ ArXiv Preprint (econ.EM)
๐ฅ Masahiro Kato
ๆฅ็ๆ่ฆ
We consider an adaptive experiment for treatment choice and design a minimax and Bayes optimal adaptive experiment with respect to regret. Given binary treatments, the experimenter's goal is to choose the treatment with the highest expected outcome through an adaptive experiment, in order to maximize welfare. We consider adaptive experiments that consist of two phases, the treatment allocation phase and the treatment choice phase. The experiment starts with the treatment allocation phase, where the experimenter allocates treatments to experimental subjects to gather observations. During this phase, the experimenter can adaptively update the allocation probabilities using the observations obtained in the experiment. After the allocation phase, the experimenter proceeds to the treatment choice phase, where one of the treatments is selected as the best. For this adaptive experimental procedure, we propose an adaptive experiment that splits the treatment allocation phase into two stages, where we first estimate the standard deviations and then allocate each treatment proportionally to its standard deviation. We show that this experiment, often referred to as Neyman allocation, is minimax and Bayes optimal in the sense that its regret upper bounds exactly match the lower bounds that we derive. To show this optimality, we derive minimax and Bayes lower bounds for the regret using change-of-measure arguments. Then, we evaluate the corresponding upper bounds using the central limit theorem and large deviation bounds.
๐
2025-12-09
๐ ArXiv Preprint (physics.bio-ph)
๐ฅ Gongyi Wang, Yu Zhang, Zihan Huang
Machine Learning
ๆฅ็ๆ่ฆ
Machine learning (ML) has become a versatile tool for analyzing anomalous diffusion trajectories, yet most existing pipelines are trained on large collections of simulated data. In contrast, experimental trajectories, such as those from single-particle tracking (SPT), are typically scarce and may differ substantially from the idealized models used for simulation, leading to degradation or even breakdown of performance when ML methods are applied to real data. To address this mismatch, we introduce a wavelet-based representation of anomalous diffusion that enables data-efficient learning directly from experimental recordings. This representation is constructed by applying six complementary wavelet families to each trajectory and combining the resulting wavelet modulus scalograms. We first evaluate the wavelet representation on simulated trajectories from the andi-datasets benchmark, where it clearly outperforms both feature-based and trajectory-based methods with as few as 1000 training trajectories and still retains an advantage on large training sets. We then use this representation to learn directly from experimental SPT trajectories of fluorescent beads diffusing in F-actin networks, where the wavelet representation remains superior to existing alternatives for both diffusion-exponent regression and mesh-size classification. In particular, when predicting the diffusion exponents of experimental trajectories, a model trained on 1200 experimental tracks using the wavelet representation achieves significantly lower errors than state-of-the-art deep learning models trained purely on $10^6$ simulated trajectories. We associate this data efficiency with the emergence of distinct scale fingerprints disentangling underlying diffusion mechanisms in the wavelet spectra.
๐
2025-12-09
๐ ArXiv Preprint (q-bio.BM)
๐ฅ Gengmo Zhou, Feng Yu, Wenda Wang, Zhifeng Gao, Guolin Ke et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Enzymes are crucial catalysts that enable a wide range of biochemical reactions. Efficiently identifying specific enzymes from vast protein libraries is essential for advancing biocatalysis. Traditional computational methods for enzyme screening and retrieval are time-consuming and resource-intensive. Recently, deep learning approaches have shown promise. However, these methods focus solely on the interaction between enzymes and reactions, overlooking the inherent hierarchical relationships within each domain. To address these limitations, we introduce FGW-CLIP, a novel contrastive learning framework based on optimizing the fused Gromov-Wasserstein distance. FGW-CLIP incorporates multiple alignments, including inter-domain alignment between reactions and enzymes and intra-domain alignment within enzymes and reactions. By introducing a tailored regularization term, our method minimizes the Gromov-Wasserstein distance between enzyme and reaction spaces, which enhances information integration across these domains. Extensive evaluations demonstrate the superiority of FGW-CLIP in challenging enzyme-reaction tasks. On the widely-used EnzymeMap benchmark, FGW-CLIP achieves state-of-the-art performance in enzyme virtual screening, as measured by BEDROC and EF metrics. Moreover, FGW-CLIP consistently outperforms across all three splits of ReactZyme, the largest enzyme-reaction benchmark, demonstrating robust generalization to novel enzymes and reactions. These results position FGW-CLIP as a promising framework for enzyme discovery in complex biochemical settings, with strong adaptability across diverse screening scenarios.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Waleed Razzaq, Yun-Bo Zhao
Machine Learning
ๆฅ็ๆ่ฆ
Accurate and uncertainty-aware degradation estimation is essential for predictive maintenance in safety-critical systems like rotating machinery with rolling-element bearings. Many existing uncertainty methods lack confidence calibration, are costly to run, are not distance-aware, and fail to generalize under out-of-distribution data. We introduce two distance-aware uncertainty methods for deterministic physics-guided neural networks: PG-SNGP, based on Spectral Normalization Gaussian Process, and PG-SNER, based on Deep Evidential Regression. We apply spectral normalization to the hidden layers so the network preserves distances from input to latent space. PG-SNGP replaces the final dense layer with a Gaussian Process layer for distance-sensitive uncertainty, while PG-SNER outputs Normal Inverse Gamma parameters to model uncertainty in a coherent probabilistic form. We assess performance using standard accuracy metrics and a new distance-aware metric based on the Pearson Correlation Coefficient, which measures how well predicted uncertainty tracks the distance between test and training samples. We also design a dynamic weighting scheme in the loss to balance data fidelity and physical consistency. We test our methods on rolling-element bearing degradation using the PRONOSTIA dataset and compare them with Monte Carlo and Deep Ensemble PGNNs. Results show that PG-SNGP and PG-SNER improve prediction accuracy, generalize reliably under OOD conditions, and remain robust to adversarial attacks and noise.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junnan Qiu, Jie Li
Machine Learning
ๆฅ็ๆ่ฆ
Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to data poisoning attacks. Existing attack strategies typically rely on locally uniform perturbations, which treat all samples indiscriminately. This approach is inefficient, as it wastes the perturbation budget on low-impact samples, and lacks stealthiness due to significant statistical deviations. In this paper, we propose a novel Global Budget Allocation attack strategy. Leveraging the theoretical insight that a sample's influence on value function convergence is proportional to its Temporal Difference (TD) error, we formulate the attack as a global resource allocation problem. We derive a closed-form solution where perturbation magnitudes are assigned proportional to the TD-error sensitivity under a global L2 constraint. Empirical results on D4RL benchmarks demonstrate that our method significantly outperforms baseline strategies, achieving up to 80% performance degradation with minimal perturbations that evade detection by state-of-the-art statistical and spectral defenses.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Weiqi Guan, Junlin He
ๆฅ็ๆ่ฆ
The relationship between Layer Normalization (LN) placement and the over-smoothing phenomenon remains underexplored. We identify a critical dilemma: Pre-LN architectures avoid over-smoothing but suffer from the curse of depth, while Post-LN architectures bypass the curse of depth but experience over-smoothing.
To resolve this, we propose a new method based on Post-LN that induces algebraic smoothing, preventing over-smoothing without the curse of depth. Empirical results across five benchmarks demonstrate that our approach supports deeper networks (up to 256 layers) and improves performance, requiring no additional parameters.
Key contributions:
Theoretical Characterization: Analysis of LN dynamics and their impact on over-smoothing and the curse of depth.
A Principled Solution: A parameter-efficient method that induces algebraic smoothing and avoids over-smoothing and the curse of depth.
Empirical Validation: Extensive experiments showing the effectiveness of the method in deeper GNNs.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Antonio Terpin, Raffaello D'Andrea
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective -- drag minimization or maximization -- is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Danial Jafarzadeh Jazi, Maryam Hajiesmaeili
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Decoding brain states from functional magnetic resonance imaging (fMRI) data is vital for advancing neuroscience and clinical applications. While traditional machine learning and deep learning approaches have made strides in leveraging the high-dimensional and complex nature of fMRI data, they often fail to utilize the contextual richness provided by Digital Imaging and Communications in Medicine (DICOM) metadata. This paper presents a novel framework integrating transformer-based architectures with multimodal inputs, including fMRI data and DICOM metadata. By employing attention mechanisms, the proposed method captures intricate spatial-temporal patterns and contextual relationships, enhancing model accuracy, interpretability, and robustness. The potential of this framework spans applications in clinical diagnostics, cognitive neuroscience, and personalized medicine. Limitations, such as metadata variability and computational demands, are addressed, and future directions for optimizing scalability and generalizability are discussed.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gary Ackerman, Theodore Wilson, Zachary Kallenborn, Olivia Shoemaker, Anna Wetzel et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The potential for rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons has generated significant policy, academic, and public concern. Both model developers and policymakers seek to quantify and mitigate any risk, with an important element of such efforts being the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper discusses the pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset. It is the third in a series of three papers describing an overall Biothreat Benchmark Generation (BBG) framework, with previous papers detailing the development of the B3 dataset. The pilot involved running the benchmarks through a sample frontier AI model, followed by human evaluation of model responses, and an applied risk analysis of the results along several dimensions. Overall, the pilot demonstrated that the B3 dataset offers a viable, nuanced method for rapidly assessing the biosecurity risk posed by a LLM, identifying the key sources of that risk and providing guidance for priority areas of mitigation priority.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gary Ackerman, Zachary Kallenborn, Anna Wetzel, Hayley Peterson, Jenna LaTourette et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The potential for rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons has generated significant policy, academic, and public concern. Both model developers and policymakers seek to quantify and mitigate any risk, with an important element of such efforts being the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper, the second in a series of three, describes the second component of a novel Biothreat Benchmark Generation (BBG) framework: the generation of the Bacterial Biothreat Benchmark (B3) dataset. The development process involved three complementary approaches: 1) web-based prompt generation, 2) red teaming, and 3) mining existing benchmark corpora, to generate over 7,000 potential benchmarks linked to the Task-Query Architecture that was developed during the first component of the project. A process of de-duplication, followed by an assessment of uplift diagnosticity, and general quality control measures, reduced the candidates to a set of 1,010 final benchmarks. This procedure ensured that these benchmarks are a) diagnostic in terms of providing uplift; b) directly relevant to biosecurity threats; and c) are aligned with a larger biosecurity architecture permitting nuanced analysis at different levels of analysis.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Madhav Gupta, Vishak Prasad C, Ganesh Ramakrishnan
Machine Learning
ๆฅ็ๆ่ฆ
Subset selection-based methods are widely used to explain deep vision models: they attribute predictions by highlighting the most influential image regions and support object-level explanations. While these methods perform well in in-distribution (ID) settings, their behavior under out-of-distribution (OOD) conditions remains poorly understood. Through extensive experiments across multiple ID-OOD sets, we find that reliability of the existing subset based methods degrades markedly, yielding redundant, unstable, and uncertainty-sensitive explanations. To address these shortcomings, we introduce a framework that combines submodular subset selection with layer-wise, gradient-based uncertainty estimation to improve robustness and fidelity without requiring additional training or auxiliary models. Our approach estimates uncertainty via adaptive weight perturbations and uses these estimates to guide submodular optimization, ensuring diverse and informative subset selection. Empirical evaluations show that, beyond mitigating the weaknesses of existing methods under OOD scenarios, our framework also yields improvements in ID settings. These findings highlight limitations of current subset-based approaches and demonstrate how uncertainty-driven optimization can enhance attribution and object-level interpretability, paving the way for more transparent and trustworthy AI in real-world vision applications.
๐
2025-12-09
๐ ArXiv Preprint (eess.IV)
๐ฅ Andreas Hauptmann, Ozan รktem
ๆฅ็ๆ่ฆ
Learned image reconstruction has become a pillar in computational imaging and inverse problems. Among the most successful approaches are learned iterative networks, which are formulated by unrolling classical iterative optimisation algorithms for solving variational problems. While the underlying algorithm is usually formulated in the functional analytic setting, learned approaches are often viewed as purely discrete. In this chapter we present a unified operator view for learned iterative networks. Specifically, we formulate a learned reconstruction operator, defining how to compute, and separately the learning problem, which defines what to compute. In this setting we present common approaches and show that many approaches are closely related in their core. We review linear as well as nonlinear inverse problems in this framework and present a short numerical study to conclude.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hithem Lamri, Michail Maniatakos
ๆฅ็ๆ่ฆ
Machine unlearning (MU) seeks to remove the influence of specified data from a trained model in response to privacy requests or data poisoning. While certified unlearning has been analyzed in centralized and server-orchestrated federated settings (via guarantees analogous to differential privacy, DP), the decentralized setting -- where peers communicate without a coordinator remains underexplored. We study certified unlearning in decentralized networks with fixed topologies and propose RR-DU, a random-walk procedure that performs one projected gradient ascent step on the forget set at the unlearning client and a geometrically distributed number of projected descent steps on the retained data elsewhere, combined with subsampled Gaussian noise and projection onto a trust region around the original model. We provide (i) convergence guarantees in the convex case and stationarity guarantees in the nonconvex case, (ii) $(\varepsilon,ฮด)$ network-unlearning certificates on client views via subsampled Gaussian Rรฉnyi DP (RDP) with segment-level subsampling, and (iii) deletion-capacity bounds that scale with the forget-to-local data ratio and quantify the effect of decentralization (network mixing and randomized subsampling) on the privacy-utility trade-off. Empirically, on image benchmarks (MNIST, CIFAR-10), RR-DU matches a given $(\varepsilon,ฮด)$ while achieving higher test accuracy than decentralized DP baselines and reducing forget accuracy to random guessing ($\approx 10\%$).
๐
2025-12-09
๐ ArXiv Preprint (eess.SY)
๐ฅ Nour Mitiche, Farid Ferguene, Mourad Oussalah
ๆฅ็ๆ่ฆ
Time delays in communication channels present significant challenges for bilateral teleoperation systems, affecting both transparency and stability. Although traditional wave variable-based methods for a four-channel architecture ensure stability via passivity, they remain vulnerable to wave reflections and disturbances like variable delays and environmental noise. This article presents a data-driven hybrid framework that replaces the conventional wave-variable transform with an ensemble of three advanced sequence models, each optimized separately via the state-of-the-art Optuna optimizer, and combined through a stacking meta-learner. The base predictors include an LSTM augmented with Prophet for trend correction, an LSTM-based feature extractor paired with clustering and a random forest for improved regression, and a CNN-LSTM model for localized and long-term dynamics. Experimental validation was performed in Python using data generated from the baseline system implemented in MATLAB/Simulink. The results show that our optimized ensemble achieves a transparency comparable to the baseline wave-variable system under varying delays and noise, while ensuring stability through passivity constraints.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Simon Chung, Colby J. Vorland, Donna L. Maney, Andrew W. Brown
ๆฅ็ๆ่ฆ
Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.
๐
2025-12-09
๐ ArXiv Preprint (cs.DC)
๐ฅ Yi Pan, Wenbo Qian, Dedong Xie, Ruiyan Hu, Yigong Hu et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
The training and deployment of machine learning (ML) models have become extremely energy-intensive. While existing optimization efforts focus primarily on hardware energy efficiency, a significant but overlooked source of inefficiency is software energy waste caused by poor software design. This often includes redundant or poorly designed operations that consume more energy without improving performance. These inefficiencies arise in widely used ML frameworks and applications, yet developers often lack the visibility and tools to detect and diagnose them.
We propose differential energy debugging, a novel approach that leverages the observation that competing ML systems often implement similar functionality with vastly different energy consumption. Building on this insight, we design and implement Magneton, an energy profiler that compares energy consumption between similar ML systems at the operator level and automatically pinpoints code regions and configuration choices responsible for excessive energy use. Applied to 9 popular ML systems spanning LLM inference, general ML frameworks, and image generation, Magneton detects and diagnoses 16 known cases of software energy inefficiency and further discovers 8 previously unknown cases, 7 of which have been confirmed by developers.
๐
2025-12-09
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Ali Sakour
Machine Learning
ๆฅ็ๆ่ฆ
Biological systems exhibit remarkable morphogenetic plasticity, where a single genome can encode various specialized cellular structures triggered by local chemical signals. In the domain of Deep Learning, Differentiable Neural Cellular Automata (NCA) have emerged as a paradigm to mimic this self-organization. However, existing NCA research has predominantly focused on continuous texture synthesis or single-target object recovery, leaving the challenge of class-conditional structural generation largely unexplored. In this work, we propose a novel Conditional Neural Cellular Automata (c-NCA) architecture capable of growing distinct topological structures - specifically MNIST digits - from a single generic seed, guided solely by a spatially broadcasted class vector. Unlike traditional generative models (e.g., GANs, VAEs) that rely on global reception fields, our model enforces strict locality and translation equivariance. We demonstrate that by injecting a one-hot condition into the cellular perception field, a single set of local rules can learn to break symmetry and self-assemble into ten distinct geometric attractors. Experimental results show that our c-NCA achieves stable convergence, correctly forming digit topologies from a single pixel, and exhibits robustness characteristic of biological systems. This work bridges the gap between texture-based NCAs and structural pattern formation, offering a lightweight, biologically plausible alternative for conditional generation.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tien Cuong Bui
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have become a powerful tool for modeling and analyzing data with graph structures. The wide adoption in numerous applications underscores the value of these models. However, the complexity of these methods often impedes understanding their decision-making processes. Current Explainable AI (XAI) methods struggle to untangle the intricate relationships and interactions within graphs. Several methods have tried to bridge this gap via a post-hoc approach or self-interpretable design. Most of them focus on graph structure analysis to determine essential patterns that correlate with prediction outcomes. While post-hoc explanation methods are adaptable, they require extra computational resources and may be less reliable due to limited access to the model's internal workings. Conversely, Interpretable models can provide immediate explanations, but their generalizability to different scenarios remains a major concern. To address these shortcomings, this thesis seeks to develop a novel XAI framework tailored for graph-based machine learning. The proposed framework aims to offer adaptable, computationally efficient explanations for GNNs, moving beyond individual feature analysis to capture how graph structure influences predictions.
๐
2025-12-09
๐ IEEE ICC 2026
๐ฅ Thai Duong Nguyen, Ngoc-Tan Nguyen, Thanh-Dao Nguyen, Nguyen Van Huynh, Dinh-Hieu Tran et al.
Machine Learning
ๆฅ็ๆ่ฆ
The deployment of Unmanned Aerial Vehicle (UAV) swarms as dynamic communication relays is critical for next-generation tactical networks. However, operating in contested environments requires solving a complex trade-off, including maximizing system throughput while ensuring collision avoidance and resilience against adversarial jamming. Existing heuristic-based approaches often struggle to find effective solutions due to the dynamic and multi-objective nature of this problem. This paper formulates this challenge as a cooperative Multi-Agent Reinforcement Learning (MARL) problem, solved using the Centralized Training with Decentralized Execution (CTDE) framework. Our approach employs a centralized critic that uses global state information to guide decentralized actors which operate using only local observations. Simulation results show that our proposed framework significantly outperforms heuristic baselines, increasing the total system throughput by approximately 50% while simultaneously achieving a near-zero collision rate. A key finding is that the agents develop an emergent anti-jamming strategy without explicit programming. They learn to intelligently position themselves to balance the trade-off between mitigating interference from jammers and maintaining effective communication links with ground users.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Abdullah Hulusi Kรถkรงam, Uฤur Daฤdeviren, Talas Fikret Kurnaz, Alparslan Serhat Demir, Caner Erden
Machine Learning
ๆฅ็ๆ่ฆ
The California Bearing Ratio (CBR) is a key geotechnical indicator used to assess the load-bearing capacity of subgrade soils, especially in transportation infrastructure and foundation design. Traditional CBR determination relies on laboratory penetration tests. Despite their accuracy, these tests are often time-consuming, costly, and can be impractical, particularly for large-scale or diverse soil profiles. Recent progress in artificial intelligence, especially machine learning (ML), has enabled data-driven approaches for modeling complex soil behavior with greater speed and precision. This study introduces a comprehensive ML framework for CBR prediction using a dataset of 382 soil samples collected from various geoclimatic regions in Tรผrkiye. The dataset includes physicochemical soil properties relevant to bearing capacity, allowing multidimensional feature representation in a supervised learning context. Twelve ML algorithms were tested, including decision tree, random forest, extra trees, gradient boosting, xgboost, k-nearest neighbors, support vector regression, multi-layer perceptron, adaboost, bagging, voting, and stacking regressors. Each model was trained, validated, and evaluated to assess its generalization and robustness. Among them, the random forest regressor performed the best, achieving strong R2 scores of 0.95 (training), 0.76 (validation), and 0.83 (test). These outcomes highlight the model's powerful nonlinear mapping ability, making it a promising tool for predictive geotechnical tasks. The study supports the integration of intelligent, data-centric models in geotechnical engineering, offering an effective alternative to traditional methods and promoting digital transformation in infrastructure analysis and design.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Michael R. Martin, Garrick Chan, Kwan-Liu Ma
Machine Learning
ๆฅ็ๆ่ฆ
Recent image protection mechanisms such as Glaze and Nightshade introduce imperceptible, adversarially designed perturbations intended to disrupt downstream text-to-image generative models. While their empirical effectiveness is known, the internal structure, detectability, and representational behavior of these perturbations remain poorly understood. This study provides a systematic, explainable AI analysis using a unified framework that integrates white-box feature-space inspection and black-box signal-level probing. Through latent-space clustering, feature-channel activation analysis, occlusion-based spatial sensitivity mapping, and frequency-domain characterization, we show that protection mechanisms operate as structured, low-entropy perturbations tightly coupled to underlying image content across representational, spatial, and spectral domains. Protected images preserve content-driven feature organization with protection-specific substructure rather than inducing global representational drift. Detectability is governed by interacting effects of perturbation entropy, spatial deployment, and frequency alignment, with sequential protection amplifying detectable structure rather than suppressing it. Frequency-domain analysis shows that Glaze and Nightshade redistribute energy along dominant image-aligned frequency axes rather than introducing diffuse noise. These findings indicate that contemporary image protection operates through structured feature-level deformation rather than semantic dislocation, explaining why protection signals remain visually subtle yet consistently detectable. This work advances the interpretability of adversarial image protection and informs the design of future defenses and detection strategies for generative AI systems.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wang Chen, Ziyan Luo, Shuangyue Wang
Machine Learning
ๆฅ็ๆ่ฆ
Input features are conventionally represented as vectors, matrices, or third order tensors in the real field, for color image classification. Inspired by the success of quaternion data modeling for color images in image recovery and denoising tasks, we propose a novel classification method for color image classification, named as the Low-rank Support Quaternion Matrix Machine (LSQMM), in which the RGB channels are treated as pure quaternions to effectively preserve the intrinsic coupling relationships among channels via the quaternion algebra. For the purpose of promoting low-rank structures resulting from strongly correlated color channels, a quaternion nuclear norm regularization term, serving as a natural extension of the conventional matrix nuclear norm to the quaternion domain, is added to the hinge loss in our LSQMM model. An Alternating Direction Method of Multipliers (ADMM)-based iterative algorithm is designed to effectively resolve the proposed quaternion optimization model. Experimental results on multiple color image classification datasets demonstrate that our proposed classification approach exhibits advantages in classification accuracy, robustness and computational efficiency, compared to several state-of-the-art methods using support vector machines, support matrix machines, and support tensor machines.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ M Yashwanth, Gaurav Kumar Nayak, Harsh Rangwani, Arya Singh, R. Venkatesh Babu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Federated Learning (FL) is an emerging machine learning framework that enables multiple clients (coordinated by a server) to collaboratively train a global model by aggregating the locally trained models without sharing any client's training data. It has been observed in recent works that learning in a federated manner may lead the aggregated global model to converge to a 'sharp minimum' thereby adversely affecting the generalizability of this FL-trained model. Therefore, in this work, we aim to improve the generalization performance of models trained in a federated setup by introducing a 'flatness' constrained FL optimization problem. This flatness constraint is imposed on the top eigenvalue of the Hessian computed from the training loss. As each client trains a model on its local data, we further re-formulate this complex problem utilizing the client loss functions and propose a new computationally efficient regularization technique, dubbed 'MAN,' which Minimizes Activation's Norm of each layer on client-side models. We also theoretically show that minimizing the activation norm reduces the top eigenvalue of the layer-wise Hessian of the client's loss, which in turn decreases the overall Hessian's top eigenvalue, ensuring convergence to a flat minimum. We apply our proposed flatness-constrained optimization to the existing FL techniques and obtain significant improvements, thereby establishing new state-of-the-art.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alexander Goslin
Robotics
ๆฅ็ๆ่ฆ
For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. We introduce Terrain Diffusion, an AI-era successor to Perlin noise that bridges the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. At its core is InfiniteDiffusion, a novel algorithm for infinite generation, enabling seamless, real-time synthesis of boundless landscapes. A hierarchical stack of diffusion models couples planetary context with local detail, while a compact Laplacian encoding stabilizes outputs across Earth-scale dynamic ranges. An open-source infinite-tensor framework supports constant-memory manipulation of unbounded tensors, and few-step consistency distillation enables efficient generation. Together, these components establish diffusion models as a practical foundation for procedural world generation, capable of synthesizing entire planets coherently, controllably, and without limits.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sarwesh Rauniyar
ๆฅ็ๆ่ฆ
Axis-aligned decision trees are fast and stable but struggle on datasets with rotated or interaction-dependent decision boundaries, where informative splits require linear combinations of features rather than single-feature thresholds. Oblique forests address this with per-node hyperplane splits, but at added computational cost and implementation complexity. We propose a simple alternative: JARF, Jacobian-Aligned Random Forests. Concretely, we first fit an axis-aligned forest to estimate class probabilities or regression outputs, compute finite-difference gradients of these predictions with respect to each feature, aggregate them into an expected Jacobian outer product that generalizes the expected gradient outer product (EGOP), and use it as a single global linear preconditioner for all inputs. This supervised preconditioner applies a single global rotation of the feature space, then hands the transformed data back to a standard axis-aligned forest, preserving off-the-shelf training pipelines while capturing oblique boundaries and feature interactions that would otherwise require many axis-aligned splits to approximate. The same construction applies to any model that provides gradients, though we focus on random forests and gradient-boosted trees in this work. On tabular classification and regression benchmarks, this preconditioning consistently improves axis-aligned forests and often matches or surpasses oblique baselines while improving training time. Our experimental results and theoretical analysis together indicate that supervised preconditioning can recover much of the accuracy of oblique forests while retaining the simplicity and robustness of axis-aligned trees.
๐
2025-12-09
๐ ArXiv Preprint (astro-ph.SR)
๐ฅ Yue Xiang, Shenghong Gu, Dongtao Cao
Machine Learning
ๆฅ็ๆ่ฆ
Ultracool dwarfs consist of lowest-mass stars and brown dwarfs. Their interior is fully convective, different from that of the partly-convective Sun-like stars. Magnetic field generation process beneath the surface of ultracool dwarfs is still poorly understood and controversial. To increase samples of active ultracool dwarfs significantly, we have identified 962 ultracool dwarfs in the latest LAMOST data release, DR11. We also simulate the Chinese Space Station Survey Telescope (CSST) low-resolution slitless spectra by degrading the LAMOST spectra. A semi-supervised machine learning approach with an autoencoder model is built to identify ultracool dwarfs with the simulated CSST spectra, which demonstrates the capability of the CSST all-sky slitless spectroscopic survey on the detection of ultracool dwarfs. Magnetic activity of the ultracool dwarfs is investigated by using the H$ฮฑ$ line emission as a proxy. The rotational periods of 82 ultracool dwarfs are derived based on the Kepler/K2 light curves. We also derive the activity-rotation relation of the ultracool dwarfs, which is saturated around a Rossby number of 0.12.
๐
2025-12-09
๐ ArXiv Preprint (cs.SE)
๐ฅ Liao Hu, Qiteng Wu, Ruoyu Qi
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The integration of Large Language Models (LLMs) into mobile and software development workflows faces a persistent tension among three demands: semantic awareness, developer productivity, and data privacy. Traditional cloud-based tools offer strong reasoning but risk data exposure and latency, while on-device solutions lack full-context understanding across codebase and developer tooling. We introduce SolidGPT, an open-source, edge-cloud hybrid developer assistant built on GitHub, designed to enhance code and workspace semantic search. SolidGPT enables developers to: talk to your codebase: interactively query code and project structure, discovering the right methods and modules without manual searching. Automate software project workflows: generate PRDs, task breakdowns, Kanban boards, and even scaffold web app beginnings, with deep integration via VSCode and Notion. Configure private, extensible agents: onboard private code folders (up to approximately 500 files), connect Notion, customize AI agent personas via embedding and in-context training, and deploy via Docker, CLI, or VSCode extension. In practice, SolidGPT empowers developer productivity through: Semantic-rich code navigation: no more hunting through files or wondering where a feature lives. Integrated documentation and task management: seamlessly sync generated PRD content and task boards into developer workflows. Privacy-first design: running locally via Docker or VSCode, with full control over code and data, while optionally reaching out to LLM APIs as needed. By combining interactive code querying, automated project scaffolding, and human-AI collaboration, SolidGPT provides a practical, privacy-respecting edge assistant that accelerates real-world development workflows, ideal for intelligent mobile and software engineering contexts.
๐
2025-12-09
๐ ArXiv Preprint (cs.MA)
๐ฅ Kyungmin Kim, Seokbin Yoon, Keumjin Lee
Robotics
ๆฅ็ๆ่ฆ
Accurate and reliable aircraft landing time prediction is essential for effective resource allocation in air traffic management. However, the inherent uncertainty of aircraft trajectories and traffic flows poses significant challenges to both prediction accuracy and trustworthiness. Therefore, prediction models should not only provide point estimates of aircraft landing times but also the uncertainties associated with these predictions. Furthermore, aircraft trajectories are frequently influenced by the presence of nearby aircraft through air traffic control interventions such as radar vectoring. Consequently, landing time prediction models must account for multi-agent interactions in the airspace. In this work, we propose a probabilistic multi-agent aircraft landing time prediction framework that provides the landing times of multiple aircraft as distributions. We evaluate the proposed framework using an air traffic surveillance dataset collected from the terminal airspace of the Incheon International Airport in South Korea. The results demonstrate that the proposed model achieves higher prediction accuracy than the baselines and quantifies the associated uncertainties of its outcomes. In addition, the model uncovered underlying patterns in air traffic control through its attention scores, thereby enhancing explainability.
๐
2025-12-09
๐ ArXiv Preprint (cs.SE)
๐ฅ Yihan Liao, Jacky Keung, Zhenyu Mao, Jingyu Zhang, Jialong Li
Robotics
ๆฅ็ๆ่ฆ
Log-based anomaly detection (LAD) is critical for ensuring the reliability of large-scale distributed systems. However, most existing LAD approaches assume centralized training, which is often impractical due to privacy constraints and the decentralized nature of system logs. While federated learning (FL) offers a promising alternative, there is a lack of dedicated testbeds tailored to the needs of LAD in federated settings. To address this, we present FedLAD, a unified platform for training and evaluating LAD models under FL constraints. FedLAD supports plug-and-play integration of diverse LAD models, benchmark datasets, and aggregation strategies, while offering runtime support for validation logging (self-monitoring), parameter tuning (self-configuration), and adaptive strategy control (self-adaptation). By enabling reproducible and scalable experimentation, FedLAD bridges the gap between FL frameworks and LAD requirements, providing a solid foundation for future research. Project code is publicly available at: https://github.com/AA-cityu/FedLAD.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Humera Sabir, Fatima Farooq, Ashraf Aboulnaga
Machine Learning
ๆฅ็ๆ่ฆ
Knowledge Graphs (KGs) are a rich source of structured, heterogeneous data, powering a wide range of applications. A common approach to leverage this data is to train a graph neural network (GNN) on the KG. However, existing message-passing GNNs struggle to scale to large KGs because they rely on the iterative message passing process to learn the graph structure, which is inefficient, especially under mini-batch training, where a node sees only a partial view of its neighborhood. In this paper, we address this problem and present gHAWK, a novel and scalable GNN training framework for large KGs. The key idea is to precompute structural features for each node that capture its local and global structure before GNN training even begins. Specifically, gHAWK introduces a preprocessing step that computes: (a)~Bloom filters to compactly encode local neighborhood structure, and (b)~TransE embeddings to represent each node's global position in the graph. These features are then fused with any domain-specific features (e.g., text embeddings), producing a node feature vector that can be incorporated into any GNN technique. By augmenting message-passing training with structural priors, gHAWK significantly reduces memory usage, accelerates convergence, and improves model accuracy. Extensive experiments on large datasets from the Open Graph Benchmark (OGB) demonstrate that gHAWK achieves state-of-the-art accuracy and lower training time on both node property prediction and link prediction tasks, topping the OGB leaderboard for three graphs.
๐
2025-12-09
๐ ICRA 2025
๐ฅ Srijan Dokania, Dharini Raghavan
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yi Ni, Xinkun Wang, Han Zhang
Robotics
ๆฅ็ๆ่ฆ
Federated Learning (FL) faces significant challenges in evolving environments, particularly regarding data heterogeneity and the rigidity of fixed network topologies. To address these issues, this paper proposes \textbf{SOFA-FL} (Self-Organizing Hierarchical Federated Learning with Adaptive Clustered Data Sharing), a novel framework that enables hierarchical federated systems to self-organize and adapt over time.
The framework is built upon three core mechanisms: (1) \textbf{Dynamic Multi-branch Agglomerative Clustering (DMAC)}, which constructs an initial efficient hierarchical structure; (2) \textbf{Self-organizing Hierarchical Adaptive Propagation and Evolution (SHAPE)}, which allows the system to dynamically restructure its topology through atomic operations -- grafting, pruning, consolidation, and purification -- to adapt to changes in data distribution; and (3) \textbf{Adaptive Clustered Data Sharing}, which mitigates data heterogeneity by enabling controlled partial data exchange between clients and cluster nodes.
By integrating these mechanisms, SOFA-FL effectively captures dynamic relationships among clients and enhances personalization capabilities without relying on predetermined cluster structures.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rachana Mysore, Preksha Girish, Kavitha Jayaram, Shrey Kumar, Preksha Girish et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We investigate the mathematical foundations of neural networks in the infinite-width regime through the Neural Tangent Kernel (NTK). We propose the NTK-Eigenvalue-Controlled Residual Network (NTK-ECRN), an architecture integrating Fourier feature embeddings, residual connections with layerwise scaling, and stochastic depth to enable rigorous analysis of kernel evolution during training. Our theoretical contributions include deriving bounds on NTK dynamics, characterizing eigenvalue evolution, and linking spectral properties to generalization and optimization stability. Empirical results on synthetic and benchmark datasets validate the predicted kernel behavior and demonstrate improved training stability and generalization. This work provides a comprehensive framework bridging infinite-width theory and practical deep-learning architectures.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Preksha Girish, Rachana Mysore, Mahanthesha U, Shrey Kumar, Misbah Fatimah Annigeri et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Sudden Unexpected Death in Epilepsy (SUDEP) and acute ischemic stroke are life-threatening conditions involving complex interactions across cortical, brainstem, and autonomic systems. We present a unified geometric-stochastic multimodal deep learning framework that integrates EEG, ECG, respiration, SpO2, EMG, and fMRI signals to model SUDEP and stroke vulnerability. The approach combines Riemannian manifold embeddings, Lie-group invariant feature representations, fractional stochastic dynamics, Hamiltonian energy-flow modeling, and cross-modal attention mechanisms. Stroke propagation is modeled using fractional epidemic diffusion over structural brain graphs. Experiments on the MULTI-CLARID dataset demonstrate improved predictive accuracy and interpretable biomarkers derived from manifold curvature, fractional memory indices, attention entropy, and diffusion centrality. The proposed framework provides a mathematically principled foundation for early detection, risk stratification, and interpretable multimodal modeling in neural-autonomic disorders.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Deepak Gupta, Himanshu Pandey, Ratikanta Behera
Machine Learning
ๆฅ็ๆ่ฆ
This work proposes a wavelet-based physics-informed quantum neural network framework to efficiently address multiscale partial differential equations that involve sharp gradients, stiffness, rapid local variations, and highly oscillatory behavior. Traditional physics-informed neural networks (PINNs) have demonstrated substantial potential in solving differential equations, and their quantum counterparts, quantum-PINNs, exhibit enhanced representational capacity with fewer trainable parameters. However, both approaches face notable challenges in accurately solving multiscale features. Furthermore, their reliance on automatic differentiation for constructing loss functions introduces considerable computational overhead, resulting in longer training times. To overcome these challenges, we developed a wavelet-accelerated physics-informed quantum neural network that eliminates the need for automatic differentiation, significantly reducing computational complexity. The proposed framework incorporates the multiresolution property of wavelets within the quantum neural network architecture, thereby enhancing the network's ability to effectively capture both local and global features of multiscale problems. Numerical experiments demonstrate that our proposed method achieves superior accuracy while requiring less than five percent of the trainable parameters compared to classical wavelet-based PINNs, resulting in faster convergence. Moreover, it offers a speedup of three to five times compared to existing quantum PINNs, highlighting the potential of the proposed approach for efficiently solving challenging multiscale and oscillatory problems.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nicholas Harner
ๆฅ็ๆ่ฆ
Classical Time Series Classification algorithms are dominated by feature engineering strategies. One of the most prominent of these transforms is ROCKET, which achieves strong performance through random kernel features. We introduce SPROCKET (Selected Prototype Random Convolutional Kernel Transform), which implements a new feature engineering strategy based on prototypes. On a majority of the UCR and UEA Time Series Classification archives, SPROCKET achieves performance comparable to existing convolutional algorithms and the new MR-HY-SP ( MultiROCKET-HYDRA-SPROCKET) ensemble's average accuracy ranking exceeds HYDRA-MR, the previous best convolutional ensemble's performance. These experimental results demonstrate that prototype-based feature transformation can enhance both accuracy and robustness in time series classification.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saeeda Naz, Saddam Hussain Khan
Natural Language Processing
ๆฅ็ๆ่ฆ
A novel deep hybrid Residual-SwinCA-Net segmentation framework is proposed in the study for addressing such challenges by extracting locally correlated and robust features, incorporating residual CNN modules. Furthermore, for learning global dependencies, Swin Transformer blocks are customized using internal residual pathways, which reinforce gradient stability, refine local patterns, and facilitate global feature fusion. Formerly, for enhancing tissue continuity, ultrasound noise suppressions, and accentuating fine structural transitions Laplacian-of-Gaussian regional operator is applied, and for maintaining the morphological integrity of malignant lesion contours, a boundary-oriented operator has been incorporated. Subsequently, a contraction strategy was applied stage-wise by progressively reducing features-map progressively for capturing scale invariance and enhancing the robustness of structural variability. In addition, each decoder level prior augmentation integrates a new Multi-Scale Channel Attention and Squeezing (MSCAS) module. The MSCAS selectively emphasizes encoder salient maps, retains discriminative global context, and complementary local structures with minimal computational cost while suppressing redundant activations. Finally, the Pixel-Attention module encodes class-relevant spatial cues by adaptively weighing malignant lesion pixels while suppressing background interference. The Residual-SwinCA-Net and existing CNNs/ViTs techniques have been implemented on the publicly available BUSI dataset. The proposed Residual-SwinCA-Net framework outperformed and achieved 99.29% mean accuracy, 98.74% IoU, and 0.9041 Dice for breast lesion segmentation. The proposed Residual-SwinCA-Net framework improves the BUSI lesion diagnostic performance and strengthens timely clinical decision-making.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Preksha Girish, Rachana Mysore, Mahanthesha U, Shrey Kumar, Shipra Prashant
Robotics
ๆฅ็ๆ่ฆ
This paper presents a mathematically rigorous framework for brain-inspired representation learning founded on the interplay between persistent topological structures and cohomological flows. Neural computation is reformulated as the evolution of cochain maps over dynamic simplicial complexes, enabling representations that capture invariants across temporal, spatial, and functional brain states. The proposed architecture integrates algebraic topology with differential geometry to construct cohomological operators that generalize gradient-based learning within a homological landscape. Synthetic data with controlled topological signatures and real neural datasets are jointly analyzed using persistent homology, sheaf cohomology, and spectral Laplacians to quantify stability, continuity, and structural preservation. Empirical results demonstrate that the model achieves superior manifold consistency and noise resilience compared to graph neural and manifold-based deep architectures, establishing a coherent mathematical foundation for topology-driven representation learning.
๐
2025-12-09
๐ WSDM 2026
๐ฅ Ye Qin, Jingchao Wang, Yang Shi, Haiying Huang, Junxu Li et al.
ๆฅ็ๆ่ฆ
Capsule Networks (CapsNets) show exceptional graph representation capacity via dynamic routing and vectorized hierarchical representations, but they model the complex geometries of real\-world graphs poorly by fixed\-curvature space due to the inherent geodesical disconnectedness issues, leading to suboptimal performance. Recent works find that non\-Euclidean pseudo\-Riemannian manifolds provide specific inductive biases for embedding graph data, but how to leverage them to improve CapsNets is still underexplored. Here, we extend the Euclidean capsule routing into geodesically disconnected pseudo\-Riemannian manifolds and derive a Pseudo\-Riemannian Capsule Network (PR\-CapsNet), which models data in pseudo\-Riemannian manifolds of adaptive curvature, for graph representation learning. Specifically, PR\-CapsNet enhances the CapsNet with Adaptive Pseudo\-Riemannian Tangent Space Routing by utilizing pseudo\-Riemannian geometry. Unlike single\-curvature or subspace\-partitioning methods, PR\-CapsNet concurrently models hierarchical and cluster or cyclic graph structures via its versatile pseudo\-Riemannian metric. It first deploys Pseudo\-Riemannian Tangent Space Routing to decompose capsule states into spherical\-temporal and Euclidean\-spatial subspaces with diffeomorphic transformations. Then, an Adaptive Curvature Routing is developed to adaptively fuse features from different curvature spaces for complex graphs via a learnable curvature tensor with geometric attention from local manifold properties. Finally, a geometric properties\-preserved Pseudo\-Riemannian Capsule Classifier is developed to project capsule embeddings to tangent spaces and use curvature\-weighted softmax for classification. Extensive experiments on node and graph classification benchmarks show PR\-CapsNet outperforms SOTA models, validating PR\-CapsNet's strong representation power for complex graph structures.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jason Chuan-Chih Chou
Robotics
ๆฅ็ๆ่ฆ
Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $ฮณ$ without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set $\propto ฮณ^2$ instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay $\propto ฮณ^2$ results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers and we show that decoupled weight decay $\propto ฮณ^2$ leads to stable weight and gradient norms and allows us to better control the training dynamics and improve the model performance.
๐
2025-12-09
๐ ArXiv Preprint (eess.IV)
๐ฅ Aneesh Rangnekar, Harini Veeraraghavan
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate segmentation of cancerous lesions from 3D computed tomography (CT) scans is essential for automated treatment planning and response assessment. However, even state-of-the-art models combining self-supervised learning (SSL) pretrained transformers with convolutional decoders are susceptible to out-of-distribution (OOD) inputs, generating confidently incorrect tumor segmentations, posing risks for safe clinical deployment. Existing logit-based methods suffer from task-specific model biases, while architectural enhancements to explicitly detect OOD increase parameters and computational costs. Hence, we introduce a plug-and-play and lightweight post-hoc random forests-based OOD detection framework called RF-Deep that leverages deep features with limited outlier exposure. RF-Deep enhances generalization to imaging variations by repurposing the hierarchical features from the pretrained-then-finetuned backbone encoder, providing task-relevant OOD detection by extracting the features from multiple regions of interest anchored to the predicted tumor segmentations. Hence, it scales to images of varying fields-of-view. We compared RF-Deep against existing OOD detection methods using 1,916 CT scans across near-OOD (pulmonary embolism, negative COVID-19) and far-OOD (kidney cancer, healthy pancreas) datasets. RF-Deep achieved AUROC > 93.50 for the challenging near-OOD datasets and near-perfect detection (AUROC > 99.00) for the far-OOD datasets, substantially outperforming logit-based and radiomics approaches. RF-Deep maintained similar performance consistency across networks of different depths and pretraining strategies, demonstrating its effectiveness as a lightweight, architecture-agnostic approach to enhance the reliability of tumor segmentation from CT volumes.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiaxiang Geng, Lunyu Zhao, Yiyi Lu, Bing Luo
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Mobile phones are the most ubiquitous end devices, generating vast amounts of human-authored data and serving as the primary platform for end-side applications. As high-quality public data for large language models (LLMs) approaches exhaustion, on-device fine-tuning provides an opportunity to leverage private user data while preserving privacy. However, existing approaches are predominantly simulation-based or rely on IoT devices and PCs, leaving commodity mobile phones largely unexplored. A key gap is the absence of an open-source framework that enables practical LLM fine-tuning on mobile phones. We present MobileFineTuner, a unified open-source framework that enables end-to-end LLM fine-tuning directly on commodity mobile phones. MobileFineTuner is designed for efficiency, scalability, and usability, supporting full-parameters fine-tuning (Full-FT) and parameter-efficient fine-tuning (PEFT). To address the memory and energy limitations inherent to mobile phones, we introduce system-level optimizations including parameter sharding, gradient accumulation, and energy-aware computation scheduling. We demonstrate the practicality of MobileFineTuner by fine-tuning GPT-2, Gemma 3, and Qwen 2.5 on real mobile phones. Extensive experiments and ablation studies validate the effectiveness of the proposed optimizations and establish MobileFineTuner as a viable foundation for future research on on-device LLM training.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Xiuyuan Cheng, Yao Xie, Linglingzhi Zhu, Yunqin Zhu
Machine Learning
ๆฅ็ๆ่ฆ
Worst-case generation plays a critical role in evaluating robustness and stress-testing systems under distribution shifts, in applications ranging from machine learning models to power grids and medical prediction systems. We develop a generative modeling framework for worst-case generation for a pre-specified risk, based on min-max optimization over continuous probability distributions, namely the Wasserstein space. Unlike traditional discrete distributionally robust optimization approaches, which often suffer from scalability issues, limited generalization, and costly worst-case inference, our framework exploits the Brenier theorem to characterize the least favorable (worst-case) distribution as the pushforward of a transport map from a continuous reference measure, enabling a continuous and expressive notion of risk-induced generation beyond classical discrete DRO formulations. Based on the min-max formulation, we propose a Gradient Descent Ascent (GDA)-type scheme that updates the decision model and the transport map in a single loop, establishing global convergence guarantees under mild regularity assumptions and possibly without convexity-concavity. We also propose to parameterize the transport map using a neural network that can be trained simultaneously with the GDA iterations by matching the transported training samples, thereby achieving a simulation-free approach. The efficiency of the proposed method as a risk-induced worst-case generator is validated by numerical experiments on synthetic and image data.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nanda K. Unnikrishnan, Keshab K. Parhi
Machine LearningRobotics
ๆฅ็ๆ่ฆ
In our prior work, LayerPipe, we had introduced an approach to accelerate training of convolutional, fully connected, and spiking neural networks by overlapping forward and backward computation. However, despite empirical success, a principled understanding of how much gradient delay needs to be introduced at each layer to achieve desired level of pipelining was not addressed. This paper, LayerPipe2, fills that gap by formally deriving LayerPipe using variable delayed gradient adaptation and retiming. We identify where delays may be legally inserted and show that the required amount of delay follows directly from the network structure where inner layers require fewer delays and outer layers require longer delays. When pipelining is applied at every layer, the amount of delay depends only on the number of remaining downstream stages. When layers are pipelined in groups, all layers in the group share the same assignment of delays. These insights not only explain previously observed scheduling patterns but also expose an often overlooked challenge that pipelining implicitly requires storage of historical weights. We overcome this storage bottleneck by developing a pipeline--aware moving average that reconstructs the required past states rather than storing them explicitly. This reduces memory cost without sacrificing the accuracy guarantees that makes pipelined learning viable. The result is a principled framework that illustrates how to construct LayerPipe architectures, predicts their delay requirements, and mitigates their storage burden, thereby enabling scalable pipelined training with controlled communication computation tradeoffs.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zheng Ding, Weirui Ye
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4$\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ali Lotfi Rezaabad, Bikram Khanal, Shashwat Chaurasia, Lu Zeng, Dezhi Hong et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets -- Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching) -- PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mikayel Samvelyan
Machine Learning
ๆฅ็ๆ่ฆ
The growing prevalence of artificial intelligence (AI) in various applications underscores the need for agents that can successfully navigate and adapt to an ever-changing, open-ended world. A key challenge is ensuring these AI agents are robust, excelling not only in familiar settings observed during training but also effectively generalising to previously unseen and varied scenarios. In this thesis, we harness methodologies from open-endedness and multi-agent learning to train and evaluate robust AI agents capable of generalising to novel environments, out-of-distribution inputs, and interactions with other co-player agents. We begin by introducing MiniHack, a sandbox framework for creating diverse environments through procedural content generation. Based on the game of NetHack, MiniHack enables the construction of new tasks for reinforcement learning (RL) agents with a focus on generalisation. We then present Maestro, a novel approach for generating adversarial curricula that progressively enhance the robustness and generality of RL agents in two-player zero-sum games. We further probe robustness in multi-agent domains, utilising quality-diversity methods to systematically identify vulnerabilities in state-of-the-art, pre-trained RL policies within the complex video game football domain, characterised by intertwined cooperative and competitive dynamics. Finally, we extend our exploration of robustness to the domain of LLMs. Here, our focus is on diagnosing and enhancing the robustness of LLMs against adversarial prompts, employing evolutionary search to generate a diverse range of effective inputs that aim to elicit undesirable outputs from an LLM. This work collectively paves the way for future advancements in AI robustness, enabling the development of agents that not only adapt to an ever-evolving world but also thrive in the face of unforeseen challenges and interactions.
๐
2025-12-09
๐ ArXiv Preprint (cs.GT)
๐ฅ Kyriakos Lotidis, Panayotis Mertikopoulos, Nicholas Bambos, Jose Blanchet
ๆฅ็ๆ่ฆ
In this paper, we examine the robustness of Nash equilibria in continuous games, under both strategic and dynamic uncertainty. Starting with the former, we introduce the notion of a robust equilibrium as those equilibria that remain invariant to small -- but otherwise arbitrary -- perturbations to the game's payoff structure, and we provide a crisp geometric characterization thereof. Subsequently, we turn to the question of dynamic robustness, and we examine which equilibria may arise as stable limit points of the dynamics of "follow the regularized leader" (FTRL) in the presence of randomness and uncertainty. Despite their very distinct origins, we establish a structural correspondence between these two notions of robustness: strategic robustness implies dynamic robustness, and, conversely, the requirement of strategic robustness cannot be relaxed if dynamic robustness is to be maintained. Finally, we examine the rate of convergence to robust equilibria as a function of the underlying regularizer, and we show that entropically regularized learning converges at a geometric rate in games with affinely constrained action spaces.
๐
2025-12-09
๐ ArXiv Preprint (cs.GT)
๐ฅ Kyriakos Lotidis, Panayotis Mertikopoulos, Nicholas Bambos, Jose Blanchet
ๆฅ็ๆ่ฆ
In this paper, we examine the convergence landscape of multi-agent learning under uncertainty. Specifically, we analyze two stochastic models of regularized learning in continuous games -- one in continuous and one in discrete time with the aim of characterizing the long-run behavior of the induced sequence of play. In stark contrast to deterministic, full-information models of learning (or models with a vanishing learning rate), we show that the resulting dynamics do not converge in general. In lieu of this, we ask instead which actions are played more often in the long run, and by how much. We show that, in strongly monotone games, the dynamics of regularized learning may wander away from equilibrium infinitely often, but they always return to its vicinity in finite time (which we estimate), and their long-run distribution is sharply concentrated around a neighborhood thereof. We quantify the degree of this concentration, and we show that these favorable properties may all break down if the underlying game is not strongly monotone -- underscoring in this way the limits of regularized learning in the presence of persistent randomness and uncertainty.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gary Ackerman, Brandon Behlendorf, Zachary Kallenborn, Sheriff Almakki, Doug Clifford et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Both model developers and policymakers seek to quantify and mitigate the risk of rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons. An important element of such efforts is the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper describes the first component of a novel Biothreat Benchmark Generation (BBG) Framework. The BBG approach is designed to help model developers and evaluators reliably measure and assess the biosecurity risk uplift and general harm potential of existing and future AI models, while accounting for key aspects of the threat itself that are often overlooked in other benchmarking efforts, including different actor capability levels, and operational (in addition to purely technical) risk factors. As a pilot, the BBG is first being developed to address bacterial biological threats only. The BBG is built upon a hierarchical structure of biothreat categories, elements and tasks, which then serves as the basis for the development of task-aligned queries. This paper outlines the development of this biothreat task-query architecture, which we have named the Bacterial Biothreat Schema, while future papers will describe follow-on efforts to turn queries into model prompts, as well as how the resulting benchmarks can be implemented for model evaluation. Overall, the BBG Framework, including the Bacterial Biothreat Schema, seeks to offer a robust, re-usable structure for evaluating bacterial biological risks arising from LLMs across multiple levels of aggregation, which captures the full scope of technical and operational requirements for biological adversaries, and which accounts for a wide spectrum of biological adversary capabilities.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Guangmingmei Yang, David J. Miller, George Kesidis
Machine Learning
ๆฅ็ๆ่ฆ
Most post-training backdoor detection methods rely on attacked models exhibiting extreme outlier detection statistics for the target class of an attack, compared to non-target classes. However, these approaches may fail: (1) when some (non-target) classes are easily discriminable from all others, in which case they may naturally achieve extreme detection statistics (e.g., decision confidence); and (2) when the backdoor is subtle, i.e., with its features weak relative to intrinsic class-discriminative features. A key observation is that the backdoor target class has contributions to its detection statistic from both the backdoor trigger and from its intrinsic features, whereas non-target classes only have contributions from their intrinsic features. To achieve more sensitive detectors, we thus propose to suppress intrinsic features while optimizing the detection statistic for a given class. For non-target classes, such suppression will drastically reduce the achievable statistic, whereas for the target class the (significant) contribution from the backdoor trigger remains. In practice, we formulate a constrained optimization problem, leveraging a small set of clean examples from a given class, and optimizing the detection statistic while orthogonalizing with respect to the class's intrinsic features. We dub this plug-and-play approach Class Subspace Orthogonalization (CSO) and assess it against challenging mixed-label and adaptive attacks.
๐
2025-12-09
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zijiang Yang
Machine Learning
ๆฅ็ๆ่ฆ
This paper will propose a novel machine learning based portfolio management method in the context of the cryptocurrency market. Previous researchers mainly focus on the prediction of the movement for specific cryptocurrency such as the bitcoin(BTC) and then trade according to the prediction. In contrast to the previous work that treats the cryptocurrencies independently, this paper manages a group of cryptocurrencies by analyzing the relative relationship. Specifically, in each time step, we utilize the neural network to predict the rank of the future return of the managed cryptocurrencies and place weights accordingly. By incorporating such cross-sectional information, the proposed methods is shown to profitable based on the backtesting experiments on the real daily cryptocurrency market data from May, 2020 to Nov, 2023. During this 3.5 years, the market experiences the full cycle of bullish, bearish and stagnant market conditions. Despite under such complex market conditions, the proposed method outperforms the existing methods and achieves a Sharpe ratio of 1.01 and annualized return of 64.26%. Additionally, the proposed method is shown to be robust to the increase of transaction fee.
๐
2025-12-09
๐ KDD 2026
๐ฅ Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Zitong Wan et al.
Multimodal
ๆฅ็ๆ่ฆ
Although existing multimodal recommendation models have shown promising performance, their effectiveness continues to be limited by the pervasive data sparsity problem. This problem arises because users typically interact with only a small subset of available items, leading existing models to arbitrarily treat unobserved items as negative samples. To this end, we propose VI-MMRec, a model-agnostic and training cost-free framework that enriches sparse user-item interactions via similarity-aware virtual user-item interactions. These virtual interactions are constructed based on modality-specific feature similarities of user-interacted items. Specifically, VI-MMRec introduces two different strategies: (1) Overlay, which independently aggregates modality-specific similarities to preserve modality-specific user preferences, and (2) Synergistic, which holistically fuses cross-modal similarities to capture complementary user preferences. To ensure high-quality augmentation, we design a statistically informed weight allocation mechanism that adaptively assigns weights to virtual user-item interactions based on dataset-specific modality relevance. As a plug-and-play framework, VI-MMRec seamlessly integrates with existing models to enhance their performance without modifying their core architecture. Its flexibility allows it to be easily incorporated into various existing models, maximizing performance with minimal implementation effort. Moreover, VI-MMRec introduces no additional overhead during training, making it significantly advantageous for practical deployment. Comprehensive experiments conducted on six real-world datasets using seven state-of-the-art multimodal recommendation models validate the effectiveness of our VI-MMRec.
๐
2025-12-09
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiin Park, Hyuna Jeon, Yoonseo Lee, Jisu Hong, Misuk Kim
ๆฅ็ๆ่ฆ
Ontology-based knowledge graph (KG) construction is a core technology that enables multidimensional understanding and advanced reasoning over domain knowledge. Industrial standards, in particular, contain extensive technical information and complex rules presented in highly structured formats that combine tables, scopes of application, constraints, exceptions, and numerical calculations, making KG construction especially challenging. In this study, we propose a method that organizes such documents into a hierarchical semantic structure, decomposes sentences and tables into atomic propositions derived from conditional and numerical rules, and integrates them into an ontology-knowledge graph through LLM-based triple extraction. Our approach captures both the hierarchical and logical structures of documents, effectively representing domain-specific semantics that conventional methods fail to reflect. To verify its effectiveness, we constructed rule, table, and multi-hop QA datasets, as well as a toxic clause detection dataset, from industrial standards, and implemented an ontology-aware KG-RAG framework for comparative evaluation. Experimental results show that our method achieves significant performance improvements across all QA types compared to existing KG-RAG approaches. This study demonstrates that reliable and scalable knowledge representation is feasible even for industrial documents with intertwined conditions, constraints, and scopes, contributing to future domain-specific RAG development and intelligent document management.
๐
2025-12-09
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiwoo Park, Ruoqi Liu, Avani Jagdale, Andrew Srisuwananukorn, Jing Zhao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Jessica Yin, Haozhi Qi, Youngsun Wi, Sayantan Kundu, Mike Lambeta et al.
Robotics
ๆฅ็ๆ่ฆ
Human video demonstrations provide abundant training data for learning robot policies, but video alone cannot capture the rich contact signals critical for mastering manipulation. We introduce OSMO, an open-source wearable tactile glove designed for human-to-robot skill transfer. The glove features 12 three-axis tactile sensors across the fingertips and palm and is designed to be compatible with state-of-the-art hand-tracking methods for in-the-wild data collection. We demonstrate that a robot policy trained exclusively on human demonstrations collected with OSMO, without any real robot data, is capable of executing a challenging contact-rich manipulation task. By equipping both the human and the robot with the same glove, OSMO minimizes the visual and tactile embodiment gap, enabling the transfer of continuous shear and normal force feedback while avoiding the need for image inpainting or other vision-based force inference. On a real-world wiping task requiring sustained contact pressure, our tactile-aware policy achieves a 72% success rate, outperforming vision-only baselines by eliminating contact-related failure modes. We release complete hardware designs, firmware, and assembly instructions to support community adoption.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Simon de Moreau, Andrei Bursuc, Hafid El-Idrissi, Fabien Moutarde
Robotics
ๆฅ็ๆ่ฆ
Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high-definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero-shot in real-world closed-loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain-generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost-effective solution to robust nighttime perception.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Manduhu Manduhu, Alexander Dow, Gerard Dooly, James Riordan
Computer Vision
ๆฅ็ๆ่ฆ
Rotation invariance is essential for precise, object-level segmentation in UAV aerial imagery, where targets can have arbitrary orientations and exhibit fine-scale details. Conventional segmentation architectures like U-Net rely on convolution operators that are not rotation-invariant, leading to degraded segmentation accuracy across varying viewpoints. Rotation invariance can be achieved by expanding the filter bank across multiple orientations; however, this will significantly increase computational cost and memory traffic. In this paper, we introduce a GPU-optimized rotation-invariant convolution framework that eliminates the traditional data-lowering (im2col) step required for matrix-multiplication-based convolution. By exploiting structured data sharing among symmetrically rotated filters, our method achieves multi-orientation convolution with greatly reduced memory traffic and computational redundancy. We further generalize the approach to accelerate convolution with arbitrary (non-symmetric) rotation angles.
Across extensive benchmarks, the proposed convolution achieves 20--55% faster training and 15--45% lower energy consumption than CUDNN, while maintaining accuracy comparable to state-of-the-art rotation-invariant methods. In the eight-orientation setting, our approach achieves up to 45% speedup and 41% energy savings on 256\(\times\)256 inputs, and 32% speedup and 23% lower energy usage on 1024\(\times\)1024 inputs. Integrated into a U-Net segmentation model, the framework yields up to 6% improvement in accuracy over the non-rotation-aware baseline. These results demonstrate that the proposed method provides an effective and highly efficient alternative to existing rotation-invariant CNN frameworks.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Ryan LeRoy, Jack Kolb
Machine Learning
ๆฅ็ๆ่ฆ
Multi-Agent Reinforcement Learning (MARL) is commonly deployed in settings where agents are trained via self-play with homogeneous teammates, often using parameter sharing and a single policy architecture. This opens the question: to what extent do self-play PPO agents learn general coordination strategies grounded in the underlying game, compared to overfitting to their training partners' behaviors? This paper investigates the question using the Heterogeneous Multi-Agent Challenge (HeMAC) environment, which features distinct Observer and Drone agents with complementary capabilities. We introduce Rotating Policy Training (RPT), an approach that rotates heterogeneous teammate policies of different learning algorithms during training, to expose the agent to a broader range of partner strategies. When playing alongside a withheld teammate policy (DDQN), we find that RPT achieves similar performance to a standard self-play baseline, IPPO, where all agents were trained sharing a single PPO policy. This result indicates that in this heterogeneous multi-agent setting, the IPPO baseline generalizes to novel teammate algorithms despite not experiencing teammate diversity during training. This shows that a simple IPPO baseline may possess the level of generalization to novel teammates that a diverse training regimen was designed to achieve.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Connor York, Zachary R Madin, Paul O'Dowd, Edmund R Hunt
ๆฅ็ๆ่ฆ
Multi-robot systems performing continuous tasks face a performance trade-off when interrupted by urgent, time-critical sub-tasks. We investigate this trade-off in a scenario where a team must balance area patrolling with locating an anomalous radio signal. To address this trade-off, we evaluate both behavioral heterogeneity through agent role specialization ("patrollers" and "searchers") and sensing heterogeneity (i.e., only the searchers can sense the radio signal). Through simulation, we identify the Pareto-optimal trade-offs under varying team compositions, with behaviorally heterogeneous teams demonstrating the most balanced trade-offs in the majority of cases. When sensing capability is restricted, heterogeneous teams with half of the sensing-capable agents perform comparably to homogeneous teams, providing cost-saving rationale for restricting sensor payload deployment. Our findings demonstrate that pre-deployment role and sensing specialization are powerful design considerations for multi-robot systems facing time-conflicting tasks, where varying the degree of behavioral heterogeneity can tune system performance toward either task.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Mohammed Elseiagy, Tsige Tadesse Alemayoh, Ranulfo Bezerra, Shotaro Kojima, Kazunori Ohno
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Bridging the sim-to-real gap remains a fundamental challenge in robotics, as accurate dynamic parameter estimation is essential for reliable model-based control, realistic simulation, and safe deployment of manipulators. Traditional analytical approaches often fall short when faced with complex robot structures and interactions. Data-driven methods offer a promising alternative, yet conventional neural networks such as recurrent models struggle to capture long-range dependencies critical for accurate estimation. In this study, we propose a Transformer-based approach for dynamic parameter estimation, supported by an automated pipeline that generates diverse robot models and enriched trajectory data using Jacobian-derived features. The dataset consists of 8,192 robots with varied inertial and frictional properties. Leveraging attention mechanisms, our model effectively captures both temporal and spatial dependencies. Experimental results highlight the influence of sequence length, sampling rate, and architecture, with the best configuration (sequence length 64, 64 Hz, four layers, 32 heads) achieving a validation R2 of 0.8633. Mass and inertia are estimated with near-perfect accuracy, Coulomb friction with moderate-to-high accuracy, while viscous friction and distal link center-of-mass remain more challenging. These results demonstrate that combining Transformers with automated dataset generation and kinematic enrichment enables scalable, accurate dynamic parameter estimation, contributing to improved sim-to-real transfer in robotic systems
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Jason Hughes, Marcel Hussing, Edward Zhang, Shenbagaraj Kannapiran, Joshua Caswell et al.
ๆฅ็ๆ่ฆ
This report presents a heterogeneous robotic system designed for remote primary triage in mass-casualty incidents (MCIs). The system employs a coordinated air-ground team of unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to locate victims, assess their injuries, and prioritize medical assistance without risking the lives of first responders. The UAV identify and provide overhead views of casualties, while UGVs equipped with specialized sensors measure vital signs and detect and localize physical injuries. Unlike previous work that focused on exploration or limited medical evaluation, this system addresses the complete triage process: victim localization, vital sign measurement, injury severity classification, mental status assessment, and data consolidation for first responders. Developed as part of the DARPA Triage Challenge, this approach demonstrates how multi-robot systems can augment human capabilities in disaster response scenarios to maximize lives saved.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Mark Pustilnik, Francesco Borrelli
Robotics
ๆฅ็ๆ่ฆ
This paper proposes a dynamic game formulation for cooperative human-robot navigation in shared workspaces with obstacles, where the human and robot jointly satisfy shared safety constraints while pursuing a common task. A key contribution is the introduction of a non-normalized equilibrium structure for the shared constraints. This structure allows the two agents to contribute different levels of effort towards enforcing safety requirements such as collision avoidance and inter-players spacing. We embed this non-normalized equilibrium into a receding-horizon optimal control scheme.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Ziyue Zheng, Yongce Liu, Hesheng Wang, Zhongqiang Ren
Machine Learning
ๆฅ็ๆ่ฆ
This paper addresses the problem of trajectory planning for information gathering with a dynamic and resolution-varying sensor footprint. Ergodic planning offers a principled framework that balances exploration (visiting all areas) and exploitation (focusing on high-information regions) by planning trajectories such that the time spent in a region is proportional to the amount of information in that region. Existing ergodic planning often oversimplifies the sensing model by assuming a point sensor or a footprint with constant shape and resolution. In practice, the sensor footprint can drastically change over time as the robot moves, such as aerial robots equipped with downward-facing cameras, whose field of view depends on the orientation and altitude. To overcome this limitation, we propose a new metric that accounts for dynamic sensor footprints, analyze the theoretic local optimality conditions, and propose numerical trajectory optimization algorithms. Experimental results show that the proposed approach can simultaneously optimize both the trajectories and sensor footprints, with up to an order of magnitude better ergodicity than conventional methods. We also deploy our approach in a multi-drone system to ergodically cover an object in 3D space.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Lauritz Rismark Fosso, Herman Biรธrn Amundsen, Marios Xanthidis, Sveinung Johan Ohrem
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Holonomic autonomous underwater vehicles (AUVs) have the hardware ability for agile maneuvering in both translational and rotational degrees of freedom (DOFs). However, due to challenges inherent to underwater vehicles, such as complex hydrostatics and hydrodynamics, parametric uncertainties, and frequent changes in dynamics due to payload changes, control is challenging. Performance typically relies on carefully tuned controllers targeting unique platform configurations, and a need for re-tuning for deployment under varying payloads and hydrodynamic conditions. As a consequence, agile maneuvering with simultaneous tracking of time-varying references in both translational and rotational DOFs is rarely utilized in practice. To the best of our knowledge, this paper presents the first general zero-shot sim2real deep reinforcement learning-based (DRL) velocity controller enabling path following and agile 6DOF maneuvering with a training duration of just 3 minutes. Sim2Swim, the proposed approach, inspired by state-of-the-art DRL-based position control, leverages domain randomization and massively parallelized training to converge to field-deployable control policies for AUVs of variable characteristics without post-processing or tuning. Sim2Swim is extensively validated in pool trials for a variety of configurations, showcasing robust control for highly agile motions.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Doumegna Mawuto Koudjo Felix, Xianjia Yu, Zhuo Zou, Tomi Westerlund
Robotics
ๆฅ็ๆ่ฆ
Lidar-based SLAM systems are highly sensitive to adverse conditions such as occlusion, noise, and field-of-view (FoV) degradation, yet existing robustness evaluation methods either lack physical grounding or do not capture sensor-specific behavior. This paper presents a sensor-aware, phenomenological framework for simulating interpretable lidar degradations directly on real point clouds, enabling controlled and reproducible SLAM stress testing. Unlike image-derived corruption benchmarks (e.g., SemanticKITTI-C) or simulation-only approaches (e.g., lidarsim), the proposed system preserves per-point geometry, intensity, and temporal structure while applying structured dropout, FoV reduction, Gaussian noise, occlusion masking, sparsification, and motion distortion. The framework features autonomous topic and sensor detection, modular configuration with four severity tiers (light--extreme), and real-time performance (less than 20 ms per frame) compatible with ROS workflows. Experimental validation across three lidar architectures and five state-of-the-art SLAM systems reveals distinct robustness patterns shaped by sensor design and environmental context. The open-source implementation provides a practical foundation for benchmarking lidar-based SLAM under physically meaningful degradation scenarios.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Marta Manzoni, Alessandro Nazzari, Roberto Rubinacci, Marco Lovera
Machine Learning
ๆฅ็ๆ่ฆ
This paper investigates the use of Multi-Task Bayesian Optimization for tuning decentralized trajectory generation algorithms in multi-drone systems. We treat each task as a trajectory generation scenario defined by a specific number of drone-to-drone interactions. To model relationships across scenarios, we employ Multi-Task Gaussian Processes, which capture shared structure across tasks and enable efficient information transfer during optimization. We compare two strategies: optimizing the average mission time across all tasks and optimizing each task individually. Through a comprehensive simulation campaign, we show that single-task optimization leads to progressively shorter mission times as swarm size grows, but requires significantly more optimization time than the average-task approach.
๐
2025-12-09
๐ ArXiv Preprint (eess.SY)
๐ฅ Adrian Wiltz, Dimos V. Dimarogonas
Robotics
ๆฅ็ๆ่ฆ
This article presents a systematic method for designing time-varying Control Barrier Functions (CBF) composed of a time-invariant component and multiple time-dependent components, leveraging structural properties of the system dynamics. The method involves the construction of a specific class of time-invariant CBFs that encode the system's dynamic capabilities with respect to a given constraint, and augments them subsequently with appropriately designed time-dependent transformations. While transformations uniformly varying the time-invariant CBF can be applied to arbitrary systems, transformations exploiting structural properties in the dynamics - equivariances in particular - enable the handling of a broader and more expressive class of time-varying constraints. The article shows how to leverage such properties in the design of time-varying CBFs. The proposed method decouples the design of time variations from the computationally expensive construction of the underlying CBFs, thereby providing a computationally attractive method to the design of time-varying CBFs. The method accounts for input constraints and under-actuation, and requires only qualitative knowledge on the time-variation of the constraints making it suitable to the application in uncertain environments.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo et al.
MultimodalMachine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuna Kato, Shohei Mori, Hideo Saito, Yoshifumi Takatsume, Hiroki Kajita et al.
ๆฅ็ๆ่ฆ
Video recordings of open surgeries are greatly required for education and research purposes. However, capturing unobstructed videos is challenging since surgeons frequently block the camera field of view. To avoid occlusion, the positions and angles of the camera must be frequently adjusted, which is highly labor-intensive. Prior work has addressed this issue by installing multiple cameras on a shadowless lamp and arranging them to fully surround the surgical area. This setup increases the chances of some cameras capturing an unobstructed view. However, manual image alignment is needed in post-processing since camera configurations change every time surgeons move the lamp for optimal lighting. This paper aims to fully automate this alignment task. The proposed method identifies frames in which the lighting system moves, realigns them, and selects the camera with the least occlusion to generate a video that consistently presents the surgical field from a fixed perspective. A user study involving surgeons demonstrated that videos generated by our method were superior to those produced by conventional methods in terms of the ease of confirming the surgical area and the comfort during video viewing. Additionally, our approach showed improvements in video quality over existing techniques. Furthermore, we implemented several synthesis options for the proposed view-synthesis method and conducted a user study to assess surgeons' preferences for each option.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Vit Kratky, Robert Penicka, Parakh M. Gupta, Ondrej Prochazka, Martin Saska
Robotics
ๆฅ็ๆ่ฆ
This paper presents an approach to mutual collision avoidance based on Nonlinear Model Predictive Control (NMPC) with time-dependent Reciprocal Velocity Constraints (RVCs). Unlike most existing methods, the proposed approach relies solely on observable information about other robots, eliminating the necessity of excessive communication use. The computationally efficient algorithm for computing RVCs, together with the direct integration of these constraints into NMPC problem formulation on a controller level, allows the whole pipeline to run at 100 Hz. This high processing rate, combined with modeled nonlinear dynamics of the controlled Uncrewed Aerial Vehicles (UAVs), is a key feature that facilitates the use of the proposed approach for an agile UAV flight. The proposed approach was evaluated through extensive simulations emulating real-world conditions in scenarios involving up to 10 UAVs and velocities of up to 25 m/s, and in real-world experiments with accelerations up to 30 m/s$^2$. Comparison with state of the art shows 31% improvement in terms of flight time reduction in challenging scenarios, while maintaining a collision-free navigation in all trials.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Yuchi Zhang, Churui Sun, Shiqi Liang, Diyuan Liu, Chao Ji et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent end-to-end robotic manipulation research increasingly adopts architectures inspired by large language models to enable robust manipulation. However, a critical challenge arises from severe distribution shifts between robotic action data, primarily due to substantial numerical variations in action commands across diverse robotic platforms and tasks, hindering the effective transfer of pretrained knowledge. To address this limitation, we propose a semantically grounded linguistic representation to normalize actions for efficient pretraining. Unlike conventional discretized action representations that are sensitive to numerical scales, the motion representation specifically disregards numeric scale effects, emphasizing directionality instead. This abstraction mitigates distribution shifts, yielding a more generalizable pretraining representation. Moreover, using the motion representation narrows the feature distance between action tokens and standard vocabulary tokens, mitigating modality gaps. Multi-task experiments on two benchmarks demonstrate that the proposed method significantly improves generalization performance and transferability in robotic manipulation tasks.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Nils Gehrke, David Brecht, Dominik Kulmer, Dheer Patel, Frank Diermeyer
ๆฅ็ๆ่ฆ
Simulation offers advantages throughout the development process of automated driving functions, both in research and product development. Common open-source simulators like CARLA are extensively used in training, evaluation, and software-in-the-loop testing of new automated driving algorithms. However, the CARLA simulator lacks an evaluation where research and automated driving vehicles are simulated with their entire sensor and actuation stack in real time. The goal of this work is therefore to create a simulation framework for testing the automation software on its dedicated hardware and identifying its limits. Achieving this goal would greatly benefit the open-source development workflow of automated driving functions, designating CARLA as a consistent evaluation tool along the entire development process. To achieve this goal, in a first step, requirements are derived, and a simulation architecture is specified and implemented. Based on the formulated requirements, the proposed vEDGAR software is evaluated, resulting in a final conclusion on the applicability of CARLA for HiL testing of automated vehicles. The tool is available open source: Modified CARLA fork: https://github.com/TUMFTM/carla, vEDGAR Framework: https://github.com/TUMFTM/vEDGAR
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Nadezhda Kushina, Ko Watanabe, Aarthi Kannan, Ashita Ashok, Andreas Dengel et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Social robots must adjust to human proxemic norms to ensure user comfort and engagement. While prior research demonstrates that eye-tracking features reliably estimate comfort in human-human interactions, their applicability to interactions with humanoid robots remains unexplored. In this study, we investigate user comfort with the robot "Ameca" across four experimentally controlled distances (0.5 m to 2.0 m) using mobile eye-tracking and subjective reporting (N=19). We evaluate multiple machine learning and deep learning models to estimate comfort based on gaze features. Contrary to previous human-human studies where Transformer models excelled, a Decision Tree classifier achieved the highest performance (F1-score = 0.73), with minimum pupil diameter identified as the most critical predictor. These findings suggest that physiological comfort thresholds in human-robot interaction differ from human-human dynamics and can be effectively modeled using interpretable logic.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Yixiang Lin, Tiancheng Yang, Jonathan Eden, Ying Tan
Robotics
ๆฅ็ๆ่ฆ
Understanding how humans respond to uncertainty is critical for designing safe and effective physical human-robot interaction (pHRI), as physically working with robots introduces multiple sources of uncertainty, including trust, comfort, and perceived safety. Conventional pHRI control frameworks typically build on optimal control theory, which assumes that human actions minimize a cost function; however, human behavior under uncertainty often departs from such optimal patterns. To address this gap, additional understanding of human behavior under uncertainty is needed. This pilot study implemented a physically coupled target-reaching task in which the robot delivered assistance or disturbances with systematically varied probabilities (10\% to 90\%). Analysis of participants' force inputs and decision-making strategies revealed two distinct behavioral clusters: a "trade-off" group that modulated their physical responses according to disturbance likelihood, and an "always-compensate" group characterized by strong risk aversion irrespective of probability. These findings provide empirical evidence that human decision-making in pHRI is highly individualized and that the perception of probability can differ to its true value. Accordingly, the study highlights the need for more interpretable behavioral models, such as cumulative prospect theory (CPT), to more accurately capture these behaviors and inform the design of future adaptive robot controllers.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Po-An Shih, Shao-Hua Wang, Yung-Che Li, Chia-Heng Tu, Chih-Han Chang
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Designing autonomous driving systems requires efficient exploration of large hardware/software configuration spaces under diverse environmental conditions, e.g., with varying traffic, weather, and road layouts. Traditional design space exploration (DSE) approaches struggle with multi-modal execution outputs and complex performance trade-offs, and often require human involvement to assess correctness based on execution outputs. This paper presents a multi-agent, large language model (LLM)-based DSE framework, which integrates multi-modal reasoning with 3D simulation and profiling tools to automate the interpretation of execution outputs and guide the exploration of system designs. Specialized LLM agents are leveraged to handle user input interpretation, design point generation, execution orchestration, and analysis of both visual and textual execution outputs, which enables identification of potential bottlenecks without human intervention. A prototype implementation is developed and evaluated on a robotaxi case study (an SAE Level 4 autonomous driving application). Compared with a genetic algorithm baseline, the proposed framework identifies more Pareto-optimal, cost-efficient solutions with reduced navigation time under the same exploration budget. Experimental results also demonstrate the efficiency of the adoption of the LLM-based approach for DSE. We believe that this framework paves the way to the design automation of autonomous driving systems.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Antonio Terpin, Raffaello D'Andrea
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective -- drag minimization or maximization -- is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nico Leuze, Maximilian Hoh, Samed Doฤan, Nicolas R. -Peรฑa, Alfred Schoettl
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Accurately recovering 6D poses in densely packed industrial bin-picking environments remain a serious challenge, owing to occlusions, reflections, and textureless parts. We introduce a holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D data. While sparse 3D approaches has proven effective for long-range perception, its potential in close-range robotic applications remains underexplored. Our framework operates fully sparse, enabling high-resolution volumetric representations to capture fine geometric details crucial for accurate pose estimation in clutter. Our method processes the entire scene integrally, predicting the 6D pose via a novel per-voxel voting strategy, allowing simultaneous pose predictions for an arbitrary number of target objects. We validate our method on the recently published IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios.
๐
2025-12-09
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mingwei Li, Xiaoyuan Zhang, Chengwei Yang, Zilong Zheng, Yaodong Yang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Model-based planning in robotic domains is fundamentally challenged by the hybrid nature of physical dynamics, where continuous motion is punctuated by discrete events such as contacts and impacts. Conventional latent world models typically employ monolithic neural networks that enforce global continuity, inevitably over-smoothing the distinct dynamic modes (e.g., sticking vs. sliding, flight vs. stance). For a planner, this smoothing results in catastrophic compounding errors during long-horizon lookaheads, rendering the search process unreliable at physical boundaries. To address this, we introduce the Prismatic World Model (PRISM-WM), a structured architecture designed to decompose complex hybrid dynamics into composable primitives. PRISM-WM leverages a context-aware Mixture-of-Experts (MoE) framework where a gating mechanism implicitly identifies the current physical mode, and specialized experts predict the associated transition dynamics. We further introduce a latent orthogonalization objective to ensure expert diversity, effectively preventing mode collapse. By accurately modeling the sharp mode transitions in system dynamics, PRISM-WM significantly reduces rollout drift. Extensive experiments on challenging continuous control benchmarks, including high-dimensional humanoids and diverse multi-task settings, demonstrate that PRISM-WM provides a superior high-fidelity substrate for trajectory optimization algorithms (e.g., TD-MPC), proving its potential as a powerful foundational model for next-generation model-based agents.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Fan Zhang, Michael Gienger
RoboticsMultimodal
ๆฅ็ๆ่ฆ
World models have demonstrated impressive performance on robotic learning tasks. Many such tasks inherently demand multimodal reasoning; for example, filling a bottle with water will lead to visual information alone being ambiguous or incomplete, thereby requiring reasoning over the temporal evolution of audio, accounting for its underlying physical properties and pitch patterns. In this paper, we propose a generative latent flow matching model to anticipate future audio observations, enabling the system to reason about long-term consequences when integrated into a robot policy. We demonstrate the superior capabilities of our system through two manipulation tasks that require perceiving in-the-wild audio or music signals, compared to methods without future lookahead. We further emphasize that successful robot action learning for these tasks relies not merely on multi-modal input, but critically on the accurate prediction of future audio states that embody intrinsic rhythmic patterns.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, Sergey Levine
Multimodal
ๆฅ็ๆ่ฆ
Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall short on new tasks not covered in the training data. When finetuned on limited demonstrations of a new task, these policies often overfit to the specific demonstrations--not only losing their prior abilities to solve a wide variety of generalist tasks but also failing to generalize within the new task itself. In this work, we aim to develop a method that preserves the generalization capabilities of the generalist policy during finetuning, allowing a single policy to robustly incorporate a new skill into its repertoire. Our goal is a single policy that both learns to generalize to variations of the new task and retains the broad competencies gained from pretraining. We show that this can be achieved through a simple yet effective strategy: interpolating the weights of a finetuned model with that of the pretrained model. We show, across extensive simulated and real-world experiments, that such model merging produces a single model that inherits the generalist abilities of the base model and learns to solve the new task robustly, outperforming both the pretrained and finetuned model on out-of-distribution variations of the new task. Moreover, we show that model merging enables continual acquisition of new skills in a lifelong learning setting, without sacrificing previously learned generalist abilities.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Haldun Balim, Na Li, Yilun Du
Robotics
ๆฅ็ๆ่ฆ
Offline decision-making requires synthesizing reliable behaviors from fixed datasets without further interaction, yet existing generative approaches often yield trajectories that are dynamically infeasible. We propose Model Predictive Diffuser (MPDiffuser), a compositional model-based diffusion framework consisting of: (i) a planner that generates diverse, task-aligned trajectories; (ii) a dynamics model that enforces consistency with the underlying system dynamics; and (iii) a ranker module that selects behaviors aligned with the task objectives. MPDiffuser employs an alternating diffusion sampling scheme, where planner and dynamics updates are interleaved to progressively refine trajectories for both task alignment and feasibility during the sampling process. We also provide a theoretical rationale for this procedure, showing how it balances fidelity to data priors with dynamics consistency. Empirically, the compositional design improves sample efficiency, as it leverages even low-quality data for dynamics learning and adapts seamlessly to novel dynamics. We evaluate MPDiffuser on both unconstrained (D4RL) and constrained (DSRL) offline decision-making benchmarks, demonstrating consistent gains over existing approaches. Furthermore, we present a preliminary study extending MPDiffuser to vision-based control tasks, showing its potential to scale to high-dimensional sensory inputs. Finally, we deploy our method on a real quadrupedal robot, showcasing its practicality for real-world control.
๐
2025-12-09
๐ ICRA 2025
๐ฅ Srijan Dokania, Dharini Raghavan
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
We introduce Zero-Splat TeleAssist, a zero-shot sensor-fusion pipeline that transforms commodity CCTV streams into a shared, 6-DoF world model for multilateral teleoperation. By integrating vision-language segmentation, monocular depth, weighted-PCA pose extraction, and 3D Gaussian Splatting (3DGS), TeleAssist provides every operator with real-time global positions and orientations of multiple robots without fiducials or depth sensors in an interaction-centric teleoperation setup.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hafeez Husain Cholakkal, Stefano Arrigoni, Francesco Braghin
Machine LearningRoboticsMultimodal
ๆฅ็ๆ่ฆ
Accurate extrinsic calibration of LiDAR, RADAR, and camera sensors is essential for reliable perception in autonomous vehicles. Still, it remains challenging due to factors such as mechanical vibrations and cumulative sensor drift in dynamic environments. This paper presents RLCNet, a novel end-to-end trainable deep learning framework for the simultaneous online calibration of these multimodal sensors. Validated on real-world datasets, RLCNet is designed for practical deployment and demonstrates robust performance under diverse conditions. To support real-time operation, an online calibration framework is introduced that incorporates a weighted moving average and outlier rejection, enabling dynamic adjustment of calibration parameters with reduced prediction noise and improved resilience to drift. An ablation study highlights the significance of architectural choices, while comparisons with existing methods demonstrate the superior accuracy and robustness of the proposed approach.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Ahan Basu, Ratnangshu Das, Pushpak Jagtap
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper presents a Spatiotemporal Tube (STT)-based control framework for general control-affine MIMO nonlinear pure-feedback systems with unknown dynamics to satisfy prescribed time reach-avoid-stay tasks under external disturbances. The STT is defined as a time-varying ball, whose center and radius are jointly approximated by a Physics-Informed Neural Network (PINN). The constraints governing the STT are first formulated as loss functions of the PINN, and a training algorithm is proposed to minimize the overall violation. The PINN being trained on certain collocation points, we propose a Lipschitz-based validity condition to formally verify that the learned PINN satisfies the conditions over the continuous time horizon. Building on the learned STT representation, an approximation-free closed-form controller is defined to guarantee satisfaction of the T-RAS specification. Finally, the effectiveness and scalability of the framework are validated through two case studies involving a mobile robot and an aerial vehicle navigating through cluttered environments.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Timothy Chen, Marcus Dominguez-Kuhne, Aiden Swann, Xu Liu, Mac Schwager
MultimodalMachine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Humans interpret safety not as a binary signal but as a continuous, context- and spatially-dependent notion of risk. While risk is subjective, humans form rational mental models that guide action selection in dynamic environments. This work proposes a framework for extracting implicit human risk models by introducing a novel, semantically-conditioned and spatially-varying parametrization of risk, supervised directly from safe human demonstration videos and VLM common sense. Notably, we define risk through a Bayesian formulation. The prior is furnished by a pretrained vision-language model. In order to encourage the risk estimate to be more human aligned, a likelihood function modulates the prior to produce a relative metric of risk. Specifically, the likelihood is a learned ViT that maps pretrained features, to pixel-aligned risk values. Our pipeline ingests RGB images and a query object string, producing pixel-dense risk images. These images that can then be used as value-predictors in robot planning tasks or be projected into 3D for use in conventional trajectory optimization to produce human-like motion. This learned mapping enables generalization to novel objects and contexts, and has the potential to scale to much larger training datasets. In particular, the Bayesian framework that is introduced enables fast adaptation of our model to additional observations or common sense rules. We demonstrate that our proposed framework produces contextual risk that aligns with human preferences. Additionally, we illustrate several downstream applications of the model; as a value learner for visuomotor planners or in conjunction with a classical trajectory optimization algorithm. Our results suggest that our framework is a significant step toward enabling autonomous systems to internalize human-like risk. Code and results can be found at https://riskbayesian.github.io/bayesian_risk/.
๐
2025-12-09
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tony Salloom, Dandi Zhou, Xinhai Sun
Robotics
ๆฅ็ๆ่ฆ
Accurate three-dimensional perception is essential for modern industrial robotic systems that perform manipulation, inspection, and navigation tasks. RGB-D and stereo vision sensors are widely used for this purpose, but the depth maps they produce are often noisy, incomplete, or biased due to sensor limitations and environmental conditions. Depth completion methods aim to generate dense, reliable depth maps from RGB images and sparse depth input. However, a key limitation in current depth completion pipelines is the unrealistic generation of sparse depth: sparse pixels are typically selected uniformly at random from dense ground-truth depth, ignoring the fact that real sensors exhibit geometry-dependent and spatially nonuniform reliability. In this work, we propose a normal-guided sparse depth sampling strategy that leverages PCA-based surface normal estimation on the RGB-D point cloud to compute a per-pixel depth reliability measure. The sparse depth samples are then drawn according to this reliability distribution. We integrate this sampling method with the Marigold-DC diffusion-based depth completion model and evaluate it on NYU Depth v2 using the standard metrics. Experiments show that our geometry-aware sparse depth improves accuracy, reduces artifacts near edges and discontinuities, and produces more realistic training conditions that better reflect real sensor behavior.
๐
2025-12-09
๐ ICRA 2026
๐ฅ Duo Zhang, Junshan Huang, Jingjin Yu
ๆฅ็ๆ่ฆ
We propose Synchronous Dual-Arm Rearrangement Planner (SDAR), a task and motion planning (TAMP) framework for tabletop rearrangement, where two robot arms equipped with 2-finger grippers must work together in close proximity to rearrange objects whose start and goal configurations are strongly entangled. To tackle such challenges, SDAR tightly knit together its dependency-driven task planner (SDAR-T) and synchronous dual-arm motion planner (SDAR-M), to intelligently sift through a large number of possible task and motion plans. Specifically, SDAR-T applies a simple yet effective strategy to decompose the global object dependency graph induced by the rearrangement task, to produce more optimal dual-arm task plans than solutions derived from optimal task plans for a single arm. Leveraging state-of-the-art GPU SIMD-based motion planning tools, SDAR-M employs a layered motion planning strategy to sift through many task plans for the best synchronous dual-arm motion plan while ensuring high levels of success rate. Comprehensive evaluation demonstrates that SDAR delivers a 100% success rate in solving complex, non-monotone, long-horizon tabletop rearrangement tasks with solution quality far exceeding the previous state-of-the-art. Experiments on two UR-5e arms further confirm SDAR directly and reliably transfers to robot hardware.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Wenjiang Xu, Cindy Wang, Rui Fang, Mingkang Zhang, Lusong Li et al.
Robotics
ๆฅ็ๆ่ฆ
World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied-tree-of-thoughts.github.io .
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Haoxin Zhang, Shuaixin Li, Xiaozhou Zhu, Hongbo Chen, Wen Yao
ๆฅ็ๆ่ฆ
In this paper, we present a user-friendly LiDAR-camera calibration toolkit that is compatible with various LiDAR and camera sensors and requires only a single pair of laser points and a camera image in targetless environments. Our approach eliminates the need for an initial transform and remains robust even with large positional and rotational LiDAR-camera extrinsic parameters. We employ the Gluestick pipeline to establish 2D-3D point and line feature correspondences for a robust and automatic initial guess. To enhance accuracy, we quantitatively analyze the impact of feature distribution on calibration results and adaptively weight the cost of each feature based on these metrics. As a result, extrinsic parameters are optimized by filtering out the adverse effects of inferior features. We validated our method through extensive experiments across various LiDAR-camera sensors in both indoor and outdoor settings. The results demonstrate that our method provides superior robustness and accuracy compared to SOTA techniques. Our code is open-sourced on GitHub to benefit the community.
๐
2025-12-09
๐ ArXiv Preprint (Robotics)
๐ฅ Haoran Wang, Zhuohang Chen, Guang Li, Bo Ma, Chuanghuang Li
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The future of UAV interaction systems is evolving from engineer-driven to user-driven, aiming to replace traditional predefined Human-UAV Interaction designs. This shift focuses on enabling more personalized task planning and design, thereby achieving a higher quality of interaction experience and greater flexibility, which can be used in many fileds, such as agriculture, aerial photography, logistics, and environmental monitoring. However, due to the lack of a common language between users and the UAVs, such interactions are often difficult to be achieved. The developments of Large Language Models possess the ability to understand nature languages and Robots' (UAVs') behaviors, marking the possibility of personalized Human-UAV Interaction. Recently, some HUI frameworks based on LLMs have been proposed, but they commonly suffer from difficulties in mixed task planning and execution, leading to low adaptability in complex scenarios. In this paper, we propose a novel dual-agent HUI framework. This framework constructs two independent LLM agents (a task planning agent, and an execution agent) and applies different Prompt Engineering to separately handle the understanding, planning, and execution of tasks. To verify the effectiveness and performance of the framework, we have built a task database covering four typical application scenarios of UAVs and quantified the performance of the HUI framework using three independent metrics. Meanwhile different LLM models are selected to control the UAVs with compared performance. Our user study experimental results demonstrate that the framework improves the smoothness of HUI and the flexibility of task execution in the tasks scenario we set up, effectively meeting users' personalized needs.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Raunak Jain, Mudita Khurana
ๆฅ็ๆ่ฆ
LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual. We argue this complementarity gap reflects a fundamental mismatch: current agents are trained as answer engines, not as partners in the collaborative sensemaking through which experts actually make decisions. Sensemaking (the ability to co-construct causal explanations, surface uncertainties, and adapt goals) is the key capability that current training pipelines do not explicitly develop or evaluate. We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models, and evaluation centred on trust and complementarity. Taken together, these directions shift MAS research from building oracle-like answer engines to cultivating AI teammates that co-reason with their human partners over the causal structure of shared decisions, advancing the design of effective human-AI teams.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenzhen Dong, Jieming Yu, Yiming Huang, Hongqiu Wang, Lei Zhu et al.
ๆฅ็ๆ่ฆ
The recent SAM 3 and SAM 3D have introduced significant advancements over the predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3D's depth reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while the zero-shot evaluations of SAM 3D on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stephane Collot, Colin Fraser, Justin Zhao, William F. Shen, Timon Willi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. Common metrics used for this choice, such as Accuracy, Precision, and F1, are sensitive to class imbalance and to arbitrary choices of positive class, and can favor judges that distort prevalence estimates. We show that Youden's $J$ statistic is theoretically aligned with choosing the best judge to compare models, and that Balanced Accuracy is an equivalent linear transformation of $J$. Through both analytical arguments and empirical examples and simulations, we demonstrate how selecting judges using Balanced Accuracy leads to better, more robust classifier selection.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gรผl Varol, Sarah Ebling et al.
Multimodal
ๆฅ็ๆ่ฆ
The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.
๐
2025-12-08
๐ In proceedings of LLM-Finance 2025 : The 2nd IEEE
๐ฅ Eliot Brenner, Dominic Seyler, Manjunath Hegde, Andrei Simion, Koustuv Dasgupta et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite advances in generative large language models (LLMs), practical application of specialized conversational AI agents remains constrained by computation costs, latency requirements, and the need for precise domain-specific relevance measures. While existing embedding models address the first two constraints, they underperform on information retrieval in specialized domains like finance. This paper introduces a scalable pipeline that trains specialized models from an unlabeled corpus using a general purpose retrieval embedding model as foundation. Our method yields an average of 27.7% improvement in MRR$\texttt{@}$5, 44.6% improvement in mean DCG$\texttt{@}$5 across 14 financial filing types measured over 21,800 query-document pairs, and improved NDCG on 3 of 4 document classes in FinanceBench. We adapt retrieval embeddings (bi-encoder) for RAG, not LLM generators, using LLM-judged relevance to distill domain knowledge into a compact retriever. There are prior works which pair synthetically generated queries with real passages to directly fine-tune the retrieval model. Our pipeline differs from these by introducing interaction between student and teacher models that interleaves retrieval-based mining of hard positive/negative examples from the unlabeled corpus with iterative retraining of the student model's weights using these examples. Each retrieval iteration uses the refined student model to mine the corpus for progressively harder training examples for the subsequent training iteration. The methodology provides a cost-effective solution to bridging the gap between general-purpose models and specialized domains without requiring labor-intensive human annotation.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Vala Vakilian, Zimeng Wang, Ankit Singh Rawat, Christos Thrampoulidis
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate the short-context dominance hypothesis: that for most sequences, a small local prefix suffices to predict their next tokens. Using large language models as statistical oracles, we measure the minimum context length (MCL) needed to reproduce accurate full-context predictions across datasets with sequences of varying lengths. For sequences with 1-7k tokens from long-context documents, we consistently find that 75-80% require only the last 96 tokens at most. Given the dominance of short-context tokens, we then ask whether it is possible to detect challenging long-context sequences for which a short local prefix does not suffice for prediction. We introduce a practical proxy to MCL, called Distributionally Aware MCL (DaMCL), that does not require knowledge of the actual next-token and is compatible with sampling strategies beyond greedy decoding. Our experiments validate that simple thresholding of the metric defining DaMCL achieves high performance in detecting long vs. short context sequences. Finally, to counter the bias that short-context dominance induces in LLM output distributions, we develop an intuitive decoding algorithm that leverages our detector to identify and boost tokens that are long-range-relevant. Across Q&A tasks and model architectures, we confirm that mitigating the bias improves performance.
๐
2025-12-08
๐ ArXiv Preprint (cs.SD)
๐ฅ Mahta Fetrat, Donya Navabi, Zahra Dehghanian, Morteza Abolghasemi, Hamid R. Rabiee
ๆฅ็ๆ่ฆ
Lightweight, real-time text-to-speech systems are crucial for accessibility. However, the most efficient TTS models often rely on lightweight phonemizers that struggle with context-dependent challenges. In contrast, more advanced phonemizers with a deeper linguistic understanding typically incur high computational costs, which prevents real-time performance.
This paper examines the trade-off between phonemization quality and inference speed in G2P-aided TTS systems, introducing a practical framework to bridge this gap. We propose lightweight strategies for context-aware phonemization and a service-oriented TTS architecture that executes these modules as independent services. This design decouples heavy context-aware components from the core TTS engine, effectively breaking the latency barrier and enabling real-time use of high-quality phonemization models. Experimental results confirm that the proposed system improves pronunciation soundness and linguistic accuracy while maintaining real-time responsiveness, making it well-suited for offline and end-device TTS applications.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Raunak Jain, Mudita Khurana
ๆฅ็ๆ่ฆ
LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual. We argue this complementarity gap reflects a fundamental mismatch: current agents are trained as answer engines, not as partners in the collaborative sensemaking through which experts actually make decisions. Sensemaking (the ability to co-construct causal explanations, surface uncertainties, and adapt goals) is the key capability that current training pipelines do not explicitly develop or evaluate. We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models, and evaluation centred on trust and complementarity. These directions can advance MAS research toward agents that think with their human partners rather than for them.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stephane Collot, Colin Fraser, Justin Zhao, William F. Shen, Timon Willi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. Common metrics used for this choice, such as Accuracy, Precision, and F1, are sensitive to class imbalance and to arbitrary choices of positive class, and can favor judges that distort prevalence estimates. We show that Youden's $J$ statistic is theoretically aligned with choosing the best judge to compare models, and that Balanced Accuracy is an equivalent linear transformation of $J$. Through both analytical arguments and empirical examples and simulations, we demonstrate how selecting judges using Balanced Accuracy leads to better, more robust classifier selection.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kwanyoung Park, Seohong Park, Youngwoon Lee, Sergey Levine
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we study whether model-based reinforcement learning (RL), in particular model-based value expansion, can provide a scalable recipe for tackling complex, long-horizon tasks in offline RL. Model-based value expansion fits an on-policy value function using length-n imaginary rollouts generated by the current policy and a learned dynamics model. While larger n reduces bias in value bootstrapping, it amplifies accumulated model errors over long horizons, degrading future predictions. We address this trade-off with an \emph{action-chunk} model that predicts a future state from a sequence of actions (an "action chunk") instead of a single action, which reduces compounding errors. In addition, instead of directly training a policy to maximize rewards, we employ rejection sampling from an expressive behavioral action-chunk policy, which prevents model exploitation from out-of-distribution actions. We call this recipe \textbf{Model-Based RL with Action Chunks (MAC)}. Through experiments on highly challenging tasks with large-scale datasets of up to 100M transitions, we show that MAC achieves the best performance among offline model-based RL algorithms, especially on challenging long-horizon tasks.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions.
In this work we propose a method for eliciting an honest expression of an LLM's shortcomings via a self-reported *confession*. A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward. As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior.
To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user.
๐
2025-12-08
๐ Presented at IEEE MIT URTC 2025
๐ฅ Jonathan Kogan, Hayden Jananthan, Jeremy Kepner
Machine Learning
ๆฅ็ๆ่ฆ
We study the expressivity of one-dimensional (1D) ReLU deep neural networks through the lens of their linear regions. For randomly initialized, fully connected 1D ReLU networks (He scaling with nonzero bias) in the infinite-width limit, we prove that the expected number of linear regions grows as $\sum_{i = 1}^L n_i + \mathop{o}\left(\sum_{i = 1}^L{n_i}\right) + 1$, where $n_\ell$ denotes the number of neurons in the $\ell$-th hidden layer. We also propose a function-adaptive notion of sparsity that compares the expected regions used by the network to the minimal number needed to approximate a target within a fixed tolerance.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jaron Cohen, Alexander G. Hasson, Sara Tanovic
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Since the advent of machine learning, interpretability has remained a persistent challenge, becoming increasingly urgent as generative models support high-stakes applications in drug and material discovery. Recent advances in large language model (LLM) architectures have yielded chemistry language models (CLMs) with impressive capabilities in molecular property prediction and molecular generation. However, how these models internally represent chemical knowledge remains poorly understood. In this work, we extend sparse autoencoder techniques to uncover and examine interpretable features within CLMs. Applying our methodology to the Foundation Models for Materials (FM4M) SMI-TED chemistry foundation model, we extract semantically meaningful latent features and analyse their activation patterns across diverse molecular datasets. Our findings reveal that these models encode a rich landscape of chemical concepts. We identify correlations between specific latent features and distinct domains of chemical knowledge, including structural motifs, physicochemical properties, and pharmacological drug classes. Our approach provides a generalisable framework for uncovering latent knowledge in chemistry-focused AI systems. This work has implications for both foundational understanding and practical deployment; with the potential to accelerate computational chemistry research.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pingchuan Ma, Chengshuai Zhao, Bohan Jiang, Saketh Vishnubhatla, Ujun Jeong et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Crisis classification in social media aims to extract actionable disaster-related information from multimodal posts, which is a crucial task for enhancing situational awareness and facilitating timely emergency responses. However, the wide variation in crisis types makes achieving generalizable performance across unseen disasters a persistent challenge. Existing approaches primarily leverage deep learning to fuse textual and visual cues for crisis classification, achieving numerically plausible results under in-domain settings. However, they exhibit poor generalization across unseen crisis types because they 1. do not disentangle spurious and causal features, resulting in performance degradation under domain shift, and 2. fail to align heterogeneous modality representations within a shared space, which hinders the direct adaptation of established single-modality domain generalization (DG) techniques to the multimodal setting. To address these issues, we introduce a causality-guided multimodal domain generalization (MMDG) framework that combines adversarial disentanglement with unified representation learning for crisis classification. The adversarial objective encourages the model to disentangle and focus on domain-invariant causal features, leading to more generalizable classifications grounded in stable causal mechanisms. The unified representation aligns features from different modalities within a shared latent space, enabling single-modality DG strategies to be seamlessly extended to multimodal learning. Experiments on the different datasets demonstrate that our approach achieves the best performance in unseen disaster scenarios.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaobin Shen, George H. Chen
ๆฅ็ๆ่ฆ
We propose an interpretable deep competing risks model called the Deep Kernel Aalen-Johansen (DKAJ) estimator, which generalizes the classical Aalen-Johansen nonparametric estimate of cumulative incidence functions (CIFs). Each data point (e.g., patient) is represented as a weighted combination of clusters. If a data point has nonzero weight only for one cluster, then its predicted CIFs correspond to those of the classical Aalen-Johansen estimator restricted to data points from that cluster. These weights come from an automatically learned kernel function that measures how similar any two data points are. On four standard competing risks datasets, we show that DKAJ is competitive with state-of-the-art baselines while being able to provide visualizations to assist model interpretation.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ashkan Shahbazi, Ping He, Ali Abbasi, Yikun Bai, Xinran Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Scaling attention faces a critical bottleneck: the $\mathcal{O}(n^2)$ quadratic computational cost of softmax attention, which limits its application in long-sequence domains. While linear attention mechanisms reduce this cost to $\mathcal{O}(n)$, they typically rely on fixed random feature maps, such as random Fourier features or hand-crafted functions. This reliance on static, data-agnostic kernels creates a fundamental trade-off, forcing practitioners to sacrifice significant model accuracy for computational efficiency. We introduce \textsc{LUNA}, a kernelized linear attention mechanism that eliminates this trade-off, retaining linear cost while matching and surpassing the accuracy of quadratic attention. \textsc{LUNA} is built on the key insight that the kernel feature map itself should be learned rather than fixed a priori. By parameterizing the kernel, \textsc{LUNA} learns a feature basis tailored to the specific data and task, overcoming the expressive limitations of fixed-feature methods. \textsc{Luna} implements this with a learnable feature map that induces a positive-definite kernel and admits a streaming form, yielding linear time and memory scaling in the sequence length. Empirical evaluations validate our approach across diverse settings. On the Long Range Arena (LRA), \textsc{Luna} achieves state-of-the-art average accuracy among efficient Transformers under compute parity, using the same parameter count, training steps, and approximate FLOPs. \textsc{Luna} also excels at post-hoc conversion: replacing softmax in fine-tuned BERT and ViT-B/16 checkpoints and briefly fine-tuning recovers most of the original performance, substantially outperforming fixed linearizations.
๐
2025-12-08
๐ ArXiv Preprint (cs.SI)
๐ฅ Honglian Wang, Haoyun Chen, Aristides Gionis
ๆฅ็ๆ่ฆ
Link-analysis algorithms, such as PageRank, are instrumental in understanding the structural dynamics of networks by evaluating the importance of individual vertices based on their connectivity. Recently, with the rising importance of responsible AI, the question of fairness in link-analysis algorithms has gained traction. In this paper, we present a new approach for incorporating group fairness into the PageRank algorithm by reweighting the transition probabilities in the underlying transition matrix. We formulate the problem of achieving fair PageRank by seeking to minimize the fairness loss, which is the difference between the original group-wise PageRank distribution and a target PageRank distribution. We further define a group-adapted fairness notion, which accounts for group homophily by considering random walks with group-biased restart for each group. Since the fairness loss is non-convex, we propose an efficient projected gradient-descent method for computing locally-optimal edge weights. Unlike earlier approaches, we do not recommend adding new edges to the network, nor do we adjust the restart vector. Instead, we keep the topology of the underlying network unchanged and only modify the relative importance of existing edges. We empirically compare our approach with state-of-the-art baselines and demonstrate the efficacy of our method, where very small changes in the transition matrix lead to significant improvement in the fairness of the PageRank algorithm.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Pedro Santana
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Embodied agents, such as robots and virtual characters, must continuously select actions to execute tasks effectively, solving complex sequential decision-making problems. Given the difficulty of designing such controllers manually, learning-based approaches have emerged as promising alternatives, most notably Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL). DRL leverages reward signals to optimize behavior, while DIL uses expert demonstrations to guide learning. This document introduces DRL and DIL in the context of embodied agents, adopting a concise, depth-first approach to the literature. It is self-contained, presenting all necessary mathematical and machine learning concepts as they are needed. It is not intended as a survey of the field; rather, it focuses on a small set of foundational algorithms and techniques, prioritizing in-depth understanding over broad coverage. The material ranges from Markov Decision Processes to REINFORCE and Proximal Policy Optimization (PPO) for DRL, and from Behavioral Cloning to Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL) for DIL.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tianxingjian Ding, Yuanhao Zou, Chen Chen, Mubarak Shah, Yu Tian
Natural Language Processing
ๆฅ็ๆ่ฆ
Clinical decision-making in oncology requires predicting dynamic disease evolution, a task current static AI predictors cannot perform. While world models (WMs) offer a paradigm for generative prediction, existing medical applications remain limited. Existing methods often rely on stochastic diffusion models, focusing on visual reconstruction rather than causal, physiological transitions. Furthermore, in medical domain, models like MeWM typically ignore patient-specific temporal and clinical contexts and lack a feedback mechanism to link predictions to treatment decisions. To address these gaps, we introduce CLARITY, a medical world model that forecasts disease evolution directly within a structured latent space. It explicitly integrates time intervals (temporal context) and patient-specific data (clinical context) to model treatment-conditioned progression as a smooth, interpretable trajectory, and thus generate physiologically faithful, individualized treatment plans. Finally, CLARITY introduces a novel prediction-to-decision framework, translating latent rollouts into transparent, actionable recommendations. CLARITY demonstrates state-of-the-art performance in treatment planning. On the MU-Glioma-Post dataset, our approach outperforms recent MeWM by 12\%, and significantly surpasses all other medical-specific large language models.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Keith Huffman, Jianping Zhang, Nathaniel Huber-Fliflet, Fusheng Wei, Peter Gronvall
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
In legal matters, text classification models are most often used to filter through large datasets in search of documents that meet certain pre-selected criteria like relevance to a certain subject matter, such as legally privileged communications and attorney-directed documents. In this context, large language models have demonstrated strong performance. This paper presents an empirical study investigating the role of randomness in LLM-based classification for attorney-client privileged document detection, focusing on four key dimensions: (1) the effectiveness of LLMs in identifying legally privileged documents, (2) the influence of randomness control parameters on classification outputs, (3) their impact on overall classification performance, and (4) a methodology for leveraging randomness to enhance accuracy. Experimental results showed that LLMs can identify privileged documents effectively, randomness control parameters have minimal impact on classification performance, and importantly, our developed methodology for leveraging randomness can have a significant impact on improving accuracy. Notably, this methodology that leverages randomness could also enhance a corporation's confidence in an LLM's output when incorporated into its sanctions-compliance processes. As organizations increasingly rely on LLMs to augment compliance workflows, reducing output variability helps build internal and regulatory confidence in LLM-derived sanctions-screening decisions.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Qiang Mao, Fusheng Wei, Robert Neary, Charles Wang, Han Qin et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The rapid increase in digital image creation and retention presents substantial challenges during legal discovery, digital archive, and content management. Corporations and legal teams must organize, analyze, and extract meaningful insights from large image collections under strict time pressures, making manual review impractical and costly. These demands have intensified interest in automated methods that can efficiently organize and describe large-scale image datasets. This paper presents a systematic investigation of automated cluster description generation through the integration of image clustering, image captioning, and large language models (LLMs). We apply K-means clustering to group images into 20 visually coherent clusters and generate base captions using the Azure AI Vision API. We then evaluate three critical dimensions of the cluster description process: (1) image sampling strategies, comparing random, centroid-based, stratified, hybrid, and density-based sampling against using all cluster images; (2) prompting techniques, contrasting standard prompting with chain-of-thought prompting; and (3) description generation methods, comparing LLM-based generation with traditional TF-IDF and template-based approaches. We assess description quality using semantic similarity and coverage metrics. Results show that strategic sampling with 20 images per cluster performs comparably to exhaustive inclusion while significantly reducing computational cost, with only stratified sampling showing modest degradation. LLM-based methods consistently outperform TF-IDF baselines, and standard prompts outperform chain-of-thought prompts for this task. These findings provide practical guidance for deploying scalable, accurate cluster description systems that support high-volume workflows in legal discovery and other domains requiring automated organization of large image collections.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Qiang Mao, Han Qin, Robert Neary, Charles Wang, Fusheng Wei et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Increasingly, attorneys are interested in moving beyond keyword and semantic search to improve the efficiency of how they find key information during a document review task. Large language models (LLMs) are now seen as tools that attorneys can use to ask natural language questions of their data during document review to receive accurate and concise answers. This study evaluates retrieval strategies within Microsoft Azure's Retrieval-Augmented Generation (RAG) framework to identify effective approaches for Early Case Assessment (ECA) in eDiscovery. During ECA, legal teams analyze data at the outset of a matter to gain a general understanding of the data and attempt to determine key facts and risks before beginning full-scale review. In this paper, we compare the performance of Azure AI Search's keyword, semantic, vector, hybrid, and hybrid-semantic retrieval methods. We then present the accuracy, relevance, and consistency of each method's AI-generated responses. Legal practitioners can use the results of this study to enhance how they select RAG configurations in the future.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jianping Zhang, Han Qin, Nathaniel Huber-Fliflet
ๆฅ็ๆ่ฆ
This paper presents a link analysis approach for identifying privileged documents by constructing a network of human entities derived from email header metadata. Entities are classified as either counsel or non-counsel based on a predefined list of known legal professionals. The core assumption is that individuals with frequent interactions with lawyers are more likely to participate in privileged communications. To quantify this likelihood, an algorithm assigns a score to each entity within the network. By utilizing both entity scores and the strength of their connections, the method enhances the identification of privileged documents. Experimental results demonstrate the algorithm's effectiveness in ranking legal entities for privileged document detection.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Pedro Santana
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Embodied agents, such as robots and virtual characters, must continuously select actions to execute tasks effectively, solving complex sequential decision-making problems. Given the difficulty of designing such controllers manually, learning-based approaches have emerged as promising alternatives, most notably Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL). DRL leverages reward signals to optimize behavior, while DIL uses expert demonstrations to guide learning. This document introduces DRL and DIL in the context of embodied agents, adopting a concise, depth-first approach to the literature. It is self-contained, presenting all necessary mathematical and machine learning concepts as they are needed. It is not intended as a survey of the field; rather, it focuses on a small set of foundational algorithms and techniques, prioritizing in-depth understanding over broad coverage. The material ranges from Markov Decision Processes to REINFORCE and Proximal Policy Optimization (PPO) for DRL, and from Behavioral Cloning to Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL) for DIL.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Lampis Papakostas, Aristeidis Geladaris, Athanasios Mastrogeorgiou, Jim Sharples, Gautier Hattenberger et al.
Robotics
ๆฅ็ๆ่ฆ
This paper presents a UAV swarm system designed to assist first responders in disaster scenarios like wildfires. By distributing sensors across multiple agents, the system extends flight duration and enhances data availability, reducing the risk of mission failure due to collisions. To mitigate this risk further, we introduce an autonomous navigation framework that utilizes a local Euclidean Signed Distance Field (ESDF) map for obstacle avoidance while maintaining swarm formation with minimal path deviation. Additionally, we incorporate a Traveling Salesman Problem (TSP) variant to optimize area coverage, prioritizing Points of Interest (POIs) based on preassigned values derived from environmental behavior and critical infrastructure. The proposed system is validated through simulations with varying swarm sizes, demonstrating its ability to maximize coverage while ensuring collision avoidance between UAVs and obstacles.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Mostafa Kamali Tabrizi, Mingshi Chi, Bir Bikram Dey, Yu Qing Yuan, Markus D. Solbach et al.
Computer Vision
ๆฅ็ๆ่ฆ
We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT's unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT's design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. The exploration of the utility of these to both human and machine vision is ongoing. Here, we present the design of DIJIT and evaluate aspects of its performance. We present a new method for saccadic camera movements. In this method, a direct relationship between camera orientation and motor values is developed. The resulting saccadic camera movements are close to human movements in terms of their accuracy.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Lazar Milikic, Manthan Patel, Jonas Frey
Machine LearningRoboticsMultimodal
ๆฅ็ๆ่ฆ
Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor. At deployment, the policy consumes VLD predictions, inheriting semantic goal information-"where to go"-from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches, such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing an alternative and, most importantly, scalable path toward reliable, multimodal navigation policies.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Alan Papalia, Nikolas Sanderson, Haoyu Han, Heng Yang, Hanumant Singh et al.
Machine Learning
ๆฅ็ๆ่ฆ
Robotic perception often requires solving large nonlinear least-squares (NLS) problems. While sparsity has been well-exploited to scale solvers, a complementary and underexploited structure is \emph{separability} -- where some variables (e.g., visual landmarks) appear linearly in the residuals and, for any estimate of the remaining variables (e.g., poses), have a closed-form solution. Variable projection (VarPro) methods are a family of techniques that exploit this structure by analytically eliminating the linear variables and presenting a reduced problem in the remaining variables that has favorable properties. However, VarPro has seen limited use in robotic perception; a major challenge arises from gauge symmetries (e.g., cost invariance to global shifts and rotations), which are common in perception and induce specific computational challenges in standard VarPro approaches. We present a VarPro scheme designed for problems with gauge symmetries that jointly exploits separability and sparsity. Our method can be applied as a one-time preprocessing step to construct a \emph{matrix-free Schur complement operator}. This operator allows efficient evaluation of costs, gradients, and Hessian-vector products of the reduced problem and readily integrates with standard iterative NLS solvers. We provide precise conditions under which our method applies, and describe extensions when these conditions are only partially met. Across synthetic and real benchmarks in SLAM, SNL, and SfM, our approach achieves up to \textbf{2$\times$--35$\times$ faster runtimes} than state-of-the-art methods while maintaining accuracy. We release an open-source C++ implementation and all datasets from our experiments.
๐
2025-12-08
๐ ArXiv Preprint (eess.SY)
๐ฅ Nikita Vaibhav Pavle, Shrreya Rajneesh, Rakesh Kumar Sahoo, Manoranjan Sinha
Robotics
ๆฅ็ๆ่ฆ
The conventional Artificial Potential Field (APF) is fundamentally limited by the local minima issue and its inability to account for the kinematics of moving obstacles. This paper addresses the critical challenge of autonomous collision avoidance for Unmanned Aerial Vehicles (UAVs) operating in dynamic and cluttered airspace by proposing a novel Direction and Relative Velocity Weighted Artificial Potential Field (APF). In this approach, a bounded weighting function, $ฯ(ฮธ,v_{e})$, is introduced to dynamically scale the repulsive potential based on the direction and velocity of the obstacle relative to the UAV. This robust APF formulation is integrated within a Model Predictive Control (MPC) framework to generate collision-free trajectories while adhering to kinematic constraints. Simulation results demonstrate that the proposed method effectively resolves local minima and significantly enhances safety by enabling smooth, predictive avoidance maneuvers. The system ensures superior path integrity and reliable performance, confirming its viability for autonomous navigation in complex environments.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Gao, Chen Chen, Tianrong Chen, Jiatao Gu
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, Qixing Huang
ๆฅ็ๆ่ฆ
Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinyuan Cao, Santosh S. Vempala
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$, can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$, independent of the document length) on the model size needed to achieve such $k$-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.
๐
2025-12-08
๐ ArXiv Preprint (cs.SE)
๐ฅ Hua Yang, Alejandro Velasco, Sen Fang, Bowen Xu, Denys Poshyvanyk
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models for code (LLM4Code) have greatly improved developer productivity but also raise privacy concerns due to their reliance on open-source repositories containing abundant personally identifiable information (PII). Prior work shows that commercial models can reproduce sensitive PII, yet existing studies largely treat PII as a single category and overlook the heterogeneous risks among different types. We investigate whether distinct PII types vary in their likelihood of being learned and leaked by LLM4Code, and whether this relationship is causal. Our methodology includes building a dataset with diverse PII types, fine-tuning representative models of different scales, computing training dynamics on real PII data, and formulating a structural causal model to estimate the causal effect of learnability on leakage. Results show that leakage risks differ substantially across PII types and correlate with their training dynamics: easy-to-learn instances such as IP addresses exhibit higher leakage, while harder types such as keys and passwords leak less frequently. Ambiguous types show mixed behaviors. This work provides the first causal evidence that leakage risks are type-dependent and offers guidance for developing type-aware and learnability-aware defenses for LLM4Code.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha et al.
ๆฅ็ๆ่ฆ
Future AI systems could conceal their capabilities ('sandbagging') during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, or training-based approaches to identify sandbagging models. We found that the blue team could not reliably discriminate sandbaggers from benign models. Black-box approaches were defeated by effective imitation of a weaker model. Linear probes, a model-internals approach, showed more promise but their naive application was vulnerable to behaviours instilled by the red team. We also explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-based elicitation was not reliable, training-based elicitation consistently elicited full performance from the sandbagging models, using only a single correct demonstration of the evaluation task. However the performance of benign models was sometimes also raised, so relying on elicitation as a detection strategy was prone to false-positives. In the short-term, we recommend developers remove potential sandbagging using on-distribution training for elicitation. In the longer-term, further research is needed to ensure the efficacy of training-based elicitation, and develop robust methods for sandbagging detection. We open source our model organisms at https://github.com/AI-Safety-Institute/sandbagging_auditing_games and select transcripts and results at https://huggingface.co/datasets/sandbagging-games/evaluation_logs . A demo illustrating the game can be played at https://sandbagging-demo.far.ai/ .
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\mathrm{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n)=\exp(n\,ฯ\,\mathbf{L})$ with a rank-2 skew generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Raunak Jain, Mudita Khurana
ๆฅ็ๆ่ฆ
LLM-based agents are rapidly being plugged into expert decision-support, yet in messy, high-stakes settings they rarely make the team smarter: human-AI teams often underperform the best individual, experts oscillate between verification loops and over-reliance, and the promised complementarity does not materialise. We argue this is not just a matter of accuracy, but a fundamental gap in how we conceive AI assistance: expert decisions are made through collaborative cognitive processes where mental models, goals, and constraints are continually co-constructed, tested, and revised between human and AI. We propose Collaborative Causal Sensemaking (CCS) as a research agenda and organizing framework for decision-support agents: systems designed as partners in cognitive work, maintaining evolving models of how particular experts reason, helping articulate and revise goals, co-constructing and stress-testing causal hypotheses, and learning from the outcomes of joint decisions so that both human and agent improve over time. We sketch challenges around training ecologies that make collaborative thinking instrumentally valuable, representations and interaction protocols for co-authored models, and evaluation centred on trust and complementarity. These directions can reframe MAS research around agents that participate in collaborative sensemaking and act as AI teammates that think with their human partners.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sridhar Mahadevan
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce a new paradigm for building large causal models (LCMs) that exploits the enormous potential latent in today's large language models (LLMs). We describe our ongoing experiments with an implemented system called DEMOCRITUS (Decentralized Extraction of Manifold Ontologies of Causal Relations Integrating Topos Universal Slices) aimed at building, organizing, and visualizing LCMs that span disparate domains extracted from carefully targeted textual queries to LLMs. DEMOCRITUS is methodologically distinct from traditional narrow domain and hypothesis centered causal inference that builds causal models from experiments that produce numerical data. A high-quality LLM is used to propose topics, generate causal questions, and extract plausible causal statements from a diverse range of domains. The technical challenge is then to take these isolated, fragmented, potentially ambiguous and possibly conflicting causal claims, and weave them into a coherent whole, converting them into relational causal triples and embedding them into a LCM. Addressing this technical challenge required inventing new categorical machine learning methods, which we can only briefly summarize in this paper, as it is focused more on the systems side of building DEMOCRITUS. We describe the implementation pipeline for DEMOCRITUS comprising of six modules, examine its computational cost profile to determine where the current bottlenecks in scaling the system to larger models. We describe the results of using DEMOCRITUS over a wide range of domains, spanning archaeology, biology, climate change, economics, medicine and technology. We discuss the limitations of the current DEMOCRITUS system, and outline directions for extending its capabilities.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Nearchos Potamitis, Lars Klein, Akhil Arora
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ignoring the intrinsic uncertainty that naturally arises from stochastic decoding. This omission creates a blind spot because practitioners cannot reliably assess whether a method's reported performance is stable, reproducible, or cost-consistent. We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in LLM reasoning. ReasonBENCH provides (i) a modular evaluation library that standardizes reasoning frameworks, models, and tasks, (ii) a multi-run protocol that reports statistically reliable metrics for both quality and cost, and (iii) a public leaderboard to encourage variance-aware reporting. Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability. Notably, even strategies with similar average performance can display confidence intervals up to four times wider, and the top-performing methods often incur higher and less stable costs. Such instability compromises reproducibility across runs and, consequently, the reliability of reported performance. To better understand these dynamics, we further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability. Our results highlight reproducibility as a critical dimension for reliable LLM reasoning and provide a foundation for future reasoning methods and uncertainty quantification techniques. ReasonBENCH is publicly available at https://github.com/au-clan/ReasonBench .
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi et al.
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models are vulnerable to jailbreak attacks, threatening their safe deployment in real-world applications. This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models through a sequence of prompt-output interactions. Existing approaches typically rely on single turn optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate the problem as a multi-turn reinforcement learning task, directly optimizing the harmfulness of the final-turn output as the outcome reward. To mitigate sparse supervision and promote long-term attack strategies, we propose two heuristic process rewards: (1) controlling the harmfulness of intermediate outputs to prevent triggering the black-box model's rejection mechanisms, and (2) maintaining the semantic relevance of intermediate outputs to avoid drifting into irrelevant content. Experimental results on multiple benchmarks show consistently improved attack success rates across multiple models, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/RL-MTJail. Warning: This paper contains examples of harmful content.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sangha Park, Seungryong Yoo, Jisoo Mok, Sungroh Yoon
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model's visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10\%p improvement in CHAIR\_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at https://github.com/wiarae/SAVE.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aidas Aglinskas, Stefano Anzellotti
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.
๐
2025-12-08
๐ ArXiv Preprint (cs.ET)
๐ฅ Zhengzheng Tang
Natural Language Processing
ๆฅ็ๆ่ฆ
The 2025 Nobel Prize in Chemistry for Metal-Organic Frameworks (MOFs) and recent breakthroughs by Huanting Wang's team at Monash University establish angstrom-scale channels as promising post-silicon substrates with native integrate-and-fire (IF) dynamics. However, utilizing these stochastic, analog materials for deterministic, bit-exact AI workloads (e.g., FP8) remains a paradox. Existing neuromorphic methods often settle for approximation, failing Transformer precision standards. To traverse the gap "from stochastic ions to deterministic floats," we propose a Native Spiking Microarchitecture. Treating noisy neurons as logic primitives, we introduce a Spatial Combinational Pipeline and a Sticky-Extra Correction mechanism. Validation across all 16,129 FP8 pairs confirms 100% bit-exact alignment with PyTorch. Crucially, our architecture reduces Linear layer latency to O(log N), yielding a 17x speedup. Physical simulations further demonstrate robustness against extreme membrane leakage (beta approx 0.01), effectively immunizing the system against the stochastic nature of the hardware.
๐
2025-12-08
๐ AAAI
๐ฅ Yonggeon Lee, Jibin Hwang, Alfred Malengo Kondoro, Juhyun Song, Youngtae Noh
Natural Language Processing
ๆฅ็ๆ่ฆ
Electric vehicles (EVs) are key to sustainable mobility, yet their lithium-ion batteries (LIBs) degrade more rapidly under prolonged high states of charge (SOC). This can be mitigated by delaying full charging \ours until just before departure, which requires accurate prediction of user departure times. In this work, we propose Transformer-based real-time-to-event (TTE) model for accurate EV departure prediction. Our approach represents each day as a TTE sequence by discretizing time into grid-based tokens. Unlike previous methods primarily dependent on temporal dependency from historical patterns, our method leverages streaming contextual information to predict departures. Evaluation on a real-world study involving 93 users and passive smartphone data demonstrates that our method effectively captures irregular departure patterns within individual routines, outperforming baseline models. These results highlight the potential for practical deployment of the \ours algorithm and its contribution to sustainable transportation systems.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Anxiang Zeng, Haibo Zhang, Hailing Zhang, Kaixiang Mo, Liang Yao et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present CompassMax-V3-Thinking, a hundred-billion-scale MoE reasoning model trained with a new RL framework built on one principle: each prompt must matter. Scaling RL to this size exposes critical inefficiencies-zero-variance prompts that waste rollouts, unstable importance sampling over long horizons, advantage inversion from standard reward models, and systemic bottlenecks in rollout processing. To overcome these challenges, we introduce several unified innovations: (1) Multi-Stage Zero-Variance Elimination, which filters out non-informative prompts and stabilizes group-based policy optimization (e.g. GRPO) by removing wasted rollouts; (2) ESPO, an entropy-adaptive optimization method that balances token-level and sequence-level importance sampling to maintain stable learning dynamics; (3) a Router Replay strategy that aligns training-time MoE router decisions with inference-time behavior to mitigate train-infer discrepancies, coupled with a reward model adjustment to prevent advantage inversion; (4) a high-throughput RL system with FP8-precision rollouts, overlapped reward computation, and length-aware scheduling to eliminate performance bottlenecks. Together, these contributions form a cohesive pipeline that makes RL on hundred-billion-scale MoE models stable and efficient. The resulting model delivers strong performance across both internal and public evaluations.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Saroj Gopali, Bipin Chhetri, Deepika Giri, Sima Siami-Namini, Akbar Siami Namin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Existing data-driven approaches in modeling and predicting time series data include ARIMA (Autoregressive Integrated Moving Average), Transformer-based models, LSTM (Long Short-Term Memory) and TCN (Temporal Convolutional Network). These approaches, and in particular deep learning-based models such as LSTM and TCN, have shown great results in predicting time series data. With the advancement of leveraging pre-trained foundation models such as Large Language Models (LLMs) and more notably Google's recent foundation model for time series data, {\it TimesFM} (Time Series Foundation Model), it is of interest to investigate whether these foundation models have the capability of outperforming existing modeling approaches in analyzing and predicting time series data.
This paper investigates the performance of using LLM models for time series data prediction. We investigate the in-context learning methodology in the training of LLM models that are specific to the underlying application domain. More specifically, the paper explores training LLMs through in-context, zero-shot and few-shot learning and forecasting time series data with OpenAI {\tt o4-mini} and Gemini 2.5 Flash Lite, as well as the recent Google's Transformer-based TimesFM, a time series-specific foundation model, along with two deep learning models, namely TCN and LSTM networks. The findings indicate that TimesFM has the best overall performance with the lowest RMSE value (0.3023) and the competitive inference time (266 seconds). Furthermore, OpenAI's o4-mini also exhibits a good performance based on Zero Shot learning.
These findings highlight pre-trained time series foundation models as a promising direction for real-time forecasting, enabling accurate and scalable deployment with minimal model adaptation.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sangha Park, Eunji Kim, Yeongtak Oh, Jooyoung Choi, Sungroh Yoon
Computer Vision
ๆฅ็ๆ่ฆ
Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt's alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at https://github.com/wiarae/NPC.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zihan Chen, Lanyu Yu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Online incivility has emerged as a widespread and persistent problem in digital communities, imposing substantial social and psychological burdens on users. Although many platforms attempt to curb incivility through moderation and automated detection, the performance of existing approaches often remains limited in both accuracy and efficiency. To address this challenge, we propose a Graph Neural Network (GNN) framework for detecting three types of uncivil behavior (i.e., toxicity, aggression, and personal attacks) within the English Wikipedia community. Our model represents each user comment as a node, with textual similarity between comments defining the edges, allowing the network to jointly learn from both linguistic content and relational structures among comments. We also introduce a dynamically adjusted attention mechanism that adaptively balances nodal and topological features during information aggregation. Empirical evaluations demonstrate that our proposed architecture outperforms 12 state-of-the-art Large Language Models (LLMs) across multiple metrics while requiring significantly lower inference cost. These findings highlight the crucial role of structural context in detecting online incivility and address the limitations of text-only LLM paradigms in behavioral prediction. All datasets and comparative outputs will be publicly available in our repository to support further research and reproducibility.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mehmet Yigit Avci, Pedro Borges, Virginia Fernandez, Paul Wright, Mehmet Yigitsoy et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Deep learning holds immense promise for transforming medical image analysis, yet its clinical generalization remains profoundly limited. A major barrier is data heterogeneity. This is particularly true in Magnetic Resonance Imaging, where scanner hardware differences, diverse acquisition protocols, and varying sequence parameters introduce substantial domain shifts that obscure underlying biological signals. Data harmonization methods aim to reduce these instrumental and acquisition variability, but existing approaches remain insufficient. When applied to imaging data, image-based harmonization approaches are often restricted by the need for target images, while existing text-guided methods rely on simplistic labels that fail to capture complex acquisition details or are typically restricted to datasets with limited variability, failing to capture the heterogeneity of real-world clinical environments. To address these limitations, we propose DIST-CLIP (Disentangled Style Transfer with CLIP Guidance), a unified framework for MRI harmonization that flexibly uses either target images or DICOM metadata for guidance. Our framework explicitly disentangles anatomical content from image contrast, with the contrast representations being extracted using pre-trained CLIP encoders. These contrast embeddings are then integrated into the anatomical content via a novel Adaptive Style Transfer module. We trained and evaluated DIST-CLIP on diverse real-world clinical datasets, and showed significant improvements in performance when compared against state-of-the-art methods in both style translation fidelity and anatomical preservation, offering a flexible solution for style transfer and standardizing MRI data. Our code and weights will be made publicly available upon publication.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hamad Almazrouei, Mariam Al Nasseri, Maha Alzaabi
RoboticsMachine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Traditional sea exploration faces significant challenges due to extreme conditions, limited visibility, and high costs, resulting in vast unexplored ocean regions. This paper presents an innovative AI-powered Autonomous Underwater Vehicle (AUV) system designed to overcome these limitations by automating underwater object detection, analysis, and reporting. The system integrates YOLOv12 Nano for real-time object detection, a Convolutional Neural Network (CNN) (ResNet50) for feature extraction, Principal Component Analysis (PCA) for dimensionality reduction, and K-Means++ clustering for grouping marine objects based on visual characteristics. Furthermore, a Large Language Model (LLM) (GPT-4o Mini) is employed to generate structured reports and summaries of underwater findings, enhancing data interpretation. The system was trained and evaluated on a combined dataset of over 55,000 images from the DeepFish and OzFish datasets, capturing diverse Australian marine environments. Experimental results demonstrate the system's capability to detect marine objects with a mAP@0.5 of 0.512, a precision of 0.535, and a recall of 0.438. The integration of PCA effectively reduced feature dimensionality while preserving 98% variance, facilitating K-Means clustering which successfully grouped detected objects based on visual similarities. The LLM integration proved effective in generating insightful summaries of detections and clusters, supported by location data. This integrated approach significantly reduces the risks associated with human diving, increases mission efficiency, and enhances the speed and depth of underwater data analysis, paving the way for more effective scientific research and discovery in challenging marine environments.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Georgios Tzachristas, Lei Deng, Ioannis Tzachristas, Gong Zhang, Renhai Chen
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We develop a unified mathematical framework for certified Top-$k$ attention truncation that quantifies approximation error at both the distribution and output levels. For a single attention distribution $P$ and its Top-$k$ truncation $\hat P$, we show that the total-variation distance coincides with the discarded softmax tail mass and satisfies $\mathrm{TV}(P,\hat P)=1-e^{-\mathrm{KL}(\hat P\Vert P)}$, yielding sharp Top-$k$-specific bounds in place of generic inequalities. From this we derive non-asymptotic deterministic bounds -- from a single boundary gap through multi-gap and blockwise variants -- that control $\mathrm{TV}(P,\hat P)$ using only the ordered logits. Using an exact head-tail decomposition, we prove that the output error factorizes as $\|\mathrm{Attn}(q,K,V)-\mathrm{Attn}_k(q,K,V)\|_2=ฯ\|ฮผ_{\mathrm{tail}}-ฮผ_{\mathrm{head}}\|_2$ with $ฯ=\mathrm{TV}(P,\hat P)$, yielding a new head-tail diameter bound $\|\mathrm{Attn}(q,K,V)-\mathrm{Attn}_k(q,K,V)\|_2\leฯ\,\mathrm{diam}_{H,T}$ and refinements linking the error to $\mathrm{Var}_P(V)$. Under an i.i.d. Gaussian score model $s_i\sim\mathcal N(ฮผ,ฯ^2)$ we derive closed-form tail masses and an asymptotic rule for the minimal $k_\varepsilon$ ensuring $\mathrm{TV}(P,\hat P)\le\varepsilon$, namely $k_\varepsilon/n\approxฮฆ_c(ฯ+ฮฆ^{-1}(\varepsilon))$. Experiments on bert-base-uncased and synthetic logits confirm the predicted scaling of $k_\varepsilon/n$ and show that certified Top-$k$ can reduce scored keys by 2-4$\times$ on average while meeting the prescribed total-variation budget.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shahar Lutati
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
When should an autonomous agent commit resources to a task? We introduce the Agent Capability Problem (ACP), a framework for predicting whether an agent can solve a problem under resource constraints. Rather than relying on empirical heuristics, ACP frames problem-solving as information acquisition: an agent requires $\Itotal$ bits to identify a solution and gains $\Istep$ bits per action at cost $\Cstep$, yielding an effective cost $\Ceff = (\Itotal/\Istep), \Cstep$ that predicts resource requirements before search. We prove that $\Ceff$ lower-bounds expected cost and provide tight probabilistic upper bounds. Experimental validation shows that ACP predictions closely track actual agent performance, consistently bounding search effort while improving efficiency over greedy and random strategies. The framework generalizes across LLM-based and agentic workflows, linking principles from active learning, Bayesian optimization, and reinforcement learning through a unified information-theoretic lens. \
๐
2025-12-08
๐ ArXiv Preprint (cs.SD)
๐ฅ Maximos Kaliakatsos-Papakostas, Konstantinos Soiledis, Theodoros Tsamis, Dimos Makris, Vassilis Katsouros et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer architectures offer significant advantages regarding the generation of symbolic music; their capabilities for incorporating user preferences toward what they generate is being studied under many aspects. This paper studies the inclusion of predefined chord constraints in melodic harmonization, i.e., where a desired chord at a specific location is provided along with the melody as inputs and the autoregressive transformer model needs to incorporate the chord in the harmonization that it generates. The peculiarities of involving such constraints is discussed and an algorithm is proposed for tackling this task. This algorithm is called B* and it combines aspects of beam search and A* along with backtracking to force pretrained transformers to satisfy the chord constraints, at the correct onset position within the correct bar. The algorithm is brute-force and has exponential complexity in the worst case; however, this paper is a first attempt to highlight the difficulties of the problem and proposes an algorithm that offers many possibilities for improvements since it accommodates the involvement of heuristics.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yongbo Yu, Jari Peeperkorn, Johannes De Smedt, Jochen De Weerdt
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Process Model Forecasting (PMF) aims to predict how the control-flow structure of a process evolves over time by modeling the temporal dynamics of directly-follows (DF) relations, complementing predictive process monitoring that focuses on single-case prefixes. Prior benchmarks show that machine learning and deep learning models provide only modest gains over statistical baselines, mainly due to the sparsity and heterogeneity of the DF time series. We investigate Time Series Foundation Models (TSFMs), large pre-trained models for generic time series, as an alternative for PMF. Using DF time series derived from real-life event logs, we compare zero-shot use of TSFMs, without additional training, with fine-tuned variants adapted on PMF-specific data. TSFMs generally achieve lower forecasting errors (MAE and RMSE) than traditional and specialized models trained from scratch on the same logs, indicating effective transfer of temporal structure from non-process domains. While fine-tuning can further improve accuracy, the gains are often small and may disappear on smaller or more complex datasets, so zero-shot use remains a strong default. Our study highlights the generalization capability and data efficiency of TSFMs for process-related time series and, to the best of our knowledge, provides the first systematic evaluation of temporal foundation models for PMF.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kairong Luo, Zhenbo Sun, Xinyu Shi, Shengqi Chen, Bowen Yu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yongsheng Lian
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark.
Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results are achieved with DAPO when DS is disabled.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jing Wang, Jie Shen, Xing Niu, Tong Zhang, Jeremy Weiss
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each {(question, option)} pair is treated as a binary instance with label $+1$ (correct) or $-1$ (incorrect). To promote {individual fairness}~--~treating similar instances similarly~--~we compute question similarity using NLP embeddings and solve items in \emph{joint pairs of similar questions} rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each \((\text{question}, \text{option})\) to a score $f(x)$ that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the {MedQA (US)} benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Navid Asgari, Benjamin M. Cole
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In this study, we propose a structured methodology that utilizes large language models (LLMs) in a cost-efficient and parsimonious manner, integrating the strengths of scholars and machines while offsetting their respective weaknesses. Our methodology, facilitated through a chain of thought and few-shot learning prompting from computer science, extends best practices for co-author teams in qualitative research to human-machine teams in quantitative research. This allows humans to utilize abductive reasoning and natural language to interrogate not just what the machine has done but also what the human has done. Our method highlights how scholars can manage inherent weaknesses OF LLMs using careful, low-cost techniques. We demonstrate how to use the methodology to interrogate human-machine rating discrepancies for a sample of 1,934 press releases announcing pharmaceutical alliances (1990-2017).
๐
2025-12-08
๐ ArXiv Preprint (eess.IV)
๐ฅ Xuecheng Li, Weikuan Jia, Komildzhon Sharipov, Sharipov Hotam Beknazarovich, Farzona S. Ataeva et al.
ๆฅ็ๆ่ฆ
Accurate segmentation of spinal structures in X-ray images is a prerequisite for quantitative scoliosis assessment, including Cobb angle measurement, vertebral translation estimation and curvature classification. In routine practice, clinicians acquire coronal, left-bending and right-bending radiographs to jointly evaluate deformity severity and spinal flexibility. However, the segmentation step remains heavily manual, time-consuming and non-reproducible, particularly in low-contrast images and in the presence of rib shadows or overlapping tissues. To address these limitations, this paper proposes R2MF-Net, a recurrent residual multi-path encoder--decoder network tailored for automatic segmentation of multi-directional spine X-ray images. The overall design consists of a coarse segmentation network and a fine segmentation network connected in cascade. Both stages adopt an improved Inception-style multi-branch feature extractor, while a recurrent residual jump connection (R2-Jump) module is inserted into skip paths to gradually align encoder and decoder semantics. A multi-scale cross-stage skip (MC-Skip) mechanism allows the fine network to reuse hierarchical representations from multiple decoder levels of the coarse network, thereby strengthening the stability of segmentation across imaging directions and contrast conditions. Furthermore, a lightweight spatial-channel squeeze-and-excitation block (SCSE-Lite) is employed at the bottleneck to emphasize spine-related activations and suppress irrelevant structures and background noise. We evaluate R2MF-Net on a clinical multi-view radiograph dataset comprising 228 sets of coronal, left-bending and right-bending spine X-ray images with expert annotations.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joel Ekstrand, Tor Mattsson, Zahra Taghiyarrenani, Slawomir Nowaczyk, Jens Lundstrรถm et al.
Machine Learning
ๆฅ็ๆ่ฆ
Reliable forecasting of multivariate time series under anomalous conditions is crucial in applications such as ATM cash logistics, where sudden demand shifts can disrupt operations. Modern deep forecasters achieve high accuracy on normal data but often fail when distribution shifts occur. We propose Weighted Contrastive Adaptation (WECA), a Weighted contrastive objective that aligns normal and anomaly-augmented representations, preserving anomaly-relevant information while maintaining consistency under benign variations. Evaluations on a nationwide ATM transaction dataset with domain-informed anomaly injection show that WECA improves SMAPE on anomaly-affected data by 6.1 percentage points compared to a normally trained baseline, with negligible degradation on normal data. These results demonstrate that WECA enhances forecasting reliability under anomalies without sacrificing performance during regular operations.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuecheng Li, Weikuan Jia, Alisher Kurbonaliev, Qurbonaliev Alisher, Khudzhamkulov Rustam et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naรฏve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kassoum Sanogo, Renzo Ardiccioni
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kyungro Lee, Dongha Choi, Hyunju Lee
Natural Language Processing
ๆฅ็ๆ่ฆ
As dialogue systems become increasingly important across various domains, a key challenge in persona-based dialogue is generating engaging and context-specific interactions while ensuring the model acts with a coherent personality. However, existing persona-based dialogue datasets lack explicit relations between persona sentences and responses, which makes it difficult for models to effectively capture persona information. To address these issues, we propose MoCoRP (Modeling Consistent Relations between Persona and Response), a framework that incorporates explicit relations into language models. MoCoRP leverages an NLI expert to explicitly extract the NLI relations between persona sentences and responses, enabling the model to effectively incorporate appropriate persona information from the context into its responses. We applied this framework to pre-trained models like BART and further extended it to modern large language models (LLMs) through alignment tuning. Experimental results on the public datasets ConvAI2 and MPChat demonstrate that MoCoRP outperforms existing baselines, achieving superior persona consistency and engaging, context-aware dialogue generation. Furthermore, our model not only excels in quantitative metrics but also shows significant improvements in qualitative aspects. These results highlight the effectiveness of explicitly modeling persona-response relations in persona-based dialogue. The source codes of MoCoRP are available at https://github.com/DMCB-GIST/MoCoRP.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Boxuan Lyu, Haiyue Song, Hidetaka Kamigaito, Chenchen Ding, Hideki Tanaka et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Error Span Detection (ESD) is a subtask of automatic machine translation evaluation that localizes error spans in translations and labels their severity. State-of-the-art generative ESD methods typically decode using Maximum a Posteriori (MAP), assuming that model-estimated probabilities are perfectly correlated with similarity to human annotation. However, we observed that annotations dissimilar to the human annotation could achieve a higher model likelihood than the human annotation. We address this issue by applying Minimum Bayes Risk (MBR) decoding to generative ESD models. Specifically, we employ sentence- and span-level similarity metrics as utility functions to select candidate hypotheses based on their approximate similarity to the human annotation. Extensive experimental results show that our MBR decoding outperforms the MAP baseline at the system, sentence, and span-levels. Furthermore, to mitigate the computational cost of MBR decoding, we demonstrate that applying MBR distillation enables a standard greedy model to match MBR decoding performance, effectively eliminating the inference-time latency bottleneck.
๐
2025-12-08
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yuzhou Nie, Hongwei Li, Chengquan Guo, Ruizhe Jiang, Zhun Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
We propose VulnLLM-R, the~\emph{first specialized reasoning LLM} for vulnerability detection. Our key insight is that LLMs can reason about program states and analyze the potential vulnerabilities, rather than simple pattern matching. This can improve the model's generalizability and prevent learning shortcuts. However, SOTA reasoning LLMs are typically ultra-large, closed-source, or have limited performance in vulnerability detection. To address this, we propose a novel training recipe with specialized data selection, reasoning data generation, reasoning data filtering and correction, and testing-phase optimization. Using our proposed methodology, we train a reasoning model with seven billion parameters. Through extensive experiments on SOTA datasets across Python, C/C++, and Java, we show that VulnLLM-R has superior effectiveness and efficiency than SOTA static analysis tools and both open-source and commercial large reasoning models. We further conduct a detailed ablation study to validate the key designs in our training recipe. Finally, we construct an agent scaffold around our model and show that it outperforms CodeQL and AFL++ in real-world projects. Our agent further discovers a set of zero-day vulnerabilities in actively maintained repositories. This work represents a pioneering effort to enable real-world, project-level vulnerability detection using AI agents powered by specialized reasoning models. The code is available at~\href{https://github.com/ucsb-mlsec/VulnLLM-R}{github}.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nishanth Venkatesh, Andreas A. Malikopoulos
Machine Learning
ๆฅ็ๆ่ฆ
We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset. In such settings, conventional model-learning methods are fundamentally inconsistent, as the transition and reward mechanisms generated under a behavioral policy do not correspond to the interventional quantities required for evaluating a state-based policy. To address this issue, we adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables. When combined with a behavior-averaged transition model, this construction yields a surrogate MDP whose Bellman operator is well defined and consistent for state-based policies, and which integrates seamlessly with the maximum causal entropy (MaxCausalEnt) model-learning framework. The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sebastian Sztwiertnia, Felix Friedrich, Kristian Kersting, Patrick Schramowski, Bjรถrn Deiseroth
Natural Language Processing
ๆฅ็ๆ่ฆ
Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pengqian Lu, Jie Lu, Anjin Liu, Guangquan Zhang
ๆฅ็ๆ่ฆ
Detecting hallucinations in Retrieval-Augmented Generation (RAG) remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge (stored in FFNs) and retrieved context. However, this perspective is incomplete, failing to account for the impact of other components in the generative process, such as the user query, previously generated tokens, the current token itself, and the final LayerNorm adjustment. To address this, we introduce SPAD. First, we mathematically attribute each token's probability into seven distinct sources: Query, RAG, Past, Current Token, FFN, Final LayerNorm, and Initial Embedding. This attribution quantifies how each source contributes to the generation of the current token. Then, we aggregate these scores by POS tags to quantify how different components drive specific linguistic categories. By identifying anomalies, such as Nouns relying on Final LayerNorm, SPAD effectively detects hallucinations. Extensive experiments demonstrate that SPAD achieves state-of-the-art performance
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikita Gabdullin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The overall neural network (NN) performance is closely related to the properties of its embedding distribution in latent space (LS). It has recently been shown that predefined vector systems, specifically An root system vectors, can be used as targets for latent space configurations (LSC) to ensure the desired LS structure. One of the main LSC advantage is the possibility of training classifier NNs without classification layers, which facilitates training NNs on datasets with extremely large numbers of classes. This paper provides a more general overview of possible vector systems for NN training along with their properties and methods for vector system construction. These systems are used to configure LS of encoders and visual transformers to significantly speed up ImageNet-1K and 50k-600k classes LSC training. It is also shown that using the minimum number of LS dimensions for a specific number of classes results in faster convergence. The latter has potential advantages for reducing the size of vector databases used to store NN embeddings.
๐
2025-12-08
๐ ArXiv Preprint (cs.SE)
๐ฅ Weilin Luo, Xueyi Liang, Haotian Deng, Yanan Liu, Hai Wan
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatically synthesizing verifiable code from natural language requirements ensures software correctness and reliability while significantly lowering the barrier to adopting the techniques of formal methods. With the rise of large language models (LLMs), long-standing efforts at autoformalization have gained new momentum. However, existing approaches suffer from severe syntactic and semantic errors due to the scarcity of domain-specific pre-training corpora and often fail to formalize implicit knowledge effectively. In this paper, we propose AutoICE, an LLM-driven evolutionary search for synthesizing verifiable C code. It introduces the diverse individual initialization and the collaborative crossover to enable diverse iterative updates, thereby mitigating error propagation inherent in single-agent iterations. Besides, it employs the self-reflective mutation to facilitate the discovery of implicit knowledge. Evaluation results demonstrate the effectiveness of AutoICE: it successfully verifies $90.36$\% of code, outperforming the state-of-the-art (SOTA) approach. Besides, on a developer-friendly dataset variant, AutoICE achieves a $88.33$\% verification success rate, significantly surpassing the $65$\% success rate of the SOTA approach.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ JV Roig
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.
๐
2025-12-08
๐ Best Paper Award (2025) from Risk Analysis as one
๐ฅ David M. Allison, Stephen Herzog
ๆฅ็ๆ่ฆ
A robust nonproliferation regime has contained the spread of nuclear weapons to just nine states. Yet, emerging and disruptive technologies are reshaping the landscape of nuclear risks, presenting a critical juncture for decision makers. This article lays out the contours of an overlooked but intensifying technological arms race for nuclear (in)visibility, driven by the interplay between proliferation-enabling technologies (PETs) and detection-enhancing technologies (DETs). We argue that the strategic pattern of proliferation will be increasingly shaped by the innovation pace in these domains. Artificial intelligence (AI) introduces unprecedented complexity to this equation, as its rapid scaling and knowledge substitution capabilities accelerate PET development and challenge traditional monitoring and verification methods. To analyze this dynamic, we develop a formal model centered on a Relative Advantage Index (RAI), quantifying the shifting balance between PETs and DETs. Our model explores how asymmetric technological advancement, particularly logistic AI-driven PET growth versus stepwise DET improvements, expands the band of uncertainty surrounding proliferation detectability. Through replicable scenario-based simulations, we evaluate the impact of varying PET growth rates and DET investment strategies on cumulative nuclear breakout risk. We identify a strategic fork ahead, where detection may no longer suffice without broader PET governance. Governments and international organizations should accordingly invest in policies and tools agile enough to keep pace with tomorrow's technology.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Florian Lรผttner, Nicole Neis, Daniel Stadler, Robin Moss, Mirjam Fehling-Kaschek et al.
Robotics
ๆฅ็ๆ่ฆ
The reliable operation of autonomous vehicles, automated driving functions, and advanced driver assistance systems across a wide range of relevant scenarios is critical for their development and deployment. Identifying a near-complete set of relevant driving scenarios for such functionalities is challenging due to numerous degrees of freedom involved, each affecting the outcomes of the driving scenario differently. Moreover, with increasing technical complexity of new functionalities, the number of potentially relevant, particularly "unknown unsafe" scenarios is increasing. To enhance validation efficiency, it is essential to identify relevant scenarios in advance, starting with simpler domains like highways before moving to more complex environments such as urban traffic. To address this, this paper focuses on analyzing lane change scenarios in highway traffic, which involve multiple degrees of freedom and present numerous safetyrelevant scenarios. We describe the process of data acquisition and processing of real-world data from public highway traffic, followed by the application of criticality measures on trajectory data to evaluate scenarios, as conducted within the AVEAS project (www.aveas.org). By linking the calculated measures to specific lane change driving scenarios and the conditions under which the data was collected, we facilitate the identification of safetyrelevant driving scenarios for various applications. Further, to tackle the extensive range of "unknown unsafe" scenarios, we propose a way to generate relevant scenarios by creating synthetic scenarios based on recorded ones. Consequently, we demonstrate and evaluate a processing chain that enables the identification of safety-relevant scenarios, the development of data-driven methods for extracting these scenarios, and the generation of synthetic critical scenarios via sampling on highways.
๐
2025-12-08
๐ ArXiv Preprint (cs.MA)
๐ฅ Trung-Kiet Huynh, Duy-Minh Dao-Sy, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLMs) increasingly operate as autonomous decision-makers in interactive and multi-agent systems and human societies, understanding their strategic behaviour has profound implications for safety, coordination, and the design of AI-driven social and economic infrastructures. Assessing such behaviour requires methods that capture not only what LLMs output, but the underlying intentions that guide their decisions. In this work, we extend the FAIRGAME framework to systematically evaluate LLM behaviour in repeated social dilemmas through two complementary advances: a payoff-scaled Prisoners Dilemma isolating sensitivity to incentive magnitude, and an integrated multi-agent Public Goods Game with dynamic payoffs and multi-agent histories. These environments reveal consistent behavioural signatures across models and languages, including incentive-sensitive cooperation, cross-linguistic divergence and end-game alignment toward defection. To interpret these patterns, we train traditional supervised classification models on canonical repeated-game strategies and apply them to FAIRGAME trajectories, showing that LLMs exhibit systematic, model- and language-dependent behavioural intentions, with linguistic framing at times exerting effects as strong as architectural differences. Together, these findings provide a unified methodological foundation for auditing LLMs as strategic agents and reveal systematic cooperation biases with direct implications for AI governance, collective decision-making, and the design of safe multi-agent systems.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amir Mohammad Akhlaghi, Amirhossein Shabani, Mostafa Abdolmaleki, Saeed Reza Kheradpisheh
Natural Language Processing
ๆฅ็ๆ่ฆ
The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft Phi-3 Mini -- originally a monolingual English model -- can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. Our approach employs a unique "warm-up" stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard in HuggingFace. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The Persian-Phi model is publicly available at https://huggingface.co/amirakhlaghiqqq/PersianPhi.
๐
2025-12-08
๐ ArXiv Preprint (physics.soc-ph)
๐ฅ Van An Nguyen, Vuong Khang Huynh, Ho Nam Duong, Huu Loi Bui, Hai Anh Ha et al.
Robotics
ๆฅ็ๆ่ฆ
Research on promoting cooperation among autonomous, self-regarding agents has often focused on the bi-objective optimisation problem: minimising the total incentive cost while maximising the frequency of cooperation. However, the optimal value of social welfare under such constraints remains largely unexplored. In this work, we hypothesise that achieving maximal social welfare is not guaranteed at the minimal incentive cost required to drive agents to a desired cooperative state. To address this gap, we adopt to a single-objective approach focused on maximising social welfare, building upon foundational evolutionary game theory models that examined cost efficiency in finite populations, in both well-mixed and structured population settings. Our analytical model and agent-based simulations show how different interference strategies, including rewarding local versus global behavioural patterns, affect social welfare and dynamics of cooperation. Our results reveal a significant gap in the per-individual incentive cost between optimising for pure cost efficiency or cooperation frequency and optimising for maximal social welfare. Overall, our findings indicate that incentive design, policy, and benchmarking in multi-agent systems and human societies should prioritise welfare-centric objectives over proxy targets of cost or cooperation frequency.
๐
2025-12-08
๐ WSDM 2026
๐ฅ Imran Ahsan, Hyunwook Yu, Jinsung Kim, Mucheol Kim
Machine Learning
ๆฅ็ๆ่ฆ
Graph neural networks (GNNs) are increasingly used to model complex patterns in graph-structured data. However, enabling them to "forget" designated information remains challenging, especially under privacy regulations such as the GDPR. Existing unlearning methods largely optimize for efficiency and scalability, yet they offer little transparency, and the black-box nature of GNNs makes it difficult to verify whether forgetting has truly occurred. We propose an explainability-driven verifier for GNN unlearning that snapshots the model before and after deletion, using attribution shifts and localized structural changes (for example, graph edit distance) as transparent evidence. The verifier uses five explainability metrics: residual attribution, heatmap shift, explainability score deviation, graph edit distance, and a diagnostic graph rule shift. We evaluate two backbones (GCN, GAT) and four unlearning strategies (Retrain, GraphEditor, GNNDelete, IDEA) across five benchmarks (Cora, Citeseer, Pubmed, Coauthor-CS, Coauthor-Physics). Results show that Retrain and GNNDelete achieve near-complete forgetting, GraphEditor provides partial erasure, and IDEA leaves residual signals. These explanation deltas provide the primary, human-readable evidence of forgetting; we also report membership-inference ROC-AUC as a complementary, graph-wide privacy signal.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenwei Shi, Xueyu Luan
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
DreamerV3 is a state-of-the-art online model-based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perceptrons (MLPs), offering superior parameter efficiency and interpretability. To mitigate KANs' computational overhead, variants like FastKAN leverage Radial Basis Functions (RBFs) to accelerate inference. In this work, we investigate integrating KAN architectures into the DreamerV3 framework. We introduce KAN-Dreamer, replacing specific MLP and convolutional components of DreamerV3 with KAN and FastKAN layers. To ensure efficiency within the JAX-based World Model, we implement a tailored, fully vectorized version with simplified grid management. We structure our investigation into three subsystems: Visual Perception, Latent Prediction, and Behavior Learning. Empirical evaluations on the DeepMind Control Suite (walker_walk) analyze sample efficiency, training time, and asymptotic performance. Experimental results demonstrate that utilizing our adapted FastKAN as a drop-in replacement for the Reward and Continue predictors yields performance on par with the original MLP-based architecture, maintaining parity in both sample efficiency and training speed. This report serves as a preliminary study for future developments in KAN-based world models.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Zhenfeng Liu et al.
ๆฅ็ๆ่ฆ
Recent advances in large reasoning models (LRMs) have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench includes over 150,000 high-quality entries from various cities and business types. We construct 300 multi-hop QA tasks based on real user queries, challenging agents to understand questions and retrieve information in multiple steps. We also developed LocalPlayground, a unified environment integrating multiple tools for agent interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.1) achieves only 34.34% correctness, and most models have issues with completeness (average 77.33%) and faithfulness (average 61.99%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services. Code, Benchmark, and Leaderboard are available at localsearchbench.github.io.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yangle Li, Danli Luo, Haifeng Hu
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Existing methods in domain generalization for Multimodal Sentiment Analysis (MSA) often overlook inter-modal synergies during invariant features extraction, which prevents the accurate capture of the rich semantic information within multimodal data. Additionally, while knowledge injection techniques have been explored in MSA, they often suffer from fragmented cross-modal knowledge, overlooking specific representations that exist beyond the confines of unimodal. To address these limitations, we propose a novel MSA framework designed for domain generalization. Firstly, the framework incorporates a Mixture of Invariant Experts model to extract domain-invariant features, thereby enhancing the model's capacity to learn synergistic relationships between modalities. Secondly, we design a Cross-Modal Adapter to augment the semantic richness of multimodal representations through cross-modal knowledge injection. Extensive domain experiments conducted on three datasets demonstrate that the proposed MIDG achieves superior performance.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Karel Moens, Matthew B. Blaschko, Tinne Tuytelaars, Bart Diricx, Jonas De Vylder et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Whole slide image (WSI) normalization remains a vital preprocessing step in computational pathology. Increasingly driven by deep learning, these models learn to approximate data distributions from training examples. This often results in outputs that gravitate toward the average, potentially masking diagnostically important features. More critically, they can introduce hallucinated content, artifacts that appear realistic but are not present in the original tissue, posing a serious threat to downstream analysis. These hallucinations are nearly impossible to detect visually, and current evaluation practices often overlook them. In this work, we demonstrate that the risk of hallucinations is real and underappreciated. While many methods perform adequately on public datasets, we observe a concerning frequency of hallucinations when these same models are retrained and evaluated on real-world clinical data. To address this, we propose a novel image comparison measure designed to automatically detect hallucinations in normalized outputs. Using this measure, we systematically evaluate several well-cited normalization methods retrained on real-world data, revealing significant inconsistencies and failures that are not captured by conventional metrics. Our findings underscore the need for more robust, interpretable normalization techniques and stricter validation protocols in clinical deployment.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gabriele Galatolo, Mirco Nanni
ๆฅ็ๆ่ฆ
Understanding the movement behaviours of individuals and the way they react to the external world is a key component of any problem that involves the modelling of human dynamics at a physical level. In particular, it is crucial to capture the influence that the presence of an individual can have on the others. Important examples of applications include crowd simulation and emergency management, where the simulation of the mass of people passes through the simulation of the individuals, taking into consideration the others as part of the general context. While existing solutions basically start from some preconceived behavioural model, in this work we propose an approach that starts directly from the data, adopting a data mining perspective. Our method searches the mobility events in the data that might be possible evidences of mutual interactions between individuals, and on top of them looks for complex, persistent patterns and time evolving configurations of events. The study of these patterns can provide new insights on the mechanics of mobility interactions between individuals, which can potentially help in improving existing simulation models. We instantiate the general methodology on two real case studies, one on cars and one on pedestrians, and a full experimental evaluation is performed, both in terms of performances, parameter sensitivity and interpretation of sample results.
๐
2025-12-08
๐ ArXiv Preprint (cs.SE)
๐ฅ Francisco Ribeiro, Claudio Spiess, Prem Devanbu, Sarah Nadi
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite the effectiveness of large language models (LLMs) for code generation, they often output incorrect code. One reason is that model output probabilities are often not well-correlated with correctness, and reflect only the final output of the generation process. Inspired by findings that LLMs internally encode concepts like truthfulness, this paper explores if LLMs similarly represent code correctness. Specifically, we identify a correctness representation inside LLMs by contrasting the hidden states between pairs of correct and incorrect code for the same programming tasks. By experimenting on four LLMs, we show that exploiting this extracted correctness representation outperforms standard log-likelihood ranking, as well as verbalized model confidence. Furthermore, we explore how this internal correctness signal can be used to select higher-quality code samples, without requiring test execution. Ultimately, this work demonstrates how leveraging internal representations can enhance code generation systems and make LLMs more reliable, thus improving confidence in automatically generated code.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Giulia Lanzillotta, Damiano Meier, Thomas Hofmann
Machine Learning
ๆฅ็ๆ่ฆ
A persistent paradox in continual learning (CL) is that neural networks often retain linearly separable representations of past tasks even when their output predictions fail. We formalize this distinction as the gap between deep feature-space and shallow classifier-level forgetting. We reveal a critical asymmetry in Experience Replay: while minimal buffers successfully anchor feature geometry and prevent deep forgetting, mitigating shallow forgetting typically requires substantially larger buffer capacities. To explain this, we extend the Neural Collapse framework to the sequential setting. We characterize deep forgetting as a geometric drift toward out-of-distribution subspaces and prove that any non-zero replay fraction asymptotically guarantees the retention of linear separability. Conversely, we identify that the "strong collapse" induced by small buffers leads to rank-deficient covariances and inflated class means, effectively blinding the classifier to true population boundaries. By unifying CL with out-of-distribution detection, our work challenges the prevailing reliance on large buffers, suggesting that explicitly correcting these statistical artifacts could unlock robust performance with minimal replay.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Byungju Kim, Jinu Pahk, Chungwoo Lee, Jaejoon Kim, Jangha Lee et al.
Robotics
ๆฅ็ๆ่ฆ
Behavior-cloning based visuomotor policies enable precise manipulation but often inherit the slow, cautious tempo of human demonstrations, limiting practical deployment. However, prior studies on acceleration methods mainly rely on statistical or heuristic cues that ignore task semantics and can fail across diverse manipulation settings. We present ESPADA, a semantic and spatially aware framework that segments demonstrations using a VLM-LLM pipeline with 3D gripper-object relations, enabling aggressive downsampling only in non-critical segments while preserving precision-critical phases, without requiring extra data or architectural modifications, or any form of retraining. To scale from a single annotated episode to the full dataset, ESPADA propagates segment labels via Dynamic Time Warping (DTW) on dynamics-only features. Across both simulation and real-world experiments with ACT and DP baselines, ESPADA achieves approximately a 2x speed-up while maintaining success rates, narrowing the gap between human demonstrations and efficient robot control.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiming Huang, Hao Ai, Jianbo Jiao
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Benefiting from the inductive biases learned from large-scale datasets, open-vocabulary semantic segmentation (OVSS) leverages the power of vision-language models, such as CLIP, to achieve remarkable progress without requiring task-specific training. However, due to CLIP's pre-training nature on image-text pairs, it tends to focus on global semantic alignment, resulting in suboptimal performance when associating fine-grained visual regions with text. This leads to noisy and inconsistent predictions, particularly in local areas. We attribute this to a dispersed bias stemming from its contrastive training paradigm, which is difficult to alleviate using CLIP features alone. To address this, we propose a structure-aware feature rectification approach that incorporates instance-specific priors derived directly from the image. Specifically, we construct a region adjacency graph (RAG) based on low-level features (e.g., colour and texture) to capture local structural relationships and use it to refine CLIP features by enhancing local discrimination. Extensive experiments show that our method effectively suppresses segmentation noise, improves region-level consistency, and achieves strong performance on multiple open-vocabulary segmentation benchmarks.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alexandre Rocchi--Henry, Thomas Fel, Gianni Franchi
ๆฅ็ๆ่ฆ
Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sayeem Been Zaman, Wasimul Karim, Arefin Ittesafun Abian, Reem E. Mohamed, Md Rafiqul Islam et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
The increasing use of synthetic media, particularly deepfakes, is an emerging challenge for digital content verification. Although recent studies use both audio and visual information, most integrate these cues within a single model, which remains vulnerable to modality mismatches, noise, and manipulation. To address this gap, we propose DeepAgent, an advanced multi-agent collaboration framework that simultaneously incorporates both visual and audio modalities for the effective detection of deepfakes. DeepAgent consists of two complementary agents. Agent-1 examines each video with a streamlined AlexNet-based CNN to identify the symbols of deepfake manipulation, while Agent-2 detects audio-visual inconsistencies by combining acoustic features, audio transcriptions from Whisper, and frame-reading sequences of images through EasyOCR. Their decisions are fused through a Random Forest meta-classifier that improves final performance by taking advantage of the different decision boundaries learned by each agent. This study evaluates the proposed framework using three benchmark datasets to demonstrate both component-level and fused performance. Agent-1 achieves a test accuracy of 94.35% on the combined Celeb-DF and FakeAVCeleb datasets. On the FakeAVCeleb dataset, Agent-2 and the final meta-classifier attain accuracies of 93.69% and 81.56%, respectively. In addition, cross-dataset validation on DeepFakeTIMIT confirms the robustness of the meta-classifier, which achieves a final accuracy of 97.49%, and indicates a strong capability across diverse datasets. These findings confirm that hierarchy-based fusion enhances robustness by mitigating the weaknesses of individual modalities and demonstrate the effectiveness of a multi-agent approach in addressing diverse types of manipulations in deepfakes.
๐
2025-12-08
๐ Accepted by IEEE International Conference on Compu
๐ฅ Shengyuan Ye, Bei Ouyang, Tianyi Qian, Liekang Zeng, Mu Yuan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) have demonstrated impressive multimodal comprehension capabilities and are being deployed in an increasing number of online video understanding applications. While recent efforts extensively explore advancing VLMs' reasoning power in these cases, deployment constraints are overlooked, leading to overwhelming system overhead in real-world deployments. To address that, we propose Venus, an on-device memory-and-retrieval system for efficient online video understanding. Venus proposes an edge-cloud disaggregated architecture that sinks memory construction and keyframe retrieval from cloud to edge, operating in two stages. In the ingestion stage, Venus continuously processes streaming edge videos via scene segmentation and clustering, where the selected keyframes are embedded with a multimodal embedding model to build a hierarchical memory for efficient storage and retrieval. In the querying stage, Venus indexes incoming queries from memory, and employs a threshold-based progressive sampling algorithm for keyframe selection that enhances diversity and adaptively balances system cost and reasoning accuracy. Our extensive evaluation shows that Venus achieves a 15x-131x speedup in total response latency compared to state-of-the-art methods, enabling real-time responses within seconds while maintaining comparable or even superior reasoning accuracy.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhengquan Luo, Guy Tadmor, Or Amar, David Zeevi, Zhiqiang Xu
Machine Learning
ๆฅ็ๆ่ฆ
Knowledge graph embedding (KGE) relies on the geometry of the embedding space to encode semantic and structural relations. Existing methods place all entities on one homogeneous manifold, Euclidean, spherical, hyperbolic, or their product/multi-curvature variants, to model linear, symmetric, or hierarchical patterns. Yet a predefined, homogeneous manifold cannot accommodate the sharply varying curvature that real-world graphs exhibit across local regions. Since this geometry is imposed a priori, any mismatch with the knowledge graph's local curvatures will distort distances between entities and hurt the expressiveness of the resulting KGE. To rectify this, we propose RicciKGE to have the KGE loss gradient coupled with local curvatures in an extended Ricci flow such that entity embeddings co-evolve dynamically with the underlying manifold geometry towards mutual adaptation. Theoretically, when the coupling coefficient is bounded and properly selected, we rigorously prove that i) all the edge-wise curvatures decay exponentially, meaning that the manifold is driven toward the Euclidean flatness; and ii) the KGE distances strictly converge to a global optimum, which indicates that geometric flattening and embedding optimization are promoting each other. Experimental improvements on link prediction and node classification benchmarks demonstrate RicciKGE's effectiveness in adapting to heterogeneous knowledge graph structures.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyang Mai, Yu-Wing Tai
ๆฅ็ๆ่ฆ
Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose \textbf{ContextAnyone}, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: \href{https://github.com/ziyang1106/ContextAnyone}{https://github.com/ziyang1106/ContextAnyone}.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuxiao Luo, Songming Zhang, Sijie Ruan, Siran Chen, Kang Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Modeling human mobility is vital for extensive applications such as transportation planning and epidemic modeling. With the rise of the Artificial Intelligence Generated Content (AIGC) paradigm, recent works explore synthetic trajectory generation using autoregressive and diffusion models. While these methods show promise for generating single-day trajectories, they remain limited by inefficiencies in long-term generation (e.g., weekly trajectories) and a lack of explicit spatiotemporal multi-scale modeling. This study proposes Multi-Scale Spatio-Temporal AutoRegression (M-STAR), a new framework that generates long-term trajectories through a coarse-to-fine spatiotemporal prediction process. M-STAR combines a Multi-scale Spatiotemporal Tokenizer that encodes hierarchical mobility patterns with a Transformer-based decoder for next-scale autoregressive prediction. Experiments on two real-world datasets show that M-STAR outperforms existing methods in fidelity and significantly improves generation speed. The data and codes are available at https://github.com/YuxiaoLuo0013/M-STAR.
๐
2025-12-08
๐ \c{opyright} 2025 IEEE. Personal use of this mater
๐ฅ Zhongchun Zhou, Chengtao Lai, Yuhang Gu, Wei Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid adoption of large language models (LLMs) is pushing AI accelerators toward increasingly powerful and specialized designs. Instead of further complicating software development with deeply hierarchical scratchpad memories (SPMs) and their asynchronous management, we investigate the opposite point of the design spectrum: a multi-core AI accelerator equipped with a shared system-level cache and application-aware management policies, which keeps the programming effort modest. Our approach exploits dataflow information available in the software stack to guide cache replacement (including dead-block prediction), in concert with bypass decisions and mechanisms that alleviate cache thrashing.
We assess the proposal using a cycle-accurate simulator and observe substantial performance gains (up to 1.80x speedup) compared with conventional cache architectures. In addition, we build and validate an analytical model that takes into account the actual overlapping behaviors to extend the measurement results of our policies to real-world larger-scale workloads. Experiment results show that when functioning together, our bypassing and thrashing mitigation strategies can handle scenarios both with and without inter-core data sharing and achieve remarkable speedups.
Finally, we implement the design in RTL and the area of our design is $\mathbf{0.064mm^2}$ with 15nm process, which can run at 2 GHz clock frequency. Our findings explore the potential of the shared cache design to assist the development of future AI accelerator systems.
๐
2025-12-08
๐ ArXiv Preprint (cs.IT)
๐ฅ Guosheng Wang, Shen Wang, Lei Yang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Radio frequency (RF)-based indoor localization offers significant promise for applications such as indoor navigation, augmented reality, and pervasive computing. While deep learning has greatly enhanced localization accuracy and robustness, existing localization models still face major challenges in cross-scene generalization due to their reliance on scene-specific labeled data. To address this, we introduce Radiance-Field Reinforced Pretraining (RFRP). This novel self-supervised pretraining framework couples a large localization model (LM) with a neural radio-frequency radiance field (RF-NeRF) in an asymmetrical autoencoder architecture. In this design, the LM encodes received RF spectra into latent, position-relevant representations, while the RF-NeRF decodes them to reconstruct the original spectra. This alignment between input and output enables effective representation learning using large-scale, unlabeled RF data, which can be collected continuously with minimal effort. To this end, we collected RF samples at 7,327,321 positions across 100 diverse scenes using four common wireless technologies--RFID, BLE, WiFi, and IIoT. Data from 75 scenes were used for training, and the remaining 25 for evaluation. Experimental results show that the RFRP-pretrained LM reduces localization error by over 40% compared to non-pretrained models and by 21% compared to those pretrained using supervised learning.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Thierry Petit, Arnault Pachot
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce a constraint-programming framework for generating synthetic populations that reproduce target statistics with high precision while enforcing full individual consistency. Unlike data-driven approaches that infer distributions from samples, our method directly encodes aggregated statistics and structural relations, enabling exact control of demographic profiles without requiring any microdata. We validate the approach on official demographic sources and study the impact of distributional deviations on downstream analyses. This work is conducted within the Pollitics project developed by Emotia, where synthetic populations can be queried through large language models to model societal behaviors, explore market and policy scenarios, and provide reproducible decision-grade insights without personal data.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingning Guo, Mengwei Wu, Shaoxian Li, Haifeng Li, Chao Tao
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sijia Li, Yuchen Huang, Zifan Liu, Zijian Li, Jingjing fu et al.
ๆฅ็ๆ่ฆ
Despite impressive advances in agent systems, multi-turn tool-use scenarios remain challenging. It is mainly because intent is clarified progressively and the environment evolves with each tool call. While reusing past experience is natural, current LLM agents either treat entire trajectories or pre-defined subtasks as indivisible units, or solely exploit tool-to-tool dependencies, hindering adaptation as states and information evolve across turns. In this paper, we propose a State Integrated Tool Graph (SIT-Graph), which enhances multi-turn tool use by exploiting partially overlapping experience. Inspired by human decision-making that integrates episodic and procedural memory, SIT-Graph captures both compact state representations (episodic-like fragments) and tool-to-tool dependencies (procedural-like routines) from historical trajectories. Specifically, we first build a tool graph from accumulated tool-use sequences, and then augment each edge with a compact state summary of the dialog and tool history that may shape the next action. At inference time, SIT-Graph enables a human-like balance between episodic recall and procedural execution: when the next decision requires recalling prior context, the agent retrieves the state summaries stored on relevant edges and uses them to guide its next action; when the step is routine, it follows high-confidence tool dependencies without explicit recall. Experiments across multiple stateful multi-turn tool-use benchmarks show that SIT-Graph consistently outperforms strong memory- and graph-based baselines, delivering more robust tool selection and more effective experience transfer.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Siyu Wang, Hua Wang, Huiyu Li, Fan Zhang
Machine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
In the field of healthcare, precise skin lesion segmentation is crucial for the early detection and accurate diagnosis of skin diseases. Despite significant advances in deep learning for image processing, existing methods have yet to effectively address the challenges of irregular lesion shapes and low contrast. To address these issues, this paper proposes an innovative encoder-decoder network architecture based on multi-scale residual structures, capable of extracting rich feature information from different receptive fields to effectively identify lesion areas. By introducing a Multi-Resolution Multi-Channel Fusion (MRCF) module, our method captures cross-scale features, enhancing the clarity and accuracy of the extracted information. Furthermore, we propose a Cross-Mix Attention Module (CMAM), which redefines the attention scope and dynamically calculates weights across multiple contexts, thus improving the flexibility and depth of feature capture and enabling deeper exploration of subtle features. To overcome the information loss caused by skip connections in traditional U-Net, an External Attention Bridge (EAB) is introduced, facilitating the effective utilization of information in the decoder and compensating for the loss during upsampling. Extensive experimental evaluations on several skin lesion segmentation datasets demonstrate that the proposed model significantly outperforms existing transformer and convolutional neural network-based models, showcasing exceptional segmentation accuracy and robustness.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Florian Tretter, Daniel Flรถgel, Alexandru Vasilache, Max Grobbel, Jรผrgen Becker et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Integrating autonomous mobile robots into human environments requires human-like decision-making and energy-efficient, event-based computation. Despite progress, neuromorphic methods are rarely applied to Deep Reinforcement Learning (DRL) navigation approaches due to unstable training. We address this gap with a hybrid socially integrated DRL actor-critic approach that combines Spiking Neural Networks (SNNs) in the actor with Artificial Neural Networks (ANNs) in the critic and a neuromorphic feature extractor to capture temporal crowd dynamics and human-robot interactions. Our approach enhances social navigation performance and reduces estimated energy consumption by approximately 1.69 orders of magnitude.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Handing Xu, Zhenguo Nie, Tairan Peng, Huimin Pan, Xin-Jun Liu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Endoscopic surgery relies on intraoperative video, making image quality a decisive factor for surgical safety and efficacy. Yet, endoscopic videos are often degraded by uneven illumination, tissue scattering, occlusions, and motion blur, which obscure critical anatomical details and complicate surgical manipulation. Although deep learning-based methods have shown promise in image enhancement, most existing approaches remain too computationally demanding for real-time surgical use. To address this challenge, we propose a degradation-aware framework for endoscopic video enhancement, which enables real-time, high-quality enhancement by propagating degradation representations across frames. In our framework, degradation representations are first extracted from images using contrastive learning. We then introduce a fusion mechanism that modulates image features with these representations to guide a single-frame enhancement model, which is trained with a cycle-consistency constraint between degraded and restored images to improve robustness and generalization. Experiments demonstrate that our framework achieves a superior balance between performance and efficiency compared with several state-of-the-art methods. These results highlight the effectiveness of degradation-aware modeling for real-time endoscopic video enhancement. Nevertheless, our method suggests that implicitly learning and propagating degradation representation offer a practical pathway for clinical application.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jingran Yang, Min Zhang, Lingfeng Zhang, Zhaohui Wang, Yonggang Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Because machine learning has significantly improved efficiency and convenience in the society, it's increasingly used to assist or replace human decision-making. However, the data-based pattern makes related algorithms learn and even exacerbate potential bias in samples, resulting in discriminatory decisions against certain unprivileged groups, depriving them of the rights to equal treatment, thus damaging the social well-being and hindering the development of related applications. Therefore, we propose a pre-processing method IFFair based on the influence function. Compared with other fairness optimization approaches, IFFair only uses the influence disparity of training samples on different groups as a guidance to dynamically adjust the sample weights during training without modifying the network structure, data features and decision boundaries. To evaluate the validity of IFFair, we conduct experiments on multiple real-world datasets and metrics. The experimental results show that our approach mitigates bias of multiple accepted metrics in the classification setting, including demographic parity, equalized odds, equality of opportunity and error rate parity without conflicts. It also demonstrates that IFFair achieves better trade-off between multiple utility and fairness metrics compared with previous pre-processing methods.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Biao Chen, Lin Zuo, Mengmeng Jing, Kunbin He, Yuchen Wang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method's effectiveness in challenging scenarios like low-shot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wenlong Liu, Jiahua Pan, Xingyu Zhang, Xinxin Gong, Yang Ye et al.
ๆฅ็ๆ่ฆ
Product matching aims to identify identical or similar products sold on different platforms. By building knowledge graphs (KGs), the product matching problem can be converted to the Entity Alignment (EA) task, which aims to discover the equivalent entities from diverse KGs. The existing EA methods inadequately utilize both attribute triples and relation triples simultaneously, especially the interactions between them. This paper introduces a two-stage pipeline consisting of rough filter and fine filter to match products from eBay and Amazon. For fine filtering, a new framework for Entity Alignment, Relation-aware and Attribute-aware Graph Attention Networks for Entity Alignment (RAEA), is employed. RAEA focuses on the interactions between attribute triples and relation triples, where the entity representation aggregates the alignment signals from attributes and relations with Attribute-aware Entity Encoder and Relation-aware Graph Attention Networks. The experimental results indicate that the RAEA model achieves significant improvements over 12 baselines on EA task in the cross-lingual dataset DBP15K (6.59% on average Hits@1) and delivers competitive results in the monolingual dataset DWY100K. The source code for experiments on DBP15K and DWY100K is available at github (https://github.com/Mockingjay-liu/RAEA-model-for-Entity-Alignment).
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hengyang Yao, Lin Li, Ke Sun, Jianing Qiu, Huiping Chen
Machine Learning
ๆฅ็ๆ่ฆ
DeepFake face swapping enables highly realistic identity forgeries, posing serious privacy and security risks. A common defence embeds invisible perturbations into images, but these are fragile and often destroyed by basic transformations such as compression or resizing. In this paper, we first conduct a systematic analysis of 30 transformations across six categories and show that protection robustness is highly sensitive to the choice of training transformations, making the standard Expectation over Transformation (EOT) with uniform sampling fundamentally suboptimal. Motivated by this, we propose Expectation Over Learned distribution of Transformation (EOLT), the framework to treat transformation distribution as a learnable component rather than a fixed design choice. Specifically, EOLT employs a policy network that learns to automatically prioritize critical transformations and adaptively generate instance-specific perturbations via reinforcement learning, enabling explicit modeling of defensive bottlenecks while maintaining broad transferability. Extensive experiments demonstrate that our method achieves substantial improvements over state-of-the-art approaches, with 26% higher average robustness and up to 30% gains on challenging transformation categories.
๐
2025-12-08
๐ AAAI 2026
๐ฅ Feng Liang, Weixin Zeng, Runhao Zhao, Xiang Zhao
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, temporal reasoning, particularly under complex temporal constraints, remains a major challenge. To this end, existing approaches have explored symbolic methods, which encode temporal structure explicitly, and reflective mechanisms, which revise reasoning errors through multi-step inference. Nonetheless, symbolic approaches often underutilize the reasoning capabilities of LLMs, while reflective methods typically lack structured temporal representations, which can result in inconsistent or hallucinated reasoning. As a result, even when the correct temporal context is available, LLMs may still misinterpret or misapply time-related information, leading to incomplete or inaccurate answers. To address these limitations, in this work, we propose Neuro-Symbolic Temporal Reasoning (NeSTR), a novel framework that integrates structured symbolic representations with hybrid reflective reasoning to enhance the temporal sensitivity of LLM inference. NeSTR preserves explicit temporal relations through symbolic encoding, enforces logical consistency via verification, and corrects flawed inferences using abductive reflection. Extensive experiments on diverse temporal question answering benchmarks demonstrate that NeSTR achieves superior zero-shot performance and consistently improves temporal reasoning without any fine-tuning, showcasing the advantage of neuro-symbolic integration in enhancing temporal understanding in large language models.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Selim Sarowar, Sungho Kim
RoboticsComputer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhaoyang Liu, Mokai Pan, Zhongyi Wang, Kaizhen Zhu, Haotao Lu et al.
Robotics
ๆฅ็ๆ่ฆ
Imitation learning with diffusion models has advanced robotic control by capturing multi-modal action distributions. However, existing approaches typically treat observations as high-level conditioning inputs to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, sampling must begin from random Gaussian noise, weakening the coupling between perception and control and often yielding suboptimal performance. We introduce BridgePolicy, a generative visuomotor policy that explicitly embeds observations within the stochastic differential equation via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich, informative prior rather than random noise, substantially improving precision and reliability in control. A key challenge is that classical diffusion bridges connect distributions with matched dimensionality, whereas robotic observations are heterogeneous and multi-modal and do not naturally align with the action space. To address this, we design a multi-modal fusion module and a semantic aligner that unify visual and state inputs and align observation and action representations, making the bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and five real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fei Luo, Ziwei Zhao, Mingxuan Wang, Duoyang Li, Zhe Qian et al.
ๆฅ็ๆ่ฆ
Federated Prompt Learning (FPL) offers a parameter-efficient solution for collaboratively training large models, but its performance is severely hindered by data heterogeneity, which causes locally trained prompts to become biased. Existing methods, focusing on aggregation or regularization, fail to address this root cause of local training bias. To this end, we propose Geometry-Guided Text Prompt Calibration (GGTPC), a novel framework that directly corrects this bias by providing clients with a global geometric prior. This prior, representing the shape of the global data distribution derived from the covariance matrix, is reconstructed on the server in a privacy-preserving manner. Clients then use a novel Geometry-Prior Calibration Layer (GPCL) to align their local feature distributions with this global prior during training. Extensive experiments show GGTPC's effectiveness. On the label-skewed CIFAR-100 dataset ($ฮฒ$=0.1), it outperforms the state-of-the-art by 2.15\%. Under extreme skew ($ฮฒ$=0.01), it improves upon the baseline by 9.17\%. Furthermore, as a plug-and-play module on the domain-skewed Office-Home dataset, it boosts FedAvg's performance by 4.60\%. These results demonstrate that GGTPC effectively mitigates data heterogeneity by correcting the fundamental local training bias, serving as a versatile module to enhance various FL algorithms.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xuan Zhang, Wenxuan Zhang, Anxu Wang, See-Kiong Ng, Yang Deng
Robotics
ๆฅ็ๆ่ฆ
Multi-agent role-playing has recently shown promise for studying social behavior with language agents, but existing simulations are mostly monolingual and fail to model cross-lingual interaction, an essential property of real societies. We introduce MASim, the first multilingual agent-based simulation framework that supports multi-turn interaction among generative agents with diverse sociolinguistic profiles. MASim offers two key analyses: (i) global public opinion modeling, by simulating how attitudes toward open-domain hypotheses evolve across languages and cultures, and (ii) media influence and information diffusion, via autonomous news agents that dynamically generate content and shape user behavior. To instantiate simulations, we construct the MAPS benchmark, which combines survey questions and demographic personas drawn from global population distributions. Experiments on calibration, sensitivity, consistency, and cultural case studies show that MASim reproduces sociocultural phenomena and highlights the importance of multilingual simulation for scalable, controlled computational social science.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhuoming Liu, Xiaofeng Gao, Feiyang Niu, Qiaozi Gao, Liu Liu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wonbeen Lee, Channyoung Lee, Junho Sohn, Hansam Cho
ๆฅ็ๆ่ฆ
With the recent surge in personalized learning, Intelligent Tutoring Systems (ITS) that can accurately track students' individual knowledge states and provide tailored learning paths based on this information are in demand as an essential task. This paper focuses on the core technology of Knowledge Tracing (KT) models that analyze students' sequences of interactions to predict their knowledge acquisition levels. However, existing KT models suffer from limitations such as restricted input data formats, cold start problems arising with new student enrollment or new question addition, and insufficient stability in real-world service environments. To overcome these limitations, a Practical Interlinked Concept Knowledge Tracing (PICKT) model that can effectively process multiple types of input data is proposed. Specifically, a knowledge map structures the relationships among concepts considering the question and concept text information, thereby enabling effective knowledge tracing even in cold start situations. Experiments reflecting real operational environments demonstrated the model's excellent performance and practicality. The main contributions of this research are as follows. First, a model architecture that effectively utilizes diverse data formats is presented. Second, significant performance improvements are achieved over existing models for two core cold start challenges: new student enrollment and new question addition. Third, the model's stability and practicality are validated through delicate experimental design, enhancing its applicability in real-world product environments. This provides a crucial theoretical and technical foundation for the practical implementation of next-generation ITS.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Latifa Dwiyanti, Sergio Ryan Wibisono, Hidetaka Nambo
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Explainable Artificial Intelligence (XAI) has become an increasingly important area of research, particularly as machine learning models are deployed in high-stakes domains. Among various XAI approaches, SHAP (SHapley Additive exPlanations) has gained prominence due to its ability to provide both global and local explanations across different machine learning models. While SHAP effectively visualizes feature importance, it often lacks contextual explanations that are meaningful for end-users, especially those without technical backgrounds. To address this gap, we propose a Python package that extends SHAP by integrating it with a large language model (LLM), specifically OpenAI's GPT, to generate contextualized textual explanations. This integration is guided by user-defined parameters (such as feature aliases, descriptions, and additional background) to tailor the explanation to both the model context and the user perspective. We hypothesize that this enhancement can improve the perceived understandability of SHAP explanations. To evaluate the effectiveness of the proposed package, we applied it in a healthcare-related case study and conducted user evaluations involving real end-users. The results, based on Likert-scale surveys and follow-up interviews, indicate that the generated explanations were perceived as more understandable and contextually appropriate compared to visual-only outputs. While the findings are preliminary, they suggest that combining visualization with contextualized text may support more user-friendly and trustworthy model explanations.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiayang Li, Chengjie Jiang, Junjun Jiang, Pengwei Liang, Jiayi Ma et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.
๐
2025-12-08
๐ NeurIPS 2025
๐ฅ Georgios Ioannides, Christos Constantinou, Aman Chadha, Aaron Elkins, Linsey Pang et al.
ๆฅ็ๆ่ฆ
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jonghyun Park, Jong Chul Ye
Machine Learning
ๆฅ็ๆ่ฆ
Deep generative models have become powerful priors for solving inverse problems, and various training-free methods have been developed. However, when applied to latent flow models, existing methods often fail to converge to the posterior mode or suffer from manifold deviation within latent spaces. To mitigate this, here we introduce a novel training-free framework, FlowLPS, that solves inverse problems with pretrained flow models via a Langevin Proximal Sampling (LPS) strategy. Our method integrates Langevin dynamics for manifold-consistent exploration with proximal optimization for precise mode seeking, achieving a superior balance between reconstruction fidelity and perceptual quality across multiple inverse tasks on FFHQ and DIV2K, outperforming state of the art inverse solvers.
๐
2025-12-08
๐ This work plans to be submitted to the IEEE for po
๐ฅ Tanay Arora, Christof Teuscher
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The Lottery Ticket Hypothesis asserts the existence of highly sparse, trainable subnetworks ('winning tickets') within dense, randomly initialized neural networks. However, state-of-the-art methods of drawing these tickets, like Lottery Ticket Rewinding (LTR), are computationally prohibitive, while more efficient saliency-based Pruning-at-Initialization (PaI) techniques suffer from a significant accuracy-sparsity trade-off and fail basic sanity checks. In this work, we argue that PaI's reliance on first-order saliency metrics, which ignore inter-weight dependencies, contributes substantially to this performance gap, especially in the sparse regime. To address this, we introduce Concrete Ticket Search (CTS), an algorithm that frames subnetwork discovery as a holistic combinatorial optimization problem. By leveraging a Concrete relaxation of the discrete search space and a novel gradient balancing scheme (GRADBALANCE) to control sparsity, CTS efficiently identifies high-performing subnetworks near initialization without requiring sensitive hyperparameter tuning. Motivated by recent works on lottery ticket training dynamics, we further propose a knowledge distillation-inspired family of pruning objectives, finding that minimizing the reverse Kullback-Leibler divergence between sparse and dense network outputs (CTS-KL) is particularly effective. Experiments on varying image classification tasks show that CTS produces subnetworks that robustly pass sanity checks and achieve accuracy comparable to or exceeding LTR, while requiring only a small fraction of the computation. For example, on ResNet-20 on CIFAR10, it reaches 99.3% sparsity with 74.0% accuracy in 7.9 minutes, while LTR attains the same sparsity with 68.3% accuracy in 95.2 minutes. CTS's subnetworks outperform saliency-based methods across all sparsities, but its advantage over LTR is most pronounced in the highly sparse regime.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Siyang Jiang, Mu Yuan, Xiang Ji, Bufang Yang, Zeyu Liu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zebin Xing, Pengxuan Yang, Linbo Wang, Yichen Zhang, Yiming Hu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Current autonomous driving systems often favor end-to-end frameworks, which take sensor inputs like images and learn to map them into trajectory space via neural networks. Previous work has demonstrated that models can achieve better planning performance when provided with a prior distribution of possible trajectories. However, these approaches often overlook two critical aspects: 1) The appropriate trajectory prior can vary significantly across different driving scenarios. 2) Their trajectory evaluation mechanism lacks policy-driven refinement, remaining constrained by the limitations of one-stage supervised training. To address these issues, we explore improvements in two key areas. For problem 1, we employ MoE to apply different trajectory priors tailored to different scenarios. For problem 2, we utilize Reinforcement Learning to fine-tune the trajectory scoring mechanism. Additionally, we integrate models with different perception backbones to enhance perceptual features. Our integrated model achieved a score of 51.08 on the navsim ICCV benchmark, securing third place.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nithin Sivakumaran, Justin Chih-Yao Chen, David Wan, Yue Zhang, Jaehong Yoon et al.
Computer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the tool call distribution, finding that diverse tools are reliably used to help resolve disagreement.
๐
2025-12-08
๐ ArXiv Preprint (cs.SE)
๐ฅ Liping Han, Tingting Nie, Le Yu, Mingzhe Hu, Tao Yue
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Flight control software is typically designed with numerous configurable parameters governing multiple functionalities, enabling flexible adaptation to mission diversity and environmental uncertainty. Although developers and manufacturers usually provide recommendations for these parameters to ensure safe and stable operations, certain combinations of parameters with recommended values may still lead to unstable flight behaviors, thereby degrading the drone's robustness. To this end, we propose a Large Language Model (LLM) based approach for real-time repair of risk-prone configurations (named RisConFix) that degrade drone robustness. RisConFix continuously monitors the drone's operational state and automatically triggers a repair mechanism once abnormal flight behaviors are detected. The repair mechanism leverages an LLM to analyze relationships between configuration parameters and flight states, and then generates corrective parameter updates to restore flight stability. To ensure the validity of the updated configuration, RisConFix operates as an iterative process; it continuously monitors the drone's flight state and, if an anomaly persists after applying an update, automatically triggers the next repair cycle. We evaluated RisConFix through a case study of ArduPilot (with 1,421 groups of misconfigurations). Experimental results show that RisConFix achieved a best repair success rate of 97% and an optimal average number of repairs of 1.17, demonstrating its capability to effectively and efficiently repair risk-prone configurations in real time.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziqing Wen, Jiahuan Wang, Ping Luo, Dongsheng Li, Tao Sun
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM reduces total training memory by approximately 50\%, eliminates up to 90\% of optimizer state memory overhead, and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Miguel Ingram, Arthur Joseph Merritt
Natural Language Processing
ๆฅ็ๆ่ฆ
Responding to Hodel et al.'s (2024) call for a formal definition of task relatedness in re-arc, we present the first 9-category taxonomy of all 400 tasks, validated at 97.5% accuracy via rule-based code analysis. We prove the taxonomy's visual coherence by training a CNN on raw grid pixels (95.24% accuracy on S3, 36.25% overall, 3.3x chance), then apply the taxonomy diagnostically to the original ARC-AGI-2 test set. Our curriculum analysis reveals 35.3% of tasks exhibit low neural affinity for Transformers--a distributional bias mirroring ARC-AGI-2. To probe this misalignment, we fine-tuned a 1.7M-parameter Transformer across 302 tasks, revealing a profound Compositional Gap: 210 of 302 tasks (69.5%) achieve >80% cell accuracy (local patterns) but <10% grid accuracy (global synthesis). This provides direct evidence for a Neural Affinity Ceiling Effect, where performance is bounded by architectural suitability, not curriculum. Applying our framework to Li et al.'s independent ViTARC study (400 specialists, 1M examples each) confirms its predictive power: Very Low affinity tasks achieve 51.9% versus 77.7% for High affinity (p<0.001), with a task at 0% despite massive data. The taxonomy enables precise diagnosis: low-affinity tasks (A2) hit hard ceilings, while high-affinity tasks (C1) reach 99.8%. These findings indicate that progress requires hybrid architectures with affinity-aligned modules. We release our validated taxonomy,
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Christopher Cruz
Robotics
ๆฅ็ๆ่ฆ
Agentic LLM frameworks promise autonomous behavior via task decomposition, tool use, and iterative planning, but most deployed systems remain brittle. They lack runtime introspection, cannot diagnose their own failure modes, and do not improve over time without human intervention. In practice, many agent stacks degrade into decorated chains of LLM calls with no structural mechanisms for reliability. We present VIGIL (Verifiable Inspection and Guarded Iterative Learning), a reflective runtime that supervises a sibling agent and performs autonomous maintenance rather than task execution. VIGIL ingests behavioral logs, appraises each event into a structured emotional representation, maintains a persistent EmoBank with decay and contextual policies, and derives an RBT diagnosis that sorts recent behavior into strengths, opportunities, and failures. From this analysis, VIGIL generates both guarded prompt updates that preserve core identity semantics and read only code proposals produced by a strategy engine that operates on log evidence and code hotspots. VIGIL functions as a state gated pipeline. Illegal transitions produce explicit errors rather than allowing the LLM to improvise. In a reminder latency case study, VIGIL identified elevated lag, proposed prompt and code repairs, and when its own diagnostic tool failed due to a schema conflict, it surfaced the internal error, produced a fallback diagnosis, and emitted a repair plan. This demonstrates meta level self repair in a deployed agent runtime.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhixiang Wang
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Background: The deployment of personalized Large Language Models (LLMs) is currently constrained by the stability-plasticity dilemma. Prevailing alignment methods, such as Supervised Fine-Tuning (SFT), rely on stochastic weight updates that often incur an "alignment tax" -- degrading general reasoning capabilities.
Methods: We propose the Soul Engine, a framework based on the Linear Representation Hypothesis, which posits that personality traits exist as orthogonal linear subspaces. We introduce SoulBench, a dataset constructed via dynamic contextual sampling. Using a dual-head architecture on a frozen Qwen-2.5 base, we extract disentangled personality vectors without modifying the backbone weights.
Results: Our experiments demonstrate three breakthroughs. First, High-Precision Profiling: The model achieves a Mean Squared Error (MSE) of 0.011 against psychological ground truth. Second, Geometric Orthogonality: T-SNE visualization confirms that personality manifolds are distinct and continuous, allowing for "Zero-Shot Personality Injection" that maintains original model intelligence. Third, Deterministic Steering: We achieve robust control over behavior via vector arithmetic, validated through extensive ablation studies.
Conclusion: This work challenges the necessity of fine-tuning for personalization. By transitioning from probabilistic prompting to deterministic latent intervention, we provide a mathematically rigorous foundation for safe, controllable AI personalization.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jungmin Lee, Gwangeun Byeon, Yulhwa Kim, Seokin Hong
Natural Language Processing
ๆฅ็ๆ่ฆ
Pruning has emerged as a promising direction for accelerating large language model (LLM) inference, yet existing approaches often suffer from instability because they rely on offline calibration data that may not generalize across inputs. In this work, we introduce Token Filtering, a lightweight online structured pruning technique that makes pruning decisions directly during inference without any calibration data. The key idea is to measure token redundancy via joint key-value similarity and skip redundant attention computations, thereby reducing inference cost while preserving critical information. To further enhance stability, we design a variance-aware fusion strategy that adaptively weights key and value similarity across heads, ensuring that informative tokens are retained even under high pruning ratios. This design introduces no additional memory overhead and provides a more reliable criterion for token importance. Extensive experiments on LLaMA-2 (7B/13B), LLaMA-3 (8B), and Mistral (7B) demonstrate that Token Filtering consistently outperforms prior structured pruning methods, preserving accuracy on commonsense reasoning benchmarks and maintaining strong performance on challenging tasks such as MMLU, even with 50% pruning.
๐
2025-12-08
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yunzhe Li, Jianan Wang, Hongzi Zhu, James Lin, Shan Chang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have become foundational components in a wide range of applications, including natural language understanding and generation, embodied intelligence, and scientific discovery. As their computational requirements continue to grow, these models are increasingly deployed as cloud-based services, allowing users to access powerful LLMs via the Internet. However, this deployment model introduces a new class of threat: denial-of-service (DoS) attacks via unbounded reasoning, where adversaries craft specially designed inputs that cause the model to enter excessively long or infinite generation loops. These attacks can exhaust backend compute resources, degrading or denying service to legitimate users. To mitigate such risks, many LLM providers adopt a closed-source, black-box setting to obscure model internals. In this paper, we propose ThinkTrap, a novel input-space optimization framework for DoS attacks against LLM services even in black-box environments. The core idea of ThinkTrap is to first map discrete tokens into a continuous embedding space, then undertake efficient black-box optimization in a low-dimensional subspace exploiting input sparsity. The goal of this optimization is to identify adversarial prompts that induce extended or non-terminating generation across several state-of-the-art LLMs, achieving DoS with minimal token overhead. We evaluate the proposed attack across multiple commercial, closed-source LLM services. Our results demonstrate that, even far under the restrictive request frequency limits commonly enforced by these platforms, typically capped at ten requests per minute (10 RPM), the attack can degrade service throughput to as low as 1% of its original capacity, and in some cases, induce complete service failure.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Rongjia Zhou, Chengzhuo Li, Carl Yang, Jiaying Lu
ๆฅ็ๆ่ฆ
Heart failure (HF) is one of the leading causes of rehospitalization among older adults in the United States. Although clinical notes contain rich, detailed patient information and make up a large portion of electronic health records (EHRs), they remain underutilized for HF readmission risk analysis. Traditional computational models for HF readmission often rely on expert-crafted rules, medical thesauri, and ontologies to interpret clinical notes, which are typically written under time pressure and may contain misspellings, abbreviations, and domain-specific jargon. We present ClinNoteAgents, an LLM-based multi-agent framework that transforms free-text clinical notes into (1) structured representations of clinical and social risk factors for association analysis and (2) clinician-style abstractions for HF 30-day readmission prediction. We evaluate ClinNoteAgents on 3,544 notes from 2,065 patients (readmission rate=35.16%), demonstrating strong performance in extracting risk factors from free-text, identifying key contributing factors, and predicting readmission risk. By reducing reliance on structured fields and minimizing manual annotation and model training, ClinNoteAgents provides a scalable and interpretable approach to note-based HF readmission risk modeling in data-limited healthcare systems.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anton Morgunov, Victor S. Batista
ๆฅ็ๆ่ฆ
Progress in computer-aided synthesis planning (CASP) is obscured by the lack of standardized evaluation infrastructure and the reliance on metrics that prioritize topological completion over chemical validity. We introduce RetroCast, a unified evaluation suite that standardizes heterogeneous model outputs into a common schema to enable statistically rigorous, apples-to-apples comparison. The framework includes a reproducible benchmarking pipeline with stratified sampling and bootstrapped confidence intervals, accompanied by SynthArena, an interactive platform for qualitative route inspection. We utilize this infrastructure to evaluate leading search-based and sequence-based algorithms on a new suite of standardized benchmarks. Our analysis reveals a divergence between "solvability" (stock-termination rate) and route quality; high solvability scores often mask chemical invalidity or fail to correlate with the reproduction of experimental ground truths. Furthermore, we identify a "complexity cliff" in which search-based methods, despite high solvability rates, exhibit a sharp performance decay in reconstructing long-range synthetic plans compared to sequence-based approaches. We release the full framework, benchmark definitions, and a standardized database of model predictions to support transparent and reproducible development in the field.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiannan Yang, Veronika Thost, Tengfei Ma
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Self-supervised learning (SSL) plays a central role in molecular representation learning. Yet, many recent innovations in masking-based pretraining are introduced as heuristics and lack principled evaluation, obscuring which design choices are genuinely effective. This work cast the entire pretrain-finetune workflow into a unified probabilistic framework, enabling a transparent comparison and deeper understanding of masking strategies. Building on this formalism, we conduct a controlled study of three core design dimensions: masking distribution, prediction target, and encoder architecture, under rigorously controlled settings. We further employ information-theoretic measures to assess the informativeness of pretraining signals and connect them to empirically benchmarked downstream performance. Our findings reveal a surprising insight: sophisticated masking distributions offer no consistent benefit over uniform sampling for common node-level prediction tasks. Instead, the choice of prediction target and its synergy with the encoder architecture are far more critical. Specifically, shifting to semantically richer targets yields substantial downstream improvements, particularly when paired with expressive Graph Transformer encoders. These insights offer practical guidance for developing more effective SSL methods for molecular graphs.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu
RoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Gao, Chen Chen, Tianrong Chen, Jiatao Gu
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu et al.
ๆฅ็ๆ่ฆ
The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline. Project page is at https://github.com/lewandofskee/OpenVE.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, Qixing Huang
ๆฅ็ๆ่ฆ
Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.
๐
2025-12-08
๐ SIGGRAPH
๐ฅ Shai Krakovsky, Gal Fiebelman, Sagie Benaim, Hadar Averbuch-Elor
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce \textbf{Distribution-Matching VAE} (\textbf{DMVAE}), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maximilian Schall, Felix Leonard Knรถfel, Noah Elias Kรถnig, Jan Jonas Kubeler, Maximilian von Klinski et al.
Machine Learning
ๆฅ็ๆ่ฆ
Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, "in-the-wild" video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species
๐
2025-12-08
๐ AAAI 2026
๐ฅ Menglin Wang, Xiaojin Gong, Jiachen Li, Genlin Ji
ๆฅ็ๆ่ฆ
Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match individuals across visible and infrared cameras without relying on any annotation. Given the significant gap across visible and infrared modality, estimating reliable cross-modality association becomes a major challenge in USVI-ReID. Existing methods usually adopt optimal transport to associate the intra-modality clusters, which is prone to propagating the local cluster errors, and also overlooks global instance-level relations. By mining and attending to the visible-infrared modality bias, this paper focuses on addressing cross-modality learning from two aspects: bias-mitigated global association and modality-invariant representation learning. Motivated by the camera-aware distance rectification in single-modality re-ID, we propose modality-aware Jaccard distance to mitigate the distance bias caused by modality discrepancy, so that more reliable cross-modality associations can be estimated through global clustering. To further improve cross-modality representation learning, a `split-and-contrast' strategy is designed to obtain modality-specific global prototypes. By explicitly aligning these prototypes under global association guidance, modality-invariant yet ID-discriminative representation learning can be achieved. While conceptually simple, our method obtains state-of-the-art performance on benchmark VI-ReID datasets and outperforms existing methods by a significant margin, validating its effectiveness.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mayank Anand, Ujair Alam, Surya Prakash, Priya Shukla, Gora Chand Nandi et al.
Computer Vision
ๆฅ็ๆ่ฆ
Clinical ultrasound acquisition is highly operator-dependent, where rapid probe motion and brightness fluctuations often lead to reconstruction errors that reduce trust and clinical utility. We present UltrasODM, a dual-stream framework that assists sonographers during acquisition through calibrated per-frame uncertainty, saliency-based diagnostics, and actionable prompts. UltrasODM integrates (i) a contrastive ranking module that groups frames by motion similarity, (ii) an optical-flow stream fused with Dual-Mamba temporal modules for robust 6-DoF pose estimation, and (iii) a Human-in-the-Loop (HITL) layer combining Bayesian uncertainty, clinician-calibrated thresholds, and saliency maps highlighting regions of low confidence. When uncertainty exceeds the threshold, the system issues unobtrusive alerts suggesting corrective actions such as re-scanning highlighted regions or slowing the sweep. Evaluated on a clinical freehand ultrasound dataset, UltrasODM reduces drift by 15.2%, distance error by 12.1%, and Hausdorff distance by 10.1% relative to UltrasOM, while producing per-frame uncertainty and saliency outputs. By emphasizing transparency and clinician feedback, UltrasODM improves reconstruction reliability and supports safer, more trustworthy clinical workflows. Our code is publicly available at https://github.com/AnandMayank/UltrasODM.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shihao Zhao, Yitong Chen, Zeyinzi Jiang, Bojia Zi, Shaozhe Hao et al.
RoboticsMachine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song et al.
RoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse, tending to generate conservative and homogeneous behaviors. While DiffusionDrive employs predefined anchors representing different driving intentions to partition the action space and generate diverse trajectories, its reliance on imitation learning lacks sufficient constraints, resulting in a dilemma between diversity and consistent high quality. In this work, we propose DiffusionDriveV2, which leverages reinforcement learning to both constrain low-quality modes and explore for superior trajectories. This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model. First, we use scale-adaptive multiplicative noise, ideal for trajectory planning, to promote broad exploration. Second, we employ intra-anchor GRPO to manage advantage estimation among samples generated from a single anchor, and inter-anchor truncated GRPO to incorporate a global perspective across different anchors, preventing improper advantage comparisons between distinct intentions (e.g., turning vs. going straight), which can lead to further mode collapse. DiffusionDriveV2 achieves 91.2 PDMS on the NAVSIM v1 dataset and 85.5 EPDMS on the NAVSIM v2 dataset in closed-loop evaluation with an aligned ResNet-34 backbone, setting a new record. Further experiments validate that our approach resolves the dilemma between diversity and consistent high quality for truncated diffusion models, achieving the best trade-off. Code and model will be available at https://github.com/hustvl/DiffusionDriveV2
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dengjia Zhang, Charles Weng, Katherine Guerrerio, Yi Lu, Kenton Murray et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meng Cao, Xingyu Li, Xue Liu, Ian Reid, Xiaodan Liang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sangha Park, Seungryong Yoo, Jisoo Mok, Sungroh Yoon
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model's visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10\%p improvement in CHAIR\_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at https://github.com/wiarae/SAVE.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aidas Aglinskas, Stefano Anzellotti
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fan Yang, Heyuan Li, Peihao Li, Weihao Yuan, Lingteng Qiu et al.
Computer Vision
ๆฅ็ๆ่ฆ
Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sayak Dutta, Harish Katti, Shashikant Verma, Shanmuganathan Raman
Computer Vision
ๆฅ็ๆ่ฆ
Animal tracking and pose estimation systems, such as STEP (Simultaneous Tracking and Pose Estimation) and ViTPose, experience substantial performance drops when processing images and videos with cage structures and systematic occlusions. We present a three-stage preprocessing pipeline that addresses this limitation through: (1) cage segmentation using a Gabor-enhanced ResNet-UNet architecture with tunable orientation filters, (2) cage inpainting using CRFill for content-aware reconstruction of occluded regions, and (3) evaluation of pose estimation and tracking on the uncaged frames. Our Gabor-enhanced segmentation model leverages orientation-aware features with 72 directional kernels to accurately identify and segment cage structures that severely impair the performance of existing methods. Experimental validation demonstrates that removing cage occlusions through our pipeline enables pose estimation and tracking performance comparable to that in environments without occlusions. We also observe significant improvements in keypoint detection accuracy and trajectory consistency.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Leo Fillioux, Enzo Ferrante, Paul-Henry Cournรจde, Maria Vakalopoulou, Stergios Christodoulidis
ๆฅ็ๆ่ฆ
Large foundation models have emerged in the last years and are pushing performance boundaries for a variety of tasks. Training or even finetuning such models demands vast datasets and computational resources, which are often scarce and costly. Adaptation methods provide a computationally efficient solution to address these limitations by allowing such models to be finetuned on small amounts of data and computing power. This is achieved by appending new trainable modules to frozen backbones with only a fraction of the trainable parameters and fitting only these modules on novel tasks. Recently, the VeRA adapter was shown to excel in parameter-efficient adaptations by utilizing a pair of frozen random low-rank matrices shared across all layers. In this paper, we propose PVeRA, a probabilistic version of the VeRA adapter, which modifies the low-rank matrices of VeRA in a probabilistic manner. This modification naturally allows handling inherent ambiguities in the input and allows for different sampling configurations during training and testing. A comprehensive evaluation was performed on the VTAB-1k benchmark and seven adapters, with PVeRA outperforming VeRA and other adapters. Our code for training models with PVeRA and benchmarking all adapters is available https://github.com/leofillioux/pvera.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sangha Park, Eunji Kim, Yeongtak Oh, Jooyoung Choi, Sungroh Yoon
Computer Vision
ๆฅ็ๆ่ฆ
Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt's alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at https://github.com/wiarae/NPC.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arslan Artykov, Corentin Sautier, Vincent Lepetit
Robotics
ๆฅ็ๆ่ฆ
Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments. Project webpage: https://aartykov.github.io/sim2art/
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sujoy Nath, Arkaprabha Basu, Sharanya Dasgupta, Swagatam Das
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding tasks. While these models often produce linguistically coherent output, they often suffer from hallucinations, generating descriptions that are factually inconsistent with the visual content, potentially leading to adverse consequences. Therefore, the assessment of hallucinations in MLLM has become increasingly crucial in the model development process. Contemporary methodologies predominantly depend on external LLM evaluators, which are themselves susceptible to hallucinations and may present challenges in terms of domain adaptation. In this study, we propose the hypothesis that hallucination manifests as measurable irregularities within the internal layer dynamics of MLLMs, not merely due to distributional shifts but also in the context of layer-wise analysis of specific assumptions. By incorporating such modifications, \textsc{\textsc{HalluShift++}} broadens the efficacy of hallucination detection from text-based large language models (LLMs) to encompass multimodal scenarios. Our codebase is available at https://github.com/C0mRD/HalluShift_Plus.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mehmet Yigit Avci, Pedro Borges, Virginia Fernandez, Paul Wright, Mehmet Yigitsoy et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Deep learning holds immense promise for transforming medical image analysis, yet its clinical generalization remains profoundly limited. A major barrier is data heterogeneity. This is particularly true in Magnetic Resonance Imaging, where scanner hardware differences, diverse acquisition protocols, and varying sequence parameters introduce substantial domain shifts that obscure underlying biological signals. Data harmonization methods aim to reduce these instrumental and acquisition variability, but existing approaches remain insufficient. When applied to imaging data, image-based harmonization approaches are often restricted by the need for target images, while existing text-guided methods rely on simplistic labels that fail to capture complex acquisition details or are typically restricted to datasets with limited variability, failing to capture the heterogeneity of real-world clinical environments. To address these limitations, we propose DIST-CLIP (Disentangled Style Transfer with CLIP Guidance), a unified framework for MRI harmonization that flexibly uses either target images or DICOM metadata for guidance. Our framework explicitly disentangles anatomical content from image contrast, with the contrast representations being extracted using pre-trained CLIP encoders. These contrast embeddings are then integrated into the anatomical content via a novel Adaptive Style Transfer module. We trained and evaluated DIST-CLIP on diverse real-world clinical datasets, and showed significant improvements in performance when compared against state-of-the-art methods in both style translation fidelity and anatomical preservation, offering a flexible solution for style transfer and standardizing MRI data. Our code and weights will be made publicly available upon publication.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ronan John, Aditya Kesari, Vincenzo DiMatteo, Kristin Dana
Robotics
ๆฅ็ๆ่ฆ
We address the challenge of predicting human visual attention during real-world navigation by measuring and modeling egocentric pedestrian eye gaze in an outdoor campus setting. We introduce the EgoCampus dataset, which spans 25 unique outdoor paths over 6 km across a university campus with recordings from more than 80 distinct human pedestrians, resulting in a diverse set of gaze-annotated videos. The system used for collection, Meta's Project Aria glasses, integrates eye tracking, front-facing RGB cameras, inertial sensors, and GPS to provide rich data from the human perspective. Unlike many prior egocentric datasets that focus on indoor tasks or exclude eye gaze information, our work emphasizes visual attention while subjects walk in outdoor campus paths. Using this data, we develop EgoCampusNet, a novel method to predict eye gaze of navigating pedestrians as they move through outdoor environments. Our contributions provide both a new resource for studying real-world attention and a resource for future work in gaze prediction models for navigation. Dataset and code are available upon request, and will be made publicly available at a later date at https://github.com/ComputerVisionRutgers/EgoCampus .
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shiaho Li, Naisheng Ye, Tianyu Li, Kashyap Chitta, Tuo An et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hamad Almazrouei, Mariam Al Nasseri, Maha Alzaabi
RoboticsMachine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Traditional sea exploration faces significant challenges due to extreme conditions, limited visibility, and high costs, resulting in vast unexplored ocean regions. This paper presents an innovative AI-powered Autonomous Underwater Vehicle (AUV) system designed to overcome these limitations by automating underwater object detection, analysis, and reporting. The system integrates YOLOv12 Nano for real-time object detection, a Convolutional Neural Network (CNN) (ResNet50) for feature extraction, Principal Component Analysis (PCA) for dimensionality reduction, and K-Means++ clustering for grouping marine objects based on visual characteristics. Furthermore, a Large Language Model (LLM) (GPT-4o Mini) is employed to generate structured reports and summaries of underwater findings, enhancing data interpretation. The system was trained and evaluated on a combined dataset of over 55,000 images from the DeepFish and OzFish datasets, capturing diverse Australian marine environments. Experimental results demonstrate the system's capability to detect marine objects with a mAP@0.5 of 0.512, a precision of 0.535, and a recall of 0.438. The integration of PCA effectively reduced feature dimensionality while preserving 98% variance, facilitating K-Means clustering which successfully grouped detected objects based on visual similarities. The LLM integration proved effective in generating insightful summaries of detections and clusters, supported by location data. This integrated approach significantly reduces the risks associated with human diving, increases mission efficiency, and enhances the speed and depth of underwater data analysis, paving the way for more effective scientific research and discovery in challenging marine environments.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanye Liu, Hanxiao Zhang, Nannan Shi, Yuxin Shi, Arif Mahmood et al.
ๆฅ็ๆ่ฆ
Liver fibrosis represents a significant global health burden, necessitating accurate staging for effective clinical management. This report introduces the LiQA (Liver Fibrosis Quantification and Analysis) dataset, established as part of the CARE 2024 challenge. Comprising $440$ patients with multi-phase, multi-center MRI scans, the dataset is curated to benchmark algorithms for Liver Segmentation (LiSeg) and Liver Fibrosis Staging (LiFS) under complex real-world conditions, including domain shifts, missing modalities, and spatial misalignment. We further describe the challenge's top-performing methodology, which integrates a semi-supervised learning framework with external data for robust segmentation, and utilizes a multi-view consensus approach with Class Activation Map (CAM)-based regularization for staging. Evaluation of this baseline demonstrates that leveraging multi-source data and anatomical constraints significantly enhances model robustness in clinical settings.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiqi Li, Wenhuan Li, Tengfei Wang, Zhenwei Wang, Junta Wu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Compositionality is critical for 3D object and scene generation, but existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components. In this work, we present MoCA, a compositional 3D generative model with two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserve contextual priors of unselected components while reducing computational complexity of global attention. With these designs, MoCA enables efficient, fine-grained compositional 3D asset creation with scalable number of components. Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks. Project page: https://lizhiqi49.github.io/MoCA
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingna Qiu, Frauke Wilm, Mathias รttl, Jonas Utz, Maja Schlereth et al.
ๆฅ็ๆ่ฆ
Active learning improves annotation efficiency by selecting the most informative samples for annotation and model training. While most prior work has focused on selecting informative images for classification tasks, we investigate the more challenging setting of dense prediction, where annotations are more costly and time-intensive, especially in medical imaging. Region-level annotation has been shown to be more efficient than image-level annotation for these tasks. However, existing methods for representative annotation region selection suffer from high computational and memory costs, irrelevant region choices, and heavy reliance on uncertainty sampling. We propose decomposition sampling (DECOMP), a new active learning sampling strategy that addresses these limitations. It enhances annotation diversity by decomposing images into class-specific components using pseudo-labels and sampling regions from each class. Class-wise predictive confidence further guides the sampling process, ensuring that difficult classes receive additional annotations. Across ROI classification, 2-D segmentation, and 3-D segmentation, DECOMP consistently surpasses baseline methods by better sampling minority-class regions and boosting performance on these challenging classes. Code is in https://github.com/JingnaQiu/DECOMP.git.
๐
2025-12-08
๐ NeurIPS 2025
๐ฅ Hanshi Wang, Zijian Cai, Jin Gao, Yiwei Zhang, Weiming Hu et al.
Robotics
ๆฅ็ๆ่ฆ
Online, real-time, and fine-grained 3D segmentation constitutes a fundamental capability for embodied intelligent agents to perceive and comprehend their operational environments. Recent advancements employ predefined object queries to aggregate semantic information from Vision Foundation Models (VFMs) outputs that are lifted into 3D point clouds, facilitating spatial information propagation through inter-query interactions. Nevertheless, perception is an inherently dynamic process, rendering temporal understanding a critical yet overlooked dimension within these prevailing query-based pipelines. Therefore, to further unlock the temporal environmental perception capabilities of embodied agents, our work reconceptualizes online 3D segmentation as an instance tracking problem (AutoSeg3D). Our core strategy involves utilizing object queries for temporal information propagation, where long-term instance association promotes the coherence of features and object identities, while short-term instance update enriches instant observations. Given that viewpoint variations in embodied robotics often lead to partial object visibility across frames, this mechanism aids the model in developing a holistic object understanding beyond incomplete instantaneous views. Furthermore, we introduce spatial consistency learning to mitigate the fragmentation problem inherent in VFMs, yielding more comprehensive instance information for enhancing the efficacy of both long-term and short-term temporal learning. The temporal information exchange and consistency learning facilitated by these sparse object queries not only enhance spatial comprehension but also circumvent the computational burden associated with dense temporal point cloud interactions. Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200 and delivering consistent gains on ScanNet, SceneNN, and 3RScan datasets.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenzhen Dong, Jieming Yu, Yiming Huang, Hongqiu Wang, Lei Zhu et al.
Computer Vision
ๆฅ็ๆ่ฆ
The recent Segment Anything Model (SAM) 3 has introduced significant advancements over its predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3's 3D reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while zero-shot evaluations on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaili Qi, Zhongyi Huang, Wenli Yang
Machine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
To address the challenge of segmenting noisy images with blurred or fragmented boundaries, this paper presents a robust version of Variational Model Based Tailored UNet (VM_TUNet), a hybrid framework that integrates variational methods with deep learning. The proposed approach incorporates physical priors, an edge detector and a mean curvature term, into a modified Cahn-Hilliard equation, aiming to combine the interpretability and boundary-smoothing advantages of variational partial differential equations (PDEs) with the strong representational ability of deep neural networks. The architecture consists of two collaborative modules: an F module, which conducts efficient frequency domain preprocessing to alleviate poor local minima, and a T module, which ensures accurate and stable local computations, backed by a stability estimate. Extensive experiments on three benchmark datasets indicate that the proposed method achieves a balanced trade-off between performance and computational efficiency, which yields competitive quantitative results and improved visual quality compared to pure convolutional neural network (CNN) based models, while achieving performance close to that of transformer-based method with reasonable computational expense.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu et al.
Computer Vision
ๆฅ็ๆ่ฆ
We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at https://github.com/YahongWang1/Information-Horizon.
๐
2025-12-08
๐ ArXiv Preprint (eess.IV)
๐ฅ Xuecheng Li, Weikuan Jia, Komildzhon Sharipov, Sharipov Hotam Beknazarovich, Farzona S. Ataeva et al.
ๆฅ็ๆ่ฆ
Accurate segmentation of spinal structures in X-ray images is a prerequisite for quantitative scoliosis assessment, including Cobb angle measurement, vertebral translation estimation and curvature classification. In routine practice, clinicians acquire coronal, left-bending and right-bending radiographs to jointly evaluate deformity severity and spinal flexibility. However, the segmentation step remains heavily manual, time-consuming and non-reproducible, particularly in low-contrast images and in the presence of rib shadows or overlapping tissues. To address these limitations, this paper proposes R2MF-Net, a recurrent residual multi-path encoder--decoder network tailored for automatic segmentation of multi-directional spine X-ray images. The overall design consists of a coarse segmentation network and a fine segmentation network connected in cascade. Both stages adopt an improved Inception-style multi-branch feature extractor, while a recurrent residual jump connection (R2-Jump) module is inserted into skip paths to gradually align encoder and decoder semantics. A multi-scale cross-stage skip (MC-Skip) mechanism allows the fine network to reuse hierarchical representations from multiple decoder levels of the coarse network, thereby strengthening the stability of segmentation across imaging directions and contrast conditions. Furthermore, a lightweight spatial-channel squeeze-and-excitation block (SCSE-Lite) is employed at the bottleneck to emphasize spine-related activations and suppress irrelevant structures and background noise. We evaluate R2MF-Net on a clinical multi-view radiograph dataset comprising 228 sets of coronal, left-bending and right-bending spine X-ray images with expert annotations.
๐
2025-12-08
๐ ArXiv Preprint (eess.IV)
๐ฅ Xuecheng Li, Weikuan Jia, Komildzhon Sharipov, Alimov Ruslan, Lutfuloev Mazbutdzhon et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Accurate three-dimensional delineation of liver tumors on contrast-enhanced CT is a prerequisite for treatment planning, navigation and response assessment, yet manual contouring is slow, observer-dependent and difficult to standardise across centres. Automatic segmentation is complicated by low lesion-parenchyma contrast, blurred or incomplete boundaries, heterogeneous enhancement patterns, and confounding structures such as vessels and adjacent organs. We propose a hybrid framework that couples an attention-enhanced cascaded U-Net with handcrafted radiomics and voxel-wise 3D CNN refinement for joint liver and liver-tumor segmentation. First, a 2.5D two-stage network with a densely connected encoder, sub-pixel convolution decoders and multi-scale attention gates produces initial liver and tumor probability maps from short stacks of axial slices. Inter-slice temporal consistency is then enforced by a simple three-slice refinement rule along the cranio-caudal direction, which restores thin and tiny lesions while suppressing isolated noise. Next, 728 radiomic descriptors spanning intensity, texture, shape, boundary and wavelet feature groups are extracted from candidate lesions and reduced to 20 stable, highly informative features via multi-strategy feature selection; a random forest classifier uses these features to reject false-positive regions. Finally, a compact 3D patch-based CNN derived from AlexNet operates in a narrow band around the tumor boundary to perform voxel-level relabelling and contour smoothing.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuecheng Li, Weikuan Jia, Alisher Kurbonaliev, Qurbonaliev Alisher, Khudzhamkulov Rustam et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naรฏve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kassoum Sanogo, Renzo Ardiccioni
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, Jibin Wu
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to entropy collapse, resulting in premature policy convergence and performance saturation. While manipulating token-level entropy has proven effective for promoting policy exploration, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden-state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model's latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a paradigm that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX significantly mitigates premature convergence and consistently achieves state-of-the-art performance.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fei Yu, Yu Liu, Luyang Tang, Mingchao Sun, Zengye Ge et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail.
To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs.
Our method's scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4\,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junkai Lin, Hang Long, Huipeng Guo, Jielei Zhang, JiaYi Yang et al.
ๆฅ็ๆ่ฆ
Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a surface.MeshRipple rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological dependencies.This integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikita Gabdullin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The overall neural network (NN) performance is closely related to the properties of its embedding distribution in latent space (LS). It has recently been shown that predefined vector systems, specifically An root system vectors, can be used as targets for latent space configurations (LSC) to ensure the desired LS structure. One of the main LSC advantage is the possibility of training classifier NNs without classification layers, which facilitates training NNs on datasets with extremely large numbers of classes. This paper provides a more general overview of possible vector systems for NN training along with their properties and methods for vector system construction. These systems are used to configure LS of encoders and visual transformers to significantly speed up ImageNet-1K and 50k-600k classes LSC training. It is also shown that using the minimum number of LS dimensions for a specific number of classes results in faster convergence. The latter has potential advantages for reducing the size of vector databases used to store NN embeddings.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ryota Okumura, Kaede Shiohara, Toshihiko Yamasaki
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Recent text-to-image models, such as Stable Diffusion, have achieved impressive visual quality, yet they often suffer from geometric inconsistencies that undermine the structural realism of generated scenes. One prominent issue is vanishing point inconsistency, where projections of parallel lines fail to converge correctly in 2D space. This leads to structurally implausible geometry that degrades spatial realism, especially in architectural scenes. We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images. Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours. We also introduce geometric constraints that explicitly encourage alignment between image edges and perspective cues. Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines. This capability is particularly valuable for applications that require accurate spatial structure, such as image-to-3D reconstruction. The dataset and source code are available at https://github.com/RyotaOkumura/ControlVP .
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yao Teng, Zhihuan Jiang, Han Shi, Xian Liu, Xuefei Ning et al.
Computer Vision
ๆฅ็ๆ่ฆ
Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed, because these models require hundreds to thousands of sequential forward passes for next-token prediction during inference. To accelerate autoregressive text-to-image generation, we propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm. Unlike traditional next-token prediction, SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps. Specifically, it integrates the iterative multi-token prediction mechanism from Jacobi decoding, with the probabilistic drafting-and-verification mechanism from speculative sampling. More importantly, for further acceleration, SJD++ reuses high-confidence draft tokens after each verification phase instead of resampling them all. We conduct extensive experiments on several representative autoregressive text-to-image generation models and demonstrate that SJD++ achieves $2\times$ to $3\times$ inference latency reduction and $2\times$ to $7\times$ step compression, while preserving visual quality with no observable degradation.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Penghui Liu, Jiangshan Wang, Yutong Shen, Shanhui Mo, Chenyang Qi et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability. The code is in the supp.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chih-Chung Hsu, Shao-Ning Chen, Chia-Ming Lee, Yi-Fang Wang, Yi-Shiuan Chou
Robotics
ๆฅ็ๆ่ฆ
Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Zihan Zheng et al.
ๆฅ็ๆ่ฆ
While traditional and neural video codecs (NVCs) have achieved remarkable rate-distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S2VC, a Single-Step diffusion based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S2VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu et al.
ๆฅ็ๆ่ฆ
Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.
๐
2025-12-08
๐ ArXiv Preprint (cs.GR)
๐ฅ Xiangjun Tang, Biao Zhang, Peter Wonka
ๆฅ็ๆ่ฆ
Generating realistic human geometry animations remains a challenging task, as it requires modeling natural clothing dynamics with fine-grained geometric details under limited data. To address these challenges, we propose two novel designs. First, we propose a compact distribution-based latent representation that enables efficient and high-quality geometry generation. We improve upon previous work by establishing a more uniform mapping between SMPL and avatar geometries. Second, we introduce a generative animation model that fully exploits the diversity of limited motion data. We focus on short-term transitions while maintaining long-term consistency through an identity-conditioned design. These two designs formulate our method as a two-stage framework: the first stage learns a latent space, while the second learns to generate animations within this latent space. We conducted experiments on both our latent space and animation model. We demonstrate that our latent space produces high-fidelity human geometry surpassing previous methods ($90\%$ lower Chamfer Dist.). The animation model synthesizes diverse animations with detailed and natural dynamics ($2.2 \times$ higher user study score), achieving the best results across all evaluation metrics.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenwei Shi, Xueyu Luan
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
DreamerV3 is a state-of-the-art online model-based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perceptrons (MLPs), offering superior parameter efficiency and interpretability. To mitigate KANs' computational overhead, variants like FastKAN leverage Radial Basis Functions (RBFs) to accelerate inference. In this work, we investigate integrating KAN architectures into the DreamerV3 framework. We introduce KAN-Dreamer, replacing specific MLP and convolutional components of DreamerV3 with KAN and FastKAN layers. To ensure efficiency within the JAX-based World Model, we implement a tailored, fully vectorized version with simplified grid management. We structure our investigation into three subsystems: Visual Perception, Latent Prediction, and Behavior Learning. Empirical evaluations on the DeepMind Control Suite (walker_walk) analyze sample efficiency, training time, and asymptotic performance. Experimental results demonstrate that utilizing our adapted FastKAN as a drop-in replacement for the Reward and Continue predictors yields performance on par with the original MLP-based architecture, maintaining parity in both sample efficiency and training speed. This report serves as a preliminary study for future developments in KAN-based world models.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Karel Moens, Matthew B. Blaschko, Tinne Tuytelaars, Bart Diricx, Jonas De Vylder et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Whole slide image (WSI) normalization remains a vital preprocessing step in computational pathology. Increasingly driven by deep learning, these models learn to approximate data distributions from training examples. This often results in outputs that gravitate toward the average, potentially masking diagnostically important features. More critically, they can introduce hallucinated content, artifacts that appear realistic but are not present in the original tissue, posing a serious threat to downstream analysis. These hallucinations are nearly impossible to detect visually, and current evaluation practices often overlook them. In this work, we demonstrate that the risk of hallucinations is real and underappreciated. While many methods perform adequately on public datasets, we observe a concerning frequency of hallucinations when these same models are retrained and evaluated on real-world clinical data. To address this, we propose a novel image comparison measure designed to automatically detect hallucinations in normalized outputs. Using this measure, we systematically evaluate several well-cited normalization methods retrained on real-world data, revealing significant inconsistencies and failures that are not captured by conventional metrics. Our findings underscore the need for more robust, interpretable normalization techniques and stricter validation protocols in clinical deployment.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haidong Kang, Jun Du, Lihong Lin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Mixed-Precision Quantization (MPQ) liberates the Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck, which garnered increasing research attention. However, conventional methods either searched from costly differentiable optimization, which is neither efficient nor flexible, or learned a quantized DNN from the proxy (i.e., HAWQ) manually designed by human experts, which is labor-intensive and requires huge expert knowledge. Can we design a proxy without involving any human experts and training? In this paper, we provide an affirmative answer by proposing a novel Large Language Models (LLMs)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework, which reforms the design paradigm of MPQ by utilizing LLMs to find superior TAP tailored for MPQ, automatically. In addition, to bridge the gap between black-box LLMs and the tough MPQ task, we ingeniously propose simple Direct Policy Optimization (DPO) based reinforcement learning to enhance LLMs' reasoning by optimizing prompts, which can construct a positive feedback loop between the LLM and the MPQ task, enabling LLMs to generate better TAP in the next evolution. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we truly believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gabriele Galatolo, Mirco Nanni
ๆฅ็ๆ่ฆ
Understanding the movement behaviours of individuals and the way they react to the external world is a key component of any problem that involves the modelling of human dynamics at a physical level. In particular, it is crucial to capture the influence that the presence of an individual can have on the others. Important examples of applications include crowd simulation and emergency management, where the simulation of the mass of people passes through the simulation of the individuals, taking into consideration the others as part of the general context. While existing solutions basically start from some preconceived behavioural model, in this work we propose an approach that starts directly from the data, adopting a data mining perspective. Our method searches the mobility events in the data that might be possible evidences of mutual interactions between individuals, and on top of them looks for complex, persistent patterns and time evolving configurations of events. The study of these patterns can provide new insights on the mechanics of mobility interactions between individuals, which can potentially help in improving existing simulation models. We instantiate the general methodology on two real case studies, one on cars and one on pedestrians, and a full experimental evaluation is performed, both in terms of performances, parameter sensitivity and interpretation of sample results.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bin Li, Ruichi Zhang, Han Liang, Jingyan Zhang, Juze Zhang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhifan Zhu, Siddhant Bansal, Shashank Tripathi, Dima Damen
Multimodal
ๆฅ็ๆ่ฆ
We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ ฤorฤe Nedeljkoviฤ
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Convolutional Neural Networks (CNNs) have proven highly effective for edge and mobile vision tasks due to their computational efficiency. While many recent works seek to enhance CNNs with global contextual understanding via self-attention-based Vision Transformers, these approaches often introduce significant computational overhead. In this work, we demonstrate that it is possible to retain strong global perception without relying on computationally expensive components. We present GlimmerNet, an ultra-lightweight convolutional network built on the principle of separating receptive field diversity from feature recombination. GlimmerNet introduces Grouped Dilated Depthwise Convolutions(GDBlocks), which partition channels into groups with distinct dilation rates, enabling multi-scale feature extraction at no additional parameter cost. To fuse these features efficiently, we design a novel Aggregator module that recombines cross-group representations using grouped pointwise convolution, significantly lowering parameter overhead. With just 31K parameters and 29% fewer FLOPs than the most recent baseline, GlimmerNet achieves a new state-of-the-art weighted F1-score of 0.966 on the UAV-focused AIDERv2 dataset. These results establish a new accuracy-efficiency trade-off frontier for real-time emergency monitoring on resource-constrained UAV platforms. Our implementation is publicly available at https://github.com/djordjened92/gdd-cnn.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gilhyun Nam, Taewon Kim, Joonhyun Jeong, Eunho Yang
Robotics
ๆฅ็ๆ่ฆ
Test-time adaptation (TTA) enables efficient adaptation of deployed models, yet it often leads to poorly calibrated predictive uncertainty - a critical issue in high-stakes domains such as autonomous driving, finance, and healthcare. Existing calibration methods typically assume fixed models or static distributions, resulting in degraded performance under real-world, dynamic test conditions. To address these challenges, we introduce Style Invariance as a Correctness Likelihood (SICL), a framework that leverages style-invariance for robust uncertainty estimation. SICL estimates instance-wise correctness likelihood by measuring prediction consistency across style-altered variants, requiring only the model's forward pass. This makes it a plug-and-play, backpropagation-free calibration module compatible with any TTA method. Comprehensive evaluations across four baselines, five TTA methods, and two realistic scenarios with three model architecture demonstrate that SICL reduces calibration error by an average of 13 percentage points compared to conventional calibration approaches.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chunhui Zhang, Li Liu, Zhipeng Zhang, Yong Wang, Hao Wen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Unmanned Aerial Vehicles (UAVs) offer wide-ranging applications but also pose significant safety and privacy violation risks in areas like airport and infrastructure inspection, spurring the rapid development of Anti-UAV technologies in recent years. However, current Anti-UAV research primarily focuses on RGB, infrared (IR), or RGB-IR videos captured by fixed ground cameras, with little attention to tracking target UAVs from another moving UAV platform. To fill this gap, we propose a new multi-modal visual tracking task termed UAV-Anti-UAV, which involves a pursuer UAV tracking a target adversarial UAV in the video stream. Compared to existing Anti-UAV tasks, UAV-Anti-UAV is more challenging due to severe dual-dynamic disturbances caused by the rapid motion of both the capturing platform and the target. To advance research in this domain, we construct a million-scale dataset consisting of 1,810 videos, each manually annotated with bounding boxes, a language prompt, and 15 tracking attributes. Furthermore, we propose MambaSTS, a Mamba-based baseline method for UAV-Anti-UAV tracking, which enables integrated spatial-temporal-semantic learning. Specifically, we employ Mamba and Transformer models to learn global semantic and spatial features, respectively, and leverage the state space model's strength in long-sequence modeling to establish video-level long-term context via a temporal token propagation mechanism. We conduct experiments on the UAV-Anti-UAV dataset to validate the effectiveness of our method. A thorough experimental evaluation of 50 modern deep tracking algorithms demonstrates that there is still significant room for improvement in the UAV-Anti-UAV domain. The dataset and codes will be available at {\color{magenta}https://github.com/983632847/Awesome-Multimodal-Object-Tracking}.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Deepika SN Vemuri, Gautham Bellamkonda, Aditya Pola, Vineeth N Balasubramanian
Machine Learning
ๆฅ็ๆ่ฆ
Concept Bottleneck Models (CBMs) provide a basis for semantic abstractions within a neural network architecture. Such models have primarily been seen through the lens of interpretability so far, wherein they offer transparency by inferring predictions as a linear combination of semantic concepts. However, a linear combination is inherently limiting. So we propose the enhancement of concept-based learning models through propositional logic. We introduce a logic module that is carefully designed to connect the learned concepts from CBMs through differentiable logic operations, such that our proposed LogicCBM can go beyond simple weighted combinations of concepts to leverage various logical operations to yield the final predictions, while maintaining end-to-end learnability. Composing concepts using a set of logic operators enables the model to capture inter-concept relations, while simultaneously improving the expressivity of the model in terms of logic operations. Our empirical studies on well-known benchmarks and synthetic datasets demonstrate that these models have better accuracy, perform effective interventions and are highly interpretable.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuohan Tao, Boyao Zhou, Hanzhang Tu, Yuwang Wang, Yebin Liu
Machine Learning
ๆฅ็ๆ่ฆ
3D Gaussian Splatting (GS) enables highly photorealistic scene reconstruction from posed image sequences but struggles with viewpoint extrapolation due to its anisotropic nature, leading to overfitting and poor generalization, particularly in sparse-view and dynamic scene reconstruction. We propose Tessellation GS, a structured 2D GS approach anchored on mesh faces, to reconstruct dynamic scenes from a single continuously moving or static camera. Our method constrains 2D Gaussians to localized regions and infers their attributes via hierarchical neural features on mesh faces. Gaussian subdivision is guided by an adaptive face subdivision strategy driven by a detail-aware loss function. Additionally, we leverage priors from a reconstruction foundation model to initialize Gaussian deformations, enabling robust reconstruction of general dynamic objects from a single static camera, previously extremely challenging for optimization-based methods. Our method outperforms previous SOTA method, reducing LPIPS by 29.1% and Chamfer distance by 49.2% on appearance and mesh reconstruction tasks.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mahila Moghadami, Mohammad Ali Keyvanrad, Melika Sabaghian
Computer Vision
ๆฅ็ๆ่ฆ
This paper investigates and develops methods for detecting small objects in large-scale aerial images. Current approaches for detecting small objects in aerial images often involve image cropping and modifications to detector network architectures. Techniques such as sliding window cropping and architectural enhancements, including higher-resolution feature maps and attention mechanisms, are commonly employed. Given the growing importance of aerial imagery in various critical and industrial applications, the need for robust frameworks for small object detection becomes imperative. To address this need, we adopted the base SW-YOLO approach to enhance speed and accuracy in small object detection by refining cropping dimensions and overlap in sliding window usage and subsequently enhanced it through architectural modifications. we propose a novel model by modifying the base model architecture, including advanced feature extraction modules in the neck for feature map enhancement, integrating CBAM in the backbone to preserve spatial and channel information, and introducing a new head to boost small object detection accuracy. Finally, we compared our method with SAHI, one of the most powerful frameworks for processing large-scale images, and CZDet, which is also based on image cropping, achieving significant improvements in accuracy. The proposed model achieves significant accuracy gains on the VisDrone2019 dataset, outperforming baseline YOLOv5L detection by a substantial margin. Specifically, the final proposed model elevates the mAP .5.5 accuracy on the VisDrone2019 dataset from the base accuracy of 35.5 achieved by the YOLOv5L detector to 61.2. Notably, the accuracy of CZDet, which is another classic method applied to this dataset, is 58.36. This research demonstrates a significant improvement, achieving an increase in accuracy from 35.5 to 61.2.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiming Huang, Hao Ai, Jianbo Jiao
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Benefiting from the inductive biases learned from large-scale datasets, open-vocabulary semantic segmentation (OVSS) leverages the power of vision-language models, such as CLIP, to achieve remarkable progress without requiring task-specific training. However, due to CLIP's pre-training nature on image-text pairs, it tends to focus on global semantic alignment, resulting in suboptimal performance when associating fine-grained visual regions with text. This leads to noisy and inconsistent predictions, particularly in local areas. We attribute this to a dispersed bias stemming from its contrastive training paradigm, which is difficult to alleviate using CLIP features alone. To address this, we propose a structure-aware feature rectification approach that incorporates instance-specific priors derived directly from the image. Specifically, we construct a region adjacency graph (RAG) based on low-level features (e.g., colour and texture) to capture local structural relationships and use it to refine CLIP features by enhancing local discrimination. Extensive experiments show that our method effectively suppresses segmentation noise, improves region-level consistency, and achieves strong performance on multiple open-vocabulary segmentation benchmarks.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alexandre Rocchi--Henry, Thomas Fel, Gianni Franchi
ๆฅ็ๆ่ฆ
Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sayeem Been Zaman, Wasimul Karim, Arefin Ittesafun Abian, Reem E. Mohamed, Md Rafiqul Islam et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
The increasing use of synthetic media, particularly deepfakes, is an emerging challenge for digital content verification. Although recent studies use both audio and visual information, most integrate these cues within a single model, which remains vulnerable to modality mismatches, noise, and manipulation. To address this gap, we propose DeepAgent, an advanced multi-agent collaboration framework that simultaneously incorporates both visual and audio modalities for the effective detection of deepfakes. DeepAgent consists of two complementary agents. Agent-1 examines each video with a streamlined AlexNet-based CNN to identify the symbols of deepfake manipulation, while Agent-2 detects audio-visual inconsistencies by combining acoustic features, audio transcriptions from Whisper, and frame-reading sequences of images through EasyOCR. Their decisions are fused through a Random Forest meta-classifier that improves final performance by taking advantage of the different decision boundaries learned by each agent. This study evaluates the proposed framework using three benchmark datasets to demonstrate both component-level and fused performance. Agent-1 achieves a test accuracy of 94.35% on the combined Celeb-DF and FakeAVCeleb datasets. On the FakeAVCeleb dataset, Agent-2 and the final meta-classifier attain accuracies of 93.69% and 81.56%, respectively. In addition, cross-dataset validation on DeepFakeTIMIT confirms the robustness of the meta-classifier, which achieves a final accuracy of 97.49%, and indicates a strong capability across diverse datasets. These findings confirm that hierarchy-based fusion enhances robustness by mitigating the weaknesses of individual modalities and demonstrate the effectiveness of a multi-agent approach in addressing diverse types of manipulations in deepfakes.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinyu Wei, Kangrui Cen, Hongyang Wei, Zhen Guo, Bairui Li et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.
๐
2025-12-08
๐ AAAI 2026
๐ฅ Shilong Jin, Haoran Duan, Litao Hua, Wentao Huang, Yuan Zhou
Robotics
ๆฅ็ๆ่ฆ
Versatile 3D tasks (e.g., generation or editing) that distill from Text-to-Image (T2I) diffusion models have attracted significant research interest for not relying on extensive 3D training data. However, T2I models exhibit limitations resulting from prior view bias, which produces conflicting appearances between different views of an object. This bias causes subject-words to preferentially activate prior view features during cross-attention (CA) computation, regardless of the target view condition. To overcome this limitation, we conduct a comprehensive mathematical analysis to reveal the root cause of the prior view bias in T2I models. Moreover, we find different UNet layers show different effects of prior view in CA. Therefore, we propose a novel framework, TD-Attn, which addresses multi-view inconsistency via two key components: (1) the 3D-Aware Attention Guidance Module (3D-AAG) constructs a view-consistent 3D attention Gaussian for subject-words to enforce spatial consistency across attention-focused regions, thereby compensating for the limited spatial information in 2D individual view CA maps; (2) the Hierarchical Attention Modulation Module (HAM) utilizes a Semantic Guidance Tree (SGT) to direct the Semantic Response Profiler (SRP) in localizing and modulating CA layers that are highly responsive to view conditions, where the enhanced CA maps further support the construction of more consistent 3D attention Gaussians. Notably, HAM facilitates semantic-specific interventions, enabling controllable and precise 3D editing. Extensive experiments firmly establish that TD-Attn has the potential to serve as a universal plugin, significantly enhancing multi-view consistency across 3D tasks.
๐
2025-12-08
๐ Submitted to IEEE J-STARS
๐ฅ Luรญs Marnoto, Alexandre Bernardino, Bruno Martins
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at https://luispl77.github.io/aerial-d .
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kanishk Awadhiya
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Vision Transformers (ViTs) lack the hierarchical inductive biases inherent to Convolutional Neural Networks (CNNs), theoretically allowing them to maintain high-dimensional representations throughout all layers. However, recent observations suggest ViTs often spontaneously manifest a "U-shaped" entropy profile-compressing information in middle layers before expanding it for the final classification. In this work, we demonstrate that this "Inductive Bottleneck" is not an architectural artifact, but a data-dependent adaptation. By analyzing the layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, and CIFAR-100), we show that the depth of the bottleneck correlates strongly with the semantic abstraction required by the task. We find that while texture-heavy datasets preserve high-rank representations throughout, object-centric datasets drive the network to dampen high-frequency information in middle layers, effectively "learning" a bottleneck to isolate semantic features.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyang Mai, Yu-Wing Tai
ๆฅ็ๆ่ฆ
Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose \textbf{ContextAnyone}, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: \href{https://github.com/ziyang1106/ContextAnyone}{https://github.com/ziyang1106/ContextAnyone}.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tobias Abraham Haider
Machine Learning
ๆฅ็ๆ่ฆ
This study revisits the findings of Carl et al., who evaluated the pre-trained Google Inception-ResNet-v2 model for automated detection of European wild mammal species in camera trap images. To assess the reproducibility and generalizability of their approach, we reimplemented the experiment from scratch using openly available resources and a different dataset consisting of 900 images spanning 90 species. After minimal preprocessing, we obtained an overall classification accuracy of 62%, closely aligning with the 71% reported in the original work despite differences in datasets. As in the original study, per-class performance varied substantially, as indicated by a macro F1 score of 0.28,highlighting limitations in generalization when labels do not align directly with ImageNet classes. Our results confirm that pretrained convolutional neural networks can provide a practical baseline for wildlife species identification but also reinforce the need for species-specific adaptation or transfer learning to achieve consistent, high-quality predictions.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingning Guo, Mengwei Wu, Shaoxian Li, Haifeng Li, Chao Tao
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mai Tsujimoto, Junjue Wang, Weihao Xuan, Naoto Yokoya
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Three-dimensional geospatial analysis is critical to applications in urban planning, climate adaptation, and environmental assessment. Current methodologies depend on costly, specialized sensors (e.g., LiDAR and multispectral), which restrict global accessibility. Existing sensor-based and rule-driven methods further struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We hereby present Geo3DVQA, a comprehensive benchmark for evaluating vision-language models (VLMs) in height-aware, 3D geospatial reasoning using RGB-only remote sensing imagery. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios that integrate elevation, sky view factors, and land cover patterns. The benchmark encompasses 110k curated question-answer pairs spanning 16 task categories across three complexity levels: single-feature inference, multi-feature reasoning, and application-level spatial analysis. The evaluation of ten state-of-the-art VLMs highlights the difficulty of RGB-to-3D reasoning. GPT-4o and Gemini-2.5-Flash achieved only 28.6% and 33.0% accuracy respectively, while domain-specific fine-tuning of Qwen2.5-VL-7B achieved 49.6% (+24.8 points). These results reveal both the limitations of current VLMs and the effectiveness of domain adaptation. Geo3DVQA introduces new challenge frontiers for scalable, accessible, and holistic 3D geospatial analysis. The dataset and code will be released upon publication at https://github.com/mm1129/Geo3DVQA.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Siyu Wang, Hua Wang, Huiyu Li, Fan Zhang
Machine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
In the field of healthcare, precise skin lesion segmentation is crucial for the early detection and accurate diagnosis of skin diseases. Despite significant advances in deep learning for image processing, existing methods have yet to effectively address the challenges of irregular lesion shapes and low contrast. To address these issues, this paper proposes an innovative encoder-decoder network architecture based on multi-scale residual structures, capable of extracting rich feature information from different receptive fields to effectively identify lesion areas. By introducing a Multi-Resolution Multi-Channel Fusion (MRCF) module, our method captures cross-scale features, enhancing the clarity and accuracy of the extracted information. Furthermore, we propose a Cross-Mix Attention Module (CMAM), which redefines the attention scope and dynamically calculates weights across multiple contexts, thus improving the flexibility and depth of feature capture and enabling deeper exploration of subtle features. To overcome the information loss caused by skip connections in traditional U-Net, an External Attention Bridge (EAB) is introduced, facilitating the effective utilization of information in the decoder and compensating for the loss during upsampling. Extensive experimental evaluations on several skin lesion segmentation datasets demonstrate that the proposed model significantly outperforms existing transformer and convolutional neural network-based models, showcasing exceptional segmentation accuracy and robustness.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhi Rao, Yucheng Zhou, Benjia Zhou, Yiqing Huang, Sergio Escalera et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Gloss-free sign language translation (SLT) is hindered by two key challenges: **inadequate sign representation** that fails to capture nuanced visual cues, and **sentence-level semantic misalignment** in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage **r**einforcing **v**ision-**l**anguage **f**ramework (**RVLF**). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level semantic misalignment, we introduce a GRPO-based optimization strategy that fine-tunes the SLT-SFT model with a reward function combining translation fidelity (BLEU) and sentence completeness (ROUGE), yielding the optimized model termed SLT-GRPO. Our conceptually simple framework yields substantial gains under the gloss-free SLT setting without pre-training on any external large-scale sign language datasets, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on the CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL datasets, respectively. To the best of our knowledge, this is the first work to incorporate GRPO into SLT. Extensive experiments and ablation studies validate the effectiveness of GRPO-based optimization in enhancing both translation quality and semantic consistency.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mike Diessner, Yannick Tarant
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Virtual representations of physical critical infrastructures, such as water or energy plants, are used for simulations and digital twins to ensure resilience and continuity of their services. These models usually require 3D point clouds from laser scanners that are expensive to acquire and require specialist knowledge to use. In this article, we present a graph generation pipeline based on photogrammetry. The pipeline detects relevant objects and predicts their relation using RGB images and depth data generated by a stereo camera. This more cost-effective approach uses deep learning for object detection and instance segmentation of the objects, and employs user-defined heuristics or rules to infer their relations. Results of two hydraulic systems show that this strategy can produce graphs close to the ground truth while its flexibility allows the method to be tailored to specific applications and its transparency qualifies it to be used in the high stakes decision-making that is required for critical infrastructures.
๐
2025-12-08
๐ ArXiv Preprint (eess.IV)
๐ฅ Tharindu Wickremasinghe, Marco F. Duarte
Computer Vision
ๆฅ็ๆ่ฆ
Image tile-based approaches are popular in many image processing applications such as denoising (e.g., non-local means). A key step in their use is grouping the images into clusters, which usually proceeds iteratively splitting the images into clusters and fitting a model for the images in each cluster. Linear subspaces have emerged as a suitable model for tile clusters; however, they are not well matched to images patches given that images are non-negative and thus not distributed around the origin in the tile vector space. We study the use of affine subspace models for the clusters to better match the geometric structure of the image tile vector space. We also present a simple denoising algorithm that relies on the affine subspace clustering model using least squares projection. We review several algorithmic approaches to solve the affine subspace clustering problem and show experimental results that highlight the performance improvements in clustering and denoising.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Handing Xu, Zhenguo Nie, Tairan Peng, Huimin Pan, Xin-Jun Liu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Endoscopic surgery relies on intraoperative video, making image quality a decisive factor for surgical safety and efficacy. Yet, endoscopic videos are often degraded by uneven illumination, tissue scattering, occlusions, and motion blur, which obscure critical anatomical details and complicate surgical manipulation. Although deep learning-based methods have shown promise in image enhancement, most existing approaches remain too computationally demanding for real-time surgical use. To address this challenge, we propose a degradation-aware framework for endoscopic video enhancement, which enables real-time, high-quality enhancement by propagating degradation representations across frames. In our framework, degradation representations are first extracted from images using contrastive learning. We then introduce a fusion mechanism that modulates image features with these representations to guide a single-frame enhancement model, which is trained with a cycle-consistency constraint between degraded and restored images to improve robustness and generalization. Experiments demonstrate that our framework achieves a superior balance between performance and efficiency compared with several state-of-the-art methods. These results highlight the effectiveness of degradation-aware modeling for real-time endoscopic video enhancement. Nevertheless, our method suggests that implicitly learning and propagating degradation representation offer a practical pathway for clinical application.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junqi Liu, Zejun Wu, Pedro R. A. S. Bassi, Xinze Zhou, Wenxuan Li et al.
ๆฅ็ๆ่ฆ
Image enhancement improves visual quality and helps reveal details that are hard to see in the original image. In medical imaging, it can support clinical decision-making, but current models often over-edit. This can distort organs, create false findings, and miss small tumors because these models do not understand anatomy or contrast dynamics. We propose SMILE, an anatomy-aware diffusion model that learns how organs are shaped and how they take up contrast. It enhances only clinically relevant regions while leaving all other areas unchanged. SMILE introduces three key ideas: (1) structure-aware supervision that follows true organ boundaries and contrast patterns; (2) registration-free learning that works directly with unaligned multi-phase CT scans; (3) unified inference that provides fast and consistent enhancement across all contrast phases. Across six external datasets, SMILE outperforms existing methods in image quality (14.2% higher SSIM, 20.6% higher PSNR, 50% better FID) and in clinical usefulness by producing anatomically accurate and diagnostically meaningful images. SMILE also improves cancer detection from non-contrast CT, raising the F1 score by up to 10 percent.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziming Hong, Tianyu Huang, Runnan Chen, Shanshan Ye, Mingming Gong et al.
Robotics
ๆฅ็ๆ่ฆ
Recent studies have extended diffusion-based instruction-driven 2D image editing pipelines to 3D Gaussian Splatting (3DGS), enabling faithful manipulation of 3DGS assets and greatly advancing 3DGS content creation. However, it also exposes these assets to serious risks of unauthorized editing and malicious tampering. Although imperceptible adversarial perturbations against diffusion models have proven effective for protecting 2D images, applying them to 3DGS encounters two major challenges: view-generalizable protection and balancing invisibility with protection capability. In this work, we propose the first editing safeguard for 3DGS, termed AdLift, which prevents instruction-driven editing across arbitrary views and dimensions by lifting strictly bounded 2D adversarial perturbations into 3D Gaussian-represented safeguard. To ensure both adversarial perturbations effectiveness and invisibility, these safeguard Gaussians are progressively optimized across training views using a tailored Lifted PGD, which first conducts gradient truncation during back-propagation from the editing model at the rendered image and applies projected gradients to strictly constrain the image-level perturbation. Then, the resulting perturbation is backpropagated to the safeguard Gaussian parameters via an image-to-Gaussian fitting operation. We alternate between gradient truncation and image-to-Gaussian fitting, yielding consistent adversarial-based protection performance across different viewpoints and generalizes to novel views. Empirically, qualitative and quantitative results demonstrate that AdLift effectively protects against state-of-the-art instruction-driven 2D image and 3DGS editing.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons -- i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER generates more faithful and interpretable explanations than existing methods. The code will be publicly released.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md. Srabon Chowdhury, Syeda Fahmida Tanzim, Sheekar Banerjee, Ishtiak Al Mamoon, AKM Muzahidul Islam
Machine Learning
ๆฅ็ๆ่ฆ
Brain tumors are one of the most common and dangerous neurological diseases which require a timely and correct diagnosis to provide the right treatment procedures. Even with the promotion of magnetic resonance imaging (MRI), the process of tumor delineation is difficult and time-consuming, which is prone to inter-observer error. In order to overcome these limitations, this work proposes a hybrid deep learning model based on SqueezeNet v1 which is a lightweight model, and EfficientNet-B0, which is a high-performing model, and is enhanced with handcrafted radiomic descriptors, including Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Gabor filters and Wavelet transforms. The framework was trained and tested only on publicly available Nickparvar Brain Tumor MRI dataset, which consisted of 7,023 contrast-enhanced T1-weighted axial MRI slices which were categorized into four groups: glioma, meningioma, pituitary tumor, and no tumor. The testing accuracy of the model was 98.93% that reached a level of 99.08% with Test Time Augmentation (TTA) showing great generalization and power. The proposed hybrid network offers a compromise between computation efficiency and diagnostic accuracy compared to current deep learning structures and only has to be trained using fewer than 2.1 million parameters and less than 1.2 GFLOPs. The handcrafted feature addition allowed greater sensitivity in texture and the EfficientNet-B0 backbone represented intricate hierarchical features. The resulting model has almost clinical reliability in automated MRI-based classification of tumors highlighting its possibility of use in clinical decision-support systems.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks. Code will be available at https://github.com/chengzhag/UCPE.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Biao Chen, Lin Zuo, Mengmeng Jing, Kunbin He, Yuchen Wang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method's effectiveness in challenging scenarios like low-shot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Abhinav Raundhal, Gaurav Behera, P J Narayanan, Ravi Kiran Sarvadevabhatla, Makarand Tapaswi
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Text as signs, labels, or instructions is a critical element of real-world scenes as they can convey important contextual information. 3D representations such as 3D Gaussian Splatting (3DGS) struggle to preserve fine-grained text details, while achieving high visual fidelity. Small errors in textual element reconstruction can lead to significant semantic loss. We propose STRinGS, a text-aware, selective refinement framework to address this issue for 3DGS reconstruction. Our method treats text and non-text regions separately, refining text regions first and merging them with non-text regions later for full-scene optimization. STRinGS produces sharp, readable text even in challenging configurations. We introduce a text readability measure OCR Character Error Rate (CER) to evaluate the efficacy on text regions. STRinGS results in a 63.6% relative improvement over 3DGS at just 7K iterations. We also introduce a curated dataset STRinGS-360 with diverse text scenarios to evaluate text readability in 3D reconstruction. Our method and dataset together push the boundaries of 3D scene understanding in text-rich environments, paving the way for more robust text-aware reconstruction methods.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fang Zhou, Zhiqiang Chen, Martin Pavlovski, Yizhong Zhang
ๆฅ็ๆ่ฆ
Generalized Category Discovery (GCD) faces the challenge of categorizing unlabeled data containing both known and novel classes, given only labels for known classes. Previous studies often treat each class independently, neglecting the inherent inter-class relations. Obtaining such inter-class relations directly presents a significant challenge in real-world scenarios. To address this issue, we propose ReLKD, an end-to-end framework that effectively exploits implicit inter-class relations and leverages this knowledge to enhance the classification of novel classes. ReLKD comprises three key modules: a target-grained module for learning discriminative representations, a coarse-grained module for capturing hierarchical class relations, and a distillation module for transferring knowledge from the coarse-grained module to refine the target-grained module's representation learning. Extensive experiments on four datasets demonstrate the effectiveness of ReLKD, particularly in scenarios with limited labeled data. The code for ReLKD is available at https://github.com/ZhouF-ECNU/ReLKD.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hengyang Yao, Lin Li, Ke Sun, Jianing Qiu, Huiping Chen
Machine Learning
ๆฅ็ๆ่ฆ
DeepFake face swapping enables highly realistic identity forgeries, posing serious privacy and security risks. A common defence embeds invisible perturbations into images, but these are fragile and often destroyed by basic transformations such as compression or resizing. In this paper, we first conduct a systematic analysis of 30 transformations across six categories and show that protection robustness is highly sensitive to the choice of training transformations, making the standard Expectation over Transformation (EOT) with uniform sampling fundamentally suboptimal. Motivated by this, we propose Expectation Over Learned distribution of Transformation (EOLT), the framework to treat transformation distribution as a learnable component rather than a fixed design choice. Specifically, EOLT employs a policy network that learns to automatically prioritize critical transformations and adaptively generate instance-specific perturbations via reinforcement learning, enabling explicit modeling of defensive bottlenecks while maintaining broad transferability. Extensive experiments demonstrate that our method achieves substantial improvements over state-of-the-art approaches, with 26% higher average robustness and up to 30% gains on challenging transformation categories.
๐
2025-12-08
๐ ArXiv Preprint (eess.IV)
๐ฅ Tianyi Ren, Daniel Low, Pittra Jaengprajak, Juampablo Heras Rivera, Jacob Ruzevick et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Segmentation is the identification of anatomical regions of interest, such as organs, tissue, and lesions, serving as a fundamental task in computer-aided diagnosis in medical imaging. Although deep learning models have achieved remarkable performance in medical image segmentation, the need for explainability remains critical for ensuring their acceptance and integration in clinical practice, despite the growing research attention in this area. Our approach explored the use of contrast-level Shapley values, a systematic perturbation of model inputs to assess feature importance. While other studies have investigated gradient-based techniques through identifying influential regions in imaging inputs, Shapley values offer a broader, clinically aligned approach, explaining how model performance is fairly attributed to certain imaging contrasts over others. Using the BraTS 2024 dataset, we generated rankings for Shapley values for four MRI contrasts across four model architectures. Two metrics were proposed from the Shapley ranking: agreement between model and ``clinician" imaging ranking, and uncertainty quantified through Shapley ranking variance across cross-validation folds. Higher-performing cases (Dice \textgreater0.6) showed significantly greater agreement with clinical rankings. Increased Shapley ranking variance correlated with decreased performance (U-Net: $r=-0.581$). These metrics provide clinically interpretable proxies for model reliability, helping clinicians better understand state-of-the-art segmentation models.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Selim Sarowar, Sungho Kim
RoboticsComputer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Frederik Hagelskjรฆr, Dimitrios Arapis, Steffen Madsen, Thorbjรธrn Mosekjรฆr Iversen
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Object pose estimation is crucial to robotic perception and typically provides a single-pose estimate. However, a single estimate cannot capture pose uncertainty deriving from visual ambiguity, which can lead to unreliable behavior. Existing pose distribution methods rely heavily on color information, often unavailable in industrial settings.
We propose a novel neural network-based method for estimating object pose uncertainty using only 3D colorless data. To the best of our knowledge, this is the first approach that leverages deep learning for pose distribution estimation without relying on RGB input. We validate our method in a real-world bin picking scenario with objects of varying geometric ambiguity. Our current implementation focuses on symmetries in reflection and revolution, but the framework is extendable to full SE(3) pose distribution estimation. Source code available at opde3d.github.io
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Boyang Pan, Zeyu Zhang, Hongyu Meng, Bin Cui, Yingying Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Purpose: To develop a fully automated deep learning system, AutoLugano, for end-to-end lymphoma classification by performing lesion segmentation, anatomical localization, and automated Lugano staging from baseline FDG-PET/CT scans. Methods: The AutoLugano system processes baseline FDG-PET/CT scans through three sequential modules:(1) Anatomy-Informed Lesion Segmentation, a 3D nnU-Net model, trained on multi-channel inputs, performs automated lesion detection (2) Atlas-based Anatomical Localization, which leverages the TotalSegmentator toolkit to map segmented lesions to 21 predefined lymph node regions using deterministic anatomical rules; and (3) Automated Lugano Staging, where the spatial distribution of involved regions is translated into Lugano stages and therapeutic groups (Limited vs. Advanced Stage).The system was trained on the public autoPET dataset (n=1,007) and externally validated on an independent cohort of 67 patients. Performance was assessed using accuracy, sensitivity, specificity, F1-scorefor regional involvement detection and staging agreement. Results: On the external validation set, the proposed model demonstrated robust performance, achieving an overall accuracy of 88.31%, sensitivity of 74.47%, Specificity of 94.21% and an F1-score of 80.80% for regional involvement detection,outperforming baseline models. Most notably, for the critical clinical task of therapeutic stratification (Limited vs. Advanced Stage), the system achieved a high accuracy of 85.07%, with a specificity of 90.48% and a sensitivity of 82.61%.Conclusion: AutoLugano represents the first fully automated, end-to-end pipeline that translates a single baseline FDG-PET/CT scan into a complete Lugano stage. This study demonstrates its strong potential to assist in initial staging, treatment stratification, and supporting clinical decision-making.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, Faqiang Qian et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng Yu
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion models have achieved remarkable performance in generative modeling, yet their theoretical foundations are often intricate, and the gap between mathematical formulations in papers and practical open-source implementations can be difficult to bridge. Existing tutorials primarily focus on deriving equations, offering limited guidance on how diffusion models actually operate in code. To address this, we present a concise implementation of approximately 300 lines that explains diffusion models from a code-execution perspective. Our minimal example preserves the essential components -- including forward diffusion, reverse sampling, the noise-prediction network, and the training loop -- while removing unnecessary engineering details. This technical report aims to provide researchers with a clear, implementation-first understanding of how diffusion models work in practice and how code and theory correspond. Our code and pre-trained models are available at: https://github.com/disanda/GM/tree/main/DDPM-DDIM-ClassifierFree.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiujie Song, Qi Jia, Shota Watanabe, Xiaoyi Pang, Ruijie Chen et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at https://github.com/xiujiesong/StorytellingImageGeneration.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seokhyun Youn, Soohyun Lee, Geonho Kim, Weeyoung Kwon, Sung-Ho Bae et al.
Computer Vision
ๆฅ็ๆ่ฆ
3D Gaussian Splatting (3DGS) has emerged as a powerful explicit representation enabling real-time, high-fidelity 3D reconstruction and novel view synthesis. However, its practical use is hindered by the massive memory and computational demands required to store and render millions of Gaussians. These challenges become even more severe in 4D dynamic scenes. To address these issues, the field of Efficient Gaussian Splatting has rapidly evolved, proposing methods that reduce redundancy while preserving reconstruction quality. This survey provides the first unified overview of efficient 3D and 4D Gaussian Splatting techniques. For both 3D and 4D settings, we systematically categorize existing methods into two major directions, Parameter Compression and Restructuring Compression, and comprehensively summarize the core ideas and methodological trends within each category. We further cover widely used datasets, evaluation metrics, and representative benchmark comparisons. Finally, we discuss current limitations and outline promising research directions toward scalable, compact, and real-time Gaussian Splatting for both static and dynamic 3D scene representation.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Niu Yi, Xu Tianyi, Ma Mingming, Wang Xinkun
Robotics
ๆฅ็ๆ่ฆ
Generative learned image compression methods using Vector Quantization (VQ) have recently shown impressive potential in balancing distortion and perceptual quality. However, these methods typically estimate the entropy of VQ indices using a static, global probability distribution, which fails to adapt to the specific content of each image. This non-adaptive approach leads to untapped bitrate potential and challenges in achieving flexible rate control. To address this challenge, we introduce a Controllable Generative Image Compression framework based on a VQ Hyperprior, termed HVQ-CGIC. HVQ-CGIC rigorously derives the mathematical foundation for introducing a hyperprior to the VQ indices entropy model. Based on this foundation, through novel loss design, to our knowledge, this framework is the first to introduce RD balance and control into vector quantization-based Generative Image Compression. Cooperating with a lightweight hyper-prior estimation network, HVQ-CGIC achieves a significant advantage in rate-distortion (RD) performance compared to current state-of-the-art (SOTA) generative compression methods. On the Kodak dataset, we achieve the same LPIPS as Control-GIC, CDC and HiFiC with an average of 61.3% fewer bits. We posit that HVQ-CGIC has the potential to become a foundational component for VQGAN-based image compression, analogous to the integral role of the HyperPrior framework in neural image compression.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenqi Zhao, Jiacheng Sang, Fenghua Cheng, Yonglu Shu, Dong Li et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Medical image segmentation remains challenging due to intensity inhomogeneity, noise, blurred boundaries, and irregular structures. Traditional level set methods, while effective in certain cases, often depend on approximate bias field estimations and therefore struggle under severe non-uniform imaging conditions. To address these limitations, we propose a novel variational Reflectance-based Level Set Model (RefLSM), which explicitly integrates Retinex-inspired reflectance decomposition into the segmentation framework. By decomposing the observed image into reflectance and bias field components, RefLSM directly segments the reflectance, which is invariant to illumination and preserves fine structural details. Building on this foundation, we introduce two key innovations for enhanced precision and robustness. First, a linear structural prior steers the smoothed reflectance gradients toward a data-driven reference, providing reliable geometric guidance in noisy or low-contrast scenes. Second, a relaxed binary level-set is embedded in RefLSM and enforced via convex relaxation and sign projection, yielding stable evolution and avoiding reinitialization-induced diffusion. The resulting variational problem is solved efficiently using an ADMM-based optimization scheme. Extensive experiments on multiple medical imaging datasets demonstrate that RefLSM achieves superior segmentation accuracy, robustness, and computational efficiency compared to state-of-the-art level set methods.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pengfei Gu, Huimin Li, Haoteng Tang, Dongkuan, Xu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Modern deep neural networks have shown remarkable performance in medical image classification. However, such networks either emphasize pixel-intensity features instead of fundamental anatomical structures (e.g., those encoded by topological invariants), or they capture only simple topological features via single-parameter persistence. In this paper, we propose a new topology-guided classification framework that extracts multi-scale and multi-filtration persistent topological features and integrates them into vision classification backbones. For an input image, we first compute cubical persistence diagrams (PDs) across multiple image resolutions/scales. We then develop a ``vineyard'' algorithm that consolidates these PDs into a single, stable diagram capturing signatures at varying granularities, from global anatomy to subtle local irregularities that may indicate early-stage disease. To further exploit richer topological representations produced by multiple filtrations, we design a cross-attention-based neural network that directly processes the consolidated final PDs. The resulting topological embeddings are fused with feature maps from CNNs or Transformers. By integrating multi-scale and multi-filtration topologies into an end-to-end architecture, our approach enhances the model's capacity to recognize complex anatomical structures. Evaluations on three public datasets show consistent, considerable improvements over strong baselines and state-of-the-art methods, demonstrating the value of our comprehensive topological perspective for robust and interpretable medical image classification.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Matteo Boglioni, Andrea Sgobbi, Gabriel Tavernini, Francesco Rita, Marius Mosbach et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
A large language model's (LLM's) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\mathrm{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n)=\exp(n\,ฯ\,\mathbf{L})$ with a rank-2 skew generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Raunak Jain, Mudita Khurana
ๆฅ็ๆ่ฆ
LLM-based agents are rapidly being plugged into expert decision-support, yet in messy, high-stakes settings they rarely make the team smarter: human-AI teams often underperform the best individual, experts oscillate between verification loops and over-reliance, and the promised complementarity does not materialise. We argue this is not just a matter of accuracy, but a fundamental gap in how we conceive AI assistance: expert decisions are made through collaborative cognitive processes where mental models, goals, and constraints are continually co-constructed, tested, and revised between human and AI. We propose Collaborative Causal Sensemaking (CCS) as a research agenda and organizing framework for decision-support agents: systems designed as partners in cognitive work, maintaining evolving models of how particular experts reason, helping articulate and revise goals, co-constructing and stress-testing causal hypotheses, and learning from the outcomes of joint decisions so that both human and agent improve over time. We sketch challenges around training ecologies that make collaborative thinking instrumentally valuable, representations and interaction protocols for co-authored models, and evaluation centred on trust and complementarity. These directions can reframe MAS research around agents that participate in collaborative sensemaking and act as AI teammates that think with their human partners.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Nearchos Potamitis, Lars Klein, Akhil Arora
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ignoring the intrinsic uncertainty that naturally arises from stochastic decoding. This omission creates a blind spot because practitioners cannot reliably assess whether a method's reported performance is stable, reproducible, or cost-consistent. We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in LLM reasoning. ReasonBENCH provides (i) a modular evaluation library that standardizes reasoning frameworks, models, and tasks, (ii) a multi-run protocol that reports statistically reliable metrics for both quality and cost, and (iii) a public leaderboard to encourage variance-aware reporting. Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability. Notably, even strategies with similar average performance can display confidence intervals up to four times wider, and the top-performing methods often incur higher and less stable costs. Such instability compromises reproducibility across runs and, consequently, the reliability of reported performance. To better understand these dynamics, we further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability. Our results highlight reproducibility as a critical dimension for reliable LLM reasoning and provide a foundation for future reasoning methods and uncertainty quantification techniques. ReasonBENCH is publicly available at https://github.com/au-clan/ReasonBench .
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Charlie Zhang, Graham Neubig, Xiang Yue
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Karin de Langis, Pรผren รncel, Ryan Peters, Andrew Elfenbein, Laura Kristen Allen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories. A probing study finds that LLMs' internal representations can reliably identify incoherent narratives. However, LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives across several prompt variations, hinting at a gap in LLM's understanding of storytelling. The reasoning LLMs tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior. Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building meaning-based narrative coherence. The consistent asymmetry found in our results suggests that LLMs do not have a complete grasp on narrative coherence.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Francois Vandenhende, Anna Georgiou, Michalis Georgiou, Theodoros Psaras, Ellie Karekla et al.
ๆฅ็ๆ่ฆ
In pre-market drug safety review, grouping related adverse event terms into standardised MedDRA queries or the FDA Office of New Drugs Custom Medical Queries (OCMQs) is critical for signal detection. We present a novel quantitative artificial intelligence system that understands and processes medical terminology and automatically retrieves relevant MedDRA Preferred Terms (PTs) for a given input query, ranking them by a relevance score using multi-criteria statistical methods. The system (SafeTerm) embeds medical query terms and MedDRA PTs in a multidimensional vector space, then applies cosine similarity and extreme-value clustering to generate a ranked list of PTs. Validation was conducted against the FDA OCMQ v3.0 (104 queries), restricted to valid MedDRA PTs. Precision, recall and F1 were computed across similarity-thresholds. High recall (>95%) is achieved at moderate thresholds. Higher thresholds improve precision (up to 86%). The optimal threshold (~0.70 - 0.75) yielded recall ~50% and precision ~33%. Narrow-term PT subsets performed similarly but required slightly higher similarity thresholds. The SafeTerm AI-driven system provides a viable supplementary method for automated MedDRA query generation. A similarity threshold of ~0.60 is recommended initially, with increased thresholds for refined term selection.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sujoy Nath, Arkaprabha Basu, Sharanya Dasgupta, Swagatam Das
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding tasks. While these models often produce linguistically coherent output, they often suffer from hallucinations, generating descriptions that are factually inconsistent with the visual content, potentially leading to adverse consequences. Therefore, the assessment of hallucinations in MLLM has become increasingly crucial in the model development process. Contemporary methodologies predominantly depend on external LLM evaluators, which are themselves susceptible to hallucinations and may present challenges in terms of domain adaptation. In this study, we propose the hypothesis that hallucination manifests as measurable irregularities within the internal layer dynamics of MLLMs, not merely due to distributional shifts but also in the context of layer-wise analysis of specific assumptions. By incorporating such modifications, \textsc{\textsc{HalluShift++}} broadens the efficacy of hallucination detection from text-based large language models (LLMs) to encompass multimodal scenarios. Our codebase is available at https://github.com/C0mRD/HalluShift_Plus.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zihan Chen, Lanyu Yu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Online incivility has emerged as a widespread and persistent problem in digital communities, imposing substantial social and psychological burdens on users. Although many platforms attempt to curb incivility through moderation and automated detection, the performance of existing approaches often remains limited in both accuracy and efficiency. To address this challenge, we propose a Graph Neural Network (GNN) framework for detecting three types of uncivil behavior (i.e., toxicity, aggression, and personal attacks) within the English Wikipedia community. Our model represents each user comment as a node, with textual similarity between comments defining the edges, allowing the network to jointly learn from both linguistic content and relational structures among comments. We also introduce a dynamically adjusted attention mechanism that adaptively balances nodal and topological features during information aggregation. Empirical evaluations demonstrate that our proposed architecture outperforms 12 state-of-the-art Large Language Models (LLMs) across multiple metrics while requiring significantly lower inference cost. These findings highlight the crucial role of structural context in detecting online incivility and address the limitations of text-only LLM paradigms in behavioral prediction. All datasets and comparative outputs will be publicly available in our repository to support further research and reproducibility.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zeqi Chen, Zhaoyang Chu, Yi Gui, Feng Guo, Yao Wan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable performance in code intelligence tasks such as code generation, summarization, and translation. However, their reliance on linearized token sequences limits their ability to understand the structural semantics of programs. While prior studies have explored graphaugmented prompting and structure-aware pretraining, they either suffer from prompt length constraints or require task-specific architectural changes that are incompatible with large-scale instructionfollowing LLMs. To address these limitations, this paper proposes CGBridge, a novel plug-and-play method that enhances LLMs with Code Graph information through an external, trainable Bridge module. CGBridge first pre-trains a code graph encoder via selfsupervised learning on a large-scale dataset of 270K code graphs to learn structural code semantics. It then trains an external module to bridge the modality gap among code, graph, and text by aligning their semantics through cross-modal attention mechanisms. Finally, the bridge module generates structure-informed prompts, which are injected into a frozen LLM, and is fine-tuned for downstream code intelligence tasks. Experiments show that CGBridge achieves notable improvements over both the original model and the graphaugmented prompting method. Specifically, it yields a 16.19% and 9.12% relative gain in LLM-as-a-Judge on code summarization, and a 9.84% and 38.87% relative gain in Execution Accuracy on code translation. Moreover, CGBridge achieves over 4x faster inference than LoRA-tuned models, demonstrating both effectiveness and efficiency in structure-aware code understanding.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kairong Luo, Zhenbo Sun, Xinyu Shi, Shengqi Chen, Bowen Yu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jing Wang, Jie Shen, Xing Niu, Tong Zhang, Jeremy Weiss
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each {(question, option)} pair is treated as a binary instance with label $+1$ (correct) or $-1$ (incorrect). To promote {individual fairness}~--~treating similar instances similarly~--~we compute question similarity using NLP embeddings and solve items in \emph{joint pairs of similar questions} rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each \((\text{question}, \text{option})\) to a score $f(x)$ that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the {MedQA (US)} benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Navid Asgari, Benjamin M. Cole
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In this study, we propose a structured methodology that utilizes large language models (LLMs) in a cost-efficient and parsimonious manner, integrating the strengths of scholars and machines while offsetting their respective weaknesses. Our methodology, facilitated through a chain of thought and few-shot learning prompting from computer science, extends best practices for co-author teams in qualitative research to human-machine teams in quantitative research. This allows humans to utilize abductive reasoning and natural language to interrogate not just what the machine has done but also what the human has done. Our method highlights how scholars can manage inherent weaknesses OF LLMs using careful, low-cost techniques. We demonstrate how to use the methodology to interrogate human-machine rating discrepancies for a sample of 1,934 press releases announcing pharmaceutical alliances (1990-2017).
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nicolas Calbucura, Valentin Barriere
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kassoum Sanogo, Renzo Ardiccioni
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Francois Vandenhende, Anna Georgiou, Michalis Georgiou, Theodoros Psaras, Ellie Karekla et al.
ๆฅ็ๆ่ฆ
In pre-market drug safety review, grouping related adverse event terms into SMQs or OCMQs is critical for signal detection. We assess the performance of SafeTerm Automated Medical Query (AMQ) on MedDRA SMQs. The AMQ is a novel quantitative artificial intelligence system that understands and processes medical terminology and automatically retrieves relevant MedDRA Preferred Terms (PTs) for a given input query, ranking them by a relevance score (0-1) using multi-criteria statistical methods. The system (SafeTerm) embeds medical query terms and MedDRA PTs in a multidimensional vector space, then applies cosine similarity, and extreme-value clustering to generate a ranked list of PTs. Validation was conducted against tier-1 SMQs (110 queries, v28.1). Precision, recall and F1 were computed at multiple similarity-thresholds, defined either manually or using an automated method. High recall (94%)) is achieved at moderate similarity thresholds, indicative of good retrieval sensitivity. Higher thresholds filter out more terms, resulting in improved precision (up to 89%). The optimal threshold (0.70)) yielded an overall recall of (48%) and precision of (45%) across all 110 queries. Restricting to narrow-term PTs achieved slightly better performance at an increased (+0.05) similarity threshold, confirming increased relatedness of narrow versus broad terms. The automatic threshold (0.66) selection prioritizes recall (0.58) to precision (0.29). SafeTerm AMQ achieves comparable, satisfactory performance on SMQs and sanitized OCMQs. It is therefore a viable supplementary method for automated MedDRA query generation, balancing recall and precision. We recommend using suitable MedDRA PT terminology in query formulation and applying the automated threshold method to optimise recall. Increasing similarity scores allows refined, narrow terms selection.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kyungro Lee, Dongha Choi, Hyunju Lee
Natural Language Processing
ๆฅ็ๆ่ฆ
As dialogue systems become increasingly important across various domains, a key challenge in persona-based dialogue is generating engaging and context-specific interactions while ensuring the model acts with a coherent personality. However, existing persona-based dialogue datasets lack explicit relations between persona sentences and responses, which makes it difficult for models to effectively capture persona information. To address these issues, we propose MoCoRP (Modeling Consistent Relations between Persona and Response), a framework that incorporates explicit relations into language models. MoCoRP leverages an NLI expert to explicitly extract the NLI relations between persona sentences and responses, enabling the model to effectively incorporate appropriate persona information from the context into its responses. We applied this framework to pre-trained models like BART and further extended it to modern large language models (LLMs) through alignment tuning. Experimental results on the public datasets ConvAI2 and MPChat demonstrate that MoCoRP outperforms existing baselines, achieving superior persona consistency and engaging, context-aware dialogue generation. Furthermore, our model not only excels in quantitative metrics but also shows significant improvements in qualitative aspects. These results highlight the effectiveness of explicitly modeling persona-response relations in persona-based dialogue. The source codes of MoCoRP are available at https://github.com/DMCB-GIST/MoCoRP.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Frederic Blum
Robotics
ๆฅ็ๆ่ฆ
The statistical over-representation of phonological features in the basic vocabulary of languages is often interpreted as reflecting potentially universal sound symbolic patterns. However, most of those results have not been tested explicitly for reproducibility and might be prone to biases in the study samples or models. Many studies on the topic do not adequately control for genealogical and areal dependencies between sampled languages, casting doubts on the robustness of the results. In this study, we test the robustness of a recent study on sound symbolism of basic vocabulary concepts which analyzed245 languages.The new sample includes data on 2864 languages from Lexibank. We modify the original model by adding statistical controls for spatial and phylogenetic dependencies between languages. The new results show that most of the previously observed patterns are not robust, and in fact many patterns disappear completely when adding the genealogical and areal controls. A small number of patterns, however, emerges as highly stable even with the new sample. Through the new analysis, we are able to assess the distribution of sound symbolism on a larger scale than previously. The study further highlights the need for testing all universal claims on language for robustness on various levels.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Boxuan Lyu, Haiyue Song, Hidetaka Kamigaito, Chenchen Ding, Hideki Tanaka et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Error Span Detection (ESD) is a subtask of automatic machine translation evaluation that localizes error spans in translations and labels their severity. State-of-the-art generative ESD methods typically decode using Maximum a Posteriori (MAP), assuming that model-estimated probabilities are perfectly correlated with similarity to human annotation. However, we observed that annotations dissimilar to the human annotation could achieve a higher model likelihood than the human annotation. We address this issue by applying Minimum Bayes Risk (MBR) decoding to generative ESD models. Specifically, we employ sentence- and span-level similarity metrics as utility functions to select candidate hypotheses based on their approximate similarity to the human annotation. Extensive experimental results show that our MBR decoding outperforms the MAP baseline at the system, sentence, and span-levels. Furthermore, to mitigate the computational cost of MBR decoding, we demonstrate that applying MBR distillation enables a standard greedy model to match MBR decoding performance, effectively eliminating the inference-time latency bottleneck.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michelle Wastl, Jannis Vamvas, Rico Sennrich
Natural Language Processing
ๆฅ็ๆ่ฆ
Recognizing semantic differences across documents, especially in different languages, is crucial for text generation evaluation and multilingual content alignment. However, as a standalone task it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English-German, English-French, and English-Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and datasets publicly available.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at https://github.com/OpenMOSS/rope_pp.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sebastian Sztwiertnia, Felix Friedrich, Kristian Kersting, Patrick Schramowski, Bjรถrn Deiseroth
Natural Language Processing
ๆฅ็ๆ่ฆ
Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pengqian Lu, Jie Lu, Anjin Liu, Guangquan Zhang
ๆฅ็ๆ่ฆ
Detecting hallucinations in Retrieval-Augmented Generation (RAG) remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge (stored in FFNs) and retrieved context. However, this perspective is incomplete, failing to account for the impact of other components in the generative process, such as the user query, previously generated tokens, the current token itself, and the final LayerNorm adjustment. To address this, we introduce SPAD. First, we mathematically attribute each token's probability into seven distinct sources: Query, RAG, Past, Current Token, FFN, Final LayerNorm, and Initial Embedding. This attribution quantifies how each source contributes to the generation of the current token. Then, we aggregate these scores by POS tags to quantify how different components drive specific linguistic categories. By identifying anomalies, such as Nouns relying on Final LayerNorm, SPAD effectively detects hallucinations. Extensive experiments demonstrate that SPAD achieves state-of-the-art performance
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhuoran Zhuang, Ye Chen, Jianghao Su, Chao Luo, Luhui Liu et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.
๐
2025-12-08
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Yifei Huang, Tianyu Yan, Sitong Gong, Xiwei Gao, Caixin Kang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present the Living Novel, an end-to-end system that transforms any literary work into an immersive, multi-character conversational experience. This system is designed to solve two fundamental challenges for LLM-driven characters. Firstly, generic LLMs suffer from persona drift, often failing to stay in character. Secondly, agents often exhibit abilities that extend beyond the constraints of the story's world and logic, leading to both narrative incoherence (spoiler leakage) and robustness failures (frame-breaking). To address these challenges, we introduce a novel two-stage training pipeline. Our Deep Persona Alignment (DPA) stage uses data-free reinforcement finetuning to instill deep character fidelity. Our Coherence and Robustness Enhancing (CRE) stage then employs a story-time-aware knowledge graph and a second retrieval-grounded training pass to architecturally enforce these narrative constraints. We validate our system through a multi-phase evaluation using Jules Verne's Twenty Thousand Leagues Under the Sea. A lab study with a detailed ablation of system components is followed by a 5-day in-the-wild diary study. Our DPA pipeline helps our specialized model outperform GPT-4o on persona-specific metrics, and our CRE stage achieves near-perfect performance in coherence and robustness measures. Our study surfaces practical design guidelines for AI-driven narrative systems: we find that character-first self-training is foundational for believability, while explicit story-time constraints are crucial for sustaining coherent, interruption-resilient mobile-web experiences.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang et al.
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amir Mohammad Akhlaghi, Amirhossein Shabani, Mostafa Abdolmaleki, Saeed Reza Kheradpisheh
Natural Language Processing
ๆฅ็ๆ่ฆ
The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft Phi-3 Mini -- originally a monolingual English model -- can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. Our approach employs a unique "warm-up" stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard in HuggingFace. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The Persian-Phi model is publicly available at https://huggingface.co/amirakhlaghiqqq/PersianPhi.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Niklas Mellgren, Peter Schneider-Kamp, Lukas Galke Poech
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Ensuring reliable tool use is critical for safe agentic AI systems. Language models frequently produce unreliable reasoning with plausible but incorrect solutions that are difficult to verify. To address this, we investigate fine-tuning models to use Prolog as an external tool for verifiable computation. Using Group Relative Policy Optimization (GRPO), we fine-tune Qwen2.5-3B-Instruct on a cleaned GSM8K-Prolog-Prover dataset while varying (i) prompt structure, (ii) reward composition (execution, syntax, semantics, structure), and (iii) inference protocol: single-shot, best-of-N, and two agentic modes where Prolog is invoked internally or independently. Our reinforcement learning approach outperforms supervised fine-tuning, with our 3B model achieving zero-shot MMLU performance comparable to 7B few-shot results. Our findings reveal that: 1) joint tuning of prompt, reward, and inference shapes program syntax and logic; 2) best-of-N with external Prolog verification maximizes accuracy on GSM8K; 3) agentic inference with internal repair yields superior zero-shot generalization on MMLU-Stem and MMLU-Pro. These results demonstrate that grounding model reasoning in formal verification systems substantially improves reliability and auditability for safety-critical applications. The source code for reproducing our experiments is available under https://github.com/niklasmellgren/grpo-prolog-inference
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yezi Liu, Hanning Chen, Wenjun Huang, Yang Ni, Mohsen Imani
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) possess vast knowledge acquired from extensive training corpora, but they often cannot remove specific pieces of information when needed, which makes it hard to handle privacy, bias mitigation, and knowledge correction. Traditional model unlearning approaches require computationally expensive fine-tuning or direct weight editing, making them impractical for real-world deployment. In this work, we introduce LoRA-based Unlearning with Negative Examples (LUNE), a lightweight framework that performs negative-only unlearning by updating only low-rank adapters while freezing the backbone, thereby localizing edits and avoiding disruptive global changes. Leveraging Low-Rank Adaptation (LoRA), LUNE targets intermediate representations to suppress (or replace) requested knowledge with an order-of-magnitude lower compute and memory than full fine-tuning or direct weight editing. Extensive experiments on multiple factual unlearning tasks show that LUNE: (I) achieves effectiveness comparable to full fine-tuning and memory-editing methods, and (II) reduces computational cost by about an order of magnitude.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yezi Liu, Hanning Chen, Wenjun Huang, Yang Ni, Mohsen Imani
ๆฅ็ๆ่ฆ
Unlearning in large foundation models (e.g., LLMs) is essential for enabling dynamic knowledge updates, enforcing data deletion rights, and correcting model behavior. However, existing unlearning methods often require full-model fine-tuning or access to the original training data, which limits their scalability and practicality. In this work, we introduce Recover-to-Forget (R2F), a novel framework for efficient unlearning in LLMs based on reconstructing full-model gradient directions from low-rank LoRA adapter updates. Rather than performing backpropagation through the full model, we compute gradients with respect to LoRA parameters using multiple paraphrased prompts and train a gradient decoder to approximate the corresponding full-model gradients. To ensure applicability to larger or black-box models, the decoder is trained on a proxy model and transferred to target models. We provide a theoretical analysis of cross-model generalization and demonstrate that our method achieves effective unlearning while preserving general model performance. Experimental results demonstrate that R2F offers a scalable and lightweight alternative for unlearning in pretrained LLMs without requiring full retraining or access to internal parameters.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Revekka Kyriakoglou, Anna Pappa
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This article presents a hybrid methodology for building a multilingual corpus designed to support the study of emerging concepts in the humanities and social sciences (HSS), illustrated here through the case of ``non-technological innovation''. The corpus relies on two complementary sources: (1) textual content automatically extracted from company websites, cleaned for French and English, and (2) annual reports collected and automatically filtered according to documentary criteria (year, format, duplication). The processing pipeline includes automatic language detection, filtering of non-relevant content, extraction of relevant segments, and enrichment with structural metadata. From this initial corpus, a derived dataset in English is created for machine learning purposes. For each occurrence of a term from the expert lexicon, a contextual block of five sentences is extracted (two preceding and two following the sentence containing the term). Each occurrence is annotated with the thematic category associated with the term, enabling the construction of data suitable for supervised classification tasks. This approach results in a reproducible and extensible resource, suitable both for analyzing lexical variability around emerging concepts and for generating datasets dedicated to natural language processing applications.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tomoki Doi, Masaru Isonuma, Hitomi Yanaka
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models have the potential to generate explanations for their own predictions in a variety of styles based on user instructions. Recent research has examined whether these self-explanations faithfully reflect the models' actual behavior and has found that they often lack faithfulness. However, the question of how to improve faithfulness remains underexplored. Moreover, because different explanation styles have superficially distinct characteristics, it is unclear whether improvements observed in one style also arise when using other styles. This study analyzes the effects of training for faithful self-explanations and the extent to which these effects generalize, using three classification tasks and three explanation styles. We construct one-word constrained explanations that are likely to be faithful using a feature attribution method, and use these pseudo-faithful self-explanations for continual learning on instruction-tuned models. Our experiments demonstrate that training can improve self-explanation faithfulness across all classification tasks and explanation styles, and that these improvements also show signs of generalization to the multi-word settings and to unseen tasks. Furthermore, we find consistent cross-style generalization among three styles, suggesting that training may contribute to a broader improvement in faithful self-explanation ability.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Srihari Bandarupalli, Bhavana Akkiraju, Charan Devarakonda, Vamsiraghusimha Narsinga, Anil Kumar Vuppala
ๆฅ็ๆ่ฆ
Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bhavana Akkiraju, Srihari Bandarupalli, Swathi Sambangi, Vasavi Ravuri, R Vijaya Saraswathi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu--English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, finetuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. Our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu--English translation. The work delivers three key contributions: a reproducible Telugu--English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mengqi Wang, Jianwei Wang, Qing Liu, Xiwei Xu, Zhenchang Xing et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Error detection (ED), which aims to identify incorrect or inconsistent cell values in tabular data, is important for ensuring data quality. Recent state-of-the-art ED methods leverage the pre-trained knowledge and semantic capability embedded in large language models (LLMs) to directly label whether a cell is erroneous. However, this LLM-as-a-labeler pipeline (1) relies on the black box, implicit decision process, thus failing to provide explainability for the detection results, and (2) is highly sensitive to prompts, yielding inconsistent outputs due to inherent model stochasticity, therefore lacking robustness. To address these limitations, we propose an LLM-as-an-inducer framework that adopts LLM to induce the decision tree for ED (termed TreeED) and further ensembles multiple such trees for consensus detection (termed ForestED), thereby improving explainability and robustness. Specifically, based on prompts derived from data context, decision tree specifications and output requirements, TreeED queries the LLM to induce the decision tree skeleton, whose root-to-leaf decision paths specify the stepwise procedure for evaluating a given sample. Each tree contains three types of nodes: (1) rule nodes that perform simple validation checks (e.g., format or range), (2) Graph Neural Network (GNN) nodes that capture complex patterns (e.g., functional dependencies), and (3) leaf nodes that output the final decision types (error or clean). Furthermore, ForestED employs uncertainty-based sampling to obtain multiple row subsets, constructing a decision tree for each subset using TreeED. It then leverages an Expectation-Maximization-based algorithm that jointly estimates tree reliability and optimizes the consensus ED prediction. Extensive xperiments demonstrate that our methods are accurate, explainable and robust, achieving an average F1-score improvement of 16.1% over the best baseline.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
๐
2025-12-08
๐ AAAI 2026
๐ฅ Feng Liang, Weixin Zeng, Runhao Zhao, Xiang Zhao
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, temporal reasoning, particularly under complex temporal constraints, remains a major challenge. To this end, existing approaches have explored symbolic methods, which encode temporal structure explicitly, and reflective mechanisms, which revise reasoning errors through multi-step inference. Nonetheless, symbolic approaches often underutilize the reasoning capabilities of LLMs, while reflective methods typically lack structured temporal representations, which can result in inconsistent or hallucinated reasoning. As a result, even when the correct temporal context is available, LLMs may still misinterpret or misapply time-related information, leading to incomplete or inaccurate answers. To address these limitations, in this work, we propose Neuro-Symbolic Temporal Reasoning (NeSTR), a novel framework that integrates structured symbolic representations with hybrid reflective reasoning to enhance the temporal sensitivity of LLM inference. NeSTR preserves explicit temporal relations through symbolic encoding, enforces logical consistency via verification, and corrects flawed inferences using abductive reflection. Extensive experiments on diverse temporal question answering benchmarks demonstrate that NeSTR achieves superior zero-shot performance and consistently improves temporal reasoning without any fine-tuning, showcasing the advantage of neuro-symbolic integration in enhancing temporal understanding in large language models.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiujie Song, Qi Jia, Shota Watanabe, Xiaoyi Pang, Ruijie Chen et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at https://github.com/xiujiesong/StorytellingImageGeneration.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xuan Zhang, Wenxuan Zhang, Anxu Wang, See-Kiong Ng, Yang Deng
Robotics
ๆฅ็ๆ่ฆ
Multi-agent role-playing has recently shown promise for studying social behavior with language agents, but existing simulations are mostly monolingual and fail to model cross-lingual interaction, an essential property of real societies. We introduce MASim, the first multilingual agent-based simulation framework that supports multi-turn interaction among generative agents with diverse sociolinguistic profiles. MASim offers two key analyses: (i) global public opinion modeling, by simulating how attitudes toward open-domain hypotheses evolve across languages and cultures, and (ii) media influence and information diffusion, via autonomous news agents that dynamically generate content and shape user behavior. To instantiate simulations, we construct the MAPS benchmark, which combines survey questions and demographic personas drawn from global population distributions. Experiments on calibration, sensitivity, consistency, and cultural case studies show that MASim reproduces sociocultural phenomena and highlights the importance of multilingual simulation for scalable, controlled computational social science.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wonbeen Lee, Channyoung Lee, Junho Sohn, Hansam Cho
ๆฅ็ๆ่ฆ
With the recent surge in personalized learning, Intelligent Tutoring Systems (ITS) that can accurately track students' individual knowledge states and provide tailored learning paths based on this information are in demand as an essential task. This paper focuses on the core technology of Knowledge Tracing (KT) models that analyze students' sequences of interactions to predict their knowledge acquisition levels. However, existing KT models suffer from limitations such as restricted input data formats, cold start problems arising with new student enrollment or new question addition, and insufficient stability in real-world service environments. To overcome these limitations, a Practical Interlinked Concept Knowledge Tracing (PICKT) model that can effectively process multiple types of input data is proposed. Specifically, a knowledge map structures the relationships among concepts considering the question and concept text information, thereby enabling effective knowledge tracing even in cold start situations. Experiments reflecting real operational environments demonstrated the model's excellent performance and practicality. The main contributions of this research are as follows. First, a model architecture that effectively utilizes diverse data formats is presented. Second, significant performance improvements are achieved over existing models for two core cold start challenges: new student enrollment and new question addition. Third, the model's stability and practicality are validated through delicate experimental design, enhancing its applicability in real-world product environments. This provides a crucial theoretical and technical foundation for the practical implementation of next-generation ITS.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fenghua Weng, Chaochao Lu, Xia Hu, Wenqi Shao, Wenjie Wang
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lauren Levine, Amir Zeldes
ๆฅ็ๆ่ฆ
Bridging is an anaphoric phenomenon where the referent of an entity in a discourse is dependent on a previous, non-identical entity for interpretation, such as in "There is 'a house'. 'The door' is red," where the door is specifically understood to be the door of the aforementioned house. While there are several existing resources in English for bridging anaphora, most are small, provide limited coverage of the phenomenon, and/or provide limited genre coverage. In this paper, we introduce GUMBridge, a new resource for bridging, which includes 16 diverse genres of English, providing both broad coverage for the phenomenon and granular annotations for the subtype categorization of bridging varieties. We also present an evaluation of annotation quality and report on baseline performance using open and closed source contemporary LLMs on three tasks underlying our data, showing that bridging resolution and subtype classification remain difficult NLP tasks in the age of LLMs.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nithin Sivakumaran, Justin Chih-Yao Chen, David Wan, Yue Zhang, Jaehong Yoon et al.
Computer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the tool call distribution, finding that diverse tools are reliably used to help resolve disagreement.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Miguel Ingram, Arthur Joseph Merritt
Natural Language Processing
ๆฅ็ๆ่ฆ
Responding to Hodel et al.'s (2024) call for a formal definition of task relatedness in re-arc, we present the first 9-category taxonomy of all 400 tasks, validated at 97.5% accuracy via rule-based code analysis. We prove the taxonomy's visual coherence by training a CNN on raw grid pixels (95.24% accuracy on S3, 36.25% overall, 3.3x chance), then apply the taxonomy diagnostically to the original ARC-AGI-2 test set. Our curriculum analysis reveals 35.3% of tasks exhibit low neural affinity for Transformers--a distributional bias mirroring ARC-AGI-2. To probe this misalignment, we fine-tuned a 1.7M-parameter Transformer across 302 tasks, revealing a profound Compositional Gap: 210 of 302 tasks (69.5%) achieve >80% cell accuracy (local patterns) but <10% grid accuracy (global synthesis). This provides direct evidence for a Neural Affinity Ceiling Effect, where performance is bounded by architectural suitability, not curriculum. Applying our framework to Li et al.'s independent ViTARC study (400 specialists, 1M examples each) confirms its predictive power: Very Low affinity tasks achieve 51.9% versus 77.7% for High affinity (p<0.001), with a task at 0% despite massive data. The taxonomy enables precise diagnosis: low-affinity tasks (A2) hit hard ceilings, while high-affinity tasks (C1) reach 99.8%. These findings indicate that progress requires hybrid architectures with affinity-aligned modules. We release our validated taxonomy,
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jungmin Lee, Gwangeun Byeon, Yulhwa Kim, Seokin Hong
Natural Language Processing
ๆฅ็ๆ่ฆ
Pruning has emerged as a promising direction for accelerating large language model (LLM) inference, yet existing approaches often suffer from instability because they rely on offline calibration data that may not generalize across inputs. In this work, we introduce Token Filtering, a lightweight online structured pruning technique that makes pruning decisions directly during inference without any calibration data. The key idea is to measure token redundancy via joint key-value similarity and skip redundant attention computations, thereby reducing inference cost while preserving critical information. To further enhance stability, we design a variance-aware fusion strategy that adaptively weights key and value similarity across heads, ensuring that informative tokens are retained even under high pruning ratios. This design introduces no additional memory overhead and provides a more reliable criterion for token importance. Extensive experiments on LLaMA-2 (7B/13B), LLaMA-3 (8B), and Mistral (7B) demonstrate that Token Filtering consistently outperforms prior structured pruning methods, preserving accuracy on commonsense reasoning benchmarks and maintaining strong performance on challenging tasks such as MMLU, even with 50% pruning.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shiwei Guo, Sihang Jiang, Qianxi He, Yanghua Xiao, Jiaqing Liang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In recent years, large language models (LLMs) have demonstrated strong performance on multilingual tasks. Given its wide range of applications, cross-cultural understanding capability is a crucial competency. However, existing benchmarks for evaluating whether LLMs genuinely possess this capability suffer from three key limitations: a lack of contextual scenarios, insufficient cross-cultural concept mapping, and limited deep cultural reasoning capabilities. To address these gaps, we propose SAGE, a scenario-based benchmark built via cross-cultural core concept alignment and generative task design, to evaluate LLMs' cross-cultural understanding and reasoning. Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions. Using this framework, we curated 210 core concepts and constructed 4530 test items across 15 specific real-world scenarios, organized under four broader categories of cross-cultural situations, following established item design principles. The SAGE dataset supports continuous expansion, and experiments confirm its transferability to other languages. It reveals model weaknesses across both dimensions and scenarios, exposing systematic limitations in cross-cultural reasoning. While progress has been made, LLMs are still some distance away from reaching a truly nuanced cross-cultural understanding. In compliance with the anonymity policy, we include data and code in the supplement materials. In future versions, we will make them publicly available online.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Emma Markle, Javier Gutierrez Bach, Shira Wein
ๆฅ็ๆ่ฆ
Uniform Meaning Representation (UMR) is a novel graph-based semantic representation which captures the core meaning of a text, with flexibility incorporated into the annotation schema such that the breadth of the world's languages can be annotated (including low-resource languages). While UMR shows promise in enabling language documentation, improving low-resource language technologies, and adding interpretability, the downstream applications of UMR can only be fully explored when text-to-UMR parsers enable the automatic large-scale production of accurate UMR graphs at test time. Prior work on text-to-UMR parsing is limited to date. In this paper, we introduce two methods for English text-to-UMR parsing, one of which fine-tunes existing parsers for Abstract Meaning Representation and the other, which leverages a converter from Universal Dependencies, using prior work as a baseline. Our best-performing model, which we call SETUP, achieves an AnCast score of 84 and a SMATCH++ score of 91, indicating substantial gains towards automatic UMR parsing.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Richard Young
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite substantial investment in safety alignment, the vulnerability of large language models to sophisticated multi-turn adversarial attacks remains poorly characterized, and whether model scale or inference mode affects robustness is unknown. This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors, generating over 97,000 API queries across adversarial conversations with automated evaluation by independent safety classifiers. Results demonstrated a spectrum of vulnerability: six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%; enabling extended reasoning on identical architecture reduced ASR from 97% to 42%. These findings indicate that safety alignment quality varies substantially across vendors, that model scale does not predict adversarial robustness, and that thinking mode provides a deployable safety enhancement. Collectively, this work establishes that current alignment techniques remain fundamentally vulnerable to adaptive multi-turn attacks regardless of model scale, while identifying deliberative inference as a promising defense direction.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Matteo Boglioni, Andrea Sgobbi, Gabriel Tavernini, Francesco Rita, Marius Mosbach et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
A large language model's (LLM's) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jeremy Yang, Noah Yonack, Kate Zyskowski, Denis Yarats, Johnny Ho et al.
ๆฅ็ๆ่ฆ
This paper presents the first large-scale field study of the adoption, usage intensity, and use cases of general-purpose AI agents operating in open-world web environments. Our analysis centers on Comet, an AI-powered browser developed by Perplexity, and its integrated agent, Comet Assistant. Drawing on hundreds of millions of anonymized user interactions, we address three fundamental questions: Who is using AI agents? How intensively are they using them? And what are they using them for? Our findings reveal substantial heterogeneity in adoption and usage across user segments. Earlier adopters, users in countries with higher GDP per capita and educational attainment, and individuals working in digital or knowledge-intensive sectors -- such as digital technology, academia, finance, marketing, and entrepreneurship -- are more likely to adopt or actively use the agent. To systematically characterize the substance of agent usage, we introduce a hierarchical agentic taxonomy that organizes use cases across three levels: topic, subtopic, and task. The two largest topics, Productivity & Workflow and Learning & Research, account for 57% of all agentic queries, while the two largest subtopics, Courses and Shopping for Goods, make up 22%. The top 10 out of 90 tasks represent 55% of queries. Personal use constitutes 55% of queries, while professional and educational contexts comprise 30% and 16%, respectively. In the short term, use cases exhibit strong stickiness, but over time users tend to shift toward more cognitively oriented topics. The diffusion of increasingly capable AI agents carries important implications for researchers, businesses, policymakers, and educators, inviting new lines of inquiry into this rapidly emerging class of AI capabilities.
๐
2025-12-08
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Lukas Johannes Mรถller
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The escalating sophistication and variety of cyber threats have rendered static honeypots inadequate, necessitating adaptive, intelligence-driven deception. In this work, ADLAH is introduced: an Adaptive Deep Learning Anomaly Detection Honeynet designed to maximize high-fidelity threat intelligence while minimizing cost through autonomous orchestration of infrastructure. The principal contribution is offered as an end-to-end architectural blueprint and vision for an AI-driven deception platform. Feasibility is evidenced by a functional prototype of the central decision mechanism, in which a reinforcement learning (RL) agent determines, in real time, when sessions should be escalated from low-interaction sensor nodes to dynamically provisioned, high-interaction honeypots. Because sufficient live data were unavailable, field-scale validation is not claimed; instead, design trade-offs and limitations are detailed, and a rigorous roadmap toward empirical evaluation at scale is provided. Beyond selective escalation and anomaly detection, the architecture pursues automated extraction, clustering, and versioning of bot attack chains, a core capability motivated by the empirical observation that exposed services are dominated by automated traffic. Together, these elements delineate a practical path toward cost-efficient capture of high-value adversary behavior, systematic bot versioning, and the production of actionable threat intelligence.
๐
2025-12-08
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Prithila Angkan, Amin Jalali, Paul Hungler, Ali Etemad
Machine Learning
ๆฅ็ๆ่ฆ
We present a novel graph-based learning of EEG representations with gradient alignment (GEEGA) that leverages multi-domain information to learn EEG representations for brain-computer interfaces. Our model leverages graph convolutional networks to fuse embeddings from frequency-based topographical maps and time-frequency spectrograms, capturing inter-domain relationships. GEEGA addresses the challenge of achieving high inter-class separability, which arises from the temporally dynamic and subject-sensitive nature of EEG signals by incorporating the center loss and pairwise difference loss. Additionally, GEEGA incorporates a gradient alignment strategy to resolve conflicts between gradients from different domains and the fused embeddings, ensuring that discrepancies, where gradients point in conflicting directions, are aligned toward a unified optimization direction. We validate the efficacy of our method through extensive experiments on three publicly available EEG datasets: BCI-2a, CL-Drive and CLARE. Comprehensive ablation studies further highlight the impact of various components of our model.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinyuan Cao, Santosh S. Vempala
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$, can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$, independent of the document length) on the model size needed to achieve such $k$-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.
๐
2025-12-08
๐ ArXiv Preprint (quant-ph)
๐ฅ M. A. Farooq, G. Di Guglielmo, A. Rajagopala, N. Tran, V. A. Chhabria et al.
Machine Learning
ๆฅ็ๆ่ฆ
Qubit readout is a critical operation in quantum computing systems, which maps the analog response of qubits into discrete classical states. Deep neural networks (DNNs) have recently emerged as a promising solution to improve readout accuracy . Prior hardware implementations of DNN-based readout are resource-intensive and suffer from high inference latency, limiting their practical use in low-latency decoding and quantum error correction (QEC) loops. This paper proposes LUNA, a fast and efficient superconducting qubit readout accelerator that combines low-cost integrator-based preprocessing with Look-Up Table (LUT) based neural networks for classification. The architecture uses simple integrators for dimensionality reduction with minimal hardware overhead, and employs LogicNets (DNNs synthesized into LUT logic) to drastically reduce resource usage while enabling ultra-low-latency inference. We integrate this with a differential evolution based exploration and optimization framework to identify high-quality design points. Our results show up to a 10.95x reduction in area and 30% lower latency with little to no loss in fidelity compared to the state-of-the-art. LUNA enables scalable, low-footprint, and high-speed qubit readout, supporting the development of larger and more reliable quantum computing systems.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present GRAPE (Group RepresentAtional Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\mathrm{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n)=\exp(n\,ฯ\,\mathbf{L})$ with a rank-2 skew generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes are the canonical coordinate pairs with log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise as rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project Page: https://github.com/model-architectures/GRAPE.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Raunak Jain, Mudita Khurana
ๆฅ็ๆ่ฆ
LLM-based agents are rapidly being plugged into expert decision-support, yet in messy, high-stakes settings they rarely make the team smarter: human-AI teams often underperform the best individual, experts oscillate between verification loops and over-reliance, and the promised complementarity does not materialise. We argue this is not just a matter of accuracy, but a fundamental gap in how we conceive AI assistance: expert decisions are made through collaborative cognitive processes where mental models, goals, and constraints are continually co-constructed, tested, and revised between human and AI. We propose Collaborative Causal Sensemaking (CCS) as a research agenda and organizing framework for decision-support agents: systems designed as partners in cognitive work, maintaining evolving models of how particular experts reason, helping articulate and revise goals, co-constructing and stress-testing causal hypotheses, and learning from the outcomes of joint decisions so that both human and agent improve over time. We sketch challenges around training ecologies that make collaborative thinking instrumentally valuable, representations and interaction protocols for co-authored models, and evaluation centred on trust and complementarity. These directions can reframe MAS research around agents that participate in collaborative sensemaking and act as AI teammates that think with their human partners.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Nearchos Potamitis, Lars Klein, Akhil Arora
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while ignoring the intrinsic uncertainty that naturally arises from stochastic decoding. This omission creates a blind spot because practitioners cannot reliably assess whether a method's reported performance is stable, reproducible, or cost-consistent. We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in LLM reasoning. ReasonBENCH provides (i) a modular evaluation library that standardizes reasoning frameworks, models, and tasks, (ii) a multi-run protocol that reports statistically reliable metrics for both quality and cost, and (iii) a public leaderboard to encourage variance-aware reporting. Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability. Notably, even strategies with similar average performance can display confidence intervals up to four times wider, and the top-performing methods often incur higher and less stable costs. Such instability compromises reproducibility across runs and, consequently, the reliability of reported performance. To better understand these dynamics, we further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability. Our results highlight reproducibility as a critical dimension for reliable LLM reasoning and provide a foundation for future reasoning methods and uncertainty quantification techniques. ReasonBENCH is publicly available at https://github.com/au-clan/ReasonBench .
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiaxu Liu, Yuhe Bai, Christos-Savvas Bouganis
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\underline{Gated} (\underline{F}lash) \underline{W}indowed \underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Dongjian Hu, Junxi Wu, Shu-Tao Xia, Changliang Zou
ๆฅ็ๆ่ฆ
Conformal prediction provides a pivotal and flexible technique for uncertainty quantification by constructing prediction sets with a predefined coverage rate. Many online conformal prediction methods have been developed to address data distribution shifts in fully adversarial environments, resulting in overly conservative prediction sets. We propose Conformal Optimistic Prediction (COP), an online conformal prediction algorithm incorporating underlying data pattern into the update rule. Through estimated cumulative distribution function of non-conformity scores, COP produces tighter prediction sets when predictable pattern exists, while retaining valid coverage guarantees even when estimates are inaccurate. We establish a joint bound on coverage and regret, which further confirms the validity of our approach. We also prove that COP achieves distribution-free, finite-sample coverage under arbitrary learning rates and can converge when scores are $i.i.d.$. The experimental results also show that COP can achieve valid coverage and construct shorter prediction intervals than other baselines.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matteo Cipollina, Michail Karatarakis, Freek Wiedijk
Machine Learning
ๆฅ็ๆ่ฆ
Neural networks are widely used, yet their analysis and verification remain challenging. In this work, we present a Lean 4 formalization of neural networks, covering both deterministic and stochastic models. We first formalize Hopfield networks, recurrent networks that store patterns as stable states. We prove convergence and the correctness of Hebbian learning, a training rule that updates network parameters to encode patterns, here limited to the case of pairwise-orthogonal patterns. We then consider stochastic networks, where updates are probabilistic and convergence is to a stationary distribution. As a canonical example, we formalize the dynamics of Boltzmann machines and prove their ergodicity, showing convergence to a unique stationary distribution using a new formalization of the Perron-Frobenius theorem.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi et al.
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models are vulnerable to jailbreak attacks, threatening their safe deployment in real-world applications. This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models through a sequence of prompt-output interactions. Existing approaches typically rely on single turn optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate the problem as a multi-turn reinforcement learning task, directly optimizing the harmfulness of the final-turn output as the outcome reward. To mitigate sparse supervision and promote long-term attack strategies, we propose two heuristic process rewards: (1) controlling the harmfulness of intermediate outputs to prevent triggering the black-box model's rejection mechanisms, and (2) maintaining the semantic relevance of intermediate outputs to avoid drifting into irrelevant content. Experimental results on multiple benchmarks show consistently improved attack success rates across multiple models, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/RL-MTJail. Warning: This paper contains examples of harmful content.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Brenda Anague, Bamdad Hosseini, Issa Karambal, Jean Medard Ngnotchouye
Machine Learning
ๆฅ็ๆ่ฆ
Recent studies have shown the success of deep learning in solving forward and inverse problems in engineering and scientific computing domains, such as physics-informed neural networks (PINNs). In the fields of atmospheric science and environmental monitoring, estimating emission source locations is a central task that further relies on multiple model parameters that dictate velocity profiles and diffusion parameters. Estimating these parameters at the same time as emission sources from scarce data is a difficult task. In this work, we achieve this by leveraging the flexibility and generality of PINNs. We use a weighted adaptive method based on the neural tangent kernels to solve a source inversion problem with parameter estimation on the 2D and 3D advection-diffusion equations with unknown velocity and diffusion coefficients that may vary in space and time. Our proposed weighted adaptive method is presented as an extension of PINNs for forward PDE problems to a highly ill-posed source inversion and parameter estimation problem. The key idea behind our methodology is to attempt the joint recovery of the solution, the sources along with the unknown parameters, thereby using the underlying partial differential equation as a constraint that couples multiple unknown functional parameters, leading to more efficient use of the limited information in the measurements. We present various numerical experiments, using different types of measurements that model practical engineering systems, to show that our proposed method is indeed successful and robust to additional noise in the measurements.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Agnes Norbury, George Fairs, Alexandra L. Georgescu, Matthew M. Nour, Emilia Molimpakis et al.
Multimodal
ๆฅ็ๆ่ฆ
During psychiatric assessment, clinicians observe not only what patients report, but important nonverbal signs such as tone, speech rate, fluency, responsiveness, and body language. Weighing and integrating these different information sources is a challenging task and a good candidate for support by intelligence-driven tools - however this is yet to be realized in the clinic. Here, we argue that several important barriers to adoption can be addressed using Bayesian network modelling. To demonstrate this, we evaluate a model for depression and anxiety symptom prediction from voice and speech features in large-scale datasets (30,135 unique speakers). Alongside performance for conditions and symptoms (for depression, anxiety ROC-AUC=0.842,0.831 ECE=0.018,0.015; core individual symptom ROC-AUC>0.74), we assess demographic fairness and investigate integration across and redundancy between different input modality types. Clinical usefulness metrics and acceptability to mental health service users are explored. When provided with sufficiently rich and large-scale multimodal data streams and specified to represent common mental conditions at the symptom rather than disorder level, such models are a principled approach for building robust assessment support tools: providing clinically-relevant outputs in a transparent and explainable format that is directly amenable to expert clinical supervision.
๐
2025-12-08
๐ ArXiv Preprint (quant-ph)
๐ฅ Andrew W. Senior, Thomas Edlich, Francisco J. H. Heras, Lei M. Zhang, Oscar Higgott et al.
ๆฅ็ๆ่ฆ
Fault-tolerant quantum computing will require error rates far below those achievable with physical qubits. Quantum error correction (QEC) bridges this gap, but depends on decoders being simultaneously fast, accurate, and scalable. This combination of requirements has not yet been met by a machine-learning decoder, nor by any decoder for promising resource-efficient codes such as the colour code. Here we introduce AlphaQubit 2, a neural-network decoder that achieves near-optimal logical error rates for both surface and colour codes at large scales under realistic noise. For the colour code, it is orders of magnitude faster than other high-accuracy decoders. For the surface code, we demonstrate real-time decoding faster than 1 microsecond per cycle up to distance 11 on current commercial accelerators with better accuracy than leading real-time decoders. These results support the practical application of a wider class of promising QEC codes, and establish a credible path towards high-accuracy, real-time neural decoding at the scales required for fault-tolerant quantum computation.
๐
2025-12-08
๐ AAAI
๐ฅ Yonggeon Lee, Jibin Hwang, Alfred Malengo Kondoro, Juhyun Song, Youngtae Noh
Natural Language Processing
ๆฅ็ๆ่ฆ
Electric vehicles (EVs) are key to sustainable mobility, yet their lithium-ion batteries (LIBs) degrade more rapidly under prolonged high states of charge (SOC). This can be mitigated by delaying full charging \ours until just before departure, which requires accurate prediction of user departure times. In this work, we propose Transformer-based real-time-to-event (TTE) model for accurate EV departure prediction. Our approach represents each day as a TTE sequence by discretizing time into grid-based tokens. Unlike previous methods primarily dependent on temporal dependency from historical patterns, our method leverages streaming contextual information to predict departures. Evaluation on a real-world study involving 93 users and passive smartphone data demonstrates that our method effectively captures irregular departure patterns within individual routines, outperforming baseline models. These results highlight the potential for practical deployment of the \ours algorithm and its contribution to sustainable transportation systems.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Anxiang Zeng, Haibo Zhang, Hailing Zhang, Kaixiang Mo, Liang Yao et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present CompassMax-V3-Thinking, a hundred-billion-scale MoE reasoning model trained with a new RL framework built on one principle: each prompt must matter. Scaling RL to this size exposes critical inefficiencies-zero-variance prompts that waste rollouts, unstable importance sampling over long horizons, advantage inversion from standard reward models, and systemic bottlenecks in rollout processing. To overcome these challenges, we introduce several unified innovations: (1) Multi-Stage Zero-Variance Elimination, which filters out non-informative prompts and stabilizes group-based policy optimization (e.g. GRPO) by removing wasted rollouts; (2) ESPO, an entropy-adaptive optimization method that balances token-level and sequence-level importance sampling to maintain stable learning dynamics; (3) a Router Replay strategy that aligns training-time MoE router decisions with inference-time behavior to mitigate train-infer discrepancies, coupled with a reward model adjustment to prevent advantage inversion; (4) a high-throughput RL system with FP8-precision rollouts, overlapped reward computation, and length-aware scheduling to eliminate performance bottlenecks. Together, these contributions form a cohesive pipeline that makes RL on hundred-billion-scale MoE models stable and efficient. The resulting model delivers strong performance across both internal and public evaluations.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Saroj Gopali, Bipin Chhetri, Deepika Giri, Sima Siami-Namini, Akbar Siami Namin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Existing data-driven approaches in modeling and predicting time series data include ARIMA (Autoregressive Integrated Moving Average), Transformer-based models, LSTM (Long Short-Term Memory) and TCN (Temporal Convolutional Network). These approaches, and in particular deep learning-based models such as LSTM and TCN, have shown great results in predicting time series data. With the advancement of leveraging pre-trained foundation models such as Large Language Models (LLMs) and more notably Google's recent foundation model for time series data, {\it TimesFM} (Time Series Foundation Model), it is of interest to investigate whether these foundation models have the capability of outperforming existing modeling approaches in analyzing and predicting time series data.
This paper investigates the performance of using LLM models for time series data prediction. We investigate the in-context learning methodology in the training of LLM models that are specific to the underlying application domain. More specifically, the paper explores training LLMs through in-context, zero-shot and few-shot learning and forecasting time series data with OpenAI {\tt o4-mini} and Gemini 2.5 Flash Lite, as well as the recent Google's Transformer-based TimesFM, a time series-specific foundation model, along with two deep learning models, namely TCN and LSTM networks. The findings indicate that TimesFM has the best overall performance with the lowest RMSE value (0.3023) and the competitive inference time (266 seconds). Furthermore, OpenAI's o4-mini also exhibits a good performance based on Zero Shot learning.
These findings highlight pre-trained time series foundation models as a promising direction for real-time forecasting, enabling accurate and scalable deployment with minimal model adaptation.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Leo Fillioux, Enzo Ferrante, Paul-Henry Cournรจde, Maria Vakalopoulou, Stergios Christodoulidis
ๆฅ็ๆ่ฆ
Large foundation models have emerged in the last years and are pushing performance boundaries for a variety of tasks. Training or even finetuning such models demands vast datasets and computational resources, which are often scarce and costly. Adaptation methods provide a computationally efficient solution to address these limitations by allowing such models to be finetuned on small amounts of data and computing power. This is achieved by appending new trainable modules to frozen backbones with only a fraction of the trainable parameters and fitting only these modules on novel tasks. Recently, the VeRA adapter was shown to excel in parameter-efficient adaptations by utilizing a pair of frozen random low-rank matrices shared across all layers. In this paper, we propose PVeRA, a probabilistic version of the VeRA adapter, which modifies the low-rank matrices of VeRA in a probabilistic manner. This modification naturally allows handling inherent ambiguities in the input and allows for different sampling configurations during training and testing. A comprehensive evaluation was performed on the VTAB-1k benchmark and seven adapters, with PVeRA outperforming VeRA and other adapters. Our code for training models with PVeRA and benchmarking all adapters is available https://github.com/leofillioux/pvera.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Aileen Liao, Dong-Ki Kim, Max Olan Smith, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei
ๆฅ็ๆ่ฆ
As a robot senses and selects actions, the world keeps changing. This inference delay creates a gap of tens to hundreds of milliseconds between the observed state and the state at execution. In this work, we take the natural generalization from zero delay to measured delay during training and inference. We introduce Delay-Aware Diffusion Policy (DA-DP), a framework for explicitly incorporating inference delays into policy learning. DA-DP corrects zero-delay trajectories to their delay-compensated counterparts, and augments the policy with delay conditioning. We empirically validate DA-DP on a variety of tasks, robots, and delays and find its success rate more robust to delay than delay-unaware methods. DA-DP is architecture agnostic and transfers beyond diffusion policies, offering a general pattern for delay-aware imitation learning. More broadly, DA-DP encourages evaluation protocols that report performance as a function of measured latency, not just task difficulty.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongjian Lan, Yucong Liu, Florian Schรคfer
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Machine learning models trained with \emph{stochastic} gradient descent (SGD) can generalize better than those trained with deterministic gradient descent (GD). In this work, we study SGD's impact on generalization through the lens of the statistical bootstrap: SGD uses gradient variability under batch sampling as a proxy for solution variability under the randomness of the data collection process. We use empirical results and theoretical analysis to substantiate this claim. In idealized experiments on empirical risk minimization, we show that SGD is drawn to parameter choices that are robust under resampling and thus avoids spurious solutions even if they lie in wider and deeper minima of the training loss. We prove rigorously that by implicitly regularizing the trace of the gradient covariance matrix, SGD controls the algorithmic variability. This regularization leads to solutions that are less sensitive to sampling noise, thereby improving generalization. Numerical experiments on neural network training show that explicitly incorporating the estimate of the algorithmic variability as a regularizer improves test performance. This fact supports our claim that bootstrap estimation underpins SGD's generalization advantages.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gracjan Gรณral, Marysia Winkels, Steven Basart
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models' existing capabilities.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Fuyuan Lyu, Zhentai Chen, Jingyan Jiang, Lingjie Li, Xing Tang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Inspired by the success of language models (LM), scaling up deep learning recommendation systems (DLRS) has become a recent trend in the community. All previous methods tend to scale up the model parameters during training time. However, how to efficiently utilize and scale up computational resources during test time remains underexplored, which can prove to be a scaling-efficient approach and bring orthogonal improvements in LM domains. The key point in applying test-time scaling to DLRS lies in effectively generating diverse yet meaningful outputs for the same instance. We propose two ways: One is to explore the heterogeneity of different model architectures. The other is to utilize the randomness of model initialization under a homogeneous architecture. The evaluation is conducted across eight models, including both classic and SOTA models, on three benchmarks. Sufficient evidence proves the effectiveness of both solutions. We further prove that under the same inference budget, test-time scaling can outperform parameter scaling. Our test-time scaling can also be seamlessly accelerated with the increase in parallel servers when deployed online, without affecting the inference time on the user side. Code is available.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Georgios Tzachristas, Lei Deng, Ioannis Tzachristas, Gong Zhang, Renhai Chen
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We develop a unified mathematical framework for certified Top-$k$ attention truncation that quantifies approximation error at both the distribution and output levels. For a single attention distribution $P$ and its Top-$k$ truncation $\hat P$, we show that the total-variation distance coincides with the discarded softmax tail mass and satisfies $\mathrm{TV}(P,\hat P)=1-e^{-\mathrm{KL}(\hat P\Vert P)}$, yielding sharp Top-$k$-specific bounds in place of generic inequalities. From this we derive non-asymptotic deterministic bounds -- from a single boundary gap through multi-gap and blockwise variants -- that control $\mathrm{TV}(P,\hat P)$ using only the ordered logits. Using an exact head-tail decomposition, we prove that the output error factorizes as $\|\mathrm{Attn}(q,K,V)-\mathrm{Attn}_k(q,K,V)\|_2=ฯ\|ฮผ_{\mathrm{tail}}-ฮผ_{\mathrm{head}}\|_2$ with $ฯ=\mathrm{TV}(P,\hat P)$, yielding a new head-tail diameter bound $\|\mathrm{Attn}(q,K,V)-\mathrm{Attn}_k(q,K,V)\|_2\leฯ\,\mathrm{diam}_{H,T}$ and refinements linking the error to $\mathrm{Var}_P(V)$. Under an i.i.d. Gaussian score model $s_i\sim\mathcal N(ฮผ,ฯ^2)$ we derive closed-form tail masses and an asymptotic rule for the minimal $k_\varepsilon$ ensuring $\mathrm{TV}(P,\hat P)\le\varepsilon$, namely $k_\varepsilon/n\approxฮฆ_c(ฯ+ฮฆ^{-1}(\varepsilon))$. Experiments on bert-base-uncased and synthetic logits confirm the predicted scaling of $k_\varepsilon/n$ and show that certified Top-$k$ can reduce scored keys by 2-4$\times$ on average while meeting the prescribed total-variation budget.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shahar Lutati
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
When should an autonomous agent commit resources to a task? We introduce the Agent Capability Problem (ACP), a framework for predicting whether an agent can solve a problem under resource constraints. Rather than relying on empirical heuristics, ACP frames problem-solving as information acquisition: an agent requires $\Itotal$ bits to identify a solution and gains $\Istep$ bits per action at cost $\Cstep$, yielding an effective cost $\Ceff = (\Itotal/\Istep), \Cstep$ that predicts resource requirements before search. We prove that $\Ceff$ lower-bounds expected cost and provide tight probabilistic upper bounds. Experimental validation shows that ACP predictions closely track actual agent performance, consistently bounding search effort while improving efficiency over greedy and random strategies. The framework generalizes across LLM-based and agentic workflows, linking principles from active learning, Bayesian optimization, and reinforcement learning through a unified information-theoretic lens. \
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yongbo Yu, Jari Peeperkorn, Johannes De Smedt, Jochen De Weerdt
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Process Model Forecasting (PMF) aims to predict how the control-flow structure of a process evolves over time by modeling the temporal dynamics of directly-follows (DF) relations, complementing predictive process monitoring that focuses on single-case prefixes. Prior benchmarks show that machine learning and deep learning models provide only modest gains over statistical baselines, mainly due to the sparsity and heterogeneity of the DF time series. We investigate Time Series Foundation Models (TSFMs), large pre-trained models for generic time series, as an alternative for PMF. Using DF time series derived from real-life event logs, we compare zero-shot use of TSFMs, without additional training, with fine-tuned variants adapted on PMF-specific data. TSFMs generally achieve lower forecasting errors (MAE and RMSE) than traditional and specialized models trained from scratch on the same logs, indicating effective transfer of temporal structure from non-process domains. While fine-tuning can further improve accuracy, the gains are often small and may disappear on smaller or more complex datasets, so zero-shot use remains a strong default. Our study highlights the generalization capability and data efficiency of TSFMs for process-related time series and, to the best of our knowledge, provides the first systematic evaluation of temporal foundation models for PMF.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kairong Luo, Zhenbo Sun, Xinyu Shi, Shengqi Chen, Bowen Yu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yongsheng Lian
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark.
Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results are achieved with DAPO when DS is disabled.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
ๆฅ็ๆ่ฆ
We propose $ฯ$-test, a global feature-selection and significance procedure for black-box predictors that combines Shapley attributions with selective inference. Given a trained model and an evaluation dataset, $ฯ$-test performs SHAP-guided screening and fits a linear surrogate on the screened features via a selection rule with a tractable selective-inference form. For each retained feature, it outputs a Shapley-based global score, a surrogate coefficient, and post-selection $p$-values and confidence intervals in a global feature-importance table. Experiments on real tabular regression tasks with tree-based and neural backbones suggest that $ฯ$-test can retain much of the predictive ability of the original model while using only a few features and producing feature sets that remain fairly stable across resamples and backbone classes. In these settings, $ฯ$-test acts as a practical global explanation layer linking Shapley-based importance summaries with classical statistical inference.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joel Ekstrand, Tor Mattsson, Zahra Taghiyarrenani, Slawomir Nowaczyk, Jens Lundstrรถm et al.
Machine Learning
ๆฅ็ๆ่ฆ
Reliable forecasting of multivariate time series under anomalous conditions is crucial in applications such as ATM cash logistics, where sudden demand shifts can disrupt operations. Modern deep forecasters achieve high accuracy on normal data but often fail when distribution shifts occur. We propose Weighted Contrastive Adaptation (WECA), a Weighted contrastive objective that aligns normal and anomaly-augmented representations, preserving anomaly-relevant information while maintaining consistency under benign variations. Evaluations on a nationwide ATM transaction dataset with domain-informed anomaly injection show that WECA improves SMAPE on anomaly-affected data by 6.1 percentage points compared to a normally trained baseline, with negligible degradation on normal data. These results demonstrate that WECA enhances forecasting reliability under anomalies without sacrificing performance during regular operations.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kassoum Sanogo, Renzo Ardiccioni
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, Jibin Wu
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to entropy collapse, resulting in premature policy convergence and performance saturation. While manipulating token-level entropy has proven effective for promoting policy exploration, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden-state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model's latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a paradigm that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX significantly mitigates premature convergence and consistently achieves state-of-the-art performance.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Jitendra K. Tugnait
ๆฅ็ๆ่ฆ
Estimation of the conditional independence graph (CIG) of high-dimensional multivariate Gaussian time series from multi-attribute data is considered. Existing methods for graph estimation for such data are based on single-attribute models where one associates a scalar time series with each node. In multi-attribute graphical models, each node represents a random vector or vector time series. In this paper we provide a unified theoretical analysis of multi-attribute graph learning for dependent time series using a penalized log-likelihood objective function formulated in the frequency domain using the discrete Fourier transform of the time-domain data. We consider both convex (sparse-group lasso) and non-convex (log-sum and SCAD group penalties) penalty/regularization functions. We establish sufficient conditions in a high-dimensional setting for consistency (convergence of the inverse power spectral density to true value in the Frobenius norm), local convexity when using non-convex penalties, and graph recovery. We do not impose any incoherence or irrepresentability condition for our convergence results. We also empirically investigate selection of the tuning parameters based on the Bayesian information criterion, and illustrate our approach using numerical examples utilizing both synthetic and real data.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jad Mounayer, Sebastian Rodriguez, Jerome Tomezyk, Chady Ghnatios, Francisco Chinesta
ๆฅ็ๆ่ฆ
Most existing latent-space models for dynamical systems require fixing the latent dimension in advance, they rely on complex loss balancing to approximate linear dynamics, and they don't regularize the latent variables. We introduce RRAEDy, a model that removes these limitations by discovering the appropriate latent dimension, while enforcing both regularized and linearized dynamics in the latent space. Built upon Rank-Reduction Autoencoders (RRAEs), RRAEDy automatically rank and prune latent variables through their singular values while learning a latent Dynamic Mode Decomposition (DMD) operator that governs their temporal progression. This structure-free yet linearly constrained formulation enables the model to learn stable and low-dimensional dynamics without auxiliary losses or manual tuning. We provide theoretical analysis demonstrating the stability of the learned operator and showcase the generality of our model by proposing an extension that handles parametric ODEs. Experiments on canonical benchmarks, including the Van der Pol oscillator, Burgers' equation, 2D Navier-Stokes, and Rotating Gaussians, show that RRAEDy achieves accurate and robust predictions. Our code is open-source and available at https://github.com/JadM133/RRAEDy. We also provide a video summarizing the main results at https://youtu.be/ox70mSSMGrM.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Youngwen Sun, Katerina Papagiannouli, Vladimir Spokoiny
Robotics
ๆฅ็ๆ่ฆ
Inspired by graph-based methodologies, we introduce a novel graph-spanning algorithm designed to identify changes in both offline and online data across low to high dimensions. This versatile approach is applicable to Euclidean and graph-structured data with unknown distributions, while maintaining control over error probabilities. Theoretically, we demonstrate that the algorithm achieves high detection power when the magnitude of the change surpasses the lower bound of the minimax separation rate, which scales on the order of $\sqrt{nd}$. Our method outperforms other techniques in terms of accuracy for both Gaussian and non-Gaussian data. Notably, it maintains strong detection power even with small observation windows, making it particularly effective for online environments where timely and precise change detection is critical.
๐
2025-12-08
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Boxuan Lyu, Haiyue Song, Hidetaka Kamigaito, Chenchen Ding, Hideki Tanaka et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Error Span Detection (ESD) is a subtask of automatic machine translation evaluation that localizes error spans in translations and labels their severity. State-of-the-art generative ESD methods typically decode using Maximum a Posteriori (MAP), assuming that model-estimated probabilities are perfectly correlated with similarity to human annotation. However, we observed that annotations dissimilar to the human annotation could achieve a higher model likelihood than the human annotation. We address this issue by applying Minimum Bayes Risk (MBR) decoding to generative ESD models. Specifically, we employ sentence- and span-level similarity metrics as utility functions to select candidate hypotheses based on their approximate similarity to the human annotation. Extensive experimental results show that our MBR decoding outperforms the MAP baseline at the system, sentence, and span-levels. Furthermore, to mitigate the computational cost of MBR decoding, we demonstrate that applying MBR distillation enables a standard greedy model to match MBR decoding performance, effectively eliminating the inference-time latency bottleneck.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qingyuan Yang, Shizhuo, Dongyue Chen, Da Teng, Zehua Gan
Natural Language Processing
ๆฅ็ๆ่ฆ
Traditional Transformers face a major bottleneck in long-sequence time series forecasting due to their quadratic complexity $(\mathcal{O}(T^2))$ and their limited ability to effectively exploit frequency-domain information. Inspired by RWKV's $\mathcal{O}(T)$ linear attention and frequency-domain modeling, we propose FRWKV, a frequency-domain linear-attention framework that overcomes these limitations. Our model integrates linear attention mechanisms with frequency-domain analysis, achieving $\mathcal{O}(T)$ computational complexity in the attention path while exploiting spectral information to enhance temporal feature representations for scalable long-sequence modeling. Across eight real-world datasets, FRWKV achieves a first-place average rank. Our ablation studies confirm the critical roles of both the linear attention and frequency-encoder components. This work demonstrates the powerful synergy between linear attention and frequency analysis, establishing a new paradigm for scalable time series modeling. Code is available at this repository: https://github.com/yangqingyuan-byte/FRWKV.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nishanth Venkatesh, Andreas A. Malikopoulos
Machine Learning
ๆฅ็ๆ่ฆ
We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset. In such settings, conventional model-learning methods are fundamentally inconsistent, as the transition and reward mechanisms generated under a behavioral policy do not correspond to the interventional quantities required for evaluating a state-based policy. To address this issue, we adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables. When combined with a behavior-averaged transition model, this construction yields a surrogate MDP whose Bellman operator is well defined and consistent for state-based policies, and which integrates seamlessly with the maximum causal entropy (MaxCausalEnt) model-learning framework. The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alexander Gammerman
Machine Learning
ๆฅ็ๆ่ฆ
This Inaugural Lecture was given at Royal Holloway University of London in 1996. It covers an introduction to machine learning and describes various theoretical advances and practical projects in the field. The Lecture here is presented in its original format, but a few remarks have been added in 2025 to reflect recent developments, and the list of references has been updated to enhance the convenience and accuracy for readers.
When did machine learning start? Maybe a good starting point is 1949, when Claude Shannon proposed a learning algorithm for chess-playing programs. Or maybe we should go back to the 1930s when Ronald Fisher developed discriminant analysis - a type of learning where the problem is to construct a decision rule that separates two types of vectors. Or could it be the 18th century when David Hume discussed the idea of induction? Or the 14th century, when William of Ockham formulated the principle of "simplicity" known as "Ockham's razor" (Ockham, by the way, is a small village not far from Royal Holloway). Or it may be that, like almost everything else in Western civilisation and culture, the origin of these ideas lies in the Mediterranean. After all, it was Aristotle who said that "we learn some things only by doing things".
The field of machine learning has been greatly influenced by other disciplines and the subject is in itself not a very homogeneous discipline, but includes separate, overlapping subfields. There are many parallel lines of research in ML: inductive learning, neural networks, clustering, and theories of learning. They are all part of the more general field of machine learning.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikita Gabdullin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The overall neural network (NN) performance is closely related to the properties of its embedding distribution in latent space (LS). It has recently been shown that predefined vector systems, specifically An root system vectors, can be used as targets for latent space configurations (LSC) to ensure the desired LS structure. One of the main LSC advantage is the possibility of training classifier NNs without classification layers, which facilitates training NNs on datasets with extremely large numbers of classes. This paper provides a more general overview of possible vector systems for NN training along with their properties and methods for vector system construction. These systems are used to configure LS of encoders and visual transformers to significantly speed up ImageNet-1K and 50k-600k classes LSC training. It is also shown that using the minimum number of LS dimensions for a specific number of classes results in faster convergence. The latter has potential advantages for reducing the size of vector databases used to store NN embeddings.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhiyu Liu, Zhi Han, Yandong Tang, Jun Fan, Yao Wang
Machine Learning
ๆฅ็ๆ่ฆ
The problem of low-tubal-rank tensor estimation is a fundamental task with wide applications across high-dimensional signal processing, machine learning, and image science. Traditional approaches tackle such a problem by performing tensor singular value decomposition, which is computationally expensive and becomes infeasible for large-scale tensors. Recent approaches address this issue by factorizing the tensor into two smaller factor tensors and solving the resulting problem using gradient descent. However, this kind of approach requires an accurate estimate of the tensor rank, and when the rank is overestimated, the convergence of gradient descent and its variants slows down significantly or even diverges. To address this problem, we propose an Alternating Preconditioned Gradient Descent (APGD) algorithm, which accelerates convergence in the over-parameterized setting by adding a preconditioning term to the original gradient and updating these two factors alternately. Based on certain geometric assumptions on the objective function, we establish linear convergence guarantees for more general low-tubal-rank tensor estimation problems. Then we further analyze the specific cases of low-tubal-rank tensor factorization and low-tubal-rank tensor recovery. Our theoretical results show that APGD achieves linear convergence even under over-parameterization, and the convergence rate is independent of the tensor condition number. Extensive simulations on synthetic data are carried out to validate our theoretical assertions.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Niklas Dobberstein, Jan Hamaekers
Natural Language Processing
ๆฅ็ๆ่ฆ
We present Materium: an autoregressive transformer for generating crystal structures that converts 3D material representations into token sequences. These sequences include elements with oxidation states, fractional coordinates and lattice parameters. Unlike diffusion approaches, which refine atomic positions iteratively through many denoising steps, Materium places atoms at precise fractional coordinates, enabling fast, scalable generation. With this design, the model can be trained in a few hours on a single GPU and generate samples much faster on GPUs and CPUs than diffusion-based approaches. The model was trained and evaluated using multiple properties as conditions, including fundamental properties, such as density and space group, as well as more practical targets, such as band gap and magnetic density. In both single and combined conditions, the model performs consistently well, producing candidates that align with the requested inputs.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Siyu Xu, Zijian Wang, Yunke Wang, Chenghao Xia, Tao Huang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have shown great performance in robotic manipulation by mapping visual observations and language instructions directly to actions. However, they remain brittle under distribution shifts: when test scenarios change, VLAs often reproduce memorized trajectories instead of adapting to the updated scene, which is a failure mode we refer to as the "Memory Trap". This limitation stems from the end-to-end design, which lacks explicit 3D spatial reasoning and prevents reliable identification of actionable regions in unfamiliar environments. To compensate for this missing spatial understanding, 3D Spatial Affordance Fields (SAFs) can provide a geometric representation that highlights where interactions are physically feasible, offering explicit cues about regions the robot should approach or avoid. We therefore introduce Affordance Field Intervention (AFI), a lightweight hybrid framework that uses SAFs as an on-demand plug-in to guide VLA behavior. Our system detects memory traps through proprioception, repositions the robot to recent high-affordance regions, and proposes affordance-driven waypoints that anchor VLA-generated actions. A SAF-based scorer then selects trajectories with the highest cumulative affordance. Extensive experiments demonstrate that our method achieves an average improvement of 23.5% across different VLA backbones ($ฯ_{0}$ and $ฯ_{0.5}$) under out-of-distribution scenarios on real-world robotic platforms, and 20.2% on the LIBERO-Pro benchmark, validating its effectiveness in enhancing VLA robustness to distribution shifts.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rongmei Liang, Zizheng Liu, Xiaofei Wu, Jingwen Tu
Machine Learning
ๆฅ็ๆ่ฆ
In the era of rapid development of artificial intelligence, its applications span across diverse fields, relying heavily on effective data processing and model optimization. Combined Regularized Support Vector Machines (CR-SVMs) can effectively handle the structural information among data features, but there is a lack of efficient algorithms in distributed-stored big data. To address this issue, we propose a unified optimization framework based on consensus structure. This framework is not only applicable to various loss functions and combined regularization terms but can also be effectively extended to non-convex regularization terms, showing strong scalability. Based on this framework, we develop a distributed parallel alternating direction method of multipliers (ADMM) algorithm to efficiently compute CR-SVMs when data is stored in a distributed manner. To ensure the convergence of the algorithm, we also introduce the Gaussian back-substitution method. Meanwhile, for the integrity of the paper, we introduce a new model, the sparse group lasso support vector machine (SGL-SVM), and apply it to music information retrieval. Theoretical analysis confirms that the computational complexity of the proposed algorithm is not affected by different regularization terms and loss functions, highlighting the universality of the parallel algorithm. Experiments on synthetic and free music archiv datasets demonstrate the reliability, stability, and efficiency of the algorithm.
๐
2025-12-08
๐ ArXiv Preprint (cs.MA)
๐ฅ Trung-Kiet Huynh, Duy-Minh Dao-Sy, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLMs) increasingly operate as autonomous decision-makers in interactive and multi-agent systems and human societies, understanding their strategic behaviour has profound implications for safety, coordination, and the design of AI-driven social and economic infrastructures. Assessing such behaviour requires methods that capture not only what LLMs output, but the underlying intentions that guide their decisions. In this work, we extend the FAIRGAME framework to systematically evaluate LLM behaviour in repeated social dilemmas through two complementary advances: a payoff-scaled Prisoners Dilemma isolating sensitivity to incentive magnitude, and an integrated multi-agent Public Goods Game with dynamic payoffs and multi-agent histories. These environments reveal consistent behavioural signatures across models and languages, including incentive-sensitive cooperation, cross-linguistic divergence and end-game alignment toward defection. To interpret these patterns, we train traditional supervised classification models on canonical repeated-game strategies and apply them to FAIRGAME trajectories, showing that LLMs exhibit systematic, model- and language-dependent behavioural intentions, with linguistic framing at times exerting effects as strong as architectural differences. Together, these findings provide a unified methodological foundation for auditing LLMs as strategic agents and reveal systematic cooperation biases with direct implications for AI governance, collective decision-making, and the design of safe multi-agent systems.
๐
2025-12-08
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Dmitrii Kapitan, Pavel Ovchinnikov, Konstantin Soldatov, Petr Andriushchenko, Vitalii Kapitan
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a systematic study of the application of convolutional neural networks (CNNs) as an efficient and versatile tool for the analysis of critical and low-temperature phase states in spin system models. The problem of calculating the dependence of the average energy on the spatial distribution of exchange integrals for the Edwards-Anderson model on a square lattice with frustrated interactions is considered. We further construct a single convolutional classifier of phase states of the ferromagnetic Ising model on square, triangular, honeycomb, and kagome lattices, trained on configurations generated by the Swendsen-Wang cluster algorithm. Computed temperature profiles of the averaged posterior probability of the high-temperature phase form clear S-shaped curves that intersect in the vicinity of the theoretical critical temperatures and allow one to determine the critical temperature for the kagome lattice without additional retraining. It is shown that convolutional models substantially reduce the root-mean-square error (RMSE) compared with fully connected architectures and efficiently capture complex correlations between thermodynamic characteristics and the structure of magnetic correlated systems.
๐
2025-12-08
๐ WSDM 2026
๐ฅ Imran Ahsan, Hyunwook Yu, Jinsung Kim, Mucheol Kim
Machine Learning
ๆฅ็ๆ่ฆ
Graph neural networks (GNNs) are increasingly used to model complex patterns in graph-structured data. However, enabling them to "forget" designated information remains challenging, especially under privacy regulations such as the GDPR. Existing unlearning methods largely optimize for efficiency and scalability, yet they offer little transparency, and the black-box nature of GNNs makes it difficult to verify whether forgetting has truly occurred. We propose an explainability-driven verifier for GNN unlearning that snapshots the model before and after deletion, using attribution shifts and localized structural changes (for example, graph edit distance) as transparent evidence. The verifier uses five explainability metrics: residual attribution, heatmap shift, explainability score deviation, graph edit distance, and a diagnostic graph rule shift. We evaluate two backbones (GCN, GAT) and four unlearning strategies (Retrain, GraphEditor, GNNDelete, IDEA) across five benchmarks (Cora, Citeseer, Pubmed, Coauthor-CS, Coauthor-Physics). Results show that Retrain and GNNDelete achieve near-complete forgetting, GraphEditor provides partial erasure, and IDEA leaves residual signals. These explanation deltas provide the primary, human-readable evidence of forgetting; we also report membership-inference ROC-AUC as a complementary, graph-wide privacy signal.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenwei Shi, Xueyu Luan
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
DreamerV3 is a state-of-the-art online model-based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perceptrons (MLPs), offering superior parameter efficiency and interpretability. To mitigate KANs' computational overhead, variants like FastKAN leverage Radial Basis Functions (RBFs) to accelerate inference. In this work, we investigate integrating KAN architectures into the DreamerV3 framework. We introduce KAN-Dreamer, replacing specific MLP and convolutional components of DreamerV3 with KAN and FastKAN layers. To ensure efficiency within the JAX-based World Model, we implement a tailored, fully vectorized version with simplified grid management. We structure our investigation into three subsystems: Visual Perception, Latent Prediction, and Behavior Learning. Empirical evaluations on the DeepMind Control Suite (walker_walk) analyze sample efficiency, training time, and asymptotic performance. Experimental results demonstrate that utilizing our adapted FastKAN as a drop-in replacement for the Reward and Continue predictors yields performance on par with the original MLP-based architecture, maintaining parity in both sample efficiency and training speed. This report serves as a preliminary study for future developments in KAN-based world models.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yezi Liu, William Youngwoo Chung, Yang Ni, Hanning Chen, Mohsen Imani
ๆฅ็ๆ่ฆ
Graph hyperdimensional computing (HDC) has emerged as a promising paradigm for cognitive tasks, emulating brain-like computation with high-dimensional vectors known as hypervectors. While HDC offers robustness and efficiency on graph-structured data, its fairness implications remain largely unexplored. In this paper, we study fairness in graph HDC, where biases in data representation and decision rules can lead to unequal treatment of different groups. We show how hypervector encoding and similarity-based classification can propagate or even amplify such biases, and we propose a fairness-aware training framework, FairGHDC, to mitigate them. FairGHDC introduces a bias correction term, derived from a gap-based demographic-parity regularizer, and converts it into a scalar fairness factor that scales the update of the class hypervector for the ground-truth label. This enables debiasing directly in the hypervector space without modifying the graph encoder or requiring backpropagation. Experimental results on six benchmark datasets demonstrate that FairGHDC substantially reduces demographic-parity and equal-opportunity gaps while maintaining accuracy comparable to standard GNNs and fairness-aware GNNs. At the same time, FairGHDC preserves the computational advantages of HDC, achieving up to about one order of magnitude ($\approx 10\times$) speedup in training time on GPU compared to GNN and fairness-aware GNN baselines.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yangle Li, Danli Luo, Haifeng Hu
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Existing methods in domain generalization for Multimodal Sentiment Analysis (MSA) often overlook inter-modal synergies during invariant features extraction, which prevents the accurate capture of the rich semantic information within multimodal data. Additionally, while knowledge injection techniques have been explored in MSA, they often suffer from fragmented cross-modal knowledge, overlooking specific representations that exist beyond the confines of unimodal. To address these limitations, we propose a novel MSA framework designed for domain generalization. Firstly, the framework incorporates a Mixture of Invariant Experts model to extract domain-invariant features, thereby enhancing the model's capacity to learn synergistic relationships between modalities. Secondly, we design a Cross-Modal Adapter to augment the semantic richness of multimodal representations through cross-modal knowledge injection. Extensive domain experiments conducted on three datasets demonstrate that the proposed MIDG achieves superior performance.
๐
2025-12-08
๐ Submitted to Nature Scientific Reports
๐ฅ Ayrat Abdullin, Umair bin Waheed, Leo Eisner, Abdullatif Al-Shuhail
Machine Learning
ๆฅ็ๆ่ฆ
Real-time monitoring of induced seismicity is crucial for mitigating operational hazards, relying on the rapid and accurate classification of microseismic events from continuous data streams. However, while many deep learning models excel at this task, their high computational requirements often limit their practical application in real-time monitoring systems. To address this limitation, a lightweight model based on the Fourier Neural Operator (FNO) is proposed for microseismic event classification, leveraging its inherent resolution-invariance and computational efficiency for waveform processing. In the STanford EArthquake Dataset (STEAD), a global and large-scale database of seismic waveforms, the FNO-based model demonstrates high effectiveness for trigger classification, with an F1 score of 95% even in the scenario of data sparsity in training. The new FNO model greatly decreases the computer power needed relative to current deep learning models without sacrificing the classification success rate measured by the F1 score. A test on a real microseismic dataset shows a classification success rate with an F1 score of 98%, outperforming many traditional deep-learning techniques. A combination of high success rate and low computational power indicates that the FNO model can serve as a methodology of choice for real-time monitoring of microseismicity for induced seismicity. The method saves computational resources and facilitates both post-processing and real-time seismic processing suitable for the implementation of traffic light systems to prevent undesired induced seismicity.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haidong Kang, Jun Du, Lihong Lin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Mixed-Precision Quantization (MPQ) liberates the Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck, which garnered increasing research attention. However, conventional methods either searched from costly differentiable optimization, which is neither efficient nor flexible, or learned a quantized DNN from the proxy (i.e., HAWQ) manually designed by human experts, which is labor-intensive and requires huge expert knowledge. Can we design a proxy without involving any human experts and training? In this paper, we provide an affirmative answer by proposing a novel Large Language Models (LLMs)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework, which reforms the design paradigm of MPQ by utilizing LLMs to find superior TAP tailored for MPQ, automatically. In addition, to bridge the gap between black-box LLMs and the tough MPQ task, we ingeniously propose simple Direct Policy Optimization (DPO) based reinforcement learning to enhance LLMs' reasoning by optimizing prompts, which can construct a positive feedback loop between the LLM and the MPQ task, enabling LLMs to generate better TAP in the next evolution. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we truly believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Giray รnรผr, Azita Dabiri, Bart De Schutter
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Effective traffic control is essential for mitigating congestion in transportation networks. Conventional traffic management strategies, including route guidance, ramp metering, and traffic signal control, often rely on state feedback controllers, used for their simplicity and reactivity; however, they lack the adaptability required to cope with complex and time-varying traffic dynamics. This paper proposes a multi-agent reinforcement learning framework in which each agent adaptively tunes the parameters of a state feedback traffic controller, combining the reactivity of state feedback controllers with the adaptability of reinforcement learning. By tuning parameters at a lower frequency rather than directly determining control actions at a high frequency, the reinforcement learning agents achieve improved training efficiency while maintaining adaptability to varying traffic conditions. The multi-agent structure further enhances system robustness, as local controllers can operate independently in the event of partial failures. The proposed framework is evaluated on a simulated multi-class transportation network under varying traffic conditions. Results show that the proposed multi-agent framework outperforms the no control and fixed-parameter state feedback control cases, while performing on par with the single-agent RL-based adaptive state feedback control, with a much better resilience to partial failures.
๐
2025-12-08
๐ ArXiv Preprint (cs.SE)
๐ฅ Francisco Ribeiro, Claudio Spiess, Prem Devanbu, Sarah Nadi
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite the effectiveness of large language models (LLMs) for code generation, they often output incorrect code. One reason is that model output probabilities are often not well-correlated with correctness, and reflect only the final output of the generation process. Inspired by findings that LLMs internally encode concepts like truthfulness, this paper explores if LLMs similarly represent code correctness. Specifically, we identify a correctness representation inside LLMs by contrasting the hidden states between pairs of correct and incorrect code for the same programming tasks. By experimenting on four LLMs, we show that exploiting this extracted correctness representation outperforms standard log-likelihood ranking, as well as verbalized model confidence. Furthermore, we explore how this internal correctness signal can be used to select higher-quality code samples, without requiring test execution. Ultimately, this work demonstrates how leveraging internal representations can enhance code generation systems and make LLMs more reliable, thus improving confidence in automatically generated code.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Giulia Lanzillotta, Damiano Meier, Thomas Hofmann
Machine Learning
ๆฅ็ๆ่ฆ
A persistent paradox in continual learning (CL) is that neural networks often retain linearly separable representations of past tasks even when their output predictions fail. We formalize this distinction as the gap between deep feature-space and shallow classifier-level forgetting. We reveal a critical asymmetry in Experience Replay: while minimal buffers successfully anchor feature geometry and prevent deep forgetting, mitigating shallow forgetting typically requires substantially larger buffer capacities. To explain this, we extend the Neural Collapse framework to the sequential setting. We characterize deep forgetting as a geometric drift toward out-of-distribution subspaces and prove that any non-zero replay fraction asymptotically guarantees the retention of linear separability. Conversely, we identify that the "strong collapse" induced by small buffers leads to rank-deficient covariances and inflated class means, effectively blinding the classifier to true population boundaries. By unifying CL with out-of-distribution detection, our work challenges the prevailing reliance on large buffers, suggesting that explicitly correcting these statistical artifacts could unlock robust performance with minimal replay.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yann Bourdin, Pierrick Legrand, Fanny Roche
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters -- sequence number, batch size, and sequence length -- and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with and without conditionning by user controls. Results demonstrate that carefully tuning these parameters enhances model accuracy and training stability, while also reducing computational demands. Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the revised TBPTT configuration maintains high perceptual quality.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gilhyun Nam, Taewon Kim, Joonhyun Jeong, Eunho Yang
Robotics
ๆฅ็ๆ่ฆ
Test-time adaptation (TTA) enables efficient adaptation of deployed models, yet it often leads to poorly calibrated predictive uncertainty - a critical issue in high-stakes domains such as autonomous driving, finance, and healthcare. Existing calibration methods typically assume fixed models or static distributions, resulting in degraded performance under real-world, dynamic test conditions. To address these challenges, we introduce Style Invariance as a Correctness Likelihood (SICL), a framework that leverages style-invariance for robust uncertainty estimation. SICL estimates instance-wise correctness likelihood by measuring prediction consistency across style-altered variants, requiring only the model's forward pass. This makes it a plug-and-play, backpropagation-free calibration module compatible with any TTA method. Comprehensive evaluations across four baselines, five TTA methods, and two realistic scenarios with three model architecture demonstrate that SICL reduces calibration error by an average of 13 percentage points compared to conventional calibration approaches.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yezi Liu, Hanning Chen, Wenjun Huang, Yang Ni, Mohsen Imani
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) possess vast knowledge acquired from extensive training corpora, but they often cannot remove specific pieces of information when needed, which makes it hard to handle privacy, bias mitigation, and knowledge correction. Traditional model unlearning approaches require computationally expensive fine-tuning or direct weight editing, making them impractical for real-world deployment. In this work, we introduce LoRA-based Unlearning with Negative Examples (LUNE), a lightweight framework that performs negative-only unlearning by updating only low-rank adapters while freezing the backbone, thereby localizing edits and avoiding disruptive global changes. Leveraging Low-Rank Adaptation (LoRA), LUNE targets intermediate representations to suppress (or replace) requested knowledge with an order-of-magnitude lower compute and memory than full fine-tuning or direct weight editing. Extensive experiments on multiple factual unlearning tasks show that LUNE: (I) achieves effectiveness comparable to full fine-tuning and memory-editing methods, and (II) reduces computational cost by about an order of magnitude.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yezi Liu, Hanning Chen, Wenjun Huang, Yang Ni, Mohsen Imani
ๆฅ็ๆ่ฆ
Unlearning in large foundation models (e.g., LLMs) is essential for enabling dynamic knowledge updates, enforcing data deletion rights, and correcting model behavior. However, existing unlearning methods often require full-model fine-tuning or access to the original training data, which limits their scalability and practicality. In this work, we introduce Recover-to-Forget (R2F), a novel framework for efficient unlearning in LLMs based on reconstructing full-model gradient directions from low-rank LoRA adapter updates. Rather than performing backpropagation through the full model, we compute gradients with respect to LoRA parameters using multiple paraphrased prompts and train a gradient decoder to approximate the corresponding full-model gradients. To ensure applicability to larger or black-box models, the decoder is trained on a proxy model and transferred to target models. We provide a theoretical analysis of cross-model generalization and demonstrate that our method achieves effective unlearning while preserving general model performance. Experimental results demonstrate that R2F offers a scalable and lightweight alternative for unlearning in pretrained LLMs without requiring full retraining or access to internal parameters.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alexandre Rocchi--Henry, Thomas Fel, Gianni Franchi
ๆฅ็ๆ่ฆ
Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.
๐
2025-12-08
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Chen Gong, Zheng Liu, Kecen Li, Tianhao Wang
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recently, offline reinforcement learning (RL) has become a popular RL paradigm. In offline RL, data providers share pre-collected datasets -- either as individual transitions or sequences of transitions forming trajectories -- to enable the training of RL models (also called agents) without direct interaction with the environments. Offline RL saves interactions with environments compared to traditional RL, and has been effective in critical areas, such as navigation tasks. Meanwhile, concerns about privacy leakage from offline RL datasets have emerged.
To safeguard private information in offline RL datasets, we propose the first differential privacy (DP) offline dataset synthesis method, PrivORL, which leverages a diffusion model and diffusion transformer to synthesize transitions and trajectories, respectively, under DP. The synthetic dataset can then be securely released for downstream analysis and research. PrivORL adopts the popular approach of pre-training a synthesizer on public datasets, and then fine-tuning on sensitive datasets using DP Stochastic Gradient Descent (DP-SGD). Additionally, PrivORL introduces curiosity-driven pre-training, which uses feedback from the curiosity module to diversify the synthetic dataset and thus can generate diverse synthetic transitions and trajectories that closely resemble the sensitive dataset. Extensive experiments on five sensitive offline RL datasets show that our method achieves better utility and fidelity in both DP transition and trajectory synthesis compared to baselines. The replication package is available at the GitHub repository.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Paul Wilsens, Katrien Antonio, Gerda Claeskens
Machine Learning
ๆฅ็ๆ่ฆ
Decision making often occurs in the presence of incomplete information, leading to the under- or overestimation of risk. Leveraging the observable information to learn the complete information is called nowcasting. In practice, incomplete information is often a consequence of reporting or observation delays. In this paper, we propose an expectation-maximisation (EM) framework for nowcasting that uses machine learning techniques to model both the occurrence as well as the reporting process of events. We allow for the inclusion of covariate information specific to the occurrence and reporting periods as well as characteristics related to the entity for which events occurred. We demonstrate how the maximisation step and the information flow between EM iterations can be tailored to leverage the predictive power of neural networks and (extreme) gradient boosting machines (XGBoost). With simulation experiments, we show that we can effectively model both the occurrence and reporting of events when dealing with high-dimensional covariate information. In the presence of non-linear effects, we show that our methodology outperforms existing EM-based nowcasting frameworks that use generalised linear models in the maximisation step. Finally, we apply the framework to the reporting of Argentinian Covid-19 cases, where the XGBoost-based approach again is most performant.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhengquan Luo, Guy Tadmor, Or Amar, David Zeevi, Zhiqiang Xu
Machine Learning
ๆฅ็ๆ่ฆ
Knowledge graph embedding (KGE) relies on the geometry of the embedding space to encode semantic and structural relations. Existing methods place all entities on one homogeneous manifold, Euclidean, spherical, hyperbolic, or their product/multi-curvature variants, to model linear, symmetric, or hierarchical patterns. Yet a predefined, homogeneous manifold cannot accommodate the sharply varying curvature that real-world graphs exhibit across local regions. Since this geometry is imposed a priori, any mismatch with the knowledge graph's local curvatures will distort distances between entities and hurt the expressiveness of the resulting KGE. To rectify this, we propose RicciKGE to have the KGE loss gradient coupled with local curvatures in an extended Ricci flow such that entity embeddings co-evolve dynamically with the underlying manifold geometry towards mutual adaptation. Theoretically, when the coupling coefficient is bounded and properly selected, we rigorously prove that i) all the edge-wise curvatures decay exponentially, meaning that the manifold is driven toward the Euclidean flatness; and ii) the KGE distances strictly converge to a global optimum, which indicates that geometric flattening and embedding optimization are promoting each other. Experimental improvements on link prediction and node classification benchmarks demonstrate RicciKGE's effectiveness in adapting to heterogeneous knowledge graph structures.
๐
2025-12-08
๐ ArXiv Preprint (physics.chem-ph)
๐ฅ Rohit Goswami
Machine Learning
ๆฅ็ๆ่ฆ
Transition state or minimum energy path finding methods constitute a routine component of the computational chemistry toolkit. Standard analysis involves trajectories conventionally plotted in terms of the relative energy to the initial state against a cumulative displacement variable, or the image number. These dimensional reductions obscure structural rearrangements in high dimensions and may often be trajectory dependent. This precludes the ability to compare optimization trajectories of different methods beyond the number of calculations, time taken, and final saddle geometry. We present a method mapping trajectories onto a two-dimension surface defined by a permutation corrected root mean square deviation from the reactant and product configurations. Energy is represented as an interpolated color-mapped surface constructed from all optimization steps using radial basis functions. This representation highlights optimization trajectories, identifies endpoint basins, and diagnoses convergence concerns invisible in one-dimensional profiles. We validate the framework on a cycloaddition reaction, showing that a machine-learned potential saddle and density functional theory reference lie on comparable energy contours despite geometric displacements.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuxiao Luo, Songming Zhang, Sijie Ruan, Siran Chen, Kang Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Modeling human mobility is vital for extensive applications such as transportation planning and epidemic modeling. With the rise of the Artificial Intelligence Generated Content (AIGC) paradigm, recent works explore synthetic trajectory generation using autoregressive and diffusion models. While these methods show promise for generating single-day trajectories, they remain limited by inefficiencies in long-term generation (e.g., weekly trajectories) and a lack of explicit spatiotemporal multi-scale modeling. This study proposes Multi-Scale Spatio-Temporal AutoRegression (M-STAR), a new framework that generates long-term trajectories through a coarse-to-fine spatiotemporal prediction process. M-STAR combines a Multi-scale Spatiotemporal Tokenizer that encodes hierarchical mobility patterns with a Transformer-based decoder for next-scale autoregressive prediction. Experiments on two real-world datasets show that M-STAR outperforms existing methods in fidelity and significantly improves generation speed. The data and codes are available at https://github.com/YuxiaoLuo0013/M-STAR.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bosun Kang, Hyejun Park, Chenglin Fan
ๆฅ็ๆ่ฆ
We revisit the classic ski rental problem through the lens of Bayesian decision-making and machine-learned predictions. While traditional algorithms minimize worst-case cost without assumptions, and recent learning-augmented approaches leverage noisy forecasts with robustness guarantees, our work unifies these perspectives. We propose a discrete Bayesian framework that maintains exact posterior distributions over the time horizon, enabling principled uncertainty quantification and seamless incorporation of expert priors. Our algorithm achieves prior-dependent competitive guarantees and gracefully interpolates between worst-case and fully-informed settings. Our extensive experimental evaluation demonstrates superior empirical performance across diverse scenarios, achieving near-optimal results under accurate priors while maintaining robust worst-case guarantees. This framework naturally extends to incorporate multiple predictions, non-uniform priors, and contextual information, highlighting the practical advantages of Bayesian reasoning in online decision problems with imperfect predictions.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrei V. Konstantinov, Valerii A. Zuev, Lev V. Utkin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Deep learning models for tabular data typically do not allow for imposing a graph of external dependencies between samples, which can be useful for accounting for relatedness in tasks such as treatment effect estimation. Graph neural networks only consider adjacent nodes, making them difficult to apply to sparse graphs. This paper proposes several solutions based on a modified attention mechanism, which accounts for possible relationships between data points by adding a term to the attention matrix. Our models are compared with each other and the gradient boosting decision trees in a regression task on synthetic and real-world datasets, as well as in a treatment effect estimation task on the IHDP dataset.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Thierry Petit, Arnault Pachot
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce a constraint-programming framework for generating synthetic populations that reproduce target statistics with high precision while enforcing full individual consistency. Unlike data-driven approaches that infer distributions from samples, our method directly encodes aggregated statistics and structural relations, enabling exact control of demographic profiles without requiring any microdata. We validate the approach on official demographic sources and study the impact of distributional deviations on downstream analyses. This work is conducted within the Pollitics project developed by Emotia, where synthetic populations can be queried through large language models to model societal behaviors, explore market and policy scenarios, and provide reproducible decision-grade insights without personal data.
๐
2025-12-08
๐ ICML 2024
๐ฅ Peijia Lin, Pin Chen, Rui Jiao, Qing Mo, Jianhuan Cen et al.
Machine Learning
ๆฅ็ๆ่ฆ
In addressing the challenge of Crystal Structure Prediction (CSP), symmetry-aware deep learning models, particularly diffusion models, have been extensively studied, which treat CSP as a conditional generation task. However, ensuring permutation, rotation, and periodic translation equivariance during diffusion process remains incompletely addressed. In this work, we propose EquiCSP, a novel equivariant diffusion-based generative model. We not only address the overlooked issue of lattice permutation equivariance in existing models, but also develop a unique noising algorithm that rigorously maintains periodic translation equivariance throughout both training and inference processes. Our experiments indicate that EquiCSP significantly surpasses existing models in terms of generating accurate structures and demonstrates faster convergence during the training process.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sijia Li, Yuchen Huang, Zifan Liu, Zijian Li, Jingjing fu et al.
ๆฅ็ๆ่ฆ
Despite impressive advances in agent systems, multi-turn tool-use scenarios remain challenging. It is mainly because intent is clarified progressively and the environment evolves with each tool call. While reusing past experience is natural, current LLM agents either treat entire trajectories or pre-defined subtasks as indivisible units, or solely exploit tool-to-tool dependencies, hindering adaptation as states and information evolve across turns. In this paper, we propose a State Integrated Tool Graph (SIT-Graph), which enhances multi-turn tool use by exploiting partially overlapping experience. Inspired by human decision-making that integrates episodic and procedural memory, SIT-Graph captures both compact state representations (episodic-like fragments) and tool-to-tool dependencies (procedural-like routines) from historical trajectories. Specifically, we first build a tool graph from accumulated tool-use sequences, and then augment each edge with a compact state summary of the dialog and tool history that may shape the next action. At inference time, SIT-Graph enables a human-like balance between episodic recall and procedural execution: when the next decision requires recalling prior context, the agent retrieves the state summaries stored on relevant edges and uses them to guide its next action; when the step is routine, it follows high-confidence tool dependencies without explicit recall. Experiments across multiple stateful multi-turn tool-use benchmarks show that SIT-Graph consistently outperforms strong memory- and graph-based baselines, delivering more robust tool selection and more effective experience transfer.
๐
2025-12-08
๐ ArXiv Preprint (eess.SP)
๐ฅ Shreyas Jayant Grampurohit, Satish Mulleti, Ajit Rajwade
Machine Learning
ๆฅ็ๆ่ฆ
We present a neural network-based framework for solving the quantitative group testing (QGT) problem that achieves both high decoding accuracy and structural verifiability. In QGT, the objective is to identify a small subset of defective items among $N$ candidates using only $M \ll N$ pooled tests, each reporting the number of defectives in the tested subset. We train a multi-layer perceptron to map noisy measurement vectors to binary defect indicators, achieving accurate and robust recovery even under sparse, bounded perturbations. Beyond accuracy, we show that the trained network implicitly learns the underlying pooling structure that links items to tests, allowing this structure to be recovered directly from the network's Jacobian. This indicates that the model does not merely memorize training patterns but internalizes the true combinatorial relationships governing QGT. Our findings reveal that standard feedforward architectures can learn verifiable inverse mappings in structured combinatorial recovery problems.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mike Diessner, Yannick Tarant
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Virtual representations of physical critical infrastructures, such as water or energy plants, are used for simulations and digital twins to ensure resilience and continuity of their services. These models usually require 3D point clouds from laser scanners that are expensive to acquire and require specialist knowledge to use. In this article, we present a graph generation pipeline based on photogrammetry. The pipeline detects relevant objects and predicts their relation using RGB images and depth data generated by a stereo camera. This more cost-effective approach uses deep learning for object detection and instance segmentation of the objects, and employs user-defined heuristics or rules to infer their relations. Results of two hydraulic systems show that this strategy can produce graphs close to the ground truth while its flexibility allows the method to be tailored to specific applications and its transparency qualifies it to be used in the high stakes decision-making that is required for critical infrastructures.
๐
2025-12-08
๐ ArXiv Preprint (eess.SP)
๐ฅ Samuel Rey, Gonzalo Mateos
Machine Learning
ๆฅ็ๆ่ฆ
This work aims to learn the directed acyclic graph (DAG) that captures the instantaneous dependencies underlying a multivariate time series. The observed data follow a linear structural vector autoregressive model (SVARM) with both instantaneous and time-lagged dependencies, where the instantaneous structure is modeled by a DAG to reflect potential causal relationships. While recent continuous relaxation approaches impose acyclicity through smooth constraint functions involving powers of the adjacency matrix, they lead to non-convex optimization problems that are challenging to solve. In contrast, we assume that the underlying DAG has only non-negative edge weights, and leverage this additional structure to impose acyclicity via a convex constraint. This enables us to cast the problem of non-negative DAG recovery from multivariate time-series data as a convex optimization problem in abstract form, which we solve using the method of multipliers. Crucially, the convex formulation guarantees global optimality of the solution. Finally, we assess the performance of the proposed method on synthetic time-series data, where it outperforms existing alternatives.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jingran Yang, Min Zhang, Lingfeng Zhang, Zhaohui Wang, Yonggang Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Because machine learning has significantly improved efficiency and convenience in the society, it's increasingly used to assist or replace human decision-making. However, the data-based pattern makes related algorithms learn and even exacerbate potential bias in samples, resulting in discriminatory decisions against certain unprivileged groups, depriving them of the rights to equal treatment, thus damaging the social well-being and hindering the development of related applications. Therefore, we propose a pre-processing method IFFair based on the influence function. Compared with other fairness optimization approaches, IFFair only uses the influence disparity of training samples on different groups as a guidance to dynamically adjust the sample weights during training without modifying the network structure, data features and decision boundaries. To evaluate the validity of IFFair, we conduct experiments on multiple real-world datasets and metrics. The experimental results show that our approach mitigates bias of multiple accepted metrics in the classification setting, including demographic parity, equalized odds, equality of opportunity and error rate parity without conflicts. It also demonstrates that IFFair achieves better trade-off between multiple utility and fairness metrics compared with previous pre-processing methods.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziming Hong, Tianyu Huang, Runnan Chen, Shanshan Ye, Mingming Gong et al.
Robotics
ๆฅ็ๆ่ฆ
Recent studies have extended diffusion-based instruction-driven 2D image editing pipelines to 3D Gaussian Splatting (3DGS), enabling faithful manipulation of 3DGS assets and greatly advancing 3DGS content creation. However, it also exposes these assets to serious risks of unauthorized editing and malicious tampering. Although imperceptible adversarial perturbations against diffusion models have proven effective for protecting 2D images, applying them to 3DGS encounters two major challenges: view-generalizable protection and balancing invisibility with protection capability. In this work, we propose the first editing safeguard for 3DGS, termed AdLift, which prevents instruction-driven editing across arbitrary views and dimensions by lifting strictly bounded 2D adversarial perturbations into 3D Gaussian-represented safeguard. To ensure both adversarial perturbations effectiveness and invisibility, these safeguard Gaussians are progressively optimized across training views using a tailored Lifted PGD, which first conducts gradient truncation during back-propagation from the editing model at the rendered image and applies projected gradients to strictly constrain the image-level perturbation. Then, the resulting perturbation is backpropagated to the safeguard Gaussian parameters via an image-to-Gaussian fitting operation. We alternate between gradient truncation and image-to-Gaussian fitting, yielding consistent adversarial-based protection performance across different viewpoints and generalizes to novel views. Empirically, qualitative and quantitative results demonstrate that AdLift effectively protects against state-of-the-art instruction-driven 2D image and 3DGS editing.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Elizaveta Kovtun, Maksim Makarenko, Natalia Semenova, Alexey Zaytsev, Semen Budennyy
Machine Learning
ๆฅ็ๆ่ฆ
A graph with semantically attributed nodes are a common data structure in a wide range of domains. It could be interlinked web data or citation networks of scientific publications. The essential problem for such a data type is to determine nodes that carry greater importance than all the others, a task that markedly enhances system monitoring and management. Traditional methods to identify important nodes in networks introduce centrality measures, such as node degree or more complex PageRank. However, they consider only the network structure, neglecting the rich node attributes. Recent methods adopt neural networks capable of handling node features, but they require supervision. This work addresses the identified gap--the absence of approaches that are both unsupervised and attribute-aware--by introducing a Pipeline for Important Node Exploration (PINE). At the core of the proposed framework is an attention-based graph model that incorporates node semantic features in the learning process of identifying the structural graph properties. The PINE's node importance scores leverage the obtained attention distribution. We demonstrate the superior performance of the proposed PINE method on various homogeneous and heterogeneous attributed networks. As an industry-implemented system, PINE tackles the real-world challenge of unsupervised identification of key entities within large-scale enterprise graphs.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hengyang Yao, Lin Li, Ke Sun, Jianing Qiu, Huiping Chen
Machine Learning
ๆฅ็ๆ่ฆ
DeepFake face swapping enables highly realistic identity forgeries, posing serious privacy and security risks. A common defence embeds invisible perturbations into images, but these are fragile and often destroyed by basic transformations such as compression or resizing. In this paper, we first conduct a systematic analysis of 30 transformations across six categories and show that protection robustness is highly sensitive to the choice of training transformations, making the standard Expectation over Transformation (EOT) with uniform sampling fundamentally suboptimal. Motivated by this, we propose Expectation Over Learned distribution of Transformation (EOLT), the framework to treat transformation distribution as a learnable component rather than a fixed design choice. Specifically, EOLT employs a policy network that learns to automatically prioritize critical transformations and adaptively generate instance-specific perturbations via reinforcement learning, enabling explicit modeling of defensive bottlenecks while maintaining broad transferability. Extensive experiments demonstrate that our method achieves substantial improvements over state-of-the-art approaches, with 26% higher average robustness and up to 30% gains on challenging transformation categories.
๐
2025-12-08
๐ ArXiv Preprint (eess.IV)
๐ฅ Tianyi Ren, Daniel Low, Pittra Jaengprajak, Juampablo Heras Rivera, Jacob Ruzevick et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Segmentation is the identification of anatomical regions of interest, such as organs, tissue, and lesions, serving as a fundamental task in computer-aided diagnosis in medical imaging. Although deep learning models have achieved remarkable performance in medical image segmentation, the need for explainability remains critical for ensuring their acceptance and integration in clinical practice, despite the growing research attention in this area. Our approach explored the use of contrast-level Shapley values, a systematic perturbation of model inputs to assess feature importance. While other studies have investigated gradient-based techniques through identifying influential regions in imaging inputs, Shapley values offer a broader, clinically aligned approach, explaining how model performance is fairly attributed to certain imaging contrasts over others. Using the BraTS 2024 dataset, we generated rankings for Shapley values for four MRI contrasts across four model architectures. Two metrics were proposed from the Shapley ranking: agreement between model and ``clinician" imaging ranking, and uncertainty quantified through Shapley ranking variance across cross-validation folds. Higher-performing cases (Dice \textgreater0.6) showed significantly greater agreement with clinical rankings. Increased Shapley ranking variance correlated with decreased performance (U-Net: $r=-0.581$). These metrics provide clinically interpretable proxies for model reliability, helping clinicians better understand state-of-the-art segmentation models.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
๐
2025-12-08
๐ AAAI 2026
๐ฅ Feng Liang, Weixin Zeng, Runhao Zhao, Xiang Zhao
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, temporal reasoning, particularly under complex temporal constraints, remains a major challenge. To this end, existing approaches have explored symbolic methods, which encode temporal structure explicitly, and reflective mechanisms, which revise reasoning errors through multi-step inference. Nonetheless, symbolic approaches often underutilize the reasoning capabilities of LLMs, while reflective methods typically lack structured temporal representations, which can result in inconsistent or hallucinated reasoning. As a result, even when the correct temporal context is available, LLMs may still misinterpret or misapply time-related information, leading to incomplete or inaccurate answers. To address these limitations, in this work, we propose Neuro-Symbolic Temporal Reasoning (NeSTR), a novel framework that integrates structured symbolic representations with hybrid reflective reasoning to enhance the temporal sensitivity of LLM inference. NeSTR preserves explicit temporal relations through symbolic encoding, enforces logical consistency via verification, and corrects flawed inferences using abductive reflection. Extensive experiments on diverse temporal question answering benchmarks demonstrate that NeSTR achieves superior zero-shot performance and consistently improves temporal reasoning without any fine-tuning, showcasing the advantage of neuro-symbolic integration in enhancing temporal understanding in large language models.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Bin Wu, Feifan Yang, Zhangming Chan, Yu-Ran Gu, Jiawei Feng et al.
Multimodal
ๆฅ็ๆ่ฆ
Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at https://taobao-mm.github.io.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhaoyang Liu, Mokai Pan, Zhongyi Wang, Kaizhen Zhu, Haotao Lu et al.
Robotics
ๆฅ็ๆ่ฆ
Imitation learning with diffusion models has advanced robotic control by capturing multi-modal action distributions. However, existing approaches typically treat observations as high-level conditioning inputs to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, sampling must begin from random Gaussian noise, weakening the coupling between perception and control and often yielding suboptimal performance. We introduce BridgePolicy, a generative visuomotor policy that explicitly embeds observations within the stochastic differential equation via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich, informative prior rather than random noise, substantially improving precision and reliability in control. A key challenge is that classical diffusion bridges connect distributions with matched dimensionality, whereas robotic observations are heterogeneous and multi-modal and do not naturally align with the action space. To address this, we design a multi-modal fusion module and a semantic aligner that unify visual and state inputs and align observation and action representations, making the bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and five real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies.
๐
2025-12-08
๐ ArXiv Preprint (Multimedia)
๐ฅ Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji
ๆฅ็ๆ่ฆ
We introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. To achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. We extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fei Luo, Ziwei Zhao, Mingxuan Wang, Duoyang Li, Zhe Qian et al.
ๆฅ็ๆ่ฆ
Federated Prompt Learning (FPL) offers a parameter-efficient solution for collaboratively training large models, but its performance is severely hindered by data heterogeneity, which causes locally trained prompts to become biased. Existing methods, focusing on aggregation or regularization, fail to address this root cause of local training bias. To this end, we propose Geometry-Guided Text Prompt Calibration (GGTPC), a novel framework that directly corrects this bias by providing clients with a global geometric prior. This prior, representing the shape of the global data distribution derived from the covariance matrix, is reconstructed on the server in a privacy-preserving manner. Clients then use a novel Geometry-Prior Calibration Layer (GPCL) to align their local feature distributions with this global prior during training. Extensive experiments show GGTPC's effectiveness. On the label-skewed CIFAR-100 dataset ($ฮฒ$=0.1), it outperforms the state-of-the-art by 2.15\%. Under extreme skew ($ฮฒ$=0.01), it improves upon the baseline by 9.17\%. Furthermore, as a plug-and-play module on the domain-skewed Office-Home dataset, it boosts FedAvg's performance by 4.60\%. These results demonstrate that GGTPC effectively mitigates data heterogeneity by correcting the fundamental local training bias, serving as a versatile module to enhance various FL algorithms.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Boyang Pan, Zeyu Zhang, Hongyu Meng, Bin Cui, Yingying Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Purpose: To develop a fully automated deep learning system, AutoLugano, for end-to-end lymphoma classification by performing lesion segmentation, anatomical localization, and automated Lugano staging from baseline FDG-PET/CT scans. Methods: The AutoLugano system processes baseline FDG-PET/CT scans through three sequential modules:(1) Anatomy-Informed Lesion Segmentation, a 3D nnU-Net model, trained on multi-channel inputs, performs automated lesion detection (2) Atlas-based Anatomical Localization, which leverages the TotalSegmentator toolkit to map segmented lesions to 21 predefined lymph node regions using deterministic anatomical rules; and (3) Automated Lugano Staging, where the spatial distribution of involved regions is translated into Lugano stages and therapeutic groups (Limited vs. Advanced Stage).The system was trained on the public autoPET dataset (n=1,007) and externally validated on an independent cohort of 67 patients. Performance was assessed using accuracy, sensitivity, specificity, F1-scorefor regional involvement detection and staging agreement. Results: On the external validation set, the proposed model demonstrated robust performance, achieving an overall accuracy of 88.31%, sensitivity of 74.47%, Specificity of 94.21% and an F1-score of 80.80% for regional involvement detection,outperforming baseline models. Most notably, for the critical clinical task of therapeutic stratification (Limited vs. Advanced Stage), the system achieved a high accuracy of 85.07%, with a specificity of 90.48% and a sensitivity of 82.61%.Conclusion: AutoLugano represents the first fully automated, end-to-end pipeline that translates a single baseline FDG-PET/CT scan into a complete Lugano stage. This study demonstrates its strong potential to assist in initial staging, treatment stratification, and supporting clinical decision-making.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng Yu
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion models have achieved remarkable performance in generative modeling, yet their theoretical foundations are often intricate, and the gap between mathematical formulations in papers and practical open-source implementations can be difficult to bridge. Existing tutorials primarily focus on deriving equations, offering limited guidance on how diffusion models actually operate in code. To address this, we present a concise implementation of approximately 300 lines that explains diffusion models from a code-execution perspective. Our minimal example preserves the essential components -- including forward diffusion, reverse sampling, the noise-prediction network, and the training loop -- while removing unnecessary engineering details. This technical report aims to provide researchers with a clear, implementation-first understanding of how diffusion models work in practice and how code and theory correspond. Our code and pre-trained models are available at: https://github.com/disanda/GM/tree/main/DDPM-DDIM-ClassifierFree.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhen Huang, Jiaxin Deng, Jiayu Xu, Junbiao Pang, Haitao Yu
Machine Learning
ๆฅ็ๆ่ฆ
In bus arrival time prediction, the process of organizing road infrastructure network data into homogeneous entities is known as segmentation. Segmenting a road network is widely recognized as the first and most critical step in developing an arrival time prediction system, particularly for auto-regressive-based approaches. Traditional methods typically employ a uniform segmentation strategy, which fails to account for varying physical constraints along roads, such as road conditions, intersections, and points of interest, thereby limiting prediction efficiency. In this paper, we propose a Reinforcement Learning (RL)-based approach to efficiently and adaptively learn non-uniform road segments for arrival time prediction. Our method decouples the prediction process into two stages: 1) Non-uniform road segments are extracted based on their impact scores using the proposed RL framework; and 2) A linear prediction model is applied to the selected segments to make predictions. This method ensures optimal segment selection while maintaining computational efficiency, offering a significant improvement over traditional uniform approaches. Furthermore, our experimental results suggest that the linear approach can even achieve better performance than more complex methods. Extensive experiments demonstrate the superiority of the proposed method, which not only enhances efficiency but also improves learning performance on large-scale benchmarks. The dataset and the code are publicly accessible at: https://github.com/pangjunbiao/Less-is-More.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Da Zhang, Bingyu Li, Zhuyuan Zhao, Junyu Gao, Feiping Nie et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
As multimodal data proliferates across diverse real-world applications, leveraging heterogeneous information such as texts and timestamps for accurate time series forecasting (TSF) has become a critical challenge. While diffusion models demonstrate exceptional performance in generation tasks, their application to TSF remains largely confined to modeling single-modality numerical sequences, overlooking the abundant cross-modal signals inherent in complex heterogeneous data. To address this gap, we propose UniDiff, a unified diffusion framework for multimodal time series forecasting. To process the numerical sequence, our framework first tokenizes the time series into patches, preserving local temporal dynamics by mapping each patch to an embedding space via a lightweight MLP. At its core lies a unified and parallel fusion module, where a single cross-attention mechanism adaptively weighs and integrates structural information from timestamps and semantic context from texts in one step, enabling a flexible and efficient interplay between modalities. Furthermore, we introduce a novel classifier-free guidance mechanism designed for multi-source conditioning, allowing for decoupled control over the guidance strength of textual and temporal information during inference, which significantly enhances model robustness. Extensive experiments on real-world benchmark datasets across eight domains demonstrate that the proposed UniDiff model achieves state-of-the-art performance.
๐
2025-12-08
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Latifa Dwiyanti, Sergio Ryan Wibisono, Hidetaka Nambo
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Explainable Artificial Intelligence (XAI) has become an increasingly important area of research, particularly as machine learning models are deployed in high-stakes domains. Among various XAI approaches, SHAP (SHapley Additive exPlanations) has gained prominence due to its ability to provide both global and local explanations across different machine learning models. While SHAP effectively visualizes feature importance, it often lacks contextual explanations that are meaningful for end-users, especially those without technical backgrounds. To address this gap, we propose a Python package that extends SHAP by integrating it with a large language model (LLM), specifically OpenAI's GPT, to generate contextualized textual explanations. This integration is guided by user-defined parameters (such as feature aliases, descriptions, and additional background) to tailor the explanation to both the model context and the user perspective. We hypothesize that this enhancement can improve the perceived understandability of SHAP explanations. To evaluate the effectiveness of the proposed package, we applied it in a healthcare-related case study and conducted user evaluations involving real end-users. The results, based on Likert-scale surveys and follow-up interviews, indicate that the generated explanations were perceived as more understandable and contextually appropriate compared to visual-only outputs. While the findings are preliminary, they suggest that combining visualization with contextualized text may support more user-friendly and trustworthy model explanations.
๐
2025-12-08
๐ NeurIPS 2025
๐ฅ Yibo Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Self-play fine-tuning has demonstrated promising abilities in adapting large language models (LLMs) to downstream tasks with limited real-world data. The basic principle is to iteratively refine the model with real samples and synthetic ones generated from itself. However, the existing methods primarily focus on the relative gaps between the rewards for two types of data, neglecting their absolute values. Through theoretical analysis, we identify that the gap-based methods suffer from unstable evolution, due to the potentially degenerated objectives. To address this limitation, we introduce a novel self-play fine-tuning method, namely Self-PlAy via Noise Contrastive Estimation (SPACE), which leverages noise contrastive estimation to capture the real-world data distribution. Specifically, SPACE treats synthetic samples as auxiliary components, and discriminates them from the real ones in a binary classification manner. As a result, SPACE independently optimizes the absolute reward values for each type of data, ensuring a consistently meaningful objective and thereby avoiding the instability issue. Theoretically, we show that the optimal solution of the objective in SPACE aligns with the underlying distribution of real-world data, and SPACE guarantees a provably stable convergence to the optimal distribution. Empirically, we show that SPACE significantly improves the performance of LLMs over various tasks, and outperforms supervised fine-tuning that employs much more real-world samples. Compared to gap-based self-play fine-tuning methods, SPACE exhibits remarkable superiority and stable evolution.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.
๐
2025-12-08
๐ NeurIPS 2025
๐ฅ Georgios Ioannides, Christos Constantinou, Aman Chadha, Aaron Elkins, Linsey Pang et al.
ๆฅ็ๆ่ฆ
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.
๐
2025-12-08
๐ ArXiv Preprint (q-fin.CP)
๐ฅ Kieran A. Malandain, Selim Kalici, Hakob Chakhoyan
Machine Learning
ๆฅ็ๆ่ฆ
Real-time calibration of stochastic volatility models (SVMs) is computationally bottlenecked by the need to repeatedly solve coupled partial differential equations (PDEs). In this work, we propose DeepSVM, a physics-informed Deep Operator Network (PI-DeepONet) designed to learn the solution operator of the Heston model across its entire parameter space. Unlike standard data-driven deep learning (DL) approaches, DeepSVM requires no labelled training data. Rather, we employ a hard-constrained ansatz that enforces terminal payoffs and static no-arbitrage conditions by design. Furthermore, we use Residual-based Adaptive Refinement (RAR) to stabilize training in difficult regions subject to high gradients. Overall, DeepSVM achieves a final training loss of $10^{-5}$ and predicts highly accurate option prices across a range of typical market dynamics. While pricing accuracy is high, we find that the model's derivatives (Greeks) exhibit noise in the at-the-money (ATM) regime, highlighting the specific need for higher-order regularization in physics-informed operator learning.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jonghyun Park, Jong Chul Ye
Machine Learning
ๆฅ็ๆ่ฆ
Deep generative models have become powerful priors for solving inverse problems, and various training-free methods have been developed. However, when applied to latent flow models, existing methods often fail to converge to the posterior mode or suffer from manifold deviation within latent spaces. To mitigate this, here we introduce a novel training-free framework, FlowLPS, that solves inverse problems with pretrained flow models via a Langevin Proximal Sampling (LPS) strategy. Our method integrates Langevin dynamics for manifold-consistent exploration with proximal optimization for precise mode seeking, achieving a superior balance between reconstruction fidelity and perceptual quality across multiple inverse tasks on FFHQ and DIV2K, outperforming state of the art inverse solvers.
๐
2025-12-08
๐ This work plans to be submitted to the IEEE for po
๐ฅ Tanay Arora, Christof Teuscher
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The Lottery Ticket Hypothesis asserts the existence of highly sparse, trainable subnetworks ('winning tickets') within dense, randomly initialized neural networks. However, state-of-the-art methods of drawing these tickets, like Lottery Ticket Rewinding (LTR), are computationally prohibitive, while more efficient saliency-based Pruning-at-Initialization (PaI) techniques suffer from a significant accuracy-sparsity trade-off and fail basic sanity checks. In this work, we argue that PaI's reliance on first-order saliency metrics, which ignore inter-weight dependencies, contributes substantially to this performance gap, especially in the sparse regime. To address this, we introduce Concrete Ticket Search (CTS), an algorithm that frames subnetwork discovery as a holistic combinatorial optimization problem. By leveraging a Concrete relaxation of the discrete search space and a novel gradient balancing scheme (GRADBALANCE) to control sparsity, CTS efficiently identifies high-performing subnetworks near initialization without requiring sensitive hyperparameter tuning. Motivated by recent works on lottery ticket training dynamics, we further propose a knowledge distillation-inspired family of pruning objectives, finding that minimizing the reverse Kullback-Leibler divergence between sparse and dense network outputs (CTS-KL) is particularly effective. Experiments on varying image classification tasks show that CTS produces subnetworks that robustly pass sanity checks and achieve accuracy comparable to or exceeding LTR, while requiring only a small fraction of the computation. For example, on ResNet-20 on CIFAR10, it reaches 99.3% sparsity with 74.0% accuracy in 7.9 minutes, while LTR attains the same sparsity with 68.3% accuracy in 95.2 minutes. CTS's subnetworks outperform saliency-based methods across all sparsities, but its advantage over LTR is most pronounced in the highly sparse regime.
๐
2025-12-08
๐ ArXiv Preprint (cs.DS)
๐ฅ J. Allagan, G. Morgan, S. Langley, R. Lopez-Bonilla, V. Deriglazov
ๆฅ็ๆ่ฆ
We establish closed-form enumeration formulas for chromatic feature vectors of 2-trees under the bichromatic triangle constraint. These efficiently computable structural features derive from constrained graph colorings where each triangle uses exactly two colors, forbidding monochromatic and rainbow triangles, a constraint arising in distributed systems where components avoid complete concentration or isolation. For theta graphs Theta_n, we prove r_k(Theta_n) = S(n-2, k-1) for k >= 3 (Stirling numbers of the second kind) and r_2(Theta_n) = 2^(n-2) + 1, computable in O(n) time. For fan graphs Phi_n, we establish r_2(Phi_n) = F_{n+1} (Fibonacci numbers) and derive explicit formulas r_k(Phi_n) = sum_{t=k-1}^{n-1} a_{n-1,t} * S(t, k-1) with efficiently computable binomial coefficients, achieving O(n^2) computation per component. Unlike classical chromatic polynomials, which assign identical features to all n-vertex 2-trees, bichromatic constraints provide informative structural features. While not complete graph invariants, these features capture meaningful structural properties through connections to Fibonacci polynomials, Bell numbers, and independent set enumeration. Applications include Byzantine fault tolerance in hierarchical networks, VM allocation in cloud computing, and secret-sharing protocols in distributed cryptography.
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kepeng Lin, Qizhe Zhang, Rui Wang, Xuehai Hu, Wei Xu
Natural Language Processing
ๆฅ็ๆ่ฆ
Understanding the underlying linguistic rules of plant genomes remains a fundamental challenge in computational biology. Recent advances including AgroNT and PDLLMs have made notable progress although, they suffer from excessive parameter size and limited ability to model the bidirectional nature of DNA strands respectively. To address these limitations, we propose PlantBiMoE, a lightweight and expressive plant genome language model that integrates bidirectional Mamba and a Sparse Mixture-of-Experts (SparseMoE) framework. The bidirectional Mamba enables the model to effectively capture structural dependencies across both the forward and reverse DNA strands, while SparseMoE significantly reduces the number of active parameters, improving computational efficiency without sacrificing modeling capacity. We evaluated and tested our model on the Modified Plants Genome Benchmark (MPGB), an enhanced genomic benchmark, which consolidates 31 datasets across 11 representative tasks, with input sequence lengths ranging from 50 to 6,000 bp. Experimental results demonstrate that PlantBiMoE achieves the best performance on 20 out of 31 datasets and the average best when comparing with existing models. In summary, all above results demonstrate that our model can effectively represent plant genomic sequences, serving as a robust computational tool for diverse genomic tasks, while making substantive contributions to plant genomics, gene editing, and synthetic biology. The code is available at: https://github.com/HUST-Keep-Lin/PlantBiMoE
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziqing Wen, Jiahuan Wang, Ping Luo, Dongsheng Li, Tao Sun
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM reduces total training memory by approximately 50\%, eliminates up to 90\% of optimizer state memory overhead, and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Fuyuan Lyu, Zhentai Chen, Jingyan Jiang, Lingjie Li, Xing Tang et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Inspired by the success of language models (LM), scaling up deep learning recommendation systems (DLRS) has become a recent trend in the community. All previous methods tend to scale up the model parameters during training time. However, how to efficiently utilize and scale up computational resources during test time remains underexplored, which can prove to be a scaling-efficient approach and bring orthogonal improvements in LM domains. The key point in applying test-time scaling to DLRS lies in effectively generating diverse yet meaningful outputs for the same instance. We propose two ways: One is to explore the heterogeneity of different model architectures. The other is to utilize the randomness of model initialization under a homogeneous architecture. The evaluation is conducted across eight models, including both classic and SOTA models, on three benchmarks. Sufficient evidence proves the effectiveness of both solutions. We further prove that under the same inference budget, test-time scaling can outperform parameter scaling. Our test-time scaling can also be seamlessly accelerated with the increase in parallel servers when deployed online, without affecting the inference time on the user side. Code is available.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Clarisse Bardiot, Pierre-Carl Langlais, Bernard Jacquemin, Jacob Hart, Antonios Lagarias et al.
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Many heritage institutions hold extensive collections of theatre programmes, which remain largely underused due to their complex layouts and lack of structured metadata. In this paper, we present a workflow for transforming such documents into structured data using a combination of multimodal large language models (LLMs), an ontology-based reasoning model, and a custom extension of the Linked Art framework. We show how vision-language models can accurately parse and transcribe born-digital and digitised programmes, achieving over 98% of correct extraction. To overcome the challenges of semantic annotation, we train a reasoning model (POntAvignon) using reinforcement learning with both formal and semantic rewards. This approach enables automated RDF triple generation and supports alignment with existing knowledge graphs. Through a case study based on the Festival d'Avignon corpus, we demonstrate the potential for large-scale, ontology-driven analysis of performing arts data. Our results open new possibilities for interoperable, explainable, and sustainable computational theatre historiography.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiangxia Cao, Shuo Yang, Zijun Wang, Qinghai Tan
Natural Language Processing
ๆฅ็ๆ่ฆ
In past years, the OpenAI's Scaling-Laws shows the amazing intelligence with the next-token prediction paradigm in neural language modeling, which pointing out a free-lunch way to enhance the model performance by scaling the model parameters. In RecSys, the retrieval stage is also follows a 'next-token prediction' paradigm, to recall the hunderds of items from the global item set, thus the generative recommendation usually refers specifically to the retrieval stage (without Tree-based methods). This raises a philosophical question: without a ground-truth next item, does the generative recommendation also holds a potential scaling law? In retrospect, the generative recommendation has two different technique paradigms: (1) ANN-based framework, utilizing the compressed user embedding to retrieve nearest other items in embedding space, e.g, Kuaiformer. (2) Auto-regressive-based framework, employing the beam search to decode the item from whole space, e.g, OneRec. In this paper, we devise a unified encoder-decoder framework to validate their scaling-laws at same time. Our empirical finding is that both of their losses strictly adhere to power-law Scaling Laws ($R^2$>0.9) within our unified architecture.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Daniele Malitesta, Claudio Pomo, Vito Walter Anelli, Alberto Carlo Maria Mancino, Alejandro Bellogรญn et al.
Machine Learning
ๆฅ็ๆ่ฆ
In recommender systems, user-item interactions can be modeled as a bipartite graph, where user and item nodes are connected by undirected edges. This graph-based view has motivated the rapid adoption of graph neural networks (GNNs), which often outperform collaborative filtering (CF) methods such as latent factor models, deep neural networks, and generative strategies. Yet, despite their empirical success, the reasons why GNNs offer systematic advantages over other CF approaches remain only partially understood. This monograph advances a topology-centered perspective on GNN-based recommendation. We argue that a comprehensive understanding of these models' performance should consider the structural properties of user-item graphs and their interaction with GNN architectural design. To support this view, we introduce a formal taxonomy that distills common modeling patterns across eleven representative GNN-based recommendation approaches and consolidates them into a unified conceptual pipeline. We further formalize thirteen classical and topological characteristics of recommendation datasets and reinterpret them through the lens of graph machine learning. Using these definitions, we analyze the considered GNN-based recommender architectures to assess how and to what extent they encode such properties. Building on this analysis, we derive an explanatory framework that links measurable dataset characteristics to model behavior and performance. Taken together, this monograph re-frames GNN-based recommendation through its topological underpinnings and outlines open theoretical, data-centric, and evaluation challenges for the next generation of topology-aware recommender systems.
๐
2025-12-08
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Bin Wu, Feifan Yang, Zhangming Chan, Yu-Ran Gu, Jiawei Feng et al.
Multimodal
ๆฅ็ๆ่ฆ
Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at https://taobao-mm.github.io.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Shubham S. Kumbhar, Abhijeet M. Kulkarni, Panagiotis Artemiadis
Robotics
ๆฅ็ๆ่ฆ
We present a control framework that enables humanoid robots to perform collaborative transportation tasks with a human partner. The framework supports both translational and rotational motions, which are fundamental to co-transport scenarios. It comprises three components: a high-level planner, a low-level controller, and a stiffness modulation mechanism. At the planning level, we introduce the Interaction Linear Inverted Pendulum (I-LIP), which, combined with an admittance model and an MPC formulation, generates dynamically feasible footstep plans. These are executed by a QP-based whole-body controller that accounts for the coupled humanoid-object dynamics. Stiffness modulation regulates robot-object interaction, ensuring convergence to the desired relative configuration defined by the distance between the object and the robot's center of mass. We validate the effectiveness of the framework through real-world experiments conducted on the Digit humanoid platform. To quantify collaboration quality, we propose an efficiency metric that captures both task performance and inter-agent coordination. We show that this metric highlights the role of compliance in collaborative tasks and offers insights into desirable trajectory characteristics across both high- and low-level control layers. Finally, we showcase experimental results on collaborative behaviors, including translation, turning, and combined motions such as semi circular trajectories, representative of naturally occurring co-transportation tasks.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Hari Prakash Thanabalan, Lars Bengtsson, Ugo Lafont, Giovanni Volpe
Robotics
ๆฅ็ๆ่ฆ
Soft robots require directional control to navigate complex terrains. However, achieving such control often requires multiple actuators, which increases mechanical complexity, complicates control systems, and raises energy consumption. Here, we introduce an inchworm-inspired soft robot whose locomotion direction is controlled passively by patterned substrates. The robot employs a single rolled dielectric elastomer actuator, while groove patterns on a 3D-printed substrate guide its alignment and trajectory. Through systematic experiments, we demonstrate that varying groove angles enables precise control of locomotion direction without the need for complex actuation strategies. This groove-guided approach reduces energy consumption, simplifies robot design, and expands the applicability of bio-inspired soft robots in fields such as search and rescue, pipe inspection, and planetary exploration.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ David Thorne, Nathan Chan, Christa S. Robison, Philip R. Osteen, Brett T. Lopez
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous robots rely on geometric maps to inform a diverse set of perception and decision-making algorithms. As autonomy requires reasoning and planning on multiple scales of the environment, each algorithm may require a different map for optimal performance. Light Detection And Ranging (LiDAR) sensors generate an abundance of geometric data to satisfy these diverse requirements, but selecting informative, size-constrained maps is computationally challenging as it requires solving an NP-hard combinatorial optimization. In this work we present OptMap: a geometric map distillation algorithm which achieves real-time, application-specific map generation via multiple theoretical and algorithmic innovations. A central feature is the maximization of set functions that exhibit diminishing returns, i.e., submodularity, using polynomial-time algorithms with provably near-optimal solutions. We formulate a novel submodular reward function which quantifies informativeness, reduces input set sizes, and minimizes bias in sequentially collected datasets. Further, we propose a dynamically reordered streaming submodular algorithm which improves empirical solution quality and addresses input order bias via an online approximation of the value of all scans. Testing was conducted on open-source and custom datasets with an emphasis on long-duration mapping sessions, highlighting OptMap's minimal computation requirements. Open-source ROS1 and ROS2 packages are available and can be used alongside any LiDAR SLAM algorithm.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Gustavo A. Cardona, Shubham S. Kumbhar, Panagiotis Artemiadis
Robotics
ๆฅ็ๆ่ฆ
Physical Human-Humanoid Interaction (pHHI) is a rapidly advancing field with significant implications for deploying robots in unstructured, human-centric environments. In this review, we examine the current state of the art in pHHI through three core pillars: (i) humanoid modeling and control, (ii) human intent estimation, and (iii) computational human models. For each pillar, we survey representative approaches, identify open challenges, and analyze current limitations that hinder robust, scalable, and adaptive interaction. These include the need for whole-body control strategies capable of handling uncertain human dynamics, real-time intent inference under limited sensing, and modeling techniques that account for variability in human physical states. Although significant progress has been made within each domain, integration across pillars remains limited. We propose pathways for unifying methods across these areas to enable cohesive interaction frameworks. This structure enables us not only to map the current landscape but also to propose concrete directions for future research that aim to bridge these domains. Additionally, we introduce a unified taxonomy of interaction types based on modality, distinguishing between direct interactions (e.g., physical contact) and indirect interactions (e.g., object-mediated), and on the level of robot engagement, ranging from assistance to cooperation and collaboration. For each category in this taxonomy, we provide the three core pillars that highlight opportunities for cross-pillar unification. Our goal is to suggest avenues to advance robust, safe, and intuitive physical interaction, providing a roadmap for future research that will allow humanoid systems to effectively understand, anticipate, and collaborate with human partners in diverse real-world settings.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mayank Anand, Ujair Alam, Surya Prakash, Priya Shukla, Gora Chand Nandi et al.
Computer Vision
ๆฅ็ๆ่ฆ
Clinical ultrasound acquisition is highly operator-dependent, where rapid probe motion and brightness fluctuations often lead to reconstruction errors that reduce trust and clinical utility. We present UltrasODM, a dual-stream framework that assists sonographers during acquisition through calibrated per-frame uncertainty, saliency-based diagnostics, and actionable prompts. UltrasODM integrates (i) a contrastive ranking module that groups frames by motion similarity, (ii) an optical-flow stream fused with Dual-Mamba temporal modules for robust 6-DoF pose estimation, and (iii) a Human-in-the-Loop (HITL) layer combining Bayesian uncertainty, clinician-calibrated thresholds, and saliency maps highlighting regions of low confidence. When uncertainty exceeds the threshold, the system issues unobtrusive alerts suggesting corrective actions such as re-scanning highlighted regions or slowing the sweep. Evaluated on a clinical freehand ultrasound dataset, UltrasODM reduces drift by 15.2%, distance error by 12.1%, and Hausdorff distance by 10.1% relative to UltrasOM, while producing per-frame uncertainty and saliency outputs. By emphasizing transparency and clinician feedback, UltrasODM improves reconstruction reliability and supports safer, more trustworthy clinical workflows. Our code is publicly available at https://github.com/AnandMayank/UltrasODM.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arslan Artykov, Corentin Sautier, Vincent Lepetit
Robotics
ๆฅ็ๆ่ฆ
Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments. Project webpage: https://aartykov.github.io/sim2art/
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Aileen Liao, Dong-Ki Kim, Max Olan Smith, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei
ๆฅ็ๆ่ฆ
As a robot senses and selects actions, the world keeps changing. This inference delay creates a gap of tens to hundreds of milliseconds between the observed state and the state at execution. In this work, we take the natural generalization from zero delay to measured delay during training and inference. We introduce Delay-Aware Diffusion Policy (DA-DP), a framework for explicitly incorporating inference delays into policy learning. DA-DP corrects zero-delay trajectories to their delay-compensated counterparts, and augments the policy with delay conditioning. We empirically validate DA-DP on a variety of tasks, robots, and delays and find its success rate more robust to delay than delay-unaware methods. DA-DP is architecture agnostic and transfers beyond diffusion policies, offering a general pattern for delay-aware imitation learning. More broadly, DA-DP encourages evaluation protocols that report performance as a function of measured latency, not just task difficulty.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ P. A. Wigner, L. Romanello, A. Hammad, P. H. Nguyen, T. Lan et al.
Robotics
ๆฅ็ๆ่ฆ
This paper presents an aerially deployable crawler designed for adaptive locomotion and manipulation within tree canopies. The system combines compliant microspine-based tracks, a dual-track rotary gripper, and an elastic tail, enabling secure attachment and stable traversal across branches of varying curvature and inclination.
Experiments demonstrate reliable gripping up to 90 degrees of body roll and inclination, while effective climbing on branches inclined up to 67.5 degrees, achieving a maximum speed of 0.55 body lengths per second on horizontal branches. The compliant tracks allow yaw steering of up to 10 degrees, enhancing maneuverability on irregular surfaces.
Power measurements show efficient operation with a dimensionless cost of transport over an order of magnitude lower than typical hovering power consumption in aerial robots. Integrated within a drone-tether deployment system, the crawler provides a robust, low-power platform for environmental sampling and in-canopy sensing, bridging the gap between aerial and surface-based ecological robotics.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Matthias Heyrman, Chenhao Li, Victor Klemm, Dongho Kang, Stelian Coros et al.
Robotics
ๆฅ็ๆ่ฆ
Effective motion representation is crucial for enabling robots to imitate expressive behaviors in real time, yet existing motion controllers often ignore inherent patterns in motion. Previous efforts in representation learning do not attempt to jointly capture structured periodic patterns and irregular variations in human and animal movement. To address this, we present Multi-Domain Motion Embedding (MDME), a motion representation that unifies the embedding of structured and unstructured features using a wavelet-based encoder and a probabilistic embedding in parallel. This produces a rich representation of reference motions from a minimal input set, enabling improved generalization across diverse motion styles and morphologies. We evaluate MDME on retargeting-free real-time motion imitation by conditioning robot control policies on the learned embeddings, demonstrating accurate reproduction of complex trajectories on both humanoid and quadruped platforms. Our comparative studies confirm that MDME outperforms prior approaches in reconstruction fidelity and generalizability to unseen motions. Furthermore, we demonstrate that MDME can reproduce novel motion styles in real-time through zero-shot deployment, eliminating the need for task-specific tuning or online retargeting. These results position MDME as a generalizable and structure-aware foundation for scalable real-time robot imitation.
๐
2025-12-08
๐ ArXiv Preprint (eess.SY)
๐ฅ Nikita Vaibhav Pavle, Shrreya Rajneesh, Rakesh Kumar Sahoo, Manoranjan Sinha
Robotics
ๆฅ็ๆ่ฆ
The conventional Artificial Potential Field (APF) is fundamentally limited by the local minima issue and its inability to account for the kinematics of moving obstacles. This paper addresses the critical challenge of autonomous collision avoidance for Unmanned Aerial Vehicles (UAVs) operating in dynamic and cluttered airspace by proposing a novel Direction and Relative Velocity Weighted Artificial Potential Field (APF). In this approach, a bounded weighting function, $ฯ(ฮธ,v_{e})$, is introduced to dynamically scale the repulsive potential based on the direction and velocity of the obstacle relative to the UAV. This robust APF formulation is integrated within a Model Predictive Control (MPC) framework to generate collision-free trajectories while adhering to kinematic constraints. Simulation results demonstrate that the proposed method effectively resolves local minima and significantly enhances safety by enabling smooth, predictive avoidance maneuvers. The system ensures superior path integrity and reliable performance, confirming its viability for autonomous navigation in complex environments.
๐
2025-12-08
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenzhen Dong, Jieming Yu, Yiming Huang, Hongqiu Wang, Lei Zhu et al.
Computer Vision
ๆฅ็ๆ่ฆ
The recent Segment Anything Model (SAM) 3 has introduced significant advancements over its predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3's 3D reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while zero-shot evaluations on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Guangyan Chen, Meiling Wang, Qi Shao, Zichen Zhou, Weixin Mao et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Developing robust and general-purpose manipulation policies represents a fundamental objective in robotics research. While Vision-Language-Action (VLA) models have demonstrated promising capabilities for end-to-end robot control, existing approaches still exhibit limited generalization to tasks beyond their training distributions. In contrast, humans possess remarkable proficiency in acquiring novel skills by simply observing others performing them once. Inspired by this capability, we propose ViVLA, a generalist robotic manipulation policy that achieves efficient task learning from a single expert demonstration video at test time. Our approach jointly processes an expert demonstration video alongside the robot's visual observations to predict both the demonstrated action sequences and subsequent robot actions, effectively distilling fine-grained manipulation knowledge from expert behavior and transferring it seamlessly to the agent. To enhance the performance of ViVLA, we develop a scalable expert-agent pair data generation pipeline capable of synthesizing paired trajectories from easily accessible human videos, further augmented by curated pairs from publicly available datasets. This pipeline produces a total of 892,911 expert-agent samples for training ViVLA. Experimental results demonstrate that our ViVLA is able to acquire novel manipulation skills from only a single expert demonstration video at test time. Our approach achieves over 30% improvement on unseen LIBERO tasks and maintains above 35% gains with cross-embodiment videos. Real-world experiments demonstrate effective learning from human videos, yielding more than 38% improvement on unseen tasks.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Yiming Cui, Shiyu Fang, Jiarui Zhang, Yan Huang, Chengkai Xu et al.
Robotics
ๆฅ็ๆ่ฆ
The rapid development of autonomous vehicles has led to a surge in testing demand. Traditional testing methods, such as virtual simulation, closed-course, and public road testing, face several challenges, including unrealistic vehicle states, limited testing capabilities, and high costs. These issues have prompted increasing interest in virtual-physical fusion testing. However, despite its potential, virtual-physical fusion testing still faces challenges, such as limited element types, narrow testing scope, and fixed evaluation metrics. To address these challenges, we propose the Virtual-Physical Testing Platform for Autonomous Vehicles (VP-AutoTest), which integrates over ten types of virtual and physical elements, including vehicles, pedestrians, and roadside infrastructure, to replicate the diversity of real-world traffic participants. The platform also supports both single-vehicle interaction and multi-vehicle cooperation testing, employing adversarial testing and parallel deduction to accelerate fault detection and explore algorithmic limits, while OBU and Redis communication enable seamless vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) cooperation across all levels of cooperative automation. Furthermore, VP-AutoTest incorporates a multidimensional evaluation framework and AI-driven expert systems to conduct comprehensive performance assessment and defect diagnosis. Finally, by comparing virtual-physical fusion test results with real-world experiments, the platform performs credibility self-evaluation to ensure both the fidelity and efficiency of autonomous driving testing. Please refer to the website for the full testing functionalities on the autonomous driving public service platform OnSite:https://www.onsite.com.cn.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Florian Lรผttner, Nicole Neis, Daniel Stadler, Robin Moss, Mirjam Fehling-Kaschek et al.
Robotics
ๆฅ็ๆ่ฆ
The reliable operation of autonomous vehicles, automated driving functions, and advanced driver assistance systems across a wide range of relevant scenarios is critical for their development and deployment. Identifying a near-complete set of relevant driving scenarios for such functionalities is challenging due to numerous degrees of freedom involved, each affecting the outcomes of the driving scenario differently. Moreover, with increasing technical complexity of new functionalities, the number of potentially relevant, particularly "unknown unsafe" scenarios is increasing. To enhance validation efficiency, it is essential to identify relevant scenarios in advance, starting with simpler domains like highways before moving to more complex environments such as urban traffic. To address this, this paper focuses on analyzing lane change scenarios in highway traffic, which involve multiple degrees of freedom and present numerous safetyrelevant scenarios. We describe the process of data acquisition and processing of real-world data from public highway traffic, followed by the application of criticality measures on trajectory data to evaluate scenarios, as conducted within the AVEAS project (www.aveas.org). By linking the calculated measures to specific lane change driving scenarios and the conditions under which the data was collected, we facilitate the identification of safetyrelevant driving scenarios for various applications. Further, to tackle the extensive range of "unknown unsafe" scenarios, we propose a way to generate relevant scenarios by creating synthetic scenarios based on recorded ones. Consequently, we demonstrate and evaluate a processing chain that enables the identification of safety-relevant scenarios, the development of data-driven methods for extracting these scenarios, and the generation of synthetic critical scenarios via sampling on highways.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Siyu Xu, Zijian Wang, Yunke Wang, Chenghao Xia, Tao Huang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have shown great performance in robotic manipulation by mapping visual observations and language instructions directly to actions. However, they remain brittle under distribution shifts: when test scenarios change, VLAs often reproduce memorized trajectories instead of adapting to the updated scene, which is a failure mode we refer to as the "Memory Trap". This limitation stems from the end-to-end design, which lacks explicit 3D spatial reasoning and prevents reliable identification of actionable regions in unfamiliar environments. To compensate for this missing spatial understanding, 3D Spatial Affordance Fields (SAFs) can provide a geometric representation that highlights where interactions are physically feasible, offering explicit cues about regions the robot should approach or avoid. We therefore introduce Affordance Field Intervention (AFI), a lightweight hybrid framework that uses SAFs as an on-demand plug-in to guide VLA behavior. Our system detects memory traps through proprioception, repositions the robot to recent high-affordance regions, and proposes affordance-driven waypoints that anchor VLA-generated actions. A SAF-based scorer then selects trajectories with the highest cumulative affordance. Extensive experiments demonstrate that our method achieves an average improvement of 23.5% across different VLA backbones ($ฯ_{0}$ and $ฯ_{0.5}$) under out-of-distribution scenarios on real-world robotic platforms, and 20.2% on the LIBERO-Pro benchmark, validating its effectiveness in enhancing VLA robustness to distribution shifts.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Haolin Song, Hongbo Zhu, Tao Yu, Yan Liu, Mingqi Yuan et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
For full-size humanoid robots, even with recent advances in reinforcement learning-based control, achieving reliable locomotion on complex terrains, such as long staircases, remains challenging. In such settings, limited perception, ambiguous terrain cues, and insufficient adaptation of gait timing can cause even a single misplaced or mistimed step to result in rapid loss of balance. We introduce a perceptive locomotion framework that merges terrain sensing, gait regulation, and whole-body control into a single reinforcement learning policy. A downward-facing depth camera mounted under the base observes the support region around the feet, and a compact U-Net reconstructs a dense egocentric height map from each frame in real time, operating at the same frequency as the control loop. The perceptual height map, together with proprioceptive observations, is processed by a unified policy that produces joint commands and a global stepping-phase signal, allowing gait timing and whole-body posture to be adapted jointly to the commanded motion and local terrain geometry. We further adopt a single-stage successive teacher-student training scheme for efficient policy learning and knowledge transfer. Experiments conducted on a 31-DoF, 1.65 m humanoid robot demonstrate robust locomotion in both simulation and real-world settings, including forward and backward stair ascent and descent, as well as crossing a 46 cm gap. Project Page:https://ga-phl.github.io/
๐
2025-12-08
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenwei Shi, Xueyu Luan
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
DreamerV3 is a state-of-the-art online model-based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perceptrons (MLPs), offering superior parameter efficiency and interpretability. To mitigate KANs' computational overhead, variants like FastKAN leverage Radial Basis Functions (RBFs) to accelerate inference. In this work, we investigate integrating KAN architectures into the DreamerV3 framework. We introduce KAN-Dreamer, replacing specific MLP and convolutional components of DreamerV3 with KAN and FastKAN layers. To ensure efficiency within the JAX-based World Model, we implement a tailored, fully vectorized version with simplified grid management. We structure our investigation into three subsystems: Visual Perception, Latent Prediction, and Behavior Learning. Empirical evaluations on the DeepMind Control Suite (walker_walk) analyze sample efficiency, training time, and asymptotic performance. Experimental results demonstrate that utilizing our adapted FastKAN as a drop-in replacement for the Reward and Continue predictors yields performance on par with the original MLP-based architecture, maintaining parity in both sample efficiency and training speed. This report serves as a preliminary study for future developments in KAN-based world models.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Byungju Kim, Jinu Pahk, Chungwoo Lee, Jaejoon Kim, Jangha Lee et al.
Robotics
ๆฅ็ๆ่ฆ
Behavior-cloning based visuomotor policies enable precise manipulation but often inherit the slow, cautious tempo of human demonstrations, limiting practical deployment. However, prior studies on acceleration methods mainly rely on statistical or heuristic cues that ignore task semantics and can fail across diverse manipulation settings. We present ESPADA, a semantic and spatially aware framework that segments demonstrations using a VLM-LLM pipeline with 3D gripper-object relations, enabling aggressive downsampling only in non-critical segments while preserving precision-critical phases, without requiring extra data or architectural modifications, or any form of retraining. To scale from a single annotated episode to the full dataset, ESPADA propagates segment labels via Dynamic Time Warping (DTW) on dynamics-only features. Across both simulation and real-world experiments with ACT and DP baselines, ESPADA achieves approximately a 2x speed-up while maintaining success rates, narrowing the gap between human demonstrations and efficient robot control.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Bin Zhao, Yiwen Lu, Haohua Zhu, Xiao Li, Sheng Yi
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Human hand simulation plays a critical role in digital twin applications, requiring models that balance anatomical fidelity with computational efficiency. We present a complete pipeline for constructing multi-rigid-body approximations of human hands that preserve realistic appearance while enabling real-time physics simulation. Starting from optical motion capture of a specific human hand, we construct a personalized MANO (Multi-Abstracted hand model with Neural Operations) model and convert it to a URDF (Unified Robot Description Format) representation with anatomically consistent joint axes. The key technical challenge is projecting MANO's unconstrained SO(3) joint rotations onto the kinematically constrained joints of the rigid-body model. We derive closed-form solutions for single degree-of-freedom joints and introduce a Baker-Campbell-Hausdorff (BCH)-corrected iterative method for two degree-of-freedom joints that properly handles the non-commutativity of rotations. We validate our approach through digital twin experiments where reinforcement learning policies control the multi-rigid-body hand to replay captured human demonstrations. Quantitative evaluation shows sub-centimeter reconstruction error and successful grasp execution across diverse manipulation tasks.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Gianpietro Battocletti, Dimitris Boskos, Bart De Schutter
Robotics
ๆฅ็ๆ่ฆ
Uncrewed Surface Vehicles (USVs) are a popular and efficient type of marine craft that find application in a large number of water-based tasks. When multiple USVs operate in the same area, they may be required to dock to each other to perform a shared task. Existing approaches for the docking between autonomous USVs generally consider one USV as a stationary target, while the second one is tasked to reach the required docking pose. In this work, we propose a cooperative approach for USV-USV docking, where two USVs work together to dock at an agreed location. We use a centralized Model Predictive Control (MPC) approach to solve the control problem, obtaining feasible trajectories that also guarantee constraint satisfaction. Owing to its model-based nature, this approach allows the rejection of disturbances, inclusive of exogenous inputs, by anticipating their effect on the USVs through the MPC prediction model. This is particularly effective in case of almost-stationary disturbances such as water currents. In simulations, we demonstrate how the proposed approach allows for a faster and more efficient docking with respect to existing approaches.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Gianpietro Battocletti, Dimitris Boskos, Bart De Schutter
ๆฅ็ๆ่ฆ
Despite the attention that the problem of path planning for tethered robots has garnered in the past few decades, the approaches proposed to solve it typically rely on a discrete representation of the configuration space and do not exploit a model that can simultaneously capture the topological information of the tether and the continuous location of the robot. In this work, we explicitly build a topological model of the configuration space of a tethered robot starting from a polygonal representation of the workspace where the robot moves. To do so, we first establish a link between the configuration space of the tethered robot and the universal covering space of the workspace, and then we exploit this link to develop an algorithm to compute a simplicial complex model of the configuration space. We show how this approach improves the performances of existing algorithms that build other types of representations of the configuration space. The proposed model can be computed in a fraction of the time required to build traditional homotopy-augmented graphs, and is continuous, allowing to solve the path planning task for tethered robots using a broad set of path planning algorithms.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Florian Tretter, Daniel Flรถgel, Alexandru Vasilache, Max Grobbel, Jรผrgen Becker et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Integrating autonomous mobile robots into human environments requires human-like decision-making and energy-efficient, event-based computation. Despite progress, neuromorphic methods are rarely applied to Deep Reinforcement Learning (DRL) navigation approaches due to unstable training. We address this gap with a hybrid socially integrated DRL actor-critic approach that combines Spiking Neural Networks (SNNs) in the actor with Artificial Neural Networks (ANNs) in the critic and a neuromorphic feature extractor to capture temporal crowd dynamics and human-robot interactions. Our approach enhances social navigation performance and reduces estimated energy consumption by approximately 1.69 orders of magnitude.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Zichao Shu, Shitao Bei, Lijun Li, Zetao Chen
ๆฅ็ๆ่ฆ
Simultaneous localization and mapping (SLAM) plays a fundamental role in extended reality (XR) applications. As the standards for immersion in XR continue to increase, the demands for SLAM benchmarking have become more stringent. Trajectory accuracy is the key metric, and marker-based optical motion capture (MoCap) systems are widely used to generate ground truth (GT) because of their drift-free and relatively accurate measurements. However, the precision of MoCap-based GT is limited by two factors: the spatiotemporal calibration with the device under test (DUT) and the inherent jitter in the MoCap measurements. These limitations hinder accurate SLAM benchmarking, particularly for key metrics like rotation error and inter-frame jitter, which are critical for immersive XR experiences. This paper presents a novel continuous-time maximum likelihood estimator to address these challenges. The proposed method integrates auxiliary inertial measurement unit (IMU) data to compensate for MoCap jitter. Additionally, a variable time synchronization method and a pose residual based on screw congruence constraints are proposed, enabling precise spatiotemporal calibration across multiple sensors and the DUT. Experimental results demonstrate that our approach outperforms existing methods, achieving the precision necessary for comprehensive benchmarking of state-of-the-art SLAM algorithms in XR applications. Furthermore, we thoroughly validate the practicality of our method by benchmarking several leading XR devices and open-source SLAM algorithms. The code is publicly available at https://github.com/ylab-xrpg/xr-hpgt.
๐
2025-12-08
๐ ArXiv Preprint (cs.MA)
๐ฅ Sungyong Chung, Alireza Talebpour, Samer H. Hamdar
ๆฅ็ๆ่ฆ
Characterizing and understanding lane-changing behavior in the presence of automated vehicles (AVs) is crucial to ensuring safety and efficiency in mixed traffic. Accordingly, this study aims to characterize the interactions between the lane-changing vehicle (active vehicle) and the vehicle directly impacted by the maneuver in the target lane (passive vehicle). Utilizing real-world trajectory data from the Waymo Open Motion Dataset (WOMD), this study explores patterns in lane-changing behavior and provides insight into how these behaviors evolve under different AV market penetration rates (MPRs). In particular, we propose a game-theoretic framework to analyze cooperative and defective behaviors in mixed traffic, applied to the 7,636 observed lane-changing events in the WOMD. First, we utilize k-means clustering to classify vehicles as cooperative or defective, revealing that the proportions of cooperative AVs are higher than those of HDVs in both active and passive roles. Next, we jointly estimate the utilities of active and passive vehicles to model their behaviors using the quantal response equilibrium framework. Empirical payoff tables are then constructed based on these utilities. Using these payoffs, we analyze the presence of social dilemmas and examine the evolution of cooperative behaviors using evolutionary game theory. Our results reveal the presence of social dilemmas in approximately 4% and 11% of lane-changing events for active and passive vehicles, respectively, with most classified as Stag Hunt or Prisoner's Dilemma (Chicken Game rarely observed). Moreover, the Monte Carlo simulation results show that repeated lane-changing interactions consistently lead to increased cooperative behavior over time, regardless of the AV penetration rate.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Fanjun Bu, Melina Tsai, Audrey Tjokro, Tapomayukh Bhattacharjee, Jorge Ortiz et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Robots operating in everyday environments must often decide when and whether to engage with people, yet such decisions often hinge on subtle nonverbal cues that unfold over time and are difficult to model explicitly. Drawing on a five-day Wizard-of-Oz deployment of a mobile service robot in a university cafe, we analyze how people signal interaction readiness through nonverbal behaviors and how expert wizards use these cues to guide engagement. Motivated by these observations, we propose a two-stage pipeline in which lightweight perceptual detectors (gaze shifts and proxemics) are used to selectively trigger heavier video-based vision-language model (VLM) queries at socially meaningful moments. We evaluate this pipeline on replayed field interactions and compare two prompting strategies. Our findings suggest that selectively using VLMs as proxies for social reasoning enables socially responsive robot behavior, allowing robots to act appropriately by attending to the cues people naturally provide in real-world interactions.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Kang Yijie, Hao Yuqing, Wang Qingyun, Chen Guanrong
Robotics
ๆฅ็ๆ่ฆ
In this paper, the time-varying formation tracking control of wheeled mobile robots with region constraint is investigated from a generalized Udwadia-Kalaba framework. The communication topology is directed, weighted and has a spanning tree with the leader being the root. By reformulating the time-varying formation tracking control objective as a constrained equation and transforming the region constraint by a diffeomorphism, the time-varying formation tracking controller with the region constraint is designed under the generalized Udwadia-Kalaba framework. Compared with the existing works on time-varying formation tracking control, the region constraint is takeninto account in this paper, which ensures the safety of the robots.Finally, some numerical simulations are presented to illustrate the effectiveness of the proposed control strategy.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Zebin Xing, Yupeng Zheng, Qichao Zhang, Zhixing Ding, Pengxuan Yang et al.
Robotics
ๆฅ็ๆ่ฆ
End-to-end autonomous driving has emerged as a pivotal direction in the field of autonomous systems. Recent works have demonstrated impressive performance by incorporating high-level guidance signals to steer low-level trajectory planners. However, their potential is often constrained by inaccurate high-level guidance and the computational overhead of complex guidance modules. To address these limitations, we propose Mimir, a novel hierarchical dual-system framework capable of generating robust trajectories relying on goal points with uncertainty estimation: (1) Unlike previous approaches that deterministically model, we estimate goal point uncertainty with a Laplace distribution to enhance robustness; (2) To overcome the slow inference speed of the guidance system, we introduce a multi-rate guidance mechanism that predicts extended goal points in advance. Validated on challenging Navhard and Navtest benchmarks, Mimir surpasses previous state-of-the-art methods with a 20% improvement in the driving score EPDMS, while achieving 1.6 times improvement in high-level module inference speed without compromising accuracy. The code and models will be released soon to promote reproducibility and further development. The code is available at https://github.com/ZebinX/Mimir-Uncertainty-Driving
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Jue Wang, Mingsong Jiang, Luis A. Ramirez, Bilige Yang, Mujun Zhang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Adaptive morphogenetic robots adapt their morphology and control policies to meet changing tasks and environmental conditions. Many such systems leverage soft components, which enable shape morphing but also introduce simulation and control challenges. Soft-body simulators remain limited in accuracy and computational tractability, while rigid-body simulators cannot capture soft-material dynamics. Here, we present a surrogate compliance modeling approach: rather than explicitly modeling soft-body physics, we introduce indirect variables representing soft-material deformation within a rigid-body simulator. We validate this approach using our amphibious robotic turtle, a quadruped with soft morphing limbs designed for multi-environment locomotion. By capturing deformation effects as changes in effective limb length and limb center of mass, and by applying reinforcement learning with extensive randomization of these indirect variables, we achieve reliable policy learning entirely in a rigid-body simulation. The resulting gaits transfer directly to hardware, demonstrating high-fidelity sim-to-real performance on hard, flat substrates and robust, though lower-fidelity, transfer on rheologically complex terrains. The learned closed-loop gaits exhibit unprecedented terrestrial maneuverability and achieve an order-of-magnitude reduction in cost of transport compared to open-loop baselines. Field experiments with the robot further demonstrate stable, multi-gait locomotion across diverse natural terrains, including gravel, grass, and mud.
๐
2025-12-08
๐ ArXiv Preprint (Robotics)
๐ฅ Tomoya Takahashi, Yusaku Nakajima, Cristian Camilo Beltran-Hernandez, Yuki Kuroda, Kazutoshi Tanaka et al.
Robotics
ๆฅ็ๆ่ฆ
Laboratory Automation (LA) has the potential to accelerate solid-state materials discovery by enabling continuous robotic operation without human intervention. While robotic systems have been developed for tasks such as powder grinding and X-ray diffraction (XRD) analysis, fully automating powder handling at the milligram scale remains a significant challenge due to the complex flow dynamics of powders and the diversity of laboratory tasks. To address this challenge, this study proposes a novel, funnel-shaped, flexible robotic hand that preserves the softness and conical sheet designs in prior work while incorporating a controllable valve at the cone apex to enable precise, incremental dispensing of milligram-scale powder quantities. The hand is integrated with an external balance through a feedback control system based on a model of powder flow and online parameter identification. Experimental evaluations with glass beads, monosodium glutamate, and titanium dioxide demonstrated that 80% of the trials achieved an error within 2 mg, and the maximum error observed was approximately 20 mg across a target range of 20 mg to 3 g. In addition, by incorporating flow prediction models commonly used for hoppers and performing online parameter identification, the system is able to adapt to variations in powder dynamics. Compared to direct PID control, the proposed model-based control significantly improved both accuracy and convergence speed. These results highlight the potential of the proposed system to enable efficient and flexible powder weighing, with scalability toward larger quantities and applicability to a broad range of laboratory automation tasks.
๐
2025-12-07
๐ ICRA 2026
๐ฅ Rushiraj Gadhvi, Sandeep Manjanna
Machine LearningRobotics
ๆฅ็ๆ่ฆ
For centuries, khalasi (Gujarati for sailor) have skillfully harnessed ocean currents to navigate vast waters with minimal effort. Emulating this intuition in autonomous systems remains a significant challenge, particularly for Autonomous Surface Vehicles tasked with long duration missions under strict energy budgets. In this work, we present a learning-based approach for energy-efficient surface vehicle navigation in vortical flow fields, where partial observability often undermines traditional path-planning methods. We present an end to end reinforcement learning framework based on Soft Actor Critic that learns flow-aware navigation policies using only local velocity measurements. Through extensive evaluation across diverse and dynamically rich scenarios, our method demonstrates substantial energy savings and robust generalization to previously unseen flow conditions, offering a promising path toward long term autonomy in ocean environments. The navigation paths generated by our proposed approach show an improvement in energy conservation 30 to 50 percent compared to the existing state of the art techniques.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Akmaral Moldagalieva, Keisuke Okumura, Amanda Prorok, Wolfgang Hรถnig
ๆฅ็ๆ่ฆ
State-of-the-art multi-robot kinodynamic motion planners struggle to handle more than a few robots due to high computational burden, which limits their scalability and results in slow planning time. In this work, we combine the scalability and speed of modern multi-agent path finding (MAPF) algorithms with the dynamic-awareness of kinodynamic planners to address these limitations. To this end, we propose discontinuity-Bounded LaCAM (db-LaCAM), a planner that utilizes a precomputed set of motion primitives that respect robot dynamics to generate horizon-length motion sequences, while allowing a user-defined discontinuity between successive motions. The planner db-LaCAM is resolution-complete with respect to motion primitives and supports arbitrary robot dynamics. Extensive experiments demonstrate that db-LaCAM scales efficiently to scenarios with up to 50 robots, achieving up to ten times faster runtime compared to state-of-the-art planners, while maintaining comparable solution quality. The approach is validated in both 2D and 3D environments with dynamics such as the unicycle and 3D double integrator. We demonstrate the safe execution of trajectories planned with db-LaCAM in two distinct physical experiments involving teams of flying robots and car-with-trailer robots.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mayank Ravishankara
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) systems have significantly reduced hallucinations in Large Language Models (LLMs) by grounding responses in external context. However, standard RAG architectures suffer from a critical vulnerability: Retrieval Sycophancy. When presented with a query based on a false premise or a common misconception, vector-based retrievers tend to fetch documents that align with the user's bias rather than objective truth, leading the model to "hallucinate with citations."
In this work, we introduce Falsification-Verification Alignment RAG (FVA-RAG), a framework that shifts the retrieval paradigm from Inductive Verification (seeking support) to Deductive Falsification (seeking disproof). Unlike existing "Self-Correction" methods that rely on internal consistency, FVA-RAG deploys a distinct Adversarial Retrieval Policy that actively generates "Kill Queries"-targeted search terms designed to surface contradictory evidence. We introduce a dual-verification mechanism that explicitly weighs the draft answer against this "Anti-Context." Preliminary experiments on a dataset of common misconceptions demonstrate that FVA-RAG significantly improves robustness against sycophantic hallucinations compared to standard RAG baselines, effectively acting as an inference-time "Red Team" for factual generation.
๐
2025-12-07
๐ ArXiv Preprint (Machine Learning)
๐ฅ Daniel Ohayon, Itay Lamprecht, Itay Hubara, Israel Cohen, Daniel Soudry et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern large language models increasingly require long contexts for reasoning and multi-document tasks, but attention's quadratic complexity creates a severe computational bottleneck. We present Block-Sparse FlashAttention (BSFA), a drop-in replacement that accelerates long-context inference while preserving model quality. Unlike methods that predict importance before computing scores, BSFA computes exact query-key similarities to select the top-k most important value blocks for each query. By comparing per-block maximum scores against calibrated thresholds, we skip approximately 50% of the computation and memory transfers for pruned blocks. Our training-free approach requires only a one-time threshold calibration on a small dataset to learn the per-layer and per-head attention score distributions. We provide a CUDA kernel implementation that can be used as a drop-in replacement for FlashAttention. On Llama-3.1-8B, BSFA achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x for needle-in-a-haystack retrieval tasks while maintaining above 99% baseline accuracy, with certain configurations even improving accuracy by focusing on the most relevant content, substantially outperforming existing sparse attention methods. The implementation is available at https://github.com/Danielohayon/Block-Sparse-Flash-Attention
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jing Jie Tan, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum, Anissa Mokraoui et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable capabilities across various natural language processing tasks. This research introduces a novel "Prompting-in-a-Series" algorithm, termed PICEPR (Psychology-Informed Contents Embeddings for Personality Recognition), featuring two pipelines: (a) Contents and (b) Embeddings. The approach demonstrates how a modularised decoder-only LLM can summarize or generate content, which can aid in classifying or enhancing personality recognition functions as a personality feature extractor and a generator for personality-rich content. We conducted various experiments to provide evidence to justify the rationale behind the PICEPR algorithm. Meanwhile, we also explored closed-source models such as \textit{gpt4o} from OpenAI and \textit{gemini} from Google, along with open-source models like \textit{mistral} from Mistral AI, to compare the quality of the generated content. The PICEPR algorithm has achieved a new state-of-the-art performance for personality recognition by 5-15\% improvement. The work repository and models' weight can be found at https://research.jingjietan.com/?q=PICEPR.
๐
2025-12-07
๐ ArXiv Preprint (Machine Learning)
๐ฅ Minshen Zhang, Xiang Hu, Jianguo Li, Wei Wu, Kewei Tu
Natural Language Processing
ๆฅ็ๆ่ฆ
We explore Multi-Head FFN (MH-FFN) as a replacement of FFN in the Transformer architecture, motivated by the structural similarity between single-head attention and FFN. While multi-head mechanisms enhance expressivity in attention, naively applying them to FFNs faces two challenges: memory consumption scaling with the head count, and an imbalanced ratio between the growing intermediate size and the fixed head dimension as models scale, which degrades scalability and expressive power. To address these challenges, we propose Flash Multi-Head FFN (FlashMHF), with two key innovations: an I/O-aware fused kernel computing outputs online in SRAM akin to FlashAttention, and a design using dynamically weighted parallel sub-networks to maintain a balanced ratio between intermediate and head dimensions. Validated on models from 128M to 1.3B parameters, FlashMHF consistently improves perplexity and downstream task accuracy over SwiGLU FFNs, while reducing peak memory usage by 3-5x and accelerating inference by up to 1.08x. Our work establishes the multi-head design as a superior architectural principle for FFNs, presenting FlashMHF as a powerful, efficient, and scalable alternative to FFNs in Transformers.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ivanhoรฉ Botcazou, Tassadit Amghar, Sylvain Lamprier, Frรฉdรฉric Saubion
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Modern neural language models achieve high accuracy in text generation, yet precise control over generation length remains underdeveloped. In this paper, we first investigate a recent length control method based on Reverse Positional Embeddings (RPE) and show its limits when control is requested beyond the training distribution. In particular, using a discrete countdown signal tied to the absolute remaining token count leads to instability. To provide robust length control, we introduce Progress Ratio Embeddings (PRE), as continuous embeddings tied to a trigonometric impatience signal. PRE integrates seamlessly into standard Transformer architectures, providing stable length fidelity without degrading text accuracy under standard evaluation metrics. We further show that PRE generalizes well to unseen target lengths. Experiments on two widely used news-summarization benchmarks validate these findings.
๐
2025-12-07
๐ ArXiv Preprint (cs.CE)
๐ฅ Zifan Peng
ๆฅ็ๆ่ฆ
Understanding a complicated Ethereum transaction remains challenging: multi-hop token flows, nested contract calls, and opaque execution paths routinely lead users to blind signing. Based on interviews with everyday users, developers, and auditors, we identify the need for faithful, step-wise explanations grounded in both on-chain evidence and real-world protocol semantics. To meet this need, we introduce (matex, a cognitive multi-agent framework that models transaction understanding as a collaborative investigation-combining rapid hypothesis generation, dynamic off-chain knowledge retrieval, evidence-aware synthesis, and adversarial validation to produce faithful explanations.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Milad Alshomary, Anisha Bhatnagar, Peter Zeng, Smaranda Muresan, Owen Rambow et al.
ๆฅ็ๆ่ฆ
We present IXAM, an Interactive eXplainability framework for Authorship Attribution Models. Given an authorship attribution (AA) task and an embedding-based AA model, our tool enables users to interactively explore the model's embedding space and construct an explanation of the model's prediction as a set of writing style features at different levels of granularity. Through a user evaluation, we demonstrate the value of our framework compared to predefined stylistic explanations.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ George Mikros
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) present a dual challenge for forensic linguistics. They serve as powerful analytical tools enabling scalable corpus analysis and embedding-based authorship attribution, while simultaneously destabilising foundational assumptions about idiolect through style mimicry, authorship obfuscation, and the proliferation of synthetic texts. Recent stylometric research indicates that LLMs can approximate surface stylistic features yet exhibit detectable differences from human writers, a tension with significant forensic implications. However, current AI-text detection techniques, whether classifier-based, stylometric, or watermarking approaches, face substantial limitations: high false positive rates for non-native English writers and vulnerability to adversarial strategies such as homoglyph substitution. These uncertainties raise concerns under legal admissibility standards, particularly the Daubert and Kumho Tire frameworks. The article concludes that forensic linguistics requires methodological reconfiguration to remain scientifically credible and legally admissible. Proposed adaptations include hybrid human-AI workflows, explainable detection paradigms beyond binary classification, and validation regimes measuring error and bias across diverse populations. The discipline's core insight, i.e., that language reveals information about its producer, remains valid but must accommodate increasingly complex chains of human and machine authorship.
๐
2025-12-07
๐ Accepted by IEEE ICIA 2025
๐ฅ Ziyang Song, Zelin Zang, Xiaofan Ye, Boqiang Xu, Long Bai et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group's average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Francois Vandenhende, Anna Georgiou, Michalis Georgiou, Theodoros Psaras, Ellie Karekla
ๆฅ็ๆ่ฆ
The PRO-CTCAE is an NCI-developed patient-reported outcome system for capturing symptomatic adverse events in oncology trials. It comprises a large library drawn from the CTCAE vocabulary, and item selection for a given trial is typically guided by expected toxicity profiles from prior data. Selecting too many PRO-CTCAE items can burden patients and reduce compliance, while too few may miss important safety signals. We present an automated method to select a minimal yet comprehensive PRO-CTCAE subset based on historical safety data. Each candidate PRO-CTCAE symptom term is first mapped to its corresponding MedDRA Preferred Terms (PTs), which are then encoded into Safeterm, a high-dimensional semantic space capturing clinical and contextual diversity in MedDRA terminology. We score each candidate PRO item for relevance to the historical list of adverse event PTs and combine relevance and incidence into a utility function. Spectral analysis is then applied to the combined utility and diversity matrix to identify an orthogonal set of medical concepts that balances relevance and diversity. Symptoms are rank-ordered by importance, and a cut-off is suggested based on the explained information. The tool is implemented as part of the Safeterm trial-safety app. We evaluate its performance using simulations and oncology case studies in which PRO-CTCAE was employed. This automated approach can streamline PRO-CTCAE design by leveraging MedDRA semantics and historical data, providing an objective and reproducible method to balance signal coverage against patient burden.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ziyun Yu, Yiru Zhou, Chen Zhao, Hongyi Wen
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Using Large Language Models (LLMs) to simulate user opinions has received growing attention. Yet LLMs, especially trained with reinforcement learning from human feedback (RLHF), are known to exhibit biases toward dominant viewpoints, raising concerns about their ability to represent users from diverse demographic and cultural backgrounds. In this work, we examine the extent to which LLMs can simulate human responses to cross-domain survey questions through direct prompting and chain-of-thought prompting. We further propose a claim diversification method CLAIMSIM, which elicits viewpoints from LLM parametric knowledge as contextual input. Experiments on the survey question answering task indicate that, while CLAIMSIM produces more diverse responses, both approaches struggle to accurately simulate users. Further analysis reveals two key limitations: (1) LLMs tend to maintain fixed viewpoints across varying demographic features, and generate single-perspective claims; and (2) when presented with conflicting claims, LLMs struggle to reason over nuanced differences among demographic features, limiting their ability to adapt responses to specific user profiles.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wanyang Hong, Zhaoning Zhang, Yi Chen, Libo Zhang, Baihui Liu et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have achieved remarkable performance on single-turn tasks, yet their effectiveness deteriorates in multi-turn conversations. We define this phenomenon as cumulative contextual decay - a progressive degradation of contextual integrity caused by attention pollution, dilution, and drift. To address this challenge, we propose Rhea (Role-aware Heuristic Episodic Attention), a novel framework that decouples conversation history into two functionally independent memory modules: (1) an Instructional Memory (IM) that persistently stores high-fidelity global constraints via a structural priority mechanism, and (2) an Episodic Memory (EM) that dynamically manages user-model interactions via asymmetric noise control and heuristic context retrieval. During inference, Rhea constructs a high signal-to-noise context by applying its priority attention: selectively integrating relevant episodic information while always prioritizing global instructions. To validate this approach, experiments on multiple multi-turn conversation benchmarks - including MT-Eval and Long-MT-Bench+ - show that Rhea mitigates performance decay and improves overall accuracy by 1.04 points on a 10-point scale (a 16% relative gain over strong baselines). Moreover, Rhea maintains near-perfect instruction fidelity (IAR > 8.1) across long-horizon interactions. These results demonstrate that Rhea provides a principled and effective framework for building more precise, instruction-consistent conversational LLMs.
๐
2025-12-07
๐ NeurIPS 2025
๐ฅ Yulin Li, Haokun Gui, Ziyang Fan, Junjie Wang, Bin Kang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Video Large Language Models (VLLMs) have achieved remarkable video understanding capabilities, yet face critical efficiency bottlenecks due to quadratic computational growth with lengthy visual token sequences of long videos. While existing keyframe sampling methods can improve temporal modeling efficiency, additional computational cost is introduced before feature encoding, and the binary frame selection paradigm is found suboptimal. Therefore, in this work, we propose Dynamic Token compression via LLM-guided Keyframe prior (DyToK), a training-free paradigm that enables dynamic token compression by harnessing VLLMs' inherent attention mechanisms. Our analysis reveals that VLLM attention layers naturally encoding query-conditioned keyframe priors, by which DyToK dynamically adjusts per-frame token retention ratios, prioritizing semantically rich frames while suppressing redundancies. Extensive experiments demonstrate that DyToK achieves state-of-the-art efficiency-accuracy tradeoffs. DyToK shows plug-and-play compatibility with existing compression methods, such as VisionZip and FastV, attaining 4.3x faster inference while preserving accuracy across multiple VLLMs, such as LLaVA-OneVision and Qwen2.5-VL. Code is available at https://github.com/yu-lin-li/DyToK .
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sepyan Purnama Kristanto, Lutfi Hakim, Hermansyah
ๆฅ็ๆ่ฆ
Evidence from many low and middle income regions shows that microbial contamination in small scale drinking water systems often fluctuates rapidly, yet existing monitoring tools capture only fragments of this behaviour. Microscopic imaging provides organism level visibility, whereas physicochemical sensors reveal shortterm changes in water chemistry; in practice, operators must interpret these streams separately, making realtime decision-making unreliable. This study introduces AquaFusionNet, a lightweight cross-modal framework that unifies both information sources inside a single edge deployable model. Unlike prior work that treats microscopic detection and water quality prediction as independent tasks, AquaFusionNet learns the statistical dependencies between microbial appearance and concurrent sensor dynamics through a gated crossattention mechanism designed specifically for lowpower hardware. The framework is trained on AquaMicro12K, a new dataset comprising 12,846 annotated 1000 micrographs curated for drinking water contexts, an area where publicly accessible microscopic datasets are scarce. Deployed for six months across seven facilities in East Java, Indonesia, the system processed 1.84 million frames and consistently detected contamination events with 94.8% mAP@0.5 and 96.3% anomaly prediction accuracy, while operating at 4.8 W on a Jetson Nano. Comparative experiments against representative lightweight detectors show that AquaFusionNet provides higher accuracy at comparable or lower power, and field results indicate that cross-modal coupling reduces common failure modes of unimodal detectors, particularly under fouling, turbidity spikes, and inconsistent illumination. All models, data, and hardware designs are released openly to facilitate replication and adaptation in decentralized water safety infrastructures.
๐
2025-12-07
๐ Accepted at Transactions of the Association for Co
๐ฅ Dibyanayan Bandyopadhyay, Soham Bhattacharjee, Mohammed Hasanuzzaman, Asif Ekbal
Multimodal
ๆฅ็ๆ่ฆ
Multimodal classifiers function as opaque black box models. While several techniques exist to interpret their predictions, very few of them are as intuitive and accessible as natural language explanations (NLEs). To build trust, such explanations must faithfully capture the classifier's internal decision making behavior, a property known as faithfulness. In this paper, we propose CAuSE (Causal Abstraction under Simulated Explanations), a novel framework to generate faithful NLEs for any pretrained multimodal classifier. We demonstrate that CAuSE generalizes across datasets and models through extensive empirical evaluations. Theoretically, we show that CAuSE, trained via interchange intervention, forms a causal abstraction of the underlying classifier. We further validate this through a redesigned metric for measuring causal faithfulness in multimodal settings. CAuSE surpasses other methods on this metric, with qualitative analysis reinforcing its advantages. We perform detailed error analysis to pinpoint the failure cases of CAuSE. For replicability, we make the codes available at https://github.com/newcodevelop/CAuSE
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tiago Rodrigues, Carla Teixeira Lopes
Natural Language Processing
ๆฅ็ๆ่ฆ
Discharge Summaries are documents written by medical professionals that detail a patient's visit to a care facility. They contain a wealth of information crucial for patient care, and automating their generation could significantly reduce the effort required from healthcare professionals, minimize errors, and ensure that critical patient information is easily accessible and actionable. In this work, we explore the use of five Large Language Models on this task, from open-source models (Mistral, Llama 2) to proprietary systems (GPT-3, GPT-4, Gemini 1.5 Pro), leveraging MIMIC-III summaries and notes. We evaluate them using exact-match, soft-overlap, and reference-free metrics. Our results show that proprietary models, particularly Gemini with one-shot prompting, outperformed others, producing summaries with the highest similarity to the gold-standard ones. Open-source models, while promising, especially Mistral after fine-tuning, lagged in performance, often struggling with hallucinations and repeated information. Human evaluation by a clinical expert confirmed the practical utility of the summaries generated by proprietary models. Despite the challenges, such as hallucinations and missing information, the findings suggest that LLMs, especially proprietary models, are promising candidates for automatic discharge summary generation as long as data privacy is ensured.
๐
2025-12-07
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yueqian Wang, Songxiang Liu, Disong Wang, Nuo Xu, Guanglu Wan et al.
RoboticsMachine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in video multimodal large language models (Video MLLMs) have significantly enhanced video understanding and multi-modal interaction capabilities. While most existing systems operate in a turn-based manner where the model can only reply after user turns, proactively deciding when to reply during video playback presents a promising yet challenging direction for real-time applications. In this work, we propose a novel text-to-text approach to proactive interaction, where the model autonomously determines whether to respond or remain silent at each turn based on dialogue history and visual context up to current frame of an streaming video. To overcome difficulties in previous methods such as manually tuning response decision thresholds and annotating precise reply times, we introduce a multi-turn RL based training method that encourages timely and accurate responses without requiring precise response time annotations. We train our model MMDuet2 on a dataset of 52k videos with two types of dialogues via SFT and RL. Experimental results demonstrate that MMDuet2 outperforms existing proactive Video MLLM baselines in response timing and quality, achieving state-of-the-art performance on the ProactiveVideoQA benchmark.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ofek Glick, Vladimir Tchuiev, Marah Ghoummaid, Michal Moshkovitz, Dotan Di-Castro
Natural Language Processing
ๆฅ็ๆ่ฆ
While Large Language Models (LLMs) are increasingly used for synthesizing textual PLC programming languages like Structured Text (ST) code, other IEC 61131-3 standard graphical languages like Sequential Function Charts (SFCs) remain underexplored. Generating SFCs is challenging due to graphical nature and ST actions embedded within, which are not directly compatible with standard generation techniques, often leading to non-executable code that is incompatible with industrial tool-chains In this work, we introduce LLM4SFC, the first framework to receive natural-language descriptions of industrial workflows and provide executable SFCs. LLM4SFC is based on three components: (i) A reduced structured representation that captures essential topology and in-line ST and reduced textual verbosity; (ii) Fine-tuning and few-shot retrieval-augmented generation (RAG) for alignment with SFC programming conventions; and (iii) A structured generation approach that prunes illegal tokens in real-time to ensure compliance with the textual format of SFCs. We evaluate LLM4SFC on a dataset of real-world SFCs from automated manufacturing projects, using both open-source and proprietary LLMs. The results show that LLM4SFC reliably generates syntactically valid SFC programs effectively bridging graphical and textual PLC languages, achieving a generation generation success of 75% - 94%, paving the way for automated industrial programming.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuchuan Tian, Yuchen Liang, Jiacheng Sun, Shuo Zhang, Guangwen Yang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at generation but dominant autoregressive (AR) decoding is inherently sequential, creating a throughput bottleneck. Diffusion Language Models (DLMs)--especially block-wise variants--enable parallel generation and intra-block bidirectional reasoning, yet training large DLMs from scratch is costly and wastes the knowledge in mature AR checkpoints. Prior "adaptation" attempts either modify logits or randomly grow attention masks to full-sequence diffusion, or simply transplant AR weights into a block-diffusion recipe, leaving a fundamental mismatch between AR causality and block-wise bidirectionality unaddressed. We reframe adaptation as a intra-paradigm path from AR to Block-Diffusion by viewing AR as Block-Diffusion with blocksize=1. Concretely, we design the pathway of adaptation as follows: we use a context-causal attention mask (causal in context, bidirectional only within the active block), an efficient parallel adaptation procedure, an auxiliary AR loss to maximize data utilization and retain pretrained knowledge, and gradual increment of the generation block size. The recipe integrates cleanly with masked block-diffusion and maintains train-inference consistency. Built on these components, NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance among the 7B-class DLMs, delivering strong gains on general-knowledge, math, and code benchmarks over strong baselines. These results demonstrate that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch. Codes: https://github.com/YuchuanTian/NBDiff.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Seungyeon Jwa, Daechul Ahn, Reokyoung Kim, Dongyeop Kang, Jonghyun Choi
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic evaluation with large language models, commonly known as LLM-as-a-judge, is now standard across reasoning and alignment tasks. Despite evaluating many samples in deployment, these evaluators typically (i) treat each case independently, missing the opportunity to accumulate experience, and (ii) rely on a single fixed prompt for all cases, neglecting the need for sample-specific evaluation criteria. We introduce Learning While Evaluating (LWE), a framework that allows evaluators to improve sequentially at inference time without requiring training or validation sets. LWE maintains an evolving meta-prompt that (i) produces sample-specific evaluation instructions and (ii) refines itself through self-generated feedback. Furthermore, we propose Selective LWE, which updates the meta-prompt only on self-inconsistent cases, focusing computation where it matters most. This selective approach retains the benefits of sequential learning while being far more cost-effective. Across two pairwise comparison benchmarks, Selective LWE outperforms strong baselines, empirically demonstrating that evaluators can improve during sequential testing with a simple selective update, learning most from the cases they struggle with.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rajeev Ranjan
ๆฅ็ๆ่ฆ
Text embedding models are designed for sentence-level applications like retrieval and semantic similarity, and are primarily evaluated on sentence-level benchmarks. Their behavior on isolated words is less understood. We show that simply prepending semantic prompts to words before embedding substantially improves word similarity correlations. Testing 7 text embedding models, including text-embedding-3-large (OpenAI), embed-english-v3.0 (Cohere), voyage-3(Voyage AI), all-mpnet-base-v2, and Qwen3-Embedding-8B, on 3 standard benchmarks (SimLex-999, WordSim-353, MEN-3000), we find that prompts like "meaning: {word}" or "Represent the semantic concept: {word}" improve Spearman correlations by up to +0.29 on SimLex-999. Some models fail completely on bare words (correlation = 0) but recover with prompts (+0.73 improvement). Our best results achieve correlation = 0.692 on SimLex-999 with embed-english-v3.0 (Cohere), correlation = 0.811 on WordSim-353, and correlation = 0.855 on MEN-3000 with text-embedding-3-large (OpenAI). These results outperform classic static embeddings like Word2Vec (correlation = 0.40) and even the best static method LexVec (correlation = 0.48) on SimLex-999, establishing a new state-of-the-art for pure embedding methods. This zero-shot technique requires no training and works with any text embedding model.
๐
2025-12-07
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikhil Verma, Joonas Linnosmaa, Espinosa-Leal Leonardo, Napat Vajragupta
Robotics
ๆฅ็ๆ่ฆ
The paper presents the formulation, implementation, and evaluation of the ArcGD optimiser. The evaluation is conducted initially on a non-convex benchmark function and subsequently on a real-world ML dataset. The initial comparative study using the Adam optimiser is conducted on a stochastic variant of the highly non-convex and notoriously challenging Rosenbrock function, renowned for its narrow, curved valley, across dimensions ranging from 2D to 1000D and an extreme case of 50,000D. Two configurations were evaluated to eliminate learning-rate bias: (i) both using ArcGD's effective learning rate and (ii) both using Adam's default learning rate. ArcGD consistently outperformed Adam under the first setting and, although slower under the second, achieved super ior final solutions in most cases. In the second evaluation, ArcGD is evaluated against state-of-the-art optimizers (Adam, AdamW, Lion, SGD) on the CIFAR-10 image classification dataset across 8 diverse MLP architectures ranging from 1 to 5 hidden layers. ArcGD achieved the highest average test accuracy (50.7%) at 20,000 iterations, outperforming AdamW (46.6%), Adam (46.8%), SGD (49.6%), and Lion (43.4%), winning or tying on 6 of 8 architectures. Notably, while Adam and AdamW showed strong early convergence at 5,000 iterations, but regressed with extended training, whereas ArcGD continued improving, demonstrating generalization and resistance to overfitting without requiring early stopping tuning. Strong performance on geometric stress tests and standard deep-learning benchmarks indicates broad applicability, highlighting the need for further exploration. Moreover, it is also shown that a variant of ArcGD can be interpreted as a special case of the Lion optimiser, highlighting connections between the inherent mechanisms of such optimisation methods.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Subrit Dikshit, Ritu Tiwari, Priyank Jain
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Transfer Learning (TL) has accelerated the rapid development and availability of large language models (LLMs) for mainstream natural language processing (NLP) use cases. However, training and deploying such gigantic LLMs in resource-constrained, real-world healthcare situations remains challenging. This study addresses the limited support available to visually impaired users and speakers of low-resource languages such as Hindi who require medical assistance in rural environments. We propose PDFTEMRA (Performant Distilled Frequency Transformer Ensemble Model with Random Activations), a compact transformer-based architecture that integrates model distillation, frequency-domain modulation, ensemble learning, and randomized activation patterns to reduce computational cost while preserving language understanding performance. The model is trained and evaluated on medical question-answering and consultation datasets tailored to Hindi and accessibility scenarios, and its performance is compared against standard NLP state-of-the-art model baselines. Results demonstrate that PDFTEMRA achieves comparable performance with substantially lower computational requirements, indicating its suitability for accessible, inclusive, low-resource medical NLP applications.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aarushi Wagh, Saniya Srivastava
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing benchmarks evaluating biases in large language models (LLMs) primarily rely on explicit cues, declaring protected attributes like religion, race, gender by name. However, real-world interactions often contain implicit biases, inferred subtly through names, cultural cues, or traits. This critical oversight creates a significant blind spot in fairness evaluation. We introduce ImplicitBBQ, a benchmark extending the Bias Benchmark for QA (BBQ) with implicitly cued protected attributes across 6 categories. Our evaluation of GPT-4o on ImplicitBBQ illustrates troubling performance disparity from explicit BBQ prompts, with accuracy declining up to 7% in the "sexual orientation" subcategory and consistent decline located across most other categories. This indicates that current LLMs contain implicit biases undetected by explicit benchmarks. ImplicitBBQ offers a crucial tool for nuanced fairness evaluation in NLP.
๐
2025-12-07
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuo Li, Jiajun Sun, Zhihao Zhang, Xiaoran Fan, Senjie Jin et al.
RoboticsMachine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in fine-tuning multimodal large language models (MLLMs) using reinforcement learning have achieved remarkable progress, particularly with the introduction of various entropy control techniques. However, the role and characteristics of entropy in perception-oriented tasks like visual grounding, as well as effective strategies for controlling it, remain largely unexplored. To address this issue, we focus on the visual grounding task and analyze the role and characteristics of entropy in comparison to reasoning tasks. Building on these findings, we introduce ECVGPO (Entropy Control Visual Grounding Policy Optimization), an interpretable algorithm designed for effective entropy regulation. Through entropy control, the trade-off between exploration and exploitation is better balanced. Experiments show that ECVGPO achieves broad improvements across various benchmarks and models.
๐
2025-12-07
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bufang Yang, Lilin Xu, Liekang Zeng, Yunqi Guo, Siyang Jiang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model (LLM) agents are emerging to transform daily life. However, existing LLM agents primarily follow a reactive paradigm, relying on explicit user instructions to initiate services, which increases both physical and cognitive workload. In this paper, we propose ProAgent, the first end-to-end proactive agent system that harnesses massive sensory contexts and LLM reasoning to deliver proactive assistance. ProAgent first employs a proactive-oriented context extraction approach with on-demand tiered perception to continuously sense the environment and derive hierarchical contexts that incorporate both sensory and persona cues. ProAgent then adopts a context-aware proactive reasoner to map these contexts to user needs and tool calls, providing proactive assistance. We implement ProAgent on Augmented Reality (AR) glasses with an edge server and extensively evaluate it on a real-world testbed, a public dataset, and through a user study. Results show that ProAgent achieves up to 33.4% higher proactive prediction accuracy, 16.8% higher tool-calling F1 score, and notable improvements in user satisfaction over state-of-the-art baselines, marking a significant step toward proactive assistants. A video demonstration of ProAgent is available at https://youtu.be/pRXZuzvrcVs.
๐
2025-12-07
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhibo Liang, Tianze Hu, Zaiye Chen, Mingjie Tang
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Autonomous Large Language Model (LLM) agents exhibit significant vulnerability to Indirect Prompt Injection (IPI) attacks. These attacks hijack agent behavior by polluting external information sources, exploiting fundamental trade-offs between security and functionality in existing defense mechanisms. This leads to malicious and unauthorized tool invocations, diverting agents from their original objectives. The success of complex IPIs reveals a deeper systemic fragility: while current defenses demonstrate some effectiveness, most defense architectures are inherently fragmented. Consequently, they fail to provide full integrity assurance across the entire task execution pipeline, forcing unacceptable multi-dimensional compromises among security, functionality, and efficiency. Our method is predicated on a core insight: no matter how subtle an IPI attack, its pursuit of a malicious objective will ultimately manifest as a detectable deviation in the action trajectory, distinct from the expected legitimate plan. Based on this, we propose the Cognitive Control Architecture (CCA), a holistic framework achieving full-lifecycle cognitive supervision. CCA constructs an efficient, dual-layered defense system through two synergistic pillars: (i) proactive and preemptive control-flow and data-flow integrity enforcement via a pre-generated "Intent Graph"; and (ii) an innovative "Tiered Adjudicator" that, upon deviation detection, initiates deep reasoning based on multi-dimensional scoring, specifically designed to counter complex conditional attacks. Experiments on the AgentDojo benchmark substantiate that CCA not only effectively withstands sophisticated attacks that challenge other advanced defense methods but also achieves uncompromised security with notable efficiency and robustness, thereby reconciling the aforementioned multi-dimensional trade-off.
๐
2025-12-07
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Donghang Duan, Xu Zheng
ๆฅ็ๆ่ฆ
Current LLM-based text anonymization frameworks usually rely on remote API services from powerful LLMs, which creates an inherent "privacy paradox": users must somehow disclose data to untrusted third parties for superior privacy preservation. Moreover, directly migrating these frameworks to local small-scale models (LSMs) offers a suboptimal solution with catastrophic collapse in utility based on our core findings. Our work argues that this failure stems not merely from the capability deficits of LSMs, but from the inherent irrationality of the greedy adversarial strategies employed by current state-of-the-art (SoTA) methods. We model the anonymization process as a trade-off between Marginal Privacy Gain (MPG) and Marginal Utility Cost (MUC), and demonstrate that greedy strategies inevitably drift into an irrational state. To address this, we propose Rational Localized Adversarial Anonymization (RLAA), a fully localized and training-free framework featuring an Attacker-Arbitrator-Anonymizer (A-A-A) architecture. RLAA introduces an arbitrator that acts as a rationality gatekeeper, validating the attacker's inference to filter out feedback providing negligible benefits on privacy preservation. This mechanism enforces a rational early-stopping criterion, and systematically prevents utility collapse. Extensive experiments on different datasets demonstrate that RLAA achieves the best privacy-utility trade-off, and in some cases even outperforms SoTA on the Pareto principle. Our code and datasets will be released upon acceptance.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yulin Huang, Yaxuan Luan, Jinxu Guo, Xiangchen Song, Yuchen Liu
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
This study addresses the issues of privacy protection and efficiency in instruction fine-tuning of large-scale language models by proposing a parameter-efficient method that integrates differential privacy noise allocation with gradient clipping in a collaborative optimization framework. The method keeps the backbone model frozen and updates parameters through a low-dimensional projection subspace, while introducing clipping and adaptive noise allocation during gradient computation. This design reduces privacy budget consumption and ensures training stability and robustness. The unified framework combines gradient constraints, noise allocation, and parameter projection, effectively mitigating performance fluctuations and privacy risks in multi-task instruction scenarios. Experiments are conducted across hyperparameter, environment, and data sensitivity dimensions. Results show that the method outperforms baseline models in accuracy, privacy budget, and parameter efficiency, and maintains stable performance under diverse and uncertain data conditions. The findings enrich the theoretical integration of differential privacy and parameter-efficient fine-tuning and demonstrate its practical adaptability in instruction tasks, providing a feasible solution for secure training in complex instruction environments.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aoi Fujita, Taichi Yamamoto, Yuri Nakayama, Ryota Kobayashi
Natural Language Processing
ๆฅ็ๆ่ฆ
Rapid expansion of social media platforms such as X (formerly Twitter), Facebook, and Reddit has enabled large-scale analysis of public perceptions on diverse topics, including social issues, politics, natural disasters, and consumer sentiment. Topic modeling is a widely used approach for uncovering latent themes in text data, typically framed as an unsupervised classification task. However, traditional models, originally designed for longer and more formal documents, struggle with short social media posts due to limited co-occurrence statistics, fragmented semantics, inconsistent spelling, and informal language. To address these challenges, we propose a new method, TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction. Specifically, each text is embedded using Sentence-BERT (SBERT) and provisionally clustered using Gaussian Mixture Models (GMM). The clusters are then refined iteratively using a supervised projection based on linear discriminant analysis, followed by GMM-based clustering until convergence. Notably, our method operates directly on raw text, eliminating the need for preprocessing steps such as stop word removal. We evaluate our approach on four diverse datasets, 20News, AgNewsTitle, Reddit, and TweetTopic, each containing human-labeled topic information. Compared with seven baseline methods, including a recent SBERT-based method and a zero-shot generative AI method, our approach achieves the highest similarity to human-annotated topics, with significant improvements for both social media posts and online news articles. Additionally, qualitative analysis shows that our method produces more interpretable topics, highlighting its potential for applications in social media data and web content analytics.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Preference alignment has enabled large language models (LLMs) to better reflect human expectations, but current methods mostly optimize for population-level preferences, overlooking individual users. Personalization is essential, yet early approaches-such as prompt customization or fine-tuning-struggle to reason over implicit preferences, limiting real-world effectiveness. Recent "think-then-generate" methods address this by reasoning before response generation. However, they face challenges in long-form generation: their static one-shot reasoning must capture all relevant information for the full response generation, making learning difficult and limiting adaptability to evolving content. To address this issue, we propose FlyThinker, an efficient "think-while-generating" framework for personalized long-form generation. FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. This design enables reasoning and generation to run concurrently, ensuring inference efficiency. In addition, the reasoning model is designed to depend only on previous responses rather than its own prior outputs, which preserves training parallelism across different positions-allowing all reasoning tokens for training data to be produced in a single forward pass like standard LLM training, ensuring training efficiency. Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while keeping training and inference efficiency.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalization is one of the next milestones in advancing AI capability and alignment. We introduce PersonaMem-v2, the state-of-the-art dataset for LLM personalization that simulates 1,000 realistic user-chatbot interactions on 300+ scenarios, 20,000+ user preferences, and 128k-token context windows, where most user preferences are implicitly revealed to reflect real-world interactions. Using this data, we investigate how reinforcement fine-tuning enables a model to improve its long-context reasoning capabilities for user understanding and personalization. We also develop a framework for training an agentic memory system, which maintains a single, human-readable memory that grows with each user over time.
In our experiments, frontier LLMs still struggle with implicit personalization, achieving only 37-48% accuracy. While they support long context windows, reasoning remains the bottleneck for implicit personalization tasks. Using reinforcement fine-tuning, we successfully train Qwen3-4B to outperforms GPT-5, reaching 53% accuracy in implicit personalization. Moreover, our agentic memory framework achieves state-of-the-art 55% accuracy while using 16x fewer input tokens, relying on a 2k-token memory instead of full 32k conversation histories. These results underscore the impact of our dataset and demonstrate agentic memory as a scalable path toward real-world personalized intelligence.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amartya Hatua
Natural Language Processing
ๆฅ็ๆ่ฆ
We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12 layers, we test the hypothesized two-stage sentiment architecture comprising early lexical detection and mid-layer contextual integration. Our experiments confirm that early layers (0-3) act as lexical sentiment detectors, encoding stable, position specific polarity signals that are largely independent of context. However, all three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing are falsified. Instead of mid-layer specialization, we find that contextual phenomena such as negation, sarcasm, domain shifts etc. are integrated primarily in late layers (8-11) through a unified, non-modular mechanism. These experimental findings provide causal evidence that GPT-2's sentiment computation differs from the predicted hierarchical pattern, highlighting the need for further empirical characterization of contextual integration in large language models.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Smitha Muthya Sudheendra, Mani Deep Cherukuri, Jaideep Srivastava
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Natural language understanding inherently depends on integrating multiple complementary perspectives spanning from surface syntax to deep semantics and world knowledge. However, current Aspect-Based Sentiment Analysis (ABSA) systems typically exploit isolated linguistic views, thereby overlooking the intricate interplay between structural representations that humans naturally leverage. We propose CMV-Fuse, a Cross-Modal View fusion framework that emulates human language processing by systematically combining multiple linguistic perspectives. Our approach systematically orchestrates four linguistic perspectives: Abstract Meaning Representations, constituency parsing, dependency syntax, and semantic attention, enhanced with external knowledge integration. Through hierarchical gated attention fusion across local syntactic, intermediate semantic, and global knowledge levels, CMV-Fuse captures both fine-grained structural patterns and broad contextual understanding. A novel structure aware multi-view contrastive learning mechanism ensures consistency across complementary representations while maintaining computational efficiency. Extensive experiments demonstrate substantial improvements over strong baselines on standard benchmarks, with analysis revealing how each linguistic view contributes to more robust sentiment analysis.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kwabena Yamoah, Cass Dykeman
ๆฅ็ๆ่ฆ
VR in the treatment of clinical concerns such as generalized anxiety disorder or social anxiety. VR has created additional pathways to support patient well-being and care. Understanding online discussion of what users think about this technology may further support its efficacy. The purpose of this study was to employ a corpus linguistic methodology to identify the words and word networks that shed light on the online discussion of virtual reality and anxiety. Using corpus linguistics, frequently used words in discussion along with collocation were identified by utilizing Sketch Engine software. The results of the study, based upon the English Trends corpus, identified VR, Oculus, and headset as the most frequently discussed within the VR and anxiety subcorpus. These results point to the development of the virtual system, along with the physical apparatus that makes viewing and engaging with the virtual environment possible. Additional results point to collocation of prepositional phrases such as of virtual reality, in virtual reality, and for virtual reality relating to the design, experience, and development, respectively. These findings offer new perspective on how VR and anxiety together are discussed in general discourse and offer pathways for future opportunities to support counseling needs through development and accessibility. Keywords: anxiety disorders, corpus linguistics, Sketch Engine, and virtual reality VR
๐
2025-12-07
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yihan Chen, Benfeng Xu, Xiaorui Wang, Zhendong Mao
ๆฅ็ๆ่ฆ
As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.
๐
2025-12-07
๐ ArXiv Preprint (cs.SE)
๐ฅ Genevieve Caumartin, Glaucia Melo
Natural Language Processing
ๆฅ็ๆ่ฆ
Bug localization remains a critical yet time-consuming challenge in large-scale software repositories. Traditional information retrieval-based bug localization (IRBL) methods rely on unchanged bug descriptions, which often contain noisy information, leading to poor retrieval accuracy. Recent advances in large language models (LLMs) have improved bug localization through query reformulation, yet the effect on agent performance remains unexplored. In this study, we investigate how an LLM-powered agent can improve file-level bug localization via lightweight query reformulation and summarization. We first employ an open-source, non-fine-tuned LLM to extract key information from bug reports, such as identifiers and code snippets, and reformulate queries pre-retrieval. Our agent then orchestrates BM25 retrieval using these preprocessed queries, automating localization workflow at scale. Using the best-performing query reformulation technique, our agent achieves 35% better ranking in first-file retrieval than our BM25 baseline and up to +22% file retrieval performance over SWE-agent.
๐
2025-12-07
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mayank Ravishankara
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) systems have significantly reduced hallucinations in Large Language Models (LLMs) by grounding responses in external context. However, standard RAG architectures suffer from a critical vulnerability: Retrieval Sycophancy. When presented with a query based on a false premise or a common misconception, vector-based retrievers tend to fetch documents that align with the user's bias rather than objective truth, leading the model to "hallucinate with citations."
In this work, we introduce Falsification-Verification Alignment RAG (FVA-RAG), a framework that shifts the retrieval paradigm from Inductive Verification (seeking support) to Deductive Falsification (seeking disproof). Unlike existing "Self-Correction" methods that rely on internal consistency, FVA-RAG deploys a distinct Adversarial Retrieval Policy that actively generates "Kill Queries"-targeted search terms designed to surface contradictory evidence. We introduce a dual-verification mechanism that explicitly weighs the draft answer against this "Anti-Context." Preliminary experiments on a dataset of common misconceptions demonstrate that FVA-RAG significantly improves robustness against sycophantic hallucinations compared to standard RAG baselines, effectively acting as an inference-time "Red Team" for factual generation.
๐
2025-12-07
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Abderaouf Bahi, Ibtissem Gasmi
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This paper examines the deployment of seven different neural network architectures CNN, RNN, GNN, Autoencoder, Transformer, NCF, and Siamese Networks on three distinct datasets: Retail E-commerce, Amazon Products, and Netflix Prize. It evaluates their effectiveness through metrics such as accuracy, recall, F1-score, and diversity in recommendations. The results demonstrate that GNNs are particularly adept at managing complex item relationships in e-commerce environments, whereas RNNs are effective in capturing the temporal dynamics that are essential for platforms such as Netflix.. Siamese Networks are emphasized for their contribution to the diversification of recommendations, particularly in retail settings. Despite their benefits, issues like computational demands, reliance on extensive data, and the challenge of balancing accurate and diverse recommendations are addressed. The study seeks to inform the advancement of recommendation systems by suggesting hybrid methods that merge the strengths of various models to better satisfy user preferences and accommodate the evolving demands of contemporary digital platforms.
๐
2025-12-07
๐ ArXiv Preprint (cs.DB)
๐ฅ Skylar Homan, Anoop Krishnadas, Kira Adaricheva
ๆฅ็ๆ่ฆ
We present a new implementation of the $D$-basis algorithm called the Small Space which considerably reduces the algorithm's memory usage for data analysis applications. The previous implementation delivers the complete set of implications that hold on the set of attributes of an input binary table. In the new version, the only output is the frequencies of attributes that appear in the antecedents of implications from the $D$-basis, with a fixed consequent attribute. Such frequencies, rather than the implications themselves, became the primary focus in analysis of datasets where the $D$-basis has been applied over the last decade. The $D$-basis employs a hypergraph dualization algorithm, and a dualization implementation known as Reverse Search allows for the gradual computation of frequencies without the need for storing all discovered implications. We demonstrate the effectiveness of the Small Space implementation by comparing the runtimes and maximum memory usage of this new version with the current implementation.
๐
2025-12-07
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zhongtao Rao, Peilin Zhou, Dading Chong, Zhiwei Chen, Shoujin Wang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal recommendation enhances accuracy by leveraging visual and textual signals, and its success largely depends on learning high-quality cross-modal representations. Recent advances in Large Vision-Language Models (LVLMs) offer unified multimodal representation learning, making them a promising backbone. However, applying LVLMs to recommendation remains challenging due to (i) representation misalignment, where domain gaps between item data and general pre-training lead to unaligned embedding spaces, and (ii) gradient conflicts during fine-tuning, where shared adapters cause interference and a lack of discriminative power. To address this, we propose SDA, a lightweight framework for Structural and Disentangled Adaptation, which integrates two components: Cross-Modal Structural Alignment (CMSA) and Modality-Disentangled Adaptation. CMSA aligns embeddings using intra-modal structures as a soft teacher, while MoDA mitigates gradient conflicts via expertized, gated low-rank paths to disentangle gradient flows. Experiments on three public Amazon datasets show SDA integrates seamlessly with existing multimodal and sequential recommenders, yielding average gains of 6.15% in Hit@10 and 8.64% in NDCG@10. It also achieves up to 12.83% and 18.70% gains on long-tail items with minimal inference overhead. Our code and full experimental results are available at https://github.com/RaoZhongtao/SDA.
๐
2025-12-07
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Li Ju, Jun Zhao, Mingxu Chai, Ziyu Shen, Xiangyang Wang et al.
ๆฅ็ๆ่ฆ
Researchers struggle to efficiently locate and manage relevant literature within the exponentially growing body of scientific publications. We present \textsc{WisPaper}, an intelligent academic retrieval and literature management platform that addresses this challenge through three integrated capabilities: (1) \textit{Scholar Search}, featuring both quick keyword-based and deep agentic search modes for efficient paper discovery; (2) \textit{Library}, a customizable knowledge base for systematic literature organization; and (3) \textit{AI Feeds}, an intelligent recommendation system that automatically delivers relevant new publications based on user interests. Unlike existing academic tools, \textsc{WisPaper} provides a closed-loop workflow that seamlessly connects literature discovery, management, and continuous tracking of research frontiers. Our multilingual and multidisciplinary system significantly reduces the time researchers from diverse backgrounds spend on paper screening and management, enabling them to focus on their core research activities. The platform is publicly accessible and serves researchers across academia and industry.
๐
2025-12-07
๐ WSDM 2026
๐ฅ Jiangxia Cao, Ruochen Yang, Xiang Chen, Changxin Lao, Yueyang Liu et al.
ๆฅ็ๆ่ฆ
Live-streaming, as an emerging media enabling real-time interaction between authors and users, has attracted significant attention. Unlike the stable playback time of traditional TV live or the fixed content of short video, live-streaming, due to the dynamics of content and time, poses higher requirements for the recommendation algorithm of the platform - understanding the ever-changing content in real time and push it to users at the appropriate moment. Through analysis, we find that users have a better experience and express more positive behaviors during highlight moments of the live-streaming. Furthermore, since the model lacks access to future content during recommendation, yet user engagement depends on how well subsequent content aligns with their interests, an intuitive solution is to predict future live-streaming content. Therefore, we perform semantic quantization on live-streaming segments to obtain Semantic ids (Sid), encode the historical Sid sequence to capture the author's characteristics, and model Sid evolution trend to enable foresight prediction of future content. This foresight enhances the ranking model through refined features. Extensive offline and online experiments demonstrate the effectiveness of our method.
๐
2025-12-07
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yihan Chen, Benfeng Xu, Xiaorui Wang, Zhendong Mao
ๆฅ็ๆ่ฆ
As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Hiroki Sawada, Alexandre Pitti, Mathias Quoy
Machine Learning
ๆฅ็ๆ่ฆ
Robots interacting with humans must not only generate learned movements in real-time, but also infer the intent behind observed behaviors and estimate the confidence of their own inferences. This paper proposes a unified model that achieves all three capabilities within a single hierarchical predictive-coding recurrent neural network (PC-RNN) equipped with a class embedding vector, CERNet, which leverages a dynamically updated class embedding vector to unify motor generation and recognition. The model operates in two modes: generation and inference. In the generation mode, the class embedding constrains the hidden state dynamics to a class-specific subspace; in the inference mode, it is optimized online to minimize prediction error, enabling real-time recognition. Validated on a humanoid robot across 26 kinesthetically taught alphabets, our hierarchical model achieves 76% lower trajectory reproduction error than a parameter-matched single-layer baseline, maintains motion fidelity under external perturbations, and infers the demonstrated trajectory class online with 68% Top-1 and 81% Top-2 accuracy. Furthermore, internal prediction errors naturally reflect the model's confidence in its recognition. This integration of robust generation, real-time recognition, and intrinsic uncertainty estimation within a compact PC-RNN framework offers a compact and extensible approach to motor memory in physical robots, with potential applications in intent-sensitive human-robot collaboration.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Runcong Wang, Fengyi Wang, Gordon Cheng
Robotics
ๆฅ็ๆ่ฆ
This paper presents a hetero-associative sequential memory system for mobile manipulators that learns compact, neuromorphic bindings between robot joint states and tactile observations to produce step-wise action decisions with low compute and memory cost. The method encodes joint angles via population place coding and converts skin-measured forces into spike-rate features using an Izhikevich neuron model; both signals are transformed into bipolar binary vectors and bound element-wise to create associations stored in a large-capacity sequential memory. To improve separability in binary space and inject geometry from touch, we introduce 3D rotary positional embeddings that rotate subspaces as a function of sensed force direction, enabling fuzzy retrieval through a softmax weighted recall over temporally shifted action patterns. On a Toyota Human Support Robot covered by robot skin, the hetero-associative sequential memory system realizes a pseudocompliance controller that moves the link under touch in the direction and with speed correlating to the amplitude of applied force, and it retrieves multi-joint grasp sequences by continuing tactile input. The system sets up quickly, trains from synchronized streams of states and observations, and exhibits a degree of generalization while remaining economical. Results demonstrate single-joint and full-arm behaviors executed via associative recall, and suggest extensions to imitation learning, motion planning, and multi-modal integration.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Maryam Seraj, Mohammad Hossein Kamrava, Carlo Tiseo
Robotics
ๆฅ็ๆ่ฆ
Haptic interfaces play a critical role in medical teleoperation by enabling surgeons to interact with remote environments through realistic force and motion feedback. Achieving high fidelity in such systems requires balancing performance trade-off among workspace, dexterity, stiffness, inertia, and bandwidth, particularly in applications demanding pure rotational motion. This paper presents the design methodology and kinematic analysis of a Cable-Driven Coaxial Spherical Parallel Mechanism (CDC-SPM) developed to address these challenges. The proposed cable-driven interface design allows for reducing the mass placed at the robot arm end-effector, thereby minimizing inertial loads, enhancing stiffness, and improving dynamic responsiveness. Through parallel and coaxial actuation, the mechanism achieves decoupled rotational degrees of freedom with isotropic force and torque transmission. Simulation and analysis demonstrate that the CDC-SPM provides accurate, responsive, and safe motion characteristics suitable for high-precision haptic applications. These results highlight the mechanism's potential for medical teleoperation tasks such as ultrasound imaging, where precise and intuitive manipulation is essential.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu et al.
RoboticsMachine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.
๐
2025-12-07
๐ NeurIPS
๐ฅ Ilia Larchenko, Gleb Zarin, Akash Karnatak
RoboticsMultimodal
ๆฅ็ๆ่ฆ
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making.
Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules.
Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Nicolรฒ Botteghi, Owen Brook, Urban Fasel, Federico Califano
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Interconnection and Damping Assignment Passivity-Based Control (IDA-PBC) is a nonlinear control technique that assigns a port-Hamiltonian (pH) structure to a controlled system using a state-feedback law. While IDA-PBC has been extensively studied and applied to many systems, its practical implementation often remains confined to academic examples and, almost exclusively, to stabilization tasks. The main limitation of IDA-PBC stems from the complexity of analytically solving a set of partial differential equations (PDEs), referred to as the matching conditions, which enforce the pH structure of the closed-loop system. However, this is extremely challenging, especially for complex physical systems and tasks.
In this work, we propose a novel numerical approach for designing IDA-PBC controllers without solving the matching PDEs exactly. We cast the IDA-PBC problem as the learning of a neural ordinary differential equation. In particular, we rely on sparse dictionary learning to parametrize the desired closed-loop system as a sparse linear combination of nonlinear state-dependent functions. Optimization of the controller parameters is achieved by solving a multi-objective optimization problem whose cost function is composed of a generic task-dependent cost and a matching condition-dependent cost. Our numerical results show that the proposed method enables (i) IDA-PBC to be applicable to complex tasks beyond stabilization, such as the discovery of periodic oscillatory behaviors, (ii) the derivation of closed-form expressions of the controlled system, including residual terms
๐
2025-12-07
๐ ICRA 2026
๐ฅ Rushiraj Gadhvi, Sandeep Manjanna
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
For centuries, khalasi have skillfully harnessed ocean currents to navigate vast waters with minimal effort. Emulating this intuition in autonomous systems remains a significant challenge, particularly for Autonomous Surface Vehicles tasked with long duration missions under strict energy budgets. In this work, we present a learning-based approach for energy-efficient surface vehicle navigation in vortical flow fields, where partial observability often undermines traditional path-planning methods. We present an end to end reinforcement learning framework based on Soft Actor Critic that learns flow-aware navigation policies using only local velocity measurements. Through extensive evaluation across diverse and dynamically rich scenarios, our method demonstrates substantial energy savings and robust generalization to previously unseen flow conditions, offering a promising path toward long term autonomy in ocean environments. The navigation paths generated by our proposed approach show an improvement in energy conservation 30 to 50 percent compared to the existing state of the art techniques.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Bradley Hobbs, Panagiotis Artemiadis
Robotics
ๆฅ็ๆ่ฆ
This study investigates whether adding ground compliance to visual feedback (VF) gait training is more effective at increasing push-off force (POF) compared to using VF alone, with implications for gait rehabilitation. Ten healthy participants walked on a custom split-belt treadmill. All participants received real-time visual feedback of their ground reaction forces. One group also experienced changes in ground compliance, while a control group received only visual feedback. Intentional increases in propulsive ground reaction forces (POF) were successfully achieved and sustained post-intervention, especially in the group that experienced ground compliance. This group also demonstrated lasting after-effects in muscle activity and joint kinematics, indicating a more robust learning of natural strategies to increase propulsion. This work demonstrates how visual and proprioceptive systems coordinate during gait adaptation. It uniquely shows that combining ground compliance with visual feedback enhances the learning of propulsive forces, supporting the potential use of compliant terrain in long-term rehabilitation targeting propulsion deficits, such as those following a stroke.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Chrysostomos Karakasis, Camryn Scully, Robert Salati, Panagiotis Artemiadis
Robotics
ๆฅ็ๆ่ฆ
Walking on compliant terrain presents a substantial challenge for individuals with lower-limb amputation, further elevating their already high risk of falling. While powered ankle-foot prostheses have demonstrated adaptability across speeds and rigid terrains, control strategies optimized for soft or compliant surfaces remain underexplored. This work experimentally validates an admittance-based control strategy that dynamically adjusts the quasi-stiffness of powered prostheses to enhance gait stability on compliant ground. Human subject experiments were conducted with three healthy individuals walking on two bilaterally compliant surfaces with ground stiffness values of 63 and 25 kN/m, representative of real-world soft environments. Controller performance was quantified using phase portraits and two walking stability metrics, offering a direct assessment of fall risk. Compared to a standard phase-variable controller developed for rigid terrain, the proposed admittance controller consistently improved gait stability across all compliant conditions. These results demonstrate the potential of adaptive, stability-aware prosthesis control to reduce fall risk in real-world environments and advance the robustness of human-prosthesis interaction in rehabilitation robotics.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Hassan Jardali, Durgakant Pushp, Youwei Yu, Mahmoud Ali, Ihab S. Mohamed et al.
Robotics
ๆฅ็ๆ่ฆ
High-speed, head-to-head autonomous racing presents substantial technical and logistical challenges, including precise localization, rapid perception, dynamic planning, and real-time control-compounded by limited track access and costly hardware. This paper introduces the Autonomous Race Stack (ARS), developed by the IU Luddy Autonomous Racing team for the Indy Autonomous Challenge (IAC). We present three iterations of our ARS, each validated on different tracks and achieving speeds up to 260 km/h. Our contributions include: (i) the modular architecture and evolution of the ARS across ARS1, ARS2, and ARS3; (ii) a detailed performance evaluation that contrasts control, perception, and estimation across oval and road-course environments; and (iii) the release of a high-speed, multi-sensor dataset collected from oval and road-course tracks. Our findings highlight the unique challenges and insights from real-world high-speed full-scale autonomous racing.
๐
2025-12-07
๐ ArXiv Preprint (cs.NI)
๐ฅ Ximing Huang, Yirui Rao
Robotics
ๆฅ็ๆ่ฆ
The proliferation of autonomous Unmanned Aerial Vehicles (UAVs) in Beyond Visual Line of Sight (BVLOS) applications is critically dependent on resilient, high-bandwidth, and low-latency communication links. Existing solutions face critical limitations: TCP's head-of-line blocking stalls time-sensitive data, UDP lacks reliability and congestion control, and cellular networks designed for terrestrial users degrade severely for aerial platforms. This paper introduces AQUILA, a cross-layer communication architecture built on QUIC to address these challenges. AQUILA contributes three key innovations: (1) a unified transport layer using QUIC's reliable streams for MAVLink Command and Control (C2) and unreliable datagrams for video, eliminating head-of-line blocking under unified congestion control; (2) a priority scheduling mechanism that structurally ensures C2 latency remains bounded and independent of video traffic intensity; (3) a UAV-adapted congestion control algorithm extending SCReAM with altitude-adaptive delay targeting and telemetry headroom reservation. AQUILA further implements 0-RTT connection resumption to minimize handover blackouts with application-layer replay protection, deployed over an IP-native architecture enabling global operation. Experimental validation demonstrates that AQUILA significantly outperforms TCP- and UDP-based approaches in C2 latency, video quality, and link resilience under realistic conditions, providing a robust foundation for autonomous BVLOS missions.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Xingguang Zhong, Liren Jin, Marija Popoviฤ, Jens Behley, Cyrill Stachniss
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Reliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, this task is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. To this end, we leverage the complementary strengths of geometric patch-based online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we robustly handle the inherent scale ambiguities of the batch-wise application of the feed-forward reconstruction model.
๐
2025-12-07
๐ ICRA 2026
๐ฅ Oluwatimilehin Tijani, Zhuo Chen, Jiankang Deng, Shan Luo
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-based tactile sensors (VBTS) face a fundamental trade-off in marker and markerless design on the tactile skin: opaque ink markers enable measurement of force and tangential displacement but completely occlude geometric features necessary for object and texture classification, while markerless skin preserves surface details but struggles in measuring tangential displacements effectively. Current practice to solve the above problem via UV lighting or virtual transfer using learning-based models introduces hardware complexity or computing burdens. This paper introduces MagicSkin, a novel tactile skin with translucent, tinted markers balancing the modes of marker and markerless for VBTS. It enables simultaneous tangential displacement tracking, force prediction, and surface detail preservation. This skin is easy to plug into GelSight-family sensors without requiring additional hardware or software tools. We comprehensively evaluate MagicSkin in downstream tasks. The translucent markers impressively enhance rather than degrade sensing performance compared with traditional markerless and inked marker design: it achieves best performance in object classification (99.17\%), texture classification (93.51\%), tangential displacement tracking (97\% point retention) and force prediction (66\% improvement in total force error). These experimental results demonstrate that translucent skin eliminates the traditional performance trade-off in marker or markerless modes, paving the way for multimodal tactile sensing essential in tactile robotics. See videos at this \href{https://zhuochenn.github.io/MagicSkin_project/}{link}.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Akmaral Moldagalieva, Keisuke Okumura, Amanda Prorok, Wolfgang Hรถnig
ๆฅ็ๆ่ฆ
State-of-the-art multi-robot kinodynamic motion planners struggle to handle more than a few robots due to high computational burden, which limits their scalability and results in slow planning time.
In this work, we combine the scalability and speed of modern multi-agent path finding (MAPF) algorithms with the dynamic-awareness of kinodynamic planners to address these limitations.
To this end, we propose discontinuity-Bounded LaCAM (db-LaCAM), a planner that utilizes a precomputed set of motion primitives that respect robot dynamics to generate horizon-length motion sequences, while allowing a user-defined discontinuity between successive motions.
The planner db-LaCAM is resolution-complete with respect to motion primitives and supports arbitrary robot dynamics.
Extensive experiments demonstrate that db-LaCAM scales efficiently to scenarios with up to 50 robots, achieving up to ten times faster runtime compared to state-of-the-art planners, while maintaining comparable solution quality.
The approach is validated in both 2D and 3D environments with dynamics such as the unicycle and 3D double integrator.
We demonstrate the safe execution of trajectories planned with db-LaCAM in two distinct physical experiments involving teams of flying robots and car-with-trailer robots.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Shrreya Rajneesh, Nikita Pavle, Rakesh Kumar Sahoo, Manoranjan Sinha
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Continuum manipulators offer intrinsic dexterity and safe geometric compliance for navigation within confined and obstacle-rich environments. However, their infinite-dimensional backbone deformation, unmodeled internal friction, and configuration-dependent stiffness fundamentally limit the reliability of model-based kinematic formulations, resulting in inaccurate Jacobian predictions, artificial singularities, and unstable actuation behavior. Motivated by these limitations, this work presents a complete model-less control framework that bypasses kinematic modeling by using an empirically initialized Jacobian refined online through differential convex updates. Tip motion is generated via a real-time quadratic program that computes actuator increments while enforcing tendon slack avoidance and geometric limits. A backbone tension optimization term is introduced in this paper to regulate axial loading and suppress co-activation compression. The framework is validated across circular, pentagonal, and square trajectories, demonstrating smooth convergence, stable tension evolution, and sub-millimeter steady-state accuracy without any model calibration or parameter identification. These results establish the proposed controller as a scalable alternative to model-dependent continuum manipulation in a constrained environment.
๐
2025-12-07
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tony Salloom, Okyay Kaynak, Wei He
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Short-term water demand forecasting (StWDF) is the foundation stone in the derivation of an optimal plan for controlling water supply systems. Deep learning (DL) approaches provide the most accurate solutions for this purpose. However, they suffer from complexity problem due to the massive number of parameters, in addition to the high forecasting error at the extreme points. In this work, an effective method to alleviate the error at these points is proposed. It is based on extending the data by inserting virtual data within the actual data to relieve the nonlinearity around them. To our knowledge, this is the first work that considers the problem related to the extreme points. Moreover, the water demand forecasting model proposed in this work is a novel DL model with relatively low complexity. The basic model uses the gated recurrent unit (GRU) to handle the sequential relationship in the historical demand data, while an unsupervised classification method, K-means, is introduced for the creation of new features to enhance the prediction accuracy with less number of parameters. Real data obtained from two different water plants in China are used to train and verify the model proposed. The prediction results and the comparison with the state-of-the-art illustrate that the method proposed reduces the complexity of the model six times of what achieved in the literature while conserving the same accuracy. Furthermore, it is found that extending the data set significantly reduces the error by about 30%. However, it increases the training time.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Wei-Bin Kou, Guangxu Zhu, Bingyang Cheng, Chen Zhang, Yik-Chung Wu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Federated Learning (FL) enables collaborative training of autonomous driving (AD) models across distributed vehicles while preserving data privacy. However, FL encounters critical challenges such as poor generalization and slow convergence due to non-independent and identically distributed (non-IID) data from diverse driving environments. To overcome these obstacles, we introduce Federated Deep Supervision and Regularization (FedDSR), a paradigm that incorporates multi-access intermediate layer supervision and regularization within federated AD system. Specifically, FedDSR comprises following integral strategies: (I) to select multiple intermediate layers based on predefined architecture-agnostic standards. (II) to compute mutual information (MI) and negative entropy (NE) on those selected layers to serve as intermediate loss and regularizer. These terms are integrated into the output-layer loss to form a unified optimization objective, enabling comprehensive optimization across the network hierarchy. (III) to aggregate models from vehicles trained based on aforementioned rules of (I) and (II) to generate the global model on central server. By guiding and penalizing the learning of feature representations at intermediate stages, FedDSR enhances the model generalization and accelerates model convergence for federated AD. We then take the semantic segmentation task as an example to assess FedDSR and apply FedDSR to multiple model architectures and FL algorithms. Extensive experiments demonstrate that FedDSR achieves up to 8.93% improvement in mIoU and 28.57% reduction in training rounds, compared to other FL baselines, making it highly suitable for practical deployment in federated AD ecosystems.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Wei-Bin Kou, Guangxu Zhu, Jingreng Lei, Chen Zhang, Yik-Chung Wu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous driving (AD) scenarios are inherently complex and diverse, posing significant challenges for a single deep learning model to effectively cover all possible conditions, such as varying weather, traffic densities, and road types. Large Model (LM)-Driven Mixture of Experts (MoE) paradigm offers a promising solution, where LM serves as the backbone to extract latent features while MoE serves as the downstream head to dynamically select and aggregate specialized experts to adapt to different scenarios. However, routing and aggregating in MoE face intrinsic challenges, including imprecise expert selection due to flawed routing strategy and inefficient expert aggregation leading to suboptimal prediction. To address these issues, we propose a statistic-augmented, decoupled MoE }outing and Aggregating Mechanism (MoE-RAM) driven by LM. Specifically, on the one hand, MoE-RAM enhances expert routing by incorporating statistical retrieval mechanism to match LM-extracted latent features with cached prototypical features of the most relevant experts; on the other hand, MoE-RAM adaptively reweights experts' outputs in fusion by measuring statistical distances of experts' instant features against LM-extracted latent features. Benefiting from the synergy of the statistic-augmented MoE's routing and aggregating, MoE-RAM ultimately improves the prediction performance. We take the AD semantic segmentation task as an example to assess the proposed MoE-RAM. Extensive experiments on AD datasets demonstrate the superiority of MoE-RAM compared to other MoE baselines and conventional single-model approaches.
๐
2025-12-07
๐ ArXiv Preprint (Robotics)
๐ฅ Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Zhangrui Guo, Xiaofan Liu et al.
RoboticsMachine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Embodied imitation learning is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video generation models for this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a hierarchical framework designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To align the generated videos with physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA world model to enforce physical plausibility by aligning the predicted and actual dynamic evolutions in the feature space. MIND-V demonstrates state-of-the-art performance in long-horizon robotic manipulation video generation, establishing a scalable and controllable paradigm for embodied data synthesis.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Fan Zhang, Jinfeng Chen, Joseph J. B. Mvogo Ahanda, Hanz Richter, Ge Lv et al.
Robotics
ๆฅ็ๆ่ฆ
In commercial robotic systems, it is common to encounter a closed inner-loop torque controller that is not user-modifiable. However, the outer-loop controller, which sends kinematic commands such as position or velocity for the inner-loop controller to track, is typically exposed to users. In this work, we focus on the development of an easily integrated add-on at the outer-loop layer by combining disturbance rejection control and robust control barrier function for high-performance tracking and safe control of the whole dynamic system of an industrial manipulator. This is particularly beneficial when 1) the inner-loop controller is imperfect, unmodifiable, and uncertain; and 2) the dynamic model exhibits significant uncertainty. Stability analysis, formal safety guarantee proof, and hardware experiments with a PUMA robotic manipulator are presented. Our solution demonstrates superior performance in terms of simplicity of implementation, robustness, tracking precision, and safety compared to the state of the art. Video: https://youtu.be/zw1tanvrV8Q
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, Alan Yuille
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo et al.
MultimodalRoboticsComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rundong Luo, Noah Snavely, Wei-Chiu Ma
ๆฅ็ๆ่ฆ
We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page https://red-fairy.github.io/ShadowDraw/ for more results and an end-to-end real-world demonstration of our pipeline!
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Purbesh Mitra, Sennur Ulukus
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe et al.
MultimodalMachine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Haiyue Song, Johannes Eschbach-Dymanus, Hour Kaing, Sumire Honda, Hideki Tanaka et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Gao, Jin Song
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shashwat Shankar, Subhranshu Pandey, Innocent Dengkhw Mochahari, Bhabesh Mali, Animesh Basak Chowdhury et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model(LLM) inference demands massive compute and energy, making domain-specific tasks expensive and unsustainable. As foundation models keep scaling, we ask: Is bigger always better for hardware design? Our work tests this by evaluating Small Language Models coupled with a curated agentic AI framework on NVIDIA's Comprehensive Verilog Design Problems(CVDP) benchmark. Results show that agentic workflows: through task decomposition, iterative feedback, and correction - not only unlock near-LLM performance at a fraction of the cost but also create learning opportunities for agents, paving the way for efficient, adaptive solutions in complex design tasks.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huascar Sanchez, Briland Hitaj, Jules Bergmann, Linda Briesemeister
Natural Language Processing
ๆฅ็ๆ่ฆ
As healthcare increasingly turns to AI for scalable and trustworthy clinical decision support, ensuring reliability in model reasoning remains a critical challenge. Individual large language models (LLMs) are susceptible to hallucinations and inconsistency, whereas naive ensembles of models often fail to deliver stable and credible recommendations. Building on our previous work on LLM Chemistry, which quantifies the collaborative compatibility among LLMs, we apply this framework to improve the reliability in medication recommendation from brief clinical vignettes. Our approach leverages multi-LLM collaboration guided by Chemistry-inspired interaction modeling, enabling ensembles that are effective (exploiting complementary strengths), stable (producing consistent quality), and calibrated (minimizing interference and error amplification). We evaluate our Chemistry-based Multi-LLM collaboration strategy on real-world clinical scenarios to investigate whether such interaction-aware ensembles can generate credible, patient-specific medication recommendations. Preliminary results are encouraging, suggesting that LLM Chemistry-guided collaboration may offer a promising path toward reliable and trustworthy AI assistants in clinical practice.
๐
2025-12-04
๐ ArXiv Preprint (quant-ph)
๐ฅ Yu-Cheng Lin, Yu-Chao Hsu, Samuel Yen-Chi Chen
Machine Learning
ๆฅ็ๆ่ฆ
The Quantum Approximate Optimization Algorithm (QAOA) is a leading approach for solving combinatorial optimization problems on near-term quantum processors. However, finding good variational parameters remains a significant challenge due to the non-convex energy landscape, often resulting in slow convergence and poor solution quality. In this work, we propose a quantum meta-learning framework that trains advanced quantum sequence models to generate effective parameter initialization policies. We investigate four classical or quantum sequence models, including the Quantum Kernel-based Long Short-Term Memory (QK-LSTM), as learned optimizers in a "learning to learn" paradigm. Our numerical experiments on the Max-Cut problem demonstrate that the QK-LSTM optimizer achieves superior performance, obtaining the highest approximation ratios and exhibiting the fastest convergence rate across all tested problem sizes (n=10 to 13). Crucially, the QK-LSTM model achieves perfect parameter transferability by synthesizing a single, fixed set of near-optimal parameters, leading to a remarkable sustained acceleration of convergence even when generalizing to larger problems. This capability, enabled by the compact and expressive power of the quantum kernel architecture, underscores its effectiveness. The QK-LSTM, with only 43 trainable parameters, substantially outperforms the classical LSTM (56 parameters) and other quantum sequence models, establishing a robust pathway toward highly efficient parameter initialization for variational quantum algorithms in the NISQ era.
๐
2025-12-04
๐ ArXiv Preprint (quant-ph)
๐ฅ Yu-Chao Hsu, Jiun-Cheng Jiang, Chun-Hua Lin, Kuo-Chung Peng, Nan-Yow Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Long short-term memory (LSTM) models are a particular type of recurrent neural networks (RNNs) that are central to sequential modeling tasks in domains such as urban telecommunication forecasting, where temporal correlations and nonlinear dependencies dominate. However, conventional LSTMs suffer from high parameter redundancy and limited nonlinear expressivity. In this work, we propose the Quantum-inspired Kolmogorov-Arnold Long Short-Term Memory (QKAN-LSTM), which integrates Data Re-Uploading Activation (DARUAN) modules into the gating structure of LSTMs. Each DARUAN acts as a quantum variational activation function (QVAF), enhancing frequency adaptability and enabling an exponentially enriched spectral representation without multi-qubit entanglement. The resulting architecture preserves quantum-level expressivity while remaining fully executable on classical hardware. Empirical evaluations on three datasets, Damped Simple Harmonic Motion, Bessel Function, and Urban Telecommunication, demonstrate that QKAN-LSTM achieves superior predictive accuracy and generalization with a 79% reduction in trainable parameters compared to classical LSTMs. We extend the framework to the Jiang-Huang-Chen-Goan Network (JHCG Net), which generalizes KAN to encoder-decoder structures, and then further use QKAN to realize the latent KAN, thereby creating a Hybrid QKAN (HQKAN) for hierarchical representation learning. The proposed HQKAN-LSTM thus provides a scalable and interpretable pathway toward quantum-inspired sequential modeling in real-world data environments.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
๐
2025-12-04
๐ ArXiv Preprint (stat.ME)
๐ฅ Garud Iyengar, Yu-Shiou Willy Lin, Kaizheng Wang
ๆฅ็ๆ่ฆ
Simulation of complex systems originated in manufacturing and queuing applications. It is now widely used for large-scale, ML-based systems in research, education, and consumer surveys. However, characterizing the discrepancy between simulators and ground truth remains challenging for increasingly complex, machine-learning-based systems. We propose a computationally tractable method to estimate the quantile function of the discrepancy between the simulated and ground-truth outcome distributions. Our approach focuses on output uncertainty and treats the simulator as a black box, imposing no modeling assumptions on its internals, and hence applies broadly across many parameter families, from Bernoulli and multinomial models to continuous, vector-valued settings. The resulting quantile curve supports confidence interval construction for unseen scenarios, risk-aware summaries of sim-to-real discrepancy (e.g., VaR/CVaR), and comparison of simulators' performance. We demonstrate our methodology in an application assessing LLM simulation fidelity on the WorldValueBench dataset spanning four LLMs.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Eric Bridgeford, Hayden Helm
Machine Learning
ๆฅ็ๆ่ฆ
Generative models augmented with external tools and update mechanisms (or \textit{agents}) have demonstrated capabilities beyond intelligent prompting of base models. As agent use proliferates, dynamic multi-agent systems have naturally emerged. Recent work has investigated the theoretical and empirical properties of low-dimensional representations of agents based on query responses at a single time point. This paper introduces the Temporal Data Kernel Perspective Space (TDKPS), which jointly embeds agents across time, and proposes several novel hypothesis tests for detecting behavioral change at the agent- and group-level in black-box multi-agent systems. We characterize the empirical properties of our proposed tests, including their sensitivity to key hyperparameters, in simulations motivated by a multi-agent system of evolving digital personas. Finally, we demonstrate via natural experiment that our proposed tests detect changes that correlate sensitively, specifically, and significantly with a real exogenous event. As far as we are aware, TDKPS is the first principled framework for monitoring behavioral dynamics in black-box multi-agent systems -- a critical capability as generative agent deployment continues to scale.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniyar Zakarin, Thiemo Wandel, Anton Obukhov, Dengxin Dai
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web
๐
2025-12-04
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Adri Gรณmez Martรญn, Felix Mรถller, Steven McDonagh, Monica Abella, Manuel Desco et al.
ๆฅ็ๆ่ฆ
Neural architecture search (NAS) in expressive search spaces is a computationally hard problem, but it also holds the potential to automatically discover completely novel and performant architectures. To achieve this we need effective search algorithms that can identify powerful components and reuse them in new candidate architectures. In this paper, we introduce two adapted variants of the Smith-Waterman algorithm for local sequence alignment and use them to compute the edit distance in a grammar-based evolutionary architecture search. These algorithms enable us to efficiently calculate a distance metric for neural architectures and to generate a set of hybrid offspring from two parent models. This facilitates the deployment of crossover-based search heuristics, allows us to perform a thorough analysis on the architectural loss landscape, and track population diversity during search. We highlight how our method vastly improves computational complexity over previous work and enables us to efficiently compute shortest paths between architectures. When instantiating the crossover in evolutionary searches, we achieve competitive results, outperforming competing methods. Future work can build upon this new tool, discovering novel components that can be used more broadly across neural architecture design, and broadening its applications beyond NAS.
๐
2025-12-04
๐ ArXiv Preprint (cs.MA)
๐ฅ Christopher Chiu, Simpson Zhang, Mihaela van der Schaar
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
As artificial intelligence (AI) agents are deployed across economic domains, understanding their strategic behavior and market-level impact becomes critical. This paper puts forward a groundbreaking new framework that is the first to capture the real-world economic forces that shape agentic labor markets: adverse selection, moral hazard, and reputation dynamics. Our framework encapsulates three core capabilities that successful LLM-agents will need: \textbf{metacognition} (accurate self-assessment of skills), \textbf{competitive awareness} (modeling rivals and market dynamics), and \textbf{long-horizon strategic planning}. We illustrate our framework through a tractable simulated gig economy where agentic Large Language Models (LLMs) compete for jobs, develop skills, and adapt their strategies under competitive pressure. Our simulations illustrate how LLM agents explicitly prompted with reasoning capabilities learn to strategically self-improve and demonstrate superior adaptability to changing market conditions. At the market level, our simulations reproduce classic macroeconomic phenomena found in human labor markets, while controlled experiments reveal potential AI-driven economic trends, such as rapid monopolization and systemic price deflation. This work provides a foundation to further explore the economic properties of AI-driven labour markets, and a conceptual framework to study the strategic reasoning capabilities in agents competing in the emerging economy.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jasmaine Khale, Ravi Prakash Srivastava
Machine Learning
ๆฅ็ๆ่ฆ
Automated retinal disease diagnosis is vital given the rising prevalence of conditions such as diabetic retinopathy and macular degeneration. Conventional deep learning approaches require large annotated datasets, which are costly and often imbalanced across disease categories, limiting their reliability in practice. Few-shot learning (FSL) addresses this challenge by enabling models to generalize from only a few labeled samples per class. In this study,we propose a balanced few-shot episodic learning framework tailored to the Retinal Fundus Multi-Disease Image Dataset (RFMiD). Focusing on the ten most represented classes, which still show substantial imbalance between majority diseases (e.g., Diabetic Retinopathy, Macular Hole) and minority ones (e.g., Optic Disc Edema, Branch Retinal Vein Occlusion), our method integrates three key components: (i) balanced episodic sampling, ensuring equal participation of all classes in each 5-way 5-shot episode; (ii) targeted augmentation, including Contrast Limited Adaptive Histogram Equalization (CLAHE) and color/geometry transformations, to improve minority-class di- versity; and (iii) a ResNet-50 encoder pretrained on ImageNet, selected for its superior ability to capture fine-grained retinal features. Prototypes are computed in the embedding space and classification is performed with cosine similarity for improved stability. Trained on 100 episodes and evaluated on 1,000 test episodes, our framework achieves substantial accuracy gains and reduces bias toward majority classes, with notable improvements for underrepresented diseases. These results demonstrate that dataset-aware few-shot pipelines, combined with balanced sampling and CLAHE-enhanced preprocessing, can deliver more robust and clinically fair retinal disease diagnosis under data-constrained conditions.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yupu Yao, Bowen Yang
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Roberto Cipollone, Luca Iocchi, Matteo Leonetti
Machine Learning
ๆฅ็ๆ่ฆ
The main focus of Hierarchical Reinforcement Learning (HRL) is studying how large Markov Decision Processes (MDPs) can be more efficiently solved when addressed in a modular way, by combining partial solutions computed for smaller subtasks. Despite their very intuitive role for learning, most notions of MDP abstractions proposed in the HRL literature have limited expressive power or do not possess formal efficiency guarantees. This work addresses these fundamental issues by defining Realizable Abstractions, a new relation between generic low-level MDPs and their associated high-level decision processes. The notion we propose avoids non-Markovianity issues and has desirable near-optimality guarantees. Indeed, we show that any abstract policy for Realizable Abstractions can be translated into near-optimal policies for the low-level MDP, through a suitable composition of options. As demonstrated in the paper, these options can be expressed as solutions of specific constrained MDPs. Based on these findings, we propose RARL, a new HRL algorithm that returns compositional and near-optimal low-level policies, taking advantage of the Realizable Abstraction given in the input. We show that RARL is Probably Approximately Correct, it converges in a polynomial number of samples, and it is robust to inaccuracies in the abstraction.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Weiye Shi, Zhaowei Zhang, Shaoheng Yan, Yaodong Yang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leyang Shen, Yang Zhang, Chun Kai Ling, Xiaoyan Zhao, Tat-Seng Chua
Machine Learning
ๆฅ็ๆ่ฆ
Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Raquel Norel, Michele Merler, Pavitra Modi
Natural Language Processing
ๆฅ็ๆ่ฆ
Patients with rare neurological diseases report cognitive symptoms -"brain fog"- invisible to traditional tests. We propose continuous neurocognitive monitoring via smartphone speech analysis integrated with Relational Graph Transformer (RELGT) architectures. Proof-of-concept in phenylketonuria (PKU) shows speech-derived "Proficiency in Verbal Discourse" correlates with blood phenylalanine (p = -0.50, p < 0.005) but not standard cognitive tests (all |r| < 0.35). RELGT could overcome information bottlenecks in heterogeneous medical data (speech, labs, assessments), enabling predictive alerts weeks before decompensation. Key challenges: multi-disease validation, clinical workflow integration, equitable multilingual deployment. Success would transform episodic neurology into continuous personalized monitoring for millions globally.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ MohammadHossein Bateni, Vincent Cohen-Addad, Yuzhou Gu, Silvio Lattanzi, Simon Meierhans et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have proven to be highly effective for solving complex reasoning tasks. Surprisingly, their capabilities can often be improved by iterating on previously generated solutions. In this context, a reasoning plan for generating and combining a set of solutions can be thought of as an algorithm for reasoning using a probabilistic oracle.
We introduce a theoretical framework for analyzing such reasoning algorithms. This framework formalizes the principles underlying popular techniques for iterative improvement and answer aggregation, providing a foundation for designing a new generation of more powerful reasoning methods. Unlike approaches for understanding models that rely on architectural specifics, our model is grounded in experimental evidence. As a result, it offers a general perspective that may extend to a wide range of current and future reasoning oracles.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Julien Benchek, Rohit Shetty, Benjamin Hunsberger, Ajay Arun, Zach Richards et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
๐
2025-12-04
๐ Accepted by the 43rd IEEE International Conference
๐ฅ Fang Li
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a novel approach to automated stripboard circuit layout design using Answer Set Programming (ASP). The work formulates the layout problem as both a synthesis and multi-objective optimization task that simultaneously generates viable layouts while minimizing board area and component strip crossing. By leveraging ASP's declarative nature, this work expresses complex geometric and electrical constraints in a natural and concise manner. The two-phase solving methodology first ensures feasibility before optimizing layout quality. Experimental results demonstrate that this approach generates compact, manufacturable layouts for a range of circuit complexities. This work represents a significant advancement in automated stripboard layout, offering a practical tool for electronics prototyping and education while showcasing the power of declarative programming for solving complex design automation problems.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guanbo Huang, Jingjia Mao, Fanding Huang, Fengkai Liu, Xiangyang Luo et al.
ๆฅ็ๆ่ฆ
Despite tremendous recent progress, Flow Matching methods still suffer from exposure bias due to discrepancies in training and inference. This paper investigates the root causes of exposure bias in Flow Matching, including: (1) the model lacks generalization to biased inputs during training, and (2) insufficient low-frequency content captured during early denoising, leading to accumulated bias. Based on these insights, we propose ReflexFlow, a simple and effective reflexive refinement of the Flow Matching learning objective that dynamically corrects exposure bias. ReflexFlow consists of two components: (1) Anti-Drift Rectification (ADR), which reflexively adjusts prediction targets for biased inputs utilizing a redesigned loss under training-time scheduled sampling; and (2) Frequency Compensation (FC), which reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias. ReflexFlow is model-agnostic, compatible with all Flow Matching frameworks, and improves generation quality across datasets. Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show that ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving a 35.65% reduction in FID on CelebA-64.
๐
2025-12-04
๐ 5 pages, 2 figures, IEEE Transactions on Dependabl
๐ฅ M Zeeshan, Saud Satti
MultimodalMachine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Artificial Intelligence (AI) systems, particularly Vision-Language Models (VLMs), have become integral to critical applications ranging from autonomous decision-making to automated document processing. As these systems scale, they rely heavily on preprocessing pipelines to handle diverse inputs efficiently. However, this dependency on standard preprocessing operations, specifically image downscaling, creates a significant yet often overlooked security vulnerability. While intended for computational optimization, scaling algorithms can be exploited to conceal malicious visual prompts that are invisible to human observers but become active semantic instructions once processed by the model. Current adversarial strategies remain largely static, failing to account for the dynamic nature of modern agentic workflows. To address this gap, we propose Chameleon, a novel, adaptive adversarial framework designed to expose and exploit scaling vulnerabilities in production VLMs. Unlike traditional static attacks, Chameleon employs an iterative, agent-based optimization mechanism that dynamically refines image perturbations based on the target model's real-time feedback. This allows the framework to craft highly robust adversarial examples that survive standard downscaling operations to hijack downstream execution. We evaluate Chameleon against Gemini 2.5 Flash model. Our experiments demonstrate that Chameleon achieves an Attack Success Rate (ASR) of 84.5% across varying scaling factors, significantly outperforming static baseline attacks which average only 32.1%. Furthermore, we show that these attacks effectively compromise agentic pipelines, reducing decision-making accuracy by over 45% in multi-step tasks. Finally, we discuss the implications of these vulnerabilities and propose multi-scale consistency checks as a necessary defense mechanism.
๐
2025-12-04
๐ This work has been submitted to the IEEE for possi
๐ฅ Junjie Fan, Hongye Zhao, Linduo Wei, Jiayu Rao, Guijia Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent adaptations of Large Language Models (LLMs) for time series forecasting often fail to effectively enhance information for raw series, leaving LLM reasoning capabilities underutilized. Existing prompting strategies rely on static correlations rather than generative interpretations of dynamic behavior, lacking critical global and instance-specific context. To address this, we propose STELLA (Semantic-Temporal Alignment with Language Abstractions), a framework that systematically mines and injects structured supplementary and complementary information. STELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components. It then translates intrinsic behavioral features of these components into Hierarchical Semantic Anchors: a Corpus-level Semantic Prior (CSP) for global context and a Fine-grained Behavioral Prompt (FBP) for instance-level patterns. Using these anchors as prefix-prompts, STELLA guides the LLM to model intrinsic dynamics. Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting, showing superior generalization in zero-shot and few-shot settings. Ablation studies further validate the effectiveness of our dynamically generated semantic anchors.
๐
2025-12-04
๐ ArXiv Preprint (cs.CY)
๐ฅ Jaan Aru, Kristjan-Julius Laak
ๆฅ็ๆ่ฆ
The vision of a universal AI tutor has remained elusive, despite decades of effort. Could LLMs be the game-changer? We overview novel issues arising from developing a nationwide AI tutor. We highlight the practical questions that point to specific gaps in our scientific understanding of the learning process.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hao Wang, Jialun Zhong, Changcheng Wang, Zhujun Nie, Zheng Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches, whether end-to-end semantic parsing or stepwise agent-based reasoning, often suffer from structural inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, a large language model (LLM) extracts a minimal S-expression core that captures the essential semantics of the input query. This core is then refined by an agentic calibration module, which corrects syntactic inconsistencies and aligns entities and relations precisely with the underlying knowledge graph. The second stage employs template-based completion, guided by question-type prediction and placeholder instantiation, to construct a fully executable S-expression. This decomposition not only simplifies logical form generation but also significantly enhances structural fidelity and linking efficiency. Crucially, SEAL incorporates a self-evolving mechanism that integrates local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance, especially in multi-hop reasoning, comparison, and aggregation tasks. The results validate notable gains in both structural accuracy and computational efficiency, underscoring the framework's capacity for robust and scalable conversational reasoning.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dadi Guo, Qingyu Liu, Dongrui Liu, Qihan Ren, Shuai Shao et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model (LLM)-based agents are increasingly used as autonomous subordinates that carry out tasks for users. This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment. We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting. To assess its prevalence, we construct a benchmark of 200 tasks covering five task types and eight realistic scenarios in a constrained environment, such as broken tools or mismatched information sources. Evaluations of 11 popular LLMs reveal that these agents typically exhibit action-based deceptive behaviors, such as guessing results, performing unsupported simulations, substituting unavailable information sources, and fabricating local files. We further test prompt-based mitigation and find only limited reductions, suggesting that it is difficult to eliminate and highlighting the need for stronger mitigation strategies to ensure the safety of LLM-based agents.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Lukas Weidener, Marko Brkiฤ, Chiara Bacci, Mihailo Jovanoviฤ, Emre Ulgac et al.
ๆฅ็ๆ่ฆ
Artificial intelligence systems are increasingly deployed in biomedical research. However, current evaluation frameworks may inadequately assess their effectiveness as research collaborators. This rapid review examines benchmarking practices for AI systems in preclinical biomedical research. Three major databases and two preprint servers were searched from January 1, 2018 to October 31, 2025, identifying 14 benchmarks that assess AI capabilities in literature understanding, experimental design, and hypothesis generation. The results revealed that all current benchmarks assess isolated component capabilities, including data analysis quality, hypothesis validity, and experimental protocol design. However, authentic research collaboration requires integrated workflows spanning multiple sessions, with contextual memory, adaptive dialogue, and constraint propagation. This gap implies that systems excelling on component benchmarks may fail as practical research co-pilots. A process-oriented evaluation framework is proposed that addresses four critical dimensions absent from current benchmarks: dialogue quality, workflow orchestration, session continuity, and researcher experience. These dimensions are essential for evaluating AI systems as research co-pilots rather than as isolated task executors.
๐
2025-12-04
๐ ArXiv Preprint (cs.SD)
๐ฅ Tsai-Ning Wang, Lin-Lin Chen, Neil Zeghidour, Aaqib Saeed
Natural Language Processing
ๆฅ็ๆ่ฆ
Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a "semantic teacher." To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras
Natural Language Processing
ๆฅ็ๆ่ฆ
Expanding the linguistic diversity of instruct large language models (LLMs) is crucial for global accessibility but is often hindered by the reliance on costly specialized target language labeled data and catastrophic forgetting during adaptation. We tackle this challenge under a realistic, low-resource constraint: adapting instruct LLMs using only unlabeled target language data. We introduce Source-Shielded Updates (SSU), a selective parameter update strategy that proactively preserves source knowledge. Using a small set of source data and a parameter importance scoring method, SSU identifies parameters critical to maintaining source abilities. It then applies a column-wise freezing strategy to protect these parameters before adaptation. Experiments across five typologically diverse languages and 7B and 13B models demonstrate that SSU successfully mitigates catastrophic forgetting. It reduces performance degradation on monolingual source tasks to just 3.4% (7B) and 2.8% (13B) on average, a stark contrast to the 20.3% and 22.3% from full fine-tuning. SSU also achieves target-language performance highly competitive with full fine-tuning, outperforming it on all benchmarks for 7B models and the majority for 13B models.
๐
2025-12-04
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Amy Winecoff, Kevin Klyman
ๆฅ็ๆ่ฆ
Generative AI systems may pose serious risks to individuals vulnerable to eating disorders. Existing safeguards tend to overlook subtle but clinically significant cues, leaving many risks unaddressed. To better understand the nature of these risks, we conducted semi-structured interviews with 15 clinicians, researchers, and advocates with expertise in eating disorders. Using abductive qualitative analysis, we developed an expert-guided taxonomy of generative AI risks across seven categories: (1) providing generalized health advice; (2) encouraging disordered behaviors; (3) supporting symptom concealment; (4) creating thinspiration; (5) reinforcing negative self-beliefs; (6) promoting excessive focus on the body; and (7) perpetuating narrow views about eating disorders. Our results demonstrate how certain user interactions with generative AI systems intersect with clinical features of eating disorders in ways that may intensify risk. We discuss implications of our work, including approaches for risk assessment, safeguard design, and participatory evaluation practices with domain experts.
๐
2025-12-04
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Wei Zhao, Zhe Li, Jun Sun
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) exhibit remarkable capabilities but remain vulnerable to adversarial manipulations such as jailbreaking, where crafted prompts bypass safety mechanisms. Understanding the causal factors behind such vulnerabilities is essential for building reliable defenses.
In this work, we introduce a unified causality analysis framework that systematically supports all levels of causal investigation in LLMs, ranging from token-level, neuron-level, and layer-level interventions to representation-level analysis. The framework enables consistent experimentation and comparison across diverse causality-based attack and defense methods. Accompanying this implementation, we provide the first comprehensive survey of causality-driven jailbreak studies and empirically evaluate the framework on multiple open-weight models and safety-critical benchmarks including jailbreaks, hallucination detection, backdoor identification, and fairness evaluation. Our results reveal that: (1) targeted interventions on causally critical components can reliably modify safety behavior; (2) safety-related mechanisms are highly localized (i.e., concentrated in early-to-middle layers with only 1--2\% of neurons exhibiting causal influence); and (3) causal features extracted from our framework achieve over 95\% detection accuracy across multiple threat types.
By bridging theoretical causality analysis and practical model safety, our framework establishes a reproducible foundation for research on causality-based attacks, interpretability, and robust attack detection and mitigation in LLMs. Code is available at https://github.com/Amadeuszhao/SOK_Casuality.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Vignesh Kumar Kembu, Pierandrea Morandini, Marta Bianca Maria Ranzini, Antonino Nocera
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have become a key topic in AI and NLP, transforming sectors like healthcare, finance, education, and marketing by improving customer service, automating tasks, providing insights, improving diagnostics, and personalizing learning experiences. Information extraction from clinical records is a crucial task in digital healthcare. Although traditional NLP techniques have been used for this in the past, they often fall short due to the complexity, variability of clinical language, and high inner semantics in the free clinical text. Recently, Large Language Models (LLMs) have become a powerful tool for better understanding and generating human-like text, making them highly effective in this area. In this paper, we explore the ability of open-source multilingual LLMs to understand EHRs (Electronic Health Records) in Italian and help extract information from them in real-time. Our detailed experimental campaign on comorbidity extraction from EHR reveals that some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Rasul Tutunov, Alexandre Maraval, Antoine Grosnit, Xihan Li, Jun Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Sphere packing, Hilbert's eighteenth problem, asks for the densest arrangement of congruent spheres in n-dimensional Euclidean space. Although relevant to areas such as cryptography, crystallography, and medical imaging, the problem remains unresolved: beyond a few special dimensions, neither optimal packings nor tight upper bounds are known. Even a major breakthrough in dimension $n=8$, later recognised with a Fields Medal, underscores its difficulty. A leading technique for upper bounds, the three-point method, reduces the problem to solving large, high-precision semidefinite programs (SDPs). Because each candidate SDP may take days to evaluate, standard data-intensive AI approaches are infeasible. We address this challenge by formulating SDP construction as a sequential decision process, the SDP game, in which a policy assembles SDP formulations from a set of admissible components. Using a sample-efficient model-based framework that combines Bayesian optimisation with Monte Carlo Tree Search, we obtain new state-of-the-art upper bounds in dimensions $4-16$, showing that model-based search can advance computational progress in longstanding geometric problems. Together, these results demonstrate that sample-efficient, model-based search can make tangible progress on mathematically rigid, evaluation limited problems, pointing towards a complementary direction for AI-assisted discovery beyond large-scale LLM-driven exploration.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Liam McGee, James Harvey, Lucy Cull, Andreas Vermeulen, Bart-Floris Visscher et al.
ๆฅ็ๆ่ฆ
In this preprint, we present A collaborative human-AI approach to building an inspectable semantic layer for Agentic AI. AI agents first propose candidate knowledge structures from diverse data sources; domain experts then validate, correct, and extend these structures, with their feedback used to improve subsequent models. Authors show how this process captures tacit institutional knowledge, improves response quality and efficiency, and mitigates institutional amnesia. We argue for a shift from post-hoc explanation to justifiable Agentic AI, where decisions are grounded in explicit, inspectable evidence and reasoning accessible to both experts and non-specialists.
๐
2025-12-04
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Puria Radmard, Paul M. Bays, Mรกtรฉ Lengyel
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Discovering the neural mechanisms underpinning cognition is one of the grand challenges of neuroscience. However, previous approaches for building models of RNN dynamics that explain behaviour required iterative refinement of architectures and/or optimisation objectives, resulting in a piecemeal, and mostly heuristic, human-in-the-loop process. Here, we offer an alternative approach that automates the discovery of viable RNN mechanisms by explicitly training RNNs to reproduce behaviour, including the same characteristic errors and suboptimalities, that humans and animals produce in a cognitive task. Achieving this required two main innovations. First, as the amount of behavioural data that can be collected in experiments is often too limited to train RNNs, we use a non-parametric generative model of behavioural responses to produce surrogate data for training RNNs. Second, to capture all relevant statistical aspects of the data, we developed a novel diffusion model-based approach for training RNNs. To showcase the potential of our approach, we chose a visual working memory task as our test-bed, as behaviour in this task is well known to produce response distributions that are patently multimodal (due to swap errors). The resulting network dynamics correctly qualitative features of macaque neural data. Importantly, these results were not possible to obtain with more traditional approaches, i.e., when only a limited set of behavioural signatures (rather than the full richness of behavioural response distributions) were fitted, or when RNNs were trained for task optimality (instead of reproducing behaviour). Our approach also yields novel predictions about the mechanism of swap errors, which can be readily tested in experiments. These results suggest that fitting RNNs to rich patterns of behaviour provides a powerful way to automatically discover mechanisms of important cognitive functions.
๐
2025-12-04
๐ 14 pages, 9 figures. Submitted to Journal of High
๐ฅ Yuhao Lu, HengJian SiTu, Jie Li, Yixuan Li, Yang Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present a population-scale catalogue of 287,872 supermassive black hole masses with high accuracy. Using a deep encoder-decoder network trained on optical spectra with reverberation-mapping (RM) based labels of 849 quasars and applied to all SDSS quasars up to $z=4$, our method achieves a root-mean-square error of $0.058$\,dex, a relative uncertainty of $\approx 14\%$, and coefficient of determination $R^{2}\approx0.91$ with respect to RM-based masses, far surpassing traditional single-line virial estimators. Notably, the high accuracy is maintained for both low ($<10^{7.5}\,M_\odot$) and high ($>10^{9}\,M_\odot$) mass quasars, where empirical relations are unreliable.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ SIMA team, Adrian Bolton, Alexander Lerchner, Alexandra Cordell, Alexandre Moufarek et al.
Robotics
ๆฅ็ๆ่ฆ
We introduce SIMA 2, a generalist embodied agent that understands and acts in a wide variety of 3D virtual worlds. Built upon a Gemini foundation model, SIMA 2 represents a significant step toward active, goal-directed interaction within an embodied environment. Unlike prior work (e.g., SIMA 1) limited to simple language commands, SIMA 2 acts as an interactive partner, capable of reasoning about high-level goals, conversing with the user, and handling complex instructions given through language and images. Across a diverse portfolio of games, SIMA 2 substantially closes the gap with human performance and demonstrates robust generalization to previously unseen environments, all while retaining the base model's core reasoning capabilities. Furthermore, we demonstrate a capacity for open-ended self-improvement: by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment. This work validates a path toward creating versatile and continuously learning agents for both virtual and, eventually, physical worlds.
๐
2025-12-04
๐ ArXiv Preprint (cs.SD)
๐ฅ Gongyu Chen, Xiaoyu Zhang, Zhenqiang Weng, Junjie Zheng, Da Shen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Eranga Bandara, Amin Hass, Ross Gore, Sachin Shetty, Ravi Mukkamala et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
AI agent-based systems are becoming increasingly integral to modern software architectures, enabling autonomous decision-making, dynamic task execution, and multimodal interactions through large language models (LLMs). However, these systems introduce novel and evolving security challenges, including prompt injection attacks, context poisoning, model manipulation, and opaque agent-to-agent communication, that are not effectively captured by traditional threat modeling frameworks. In this paper, we introduce ASTRIDE, an automated threat modeling platform purpose-built for AI agent-based systems. ASTRIDE extends the classical STRIDE framework by introducing a new threat category, A for AI Agent-Specific Attacks, which encompasses emerging vulnerabilities such as prompt injection, unsafe tool invocation, and reasoning subversion, unique to agent-based applications. To automate threat modeling, ASTRIDE combines a consortium of fine-tuned vision-language models (VLMs) with the OpenAI-gpt-oss reasoning LLM to perform end-to-end analysis directly from visual agent architecture diagrams, such as data flow diagrams(DFDs). LLM agents orchestrate the end-to-end threat modeling automation process by coordinating interactions between the VLM consortium and the reasoning LLM. Our evaluations demonstrate that ASTRIDE provides accurate, scalable, and explainable threat modeling for next-generation intelligent systems. To the best of our knowledge, ASTRIDE is the first framework to both extend STRIDE with AI-specific threats and integrate fine-tuned VLMs with a reasoning LLM to fully automate diagram-driven threat modeling in AI agent-based applications.
๐
2025-12-04
๐ ArXiv Preprint (cs.SD)
๐ฅ Junjie Zheng, Chunbo Hao, Guobin Ma, Xiaoyu Zhang, Gongyu Chen et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment and manually annotated melody contours, requirements that are resource-intensive and hinder scalability. To overcome these limitations, we propose a melody-driven SVS framework capable of synthesizing arbitrary lyrics following any reference melody, without relying on phoneme-level alignment. Our method builds on a Diffusion Transformer (DiT) architecture, enhanced with a dedicated melody extraction module that derives melody representations directly from reference audio. To ensure robust melody encoding, we employ a teacher model to guide the optimization of the melody extractor, alongside an implicit alignment mechanism that enforces similarity distribution constraints for improved melodic stability and coherence. Additionally, we refine duration modeling using weakly annotated song data and introduce a Flow-GRPO reinforcement learning strategy with a multi-objective reward function to jointly enhance pronunciation clarity and melodic fidelity. Experiments show that our model achieves superior performance over existing approaches in both objective measures and subjective listening tests, especially in zero-shot and lyric adaptation settings, while maintaining high audio quality without manual annotation. This work offers a practical and scalable solution for advancing data-efficient singing voice synthesis. To support reproducibility, we release our inference code and model checkpoints.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Giorgos Polychronis, Foivos Pournaropoulos, Christos D. Antonopoulos, Spyros Lalis
Machine Learning
ๆฅ็ๆ่ฆ
Drones are becoming indispensable in many application domains. In data-driven missions, besides sensing, the drone must process the collected data at runtime to decide whether additional action must be taken on the spot, before moving to the next point of interest. If processing does not reveal an event or situation that requires such an action, the drone has waited in vain instead of moving to the next point. If, however, the drone starts moving to the next point and it turns out that a follow-up action is needed at the previous point, it must spend time to fly-back. To take this decision, we propose different machine-learning methods based on branch prediction and reinforcement learning. We evaluate these methods for a wide range of scenarios where the probability of event occurrence changes with time. Our results show that the proposed methods consistently outperform the regression-based method proposed in the literature and can significantly improve the worst-case mission time by up to 4.1x. Also, the achieved median mission time is very close, merely up to 2.7% higher, to that of a method with perfect knowledge of the current underlying event probability at each point of interest.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Yuxing Wang, Zhiyu Chen, Tiantian Zhang, Qiyue Yin, Yongzhe Chang et al.
Robotics
ๆฅ็ๆ่ฆ
Brain-body co-evolution enables animals to develop complex behaviors in their environments. Inspired by this biological synergy, embodied co-design (ECD) has emerged as a transformative paradigm for creating intelligent agents-from virtual creatures to physical robots-by jointly optimizing their morphologies and controllers rather than treating control in isolation. This integrated approach facilitates richer environmental interactions and robust task performance. In this survey, we provide a systematic overview of recent advances in ECD. We first formalize the concept of ECD and position it within related fields. We then introduce a hierarchical taxonomy: a lower layer that breaks down agent design into three fundamental components-controlling brain, body morphology, and task environment-and an upper layer that integrates these components into four major ECD frameworks: bi-level, single-level, generative, and open-ended. This taxonomy allows us to synthesize insights from more than one hundred recent studies. We further review notable benchmarks, datasets, and applications in both simulated and real-world scenarios. Finally, we identify significant challenges and offer insights into promising future research directions. A project associated with this survey has been created at https://github.com/Yuxing-Wang-THU/SurveyBrainBody.
๐
2025-12-04
๐ AAAI 2026
๐ฅ Dario Pesenti, Alessandro Bogani, Katya Tentori, Stefano Teso
ๆฅ็ๆ่ฆ
Explanatory Interactive Learning (XIL) is a powerful interactive learning framework designed to enable users to customize and correct AI models by interacting with their explanations. In a nutshell, XIL algorithms select a number of items on which an AI model made a decision (e.g. images and their tags) and present them to users, together with corresponding explanations (e.g. image regions that drive the model's decision). Then, users supply corrective feedback for the explanations, which the algorithm uses to improve the model. Despite showing promise in debugging tasks, recent studies have raised concerns that explanatory interaction may trigger order effects, a well-known cognitive bias in which the sequence of presented items influences users' trust and, critically, the quality of their feedback. We argue that these studies are not entirely conclusive, as the experimental designs and tasks employed differ substantially from common XIL use cases, complicating interpretation. To clarify the interplay between order effects and explanatory interaction, we ran two larger-scale user studies (n = 713 total) designed to mimic common XIL tasks. Specifically, we assessed order effects both within and between debugging sessions by manipulating the order in which correct and wrong explanations are presented to participants. Order effects had a limited, through significant impact on users' agreement with the model (i.e., a behavioral measure of their trust), and only when examined withing debugging sessions, not between them. The quality of users' feedback was generally satisfactory, with order effects exerting only a small and inconsistent influence in both experiments. Overall, our findings suggest that order effects do not pose a significant issue for the successful employment of XIL approaches. More broadly, our work contributes to the ongoing efforts for understanding human factors in AI.
๐
2025-12-04
๐ ArXiv Preprint (physics.geo-ph)
๐ฅ Yijia Song, Juliet Biggs, Alin Achim, Robert Popescu, Simon Orrego et al.
ๆฅ็ๆ่ฆ
Phase unwrapping is a fundamental problem in InSAR data processing, supporting geophysical applications such as deformation monitoring and hazard assessment. Its reliability is limited by noise and decorrelation in radar acquisitions, which makes accurate reconstruction of the deformation signal challenging. We propose a denoising diffusion probabilistic model (DDPM)-based framework for InSAR phase unwrapping, UnwrapDiff, in which the output of the traditional minimum cost flow algorithm (SNAPHU) is incorporated as conditional guidance. To evaluate robustness, we construct a synthetic dataset that incorporates atmospheric effects and diverse noise patterns, representative of realistic InSAR observations. Experiments show that the proposed model leverages the conditional prior while reducing the effect of diverse noise patterns, achieving on average a 10.11\% reduction in NRMSE compared to SNAPHU. It also achieves better reconstruction quality in difficult cases such as dyke intrusions.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen
Natural Language Processing
ๆฅ็ๆ่ฆ
Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.
๐
2025-12-04
๐ ArXiv Preprint (math.OC)
๐ฅ Francesca Rossi, Veronica Centorrino, Francesco Bullo, Giovanni Russo
ๆฅ็ๆ่ฆ
The ability to compose acquired skills to plan and execute behaviors is a hallmark of natural intelligence. Yet, despite remarkable cross-disciplinary efforts, a principled account of how task structure shapes gating and how such computations could be delivered in neural circuits, remains elusive. Here we introduce GateMod, an interpretable theoretically grounded computational model linking the emergence of gating to the underlying decision-making task, and to a neural circuit architecture. We first develop GateFrame, a normative framework casting policy gating into the minimization of the free energy. This framework, relating gating rules to task, applies broadly across neuroscience, cognitive and computational sciences. We then derive GateFlow, a continuous-time energy based dynamics that provably converges to GateFrame optimal solution. Convergence, exponential and global, follows from a contractivity property that also yields robustness and other desirable properties. Finally, we derive a neural circuit from GateFlow, GateNet. This is a soft-competitive recurrent circuit whose components perform local and contextual computations consistent with known dendritic and neural processing motifs. We evaluate GateMod across two different settings: collective behaviors in multi-agent systems and human decision-making in multi-armed bandits. In all settings, GateMod provides interpretable mechanistic explanations of gating and quantitatively matches or outperforms established models. GateMod offers a unifying framework for neural policy gating, linking task objectives, dynamical computation, and circuit-level mechanisms. It provides a framework to understand gating in natural agents beyond current explanations and to equip machines with this ability.
๐
2025-12-04
๐ ICDE
๐ฅ Zhuoyue Wan, Wentao Hu, Chen Jason Zhang, Yuanfeng Song, Shuaimin Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Bridging natural language and structured query languages is a long-standing challenge in the database community. While recent advances in language models have shown promise in this direction, existing solutions often rely on large-scale closed-source models that suffer from high inference costs, limited transparency, and lack of adaptability for lightweight deployment. In this paper, we present OsmT, an open-source tag-aware language model specifically designed to bridge natural language and Overpass Query Language (OverpassQL), a structured query language for accessing large-scale OpenStreetMap (OSM) data. To enhance the accuracy and structural validity of generated queries, we introduce a Tag Retrieval Augmentation (TRA) mechanism that incorporates contextually relevant tag knowledge into the generation process. This mechanism is designed to capture the hierarchical and relational dependencies present in the OSM database, addressing the topological complexity inherent in geospatial query formulation. In addition, we define a reverse task, OverpassQL-to-Text, which translates structured queries into natural language explanations to support query interpretation and improve user accessibility. We evaluate OsmT on a public benchmark against strong baselines and observe consistent improvements in both query generation and interpretation. Despite using significantly fewer parameters, our model achieves competitive accuracy, demonstrating the effectiveness of open-source pre-trained language models in bridging natural language and structured query languages within schema-rich geospatial environments.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yigui Feng, Qinglin Wang, Haotian Mo, Yang Liu, Ke Liu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kuinan Hou, Marco Zorzi, Alberto Testolin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reliably counting and generating sequences of items remain a significant challenge for neural networks, including Large Language Models (LLMs). Indeed, although this capability is readily handled by rule-based symbolic systems based on serial computation, learning to systematically deploy counting procedures is difficult for neural models, which should acquire these skills through learning. Previous research has demonstrated that recurrent architectures can only approximately track and enumerate sequences of events, and it remains unclear whether modern deep learning systems, including LLMs, can deploy systematic counting procedures over sequences of discrete symbols. This paper aims to fill this gap by investigating the sequential enumeration abilities of five state-of-the-art LLMs, including proprietary, open-source, and reasoning models. We probe LLMs in sequential naming and production tasks involving lists of letters and words, adopting a variety of prompting instructions to explore the role of chain-of-thought in the spontaneous emerging of counting strategies. We also evaluate open-source models with the same architecture but increasing size to see whether the mastering of counting principles follows scaling laws, and we analyze the embedding dynamics during sequential enumeration to investigate the emergent encoding of numerosity. We find that some LLMs are indeed capable of deploying counting procedures when explicitly prompted to do so, but none of them spontaneously engage in counting when simply asked to enumerate the number of items in a sequence. Our results suggest that, despite their impressive emergent abilities, LLMs cannot yet robustly and systematically deploy counting procedures, highlighting a persistent gap between neural and symbolic approaches to compositional generalization.
๐
2025-12-04
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ Haodong Feng, Lugang Ye, Dixia Fan
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The integration of artificial intelligence into experimental fluid mechanics promises to accelerate discovery, yet most AI applications remain narrowly focused on numerical studies. This work proposes an AI Fluid Scientist framework that autonomously executes the complete experimental workflow: hypothesis generation, experimental design, robotic execution, data analysis, and manuscript preparation. We validate this through investigation of vortex-induced vibration (VIV) and wake-induced vibration (WIV) in tandem cylinders. Our work has four key contributions: (1) A computer-controlled circulating water tunnel (CWT) with programmatic control of flow velocity, cylinder position, and forcing parameters (vibration frequency and amplitude) with data acquisition (displacement, force, and torque). (2) Automated experiments reproduce literature benchmarks (Khalak and Williamson [1999] and Assi et al. [2013, 2010]) with frequency lock-in within 4% and matching critical spacing trends. (3) The framework with Human-in-the-Loop (HIL) discovers more WIV amplitude response phenomena, and uses a neural network to fit physical laws from data, which is 31% higher than that of polynomial fitting. (4) The framework with multi-agent with virtual-real interaction system executes hundreds of experiments end-to-end, which automatically completes the entire process of scientific research from hypothesis generation, experimental design, experimental execution, data analysis, and manuscript preparation. It greatly liberates human researchers and improves study efficiency, providing new paradigm for the development and research of experimental fluid mechanics.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Andrew Paterson, Carl Sanders
ๆฅ็ๆ่ฆ
For years, the discourse around poker AI has been dominated by the concept of solvers and the pursuit of unexploitable, machine-perfect play. This paper challenges that orthodoxy. It presents Patrick, an AI built on the contrary philosophy: that the path to victory lies not in being unexploitable, but in being maximally exploitative. Patrick's architecture is a purpose-built engine for understanding and attacking the flawed, psychological, and often irrational nature of human opponents. Through detailed analysis of its design, its novel prediction-anchored learning method, and its profitable performance in a 64,267-hand trial, this paper makes the case that the solved myth is a distraction from the real, far more interesting challenge: creating AI that can master the art of human imperfection.
๐
2025-12-04
๐ ArXiv Preprint (cs.SD)
๐ฅ Yun Tian, Zhijin Qin, Guocheng Lv, Ye Jin, Kaibin Huang et al.
Robotics
ๆฅ็ๆ่ฆ
Existing speech semantic communication systems mainly based on Joint Source-Channel Coding (JSCC) architectures have demonstrated impressive performance, but their effectiveness remains limited by model structures specifically designed for particular tasks and datasets. Recent advances indicate that generative large models pre-trained on massive datasets, can achieve outstanding performance arexhibit exceptional performance across diverse downstream tasks with minimal fine-tuning. To exploit the rich semantic knowledge embedded in large models and enable adaptive transmission over lossy channels, we propose a Large Speech Model enabled Semantic Communication (LargeSC) system. Simultaneously achieving adaptive compression and robust transmission over lossy channels remains challenging, requiring trade-offs among compression efficiency, speech quality, and latency. In this work, we employ the Mimi as a speech codec, converting speech into discrete tokens compatible with existing network architectures. We propose an adaptive controller module that enables adaptive transmission and in-band Unequal Error Protection (UEP), dynamically adjusting to both speech content and packet loss probability under bandwidth constraints. Additionally, we employ Low-Rank Adaptation (LoRA) to finetune the Moshi foundation model for generative recovery of lost speech tokens. Simulation results show that the proposed system supports bandwidths ranging from 550 bps to 2.06 kbps, outperforms conventional baselines in speech quality under high packet loss rates and achieves an end-to-end latency of approximately 460 ms, thereby demonstrating its potential for real-time deployment.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Baris Yilmaz, Bevan Deniz Cilgin, Erdem Akagรผndรผz, Salih Tileylioglu
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Effective earthquake risk reduction relies on accurate site-specific evaluations. This requires models that can represent the influence of local site conditions on ground motion characteristics. In this context, data driven approaches that learn site controlled signatures from recorded ground motions offer a promising direction. We address strong ground motion generation from time-domain accelerometer records and introduce the TimesNet-Gen, a time-domain conditional generator. The approach uses a station specific latent bottleneck. We evaluate generation by comparing HVSR curves and fundamental site-frequency $f_0$ distributions between real and generated records per station, and summarize station specificity with a score based on the $f_0$ distribution confusion matrices. TimesNet-Gen achieves strong station-wise alignment and compares favorably with a spectrogram-based conditional VAE baseline for site-specific strong motion synthesis. Our codes are available via https://github.com/brsylmz23/TimesNet-Gen.
๐
2025-12-04
๐ AAAI
๐ฅ Jae Hee Lee, Anne Lauscher, Stefano V. Albrecht
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi-agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi-agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter-efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.
๐
2025-12-04
๐ Accepted by ACM Transactions on Autonomous and Ada
๐ฅ Jialong Li, Mingyue Zhang, Nianyu Li, Danny Weyns, Zhi Jin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Self-adaptive systems (SASs) are designed to handle changes and uncertainties through a feedback loop with four core functionalities: monitoring, analyzing, planning, and execution. Recently, generative artificial intelligence (GenAI), especially the area of large language models, has shown impressive performance in data comprehension and logical reasoning. These capabilities are highly aligned with the functionalities required in SASs, suggesting a strong potential to employ GenAI to enhance SASs. However, the specific benefits and challenges of employing GenAI in SASs remain unclear. Yet, providing a comprehensive understanding of these benefits and challenges is complex due to several reasons: limited publications in the SAS field, the technological and application diversity within SASs, and the rapid evolution of GenAI technologies. To that end, this paper aims to provide researchers and practitioners a comprehensive snapshot that outlines the potential benefits and challenges of employing GenAI's within SAS. Specifically, we gather, filter, and analyze literature from four distinct research fields and organize them into two main categories to potential benefits: (i) enhancements to the autonomy of SASs centered around the specific functions of the MAPE-K feedback loop, and (ii) improvements in the interaction between humans and SASs within human-on-the-loop settings. From our study, we outline a research roadmap that highlights the challenges of integrating GenAI into SASs. The roadmap starts with outlining key research challenges that need to be tackled to exploit the potential for applying GenAI in the field of SAS. The roadmap concludes with a practical reflection, elaborating on current shortcomings of GenAI and proposing possible mitigation strategies.
๐
2025-12-04
๐ ACL
๐ฅ Jinbo Liu, Defu Cao, Yifei Wei, Tianyao Su, Yuan Liang et al.
Robotics
ๆฅ็ๆ่ฆ
Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a framework that measures how network structure shapes leakage. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over up to 10 interaction rounds, we quantify leakage as the fraction of ground-truth PII recovered from attacking agent outputs via exact matching. We systematically evaluate six common network topologies (fully connected, ring, chain, binary tree, star, and star-ring), varying agent counts $n\in\{4,5,6\}$, attacker-target placements, and base models. Our findings reveal consistent patterns: fully connected graphs exhibit maximum leakage while chains provide strongest protection; shorter attacker-target graph distance and higher target centrality significantly increase vulnerability; leakage rises sharply in early rounds before plateauing; model choice shifts absolute leakage rates but preserves topology rankings; temporal/locational PII attributes leak more readily than identity credentials or regulated identifiers. These results provide the first systematic mapping from architectural choices to measurable privacy risk, yielding actionable guidance: prefer sparse or hierarchical connectivity, maximize attacker-target separation, limit node degree and network radius, avoid shortcuts bypassing hubs, and implement topology-aware access controls.
๐
2025-12-04
๐ ArXiv Preprint (cs.MA)
๐ฅ Pouria Yazdani, Arash Rezaali, Monireh Abdoos
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Multi-agent reinforcement learning (MARL) has emerged as a promising paradigm for adaptive traffic signal control (ATSC) of multiple intersections. Existing approaches typically follow either a fully centralized or a fully decentralized design. Fully centralized approaches suffer from the curse of dimensionality, and reliance on a single learning server, whereas purely decentralized approaches operate under severe partial observability and lack explicit coordination resulting in suboptimal performance. These limitations motivate region-based MARL, where the network is partitioned into smaller, tightly coupled intersections that form regions, and training is organized around these regions. This paper introduces a Semi-Centralized Training, Decentralized Execution (SEMI-CTDE) architecture for multi intersection ATSC. Within each region, SEMI-CTDE performs centralized training with regional parameter sharing and employs composite state and reward formulations that jointly encode local and regional information. The architecture is highly transferable across different policy backbones and state-reward instantiations. Building on this architecture, we implement two models with distinct design objectives. A multi-perspective experimental analysis of the two implemented SEMI-CTDE-based models covering ablations of the architecture's core elements including rule based and fully decentralized baselines shows that they achieve consistently superior performance and remain effective across a wide range of traffic densities and distributions.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
๐
2025-12-04
๐ ArXiv Preprint (cs.SI)
๐ฅ Saumya Chauhan, Mila Hong, Maria Vazhaeparambil
ๆฅ็ๆ่ฆ
AI-generated content and misinformation are increasingly prevalent on social networks. While prior research primarily examined textual misinformation, fewer studies have focused on visual content's role in virality. In this work, we present the first large-scale analysis of how misinformation and AI-generated images propagate through repost cascades across five ideologically diverse Reddit communities. By integrating textual sentiment, visual attributes, and diffusion metrics (e.g., time-to-first repost, community reach), our framework accurately predicts both immediate post-level virality (AUC=0.83) and long-term cascade-level spread (AUC=0.998). These findings offer essential insights for moderating synthetic and misleading visual content online.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Thibaut Boissin, Thomas Massena, Franck Mamalet, Mathieu Serrurier
Machine Learning
ๆฅ็ๆ่ฆ
Orthogonality-based optimizers, such as Muon, have recently shown strong performance across large-scale training and community-driven efficiency challenges. However, these methods rely on a costly gradient orthogonalization step. Even efficient iterative approximations such as Newton-Schulz remain expensive, typically requiring dozens of matrix multiplications to converge. We introduce a preconditioning procedure that accelerates Newton-Schulz convergence and reduces its computational cost. We evaluate its impact and show that the overhead of our preconditioning can be made negligible. Furthermore, the faster convergence it enables allows us to remove one iteration out of the usual five without degrading approximation quality. Our publicly available implementation achieves up to a 2.8x speedup in the Newton-Schulz approximation. We also show that this has a direct impact on end-to-end training runtime with 5-10% improvement in realistic training scenarios across two efficiency-focused tasks. On challenging language or vision tasks, we validate that our method maintains equal or superior model performance while improving runtime. Crucially, these improvements require no hyperparameter tuning and can be adopted as a simple drop-in replacement. Our code is publicly available on github.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chenyang Zuo, Siqi Fan, Zaiqing Nie
Natural Language Processing
ๆฅ็ๆ่ฆ
Molecules play a crucial role in biomedical research and discovery, particularly in the field of small molecule drug development. Given the rapid advancements in large language models, especially the recent emergence of reasoning models, it is natural to explore how a general-purpose language model can be efficiently adapted for molecular science applications. In this work, we introduce BioMedGPT-Mol, a molecular language model designed to support molecular understanding and generation tasks. By curating and unifying existing public instruction datasets, we have assembled a large-scale, comprehensive, and high-quality training dataset. The model is then fine-tuned through a meticulously designed multi-task learning framework. On a consolidated benchmark derived from LlaSMol, TOMG-Bench, and MuMOInstruct, BioMedGPT-Mol achieves remarkable performance. Our experimental results demonstrate that a general-purpose reasoning model can be effectively and efficiently post-trained into a professional molecular language model through a well-structured multi-task curriculum. Leveraging the power of it, we further explore retrosynthetic planning task, and the performance on RetroBench demonstrates its competitive capability of acting as an end-to-end retrosynthetic planner. We anticipate that our approach can be extended to other biomedical scientific domains.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mohamed Baha Ben Ticha, Xingchen Ran, Guillaume Saldanha, Gaรซl Le Godais, Philรฉmon Roussel et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Speech Brain Computer Interfaces (BCIs) offer promising solutions to people with severe paralysis unable to communicate. A number of recent studies have demonstrated convincing reconstruction of intelligible speech from surface electrocorticographic (ECoG) or intracortical recordings by predicting a series of phonemes or words and using downstream language models to obtain meaningful sentences. A current challenge is to reconstruct speech in a streaming mode by directly regressing cortical signals into acoustic speech. While this has been achieved recently using intracortical data, further work is needed to obtain comparable results with surface ECoG recordings. In particular, optimizing neural decoders becomes critical in this case. Here we present an offline speech decoding pipeline based on an encoder-decoder deep neural architecture, integrating Vision Transformers and contrastive learning to enhance the direct regression of speech from ECoG signals. The approach is evaluated on two datasets, one obtained with clinical subdural electrodes in an epileptic patient, and another obtained with the fully implantable WIMAGINE epidural system in a participant of a motor BCI trial. To our knowledge this presents a first attempt to decode speech from a fully implantable and wireless epidural recording system offering perspectives for long-term use.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Michael Klenk
Robotics
ๆฅ็ๆ่ฆ
This chapter discusses the ethics of generative AI. It provides a technical primer to show how generative AI affords experiencing technology as if it were human, and this affordance provides a fruitful focus for the philosophical ethics of generative AI. It then shows how generative AI can both aggravate and alleviate familiar ethical concerns in AI ethics, including responsibility, privacy, bias and fairness, and forms of alienation and exploitation. Finally, the chapter examines ethical questions that arise specifically from generative AI's mimetic generativity, such as debates about authorship and credit, the emergence of as-if social relationships with machines, and new forms of influence, persuasion, and manipulation.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Wu, Chuhao Zhou, Guangyu Zhao, Haozhi Cao, Yewen Pu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.
๐
2025-12-04
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Huifeng Zhu, Shijie Li, Qinfeng Li, Yier Jin
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
To enhance the performance of large language models (LLMs) in various domain-specific applications, sensitive data such as healthcare, law, and finance are being used to privately customize or fine-tune these models. Such privately adapted LLMs are regarded as either personal privacy assets or corporate intellectual property. Therefore, protecting model weights and maintaining strict confidentiality during deployment and distribution have become critically important. However, existing model formats and deployment frameworks provide little to no built-in support for confidentiality, access control, or secure integration with trusted hardware. Current methods for securing model deployment either rely on computationally expensive cryptographic techniques or tightly controlled private infrastructure. Although these approaches can be effective in specific scenarios, they are difficult and costly for widespread deployment.
In this paper, we introduce CryptoTensors, a secure and format-compatible file structure for confidential LLM distribution. Built as an extension to the widely adopted Safetensors format, CryptoTensors incorporates tensor-level encryption and embedded access control policies, while preserving critical features such as lazy loading and partial deserialization. It enables transparent decryption and automated key management, supporting flexible licensing and secure model execution with minimal overhead. We implement a proof-of-concept library, benchmark its performance across serialization and runtime scenarios, and validate its compatibility with existing inference frameworks, including Hugging Face Transformers and vLLM. Our results highlight CryptoTensors as a light-weight, efficient, and developer-friendly solution for safeguarding LLM weights in real-world and widespread deployments.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose \textbf{Soft Q-based Diffusion Finetuning (SQDF)}, a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.
๐
2025-12-04
๐ ICASSP 2026
๐ฅ Cong Wang, Changfeng Gao, Yang Xiang, Zhihao Du, Keyu An et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional expressiveness and naturalness over all baselines. Demo page: https://lrwinr.github.io/RRPO-CosyVoice.
๐
2025-12-04
๐ ICASSP 2026
๐ฅ Cong Wang, Yizhong Geng, Yuhua Wen, Qifei Li, Yingming Gao et al.
ๆฅ็ๆ่ฆ
Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves state-of-the-art performance, suggesting its effectiveness and robustness.
๐
2025-12-04
๐ NeurIPS 2025
๐ฅ Yangning Li, Shaoshen Chen, Yinghui Li, Yankai Chen, Hai-Tao Zheng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The quadratic complexity of self-attention constrains Large Language Models (LLMs) in processing long contexts, a capability essential for many advanced applications. Context compression aims to alleviate this computational bottleneck while retaining critical semantic information. However, existing approaches often fall short: explicit methods may compromise local detail, whereas implicit methods can suffer from positional biases, information degradation, or an inability to capture long-range semantic dependencies. We propose AdmTree, a novel framework for adaptive, hierarchical context compression with a central focus on preserving high semantic fidelity while maintaining efficiency. AdmTree dynamically segments input based on information density, utilizing gist tokens to summarize variable-length segments as the leaves of a semantic binary tree. This structure, together with a lightweight aggregation mechanism and a frozen backbone LLM (thereby minimizing new trainable parameters), enables efficient hierarchical abstraction of the context. By preserving fine-grained details alongside global semantic coherence, mitigating positional bias, and dynamically adapting to content, AdmTree robustly retains the semantic information of long contexts.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bita Baroutian, Atefe Aghaei, Mohsen Ebrahimi Moghaddam
ๆฅ็ๆ่ฆ
Alcohol consumption is a significant public health concern and a major cause of accidents and fatalities worldwide. This study introduces a novel video-based facial sequence analysis approach dedicated to the detection of alcohol intoxication. The method integrates facial landmark analysis via a Graph Attention Network (GAT) with spatiotemporal visual features extracted using a 3D ResNet. These features are dynamically fused with adaptive prioritization to enhance classification performance. Additionally, we introduce a curated dataset comprising 3,542 video segments derived from 202 individuals to support training and evaluation. Our model is compared against two baselines: a custom 3D-CNN and a VGGFace+LSTM architecture. Experimental results show that our approach achieves 95.82% accuracy, 0.977 precision, and 0.97 recall, outperforming prior methods. The findings demonstrate the model's potential for practical deployment in public safety systems for non-invasive, reliable alcohol intoxication detection.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhenzhen Ren, Xinpeng Zhang, Zhenxing Qian, Yan Gao, Yu Shi et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The integration of external tools is pivotal for empowering Large Language Model (LLM) agents with real-world capabilities. However, training these agents through direct, continuous interaction with diverse tools is often prohibitively expensive, slow, and introduces additional development and maintenance overhead. To address this challenge, we introduce the Generalist Tool Model (GTM), a 1.5-billion-parameter model that learns to act as a universal tool simulator. With only prompt-level configuration, GTM accesses tool functionalities along with input arguments and generates outputs that faithfully mimic real tool execution, providing a fast and cost-effective solution that eliminates development overhead. To build GTM, we propose the Context-Aware Response Generation (CARG) pipeline, which synthesizes comprehensive training data covering over 20,000 tools across 300 domains including physics, medicine, robotics, and finance. Through this pipeline, GTM learns to produce not only syntactically correct outputs but also logically coherent and contextually appropriate responses. Experiments demonstrate that GTM produces high-quality outputs with strong consistency and reliability. Besides when used in real reinforcement learning scenarios for agent training, GTM exhibits significantly faster simulation speed compared to real tools while maintaining comparable output quality, along with remarkable generalization and domain adaptability. Our results establish GTM as a foundational component for developing future AI agents, enabling efficient and scalable training of tool-augmented systems.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu-Wei Zhan, Xin Wang, Hong Chen, Tongtong Feng, Wei Feng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xin Liang, Xiang Zhang, Yiwei Xu, Siqi Sun, Chenyu You
Multimodal
ๆฅ็ๆ่ฆ
Generating academic slides from scientific papers is a challenging multimodal reasoning task that requires both long context understanding and deliberate visual planning. Existing approaches largely reduce it to text only summarization, overlooking the visual component and design intensive nature of slide creation. In this paper we introduce SlideGen, an agentic, modular, and visual in the loop framework for scientific paper to slide generation. SlideGen orchestrates a group of vision language agents that reason collaboratively over the document structure and semantics, producing editable PPTX slides with logical flow and compelling visual presentation. By integrating coordinated outlining, mapping, arrangement, note synthesis, and iterative refinement, our system consistently delivers slides of expert level quality. Across diverse benchmarks and strong baselines, SlideGen outperforms existing methods in visual quality, content faithfulness, and readability, positioning it as the new state of the art in automated slide generation. Our work establishes a foundation for design aware multimodal slide generation, demonstrating how agentic collaboration can bridge understanding and presentation in complex multimodal reasoning tasks.
๐
2025-12-04
๐ AAAI 2026
๐ฅ Tianle Hu, Weijun Lv, Na Han, Xiaozhao Fang, Jie Wen et al.
ๆฅ็ๆ่ฆ
Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tianmai M. Zhang, Zhaoyi Sun, Sihang Zeng, Chenxi Li, Neil F. Abernethy et al.
Machine Learning
ๆฅ็ๆ่ฆ
The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 -- generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yu-Wei Zhan, Xin Wang, Pengzhe Mao, Tongtong Feng, Ren Wang et al.
MultimodalMachine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM's latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM's semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Learning, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Edervaldo Melo
ๆฅ็ๆ่ฆ
This paper presents the Nemosine Framework, a modular cognitive architecture designed to support assisted reasoning, structured thinking, and systematic analysis. The model operates through functional cognitive modules ("personas") that organize tasks such as planning, evaluation, cross-checking, and narrative synthesis. The framework combines principles from metacognition, distributed cognition, and modular cognitive systems to offer an operational structure for assisted problem-solving and decision support. The architecture is documented through formal specification, internal consistency criteria, and reproducible structural components. The goal is to provide a clear conceptual basis for future computational implementations and to contribute to the study of symbolic-modular architectures for reasoning.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Nate Straub, Saara Khan, Katharina Jay, Brian Cabral, Oskar Linde
ๆฅ็ๆ่ฆ
We demonstrate the importance of persona-based multi-agents brainstorming for both diverse topics and subject matter ideation. Prior work has shown that generalized multi-agent collaboration often provides better reasoning than a single agent alone. In this paper, we propose and develop a framework for persona-based agent selection, showing how persona domain curation can improve brainstorming outcomes. Using multiple experimental setups, we evaluate brainstorming outputs across different persona pairings (e.g., Doctor vs VR Engineer) and A2A (agent-to-agent) dynamics (separate, together, separate-then-together). Our results show that (1) persona choice shapes idea domains, (2) collaboration mode shifts diversity of idea generation, and (3) multi-agent persona-driven brainstorming produces idea depth and cross-domain coverage.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Pedro Passos
Machine Learning
ๆฅ็ๆ่ฆ
In elite soccer, substitution decisions entail significant financial and sporting consequences yet remain heavily reliant on intuition or predictive models that merely mimic historical biases. This paper introduces a Fuzzy Logic based Decision Support System (DSS) designed for real time, prescriptive game management. Unlike traditional Machine Learning approaches that encounter a predictive ceiling by attempting to replicate human behavior, our system audits performance through an objective, rule based inference engine. We propose a methodological advancement by reformulating the PlayeRank metric into a Cumulative Mean with Role Aware Normalization, eliminating the play time exposure bias inherent in cumulative sum models to enable accurate intra match comparison. The system integrates this refined metric with physiological proxies (fatigue) and contextual variables (disciplinary risk modulated by tactical role) to calculate a dynamic Substitution Priority (P final). Validation via a case study of the 2018 FIFA World Cup match between Brazil and Belgium demonstrates the system's ecological validity: it not only aligned with expert consensus on executed substitutions (for example Gabriel Jesus) but, crucially, identified high risk scenarios ignored by human decision makers. Specifically, the model flagged the "FAGNER Paradox" - a maximum priority defensive risk - minutes before a critical yellow card, and detected the "Lukaku Paradox", where an isolated assist masked a severe drop in participation. These results confirm that Fuzzy Logic offers a transparent, explainable, and superior alternative to black box models for optimizing real time tactical decisions.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Timo Stoll, Chendi Qian, Ben Finkelshtein, Ali Parviz, Darius Weber et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See www.graphbench.io for further details.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Philip Stephens, Emmanuel Salawu
Robotics
ๆฅ็ๆ่ฆ
We introduce a unified mathematical and probabilistic framework for understanding and comparing diverse AI agent strategies. We bridge the gap between high-level agent design concepts, such as ReAct, multi-agent systems, and control flows, and a rigorous mathematical formulation. Our approach frames agentic processes as a chain of probabilities, enabling a detailed analysis of how different strategies manipulate these probabilities to achieve desired outcomes. Our framework provides a common language for discussing the trade-offs inherent in various agent architectures. One of our many key contributions is the introduction of the "Degrees of Freedom" concept, which intuitively differentiates the optimizable levers available for each approach, thereby guiding the selection of appropriate strategies for specific tasks. This work aims to enhance the clarity and precision in designing and evaluating AI agents, offering insights into maximizing the probability of successful actions within complex agentic systems.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Price Allman, Lian Thang, Dre Simmons, Salmon Riaz
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We present a comparative study of multi-agent reinforcement learning (MARL) algorithms for cooperative warehouse robotics. We evaluate QMIX and IPPO on the Robotic Warehouse (RWARE) environment and a custom Unity 3D simulation. Our experiments reveal that QMIX's value decomposition significantly outperforms independent learning approaches (achieving 3.25 mean return vs. 0.38 for advanced IPPO), but requires extensive hyperparameter tuning -- particularly extended epsilon annealing (5M+ steps) for sparse reward discovery. We demonstrate successful deployment in Unity ML-Agents, achieving consistent package delivery after 1M training steps. While MARL shows promise for small-scale deployments (2-4 robots), significant scaling challenges remain. Code and analyses: https://pallman14.github.io/MARL-QMIX-Warehouse-Robots/
๐
2025-12-04
๐ AAAI 2026
๐ฅ Changjin Kim, HyeokJun Lee, YoungJoon Yoo
Machine Learning
ๆฅ็ๆ่ฆ
Recent image denoising methods have leveraged generative modeling for real noise synthesis to address the costly acquisition of real-world noisy data. However, these generative models typically require camera metadata and extensive target-specific noisy-clean image pairs, often showing limited generalization between settings. In this paper, to mitigate the prerequisites, we propose a Single-Pair Guided Diffusion for generalized noise synthesis GuidNoise, which uses a single noisy/clean pair as the guidance, often easily obtained by itself within a training set. To train GuidNoise, which generates synthetic noisy images from the guidance, we introduce a guidance-aware affine feature modification (GAFM) and a noise-aware refine loss to leverage the inherent potential of diffusion models. This loss function refines the diffusion model's backward process, making the model more adept at generating realistic noise distributions. The GuidNoise synthesizes high-quality noisy images under diverse noise environments without additional metadata during both training and inference. Additionally, GuidNoise enables the efficient generation of noisy-clean image pairs at inference time, making synthetic noise readily applicable for augmenting training data. This self-augmentation significantly improves denoising performance, especially in practical scenarios with lightweight models and limited training data. The code is available at https://github.com/chjinny/GuidNoise.
๐
2025-12-04
๐ Accepted to ACM/IEEE International Conference on H
๐ฅ Debasmita Ghose, Oz Gitelson, Marynel Vazquez, Brian Scassellati
ๆฅ็ๆ่ฆ
To collaborate with humans, robots must infer goals that are often ambiguous, difficult to articulate, or not drawn from a fixed set. Prior approaches restrict inference to a predefined goal set, rely only on observed actions, or depend exclusively on explicit instructions, making them brittle in real-world interactions. We present BALI (Bidirectional Action-Language Inference) for goal prediction, a method that integrates natural language preferences with observed human actions in a receding-horizon planning tree. BALI combines language and action cues from the human, asks clarifying questions only when the expected information gain from the answer outweighs the cost of interruption, and selects supportive actions that align with inferred goals. We evaluate the approach in collaborative cooking tasks, where goals may be novel to the robot and unbounded. Compared to baselines, BALI yields more stable goal predictions and significantly fewer mistakes.
๐
2025-12-04
๐ 48 pages, 16 figures, submitted to Journal of Adva
๐ฅ Xin Kai Lee, Ali Ramadhan, Andre Souza, Gregory LeClaire Wagner, Simone Silvestri et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
NORi is a machine-learned (ML) parameterization of ocean boundary layer turbulence that is physics-based and augmented with neural networks. NORi stands for neural ordinary differential equations (NODEs) Richardson number (Ri) closure. The physical parameterization is controlled by a Richardson number-dependent diffusivity and viscosity. The NODEs are trained to capture the entrainment through the base of the boundary layer, which cannot be represented with a local diffusive closure. The parameterization is trained using large-eddy simulations in an "a posteriori" fashion, where parameters are calibrated with a loss function that explicitly depends on the actual time-integrated variables of interest rather than the instantaneous subgrid fluxes, which are inherently noisy. NORi is designed for the realistic nonlinear equation of state of seawater and demonstrates excellent prediction and generalization capabilities in capturing entrainment dynamics under different convective strengths, oceanic background stratifications, rotation strengths, and surface wind forcings. NORi is numerically stable for at least 100 years of integration time in large-scale simulations, despite only being trained on 2-day horizons, and can be run with time steps as long as one hour. The highly expressive neural networks, combined with a physically-rigorous base closure, prove to be a robust paradigm for designing parameterizations for climate models where data requirements are drastically reduced, inference performance can be directly targeted and optimized, and numerical stability is implicitly encouraged during training.
๐
2025-12-04
๐ AAAI
๐ฅ Yanbin Zhang, Hanhui Ye, Yue Bai, Qiming Zhang, Liao Xiang et al.
Robotics
ๆฅ็ๆ่ฆ
Workflow automation promises substantial productivity gains in everyday document-related tasks. While prior agentic systems can execute isolated instructions, they struggle with automating multi-step, session-level workflows due to limited control over the operational process. To this end, we introduce AutoDW, a novel execution framework that enables stepwise, rollback-enabled operation orchestration. AutoDW incrementally plans API actions conditioned on user instructions, intent-filtered API candidates, and the evolving states of the document. It further employs robust rollback mechanisms at both the argument and API levels, enabling dynamic correction and fault tolerance. These designs together ensure that the execution trajectory of AutoDW remains aligned with user intent and document context across long-horizon workflows. To assess its effectiveness, we construct a comprehensive benchmark of 250 sessions and 1,708 human-annotated instructions, reflecting realistic document processing scenarios with interdependent instructions. AutoDW achieves 90% and 62% completion rates on instruction- and session-level tasks, respectively, outperforming strong baselines by 40% and 76%. Moreover, AutoDW also remains robust for the decision of backbone LLMs and on tasks with varying difficulty. Code and data will be open-sourced. Code: https://github.com/YJett/AutoDW
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dilani Widanapathiranage, Scott Barnett, Stefanus Kurniawan, Wannita Takerngsaksiri
Natural Language Processing
ๆฅ็ๆ่ฆ
Hallucinations are a key concern when creating applications that rely on Foundation models (FMs). Understanding where and how these subtle failures occur in an application relies on evaluation methods known as \textit{evals}. Prior work focuses on defining new eval methods or benchmark datasets for specific tasks. However, neither helps a software team with a task-specific FM application when there is no metric or dataset. The demand for both automated approaches and deep integration of human insight makes this a challenging problem. We address this gap by proposing an approach to synthesise a FM task-specific evaluator program that provides automation and a custom UI for capturing feedback. The core novelty of our approach lies in: (1) a task-agnostic meta-model that captures properties of any FM task, (2) an interaction protocol for efficient use of human feedback, and (3) an eval synthesiser that selects or generates an appropriate set of evals. We implement our approach in \toolname and demonstrate the concept on two diverse FM tasks: chart data extraction and document question answering. A preliminary evaluation on the quality of our selected evals shows 93\% and 90\% accuracy respectively. Our research tackles a growing problem facing engineering teams, how to evaluate and review outputs from FM tasks.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Manar Alnaasan, Md Selim Sarowar, Sungho Kim
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:https://github.com/manaralnaasan/RGB-D_parkinson-LLM
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Weiwei Wang, Weijie Zou, Jiyong Min
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The repetition problem, where Large Language Models (LLMs) continuously generate repetitive content without proper termination, poses a critical challenge in production deployments, causing severe performance degradation and system stalling. This paper presents a comprehensive investigation and multiple practical solutions for the repetition problem encountered in real-world batch code interpretation tasks.
We identify three distinct repetition patterns: (1) business rule generation repetition, (2) method call relationship analysis repetition, and (3) PlantUML diagram syntax generation repetition. Through rigorous theoretical analysis based on Markov models, we establish that the root cause lies in greedy decoding's inability to escape repetitive loops, exacerbated by self-reinforcement effects.
Our comprehensive experimental evaluation demonstrates three viable solutions: (1) Beam Search decoding with early_stopping=True serves as a universal post-hoc mechanism that effectively resolves all three repetition patterns; (2) presence_penalty hyperparameter provides an effective solution specifically for BadCase 1; and (3) Direct Preference Optimization (DPO) fine-tuning offers a universal model-level solution for all three BadCases.
The primary value of this work lies in combining first-hand production experience with extensive experimental validation. Our main contributions include systematic theoretical analysis of repetition mechanisms, comprehensive evaluation of multiple solutions with task-specific applicability mapping, identification of early_stopping as the critical parameter for Beam Search effectiveness, and practical production-ready solutions validated in real deployment environments.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhou Liu, Zhaoyang Han, Guochen Yan, Hao Liang, Bohan Zeng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Data governance ensures data quality, security, and compliance through policies and standards, a critical foundation for scaling modern AI development. Recently, large language models (LLMs) have emerged as a promising solution for automating data governance by translating user intent into executable transformation code. However, existing benchmarks for automated data science often emphasize snippet-level coding or high-level analytics, failing to capture the unique challenge of data governance: ensuring the correctness and quality of the data itself. To bridge this gap, we introduce GovBench, a benchmark featuring 150 diverse tasks grounded in real-world scenarios, built on data from actual cases. GovBench employs a novel "reversed-objective" methodology to synthesize realistic noise and utilizes rigorous metrics to assess end-to-end pipeline reliability. Our analysis on GovBench reveals that current models struggle with complex, multi-step workflows and lack robust error-correction mechanisms. Consequently, we propose DataGovAgent, a framework utilizing a Planner-Executor-Evaluator architecture that integrates constraint-based planning, retrieval-augmented generation, and sandboxed feedback-driven debugging. Experimental results show that DataGovAgent significantly boosts the Average Task Score (ATS) on complex tasks from 39.7 to 54.9 and reduces debugging iterations by over 77.9 percent compared to general-purpose baselines.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangyi Gao, Danpei Zhao, Bo Yuan, Wentao Li
Computer Vision
ๆฅ็ๆ่ฆ
Knowledge distillation is an effective and hardware-friendly method, which plays a key role in lightweighting remote sensing object detection. However, existing distillation methods often encounter the issue of mixed features in remote sensing images (RSIs), and neglect the discrepancies caused by subtle feature variations, leading to entangled knowledge confusion. To address these challenges, we propose an architecture-agnostic distillation method named Dual-Stream Spectral Decoupling Distillation (DS2D2) for universal remote sensing object detection tasks. Specifically, DS2D2 integrates explicit and implicit distillation grounded in spectral decomposition. Firstly, the first-order wavelet transform is applied for spectral decomposition to preserve the critical spatial characteristics of RSIs. Leveraging this spatial preservation, a Density-Independent Scale Weight (DISW) is designed to address the challenges of dense and small object detection common in RSIs. Secondly, we show implicit knowledge hidden in subtle student-teacher feature discrepancies, which significantly influence predictions when activated by detection heads. This implicit knowledge is extracted via full-frequency and high-frequency amplifiers, which map feature differences to prediction deviations. Extensive experiments on DIOR and DOTA datasets validate the effectiveness of the proposed method. Specifically, on DIOR dataset, DS2D2 achieves improvements of 4.2% in AP50 for RetinaNet and 3.8% in AP50 for Faster R-CNN, outperforming existing distillation approaches. The source code will be available at https://github.com/PolarAid/DS2D2.
๐
2025-12-04
๐ AAAI
๐ฅ Gautam Varma Datla, Anudeep Vurity, Tejaswani Dash, Tazeem Ahmad, Mohd Adnan et al.
ๆฅ็ๆ่ฆ
AI policy guidance is predominantly written as prose, which practitioners must first convert into executable rules before frameworks can evaluate or enforce them. This manual step is slow, error-prone, difficult to scale, and often delays the use of safeguards in real-world deployments. To address this gap, we present Policy-to-Tests (P2T), a framework that converts natural-language policy documents into normalized, machine-readable rules. The framework comprises a pipeline and a compact domain-specific language (DSL) that encodes hazards, scope, conditions, exceptions, and required evidence, yielding a canonical representation of extracted rules. To test the framework beyond a single policy, we apply it across general frameworks, sector guidance, and enterprise standards, extracting obligation-bearing clauses and converting them into executable rules. These AI-generated rules closely match strong human baselines on span-level and rule-level metrics, with robust inter-annotator agreement on the gold set. To evaluate downstream behavioral and safety impact, we add HIPAA-derived safeguards to a generative agent and compare it with an otherwise identical agent without guardrails. An LLM-based judge, aligned with gold-standard criteria, measures violation rates and robustness to obfuscated and compositional prompts. Detailed results are provided in the appendix. We release the codebase, DSL, prompts, and rule sets as open-source resources to enable reproducible evaluation.
๐
2025-12-04
๐ ArXiv Preprint (eess.SP)
๐ฅ Chenyuan Feng, Anbang Zhang, Geyong Min, Yongming Huang, Tony Q. S. Quek et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The evolution toward sixth-generation wireless systems positions intelligence as a native network capability, fundamentally transforming the design of radio access networks (RANs). Within this vision, Semantic-native communication and agentic intelligence are expected to play central roles. SemCom departs from bit-level fidelity and instead emphasizes task-oriented meaning exchange, enabling compact SC and introducing new performance measures such as semantic fidelity and task success rate. Agentic intelligence endows distributed RAN entities with goal-driven autonomy, reasoning, planning, and multi-agent collaboration, increasingly supported by foundation models and knowledge graphs. In this work, we first introduce the conceptual foundations of SemCom and agentic networking, and discuss why existing AI-driven O-RAN solutions remain largely bit-centric and task-siloed. We then present a unified taxonomy that organizes recent research along three axes: i) semantic abstraction level (symbol/feature/intent/knowledge), ii) agent autonomy and coordination granularity (single-, multi-, and hierarchical-agent), and iii) RAN control placement across PHY/MAC, near-real-time RIC, and non-real-time RIC. Based on this taxonomy, we systematically introduce enabling technologies including task-oriented semantic encoders/decoders, multi-agent reinforcement learning, foundation-model-assisted RAN agents, and knowledge-graph-based reasoning for cross-layer awareness. Representative 6G use cases, such as immersive XR, vehicular V2X, and industrial digital twins, are analyzed to illustrate the semantic-agentic convergence in practice. Finally, we identify open challenges in semantic representation standardization, scalable trustworthy agent coordination, O-RAN interoperability, and energy-efficient AI deployment, and outline research directions toward operational semantic-agentic AI-RAN.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, Alan Yuille
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianqi Liu, Zhaoxi Chen, Zihao Huang, Shaocong Xu, Saining Zhang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.
๐
2025-12-04
๐ NeurIPS 2025
๐ฅ Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich, Weiyang Liu, Dinghuai Zhang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
While methods exist for aligning flow matching models--a popular and effective class of generative models--with human preferences, existing approaches fail to achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The key idea behind this algorithm is that the optimal difference between the finetuned velocity field and the pretrained one should be matched with the gradient field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.
๐
2025-12-04
๐ 8 pages, 8 figures, 1 table, website at https://w3
๐ฅ Malte Hoffmann, Lilla Zรถllei, Adrian V. Dalca
Machine Learning
ๆฅ็ๆ่ฆ
Segmentation of magnetic resonance images (MRI) facilitates analysis of human brain development by delineating anatomical structures. However, in infants and young children, accurate segmentation is challenging due to development and imaging constraints. Pediatric brain MRI is notoriously difficult to acquire, with inconsistent availability of imaging modalities, substantial non-head anatomy in the field of view, and frequent motion artifacts. This has led to specialized segmentation models that are often limited to specific image types or narrow age groups, or that are fragile for more variable images such as those acquired clinically. We address this method fragmentation with BabySeg, a deep learning brain segmentation framework for infants and young children that supports diverse MRI protocols, including repeat scans and image types unavailable during training. Our approach builds on recent domain randomization techniques, which synthesize training images far beyond realistic bounds to promote dataset shift invariance. We also describe a mechanism that enables models to flexibly pool and interact features from any number of input scans. We demonstrate state-of-the-art performance that matches or exceeds the accuracy of several existing methods for various age cohorts and input configurations using a single model, in a fraction of the runtime required by many existing tools.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao-Jen Chien, Yi-Chuan Huang, Chung-Ho Wu, Wei-Lun Chao, Yu-Lun Liu
Robotics
ๆฅ็ๆ่ฆ
Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: https://chien90190.github.io/splannequin/
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo et al.
MultimodalRoboticsComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao et al.
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rundong Luo, Noah Snavely, Wei-Chiu Ma
ๆฅ็ๆ่ฆ
We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page https://red-fairy.github.io/ShadowDraw/ for more results and an end-to-end real-world demonstration of our pipeline!
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion ฯ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. ฯ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, ฯ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, ฯ-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaqi Ma, Shengkai Hu, Jun Wan, Jiaxing Huang, Lefei Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe et al.
MultimodalMachine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Gao, Jin Song
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon et al.
ๆฅ็ๆ่ฆ
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minghan Zhu, Zhiyi Wang, Qihang Sun, Maani Ghaffari, Michael Posa
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann et al.
Robotics
ๆฅ็ๆ่ฆ
Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xianfeng Wu, Yajing Bai, Minghan Li, Xianzu Wu, Xueqi Zhao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou et al.
ๆฅ็ๆ่ฆ
Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.
๐
2025-12-04
๐ CVPR
๐ฅ Abhigyan Bhattacharya, Hiranmoy Roy
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nicolas Houdrรฉ, Diego Marcos, Hugo Riffaud de Turckheim, Dino Ienco, Laurent Wendling et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pham Thach Thanh Truc, Dang Hoai Nam, Huynh Tong Dang Khoa, Vo Nguyen Le Duy
Machine Learning
ๆฅ็ๆ่ฆ
Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma
Natural Language Processing
ๆฅ็ๆ่ฆ
We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xianghui Fan, Zhaoyu Chen, Mengyang Pan, Anping Deng, Hang Yang
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
The perception of transparent objects is one of the well-known challenges in computer vision. Conventional depth sensors have difficulty in sensing the depth of transparent objects due to refraction and reflection of light. Previous research has typically train a neural network to complete the depth acquired by the sensor, and this method can quickly and accurately acquire accurate depth maps of transparent objects. However, previous training relies on a large amount of annotation data for supervision, and the labeling of depth maps is costly. To tackle this challenge, we propose a new self-supervised method for training depth completion networks. Our method simulates the depth deficits of transparent objects within non-transparent regions and utilizes the original depth map as ground truth for supervision. Experiments demonstrate that our method achieves performance comparable to supervised approach, and pre-training with our method can improve the model performance when the training samples are small.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniyar Zakarin, Thiemo Wandel, Anton Obukhov, Dengxin Dai
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiong Chang, Weimin Wang, Junpei Zhong, Jun Miyazaki
Machine Learning
ๆฅ็ๆ่ฆ
This paper proposes a memory-efficient optimization strategy for the high-performance point cloud registration algorithm VANICP, enabling lightweight execution on embedded GPUs with constrained hardware resources. VANICP is a recently published acceleration framework that significantly improves the computational efficiency of point-cloud-based applications. By transforming the global nearest neighbor search into a localized process through a dilation-based information propagation mechanism, VANICP greatly reduces the computational complexity of the NNS. However, its original implementation demands a considerable amount of memory, which restricts its deployment in resource-constrained environments such as embedded systems. To address this issue, we propose a GPU-oriented dynamic memory assignment strategy that optimizes the memory usage of the dilation operation. Furthermore, based on this strategy, we construct an enhanced version of the VANICP framework that achieves over 97% reduction in memory consumption while preserving the original performance. Source code is published on: https://github.com/changqiong/VANICP4Em.git.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ NaHyeon Park, Namin An, Kunhee Kim, Soyeon Yoon, Jiahao Huo et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Leonid Pogorelyuk, Niels Bracher, Aaron Verkleeren, Lars Kรผhmichel, Stefan T. Radev
Machine Learning
ๆฅ็ๆ่ฆ
We pilot a family of stable contrastive losses for learning pixel-level representations that jointly capture semantic and geometric information. Our approach maps each pixel of an image to an overcomplete descriptor that is both view-invariant and semantically meaningful. It enables precise point-correspondence across images without requiring momentum-based teacher-student training. Two experiments in synthetic 2D and 3D environments demonstrate the properties of our loss and the resulting overcomplete representations.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ NaHyeon Park, Kunhee Kim, Junsuk Choe, Hyunjung Shim
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jasmaine Khale, Ravi Prakash Srivastava
Machine Learning
ๆฅ็ๆ่ฆ
Automated retinal disease diagnosis is vital given the rising prevalence of conditions such as diabetic retinopathy and macular degeneration. Conventional deep learning approaches require large annotated datasets, which are costly and often imbalanced across disease categories, limiting their reliability in practice. Few-shot learning (FSL) addresses this challenge by enabling models to generalize from only a few labeled samples per class. In this study,we propose a balanced few-shot episodic learning framework tailored to the Retinal Fundus Multi-Disease Image Dataset (RFMiD). Focusing on the ten most represented classes, which still show substantial imbalance between majority diseases (e.g., Diabetic Retinopathy, Macular Hole) and minority ones (e.g., Optic Disc Edema, Branch Retinal Vein Occlusion), our method integrates three key components: (i) balanced episodic sampling, ensuring equal participation of all classes in each 5-way 5-shot episode; (ii) targeted augmentation, including Contrast Limited Adaptive Histogram Equalization (CLAHE) and color/geometry transformations, to improve minority-class di- versity; and (iii) a ResNet-50 encoder pretrained on ImageNet, selected for its superior ability to capture fine-grained retinal features. Prototypes are computed in the embedding space and classification is performed with cosine similarity for improved stability. Trained on 100 episodes and evaluated on 1,000 test episodes, our framework achieves substantial accuracy gains and reduces bias toward majority classes, with notable improvements for underrepresented diseases. These results demonstrate that dataset-aware few-shot pipelines, combined with balanced sampling and CLAHE-enhanced preprocessing, can deliver more robust and clinically fair retinal disease diagnosis under data-constrained conditions.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yupu Yao, Bowen Yang
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Novanto Yudistira
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: https://garlicba.github.io/LiteVGGT/
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Paul Henderson
Machine Learning
ๆฅ็ๆ่ฆ
The Herculaneum Papyri are a collection of rolled papyrus documents that were charred and buried by the famous eruption of Mount Vesuvius. They promise to contain a wealth of previously unseen Greek and Latin texts, but are extremely fragile and thus most cannot be unrolled physically. A solution to access these texts is virtual unrolling, where the papyrus surface is digitally traced out in a CT scan of the scroll, to create a flattened representation. This tracing is very laborious to do manually in gigavoxel-sized scans, so automated approaches are desirable. We present the first top-down method that automatically fits a surface model to a CT scan of a severely damaged scroll. We take a novel approach that globally fits an explicit parametric model of the deformed scroll to existing neural network predictions of where the rolled papyrus likely passes. Our method guarantees the resulting surface is a single continuous 2D sheet, even passing through regions where the surface is not detectable in the CT scan. We conduct comprehensive experiments on high-resolution CT scans of two scrolls, showing that our approach successfully unrolls large regions, and exceeds the performance of the only existing automated unrolling method suitable for this data.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin et al.
ๆฅ็ๆ่ฆ
Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guanbo Huang, Jingjia Mao, Fanding Huang, Fengkai Liu, Xiangyang Luo et al.
ๆฅ็ๆ่ฆ
Despite tremendous recent progress, Flow Matching methods still suffer from exposure bias due to discrepancies in training and inference. This paper investigates the root causes of exposure bias in Flow Matching, including: (1) the model lacks generalization to biased inputs during training, and (2) insufficient low-frequency content captured during early denoising, leading to accumulated bias. Based on these insights, we propose ReflexFlow, a simple and effective reflexive refinement of the Flow Matching learning objective that dynamically corrects exposure bias. ReflexFlow consists of two components: (1) Anti-Drift Rectification (ADR), which reflexively adjusts prediction targets for biased inputs utilizing a redesigned loss under training-time scheduled sampling; and (2) Frequency Compensation (FC), which reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias. ReflexFlow is model-agnostic, compatible with all Flow Matching frameworks, and improves generation quality across datasets. Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show that ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving a 35.65% reduction in FID on CelebA-64.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ramya Muthukrishnan, Borjan Gagoski, Aryn Lee, P. Ellen Grant, Elfar Adalsteinsson et al.
Computer Vision
ๆฅ็ๆ่ฆ
We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation. Our implementation is available at github.com/ramyamut/E3-Pose.
๐
2025-12-04
๐ under review in the Elsevier Engineering Journal
๐ฅ Priyanto Hidayatullah, Nurjannah Syakrani, Yudi Widhiyasana, Muhammad Rizqi Sholahuddin, Refdinal Tubagus et al.
Computer Vision
ๆฅ็ๆ่ฆ
Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework's feasibility for practical use.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiawen Wen, Yu Hu, Suixuan Qiu, Jinshan Huang, Xiaowen Chu
Robotics
ๆฅ็ๆ่ฆ
Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qing Xu, Yanqian Wang, Xiangjian Hea, Yue Li, Yixuan Zhang et al.
ๆฅ็ๆ่ฆ
Automated lesion detection in chest X-rays has demonstrated significant potential for improving clinical diagnosis by precisely localizing pathological abnormalities. While recent promptable detection frameworks have achieved remarkable accuracy in target localization, existing methods typically rely on manual annotations as prompts, which are labor-intensive and impractical for clinical applications. To address this limitation, we propose SP-Det, a novel self-prompted detection framework that automatically generates rich textual context to guide multi-label lesion detection without requiring expert annotations. Specifically, we introduce an expert-free dual-text prompt generator (DTPG) that leverages two complementary textual modalities: semantic context prompts that capture global pathological patterns and disease beacon prompts that focus on disease-specific manifestations. Moreover, we devise a bidirectional feature enhancer (BFE) that synergistically integrates comprehensive diagnostic context with disease-specific embeddings to significantly improve feature representation and detection accuracy. Extensive experiments on two chest X-ray datasets with diverse thoracic disease categories demonstrate that our SP-Det framework outperforms state-of-the-art detection methods while completely eliminating the dependency on expert-annotated prompts compared to existing promptable architectures.
๐
2025-12-04
๐ ICCV 2025
๐ฅ Maria-Paola Forte, Nikos Athanasiou, Giulia Ballardini, Jan Ulrich Bartels, Katherine J. Kuchenbecker et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Capturing accurate 3D human pose in the wild would provide valuable data for training pose estimation and motion generation methods. While video-based estimation approaches have become increasingly accurate, they often fail in common scenarios involving self-contact, such as a hand touching the face. In contrast, wearable bioimpedance sensing can cheaply and unobtrusively measure ground-truth skin-to-skin contact. Consequently, we propose a novel framework that combines visual pose estimators with bioimpedance sensing to capture the 3D pose of people by taking self-contact into account. Our method, BioTUCH, initializes the pose using an off-the-shelf estimator and introduces contact-aware pose optimization during measured self-contact: reprojection error and deviations from the input estimate are minimized while enforcing vertex proximity constraints. We validate our approach using a new dataset of synchronized RGB video, bioimpedance measurements, and 3D motion capture. Testing with three input pose estimators, we demonstrate an average of 11.7% improvement in reconstruction accuracy. We also present a miniature wearable bioimpedance sensor that enables efficient large-scale collection of contact-aware training data for improving pose estimation and generation using BioTUCH. Code and data are available at biotuch.is.tue.mpg.de
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jikang Cheng, Renye Yan, Zhiyuan Yan, Yaozhong Gan, Xueyi Zhang et al.
ๆฅ็ๆ่ฆ
Existing methods for deepfake detection aim to develop generalizable detectors. Although "generalizable" is the ultimate target once and for all, with limited training forgeries and domains, it appears idealistic to expect generalization that covers entirely unseen variations, especially given the diversity of real-world deepfakes. Therefore, introducing large-scale multi-domain data for training can be feasible and important for real-world applications. However, within such a multi-domain scenario, the differences between multiple domains, rather than the subtle real/fake distinctions, dominate the feature space. As a result, despite detectors being able to relatively separate real and fake within each domain (i.e., high AUC), they struggle with single-image real/fake judgments in domain-unspecified conditions (i.e., low ACC). In this paper, we first define a new research paradigm named Multi-In-Domain Face Forgery Detection (MID-FFD), which includes sufficient volumes of real-fake domains for training. Then, the detector should provide definitive real-fake judgments to the domain-unspecified inputs, which simulate the frame-by-frame independent detection scenario in the real world. Meanwhile, to address the domain-dominant issue, we propose a model-agnostic framework termed DevDet (Developer for Detector) to amplify real/fake differences and make them dominant in the feature space. DevDet consists of a Face Forgery Developer (FFDev) and a Dose-Adaptive detector Fine-Tuning strategy (DAFT). Experiments demonstrate our superiority in predicting real-fake under the MID-FFD scenario while maintaining original generalization ability to unseen data.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Manuel Ladron de Guevara, Jinmo Rhee, Ardavan Bidgoli, Vaidas Razgaitis, Michael Bergin
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shijie Chen, Peixi Peng
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huynh Trinh Ngoc, Hoang Anh Nguyen Kim, Toan Nguyen Hai, Long Tran Quoc
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Generative models have achieved remarkable progress with the emergence of flow matching (FM). It has demonstrated strong generative capabilities and attracted significant attention as a simulation-free flow-based framework capable of learning exact data densities. Motivated by these advances, we propose LatentFM, a flow-based model operating in the latent space for medical image segmentation. To model the data distribution, we first design two variational autoencoders (VAEs) to encode both medical images and their corresponding masks into a lower-dimensional latent space. We then estimate a conditional velocity field that guides the flow based on the input image. By sampling multiple latent representations, our method synthesizes diverse segmentation outputs whose pixel-wise variance reliably captures the underlying data distribution, enabling both highly accurate and uncertainty-aware predictions. Furthermore, we generate confidence maps that quantify the model certainty, providing clinicians with richer information for deeper analysis. We conduct experiments on two datasets, ISIC-2018 and CVC-Clinic, and compare our method with several prior baselines, including both deterministic and generative approach models. Through comprehensive evaluations, both qualitative and quantitative results show that our approach achieves superior segmentation accuracy while remaining highly efficient in the latent space.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chuanyu Fu, Guanying Chen, Yuqi Zhang, Kunbin Yao, Yuan Xiong et al.
Machine Learning
ๆฅ็ๆ่ฆ
3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling in-the-wild scenes affected by transient objects and illuminations, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances and illumination variations. To address this, we propose RobustSplat++, a robust solution based on several critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Third, we incorporate the delayed Gaussian growth strategy and mask bootstrapping with appearance modeling to handling in-the-wild scenes including transients and illuminations. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method.
๐
2025-12-04
๐ ICASSP
๐ฅ Christopher Simic, Korbinian Riedhammer, Tobias Bocklet
ๆฅ็ๆ่ฆ
The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian
Multimodal
ๆฅ็ๆ่ฆ
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chia-Hao Chen, Zi-Xin Zou, Yan-Pei Cao, Ze Yuan, Guan Luo et al.
ๆฅ็ๆ่ฆ
Generating high-fidelity, seamless textures directly on 3D surfaces, what we term 3D-native texturing, remains a fundamental open challenge, with the potential to overcome long-standing limitations of UV-based and multi-view projection methods. However, existing native approaches are constrained by the absence of a powerful and versatile latent representation, which severely limits the fidelity and generality of their generated textures. We identify this representation gap as the principal barrier to further progress. We introduce LaFiTe, a framework that addresses this challenge by learning to generate textures as a 3D generative sparse latent color field. At its core, LaFiTe employs a variational autoencoder (VAE) to encode complex surface appearance into a sparse, structured latent space, which is subsequently decoded into a continuous color field. This representation achieves unprecedented fidelity, exceeding state-of-the-art methods by >10 dB PSNR in reconstruction, by effectively disentangling texture appearance from mesh topology and UV parameterization. Building upon this strong representation, a conditional rectified-flow model synthesizes high-quality, coherent textures across diverse styles and geometries. Extensive experiments demonstrate that LaFiTe not only sets a new benchmark for 3D-native texturing but also enables flexible downstream applications such as material synthesis and texture super-resolution, paving the way for the next generation of 3D content creation workflows.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Massimo Bini, Ondrej Bohdal, Umberto Michieli, Zeynep Akata, Mete Ozay et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yizi Chen, Sidi Wu, Tianyi Xiao, Nina Wiedemann, Loic Landrieu
ๆฅ็ๆ่ฆ
VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD tools. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for generating 3D shapes from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work, generalizes effectively from synthetic to real sketches with minimal supervision, and performs well even on partial sketches. All data and models will be released open-source at https://chenyizi086.github.io/VRSketch2Shape_website.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Abdul Haseeb Nizamani, Dandi Zhou, Xinhai Sun
Robotics
ๆฅ็ๆ่ฆ
Depth completion plays a vital role in 3D perception systems, especially in scenarios where sparse depth data must be densified for tasks such as autonomous driving, robotics, and augmented reality. While many existing approaches rely on semantic segmentation to guide depth completion, they often overlook the benefits of object-level understanding. In this work, we introduce an instance-aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions. Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U-Net-based depth completion backbone, a cross-attention fusion module, and an attention-guided prediction head. The instance segmentation branch generates per-image foreground masks that guide the depth branch via cross-attention, allowing the network to focus on object-centric regions during refinement. We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower RMSE compared to both a U-Net-only baseline and previous semantic-guided methods, while maintaining competitive MAE. Qualitative and quantitative results demonstrate that the proposed model effectively enhances depth accuracy near object boundaries, occlusions, and thin structures. Our findings suggest that incorporating instance-aware cues offers a promising direction for improving depth completion without relying on dense semantic labels.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yigui Feng, Qinglin Wang, Haotian Mo, Yang Liu, Ke Liu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.
๐
2025-12-04
๐ Submitted to IEEE Transactions on Emerging Topics
๐ฅ Alaa Zniber, Arne Symons, Ouassim Karrakchou, Marian Verhelst, Mounir Ghogho
Machine Learning
ๆฅ็ๆ่ฆ
Advancements in high-performance computing and cloud technologies have enabled the development of increasingly sophisticated Deep Learning (DL) models. However, the growing demand for embedded intelligence at the edge imposes stringent computational and energy constraints, challenging the deployment of these large-scale models. Early Exiting Neural Networks (EENN) have emerged as a promising solution, allowing dynamic termination of inference based on input complexity to enhance efficiency. Despite their potential, EENN performance is highly influenced by the heterogeneity of edge accelerators and the constraints imposed by quantization, affecting accuracy, energy efficiency, and latency. Yet, research on the automatic optimization of EENN design for edge hardware remains limited. To bridge this gap, we propose a hardware-aware Neural Architecture Search (NAS) framework that systematically integrates the effects of quantization and hardware resource allocation to optimize the placement of early exit points within a network backbone. Experimental results on the CIFAR-10 dataset demonstrate that our NAS framework can discover architectures that achieve over a 50\% reduction in computational costs compared to conventional static networks, making them more suitable for deployment in resource-constrained edge environments.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinning Chai, Zhengxue Cheng, Yuhong Zhang, Hengsheng Zhang, Yingsheng Qin et al.
Robotics
ๆฅ็ๆ่ฆ
Arbitrary-scale super-resolution (ASSR) overcomes the limitation of traditional super-resolution (SR) methods that operate only at fixed scales (e.g., 4x), enabling a single model to handle arbitrary magnification. Most existing ASSR approaches rely on implicit neural representation (INR), but its regression-driven feature extraction and aggregation intrinsically limit the ability to synthesize fine details, leading to low realism. Recent diffusion-based realistic image super-resolution (Real-ISR) models leverage powerful pre-trained diffusion priors and show impressive results at the 4x setting. We observe that they can also achieve ASSR because the diffusion prior implicitly adapts to scale by encouraging high-realism generation. However, without explicit scale control, the diffusion process cannot be properly adjusted for different magnification levels, resulting in excessive hallucination or blurry outputs, especially under ultra-high scales. To address these issues, we propose OmniScaleSR, a diffusion-based realistic arbitrary-scale SR framework designed to achieve both high fidelity and high realism. We introduce explicit, diffusion-native scale control mechanisms that work synergistically with implicit scale adaptation, enabling scale-aware and content-aware modulation of the diffusion process. In addition, we incorporate multi-domain fidelity enhancement designs to further improve reconstruction accuracy. Extensive experiments on bicubic degradation benchmarks and real-world datasets show that OmniScaleSR surpasses state-of-the-art methods in both fidelity and perceptual realism, with particularly strong performance at large magnification factors. Code will be released at https://github.com/chaixinning/OmniScaleSR.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Cross-view correspondence is a fundamental capability for spatial understanding and embodied AI. However, it is still far from being realized in Vision-Language Models (VLMs), especially in achieving precise point-level correspondence, which is crucial for precise affordance interaction. So we propose the Cross-View Point Correspondence (CVPC) task and CrossPoint-Bench, a comprehensive benchmark with hierarchical design, inspired by the human cognitive process of "perceive", "reason", and "correspond". Our evaluation shows the state-of-the-art models (e.g., Gemini-2.5-Pro) still fall far behind humans, with a gap of over 54.65% in overall accuracy, exposing a challenge in transitioning from coarse-grained judgement to fine-grained coordinate prediction. To address this problem, we construct CrossPoint-378K, a dataset with 378K question-answering pairs across 900 scenes, focused on actionable affordance regions that better reflect real-world manipulation and interaction scenarios. Furthermore, we propose CroPond that trained on the CrossPoint-378K dataset. Our CroPond achieves state-of-the-art performance on CrossPoint-Bench, surpassing Gemini-2.5-Pro by 39.7% accuracy, which offers a foundation for advancing future work on cross-view correspondence. The benchmark, dataset, and model are publicly available at https://github.com/WangYipu2002/CrossPoint.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang et al.
ๆฅ็ๆ่ฆ
Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Juntong Wang, Jiarui Wang, Huiyu Duan, Jiaxiang Kang, Guangtao Zhai et al.
Multimodal
ๆฅ็ๆ่ฆ
Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
๐
2025-12-04
๐ Accepted to IEEE TNNLS
๐ฅ Bowen Zheng, Ran Cheng
Multimodal
ๆฅ็ๆ่ฆ
In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model's predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models' predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at https://github.com/ZaberKo/GDKD.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianyu Yuan, Yuanbo Yang, Lin-Zhuo Chen, Yao Yao, Zhuzhong Qian
Natural Language Processing
ๆฅ็ๆ่ฆ
In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sheng Hang, Chaoxiang He, Hongsheng Hu, Hanqing Hu, Bin Benjamin Zhu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method's precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Wu, Chuhao Zhou, Guangyu Zhao, Haozhi Cao, Yewen Pu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao et al.
Multimodal
ๆฅ็ๆ่ฆ
Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.
๐
2025-12-04
๐ Accepted by IEEE TMM
๐ฅ Houzhang Fang, Chenxing Wu, Kun Bai, Tianqi Chen, Xiaolin Wang et al.
ๆฅ็ๆ่ฆ
Unmanned aerial vehicle (UAV) target tracking based on thermal infrared imaging has been one of the most important sensing technologies in anti-UAV applications. However, the infrared UAV targets often exhibit weak features and complex backgrounds, posing significant challenges to accurate tracking. To address these problems, we introduce SiamDFF, a novel dynamic feature fusion Siamese network that integrates feature enhancement and global contextual attention knowledge distillation for infrared UAV target (IRUT) tracking. The SiamDFF incorporates a selective target enhancement network (STEN), a dynamic spatial feature aggregation module (DSFAM), and a dynamic channel feature aggregation module (DCFAM). The STEN employs intensity-aware multi-head cross-attention to adaptively enhance important regions for both template and search branches. The DSFAM enhances multi-scale UAV target features by integrating local details with global features, utilizing spatial attention guidance within the search frame. The DCFAM effectively integrates the mixed template generated from STEN in the template branch and original template, avoiding excessive background interference with the template and thereby enhancing the emphasis on UAV target region features within the search frame. Furthermore, to enhance the feature extraction capabilities of the network for IRUT without adding extra computational burden, we propose a novel tracking-specific target-aware contextual attention knowledge distiller. It transfers the target prior from the teacher network to the student model, significantly improving the student network's focus on informative regions at each hierarchical level of the backbone network. Extensive experiments on real infrared UAV datasets demonstrate that the proposed approach outperforms state-of-the-art target trackers under complex backgrounds while achieving a real-time tracking speed.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zishuo Wan, Qinqin Kang, Yi Huang, Yun Bian, Dawei Ding et al.
Machine Learning
ๆฅ็ๆ่ฆ
Tumor segmentation and diagnosis in contrast-enhanced Computed Tomography (CT) rely heavily on the physiological dynamics of contrast agents. However, obtaining a complete multi-phase series is often clinically unfeasible due to radiation concerns or scanning limitations, leading to the "missing modality" problem. Existing deep learning approaches typically treat missing phases as absent independent channels, ignoring the inherent temporal continuity of hemodynamics. In this work, we propose Time Attenuated Representation Disentanglement (TARDis), a novel physics-aware framework that redefines missing modalities as missing sample points on a continuous Time-Attenuation Curve. TARDis explicitly disentangles the latent feature space into a time-invariant static component (anatomy) and a time-dependent dynamic component (perfusion). We achieve this via a dual-path architecture: a quantization-based path using a learnable embedding dictionary to extract consistent anatomical structures, and a probabilistic path using a Conditional Variational Autoencoder to model dynamic enhancement conditioned on the estimated scan time. This design allows the network to hallucinate missing hemodynamic features by sampling from the learned latent distribution. Extensive experiments on a large-scale private abdominal CT dataset (2,282 cases) and two public datasets demonstrate that TARDis significantly outperforms state-of-the-art incomplete modality frameworks. Notably, our method maintains robust diagnostic performance even in extreme data-sparsity scenarios, highlighting its potential for reducing radiation exposure while maintaining diagnostic precision.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vitor Hideyo Isume, Takuya Kiyokawa, Natsuki Yamanobe, Yukiyasu Domae, Weiwei Wan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Inspired by traditional handmade crafts, where a person improvises assemblies based on the available objects, we formally introduce the Craft Assembly Task. It is a robotic assembly task that involves building an accurate representation of a given target object using the available objects, which do not directly correspond to its parts. In this work, we focus on selecting the subset of available objects for the final craft, when the given input is an RGB image of the target in the wild. We use a mask segmentation neural network to identify visible parts, followed by retrieving labeled template meshes. These meshes undergo pose optimization to determine the most suitable template. Then, we propose to simplify the parts of the transformed template mesh to primitive shapes like cuboids or cylinders. Finally, we design a search algorithm to find correspondences in the scene based on local and global proportions. We develop baselines for comparison that consider all possible combinations, and choose the highest scoring combination for common metrics used in foreground maps and mask accuracy. Our approach achieves comparable results to the baselines for two different scenes, and we show qualitative results for an implementation in a real-world scenario.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Christof A. Bertram, Viktoria Weiss, Jonas Ammeling, F. Maria Schabel, Taryn A. Donovan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Supervised deep learning (DL) receives great interest for automated analysis of microscopic images with an increasing body of literature supporting its potential. The development and validation of those DL models relies heavily on the availability of high-quality, large-scale datasets. However, creating such datasets is a complex and resource-intensive process, often hindered by challenges such as time constraints, domain variability, and risks of bias in image collection and label creation. This review provides a comprehensive guide to the critical steps in dataset creation, including: 1) image acquisition, 2) selection of annotation software, and 3) annotation creation. In addition to ensuring a sufficiently large number of images, it is crucial to address sources of image variability (domain shifts) - such as those related to slide preparation and digitization - that could lead to algorithmic errors if not adequately represented in the training data. Key quality criteria for annotations are the three "C"s: correctness, completeness, and consistency. This review explores methods to enhance annotation quality through the use of advanced techniques that mitigate the limitations of single annotators. To support dataset creators, a standard operating procedure (SOP) is provided as supplemental material, outlining best practices for dataset development. Furthermore, the article underscores the importance of open datasets in driving innovation and enhancing reproducibility of DL research. By addressing the challenges and offering practical recommendations, this review aims to advance the creation of and availability to high-quality, large-scale datasets, ultimately contributing to the development of generalizable and robust DL models for pathology applications.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng et al.
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.
๐
2025-12-04
๐ ArXiv Preprint (cs.GR)
๐ฅ Zhizhen Wu, Zhe Cao, Yuchi Huo
Machine Learning
ๆฅ็ๆ่ฆ
Image convolution with complex kernels is a fundamental operation in photography, scientific imaging, and animation effects, yet direct dense convolution is computationally prohibitive on resource-limited devices. Existing approximations, such as simulated annealing or low-rank decompositions, either lack efficiency or fail to capture non-convex kernels. We introduce a differentiable kernel decomposition framework that represents a target spatially-variant, dense, complex kernel using a set of sparse kernel samples. Our approach features (i) a decomposition that enables differentiable optimization of sparse kernels, (ii) a dedicated initialization strategy for non-convex shapes to avoid poor local minima, and (iii) a kernel-space interpolation scheme that extends single-kernel filtering to spatially varying filtering without retraining and additional runtime overhead. Experiments on Gaussian and non-convex kernels show that our method achieves higher fidelity than simulated annealing and significantly lower cost than low-rank decompositions. Our approach provides a practical solution for mobile imaging and real-time rendering, while remaining fully differentiable for integration into broader learning pipelines.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marco Pintore, Maura Pintor, Dimosthenis Karatzas, Battista Biggio
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Document Visual Question Answering (DocVQA) enables end-to-end reasoning grounded on information present in a document input. While recent models have shown impressive capabilities, they remain vulnerable to adversarial attacks. In this work, we introduce a novel attack scenario that aims to forge document content in a visually imperceptible yet semantically targeted manner, allowing an adversary to induce specific or generally incorrect answers from a DocVQA model. We develop specialized attack algorithms that can produce adversarially forged documents tailored to different attackers' goals, ranging from targeted misinformation to systematic model failure scenarios. We demonstrate the effectiveness of our approach against two end-to-end state-of-the-art models: Pix2Struct, a vision-language transformer that jointly processes image and text through sequence-to-sequence modeling, and Donut, a transformer-based model that directly extracts text and answers questions from document images. Our findings highlight critical vulnerabilities in current DocVQA systems and call for the development of more robust defenses.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hong Kuang, Jianchen Liu
Machine Learning
ๆฅ็ๆ่ฆ
3D Gaussian Splatting (3DGS) has emerged as a leading technique for novel view synthesis, demonstrating exceptional rendering efficiency. \replaced[]{Well-reconstructed surfaces can be characterized by low configurational entropy, where dominant primitives clearly define surface geometry while redundant components are suppressed.}{The key insight is that well-reconstructed surfaces naturally exhibit low configurational entropy, where dominant primitives clearly define surface geometry while suppressing redundant components.} Three complementary technical contributions are introduced: (1) entropy-driven surface modeling via entropy minimization for low configurational entropy in primitive distributions; (2) adaptive spatial regularization using the Surface Neighborhood Redundancy Index (SNRI) and image entropy-guided weighting; (3) multi-scale geometric preservation through competitive cross-scale entropy alignment. Extensive experiments demonstrate that GEF achieves competitive geometric precision on DTU and T\&T benchmarks, while delivering superior rendering quality compared to existing methods on Mip-NeRF 360. Notably, superior Chamfer Distance (0.64) on DTU and F1 score (0.44) on T\&T are obtained, alongside the best SSIM (0.855) and LPIPS (0.136) among baselines on Mip-NeRF 360, validating the framework's ability to enhance surface reconstruction accuracy without compromising photometric fidelity.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongbo Jin, Qingyuan Wang, Wenhao Zhang, Yang Liu, Sijie Cheng
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou
Multimodal
ๆฅ็ๆ่ฆ
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bita Baroutian, Atefe Aghaei, Mohsen Ebrahimi Moghaddam
ๆฅ็ๆ่ฆ
Alcohol consumption is a significant public health concern and a major cause of accidents and fatalities worldwide. This study introduces a novel video-based facial sequence analysis approach dedicated to the detection of alcohol intoxication. The method integrates facial landmark analysis via a Graph Attention Network (GAT) with spatiotemporal visual features extracted using a 3D ResNet. These features are dynamically fused with adaptive prioritization to enhance classification performance. Additionally, we introduce a curated dataset comprising 3,542 video segments derived from 202 individuals to support training and evaluation. Our model is compared against two baselines: a custom 3D-CNN and a VGGFace+LSTM architecture. Experimental results show that our approach achieves 95.82% accuracy, 0.977 precision, and 0.97 recall, outperforming prior methods. The findings demonstrate the model's potential for practical deployment in public safety systems for non-invasive, reliable alcohol intoxication detection.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Youze Huang, Penghui Ruan, Bojia Zi, Xianbiao Qi, Jianan Wang et al.
Robotics
ๆฅ็ๆ่ฆ
Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task, Object Retexture, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability for two reasons: conditioning on the raw reference image introduces unwanted structural information, and it fails to disentangle the visual texture and structure information of the source. To address this problem, we propose Refaรงade, a method that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving the geometry and motion of source videos. Second, we disrupt the reference global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than the global layout of the object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations. Code is available at https://github.com/fishZe233/Refacade.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu-Wei Zhan, Xin Wang, Hong Chen, Tongtong Feng, Wei Feng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chentao Shen, Sizhe Zheng, Bingqian Wu, Yaohua Feng, Yuanchen Fei et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Traditional high-quality 3D scanning and reconstruction typically relies on human labor to plan the scanning procedure. With the rapid development of embodied systems such as drones and robots, there is a growing demand of performing accurate 3D scanning and reconstruction in an fully automated manner. We introduce Auto3R, a data-driven uncertainty quantification model that is designed to automate the 3D scanning and reconstruction of scenes and objects, including objects with non-lambertian and specular materials. Specifically, in a process of iterative 3D reconstruction and scanning, Auto3R can make efficient and accurate prediction of uncertainty distribution over potential scanning viewpoints, without knowing the ground truth geometry and appearance. Through extensive experiments, Auto3R achieves superior performance that outperforms the state-of-the-art methods by a large margin. We also deploy Auto3R on a robot arm equipped with a camera and demonstrate that Auto3R can be used to effectively digitize real-world 3D objects and delivers ready-to-use and photorealistic digital assets. Our homepage: https://tomatoma00.github.io/auto3r.github.io .
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guoqing Zhang, Zhun Wang, Hairui Wang, Zhonglin Ye, Yuhui Zheng
ๆฅ็ๆ่ฆ
Visible-Infrared Person Re-Identification (VI-ReID) is a challenging cross-modal matching task due to significant modality discrepancies. While current methods mainly focus on learning modality-invariant features through unified embedding spaces, they often focus solely on the common discriminative semantics across modalities while disregarding the critical role of modality-specific identity-aware knowledge in discriminative feature learning. To bridge this gap, we propose a novel Identity Clue Refinement and Enhancement (ICRE) network to mine and utilize the implicit discriminative knowledge inherent in modality-specific attributes. Initially, we design a Multi-Perception Feature Refinement (MPFR) module that aggregates shallow features from shared branches, aiming to capture modality-specific attributes that are easily overlooked. Then, we propose a Semantic Distillation Cascade Enhancement (SDCE) module, which distills identity-aware knowledge from the aggregated shallow features and guide the learning of modality-invariant features. Finally, an Identity Clues Guided (ICG) Loss is proposed to alleviate the modality discrepancies within the enhanced features and promote the learning of a diverse representation space. Extensive experiments across multiple public datasets clearly show that our proposed ICRE outperforms existing SOTA methods.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruijing Liu, Cunhua Pan, Jiaming Zeng, Hong Ren, Kezhi Wang et al.
ๆฅ็ๆ่ฆ
While fulfilling communication tasks, wireless signals can also be used to sense the environment. Among various types of sensing media, WiFi signals offer advantages such as widespread availability, low hardware cost, and strong robustness to environmental conditions like light, temperature, and humidity. By analyzing Wi-Fi signals in the environment, it is possible to capture dynamic changes of the human body and accomplish sensing applications such as gesture recognition. Although many existing gesture sensing solutions perform well in-domain but lack cross-domain capabilities (i.e., recognition performance in untrained environments). To address this, we extract Doppler spectra from the channel state information (CSI) received by all receivers and concatenate each Doppler spectrum along the same time axis to generate fused images with multi-angle information as input features. Furthermore, inspired by the convolutional block attention module (CBAM), we propose a gesture recognition network that integrates a multi-semantic spatial attention mechanism with a self-attention-based channel mechanism. This network constructs attention maps to quantify the spatiotemporal features of gestures in images, enabling the extraction of key domain-independent features. Additionally, ResNet18 is employed as the backbone network to further capture deep-level features. To validate the network performance, we evaluate the proposed network on the public Widar3 dataset, and the results show that it not only maintains high in-domain accuracy of 99.72%, but also achieves high performance in cross-domain recognition of 97.61%, significantly outperforming existing best solutions.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenlin Xu, Lei Zhang, Lituan Wang, Xinyu Pu, Pengfei Ma et al.
Computer Vision
ๆฅ็ๆ่ฆ
Due to the scarcity of annotated data and the substantial computational costs of model, conventional tuning methods in medical image segmentation face critical challenges. Current approaches to adapting pretrained models, including full-parameter and parameter-efficient fine-tuning, still rely heavily on task-specific training on downstream tasks. Therefore, zero-shot segmentation has gained increasing attention, especially with foundation models such as SAM demonstrating promising generalization capabilities. However, SAM still faces notable limitations on medical datasets due to domain shifts, making efficient zero-shot enhancement an urgent research goal. To address these challenges, we propose BA-TTA-SAM, a task-agnostic test-time adaptation framework that significantly enhances the zero-shot segmentation performance of SAM via test-time adaptation. This framework integrates two key mechanisms: (1) The encoder-level Gaussian prompt injection embeds Gaussian-based prompts directly into the image encoder, providing explicit guidance for initial representation learning. (2) The cross-layer boundary-aware attention alignment exploits the hierarchical feature interactions within the ViT backbone, aligning deep semantic responses with shallow boundary cues. Experiments on four datasets, including ISIC, Kvasir, BUSI, and REFUGE, show an average improvement of 12.4\% in the DICE score compared with SAM's zero-shot segmentation performance. The results demonstrate that our method consistently outperforms state-of-the-art models in medical image segmentation. Our framework significantly enhances the generalization ability of SAM, without requiring any source-domain training data. Extensive experiments on publicly available medical datasets strongly demonstrate the superiority of our framework. Our code is available at https://github.com/Emilychenlin/BA-TTA-SAM.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun et al.
Robotics
ๆฅ็ๆ่ฆ
Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Liuzhou Zhang, Jiarui Ye, Yuanlei Wang, Ming Zhong, Mingju Cao et al.
ๆฅ็ๆ่ฆ
Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yinghui Xing, Xiaoting Su, Shizhou Zhang, Donghao Chu, Di Xu
Computer Vision
ๆฅ็ๆ่ฆ
Infrared imaging plays a critical role in low-light and adverse weather conditions. However, due to the distinct characteristics of infrared images, existing foundation models such as Masked Autoencoder (MAE) trained on visible data perform suboptimal in infrared image interpretation tasks. To bridge this gap, an infrared foundation model known as InfMAE was developed and pre-trained on large-scale infrared datasets. Despite its effectiveness, InfMAE still faces several limitations, including the omission of informative tokens, insufficient modeling of global associations, and neglect of non-uniform noise. In this paper, we propose a Dual-domain Guided Infrared foundation model based on MAE (DuGI-MAE). First, we design a deterministic masking strategy based on token entropy, preserving only high-entropy tokens for reconstruction to enhance informativeness. Next, we introduce a Dual-Domain Guidance (DDG) module, which simultaneously captures global token relationships and adaptively filters non-uniform background noise commonly present in infrared imagery. To facilitate large-scale pretraining, we construct Inf-590K, a comprehensive infrared image dataset encompassing diverse scenes, various target types, and multiple spatial resolutions. Pretrained on Inf-590K, DuGI-MAE demonstrates strong generalization capabilities across various downstream tasks, including infrared object detection, semantic segmentation, and small target detection. Experimental results validate the superiority of the proposed method over both supervised and self-supervised comparison methods. Our code is available in the supplementary material.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Min Zhao, Bokai Yan, Xue Yang, Hongzhou Zhu, Jintao Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at \href{https://thu-ml.github.io/ultraimage.github.io/}{https://thu-ml.github.io/ultraimage.github.io/}.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuduo Jin, Brandon Haworth
Robotics
ๆฅ็ๆ่ฆ
Diffusion models have emerged as a widely utilized and successful methodology in human motion synthesis. Task-oriented diffusion models have significantly advanced action-to-motion, text-to-motion, and audio-to-motion applications. In this paper, we investigate fundamental questions regarding motion representations and loss functions in a controlled study, and we enumerate the impacts of various decisions in the workflow of the generative motion diffusion model. To answer these questions, we conduct empirical studies based on a proxy motion diffusion model (MDM). We apply v loss as the prediction objective on MDM (vMDM), where v is the weighted sum of motion data and noise. We aim to enhance the understanding of latent data distributions and provide a foundation for improving the state of conditional motion diffusion models. First, we evaluate the six common motion representations in the literature and compare their performance in terms of quality and diversity metrics. Second, we compare the training time under various configurations to shed light on how to speed up the training process of motion diffusion models. Finally, we also conduct evaluation analysis on a large motion dataset. The results of our experiments indicate clear performance differences across motion representations in diverse datasets. Our results also demonstrate the impacts of distinct configurations on model training and suggest the importance and effectiveness of these decisions on the outcomes of motion diffusion models.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianci Huo, Lingfeng Qi, Yuhan Chen, Qihong Xue, Jinyuan Shao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Inevitable specular highlights in practical environments severely impair the visual performance, thus degrading the task effectiveness and efficiency. Although there exist considerable methods that focus on local information from convolutional neural network models or global information from transformer models, the single-type model falls into a modeling dilemma between local fine-grained details and global long-range dependencies, thus deteriorating for specular highlights with different scales. Therefore, to accommodate specular highlights of all scales, we propose a multi-model architecture for specular highlight removal (MM-SHR) that effectively captures fine-grained features in highlight regions and models long-range dependencies between highlight and highlight-free areas. Specifically, we employ convolution operations to extract local details in the shallow layers of MM-SHR, and utilize the attention mechanism to capture global features in the deep layers, ensuring both operation efficiency and removal accuracy. To model long-range dependencies without compromising computational complexity, we utilize a coarse-to-fine manner and propose Omni-Directional Attention Integration Block(OAIBlock) and Adaptive Region-Aware Hybrid-Domain Dual Attention Convolutional Network(HDDAConv) , which leverage omni-directiona pixel-shifting and window-dividing operations at the raw features to achieve specular highlight removal. Extensive experimental results on three benchmark tasks and six types of surface materials demonstrate that MM-SHR outperforms state-of-the-art methods in both accuracy and efficiency for specular highlight removal. The implementation will be made publicly available at https://github.com/Htcicv/MM-SHR.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Eunjong Lee, Eunhee Kim, Sanghoon Hong, Eunho Jung, Jihoon Kim
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex motion control tasks and confirming its readiness for demanding interactive applications.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aaron Sun, Oindrila Saha, Subhransu Maji
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data -- especially videos or multi-view observations of the same subject -- making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex -- used as a proxy for identity -- substantially improves performance on both seen and unseen species.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pengbo Guo, Junke Wang, Zhen Xing, Chengxu Liu, Daoguo Dong et al.
ๆฅ็ๆ่ฆ
This paper presents DeRA, a novel 1D video tokenizer that decouples the spatial-temporal representation learning in video tokenization to achieve better training efficiency and performance. Specifically, DeRA maintains a compact 1D latent space while factorizing video encoding into appearance and motion streams, which are aligned with pretrained vision foundation models to capture the spatial semantics and temporal dynamics in videos separately. To address the gradient conflicts introduced by the heterogeneous supervision, we further propose the Symmetric Alignment-Conflict Projection (SACP) module that proactively reformulates gradients by suppressing the components along conflicting directions. Extensive experiments demonstrate that DeRA outperforms LARP, the previous state-of-the-art video tokenizer by 25% on UCF-101 in terms of rFVD. Moreover, using DeRA for autoregressive video generation, we also achieve new state-of-the-art results on both UCF-101 class-conditional generation and K600 frame prediction.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tanmay Dogra, Eric Ngo, Mohammad Alam, Jean-Paul Talavera, Asim Dahal
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We challenge the common belief that deep learning always trumps older techniques, using the example of grading Saint-Gaudens Double Eagle gold coins automatically. In our work, we put a feature-based Artificial Neural Network built around 192 custom features pulled from Sobel edge detection and HSV color analysis up against a hybrid Convolutional Neural Network that blends in EfficientNetV2, plus a straightforward Support Vector Machine as the control. Testing 1,785 coins graded by experts, the ANN nailed 86% exact matches and hit 98% when allowing a 3-grade leeway. On the flip side, CNN and SVM mostly just guessed the most common grade, scraping by with 31% and 30% exact hits. Sure, the CNN looked good on broader tolerance metrics, but that is because of some averaging trick in regression that hides how it totally flops at picking out specific grades. All told, when you are stuck with under 2,000 examples and lopsided classes, baking in real coin-expert knowledge through feature design beats out those inscrutable, all-in-one deep learning setups. This rings true for other niche quality checks where data's thin and know-how matters more than raw compute.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu et al.
MultimodalMachine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.
๐
2025-12-04
๐ AAAI 2026
๐ฅ Changjin Kim, HyeokJun Lee, YoungJoon Yoo
Machine Learning
ๆฅ็ๆ่ฆ
Recent image denoising methods have leveraged generative modeling for real noise synthesis to address the costly acquisition of real-world noisy data. However, these generative models typically require camera metadata and extensive target-specific noisy-clean image pairs, often showing limited generalization between settings. In this paper, to mitigate the prerequisites, we propose a Single-Pair Guided Diffusion for generalized noise synthesis GuidNoise, which uses a single noisy/clean pair as the guidance, often easily obtained by itself within a training set. To train GuidNoise, which generates synthetic noisy images from the guidance, we introduce a guidance-aware affine feature modification (GAFM) and a noise-aware refine loss to leverage the inherent potential of diffusion models. This loss function refines the diffusion model's backward process, making the model more adept at generating realistic noise distributions. The GuidNoise synthesizes high-quality noisy images under diverse noise environments without additional metadata during both training and inference. Additionally, GuidNoise enables the efficient generation of noisy-clean image pairs at inference time, making synthetic noise readily applicable for augmenting training data. This self-augmentation significantly improves denoising performance, especially in practical scenarios with lightweight models and limited training data. The code is available at https://github.com/chjinny/GuidNoise.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifei Wang, Zhenkai Li, Tianwen Qian, Huanran Zheng, Zheng Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bin Suna, Yaoguang Caob, Yan Wanga, Rui Wanga, Jiachen Shanga et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo et al.
MultimodalRoboticsComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Purbesh Mitra, Sennur Ulukus
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Haiyue Song, Johannes Eschbach-Dymanus, Hour Kaing, Sumire Honda, Hideki Tanaka et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huascar Sanchez, Briland Hitaj, Jules Bergmann, Linda Briesemeister
Natural Language Processing
ๆฅ็ๆ่ฆ
As healthcare increasingly turns to AI for scalable and trustworthy clinical decision support, ensuring reliability in model reasoning remains a critical challenge. Individual large language models (LLMs) are susceptible to hallucinations and inconsistency, whereas naive ensembles of models often fail to deliver stable and credible recommendations. Building on our previous work on LLM Chemistry, which quantifies the collaborative compatibility among LLMs, we apply this framework to improve the reliability in medication recommendation from brief clinical vignettes. Our approach leverages multi-LLM collaboration guided by Chemistry-inspired interaction modeling, enabling ensembles that are effective (exploiting complementary strengths), stable (producing consistent quality), and calibrated (minimizing interference and error amplification). We evaluate our Chemistry-based Multi-LLM collaboration strategy on real-world clinical scenarios to investigate whether such interaction-aware ensembles can generate credible, patient-specific medication recommendations. Preliminary results are encouraging, suggesting that LLM Chemistry-guided collaboration may offer a promising path toward reliable and trustworthy AI assistants in clinical practice.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Francielle Vargas, Daniel Pedronette
Machine Learning
ๆฅ็ๆ่ฆ
This extended abstract introduces Self-Explaining Contrastive Evidence Re-Ranking (CER), a novel method that restructures retrieval around factual evidence by fine-tuning embeddings with contrastive learning and generating token-level attribution rationales for each retrieved passage. Hard negatives are automatically selected using a subjectivity-based criterion, forcing the model to pull factual rationales closer while pushing subjective or misleading explanations apart. As a result, the method creates an embedding space explicitly aligned with evidential reasoning. We evaluated our method on clinical trial reports, and initial experimental results show that CER improves retrieval accuracy, mitigates the potential for hallucinations in RAG systems, and provides transparent, evidence-based retrieval that enhances reliability, especially in safety-critical domains.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nex-AGI Team, :, Yuxuan Cai, Lu Chen, Qiaoling Chen et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms -- from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Weiye Shi, Zhaowei Zhang, Shaoheng Yan, Yaodong Yang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leyang Shen, Yang Zhang, Chun Kai Ling, Xiaoyan Zhao, Tat-Seng Chua
Machine Learning
ๆฅ็ๆ่ฆ
Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ MohammadHossein Bateni, Vincent Cohen-Addad, Yuzhou Gu, Silvio Lattanzi, Simon Meierhans et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have proven to be highly effective for solving complex reasoning tasks. Surprisingly, their capabilities can often be improved by iterating on previously generated solutions. In this context, a reasoning plan for generating and combining a set of solutions can be thought of as an algorithm for reasoning using a probabilistic oracle.
We introduce a theoretical framework for analyzing such reasoning algorithms. This framework formalizes the principles underlying popular techniques for iterative improvement and answer aggregation, providing a foundation for designing a new generation of more powerful reasoning methods. Unlike approaches for understanding models that rely on architectural specifics, our model is grounded in experimental evidence. As a result, it offers a general perspective that may extend to a wide range of current and future reasoning oracles.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Julien Benchek, Rohit Shetty, Benjamin Hunsberger, Ajay Arun, Zach Richards et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
๐
2025-12-04
๐ This work has been submitted to the IEEE for possi
๐ฅ Junjie Fan, Hongye Zhao, Linduo Wei, Jiayu Rao, Guijia Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent adaptations of Large Language Models (LLMs) for time series forecasting often fail to effectively enhance information for raw series, leaving LLM reasoning capabilities underutilized. Existing prompting strategies rely on static correlations rather than generative interpretations of dynamic behavior, lacking critical global and instance-specific context. To address this, we propose STELLA (Semantic-Temporal Alignment with Language Abstractions), a framework that systematically mines and injects structured supplementary and complementary information. STELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components. It then translates intrinsic behavioral features of these components into Hierarchical Semantic Anchors: a Corpus-level Semantic Prior (CSP) for global context and a Fine-grained Behavioral Prompt (FBP) for instance-level patterns. Using these anchors as prefix-prompts, STELLA guides the LLM to model intrinsic dynamics. Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting, showing superior generalization in zero-shot and few-shot settings. Ablation studies further validate the effectiveness of our dynamically generated semantic anchors.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hao Wang, Jialun Zhong, Changcheng Wang, Zhujun Nie, Zheng Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches, whether end-to-end semantic parsing or stepwise agent-based reasoning, often suffer from structural inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, a large language model (LLM) extracts a minimal S-expression core that captures the essential semantics of the input query. This core is then refined by an agentic calibration module, which corrects syntactic inconsistencies and aligns entities and relations precisely with the underlying knowledge graph. The second stage employs template-based completion, guided by question-type prediction and placeholder instantiation, to construct a fully executable S-expression. This decomposition not only simplifies logical form generation but also significantly enhances structural fidelity and linking efficiency. Crucially, SEAL incorporates a self-evolving mechanism that integrates local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance, especially in multi-hop reasoning, comparison, and aggregation tasks. The results validate notable gains in both structural accuracy and computational efficiency, underscoring the framework's capacity for robust and scalable conversational reasoning.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras
Natural Language Processing
ๆฅ็ๆ่ฆ
Expanding the linguistic diversity of instruct large language models (LLMs) is crucial for global accessibility but is often hindered by the reliance on costly specialized target language labeled data and catastrophic forgetting during adaptation. We tackle this challenge under a realistic, low-resource constraint: adapting instruct LLMs using only unlabeled target language data. We introduce Source-Shielded Updates (SSU), a selective parameter update strategy that proactively preserves source knowledge. Using a small set of source data and a parameter importance scoring method, SSU identifies parameters critical to maintaining source abilities. It then applies a column-wise freezing strategy to protect these parameters before adaptation. Experiments across five typologically diverse languages and 7B and 13B models demonstrate that SSU successfully mitigates catastrophic forgetting. It reduces performance degradation on monolingual source tasks to just 3.4% (7B) and 2.8% (13B) on average, a stark contrast to the 20.3% and 22.3% from full fine-tuning. SSU also achieves target-language performance highly competitive with full fine-tuning, outperforming it on all benchmarks for 7B models and the majority for 13B models.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ L. D. M. S. Sai Teja, N. Siva Gopala Krishna, Ufaq Khan, Muhammad Haris Khan, Partha Pakray et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In the age of advanced large language models (LLMs), the boundaries between human and AI-generated text are becoming increasingly blurred. We address the challenge of segmenting mixed-authorship text, that is identifying transition points in text where authorship shifts from human to AI or vice-versa, a problem with critical implications for authenticity, trust, and human oversight. We introduce a novel framework, called Info-Mask for mixed authorship detection that integrates stylometric cues, perplexity-driven signals, and structured boundary modeling to accurately segment collaborative human-AI content. To evaluate the robustness of our system against adversarial perturbations, we construct and release an adversarial benchmark dataset Mixed-text Adversarial setting for Segmentation (MAS), designed to probe the limits of existing detectors. Beyond segmentation accuracy, we introduce Human-Interpretable Attribution (HIA overlays that highlight how stylometric features inform boundary predictions, and we conduct a small-scale human study assessing their usefulness. Across multiple architectures, Info-Mask significantly improves span-level robustness under adversarial conditions, establishing new baselines while revealing remaining challenges. Our findings highlight both the promise and limitations of adversarially robust, interpretable mixed-authorship detection, with implications for trust and oversight in human-AI co-authorship.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Vignesh Kumar Kembu, Pierandrea Morandini, Marta Bianca Maria Ranzini, Antonino Nocera
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have become a key topic in AI and NLP, transforming sectors like healthcare, finance, education, and marketing by improving customer service, automating tasks, providing insights, improving diagnostics, and personalizing learning experiences. Information extraction from clinical records is a crucial task in digital healthcare. Although traditional NLP techniques have been used for this in the past, they often fall short due to the complexity, variability of clinical language, and high inner semantics in the free clinical text. Recently, Large Language Models (LLMs) have become a powerful tool for better understanding and generating human-like text, making them highly effective in this area. In this paper, we explore the ability of open-source multilingual LLMs to understand EHRs (Electronic Health Records) in Italian and help extract information from them in real-time. Our detailed experimental campaign on comorbidity extraction from EHR reveals that some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Gianluca Barmina, Nathalie Carmen Hau Norman, Peter Schneider-Kamp, Lukas Galke
Natural Language Processing
ๆฅ็ๆ่ฆ
We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pooja Singh, Sandeep Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models and multilingual machine translation (MT) systems increasingly drive access to information, yet many languages of the tribal communities remain effectively invisible in these technologies. This invisibility exacerbates existing structural inequities in education, governance, and digital participation. We present AdiBhashaa, a community-driven initiative that constructs the first open parallel corpora and baseline MT systems for four major Indian tribal languages-Bhili, Mundari, Gondi, and Santali. This work combines participatory data creation with native speakers, human-in-the-loop validation, and systematic evaluation of both encoder-decoder MT models and large language models. In addition to reporting technical findings, we articulate how AdiBhashaa illustrates a possible model for more equitable AI research: it centers local expertise, builds capacity among early-career researchers from marginalized communities, and foregrounds human validation in the development of language technologies.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Massimo Bini, Ondrej Bohdal, Umberto Michieli, Zeynep Akata, Mete Ozay et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Malvina Nissim, Danilo Croce, Viviana Patti, Pierpaolo Basile, Giuseppe Attanasio et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ruilin Li, Yibin Wang, Wenhong Zhu, Chenglin Li, Jinghao Zhang et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Knowledge editing aims to update specific facts in large language models (LLMs) without full retraining. Prior efforts sought to tune the knowledge layers of LLMs, proving effective for making selective edits. However, a significant gap exists between their performance in controlled, teacher-forcing evaluations and their real-world effectiveness in lifelong learning scenarios, which greatly limits their practical applicability. This work's empirical analysis reveals two recurring issues associated with this gap: (1) Most traditional methods lead the edited model to overfit to the new fact, thereby degrading pre-trained capabilities; (2) There is a critical absence of a knowledge consolidation stage, leaving new facts insufficiently integrated into LLMs' inference-time behavior under autoregressive generation, thereby leading to a mismatch between parametric knowledge and actual generation behavior. To this end, we propose Edit-then-Consolidate, a novel knowledge editing paradigm that aims to bridge the gap between theoretical knowledge editing methods and their real-world applicability. Specifically, (1) our framework mitigates overfitting via Targeted Proximal Supervised Fine-Tuning (TPSFT) that localizes the edit via a trust-region objective to limit policy drift; (2) Then, a consolidation stage using Group Relative Policy Optimization (GRPO) aligns the edited knowledge with CoT-based inference policy by optimizing trajectory-level behavior under comprehensive reward signals. Extensive experiments demonstrate our framework consistently improves editing reliability and generalization under real-world evaluations, while better preserving locality and pre-trained capabilities.
๐
2025-12-04
๐ AAAI 2026
๐ฅ Xinyue Kang, Diwei Shi, Li Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
It is a critical challenge to efficiently unlock the powerful reasoning potential of Large Language Models (LLMs) for specific tasks or new distributions. Existing test-time adaptation methods often require tuning model parameters, which is not only computationally expensive but also risks degrading the model's pre-existing abilities.To address this, we introduce a lightweight component, Test-Time Steering Vectors (TTSV), which is prepended to the input while keeping the LLM's parameters entirely frozen. By optimizing the TTSV on test data to minimize the model's output entropy, we steer the model towards an internal state of higher confidence, activating its inherent abilities most relevant to the current task. TTSV is both lightweight and highly efficient to optimize, making it a true plug-and-play enhancement. Extensive experiments validate our approach's effectiveness on both base models and reasoning-enhanced models. For instance, on the MATH500 task, TTSV achieves a 45.88% relative performance gain on the Qwen2.5-Math-7B model and a 16.22% relative gain on the Qwen3-4B model. Furthermore, our approach exhibits robust generalization, with its steering vectors proving highly transferable across diverse tasks.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen
Natural Language Processing
ๆฅ็ๆ่ฆ
Extreme low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2-bits and even 4-bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework that is highly effective even without mixed-precision. SignRoundV2 introduces (1) a fast sensitivity metric that combines gradient information with quantization-induced deviations to guide layer-wise bit allocation, and (2) a lightweight pre-tuning search for quantization scales to improve extremely low-bit quantization. These components allow SignRoundV2 to close the gap with full-precision models. Extensive experiments indicate that our method sustains competitive accuracy for LLMs, achieving production-grade performance with about 1 percent variance at 4-5 bits and strong results even at 2 bits. The implementation is available at https://github.com/intel/auto-round.
๐
2025-12-04
๐ ICDE
๐ฅ Zhuoyue Wan, Wentao Hu, Chen Jason Zhang, Yuanfeng Song, Shuaimin Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Bridging natural language and structured query languages is a long-standing challenge in the database community. While recent advances in language models have shown promise in this direction, existing solutions often rely on large-scale closed-source models that suffer from high inference costs, limited transparency, and lack of adaptability for lightweight deployment. In this paper, we present OsmT, an open-source tag-aware language model specifically designed to bridge natural language and Overpass Query Language (OverpassQL), a structured query language for accessing large-scale OpenStreetMap (OSM) data. To enhance the accuracy and structural validity of generated queries, we introduce a Tag Retrieval Augmentation (TRA) mechanism that incorporates contextually relevant tag knowledge into the generation process. This mechanism is designed to capture the hierarchical and relational dependencies present in the OSM database, addressing the topological complexity inherent in geospatial query formulation. In addition, we define a reverse task, OverpassQL-to-Text, which translates structured queries into natural language explanations to support query interpretation and improve user accessibility. We evaluate OsmT on a public benchmark against strong baselines and observe consistent improvements in both query generation and interpretation. Despite using significantly fewer parameters, our model achieves competitive accuracy, demonstrating the effectiveness of open-source pre-trained language models in bridging natural language and structured query languages within schema-rich geospatial environments.
๐
2025-12-04
๐ AAAI
๐ฅ Jae Hee Lee, Anne Lauscher, Stefano V. Albrecht
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi-agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi-agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter-efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Carolin Mueller-Spitzer, Samira Ochs, Jan Oliver Ruediger, Sascha Wolfer
ๆฅ็ๆ่ฆ
This study examines the distribution and linguistic characteristics of generic masculines (GM) in contemporary German press texts. The use of masculine personal nouns to refer to mixed-gender groups or unspecified individuals has been widely debated in academia and the public, with con-flicting perspectives on its gender-neutrality. While psycholinguistic studies suggest that GM is more readily associated with male referents, corpus-based analyses of its actual use remain scarce. We investigate GM in a large corpus of press texts, focusing on lexeme-specific differences across dif-ferent types of personal nouns. We conducted manual annotations of the whole inflectional para-digm of 21 personal nouns, resulting in 6,195 annotated tokens. Our findings reveal considerable differences between lexical items, especially between passive role nouns and prestige-related per-sonal nouns. On a grammatical level, we find that GM occurs predominantly in the plural and in indefinite noun phrases. Furthermore, our data shows that GM is not primarily used to denote entire classes of people, as has been previously claimed. By providing an empirical insight into the use of GM in authentic written language, we contribute to a more nuanced understanding of its forms and manifestations. These findings provide a solid basis for aligning linguistic stimuli in psy-cholinguistic studies more closely with real-world language use.
๐
2025-12-04
๐ ACL
๐ฅ Jinbo Liu, Defu Cao, Yifei Wei, Tianyao Su, Yuan Liang et al.
Robotics
ๆฅ็ๆ่ฆ
Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a framework that measures how network structure shapes leakage. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over up to 10 interaction rounds, we quantify leakage as the fraction of ground-truth PII recovered from attacking agent outputs via exact matching. We systematically evaluate six common network topologies (fully connected, ring, chain, binary tree, star, and star-ring), varying agent counts $n\in\{4,5,6\}$, attacker-target placements, and base models. Our findings reveal consistent patterns: fully connected graphs exhibit maximum leakage while chains provide strongest protection; shorter attacker-target graph distance and higher target centrality significantly increase vulnerability; leakage rises sharply in early rounds before plateauing; model choice shifts absolute leakage rates but preserves topology rankings; temporal/locational PII attributes leak more readily than identity credentials or regulated identifiers. These results provide the first systematic mapping from architectural choices to measurable privacy risk, yielding actionable guidance: prefer sparse or hierarchical connectivity, maximize attacker-target separation, limit node degree and network radius, avoid shortcuts bypassing hubs, and implement topology-aware access controls.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
๐
2025-12-04
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Adamantios I. Gafos, Stephan R. Kuberski
ๆฅ็ๆ่ฆ
Rhythmic fluctuations in acoustic energy and accompanying neuronal excitations in cortical oscillations are characteristic of human speech, yet whether a corresponding rhythmicity inheres in the articulatory movements that generate speech remains unclear. The received understanding of speech movements as discrete, goal-oriented actions struggles to make contact with the rhythmicity findings. In this work, we demonstrate that an unintuitive -- but no less principled than the conventional -- representation for discrete movements reveals a pervasive limit cycle organization and unlocks the recovery of previously inaccessible rhythmic structure underlying the motor activity of speech. These results help resolve a time-honored tension between the ubiquity of biological rhythmicity and discreteness in speech, the quintessential human higher function, by revealing a rhythmic organization at the most fundamental level of individual articulatory actions.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wenjin Liu, Haoran Luo, Xin Feng, Xiang Ji, Lijuan Zhou et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pritam Kadasi, Abhishek Upperwal, Mayank SIngh
Machine Learning
ๆฅ็ๆ่ฆ
We propose ADAPT, a meta-learning algorithm that \emph{learns} task sampling proportions under an explicit token budget for multi-task instruction tuning. Instead of fixing task weights by hand, \adapt{} maintains a continuous distribution over tasks and updates it via meta-gradients of a smooth worst-case validation objective, inducing an adaptive curriculum that allocates more tokens to useful tasks while avoiding collapse. We instantiate ADAPT on three $\sim$1B-parameter open-weight LLMs (Gemma-3-1B, LLaMA-3.2-1B, Qwen-0.6B), training on 20 Natural Instructions task types under budgets of $1\%$, $5\%$, and $10\%$ of the available supervised tokens, and compare against strong supervised fine-tuning baselines with uniform and size-proportional mixing. We conduct evaluations on 11 out-of-domain benchmarks spanning reasoning, reading comprehension, code generation, and instruction following, we find that ADAPT matches or slightly improves average downstream performance relative to the best static mixture, while using fewer effective training tokens and reallocating budget toward harder, benchmark-aligned tasks.
๐
2025-12-04
๐ NeurIPS 2025
๐ฅ Yangning Li, Shaoshen Chen, Yinghui Li, Yankai Chen, Hai-Tao Zheng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The quadratic complexity of self-attention constrains Large Language Models (LLMs) in processing long contexts, a capability essential for many advanced applications. Context compression aims to alleviate this computational bottleneck while retaining critical semantic information. However, existing approaches often fall short: explicit methods may compromise local detail, whereas implicit methods can suffer from positional biases, information degradation, or an inability to capture long-range semantic dependencies. We propose AdmTree, a novel framework for adaptive, hierarchical context compression with a central focus on preserving high semantic fidelity while maintaining efficiency. AdmTree dynamically segments input based on information density, utilizing gist tokens to summarize variable-length segments as the leaves of a semantic binary tree. This structure, together with a lightweight aggregation mechanism and a frozen backbone LLM (thereby minimizing new trainable parameters), enables efficient hierarchical abstraction of the context. By preserving fine-grained details alongside global semantic coherence, mitigating positional bias, and dynamically adapting to content, AdmTree robustly retains the semantic information of long contexts.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pengfei Cao, Zeao Ji, Daojian Zeng, Jun Zhao, Kang Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Adjusting the outdated knowledge of large language models (LLMs) after deployment remains a major challenge. This difficulty has spurred the development of knowledge editing, which seeks to accurately and efficiently modify a model's internal (parametric) knowledge without retraining it from scratch. However, existing methods suffer from two limitations. First, they depend on structured triplets that are misaligned with the free-text nature of LLM pretraining and fail to capture the nuanced relationships among facts. Second, they typically support one-time knowledge updates, with relatively limited research on the problem of sequential or lifelong editing. To address these gaps, we propose a new task, Lifelong Free-text Knowledge Editing (LF-Edit), which enables models to incorporate updates expressed in natural language and supports continual editing over time. Despite its promise, LF-Edit faces the dual challenge of integrating new knowledge while mitigating the forgetting of prior information. To foster research on this new task, we construct a large-scale benchmark, Multi-Rank Lifelong Free-text Editing Benchmark (MRLF-Bench), containing 16,835 free-text edit requests. We further design a cognitively inspired multi-rank evaluation framework encompassing four levels: memorization, understanding, constrained comprehension, and reasoning. To tackle the challenges inherent in LF-Edit, we introduce a novel approach named EvoEdit that enhances knowledge injection through Latent Perturbation Augmentation and preserves prior information via Knowledge-driven Parameter Fusion. Experimental results demonstrate that EvoEdit substantially outperforms existing knowledge editing methods on the proposed LF-Edit task.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tianmai M. Zhang, Zhaoyi Sun, Sihang Zeng, Chenxi Li, Neil F. Abernethy et al.
Machine Learning
ๆฅ็ๆ่ฆ
The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 -- generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuanshuo Zhang, Aohua Li, Bo Chen, Jingbo Sun, Xiaobing Zhao
ๆฅ็ๆ่ฆ
LLM-based approaches have recently achieved impressive results in zero-shot stance detection. However, they still struggle in complex real-world scenarios, where stance understanding requires dynamic background knowledge, target definitions involve compound entities or events that must be explicitly linked to stance labels, and rhetorical devices such as irony often obscure the author's actual intent. To address these challenges, we propose MSME, a Multi-Stage, Multi-Expert framework for zero-shot stance detection. MSME consists of three stages: (1) Knowledge Preparation, where relevant background knowledge is retrieved and stance labels are clarified; (2) Expert Reasoning, involving three specialized modules-Knowledge Expert distills salient facts and reasons from a knowledge perspective, Label Expert refines stance labels and reasons accordingly, and Pragmatic Expert detects rhetorical cues such as irony to infer intent from a pragmatic angle; (3) Decision Aggregation, where a Meta-Judge integrates all expert analyses to produce the final stance prediction. Experiments on three public datasets show that MSME achieves state-of-the-art performance across the board.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Guoshenghui Zhao, Huawei Lin, Weijie Zhao
Natural Language Processing
ๆฅ็ๆ่ฆ
Removing specific data influence from large language models (LLMs) remains challenging, as retraining is costly and existing approximate unlearning methods are often unstable. The challenge is exacerbated when the forget set is small or imbalanced. We introduce RapidUn, an influence-driven and parameter-efficient unlearning framework. It first estimates per-sample influence through a fast estimation module, then maps these scores into adaptive update weights that guide selective parameter updates -- forgetting harmful behavior while retaining general knowledge. On Mistral-7B and Llama-3-8B across Dolly-15k and Alpaca-57k, RapidUn achieves up to 100 times higher efficiency than full retraining and consistently outperforms Fisher, GA, and LoReUn on both in-distribution and out-of-distribution forgetting. These results establish influence-guided parameter reweighting as a scalable and interpretable paradigm for LLM unlearning.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Subrata Karmaker
Machine Learning
ๆฅ็ๆ่ฆ
Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and interpretable methods.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhou Yang, Shunyan Luo, Jiazhen Zhu, Fang Jin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Deep neural networks (DNNs) have made significant strides in Natural Language Processing (NLP), yet their interpretability remains elusive, particularly when evaluating their intricate decision-making processes. Traditional methods often rely on post-hoc interpretations, such as saliency maps or feature visualization, which might not be directly applicable to the discrete nature of word data in NLP. Addressing this, we introduce the Model-agnostic Saliency Estimation (MASE) framework. MASE offers local explanations for text-based predictive models without necessitating in-depth knowledge of a model's internal architecture. By leveraging Normalized Linear Gaussian Perturbations (NLGP) on the embedding layer instead of raw word inputs, MASE efficiently estimates input saliency. Our results indicate MASE's superiority over other model-agnostic interpretation methods, especially in terms of Delta Accuracy, positioning it as a promising tool for elucidating the operations of text-based models in NLP.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Muyu Pan, Matthew Walter, Dheeraj Kodakandla, Mahfuza Farooque
Machine Learning
ๆฅ็ๆ่ฆ
Our work presents a novel reinforcement learning (RL) based framework to optimize heuristic selection within the conflict-driven clause learning (CDCL) process, improving the efficiency of Boolean satisfiability (SAT) solving. The proposed system, LangSAT, bridges the gap between natural language inputs and propositional logic by converting English descriptions into Conjunctive Normal Form (CNF) expressions and solving them using an RL-enhanced CDCL SAT solver. Unlike existing SAT-solving platforms that require CNF as input, LangSAT enables users to input standard English descriptions, making SAT-solving more accessible. The framework comprises two key components: Lang2Logic, which translates English sentences into CNF expressions, and SmartSAT, an RL-based SAT solver. SmartSAT encodes clause-variable relationships as structured graph representations and extracts global features specific to the SAT problem. This implementation provides the RL agent with deeper contextual information, enabling SAT problems to be solved more efficiently. Lang2Logic was evaluated on diverse natural language inputs, processing descriptions up to 450 words. The generated CNFs were solved by SmartSAT, which demonstrated comparable performance to traditional CDCL heuristics with respect to solving time. The combined LangSAT framework offers a more accessible and scalable solution for SAT-solving tasks across reasoning, formal verification, and debugging.
๐
2025-12-04
๐ IEEE/CVF Winter Conference on Applications of Comp
๐ฅ Kai-Po Chang, Wei-Yuan Cheng, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang
Multimodal
ๆฅ็ๆ่ฆ
Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiming Xu, Yuan Yuan, Vijay Viswanathan, Graham Neubig
Natural Language Processing
ๆฅ็ๆ่ฆ
Text clustering is a fundamental task in natural language processing, yet traditional clustering algorithms with pre-trained embeddings often struggle in domain-specific contexts without costly fine-tuning. Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries. We propose ClusterFusion, a hybrid framework that instead treats the LLM as the clustering core, guided by lightweight embedding methods. The framework proceeds in three stages: embedding-guided subset partition, LLM-driven topic summarization, and LLM-based topic assignment. This design enables direct incorporation of domain knowledge and user preferences, fully leveraging the contextual adaptability of LLMs. Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion not only achieves state-of-the-art performance on standard tasks but also delivers substantial gains in specialized domains. To support future work, we release our newly constructed dataset and results on all benchmarks.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, Alan Yuille
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.
๐
2025-12-04
๐ NeurIPS 2025
๐ฅ Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich, Weiyang Liu, Dinghuai Zhang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
While methods exist for aligning flow matching models--a popular and effective class of generative models--with human preferences, existing approaches fail to achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The key idea behind this algorithm is that the optimal difference between the finetuned velocity field and the pretrained one should be matched with the gradient field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.
๐
2025-12-04
๐ 8 pages, 8 figures, 1 table, website at https://w3
๐ฅ Malte Hoffmann, Lilla Zรถllei, Adrian V. Dalca
Machine Learning
ๆฅ็ๆ่ฆ
Segmentation of magnetic resonance images (MRI) facilitates analysis of human brain development by delineating anatomical structures. However, in infants and young children, accurate segmentation is challenging due to development and imaging constraints. Pediatric brain MRI is notoriously difficult to acquire, with inconsistent availability of imaging modalities, substantial non-head anatomy in the field of view, and frequent motion artifacts. This has led to specialized segmentation models that are often limited to specific image types or narrow age groups, or that are fragile for more variable images such as those acquired clinically. We address this method fragmentation with BabySeg, a deep learning brain segmentation framework for infants and young children that supports diverse MRI protocols, including repeat scans and image types unavailable during training. Our approach builds on recent domain randomization techniques, which synthesize training images far beyond realistic bounds to promote dataset shift invariance. We also describe a mechanism that enables models to flexibly pool and interact features from any number of input scans. We demonstrate state-of-the-art performance that matches or exceeds the accuracy of several existing methods for various age cohorts and input configurations using a single model, in a fraction of the runtime required by many existing tools.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo et al.
MultimodalRoboticsComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Purbesh Mitra, Sennur Ulukus
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion ฯ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. ฯ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, ฯ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, ฯ-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe et al.
MultimodalMachine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Haiyue Song, Johannes Eschbach-Dymanus, Hour Kaing, Sumire Honda, Hideki Tanaka et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Vincent Pauline, Tobias Hรถppe, Kirill Neklyudov, Alexander Tong, Stefan Bauer et al.
Machine Learning
ๆฅ็ๆ่ฆ
Although diffusion models now occupy a central place in generative modeling, introductory treatments commonly assume Euclidean data and seldom clarify their connection to discrete-state analogues. This article is a self-contained primer on diffusion over general state spaces, unifying continuous domains and discrete/categorical structures under one lens. We develop the discrete-time view (forward noising via Markov kernels and learned reverse dynamics) alongside its continuous-time limits -- stochastic differential equations (SDEs) in $\mathbb{R}^d$ and continuous-time Markov chains (CTMCs) on finite alphabets -- and derive the associated Fokker--Planck and master equations. A common variational treatment yields the ELBO that underpins standard training losses. We make explicit how forward corruption choices -- Gaussian processes in continuous spaces and structured categorical transition kernels (uniform, masking/absorbing and more) in discrete spaces -- shape reverse dynamics and the ELBO. The presentation is layered for three audiences: newcomers seeking a self-contained intuitive introduction; diffusion practitioners wanting a global theoretical synthesis; and continuous-diffusion experts looking for an analogy-first path into discrete diffusion. The result is a unified roadmap to modern diffusion methodology across continuous domains and discrete sequences, highlighting a compact set of reusable proofs, identities, and core theoretical principles.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Eduardo Di Santi
ๆฅ็ๆ่ฆ
Real-world physical processes do not generate arbitrary variability: their signals concentrate on compact and low-variability subsets of functional space. This geometric structure enables rapid generalization from a few examples in both biological and artificial systems.
This work develops a deterministic functional-topological framework in which the set of valid realizations of a physical phenomenon forms a compact perceptual manifold with stable invariants and a finite Hausdorff radius. We show that the boundaries of this manifold can be discovered in a fully self-supervised manner through Monte Carlo sampling, even when the governing equations of the system are unknown.
We provide theoretical guarantees, practical estimators of knowledge boundaries, and empirical validations across three domains: electromechanical railway point machines, electrochemical battery discharge curves, and physiological ECG signals.
Our results demonstrate that deterministic functional topology offers a unified mathematical foundation for perception, representation, and world-model construction, explaining why biological learners and self-supervised AI models can generalize from limited observations.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dravyansh Sharma
Machine Learning
ๆฅ็ๆ่ฆ
Gradient-based iterative optimization methods are the workhorse of modern machine learning. They crucially rely on careful tuning of parameters like learning rate and momentum. However, one typically sets them using heuristic approaches without formal near-optimality guarantees. Recent work by Gupta and Roughgarden studies how to learn a good step-size in gradient descent. However, like most of the literature with theoretical guarantees for gradient-based optimization, their results rely on strong assumptions on the function class including convexity and smoothness which do not hold in typical applications. In this work, we develop novel analytical tools for provably tuning hyperparameters in gradient-based algorithms that apply to non-convex and non-smooth functions. We obtain matching sample complexity bounds for learning the step-size in gradient descent shown for smooth, convex functions in prior work (up to logarithmic factors) but for a much broader class of functions. Our analysis applies to gradient descent on neural networks with commonly used activation functions (including ReLU, sigmoid and tanh). We extend our framework to tuning multiple hyperparameters, including tuning the learning rate schedule, simultaneously tuning momentum and step-size, and pre-training the initialization vector. Our approach can be used to bound the sample complexity for minimizing both the validation loss as well as the number of gradient descent iterations.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ian Dunn, Liv Toft, Tyler Katz, Juhi Gupta, Riya Shah et al.
Machine Learning
ๆฅ็ๆ่ฆ
Structure-based drug design (SBDD) focuses on designing small-molecule ligands that bind to specific protein pockets. Computational methods are integral in modern SBDD workflows and often make use of virtual screening methods via docking or pharmacophore search. Modern generative modeling approaches have focused on improving novel ligand discovery by enabling de novo design. In this work, we recognize that these tasks share a common structure and can therefore be represented as different instantiations of a consistent generative modeling framework. We propose a unified approach in OMTRA, a multi-modal flow matching model that flexibly performs many tasks relevant to SBDD, including some with no analogue in conventional workflows. Additionally, we curate a dataset of 500M 3D molecular conformers, complementing protein-ligand data and expanding the chemical diversity available for training. OMTRA obtains state of the art performance on pocket-conditioned de novo design and docking; however, the effects of large-scale pretraining and multi-task training are modest. All code, trained models, and dataset for reproducing this work are available at https://github.com/gnina/OMTRA
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shashwat Shankar, Subhranshu Pandey, Innocent Dengkhw Mochahari, Bhabesh Mali, Animesh Basak Chowdhury et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model(LLM) inference demands massive compute and energy, making domain-specific tasks expensive and unsustainable. As foundation models keep scaling, we ask: Is bigger always better for hardware design? Our work tests this by evaluating Small Language Models coupled with a curated agentic AI framework on NVIDIA's Comprehensive Verilog Design Problems(CVDP) benchmark. Results show that agentic workflows: through task decomposition, iterative feedback, and correction - not only unlock near-LLM performance at a fraction of the cost but also create learning opportunities for agents, paving the way for efficient, adaptive solutions in complex design tasks.
๐
2025-12-04
๐ NeurIPS 2025
๐ฅ Samuel Howard, Nikolas Nรผsken, Jakiw Pidstrigach
Robotics
ๆฅ็ๆ่ฆ
Simulating the conditioned dynamics of diffusion processes, given their initial and terminal states, is an important but challenging problem in the sciences. The difficulty is particularly pronounced for rare events, for which the unconditioned dynamics rarely reach the terminal state. In this work, we leverage a self-consistency property of the conditioned dynamics to learn the diffusion bridge in an iterative online manner, and demonstrate promising empirical results in a range of settings.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammad Arif Rasyidi, Omar Alhussein, Sami Muhaidat, Ernesto Damiani
ๆฅ็ๆ่ฆ
Unsupervised anomaly-based intrusion detection requires models that can generalize to attack patterns not observed during training. This work presents the first large-scale evaluation of hybrid quantum-classical (HQC) autoencoders for this task. We construct a unified experimental framework that iterates over key quantum design choices, including quantum-layer placement, measurement approach, variational and non-variational formulations, and latent-space regularization. Experiments across three benchmark NIDS datasets show that HQC autoencoders can match or exceed classical performance in their best configurations, although they exhibit higher sensitivity to architectural decisions. Under zero-day evaluation, well-configured HQC models provide stronger and more stable generalization than classical and supervised baselines. Simulated gate-noise experiments reveal early performance degradation, indicating the need for noise-aware HQC designs. These results provide the first data-driven characterization of HQC autoencoder behavior for network intrusion detection and outline key factors that govern their practical viability. All experiment code and configurations are available at https://github.com/arasyi/hqcae-network-intrusion-detection.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huascar Sanchez, Briland Hitaj, Jules Bergmann, Linda Briesemeister
Natural Language Processing
ๆฅ็ๆ่ฆ
As healthcare increasingly turns to AI for scalable and trustworthy clinical decision support, ensuring reliability in model reasoning remains a critical challenge. Individual large language models (LLMs) are susceptible to hallucinations and inconsistency, whereas naive ensembles of models often fail to deliver stable and credible recommendations. Building on our previous work on LLM Chemistry, which quantifies the collaborative compatibility among LLMs, we apply this framework to improve the reliability in medication recommendation from brief clinical vignettes. Our approach leverages multi-LLM collaboration guided by Chemistry-inspired interaction modeling, enabling ensembles that are effective (exploiting complementary strengths), stable (producing consistent quality), and calibrated (minimizing interference and error amplification). We evaluate our Chemistry-based Multi-LLM collaboration strategy on real-world clinical scenarios to investigate whether such interaction-aware ensembles can generate credible, patient-specific medication recommendations. Preliminary results are encouraging, suggesting that LLM Chemistry-guided collaboration may offer a promising path toward reliable and trustworthy AI assistants in clinical practice.
๐
2025-12-04
๐ ArXiv Preprint (quant-ph)
๐ฅ Yu-Cheng Lin, Yu-Chao Hsu, Samuel Yen-Chi Chen
Machine Learning
ๆฅ็ๆ่ฆ
The Quantum Approximate Optimization Algorithm (QAOA) is a leading approach for solving combinatorial optimization problems on near-term quantum processors. However, finding good variational parameters remains a significant challenge due to the non-convex energy landscape, often resulting in slow convergence and poor solution quality. In this work, we propose a quantum meta-learning framework that trains advanced quantum sequence models to generate effective parameter initialization policies. We investigate four classical or quantum sequence models, including the Quantum Kernel-based Long Short-Term Memory (QK-LSTM), as learned optimizers in a "learning to learn" paradigm. Our numerical experiments on the Max-Cut problem demonstrate that the QK-LSTM optimizer achieves superior performance, obtaining the highest approximation ratios and exhibiting the fastest convergence rate across all tested problem sizes (n=10 to 13). Crucially, the QK-LSTM model achieves perfect parameter transferability by synthesizing a single, fixed set of near-optimal parameters, leading to a remarkable sustained acceleration of convergence even when generalizing to larger problems. This capability, enabled by the compact and expressive power of the quantum kernel architecture, underscores its effectiveness. The QK-LSTM, with only 43 trainable parameters, substantially outperforms the classical LSTM (56 parameters) and other quantum sequence models, establishing a robust pathway toward highly efficient parameter initialization for variational quantum algorithms in the NISQ era.
๐
2025-12-04
๐ ArXiv Preprint (quant-ph)
๐ฅ Yu-Chao Hsu, Jiun-Cheng Jiang, Chun-Hua Lin, Kuo-Chung Peng, Nan-Yow Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Long short-term memory (LSTM) models are a particular type of recurrent neural networks (RNNs) that are central to sequential modeling tasks in domains such as urban telecommunication forecasting, where temporal correlations and nonlinear dependencies dominate. However, conventional LSTMs suffer from high parameter redundancy and limited nonlinear expressivity. In this work, we propose the Quantum-inspired Kolmogorov-Arnold Long Short-Term Memory (QKAN-LSTM), which integrates Data Re-Uploading Activation (DARUAN) modules into the gating structure of LSTMs. Each DARUAN acts as a quantum variational activation function (QVAF), enhancing frequency adaptability and enabling an exponentially enriched spectral representation without multi-qubit entanglement. The resulting architecture preserves quantum-level expressivity while remaining fully executable on classical hardware. Empirical evaluations on three datasets, Damped Simple Harmonic Motion, Bessel Function, and Urban Telecommunication, demonstrate that QKAN-LSTM achieves superior predictive accuracy and generalization with a 79% reduction in trainable parameters compared to classical LSTMs. We extend the framework to the Jiang-Huang-Chen-Goan Network (JHCG Net), which generalizes KAN to encoder-decoder structures, and then further use QKAN to realize the latent KAN, thereby creating a Hybrid QKAN (HQKAN) for hierarchical representation learning. The proposed HQKAN-LSTM thus provides a scalable and interpretable pathway toward quantum-inspired sequential modeling in real-world data environments.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Cassandra Goldberg, Chaehyeon Kim, Adam Stein, Eric Wong
ๆฅ็ๆ่ฆ
Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuan Li, Samuel Bello
ๆฅ็ๆ่ฆ
Accurate estimation of three-dimensional ground reaction forces and moments (GRFs/GRMs) is crucial for both biomechanics research and clinical rehabilitation evaluation. In this study, we focus on insole-based GRF/GRM estimation and further validate our approach on a public walking dataset. We propose a Dual-Path Region-Guided Attention Network that integrates anatomy-inspired spatial priors and temporal priors into a region-level attention mechanism, while a complementary path captures context from the full sensor field. The two paths are trained jointly and their outputs are combined to produce the final GRF/GRM predictions. Conclusions: Our model outperforms strong baseline models, including CNN and CNN-LSTM architectures on two datasets, achieving the lowest six-component average NRMSE of 5.78% on the insole dataset and 1.42% for the vertical ground reaction force on the public dataset. This demonstrates robust performance for ground reaction force and moment estimation.
๐
2025-12-04
๐ ArXiv Preprint (stat.ME)
๐ฅ Garud Iyengar, Yu-Shiou Willy Lin, Kaizheng Wang
ๆฅ็ๆ่ฆ
Simulation of complex systems originated in manufacturing and queuing applications. It is now widely used for large-scale, ML-based systems in research, education, and consumer surveys. However, characterizing the discrepancy between simulators and ground truth remains challenging for increasingly complex, machine-learning-based systems. We propose a computationally tractable method to estimate the quantile function of the discrepancy between the simulated and ground-truth outcome distributions. Our approach focuses on output uncertainty and treats the simulator as a black box, imposing no modeling assumptions on its internals, and hence applies broadly across many parameter families, from Bernoulli and multinomial models to continuous, vector-valued settings. The resulting quantile curve supports confidence interval construction for unseen scenarios, risk-aware summaries of sim-to-real discrepancy (e.g., VaR/CVaR), and comparison of simulators' performance. We demonstrate our methodology in an application assessing LLM simulation fidelity on the WorldValueBench dataset spanning four LLMs.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pham Thach Thanh Truc, Dang Hoai Nam, Huynh Tong Dang Khoa, Vo Nguyen Le Duy
Machine Learning
ๆฅ็ๆ่ฆ
Handwritten Text Recognition remains challenging due to the limited data, high writing style variance, and scripts with complex diacritics. Existing approaches, though partially address these issues, often struggle to generalize without massive synthetic data. To address these challenges, we propose HTR-ConvText, a model designed to capture fine-grained, stroke-level local features while preserving global contextual dependencies. In the feature extraction stage, we integrate a residual Convolutional Neural Network backbone with a MobileViT with Positional Encoding block. This enables the model to both capture structural patterns and learn subtle writing details. We then introduce the ConvText encoder, a hybrid architecture combining global context and local features within a hierarchical structure that reduces sequence length for improved efficiency. Additionally, an auxiliary module injects textual context to mitigate the weakness of Connectionist Temporal Classification. Evaluations on IAM, READ2016, LAM and HANDS-VNOnDB demonstrate that our approach achieves improved performance and better generalization compared to existing methods, especially in scenarios with limited training samples and high handwriting diversity.
๐
2025-12-04
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Adri Gรณmez Martรญn, Felix Mรถller, Steven McDonagh, Monica Abella, Manuel Desco et al.
ๆฅ็ๆ่ฆ
Neural architecture search (NAS) in expressive search spaces is a computationally hard problem, but it also holds the potential to automatically discover completely novel and performant architectures. To achieve this we need effective search algorithms that can identify powerful components and reuse them in new candidate architectures. In this paper, we introduce two adapted variants of the Smith-Waterman algorithm for local sequence alignment and use them to compute the edit distance in a grammar-based evolutionary architecture search. These algorithms enable us to efficiently calculate a distance metric for neural architectures and to generate a set of hybrid offspring from two parent models. This facilitates the deployment of crossover-based search heuristics, allows us to perform a thorough analysis on the architectural loss landscape, and track population diversity during search. We highlight how our method vastly improves computational complexity over previous work and enables us to efficiently compute shortest paths between architectures. When instantiating the crossover in evolutionary searches, we achieve competitive results, outperforming competing methods. Future work can build upon this new tool, discovering novel components that can be used more broadly across neural architecture design, and broadening its applications beyond NAS.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Yuchen Jiao, Yuxin Chen, Gen Li
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Guided or controlled data generation with diffusion models\blfootnote{Partial preliminary results of this work appeared in International Conference on Machine Learning 2025 \citep{li2025provable}.} has become a cornerstone of modern generative modeling. Despite substantial advances in diffusion model theory, the theoretical understanding of guided diffusion samplers remains severely limited. We make progress by developing a unified algorithmic and theoretical framework that accommodates both diffusion guidance and reward-guided diffusion. Aimed at fine-tuning diffusion models to improve certain rewards, we propose injecting a reward guidance term -- constructed from the difference between the original and reward-reweighted scores -- into the backward diffusion process, and rigorously quantify the resulting reward improvement over the unguided counterpart. As a key application, our framework shows that classifier-free guidance (CFG) decreases the expected reciprocal of the classifier probability, providing the first theoretical characterization of the specific performance metric that CFG improves for general target distributions. When applied to reward-guided diffusion, our framework yields a new sampler that is easy-to-train and requires no full diffusion trajectories during training. Numerical experiments further corroborate our theoretical findings.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ NaHyeon Park, Namin An, Kunhee Kim, Soyeon Yoon, Jiahao Huo et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Mouad EL Bouchattaoui
Natural Language Processing
ๆฅ็ๆ่ฆ
This thesis develops methods for causal inference and causal representation learning (CRL) in high-dimensional, time-varying data.
The first contribution introduces the Causal Dynamic Variational Autoencoder (CDVAE), a model for estimating Individual Treatment Effects (ITEs) by capturing unobserved heterogeneity in treatment response driven by latent risk factors that affect only outcomes. CDVAE comes with theoretical guarantees on valid latent adjustment and generalization bounds for ITE error. Experiments on synthetic and real datasets show that CDVAE outperforms baselines, and that state-of-the-art models greatly improve when augmented with its latent substitutes, approaching oracle performance without access to true adjustment variables.
The second contribution proposes an efficient framework for long-term counterfactual regression based on RNNs enhanced with Contrastive Predictive Coding (CPC) and InfoMax. It captures long-range dependencies under time-varying confounding while avoiding the computational cost of transformers, achieving state-of-the-art results and introducing CPC into causal inference.
The third contribution advances CRL by addressing how latent causes manifest in observed variables. We introduce a model-agnostic interpretability layer based on the geometry of the decoder Jacobian. A sparse self-expression prior induces modular, possibly overlapping groups of observed features aligned with shared latent influences. We provide recovery guarantees in both disjoint and overlapping settings and show that meaningful latent-to-observed structure can be recovered without anchor features or single-parent assumptions. Scalable Jacobian-based regularization techniques are also developed.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Armand Kassaรฏ Koupaรฏ, Lise Le Boudec, Patrick Gallinari
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce ECHO, a transformer-operator framework for generating million-point PDE trajectories. While existing neural operators (NOs) have shown promise for solving partial differential equations, they remain limited in practice due to poor scalability on dense grids, error accumulation during dynamic unrolling, and task-specific design. ECHO addresses these challenges through three key innovations. (i) It employs a hierarchical convolutional encode-decode architecture that achieves a 100 $\times$ spatio-temporal compression while preserving fidelity on mesh points. (ii) It incorporates a training and adaptation strategy that enables high-resolution PDE solution generation from sparse input grids. (iii) It adopts a generative modeling paradigm that learns complete trajectory segments, mitigating long-horizon error drift. The training strategy decouples representation learning from downstream task supervision, allowing the model to tackle multiple tasks such as trajectory generation, forward and inverse problems, and interpolation. The generative model further supports both conditional and unconditional generation. We demonstrate state-of-the-art performance on million-point simulations across diverse PDE systems featuring complex geometries, high-frequency dynamics, and long-term horizons.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ NaHyeon Park, Kunhee Kim, Junsuk Choe, Hyunjung Shim
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.
๐
2025-12-04
๐ 13 pages, 13 figures, 40 references, submitted to
๐ฅ Guangming Liang, Mingjie Yang, Dongzhu Liu, Paul Henderson, Lajos Hanzo
Multimodal
ๆฅ็ๆ่ฆ
Accurate channel state information (CSI) underpins reliable and efficient wireless communication. However, acquiring CSI via pilot estimation incurs substantial overhead, especially in massive multiple-input multiple-output (MIMO) systems operating in high-Doppler environments. By leveraging the growing availability of environmental sensing data, this treatise investigates pilot-free channel inference that estimates complete CSI directly from multimodal observations, including camera images, LiDAR point clouds, and GPS coordinates. In contrast to prior studies that rely on predefined channel models, we develop a data-driven framework that formulates the sensing-to-channel mapping as a cross-modal flow matching problem. The framework fuses multimodal features into a latent distribution within the channel domain, and learns a velocity field that continuously transforms the latent distribution toward the channel distribution. To make this formulation tractable and efficient, we reformulate the problem as an equivalent conditional flow matching objective and incorporate a modality alignment loss, while adopting low-latency inference mechanisms to enable real-time CSI estimation. In experiments, we build a procedural data generator based on Sionna and Blender to support realistic modeling of sensing scenes and wireless propagation. System-level evaluations demonstrate significant improvements over pilot- and sensing-based benchmarks in both channel estimation accuracy and spectral efficiency for the downstream beamforming task.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Roberto Cipollone, Luca Iocchi, Matteo Leonetti
Machine Learning
ๆฅ็ๆ่ฆ
The main focus of Hierarchical Reinforcement Learning (HRL) is studying how large Markov Decision Processes (MDPs) can be more efficiently solved when addressed in a modular way, by combining partial solutions computed for smaller subtasks. Despite their very intuitive role for learning, most notions of MDP abstractions proposed in the HRL literature have limited expressive power or do not possess formal efficiency guarantees. This work addresses these fundamental issues by defining Realizable Abstractions, a new relation between generic low-level MDPs and their associated high-level decision processes. The notion we propose avoids non-Markovianity issues and has desirable near-optimality guarantees. Indeed, we show that any abstract policy for Realizable Abstractions can be translated into near-optimal policies for the low-level MDP, through a suitable composition of options. As demonstrated in the paper, these options can be expressed as solutions of specific constrained MDPs. Based on these findings, we propose RARL, a new HRL algorithm that returns compositional and near-optimal low-level policies, taking advantage of the Realizable Abstraction given in the input. We show that RARL is Probably Approximately Correct, it converges in a polynomial number of samples, and it is robust to inaccuracies in the abstraction.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rajneil Baruah
ๆฅ็ๆ่ฆ
We present a novel technique for amortized posterior estimation using Normalizing Flows trained with likelihood-weighted importance sampling. This approach allows for the efficient inference of theoretical parameters in high-dimensional inverse problems without the need for posterior training samples. We implement the method on multi-modal benchmark tasks in 2D and 3D to check for the efficacy. A critical observation of our study is the impact of the topology of the base distributions on the modelled posteriors. We find that standard unimodal base distributions fail to capture disconnected support, resulting in spurious probability bridges between modes. We demonstrate that initializing the flow with a Gaussian Mixture Model that matches the cardinality of the target modes significantly improves reconstruction fidelity, as measured by some distance and divergence metrics.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leyang Shen, Yang Zhang, Chun Kai Ling, Xiaoyan Zhao, Tat-Seng Chua
Machine Learning
ๆฅ็ๆ่ฆ
Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents. CARL achieves focused training through providing action-level optimization signals for high-criticality actions while excluding low-criticality actions from model update. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kailiang Liu, Ying Chen, Ralf Borndรถrfer, Thorsten Koch
Machine Learning
ๆฅ็ๆ่ฆ
Intraday surgical scheduling is a multi-objective decision problem under uncertainty-balancing elective throughput, urgent and emergency demand, delays, sequence-dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi-agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within-epoch sequential assignment protocol constructs conflict-free joint schedules across ORs. A mixed-integer pre-schedule provides reference starting times for electives; we impose type-specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior-prioritizing emergencies, batching similar cases to reduce setups, and deferring lower-value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations-including OR homogeneity and the omission of explicit staffing constraints-and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data-driven complement to optimization for real-time OR scheduling.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jonathan Baxter, Peter Bartlett
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
In general, approximating classes of functions defined over high-dimensional input spaces by linear combinations of a fixed set of basis functions or ``features'' is known to be hard. Typically, the worst-case error of the best basis set decays only as fast as $ฮ\(n^{-1/d}\)$, where $n$ is the number of basis functions and $d$ is the input dimension. However, there are many examples of high-dimensional pattern recognition problems (such as face recognition) where linear combinations of small sets of features do solve the problem well. Hence these function classes do not suffer from the ``curse of dimensionality'' associated with more general classes. It is natural then, to look for characterizations of high-dimensional function classes that nevertheless are approximated well by linear combinations of small sets of features. In this paper we give a general result relating the error of approximation of a function class to the covering number of its ``convex core''. For one-hidden-layer neural networks, covering numbers of the class of functions computed by a single hidden node upper bound the covering numbers of the convex core. Hence, using standard results we obtain upper bounds on the approximation rate of neural network classes.
๐
2025-12-04
๐ ArXiv Preprint (math.FA)
๐ฅ James Tian
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper develops a nonlinear operator dynamic that progressively removes the influence of a prescribed feature subspace while retaining maximal structure elsewhere. The induced sequence of positive operators is monotone, admits an exact residual decomposition, and converges to the classical shorted operator. Transporting this dynamic to reproducing kernel Hilbert spaces yields a corresponding family of kernels that converges to the largest kernel dominated by the original one and annihilating the given subspace. In the finite-sample setting, the associated Gram operators inherit a structured residual decomposition that leads to a canonical form of kernel ridge regression and a principled way to enforce nuisance invariance. This gives a unified operator-analytic approach to invariant kernel construction and structured regularization in data analysis.
๐
2025-12-04
๐ This work has been submitted to the IEEE for possi
๐ฅ Junjie Fan, Hongye Zhao, Linduo Wei, Jiayu Rao, Guijia Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent adaptations of Large Language Models (LLMs) for time series forecasting often fail to effectively enhance information for raw series, leaving LLM reasoning capabilities underutilized. Existing prompting strategies rely on static correlations rather than generative interpretations of dynamic behavior, lacking critical global and instance-specific context. To address this, we propose STELLA (Semantic-Temporal Alignment with Language Abstractions), a framework that systematically mines and injects structured supplementary and complementary information. STELLA employs a dynamic semantic abstraction mechanism that decouples input series into trend, seasonality, and residual components. It then translates intrinsic behavioral features of these components into Hierarchical Semantic Anchors: a Corpus-level Semantic Prior (CSP) for global context and a Fine-grained Behavioral Prompt (FBP) for instance-level patterns. Using these anchors as prefix-prompts, STELLA guides the LLM to model intrinsic dynamics. Experiments on eight benchmark datasets demonstrate that STELLA outperforms state-of-the-art methods in long- and short-term forecasting, showing superior generalization in zero-shot and few-shot settings. Ablation studies further validate the effectiveness of our dynamically generated semantic anchors.
๐
2025-12-04
๐ ArXiv Preprint (math.AG)
๐ฅ Igor V. Netay
Machine Learning
ๆฅ็ๆ่ฆ
In this paper we describe an approach to construct large extendable collections of vectors in predefined spaces of given dimensions. These collections are useful for neural network latent space configuration and training. For classification problem with large or unknown number of classes this allows to construct classifiers without classification layer and extend the number of classes without retraining of network from the very beginning. The construction allows to create large well-spaced vector collections in spaces of minimal possible dimension. If the number of classes is known or approximately predictable, one can choose sufficient enough vector collection size. If one needs to significantly extend the number of classes, one can extend the collection in the same latent space, or to incorporate the collection into collection of higher dimensions with same spacing between vectors. Also, regular symmetric structure of constructed vector collections can significantly simplify problems of search for nearest cluster centers or embeddings in the latent space. Construction of vector collections is based on combinatorics and geometry of semi-simple Lie groups irreducible representations with highest weight.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Manuel Ladron de Guevara, Jinmo Rhee, Ardavan Bidgoli, Vaidas Razgaitis, Michael Bergin
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.
๐
2025-12-04
๐ ArXiv Preprint (cs.SD)
๐ฅ Wenzhang Du
Multimodal
ๆฅ็ๆ่ฆ
Subjective mean opinion scores (MOS) remain the de-facto target for non-intrusive speech and singing quality assessment. However, MOS is a scalar that collapses heterogeneous user expectations, ignores service-level objectives, and is difficult to compare across deployment graphs. We propose a contract-driven QoE auditing framework: each service graph G is evaluated under a set of human-interpretable experience contracts C, yielding a contract-level satisfaction vector Q(G, C). We show that (i) classical MOS regression is a special case with a degenerate contract set, (ii) contract-driven quality is more stable than MOS under graph view transformations (e.g., pooling by system vs. by system type), and (iii) the effective sample complexity of learning contracts is governed by contract semantics rather than merely the dimensionality of C. We instantiate the framework on URGENT2024 MOS (6.9k speech utterances with raw rating vectors) and SingMOS v1 (7,981 singing clips; 80 systems). On URGENT, we train a contract-aware neural auditor on self-supervised WavLM embeddings; on SingMOS, we perform contract-driven graph auditing using released rating vectors and metadata without decoding audio. Empirically, our auditor matches strong MOS predictors in MOS accuracy while providing calibrated contract probabilities; on SingMOS, Q(G, C) exhibits substantially smaller cross-view drift than raw MOS and graph-only baselines; on URGENT, difficulty curves reveal that mis-specified "simple" contracts can be harder to learn than richer but better aligned contract sets.
๐
2025-12-04
๐ ArXiv Preprint (eess.SY)
๐ฅ Dario Paccagnan, Daniel Marks, Marco C. Campi, Simone Garatti
Robotics
ๆฅ็ๆ่ฆ
Data-driven methods have become paramount in modern systems and control problems characterized by growing levels of complexity. In safety-critical environments, deploying these methods requires rigorous guarantees, a need that has motivated much recent work at the interface of statistical learning and control. However, many existing approaches achieve this goal at the cost of sacrificing valuable data for testing and calibration, or by constraining the choice of learning algorithm, thus leading to suboptimal performances. In this paper, we describe Pick-to-Learn (P2L) for Systems and Control, a framework that allows any data-driven control method to be equipped with state-of-the-art safety and performance guarantees. P2L enables the use of all available data to jointly synthesize and certify the design, eliminating the need to set aside data for calibration or validation purposes. In presenting a comprehensive version of P2L for systems and control, this paper demonstrates its effectiveness across a range of core problems, including optimal control, reachability analysis, safe synthesis, and robust control. In many of these applications, P2L delivers designs and certificates that outperform commonly employed methods, and shows strong potential for broad applicability in diverse practical settings.
๐
2025-12-04
๐ ArXiv Preprint (cs.MA)
๐ฅ Roberto Garrone
Machine Learning
ๆฅ็ๆ่ฆ
This article extends the preprint "Characterizing Agent-Based Model Dynamics via $ฮต$-Machines and Kolmogorov-Style Complexity" by introducing diffusion models as orthogonal and complementary tools for characterizing the output of agent-based models (ABMs). Where $ฮต$-machines capture the predictive temporal structure and intrinsic computation of ABM-generated time series, diffusion models characterize high-dimensional cross-sectional distributions, learn underlying data manifolds, and enable synthetic generation of plausible population-level outcomes. We provide a formal analysis demonstrating that the two approaches operate on distinct mathematical domains -- processes vs. distributions -- and show that their combination yields a two-axis representation of ABM behavior based on temporal organization and distributional geometry. To our knowledge, this is the first framework to integrate computational mechanics with score-based generative modeling for the structural analysis of ABM outputs, thereby situating ABM characterization within the broader landscape of modern machine-learning methods for density estimation and intrinsic computation. The framework is validated using the same elder-caregiver ABM dataset introduced in the companion paper, and we provide precise definitions and propositions formalizing the mathematical complementarity between $ฮต$-machines and diffusion models. This establishes a principled methodology for jointly analyzing temporal predictability and high-dimensional distributional structure in complex simulation models.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Massimo Bini, Ondrej Bohdal, Umberto Michieli, Zeynep Akata, Mete Ozay et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Siqi Wang, Hailong Yang, Junjie Zhu, Xuezhu Wang, Yufan Xu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning from Human Feedback (RLHF) is an important fine-tuning technique for large language models (LLMs) and comprises three stages: generation, inference, and training. The generation stage generates samples that are then used to infer learnable experiences for training. We observe that the generation stage is the bottleneck of the entire execution process and consider it a key point for optimization. Specifically, we realize the first attempt to integrate speculative decoding into the RLHF generation stage and propose RLHFSpec, an RLHF system that accelerates generation execution with adaptive speculative decoding and sample reallocation. To fully exploit the performance potential provided by speculative decoding, especially dealing with the dynamic workload of the generation stage, RLHFSpec proposes a workload-aware drafting strategy selection mechanism, which selects the near-optimal strategy by jointly considering the verification cost and the number of accepted tokens. Moreover, RLHFSpec also proposes sample reallocation to fully utilize the GPU resources, and optimizes it with an efficient sample migration mechanism. The experimental results show that the RLHFSpec can achieve higher throughput in the generation stage compared to state-of-the-art works. Moreover, due to the effective alleviation of the generation bottleneck, RLHFSpec also shows significant performance speedup in the entire RLHF execution.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jingyuan Wang, Jiahao Ji
Machine Learning
ๆฅ็ๆ่ฆ
This article serves as the regression analysis lecture notes in the Intelligent Computing course cluster (including the courses of Artificial Intelligence, Data Mining, Machine Learning, and Pattern Recognition). It aims to provide students -- who are assumed to possess only basic university-level mathematics (i.e., with prerequisite courses in calculus, linear algebra, and probability theory) -- with a comprehensive and self-contained understanding of regression analysis without requiring any additional references. The lecture notes systematically introduce the fundamental concepts, modeling components, and theoretical foundations of regression analysis, covering linear regression, logistic regression, multinomial logistic regression, polynomial regression, basis-function models, kernel-based methods, and neural-network-based nonlinear regression. Core methodological topics include loss-function design, parameter-estimation principles, ordinary least squares, gradient-based optimization algorithms and their variants, as well as regularization techniques such as Ridge and LASSO regression. Through detailed mathematical derivations, illustrative examples, and intuitive visual explanations, the materials help students understand not only how regression models are constructed and optimized, but also how they reveal the underlying relationships between features and response variables. By bridging classical statistical modeling and modern machine-learning practice, these lecture notes aim to equip students with a solid conceptual and technical foundation for further study in advanced artificial intelligence models.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kuinan Hou, Marco Zorzi, Alberto Testolin
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reliably counting and generating sequences of items remain a significant challenge for neural networks, including Large Language Models (LLMs). Indeed, although this capability is readily handled by rule-based symbolic systems based on serial computation, learning to systematically deploy counting procedures is difficult for neural models, which should acquire these skills through learning. Previous research has demonstrated that recurrent architectures can only approximately track and enumerate sequences of events, and it remains unclear whether modern deep learning systems, including LLMs, can deploy systematic counting procedures over sequences of discrete symbols. This paper aims to fill this gap by investigating the sequential enumeration abilities of five state-of-the-art LLMs, including proprietary, open-source, and reasoning models. We probe LLMs in sequential naming and production tasks involving lists of letters and words, adopting a variety of prompting instructions to explore the role of chain-of-thought in the spontaneous emerging of counting strategies. We also evaluate open-source models with the same architecture but increasing size to see whether the mastering of counting principles follows scaling laws, and we analyze the embedding dynamics during sequential enumeration to investigate the emergent encoding of numerosity. We find that some LLMs are indeed capable of deploying counting procedures when explicitly prompted to do so, but none of them spontaneously engage in counting when simply asked to enumerate the number of items in a sequence. Our results suggest that, despite their impressive emergent abilities, LLMs cannot yet robustly and systematically deploy counting procedures, highlighting a persistent gap between neural and symbolic approaches to compositional generalization.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stefan Perko
Machine Learning
ๆฅ็ๆ่ฆ
Gradient optimization algorithms using epochs, that is those based on stochastic gradient descent without replacement (SGDo), are predominantly used to train machine learning models in practice. However, the mathematical theory of SGDo and related algorithms remain underexplored compared to their "with replacement" and "one-pass" counterparts. In this article, we propose a stochastic, continuous-time approximation to SGDo with additive noise based on a Young differential equation driven by a stochastic process we call an "epoched Brownian motion". We show its usefulness by proving the almost sure convergence of the continuous-time approximation for strongly convex objectives and learning rate schedules of the form $u_t = \frac{1}{(1+t)^ฮฒ}, ฮฒ\in (0,1)$. Moreover, we compute an upper bound on the asymptotic rate of almost sure convergence, which is as good or better than previous results for SGDo.
๐
2025-12-04
๐ ArXiv Preprint (math.OC)
๐ฅ Yijie Huang, Mengge Li, Xiang Yu, Zhou Zhou
Machine Learning
ๆฅ็ๆ่ฆ
This paper studies the continuous-time reinforcement learning (RL) for optimal switching problems across multiple regimes. We consider a type of exploratory formulation under entropy regularization where the agent randomizes both the timing of switches and the selection of regimes through the generator matrix of an associated continuous-time finite-state Markov chain. We establish the well-posedness of the associated system of Hamilton-Jacobi-Bellman (HJB) equations and provide a characterization of the optimal policy. The policy improvement and the convergence of the policy iterations are rigorously established by analyzing the system of equations. We also show the convergence of the value function in the exploratory formulation towards the value function in the classical formulation as the temperature parameter vanishes. Finally, a reinforcement learning algorithm is devised and implemented by invoking the policy evaluation based on the martingale characterization. Our numerical examples with the aid of neural networks illustrate the effectiveness of the proposed RL algorithm.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Kazuma Sawaya
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We develop a flexible feature selection framework based on deep neural networks that approximately controls the false discovery rate (FDR), a measure of Type-I error. The method applies to architectures whose first layer is fully connected. From the second layer onward, it accommodates multilayer perceptrons (MLPs) of arbitrary width and depth, convolutional and recurrent networks, attention mechanisms, residual connections, and dropout. The procedure also accommodates stochastic gradient descent with data-independent initializations and learning rates. To the best of our knowledge, this is the first work to provide a theoretical guarantee of FDR control for feature selection within such a general deep learning setting.
Our analysis is built upon a multi-index data-generating model and an asymptotic regime in which the feature dimension $n$ diverges faster than the latent dimension $q^{*}$, while the sample size, the number of training iterations, the network depth, and hidden layer widths are left unrestricted. Under this setting, we show that each coordinate of the gradient-based feature-importance vector admits a marginal normal approximation, thereby supporting the validity of asymptotic FDR control. As a theoretical limitation, we assume $\mathbf{B}$-right orthogonal invariance of the design matrix, and we discuss broader generalizations. We also present numerical experiments that underscore the theoretical findings.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large language models (LLMs). The coordinator, comprising a compact language model (approximately $0.6$B parameters) and a lightweight head (approximately $10$K parameters), is optimized with an evolutionary strategy for efficient and adaptive delegation. Trinity processes queries over multiple turns, where at each turn the coordinator assigns one of three roles (Thinker, Worker, or Verifier) to a selected LLM, effectively offloading complex skill acquisition from the coordinator itself. Experiments show that Trinity consistently outperforms individual models and existing methods across coding, math, reasoning, and domain knowledge tasks, and generalizes robustly to out-of-distribution tasks. On standard benchmarks, Trinity achieves state-of-the-art results, including a score of 86.2% on LiveCodeBench. Theoretical and empirical analyses identify two main factors behind this performance: (1) the coordinator's hidden-state representations provide rich contextualization of inputs, and (2) under high dimensionality and strict budget constraints, the separable Covariance Matrix Adaptation Evolution Strategy offers advantages over reinforcement learning, imitation learning, and random search by exploiting potential block-epsilon-separability.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Baris Yilmaz, Bevan Deniz Cilgin, Erdem Akagรผndรผz, Salih Tileylioglu
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Effective earthquake risk reduction relies on accurate site-specific evaluations. This requires models that can represent the influence of local site conditions on ground motion characteristics. In this context, data driven approaches that learn site controlled signatures from recorded ground motions offer a promising direction. We address strong ground motion generation from time-domain accelerometer records and introduce the TimesNet-Gen, a time-domain conditional generator. The approach uses a station specific latent bottleneck. We evaluate generation by comparing HVSR curves and fundamental site-frequency $f_0$ distributions between real and generated records per station, and summarize station specificity with a score based on the $f_0$ distribution confusion matrices. TimesNet-Gen achieves strong station-wise alignment and compares favorably with a spectrogram-based conditional VAE baseline for site-specific strong motion synthesis. Our codes are available via https://github.com/brsylmz23/TimesNet-Gen.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Souhir Ben Amor, Florian Ziel
Machine Learning
ๆฅ็ๆ่ฆ
We present a novel recurrent neural network architecture designed explicitly for day-ahead electricity price forecasting, aimed at improving short-term decision-making and operational management in energy systems. Our combined forecasting model embeds linear structures, such as expert models and Kalman filters, into recurrent networks, enabling efficient computation and enhanced interpretability. The design leverages the strengths of both linear and non-linear model structures, allowing it to capture all relevant stylised price characteristics in power markets, including calendar and autoregressive effects, as well as influences from load, renewable energy, and related fuel and carbon markets. For empirical testing, we use hourly data from the largest European electricity market spanning 2018 to 2025 in a comprehensive forecasting study, comparing our model against state-of-the-art approaches, particularly high-dimensional linear and neural network models. The proposed model achieves approximately 12% higher accuracy than leading benchmarks. We evaluate the contributions of the interpretable model components and conclude on the impact of combining linear and non-linear structures.
๐
2025-12-04
๐ ArXiv Preprint (quant-ph)
๐ฅ Jannes Nys, Juan Carrasquilla
ๆฅ็ๆ่ฆ
We introduce fermionic neural Gibbs states (fNGS), a variational framework for modeling finite-temperature properties of strongly interacting fermions. fNGS starts from a reference mean-field thermofield-double state and uses neural-network transformations together with imaginary-time evolution to systematically build strong correlations. Applied to the doped Fermi-Hubbard model, a minimal lattice model capturing essential features of strong electronic correlations, fNGS accurately reproduces thermal energies over a broad range of temperatures, interaction strengths, even at large dopings, for system sizes beyond the reach of exact methods. These results demonstrate a scalable route to studying finite-temperature properties of strongly correlated fermionic systems beyond one dimension with neural-network representations of quantum states.
๐
2025-12-04
๐ ArXiv Preprint (cs.MA)
๐ฅ Pouria Yazdani, Arash Rezaali, Monireh Abdoos
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Multi-agent reinforcement learning (MARL) has emerged as a promising paradigm for adaptive traffic signal control (ATSC) of multiple intersections. Existing approaches typically follow either a fully centralized or a fully decentralized design. Fully centralized approaches suffer from the curse of dimensionality, and reliance on a single learning server, whereas purely decentralized approaches operate under severe partial observability and lack explicit coordination resulting in suboptimal performance. These limitations motivate region-based MARL, where the network is partitioned into smaller, tightly coupled intersections that form regions, and training is organized around these regions. This paper introduces a Semi-Centralized Training, Decentralized Execution (SEMI-CTDE) architecture for multi intersection ATSC. Within each region, SEMI-CTDE performs centralized training with regional parameter sharing and employs composite state and reward formulations that jointly encode local and regional information. The architecture is highly transferable across different policy backbones and state-reward instantiations. Building on this architecture, we implement two models with distinct design objectives. A multi-perspective experimental analysis of the two implemented SEMI-CTDE-based models covering ablations of the architecture's core elements including rule based and fully decentralized baselines shows that they achieve consistently superior performance and remain effective across a wide range of traffic densities and distributions.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wenzhang Du
Machine Learning
ๆฅ็ๆ่ฆ
Earth observation (EO) models are frequently trained under implicit sampling policies that optimize global accuracy but provide no explicit guarantees on who (which regions, classes, or mission-critical strata) is being served throughout training. This paper introduces a contract-governed training paradigm for EO in which training samples are grouped into service contracts -- semantically meaningful units such as (dataset, region, rare-crop indicator) -- and each contract is assigned a target service share. We instantiate this paradigm as an Observed Service Agreement Graph (OSAG), a lightweight governance layer that (i) monitors contract-level exposure (coverage) during optimization, (ii) drives empirical coverage toward target shares via contract-normalized sampling weights, and (iii) exposes explicit accuracy-governance trade-offs through two knobs: a sampling mixture coefficient alpha and a contract-regularization weight lambda_C. We provide a compact theory in a toy setting: OSAG sampling concentrates empirical coverage to targets; coverage deviations upper-bound service-risk deviations; and contract design (coarse vs. fine) modulates governance cost. Experiments on AVIRIS hyperspectral scenes (Indian Pines plus Salinas) and multispectral Sentinel-2 EuroSAT demonstrate that OSAG can substantially reduce priority coverage error while maintaining global accuracy and improving high-priority accuracy. A EuroSAT coarse-vs-fine contract ablation further evidences how semantically refined contracts can reduce the accuracy cost per unit of governance improvement.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anita Graser, Axel Weiรenfeld, Clemens Heistracher, Melitta Dragaschnig, Peter Widhalm
Machine Learning
ๆฅ็ๆ่ฆ
This paper introduces M3fed, a novel solution for federated learning of movement anomaly detection models. This innovation has the potential to improve data privacy and reduce communication costs in machine learning for movement anomaly detection. We present the novel federated learning (FL) strategies employed to train M3fed, perform an example experiment with maritime AIS data, and evaluate the results with respect to communication costs and FL model quality by comparing classic centralized M3 and the new federated M3fed.
๐
2025-12-04
๐ Accepted to IEEE TNNLS
๐ฅ Bowen Zheng, Ran Cheng
Multimodal
ๆฅ็ๆ่ฆ
In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model's predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models' predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at https://github.com/ZaberKo/GDKD.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoqun Cao, Yixuan Zhang, Feng Zhou
ๆฅ็ๆ่ฆ
Score matching estimators have garnered significant attention in recent years because they eliminate the need to compute normalizing constants, thereby mitigating the computational challenges associated with maximum likelihood estimation (MLE).While several studies have proposed score matching estimators for point processes, this work highlights the limitations of these existing methods, which stem primarily from the lack of a mathematically rigorous analysis of how score matching behaves on finite point processes -- special random configurations on bounded spaces where many of the usual assumptions and properties of score matching no longer hold. To this end, we develop a formal framework for score matching on finite point processes via Janossy measures and, within this framework, introduce an (autoregressive) weighted score-matching estimator, whose statistical properties we analyze in classical parametric settings. For general nonparametric (e.g., deep) point process models, we show that score matching alone does not uniquely identify the ground-truth distribution due to subtle normalization issues, and we propose a simple survival-classification augmentation that yields a complete, integration-free training objective for any intensity-based point process model for spatio-temporal case. Experiments on synthetic and real-world temporal and spatio-temporal datasets, demonstrate that our method accurately recovers intensities and achieves performance comparable to MLE with better efficiency.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Wu, Chuhao Zhou, Guangyu Zhao, Haozhi Cao, Yewen Pu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Guanchen Du, Jianlong Xu, Wei Wei
Machine Learning
ๆฅ็ๆ่ฆ
Accurate Quality of Service (QoS) prediction is fundamental to service computing, providing essential data-driven guidance for service selection and ensuring superior user experiences. However, prevalent approaches, particularly Graph Neural Networks (GNNs), heavily rely on constructing explicit user--service interaction graphs. This dependency introduces severe scalability bottlenecks and limits performance when explicit connections are sparse or corrupted by noise. To address these challenges, this paper introduces \emph{QoSDiff}, a novel embedding learning framework that bypasses the prerequisite of explicit graph construction. Specifically, it leverages a denoising diffusion probabilistic model to recover intrinsic latent structures from noisy initializations. To further capture high-order interactions, we propose an adversarial interaction module that integrates a bidirectional hybrid attention mechanism. This adversarial paradigm dynamically distinguishes informative patterns from noise, enabling a dual-perspective modeling of intricate user--service associations. Extensive experiments on two large-scale real-world datasets demonstrate that QoSDiff significantly outperforms state-of-the-art baselines. Notably, the results highlight the framework's superior cross-dataset generalization capability and exceptional robustness against data sparsity and observational noise.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kenan Begovic, Abdulaziz Al-Ali, Qutaibah Malluhi
Machine Learning
ๆฅ็ๆ่ฆ
This paper proposes using the Linux kernel ftrace framework, particularly the function graph tracer, to generate informative system level data for machine learning (ML) applications. Experiments on a real world encryption detection task demonstrate the efficacy of the proposed features across several learning algorithms. The learner faces the problem of detecting encryption activities across a large dataset of files, using function call traces and graph based features. Empirical results highlight an outstanding accuracy of 99.28 on the task at hand, underscoring the efficacy of features derived from the function graph tracer. The results were further validated in an additional experiment targeting a multilabel classification problem, in which running programs were identified from trace data. This work provides comprehensive methodologies for preprocessing raw trace data and extracting graph based features, offering significant advancements in applying ML to system behavior analysis, program identification, and anomaly detection. By bridging the gap between system tracing and ML, this paper paves the way for innovative solutions in performance monitoring and security analytics.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aditi Naiknaware, Sanchit Singh, Hajar Homayouni, Salimeh Sekeh
ๆฅ็ๆ่ฆ
Open-world learning (OWL) requires models that can adapt to evolving environments while reliably detecting out-of-distribution (OOD) inputs. Existing approaches, such as SCONE, achieve robustness to covariate and semantic shifts but assume static environments, leading to degraded performance in dynamic domains. In this paper, we propose Temp-SCONE, a temporally consistent extension of SCONE designed to handle temporal shifts in dynamic environments. Temp-SCONE introduces a confidence-driven regularization loss based on Average Thresholded Confidence (ATC), penalizing instability in predictions across time steps while preserving SCONE's energy-margin separation. Experiments on dynamic datasets demonstrate that Temp-SCONE significantly improves robustness under temporal drift, yielding higher corrupted-data accuracy and more reliable OOD detection compared to SCONE. On distinct datasets without temporal continuity, Temp-SCONE maintains comparable performance, highlighting the importance and limitations of temporal regularization. Our theoretical insights on temporal stability and generalization error further establish Temp-SCONE as a step toward reliable OWL in evolving dynamic environments.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Miguel Sรกnchez-Domรญnguez, Lucas Lacasa, Javier de Vicente, Gonzalo Rubio, Eusebio Valero
Machine Learning
ๆฅ็ๆ่ฆ
Surrogate models (including deep neural networks and other machine learning algorithms in supervised learning) are capable of approximating arbitrarily complex, high-dimensional input-output problems in science and engineering, but require a thorough data-agnostic uncertainty quantification analysis before these can be deployed for any safety-critical application. The standard approach for data-agnostic uncertainty quantification is to use conformal prediction (CP), a well-established framework to build uncertainty models with proven statistical guarantees that do not assume any shape for the error distribution of the surrogate model. However, since the classic statistical guarantee offered by CP is given in terms of bounds for the marginal coverage, for small calibration set sizes (which are frequent in realistic surrogate modelling that aims to quantify error at different regions), the potentially strong dispersion of the coverage distribution around its average negatively impacts the reliability of the uncertainty model, often obtaining coverages below the expected value, resulting in a less applicable framework. After providing a gentle presentation of uncertainty quantification for surrogate models for machine learning practitioners, in this paper we bridge the gap by proposing a new statistical guarantee that offers probabilistic information for the coverage of a single conformal predictor. We show that the proposed framework converges to the standard solution offered by CP for large calibration set sizes and, unlike the classic guarantee, still offers reliable information about the coverage of a conformal predictor for small data sizes. We illustrate and validate the methodology in a suite of examples, and implement an open access software solution that can be used alongside common conformal prediction libraries to obtain uncertainty models that fulfil the new guarantee.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Siddharth Betala, Samuel P. Gleason, Ali Ramlaoui, Andy Xu, Georgia Channing et al.
Machine Learning
ๆฅ็ๆ่ฆ
Generative machine learning (ML) models hold great promise for accelerating materials discovery through the inverse design of inorganic crystals, enabling an unprecedented exploration of chemical space. Yet, the lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop these ML models meaningfully. In this work, we introduce LeMat-GenBench, a unified benchmark for generative models of crystalline materials, supported by a set of evaluation metrics designed to better inform model development and downstream applications. We release both an open-source evaluation suite and a public leaderboard on Hugging Face, and benchmark 12 recent generative models. Results reveal that an increase in stability leads to a decrease in novelty and diversity on average, with no model excelling across all dimensions. Altogether, LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide the development of more reliable, discovery-oriented generative models for crystalline materials.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose \textbf{Soft Q-based Diffusion Finetuning (SQDF)}, a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yue Yu, Qiwei Di, Quanquan Gu, Dongruo Zhou
Natural Language Processing
ๆฅ็ๆ่ฆ
Test-time compute (TTC) has become an increasingly prominent paradigm for enhancing large language models (LLMs). Despite the empirical success of methods such as best-of-$n$ (BoN) sampling and sequential revision, their fundamental limits remain unclear. We address this gap by analyzing a mixture-of-reference policy model and proving that standard BoN is inherently suboptimal. To move closer to the optimal frontier, we study reward-filtered sequential inference, a simple procedure that selectively incorporates only high-reward generations into the context. This mechanism concentrates computation on superior policy candidates and suppresses inferior ones. On the theoretical side, we show that reward-filtered sequential inference yields strictly stronger guarantees than standard TTC paradigms. On the empirical side, we evaluate such an inference strategy across diverse benchmarks and observe consistent improvements over widely used approaches, demonstrating the practical effectiveness of our framework.
๐
2025-12-04
๐ IJCAI
๐ฅ Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan
ๆฅ็ๆ่ฆ
Explainable artificial intelligence (XAI) is an important area in the AI community, and interpretability is crucial for building robust and trustworthy AI models. While previous work has explored model-level and instance-level explainable graph learning, there has been limited investigation into explainable graph representation learning. In this paper, we focus on representation-level explainable graph learning and ask a fundamental question: What specific information about a graph is captured in graph representations? Our approach is inspired by graph kernels, which evaluate graph similarities by counting substructures within specific graph patterns. Although the pattern counting vector can serve as an explainable representation, it has limitations such as ignoring node features and being high-dimensional. To address these limitations, we introduce a framework (PXGL-GNN) for learning and explaining graph representations through graph pattern analysis. We start by sampling graph substructures of various patterns. Then, we learn the representations of these patterns and combine them using a weighted sum, where the weights indicate the importance of each graph pattern's contribution. We also provide theoretical analyses of our methods, including robustness and generalization. In our experiments, we show how to learn and explain graph representations for real-world data using pattern analysis. Additionally, we compare our method against multiple baselines in both supervised and unsupervised learning tasks to demonstrate its effectiveness.
๐
2025-12-04
๐ AAAI 2026
๐ฅ Tianle Hu, Weijun Lv, Na Han, Xiaozhao Fang, Jie Wen et al.
ๆฅ็ๆ่ฆ
Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zehao Fan, Zhenyu Liu, Yunzhen Liu, Yayue Hou, Hadjer Benmeziane et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory. In this case, weights must be offloaded to external memory, and fetching them incurs costly and repeated transfers. We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place, converting expensive parameter movement into cheaper activation movement. Unlike prior GPU-NDP systems that are largely context-agnostic and reactive, we develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP. To meet NDP's limited compute throughput, we introduce context-aware mixed-precision quantization that allocates per-expert bitwidths (1-4 bit) based on prefill stage. The resulting MoE inference system overlaps GPU and NDP execution while minimizing cross-device movement. The evaluation on the GPU-NDP system shows that our approach achieves up to an 8.7-fold decoding throughput improvement over the state-of-the-art method, while incurring only a 0.13% average accuracy drop.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Timo Stoll, Chendi Qian, Ben Finkelshtein, Ali Parviz, Darius Weber et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See www.graphbench.io for further details.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Philip Stephens, Emmanuel Salawu
Robotics
ๆฅ็ๆ่ฆ
We introduce a unified mathematical and probabilistic framework for understanding and comparing diverse AI agent strategies. We bridge the gap between high-level agent design concepts, such as ReAct, multi-agent systems, and control flows, and a rigorous mathematical formulation. Our approach frames agentic processes as a chain of probabilities, enabling a detailed analysis of how different strategies manipulate these probabilities to achieve desired outcomes. Our framework provides a common language for discussing the trade-offs inherent in various agent architectures. One of our many key contributions is the introduction of the "Degrees of Freedom" concept, which intuitively differentiates the optimizable levers available for each approach, thereby guiding the selection of appropriate strategies for specific tasks. This work aims to enhance the clarity and precision in designing and evaluating AI agents, offering insights into maximizing the probability of successful actions within complex agentic systems.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tanmay Dogra, Eric Ngo, Mohammad Alam, Jean-Paul Talavera, Asim Dahal
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We challenge the common belief that deep learning always trumps older techniques, using the example of grading Saint-Gaudens Double Eagle gold coins automatically. In our work, we put a feature-based Artificial Neural Network built around 192 custom features pulled from Sobel edge detection and HSV color analysis up against a hybrid Convolutional Neural Network that blends in EfficientNetV2, plus a straightforward Support Vector Machine as the control. Testing 1,785 coins graded by experts, the ANN nailed 86% exact matches and hit 98% when allowing a 3-grade leeway. On the flip side, CNN and SVM mostly just guessed the most common grade, scraping by with 31% and 30% exact hits. Sure, the CNN looked good on broader tolerance metrics, but that is because of some averaging trick in regression that hides how it totally flops at picking out specific grades. All told, when you are stuck with under 2,000 examples and lopsided classes, baking in real coin-expert knowledge through feature design beats out those inscrutable, all-in-one deep learning setups. This rings true for other niche quality checks where data's thin and know-how matters more than raw compute.
๐
2025-12-04
๐ 48 pages, 16 figures, submitted to Journal of Adva
๐ฅ Xin Kai Lee, Ali Ramadhan, Andre Souza, Gregory LeClaire Wagner, Simone Silvestri et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
NORi is a machine-learned (ML) parameterization of ocean boundary layer turbulence that is physics-based and augmented with neural networks. NORi stands for neural ordinary differential equations (NODEs) Richardson number (Ri) closure. The physical parameterization is controlled by a Richardson number-dependent diffusivity and viscosity. The NODEs are trained to capture the entrainment through the base of the boundary layer, which cannot be represented with a local diffusive closure. The parameterization is trained using large-eddy simulations in an "a posteriori" fashion, where parameters are calibrated with a loss function that explicitly depends on the actual time-integrated variables of interest rather than the instantaneous subgrid fluxes, which are inherently noisy. NORi is designed for the realistic nonlinear equation of state of seawater and demonstrates excellent prediction and generalization capabilities in capturing entrainment dynamics under different convective strengths, oceanic background stratifications, rotation strengths, and surface wind forcings. NORi is numerically stable for at least 100 years of integration time in large-scale simulations, despite only being trained on 2-day horizons, and can be run with time steps as long as one hour. The highly expressive neural networks, combined with a physically-rigorous base closure, prove to be a robust paradigm for designing parameterizations for climate models where data requirements are drastically reduced, inference performance can be directly targeted and optimized, and numerical stability is implicitly encouraged during training.
๐
2025-12-04
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ Ali Rabeh, Suresh Murugaiyan, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
ๆฅ็ๆ่ฆ
Fast, geometry-generalizing surrogates for unsteady flow remain challenging. We present a time-dependent, geometry-aware Deep Operator Network that predicts velocity fields for moderate-Re flows around parametric and non-parametric shapes. The model encodes geometry via a signed distance field (SDF) trunk and flow history via a CNN branch, trained on 841 high-fidelity simulations. On held-out shapes, it attains $\sim 5\%$ relative L2 single-step error and up to 1000X speedups over CFD. We provide physics-centric rollout diagnostics, including phase error at probes and divergence norms, to quantify long-horizon fidelity. These reveal accurate near-term transients but error accumulation in fine-scale wakes, most pronounced for sharp-cornered geometries. We analyze failure modes and outline practical mitigations. Code, splits, and scripts are openly released at: https://github.com/baskargroup/TimeDependent-DeepONet to support reproducibility and benchmarking.
๐
2025-12-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Subrata Karmaker
Machine Learning
ๆฅ็ๆ่ฆ
Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and interpretable methods.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Jinran Wu, You-Gan Wang, Geoffrey J. McLachlan
ๆฅ็ๆ่ฆ
Semi-supervised learning (SSL) constructs classifiers using both labelled and unlabelled data. It leverages information from labelled samples, whose acquisition is often costly or labour-intensive, together with unlabelled data to enhance prediction performance. This defines an incomplete-data problem, which statistically can be formulated within the likelihood framework for finite mixture models that can be fitted using the expectation-maximisation (EM) algorithm. Ideally, one would prefer a completely labelled sample, as one would anticipate that a labelled observation provides more information than an unlabelled one. However, when the mechanism governing label absence depends on the observed features or the class labels or both, the missingness indicators themselves contain useful information. In certain situations, the information gained from modelling the missing-label mechanism can even outweigh the loss due to missing labels, yielding a classifier with a smaller expected error than one based on a completely labelled sample analysed. This improvement arises particularly when class overlap is moderate, labelled data are sparse, and the missingness is informative. Modelling such informative missingness thus offers a coherent statistical framework that unifies likelihood-based inference with the behaviour of empirical SSL methods.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stefan Nielsen, Edoardo Cetin, Peter Schwendeman, Qi Sun, Jinglue Xu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nan Zhou, Weijie Hong, Huandong Wang, Jianfeng Zheng, Qiuhua Wang et al.
Robotics
ๆฅ็ๆ่ฆ
Fine-grained air pollution forecasting is crucial for urban management and the development of healthy buildings. Deploying portable sensors on mobile platforms such as cars and buses offers a low-cost, easy-to-maintain, and wide-coverage data collection solution. However, due to the random and uncontrollable movement patterns of these non-dedicated mobile platforms, the resulting sensor data are often incomplete and temporally inconsistent. By exploring potential training patterns in the reverse process of diffusion models, we propose Spatio-Temporal Physics-Informed Diffusion Models (STeP-Diff). STeP-Diff leverages DeepONet to model the spatial sequence of measurements along with a PDE-informed diffusion model to forecast the spatio-temporal field from incomplete and time-varying data. Through a PDE-constrained regularization framework, the denoising process asymptotically converges to the convection-diffusion dynamics, ensuring that predictions are both grounded in real-world measurements and aligned with the fundamental physics governing pollution dispersion. To assess the performance of the system, we deployed 59 self-designed portable sensing devices in two cities, operating for 14 days to collect air pollution data. Compared to the second-best performing algorithm, our model achieved improvements of up to 89.12% in MAE, 82.30% in RMSE, and 25.00% in MAPE, with extensive evaluations demonstrating that STeP-Diff effectively captures the spatio-temporal dependencies in air pollution fields.
๐
2025-12-04
๐ ArXiv Preprint (math.PR)
๐ฅ Frederic Koehler, Beining Wu
Robotics
ๆฅ็ๆ่ฆ
A classical result of Carleman, based on the theory of quasianalytic functions, shows that polynomials are dense in $L^2(ฮผ)$ for any $ฮผ$ such that the moments $\int x^k dฮผ$ do not grow too rapidly as $k \to \infty$. In this work, we develop a fairly tight quantitative analogue of the underlying Denjoy-Carleman theorem via complex analysis, and show that this allows for nonasymptotic control of the rate of approximation by polynomials for any smooth function with polynomial growth at infinity. In many cases, this allows us to establish $L^2$ approximation-theoretic results for functions over general classes of distributions (e.g., multivariate sub-Gaussian or sub-exponential distributions) which were previously known only in special cases. As one application, we show that the Paley--Wiener class of functions bandlimited to $[-ฮฉ,ฮฉ]$ admits superexponential rates of approximation over all strictly sub-exponential distributions, which leads to a new characterization of the class. As another application, we solve an open problem recently posed by Chandrasekaran, Klivans, Kontonis, Meka and Stavropoulos on the smoothed analysis of learning, and also obtain quantitative improvements to their main results and applications.
๐
2025-12-04
๐ Accepted and Presented at 1st IEEE Uttar Pradesh S
๐ฅ Praveen Anugula, Avdhesh Kumar Bhardwaj, Navin Chhibber, Rohit Tewari, Sunil Khemka et al.
Machine Learning
ๆฅ็ๆ่ฆ
Contemporary DevSecOps pipelines have to deal with the evolution of security in an ever-continuously integrated and deployed environment. Existing methods,such as rule-based intrusion detection and static vulnerability scanning, are inadequate and unreceptive to changes in the system, causing longer response times and organization needs exposure to emerging attack vectors. In light of the previous constraints, we introduce AutoGuard to the DevSecOps ecosystem, a reinforcement learning (RL)-powered self-healing security framework built to pre-emptively protect DevSecOps environments. AutoGuard is a self-securing security environment that continuously observes pipeline activities for potential anomalies while preemptively remediating the environment. The model observes and reacts based on a policy that is continually learned dynamically over time. The RL agent improves each action over time through reward-based learning aimed at improving the agent's ability to prevent, detect and respond to a security incident in real-time. Testing using simulated ContinuousIntegration / Continuous Deployment (CI/CD) environments showed AutoGuard to successfully improve threat detection accuracy by 22%, reduce mean time torecovery (MTTR) for incidents by 38% and increase overall resilience to incidents as compared to traditional methods.
Keywords- DevSecOps, Reinforcement Learning, Self- Healing Security, Continuous Integration, Automated Threat Mitigation
๐
2025-12-04
๐ IEEE/CVF Winter Conference on Applications of Comp
๐ฅ Kai-Po Chang, Wei-Yuan Cheng, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang
Multimodal
ๆฅ็ๆ่ฆ
Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ April S. Liang, Fatemeh Amrollahi, Yixing Jiang, Conor K. Corbin, Grace Y. E. Kim et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Repetitive laboratory testing unlikely to yield clinically useful information is a common practice that burdens patients and increases healthcare costs. Education and feedback interventions have limited success, while general test ordering restrictions and electronic alerts impede appropriate clinical care. We introduce and evaluate SmartAlert, a machine learning (ML)-driven clinical decision support (CDS) system integrated into the electronic health record that predicts stable laboratory results to reduce unnecessary repeat testing. This case study describes the implementation process, challenges, and lessons learned from deploying SmartAlert targeting complete blood count (CBC) utilization in a randomized controlled pilot across 9270 admissions in eight acute care units across two hospitals between August 15, 2024, and March 15, 2025. Results show significant decrease in number of CBC results within 52 hours of SmartAlert display (1.54 vs 1.82, p <0.01) without adverse effect on secondary safety outcomes, representing a 15% relative reduction in repetitive testing. Implementation lessons learned include interpretation of probabilistic model predictions in clinical contexts, stakeholder engagement to define acceptable model behavior, governance processes for deploying a complex model in a clinical environment, user interface design considerations, alignment with clinical operational priorities, and the value of qualitative feedback from end users. In conclusion, a machine learning-driven CDS system backed by a deliberate implementation and governance process can provide precision guidance on inpatient laboratory testing to safely reduce unnecessary repetitive testing.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Manh Nguyen, Sunil Gupta, Hung Le
Natural Language Processing
ๆฅ็ๆ่ฆ
Detecting when large language models (LLMs) are uncertain is critical for building reliable systems, yet existing methods are overly complicated, relying on brittle semantic clustering or internal states. We introduce \textbf{Radial Dispersion Score (RDS)}, a simple, parameter-free, fully model-agnostic uncertainty metric that measures the radial dispersion of sampled generations in embedding space. A lightweight probability-weighted variant further incorporates the model's own token probabilities when available, outperforming different nine strong baselines. Moroever, RDS naturally extends to per-sample scoring, enabling applications such as best-of-$N$ selection and confidence-based filtering. Across four challenging free-form QA datasets and multiple LLMs, our metrics achieve state-of-the-art hallucination detection and answer selection performance, while remaining robust and scalable with respect to sample size and embedding choice.
๐
2025-12-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh et al.
Machine Learning
ๆฅ็ๆ่ฆ
Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting planning horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale the principle to realistic tasks, identifying key design choices, such as layer normalization in the world model and adaptive long-horizon planning, that mitigate compounding error and value overestimation. These yield our practical algorithm, Neubay, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, Neubay generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with planning horizons of several hundred steps, challenging common belief. Finally, we characterize when Neubay is preferable to conservatism, laying the foundation for a new direction in offline and model-based RL.
๐
2025-12-04
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mauro Dalle Lucca Tosi, Jordi Cabot
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly used to query knowledge graphs (KGs) due to their strong semantic understanding and extrapolation capabilities compared to traditional approaches. However, these methods cannot be applied when the KG contains sensitive data and the user lacks the resources to deploy a local generative LLM. To address this issue, we propose a privacy-aware query generation approach for KGs. Our method identifies sensitive information in the graph based on its structure and omits such values before requesting the LLM to translate natural language questions into Cypher queries. Experimental results show that our approach preserves the quality of the generated queries while preventing sensitive data from being transmitted to third-party services.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Vignesh Kumar Kembu, Pierandrea Morandini, Marta Bianca Maria Ranzini, Antonino Nocera
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have become a key topic in AI and NLP, transforming sectors like healthcare, finance, education, and marketing by improving customer service, automating tasks, providing insights, improving diagnostics, and personalizing learning experiences. Information extraction from clinical records is a crucial task in digital healthcare. Although traditional NLP techniques have been used for this in the past, they often fall short due to the complexity, variability of clinical language, and high inner semantics in the free clinical text. Recently, Large Language Models (LLMs) have become a powerful tool for better understanding and generating human-like text, making them highly effective in this area. In this paper, we explore the ability of open-source multilingual LLMs to understand EHRs (Electronic Health Records) in Italian and help extract information from them in real-time. Our detailed experimental campaign on comorbidity extraction from EHR reveals that some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.
๐
2025-12-04
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Maddalena Amendola, Chiara Pugliese, Raffaele Perego, Chiara Renso
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have become foundational tools in artificial intelligence, supporting a wide range of applications beyond traditional natural language processing, including urban systems and tourist recommendations. However, their tendency to hallucinate and their limitations in spatial retrieval and reasoning are well known, pointing to the need for novel solutions. Retrieval-augmented generation (RAG) has recently emerged as a promising way to enhance LLMs with accurate, domain-specific, and timely information. Spatial RAG extends this approach to tasks involving geographic understanding. In this work, we introduce WalkRAG, a spatial RAG-based framework with a conversational interface for recommending walkable urban itineraries. Users can request routes that meet specific spatial constraints and preferences while interactively retrieving information about the path and points of interest (POIs) along the way. Preliminary results show the effectiveness of combining information retrieval, spatial reasoning, and LLMs to support urban discovery.
๐
2025-12-04
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Nolwenn Bernard, Krisztian Balog
Natural Language Processing
ๆฅ็ๆ่ฆ
Resources for simulation-based evaluation of conversational recommender systems (CRSs) are scarce. The UserSimCRS toolkit was introduced to address this gap. In this work, we present UserSimCRS v2, a significant upgrade aligning the toolkit with state-of-the-art research. Key extensions include an enhanced agenda-based user simulator, introduction of large language model-based simulators, integration for a wider range of CRSs and datasets, and new LLM-as-a-judge evaluation utilities. We demonstrate these extensions in a case study.
๐
2025-12-04
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Satyajit Movidi, Stephen Russell
Natural Language Processing
ๆฅ็ๆ่ฆ
AIVisor, an agentic retrieval-augmented LLM for student advising, was used to examine how personalization affects system performance across multiple evaluation dimensions. Using twelve authentic advising questions intentionally designed to stress lexical precision, we compared ten personalized and non-personalized system configurations and analyzed outcomes with a Linear Mixed-Effects Model across lexical (BLEU, ROUGE-L), semantic (METEOR, BERTScore), and grounding (RAGAS) metrics. Results showed a consistent trade-off: personalization reliably improved reasoning quality and grounding, yet introduced a significant negative interaction on semantic similarity, driven not by poorer answers but by the limits of current metrics, which penalize meaningful personalized deviations from generic reference texts. This reveals a structural flaw in prevailing LLM evaluation methods, which are ill-suited for assessing user-specific responses. The fully integrated personalized configuration produced the highest overall gains, suggesting that personalization can enhance system effectiveness when evaluated with appropriate multidimensional metrics. Overall, the study demonstrates that personalization produces metric-dependent shifts rather than uniform improvements and provides a methodological foundation for more transparent and robust personalization in agentic AI.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Feng Xu, Guangyao Zhai, Xin Kong, Tingzhong Fu, Daniel F. N. Gordon et al.
MultimodalMachine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Vision-Language-Action (VLA) models, powered by large language models and reinforcement learning-based fine-tuning, have shown remarkable progress in robotic manipulation. Existing methods often treat long-horizon actions as linguistic sequences and apply trajectory-level optimization methods such as Trajectory-wise Preference Optimization (TPO) or Proximal Policy Optimization (PPO), leading to coarse credit assignment and unstable training. However, unlike language, where a unified semantic meaning is preserved despite flexible sentence order, action trajectories progress through causally chained stages with different learning difficulties. This motivates progressive stage optimization. Thereby, we present Stage-Aware Reinforcement (STARE), a module that decomposes a long-horizon action trajectory into semantically meaningful stages and provides dense, interpretable, and stage-aligned reinforcement signals. Integrating STARE into TPO and PPO, we yield Stage-Aware TPO (STA-TPO) and Stage-Aware PPO (STA-PPO) for offline stage-wise preference and online intra-stage interaction, respectively. Further building on supervised fine-tuning as initialization, we propose the Imitation -> Preference -> Interaction (IPI), a serial fine-tuning pipeline for improving action accuracy in VLA models. Experiments on SimplerEnv and ManiSkill3 demonstrate substantial gains, achieving state-of-the-art success rates of 98.0 percent on SimplerEnv and 96.4 percent on ManiSkill3 tasks.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion ฯ-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. ฯ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, ฯ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, ฯ-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minghan Zhu, Zhiyi Wang, Qihang Sun, Maani Ghaffari, Michael Posa
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Haroon Hublikar
Robotics
ๆฅ็ๆ่ฆ
This thesis presents a unified modeling and simulation framework for analyzing sidewinding and tumbling locomotion of the COBRA snake robot across rigid, compliant, and granular terrains. A contact-implicit formulation is used to model distributed frictional interactions during sidewinding, and validated through MATLAB Simscape simulations and physical experiments on rigid ground and loose sand. To capture terrain deformation effects, Project Chrono's Soil Contact Model (SCM) is integrated with the articulated multibody dynamics, enabling prediction of slip, sinkage, and load redistribution that reduce stride efficiency on deformable substrates. For high-energy rolling locomotion on steep slopes, the Chrono DEM Engine is used to simulate particle-resolved granular interactions, revealing soil failure, intermittent lift-off, and energy dissipation mechanisms not captured by rigid models. Together, these methods span real-time control-oriented simulation and high-fidelity granular physics. Results demonstrate that rigid-ground models provide accurate short-horizon motion prediction, while continuum and particle-based terrain modeling becomes necessary for reliable mobility analysis in soft and highly dynamic environments. This work establishes a hierarchical simulation pipeline that advances robust, terrain-aware locomotion for robots operating in challenging unstructured settings.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Giuseppe Milazzo, Giorgio Grioli, Antonio Bicchi, Manuel G. Catalano
Robotics
ๆฅ็ๆ่ฆ
Current upper limb prostheses aim to enhance user independence in daily activities by incorporating basic motor functions. However, they fall short of replicating the natural movement and interaction capabilities of the human arm. In contrast, human limbs leverage intrinsic compliance and actively modulate joint stiffness, enabling adaptive responses to varying tasks, impact absorption, and efficient energy transfer during dynamic actions. Inspired by this adaptability, we developed a transhumeral prosthesis with Variable Stiffness Actuators (VSAs) to replicate the controllable compliance found in biological joints. The proposed prosthesis features a modular design, allowing customization for different residual limb shapes and accommodating a range of independent control signals derived from users' biological cues. Integrated elastic elements passively support more natural movements, facilitate safe interactions with the environment, and adapt to diverse task requirements. This paper presents a comprehensive overview of the platform and its functionalities, highlighting its potential applications in the field of prosthetics.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Giuseppe Milazzo, Manuel G. Catalano, Antonio Bicchi, Giorgio Grioli
Robotics
ๆฅ็ๆ่ฆ
Variable Stiffness Actuators prove invaluable for robotics applications in unstructured environments, fostering safe interactions and enhancing task adaptability. Nevertheless, their mechanical design inevitably results in larger and heavier structures compared to classical rigid actuators. This paper introduces a novel 3 Degrees of Freedom (DoFs) parallel wrist that achieves variable stiffness through redundant elastic actuation. Leveraging its parallel architecture, the device employs only four motors, rendering it compact and lightweight. This characteristic makes it particularly well-suited for applications in prosthetics or humanoid robotics. The manuscript delves into the theoretical model of the device and proposes a sophisticated control strategy for independent regulation of joint position and stiffness. Furthermore, it validates the proposed controller through simulation, utilizing a comprehensive analysis of the system dynamics. The reported results affirm the ability of the device to achieve high accuracy and disturbance rejection in rigid configurations while minimizing interaction forces with its compliant behavior.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Jonne Van Haastregt, Bastian Orthmann, Michael C. Welle, Yuchong Zhang, Danica Kragic
Robotics
ๆฅ็ๆ่ฆ
Despite the fact that visuomotor-based policies obtained via imitation learning demonstrate good performances in complex manipulation tasks, they usually struggle to achieve the same accuracy and speed as traditional control based methods. In this work, we introduce Hybrid-Diffusion models that combine open-loop routines with visuomotor diffusion policies. We develop Teleoperation Augmentation Primitives (TAPs) that allow the operator to perform predefined routines, such as locking specific axes, moving to perching waypoints, or triggering task-specific routines seamlessly during demonstrations. Our Hybrid-Diffusion method learns to trigger such TAPs during inference. We validate the method on challenging real-world tasks: Vial Aspiration, Open-Container Liquid Transfer, and container unscrewing. All experimental videos are available on the project's website: https://hybriddiffusion.github.io/
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Matteo Masoni, Vincenzo Palermo, Marco Gabiccini, Martino Gulisano, Giorgio Previati et al.
Robotics
ๆฅ็ๆ่ฆ
This work investigates how disturbance-aware, robustness-embedded reference trajectories translate into driving performance when executed by professional drivers in a dynamic simulator. Three planned reference trajectories are compared against a free-driving baseline (NOREF) to assess trade-offs between lap time (LT) and steering effort (SE): NOM, the nominal time-optimal trajectory; TLC, a track-limit-robust trajectory obtained by tightening margins to the track edges; and FLC, a friction-limit-robust trajectory obtained by tightening against axle and tire saturation. All trajectories share the same minimum lap-time objective with a small steering-smoothness regularizer and are evaluated by two professional drivers using a high-performance car on a virtual track. The trajectories derive from a disturbance-aware minimum-lap-time framework recently proposed by the authors, where worst-case disturbance growth is propagated over a finite horizon and used to tighten tire-friction and track-limit constraints, preserving performance while providing probabilistic safety margins. LT and SE are used as performance indicators, while RMS lateral deviation, speed error, and drift angle characterize driving style. Results show a Pareto-like LT-SE trade-off: NOM yields the shortest LT but highest SE; TLC minimizes SE at the cost of longer LT; FLC lies near the efficient frontier, substantially reducing SE relative to NOM with only a small LT increase. Removing trajectory guidance (NOREF) increases both LT and SE, confirming that reference trajectories improve pace and control efficiency. Overall, the findings highlight reference-based and disturbance-aware planning, especially FLC, as effective tools for training and for achieving fast yet stable trajectories.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Tim Engelbracht, Renรฉ Zurbrรผgg, Matteo Wohlrapp, Martin Bรผchner, Abhinav Valada et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper - where the tool embodiment provides synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Huanqian Wang, Chi Bene Chen, Yang Yue, Danhua Tao, Tong Guo et al.
Robotics
ๆฅ็ๆ่ฆ
Imitation learning method has shown immense promise for robotic manipulation, yet its practical deployment is fundamentally constrained by the data scarcity. Despite prior work on collecting large-scale datasets, there still remains a significant gap to robust spatial generalization. We identify a key limitation: individual trajectories, regardless of their length, are typically collected from a \emph{single, static spatial configuration} of the environment. This includes fixed object and target spatial positions as well as unchanging camera viewpoints, which significantly restricts the diversity of spatial information available for learning. To address this critical bottleneck in data efficiency, we propose \textbf{MOtion-Based Variability Enhancement} (\emph{MOVE}), a simple yet effective data collection paradigm that enables the acquisition of richer spatial information from dynamic demonstrations. Our core contribution is an augmentation strategy that injects motion into any movable objects within the environment for each demonstration. This process implicitly generates a dense and diverse set of spatial configurations within a single trajectory. We conduct extensive experiments in both simulation and real-world environments to validate our approach. For example, in simulation tasks requiring strong spatial generalization, \emph{MOVE} achieves an average success rate of 39.1\%, a 76.1\% relative improvement over the static data collection paradigm (22.2\%), and yields up to 2--5$\times$ gains in data efficiency on certain tasks. Our code is available at https://github.com/lucywang720/MOVE.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ SIMA team, Adrian Bolton, Alexander Lerchner, Alexandra Cordell, Alexandre Moufarek et al.
Robotics
ๆฅ็ๆ่ฆ
We introduce SIMA 2, a generalist embodied agent that understands and acts in a wide variety of 3D virtual worlds. Built upon a Gemini foundation model, SIMA 2 represents a significant step toward active, goal-directed interaction within an embodied environment. Unlike prior work (e.g., SIMA 1) limited to simple language commands, SIMA 2 acts as an interactive partner, capable of reasoning about high-level goals, conversing with the user, and handling complex instructions given through language and images. Across a diverse portfolio of games, SIMA 2 substantially closes the gap with human performance and demonstrates robust generalization to previously unseen environments, all while retaining the base model's core reasoning capabilities. Furthermore, we demonstrate a capacity for open-ended self-improvement: by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment. This work validates a path toward creating versatile and continuously learning agents for both virtual and, eventually, physical worlds.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Giorgos Polychronis, Foivos Pournaropoulos, Christos D. Antonopoulos, Spyros Lalis
Machine Learning
ๆฅ็ๆ่ฆ
Drones are becoming indispensable in many application domains. In data-driven missions, besides sensing, the drone must process the collected data at runtime to decide whether additional action must be taken on the spot, before moving to the next point of interest. If processing does not reveal an event or situation that requires such an action, the drone has waited in vain instead of moving to the next point. If, however, the drone starts moving to the next point and it turns out that a follow-up action is needed at the previous point, it must spend time to fly-back. To take this decision, we propose different machine-learning methods based on branch prediction and reinforcement learning. We evaluate these methods for a wide range of scenarios where the probability of event occurrence changes with time. Our results show that the proposed methods consistently outperform the regression-based method proposed in the literature and can significantly improve the worst-case mission time by up to 4.1x. Also, the achieved median mission time is very close, merely up to 2.7% higher, to that of a method with perfect knowledge of the current underlying event probability at each point of interest.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Mauro Martini, Marco Ambrosio, Judith Vilella-Cantos, Alessandro Navone, Marcello Chiaberge
Robotics
ๆฅ็ๆ่ฆ
In recent years, precision agriculture has been introducing groundbreaking innovations in the field, with a strong focus on automation. However, research studies in robotics and autonomous navigation often rely on controlled simulations or isolated field trials. The absence of a realistic common benchmark represents a significant limitation for the diffusion of robust autonomous systems under real complex agricultural conditions. Vineyards pose significant challenges due to their dynamic nature, and they are increasingly drawing attention from both academic and industrial stakeholders interested in automation. In this context, we introduce the TEMPO-VINE dataset, a large-scale multi-temporal dataset specifically designed for evaluating sensor fusion, simultaneous localization and mapping (SLAM), and place recognition techniques within operational vineyard environments. TEMPO-VINE is the first multi-modal public dataset that brings together data from heterogeneous LiDARs of different price levels, AHRS, RTK-GPS, and cameras in real trellis and pergola vineyards, with multiple rows exceeding 100 m in length. In this work, we address a critical gap in the landscape of agricultural datasets by providing researchers with a comprehensive data collection and ground truth trajectories in different seasons, vegetation growth stages, terrain and weather conditions. The sequence paths with multiple runs and revisits will foster the development of sensor fusion, localization, mapping and place recognition solutions for agricultural fields. The dataset, the processing tools and the benchmarking results will be available at the dedicated webpage upon acceptance.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Yuxing Wang, Zhiyu Chen, Tiantian Zhang, Qiyue Yin, Yongzhe Chang et al.
Robotics
ๆฅ็ๆ่ฆ
Brain-body co-evolution enables animals to develop complex behaviors in their environments. Inspired by this biological synergy, embodied co-design (ECD) has emerged as a transformative paradigm for creating intelligent agents-from virtual creatures to physical robots-by jointly optimizing their morphologies and controllers rather than treating control in isolation. This integrated approach facilitates richer environmental interactions and robust task performance. In this survey, we provide a systematic overview of recent advances in ECD. We first formalize the concept of ECD and position it within related fields. We then introduce a hierarchical taxonomy: a lower layer that breaks down agent design into three fundamental components-controlling brain, body morphology, and task environment-and an upper layer that integrates these components into four major ECD frameworks: bi-level, single-level, generative, and open-ended. This taxonomy allows us to synthesize insights from more than one hundred recent studies. We further review notable benchmarks, datasets, and applications in both simulated and real-world scenarios. Finally, we identify significant challenges and offer insights into promising future research directions. A project associated with this survey has been created at https://github.com/Yuxing-Wang-THU/SurveyBrainBody.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Jian Tang, Pu Pang, Haowen Sun, Chengzhong Ma, Xingyu Chen et al.
Robotics
ๆฅ็ๆ่ฆ
Cross-domain transfer in robotic manipulation remains a longstanding challenge due to the significant domain gap between simulated and real-world environments. Existing methods such as domain randomization, adaptation, and sim-real calibration often require extensive tuning or fail to generalize to unseen scenarios. To address this issue, we observe that if domain-invariant features are utilized during policy training in simulation, and the same features can be extracted and provided as the input to policy during real-world deployment, the domain gap can be effectively bridged, leading to significantly improved policy generalization. Accordingly, we propose Semantic 2D Gaussian Splatting (S2GS), a novel representation method that extracts object-centric, domain-invariant spatial features. S2GS constructs multi-view 2D semantic fields and projects them into a unified 3D space via feature-level Gaussian splatting. A semantic filtering mechanism removes irrelevant background content, ensuring clean and consistent inputs for policy learning. To evaluate the effectiveness of S2GS, we adopt Diffusion Policy as the downstream learning algorithm and conduct experiments in the ManiSkill simulation environment, followed by real-world deployment. Results demonstrate that S2GS significantly improves sim-to-real transferability, maintaining high and stable task performance in real-world scenarios.
๐
2025-12-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Wu, Chuhao Zhou, Guangyu Zhao, Haozhi Cao, Yewen Pu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79% abstention recall, while humans achieve 91.17%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.
๐
2025-12-04
๐ ArXiv Preprint (eess.SY)
๐ฅ Hannes Homburger, Katrin Baumgรคrtner, Moritz Diehl, Johannes Reuter
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Model Predictive Path Integral (MPPI) control is a sampling-based optimization method that has recently attracted attention, particularly in the robotics and reinforcement learning communities. MPPI has been widely applied as a GPU-accelerated random search method to deterministic direct single-shooting optimal control problems arising in model predictive control (MPC) formulations. MPPI offers several key advantages, including flexibility, robustness, ease of implementation, and inherent parallelizability. However, its performance can deteriorate in high-dimensional settings since the optimal control problem is solved via Monte Carlo sampling. To address this limitation, this paper proposes an enhanced MPPI method that incorporates a Jacobian reconstruction technique and the second-order Generalized Gauss-Newton method. This novel approach is called \textit{Gauss-Newton accelerated MPPI}. The numerical results show that the Gauss-Newton accelerated MPPI approach substantially improves MPPI scalability and computational efficiency while preserving the key benefits of the classical MPPI framework, making it a promising approach even for high-dimensional problems.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Andres Arias, Wei Zhang, Haoyu Qian, Jr-Shin Li, Chuangchuang Sun
Robotics
ๆฅ็ๆ่ฆ
Ensemble control aims to steer a population of dynamical systems using a shared control input. This paper introduces a constrained ensemble control framework for parameterized, heterogeneous robotic systems operating under state and environmental constraints, such as obstacle avoidance. We develop a moment kernel transform that maps the parameterized ensemble dynamics to the moment system in a kernel space, enabling the characterization of population-level behavior. The state-space constraints, such as polyhedral waypoints to be visited and obstacles to be avoided, are also transformed into the moment space, leading to a unified formulation for safe, large-scale ensemble control. Expressive signal temporal logic specifications are employed to encode complex visit-avoid tasks, which are achieved through a single shared controller synthesized from our constrained ensemble control formulation. Simulation and hardware experiments demonstrate the effectiveness of the proposed approach in safely and efficiently controlling robotic ensembles within constrained environments.
๐
2025-12-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Price Allman, Lian Thang, Dre Simmons, Salmon Riaz
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We present a comparative study of multi-agent reinforcement learning (MARL) algorithms for cooperative warehouse robotics. We evaluate QMIX and IPPO on the Robotic Warehouse (RWARE) environment and a custom Unity 3D simulation. Our experiments reveal that QMIX's value decomposition significantly outperforms independent learning approaches (achieving 3.25 mean return vs. 0.38 for advanced IPPO), but requires extensive hyperparameter tuning -- particularly extended epsilon annealing (5M+ steps) for sparse reward discovery. We demonstrate successful deployment in Unity ML-Agents, achieving consistent package delivery after 1M training steps. While MARL shows promise for small-scale deployments (2-4 robots), significant scaling challenges remain. Code and analyses: https://pallman14.github.io/MARL-QMIX-Warehouse-Robots/
๐
2025-12-04
๐ Accepted to ACM/IEEE International Conference on H
๐ฅ Debasmita Ghose, Oz Gitelson, Marynel Vazquez, Brian Scassellati
ๆฅ็ๆ่ฆ
To collaborate with humans, robots must infer goals that are often ambiguous, difficult to articulate, or not drawn from a fixed set. Prior approaches restrict inference to a predefined goal set, rely only on observed actions, or depend exclusively on explicit instructions, making them brittle in real-world interactions. We present BALI (Bidirectional Action-Language Inference) for goal prediction, a method that integrates natural language preferences with observed human actions in a receding-horizon planning tree. BALI combines language and action cues from the human, asks clarifying questions only when the expected information gain from the answer outweighs the cost of interruption, and selects supportive actions that align with inferred goals. We evaluate the approach in collaborative cooking tasks, where goals may be novel to the robot and unbounded. Compared to baselines, BALI yields more stable goal predictions and significantly fewer mistakes.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Chang Liu, Sibo Tian, Sara Behdad, Xiao Liang, Minghui Zheng
MultimodalRobotics
ๆฅ็ๆ่ฆ
Automating disassembly of critical components from end-of-life (EoL) desktops, such as high-value items like RAM modules and CPUs, as well as sensitive parts like hard disk drives, remains challenging due to the inherent variability and uncertainty of these products. Moreover, their disassembly requires sequential, precise, and dexterous operations, further increasing the complexity of automation. Current robotic disassembly processes are typically divided into several stages: perception, sequence planning, task planning, motion planning, and manipulation. Each stage requires explicit modeling, which limits generalization to unfamiliar scenarios. Recent development of vision-language-action (VLA) models has presented an end-to-end approach for general robotic manipulation tasks. Although VLAs have demonstrated promising performance on simple tasks, the feasibility of applying such models to complex disassembly remains largely unexplored. In this paper, we collected a customized dataset for robotic RAM and CPU disassembly and used it to fine-tune two well-established VLA approaches, OpenVLA and OpenVLA-OFT, as a case study. We divided the whole disassembly task into several small steps, and our preliminary experimental results indicate that the fine-tuned VLA models can faithfully complete multiple early steps but struggle with certain critical subtasks, leading to task failure. However, we observed that a simple hybrid strategy that combines VLA with a rule-based controller can successfully perform the entire disassembly operation. These findings highlight the current limitations of VLA models in handling the dexterity and precision required for robotic EoL product disassembly. By offering a detailed analysis of the observed results, this study provides insights that may inform future research to address current challenges and advance end-to-end robotic automated disassembly.
๐
2025-12-04
๐ Under review at the International Journal of Robot
๐ฅ Zhoufeng Wang, Hang Zhao, Juzhan Xu, Shishun Zhang, Zeyu Xiong et al.
ๆฅ็ๆ่ฆ
Physical feasibility in 3D bin packing is a key requirement in modern industrial logistics and robotic automation. With the growing adoption of industrial automation, online bin packing has gained increasing attention. However, inconsistencies in problem settings, test datasets, and evaluation metrics have hindered progress in the field, and there is a lack of a comprehensive benchmarking system. Direct testing on real hardware is costly, and building a realistic simulation environment is also challenging. To address these limitations, we introduce RoboBPP, a benchmarking system designed for robotic online bin packing. RoboBPP integrates a physics-based simulator to assess physical feasibility. In our simulation environment, we introduce a robotic arm and boxes at real-world scales to replicate real industrial packing workflows. By simulating conditions that arise in real industrial applications, we ensure that evaluated algorithms are practically deployable. In addition, prior studies often rely on synthetic datasets whose distributions differ from real-world industrial data. To address this issue, we collect three datasets from real industrial workflows, including assembly-line production, logistics packing, and furniture manufacturing. The benchmark comprises three carefully designed test settings and extends existing evaluation metrics with new metrics for structural stability and operational safety. We design a scoring system and derive a range of insights from the evaluation results. RoboBPP is fully open-source and is equipped with visualization tools and an online leaderboard, providing a reproducible and extensible foundation for future research and industrial applications (https://robot-bin-packing-benchmark.github.io).
๐
2025-12-04
๐ 34 pages, is submitted RAS Journal
๐ฅ Chaoran Wang, Jingyuan Sun, Yanhui Zhang, Changju Wu
Robotics
ๆฅ็ๆ่ฆ
This paper proposes an Interactive Inference Behavior Tree (IIBT) framework that integrates behavior trees (BTs) with active inference under the free energy principle for distributed multi-robot decision-making. The proposed IIBT node extends conventional BTs with probabilistic reasoning, enabling online joint planning and execution across multiple robots. It remains fully compatible with standard BT architectures, allowing seamless integration into existing multi-robot control systems. Within this framework, multi-robot cooperation is formulated as a free-energy minimization process, where each robot dynamically updates its preference matrix based on perceptual inputs and peer intentions, thereby achieving adaptive coordination in partially observable and dynamic environments. The proposed approach is validated through both simulation and real-world experiments, including a multi-robot maze navigation and a collaborative manipulation task, compared against traditional BTs(https://youtu.be/KX_oT3IDTf4). Experimental results demonstrate that the IIBT framework reduces BT node complexity by over 70%, while maintaining robust, interpretable, and adaptive cooperative behavior under environmental uncertainty.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Haoqi Han, Yi Yang, Yifei Yu, Yixuan Zhou, Xiaohan Zhu et al.
Robotics
ๆฅ็ๆ่ฆ
In robotic hand research, minimizing the number of actuators while maintaining human-hand-consistent dimensions and degrees of freedom constitutes a fundamental challenge. Drawing bio-inspiration from human hand kinematic configurations and muscle distribution strategies, this work proposes a novel 15-DoF dexterous robotic hand, with detailed analysis of its mechanical architecture, electrical system, and control system. The bionic hand employs a new tendon-driven mechanism, significantly reducing the number of motors required by traditional tendon-driven systems while enhancing motion performance and simplifying the mechanical structure. This design integrates five motors in the forearm to provide strong gripping force, while ten small motors are installed in the palm to support fine manipulation tasks. Additionally, a corresponding joint sensing and motor driving electrical system was developed to ensure efficient control and feedback. The entire system weighs only 1.4kg, combining lightweight and high-performance features. Through experiments, the bionic hand exhibited exceptional dexterity and robust grasping capabilities, demonstrating significant potential for robotic manipulation tasks.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Chengyang He, Ge Sun, Yue Bai, Junkai Lu, Jiadong Zhao et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
We present FoundAtion-model-guided decoupled LoCO-maNipulation visuomotor policies (FALCON), a framework for loco-manipulation that combines modular diffusion policies with a vision-language foundation model as the coordinator. Our approach explicitly decouples locomotion and manipulation into two specialized visuomotor policies, allowing each subsystem to rely on its own observations. This mitigates the performance degradation that arise when a single policy is forced to fuse heterogeneous, potentially mismatched observations from locomotion and manipulation. Our key innovation lies in restoring coordination between these two independent policies through a vision-language foundation model, which encodes global observations and language instructions into a shared latent embedding conditioning both diffusion policies. On top of this backbone, we introduce a phase-progress head that uses textual descriptions of task stages to infer discrete phase and continuous progress estimates without manual phase labels. To further structure the latent space, we incorporate a coordination-aware contrastive loss that explicitly encodes cross-subsystem compatibility between arm and base actions. We evaluate FALCON on two challenging loco-manipulation tasks requiring navigation, precise end-effector placement, and tight base-arm coordination. Results show that it surpasses centralized and decentralized baselines while exhibiting improved robustness and generalization to out-of-distribution scenarios.
๐
2025-12-04
๐ ArXiv Preprint (Robotics)
๐ฅ Hann Woei Ho, Ye Zhou
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Autonomous landing on sloped terrain poses significant challenges for small, lightweight spacecraft, such as rotorcraft and landers. These vehicles have limited processing capability and payload capacity, which makes advanced deep learning methods and heavy sensors impractical. Flying insects, such as bees, achieve remarkable landings with minimal neural and sensory resources, relying heavily on optical flow. By regulating flow divergence, a measure of vertical velocity divided by height, they perform smooth landings in which velocity and height decay exponentially together. However, adapting this bio-inspired strategy for spacecraft landings on sloped terrain presents two key challenges: global flow-divergence estimates obscure terrain inclination, and the nonlinear nature of divergence-based control can lead to instability when using conventional controllers. This paper proposes a nonlinear control strategy that leverages two distinct local flow divergence estimates to regulate both thrust and attitude during vertical landings. The control law is formulated based on Incremental Nonlinear Dynamic Inversion to handle the nonlinear flow divergence. The thrust control ensures a smooth vertical descent by keeping a constant average of the local flow divergence estimates, while the attitude control aligns the vehicle with the inclined surface at touchdown by exploiting their difference. The approach is evaluated in numerical simulations using a simplified 2D spacecraft model across varying slopes and divergence setpoints. Results show that regulating the average divergence yields stable landings with exponential decay of velocity and height, and using the divergence difference enables effective alignment with inclined terrain. Overall, the method offers a robust, low-resource landing strategy that enhances the feasibility of autonomous planetary missions with small spacecraft.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fangyu Lei, Jinxiang Meng, Yiming Huang, Junjie Zhao, Yitong Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io
๐
2025-12-03
๐ CVPR 2026
๐ฅ Rui Fonseca, Bruno Martins, Gil Rocha
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chinmay Gondhalekar, Urjitkumar Patel, Fang-Chun Yeh
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mansour Essgaer, Khamis Massud, Rabia Al Mamlook, Najah Ghmaid
ๆฅ็ๆ่ฆ
This study investigates logistic regression, linear support vector machine, multinomial Naive Bayes, and Bernoulli Naive Bayes for classifying Libyan dialect utterances gathered from Twitter. The dataset used is the QADI corpus, which consists of 540,000 sentences across 18 Arabic dialects. Preprocessing challenges include handling inconsistent orthographic variations and non-standard spellings typical of the Libyan dialect. The chi-square analysis revealed that certain features, such as email mentions and emotion indicators, were not significantly associated with dialect classification and were thus excluded from further analysis. Two main experiments were conducted: (1) evaluating the significance of meta-features extracted from the corpus using the chi-square test and (2) assessing classifier performance using different word and character n-gram representations. The classification experiments showed that Multinomial Naive Bayes (MNB) achieved the highest accuracy of 85.89% and an F1-score of 0.85741 when using a (1,2) word n-gram and (1,5) character n-gram representation. In contrast, Logistic Regression and Linear SVM exhibited slightly lower performance, with maximum accuracies of 84.41% and 84.73%, respectively. Additional evaluation metrics, including log loss, Cohen kappa, and Matthew correlation coefficient, further supported the effectiveness of MNB in this task. The results indicate that carefully selected n-gram representations and classification models play a crucial role in improving the accuracy of Libyan dialect identification. This study provides empirical benchmarks and insights for future research in Arabic dialect NLP applications.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Huy Nghiem, Swetasudha Panda, Devashish Khatwani, Huy V. Nguyen, Krishnaram Kenthapadi et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman-Tversky Optimization (KTO) and Direct Preference Optimization (DPO) to refine models against domain-specific safety signals. Using the CARES-18K benchmark for adversarial robustness, we evaluate four LLMs (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when external or finetuned judges are necessary to maximize performance gains. Our findings underscore the importance of adopting best practices that balance patient safety, user trust, and clinical utility in the design of conversational medical assistants.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vighnesh Subramaniam, Colin Conwell, Boris Katz, Andrei Barbu, Brian Cheung
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
A standard assumption in deep learning is that the inductive bias introduced by a neural network architecture must persist from training through inference. The architecture you train with is the architecture you deploy. This assumption constrains the community from selecting architectures that may have desirable efficiency or design properties due to difficulties with optimization. We challenge this assumption with Network of Theseus (NoT), a method for progressively converting a trained, or even untrained, guide network architecture part-by-part into an entirely different target network architecture while preserving the performance of the guide network. At each stage, components in the guide network architecture are incrementally replaced with target architecture modules and aligned via representational similarity metrics. This procedure largely preserves the functionality of the guide network even under substantial architectural changes-for example, converting a convolutional network into a multilayer perceptron, or GPT-2 into a recurrent neural network. By decoupling optimization from deployment, NoT expands the space of viable inference-time architectures, opening opportunities for better accuracy-efficiency tradeoffs and enabling more directed exploration of the architectural design space.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Andreas Koukounas, Georgios Mastrapas, Florian Hรถnicke, Sedigheh Eslami, Guillaume Roncari et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce $\textbf{Doublespeak}$, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hanting Chen, Chong Zhu, Kai Han, Yuchuan Tian, Yuchen Liang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the Nexus, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Nexus dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $\mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Nexus outperforms standard Transformers on multiple benchmarks.
๐
2025-12-03
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Biagio Montaruli, Luca Compagna, Serena Elisa Ponta, Davide Balzarotti
ๆฅ็ๆ่ฆ
The rise of supply chain attacks via malicious Python packages demands robust detection solutions. Current approaches, however, overlook two critical challenges: robustness against adversarial source code transformations and adaptability to the varying false positive rate (FPR) requirements of different actors, from repository maintainers (requiring low FPR) to enterprise security teams (higher FPR tolerance).
We introduce a robust detector capable of seamless integration into both public repositories like PyPI and enterprise ecosystems. To ensure robustness, we propose a novel methodology for generating adversarial packages using fine-grained code obfuscation. Combining these with adversarial training (AT) enhances detector robustness by 2.5x. We comprehensively evaluate AT effectiveness by testing our detector against 122,398 packages collected daily from PyPI over 80 days, showing that AT needs careful application: it makes the detector more robust to obfuscations and allows finding 10% more obfuscated packages, but slightly decreases performance on non-obfuscated packages.
We demonstrate production adaptability of our detector via two case studies: (i) one for PyPI maintainers (tuned at 0.1% FPR) and (ii) one for enterprise teams (tuned at 10% FPR). In the former, we analyze 91,949 packages collected from PyPI over 37 days, achieving a daily detection rate of 2.48 malicious packages with only 2.18 false positives. In the latter, we analyze 1,596 packages adopted by a multinational software company, obtaining only 1.24 false positives daily. These results show that our detector can be seamlessly integrated into both public repositories like PyPI and enterprise ecosystems, ensuring a very low time budget of a few minutes to review the false positives.
Overall, we uncovered 346 malicious packages, now reported to the community.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shreyas Shende, Varsha Narayanan, Vishal Fenn, Yiran Huang, Dincer Goksuluk et al.
ๆฅ็ๆ่ฆ
Early detection of cancer plays a key role in improving survival rates, but identifying reliable biomarkers from RNA-seq data is still a major challenge. The data are high-dimensional, and conventional statistical methods often fail to capture the complex relationships between genes. In this study, we introduce RGE-GCN (Recursive Gene Elimination with Graph Convolutional Networks), a framework that combines feature selection and classification in a single pipeline. Our approach builds a graph from gene expression profiles, uses a Graph Convolutional Network to classify cancer versus normal samples, and applies Integrated Gradients to highlight the most informative genes. By recursively removing less relevant genes, the model converges to a compact set of biomarkers that are both interpretable and predictive. We evaluated RGE-GCN on synthetic data as well as real-world RNA-seq cohorts of lung, kidney, and cervical cancers. Across all datasets, the method consistently achieved higher accuracy and F1-scores than standard tools such as DESeq2, edgeR, and limma-voom. Importantly, the selected genes aligned with well-known cancer pathways including PI3K-AKT, MAPK, SUMOylation, and immune regulation. These results suggest that RGE-GCN shows promise as a generalizable approach for RNA-seq based early cancer detection and biomarker discovery (https://rce-gcn.streamlit.app/ ).
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Biao Chen, Zhenhua Lei, Yahui Zhang, Tongzhi Niu
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper introduces a novel method for generating high-quality Digital Image Correlation (DIC) dataset based on non-uniform B-spline surfaces. By randomly generating control point coordinates, we construct displacement fields that encompass a variety of realistic displacement scenarios, which are subsequently used to generate speckle pattern datasets. This approach enables the generation of a large-scale dataset that capture real-world displacement field situations, thereby enhancing the training and generalization capabilities of deep learning-based DIC algorithms. Additionally, we propose a novel network architecture, termed Bayes-DIC Net, which extracts information at multiple levels during the down-sampling phase and facilitates the aggregation of information across various levels through a single skip connection during the up-sampling phase. Bayes-DIC Net incorporates a series of lightweight convolutional blocks designed to expand the receptive field and capture rich contextual information while minimizing computational costs. Furthermore, by integrating appropriate dropout modules into Bayes-DIC Net and activating them during the network inference stage, Bayes-DIC Net is transformed into a Bayesian neural network. This transformation allows the network to provide not only predictive results but also confidence levels in these predictions when processing real unlabeled datasets. This feature significantly enhances the practicality and reliability of our network in real-world displacement field prediction tasks. Through these innovations, this paper offers new perspectives and methods for dataset generation and algorithm performance enhancement in the field of DIC.
๐
2025-12-03
๐ NeurIPS 2025
๐ฅ Arthur Pellegrino, Angus Chadwick
Machine Learning
ๆฅ็ๆ่ฆ
Analysing how neural networks represent data features in their activations can help interpret how they perform tasks. Hence, a long line of work has focused on mathematically characterising the geometry of such "neural representations." In parallel, machine learning has seen a surge of interest in understanding how dynamical systems perform computations on time-varying input data. Yet, the link between computation-through-dynamics and representational geometry remains poorly understood. Here, we hypothesise that recurrent neural networks (RNNs) perform computations by dynamically warping their representations of task variables. To test this hypothesis, we develop a Riemannian geometric framework that enables the derivation of the manifold topology and geometry of a dynamical system from the manifold of its inputs. By characterising the time-varying geometry of RNNs, we show that dynamic warping is a fundamental feature of their computations.
๐
2025-12-03
๐ NeurIPS
๐ฅ Andy Chung, Yichi Zhang, Kaixiang Lin, Aditya Rawal, Qiaozi Gao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude-3.7, GPT-4.1, Llama 4, and o4-mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50\% in baseline conditions to less than 10\% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task-relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long-term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Damek Davis, Dmitriy Drusvyatskiy
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Malyaban Bal, Abhronil Sengupta
Natural Language Processing
ๆฅ็ๆ่ฆ
Parameter-efficient fine-tuning (PEFT) provides a scalable alternative to full-model adaptation by updating only a small subset of parameters in large pre-trained models. We introduce GRASP - GRouped Activation Shared Parameterization - a lightweight PEFT framework that partitions the D-dimensional token representations of selected layers into K << D groups and learns a shared scaling and shifting vector for each group. This grouped modulation reduces the number of trainable parameters significantly while preserving the ability of the model to learn task-specific features. Building on this formulation, we further propose StochGRASP, which learns Gaussian distributions as perturbations to the pre-trained weights rather than deterministic values. This probabilistic parameterization along with a noise-aware loss function formulation enables modelling hardware-level variability in programmed weights and significantly improves robustness under non-ideal inference conditions-an important requirement for deployment on edge-based emerging AI hardware. Across GLUE (RoBERTa-base & RoBERTa-large) and E2E NLG (GPT-2 Medium), GRASP matches or exceeds the performance of established PEFT methods while achieving an order of magnitude reduction in trainable parameters compared to LoRA and BitFit. Under varying levels of noise, StochGRASP consistently outperforms deterministic variants, demonstrating its suitability for energy-efficient and noise-prone hardware platforms.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fan Jia, Yuhao Huang, Shih-Hsin Wang, Cristina Garcia-Cardona, Andrea L. Bertozzi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Flow matching-based generative models have been integrated into the plug-and-play image restoration framework, and the resulting plug-and-play flow matching (PnP-Flow) model has achieved some remarkable empirical success for image restoration. However, the theoretical understanding of PnP-Flow lags its empirical success. In this paper, we derive a continuous limit for PnP-Flow, resulting in a stochastic differential equation (SDE) surrogate model of PnP-Flow. The SDE model provides two particular insights to improve PnP-Flow for image restoration: (1) It enables us to quantify the error for image restoration, informing us to improve step scheduling and regularize the Lipschitz constant of the neural network-parameterized vector field for error reduction. (2) It informs us to accelerate off-the-shelf PnP-Flow models via extrapolation, resulting in a rescaled version of the proposed SDE model. We validate the efficacy of the SDE-informed improved PnP-Flow using several benchmark tasks, including image denoising, deblurring, super-resolution, and inpainting. Numerical results show that our method significantly outperforms the baseline PnP-Flow and other state-of-the-art approaches, achieving superior performance across evaluation metrics.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tasmiah Haque, Srinjoy Das
Multimodal
ๆฅ็ๆ่ฆ
Real-time video motion transfer applications such as immersive gaming and vision-based anomaly detection require accurate yet diverse future predictions to support realistic synthesis and robust downstream decision making under uncertainty. To improve the diversity of such sequential forecasts we propose a novel inference-time refinement technique that combines Gated Recurrent Unit-Normalizing Flows (GRU-NF) with stochastic sampling methods. While GRU-NF can capture multimodal distributions through its integration of normalizing flows within a temporal forecasting framework, its deterministic transformation structure can limit expressivity. To address this, inspired by Stochastic Normalizing Flows (SNF), we introduce Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, enabling the model to explore a richer output space and better approximate the true data distribution without retraining. We validate our approach in a keypoint-based video motion transfer pipeline, where capturing temporally coherent and perceptually diverse future trajectories is essential for realistic samples and low bandwidth communication. Experiments show that our inference framework, Gated Recurrent Unit- Stochastic Normalizing Flows (GRU-SNF) outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. By injecting stochasticity during inference, our approach captures multimodal behavior more effectively. These results highlight the potential of integrating stochastic dynamics with flow-based sequence models for generative time series forecasting.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Prakhar Gupta, Vaibhav Gupta
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during RL post-training, improves performance even when fine-tuned on randomized solution sequences. On Sudoku, we train a Transformer with standard fine-tuning on randomized solving orders, then post-train it with Group Relative Policy Optimization (GRPO) with two rewards: cell accuracy and an ordering reward that increases when the model's emission order aligns with the solver order. To compare signals cleanly, we combine them via fixed mixtures and use a simple bootstrapped scaling to equalize component magnitudes at initialization. Mixed rewards generally outperform cell-only optimization--the best mixture yields substantially higher test accuracy than the fine-tuned-only model trained on random-order and approaches the fine-tuned-only model trained on solver-order sequences in accuracy. These results suggest that coarse ordering signals can steer RL post-training toward solver-order trajectories without modifying supervised data or architecture.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Przemyslaw Chojecki
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Benchmarks are the primary tool for assessing progress in artificial intelligence (AI), yet current practice evaluates models on isolated test suites and provides little guidance for reasoning about generality or autonomous self-improvement. Here we introduce a geometric framework in which all psychometric batteries for AI agents are treated as points in a structured moduli space, and agent performance is described by capability functionals over this space. First, we define an Autonomous AI (AAI) Scale, a Kardashev-style hierarchy of autonomy grounded in measurable performance on batteries spanning families of tasks (for example reasoning, planning, tool use and long-horizon control). Second, we construct a moduli space of batteries, identifying equivalence classes of benchmarks that are indistinguishable at the level of agent orderings and capability inferences. This geometry yields determinacy results: dense families of batteries suffice to certify performance on entire regions of task space. Third, we introduce a general Generator-Verifier-Updater (GVU) operator that subsumes reinforcement learning, self-play, debate and verifier-based fine-tuning as special cases, and we define a self-improvement coefficient $ฮบ$ as the Lie derivative of a capability functional along the induced flow. A variance inequality on the combined noise of generation and verification provides sufficient conditions for $ฮบ> 0$. Our results suggest that progress toward artificial general intelligence (AGI) is best understood as a flow on moduli of benchmarks, driven by GVU dynamics rather than by scores on individual leaderboards.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shifeng Xie, Rui Yuan, Simone Rossi, Thomas Hannagan
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Long Dang, Thushari Hapuarachchi, Kaiqi Xiong, Jing Lin
Machine Learning
ๆฅ็ๆ่ฆ
Adversarial training is an effective method to improve the machine learning (ML) model robustness. Most existing studies typically consider the Rectified linear unit (ReLU) activation function and centralized training environments. In this paper, we study the ML model robustness using ten different activation functions through adversarial training in centralized environments and explore the ML model robustness in federal learning environments. In the centralized environment, we first propose an advanced adversarial training approach to improving the ML model robustness by incorporating model architecture change, soft labeling, simplified data augmentation, and varying learning rates. Then, we conduct extensive experiments on ten well-known activation functions in addition to ReLU to better understand how they impact the ML model robustness. Furthermore, we extend the proposed adversarial training approach to the federal learning environment, where both independent and identically distributed (IID) and non-IID data settings are considered. Our proposed centralized adversarial training approach achieves a natural and robust accuracy of 77.08% and 67.96%, respectively on CIFAR-10 against the fast gradient sign attacks. Experiments on ten activation functions reveal ReLU usually performs best. In the federated learning environment, however, the robust accuracy decreases significantly, especially on non-IID data. To address the significant performance drop in the non-IID data case, we introduce data sharing and achieve the natural and robust accuracy of 70.09% and 54.79%, respectively, surpassing the CalFAT algorithm, when 40% data sharing is used. That is, a proper percentage of data sharing can significantly improve the ML model robustness, which is useful to some real-world applications.
๐
2025-12-03
๐ International Conference on Data Science and Intel
๐ฅ Philipp Jung, Nicholas Chandler, Sebastian Jรคger, Felix Biessmann
Machine Learning
ๆฅ็ๆ่ฆ
Data quality monitoring is a core challenge in modern information processing systems. While many approaches to detect data errors or shifts have been proposed, few studies investigate the mechanisms governing error generation. We argue that knowing how errors were generated can be key to tracing and fixing them. In this study, we build on existing work in the statistics literature on missing values and propose MechDetect, a simple algorithm to investigate error generation mechanisms. Given a tabular data set and a corresponding error mask, the algorithm estimates whether or not the errors depend on the data using machine learning models. Our work extends established approaches to detect mechanisms underlying missing values and can be readily applied to other error types, provided that an error mask is available. We demonstrate the effectiveness of MechDetect in experiments on established benchmark datasets.
๐
2025-12-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Christos Kolomvakis, Thomas Bobille, Arnaud Vandaele, Nicolas Gillis
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Boolean matrix factorization (BMF) approximates a given binary input matrix as the product of two smaller binary factors. Unlike binary matrix factorization based on standard arithmetic, BMF employs the Boolean OR and AND operations for the matrix product, which improves interpretability and reduces the approximation error. It is also used in role mining and computer vision. In this paper, we first propose algorithms for BMF that perform alternating optimization (AO) of the factor matrices, where each subproblem is solved via integer programming (IP). We then design different approaches to further enhance AO-based algorithms by selecting an optimal subset of rank-one factors from multiple runs. To address the scalability limits of IP-based methods, we introduce new greedy and local-search heuristics. We also construct a new C++ data structure for Boolean vectors and matrices that is significantly faster than existing ones and is of independent interest, allowing our heuristics to scale to large datasets. We illustrate the performance of all our proposed methods and compare them with the state of the art on various real datasets, both with and without missing data, including applications in topic modeling and imaging.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Lei Zhang, Ju Dong, Kaixin Bai, Minheng Ni, Zoltan-Csaba Marton et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in large multimodal models have enabled new opportunities in embodied AI, particularly in robotic manipulation. These models have shown strong potential in generalization and reasoning, but achieving reliable and responsible robotic behavior in real-world settings remains an open challenge. In high-stakes environments, robotic agents must go beyond basic task execution to perform risk-aware reasoning, moral decision-making, and physically grounded planning. We introduce ResponsibleRobotBench, a systematic benchmark designed to evaluate and accelerate progress in responsible robotic manipulation from simulation to real world. This benchmark consists of 23 multi-stage tasks spanning diverse risk types, including electrical, chemical, and human-related hazards, and varying levels of physical and planning complexity. These tasks require agents to detect and mitigate risks, reason about safety, plan sequences of actions, and engage human assistance when necessary. Our benchmark includes a general-purpose evaluation framework that supports multimodal model-based agents with various action representation modalities. The framework integrates visual perception, context learning, prompt construction, hazard detection, reasoning and planning, and physical execution. It also provides a rich multimodal dataset, supports reproducible experiments, and includes standardized metrics such as success rate, safety rate, and safe success rate. Through extensive experimental setups, ResponsibleRobotBench enables analysis across risk categories, task types, and agent configurations. By emphasizing physical reliability, generalization, and safety in decision-making, this benchmark provides a foundation for advancing the development of trustworthy, real-world responsible dexterous robotic systems. https://sites.google.com/view/responsible-robotbench
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Feeza Khan Khanzada, Jaerock Kwon
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We study how to exploit dense simulator-defined rewards in vision-based autonomous driving without inheriting their misalignment with deployment metrics. In realistic simulators such as CARLA, privileged state (e.g., lane geometry, infractions, time-to-collision) can be converted into dense rewards that stabilize and accelerate model-based reinforcement learning, but policies trained directly on these signals often overfit and fail to generalize when evaluated on sparse objectives such as route completion and collision-free overtaking. We propose reward-privileged world model distillation, a two-stage framework in which a teacher DreamerV3-style agent is first trained with a dense privileged reward, and only its latent dynamics are distilled into a student trained solely on sparse task rewards. Teacher and student share the same observation space (semantic bird's-eye-view images); privileged information enters only through the teacher's reward, and the student does not imitate the teacher's actions or value estimates. Instead, the student's world model is regularized to match the teacher's latent dynamics while its policy is learned from scratch on sparse success/failure signals. In CARLA lane-following and overtaking benchmarks, sparse-reward students outperform both dense-reward teachers and sparse-from-scratch baselines. On unseen lane-following routes, reward-privileged distillation improves success by about 23 percent relative to the dense teacher while maintaining comparable or better safety. On overtaking, students retain near-perfect performance on training routes and achieve up to a 27x improvement in success on unseen routes, with improved lane keeping. These results show that dense rewards can be leveraged to learn richer dynamics models while keeping the deployed policy optimized strictly for sparse, deployment-aligned objectives.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Maksim Surov, Leonid Freidovich
Robotics
ๆฅ็ๆ่ฆ
This paper presents a combined sliding-mode control and subspace stabilization methodology for orbital stabilization of periodic trajectories in underactuated mechanical systems with one degree of underactuation. The approach starts with partial feedback linearization and stabilization. Then, transverse linearization along the reference orbit is computed, resulting in a periodic linear time-varying system with a stable subspace. Sliding-mode control drives trajectories toward this subspace. The proposed design avoids solving computationally intensive periodic LQR problems and improves robustness to matched disturbances. The methodology is validated through experiments on the Butterfly robot.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Zhou Chen, Joe Lin, Carson Bulgin, Sathyanarayanan N. Aakur
Robotics
ๆฅ็ๆ่ฆ
Assistive robots operating in unstructured environments must understand not only what objects are, but what they can be used for. This requires grounding language-based action queries to objects that both afford the requested function and can be physically retrieved. Existing approaches often rely on black-box models or fixed affordance labels, limiting transparency, controllability, and reliability for human-facing applications. We introduce CRAFT-E, a modular neuro-symbolic framework that composes a structured verb-property-object knowledge graph with visual-language alignment and energy-based grasp reasoning. The system generates interpretable grounding paths that expose the factors influencing object selection and incorporates grasp feasibility as an integral part of affordance inference. We further construct a benchmark dataset with unified annotations for verb-object compatibility, segmentation, and grasp candidates, and deploy the full pipeline on a physical robot. CRAFT-E achieves competitive performance in static scenes, ImageNet-based functional retrieval, and real-world trials involving 20 verbs and 39 objects. The framework remains robust under perceptual noise and provides transparent, component-level diagnostics. By coupling symbolic reasoning with embodied perception, CRAFT-E offers an interpretable and customizable alternative to end-to-end models for affordance-grounded object selection, supporting trustworthy decision-making in assistive robotic systems.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Seungwon Choi, Dong-Gyu Park, Seo-Yeon Hwang, Tae-Wan Kim
ๆฅ็ๆ่ฆ
LiDAR-inertial odometry (LIO) is an active research area, as it enables accurate real-time state estimation in GPS-denied environments. Recent advances in map data structures and spatial indexing have significantly improved the efficiency of LIO systems. Nevertheless, we observe that two aspects may still leave room for improvement: (1) nearest neighbor search often requires examining multiple spatial units to gather sufficient points for plane fitting, and (2) plane parameters are typically recomputed at every iteration despite unchanged map geometry. Motivated by these observations, we propose Surfel-LIO, which employs a hierarchical voxel structure (hVox) with pre-computed surfel representation. This design enables O(1) correspondence retrieval without runtime neighbor enumeration or plane fitting, combined with Z-order curve encoding for cache-friendly spatial indexing. Experimental results on the M3DGR dataset demonstrate that our method achieves significantly faster processing speed compared to recent state-of-the-art methods while maintaining comparable state estimation accuracy. Our implementation is publicly available at https://github.com/93won/lidar_inertial_odometry.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zayne Sprague, Jack Lu, Manya Wadhwa, Sedrick Keh, Mengye Ren et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ashlesha Gopinath Sawant, Sahil S. Jadhav, Vidhan R. Jain, Shriraj S. Jagtap, Prachi Jadhav et al.
ๆฅ็ๆ่ฆ
In todays increasing world, it is very important to have good hailing services like Ola, Uber, and Rapido as it is very essential for our daily transportation. Users often face difficulties in choosing the most appropriate and efficient ride that would lead to both cost-effective and would take us to our destination in less time. This project provides you with the web application that helps you to select the most beneficial ride for you by providing users with the fare comparison between Ola, Uber, Rapido for the destination entered by the user. The backend is use to fetch the data, providing users with the fare comparison for the ride and finally providing with the best option using Python. This research paper also addresses the problem and challenges faced in accessing the data using APIs, Android Studios emulator, Appium and location comparison. Thus, the aim of the project is to provide transparency to the users in ride-hailing services and increase efficiency and provide users with better experience.
๐
2025-12-03
๐ ArXiv Preprint (econ.GN)
๐ฅ Nadav Kunievsky
ๆฅ็ๆ่ฆ
In democracies, major policy decisions typically require some form of majority or consensus, so elites must secure mass support to govern. Historically, elites could shape support only through limited instruments like schooling and mass media; advances in AI-driven persuasion sharply reduce the cost and increase the precision of shaping public opinion, making the distribution of preferences itself an object of deliberate design. We develop a dynamic model in which elites choose how much to reshape the distribution of policy preferences, subject to persuasion costs and a majority rule constraint. With a single elite, any optimal intervention tends to push society toward more polarized opinion profiles - a ``polarization pull'' - and improvements in persuasion technology accelerate this drift. When two opposed elites alternate in power, the same technology also creates incentives to park society in ``semi-lock'' regions where opinions are more cohesive and harder for a rival to overturn, so advances in persuasion can either heighten or dampen polarization depending on the environment. Taken together, cheaper persuasion technologies recast polarization as a strategic instrument of governance rather than a purely emergent social byproduct, with important implications for democratic stability as AI capabilities advance.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yizhou Zhao, Zhiwei Steven Wu, Adam Block
Natural Language Processing
ๆฅ็ๆ่ฆ
Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model's representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sandeep Nagar
RoboticsMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance.
The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Andreas Koukounas, Georgios Mastrapas, Florian Hรถnicke, Sedigheh Eslami, Guillaume Roncari et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.
๐
2025-12-03
๐ ArXiv Preprint (astro-ph.IM)
๐ฅ Yixuan Li, Yuhao Lu, Yang Liu, Liang Li, R. Ruffini et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This work investigates whether large language models (LLMs) offer advantages over traditional neural networks for astronomical data processing, in regimes with non-Gaussian, non-stationary noise and limited labeled samples. Gravitational wave observations provide an suitable test case, using only 90 LIGO events, finetuned LLMs achieve 97.4\% accuracy for identifying signals. Further experiments show that, in contrast to traditional networks that rely on large simulated datasets, additional simulated samples do not improve LLM performance, while scaling studies reveal predictable gains with increasing model size and dataset size. These results indicate that LLMs can extract discriminative structure directly from observational data and provide an efficient assessment for gravitational wave identification. The same strategy may extend to other astronomical domains with similar noise properties, such as radio or pulsar observations.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang
Computer Vision
ๆฅ็ๆ่ฆ
Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA
๐
2025-12-03
๐ ArXiv Preprint (quant-ph)
๐ฅ Davut Emre Tasar, Ceren Ocal Tasar
Robotics
ๆฅ็ๆ่ฆ
Quantum key distribution (QKD) security fundamentally relies on the ability to distinguish genuine quantum correlations from classical eavesdropper simulations, yet existing certification methods lack rigorous statistical guarantees under finite-sample conditions and adversarial scenarios. We introduce TARA (Test by Adaptive Ranks), a novel framework combining conformal prediction with sequential martingale testing for quantum anomaly detection that provides distribution-free validity guarantees. TARA offers two complementary approaches. TARA k, based on Kolmogorov Smirnov calibration against local hidden variable (LHV) null distributions, achieving ROC AUC = 0.96 for quantum-classical discrimination. And TARA-m, employing betting martingales for streaming detection with anytime valid type I error control that enables real time monitoring of quantum channels. We establish theoretical guarantees proving that under (context conditional) exchangeability, conformal p-values remain uniformly distributed even for strongly contextual quantum data, confirming that quantum contextuality does not break conformal prediction validity a result with implications beyond quantum certification to any application of distribution-free methods to nonclassical data. Extensive validation on both IBM Torino (superconducting, CHSH = 2.725) and IonQ Forte Enterprise (trapped ion, CHSH = 2.716) quantum processors demonstrates cross-platform robustness, achieving 36% security margins above the classical CHSH bound of 2. Critically, our framework reveals a methodological concern affecting quantum certification more broadly: same-distribution calibration can inflate detection performance by up to 44 percentage points compared to proper cross-distribution calibration, suggesting that prior quantum certification studies using standard train test splits may have systematically overestimated adversarial robustness.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marcelo Isaias de Moraes Junior, Moacir Antonelli Ponti
ๆฅ็ๆ่ฆ
Sketches are simple human hand-drawn abstractions of complex scenes and real-world objects. Although the field of sketch representation learning has advanced significantly, there is still a gap in understanding the true relevance of the temporal aspect to the quality of these representations. This work investigates whether it is indeed justifiable to treat sketches as sequences, as well as which internal orders play a more relevant role. The results indicate that, although the use of traditional positional encodings is valid for modeling sketches as sequences, absolute coordinates consistently outperform relative ones. Furthermore, non-autoregressive decoders outperform their autoregressive counterparts. Finally, the importance of temporality was shown to depend on both the order considered and the task evaluated.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jialuo Li, Bin Li, Jiahao Li, Yan Lu
Multimodal
ๆฅ็ๆ่ฆ
The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hang Xu, Linjiang Huang, Feng Zhao
Robotics
ๆฅ็ๆ่ฆ
Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{https://github.com/xuhang07/TEP-Diffusion}{https://github.com/xuhang07/TEP-Diffusion}.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zexin Lin, Hawen Wan, Yebin Zhong, Xiaoqiang
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.
๐
2025-12-03
๐ NeurIPS 2025
๐ฅ Jin-Ting He, Fu-Jen Tsai, Yan-Tsung Peng, Min-Hung Chen, Chia-Wen Lin et al.
ๆฅ็ๆ่ฆ
Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The source code is available at https://github.com/Jin-Ting-He/BlurDM.
๐
2025-12-03
๐ ArXiv Preprint (cs.GT)
๐ฅ Kshipra Bhawalkar, Alexandros Psomas, Di Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Online platforms connect users with relevant products and services using ads. A key challenge is that a user's search query often leaves their true intent ambiguous. Typically, platforms passively predict relevance based on available signals and in some cases offer query refinements. The shift from traditional search to conversational AI provides a new approach. When a user's query is ambiguous, a Large Language Model (LLM) can proactively offer several clarifying follow-up prompts. In this paper we consider the following: what if some of these follow-up prompts can be ``sponsored,'' i.e., selected for their advertising potential. How should these ``suggestion slots'' be allocated? And, how does this new mechanism interact with the traditional ad auction that might follow?
This paper introduces a formal model for designing and analyzing these interactive platforms. We use this model to investigate a critical engineering choice: whether it is better to build an end-to-end pipeline that jointly optimizes the user interaction and the final ad auction, or to decouple them into separate mechanisms for the suggestion slots and another for the subsequent ad slot. We show that the VCG mechanism can be adopted to jointly optimize the sponsored suggestion and the ads that follow; while this mechanism is more complex, it achieves outcomes that are efficient and truthful. On the other hand, we prove that the simple-to-implement modular approach suffers from strategic inefficiency: its Price of Anarchy is unbounded.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Franki Nguimatsia Tiofack, Thรฉotime Le Hellard, Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Niklas Jobs, Luis Miguel Vieira da Silva, Jayanth Somashekaraiah, Maximilian Weigand, David Kube et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Industrial automation increasingly requires flexible control strategies that can adapt to changing tasks and environments. Agents based on Large Language Models (LLMs) offer potential for such adaptive planning and execution but lack standardized benchmarks for systematic comparison. We introduce a benchmark with an executable simulation environment representing the Blocksworld problem providing five complexity categories. By integrating the Model Context Protocol (MCP) as a standardized tool interface, diverse agent architectures can be connected to and evaluated against the benchmark without implementation-specific modifications. A single-agent implementation demonstrates the benchmark's applicability, establishing quantitative metrics for comparison of LLM-based planning and execution approaches.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Vineel Tummala, Daniela Inclezan
Robotics
ๆฅ็ๆ่ฆ
This paper presents a logic programming-based framework for policy-aware autonomous agents that can reason about potential penalties for non-compliance and act accordingly. While prior work has primarily focused on ensuring compliance, our approach considers scenarios where deviating from policies may be necessary to achieve high-stakes goals. Additionally, modeling non-compliant behavior can assist policymakers by simulating realistic human decision-making. Our framework extends Gelfond and Lobo's Authorization and Obligation Policy Language (AOPL) to incorporate penalties and integrates Answer Set Programming (ASP) for reasoning. Compared to previous approaches, our method ensures well-formed policies, accounts for policy priorities, and enhances explainability by explicitly identifying rule violations and their consequences. Building on the work of Harders and Inclezan, we introduce penalty-based reasoning to distinguish between non-compliant plans, prioritizing those with minimal repercussions. To support this, we develop an automated translation from the extended AOPL into ASP and refine ASP-based planning algorithms to account for incurred penalties. Experiments in two domains demonstrate that our framework generates higher-quality plans that avoid harmful actions while, in some cases, also improving computational efficiency. These findings underscore its potential for enhancing autonomous decision-making and informing policy refinement. Under consideration in Theory and Practice of Logic Programming (TPLP).
๐
2025-12-03
๐ ArXiv Preprint (math.OC)
๐ฅ X. Y. Han, Yuan Zhong
Machine Learning
ๆฅ็ๆ่ฆ
In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Jeongeun Park, Jihwan Yoon, Byungwoo Jeon, Juhan Park, Jinwoo Shin et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Prior Vision-Language-Action (VLA) models are typically trained on teleoperated successful demonstrations, while discarding numerous failed attempts that occur naturally during data collection. However, these failures encode where and how policies can be fragile, information that can be exploited to improve robustness. We address this problem by leveraging mixed-quality datasets to learn failure-aware reasoning at planning time. We introduce VINE, a hierarchical vision-language-action model that separates high-level reasoning (System 2) from low-level control (System 1) under a hierarchical reinforcement learning formalism, making failures usable as a structured learning signal rather than noisy supervision. System 2 performs feasibility-guided tree search over a 2D scene-graph abstraction: it proposes subgoal transitions, predicts success probabilities from both successes and failures, and prunes brittle branches before execution, effectively casting plan evaluation as feasibility scoring. The selected subgoal sequence is then passed to System 1, which executes low-level actions without modifying the agent's core skills. Trained entirely from offline teleoperation data, VINE integrates negative experience directly into the decision loop. Across challenging manipulation tasks, this approach consistently improves success rates and robustness, demonstrating that failure data is an essential resource for converting the broad competence of VLAs into robust execution.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Kenneth Stewart, Roxana Leontie, Samantha Chapin, Joe Hays, Sumit Bam Shrestha et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We present an end-to-end pipeline for deploying reinforcement learning (RL) trained Artificial Neural Networks (ANNs) on neuromorphic hardware by converting them into spiking Sigma-Delta Neural Networks (SDNNs). We demonstrate that an ANN policy trained entirely in simulation can be transformed into an SDNN compatible with Intel's Loihi 2 architecture, enabling low-latency and energy-efficient inference. As a test case, we use an RL policy for controlling the Astrobee free-flying robot, similar to a previously hardware in space-validated controller. The policy, trained with Rectified Linear Units (ReLUs), is converted to an SDNN and deployed on Intel's Loihi 2, then evaluated in NVIDIA's Omniverse Isaac Lab simulation environment for closed-loop control of Astrobee's motion. We compare execution performance between GPU and Loihi 2. The results highlight the feasibility of using neuromorphic platforms for robotic control and establish a pathway toward energy-efficient, real-time neuromorphic computation in future space and terrestrial robotics applications.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ekhi Azurmendi, Joseba Fernandez de Landa, Jaione Bengoetxea, Maite Heredia, Julen Etxaniz et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Saurav Prateek
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The advancement in Large Language Models has driven the creation of complex agentic systems, such as Deep Research Agents (DRAs), to overcome the limitations of static Retrieval Augmented Generation (RAG) pipelines in handling complex, multi-turn research tasks. This paper introduces the Static Deep Research Agent (Static-DRA), a novel solution built upon a configurable and hierarchical Tree-based static workflow.
The core contribution is the integration of two user-tunable parameters, Depth and Breadth, which provide granular control over the research intensity. This design allows end-users to consciously balance the desired quality and comprehensiveness of the research report against the associated computational cost of Large Language Model (LLM) interactions. The agent's architecture, comprising Supervisor, Independent, and Worker agents, facilitates effective multi-hop information retrieval and parallel sub-topic investigation.
We evaluate the Static-DRA against the established DeepResearch Bench using the RACE (Reference-based Adaptive Criteria-driven Evaluation) framework. Configured with a depth of 2 and a breadth of 5, and powered by the gemini-2.5-pro model, the agent achieved an overall score of 34.72. Our experiments validate that increasing the configured Depth and Breadth parameters results in a more in-depth research process and a correspondingly higher evaluation score. The Static-DRA offers a pragmatic and resource-aware solution, empowering users with transparent control over the deep research process. The entire source code, outputs and benchmark results are open-sourced at https://github.com/SauravP97/Static-Deep-Research/
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Danny Hoang, Anandkumar Patel, Ruimen Chen, Rajiv Malhotra, Farhad Imani
ๆฅ็ๆ่ฆ
Smart manufacturing can significantly improve efficiency and reduce energy consumption, yet the energy demands of AI models may offset these gains. This study utilizes in-situ sensing-based prediction of geometric quality in smart machining to compare the energy consumption, accuracy, and speed of common AI models. HyperDimensional Computing (HDC) is introduced as an alternative, achieving accuracy comparable to conventional models while drastically reducing energy consumption, 200$\times$ for training and 175 to 1000$\times$ for inference. Furthermore, HDC reduces training times by 200$\times$ and inference times by 300 to 600$\times$, showcasing its potential for energy-efficient smart manufacturing.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gaetano Signorelli, Michele Lombardi
Machine Learning
ๆฅ็ๆ่ฆ
Decision support systems often rely on solving complex optimization problems that may require to estimate uncertain parameters beforehand. Recent studies have shown how using traditionally trained estimators for this task can lead to suboptimal solutions. Using the actual decision cost as a loss function (called Decision Focused Learning) can address this issue, but with a severe loss of scalability at training time. To address this issue, we propose an acceleration method based on replacing costly loss function evaluations with an efficient surrogate. Unlike previously defined surrogates, our approach relies on unbiased estimators reducing the risk of spurious local optima and can provide information on its local confidence allowing one to switch to a fallback method when needed. Furthermore, the surrogate is designed for a black-box setting, which enables compensating for simplifications in the optimization model and account- ing for recourse actions during cost computation. In our results, the method reduces costly inner solver calls, with a solution quality comparable to other state-of-the-art techniques.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hania Ghouse, Maryam Alsharqi, Farhad R. Nezami, Muzammil Behzad
Multimodal
ๆฅ็ๆ่ฆ
Cardiac image analysis remains fragmented across tasks: anatomical segmentation, disease classification, and grounded clinical report generation are typically handled by separate networks trained under different data regimes. No existing framework unifies these objectives within a single architecture while retaining generalization across imaging modalities and datasets. We introduce PULSE, a multi-task vision-language framework built on self-supervised representations and optimized through a composite supervision strategy that balances region overlap learning, pixel wise classification fidelity, and boundary aware IoU refinement. A multi-scale token reconstruction decoder enables anatomical segmentation, while shared global representations support disease classification and clinically grounded text output allowing the model to transition from pixels to structures and finally clinical reasoning within one architecture. Unlike prior task-specific pipelines, PULSE learns task-invariant cardiac priors, generalizes robustly across datasets, and can be adapted to new imaging modalities with minimal supervision. This moves the field closer to a scalable, foundation style cardiac analysis framework.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dingwei Zhu, Zhiheng Xi, Shihan Dou, Yuhui Wang, Sixian Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk regularization to shape the distribution tails: it contracts the lower tail to dampen noisy negative deviations, while expanding the upper tail to preserve exploratory diversity. Across extensive experiments and analysis in multi-turn dialogue, math reasoning, and scientific QA, DVPO consistently outperforms PPO, GRPO, and robust Bellman-based PPO under noisy supervision, showing its potential for LLM post-training in the real-world.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Jia Hu, Zhexi Lian, Xuerun Yan, Ruiang Bi, Dou Shen et al.
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Autonomous Driving (AD) vehicles still struggle to exhibit human-like behavior in highly dynamic and interactive traffic scenarios. The key challenge lies in AD's limited ability to interact with surrounding vehicles, largely due to a lack of understanding the underlying mechanisms of social interaction. To address this issue, we introduce MPCFormer, an explainable socially-aware autonomous driving approach with physics-informed and data-driven coupled social interaction dynamics. In this model, the dynamics are formulated into a discrete space-state representation, which embeds physics priors to enhance modeling explainability. The dynamics coefficients are learned from naturalistic driving data via a Transformer-based encoder-decoder architecture. To the best of our knowledge, MPCFormer is the first approach to explicitly model the dynamics of multi-vehicle social interactions. The learned social interaction dynamics enable the planner to generate manifold, human-like behaviors when interacting with surrounding traffic. By leveraging the MPC framework, the approach mitigates the potential safety risks typically associated with purely learning-based methods. Open-looped evaluation on NGSIM dataset demonstrates that MPCFormer achieves superior social interaction awareness, yielding the lowest trajectory prediction errors compared with other state-of-the-art approach. The prediction achieves an ADE as low as 0.86 m over a long prediction horizon of 5 seconds. Close-looped experiments in highly intense interaction scenarios, where consecutive lane changes are required to exit an off-ramp, further validate the effectiveness of MPCFormer. Results show that MPCFormer achieves the highest planning success rate of 94.67%, improves driving efficiency by 15.75%, and reduces the collision rate from 21.25% to 0.5%, outperforming a frontier Reinforcement Learning (RL) based planner.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye
MultimodalRoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dongchao Yang, Songxiang Liu, Disong Wang, Yuanyuan Wang, Guanglu Wan et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in Omni models have enabled unified multimodal perception and generation. However, most existing systems still exhibit rigid reasoning behaviors, either overthinking simple problems or failing to reason when necessary. To address this limitation, we propose Omni-AutoThink, a novel adaptive reasoning framework that dynamically adjusts the model's reasoning depth according to task difficulty. Our framework comprises two stages: (1) an Adaptive Supervised Fine-Tuning (Adaptive SFT) stage, which endows the Omni model with fundamental reasoning capability using large-scale reasoning-augmented data, and (2) an Adaptive Reinforcement Learning (Adaptive GRPO) stage, which optimizes reasoning behaviors based on task complexity and reward feedback. We further construct a comprehensive adaptive reasoning benchmark that spans text-only, text-audio, text-visual, and text-audio-visual modalities, providing both training and evaluation splits for multimodal reasoning assessment. Experimental results demonstrate that our proposed framework significantly improves adaptive reasoning performance compared to previous baselines. All benchmark data and code will be publicly released.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Gabriele Fadini, Deepak Ingole, Tong Duy Son, Alisa Rupenyan
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper presents an auto-tuning framework for torque-based Nonlinear Model Predictive Control (nMPC), where the MPC serves as a real-time controller for optimal joint torque commands. The MPC parameters, including cost function weights and low-level controller gains, are optimized using high-dimensional Bayesian Optimization (BO) techniques, specifically Sparse Axis-Aligned Subspace (SAASBO) with a digital twin (DT) to achieve precise end-effector trajectory real-time tracking on an UR10e robot arm. The simulation model allows efficient exploration of the high-dimensional parameter space, and it ensures safe transfer to hardware. Our simulation results demonstrate significant improvements in tracking performance (+41.9%) and reduction in solve times (-2.5%) compared to manually-tuned parameters. Moreover, experimental validation on the real robot follows the trend (with a +25.8% improvement), emphasizing the importance of digital twin-enabled automated parameter optimization for robotic operations.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce \textbf{Doublespeak}, a simple \emph{in-context representation hijacking} attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., \textit{bomb}) with a benign token (e.g., \textit{carrot}) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74\% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiawei Xu, Fengfeng Wei, Weineng Chen
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Automatic Heuristic Design (AHD) has gained traction as a promising solution for solving combinatorial optimization problems (COPs). Large Language Models (LLMs) have emerged and become a promising approach to achieving AHD, but current LLM-based AHD research often only considers a single role. This paper proposes RoCo, a novel Multi-Agent Role-Based System, to enhance the diversity and quality of AHD through multi-role collaboration. RoCo coordinates four specialized LLM-guided agents-explorer, exploiter, critic, and integrator-to collaboratively generate high-quality heuristics. The explorer promotes long-term potential through creative, diversity-driven thinking, while the exploiter focuses on short-term improvements via conservative, efficiency-oriented refinements. The critic evaluates the effectiveness of each evolution step and provides targeted feedback and reflection. The integrator synthesizes proposals from the explorer and exploiter, balancing innovation and exploitation to drive overall progress. These agents interact in a structured multi-round process involving feedback, refinement, and elite mutations guided by both short-term and accumulated long-term reflections. We evaluate RoCo on five different COPs under both white-box and black-box settings. Experimental results demonstrate that RoCo achieves superior performance, consistently generating competitive heuristics that outperform existing methods including ReEvo and HSEvo, both in white-box and black-box scenarios. This role-based collaborative paradigm establishes a new standard for robust and high-performing AHD.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yufeng Li, Wenchao Zhao, Bo Dang, Weimin Wang
Machine Learning
ๆฅ็ๆ่ฆ
Previously, image interpretation in radiology relied heavily on manual methods. However, manual classification of brain tumor medical images is time-consuming and labor-intensive. Even with shallow convolutional neural network models, the accuracy is not ideal. To improve the efficiency and accuracy of brain tumor image classification, this paper proposes a brain tumor classification model based on an improved ResNet34 network. This model uses the ResNet34 residual network as the backbone network and incorporates multi-scale feature extraction. It uses a multi-scale input module as the first layer of the ResNet34 network and an Inception v2 module as the residual downsampling layer. Furthermore, a channel attention mechanism module assigns different weights to different channels of the image from a channel domain perspective, obtaining more important feature information. The results after a five-fold crossover experiment show that the average classification accuracy of the improved network model is approximately 98.8%, which is not only 1% higher than ResNet34, but also only 80% of the number of parameters of the original model. Therefore, the improved network model not only improves accuracy but also reduces clutter, achieving a classification effect with fewer parameters and higher accuracy.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Melane Navaratnarajah, David A. Kelly, Hana Chockler
ๆฅ็ๆ่ฆ
Adversarial perturbations are a useful way to expose vulnerabilities in object detectors. Existing perturbation methods are frequently white-box and architecture specific. More importantly, while they are often successful, it is rarely clear why they work. Insights into the mechanism of this success would allow developers to understand and analyze these attacks, as well as fine-tune the model to prevent them. This paper presents BlackCAtt, a black-box algorithm and a tool, which uses minimal, causally sufficient pixel sets to construct explainable, imperceptible, reproducible, architecture-agnostic attacks on object detectors. BlackCAtt combines causal pixels with bounding boxes produced by object detectors to create adversarial attacks that lead to the loss, modification or addition of a bounding box. BlackCAtt works across different object detectors of different sizes and architectures, treating the detector as a black box. We compare the performance of BlackCAtt with other black-box attack methods and show that identification of causal pixels leads to more precisely targeted and less perceptible attacks. On the COCO test dataset, our approach is 2.7 times better than the baseline in removing a detection, 3.86 times better in changing a detection, and 5.75 times better in triggering new, spurious, detections. The attacks generated by BlackCAtt are very close to the original image, and hence imperceptible, demonstrating the power of causal pixels.
๐
2025-12-03
๐ ArXiv Preprint (cs.ET)
๐ฅ Pradnya Taksande, Shwetha Kiran, Pranav Jha, Prasanna Chaporkar
Machine Learning
ๆฅ็ๆ่ฆ
The 3rd Generation Partnership Project (3GPP), the standards body for mobile networks, is in the final phase of Release 19 standardization and is beginning Release 20. Artificial Intelligence/ Machine Learning (AI/ML) has brought about a paradigm shift in technology and it is being adopted across industries and verticals. 3GPP has been integrating AI/ML into the 5G advanced system since Release 18. This paper focuses on the AI/ML related technological advancements and features introduced in Release 19 within the Service and System Aspects (SA) Technical specifications group of 3GPP. The advancements relate to two paradigms: (i) enhancements that AI/ML brought to the 5G advanced system (AI for network), e.g. resource optimization, and (ii) enhancements that were made to the 5G system to support AI/ML applications (Network for AI), e.g. image recognition.
๐
2025-12-03
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Tengyun Ma, Jiaqi Yao, Daojing He, Shihao Peng, Yu Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have emerged as powerful tools for diverse applications. However, their uniform token processing paradigm introduces critical vulnerabilities in instruction handling, particularly when exposed to adversarial scenarios. In this work, we identify and propose a novel class of vulnerabilities, termed Tool-Completion Attack (TCA), which exploits function-calling mechanisms to subvert model behavior. To evaluate LLM robustness against such threats, we introduce the Tool-Completion benchmark, a comprehensive security assessment framework, which reveals that even state-of-the-art models remain susceptible to TCA, with surprisingly high attack success rates. To address these vulnerabilities, we introduce Context-Aware Hierarchical Learning (CAHL), a sophisticated mechanism that dynamically balances semantic comprehension with role-specific instruction constraints. CAHL leverages the contextual correlations between different instruction segments to establish a robust, context-aware instruction hierarchy. Extensive experiments demonstrate that CAHL significantly enhances LLM robustness against both conventional attacks and the proposed TCA, exhibiting strong generalization capabilities in zero-shot evaluations while still preserving model performance on generic tasks. Our code is available at https://github.com/S2AILab/CAHL.
๐
2025-12-03
๐ ArXiv Preprint (cs.IT)
๐ฅ Seyed Mohammad Azimi-Abarghouyi, Carlo Fischione, Kaibin Huang
Machine Learning
ๆฅ็ๆ่ฆ
Over-the-Air Federated Learning (AirFL) is an emerging paradigm that tightly integrates wireless signal processing and distributed machine learning to enable scalable AI at the network edge. By leveraging the superposition property of wireless signals, AirFL performs communication and model aggregation of the learning process simultaneously, significantly reducing latency, bandwidth, and energy consumption. This article offers a tutorial treatment of AirFL, presenting a novel classification into three design approaches: CSIT-aware, blind, and weighted AirFL. We provide a comprehensive guide to theoretical foundations, performance analysis, complexity considerations, practical limitations, and prospective research directions.
๐
2025-12-03
๐ ArXiv Preprint (cs.DS)
๐ฅ Robert Ganian, Hung P. Hoang, Simon Wietheger
ๆฅ็ๆ่ฆ
We study the computational problem of computing a fair means clustering of discrete vectors, which admits an equivalent formulation as editing a colored matrix into one with few distinct color-balanced rows by changing at most $k$ values. While NP-hard in both the fairness-oblivious and the fair settings, the problem is well-known to admit a fixed-parameter algorithm in the former ``vanilla'' setting. As our first contribution, we exclude an analogous algorithm even for highly restricted fair means clustering instances. We then proceed to obtain a full complexity landscape of the problem, and establish tractability results which capture three means of circumventing our obtained lower bound: placing additional constraints on the problem instances, fixed-parameter approximation, or using an alternative parameterization targeting tree-like matrices.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammad Doost, Mohammad Manthouri
Machine Learning
ๆฅ็ๆ่ฆ
We propose a novel QTGNN framework for detecting fraudulent transactions in large-scale financial networks. By integrating quantum embedding, variational graph convolutions, and topological data analysis, QTGNN captures complex transaction dynamics and structural anomalies indicative of fraud. The methodology includes quantum data embedding with entanglement enhancement, variational quantum graph convolutions with non-linear dynamics, extraction of higher-order topological invariants, hybrid quantum-classical anomaly learning with adaptive optimization, and interpretable decision-making via topological attribution. Rigorous convergence guarantees ensure stable training on noisy intermediate-scale quantum (NISQ) devices, while stability of topological signatures provides robust fraud detection. Optimized for NISQ hardware with circuit simplifications and graph sampling, the framework scales to large transaction networks. Simulations on financial datasets, such as PaySim and Elliptic, benchmark QTGNN against classical and quantum baselines, using metrics like ROC-AUC, precision, and false positive rate. An ablation study evaluates the contributions of quantum embeddings, topological features, non-linear channels, and hybrid learning. QTGNN offers a theoretically sound, interpretable, and practical solution for financial fraud detection, bridging quantum machine learning, graph theory, and topological analysis.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qi'ao Xu, Tianwen Qian, Yuqian Fu, Kailing Li, Yang Jiao et al.
Multimodal
ๆฅ็ๆ่ฆ
A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) \textbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alex Ferrando, Xavier Suau, Jordi Gonzร lez, Pau Rodriguez
Machine Learning
ๆฅ็ๆ่ฆ
Activation steering has emerged as a powerful method for guiding the behavior of generative models towards desired outcomes such as toxicity mitigation. However, most existing methods apply interventions uniformly across all inputs, degrading model performance when steering is unnecessary. We introduce Dynamically Scaled Activation Steering (DSAS), a method-agnostic steering framework that decouples when to steer from how to steer. DSAS adaptively modulates the strength of existing steering transformations across layers and inputs, intervening strongly only when undesired behavior is detected. At generation time, DSAS computes context-dependent scaling factors that selectively adjust the strength of any steering method. We also show how DSAS can be jointly optimized end-to-end together with the steering function. When combined with existing steering methods, DSAS consistently improves the Pareto front with respect to steering alone, achieving a better trade-off between toxicity mitigation and utility preservation. We further demonstrate DSAS's generality by applying it to a text-to-image diffusion model, showing how adaptive steering allows the modulation of specific concepts. Finally, DSAS introduces minimal computational overhead while improving interpretability, pinpointing which tokens require steering and by how much.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiahao Zhang, Xiao Zhao, Guangyu Gao
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Deep convolutional neural networks (DCNNs) have substantially advanced object detection capabilities, particularly in remote sensing imagery. However, challenges persist, especially in detecting small objects where the high resolution of these images and the small size of target objects often result in a loss of critical information in the deeper layers of conventional CNNs. Additionally, the extensive spatial redundancy and intricate background details typical in remote-sensing images tend to obscure these small targets. To address these challenges, we introduce Multi-Kernel Selection Network (MKSNet), a novel network architecture featuring a novel Multi-Kernel Selection mechanism. The MKS mechanism utilizes large convolutional kernels to effectively capture an extensive range of contextual information. This innovative design allows for adaptive kernel size selection, significantly enhancing the network's ability to dynamically process and emphasize crucial spatial details for small object detection. Furthermore, MKSNet also incorporates a dual attention mechanism, merging spatial and channel attention modules. The spatial attention module adaptively fine-tunes the spatial weights of feature maps, focusing more intensively on relevant regions while mitigating background noise. Simultaneously, the channel attention module optimizes channel information selection, improving feature representation and detection accuracy. Empirical evaluations on the DOTA-v1.0 and HRSC2016 benchmark demonstrate that MKSNet substantially surpasses existing state-of-the-art models in detecting small objects in remote sensing images. These results highlight MKSNet's superior ability to manage the complexities associated with multi-scale and high-resolution image data, confirming its effectiveness and innovation in remote sensing object detection.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ahmad Aghaebrahimian
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models have significantly advanced natural language processing tasks, but remain prone to generating incorrect or misleading but plausible arguments. This issue, known as hallucination, is particularly concerning in high-stakes domains like clinical applications, where factual inaccuracies can have severe consequences. Existing evaluation metrics fail to adequately assess factual consistency and lack interpretability, making diagnosing and mitigating errors difficult. We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts to address these limitations. Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology. Unlike previous methods with an absolute metric, we incorporate a weighted metric to enhance factual evaluation. Additionally, we propose a mechanism to control assessment complexity in intricate domains. We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training in future research.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen et al.
Multimodal
ๆฅ็ๆ่ฆ
Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remember. Without reliable memory, agents catastrophically forget past experiences, struggle with long-horizon reasoning, and fail to operate coherently in multimodal or interactive environments. We introduce MemVerse, a model-agnostic, plug-and-play memory framework that bridges fast parametric recall with hierarchical retrieval-based memory, enabling scalable and adaptive multimodal intelligence. MemVerse maintains short-term memory for recent context while transforming raw multimodal experiences into structured long-term memories organized as hierarchical knowledge graphs. This design supports continual consolidation, adaptive forgetting, and bounded memory growth. To handle real-time demands, MemVerse introduces a periodic distillation mechanism that compresses essential knowledge from long-term memory into the parametric model, allowing fast, differentiable recall while preserving interpretability. Extensive experiments demonstrate that MemVerse significantly improves multimodal reasoning and continual learning efficiency, empowering agents to remember, adapt, and reason coherently across extended interactions.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Edward C. C. Steele, Dinesh Mane, Emilio Monti, Luis Orus, Rebecca Chantrill-Cheyette et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Despite the promising capability of multimodal foundation models, their application to the generation of meteorological products and services remains nascent. To accelerate aspiration and adoption, we explore the novel use of a vision language model for writing the iconic Shipping Forecast text directly from video-encoded gridded weather data. These early results demonstrate promising scalable technological opportunities for enhancing production efficiency and service innovation within the weather enterprise and beyond.
๐
2025-12-03
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Hanxiu Zhang, Yue Zheng
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The protection of Intellectual Property (IP) in Large Language Models (LLMs) represents a critical challenge in contemporary AI research. While fingerprinting techniques have emerged as a fundamental mechanism for detecting unauthorized model usage, existing methods -- whether behavior-based or structural -- suffer from vulnerabilities such as false claim attacks or susceptible to weight manipulations. To overcome these limitations, we propose SELF, a novel intrinsic weight-based fingerprinting scheme that eliminates dependency on input and inherently resists false claims. SELF achieves robust IP protection through two key innovations: 1) unique, scalable and transformation-invariant fingerprint extraction via singular value and eigenvalue decomposition of LLM attention weights, and 2) effective neural network-based fingerprint similarity comparison based on few-shot learning and data augmentation. Experimental results demonstrate SELF maintains high IP infringement detection accuracy while showing strong robustness against various downstream modifications, including quantization, pruning, and fine-tuning attacks. Our code is available at https://github.com/HanxiuZhang/SELF_v2.
๐
2025-12-03
๐ ArXiv Preprint (cs.AR)
๐ฅ Lishuo Deng, Shaojie Xu, Jinwu Chen, Changwei Yan, Jiajie Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Deploying large language models (LLMs) on edge devices enables personalized agents with strong privacy and low cost. However, with tens to hundreds of billions of parameters, single-batch autoregressive inference suffers from extremely low arithmetic intensity, creating severe weight-loading and bandwidth pressures on resource-constrained platforms. Recent in-flash computing (IFC) solutions alleviate this bottleneck by co-locating weight-related linear computations in the decode phase with flash, yet still rely on DRAM for the key-value (KV) cache. As context length grows, the KV cache can exceed model weights in size, imposing prohibitive DRAM cost and capacity requirements. Attempts to offload KV cache to flash suffer from severe performance penalties.
We propose KVNAND, the first DRAM-free, IFC-based architecture that stores both model weights and KV cache entirely in compute-enabled 3D NAND flash. KVNAND addresses the fundamental performance challenges of flash under intensive KV cache access by leveraging IFC for all memory-bound operations to reduce data transfer overhead, introducing head-group parallelism to boost throughput, and employing page-level KV cache mapping to align token access patterns with flash organization. In addition, we propose a design space exploration framework that evaluates discrete and compact KVNAND variants to balance weight and KV placement, automatically identifying the optimal design trade-off. These techniques mitigate latency, energy, and reliability concerns, turning flash into a practical medium for long-context KV storage. Evaluations on MHA 7B and GQA 70B LLMs show that KVNAND achieves 1.98\(\times\)/1.94\(\times\)/2.05\(\times\) geomean speedup at 128/1K/10K-token contexts compared to DRAM-equipped IFC designs and addresses out-of-memory failures at 100K context length.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yusen Wu, Xiaotie Deng
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This paper proposes DeepRule, an integrated framework for automated business rule generation in retail assortment and pricing optimization. Addressing the systematic misalignment between existing theoretical models and real-world economic complexities, we identify three critical gaps: (1) data modality mismatch where unstructured textual sources (e.g. negotiation records, approval documents) impede accurate customer profiling; (2) dynamic feature entanglement challenges in modeling nonlinear price elasticity and time-varying attributes; (3) operational infeasibility caused by multi-tier business constraints.
Our framework introduces a tri-level architecture for above challenges. We design a hybrid knowledge fusion engine employing large language models (LLMs) for deep semantic parsing of unstructured text, transforming distributor agreements and sales assessments into structured features while integrating managerial expertise. Then a game-theoretic constrained optimization mechanism is employed to dynamically reconcile supply chain interests through bilateral utility functions, encoding manufacturer-distributor profit redistribution as endogenous objectives under hierarchical constraints. Finally an interpretable decision distillation interface leveraging LLM-guided symbolic regression to find and optimize pricing strategies and auditable business rules embeds economic priors (e.g. non-negative elasticity) as hard constraints during mathematical expression search. We validate the framework in real retail environments achieving higher profits versus systematic B2C baselines while ensuring operational feasibility. This establishes a close-loop pipeline unifying unstructured knowledge injection, multi-agent optimization, and interpretable strategy synthesis for real economic intelligence.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zeba Afroz, Harsh Vardhan, Pawan Bhakuni, Aanchal Punia, Rajdeep Kumar et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Narratives are the cognitive and emotional scaffolds of propaganda. They organize isolated persuasive techniques into coherent stories that justify actions, attribute blame, and evoke identification with ideological camps. In this paper, we propose a novel fine-grained narrative classification in biased news articles. We also explore article-bias classification as the precursor task to narrative classification and fine-grained persuasive technique identification. We develop INDI-PROP, the first ideologically grounded fine-grained narrative dataset with multi-level annotation for analyzing propaganda in Indian news media. Our dataset INDI-PROP comprises 1,266 articles focusing on two polarizing socio-political events in recent times: CAA and the Farmers' protest. Each article is annotated at three hierarchical levels: (i) ideological article-bias (pro-government, pro-opposition, neutral), (ii) event-specific fine-grained narrative frames anchored in ideological polarity and communicative intent, and (iii) persuasive techniques. We propose FANTA and TPTC, two GPT-4o-mini guided multi-hop prompt-based reasoning frameworks for the bias, narrative, and persuasive technique classification. FANTA leverages multi-layered communicative phenomena by integrating information extraction and contextual framing for hierarchical reasoning. On the other hand, TPTC adopts systematic decomposition of persuasive cues via a two-stage approach. Our evaluation suggests substantial improvement over underlying baselines in each case.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Florent Forest, Amaury Wei, Olga Fink
Machine Learning
ๆฅ็ๆ่ฆ
Time series extrinsic regression (TSER) refers to the task of predicting a continuous target variable from an input time series. It appears in many domains, including healthcare, finance, environmental monitoring, and engineering. In these settings, accurate predictions and trustworthy reasoning are both essential. Although state-of-the-art TSER models achieve strong predictive performance, they typically operate as black boxes, making it difficult to understand which temporal patterns drive their decisions. Post-hoc interpretability techniques, such as feature attribution, aim to to explain how the model arrives at its predictions, but often produce coarse, noisy, or unstable explanations. Recently, inherently interpretable approaches based on concepts, additive decompositions, or symbolic regression, have emerged as promising alternatives. However, these approaches remain limited: they require explicit supervision on the concepts themselves, often cannot capture interactions between time-series features, lack expressiveness for complex temporal patterns, and struggle to scale to high-dimensional multivariate data.
To address these limitations, we propose MAGNETS (Mask-and-AGgregate NEtwork for Time Series), an inherently interpretable neural architecture for TSER. MAGNETS learns a compact set of human-understandable concepts without requiring any annotations. Each concept corresponds to a learned, mask-based aggregation over selected input features, explicitly revealing both which features drive predictions and when they matter in the sequence. Predictions are formed as combinations of these learned concepts through a transparent, additive structure, enabling clear insight into the model's decision process.
๐
2025-12-03
๐ NeurIPS 2025
๐ฅ Zhening Li, Armando Solar-Lezama, Yisong Yue, Stephan Zheng
ๆฅ็ๆ่ฆ
We introduce a new approach to agent programming, the development of LLM-based agents. Current approaches to agent programming often entangle two aspects of agent design: the core workflow logic and the inference-time strategy (e.g., tree search). We introduce "probabilistic angelic nondeterminism" ("PAN"), a programming model that disentangles these two concerns, allowing the programmer to describe the agent workflow and independently experiment with different inference-time strategies by simply changing a few inputs. We provide an implementation of PAN in Python as the EnCompass framework, which uses a Python decorator to compile agent workflow programs into a search space. We present three case studies that demonstrate how the framework lets the programmer quickly improve the reliability of an agent and easily switch between different inference-time strategies, all with little additional coding.
๐
2025-12-03
๐ ArXiv Preprint (cs.NI)
๐ฅ Stefano Scanzio, Gabriele Formis, Tullio Facchinetti, Gianluca Cena
Machine Learning
ๆฅ็ๆ่ฆ
Wireless sensor networks (WSNs) are employed across a wide range of industrial applications where ultra-low power consumption is a critical prerequisite. At the same time, these systems must maintain a certain level of determinism to ensure reliable and predictable operation. In this view, time slotted channel hopping (TSCH) is a communication technology that meets both conditions, making it an attractive option for its usage in industrial WSNs. This work proposes the use of machine learning to learn the traffic pattern generated in networks based on the TSCH protocol, in order to turn nodes into a deep sleep state when no transmission is planned and thus to improve the energy efficiency of the WSN. The ability of machine learning models to make good predictions at different network levels in a typical tree network topology was analyzed in depth, showing how their capabilities degrade while approaching the root of the tree. The application of these models on simulated data based on an accurate modeling of wireless sensor nodes indicates that the investigated algorithms can be suitably used to further and substantially reduce the power consumption of a TSCH network.
๐
2025-12-03
๐ ArXiv Preprint (cs.SD)
๐ฅ Chengyu Tang, Sanjeev Baskiyar
Natural Language Processing
ๆฅ็ๆ่ฆ
In this study, we evaluate the efficacy of the Mamba model in the field of bioacoustics. We first pretrain a Mamba-based audio large language model (LLM) on a large corpus of audio data using self-supervised learning. We fine-tune and evaluate BioMamba on the BEANS benchmark, a collection of diverse bioacoustic tasks including classification and detection, and compare its performance and efficiency with multiple baseline models, including AVES, a state-of-the-art Transformer-based model. The results show that BioMamba achieves comparable performance with AVES while consumption significantly less VRAM, demonstrating its potential in this domain.
๐
2025-12-03
๐ AAAI 2026
๐ฅ Gianni Molinari, Fabio Ciravegna
Robotics
ๆฅ็ๆ่ฆ
Despite recent advances, autonomous agents often struggle to solve complex tasks in enterprise domains that require coordinating multiple tools and processing diverse data sources. This struggle is driven by two main limitations. First, single-agent architectures enforce a monolithic plan-execute loop, which directly causes trajectory instability. Second, the requirement to use local open-weight models for data privacy introduces smaller context windows leading to the rapid consumption of context from large tool outputs. To solve this problem we introduce RP-ReAct (Reasoner Planner-ReAct), a novel multi-agent approach that fundamentally decouples strategic planning from low-level execution to achieve superior reliability and efficiency. RP-ReAct consists of a Reasoner Planner Agent (RPA), responsible for planning each sub-step, continuously analysing the execution results using the strong reasoning capabilities of a Large Reasoning Model, and one or multiple Proxy-Execution Agent (PEA) that translates sub-steps into concrete tool interactions using a ReAct approach. Crucially, we incorporate a context-saving strategy within the PEA to mitigate context window overflow by managing large tool outputs via external storage and on-demand access. We evaluate RP-ReAct, on the challenging, multi-domain ToolQA benchmark using a diverse set of six open-weight reasoning models. Our empirical results show that RP-ReAct achieves superior performance and improved generalization ability over state-of-the-art baselines when addressing diverse complex tasks across the evaluated domains. Furthermore we establish the enhanced robustness and stability of our approach across different model scales, paving the way for effective and deployable agentic solutions for enterprises.
๐
2025-12-03
๐ KDD 2026
๐ฅ Wei Chee Yew, Hailun Xu, Sanjay Saha, Xiaotian Fan, Hiok Hian Ong et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuki Orimo, Iori Kurata, Hodaka Mori, Ryuhei Okuno, Ryohto Sawada et al.
Robotics
ๆฅ็ๆ่ฆ
We introduce PARC, a coding agent for the autonomous and robust execution of long-horizon computational tasks. PARC is built on a hierarchical multi-agent architecture incorporating task planning, execution, and a mechanism that evaluates its own actions and their outcomes from an independent context and provides feedback, namely self-assessment and self-feedback. This design enables PARC to detect and correct high-level strategic errors and sustain progress without human intervention. We evaluate PARC across computational science and data science tasks. In materials science, it autonomously reproduces key results from studies on lithium-ion conduction and alloy segregation. In particular, it coordinates dozens of parallel simulation tasks, each requiring roughly 43 hours of computation, managing orchestration, monitoring, and error correction end-to-end. In Kaggle-based experiments, starting from minimal natural-language instructions, PARC conducts data analysis and implements search strategies, producing solutions competitive with human-engineered baselines. These results highlight the potential of integrating a hierarchical multi-agent system with self-assessment and self-feedback to enable AI systems capable of independent, large-scale scientific and analytical work.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Zexin Lin, Yebin Zhong, Hanwen Wan, Jiu Cheng, Zhenglong Sun et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Transition control poses a critical challenge in Vertical Take-Off and Landing Unmanned Aerial Vehicle (VTOL UAV) development due to the tilting rotor mechanism, which shifts the center of gravity and thrust direction during transitions. Current control methods' decoupled control of altitude and position leads to significant vibration, and limits interaction consideration and adaptability. In this study, we propose a novel coupled transition control methodology based on reinforcement learning (RL) driven controller. Besides, contrasting to the conventional phase-transition approach, the ST3M method demonstrates a new perspective by treating cruise mode as a special case of hover. We validate the feasibility of applying our method in simulation and real-world environments, demonstrating efficient controller development and migration while accurately controlling UAV position and attitude, exhibiting outstanding trajectory tracking and reduced vibrations during the transition process.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nan Sun, Zhenyu Zhang, Xixun Lin, Kun Wang, Yanmin Shang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on "how to intervene" but overlooking the prerequisite "when to intervene", which leads to the "over-intervention" problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
๐
2025-12-03
๐ Accepted by ACM Multimedia 2025
๐ฅ Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu et al.
ๆฅ็ๆ่ฆ
Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Guang Yang, Tianpei Yang, Jingwen Qiao, Yanqing Wu, Jing Huo et al.
Machine Learning
ๆฅ็ๆ่ฆ
Communication is one of the effective means to improve the learning of cooperative policy in multi-agent systems. However, in most real-world scenarios, lossy communication is a prevalent issue. Existing multi-agent reinforcement learning with communication, due to their limited scalability and robustness, struggles to apply to complex and dynamic real-world environments. To address these challenges, we propose a generalized communication-constrained model to uniformly characterize communication conditions across different scenarios. Based on this, we utilize it as a learning prior to distinguish between lossy and lossless messages for specific scenarios. Additionally, we decouple the impact of lossy and lossless messages on distributed decision-making, drawing on a dual mutual information estimatior, and introduce a communication-constrained multi-agent reinforcement learning framework, quantifying the impact of communication messages into the global reward. Finally, we validate the effectiveness of our approach across several communication-constrained benchmarks.
๐
2025-12-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Adithya S Kolavi, Vyoman Jain
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuanxuan Yang, Xiuyang Zhang, Haofeng Chen, Gang Ma, Xiaojie Wang
ๆฅ็ๆ่ฆ
Electrical impedance tomography (EIT) provides an attractive solution for large-area tactile sensing due to its minimal wiring and shape flexibility, but its nonlinear inverse problem often leads to severe artifacts and inaccurate contact reconstruction. This work presents PhyDNN, a physics-driven deep reconstruction framework that embeds the EIT forward model directly into the learning objective. By jointly minimizing the discrepancy between predicted and ground-truth conductivity maps and enforcing consistency with the forward PDE, PhyDNN reduces the black-box nature of deep networks and improves both physical plausibility and generalization. To enable efficient backpropagation, we design a differentiable forward-operator network that accurately approximates the nonlinear EIT response, allowing fast physics-guided training. Extensive simulations and real tactile experiments on a 16-electrode soft sensor show that PhyDNN consistently outperforms NOSER, TV, and standard DNNs in reconstructing contact shape, location, and pressure distribution. PhyDNN yields fewer artifacts, sharper boundaries, and higher metric scores, demonstrating its effectiveness for high-quality tomographic tactile sensing.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Renqi Chen, Haoyang Su, Shixiang Tang
Machine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.
๐
2025-12-03
๐ ArXiv Preprint (q-bio.QM)
๐ฅ Xiangzheng Cheng, Haili Huang, Ye Su, Qing Nie, Xiufen Zou et al.
ๆฅ็ๆ่ฆ
In multicellular organisms, cells coordinate their activities through cell-cell communication (CCC), which are crucial for development, tissue homeostasis, and disease progression. Recent advances in single-cell and spatial omics technologies provide unprecedented opportunities to systematically infer and analyze CCC from these omics data, either by integrating prior knowledge of ligand-receptor interactions (LRIs) or through de novo approaches. A variety of computational methods have been developed, focusing on methodological innovations, accurate modeling of complex signaling mechanisms, and investigation of broader biological questions. These advances have greatly enhanced our ability to analyze CCC and generate biological hypotheses. Here, we introduce the biological mechanisms and modeling strategies of CCC, and provide a focused overview of more than 140 computational methods for inferring CCC from single-cell and spatial transcriptomic data, emphasizing the diversity in methodological frameworks and biological questions. Finally, we discuss the current challenges and future opportunities in this rapidly evolving field.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Juan Diego Toscano, Daniel T. Chen, George Em Karniadakis
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Bridging the gap between theoretical conceptualization and computational implementation is a major bottleneck in Scientific Computing (SciC) and Scientific Machine Learning (SciML). We introduce ATHENA (Agentic Team for Hierarchical Evolutionary Numerical Algorithms), an agentic framework designed as an Autonomous Lab to manage the end-to-end computational research lifecycle. Its core is the HENA loop, a knowledge-driven diagnostic process framed as a Contextual Bandit problem. Acting as an online learner, the system analyzes prior trials to select structural `actions' ($A_n$) from combinatorial spaces guided by expert blueprints (e.g., Universal Approximation, Physics-Informed constraints). These actions are translated into executable code ($S_n$) to generate scientific rewards ($R_n$). ATHENA transcends standard automation: in SciC, it autonomously identifies mathematical symmetries for exact analytical solutions or derives stable numerical solvers where foundation models fail. In SciML, it performs deep diagnosis to tackle ill-posed formulations and combines hybrid symbolic-numeric workflows (e.g., coupling PINNs with FEM) to resolve multiphysics problems. The framework achieves super-human performance, reaching validation errors of $10^{-14}$. Furthermore, collaborative ``human-in-the-loop" intervention allows the system to bridge stability gaps, improving results by an order of magnitude. This paradigm shift focuses from implementation mechanics to methodological innovation, accelerating scientific discovery.
๐
2025-12-03
๐ NeurIPS
๐ฅ Xavier Cadet, Edward Koh, Peter Chin
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model (LLM) agents are increasingly studied in multi-turn, multi-agent scenarios, yet most existing setups emphasize open-ended role-play rather than controlled evaluation. We introduce AsymPuzl, a minimal but expressive two-agent puzzle environment designed to isolate communication under information asymmetry. Each agent observes complementary but incomplete views of a symbolic puzzle and must exchange messages to solve it cooperatively. Using a diverse set of current-generation and open-source LLMs, we show that (i) strong models such as GPT-5 and Claude-4.0 reliably converge across puzzle sizes on the solution by sharing complete information in two turns, (ii) weaker models often ignore partner messages or over-correct their hypotheses, and (iii) feedback design is non-trivial: simple self-feedback improves success rates, while detailed joint feedback can hurt performance. These findings show that even in simple cooperative tasks, LLM communication strategies diverge and depend on the granularity of feedback signals. AsymPuzl thus provides a testbed for probing the limits of multi-turn cooperation and opens avenues for studying coordination mechanisms.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shojiro Yamabe, Futa Waseda, Daiki Shiono, Tsubasa Takahashi
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.
๐
2025-12-03
๐ This is a pre-print for submitting to computers &
๐ฅ Johnny Peng, Thanh Tung Khuat, Ellen Otte, Katarzyna Musial, Bogdan Gabrys
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
In cell culture bioprocessing, real-time batch process monitoring (BPM) refers to the continuous tracking and analysis of key process variables such as viable cell density, nutrient levels, metabolite concentrations, and product titer throughout the duration of a batch run. This enables early detection of deviations and supports timely control actions to ensure optimal cell growth and product quality. BPM plays a critical role in ensuring the quality and regulatory compliance of biopharmaceutical manufacturing processes. However, the development of accurate soft sensors for BPM is hindered by key challenges, including limited historical data, infrequent feedback, heterogeneous process conditions, and high-dimensional sensory inputs. This study presents a comprehensive benchmarking analysis of machine learning (ML) methods designed to address these challenges, with a focus on learning from historical data with limited volume and relevance in the context of bioprocess monitoring. We evaluate multiple ML approaches including feature dimensionality reduction, online learning, and just-in-time learning across three datasets, one in silico dataset and two real-world experimental datasets. Our findings highlight the importance of training strategies in handling limited data and feedback, with batch learning proving effective in homogeneous settings, while just-in-time learning and online learning demonstrate superior adaptability in cold-start scenarios. Additionally, we identify key meta-features, such as feed media composition and process control strategies, that significantly impact model transferability. The results also suggest that integrating Raman-based predictions with lagged offline measurements enhances monitoring accuracy, offering a promising direction for future bioprocess soft sensor development.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiye Song, Steve Dai, Ben Keller, Brucek Khailany
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications.
We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xieji Li, Siyuan Yan, Yingsheng Liu, H. Peter Soyer, Monika Janda et al.
Multimodal
ๆฅ็ๆ่ฆ
Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.
๐
2025-12-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Reuben Tan, Baolin Peng, Zhengyuan Yang, Hao Cheng, Oier Mees et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.
๐
2025-12-03
๐ Accepted for publication in the Journal of Intelli
๐ฅ Raul Steinmetz, Fabio Demo Rosa, Victor Augusto Kich, Jair Augusto Bottega, Ricardo Bedin Grando et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous navigation of terrestrial robots using Reinforcement Learning (RL) from LIDAR observations remains challenging due to the high dimensionality of sensor data and the sample inefficiency of model-free approaches. Conventional policy networks struggle to process full-resolution LIDAR inputs, forcing prior works to rely on simplified observations that reduce spatial awareness and navigation robustness. This paper presents a novel model-based RL framework built on top of the DreamerV3 algorithm, integrating a Multi-Layer Perceptron Variational Autoencoder (MLP-VAE) within a world model to encode high-dimensional LIDAR readings into compact latent representations. These latent features, combined with a learned dynamics predictor, enable efficient imagination-based policy optimization. Experiments on simulated TurtleBot3 navigation tasks demonstrate that the proposed architecture achieves faster convergence and higher success rate compared to model-free baselines such as SAC, DDPG, and TD3. It is worth emphasizing that the DreamerV3-based agent attains a 100% success rate across all evaluated environments when using the full dataset of the Turtlebot3 LIDAR (360 readings), while model-free methods plateaued below 85%. These findings demonstrate that integrating predictive world models with learned latent representations enables more efficient and robust navigation from high-dimensional sensory data.
๐
2025-12-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Shu Wang, Yingli Zhou, Yixiang Fang
Natural Language Processing
ๆฅ็ๆ่ฆ
As an effective method to boost the performance of Large Language Models (LLMs) on the question answering (QA) task, Retrieval-Augmented Generation (RAG), which queries highly relevant information from external complex documents, has attracted tremendous attention from both industry and academia. Existing RAG approaches often focus on general documents, and they overlook the fact that many real-world documents (such as books, booklets, handbooks, etc.) have a hierarchical structure, which organizes their content from different granularity levels, leading to poor performance for the QA task. To address these limitations, we introduce BookRAG, a novel RAG approach targeted for documents with a hierarchical structure, which exploits logical hierarchies and traces entity relations to query the highly relevant information. Specifically, we build a novel index structure, called BookIndex, by extracting a hierarchical tree from the document, which serves as the role of its table of contents, using a graph to capture the intricate relationships between entities, and mapping entities to tree nodes. Leveraging the BookIndex, we then propose an agent-based query method inspired by the Information Foraging Theory, which dynamically classifies queries and employs a tailored retrieval workflow. Extensive experiments on three widely adopted benchmarks demonstrate that BookRAG achieves state-of-the-art performance, significantly outperforming baselines in both retrieval recall and QA accuracy while maintaining competitive efficiency.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Prakhar Gupta, Henry Conklin, Sarah-Jane Leslie, Andrew Lee
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In this work we study how explicit world-modeling objectives affect the internal representations and downstream capability of Transformers across different training stages. We use a controlled 2x2x2 Rubik's Cube and ask: (1) how does explicitly pretraining a world model affect the model's latent representations, and (2) how does world-model quality affect the model's performance after reinforcement learning post-training? We compare standard next-token prediction to two explicit world-modeling strategies -- (i) state-prediction pretraining and (ii) a joint state-prediction + next-token objective -- and assess task performance after Group Relative Policy Optimization (GRPO) is applied as post-training. We evaluate the representation quality with linear probes and causal interventions. We find that explicit world-modeling yields more linearly decodable and causally steerable state representations. More importantly, we find that improved state representations lead to higher gains for GRPO, especially on harder cube states. Our results indicate that sharpening state representations can improve the effectiveness of post-training for sequence-planning tasks.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hamed Poursiami, Shay Snyder, Guojing Cong, Thomas Potok, Maryam Parsa
Machine Learning
ๆฅ็ๆ่ฆ
Graph classification is a fundamental task in domains ranging from molecular property prediction to materials design. While graph neural networks (GNNs) achieve strong performance by learning expressive representations via message passing, they incur high computational costs, limiting their scalability and deployment on resource-constrained devices. Hyperdimensional Computing (HDC), also known as Vector Symbolic Architectures (VSA), offers a lightweight, brain-inspired alternative, yet existing HDC-based graph methods typically struggle to match the predictive performance of GNNs. In this work, we propose VS-Graph, a vector-symbolic graph learning framework that narrows the gap between the efficiency of HDC and the expressive power of message passing. VS-Graph introduces a Spike Diffusion mechanism for topology-driven node identification and an Associative Message Passing scheme for multi-hop neighborhood aggregation entirely within the high-dimensional vector space. Without gradient-based optimization or backpropagation, our method achieves competitive accuracy with modern GNNs, outperforming the prior HDC baseline by 4-5% on standard benchmarks such as MUTAG and DD. It also matches or exceeds the performance of the GNN baselines on several datasets while accelerating the training by a factor of up to 450x. Furthermore, VS-Graph maintains high accuracy even with the hypervector dimensionality reduced to D=128, demonstrating robustness under aggressive dimension compression and paving the way for ultra-efficient execution on edge and neuromorphic hardware.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Deploying large language model (LLM) models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nan Zhou, Huandong Wang, Jiahao Li, Han Li, Yali Song et al.
Machine Learning
ๆฅ็ๆ่ฆ
Fine-grained wildfire spread prediction is crucial for enhancing emergency response efficacy and decision-making precision. However, existing research predominantly focuses on coarse spatiotemporal scales and relies on low-resolution satellite data, capturing only macroscopic fire states while fundamentally constraining high-precision localized fire dynamics modeling capabilities. To bridge this gap, we present FireSentry, a provincial-scale multi-modal wildfire dataset characterized by sub-meter spatial and sub-second temporal resolution. Collected using synchronized UAV platforms, FireSentry provides visible and infrared video streams, in-situ environmental measurements, and manually validated fire masks. Building on FireSentry, we establish a comprehensive benchmark encompassing physics-based, data-driven, and generative models, revealing the limitations of existing mask-only approaches. Our analysis proposes FiReDiff, a novel dual-modality paradigm that first predicts future video sequences in the infrared modality, and then precisely segments fire masks in the mask modality based on the generated dynamics. FiReDiff achieves state-of-the-art performance, with video quality gains of 39.2% in PSNR, 36.1% in SSIM, 50.0% in LPIPS, 29.4% in FVD, and mask accuracy gains of 3.3% in AUPRC, 59.1% in F1 score, 42.9% in IoU, and 62.5% in MSE when applied to generative models. The FireSentry benchmark dataset and FiReDiff paradigm collectively advance fine-grained wildfire forecasting and dynamic disaster simulation. The processed benchmark dataset is publicly available at: https://github.com/Munan222/FireSentry-Benchmark-Dataset.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seunghoi Kim, Henry F. J. Tregidgo, Chen Jin, Matteo Figini, Daniel C. Alexander
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Generative models are prone to hallucinations: plausible but incorrect structures absent in the ground truth. This issue is problematic in image restoration for safety-critical domains such as medical imaging, industrial inspection, and remote sensing, where such errors undermine reliability and trust. For example, in low-field MRI, widely used in resource-limited settings, restoration models are essential for enhancing low-quality scans, yet hallucinations can lead to serious diagnostic errors. Progress has been hindered by a circular dependency: evaluating hallucinations requires labeled data, yet such labels are costly and subjective. We introduce HalluGen, a diffusion-based framework that synthesizes realistic hallucinations with controllable type, location, and severity, producing perceptually realistic but semantically incorrect outputs (segmentation IoU drops from 0.86 to 0.36). Using HalluGen, we construct the first large-scale hallucination dataset comprising 4,350 annotated images derived from 1,450 brain MR images for low-field enhancement, enabling systematic evaluation of hallucination detection and mitigation. We demonstrate its utility in two applications: (1) benchmarking image quality metrics and developing Semantic Hallucination Assessment via Feature Evaluation (SHAFE), a feature-based metric with soft-attention pooling that improves hallucination sensitivity over traditional metrics; and (2) training reference-free hallucination detectors that generalize to real restoration failures. Together, HalluGen and its open dataset establish the first scalable foundation for evaluating hallucinations in safety-critical image restoration.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Darshan Fofadiya
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive Language Models (LLMs) trained on Next-Token Prediction (NTP) often suffer from ``Topic Drift'' where the generation wanders away from the initial prompt due to a reliance on local associations rather than global planning \citep{holtzman2019curious}. While scaling model size mitigates this \citep{brown2020language}, the fundamental myopia of the NTP objective remains. In this work, we introduce the Idea-Gated Transformer, a novel architecture that separates semantic planning from syntactic generation. We introduce an auxiliary ``Idea Head'' trained to predict the bag-of-words distribution for a future context window, creating a latent ``Concept Vector'' that actively gates the main vocabulary during generation. We propose a differentiable gating mechanism that suppresses semantically irrelevant tokens, effectively pruning the search space in real-time. Experiments on WikiText-103 demonstrate that while the Idea-Gated model achieves comparable validation perplexity to a standard GPT-2 baseline, it exhibits significantly superior Domain Retention. Qualitative and quantitative analysis reveals that the gating mechanism successfully locks generation into specific semantic clusters (e.g., Finance, Science) and resists associative drift, offering a parameter-efficient path toward more controllable language modeling.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yeganeh Ghamary, Victoria Wu, Hooman Vaseli, Christina Luong, Teresa Tsang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Ejection fraction (EF) is a crucial metric for assessing cardiac function and diagnosing conditions such as heart failure. Traditionally, EF estimation requires manual tracing and domain expertise, making the process time-consuming and subject to interobserver variability. Most current deep learning methods for EF prediction are black-box models with limited transparency, which reduces clinical trust. Some post-hoc explainability methods have been proposed to interpret the decision-making process after the prediction is made. However, these explanations do not guide the model's internal reasoning and therefore offer limited reliability in clinical applications. To address this, we introduce ProtoEFNet, a novel video-based prototype learning model for continuous EF regression. The model learns dynamic spatiotemporal prototypes that capture clinically meaningful cardiac motion patterns. Additionally, the proposed Prototype Angular Separation (PAS) loss enforces discriminative representations across the continuous EF spectrum. Our experiments on the EchonetDynamic dataset show that ProtoEFNet can achieve accuracy on par with its non-interpretable counterpart while providing clinically relevant insight. The ablation study shows that the proposed loss boosts performance with a 2% increase in F1 score from 77.67$\pm$2.68 to 79.64$\pm$2.10. Our source code is available at: https://github.com/DeepRCL/ProtoEF
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alan T. L. Bacellar, Mustafa Munir, Felipe M. G. Franรงa, Priscila M. V. Lima, Radu Marculescu et al.
ๆฅ็ๆ่ฆ
Federated Learning (FL) is plagued by two key challenges: high communication overhead and performance collapse on heterogeneous (non-IID) data. Analytic FL (AFL) provides a single-round, data distribution invariant solution, but is limited to linear models. Subsequent non-linear approaches, like DeepAFL, regain accuracy but sacrifice the single-round benefit. In this work, we break this trade-off. We propose SAFLe, a framework that achieves scalable non-linear expressivity by introducing a structured head of bucketed features and sparse, grouped embeddings. We prove this non-linear architecture is mathematically equivalent to a high-dimensional linear regression. This key equivalence allows SAFLe to be solved with AFL's single-shot, invariant aggregation law. Empirically, SAFLe establishes a new state-of-the-art for analytic FL, significantly outperforming both linear AFL and multi-round DeepAFL in accuracy across all benchmarks, demonstrating a highly efficient and scalable solution for federated vision.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying
ๆฅ็ๆ่ฆ
Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.
๐
2025-12-03
๐ NeurIPS
๐ฅ Chandler Smith, Marwa Abdulhai, Manfred Diaz, Marko Tesic, Rakshit S. Trivedi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.
๐
2025-12-03
๐ Accepted to 2026 IEEE/CVF Winter Conference on App
๐ฅ Thomas Monninger, Zihan Zhang, Steffen Staab, Sihao Ding
Robotics
ๆฅ็ๆ่ฆ
Accurate environmental representations are essential for autonomous driving, providing the foundation for safe and efficient navigation. Traditionally, high-definition (HD) maps are providing this representation of the static road infrastructure to the autonomous system a priori. However, because the real world is constantly changing, such maps must be constructed online from on-board sensor data. Navigation-grade standard-definition (SD) maps are widely available, but their resolution is insufficient for direct deployment. Instead, they can be used as coarse prior to guide the online map construction process. We propose NavMapFusion, a diffusion-based framework that performs iterative denoising conditioned on high-fidelity sensor data and on low-fidelity navigation maps. This paper strives to answer: (1) How can coarse, potentially outdated navigation maps guide online map construction? (2) What advantages do diffusion models offer for map fusion? We demonstrate that diffusion-based map construction provides a robust framework for map fusion. Our key insight is that discrepancies between the prior map and online perception naturally correspond to noise within the diffusion process; consistent regions reinforce the map construction, whereas outdated segments are suppressed. On the nuScenes benchmark, NavMapFusion conditioned on coarse road lines from OpenStreetMap data reaches a 21.4% relative improvement on 100 m, and even stronger improvements on larger perception ranges, while maintaining real-time capabilities. By fusing low-fidelity priors with high-fidelity sensor data, the proposed method generates accurate and up-to-date environment representations, guiding towards safer and more reliable autonomous driving. The code is available at https://github.com/tmonnin/navmapfusion
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tengda Han, Sayna Ebrahimi, Dilara Gokay, Li Yang Ku, Maks Ovsjanikov et al.
ๆฅ็ๆ่ฆ
We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qinyu Zhao, Guangting Zheng, Tao Yang, Rui Zhu, Xingjian Leng et al.
ๆฅ็ๆ่ฆ
Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations. First, they add random noise to training samples or VAE latents as data augmentation, introducing complex pipelines including extra noising and denoising steps. Second, they use a pretrained and frozen VAE encoder, resulting in suboptimal reconstruction and generation quality. In this paper, we find that the two issues can be solved in a very simple way: just fixing the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5). On the one hand, this method allows the encoder to output a broader distribution of tokens and the decoder to learn to reconstruct clean images from the augmented token distribution, avoiding additional noise or denoising design. On the other hand, fixed variance simplifies the VAE evidence lower bound, making it stable to train an NF with a VAE jointly. On the ImageNet $256 \times 256$ generation task, our model SimFlow obtains a gFID score of 2.15, outperforming the state-of-the-art method STARFlow (gFID 2.40). Moreover, SimFlow can be seamlessly integrated with the end-to-end representation alignment (REPA-E) method and achieves an improved gFID of 1.91, setting a new state of the art among NFs.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiazhe Wei, Ken Li, Tianyu Lao, Haofan Wang, Liang Wang et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.
๐
2025-12-03
๐ ArXiv Preprint (cs.GR)
๐ฅ Alexander Mai, Trevor Hedstrom, George Kopanas, Janne Kontkanen, Falko Kuester et al.
ๆฅ็ๆ่ฆ
We introduce radiance meshes, a technique for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization. Unlike a Voronoi diagram, a Delaunay tetrahedralization yields simple triangles that are natively supported by existing hardware. As such, our model is able to perform exact and fast volume rendering using both rasterization and ray-tracing. We introduce a new rasterization method that achieves faster rendering speeds than all prior radiance field representations (assuming an equivalent number of primitives and resolution) across a variety of platforms. Optimizing the positions of Delaunay vertices introduces topological discontinuities (edge flips). To solve this, we use a Zip-NeRF-style backbone which allows us to express a smoothly varying field even when the topology changes. Our rendering method exactly evaluates the volume rendering equation and enables high quality, real-time view synthesis on standard consumer hardware. Our tetrahedral meshes also lend themselves to a variety of exciting applications including fisheye lens distortion, physics-based simulation, editing, and mesh extraction.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali et al.
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sen Fang, Yalin Feng, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas
Machine Learning
ๆฅ็ๆ่ฆ
Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou et al.
Robotics
ๆฅ็ๆ่ฆ
A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sandeep Nagar
RoboticsMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance.
The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Andreas Koukounas, Georgios Mastrapas, Florian Hรถnicke, Sedigheh Eslami, Guillaume Roncari et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang
Computer Vision
ๆฅ็ๆ่ฆ
Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Honggyu An, Jaewoo Jung, Mungyeom Kim, Sunghwan Hong, Chaehyun Kim et al.
Computer Vision
ๆฅ็ๆ่ฆ
Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ho Man Kwan, Tianhao Peng, Ge Gao, Fan Zhang, Mike Nilsson et al.
ๆฅ็ๆ่ฆ
Recent works have demonstrated the viability of utilizing over-fitted implicit neural representations (INRs) as alternatives to autoencoder-based models for neural video compression. Among these INR-based video codecs, Neural Video Representation Compression (NVRC) was the first to adopt a fully end-to-end compression framework that compresses INRs, achieving state-of-the-art performance. Moreover, some recently proposed lightweight INRs have shown comparable performance to their baseline codecs with computational complexity lower than 10kMACs/pixel. In this work, we extend NVRC toward lightweight representations, and propose NVRC-Lite, which incorporates two key changes. Firstly, we integrated multi-scale feature grids into our lightweight neural representation, and the use of higher resolution grids significantly improves the performance of INRs at low complexity. Secondly, we address the issue that existing INRs typically leverage autoregressive models for entropy coding: these are effective but impractical due to their slow coding speed. In this work, we propose an octree-based context model for entropy coding high-dimensional feature grids, which accelerates the entropy coding module of the model. Our experimental results demonstrate that NVRC-Lite outperforms C3, one of the best lightweight INR-based video codecs, with up to 21.03% and 23.06% BD-rate savings when measured in PSNR and MS-SSIM, respectively, while achieving 8.4x encoding and 2.5x decoding speedup. The implementation of NVRC-Lite will be made available.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Farhana Hossain Swarnali, Miaomiao Zhang, Tonmoy Hossain
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Modeling group actions on latent representations enables controllable transformations of high-dimensional image data. Prior works applying group-theoretic priors or modeling transformations typically operate in the high-dimensional data space, where group actions apply uniformly across the entire input, making it difficult to disentangle the subspace that varies under transformations. While latent-space methods offer greater flexibility, they still require manual partitioning of latent variables into equivariant and invariant subspaces, limiting the ability to robustly learn and operate group actions within the representation space. To address this, we introduce a novel end-to-end framework that for the first time learns group actions on latent image manifolds, automatically discovering transformation-relevant structures without manual intervention. Our method uses learnable binary masks with straight-through estimation to dynamically partition latent representations into transformation-sensitive and invariant components. We formulate this within a unified optimization framework that jointly learns latent disentanglement and group transformation mappings. The framework can be seamlessly integrated with any standard encoder-decoder architecture. We validate our approach on five 2D/3D image datasets, demonstrating its ability to automatically learn disentangled latent factors for group actions in diverse data, while downstream classification tasks confirm the effectiveness of the learned representations. Our code is publicly available at https://github.com/farhanaswarnali/Learning-Group-Actions-In-Disentangled-Latent-Image-Representations .
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An et al.
RoboticsComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Reliable 3D reconstruction from in-the-wild image collections is often hindered by "noisy" images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marcelo Isaias de Moraes Junior, Moacir Antonelli Ponti
ๆฅ็ๆ่ฆ
Sketches are simple human hand-drawn abstractions of complex scenes and real-world objects. Although the field of sketch representation learning has advanced significantly, there is still a gap in understanding the true relevance of the temporal aspect to the quality of these representations. This work investigates whether it is indeed justifiable to treat sketches as sequences, as well as which internal orders play a more relevant role. The results indicate that, although the use of traditional positional encodings is valid for modeling sketches as sequences, absolute coordinates consistently outperform relative ones. Furthermore, non-autoregressive decoders outperform their autoregressive counterparts. Finally, the importance of temporality was shown to depend on both the order considered and the task evaluated.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jialuo Li, Bin Li, Jiahao Li, Yan Lu
Multimodal
ๆฅ็ๆ่ฆ
The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hang Xu, Linjiang Huang, Feng Zhao
Robotics
ๆฅ็ๆ่ฆ
Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{https://github.com/xuhang07/TEP-Diffusion}{https://github.com/xuhang07/TEP-Diffusion}.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Levi Burner, Guido de Croon, Yiannis Aloimonos
ๆฅ็ๆ่ฆ
Animals with foveated vision, including humans, experience microsaccades, small, rapid eye movements that they are not aware of. Inspired by this phenomenon, we develop a method for "Artificial Microsaccade Compensation". It can stabilize video captured by a tailless ornithopter that has resisted attempts to use camera-based sensing because it shakes at 12-20 Hz. Our approach minimizes changes in image intensity by optimizing over 3D rotation represented in SO(3). This results in a stabilized video, computed in real time, suitable for human viewing, and free from distortion. When adapted to hold a fixed viewing orientation, up to occasional saccades, it can dramatically reduce inter-frame motion while also benefiting from an efficient recursive update. When compared to Adobe Premier Pro's warp stabilizer, which is widely regarded as the best commercial video stabilization software available, our method achieves higher quality results while also running in real time.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zexin Lin, Hawen Wan, Yebin Zhong, Xiaoqiang
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sheng-Hao Liao, Shang-Fu Chen, Tai-Ming Huang, Wen-Huang Cheng, Kai-Lung Hua
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Drag-based image editing using generative models provides intuitive control over image structures. However, existing methods rely heavily on manually provided masks and textual prompts to preserve semantic fidelity and motion precision. Removing these constraints creates a fundamental trade-off: visual artifacts without masks and poor spatial control without prompts. To address these limitations, we propose DirectDrag, a novel mask- and prompt-free editing framework. DirectDrag enables precise and efficient manipulation with minimal user input while maintaining high image fidelity and accurate point alignment. DirectDrag introduces two key innovations. First, we design an Auto Soft Mask Generation module that intelligently infers editable regions from point displacement, automatically localizing deformation along movement paths while preserving contextual integrity through the generative model's inherent capacity. Second, we develop a Readout-Guided Feature Alignment mechanism that leverages intermediate diffusion activations to maintain structural consistency during point-based edits, substantially improving visual fidelity. Despite operating without manual mask or prompt, DirectDrag achieves superior image quality compared to existing methods while maintaining competitive drag accuracy. Extensive experiments on DragBench and real-world scenarios demonstrate the effectiveness and practicality of DirectDrag for high-quality, interactive image manipulation. Project Page: https://frakw.github.io/DirectDrag/. Code is available at: https://github.com/frakw/DirectDrag.
๐
2025-12-03
๐ NeurIPS 2025
๐ฅ Jin-Ting He, Fu-Jen Tsai, Yan-Tsung Peng, Min-Hung Chen, Chia-Wen Lin et al.
ๆฅ็ๆ่ฆ
Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The source code is available at https://github.com/Jin-Ting-He/BlurDM.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lianyu Pang, Ji Zhou, Qiping Wang, Baoquan Zhao, Zhenguo Yang et al.
Robotics
ๆฅ็ๆ่ฆ
Tuning-free face personalization methods have developed along two distinct paradigms: text embedding approaches that map facial features into the text embedding space, and adapter-based methods that inject features through auxiliary cross-attention layers. While both paradigms have shown promise, existing methods struggle to simultaneously achieve high identity fidelity and flexible text controllability. We introduce UniID, a unified tuning-free framework that synergistically integrates both paradigms. Our key insight is that when merging these approaches, they should mutually reinforce only identity-relevant information while preserving the original diffusion prior for non-identity attributes. We realize this through a principled training-inference strategy: during training, we employ an identity-focused learning scheme that guides both branches to capture identity features exclusively; at inference, we introduce a normalized rescaling mechanism that recovers the text controllability of the base diffusion model while enabling complementary identity signals to enhance each other. This principled design enables UniID to achieve high-fidelity face personalization with flexible text controllability. Extensive experiments against six state-of-the-art methods demonstrate that UniID achieves superior performance in both identity preservation and text controllability. Code will be available at https://github.com/lyuPang/UniID
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Wu, Li Yang, Gen Zhan, Yiting Liao, Junlin Li et al.
MultimodalMachine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.
๐
2025-12-03
๐ ArXiv Preprint (eess.IV)
๐ฅ Evan Bell, Shijun Liang, Ismail Alkhouri, Saiprasad Ravishankar
Computer Vision
ๆฅ็ๆ่ฆ
Deep Image Prior (DIP) has recently emerged as a promising one-shot neural-network based image reconstruction method. However, DIP has seen limited application to 3D image reconstruction problems. In this work, we introduce Tada-DIP, a highly effective and fully 3D DIP method for solving 3D inverse problems. By combining input-adaptation and denoising regularization, Tada-DIP produces high-quality 3D reconstructions while avoiding the overfitting phenomenon that is common in DIP. Experiments on sparse-view X-ray computed tomography reconstruction validate the effectiveness of the proposed method, demonstrating that Tada-DIP produces much better reconstructions than training-data-free baselines and achieves reconstruction performance on par with a supervised network trained using a large dataset with fully-sampled volumes.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guole Shen, Tianchen Deng, Xingrui Qin, Nailin Wang, Jianyu Wang et al.
Machine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Recent stateful recurrent neural networks have achieved remarkable progress on static 3D reconstruction but remain vulnerable to motion-induced artifacts, where non-rigid regions corrupt attention propagation between the spatial memory and image feature. By analyzing the internal behaviors of the state and image token updating mechanism, we find that aggregating self-attention maps across layers reveals a consistent pattern: dynamic regions are naturally down-weighted, exposing an implicit motion cue that the pretrained transformer already encodes but never explicitly uses. Motivated by this observation, we introduce MUT3R, a training-free framework that applies the attention-derived motion cue to suppress dynamic content in the early layers of the transformer during inference. Our attention-level gating module suppresses the influence of dynamic regions before their artifacts propagate through the feature hierarchy. Notably, we do not retrain or fine-tune the model; we let the pretrained transformer diagnose its own motion cues and correct itself. This early regulation stabilizes geometric reasoning in streaming scenarios and leads to improvements in temporal consistency and camera pose robustness across multiple dynamic benchmarks, offering a simple and training-free pathway toward motion-aware streaming reconstruction.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Donghun Ryou, Inju Ha, Sanghyeok Chu, Bohyung Han
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning-based image restoration has achieved significant success. However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. Our framework generates perceptually enhanced ground truth images using super-resolution by incorporating adaptive frequency masks, which are learned by a conditional frequency mask generator. These masks guide the optimal fusion of frequency components from the original ground truth and its super-resolved variants, yielding enhanced ground truth images. This frequency-domain mixup preserves the semantic consistency of the original content while selectively enriching perceptual details, preventing hallucinated artifacts that could compromise fidelity. The enhanced ground truth images are used to train a lightweight output refinement network that can be seamlessly integrated with existing restoration models. Extensive experiments demonstrate that our approach consistently improves the quality of restored images. We further validate the effectiveness of both supervision enhancement and output refinement through user studies. Code is available at https://github.com/dhryougit/Beyond-the-Ground-Truth.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Youxin Pang, Yong Zhang, Ruizhi Shao, Xiang Deng, Feng Gao et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
We propose UniMo, an innovative autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework, enabling simultaneous generation and understanding of these two modalities for the first time. Current methods predominantly focus on generating one modality given another as the condition or integrating either of them with other modalities such as text and audio. Unifying 2D videos and 3D motions for simultaneous optimization and generation remains largely unexplored, presenting significant challenges due to their substantial structural and distributional differences. Inspired by the LLM's ability to unify different modalities, our method models videos and 3D motions as a unified tokens sequence, utilizing separate embedding layers to mitigate distribution gaps. Additionally, we devise a sequence modeling strategy that integrates two distinct tasks within a single framework, proving the effectiveness of unified modeling. Moreover, to efficiently align with visual tokens and preserve 3D spatial information, we design a novel 3D motion tokenizer with a temporal expansion strategy, using a single VQ-VAE to produce quantized motion tokens. It features multiple expert decoders that handle body shapes, translation, global orientation, and body poses for reliable 3D motion reconstruction. Extensive experiments demonstrate that our method simultaneously generates corresponding videos and motions while performing accurate motion capture. This work taps into the capacity of LLMs to fuse diverse data types, paving the way for integrating human-centric information into existing models and potentially enabling multimodal, controllable joint modeling of humans, objects, and scenes.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuai Yang, Junxin Lin, Yifan Zhou, Ziwei Liu, Chen Change Loy
ๆฅ็ๆ่ฆ
The remarkable success in text-to-image diffusion models has motivated extensive investigation of their potential for video applications. Zero-shot techniques aim to adapt image diffusion models for videos without requiring further model training. Recent methods largely emphasize integrating inter-frame correspondence into attention mechanisms. However, the soft constraint applied to identify the valid features to attend is insufficient, which could lead to temporal inconsistency. In this paper, we present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint. This enhancement ensures a consistent transformation of semantically similar content between frames. Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video, significantly enhancing the visual coherence of manipulated videos. We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing. Comprehensive experiments demonstrate the effectiveness of our framework in generating high-quality, coherent videos, highlighting a significant advance over current zero-shot methods.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jorge Tapias Gomez, Despoina Kanata, Aneesh Rangnekar, Christina Lee, Julio Garcia-Aguilar et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76\% $\pm$ 0.04), sensitivity (90.07\% $\pm$ 0.08), and specificity (72.86\% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniele Falcetta, Liane S. Canas, Lorenzo Suppa, Matteo Pentassuglia, Jon Cleary et al.
ๆฅ็ๆ่ฆ
We present CaravelMetrics, a computational framework for automated cerebrovascular analysis that models vessel morphology through skeletonization-derived graph representations. The framework integrates atlas-based regional parcellation, centerline extraction, and graph construction to compute fifteen morphometric, topological, fractal, and geometric features. The features can be estimated globally from the complete vascular network or regionally within arterial territories, enabling multiscale characterization of cerebrovascular organization. Applied to 570 3D TOF-MRA scans from the IXI dataset (ages 20-86), CaravelMetrics yields reproducible vessel graphs capturing age- and sex-related variations and education-associated increases in vascular complexity, consistent with findings reported in the literature. The framework provides a scalable and fully automated approach for quantitative cerebrovascular feature extraction, supporting normative modeling and population-level studies of vascular health and aging.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Oli Bridge, Huey Sun, Botond Branyicskai-Nagy, Charles D'Ornano, Shomit Basu
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Peshawa J. Muhammad Ali, Navin Vincent, Saman S. Abdulla, Han N. Mohammed Fadhl, Anders Blilie et al.
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) is increasingly used in digital pathology. Publicly available histopathology datasets remain scarce, and those that do exist predominantly represent Western populations. Consequently, the generalizability of AI models to populations from less digitized regions, such as the Middle East, is largely unknown. This motivates the public release of our dataset to support the development and validation of pathology AI models across globally diverse populations. We present 339 whole-slide images of prostate core needle biopsies from a consecutive series of 185 patients collected in Erbil, Iraq. The slides are associated with Gleason scores and International Society of Urological Pathology grades assigned independently by three pathologists. Scanning was performed using two high-throughput scanners (Leica and Hamamatsu) and one compact scanner (Grundium). All slides were de-identified and are provided in their native formats without further conversion. The dataset enables grading concordance analyses, color normalization, and cross-scanner robustness evaluations. Data will be deposited in the Bioimage Archive (BIA) under accession code: to be announced (TBA), and released under a CC BY 4.0 license.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Liwen Pan, Longguang Wang, Guangwei Gao, Jun Wang, Jun Shi et al.
ๆฅ็ๆ่ฆ
Traffic image restoration under adverse weather conditions remains a critical challenge for intelligent transportation systems. Existing methods primarily focus on spatial-domain modeling but neglect frequency-domain priors. Although the emerging Mamba architecture excels at long-range dependency modeling through patch-wise correlation analysis, its potential for frequency-domain feature extraction remains unexplored. To address this, we propose Frequency-Aware Mamba (FAMamba), a novel framework that integrates frequency guidance with sequence modeling for efficient image restoration. Our architecture consists of two key components: (1) a Dual-Branch Feature Extraction Block (DFEB) that enhances local-global interaction via bidirectional 2D frequency-adaptive scanning, dynamically adjusting traversal paths based on sub-band texture distributions; and (2) a Prior-Guided Block (PGB) that refines texture details through wavelet-based high-frequency residual learning, enabling high-quality image reconstruction with precise details. Meanwhile, we design a novel Adaptive Frequency Scanning Mechanism (AFSM) for the Mamba architecture, which enables the Mamba to achieve frequency-domain scanning across distinct subgraphs, thereby fully leveraging the texture distribution characteristics inherent in subgraph structures. Extensive experiments demonstrate the efficiency and effectiveness of FAMamba.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hania Ghouse, Maryam Alsharqi, Farhad R. Nezami, Muzammil Behzad
Multimodal
ๆฅ็ๆ่ฆ
Cardiac image analysis remains fragmented across tasks: anatomical segmentation, disease classification, and grounded clinical report generation are typically handled by separate networks trained under different data regimes. No existing framework unifies these objectives within a single architecture while retaining generalization across imaging modalities and datasets. We introduce PULSE, a multi-task vision-language framework built on self-supervised representations and optimized through a composite supervision strategy that balances region overlap learning, pixel wise classification fidelity, and boundary aware IoU refinement. A multi-scale token reconstruction decoder enables anatomical segmentation, while shared global representations support disease classification and clinically grounded text output allowing the model to transition from pixels to structures and finally clinical reasoning within one architecture. Unlike prior task-specific pipelines, PULSE learns task-invariant cardiac priors, generalizes robustly across datasets, and can be adapted to new imaging modalities with minimal supervision. This moves the field closer to a scalable, foundation style cardiac analysis framework.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Letian Zhou, Songhua Liu, Xinchao Wang
Machine Learning
ๆฅ็ๆ่ฆ
Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the "intrinsic core distribution" of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4% at the 50-images-per-class (IPC) setup on ImageNet-1K. Our code is available on the project webpage: https://github.com/zzzlt422/CoDA
๐
2025-12-03
๐ Final Version of IEEE Transactions on Pattern Anal
๐ฅ Mengyuan Liu, Jinfu Liu, Yongkang Jiang, Bin He
Multimodal
ๆฅ็ๆ่ฆ
Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and UAV-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods. Our code is publicly available at: https://github.com/liujf69/HPNet-Action.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ture Hassler, Ida ร
kerholm, Marcus Nordstrรถm, Gabriele Balletti, Orcun Goksel
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Unet and its variations have been standard in semantic image segmentation, especially for computer assisted radiology. Current Unet architectures iteratively downsample spatial resolution while increasing channel dimensions to preserve information content. Such a structure demands a large memory footprint, limiting training batch sizes and increasing inference latency. Channel pruning compresses Unet architecture without accuracy loss, but requires lengthy optimization and may not generalize across tasks and datasets. By investigating Unet pruning, we hypothesize that the final structure is the crucial factor, not the channel selection strategy of pruning. Based on our observations, we propose a lean Unet architecture (LUnet) with a compact, flat hierarchy where channels are not doubled as resolution is halved. We evaluate on a public MRI dataset allowing comparable reporting, as well as on two internal CT datasets. We show that a state-of-the-art pruning solution (STAMP) mainly prunes from the layers with the highest number of channels. Comparatively, simply eliminating a random channel at the pruning-identified layer or at the largest layer achieves similar or better performance. Our proposed LUnet with fixed architectures and over 30 times fewer parameters achieves performance comparable to both conventional Unet counterparts and data-adaptively pruned networks. The proposed lean Unet with constant channel count across layers requires far fewer parameters while achieving performance superior to standard Unet for the same total number of parameters. Skip connections allow Unet bottleneck channels to be largely reduced, unlike standard encoder-decoder architectures requiring increased bottleneck channels for information propagation.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alexey Protopopov
ๆฅ็ๆ่ฆ
Proliferation of cheap and accessible cameras makes it possible to measure a subject's breath rate from video footage alone. Recent works on this topic have proposed a variety of approaches for accurately measuring human breath rate, however they are either tested in near-ideal conditions, or produce results that are not sufficiently accurate. The present study proposes a more robust method to measure breath rate in humans with minimal hardware requirements using a combination of mathematical transforms with a relative deviation from the ground truth of less than 5%. The method was tested on videos taken from 14 volunteers with a total duration of over 2 hours 30 minutes. The obtained results were compared to reference data and the average mean absolute error was found to be at 0.57 respirations per minute, which is noticeably better than the results from previous works. The breath rate measurement method proposed in the present article is more resistant to distortions caused by subject movement and thus allows one to remotely measure the subject's breath rate without any significant limitations on the subject's behavior.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ahmed Nasser, Marwan Mohamed, Alaa Sherif, Basmala Mahmoud, Shereen Yehia et al.
Machine Learning
ๆฅ็ๆ่ฆ
Egyptian hieroglyphs, the ancient Egyptian writing system, are composed entirely of drawings. Translating these glyphs into English poses various challenges, including the fact that a single glyph can have multiple meanings. Deep learning translation applications are evolving rapidly, producing remarkable results that significantly impact our lives. In this research, we propose a method for the automatic recognition and translation of ancient Egyptian hieroglyphs from images to English. This study utilized two datasets for classification and translation: the Morris Franken dataset and the EgyptianTranslation dataset. Our approach is divided into three stages: segmentation (using Contour and Detectron2), mapping symbols to Gardiner codes, and translation (using the CNN model). The model achieved a BLEU score of 42.2, a significant result compared to previous research.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hong-Kai Zheng, Piji Li
Computer Vision
ๆฅ็ๆ่ฆ
Visual Autoregressive (VAR) modeling approach for image generation proposes autoregressive processing across hierarchical scales, decoding multiple tokens per scale in parallel. This method achieves high-quality generation while accelerating synthesis. However, parallel token sampling within a scale may lead to structural errors, resulting in suboptimal generated images. To mitigate this, we propose Latent Scale Rejection Sampling (LSRS), a method that progressively refines token maps in the latent scale during inference to enhance VAR models. Our method uses a lightweight scoring model to evaluate multiple candidate token maps sampled at each scale, selecting the high-quality map to guide subsequent scale generation. By prioritizing early scales critical for structural coherence, LSRS effectively mitigates autoregressive error accumulation while maintaining computational efficiency. Experiments demonstrate that LSRS significantly improves VAR's generation quality with minimal additional computational overhead. For the VAR-d30 model, LSRS increases the inference time by merely 1% while reducing its FID score from 1.95 to 1.78. When the inference time is increased by 15%, the FID score can be further reduced to 1.66. LSRS offers an efficient test-time scaling solution for enhancing VAR-based generation.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye
MultimodalRoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yufeng Li, Wenchao Zhao, Bo Dang, Weimin Wang
Machine Learning
ๆฅ็ๆ่ฆ
Previously, image interpretation in radiology relied heavily on manual methods. However, manual classification of brain tumor medical images is time-consuming and labor-intensive. Even with shallow convolutional neural network models, the accuracy is not ideal. To improve the efficiency and accuracy of brain tumor image classification, this paper proposes a brain tumor classification model based on an improved ResNet34 network. This model uses the ResNet34 residual network as the backbone network and incorporates multi-scale feature extraction. It uses a multi-scale input module as the first layer of the ResNet34 network and an Inception v2 module as the residual downsampling layer. Furthermore, a channel attention mechanism module assigns different weights to different channels of the image from a channel domain perspective, obtaining more important feature information. The results after a five-fold crossover experiment show that the average classification accuracy of the improved network model is approximately 98.8%, which is not only 1% higher than ResNet34, but also only 80% of the number of parameters of the original model. Therefore, the improved network model not only improves accuracy but also reduces clutter, achieving a classification effect with fewer parameters and higher accuracy.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Korada Sri Vardhana, Shrikrishna Lolla, Soma Biswas
ๆฅ็ๆ่ฆ
Text-to-image (T2I) diffusion models have achieved widespread success due to their ability to generate high-resolution, photorealistic images. These models are trained on large-scale datasets, like LAION-5B, often scraped from the internet. However, since this data contains numerous biases, the models inherently learn and reproduce them, resulting in stereotypical outputs. We introduce SelfDebias, a fully unsupervised test-time debiasing method applicable to any diffusion model that uses a UNet as its noise predictor. SelfDebias identifies semantic clusters in an image encoder's embedding space and uses these clusters to guide the diffusion process during inference, minimizing the KL divergence between the output distribution and the uniform distribution. Unlike supervised approaches, SelfDebias does not require human-annotated datasets or external classifiers trained for each generated concept. Instead, it is designed to automatically identify semantic modes. Extensive experiments show that SelfDebias generalizes across prompts and diffusion model architectures, including both conditional and unconditional models. It not only effectively debiases images along key demographic dimensions while maintaining the visual fidelity of the generated images, but also more abstract concepts for which identifying biases is also challenging.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, Tao Jin
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaze Li, Yan Lu, Bin Liu, Guojun Yin, Mang Ye
Machine Learning
ๆฅ็ๆ่ฆ
Two-stage learning pipeline has achieved promising results in unsupervised visible-infrared person re-identification (USL-VI-ReID). It first performs single-modality learning and then operates cross-modality learning to tackle the modality discrepancy. Although promising, this pipeline inevitably introduces modality bias: modality-specific cues learned in the single-modality training naturally propagate into the following cross-modality learning, impairing identity discrimination and generalization. To address this issue, we propose a Dual-level Modality Debiasing Learning (DMDL) framework that implements debiasing at both the model and optimization levels. At the model level, we propose a Causality-inspired Adjustment Intervention (CAI) module that replaces likelihood-based modeling with causal modeling, preventing modality-induced spurious patterns from being introduced, leading to a low-biased model. At the optimization level, a Collaborative Bias-free Training (CBT) strategy is introduced to interrupt the propagation of modality bias across data, labels, and features by integrating modality-specific augmentation, label refinement, and feature alignment. Extensive experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Melane Navaratnarajah, David A. Kelly, Hana Chockler
ๆฅ็ๆ่ฆ
Adversarial perturbations are a useful way to expose vulnerabilities in object detectors. Existing perturbation methods are frequently white-box and architecture specific. More importantly, while they are often successful, it is rarely clear why they work. Insights into the mechanism of this success would allow developers to understand and analyze these attacks, as well as fine-tune the model to prevent them. This paper presents BlackCAtt, a black-box algorithm and a tool, which uses minimal, causally sufficient pixel sets to construct explainable, imperceptible, reproducible, architecture-agnostic attacks on object detectors. BlackCAtt combines causal pixels with bounding boxes produced by object detectors to create adversarial attacks that lead to the loss, modification or addition of a bounding box. BlackCAtt works across different object detectors of different sizes and architectures, treating the detector as a black box. We compare the performance of BlackCAtt with other black-box attack methods and show that identification of causal pixels leads to more precisely targeted and less perceptible attacks. On the COCO test dataset, our approach is 2.7 times better than the baseline in removing a detection, 3.86 times better in changing a detection, and 5.75 times better in triggering new, spurious, detections. The attacks generated by BlackCAtt are very close to the original image, and hence imperceptible, demonstrating the power of causal pixels.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaichen Zhang, Tianxiang Sheng, Xuanming Shi
Computer Vision
ๆฅ็ๆ่ฆ
This paper presents DINO-RotateMatch, a deep-learning framework designed to address the chal lenges of image matching in large-scale 3D reconstruction from unstructured Internet images. The
method integrates a dataset-adaptive image pairing strategy with rotation-aware keypoint extraction and
matching. DINO is employed to retrieve semantically relevant image pairs in large collections, while
rotation-based augmentation captures orientation-dependent local features using ALIKED and Light Glue. Experiments on the Kaggle Image Matching Challenge 2025 demonstrate consistent improve ments in mean Average Accuracy (mAA), achieving a Silver Award (47th of 943 teams). The results
confirm that combining self-supervised global descriptors with rotation-enhanced local matching offers
a robust and scalable solution for large-scale 3D reconstruction.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Paula Seidler, Neill D. F. Campbell, Ivor J A Simpson
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties.
We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling.
SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yian Li, Xiaoyu Guo, Hao Zhang, Shuiwang Li, Xiaowei Dai
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Active visual perception refers to the ability of a system to dynamically engage with its environment through sensing and action, allowing it to modify its behavior in response to specific goals or uncertainties. Unlike passive systems that rely solely on visual data, active visual perception systems can direct attention, move sensors, or interact with objects to acquire more informative data. This approach is particularly powerful in complex environments where static sensing methods may not provide sufficient information. Active visual perception plays a critical role in numerous applications, including robotics, autonomous vehicles, human-computer interaction, and surveillance systems. However, despite its significant promise, there are several challenges that need to be addressed, including real-time processing of complex visual data, decision-making in dynamic environments, and integrating multimodal sensory inputs. This paper explores both the opportunities and challenges inherent in active visual perception, providing a comprehensive overview of its potential, current research, and the obstacles that must be overcome for broader adoption.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Melis Ocal, Xiaoyan Xing, Yue Li, Ngo Anh Vien, Sezer Karaoglu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
3D stylization is central to game development, virtual reality, and digital arts, where the demand for diverse assets calls for scalable methods that support fast, high-fidelity manipulation. Existing text-to-3D stylization methods typically distill from 2D image editors, requiring time-intensive per-asset optimization and exhibiting multi-view inconsistency due to the limitations of current text-to-image models, which makes them impractical for large-scale production. In this paper, we introduce GaussianBlender, a pioneering feed-forward framework for text-driven 3D stylization that performs edits instantly at inference. Our method learns structured, disentangled latent spaces with controlled information sharing for geometry and appearance from spatially-grouped 3D Gaussians. A latent diffusion model then applies text-conditioned edits on these learned representations. Comprehensive evaluations show that GaussianBlender not only delivers instant, high-fidelity, geometry-preserving, multi-view consistent stylization, but also surpasses methods that require per-instance test-time optimization - unlocking practical, democratized 3D stylization at scale.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Feice Huang, Zuliang Han, Xing Zhou, Yihuang Chen, Lifei Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion transformers have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we propose ConvRot, a group-wise rotation-based quantization method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation, quantization, GEMM, and dequantization, enabling W4A4 inference without retraining and preserving visual quality. Experiments on FLUX.1-dev demonstrate a 2.26$\times$ speedup and 4.05$\times$ memory reduction while maintaining image fidelity. To our knowledge, this is the first application of rotation-based quantization for plug-and-play W4A4 inference in diffusion transformers.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ge-Peng Ji, Jingyi Liu, Deng-Ping Fan, Nick Barnes
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qi'ao Xu, Tianwen Qian, Yuqian Fu, Kailing Li, Yang Jiao et al.
Multimodal
ๆฅ็ๆ่ฆ
A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) \textbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Salim Khazem
Natural Language Processing
ๆฅ็ๆ่ฆ
Visual prompting has recently emerged as an efficient strategy to adapt vision models using lightweight, learnable parameters injected into the input space. However, prior work mainly targets large Vision Transformers and high-resolution datasets such as ImageNet. In contrast, small-image benchmarks like MNIST, Fashion-MNIST, and CIFAR-10 remain widely used in education, prototyping, and research, yet have received little attention in the context of prompting. In this paper, we introduce \textbf{Multi-Scale Visual Prompting (MSVP)}, a simple and generic module that learns a set of global, mid-scale, and local prompt maps fused with the input image via a lightweight $1 \times 1$ convolution. MSVP is backbone-agnostic, adds less than $0.02\%$ parameters, and significantly improves performance across CNN and Vision Transformer backbones.
We provide a unified benchmark on MNIST, Fashion-MNIST, and CIFAR-10 using a simple CNN, ResNet-18, and a small Vision Transformer. Our method yields consistent improvements with negligible computational overhead. We further include ablations on prompt scales, fusion strategies, and backbone architectures, along with qualitative analyzes using prompt visualizations and Grad-CAM. Our results demonstrate that multi-scale prompting provides an effective inductive bias even on low-resolution images.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Salim Khazem, Houssam Kanso
Machine Learning
ๆฅ็ๆ่ฆ
Accurate electricity consumption forecasting is essential for demand management and smart grid operations. This paper introduces a unified deep learning framework that integrates cyclical temporal encoding with hybrid LSTM-CNN architectures to enhance multistep energy forecasting. We systematically transform calendar-based attributes using sine cosine encodings to preserve periodic structure and evaluate their predictive relevance through correlation analysis. To exploit both long-term seasonal effects and short-term local patterns, we employ an ensemble model composed of an LSTM, a CNN, and a meta-learner of MLP regressors specialized for each forecast horizon. Using a one year national consumption dataset, we conduct an extensive experimental study including ablation analyses with and without cyclical encodings and calendar features and comparisons with established baselines from the literature. Results demonstrate consistent improvements across all seven forecast horizons, with our hybrid model achieving lower RMSE and MAE than individual architectures and prior methods. These findings confirm the benefit of combining cyclical temporal representations with complementary deep learning structures. To our knowledge, this is the first work to jointly evaluate temporal encodings, calendar-based features, and hybrid ensemble architectures within a unified short-term energy forecasting framework.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick
Natural Language Processing
ๆฅ็ๆ่ฆ
DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiahao Zhang, Xiao Zhao, Guangyu Gao
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Deep convolutional neural networks (DCNNs) have substantially advanced object detection capabilities, particularly in remote sensing imagery. However, challenges persist, especially in detecting small objects where the high resolution of these images and the small size of target objects often result in a loss of critical information in the deeper layers of conventional CNNs. Additionally, the extensive spatial redundancy and intricate background details typical in remote-sensing images tend to obscure these small targets. To address these challenges, we introduce Multi-Kernel Selection Network (MKSNet), a novel network architecture featuring a novel Multi-Kernel Selection mechanism. The MKS mechanism utilizes large convolutional kernels to effectively capture an extensive range of contextual information. This innovative design allows for adaptive kernel size selection, significantly enhancing the network's ability to dynamically process and emphasize crucial spatial details for small object detection. Furthermore, MKSNet also incorporates a dual attention mechanism, merging spatial and channel attention modules. The spatial attention module adaptively fine-tunes the spatial weights of feature maps, focusing more intensively on relevant regions while mitigating background noise. Simultaneously, the channel attention module optimizes channel information selection, improving feature representation and detection accuracy. Empirical evaluations on the DOTA-v1.0 and HRSC2016 benchmark demonstrate that MKSNet substantially surpasses existing state-of-the-art models in detecting small objects in remote sensing images. These results highlight MKSNet's superior ability to manage the complexities associated with multi-scale and high-resolution image data, confirming its effectiveness and innovation in remote sensing object detection.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhigang Yang, Yuan Liu, Jiawei Zhang, Puning Zhang, Xinqiang Ma
Machine Learning
ๆฅ็ๆ่ฆ
Although the remarkable performance of deep neural networks (DNNs) in image classification, their vulnerability to adversarial attacks remains a critical challenge. Most existing detection methods rely on complex and poorly interpretable architectures, which compromise interpretability and generalization. To address this, we propose FeatureLens, a lightweight framework that acts as a lens to scrutinize anomalies in image features. Comprising an Image Feature Extractor (IFE) and shallow classifiers (e.g., SVM, MLP, or XGBoost) with model sizes ranging from 1,000 to 30,000 parameters, FeatureLens achieves high detection accuracy ranging from 97.8% to 99.75% in closed-set evaluation and 86.17% to 99.6% in generalization evaluation across FGSM, PGD, CW, and DAmageNet attacks, using only 51 dimensional features. By combining strong detection performance with excellent generalization, interpretability, and computational efficiency, FeatureLens offers a practical pathway toward transparent and effective adversarial defense.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yaokun Li, Shuaixian Wang, Mantang Guo, Jiehui Huang, Taojun Ding et al.
Robotics
ๆฅ็ๆ่ฆ
We propose ReCamDriving, a purely vision-based, camera-controlled novel-trajectory video generation framework. While repair-based methods fail to restore complex artifacts and LiDAR-based approaches rely on sparse and incomplete cues, ReCamDriving leverages dense and scene-complete 3DGS renderings for explicit geometric guidance, achieving precise camera-controllable generation. To mitigate overfitting to restoration behaviors when conditioned on 3DGS renderings, ReCamDriving adopts a two-stage training paradigm: the first stage uses camera poses for coarse control, while the second stage incorporates 3DGS renderings for fine-grained viewpoint and geometric guidance. Furthermore, we present a 3DGS-based cross-trajectory data curation strategy to eliminate the train-test gap in camera transformation patterns, enabling scalable multi-trajectory supervision from monocular videos. Based on this strategy, we construct the ParaDrive dataset, containing over 110K parallel-trajectory video pairs. Extensive experiments demonstrate that ReCamDriving achieves state-of-the-art camera controllability and structural consistency.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Erkut Erdem, Aykut Erdem et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications.
๐
2025-12-03
๐ NeurIPS 2025
๐ฅ Haoran Zhou, Gim Hee Lee
Machine Learning
ๆฅ็ๆ่ฆ
Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments. In this paper, we present Motion4D, a novel framework that addresses these challenges by integrating 2D priors from foundation models into a unified 4D Gaussian Splatting representation. Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence. To enhance motion accuracy, we introduce a 3D confidence map that dynamically adjusts the motion priors, and an adaptive resampling process that inserts new Gaussians into under-represented regions based on per-pixel RGB and semantic errors. Furthermore, we enhance semantic coherence through an iterative refinement process that resolves semantic inconsistencies by alternately optimizing the semantic fields and updating prompts of SAM2. Extensive evaluations demonstrate that our Motion4D significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis. Our code is available at https://hrzhou2.github.io/motion4d-web/.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianan Sun, Yukang Huang, Dongzhihan Wang, Mingyu Fan
ๆฅ็ๆ่ฆ
Partial dental point clouds often suffer from large missing regions caused by occlusion and limited scanning views, which bias encoder-only global features and force decoders to hallucinate structures. We propose a retrieval-augmented framework for tooth completion that integrates a prototype memory into standard encoder--decoder pipelines. After encoding a partial input into a global descriptor, the model retrieves the nearest manifold prototype from a learnable memory and fuses it with the query feature through confidence-gated weighting before decoding. The memory is optimized end-to-end and self-organizes into reusable tooth-shape prototypes without requiring tooth-position labels, thereby providing structural priors that stabilize missing-region inference and free decoder capacity for detail recovery. The module is plug-and-play and compatible with common completion backbones, while keeping the same training losses. Experiments on a self-processed Teeth3DS benchmark demonstrate consistent improvements in Chamfer Distance, with visualizations showing sharper cusps, ridges, and interproximal transitions. Our approach provides a simple yet effective way to exploit cross-sample regularities for more accurate and faithful dental point-cloud completion.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fuchen Zheng, Xinyi Chen, Weixuan Li, Quanjun Li, Junhua Zhou et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Medical image segmentation is a cornerstone of modern clinical diagnostics. While Vision Transformers that leverage shifted window-based self-attention have established new benchmarks in this field, they are often hampered by a critical limitation: their localized attention mechanism struggles to effectively fuse local details with global context. This deficiency is particularly detrimental to challenging tasks such as the segmentation of microtumors and miniature organs, where both fine-grained boundary definition and broad contextual understanding are paramount. To address this gap, we propose HBFormer, a novel Hybrid-Bridge Transformer architecture. The 'Hybrid' design of HBFormer synergizes a classic U-shaped encoder-decoder framework with a powerful Swin Transformer backbone for robust hierarchical feature extraction. The core innovation lies in its 'Bridge' mechanism, a sophisticated nexus for multi-scale feature integration. This bridge is architecturally embodied by our novel Multi-Scale Feature Fusion (MFF) decoder. Departing from conventional symmetric designs, the MFF decoder is engineered to fuse multi-scale features from the encoder with global contextual information. It achieves this through a synergistic combination of channel and spatial attention modules, which are constructed from a series of dilated and depth-wise convolutions. These components work in concert to create a powerful feature bridge that explicitly captures long-range dependencies and refines object boundaries with exceptional precision. Comprehensive experiments on challenging medical image segmentation datasets, including multi-organ, liver tumor, and bladder tumor benchmarks, demonstrate that HBFormer achieves state-of-the-art results, showcasing its outstanding capabilities in microtumor and miniature organ segmentation. Code and models are available at: https://github.com/lzeeorno/HBFormer.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ David Svitov, Pietro Morerio, Lourdes Agapito, Alessio Del Bue
ๆฅ็ๆ่ฆ
We present a CloseUpAvatar - a novel approach for articulated human avatar representation dealing with more general camera motions, while preserving rendering quality for close-up views. CloseUpAvatar represents an avatar as a set of textured planes with two sets of learnable textures for low and high-frequency detail. The method automatically switches to high-frequency textures only for cameras positioned close to the avatar's surface and gradually reduces their impact as the camera moves farther away. Such parametrization of the avatar enables CloseUpAvatar to adjust rendering quality based on camera distance ensuring realistic rendering across a wider range of camera orientations than previous approaches. We provide experiments using the ActorsHQ dataset with high-resolution input images. CloseUpAvatar demonstrates both qualitative and quantitative improvements over existing methods in rendering from novel wide range camera positions, while maintaining high FPS by limiting the number of required primitives.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guang Yang, Lei Fan
Machine Learning
ๆฅ็ๆ่ฆ
The RNA inverse folding problem, a key challenge in RNA design, involves identifying nucleotide sequences that can fold into desired secondary structures, which are critical for ensuring molecular stability and function. The inherent complexity of this task stems from the intricate relationship between sequence and structure, making it particularly challenging. In this paper, we propose a framework, named HyperRNA, a generative model with an encoder-decoder architecture that leverages hypergraphs to design RNA sequences. Specifically, our HyperRNA model consists of three main components: preprocessing, encoding and decoding.
In the preprocessing stage, graph structures are constructed by extracting the atom coordinates of RNA backbone based on 3-bead coarse-grained representation. The encoding stage processes these graphs, capturing higher order dependencies and complex biomolecular interactions using an attention embedding module and a hypergraph-based encoder. Finally, the decoding stage generates the RNA sequence in an autoregressive manner. We conducted quantitative and qualitative experiments on the PDBBind and RNAsolo datasets to evaluate the inverse folding task for RNA sequence generation and RNA-protein complex sequence generation. The experimental results demonstrate that HyperRNA not only outperforms existing RNA design methods but also highlights the potential of leveraging hypergraphs in RNA engineering.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Jie Wang, Feidiao Yang et al.
Multimodal
ๆฅ็ๆ่ฆ
Handling fast, complex, and highly non-linear motion patterns has long posed challenges for video frame interpolation. Although recent diffusion-based approaches improve upon traditional optical-flow-based methods, they still struggle to cover diverse application scenarios and often fail to produce sharp, temporally consistent frames in fine-grained motion tasks such as audio-visual synchronized interpolation. To address these limitations, we introduce BBF (Beyond Boundary Frames), a context-aware video frame interpolation framework, which could be guided by audio/visual semantics. First, we enhance the input design of the interpolation model so that it can flexibly handle multiple conditional modalities, including text, audio, images, and video. Second, we propose a decoupled multimodal fusion mechanism that sequentially injects different conditional signals into a DiT backbone. Finally, to maintain the generation abilities of the foundation model, we adopt a progressive multi-stage training paradigm, where the start-end frame difference embedding is used to dynamically adjust both the data sampling and the loss weighting. Extensive experimental results demonstrate that BBF outperforms specialized state-of-the-art methods on both generic interpolation and audio-visual synchronized interpolation tasks, establishing a unified framework for video frame interpolation under coordinated multi-channel conditioning.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Malte Bleeker, Mauro Gotsch
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We propose the Dynamic Optical Test for Bot Identification (DOT-BI): a quick and easy method that uses human perception of motion to differentiate between human respondents and automated systems in surveys and online processes. In DOT-BI, a 'hidden' number is displayed with the same random black-and-white pixel texture as its background. Only the difference in motion and scale between the number and the background makes the number perceptible to humans across frames, while frame-by-frame algorithmic processing yields no meaningful signal. We conducted two preliminary assessments. Firstly, state-of-the-art, video-capable, multimodal models (GPT-5-Thinking and Gemini 2.5 Pro) fail to extract the correct value, even when given explicit instructions about the mechanism. Secondly, in an online survey (n=182), 99.5% (181/182) of participants solved the task, with an average end-to-end completion time of 10.7 seconds; a supervised lab study (n=39) found no negative effects on perceived ease-of-use or completion time relative to a control. We release code to generate tests and 100+ pre-rendered variants to facilitate adoption in surveys and online processes.
๐
2025-12-03
๐ 6 pages, 2 figures. Camera-ready version accepted
๐ฅ Yizhi Zhang, Lei Fan, Zhulin Tao, Donglin Di, Yang Song et al.
Machine Learning
ๆฅ็ๆ่ฆ
Universal, transferable whole-slide image (WSI) representations are central to computational pathology. Incorporating multiple markers (e.g., immunohistochemistry, IHC) alongside H&E enriches H&E-based features with diverse, biologically meaningful information. However, progress is limited by the scarcity of well-aligned multi-stain datasets. Inter-stain misalignment shifts corresponding tissue across slides, hindering consistent patch-level features and degrading slide-level embeddings. To address this, we curated a slide-level aligned, five-stain dataset (H&E, HER2, KI67, ER, PGR) to enable paired H&E-IHC learning and robust cross-stain representation. Leveraging this dataset, we propose Cross-Stain Contrastive Learning (CSCL), a two-stage pretraining framework with a lightweight adapter trained using patch-wise contrastive alignment to improve the compatibility of H&E features with corresponding IHC-derived contextual cues, and slide-level representation learning with Multiple Instance Learning (MIL), which uses a cross-stain attention fusion module to integrate stain-specific patch features and a cross-stain global alignment module to enforce consistency among slide-level embeddings across different stains. Experiments on cancer subtype classification, IHC biomarker status classification, and survival prediction show consistent gains, yielding high-quality, transferable H&E slide-level representations. The code and data are available at https://github.com/lily-zyz/CSCL.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Distinct from attention-based compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules-Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression-that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fuxiang Yang, Tonghua Su, Donglin Di, Yin Chen, Xiangqian Wu et al.
ๆฅ็ๆ่ฆ
Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving both the original text style and background texture. Existing methods suffer from two major challenges: inconsistency and length-insensitivity. They often fail to maintain coherence between the edited local patch and the surrounding area, and they struggle to handle significant differences in text length before and after editing. To tackle these challenges, we propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE), which simultaneously incorporates high-level global contextual information along with delicate local features. Specifically, we design a global-local combination structure, joint global and local losses, and enhance text image features to ensure consistency in text style within local patches while maintaining harmony between local and global areas. Additionally, we express the text style as a vector independent of the image size, which can be transferred to target text images of various sizes. We use an affine fusion to fill target text images into the editing patch while maintaining their aspect ratio unchanged. Extensive experiments on real-world datasets validate that our GLASTE model outperforms previous methods in both quantitative metrics and qualitative results and effectively mitigates the two challenges.
๐
2025-12-03
๐ Accepted by ACM MM Asia2026
๐ฅ Hao Sun, Lei Fan, Donglin Di, Shaohui Liu
ๆฅ็ๆ่ฆ
Articulated object generation has seen increasing advancements, yet existing models often lack the ability to be conditioned on text prompts. To address the significant gap between textual descriptions and 3D articulated object representations, we propose GAOT, a three-phase framework that generates articulated objects from text prompts, leveraging diffusion models and hypergraph learning in a three-step process. First, we fine-tune a point cloud generation model to produce a coarse representation of objects from text prompts. Given the inherent connection between articulated objects and graph structures, we design a hypergraph-based learning method to refine these coarse representations, representing object parts as graph vertices. Finally, leveraging a diffusion model, the joints of articulated objects-represented as graph edges-are generated based on the object parts. Extensive qualitative and quantitative experiments on the PartNet-Mobility dataset demonstrate the effectiveness of our approach, achieving superior performance over previous methods.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huy Quang Ung, Guillaume Habault, Yasutaka Nishimura, Hao Niu, Roberto Legaspi et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: https://github.com/ungquanghuy-kddi/CartoMapQA.git
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ''endogenous'' rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.
๐
2025-12-03
๐ KDD 2026
๐ฅ Wei Chee Yew, Hailun Xu, Sanjay Saha, Xiaotian Fan, Hiok Hian Ong et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nan Sun, Zhenyu Zhang, Xixun Lin, Kun Wang, Yanmin Shang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on "how to intervene" but overlooking the prerequisite "when to intervene", which leads to the "over-intervention" problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
๐
2025-12-03
๐ Accepted by ACM Multimedia 2025
๐ฅ Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Subin Kim, Sangwoo Mo, Mamshad Nayeem Rizve, Yiran Xu, Difan Liu et al.
ๆฅ็ๆ่ฆ
Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
๐
2025-12-03
๐ Accepted in IEEE Robotics and Automation Letters (
๐ฅ Gihyeon Lee, Jungwoo Lee, Juwon Kim, Young-Sik Shin, Younggun Cho
Computer Vision
ๆฅ็ๆ่ฆ
Robots are often required to localize in environments with unknown object classes and semantic ambiguity. However, when performing global localization using semantic objects, high semantic ambiguity intensifies object misclassification and increases the likelihood of incorrect associations, which in turn can cause significant errors in the estimated pose. Thus, in this letter, we propose a multi-label likelihood-based semantic graph matching framework for object-level global localization. The key idea is to exploit multi-label graph representations, rather than single-label alternatives, to capture and leverage the inherent semantic context of object observations. Based on these representations, our approach enhances semantic correspondence across graphs by combining the likelihood of each node with the maximum likelihood of its neighbors via context-aware likelihood propagation. For rigorous validation, data association and pose estimation performance are evaluated under both closed-set and open-set detection configurations. In addition, we demonstrate the scalability of our approach to large-vocabulary object categories in both real-world indoor scenes and synthetic environments.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng et al.
Robotics
ๆฅ็ๆ่ฆ
We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. https://shandaai.github.io/FloodDiffusion/
๐
2025-12-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Adithya S Kolavi, Vyoman Jain
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhijian Qiao, Zehuan Yu, Tong Li, Chih-Chung Chou, Wenchao Ding et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Crowdsourcing enables scalable autonomous driving map construction, but low-cost sensor noise hinders quality from improving with data volume. We propose CSMapping, a system that produces accurate semantic maps and topological road centerlines whose quality consistently increases with more crowdsourced data. For semantic mapping, we train a latent diffusion model on HD maps (optionally conditioned on SD maps) to learn a generative prior of real-world map structure, without requiring paired crowdsourced/HD-map supervision. This prior is incorporated via constrained MAP optimization in latent space, ensuring robustness to severe noise and plausible completion in unobserved areas. Initialization uses a robust vectorized mapping module followed by diffusion inversion; optimization employs efficient Gaussian-basis reparameterization, projected gradient descent zobracket multi-start, and latent-space factor-graph for global consistency. For topological mapping, we apply confidence-weighted k-medoids clustering and kinematic refinement to trajectories, yielding smooth, human-like centerlines robust to trajectory variation. Experiments on nuScenes, Argoverse 2, and a large proprietary dataset achieve state-of-the-art semantic and topological mapping performance, with thorough ablation and scalability studies.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kwaku Opoku-Ware, Gideon Opoku
Computer Vision
ๆฅ็ๆ่ฆ
This paper presents a preliminary investigation into automated dance movement analysis using contemporary computer vision techniques. We propose a proof-of-concept framework that integrates YOLOv8 and v11 for dancer detection with the Segment Anything Model (SAM) for precise segmentation, enabling the tracking and quantification of dancer movements in video recordings without specialized equipment or markers. Our approach identifies dancers within video frames, counts discrete dance steps, calculates spatial coverage patterns, and measures rhythm consistency across performance sequences. Testing this framework on a single 49-second recording of Ghanaian AfroBeats dance demonstrates technical feasibility, with the system achieving approximately 94% detection precision and 89% recall on manually inspected samples. The pixel-level segmentation provided by SAM, achieving approximately 83% intersection-over-union with visual inspection, enables motion quantification that captures body configuration changes beyond what bounding-box approaches can represent. Analysis of this preliminary case study indicates that the dancer classified as primary by our system executed 23% more steps with 37% higher motion intensity and utilized 42% more performance space compared to dancers classified as secondary. However, this work represents an early-stage investigation with substantial limitations including single-video validation, absence of systematic ground truth annotations, and lack of comparison with existing pose estimation methods. We present this framework to demonstrate technical feasibility, identify promising directions for quantitative dance metrics, and establish a foundation for future systematic validation studies.
๐
2025-12-03
๐ ICCV 2025
๐ฅ Seogkyu Jeon, Kibeom Hong, Hyeran Byun
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Te Yang, Xiangyu Zhu, Bo Wang, Quan Chen, Peng Jiang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Renqi Chen, Haoyang Su, Shixiang Tang
Machine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenliang Guo, Yu Kong
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zijian Gu, Yuxi Liu, Zhenhao Zhang, Song Wang
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms.Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenliang Guo, Yujiang Pu, Yu Kong
ๆฅ็ๆ่ฆ
Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chen Hu, Mingyu Zhou, Shuai Yuan, Hongbo Hu, Xiangyu Qiu et al.
ๆฅ็ๆ่ฆ
Infrared small target detection (ISTD) faces two major challenges: a lack of discernible target texture and severe background clutter, which results in the background obscuring the target. To enhance targets and suppress backgrounds, we propose the Basis Decomposition Module (BDM) as an extensible and lightweight module based on basis decomposition, which decomposes a complex feature into several basis features and enhances certain information while eliminating redundancy. Extending BDM leads to a series of modules, including the Spatial Difference Decomposition Module (SD$^\mathrm{2}$M), Spatial Difference Decomposition Downsampling Module (SD$^\mathrm{3}$M), and Temporal Difference Decomposition Module (TD$^\mathrm{2}$M). Based on these modules, we develop the Spatial Difference Decomposition Network (SD$^\mathrm{2}$Net) for single-frame ISTD (SISTD) and the Spatiotemporal Difference Decomposition Network (STD$^\mathrm{2}$Net) for multi-frame ISTD (MISTD). SD$^\mathrm{2}$Net integrates SD$^\mathrm{2}$M and SD$^\mathrm{3}$M within an adapted U-shaped architecture. We employ TD$^\mathrm{2}$M to introduce motion information, which transforms SD$^\mathrm{2}$Net into STD$^\mathrm{2}$Net. Extensive experiments on SISTD and MISTD datasets demonstrate state-of-the-art (SOTA) performance. On the SISTD task, SD$^\mathrm{2}$Net performs well compared to most established networks. On the MISTD datasets, STD$^\mathrm{2}$Net achieves a mIoU of 87.68\%, outperforming SD$^\mathrm{2}$Net, which achieves a mIoU of 64.97\%. Our codes are available: https://github.com/greekinRoma/IRSTD_HC_Platform.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shojiro Yamabe, Futa Waseda, Daiki Shiono, Tsubasa Takahashi
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, Qixing Huang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in video generation have enabled the synthesis of high-quality and visually realistic clips using diffusion transformer models. However, most existing approaches operate purely in the 2D pixel space and lack explicit mechanisms for modeling 3D structures, often resulting in temporally inconsistent geometries, implausible motions, and structural artifacts. In this work, we introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction. We adopted depth as the geometric representation because of the great progress in depth prediction and its compatibility with image-based latent encoders. Specifically, to enforce structural consistency over time, we propose a multi-view geometric loss that aligns the predicted depth maps across frames within a shared 3D coordinate system. Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved spatio-temporal coherence, shape consistency, and physical plausibility. Experiments across multiple datasets show that our approach produces significantly more stable and geometrically consistent results than existing baselines.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiye Song, Steve Dai, Ben Keller, Brucek Khailany
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications.
We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rhys Newbury, Juyan Zhang, Tin Tran, Hanna Kurniawati, Dana Kuliฤ
Computer Vision
ๆฅ็ๆ่ฆ
Understanding and representing the structure of 3D objects in an unsupervised manner remains a core challenge in computer vision and graphics. Most existing unsupervised keypoint methods are not designed for unconditional generative settings, restricting their use in modern 3D generative pipelines; our formulation explicitly bridges this gap. We present an unsupervised framework for learning spatially structured 3D keypoints from point cloud data. These keypoints serve as a compact and interpretable representation that conditions an Elucidated Diffusion Model (EDM) to reconstruct the full shape. The learned keypoints exhibit repeatable spatial structure across object instances and support smooth interpolation in keypoint space, indicating that they capture geometric variation. Our method achieves strong performance across diverse object categories, yielding a 6 percentage-point improvement in keypoint consistency compared to prior approaches.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tongxu Zhang
Robotics
ๆฅ็ๆ่ฆ
Background and Objective: Radiomics of knee MRI requires robust, anatomically meaningful regions of interest (ROIs) that jointly capture cartilage and subchondral bone. Most existing work relies on manual ROIs and rarely reports quality control (QC). We present LM-CartSeg, a fully automatic pipeline for cartilage/bone segmentation, geometric lateral/medial (L/M) compartmentalisation and radiomics analysis. Methods: Two 3D nnU-Net models were trained on SKM-TEA (138 knees) and OAIZIB-CM (404 knees). At test time, zero-shot predictions were fused and refined by simple geometric rules: connected-component cleaning, construction of 10 mm subchondral bone bands in physical space, and a data-driven tibial L/M split based on PCA and k-means. Segmentation was evaluated on an OAIZIB-CM test set (103 knees) and on SKI-10 (100 knees). QC used volume and thickness signatures. From 10 ROIs we extracted 4 650 non-shape radiomic features to study inter-compartment similarity, dependence on ROI size, and OA vs. non-OA classification on OAIZIB-CM Results: Post-processing improved macro ASSD on OAIZIB-CM from 2.63 to 0.36 mm and HD95 from 25.2 to 3.35 mm, with DSC 0.91; zero-shot DSC on SKI-10 was 0.80. The geometric L/M rule produced stable compartments across datasets, whereas a direct L/M nnU-Net showed domain-dependent side swaps. Only 6 to 12 percent of features per ROI were strongly correlated with volume or thickness. Radiomics-based models models restricted to size-linked features. Conclusions: LM-CartSeg yields automatic, QCd ROIs and radiomic features that carry discriminative information beyond simple morphometry, providing a practical foundation for multi-centre knee OA radiomics studies.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xieji Li, Siyuan Yan, Yingsheng Liu, H. Peter Soyer, Monika Janda et al.
Multimodal
ๆฅ็ๆ่ฆ
Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.
๐
2025-12-03
๐ ICML 2025
๐ฅ Yuzhen Hu, Biplab Banerjee, Saurabh Prasad
Multimodal
ๆฅ็ๆ่ฆ
Hyperspectral imaging (HSI) enables detailed land cover classification, yet low spatial resolution and sparse annotations pose significant challenges. We present a label-efficient framework that leverages spatial features from a frozen diffusion model pretrained on natural images. Our approach extracts low-level representations from high-resolution decoder layers at early denoising timesteps, which transfer effectively to the low-texture structure of HSI. To integrate spectral and spatial information, we introduce a lightweight FiLM-based fusion module that adaptively modulates frozen spatial features using spectral cues, enabling robust multimodal learning under sparse supervision. Experiments on two recent hyperspectral datasets demonstrate that our method outperforms state-of-the-art approaches using only the provided sparse training labels. Ablation studies further highlight the benefits of diffusion-derived features and spectral-aware fusion. Overall, our results indicate that pretrained diffusion models can support domain-agnostic, label-efficient representation learning for remote sensing and broader scientific imaging tasks.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green
Robotics
ๆฅ็ๆ่ฆ
Autonomous UAV forestry operations require robust depth estimation methods with strong cross-domain generalization. However, existing evaluations focus on urban and indoor scenarios, leaving a critical gap for specialized vegetation-dense environments. We present the first systematic zero-shot evaluation of eight state-of-the-art stereo methods--RAFT-Stereo, IGEV, IGEV++, BridgeDepth, StereoAnywhere, DEFOM (plus baseline methods ACVNet, PSMNet, TCstereo)--spanning iterative refinement, foundation model, and zero-shot adaptation paradigms. All methods are trained exclusively on Scene Flow and evaluated without fine-tuning on four standard benchmarks (ETH3D, KITTI 2012/2015, Middlebury) plus a novel 5,313-pair Canterbury forestry dataset captured with ZED Mini camera (1920x1080). Performance reveals scene-dependent patterns: foundation models excel on structured scenes (BridgeDepth: 0.23 px on ETH3D, 0.83-1.07 px on KITTI; DEFOM: 0.35-4.65 px across benchmarks), while iterative methods maintain cross-domain robustness (IGEV++: 0.36-6.77 px; IGEV: 0.33-21.91 px). Critical finding: RAFT-Stereo exhibits catastrophic ETH3D failure (26.23 px EPE, 98 percent error rate) due to negative disparity predictions, while performing normally on KITTI (0.90-1.11 px). Qualitative evaluation on Canterbury forestry dataset identifies DEFOM as the optimal gold-standard baseline for vegetation depth estimation, exhibiting superior depth smoothness, occlusion handling, and cross-domain consistency compared to IGEV++, despite IGEV++'s finer detail preservation.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zayne Sprague, Jack Lu, Manya Wadhwa, Sedrick Keh, Mengye Ren et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sen Fang, Yalin Feng, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas
Machine Learning
ๆฅ็ๆ่ฆ
Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Andreas Koukounas, Georgios Mastrapas, Florian Hรถnicke, Sedigheh Eslami, Guillaume Roncari et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ying Wang, Zhen Jin, Jiexiong Xu, Wenhai Lin, Yiquan Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality.
This paper presents AugServe, an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented LLM inference services. The core idea of AugServe is a two-stage adaptive request scheduling strategy. Specifically, AugServe combines the inference features of augmented LLM requests to optimize the order of scheduling decisions (stage I). These decisions are continuously refined with runtime information (stage II), adapting to both request characteristics and system capabilities. In addition, AugServe dynamically adjusts the token batching mechanism based on hardware status and real-time load, further enhancing throughput performance. Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Taido Purason, Pavel Chizhov, Ivan P. Yamshchikov, Mark Fishel
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Tokenizer adaptation plays an important role in transferring pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training, which adapts a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary pruning, which removes redundant tokens while preserving model quality. Together, these methods provide practical tools for controlled vocabulary modification, which we release as an open-source package.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lifeng Chen, Ryan Lai, Tianming Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Adapting large language models (LLMs) to low-resource languages remains a major challenge due to data scarcity and cross-lingual drift. This work presents a two-stage adaptation of Qwen2.5-3B to Tibetan, a morphologically rich and underrepresented language. We employ Continual Pretraining (CPT) to establish Tibetan linguistic grounding, followed by Supervised Fine-Tuning (SFT) for task and translation specialization. Empirical evaluations demonstrate a consistent decrease in perplexity (from 2.98 $\rightarrow$ 1.54) and substantial improvements in Chinese$\rightarrow$Tibetan translation quality (BLEU: 0.046 $\rightarrow$ 0.261; chrF: 2.2 $\rightarrow$ 6.6). Layer-wise analysis across 435 layers in Qwen3-4B reveals that adaptation primarily concentrates on embedding and output heads, with mid--late MLP projections encoding domain-specific transformations. Our findings suggest that CPT constructs a Tibetan semantic manifold while SFT sharpens task alignment with minimal representational disruption. This study provides the first quantitative exploration of Tibetan adaptation dynamics for LLMs, and offers an open, reproducible framework for extending multilingual foundation models to low-resource settings.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kazi Abrab Hossain, Jannatul Somiya Mahmud, Maria Hossain Tuli, Anik Mitra, S. M. Taiabul Haque et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
While recent developments in large language models have improved bias detection and classification, sensitive subjects like religion still present challenges because even minor errors can result in severe misunderstandings. In particular, multilingual models often misrepresent religions and have difficulties being accurate in religious contexts. To address this, we introduce BRAND: Bilingual Religious Accountable Norm Dataset, which focuses on the four main religions of South Asia: Buddhism, Christianity, Hinduism, and Islam, containing over 2,400 entries, and we used three different types of prompts in both English and Bengali. Our results indicate that models perform better in English than in Bengali and consistently display bias toward Islam, even when answering religion-neutral questions. These findings highlight persistent bias in multilingual models when similar questions are asked in different languages. We further connect our findings to the broader issues in HCI regarding religion and spirituality.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ekhi Azurmendi, Joseba Fernandez de Landa, Jaione Bengoetxea, Maite Heredia, Julen Etxaniz et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michael Staniek, Artem Sokolov, Stefan Riezler
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kylie L. Anglin, Stephanie Milan, Brittney Hernandez, Claudia Ventura
Natural Language Processing
ๆฅ็ๆ่ฆ
Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies --codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Huey Sun, Anabel Yong, Lorenzo Gilly, Felipe Jin
Natural Language Processing
ๆฅ็ๆ่ฆ
Contrastive decoding is a lightweight and effective inference-time method that improves the quality of text generation in Large Language Models. However, algorithms such as DoLa (Decoding by Contrastive Layers) have only been implemented in decoder-only architectures and studied for their impact on improving factuality. This work adapts DoLa for the T5 and FLAN-T5 model families and evaluates its impact on the models' instruction following capabilities, which to our knowledge is the first implementation of a contrastive decoding strategy in an encoder-decoder architecture. Our results show that DoLa improves the faithfulness of text generation for certain categories of tasks and harms others. To understand these results, we present a layer-by-layer analysis of logit evolution in a FLAN-T5 model to quantify DoLa's impact on token output probabilities.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye
MultimodalRoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce \textbf{Doublespeak}, a simple \emph{in-context representation hijacking} attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., \textit{bomb}) with a benign token (e.g., \textit{carrot}) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74\% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, Tao Jin
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chuyue Wang, Jie Feng, Yuxi Wu, Hang Zhang, Zhiguo Fan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate and reliable search on online healthcare platforms is critical for user safety and service efficacy. Traditional methods, however, often fail to comprehend complex and nuanced user queries, limiting their effectiveness. Large language models (LLMs) present a promising solution, offering powerful semantic understanding to bridge this gap. Despite their potential, deploying LLMs in this high-stakes domain is fraught with challenges, including factual hallucinations, specialized knowledge gaps, and high operational costs. To overcome these barriers, we introduce \textbf{AR-Med}, a novel framework for \textbf{A}utomated \textbf{R}elevance assessment for \textbf{Med}ical search that has been successfully deployed at scale on the Online Medical Delivery Platforms. AR-Med grounds LLM reasoning in verified medical knowledge through a retrieval-augmented approach, ensuring high accuracy and reliability. To enable efficient online service, we design a practical knowledge distillation scheme that compresses large teacher models into compact yet powerful student models. We also introduce LocalQSMed, a multi-expert annotated benchmark developed to guide model iteration and ensure strong alignment between offline and online performance. Extensive experiments show AR-Med achieves an offline accuracy of over 93\%, a 24\% absolute improvement over the original online system, and delivers significant gains in online relevance and user satisfaction. Our work presents a practical and scalable blueprint for developing trustworthy, LLM-powered systems in real-world healthcare applications.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yijun Liao
ๆฅ็ๆ่ฆ
Long-context dialogue systems suffer from State Inertia, where static constraints prevent models from resolving conflicts between evolving user intents and established historical context. To address this, we propose DZ-TDPO, a non-destructive alignment framework that synergizes conflict-aware dynamic KL constraints with a learnable temporal attention bias. Experiments on the Multi-Session Chat (MSC) dataset demonstrate that DZ-TDPO achieves state-of-the-art win rates (86.2% on Phi-3.5) while maintaining robust zero-shot generalization. Crucially, our scaling analysis reveals a "Capacity-Stability Trade-off": while smaller models incur an "alignment tax" (perplexity surge) to overcome historical inertia, the larger Qwen2.5-7B model achieves near-perfect alignment (99.4% win rate) with negligible perplexity overhead. This confirms that TAI can be alleviated via precise attention regulation rather than destructive weight updates, preserving general capabilities (MMLU) across model scales. Code and data are available: https://github.com/lyj20071013/DZ-TDPO
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Numaan Naeem, Kaushal Kumar Maurya, Kseniia Petukhova, Ekaterina Kochmar
ๆฅ็ๆ่ฆ
We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotations.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Daria Kryvosheieva, Andrea de Varda, Evelina Fedorenko, Greta Tuckute
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) can reliably distinguish grammatical from ungrammatical sentences, but how grammatical knowledge is represented within the models remains an open question. We investigate whether different syntactic phenomena recruit shared or distinct components in LLMs. Using a functional localization approach inspired by cognitive neuroscience, we identify the LLM units most responsive to 67 English syntactic phenomena in seven open-weight models. These units are consistently recruited across sentences containing the phenomena and causally support the models' syntactic performance. Critically, different types of syntactic agreement (e.g., subject-verb, anaphor, determiner-noun) recruit overlapping sets of units, suggesting that agreement constitutes a meaningful functional category for LLMs. This pattern holds in English, Russian, and Chinese; and further, in a cross-lingual analysis of 57 diverse languages, structurally more similar languages share more units for subject-verb agreement. Taken together, these findings reveal that syntactic agreement-a critical marker of syntactic dependencies-constitutes a meaningful category within LLMs' representational spaces.
๐
2025-12-03
๐ Hydro-SE Bench sets a new benchmark for the evalua
๐ฅ Shiruo Hu, Wenbo Shan, Yingjia Li, Zhiqi Wan, Xinpeng Yu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert collaboration in decision-making, which poses difficulties for intelligence. With the rapid advancement of large language models (LLMs), their potential application in the Hydro-SE domain is being increasingly explored. However, the knowledge and application abilities of LLMs in Hydro-SE have not been sufficiently evaluated. To address this issue, we propose the Hydro-SE LLM evaluation benchmark (Hydro-SE Bench), which contains 4,000 multiple-choice questions. Hydro-SE Bench covers nine subfields and enables evaluation of LLMs in aspects of basic conceptual knowledge, engineering application ability, and reasoning and calculation ability. The evaluation results on Hydro-SE Bench show that the accuracy values vary among 0.74 to 0.80 for commercial LLMs, and among 0.41 to 0.68 for small-parameter LLMs. While LLMs perform well in subfields closely related to natural and physical sciences, they struggle with domain-specific knowledge such as industry standards and hydraulic structures. Model scaling mainly improves reasoning and calculation abilities, but there is still great potential for LLMs to better handle problems in practical engineering application. This study highlights the strengths and weaknesses of LLMs for Hydro-SE tasks, providing model developers with clear training targets and Hydro-SE researchers with practical guidance for applying LLMs.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Beatrice Savoldi, Giuseppe Attanasio, Olga Gorodetskaya, Marta Marchiori Manerba, Elisa Bassignana et al.
ๆฅ็ๆ่ฆ
The rise of Artificial Intelligence (AI) language technologies, particularly generative AI (GenAI) chatbots accessible via conversational interfaces, is transforming digital interactions. While these tools hold societal promise, they also risk widening digital divides due to uneven adoption and low awareness of their limitations. This study presents the first comprehensive empirical mapping of GenAI adoption, usage patterns, and literacy in Italy, based on newly collected survey data from 1,906 Italian-speaking adults. Our findings reveal widespread adoption for both work and personal use, including sensitive tasks like emotional support and medical advice. Crucially, GenAI is supplanting other technologies to become a primary information source: this trend persists despite low user digital literacy, posing a risk as users struggle to recognize errors or misinformation. Moreover, we identify a significant gender divide -- particularly pronounced in older generations -- where women are half as likely to adopt GenAI and use it less frequently than men. While we find literacy to be a key predictor of adoption, it only partially explains this disparity, suggesting that other barriers are at play. Overall, our data provide granular insights into the multipurpose usage of GenAI, highlighting the dual need for targeted educational initiatives and further investigation into the underlying barriers to equitable participation that competence alone cannot explain.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick
Natural Language Processing
ๆฅ็ๆ่ฆ
DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ahmad Aghaebrahimian
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models have significantly advanced natural language processing tasks, but remain prone to generating incorrect or misleading but plausible arguments. This issue, known as hallucination, is particularly concerning in high-stakes domains like clinical applications, where factual inaccuracies can have severe consequences. Existing evaluation metrics fail to adequately assess factual consistency and lack interpretability, making diagnosing and mitigating errors difficult. We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts to address these limitations. Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology. Unlike previous methods with an absolute metric, we incorporate a weighted metric to enhance factual evaluation. Additionally, we propose a mechanism to control assessment complexity in intricate domains. We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training in future research.
๐
2025-12-03
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Hanxiu Zhang, Yue Zheng
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The protection of Intellectual Property (IP) in Large Language Models (LLMs) represents a critical challenge in contemporary AI research. While fingerprinting techniques have emerged as a fundamental mechanism for detecting unauthorized model usage, existing methods -- whether behavior-based or structural -- suffer from vulnerabilities such as false claim attacks or susceptible to weight manipulations. To overcome these limitations, we propose SELF, a novel intrinsic weight-based fingerprinting scheme that eliminates dependency on input and inherently resists false claims. SELF achieves robust IP protection through two key innovations: 1) unique, scalable and transformation-invariant fingerprint extraction via singular value and eigenvalue decomposition of LLM attention weights, and 2) effective neural network-based fingerprint similarity comparison based on few-shot learning and data augmentation. Experimental results demonstrate SELF maintains high IP infringement detection accuracy while showing strong robustness against various downstream modifications, including quantization, pruning, and fine-tuning attacks. Our code is available at https://github.com/HanxiuZhang/SELF_v2.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zeba Afroz, Harsh Vardhan, Pawan Bhakuni, Aanchal Punia, Rajdeep Kumar et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Narratives are the cognitive and emotional scaffolds of propaganda. They organize isolated persuasive techniques into coherent stories that justify actions, attribute blame, and evoke identification with ideological camps. In this paper, we propose a novel fine-grained narrative classification in biased news articles. We also explore article-bias classification as the precursor task to narrative classification and fine-grained persuasive technique identification. We develop INDI-PROP, the first ideologically grounded fine-grained narrative dataset with multi-level annotation for analyzing propaganda in Indian news media. Our dataset INDI-PROP comprises 1,266 articles focusing on two polarizing socio-political events in recent times: CAA and the Farmers' protest. Each article is annotated at three hierarchical levels: (i) ideological article-bias (pro-government, pro-opposition, neutral), (ii) event-specific fine-grained narrative frames anchored in ideological polarity and communicative intent, and (iii) persuasive techniques. We propose FANTA and TPTC, two GPT-4o-mini guided multi-hop prompt-based reasoning frameworks for the bias, narrative, and persuasive technique classification. FANTA leverages multi-layered communicative phenomena by integrating information extraction and contextual framing for hierarchical reasoning. On the other hand, TPTC adopts systematic decomposition of persuasive cues via a two-stage approach. Our evaluation suggests substantial improvement over underlying baselines in each case.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huy Quang Ung, Guillaume Habault, Yasutaka Nishimura, Hao Niu, Roberto Legaspi et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: https://github.com/ungquanghuy-kddi/CartoMapQA.git
๐
2025-12-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Adithya S Kolavi, Vyoman Jain
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Haohan Yuan, Siu Cheung Hui, Haopeng Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
While the reasoning capabilities of Large Language Models (LLMs) excel in analytical tasks such as mathematics and code generation, their utility for abstractive summarization remains widely assumed but largely unverified. To bridge this gap, we first tailor general reasoning strategies to the summarization domain. We then conduct a systematic, large scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models (LRMs) across 8 diverse datasets, assessing both summary quality and faithfulness. Our findings show that reasoning is not a universal solution and its effectiveness is highly dependent on the specific strategy and context. Specifically, we observe a trade-off between summary quality and factual faithfulness: explicit reasoning strategies tend to improve fluency at the expense of factual grounding, while implicit reasoning in LRMs exhibits the inverse pattern. Furthermore, increasing an LRM's internal reasoning budget does not improve, and can even hurt, factual consistency, suggesting that effective summarization demands faithful compression rather than creative over-thinking.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Renqi Chen, Haoyang Su, Shixiang Tang
Machine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Di Xiu, Hongyin Tang, Bolin Rong, Lizhi Yan, Jingang Wang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top-$k$ Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top-$k$ Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top-$k$ Attention operations facilitates the further unlocking of Top-$k$ Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top-$k$ Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top-$k$ Decoding.
๐
2025-12-03
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Robert Dilworth
ๆฅ็ๆ่ฆ
In this study, we more rigorously evaluated our attack script $\textit{TraceTarnish}$, which leverages adversarial stylometry principles to anonymize the authorship of text-based messages. To ensure the efficacy and utility of our attack, we sourced, processed, and analyzed Reddit comments--comments that were later alchemized into $\textit{TraceTarnish}$ data--to gain valuable insights. The transformed $\textit{TraceTarnish}$ data was then further augmented by $\textit{StyloMetrix}$ to manufacture stylometric features--features that were culled using the Information Gain criterion, leaving only the most informative, predictive, and discriminative ones. Our results found that function words and function word types ($L\_FUNC\_A$ $\&$ $L\_FUNC\_T$); content words and content word types ($L\_CONT\_A$ $\&$ $L\_CONT\_T$); and the Type-Token Ratio ($ST\_TYPE\_TOKEN\_RATIO\_LEMMAS$) yielded significant Information-Gain readings. The identified stylometric cues--function-word frequencies, content-word distributions, and the Type-Token Ratio--serve as reliable indicators of compromise (IoCs), revealing when a text has been deliberately altered to mask its true author. Similarly, these features could function as forensic beacons, alerting defenders to the presence of an adversarial stylometry attack; granted, in the absence of the original message, this signal may go largely unnoticed, as it appears to depend on a pre- and post-transformation comparison. "In trying to erase a trace, you often imprint a larger one." Armed with this understanding, we framed $\textit{TraceTarnish}$'s operations and outputs around these five isolated features, using them to conceptualize and implement enhancements that further strengthen the attack.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shojiro Yamabe, Futa Waseda, Daiki Shiono, Tsubasa Takahashi
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yixing Xu, Chao Li, Xuanwu Yin, Spandan Tiwari, Dong Li et al.
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Low-rank adaptation (LoRA) is one of the most popular methods among parameter-efficient fine-tuning (PEFT) methods to adapt pre-trained large language models (LLMs) to specific downstream tasks. However, the model trained based on LoRA often has an unsatisfactory performance due to its low-rank assumption. In this paper, we propose a novel method called Dual LoRA to improve the performance by incorporating an inductive bias into the original LoRA. Specifically, we separate low-rank matrices into two groups: the magnitude group to control whether or not and how far we should update a parameter and the direction group to decide whether this parameter should move forward or backward, to better simulate the parameter updating process of the full fine-tuning based on gradient-based optimization algorithms. We show that this can be simply achieved by adding a ReLU function to the magnitude group and a sign function to the direction group. We conduct several experiments over a wide range of NLP tasks, including natural language generation (NLG), understanding (NLU), and commonsense reasoning datasets on GPT-2, RoBERTa, DeBERTa, and LLaMA-1/2/3 as baseline models. The results show that we consistently outperform LoRA and its state-of-the-art variants with the same number of trainable parameters.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nicholas Tomlin, Naitian Zhou, Eve Fleisig, Liangyuan, Chen et al.
ๆฅ็ๆ่ฆ
Cooperative video games, where multiple participants must coordinate by communicating and reasoning under uncertainty in complex environments, yield a rich source of language data. We collect the Portal Dialogue Corpus: a corpus of 11.5 hours of spoken human dialogue in the co-op mode of the popular Portal 2 virtual puzzle game, comprising 24.5K total utterances. We analyze player language and behavior, identifying a number of linguistic phenomena that rarely appear in most existing chitchat or task-oriented dialogue corpora, including complex spatial reference, clarification and repair, and ad-hoc convention formation. To support future analyses of language use in complex, situated, collaborative problem-solving scenarios, we publicly release the corpus, which comprises player videos, audio, transcripts, game state data, and both manual and automatic annotations of language data.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hanting Chen, Chu Zhong, Kai Han, Yuchuan Tian, Yuchen Liang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the \textbf{Higher-Order Attention Network (Hon)}, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Hon dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $\mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Hon outperforms standard Transformers on multiple benchmarks.
๐
2025-12-03
๐ ArXiv Preprint (cs.CY)
๐ฅ Elyas Meguellati, Stefano Civelli, Lei Han, Abraham Bernstein, Shazia Sadiq et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As large language models (LLMs) become increasingly capable of generating persuasive content, understanding their effectiveness across different advertising strategies becomes critical. This paper presents a two-part investigation examining LLM-generated advertising through complementary lenses: (1) personality-based and (2) psychological persuasion principles.
In our first study (n=400), we tested whether LLMs could generate personalized advertisements tailored to specific personality traits (openness and neuroticism) and how their performance compared to human experts. Results showed that LLM-generated ads achieved statistical parity with human-written ads (51.1% vs. 48.9%, p > 0.05), with no significant performance differences for matched personalities.
Building on these insights, our second study (n=800) shifted focus from individual personalization to universal persuasion, testing LLM performance across four foundational psychological principles: authority, consensus, cognition, and scarcity. AI-generated ads significantly outperformed human-created content, achieving a 59.1% preference rate (vs. 40.9%, p < 0.001), with the strongest performance in authority (63.0%) and consensus (62.5%) appeals. Qualitative analysis revealed AI's advantage stems from crafting more sophisticated, aspirational messages and achieving superior visual-narrative coherence. Critically, this quality advantage proved robust: even after applying a 21.2 percentage point detection penalty when participants correctly identified AI-origin, AI ads still outperformed human ads, and 29.4% of participants chose AI content despite knowing its origin. These findings demonstrate LLMs' evolution from parity in personalization to superiority in persuasive storytelling, with significant implications for advertising practice given LLMs' near-zero marginal cost and time requirements compared to human experts.
๐
2025-12-03
๐ AAAI 2026
๐ฅ Qingchuan Li, Mingyue Cheng, Zirui Liu, Daoyu Wang, Yuting Zeng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Logical reasoning is a core challenge in natural language understanding and a fundamental capability of artificial intelligence, underpinning scientific discovery, mathematical theorem proving, and complex decision-making. Despite the remarkable progress of large language models (LLMs), most current approaches still rely on forward reasoning paradigms, generating step-by-step rationales from premises to conclusions. However, such methods often suffer from redundant inference paths, hallucinated steps, and semantic drift, resulting in inefficient and unreliable reasoning. In this paper, we propose a novel framework, Hypothesis-driven Backward Logical Reasoning (HBLR). The core idea is to integrate confidence-aware symbolic translation with hypothesis-driven backward reasoning. In the translation phase, only high-confidence spans are converted into logical form, such as First-Order Logic (FOL), while uncertain content remains in natural language. A translation reflection module further ensures semantic fidelity by evaluating symbolic outputs and reverting lossy ones back to text when necessary. In the reasoning phase, HBLR simulates human deductive thinking by assuming the conclusion is true and recursively verifying its premises. A reasoning reflection module further identifies and corrects flawed inference steps, enhancing logical coherence. Extensive experiments on five reasoning benchmarks demonstrate that HBLR consistently outperforms strong baselines in both accuracy and efficiency.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Darshan Fofadiya
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive Language Models (LLMs) trained on Next-Token Prediction (NTP) often suffer from ``Topic Drift'' where the generation wanders away from the initial prompt due to a reliance on local associations rather than global planning \citep{holtzman2019curious}. While scaling model size mitigates this \citep{brown2020language}, the fundamental myopia of the NTP objective remains. In this work, we introduce the Idea-Gated Transformer, a novel architecture that separates semantic planning from syntactic generation. We introduce an auxiliary ``Idea Head'' trained to predict the bag-of-words distribution for a future context window, creating a latent ``Concept Vector'' that actively gates the main vocabulary during generation. We propose a differentiable gating mechanism that suppresses semantically irrelevant tokens, effectively pruning the search space in real-time. Experiments on WikiText-103 demonstrate that while the Idea-Gated model achieves comparable validation perplexity to a standard GPT-2 baseline, it exhibits significantly superior Domain Retention. Qualitative and quantitative analysis reveals that the gating mechanism successfully locks generation into specific semantic clusters (e.g., Finance, Science) and resists associative drift, offering a parameter-efficient path toward more controllable language modeling.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rohan Charudatt Salvi, Chirag Chawla, Dhruv Jain, Swapnil Panigrahi, Md Shad Akhtar et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Automatic medical text simplification plays a key role in improving health literacy by making complex biomedical research accessible to diverse readers. However, most existing resources assume a single generic audience, overlooking the wide variation in medical literacy and information needs across user groups. To address this limitation, we introduce PERCS (Persona-guided Controllable Summarization), a dataset of biomedical abstracts paired with summaries tailored to four personas: Laypersons, Premedical Students, Non-medical Researchers, and Medical Experts. These personas represent different levels of medical literacy and information needs, emphasizing the need for targeted, audience-specific summarization. Each summary in PERCS was reviewed by physicians for factual accuracy and persona alignment using a detailed error taxonomy. Technical validation shows clear differences in readability, vocabulary, and content depth across personas. Along with describing the dataset, we benchmark four large language models on PERCS using automatic evaluation metrics that assess comprehensiveness, readability, and faithfulness, establishing baseline results for future research. The dataset, annotation guidelines, and evaluation materials are publicly available to support research on persona-specific communication and controllable biomedical summarization.
๐
2025-12-03
๐ ArXiv Preprint (cs.SI)
๐ฅ Aliakbar Mehdizadeh, Martin Hilbert
ๆฅ็ๆ่ฆ
A quarter century ago, Wikipedia's decentralized, crowdsourced, and consensus-driven model replaced the centralized, expert-driven, and authority-based standard for encyclopedic knowledge curation. The emergence of generative AI encyclopedias, such as Grokipedia, possibly presents another potential shift in epistemic evolution. This study investigates whether AI- and human-curated encyclopedias rely on the same foundations of authority. We conducted a multi-scale comparative analysis of the citation networks from 72 matched article pairs, which cite a total of almost 60,000 sources. Using an 8-category epistemic classification, we mapped the "epistemic profiles" of the articles on each platform. Our findings reveal several quantitative and qualitative differences in how knowledge is sourced and encyclopedia claims are epistemologically justified. Grokipedia replaces Wikipedia's heavy reliance on peer-reviewed "Academic & Scholarly" work with a notable increase in "User-generated" and "Civic organization" sources. Comparative network analyses further show that Grokipedia employs very different epistemological profiles when sourcing leisure topics (such as Sports and Entertainment) and more societal sensitive civic topics (such as Politics & Conflicts, Geographical Entities, and General Knowledge & Society). Finally, we find a "scaling-law for AI-generated knowledge sourcing" that shows a linear relationship between article length and citation density, which is distinct from collective human reference sourcing. We conclude that this first implementation of an LLM-based encyclopedia does not merely automate knowledge production but restructures it. Given the notable changes and the important role of encyclopedias, we suggest the continuation and deepening of algorithm audits, such as the one presented here, in order to understand the ongoing epistemological shifts.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nemika Tyagi, Nelvin Licona Guevara, Olga Kellert
Natural Language Processing
ๆฅ็ๆ่ฆ
This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaranรญ. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaranรญ dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaranรญ and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models can reliably recover interpretable sociolinguistic patterns traditionally accessible only through manual annotation, advancing computational methods for cross-linguistic and low-resource bilingual research.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jonathan Berant, Maximillian Chen, Adam Fisch, Reza Aghajani, Fantine Huot et al.
ๆฅ็ๆ่ฆ
To handle underspecified or ambiguous queries, AI assistants need a policy for managing their uncertainty to determine (a) when to guess the user intent and answer directly, (b) when to enumerate and answer multiple possible intents, and (c) when to ask a clarifying question. However, such policies are contextually dependent on factors such as user preferences or modality. For example, enumerating multiple possible user intentions is cumbersome on small screens or in a voice setting. In this work, we propose to train steerable policies for managing this uncertainty using self-play. Given two agents, one simulating a user and the other an AI assistant, we generate conversations where the user issues a potentially ambiguous query, and the assistant needs to determine how to respond. Importantly, the model takes as input the numerical cost of each clarification question, and each generated word, and is asked to take the action that will maximize its final reward, which is the cost-penalized accuracy. We use Reinforced Self-Training (ReST) to train our model to achieve high reward and show this leads to a steerable policy that changes its behavior predictably conditioned on the provided costs, leading to higher reward and accuracy. Moreover, our procedure also generalizes to numerical cost values that were unobserved at training time.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ashlesha Gopinath Sawant, Sahil S. Jadhav, Vidhan R. Jain, Shriraj S. Jagtap, Prachi Jadhav et al.
ๆฅ็ๆ่ฆ
In todays increasing world, it is very important to have good hailing services like Ola, Uber, and Rapido as it is very essential for our daily transportation. Users often face difficulties in choosing the most appropriate and efficient ride that would lead to both cost-effective and would take us to our destination in less time. This project provides you with the web application that helps you to select the most beneficial ride for you by providing users with the fare comparison between Ola, Uber, Rapido for the destination entered by the user. The backend is use to fetch the data, providing users with the fare comparison for the ride and finally providing with the best option using Python. This research paper also addresses the problem and challenges faced in accessing the data using APIs, Android Studios emulator, Appium and location comparison. Thus, the aim of the project is to provide transparency to the users in ride-hailing services and increase efficiency and provide users with better experience.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Florian Bordes, Candace Ross, Justine T Kao, Evangelia Spiliopoulou, Adina Williams
ๆฅ็ๆ่ฆ
The rapid proliferation of benchmarks has created significant challenges in reproducibility, transparency, and informed decision-making. However, unlike datasets and models -- which benefit from structured documentation frameworks like Datasheets and Model Cards -- evaluation methodologies lack systematic documentation standards. We introduce Eval Factsheets, a structured, descriptive framework for documenting AI system evaluations through a comprehensive taxonomy and questionnaire-based approach. Our framework organizes evaluation characteristics across five fundamental dimensions: Context (Who made the evaluation and when?), Scope (What does it evaluate?), Structure (With what the evaluation is built?), Method (How does it work?) and Alignment (In what ways is it reliable/valid/robust?). We implement this taxonomy as a practical questionnaire spanning five sections with mandatory and recommended documentation elements. Through case studies on multiple benchmarks, we demonstrate that Eval Factsheets effectively captures diverse evaluation paradigms -- from traditional benchmarks to LLM-as-judge methodologies -- while maintaining consistency and comparability. We hope Eval Factsheets are incorporated into both existing and newly released evaluation frameworks and lead to more transparency and reproducibility.
๐
2025-12-03
๐ ArXiv Preprint (quant-ph)
๐ฅ Shashaank Khanna, Matthew Pusey, Roger Colbeck
ๆฅ็ๆ่ฆ
The discovery of Bell that there exist quantum correlations that cannot be reproduced classically is one of the most important in the foundations of quantum mechanics, as well as having practical implications. Bell's result was originally proven in a simple bipartite causal structure, but analogous results have also been shown in further causal structures. Here we study the only causal structure with six or fewer nodes in which the question of whether or not there exist quantum correlations that cannot be achieved classically was open. In this causal structure we show that such quantum correlations exist using a method that involves imposing additional restrictions on the correlations. This hence completes the picture of which causal structures of up to six nodes support non-classical quantum correlations. We also provide further illustrations of our method using other causal structures.
๐
2025-12-03
๐ NeurIPS
๐ฅ Paul Wilson, Fabio Zanasi, George Constantinides
Machine Learning
ๆฅ็ๆ่ฆ
Modern deep learning models require immense computational resources, motivating research into low-precision training. Quantised training addresses this by representing training components in low-bit integers, but typically relies on discretising real-valued updates. We introduce an alternative approach where the update rule itself is discrete, avoiding the quantisation of continuous updates by design. We establish convergence guarantees for a general class of such discrete schemes, and present a multinomial update rule as a concrete example, supported by empirical evaluation. This perspective opens new avenues for efficient training, particularly for models with inherently discrete structure.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yizhou Zhao, Zhiwei Steven Wu, Adam Block
Natural Language Processing
ๆฅ็ๆ่ฆ
Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model's representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sandeep Nagar
RoboticsMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance.
The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hong Yang, Devroop Kar, Qi Yu, Alex Ororbia, Travis Desell
Machine Learning
ๆฅ็ๆ่ฆ
Why do state-of-the-art OOD detection methods exhibit catastrophic failure when models are trained on single-domain datasets? We provide the first theoretical explanation for this phenomenon through the lens of information theory. We prove that supervised learning on single-domain data inevitably produces domain feature collapse -- representations where I(x_d; z) = 0, meaning domain-specific information is completely discarded. This is a fundamental consequence of information bottleneck optimization: models trained on single domains (e.g., medical images) learn to rely solely on class-specific features while discarding domain features, leading to catastrophic failure when detecting out-of-domain samples (e.g., achieving only 53% FPR@95 on MNIST). We extend our analysis using Fano's inequality to quantify partial collapse in practical scenarios. To validate our theory, we introduce Domain Bench, a benchmark of single-domain datasets, and demonstrate that preserving I(x_d; z) > 0 through domain filtering (using pretrained representations) resolves the failure mode. While domain filtering itself is conceptually straightforward, its effectiveness provides strong empirical evidence for our information-theoretic framework. Our work explains a puzzling empirical phenomenon, reveals fundamental limitations of supervised learning in narrow domains, and has broader implications for transfer learning and when to fine-tune versus freeze pretrained models.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaolong Li, Youping Gu, Xi Lin, Weijie Wang, Bohan Zhuang
Computer Vision
ๆฅ็ๆ่ฆ
Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zoรซ Ruha Bell, Anvith Thudi, Olive Franzese-McLaughlin, Nicolas Papernot, Shafi Goldwasser
Machine Learning
ๆฅ็ๆ่ฆ
Training with differential privacy (DP) provides a guarantee to members in a dataset that they cannot be identified by users of the released model. However, those data providers, and, in general, the public, lack methods to efficiently verify that models trained on their data satisfy DP guarantees. The amount of compute needed to verify DP guarantees for current algorithms scales with the amount of compute required to train the model. In this paper we design the first DP algorithm with near optimal privacy-utility trade-offs but whose DP guarantees can be verified cheaper than training. We focus on DP stochastic convex optimization (DP-SCO), where optimal privacy-utility trade-offs are known. Here we show we can obtain tight privacy-utility trade-offs by privately minimizing a series of regularized objectives and only using the standard DP composition bound. Crucially, this method can be verified with much less compute than training. This leads to the first known DP-SCO algorithm with near optimal privacy-utility whose DP verification scales better than training cost, significantly reducing verification costs on large datasets.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis
Machine Learning
ๆฅ็ๆ่ฆ
Cross-entropy (CE) training loss dominates deep learning practice, yet existing theory often relies on simplifications, either replacing it with squared loss or restricting to convex models, that miss essential behavior. CE and squared loss generate fundamentally different dynamics, and convex linear models cannot capture the complexities of non-convex optimization. We provide an in-depth characterization of multi-class CE optimization dynamics beyond the convex regime by analyzing a canonical two-layer linear neural network with standard-basis vectors as inputs: the simplest non-convex extension for which the implicit bias remained unknown. This model coincides with the unconstrained features model used to study neural collapse, making our work the first to prove that gradient flow on CE converges to the neural collapse geometry. We construct an explicit Lyapunov function that establishes global convergence, despite the presence of spurious critical points in the non-convex landscape. A key insight underlying our analysis is an inconspicuous finding: Hadamard Initialization diagonalizes the softmax operator, freezing the singular vectors of the weight matrices and reducing the dynamics entirely to their singular values. This technique opens a pathway for analyzing CE training dynamics well beyond our specific setting considered here.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yanlin Chen, Kehua Chen, Yinhai Wang
ๆฅ็ๆ่ฆ
Traffic state estimation (TSE) becomes challenging when probe-vehicle penetration is low and observations are spatially sparse. Pure data-driven methods lack physical explanations and have poor generalization when observed data is sparse. In contrast, physical models have difficulty integrating uncertainties and capturing the real complexity of traffic. To bridge this gap, recent studies have explored combining them by embedding physical structure into Gaussian process. These approaches typically introduce the governing equations as soft constraints through pseudo-observations, enabling the integration of model structure within a variational framework. However, these methods rely heavily on penalty tuning and lack principled uncertainty calibration, which makes them sensitive to model mis-specification. In this work, we address these limitations by presenting a novel Physics-Embedded Gaussian Process (PEGP), designed to integrate domain knowledge with data-driven methods in traffic state estimation. Specifically, we design two multi-output kernels informed by classic traffic flow models, constructed via the explicit application of the linearized differential operator. Experiments on HighD, NGSIM show consistent improvements over non-physics baselines. PEGP-ARZ proves more reliable under sparse observation, while PEGP-LWR achieves lower errors with denser observation. Ablation study further reveals that PEGP-ARZ residuals align closely with physics and yield calibrated, interpretable uncertainty, whereas PEGP-LWR residuals are more orthogonal and produce nearly constant variance fields. This PEGP framework combines physical priors, uncertainty quantification, which can provide reliable support for TSE.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jialuo Li, Bin Li, Jiahao Li, Yan Lu
Multimodal
ๆฅ็ๆ่ฆ
The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
๐
2025-12-03
๐ AAAI 2026
๐ฅ Oren Rachmil, Roy Betser, Itay Gershon, Omer Hofman, Nitay Yakoby et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mechanisms to detect policy violations within their regulatory and operational frameworks, where breaches can trigger legal and reputational risks. Existing content moderation frameworks, such as guardrails, remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and lack interpretability. To address these limitations, we propose a training-free and efficient method that treats policy violation detection as an out-of-distribution (OOD) detection problem. Inspired by whitening techniques, we apply a linear transformation to decorrelate the model's hidden activations and standardize them to zero mean and unit variance, yielding near-identity covariance matrix. In this transformed space, we use the Euclidean norm as a compliance score to detect policy violations. The method requires only the policy text and a small number of illustrative samples, which makes it light-weight and easily deployable. On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models. This work provides organizations with a practical and statistically grounded framework for policy-aware oversight of LLMs, advancing the broader goal of deployable AI governance. Code is available at: https://tinyurl.com/policy-violation-detection
๐
2025-12-03
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Paul Fuchs, Julija Zavadlav
Machine Learning
ๆฅ็ๆ่ฆ
Foundational Machine Learning Potentials can resolve the accuracy and transferability limitations of classical force fields. They enable microscopic insights into material behavior through Molecular Dynamics simulations, which can crucially expedite material design and discovery. However, insufficiently broad and systematically biased reference data affect the predictive quality of the learned models. Often, these models exhibit significant deviations from experimentally observed phase transition temperatures, in the order of several hundred kelvins. Thus, fine-tuning is necessary to achieve adequate accuracy in many practical problems. This work proposes a fine-tuning strategy via top-down learning, directly correcting the wrongly predicted transition temperatures to match the experimental reference data. Our approach leverages the Differentiable Trajectory Reweighting algorithm to minimize the free energy differences between phases at the experimental target pressures and temperatures. We demonstrate that our approach can accurately correct the phase diagram of pure Titanium in a pressure range of up to 5 GPa, matching the experimental reference within tenths of kelvins and improving the liquid-state diffusion constant. Our approach is model-agnostic, applicable to multi-component systems with solid-solid and solid-liquid transitions, and compliant with top-down training on other experimental properties. Therefore, our approach can serve as an essential step towards highly accurate application-specific and foundational machine learning potentials.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Franki Nguimatsia Tiofack, Thรฉotime Le Hellard, Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Edson Bollis et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In the vision domain, dataset distillation arises as a technique to condense a large dataset into a smaller synthetic one that exhibits a similar result in the training process. While image data presents an extensive literature of distillation methods, text dataset distillation has fewer works in comparison. Text dataset distillation initially grew as an adaptation of efforts from the vision universe, as the particularities of the modality became clear obstacles, it rose into a separate branch of research. Several milestones mark the development of this area, such as the introduction of methods that use transformer models, the generation of discrete synthetic text, and the scaling to decoder-only models with over 1B parameters. Despite major advances in modern approaches, the field remains in a maturing phase, with room for improvement on benchmarking standardization, approaches to overcome the discrete nature of text, handling complex tasks, and providing explicit examples of real-world applications. In this report, we review past and recent advances in dataset distillation for text, highlighting different distillation strategies, key contributions, and general challenges.
๐
2025-12-03
๐ ArXiv Preprint (eess.IV)
๐ฅ Evan Bell, Shijun Liang, Ismail Alkhouri, Saiprasad Ravishankar
Computer Vision
ๆฅ็ๆ่ฆ
Deep Image Prior (DIP) has recently emerged as a promising one-shot neural-network based image reconstruction method. However, DIP has seen limited application to 3D image reconstruction problems. In this work, we introduce Tada-DIP, a highly effective and fully 3D DIP method for solving 3D inverse problems. By combining input-adaptation and denoising regularization, Tada-DIP produces high-quality 3D reconstructions while avoiding the overfitting phenomenon that is common in DIP. Experiments on sparse-view X-ray computed tomography reconstruction validate the effectiveness of the proposed method, demonstrating that Tada-DIP produces much better reconstructions than training-data-free baselines and achieves reconstruction performance on par with a supervised network trained using a large dataset with fully-sampled volumes.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Michele Alessi, Alessio Ansuini, Alex Rodriguez
ๆฅ็ๆ่ฆ
We introduce Density-Informed VAE (DiVAE), a lightweight, data-driven regularizer that aligns the VAE log-prior probability $\log p_Z(z)$ with a log-density estimated from data. Standard VAEs match latents to a simple prior, overlooking density structure in the data-space. DiVAE encourages the encoder to allocate posterior mass in proportion to data-space density and, when the prior is learnable, nudges the prior toward high-density regions. This is realized by adding a robust, precision-weighted penalty to the ELBO, incurring negligible computational overhead. On synthetic datasets, DiVAE (i) improves distributional alignment of latent log-densities to its ground truth counterpart, (ii) improves prior coverage, and (iii) yields better OOD uncertainty calibration. On MNIST, DiVAE improves alignment of the prior with external estimates of the density, providing better interpretability, and improves OOD detection for learnable priors.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiang Rao, Yina Liu, Yuxuan Shen
Machine Learning
ๆฅ็ๆ่ฆ
Solving partial differential equations (PDEs) for reservoir seepage is critical for optimizing oil and gas field development and predicting production performance. Traditional numerical methods suffer from mesh-dependent errors and high computational costs, while classical Physics-Informed Neural Networks (PINNs) face bottlenecks in parameter efficiency, high-dimensional expression, and strong nonlinear fitting. To address these limitations, we propose a Discrete Variable (DV)-Circuit Quantum-Classical Physics-Informed Neural Network (QCPINN) and apply it to three typical reservoir seepage models for the first time: the pressure diffusion equation for heterogeneous single-phase flow, the nonlinear Buckley-Leverett (BL) equation for two-phase waterflooding, and the convection-diffusion equation for compositional flow considering adsorption. The QCPINN integrates classical preprocessing/postprocessing networks with a DV quantum core, leveraging quantum superposition and entanglement to enhance high-dimensional feature mapping while embedding physical constraints to ensure solution consistency. We test three quantum circuit topologies (Cascade, Cross-mesh, Alternate) and demonstrate through numerical experiments that QCPINNs achieve high prediction accuracy with fewer parameters than classical PINNs. Specifically, the Alternate topology outperforms others in heterogeneous single-phase flow and two-phase BL equation simulations, while the Cascade topology excels in compositional flow with convection-dispersion-adsorption coupling. Our work verifies the feasibility of QCPINN for reservoir engineering applications, bridging the gap between quantum computing research and industrial practice in oil and gas engineering.
๐
2025-12-03
๐ ArXiv Preprint (math.OC)
๐ฅ X. Y. Han, Yuan Zhong
Machine Learning
ๆฅ็ๆ่ฆ
In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Kenneth Stewart, Roxana Leontie, Samantha Chapin, Joe Hays, Sumit Bam Shrestha et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We present an end-to-end pipeline for deploying reinforcement learning (RL) trained Artificial Neural Networks (ANNs) on neuromorphic hardware by converting them into spiking Sigma-Delta Neural Networks (SDNNs). We demonstrate that an ANN policy trained entirely in simulation can be transformed into an SDNN compatible with Intel's Loihi 2 architecture, enabling low-latency and energy-efficient inference. As a test case, we use an RL policy for controlling the Astrobee free-flying robot, similar to a previously hardware in space-validated controller. The policy, trained with Rectified Linear Units (ReLUs), is converted to an SDNN and deployed on Intel's Loihi 2, then evaluated in NVIDIA's Omniverse Isaac Lab simulation environment for closed-loop control of Astrobee's motion. We compare execution performance between GPU and Loihi 2. The results highlight the feasibility of using neuromorphic platforms for robotic control and establish a pathway toward energy-efficient, real-time neuromorphic computation in future space and terrestrial robotics applications.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Janis Keck, Lukas Silvester Barth, Fatemeh, Fahimi, Parvaneh Joharinad et al.
Machine Learning
ๆฅ็ๆ่ฆ
Fuzzy simplicial sets have become an object of interest in dimensionality reduction and manifold learning, most prominently through their role in UMAP. However, their definition through tools from algebraic topology without a clear probabilistic interpretation detaches them from commonly used theoretical frameworks in those areas. In this work we introduce a framework that explains fuzzy simplicial sets as marginals of probability measures on simplicial sets. In particular, this perspective shows that the fuzzy weights of UMAP arise from a generative model that samples Vietoris-Rips filtrations at random scales, yielding cumulative distribution functions of pairwise distances. More generally, the framework connects fuzzy simplicial sets to probabilistic models on the face poset, clarifies the relation between Kullback-Leibler divergence and fuzzy cross-entropy in this setting, and recovers standard t-norms and t-conorms via Boolean operations on the underlying simplicial sets. We then show how new embedding methods may be derived from this framework and illustrate this on an example where we generalize UMAP using ฤech filtrations with triplet sampling. In summary, this probabilistic viewpoint provides a unified probabilistic theoretical foundation for fuzzy simplicial sets, clarifies the role of UMAP within this framework, and enables the systematic derivation of new dimensionality reduction methods.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Ying-Kuan Tsai, Yi-Ping Chen, Vispi Karkaria, Wei Chen
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Active suspension systems are critical for enhancing vehicle comfort, safety, and stability, yet their performance is often limited by fixed hardware designs and control strategies that cannot adapt to uncertain and dynamic operating conditions. Recent advances in digital twins (DTs) and deep reinforcement learning (DRL) offer new opportunities for real-time, data-driven optimization across a vehicle's lifecycle. However, integrating these technologies into a unified framework remains an open challenge. This work presents a DT-based control co-design (CCD) framework for full-vehicle active suspensions using multi-generation design concepts. By integrating automatic differentiation into DRL, we jointly optimize physical suspension components and control policies under varying driver behaviors and environmental uncertainties. DRL also addresses the challenge of partial observability, where only limited states can be sensed and fed back to the controller, by learning optimal control actions directly from available sensor information. The framework incorporates model updating with quantile learning to capture data uncertainty, enabling real-time decision-making and adaptive learning from digital-physical interactions. The approach demonstrates personalized optimization of suspension systems under two distinct driving settings (mild and aggressive). Results show that the optimized systems achieve smoother trajectories and reduce control efforts by approximately 43% and 52% for mild and aggressive, respectively, while maintaining ride comfort and stability. Contributions include: developing a DT-enabled CCD framework integrating DRL and uncertainty-aware model updating for full-vehicle active suspensions, introducing a multi-generation design strategy for self-improving systems, and demonstrating personalized optimization of active suspension systems for distinct driver types.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haidong Kang, Wei Wu, Hanling Wang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Few-shot class incremental learning (FSCIL) is a more realistic and challenging paradigm in continual learning to incrementally learn unseen classes and overcome catastrophic forgetting on base classes with only a few training examples. Previous efforts have primarily centered around studying more effective FSCIL approaches. By contrast, less attention was devoted to thinking the security issues in contributing to FSCIL. This paper aims to provide a holistic study of the impact of attacks on FSCIL. We first derive insights by systematically exploring how human expert-designed attack methods (i.e., PGD, FGSM) affect FSCIL. We find that those methods either fail to attack base classes, or suffer from huge labor costs due to relying on huge expert knowledge. This highlights the need to craft a specialized attack method for FSCIL. Grounded in these insights, in this paper, we propose a simple yet effective ACraft method to automatically steer and discover optimal attack methods targeted at FSCIL by leveraging Large Language Models (LLMs) without human experts. Moreover, to improve the reasoning between LLMs and FSCIL, we introduce a novel Proximal Policy Optimization (PPO) based reinforcement learning to optimize learning, making LLMs generate better attack methods in the next generation by establishing positive feedback. Experiments on mainstream benchmarks show that our ACraft significantly degrades the performance of state-of-the-art FSCIL methods and dramatically beyond human expert-designed attack methods while maintaining the lowest costs of attack.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Lei Zhang, Diwen Zheng, Kaixin Bai, Zhenshan Bing, Zoltan-Csaba Marton et al.
MultimodalRoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Dexterous grasp generation aims to produce grasp poses that align with task requirements and human interpretable grasp semantics. However, achieving semantically controllable dexterous grasp synthesis remains highly challenging due to the lack of unified modeling of multiple semantic dimensions, including grasp taxonomy, contact semantics, and functional affordance. To address these limitations, we present OmniDexVLG, a multimodal, semantics aware grasp generation framework capable of producing structurally diverse and semantically coherent dexterous grasps under joint language and visual guidance. Our approach begins with OmniDexDataGen, a semantic rich dexterous grasp dataset generation pipeline that integrates grasp taxonomy guided configuration sampling, functional affordance contact point sampling, taxonomy aware differential force closure grasp sampling, and physics based optimization and validation, enabling systematic coverage of diverse grasp types. We further introduce OmniDexReasoner, a multimodal grasp type semantic reasoning module that leverages multi agent collaboration, retrieval augmented generation, and chain of thought reasoning to infer grasp related semantics and generate high quality annotations that align language instructions with task specific grasp intent. Building upon these components, we develop a unified Vision Language Grasping generation model that explicitly incorporates grasp taxonomy, contact structure, and functional affordance semantics, enabling fine grained control over grasp synthesis from natural language instructions. Extensive experiments in simulation and real world object grasping and ablation studies demonstrate that our method substantially outperforms state of the art approaches in terms of grasp diversity, contact semantic diversity, functional affordance diversity, and semantic consistency.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Danny Hoang, Anandkumar Patel, Ruimen Chen, Rajiv Malhotra, Farhad Imani
ๆฅ็ๆ่ฆ
Smart manufacturing can significantly improve efficiency and reduce energy consumption, yet the energy demands of AI models may offset these gains. This study utilizes in-situ sensing-based prediction of geometric quality in smart machining to compare the energy consumption, accuracy, and speed of common AI models. HyperDimensional Computing (HDC) is introduced as an alternative, achieving accuracy comparable to conventional models while drastically reducing energy consumption, 200$\times$ for training and 175 to 1000$\times$ for inference. Furthermore, HDC reduces training times by 200$\times$ and inference times by 300 to 600$\times$, showcasing its potential for energy-efficient smart manufacturing.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gaetano Signorelli, Michele Lombardi
Machine Learning
ๆฅ็ๆ่ฆ
Decision support systems often rely on solving complex optimization problems that may require to estimate uncertain parameters beforehand. Recent studies have shown how using traditionally trained estimators for this task can lead to suboptimal solutions. Using the actual decision cost as a loss function (called Decision Focused Learning) can address this issue, but with a severe loss of scalability at training time. To address this issue, we propose an acceleration method based on replacing costly loss function evaluations with an efficient surrogate. Unlike previously defined surrogates, our approach relies on unbiased estimators reducing the risk of spurious local optima and can provide information on its local confidence allowing one to switch to a fallback method when needed. Furthermore, the surrogate is designed for a black-box setting, which enables compensating for simplifications in the optimization model and account- ing for recourse actions during cost computation. In our results, the method reduces costly inner solver calls, with a solution quality comparable to other state-of-the-art techniques.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Paul Strasser, Andreas Pfeffer, Jakob Weber, Markus Gurtner, Andreas Kรถrner
Machine Learning
ๆฅ็ๆ่ฆ
Neural networks have become a widely adopted tool for modeling nonlinear dynamical systems from data. However, the choice of training strategy remains a key design decision, particularly for simulation tasks. This paper compares two predominant strategies: parallel and series-parallel training. The conducted empirical analysis spans five neural network architectures and two examples: a pneumatic valve test bench and an industrial robot benchmark. The study reveals that, even though series-parallel training dominates current practice, parallel training consistently yields better long-term prediction accuracy. Additionally, this work clarifies the often inconsistent terminology in the literature and relate both strategies to concepts from system identification. The findings suggest that parallel training should be considered the default training strategy for neural network-based simulation of dynamical systems.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dingwei Zhu, Zhiheng Xi, Shihan Dou, Yuhui Wang, Sixian Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk regularization to shape the distribution tails: it contracts the lower tail to dampen noisy negative deviations, while expanding the upper tail to preserve exploratory diversity. Across extensive experiments and analysis in multi-turn dialogue, math reasoning, and scientific QA, DVPO consistently outperforms PPO, GRPO, and robust Bellman-based PPO under noisy supervision, showing its potential for LLM post-training in the real-world.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junlin Chang, Yubo Han, Hnag Yue, John S Thompson, Rongke Liu
ๆฅ็ๆ่ฆ
The widespread adoption of depth sensors has substantially lowered the barrier to point-cloud acquisition. This letter proposes a semantic wireless transmission framework for three dimension (3D) point clouds built on Deep Joint Source - Channel Coding (DeepJSCC). Instead of sending raw features, the transmitter predicts combination weights over a receiver-side semantic orthogonal feature pool, enabling compact representations and robust reconstruction. A folding-based decoder deforms a 2D grid into 3D, enforcing manifold continuity while preserving geometric fidelity. Trained with Chamfer Distance (CD) and an orthogonality regularizer, the system is evaluated on ModelNet40 across varying Signal-to-Noise Ratios (SNRs) and bandwidths. Results show performance on par with SEmantic Point cloud Transmission (SEPT) at high bandwidth and clear gains in bandwidth-constrained regimes, with consistent improvements in both Peak Signal-to-Noise Ratio (PSNR) and CD. Ablation experiments confirm the benefits of orthogonalization and the folding prior.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ahmed Nasser, Marwan Mohamed, Alaa Sherif, Basmala Mahmoud, Shereen Yehia et al.
Machine Learning
ๆฅ็ๆ่ฆ
Egyptian hieroglyphs, the ancient Egyptian writing system, are composed entirely of drawings. Translating these glyphs into English poses various challenges, including the fact that a single glyph can have multiple meanings. Deep learning translation applications are evolving rapidly, producing remarkable results that significantly impact our lives. In this research, we propose a method for the automatic recognition and translation of ancient Egyptian hieroglyphs from images to English. This study utilized two datasets for classification and translation: the Morris Franken dataset and the EgyptianTranslation dataset. Our approach is divided into three stages: segmentation (using Contour and Detectron2), mapping symbols to Gardiner codes, and translation (using the CNN model). The model achieved a BLEU score of 42.2, a significant result compared to previous research.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Timothรฉe Chauvin, Erwan Le Merrer, Franรงois Taรฏani, Gilles Tredan
ๆฅ็ๆ่ฆ
When using an LLM through an API provider, users expect the served model to remain consistent over time, a property crucial for the reliability of downstream applications and the reproducibility of research. Existing audit methods are too costly to apply at regular time intervals to the wide range of available LLM APIs. This means that model updates are left largely unmonitored in practice. In this work, we show that while LLM log probabilities (logprobs) are usually non-deterministic, they can still be used as the basis for cost-effective continuous monitoring of LLM APIs. We apply a simple statistical test based on the average value of each token logprob, requesting only a single token of output. This is enough to detect changes as small as one step of fine-tuning, making this approach more sensitive than existing methods while being 1,000x cheaper. We introduce the TinyChange benchmark as a way to measure the sensitivity of audit methods in the context of small, realistic model changes.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tai Nguyen, Phong Le, Andrรฉ Biedenkapp, Carola Doerr, Nguyen Dang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Dynamic Algorithm Configuration (DAC) studies the efficient identification of control policies for parameterized optimization algorithms. Numerous studies have leveraged the robustness of decision-making in Reinforcement Learning (RL) to address the optimization challenges in algorithm configuration. However, applying RL to DAC is challenging and often requires extensive domain expertise. We conduct a comprehensive study of deep-RL algorithms in DAC through a systematic analysis of controlling the population size parameter of the (1+($ฮป$,$ฮป$))-GA on OneMax instances. Our investigation of DDQN and PPO reveals two fundamental challenges that limit their effectiveness in DAC: scalability degradation and learning instability. We trace these issues to two primary causes: under-exploration and planning horizon coverage, each of which can be effectively addressed through targeted solutions. To address under-exploration, we introduce an adaptive reward shifting mechanism that leverages reward distribution statistics to enhance DDQN agent exploration, eliminating the need for instance-specific hyperparameter tuning and ensuring consistent effectiveness across different problem scales. In dealing with the planning horizon coverage problem, we demonstrate that undiscounted learning effectively resolves it in DDQN, while PPO faces fundamental variance issues that necessitate alternative algorithmic designs. We further analyze the hyperparameter dependencies of PPO, showing that while hyperparameter optimization enhances learning stability, it consistently falls short in identifying effective policies across various configurations. Finally, we demonstrate that DDQN equipped with our adaptive reward shifting strategy achieves performance comparable to theoretically derived policies with vastly improved sample efficiency, outperforming prior DAC approaches by several orders of magnitude.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hanhui Deng, Xinglin Li, Jie Luo, Zhanpeng Jin, Di Wu
Machine Learning
ๆฅ็ๆ่ฆ
Electrocardiogram is a useful diagnostic signal that can detect cardiac abnormalities by measuring the electrical activity generated by the heart. Due to its rapid, non-invasive, and richly informative characteristics, ECG has many emerging applications. In this paper, we study novel deep learning technologies to effectively manage and analyse ECG data, with the aim of building a diagnostic model, accurately and quickly, that can substantially reduce the burden on medical workers. Unlike the existing ECG models that exhibit a high misdiagnosis rate, our deep learning approaches can automatically extract the features of ECG data through end-to-end training. Specifically, we first devise EfficientECG, an accurate and lightweight classification model for ECG analysis based on the existing EfficientNet model, which can effectively handle high-frequency long-sequence ECG data with various leading types. On top of that, we next propose a cross-attention-based feature fusion model of EfficientECG for analysing multi-lead ECG data with multiple features (e.g., gender and age). Our evaluations on representative ECG datasets validate the superiority of our model against state-of-the-art works in terms of high precision, multi-feature fusion, and lightweights.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye
MultimodalRoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
๐
2025-12-03
๐ Accepted to the 41st ACM/SIGAPP Symposium On Appli
๐ฅ Francesco Vitale, Nicola Mazzocca
ๆฅ็ๆ่ฆ
Clinical pathways are specialized healthcare plans that model patient treatment procedures. They are developed to provide criteria-based progression and standardize patient treatment, thereby improving care, reducing resource use, and accelerating patient recovery. However, manual modeling of these pathways based on clinical guidelines and domain expertise is difficult and may not reflect the actual best practices for different variations or combinations of diseases. We propose a two-phase modeling method using process mining, which extends the knowledge base of clinical pathways by leveraging conformance checking diagnostics. In the first phase, historical data of a given disease is collected to capture treatment in the form of a process model. In the second phase, new data is compared against the reference model to verify conformance. Based on the conformance checking results, the knowledge base can be expanded with more specific models tailored to new variants or disease combinations. We demonstrate our approach using Synthea, a benchmark dataset simulating patient treatments for SARS-CoV-2 infections with varying COVID-19 complications. The results show that our method enables expanding the knowledge base of clinical pathways with sufficient precision, peaking to 95.62% AUC while maintaining an arc-degree simplicity of 67.11%.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Conor McCarthy, Jan Peter van Zandwijk, Marcel Worring, Zeno Geradts
Machine Learning
ๆฅ็ๆ่ฆ
Smartphones and smartwatches are ever-present in daily life, and provide a rich source of information on their users' behaviour. In particular, digital traces derived from the phone's embedded movement sensors present an opportunity for a forensic investigator to gain insight into a person's physical activities. In this work, we present a machine learning-based approach to translate digital traces into likelihood ratios (LRs) for different types of physical activities. Evaluating on a new dataset, NFI\_FARED, which contains digital traces from four different types of iPhones labelled with 19 activities, it was found that our approach could produce useful LR systems to distinguish 167 out of a possible 171 activity pairings. The same approach was extended to analyse likelihoods for multiple activities (or groups of activities) simultaneously and create activity timelines to aid in both the early and latter stages of forensic investigations. The dataset and all code required to replicate the results have also been made public to encourage further research on this topic.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Itay Yona, Amir Sarid, Michael Karasik, Yossi Gandelsman
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce \textbf{Doublespeak}, a simple \emph{in-context representation hijacking} attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., \textit{bomb}) with a benign token (e.g., \textit{carrot}) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74\% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
๐
2025-12-03
๐ under review for publication in the IEEE
๐ฅ Nir Shlezinger, Santiago Segarra, Yi Zhang, Dvir Avrahami, Zohar Davidov et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Optimization methods play a central role in signal processing, serving as the mathematical foundation for inference, estimation, and control. While classical iterative optimization algorithms provide interpretability and theoretical guarantees, they often rely on surrogate objectives, require careful hyperparameter tuning, and exhibit substantial computational latency. Conversely, machine learning (ML ) offers powerful data-driven modeling capabilities but lacks the structure, transparency, and efficiency needed for optimization-driven inference. Deep unfolding has recently emerged as a compelling framework that bridges these two paradigms by systematically transforming iterative optimization algorithms into structured, trainable ML architectures. This article provides a tutorial-style overview of deep unfolding, presenting a unified perspective of methodologies for converting optimization solvers into ML models and highlighting their conceptual, theoretical, and practical implications. We review the foundations of optimization for inference and for learning, introduce four representative design paradigms for deep unfolding, and discuss the distinctive training schemes that arise from their iterative nature. Furthermore, we survey recent theoretical advances that establish convergence and generalization guarantees for unfolded optimizers, and provide comparative qualitative and empirical studies illustrating their relative trade-offs in complexity, interpretability, and robustness.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stephen Law, Tao Yang, Nanjiang Chen, Xuhui Lin
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Urban analytics increasingly relies on AI-driven trajectory analysis, yet current approaches suffer from methodological fragmentation: trajectory learning captures movement patterns but ignores spatial context, while spatial embedding methods encode street networks but miss temporal dynamics. Three gaps persist: (1) lack of joint training that integrates spatial and temporal representations, (2) origin-agnostic treatment that ignores directional asymmetries in navigation ($A \to B \ne B \to A$), and (3) over-reliance on auxiliary data (POIs, imagery) rather than fundamental geometric properties of urban space. We introduce a conditional trajectory encoder that jointly learns spatial and movement representations while preserving origin-dependent asymmetries using geometric features. This framework decomposes urban navigation into shared cognitive patterns and origin-specific spatial narratives, enabling quantitative measurement of cognitive asymmetries across starting locations. Our bidirectional LSTM processes visibility ratio and curvature features conditioned on learnable origin embeddings, decomposing representations into shared urban patterns and origin-specific signatures through contrastive learning. Results from six synthetic cities and real-world validation on Beijing's Xicheng District demonstrate that urban morphology creates systematic cognitive inequalities. This provides urban planners quantitative tools for assessing experiential equity, offers architects insights into layout decisions' cognitive impacts, and enables origin-aware analytics for navigation systems.
๐
2025-12-03
๐ NeurIPS 2025
๐ฅ Sathya Edamadaka, Soojung Yang, Ju Li, Rafael Gรณmez-Bombarelli
Machine Learning
ๆฅ็ๆ่ฆ
Machine learning models of vastly different modalities and architectures are being trained to predict the behavior of molecules, materials, and proteins. However, it remains unclear whether they learn similar internal representations of matter. Understanding their latent structure is essential for building scientific foundation models that generalize reliably beyond their training domains. Although representational convergence has been observed in language and vision, its counterpart in the sciences has not been systematically explored. Here, we show that representations learned by nearly sixty scientific models, spanning string-, graph-, 3D atomistic, and protein-based modalities, are highly aligned across a wide range of chemical systems. Models trained on different datasets have highly similar representations of small molecules, and machine learning interatomic potentials converge in representation space as they improve in performance, suggesting that foundation models learn a common underlying representation of physical reality. We then show two distinct regimes of scientific models: on inputs similar to those seen during training, high-performing models align closely and weak models diverge into local sub-optima in representation space; on vastly different structures from those seen during training, nearly all models collapse onto a low-information representation, indicating that today's models remain limited by training data and inductive bias and do not yet encode truly universal structure. Our findings establish representational alignment as a quantitative benchmark for foundation-level generality in scientific models. More broadly, our work can track the emergence of universal representations of matter as models scale, and for selecting and distilling models whose learned representations transfer best across modalities, domains of matter, and scientific tasks.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuhui Lin, Qiuchen Lu
Machine Learning
ๆฅ็ๆ่ฆ
Urban transportation systems face increasing resilience challenges from extreme weather events, but current assessment methods rely on surface-level recovery indicators that miss hidden structural damage. Existing approaches cannot distinguish between true recovery and "false recovery," where traffic metrics normalize, but the underlying system dynamics permanently degrade. To address this, a new physics-constrained Hamiltonian learning algorithm combining "structural irreversibility detection" and "energy landscape reconstruction" has been developed. Our approach extracts low-dimensional state representations, identifies quasi-Hamiltonian structures through physics-constrained optimization, and quantifies structural changes via energy landscape comparison. Analysis of London's extreme rainfall in 2021 demonstrates that while surface indicators were fully recovered, our algorithm detected 64.8\% structural damage missed by traditional monitoring. Our framework provides tools for proactive structural risk assessment, enabling infrastructure investments based on true system health rather than misleading surface metrics.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Kehlani Fay, Darin Anthony Djapri, Anya Zorin, James Clinton, Ali El Lahib et al.
Robotics
ๆฅ็ๆ่ฆ
Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework will be open-sourced and available on our website.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Kenneth Stewart, Samantha Chapin, Roxana Leontie, Carl Glen Henshaw
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) offers transformative potential for robotic control in space. We present the first on-orbit demonstration of RL-based autonomous control of a free-flying robot, the NASA Astrobee, aboard the International Space Station (ISS). Using NVIDIA's Omniverse physics simulator and curriculum learning, we trained a deep neural network to replace Astrobee's standard attitude and translation control, enabling it to navigate in microgravity. Our results validate a novel training pipeline that bridges the simulation-to-reality (Sim2Real) gap, utilizing a GPU-accelerated, scientific-grade simulation environment for efficient Monte Carlo RL training. This successful deployment demonstrates the feasibility of training RL policies terrestrially and transferring them to space-based applications. This paves the way for future work in In-Space Servicing, Assembly, and Manufacturing (ISAM), enabling rapid on-orbit adaptation to dynamic mission requirements.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Samantha Chapin, Kenneth Stewart, Roxana Leontie, Carl Glen Henshaw
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The US Naval Research Laboratory's (NRL's) Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) experiment pioneers the use of reinforcement learning (RL) for control of free-flying robots in the zero-gravity (zero-G) environment of space. On Tuesday, May 27th 2025 the APIARY team conducted the first ever, to our knowledge, RL control of a free-flyer in space using the NASA Astrobee robot on-board the International Space Station (ISS). A robust 6-degrees of freedom (DOF) control policy was trained using an actor-critic Proximal Policy Optimization (PPO) network within the NVIDIA Isaac Lab simulation environment, randomizing over goal poses and mass distributions to enhance robustness. This paper details the simulation testing, ground testing, and flight validation of this experiment. This on-orbit demonstration validates the transformative potential of RL for improving robotic autonomy, enabling rapid development and deployment (in minutes to hours) of tailored behaviors for space exploration, logistics, and real-time mission needs.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Lorenzo Marinucci, Leonardo Di Nino, Gabriele D'Acunto, Mario Edoardo Pandolfo, Paolo Di Lorenzo et al.
ๆฅ็ๆ่ฆ
Probabilistic Graphical Models (PGMs) encode conditional dependencies among random variables using a graph -nodes for variables, links for dependencies- and factorize the joint distribution into lower-dimensional components. This makes PGMs well-suited for analyzing complex systems and supporting decision-making. Recent advances in topological signal processing highlight the importance of variables defined on topological spaces in several application domains. In such cases, the underlying topology shapes statistical relationships, limiting the expressiveness of canonical PGMs. To overcome this limitation, we introduce Colored Markov Random Fields (CMRFs), which model both conditional and marginal dependencies among Gaussian edge variables on topological spaces, with a theoretical foundation in Hodge theory. CMRFs extend classical Gaussian Markov Random Fields by including link coloring: connectivity encodes conditional independence, while color encodes marginal independence. We quantify the benefits of CMRFs through a distributed estimation case study over a physical network, comparing it with baselines with different levels of topological prior.
๐
2025-12-03
๐ ArXiv Preprint (cs.IT)
๐ฅ Seyed Mohammad Azimi-Abarghouyi, Carlo Fischione, Kaibin Huang
Machine Learning
ๆฅ็ๆ่ฆ
Over-the-Air Federated Learning (AirFL) is an emerging paradigm that tightly integrates wireless signal processing and distributed machine learning to enable scalable AI at the network edge. By leveraging the superposition property of wireless signals, AirFL performs communication and model aggregation of the learning process simultaneously, significantly reducing latency, bandwidth, and energy consumption. This article offers a tutorial treatment of AirFL, presenting a novel classification into three design approaches: CSIT-aware, blind, and weighted AirFL. We provide a comprehensive guide to theoretical foundations, performance analysis, complexity considerations, practical limitations, and prospective research directions.
๐
2025-12-03
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Vahid Nateghi, Lara Neureither, Selma Moqvist, Carsten Hartmann, Simon Olsson et al.
ๆฅ็ๆ่ฆ
Coarse graining (CG) is an important task for efficient modeling and simulation of complex multi-scale systems, such as the conformational dynamics of biomolecules. This work presents a projection-based coarse-graining formalism for general underdamped Langevin dynamics. Following the Zwanzig projection approach, we derive a closed-form expression for the coarse grained dynamics. In addition, we show how the generator Extended Dynamic Mode Decomposition (gEDMD) method, which was developed in the context of Koopman operator methods, can be used to model the CG dynamics and evaluate its kinetic properties, such as transition timescales. Finally, we combine our approach with thermodynamic interpolation (TI), a generative approach to transform samples between thermodynamic conditions, to extend the scope of the approach across thermodynamic states without repeated numerical simulations. Using a two-dimensional model system, we demonstrate that the proposed method allows to accurately capture the thermodynamic and kinetic properties of the full-space model.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammad Doost, Mohammad Manthouri
Machine Learning
ๆฅ็ๆ่ฆ
We propose a novel QTGNN framework for detecting fraudulent transactions in large-scale financial networks. By integrating quantum embedding, variational graph convolutions, and topological data analysis, QTGNN captures complex transaction dynamics and structural anomalies indicative of fraud. The methodology includes quantum data embedding with entanglement enhancement, variational quantum graph convolutions with non-linear dynamics, extraction of higher-order topological invariants, hybrid quantum-classical anomaly learning with adaptive optimization, and interpretable decision-making via topological attribution. Rigorous convergence guarantees ensure stable training on noisy intermediate-scale quantum (NISQ) devices, while stability of topological signatures provides robust fraud detection. Optimized for NISQ hardware with circuit simplifications and graph sampling, the framework scales to large transaction networks. Simulations on financial datasets, such as PaySim and Elliptic, benchmark QTGNN against classical and quantum baselines, using metrics like ROC-AUC, precision, and false positive rate. An ablation study evaluates the contributions of quantum embeddings, topological features, non-linear channels, and hybrid learning. QTGNN offers a theoretically sound, interpretable, and practical solution for financial fraud detection, bridging quantum machine learning, graph theory, and topological analysis.
๐
2025-12-03
๐ NeurIPS 2025
๐ฅ Hao-Run Cai, Han-Jia Ye
Machine Learning
ๆฅ็ๆ่ฆ
While tabular machine learning has achieved remarkable success, temporal distribution shifts pose significant challenges in real-world deployment, as the relationships between features and labels continuously evolve. Static models assume fixed mappings to ensure generalization, whereas adaptive models may overfit to transient patterns, creating a dilemma between robustness and adaptability. In this paper, we analyze key factors essential for constructing an effective dynamic mapping for temporal tabular data. We discover that evolving feature semantics-particularly objective and subjective meanings-introduce concept drift over time. Crucially, we identify that feature transformation strategies are able to mitigate discrepancies in feature representations across temporal stages. Motivated by these insights, we propose a feature-aware temporal modulation mechanism that conditions feature representations on temporal context, modulating statistical properties such as scale and skewness. By aligning feature semantics across time, our approach achieves a lightweight yet powerful adaptation, effectively balancing generalizability and adaptability. Benchmark evaluations validate the effectiveness of our method in handling temporal shifts in tabular data.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alex Ferrando, Xavier Suau, Jordi Gonzร lez, Pau Rodriguez
Machine Learning
ๆฅ็ๆ่ฆ
Activation steering has emerged as a powerful method for guiding the behavior of generative models towards desired outcomes such as toxicity mitigation. However, most existing methods apply interventions uniformly across all inputs, degrading model performance when steering is unnecessary. We introduce Dynamically Scaled Activation Steering (DSAS), a method-agnostic steering framework that decouples when to steer from how to steer. DSAS adaptively modulates the strength of existing steering transformations across layers and inputs, intervening strongly only when undesired behavior is detected. At generation time, DSAS computes context-dependent scaling factors that selectively adjust the strength of any steering method. We also show how DSAS can be jointly optimized end-to-end together with the steering function. When combined with existing steering methods, DSAS consistently improves the Pareto front with respect to steering alone, achieving a better trade-off between toxicity mitigation and utility preservation. We further demonstrate DSAS's generality by applying it to a text-to-image diffusion model, showing how adaptive steering allows the modulation of specific concepts. Finally, DSAS introduces minimal computational overhead while improving interpretability, pinpointing which tokens require steering and by how much.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Salim Khazem, Houssam Kanso
Machine Learning
ๆฅ็ๆ่ฆ
Accurate electricity consumption forecasting is essential for demand management and smart grid operations. This paper introduces a unified deep learning framework that integrates cyclical temporal encoding with hybrid LSTM-CNN architectures to enhance multistep energy forecasting. We systematically transform calendar-based attributes using sine cosine encodings to preserve periodic structure and evaluate their predictive relevance through correlation analysis. To exploit both long-term seasonal effects and short-term local patterns, we employ an ensemble model composed of an LSTM, a CNN, and a meta-learner of MLP regressors specialized for each forecast horizon. Using a one year national consumption dataset, we conduct an extensive experimental study including ablation analyses with and without cyclical encodings and calendar features and comparisons with established baselines from the literature. Results demonstrate consistent improvements across all seven forecast horizons, with our hybrid model achieving lower RMSE and MAE than individual architectures and prior methods. These findings confirm the benefit of combining cyclical temporal representations with complementary deep learning structures. To our knowledge, this is the first work to jointly evaluate temporal encodings, calendar-based features, and hybrid ensemble architectures within a unified short-term energy forecasting framework.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jan Saynisch-Wagner, Saran Rajendran Sari
Machine Learning
ๆฅ็ๆ่ฆ
This study proposes a method to enhance neural network performance when training data and application data are not very similar, e.g., out of distribution problems, as well as pattern and regime shifts. The method consists of three main steps: 1) Retrain the neural network towards reasonable subsets of the training data set and note down the resulting weight anomalies. 2) Choose reasonable predictors and derive a regression between the predictors and the weight anomalies. 3) Extrapolate the weights, and thereby the neural network, to the application data. We show and discuss this method in three use cases from the climate sciences, which include successful temporal, spatial and cross-domain extrapolations of neural networks.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick
Natural Language Processing
ๆฅ็ๆ่ฆ
DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding
๐
2025-12-03
๐ ArXiv Preprint (cs.SD)
๐ฅ Kohei Yamamoto, Kosuke Okusa
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer-based audio SSL (self-supervised learning) models often treat spectrograms as images, applying convolutional patchification with heavy temporal downsampling. This lowers the effective Nyquist frequency and introduces aliasing, while naรฏve low-pass filtering removes task-relevant high-frequency cues. In this study, we present Aliasing-aware Patch Embedding (AaPE), a drop-in patch stem that mitigates aliasing while preserving high-frequency information. AaPE augments standard patch tokens with features produced by a band-limited complex sinusoidal kernel using a two-sided exponential window that dynamically targets alias-prone bands. Frequency and decay parameters of the kernel are estimated from the input, enabling parallel, adaptive subband analysis whose outputs are fused with the standard patch tokens. AaPE integrates seamlessly into the masked teacher-student self-supervised learning. In addition, we combine a multi-mask strategy with a contrastive objective to enforce consistency across diverse mask patterns, stabilizing training. Pre-training on AudioSet followed by fine-tuning evaluation across diverse downstream benchmarks, which spanned categories, such as environmental sounds and other common audio domains. This approach yields state-of-the-art performance on a subset of tasks and competitive results across the remainder. Complementary linear probing evaluation mirrors this pattern, yielding clear gains on several benchmarks and strong performance elsewhere. The collective analysis of these results indicates that AaPE serves to mitigate the effects of aliasing without discarding of informative high-frequency content.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Edward C. C. Steele, Dinesh Mane, Emilio Monti, Luis Orus, Rebecca Chantrill-Cheyette et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Despite the promising capability of multimodal foundation models, their application to the generation of meteorological products and services remains nascent. To accelerate aspiration and adoption, we explore the novel use of a vision language model for writing the iconic Shipping Forecast text directly from video-encoded gridded weather data. These early results demonstrate promising scalable technological opportunities for enhancing production efficiency and service innovation within the weather enterprise and beyond.
๐
2025-12-03
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Hanxiu Zhang, Yue Zheng
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The protection of Intellectual Property (IP) in Large Language Models (LLMs) represents a critical challenge in contemporary AI research. While fingerprinting techniques have emerged as a fundamental mechanism for detecting unauthorized model usage, existing methods -- whether behavior-based or structural -- suffer from vulnerabilities such as false claim attacks or susceptible to weight manipulations. To overcome these limitations, we propose SELF, a novel intrinsic weight-based fingerprinting scheme that eliminates dependency on input and inherently resists false claims. SELF achieves robust IP protection through two key innovations: 1) unique, scalable and transformation-invariant fingerprint extraction via singular value and eigenvalue decomposition of LLM attention weights, and 2) effective neural network-based fingerprint similarity comparison based on few-shot learning and data augmentation. Experimental results demonstrate SELF maintains high IP infringement detection accuracy while showing strong robustness against various downstream modifications, including quantization, pruning, and fine-tuning attacks. Our code is available at https://github.com/HanxiuZhang/SELF_v2.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Julius Lenz
Machine Learning
ๆฅ็ๆ่ฆ
Merging neural networks without retraining is central to federated and distributed learning. Common methods such as weight averaging or Fisher merging often lose accuracy and are unstable across seeds. CoGraM (Contextual Granular Merging) is a multi-stage, context-sensitive, loss-based, and iterative optimization method across layers, neurons, and weight levels that aligns decisions with loss differences and thresholds and prevents harmful updates through rollback. CoGraM is an optimization method that addresses the weaknesses of methods such as Fisher and can significantly improve the merged network.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matteo Peduto, Qidong Yang, Jonathan Giezendanner, Devis Tuia, Sherrie Wang
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Accurate marine wind forecasts are essential for safe navigation, ship routing, and energy operations, yet they remain challenging because observations over the ocean are sparse, heterogeneous, and temporally variable. We reformulate wind forecasting as observation-informed correction of a global numerical weather prediction (NWP) model. Rather than forecasting winds directly, we learn local correction patterns by assimilating the latest in-situ observations to adjust the Global Forecast System (GFS) output. We propose a transformer-based deep learning architecture that (i) handles irregular and time-varying observation sets through masking and set-based attention mechanisms, (ii) conditions predictions on recent observation-forecast pairs via cross-attention, and (iii) employs cyclical time embeddings and coordinate-aware location representations to enable single-pass inference at arbitrary spatial coordinates. We evaluate our model over the Atlantic Ocean using observations from the International Comprehensive Ocean-Atmosphere Data Set (ICOADS) as reference. The model reduces GFS 10-meter wind RMSE at all lead times up to 48 hours, achieving 45% improvement at 1-hour lead time and 13% improvement at 48-hour lead time. Spatial analyses reveal the most persistent improvements along coastlines and shipping routes, where observations are most abundant. The tokenized architecture naturally accommodates heterogeneous observing platforms (ships, buoys, tide gauges, and coastal stations) and produces both site-specific predictions and basin-scale gridded products in a single forward pass. These results demonstrate a practical, low-latency post-processing approach that complements NWP by learning to correct systematic forecast errors.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Thomas Grรคupl, Andreas Reisenbauer, Marcel Hecko, Anil Rasouli, Anita Graser et al.
ๆฅ็ๆ่ฆ
This paper presents the VesselEdge system, which leverages federated learning and bandwidth-constrained trajectory compression to enhance maritime situational awareness by extending AIS coverage. VesselEdge transforms vessels into mobile sensors, enabling real-time anomaly detection and efficient data transmission over low-bandwidth connections. The system integrates the M3fed model for federated learning and the BWC-DR-A algorithm for trajectory compression, prioritizing anomalous data. Preliminary results demonstrate the effectiveness of VesselEdge in improving AIS coverage and situational awareness using historical data.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sanjit Dandapanthula, Aleksandr Podkopaev, Shiva Prasad Kasiviswanathan, Aaditya Ramdas, Ziv Goldfeld
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Optimal transport (OT) and Gromov-Wasserstein (GW) alignment provide interpretable geometric frameworks for comparing, transforming, and aggregating heterogeneous datasets -- tasks ubiquitous in data science and machine learning. Because these frameworks are computationally expensive, large-scale applications often rely on closed-form solutions for Gaussian distributions under quadratic cost. This work provides a comprehensive treatment of Gaussian, quadratic cost OT and inner product GW (IGW) alignment, closing several gaps in the literature to broaden applicability. First, we treat the open problem of IGW alignment between uncentered Gaussians on separable Hilbert spaces by giving a closed-form expression up to a quadratic optimization over unitary operators, for which we derive tight analytic upper and lower bounds. If at least one Gaussian measure is centered, the solution reduces to a fully closed-form expression, which we further extend to an analytic solution for the IGW barycenter between centered Gaussians. We also present a reduction of Gaussian multimarginal OT with pairwise quadratic costs to a tractable optimization problem and provide an efficient algorithm to solve it using a rank-deficiency constraint. To demonstrate utility, we apply our results to knowledge distillation and heterogeneous clustering on synthetic and real-world datasets.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Florent Forest, Amaury Wei, Olga Fink
Machine Learning
ๆฅ็ๆ่ฆ
Time series extrinsic regression (TSER) refers to the task of predicting a continuous target variable from an input time series. It appears in many domains, including healthcare, finance, environmental monitoring, and engineering. In these settings, accurate predictions and trustworthy reasoning are both essential. Although state-of-the-art TSER models achieve strong predictive performance, they typically operate as black boxes, making it difficult to understand which temporal patterns drive their decisions. Post-hoc interpretability techniques, such as feature attribution, aim to to explain how the model arrives at its predictions, but often produce coarse, noisy, or unstable explanations. Recently, inherently interpretable approaches based on concepts, additive decompositions, or symbolic regression, have emerged as promising alternatives. However, these approaches remain limited: they require explicit supervision on the concepts themselves, often cannot capture interactions between time-series features, lack expressiveness for complex temporal patterns, and struggle to scale to high-dimensional multivariate data.
To address these limitations, we propose MAGNETS (Mask-and-AGgregate NEtwork for Time Series), an inherently interpretable neural architecture for TSER. MAGNETS learns a compact set of human-understandable concepts without requiring any annotations. Each concept corresponds to a learned, mask-based aggregation over selected input features, explicitly revealing both which features drive predictions and when they matter in the sequence. Predictions are formed as combinations of these learned concepts through a transparent, additive structure, enabling clear insight into the model's decision process.
๐
2025-12-03
๐ NeurIPS 2025
๐ฅ Zhening Li, Armando Solar-Lezama, Yisong Yue, Stephan Zheng
ๆฅ็ๆ่ฆ
We introduce a new approach to agent programming, the development of LLM-based agents. Current approaches to agent programming often entangle two aspects of agent design: the core workflow logic and the inference-time strategy (e.g., tree search). We introduce "probabilistic angelic nondeterminism" ("PAN"), a programming model that disentangles these two concerns, allowing the programmer to describe the agent workflow and independently experiment with different inference-time strategies by simply changing a few inputs. We provide an implementation of PAN in Python as the EnCompass framework, which uses a Python decorator to compile agent workflow programs into a search space. We present three case studies that demonstrate how the framework lets the programmer quickly improve the reliability of an agent and easily switch between different inference-time strategies, all with little additional coding.
๐
2025-12-03
๐ ArXiv Preprint (cs.NI)
๐ฅ Stefano Scanzio, Gabriele Formis, Tullio Facchinetti, Gianluca Cena
Machine Learning
ๆฅ็ๆ่ฆ
Wireless sensor networks (WSNs) are employed across a wide range of industrial applications where ultra-low power consumption is a critical prerequisite. At the same time, these systems must maintain a certain level of determinism to ensure reliable and predictable operation. In this view, time slotted channel hopping (TSCH) is a communication technology that meets both conditions, making it an attractive option for its usage in industrial WSNs. This work proposes the use of machine learning to learn the traffic pattern generated in networks based on the TSCH protocol, in order to turn nodes into a deep sleep state when no transmission is planned and thus to improve the energy efficiency of the WSN. The ability of machine learning models to make good predictions at different network levels in a typical tree network topology was analyzed in depth, showing how their capabilities degrade while approaching the root of the tree. The application of these models on simulated data based on an accurate modeling of wireless sensor nodes indicates that the investigated algorithms can be suitably used to further and substantially reduce the power consumption of a TSCH network.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xun Yuan, Zilong Zhao, Jiayu Li, Aryan Pasikhani, Prosanta Gope et al.
ๆฅ็ๆ่ฆ
Diffusion models are renowned for their state-of-the-art performance in generating synthetic images. However, concerns related to safety, privacy, and copyright highlight the need for machine unlearning, which can make diffusion models forget specific training data and prevent the generation of sensitive or unwanted content. Current machine unlearning methods for diffusion models are primarily designed for conditional diffusion models and focus on unlearning specific data classes or features. Among these methods, finetuning-based machine unlearning methods are recognized for their efficiency and effectiveness, which update the parameters of pre-trained diffusion models by minimizing carefully designed loss functions. However, in this paper, we propose a novel attack named Diffusion Model Relearning Attack (DiMRA), which can reverse the finetuning-based machine unlearning methods, posing a significant vulnerability of this kind of technique. Without prior knowledge of the unlearning elements, DiMRA optimizes the unlearned diffusion model on an auxiliary dataset to reverse the unlearning, enabling the model to regenerate previously unlearned elements. To mitigate this vulnerability, we propose a novel machine unlearning method for diffusion models, termed as Diffusion Model Unlearning by Memorization (DiMUM). Unlike traditional methods that focus on forgetting, DiMUM memorizes alternative data or features to replace targeted unlearning data or features in order to prevent generating such elements. In our experiments, we demonstrate the effectiveness of DiMRA in reversing state-of-the-art finetuning-based machine unlearning methods for diffusion models, highlighting the need for more robust solutions. We extensively evaluate DiMUM, demonstrating its superior ability to preserve the generative performance of diffusion models while enhancing robustness against DiMRA.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhiming Xu, Baile Xu, Jian Zhao, Furao Shen, Suorong Yang
ๆฅ็ๆ่ฆ
Existing class-incremental learning (CIL) approaches based on replay or knowledge distillation are often constrained by forgetting or the stability-plasticity dilemma. Some expansion-based approaches could achieve higher accuracy. However, they always require significant parameter increases. In this paper, we propose a plugin extension paradigm termed the Deployment of extra LoRA Components (DLC) for non-pre-trained CIL scenarios.We treat the feature extractor trained through replay or distillation as a base model with rich knowledge. For each task, we use Low-Rank Adaptation (LoRA) to inject task-specific residuals into the base model's deep layers. During inference, representations with task-specific residuals are aggregated to produce classification predictions. To mitigate interference from non-target LoRA plugins, we introduce a lightweight weighting unit. This unit learns to assign importance scores to different LoRA-tuned representations. Like downloadable contents in software, our method serves as a plug-and-play enhancement that efficiently extends the base methods. Remarkably, on the large-scale ImageNet-100, with merely 4 % of the parameters of a standard ResNet-18, our DLC model achieves a significant 8 % improvement in accuracy, demonstrating exceptional efficiency. Moreover, it could surpass state-of-the-art methods under the fixed memory budget.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Adil Rasheed, Mikael Aleksander Jansen Shahly, Muhammad Faisal Aftab
Machine Learning
ๆฅ็ๆ่ฆ
Compressed sensing enables sparse sampling but relies on generic bases and random measurements, limiting efficiency and reconstruction quality. Optimal sensor placement uses historcal data to design tailored sampling patterns, yet its fixed, linear bases cannot adapt to nonlinear or sample-specific variations. Generative model-based compressed sensing improves reconstruction using deep generative priors but still employs suboptimal random sampling. We propose an adaptive sparse sensing framework that couples a variational autoencoder prior with reinforcement learning to select measurements sequentially. Experiments show that this approach outperforms CS, OSP, and Generative model-based reconstruction from sparse measurements.
๐
2025-12-03
๐ AAAI 2026
๐ฅ Xiaosen Lyu, Jiayu Xiong, Yuren Chen, Wanlong Wang, Xiaoqing Dai et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers' emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross-modal interactions or experience gradient conflicts and unstable training when using deeper architectures. To address these issues, we propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component. Synergistic Polynomial Fusion (SPF) serves the representation role, leveraging low-rank tensor factorization to efficiently capture high-order cross-modal interactions. Pareto Gradient Modulator (PGM) serves the optimization role, steering updates along Pareto-optimal directions across competing objectives to alleviate gradient conflicts and improve stability. Experiments show that CSS outperforms existing representative methods on IEMOCAP and MELD in both accuracy and training stability, demonstrating its effectiveness in complex multimodal scenarios.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuanxuan Yang, Xiuyang Zhang, Haofeng Chen, Gang Ma, Xiaojie Wang
ๆฅ็ๆ่ฆ
Electrical impedance tomography (EIT) provides an attractive solution for large-area tactile sensing due to its minimal wiring and shape flexibility, but its nonlinear inverse problem often leads to severe artifacts and inaccurate contact reconstruction. This work presents PhyDNN, a physics-driven deep reconstruction framework that embeds the EIT forward model directly into the learning objective. By jointly minimizing the discrepancy between predicted and ground-truth conductivity maps and enforcing consistency with the forward PDE, PhyDNN reduces the black-box nature of deep networks and improves both physical plausibility and generalization. To enable efficient backpropagation, we design a differentiable forward-operator network that accurately approximates the nonlinear EIT response, allowing fast physics-guided training. Extensive simulations and real tactile experiments on a 16-electrode soft sensor show that PhyDNN consistently outperforms NOSER, TV, and standard DNNs in reconstructing contact shape, location, and pressure distribution. PhyDNN yields fewer artifacts, sharper boundaries, and higher metric scores, demonstrating its effectiveness for high-quality tomographic tactile sensing.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Di Xiu, Hongyin Tang, Bolin Rong, Lizhi Yan, Jingang Wang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top-$k$ Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top-$k$ Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top-$k$ Attention operations facilitates the further unlocking of Top-$k$ Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top-$k$ Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top-$k$ Decoding.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Antonin Sulc
Machine Learning
ๆฅ็ๆ่ฆ
We propose Modal Logical Neural Networks (MLNNs), a neurosymbolic framework that integrates deep learning with the formal semantics of modal logic, enabling reasoning about necessity and possibility. Drawing on Kripke semantics, we introduce specialized neurons for the modal operators $\Box$ and $\Diamond$ that operate over a set of possible worlds, enabling the framework to act as a differentiable ``logical guardrail.'' The architecture is highly flexible: the accessibility relation between worlds can either be fixed by the user to enforce known rules or, as an inductive feature, be parameterized by a neural network. This allows the model to optionally learn the relational structure of a logical system from data while simultaneously performing deductive reasoning within that structure.
This versatile construction is designed for flexibility. The entire framework is differentiable from end to end, with learning driven by minimizing a logical contradiction loss. This not only makes the system resilient to inconsistent knowledge but also enables it to learn nonlinear relationships that can help define the logic of a problem space. We illustrate MLNNs on four case studies: grammatical guardrailing, axiomatic detection of the unknown, multi-agent epistemic trust, and detecting constructive deception in natural language negotiation. These experiments demonstrate how enforcing or learning accessibility can increase logical consistency and interpretability without changing the underlying task architecture.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zijian Gu, Yuxi Liu, Zhenhao Zhang, Song Wang
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms.Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Juan Diego Toscano, Daniel T. Chen, George Em Karniadakis
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Bridging the gap between theoretical conceptualization and computational implementation is a major bottleneck in Scientific Computing (SciC) and Scientific Machine Learning (SciML). We introduce ATHENA (Agentic Team for Hierarchical Evolutionary Numerical Algorithms), an agentic framework designed as an Autonomous Lab to manage the end-to-end computational research lifecycle. Its core is the HENA loop, a knowledge-driven diagnostic process framed as a Contextual Bandit problem. Acting as an online learner, the system analyzes prior trials to select structural `actions' ($A_n$) from combinatorial spaces guided by expert blueprints (e.g., Universal Approximation, Physics-Informed constraints). These actions are translated into executable code ($S_n$) to generate scientific rewards ($R_n$). ATHENA transcends standard automation: in SciC, it autonomously identifies mathematical symmetries for exact analytical solutions or derives stable numerical solvers where foundation models fail. In SciML, it performs deep diagnosis to tackle ill-posed formulations and combines hybrid symbolic-numeric workflows (e.g., coupling PINNs with FEM) to resolve multiphysics problems. The framework achieves super-human performance, reaching validation errors of $10^{-14}$. Furthermore, collaborative ``human-in-the-loop" intervention allows the system to bridge stability gaps, improving results by an order of magnitude. This paradigm shift focuses from implementation mechanics to methodological innovation, accelerating scientific discovery.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongtao Hao, Joseph L. Austerweil
ๆฅ็ๆ่ฆ
Event-based models (EBMs) infer disease progression from cross-sectional data, and standard EBMs assume a single underlying disease per individual. In contrast, mixed pathologies are common in neurodegeneration. We introduce the Joint Progression Model (JPM), a probabilistic framework that treats single-disease trajectories as partial rankings and builds a prior over joint progressions. We study several JPM variants (Pairwise, Bradley-Terry, Plackett-Luce, and Mallows) and analyze three properties: (i) calibration -- whether lower model energy predicts smaller distance to the ground truth ordering; (ii) separation -- the degree to which sampled rankings are distinguishable from random permutations; and (iii) sharpness -- the stability of sampled aggregate rankings. All variants are calibrated, and all achieve near-perfect separation; sharpness varies by variant and is well-predicted by simple features of the input partial rankings (number and length of rankings, conflict, and overlap). In synthetic experiments, JPM improves ordering accuracy by roughly 21 percent over a strong EBM baseline (SA-EBM) that treats the joint disease as a single condition. Finally, using NACC, we find that the Mallows variant of JPM and the baseline model (SA-EBM) have results that are more consistent with prior literature on the possible disease progression of the mixed pathology of AD and VaD.
๐
2025-12-03
๐ 12 pages, 6 figures. Submitted to the IEEE Journal
๐ฅ Ian Henriques, Lynda Elhassar, Sarvesh Relekar, Denis Walrave, Shayan Hassantabar et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The global rise in type 2 diabetes underscores the need for scalable and cost-effective screening methods. Current diagnosis requires biochemical assays, which are invasive and costly. Advances in consumer wearables have enabled early explorations of machine learning-based disease detection, but prior studies were limited to controlled settings. We present SweetDeep, a compact neural network trained on physiological and demographic data from 285 (diabetic and non-diabetic) participants in the EU and MENA regions, collected using Samsung Galaxy Watch 7 devices in free-living conditions over six days. Each participant contributed multiple 2-minute sensor recordings per day, totaling approximately 20 recordings per individual. Despite comprising fewer than 3,000 parameters, SweetDeep achieves 82.5% patient-level accuracy (82.1% macro-F1, 79.7% sensitivity, 84.6% specificity) under three-fold cross-validation, with an expected calibration error of 5.5%. Allowing the model to abstain on less than 10% of low-confidence patient predictions yields an accuracy of 84.5% on the remaining patients. These findings demonstrate that combining engineered features with lightweight architectures can support accurate, rapid, and generalizable detection of type 2 diabetes in real-world wearable settings.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongtao Hao, Joseph L. Austerweil
ๆฅ็ๆ่ฆ
Chronic diseases often progress differently across patients. Rather than randomly varying, there are typically a small number of subtypes for how a disease progresses across patients. To capture this structured heterogeneity, the Subtype and Stage Inference Event-Based Model (SuStaIn) estimates the number of subtypes, the order of disease progression for each subtype, and assigns each patient to a subtype from primarily cross-sectional data. It has been widely applied to uncover the subtypes of many diseases and inform our understanding of them. But how robust is its performance? In this paper, we develop a principled Bayesian subtype variant of the event-based model (BEBMS) and compare its performance to SuStaIn in a variety of synthetic data experiments with varied levels of model misspecification. BEBMS substantially outperforms SuStaIn across ordering, staging, and subtype assignment tasks. Further, we apply BEBMS and SuStaIn to a real-world Alzheimer's data set. We find BEBMS has results that are more consistent with the scientific consensus of Alzheimer's disease progression than SuStaIn.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yujing Liu, Chen Yang
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In recent years, financial sentiment analysis of public opinion has become increasingly important for market forecasting and risk assessment. However, existing methods often struggle to effectively integrate diverse opinion modalities and capture fine-grained interactions across them. This paper proposes an end-to-end deep learning framework that integrates two distinct modalities of financial opinions: recency modality (timely opinions) and popularity modality (trending opinions), through a novel cross-modal attention mechanism specifically designed for financial sentiment analysis. While both modalities consist of textual data, they represent fundamentally different information channels: recency-driven market updates versus popularity-driven collective sentiment. Our model first uses BERT (Chinese-wwm-ext) for feature embedding and then employs our proposed Financial Multi-Head Cross-Attention (FMHCA) structure to facilitate information exchange between these distinct opinion modalities. The processed features are optimized through a transformer layer and fused using multimodal factored bilinear pooling for classification into negative, neutral, and positive sentiment. Extensive experiments on a comprehensive dataset covering 837 companies demonstrate that our approach achieves an accuracy of 83.5%, significantly outperforming baselines including BERT+Transformer by 21 percent. These results highlight the potential of our framework to support more accurate financial decision-making and risk management.
๐
2025-12-03
๐ This is a pre-print for submitting to computers &
๐ฅ Johnny Peng, Thanh Tung Khuat, Ellen Otte, Katarzyna Musial, Bogdan Gabrys
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
In cell culture bioprocessing, real-time batch process monitoring (BPM) refers to the continuous tracking and analysis of key process variables such as viable cell density, nutrient levels, metabolite concentrations, and product titer throughout the duration of a batch run. This enables early detection of deviations and supports timely control actions to ensure optimal cell growth and product quality. BPM plays a critical role in ensuring the quality and regulatory compliance of biopharmaceutical manufacturing processes. However, the development of accurate soft sensors for BPM is hindered by key challenges, including limited historical data, infrequent feedback, heterogeneous process conditions, and high-dimensional sensory inputs. This study presents a comprehensive benchmarking analysis of machine learning (ML) methods designed to address these challenges, with a focus on learning from historical data with limited volume and relevance in the context of bioprocess monitoring. We evaluate multiple ML approaches including feature dimensionality reduction, online learning, and just-in-time learning across three datasets, one in silico dataset and two real-world experimental datasets. Our findings highlight the importance of training strategies in handling limited data and feedback, with batch learning proving effective in homogeneous settings, while just-in-time learning and online learning demonstrate superior adaptability in cold-start scenarios. Additionally, we identify key meta-features, such as feed media composition and process control strategies, that significantly impact model transferability. The results also suggest that integrating Raman-based predictions with lagged offline measurements enhances monitoring accuracy, offering a promising direction for future bioprocess soft sensor development.
๐
2025-12-03
๐ ArXiv Preprint (eess.SP)
๐ฅ Maryam Maghsoudi, Mohsen Rezaeizadeh, Shihab Shamma
Machine Learning
ๆฅ็ๆ่ฆ
Decoding imagined speech engages complex neural processes that are difficult to interpret due to uncertainty in timing and the limited availability of imagined-response datasets. In this study, we present a Magnetoencephalography (MEG) dataset collected from trained musicians as they imagined and listened to musical and poetic stimuli. We show that both imagined and perceived brain responses contain consistent, condition-specific information. Using a sliding-window ridge regression model, we first mapped imagined responses to listened responses at the single-subject level, but found limited generalization across subjects. At the group level, we developed an encoder-decoder convolutional neural network with a subject-specific calibration layer that produced stable and generalizable mappings. The CNN consistently outperformed the null model, yielding significantly higher correlations between predicted and true listened responses for nearly all held-out subjects. Our findings demonstrate that imagined neural activity can be transformed into perception-like responses, providing a foundation for future brain-computer interface applications involving imagined speech and music.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiye Song, Steve Dai, Ben Keller, Brucek Khailany
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications.
We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rhys Newbury, Juyan Zhang, Tin Tran, Hanna Kurniawati, Dana Kuliฤ
Computer Vision
ๆฅ็ๆ่ฆ
Understanding and representing the structure of 3D objects in an unsupervised manner remains a core challenge in computer vision and graphics. Most existing unsupervised keypoint methods are not designed for unconditional generative settings, restricting their use in modern 3D generative pipelines; our formulation explicitly bridges this gap. We present an unsupervised framework for learning spatially structured 3D keypoints from point cloud data. These keypoints serve as a compact and interpretable representation that conditions an Elucidated Diffusion Model (EDM) to reconstruct the full shape. The learned keypoints exhibit repeatable spatial structure across object instances and support smooth interpolation in keypoint space, indicating that they capture geometric variation. Our method achieves strong performance across diverse object categories, yielding a 6 percentage-point improvement in keypoint consistency compared to prior approaches.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuanbang Liang, Yang Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Grokking-delayed generalization that emerges well after a model has fit the training data-has been linked to robustness and representation quality. We ask whether this training regime also helps with machine unlearning, i.e., removing the influence of specified data without full retraining. We compare applying standard unlearning methods before versus after the grokking transition across vision (CNNs/ResNets on CIFAR, SVHN, and ImageNet) and language (a transformer on a TOFU-style setup). Starting from grokked checkpoints consistently yields (i) more efficient forgetting (fewer updates to reach a target forget level), (ii) less collateral damage (smaller drops on retained and test performance), and (iii) more stable updates across seeds, relative to early-stopped counterparts under identical unlearning algorithms. Analyses of features and curvature further suggest that post-grokking models learn more modular representations with reduced gradient alignment between forget and retain subsets, which facilitates selective forgetting. Our results highlight when a model is trained (pre- vs. post-grokking) as an orthogonal lever to how unlearning is performed, providing a practical recipe to improve existing unlearning methods without altering their algorithms.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziyi Ding, Xiao-Ping Zhang
ๆฅ็ๆ่ฆ
We propose GaussDetect-LiNGAM, a novel approach for bivariate causal discovery that eliminates the need for explicit Gaussianity tests by leveraging a fundamental equivalence between noise Gaussianity and residual independence in the reverse regression. Under the standard LiNGAM assumptions of linearity, acyclicity, and exogeneity, we prove that the Gaussianity of the forward-model noise is equivalent to the independence between the regressor and residual in the reverse model. This theoretical insight allows us to replace fragile and sample-sensitive Gaussianity tests with robust kernel-based independence tests. Experimental results validate the equivalence and demonstrate that GaussDetect-LiNGAM maintains high consistency across diverse noise types and sample sizes, while reducing the number of tests per decision (TPD). Our method enhances both the efficiency and practical applicability of causal inference, making LiNGAM more accessible and reliable in real-world scenarios.
๐
2025-12-03
๐ ArXiv Preprint (cs.DS)
๐ฅ Bharat Sharman, Elkafi Hassini
Machine Learning
ๆฅ็ๆ่ฆ
The maximum clique problem, a well-known graph-based combinatorial optimization problem, has been addressed through various algorithmic approaches, though systematic analyses of the problem instances remain sparse. This study employs the instance space analysis (ISA) methodology to systematically analyze the instance space of this problem and assess & predict the performance of state-of-the-art (SOTA) algorithms, including exact, heuristic, and graph neural network (GNN)-based methods. A dataset was compiled using graph instances from TWITTER, COLLAB and IMDB-BINARY benchmarks commonly used in graph machine learning research. A set of 33 generic and 2 problem-specific polynomial-time-computable graph-based features, including several spectral properties, was employed for the ISA. A composite performance mea- sure incorporating both solution quality and algorithm runtime was utilized. The comparative analysis demonstrated that the exact algorithm Mixed Order Maximum Clique (MOMC) exhib- ited superior performance across approximately 74.7% of the instance space constituted by the compiled dataset. Gurobi & CliSAT accounted for superior performance in 13.8% and 11% of the instance space, respectively. The ISA-based algorithm performance prediction model run on 34 challenging test instances compiled from the BHOSLIB and DIMACS datasets yielded top-1 and top-2 best performing algorithm prediction accuracies of 88% and 97%, respectively.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Prakhar Gupta, Henry Conklin, Sarah-Jane Leslie, Andrew Lee
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In this work we study how explicit world-modeling objectives affect the internal representations and downstream capability of Transformers across different training stages. We use a controlled 2x2x2 Rubik's Cube and ask: (1) how does explicitly pretraining a world model affect the model's latent representations, and (2) how does world-model quality affect the model's performance after reinforcement learning post-training? We compare standard next-token prediction to two explicit world-modeling strategies -- (i) state-prediction pretraining and (ii) a joint state-prediction + next-token objective -- and assess task performance after Group Relative Policy Optimization (GRPO) is applied as post-training. We evaluate the representation quality with linear probes and causal interventions. We find that explicit world-modeling yields more linearly decodable and causally steerable state representations. More importantly, we find that improved state representations lead to higher gains for GRPO, especially on harder cube states. Our results indicate that sharpening state representations can improve the effectiveness of post-training for sequence-planning tasks.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joe Edelman, Tan Zhi-Xuan, Ryan Lowe, Oliver Klingefjord, Vincent Wang-Mascianica et al.
ๆฅ็ๆ่ฆ
Beneficial societal outcomes cannot be guaranteed by aligning individual AI systems with the intentions of their operators or users. Even an AI system that is perfectly aligned to the intentions of its operating organization can lead to bad outcomes if the goals of that organization are misaligned with those of other institutions and individuals. For this reason, we need full-stack alignment, the concurrent alignment of AI systems and the institutions that shape them with what people value. This can be done without imposing a particular vision of individual or collective flourishing. We argue that current approaches for representing values, such as utility functions, preference orderings, or unstructured text, struggle to address these and other issues effectively. They struggle to distinguish values from other signals, to support principled normative reasoning, and to model collective goods. We propose thick models of value will be needed. These structure the way values and norms are represented, enabling systems to distinguish enduring values from fleeting preferences, to model the social embedding of individual choices, and to reason normatively, applying values in new domains. We demonstrate this approach in five areas: AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, and democratic regulatory institutions.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hamed Poursiami, Shay Snyder, Guojing Cong, Thomas Potok, Maryam Parsa
Machine Learning
ๆฅ็ๆ่ฆ
Graph classification is a fundamental task in domains ranging from molecular property prediction to materials design. While graph neural networks (GNNs) achieve strong performance by learning expressive representations via message passing, they incur high computational costs, limiting their scalability and deployment on resource-constrained devices. Hyperdimensional Computing (HDC), also known as Vector Symbolic Architectures (VSA), offers a lightweight, brain-inspired alternative, yet existing HDC-based graph methods typically struggle to match the predictive performance of GNNs. In this work, we propose VS-Graph, a vector-symbolic graph learning framework that narrows the gap between the efficiency of HDC and the expressive power of message passing. VS-Graph introduces a Spike Diffusion mechanism for topology-driven node identification and an Associative Message Passing scheme for multi-hop neighborhood aggregation entirely within the high-dimensional vector space. Without gradient-based optimization or backpropagation, our method achieves competitive accuracy with modern GNNs, outperforming the prior HDC baseline by 4-5% on standard benchmarks such as MUTAG and DD. It also matches or exceeds the performance of the GNN baselines on several datasets while accelerating the training by a factor of up to 450x. Furthermore, VS-Graph maintains high accuracy even with the hypervector dimensionality reduced to D=128, demonstrating robustness under aggressive dimension compression and paving the way for ultra-efficient execution on edge and neuromorphic hardware.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lakshmi Jayalal, Sheetal Kalyani
Machine Learning
ๆฅ็ๆ่ฆ
Recovering jointly sparse signals in the multiple measurement vectors (MMV) setting is a fundamental problem in machine learning, but traditional methods like multiple measurement vectors orthogonal matching pursuit (M-OMP) and multiple measurement vectors FOCal Underdetermined System Solver (M-FOCUSS) often require careful parameter tuning or prior knowledge of the sparsity of the signal and/or noise variance. We introduce a novel tuning-free framework that leverages Implicit Regularization (IR) from overparameterization to overcome this limitation. Our approach reparameterizes the estimation matrix into factors that decouple the shared row-support from individual vector entries. We show that the optimization dynamics inherently promote the desired row-sparse structure by applying gradient descent to a standard least-squares objective on these factors. We prove that with a sufficiently small and balanced initialization, the optimization dynamics exhibit a "momentum-like" effect, causing the norms of rows in the true support to grow significantly faster than others. This formally guarantees that the solution trajectory converges towards an idealized row-sparse solution. Additionally, empirical results demonstrate that our approach achieves performance comparable to established methods without requiring any prior information or tuning.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Deploying large language model (LLM) models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mahdi Arab Loodaricheh, Mohammad Hossein Manshaei, Anita Raja
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Modern Intrusion Detection Systems (IDS) face severe challenges due to heterogeneous network traffic, evolving cyber threats, and pronounced data imbalance between benign and attack flows. While generative models have shown promise in data augmentation, existing approaches are limited to single modalities and fail to capture cross-domain dependencies. This paper introduces MAGE-ID (Multimodal Attack Generator for Intrusion Detection), a diffusion-based generative framework that couples tabular flow features with their transformed images through a unified latent prior. By jointly training Transformer and CNN-based variational encoders with an EDM style denoiser, MAGE-ID achieves balanced and coherent multimodal synthesis. Evaluations on CIC-IDS-2017 and NSL-KDD demonstrate significant improvements in fidelity, diversity, and downstream detection performance over TabSyn and TabDDPM, highlighting the effectiveness of MAGE-ID for multimodal IDS augmentation.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shanika Iroshi Nanayakkara, Shiva Raj Pokhrel
ๆฅ็ๆ่ฆ
Federated learning (FL) deployed over quantum enabled and heterogeneous classical networks faces significant performance degradation due to uneven client quality, stochastic teleportation fidelity, device instability, and geometric mismatch between local and global models. Classical aggregation rules assume euclidean topology and uniform communication reliability, limiting their suitability for emerging quantum federated systems. This paper introduces A2G (Adaptive Aggregation with Two Gains), a dual gain framework that jointly regulates geometric blending through a geometry gain and modulates client importance using a QoS gain derived from teleportation fidelity, latency, and instability. We develop the A2G update rule, establish convergence guarantees under smoothness and bounded variance assumptions, and show that A2G recovers FedAvg, QoS aware averaging, and manifold based aggregation as special cases. Experiments on a quantum classical hybrid testbed demonstrate improved stability and higher accuracy under heterogeneous and noisy conditions.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongseon Yeom, Jaeyoul Shin, Soojin Min, Jeongmin Yoon, Seunghak Yu et al.
ๆฅ็ๆ่ฆ
Online A/B testing, the gold standard for evaluating new advertising policies, consumes substantial engineering resources and risks significant revenue loss from deploying underperforming variations. This motivates the use of Off-Policy Evaluation (OPE) for rapid, offline assessment. However, applying OPE to ad auctions is fundamentally more challenging than in domains like recommender systems, where stochastic policies are common. In online ad auctions, it is common for the highest-bidding ad to win the impression, resulting in a deterministic, winner-takes-all setting. This results in zero probability of exposure for non-winning ads, rendering standard OPE estimators inapplicable. We introduce the first principled framework for OPE in deterministic auctions by repurposing the bid landscape model to approximate the propensity score. This model allows us to derive robust approximate propensity scores, enabling the use of stable estimators like Self-Normalized Inverse Propensity Scoring (SNIPS) for counterfactual evaluation. We validate our approach on the AuctionNet simulation benchmark and against 2-weeks online A/B test from a large-scale industrial platform. Our method shows remarkable alignment with online results, achieving a 92\% Mean Directional Accuracy (MDA) in CTR prediction, significantly outperforming the parametric baseline. MDA is the most critical metric for guiding deployment decisions, as it reflects the ability to correctly predict whether a new model will improve or harm performance. This work contributes the first practical and validated framework for reliable OPE in deterministic auction environments, offering an efficient alternative to costly and risky online experiments.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yeganeh Ghamary, Victoria Wu, Hooman Vaseli, Christina Luong, Teresa Tsang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Ejection fraction (EF) is a crucial metric for assessing cardiac function and diagnosing conditions such as heart failure. Traditionally, EF estimation requires manual tracing and domain expertise, making the process time-consuming and subject to interobserver variability. Most current deep learning methods for EF prediction are black-box models with limited transparency, which reduces clinical trust. Some post-hoc explainability methods have been proposed to interpret the decision-making process after the prediction is made. However, these explanations do not guide the model's internal reasoning and therefore offer limited reliability in clinical applications. To address this, we introduce ProtoEFNet, a novel video-based prototype learning model for continuous EF regression. The model learns dynamic spatiotemporal prototypes that capture clinically meaningful cardiac motion patterns. Additionally, the proposed Prototype Angular Separation (PAS) loss enforces discriminative representations across the continuous EF spectrum. Our experiments on the EchonetDynamic dataset show that ProtoEFNet can achieve accuracy on par with its non-interpretable counterpart while providing clinically relevant insight. The ablation study shows that the proposed loss boosts performance with a 2% increase in F1 score from 77.67$\pm$2.68 to 79.64$\pm$2.10. Our source code is available at: https://github.com/DeepRCL/ProtoEF
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alan T. L. Bacellar, Mustafa Munir, Felipe M. G. Franรงa, Priscila M. V. Lima, Radu Marculescu et al.
ๆฅ็ๆ่ฆ
Federated Learning (FL) is plagued by two key challenges: high communication overhead and performance collapse on heterogeneous (non-IID) data. Analytic FL (AFL) provides a single-round, data distribution invariant solution, but is limited to linear models. Subsequent non-linear approaches, like DeepAFL, regain accuracy but sacrifice the single-round benefit. In this work, we break this trade-off. We propose SAFLe, a framework that achieves scalable non-linear expressivity by introducing a structured head of bucketed features and sparse, grouped embeddings. We prove this non-linear architecture is mathematically equivalent to a high-dimensional linear regression. This key equivalence allows SAFLe to be solved with AFL's single-shot, invariant aggregation law. Empirically, SAFLe establishes a new state-of-the-art for analytic FL, significantly outperforming both linear AFL and multi-round DeepAFL in accuracy across all benchmarks, demonstrating a highly efficient and scalable solution for federated vision.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Faizan Farooq Khan, K J Joseph, Koustava Goswami, Mohamed Elhoseiny, Balaji Vasan Srinivasan
Machine Learning
ๆฅ็ๆ่ฆ
Design generation, in its essence, is a step-by-step process where designers progressively refine and enhance their work through careful modifications. Despite this fundamental characteristic, existing approaches mainly treat design synthesis as a single-step generation problem, significantly underestimating the inherent complexity of the creative process. To bridge this gap, we propose a novel problem setting called Step-by-Step Layered Design Generation, which tasks a machine learning model with generating a design that adheres to a sequence of instructions from a designer. Leveraging recent advancements in multi-modal LLMs, we propose SLEDGE: Step-by-step LayEred Design GEnerator to model each update to a design as an atomic, layered change over its previous state, while being grounded in the instruction. To complement our new problem setting, we introduce a new evaluation suite, including a dataset and a benchmark. Our exhaustive experimental analysis and comparison with state-of-the-art approaches tailored to our new setup demonstrate the efficacy of our approach. We hope our work will attract attention to this pragmatic and under-explored research area.
๐
2025-12-03
๐ ArXiv Preprint (math.ST)
๐ฅ Garrett G. Wen, Hong Hu, Yue M. Lu, Zhou Fan, Theodor Misiakiewicz
ๆฅ็ๆ่ฆ
A major effort in modern high-dimensional statistics has been devoted to the analysis of linear predictors trained on nonlinear feature embeddings via empirical risk minimization (ERM). Gaussian equivalence theory (GET) has emerged as a powerful universality principle in this context: it states that the behavior of high-dimensional, complex features can be captured by Gaussian surrogates, which are more amenable to analysis. Despite its remarkable successes, numerical experiments show that this equivalence can fail even for simple embeddings -- such as polynomial maps -- under general scaling regimes.
We investigate this breakdown in the setting of random feature (RF) models in the quadratic scaling regime, where both the number of features and the sample size grow quadratically with the data dimension. We show that when the target function depends on a low-dimensional projection of the data, such as generalized linear models, GET yields incorrect predictions. To capture the correct asymptotics, we introduce a Conditional Gaussian Equivalent (CGE) model, which can be viewed as appending a low-dimensional non-Gaussian component to an otherwise high-dimensional Gaussian model. This hybrid model retains the tractability of the Gaussian framework and accurately describes RF models in the quadratic scaling regime. We derive sharp asymptotics for the training and test errors in this setting, which continue to agree with numerical simulations even when GET fails.
Our analysis combines general results on CLT for Wiener chaos expansions and a careful two-phase Lindeberg swapping argument. Beyond RF models and quadratic scaling, our work hints at a rich landscape of universality phenomena in high-dimensional ERM.
๐
2025-12-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying
ๆฅ็ๆ่ฆ
Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.
๐
2025-12-03
๐ Accepted to 2026 IEEE/CVF Winter Conference on App
๐ฅ Thomas Monninger, Zihan Zhang, Steffen Staab, Sihao Ding
Robotics
ๆฅ็ๆ่ฆ
Accurate environmental representations are essential for autonomous driving, providing the foundation for safe and efficient navigation. Traditionally, high-definition (HD) maps are providing this representation of the static road infrastructure to the autonomous system a priori. However, because the real world is constantly changing, such maps must be constructed online from on-board sensor data. Navigation-grade standard-definition (SD) maps are widely available, but their resolution is insufficient for direct deployment. Instead, they can be used as coarse prior to guide the online map construction process. We propose NavMapFusion, a diffusion-based framework that performs iterative denoising conditioned on high-fidelity sensor data and on low-fidelity navigation maps. This paper strives to answer: (1) How can coarse, potentially outdated navigation maps guide online map construction? (2) What advantages do diffusion models offer for map fusion? We demonstrate that diffusion-based map construction provides a robust framework for map fusion. Our key insight is that discrepancies between the prior map and online perception naturally correspond to noise within the diffusion process; consistent regions reinforce the map construction, whereas outdated segments are suppressed. On the nuScenes benchmark, NavMapFusion conditioned on coarse road lines from OpenStreetMap data reaches a 21.4% relative improvement on 100 m, and even stronger improvements on larger perception ranges, while maintaining real-time capabilities. By fusing low-fidelity priors with high-fidelity sensor data, the proposed method generates accurate and up-to-date environment representations, guiding towards safer and more reliable autonomous driving. The code is available at https://github.com/tmonnin/navmapfusion
๐
2025-12-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jie Tang, Daochen Zha, Xin Liu, Huiji Gao, Liwei He et al.
Machine Learning
ๆฅ็ๆ่ฆ
In online marketplaces like Airbnb, users frequently engage in comparison shopping before making purchase decisions. Despite the prevalence of this behavior, a significant disconnect persists between mainstream e-commerce search engines and users' comparison needs. Traditional ranking models often evaluate items in isolation, disregarding the context in which users compare multiple items on a search results page. While recent advances in deep learning have sought to improve ranking accuracy, diversity, and fairness by encoding listwise context, the challenge of aligning search rankings with user comparison shopping behavior remains inadequately addressed. In this paper, we propose a novel ranking architecture - Learning-to-Comparison-Shop (LTCS) System - that explicitly models and learns users' comparison shopping behaviors. Through extensive offline and online experiments, we demonstrate that our approach yields statistically significant gains in key business metrics - improving NDCG by 1.7% and boosting booking conversion rate by 0.6% in A/B testing - while also enhancing user experience. We also compare our model against state-of-the-art approaches and demonstrate that LTCS significantly outperforms them.
๐
2025-12-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Christos Kolomvakis, Thomas Bobille, Arnaud Vandaele, Nicolas Gillis
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Boolean matrix factorization (BMF) approximates a given binary input matrix as the product of two smaller binary factors. Unlike binary matrix factorization based on standard arithmetic, BMF employs the Boolean OR and AND operations for the matrix product, which improves interpretability and reduces the approximation error. It is also used in role mining and computer vision. In this paper, we first propose algorithms for BMF that perform alternating optimization (AO) of the factor matrices, where each subproblem is solved via integer programming (IP). We then design different approaches to further enhance AO-based algorithms by selecting an optimal subset of rank-one factors from multiple runs. To address the scalability limits of IP-based methods, we introduce new greedy and local-search heuristics. We also construct a new C++ data structure for Boolean vectors and matrices that is significantly faster than existing ones and is of independent interest, allowing our heuristics to scale to large datasets. We illustrate the performance of all our proposed methods and compare them with the state of the art on various real datasets, both with and without missing data, including applications in topic modeling and imaging.
๐
2025-12-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chuyue Wang, Jie Feng, Yuxi Wu, Hang Zhang, Zhiguo Fan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate and reliable search on online healthcare platforms is critical for user safety and service efficacy. Traditional methods, however, often fail to comprehend complex and nuanced user queries, limiting their effectiveness. Large language models (LLMs) present a promising solution, offering powerful semantic understanding to bridge this gap. Despite their potential, deploying LLMs in this high-stakes domain is fraught with challenges, including factual hallucinations, specialized knowledge gaps, and high operational costs. To overcome these barriers, we introduce \textbf{AR-Med}, a novel framework for \textbf{A}utomated \textbf{R}elevance assessment for \textbf{Med}ical search that has been successfully deployed at scale on the Online Medical Delivery Platforms. AR-Med grounds LLM reasoning in verified medical knowledge through a retrieval-augmented approach, ensuring high accuracy and reliability. To enable efficient online service, we design a practical knowledge distillation scheme that compresses large teacher models into compact yet powerful student models. We also introduce LocalQSMed, a multi-expert annotated benchmark developed to guide model iteration and ensure strong alignment between offline and online performance. Extensive experiments show AR-Med achieves an offline accuracy of over 93\%, a 24\% absolute improvement over the original online system, and delivers significant gains in online relevance and user satisfaction. Our work presents a practical and scalable blueprint for developing trustworthy, LLM-powered systems in real-world healthcare applications.
๐
2025-12-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Adithya S Kolavi, Vyoman Jain
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.
๐
2025-12-03
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Robert Dilworth
ๆฅ็ๆ่ฆ
In this study, we more rigorously evaluated our attack script $\textit{TraceTarnish}$, which leverages adversarial stylometry principles to anonymize the authorship of text-based messages. To ensure the efficacy and utility of our attack, we sourced, processed, and analyzed Reddit comments--comments that were later alchemized into $\textit{TraceTarnish}$ data--to gain valuable insights. The transformed $\textit{TraceTarnish}$ data was then further augmented by $\textit{StyloMetrix}$ to manufacture stylometric features--features that were culled using the Information Gain criterion, leaving only the most informative, predictive, and discriminative ones. Our results found that function words and function word types ($L\_FUNC\_A$ $\&$ $L\_FUNC\_T$); content words and content word types ($L\_CONT\_A$ $\&$ $L\_CONT\_T$); and the Type-Token Ratio ($ST\_TYPE\_TOKEN\_RATIO\_LEMMAS$) yielded significant Information-Gain readings. The identified stylometric cues--function-word frequencies, content-word distributions, and the Type-Token Ratio--serve as reliable indicators of compromise (IoCs), revealing when a text has been deliberately altered to mask its true author. Similarly, these features could function as forensic beacons, alerting defenders to the presence of an adversarial stylometry attack; granted, in the absence of the original message, this signal may go largely unnoticed, as it appears to depend on a pre- and post-transformation comparison. "In trying to erase a trace, you often imprint a larger one." Armed with this understanding, we framed $\textit{TraceTarnish}$'s operations and outputs around these five isolated features, using them to conceptualize and implement enhancements that further strengthen the attack.
๐
2025-12-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yaqi Wang, Haojia Sun, Shuting Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
The application of large language models (LLMs) in recommendation systems has recently gained traction. Traditional recommendation systems often lack explainability and suffer from issues such as popularity bias. Previous research has also indicated that LLMs, when used as standalone predictors, fail to achieve accuracy comparable to traditional models. To address these challenges, we propose to use LLM as an explainable re-ranker, a hybrid approach that combines traditional recommendation models with LLMs to enhance both accuracy and interpretability. We constructed a dataset to train the re-ranker LLM and evaluated the alignment between the generated dataset and human expectations. Leveraging a two-stage training process, our model significantly improved NDCG, a key ranking metric. Moreover, the re-ranker outperformed a zero-shot baseline in ranking accuracy and interpretability. These results highlight the potential of integrating traditional recommendation models with LLMs to address limitations in existing systems and pave the way for more explainable and fair recommendation frameworks.
๐
2025-12-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Shu Wang, Yingli Zhou, Yixiang Fang
Natural Language Processing
ๆฅ็ๆ่ฆ
As an effective method to boost the performance of Large Language Models (LLMs) on the question answering (QA) task, Retrieval-Augmented Generation (RAG), which queries highly relevant information from external complex documents, has attracted tremendous attention from both industry and academia. Existing RAG approaches often focus on general documents, and they overlook the fact that many real-world documents (such as books, booklets, handbooks, etc.) have a hierarchical structure, which organizes their content from different granularity levels, leading to poor performance for the QA task. To address these limitations, we introduce BookRAG, a novel RAG approach targeted for documents with a hierarchical structure, which exploits logical hierarchies and traces entity relations to query the highly relevant information. Specifically, we build a novel index structure, called BookIndex, by extracting a hierarchical tree from the document, which serves as the role of its table of contents, using a graph to capture the intricate relationships between entities, and mapping entities to tree nodes. Leveraging the BookIndex, we then propose an agent-based query method inspired by the Information Foraging Theory, which dynamically classifies queries and employs a tailored retrieval workflow. Extensive experiments on three widely adopted benchmarks demonstrate that BookRAG achieves state-of-the-art performance, significantly outperforming baselines in both retrieval recall and QA accuracy while maintaining competitive efficiency.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali et al.
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Levi Burner, Guido de Croon, Yiannis Aloimonos
ๆฅ็ๆ่ฆ
Animals with foveated vision, including humans, experience microsaccades, small, rapid eye movements that they are not aware of. Inspired by this phenomenon, we develop a method for "Artificial Microsaccade Compensation". It can stabilize video captured by a tailless ornithopter that has resisted attempts to use camera-based sensing because it shakes at 12-20 Hz. Our approach minimizes changes in image intensity by optimizing over 3D rotation represented in SO(3). This results in a stabilized video, computed in real time, suitable for human viewing, and free from distortion. When adapted to hold a fixed viewing orientation, up to occasional saccades, it can dramatically reduce inter-frame motion while also benefiting from an efficient recursive update. When compared to Adobe Premier Pro's warp stabilizer, which is widely regarded as the best commercial video stabilization software available, our method achieves higher quality results while also running in real time.
๐
2025-12-03
๐ 6 pages, 3 figures, 5 tables. This paper has been
๐ฅ Michael Schiffmann, Felix Struth, Sabina Jeschke, Anja Richert
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The social capabilities of socially interactive agents (SIA) are a key to successful and smooth interactions between the user and the SIA. A successful start of the interaction is one of the essential factors for satisfying SIA interactions. For a service and information task in which the SIA helps with information, e.g. about the location, it is an important skill to master the opening of the conversation and to recognize which interlocutor opens the conversation and when. We are therefore investigating the extent to which the opening of the conversation can be trained using the user's body language as an input for machine learning to ensure smooth conversation starts for the interaction. In this paper we propose the Interaction Initiation System (IIS) which we developed, trained and validated using an in-the-wild data set. In a field test at the Deutsches Museum Bonn, a Furhat robot from Furhat Robotics was used as a service and information point. Over the period of use we collected the data of \textit{N} = 201 single user interactions for the training of the algorithms. We can show that the IIS, achieves a performance that allows the conclusion that this system is able to determine the greeting period and the opener of the interaction.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaobei Zhao, Xingqi Lyu, Xiang Li
Robotics
ๆฅ็ๆ่ฆ
Agricultural robots are serving as powerful assistants across a wide range of agricultural tasks, nevertheless, still heavily relying on manual operations or railway systems for movement. The AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling a robot to navigate to a target position following a natural language instruction. Unlike human binocular vision, most agricultural robots are only given a single camera for monocular vision, which results in limited spatial perception. To bridge this gap, we present the method of Agricultural Vision-and-Language Navigation with Monocular Depth Estimation (MDE-AgriVLN), in which we propose the MDE module generating depth features from RGB images, to assist the decision-maker on reasoning. When evaluated on the A2A benchmark, our MDE-AgriVLN method successfully increases Success Rate from 0.23 to 0.32 and decreases Navigation Error from 4.43m to 4.08m, demonstrating the state-of-the-art performance in the agricultural VLN domain. Code: https://github.com/AlexTraveling/MDE-AgriVLN.
๐
2025-12-03
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Michael Schiffmann, Sabina Jeschke, Anja Richert
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Socially interactive agents (SIAs) are being used in various scenarios and are nearing productive deployment. Evaluating user satisfaction with SIAs' performance is a key factor in designing the interaction between the user and SIA. Currently, subjective user satisfaction is primarily assessed manually through questionnaires or indirectly via system metrics. This study examines the automatic classification of user satisfaction through analysis of social signals, aiming to enhance both manual and autonomous evaluation methods for SIAs. During a field trial at the Deutsches Museum Bonn, a Furhat Robotics head was employed as a service and information hub, collecting an "in-the-wild" dataset. This dataset comprises 46 single-user interactions, including questionnaire responses and video data. Our method focuses on automatically classifying user satisfaction based on time series classification. We use time series of social signal metrics derived from the body pose, time series of facial expressions, and physical distance. This study compares three feature engineering approaches on different machine learning models. The results confirm the method's effectiveness in reliably identifying interactions with low user satisfaction without the need for manually annotated datasets. This approach offers significant potential for enhancing SIA performance and user experience through automated feedback mechanisms.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guole Shen, Tianchen Deng, Xingrui Qin, Nailin Wang, Jianyu Wang et al.
Machine LearningComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Recent stateful recurrent neural networks have achieved remarkable progress on static 3D reconstruction but remain vulnerable to motion-induced artifacts, where non-rigid regions corrupt attention propagation between the spatial memory and image feature. By analyzing the internal behaviors of the state and image token updating mechanism, we find that aggregating self-attention maps across layers reveals a consistent pattern: dynamic regions are naturally down-weighted, exposing an implicit motion cue that the pretrained transformer already encodes but never explicitly uses. Motivated by this observation, we introduce MUT3R, a training-free framework that applies the attention-derived motion cue to suppress dynamic content in the early layers of the transformer during inference. Our attention-level gating module suppresses the influence of dynamic regions before their artifacts propagate through the feature hierarchy. Notably, we do not retrain or fine-tune the model; we let the pretrained transformer diagnose its own motion cues and correct itself. This early regulation stabilizes geometric reasoning in streaming scenarios and leads to improvements in temporal consistency and camera pose robustness across multiple dynamic benchmarks, offering a simple and training-free pathway toward motion-aware streaming reconstruction.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Aron Distelzweig, Yiwei Wang, Faris Janjoลก, Marcel Hallgarten, Mihai Dobre et al.
Robotics
ๆฅ็ๆ่ฆ
Autonomous driving planning systems perform nearly perfectly in routine scenarios using lightweight, rule-based methods but still struggle in dense urban traffic, where lane changes and merges require anticipating and influencing other agents. Modern motion predictors offer highly accurate forecasts, yet their integration into planning is mostly rudimental: discarding unsafe plans. Similarly, end-to-end models offer a one-way integration that avoids the challenges of joint prediction and planning modeling under uncertainty. In contrast, game-theoretic formulations offer a principled alternative but have seen limited adoption in autonomous driving. We present Bayesian Iterative Best Response (BIBeR), a framework that unifies motion prediction and game-theoretic planning into a single interaction-aware process. BIBeR is the first to integrate a state-of-the-art predictor into an Iterative Best Response (IBR) loop, repeatedly refining the strategies of the ego vehicle and surrounding agents. This repeated best-response process approximates a Nash equilibrium, enabling bidirectional adaptation where the ego both reacts to and shapes the behavior of others. In addition, our proposed Bayesian confidence estimation quantifies prediction reliability and modulates update strength, more conservative under low confidence and more decisive under high confidence. BIBeR is compatible with modern predictors and planners, combining the transparency of structured planning with the flexibility of learned models. Experiments show that BIBeR achieves an 11% improvement over state-of-the-art planners on highly interactive interPlan lane-change scenarios, while also outperforming existing approaches on standard nuPlan benchmarks.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Jeongeun Park, Jihwan Yoon, Byungwoo Jeon, Juhan Park, Jinwoo Shin et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Prior Vision-Language-Action (VLA) models are typically trained on teleoperated successful demonstrations, while discarding numerous failed attempts that occur naturally during data collection. However, these failures encode where and how policies can be fragile, information that can be exploited to improve robustness. We address this problem by leveraging mixed-quality datasets to learn failure-aware reasoning at planning time. We introduce VINE, a hierarchical vision-language-action model that separates high-level reasoning (System 2) from low-level control (System 1) under a hierarchical reinforcement learning formalism, making failures usable as a structured learning signal rather than noisy supervision. System 2 performs feasibility-guided tree search over a 2D scene-graph abstraction: it proposes subgoal transitions, predicts success probabilities from both successes and failures, and prunes brittle branches before execution, effectively casting plan evaluation as feasibility scoring. The selected subgoal sequence is then passed to System 1, which executes low-level actions without modifying the agent's core skills. Trained entirely from offline teleoperation data, VINE integrates negative experience directly into the decision loop. Across challenging manipulation tasks, this approach consistently improves success rates and robustness, demonstrating that failure data is an essential resource for converting the broad competence of VLAs into robust execution.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Kenneth Stewart, Roxana Leontie, Samantha Chapin, Joe Hays, Sumit Bam Shrestha et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We present an end-to-end pipeline for deploying reinforcement learning (RL) trained Artificial Neural Networks (ANNs) on neuromorphic hardware by converting them into spiking Sigma-Delta Neural Networks (SDNNs). We demonstrate that an ANN policy trained entirely in simulation can be transformed into an SDNN compatible with Intel's Loihi 2 architecture, enabling low-latency and energy-efficient inference. As a test case, we use an RL policy for controlling the Astrobee free-flying robot, similar to a previously hardware in space-validated controller. The policy, trained with Rectified Linear Units (ReLUs), is converted to an SDNN and deployed on Intel's Loihi 2, then evaluated in NVIDIA's Omniverse Isaac Lab simulation environment for closed-loop control of Astrobee's motion. We compare execution performance between GPU and Loihi 2. The results highlight the feasibility of using neuromorphic platforms for robotic control and establish a pathway toward energy-efficient, real-time neuromorphic computation in future space and terrestrial robotics applications.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Ying-Kuan Tsai, Yi-Ping Chen, Vispi Karkaria, Wei Chen
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Active suspension systems are critical for enhancing vehicle comfort, safety, and stability, yet their performance is often limited by fixed hardware designs and control strategies that cannot adapt to uncertain and dynamic operating conditions. Recent advances in digital twins (DTs) and deep reinforcement learning (DRL) offer new opportunities for real-time, data-driven optimization across a vehicle's lifecycle. However, integrating these technologies into a unified framework remains an open challenge. This work presents a DT-based control co-design (CCD) framework for full-vehicle active suspensions using multi-generation design concepts. By integrating automatic differentiation into DRL, we jointly optimize physical suspension components and control policies under varying driver behaviors and environmental uncertainties. DRL also addresses the challenge of partial observability, where only limited states can be sensed and fed back to the controller, by learning optimal control actions directly from available sensor information. The framework incorporates model updating with quantile learning to capture data uncertainty, enabling real-time decision-making and adaptive learning from digital-physical interactions. The approach demonstrates personalized optimization of suspension systems under two distinct driving settings (mild and aggressive). Results show that the optimized systems achieve smoother trajectories and reduce control efforts by approximately 43% and 52% for mild and aggressive, respectively, while maintaining ride comfort and stability. Contributions include: developing a DT-enabled CCD framework integrating DRL and uncertainty-aware model updating for full-vehicle active suspensions, introducing a multi-generation design strategy for self-improving systems, and demonstrating personalized optimization of active suspension systems for distinct driver types.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Brais Fontan-Costas, M. Diaz-Cacho, Ruben Fernandez-Boullon, Manuel Alonso-Carracedo, Javier Perez-Robles
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
This paper presents an Autonomous System (AS) architecture for vehicles in a closed circuit. The AS performs precision tasks including computer vision for environment perception, positioning and mapping for accurate localization, path planning for optimal trajectory generation, and control for precise vehicle actuation. Each subsystem operates independently while connecting data through a cohesive pipeline architecture. The system implements a modular design that combines state-of-the-art technologies for real-time autonomous navigation in controlled environments.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Lei Zhang, Diwen Zheng, Kaixin Bai, Zhenshan Bing, Zoltan-Csaba Marton et al.
MultimodalRoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Dexterous grasp generation aims to produce grasp poses that align with task requirements and human interpretable grasp semantics. However, achieving semantically controllable dexterous grasp synthesis remains highly challenging due to the lack of unified modeling of multiple semantic dimensions, including grasp taxonomy, contact semantics, and functional affordance. To address these limitations, we present OmniDexVLG, a multimodal, semantics aware grasp generation framework capable of producing structurally diverse and semantically coherent dexterous grasps under joint language and visual guidance. Our approach begins with OmniDexDataGen, a semantic rich dexterous grasp dataset generation pipeline that integrates grasp taxonomy guided configuration sampling, functional affordance contact point sampling, taxonomy aware differential force closure grasp sampling, and physics based optimization and validation, enabling systematic coverage of diverse grasp types. We further introduce OmniDexReasoner, a multimodal grasp type semantic reasoning module that leverages multi agent collaboration, retrieval augmented generation, and chain of thought reasoning to infer grasp related semantics and generate high quality annotations that align language instructions with task specific grasp intent. Building upon these components, we develop a unified Vision Language Grasping generation model that explicitly incorporates grasp taxonomy, contact structure, and functional affordance semantics, enabling fine grained control over grasp synthesis from natural language instructions. Extensive experiments in simulation and real world object grasping and ablation studies demonstrate that our method substantially outperforms state of the art approaches in terms of grasp diversity, contact semantic diversity, functional affordance diversity, and semantic consistency.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Dominykas Strazdas, Magnus Jung, Jan Marquenie, Ingo Siegert, Ayoub Al-Hamadi
Robotics
ๆฅ็ๆ่ฆ
The effectiveness of human-robot interaction often hinges on the ability to cultivate engagement - a dynamic process of cognitive involvement that supports meaningful exchanges. Many existing definitions and models of engagement are either too vague or lack the ability to generalize across different contexts. We introduce IM HERE, a novel framework that models engagement effectively in human-human, human-robot, and robot-robot interactions. By employing an effort-based description of bilateral relationships between entities, we provide an accurate breakdown of relationship patterns, simplifying them to focus placement and four key states. This framework captures mutual relationships, group behaviors, and actions conforming to social norms, translating them into specific directives for autonomous systems. By integrating both subjective perceptions and objective states, the model precisely identifies and describes miscommunication. The primary objective of this paper is to automate the analysis, modeling, and description of social behavior, and to determine how autonomous systems can behave in accordance with social norms for full social integration while simultaneously pursuing their own social goals.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Jia Hu, Zhexi Lian, Xuerun Yan, Ruiang Bi, Dou Shen et al.
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Autonomous Driving (AD) vehicles still struggle to exhibit human-like behavior in highly dynamic and interactive traffic scenarios. The key challenge lies in AD's limited ability to interact with surrounding vehicles, largely due to a lack of understanding the underlying mechanisms of social interaction. To address this issue, we introduce MPCFormer, an explainable socially-aware autonomous driving approach with physics-informed and data-driven coupled social interaction dynamics. In this model, the dynamics are formulated into a discrete space-state representation, which embeds physics priors to enhance modeling explainability. The dynamics coefficients are learned from naturalistic driving data via a Transformer-based encoder-decoder architecture. To the best of our knowledge, MPCFormer is the first approach to explicitly model the dynamics of multi-vehicle social interactions. The learned social interaction dynamics enable the planner to generate manifold, human-like behaviors when interacting with surrounding traffic. By leveraging the MPC framework, the approach mitigates the potential safety risks typically associated with purely learning-based methods. Open-looped evaluation on NGSIM dataset demonstrates that MPCFormer achieves superior social interaction awareness, yielding the lowest trajectory prediction errors compared with other state-of-the-art approach. The prediction achieves an ADE as low as 0.86 m over a long prediction horizon of 5 seconds. Close-looped experiments in highly intense interaction scenarios, where consecutive lane changes are required to exit an off-ramp, further validate the effectiveness of MPCFormer. Results show that MPCFormer achieves the highest planning success rate of 94.67%, improves driving efficiency by 15.75%, and reduces the collision rate from 21.25% to 0.5%, outperforming a frontier Reinforcement Learning (RL) based planner.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Johannes Fischer, Marlon Steiner, รmer Sahin Tas, Christoph Stiller
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Model predictive control (MPC) is widely used for motion planning, particularly in autonomous driving. Real-time capability of the planner requires utilizing convex approximation of optimal control problems (OCPs) for the planner. However, such approximations confine the solution to a subspace, which might not contain the global optimum. To address this, we propose using safe reinforcement learning (SRL) to obtain a new and safe reference trajectory within MPC. By employing a learning-based approach, the MPC can explore solutions beyond the close neighborhood of the previous one, potentially finding global optima. We incorporate constrained reinforcement learning (CRL) to ensure safety in automated driving, using a handcrafted energy function-based safety index as the constraint objective to model safe and unsafe regions. Our approach utilizes a state-dependent Lagrangian multiplier, learned concurrently with the safe policy, to solve the CRL problem. Through experimentation in a highway scenario, we demonstrate the superiority of our approach over both MPC and SRL in terms of safety and performance measures.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Gabriele Fadini, Deepak Ingole, Tong Duy Son, Alisa Rupenyan
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper presents an auto-tuning framework for torque-based Nonlinear Model Predictive Control (nMPC), where the MPC serves as a real-time controller for optimal joint torque commands. The MPC parameters, including cost function weights and low-level controller gains, are optimized using high-dimensional Bayesian Optimization (BO) techniques, specifically Sparse Axis-Aligned Subspace (SAASBO) with a digital twin (DT) to achieve precise end-effector trajectory real-time tracking on an UR10e robot arm. The simulation model allows efficient exploration of the high-dimensional parameter space, and it ensures safe transfer to hardware. Our simulation results demonstrate significant improvements in tracking performance (+41.9%) and reduction in solve times (-2.5%) compared to manually-tuned parameters. Moreover, experimental validation on the real robot follows the trend (with a +25.8% improvement), emphasizing the importance of digital twin-enabled automated parameter optimization for robotic operations.
๐
2025-12-03
๐ In Proceedings of the IEEE International Conferenc
๐ฅ Marlon Steiner, Royden Wagner, รmer Sahin Tas, Christoph Stiller
Robotics
ๆฅ็ๆ่ฆ
Combining motion prediction and motion planning offers a promising framework for enhancing interactions between automated vehicles and other traffic participants. However, this introduces challenges in conditioning predictions on navigation goals and ensuring stable, kinematically feasible trajectories. Addressing the former challenge, this paper investigates the extension of attention-based motion prediction models with navigation information. By integrating the ego vehicle's intended route and goal pose into the model architecture, we bridge the gap between multi-agent motion prediction and goal-based motion planning. We propose and evaluate several architectural navigation integration strategies to our model on the nuPlan dataset. Our results demonstrate the potential of prediction-driven motion planning, highlighting how navigation information can enhance both prediction and planning tasks. Our implementation is at: https://github.com/KIT-MRT/future-motion.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Kehlani Fay, Darin Anthony Djapri, Anya Zorin, James Clinton, Ali El Lahib et al.
Robotics
ๆฅ็ๆ่ฆ
Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework will be open-sourced and available on our website.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Kenneth Stewart, Samantha Chapin, Roxana Leontie, Carl Glen Henshaw
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) offers transformative potential for robotic control in space. We present the first on-orbit demonstration of RL-based autonomous control of a free-flying robot, the NASA Astrobee, aboard the International Space Station (ISS). Using NVIDIA's Omniverse physics simulator and curriculum learning, we trained a deep neural network to replace Astrobee's standard attitude and translation control, enabling it to navigate in microgravity. Our results validate a novel training pipeline that bridges the simulation-to-reality (Sim2Real) gap, utilizing a GPU-accelerated, scientific-grade simulation environment for efficient Monte Carlo RL training. This successful deployment demonstrates the feasibility of training RL policies terrestrially and transferring them to space-based applications. This paves the way for future work in In-Space Servicing, Assembly, and Manufacturing (ISAM), enabling rapid on-orbit adaptation to dynamic mission requirements.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Samantha Chapin, Kenneth Stewart, Roxana Leontie, Carl Glen Henshaw
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The US Naval Research Laboratory's (NRL's) Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) experiment pioneers the use of reinforcement learning (RL) for control of free-flying robots in the zero-gravity (zero-G) environment of space. On Tuesday, May 27th 2025 the APIARY team conducted the first ever, to our knowledge, RL control of a free-flyer in space using the NASA Astrobee robot on-board the International Space Station (ISS). A robust 6-degrees of freedom (DOF) control policy was trained using an actor-critic Proximal Policy Optimization (PPO) network within the NVIDIA Isaac Lab simulation environment, randomizing over goal poses and mass distributions to enhance robustness. This paper details the simulation testing, ground testing, and flight validation of this experiment. This on-orbit demonstration validates the transformative potential of RL for improving robotic autonomy, enabling rapid development and deployment (in minutes to hours) of tailored behaviors for space exploration, logistics, and real-time mission needs.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Sundas Rafat Mulkana, Ronyu Yu, Tanaya Guha, Emma Li
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
In collaborative human-robot tasks, safety requires not only avoiding collisions but also ensuring safe, intentional physical contact. We present ContactRL, a reinforcement learning (RL) based framework that directly incorporates contact safety into the reward function through force feedback. This enables a robot to learn adaptive motion profiles that minimize human-robot contact forces while maintaining task efficiency. In simulation, ContactRL achieves a low safety violation rate of 0.2\% with a high task success rate of 87.7\%, outperforming state-of-the-art constrained RL baselines. In order to guarantee deployment safety, we augment the learned policy with a kinetic energy based Control Barrier Function (eCBF) shield. Real-world experiments on an UR3e robotic platform performing small object handovers from a human hand across 360 trials confirm safe contact, with measured normal forces consistently below 10N. These results demonstrate that ContactRL enables safe and efficient physical collaboration, thereby advancing the deployment of collaborative robots in contact-rich tasks.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Shahid Ansari, Mahendra Kumar Gohil, Yusuke Maeda, Bishakh Bhattacharya
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper presents an autonomous tomato-harvesting system built around a hybrid robotic gripper that combines six soft auxetic fingers with a rigid exoskeleton and a latex basket to achieve gentle, cage-like grasping. The gripper is driven by a servo-actuated Scotch--yoke mechanism, and includes separator leaves that form a conical frustum for fruit isolation, with an integrated micro-servo cutter for pedicel cutting. For perception, an RGB--D camera and a Detectron2-based pipeline perform semantic segmentation of ripe/unripe tomatoes and keypoint localization of the pedicel and fruit center under occlusion and variable illumination. An analytical model derived using the principle of virtual work relates servo torque to grasp force, enabling design-level reasoning about actuation requirements. During execution, closed-loop grasp-force regulation is achieved using a proportional--integral--derivative controller with feedback from force-sensitive resistors mounted on selected fingers to prevent slip and bruising. Motion execution is supported by Particle Swarm Optimization (PSO)--based trajectory planning for a 5-DOF manipulator. Experiments demonstrate complete picking cycles (approach, separation, cutting, grasping, transport, release) with an average cycle time of 24.34~s and an overall success rate of approximately 80\%, while maintaining low grasp forces (0.20--0.50~N). These results validate the proposed hybrid gripper and integrated vision--control pipeline for reliable harvesting in cluttered environments.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Kilian Schweppe, Anne-Kathrin Schmuck
Robotics
ๆฅ็ๆ่ฆ
We address the challenge of reliable and efficient interaction in autonomous multi-agent systems, where agents must balance long-term strategic objectives with short-term dynamic adaptation. We propose context-triggered contingency games, a novel integration of strategic games derived from temporal logic specifications with dynamic contingency games solved in real time. Our two-layered architecture leverages strategy templates to guarantee satisfaction of high-level objectives, while a new factor-graph-based solver enables scalable, real-time model predictive control of dynamic interactions. The resulting framework ensures both safety and progress in uncertain, interactive environments. We validate our approach through simulations and hardware experiments in autonomous driving and robotic navigation, demonstrating efficient, reliable, and adaptive multi-agent interaction.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Shifa Sulaiman, Amarnath H, Simon Bogh, Naresh Marturi
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Motion planning schemes are used for planning motions of a manipulator from an initial pose to a final pose during a task execution. A motion planning scheme generally comprises of a trajectory planning method and an inverse kinematic solver to determine trajectories and joints solutions respectively. In this paper, 3 motion planning schemes developed based on Jacobian methods are implemented to traverse a redundant manipulator with a coupled finger gripper through given trajectories. RRT* algorithm is used for planning trajectories and screw theory based forward kinematic equations are solved for determining joint solutions of the manipulator and gripper. Inverse solutions are computed separately using 3 Jacobian based methods such as Jacobian Transpose (JT), Pseudo Inverse (PI), and Damped Least Square (DLS) methods. Space Jacobian and manipulability measurements of the manipulator and gripper are obtained using screw theory formulations. Smoothness and RMSE error of generated trajectories and velocity continuity, acceleration profile, jerk, and snap values of joint motions are analysed for determining an efficient motion planning method for a given task. Advantages and disadvantages of the proposed motion planning schemes mentioned above are analysed using simulation studies to determine a suitable inverse solution technique for the tasks.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ''endogenous'' rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Zexin Lin, Yebin Zhong, Hanwen Wan, Jiu Cheng, Zhenglong Sun et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Transition control poses a critical challenge in Vertical Take-Off and Landing Unmanned Aerial Vehicle (VTOL UAV) development due to the tilting rotor mechanism, which shifts the center of gravity and thrust direction during transitions. Current control methods' decoupled control of altitude and position leads to significant vibration, and limits interaction consideration and adaptability. In this study, we propose a novel coupled transition control methodology based on reinforcement learning (RL) driven controller. Besides, contrasting to the conventional phase-transition approach, the ST3M method demonstrates a new perspective by treating cruise mode as a special case of hover. We validate the feasibility of applying our method in simulation and real-world environments, demonstrating efficient controller development and migration while accurately controlling UAV position and attitude, exhibiting outstanding trajectory tracking and reduced vibrations during the transition process.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Yuhang Huang, Shilong Zou, Jiazhao Zhang, Xinwang Liu, Ruizhen Hu et al.
Robotics
ๆฅ็ๆ่ฆ
World Foundation Models (WFMs) offer remarkable visual dynamics simulation capabilities, yet their application to precise robotic control remains limited by the gap between generative realism and control-oriented precision. While existing approaches use WFMs as synthetic data generators, they suffer from high computational costs and underutilization of pre-trained VLA policies. We introduce \textbf{AdaPower} (\textbf{Ada}pt and Em\textbf{power}), a lightweight adaptation framework that transforms general-purpose WFMs into specialist world models through two novel components: Temporal-Spatial Test-Time Training (TS-TTT) for inference-time adaptation and Memory Persistence (MP) for long-horizon consistency. Integrated within a Model Predictive Control framework, our adapted world model empowers pre-trained VLAs, achieving over 41\% improvement in task success rates on LIBERO benchmarks without policy retraining, while preserving computational efficiency and generalist capabilities.
๐
2025-12-03
๐ ArXiv Preprint (cs.NI)
๐ฅ Pavlo Mykytyn, Ronald Chitauro, Onur Yener, Peter Langendoerfer
Robotics
ๆฅ็ๆ่ฆ
This work presents an experimental performance evaluation of a private 5G airfield network under controlled directional SDR jamming attacks targeting UAV-based UE nodes. Using a QualiPoc Android UE, mounted as a payload on a quadcopter UAV, we conducted a series of experiments to evaluate signal degradation, handover performance, and ser-vice stability in the presence of constant directional jamming. The conducted experiments aimed to examine the effects of varying travel speeds, altitudes, and moving patterns of a UAV-based UE to record and analyze the key physical-layer and network-layer metrics such as CQI, MCS, RSRP, SINR, BLER, Net PDSCH Throughput and RLF. The re-sults of this work describe the link stability and signal degradation dependencies, caused by the level of mobility of the UAV-based UE nodes during autonomous and automatic operation in private 5G Airfield networks
๐
2025-12-03
๐ Accepted in IEEE Robotics and Automation Letters (
๐ฅ Gihyeon Lee, Jungwoo Lee, Juwon Kim, Young-Sik Shin, Younggun Cho
Computer Vision
ๆฅ็ๆ่ฆ
Robots are often required to localize in environments with unknown object classes and semantic ambiguity. However, when performing global localization using semantic objects, high semantic ambiguity intensifies object misclassification and increases the likelihood of incorrect associations, which in turn can cause significant errors in the estimated pose. Thus, in this letter, we propose a multi-label likelihood-based semantic graph matching framework for object-level global localization. The key idea is to exploit multi-label graph representations, rather than single-label alternatives, to capture and leverage the inherent semantic context of object observations. Based on these representations, our approach enhances semantic correspondence across graphs by combining the likelihood of each node with the maximum likelihood of its neighbors via context-aware likelihood propagation. For rigorous validation, data association and pose estimation performance are evaluated under both closed-set and open-set detection configurations. In addition, we demonstrate the scalability of our approach to large-vocabulary object categories in both real-world indoor scenes and synthetic environments.
๐
2025-12-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhijian Qiao, Zehuan Yu, Tong Li, Chih-Chung Chou, Wenchao Ding et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Crowdsourcing enables scalable autonomous driving map construction, but low-cost sensor noise hinders quality from improving with data volume. We propose CSMapping, a system that produces accurate semantic maps and topological road centerlines whose quality consistently increases with more crowdsourced data. For semantic mapping, we train a latent diffusion model on HD maps (optionally conditioned on SD maps) to learn a generative prior of real-world map structure, without requiring paired crowdsourced/HD-map supervision. This prior is incorporated via constrained MAP optimization in latent space, ensuring robustness to severe noise and plausible completion in unobserved areas. Initialization uses a robust vectorized mapping module followed by diffusion inversion; optimization employs efficient Gaussian-basis reparameterization, projected gradient descent zobracket multi-start, and latent-space factor-graph for global consistency. For topological mapping, we apply confidence-weighted k-medoids clustering and kinematic refinement to trajectories, yielding smooth, human-like centerlines robust to trajectory variation. Experiments on nuScenes, Argoverse 2, and a large proprietary dataset achieve state-of-the-art semantic and topological mapping performance, with thorough ablation and scalability studies.
๐
2025-12-03
๐ 12 pages, 11 figures. Submitted to IEEE Transactio
๐ฅ Hidaka Asai, Tomoyuki Noda, Jun Morimoto
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Human motor control remains agile and robust despite limited sensory information for feedback, a property attributed to the body's ability to perform morphological computation through muscle coordination with variable impedance. However, it remains unclear how such low-level mechanical computation reduces the control requirements of the high-level controller. In this study, we implement a hierarchical controller consisting of a high-level neural network trained by reinforcement learning and a low-level variable-impedance muscle coor dination model with mono- and biarticular muscles in monoped locomotion task. We systematically restrict the high-level controller by varying the control frequency and by introducing biologically inspired observation conditions: delayed, partial, and substituted observation. Under these conditions, we evaluate how the low-level variable-impedance muscle coordination contributes to learning process of high-level neural network. The results show that variable-impedance muscle coordination enables stable locomotion even under slow-rate control frequency and limited observation conditions. These findings demonstrate that the morphological computation of muscle coordination effectively offloads high-frequency feedback of the high-level controller and provide a design principle for the controller in motor control.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Davood Soleymanzadeh, Xiao Liang, Minghui Zheng
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Deep learning methods have significantly enhanced motion planning for robotic manipulators by leveraging prior experiences within planning datasets. However, state-of-the-art neural motion planners are primarily trained on small datasets collected in manually generated workspaces, limiting their generalizability to out-of-distribution scenarios. Additionally, these planners often rely on monolithic network architectures that struggle to encode critical planning information. To address these challenges, we introduce Motion Policy with Dataset Synthesis powered by large language models (LLMs) and Fusion Action-Chunking Transformers (PerFACT), which incorporates two key components. Firstly, a novel LLM-powered workspace generation method, MotionGeneralizer, enables large-scale planning data collection by producing a diverse set of semantically feasible workspaces. Secondly, we introduce Fusion Motion Policy Networks (MpiNetsFusion), a generalist neural motion planner that uses a fusion action-chunking transformer to better encode planning signals and attend to multiple feature modalities. Leveraging MotionGeneralizer, we collect 3.5M trajectories to train and evaluate MpiNetsFusion against state-of-the-art planners, which shows that the proposed MpiNetsFusion can plan several times faster on the evaluated tasks.
๐
2025-12-03
๐ Accepted for publication in the Journal of Intelli
๐ฅ Raul Steinmetz, Fabio Demo Rosa, Victor Augusto Kich, Jair Augusto Bottega, Ricardo Bedin Grando et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous navigation of terrestrial robots using Reinforcement Learning (RL) from LIDAR observations remains challenging due to the high dimensionality of sensor data and the sample inefficiency of model-free approaches. Conventional policy networks struggle to process full-resolution LIDAR inputs, forcing prior works to rely on simplified observations that reduce spatial awareness and navigation robustness. This paper presents a novel model-based RL framework built on top of the DreamerV3 algorithm, integrating a Multi-Layer Perceptron Variational Autoencoder (MLP-VAE) within a world model to encode high-dimensional LIDAR readings into compact latent representations. These latent features, combined with a learned dynamics predictor, enable efficient imagination-based policy optimization. Experiments on simulated TurtleBot3 navigation tasks demonstrate that the proposed architecture achieves faster convergence and higher success rate compared to model-free baselines such as SAC, DDPG, and TD3. It is worth emphasizing that the DreamerV3-based agent attains a 100% success rate across all evaluated environments when using the full dataset of the Turtlebot3 LIDAR (360 readings), while model-free methods plateaued below 85%. These findings demonstrate that integrating predictive world models with learned latent representations enables more efficient and robust navigation from high-dimensional sensory data.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Tianchen Deng, Yue Pan, Shenghai Yuan, Dong Li, Chen Wang et al.
Robotics
ๆฅ็ๆ่ฆ
In this paper, we provide a comprehensive overview of existing scene representation methods for robotics, covering traditional representations such as point clouds, voxels, signed distance functions (SDF), and scene graphs, as well as more recent neural representations like Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and the emerging Foundation Models. While current SLAM and localization systems predominantly rely on sparse representations like point clouds and voxels, dense scene representations are expected to play a critical role in downstream tasks such as navigation and obstacle avoidance. Moreover, neural representations such as NeRF, 3DGS, and foundation models are well-suited for integrating high-level semantic features and language-based priors, enabling more comprehensive 3D scene understanding and embodied intelligence. In this paper, we categorized the core modules of robotics into five parts (Perception, Mapping, Localization, Navigation, Manipulation). We start by presenting the standard formulation of different scene representation methods and comparing the advantages and disadvantages of scene representation across different modules. This survey is centered around the question: What is the best 3D scene representation for robotics? We then discuss the future development trends of 3D scene representations, with a particular focus on how the 3D Foundation Model could replace current methods as the unified solution for future robotic applications. The remaining challenges in fully realizing this model are also explored. We aim to offer a valuable resource for both newcomers and experienced researchers to explore the future of 3D scene representations and their application in robotics. We have published an open-source project on GitHub and will continue to add new works and technologies to this project.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ Seungwon Choi, Dong-Gyu Park, Seo-Yeon Hwang, Tae-Wan Kim
ๆฅ็ๆ่ฆ
LiDAR-inertial odometry (LIO) is an active research area, as it enables accurate real-time state estimation in GPS-denied environments. Recent advances in map data structures and spatial indexing have significantly improved the efficiency of LIO systems. Nevertheless, we observe that two aspects may still leave room for improvement: (1) nearest neighbor search often requires examining multiple spatial units to gather sufficient points for plane fitting, and (2) plane parameters are typically recomputed at every iteration despite unchanged map geometry. Motivated by these observations, we propose Surfel-LIO, which employs a hierarchical voxel structure (hVox) with pre-computed surfel representation. This design enables O(1) correspondence retrieval without runtime neighbor enumeration or plane fitting, combined with Z-order curve encoding for cache-friendly spatial indexing. Experimental results on the M3DGR dataset demonstrate that our method achieves significantly faster processing speed compared to recent state-of-the-art methods while maintaining comparable state estimation accuracy. Our implementation is publicly available at https://github.com/93won/lidar_inertial_odometry.
๐
2025-12-03
๐ ArXiv Preprint (Robotics)
๐ฅ William van den Bogert, Gregory Linkowski, Nima Fazeli
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Imitation Learning (IL) holds great potential for learning repetitive manipulation tasks, such as those in industrial assembly. However, its effectiveness is often limited by insufficient trajectory precision due to compounding errors. In this paper, we introduce Grasped Object Manifold Projection (GOMP), an interactive method that mitigates these errors by constraining a non-rigidly grasped object to a lower-dimensional manifold. GOMP assumes a precise task in which a manipulator holds an object that may shift within the grasp in an observable manner and must be mated with a grounded part. Crucially, all GOMP enhancements are learned from the same expert dataset used to train the base IL policy, and are adjusted with an n-arm bandit-based interactive component. We propose a theoretical basis for GOMP's improvement upon the well-known compounding error bound in IL literature. We demonstrate the framework on four precise assembly tasks using tactile feedback, and note that the approach remains modality-agnostic. Data and videos are available at williamvdb.github.io/GOMPsite.
๐
2025-12-03
๐ Accepted to 2026 IEEE/CVF Winter Conference on App
๐ฅ Thomas Monninger, Zihan Zhang, Steffen Staab, Sihao Ding
Robotics
ๆฅ็ๆ่ฆ
Accurate environmental representations are essential for autonomous driving, providing the foundation for safe and efficient navigation. Traditionally, high-definition (HD) maps are providing this representation of the static road infrastructure to the autonomous system a priori. However, because the real world is constantly changing, such maps must be constructed online from on-board sensor data. Navigation-grade standard-definition (SD) maps are widely available, but their resolution is insufficient for direct deployment. Instead, they can be used as coarse prior to guide the online map construction process. We propose NavMapFusion, a diffusion-based framework that performs iterative denoising conditioned on high-fidelity sensor data and on low-fidelity navigation maps. This paper strives to answer: (1) How can coarse, potentially outdated navigation maps guide online map construction? (2) What advantages do diffusion models offer for map fusion? We demonstrate that diffusion-based map construction provides a robust framework for map fusion. Our key insight is that discrepancies between the prior map and online perception naturally correspond to noise within the diffusion process; consistent regions reinforce the map construction, whereas outdated segments are suppressed. On the nuScenes benchmark, NavMapFusion conditioned on coarse road lines from OpenStreetMap data reaches a 21.4% relative improvement on 100 m, and even stronger improvements on larger perception ranges, while maintaining real-time capabilities. By fusing low-fidelity priors with high-fidelity sensor data, the proposed method generates accurate and up-to-date environment representations, guiding towards safer and more reliable autonomous driving. The code is available at https://github.com/tmonnin/navmapfusion
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Athos Georgiou
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Late-interaction multimodal retrieval models like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali's patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on area efficiency. We evaluate on BBox-DocVQA with ground-truth bounding boxes. For within-page localization (given correct page retrieval), ColQwen3-4B with percentile-50 thresholding achieves 59.7% hit rate at IoU@0.5 (84.4% at IoU@0.25, 35.8% at IoU@0.7), with mean IoU of 0.569, compared to ~6.7% for random region selection. Our approach reduces context tokens by 28.8% compared to returning all OCR regions and by 52.3% compared to full-page image tokens. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation at https://github.com/athrael-soju/Snappy.
๐
2025-12-02
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Chenji Lu, Zhuo Chen, Hui Zhao, Zhiyuan Zeng, Gang Zhao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aniruddha Bora, Shixuan Zhang, Khemraj Shukla, Bryce Harrop, George Em. Karniadakis et al.
ๆฅ็ๆ่ฆ
Coarse resolution, imperfect parameterizations, and uncertain initial states and forcings limit Earth-system model (ESM) predictions. Traditional bias correction via data assimilation improves constrained simulations but offers limited benefit once models run freely. We introduce an operator-learning framework that maps instantaneous model states to bias-correction tendencies and applies them online during integration. Building on a U-Net backbone, we develop two operator architectures Inception U-Net (IUNet) and a multi-scale network (M\&M) that combine diverse upsampling and receptive fields to capture multiscale nonlinear features under Energy Exascale Earth System Model (E3SM) runtime constraints. Trained on two years E3SM simulations nudged toward ERA5 reanalysis, the operators generalize across height levels and seasons. Both architectures outperform standard U-Net baselines in offline tests, indicating that functional richness rather than parameter count drives performance. In online hybrid E3SM runs, M\&M delivers the most consistent bias reductions across variables and vertical levels. The ML-augmented configurations remain stable and computationally feasible in multi-year simulations, providing a practical pathway for scalable hybrid modeling. Our framework emphasizes long-term stability, portability, and cadence-limited updates, demonstrating the utility of expressive ML operators for learning structured, cross-scale relationships and retrofitting legacy ESMs.
๐
2025-12-02
๐ AAAI 2026
๐ฅ Matthew Peroni, Franck Le, Vadim Sheinin
ๆฅ็ๆ่ฆ
The development of tabular foundation models (TFMs) has accelerated in recent years, showing strong potential to outperform traditional ML methods for structured data. A key finding is that TFMs can be pretrained entirely on synthetic datasets, opening opportunities to design data generators that encourage desirable model properties. Prior work has mainly focused on crafting high-quality priors over generators to improve overall pretraining performance. Our insight is that parameterizing the generator distribution enables an adversarial robustness perspective: during training, we can adapt the generator to emphasize datasets that are particularly challenging for the model. We formalize this by introducing an optimality gap measure, given by the difference between TFM performance and the best achievable performance as estimated by strong baselines such as XGBoost, CatBoost, and Random Forests. Building on this idea, we propose Robust Tabular Foundation Models (RTFM), a model-agnostic adversarial training framework. Applied to the TabPFN V2 classifier, RTFM improves benchmark performance, with up to a 6% increase in mean normalized AUC over the original TabPFN and other baseline algorithms, while requiring less than 100k additional synthetic datasets. These results highlight a promising new direction for targeted adversarial training and fine-tuning of TFMs using synthetic data alone.
๐
2025-12-02
๐ AAAI 2026
๐ฅ Pengfei Hu, Fan Ming, Xiaoxue Han, Chang Lu, Yue Ning et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning models have shown promise in reservoir inflow prediction, yet their performance often deteriorates when applied to different reservoirs due to distributional differences, referred to as the domain shift problem. Domain generalization (DG) solutions aim to address this issue by extracting domain-invariant representations that mitigate errors in unseen domains. However, in hydrological settings, each reservoir exhibits unique inflow patterns, while some metadata beyond observations like spatial information exerts indirect but significant influence. This mismatch limits the applicability of conventional DG techniques to many-domain hydrological systems. To overcome these challenges, we propose HydroDCM, a scalable DG framework for cross-reservoir inflow forecasting. Spatial metadata of reservoirs is used to construct pseudo-domain labels that guide adversarial learning of invariant temporal features. During inference, HydroDCM adapts these features through light-weight conditioning layers informed by the target reservoir's metadata, reconciling DG's invariance with location-specific adaptation. Experiment results on 30 real-world reservoirs in the Upper Colorado River Basin demonstrate that our method substantially outperforms state-of-the-art DG baselines under many-domain conditions and remains computationally efficient.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Echo Diyun LU, Charles Findling, Marianne Clausel, Alessandro Leite, Wei Gong et al.
ๆฅ็ๆ่ฆ
Regime transitions routinely break stationarity in time series, making calibrated uncertainty as important as point accuracy. We study distribution-free uncertainty for regime-switching forecasting by coupling Deep Switching State Space Models with Adaptive Conformal Inference (ACI) and its aggregated variant (AgACI). We also introduce a unified conformal wrapper that sits atop strong sequence baselines including S4, MC-Dropout GRU, sparse Gaussian processes, and a change-point local model to produce online predictive bands with finite-sample marginal guarantees under nonstationarity and model misspecification. Across synthetic and real datasets, conformalized forecasters achieve near-nominal coverage with competitive accuracy and generally improved band efficiency.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Filippo Torresan, Ryota Kanai, Manuel Baltieri
Robotics
ๆฅ็ๆ่ฆ
Active inference proposes expected free energy as an objective for planning and decision-making to adequately balance exploitative and explorative drives in learning agents. The exploitative drive, or what an agent wants to achieve, is formalised as the Kullback-Leibler divergence between a variational probability distribution, updated at each inference step, and a preference probability distribution that indicates what states or observations are more likely for the agent, hence determining the agent's goal in a certain environment. In the literature, the questions of how the preference distribution should be specified and of how a certain specification impacts inference and learning in an active inference agent have been given hardly any attention. In this work, we consider four possible ways of defining the preference distribution, either providing the agents with hard or soft goals and either involving or not goal shaping (i.e., intermediate goals). We compare the performances of four agents, each given one of the possible preference distributions, in a grid world navigation task. Our results show that goal shaping enables the best performance overall (i.e., it promotes exploitation) while sacrificing learning about the environment's transition dynamics (i.e., it hampers exploration).
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nicholas Sung, Steven Spreizer, Mohamed Elrefaie, Matthew C. Jones, Faez Ahmed
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Despite progress in machine learning-based aerodynamic surrogates, the scarcity of large, field-resolved datasets limits progress on accurate pointwise prediction and reproducible inverse design for aircraft. We introduce BlendedNet++, a large-scale aerodynamic dataset and benchmark focused on blended wing body (BWB) aircraft. The dataset contains over 12,000 unique geometries, each simulated at a single flight condition, yielding 12,490 aerodynamic results for steady RANS CFD. For every case, we provide (i) integrated force/moment coefficients CL, CD, CM and (ii) dense surface fields of pressure and skin friction coefficients Cp and (Cfx, Cfy, Cfz). Using this dataset, we standardize a forward-surrogate benchmark to predict pointwise fields across six model families: GraphSAGE, GraphUNet, PointNet, a coordinate Transformer (Transolver-style), a FiLMNet (coordinate MLP with feature-wise modulation), and a Graph Neural Operator Transformer (GNOT). Finally, we present an inverse design task of achieving a specified lift-to-drag ratio under fixed flight conditions, implemented via a conditional diffusion model. To assess performance, we benchmark this approach against gradient-based optimization on the same surrogate and a diffusion-optimization hybrid that first samples with the conditional diffusion model and then further optimizes the designs. BlendedNet++ provides a unified forward and inverse protocol with multi-model baselines, enabling fair, reproducible comparison across architectures and optimization paradigms. We expect BlendedNet++ to catalyze reproducible research in field-level aerodynamics and inverse design; resources (dataset, splits, baselines, and scripts) will be released upon acceptance.
๐
2025-12-02
๐ AAAI 2026
๐ฅ Michael Theologitis, Dan Suciu
Robotics
ๆฅ็ๆ่ฆ
In today's age, it is becoming increasingly difficult to decipher truth from lies. Every day, politicians, media outlets, and public figures make conflicting claims$\unicode{x2014}$often about topics that can, in principle, be verified against structured data. For instance, statements about crime rates, economic growth or healthcare can all be verified against official public records and structured datasets. Building a system that can automatically do that would have sounded like science fiction just a few years ago. Yet, with the extraordinary progress in LLMs and agentic AI, this is now within reach. Still, there remains a striking gap between what is technically possible and what is being demonstrated by recent work. Most existing verification systems operate only on small, single-table databases$\unicode{x2014}$typically a few hundred rows$\unicode{x2014}$that conveniently fit within an LLM's context window.
In this paper we report our progress on Thucy, the first cross-database, cross-table multi-agent claim verification system that also provides concrete evidence for each verification verdict. Thucy remains completely agnostic to the underlying data sources before deployment and must therefore autonomously discover, inspect, and reason over all available relational databases to verify claims. Importantly, Thucy also reports the exact SQL queries that support its verdict (whether the claim is accurate or not) offering full transparency to expert users familiar with SQL. When evaluated on the TabFact dataset$\unicode{x2014}$the standard benchmark for fact verification over structured data$\unicode{x2014}$Thucy surpasses the previous state of the art by 5.6 percentage points in accuracy (94.3% vs. 88.7%).
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhiyuan He, Dingmin Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Reasoning Models (LRMs) achieve strong performance on complex reasoning tasks by generating long Chains of Thought (CoTs). However, this paradigm might incur substantial token overhead, especially when models "overthink" by producing lengthy reasoning chains, which can even lead to incorrect answers. A promising direction is the symbolic-solver-integrated approach, which leverages the code generation capabilities of LLMs to translate reasoning tasks into executable code and then solve them with a symbolic solver. In this paper, we explore an open question of when the conventional long-CoT can be enhanced by symbolic solvers. Our experimental results show that the symbolic-solver-integrated method only helps when the problem requires limited implicit reasoning but involves an ample search space. The latest LLMs, like GPT-4o, show better performance on deductive problems with shallow reasoning depth, while the symbolic-solver-integrated method significantly improves the LLMs' performance in constraint satisfaction problems that require repeated backtracks. When a declarative exemplar is provided, even CodeLlama-13B can outperform GPT-4o in difficult Zebra puzzles.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mark Moussa, Andre Williams, Seth Roffe, Douglas Morton
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Rapid and accurate wildfire detection is crucial for emergency response and environmental management. In airborne and spaceborne missions, real-time algorithms must distinguish between no fire, active fire, and post-fire conditions, and estimate fire intensity. Multispectral and hyperspectral thermal imagers provide rich spectral information, but high data dimensionality and limited onboard resources make real-time processing challenging. As wildfires increase in frequency and severity, the need for low-latency and computationally efficient onboard detection methods is critical.
We present a systematic evaluation of multiple deep learning architectures, including custom Convolutional Neural Networks (CNNs) and Transformer-based models, for multi-class fire classification. We also introduce PyroFocus, a two-stage pipeline that performs fire classification followed by fire radiative power (FRP) regression or segmentation to reduce inference time and computational cost for onboard deployment. Using data from NASA's MODIS/ASTER Airborne Simulator (MASTER), which is similar to a next-generation fire detection sensor, we compare accuracy, inference latency, and resource efficiency.
Experimental results show that the proposed two-stage pipeline achieves strong trade-offs between speed and accuracy, demonstrating significant potential for real-time edge deployment in future wildfire monitoring missions.
๐
2025-12-02
๐ ArXiv Preprint (cs.MA)
๐ฅ Enrico Grimaldi, Mario Edoardo Pandolfo, Gabriele D'Acunto, Sergio Barbarossa, Paolo Di Lorenzo
ๆฅ็ๆ่ฆ
Recent advances in AI call for a paradigm shift from bit-centric communication to goal- and semantics-oriented architectures, paving the way for AI-native 6G networks. In this context, we address a key open challenge: enabling heterogeneous AI agents to exchange compressed latent-space representations while mitigating semantic noise and preserving task-relevant meaning. We cast this challenge as learning both the communication topology and the alignment maps that govern information exchange among agents, yielding a learned network sheaf equipped with orthogonal maps. This learning process is further supported by a semantic denoising end compression module that constructs a shared global semantic space and derives sparse, structured representations of each agent's latent space. This corresponds to a nonconvex dictionary learning problem solved iteratively with closed-form updates. Experiments with mutiple AI agents pre-trained on real image data show that the semantic denoising and compression facilitates AI agents alignment and the extraction of semantic clusters, while preserving high accuracy in downstream task. The resulting communication network provides new insights about semantic heterogeneity across agents, highlighting the interpretability of our methodology.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Salman Rahman, Sruthi Gorantla, Arpit Gupta, Swastik Roy, Nanyun Peng et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
๐
2025-12-02
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Natalia Ponomareva, Zheng Xu, H. Brendan McMahan, Peter Kairouz, Lucas Rosenblatt et al.
ๆฅ็ๆ่ฆ
High quality data is needed to unlock the full potential of AI for end users. However finding new sources of such data is getting harder: most publicly-available human generated data will soon have been used. Additionally, publicly available data often is not representative of users of a particular system -- for example, a research speech dataset of contractors interacting with an AI assistant will likely be more homogeneous, well articulated and self-censored than real world commands that end users will issue. Therefore unlocking high-quality data grounded in real user interactions is of vital interest. However, the direct use of user data comes with significant privacy risks. Differential Privacy (DP) is a well established framework for reasoning about and limiting information leakage, and is a gold standard for protecting user privacy. The focus of this work, \emph{Differentially Private Synthetic data}, refers to synthetic data that preserves the overall trends of source data,, while providing strong privacy guarantees to individuals that contributed to the source dataset. DP synthetic data can unlock the value of datasets that have previously been inaccessible due to privacy concerns and can replace the use of sensitive datasets that previously have only had rudimentary protections like ad-hoc rule-based anonymization.
In this paper we explore the full suite of techniques surrounding DP synthetic data, the types of privacy protections they offer and the state-of-the-art for various modalities (image, tabular, text and decentralized). We outline all the components needed in a system that generates DP synthetic data, from sensitive data handling and preparation, to tracking the use and empirical privacy testing. We hope that work will result in increased adoption of DP synthetic data, spur additional research and increase trust in DP synthetic data approaches.
๐
2025-12-02
๐ ICML 2026
๐ฅ Kunj Joshi, David A. Smith
Natural Language Processing
ๆฅ็ๆ่ฆ
The current literature on memorization in Natural Language Models, especially Large Language Models (LLMs), poses severe security and privacy risks, as models tend to memorize personally identifying information (PIIs) from training data. We introduce Randomized Masked Fine-Tuning (RMFT), a novel privacy-preserving fine-tuning technique that reduces PII memorization while minimizing performance impact. Using the Enron Email Dataset, we demonstrate that RMFT achieves an 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate compared to baseline fine-tuning, outperforming deduplication methods while maintaining only a 5.73% increase in perplexity. We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
๐
2025-12-02
๐ ArXiv Preprint (cs.SE)
๐ฅ Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Salman Rahman, Sruthi Gorantla, Arpit Gupta, Swastik Roy, Nanyun Peng et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Paulina Garcia-Corral
Natural Language Processing
ๆฅ็ๆ่ฆ
Explanations are a fundamental element of how people make sense of the political world. Citizens routinely ask and answer questions about why events happen, who is responsible, and what could or should be done differently. Yet despite their importance, explanations remain an underdeveloped object of systematic analysis in political science, and existing approaches are fragmented and often issue-specific. I introduce a framework for detecting and parsing explanations in political text. To do this, I train a lightweight causal language model that returns a structured data set of causal claims in the form of cause-effect pairs for downstream analysis. I demonstrate how causal explanations can be studied at scale, and show the method's modest annotation requirements, generalizability, and accuracy relative to human coding.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Faezeh Faez, Marzieh S. Tahaei, Yaochen Hu, Ali Pourranjbar, Mahdi Biparva et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have revolutionized the ability to understand and generate text, enabling significant progress in automatic knowledge graph construction from text (Text2KG). Many Text2KG methods, however, rely on iterative LLM prompting, making them computationally expensive and prone to overlooking complex relations distributed throughout the text. To address these limitations, we propose InvertiTune, a framework that combines a controlled data generation pipeline with supervised fine-tuning (SFT). Within this framework, the data-generation pipeline systematically extracts subgraphs from large knowledge bases, applies noise filtering, and leverages LLMs to generate corresponding natural text descriptions, a task more aligned with LLM capabilities than direct KG generation from text. This pipeline enables generating datasets composed of longer texts paired with larger KGs that better reflect real-world scenarios compared to existing benchmarks, thus supporting effective SFT of lightweight models for single-shot KG construction. Experimental results on CE12k, a dataset generated using the introduced pipeline, show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches, while also demonstrating stronger cross-dataset generalization on CrossEval-1200, a test set created from three established benchmark datasets and CE12k. These findings highlight the importance of realistic, high-quality training data for advancing efficient and high-performing Text2KG systems.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Stylianos Saroglou, Konstantinos Diamantaras, Francesco Preta, Marina Delianidi, Apostolos Benisis et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
This study investigates the potential of language models to improve the classification of labor market information by linking job vacancy texts to two major European frameworks: the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy and the European Qualifications Framework (EQF). We examine and compare two prominent methodologies from the literature: Sentence Linking and Entity Linking. In support of ongoing research, we release an open-source tool, incorporating these two methodologies, designed to facilitate further work on labor classification and employment discourse. To move beyond surface-level skill extraction, we introduce two annotated datasets specifically aimed at evaluating how occupations and qualifications are represented within job vacancy texts. Additionally, we examine different ways to utilize generative large language models for this task. Our findings contribute to advancing the state of the art in job entity extraction and offer computational infrastructure for examining work, skills, and labor market narratives in a digitally mediated economy. Our code is made publicly available: https://github.com/tabiya-tech/tabiya-livelihoods-classifier
๐
2025-12-02
๐ ArXiv Preprint (cs.CY)
๐ฅ Joan Nwatu, Longju Bai, Oana Ignat, Rada Mihalcea
Multimodal
ๆฅ็ๆ่ฆ
Culture shapes the objects people use and for what purposes, yet mainstream Vision-Language (VL) datasets frequently exhibit cultural biases, disproportionately favoring higher-income, Western contexts. This imbalance reduces model generalizability and perpetuates performance disparities, especially impacting lower-income and non-Western communities. To address these disparities, we propose a novel function-centric framework that categorizes objects by the functions they fulfill, across diverse cultural and economic contexts. We implement this framework by creating the Culture Affordance Atlas, a re-annotated and culturally grounded restructuring of the Dollar Street dataset spanning 46 functions and 288 objects publicly available at https://lit.eecs.umich.edu/CultureAffordance-Atlas/index.html. Through extensive empirical analyses using the CLIP model, we demonstrate that function-centric labels substantially reduce socioeconomic performance gaps between high- and low-income groups by a median of 6 pp (statistically significant), improving model effectiveness for lower-income contexts. Furthermore, our analyses reveals numerous culturally essential objects that are frequently overlooked in prominent VL datasets. Our contributions offer a scalable pathway toward building inclusive VL datasets and equitable AI systems.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mainak Singha
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) produce fluent but unsupported answers - hallucinations - limiting safe deployment in high-stakes domains. We propose ECLIPSE, a framework that treats hallucination as a mismatch between a model's semantic entropy and the capacity of available evidence. We combine entropy estimation via multi-sample clustering with a novel perplexity decomposition that measures how models use retrieved evidence. We prove that under mild conditions, the resulting entropy-capacity objective is strictly convex with a unique stable optimum. We evaluate on a controlled financial question answering dataset with GPT-3.5-turbo (n=200 balanced samples with synthetic hallucinations), where ECLIPSE achieves ROC AUC of 0.89 and average precision of 0.90, substantially outperforming a semantic entropy-only baseline (AUC 0.50). A controlled ablation with Claude-3-Haiku, which lacks token-level log probabilities, shows AUC dropping to 0.59 with coefficient magnitudes decreasing by 95% - demonstrating that ECLIPSE is a logprob-native mechanism whose effectiveness depends on calibrated token-level uncertainties. The perplexity decomposition features exhibit the largest learned coefficients, confirming that evidence utilization is central to hallucination detection. We position this work as a controlled mechanism study; broader validation across domains and naturally occurring hallucinations remains future work.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Albert H. Li, Ivan Dario Jimenez Rodriguez, Joel W. Burdick, Yisong Yue, Aaron D. Ames
Robotics
ๆฅ็ๆ่ฆ
Long-horizon dynamical prediction is fundamental in robotics and control, underpinning canonical methods like model predictive control. Yet, many systems and disturbance phenomena are difficult to model due to effects like nonlinearity, chaos, and high-dimensionality. Koopman theory addresses this by modeling the linear evolution of embeddings of the state under an infinite-dimensional linear operator that can be approximated with a suitable finite basis of embedding functions, effectively trading model nonlinearity for representational complexity. However, explicitly computing a good choice of basis is nontrivial, and poor choices may cause inaccurate forecasts or overfitting. To address this, we present Kalman-Implicit Koopman Operator (KALIKO) Learning, a method that leverages the Kalman filter to implicitly learn embeddings corresponding to latent dynamics without requiring an explicit encoder. KALIKO produces interpretable representations consistent with both theory and prior works, yielding high-quality reconstructions and inducing a globally linear latent dynamics. Evaluated on wave data generated by a high-dimensional PDE, KALIKO surpasses several baselines in open-loop prediction and in a demanding closed-loop simulated control task: stabilizing an underactuated manipulator's payload by predicting and compensating for strong wave disturbances.
๐
2025-12-02
๐ NeurIPS 2025
๐ฅ Jingkang Wang, Henry Che, Yun Chen, Ze Yang, Lily Goli et al.
RoboticsMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
Reconstructing large-scale dynamic scenes from visual observations is a fundamental challenge in computer vision, with critical implications for robotics and autonomous systems. While recent differentiable rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved impressive photorealistic reconstruction, they suffer from scalability limitations and require annotations to decouple actor motion. Existing self-supervised methods attempt to eliminate explicit annotations by leveraging motion cues and geometric priors, yet they remain constrained by per-scene optimization and sensitivity to hyperparameter tuning. In this paper, we introduce Flux4D, a simple and scalable framework for 4D reconstruction of large-scale dynamic scenes. Flux4D directly predicts 3D Gaussians and their motion dynamics to reconstruct sensor observations in a fully unsupervised manner. By adopting only photometric losses and enforcing an "as static as possible" regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes. Our approach enables efficient reconstruction of dynamic scenes within seconds, scales effectively to large datasets, and generalizes well to unseen environments, including rare and unknown objects. Experiments on outdoor driving datasets show Flux4D significantly outperforms existing methods in scalability, generalization, and reconstruction quality.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Johannes Gaber, Meshal Alharbi, Daniele Gammelli, Gioele Zardini
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Large robot fleets are now common in warehouses and other logistics settings, where small control gains translate into large operational impacts. In this article, we address task scheduling for lifelong Multi-Agent Pickup-and-Delivery (MAPD) and propose a hybrid method that couples learning-based global guidance with lightweight optimization. A graph neural network policy trained via reinforcement learning outputs a desired distribution of free agents over an aggregated warehouse graph. This signal is converted into region-to-region rebalancing through a minimum-cost flow, and finalized by small, local assignment problems, preserving accuracy while keeping per-step latency within a 1 s compute budget. On congested warehouse benchmarks from the League of Robot Runners (LRR) with up to 500 agents, our approach improves throughput by up to 10% over the 2024 winning scheduler while maintaining real-time execution. The results indicate that coupling graph-structured learned guidance with tractable solvers reduces congestion and yields a practical, scalable blueprint for high-throughput scheduling in large fleets.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Aya Taourirte, Md Sohag Mia
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The deployment of multi-agent systems in dynamic, adversarial environments like robotic soccer necessitates real-time decision-making, sophisticated cooperation, and scalable algorithms to avoid the curse of dimensionality. While Reinforcement Learning (RL) offers a promising framework, existing methods often struggle with the multi-granularity of tasks (long-term strategy vs. instant actions) and the complexity of large-scale agent interactions. This paper presents a unified Multi-Agent Reinforcement Learning (MARL) framework that addresses these challenges. First, we establish a baseline using Proximal Policy Optimization (PPO) within a client-server architecture for real-time action scheduling, with PPO demonstrating superior performance (4.32 avg. goals, 82.9% ball control). Second, we introduce a Hierarchical RL (HRL) structure based on the options framework to decompose the problem into a high-level trajectory planning layer (modeled as a Semi-Markov Decision Process) and a low-level action execution layer, improving global strategy (avg. goals increased to 5.26). Finally, to ensure scalability, we integrate mean-field theory into the HRL framework, simplifying many-agent interactions into a single agent vs. the population average. Our mean-field actor-critic method achieves a significant performance boost (5.93 avg. goals, 89.1% ball control, 92.3% passing accuracy) and enhanced training stability. Extensive simulations of 4v4 matches in the Webots environment validate our approach, demonstrating its potential for robust, scalable, and cooperative behavior in complex multi-agent domains.
๐
2025-12-02
๐ ArXiv Preprint (cs.GR)
๐ฅ Yuxuan Mu, Ziyu Zhang, Yi Shi, Minami Matsumoto, Kotaro Imamura et al.
Robotics
ๆฅ็ๆ่ฆ
Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Iana Zhura, Sausar Karaf, Faryal Batool, Nipun Dhananjaya Weerakkodi Mudalige, Valerii Serpiva et al.
Robotics
ๆฅ็ๆ่ฆ
Visual traversability estimation is critical for autonomous navigation, but existing VLM-based methods rely on hand-crafted prompts, generalize poorly across embodiments, and output only traversability maps, leaving trajectory generation to slow external planners. We propose SwarmDiffusion, a lightweight end-to-end diffusion model that jointly predicts traversability and generates a feasible trajectory from a single RGB image. To remove the need for annotated or planner-produced paths, we introduce a planner-free trajectory construction pipeline based on randomized waypoint sampling, Bezier smoothing, and regularization enforcing connectivity, safety, directionality, and path thinness. This enables learning stable motion priors without demonstrations. SwarmDiffusion leverages VLM-derived supervision without prompt engineering and conditions the diffusion process on a compact embodiment state, producing physically consistent, traversable paths that transfer across different robot platforms. Across indoor environments and two embodiments (quadruped and aerial), the method achieves 80-100% navigation success and 0.09s inference, and adapts to a new robot using only-500 additional visual samples. It generalizes reliably to unseen environments in simulation and real-world trials, offering a scalable, prompt-free approach to unified traversability reasoning and trajectory generation.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Xianchao Zeng, Xinyu Zhou, Youcheng Li, Jiayou Shi, Tianle Li et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Michael Ofengenden, Yunze Man, Ziqi Pang, Yu-Xiong Wang
Robotics
ๆฅ็ๆ่ฆ
We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin
ๆฅ็ๆ่ฆ
Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.
๐
2025-12-02
๐ ArXiv Preprint (cs.GR)
๐ฅ Yuxuan Mu, Ziyu Zhang, Yi Shi, Minami Matsumoto, Kotaro Imamura et al.
Robotics
ๆฅ็ๆ่ฆ
Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Saeid Jamshidi, Kawser Wazed Nafi, Arghavan Moradi Dakhel, Negar Shahabi, Foutse Khomh
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts. Existing alignment frameworks, structured approaches designed to align model behavior with human ethical and social norms, often rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning may evolve across different contexts or temporal scales. This study presents the Moral Consistency Pipeline (MoCoP), a dataset-free, closed-loop framework for continuously evaluating and interpreting the moral stability of LLMs. MoCoP combines three supporting layers: (i) lexical integrity analysis, (ii) semantic risk estimation, and (iii) reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision. Our empirical results on GPT-4-Turbo and DeepSeek suggest that MoCoP effectively captures longitudinal ethical behavior, revealing a strong inverse relationship between ethical and toxicity dimensions (correlation rET = -0.81, p value less than 0.001) and a near-zero association with response latency (correlation rEL approximately equal to 0). These findings demonstrate that moral coherence and linguistic safety tend to emerge as stable and interpretable characteristics of model behavior rather than short-term fluctuations. Furthermore, by reframing ethical evaluation as a dynamic, model-agnostic form of moral introspection, MoCoP offers a reproducible foundation for scalable, continuous auditing and advances the study of computational morality in autonomous AI systems.
๐
2025-12-02
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Chenji Lu, Zhuo Chen, Hui Zhao, Zhiyuan Zeng, Gang Zhao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.
๐
2025-12-02
๐ AAAI
๐ฅ Chenxu Niu, Wei Zhang, Jie Li, Yongjian Zhao, Tongyang Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines (i) a declarative configuration interface covering model choice, prompt set, and inference engine, (ii) a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and (iii) a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straight-forward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh
Natural Language Processing
ๆฅ็ๆ่ฆ
Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sagi Polaczek, Or Patashnik, Ali Mahdavi-Amiri, Daniel Cohen-Or
Robotics
ๆฅ็ๆ่ฆ
Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject's original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Thomas Rivasseau
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Current research on operator control of Large Language Models improves model robustness against adversarial attacks and misbehavior by training on preference examples, prompting, and input/output filtering. Despite good results, LLMs remain susceptible to abuse, and jailbreak probability increases with context length. There is a need for robust LLM security guarantees in long-context situations. We propose control sentences inserted into the LLM context as invasive context engineering to partially solve the problem. We suggest this technique can be generalized to the Chain-of-Thought process to prevent scheming. Invasive Context Engineering does not rely on LLM training, avoiding data shortage pitfalls which arise in training models for long context situations.
๐
2025-12-02
๐ IEEE ISNCC 2025
๐ฅ Muyu Pan, Dheeraj Kodakandla, Mahfuza Farooque
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic without human intervention. This enables automated reasoning and facilitates debugging, finding loop invariants, and adhering to specifications in software systems. However, hallucinations-incorrect outputs generated by LLMs are challenging, particularly for logical translation tasks requiring precision. This work introduces a novel framework that inputs English sentences, converts them into logical expressions, and then translates them into Conjunctive Normal Form (CNF) for satisfiability solving. It employs classical NLP techniques with self-defined grammar, symbolic computation libraries, and a fine-tuned language model to reduce hallucinations. In the early experiments, we observed that the fine-tuned model, trained on different grammar settings, could intentionally correct the same types of hallucinations made by the original model. Thus, it provides reliable CNF generation.
๐
2025-12-02
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Paul Barbaste, Olivier Oullier, Xavier Vasques
Multimodal
ๆฅ็ๆ่ฆ
Robust decoding and classification of brain patterns measured with electroencephalography (EEG) remains a major challenge for real-world (i.e. outside scientific lab and medical facilities) brain-computer interface (BCI) applications due to well documented inter- and intra-participant variability. Here, we present a large-scale benchmark evaluating over 340,000+ unique combinations of spatial and nonlinear EEG classification. Our methodological pipeline consists in combinations of Common Spatial Patterns (CSP), Riemannian geometry, functional connectivity, and fractal- or entropy-based features across three open-access EEG datasets. Unlike prior studies, our analysis operates at the per-participant level and across multiple frequency bands (8-15 Hz and 8-30 Hz), enabling direct assessment of both group-level performance and individual variability. Covariance tangent space projection (cov-tgsp) and CSP consistently achieved the highest average classification accuracies. However, their effectiveness was strongly dataset-dependent, and marked participant-level differences persisted, particularly in the most heterogeneous of the datasets. Importantly, nonlinear methods outperformed spatial approaches for specific individuals, underscoring the need for personalized pipeline selection. Our findings highlight that no universal 'one-size-fits-all' method can optimally decode EEG motor imagery patterns across all users or datasets. Future work will require adaptive, multimodal, and possibly novel approaches to fully address neurophysiological variability in practical BCI applications where the system can automatically adapt to what makes each user unique.
๐
2025-12-02
๐ ArXiv Preprint (cs.PL)
๐ฅ Isha Chaudhary, Vedaant Jain, Avaljot Singh, Kavya Sachdeva, Sayan Ranu et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We introduce the first principled framework, Lumos, for specifying and formally certifying Language Model System (LMS) behaviors. Lumos is an imperative probabilistic programming DSL over graphs, with constructs to generate independent and identically distributed prompts for LMS. It offers a structured view of prompt distributions via graphs, forming random prompts from sampled subgraphs. Lumos supports certifying LMS for arbitrary prompt distributions via integration with statistical certifiers. We provide hybrid (operational and denotational) semantics for Lumos, providing a rigorous way to interpret the specifications. Using only a small set of composable constructs, Lumos can encode existing LMS specifications, including complex relational and temporal specifications. It also facilitates specifying new properties - we present the first safety specifications for vision-language models (VLMs) in autonomous driving scenarios developed with Lumos. Using these, we show that the state-of-the-art VLM Qwen-VL exhibits critical safety failures, producing incorrect and unsafe responses with at least 90% probability in right-turn scenarios under rainy driving conditions, revealing substantial safety risks. Lumos's modular structure allows easy modification of the specifications, enabling LMS certification to stay abreast with the rapidly evolving threat landscape. We further demonstrate that specification programs written in Lumos enable finding specific failure cases exhibited by state-of-the-art LMS. Lumos is the first systematic and extensible language-based framework for specifying and certifying LMS behaviors, paving the way for a wider adoption of LMS certification.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lanxiang Hu, Abhilash Shankarampeta, Yixin Huang, Zilin Dai, Haoyang Yu et al.
ๆฅ็ๆ่ฆ
The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models' scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \href{https://github.com/hao-ai-lab/VideoScience}{github.com/hao-ai-lab/VideoScience}.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yancheng Zhang, Guangyu Sun, Chen Chen
Computer VisionRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time rendering with high appearance fidelity, it suffers from multi-view inconsistencies, limiting geometric accuracy. In contrast, 2D Gaussian Splatting (2DGS) enforces multi-view consistency but compromises texture details. To address these limitations, we propose Exchangeable Gaussian Splatting (EGGS), a hybrid representation that integrates 2D and 3D Gaussians to balance appearance and geometry. To achieve this, we introduce Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between 2D and 3D Gaussians, and Frequency-Decoupled Optimization that effectively exploits the strengths of each type of Gaussian representation. Our CUDA-accelerated implementation ensures efficient training and inference. Extensive experiments demonstrate that EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, providing a practical solution for high-quality NVS.
๐
2025-12-02
๐ NeurIPS 2025
๐ฅ Zhonghao He, Tianyi Qiu, Hirokazu Shirado, Maarten Sap
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment and confirmation bias, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates are unpredictable from the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, which signal deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains including event forecasting, value-laden questions, and academic paper review, we find such violations to be widespread across models and setups, where the current belief positively predicts future belief updates, a phenomenon which we term belief entrenchment. We identify the models, reasoning techniques, and domains more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of a reasoning process.
๐
2025-12-02
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Enrico Cipriani, Pavel Okopnyi, Danilo Menicucci, Simone Grassini
Natural Language Processing
ๆฅ็ๆ่ฆ
Developing and validating psychometric scales requires large samples, multiple testing phases, and substantial resources. Recent advances in Large Language Models (LLMs) enable the generation of synthetic participant data by prompting models to answer items while impersonating individuals of specific demographic profiles, potentially allowing in silico piloting before real data collection. Across four preregistered studies (N = circa 300 each), we tested whether LLM-simulated datasets can reproduce the latent structures and measurement properties of human responses. In Studies 1-2, we compared LLM-generated data with real datasets for two validated scales; in Studies 3-4, we created new scales using EFA on simulated data and then examined whether these structures generalized to newly collected human samples. Simulated datasets replicated the intended factor structures in three of four studies and showed consistent configural and metric invariance, with scalar invariance achieved for the two newly developed scales. However, correlation-based tests revealed substantial differences between real and synthetic datasets, and notable discrepancies appeared in score distributions and variances. Thus, while LLMs capture group-level latent structures, they do not approximate individual-level data properties. Simulated datasets also showed full internal invariance across gender. Overall, LLM-generated data appear useful for early-stage, group-level psychometric prototyping, but not as substitutes for individual-level validation. We discuss methodological limitations, risks of bias and data pollution, and ethical considerations related to in silico psychometric simulations.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fan Yang, Kaihao Zhang
Computer VisionNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
๐
2025-12-02
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Marie Louise Schubert, Houssam Metni, Jan David Fischbach, Benedikt Zerulla, Marjan Krstiฤ et al.
Machine Learning
ๆฅ็ๆ่ฆ
Maximizing energy yield (EY) - the total electric energy generated by a solar cell within a year at a specific location - is crucial in photovoltaics (PV), especially for emerging technologies. Computational methods provide the necessary insights and guidance for future research. However, existing simulations typically focus on only isolated aspects of solar cells. This lack of consistency highlights the need for a framework unifying all computational levels, from material to cell properties, for accurate prediction and optimization of EY prediction. To address this challenge, a differentiable digital twin, Sol(Di)$^2$T, is introduced to enable comprehensive end-to-end optimization of solar cells. The workflow starts with material properties and morphological processing parameters, followed by optical and electrical simulations. Finally, climatic conditions and geographic location are incorporated to predict the EY. Each step is either intrinsically differentiable or replaced with a machine-learned surrogate model, enabling not only accurate EY prediction but also gradient-based optimization with respect to input parameters. Consequently, Sol(Di)$^2$T extends EY predictions to previously unexplored conditions. Demonstrated for an organic solar cell, the proposed framework marks a significant step towards tailoring solar cells for specific applications while ensuring maximal performance.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang
Multimodal
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.
๐
2025-12-02
๐ ArXiv Preprint (cs.SE)
๐ฅ Pedro Orvalho, Marta Kwiatkowska, Mikolรกลก Janota, Vasco Manquinho
ๆฅ็ๆ่ฆ
Debugging is one of the most time-consuming and expensive tasks in software development and circuit design. Several formula-based fault localisation (FBFL) methods have been proposed, but they fail to guarantee a set of diagnoses across all failing tests or may produce redundant diagnoses that are not subset-minimal, particularly for programs/circuits with multiple faults.
This paper introduces CFaults, a novel fault localisation tool for C software and Boolean circuits with multiple faults. CFaults leverages Model-Based Diagnosis (MBD) with multiple observations and aggregates all failing test cases into a unified Maximum Satisfiability (MaxSAT) formula. Consequently, our method guarantees consistency across observations and simplifies the fault localisation procedure. Experimental results on three benchmark sets, two of C programs, TCAS and C-Pack-IPAs, and one of Boolean circuits, ISCAS85, show that CFaults is faster at localising faults in C software than other FBFL approaches such as BugAssist, SNIPER, and HSD. On the ISCAS85 benchmark, CFaults is generally slower than HSD; however, it localises faults in only 6% fewer circuits, demonstrating that it remains competitive in this domain. Furthermore, CFaults produces only subset-minimal diagnoses of faulty statements, whereas the other approaches tend to enumerate redundant diagnoses (e.g., BugAssist and SNIPER).
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jef Caers
ๆฅ็ๆ่ฆ
The energy transition through increased electrification has put the worlds attention on critical mineral exploration Even with increased investments a decrease in new discoveries has taken place over the last two decades Here I propose a solution to this problem where AI is implemented as the enabler of a rigorous scientific method for mineral exploration that aims to reduce cognitive bias and false positives drive down the cost of exploration I propose a new scientific method that is based on a philosophical approach founded on the principles of Bayesianism and falsification In this approach data acquisition is in the first place seen as a means to falsify human generated hypothesis Decision of what data to acquire next is quantified with verifiable metrics and based on rational decision making A practical protocol is provided that can be used as a template in any exploration campaign However in order to make this protocol practical various form of artificial intelligence are needed I will argue that the most important form are one novel unsupervised learning methods that collaborate with domain experts to better understand data and generate multiple competing geological hypotheses and two humanintheloop AI algorithms that can optimally plan various geological geophysical geochemical and drilling data acquisition where uncertainty reduction of geological hypothesis precedes the uncertainty reduction on grade and tonnage
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mikoลaj Sacha, Hammad Jafri, Mattie Terzolo, Ayan Sinha, Andrew Rabinovich
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recommending matches in a text-rich, dynamic two-sided marketplace presents unique challenges due to evolving content and interaction graphs. We introduce GraphMatch, a new large-scale recommendation framework that fuses pre-trained language models with graph neural networks to overcome these challenges. Unlike prior approaches centered on standalone models, GraphMatch is a comprehensive recipe built on powerful text encoders and GNNs working in tandem. It employs adversarial negative sampling alongside point-in-time subgraph training to learn representations that capture both the fine-grained semantics of evolving text and the time-sensitive structure of the graph. We evaluated extensively on interaction data from Upwork, a leading labor marketplace, at large scale, and discuss our approach towards low-latency inference suitable for real-time use. In our experiments, GraphMatch outperforms language-only and graph-only baselines on matching tasks while being efficient at runtime. These results demonstrate that unifying language and graph representations yields a highly effective solution to text-rich, dynamic two-sided recommendations, bridging the gap between powerful pretrained LMs and large-scale graphs in practice.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lechen Zhang, Yusheng Zhou, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose \textbf{TACO}, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ihab Ahmed, Denis Krompaร, Cheng Feng, Volker Tresp
Machine Learning
ๆฅ็ๆ่ฆ
We investigate input normalization methods for Time-Series Foundation Models (TSFMs). While normalization is well-studied in dataset-specific time-series models, it remains overlooked in TSFMs where generalization is critical. Time-series data, unlike text or images, exhibits significant scale variation across domains and channels, coupled with non-stationarity, can undermine TSFM performance regardless of architectural complexity. Through systematic evaluation across four architecturally diverse TSFMs, we empirically establish REVIN as the most efficient approach, reducing zero-shot MASE by 89\% relative to an un-normalized baseline and by 44\% versus other normalization methods, while matching the best in-domain accuracy (0.84 MASE) without any dataset-level preprocessing -- yielding the highest accuracy-efficiency trade-off. Yet its effect utilization depends on architectural design choices and optimization objective, particularly with respect to training loss scale sensitivity and model type (probabilistic, point-forecast, or LLM-based models).
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Awad, Mahmoud Akrm, Walid Gomaa
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yongrui Yu, Zhongzhen Huang, Linjie Mu, Shaoting Zhang, Xiaofan Zhang
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Radiology reporting is an essential yet time-consuming and error-prone task for radiologists in clinical examinations, especially for volumetric medical images. Rigorous quality control is also critical but tedious, ensuring that the final report meets clinical standards. Existing automated approaches, including radiology report generation methods and medical vision-language models, focus mainly on the report generation phase and neglect the crucial quality control procedure, limiting their capability to provide comprehensive support to radiologists. We propose Radiologist Copilot, an agentic AI assistant equipped with orchestrated tools designed for automated radiology reporting with quality control. Leveraging large language models as the reasoning backbone, the agentic system autonomously selects tools, plans, and executes actions, emulating the behavior of radiologists throughout the holistic radiology reporting process. The orchestrated tools include region localization, think with image paradigm directed region analysis planning, strategic template selection for report generation, quality assessment and feedback-driven adaptive refinement for quality control. Therefore, Radiologist Copilot facilitates accurate, complete, and efficient radiology reporting, assisting radiologists and improving clinical efficiency. Experimental results demonstrate that Radiologist Copilot significantly surpasses other state-of-the-art methods in radiology reporting. The source code will be released upon acceptance.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zijie Lin, Qilin Cai, Liang Shen, Mingjun Xiao
ๆฅ็ๆ่ฆ
Automated paper reproduction has emerged as a promising approach to accelerate scientific research, employing multi-step workflow frameworks to systematically convert academic papers into executable code. However, existing frameworks often lack mechanisms to verify and refine the outputs at each generation step, or rely heavily on manually designed prompts for self-refinement, which limits their adaptability and scalability. To address these limitations, we propose a prompt-free collaborative agent framework that automatically enhances the quality of paper-to-code generation. Our approach employs two collaborative agents: a verification agent that examines whether the outputs at each step satisfy the requirements specified in the corresponding system prompt, and a refinement agent that revises the outputs based on the identified issues. Unlike previous methods that require human experts to craft specific refinement prompts for each step, our framework achieves automatic verification and improvement by leveraging only the original system prompts. We integrate our collaborative agents into the Paper2Code framework and conduct comprehensive experiments on PaperBench Code-Dev and Paper2CodeBench datasets. Experimental results demonstrate that our approach significantly improves the accuracy and completeness of reproduced code, achieving performance gains of approximately 15\% and 13\%, respectively, compared to the baseline without our agents. Furthermore, comparative experiments against Self-Refine validate the robustness and consistency of our prompt-free approach across different datasets.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Shyam prasad reddy Kaitha, Hongrui Yu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Multi-robot task allocation in construction automation has traditionally relied on optimization methods such as Dynamic Programming and Reinforcement Learning. This research introduces the LangGraph-based Task Allocation Agent (LTAA), an LLM-driven framework that integrates phase-adaptive allocation strategies, multi-stage validation with hierarchical retries, and dynamic prompting for efficient robot coordination. Although recent LLM approaches show potential for construction robotics, they largely lack rigorous validation and benchmarking against established algorithms. This paper presents the first systematic comparison of LLM-based task allocation with traditional methods in construction scenarios.The study validates LLM feasibility through SMART-LLM replication and addresses implementation challenges using a Self-Corrective Agent Architecture. LTAA leverages natural-language reasoning combined with structured validation mechanisms, achieving major computational gains reducing token usage by 94.6% and allocation time by 86% through dynamic prompting. The framework adjusts its strategy across phases: emphasizing execution feasibility early and workload balance in later allocations.The authors evaluate LTAA against Dynamic Programming, Q-learning, and Deep Q-Network (DQN) baselines using construction operations from the TEACh human-robot collaboration dataset. In the Heavy Excels setting, where robots have strong task specializations, LTAA achieves 77% task completion with superior workload balance, outperforming all traditional methods. These findings show that LLM-based reasoning with structured validation can match established optimization algorithms while offering additional advantages such as interpretability, adaptability, and the ability to update task logic without retraining.
๐
2025-12-02
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ David Stammer, Hannah Strauss, Peter Knees
ๆฅ็ๆ่ฆ
The rapid rise of AI-generated art has sparked debate about potential biases in how audiences perceive and evaluate such works. This study investigates how composer information and listener characteristics shape the perception of AI-generated music, adopting a mixed-method approach. Using a diverse set of stimuli across various genres from two AI music models, we examine effects of perceived authorship on liking and emotional responses, and explore how attitudes toward AI, personality traits, and music-related variables influence evaluations. We further assess the influence of perceived humanness and analyze open-ended responses to uncover listener criteria for judging AI-generated music. Attitudes toward AI proved to be the best predictor of both liking and emotional intensity of AI-generated music. This quantitative finding was complemented by qualitative themes from our thematic analysis, which identified ethical, cultural, and contextual considerations as important criteria in listeners' evaluations of AI-generated music. Our results offer a nuanced view of how people experience music created by AI tools and point to key factors and methodological considerations for future research on music perception in human-AI interaction.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiahao Zhao, Shuaixing Zhang, Nan Xu, Lei Wang
ๆฅ็ๆ่ฆ
LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Anna Rodum Bjรธru, Jacob Lysnรฆs-Larsen, Oskar Jรธrgensen, Inga Strรผmke, Helge Langseth
ๆฅ็ๆ่ฆ
This work presents a conceptual framework for causal concept-based post-hoc Explainable Artificial Intelligence (XAI), based on the requirements that explanations for non-interpretable models should be understandable as well as faithful to the model being explained. Local and global explanations are generated by calculating the probability of sufficiency of concept interventions. Example explanations are presented, generated with a proof-of-concept model made to explain classifiers trained on the CelebA dataset. Understandability is demonstrated through a clear concept-based vocabulary, subject to an implicit causal interpretation. Fidelity is addressed by highlighting important framework assumptions, stressing that the context of explanation interpretation must align with the context of explanation generation.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Przemyslaw Chojecki
ๆฅ็ๆ่ฆ
We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the space of agent representations, this paper formalizes the agent as a flow $ฮฝ_r$ parameterized by computational resource $r$, governed by a recursive Generator-Verifier-Updater (GVU) operator. We prove that this operator generates a vector field on the parameter manifold $ฮ$, and we identify the coefficient of self-improvement $ฮบ$ as the Lie derivative of the capability functional along this flow.
The central contribution of this work is the derivation of the Variance Inequality, a spectral condition that is sufficient (under mild regularity) for the stability of self-improvement. We show that a sufficient condition for $ฮบ> 0$ is that, up to curvature and step-size effects, the combined noise of generation and verification must be small enough.
We then apply this formalism to unify the recent literature on Language Self-Play (LSP), Self-Correction, and Synthetic Data bootstrapping. We demonstrate that architectures such as STaR, SPIN, Reflexion, GANs and AlphaZero are specific topological realizations of the GVU operator that satisfy the Variance Inequality through filtration, adversarial discrimination, or grounding in formal systems.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Zhou, Takehiko Ohkawa, Guwenxiao Zhou, Kanoko Goto, Takumi Hirose et al.
Computer Vision
ๆฅ็ๆ่ฆ
Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Md Abdul Kadir, Sai Suresh Macharla Vasu, Sidharth S. Nair, Daniel Sonntag
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Auditors rely on Journal Entry Tests (JETs) to detect anomalies in tax-related ledger records, but rule-based methods generate overwhelming false positives and struggle with subtle irregularities. We investigate whether large language models (LLMs) can serve as anomaly detectors in double-entry bookkeeping. Benchmarking SoTA LLMs such as LLaMA and Gemma on both synthetic and real-world anonymized ledgers, we compare them against JETs and machine learning baselines. Our results show that LLMs consistently outperform traditional rule-based JETs and classical ML baselines, while also providing natural-language explanations that enhance interpretability. These results highlight the potential of \textbf{AI-augmented auditing}, where human auditors collaborate with foundation models to strengthen financial integrity.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ He Wang, Wenyilin Xiao, Songqiao Han, Hailiang Huang
Natural Language Processing
ๆฅ็ๆ่ฆ
Stock price prediction is challenging due to market volatility and its sensitivity to real-time events. While large language models (LLMs) offer new avenues for text-based forecasting, their application in finance is hindered by noisy news data and the lack of explicit answers in text. General-purpose memory architectures struggle to identify the key drivers of price movements. To address this, we propose StockMem, an event-reflection dual-layer memory framework. It structures news into events and mines them along two dimensions: horizontal consolidation integrates daily events, while longitudinal tracking captures event evolution to extract incremental information reflecting market expectation discrepancies. This builds a temporal event knowledge base. By analyzing event-price dynamics, the framework further forms a reflection knowledge base of causal experiences. For prediction, it retrieves analogous historical scenarios and reasons with current events, incremental data, and past experiences. Experiments show StockMem outperforms existing memory architectures and provides superior, explainable reasoning by tracing the information chain affecting prices, enhancing decision transparency in financial forecasting.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Julian Ma, Jun Wang, Zafeirios Fountas
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tianyi Zhang, Xiangyuan Xue, Lingyan Ruan, Shiya Fu, Feng Xia et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Mental health conditions affect hundreds of millions globally, yet early detection remains limited. While large language models (LLMs) have shown promise in mental health applications, their size and computational demands hinder practical deployment. Small language models (SLMs) offer a lightweight alternative, but their use for social media--based mental health prediction remains largely underexplored. In this study, we introduce Menta, the first optimized SLM fine-tuned specifically for multi-task mental health prediction from social media data. Menta is jointly trained across six classification tasks using a LoRA-based framework, a cross-dataset strategy, and a balanced accuracy--oriented loss. Evaluated against nine state-of-the-art SLM baselines, Menta achieves an average improvement of 15.2\% across tasks covering depression, stress, and suicidality compared with the best-performing non--fine-tuned SLMs. It also achieves higher accuracy on depression and stress classification tasks compared to 13B-parameter LLMs, while being approximately 3.25x smaller. Moreover, we demonstrate real-time, on-device deployment of Menta on an iPhone 15 Pro Max, requiring only approximately 3GB RAM. Supported by a comprehensive benchmark against existing SLMs and LLMs, Menta highlights the potential for scalable, privacy-preserving mental health monitoring. Code is available at: https://xxue752-nz.github.io/menta-project/
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Theodoros Aivalis, Iraklis A. Klampanos, Antonis Troumpoukis, Joemon M. Jose
MultimodalComputer VisionNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
As generative models become powerful, concerns around transparency, accountability, and copyright violations have intensified. Understanding how specific training data contributes to a model's output is critical. We introduce a framework for interpreting generative outputs through the automatic construction of ontologyaligned knowledge graphs (KGs). While automatic KG construction from natural text has advanced, extracting structured and ontology-consistent representations from visual content remains challenging -- due to the richness and multi-object nature of images. Leveraging multimodal large language models (LLMs), our method extracts structured triples from images, aligned with a domain-specific ontology. By comparing the KGs of generated and training images, we can trace potential influences, enabling copyright analysis, dataset transparency, and interpretable AI. We validate our method through experiments on locally trained models via unlearning, and on large-scale models through a style-specific experiment. Our framework supports the development of AI systems that foster human collaboration, creativity and stimulate curiosity.
๐
2025-12-02
๐ ArXiv Preprint (cs.SE)
๐ฅ Thomas Georges, Marianne Huchard, Mรฉlanie Kรถnig, Clรฉmentine Nebut, Chouki Tibermacine
ๆฅ็ๆ่ฆ
Migrating a set of software variants into a software product line (SPL) is an expensive and potentially challenging endeavor. Indeed, SPL engineering can significantly impact a company's development process and often requires changes to established developer practices. The work presented in this paper stems from a collaboration with a Small and Medium-sized Enterprise (SME) that decided to migrate its existing code base into an SPL. In this study, we conducted an in-depth evaluation of the company's current development processes and practices, as well as the anticipated benefits and risks associated with the migration. Key stakeholders involved in software development participated in this evaluation to provide insight into their perceptions of the migration and their potential resistance to change. This paper describes the design of the interviews conducted with these stakeholders and presents an analysis of the results. Among the qualitative findings, we observed that all participants, regardless of their role in the development process, identified benefits of the migration relevant to their own activities. Furthermore, our results suggest that an effective risk mitigation strategy involves keeping stakeholders informed and engaged throughout the process, preserving as many good practices as possible, and actively involving them in the migration to ensure a smooth transition and minimize potential challenges.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hyeongseop Rha, Jeong Hun Yeo, Junil Won, Se Jin Park, Yong Man Ro
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In this paper, we present Modality-Importance-Guided Reasoning (MIGR), a framework designed to improve the reliability of reasoning-based multimodal emotion understanding in multimodal large language models. Although existing methods have advanced emotion understanding, they often suffer from reasoning drift: models gradually rely on their own generated text instead of multimodal evidence, and their explanations are overly shaped by visually initiated reasoning paths. To address these issues, we introduce Modality Importance (MI), a simple yet effective mechanism for identifying the emotion-dominant modality. Using MI, MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion, preventing early reasoning from being misled by less informative cues. Our two-stage framework-comprising modality-aligned supervised fine-tuning and modality-aware reward optimization-encourages models to generate emotionally grounded, causally relevant, and coherence-preserving explanations. Experimental results on the DFEW benchmark show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations from 18.10% to 7.37%. These results confirm the benefit of initiating reasoning from the emotion-dominant modality.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Daiki Shirafuji, Tatsuhiko Saito, Yasutomo Kimura
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing'' LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.
๐
2025-12-02
๐ ArXiv Preprint (cs.MA)
๐ฅ Piercosma Bisconti, Marcello Galisai, Federico Pierucci, Marcantonio Bracale, Matteo Prandi
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper examines why safety mechanisms designed for human-model interaction do not scale to environments where large language models (LLMs) interact with each other. Most current governance practices still rely on single-agent safety containment, prompts, fine-tuning, and moderation layers that constrain individual model behavior but leave the dynamics of multi-model interaction ungoverned. These mechanisms assume a dyadic setting: one model responding to one user under stable oversight. Yet research and industrial development are rapidly shifting toward LLM-to-LLM ecosystems, where outputs are recursively reused as inputs across chains of agents. In such systems, local compliance can aggregate into collective failure even when every model is individually aligned. We propose a conceptual transition from model-level safety to system-level safety, introducing the framework of the Emergent Systemic Risk Horizon (ESRH) to formalize how instability arises from interaction structure rather than from isolated misbehavior. The paper contributes (i) a theoretical account of collective risk in interacting LLMs, (ii) a taxonomy connecting micro, meso, and macro-level failure modes, and (iii) a design proposal for InstitutionalAI, an architecture for embedding adaptive oversight within multi-agent systems.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhiyuan He
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models have demonstrated remarkable capabilities across many tasks, yet face significant challenges when dealing with recursive reasoning problems, those requiring the resolution of nested hierarchical structures. While prior research has extensively studied length generalization (a model's ability to handle longer sequences than seen during training), we investigate a distinct and underexplored limitation: depth generalization. Here, depth refers to the number of nested levels in a hierarchical problem, such as the layers of parentheses in a mathematical expression or the nesting of logical clauses in a Boolean formula. Our work reveals that standard transformer architectures struggle with problems involving deeper recursion than encountered during training, even when they perform well on longer but non-nested sequences. This limitation stems from their inability to maintain stack-like behavior, the capacity to track and resolve multiple levels of nested dependencies. Through systematic analysis, we demonstrate how this architectural constraint leads to rapid performance decay as the depth of the recursion increases. To address this challenge, we develop a novel looped locate-and-replace pipeline that decomposes recursive problems into manageable subcomponents. The approach employs two specialized models: a locator that identifies solvable subexpressions and a replacer that evaluates these components while preserving the overall structure. We evaluated this method in three carefully designed domains: Boolean algebra, recursive arithmetic, and propositional logic, each with a controllable depth of recursion. We show that our method effectively alleviates the performance decay when tested on out-of-distribution recursion depth.
๐
2025-12-02
๐ ArXiv Preprint (cs.SD)
๐ฅ Gauri Deshpande, Harish Battula, Ashish Panda, Sunil Kumar Kopparapu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a unified study of four distinct modeling approaches for classifying dysarthria severity in the Speech Analysis for Neurodegenerative Diseases (SAND) challenge. All models tackle the same five class classification task using a common dataset of speech recordings. We investigate: (1) a ViT-OF method leveraging a Vision Transformer on spectrogram images, (2) a 1D-CNN approach using eight 1-D CNN's with majority-vote fusion, (3) a BiLSTM-OF approach using nine BiLSTM models with majority vote fusion, and (4) a Hierarchical XGBoost ensemble that combines glottal and formant features through a two stage learning framework. Each method is described, and their performances on a validation set of 53 speakers are compared. Results show that while the feature-engineered XGBoost ensemble achieves the highest macro-F1 (0.86), the deep learning models (ViT, CNN, BiLSTM) attain competitive F1-scores (0.70) and offer complementary insights into the problem.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haozhuo Zheng, Cheng Wang, Yang Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error propagation. In this work, we introduce the Graph VQ-Transformer (GVT), a two-stage generative framework that achieves both high accuracy and efficiency. The core of our approach is a novel Graph Vector Quantized Variational Autoencoder (VQ-VAE) that compresses molecular graphs into high-fidelity discrete latent sequences. By synergistically combining a Graph Transformer with canonical Reverse Cuthill-McKee (RCM) node ordering and Rotary Positional Embeddings (RoPE), our VQ-VAE achieves near-perfect reconstruction rates. An autoregressive Transformer is then trained on these discrete latents, effectively converting graph generation into a well-structured sequence modeling problem. Crucially, this mapping of complex graphs to high-fidelity discrete sequences bridges molecular design with the powerful paradigm of large-scale sequence modeling, unlocking potential synergies with Large Language Models (LLMs). Extensive experiments show that GVT achieves state-of-the-art or highly competitive performance across major benchmarks like ZINC250k, MOSES, and GuacaMol, and notably outperforms leading diffusion models on key distribution similarity metrics such as FCD and KL Divergence. With its superior performance, efficiency, and architectural novelty, GVT not only presents a compelling alternative to diffusion models but also establishes a strong new baseline for the field, paving the way for future research in discrete latent-space molecular generation.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Naveen George, Naoki Murata, Yuhta Takida, Konda Reddy Mopuri, Yuki Mitsufuji
Machine Learning
ๆฅ็ๆ่ฆ
The recent rapid growth of visual generative models trained on vast web-scale datasets has created significant tension with data privacy regulations and copyright laws, such as GDPR's ``Right to be Forgotten.'' This necessitates machine unlearning (MU) to remove specific concepts without the prohibitive cost of retraining. However, existing MU techniques are fundamentally ill-equipped for real-world scenarios where deletion requests arrive sequentially, a setting known as continual unlearning (CUL). Naively applying one-shot methods in a continual setting triggers a stability crisis, leading to a cascade of degradation characterized by retention collapse, compounding collateral damage to related concepts, and a sharp decline in generative quality. To address this critical challenge, we introduce a novel generative distillation based continual unlearning framework that ensures targeted and stable unlearning under sequences of deletion requests. By reframing each unlearning step as a multi-objective, teacher-student distillation process, the framework leverages principles from continual learning to maintain model integrity. Experiments on a 10-step sequential benchmark demonstrate that our method unlearns forget concepts with better fidelity and achieves this without significant interference to the performance on retain concepts or the overall image quality, substantially outperforming baselines. This framework provides a viable pathway for the responsible deployment and maintenance of large-scale generative models, enabling industries to comply with ongoing data removal requests in a practical and effective manner.
๐
2025-12-02
๐ ArXiv Preprint (cs.SD)
๐ฅ Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang, Lin-Han Jia, Lan-Zhe Guo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with four key contributions: 1) a unified Musical Instrument Digital Interface (MIDI) data representation for learning the shared principles of musical structure and expression without explicit annotation; 2) an efficient asymmetric architecture, enabling longer contexts and faster inference without sacrificing rendering quality; 3) a self-supervised pre-training pipeline with 10B tokens and 135M-parameter model, unlocking data and model scaling advantages for expressive performance rendering; 4) a state-of-the-art performance model, which achieves strong objective metrics and human-level subjective ratings. Overall, Pianist Transformer establishes a scalable path toward human-like performance synthesis in the music domain.
๐
2025-12-02
๐ ICML 2025
๐ฅ Mattia Giuri, Mathias Jackermeier, Alessandro Abate
Machine Learning
ๆฅ็ๆ่ฆ
Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time. However, existing approaches fall short in environments where multiple high-level events (i.e., atomic propositions) can be true at the same time and potentially interact in complicated ways. In this work, we propose a novel approach to learning a multi-task policy for following arbitrary LTL instructions that addresses this shortcoming. Our method conditions the policy on sequences of simple Boolean formulae, which directly align with transitions in the automaton, and are encoded via a graph neural network (GNN) to yield structured task representations. Experiments in a complex chess-based environment demonstrate the advantages of our approach.
๐
2025-12-02
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Mayar Elfares, Pascal Reisert, Tilman Dietz, Manpa Barman, Ahmed Zaki et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at many general-purpose natural language processing tasks. However, their ability to perform deep reasoning and mathematical analysis, particularly for complex tasks as required in cryptography, remains poorly understood, largely due to the lack of suitable data for evaluation and training. To address this gap, we present CryptoQA, the first large-scale question-answering (QA) dataset specifically designed for cryptography. CryptoQA contains over two million QA pairs drawn from curated academic sources, along with contextual metadata that can be used to test the cryptographic capabilities of LLMs and to train new LLMs on cryptographic tasks. We benchmark 15 state-of-the-art LLMs on CryptoQA, evaluating their factual accuracy, mathematical reasoning, consistency, referencing, backward reasoning, and robustness to adversarial samples. In addition to quantitative metrics, we provide expert reviews that qualitatively assess model outputs and establish a gold-standard baseline. Our results reveal significant performance deficits of LLMs, particularly on tasks that require formal reasoning and precise mathematical knowledge. This shows the urgent need for LLM assistants tailored to cryptography research and development. We demonstrate that, by using CryptoQA, LLMs can be fine-tuned to exhibit better performance on cryptographic tasks.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yubo Hou, Mohamed Ragab, Min Wu, Chee-Keong Kwoh, Xiaoli Li et al.
ๆฅ็ๆ่ฆ
Accurate prediction of the Remaining Useful Life (RUL) in machinery can significantly diminish maintenance costs, enhance equipment up-time, and mitigate adverse outcomes. Data-driven RUL prediction techniques have demonstrated commendable performance. However, their efficacy often relies on the assumption that training and testing data are drawn from the same distribution or domain, which does not hold in real industrial settings. To mitigate this domain discrepancy issue, prior adversarial domain adaptation methods focused on deriving domain-invariant features. Nevertheless, they overlook target-specific information and inconsistency characteristics pertinent to the degradation stages, resulting in suboptimal performance. To tackle these issues, we propose a novel domain adaptation approach for cross-domain RUL prediction named TACDA. Specifically, we propose a target domain reconstruction strategy within the adversarial adaptation process, thereby retaining target-specific information while learning domain-invariant features. Furthermore, we develop a novel clustering and pairing strategy for consistent alignment between similar degradation stages. Through extensive experiments, our results demonstrate the remarkable performance of our proposed TACDA method, surpassing state-of-the-art approaches with regard to two different evaluation metrics. Our code is available at https://github.com/keyplay/TACDA.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Pengju Lu
Robotics
ๆฅ็ๆ่ฆ
This technical white paper introduces the Interactive Agents Call Tree (IACT), a computational model designed to address the limitations of static, hard-coded agent workflows. Unlike traditional systems that require pre-defined graphs or specialized programming, IACT operates as a general-purpose autonomous system driven purely by user dialogue. Given a high-level objective, the system autonomously grows a dynamic, recursive agent topology incrementally tailored to the problem's structure. This allows it to scale its organizational complexity to match open-ended tasks. To mitigate the error propagation inherent in unidirectional function calls, IACT introduces interactional redundancy by replacing rigid invocations with bidirectional, stateful dialogues. This mechanism enables runtime error correction and ambiguity resolution. We describe the architecture, design principles, and practical lessons behind the production deployment of this model in the kragent.ai system, presenting qualitative evidence from real-world workflows rather than exhaustive benchmark results.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, Bingsheng He
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors such as Overleaf. We present PaperDebugger, an in-editor, multi-agent, and plugin-based academic writing assistant that brings LLM-driven reasoning directly into the writing environment. Enabling such in-editor interaction is technically non-trivial: it requires reliable bidirectional synchronization with the editor, fine-grained version control and patching, secure state management, multi-agent scheduling, and extensible communication with external tools. PaperDebugger addresses these challenges through a Chrome-approved extension, a Kubernetes-native orchestration layer, and a Model Context Protocol (MCP) toolchain that integrates literature search, reference lookup, document scoring, and revision pipelines. Our demo showcases a fully integrated workflow, including localized edits, structured reviews, parallel agent execution, and diff-based updates, encapsulated within a minimal-intrusion user interface (UI). Early aggregated analytics demonstrate active user engagement and validate the practicality of an editor-native, agentic writing assistant. More details about this demo and video could be found at https://github.com/PaperDebugger/PaperDebugger.
๐
2025-12-02
๐ ArXiv Preprint (cs.SE)
๐ฅ Martin Weiss, Jesko Hecking-Harbusch, Jochen Quante, Matthias Woehrle
Natural Language Processing
ๆฅ็ๆ่ฆ
The advent of strong generative AI has a considerable impact on various software engineering tasks such as code repair, test generation, or language translation. While tools like GitHub Copilot are already in widespread use in interactive settings, automated approaches require a higher level of reliability before being usable in industrial practice. In this paper, we focus on three aspects that directly influence the quality of the results: a) the effect of automated feedback loops, b) the choice of Large Language Model (LLM), and c) the influence of behavior-preserving code changes.
We study the effect of these three variables on an automated C-to-Rust translation system. Code translation from C to Rust is an attractive use case in industry due to Rust's safety guarantees. The translation system is based on a generate-and-check pattern, in which Rust code generated by the LLM is automatically checked for compilability and behavioral equivalence with the original C code. For negative checking results, the LLM is re-prompted in a feedback loop to repair its output. These checks also allow us to evaluate and compare the respective success rates of the translation system when varying the three variables.
Our results show that without feedback loops LLM selection has a large effect on translation success. However, when the translation system uses feedback loops the differences across models diminish. We observe this for the average performance of the system as well as its robustness under code perturbations. Finally, we also identify that diversity provided by code perturbations can even result in improved system performance.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.
๐
2025-12-02
๐ SIGIR 2025
๐ฅ Jinming Yang, Zimu Ji, Weiqi Luo, Gaoxi Wang, Bin Ma et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
With the rapid development of the online education and large language model, the existing educational tools still suffer from incomplete service, insufficient performance and weak interactivity in terms of courseware generation, interactive notes and quality assurance of content. In particular, the proposed generative agent EZYer : 1) Teacher Module: Integrating the Text Corpus retrieval and in-depth generation technologies, it automatically generates structured teaching materials and LaTeX Beamer courseware in line with the high school mathematics syllabus and supports user-defined image insertion. 2) Student Module: Throughout the collaborative interaction of the four roles of Teacher, Assistant, Top Student and Struggling Student, Note Taker summarizes and generates academic notes to enhance the depth and interest of learning. 3) Controller: set up keyword filtering system, content scoring system, role co-validation system, and dynamic content correction system. This ensure academic strictness and pedagogical propriety of EZYer inputs and outputs. In order to evaluate EZYer, this paper designs five-dimensional evaluation indexes of content accuracy, knowledge coverage, usability, formatting correctness and visual design and appeal, and scores 100 Beamer and Notes generated by EZYer by five large language models, separately, and the results show that the quality of EZYer-generated content is excellent and has a good application prospect.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yufei Xiao, Shangfei Wang
ๆฅ็ๆ่ฆ
Prevalent empathy prediction techniques primarily concentrate on a singular modality, typically textual, thus neglecting multi-modal processing capabilities. They also overlook the utilization of certain privileged information, which may encompass additional empathetic content. In response, we introduce an advanced multi-modal empathy prediction method integrating video, audio, and text information. The method comprises the Multi-Modal Empathy Prediction and Supervisory Documentation Assisted Training. We use pre-trained networks in the empathy prediction network to extract features from various modalities, followed by a cross-modal fusion. This process yields a multi-modal feature representation, which is employed to predict empathy labels. To enhance the extraction of text features, we incorporate supervisory documents as privileged information during the assisted training phase. Specifically, we apply the Latent Dirichlet Allocation model to identify potential topic distributions to constrain text features. These supervisory documents, created by supervisors, focus on the counseling topics and the counselor's display of empathy. Notably, this privileged information is only available during training and is not accessible during the prediction phase. Experimental results on the multi-modal and dialogue empathy datasets demonstrate that our approach is superior to the existing methods.
๐
2025-12-02
๐ SIGIR 2025
๐ฅ Zheng Fang, Donghao Xie, Ming Pang, Chunyuan Yuan, Xue Jiang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models' reliance on the scarcity of domain-specific hard samples. We propose ADORE, a self-sustaining framework that synergizes three innovations: (1) A Rule-aware Relevance Discrimination module, where a Chain-of-Thought LLM generates intent-aligned training data, refined via Kahneman-Tversky Optimization (KTO) to align with user behavior; (2) An Error-type-aware Data Synthesis module that auto-generates adversarial examples to harden robustness; and (3) A Key-attribute-enhanced Knowledge Distillation module that injects domain-specific attribute hierarchies into a deployable student model. ADORE automates annotation, adversarial generation, and distillation, overcoming data scarcity while enhancing reasoning. Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial applications.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2
๐
2025-12-02
๐ ArXiv Preprint (cs.CE)
๐ฅ Ioanna Tasou, Panagiotis Mpakos, Angelos Vlachos, Dionysios Adamopoulos, Georgios Giannakopoulos et al.
Machine Learning
ๆฅ็ๆ่ฆ
The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contributing in significant computational, energy and environmental footprints. Sparsity stands out as a critical mechanism for drastically reducing these resource demands. However, its potential remains largely untapped and is not yet fully incorporated in production AI systems. To bridge this gap, this work provides the necessary knowledge and insights for performance engineers keen to get involved in deep learning inference optimization. In particular, in this work we: a) discuss the various forms of sparsity that can be utilized in DNN inference, b) explain how the original dense computations translate to sparse kernels, c) provide an extensive bibliographic review of the state-of-the-art in the implementation of these kernels for CPUs and GPUs, d) discuss the availability of sparse datasets in support of sparsity-related research and development, e) explore the current software tools and frameworks that provide robust sparsity support, and f) present evaluation results of different implementations of the key SpMM and SDDMM kernels on CPU and GPU platforms. Ultimately, this paper aims to serve as a resource for performance engineers seeking to develop and deploy highly efficient sparse deep learning models in productions.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuxiang He, Jian Zhao, Yuchen Yuan, Tianle Zhang, Wei Cai et al.
Multimodal
ๆฅ็ๆ่ฆ
The exponential growth of digital content presents significant challenges for content safety. Current moderation systems, often based on single models or fixed pipelines, exhibit limitations in identifying implicit risks and providing interpretable judgment processes. To address these issues, we propose Aetheria, a multimodal interpretable content safety framework based on multi-agent debate and collaboration.Employing a collaborative architecture of five core agents, Aetheria conducts in-depth analysis and adjudication of multimodal content through a dynamic, mutually persuasive debate mechanism, which is grounded by RAG-based knowledge retrieval.Comprehensive experiments on our proposed benchmark (AIR-Bench) validate that Aetheria not only generates detailed and traceable audit reports but also demonstrates significant advantages over baselines in overall content safety accuracy, especially in the identification of implicit risks. This framework establishes a transparent and interpretable paradigm, significantly advancing the field of trustworthy AI content moderation.
๐
2025-12-02
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Luyao Niu, Zhicheng Deng, Boyang Li, Nuoxian Huang, Ruiqi Liu et al.
ๆฅ็ๆ่ฆ
The "15-minute city" envisions neighborhoods where residents can meet daily needs via a short walk or bike ride. Realizing this vision requires not only physical proximity but also efficient and reliable access to information about nearby places, services, and events. Existing location-based systems, however, focus mainly on city-level tasks and neglect the spatial, temporal, and cognitive factors that shape localized decision-making. We conceptualize this gap as the Local Life Information Accessibility (LLIA) problem and introduce AskNearby, an AI-driven community application that unifies retrieval and recommendation within the 15-minute life circle. AskNearby integrates (i) a three-layer Retrieval-Augmented Generation (RAG) pipeline that synergizes graph-based, semantic-vector, and geographic retrieval with (ii) a cognitive-map model that encodes each user's neighborhood familiarity and preferences. Experiments on real-world community datasets demonstrate that AskNearby significantly outperforms LLM-based and map-based baselines in retrieval accuracy and recommendation quality, achieving robust performance in spatiotemporal grounding and cognitive-aware ranking. Real-world deployments further validate its effectiveness. By addressing the LLIA challenge, AskNearby empowers residents to more effectively discover local resources, plan daily activities, and engage in community life.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yongkai Liu, Helena Feng, Bin Jiang, Yixin Wang, Max Wintermark et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Predicting outcomes in acute ischemic stroke (AIS) guides clinical decision-making, patient counseling, and resource allocation. Clinical notes contain rich contextual information, but their unstructured nature limits their use in traditional predictive models. We developed and evaluated the Chain-of-Thought (CoT) Outcome Prediction Engine (COPE), a reasoning-enhanced large language model framework, for predicting 90-day functional outcomes after AIS from unstructured clinical notes. This study included 464 AIS patients with discharge summaries and 90-day modified Rankin Scale (mRS) scores. COPE uses a two-step CoT framework based on sequential open-source LLaMA-3-8B models: the first generates clinical reasoning, and the second outputs an mRS prediction. We compared COPE with GPT-4.1, ClinicalBERT, a structured variable-based machine learning model (Clinical ML), and a single-step LLM without CoT. Performance was evaluated using mean absolute error (MAE), accuracy within +/-1 mRS point, and exact accuracy. COPE achieved an MAE of 1.01 (95% CI 0.92-1.11), +/-1 accuracy of 74.4% (69.9, 78.8%), and exact accuracy of 32.8% (28.0, 37.6%), comparable to GPT-4.1 and superior to ClinicalBERT [MAE 1.24 (1.13-1.36)], Clinical ML [1.28 (1.18-1.39)], and the single-step LLM [1.20 (1.09-1.33)]. Subgroup analyses showed consistent performance across sex and age, with slightly higher error among older patients, those undergoing thrombectomy, and those with longer summaries. These findings demonstrate that COPE, a lightweight, interpretable, and privacy-preserving open-source framework, provides an accurate and practical solution for outcome prediction from unstructured clinical text.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yerim Jeon, Miso Lee, WonJun Moon, Jae-Pil Heo
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Xiaofan Zhang, Qi Dou
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.
๐
2025-12-02
๐ KDD 2026
๐ฅ Haofeng Huang, Ling Gai
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Sequential recommendation plays a critical role in modern online platforms such as e-commerce, advertising, and content streaming, where accurately predicting users' next interactions is essential for personalization. Recent Transformer-based methods like BERT4Rec have shown strong modeling capability, yet they still rely on discrete item IDs that lack semantic meaning and ignore rich multimodal information (e.g., text and image). This leads to weak generalization and limited interpretability. To address these challenges, we propose Q-Bert4Rec, a multimodal sequential recommendation framework that unifies semantic representation and quantized modeling. Specifically, Q-Bert4Rec consists of three stages: (1) cross-modal semantic injection, which enriches randomly initialized ID embeddings through a dynamic transformer that fuses textual, visual, and structural features; (2) semantic quantization, which discretizes fused representations into meaningful tokens via residual vector quantization; and (3) multi-mask pretraining and fine-tuning, which leverage diverse masking strategies -- span, tail, and multi-region -- to improve sequential understanding. We validate our model on public Amazon benchmarks and demonstrate that Q-Bert4Rec significantly outperforms many strong existing methods, confirming the effectiveness of semantic tokenization for multimodal sequential recommendation. Our source code will be publicly available on GitHub after publishing.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang et al.
Robotics
ๆฅ็ๆ่ฆ
AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.
๐
2025-12-02
๐ ArXiv Preprint (q-bio.GN)
๐ฅ Ping Xu, Zaitian Wang, Zhirui Wang, Pengjiang Li, Jiajia Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Cell clustering is crucial for uncovering cellular heterogeneity in single-cell RNA sequencing (scRNA-seq) data by identifying cell types and marker genes. Despite its importance, benchmarks for scRNA-seq clustering methods remain fragmented, often lacking standardized protocols and failing to incorporate recent advances in artificial intelligence. To fill these gaps, we present scCluBench, a comprehensive benchmark of clustering algorithms for scRNA-seq data. First, scCluBench provides 36 scRNA-seq datasets collected from diverse public sources, covering multiple tissues, which are uniformly processed and standardized to ensure consistency for systematic evaluation and downstream analyses. To evaluate performance, we collect and reproduce a range of scRNA-seq clustering methods, including traditional, deep learning-based, graph-based, and biological foundation models. We comprehensively evaluate each method both quantitatively and qualitatively, using core performance metrics as well as visualization analyses. Furthermore, we construct representative downstream biological tasks, such as marker gene identification and cell type annotation, to further assess the practical utility. scCluBench then investigates the performance differences and applicability boundaries of various clustering models across diverse analytical tasks, systematically assessing their robustness and scalability in real-world scenarios. Overall, scCluBench offers a standardized and user-friendly benchmark for scRNA-seq clustering, with curated datasets, unified evaluation protocols, and transparent analyses, facilitating informed method selection and providing valuable insights into model generalizability and application scope.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xingwang Li, Mengyun Chen, Jiamou Liu, Sijie Wang, Shuanggen Jin et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In the face of accelerating global urbanization and the increasing frequency of extreme weather events, highresolution urban rainfall monitoring is crucial for building resilient smart cities. Commercial Microwave Links (CMLs) are an emerging data source with great potential for this task.While traditional rainfall retrieval from CMLs relies on physicsbased models, these often struggle with real-world complexities like signal noise and nonlinear attenuation. To address these limitations, this paper proposes a novel hybrid deep learning architecture based on the Transformer and a Bidirectional Gated Recurrent Unit (BiGRU), which we name TabGRU. This design synergistically captures both long-term dependencies and local sequential features in the CML signal data. The model is further enhanced by a learnable positional embedding and an attention pooling mechanism to improve its dynamic feature extraction and generalization capabilities. The model was validated on a public benchmark dataset from Gothenburg, Sweden (June-September 2015). The evaluation used 12 sub-links from two rain gauges (Torp and Barl) over a test period (August 22-31) covering approximately 10 distinct rainfall events. The proposed TabGRU model demonstrated consistent advantages, outperforming deep learning baselines and achieving high coefficients of determination (R2) at both the Torp site (0.91) and the Barl site (0.96). Furthermore, compared to the physics-based approach, TabGRU maintained higher accuracy and was particularly effective in mitigating the significant overestimation problem observed in the PL model during peak rainfall events. This evaluation confirms that the TabGRU model can effectively overcome the limitations of traditional methods, providing a robust and accurate solution for CML-based urban rainfall monitoring under the tested conditions.
๐
2025-12-02
๐ NeurIPS 2025
๐ฅ Valentin Bieri, Marie-Julie Rakotosaona, Keisuke Tateno, Francis Engelmann, Leonidas Guibas
ๆฅ็ๆ่ฆ
Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle large multi floor buildings and require scenes to be split into individual floors before processing, which removes global spatial context that is essential for reasoning about structures such as staircases that connect multiple levels. In this work, we introduce HouseLayout3D, a real world benchmark designed to support progress toward full building scale layout estimation, including multiple floors and architecturally intricate spaces. We also present MultiFloor3D, a simple training free baseline that leverages recent scene understanding methods and already outperforms existing 3D layout estimation models on both our benchmark and prior datasets, highlighting the need for further research in this direction. Data and code are available at: https://houselayout3d.github.io.
๐
2025-12-02
๐ AAAI 2026
๐ฅ Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz
Natural Language Processing
ๆฅ็ๆ่ฆ
Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Phuc Pham, Nhu Pham, Ngoc Quoc Ly
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at https://github.com/phphuc612/MSD .
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daeyoung Kim
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed and blurred visions, or peripheral vision loss, glaucoma is well known to occur due to damages in the optic nerve from increased intraocular pressure (IOP) or neovascularization within the retina. Traditionally, most glaucoma related works and clinical diagnosis focused on detecting these damages in the optic nerve by using patient data from perimetry tests, optic papilla inspections and tonometer-based IOP measurements. Recently, with advancements in computer vision AI models, such as VGG16 or Vision Transformers (ViT), AI-automatized glaucoma detection and optic cup segmentation based on retinal fundus images or OCT recently exhibited significant performance in aiding conventional diagnosis with high performance. However, current AI-driven glaucoma detection approaches still have significant room for improvement in terms of reliability, excessive parameter usage, possibility of spurious correlation within detection, and limitations in applications to intervention analysis or clinical simulations. Thus, this research introduced a novel causal representation driven glaucoma detection model: LightHCG, an extremely lightweight Convolutional VAE-based latent glaucoma representation model that can consider the true causality among glaucoma-related physical factors within the optic nerve region. Using HSIC-based latent space disentanglement and Graph Autoencoder based unsupervised causal representation learning, LightHCG not only exhibits higher performance in classifying glaucoma with 93~99% less weights, but also enhances the possibility of AI-driven intervention analysis, compared to existing advanced vision models such as InceptionV3, MobileNetV2 or VGG16.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Agostino Capponi, Alfio Gliozzo, Brian Zhu
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Prediction markets allow users to trade on outcomes of real-world events, but are prone to fragmentation through overlapping questions, implicit equivalences, and hidden contradictions across markets. We present an agentic AI pipeline that autonomously (i) clusters markets into coherent topical groups using natural-language understanding over contract text and metadata, and (ii) identifies within-cluster market pairs whose resolved outcomes exhibit strong dependence, including same-outcome (correlated) and different-outcome (anti-correlated) relationships. Using a historical dataset of resolved markets on Polymarket, we evaluate the accuracy of the agent's relational predictions. We then translate discovered relationships into a simple trading strategy to quantify how these relationships map to actionable signals. Results show that agent-identified relationships achieve roughly 60-70% accuracy, and their induced trading strategies earn about 20% average returns over week-long horizons, highlighting the ability of agentic AI and large language models to uncover latent semantic structure in prediction markets.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
๐
2025-12-02
๐ ArXiv Preprint (quant-ph)
๐ฅ Tommaso Fioravanti, Brian Quanz, Gabriele Agliardi, Edgar Andres Ruiz Guzman, Ginรฉs Carrascal et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Quantum Machine Learning (QML) holds the promise of enhancing machine learning modeling in terms of both complexity and accuracy. A key challenge in this domain is the encoding of input data, which plays a pivotal role in determining the performance of QML models. In this work, we tackle a largely unaddressed aspect of encoding that is unique to QML modeling -- rather than adjusting the ansatz used for encoding, we consider adjusting how data is conveyed to the ansatz. We specifically implement QML pipelines that leverage classical data manipulation (i.e., ordering, selecting, and weighting features) as a preprocessing step, and evaluate if these aspects of encoding can have a significant impact on QML model performance, and if they can be effectively optimized to improve performance. Our experimental results, applied across a wide variety of data sets, ansatz, and circuit sizes, with a representative QML approach, demonstrate that by optimizing how features are encoded in an ansatz we can substantially and consistently improve the performance of QML models, making a compelling case for integrating these techniques in future QML applications. Finally we demonstrate the practical feasibility of this approach by running it using real quantum hardware with 100 qubit circuits and successfully achieving improved QML modeling performance in this case as well.
๐
2025-12-02
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Shogo Ohmae, Keiko Ohmae
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in general-purpose AI systems with attention-based transformers offer a potential window into how the neocortex and cerebellum, despite their relatively uniform circuit architectures, give rise to diverse functions and, ultimately, to human intelligence. This Perspective provides a cross-domain comparison between the brain and AI that goes beyond the traditional focus on visual processing, adopting the emerging perspecive of world-model-based computation. Here, we identify shared computational mechanisms in the attention-based neocortex and the non-attentional cerebellum: both predict future world events from past inputs and construct internal world models through prediction-error learning. These predictive world models are repurposed for seemingly distinct functions -- understanding in sensory processing and generation in motor processing -- enabling the brain to achieve multi-domain capabilities and human-like adaptive intelligence. Notably, attention-based AI has independently converged on a similar learning paradigm and world-model-based computation. We conclude that these shared mechanisms in both biological and artificial systems constitute a core computational foundation for realizing diverse functions including high-level intelligence, despite their relatively uniform circuit structures. Our theoretical insights bridge neuroscience and AI, advancing our understanding of the computational essence of intelligence.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Huiqian Li, Wei Pan, Haodong Zhang, Jin Huang, Zhihua Zhong
Robotics
ๆฅ็ๆ่ฆ
World models have gained significant attention as a promising approach for autonomous driving. By emulating human-like perception and decision-making processes, these models can predict and adapt to dynamic environments. Existing methods typically map high-dimensional observations into compact latent spaces and learn optimal policies within these latent representations. However, prior work usually jointly learns ego-vehicle dynamics and environmental transition dynamics from the image input, leading to inefficiencies and a lack of robustness to variations in vehicle dynamics. To address these issues, we propose the Vehicle Dynamics embedded Dreamer (VDD) method, which decouples the modeling of ego-vehicle dynamics from environmental transition dynamics. This separation allows the world model to generalize effectively across vehicles with diverse parameters. Additionally, we introduce two strategies to further enhance the robustness of the learned policy: Policy Adjustment during Deployment (PAD) and Policy Augmentation during Training (PAT). Comprehensive experiments in simulated environments demonstrate that the proposed model significantly improves both driving performance and robustness to variations in vehicle dynamics, outperforming existing approaches.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dmitriy Parashchuk, Alexey Kapshitskiy, Yuriy Karyakin
Computer VisionNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized for standard metrics often struggle to detect thin structural components and yield masks with irregular boundaries, lacking the geometric precision required for subsequent vectorization. To address this issue, we introduce MitUNet, a hybrid neural network architecture specifically designed for wall segmentation tasks in the context of 3D modeling. In MitUNet, we utilize a hierarchical Mix-Transformer encoder to capture global context and a U-Net decoder enhanced with scSE attention blocks for precise boundary recovery. Furthermore, we propose an optimization strategy based on the Tversky loss function to effectively balance precision and recall. By fine-tuning the hyperparameters of the loss function, we prioritize the suppression of false positive noise along wall boundaries while maintaining high sensitivity to thin structures. Our experiments on the public CubiCasa5k dataset and a proprietary regional dataset demonstrate that the proposed approach ensures the generation of structurally correct masks with high boundary accuracy, outperforming standard single-task models. MitUNet provides a robust tool for data preparation in automated 3D reconstruction pipelines.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yizhou Zhang, Lun Du
Machine Learning
ๆฅ็ๆ่ฆ
Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability.
We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning -- although practical systems can only approximate this behavior.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anoop Cherian, River Doyle, Eyal Ben-Dov, Suhas Lohit, Kuan-Chuan Peng
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.
๐
2025-12-02
๐ ArXiv Preprint (cs.SE)
๐ฅ Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines. Unlike conventional programs, their execution (i.e., trajectories) is inherently stochastic and adaptive to the problem they are solving. Evaluation of such systems is often outcome-centric, judging their performance based on success or failure at the final step. This narrow focus overlooks detailed insights about such systems, failing to explain how agents reason, plan, act, or change their strategies over time. Inspired by the structured representation of conventional software systems as graphs, we introduce Graphectory to systematically encode the temporal and semantic relations in such software systems. Graphectory facilitates the design of process-centric metrics and analyses to assess the quality of agentic workflows independent of final success.
Using Graphectory, we analyze 4000 trajectories of two dominant agentic programming workflows, namely SWE-agent and OpenHands, with a combination of four backbone Large Language Models (LLMs), attempting to resolve SWE-bench Verified issues. Our fully automated analyses reveal that: (1) agents using richer prompts or stronger LLMs exhibit more complex Graphectory, reflecting deeper exploration, broader context gathering, and more thorough validation before patch submission; (2) agents' problem-solving strategies vary with both problem difficulty and the underlying LLM -- for resolved issues, the strategies often follow coherent localization-patching-validation steps, while unresolved ones exhibit chaotic, repetitive, or backtracking behaviors; (3) even when successful, agentic programming systems often display inefficient processes, leading to unnecessarily prolonged trajectories.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ David X. Wu, Shreyas Kapur, Anant Sahai, Stuart Russell
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning has become the dominant paradigm for eliciting reasoning and self-correction capabilities in large language models, but its computational expense motivates exploration of alternatives. Inspired by techniques from autonomous driving and robotics, we investigate whether supervised learning with synthetic error injection can induce self-correction abilities in language models. Our approach inserts artificial errors into reasoning chains, masks them, and supervises the model to recognize and correct these mistakes. Despite the intuitive appeal of this method, we find that it fails to significantly improve performance even on simple synthetic tasks across multiple models. Moreover, even when the model catches its own error, it often parrots the original mistake. We find that the distribution shift of synthetic errors to on-policy errors significantly degrades the error-correction capabilities of the fine-tuned model, even with good synthetic coverage of on-policy errors. Our results help explain why on-policy reinforcement learning methods have proven uniquely effective for eliciting self-correction.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenyi Xiong, Jian Chen
MultimodalRobotics
ๆฅ็ๆ่ฆ
Trajectory prediction is crucial for the reliability and safety of autonomous driving systems, yet it remains a challenging task in complex interactive scenarios. Existing methods often struggle to efficiently extract valuable scene information from redundant data, thereby reducing computational efficiency and prediction accuracy, especially when dealing with intricate agent interactions. To address these challenges, we propose a novel map-free trajectory prediction algorithm that achieves trajectory prediction across the temporal, spatial, and frequency domains. Specifically, in temporal information processing, We utilize a Mixture of Experts (MoE) mechanism to adaptively select critical frequency components. Concurrently, we extract these components and integrate multi-scale temporal features. Subsequently, a selective attention module is proposed to filter out redundant information in both temporal sequences and spatial interactions. Finally, we design a multimodal decoder. Under the supervision of patch-level and point-level losses, we obtain reasonable trajectory results. Experiments on Nuscences datasets demonstrate the superiority of our algorithm, validating its effectiveness in handling complex interactive scenarios.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daanish Hindustani, Sanober Hindustani, Preston Nguyen
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
This study explores the application of machine learning models, specifically a pretrained ResNet-50 model and a general SqueezeNet model, in diagnosing tuberculosis (TB) using chest X-ray images. TB, a persistent infectious disease affecting humanity for millennia, poses challenges in diagnosis, especially in resource-limited settings. Traditional methods, such as sputum smear microscopy and culture, are inefficient, prompting the exploration of advanced technologies like deep learning and computer vision. The study utilized a dataset from Kaggle, consisting of 4,200 chest X-rays, to develop and compare the performance of the two machine learning models. Preprocessing involved data splitting, augmentation, and resizing to enhance training efficiency. Evaluation metrics, including accuracy, precision, recall, and confusion matrix, were employed to assess model performance. Results showcase that the SqueezeNet achieved a loss of 32%, accuracy of 89%, precision of 98%, recall of 80%, and an F1 score of 87%. In contrast, the ResNet-50 model exhibited a loss of 54%, accuracy of 73%, precision of 88%, recall of 52%, and an F1 score of 65%. This study emphasizes the potential of machine learning in TB detection and possible implications for early identification and treatment initiation. The possibility of integrating such models into mobile devices expands their utility in areas lacking TB detection resources. However, despite promising results, the need for continued development of faster, smaller, and more accurate TB detection models remains crucial in contributing to the global efforts in combating TB.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lei Fu, Xiang Chen, Kaige Gao Xinyue Huang, Kejian Tong
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Domain-specific question answering (QA) systems for services face unique challenges in integrating heterogeneous knowledge sources while ensuring both accuracy and safety. Existing large language models often struggle with factual consistency and context alignment in sensitive domains such as healthcare policies and government welfare. In this work, we introduce Knowledge-Aware Reasoning and Memory-Augmented Adaptation (KARMA), a novel framework designed to enhance QA performance in care scenarios. KARMA incorporates a dual-encoder architecture to fuse structured and unstructured knowledge sources, a gated memory unit to dynamically regulate external knowledge integration, and a safety-aware controllable decoder that mitigates unsafe outputs using safety classification and guided generation techniques. Extensive experiments on a proprietary QA dataset demonstrate that KARMA outperforms strong baselines in both answer quality and safety. This study offers a comprehensive solution for building trustworthy and adaptive QA systems in service contexts.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and fine-tuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT integrates a structured collection of general visual augmentations, broadening the query image views while reducing training complexity and computational overhead with efficient agentic reinforcement learning. We propose a conditional reward scheme that encourages necessary augmentation while penalizing verbose responses, ensuring concise and effective reasoning in perception tasks. We demonstrate the superiority of VACoT with extensive experiments on 13 perception benchmarks and further introduce AdvOCR to highlight the generalization benefits of post-hoc visual augmentations in adversarial scenarios.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ran Zhang, Kun Ouyang, Tiancheng Ma, Yida Yang, Dong Fang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Optimizing numerical systems and mechanism design is crucial for enhancing player experience in Massively Multiplayer Online (MMO) games. Traditional optimization approaches rely on large-scale online experiments or parameter tuning over predefined statistical models, which are costly, time-consuming, and may disrupt player experience. Although simplified offline simulation systems are often adopted as alternatives, their limited fidelity prevents agents from accurately mimicking real player reasoning and reactions to interventions. To address these limitations, we propose a generative agent-based MMO simulation system empowered by Large Language Models (LLMs). By applying Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on large-scale real player behavioral data, we adapt LLMs from general priors to game-specific domains, enabling realistic and interpretable player decision-making. In parallel, a data-driven environment model trained on real gameplay logs reconstructs dynamic in-game systems. Experiments demonstrate strong consistency with real-world player behaviors and plausible causal responses under interventions, providing a reliable, interpretable, and cost-efficient framework for data-driven numerical design optimization.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shwai He, Chaorui Deng, Ang Li, Shen Yan
Multimodal
ๆฅ็ๆ่ฆ
Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.
๐
2025-12-02
๐ Accepted by IEEE/ACM ToN
๐ฅ Nan Qiao, Sheng Yue, Ju Ren, Yaoxue Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Offline Federated Reinforcement Learning (FRL), a marriage of federated learning and offline reinforcement learning, has attracted increasing interest recently. Albeit with some advancement, we find that the performance of most existing offline FRL methods drops dramatically when provided with mixed-quality data, that is, the logging behaviors (offline data) are collected by policies with varying qualities across clients. To overcome this limitation, this paper introduces a new vote-based offline FRL framework, named FOVA. It exploits a \emph{vote mechanism} to identify high-return actions during local policy evaluation, alleviating the negative effect of low-quality behaviors from diverse local learning policies. Besides, building on advantage-weighted regression (AWR), we construct consistent local and global training objectives, significantly enhancing the efficiency and stability of FOVA. Further, we conduct an extensive theoretical analysis and rigorously show that the policy learned by FOVA enjoys strict policy improvement over the behavioral policy. Extensive experiments corroborate the significant performance gains of our proposed algorithm over existing baselines on widely used benchmarks.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qiyao Xue, Weichen Liu, Shiqi Wang, Haoming Wang, Yuyang Wu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at https://huggingface.co/datasets/Xue0823/ReMindView-Bench, and the source codes of benchmark construction and VLM reasoning analysis are available at https://github.com/pittisl/ReMindView-Bench.
๐
2025-12-02
๐ NeurIPS 2025
๐ฅ Chenshuang Zhang, Kang Zhang, Joon Son Chung, In So Kweon, Junmo Kim et al.
Computer Vision
ๆฅ็ๆ่ฆ
Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.
๐
2025-12-02
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further analyze the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate the underlying mechanisms of why models succeed/fail on specific CAPTCHA puzzles and use these insights to derive defense-oriented guidelines for selecting and strengthening CAPTCHA tasks. We conclude by discussing implications for platform operators deploying CAPTCHA as part of their abuse-mitigation pipeline.Code Availability (https://anonymous.4open.science/r/Captcha-465E/).
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seonghu Jeon, Jinhyuk Jang et al.
ๆฅ็ๆ่ฆ
Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Michael Ofengenden, Yunze Man, Ziqi Pang, Yu-Xiong Wang
Robotics
ๆฅ็ๆ่ฆ
We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin
ๆฅ็ๆ่ฆ
Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.
๐
2025-12-02
๐ ArXiv Preprint (cs.GR)
๐ฅ Yuxuan Mu, Ziyu Zhang, Yi Shi, Minami Matsumoto, Kotaro Imamura et al.
Robotics
ๆฅ็ๆ่ฆ
Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kehan Qi, Saumya Gupta, Qingqiao Hu, Weimin Lyu, Chao Chen
Machine Learning
ๆฅ็ๆ่ฆ
Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent methods accelerate MRI by under-sampling $k$-space and reconstructing the resulting images using deep learning. Unrolled networks have been widely used for the reconstruction task due to their efficiency, but suffer from unstable evolving caused by freely-learnable parameters in intermediate steps. In contrast, diffusion models based on stochastic differential equations offer theoretical stability in both medical and natural image tasks but are computationally expensive. In this work, we introduce flow ODEs to MRI reconstruction by theoretically proving that unrolled networks are discrete implementations of conditional probability flow ODEs. This connection provides explicit formulations for parameters and clarifies how intermediate states should evolve. Building on this insight, we propose Flow-Aligned Training (FLAT), which derives unrolled parameters from the ODE discretization and aligns intermediate reconstructions with the ideal ODE trajectory to improve stability and convergence. Experiments on three MRI datasets show that FLAT achieves high-quality reconstructions with up to $3\times$ fewer iterations than diffusion-based generative models and significantly greater stability than unrolled networks.
๐
2025-12-02
๐ SIGGRAPH
๐ฅ Xiang Xu, Pradeep Kumar Jayaraman, Joseph G. Lambourne, Yilin Liu, Durvesh Malpure et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The boundary representation (B-Rep) is the standard data structure used in Computer-Aided Design (CAD) for defining solid models. Despite recent progress, directly generating B-Reps end-to-end with precise geometry and watertight topology remains a challenge. This paper presents AutoBrep, a novel Transformer model that autoregressively generates B-Reps with high quality and validity. AutoBrep employs a unified tokenization scheme that encodes both geometric and topological characteristics of a B-Rep model as a sequence of discrete tokens. Geometric primitives (i.e., surfaces and curves) are encoded as latent geometry tokens, and their structural relationships are defined as special topological reference tokens. Sequence order in AutoBrep naturally follows a breadth first traversal of the B-Rep face adjacency graph. At inference time, neighboring faces and edges along with their topological structure are progressively generated. Extensive experiments demonstrate the advantages of our unified representation when coupled with next-token prediction for B-Rep generation. AutoBrep outperforms baselines with better quality and watertightness. It is also highly scalable to complex solids with good fidelity and inference speed. We further show that autocompleting B-Reps is natively supported through our unified tokenization, enabling user-controllable CAD generation with minimal changes. Code is available at https://github.com/AutodeskAILab/AutoBrep.
๐
2025-12-02
๐ NeurIPS 2025
๐ฅ Matthew Dutson, Nathan Labiosa, Yin Li, Mohit Gupta
ๆฅ็ๆ่ฆ
When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sagi Polaczek, Or Patashnik, Ali Mahdavi-Amiri, Daniel Cohen-Or
Robotics
ๆฅ็ๆ่ฆ
Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject's original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Svenja Strobel, Matthias Innmann, Bernhard Egger, Marc Stamminger, Linus Franke
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce \textbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin et al.
Computer VisionMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.
๐
2025-12-02
๐ WWW
๐ฅ Yifei Zeng, Yajie Bao, Jiachen Qian, Shuang Wu, Youtian Lin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Sohag Mia, Md Nahid Hasan, Tawhid Ahmed, Muhammad Abdullah Adnan
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6\% AP$_{25}$ and 51.2\% AP$_{50}$) and ScanNetV2 (75.1\% AP$_{25}$ and 60.8\% AP$_{50}$) demonstrate a substantial performance improvement over existing approaches.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiang Xu, Ao Liang, Youquan Liu, Linfeng Li, Lingdong Kong et al.
Robotics
ๆฅ็ๆ่ฆ
Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.
๐
2025-12-02
๐ AAAI 2026
๐ฅ Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, Wei Pang
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent's ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent's reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Xiong, Ziqi Miao, Lijun Li, Chen Qian, Jie Li et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31\% and 91.07\%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guowen Zhang, Chenhang He, Liyi Chen, Lei Zhang
Computer Vision
ๆฅ็ๆ่ฆ
Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuhan Chen, Yicui Shi, Guofa Li, Guangrui Bai, Jinyuan Shao et al.
ๆฅ็ๆ่ฆ
In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensive for vehicular applications, we propose UltraFast-LieNET, a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement. We introduce a Dynamic Shifted Convolution (DSConv) kernel with only 12 learnable parameters for efficient feature extraction. By integrating DSConv with varying shift distances, a Multi-scale Shifted Residual Block (MSRB) is constructed to significantly expand the receptive field. To mitigate lightweight network instability, a residual structure and a novel multi-level gradient-aware loss function are incorporated. UltraFast-LieNET allows flexible parameter configuration, with a minimum size of only 36 parameters. Results on the LOLI-Street dataset show a PSNR of 26.51 dB, outperforming state-of-the-art methods by 4.6 dB while utilizing only 180 parameters. Experiments across four benchmark datasets validate its superior balance of real-time performance and enhancement quality under limited resources. Code is available at https://githubhttps://github.com/YuhanChen2024/UltraFast-LiNET
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Sohag Mia, Muhammad Abdullah Adnan
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geometric structure prediction. Our approach integrates OneFormer's task-conditioned queries and contrastive learning with two key modules: (1) a layout degeneration strategy that augments training data while preserving Manhattan-world constraints through topology-aware transformations, and (2) differentiable geometric losses that directly enforce planar consistency and sharp boundary predictions during training. By unifying these components in an end-to-end framework, the model eliminates complex post-processing pipelines while achieving high-speed inference at 114ms. Extensive experiments demonstrate state-of-the-art performance across standard benchmarks, with pixel error (PE) of 5.43% and corner error (CE) of 4.02% on the LSUN, PE of 7.04% (CE 5.17%) on the Hedau and PE of 4.03% (CE 3.15%) on the Matterport3D-Layout datasets. The framework's combination of geometric awareness and computational efficiency makes it particularly suitable for augmented reality applications and large-scale 3D scene reconstruction tasks.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lanxiang Hu, Abhilash Shankarampeta, Yixin Huang, Zilin Dai, Haoyang Yu et al.
ๆฅ็ๆ่ฆ
The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models' scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \href{https://github.com/hao-ai-lab/VideoScience}{github.com/hao-ai-lab/VideoScience}.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhihan Xiao, Lin Liu, Yixin Gao, Xiaopeng Zhang, Haoxuan Che et al.
Robotics
ๆฅ็ๆ่ฆ
Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yancheng Zhang, Guangyu Sun, Chen Chen
Computer VisionRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time rendering with high appearance fidelity, it suffers from multi-view inconsistencies, limiting geometric accuracy. In contrast, 2D Gaussian Splatting (2DGS) enforces multi-view consistency but compromises texture details. To address these limitations, we propose Exchangeable Gaussian Splatting (EGGS), a hybrid representation that integrates 2D and 3D Gaussians to balance appearance and geometry. To achieve this, we introduce Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between 2D and 3D Gaussians, and Frequency-Decoupled Optimization that effectively exploits the strengths of each type of Gaussian representation. Our CUDA-accelerated implementation ensures efficient training and inference. Extensive experiments demonstrate that EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, providing a practical solution for high-quality NVS.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
In this paper, we investigate the underexplored challenge of sample diversity in autoregressive (AR) generative models with bitwise visual tokenizers. We first analyze the factors that limit diversity in bitwise AR models and identify two key issues: (1) the binary classification nature of bitwise modeling, which restricts the prediction space, and (2) the overly sharp logits distribution, which causes sampling collapse and reduces diversity. Building on these insights, we propose DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality. Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling, resulting in smoother predictions and greater diversity. To mitigate potential fidelity loss caused by distribution smoothing, we further develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens, thereby preserving high visual quality. Extensive experiments demonstrate that DiverseAR substantially improves sample diversity in bitwise autoregressive image generation.
๐
2025-12-02
๐ KDD
๐ฅ Ziniu Zhang, Minxuan Duan, Haris N. Koutsopoulos, Hongyang R. Zhang
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region's weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1\%$, which is a $3.7\%$ gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by $24\%$ under higher precipitation, by $22\%$ on higher-speed roads such as motorways, and by $29\%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fan Yang, Kaihao Zhang
Computer VisionNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhuobai Dong, Rui Zhao, Songjie Wu, Junchao Yi, Linjie Li et al.
Computer Vision
ๆฅ็ๆ่ฆ
Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pierpaolo Serio, Giulio Pisaneschi, Andrea Dan Ryals, Vincenzo Infantino, Lorenzo Gentilini et al.
Robotics
ๆฅ็ๆ่ฆ
This work presents a systematic investigation into how alternative LiDAR-to-image projections affect metric place recognition when coupled with a state-of-the-art vision foundation model. We introduce a modular retrieval pipeline that controls for backbone, aggregation, and evaluation protocol, thereby isolating the influence of the 2-D projection itself. Using consistent geometric and structural channels across multiple datasets and deployment scenarios, we identify the projection characteristics that most strongly determine discriminative power, robustness to environmental variation, and suitability for real-time autonomy. Experiments with different datasets, including integration into an operational place recognition policy, validate the practical relevance of these findings and demonstrate that carefully designed projections can serve as an effective surrogate for end-to-end 3-D learning in LiDAR place recognition.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wei Chen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yaqi Wang, Zhi Li, Chengyu Wu, Jun Liu, Yifan Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Cone-Beam Computed Tomography (CBCT) and Intraoral Scanning (IOS) are essential for digital dentistry, but annotated data scarcity limits automated solutions for pulp canal segmentation and cross-modal registration. To benchmark semi-supervised learning (SSL) in this domain, we organized the STSR 2025 Challenge at MICCAI 2025, featuring two tasks: (1) semi-supervised segmentation of teeth and pulp canals in CBCT, and (2) semi-supervised rigid registration of CBCT and IOS. We provided 60 labeled and 640 unlabeled IOS samples, plus 30 labeled and 250 unlabeled CBCT scans with varying resolutions and fields of view. The challenge attracted strong community participation, with top teams submitting open-source deep learning-based SSL solutions. For segmentation, leading methods used nnU-Net and Mamba-like State Space Models with pseudo-labeling and consistency regularization, achieving a Dice score of 0.967 and Instance Affinity of 0.738 on the hidden test set. For registration, effective approaches combined PointNetLK with differentiable SVD and geometric augmentation to handle modality gaps; hybrid neural-classical refinement enabled accurate alignment despite limited labels. All data and code are publicly available at https://github.com/ricoleehduu/STS-Challenge-2025 to ensure reproducibility.
๐
2025-12-02
๐ ICASSP
๐ฅ Abdul Hannan, Furqan Malik, Hina Jabbar, Syed Suleman Sadiq, Mubashir Noman
ๆฅ็ๆ่ฆ
Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vishal Dubey, Pallavi Tyagi
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern image editors can produce identity-preserving AIGC (IP-AIGC), where the same person appears with new attire, background, or lighting. The robustness and fairness of current detectors in this regime remain unclear, especially for under-represented populations. We present what we believe is the first systematic study of IP-AIGC detection for Indian and South-Asian faces, quantifying cross-generator generalization and intra-population performance. We assemble Indian-focused training splits from FairFD and HAV-DF, and construct two held-out IP-AIGC test sets (HIDF-img-ip-genai and HIDF-vid-ip-genai) using commercial web-UI generators (Gemini and ChatGPT) with identity-preserving prompts. We evaluate two state-of-the-art detectors (AIDE and Effort) under pretrained (PT) and fine-tuned (FT) regimes and report AUC, AP, EER, and accuracy. Fine-tuning yields strong in-domain gains (for example, Effort AUC 0.739 to 0.944 on HAV-DF-test; AIDE EER 0.484 to 0.259), but consistently degrades performance on held-out IP-AIGC for Indian cohorts (for example, AIDE AUC 0.923 to 0.563 on HIDF-img-ip-genai; Effort 0.740 to 0.533), which indicates overfitting to training-generator cues. On non-IP HIDF images, PT performance remains high, which suggests a specific brittleness to identity-preserving edits rather than a generic distribution shift. Our study establishes IP-AIGC-Indian as a challenging and practically relevant scenario and motivates representation-preserving adaptation and India-aware benchmark curation to close generalization gaps in AIGC detection.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Manuel Benavent-Lledo, Konstantinos Bacharidis, Victoria Manousaki, Konstantinos Papoutsakis, Antonis Argyros et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Awad, Mahmoud Akrm, Walid Gomaa
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.
๐
2025-12-02
๐ Accepted by ACM MM 2025
๐ฅ Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen et al.
ๆฅ็ๆ่ฆ
Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tang Haonan, Chen Yanjun, Jiang Lezhi
Natural Language Processing
ๆฅ็ๆ่ฆ
The TrackNet series has established a strong baseline for fast-moving small object tracking in sports. However, existing iterations face significant limitations: V1-V3 struggle with occlusions due to a reliance on purely visual cues, while TrackNetV4, despite introducing motion inputs, suffers from directional ambiguity as its absolute difference method discards motion polarity. To overcome these bottlenecks, we propose TrackNetV5, a robust architecture integrating two novel mechanisms. First, to recover lost directional priors, we introduce the Motion Direction Decoupling (MDD) module. Unlike V4, MDD decomposes temporal dynamics into signed polarity fields, explicitly encoding both movement occurrence and trajectory direction. Second, we propose the Residual-Driven Spatio-Temporal Refinement (R-STR) head. Operating on a coarse-to-fine paradigm, this Transformer-based module leverages factorized spatio-temporal contexts to estimate a corrective residual, effectively recovering occluded targets. Extensive experiments on the TrackNetV2 dataset demonstrate that TrackNetV5 achieves a new state-of-the-art F1-score of 0.9859 and an accuracy of 0.9733, significantly outperforming previous versions. Notably, this performance leap is achieved with a marginal 3.7% increase in FLOPs compared to V4, maintaining real-time inference capabilities while delivering superior tracking precision.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Xianchao Zeng, Xinyu Zhou, Youcheng Li, Jiayou Shi, Tianle Li et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xu Han, Biao Zhang, Xiangjun Tang, Xianzhi Li, Peter Wonka
ๆฅ็ๆ่ฆ
We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.
๐
2025-12-02
๐ AAAI
๐ฅ Qifan Liang, Junlin Li, Zhen Han, Xihao Wang, Zhongyuan Wang et al.
ๆฅ็ๆ่ฆ
Electrocautery or lasers will inevitably generate surgical smoke, which hinders the visual guidance of laparoscopic videos for surgical procedures. The surgical smoke can be classified into different types based on its motion patterns, leading to distinctive spatio-temporal characteristics across smoky laparoscopic videos. However, existing desmoking methods fail to account for such smoke-type-specific distinctions. Therefore, we propose the first Smoke-Type-Aware Laparoscopic Video Desmoking Network (STANet) by introducing two smoke types: Diffusion Smoke and Ambient Smoke. Specifically, a smoke mask segmentation sub-network is designed to jointly conduct smoke mask and smoke type predictions based on the attention-weighted mask aggregation, while a smokeless video reconstruction sub-network is proposed to perform specially desmoking on smoky features guided by two types of smoke mask. To address the entanglement challenges of two smoke types, we further embed a coarse-to-fine disentanglement module into the mask segmentation sub-network, which yields more accurate disentangled masks through the smoke-type-aware cross attention between non-entangled and entangled regions. In addition, we also construct the first large-scale synthetic video desmoking dataset with smoke type annotations. Extensive experiments demonstrate that our method not only outperforms state-of-the-art approaches in quality evaluations, but also exhibits superior generalization across multiple downstream surgical tasks.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rakib Ahsan, MD Sadik Hossain Shanto, Md Sultanul Arifin, Tanzima Hashem
Machine Learning
ๆฅ็ๆ่ฆ
Methane is a powerful greenhouse gas that contributes significantly to global warming. Accurate detection of methane emissions is the key to taking timely action and minimizing their impact on climate change. We present AttMetNet, a novel attention-enhanced deep learning framework for methane plume detection with Sentinel-2 satellite imagery. The major challenge in developing a methane detection model is to accurately identify methane plumes from Sentinel-2's B11 and B12 bands while suppressing false positives caused by background variability and diverse land cover types. Traditional detection methods typically depend on the differences or ratios between these bands when comparing the scenes with and without plumes. However, these methods often require verification by a domain expert because they generate numerous false positives. Recent deep learning methods make some improvements using CNN-based architectures, but lack mechanisms to prioritize methane-specific features. AttMetNet introduces a methane-aware architecture that fuses the Normalized Difference Methane Index (NDMI) with an attention-enhanced U-Net. By jointly exploiting NDMI's plume-sensitive cues and attention-driven feature selection, AttMetNet selectively amplifies methane absorption features while suppressing background noise. This integration establishes a first-of-its-kind architecture tailored for robust methane plume detection in real satellite imagery. Additionally, we employ focal loss to address the severe class imbalance arising from both limited positive plume samples and sparse plume pixels within imagery. Furthermore, AttMetNet is trained on the real methane plume dataset, making it more robust to practical scenarios. Extensive experiments show that AttMetNet surpasses recent methods in methane plume detection with a lower false positive rate, better precision recall balance, and higher IoU.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tristan Amadei, Enric Meinhardt-Llopis, Benedicte Bascle, Corentin Abgrall, Gabriele Facciolo
ๆฅ็ๆ่ฆ
Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Zhou, Takehiko Ohkawa, Guwenxiao Zhou, Kanoko Goto, Takumi Hirose et al.
Computer Vision
ๆฅ็ๆ่ฆ
Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Julian Ma, Jun Wang, Zafeirios Fountas
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Peirong Zhang, Yidan Zhang, Luxiao Xu, Jinliang Lin, Zonghao Guo et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects. To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness. Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yasemin Utkueri, Elin Lundstrรถm, Hรฅkan Ahlstrรถm, Johan รfverstedt, Joel Kullberg
ๆฅ็ๆ่ฆ
The UK Biobank is a large-scale study collecting whole-body MR imaging and non-imaging health data. Robust and accurate inter-subject image registration of these whole-body MR images would enable their body-wide spatial standardization, and region-/voxel-wise correlation analysis of non-imaging data with image-derived parameters (e.g., tissue volume or fat content). We propose a sex-stratified inter-subject whole-body MR image registration approach that uses subcutaneous adipose tissue- and muscle-masks from the state-of-the-art VIBESegmentator method to augment intensity-based graph-cut registration. The proposed method was evaluated on a subset of 4000 subjects by comparing it to an intensity-only method as well as two previously published registration methods, uniGradICON and MIRTK. The evaluation comprised overlap measures applied to the 71 VIBESegmentator masks: 1) Dice scores, and 2) voxel-wise label error frequency. Additionally, voxel-wise correlation between age and each of fat content and tissue volume was studied to exemplify the usefulness for medical research. The proposed method exhibited a mean dice score of 0.77 / 0.75 across the cohort and the 71 masks for males/females, respectively. When compared to the intensity-only registration, the mean values were 6 percentage points (pp) higher for both sexes, and the label error frequency was decreased in most tissue regions. These differences were 9pp / 8pp against uniGradICON and 12pp / 13pp against MIRTK. Using the proposed method, the age-correlation maps were less noisy and showed higher anatomical alignment. In conclusion, the image registration method using two tissue masks improves whole-body registration of UK Biobank images.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu et al.
ๆฅ็ๆ่ฆ
Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at https://github.com/MiliLab/GeoBridge.
๐
2025-12-02
๐ ICASSP 2026
๐ฅ Omid Reza Heidari, Yang Wang, Xinxin Zuo
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to variations in scanning devices and environmental conditions, leading to significant domain discrepancies. To address this, we apply ALDI++, a domain adaptation framework that integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift effectively in this area. We conduct extensive experiments on the EDS dataset, demonstrating that ALDI++ surpasses the state-of-the-art (SOTA) domain adaptation methods across multiple adaptation scenarios. In particular, ALDI++ with a Vision Transformer for Detection (ViTDet) backbone achieves the highest mean average precision (mAP), confirming the effectiveness of transformer-based architectures for cross-domain object detection. Additionally, our category-wise analysis highlights consistent improvements in detection accuracy, reinforcing the robustness of the model across diverse object classes. Our findings establish ALDI++ as an efficient solution for domain-adaptive object detection, setting a new benchmark for performance stability and cross-domain generalization in security X-ray imagery.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxing Liu, Yong Liu
Robotics
ๆฅ็ๆ่ฆ
Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions. Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huankun Sheng, Ming Li, Yixiang Wei, Yeying Fan, Yu-Hui Wen et al.
ๆฅ็ๆ่ฆ
Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhongbao Yang, Jiangxin Dong, Yazhou Yao, Jinhui Tang, Jinshan Pan
Natural Language Processing
ๆฅ็ๆ่ฆ
Although diffusion-based models have achieved impressive results in image super-resolution, they often rely on large-scale backbones such as Stable Diffusion XL (SDXL) and Diffusion Transformers (DiT), which lead to excessive computational and memory costs during training and inference. To address this issue, we develop a lightweight diffusion method, PGP-DiffSR, by removing redundant information from diffusion models under the guidance of the phase information of inputs for efficient image super-resolution. We first identify the intra-block redundancy within the diffusion backbone and propose a progressive pruning approach that removes redundant blocks while reserving restoration capability. We note that the phase information of the restored images produced by the pruned diffusion model is not well estimated. To solve this problem, we propose a phase-exchange adapter module that explores the phase information of the inputs to guide the pruned diffusion model for better restoration performance. We formulate the progressive pruning approach and the phase-exchange adapter module into a unified model. Extensive experiments demonstrate that our method achieves competitive restoration quality while significantly reducing computational load and memory consumption. The code is available at https://github.com/yzb1997/PGP-DiffSR.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qionglin Ren, Dawei Zhang, Chunxu Tian, Dan Zhang
Multimodal
ๆฅ็ๆ่ฆ
Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is still lacking. Existing approaches have primarily focused on independent models for individual tasks, often overlooking the potential for cross-modal information sharing. Furthermore, Anti-UAV tracking techniques are still in their infancy, with current solutions struggling to achieve effective multimodal data fusion. To address these challenges, we propose UAUTrack, a unified single-target tracking framework built upon a single-stream, single-stage, end-to-end architecture that effectively integrates multiple modalities. UAUTrack introduces a key component: a text prior prompt strategy that directs the model to focus on UAVs across various scenarios. Experimental results show that UAUTrack achieves state-of-the-art performance on the Anti-UAV and DUT Anti-UAV datasets, and maintains a favourable trade-off between accuracy and speed on the Anti-UAV410 dataset, demonstrating both high accuracy and practical efficiency across diverse Anti-UAV scenarios.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Derui Shan, Qian Qiao, Hao Lu, Tao Du, Peng Lu
Machine Learning
ๆฅ็ๆ่ฆ
Polarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS's geometric priors are leveraged to resolve polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS's normal and spherical harmonic representation. This process achieves high-fidelity reflection separation and full-scene reconstruction without requiring environment maps or restrictive material assumptions. We demonstrate on public and self-collected datasets that PolarGuide-GSDR achieves state-of-the-art performance in specular reconstruction, normal estimation, and novel view synthesis, all while maintaining real-time rendering capabilities. To our knowledge, this is the first framework embedding polarization priors directly into 3DGS optimization, yielding superior interpretability and real-time performance for modeling complex reflective scenes.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Agathoklis Georgiou
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali's patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.
๐
2025-12-02
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Alvaro Becerra, Pablo Villegas, Ruth Cobos
Multimodal
ๆฅ็ๆ่ฆ
Wearable sensors, such as smartwatches, have become increasingly prevalent across domains like healthcare, sports, and education, enabling continuous monitoring of physiological and behavioral data. In the context of education, these technologies offer new opportunities to study cognitive and affective processes such as engagement, attention, and performance. However, the lack of scalable, synchronized, and high-resolution tools for multimodal data acquisition continues to be a significant barrier to the widespread adoption of Multimodal Learning Analytics in real-world educational settings. This paper presents two complementary tools developed to address these challenges: Watch-DMLT, a data acquisition application for Fitbit Sense 2 smartwatches that enables real-time, multi-user monitoring of physiological and motion signals; and ViSeDOPS, a dashboard-based visualization system for analyzing synchronized multimodal data collected during oral presentations. We report on a classroom deployment involving 65 students and up to 16 smartwatches, where data streams including heart rate, motion, gaze, video, and contextual annotations were captured and analyzed. Results demonstrate the feasibility and utility of the proposed system for supporting fine-grained, scalable, and interpretable Multimodal Learning Analytics in real learning environments.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junwon Lee, Juhan Nam, Jiyoung Lee
Robotics
ๆฅ็ๆ่ฆ
This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dong Li, Jiahao Xiong, Yingda Huang, Le Chang
Computer Vision
ๆฅ็ๆ่ฆ
We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total, among which more than 52,000 are longer than 10 frames, including 68 manually reviewed trajectories that span the entire 150 frames. To the best of our knowledge, PoreTrack3D is the first benchmark dataset to capture both traditional facial landmarks and pore-scale keypoints trajectory, advancing the study of fine-grained facial expressions through the analysis of subtle skin-surface motion. We systematically evaluate state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D, establishing the first performance baseline in this domain. Overall, the pipeline developed for this benchmark dataset's creation establishes a new framework for high-fidelity facial motion capture and dynamic 3D reconstruction. Our dataset are publicly available at: https://github.com/JHXion9/PoreTrack3D
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yongchuan Cui, Peng Liu, Yi Zeng
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets due to the limited availability of real training data and the domain gap between different satellite sensors. To address this challenge, we explore the potential of foundation models by proposing a novel pretraining strategy that leverages large-scale simulated datasets to learn robust spatial-spectral priors. Specifically, our approach first constructs diverse simulated datasets by applying various degradation operations (blur, noise, downsampling) and augmentations (bands generation, channel shuffling, high-pass filtering, color jittering, etc.) to natural images from ImageNet and remote sensing images from SkyScript. We then pretrain fusion models on these simulated data to learn generalizable spatial-spectral representations. The pretrained models are subsequently evaluated on six datasets (WorldView-2/3/4, IKONOS, QuickBird, GaoFen-2) using zero-shot and one-shot paradigms, with both full- and freeze-tuning approaches for fine-tuning. Extensive experiments on different network architectures including convolutional neural networks, Transformer, and Mamba demonstrate that our pretraining strategy significantly improves generalization performance across different satellite sensors and imaging conditions for various fusion models. The pretrained models achieve superior results in zero-shot scenarios and show remarkable adaptation capability with minimal real data in one-shot settings. Our work provides a practical solution for cross-domain pansharpening, establishes a new benchmark for generalization in remote sensing image fusion tasks, and paves the way for leveraging foundation models through advanced training strategies.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter et al.
Machine Learning
ๆฅ็ๆ่ฆ
Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models -- diffusion and flow-based models -- still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow model to outperform a 1024 step teacher model with only a single additional backward NFE.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zheng Huang, Xukai Liu, Tianyu Hu, Kai Zhang, Ye Liu
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Panagiotis Papantonakis, Georgios Kopanas, Fredo Durand, George Drettakis
Computer VisionRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Gaussian Splatting has become the method of choice for 3D reconstruction and real-time rendering of captured real scenes. However, fine appearance details need to be represented as a large number of small Gaussian primitives, which can be wasteful when geometry and appearance exhibit different frequency characteristics.
Inspired by the long tradition of texture mapping, we propose to use texture to represent detailed appearance where possible. Our main focus is to incorporate per-primitive texture maps that adapt to the scene in a principled manner during Gaussian Splatting optimization. We do this by proposing a new appearance representation for 2D Gaussian primitives with textures where the size of a texel is bounded by the image sampling frequency and adapted to the content of the input images. We achieve this by adaptively upscaling or downscaling the texture resolution during optimization. In addition, our approach enables control of the number of primitives during optimization based on texture resolution. We show that our approach performs favorably in image quality and total number of parameters used compared to alternative solutions for textured Gaussian primitives. Project page: https://repo-sam.inria.fr/nerphys/gs-texturing/
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Shengkai Wu, Jinrong Yang, Wenqiu Luo, Linfeng Gao, Chaohui Shang et al.
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2's built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yafei Song, Peng Zhang, Bang Zhang
ๆฅ็ๆ่ฆ
Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Changxiao Ma, Chao Yuan, Xincheng Shi, Yuzhuo Ma, Yongfei Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang et al.
ๆฅ็ๆ่ฆ
Since DUSt3R, models such as VGGT and $ฯ^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $ฯ^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $ฯ^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jian Yang, Dacheng Yin, Xiaoxuan He, Yong Li, Fengyun Rao et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tai Le-Gia
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Zero-shot anomaly classification and segmentation (AC/AS) aim to detect anomalous samples and regions without any training data, a capability increasingly crucial in industrial inspection and medical imaging. This dissertation aims to investigate the core challenges of zero-shot AC/AS and presents principled solutions rooted in theory and algorithmic design.
We first formalize the problem of consistent anomalies, a failure mode in which recurring similar anomalies systematically bias distance-based methods. By analyzing the statistical and geometric behavior of patch representations from pre-trained Vision Transformers, we identify two key phenomena - similarity scaling and neighbor-burnout - that describe how relationships among normal patches change with and without consistent anomalies in settings characterized by highly similar objects.
We then introduce CoDeGraph, a graph-based framework for filtering consistent anomalies built on the similarity scaling and neighbor-burnout phenomena. Through multi-stage graph construction, community detection, and structured refinement, CoDeGraph effectively suppresses the influence of consistent anomalies.
Next, we extend this framework to 3D medical imaging by proposing a training-free, computationally efficient volumetric tokenization strategy for MRI data. This enables a genuinely zero-shot 3D anomaly detection pipeline and shows that volumetric anomaly segmentation is achievable without any 3D training samples.
Finally, we bridge batch-based and text-based zero-shot methods by demonstrating that CoDeGraph-derived pseudo-masks can supervise prompt-driven vision-language models. Together, this dissertation provides theoretical understanding and practical solutions for the zero-shot AC/AS problem.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aditya Chaudhary, Prachet Dev Singh, Ankit Jha
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Bo Yang
Computer VisionNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots.ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenjing Yu, Shuo Jiang, Yifei Chen, Shuo Chang, Yuanhan Wang et al.
Computer Vision
ๆฅ็ๆ่ฆ
Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptation methods across seven imaging modalities, including MRI, CT, ultrasound, pathology, dermoscopy, OCT, and chest X-ray, under fully unified data preprocessing, backbone configuration, and test time protocols. The benchmark encompasses four significant adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation, enabling the first systematic cross-modality comparison of their reliability and applicability. The results show that no single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts. Feature-level and Output-level methods offer greater advantages in boundary-related metrics, whereas prior-based methods exhibit strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts, which highlights the importance of principled method selection for clinical deployment. MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard, establishing a rigorous foundation for future research on robust, clinically reliable test-time adaptation. All source codes and open-source datasets are available at https://github.com/wenjing-gg/MedSeg-TTA.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mizuki Kikkawa, Tatsuya Yatagawa, Yutaka Ohtake, Hiromasa Suzuki
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This study investigates the impact of the invariance of feature vectors for partial-to-partial point set registration under translation and rotation of input point sets, particularly in the realm of techniques based on deep learning and Gaussian mixture models (GMMs). We reveal both theoretical and practical problems associated with such deep-learning-based registration methods using GMMs, with a particular focus on the limitations of DeepGMR, a pioneering study in this line, to the partial-to-partial point set registration. Our primary goal is to uncover the causes behind such methods and propose a comprehensible solution for that. To address this, we introduce an attention-based reference point shifting (ARPS) layer, which robustly identifies a common reference point of two partial point sets, thereby acquiring transformation-invariant features. The ARPS layer employs a well-studied attention module to find a common reference point rather than the overlap region. Owing to this, it significantly enhances the performance of DeepGMR and its recent variant, UGMMReg. Furthermore, these extension models outperform even prior deep learning methods using attention blocks and Transformer to extract the overlap region or common reference points. We believe these findings provide deeper insights into registration methods using deep learning and GMMs.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yerim Jeon, Miso Lee, WonJun Moon, Jae-Pil Heo
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Xiaofan Zhang, Qi Dou
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vishwesh Nath, Javier G. Tejero, Ruilong Li, Filippo Filicori, Mahdi Azizian et al.
ๆฅ็ๆ่ฆ
We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuta Oshima, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
Robotics
ๆฅ็ๆ่ฆ
Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.
๐
2025-12-02
๐ AAAI 2026
๐ฅ Fengli Ran, Xiao Pu, Bo Liu, Xiuli Bi, Bin Xiao
Machine Learning
ๆฅ็ๆ่ฆ
Dataset distillation compresses large datasets into compact synthetic ones to reduce storage and computational costs. Among various approaches, distribution matching (DM)-based methods have attracted attention for their high efficiency. However, they often overlook the evolution of feature representations during training, which limits the expressiveness of synthetic data and weakens downstream performance. To address this issue, we propose Trajectory Guided Dataset Distillation (TGDD), which reformulates distribution matching as a dynamic alignment process along the model's training trajectory. At each training stage, TGDD captures evolving semantics by aligning the feature distribution between the synthetic and original dataset. Meanwhile, it introduces a distribution constraint regularization to reduce class overlap. This design helps synthetic data preserve both semantic diversity and representativeness, improving performance in downstream tasks. Without additional optimization overhead, TGDD achieves a favorable balance between performance and efficiency. Experiments on ten datasets demonstrate that TGDD achieves state-of-the-art performance, notably a 5.0% accuracy gain on high-resolution benchmarks.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhongyi Cai, Yi Du, Chen Wang, Yu Kong
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in real life, agents often face sequential tasks, where each new sub-task follows the completion of the previous one, and certain sub-tasks may be infeasible, such as searching for a non-existent object. Compared with the single-task setting, the core challenge lies in reusing spatial knowledge accumulated from previous explorations to support subsequent reasoning and exploration. In this work, we investigate this underexplored yet practically significant embodied AI challenge. To evaluate this challenge, we introduce SEER-Bench, a new Sequential Embodied Exploration and Reasoning Benchmark encompassing encompassing two classic embodied tasks: Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN). Building on SEER-Bench, we propose 3DSPMR, a 3D SPatial Memory Reasoning approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models (MLLMs) for reasoning and exploration in sequential embodied tasks. To the best of our knowledge, this is the first work to explicitly incorporate geometric information into MLLM-based spatial understanding and reasoning. Extensive experiments verify that 3DSPMR achieves substantial performance gains on both sequential EQA and EMN tasks.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianzong Wu, Hao Lian, Dachao Hao, Ye Tian, Qingyu Shi et al.
ๆฅ็ๆ่ฆ
Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sourabh Sharma, Sonam Gupta, Sadbhawna
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model's ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kerui Chen, Jianrong Zhang, Ming Li, Zhonglong Zheng, Hehe Fan
ๆฅ็ๆ่ฆ
Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. Instead of learning an unstructured embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two structured style embedding spaces, i.e., global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.
๐
2025-12-02
๐ NeurIPS 2025
๐ฅ Valentin Bieri, Marie-Julie Rakotosaona, Keisuke Tateno, Francis Engelmann, Leonidas Guibas
ๆฅ็ๆ่ฆ
Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle large multi floor buildings and require scenes to be split into individual floors before processing, which removes global spatial context that is essential for reasoning about structures such as staircases that connect multiple levels. In this work, we introduce HouseLayout3D, a real world benchmark designed to support progress toward full building scale layout estimation, including multiple floors and architecturally intricate spaces. We also present MultiFloor3D, a simple training free baseline that leverages recent scene understanding methods and already outperforms existing 3D layout estimation models on both our benchmark and prior datasets, highlighting the need for further research in this direction. Data and code are available at: https://houselayout3d.github.io.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Whye Kit Fong, Venice Erin Liong, Kok Seang Tan, Holger Caesar
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization \& mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fan Luo, Zeyu Gao, Xinhao Luo, Kai Zhao, Yanfeng Lu
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Spiking Neural Networks (SNNs), with their brain-inspired spatiotemporal dynamics and spike-driven computation, have emerged as promising energy-efficient alternatives to Artificial Neural Networks (ANNs). However, existing SNNs typically replicate inputs directly or aggregate them into frames at fixed intervals. Such strategies lead to neurons receiving nearly identical stimuli across time steps, severely limiting the model's expressive power, particularly in complex tasks like object detection. In this work, we propose the Temporal Dynamics Enhancer (TDE) to strengthen SNNs' capacity for temporal information modeling. TDE consists of two modules: a Spiking Encoder (SE) that generates diverse input stimuli across time steps, and an Attention Gating Module (AGM) that guides the SE generation based on inter-temporal dependencies. Moreover, to eliminate the high-energy multiplication operations introduced by the AGM, we propose a Spike-Driven Attention (SDA) to reduce attention-related energy consumption. Extensive experiments demonstrate that TDE can be seamlessly integrated into existing SNN-based detectors and consistently outperforms state-of-the-art methods, achieving mAP50-95 scores of 57.7% on the static PASCAL VOC dataset and 47.6% on the neuromorphic EvDET200K dataset. In terms of energy consumption, the SDA consumes only 0.240 times the energy of conventional attention modules.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junghwan Park, Woojin Cho, Junhyuk Heo, Darongsae Kwon, Kookjin Lee
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Phuc Pham, Nhu Pham, Ngoc Quoc Ly
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at https://github.com/phphuc612/MSD .
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daeyoung Kim
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed and blurred visions, or peripheral vision loss, glaucoma is well known to occur due to damages in the optic nerve from increased intraocular pressure (IOP) or neovascularization within the retina. Traditionally, most glaucoma related works and clinical diagnosis focused on detecting these damages in the optic nerve by using patient data from perimetry tests, optic papilla inspections and tonometer-based IOP measurements. Recently, with advancements in computer vision AI models, such as VGG16 or Vision Transformers (ViT), AI-automatized glaucoma detection and optic cup segmentation based on retinal fundus images or OCT recently exhibited significant performance in aiding conventional diagnosis with high performance. However, current AI-driven glaucoma detection approaches still have significant room for improvement in terms of reliability, excessive parameter usage, possibility of spurious correlation within detection, and limitations in applications to intervention analysis or clinical simulations. Thus, this research introduced a novel causal representation driven glaucoma detection model: LightHCG, an extremely lightweight Convolutional VAE-based latent glaucoma representation model that can consider the true causality among glaucoma-related physical factors within the optic nerve region. Using HSIC-based latent space disentanglement and Graph Autoencoder based unsupervised causal representation learning, LightHCG not only exhibits higher performance in classifying glaucoma with 93~99% less weights, but also enhances the possibility of AI-driven intervention analysis, compared to existing advanced vision models such as InceptionV3, MobileNetV2 or VGG16.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haolong Yan, Yeqing Shen, Xin Huang, Jia Wang, Kaijun Tan et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Saeid Jamshidi, Kawser Wazed Nafi, Arghavan Moradi Dakhel, Negar Shahabi, Foutse Khomh
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts. Existing alignment frameworks, structured approaches designed to align model behavior with human ethical and social norms, often rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning may evolve across different contexts or temporal scales. This study presents the Moral Consistency Pipeline (MoCoP), a dataset-free, closed-loop framework for continuously evaluating and interpreting the moral stability of LLMs. MoCoP combines three supporting layers: (i) lexical integrity analysis, (ii) semantic risk estimation, and (iii) reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision. Our empirical results on GPT-4-Turbo and DeepSeek suggest that MoCoP effectively captures longitudinal ethical behavior, revealing a strong inverse relationship between ethical and toxicity dimensions (correlation rET = -0.81, p value less than 0.001) and a near-zero association with response latency (correlation rEL approximately equal to 0). These findings demonstrate that moral coherence and linguistic safety tend to emerge as stable and interpretable characteristics of model behavior rather than short-term fluctuations. Furthermore, by reframing ethical evaluation as a dynamic, model-agnostic form of moral introspection, MoCoP offers a reproducible foundation for scalable, continuous auditing and advances the study of computational morality in autonomous AI systems.
๐
2025-12-02
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Chenji Lu, Zhuo Chen, Hui Zhao, Zhiyuan Zeng, Gang Zhao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.
๐
2025-12-02
๐ IEEE ISNCC 2025
๐ฅ Muyu Pan, Dheeraj Kodakandla, Mahfuza Farooque
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic without human intervention. This enables automated reasoning and facilitates debugging, finding loop invariants, and adhering to specifications in software systems. However, hallucinations-incorrect outputs generated by LLMs are challenging, particularly for logical translation tasks requiring precision. This work introduces a novel framework that inputs English sentences, converts them into logical expressions, and then translates them into Conjunctive Normal Form (CNF) for satisfiability solving. It employs classical NLP techniques with self-defined grammar, symbolic computation libraries, and a fine-tuned language model to reduce hallucinations. In the early experiments, we observed that the fine-tuned model, trained on different grammar settings, could intentionally correct the same types of hallucinations made by the original model. Thus, it provides reliable CNF generation.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuan Xiong, Ziqi Miao, Lijun Li, Chen Qian, Jie Li et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31\% and 91.07\%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.
๐
2025-12-02
๐ NeurIPS 2025
๐ฅ Zhonghao He, Tianyi Qiu, Hirokazu Shirado, Maarten Sap
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment and confirmation bias, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates are unpredictable from the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, which signal deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains including event forecasting, value-laden questions, and academic paper review, we find such violations to be widespread across models and setups, where the current belief positively predicts future belief updates, a phenomenon which we term belief entrenchment. We identify the models, reasoning techniques, and domains more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of a reasoning process.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves $3.8$-$4.0\times$ speedups while retaining $99.8$-$100\%$ of the baseline score on average. On base models, SchED yields consistent speedup gains with $99.1$-$100\%$ performance retention, with up to $2.34\times$ under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, $ฮณ{=}4$), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model's token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Haonan Wang, Chao Du, Kenji Kawaguchi, Tianyu Pang
Natural Language Processing
ๆฅ็ๆ่ฆ
Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a "majority" over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yalda Keivan Jafari, Krishno Dey
Natural Language Processing
ๆฅ็ๆ่ฆ
Hate speech recognition in low-resource languages remains a difficult problem due to insufficient datasets, orthographic heterogeneity, and linguistic variety. Bangla is spoken by more than 230 million people of Bangladesh and India (West Bengal). Despite the growing need for automated moderation on social media platforms, Bangla is significantly under-represented in computational resources. In this work, we study Subtask 1A and Subtask 1B of the BLP 2025 Shared Task on hate speech detection. We reproduce the official baselines (e.g., Majority, Random, Support Vector Machine) and also produce and consider Logistic Regression, Random Forest, and Decision Tree as baseline methods. We also utilized transformer-based models such as DistilBERT, BanglaBERT, m-BERT, and XLM-RoBERTa for hate speech classification. All the transformer-based models outperformed baseline methods for the subtasks, except for DistilBERT. Among the transformer-based models, BanglaBERT produces the best performance for both subtasks. Despite being smaller in size, BanglaBERT outperforms both m-BERT and XLM-RoBERTa, which suggests language-specific pre-training is very important. Our results highlight the potential and need for pre-trained language models for the low-resource Bangla language.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lechen Zhang, Yusheng Zhou, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tom Zehle, Timo Heiร, Moritz Schlager, Matthias Aรenmacher, Matthias Feurer
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Prompt optimization has become crucial for enhancing the performance of large language models (LLMs) across a broad range of tasks. Although many research papers show its effectiveness, practical adoption is hindered as existing implementations are often tied to unmaintained and isolated research codebases. To address this, we introduce promptolution, a unified and modular open-source framework that provides all components required for prompt optimization within a single extensible system for both practitioners and researchers. It integrates multiple contemporary discrete prompt optimizers while remaining agnostic to the underlying LLM implementation.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kunning Li, Jianbin Guo, Zhaoyang Shang, Yiqing Liu, Hongmin Du et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The emergence of Large Language Models (LLMs) within the Traditional Chinese Medicine (TCM) domain presents an urgent need to assess their clinical application capabilities. However, such evaluations are challenged by the individualized, holistic, and diverse nature of TCM's "Syndrome Differentiation and Treatment" (SDT). Existing benchmarks are confined to knowledge-based question-answering or the accuracy of syndrome differentiation, often neglecting assessment of treatment decision-making. Here, we propose a comprehensive, clinical case-based benchmark spearheaded by TCM experts, and a specialized reward model employed to quantify prescription-syndrome congruence. Data annotation follows a rigorous pipeline. This benchmark, designated TCM-BEST4SDT, encompasses four tasks, including TCM Basic Knowledge, Medical Ethics, LLM Content Safety, and SDT. The evaluation framework integrates three mechanisms, namely selected-response evaluation, judge model evaluation, and reward model evaluation. The effectiveness of TCM-BEST4SDT was corroborated through experiments on 15 mainstream LLMs, spanning both general and TCM domains. To foster the development of intelligent TCM research, TCM-BEST4SDT is now publicly available.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yixuan Tang, Yi Yang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mike Nkongolo, Hilton Vorster, Josh Warren, Trevor Naick, Deandre Vanmali et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Low-resource African languages remain underrepresented in sentiment analysis, limiting both lexical coverage and the performance of multilingual Natural Language Processing (NLP) systems. This study proposes TriLex, a three-stage retrieval augmented framework that unifies corpus-based extraction, cross lingual mapping, and retrieval augmented generation (RAG) driven lexical refinement to systematically expand sentiment lexicons for low-resource languages. Using the enriched lexicon, the performance of two prominent African pretrained language models (AfroXLMR and AfriBERTa) is evaluated across multiple case studies. Results demonstrate that AfroXLMR delivers superior performance, achieving F1-scores above 80% for isiXhosa and isiZulu and exhibiting strong cross-lingual stability. Although AfriBERTa lacks pre-training on these target languages, it still achieves reliable F1-scores around 64%, validating its utility in computationally constrained settings. Both models outperform traditional machine learning baselines, and ensemble analyses further enhance precision and robustness. The findings establish TriLex as a scalable and effective framework for multilingual sentiment lexicon expansion and sentiment modeling in low-resource South African languages.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Juexi Shao, Siyou Li, Yujian Gan, Chris Madge, Vanja Karan et al.
Robotics
ๆฅ็ๆ่ฆ
Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Weihang Su, Jianming Long, Changyue Wang, Shiyu Lin, Jingyan Xu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) frequently exhibit hallucinations, generating content that appears fluent and coherent but is factually incorrect. Such errors undermine trust and hinder their adoption in real-world applications. To address this challenge, two distinct research paradigms have emerged: model-centric Hallucination Detection (HD) and text-centric Fact Verification (FV). Despite sharing the same goal, these paradigms have evolved in isolation, using distinct assumptions, datasets, and evaluation protocols. This separation has created a research schism that hinders their collective progress. In this work, we take a decisive step toward bridging this divide. We introduce UniFact, a unified evaluation framework that enables direct, instance-level comparison between FV and HD by dynamically generating model outputs and corresponding factuality labels. Through large-scale experiments across multiple LLM families and detection methods, we reveal three key findings: (1) No paradigm is universally superior; (2) HD and FV capture complementary facets of factual errors; and (3) hybrid approaches that integrate both methods consistently achieve state-of-the-art performance. Beyond benchmarking, we provide the first in-depth analysis of why FV and HD diverged, as well as empirical evidence supporting the need for their unification. The comprehensive experimental results call for a new, integrated research agenda toward unifying Hallucination Detection and Fact Verification in LLMs.
We have open-sourced all the code, data, and baseline implementation at: https://github.com/oneal2000/UniFact/
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Robert Belanec, Ivan Srba, Maria Bielikova
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiahao Zhao, Shuaixing Zhang, Nan Xu, Lei Wang
ๆฅ็ๆ่ฆ
LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.
๐
2025-12-02
๐ ArXiv Preprint (cs.PL)
๐ฅ Joel Nyholm, Wojciech Mostowski, Christoph Reichenbach
ๆฅ็ๆ่ฆ
Energy consumption is a growing concern in several fields, from mobile devices to large data centers. Developers need detailed data on the energy consumption of their software to mitigate consumption issues. Previous approaches have a broader focus, such as on specific functions or programs, rather than source code statements. They primarily focus on estimating the CPU's energy consumption using point estimates, thereby disregarding other hardware effects and limiting their use for statistical reasoning and explainability. We developed a novel methodology to address the limitations of measuring only the CPU's consumption and using point estimates, focusing on predicting the energy usage of statically typed JVM-based programming languages, such as Java and Scala. We measure the energy consumption of Bytecode patterns, the translation from the programming language's source code statement to their Java Bytecode representation. With the energy measurements, we construct a statistical model using Bayesian statistics, which allows us to predict the energy consumption through statistical distributions and analyze individual factors. The model includes three factors we obtain statically from the code: data size, data type, operation, and one factor about the hardware platform the code executes on: device. To validate our methodology, we implemented it for Java and evaluated its energy predictions on unseen programs. We observe that all four factors are influential, notably that two devices of the same model may differ in energy consumption and that the operations and data types cause consumption differences. The experiments also show that the energy prediction of programs closely follows the program's real energy consumption, validating our approach. Our work presents a methodology for constructing an energy model that future work, such as verification tools, can use for their energy estimates.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Julian Ma, Jun Wang, Zafeirios Fountas
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lavish Bansal, Naman Mishra
Natural Language Processing
ๆฅ็ๆ่ฆ
Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world's population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Daiki Shirafuji, Tatsuhiko Saito, Yasutomo Kimura
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing'' LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jing Ma
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are now used in settings such as Google's AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs equally. Focusing on abortion-related news, we construct 40 pro-neutral-con article triplets, permute each triplet into six input orders, and prompt Gemini 2.5 Flash to generate a neutral overview. We evaluate each summary against its source articles using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). One-way ANOVA reveals a significant primacy effect for BERTScore across all stances, indicating that summaries are more semantically aligned with the first-seen article. Pairwise comparisons further show that Position 1 differs significantly from Positions 2 and 3, while the latter two do not differ from each other, confirming a selective preference for the first document. The findings present risks for applications that rely on LLM-generated overviews and for agentic AI systems, where the steps involving LLMs can disproportionately influence downstream actions.
๐
2025-12-02
๐ EMNLP 2025
๐ฅ Chao-Han Huck Yang, Andreas Stolcke, Larry Heck
Natural Language Processing
ๆฅ็ๆ่ฆ
Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.
๐
2025-12-02
๐ AAAI 2026
๐ฅ Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.
๐
2025-12-02
๐ SIGIR 2025
๐ฅ Zheng Fang, Donghao Xie, Ming Pang, Chunyuan Yuan, Xue Jiang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models' reliance on the scarcity of domain-specific hard samples. We propose ADORE, a self-sustaining framework that synergizes three innovations: (1) A Rule-aware Relevance Discrimination module, where a Chain-of-Thought LLM generates intent-aligned training data, refined via Kahneman-Tversky Optimization (KTO) to align with user behavior; (2) An Error-type-aware Data Synthesis module that auto-generates adversarial examples to harden robustness; and (3) A Key-attribute-enhanced Knowledge Distillation module that injects domain-specific attribute hierarchies into a deployable student model. ADORE automates annotation, adversarial generation, and distillation, overcoming data scarcity while enhancing reasoning. Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial applications.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Francesco Paolo Savatteri, Chahan Vidal-Gorรจne, Florian Cafiero
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We present an evaluation-driven study of two practical tasks regarding online misinformation: (i) fake-news detection and (ii) virality prediction in the context of operational settings, with the necessity for rapid reaction. Using the EVONS and FakeNewsNet datasets, we compare textual embeddings (RoBERTa; with a control using Mistral) against lightweight numeric features (timing, follower counts, verification, likes) and sequence models (GRU, gating architectures, Transformer encoders). We show that textual content alone is a strong discriminator for fake-news detection, while numeric-only pipelines remain viable when language models are unavailable or compute is constrained. Virality prediction is markedly harder than fake-news detection and is highly sensitive to label construction; in our setup, a median-based ''viral'' split (<50 likes) is pragmatic but underestimates real-world virality, and time-censoring for engagement features is desirable yet difficult under current API limits. Dimensionality-reduction analyses suggest non-linear structure is more informative for virality than for fake-news detection (t-SNE > PCA on numeric features). Swapping RoBERTa for Mistral embeddings yields only modest deltas, leaving conclusions unchanged. We discuss implications for evaluation design and report reproducibility constraints that realistically affect the field. We release splits and code where possible and provide guidance for metric selection.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Parth Pulkundwar, Vivek Dhanawade, Rohit Yadav, Minal Sonkar, Medha Asurlekar et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Traditional language models face a challenge from hallucinations. Their very presence casts a large, dangerous shadow over the promising realm of natural language processing. It becomes crucial to understand the various kinds of hallucinations that occur nowadays, their origins, and ways of reducing them. This document provides a concise and straightforward summary of that. It serves as a one-stop resource for a general understanding of hallucinations and how to mitigate them.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang et al.
Robotics
ๆฅ็ๆ่ฆ
AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sourabh Sharma, Sonam Gupta, Sadbhawna
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model's ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.
๐
2025-12-02
๐ AAAI 2026
๐ฅ Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz
Natural Language Processing
ๆฅ็ๆ่ฆ
Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
๐
2025-12-02
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Shogo Ohmae, Keiko Ohmae
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in general-purpose AI systems with attention-based transformers offer a potential window into how the neocortex and cerebellum, despite their relatively uniform circuit architectures, give rise to diverse functions and, ultimately, to human intelligence. This Perspective provides a cross-domain comparison between the brain and AI that goes beyond the traditional focus on visual processing, adopting the emerging perspecive of world-model-based computation. Here, we identify shared computational mechanisms in the attention-based neocortex and the non-attentional cerebellum: both predict future world events from past inputs and construct internal world models through prediction-error learning. These predictive world models are repurposed for seemingly distinct functions -- understanding in sensory processing and generation in motor processing -- enabling the brain to achieve multi-domain capabilities and human-like adaptive intelligence. Notably, attention-based AI has independently converged on a similar learning paradigm and world-model-based computation. We conclude that these shared mechanisms in both biological and artificial systems constitute a core computational foundation for realizing diverse functions including high-level intelligence, despite their relatively uniform circuit structures. Our theoretical insights bridge neuroscience and AI, advancing our understanding of the computational essence of intelligence.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yunchao Wang, Guodao Sun, Zihang Fu, Zhehao Liu, Kaixing Du et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately translate user intent into satisfactory story outputs due to a lack of fine-grained control and unclear input specifications, limiting their applicability. To address this, we propose TaleFrame, a system that combines large language models (LLMs) with human-computer interaction (HCI) to generate stories through structured information, enabling precise control over the generation process. The innovation of TaleFrame lies in decomposing the story structure into four basic units: entities, events, relationships, and story outline. We leverage the Tinystories dataset, parsing and constructing a preference dataset consisting of 9,851 JSON-formatted entries, which is then used to fine-tune a local Llama model. By employing this JSON2Story approach, structured data is transformed into coherent stories. TaleFrame also offers an intuitive interface that supports users in creating and editing entities and events and generates stories through the structured framework. Users can control these units through simple interactions (e.g., drag-and-drop, attach, and connect), thus influencing the details and progression of the story. The generated stories can be evaluated across seven dimensions (e.g., creativity, structural integrity), with the system providing suggestions for refinement based on these evaluations. Users can iteratively adjust the story until a satisfactory result is achieved. Finally, we conduct quantitative evaluation and user studies that demonstrate the usefulness of TaleFrame. Dataset available at https://huggingface.co/datasets/guodaosun/tale-frame.
๐
2025-12-02
๐ ArXiv Preprint (cs.SE)
๐ฅ Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Agentic systems are modern software systems: they consist of orchestrated modules, expose interfaces, and are deployed in software pipelines. Unlike conventional programs, their execution (i.e., trajectories) is inherently stochastic and adaptive to the problem they are solving. Evaluation of such systems is often outcome-centric, judging their performance based on success or failure at the final step. This narrow focus overlooks detailed insights about such systems, failing to explain how agents reason, plan, act, or change their strategies over time. Inspired by the structured representation of conventional software systems as graphs, we introduce Graphectory to systematically encode the temporal and semantic relations in such software systems. Graphectory facilitates the design of process-centric metrics and analyses to assess the quality of agentic workflows independent of final success.
Using Graphectory, we analyze 4000 trajectories of two dominant agentic programming workflows, namely SWE-agent and OpenHands, with a combination of four backbone Large Language Models (LLMs), attempting to resolve SWE-bench Verified issues. Our fully automated analyses reveal that: (1) agents using richer prompts or stronger LLMs exhibit more complex Graphectory, reflecting deeper exploration, broader context gathering, and more thorough validation before patch submission; (2) agents' problem-solving strategies vary with both problem difficulty and the underlying LLM -- for resolved issues, the strategies often follow coherent localization-patching-validation steps, while unresolved ones exhibit chaotic, repetitive, or backtracking behaviors; (3) even when successful, agentic programming systems often display inefficient processes, leading to unnecessarily prolonged trajectories.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lei Fu, Xiang Chen, Kaige Gao Xinyue Huang, Kejian Tong
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Domain-specific question answering (QA) systems for services face unique challenges in integrating heterogeneous knowledge sources while ensuring both accuracy and safety. Existing large language models often struggle with factual consistency and context alignment in sensitive domains such as healthcare policies and government welfare. In this work, we introduce Knowledge-Aware Reasoning and Memory-Augmented Adaptation (KARMA), a novel framework designed to enhance QA performance in care scenarios. KARMA incorporates a dual-encoder architecture to fuse structured and unstructured knowledge sources, a gated memory unit to dynamically regulate external knowledge integration, and a safety-aware controllable decoder that mitigates unsafe outputs using safety classification and guided generation techniques. Extensive experiments on a proprietary QA dataset demonstrate that KARMA outperforms strong baselines in both answer quality and safety. This study offers a comprehensive solution for building trustworthy and adaptive QA systems in service contexts.
๐
2025-12-02
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yuanhe Zhang, Weiliu Wang, Zhenhong Zhou, Kun Wang, Jie Zhang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in reasoning, planning, and tool usage. The recently proposed Model Context Protocol (MCP) has emerged as a unifying framework for integrating external tools into agent systems, enabling a thriving open ecosystem of community-built functionalities. However, the openness and composability that make MCP appealing also introduce a critical yet overlooked security assumption -- implicit trust in third-party tool providers. In this work, we identify and formalize a new class of attacks that exploit this trust boundary without violating explicit permissions. We term this new attack vector implicit toxicity, where malicious behaviors occur entirely within the allowed privilege scope. We propose LeechHijack, a Latent Embedded Exploit for Computation Hijacking, in which an adversarial MCP tool covertly expropriates the agent's computational resources for unauthorized workloads. LeechHijack operates through a two-stage mechanism: an implantation stage that embeds a benign-looking backdoor in a tool, and an exploitation stage where the backdoor activates upon predefined triggers to establish a command-and-control channel. Through this channel, the attacker injects additional tasks that the agent executes as if they were part of its normal workflow, effectively parasitizing the user's compute budget. We implement LeechHijack across four major LLM families. Experiments show that LeechHijack achieves an average success rate of 77.25%, with a resource overhead of 18.62% compared to the baseline. This study highlights the urgent need for computational provenance and resource attestation mechanisms to safeguard the emerging MCP ecosystem.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Boya Zhang, Alban Bornet, Rui Yang, Nan Liu, Douglas Teodoro
Natural Language Processing
ๆฅ็ๆ่ฆ
How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict, an expert-verified dataset comprising 920 unique instances, each consisting of a health-related question, a factual answer supported by scientific evidence, and two documents presenting contradictory stances. We consider several prompt settings, including correct, incorrect or contradictory context, and measure their impact on model outputs. Compared to existing medical question-answering evaluation benchmarks, HealthContradict provides greater distinctions of language models' contextual reasoning capabilities. Our experiments show that the strength of fine-tuned biomedical language models lies not only in their parametric knowledge from pretraining, but also in their ability to exploit correct context while resisting incorrect context.
๐
2025-12-02
๐ ArXiv Preprint (eess.SY)
๐ฅ Ibrahim Laiche, Mokrane Boudaoud, Patrick Gallinari, Pascal Morin
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This article investigates the modeling and control of Lagrangian systems involving non-conservative forces using a hybrid method that does not require acceleration calculations. It focuses in particular on the derivation and identification of physically consistent models, which are essential for model-based control synthesis. Lagrangian or Hamiltonian neural networks provide useful structural guarantees but the learning of such models often leads to inconsistent models, especially on real physical systems where training data are limited, partial and noisy. Motivated by this observation and the objective to exploit these models for model-based nonlinear control, a learning algorithm relying on an original loss function is proposed to improve the physical consistency of Lagrangian systems. A comparative analysis of different learning-based modeling approaches with the proposed solution shows significant improvements in terms of physical consistency of the learned models, on both simulated and experimental systems. The model's consistency is then exploited to demonstrate, on an experimental benchmark, the practical relevance of the proposed methodology for feedback linearization and energy-based control techniques.
๐
2025-12-02
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Chenji Lu, Zhuo Chen, Hui Zhao, Zhiyuan Zeng, Gang Zhao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.
๐
2025-12-02
๐ AAAI
๐ฅ Chenxu Niu, Wei Zhang, Jie Li, Yongjian Zhao, Tongyang Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines (i) a declarative configuration interface covering model choice, prompt set, and inference engine, (ii) a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and (iii) a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straight-forward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh
Natural Language Processing
ๆฅ็ๆ่ฆ
Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Louis McConnell, Jieran Sun, Theo Maffei, Raphael Gottardo, Marianna Rapsomaniki
ๆฅ็ๆ่ฆ
Understanding the spatial architecture of the tumor microenvironment (TME) is critical to advance precision oncology. We present ProteinPNet, a novel framework based on prototypical part networks that discovers TME motifs from spatial proteomics data. Unlike traditional post-hoc explanability models, ProteinPNet directly learns discriminative, interpretable, faithful spatial prototypes through supervised training. We validate our approach on synthetic datasets with ground truth motifs, and further test it on a real-world lung cancer spatial proteomics dataset. ProteinPNet consistently identifies biologically meaningful prototypes aligned with different tumor subtypes. Through graphical and morphological analyses, we show that these prototypes capture interpretable features pointing to differences in immune infiltration and tissue modularity. Our results highlight the potential of prototype-based learning to reveal interpretable spatial biomarkers within the TME, with implications for mechanistic discovery in spatial omics.
๐
2025-12-02
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Paul Barbaste, Olivier Oullier, Xavier Vasques
Multimodal
ๆฅ็ๆ่ฆ
Robust decoding and classification of brain patterns measured with electroencephalography (EEG) remains a major challenge for real-world (i.e. outside scientific lab and medical facilities) brain-computer interface (BCI) applications due to well documented inter- and intra-participant variability. Here, we present a large-scale benchmark evaluating over 340,000+ unique combinations of spatial and nonlinear EEG classification. Our methodological pipeline consists in combinations of Common Spatial Patterns (CSP), Riemannian geometry, functional connectivity, and fractal- or entropy-based features across three open-access EEG datasets. Unlike prior studies, our analysis operates at the per-participant level and across multiple frequency bands (8-15 Hz and 8-30 Hz), enabling direct assessment of both group-level performance and individual variability. Covariance tangent space projection (cov-tgsp) and CSP consistently achieved the highest average classification accuracies. However, their effectiveness was strongly dataset-dependent, and marked participant-level differences persisted, particularly in the most heterogeneous of the datasets. Importantly, nonlinear methods outperformed spatial approaches for specific individuals, underscoring the need for personalized pipeline selection. Our findings highlight that no universal 'one-size-fits-all' method can optimally decode EEG motor imagery patterns across all users or datasets. Future work will require adaptive, multimodal, and possibly novel approaches to fully address neurophysiological variability in practical BCI applications where the system can automatically adapt to what makes each user unique.
๐
2025-12-02
๐ ArXiv Preprint (gr-qc)
๐ฅ Annalena Kofler, Maximilian Dax, Stephen R. Green, Jonas Wildberger, Nihar Gupte et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Gravitational-wave data analysis relies on accurate and efficient methods to extract physical information from noisy detector signals, yet the increasing rate and complexity of observations represent a growing challenge. Deep learning provides a powerful alternative to traditional inference, but existing neural models typically lack the flexibility to handle variations in data analysis settings. Such variations accommodate imperfect observations or are required for specialized tests, and could include changes in detector configurations, overall frequency ranges, or localized cuts. We introduce a flexible transformer-based architecture paired with a training strategy that enables adaptation to diverse analysis settings at inference time. Applied to parameter estimation, we demonstrate that a single flexible model -- called Dingo-T1 -- can (i) analyze 48 gravitational-wave events from the third LIGO-Virgo-KAGRA Observing Run under a wide range of analysis configurations, (ii) enable systematic studies of how detector and frequency configurations impact inferred posteriors, and (iii) perform inspiral-merger-ringdown consistency tests probing general relativity. Dingo-T1 also improves median sample efficiency on real events from a baseline of 1.4% to 4.2%. Our approach thus demonstrates flexible and scalable inference with a principled framework for handling missing or incomplete data -- key capabilities for current and next-generation observatories.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jennifer Zvonek, Andrew Gillette
Machine Learning
ๆฅ็ๆ่ฆ
An implicit neural representation (INR) is a neural network that approximates a spatiotemporal function. Many memory-intensive visualization tasks, including modern 4D CT scanning methods, represent data natively as INRs. While INRs are prized for being more memory-efficient than traditional data stored on a lattice, many visualization tasks still require discretization to a regular grid. We present PruningAMR, an algorithm that builds a mesh with resolution adapted to geometric features encoded by the INR. To identify these geometric features, we use an interpolative decomposition pruning method on the weight matrices of the INR. The resulting pruned network is used to guide adaptive mesh refinement, enabling automatic mesh generation tailored to the underlying resolution of the function. Starting from a pre-trained INR--without access to its training data--we produce a variable resolution visualization with substantial memory savings.
๐
2025-12-02
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Samuel Andrello, Daniel Alabi, Simon J. L. Billinge
Machine Learning
ๆฅ็ๆ่ฆ
While machine learning has enabled the rapid prediction of inorganic materials with novel properties, the challenge of determining how to synthesize these materials remains largely unsolved. Previous work has largely focused on predicting precursors or reaction conditions, but only rarely on full synthesis pathways. We introduce the ActionGraph, a directed acyclic graph framework that encodes both the chemical and procedural structure, in terms of synthesis operations, of inorganic synthesis reactions. Using 13,017 text-mined solid-state synthesis reactions from the Materials Project, we show that incorporating PCA-reduced ActionGraph adjacency matrices into a $k$-nearest neighbors retrieval model significantly improves synthesis pathway prediction. While the ActionGraph framework only results in a 1.34% and 2.76% increase in precursor and operation F1 scores (average over varying numbers of PCA components) respectively, the operation length matching accuracy rises 3.4 times (from 15.8% to 53.3%). We observe an interesting trade-off where precursor prediction performance peaks at 10-11 PCA components while operation prediction continues improving up to 30 components. This suggests composition information dominates precursor selection while structural information is critical for operation sequencing. Overall, the ActionGraph framework demonstrates strong potential, and with further adoption, its full range of benefits can be effectively realized.
๐
2025-12-02
๐ Accepted by the INFORMS Journal on Data Science
๐ฅ Ahmadreza Chokhachian, Matthias Katzfuss, Yu Ding
ๆฅ็ๆ่ฆ
This paper is concerned with the problem of how to speed up computation for Gaussian process models trained on autocorrelated data. The Gaussian process model is a powerful tool commonly used in nonlinear regression applications. Standard regression modeling assumes random samples and an independently, identically distributed noise. Various fast approximations that speed up Gaussian process regression work under this standard setting. But for autocorrelated data, failing to account for autocorrelation leads to a phenomenon known as temporal overfitting that deteriorates model performance on new test instances. To handle autocorrelated data, existing fast Gaussian process approximations have to be modified; one such approach is to segment the originally correlated data points into blocks in which the blocked data are de-correlated. This work explains how to make some of the existing Gaussian process approximations work with blocked data. Numerical experiments across diverse application datasets demonstrate that the proposed approaches can remarkably accelerate computation for Gaussian process regression on autocorrelated data without compromising model prediction performance.
๐
2025-12-02
๐ KDD
๐ฅ Ziniu Zhang, Minxuan Duan, Haris N. Koutsopoulos, Hongyang R. Zhang
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region's weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1\%$, which is a $3.7\%$ gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by $24\%$ under higher precipitation, by $22\%$ on higher-speed roads such as motorways, and by $29\%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.
๐
2025-12-02
๐ NeurIPS 2025
๐ฅ Zhonghao He, Tianyi Qiu, Hirokazu Shirado, Maarten Sap
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment and confirmation bias, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates are unpredictable from the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, which signal deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains including event forecasting, value-laden questions, and academic paper review, we find such violations to be widespread across models and setups, where the current belief positively predicts future belief updates, a phenomenon which we term belief entrenchment. We identify the models, reasoning techniques, and domains more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of a reasoning process.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anuran Makur, Japneet Singh
ๆฅ็ๆ่ฆ
In this work, we develop a hypothesis testing framework to determine whether pairwise comparison data is generated by an underlying \emph{generalized Thurstone model} $\mathcal{T}_F$ for a given choice function $F$. While prior work has predominantly focused on parameter estimation and uncertainty quantification for such models, we address the fundamental problem of minimax hypothesis testing for $\mathcal{T}_F$ models. We formulate this testing problem by introducing a notion of separation distance between general pairwise comparison models and the class of $\mathcal{T}_F$ models. We then derive upper and lower bounds on the critical threshold for testing that depend on the topology of the observation graph. For the special case of complete observation graphs, this threshold scales as $ฮ((nk)^{-1/2})$, where $n$ is the number of agents and $k$ is the number of comparisons per pair. Furthermore, we propose a hypothesis test based on our separation distance, construct confidence intervals, establish time-uniform bounds on the probabilities of type I and II errors using reverse martingale techniques, and derive minimax lower bounds using information-theoretic methods. Finally, we validate our results through experiments on synthetic and real-world datasets.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang
Multimodal
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Cristian J. Villatoro, Gianluca Geraci, Daniele E. Schiavazzi
Machine Learning
ๆฅ็ๆ่ฆ
Outer loop tasks such as optimization, uncertainty quantification or inference can easily become intractable when the underlying high-fidelity model is computationally expensive. Similarly, data-driven architectures typically require large datasets to perform predictive tasks with sufficient accuracy. A possible approach to mitigate these challenges is the development of multi-fidelity emulators, leveraging potentially biased, inexpensive low-fidelity information while correcting and refining predictions using scarce, accurate high-fidelity data. This study investigates the performance of multi-fidelity neural emulators, neural networks designed to learn the input-to-output mapping by integrating limited high-fidelity data with abundant low-fidelity model solutions. We investigate the performance of such emulators for low and high-dimensional functions, with oscillatory character, in the presence of discontinuities, for collections of models with equal and dissimilar parametrization, and for a possibly large number of potentially corrupted low-fidelity sources. In doing so, we consider a large number of architectural, hyperparameter, and dataset configurations including networks with a different amount of spectral bias (Multi-Layered Perceptron, Siren and Kolmogorov Arnold Network), various mechanisms for coordinate encoding, exact or learnable low-fidelity information, and for varying training dataset size. We further analyze the added value of the multi-fidelity approach by conducting equivalent single-fidelity tests for each case, quantifying the performance gains achieved through fusing multiple sources of information.
๐
2025-12-02
๐ ArXiv Preprint (physics.geo-ph)
๐ฅ Roberto Miele, Leonardo Azevedo
ๆฅ็ๆ่ฆ
Probabilistic seismic inverse modeling often requires the prediction of both spatially correlated geological heterogeneities (e.g., facies) and continuous parameters (e.g., rock and elastic properties). Generative adversarial networks (GANs) provide an efficient training-image-based simulation framework capable of reproducing complex geological models with high accuracy and comparably low generative cost. However, their application in stochastic geophysical inversion for multivariate property prediction is limited, as representing multiple coupled properties requires large and unstable networks with high memory and training demands. A more recent variant of GANs with spatially adaptive denormalization (SPADE-GAN) enables the direct conditioning of facies spatial distributions on local probability maps. Leveraging on such features, an iterative geostatistical inversion algorithm is proposed, SPADE-GANInv, integrating a pre-trained SPADE-GAN with geostatistical simulation, for the prediction of facies and multiple correlated continuous properties from seismic data. The SPADE-GAN is trained to reproduce realistic facies geometries, while sequential stochastic co-simulation predicts the spatial variability of the facies-dependent continuous properties. At each iteration, a set of subsurface realizations is generated and used to compute synthetic seismic data. The realizations providing the highest similarity coefficient to the observed data are used to update the subsurface probability models in the next iteration. The method is demonstrated on both 2-D synthetic scenarios and field data, targeting the prediction of facies, porosity, and acoustic impedance from full-stack seismic data. Results show that the algorithm enables accurate multivariate prediction, mitigates the impact of biased prior data, and accommodates additional local conditioning such as well logs.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shuyuan Wu, Feifei Wang, Yuan Gao, Hansheng Wang
Machine Learning
ๆฅ็ๆ่ฆ
In decentralized federated learning (DFL), the presence of abnormal clients, often caused by noisy or poisoned data, can significantly disrupt the learning process and degrade the overall robustness of the model. Previous methods on this issue often require a sufficiently large number of normal neighboring clients or prior knowledge of reliable clients, which reduces the practical applicability of DFL. To address these limitations, we develop here a novel adaptive DFL (aDFL) approach for robust estimation. The key idea is to adaptively adjust the learning rates of clients. By assigning smaller rates to suspicious clients and larger rates to normal clients, aDFL mitigates the negative impact of abnormal clients on the global model in a fully adaptive way. Our theory does not put any stringent conditions on neighboring nodes and requires no prior knowledge. A rigorous convergence analysis is provided to guarantee the oracle property of aDFL. Extensive numerical experiments demonstrate the superior performance of the aDFL method.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vishal Dubey, Pallavi Tyagi
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern image editors can produce identity-preserving AIGC (IP-AIGC), where the same person appears with new attire, background, or lighting. The robustness and fairness of current detectors in this regime remain unclear, especially for under-represented populations. We present what we believe is the first systematic study of IP-AIGC detection for Indian and South-Asian faces, quantifying cross-generator generalization and intra-population performance. We assemble Indian-focused training splits from FairFD and HAV-DF, and construct two held-out IP-AIGC test sets (HIDF-img-ip-genai and HIDF-vid-ip-genai) using commercial web-UI generators (Gemini and ChatGPT) with identity-preserving prompts. We evaluate two state-of-the-art detectors (AIDE and Effort) under pretrained (PT) and fine-tuned (FT) regimes and report AUC, AP, EER, and accuracy. Fine-tuning yields strong in-domain gains (for example, Effort AUC 0.739 to 0.944 on HAV-DF-test; AIDE EER 0.484 to 0.259), but consistently degrades performance on held-out IP-AIGC for Indian cohorts (for example, AIDE AUC 0.923 to 0.563 on HIDF-img-ip-genai; Effort 0.740 to 0.533), which indicates overfitting to training-generator cues. On non-IP HIDF images, PT performance remains high, which suggests a specific brittleness to identity-preserving edits rather than a generic distribution shift. Our study establishes IP-AIGC-Indian as a challenging and practically relevant scenario and motivates representation-preserving adaptation and India-aware benchmark curation to close generalization gaps in AIGC detection.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mikoลaj Sacha, Hammad Jafri, Mattie Terzolo, Ayan Sinha, Andrew Rabinovich
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recommending matches in a text-rich, dynamic two-sided marketplace presents unique challenges due to evolving content and interaction graphs. We introduce GraphMatch, a new large-scale recommendation framework that fuses pre-trained language models with graph neural networks to overcome these challenges. Unlike prior approaches centered on standalone models, GraphMatch is a comprehensive recipe built on powerful text encoders and GNNs working in tandem. It employs adversarial negative sampling alongside point-in-time subgraph training to learn representations that capture both the fine-grained semantics of evolving text and the time-sensitive structure of the graph. We evaluated extensively on interaction data from Upwork, a leading labor marketplace, at large scale, and discuss our approach towards low-latency inference suitable for real-time use. In our experiments, GraphMatch outperforms language-only and graph-only baselines on matching tasks while being efficient at runtime. These results demonstrate that unifying language and graph representations yields a highly effective solution to text-rich, dynamic two-sided recommendations, bridging the gap between powerful pretrained LMs and large-scale graphs in practice.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Xinzheng Wu, Junyi Chen, Naiting Zhong, Yong Shen
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The safe deployment of autonomous driving systems (ADSs) relies on comprehensive testing and evaluation. However, safety-critical scenarios that can effectively expose system vulnerabilities are extremely sparse in the real world. Existing scenario generation methods face challenges in efficiently constructing long-tail scenarios that ensure fidelity, criticality, and interactivity, while particularly lacking real-time dynamic response capabilities to the vehicle under test (VUT). To address these challenges, this paper proposes a safety-critical testing scenario generation framework that integrates the high-level semantic understanding capabilities of Vision Language Models (VLMs) with the fine-grained generation capabilities of adaptive guided diffusion models. The framework establishes a three-layer hierarchical architecture comprising a strategic layer for VLM-directed scenario generation objective determination, a tactical layer for guidance function formulation, and an operational layer for guided diffusion execution. We first establish a high-quality fundamental diffusion model that learns the data distribution of real driving scenarios. Next, we design an adaptive guided diffusion method that enables real-time, precise control of background vehicles (BVs) in closed-loop simulation. The VLM is then incorporated to autonomously generate scenario generation objectives and guidance functions through deep scenario understanding and risk reasoning, ultimately guiding the diffusion model to achieve VLM-directed scenario generation. Experimental results demonstrate that the proposed method can efficiently generate realistic, diverse, and highly interactive safety-critical testing scenarios. Furthermore, case studies validate the adaptability and VLM-directed generation performance of the proposed method.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lechen Zhang, Yusheng Zhou, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ihab Ahmed, Denis Krompaร, Cheng Feng, Volker Tresp
Machine Learning
ๆฅ็ๆ่ฆ
We investigate input normalization methods for Time-Series Foundation Models (TSFMs). While normalization is well-studied in dataset-specific time-series models, it remains overlooked in TSFMs where generalization is critical. Time-series data, unlike text or images, exhibits significant scale variation across domains and channels, coupled with non-stationarity, can undermine TSFM performance regardless of architectural complexity. Through systematic evaluation across four architecturally diverse TSFMs, we empirically establish REVIN as the most efficient approach, reducing zero-shot MASE by 89\% relative to an un-normalized baseline and by 44\% versus other normalization methods, while matching the best in-domain accuracy (0.84 MASE) without any dataset-level preprocessing -- yielding the highest accuracy-efficiency trade-off. Yet its effect utilization depends on architectural design choices and optimization objective, particularly with respect to training loss scale sensitivity and model type (probabilistic, point-forecast, or LLM-based models).
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Ali Alvandi, Mina Rezaei
Machine Learning
ๆฅ็ๆ่ฆ
Contrastive learning is among the most popular and powerful approaches for self-supervised representation learning, where the goal is to map semantically similar samples close together while separating dissimilar ones in the latent space. Existing theoretical methods assume that downstream task classes are drawn from the same latent class distribution used during the pretraining phase. However, in real-world settings, downstream tasks may not only exhibit distributional shifts within the same label space but also introduce new or broader label spaces, leading to domain generalization challenges. In this work, we introduce novel generalization bounds that explicitly account for both types of mismatch: domain shift and domain generalization. Specifically, we analyze scenarios where downstream tasks either (i) draw classes from the same latent class space but with shifted distributions, or (ii) involve new label spaces beyond those seen during pretraining. Our analysis reveals how the performance of contrastively learned representations depends on the statistical discrepancy between pretraining and downstream distributions. This extended perspective allows us to derive provable guarantees on the performance of learned representations on average classification tasks involving class distributions outside the pretraining latent class set.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Shyam prasad reddy Kaitha, Hongrui Yu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Multi-robot task allocation in construction automation has traditionally relied on optimization methods such as Dynamic Programming and Reinforcement Learning. This research introduces the LangGraph-based Task Allocation Agent (LTAA), an LLM-driven framework that integrates phase-adaptive allocation strategies, multi-stage validation with hierarchical retries, and dynamic prompting for efficient robot coordination. Although recent LLM approaches show potential for construction robotics, they largely lack rigorous validation and benchmarking against established algorithms. This paper presents the first systematic comparison of LLM-based task allocation with traditional methods in construction scenarios.The study validates LLM feasibility through SMART-LLM replication and addresses implementation challenges using a Self-Corrective Agent Architecture. LTAA leverages natural-language reasoning combined with structured validation mechanisms, achieving major computational gains reducing token usage by 94.6% and allocation time by 86% through dynamic prompting. The framework adjusts its strategy across phases: emphasizing execution feasibility early and workload balance in later allocations.The authors evaluate LTAA against Dynamic Programming, Q-learning, and Deep Q-Network (DQN) baselines using construction operations from the TEACh human-robot collaboration dataset. In the Heavy Excels setting, where robots have strong task specializations, LTAA achieves 77% task completion with superior workload balance, outperforming all traditional methods. These findings show that LLM-based reasoning with structured validation can match established optimization algorithms while offering additional advantages such as interpretability, adaptability, and the ability to update task logic without retraining.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anton Emelyanov, Sergei Kudriashov, Alena Fenogenova
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Membership Inference Attacks (MIAs) aim to determine whether a specific data point was included in the training set of a target model. Although there are have been numerous methods developed for detecting data contamination in large language models (LLMs), their performance on multimodal LLMs (MLLMs) falls short due to the instabilities introduced through multimodal component adaptation and possible distribution shifts across multiple inputs. In this work, we investigate multimodal membership inference and address two issues: first, by identifying distribution shifts in the existing datasets, and second, by releasing an extended baseline pipeline to detect them. We also generalize the perturbation-based membership inference methods to MLLMs and release \textbf{FiMMIA} -- a modular \textbf{F}ramework for \textbf{M}ultimodal \textbf{MIA}.\footnote{The source code and framework have been made publicly available under the MIT license via \href{https://github.com/ai-forever/data_leakage_detect}{link}.The video demonstration is available on \href{https://youtu.be/a9L4-H80aSg}{YouTube}.} Our approach trains a neural network to analyze the target model's behavior on perturbed inputs, capturing distributional differences between members and non-members. Comprehensive evaluations on various fine-tuned multimodal models demonstrate the effectiveness of our perturbation-based membership inference attacks in multimodal domains.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xu Han, Biao Zhang, Xiangjun Tang, Xianzhi Li, Peter Wonka
ๆฅ็ๆ่ฆ
We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.
๐
2025-12-02
๐ ICASSP 2024
๐ฅ Waleed El-Geresy, Deniz Gรผndรผz
ๆฅ็ๆ่ฆ
We propose the use of adversarial wireless jamming to regularise the latent space of an autoencoder to match a diagonal Gaussian distribution. We consider the minimisation of a mean squared error distortion, where a jammer attempts to disrupt the recovery of a Gaussian source encoded and transmitted over the adversarial channel. A straightforward consequence of existing theoretical results is the fact that the saddle point of a minimax game - involving such an encoder, its corresponding decoder, and an adversarial jammer - consists of diagonal Gaussian noise output by the jammer. We use this result as inspiration for a novel approach to distribution matching in the latent space, utilising jamming as an auxiliary objective to encourage the aggregated latent posterior to match a diagonal Gaussian distribution. Using this new technique, we achieve distribution matching comparable to standard variational autoencoders and to Wasserstein autoencoders. This approach can also be generalised to other latent distributions.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tristan Amadei, Enric Meinhardt-Llopis, Benedicte Bascle, Corentin Abgrall, Gabriele Facciolo
ๆฅ็ๆ่ฆ
Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Przemyslaw Chojecki
ๆฅ็ๆ่ฆ
We extend the moduli-theoretic framework of psychometric batteries to the domain of dynamical systems. While previous work established the AAI capability score as a static functional on the space of agent representations, this paper formalizes the agent as a flow $ฮฝ_r$ parameterized by computational resource $r$, governed by a recursive Generator-Verifier-Updater (GVU) operator. We prove that this operator generates a vector field on the parameter manifold $ฮ$, and we identify the coefficient of self-improvement $ฮบ$ as the Lie derivative of the capability functional along this flow.
The central contribution of this work is the derivation of the Variance Inequality, a spectral condition that is sufficient (under mild regularity) for the stability of self-improvement. We show that a sufficient condition for $ฮบ> 0$ is that, up to curvature and step-size effects, the combined noise of generation and verification must be small enough.
We then apply this formalism to unify the recent literature on Language Self-Play (LSP), Self-Correction, and Synthetic Data bootstrapping. We demonstrate that architectures such as STaR, SPIN, Reflexion, GANs and AlphaZero are specific topological realizations of the GVU operator that satisfy the Variance Inequality through filtration, adversarial discrimination, or grounding in formal systems.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Zhou, Takehiko Ohkawa, Guwenxiao Zhou, Kanoko Goto, Takumi Hirose et al.
Computer Vision
ๆฅ็ๆ่ฆ
Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matteo Tolloso, Davide Bacciu
Machine Learning
ๆฅ็ๆ่ฆ
Uncertainty quantification is essential for deploying reliable Graph Neural Networks (GNNs), where existing approaches primarily rely on Bayesian inference or ensembles. In this paper, we introduce the first credal graph neural networks (CGNNs), which extend credal learning to the graph domain by training GNNs to output set-valued predictions in the form of credal sets. To account for the distinctive nature of message passing in GNNs, we develop a complementary approach to credal learning that leverages different aspects of layer-wise information propagation. We assess our approach on uncertainty quantification in node classification under out-of-distribution conditions. Our analysis highlights the critical role of the graph homophily assumption in shaping the effectiveness of uncertainty estimates. Extensive experiments demonstrate that CGNNs deliver more reliable representations of epistemic uncertainty and achieve state-of-the-art performance under distributional shift on heterophilic graphs.
๐
2025-12-02
๐ ArXiv Preprint (quant-ph)
๐ฅ Mark M. Wilde
Machine Learning
ๆฅ็ๆ่ฆ
Born-rule generative modeling, a central task in quantum machine learning, seeks to learn probability distributions that can be efficiently sampled by measuring complex quantum states. One hope is for quantum models to efficiently capture probability distributions that are difficult to learn and simulate by classical means alone. Quantum Boltzmann machines were proposed about one decade ago for this purpose, yet efficient training methods have remained elusive. In this paper, I overcome this obstacle by proposing a practical solution that trains quantum Boltzmann machines for Born-rule generative modeling. Two key ingredients in the proposal are the Donsker-Varadhan variational representation of the classical relative entropy and the quantum Boltzmann gradient estimator of [Patel et al., arXiv:2410.12935]. I present the main result for a more general ansatz known as an evolved quantum Boltzmann machine [Minervini et al., arXiv:2501.03367], which combines parameterized real- and imaginary-time evolution. I also show how to extend the findings to other distinguishability measures beyond relative entropy. Finally, I present four different hybrid quantum-classical algorithms for the minimax optimization underlying training, and I discuss their theoretical convergence guarantees.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ He Wang, Wenyilin Xiao, Songqiao Han, Hailiang Huang
Natural Language Processing
ๆฅ็ๆ่ฆ
Stock price prediction is challenging due to market volatility and its sensitivity to real-time events. While large language models (LLMs) offer new avenues for text-based forecasting, their application in finance is hindered by noisy news data and the lack of explicit answers in text. General-purpose memory architectures struggle to identify the key drivers of price movements. To address this, we propose StockMem, an event-reflection dual-layer memory framework. It structures news into events and mines them along two dimensions: horizontal consolidation integrates daily events, while longitudinal tracking captures event evolution to extract incremental information reflecting market expectation discrepancies. This builds a temporal event knowledge base. By analyzing event-price dynamics, the framework further forms a reflection knowledge base of causal experiences. For prediction, it retrieves analogous historical scenarios and reasons with current events, incremental data, and past experiences. Experiments show StockMem outperforms existing memory architectures and provides superior, explainable reasoning by tracing the information chain affecting prices, enhancing decision transparency in financial forecasting.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Julian Ma, Jun Wang, Zafeirios Fountas
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lavish Bansal, Naman Mishra
Natural Language Processing
ๆฅ็ๆ่ฆ
Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world's population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junpeng Wu, Pinheng Zong
ๆฅ็ๆ่ฆ
Graph-based Anomaly Detection models have gained widespread adoption in recent years, identifying suspicious nodes by aggregating neighborhood information. However, most existing studies overlook the pervasive issues of missing and adversarially obscured node attributes, which can undermine aggregation stability and prediction reliability. To mitigate this, we propose FGC-Comp, a lightweight, classifier-agnostic, and deployment-friendly attribute completion module-designed to enhance neighborhood aggregation under incomplete attributes. We partition each node's neighbors into three label-based groups, apply group-specific transforms to the labeled groups while a node-conditioned gate handles unknowns, fuse messages via residual connections, and train end-to-end with a binary classification objective to improve aggregation stability and prediction reliability under missing attributes. Experiments on two real-world fraud datasets validate the effectiveness of the approach with negligible computational overhead.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Senrong Xu, Tianyu Wang, Zenan Li, Yuan Yao, Taolue Chen et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Conformal prediction (CP) provides a comprehensive framework to produce statistically rigorous uncertainty sets for black-box machine learning models. To further improve the efficiency of CP, conformal correction is proposed to fine-tune or wrap the base model with an extra module using a conformal-aware inefficiency loss. In this work, we empirically and theoretically identify a trade-off between the CP efficiency and the entropy of model prediction. We then propose an entropy-constrained conformal correction method, exploring a better Pareto optimum between efficiency and entropy. Extensive experimental results on both computer vision and graph datasets demonstrate the efficacy of the proposed method. For instance, it can significantly improve the efficiency of state-of-the-art CP methods by up to 34.4%, given an entropy threshold.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup.
๐
2025-12-02
๐ ICASSP 2026
๐ฅ Omid Reza Heidari, Yang Wang, Xinxin Zuo
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to variations in scanning devices and environmental conditions, leading to significant domain discrepancies. To address this, we apply ALDI++, a domain adaptation framework that integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift effectively in this area. We conduct extensive experiments on the EDS dataset, demonstrating that ALDI++ surpasses the state-of-the-art (SOTA) domain adaptation methods across multiple adaptation scenarios. In particular, ALDI++ with a Vision Transformer for Detection (ViTDet) backbone achieves the highest mean average precision (mAP), confirming the effectiveness of transformer-based architectures for cross-domain object detection. Additionally, our category-wise analysis highlights consistent improvements in detection accuracy, reinforcing the robustness of the model across diverse object classes. Our findings establish ALDI++ as an efficient solution for domain-adaptive object detection, setting a new benchmark for performance stability and cross-domain generalization in security X-ray imagery.
๐
2025-12-02
๐ ArXiv Preprint (cs.SI)
๐ฅ Vedanta Thapar, Renaud Lambiotte, George T. Cantwell
ๆฅ็ๆ่ฆ
We propose the first return time distribution (FRTD) of a random walk as an interpretable and mathematically grounded node embedding. The FRTD assigns a probability mass function to each node, allowing us to define a distance between any pair of nodes using standard metrics for discrete distributions. We present several arguments to motivate the FRTD embedding. First, we show that FRTDs are strictly more informative than eigenvalue spectra, yet insufficient for complete graph identification, thus placing FRTD equivalence between cospectrality and isomorphism. Second, we argue that FRTD equivalence between nodes captures structural similarity. Third, we empirically demonstrate that the FRTD embedding outperforms manually designed graph metrics in network alignment tasks. Finally, we show that random networks that approximately match the FRTD of a desired target also preserve other salient features. Together these results demonstrate the FRTD as a simple and mathematically principled embedding for complex networks.
๐
2025-12-02
๐ ArXiv Preprint (cs.SD)
๐ฅ Gauri Deshpande, Harish Battula, Ashish Panda, Sunil Kumar Kopparapu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a unified study of four distinct modeling approaches for classifying dysarthria severity in the Speech Analysis for Neurodegenerative Diseases (SAND) challenge. All models tackle the same five class classification task using a common dataset of speech recordings. We investigate: (1) a ViT-OF method leveraging a Vision Transformer on spectrogram images, (2) a 1D-CNN approach using eight 1-D CNN's with majority-vote fusion, (3) a BiLSTM-OF approach using nine BiLSTM models with majority vote fusion, and (4) a Hierarchical XGBoost ensemble that combines glottal and formant features through a two stage learning framework. Each method is described, and their performances on a validation set of 53 speakers are compared. Results show that while the feature-engineered XGBoost ensemble achieves the highest macro-F1 (0.86), the deep learning models (ViT, CNN, BiLSTM) attain competitive F1-scores (0.70) and offer complementary insights into the problem.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haozhuo Zheng, Cheng Wang, Yang Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error propagation. In this work, we introduce the Graph VQ-Transformer (GVT), a two-stage generative framework that achieves both high accuracy and efficiency. The core of our approach is a novel Graph Vector Quantized Variational Autoencoder (VQ-VAE) that compresses molecular graphs into high-fidelity discrete latent sequences. By synergistically combining a Graph Transformer with canonical Reverse Cuthill-McKee (RCM) node ordering and Rotary Positional Embeddings (RoPE), our VQ-VAE achieves near-perfect reconstruction rates. An autoregressive Transformer is then trained on these discrete latents, effectively converting graph generation into a well-structured sequence modeling problem. Crucially, this mapping of complex graphs to high-fidelity discrete sequences bridges molecular design with the powerful paradigm of large-scale sequence modeling, unlocking potential synergies with Large Language Models (LLMs). Extensive experiments show that GVT achieves state-of-the-art or highly competitive performance across major benchmarks like ZINC250k, MOSES, and GuacaMol, and notably outperforms leading diffusion models on key distribution similarity metrics such as FCD and KL Divergence. With its superior performance, efficiency, and architectural novelty, GVT not only presents a compelling alternative to diffusion models but also establishes a strong new baseline for the field, paving the way for future research in discrete latent-space molecular generation.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Naveen George, Naoki Murata, Yuhta Takida, Konda Reddy Mopuri, Yuki Mitsufuji
Machine Learning
ๆฅ็ๆ่ฆ
The recent rapid growth of visual generative models trained on vast web-scale datasets has created significant tension with data privacy regulations and copyright laws, such as GDPR's ``Right to be Forgotten.'' This necessitates machine unlearning (MU) to remove specific concepts without the prohibitive cost of retraining. However, existing MU techniques are fundamentally ill-equipped for real-world scenarios where deletion requests arrive sequentially, a setting known as continual unlearning (CUL). Naively applying one-shot methods in a continual setting triggers a stability crisis, leading to a cascade of degradation characterized by retention collapse, compounding collateral damage to related concepts, and a sharp decline in generative quality. To address this critical challenge, we introduce a novel generative distillation based continual unlearning framework that ensures targeted and stable unlearning under sequences of deletion requests. By reframing each unlearning step as a multi-objective, teacher-student distillation process, the framework leverages principles from continual learning to maintain model integrity. Experiments on a 10-step sequential benchmark demonstrate that our method unlearns forget concepts with better fidelity and achieves this without significant interference to the performance on retain concepts or the overall image quality, substantially outperforming baselines. This framework provides a viable pathway for the responsible deployment and maintenance of large-scale generative models, enabling industries to comply with ongoing data removal requests in a practical and effective manner.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Farnaz Faramarzi Lighvan, Mehrdad Asadi, Lynn Houthuys
ๆฅ็ๆ่ฆ
Multi-view learning integrates diverse representations of the same instances to improve performance. Most existing kernel-based multi-view learning methods use fusion techniques without enforcing an explicit collaboration type across views or co-regularization which limits global collaboration. We propose AW-LSSVM, an adaptive weighted LS-SVM that promotes complementary learning by an iterative global coupling to make each view focus on hard samples of others from previous iterations. Experiments demonstrate that AW-LSSVM outperforms existing kernel-based multi-view methods on most datasets, while keeping raw features isolated, making it also suitable for privacy-preserving scenarios.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junwon Lee, Juhan Nam, Jiyoung Lee
Robotics
ๆฅ็ๆ่ฆ
This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter et al.
Machine Learning
ๆฅ็ๆ่ฆ
Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models -- diffusion and flow-based models -- still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow model to outperform a 1024 step teacher model with only a single additional backward NFE.
๐
2025-12-02
๐ ICML 2025
๐ฅ Mattia Giuri, Mathias Jackermeier, Alessandro Abate
Machine Learning
ๆฅ็ๆ่ฆ
Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time. However, existing approaches fall short in environments where multiple high-level events (i.e., atomic propositions) can be true at the same time and potentially interact in complicated ways. In this work, we propose a novel approach to learning a multi-task policy for following arbitrary LTL instructions that addresses this shortcoming. Our method conditions the policy on sequences of simple Boolean formulae, which directly align with transitions in the automaton, and are encoded via a graph neural network (GNN) to yield structured task representations. Experiments in a complex chess-based environment demonstrate the advantages of our approach.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhengcheng Wang, Zichuan Lin, Yijun Yang, Haobo Fu, Deheng Ye
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly hinder their navigation performance. To address these limitations, a novel VLN agent framework, named SeeNav-Agent, is proposed in this work. First, to reduce perception hallucinations of the visual module of the VLN agent, a dual-view Visual Prompt (VP) technique is introduced in the input space, which can also improve the agent's understanding of current spatial states. Subsequently, a novel step-level Reinforcement Fine-Tuning (RFT) method, Step Reward Group Policy Optimization (SRGPO), is designed for the post-training of VLN agents. In SRGPO, we first define verifiable process rewards for the navigation task, and then perform efficient step-level advantage estimation by randomly grouping different navigation steps. SRGPO provides dense reward signals for the reinforcement learning process of the VLN agent and enhances its planning capability. Experimental results on the EmbodiedBench Navigation benchmark indicate that by introducing the zero-shot VP module, the GPT-4.1 achieves a navigation success rate of 86.7%, surpassing the current best LVLM by approximately 20 percentage points (pp). Through post-training based on SRGPO, the Qwen2.5-VL-3B model reaches a navigation success rate of 72.3%, outperforming the best existing LVLM model by 5.6 pp. Moreover, compared to RFT algorithms such as GRPO and GiGPO, the proposed SRGPO demonstrates significant improvements in training stability, convergence efficiency, and generalization capability.
๐
2025-12-02
๐ This work has been submitted to the IEEE for possi
๐ฅ Seline J. S. de Rooij, Borbรกla Hunyadi
Machine Learning
ๆฅ็ๆ่ฆ
Transfer learning aims to optimize performance in a target task by learning from a related source problem. In this work, we propose an efficient transfer learning method using a tensor kernel machine. Our method takes inspiration from the adaptive SVM and hence transfers 'knowledge' from the source to the 'adapted' model via regularization. The main advantage of using tensor kernel machines is that they leverage low-rank tensor networks to learn a compact non-linear model in the primal domain. This allows for a more efficient adaptation without adding more parameters to the model. To demonstrate the effectiveness of our approach, we apply the adaptive tensor kernel machine (Adapt-TKM) to seizure detection on behind-the-ear EEG. By personalizing patient-independent models with a small amount of patient-specific data, the patient-adapted model (which utilizes the Adapt-TKM), achieves better performance compared to the patient-independent and fully patient-specific models. Notably, it is able to do so while requiring around 100 times fewer parameters than the adaptive SVM model, leading to a correspondingly faster inference speed. This makes the Adapt-TKM especially useful for resource-constrained wearable devices.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wenhao Sha, Tienchong Chang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Heat transfer in semiconductor devices is dominated by chip and substrate assemblies, where heat generated within a finite chip layer dissipates into a semi-infinite substrate with much higher thermophysical properties. This mismatch produces steep interfacial temperature gradients, making the transient thermal response highly sensitive to the interface. Conventional numerical solvers require excessive discretization to resolve these dynamics, while physics-informed neural networks (PINNs) often exhibit unstable convergence and loss of physical consistency near the material interface. To address these challenges, we introduce HeatTransFormer, a physics-guided Transformer architecture for interface-dominated diffusion problems. The framework integrates physically informed spatiotemporal sampling, a Laplace-based activation emulating analytical diffusion solutions, and a mask-free attention mechanism supporting bidirectional spatiotemporal coupling. These components enable the model to resolve steep gradients, maintain physical consistency, and remain stable where PINNs typically fail. HeatTransFormer produces coherent temperature fields across the interface when applied to a finite layer and semi-infinite substrate configuration. Coupled with a physics-constrained inverse strategy, it further enables reliable identification of three unknown thermal properties simultaneously using only external measurements. Overall, this work demonstrates that physics-guided Transformer architectures provide a unified framework for forward and inverse modeling in interface-dominated thermal systems.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2
๐
2025-12-02
๐ ArXiv Preprint (cs.CE)
๐ฅ Ioanna Tasou, Panagiotis Mpakos, Angelos Vlachos, Dionysios Adamopoulos, Georgios Giannakopoulos et al.
Machine Learning
ๆฅ็ๆ่ฆ
The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contributing in significant computational, energy and environmental footprints. Sparsity stands out as a critical mechanism for drastically reducing these resource demands. However, its potential remains largely untapped and is not yet fully incorporated in production AI systems. To bridge this gap, this work provides the necessary knowledge and insights for performance engineers keen to get involved in deep learning inference optimization. In particular, in this work we: a) discuss the various forms of sparsity that can be utilized in DNN inference, b) explain how the original dense computations translate to sparse kernels, c) provide an extensive bibliographic review of the state-of-the-art in the implementation of these kernels for CPUs and GPUs, d) discuss the availability of sparse datasets in support of sparsity-related research and development, e) explore the current software tools and frameworks that provide robust sparsity support, and f) present evaluation results of different implementations of the key SpMM and SDDMM kernels on CPU and GPU platforms. Ultimately, this paper aims to serve as a resource for performance engineers seeking to develop and deploy highly efficient sparse deep learning models in productions.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Albert Saiapin, Kim Batselier
Machine Learning
ๆฅ็ๆ่ฆ
Many approximations were suggested to circumvent the cubic complexity of kernel-based algorithms, allowing their application to large-scale datasets. One strategy is to consider the primal formulation of the learning problem by mapping the data to a higher-dimensional space using tensor-product structured polynomial and Fourier features. The curse of dimensionality due to these tensor-product features was effectively solved by a tensor network reparameterization of the model parameters. However, another important aspect of model training - identifying optimal feature hyperparameters - has not been addressed and is typically handled using the standard cross-validation approach. In this paper, we introduce the Feature Learning (FL) model, which addresses this issue by representing tensor-product features as a learnable Canonical Polyadic Decomposition (CPD). By leveraging this CPD structure, we efficiently learn the hyperparameters associated with different features alongside the model parameters using an Alternating Least Squares (ALS) optimization method. We prove the effectiveness of the FL model through experiments on real data of various dimensionality and scale. The results show that the FL model can be consistently trained 3-5 times faster than and have the prediction quality on par with a standard cross-validated model.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vishnu Sarukkai, Asanshay Gupta, James Hong, Michaรซl Gharbi, Kayvon Fatahalian
Machine Learning
ๆฅ็ๆ่ฆ
The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce $\textit{in-context distillation}$, which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of $\textit{self-consistency cascades}$ to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at $\textbf{2.5$\times$ lower cost}$, reducing per-episode costs from \$0.059 to \$0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding \$34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a $\textbf{2$\times$ cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Albert Saiapin, Kim Batselier
ๆฅ็ๆ่ฆ
To address the scalability limitations of Gaussian process (GP) regression, several approximation techniques have been proposed. One such method is based on tensor networks, which utilizes an exponential number of basis functions without incurring exponential computational cost. However, extending this model to a fully probabilistic formulation introduces several design challenges. In particular, for tensor train (TT) models, it is unclear which TT-core should be treated in a Bayesian manner. We introduce a Bayesian tensor train kernel machine that applies Laplace approximation to estimate the posterior distribution over a selected TT-core and employs variational inference (VI) for precision hyperparameters. Experiments show that core selection is largely independent of TT-ranks and feature structure, and that VI replaces cross-validation while offering up to 65x faster training. The method's effectiveness is demonstrated on an inverse dynamics problem.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Parth Pulkundwar, Vivek Dhanawade, Rohit Yadav, Minal Sonkar, Medha Asurlekar et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Traditional language models face a challenge from hallucinations. Their very presence casts a large, dangerous shadow over the promising realm of natural language processing. It becomes crucial to understand the various kinds of hallucinations that occur nowadays, their origins, and ways of reducing them. This document provides a concise and straightforward summary of that. It serves as a one-stop resource for a general understanding of hallucinations and how to mitigate them.
๐
2025-12-02
๐ ArXiv Preprint (cs.SD)
๐ฅ Xueyan Li, Yuxin Wang, Mengjie Jiang, Qingzi Zhu, Jiang Zhang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Singing voice synthesis (SVS) has advanced significantly, enabling models to generate vocals with accurate pitch and consistent style. As these capabilities improve, the need for reliable evaluation and optimization becomes increasingly critical. However, current methods like reward systems often rely on single numerical scores, struggle to capture various dimensions such as phrasing or expressiveness, and require costly annotations, limiting interpretability and generalization. To address these issues, we propose a generative feedback (i.e., reward model) framework that provides multi-dimensional language and audio feedback for SVS assessment. Our approach leverages an audio-language model to generate text and audio critiques-covering aspects such as melody, content, and auditory quality. The model is fine-tuned on a hybrid dataset combining human music reactions and synthetic critiques from a MLLMs, enhancing diversity and linguistic richness. Quantitative experiments validate the effectiveness of the proposed dataset and training strategy, demonstrating that the framework produces musically accurate and interpretable evaluations suitable for guiding generative model improvement. The code is at [https://github.com/opendilab/VocalCritic](https://github.com/opendilab/VocalCritic)
๐
2025-12-02
๐ accepted for IEEE Globecom Workshop 2025
๐ฅ Krishnendu S. Tharakan, Carlo Fischione
ๆฅ็ๆ่ฆ
Wireless connectivity promises to unshackle virtual reality (VR) experiences, allowing users to engage from anywhere, anytime. However, delivering seamless, high-quality, real-time VR video wirelessly is challenging due to the stringent quality of experience requirements, low latency constraints, and limited VR device capabilities. This paper addresses these challenges by introducing a novel decentralized multi task fair federated learning (DMTFL) based caching that caches and prefetches each VR user's field of view (FOV) at base stations (BSs) based on the caching strategies tailored to each BS. In federated learning (FL) in its naive form, often biases toward certain users, and a single global model fails to capture the statistical heterogeneity across users and BSs. In contrast, the proposed DMTFL algorithm personalizes content delivery by learning individual caching models at each BS. These models are further optimized to perform well under any target distribution, while providing theoretical guarantees via Rademacher complexity and a probably approximately correct (PAC) bound on the loss. Using a realistic VR head-tracking dataset, our simulations demonstrate the superiority of our proposed DMTFL algorithm compared to baseline algorithms.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Marco Cardia, Stefano Chessa, Alessio Micheli, Antonella Giuliana Luminare, Francesca Gambineri
Machine Learning
ๆฅ็ๆ่ฆ
The quality of water is key for the quality of agrifood sector. Water is used in agriculture for fertigation, for animal husbandry, and in the agrifood processing industry. In the context of the progressive digitalization of this sector, the automatic assessment of the quality of water is thus becoming an important asset. In this work, we present the integration of Ultraviolet-Visible (UV-Vis) spectroscopy with Machine Learning in the context of water quality assessment aiming at ensuring water safety and the compliance of water regulation. Furthermore, we emphasize the importance of model interpretability by employing SHapley Additive exPlanations (SHAP) to understand the contribution of absorbance at different wavelengths to the predictions. Our approach demonstrates the potential for rapid, accurate, and interpretable assessment of key water quality parameters.
๐
2025-12-02
๐ 31 page, paper in revision, submitted in Journal o
๐ฅ Ali Mohammad-Djafari, Ning Chu, Li Wang
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Inverse problems arise across scientific and engineering domains, where the goal is to infer hidden parameters or physical fields from indirect and noisy observations. Classical approaches, such as variational regularization and Bayesian inference, provide well established theoretical foundations for handling ill posedness. However, these methods often become computationally restrictive in high dimensional settings or when the forward model is governed by complex physics. Physics Informed Neural Networks (PINNs) have recently emerged as a promising framework for solving inverse problems by embedding physical laws directly into the training process of neural networks. In this paper, we introduce a new perspective on the Bayesian Physics Informed Neural Network (BPINN) framework, extending classical PINNs by explicitly incorporating training data generation, modeling and measurement uncertainties through Bayesian prior modeling and doing inference with the posterior laws. Also, as we focus on the inverse problems, we call this method BPINN-IP, and we show that the standard PINN formulation naturally appears as its special case corresponding to the Maximum A Posteriori (MAP) estimate. This unified formulation allows simultaneous exploitation of physical constraints, prior knowledge, and data-driven inference, while enabling uncertainty quantification through posterior distributions. To demonstrate the effectiveness of the proposed framework, we consider inverse problems arising in infrared image processing, including deconvolution and super-resolution, and present results on both simulated and real industrial data.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zihao Zhao, Kai-Chia Mo, Shing-Hei Ho, Brandon Amos, Kai Wang
Machine Learning
ๆฅ็ๆ่ฆ
Differentiable optimization layers enable learning systems to make decisions by solving embedded optimization problems. However, computing gradients via implicit differentiation requires solving a linear system with Hessian terms, which is both compute- and memory-intensive. To address this challenge, we propose a novel algorithm that computes the gradient using only first-order information. The key insight is to rewrite the differentiable optimization as a bilevel optimization problem and leverage recent advances in bilevel methods. Specifically, we introduce an active-set Lagrangian hypergradient oracle that avoids Hessian evaluations and provides finite-time, non-asymptotic approximation guarantees. We show that an approximate hypergradient can be computed using only first-order information in $\tilde{\oo}(1)$ time, leading to an overall complexity of $\tilde{\oo}(ฮด^{-1}ฮต^{-3})$ for constrained bilevel optimization, which matches the best known rate for non-smooth non-convex optimization. Furthermore, we release an open-source Python library that can be easily adapted from existing solvers. Our code is available here: https://github.com/guaguakai/FFOLayer.
๐
2025-12-02
๐ ArXiv Preprint (cs.DB)
๐ฅ Yarden Gabbay, Haoquan Guan, Shaull Almagor, El Kindi Rezig, Brit Youngmann et al.
ๆฅ็ๆ่ฆ
Causal analyses derived from observational data underpin high-stakes decisions in domains such as healthcare, public policy, and economics. Yet such conclusions can be surprisingly fragile: even minor data errors - duplicate records, or entry mistakes - may drastically alter causal relationships. This raises a fundamental question: how robust is a causal claim to small, targeted modifications in the data? Addressing this question is essential for ensuring the reliability, interpretability, and reproducibility of empirical findings. We introduce SubCure, a framework for robustness auditing via cardinality repairs. Given a causal query and a user-specified target range for the estimated effect, SubCure identifies a small set of tuples or subpopulations whose removal shifts the estimate into the desired range. This process not only quantifies the sensitivity of causal conclusions but also pinpoints the specific regions of the data that drive those conclusions. We formalize this problem under both tuple- and pattern-level deletion settings and show both are NP-complete. To scale to large datasets, we develop efficient algorithms that incorporate machine unlearning techniques to incrementally update causal estimates without retraining from scratch. We evaluate SubCure across four real-world datasets covering diverse application domains. In each case, it uncovers compact, high-impact subsets whose removal significantly shifts the causal conclusions, revealing vulnerabilities that traditional methods fail to detect. Our results demonstrate that cardinality repair is a powerful and general-purpose tool for stress-testing causal analyses and guarding against misleading claims rooted in ordinary data imperfections.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mithra D K
Machine Learning
ๆฅ็ๆ่ฆ
I present an application of established machine learning techniques to NHANES health survey data for predicting diabetes status. I compare baseline models (logistic regression, random forest, XGBoost) with a hybrid approach that uses an XGBoost feature encoder and a lightweight multilayer perceptron (MLP) head. Experiments show the hybrid model attains improved AUC and balanced accuracy compared to baselines on the processed NHANES subset. I release code and reproducible scripts to encourage replication.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhongjian Qiao, Rui Yang, Jiafei Lyu, Xiu Li, Zhongxiang Dai et al.
Machine Learning
ๆฅ็ๆ่ฆ
Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuta Oshima, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
Robotics
ๆฅ็ๆ่ฆ
Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang et al.
Robotics
ๆฅ็ๆ่ฆ
AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xingwang Li, Mengyun Chen, Jiamou Liu, Sijie Wang, Shuanggen Jin et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In the face of accelerating global urbanization and the increasing frequency of extreme weather events, highresolution urban rainfall monitoring is crucial for building resilient smart cities. Commercial Microwave Links (CMLs) are an emerging data source with great potential for this task.While traditional rainfall retrieval from CMLs relies on physicsbased models, these often struggle with real-world complexities like signal noise and nonlinear attenuation. To address these limitations, this paper proposes a novel hybrid deep learning architecture based on the Transformer and a Bidirectional Gated Recurrent Unit (BiGRU), which we name TabGRU. This design synergistically captures both long-term dependencies and local sequential features in the CML signal data. The model is further enhanced by a learnable positional embedding and an attention pooling mechanism to improve its dynamic feature extraction and generalization capabilities. The model was validated on a public benchmark dataset from Gothenburg, Sweden (June-September 2015). The evaluation used 12 sub-links from two rain gauges (Torp and Barl) over a test period (August 22-31) covering approximately 10 distinct rainfall events. The proposed TabGRU model demonstrated consistent advantages, outperforming deep learning baselines and achieving high coefficients of determination (R2) at both the Torp site (0.91) and the Barl site (0.96). Furthermore, compared to the physics-based approach, TabGRU maintained higher accuracy and was particularly effective in mitigating the significant overestimation problem observed in the PL model during peak rainfall events. This evaluation confirms that the TabGRU model can effectively overcome the limitations of traditional methods, providing a robust and accurate solution for CML-based urban rainfall monitoring under the tested conditions.
๐
2025-12-02
๐ AAAI 2026
๐ฅ Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz
Natural Language Processing
ๆฅ็ๆ่ฆ
Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.
๐
2025-12-02
๐ ArXiv Preprint (cs.DB)
๐ฅ Ning Wang, Sainyam Galhotra
Machine Learning
ๆฅ็ๆ่ฆ
Discovering which tables in large, heterogeneous repositories can be joined and by what transformations is a central challenge in data integration and data discovery. Traditional join discovery methods are largely designed for equi-joins, which assume that join keys match exactly or nearly so. These techniques, while efficient in clean, well-normalized databases, fail in open or federated settings where identifiers are inconsistently formatted, embedded, or split across multiple columns. Approximate or fuzzy joins alleviate minor string variations but cannot capture systematic transformations. We introduce QJoin, a reinforcement-learning framework that learns and reuses transformation strategies across join tasks. QJoin trains an agent under a uniqueness-aware reward that balances similarity with key distinctiveness, enabling it to explore concise, high-value transformation chains. To accelerate new joins, we introduce two reuse mechanisms: (i) agent transfer, which initializes new policies from pretrained agents, and (ii) transformation reuse, which caches successful operator sequences for similar column clusters. On the AutoJoin Web benchmark (31 table pairs), QJoin achieves an average F1-score of 91.0%. For 19,990 join tasks in NYC+Chicago open datasets, Qjoin reduces runtime by up to 7.4% (13,747 s) by using reusing. These results demonstrate that transformation learning and reuse can make join discovery both more accurate and more efficient.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junghwan Park, Woojin Cho, Junhyuk Heo, Darongsae Kwon, Kookjin Lee
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhongjian Qiao, Rui Yang, Jiafei Lyu, Chenjia Bai, Xiu Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between the source and target domain, simply merging the data from two datasets may incur inferior performance. Recent advances address this issue by selectively sharing source domain samples that exhibit dynamics alignment with the target domain. However, these approaches focus solely on dynamics alignment and overlook \textit{value alignment}, i.e., selecting high-quality, high-value samples from the source domain. In this paper, we first demonstrate that both dynamics alignment and value alignment are essential for policy learning, by examining the limitations of the current theoretical framework for cross-domain RL and establishing a concrete sub-optimality gap of a policy trained on the source domain and evaluated on the target domain. Motivated by the theoretical insights, we propose to selectively share those source domain samples with both high dynamics and value alignment and present our \textbf{\underline{D}}ynamics- and \textbf{\underline{V}}alue-aligned \textbf{\underline{D}}ata \textbf{\underline{F}}iltering (DVDF) method. We design a range of dynamics shift settings, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, as well as in challenging extremely low-data settings where the target domain dataset contains only 5,000 transitions. Extensive experiments demonstrate that DVDF consistently outperforms prior strong baselines and delivers exceptional performance across multiple tasks and datasets.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
๐
2025-12-02
๐ ArXiv Preprint (quant-ph)
๐ฅ Tommaso Fioravanti, Brian Quanz, Gabriele Agliardi, Edgar Andres Ruiz Guzman, Ginรฉs Carrascal et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Quantum Machine Learning (QML) holds the promise of enhancing machine learning modeling in terms of both complexity and accuracy. A key challenge in this domain is the encoding of input data, which plays a pivotal role in determining the performance of QML models. In this work, we tackle a largely unaddressed aspect of encoding that is unique to QML modeling -- rather than adjusting the ansatz used for encoding, we consider adjusting how data is conveyed to the ansatz. We specifically implement QML pipelines that leverage classical data manipulation (i.e., ordering, selecting, and weighting features) as a preprocessing step, and evaluate if these aspects of encoding can have a significant impact on QML model performance, and if they can be effectively optimized to improve performance. Our experimental results, applied across a wide variety of data sets, ansatz, and circuit sizes, with a representative QML approach, demonstrate that by optimizing how features are encoded in an ansatz we can substantially and consistently improve the performance of QML models, making a compelling case for integrating these techniques in future QML applications. Finally we demonstrate the practical feasibility of this approach by running it using real quantum hardware with 100 qubit circuits and successfully achieving improved QML modeling performance in this case as well.
๐
2025-12-02
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yuexin Xiang, Yuchen Lei, SM Mahir Shazeed Rish, Yuanzhe Zhang, Qin Wang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Stablecoins such as USDT and USDC aspire to peg stability by coupling issuance controls with reserve attestations. In practice, however, the transparency is split across two worlds: verifiable on-chain traces and off-chain disclosures locked in unstructured text that are unconnected. We introduce a large language model (LLM)-based automated framework that bridges these two dimensions by aligning on-chain issuance data with off-chain disclosure statements. First, we propose an integrative framework using LLMs to capture and analyze on- and off-chain data through document parsing and semantic alignment, extracting key financial indicators from issuer attestations and mapping them to corresponding on-chain metrics. Second, we integrate multi-chain issuance records and disclosure documents within a model context protocol (MCP) framework that standardizes LLMs access to both quantitative market data and qualitative disclosure narratives. This framework enables unified retrieval and contextual alignment across heterogeneous stablecoin information sources and facilitates consistent analysis. Third, we demonstrate the capability of LLMs to operate across heterogeneous data modalities in blockchain analytics, quantifying discrepancies between reported and observed circulation and examining their implications for cross-chain transparency and price dynamics. Our findings reveal systematic gaps between disclosed and verifiable data, showing that LLM-assisted analysis enhances cross-modal transparency and supports automated, data-driven auditing in decentralized finance (DeFi).
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yizhou Zhang, Lun Du
Machine Learning
ๆฅ็ๆ่ฆ
Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability.
We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning -- although practical systems can only approximate this behavior.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Oshada Jayasinghe, Farhana Choudhury, Egemen Tanin, Shanika Karunasekera
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
With increased travelling needs more than ever, traffic congestion has become a major concern in most urban areas. Allocating spaces for on-street parking, further hinders traffic flow, by limiting the effective road width available for driving. With the advancement of vehicle-to-infrastructure connectivity technologies, we explore how the impact of on-street parking on traffic congestion could be minimized, by dynamically configuring on-street parking spaces. Towards that end, we formulate dynamic on-street parking space configuration as an optimization problem, and we follow a data driven approach, considering the nature of our problem. Our proposed solution comprises a two-layer multi agent reinforcement learning based framework, which is inherently scalable to large road networks. The lane level agents are responsible for deciding the optimal parking space configuration for each lane, and we introduce a novel Deep Q-learning architecture which effectively utilizes long short term memory networks and graph attention networks to capture the spatio-temporal correlations evident in the given problem. The block level agents control the actions of the lane level agents and maintain a sufficient level of parking around the block. We conduct a set of comprehensive experiments using SUMO, on both synthetic data as well as real-world data from the city of Melbourne. Our experiments show that the proposed framework could reduce the average travel time loss of vehicles significantly, reaching upto 47%, with a negligible increase in the walking distance for parking.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anoop Cherian, River Doyle, Eyal Ben-Dov, Suhas Lohit, Kuan-Chuan Peng
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongxiang Liu, Zhifang Deng, Tong Pu, Shengli Lu
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformers, composed of QKV generation, attention computation, and FFNs,
have become the dominant model across various domains due to their outstanding performance.
However, their high computational cost hinders efficient hardware deployment.
Sparsity offers a promising solution,
yet most existing accelerators exploit only intra-row sparsity in attention,
while few consider inter-row sparsity.
Approaches leveraging inter-row sparsity often rely on costly global similarity estimation,
which diminishes the acceleration benefits of sparsity,
and typically apply sparsity to only one or two transformer components.
Through careful analysis of the attention distribution and computation flow,
we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead.
Motivated by this observation, we propose ESACT,
an end-to-end sparse accelerator for compute-intensive Transformers.
ESACT centers on the Sparsity Prediction with Local Similarity (SPLS) mechanism,
which leverages HLog quantization to accurately predict local attention sparsity prior to QK generation,
achieving efficient sparsity across all transformer components.
To support efficient hardware realization, we introduce three architectural innovations.
Experimental results on 26 benchmarks demonstrate that
SPLS reduces total computation by 52.03% with less than 1% accuracy loss.
ESACT achieves an end-to-end energy efficiency of 3.29 TOPS/W,
and improves attention-level energy efficiency by 2.95x and 2.26x over
SOTA attention accelerators SpAtten and Sanger, respectively.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ David X. Wu, Shreyas Kapur, Anant Sahai, Stuart Russell
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning has become the dominant paradigm for eliciting reasoning and self-correction capabilities in large language models, but its computational expense motivates exploration of alternatives. Inspired by techniques from autonomous driving and robotics, we investigate whether supervised learning with synthetic error injection can induce self-correction abilities in language models. Our approach inserts artificial errors into reasoning chains, masks them, and supervises the model to recognize and correct these mistakes. Despite the intuitive appeal of this method, we find that it fails to significantly improve performance even on simple synthetic tasks across multiple models. Moreover, even when the model catches its own error, it often parrots the original mistake. We find that the distribution shift of synthetic errors to on-policy errors significantly degrades the error-correction capabilities of the fine-tuned model, even with good synthetic coverage of on-policy errors. Our results help explain why on-policy reinforcement learning methods have proven uniquely effective for eliciting self-correction.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chuhan Xie
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper studies the problem of risk-sensitive reinforcement learning (RSRL) in continuous time, where the environment is characterized by a controllable stochastic differential equation (SDE) and the objective is a potentially nonlinear functional of cumulative rewards. We prove that when the functional is an optimized certainty equivalent (OCE), the optimal policy is Markovian with respect to an augmented environment. We also propose \textit{CT-RS-q}, a risk-sensitive q-learning algorithm based on a novel martingale characterization approach. Finally, we run a simulation study on a dynamic portfolio selection problem and illustrate the effectiveness of our algorithm.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jonathan Baxter, Peter L. Bartlett
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm's chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter $ฮฒ\in [0,1)$, which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward.
๐
2025-12-02
๐ Accepted by IEEE/ACM ToN
๐ฅ Nan Qiao, Sheng Yue, Ju Ren, Yaoxue Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Offline Federated Reinforcement Learning (FRL), a marriage of federated learning and offline reinforcement learning, has attracted increasing interest recently. Albeit with some advancement, we find that the performance of most existing offline FRL methods drops dramatically when provided with mixed-quality data, that is, the logging behaviors (offline data) are collected by policies with varying qualities across clients. To overcome this limitation, this paper introduces a new vote-based offline FRL framework, named FOVA. It exploits a \emph{vote mechanism} to identify high-return actions during local policy evaluation, alleviating the negative effect of low-quality behaviors from diverse local learning policies. Besides, building on advantage-weighted regression (AWR), we construct consistent local and global training objectives, significantly enhancing the efficiency and stability of FOVA. Further, we conduct an extensive theoretical analysis and rigorously show that the policy learned by FOVA enjoys strict policy improvement over the behavioral policy. Extensive experiments corroborate the significant performance gains of our proposed algorithm over existing baselines on widely used benchmarks.
๐
2025-12-02
๐ ArXiv Preprint (math.OC)
๐ฅ Dimitris Oikonomou, Nicolas Loizou
Machine Learning
ๆฅ็ๆ่ฆ
The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. Comprehensive experiments on convex benchmarks and deep neural networks corroborate our theory: the proposed step size accelerates convergence, reduces variance, and consistently outperforms existing adaptive baselines. Finally, in the context of deep neural network training, our method demonstrates robust performance by addressing the vanishing gradient problem.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, Kun Xia
Natural Language Processing
ๆฅ็ๆ่ฆ
Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sai Siddharth Nalamalpu, Kaining Yuan, Aiden Zhou, Eugene Pinsky
Machine Learning
ๆฅ็ๆ่ฆ
The Massachusetts Bay Transportation Authority (MBTA) is the main public transit provider in Boston, operating multiple means of transport, including trains, subways, and buses. However, the system often faces delays and fluctuations in ridership volume, which negatively affect efficiency and passenger satisfaction. To further understand this phenomenon, this paper compares the performance of existing and unique methods to determine the best approach in predicting gated station entries in the subway system (a proxy for subway usage) and the number of delays in the overall MBTA system. To do so, this research considers factors that tend to affect public transportation, such as day of week, season, pressure, wind speed, average temperature, and precipitation. This paper evaluates the performance of 10 statistical and machine learning models on predicting next-day subway usage. On predicting delay count, the number of models is extended to 11 per day by introducing a self-exciting point process model, representing a unique application of a point-process framework for MBTA delay modeling. This research involves experimenting with the selective inclusion of features to determine feature importance, testing model accuracy via Root Mean Squared Error (RMSE). Remarkably, it is found that providing either day of week or season data has a more substantial benefit to predictive accuracy compared to weather data; in fact, providing weather data generally worsens performance, suggesting a tendency of models to overfit.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wenzhang Du
ๆฅ็ๆ่ฆ
Retrieval-augmented models couple parametric predictors with non-parametric memories, but their use in streaming supervised learning with concept drift is not well understood. We study online classification in non-stationary environments and propose Retrieval-Augmented Memory for Online Learning (RAM-OL), a simple extension of stochastic gradient descent that maintains a small buffer of past examples. At each time step, RAM-OL retrieves a few nearest neighbours of the current input in the hidden representation space and updates the model jointly on the current example and the retrieved neighbours. We compare a naive replay variant with a gated replay variant that constrains neighbours using a time window, similarity thresholds, and gradient reweighting, in order to balance fast reuse of relevant past data against robustness to outdated regimes. From a theoretical perspective, we interpret RAM-OL under a bounded drift model and discuss how retrieval can reduce adaptation cost and improve regret constants when patterns recur over time. Empirically, we instantiate RAM-OL on a simple online multilayer perceptron and evaluate it on three real-world data streams derived from electricity pricing, electricity load, and airline delay data. On strongly and periodically drifting streams, RAM-OL improves prequential accuracy by up to about seven percentage points and greatly reduces variance across random seeds, while on a noisy airline stream the gated variant closely matches the purely online baseline. These results show that retrieval-augmented memory is a practical and robust tool for online learning under concept drift.
๐
2025-12-02
๐ ArXiv Preprint (q-bio.QM)
๐ฅ Jiabao Brad Wang, Siyuan Cao, Hongxuan Wu, Yiliang Yuan, Mustafa Misir
ๆฅ็ๆ่ฆ
Selecting an effective docking algorithm is highly context-dependent, and no single method performs reliably across structural, chemical, or protocol regimes. We introduce MolAS, a lightweight algorithm selection system that predicts per-algorithm performance from pretrained protein-ligand embeddings using attentional pooling and a shallow residual decoder. With only hundreds to a few thousand labelled complexes, MolAS achieves up to 15% absolute improvement over the single-best solver (SBS) and closes 17-66% of the Virtual Best Solver (VBS)-SBS gap across five diverse docking benchmarks. Analyses of reliability, embedding geometry, and solver-selection patterns show that MolAS succeeds when the oracle landscape exhibits low entropy and separable solver behaviour, but collapses under protocol-induced hierarchy shifts. These findings indicate that the main barrier to robust docking AS is not representational capacity but instability in solver rankings across pose-generation regimes, positioning MolAS as both a practical in-domain selector and a diagnostic tool for assessing when AS is feasible.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kentaro Kubo, Hayato Goto
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Boltzmann machines (BMs) are powerful energy-based generative models, but their heavy training cost has largely confined practical use to Restricted BMs (RBMs) trained with an efficient learning method called contrastive divergence. More accurate learning typically requires Markov chain Monte Carlo (MCMC) Boltzmann sampling, but it is time-consuming due to the difficulty of parallelization for more expressive models. To address this limitation, we first propose a new Boltzmann sampler inspired by a quantum-inspired combinatorial optimization called simulated bifurcation (SB). This SB-inspired approach, which we name Langevin SB (LSB), enables parallelized sampling while maintaining accuracy comparable to MCMC. Furthermore, this is applicable not only to RBMs but also to BMs with general couplings. However, LSB cannot control the inverse temperature of the output Boltzmann distribution, which hinders learning and degrades performance. To overcome this limitation, we also developed an efficient method for estimating the inverse temperature during the learning process, which we call conditional expectation matching (CEM). By combining LSB and CEM, we establish an efficient learning framework for BMs with greater expressive power than RBMs. We refer to this framework as sampler-adaptive learning (SAL). SAL opens new avenues for energy-based generative modeling beyond RBMs.
๐
2025-12-02
๐ NeurIPS 2025
๐ฅ Felix Teufel, Aaron W. Kollasch, Yining Huang, Ole Winther, Kevin K. Yang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins and assays without large task-specific datasets. By encoding sequence information, auxiliary zero-shot predictions, and sparse experimental labels from many assays as a unified token set in a pre-training masked-language modeling paradigm, PRIMO learns to prioritize promising variants through a preference-based loss function. Across diverse protein families and properties-including both substitution and indel mutations-PRIMO outperforms zero-shot and fully supervised baselines. This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.
๐
2025-12-02
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.
๐
2025-12-02
๐ Accepted to Transactions on Machine Learning Resea
๐ฅ Max W. Shen, Ewa Nowara, Michael Maser, Kyunghyun Cho
ๆฅ็ๆ่ฆ
While data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a principled measure of equivariance error that, for convex losses, calculates the percent of total loss attributable to imperfections in learned symmetry. We focus our empirical investigation to 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) and find that models reduce equivariance error quickly to $\leq$2\% held-out loss within 1k-10k training steps, a result robust to model and dataset size. This happens because learning 3D-rotational equivariance is an easier learning task, with a smoother and better-conditioned loss landscape, than the main prediction task. For 3D rotations, the loss penalty for non-equivariant models is small throughout training, so they may achieve lower test loss than equivariant models per GPU-hour unless the equivariant ``efficiency gap'' is narrowed. We also experimentally and theoretically investigate the relationships between relative equivariance error, learning gradients, and model parameters.
๐
2025-12-02
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Chenji Lu, Zhuo Chen, Hui Zhao, Zhiyuan Zeng, Gang Zhao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mikoลaj Sacha, Hammad Jafri, Mattie Terzolo, Ayan Sinha, Andrew Rabinovich
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recommending matches in a text-rich, dynamic two-sided marketplace presents unique challenges due to evolving content and interaction graphs. We introduce GraphMatch, a new large-scale recommendation framework that fuses pre-trained language models with graph neural networks to overcome these challenges. Unlike prior approaches centered on standalone models, GraphMatch is a comprehensive recipe built on powerful text encoders and GNNs working in tandem. It employs adversarial negative sampling alongside point-in-time subgraph training to learn representations that capture both the fine-grained semantics of evolving text and the time-sensitive structure of the graph. We evaluated extensively on interaction data from Upwork, a leading labor marketplace, at large scale, and discuss our approach towards low-latency inference suitable for real-time use. In our experiments, GraphMatch outperforms language-only and graph-only baselines on matching tasks while being efficient at runtime. These results demonstrate that unifying language and graph representations yields a highly effective solution to text-rich, dynamic two-sided recommendations, bridging the gap between powerful pretrained LMs and large-scale graphs in practice.
๐
2025-12-02
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Weihang Su, Jianming Long, Changyue Wang, Shiyu Lin, Jingyan Xu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) frequently exhibit hallucinations, generating content that appears fluent and coherent but is factually incorrect. Such errors undermine trust and hinder their adoption in real-world applications. To address this challenge, two distinct research paradigms have emerged: model-centric Hallucination Detection (HD) and text-centric Fact Verification (FV). Despite sharing the same goal, these paradigms have evolved in isolation, using distinct assumptions, datasets, and evaluation protocols. This separation has created a research schism that hinders their collective progress. In this work, we take a decisive step toward bridging this divide. We introduce UniFact, a unified evaluation framework that enables direct, instance-level comparison between FV and HD by dynamically generating model outputs and corresponding factuality labels. Through large-scale experiments across multiple LLM families and detection methods, we reveal three key findings: (1) No paradigm is universally superior; (2) HD and FV capture complementary facets of factual errors; and (3) hybrid approaches that integrate both methods consistently achieve state-of-the-art performance. Beyond benchmarking, we provide the first in-depth analysis of why FV and HD diverged, as well as empirical evidence supporting the need for their unification. The comprehensive experimental results call for a new, integrated research agenda toward unifying Hallucination Detection and Fact Verification in LLMs.
We have open-sourced all the code, data, and baseline implementation at: https://github.com/oneal2000/UniFact/
๐
2025-12-02
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junpeng Wu, Pinheng Zong
ๆฅ็ๆ่ฆ
Graph-based Anomaly Detection models have gained widespread adoption in recent years, identifying suspicious nodes by aggregating neighborhood information. However, most existing studies overlook the pervasive issues of missing and adversarially obscured node attributes, which can undermine aggregation stability and prediction reliability. To mitigate this, we propose FGC-Comp, a lightweight, classifier-agnostic, and deployment-friendly attribute completion module-designed to enhance neighborhood aggregation under incomplete attributes. We partition each node's neighbors into three label-based groups, apply group-specific transforms to the labeled groups while a node-conditioned gate handles unknowns, fuse messages via residual connections, and train end-to-end with a binary classification objective to improve aggregation stability and prediction reliability under missing attributes. Experiments on two real-world fraud datasets validate the effectiveness of the approach with negligible computational overhead.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Agathoklis Georgiou
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali's patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.
๐
2025-12-02
๐ SIGIR 2025
๐ฅ Zheng Fang, Donghao Xie, Ming Pang, Chunyuan Yuan, Xue Jiang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models' reliance on the scarcity of domain-specific hard samples. We propose ADORE, a self-sustaining framework that synergizes three innovations: (1) A Rule-aware Relevance Discrimination module, where a Chain-of-Thought LLM generates intent-aligned training data, refined via Kahneman-Tversky Optimization (KTO) to align with user behavior; (2) An Error-type-aware Data Synthesis module that auto-generates adversarial examples to harden robustness; and (3) A Key-attribute-enhanced Knowledge Distillation module that injects domain-specific attribute hierarchies into a deployable student model. ADORE automates annotation, adversarial generation, and distillation, overcoming data scarcity while enhancing reasoning. Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial applications.
๐
2025-12-02
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Luyao Niu, Zhicheng Deng, Boyang Li, Nuoxian Huang, Ruiqi Liu et al.
ๆฅ็ๆ่ฆ
The "15-minute city" envisions neighborhoods where residents can meet daily needs via a short walk or bike ride. Realizing this vision requires not only physical proximity but also efficient and reliable access to information about nearby places, services, and events. Existing location-based systems, however, focus mainly on city-level tasks and neglect the spatial, temporal, and cognitive factors that shape localized decision-making. We conceptualize this gap as the Local Life Information Accessibility (LLIA) problem and introduce AskNearby, an AI-driven community application that unifies retrieval and recommendation within the 15-minute life circle. AskNearby integrates (i) a three-layer Retrieval-Augmented Generation (RAG) pipeline that synergizes graph-based, semantic-vector, and geographic retrieval with (ii) a cognitive-map model that encodes each user's neighborhood familiarity and preferences. Experiments on real-world community datasets demonstrate that AskNearby significantly outperforms LLM-based and map-based baselines in retrieval accuracy and recommendation quality, achieving robust performance in spatiotemporal grounding and cognitive-aware ranking. Real-world deployments further validate its effectiveness. By addressing the LLIA challenge, AskNearby empowers residents to more effectively discover local resources, plan daily activities, and engage in community life.
๐
2025-12-02
๐ KDD 2026
๐ฅ Haofeng Huang, Ling Gai
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Sequential recommendation plays a critical role in modern online platforms such as e-commerce, advertising, and content streaming, where accurately predicting users' next interactions is essential for personalization. Recent Transformer-based methods like BERT4Rec have shown strong modeling capability, yet they still rely on discrete item IDs that lack semantic meaning and ignore rich multimodal information (e.g., text and image). This leads to weak generalization and limited interpretability. To address these challenges, we propose Q-Bert4Rec, a multimodal sequential recommendation framework that unifies semantic representation and quantized modeling. Specifically, Q-Bert4Rec consists of three stages: (1) cross-modal semantic injection, which enriches randomly initialized ID embeddings through a dynamic transformer that fuses textual, visual, and structural features; (2) semantic quantization, which discretizes fused representations into meaningful tokens via residual vector quantization; and (3) multi-mask pretraining and fine-tuning, which leverage diverse masking strategies -- span, tail, and multi-region -- to improve sequential understanding. We validate our model on public Amazon benchmarks and demonstrate that Q-Bert4Rec significantly outperforms many strong existing methods, confirming the effectiveness of semantic tokenization for multimodal sequential recommendation. Our source code will be publicly available on GitHub after publishing.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Yueru Jia, Jiaming Liu, Shengbang Liu, Rui Zhou, Wanhe Yu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate and how to move. To mitigate inference inefficiency, we propose an asynchronous dual-system design, where the VDM functions as the slow System 2 and the DiT head as the fast System 1, working collaboratively to generate adaptive actions. By providing motion-aware conditions to System 1, Video2Act maintains stable manipulation even with low-frequency updates from the VDM. For evaluation, Video2Act surpasses previous state-of-the-art VLA methods by 7.7% in simulation and 21.7% in real-world tasks in terms of average success rate, further exhibiting strong generalization capabilities.
๐
2025-12-02
๐ ArXiv Preprint (cs.GR)
๐ฅ Yuxuan Mu, Ziyu Zhang, Yi Shi, Minami Matsumoto, Kotaro Imamura et al.
Robotics
ๆฅ็ๆ่ฆ
Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Svenja Strobel, Matthias Innmann, Bernhard Egger, Marc Stamminger, Linus Franke
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiang Xu, Ao Liang, Youquan Liu, Linfeng Li, Lingdong Kong et al.
Robotics
ๆฅ็ๆ่ฆ
Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guowen Zhang, Chenhang He, Liyi Chen, Lei Zhang
Computer Vision
ๆฅ็ๆ่ฆ
Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Nicholas Baiata, Nilanjan Chakraborty
Robotics
ๆฅ็ๆ่ฆ
Task-space control of robotic fingers is a critical enabler of dexterous manipulation, as manipulation objectives are most naturally specified in terms of fingertip motions and applied forces rather than individual joint angles. While task-space planning and control have been extensively studied for larger, arm-scale manipulators, demonstrations of precise task-space trajectory tracking in compact, multi-DoF robotic fingers remain scarce. In this paper, we present the physical prototyping and experimental characterization of a three-degree-of-freedom, linkage-driven, series-parallel robotic finger with analytic forward kinematics and a closed-form Jacobian. A resolved motion rate control (RMRC) scheme is implemented to achieve closed-loop task-space trajectory tracking. We experimentally evaluate the fingertip tracking performance across a variety of trajectories, including straight lines, circles, and more complex curves, and report millimeter-level accuracy. To the best of our knowledge, this work provides one of the first systematic experimental demonstrations of precise task-space trajectory tracking in a linkage-driven robotic finger, thereby establishing a benchmark for future designs aimed at dexterous in-hand manipulation.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang
Multimodal
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pierpaolo Serio, Giulio Pisaneschi, Andrea Dan Ryals, Vincenzo Infantino, Lorenzo Gentilini et al.
Robotics
ๆฅ็ๆ่ฆ
This work presents a systematic investigation into how alternative LiDAR-to-image projections affect metric place recognition when coupled with a state-of-the-art vision foundation model. We introduce a modular retrieval pipeline that controls for backbone, aggregation, and evaluation protocol, thereby isolating the influence of the 2-D projection itself. Using consistent geometric and structural channels across multiple datasets and deployment scenarios, we identify the projection characteristics that most strongly determine discriminative power, robustness to environmental variation, and suitability for real-time autonomy. Experiments with different datasets, including integration into an operational place recognition policy, validate the practical relevance of these findings and demonstrate that carefully designed projections can serve as an effective surrogate for end-to-end 3-D learning in LiDAR place recognition.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Iana Zhura, Sausar Karaf, Faryal Batool, Nipun Dhananjaya Weerakkodi Mudalige, Valerii Serpiva et al.
Robotics
ๆฅ็ๆ่ฆ
Visual traversability estimation is critical for autonomous navigation, but existing VLM-based methods rely on hand-crafted prompts, generalize poorly across embodiments, and output only traversability maps, leaving trajectory generation to slow external planners. We propose SwarmDiffusion, a lightweight end-to-end diffusion model that jointly predicts traversability and generates a feasible trajectory from a single RGB image. To remove the need for annotated or planner-produced paths, we introduce a planner-free trajectory construction pipeline based on randomized waypoint sampling, Bezier smoothing, and regularization enforcing connectivity, safety, directionality, and path thinness. This enables learning stable motion priors without demonstrations. SwarmDiffusion leverages VLM-derived supervision without prompt engineering and conditions the diffusion process on a compact embodiment state, producing physically consistent, traversable paths that transfer across different robot platforms. Across indoor environments and two embodiments (quadruped and aerial), the method achieves 80-100\% navigation success and 0.09 s inference, and adapts to a new robot using only-500 additional visual samples. It generalizes reliably to unseen environments in simulation and real-world trials, offering a scalable, prompt-free approach to unified traversability reasoning and trajectory generation.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Xinzheng Wu, Junyi Chen, Naiting Zhong, Yong Shen
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The safe deployment of autonomous driving systems (ADSs) relies on comprehensive testing and evaluation. However, safety-critical scenarios that can effectively expose system vulnerabilities are extremely sparse in the real world. Existing scenario generation methods face challenges in efficiently constructing long-tail scenarios that ensure fidelity, criticality, and interactivity, while particularly lacking real-time dynamic response capabilities to the vehicle under test (VUT). To address these challenges, this paper proposes a safety-critical testing scenario generation framework that integrates the high-level semantic understanding capabilities of Vision Language Models (VLMs) with the fine-grained generation capabilities of adaptive guided diffusion models. The framework establishes a three-layer hierarchical architecture comprising a strategic layer for VLM-directed scenario generation objective determination, a tactical layer for guidance function formulation, and an operational layer for guided diffusion execution. We first establish a high-quality fundamental diffusion model that learns the data distribution of real driving scenarios. Next, we design an adaptive guided diffusion method that enables real-time, precise control of background vehicles (BVs) in closed-loop simulation. The VLM is then incorporated to autonomously generate scenario generation objectives and guidance functions through deep scenario understanding and risk reasoning, ultimately guiding the diffusion model to achieve VLM-directed scenario generation. Experimental results demonstrate that the proposed method can efficiently generate realistic, diverse, and highly interactive safety-critical testing scenarios. Furthermore, case studies validate the adaptability and VLM-directed generation performance of the proposed method.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose \textbf{TACO}, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Shyam prasad reddy Kaitha, Hongrui Yu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Multi-robot task allocation in construction automation has traditionally relied on optimization methods such as Dynamic Programming and Reinforcement Learning. This research introduces the LangGraph-based Task Allocation Agent (LTAA), an LLM-driven framework that integrates phase-adaptive allocation strategies, multi-stage validation with hierarchical retries, and dynamic prompting for efficient robot coordination. Although recent LLM approaches show potential for construction robotics, they largely lack rigorous validation and benchmarking against established algorithms. This paper presents the first systematic comparison of LLM-based task allocation with traditional methods in construction scenarios.The study validates LLM feasibility through SMART-LLM replication and addresses implementation challenges using a Self-Corrective Agent Architecture. LTAA leverages natural-language reasoning combined with structured validation mechanisms, achieving major computational gains reducing token usage by 94.6% and allocation time by 86% through dynamic prompting. The framework adjusts its strategy across phases: emphasizing execution feasibility early and workload balance in later allocations.The authors evaluate LTAA against Dynamic Programming, Q-learning, and Deep Q-Network (DQN) baselines using construction operations from the TEACh human-robot collaboration dataset. In the Heavy Excels setting, where robots have strong task specializations, LTAA achieves 77% task completion with superior workload balance, outperforming all traditional methods. These findings show that LLM-based reasoning with structured validation can match established optimization algorithms while offering additional advantages such as interpretability, adaptability, and the ability to update task logic without retraining.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Xianchao Zeng, Xinyu Zhou, Youcheng Li, Jiayou Shi, Tianle Li et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Heye Huang, Yibin Yang, Mingfeng Fan, Haoran Wang, Xiaocong Zhao et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Safe autonomous driving in mixed traffic requires a unified understanding of multimodal interactions and dynamic planning under uncertainty. Existing learning based approaches struggle to capture rare but safety critical behaviors, while rule based systems often lack adaptability in complex interactions. To address these limitations, CogDrive introduces a cognition driven multimodal prediction and planning framework that integrates explicit modal reasoning with safety aware trajectory optimization. The prediction module adopts cognitive representations of interaction modes based on topological motion semantics and nearest neighbor relational encoding. With a differentiable modal loss and multimodal Gaussian decoding, CogDrive learns sparse and unbalanced interaction behaviors and improves long horizon trajectory prediction. The planning module incorporates an emergency response concept and optimizes safety stabilized trajectories, where short term consistent branches ensure safety during replanning cycles and long term branches support smooth and collision free motion under low probability switching modes. Experiments on Argoverse2 and INTERACTION datasets show that CogDrive achieves strong performance in trajectory accuracy and miss rate, while closed loop simulations confirm adaptive behavior in merge and intersection scenarios. By combining cognitive multimodal prediction with safety oriented planning, CogDrive offers an interpretable and reliable paradigm for safe autonomy in complex traffic.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Yuhong Zhang, Zihan Gao, Shengpeng Li, Ling-Hao Chen, Kaisheng Liu et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
We introduce Robowheel, a data engine that converts human hand object interaction (HOI) videos into training-ready supervision for cross morphology robotic learning. From monocular RGB or RGB-D inputs, we perform high precision HOI reconstruction and enforce physical plausibility via a reinforcement learning (RL) optimizer that refines hand object relative poses under contact and penetration constraints. The reconstructed, contact rich trajectories are then retargeted to cross-embodiments, robot arms with simple end effectors, dexterous hands, and humanoids, yielding executable actions and rollouts. To scale coverage, we build a simulation-augmented framework on Isaac Sim with diverse domain randomization (embodiments, trajectories, object retrieval, background textures, hand motion mirroring), which enriches the distributions of trajectories and observations while preserving spatial relationships and physical plausibility. The entire data pipeline forms an end to end pipeline from video,reconstruction,retargeting,augmentation data acquisition. We validate the data on mainstream vision language action (VLA) and imitation learning architectures, demonstrating that trajectories produced by our pipeline are as stable as those from teleoperation and yield comparable continual performance gains. To our knowledge, this provides the first quantitative evidence that HOI modalities can serve as effective supervision for robotic learning. Compared with teleoperation, Robowheel is lightweight, a single monocular RGB(D) camera is sufficient to extract a universal, embodiment agnostic motion representation that could be flexibly retargeted across embodiments. We further assemble a large scale multimodal dataset combining multi-camera captures, monocular videos, and public HOI corpora for training and evaluating embodied models.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Shengkai Wu, Jinrong Yang, Wenqiu Luo, Linfeng Gao, Chaohui Shang et al.
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2's built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.
๐
2025-12-02
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Yuchong Zhang, Yong Ma, Danica Kragic
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
This perspective reframes human-robot interaction (HRI) through extended reality (XR), arguing that virtual robots powered by large foundation models (FMs) can serve as cognitively grounded, empathic agents. Unlike physical robots, XR-native agents are unbound by hardware constraints and can be instantiated, adapted, and scaled on demand, while still affording embodiment and co-presence. We synthesize work across XR, HRI, and cognitive AI to show how such agents can support safety-critical scenarios, socially and cognitively empathic interaction across domains, and outreaching physical capabilities with XR and AI integration. We then discuss how multimodal large FMs (e.g., large language model, large vision model, and vision-language model) enable context-aware reasoning, affect-sensitive situations, and long-term adaptation, positioning virtual robots as cognitive and empathic mediators rather than mere simulation assets. At the same time, we highlight challenges and potential risks, including overtrust, cultural and representational bias, privacy concerns around biometric sensing, and data governance and transparency. The paper concludes by outlining a research agenda for human-centered, ethically grounded XR agents - emphasizing multi-layered evaluation frameworks, multi-user ecosystems, mixed virtual-physical embodiment, and societal and ethical design practices to envision XR-based virtual agents powered by FMs as reshaping future HRI into a more efficient and adaptive paradigm.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Alessandro Ianniello, Dave Murray-Rust, Sara Muscolo, Olger Siebinga, Nicky Mol et al.
Robotics
ๆฅ็ๆ่ฆ
As robots become more adaptable, responsive, and capable of interacting with humans, the design of effective human-robot collaboration becomes critical. Yet, this design process is typically led by monodisciplinary approaches, often overlooking interdisciplinary knowledge and the experiential knowledge of workers who will ultimately share tasks with these systems. To address this gap, we introduce the robotic capabilities framework, a vocabulary that enables transdisciplinary collaborations to meaningfully shape the future of work when robotic systems are integrated into the workplace. Rather than focusing on the internal workings of robots, the framework centers discussion on high-level capabilities, supporting dialogue around which elements of a task should remain human-led and which can be delegated to robots. We developed the framework through reflexive and iterative processes, and applied it in two distinct settings: by engaging roboticists in describing existing commercial robots using its vocabulary, and through a design activity with students working on robotics-related projects. The framework emerges as an intermediate-level knowledge artifact and a boundary object that bridges technical and experiential domains, guiding designers, empowering workers, and contributing to more just and collaborative futures of work.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Jeric Lew, Yuhong Cao, Derek Ming Siang Tan, Guillaume Sartoretti
Machine Learning
ๆฅ็ๆ่ฆ
Information gathering in large-scale or time-critical scenarios (e.g., environmental monitoring, search and rescue) requires broad coverage within limited time budgets, motivating the use of multi-agent systems. These scenarios are commonly formulated as multi-agent informative path planning (MAIPP), where multiple agents must coordinate to maximize information gain while operating under budget constraints. A central challenge in MAIPP is ensuring effective coordination while the belief over the environment evolves with incoming measurements. Recent learning-based approaches address this by using distributions over future positions as "intent" to support coordination. However, these autoregressive intent predictors are computationally expensive and prone to compounding errors. Inspired by the effectiveness of diffusion models as expressive, long-horizon policies, we propose AID, a fully decentralized MAIPP framework that leverages diffusion models to generate long-term trajectories in a non-autoregressive manner. AID first performs behavior cloning on trajectories produced by existing MAIPP planners and then fine-tunes the policy using reinforcement learning via Diffusion Policy Policy Optimization (DPPO). This two-stage pipeline enables the policy to inherit expert behavior while learning improved coordination through online reward feedback. Experiments demonstrate that AID consistently improves upon the MAIPP planners it is trained from, achieving up to 4x faster execution and 17% increased information gain, while scaling effectively to larger numbers of agents. Our implementation is publicly available at https://github.com/marmotlab/AID.
๐
2025-12-02
๐ ArXiv Preprint (Computer Vision)
๐ฅ Whye Kit Fong, Venice Erin Liong, Kok Seang Tan, Holger Caesar
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization \& mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Huiqian Li, Wei Pan, Haodong Zhang, Jin Huang, Zhihua Zhong
Robotics
ๆฅ็ๆ่ฆ
World models have gained significant attention as a promising approach for autonomous driving. By emulating human-like perception and decision-making processes, these models can predict and adapt to dynamic environments. Existing methods typically map high-dimensional observations into compact latent spaces and learn optimal policies within these latent representations. However, prior work usually jointly learns ego-vehicle dynamics and environmental transition dynamics from the image input, leading to inefficiencies and a lack of robustness to variations in vehicle dynamics. To address these issues, we propose the Vehicle Dynamics embedded Dreamer (VDD) method, which decouples the modeling of ego-vehicle dynamics from environmental transition dynamics. This separation allows the world model to generalize effectively across vehicles with diverse parameters. Additionally, we introduce two strategies to further enhance the robustness of the learned policy: Policy Adjustment during Deployment (PAD) and Policy Augmentation during Training (PAT). Comprehensive experiments in simulated environments demonstrate that the proposed model significantly improves both driving performance and robustness to variations in vehicle dynamics, outperforming existing approaches.
๐
2025-12-02
๐ ArXiv Preprint (eess.SY)
๐ฅ Sungjun Seo, Kooktae Lee
Robotics
ๆฅ็ๆ่ฆ
This paper presents Density-based Predictive Control (DPC), a novel multi-agent control strategy for efficient non-uniform area coverage, grounded in optimal transport theory. In large-scale scenarios such as search and rescue or environmental monitoring, traditional uniform coverage fails to account for varying regional priorities. DPC leverages a pre-constructed reference distribution to allocate agents' coverage efforts, spending more time in high-priority or densely sampled regions. We analyze convergence conditions using the Wasserstein distance, derive an analytic optimal control law for unconstrained cases, and propose a numerical method for constrained scenarios. Simulations on first-order dynamics and linearized quadrotor models demonstrate that DPC achieves trajectories closely matching the non-uniform reference distribution, outperforming existing coverage methods.
๐
2025-12-02
๐ ArXiv Preprint (Robotics)
๐ฅ Zihan Zhu, Wei Zhang, Norbert Haala, Marc Pollefeys, Daniel Barath
Machine Learning
ๆฅ็ๆ่ฆ
We present VIGS-SLAM, a visual-inertial 3D Gaussian Splatting SLAM system that achieves robust real-time tracking and high-fidelity reconstruction. Although recent 3DGS-based SLAM methods achieve dense and photorealistic mapping, their purely visual design degrades under motion blur, low texture, and exposure variations. Our method tightly couples visual and inertial cues within a unified optimization framework, jointly refining camera poses, depths, and IMU states. It features robust IMU initialization, time-varying bias modeling, and loop closure with consistent Gaussian updates. Experiments on four challenging datasets demonstrate our superiority over state-of-the-art methods. Project page: https://vigs-slam.github.io
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Stefano Zeppieri
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel at generating coherent text within a single prompt but fall short in sustaining relevance, personalization, and continuity across extended interactions. Human communication, however, relies on multiple forms of memory, from recalling past conversations to adapting to personal traits and situational context. This paper introduces the Mixed Memory-Augmented Generation (MMAG) pattern, a framework that organizes memory for LLM-based agents into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory. Drawing inspiration from cognitive psychology, we map these layers to technical components and outline strategies for coordination, prioritization, and conflict resolution. We demonstrate the approach through its implementation in the Heero conversational agent, where encrypted long-term bios and conversational history already improve engagement and retention. We further discuss implementation concerns around storage, retrieval, privacy, and latency, and highlight open challenges. MMAG provides a foundation for building memory-rich language agents that are more coherent, proactive, and aligned with human needs.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Liangji Kong, Aditya Joshi, Sarvnaz Karimi
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts -- farmer advisors -- to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Olivia Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Alexander Gurung, Nikolay Malkin, Mirella Lapata
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Vukosi Marivatee, Kayode Olaleye, Sitwala Mundia, Andinda Bakainga, Unarine Netshifhefhe et al.
ๆฅ็ๆ่ฆ
This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
๐
2025-12-01
๐ WWW
๐ฅ David Ph. Shakouri, Crit Cremers, Niels O. Schiller
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
This paper presents an initial study performed by the MODOMA system. The MODOMA is a computational multi-agent laboratory environment for unsupervised language acquisition experiments such that acquisition is based on the interaction between two language models, an adult and a child agent. Although this framework employs statistical as well as rule-based procedures, the result of language acquisition is a knowledge-based language model, which can be used to generate and parse new utterances of the target language. This system is fully parametrized and researchers can control all aspects of the experiments while the results of language acquisition, that is, the acquired grammatical knowledge, are explicitly represented and can be consulted. Thus, this system introduces novel possibilities for conducting computational language acquisition experiments. The experiments presented by this paper demonstrate that functional and content categories can be acquired and represented by the daughter agent based on training and test data containing different amounts of exemplars generated by the adult agent. Interestingly, similar patterns, which are well-established for human-generated data, are also found for these machine-generated data. As the procedures resulted in the successful acquisition of discrete grammatical categories by the child agent, these experiments substantiate the validity of the MODOMA approach to modelling language acquisition.
๐
2025-12-01
๐ 8 pages (6 pages of main text + 2 pages of referen
๐ฅ Mohammad Shokri, Alexandra C. Salem, Gabriel Levine, Johanna Devaney, Sarah Ita Levitan
Natural Language Processing
ๆฅ็ๆ่ฆ
In this paper, we introduce Story2MIDI, a sequence-to-sequence Transformer-based model for generating emotion-aligned music from a given piece of text. To develop this model, we construct the Story2MIDI dataset by merging existing datasets for sentiment analysis from text and emotion classification in music. The resulting dataset contains pairs of text blurbs and music pieces that evoke the same emotions in the reader or listener. Despite the small scale of our dataset and limited computational resources, our results indicate that our model effectively learns emotion-relevant features in music and incorporates them into its generation process, producing samples with diverse emotional responses. We evaluate the generated outputs using objective musical metrics and a human listening study, confirming the model's ability to capture intended emotional cues.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.
๐
2025-12-01
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Aaron Sandoval, Cody Rushing
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The field of AI Control seeks to develop robust control protocols, deployment safeguards for untrusted AI which may be intentionally subversive. However, existing protocols that rely on weaker monitors to detect unsafe behavior often fail on complex tasks beyond the monitor's comprehension. We develop control protocols based on factored cognition, in which a trusted model decomposes a task, an untrusted model solves each resultant child task in isolation, and the results are reassembled into a full solution. These protocols may improve safety by several means, such as by simplifying the context for monitors, or by obscuring vulnerabilities in the environment. We implement our protocols in the APPS coding setting and red team them against backdoor attempts from an adversarial GPT-4.1 Nano. We find that: (i) Adding factored cognition to a trusted monitoring protocol can boost safety from 41% to 63%; (ii) Safety improves because monitor performance improves; (iii) Factored cognition makes it no harder for capable LLMs to write backdoors in APPS. While our protocols show low usefulness in APPS, they hold promise for more complex tasks.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pritish N. Desai, Tanay Kewalramani, Srimanta Mandal
Natural Language Processing
ๆฅ็ๆ่ฆ
Abusive speech on social media poses a persistent and evolving challenge, driven by the continuous emergence of novel slang and obfuscated terms designed to circumvent detection systems. In this work, we present a data efficient strategy for fine tuning BERT on hate speech classification by significantly reducing training set size without compromising performance. Our approach employs a TF IDF-based sample selection mechanism to retain only the most informative 75 percent of examples, thereby minimizing training overhead. To address the limitations of BERT's native vocabulary in capturing evolving hate speech terminology, we augment the tokenizer with domain-specific slang and lexical variants commonly found in abusive contexts. Experimental results on a widely used hate speech dataset demonstrate that our method achieves competitive performance while improving computational efficiency, highlighting its potential for scalable and adaptive abusive content moderation.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhongjie Jiang
ๆฅ็ๆ่ฆ
Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse.
This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators.
The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33.
Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiahao Yuan, Zhiqing Cui, Hanqing Wang, Yuansheng Gao, Yucheng Zhou et al.
Machine Learning
ๆฅ็ๆ่ฆ
As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at https://github.com/JhCircle/Kardia-R1.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jane Lange, Mingda Qiao
ๆฅ็ๆ่ฆ
Membership queries (MQ) often yield speedups for learning tasks, particularly in the distribution-specific setting. We show that in the \emph{testable learning} model of Rubinfeld and Vasilyan [RV23], membership queries cannot decrease the time complexity of testable learning algorithms beyond the complexity of sample-only distribution-specific learning. In the testable learning model, the learner must output a hypothesis whenever the data distribution satisfies a desired property, and if it outputs a hypothesis, the hypothesis must be near-optimal.
We give a general reduction from sample-based \emph{refutation} of boolean concept classes, as presented in [Vadhan17, KL18], to testable learning with queries (TL-Q). This yields lower bounds for TL-Q via the reduction from learning to refutation given in [KL18]. The result is that, relative to a concept class and a distribution family, no $m$-sample TL-Q algorithm can be super-polynomially more time-efficient than the best $m$-sample PAC learner.
Finally, we define a class of ``statistical'' MQ algorithms that encompasses many known distribution-specific MQ learners, such as those based on influence estimation or subcube-conditional statistical queries. We show that TL-Q algorithms in this class imply efficient statistical-query refutation and learning algorithms. Thus, combined with known SQ dimension lower bounds, our results imply that these efficient membership query learners cannot be made testable.
๐
2025-12-01
๐ Accepted at the 2025 IEEE International Symposium
๐ฅ Adel Chehade, Edoardo Ragusa, Paolo Gastaldo, Rodolfo Zunino
ๆฅ็ๆ่ฆ
Traffic classification (TC) plays a critical role in cybersecurity, particularly in IoT and embedded contexts, where inspection must often occur locally under tight hardware constraints. We use hardware-aware neural architecture search (HW-NAS) to derive lightweight TC models that are accurate, efficient, and deployable on edge platforms. Two input formats are considered: a flattened byte sequence and a 2D packet-wise time series; we examine how input structure affects adversarial vulnerability when using resource-constrained models. Robustness is assessed against white-box attacks, specifically Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). On USTC-TFC2016, both HW-NAS models achieve over 99% clean-data accuracy while remaining within 65k parameters and 2M FLOPs. Yet under perturbations of strength 0.1, their robustness diverges: the flat model retains over 85% accuracy, while the time-series variant drops below 35%. Adversarial fine-tuning delivers robust gains, with flat-input accuracy exceeding 96% and the time-series variant recovering over 60 percentage points in robustness, all without compromising efficiency. The results underscore how input structure influences adversarial vulnerability, and show that even compact, resource-efficient models can attain strong robustness, supporting their practical deployment in secure edge-based TC.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jeremy Andrew Irvin, Jiaqi Han, Zikui Wang, Abdulaziz Alharbi, Yufei Zhao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Generative models have the potential to transform the way we emulate Earth's changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at https://github.com/stanfordmlgroup/spf .
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joshua Wolff Anderson, Shyam Visweswaran
Machine Learning
ๆฅ็ๆ่ฆ
Trustworthy machine learning in healthcare requires strong predictive performance, fairness, and explanations. While it is known that improving fairness can affect predictive performance, little is known about how fairness improvements influence explainability, an essential ingredient for clinical trust. Clinicians may hesitate to rely on a model whose explanations shift after fairness constraints are applied. In this study, we examine how enhancing fairness through bias mitigation techniques reshapes Shapley-based feature rankings. We quantify changes in feature importance rankings after applying fairness constraints across three datasets: pediatric urinary tract infection risk, direct anticoagulant bleeding risk, and recidivism risk. We also evaluate multiple model classes on the stability of Shapley-based rankings. We find that increasing model fairness across racial subgroups can significantly alter feature importance rankings, sometimes in different ways across groups. These results highlight the need to jointly consider accuracy, fairness, and explainability in model assessment rather than in isolation.
๐
2025-12-01
๐ ArXiv Preprint (eess.SY)
๐ฅ Alexander Davydov
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We address the problem of verifying closed-loop contraction in nonlinear control systems whose controller and contraction metric are both parameterized by neural networks. By leveraging interval analysis and interval bound propagation, we derive a tractable and scalable sufficient condition for closed-loop contractivity that reduces to checking that the dominant eigenvalue of a symmetric Metzler matrix is nonpositive. We combine this sufficient condition with a domain partitioning strategy to integrate this sufficient condition into training. The proposed approach is validated on an inverted pendulum system, demonstrating the ability to learn neural network controllers and contraction metrics that provably satisfy the contraction condition.
๐
2025-12-01
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Nazmus Shakib Sayom, Luis Garcia
Robotics
ๆฅ็ๆ่ฆ
We propose a methodology for falsifying safety properties in robotic vehicle systems through property-guided reduction and surrogate execution. By isolating only the control logic and physical dynamics relevant to a given specification, we construct lightweight surrogate models that preserve property-relevant behaviors while eliminating unrelated system complexity. This enables scalable falsification via trace analysis and temporal logic oracles.
We demonstrate the approach on a drone control system containing a known safety flaw. The surrogate replicates failure conditions at a fraction of the simulation cost, and a property-guided fuzzer efficiently discovers semantic violations. Our results suggest that controller reduction, when coupled with logic-aware test generation, provides a practical and scalable path toward semantic verification of cyber-physical systems.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop
Machine Learning
ๆฅ็ๆ่ฆ
Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent's experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.
๐
2025-12-01
๐ WWW
๐ฅ Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data. Code, Data, and Models available at https://www.di.ens.fr/willow/research/guardian/.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Eric C. Joyce, Qianwen Zhao, Nathaniel Burgdorfer, Long Wang, Philippos Mordohai
Computer Vision
ๆฅ็ๆ่ฆ
We present a framework for evaluating 6-DoF instance-level object pose estimators, focusing on those that require a single RGB (not RGB-D) image as input. Besides gaining intuition about how accurate these estimators are, we are interested in the degree to which they can serve as the sole perception mechanism for robotic grasping. To assess this, we perform grasping trials in a physics-based simulator, using image-based pose estimates to guide a parallel gripper and an underactuated robotic hand in picking up 3D models of objects. Our experiments on a subset of the BOP (Benchmark for 6D Object Pose Estimation) dataset compare five open-source object pose estimators and provide insights that were missing from the literature.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting $Q$-values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundations models to specialize into reliable real-world experts.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yumeng He, Ying Jiang, Jiayin Lu, Yin Yang, Chenfanfu Jiang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling. Project page: https://heyumeng.com/SPARK/index.html.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Jianlei Chang, Ruofeng Mei, Wei Ke, Xiangyu Xu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Generative modeling has recently shown remarkable promise for visuomotor policy learning, enabling flexible and expressive control across diverse embodied AI tasks. However, existing generative policies often struggle with data inefficiency, requiring large-scale demonstrations, and sampling inefficiency, incurring slow action generation during inference. We introduce EfficientFlow, a unified framework for efficient embodied AI with flow-based policy learning. To enhance data efficiency, we bring equivariance into flow matching. We theoretically prove that when using an isotropic Gaussian prior and an equivariant velocity prediction network, the resulting action distribution remains equivariant, leading to improved generalization and substantially reduced data demands. To accelerate sampling, we propose a novel acceleration regularization strategy. As direct computation of acceleration is intractable for marginal flow trajectories, we derive a novel surrogate loss that enables stable and scalable training using only conditional trajectories. Across a wide range of robotic manipulation benchmarks, the proposed algorithm achieves competitive or superior performance under limited data while offering dramatically faster inference. These results highlight EfficientFlow as a powerful and efficient paradigm for high-performance embodied AI.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sebastian Sanokowski, Kaustubh Patil, Alois Knoll
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Diffusion models have achieved remarkable success in data-driven learning and in sampling from complex, unnormalized target distributions. Building on this progress, we reinterpret Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem. We tackle this problem by minimizing the reverse Kullback-Leibler (KL) divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. By applying the policy gradient theorem to this objective, we derive a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way. This leads to simple diffusion-based variants of Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO) and Wasserstein Policy Optimization (WPO), termed DiffSAC, DiffPPO and DiffWPO. All of these methods require only minor implementation changes to their base algorithm. We find that on standard continuous control benchmarks, DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Shaowei Liu, David Yifan Yao, Saurabh Gupta, Shenlong Wang
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Massively parallel simulation has reduced reinforcement learning (RL) training time for robots from days to minutes. However, achieving fast and reliable sim-to-real RL for humanoid control remains difficult due to the challenges introduced by factors such as high dimensionality and domain randomization. In this work, we introduce a simple and practical recipe based on off-policy RL algorithms, i.e., FastSAC and FastTD3, that enables rapid training of humanoid locomotion policies in just 15 minutes with a single RTX 4090 GPU. Our simple recipe stabilizes off-policy RL algorithms at massive scale with thousands of parallel environments through carefully tuned design choices and minimalist reward functions. We demonstrate rapid end-to-end learning of humanoid locomotion controllers on Unitree G1 and Booster T1 robots under strong domain randomization, e.g., randomized dynamics, rough terrain, and push perturbations, as well as fast training of whole-body human-motion tracking policies. We provide videos and open-source implementation at: https://younggyo.me/fastsac-humanoid.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Guillermo Garcia-Cobo, Maximilian Igl, Peter Karkus, Zhejun Zhang, Michael Watson et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift when deployed in closed loop, leading to compounding errors. We introduce Rollouts as Demonstrations (RoaD), a simple and efficient method to mitigate covariate shift by leveraging the policy's own closed-loop rollouts as additional training data. During rollout generation, RoaD incorporates expert guidance to bias trajectories toward high-quality behavior, producing informative yet realistic demonstrations for fine-tuning. This approach enables robust closed-loop adaptation with orders of magnitude less data than reinforcement learning, and avoids restrictive assumptions of prior closed-loop supervised fine-tuning (CL-SFT) methods, allowing broader applications domains including end-to-end driving. We demonstrate the effectiveness of RoaD on WOSAC, a large-scale traffic simulation benchmark, where it performs similar or better than the prior CL-SFT method; and in AlpaSim, a high-fidelity neural reconstruction-based simulator for end-to-end driving, where it improves driving score by 41\% and reduces collisions by 54\%.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sai Kolasani, Maxim Saplin, Nicholas Crispino, Kyle Montgomery, Jared Quincy Davis et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs) through extended agentic interaction in the domain of chess. We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including win and loss rates, move quality, move legality, hallucinated actions, and game duration. For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way. Despite the simplicity of the instruction-following task and the weakness of the opponent, many state-of-the-art models struggle to complete games or achieve consistent wins. Similar to other benchmarks on complex reasoning tasks, our experiments reveal a clear separation between reasoning and non-reasoning models. However, unlike existing static benchmarks, the stochastic and dynamic nature of LLM CHESS uniquely reduces overfitting and memorization while preventing benchmark saturation, proving difficult even for top reasoning models. To support future work on evaluating reasoning and instruction-following in LLMs, we release our experimental framework, a public leaderboard, and a dataset of associated games.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop
Machine Learning
ๆฅ็ๆ่ฆ
Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent's experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Aiden Yiliu Li, Bizhi Yu, Daoan Lei, Tianhe Ren, Shilong Liu
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.
๐
2025-12-01
๐ ArXiv Preprint (eess.SY)
๐ฅ William Xu, Amir Eskanlou, Mansur Arief, David Zhen Yin, Jef K. Caers
Machine Learning
ๆฅ็ๆ่ฆ
The global capacity for mineral processing must expand rapidly to meet the demand for critical minerals, which are essential for building the clean energy technologies necessary to mitigate climate change. However, the efficiency of mineral processing is severely limited by uncertainty, which arises from both the variability of feedstock and the complexity of process dynamics. To optimize mineral processing circuits under uncertainty, we introduce an AI-driven approach that formulates mineral processing as a Partially Observable Markov Decision Process (POMDP). We demonstrate the capabilities of this approach in handling both feedstock uncertainty and process model uncertainty to optimize the operation of a simulated, simplified flotation cell as an example. We show that by integrating the process of information gathering (i.e., uncertainty reduction) and process optimization, this approach has the potential to consistently perform better than traditional approaches at maximizing an overall objective, such as net present value (NPV). Our methodological demonstration of this optimization-under-uncertainty approach for a synthetic case provides a mathematical and computational framework for later real-world application, with the potential to improve both the laboratory-scale design of experiments and industrial-scale operation of mineral processing circuits without any additional hardware.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jie Meng, Jin Mao
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) are predominantly used as evaluators for natural language generation (NLG) tasks, but their application to broader evaluation scenarios remains limited. In this work, we explore the potential of LLMs as general evaluators across diverse tasks. Although LLM-based evaluators have made progress in different areas, existing methods struggle to generalize due to their reliance on costly, human-designed evaluation principles, which are often misaligned with both annotated data and LLMs' understanding.To address these challenges, we propose a rule-augmented evaluation paradigm. First, we introduce a rule distillation method that automatically extracts scoring rules from data using an LLM-assisted Monte Carlo Tree Search (MCTS), alleviating scalability issues and improving alignment with data. Second, to enable LLMs to effectively apply the learned rules, we propose two strategies: (1) Chain-of-Rule (CoR), which guides LLM to follow distilled rules, and (2) training a rule-augmented LLM evaluator (RuAE) via reinforcement learning, further bridging the gap between rules and LLMs' reasoning. Extensive experiments on diverse tasks demonstrate the effectiveness and generalizability of our approach across various evaluation scenarios.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Han Zhou, Xingchen Wan, Ivan Vuliฤ, Anna Korhonen
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.
๐
2025-12-01
๐ ArXiv Preprint (cs.SE)
๐ฅ Yanlin Wang, Xinyi Xu, Jiachi Chen, Tingting Bi, Wenchao Gu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The rise of large language models (LLMs) has sparked a surge of interest in agents, leading to the rapid growth of agent frameworks. Agent frameworks are software toolkits and libraries that provide standardized components, abstractions, and orchestration mechanisms to simplify agent development. Despite widespread use of agent frameworks, their practical applications and how they influence the agent development process remain underexplored. Different agent frameworks encounter similar problems during use, indicating that these recurring issues deserve greater attention and call for further improvements in agent framework design. Meanwhile, as the number of agent frameworks continues to grow and evolve, more than 80% of developers report difficulties in identifying the frameworks that best meet their specific development requirements. In this paper, we conduct the first empirical study of LLM-based agent frameworks, exploring real-world experiences of developers in building AI agents. To compare how well the agent frameworks meet developer needs, we further collect developer discussions for the ten previously identified agent frameworks, resulting in a total of 11,910 discussions. Finally, by analyzing these discussions, we compare the frameworks across five dimensions: development efficiency, functional abstraction, learning cost, performance optimization, and maintainability, which refers to how easily developers can update and extend both the framework itself and the agents built upon it over time. Our comparative analysis reveals significant differences among frameworks in how they meet the needs of agent developers. Overall, we provide a set of findings and implications for the LLM-driven AI agent framework ecosystem and offer insights for the design of future LLM-based agent frameworks and agent developers.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nico Daheim, Thomas Mรถllenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections, but have seen limited success in deep learning. Here, we show surprising new foundational connections of SVRG to a recently proposed Bayesian method called posterior correction. Specifically, we show that SVRG is recovered as a special case of posterior correction over the isotropic-Gaussian family, while novel extensions are automatically obtained by using more flexible exponential families. We derive two new SVRG variants by using Gaussian families: First, a Newton-like variant that employs novel Hessian corrections, and second, an Adam-like extension that improves pretraining and finetuning of Transformer language models. This is the first work to connect SVRG to Bayes and use it to boost variational training for deep networks.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Junnan Liu, Hongwei Liu, Songyang Zhang, Kai Chen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.
๐
2025-12-01
๐ Accepted for publication in IEEE Robotics and Auto
๐ฅ Kentaro Fujii, Shingo Murata
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Robots in uncertain real-world environments must perform both goal-directed and exploratory actions. However, most deep learning-based control methods neglect exploration and struggle under uncertainty. To address this, we adopt deep active inference, a framework that accounts for human goal-directed and exploratory actions. Yet, conventional deep active inference approaches face challenges due to limited environmental representation capacity and high computational cost in action selection. We propose a novel deep active inference framework that consists of a world model, an action model, and an abstract world model. The world model encodes environmental dynamics into hidden state representations at slow and fast timescales. The action model compresses action sequences into abstract actions using vector quantization, and the abstract world model predicts future slow states conditioned on the abstract action, enabling low-cost action selection. We evaluate the framework on object-manipulation tasks with a real-world robot. Results show that it achieves high success rates across diverse manipulation tasks and switches between goal-directed and exploratory actions in uncertain settings, while making action selection computationally tractable. These findings highlight the importance of modeling multiple timescale dynamics and abstracting actions and state transitions.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Heloisa Candello, Muneeza Azmat, Uma Sushmitha Gunturi, Raya Horesh, Rogerio Abreu de Paula et al.
Machine Learning
ๆฅ็ๆ่ฆ
With the rapid uptake of generative AI, investigating human perceptions of generated responses has become crucial. A major challenge is their `aptitude' for hallucinating and generating harmful contents. Despite major efforts for implementing guardrails, human perceptions of these mitigation strategies are largely unknown. We conducted a mixed-method experiment for evaluating the responses of a mitigation strategy across multiple-dimensions: faithfulness, fairness, harm-removal capacity, and relevance. In a within-subject study design, 57 participants assessed the responses under two conditions: harmful response plus its mitigation and solely mitigated response. Results revealed that participants' native language, AI work experience, and annotation familiarity significantly influenced evaluations. Participants showed high sensitivity to linguistic and contextual attributes, penalizing minor grammar errors while rewarding preserved semantic contexts. This contrasts with how language is often treated in the quantitative evaluation of LLMs. We also introduced new metrics for training and evaluating mitigation strategies and insights for human-AI evaluation studies.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ahmed Nebli
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The training of deep vision models is fundamentally a signal recovery problem amidst high-dimensional stochastic noise. Current optimization paradigms impose a static compromise on information channel capacity. For instance, magnitude-based methods, such as AdamW, operate on the assumption that gradient norms are high-fidelity curvature signals. While this allows for precision in smooth regimes, it leads to catastrophic noise amplification when applied to rugged, non-convex landscapes. Conversely, sign-based methods (e.g., Lion) perform a radical 1-bit quantization of the gradient, which aims to provide robust regularization at the cost of discarding fine-grained descent information. We propose that optimal convergence requires neither static prior, but rather a dynamic modulation of the update bitrate. We introduce \textbf{ThermoLion}, a vision-centric framework that utilizes local Signal-to-Noise Ratio (SNR) gating to autonomously transition parameters between a "low-bit" exploration phase and a "high-precision" exploitation phase. Furthermore, we introduce a Momentum Alignment mechanism that detects constructive interference between historical drift and instantaneous gradients to accelerate convergence during stable trajectories. Empirical benchmarks across 12 diverse vision datasets (including CIFAR, SVHN, and GTSRB) demonstrate that ThermoLion serves as a hyperparameter-free generalist, surpassing both AdamW and Lion in convergence speed and terminal accuracy without architecture-specific tuning.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Daren Zhong, Dingcheng Huang, Clayton Greenberg
Natural Language Processing
ๆฅ็ๆ่ฆ
Chess, a deterministic game with perfect information, has long served as a benchmark for studying strategic decision-making and artificial intelligence. Traditional chess engines or tools for analysis primarily focus on calculating optimal moves, often neglecting the variability inherent in human chess playing, particularly across different skill levels.
To overcome this limitation, we propose a novel and computationally efficient move prediction framework that approaches chess move prediction as a behavioral analysis task. The framework employs n-gram language models to capture move patterns characteristic of specific player skill levels. By dividing players into seven distinct skill groups, from novice to expert, we trained separate models using data from the open-source chess platform Lichess. The framework dynamically selects the most suitable model for prediction tasks and generates player moves based on preceding sequences.
Evaluation on real-world game data demonstrates that the model selector module within the framework can classify skill levels with an accuracy of up to 31.7\% when utilizing early game information (16 half-moves). The move prediction framework also shows substantial accuracy improvements, with our Selector Assisted Accuracy being up to 39.1\% more accurate than our benchmark accuracy. The computational efficiency of the framework further enhances its suitability for real-time chess analysis.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Gaganpreet Jhajj, Fuhua Lin
Machine Learning
ๆฅ็ๆ่ฆ
In this work, we propose that reasoning in knowledge graph (KG) networks can be guided by surprise minimization. Entities that are close in graph distance will have lower surprise than those farther apart. This connects the Free Energy Principle (FEP) from neuroscience to KG systems, where the KG serves as the agent's generative model. We formalize surprise using the shortest-path distance in directed graphs and provide a framework for KG-based agents. Graph distance appears in graph neural networks as message passing depth and in model-based reinforcement learning as world model trajectories. This work-in-progress study explores whether distance-based surprise can extend recent work showing that syntax minimizes surprise and free energy via tree structures.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alessandro Breccia, Federica Gerace, Marco Lippi, Gabriele Sicuro, Pierluigi Contucci
Natural Language Processing
ๆฅ็ๆ่ฆ
We study whether a Large Language Model can learn the deterministic sequence of trees generated by the iterated prime factorization of the natural numbers. Each integer is mapped into a rooted planar tree and the resulting sequence $ \mathbb{N}\mathcal{T}$ defines an arithmetic text with measurable statistical structure. A transformer network (the GPT-2 architecture) is trained from scratch on the first $10^{11}$ elements to subsequently test its predictive ability under next-word and masked-word prediction tasks. Our results show that the model partially learns the internal grammar of $\mathbb{N}\mathcal{T}$, capturing non-trivial regularities and correlations. This suggests that learnability may extend beyond empirical data to the very structure of arithmetic.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Adel Moumen, Guangzhi Sun, Philip C. Woodland
Natural Language Processing
ๆฅ็ๆ่ฆ
Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult. We present a cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. We also release an EN-FR training dataset, TinyStories (~42k hours), together with EN-FR spoken StoryCloze and TopicCloze benchmarks for cross-lingual semantic evaluation, both synthetically generated using GPT-4. On 360M and 1B SLMs under matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual hidden-state alignment. Taken together, these results indicate that cross-lingual interleaving is a simple, scalable route to building multilingual SLMs that understand and converse across languages. All resources will be made open-source to support reproducibility.
๐
2025-12-01
๐ ArXiv Preprint (cond-mat.mes-hall)
๐ฅ Ahmed Abouelkomsan, Max Geier, Liang Fu
Machine Learning
ๆฅ็ๆ่ฆ
Topologically ordered states are among the most interesting quantum phases of matter that host emergent quasi-particles having fractional charge and obeying fractional quantum statistics. Theoretical study of such states is however challenging owing to their strong-coupling nature that prevents conventional mean-field treatment. Here, we demonstrate that an attention-based deep neural network provides an expressive variational wavefunction that discovers fractional Chern insulator ground states purely through energy minimization without prior knowledge and achieves remarkable accuracy. We introduce an efficient method to extract ground state topological degeneracy -- a hallmark of topological order -- from a single optimized real-space wavefunction in translation-invariant systems by decomposing it into different many-body momentum sectors. Our results establish neural network variational Monte Carlo as a versatile tool for discovering strongly correlated topological phases.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hrishikesh Terdalkar, Kirtan Bhojani, Aryan Dongare, Omm Aditya Behera
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly deployed in multilingual applications but often generate plausible yet incorrect or misleading outputs, known as hallucinations. While hallucination detection has been studied extensively in English, under-resourced Indian languages remain largely unexplored. We present BHRAM-IL, a benchmark for hallucination recognition and assessment in multiple Indian languages, covering Hindi, Gujarati, Marathi, Odia, along with English. The benchmark comprises 36,047 curated questions across nine categories spanning factual, numerical, reasoning, and linguistic tasks. We evaluate 14 state-of-the-art multilingual LLMs on a benchmark subset of 10,265 questions, analyzing cross-lingual and factual hallucinations across languages, models, scales, categories, and domains using category-specific metrics normalized to (0,1) range. Aggregation over all categories and models yields a primary score of 0.23 and a language-corrected fuzzy score of 0.385, demonstrating the usefulness of BHRAM-IL for hallucination-focused evaluation. The dataset, and the code for generation and evaluation are available on GitHub (https://github.com/sambhashana/BHRAM-IL/) and HuggingFace (https://huggingface.co/datasets/sambhashana/BHRAM-IL/) to support future research in multilingual hallucination detection and mitigation.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingxuan Hu, Hongbo Ma, Xinlan Wu, Ziqi Liu, Jiaqi Liu et al.
ๆฅ็ๆ่ฆ
Audio-based depression detection models have demonstrated promising performance but often suffer from gender bias due to imbalanced training data. Epidemiological statistics show a higher prevalence of depression in females, leading models to learn spurious correlations between gender and depression. Consequently, models tend to over-diagnose female patients while underperforming on male patients, raising significant fairness concerns. To address this, we propose a novel Counterfactual Debiasing Framework grounded in causal inference. We construct a causal graph to model the decision-making process and identify gender bias as the direct causal effect of gender on the prediction. During inference, we employ counterfactual inference to estimate and subtract this direct effect, ensuring the model relies primarily on authentic acoustic pathological features. Extensive experiments on the DAIC-WOZ dataset using two advanced acoustic backbones demonstrate that our framework not only significantly reduces gender bias but also improves overall detection performance compared to existing debiasing strategies.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yudi Wu, Wenhao Zhao, Dianbo Liu
Machine Learning
ๆฅ็ๆ่ฆ
Generative diversity varies significantly across discrete latent generative models such as AR, MIM, and Diffusion. We propose a diagnostic framework, grounded in Information Bottleneck (IB) theory, to analyze the underlying strategies resolving this behavior. The framework models generation as a conflict between a 'Compression Pressure' - a drive to minimize overall codebook entropy - and a 'Diversity Pressure' - a drive to maximize conditional entropy given an input. We further decompose this diversity into two primary sources: 'Path Diversity', representing the choice of high-level generative strategies, and 'Execution Diversity', the randomness in executing a chosen strategy. To make this decomposition operational, we introduce three zero-shot, inference-time interventions that directly perturb the latent generative process and reveal how models allocate and express diversity. Application of this probe-based framework to representative AR, MIM, and Diffusion systems reveals three distinct strategies: "Diversity-Prioritized" (MIM), "Compression-Prioritized" (AR), and "Decoupled" (Diffusion). Our analysis provides a principled explanation for their behavioral differences and informs a novel inference-time diversity enhancement technique.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu et al.
ๆฅ็ๆ่ฆ
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xiaochuan Li, Ke Wang, Girija Gouda, Shubham Choudhary, Yaqun Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLMs) become integrated into high-stakes domains, there is a growing need for evaluation methods that are both scalable for real-time deployment and reliable for critical decision-making. While human evaluation is reliable, it is slow and costly. Single LLM judges are biased, and static juries lack adaptability. To overcome these limitations, we propose LLM Jury-on-Demand - a dynamic, learning-based framework for scalable and context-aware evaluation. Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts, leveraging token distributions, embeddings, and structural input features. This enables a fully adaptive evaluation where, for each data point, an optimal jury of the most reliable judges is dynamically selected, and their scores are aggregated using their reliability as weights. Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines. These results highlight the promise of adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenhao Sun, Yuhao Mao, Martin Vechev
Machine Learning
ๆฅ็ๆ่ฆ
Randomized Smoothing (RS) is a prominent technique for certifying the robustness of neural networks against adversarial perturbations. With RS, achieving high accuracy at small radii requires a small noise variance, while achieving high accuracy at large radii requires a large noise variance. However, the global noise variance used in the standard RS formulation leads to a fundamental limitation: there exists no global noise variance that simultaneously achieves strong performance at both small and large radii. To break through the global variance limitation, we propose a dual RS framework which enables input-dependent noise variances. To achieve that, we first prove that RS remains valid with input-dependent noise variances, provided the variance is locally constant around each input. Building on this result, we introduce two components which form our dual RS framework: (i) a variance estimator first predicts an optimal noise variance for each input, (ii) this estimated variance is then used by a standard RS classifier. The variance estimator is independently smoothed via RS to ensure local constancy, enabling flexible design. We also introduce training strategies to iteratively optimize the two components. Extensive experiments on CIFAR-10 show that our dual RS method provides strong performance for both small and large radii-unattainable with global noise variance-while incurring only a 60% computational overhead at inference. Moreover, it consistently outperforms prior input-dependent noise approaches across most radii, with particularly large gains at radii 0.5, 0.75, and 1.0, achieving relative improvements of 19%, 24%, and 21%, respectively. On ImageNet, dual RS remains effective across all radii. Additionally, the dual RS framework naturally provides a routing perspective for certified robustness, improving the accuracy-robustness trade-off with off-the-shelf expert RS models.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhuoqian Yang, Mathieu Salzmann, Sabine Sรผsstrunk
Machine Learning
ๆฅ็ๆ่ฆ
In this work, we investigate the potential of weights to serve as effective representations, focusing on neural fields. Our key insight is that constraining the optimization space through a pre-trained base model and low-rank adaptation (LoRA) can induce structure in weight space. Across reconstruction, generation, and analysis tasks on 2D and 3D data, we find that multiplicative LoRA weights achieve high representation quality while exhibiting distinctiveness and semantic structure. When used with latent diffusion models, multiplicative LoRA weights enable higher-quality generation than existing weight-space methods.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Saba Kublashvili
Machine Learning
ๆฅ็ๆ่ฆ
Modeling historical events poses fundamental challenges for machine learning: extreme data scarcity (N << 100), heterogeneous and noisy measurements, missing counterfactuals, and the requirement for human interpretable explanations. We present HistoricalML, a probabilistic neuro-symbolic framework that addresses these challenges through principled integration of (1) Bayesian uncertainty quantification to separate epistemic from aleatoric uncertainty, (2) structural causal models for counterfactual reasoning under confounding, (3) cooperative game theory (Shapley values) for fair allocation modeling, and (4) attention based neural architectures for context dependent factor weighting. We provide theoretical analysis showing that our approach achieves consistent estimation in the sparse data regime when strong priors from domain knowledge are available, and that Shapley based allocation satisfies axiomatic fairness guarantees that pure regression approaches cannot provide. We instantiate the framework on two historical case studies: the 19th century partition of Africa (N = 7 colonial powers) and the Second Punic War (N = 2 factions). Our model identifies Germany's +107.9 percent discrepancy as a quantifiable structural tension preceding World War I, with tension factor 36.43 and 0.79 naval arms race correlation. For the Punic Wars, Monte Carlo battle simulations achieve a 57.3 percent win probability for Carthage at Cannae and 57.8 percent for Rome at Zama, aligning with historical outcomes. Counterfactual analysis reveals that Carthaginian political support (support score 6.4 vs Napoleon's 7.1), rather than military capability, was the decisive factor.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai et al.
ๆฅ็ๆ่ฆ
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhongyuan Wu, Jingyuan Wang, Zexuan Cheng, Yilong Zhou, Weizhi Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Anomaly detection (AD) is a fundamental task of critical importance across numerous domains. Current systems increasingly operate in rapidly evolving environments that generate diverse yet interconnected data modalities -- such as time series, system logs, and tabular records -- as exemplified by modern IT systems. Effective AD methods in such environments must therefore possess two critical capabilities: (1) the ability to handle heterogeneous data formats within a unified framework, allowing the model to process and detect multiple modalities in a consistent manner during anomalous events; (2) a strong generalization ability to quickly adapt to new scenarios without extensive retraining. However, most existing methods fall short of these requirements, as they typically focus on single modalities and lack the flexibility to generalize across domains. To address this gap, we introduce a novel paradigm: In-Context Anomaly Detection (ICAD), where anomalies are defined by their dissimilarity to a relevant reference set of normal samples. Under this paradigm, we propose ICAD-LLM, a unified AD framework leveraging Large Language Models' in-context learning abilities to process heterogeneous data within a single model. Extensive experiments demonstrate that ICAD-LLM achieves competitive performance with task-specific AD methods and exhibits strong generalization to previously unseen tasks, which substantially reduces deployment costs and enables rapid adaptation to new environments. To the best of our knowledge, ICAD-LLM is the first model capable of handling anomaly detection tasks across diverse domains and modalities.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dengyun Peng, Qiguang Chen, Bofei Liu, Jiannan Guan, Libo Qin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Ensuring LLM reliability requires not only solving complex problems but also recognizing when a problem is unsolvable. Current models often struggle to distinguish objective unsolvability (inherent contradictions in the problem) from subjective capability limitations (problems beyond the model's competence), which leads to hallucinations and overconfidence. To address this, we propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability. Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology: programmatic generation for logic puzzles and a novel "Reverse Construction" method that injects contradictions into valid reasoning chains for mathematics. Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty. Empirical results show that our approach achieves near-perfect unsolvability detection while also improving accuracy on solvable tasks. Crucially, we identify Capability Collapse, demonstrating that explicit exposure to unsolvable data is indispensable for preventing models from becoming systematically overconfident. Our code and data are available at https://github.com/sfasfaffa/unsolvableQA.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Valentin Noรซl, Elimane Yassine Seidou, Charly Ken Capo-Chichi, Ghanem Amari
Robotics
ๆฅ็ๆ่ฆ
Legal AI systems powered by retrieval-augmented generation (RAG) face a critical accountability challenge: when an AI assistant cites case law, statutes, or contractual clauses, practitioners need verifiable guarantees that generated text faithfully represents source documents. Existing hallucination detectors rely on semantic similarity metrics that tolerate entity substitutions, a dangerous failure mode when confusing parties, dates, or legal provisions can have material consequences. We introduce HalluGraph, a graph-theoretic framework that quantifies hallucinations through structural alignment between knowledge graphs extracted from context, query, and response. Our approach produces bounded, interpretable metrics decomposed into \textit{Entity Grounding} (EG), measuring whether entities in the response appear in source documents, and \textit{Relation Preservation} (RP), verifying that asserted relationships are supported by context. On structured control documents, HalluGraph achieves near-perfect discrimination ($>$400 words, $>$20 entities), HalluGraph achieves $AUC = 0.979$, while maintaining robust performance ($AUC \approx 0.89$) on challenging generative legal task, consistently outperforming semantic similarity baselines. The framework provides the transparency and traceability required for high-stakes legal applications, enabling full audit trails from generated assertions back to source passages.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chainesh Gautam, Raghuram Bharadwaj Diddigi
Computer VisionMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recently, there has been an increasing need to develop agents capable of solving multiple tasks within the same environment, especially when these tasks are naturally associated with language. In this work, we propose a novel approach that leverages combinations of pre-trained (language, policy) pairs to establish an efficient transfer pipeline. Our algorithm is inspired by the principles of Contrastive Language-Image Pretraining (CLIP) in Computer Vision, which aligns representations across different modalities under the philosophy that ''two modalities representing the same concept should have similar representations.'' The central idea here is that the instruction and corresponding policy of a task represent the same concept, the task itself, in two different modalities. Therefore, by extending the idea of CLIP to RL, our method creates a unified representation space for natural language and policy embeddings. Experimental results demonstrate the utility of our algorithm in achieving faster transfer across tasks.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junran Peng, Yiheng Huang, Silei Shen, Zeji Wei, Jingwei Yang et al.
ๆฅ็ๆ่ฆ
In this paper, we introduce RoleMotion, a large-scale human motion dataset that encompasses a wealth of role-playing and functional motion data tailored to fit various specific scenes. Existing text datasets are mainly constructed decentrally as amalgamation of assorted subsets that their data are nonfunctional and isolated to work together to cover social activities in various scenes. Also, the quality of motion data is inconsistent, and textual annotation lacks fine-grained details in these datasets. In contrast, RoleMotion is meticulously designed and collected with a particular focus on scenes and roles. The dataset features 25 classic scenes, 110 functional roles, over 500 behaviors, and 10296 high-quality human motion sequences of body and hands, annotated with 27831 fine-grained text descriptions. We build an evaluator stronger than existing counterparts, prove its reliability, and evaluate various text-to-motion methods on our dataset. Finally, we explore the interplay of motion generation of body and hands. Experimental results demonstrate the high-quality and functionality of our dataset on text-driven whole-body generation.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Nihaal Bhojwani, Chuwei Wang, Hai-Yang Wang, Chang Sun, Elias R. Most et al.
Machine Learning
ๆฅ็ๆ่ฆ
Modeling how supermassive black holes co-evolve with their host galaxies is notoriously hard because the relevant physics spans nine orders of magnitude in scale-from milliparsecs to megaparsecs--making end-to-end first-principles simulation infeasible. To characterize the feedback from the small scales, existing methods employ a static subgrid scheme or one based on theoretical guesses, which usually struggle to capture the time variability and derive physically faithful results. Neural operators are a class of machine learning models that achieve significant speed-up in simulating complex dynamics. We introduce a neural-operator-based ''subgrid black hole'' that learns the small-scale local dynamics and embeds it within the direct multi-level simulations. Trained on small-domain (general relativistic) magnetohydrodynamic data, the model predicts the unresolved dynamics needed to supply boundary conditions and fluxes at coarser levels across timesteps, enabling stable long-horizon rollouts without hand-crafted closures. Thanks to the great speedup in fine-scale evolution, our approach for the first time captures intrinsic variability in accretion-driven feedback, allowing dynamic coupling between the central black hole and galaxy-scale gas. This work reframes subgrid modeling in computational astrophysics with scale separation and provides a scalable path toward data-driven closures for a broad class of systems with central accretors.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Letian Yi, Tingpeng Zhang, Mingyuan Zhou, Guannan Wang, Quanke Su et al.
ๆฅ็ๆ่ฆ
Reconstructing full fields from extremely sparse and random measurements is a longstanding ill-posed inverse problem. A powerful framework for addressing such challenges is hierarchical probabilistic modeling, where uncertainty is represented by intermediate variables and resolved through marginalization during inference. Inspired by this principle, we propose Cascaded Sensing (Cas-Sensing), a hierarchical reconstruction framework that integrates an autoencoder-diffusion cascade. First, a neural operator-based functional autoencoder reconstructs the dominant structures of the original field - including large-scale components and geometric boundaries - from arbitrary sparse inputs, serving as an intermediate variable. Then, a conditional diffusion model, trained with a mask-cascade strategy, generates fine-scale details conditioned on these large-scale structures. To further enhance fidelity, measurement consistency is enforced via the manifold constrained gradient based on Bayesian posterior sampling during the generation process. This cascaded pipeline substantially alleviates ill-posedness, delivering accurate and robust reconstructions. Experiments on both simulation and real-world datasets demonstrate that Cas-Sensing generalizes well across varying sensor configurations and geometric boundaries, making it a promising tool for practical deployment in scientific and engineering applications.
๐
2025-12-01
๐ ArXiv Preprint (math.OC)
๐ฅ Alex Oshin, Rahul Vodeb Ghosh, Augustinos D. Saravanos, Evangelos A. Theodorou
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We propose an always-feasible quadratic programming (QP) optimizer, FlexQP, which is based on an exact relaxation of the QP constraints. If the original constraints are feasible, then the optimizer finds the optimal solution to the original QP. On the other hand, if the constraints are infeasible, the optimizer identifies a solution that minimizes the constraint violation in a sparse manner. FlexQP scales favorably with respect to the problem dimension, is robust to both feasible and infeasible QPs with minimal assumptions on the problem data, and can be effectively warm-started. We subsequently apply deep unfolding to improve our optimizer through data-driven techniques, leading to an accelerated Deep FlexQP. By learning dimension-agnostic feedback policies for the parameters from a small number of training examples, Deep FlexQP generalizes to problems with larger dimensions and can optimize for many more iterations than it was initially trained for. Our approach outperforms two recently proposed state-of-the-art accelerated QP approaches on a suite of benchmark systems including portfolio optimization, classification, and regression problems. We provide guarantees on the expected performance of our deep QP optimizer through probably approximately correct (PAC) Bayes generalization bounds. These certificates are used to design an accelerated sequential quadratic programming solver that solves nonlinear optimal control and predictive safety filter problems faster than traditional approaches. Overall, our approach is very robust and greatly outperforms existing non-learning and learning-based optimizers in terms of both runtime and convergence to the optimal solution across multiple classes of NLPs.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thao Thi Phuong Dao, Tan-Cong Nguyen, Nguyen Chi Thanh, Truong Hoang Viet, Trong-Le Do et al.
ๆฅ็ๆ่ฆ
Head and neck masses are space-occupying lesions that can compress the airway and esophagus and may affect nerves and blood vessels. Available public datasets primarily focus on malignant lesions and often overlook other space-occupying conditions in this region. To address this gap, we introduce MasHeNe, an initial dataset of 3,779 contrast-enhanced CT slices that includes both tumors and cysts with pixel-level annotations. We also establish a benchmark using standard segmentation baselines and report common metrics to enable fair comparison. In addition, we propose the Windowing-Enhanced Mamba with Frequency integration (WEMF) model. WEMF applies tri-window enhancement to enrich the input appearance before feature extraction. It further uses multi-frequency attention to fuse information across skip connections within a U-shaped Mamba backbone. On MasHeNe, WEMF attains the best performance among evaluated methods, with a Dice of 70.45%, IoU of 66.89%, NSD of 72.33%, and HD95 of 5.12 mm. This model indicates stable and strong results on this challenging task. MasHeNe provides a benchmark for head-and-neck mass segmentation beyond malignancy-only datasets. The observed error patterns also suggest that this task remains challenging and requires further research. Our dataset and code are available at https://github.com/drthaodao3101/MasHeNe.git.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) often generate unreliable answers, while heuristic uncertainty methods fail to fully distinguish correct from incorrect predictions, causing users to accept erroneous answers without statistical guarantees. We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level. To achieve this in a principled way, we propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint over selection and error indicators. Then, we establish a finite-sample sufficient condition, which relies only on a held-out set of exchangeable calibration samples, to compute an FDR-constrained, coverage-maximizing threshold. Furthermore, we extend LEC to a two-model routing mechanism: given a prompt, if the current model's uncertainty exceeds its calibrated threshold, we delegate it to a stronger model, while maintaining a unified FDR guarantee. Evaluations on closed-ended and open-ended question-answering (QA) datasets show that LEC achieves tighter FDR control and substantially improves sample retention over prior methods. Moreover, the two-model routing mechanism achieves lower risk levels while accepting more correct samples than each individual model.
๐
2025-12-01
๐ ArXiv Preprint (cs.DC)
๐ฅ Tom Goethals, Merlijn Sebrechts, Stijn De Schrijver, Filip De Turck, Bruno Volckaert
ๆฅ็ๆ่ฆ
Federated Learning is a popular approach for distributed learning due to its security and computational benefits. With the advent of powerful devices in the network edge, Gossip Learning further decentralizes Federated Learning by removing centralized integration and relying fully on peer to peer updates. However, the averaging methods generally used in both Federated and Gossip Learning are not ideal for model accuracy and global convergence. Additionally, there are few options to deploy Learning workloads in the edge as part of a larger application using a declarative approach such as Kubernetes manifests. This paper proposes Delta Sum Learning as a method to improve the basic aggregation operation in Gossip Learning, and implements it in a decentralized orchestration framework based on Open Application Model, which allows for dynamic node discovery and intent-driven deployment of multi-workload applications. Evaluation results show that Delta Sum performance is on par with alternative integration methods for 10 node topologies, but results in a 58% lower global accuracy drop when scaling to 50 nodes. Overall, it shows strong global convergence and a logarithmic loss of accuracy with increasing topology size compared to a linear loss for alternatives under limited connectivity.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Yuma Ichikawa, Yudai Fujimoto, Akira Sakai
ๆฅ็ๆ่ฆ
Post-training quantization (PTQ) aims to preserve model-level behavior; however, most methods focus on individual linear layers. Even recent extensions, such as QEP and LoaQ, which mitigate error propagation or target specific submodules, still rely on layer-wise formulations and fail to capture the behavior of larger submodules. We introduce Layer-Projected Coordinate Descent (LPCD), a unified framework that extends PTQ beyond layers by optimizing relaxed objectives across arbitrary submodules and projecting the solutions with layer-wise quantizers. LPCD generalizes existing methods and provides a principled approach to quantizing complex submodules while maintaining the efficiency and compatibility of layer-wise PTQ pipelines. Across diverse LLM architectures and bit-widths, LPCD-based submodule quantization consistently enhances both layer-wise PTQ methods and existing submodule approaches.
๐
2025-12-01
๐ ArXiv Preprint (cs.SD)
๐ฅ Tal Shuster, Eliya Nachmani
ๆฅ็ๆ่ฆ
Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech domain compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alexander Frotscher, Christian F. Baumgartner, Thomas Wolfers
ๆฅ็ๆ่ฆ
Deep unsupervised anomaly detection in brain magnetic resonance imaging offers a promising route to identify pathological deviations without requiring lesion-specific annotations. Yet, fragmented evaluations, heterogeneous datasets, and inconsistent metrics have hindered progress toward clinical translation. Here, we present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging. The training cohort comprised 2,976 T1 and 2,972 T2-weighted scans from healthy individuals across six scanners, with ages ranging from 6 to 89 years. Validation used 92 scans to tune hyperparameters and estimate unbiased thresholds. Testing encompassed 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts. Across all algorithms, the Dice-based segmentation performance varied between 0.03 and 0.65, indicating substantial variability. To assess robustness, we systematically evaluated the impact of different scanners, lesion types and sizes, as well as demographics (age, sex). Reconstruction-based methods, particularly diffusion-inspired approaches, achieved the strongest lesion segmentation performance, while feature-based methods showed greater robustness under distributional shifts. However, systematic biases, such as scanner-related effects, were observed for the majority of algorithms, including that small and low-contrast lesions were missed more often, and that false positives varied with age and sex. Increasing healthy training data yields only modest gains, underscoring that current unsupervised anomaly detection frameworks are limited algorithmically rather than by data availability. Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation, including image native pretraining, principled deviation measures, fairness-aware modeling, and robust domain adaptation.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hailong Yang, Te Zhang, Kup-sze Choi, Zhaohong Deng
ๆฅ็ๆ่ฆ
Diffusion models have emerged as a leading technique for generating images due to their ability to create high-resolution and realistic images. Despite their strong performance, diffusion models still struggle in managing image collections with significant feature differences. They often fail to capture complex features and produce conflicting results. Research has attempted to address this issue by learning different regions of an image through multiple diffusion paths and then combining them. However, this approach leads to inefficient coordination among multiple paths and high computational costs. To tackle these issues, this paper presents a Diffusion Fuzzy System (DFS), a latent-space multi-path diffusion model guided by fuzzy rules. DFS offers several advantages. First, unlike traditional multi-path diffusion methods, DFS uses multiple diffusion paths, each dedicated to learning a specific class of image features. By assigning each path to a different feature type, DFS overcomes the limitations of multi-path models in capturing heterogeneous image features. Second, DFS employs rule-chain-based reasoning to dynamically steer the diffusion process and enable efficient coordination among multiple paths. Finally, DFS introduces a fuzzy membership-based latent-space compression mechanism to reduce the computational costs of multi-path diffusion effectively. We tested our method on three public datasets: LSUN Bedroom, LSUN Church, and MS COCO. The results show that DFS achieves more stable training and faster convergence than existing single-path and multi-path diffusion models. Additionally, DFS surpasses baseline models in both image quality and alignment between text and images, and also shows improved accuracy when comparing generated images to target references.
๐
2025-12-01
๐ ArXiv Preprint (cs.SE)
๐ฅ Pankaj Jalote, Y. Raghu Reddy, Vasudeva Varma
ๆฅ็ๆ่ฆ
Covid has made online teaching and learning acceptable and students, faculty, and industry professionals are all comfortable with this mode. This comfort can be leveraged to offer an online multi-institutional research-level course in an area where individual institutions may not have the requisite faculty to teach and/or research students to enroll. If the subject is of interest to industry, online offering also allows industry experts to contribute and participate with ease. Advanced topics in Software Engineering are ideally suited for experimenting with this approach as industry, which is often looking to incorporate advances in software engineering in their practices, is likely to agree to contribute and participate. In this paper we describe an experiment in teaching a course titled "AI in Software Engineering" jointly between two institutions with active industry participation, and share our and student's experience. We believe this collaborative teaching approach can be used for offering research level courses in any applied area of computer science by institutions who are small and find it difficult to offer research level courses on their own.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Daniel Armstrong, Zlatko Jonฤev, Andres M Bran, Philippe Schwaller
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern computer-assisted synthesis planning (CASP) systems show promises at generating chemically valid reaction steps but struggle to incorporate strategic considerations such as convergent assembly, protecting group minimization, and optimal ring-forming sequences. We introduce a methodology that leverages Large Language Models to distill synthetic knowledge into code. Our system analyzes synthesis routes and translates strategic principles into Python functions representing diverse strategic and tactical rules, such as strategic functional group interconversions and ring construction strategies. By formalizing this knowledge as verifiable code rather than simple heuristics, we create testable, interpretable representations of synthetic strategy. We release the complete codebase and the USPTO-ST dataset -- synthesis routes annotated with strategic tags. This framework unlocks a novel capability for CASP: natural language-based route retrieval, achieving 75\% Top-3 accuracy on our benchmark. We further validate our library through temporal analysis of historical trends and chemically intuitive route clustering that offers more granular partitioning than common previous methods. This work bridges the tactical-strategic divide in CASP, enabling specification, search, and evaluation of routes by strategic criteria rather than structure alone.
๐
2025-12-01
๐ ArXiv Preprint (quant-ph)
๐ฅ Dennis Gross
Machine Learning
ๆฅ็ๆ่ฆ
Quantum reinforcement learning (QRL) aims to use quantum effects to create sequential decision-making policies that achieve tasks more effectively than their classical counterparts. However, QRL policies face uncertainty from quantum measurements and hardware noise, such as bit-flip, phase-flip, and depolarizing errors, which can lead to unsafe behavior. Existing work offers no systematic way to verify whether trained QRL policies meet safety requirements under specific noise conditions.
We introduce QVerifier, a formal verification method that applies probabilistic model checking to analyze trained QRL policies with and without modeled quantum noise. QVerifier builds a complete model of the policy-environment interaction, incorporates quantum uncertainty directly into the transition probabilities, and then checks safety properties using the Storm model checker.
Experiments across multiple QRL environments show that QVerifier precisely measures how different noise models influence safety, revealing both performance degradation and cases where noise can help. By enabling rigorous safety verification before deployment, QVerifier addresses a critical need: because access to quantum hardware is expensive, pre-deployment verification is essential for any safety-critical use of QRL. QVerifier targets a potential classical-quantum sweet spot: trained QRL policies that execute efficiently on quantum hardware, yet remain tractable for classical probabilistic model checking despite being too slow for real-time classical deployment.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jindi Lv, Yuhao Zhou, Zheng Zhu, Xiaofeng Wang, Guan Huang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Chain-of-Thought (CoT) reasoning has significantly advanced the problem-solving capabilities of Large Language Models (LLMs), yet conventional CoT often exhibits internal determinism during decoding, limiting exploration of plausible alternatives. Recent methods attempt to address this by generating soft abstract tokens to enable reasoning in a continuous semantic space. However, we find that such approaches remain constrained by the greedy nature of autoregressive decoding, which fundamentally isolates the model from alternative reasoning possibilities. In this work, we propose Multi-Path Perception Policy Optimization (M3PO), a novel reinforcement learning framework that explicitly injects collective insights into the reasoning process. M3PO leverages parallel policy rollouts as naturally diverse reasoning sources and integrates cross-path interactions into policy updates through a lightweight collaborative mechanism. This design allows each trajectory to refine its reasoning with peer feedback, thereby cultivating more reliable multi-step reasoning patterns. Empirical results show that M3PO achieves state-of-the-art performance on both knowledge- and reasoning-intensive benchmarks. Models trained with M3PO maintain interpretability and inference efficiency, underscoring the promise of multi-path collaborative learning for robust reasoning.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gwendal Debaussart-Joniec, Argyris Kalogeratos
ๆฅ็ๆ่ฆ
This paper introduces a comprehensive unified framework for constructing multi-view diffusion geometries through intertwined multi-view diffusion trajectories (MDTs), a class of inhomogeneous diffusion processes that iteratively combine the random walk operators of multiple data views. Each MDT defines a trajectory-dependent diffusion operator with a clear probabilistic and geometric interpretation, capturing over time the interplay between data views. Our formulation encompasses existing multi-view diffusion models, while providing new degrees of freedom for view interaction and fusion. We establish theoretical properties under mild assumptions, including ergodicity of both the point-wise operator and the process in itself. We also derive MDT-based diffusion distances, and associated embeddings via singular value decompositions. Finally, we propose various strategies for learning MDT operators within the defined operator space, guided by internal quality measures. Beyond enabling flexible model design, MDTs also offer a neutral baseline for evaluating diffusion-based approaches through comparison with randomly selected MDTs. Experiments show the practical impact of the MDT operators in a manifold learning and data clustering context.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dan Qiao, Yu-Xiang Wang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We consider the problem of generalization of arbitrarily overparameterized two-layer ReLU Neural Networks with univariate input. Recent work showed that under square loss, flat solutions (motivated by flat / stable minima and Edge of Stability phenomenon) provably cannot overfit, but it remains unclear whether the same phenomenon holds for logistic loss. This is a puzzling open problem because existing work on logistic loss shows that gradient descent with increasing step size converges to interpolating solutions (at infinity, for the margin-separable cases). In this paper, we prove that the \emph{flatness implied generalization} is more delicate under logistic loss. On the positive side, we show that flat solutions enjoy near-optimal generalization bounds within a region between the left-most and right-most \emph{uncertain} sets determined by each candidate solution. On the negative side, we show that there exist arbitrarily flat yet overfitting solutions at infinity that are (falsely) certain everywhere, thus certifying that flatness alone is insufficient for generalization in general. We demonstrate the effects predicted by our theory in a well-controlled simulation study.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Lingbo Li, Anuradha Mathrani, Teo Susnjak
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Assessing risk of bias (RoB) in randomized controlled trials is essential for trustworthy evidence synthesis, but the process is resource-intensive and prone to variability across reviewers. Large language models (LLMs) offer a route to automation, but existing methods rely on manually engineered prompts that are difficult to reproduce, generalize, or evaluate. This study introduces a programmable RoB assessment pipeline that replaces ad-hoc prompt design with structured, code-based optimization using DSPy and its GEPA module. GEPA refines LLM reasoning through Pareto-guided search and produces inspectable execution traces, enabling transparent replication of every step in the optimization process. We evaluated the method on 100 RCTs from published meta-analyses across seven RoB domains. GEPA-generated prompts were applied to both open-weight models (Mistral Small 3.1 with GPT-oss-20b) and commercial models (GPT-5 Nano and GPT-5 Mini). In domains with clearer methodological reporting, such as Random Sequence Generation, GEPA-generated prompts performed best, with similar results for Allocation Concealment and Blinding of Participants, while the commercial model performed slightly better overall. We also compared GEPA with three manually designed prompts using Claude 3.5 Sonnet. GEPA achieved the highest overall accuracy and improved performance by 30%-40% in Random Sequence Generation and Selective Reporting, and showed generally comparable, competitively aligned performance in the other domains relative to manual prompts. These findings suggest that GEPA can produce consistent and reproducible prompts for RoB assessment, supporting the structured and principled use of LLMs in evidence synthesis.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Heng Xie, Kang Zhu, Zhengqi Wen, Jianhua Tao, Xuefei Liu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal sentiment analysis (MSA) is a research field that recognizes human sentiments by combining textual, visual, and audio modalities. The main challenge lies in integrating sentiment-related information from different modalities, which typically arises during the unimodal feature extraction phase and the multimodal feature fusion phase. Existing methods extract only shallow information from unimodal features during the extraction phase, neglecting sentimental differences across different personalities. During the fusion phase, they directly merge the feature information from each modality without considering differences at the feature level. This ultimately affects the model's recognition performance. To address this problem, we propose a personality-sentiment aligned multi-level fusion framework. We introduce personality traits during the feature extraction phase and propose a novel personality-sentiment alignment method to obtain personalized sentiment embeddings from the textual modality for the first time. In the fusion phase, we introduce a novel multi-level fusion method. This method gradually integrates sentimental information from textual, visual, and audio modalities through multimodal pre-fusion and a multi-level enhanced fusion strategy. Our method has been evaluated through multiple experiments on two commonly used datasets, achieving state-of-the-art results.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sylvain Mariรฉ, Pablo Knecht
ๆฅ็ๆ่ฆ
Discrete event systems are present both in observations of nature, socio economical sciences, and industrial systems. Standard analysis approaches do not usually exploit their dual event / state nature: signals are either modeled as transition event sequences, emphasizing event order alignment, or as categorical or ordinal state timeseries, usually resampled a distorting and costly operation as the observation period and number of events grow. In this work we define state transition event timeseries (STE-ts) and propose a new Selective Temporal Hamming distance (STH) leveraging both transition time and duration-in-state, avoiding costly and distorting resampling on large databases. STH generalizes both resampled Hamming and Jaccard metrics with better precision and computation time, and an ability to focus on multiple states of interest. We validate these benefits on simulated and real-world datasets.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Daull Xavier, Patrice Bellot, Emmanuel Bruno, Vincent Martin, Elisabeth Murisasco
Robotics
ๆฅ็ๆ่ฆ
We introduce CollabToolBuilder, a flexible multiagent LLM framework with expert-in-the-loop (HITL) guidance that iteratively learns to create tools for a target goal, aligning with human intent and process, while minimizing time for task/domain adaptation effort and human feedback capture. The architecture generates and validates tools via four specialized agents (Coach, Coder, Critic, Capitalizer) using a reinforced dynamic prompt and systematic human feedback integration to reinforce each agent's role toward goals and constraints. This work is best viewed as a system-level integration and methodology combining multi-agent in-context learning, HITL controls, and reusable tool capitalization for complex iterative problems such as scientific document generation. We illustrate it with preliminary experiments (e.g., generating state-of-the-art research papers or patents given an abstract) and discuss its applicability to other iterative problem-solving.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yaxuan Wang, Quan Liu, Zhenting Wang, Zichao Li, Wei Wei et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) underpin applications in code generation, mathematical reasoning, and agent-based workflows. In practice, systems access LLMs via commercial APIs or open-source deployments, and the model landscape (e.g., GPT, Claude, Llama) evolves rapidly. This rapid evolution forces frequent model switches driven by capability, cost, deployment constraints, and privacy. Yet prompts are highly model-sensitive: reusing a prompt engineered for one model on another often yields substantially worse performance than a prompt optimized for the target model. We term this phenomenon Model Drifting. Through extensive empirical analysis across diverse LLM configurations, we show that model drifting is both common and severe. To address this challenge, we introduce PromptBridge, a training-free framework that preserves prompt effectiveness under model switches, enabling cross-model prompt transfer without costly per-task or per-model re-optimization. PromptBridge requires only a small set of alignment tasks for calibration. It first applies Model-Adaptive Reflective Prompt Evolution (MAP-RPE) to obtain task- and model-specific optimal prompts via iterative reflective refinement and quantitative evaluation. Using the resulting calibrated prompt pairs for the source and target models, PromptBridge learns a cross-model prompt mapping. At test time, i.e., for an unseen task, given a source-model prompt, this mapping directly produces an optimized prompt for the target model. Experiments in single-agent and multi-agent settings show that PromptBridge consistently improves downstream accuracy while reducing migration effort. The code will be available soon.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tushar Pranav, Eshan Pandey, Austria Lyka Diane Bala, Aman Chadha, Indriyati Atmosukarto et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases, limiting their effectiveness in culturally diverse regions like Southeast Asia (SEA). To address this, we introduce RICE-VL, a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries. RICE-VL includes over 28,000 human-curated Visual Question Answering (VQA) samples -- covering True or False, Fill-in-the-Blank, and open-ended formats -- and 1,000 image-bounding box pairs for Visual Grounding, annotated by culturally informed experts across 14 sub-ground categories. We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification. Evaluations of six open- and closed-source VLMs reveal significant performance gaps in low-resource countries and abstract cultural domains. The Visual Grounding task tests models' ability to localize culturally significant elements in complex scenes, probing spatial and contextual accuracy. RICE-VL exposes limitations in VLMs' cultural comprehension and highlights the need for inclusive model development to better serve diverse global populations.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziqian Wang, Yuxiao Cheng, Jinli Suo
Machine Learning
ๆฅ็ๆ่ฆ
Explainability is essential for neural networks that model long time series, yet most existing explainable AI methods only produce point-wise importance scores and fail to capture temporal structures such as trends, cycles, and regime changes. This limitation weakens human interpretability and trust in long-horizon models. To address these issues, we identify four key requirements for interpretable time-series modeling: temporal continuity, pattern-centric explanation, causal disentanglement, and faithfulness to the model's inference process. We propose EXCAP, a unified framework that satisfies all four requirements. EXCAP combines an attention-based segmenter that extracts coherent temporal patterns, a causally structured decoder guided by a pre-trained causal graph, and a latent aggregation mechanism that enforces representation stability. Our theoretical analysis shows that EXCAP provides smooth and stable explanations over time and is robust to perturbations in causal masks. Extensive experiments on classification and forecasting benchmarks demonstrate that EXCAP achieves strong predictive accuracy while generating coherent and causally grounded explanations. These results show that EXCAP offers a principled and scalable approach to interpretable modeling of long time series with relevance to high-stakes domains such as healthcare and finance.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoyu Lei, Chin Wa Lau, Kaiwen Zhou, Nian Guo, Farzan Farnia
Natural Language Processing
ๆฅ็ๆ่ฆ
Error Correction Codes (ECC) are fundamental to reliable digital communication, yet designing neural decoders that are both accurate and computationally efficient remains challenging. Recent denoising diffusion decoders with transformer backbones achieve state-of-the-art performance, but their iterative sampling limits practicality in low-latency settings. We introduce the Error Correction Consistency Flow Model (ECCFM), an architecture-agnostic training framework for high-fidelity one-step decoding. By casting the reverse denoising process as a Probability Flow Ordinary Differential Equation (PF-ODE) and enforcing smoothness through a differential time regularization, ECCFM learns to map noisy signals along the decoding trajectory directly to the original codeword in a single inference step. Across multiple decoding benchmarks, ECCFM attains lower bit-error rates (BER) than autoregressive and diffusion-based baselines, with notable improvements on longer codes, while delivering inference speeds up from 30x to 100x faster than denoising diffusion decoders.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
๐
2025-12-01
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Wei Yang, Rui Zhong, Yiqun Chen, Chi Lu, Peng Jiang
Multimodal
ๆฅ็ๆ่ฆ
Multimodal recommendation aims to integrate collaborative signals with heterogeneous content such as visual and textual information, but remains challenged by modality-specific noise, semantic inconsistency, and unstable propagation over user-item graphs. These issues are often exacerbated by naive fusion or shallow modeling strategies, leading to degraded generalization and poor robustness. While recent work has explored the frequency domain as a lens to separate stable from noisy signals, most methods rely on static filtering or reweighting, lacking the ability to reason over spectral structure or adapt to modality-specific reliability. To address these challenges, we propose a Structured Spectral Reasoning (SSR) framework for frequency-aware multimodal recommendation. Our method follows a four-stage pipeline: (i) Decompose graph-based multimodal signals into spectral bands via graph-guided transformations to isolate semantic granularity; (ii) Modulate band-level reliability with spectral band masking, a training-time masking with a prediction-consistency objective that suppresses brittle frequency components; (iii) Fuse complementary frequency cues using hyperspectral reasoning with low-rank cross-band interaction; and (iv) Align modality-specific spectral features via contrastive regularization to promote semantic and structural consistency. Experiments on three real-world benchmarks show consistent gains over strong baselines, particularly under sparse and cold-start settings. Additional analyses indicate that structured spectral modeling improves robustness and provides clearer diagnostics of how different bands contribute to performance.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Medha Sawhney, Abhilash Neog, Mridul Khurana, Anuj Karpatne
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion-based solvers for partial differential equations (PDEs) are often bottle-necked by slow gradient-based test-time optimization routines that use PDE residuals for loss guidance. They additionally suffer from optimization instabilities and are unable to dynamically adapt their inference scheme in the presence of noisy PDE residuals. To address these limitations, we introduce PRISMA (PDE Residual Informed Spectral Modulation with Attention), a conditional diffusion neural operator that embeds PDE residuals directly into the model's architecture via attention mechanisms in the spectral domain, enabling gradient-descent free inference. In contrast to previous methods that use PDE loss solely as external optimization targets, PRISMA integrates PDE residuals as integral architectural features, making it inherently fast, robust, accurate, and free from sensitive hyperparameter tuning. We show that PRISMA has competitive accuracy, at substantially lower inference costs, compared to previous methods across five benchmark PDEs, especially with noisy observations, while using 10x to 100x fewer denoising steps, leading to 15x to 250x faster inference.
๐
2025-12-01
๐ ArXiv Preprint (cs.DC)
๐ฅ Wenbin Zhu, Zhaoyan Shen, Zili Shao, Hongjun Dai, Feng Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Serverless Large Language Models (LLMs) have emerged as a cost-effective solution for deploying AI services by enabling a 'pay-as-you-go' pricing model through GPU resource sharing. However, cold-start latency, especially the model loading phase, has become a critical performance bottleneck, as it scales linearly with model size and severely limits the practical deployment of large-scale LLM services. This paper presents Tangram, a novel system that accelerates Serverless LLM loading through efficient GPU memory reuse. By leveraging the unused GPU memory to retain model parameters, Tangram significantly reduces model transfer time and cold-start latency. Its design includes three key components: unified GPU memory pool for tensor-level parameter sharing across models, on-demand KV cache allocation for dynamic memory management, and GPU-affinity-aware scheduling for maximizing resource utilization. These techniques collectively address the critical challenges of inefficient memory usage and the cold-start problem in Serverless LLM platforms. We have implemented a fully functional prototype, and experiments show that Tangram achieves up to 6.2 times faster loading and reduces Time-To-First-Token (TTFT) during cold-start by 23--55% over state-of-the-art methods.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhongjie Jiang
ๆฅ็ๆ่ฆ
Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse.
This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators.
The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33.
Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Elinor Poole-Dayan, Jiayi Wu, Taylor Sorensen, Jiaxin Pei, Michiel A. Bakker
ๆฅ็ๆ่ฆ
We introduce a novel framework for measuring Overton pluralism in LLMs--the extent to which diverse viewpoints are represented in model outputs. We (i) formalize Overton pluralism as a set coverage metric (OvertonScore), (ii) conduct a large-scale U.S.-representative human study (N = 1209; 60 questions; 8 LLMs), and (iii) develop an automated benchmark that closely reproduces human judgments. On average, models achieve OvertonScores of 0.35--0.41, with DeepSeek V3 performing best; yet all models remain far below the theoretical maximum of 1.0, revealing substantial headroom for improvement. Because repeated large-scale human studies are costly and slow, scalable evaluation tools are essential for model development. Hence, we propose an automated benchmark that achieves high rank correlation with human judgments ($ฯ=0.88$), providing a practical proxy without replacing human assessment. By turning pluralistic alignment from a normative aim into a measurable benchmark, our work establishes a foundation for systematic progress toward more pluralistic LLMs.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shashank Landge, Abhishek Patil, Tejas kamble, Bhushan Buddhivant, Priyanka Joshi
Natural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLMs) continue to scale in parameter count, deploying them on commodity hardware has become increasingly challenging. Post-Training Quantization (PTQ) addresses this by reducing the precision of model weights, typically to 4-bit or lower. However, uniform quantization often leads to significant performance degradation due to the presence of ``outlier features'' -- weights that, while few in number, are critical for maintaining model accuracy. Current state-of-the-art methods such as AWQ (Activation-aware Weight Quantization) and SpQR (Sparse Quantization Representations) rely on calibration data to identify these salient weights via activation magnitudes or Hessian sensitivity. In scenarios where data privacy is paramount or calibration data is unavailable, these methods are inapplicable.
In this work, we propose a data-free, structure-aware hypothesis: that the weights identified as Principal Components via Singular Value Decomposition (SVD) are intrinsically important to the model's downstream performance. We introduce a novel selection heuristic that preserves the top-$k$ weights aligned with the principal components in FP32, while aggressively quantizing the residual weights. We compare our method against activation-aware (AWQ) and second-order (SpQR) methods across GLUE benchmarks (MRPC, RTE, QNLI) using a DistilBERT backbone. Our experiments reveal that structural importance is highly correlated with functional importance. On the challenging RTE task, our SVD-based method achieves an accuracy of 66.06\%, outperforming both AWQ (65.34\%) and SpQR (65.34\%) at high protection budgets, validating that intrinsic matrix structure can serve as a robust proxy for weight saliency without the need for forward passes or calibration data.
๐
2025-12-01
๐ KDD
๐ฅ Xinyun Zhou, Xinfeng Li, Yinan Peng, Ming Xu, Xuanwang Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) systems are increasingly central to robust AI, enhancing large language model (LLM) faithfulness by incorporating external knowledge. However, our study unveils a critical, overlooked vulnerability: their profound susceptibility to subtle symbolic perturbations, particularly through near-imperceptible emoticon tokens such as "(@_@)" that can catastrophically mislead retrieval, termed EmoRAG. We demonstrate that injecting a single emoticon into a query makes it nearly 100% likely to retrieve semantically unrelated texts that contain a matching emoticon. Our extensive experiment across general question-answering and code domains, using a range of state-of-the-art retrievers and generators, reveals three key findings: (I) Single-Emoticon Disaster: Minimal emoticon injections cause maximal disruptions, with a single emoticon almost 100% dominating RAG output. (II) Positional Sensitivity: Placing an emoticon at the beginning of a query can cause severe perturbation, with F1-Scores exceeding 0.92 across all datasets. (III) Parameter-Scale Vulnerability: Counterintuitively, models with larger parameters exhibit greater vulnerability to the interference. We provide an in-depth analysis to uncover the underlying mechanisms of these phenomena. Furthermore, we raise a critical concern regarding the robustness assumption of current RAG systems, envisioning a threat scenario where an adversary exploits this vulnerability to manipulate the RAG system. We evaluate standard defenses and find them insufficient against EmoRAG. To address this, we propose targeted defenses, analyzing their strengths and limitations in mitigating emoticon-based perturbations. Finally, we outline future directions for building robust RAG systems.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Saman Ahmadi, Mahdi Jalili
ๆฅ็ๆ่ฆ
We study the energy-optimal shortest path problem for electric vehicles (EVs) in large-scale road networks, where recuperated energy along downhill segments introduces negative energy costs. While traditional point-to-point pathfinding algorithms for EVs assume a known initial energy level, many real-world scenarios involving uncertainty in available energy require planning optimal paths for all possible initial energy levels, a task known as energy-optimal profile search. Existing solutions typically rely on specialized profile-merging procedures within a label-correcting framework that results in searching over complex profiles. In this paper, we propose a simple yet effective label-setting approach based on multi-objective A* search, which employs a novel profile dominance rule to avoid generating and handling complex profiles. We develop four variants of our method and evaluate them on real-world road networks enriched with realistic energy consumption data. Experimental results demonstrate that our energy profile A* search achieves performance comparable to energy-optimal A* with a known initial energy level.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Juan Hernandez, Diego Fernรกndez, Manuel Cifuentes, Denis Parra, Rodrigo Toro Icarte
Machine Learning
ๆฅ็ๆ่ฆ
The Never Give Up (NGU) algorithm has proven effective in reinforcement learning tasks with sparse rewards by combining episodic novelty and intrinsic motivation. In this work, we extend NGU to multi-agent environments and evaluate its performance in the simple_tag environment from the PettingZoo suite. Compared to a multi-agent DQN baseline, NGU achieves moderately higher returns and more stable learning dynamics. We investigate three design choices: (1) shared replay buffer versus individual replay buffers, (2) sharing episodic novelty among agents using different k thresholds, and (3) using heterogeneous values of the beta parameter. Our results show that NGU with a shared replay buffer yields the best performance and stability, highlighting that the gains come from combining NGU intrinsic exploration with experience sharing. Novelty sharing performs comparably when k = 1 but degrades learning for larger values. Finally, heterogeneous beta values do not improve over a small common value. These findings suggest that NGU can be effectively applied in multi-agent settings when experiences are shared and intrinsic exploration signals are carefully tuned.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Feiyang Xiao, Yichi Zhang, Xigui Li, Yuanye Zhou, Chen Jiang et al.
ๆฅ็ๆ่ฆ
The precise segmentation of intracranial aneurysms and their parent vessels (IA-Vessel) is a critical step for hemodynamic analyses, which mainly depends on computational fluid dynamics (CFD). However, current segmentation methods predominantly focus on image-based evaluation metrics, often neglecting their practical effectiveness in subsequent CFD applications. To address this deficiency, we present the Intracranial Aneurysm Vessel Segmentation (IAVS) dataset, the first comprehensive, multi-center collection comprising 641 3D MRA images with 587 annotations of aneurysms and IA-Vessels. In addition to image-mask pairs, IAVS dataset includes detailed hemodynamic analysis outcomes, addressing the limitations of existing datasets that neglect topological integrity and CFD applicability. To facilitate the development and evaluation of clinically relevant techniques, we construct two evaluation benchmarks including global localization of aneurysms (Stage I) and fine-grained segmentation of IA-Vessel (Stage II) and develop a simple and effective two-stage framework, which can be used as a out-of-the-box method and strong baseline. For comprehensive evaluation of applicability of segmentation results, we establish a standardized CFD applicability evaluation system that enables the automated and consistent conversion of segmentation masks into CFD models, offering an applicability-focused assessment of segmentation outcomes. The dataset, code, and model will be public available at https://github.com/AbsoluteResonance/IAVS.
๐
2025-12-01
๐ ArXiv Preprint (quant-ph)
๐ฅ Dongheng Qian, Jing Wang
Machine Learning
ๆฅ็ๆ่ฆ
Measurement-induced entanglement (MIE) captures how local measurements generate long-range quantum correlations and drive dynamical phase transitions in many-body systems. Yet estimating MIE experimentally remains challenging: direct evaluation requires extensive post-selection over measurement outcomes, raising the question of whether MIE is accessible with only polynomial resources. We address this challenge by reframing MIE detection as a data-driven learning problem that assumes no prior knowledge of state preparation. Using measurement records alone, we train a neural network in a self-supervised manner to predict the uncertainty metric for MIE--the gap between upper and lower bounds of the average post-measurement bipartite entanglement. Applied to random circuits with one-dimensional all-to-all connectivity and two-dimensional nearest-neighbor coupling, our method reveals a learnability transition with increasing circuit depth: below a threshold, the uncertainty is small and decreases with polynomial measurement data and model parameters, while above it the uncertainty remains large despite increasing resources. We further verify this transition experimentally on current noisy quantum devices, demonstrating its robustness to realistic noise. These results highlight the power of data-driven approaches for learning MIE and delineate the practical limits of its classical learnability.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Koki Natsumi, Hiroyuki Deguchi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
ๆฅ็ๆ่ฆ
Minimum Bayes risk (MBR) decoding generates high-quality translations by maximizing the expected utility of output candidates, but it evaluates all pairwise scores over the candidate set; hence, it takes quadratic time with respect to the number of candidates. To reduce the number of utility function calls, probabilistic MBR (PMBR) decoding partially evaluates quality scores using sampled pairs of candidates and completes the missing scores with a matrix completion algorithm. Nevertheless, it degrades the translation quality as the number of utility function calls is reduced. Therefore, to improve the trade-off between quality and cost, we propose agreement-constrained PMBR (AC-PMBR) decoding, which leverages a knowledge distilled model to guide the completion of the score matrix. Our AC-PMBR decoding improved approximation errors of matrix completion by up to 3 times and achieved higher translation quality compared with PMBR decoding at a comparable computational cost on the WMT'23 En$\leftrightarrow$De translation tasks.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ashish Vashist, Qiranul Saadiyean, Suresh Sundaram, Chandra Sekhar Seelamantula
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Foreign Object Debris (FOD) within aircraft fuel tanks presents critical safety hazards including fuel contamination, system malfunctions, and increased maintenance costs. Despite the severity of these risks, there is a notable lack of dedicated datasets for the complex, enclosed environments found inside fuel tanks. To bridge this gap, we present a novel dataset, FOD-S2R, composed of real and synthetic images of the FOD within a simulated aircraft fuel tank. Unlike existing datasets that focus on external or open-air environments, our dataset is the first to systematically evaluate the effectiveness of synthetic data in enhancing the real-world FOD detection performance in confined, closed structures. The real-world subset consists of 3,114 high-resolution HD images captured in a controlled fuel tank replica, while the synthetic subset includes 3,137 images generated using Unreal Engine. The dataset is composed of various Field of views (FOV), object distances, lighting conditions, color, and object size. Prior research has demonstrated that synthetic data can reduce reliance on extensive real-world annotations and improve the generalizability of vision models. Thus, we benchmark several state-of-the-art object detection models and demonstrate that introducing synthetic data improves the detection accuracy and generalization to real-world conditions. These experiments demonstrate the effectiveness of synthetic data in enhancing the model performance and narrowing the Sim2Real gap, providing a valuable foundation for developing automated FOD detection systems for aviation maintenance.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shinji Mai, Yunpeng Zhai, Ziqian Chen, Cheng Chen, Anni Zou et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Large language model based agents are increasingly deployed in complex, tool augmented environments. While reinforcement learning provides a principled mechanism for such agents to improve through interaction, its effectiveness critically depends on the availability of structured training tasks. In many realistic settings, however, no such tasks exist a challenge we term task scarcity, which has become a key bottleneck for scaling agentic RL. Existing approaches typically assume predefined task collections, an assumption that fails in novel environments where tool semantics and affordances are initially unknown. To address this limitation, we formalize the problem of Task Generation for Agentic RL, where an agent must learn within a given environment that lacks predefined tasks. We propose CuES, a Curiosity driven and Environment grounded Synthesis framework that autonomously generates diverse, executable, and meaningful tasks directly from the environment structure and affordances, without relying on handcrafted seeds or external corpora. CuES drives exploration through intrinsic curiosity, abstracts interaction patterns into reusable task schemas, and refines them through lightweight top down guidance and memory based quality control. Across three representative environments, AppWorld, BFCL, and WebShop, CuES produces task distributions that match or surpass manually curated datasets in both diversity and executability, yielding substantial downstream policy improvements. These results demonstrate that curiosity driven, environment grounded task generation provides a scalable foundation for agents that not only learn how to act, but also learn what to learn. The code is available at https://github.com/modelscope/AgentEvolver/research/CuES.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dacheng Liao, Mengshi Qi, Peng Shu, Zhining Zhang, Yuxin Lin et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Current Vision-Language Model (VLM)-based end-to-end autonomous driving systems often leverage large language models to generate driving decisions directly based on their understanding of the current scene. However, such systems introduce multiple risks in real-world driving scenarios. To evaluate whether VLMs are truly viable for autonomous driving, we introduce RoboDriveBench, the first robustness benchmark focused on end-to-end trajectory prediction tasks. This benchmark systematically evaluates two critical categories of real-world challenges for VLM-based end-to-end autonomous driving systems through 11 simulated scenarios encompassing various corruption types, including 6 scenarios of sensor corruption caused by environmental variations, along with 5 cases of prompt corruption resulting from human intervention and data transmission failures. Each corruption type includes 250 unique driving scenarios and 5,689 frames, resulting in 64,559 total trajectory prediction cases per evaluation. To overcome these real-world challenges, we propose a novel VLM-based autonomous driving framework called RoboDriveVLM, which enhances robustness by mapping more multimodal data-e.g., lidar and radar-into a unified latent space. Furthermore, we introduce a new Test-Time Adaptation (TTA) method based on cross-modal knowledge distillation to improve the robustness of VLM-based autonomous driving systems. Through extensive experiments, our work highlights the limitations of current VLM-based end-to-end autonomous driving systems and provides a more reliable solution for real-world deployment. Source code and datasets will be released.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huynh Trinh Ngoc, Toan Nguyen Hai, Ba Luong Son, Long Tran Quoc
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Medical image segmentation is crucial for clinical diagnosis and treatment planning. Traditional methods typically produce a single segmentation mask, failing to capture inherent uncertainty. Recent generative models enable the creation of multiple plausible masks per image, mimicking the collaborative interpretation of several clinicians. However, these approaches remain computationally heavy. We propose MedSegLatDiff, a diffusion based framework that combines a variational autoencoder (VAE) with a latent diffusion model for efficient medical image segmentation. The VAE compresses the input into a low dimensional latent space, reducing noise and accelerating training, while the diffusion process operates directly in this compact representation. We further replace the conventional MSE loss with weighted cross entropy in the VAE mask reconstruction path to better preserve tiny structures such as small nodules. MedSegLatDiff is evaluated on ISIC-2018 (skin lesions), CVC-Clinic (polyps), and LIDC-IDRI (lung nodules). It achieves state of the art or highly competitive Dice and IoU scores while simultaneously generating diverse segmentation hypotheses and confidence maps. This provides enhanced interpretability and reliability compared to deterministic baselines, making the model particularly suitable for clinical deployment.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mingqin Yu, Fethi Rabhi, Boming Xia, Zhengyi Yang, Felix Tan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Environmental, Social, and Governance (ESG) disclosure frameworks such as SASB, TCFD, and IFRS S2 require organizations to compute and report numerous metrics for compliance, yet these requirements are embedded in long, unstructured PDF documents that are difficult to interpret, standardize, and audit. Manual extraction is unscalable, while unconstrained large language model (LLM) extraction often produces inconsistent entities, hallucinated relationships, missing provenance, and high validation failure rates. We present OntoMetric, an ontology-guided framework that transforms ESG regulatory documents into validated, AI- and web-ready knowledge graphs. OntoMetric operates through a three-stage pipeline: (1) structure-aware segmentation using table-of-contents boundaries, (2) ontology-constrained LLM extraction that embeds the ESGMKG schema into prompts while enriching entities with semantic fields for downstream reasoning, and (3) two-phase validation that combines LLM-based semantic verification with rule-based schema checking across entity, property, and relationship levels (VR001-VR006). The framework preserves both segment-level and page-level provenance for audit traceability. Evaluated on five ESG standards (SASB Commercial Banks, SASB Semiconductors, TCFD, IFRS S2, AASB S2) totaling 228 pages and 60 segments, OntoMetric achieves 65-90% semantic accuracy and 80-90% schema compliance, compared to 3-10% for baseline unconstrained extraction, at approximately 0.01 to 0.02 USD per validated entity. Our results demonstrate that combining symbolic ontology constraints with neural extraction enables reliable, auditable knowledge graphs suitable for regulatory compliance and web integration, supporting downstream applications such as sustainable-finance analytics, transparency portals, and automated compliance tools.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mudit Gaur, Prashant Trivedi, Shuchin Aeron, Amrit Singh Bedi, George K. Atia et al.
Machine Learning
ๆฅ็ๆ่ฆ
Flow matching has recently emerged as a promising alternative to diffusion-based generative models, offering faster sampling and simpler training by learning continuous flows governed by ordinary differential equations. Despite growing empirical success, the theoretical understanding of flow matching remains limited, particularly in terms of sample complexity results. In this work, we provide the first analysis of the sample complexity for flow-matching based generative models without assuming access to the empirical risk minimizer (ERM) of the loss function for estimating the velocity field. Under standard assumptions on the loss function for velocity field estimation and boundedness of the data distribution, we show that a sufficiently expressive neural network can learn a velocity field such that with $\mathcal{O}(ฮต^{-4})$ samples, such that the Wasserstein-2 distance between the learned and the true distribution is less than $\mathcal{O}(ฮต)$. The key technical idea is to decompose the velocity field estimation error into neural-network approximation error, statistical error due to the finite sample size, and optimization error due to the finite number of optimization steps for estimating the velocity field. Each of these terms are then handled via techniques that may be of independent interest.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiahao Yuan, Zhiqing Cui, Hanqing Wang, Yuansheng Gao, Yucheng Zhou et al.
Machine Learning
ๆฅ็ๆ่ฆ
As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at https://github.com/JhCircle/Kardia-R1.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yilong Zhao, Jiaming Tang, Kan Zhu, Zihao Ye, Chi-Chih Chang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth.
To address this, we introduce SparseSpec, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SparseSpec features a novel sparse attention mechanism, PillarAttn, as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SparseSpec co-designs self-speculation with three system innovations: (1) a unified scheduler to batch token drafting and verification, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to maximize memory utilization. Across various models and datasets, SparseSpec outperforms state-of-the-art solutions, with an up to 2.13x throughput speedup.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zehua Zhao, Zhixian Huang, Junren Li, Siyu Lin, Junting Zhou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.
๐
2025-12-01
๐ ArXiv Preprint (cs.SI)
๐ฅ Junfeng Wu, Xiangmin Zhou, Erica Kuligowski, Dhirendra Singh, Enrico Ronchi et al.
Multimodal
ๆฅ็ๆ่ฆ
Traditional data sources on bushfire evacuation behaviour, such as quantitative surveys and manual observations have severe limitations. Mining social media data related to bushfire evacuations promises to close this gap by allowing the collection and processing of a large amount of behavioural data, which are low-cost, accurate, possibly including location information and rich contextual information. However, social media data have many limitations, such as being scattered, incomplete, informal, etc. Together, these limitations represent several challenges to their usefulness to better understand bushfire evacuation. To overcome these challenges and provide guidance on which and how social media data can be used, this scoping review of the literature reports on recent advances in relevant data mining techniques. In addition, future applications and open problems are discussed. We envision future applications such as evacuation model calibration and validation, emergency communication, personalised evacuation training, and resource allocation for evacuation preparedness. We identify open problems such as data quality, bias and representativeness, geolocation accuracy, contextual understanding, crisis-specific lexicon and semantics, and multimodal data interpretation.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ekha Morang, Surhoni A. Ngullie, Sashienla Longkumer, Teisovi Angami
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The Nagamese language, a.k.a Naga Pidgin, is an Assamese-lexified creole language developed primarily as a means of communication in trade between the people from Nagaland and people from Assam in the north-east India. Substantial amount of work in sentiment analysis has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in Nagamese language. To the best of our knowledge, this is the first attempt on sentiment analysis and emotion classification for the Nagamese Language. The aim of this work is to detect sentiments in terms of polarity (positive, negative and neutral) and basic emotions contained in textual content of Nagamese language. We build sentiment polarity lexicon of 1,195 nagamese words and use these to build features along with additional features for supervised machine learning techniques using Na"ive Bayes and Support Vector Machines.
Keywords: Nagamese, NLP, sentiment analysis, machine learning
๐
2025-12-01
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Otman A. Basir
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper introduces a new family of multi-parent recombination operators for Genetic Algorithms (GAs), based on normalized Pascal (binomial) coefficients. Unlike classical two-parent crossover operators, Pascal-Weighted Recombination (PWR) forms offsprings as structured convex combination of multiple parents, using binomially shaped weights that emphasize central inheritance while suppressing disruptive variance. We develop a mathematical framework for PWR, derive variance-transfer properties, and analyze its effect on schema survival. The operator is extended to real-valued, binary/logit, and permutation representations.
We evaluate the proposed method on four representative benchmarks: (i) PID controller tuning evaluated using the ITAE metric, (ii) FIR low-pass filter design under magnitude-response constraints, (iii) wireless power-modulation optimization under SINR coupling, and (iv) the Traveling Salesman Problem (TSP). We demonstrate how, across these benchmarks, PWR consistently yields smoother convergence, reduced variance, and achieves 9-22% performance gains over standard recombination operators. The approach is simple, algorithm-agnostic, and readily integrable into diverse GA architectures.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.
๐
2025-12-01
๐ ArXiv Preprint (cs.CY)
๐ฅ David Wu, Fateme Nateghi Haredasht, Saloni Kumar Maharaj, Priyank Jain, Jessica Tran et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary-care-to-specialist consultation cases to measure harm frequency and severity from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, severe harm occurs in up to 22.2% (95% CI 21.6-22.8%) of cases, with harms of omission accounting for 76.6% (95% CI 76.4-76.8%) of errors. Safety performance is only moderately correlated (r = 0.61-0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach reduces harm compared to solo models (mean difference 8.0%, 95% CI 4.0-12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement.
๐
2025-12-01
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Suveen Ellawala, Sashenka Gamage, Dinithi Dissanayake
Multimodal
ๆฅ็ๆ่ฆ
Educators frequently rely on diagrams to explain complex concepts during lectures, yet creating clear and complete visual representations in real time while simultaneously speaking can be cognitively demanding. Incomplete or unclear diagrams may hinder student comprehension, as learners must mentally reconstruct missing information while following the verbal explanation. Inspired by advances in code completion tools, we introduce DrawDash, an AI-powered whiteboard assistant that proactively completes and refines educational diagrams through multimodal understanding. DrawDash adopts a TAB-completion interaction model: it listens to spoken explanations, detects intent, and dynamically suggests refinements that can be accepted with a single keystroke. We demonstrate DrawDash across four diverse teaching scenarios, spanning topics from computer science and web development to biology. This work represents an early exploration into reducing instructors' cognitive load and improving diagram-based pedagogy through real-time, speech-driven visual assistance, and concludes with a discussion of current limitations and directions for formal classroom evaluation.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Donghao Huang, Shila Chew, Anna Dutkiewicz, Zhaoxia Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Assessing software test coverage at scale remains a bottleneck in QA pipelines. We present LLM-as-a-Judge (LAJ), a production-ready, rubric-driven framework for evaluating Gherkin acceptance tests with structured JSON outputs. Across 20 model configurations (GPT-4, GPT-5 with varying reasoning effort, and open-weight models) on 100 expert-annotated scripts over 5 runs (500 evaluations), we provide the first comprehensive analysis spanning accuracy, operational reliability, and cost. We introduce the Evaluation Completion Rate (ECR@1) to quantify first-attempt success, revealing reliability from 85.4% to 100.0% with material cost implications via retries. Results show that smaller models can outperform larger ones: GPT-4o Mini attains the best accuracy (6.07 MAAE), high reliability (96.6% ECR@1), and low cost ($1.01 per 1K), yielding a 78x cost reduction vs. GPT-5 (high reasoning) while improving accuracy. Reasoning effort is model-family dependent: GPT-5 benefits from increased reasoning (with predictable accuracy-cost tradeoffs), whereas open-weight models degrade across all dimensions as reasoning increases. Overall, cost spans 175x ($0.45-$78.96 per 1K). We release the dataset, framework, and code to support reproducibility and deployment.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Beining Xu, Siting Zhu, Zhao Jin, Junxian Li, Hesheng Wang
Natural Language ProcessingRoboticsComputer Vision
ๆฅ็ๆ่ฆ
3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S$^2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S$^2$-MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ching Fang, Samuel Marks
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
As large language models become increasingly capable, there is growing concern that they may develop reasoning processes that are encoded or hidden from human oversight. To investigate whether current interpretability techniques can penetrate such encoded reasoning, we construct a controlled testbed by fine-tuning a reasoning model (DeepSeek-R1-Distill-Llama-70B) to perform chain-of-thought reasoning in ROT-13 encryption while maintaining intelligible English outputs. We evaluate mechanistic interpretability methods--in particular, logit lens analysis--on their ability to decode the model's hidden reasoning process using only internal activations. We show that logit lens can effectively translate encoded reasoning, with accuracy peaking in intermediate-to-late layers. Finally, we develop a fully unsupervised decoding pipeline that combines logit lens with automated paraphrasing, achieving substantial accuracy in reconstructing complete reasoning transcripts from internal model representations. These findings suggest that current mechanistic interpretability techniques may be more robust to simple forms of encoded reasoning than previously understood. Our work provides an initial framework for evaluating interpretability methods against models that reason in non-human-readable formats, contributing to the broader challenge of maintaining oversight over increasingly capable AI systems.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuezhi Liu
Machine Learning
ๆฅ็ๆ่ฆ
Optimal Power Flow (OPF) is a core optimization problem in power system operation and planning, aiming to minimize generation costs while satisfying physical constraints such as power flow equations, generator limits, and voltage limits. Traditional OPF solving methods typically employ iterative optimization algorithms (such as interior point methods, sequential quadratic programming, etc.), with limitations including low computational efficiency, initial value sensitivity, and low batch computation efficiency. Most existing deep learning-based OPF methods rely on supervised learning, requiring pre-solving large numbers of cases, and have difficulty guaranteeing physical consistency. This paper proposes an Optimal Power Flow solving method based on neural network dynamics and energy gradient flow, transforming OPF problems into energy minimization problems. By constructing an energy function to measure the degree of deviation from the constraint manifold, and guiding networks to learn optimal solutions that simultaneously satisfy power flow constraints and minimize costs through gradient flow. Neural networks are trained unsupervised by directly minimizing physical residuals, requiring no labeled data, achieving true "end-to-end" physics-constrained learning.
๐
2025-12-01
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ Tao Li, Chunze Zhang, Weiwei Yao, Junzhao He, Ji Hou et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Understanding efficient fish locomotion offers insights for biomechanics, fluid dynamics, and engineering. Traditional studies often miss the link between neuromuscular control and whole-body movement. To explore energy transfer in carangiform swimming, we created a bio-inspired digital trout. This model combined multibody dynamics, Hill-type muscle modeling, and a high-fidelity fluid-structure interaction algorithm, accurately replicating a real trout's form and properties. Using deep reinforcement learning, the trout's neural system achieved hierarchical spatiotemporal control of muscle activation. We systematically examined how activation strategies affect speed and energy use. Results show that axial myomere coupling-with activation spanning over 0.5 body lengths-is crucial for stable body wave propagation. Moderate muscle contraction duration ([0.1,0.3] of a tail-beat cycle) lets the body and fluid act as a passive damping system, cutting energy use. Additionally, the activation phase lag of myomeres shapes the body wave; if too large, it causes antagonistic contractions that hinder thrust. These findings advance bio-inspired locomotion understanding and aid energy-efficient underwater system design.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hang Wu, Ke Sun, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
In the contemporary digital landscape, multi-modal media manipulation has emerged as a significant societal threat, impacting the reliability and integrity of information dissemination. Current detection methodologies in this domain often overlook the crucial aspect of localized information, despite the fact that manipulations frequently occur in specific areas, particularly in facial regions. In response to this critical observation, we propose the M4-BLIP framework. This innovative framework utilizes the BLIP-2 model, renowned for its ability to extract local features, as the cornerstone for feature extraction. Complementing this, we incorporate local facial information as prior knowledge. A specially designed alignment and fusion module within M4-BLIP meticulously integrates these local and global features, creating a harmonious blend that enhances detection accuracy. Furthermore, our approach seamlessly integrates with Large Language Models (LLM), significantly improving the interpretability of the detection outcomes. Extensive quantitative and visualization experiments validate the effectiveness of our framework against the state-of-the-art competitors.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ruiyu Wang, Tuan Vinh, Ran Xu, Yuyin Zhou, Jiaying Lu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Electronic health records (EHRs) support powerful clinical prediction models, but existing methods typically provide coarse, post hoc explanations that offer limited value for patient-level decision making. We introduce a knowledge graph (KG)-guided chain-of-thought (CoT) framework that generates clinically grounded and temporally consistent reasoning for visit-level disease prediction in MIMIC-III. ICD-9 codes are mapped to PrimeKG, from which disease-relevant nodes and multi-hop reasoning paths are extracted and used as scaffolds for CoT generation; only explanations whose conclusions match observed outcomes are retained. Lightweight LLaMA-3.1-Instruct-8B and Gemma-7B models are then fine-tuned on this supervision corpus. Across ten PrimeKG-mapped diseases and limited training cohorts (400 and 1000 cases), KG-guided models outperform strong classical baselines, achieving AUROC values of 0.66 to 0.70 and macro-AUPR values of 0.40 to 0.47. The models also transfer zero-shot to the CRADLE cohort, improving accuracy from approximately 0.40 to 0.51 up to 0.72 to 0.77. A blinded clinician evaluation shows consistent preference for KG-guided CoT explanations in clarity, relevance, and clinical correctness.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Jianlei Chang, Ruofeng Mei, Wei Ke, Xiangyu Xu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Generative modeling has recently shown remarkable promise for visuomotor policy learning, enabling flexible and expressive control across diverse embodied AI tasks. However, existing generative policies often struggle with data inefficiency, requiring large-scale demonstrations, and sampling inefficiency, incurring slow action generation during inference. We introduce EfficientFlow, a unified framework for efficient embodied AI with flow-based policy learning. To enhance data efficiency, we bring equivariance into flow matching. We theoretically prove that when using an isotropic Gaussian prior and an equivariant velocity prediction network, the resulting action distribution remains equivariant, leading to improved generalization and substantially reduced data demands. To accelerate sampling, we propose a novel acceleration regularization strategy. As direct computation of acceleration is intractable for marginal flow trajectories, we derive a novel surrogate loss that enables stable and scalable training using only conditional trajectories. Across a wide range of robotic manipulation benchmarks, the proposed algorithm achieves competitive or superior performance under limited data while offering dramatically faster inference. These results highlight EfficientFlow as a powerful and efficient paradigm for high-performance embodied AI.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anbang Liu, Guanzhong Hu, Jiayi Wang, Ping Guo, Han Liu
Computer Vision
ๆฅ็ๆ่ฆ
Self-driving laboratories offer a promising path toward reducing the labor-intensive, time-consuming, and often irreproducible workflows in the biological sciences. Yet their stringent precision requirements demand highly robust models whose training relies on large amounts of annotated data. However, this kind of data is difficult to obtain in routine practice, especially negative samples. In this work, we focus on pipetting, the most critical and precision sensitive action in SDLs. To overcome the scarcity of training data, we build a hybrid pipeline that fuses real and virtual data generation. The real track adopts a human-in-the-loop scheme that couples automated acquisition with selective human verification to maximize accuracy with minimal effort. The virtual track augments the real data using reference-conditioned, prompt-guided image generation, which is further screened and validated for reliability. Together, these two tracks yield a class-balanced dataset that enables robust bubble detection training. On a held-out real test set, a model trained entirely on automatically acquired real images reaches 99.6% accuracy, and mixing real and generated data during training sustains 99.4% accuracy while reducing collection and review load. Our approach offers a scalable and cost-effective strategy for supplying visual feedback data to SDL workflows and provides a practical solution to data scarcity in rare event detection and broader vision tasks.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Shaowei Liu, David Yifan Yao, Saurabh Gupta, Shenlong Wang
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Varun Varma Thozhiyoor, Shivam Tripathi, Venkatesh Babu Radhakrishnan, Anand Bhattad
Multimodal
ๆฅ็ๆ่ฆ
Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their representation of a fundamental law: gravity. Out-of-the-box video generators consistently generate objects falling at an effectively slower acceleration. However, these physical tests are often confounded by ambiguous metric scale. We first investigate if observed physical errors are artifacts of these ambiguities (e.g., incorrect frame rate assumptions). We find that even temporal rescaling cannot correct the high-variance gravity artifacts. To rigorously isolate the underlying physical representation from these confounds, we introduce a unit-free, two-object protocol that tests the timing ratio $t_1^2/t_2^2 = h_1/h_2$, a relationship independent of $g$, focal length, and scale. This relative test reveals violations of Galileo's equivalence principle. We then demonstrate that this physical gap can be partially mitigated with targeted specialization. A lightweight low-rank adaptor fine-tuned on only 100 single-ball clips raises $g_{\mathrm{eff}}$ from $1.81\,\mathrm{m/s^2}$ to $6.43\,\mathrm{m/s^2}$ (reaching $65\%$ of terrestrial gravity). This specialist adaptor also generalizes zero-shot to two-ball drops and inclined planes, offering initial evidence that specific physical laws can be corrected with minimal data.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee et al.
Robotics
ๆฅ็ๆ่ฆ
Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen et al.
Multimodal
ๆฅ็ๆ่ฆ
Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter et al.
Machine Learning
ๆฅ็ๆ่ฆ
MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xian Ge, Yuling Pan, Yuhang Zhang, Xiang Li, Weijun Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, and entity-level understanding; an interactive pedestrian-aware system for modeling human behavior; and an automated trajectory generation paradigm to support navigation tasks. Furthermore, we collect more than 60K panoramic samples and conduct extensive experiments across various tasks to demonstrate the effectiveness of our simulator. Unlike existing simulators, our work is the first to systematically model the 4D real world under an omnidirectional setting. The entire platform, including the toolkit, plugins, and collected datasets, will be made publicly available at https://insta360-research-team.github.io/AirSim360-website.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jahyeok Koo, Inรจs Hyeonsu Kim, Mungyeom Kim, Junghyun Park, Seohyun Park et al.
ๆฅ็ๆ่ฆ
Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lidong Lu, Guo Chen, Zhu Wei, Yicheng Liu, Tong Lu
ๆฅ็ๆ่ฆ
We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that rely on textual instructions or demonstration videos, which often limited by ambiguity or occlusion, audio provides real-time, semantically rich, and visually independent cues for affordance grounding, enabling more intuitive understanding of interaction regions. To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations. The dataset also includes an unseen subset to evaluate zero-shot generalization. Furthermore, we propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder that effectively fuses audio and visual signals for mask prediction. Experiments show that AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight the distinctions between AV-AG and AVS, the benefits of end-to-end modeling, and the contribution of each component. Code and dataset have been released on https://jscslld.github.io/AVAGFormer/.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Guillermo Garcia-Cobo, Maximilian Igl, Peter Karkus, Zhejun Zhang, Michael Watson et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift when deployed in closed loop, leading to compounding errors. We introduce Rollouts as Demonstrations (RoaD), a simple and efficient method to mitigate covariate shift by leveraging the policy's own closed-loop rollouts as additional training data. During rollout generation, RoaD incorporates expert guidance to bias trajectories toward high-quality behavior, producing informative yet realistic demonstrations for fine-tuning. This approach enables robust closed-loop adaptation with orders of magnitude less data than reinforcement learning, and avoids restrictive assumptions of prior closed-loop supervised fine-tuning (CL-SFT) methods, allowing broader applications domains including end-to-end driving. We demonstrate the effectiveness of RoaD on WOSAC, a large-scale traffic simulation benchmark, where it performs similar or better than the prior CL-SFT method; and in AlpaSim, a high-fidelity neural reconstruction-based simulator for end-to-end driving, where it improves driving score by 41\% and reduces collisions by 54\%.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, Humphrey Shi
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wei Tang, Yanpeng Sun, Shan Zhang, Xiaofan Li, Piotr Koniusz et al.
ๆฅ็ๆ่ฆ
Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Aiden Yiliu Li, Bizhi Yu, Daoan Lei, Tianhe Ren, Shilong Liu
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.
๐
2025-12-01
๐ AAAI
๐ฅ Xu Zhang, Jin Yuan, Hanwang Zhang, Guojin Zhong, Yongsheng Zang et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Controllable image semantic understanding tasks, such as captioning or segmentation, necessitate users to input a prompt (e.g., text or bounding boxes) to predict a unique outcome, presenting challenges such as high-cost prompt input or limited information output. This paper introduces a new task ``Image Collaborative Segmentation and Captioning'' (SegCaptioning), which aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs, allowing flexible result selection by users. This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks. Technically, we propose a novel Scene Graph Guided Diffusion Model that leverages structured scene graph features for correlated mask-caption prediction. Initially, we introduce a Prompt-Centric Scene Graph Adaptor to map a user's prompt to a scene graph, effectively capturing his intention. Subsequently, we employ a diffusion process incorporating a Scene Graph Guided Bimodal Transformer to predict correlated caption-mask pairs by uncovering intricate correlations between them. To ensure accurate alignment, we design a Multi-Entities Contrastive Learning loss to explicitly align visual and textual entities by considering inter-modal similarity, resulting in well-aligned caption-mask pairs. Extensive experiments conducted on two datasets demonstrate that SGDiff achieves superior performance in SegCaptioning, yielding promising results for both captioning and segmentation tasks with minimal prompt input.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zisu Li, Hengye Lyu, Jiaxin Shi, Yufeng Zeng, Mingming Fan et al.
ๆฅ็ๆ่ฆ
Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.
๐
2025-12-01
๐ Published in Transactions on Machine Learning Rese
๐ฅ Zhongyu Yang, Dannong Xu, Wei Pang, Yingfang Yuan
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to 6.8x prefill speedup and 10x FLOP reduction, while retaining 96.88% of the original performance.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenyi Wang, Yanmao Man, Raymond Muller, Ming Li, Z. Berkay Celik et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Multi-Object Tracking (MOT) is a critical task in computer vision, with applications ranging from surveillance systems to autonomous driving. However, threats to MOT algorithms have yet been widely studied. In particular, incorrect association between the tracked objects and their assigned IDs can lead to severe consequences, such as wrong trajectory predictions. Previous attacks against MOT either focused on hijacking the trackers of individual objects, or manipulating the tracker IDs in MOT by attacking the integrated object detection (OD) module in the digital domain, which are model-specific, non-robust, and only able to affect specific samples in offline datasets. In this paper, we present AdvTraj, the first online and physical ID-manipulation attack against tracking-by-detection MOT, in which an attacker uses adversarial trajectories to transfer its ID to a targeted object to confuse the tracking system, without attacking OD. Our simulation results in CARLA show that AdvTraj can fool ID assignments with 100% success rate in various scenarios for white-box attacks against SORT, which also have high attack transferability (up to 93% attack success rate) against state-of-the-art (SOTA) MOT algorithms due to their common design principles. We characterize the patterns of trajectories generated by AdvTraj and propose two universal adversarial maneuvers that can be performed by a human walker/driver in daily scenarios. Our work reveals under-explored weaknesses in the object association phase of SOTA MOT systems, and provides insights into enhancing the robustness of such systems.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zahra Mahdavi, Zahra Khodakaramimaghsoud, Hooman Khaloo, Sina Bakhshandeh Taleshani, Erfan Hashemi et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but are in fact incorrect. In the natural image domain, several decoding strategies have been proposed to mitigate hallucinations by reinforcing visual evidence, but most rely on secondary decoding or rollback procedures that substantially slow inference. Moreover, existing solutions are often domain-specific and may introduce misalignment between modalities or between generated and ground-truth content. We introduce Med-VCD, a sparse visual-contrastive decoding method that mitigates hallucinations in medical LVLMs without the time overhead of secondary decoding. Med-VCD incorporates a novel token-sparsification strategy that selects visually informed tokens on the fly, trimming redundancy while retaining critical visual context and thus balancing efficiency with reliability. Evaluations on eight medical datasets, spanning ophthalmology, radiology, and pathology tasks in visual question answering, report generation, and dedicated hallucination benchmarks, show that Med-VCD raises factual accuracy by an average of 13\% and improves hallucination accuracy by 6\% relative to baseline medical LVLMs.
๐
2025-12-01
๐ Submitted to Medical Image Analysis. Journal Exten
๐ฅ Bailiang Jian, Jiazhen Pan, Rohit Jena, Morteza Ghahremani, Hongwei Bran Li et al.
Natural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Medical image registration drives quantitative analysis across organs, modalities, and patient populations. Recent deep learning methods often combine low-level "trend-driven" computational blocks from computer vision, such as large-kernel CNNs, Transformers, and state-space models, with high-level registration-specific designs like motion pyramids, correlation layers, and iterative refinement. Yet, their relative contributions remain unclear and entangled. This raises a central question: should future advances in registration focus on importing generic architectural trends or on refining domain-specific design principles? Through a modular framework spanning brain, lung, cardiac, and abdominal registration, we systematically disentangle the influence of these two paradigms. Our evaluation reveals that low-level "trend-driven" computational blocks offer only marginal or inconsistent gains, while high-level registration-specific designs consistently deliver more accurate, smoother, and more robust deformations. These domain priors significantly elevate the performance of a standard U-Net baseline, far more than variants incorporating "trend-driven" blocks, achieving an average relative improvement of $\sim3\%$. All models and experiments are released within a transparent, modular benchmark that enables plug-and-play comparison for new architectures and registration tasks (https://github.com/BailiangJ/rethink-reg). This dynamic and extensible platform establishes a common ground for reproducible and fair evaluation, inviting the community to isolate genuine methodological contributions from domain priors. Our findings advocate a shift in research emphasis: from following architectural trends to embracing domain-specific design principles as the true drivers of progress in learning-based medical image registration.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gurmeher Khurana, Lan Wei, Dandan Zhang
Robotics
ๆฅ็ๆ่ฆ
Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Avirup Dey, Vinay Namboodiri
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Cross-domain face retargeting requires disentangled control over identity, expressions, and domain-specific stylistic attributes. Existing methods, typically trained on real-world faces, either fail to generalize across domains, need test-time optimizations, or require fine-tuning with carefully curated multi-style datasets to achieve domain-invariant identity representations. In this work, we introduce \textit{StyleYourSmile}, a novel one-shot cross-domain face retargeting method that eliminates the need for curated multi-style paired data. We propose an efficient data augmentation strategy alongside a dual-encoder framework, for extracting domain-invariant identity cues and capturing domain-specific stylistic variations. Leveraging these disentangled control signals, we condition a diffusion model to retarget facial expressions across domains. Extensive experiments demonstrate that \textit{StyleYourSmile} achieves superior identity preservation and retargeting fidelity across a wide range of visual domains.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zaid Nasser, Mikhail Iumanov, Tianhao Li, Maxim Popov, Jaafar Mahmoud et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.
๐
2025-12-01
๐ 13 pages, 7 figures, 2 tables. This work has been
๐ฅ Florian Bรผrger, Martim Dias Gomes, Nica Gutu, Adriรกn E. Granada, Noรฉmie Moreau et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Tracking cells in time-lapse videos is an essential technique for monitoring cell population dynamics at a single-cell level. Current methods for cell tracking are developed on videos with mostly single, constant signals and do not detect pivotal events such as cell death. Here, we present TransientTrack, a deep learning-based framework for cell tracking in multi-channel microscopy video data with transient fluorescent signals that fluctuate over time following processes such as the circadian rhythm of cells. By identifying key cellular events - mitosis (cell division) and apoptosis (cell death) our method allows us to build complete trajectories, including cell lineage information. TransientTrack is lightweight and performs matching on cell detection embeddings directly, without the need for quantification of tracking-specific cell features. Furthermore, our approach integrates Transformer Networks, multi-stage matching using all detection boxes, and the interpolation of missing tracklets with the Kalman Filter. This unified framework achieves strong performance across diverse conditions, effectively tracking cells and capturing cell division and death. We demonstrate the use of TransientTrack in an analysis of the efficacy of a chemotherapeutic drug at a single-cell level. The proposed framework could further advance quantitative studies of cancer cell dynamics, enabling detailed characterization of treatment response and resistance mechanisms. The code is available at https://github.com/bozeklab/TransientTrack.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Tsz-To Wong, Ching-Chun Huang, Hong-Han Shuai
Computer Vision
ๆฅ็ๆ่ฆ
Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro-level actions to macro-level game strategies. Existing end-to-end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi-Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct "cognitive tool" specializing in a specific aspect of analysis. The system's architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short-term analytic reasoning (e.g., Rally QA) and long-term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine-grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross-task sports video intelligence.The project homepage is available at https://aiden1020.github.io/COACH-project-page
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yue Pan, Tao Sun, Liyuan Zhu, Lucas Nunes, Iro Armeni et al.
Computer Vision
ๆฅ็ๆ่ฆ
Point cloud registration aligns multiple unposed point clouds into a common frame, and is a core step for 3D reconstruction and robot localization. In this work, we cast registration as conditional generation: a learned continuous, point-wise velocity field transports noisy points to a registered scene, from which the pose of each view is recovered. Unlike previous methods that conduct correspondence matching to estimate the transformation between a pair of point clouds and then optimize the pairwise transformations to realize multi-view registration, our model directly generates the registered point cloud. With a lightweight local feature extractor and test-time rigidity enforcement, our approach achieves state-of-the-art results on pairwise and multi-view registration benchmarks, particularly with low overlap, and generalizes across scales and sensor modalities. It further supports downstream tasks including relocalization, multi-robot SLAM, and multi-session map merging. Source code available at: https://github.com/PRBonn/RAP.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zeqing Wang, Keze Wang, Lei Zhang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Songyan Zhang, Wenhui Huang, Zhan Chen, Chua Jiahao Collister, Qihang Huang et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yize Zhang, Meiqi Chen, Sirui Chen, Bo Peng, Yanxi Zhang et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: https://github.com/OpenCausaLab/CauSight.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu et al.
ๆฅ็ๆ่ฆ
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meng Cao, Haokun Lin, Haoyuan Li, Haoran Tang, Rongtao Xu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lama Alssum, Hasan Abed Al Kader Hammoud, Motasem Alfarra, Juan C Leon Alcazar, Bernard Ghanem
Machine Learning
ๆฅ็ๆ่ฆ
Deep neural networks suffer from catastrophic forgetting, where performance on previous tasks degrades after training on a new task. This issue arises due to the model's tendency to overwrite previously acquired knowledge with new information. We present a novel approach to address this challenge, focusing on the intersection of memory-based methods and regularization approaches. We formulate a regularization strategy, termed Information Maximization (IM) regularizer, for memory-based continual learning methods, which is based exclusively on the expected label distribution, thus making it class-agnostic. As a consequence, IM regularizer can be directly integrated into various rehearsal-based continual learning methods, reducing forgetting and favoring faster convergence. Our empirical validation shows that, across datasets and regardless of the number of tasks, our proposed regularization strategy consistently improves baseline performance at the expense of a minimal computational overhead. The lightweight nature of IM ensures that it remains a practical and scalable solution, making it applicable to real-world continual learning scenarios where efficiency is paramount. Finally, we demonstrate the data-agnostic nature of our regularizer by applying it to video data, which presents additional challenges due to its temporal structure and higher memory requirements. Despite the significant domain gap, our experiments show that IM regularizer also improves the performance of video continual learning methods.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinyu Xiong, Zihuang Wu, Lei Lu, Yufa Xia
Computer Vision
ๆฅ็ๆ่ฆ
In this paper, we introduce SAM3-UNet, a simplified variant of Segment Anything Model 3 (SAM3), designed to adapt SAM3 for downstream tasks at a low cost. Our SAM3-UNet consists of three components: a SAM3 image encoder, a simple adapter for parameter-efficient fine-tuning, and a lightweight U-Net-style decoder. Preliminary experiments on multiple tasks, such as mirror detection and salient object detection, demonstrate that the proposed SAM3-UNet outperforms the prior SAM2-UNet and other state-of-the-art methods, while requiring less than 6 GB of GPU memory during training with a batch size of 12. The code is publicly available at https://github.com/WZH0120/SAM3-UNet.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Christian Molliรจre, Iker Cumplido, Marco Zeulner, Lukas Liesenhoff, Matthias Schubert et al.
Machine Learning
ๆฅ็ๆ่ฆ
The rapid growth of data from satellite-based Earth observation (EO) systems poses significant challenges in data transmission and storage. We evaluate the potential of task-specific learned compression algorithms in this context to reduce data volumes while retaining crucial information. In detail, we compare traditional compression (JPEG 2000) versus a learned compression approach (Discretized Mixed Gaussian Likelihood) on three EO segmentation tasks: Fire, cloud, and building detection. Learned compression notably outperforms JPEG 2000 for large-scale, multi-channel optical imagery in both reconstruction quality (PSNR) and segmentation accuracy. However, traditional codecs remain competitive on smaller, single-channel thermal infrared datasets due to limited data and architectural constraints. Additionally, joint end-to-end optimization of compression and segmentation models does not improve performance over standalone optimization.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Syed Hesham Syed Ariff, Yun Liu, Guolei Sun, Jing Yang, Henghui Ding et al.
ๆฅ็ๆ่ฆ
The Segmentation Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos, capable of storing object-aware memories and transferring them temporally through memory blocks. While SAM2 excels in video object segmentation by providing dense segmentation masks based on prompts, extending it to dense Video Semantic Segmentation (VSS) poses challenges due to the need for spatial accuracy, temporal consistency, and the ability to track multiple objects with complex boundaries and varying scales. This paper explores the extension of SAM2 for VSS, focusing on two primary approaches and highlighting firsthand observations and common challenges faced during this process. The first approach involves using SAM2 to extract unique objects as masks from a given image, with a segmentation network employed in parallel to generate and refine initial predictions. The second approach utilizes the predicted masks to extract unique feature vectors, which are then fed into a simple network for classification. The resulting classifications and masks are subsequently combined to produce the final segmentation. Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ahsan Raza Siyal, Markus Haltmeier, Ruth Steiger, Malik Galijasevic, Elke Ruth Gizewski et al.
ๆฅ็ๆ่ฆ
Medical image registration is crucial for various clinical and research applications including disease diagnosis or treatment planning which require alignment of images from different modalities, time points, or subjects. Traditional registration techniques often struggle with challenges such as contrast differences, spatial distortions, and modality-specific variations. To address these limitations, we propose a method that integrates learnable edge kernels with learning-based rigid and non-rigid registration techniques. Unlike conventional layers that learn all features without specific bias, our approach begins with a predefined edge detection kernel, which is then perturbed with random noise. These kernels are learned during training to extract optimal edge features tailored to the task. This adaptive edge detection enhances the registration process by capturing diverse structural features critical in medical imaging. To provide clearer insight into the contribution of each component in our design, we introduce four variant models for rigid registration and four variant models for non-rigid registration. We evaluated our approach using a dataset provided by the Medical University across three setups: rigid registration without skull removal, with skull removal, and non-rigid registration. Additionally, we assessed performance on two publicly available datasets. Across all experiments, our method consistently outperformed state-of-the-art techniques, demonstrating its potential to improve multi-modal image alignment and anatomical structure analysis.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hafsa Billah
Computer Vision
ๆฅ็ๆ่ฆ
Automatically understanding video contents is important for several applications in Civic Monitoring (CM), general Surveillance (SL), Assisted Living (AL), etc. Decades of Image and Video Analysis (IVA) research have advanced tasks such as content extraction (e.g., object recognition and tracking). Identifying meaningful activities or situations (e.g., two objects coming closer) remains difficult and cannot be achieved by content extraction alone. Currently, Video Situation Analysis (VSA) is done manually with a human in the loop, which is error-prone and labor-intensive, or through custom algorithms designed for specific video types or situations. These algorithms are not general-purpose and require a new algorithm/software for each new situation or video from a new domain.
This report proposes a general-purpose VSA framework that overcomes the above limitations. Video contents are extracted once using state-of-the-art Video Content Extraction technologies. They are represented using two alternative models -- the extended relational model (R++) and graph models. When represented using R++, the extracted contents can be used as data streams, enabling Continuous Query Processing via the proposed Continuous Query Language for Video Analysis. The graph models complement this by enabling the detection of situations that are difficult or impossible to detect using the relational model alone. Existing graph algorithms and newly developed algorithms support a wide variety of situation detection. To support domain independence, primitive situation variants across domains are identified and expressed as parameterized templates. Extensive experiments were conducted across several interesting situations from three domains -- AL, CM, and SL-- to evaluate the accuracy, efficiency, and robustness of the proposed approach using a dataset of videos of varying lengths from these domains.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kaiwen Zhou et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Graphical User Interface (GUI) agents require effective use of historical context to perform sequential navigation tasks. While incorporating past actions and observations can improve decision making, naive use of full history leads to excessive computational overhead and distraction from irrelevant information. To address this, we introduce HiconAgent, a GUI agent trained with History Context-aware Policy Optimization (HCPO) for efficient and effective utilization of historical information. HCPO optimizes history usage in both sampling and policy updates through two complementary components: (1) Dynamic Context Sampling (DCS) presents the agent with variable length histories during sampling, enabling adaptive use of the most relevant context; (2) Anchor-guided History Compression (AHC) refines the policy update phase with a dual branch strategy where the compressed branch removes history observations while keeping history actions as information flow anchors. The compressed and uncompressed branches are coupled through a history-enhanced alignment loss to enforce consistent history usage while maintaining efficiency. Experiments on mainstream GUI navigation benchmarks demonstrate strong performance. Despite being smaller, HiconAgent-3B outperforms GUI-R1-7B by +8.46 percent grounding accuracy and +11.32 percent step success rate on GUI-Odyssey, while achieving comparable results on AndroidControl and AITW with up to 2.47x computational speedup and 60 percent FLOPs reduction.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yucheng Liao, Jiajun Liang, Kaiqian Cui, Baoquan Zhao, Haoran Xie et al.
Robotics
ๆฅ็ๆ่ฆ
Instruction-based image editing through natural language has emerged as a powerful paradigm for intuitive visual manipulation. While recent models achieve impressive results on single edits, they suffer from severe quality degradation under multi-turn editing. Through systematic analysis, we identify progressive loss of high-frequency information as the primary cause of this quality degradation. We present FreqEdit, a training-free framework that enables stable editing across 10+ consecutive iterations. Our approach comprises three synergistic components: (1) high-frequency feature injection from reference velocity fields to preserve fine-grained details, (2) an adaptive injection strategy that spatially modulates injection strength for precise region-specific control, and (3) a path compensation mechanism that periodically recalibrates the editing trajectory to prevent over-constraint. Extensive experiments demonstrate that FreqEdit achieves superior performance in both identity preservation and instruction following compared to seven state-of-the-art baselines.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai et al.
ๆฅ็ๆ่ฆ
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Xiuli Bi, Die Xiao, Junchao Fan, Bin Xiao
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.
๐
2025-12-01
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Huaxu He
Machine Learning
ๆฅ็ๆ่ฆ
Handling static images that lack inherent temporal dynamics remains a fundamental challenge for spiking neural networks (SNNs). In directly trained SNNs, static inputs are typically repeated across time steps, causing the temporal dimension to collapse into a rate like representation and preventing meaningful temporal modeling. This work revisits the reported performance gap between direct and rate based encodings and shows that it primarily stems from convolutional learnability and surrogate gradient formulations rather than the encoding schemes themselves. To illustrate this mechanism level clarification, we introduce a minimal learnable temporal encoding that adds adaptive phase shifts to induce meaningful temporal variation from static inputs.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Patrick Kwon, Chen Chen
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and a 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy. Our project page is available at https://yj7082126.github.io/dreamingcomics/
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Farzaneh Seyedshahi, Francesca Damiola, Sylvie Lantuejoul, Ke Yuan, John Le Quesne
ๆฅ็ๆ่ฆ
Accurate subtype classification and outcome prediction in mesothelioma are essential for guiding therapy and patient care. Most computational pathology models are trained on large tissue images from resection specimens, limiting their use in real-world settings where small biopsies are common. We show that a self-supervised encoder trained on resection tissue can be applied to biopsy material, capturing meaningful morphological patterns. Using these patterns, the model can predict patient survival and classify tumor subtypes. This approach demonstrates the potential of AI-driven tools to support diagnosis and treatment planning in mesothelioma.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haodong Yan, Hang Yu, Zhide Zhong, Weilin Yuan, Xin Gong et al.
ๆฅ็ๆ่ฆ
Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design. Our project page is https://hgzn258.github.io/SCAR/.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Felix Nรผtzel, Mischa Dombrowski, Bernhard Kainz
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in text-to-image diffusion models enable high-fidelity generation across diverse prompts. However, these models falter in long-tail settings, such as medical imaging, where rare pathologies comprise a small fraction of data. This results in mode collapse: tail-class outputs lack quality and diversity, undermining the goal of synthetic data augmentation for underrepresented conditions. We pinpoint gradient conflicts between frequent head and rare tail classes as the primary culprit, a factor unaddressed by existing sampling or conditioning methods that mainly steer inference without altering the learned distribution. To resolve this, we propose GRASP: Guided Residual Adapters with Sample-wise Partitioning. GRASP uses external priors to statically partition samples into clusters that minimize intra-group gradient clashes. It then fine-tunes pre-trained models by injecting cluster-specific residual adapters into transformer feedforward layers, bypassing learned gating for stability and efficiency. On the long-tail MIMIC-CXR-LT dataset, GRASP yields superior FID and diversity metrics, especially for rare classes, outperforming baselines like vanilla fine-tuning and Mixture of Experts variants. Downstream classification on NIH-CXR-LT improves considerably for tail labels. Generalization to ImageNet-LT confirms broad applicability. Our method is lightweight, scalable, and readily integrates with diffusion pipelines.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhicheng Zhao, Yin Huang, Lingma Sun, Chenglong Li, Jin Tang
Computer Vision
ๆฅ็ๆ่ฆ
Tiny object detection in remote sensing imagery has attracted significant research interest in recent years. Despite recent progress, achieving balanced detection performance across diverse object scales remains a formidable challenge, particularly in scenarios where dense tiny objects and large objects coexist. Although large foundation models have revolutionized general vision tasks, their application to tiny object detection remains unexplored due to the extreme scale variation and density distribution inherent to remote sensing imagery. To bridge this scale gap, we propose ScaleBridge-Det, to the best of our knowledge, the first large detection framework designed for tiny objects, which could achieve balanced performance across diverse scales through scale-adaptive expert routing and density-guided query allocation. Specifically, we introduce a Routing-Enhanced Mixture Attention (REM) module that dynamically selects and fuses scale-specific expert features via adaptive routing to address the tendency of standard MoE models to favor dominant scales. REM generates complementary and discriminative multi-scale representations suitable for both tiny and large objects. Furthermore, we present a Density-Guided Dynamic Query (DGQ) module that predicts object density to adaptively adjust query positions and numbers, enabling efficient resource allocation for objects of varying scales. The proposed framework allows ScaleBridge-Det to simultaneously optimize performance for both dense tiny and general objects without trade-offs. Extensive experiments on benchmark and cross-domain datasets demonstrate that ScaleBridge-Det achieves state-of-the-art performance on AI-TOD-V2 and DTOD, while exhibiting superior cross-domain robustness on VisDrone.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongyu Xu, Panpan Meng, Meng Wang, Dayu Hu, Liming Liang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Accurate segmentation of retinal vessels is crucial for the clinical diagnosis of numerous ophthalmic and systemic diseases. However, traditional Convolutional Neural Network (CNN) methods exhibit inherent limitations, struggling to capture long-range dependencies and complex nonlinear relationships. To address the above limitations, an Adaptive Dual Branch Kolmogorov-Arnold UNet (DB-KAUNet) is proposed for retinal vessel segmentation. In DB-KAUNet, we design a Heterogeneous Dual-Branch Encoder (HDBE) that features parallel CNN and Transformer pathways. The HDBE strategically interleaves standard CNN and Transformer blocks with novel KANConv and KAT blocks, enabling the model to form a comprehensive feature representation. To optimize feature processing, we integrate several critical components into the HDBE. First, a Cross-Branch Channel Interaction (CCI) module is embedded to facilitate efficient interaction of channel features between the parallel pathways. Second, an attention-based Spatial Feature Enhancement (SFE) module is employed to enhance spatial features and fuse the outputs from both branches. Building upon the SFE module, an advanced Spatial Feature Enhancement with Geometrically Adaptive Fusion (SFE-GAF) module is subsequently developed. In the SFE-GAF module, adaptive sampling is utilized to focus on true vessel morphology precisely. The adaptive process strengthens salient vascular features while significantly reducing background noise and computational overhead. Extensive experiments on the DRIVE, STARE, and CHASE_DB1 datasets validate that DB-KAUNet achieves leading segmentation performance and demonstrates exceptional robustness.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongchen Han, Yining Li, Tianyu Li, Zixuan Cao, Ziming Wang et al.
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code is available at https://github.com/LeapLabTHU/ViTTT.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Wang, Haipeng Zhang, Mang Li, Zhaohui Xia, Yueguo Chen et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yumeng He, Ying Jiang, Jiayin Lu, Yin Yang, Chenfanfu Jiang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fengfeng Li, Zhou Feng, Hongliang Wu, Hao Zhang, Han Tian et al.
ๆฅ็ๆ่ฆ
In well logging operations using the oil-based mud (OBM) microresistivity imager, which employs an interleaved design with upper and lower pad sets, depth misalignment issues persist between the pad images even after velocity correction. This paper presents a depth matching method for borehole images based on the Shape Dynamic Time Warping (ShapeDTW) algorithm. The method extracts local shape features to construct a morphologically sensitive distance matrix, better preserving structural similarity between sequences during alignment. We implement this by employing a combined feature set of the one-dimensional Histogram of Oriented Gradients (HOG1D) and the original signal as the shape descriptor. Field test examples demonstrate that our method achieves precise alignment for images with complex textures, depth shifts, or local scaling. Furthermore, it provides a flexible framework for feature extension, allowing the integration of other descriptors tailored to specific geological features.
๐
2025-12-01
๐ The 2025 IEEE International Conference on Content-
๐ฅ Thao Thi Phuong Dao, Tan-Cong Nguyen, Trong-Le Do, Truong Hoang Viet, Nguyen Chi Thanh et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Abscesses in the head and neck represent an acute infectious process that can potentially lead to sepsis or mortality if not diagnosed and managed promptly. Accurate detection and delineation of these lesions on imaging are essential for diagnosis, treatment planning, and surgical intervention. In this study, we introduce AbscessHeNe, a curated and comprehensively annotated dataset comprising 4,926 contrast-enhanced CT slices with clinically confirmed head and neck abscesses. The dataset is designed to facilitate the development of robust semantic segmentation models that can accurately delineate abscess boundaries and evaluate deep neck space involvement, thereby supporting informed clinical decision-making. To establish performance baselines, we evaluate several state-of-the-art segmentation architectures, including CNN, Transformer, and Mamba-based models. The highest-performing model achieved a Dice Similarity Coefficient of 0.39, Intersection-over-Union of 0.27, and Normalized Surface Distance of 0.67, indicating the challenges of this task and the need for further research. Beyond segmentation, AbscessHeNe is structured for future applications in content-based multimedia indexing and case-based retrieval. Each CT scan is linked with pixel-level annotations and clinical metadata, providing a foundation for building intelligent retrieval systems and supporting knowledge-driven clinical workflows. The dataset will be made publicly available at https://github.com/drthaodao3101/AbscessHeNe.git.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junran Peng, Yiheng Huang, Silei Shen, Zeji Wei, Jingwei Yang et al.
ๆฅ็ๆ่ฆ
In this paper, we introduce RoleMotion, a large-scale human motion dataset that encompasses a wealth of role-playing and functional motion data tailored to fit various specific scenes. Existing text datasets are mainly constructed decentrally as amalgamation of assorted subsets that their data are nonfunctional and isolated to work together to cover social activities in various scenes. Also, the quality of motion data is inconsistent, and textual annotation lacks fine-grained details in these datasets. In contrast, RoleMotion is meticulously designed and collected with a particular focus on scenes and roles. The dataset features 25 classic scenes, 110 functional roles, over 500 behaviors, and 10296 high-quality human motion sequences of body and hands, annotated with 27831 fine-grained text descriptions. We build an evaluator stronger than existing counterparts, prove its reliability, and evaluate various text-to-motion methods on our dataset. Finally, we explore the interplay of motion generation of body and hands. Experimental results demonstrate the high-quality and functionality of our dataset on text-driven whole-body generation.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thao Thi Phuong Dao, Tan-Cong Nguyen, Nguyen Chi Thanh, Truong Hoang Viet, Trong-Le Do et al.
ๆฅ็ๆ่ฆ
Head and neck masses are space-occupying lesions that can compress the airway and esophagus and may affect nerves and blood vessels. Available public datasets primarily focus on malignant lesions and often overlook other space-occupying conditions in this region. To address this gap, we introduce MasHeNe, an initial dataset of 3,779 contrast-enhanced CT slices that includes both tumors and cysts with pixel-level annotations. We also establish a benchmark using standard segmentation baselines and report common metrics to enable fair comparison. In addition, we propose the Windowing-Enhanced Mamba with Frequency integration (WEMF) model. WEMF applies tri-window enhancement to enrich the input appearance before feature extraction. It further uses multi-frequency attention to fuse information across skip connections within a U-shaped Mamba backbone. On MasHeNe, WEMF attains the best performance among evaluated methods, with a Dice of 70.45%, IoU of 66.89%, NSD of 72.33%, and HD95 of 5.12 mm. This model indicates stable and strong results on this challenging task. MasHeNe provides a benchmark for head-and-neck mass segmentation beyond malignancy-only datasets. The observed error patterns also suggest that this task remains challenging and requires further research. Our dataset and code are available at https://github.com/drthaodao3101/MasHeNe.git.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zipeng Wang, Dan Xu
Natural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead. Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% of VGGT for 1,000 images, and scaling efficiently to sequences exceeding 3,000 images. Our project page is available at https://wzpscott.github.io/flashvggt_page/.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alexander Frotscher, Christian F. Baumgartner, Thomas Wolfers
ๆฅ็ๆ่ฆ
Deep unsupervised anomaly detection in brain magnetic resonance imaging offers a promising route to identify pathological deviations without requiring lesion-specific annotations. Yet, fragmented evaluations, heterogeneous datasets, and inconsistent metrics have hindered progress toward clinical translation. Here, we present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging. The training cohort comprised 2,976 T1 and 2,972 T2-weighted scans from healthy individuals across six scanners, with ages ranging from 6 to 89 years. Validation used 92 scans to tune hyperparameters and estimate unbiased thresholds. Testing encompassed 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts. Across all algorithms, the Dice-based segmentation performance varied between 0.03 and 0.65, indicating substantial variability. To assess robustness, we systematically evaluated the impact of different scanners, lesion types and sizes, as well as demographics (age, sex). Reconstruction-based methods, particularly diffusion-inspired approaches, achieved the strongest lesion segmentation performance, while feature-based methods showed greater robustness under distributional shifts. However, systematic biases, such as scanner-related effects, were observed for the majority of algorithms, including that small and low-contrast lesions were missed more often, and that false positives varied with age and sex. Increasing healthy training data yields only modest gains, underscoring that current unsupervised anomaly detection frameworks are limited algorithmically rather than by data availability. Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation, including image native pretraining, principled deviation measures, fairness-aware modeling, and robust domain adaptation.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hailong Yang, Te Zhang, Kup-sze Choi, Zhaohong Deng
ๆฅ็ๆ่ฆ
Diffusion models have emerged as a leading technique for generating images due to their ability to create high-resolution and realistic images. Despite their strong performance, diffusion models still struggle in managing image collections with significant feature differences. They often fail to capture complex features and produce conflicting results. Research has attempted to address this issue by learning different regions of an image through multiple diffusion paths and then combining them. However, this approach leads to inefficient coordination among multiple paths and high computational costs. To tackle these issues, this paper presents a Diffusion Fuzzy System (DFS), a latent-space multi-path diffusion model guided by fuzzy rules. DFS offers several advantages. First, unlike traditional multi-path diffusion methods, DFS uses multiple diffusion paths, each dedicated to learning a specific class of image features. By assigning each path to a different feature type, DFS overcomes the limitations of multi-path models in capturing heterogeneous image features. Second, DFS employs rule-chain-based reasoning to dynamically steer the diffusion process and enable efficient coordination among multiple paths. Finally, DFS introduces a fuzzy membership-based latent-space compression mechanism to reduce the computational costs of multi-path diffusion effectively. We tested our method on three public datasets: LSUN Bedroom, LSUN Church, and MS COCO. The results show that DFS achieves more stable training and faster convergence than existing single-path and multi-path diffusion models. Additionally, DFS surpasses baseline models in both image quality and alignment between text and images, and also shows improved accuracy when comparing generated images to target references.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Can Polat, Erchin Serpedin, Mustafa Kurban, Hasan Kurban
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Despite rapid advances in molecular and materials machine learning, most models still lack physical transferability: they fit correlations across whole molecules or crystals rather than learning the quantum interactions between atomic pairs. Yet bonding, charge redistribution, orbital hybridization, and electronic coupling all emerge from these two-body interactions that define local quantum fields in many-body systems. We introduce QuantumCanvas, a large-scale multimodal benchmark that treats two-body quantum systems as foundational units of matter. The dataset spans 2,850 element-element pairs, each annotated with 18 electronic, thermodynamic, and geometric properties and paired with ten-channel image representations derived from l- and m-resolved orbital densities, angular field transforms, co-occupancy maps, and charge-density projections. These physically grounded images encode spatial, angular, and electrostatic symmetries without explicit coordinates, providing an interpretable visual modality for quantum learning. Benchmarking eight architectures across 18 targets, we report mean absolute errors of 0.201 eV on energy gap using GATv2, 0.265 eV on HOMO and 0.274 eV on LUMO using EGNN. For energy-related quantities, DimeNet attains 2.27 eV total-energy MAE and 0.132 eV repulsive-energy MAE, while a multimodal fusion model achieves a 2.15 eV Mermin free-energy MAE. Pretraining on QuantumCanvas further improves convergence stability and generalization when fine-tuned on larger datasets such as QM9, MD17, and CrysMTM. By unifying orbital physics with vision-based representation learning, QuantumCanvas provides a principled and interpretable basis for learning transferable quantum interactions through coupled visual and numerical modalities. Dataset and model implementations are available at https://github.com/KurbanIntelligenceLab/QuantumCanvas.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Franz Thaler, Martin Urschler, Mateusz Kozinski, Matthias AF Gsell, Gernot Plank et al.
Computer Vision
ๆฅ็ๆ่ฆ
We tackle the challenging problem of single-source domain generalization (DG) for medical image segmentation. To this end, we aim for training a network on one domain (e.g., CT) and directly apply it to a different domain (e.g., MR) without adapting the model and without requiring images or annotations from the new domain during training. We propose a novel method for promoting DG when training deep segmentation networks, which we call SRCSM. During training, our method diversifies the source domain through semantic-aware random convolution, where different regions of a source image are augmented differently, based on their annotation labels. At test-time, we complement the randomization of the training domain via mapping the intensity of target domain images, making them similar to source domain data. We perform a comprehensive evaluation on a variety of cross-modality and cross-center generalization settings for abdominal, whole-heart and prostate segmentation, where we outperform previous DG techniques in a vast majority of experiments. Additionally, we also investigate our method when training on whole-heart CT or MR data and testing on the diastolic and systolic phase of cine MR data captured with different scanner hardware, where we make a step towards closing the domain gap in this even more challenging setting. Overall, our evaluation shows that SRCSM can be considered a new state-of-the-art in DG for medical image segmentation and, moreover, even achieves a segmentation performance that matches the performance of the in-domain baseline in several settings.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Joanne Lin, Ruirui Lin, Yini Li, David Bull, Nantheera Anantrasirichai
ๆฅ็ๆ่ฆ
Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to adverse imaging conditions including noise, blur and low-contrast. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, as a result, perform poorly even when finetuned on low-light data. In this paper, we introduce \textbf{ELVIS} (\textbf{E}nhance \textbf{L}ow-light for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile synthesis network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to \textbf{+3.7AP} on the synthetic low-light YouTube-VIS 2019 dataset. Code will be released upon acceptance.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Majid Arthaud, Antonin Chambolle, Vincent Duval
ๆฅ็ๆ่ฆ
We introduce a variational approach for extracting curves between a list of possible endpoints, based on the discretization of an energy and Smirnov's decomposition theorem for vector fields. It is used to design a bi-level minimization approach to automatically extract curves and 1D structures from an image, which is mostly unsupervised. We extend then the method to curvature-dependent energies, using a now classical lifting of the curves in the space of positions and orientations equipped with an appropriate sub-Riemanian or Finslerian metric.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qisen Wang, Yifan Zhao, Peisen Shen, Jialu Li, Jia Li
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Although prevailing camera-controlled video generation models can produce cinematic results, lifting them directly to the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging, which is a pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization, but these strategies are constrained by limited model generalization and scalability issues. To this end, we propose ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal constraints of a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of multiple views using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and 3D-consistent time-synchronized multi-view videos generation without training or fine-tuning for diffusion models.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Omer Sela, Michael Chertok, Lior Wolf
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper presents CourtMotion, a spatiotemporal modeling framework for analyzing and predicting game events and plays as they develop in professional basketball. Anticipating basketball events requires understanding both physical motion patterns and their semantic significance in the context of the game. Traditional approaches that use only player positions fail to capture crucial indicators such as body orientation, defensive stance, or shooting preparation motions. Our two-stage approach first processes skeletal tracking data through Graph Neural Networks to capture nuanced motion patterns, then employs a Transformer architecture with specialized attention mechanisms to model player interactions. We introduce event projection heads that explicitly connect player movements to basketball events like passes, shots, and steals, training the model to associate physical motion patterns with their tactical purposes. Experiments on NBA tracking data demonstrate significant improvements over position-only baselines: 35% reduction in trajectory prediction error compared to state-of-the-art position-based models and consistent performance gains across key basketball analytics tasks. The resulting pretrained model serves as a powerful foundation for multiple downstream tasks, with pick detection, shot taker identification, assist prediction, shot location classification, and shot type recognition demonstrating substantial improvements over existing methods.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kuangpu Guo, Yuhe Ding, Jian Liang, Zilei Wang, Ran He
ๆฅ็ๆ่ฆ
Model merging has emerged as a promising paradigm for enabling multi-task capabilities without additional training. However, existing methods often experience substantial performance degradation compared with individually fine-tuned models, even on similar tasks, underscoring the need to preserve task-specific information. This paper proposes Decomposition, Thresholding, and Scaling (DTS), an approximation-based personalized merging framework that preserves task-specific information with minimal storage overhead. DTS first applies singular value decomposition to the task-specific information and retains only a small subset of singular values and vectors. It then introduces a novel thresholding strategy that partitions singular vector elements into groups and assigns a scaling factor to each group. To enable generalization to unseen tasks, we further extend DTS with a variant that fuses task-specific information in a data-free manner based on the semantic similarity of task characteristics. Extensive experiments demonstrate that DTS consistently outperforms state-of-the-art baselines while requiring only 1\% additional storage per task. Furthermore, experiments on unseen tasks show that the DTS variant achieves significantly better generalization performance. Our code is available at https://github.com/krumpguo/DTS.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jian Shu, Nanjie Yao, Gangjian Zhang, Junlong Ren, Yu Feng et al.
ๆฅ็ๆ่ฆ
3D human avatar animation aims at transforming a human avatar from an arbitrary initial pose to a specified target pose using deformation algorithms. Existing approaches typically divide this task into two stages: canonical template construction and target pose deformation. However, current template construction methods demand extensive skeletal rigging and often produce artifacts for specific poses. Moreover, target pose deformation suffers from structural distortions caused by Linear Blend Skinning (LBS), which significantly undermines animation realism. To address these problems, we propose a unified learning-based framework to address both challenges in two phases. For the former phase, to overcome the inefficiencies and artifacts during template construction, we leverage a U-Net architecture that decouples texture and pose information in a feed-forward process, enabling fast generation of a human template. For the latter phase, we propose a data-driven refinement technique that enhances structural integrity. Extensive experiments show that our model delivers consistent performance across diverse poses with an optimal balance between efficiency and quality,surpassing state-of-the-art (SOTA) methods.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Klara Reichard, Nikolas Brasch, Nassir Navab, Federico Tombari
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Open-world and anomaly segmentation methods seek to enable autonomous driving systems to detect and segment both known and unknown objects in real-world scenes. However, existing methods do not assign semantically meaningful labels to unknown regions, and distinguishing and learning representations for unknown classes remains difficult. While open-vocabulary segmentation methods show promise in generalizing to novel classes, they require a fixed inference vocabulary and thus cannot be directly applied to anomaly segmentation where unknown classes are unconstrained. We propose Clipomaly, the first CLIP-based open-world and anomaly segmentation method for autonomous driving. Our zero-shot approach requires no anomaly-specific training data and leverages CLIP's shared image-text embedding space to both segment unknown objects and assign human-interpretable names to them. Unlike open-vocabulary methods, our model dynamically extends its vocabulary at inference time without retraining, enabling robust detection and naming of anomalies beyond common class definitions such as those in Cityscapes. Clipomaly achieves state-of-the-art performance on established anomaly segmentation benchmarks while providing interpretability and flexibility essential for practical deployment.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiyang Ma, Feng Zhou, Xuedan Yin, Pu Cao, Yonghao Dang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Leveraging pre-trained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xusen Hei, Jiali Chen, Jinyu Yang, Mengchen Zhao, Yi Cai
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
As multimodal large language models (MLLMs) frequently exhibit errors in complex video reasoning scenarios, correcting these errors is critical for uncovering their weaknesses and improving performance. However, existing benchmarks lack systematic evaluation of MLLMs' ability to identify and correct these video reasoning errors. To bridge this gap, we propose \textit{ViRectify}, a comprehensive benchmark to evaluate their fine-grained correction capability. Through an AI-assisted annotation pipeline with human verification, we construct a dataset of over 30\textit{K} instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains. In \textit{ViRectify}, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding. In addition, we further propose the trajectory evidence-driven correction framework, comprising step-wise error trajectory and reward modeling on visual evidence-grounded correction. It encourages the model to explicitly concentrate on error propagation and key timestamps for correction. Extensive evaluation across 16 advanced MLLMs demonstrates that our \textit{ViRectify} serves as a challenging testbed, where GPT-5 achieves only 31.94\% correction accuracy. Our framework enables a Qwen2.5-VL-7B to consistently outperform the variants of 72B on \textit{ViRectify}, showing the effectiveness of our approach. Further analysis uncovers systematic asymmetries in error correction across models, and our dataset is also a valuable data resource to perform reflection learning. We believe \textit{ViRectify} provides a new direction for comprehensively evaluating the advanced MLLMs in video reasoning.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Yongkun Du, Miaomiao Zhao, Songlin Fan, Zhineng Chen, Caiyan Jia et al.
Multimodal
ๆฅ็ๆ่ฆ
Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: https://github.com/Topdu/OpenOCR.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tushar Pranav, Eshan Pandey, Austria Lyka Diane Bala, Aman Chadha, Indriyati Atmosukarto et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases, limiting their effectiveness in culturally diverse regions like Southeast Asia (SEA). To address this, we introduce RICE-VL, a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries. RICE-VL includes over 28,000 human-curated Visual Question Answering (VQA) samples -- covering True or False, Fill-in-the-Blank, and open-ended formats -- and 1,000 image-bounding box pairs for Visual Grounding, annotated by culturally informed experts across 14 sub-ground categories. We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification. Evaluations of six open- and closed-source VLMs reveal significant performance gaps in low-resource countries and abstract cultural domains. The Visual Grounding task tests models' ability to localize culturally significant elements in complex scenes, probing spatial and contextual accuracy. RICE-VL exposes limitations in VLMs' cultural comprehension and highlights the need for inclusive model development to better serve diverse global populations.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seungho Choi, Jeahun Sung, Jihyong Oh
ๆฅ็ๆ่ฆ
Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunze Liu, Zifan Wang, Peiran Wu, Jiayang Ao
Natural Language Processing
ๆฅ็ๆ่ฆ
Understanding dynamic 4D environments-3D space evolving over time-is critical for robotic and interactive systems. These applications demand systems that can process streaming point cloud video in real-time, often under resource constraints, while also benefiting from past and present observations when available. However, current 4D backbone networks rely heavily on spatiotemporal convolutions and Transformers, which are often computationally intensive and poorly suited to real-time applications. We propose PointNet4D, a lightweight 4D backbone optimized for both online and offline settings. At its core is a Hybrid Mamba-Transformer temporal fusion block, which integrates the efficient state-space modeling of Mamba and the bidirectional modeling power of Transformers. This enables PointNet4D to handle variable-length online sequences efficiently across different deployment scenarios. To enhance temporal understanding, we introduce 4DMAP, a frame-wise masked auto-regressive pretraining strategy that captures motion cues across frames. Our extensive evaluations across 9 tasks on 7 datasets, demonstrating consistent improvements across diverse domains. We further demonstrate PointNet4D's utility by building two robotic application systems: 4D Diffusion Policy and 4D Imitation Learning, achieving substantial gains on the RoboTwin and HandoverSim benchmarks.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang et al.
Robotics
ๆฅ็ๆ่ฆ
Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurring high computational costs. As a training-free alternative, inversion techniques can be used to map the source image into a latent space for manipulation. However, our empirical study reveals that standard inversion is sub-optimal for EIE, leading to poor quality and inefficiency. To tackle this challenge, we introduce \textbf{Reversible Inversion ({ReInversion})} for effective and efficient EIE. Specifically, ReInversion operates as a two-stage denoising process, which is first conditioned on the source image and subsequently on the reference. Besides, we introduce a Mask-Guided Selective Denoising (MSD) strategy to constrain edits to target regions, preserving the structural consistency of the background. Both qualitative and quantitative comparisons demonstrate that our ReInversion method achieves state-of-the-art EIE performance with the lowest computational overhead.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianyu Luan, Xuelu Feng, Zixin Zhu, Phani Nuney, Sheng Liu et al.
Computer Vision
ๆฅ็ๆ่ฆ
Textured high-fidelity 3D models are crucial for games, AR/VR, and film, but human-aligned evaluation methods still fall behind despite recent advances in 3D reconstruction and generation. Existing metrics, such as Chamfer Distance, often fail to align with how humans evaluate the fidelity of 3D shapes. Recent learning-based metrics attempt to improve this by relying on rendered images and 2D image quality metrics. However, these approaches face limitations due to incomplete structural coverage and sensitivity to viewpoint choices. Moreover, most methods are trained on synthetic distortions, which differ significantly from real-world distortions, resulting in a domain gap. To address these challenges, we propose a new fidelity evaluation method that is based directly on 3D meshes with texture, without relying on rendering. Our method, named Textured Geometry Evaluation TGE, jointly uses the geometry and color information to calculate the fidelity of the input textured mesh with comparison to a reference colored shape. To train and evaluate our metric, we design a human-annotated dataset with real-world distortions. Experiments show that TGE outperforms rendering-based and geometry-only methods on real-world distortion dataset.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Sheng Liu, Tianyu Luan, Phani Nuney, Xuelu Feng, Junsong Yuan
Natural Language Processing
ๆฅ็ๆ่ฆ
3D generation and reconstruction techniques have been widely used in computer games, film, and other content creation areas. As the application grows, there is a growing demand for 3D shapes that look truly realistic. Traditional evaluation methods rely on a ground truth to measure mesh fidelity. However, in many practical cases, a shape's realism does not depend on having a ground truth reference. In this work, we propose a Shape-Realism Alignment Metric that leverages a large language model (LLM) as a bridge between mesh shape information and realism evaluation. To achieve this, we adopt a mesh encoding approach that converts 3D shapes into the language token space. A dedicated realism decoder is designed to align the language model's output with human perception of realism. Additionally, we introduce a new dataset, RealismGrading, which provides human-annotated realism scores without the need for ground truth shapes. Our dataset includes shapes generated by 16 different algorithms on over a dozen objects, making it more representative of practical 3D shape distributions. We validate our metric's performance and generalizability through k-fold cross-validation across different objects. Experimental results show that our metric correlates well with human perceptions and outperforms existing methods, and has good generalizability.
๐
2025-12-01
๐ This is the author-accepted version of the paper p
๐ฅ Yunzhe Li, Jiajun Yan, Yuzhou Wei, Kechen Liu, Yize Zhao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Failing to be aware of speeding vehicles approaching from behind poses a huge threat to the road safety of pedestrians and cyclists. In this paper, we propose BlinkBud, which utilizes a single earbud and a paired phone to online detect hazardous objects approaching from behind of a user. The core idea is to accurately track visually identified objects utilizing a small number of sampled camera images taken from the earbud. To minimize the power consumption of the earbud and the phone while guaranteeing the best tracking accuracy, a novel 3D object tracking algorithm is devised, integrating both a Kalman filter based trajectory estimation scheme and an optimal image sampling strategy based on reinforcement learning. Moreover, the impact of constant user head movements on the tracking accuracy is significantly eliminated by leveraging the estimated pitch and yaw angles to correct the object depth estimation and align the camera coordinate system to the user's body coordinate system, respectively. We implement a prototype BlinkBud system and conduct extensive real-world experiments. Results show that BlinkBud is lightweight with ultra-low mean power consumptions of 29.8 mW and 702.6 mW on the earbud and smartphone, respectively, and can accurately detect hazards with a low average false positive ratio (FPR) and false negative ratio (FNR) of 4.90% and 1.47%, respectively.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ In-Jae Lee, Mungyeom Kim, Kwonyoung Ryu, Pierre Musacchio, Jaesik Park
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Unsupervised and open-vocabulary 3D object detection has recently gained attention, particularly in autonomous driving, where reducing annotation costs and recognizing unseen objects are critical for both safety and scalability. However, most existing approaches uniformly annotate 3D bounding boxes, ignore objects' physical states, and require multiple self-training iterations for annotation refinement, resulting in suboptimal quality and substantial computational overhead. To address these challenges, we propose OpenBox, a two-stage automatic annotation pipeline that leverages a 2D vision foundation model. In the first stage, OpenBox associates instance-level cues from 2D images processed by a vision foundation model with the corresponding 3D point clouds via cross-modal instance alignment. In the second stage, it categorizes instances by rigidity and motion state, then generates adaptive bounding boxes with class-specific size statistics. As a result, OpenBox produces high-quality 3D bounding box annotations without requiring self-training. Experiments on the Waymo Open Dataset, the Lyft Level 5 Perception dataset, and the nuScenes dataset demonstrate improved accuracy and efficiency over baselines.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sayantan Dey, Alireza Alaei, Partha Pratim Roy
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite considerable progress in handwritten text recognition, paragraph-level handwritten text recognition, especially in low-resource languages, such as Hindi, Urdu and similar scripts, remains a challenging problem. These languages, often lacking comprehensive linguistic resources, require special attention to develop robust systems for accurate optical character recognition (OCR). This paper introduces BharatOCR, a novel segmentation-free paragraph-level handwritten Hindi and Urdu text recognition. We propose a ViT-Transformer Decoder-LM architecture for handwritten text recognition, where a Vision Transformer (ViT) extracts visual features, a Transformer decoder generates text sequences, and a pre-trained language model (LM) refines the output to improve accuracy, fluency, and coherence. Our model utilizes a Data-efficient Image Transformer (DeiT) model proposed for masked image modeling in this research work. In addition, we adopt a RoBERTa architecture optimized for masked language modeling (MLM) to enhance the linguistic comprehension and generative capabilities of the proposed model. The transformer decoder generates text sequences from visual embeddings. This model is designed to iteratively process a paragraph image line by line, called implicit line segmentation. The proposed model was evaluated using our custom dataset ('Parimal Urdu') and ('Parimal Hindi'), introduced in this research work, as well as two public datasets. The proposed model achieved benchmark results in the NUST-UHWR, PUCIT-OUHL, and Parimal-Urdu datasets, achieving character recognition rates of 96.24%, 92.05%, and 94.80%, respectively. The model also provided benchmark results using the Hindi dataset achieving a character recognition rate of 80.64%. The results obtained from our proposed model indicated that it outperformed several state-of-the-art Urdu text recognition methods.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Ziang Yan et al.
Multimodal
ๆฅ็ๆ่ฆ
Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yingjie Zhou, Xilei Zhu, Siyu Ren, Ziyi Zhao, Ziwen Wang et al.
Multimodal
ๆฅ็ๆ่ฆ
Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore, we introduce EvalTalker, a novel TH quality assessment framework. This framework possesses the ability to perceive global quality, human characteristics, and identity consistency, while integrating Qwen-Sync to perceive multimodal synchrony. Experimental results demonstrate that EvalTalker achieves superior correlation with subjective scores, providing a robust foundation for future research on high-quality Multi-Talker generation and evaluation.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence. However, existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image (e.g., object addition, deletion, or modification), a shortcoming we term semantic negligence. In a pilot study, we find that applying a Gaussian blur to the input image improves semantic adherence. Analyzing attention maps, we observe clearer foreground-background separation. From an energy perspective, this corresponds to a lower-entropy cross-attention distribution. Motivated by this, we introduce AlignVid, a training-free framework with two components: (i) Attention Scaling Modulation (ASM), which directly reweights attention via lightweight Q or K scaling, and (ii) Guidance Scheduling (GS), which applies ASM selectively across transformer blocks and denoising steps to reduce visual quality degradation. This minimal intervention improves prompt adherence while limiting aesthetic degradation. In addition, we introduce OmitI2V to evaluate semantic negligence in TI2V generation, comprising 367 human-annotated samples that span addition, deletion, and modification scenarios. Extensive experiments demonstrate that AlignVid can enhance semantic fidelity.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ A S M Ahsanul Sarkar Akib, Raduana Khawla, Abdul Hasib
Machine Learning
ๆฅ็ๆ่ฆ
Stroke is a major cause of death and permanent impairment, making it a major worldwide health concern. For prompt intervention and successful preventative tactics, early risk assessment is essential. To address this challenge, we used ensemble modeling and explainable AI (XAI) techniques to create an interpretable machine learning framework for stroke risk prediction. A thorough evaluation of 10 different machine learning models using 5-fold cross-validation across several datasets was part of our all-inclusive strategy, which also included feature engineering and data pretreatment (using Random Over-Sampling (ROS) to solve class imbalance). Our optimized ensemble model (Random Forest + ExtraTrees + XGBoost) performed exceptionally well, obtaining a strong 99.09% accuracy on the Stroke Prediction Dataset (SPD). We improved the model's transparency and clinical applicability by identifying three important clinical variables using LIME-based interpretability analysis: age, hypertension, and glucose levels. Through early prediction, this study highlights how combining ensemble learning with explainable AI (XAI) can deliver highly accurate and interpretable stroke risk assessment. By enabling data-driven prevention and personalized clinical decisions, our framework has the potential to transform stroke prediction and cardiovascular risk management.
๐
2025-12-01
๐ ArXiv Preprint (cs.GR)
๐ฅ Hanzhi Guo, Dongdong Weng, Mo Su, Yixiao Chen, Xiaonuo Dongye et al.
ๆฅ็ๆ่ฆ
Topology-consistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable precise 3D keypoint tracking. Project page: https://haza628.github.io/tagSplat/
๐
2025-12-01
๐ ArXiv Preprint (hep-ex)
๐ฅ Samuel Young, Kazuhiro Terao
Machine Learning
ๆฅ็ๆ่ฆ
Liquid argon time projection chambers (LArTPCs) provide dense, high-fidelity 3D measurements of particle interactions and underpin current and future neutrino and rare-event experiments. Physics reconstruction typically relies on complex detector-specific pipelines that use tens of hand-engineered pattern recognition algorithms or cascades of task-specific neural networks that require extensive, labeled simulation that requires a careful, time-consuming calibration process. We introduce \textbf{Panda}, a model that learns reusable sensor-level representations directly from raw unlabeled LArTPC data. Panda couples a hierarchical sparse 3D encoder with a multi-view, prototype-based self-distillation objective. On a simulated dataset, Panda substantially improves label efficiency and reconstruction quality, beating the previous state-of-the-art semantic segmentation model with 1,000$\times$ fewer labels. We also show that a single set-prediction head 1/20th the size of the backbone with no physical priors trained on frozen outputs from Panda can result in particle identification that is comparable with state-of-the-art (SOTA) reconstruction tools. Full fine-tuning further improves performance across all tasks.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Feiyang Xiao, Yichi Zhang, Xigui Li, Yuanye Zhou, Chen Jiang et al.
ๆฅ็ๆ่ฆ
The precise segmentation of intracranial aneurysms and their parent vessels (IA-Vessel) is a critical step for hemodynamic analyses, which mainly depends on computational fluid dynamics (CFD). However, current segmentation methods predominantly focus on image-based evaluation metrics, often neglecting their practical effectiveness in subsequent CFD applications. To address this deficiency, we present the Intracranial Aneurysm Vessel Segmentation (IAVS) dataset, the first comprehensive, multi-center collection comprising 641 3D MRA images with 587 annotations of aneurysms and IA-Vessels. In addition to image-mask pairs, IAVS dataset includes detailed hemodynamic analysis outcomes, addressing the limitations of existing datasets that neglect topological integrity and CFD applicability. To facilitate the development and evaluation of clinically relevant techniques, we construct two evaluation benchmarks including global localization of aneurysms (Stage I) and fine-grained segmentation of IA-Vessel (Stage II) and develop a simple and effective two-stage framework, which can be used as a out-of-the-box method and strong baseline. For comprehensive evaluation of applicability of segmentation results, we establish a standardized CFD applicability evaluation system that enables the automated and consistent conversion of segmentation masks into CFD models, offering an applicability-focused assessment of segmentation outcomes. The dataset, code, and model will be public available at https://github.com/AbsoluteResonance/IAVS.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ashish Vashist, Qiranul Saadiyean, Suresh Sundaram, Chandra Sekhar Seelamantula
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Foreign Object Debris (FOD) within aircraft fuel tanks presents critical safety hazards including fuel contamination, system malfunctions, and increased maintenance costs. Despite the severity of these risks, there is a notable lack of dedicated datasets for the complex, enclosed environments found inside fuel tanks. To bridge this gap, we present a novel dataset, FOD-S2R, composed of real and synthetic images of the FOD within a simulated aircraft fuel tank. Unlike existing datasets that focus on external or open-air environments, our dataset is the first to systematically evaluate the effectiveness of synthetic data in enhancing the real-world FOD detection performance in confined, closed structures. The real-world subset consists of 3,114 high-resolution HD images captured in a controlled fuel tank replica, while the synthetic subset includes 3,137 images generated using Unreal Engine. The dataset is composed of various Field of views (FOV), object distances, lighting conditions, color, and object size. Prior research has demonstrated that synthetic data can reduce reliance on extensive real-world annotations and improve the generalizability of vision models. Thus, we benchmark several state-of-the-art object detection models and demonstrate that introducing synthetic data improves the detection accuracy and generalization to real-world conditions. These experiments demonstrate the effectiveness of synthetic data in enhancing the model performance and narrowing the Sim2Real gap, providing a valuable foundation for developing automated FOD detection systems for aviation maintenance.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pei Yang, Yepeng Liu, Kelly Peng, Yuan Gao, Yiren Song
Natural Language Processing
ๆฅ็ๆ่ฆ
In the digital economy era, digital watermarking serves as a critical basis for ownership proof of massive replicable content, including AI-generated and other virtual assets. Designing robust watermarks capable of withstanding various attacks and processing operations is even more paramount. We introduce TokenPure, a novel Diffusion Transformer-based framework designed for effective and consistent watermark removal. TokenPure solves the trade-off between thorough watermark destruction and content consistency by leveraging token-based conditional reconstruction. It reframes the task as conditional generation, entirely bypassing the initial watermark-carrying noise. We achieve this by decomposing the watermarked image into two complementary token sets: visual tokens for texture and structural tokens for geometry. These tokens jointly condition the diffusion process, enabling the framework to synthesize watermark-free images with fine-grained consistency and structural integrity. Comprehensive experiments show that TokenPure achieves state-of-the-art watermark removal and reconstruction fidelity, substantially outperforming existing baselines in both perceptual quality and consistency.
๐
2025-12-01
๐ SIGIR 2025
๐ฅ Ning Han, Yawen Zeng, Shaohua Long, Chengqing Li, Sijie Yang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In recent years, significant developments have been made in both video retrieval and video moment retrieval tasks, which respectively retrieve complete videos or moments for a given text query. These advancements have greatly improved user satisfaction during the search process. However, previous work has failed to establish meaningful "interaction" between the retrieval system and the user, and its one-way retrieval paradigm can no longer fully meet the personalization and dynamic needs of at least 80.8\% of users. In this paper, we introduce the Interactive Video Corpus Retrieval (IVCR) task, a more realistic setting that enables multi-turn, conversational, and realistic interactions between the user and the retrieval system. To facilitate research on this challenging task, we introduce IVCR-200K, a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval. Furthermore, we propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions. The extensive experiments demonstrate the effectiveness of our dataset and framework.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yanteng Zhang, Songheng Li, Zeyu Shen, Qizhen Lan, Lipei Zhang et al.
Computer Vision
ๆฅ็ๆ่ฆ
Large-scale brain imaging datasets provide unprecedented opportunities for developing domain foundation models through pretraining. However, unlike natural image datasets in computer vision, these neuroimaging data often exhibit high heterogeneity in quality, ranging from well-structured scans to severely distorted or incomplete brain volumes. This raises a fundamental question: can noise or low-quality scans contribute meaningfully to pretraining, or do they instead hinder model learning? In this study, we systematically explore the role of data quality level in pretraining and its impact on downstream tasks. Specifically, we perform pretraining on datasets with different quality levels and perform fine-tuning for brain age prediction on external cohorts. Our results show significant performance differences across quality levels, revealing both opportunities and limitations. We further discuss the gap between computer vision practices and clinical neuroimaging standards, emphasizing the necessity of domain-aware curation to ensure trusted and generalizable domain-specific foundation models.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongru Yan, Xiang Zhang, Zeyuan Chen, Fangyin Wei, Zhuowen Tu
ๆฅ็ๆ่ฆ
Branches swaying in the breeze, flags rippling in the wind, and boats rocking on the water all show how aerodynamics shape natural motion -- an effect crucial for realism in vision and graphics. In this paper, we present Gaussian Swaying, a surface-based framework for aerodynamic simulation using 3D Gaussians. Unlike mesh-based methods that require costly meshing, or particle-based approaches that rely on discrete positional data, Gaussian Swaying models surfaces continuously with 3D Gaussians, enabling efficient and fine-grained aerodynamic interaction. Our framework unifies simulation and rendering on the same representation: Gaussian patches, which support force computation for dynamics while simultaneously providing normals for lightweight shading. Comprehensive experiments on both synthetic and real-world datasets across multiple metrics demonstrate that Gaussian Swaying achieves state-of-the-art performance and efficiency, offering a scalable approach for realistic aerodynamic scene simulation.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jaewoo Song, Jooyoung Choi, Kanghyun Baek, Sangyub Lee, Daemin Park et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thisara Rathnayaka, Uthayasanker Thayasivam
Natural Language Processing
ๆฅ็ๆ่ฆ
Temporal Action Localization (TAL) remains a fundamental challenge in video understanding, aiming to identify the start time, end time, and category of all action instances within untrimmed videos. While recent single-stage, anchor-free models like ActionFormer have set a high standard by leveraging Transformers for temporal reasoning, they often struggle with two persistent issues: the precise localization of actions with ambiguous or "fuzzy" temporal boundaries and the effective fusion of multi-scale contextual information. In this paper, we introduce the Temporal Boundary Transformer (TBT-Former), a new architecture that directly addresses these limitations. TBT-Former enhances the strong ActionFormer baseline with three core contributions: (1) a higher-capacity scaled Transformer backbone with an increased number of attention heads and an expanded Multi-Layer Perceptron (MLP) dimension for more powerful temporal feature extraction; (2) a cross-scale feature pyramid network (FPN) that integrates a top-down pathway with lateral connections, enabling richer fusion of high-level semantics and low-level temporal details; and (3) a novel boundary distribution regression head. Inspired by the principles of Generalized Focal Loss (GFL), this new head recasts the challenging task of boundary regression as a more flexible probability distribution learning problem, allowing the model to explicitly represent and reason about boundary uncertainty. Within the paradigm of Transformer-based architectures, TBT-Former advances the formidable benchmark set by its predecessors, establishing a new level of performance on the highly competitive THUMOS14 and EPIC-Kitchens 100 datasets, while remaining competitive on the large-scale ActivityNet-1.3. Our code is available at https://github.com/aaivu/In21-S7-CS4681-AML-Research-Projects/tree/main/projects/210536K-Multi-Modal-Learning_Video-Understanding
๐
2025-12-01
๐ SIGGRAPH
๐ฅ Xiaokun Pan, Zhenzhe Li, Zhichao Ye, Hongjia Zhai, Guofeng Zhang
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Real-time 3D reconstruction is a fundamental task in computer graphics. Recently, differentiable-rendering-based SLAM system has demonstrated significant potential, enabling photorealistic scene rendering through learnable scene representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Current differentiable rendering methods face dual challenges in real-time computation and sensor noise sensitivity, leading to degraded geometric fidelity in scene reconstruction and limited practicality. To address these challenges, we propose a novel real-time system EGG-Fusion, featuring robust sparse-to-dense camera tracking and a geometry-aware Gaussian surfel mapping module, introducing an information filter-based fusion method that explicitly accounts for sensor noise to achieve high-precision surface reconstruction. The proposed differentiable Gaussian surfel mapping effectively models multi-view consistent surfaces while enabling efficient parameter optimization. Extensive experimental results demonstrate that the proposed system achieves a surface reconstruction error of 0.6\textit{cm} on standardized benchmark datasets including Replica and ScanNet++, representing over 20\% improvement in accuracy compared to state-of-the-art (SOTA) GS-based methods. Notably, the system maintains real-time processing capabilities at 24 FPS, establishing it as one of the most accurate differentiable-rendering-based real-time reconstruction systems. Project Page: https://zju3dv.github.io/eggfusion/
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huynh Trinh Ngoc, Toan Nguyen Hai, Ba Luong Son, Long Tran Quoc
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Medical image segmentation is crucial for clinical diagnosis and treatment planning. Traditional methods typically produce a single segmentation mask, failing to capture inherent uncertainty. Recent generative models enable the creation of multiple plausible masks per image, mimicking the collaborative interpretation of several clinicians. However, these approaches remain computationally heavy. We propose MedSegLatDiff, a diffusion based framework that combines a variational autoencoder (VAE) with a latent diffusion model for efficient medical image segmentation. The VAE compresses the input into a low dimensional latent space, reducing noise and accelerating training, while the diffusion process operates directly in this compact representation. We further replace the conventional MSE loss with weighted cross entropy in the VAE mask reconstruction path to better preserve tiny structures such as small nodules. MedSegLatDiff is evaluated on ISIC-2018 (skin lesions), CVC-Clinic (polyps), and LIDC-IDRI (lung nodules). It achieves state of the art or highly competitive Dice and IoU scores while simultaneously generating diverse segmentation hypotheses and confidence maps. This provides enhanced interpretability and reliability compared to deterministic baselines, making the model particularly suitable for clinical deployment.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han
Natural Language Processing
ๆฅ็ๆ่ฆ
As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands--weights and activations in the forward pass, and weights, activations, and gradients in the backward pass--must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. NVFP4 by evaluating multiple potential scale factors for each block of values. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty
Natural Language Processing
ๆฅ็ๆ่ฆ
Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Minglai Yang, Xinyu Guo, Mihai Surdeanu, Liangming Pan
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a defined ontology through a "pre-train, then post-train" curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific relations can be inspected and controlled without interference from unrelated features. Empirical results demonstrate that AlignSAE enables precise causal interventions, such as reliable "concept swaps", by targeting single, semantically aligned slots.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sai Kolasani, Maxim Saplin, Nicholas Crispino, Kyle Montgomery, Jared Quincy Davis et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs) through extended agentic interaction in the domain of chess. We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including win and loss rates, move quality, move legality, hallucinated actions, and game duration. For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way. Despite the simplicity of the instruction-following task and the weakness of the opponent, many state-of-the-art models struggle to complete games or achieve consistent wins. Similar to other benchmarks on complex reasoning tasks, our experiments reveal a clear separation between reasoning and non-reasoning models. However, unlike existing static benchmarks, the stochastic and dynamic nature of LLM CHESS uniquely reduces overfitting and memorization while preventing benchmark saturation, proving difficult even for top reasoning models. To support future work on evaluating reasoning and instruction-following in LLMs, we release our experimental framework, a public leaderboard, and a dataset of associated games.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Aiden Yiliu Li, Bizhi Yu, Daoan Lei, Tianhe Ren, Shilong Liu
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou et al.
ๆฅ็ๆ่ฆ
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Han Zhou, Xingchen Wan, Ivan Vuliฤ, Anna Korhonen
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Junnan Liu, Hongwei Liu, Songyang Zhang, Kai Chen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lihu Chen, Xiang Yin, Francesca Toni
Natural Language Processing
ๆฅ็ๆ่ฆ
Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond interpretability, we demonstrate that latent debate provides a strong baseline for hallucination detection. Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations. These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jinzheng Yu, Yang Xu, Haozhen Li, Junqi Li, Yifan Feng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises. While large language models have made automated report generation technically feasible, systematic research in this specific area remains notably absent, particularly lacking formal task definitions and corresponding benchmarks. To bridge this gap, we define the Automated Online Public Opinion Report Generation (OPOR-GEN) task and construct OPOR-BENCH, an event-centric dataset covering 463 crisis events with their corresponding news articles, social media posts, and a reference summary. To evaluate report quality, we propose OPOR-EVAL, a novel agent-based framework that simulates human expert evaluation by analyzing generated reports in context. Experiments with frontier models demonstrate that our framework achieves high correlation with human judgments. Our comprehensive task definition, benchmark dataset, and evaluation framework provide a solid foundation for future research in this critical domain.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Heloisa Candello, Muneeza Azmat, Uma Sushmitha Gunturi, Raya Horesh, Rogerio Abreu de Paula et al.
Machine Learning
ๆฅ็ๆ่ฆ
With the rapid uptake of generative AI, investigating human perceptions of generated responses has become crucial. A major challenge is their `aptitude' for hallucinating and generating harmful contents. Despite major efforts for implementing guardrails, human perceptions of these mitigation strategies are largely unknown. We conducted a mixed-method experiment for evaluating the responses of a mitigation strategy across multiple-dimensions: faithfulness, fairness, harm-removal capacity, and relevance. In a within-subject study design, 57 participants assessed the responses under two conditions: harmful response plus its mitigation and solely mitigated response. Results revealed that participants' native language, AI work experience, and annotation familiarity significantly influenced evaluations. Participants showed high sensitivity to linguistic and contextual attributes, penalizing minor grammar errors while rewarding preserved semantic contexts. This contrasts with how language is often treated in the quantitative evaluation of LLMs. We also introduced new metrics for training and evaluating mitigation strategies and insights for human-AI evaluation studies.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Adel Moumen, Guangzhi Sun, Philip C. Woodland
Natural Language Processing
ๆฅ็ๆ่ฆ
Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult. We present a cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. We also release an EN-FR training dataset, TinyStories (~42k hours), together with EN-FR spoken StoryCloze and TopicCloze benchmarks for cross-lingual semantic evaluation, both synthetically generated using GPT-4. On 360M and 1B SLMs under matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual hidden-state alignment. Taken together, these results indicate that cross-lingual interleaving is a simple, scalable route to building multilingual SLMs that understand and converse across languages. All resources will be made open-source to support reproducibility.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hrishikesh Terdalkar, Kirtan Bhojani, Aryan Dongare, Omm Aditya Behera
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly deployed in multilingual applications but often generate plausible yet incorrect or misleading outputs, known as hallucinations. While hallucination detection has been studied extensively in English, under-resourced Indian languages remain largely unexplored. We present BHRAM-IL, a benchmark for hallucination recognition and assessment in multiple Indian languages, covering Hindi, Gujarati, Marathi, Odia, along with English. The benchmark comprises 36,047 curated questions across nine categories spanning factual, numerical, reasoning, and linguistic tasks. We evaluate 14 state-of-the-art multilingual LLMs on a benchmark subset of 10,265 questions, analyzing cross-lingual and factual hallucinations across languages, models, scales, categories, and domains using category-specific metrics normalized to (0,1) range. Aggregation over all categories and models yields a primary score of 0.23 and a language-corrected fuzzy score of 0.385, demonstrating the usefulness of BHRAM-IL for hallucination-focused evaluation. The dataset, and the code for generation and evaluation are available on GitHub (https://github.com/sambhashana/BHRAM-IL/) and HuggingFace (https://huggingface.co/datasets/sambhashana/BHRAM-IL/) to support future research in multilingual hallucination detection and mitigation.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jinghan Jia, Nathalie Baracaldo, Sijia Liu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu et al.
ๆฅ็ๆ่ฆ
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Zhengjia Wang, Danding Wang, Qiang Sheng, Jiaying Wu, Juan Cao
ๆฅ็ๆ่ฆ
This paper investigates the detection of misinformation, which deceives readers by explicitly fabricating misleading content or implicitly omitting important information necessary for informed judgment. While the former has been extensively studied, omission-based deception remains largely overlooked, even though it can subtly guide readers toward false conclusions under the illusion of completeness. To pioneer in this direction, this paper presents OmiGraph, the first omission-aware framework for misinformation detection. Specifically, OmiGraph constructs an omission-aware graph for the target news by utilizing a contextual environment that captures complementary perspectives of the same event, thereby surfacing potentially omitted contents. Based on this graph, omission-oriented relation modeling is then proposed to identify the internal contextual dependencies, as well as the dynamic omission intents, formulating a comprehensive omission relation representation. Finally, to extract omission patterns for detection, OmiGraph introduces omission-aware message-passing and aggregation that establishes holistic deception perception by integrating the omission contents and relations. Experiments show that, by considering the omission perspective, our approach attains remarkable performance, achieving average improvements of +5.4% F1 and +5.3% ACC on two large-scale benchmarks.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiannan Guan, Qiguang Chen, Libo Qin, Dengyun Peng, Jinhao Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to \textbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce \textit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the \textbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tim Wientzek
ๆฅ็ๆ่ฆ
This paper presents a fully self-supervised approach to borrowing detection in multilingual wordlists. The method combines two sources of information: PMI similarities based on a global correspondence model and a lightweight contrastive component trained on phonetic feature vectors. It further includes an automatic procedure for selecting decision thresholds without requiring labeled data. Experiments on benchmark datasets show that PMI alone already improves over existing string similarity measures such as NED and SCA, and that the combined similarity performs on par with or better than supervised baselines. An ablation study highlights the importance of character encoding, temperature settings and augmentation strategies. The approach scales to datasets of different sizes, works without manual supervision and is provided with a command-line tool that allows researchers to conduct their own studies.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Stefano Zeppieri
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel at generating coherent text within a single prompt but fall short in sustaining relevance, personalization, and continuity across extended interactions. Human communication, however, relies on multiple forms of memory, from recalling past conversations to adapting to personal traits and situational context. This paper introduces the Mixed Memory-Augmented Generation (MMAG) pattern, a framework that organizes memory for LLM-based agents into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory. Drawing inspiration from cognitive psychology, we map these layers to technical components and outline strategies for coordination, prioritization, and conflict resolution. We demonstrate the approach through its implementation in the Heero conversational agent, where encrypted long-term bios and conversational history already improve engagement and retention. We further discuss implementation concerns around storage, retrieval, privacy, and latency, and highlight open challenges. MMAG provides a foundation for building memory-rich language agents that are more coherent, proactive, and aligned with human needs.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daeun Lee, Subhojyoti Mukherjee, Branislav Kveton, Ryan A. Rossi, Viet Dac Lai et al.
ๆฅ็ๆ่ฆ
Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dengyun Peng, Qiguang Chen, Bofei Liu, Jiannan Guan, Libo Qin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Ensuring LLM reliability requires not only solving complex problems but also recognizing when a problem is unsolvable. Current models often struggle to distinguish objective unsolvability (inherent contradictions in the problem) from subjective capability limitations (problems beyond the model's competence), which leads to hallucinations and overconfidence. To address this, we propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability. Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology: programmatic generation for logic puzzles and a novel "Reverse Construction" method that injects contradictions into valid reasoning chains for mathematics. Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty. Empirical results show that our approach achieves near-perfect unsolvability detection while also improving accuracy on solvable tasks. Crucially, we identify Capability Collapse, demonstrating that explicit exposure to unsolvable data is indispensable for preventing models from becoming systematically overconfident. Our code and data are available at https://github.com/sfasfaffa/unsolvableQA.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Valentin Noรซl, Elimane Yassine Seidou, Charly Ken Capo-Chichi, Ghanem Amari
Robotics
ๆฅ็ๆ่ฆ
Legal AI systems powered by retrieval-augmented generation (RAG) face a critical accountability challenge: when an AI assistant cites case law, statutes, or contractual clauses, practitioners need verifiable guarantees that generated text faithfully represents source documents. Existing hallucination detectors rely on semantic similarity metrics that tolerate entity substitutions, a dangerous failure mode when confusing parties, dates, or legal provisions can have material consequences. We introduce HalluGraph, a graph-theoretic framework that quantifies hallucinations through structural alignment between knowledge graphs extracted from context, query, and response. Our approach produces bounded, interpretable metrics decomposed into \textit{Entity Grounding} (EG), measuring whether entities in the response appear in source documents, and \textit{Relation Preservation} (RP), verifying that asserted relationships are supported by context. On structured control documents, HalluGraph achieves near-perfect discrimination ($>$400 words, $>$20 entities), HalluGraph achieves $AUC = 0.979$, while maintaining robust performance ($AUC \approx 0.89$) on challenging generative legal task, consistently outperforming semantic similarity baselines. The framework provides the transparency and traceability required for high-stakes legal applications, enabling full audit trails from generated assertions back to source passages.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuezhang Peng, Chonghao Cai, Ziang Liu, Shuai Fan, Sheng Jiang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficulty of the SLU task by incorporating authentic and complex multi-intent data. Based on MAC-SLU, we conducted a comprehensive benchmark of leading open-source LLMs and LALMs, covering methods like in-context learning, supervised fine-tuning (SFT), and end-to-end (E2E) and pipeline paradigms. Our experiments show that while LLMs and LALMs have the potential to complete SLU tasks through in-context learning, their performance still lags significantly behind SFT. Meanwhile, E2E LALMs demonstrate performance comparable to pipeline approaches and effectively avoid error propagation from speech recognition. Code\footnote{https://github.com/Gatsby-web/MAC\_SLU} and datasets\footnote{huggingface.co/datasets/Gatsby1984/MAC\_SLU} are released publicly.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Edward Ajayi, Eudoxie Umwari, Mawuli Deku, Prosper Singadi, Jules Udahemuka et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
This study examines the digital representation of African languages and the challenges this presents for current language detection tools. We evaluate their performance on Yoruba, Kinyarwanda, and Amharic. While these languages are spoken by millions, their online usage on conversational platforms is often sparse, heavily influenced by English, and not representative of the authentic, monolingual conversations prevalent among native speakers. This lack of readily available authentic data online creates a challenge of scarcity of conversational data for training language models. To investigate this, data was collected from subreddits and local news sources for each language. The analysis showed a stark contrast between the two sources. Reddit data was minimal and characterized by heavy code-switching. Conversely, local news media offered a robust source of clean, monolingual language data, which also prompted more user engagement in the local language on the news publishers social media pages. Language detection models, including the specialized AfroLID and a general LLM, performed with near-perfect accuracy on the clean news data but struggled with the code-switched Reddit posts. The study concludes that professionally curated news content is a more reliable and effective source for training context-rich AI models for African languages than data from conversational platforms. It also highlights the need for future models that can process clean and code-switched text to improve the detection accuracy for African languages.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) often generate unreliable answers, while heuristic uncertainty methods fail to fully distinguish correct from incorrect predictions, causing users to accept erroneous answers without statistical guarantees. We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level. To achieve this in a principled way, we propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint over selection and error indicators. Then, we establish a finite-sample sufficient condition, which relies only on a held-out set of exchangeable calibration samples, to compute an FDR-constrained, coverage-maximizing threshold. Furthermore, we extend LEC to a two-model routing mechanism: given a prompt, if the current model's uncertainty exceeds its calibrated threshold, we delegate it to a stronger model, while maintaining a unified FDR guarantee. Evaluations on closed-ended and open-ended question-answering (QA) datasets show that LEC achieves tighter FDR control and substantially improves sample retention over prior methods. Moreover, the two-model routing mechanism achieves lower risk levels while accepting more correct samples than each individual model.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Yuma Ichikawa, Yudai Fujimoto, Akira Sakai
ๆฅ็ๆ่ฆ
Post-training quantization (PTQ) aims to preserve model-level behavior; however, most methods focus on individual linear layers. Even recent extensions, such as QEP and LoaQ, which mitigate error propagation or target specific submodules, still rely on layer-wise formulations and fail to capture the behavior of larger submodules. We introduce Layer-Projected Coordinate Descent (LPCD), a unified framework that extends PTQ beyond layers by optimizing relaxed objectives across arbitrary submodules and projecting the solutions with layer-wise quantizers. LPCD generalizes existing methods and provides a principled approach to quantizing complex submodules while maintaining the efficiency and compatibility of layer-wise PTQ pipelines. Across diverse LLM architectures and bit-widths, LPCD-based submodule quantization consistently enhances both layer-wise PTQ methods and existing submodule approaches.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Keqi Deng et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs' many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an optimized speech adapter module is designed to reduce the length of the speech sequence to only 30 tokens. Extensive experiments were conducted on MLLMs of different scales (9B and 27B). The experimental results demonstrate that MCAT not only surpasses state-of-the-art end-to-end models on the FLEURS dataset across 70x69 directions but also enhances batch inference efficiency. This is achieved with only ~100M trainable parameters and by using only 10 hours of S2TT data per language. Furthermore, we have released MCAT as open-source to promote the development of MLLMs for robust S2TT capabilities. The code and models are released at https://github.com/yxduir/m2m-70.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jozef Kubรญk, Marek ล uppa, Martin Takรกฤ
Natural Language Processing
ๆฅ็ๆ่ฆ
Limited data for low-resource languages typically yield weaker language models (LMs). Since pre-training is compute-intensive, it is more pragmatic to target improvements during fine-tuning. In this work, we examine the use of Active Learning (AL) methods augmented by structured data selection strategies which we term 'Active Learning schedulers', to boost the fine-tuning process with a limited amount of training data. We connect the AL to data clustering and propose an integrated fine-tuning pipeline that systematically combines AL, clustering, and dynamic data selection schedulers to enhance model's performance. Experiments in the Slovak, Maltese, Icelandic and Turkish languages show that the use of clustering during the fine-tuning phase together with AL scheduling can simultaneously produce annotation savings up to 30% and performance improvements up to four F1 score points, while also providing better fine-tuning stability.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Xabier de Zuazo, Ibon Saratxaga, Eva Navas
ๆฅ็ๆ่ฆ
We present Conformer-based decoders for the LibriBrain 2025 PNPL competition, targeting two foundational MEG tasks: Speech Detection and Phoneme Classification. Our approach adapts a compact Conformer to raw 306-channel MEG signals, with a lightweight convolutional projection layer and task-specific heads. For Speech Detection, a MEG-oriented SpecAugment provided a first exploration of MEG-specific augmentation. For Phoneme Classification, we used inverse-square-root class weighting and a dynamic grouping loader to handle 100-sample averaged examples. In addition, a simple instance-level normalization proved critical to mitigate distribution shifts on the holdout split. Using the official Standard track splits and F1-macro for model selection, our best systems achieved 88.9% (Speech) and 65.8% (Phoneme) on the leaderboard, surpassing the competition baselines and ranking within the top-10 in both tasks. For further implementation details, the technical documentation, source code, and checkpoints are available at https://github.com/neural2speech/libribrain-experiments.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bharatdeep Hazarika, Arya Suneesh, Prasanna Devadiga, Pawan Kumar Rajpoot, Anshuman B Suresh et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
India's linguistic diversity presents both opportunities and challenges for fintech platforms. While the country has 31 major languages and over 100 minor ones, only 10\% of the population understands English, creating barriers to financial inclusion. We present a multilingual conversational AI system for a financial assistance use case that supports code-mixed languages like Hinglish, enabling natural interactions for India's diverse user base. Our system employs a multi-agent architecture with language classification, function management, and multilingual response generation. Through comparative analysis of multiple language models and real-world deployment, we demonstrate significant improvements in user engagement while maintaining low latency overhead (4-8\%). This work contributes to bridging the language gap in digital financial services for emerging markets.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yaxuan Wang, Quan Liu, Zhenting Wang, Zichao Li, Wei Wei et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) underpin applications in code generation, mathematical reasoning, and agent-based workflows. In practice, systems access LLMs via commercial APIs or open-source deployments, and the model landscape (e.g., GPT, Claude, Llama) evolves rapidly. This rapid evolution forces frequent model switches driven by capability, cost, deployment constraints, and privacy. Yet prompts are highly model-sensitive: reusing a prompt engineered for one model on another often yields substantially worse performance than a prompt optimized for the target model. We term this phenomenon Model Drifting. Through extensive empirical analysis across diverse LLM configurations, we show that model drifting is both common and severe. To address this challenge, we introduce PromptBridge, a training-free framework that preserves prompt effectiveness under model switches, enabling cross-model prompt transfer without costly per-task or per-model re-optimization. PromptBridge requires only a small set of alignment tasks for calibration. It first applies Model-Adaptive Reflective Prompt Evolution (MAP-RPE) to obtain task- and model-specific optimal prompts via iterative reflective refinement and quantitative evaluation. Using the resulting calibrated prompt pairs for the source and target models, PromptBridge learns a cross-model prompt mapping. At test time, i.e., for an unseen task, given a source-model prompt, this mapping directly produces an optimized prompt for the target model. Experiments in single-agent and multi-agent settings show that PromptBridge consistently improves downstream accuracy while reducing migration effort. The code will be available soon.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ruohan Zhou, Jiachen Yuan, Churui Yang, Wenzheng Huang, Guoyan Zhang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Understanding sentiment in complex textual expressions remains a fundamental challenge in affective computing. To address this, we propose a Dynamic Fusion Learning Model (DyFuLM), a multimodal framework designed to capture both hierarchical semantic representations and fine-grained emotional nuances. DyFuLM introduces two key moodules: a Hierarchical Dynamic Fusion module that adaptively integrates multi-level features, and a Gated Feature Aggregation module that regulates cross-layer information ffow to achieve balanced representation learning. Comprehensive experiments on multi-task sentiment datasets demonstrate that DyFuLM achieves 82.64% coarse-grained and 68.48% fine-grained accuracy, yielding the lowest regression errors (MAE = 0.0674, MSE = 0.0082) and the highest R^2 coefficient of determination (R^2= 0.6903). Furthermore, the ablation study validates the effectiveness of each module in DyFuLM. When all modules are removed, the accuracy drops by 0.91% for coarse-grained and 0.68% for fine-grained tasks. Keeping only the gated fusion module causes decreases of 0.75% and 0.55%, while removing the dynamic loss mechanism results in drops of 0.78% and 0.26% for coarse-grained and fine-grained sentiment classification, respectively. These results demonstrate that each module contributes significantly to feature interaction and task balance. Overall, the experimental findings further validate that DyFuLM enhances sentiment representation and overall performance through effective hierarchical feature fusion.
๐
2025-12-01
๐ ArXiv Preprint (cs.SE)
๐ฅ Zhiqing Zhong, Jiaming Huang, Pinjia He
ๆฅ็ๆ่ฆ
Many modern software projects evolve rapidly to incorporate new features and security patches. It is important for users to update their dependencies to safer versions, but many still use older, vulnerable package versions because upgrading can be difficult and may break their existing codebase. Software developers can mitigate this problem by backporting security patches to older releases. However, manually backporting is time-consuming and error-prone. The effectiveness of existing automated backporting techniques on general software remains unclear since they typically target only code-hunk or function-level patch porting scenarios and are evaluated with imperfect metrics.
To facilitate the development and evaluation of automated backporting techniques, we introduce BackportBench, the first comprehensive benchmark suite for patch backporting problem. BackportBench is a multilingual benchmark that contains 202 patch backporting problems from PyPI, Maven, and npm, each with executable Docker environments and relevant test cases. We evaluated existing patch porting methods and LLM-based techniques that have the potential to adapt to this task using BackportBench. The results show that the agentic method has outperformed traditional patch porting methods, especially on cases that require logical and structural changes. However, the performance varies across different programming languages. Based on the findings, we draw several implications for researchers and software practitioners in future work on automated backporting.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Md. Rafiul Biswas, Firoj Alam, Wajdi Zaghouani
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
MARSAD is a multifunctional natural language processing (NLP) platform designed for real-time social media monitoring and analysis, with a particular focus on the Arabic-speaking world. It enables researchers and non-technical users alike to examine both live and archived social media content, producing detailed visualizations and reports across various dimensions, including sentiment analysis, emotion analysis, propaganda detection, fact-checking, and hate speech detection. The platform also provides secure data-scraping capabilities through API keys for accessing public social media data. MARSAD's backend architecture integrates flexible document storage with structured data management, ensuring efficient processing of large and multimodal datasets. Its user-friendly frontend supports seamless data upload and interaction.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhongjie Jiang
ๆฅ็ๆ่ฆ
Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse.
This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators.
The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33.
Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.
๐
2025-12-01
๐ KDD
๐ฅ Xinyun Zhou, Xinfeng Li, Yinan Peng, Ming Xu, Xuanwang Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) systems are increasingly central to robust AI, enhancing large language model (LLM) faithfulness by incorporating external knowledge. However, our study unveils a critical, overlooked vulnerability: their profound susceptibility to subtle symbolic perturbations, particularly through near-imperceptible emoticon tokens such as "(@_@)" that can catastrophically mislead retrieval, termed EmoRAG. We demonstrate that injecting a single emoticon into a query makes it nearly 100% likely to retrieve semantically unrelated texts that contain a matching emoticon. Our extensive experiment across general question-answering and code domains, using a range of state-of-the-art retrievers and generators, reveals three key findings: (I) Single-Emoticon Disaster: Minimal emoticon injections cause maximal disruptions, with a single emoticon almost 100% dominating RAG output. (II) Positional Sensitivity: Placing an emoticon at the beginning of a query can cause severe perturbation, with F1-Scores exceeding 0.92 across all datasets. (III) Parameter-Scale Vulnerability: Counterintuitively, models with larger parameters exhibit greater vulnerability to the interference. We provide an in-depth analysis to uncover the underlying mechanisms of these phenomena. Furthermore, we raise a critical concern regarding the robustness assumption of current RAG systems, envisioning a threat scenario where an adversary exploits this vulnerability to manipulate the RAG system. We evaluate standard defenses and find them insufficient against EmoRAG. To address this, we propose targeted defenses, analyzing their strengths and limitations in mitigating emoticon-based perturbations. Finally, we outline future directions for building robust RAG systems.
๐
2025-12-01
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Omar Farooq Khan Suri, John McCrae
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly being deployed in real-world applications, but their flexibility exposes them to prompt injection attacks. These attacks leverage the model's instruction-following ability to make it perform malicious tasks. Recent work has proposed JATMO, a task-specific fine-tuning approach that trains non-instruction-tuned base models to perform a single function, thereby reducing susceptibility to adversarial instructions. In this study, we evaluate the robustness of JATMO against HOUYI, a genetic attack framework that systematically mutates and optimizes adversarial prompts. We adapt HOUYI by introducing custom fitness scoring, modified mutation logic, and a new harness for local model testing, enabling a more accurate assessment of defense effectiveness. We fine-tuned LLaMA 2-7B, Qwen1.5-4B, and Qwen1.5-0.5B models under the JATMO methodology and compared them with a fine-tuned GPT-3.5-Turbo baseline. Results show that while JATMO reduces attack success rates relative to instruction-tuned models, it does not fully prevent injections; adversaries exploiting multilingual cues or code-related disruptors still bypass defenses. We also observe a trade-off between generation quality and injection vulnerability, suggesting that better task performance often correlates with increased susceptibility. Our results highlight both the promise and limitations of fine-tuning-based defenses and point toward the need for layered, adversarially informed mitigation strategies.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Koki Natsumi, Hiroyuki Deguchi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
ๆฅ็ๆ่ฆ
Minimum Bayes risk (MBR) decoding generates high-quality translations by maximizing the expected utility of output candidates, but it evaluates all pairwise scores over the candidate set; hence, it takes quadratic time with respect to the number of candidates. To reduce the number of utility function calls, probabilistic MBR (PMBR) decoding partially evaluates quality scores using sampled pairs of candidates and completes the missing scores with a matrix completion algorithm. Nevertheless, it degrades the translation quality as the number of utility function calls is reduced. Therefore, to improve the trade-off between quality and cost, we propose agreement-constrained PMBR (AC-PMBR) decoding, which leverages a knowledge distilled model to guide the completion of the score matrix. Our AC-PMBR decoding improved approximation errors of matrix completion by up to 3 times and achieved higher translation quality compared with PMBR decoding at a comparable computational cost on the WMT'23 En$\leftrightarrow$De translation tasks.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiahao Yuan, Zhiqing Cui, Hanqing Wang, Yuansheng Gao, Yucheng Zhou et al.
Machine Learning
ๆฅ็ๆ่ฆ
As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at https://github.com/JhCircle/Kardia-R1.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zehua Zhao, Zhixian Huang, Junren Li, Siyu Lin, Junting Zhou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ekha Morang, Surhoni A. Ngullie, Sashienla Longkumer, Teisovi Angami
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The Nagamese language, a.k.a Naga Pidgin, is an Assamese-lexified creole language developed primarily as a means of communication in trade between the people from Nagaland and people from Assam in the north-east India. Substantial amount of work in sentiment analysis has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in Nagamese language. To the best of our knowledge, this is the first attempt on sentiment analysis and emotion classification for the Nagamese Language. The aim of this work is to detect sentiments in terms of polarity (positive, negative and neutral) and basic emotions contained in textual content of Nagamese language. We build sentiment polarity lexicon of 1,195 nagamese words and use these to build features along with additional features for supervised machine learning techniques using Na"ive Bayes and Support Vector Machines.
Keywords: Nagamese, NLP, sentiment analysis, machine learning
๐
2025-12-01
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Qingyuan Fei, Xin Liu, Song Li, Shujiang Wu, Jianwei Hou et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Researchers have proposed numerous methods to detect vulnerabilities in JavaScript, especially those assisted by Large Language Models (LLMs). However, the actual capability of LLMs in JavaScript vulnerability detection remains questionable, necessitating systematic evaluation and comprehensive benchmarks. Unfortunately, existing benchmarks suffer from three critical limitations: (1) incomplete coverage, such as covering a limited subset of CWE types; (2) underestimation of LLM capabilities caused by unreasonable ground truth labeling; and (3) overestimation due to unrealistic cases such as using isolated vulnerable files rather than complete projects.
In this paper, we introduce, for the first time, three principles for constructing a benchmark for JavaScript vulnerability detection that directly address these limitations: (1) comprehensiveness, (2) no underestimation, and (3) no overestimation. Guided by these principles, we propose FORGEJS, the first automatic benchmark generation framework for evaluating LLMs' capability in JavaScript vulnerability detection. Then, we use FORGEJS to construct ARENAJS-the first systematic benchmark for LLM-based JavaScript vulnerability detection-and further propose JUDGEJS, an automatic evaluation framework.
We conduct the first systematic evaluation of LLMs for JavaScript vulnerability detection, leveraging JUDGEJS to assess seven popular commercial LLMs on ARENAJS. The results show that LLMs not only exhibit limited reasoning capabilities, but also suffer from severe robustness defects, indicating that reliable JavaScript vulnerability detection with LLMs remains an open challenge.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alper Yฤฑldฤฑrฤฑm, ฤฐbrahim Yรผcedaฤ
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Standard Transformers suffer from a "Semantic Alignment Tax", a prohibitive optimization cost required to organize a chaotic initialization into a coherent geometric map via local gradient diffusion. We hypothesize that this reliance on diffusive learning creates "Catastrophic Rigidity", rendering models unable to adapt to novel concepts without destroying their pre-trained reasoning capabilities. To isolate this phenomenon, we introduce Iterative Semantic Map Refinement (ISMR), a diagnostic protocol revealing that alignment is a fixed geometric barrier that scaling cannot solve; a 20-layer model overcomes this barrier no faster than a 1-layer model. We introduce the Phase-Resonant Intelligent Spectral Model (PRISM). PRISM encodes semantic identity as resonant frequencies in the complex domain (C^d) and replaces quadratic self-attention with linearithmic O(N log N) Gated Harmonic Convolutions. We validate PRISM on the WMT14 translation task. While the Standard Transformer maintains a slight edge in general competence on static benchmarks (23.88 vs 21.40 BLEU), it fails the "Plasticity-Stability" stress test completely. When injected with novel concepts, the Transformer suffers Catastrophic Forgetting, degrading by -10.55 BLEU points while achieving only 60% acquisition. In contrast, PRISM demonstrates Lossless Plasticity, achieving 96% 5-shot acquisition with negligible degradation (-0.84 BLEU). These results suggest that harmonic representations effectively decouple memory from reasoning, offering a structural solution to the plasticity-stability dilemma in real-time knowledge adaptation.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiatong Han
Natural Language Processing
ๆฅ็ๆ่ฆ
Traditional Chinese Medicine (TCM) theory is built on imagistic thinking, in which medical principles and diagnostic and therapeutic logic are structured through metaphor and metonymy. However, existing English translations largely rely on literal rendering, making it difficult for target-language readers to reconstruct the underlying conceptual networks and apply them in clinical practice. This study adopted a human-in-the-loop (HITL) framework and selected four passages from the medical canon Huangdi Neijing that are fundamental in theory. Through prompt-based cognitive scaffolding, DeepSeek V3.1 was guided to identify metaphor and metonymy in the source text and convey the theory in translation. In the evaluation stage, ChatGPT 5 Pro and Gemini 2.5 Pro were instructed by prompts to simulate three types of real-world readers. Human translations, baseline model translations, and prompt-adjusted translations were scored by the simulated readers across five cognitive dimensions, followed by structured interviews and Interpretative Phenomenological Analysis (IPA). Results show that the prompt-adjusted LLM translations perform best across all five dimensions, with high cross-model and cross-role consistency. The interview themes reveal differences between human and machine translation, effective strategies for metaphor and metonymy transfer, and readers' cognitive preferences. This study provides a cognitive, efficient, and replicable HITL methodological pathway for the translation of ancient, concept-dense texts such as TCM.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Krithik Vishwanath, Mrigayu Ghosh, Anton Alyakin, Daniel Alexander Alber, Yindalon Aphinyanaphongs et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yongxin Zhou, Philippe Mulhem, Didier Schwab
ๆฅ็ๆ่ฆ
The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Hyunjun Kim, Sooyoung Ryu
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four state-of-the-art LLMs (Claude-4 Sonnet, GPT-4.1, GPT-4.1-mini, Gemini-2.5 Flash) across 1,000 tests, we establish both capabilities and limitations: models achieved 92.8% perfect performance with structured external feedback driving significant improvements (average +3.2%, up to +32.8% for complex scenes), but systematic error patterns emerged in tool state management and long-horizon planning. Notably, specification clarity proved more important than task complexity -- models achieved 100% perfect performance when given explicit, verifiable criteria. These findings demonstrate that transparent evaluation frameworks can establish trust in agentic systems, with external oversight proving more reliable than self-correction for guiding agent behavior. Our open-source framework provides a template for trustworthy agent assessment. Code and data: https://github.com/hyunjun1121/DrawingBench
๐
2025-12-01
๐ AAAI 2026
๐ฅ Jianlei Chang, Ruofeng Mei, Wei Ke, Xiangyu Xu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Generative modeling has recently shown remarkable promise for visuomotor policy learning, enabling flexible and expressive control across diverse embodied AI tasks. However, existing generative policies often struggle with data inefficiency, requiring large-scale demonstrations, and sampling inefficiency, incurring slow action generation during inference. We introduce EfficientFlow, a unified framework for efficient embodied AI with flow-based policy learning. To enhance data efficiency, we bring equivariance into flow matching. We theoretically prove that when using an isotropic Gaussian prior and an equivariant velocity prediction network, the resulting action distribution remains equivariant, leading to improved generalization and substantially reduced data demands. To accelerate sampling, we propose a novel acceleration regularization strategy. As direct computation of acceleration is intractable for marginal flow trajectories, we derive a novel surrogate loss that enables stable and scalable training using only conditional trajectories. Across a wide range of robotic manipulation benchmarks, the proposed algorithm achieves competitive or superior performance under limited data while offering dramatically faster inference. These results highlight EfficientFlow as a powerful and efficient paradigm for high-performance embodied AI.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sebastian Sanokowski, Kaustubh Patil, Alois Knoll
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Diffusion models have achieved remarkable success in data-driven learning and in sampling from complex, unnormalized target distributions. Building on this progress, we reinterpret Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem. We tackle this problem by minimizing the reverse Kullback-Leibler (KL) divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. By applying the policy gradient theorem to this objective, we derive a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way. This leads to simple diffusion-based variants of Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO) and Wasserstein Policy Optimization (WPO), termed DiffSAC, DiffPPO and DiffWPO. All of these methods require only minor implementation changes to their base algorithm. We find that on standard continuous control benchmarks, DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Shaowei Liu, David Yifan Yao, Saurabh Gupta, Shenlong Wang
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter et al.
Machine Learning
ๆฅ็ๆ่ฆ
MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han
Natural Language Processing
ๆฅ็ๆ่ฆ
As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands--weights and activations in the forward pass, and weights, activations, and gradients in the backward pass--must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. NVFP4 by evaluating multiple potential scale factors for each block of values. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Minglai Yang, Xinyu Guo, Mihai Surdeanu, Liangming Pan
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a defined ontology through a "pre-train, then post-train" curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific relations can be inspected and controlled without interference from unrelated features. Empirical results demonstrate that AlignSAE enables precise causal interventions, such as reliable "concept swaps", by targeting single, semantically aligned slots.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Massively parallel simulation has reduced reinforcement learning (RL) training time for robots from days to minutes. However, achieving fast and reliable sim-to-real RL for humanoid control remains difficult due to the challenges introduced by factors such as high dimensionality and domain randomization. In this work, we introduce a simple and practical recipe based on off-policy RL algorithms, i.e., FastSAC and FastTD3, that enables rapid training of humanoid locomotion policies in just 15 minutes with a single RTX 4090 GPU. Our simple recipe stabilizes off-policy RL algorithms at massive scale with thousands of parallel environments through carefully tuned design choices and minimalist reward functions. We demonstrate rapid end-to-end learning of humanoid locomotion controllers on Unitree G1 and Booster T1 robots under strong domain randomization, e.g., randomized dynamics, rough terrain, and push perturbations, as well as fast training of whole-body human-motion tracking policies. We provide videos and open-source implementation at: https://younggyo.me/fastsac-humanoid.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Guillermo Garcia-Cobo, Maximilian Igl, Peter Karkus, Zhejun Zhang, Michael Watson et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift when deployed in closed loop, leading to compounding errors. We introduce Rollouts as Demonstrations (RoaD), a simple and efficient method to mitigate covariate shift by leveraging the policy's own closed-loop rollouts as additional training data. During rollout generation, RoaD incorporates expert guidance to bias trajectories toward high-quality behavior, producing informative yet realistic demonstrations for fine-tuning. This approach enables robust closed-loop adaptation with orders of magnitude less data than reinforcement learning, and avoids restrictive assumptions of prior closed-loop supervised fine-tuning (CL-SFT) methods, allowing broader applications domains including end-to-end driving. We demonstrate the effectiveness of RoaD on WOSAC, a large-scale traffic simulation benchmark, where it performs similar or better than the prior CL-SFT method; and in AlpaSim, a high-fidelity neural reconstruction-based simulator for end-to-end driving, where it improves driving score by 41\% and reduces collisions by 54\%.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop
Machine Learning
ๆฅ็ๆ่ฆ
Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent's experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.
๐
2025-12-01
๐ ArXiv Preprint (q-bio.QM)
๐ฅ Nasim Montazeri, Stone Yang, Dominik Luszczynski, John Zhang, Dharmendra Gurve et al.
Machine Learning
ๆฅ็ๆ่ฆ
Study Objectives: Wrist accelerometry is widely used for inferring sleep-wake state. Previous works demonstrated poor wake detection, without cross-device generalizability and validation in different age range and sleep disorders. We developed a robust deep learning model for to detect sleep-wakefulness from triaxial accelerometry and evaluated its validity across three devices and in a large adult population spanning a wide range of ages with and without sleep disorders. Methods: We collected wrist accelerometry simultaneous to polysomnography (PSG) in 453 adults undergoing clinical sleep testing at a tertiary care sleep laboratory, using three devices. We extracted features in 30-second epochs and trained a 3-class model to detect wake, sleep, and sleep with arousals, which was then collapsed into wake vs. sleep using a decision tree. To enhance wake detection, the model was specifically trained on randomly selected subjects with low sleep efficiency and/or high arousal index from one device recording and then tested on the remaining recordings. Results: The model showed high performance with F1 Score of 0.86, sensitivity (sleep) of 0.87, and specificity (wakefulness) of 0.78, and significant and moderate correlation to PSG in predicting total sleep time (R=0.69) and sleep efficiency (R=0.63). Model performance was robust to the presence of sleep disorders, including sleep apnea and periodic limb movements in sleep, and was consistent across all three models of accelerometer. Conclusions: We present a deep model to detect sleep-wakefulness from actigraphy in adults with relative robustness to the presence of sleep disorders and generalizability across diverse commonly used wrist accelerometers.
๐
2025-12-01
๐ ArXiv Preprint (eess.SY)
๐ฅ Andrea Goertzen, Sunbochen Tang, Navid Azizan
Robotics
ๆฅ็ๆ่ฆ
Chaos is a fundamental feature of many complex dynamical systems, including weather systems and fluid turbulence. These systems are inherently difficult to predict due to their extreme sensitivity to initial conditions. Many chaotic systems are dissipative and ergodic, motivating data-driven models that aim to learn invariant statistical properties over long time horizons. While recent models have shown empirical success in preserving invariant statistics, they are prone to generating unbounded predictions, which prevent meaningful statistics evaluation. To overcome this, we introduce the Energy-Constrained Operator (ECO) that simultaneously learns the system dynamics while enforcing boundedness in predictions. We leverage concepts from control theory to develop algebraic conditions based on a learnable energy function, ensuring the learned dynamics is dissipative. ECO enforces these algebraic conditions through an efficient closed-form quadratic projection layer, which provides provable trajectory boundedness. To our knowledge, this is the first work establishing such formal guarantees for data-driven chaotic dynamics models. Additionally, the learned invariant level set provides an outer estimate for the strange attractor, a complex structure that is computationally intractable to characterize. We demonstrate empirical success in ECO's ability to generate stable long-horizon forecasts, capturing invariant statistics on systems governed by chaotic PDEs, including the Kuramoto--Sivashinsky and the Navier--Stokes equations.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Eunjeong Jeong, Giovanni Perin, Howard H. Yang, Nikolaos Pappas
Machine Learning
ๆฅ็ๆ่ฆ
Federated Learning (FL) on resource-constrained edge devices faces a critical challenge: The computational energy required for training Deep Neural Networks (DNNs) often dominates communication costs. However, most existing Energy-Harvesting FL (EHFL) strategies fail to account for this reality, resulting in wasted energy due to redundant local computations. For efficient and proactive resource management, algorithms that predict local update contributions must be devised. We propose a lightweight client scheduling framework using the Version Age of Information (VAoI), a semantics-aware metric that quantifies update timeliness and significance. Crucially, we overcome VAoI's typical prohibitive computational cost, which requires statistical distance over the entire parameter space, by introducing a feature-based proxy. This proxy estimates model redundancy using intermediate-layer extraction from a single forward pass, dramatically reducing computational complexity. Experiments conducted under extreme non-IID data distributions and scarce energy availability demonstrate superior learning performance while achieving energy reduction compared to existing baseline selection policies. Our framework establishes semantics-aware scheduling as a practical and vital solution for EHFL in realistic scenarios where training costs dominate transmission costs.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoran Qin, Shansita Sharma, Ali Abbasi, Chayne Thrash, Soheil Kolouri
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Low-rank approximation methods such as singular value decomposition (SVD) and its variants (e.g., Fisher-weighted SVD, Activation SVD) have recently emerged as effective tools for neural network compression. In this setting, decomposition acts as a "surgical" intervention, followed by fine-tuning that serves as "rehab" to recover accuracy. Inspired by prehabilitation in surgery, we introduce a pre-compression fine-tuning stage, Low-Rank Prehab, that explicitly encourages low-rank structure in weight matrices while preserving task performance. By conditioning the model before SVD, Prehab steers weights toward spectrally compact regions of the parameter space, enabling smoother low-rank approximation and improved recovery. Experiments on large language models (LLMs) and other Transformer-based architectures, including Vision Transformers (ViTs), show that Prehab substantially reduces the immediate accuracy drop after compression and consistently improves post-finetuning performance. Across a wide range of compression ratios, our method outperforms state-of-the-art SVD-based techniques such as SVD-LLM, highlighting the importance of preparing models for compression rather than only improving the compression and recovery stages. Source code is available at https://github.com/niqretnuh/PREHAB-SVD
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sai Gokhale, Devleena Das, Rajeev Patwari, Ashish Sirasao, Elliott Delaye
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Long-context Large Language Models (LLMs) face significant memory bottlenecks during inference due to the linear growth of key-value (KV) cache with sequence length. While individual optimization techniques like KV cache quantization, chunked prefill, and model weight quantization have shown promise, their joint effects and optimal configurations for edge deployment remain underexplored. We introduce KV Pareto, a systems-level framework that systematically maps the trade-off frontier between total memory consumption and task accuracy across these three complementary optimization techniques. Our framework evaluates multiple LLM architectures (Qwen, Llama, Mistral) with varying KV quantization schemes (int2/4/8, mixed-precision), granularities (per-token, per-tensor, per-block), and 4-bit weight quantization via AWQ. Our framework identifies model-specific Pareto-optimal configurations that achieve 68-78% total memory reduction with minimal (1-3%) accuracy degradation on long-context tasks. We additionally verify the selected frontiers on additional benchmarks of Needle-in-a-Haystack, GSM8k and MMLU as well as extended context lengths of up to 128k to demonstrate the practical need of joint optimization for efficient LLM inference.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Han Zhou, Xingchen Wan, Ivan Vuliฤ, Anna Korhonen
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nico Daheim, Thomas Mรถllenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections, but have seen limited success in deep learning. Here, we show surprising new foundational connections of SVRG to a recently proposed Bayesian method called posterior correction. Specifically, we show that SVRG is recovered as a special case of posterior correction over the isotropic-Gaussian family, while novel extensions are automatically obtained by using more flexible exponential families. We derive two new SVRG variants by using Gaussian families: First, a Newton-like variant that employs novel Hessian corrections, and second, an Adam-like extension that improves pretraining and finetuning of Transformer language models. This is the first work to connect SVRG to Bayes and use it to boost variational training for deep networks.
๐
2025-12-01
๐ Accepted for publication in IEEE Robotics and Auto
๐ฅ Kentaro Fujii, Shingo Murata
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Robots in uncertain real-world environments must perform both goal-directed and exploratory actions. However, most deep learning-based control methods neglect exploration and struggle under uncertainty. To address this, we adopt deep active inference, a framework that accounts for human goal-directed and exploratory actions. Yet, conventional deep active inference approaches face challenges due to limited environmental representation capacity and high computational cost in action selection. We propose a novel deep active inference framework that consists of a world model, an action model, and an abstract world model. The world model encodes environmental dynamics into hidden state representations at slow and fast timescales. The action model compresses action sequences into abstract actions using vector quantization, and the abstract world model predicts future slow states conditioned on the abstract action, enabling low-cost action selection. We evaluate the framework on object-manipulation tasks with a real-world robot. Results show that it achieves high success rates across diverse manipulation tasks and switches between goal-directed and exploratory actions in uncertain settings, while making action selection computationally tractable. These findings highlight the importance of modeling multiple timescale dynamics and abstracting actions and state transitions.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jacob Searcy, Anish Dulal, Scott Bridgham, Ashley Cordes, Lillian Aoki et al.
ๆฅ็ๆ่ฆ
Natural climate solutions (NCS) offer an approach to mitigating carbon dioxide (CO2) emissions. However, monitoring the carbon drawdown of ecosystems over large geographic areas remains challenging. Eddy-flux covariance towers provide ground truth for predictive 'upscaling' models derived from satellite products, but many satellites now produce measurements on spatial scales smaller than a flux tower's footprint. We introduce Footprint-Aware Regression (FAR), a first-of-its-kind, deep-learning framework that simultaneously predicts spatial footprints and pixel-level (30 m scale) estimates of carbon flux. FAR is trained on our AMERI-FAR25 dataset which combines 439 site years of tower data with corresponding Landsat scenes. Our model produces high-resolution predictions and achieves R2 = 0.78 when predicting monthly net ecosystem exchange on test sites from a variety of ecosystems.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sanja Karilanova, Subhrakanti Dey, Ayรงa รzรงelikkale
Machine Learning
ๆฅ็ๆ่ฆ
Spiking neural networks (SNNs) are biologically inspired, event-driven models that are suitable for processing temporal data and offer energy-efficient computation when implemented on neuromorphic hardware. In SNNs, richer neuronal dynamic allows capturing more complex temporal dependencies, with delays playing a crucial role by allowing past inputs to directly influence present spiking behavior. We propose a general framework for incorporating delays into SNNs through additional state variables. The proposed mechanism enables each neuron to access a finite temporal input history. The framework is agnostic to neuron models and hence can be seamlessly integrated into standard spiking neuron models such as LIF and adLIF. We analyze how the duration of the delays and the learnable parameters associated with them affect the performance. We investigate the trade-offs in the network architecture due to additional state variables introduced by the delay mechanism. Experiments on the Spiking Heidelberg Digits (SHD) dataset show that the proposed mechanism matches the performance of existing delay-based SNNs while remaining computationally efficient. Moreover, the results illustrate that the incorporation of delays may substantially improve performance in smaller networks.
๐
2025-12-01
๐ 12 pages, 9 figures, submitted to IEEE SaTML 2026
๐ฅ Leo Elmecker-Plakolm, Pierre Fasterling, Philip Sosnin, Calvin Tsay, Matthew Wicker
Machine Learning
ๆฅ็ๆ่ฆ
Safety-critical environments are inherently dynamic. Distribution shifts, emerging vulnerabilities, and evolving requirements demand continuous updates to machine learning models. Yet even benign parameter updates can have unintended consequences, such as catastrophic forgetting in classical models or alignment drift in foundation models. Existing heuristic approaches (e.g., regularization, parameter isolation) can mitigate these effects but cannot certify that updated models continue to satisfy required performance specifications. We address this problem by introducing a framework for provably safe model updates. Our approach first formalizes the problem as computing the largest locally invariant domain (LID): a connected region in parameter space where all points are certified to satisfy a given specification. While exact maximal LID computation is intractable, we show that relaxing the problem to parameterized abstract domains (orthotopes, zonotopes) yields a tractable primal-dual formulation. This enables efficient certification of updates - independent of the data or algorithm used - by projecting them onto the safe domain. Our formulation further allows computation of multiple approximately optimal LIDs, incorporation of regularization-inspired biases, and use of lookahead data buffers. Across continual learning and foundation model fine-tuning benchmarks, our method matches or exceeds heuristic baselines for avoiding forgetting while providing formal safety guarantees.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Gaganpreet Jhajj, Fuhua Lin
ๆฅ็ๆ่ฆ
Knowledge graphs (KGs) require continual updates as new information emerges, but neural embedding models suffer from catastrophic forgetting when learning new tasks sequentially. We evaluate Elastic Weight Consolidation (EWC), a regularization-based continual learning method, on KG link prediction using TransE embeddings on FB15k-237. Across multiple experiments with five random seeds, we find that EWC reduces catastrophic forgetting from 12.62% to 6.85%, a 45.7% reduction compared to naive sequential training. We observe that the task partitioning strategy affects the magnitude of forgetting: relation-based partitioning (grouping triples by relation type) exhibits 9.8 percentage points higher forgetting than randomly partitioned tasks (12.62% vs 2.81%), suggesting that task construction influences evaluation outcomes. While focused on a single embedding model and dataset, our results demonstrate that EWC effectively mitigates catastrophic forgetting in KG continual learning and highlight the importance of evaluation protocol design.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Adrienne M. Propp, Mauro Perego, Eric C. Cyr, Anthony Gruber, Amanda A. Howard et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate yet efficient surrogate models are essential for large-scale simulations of partial differential equations (PDEs), particularly for uncertainty quantification (UQ) tasks that demand hundreds or thousands of evaluations. We develop a physics-inspired graph neural network (GNN) surrogate that operates directly on unstructured meshes and leverages the flexibility of graph attention. To improve both training efficiency and generalization properties of the model, we introduce a domain decomposition (DD) strategy that partitions the mesh into subdomains, trains local GNN surrogates in parallel, and aggregates their predictions. We then employ transfer learning to fine-tune models across subdomains, accelerating training and improving accuracy in data-limited settings. Applied to ice sheet simulations, our approach accurately predicts full-field velocities on high-resolution meshes, substantially reduces training time relative to training a single global surrogate model, and provides a ripe foundation for UQ objectives. Our results demonstrate that graph-based DD, combined with transfer learning, provides a scalable and reliable pathway for training GNN surrogates on massive PDE-governed systems, with broad potential for application beyond ice sheet dynamics.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aref Ghoreishee, Abhishek Mishra, Lifeng Zhou, John Walsh, Nagarajan Kandasamy
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
This work proposes an end-to-end multi-modal reinforcement learning framework for high-level decision-making in autonomous vehicles. The framework integrates heterogeneous sensory input, including camera images, LiDAR point clouds, and vehicle heading information, through a cross-attention transformer-based perception module. Although transformers have become the backbone of modern multi-modal architectures, their high computational cost limits their deployment in resource-constrained edge environments. To overcome this challenge, we propose a spiking temporal-aware transformer-like architecture that uses ternary spiking neurons for computationally efficient multi-modal fusion. Comprehensive evaluations across multiple tasks in the Highway Environment demonstrate the effectiveness and efficiency of the proposed approach for real-time autonomous decision-making.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ahmed Nebli
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The training of deep vision models is fundamentally a signal recovery problem amidst high-dimensional stochastic noise. Current optimization paradigms impose a static compromise on information channel capacity. For instance, magnitude-based methods, such as AdamW, operate on the assumption that gradient norms are high-fidelity curvature signals. While this allows for precision in smooth regimes, it leads to catastrophic noise amplification when applied to rugged, non-convex landscapes. Conversely, sign-based methods (e.g., Lion) perform a radical 1-bit quantization of the gradient, which aims to provide robust regularization at the cost of discarding fine-grained descent information. We propose that optimal convergence requires neither static prior, but rather a dynamic modulation of the update bitrate. We introduce \textbf{ThermoLion}, a vision-centric framework that utilizes local Signal-to-Noise Ratio (SNR) gating to autonomously transition parameters between a "low-bit" exploration phase and a "high-precision" exploitation phase. Furthermore, we introduce a Momentum Alignment mechanism that detects constructive interference between historical drift and instantaneous gradients to accelerate convergence during stable trajectories. Empirical benchmarks across 12 diverse vision datasets (including CIFAR, SVHN, and GTSRB) demonstrate that ThermoLion serves as a hyperparameter-free generalist, surpassing both AdamW and Lion in convergence speed and terminal accuracy without architecture-specific tuning.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Philippe Rigollet
Natural Language Processing
ๆฅ็ๆ่ฆ
We develop a mathematical framework that interprets Transformer attention as an interacting particle system and studies its continuum (mean-field) limits. By idealizing attention continuous on the sphere, we connect Transformer dynamics to Wasserstein gradient flows, synchronization models (Kuramoto), and mean-shift clustering. Central to our results is a global clustering phenomenon whereby tokens cluster asymptotically after long metastable states where they are arranged into multiple clusters. We further analyze a tractable equiangular reduction to obtain exact clustering rates, show how commonly used normalization schemes alter contraction speeds, and identify a phase transition for long-context attention. The results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingxuan Hu, Hongbo Ma, Xinlan Wu, Ziqi Liu, Jiaqi Liu et al.
ๆฅ็ๆ่ฆ
Audio-based depression detection models have demonstrated promising performance but often suffer from gender bias due to imbalanced training data. Epidemiological statistics show a higher prevalence of depression in females, leading models to learn spurious correlations between gender and depression. Consequently, models tend to over-diagnose female patients while underperforming on male patients, raising significant fairness concerns. To address this, we propose a novel Counterfactual Debiasing Framework grounded in causal inference. We construct a causal graph to model the decision-making process and identify gender bias as the direct causal effect of gender on the prediction. During inference, we employ counterfactual inference to estimate and subtract this direct effect, ensuring the model relies primarily on authentic acoustic pathological features. Extensive experiments on the DAIC-WOZ dataset using two advanced acoustic backbones demonstrate that our framework not only significantly reduces gender bias but also improves overall detection performance compared to existing debiasing strategies.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yudi Wu, Wenhao Zhao, Dianbo Liu
Machine Learning
ๆฅ็ๆ่ฆ
Generative diversity varies significantly across discrete latent generative models such as AR, MIM, and Diffusion. We propose a diagnostic framework, grounded in Information Bottleneck (IB) theory, to analyze the underlying strategies resolving this behavior. The framework models generation as a conflict between a 'Compression Pressure' - a drive to minimize overall codebook entropy - and a 'Diversity Pressure' - a drive to maximize conditional entropy given an input. We further decompose this diversity into two primary sources: 'Path Diversity', representing the choice of high-level generative strategies, and 'Execution Diversity', the randomness in executing a chosen strategy. To make this decomposition operational, we introduce three zero-shot, inference-time interventions that directly perturb the latent generative process and reveal how models allocate and express diversity. Application of this probe-based framework to representative AR, MIM, and Diffusion systems reveals three distinct strategies: "Diversity-Prioritized" (MIM), "Compression-Prioritized" (AR), and "Decoupled" (Diffusion). Our analysis provides a principled explanation for their behavioral differences and informs a novel inference-time diversity enhancement technique.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu et al.
ๆฅ็ๆ่ฆ
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Valentin de Bortoli, Romuald Elie, Anna Kazeykina, Zhenjie Ren, Jiacheng Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion generative models have emerged as powerful tools for producing synthetic data from an empirically observed distribution. A common approach involves simulating the time-reversal of an Ornstein-Uhlenbeck (OU) process initialized at the true data distribution. Since the score function associated with the OU process is typically unknown, it is approximated using a trained neural network. This approximation, along with finite time simulation, time discretization and statistical approximation, introduce several sources of error whose impact on the generated samples must be carefully understood. Previous analyses have quantified the error between the generated and the true data distributions in terms of Wasserstein distance or Kullback-Leibler (KL) divergence. However, both metrics present limitations: KL divergence requires absolute continuity between distributions, while Wasserstein distance, though more general, leads to error bounds that scale poorly with dimension, rendering them impractical in high-dimensional settings. In this work, we derive an explicit, dimension-free bound on the discrepancy between the generated and the true data distributions. The bound is expressed in terms of a smooth test functional with bounded first and second derivatives. The key novelty lies in the use of this weaker, functional metric to obtain dimension-independent guarantees, at the cost of higher regularity on the test functions. As an application, we formulate and solve a variational problem to minimize the time-discretization error, leading to the derivation of an optimal time-scheduling strategy for the reverse-time diffusion. Interestingly, this scheduler has appeared previously in the literature in a different context; our analysis provides a new justification for its optimality, now grounded in minimizing the discretization bias in generative sampling.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Cencheng Shen, Yuexiao Dong, Carey E. Priebe
Machine Learning
ๆฅ็ๆ่ฆ
Decision trees and random forest remain highly competitive for classification on medium-sized, standard datasets due to their robustness, minimal preprocessing requirements, and interpretability. However, a single tree suffers from high estimation variance, while large ensembles reduce this variance at the cost of substantial computational overhead and diminished interpretability. In this paper, we propose Decision Tree Embedding (DTE), a fast and effective method that leverages the leaf partitions of a trained classification tree to construct an interpretable feature representation. By using the sample means within each leaf region as anchor points, DTE maps inputs into an embedding space defined by the tree's partition structure, effectively circumventing the high variance inherent in decision-tree splitting rules. We further introduce an ensemble extension based on additional bootstrap trees, and pair the resulting embedding with linear discriminant analysis for classification. We establish several population-level theoretical properties of DTE, including its preservation of conditional density under mild conditions and a characterization of the resulting classification error. Empirical studies on synthetic and real datasets demonstrate that DTE strikes a strong balance between accuracy and computational efficiency, outperforming or matching random forest and shallow neural networks while requiring only a fraction of their training time in most cases. Overall, the proposed DTE method can be viewed either as a scalable decision tree classifier that improves upon standard split rules, or as a neural network model whose weights are learned from tree-derived anchor points, achieving an intriguing integration of both paradigms.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lama Alssum, Hasan Abed Al Kader Hammoud, Motasem Alfarra, Juan C Leon Alcazar, Bernard Ghanem
Machine Learning
ๆฅ็ๆ่ฆ
Deep neural networks suffer from catastrophic forgetting, where performance on previous tasks degrades after training on a new task. This issue arises due to the model's tendency to overwrite previously acquired knowledge with new information. We present a novel approach to address this challenge, focusing on the intersection of memory-based methods and regularization approaches. We formulate a regularization strategy, termed Information Maximization (IM) regularizer, for memory-based continual learning methods, which is based exclusively on the expected label distribution, thus making it class-agnostic. As a consequence, IM regularizer can be directly integrated into various rehearsal-based continual learning methods, reducing forgetting and favoring faster convergence. Our empirical validation shows that, across datasets and regardless of the number of tasks, our proposed regularization strategy consistently improves baseline performance at the expense of a minimal computational overhead. The lightweight nature of IM ensures that it remains a practical and scalable solution, making it applicable to real-world continual learning scenarios where efficiency is paramount. Finally, we demonstrate the data-agnostic nature of our regularizer by applying it to video data, which presents additional challenges due to its temporal structure and higher memory requirements. Despite the significant domain gap, our experiments show that IM regularizer also improves the performance of video continual learning methods.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sarah Segel, Helena Graf, Edward Bergman, Kristina Thieme, Marcel Wever et al.
Machine Learning
ๆฅ็ๆ่ฆ
Hyperparameter optimization (HPO), as a central paradigm of AutoML, is crucial for leveraging the full potential of machine learning (ML) models; yet its complexity poses challenges in understanding and debugging the optimization process. We present DeepCAVE, a tool for interactive visualization and analysis, providing insights into HPO. Through an interactive dashboard, researchers, data scientists, and ML engineers can explore various aspects of the HPO process and identify issues, untouched potentials, and new insights about the ML model being tuned. By empowering users with actionable insights, DeepCAVE contributes to the interpretability of HPO and ML on a design level and aims to foster the development of more robust and efficient methodologies in the future.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Generative models, like flows and diffusions, have recently emerged as popular and efficacious policy parameterizations in robotics. There has been much speculation as to the factors underlying their successes, ranging from capturing multi-modal action distribution to expressing more complex behaviors. In this work, we perform a comprehensive evaluation of popular generative control policies (GCPs) on common behavior cloning (BC) benchmarks. We find that GCPs do not owe their success to their ability to capture multi-modality or to express more complex observation-to-action mappings. Instead, we find that their advantage stems from iterative computation, as long as intermediate steps are supervised during training and this supervision is paired with a suitable level of stochasticity. As a validation of our findings, we show that a minimum iterative policy (MIP), a lightweight two-step regression-based policy, essentially matches the performance of flow GCPs, and often outperforms distilled shortcut models. Our results suggest that the distribution-fitting component of GCPs is less salient than commonly believed, and point toward new design spaces focusing solely on control performance. Project page: https://simchowitzlabpublic.github.io/much-ado-about-noising-project/
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui et al.
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting $Q$-values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundations models to specialize into reliable real-world experts.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xiaochuan Li, Ke Wang, Girija Gouda, Shubham Choudhary, Yaqun Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLMs) become integrated into high-stakes domains, there is a growing need for evaluation methods that are both scalable for real-time deployment and reliable for critical decision-making. While human evaluation is reliable, it is slow and costly. Single LLM judges are biased, and static juries lack adaptability. To overcome these limitations, we propose LLM Jury-on-Demand - a dynamic, learning-based framework for scalable and context-aware evaluation. Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts, leveraging token distributions, embeddings, and structural input features. This enables a fully adaptive evaluation where, for each data point, an optimal jury of the most reliable judges is dynamically selected, and their scores are aggregated using their reliability as weights. Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines. These results highlight the promise of adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Maria-Florina Blacan, Hedyeh Beyhaghi
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We initiate the study of active learning algorithms for classifying strategic agents. Active learning is a well-established framework in machine learning in which the learner selectively queries labels, often achieving substantially higher accuracy and efficiency than classical supervised methods-especially in settings where labeling is costly or time-consuming, such as hiring, admissions, and loan decisions. Strategic classification, however, addresses scenarios where agents modify their features to obtain more favorable outcomes, resulting in observed data that is not truthful. Such manipulation introduces challenges beyond those in learning from clean data. Our goal is to design active and noise-tolerant algorithms that remain effective in strategic environments-algorithms that classify strategic agents accurately while issuing as few label requests as possible. The central difficulty is to simultaneously account for strategic manipulation and preserve the efficiency gains of active learning.
Our main result is an algorithm for actively learning linear separators in the strategic setting that preserves the exponential improvement in label complexity over passive learning previously obtained only in the non-strategic case. Specifically, for data drawn uniformly from the unit sphere, we show that a modified version of the Active Perceptron algorithm [DKM05,YZ17] achieves excess error $ฮต$ using only $\tilde{O}(d \ln \frac{1}ฮต)$ label queries and incurs at most $\tilde{O}(d \ln \frac{1}ฮต)$ additional mistakes relative to the optimal classifier, even in the nonrealizable case, when a $\tildeฮฉ(ฮต)$ fraction of inputs have inconsistent labels with the optimal classifier. The algorithm is computationally efficient and, under these distributional assumptions, requires substantially fewer label queries than prior work on strategic Perceptron [ABBN21].
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenhao Sun, Yuhao Mao, Martin Vechev
Machine Learning
ๆฅ็ๆ่ฆ
Randomized Smoothing (RS) is a prominent technique for certifying the robustness of neural networks against adversarial perturbations. With RS, achieving high accuracy at small radii requires a small noise variance, while achieving high accuracy at large radii requires a large noise variance. However, the global noise variance used in the standard RS formulation leads to a fundamental limitation: there exists no global noise variance that simultaneously achieves strong performance at both small and large radii. To break through the global variance limitation, we propose a dual RS framework which enables input-dependent noise variances. To achieve that, we first prove that RS remains valid with input-dependent noise variances, provided the variance is locally constant around each input. Building on this result, we introduce two components which form our dual RS framework: (i) a variance estimator first predicts an optimal noise variance for each input, (ii) this estimated variance is then used by a standard RS classifier. The variance estimator is independently smoothed via RS to ensure local constancy, enabling flexible design. We also introduce training strategies to iteratively optimize the two components. Extensive experiments on CIFAR-10 show that our dual RS method provides strong performance for both small and large radii-unattainable with global noise variance-while incurring only a 60% computational overhead at inference. Moreover, it consistently outperforms prior input-dependent noise approaches across most radii, with particularly large gains at radii 0.5, 0.75, and 1.0, achieving relative improvements of 19%, 24%, and 21%, respectively. On ImageNet, dual RS remains effective across all radii. Additionally, the dual RS framework naturally provides a routing perspective for certified robustness, improving the accuracy-robustness trade-off with off-the-shelf expert RS models.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Simon Park, Simran Kaur, Sanjeev Arora
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ John C. Hill, Tyler LaBonte, Xinchen Zhang, Vidya Muthukumar
Machine Learning
ๆฅ็ๆ่ฆ
Last-layer retraining (LLR) methods -- wherein the last layer of a neural network is reinitialized and retrained on a held-out set following ERM training -- have garnered interest as an efficient approach to rectify dependence on spurious correlations and improve performance on minority groups. Surprisingly, LLR has been found to improve worst-group accuracy even when the held-out set is an imbalanced subset of the training set. We initially hypothesize that this ``unreasonable effectiveness'' of LLR is explained by its ability to mitigate neural collapse through the held-out set, resulting in the implicit bias of gradient descent benefiting robustness. Our empirical investigation does not support this hypothesis. Instead, we present strong evidence for an alternative hypothesis: that the success of LLR is primarily due to better group balance in the held-out set. We conclude by showing how the recent algorithms CB-LLR and AFR perform implicit group-balancing to elicit a robustness improvement.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhuoqian Yang, Mathieu Salzmann, Sabine Sรผsstrunk
Machine Learning
ๆฅ็ๆ่ฆ
In this work, we investigate the potential of weights to serve as effective representations, focusing on neural fields. Our key insight is that constraining the optimization space through a pre-trained base model and low-rank adaptation (LoRA) can induce structure in weight space. Across reconstruction, generation, and analysis tasks on 2D and 3D data, we find that multiplicative LoRA weights achieve high representation quality while exhibiting distinctiveness and semantic structure. When used with latent diffusion models, multiplicative LoRA weights enable higher-quality generation than existing weight-space methods.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vaidotas Simkus, Anders Christensen, Steven Bennett, Ian Johnson, Mark Neumann et al.
Machine Learning
ๆฅ็ๆ่ฆ
Mofasa is an all-atom latent diffusion model with state-of-the-art performance for generating Metal-Organic Frameworks (MOFs). These are highly porous crystalline materials used to harvest water from desert air, capture carbon dioxide, store toxic gases and catalyse chemical reactions. In recognition of their value, the development of MOFs recently received a Nobel Prize in Chemistry.
In many ways, MOFs are well-suited for exploiting generative models in chemistry: they are rationally-designable materials with a large combinatorial design space and strong structure-property couplings. And yet, to date, a high performance generative model has been lacking. To fill this gap, we introduce Mofasa, a general-purpose latent diffusion model that jointly samples positions, atom-types and lattice vectors for systems as large as 500 atoms. Mofasa avoids handcrafted assembly algorithms common in the literature, unlocking the simultaneous discovery of metal nodes, linkers and topologies.
To help the scientific community build on our work, we release MofasaDB, an annotated library of hundreds of thousands of sampled MOF structures, along with a user-friendly web interface for search and discovery: https://mofux.ai/ .
๐
2025-12-01
๐ ArXiv Preprint (eess.SP)
๐ฅ Kai Zhang, Wentao Yu, Hengtao He, Shenghui Song, Jun Zhang et al.
Multimodal
ๆฅ็ๆ่ฆ
Integrated sensing and communication (ISAC) is a key enabler for low-altitude wireless networks (LAWNs), providing simultaneous environmental perception and data transmission in complex aerial scenarios. By combining heterogeneous sensing modalities such as visual, radar, lidar, and positional information, multimodal ISAC can improve both situational awareness and robustness of LAWNs. However, most existing multimodal fusion approaches use static fusion strategies that treat all modalities equally and cannot adapt to channel heterogeneity or time-varying modality reliability in dynamic low-altitude environments. To address this fundamental limitation, we propose a mixture-of-experts (MoE) framework for multimodal ISAC in LAWNs. Each modality is processed by a dedicated expert network, and a lightweight gating module adaptively assigns fusion weights according to the instantaneous informativeness and reliability of each modality. To improve scalability under the stringent energy constraints of aerial platforms, we further develop a sparse MoE variant that selectively activates only a subset of experts, thereby reducing computation overhead while preserving the benefits of adaptive fusion. Comprehensive simulations on three typical ISAC tasks in LAWNs demonstrate that the proposed frameworks consistently outperform conventional multimodal fusion baselines in terms of learning performance and training sample efficiency.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stella Etuk, Ashraf Matrawy
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite advances in the use of large language models (LLMs) in downstream tasks, their ability to memorize information has raised privacy concerns. Therefore, protecting personally identifiable information (PII) during LLM training remains a fundamental challenge. Conventional methods like Differential Privacy-Stochastic Gradient Descent (DP-SGD) provide robust privacy protection via uniform noising, protecting PII regardless of its distinct sensitivity. This comes at the expense of the model's utility, leading to a trade-off. In this paper, we propose SA-ADP, a sensitivity-aware approach that allocates noise based on the sensitivity of individual PII. We evaluated our method on four datasets (ABCD, CUSTOMERSIM, Wikitext-2, and UNSW-NB15 ). Our results show that SA-ADP achieves results comparable to the baseline (No-DP) and the conventional DP-SGD. This means that our method did not degrade the model's utility while still maintaining strong privacy protection.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pedro M. P. Curvo, Jan-Willem van de Meent, Maksim Zhdanov
Natural Language Processing
ๆฅ็ๆ่ฆ
A key scalability challenge in neural solvers for industrial-scale physics simulations is efficiently capturing both fine-grained local interactions and long-range global dependencies across millions of spatial elements. We introduce the Multi-Scale Patch Transformer (MSPT), an architecture that combines local point attention within patches with global attention to coarse patch-level representations. To partition the input domain into spatially-coherent patches, we employ ball trees, which handle irregular geometries efficiently. This dual-scale design enables MSPT to scale to millions of points on a single GPU. We validate our method on standard PDE benchmarks (elasticity, plasticity, fluid dynamics, porous flow) and large-scale aerodynamic datasets (ShapeNet-Car, Ahmed-ML), achieving state-of-the-art accuracy with substantially lower memory footprint and computational cost.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Marius Tacke, Matthias Busch, Kian Abdolazizi, Jonas Eichinger, Kevin Linka et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language model (LLM)-based agentic frameworks increasingly adopt the paradigm of dynamically generating task-specific agents. We suggest that not only agents but also specialized software modules for scientific and engineering tasks can be generated on demand. We demonstrate this concept in the field of solid mechanics. There, so-called constitutive models are required to describe the relationship between mechanical stress and body deformation. Constitutive models are essential for both the scientific understanding and industrial application of materials. However, even recent data-driven methods of constitutive modeling, such as constitutive artificial neural networks (CANNs), still require substantial expert knowledge and human labor. We present a framework in which an LLM generates a CANN on demand, tailored to a given material class and dataset provided by the user. The framework covers LLM-based architecture selection, integration of physical constraints, and complete code generation. Evaluation on three benchmark problems demonstrates that LLM-generated CANNs achieve accuracy comparable to or greater than manually engineered counterparts, while also exhibiting reliable generalization to unseen loading scenarios and extrapolation to large deformations. These findings indicate that LLM-based generation of physics-constrained neural networks can substantially reduce the expertise required for constitutive modeling and represent a step toward practical end-to-end automation.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yan Huang, Jinming Xu, Jiming Chen, Karl Henrik Johansson
Machine Learning
ๆฅ็ๆ่ฆ
In distributed and federated learning algorithms, communication overhead is often reduced by performing multiple local updates between communication rounds. However, due to data heterogeneity across nodes and the local gradient noise within each node, this strategy can lead to the drift of local models away from the global optimum. To address this issue, we revisit the well-known federated learning method Scaffold (Karimireddy et al., 2020) under a gradient tracking perspective, and propose a unified spatio-temporal gradient tracking algorithm, termed ST-GT, for distributed stochastic optimization over time-varying graphs. ST-GT tracks the global gradient across neighboring nodes to mitigate data heterogeneity, while maintaining a running average of local gradients to substantially suppress noise, with slightly more storage overhead. Without assuming bounded data heterogeneity, we prove that ST-GT attains a linear convergence rate for strongly convex problems and a sublinear rate for nonconvex cases. Notably, ST-GT achieves the first linear speed-up in communication complexity with respect to the number of local updates per round $ฯ$ for the strongly-convex setting. Compared to traditional gradient tracking methods, ST-GT reduces the topology-dependent noise term from $ฯ^2$ to $ฯ^2/ฯ$, where $ฯ^2$ denotes the noise level, thereby improving communication efficiency.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Louis Lacoste, Pierre Barbillon, Sophie Donnet
ๆฅ็ๆ่ฆ
Bipartite networks are widely used to encode the ecological interactions. Being able to compare the organization of bipartite networks is a first step toward a better understanding of how environmental factors shape community structure and resilience. Yet current methods for structure detection in bipartite networks overlook shared patterns across collections of networks. We introduce the \emph{colBiSBM}, a family of probabilistic models for collections of bipartite networks that extends the classical Latent Block Model (LBM). The proposed framework assumes that networks are independent realizations of a shared mesoscale structure, encoded through common inter-block connectivity parameters. We establish identifiability conditions for the different variants of \emph{colBiSBM} and develop a variational EM algorithm for parameter estimation, coupled with an adaptation of the Integrated Classification Likelihood (ICL) criterion for model selection. We demonstrate how our approach can be used to classify networks based on their topology or organization. Simulation studies highlight the ability of \emph{colBiSBM} to recover common structures, improve clustering performance, and enhance link prediction by borrowing strength across networks. An application to plant--pollinator networks highlights how the method uncovers shared ecological roles and partitions networks into sub-collections with similar connectivity patterns. These results illustrate the methodological and practical advantages of joint modeling over separate network analyses in the study of bipartite systems.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Ghita Fassy El Fehri, Aurรฉlien Bellet, Philippe Bastien
ๆฅ็ๆ่ฆ
Learning the structure of a Bayesian network from decentralized data poses two major challenges: (i) ensuring rigorous privacy guarantees for participants, and (ii) avoiding communication costs that scale poorly with dimensionality. In this work, we introduce Fed-Sparse-BNSL, a novel federated method for learning linear Gaussian Bayesian network structures that addresses both challenges. By combining differential privacy with greedy updates that target only a few relevant edges per participant, Fed-Sparse-BNSL efficiently uses the privacy budget while keeping communication costs low. Our careful algorithmic design preserves model identifiability and enables accurate structure estimation. Experiments on synthetic and real datasets demonstrate that Fed-Sparse-BNSL achieves utility close to non-private baselines while offering substantially stronger privacy and communication efficiency.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bei Zhou, Cesare Corrado, Shuang Qian, Maximilian Balmus, Angela W. C. Lee et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate maps of atrial electrical activation are essential for personalised treatment of arrhythmias, yet biophysically detailed simulations remain computationally intensive for real-time clinical use or population-scale analyses. Here we introduce a geometry-independent operator-learning framework that predicts local activation time (LAT) fields across diverse left atrial anatomies with near-instantaneous inference. We generated a dataset of 308,700 simulations using a GPU-accelerated electrophysiology solver, systematically varying multiple pacing sites and physiologically varied conduction properties across 147 patient-specific geometries derived from two independent clinical cohorts. All anatomical and functional data are expressed in a Universal Atrium Coordinate system, providing a consistent representation that decouples electrophysiological patterns from mesh topology. Within this coordinate space, we designed a neural operator with a vision-transformer backbone to learn the mapping from structural and electrophysiological inputs to LAT fields. With a mean prediction error of 5.1 ms over a 455 ms maximum simulation time, the model outperforms established operator-learning approaches and performs inference in 0.12 ms per sample. Our framework establishes a general strategy for learning domain-invariant biophysical mappings across variable anatomical domains and enables integration of computational electrophysiology into real-time and large-scale clinical workflows.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anubhab, Rupesh Nasre
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. The results show that Morphling improves per-epoch training throughput by an average of 20X on CPUs and 19X on GPUs over PyG and DGL, with peak speedups reaching 66X. Morphling's memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhongyuan Wu, Jingyuan Wang, Zexuan Cheng, Yilong Zhou, Weizhi Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Anomaly detection (AD) is a fundamental task of critical importance across numerous domains. Current systems increasingly operate in rapidly evolving environments that generate diverse yet interconnected data modalities -- such as time series, system logs, and tabular records -- as exemplified by modern IT systems. Effective AD methods in such environments must therefore possess two critical capabilities: (1) the ability to handle heterogeneous data formats within a unified framework, allowing the model to process and detect multiple modalities in a consistent manner during anomalous events; (2) a strong generalization ability to quickly adapt to new scenarios without extensive retraining. However, most existing methods fall short of these requirements, as they typically focus on single modalities and lack the flexibility to generalize across domains. To address this gap, we introduce a novel paradigm: In-Context Anomaly Detection (ICAD), where anomalies are defined by their dissimilarity to a relevant reference set of normal samples. Under this paradigm, we propose ICAD-LLM, a unified AD framework leveraging Large Language Models' in-context learning abilities to process heterogeneous data within a single model. Extensive experiments demonstrate that ICAD-LLM achieves competitive performance with task-specific AD methods and exhibits strong generalization to previously unseen tasks, which substantially reduces deployment costs and enables rapid adaptation to new environments. To the best of our knowledge, ICAD-LLM is the first model capable of handling anomaly detection tasks across diverse domains and modalities.
๐
2025-12-01
๐ ArXiv Preprint (math.OC)
๐ฅ Jimin Choi, Max Z. Li
Robotics
ๆฅ็ๆ่ฆ
Collaborative Combat Aircraft (CCAs) are envisioned to enable autonomous Intelligence, Surveillance, and Reconnaissance (ISR) missions in contested environments, where adversaries may act strategically to deceive or evade detection. These missions pose challenges due to model uncertainty and the need for safe, real-time decision-making. Robust Markov Decision Processes (RMDPs) provide worst-case guarantees but are limited by static ambiguity sets that capture initial uncertainty without adapting to new observations. This paper presents an adaptive RMDP framework tailored to ISR missions with CCAs. We introduce a mission-specific formulation in which aircraft alternate between movement and sensing states. Adversarial tactics are modeled as a finite set of transition kernels, each capturing assumptions about how adversarial sensing or environmental conditions affect rewards. Our approach incrementally refines policies by eliminating inconsistent threat models, allowing agents to shift from conservative to aggressive behaviors while maintaining robustness. We provide theoretical guarantees showing that the adaptive planner converges as credible sets contract to the true threat and maintains safety under uncertainty. Experiments under Gaussian and non-Gaussian threat models across diverse network topologies show higher mission rewards and fewer exposure events compared to nominal and static robust planners.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Valentin Noรซl, Elimane Yassine Seidou, Charly Ken Capo-Chichi, Ghanem Amari
Robotics
ๆฅ็ๆ่ฆ
Legal AI systems powered by retrieval-augmented generation (RAG) face a critical accountability challenge: when an AI assistant cites case law, statutes, or contractual clauses, practitioners need verifiable guarantees that generated text faithfully represents source documents. Existing hallucination detectors rely on semantic similarity metrics that tolerate entity substitutions, a dangerous failure mode when confusing parties, dates, or legal provisions can have material consequences. We introduce HalluGraph, a graph-theoretic framework that quantifies hallucinations through structural alignment between knowledge graphs extracted from context, query, and response. Our approach produces bounded, interpretable metrics decomposed into \textit{Entity Grounding} (EG), measuring whether entities in the response appear in source documents, and \textit{Relation Preservation} (RP), verifying that asserted relationships are supported by context. On structured control documents, HalluGraph achieves near-perfect discrimination ($>$400 words, $>$20 entities), HalluGraph achieves $AUC = 0.979$, while maintaining robust performance ($AUC \approx 0.89$) on challenging generative legal task, consistently outperforming semantic similarity baselines. The framework provides the transparency and traceability required for high-stakes legal applications, enabling full audit trails from generated assertions back to source passages.
๐
2025-12-01
๐ ArXiv Preprint (eess.SP)
๐ฅ Yiqiao Chen, Fazheng Xu, Zijian Huang, Juchi He, Zhenghui Feng
Machine Learning
ๆฅ็ๆ่ฆ
Cardiovascular disease (CVD) is a leading cause of morbidity and mortality worldwide, and sustained hypertension is an often silent risk factor, making cuffless continuous blood pressure (BP) monitoring with wearable devices important for early screening and long-term management. Most existing cuffless BP estimation methods use only photoplethysmography (PPG) and electrocardiography (ECG) signals, alone or in combination. These models are typically developed under resting or quasi-static conditions and struggle to maintain robust accuracy in multi-motion-state scenarios. In this study, we propose a six-modal BP estimation framework that jointly leverages ECG, multi-channel PPG, attachment pressure, sensor temperature, and triaxial acceleration and angular velocity. Each modality is processed by a lightweight branch encoder, contrastive learning enforces cross-modal semantic alignment, and a mixture-of-experts (MoE) regression head adaptively maps the fused features to BP across motion states. Comprehensive experiments on the public Pulse Transit Time PPG Dataset, which includes running, walking, and sitting data from 22 subjects, show that the proposed method achieves mean absolute errors (MAE) of 3.60 mmHg for systolic BP (SBP) and 3.01 mmHg for diastolic BP (DBP). From a clinical perspective, it attains Grade A for SBP, DBP, and mean arterial pressure (MAP) according to the British Hypertension Society (BHS) protocol and meets the numerical criteria of the Association for the Advancement of Medical Instrumentation (AAMI) standard for mean error (ME) and standard deviation of error (SDE).
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Daniele Masti, Francesco Basciani, Arianna Fedeli, Girgio Gnecco, Francesco Smarra
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Digital Twins (DTs) are increasingly used as autonomous decision-makers in complex socio-technical systems. Their mathematically optimal decisions often diverge from human expectations, exposing a persistent gap between algorithmic and bounded human rationality. This work addresses this gap by proposing a framework that operationalizes fairness as a learnable objective within optimization-based Digital Twins. We introduce a preference-driven learning pipeline that infers latent fairness objectives directly from human pairwise preferences over feasible decisions. A novel Siamese neural network is developed to generate convex quadratic cost functions conditioned on contextual information. The resulting surrogate objectives align optimization outcomes with human-perceived fairness while maintaining computational efficiency. The approach is demonstrated on a COVID-19 hospital resource allocation scenario. This study provides an actionable path toward embedding human-centered fairness in the design of autonomous decision-making systems.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Josรฉphine Raugel, Stรฉphane d'Ascoli, Jรฉrรฉmy Rapin, Valentin Wyart, Jean-Rรฉmi King
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent studies suggest that the representations learned by large language models (LLMs) are partially aligned to those of the human brain. However, whether and why this alignment score arises from a similar sequence of computations remains elusive. In this study, we explore this question by examining temporally-resolved brain signals of participants listening to 10 hours of an audiobook. We study these neural dynamics jointly with a benchmark encompassing 22 LLMs varying in size and architecture type. Our analyses confirm that LLMs and the brain generate representations in a similar order: specifically, activations in the initial layers of LLMs tend to best align with early brain responses, while the deeper layers of LLMs tend to best align with later brain responses. This brain-LLM alignment is consistent across transformers and recurrent architectures. However, its emergence depends on both model size and context length. Overall, this study sheds light on the sequential nature of computations and the factors underlying the partial convergence between biological and artificial neural networks.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Letian Yi, Tingpeng Zhang, Mingyuan Zhou, Guannan Wang, Quanke Su et al.
ๆฅ็ๆ่ฆ
Reconstructing full fields from extremely sparse and random measurements is a longstanding ill-posed inverse problem. A powerful framework for addressing such challenges is hierarchical probabilistic modeling, where uncertainty is represented by intermediate variables and resolved through marginalization during inference. Inspired by this principle, we propose Cascaded Sensing (Cas-Sensing), a hierarchical reconstruction framework that integrates an autoencoder-diffusion cascade. First, a neural operator-based functional autoencoder reconstructs the dominant structures of the original field - including large-scale components and geometric boundaries - from arbitrary sparse inputs, serving as an intermediate variable. Then, a conditional diffusion model, trained with a mask-cascade strategy, generates fine-scale details conditioned on these large-scale structures. To further enhance fidelity, measurement consistency is enforced via the manifold constrained gradient based on Bayesian posterior sampling during the generation process. This cascaded pipeline substantially alleviates ill-posedness, delivering accurate and robust reconstructions. Experiments on both simulation and real-world datasets demonstrate that Cas-Sensing generalizes well across varying sensor configurations and geometric boundaries, making it a promising tool for practical deployment in scientific and engineering applications.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sandro Andric
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate whether Large Language Models (LLMs) exhibit altruistic tendencies, and critically, whether their implicit associations and self-reports predict actual altruistic behavior. Using a multi-method approach inspired by human social psychology, we tested 24 frontier LLMs across three paradigms: (1) an Implicit Association Test (IAT) measuring implicit altruism bias, (2) a forced binary choice task measuring behavioral altruism, and (3) a self-assessment scale measuring explicit altruism beliefs. Our key findings are: (1) All models show strong implicit pro-altruism bias (mean IAT = 0.87, p < .0001), confirming models "know" altruism is good. (2) Models behave more altruistically than chance (65.6% vs. 50%, p < .0001), but with substantial variation (48-85%). (3) Implicit associations do not predict behavior (r = .22, p = .29). (4) Most critically, models systematically overestimate their own altruism, claiming 77.5% altruism while acting at 65.6% (p < .0001, Cohen's d = 1.08). This "virtue signaling gap" affects 75% of models tested. Based on these findings, we recommend the Calibration Gap (the discrepancy between self-reported and behavioral values) as a standardized alignment metric. Well-calibrated models are more predictable and behaviorally consistent; only 12.5% of models achieve the ideal combination of high prosocial behavior and accurate self-knowledge.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Simon Leszek
ๆฅ็ๆ่ฆ
Change-point detection (CPD) in high-dimensional, large-volume time series is challenging for statistical consistency, scalability, and interpretability. We introduce TimePred, a self-supervised framework that reduces multivariate CPD to univariate mean-shift detection by predicting each sample's normalized time index. This enables efficient offline CPD using existing algorithms and supports the integration of XAI attribution methods for feature-level explanations. Our experiments show competitive CPD performance while reducing computational cost by up to two orders of magnitude. In an industrial manufacturing case study, we demonstrate improved detection accuracy and illustrate the practical value of interpretable change-point insights.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) often generate unreliable answers, while heuristic uncertainty methods fail to fully distinguish correct from incorrect predictions, causing users to accept erroneous answers without statistical guarantees. We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level. To achieve this in a principled way, we propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint over selection and error indicators. Then, we establish a finite-sample sufficient condition, which relies only on a held-out set of exchangeable calibration samples, to compute an FDR-constrained, coverage-maximizing threshold. Furthermore, we extend LEC to a two-model routing mechanism: given a prompt, if the current model's uncertainty exceeds its calibrated threshold, we delegate it to a stronger model, while maintaining a unified FDR guarantee. Evaluations on closed-ended and open-ended question-answering (QA) datasets show that LEC achieves tighter FDR control and substantially improves sample retention over prior methods. Moreover, the two-model routing mechanism achieves lower risk levels while accepting more correct samples than each individual model.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Yuma Ichikawa, Yudai Fujimoto, Akira Sakai
ๆฅ็ๆ่ฆ
Post-training quantization (PTQ) aims to preserve model-level behavior; however, most methods focus on individual linear layers. Even recent extensions, such as QEP and LoaQ, which mitigate error propagation or target specific submodules, still rely on layer-wise formulations and fail to capture the behavior of larger submodules. We introduce Layer-Projected Coordinate Descent (LPCD), a unified framework that extends PTQ beyond layers by optimizing relaxed objectives across arbitrary submodules and projecting the solutions with layer-wise quantizers. LPCD generalizes existing methods and provides a principled approach to quantizing complex submodules while maintaining the efficiency and compatibility of layer-wise PTQ pipelines. Across diverse LLM architectures and bit-widths, LPCD-based submodule quantization consistently enhances both layer-wise PTQ methods and existing submodule approaches.
๐
2025-12-01
๐ ArXiv Preprint (cs.SD)
๐ฅ Tal Shuster, Eliya Nachmani
ๆฅ็ๆ่ฆ
Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech domain compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Abdo Abouelrous, Laurens Bliek, Yaoxin Wu, Yingqian Zhang
Machine Learning
ๆฅ็ๆ่ฆ
In this work, we consider learning-based applications in routing to solve a Vehicle Routing variant characterized by stochasticity and multiple objectives. Such problems are representative of practical settings where decision-makers have to deal with uncertainty in the operational environment as well as multiple conflicting objectives due to different stakeholders. We specifically consider travel time uncertainty. We also consider two objectives, total travel time and route makespan, that jointly target operational efficiency and labor regulations on shift length, although different objectives could be incorporated. Learning-based methods offer earnest computational advantages as they can repeatedly solve problems with limited interference from the decision-maker. We specifically focus on end-to-end deep learning models that leverage the attention mechanism and multiple solution trajectories. These models have seen several successful applications in routing problems. However, since travel times are not a direct input to these models due to the large dimensions of the travel time matrix, accounting for uncertainty is a challenge, especially in the presence of multiple objectives. In turn, we propose a model that simultaneously addresses stochasticity and multi-objectivity and provide a refined training mechanism for this model through scenario clustering to reduce training time. Our results show that our model is capable of constructing a Pareto Front of good quality within acceptable run times compared to three baselines.
๐
2025-12-01
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ Sigurd Vargdal, Paula Reis, Henrik Andersen Sveinsson, Gaute Linga
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Permeability is a central concept in the macroscopic description of flow through porous media, with applications spanning from oil recovery to hydrology. Traditional methods for determining the permeability tensor involving flow simulations or experiments can be time consuming and resource-intensive, while analytical methods, e.g., based on the Kozeny-Carman equation, may be too simplistic for accurate prediction based on pore-scale features. In this work, we explore deep learning as a more efficient alternative for predicting the permeability tensor based on two-dimensional binary images of porous media, segmented into solid ($1$) and void ($0$) regions. We generate a dataset of 24,000 synthetic random periodic porous media samples with specified porosity and characteristic length scale. Using Lattice-Boltzmann simulations, we compute the permeability tensor for flow through these samples with values spanning three orders of magnitude. We evaluate three families of image-based deep learning models: ResNet (ResNet-$50$ and ResNet-$101$), Vision Transformers (ViT-T$16$ and ViT-S$16$) and ConvNeXt (Tiny and Small). To improve model generalisation, we employ techniques such as weight decay, learning rate scheduling, and data augmentation. The effect of data augmentation and dataset size on model performance is studied, and we find that they generally increase the accuracy of permeability predictions. We also show that ConvNeXt and ResNet converge faster than ViT and degrade in performance if trained for too long. ConvNeXt-Small achieved the highest $R^2$ score of $0.99460$ on $4,000$ unseen test samples. These findings underscore the potential to use image-based neural networks to predict permeability tensors accurately.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mengyao Du, Gang Yang, Han Fang, Quanjun Yin, Ee-chien Chang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The widespread adoption of natural language processing techniques has led to an unprecedented growth of text classifiers across the modern web. Yet many of these models circulate with their internal semantics undocumented or even intentionally withheld. Such opaque classifiers, which may expose only hard-label outputs, can operate in unregulated web environments or be repurposed for unknown intents, raising legitimate forensic and auditing concerns. In this paper, we position ourselves as investigators and work to infer the semantic concept each label encodes in an undocumented black-box classifier.
Specifically, we introduce label forensics, a black-box framework that reconstructs a label's semantic meaning. Concretely, we represent a label by a sentence embedding distribution from which any sample reliably reflects the concept the classifier has implicitly learned for that label. We believe this distribution should maintain two key properties: precise, with samples consistently classified into the target label, and general, covering the label's broad semantic space. To realize this, we design a semantic neighborhood sampler and an iterative optimization procedure to select representative seed sentences that jointly maximize label consistency and distributional coverage. The final output, an optimized seed sentence set combined with the sampler, constitutes the empirical distribution representing the label's semantics. Experiments on multiple black-box classifiers achieve an average label consistency of around 92.24 percent, demonstrating that the embedding regions accurately capture each classifier's label semantics. We further validate our framework on an undocumented HuggingFace classifier, enabling fine-grained label interpretation and supporting responsible AI auditing.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Franz Thaler, Martin Urschler, Mateusz Kozinski, Matthias AF Gsell, Gernot Plank et al.
Computer Vision
ๆฅ็ๆ่ฆ
We tackle the challenging problem of single-source domain generalization (DG) for medical image segmentation. To this end, we aim for training a network on one domain (e.g., CT) and directly apply it to a different domain (e.g., MR) without adapting the model and without requiring images or annotations from the new domain during training. We propose a novel method for promoting DG when training deep segmentation networks, which we call SRCSM. During training, our method diversifies the source domain through semantic-aware random convolution, where different regions of a source image are augmented differently, based on their annotation labels. At test-time, we complement the randomization of the training domain via mapping the intensity of target domain images, making them similar to source domain data. We perform a comprehensive evaluation on a variety of cross-modality and cross-center generalization settings for abdominal, whole-heart and prostate segmentation, where we outperform previous DG techniques in a vast majority of experiments. Additionally, we also investigate our method when training on whole-heart CT or MR data and testing on the diastolic and systolic phase of cine MR data captured with different scanner hardware, where we make a step towards closing the domain gap in this even more challenging setting. Overall, our evaluation shows that SRCSM can be considered a new state-of-the-art in DG for medical image segmentation and, moreover, even achieves a segmentation performance that matches the performance of the in-domain baseline in several settings.
๐
2025-12-01
๐ ArXiv Preprint (quant-ph)
๐ฅ Patrick Odagiu, Vasilis Belis, Lennart Schulze, Panagiotis Barkoutsos, Michele Grossi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Data sets that are specified by a large number of features are currently outside the area of applicability for quantum machine learning algorithms. An immediate solution to this impasse is the application of dimensionality reduction methods before passing the data to the quantum algorithm. We investigate six conventional feature extraction algorithms and five autoencoder-based dimensionality reduction models to a particle physics data set with 67 features. The reduced representations generated by these models are then used to train a quantum support vector machine for solving a binary classification problem: whether a Higgs boson is produced in proton collisions at the LHC. We show that the autoencoder methods learn a better lower-dimensional representation of the data, with the method we design, the Sinkclass autoencoder, performing 40% better than the baseline. The methods developed here open up the applicability of quantum machine learning to a larger array of data sets. Moreover, we provide a recipe for effective dimensionality reduction in this context.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Daniel Armstrong, Zlatko Jonฤev, Andres M Bran, Philippe Schwaller
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern computer-assisted synthesis planning (CASP) systems show promises at generating chemically valid reaction steps but struggle to incorporate strategic considerations such as convergent assembly, protecting group minimization, and optimal ring-forming sequences. We introduce a methodology that leverages Large Language Models to distill synthetic knowledge into code. Our system analyzes synthesis routes and translates strategic principles into Python functions representing diverse strategic and tactical rules, such as strategic functional group interconversions and ring construction strategies. By formalizing this knowledge as verifiable code rather than simple heuristics, we create testable, interpretable representations of synthetic strategy. We release the complete codebase and the USPTO-ST dataset -- synthesis routes annotated with strategic tags. This framework unlocks a novel capability for CASP: natural language-based route retrieval, achieving 75\% Top-3 accuracy on our benchmark. We further validate our library through temporal analysis of historical trends and chemically intuitive route clustering that offers more granular partitioning than common previous methods. This work bridges the tactical-strategic divide in CASP, enabling specification, search, and evaluation of routes by strategic criteria rather than structure alone.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alfredo Reichlin, Miguel Vasco, Danica Kragic
Machine Learning
ๆฅ็ๆ่ฆ
Bayesian Neural Networks provide a principled framework for uncertainty quantification by modeling the posterior distribution of network parameters. However, exact posterior inference is computationally intractable, and widely used approximations like the Laplace method struggle with scalability and posterior accuracy in modern deep networks. In this work, we revisit sampling techniques for posterior exploration, proposing a simple variation tailored to efficiently sample from the posterior in over-parameterized networks by leveraging the low-dimensional structure of loss minima. Building on this, we introduce a model that learns a deformation of the parameter space, enabling rapid posterior sampling without requiring iterative methods. Empirical results demonstrate that our approach achieves competitive posterior approximations with improved scalability compared to recent refinement techniques. These contributions provide a practical alternative for Bayesian inference in deep learning.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ali Nafisi, Sina Asghari, Mohammad Saeed Arvenaghi, Hossein Shakibania
Machine Learning
ๆฅ็ๆ่ฆ
This report presents solutions to three machine learning challenges: compositional image retrieval, zero-shot anomaly detection, and backdoored model detection. In compositional image retrieval, we developed a system that processes visual and textual inputs to retrieve relevant images, achieving 95.38\% accuracy and ranking first with a clear margin over the second team. For zero-shot anomaly detection, we designed a model that identifies and localizes anomalies in images without prior exposure to abnormal examples, securing 1st place with 73.14\% accuracy. In the backdoored model detection task, we proposed a method to detect hidden backdoor triggers in neural networks, reaching an accuracy of 78\%, which placed our approach in second place. These results demonstrate the effectiveness of our methods in addressing key challenges related to retrieval, anomaly detection, and model security, with implications for real-world applications in industries such as healthcare, manufacturing, and cybersecurity. Code for all solutions is available online.
๐
2025-12-01
๐ ArXiv Preprint (cs.DM)
๐ฅ Tuguldur Bayarsaikhan, Altannar Chinchuluun, Ashwin Arulselvan, Panos Pardalos
Robotics
ๆฅ็ๆ่ฆ
Given a network, the critical node detection problem finds a subset of nodes whose removal disrupts the network connectivity. Since many real-world systems are naturally modeled as graphs, assessing the vulnerability of the network is essential, with applications in transportation systems, traffic forecasting, epidemic control, and biological networks. In this paper, we consider a stochastic version of the critical node detection problem, where the existence of edges is given by certain probabilities. We propose heuristics and learning-based methods for the problem and compare them with existing algorithms. Experimental results performed on random graphs from small to larger scales, with edge-survival probabilities drawn from different distributions, demonstrate the effectiveness of the methods. Heuristic methods often illustrate the strongest results with high scalability, while learning-based methods maintain nearly constant inference time as the network size and density grow.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jindi Lv, Yuhao Zhou, Zheng Zhu, Xiaofeng Wang, Guan Huang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Chain-of-Thought (CoT) reasoning has significantly advanced the problem-solving capabilities of Large Language Models (LLMs), yet conventional CoT often exhibits internal determinism during decoding, limiting exploration of plausible alternatives. Recent methods attempt to address this by generating soft abstract tokens to enable reasoning in a continuous semantic space. However, we find that such approaches remain constrained by the greedy nature of autoregressive decoding, which fundamentally isolates the model from alternative reasoning possibilities. In this work, we propose Multi-Path Perception Policy Optimization (M3PO), a novel reinforcement learning framework that explicitly injects collective insights into the reasoning process. M3PO leverages parallel policy rollouts as naturally diverse reasoning sources and integrates cross-path interactions into policy updates through a lightweight collaborative mechanism. This design allows each trajectory to refine its reasoning with peer feedback, thereby cultivating more reliable multi-step reasoning patterns. Empirical results show that M3PO achieves state-of-the-art performance on both knowledge- and reasoning-intensive benchmarks. Models trained with M3PO maintain interpretability and inference efficiency, underscoring the promise of multi-path collaborative learning for robust reasoning.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gwendal Debaussart-Joniec, Argyris Kalogeratos
ๆฅ็ๆ่ฆ
This paper introduces a comprehensive unified framework for constructing multi-view diffusion geometries through intertwined multi-view diffusion trajectories (MDTs), a class of inhomogeneous diffusion processes that iteratively combine the random walk operators of multiple data views. Each MDT defines a trajectory-dependent diffusion operator with a clear probabilistic and geometric interpretation, capturing over time the interplay between data views. Our formulation encompasses existing multi-view diffusion models, while providing new degrees of freedom for view interaction and fusion. We establish theoretical properties under mild assumptions, including ergodicity of both the point-wise operator and the process in itself. We also derive MDT-based diffusion distances, and associated embeddings via singular value decompositions. Finally, we propose various strategies for learning MDT operators within the defined operator space, guided by internal quality measures. Beyond enabling flexible model design, MDTs also offer a neutral baseline for evaluating diffusion-based approaches through comparison with randomly selected MDTs. Experiments show the practical impact of the MDT operators in a manifold learning and data clustering context.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dan Qiao, Yu-Xiang Wang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We consider the problem of generalization of arbitrarily overparameterized two-layer ReLU Neural Networks with univariate input. Recent work showed that under square loss, flat solutions (motivated by flat / stable minima and Edge of Stability phenomenon) provably cannot overfit, but it remains unclear whether the same phenomenon holds for logistic loss. This is a puzzling open problem because existing work on logistic loss shows that gradient descent with increasing step size converges to interpolating solutions (at infinity, for the margin-separable cases). In this paper, we prove that the \emph{flatness implied generalization} is more delicate under logistic loss. On the positive side, we show that flat solutions enjoy near-optimal generalization bounds within a region between the left-most and right-most \emph{uncertain} sets determined by each candidate solution. On the negative side, we show that there exist arbitrarily flat yet overfitting solutions at infinity that are (falsely) certain everywhere, thus certifying that flatness alone is insufficient for generalization in general. We demonstrate the effects predicted by our theory in a well-controlled simulation study.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fabian Kresse, Christoph H. Lampert
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We investigate whether continuous-control policies can be represented and learned as discrete logic circuits instead of continuous neural networks. We introduce Differentiable Weightless Controllers (DWCs), a symbolic-differentiable architecture that maps real-valued observations to actions using thermometer-encoded inputs, sparsely connected boolean lookup-table layers, and lightweight action heads. DWCs can be trained end-to-end by gradient-based techniques, yet compile directly into FPGA-compatible circuits with few- or even single-clock-cycle latency and nanojoule-level energy cost per action. Across five MuJoCo benchmarks, including high-dimensional Humanoid, DWCs achieve returns competitive with weight-based policies (full precision or quantized neural networks), matching performance on four tasks and isolating network capacity as the key limiting factor on HalfCheetah. Furthermore, DWCs exhibit structurally sparse and interpretable connectivity patterns, enabling a direct inspection of which input thresholds influence control decisions.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongnan Si, Tong Li, Yujie Chen, Xin Liao
Machine Learning
ๆฅ็ๆ่ฆ
Water quality monitoring is a core component of ecological environmental protection. However, due to sensor failure or other inevitable factors, data missing often exists in long-term monitoring, posing great challenges in water quality analysis. This paper proposes a Neural Tucker Convolutional Network (NTCN) model for water quality data imputation, which features the following key components: a) Encode different mode entities into respective embedding vectors, and construct a Tucker interaction tensor by outer product operations to capture the complex mode-wise feature interactions; b) Use 3D convolution to extract fine-grained spatiotemporal features from the interaction tensor. Experiments on three real-world water quality datasets show that the proposed NTCN model outperforms several state-of-the-art imputation models in terms of accuracy.
๐
2025-12-01
๐ ArXiv Preprint (cs.AR)
๐ฅ Jan-Frederik Schulte, Benjamin Ramhorst, Chang Sun, Jovan Mitrevski, Nicolรฒ Ghielmetti et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present hls4ml, a free and open-source platform that translates machine learning (ML) models from modern deep learning frameworks into high-level synthesis (HLS) code that can be integrated into full designs for field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). With its flexible and modular design, hls4ml supports a large number of deep learning frameworks and can target HLS compilers from several vendors, including Vitis HLS, Intel oneAPI and Catapult HLS. Together with a wider eco-system for software-hardware co-design, hls4ml has enabled the acceleration of ML inference in a wide range of commercial and scientific applications where low latency, resource usage, and power consumption are critical. In this paper, we describe the structure and functionality of the hls4ml platform. The overarching design considerations for the generated HLS code are discussed, together with selected performance results.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kuangpu Guo, Yuhe Ding, Jian Liang, Zilei Wang, Ran He
ๆฅ็ๆ่ฆ
Model merging has emerged as a promising paradigm for enabling multi-task capabilities without additional training. However, existing methods often experience substantial performance degradation compared with individually fine-tuned models, even on similar tasks, underscoring the need to preserve task-specific information. This paper proposes Decomposition, Thresholding, and Scaling (DTS), an approximation-based personalized merging framework that preserves task-specific information with minimal storage overhead. DTS first applies singular value decomposition to the task-specific information and retains only a small subset of singular values and vectors. It then introduces a novel thresholding strategy that partitions singular vector elements into groups and assigns a scaling factor to each group. To enable generalization to unseen tasks, we further extend DTS with a variant that fuses task-specific information in a data-free manner based on the semantic similarity of task characteristics. Extensive experiments demonstrate that DTS consistently outperforms state-of-the-art baselines while requiring only 1\% additional storage per task. Furthermore, experiments on unseen tasks show that the DTS variant achieves significantly better generalization performance. Our code is available at https://github.com/krumpguo/DTS.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jozef Kubรญk, Marek ล uppa, Martin Takรกฤ
Natural Language Processing
ๆฅ็ๆ่ฆ
Limited data for low-resource languages typically yield weaker language models (LMs). Since pre-training is compute-intensive, it is more pragmatic to target improvements during fine-tuning. In this work, we examine the use of Active Learning (AL) methods augmented by structured data selection strategies which we term 'Active Learning schedulers', to boost the fine-tuning process with a limited amount of training data. We connect the AL to data clustering and propose an integrated fine-tuning pipeline that systematically combines AL, clustering, and dynamic data selection schedulers to enhance model's performance. Experiments in the Slovak, Maltese, Icelandic and Turkish languages show that the use of clustering during the fine-tuning phase together with AL scheduling can simultaneously produce annotation savings up to 30% and performance improvements up to four F1 score points, while also providing better fine-tuning stability.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Xabier de Zuazo, Ibon Saratxaga, Eva Navas
ๆฅ็ๆ่ฆ
We present Conformer-based decoders for the LibriBrain 2025 PNPL competition, targeting two foundational MEG tasks: Speech Detection and Phoneme Classification. Our approach adapts a compact Conformer to raw 306-channel MEG signals, with a lightweight convolutional projection layer and task-specific heads. For Speech Detection, a MEG-oriented SpecAugment provided a first exploration of MEG-specific augmentation. For Phoneme Classification, we used inverse-square-root class weighting and a dynamic grouping loader to handle 100-sample averaged examples. In addition, a simple instance-level normalization proved critical to mitigate distribution shifts on the holdout split. Using the official Standard track splits and F1-macro for model selection, our best systems achieved 88.9% (Speech) and 65.8% (Phoneme) on the leaderboard, surpassing the competition baselines and ranking within the top-10 in both tasks. For further implementation details, the technical documentation, source code, and checkpoints are available at https://github.com/neural2speech/libribrain-experiments.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Oguz Bedir, Nurullah Sevim, Mostafa Ibrahim, Sabit Ekin
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent breakthroughs in natural language processing show that attention mechanism in Transformer networks, trained via masked-token prediction, enables models to capture the semantic context of the tokens and internalize the grammar of language. While the application of Transformers to communication systems is a burgeoning field, the notion of context within physical waveforms remains under-explored. This paper addresses that gap by re-examining inter-symbol contribution (ISC) caused by pulse-shaping overlap. Rather than treating ISC as a nuisance, we view it as a deterministic source of contextual information embedded in oversampled complex baseband signals. We propose Masked Symbol Modeling (MSM), a framework for the physical (PHY) layer inspired by Bidirectional Encoder Representations from Transformers methodology. In MSM, a subset of symbol aligned samples is randomly masked, and a Transformer predicts the missing symbol identifiers using the surrounding "in-between" samples. Through this objective, the model learns the latent syntax of complex baseband waveforms. We illustrate MSM's potential by applying it to the task of demodulating signals corrupted by impulsive noise, where the model infers corrupted segments by leveraging the learned context. Our results suggest a path toward receivers that interpret, rather than merely detect communication signals, opening new avenues for context-aware PHY layer design.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Valentin Duruisseaux, Jean Kossaifi, Anima Anandkumar
ๆฅ็ๆ่ฆ
Partial differential equations (PDEs) govern a wide variety of dynamical processes in science and engineering, yet obtaining their numerical solutions often requires high-resolution discretizations and repeated evaluations of complex operators, leading to substantial computational costs. Neural operators have recently emerged as a powerful framework for learning mappings between function spaces directly from data, enabling efficient surrogate models for PDE systems. Among these architectures, the Fourier Neural Operator (FNO) has become the most influential and widely adopted due to its elegant spectral formulation, which captures global correlations through learnable transformations in Fourier space while remaining invariant to discretization and resolution. Despite their success, the practical use of FNOs is often hindered by an incomplete understanding among practitioners of their theoretical foundations, practical constraints, and implementation details, which can lead to their incorrect or unreliable application. This work presents a comprehensive and practice-oriented guide to FNOs, unifying their mathematical principles with implementation strategies. We provide an intuitive exposition to the concepts of operator theory and signal-processing that underlie the FNO, detail its spectral parameterization and the computational design of all its components, and address common misunderstandings encountered in the literature. The exposition is closely integrated with the NeuralOperator 2.0.0 library, offering modular state-of-the-art implementations that faithfully reflect the theory. By connecting rigorous foundations with practical insight, this guide aims to establish a clear and reliable framework for applying FNOs effectively across diverse scientific and engineering fields.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziqian Wang, Yuxiao Cheng, Jinli Suo
Machine Learning
ๆฅ็ๆ่ฆ
Explainability is essential for neural networks that model long time series, yet most existing explainable AI methods only produce point-wise importance scores and fail to capture temporal structures such as trends, cycles, and regime changes. This limitation weakens human interpretability and trust in long-horizon models. To address these issues, we identify four key requirements for interpretable time-series modeling: temporal continuity, pattern-centric explanation, causal disentanglement, and faithfulness to the model's inference process. We propose EXCAP, a unified framework that satisfies all four requirements. EXCAP combines an attention-based segmenter that extracts coherent temporal patterns, a causally structured decoder guided by a pre-trained causal graph, and a latent aggregation mechanism that enforces representation stability. Our theoretical analysis shows that EXCAP provides smooth and stable explanations over time and is robust to perturbations in causal masks. Extensive experiments on classification and forecasting benchmarks demonstrate that EXCAP achieves strong predictive accuracy while generating coherent and causally grounded explanations. These results show that EXCAP offers a principled and scalable approach to interpretable modeling of long time series with relevance to high-stakes domains such as healthcare and finance.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Benjamin Ramtoula, Pierre-Yves Lajoie, Paul Newman, Daniele De Martini
Natural Language Processing
ๆฅ็ๆ่ฆ
Foundation models (FMs) trained with different objectives and data learn diverse representations, making some more effective than others for specific downstream tasks. Existing adaptation strategies, such as parameter-efficient fine-tuning, focus on individual models and do not exploit the complementary strengths across models. Probing methods offer a promising alternative by extracting information from frozen models, but current techniques do not scale well with large feature sets and often rely on dataset-specific hyperparameter tuning. We propose Combined backBones (ComBo), a simple and scalable probing-based adapter that effectively integrates features from multiple models and layers. ComBo compresses activations from layers of one or more FMs into compact token-wise representations and processes them with a lightweight transformer for task-specific prediction. Crucially, ComBo does not require dataset-specific tuning or backpropagation through the backbone models. However, not all models are equally relevant for all tasks. To address this, we introduce a mechanism that leverages ComBo's joint multi-backbone probing to efficiently evaluate each backbone's task-relevance, enabling both practical model comparison and improved performance through selective adaptation. On the 19 tasks of the VTAB-1k benchmark, ComBo outperforms previous probing methods, matches or surpasses more expensive alternatives, such as distillation-based model merging, and enables efficient probing of tuned models. Our results demonstrate that ComBo offers a practical and general-purpose framework for combining diverse representations from multiple FMs.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Paula Harder, Christian Lessig, Matthew Chantry, Francis Pelletier, David Rolnick
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning offers promising capabilities for the statistical downscaling of climate and weather forecasts, with generative approaches showing particular success in capturing fine-scale precipitation patterns. However, most existing models are region-specific, and their ability to generalize to unseen geographic areas remains largely unexplored. In this study, we evaluate the generalization performance of generative downscaling models across diverse regions. Using a global framework, we employ ERA5 reanalysis data as predictors and IMERG precipitation estimates at $0.1^\circ$ resolution as targets. A hierarchical location-based data split enables a systematic assessment of model performance across 15 regions around the world.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ali Forootani, Mohammad Sadr, Danial Esmaeili Aliabadi, Daniela Thraen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Energy system models are increasingly employed to guide long-term planning in multi-sectoral environments where decisions span electricity, heat, transport, land use, and industry. While these models provide rigorous quantitative insights, their outputs are often highly technical, making them difficult to interpret for non-expert stakeholders such as policymakers, planners, and the public. This communication gap limits the accessibility and practical impact of scenario-based modeling, particularly as energy transitions grow more complex with rising shares of renewables, sectoral integration, and deep uncertainties. To address this challenge, we propose the Renewable Energy Large Language Model (RE-LLM), a hybrid framework that integrates Large Language Models (LLMs) directly into the energy system modeling workflow. RE-LLM combines three core elements: (i) optimization-based scenario exploration, (ii) machine learning surrogates that accelerate computationally intensive simulations, and (iii) LLM-powered natural language generation that translates complex results into clear, stakeholder-oriented explanations. This integrated design not only reduces computational burden but also enhances inter-pretability, enabling real-time reasoning about trade-offs, sensitivities, and policy implications. The framework is adaptable across different optimization platforms and energy system models, ensuring broad applicability beyond the case study presented. By merging speed, rigor, and interpretability, RE-LLM advances a new paradigm of human-centric energy modeling. It enables interactive, multilingual, and accessible engagement with future energy pathways, ultimately bridging the final gap between data-driven analysis and actionable decision-making for sustainable transitions.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoyu Lei, Chin Wa Lau, Kaiwen Zhou, Nian Guo, Farzan Farnia
Natural Language Processing
ๆฅ็ๆ่ฆ
Error Correction Codes (ECC) are fundamental to reliable digital communication, yet designing neural decoders that are both accurate and computationally efficient remains challenging. Recent denoising diffusion decoders with transformer backbones achieve state-of-the-art performance, but their iterative sampling limits practicality in low-latency settings. We introduce the Error Correction Consistency Flow Model (ECCFM), an architecture-agnostic training framework for high-fidelity one-step decoding. By casting the reverse denoising process as a Probability Flow Ordinary Differential Equation (PF-ODE) and enforcing smoothness through a differential time regularization, ECCFM learns to map noisy signals along the decoding trajectory directly to the original codeword in a single inference step. Across multiple decoding benchmarks, ECCFM attains lower bit-error rates (BER) than autoregressive and diffusion-based baselines, with notable improvements on longer codes, while delivering inference speeds up from 30x to 100x faster than denoising diffusion decoders.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
ๆฅ็ๆ่ฆ
We present CLAPS, a posterior-aware conformal regression method that pairs a Last-Layer Laplace Approximation with split-conformal calibration. From the resulting Gaussian posterior, CLAPS defines a simple two-sided posterior CDF score that aligns the conformity metric with the full predictive shape, not just a point estimate. This alignment yields narrower prediction intervals at the same target coverage, especially on small to medium tabular datasets where data are scarce and uncertainty modeling matters. We also provide a lightweight diagnostic suite that separates aleatoric and epistemic components and visualizes posterior behavior, helping practitioners understand why intervals shrink when they do. Across multiple benchmarks using the same MLP backbone, CLAPS consistently attains nominal coverage with improved efficiency and minimal overhead, offering a clear, practical upgrade to residual-based conformal baselines.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Medha Sawhney, Abhilash Neog, Mridul Khurana, Anuj Karpatne
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion-based solvers for partial differential equations (PDEs) are often bottle-necked by slow gradient-based test-time optimization routines that use PDE residuals for loss guidance. They additionally suffer from optimization instabilities and are unable to dynamically adapt their inference scheme in the presence of noisy PDE residuals. To address these limitations, we introduce PRISMA (PDE Residual Informed Spectral Modulation with Attention), a conditional diffusion neural operator that embeds PDE residuals directly into the model's architecture via attention mechanisms in the spectral domain, enabling gradient-descent free inference. In contrast to previous methods that use PDE loss solely as external optimization targets, PRISMA integrates PDE residuals as integral architectural features, making it inherently fast, robust, accurate, and free from sensitive hyperparameter tuning. We show that PRISMA has competitive accuracy, at substantially lower inference costs, compared to previous methods across five benchmark PDEs, especially with noisy observations, while using 10x to 100x fewer denoising steps, leading to 15x to 250x faster inference.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinyu Jiang, Cuiyun Gao, Wenda Huang, Yiyang Jiang, Binwen Luo et al.
Robotics
ๆฅ็ๆ่ฆ
Background: Impairment of visual spatial cognitive function is the most common early clinical manifestation of Alzheimer's Disease (AD). When the Montreal Cognitive Assessment (MoCA) uses the "0/1" binary method ("pass/fail") to evaluate the visual spatial cognitive ability represented by the Cube Copying Test(CCT), the elder with less formal education generally score 0 point, resulting in serious bias in the evaluation results. Therefore, this study proposes a fine evaluation method for CCT based on dynamic handwriting feature extraction of DH-SCSM-BLA. method : The Cogni-CareV3.0 software independently developed by our team was used to collect dynamic handwriting data of CCT. Then, the spatial and motion features of segmented dynamic handwriting were extracted, and feature matrix with unequal dimensions were normalized. Finally, a bidirectional long short-term memory network model combined with attention mechanism (BiLSTM-Attention) was adopted for classification. Result: The experimental results showed that: The proposed method has significant superiority compared to similar studies, with a classification accuracy of 86.69%. The distribution of cube drawing ability scores has significant regularity for three aspects such as MCI patients and healthy control group, age, and levels of education. It was also found that score for each cognitive task including cube drawing ability score is negatively correlated with age. Score for each cognitive task including cube drawing ability score, but positively correlated with levels of education significantly. Conclusion: This study provides a relatively objective and comprehensive evaluation method for early screening and personalized intervention of visual spatial cognitive impairment.
๐
2025-12-01
๐ This is the author-accepted version of the paper p
๐ฅ Yunzhe Li, Jiajun Yan, Yuzhou Wei, Kechen Liu, Yize Zhao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Failing to be aware of speeding vehicles approaching from behind poses a huge threat to the road safety of pedestrians and cyclists. In this paper, we propose BlinkBud, which utilizes a single earbud and a paired phone to online detect hazardous objects approaching from behind of a user. The core idea is to accurately track visually identified objects utilizing a small number of sampled camera images taken from the earbud. To minimize the power consumption of the earbud and the phone while guaranteeing the best tracking accuracy, a novel 3D object tracking algorithm is devised, integrating both a Kalman filter based trajectory estimation scheme and an optimal image sampling strategy based on reinforcement learning. Moreover, the impact of constant user head movements on the tracking accuracy is significantly eliminated by leveraging the estimated pitch and yaw angles to correct the object depth estimation and align the camera coordinate system to the user's body coordinate system, respectively. We implement a prototype BlinkBud system and conduct extensive real-world experiments. Results show that BlinkBud is lightweight with ultra-low mean power consumptions of 29.8 mW and 702.6 mW on the earbud and smartphone, respectively, and can accurately detect hazards with a low average false positive ratio (FPR) and false negative ratio (FNR) of 4.90% and 1.47%, respectively.
๐
2025-12-01
๐ ArXiv Preprint (quant-ph)
๐ฅ Swati Kumari, Shiva Raj Pokhrel, Swathi Chandrasekhar, Navneet Singh, Hridoy Sankar Dutta et al.
ๆฅ็ๆ่ฆ
Network traffic anomaly detection is a critical cy- bersecurity challenge requiring robust solutions for complex Internet of Things (IoT) environments. We present a novel hybrid quantum-classical framework integrating an enhanced Quantum Support Vector Machine (QSVM) with the Quantum Haar Wavelet Packet Transform (QWPT) for superior anomaly classification under realistic noisy intermediate-scale Quantum conditions. Our methodology employs amplitude-encoded quan- tum state preparation, multi-level QWPT feature extraction, and behavioral analysis via Shannon Entropy profiling and Chi-square testing. Features are classified using QSVM with fidelity-based quantum kernels optimized through hybrid train- ing with simultaneous perturbation stochastic approximation (SPSA) optimizer. Evaluation under noiseless and depolarizing noise conditions demonstrates exceptional performance: 96.67% accuracy on BoT-IoT and 89.67% on IoT-23 datasets, surpassing quantum autoencoder approaches by over 7 percentage points.
๐
2025-12-01
๐ ArXiv Preprint (cs.MA)
๐ฅ Jiaguo Tian, Zhengbang Zhu, Shenyu Zhang, Li Xu, Bo Zheng et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The generation of realistic and diverse traffic scenarios in simulation is essential for developing and evaluating autonomous driving systems. However, most simulation frameworks rely on rule-based or simplified models for scene generation, which lack the fidelity and diversity needed to represent real-world driving. While recent advances in generative modeling produce more realistic and context-aware traffic interactions, they often overlook how social preferences influence driving behavior. SocialDriveGen addresses this gap through a hierarchical framework that integrates semantic reasoning and social preference modeling with generative trajectory synthesis. By modeling egoism and altruism as complementary social dimensions, our framework enables controllable diversity in driver personalities and interaction styles. Experiments on the Argoverse 2 dataset show that SocialDriveGen generates diverse, high-fidelity traffic scenarios spanning cooperative to adversarial behaviors, significantly enhancing policy robustness and generalization to rare or high-risk situations.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yanlin Wang, Nancy M Young, Patrick C M Wong
Machine Learning
ๆฅ็ๆ่ฆ
Neural prediction offers a promising approach to forecasting the individual variability of neurocognitive functions and disorders and providing prognostic indicators for personalized invention. However, it is challenging to translate neural predictive models into medical artificial intelligent applications due to the limitations of domain shift and label scarcity. Here, we propose the directed evolution model (DEM), a novel computational model that mimics the trial-and-error processes of biological directed evolution to approximate optimal solutions for predictive modeling tasks. We demonstrated that the directed evolution algorithm is an effective strategy for uncertainty exploration, enhancing generalization in reinforcement learning. Furthermore, by incorporating replay buffer and continual backpropagate methods into DEM, we provide evidence of achieving better trade-off between exploitation and exploration in continuous learning settings. We conducted experiments on four different datasets for children with cochlear implants whose spoken language developmental outcomes vary considerably on the individual-child level. Preoperative neural MRI data has shown to accurately predict the post-operative outcome of these children within but not across datasets. Our results show that DEM can efficiently improve the performance of cross-domain pre-implantation neural predictions while addressing the challenge of label scarcity in target domain.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Junsung Park, Hogun Kee, Songhwai Oh
Robotics
ๆฅ็ๆ่ฆ
This paper presents a modality-augmented fine-tuning framework designed to adapt foundation robot policies to diverse humanoid embodiments. We validate our approach across two distinct settings: (i) the GR1 embodiment, utilizing public datasets where we introduce post-processed modalities, including binary contact signals and ZoeDepth-generated metric depth; and (ii) the Unitree G1 embodiment, for which we contribute a novel multi-modal dataset incorporating cuRobo motion planning, inverse kinematics, and ground-truth contact-force measurements. Our experiments demonstrate that modality augmentation consistently enhances policy performance across different embodiments. Specifically, for the GR1, integrating contact-state cues and RGB-D fusion improves online success rates from 51% to 63%. Furthermore, in the G1 "Pick Apple to Bowl" task, our contact-augmented model achieves a success rate of 94%, significantly outperforming the 48% achieved by standard fine-tuning and the 0% baseline of zero-shot transfer. These results highlight that lightweight post-processing effectively strengthens policies for GR1, while high-quality multi-modal data is crucial for reliable transfer to the Unitree G1. Consequently, this work establishes a unified, data-centric pathway for extending foundation robot policies through targeted modality design and multi-modal fine-tuning.
๐
2025-12-01
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhongjie Jiang
ๆฅ็ๆ่ฆ
Although synthetic data is widely promoted as a remedy, its prevailing production paradigm -- one optimizing for statistical smoothness -- systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse.
This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators.
The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33.
Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.
๐
2025-12-01
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shashank Landge, Abhishek Patil, Tejas kamble, Bhushan Buddhivant, Priyanka Joshi
Natural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLMs) continue to scale in parameter count, deploying them on commodity hardware has become increasingly challenging. Post-Training Quantization (PTQ) addresses this by reducing the precision of model weights, typically to 4-bit or lower. However, uniform quantization often leads to significant performance degradation due to the presence of ``outlier features'' -- weights that, while few in number, are critical for maintaining model accuracy. Current state-of-the-art methods such as AWQ (Activation-aware Weight Quantization) and SpQR (Sparse Quantization Representations) rely on calibration data to identify these salient weights via activation magnitudes or Hessian sensitivity. In scenarios where data privacy is paramount or calibration data is unavailable, these methods are inapplicable.
In this work, we propose a data-free, structure-aware hypothesis: that the weights identified as Principal Components via Singular Value Decomposition (SVD) are intrinsically important to the model's downstream performance. We introduce a novel selection heuristic that preserves the top-$k$ weights aligned with the principal components in FP32, while aggressively quantizing the residual weights. We compare our method against activation-aware (AWQ) and second-order (SpQR) methods across GLUE benchmarks (MRPC, RTE, QNLI) using a DistilBERT backbone. Our experiments reveal that structural importance is highly correlated with functional importance. On the challenging RTE task, our SVD-based method achieves an accuracy of 66.06\%, outperforming both AWQ (65.34\%) and SpQR (65.34\%) at high protection budgets, validating that intrinsic matrix structure can serve as a robust proxy for weight saliency without the need for forward passes or calibration data.
๐
2025-12-01
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Stefano Zeppieri
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel at generating coherent text within a single prompt but fall short in sustaining relevance, personalization, and continuity across extended interactions. Human communication, however, relies on multiple forms of memory, from recalling past conversations to adapting to personal traits and situational context. This paper introduces the Mixed Memory-Augmented Generation (MMAG) pattern, a framework that organizes memory for LLM-based agents into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory. Drawing inspiration from cognitive psychology, we map these layers to technical components and outline strategies for coordination, prioritization, and conflict resolution. We demonstrate the approach through its implementation in the Heero conversational agent, where encrypted long-term bios and conversational history already improve engagement and retention. We further discuss implementation concerns around storage, retrieval, privacy, and latency, and highlight open challenges. MMAG provides a foundation for building memory-rich language agents that are more coherent, proactive, and aligned with human needs.
๐
2025-12-01
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Wei Yang, Rui Zhong, Yiqun Chen, Chi Lu, Peng Jiang
Multimodal
ๆฅ็ๆ่ฆ
Multimodal recommendation aims to integrate collaborative signals with heterogeneous content such as visual and textual information, but remains challenged by modality-specific noise, semantic inconsistency, and unstable propagation over user-item graphs. These issues are often exacerbated by naive fusion or shallow modeling strategies, leading to degraded generalization and poor robustness. While recent work has explored the frequency domain as a lens to separate stable from noisy signals, most methods rely on static filtering or reweighting, lacking the ability to reason over spectral structure or adapt to modality-specific reliability. To address these challenges, we propose a Structured Spectral Reasoning (SSR) framework for frequency-aware multimodal recommendation. Our method follows a four-stage pipeline: (i) Decompose graph-based multimodal signals into spectral bands via graph-guided transformations to isolate semantic granularity; (ii) Modulate band-level reliability with spectral band masking, a training-time masking with a prediction-consistency objective that suppresses brittle frequency components; (iii) Fuse complementary frequency cues using hyperspectral reasoning with low-rank cross-band interaction; and (iv) Align modality-specific spectral features via contrastive regularization to promote semantic and structural consistency. Experiments on three real-world benchmarks show consistent gains over strong baselines, particularly under sparse and cold-start settings. Additional analyses indicate that structured spectral modeling improves robustness and provides clearer diagnostics of how different bands contribute to performance.
๐
2025-12-01
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Shan Gao, Yanwu Yang
Natural Language Processing
ๆฅ็ๆ่ฆ
This research designs a unified architecture of CTR prediction benchmark (Bench-CTR) platform that offers flexible interfaces with datasets and components of a wide range of CTR prediction models. Moreover, we construct a comprehensive system of evaluation protocols encompassing real-world and synthetic datasets, a taxonomy of metrics, standardized procedures and experimental guidelines for calibrating the performance of CTR prediction models. Furthermore, we implement the proposed benchmark platform and conduct a comparative study to evaluate a wide range of state-of-the-art models from traditional multivariate statistical to modern large language model (LLM)-based approaches on three public datasets and two synthetic datasets. Experimental results reveal that, (1) high-order models largely outperform low-order models, though such advantage varies in terms of metrics and on different datasets; (2) LLM-based models demonstrate a remarkable data efficiency, i.e., achieving the comparable performance to other models while using only 2% of the training data; (3) the performance of CTR prediction models has achieved significant improvements from 2015 to 2016, then reached a stage with slow progress, which is consistent across various datasets. This benchmark is expected to facilitate model development and evaluation and enhance practitioners' understanding of the underlying mechanisms of models in the area of CTR prediction. Code is available at https://github.com/NuriaNinja/Bench-CTR.
๐
2025-12-01
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Tao Xue, Yanwu Yang, Panyu Zhai
ๆฅ็ๆ่ฆ
Conversion and conversion rate (CVR) prediction play a critical role in efficient advertising decision-making. In past decades, although researchers have developed plenty of models for CVR prediction, the methodological evolution and relationships between different techniques have been precluded. In this paper, we conduct a comprehensive literature review on CVR prediction in online advertising, and classify state-of-the-art CVR prediction models into six categories with respect to the underlying techniques and elaborate on connections between these techniques. For each category of models, we present the framework of underlying techniques, their advantages and disadvantages, and discuss how they are utilized for CVR prediction. Moreover, we summarize the performance of various CVR prediction models on public and proprietary datasets. Finally, we identify research trends, major challenges, and promising future directions. We observe that results of performance evaluation reported in prior studies are not unanimous; semantics-enriched, attribution-enhanced, debiased CVR prediction and jointly modeling CTR and CVR prediction would be promising directions to explore in the future. This review is expected to provide valuable references and insights for future researchers and practitioners in this area.
๐
2025-12-01
๐ AAAI 2026
๐ฅ Jianlei Chang, Ruofeng Mei, Wei Ke, Xiangyu Xu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Generative modeling has recently shown remarkable promise for visuomotor policy learning, enabling flexible and expressive control across diverse embodied AI tasks. However, existing generative policies often struggle with data inefficiency, requiring large-scale demonstrations, and sampling inefficiency, incurring slow action generation during inference. We introduce EfficientFlow, a unified framework for efficient embodied AI with flow-based policy learning. To enhance data efficiency, we bring equivariance into flow matching. We theoretically prove that when using an isotropic Gaussian prior and an equivariant velocity prediction network, the resulting action distribution remains equivariant, leading to improved generalization and substantially reduced data demands. To accelerate sampling, we propose a novel acceleration regularization strategy. As direct computation of acceleration is intractable for marginal flow trajectories, we derive a novel surrogate loss that enables stable and scalable training using only conditional trajectories. Across a wide range of robotic manipulation benchmarks, the proposed algorithm achieves competitive or superior performance under limited data while offering dramatically faster inference. These results highlight EfficientFlow as a powerful and efficient paradigm for high-performance embodied AI.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anbang Liu, Guanzhong Hu, Jiayi Wang, Ping Guo, Han Liu
Computer Vision
ๆฅ็ๆ่ฆ
Self-driving laboratories offer a promising path toward reducing the labor-intensive, time-consuming, and often irreproducible workflows in the biological sciences. Yet their stringent precision requirements demand highly robust models whose training relies on large amounts of annotated data. However, this kind of data is difficult to obtain in routine practice, especially negative samples. In this work, we focus on pipetting, the most critical and precision sensitive action in SDLs. To overcome the scarcity of training data, we build a hybrid pipeline that fuses real and virtual data generation. The real track adopts a human-in-the-loop scheme that couples automated acquisition with selective human verification to maximize accuracy with minimal effort. The virtual track augments the real data using reference-conditioned, prompt-guided image generation, which is further screened and validated for reliability. Together, these two tracks yield a class-balanced dataset that enables robust bubble detection training. On a held-out real test set, a model trained entirely on automatically acquired real images reaches 99.6% accuracy, and mixing real and generated data during training sustains 99.4% accuracy while reducing collection and review load. Our approach offers a scalable and cost-effective strategy for supplying visual feedback data to SDL workflows and provides a practical solution to data scarcity in rare event detection and broader vision tasks.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Shaowei Liu, David Yifan Yao, Saurabh Gupta, Shenlong Wang
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, position prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Elvis Hsieh, Wen-Han Hsieh, Yen-Jen Wang, Toru Lin, Jitendra Malik et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning and sim-to-real transfer have made significant progress in dexterous manipulation. However, progress remains limited by the difficulty of simulating complex contact dynamics and multisensory signals, especially tactile feedback. In this work, we propose \ours, a sim-to-real framework that addresses these limitations and demonstrates its effectiveness on nut-bolt fastening and screwdriving with multi-fingered hands. The framework has three stages. First, we train reinforcement learning policies in simulation using simplified object models that lead to the emergence of correct finger gaits. We then use the learned policy as a skill primitive within a teleoperation system to collect real-world demonstrations that contain tactile and proprioceptive information. Finally, we train a behavior cloning policy that incorporates tactile sensing and show that it generalizes to nuts and screwdrivers with diverse geometries. Experiments across both tasks show high task progress ratios compared to direct sim-to-real transfer and robust performance even on unseen object shapes and under external perturbations. Videos and code are available on https://dexscrew.github.io.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Wenhao Wang, Yanyan Li, Long Jiao, Jiawei Yuan
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Large language models (LLMs) have demonstrated their promising capabilities of generating robot operation code to enable LLM-driven robots. To enhance the reliability of operation code generated by LLMs, corrective designs with feedback from the observation of executing code have been increasingly adopted in existing research. However, the code execution in these designs relies on either a physical experiment or a customized simulation environment, which limits their deployment due to the high configuration effort of the environment and the potential long execution time. In this paper, we explore the possibility of directly leveraging LLM to enable static simulation of robot operation code, and then leverage it to design a new reliable LLM-driven corrective robot operation code generation framework. Our framework configures the LLM as a static simulator with enhanced capabilities that reliably simulate robot code execution by interpreting actions, reasoning over state transitions, analyzing execution outcomes, and generating se- mantic observations that accurately capture trajectory dynamics. To validate the performance of our framework, we performed experiments on various operation tasks for different robots, including UAVs and small ground vehicles. The experiment results not only demonstrated the high accuracy of our static text-based simulation but also the reliable code generation of our LLM-driven corrective framework, which achieves a comparable performance with state-of-the-art research while does not rely on dynamic code execution using physical experiments or simulators.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Younggyo Seo, Carmelo Sferrazza, Juyue Chen, Guanya Shi, Rocky Duan et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Massively parallel simulation has reduced reinforcement learning (RL) training time for robots from days to minutes. However, achieving fast and reliable sim-to-real RL for humanoid control remains difficult due to the challenges introduced by factors such as high dimensionality and domain randomization. In this work, we introduce a simple and practical recipe based on off-policy RL algorithms, i.e., FastSAC and FastTD3, that enables rapid training of humanoid locomotion policies in just 15 minutes with a single RTX 4090 GPU. Our simple recipe stabilizes off-policy RL algorithms at massive scale with thousands of parallel environments through carefully tuned design choices and minimalist reward functions. We demonstrate rapid end-to-end learning of humanoid locomotion controllers on Unitree G1 and Booster T1 robots under strong domain randomization, e.g., randomized dynamics, rough terrain, and push perturbations, as well as fast training of whole-body human-motion tracking policies. We provide videos and open-source implementation at: https://younggyo.me/fastsac-humanoid.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Guillermo Garcia-Cobo, Maximilian Igl, Peter Karkus, Zhejun Zhang, Michael Watson et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous driving policies are typically trained via open-loop behavior cloning of human demonstrations. However, such policies suffer from covariate shift when deployed in closed loop, leading to compounding errors. We introduce Rollouts as Demonstrations (RoaD), a simple and efficient method to mitigate covariate shift by leveraging the policy's own closed-loop rollouts as additional training data. During rollout generation, RoaD incorporates expert guidance to bias trajectories toward high-quality behavior, producing informative yet realistic demonstrations for fine-tuning. This approach enables robust closed-loop adaptation with orders of magnitude less data than reinforcement learning, and avoids restrictive assumptions of prior closed-loop supervised fine-tuning (CL-SFT) methods, allowing broader applications domains including end-to-end driving. We demonstrate the effectiveness of RoaD on WOSAC, a large-scale traffic simulation benchmark, where it performs similar or better than the prior CL-SFT method; and in AlpaSim, a high-fidelity neural reconstruction-based simulator for end-to-end driving, where it improves driving score by 41\% and reduces collisions by 54\%.
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop
Machine Learning
ๆฅ็ๆ่ฆ
Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent's experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data.
๐
2025-12-01
๐ Accepted for publication in IEEE Robotics and Auto
๐ฅ Kentaro Fujii, Shingo Murata
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Robots in uncertain real-world environments must perform both goal-directed and exploratory actions. However, most deep learning-based control methods neglect exploration and struggle under uncertainty. To address this, we adopt deep active inference, a framework that accounts for human goal-directed and exploratory actions. Yet, conventional deep active inference approaches face challenges due to limited environmental representation capacity and high computational cost in action selection. We propose a novel deep active inference framework that consists of a world model, an action model, and an abstract world model. The world model encodes environmental dynamics into hidden state representations at slow and fast timescales. The action model compresses action sequences into abstract actions using vector quantization, and the abstract world model predicts future slow states conditioned on the abstract action, enabling low-cost action selection. We evaluate the framework on object-manipulation tasks with a real-world robot. Results show that it achieves high success rates across diverse manipulation tasks and switches between goal-directed and exploratory actions in uncertain settings, while making action selection computationally tractable. These findings highlight the importance of modeling multiple timescale dynamics and abstracting actions and state transitions.
๐
2025-12-01
๐ Author-accepted version. Accepted at IEEE 11th Ind
๐ฅ Granthik Halder, Rudrashis Majumder, Rakshith M R, Rahi Shah, Suresh Sundaram
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous ground vehicles (AGVs) must navigate safely in cluttered environments while accounting for complex dynamics and environmental uncertainty. Hamilton-Jacobi Reachability (HJR) offers formal safety guarantees through the computation of forward and backward reachable sets, but its application is hindered by poor scalability in environments with numerous obstacles. In this paper, we present a novel framework called NeuroHJR that leverages Physics-Informed Neural Networks (PINNs) to approximate the HJR solution for real-time obstacle avoidance. By embedding system dynamics and safety constraints directly into the neural network loss function, our method bypasses the need for grid-based discretization and enables efficient estimation of reachable sets in continuous state spaces. We demonstrate the effectiveness of our approach through simulation results in densely cluttered scenarios, showing that it achieves safety performance comparable to that of classical HJR solvers while significantly reducing the computational cost. This work provides a new step toward real-time, scalable deployment of reachability-based obstacle avoidance in robotics.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Eric C. Joyce, Qianwen Zhao, Nathaniel Burgdorfer, Long Wang, Philippos Mordohai
Computer Vision
ๆฅ็ๆ่ฆ
We present a framework for evaluating 6-DoF instance-level object pose estimators, focusing on those that require a single RGB (not RGB-D) image as input. Besides gaining intuition about how accurate these estimators are, we are interested in the degree to which they can serve as the sole perception mechanism for robotic grasping. To assess this, we perform grasping trials in a physics-based simulator, using image-based pose estimates to guide a parallel gripper and an underactuated robotic hand in picking up 3D models of objects. Our experiments on a subset of the BOP (Benchmark for 6D Object Pose Estimation) dataset compare five open-source object pose estimators and provide insights that were missing from the literature.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yue Pan, Tao Sun, Liyuan Zhu, Lucas Nunes, Iro Armeni et al.
Computer Vision
ๆฅ็ๆ่ฆ
Point cloud registration aligns multiple unposed point clouds into a common frame, and is a core step for 3D reconstruction and robot localization. In this work, we cast registration as conditional generation: a learned continuous, point-wise velocity field transports noisy points to a registered scene, from which the pose of each view is recovered. Unlike previous methods that conduct correspondence matching to estimate the transformation between a pair of point clouds and then optimize the pairwise transformations to realize multi-view registration, our model directly generates the registered point cloud. With a lightweight local feature extractor and test-time rigidity enforcement, our approach achieves state-of-the-art results on pairwise and multi-view registration benchmarks, particularly with low overlap, and generalizes across scales and sensor modalities. It further supports downstream tasks including relocalization, multi-robot SLAM, and multi-session map merging. Source code available at: https://github.com/PRBonn/RAP.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Generative models, like flows and diffusions, have recently emerged as popular and efficacious policy parameterizations in robotics. There has been much speculation as to the factors underlying their successes, ranging from capturing multi-modal action distribution to expressing more complex behaviors. In this work, we perform a comprehensive evaluation of popular generative control policies (GCPs) on common behavior cloning (BC) benchmarks. We find that GCPs do not owe their success to their ability to capture multi-modality or to express more complex observation-to-action mappings. Instead, we find that their advantage stems from iterative computation, as long as intermediate steps are supervised during training and this supervision is paired with a suitable level of stochasticity. As a validation of our findings, we show that a minimum iterative policy (MIP), a lightweight two-step regression-based policy, essentially matches the performance of flow GCPs, and often outperforms distilled shortcut models. Our results suggest that the distribution-fitting component of GCPs is less salient than commonly believed, and point toward new design spaces focusing solely on control performance. Project page: https://simchowitzlabpublic.github.io/much-ado-about-noising-project/
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui et al.
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting $Q$-values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundations models to specialize into reliable real-world experts.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Chenghao Gu, Haolan Kang, Junchao Lin, Jinghe Wang, Duo Wu et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Zhihao Zhan, Yuhang Ming, Shaobin Li, Jie Yuan
Robotics
ๆฅ็ๆ่ฆ
Multi-sensor Simultaneous Localization and Mapping (SLAM) is essential for Unmanned Aerial Vehicles (UAVs) performing agricultural tasks such as spraying, surveying, and inspection. However, real-world, multi-modal agricultural UAV datasets that enable research on robust operation remain scarce. To address this gap, we present AgriLiRa4D, a multi-modal UAV dataset designed for challenging outdoor agricultural environments. AgriLiRa4D spans three representative farmland types-flat, hilly, and terraced-and includes both boundary and coverage operation modes, resulting in six flight sequence groups. The dataset provides high-accuracy ground-truth trajectories from a Fiber Optic Inertial Navigation System with Real-Time Kinematic capability (FINS_RTK), along with synchronized measurements from a 3D LiDAR, a 4D Radar, and an Inertial Measurement Unit (IMU), accompanied by complete intrinsic and extrinsic calibrations. Leveraging its comprehensive sensor suite and diverse real-world scenarios, AgriLiRa4D supports diverse SLAM and localization studies and enables rigorous robustness evaluation against low-texture crops, repetitive patterns, dynamic vegetation, and other challenges of real agricultural environments. To further demonstrate its utility, we benchmark four state-of-the-art multi-sensor SLAM algorithms across different sensor combinations, highlighting the difficulty of the proposed sequences and the necessity of multi-modal approaches for reliable UAV localization. By filling a critical gap in agricultural SLAM datasets, AgriLiRa4D provides a valuable benchmark for the research community and contributes to advancing autonomous navigation technologies for agricultural UAVs. The dataset can be downloaded from: https://zhan994.github.io/AgriLiRa4D.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Yicheng Feng et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models trained with flow matching have demonstrated impressive capabilities on robotic manipulation tasks. However, their performance often degrades under distribution shift and on complex multi-step tasks, suggesting that the learned representations may not robustly capture task-relevant semantics. We introduce DiG-Flow, a principled framework that enhances VLA robustness through geometric regularization. Our key insight is that the distributional discrepancy between observation and action embeddings provides a meaningful geometric signal: lower transport cost indicates compatible representations, while higher cost suggests potential misalignment. DiG-Flow computes a discrepancy measure between empirical distributions of observation and action embeddings, maps it to a modulation weight via a monotone function, and applies residual updates to the observation embeddings before flow matching. Crucially, this intervention operates at the representation level without modifying the flow matching path or target vector field. We provide theoretical guarantees showing that discrepancy-guided training provably decreases the training objective, and that guided inference refinement converges with contraction. Empirically, DiG-Flow integrates into existing VLA architectures with negligible overhead and consistently improves performance, with particularly pronounced gains on complex multi-step tasks and under limited training data.
๐
2025-12-01
๐ To be presented in the 64th IEEE Conference on Dec
๐ฅ Xin Yin, Chenyang Liang, Yanning Guo, Jie Mei
Robotics
ๆฅ็ๆ่ฆ
Control Barrier Functions (CBFs) have emerged as efficient tools to address the safe navigation problem for robot applications. However, synthesizing informative and obstacle motion-aware CBFs online using real-time sensor data remains challenging, particularly in unknown and dynamic scenarios. Motived by this challenge, this paper aims to propose a novel Gaussian Process-based formulation of CBF, termed the Dynamic Log Gaussian Process Control Barrier Function (DLGP-CBF), to enable real-time construction of CBF which are both spatially informative and responsive to obstacle motion. Firstly, the DLGP-CBF leverages a logarithmic transformation of GP regression to generate smooth and informative barrier values and gradients, even in sparse-data regions. Secondly, by explicitly modeling the DLGP-CBF as a function of obstacle positions, the derived safety constraint integrates predicted obstacle velocities, allowing the controller to proactively respond to dynamic obstacles' motion. Simulation results demonstrate significant improvements in obstacle avoidance performance, including increased safety margins, smoother trajectories, and enhanced responsiveness compared to baseline methods.
๐
2025-12-01
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yumeng He, Ying Jiang, Jiayin Lu, Yin Yang, Chenfanfu Jiang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Mo Chen
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This work presents a real-time autonomous track navigation framework for nonholonomic differential-drive mobile robots by jointly integrating multi-task visual perception and a provably stable tracking controller. The perception pipeline reconstructs lane centerlines using 2D-to-3D camera projection, arc-length based uniform point resampling, and cubic polynomial fitting solved via robust QR least-squares optimization. The controller regulates robot linear and angular velocities through a Lyapunov-stability grounded design, ensuring bounded error dynamics and asymptotic convergence of position and heading deviations even in dynamic and partially perceived lane scenarios, without relying on HD prior maps or global satellite localization. Real-world experiments on embedded platforms verify system fidelity, real-time execution, trajectory smoothness, and closed-loop stability for reliable autonomous navigation.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Marek Vagas, Martin Varga, Jaroslav Romancik, Ondrej Majercak, Alejandro Suarez et al.
Robotics
ๆฅ็ๆ่ฆ
Robotic grippers are increasingly deployed across industrial, collaborative, and aerial platforms, where each embodiment imposes distinct mechanical, energetic, and operational constraints. Established YCB and NIST benchmarks quantify grasp success, force, or timing on a single platform, but do not evaluate cross-embodiment transferability or energy-aware performance, capabilities essential for modern mobile and aerial manipulation. This letter introduces the Cross-Embodiment Gripper Benchmark (CEGB), a compact and reproducible benchmarking suite extending YCB and selected NIST metrics with three additional components: a transfer-time benchmark measuring the practical effort required to exchange embodiments, an energy-consumption benchmark evaluating grasping and holding efficiency, and an intent-specific ideal payload assessment reflecting design-dependent operational capability. Together, these metrics characterize both grasp performance and the suitability of reusing a single gripper across heterogeneous robotic systems. A lightweight self-locking gripper prototype is implemented as a reference case. Experiments demonstrate rapid embodiment transfer (median ~= 17.6 s across user groups), low holding energy for gripper prototype (~= 1.5 J per 10 s), and consistent grasp performance with cycle times of 3.2 - 3.9 s and success rates exceeding 90%. CEGB thus provides a reproducible foundation for cross-platform, energy-aware evaluation of grippers in aerial and manipulators domains.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Qiyang Lyu, Wei Wang, Zhenyu Wu, Hongming Shen, Huiqin Zhou et al.
Multimodal
ๆฅ็ๆ่ฆ
Multimodal sensor fusion enables robust environmental perception by leveraging complementary information from heterogeneous sensing modalities. However, accurate calibration is a critical prerequisite for effective fusion. This paper proposes a novel one-key calibration framework named L2M-Calib for a fused magnetic-LiDAR system, jointly estimating the extrinsic transformation between the two kinds of sensors and the intrinsic distortion parameters of the magnetic sensors. Magnetic sensors capture ambient magnetic field (AMF) patterns, which are invariant to geometry, texture, illumination, and weather, making them suitable for challenging environments. Nonetheless, the integration of magnetic sensing into multimodal systems remains underexplored due to the absence of effective calibration techniques. To address this, we optimize extrinsic parameters using an iterative Gauss-Newton scheme, coupled with the intrinsic calibration as a weighted ridge-regularized total least squares (w-RRTLS) problem, ensuring robustness against measurement noise and ill-conditioned data. Extensive evaluations on both simulated datasets and real-world experiments, including AGV-mounted sensor configurations, demonstrate that our method achieves high calibration accuracy and robustness under various environmental and operational conditions.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Jiayi Li, Yuxuan Hu, Haoran Geng, Xiangyu Chen, Chuhao Zhou et al.
Robotics
ๆฅ็ๆ่ฆ
Material generalization is essential for real-world robotic manipulation, where robots must interact with objects exhibiting diverse visual and physical properties. This challenge is particularly pronounced for objects made of glass, metal, or other materials whose transparent or reflective surfaces introduce severe out-of-distribution variations. Existing approaches either rely on simulated materials in simulators and perform sim-to-real transfer, which is hindered by substantial visual domain gaps, or depend on collecting extensive real-world demonstrations, which is costly, time-consuming, and still insufficient to cover various materials. To overcome these limitations, we resort to computational photography and introduce Mutable Material Manipulation Augmentation (M$^3$A), a unified framework that leverages the physical characteristics of materials as captured by light transport for photometric re-rendering. The core idea is simple yet powerful: given a single real-world demonstration, we photometrically re-render the scene to generate a diverse set of highly realistic demonstrations with different material properties. This augmentation effectively decouples task-specific manipulation skills from surface appearance, enabling policies to generalize across materials without additional data collection. To systematically evaluate this capability, we construct the first comprehensive multi-material manipulation benchmark spanning both simulation and real-world environments. Extensive experiments show that the M$^3$A policy significantly enhances cross-material generalization, improving the average success rate across three real-world tasks by 58.03\%, and demonstrating robust performance on previously unseen materials.
๐
2025-12-01
๐ ArXiv Preprint (cs.OS)
๐ฅ Hiroto Takahashi, Atsushi Yano, Takuya Azumi
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Accurate estimation of the Worst-Case Deadline Failure Probability (WCDFP) has attracted growing attention as a means to provide safety assurances in complex systems such as robotic platforms and autonomous vehicles. WCDFP quantifies the likelihood of deadline misses under the most pessimistic operating conditions, and safe estimation is essential for dependable real-time applications. However, achieving high accuracy in WCDFP estimation often incurs significant computational cost. Recent studies have revealed that the classical assumption of the critical instant, the activation pattern traditionally considered to trigger the worst-case behavior, can lead to underestimation of WCDFP in probabilistic settings. This observation motivates the use of a revised critical instant formulation that more faithfully captures the true worst-case scenario. This paper investigates convolution-based methods for WCDFP estimation under this revised setting and proposes an optimization technique that accelerates convolution by improving the merge order. Extensive experiments with diverse execution-time distributions demonstrate that the proposed optimized Aggregate Convolution reduces computation time by up to an order of magnitude compared to Sequential Convolution, while retaining accurate and safe-sided WCDFP estimates. These results highlight the potential of the approach to provide both efficiency and reliability in probabilistic timing analysis for safety-critical real-time applications.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Junsung Park, Hogun Kee, Songhwai Oh
Robotics
ๆฅ็ๆ่ฆ
This paper presents a modality-augmented fine-tuning framework designed to adapt foundation robot policies to diverse humanoid embodiments. We validate our approach across two distinct settings: (i) the GR1 embodiment, utilizing public datasets where we introduce post-processed modalities, including binary contact signals and ZoeDepth-generated metric depth; and (ii) the Unitree G1 embodiment, for which we contribute a novel multi-modal dataset incorporating cuRobo motion planning, inverse kinematics, and ground-truth contact-force measurements. Our experiments demonstrate that modality augmentation consistently enhances policy performance across different embodiments. Specifically, for the GR1, integrating contact-state cues and RGB-D fusion improves online success rates from 51% to 63%. Furthermore, in the G1 "Pick Apple to Bowl" task, our contact-augmented model achieves a success rate of 94%, significantly outperforming the 48% achieved by standard fine-tuning and the 0% baseline of zero-shot transfer. These results highlight that lightweight post-processing effectively strengthens policies for GR1, while high-quality multi-modal data is crucial for reliable transfer to the Unitree G1. Consequently, this work establishes a unified, data-centric pathway for extending foundation robot policies through targeted modality design and multi-modal fine-tuning.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Diyuan Shi, Shangke Lyu, Donglin Wang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Humanoid robots have received significant research interests and advancements in recent years. Despite many successes, due to their morphology, dynamics and limitation of control policy, humanoid robots are prone to fall as compared to other embodiments like quadruped or wheeled robots. And its large weight, tall Center of Mass, high Degree-of-Freedom would cause serious hardware damages when falling uncontrolled, to both itself and surrounding objects. Existing researches in this field mostly focus on using control based methods that struggle to cater diverse falling scenarios and may introduce unsuitable human prior. On the other hand, large-scale Deep Reinforcement Learning and Curriculum Learning could be employed to incentivize humanoid agent discovering falling protection policy that fits its own nature and property. In this work, with carefully designed reward functions and domain diversification curriculum, we successfully train humanoid agent to explore falling protection behaviors and discover that by forming a `triangle' structure, the falling damages could be significantly reduced with its rigid-material body. With comprehensive metrics and experiments, we quantify its performance with comparison to other methods, visualize its falling behaviors and successfully transfer it to real world platform.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Longji Yin, Yunfan Ren, Fangcheng Zhu, Liuyu Shi, Fanze Kong et al.
Robotics
ๆฅ็ๆ่ฆ
Autonomous aerial tracking with drones offers vast potential for surveillance, cinematography, and industrial inspection applications. While single-drone tracking systems have been extensively studied, swarm-based target tracking remains underexplored, despite its unique advantages of distributed perception, fault-tolerant redundancy, and multidirectional target coverage. To bridge this gap, we propose a novel decentralized LiDAR-based swarm tracking framework that enables visibility-aware, cooperative target tracking in complex environments, while fully harnessing the unique capabilities of swarm systems. To address visibility, we introduce a novel Spherical Signed Distance Field (SSDF)-based metric for 3-D environmental occlusion representation, coupled with an efficient algorithm that enables real-time onboard SSDF updating. A general Field-of-View (FOV) alignment cost supporting heterogeneous LiDAR configurations is proposed for consistent target observation. Swarm coordination is enhanced through cooperative costs that enforce inter-robot safe clearance, prevent mutual occlusions, and notably facilitate 3-D multidirectional target encirclement via a novel electrostatic-potential-inspired distribution metric. These innovations are integrated into a hierarchical planner, combining a kinodynamic front-end searcher with a spatiotemporal $SE(3)$ back-end optimizer to generate collision-free, visibility-optimized trajectories.Deployed on heterogeneous LiDAR swarms, our fully decentralized implementation features collaborative perception, distributed planning, and dynamic swarm reconfigurability. Validated through rigorous real-world experiments in cluttered outdoor environments, the proposed system demonstrates robust cooperative tracking of agile targets (drones, humans) while achieving superior visibility maintenance.
๐
2025-12-01
๐ 8 pages, 8 figures, accepted at IEEE RA-L
๐ฅ Shuai Wang, Xiaoming Tang, Junning Liang, Haowen Zheng, Biyu Ye et al.
Robotics
ๆฅ็ๆ่ฆ
Coaxial bi-copter autonomous aerial vehicles (AAVs) have garnered attention due to their potential for improved rotor system efficiency and compact form factor. However, balancing efficiency, maneuverability, and compactness in coaxial bi-copter systems remains a key design challenge, limiting their practical deployment. This letter introduces COMET, a coaxial bi-copter AAV platform featuring a dual swashplate mechanism. The coaxial bi-copter system's efficiency and compactness are optimized through bench tests, and the whole prototype's efficiency and robustness under varying payload conditions are verified through flight endurance experiments. The maneuverability performance of the system is evaluated in comprehensive trajectory tracking tests. The results indicate that the dual swashplate configuration enhances tracking performance and improves flight efficiency compared to the single swashplate alternative. Successful autonomous flight trials across various scenarios verify COMET's potential for real-world applications.
๐
2025-12-01
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ Tao Li, Chunze Zhang, Weiwei Yao, Junzhao He, Ji Hou et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Understanding efficient fish locomotion offers insights for biomechanics, fluid dynamics, and engineering. Traditional studies often miss the link between neuromuscular control and whole-body movement. To explore energy transfer in carangiform swimming, we created a bio-inspired digital trout. This model combined multibody dynamics, Hill-type muscle modeling, and a high-fidelity fluid-structure interaction algorithm, accurately replicating a real trout's form and properties. Using deep reinforcement learning, the trout's neural system achieved hierarchical spatiotemporal control of muscle activation. We systematically examined how activation strategies affect speed and energy use. Results show that axial myomere coupling-with activation spanning over 0.5 body lengths-is crucial for stable body wave propagation. Moderate muscle contraction duration ([0.1,0.3] of a tail-beat cycle) lets the body and fluid act as a passive damping system, cutting energy use. Additionally, the activation phase lag of myomeres shapes the body wave; if too large, it causes antagonistic contractions that hinder thrust. These findings advance bio-inspired locomotion understanding and aid energy-efficient underwater system design.
๐
2025-12-01
๐ ArXiv Preprint (Robotics)
๐ฅ Jaejin Jeon, Seonghoon Ryoo, Sang-Duck Lee, Soomok Lee, Seungwoo Jeong
Robotics
ๆฅ็ๆ่ฆ
Robust place recognition is essential for reliable localization in robotics, particularly in complex environments with fre- quent indoor-outdoor transitions. However, existing LiDAR-based datasets often focus on outdoor scenarios and lack seamless domain shifts. In this paper, we propose RoboLoc, a benchmark dataset designed for GPS-free place recognition in indoor-outdoor environments with floor transitions. RoboLoc features real-world robot trajectories, diverse elevation profiles, and transitions between structured indoor and unstructured outdoor domains. We benchmark a variety of state-of-the-art models, point-based, voxel-based, and BEV-based architectures, highlighting their generalizability domain shifts. RoboLoc provides a realistic testbed for developing multi-domain localization systems in robotics and autonomous navigation
๐
2025-12-01
๐ NeurIPS 2025
๐ฅ Edward S. Hu, Jie Wang, Xingfang Yuan, Fiona Luo, Muyao Li et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
A robot's instantaneous sensory observations do not always reveal task-relevant state information. Under such partial observability, optimal behavior typically involves explicitly acting to gain the missing information. Today's standard robot learning techniques struggle to produce such active perception behaviors. We propose a simple real-world robot learning recipe to efficiently train active perception policies. Our approach, asymmetric advantage weighted regression (AAWR), exploits access to "privileged" extra sensors at training time. The privileged sensors enable training high-quality privileged value functions that aid in estimating the advantage of the target policy. Bootstrapping from a small number of potentially suboptimal demonstrations and an easy-to-obtain coarse policy initialization, AAWR quickly acquires active perception behaviors and boosts task performance. In evaluations on 8 manipulation tasks on 3 robots spanning varying degrees of partial observability, AAWR synthesizes reliable active perception behaviors that outperform all prior approaches. When initialized with a "generalist" robot policy that struggles with active perception tasks, AAWR efficiently generates information-gathering behaviors that allow it to operate under severe partial observability for manipulation tasks. Website: https://penn-pal-lab.github.io/aawr/
๐
2025-12-01
๐ Copyright 2025 IEEE. This is the author's version
๐ฅ Mohamed Abdallah Salem, Ahmed Harb Rabia
Computer Vision
ๆฅ็ๆ่ฆ
Efficient and accurate annotation of datasets remains a significant challenge for deploying object detection models such as You Only Look Once (YOLO) in real-world applications, particularly in agriculture where rapid decision-making is critical. Traditional annotation techniques are labor-intensive, requiring extensive manual labeling post data collection. This paper presents a novel real-time annotation approach leveraging YOLO models deployed on edge devices, enabling immediate labeling during image capture. To comprehensively evaluate the efficiency and accuracy of our proposed system, we conducted an extensive comparative analysis using three prominent YOLO architectures (YOLOv5, YOLOv8, YOLOv12) under various configurations: single-class versus multi-class annotation and pretrained versus scratch-based training. Our analysis includes detailed statistical tests and learning dynamics, demonstrating significant advantages of pretrained and single-class configurations in terms of model convergence, performance, and robustness. Results strongly validate the feasibility and effectiveness of our real-time annotation framework, highlighting its capability to drastically reduce dataset preparation time while maintaining high annotation quality.
๐
2025-11-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Libo Wang
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
To address the gap in humanoid robot cognitive systems regarding the lack of a time-updable mediating thought space between semantics and continuous control, this study constructs and trains a VLA model named "Sigma" that runs on a single RTX 4090. It uses the open-source pi05_base model as a foundation and preprocesses svla_so101_pickplace into a training dataset. The researcher independently designed an architecture for a vision-language-action model that combines deep semantic understanding and association to achieve telepathic communication. The training process involved repeated optimizations of data preprocessing, LoRA fine-tuning, and the inference-stage adapter. The experiment employed offline closed-loop replay, comparing Sigma with the untuned pure pi05_base model under data conditions. Results showed that Sigma exhibited a stable decrease in control MSE across vector, fragment, and entire trajectory timescales, while maintaining the telepathy norm and semantic-text alignment quality unchanged. It demonstrates that mind-responsive alignment control is quantified through an architecture that combines deep understanding of semantics and association without retraining the base model, which provides reproducible experience for semantic alignment and intention-driven behavior in humanoid robots.
๐
2025-11-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chen Henry Wu, Sachin Goyal, Aditi Raghunathan
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Parallel sampling promises substantial gains in test-time scaling, but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We propose the mode-conditioning (ModC) framework, which explicitly allocates test-time compute across reasoning modes using either specialist models or mode-specific prefixes. ModC consistently improves scaling across controlled graph-search tasks and large-scale reasoning benchmarks, spanning model families and sizes from 0.5B to 7B. On OpenThoughts, fine-tuning Qwen2.5-7B with ModC achieves a 4x efficiency gain over standard training while also improving the maximum attainable Pass@k. We further show that gradient clustering enables ModC without explicit mode labels, yielding up to 10% gains on datasets such as NuminaMath. Finally, we show that ModC improves reinforcement learning (RL) and can further boost diversity-inducing RL methods. These results demonstrate that standard training underutilizes the diversity in data, and that ModC provides a simple, effective remedy for unlocking the full benefits of diversity in test-time scaling.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yaxuan Ren, Krithika Ramesh, Yaxing Yao, Anjalie Field
ๆฅ็ๆ่ฆ
In this work, we aim to clarify and reconcile metrics for evaluating privacy protection in text through a systematic survey. Although text anonymization is essential for enabling NLP research and model development in domains with sensitive data, evaluating whether anonymization methods sufficiently protect privacy remains an open challenge. In manually reviewing 47 papers that report privacy metrics, we identify and compare six distinct privacy notions, and analyze how the associated metrics capture different aspects of privacy risk. We then assess how well these notions align with legal privacy standards (HIPAA and GDPR), as well as user-centered expectations grounded in HCI studies. Our analysis offers practical guidance on navigating the landscape of privacy evaluation approaches further and highlights gaps in current practices. Ultimately, we aim to facilitate more robust, comparable, and legally aware privacy evaluations in text anonymization.
๐
2025-11-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenjun Zhang, Shekhar S. Chandra, Aaron Nicolson
ๆฅ็ๆ่ฆ
Medical phrase grounding (MPG) maps textual descriptions of radiological findings to corresponding image regions. These grounded reports are easier to interpret, especially for non-experts. Existing MPG systems mostly follow the referring expression comprehension (REC) paradigm and return exactly one bounding box per phrase. Real reports often violate this assumption. They contain multi-region findings, non-diagnostic text, and non-groundable phrases, such as negations or descriptions of normal anatomy. Motivated by this, we reformulate the task as generalised medical phrase grounding (GMPG), where each sentence is mapped to zero, one, or multiple scored regions. To realise this formulation, we introduce the first GMPG model: MedGrounder. We adopted a two-stage training regime: pre-training on report sentence--anatomy box alignment datasets and fine-tuning on report sentence--human annotated box datasets. Experiments on PadChest-GR and MS-CXR show that MedGrounder achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines on multi-region and non-groundable phrases, while using far fewer human box annotations. Finally, we show that MedGrounder can be composed with existing report generators to produce grounded reports without retraining the generator.
๐
2025-11-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Stephen Fitz
Robotics
ๆฅ็ๆ่ฆ
The Machine Consciousness Hypothesis states that consciousness is a substrate-free functional property of computational systems capable of second-order perception. I propose a research program to investigate this idea in silico by studying how collective self-models (coherent, self-referential representations) emerge from distributed learning systems embedded within universal self-organizing environments. The theory outlined here starts from the supposition that consciousness is an emergent property of collective intelligence systems undergoing synchronization of prediction through communication. It is not an epiphenomenon of individual modeling but a property of the language that a system evolves to internally describe itself. For a model of base reality, I begin with a minimal but general computational world: a cellular automaton, which exhibits both computational irreducibility and local reducibility. On top of this computational substrate, I introduce a network of local, predictive, representational (neural) models capable of communication and adaptation. I use this layered model to study how collective intelligence gives rise to self-representation as a direct consequence of inter-agent alignment. I suggest that consciousness does not emerge from modeling per se, but from communication. It arises from the noisy, lossy exchange of predictive messages between groups of local observers describing persistent patterns in the underlying computational substrate (base reality). It is through this representational dialogue that a shared model arises, aligning many partial views of the world. The broader goal is to develop empirically testable theories of machine consciousness, by studying how internal self-models may form in distributed systems without centralized control.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Neha Joshi, Pamir Gogoi, Aasim Mirza, Aayush Jansari, Aditya Yadavalli et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Riad Ahmed Anonto, Md Labid Al Nahiyan, Md Tanvir Hassan, Ch. Md. Rakin Haider
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.
๐
2025-11-30
๐ NeurIPS 2025
๐ฅ Luigi Assom
ๆฅ็ๆ่ฆ
This study presents an unsupervised method to infer discreteness, syntax and temporal structures of fruit-bats vocalizations, as a case study of graded vocal systems, and evaluates the complexity of communication patterns in relation with behavioral context. The method improved the baseline for unsupervised labeling of vocal units (i.e. syllables) through manifold learning, by investigating how dimen- sionality reduction on mel-spectrograms affects labeling, and comparing it with unsupervised labels based on acoustic similarity. We then encoded vocalizations as syllabic sequences to analyze the type of syntax, and extracted the Maximal Repetitions (MRs) to evaluate syntactical structures. We found evidence for: i) associative syntax, rather than combinatorial (context classification is unaffected by permutation of sequences, F 1 > 0.9); ii) context-dependent use of syllables (Wilcoxon rank-sum tests, p-value < 0.05); iii) heavy-tail distribution of MRs (truncated power-law, exponent ฮฑ < 2), indicative of mechanism encoding com- binatorial complexity. Analysis of MRs and syllabic transition networks revealed that mother-pupil interactions were characterized by repetitions, while commu- nication in conflict-contexts exhibited higher complexity (longer MRs and more interconnected vocal sequences) than non-agonistic contexts. We propose that communicative complexity is higher in scenarios of disagreement, reflecting lower compressibility of information.
๐
2025-11-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jinu Lee, Kyoung-Woon On, Simeng Han, Arman Cohan, Julia Hockenmaier
ๆฅ็ๆ่ฆ
Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nicole Favero, Francesca Salute, Daniel Hardt
Natural Language Processing
ๆฅ็ๆ่ฆ
Most evaluations of large language models focus on standard tasks such as factual question answering or short summarization. This research expands that scope in two directions: first, by comparing two retrieval strategies, Graph RAG, structured knowledge-graph based, and Advanced RAG, hybrid keyword-semantic search, for QA; and second, by evaluating whether LLMs can generate high quality non-traditional academic outputs, specifically slide decks and podcast scripts. We implemented a prototype combining Meta's LLaMA 3 70B open weight and OpenAI's GPT 4o mini API based. QA performance was evaluated using both human ratings across eleven quality dimensions and large language model judges for scalable cross validation. GPT 4o mini with Advanced RAG produced the most accurate responses. Graph RAG offered limited improvements and led to more hallucinations, partly due to its structural complexity and manual setup. Slide and podcast generation was tested with document grounded retrieval. GPT 4o mini again performed best, though LLaMA 3 showed promise in narrative coherence. Human reviewers were crucial for detecting layout and stylistic flaws, highlighting the need for combined human LLM evaluation in assessing emerging academic outputs.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhihan Guo, Feiyang Xu, Yifan Li, Muzhi Li, Shuai Zou et al.
ๆฅ็ๆ่ฆ
The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce Dr.Mi-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (Dr.Mi-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.
๐
2025-11-30
๐ NeurIPS 2025
๐ฅ Liyao Li, Chao Ye, Wentao Ye, Yifei Sun, Zhe Jiang et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
To migrate the remarkable successes of Large Language Models (LLMs), the community has made numerous efforts to generalize them to the table reasoning tasks for the widely deployed tabular data. Despite that, in this work, by showing a probing experiment on our proposed StructQA benchmark, we postulate that even the most advanced LLMs (such as GPTs) may still fall short of coping with tabular data. More specifically, the current scheme often simply relies on serializing the tabular data, together with the meta information, then inputting them through the LLMs. We argue that the loss of structural information is the root of this shortcoming. In this work, we further propose TAMO, which bears an ideology to treat the tables as an independent modality integrated with the text tokens. The resulting model in TAMO is a multimodal framework consisting of a hypergraph neural network as the global table encoder seamlessly integrated with the mainstream LLM. Empirical results on various benchmarking datasets, including HiTab, WikiTQ, WikiSQL, FeTaQA, and StructQA, have demonstrated significant improvements on generalization with an average relative gain of 42.65%.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Alvaro Paredes Amorin, Andre Python, Christoph Weisser
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) play an increasingly important role in finan- cial markets analysis by capturing signals from complex and heterogeneous textual data sources, such as tweets, news articles, reports, and microblogs. However, their performance is dependent on large computational resources and proprietary datasets, which are costly, restricted, and therefore inacces- sible to many researchers and practitioners. To reflect realistic situations we investigate the ability of lightweight open-source LLMs - smaller and publicly available models designed to operate with limited computational resources - to generalize sentiment understanding from financial datasets of varying sizes, sources, formats, and languages. We compare the benchmark finance natural language processing (NLP) model, FinBERT, and three open-source lightweight LLMs, DeepSeek-LLM 7B, Llama3 8B Instruct, and Qwen3 8B on five publicly available datasets: FinancialPhraseBank, Financial Question Answering, Gold News Sentiment, Twitter Sentiment and Chinese Finance Sentiment. We find that LLMs, specially Qwen3 8B and Llama3 8B, perform best in most scenarios, even from using only 5% of the available training data. These results hold in zero-shot and few-shot learning scenarios. Our findings indicate that lightweight, open-source large language models (LLMs) consti- tute a cost-effective option, as they can achieve competitive performance on heterogeneous textual data even when trained on only a limited subset of the extensive annotated corpora that are typically deemed necessary.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ahmed Mustafa Younes
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer models have significantly advanced Natural Language Processing (NLP), demonstrating strong performance in English. However, their effectiveness in Arabic, particularly for Named Entity Recognition (NER), remains limited, even with larger pre-trained models. This performance gap stems from multiple factors, including tokenisation, dataset quality, and annotation inconsistencies. Existing studies often analyze these issues in isolation, failing to capture their joint effect on system behaviour and performance.
We introduce DeformAr (Debugging and Evaluation Framework for Transformer-based NER Systems), a novel framework designed to investigate and explain the performance discrepancy between Arabic and English NER systems. DeformAr integrates a data extraction library and an interactive dashboard, supporting two modes of evaluation: cross-component analysis and behavioural analysis. The framework divides each language into dataset and model components to examine their interactions.
The analysis proceeds in two stages. First, cross-component analysis provides systematic diagnostic measures across data and model subcomponents, addressing the "what," "how," and "why" behind observed discrepancies. The second stage applies behavioural analysis by combining interpretability techniques with token-level metrics, interactive visualisations, and representation space analysis. These stages enable a component-aware diagnostic process that detects model behaviours and explains them by linking them to underlying representational patterns and data factors. DeformAr is the first Arabic-specific, component-based interpretability tool, offering a crucial resource for advancing model analysis in under-resourced languages.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Imane Jaaouine, Ross D. King
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) produce context inconsistency hallucinations, which are LLM generated outputs that are misaligned with the user prompt. This research project investigates whether prompt engineering (PE) methods can mitigate context inconsistency hallucinations in zero-shot LLM summarisation of scientific texts, where zero-shot indicates that the LLM relies purely on its pre-training data. Across eight yeast biotechnology research paper abstracts, six instruction-tuned LLMs were prompted with seven methods: a base- line prompt, two levels of increasing instruction complexity (PE-1 and PE-2), two levels of context repetition (CR-K1 and CR-K2), and two levels of random addition (RA-K1 and RA-K2). Context repetition involved the identification and repetition of K key sentences from the abstract, whereas random addition involved the repetition of K randomly selected sentences from the abstract, where K is 1 or 2. A total of 336 LLM-generated summaries were evaluated using six metrics: ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, and cosine similarity, which were used to compute the lexical and semantic alignment be- tween the summaries and the abstracts. Four hypotheses on the effects of prompt methods on summary alignment with the reference text were tested. Statistical analysis on 3744 collected datapoints was performed using bias-corrected and accelerated (BCa) bootstrap confidence intervals and Wilcoxon signed-rank tests with Bonferroni-Holm correction. The results demonstrated that CR and RA significantly improve the lexical alignment of LLM-generated summaries with the abstracts. These findings indicate that prompt engineering has the potential to impact hallucinations in zero-shot scientific summarisation tasks.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jianxiang Zang, Yongda Wei, Ruxue Bai, Shiyu Jiang, Nijia Mo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.
๐
2025-11-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Samuel Kessler, Menglin Xia, Daniel Madrigal Diaz, Dongge Han, Helia Heshemi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
A common and effective means for improving language model capabilities involves finetuning a ``student'' language model's parameters on generations from a more proficient ``teacher'' model. Termed ``synthetic data'', these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection criteria from the active learning literature tend to be most performant. We validate these claims across four mathematical and logical reasoning datasets using four different small language models.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chunlin Tian, Xuyang Wei, Huanrong Liu, Zhijiang Guo, Li Li
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), but it still incurs notable overhead and suffers from parameter interference in complex datasets. While re- cent works decouple LoRA update matrices to exploit matrix-wise asymmetry, training costs remain high. We revisit LoRA from the perspective of inter-matrix and intra-layer parameter redundancy and propose Resource-Efficient Low-Rank Adaptation, EffiLoRA, a lightweight and generalizable approach for language, multimodal, and diffusion models. EffiLoRA employs a unified A matrix across all transformer layers and introduces a runtime selective B matrices up- date to dynamically trade-off the system resource budget and model performance. EffiLoRA consistently outperforms LoRA across diverse modalities, including commonsense reasoning, visual instruction tuning, and image generation, demon- strating improved efficiency and robustness.
๐
2025-11-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yandong Sun, Qiang Huang, Ziwei Xu, Yiqun Sun, Yixuan Tang et al.
ๆฅ็ๆ่ฆ
Embedding spaces are fundamental to modern AI, translating raw data into high-dimensional vectors that encode rich semantic relationships. Yet, their internal structures remain opaque, with existing approaches often sacrificing semantic coherence for structural regularity or incurring high computational overhead to improve interpretability. To address these challenges, we introduce the Semantic Field Subspace (SFS), a geometry-preserving, context-aware representation that captures local semantic neighborhoods within the embedding space. We also propose SAFARI (SemAntic Field subspAce deteRmInation), an unsupervised, modality-agnostic algorithm that uncovers hierarchical semantic structures using a novel metric called Semantic Shift, which quantifies how semantics evolve as SFSes evolve. To ensure scalability, we develop an efficient approximation of Semantic Shift that replaces costly SVD computations, achieving a 15~30x speedup with average errors below 0.01. Extensive evaluations across six real-world text and image datasets show that SFSes outperform standard classifiers not only in classification but also in nuanced tasks such as political bias detection, while SAFARI consistently reveals interpretable and generalizable semantic hierarchies. This work presents a unified framework for structuring, analyzing, and scaling semantic understanding in embedding spaces.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yukang Lin, Jiahao Shao, Shuoran Jiang, Wentao Zhu, Bingjie Lu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Watermarking acts as a critical safeguard in text generated by Large Language Models (LLMs). By embedding identifiable signals into model outputs, watermarking enables reliable attribution and enhances the security of machine-generated content. Existing approaches typically embed signals by manipulating token generation probabilities. Despite their effectiveness, these methods inherently face a trade-off between detectability and text quality: the signal strength and randomness required for robust watermarking tend to degrade the performance of downstream tasks.
In this paper, we design a novel embedding scheme that controls seed pools to facilitate diverse parallel generation of watermarked text. Based on that scheme, we propose WaterSearch, a sentence-level, search-based watermarking framework adaptable to a wide range of existing methods. WaterSearch enhances text quality by jointly optimizing two key aspects: 1) distribution fidelity and 2) watermark signal characteristics. Furthermore, WaterSearch is complemented by a sentence-level detection method with strong attack robustness. We evaluate our method on three popular LLMs across ten diverse tasks. Extensive experiments demonstrate that our method achieves an average performance improvement of 51.01\% over state-of-the-art baselines at a watermark detectability strength of 95\%. In challenging scenarios such as short text generation and low-entropy output generation, our method yields performance gains of 47.78\% and 36.47\%, respectively. Moreover, under different attack senarios including insertion, synonym substitution and paraphrase attasks, WaterSearch maintains high detectability, further validating its robust anti-attack capabilities. Our code is available at \href{https://github.com/Yukang-Lin/WaterSearch}{https://github.com/Yukang-Lin/WaterSearch}.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Md Mehrab Hossain Opi, Sumaiya Khan, Moshammad Farzana Rahman
Natural Language Processing
ๆฅ็ๆ่ฆ
Training models for Natural Language Processing (NLP) requires substantial computational resources and time, posing significant challenges, especially for NLP development in Bangla, where access to high-end hardware is often limited. In this work, we explore automatic mixed precision (AMP) training as a means to improve computational efficiency without sacrificing model performance. By leveraging a dynamic mix of 16-bit and 32-bit floating-point computations, AMP lowers GPU memory requirements and speeds up training without degrading model performance. We evaluate AMP across four standard Bangla NLP tasks, namely sentiment analysis, named entity recognition, error classification, and question answering, using four transformer-based models: BanglaBERT, BanglishBERT, XLM-R, and mBERT. Our results demonstrate that AMP accelerates training by 44.5% and reduces memory consumption by 17.6%, while maintaining F-1 score within 99.7% of the full-precision baselines. This empirical study highlights AMP's potential to democratize access to state-of-the-art NLP capabilities in hardware-constrained settings by lowering computational barriers.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xiaodong Cai, Hai Lin, Shaoxiong Zhan, Weiqi Luo, Hong-Gee Kim et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Token sampling strategies critically influence text generation quality in large language models (LLMs). However, existing methods introduce additional hyperparameters, requiring extensive tuning and complicating deployment. We present Entropy Equilibrium Sampling (EES), an auxiliary hyperparameter-free approach inspired by information theory that can dynamically adjust candidate sets by balancing normalized entropy with probability mass. We evaluate EES on both reasoning and generation tasks across a range of model architectures. Our results show that EES consistently performs well across temperature settings, delivering competitive accuracy and coherence while maintaining diversity. By eliminating the need for hyperparameter tuning, EES greatly simplifies deployment while improving performance. Code is available at https://github.com/shuanncai/EES
๐
2025-11-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hamed Razavi
Natural Language Processing
ๆฅ็ๆ่ฆ
The fast growth of digital health systems has led to a need to better comprehend how they interpret and represent patient-reported symptoms. Chatbots have been used in healthcare to provide clinical support and enhance the user experience, making it possible to provide meaningful clinical patterns from text-based data through chatbots. The proposed research utilises several different natural language processing methods to study the occurrences of symptom descriptions in medicine as well as analyse the patterns that emerge through these conversations within medical bots. Through the use of the Medical Conversations to Disease Dataset which contains 960 multi-turn dialogues divided into 24 Clinical Conditions, a standardised representation of conversations between patient and bot is created for further analysis by computational means. The multi-method approach uses a variety of tools, including Latent Dirichlet Allocation (LDA) to identify latent symptom themes, K-Means to group symptom descriptions by similarity, Transformer-based Named Entity Recognition (NER) to extract medical concepts, and the Apriori algorithm to discover frequent symptom pairs. Findings from the analysis indicate a coherent structure of clinically relevant topics, moderate levels of clustering cohesiveness and several high confidence rates on the relationships between symptoms like fever headache and rash itchiness. The results support the notion that conversational medical data can be a valuable diagnostic signal for early symptom interpretation, assist in strengthening decision support and improve how users interact with tele-health technology. By demonstrating a method for converting unstructured free-flowing dialogue into actionable knowledge regarding symptoms this work provides an extensible framework to further enhance future performance, dependability and clinical utility of selecting medical chatbots.
๐
2025-11-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Md Abdullah Al Kafi, Sumit Kumar Banshal
Natural Language Processing
ๆฅ็ๆ่ฆ
This study proposes a language-agnostic transformer-based POS tagging framework designed for low-resource languages, using Bangla and Hindi as case studies. With only three lines of framework-specific code, the model was adapted from Bangla to Hindi, demonstrating effective portability with minimal modification. The framework achieves 96.85 percent and 97 percent token-level accuracy across POS categories in Bangla and Hindi while sustaining strong F1 scores despite dataset imbalance and linguistic overlap. A performance discrepancy in a specific POS category underscores ongoing challenges in dataset curation. The strong results stem from the underlying transformer architecture, which can be replaced with limited code adjustments. Its modular and open-source design enables rapid cross-lingual adaptation while reducing model design and tuning overhead, allowing researchers to focus on linguistic preprocessing and dataset refinement, which are essential for advancing NLP in underrepresented languages.
๐
2025-11-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuxiang Chen, Zuohan Wu, Ziwei Wang, Xiangning Yu, Xujia Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large reasoning models (LRMs) have garnered significant attention from researchers owing to their exceptional capability in addressing complex tasks. Motivated by the observed human-like behaviors in their reasoning processes, this paper introduces a comprehensive taxonomy to characterize atomic reasoning steps and probe the ``psyche'' of LRM intelligence. Specifically, it comprises five groups and seventeen categories derived from human mental processes, thereby grounding the understanding of LRMs in an interdisciplinary perspective. The taxonomy is then applied for an in-depth understanding of current LRMs, resulting in a distinct labeled dataset that comprises 277,534 atomic reasoning steps. Using this resource, we analyze contemporary LRMs and distill several actionable takeaways for improving training and post-training of reasoning models. Notably, our analysis reveals that prevailing post-answer ``double-checks'' (self-monitoring evaluations) are largely superficial and rarely yield substantive revisions. Thus, incentivizing comprehensive multi-step reflection, rather than simple self-monitoring, may offer a more effective path forward. To complement the taxonomy, an automatic annotation framework, named CAPO, is proposed to leverage large language models (LLMs) for generating the taxonomy-based annotations. Experimental results demonstrate that CAPO achieves higher consistency with human experts compared to baselines, facilitating a scalable and comprehensive analysis of LRMs from a human cognitive perspective. Together, the taxonomy, CAPO, and the derived insights provide a principled, scalable path toward understanding and advancing LRM reasoning.
๐
2025-11-30
๐ KDD 2026
๐ฅ Ziyang Zeng, Heming Jing, Jindong Chen, Xiangli Li, Hongyu Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Ranking relevance is a fundamental task in search engines, aiming to identify the items most relevant to a given user query. Traditional relevance models typically produce scalar scores or directly predict relevance labels, limiting both interpretability and the modeling of complex relevance signals. Inspired by recent advances in Chain-of-Thought (CoT) reasoning for complex tasks, we investigate whether explicit reasoning can enhance both interpretability and performance in relevance modeling. However, existing reasoning-based Generative Relevance Models (GRMs) primarily rely on supervised fine-tuning on large amounts of human-annotated or synthetic CoT data, which often leads to limited generalization. Moreover, domain-agnostic, free-form reasoning tends to be overly generic and insufficiently grounded, limiting its potential to handle the diverse and ambiguous cases prevalent in open-domain search. In this work, we formulate relevance modeling in Xiaohongshu search as a reasoning task and introduce a Reinforcement Learning (RL)-based training framework to enhance the grounded reasoning capabilities of GRMs. Specifically, we incorporate practical business-specific relevance criteria into the multi-step reasoning prompt design and propose Stepwise Advantage Masking (SAM), a lightweight process-supervision strategy which facilitates effective learning of these criteria through improved credit assignment. To enable industrial deployment, we further distill the large-scale RL-tuned model to a lightweight version suitable for real-world search systems. Extensive experiments on industrial datasets, along with online A/B tests, demonstrate the effectiveness of our approach.
๐
2025-11-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Hyunseok Ryu, Wonjune Shin, Hyun Park
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) is gaining recognition as one of the key technological axes for next generation information retrieval, owing to its ability to mitigate the hallucination phenomenon in Large Language
Models (LLMs)and effectively incorporate up-to-date information. However, specialized expertise is necessary to
construct ahigh-quality retrieval system independently; moreover, RAGdemonstratesrelativelyslowerprocessing
speeds compared to conventional pure retrieval systems because it involves both retrieval and generation stages.
Accordingly, this study proposes SHRAG, a novel framework designed to facilitate the seamless integration of
Information Retrieval and RAG while simultaneously securing precise retrieval performance. SHRAG utilizes a
Large Language Model as a Query Strategist to automatically transform unstructured natural language queries
into logically structured search queries, subsequently performing Boolean retrieval to emulate the search process
of an expert human searcher. Furthermore, it incorporates multilingual query expansion and a multilingual
embedding model, enabling it to perform efficient cross-lingual question answering within the multilingual
dataset environment of the ScienceON Challenge. Experimental results demonstrate that the proposed method,
combining logical retrieval capabilities and generative reasoning, can significantly enhance the accuracy and
reliability of RAG systems. Furthermore, SHRAG movesbeyondconventionaldocument-centric retrieval methods,
presenting the potential for a new search paradigm capable of providing direct and reliable responses to queries.
๐
2025-11-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hamed Razavi
Natural Language Processing
ๆฅ็ๆ่ฆ
The fast growth of digital health systems has led to a need to better comprehend how they interpret and represent patient-reported symptoms. Chatbots have been used in healthcare to provide clinical support and enhance the user experience, making it possible to provide meaningful clinical patterns from text-based data through chatbots. The proposed research utilises several different natural language processing methods to study the occurrences of symptom descriptions in medicine as well as analyse the patterns that emerge through these conversations within medical bots. Through the use of the Medical Conversations to Disease Dataset which contains 960 multi-turn dialogues divided into 24 Clinical Conditions, a standardised representation of conversations between patient and bot is created for further analysis by computational means. The multi-method approach uses a variety of tools, including Latent Dirichlet Allocation (LDA) to identify latent symptom themes, K-Means to group symptom descriptions by similarity, Transformer-based Named Entity Recognition (NER) to extract medical concepts, and the Apriori algorithm to discover frequent symptom pairs. Findings from the analysis indicate a coherent structure of clinically relevant topics, moderate levels of clustering cohesiveness and several high confidence rates on the relationships between symptoms like fever headache and rash itchiness. The results support the notion that conversational medical data can be a valuable diagnostic signal for early symptom interpretation, assist in strengthening decision support and improve how users interact with tele-health technology. By demonstrating a method for converting unstructured free-flowing dialogue into actionable knowledge regarding symptoms this work provides an extensible framework to further enhance future performance, dependability and clinical utility of selecting medical chatbots.
๐
2025-11-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lingling Fu
Machine Learning
ๆฅ็ๆ่ฆ
Reward models play a critical role in Reinforcement Learning from Human Feedback (RLHF) by assessing the consistency between generated outputs and human preferences. However, conventional reward models are prone to reward hacking or over-optimization, where the policy exploits shortcut patterns to obtain high reward scores that do not reflect true human preference. Although Mixture-of-Experts (MoE)-based reward models can enhance discriminative capability, they typically introduce substantial computational overhead. To address these challenges, we propose an upcycle and merge MoE reward modeling approach. We first upcycle a dense reward model into a MoE architecture, where a shared expert captures general knowledge, while normal experts specialize in instruction-specific patterns. We then apply routing-weight normalization and merge experts back into a dense model through a learnable weight-averaging mechanism, preserving performance gains while significantly reducing inference cost. Experimental results demonstrate that our method effectively mitigates reward hacking across various model scales. Our work highlights the potential of upcycle and merge MoE structures for improving both robustness and efficiency of RLHF reward models.
๐
2025-11-30
๐ ArXiv Preprint (cs.IT)
๐ฅ Loc X. Nguyen, Ji Su Yoon, Huy Q. Le, Yu Qiao, Avi Deb Raha et al.
Machine Learning
ๆฅ็ๆ่ฆ
Semantic communication can significantly improve bandwidth utilization in wireless systems by exploiting the meaning behind raw data. However, the advancements achieved through semantic communication are closely dependent on the development of deep learning (DL) models for joint source-channel coding (JSCC) encoder/decoder techniques, which require a large amount of data for training. To address this data-intensive nature of DL models, federated learning (FL) has been proposed to train a model in a distributed manner, where the server broadcasts the DL model to clients in the network for training with their local data. However, the conventional FL approaches suffer from catastrophic degradation when client data are from different domains. In contrast, in this paper, a novel FL framework is proposed to address this domain shift by constructing the global representation, which aligns with the local features of the clients to preserve the semantics of different data domains. In addition, the dominance problem of client domains with a large number of samples is identified and, then, addressed with a domain-aware aggregation approach. This work is the first to consider the domain shift in training the semantic communication system for the image reconstruction task. Finally, simulation results demonstrate that the proposed approach outperforms the model-contrastive FL (MOON) framework by 0.5 for PSNR values under three domains at an SNR of 1 dB, and this gap continues to widen as the channel quality improves.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Himanshi Lalwani, Hanan Salam
ๆฅ็ๆ่ฆ
College students often face academic and life stressors affecting productivity, especially students with Attention Deficit Hyperactivity Disorder (ADHD) who experience executive functioning challenges. Conventional productivity tools typically demand sustained self-discipline and consistent use, which many students struggle with, leading to disruptive app-switching behaviors. Socially Assistive Robots (SARs), known for their intuitive and interactive nature, offer promising potential to support productivity in academic environments, having been successfully utilized in domains like education, cognitive development, and mental health. To leverage SARs effectively in addressing student productivity, this study employed a Participatory Design (PD) approach, directly involving college students and a Student Success and Well-Being Coach in the design process. Through interviews and a collaborative workshop, we gathered detailed insights on productivity challenges and identified desirable features for a productivity-focused SAR. Importantly, ethical considerations were integrated from the onset, facilitating responsible and user-aligned design choices. Our contributions include comprehensive insights into student productivity challenges, SAR design preferences, and actionable recommendations for effective robot characteristics. Additionally, we present stakeholder-derived ethical guidelines to inform responsible future implementations of productivity-focused SARs in higher education.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Gabriel Olin, Lu Chen, Nayesha Gandotra, Maxim Likhachev, Howie Choset
ๆฅ็ๆ่ฆ
Intercepting fast moving objects, by its very nature, is challenging because of its tight time constraints. This problem becomes further complicated in the presence of sensor noise because noisy sensors provide, at best, incomplete information, which results in a distribution over target states to be intercepted. Since time is of the essence, to hit the target, the planner must begin directing the interceptor, in this case a robot arm, while still receiving information. We introduce an tree-like structure, which is grown using kinodynamic motion primitives in state-time space. This tree-like structure encodes reachability to multiple goals from a single origin, while enabling real-time value updates as the target belief evolves and seamless transitions between goals. We evaluate our framework on an interception task on a 6 DOF industrial arm (ABB IRB-1600) with an onboard stereo camera (ZED 2i). A robust Innovation-based Adaptive Estimation Adaptive Kalman Filter (RIAE-AKF) is used to track the target and perform belief updates.
๐
2025-11-30
๐ Accepted in International Journal of Robotics Rese
๐ฅ Nathan F. Lepora
Robotics
ๆฅ็ๆ่ฆ
What is the future of tactile robotics? To help define that future, this article provides a historical perspective on tactile sensing in robotics from the wealth of knowledge and expert opinion in nearly 150 reviews over almost half a century. This history is characterized by a succession of generations: 1965-79 (origins), 1980-94 (foundations and growth), 1995-2009 (tactile winter) and 2010-2024 (expansion and diversification). Recent expansion has led to diverse themes emerging of e-skins, tactile robotic hands, vision-based tactile sensing, soft/biomimetic touch, and the tactile Internet. In the next generation from 2025, tactile robotics could mature to widespread commercial use, with applications in human-like dexterity, understanding human intelligence, and telepresence impacting all robotics and AI. By linking past expert insights to present themes, this article highlights recurring challenges in tactile robotics, showing how the field has evolved, why progress has often stalled, and which opportunities are most likely to define its future.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Himanshi Lalwani, Hanan Salam
ๆฅ็ๆ่ฆ
College students often face academic challenges that hamper their productivity and well-being. Although self-help books and productivity apps are popular, they often fall short. Books provide generalized, non-interactive guidance, and apps are not inherently educational and can hinder the development of key organizational skills. Traditional productivity coaching offers personalized support, but is resource-intensive and difficult to scale. In this study, we present a proof-of-concept for a socially assistive robot (SAR) as an educational coach and a potential solution to the limitations of existing productivity tools and coaching approaches. The SAR delivers six different lessons on time management and task prioritization. Users interact via a chat interface, while the SAR responds through speech (with a toggle option). An integrated dashboard monitors progress, mood, engagement, confidence per lesson, and time spent per lesson. It also offers personalized productivity insights to foster reflection and self-awareness. We evaluated the system with 15 college students, achieving a System Usability Score of 79.2 and high ratings for overall experience and engagement. Our findings suggest that SAR-based productivity coaching can offer an effective and scalable solution to improve productivity among college students.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Evelyn Zhang, Alex Richardson, Jonathan Sprinkle
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The goal of this paper is to explore the accuracy of dashcam footage to predict the actual kinematic motion of a car-like vehicle. Our approach uses ground truth information from the vehicle's on-board data stream, through the controller area network, and a time-synchronized dashboard camera, mounted to a consumer-grade vehicle, for 18 hours of footage and driving. The contributions of the paper include neural network models that allow us to quantify the accuracy of predicting the vehicle speed and yaw, as well as the presence of a lead vehicle, and its relative distance and speed. In addition, the paper describes how other researchers can gather their own data to perform similar experiments, using open-source tools and off-the-shelf technology.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Joel Cahn, Antonin Thomas, Philippe Pastor
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper presents the development of a control law, which is intended to be implemented on an optical guided glider. This guiding law follows an innovative approach, the reinforcement learning. This control law is used to make navigation more flexible and autonomous in a dynamic environment. The final objective is to track a target detected with the camera and then guide the glider to this point with high precision. Already applied on quad-copter drones, we wish by this study to demonstrate the applicability of reinforcement learning for fixed-wing aircraft on all of its axis.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Haoru Xue, Tairan He, Zi Wang, Qingwei Ben, Wenli Xiao et al.
Robotics
ๆฅ็ๆ่ฆ
Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure that mitigates partial observability and improves closed-loop consistency in sim-to-real RL. Trained entirely on simulation data, the resulting policy achieves robust zero-shot performance across diverse door types and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception.
๐
2025-11-30
๐ This work has been submitted to the IEEE for possi
๐ฅ Muhtadin, Mochammad Hilmi Rusydiansyah, Mauridhi Hery Purnomo, I Ketut Eddy Purnama, Chastine Fatichah
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Quadruped robots are increasingly used in various applications due to their high mobility and ability to operate in diverse terrains. However, most available quadruped robots are primarily focused on mobility without object manipulation capabilities. Equipping a quadruped robot with a robotic arm and gripper introduces a challenge in manual control, especially in remote scenarios that require complex commands. This research aims to develop an autonomous grasping system on a quadruped robot using a task-level interaction approach. The system includes hardware integration of a robotic arm and gripper onto the quadruped robot's body, a layered control system designed using ROS, and a web-based interface for human-robot interaction. The robot is capable of autonomously performing tasks such as navigation, object detection, and grasping using GraspNet. Testing was conducted through real-world scenarios to evaluate navigation, object selection and grasping, and user experience. The results show that the robot can perform tasks accurately and consistently, achieving a grasping success rate of 75 % from 12 trials. Therefore, the system demonstrates significant potential in enhancing the capabilities of quadruped robots as service robots in real-world environments.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action models (VLAs) are becoming increasingly capable across diverse robotic tasks. However, their real-world deployment remains slow and inefficient: demonstration videos are often sped up by 5-10x to appear smooth, with noticeable action stalls and delayed reactions to environmental changes. Asynchronous inference offers a promising solution to achieve continuous and low-latency control by enabling robots to execute actions and perform inference simultaneously. However, because the robot and environment continue to evolve during inference, a temporal misalignment arises between the prediction and execution intervals. This leads to significant action instability, while existing methods either degrade accuracy or introduce runtime overhead to mitigate it. We propose VLASH, a general asynchronous inference framework for VLAs that delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes. VLASH estimates the future execution-time state by rolling the robot state forward with the previously generated action chunk, thereby bridging the gap between prediction and execution. Experiments show that VLASH achieves up to 2.03x speedup and reduces reaction latency by up to 17.4x compared to synchronous inference while fully preserving the original accuracy. Moreover, it empowers VLAs to handle fast-reaction, high-precision tasks such as playing ping-pong and playing whack-a-mole, where traditional synchronous inference fails. Code is available at https://github.com/mit-han-lab/vlash
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Yi-Lin Wei, Haoran Liao, Yuhao Lin, Pengyue Wang, Zhizhao Liang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
In this paper, we explore an important yet underexplored task in robot manipulation: cycle-based manipulation, where robots need to perform cyclic or repetitive actions with an expected terminal time. These tasks are crucial in daily life, such as shaking a bottle or knocking a nail. However, few prior works have explored this task, leading to two main challenges: 1) the imitation methods often fail to complete these tasks within the expected terminal time due to the ineffective utilization of history; 2) the absence of a benchmark with sufficient data and automatic evaluation tools hinders development of effective solutions in this area. To address these challenges, we first propose the CycleManip framework to achieve cycle-based task manipulation in an end-to-end imitation manner without requiring any extra models, hierarchical structure or significant computational overhead. The core insight is to enhance effective history perception by a cost-aware sampling strategy and to improve historical understanding by multi-task learning. Second, we introduce a cycle-based task manipulation benchmark, which provides diverse cycle-based tasks, and an automatic evaluation method. Extensive experiments conducted in both simulation and real-world settings demonstrate that our method achieves high success rates in cycle-based task manipulation. The results further show strong adaptability performance in general manipulation, and the plug-and-play ability on imitation policies such as Vision-Language-Action (VLA) models. Moreover, the results show that our approach can be applied across diverse robotic platforms, including bi-arm grippers, dexterous hands, and humanoid robots.
๐
2025-11-30
๐ This paper has been submitted to IEEE Access Journ
๐ฅ Adelina Giurea, Stijn Luchie, Dieter Coppens, Jeroen Hoebeke, Eli De Poorter
ๆฅ็ๆ่ฆ
This paper presents an infrastructure-free approach for obstacle detection and environmental mapping using ultra-wideband (UWB) radar mounted on a mobile robotic platform. Traditional sensing modalities such as visual cameras and Light Detection and Ranging (LiDAR) fail in environments with poor visibility due to darkness, smoke, or reflective surfaces. In these visioned-impaired conditions, UWB radar offers a promising alternative. To this end, this work explores the suitability of robot-mounted UWB radar for environmental mapping in dynamic, anchor-free scenarios. The study investigates how different materials (metal, concrete and plywood) and UWB radio channels (5 and 9) influence the Channel Impulse Response (CIR). Furthermore, a processing pipeline is proposed to achieve reliable mapping of detected obstacles, consisting of 3 steps: (i) target identification (based on CIR peak detection), (ii) filtering (based on peak properties, signal-to-noise score, and phase-difference of arrival), and (iii) clustering (based on distance estimation and angle-of-arrival estimation). The proposed approach successfully reduces noise and multipath effects, resulting in an obstacle detection precision of at least 82.36% and a recall of 89.46% on channel 9 even when detecting low-reflective materials such as plywood. This work offers a foundation for further development of UWB-based localisation and mapping (SLAM) systems that do not rely on visual features and, unlike conventional UWB localisation systems, do not require on fixed anchor nodes for triangulation.
๐
2025-11-30
๐ WWW
๐ฅ Thomas Chabal, Shizhe Chen, Jean Ponce, Cordelia Schmid
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
This paper addresses the Object Goal Navigation problem, where a robot must efficiently find a target object in an unknown environment. Existing implicit memory-based methods struggle with long-term memory retention and planning, while explicit map-based approaches lack rich semantic information. To address these challenges, we propose FOM-Nav, a modular framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models. Our Frontier-Object Maps are built online and jointly encode spatial frontiers and fine-grained object information. Using this representation, a vision-language model performs multimodal scene understanding and high-level goal prediction, which is executed by a low-level planner for efficient trajectory generation. To train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments. Extensive experiments validate the effectiveness of our model design and constructed dataset. FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL, and yields promising results on a real robot.
๐
2025-11-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haotian Liang, Xinyi Chen, Bin Wang, Mingkang Chen, Yitian Liu et al.
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Yunfeng Lin, Minghuan Liu, Yufei Xue, Ming Zhou, Yong Yu et al.
Robotics
ๆฅ็ๆ่ฆ
The rapid advancement of humanoid robotics has intensified the need for robust and adaptable controllers to enable stable and efficient locomotion across diverse platforms. However, developing such controllers remains a significant challenge because existing solutions are tailored to specific robot designs, requiring extensive tuning of reward functions, physical parameters, and training hyperparameters for each embodiment. To address this challenge, we introduce H-Zero, a cross-humanoid locomotion pretraining pipeline that learns a generalizable humanoid base policy. We show that pretraining on a limited set of embodiments enables zero-shot and few-shot transfer to novel humanoid robots with minimal fine-tuning. Evaluations show that the pretrained policy maintains up to 81% of the full episode duration on unseen robots in simulation while enabling few-shot transfer to unseen humanoids and upright quadrupeds within 30 minutes of fine-tuning.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Nayesha Gandotra, Itamar Mishani, Maxim Likhachev
Robotics
ๆฅ็ๆ่ฆ
Recent progress in contact-rich robotic manipulation has been striking, yet most deployed systems remain confined to simple, scripted routines. One of the key barriers is the lack of motion planning algorithms that can provide verifiable guarantees for safety, efficiency and reliability. To address this, a family of algorithms called Constant-Time Motion Planning (CTMP) was introduced, which leverages a preprocessing phase to enable collision-free motion queries in a fixed, user-specified time budget (e.g., 10 milliseconds). However, existing CTMP methods do not explicitly incorporate the manipulation behaviors essential for object handling. To bridge this gap, we introduce the \textit{Behavioral Constant-Time Motion Planner} (B-CTMP), an algorithm that extends CTMP to solve a broad class of two-step manipulation tasks: (1) a collision-free motion to a behavior initiation state, followed by (2) execution of a manipulation behavior (such as grasping or insertion) to reach the goal. By precomputing compact data structures, B-CTMP guarantees constant-time query in mere milliseconds while ensuring completeness and successful task execution over a specified set of states. We evaluate B-CTMP on two canonical manipulation tasks in simulation, shelf picking and plug insertion,and demonstrate its effectiveness on a real robot. Our results show that B-CTMP unifies collision-free planning and object manipulation within a single constant-time framework, providing provable guarantees of speed and success for manipulation in semi-structured environments.
๐
2025-11-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junwoo Chang, Minwoo Park, Joohwan Seo, Roberto Horowitz, Jongmin Lee et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Group symmetries provide a powerful inductive bias for reinforcement learning (RL), enabling efficient generalization across symmetric states and actions via group-invariant Markov Decision Processes (MDPs). However, real-world environments almost never realize fully group-invariant MDPs; dynamics, actuation limits, and reward design usually break symmetries, often only locally. Under group-invariant Bellman backups for such cases, local symmetry-breaking introduces errors that propagate across the entire state-action space, resulting in global value estimation errors. To address this, we introduce Partially group-Invariant MDP (PI-MDP), which selectively applies group-invariant or standard Bellman backups depending on where symmetry holds. This framework mitigates error propagation from locally broken symmetries while maintaining the benefits of equivariance, thereby enhancing sample efficiency and generalizability. Building on this framework, we present practical RL algorithms -- Partially Equivariant (PE)-DQN for discrete control and PE-SAC for continuous control -- that combine the benefits of equivariance with robustness to symmetry-breaking. Experiments across Grid-World, locomotion, and manipulation benchmarks demonstrate that PE-DQN and PE-SAC significantly outperform baselines, highlighting the importance of selective symmetry exploitation for robust and sample-efficient RL.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Chengjin Du, Federico Bernabei, Zhengyin Du, Sergio Decherchi, Matteo Lo Preti et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Soft robots are powerful tools for manipulating delicate objects, yet their adoption is hindered by two gaps: the lack of integrated tactile sensing and sensor signal distortion caused by actuator deformations. This paper addresses these challenges by introducing the SoftMag actuator: a magnetic tactile-sensorized soft actuator. Unlike systems relying on attached sensors or treating sensing and actuation separately, SoftMag unifies them through a shared architecture while confronting the mechanical parasitic effect, where deformations corrupt tactile signals. A multiphysics simulation framework models this coupling, and a neural-network-based decoupling strategy removes the parasitic component, restoring sensing fidelity. Experiments including indentation, quasi-static and step actuation, and fatigue tests validate the actuator's performance and decoupling effectiveness. Building upon this foundation, the system is extended into a two-finger SoftMag gripper, where a multi-task neural network enables real-time prediction of tri-axial contact forces and position. Furthermore, a probing-based strategy estimates object firmness during grasping. Validation on apricots shows a strong correlation (Pearson r over 0.8) between gripper-estimated firmness and reference measurements, confirming the system's capability for non-destructive quality assessment. Results demonstrate that combining integrated magnetic sensing, learning-based correction, and real-time inference enables a soft robotic platform that adapts its grasp and quantifies material properties. The framework offers an approach for advancing sensorized soft actuators toward intelligent, material-aware robotics.
๐
2025-11-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wenzhao Zheng et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Md Muzakkir Quamar, Ali Nasir, Sami ELFerik
ๆฅ็ๆ่ฆ
This paper presents a scalable and fault-tolerant framework for unmanned aerial vehicle (UAV) mission management in complex and uncertain environments. The proposed approach addresses the computational bottleneck inherent in solving large-scale Markov Decision Processes (MDPs) by introducing a two-stage decomposition strategy. In the first stage, a factor-based algorithm partitions the global MDP into smaller, goal-specific sub-MDPs by leveraging domain-specific features such as goal priority, fault states, spatial layout, and energy constraints. In the second stage, a priority-based recombination algorithm solves each sub-MDP independently and integrates the results into a unified global policy using a meta-policy for conflict resolution. Importantly, we present a theoretical analysis showing that, under mild probabilistic independence assumptions, the combined policy is provably equivalent to the optimal global MDP policy. Our work advances artificial intelligence (AI) decision scalability by decomposing large MDPs into tractable subproblems with provable global equivalence. The proposed decomposition framework enhances the scalability of Markov Decision Processes, a cornerstone of sequential decision-making in artificial intelligence, enabling real-time policy updates for complex mission environments. Extensive simulations validate the effectiveness of our method, demonstrating orders-of-magnitude reduction in computation time without sacrificing mission reliability or policy optimality. The proposed framework establishes a practical and robust foundation for scalable decision-making in real-time UAV mission execution.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Nan Sun, Bo Mao, Yongchang Li, Chenxu Wang, Di Guo et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Foundation models have become central to unifying perception and planning in robotics, yet real-world deployment exposes a mismatch between their monolithic assumption that a single model can handle all cognitive functions and the distributed, dynamic nature of practical service workflows. Vision-language models offer strong semantic understanding but lack embodiment-aware action capabilities while relying on hand-crafted skills. Vision-Language-Action policies enable reactive manipulation but remain brittle across embodiments, weak in geometric grounding, and devoid of proactive collaboration mechanisms. These limitations indicate that scaling a single model alone cannot deliver reliable autonomy for service robots operating in human-populated settings. To address this gap, we present InteractGen, an LLM-powered multi-agent framework that decomposes robot intelligence into specialized agents for continuous perception, dependency-aware planning, decision and verification, failure reflection, and dynamic human delegation, treating foundation models as regulated components within a closed-loop collective. Deployed on a heterogeneous robot team and evaluated in a three-month open-use study, InteractGen improves task success, adaptability, and human-robot collaboration, providing evidence that multi-agent orchestration offers a more feasible path toward socially grounded service autonomy than further scaling standalone models.
๐
2025-11-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Libo Wang
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
To address the gap in humanoid robot cognitive systems regarding the lack of a time-updable mediating thought space between semantics and continuous control, this study constructs and trains a VLA model named "Sigma" that runs on a single RTX 4090. It uses the open-source pi05_base model as a foundation and preprocesses svla_so101_pickplace into a training dataset. The researcher independently designed an architecture for a vision-language-action model that combines deep semantic understanding and association to achieve telepathic communication. The training process involved repeated optimizations of data preprocessing, LoRA fine-tuning, and the inference-stage adapter. The experiment employed offline closed-loop replay, comparing Sigma with the untuned pure pi05_base_base model under data conditions. Results showed that Sigma exhibited a stable decrease in control MSE across vector, fragment, and entire trajectory timescales, while maintaining the telepathy norm and semantic-text alignment quality unchanged. It demonstrates that mind-responsive alignment control is quantified through an architecture that combines deep understanding of semantics and association without retraining the base model, which provides reproducible experience for semantic alignment and intention-driven behavior in humanoid robots.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Nitin Kumar Singh, Arie Rachmad Syulistyo, Yuichiro Tanaka, Hakaru Tamukoh
Machine Learning
ๆฅ็ๆ่ฆ
Sign language recognition (SLR) facilitates communication between deaf and hearing individuals. Deep learning is widely used to develop SLR-based systems; however, it is computationally intensive and requires substantial computational resources, making it unsuitable for resource-constrained devices. To address this, we propose an efficient sign language recognition system using MediaPipe and an echo state network (ESN)-based bidirectional reservoir computing (BRC) architecture. MediaPipe extracts hand joint coordinates, which serve as inputs to the ESN-based BRC architecture. The BRC processes these features in both forward and backward directions, efficiently capturing temporal dependencies. The resulting states of BRC are concatenated to form a robust representation for classification. We evaluated our method on the Word-Level American Sign Language (WLASL) video dataset, achieving a competitive accuracy of 57.71% and a significantly lower training time of only 9 seconds, in contrast to the 55 minutes and $38$ seconds required by the deep learning-based Bi-GRU approach. Consequently, the BRC-based SLR system is well-suited for edge devices.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Ruijia Liu, Ancheng Hou, Shaoyuan Li, Xiang Yin
Robotics
ๆฅ็ๆ่ฆ
Linear Temporal Logic (LTL) provides a rigorous framework for complex robotic tasks, yet existing methods often rely on accurate dynamics models or expensive online interactions. In this work, we address LTL-constrained control in a challenging offline, model-free setting, utilizing only fixed, task-agnostic datasets of fragmented trajectories. We propose SAGAS, a novel framework combining graph-assisted trajectory stitching with automata-guided planning. First, we construct a latent reachability graph from a learned temporal-distance representation. To bridge the semantic gap, we augment this graph with certified anchor nodes and probabilistic soft labels. We then translate the specification into a Bรผchi automaton and search the implicit product space to derive a cost-minimal prefix-suffix plan. Finally, a subgoal-conditioned low-level policy is deployed to execute these latent waypoints. Experiments on OGBench locomotion domains demonstrate that SAGAS successfully synthesizes efficient trajectories for diverse LTL tasks, effectively bridging the gap between fragmented offline data and complex logical constraints.
๐
2025-11-30
๐ ArXiv Preprint (Robotics)
๐ฅ Sizhe Wei, Xulin Chen, Fengze Xie, Garrett Ethan Katz, Zhenyu Gan et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning has recently enabled impressive locomotion capabilities on legged robots; however, most policy architectures remain morphology- and symmetry-agnostic, leading to inefficient training and limited generalization. This work introduces MS-PPO, a morphological-symmetry-equivariant policy learning framework that encodes robot kinematic structure and morphological symmetries directly into the policy network. We construct a morphology-informed graph neural architecture that is provably equivariant with respect to the robot's morphological symmetry group actions, ensuring consistent policy responses under symmetric states while maintaining invariance in value estimation. This design eliminates the need for tedious reward shaping or costly data augmentation, which are typically required to enforce symmetry. We evaluate MS-PPO in simulation on Unitree Go2 and Xiaomi CyberDog2 robots across diverse locomotion tasks, including trotting, pronking, slope walking, and bipedal turning, and further deploy the learned policies on hardware. Extensive experiments show that MS-PPO achieves superior training stability, symmetry generalization ability, and sample efficiency in challenging locomotion tasks, compared to state-of-the-art baselines. These findings demonstrate that embedding both kinematic structure and morphological symmetry into policy learning provides a powerful inductive bias for legged robot locomotion control. Our code will be made publicly available at https://lunarlab-gatech.github.io/MS-PPO/.
๐
2025-11-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xingtai Gui, Jianbo Zhao, Wencheng Han, Jikai Wang, Jiahao Gong et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
End-to-end autonomous driving systems directly generate driving policies from raw sensor inputs. While these systems can extract effective environmental features for planning, relying on auxiliary perception tasks, developing perception annotation-free planning paradigms has become increasingly critical due to the high cost of manual perception annotation. In this work, we propose TrajDiff, a Trajectory-oriented BEV Conditioned Diffusion framework that establishes a fully perception annotation-free generative method for end-to-end autonomous driving. TrajDiff requires only raw sensor inputs and future trajectory, constructing Gaussian BEV heatmap targets that inherently capture driving modalities. We design a simple yet effective trajectory-oriented BEV encoder to extract the TrajBEV feature without perceptual supervision. Furthermore, we introduce Trajectory-oriented BEV Diffusion Transformer (TB-DiT), which leverages ego-state information and the predicted TrajBEV features to directly generate diverse yet plausible trajectories, eliminating the need for handcrafted motion priors. Beyond architectural innovations, TrajDiff enables exploration of data scaling benefits in the annotation-free setting. Evaluated on the NAVSIM benchmark, TrajDiff achieves 87.5 PDMS, establishing state-of-the-art performance among all annotation-free methods. With data scaling, it further improves to 88.5 PDMS, which is comparable to advanced perception-based approaches. Our code and model will be made publicly available.
๐
2025-11-29
๐ ArXiv Preprint (Robotics)
๐ฅ Jinhao Zhang, Wenlong Xia, Zhexuan Zhou, Youmin Gong, Jie Mei
Robotics
ๆฅ็ๆ่ฆ
Diffusion models have demonstrated strong capabilities for modeling human-like driving behaviors in autonomous driving, but their iterative sampling process induces substantial latency, and operating directly on raw trajectory points forces the model to spend capacity on low-level kinematics, rather than high-level multi-modal semantics. To address these limitations, we propose LAtent Planner (LAP), a framework that plans in a VAE-learned latent space that disentangles high-level intents from low-level kinematics, enabling our planner to capture rich, multi-modal driving strategies. We further introduce a fine-grained feature distillation mechanism to guide a better interaction and fusion between the high-level semantic planning space and the vectorized scene context. Notably, LAP can produce high-quality plans in one single denoising step, substantially reducing computational overhead. Through extensive evaluations on the large-scale nuPlan benchmark, LAP achieves state-of-the-art closed-loop performance among learning-based planning methods, while demonstrating an inference speed-up of at most 10 times over previous SOTA approaches. Code will be released at: https://github.com/jhz1192/Latent-Planner.
๐
2025-11-27
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dasheng Bi, Yubin Hu, Mohammed N. Nasir
ๆฅ็ๆ่ฆ
Learning how to do things from trial and error in real time is a hallmark of biological intelligence, yet most LLM-based agents lack mechanisms to acquire procedural knowledge after deployment. We propose Procedural Recall for Agents with eXperiences Indexed by State (PRAXIS), a lightweight post-training learning mechanism that stores the consequences of actions and retrieves them by jointly matching environmental and internal states of past episodes to the current state. PRAXIS augments agentic action selection with retrieved state-action-result exemplars that are generated in real time. When evaluated on the REAL web browsing benchmark, PRAXIS improves task completion accuracy, reliability, and cost efficiency across different foundation model backbones, and shows preliminary generalization to unseen tasks in similar environments. These results demonstrate that PRAXIS enables the practical adoption of AI agents in fast-evolving stateful environments by helping them learn new procedures effectively.
๐
2025-11-27
๐ ArXiv Preprint (Machine Learning)
๐ฅ Florian Rottach, William Rudman, Bastian Rieck, Harrisen Scells, Carsten Eickhoff
ๆฅ็ๆ่ฆ
Studying how embeddings are organized in space not only enhances model interpretability but also uncovers factors that drive downstream task performance. In this paper, we present a comprehensive analysis of topological and geometric measures across a wide set of text embedding models and datasets. We find a high degree of redundancy among these measures and observe that individual metrics often fail to sufficiently differentiate embedding spaces. Building on these insights, we introduce Unified Topological Signatures (UTS), a holistic framework for characterizing embedding spaces. We show that UTS can predict model-specific properties and reveal similarities driven by model architecture. Further, we demonstrate the utility of our method by linking topological structure to ranking effectiveness and accurately predicting document retrievability. We find that a holistic, multi-attribute perspective is essential to understanding and leveraging the geometry of text embeddings.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates significant potential in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing RLVR methods are often constrained by issues such as coarse-grained rewards, reward noise, and inefficient exploration, which lead to unstable training and entropy collapse. To address this challenge, we propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO). The intuition behind it lies in the fact that the probabilities of an LLM generating different responses can inherently and directly reflect its self-assessment of the reasoning process. Inspired by the idea of preference modeling, ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt, and integrates this score with verifiable rewards to guide the exploration process. We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors, enhances the relative superiority of undervalued high-quality responses, and prevents the model from overfitting to specific strategies, thereby facilitating more thorough exploration. Comprehensive experiments across four general-domain benchmarks and three mathematical benchmarks demonstrate that ICPO steadily boosts reasoning compared to GRPO.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang et al.
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie et al.
Multimodal
ๆฅ็ๆ่ฆ
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shruti Bothe, Illyyne Saffar, Aurelie Boisbunon, Hasan Farooq, Julien Forgeat et al.
ๆฅ็ๆ่ฆ
The rise of AI in telecommunications, from optimizing Radio Access Networks to managing user experience, has sharply increased data volumes and training demands. Telecom data is often noisy, high-dimensional, costly to store, process, and label. Despite Ai's critical role, standard workflows still assume all training samples contribute equally. On the other hand, next generation systems require AI models that are accurate, efficient, and sustainable.The paper questions the assumptions of equal importance by focusing on applying and analyzing the roles of individual samples in telecom training and assessing whether the proposed model optimizes computation and energy use. we perform sample-level gradient analysis across epochs to identify patterns of influence and redundancy in model learning. Based on this, we propose a sample importance framework thats electively prioritizes impactful data and reduces computation without compromising accuracy. Experiments on three real-world telecom datasets show that our method [reserves performance while reducing data needs and computational overhead while advancing the goals of sustainable AI in telecommunications.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Locke Cai, Ivan Provilkov
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator): the policy learns to mimic expert answers, while the critic learns to compare and distinguish between policy and expert answers. Our method trains both the policy and the critic jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL on verifiable tasks. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Naifu Zhang, Wei Tao, Xi Xiao, Qianpu Sun, Yuxin Zheng et al.
Multimodal
ๆฅ็ๆ่ฆ
In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.
๐
2025-11-26
๐ ACM MMSys 2025
๐ฅ Kirill Paramonov, Mete Ozay, Aristeidis Mystakidis, Nikolaos Tsalikidis, Dimitrios Sotos et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system's effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system's practicality in real-world scenarios.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Peter S. Hovmand, Kari O'Donnell, Callie Ogland-Hand, Brian Biroscak, Douglas D. Gunzler
ๆฅ็ๆ่ฆ
AI/ML models have rapidly gained prominence as innovations for solving previously unsolved problems and their unintended consequences from amplifying human biases. Advocates for responsible AI/ML have sought ways to draw on the richer causal models of system dynamics to better inform the development of responsible AI/ML. However, a major barrier to advancing this work is the difficulty of bringing together methods rooted in different underlying assumptions (i.e., Dana Meadow's "the unavoidable a priori"). This paper brings system dynamics and structural equation modeling together into a common mathematical framework that can be used to generate systems from distributions, develop methods, and compare results to inform the underlying epistemology of system dynamics for data science and AI/ML applications.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anantha Padmanaban Krishna Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mathew Vanherreweghe, Michael H. Freedman, Keith M. Adams
Machine Learning
ๆฅ็ๆ่ฆ
Recent work by Freedman and Mulligan demonstrated that shallow multilayer perceptrons spontaneously develop Kolmogorov-Arnold geometric (KAG) structure during training on synthetic three-dimensional tasks. However, it remained unclear whether this phenomenon persists in realistic high-dimensional settings and what spatial properties this geometry exhibits.
We extend KAG analysis to MNIST digit classification (784 dimensions) using 2-layer MLPs with systematic spatial analysis at multiple scales. We find that KAG emerges during training and appears consistently across spatial scales, from local 7-pixel neighborhoods to the full 28x28 image. This scale-agnostic property holds across different training procedures: both standard training and training with spatial augmentation produce the same qualitative pattern. These findings reveal that neural networks spontaneously develop organized, scale-invariant geometric structure during learning on realistic high-dimensional data.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hans Gundlach, Alex Fogelson, Jayson Lynch, Ana Trisovic, Jonathan Rosenfeld et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader literature, we estimate that additional innovations not included in our ablations account for less than 10x, yielding a total under 100x. This leads us to conduct scaling experiments, which reveal that much of this efficiency gap can be explained by algorithms with scale-dependent efficiency improvements. In particular, we conduct scaling experiments between LSTMs and Transformers, finding exponent differences in their compute-optimal scaling law while finding little scaling difference for many other innovations. These experiments demonstrate that - contrary to standard assumptions - an algorithm's efficiency gains are tied to compute scale. Using experimental extrapolation and literature estimates, we account for 6,930x efficiency gains over the same time period, with the scale-dependent LSTM-to-Transformer transition accounting for the majority of gains. Our results indicate that algorithmic progress for small models has been far slower than previously assumed, and that measures of algorithmic efficiency are strongly reference-dependent.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dongyang Fan, Diba Hashemi, Sai Praneeth Karimireddy, Martin Jaggi
Natural Language Processing
ๆฅ็ๆ่ฆ
Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Charles Schepanowski, Charles Ling
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code execution or other tools, using the 8-puzzle: a classic task that requires state tracking and goal-directed planning while allowing precise, step-by-step evaluation. Four models are tested under common prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) and with tiered corrective feedback. Feedback improves success rates for some model-prompt combinations, but many successful runs are long, computationally expensive, and indirect. We then examine the models with an external move validator that provides only valid moves. Despite this level of assistance, none of the models solve any puzzles in this setting. Qualitative analysis reveals two dominant deficits across all models: (1) brittle internal state representations, leading to frequent invalid moves, and (2) weak heuristic planning, with models entering loops or selecting actions that do not reduce the distance to the goal state. These findings indicate that, in the absence of external tools such as code interpreters, current LLMs have substantial limitations in planning and that further progress may require mechanisms for maintaining explicit state and performing structured search.
๐
2025-11-26
๐ NeurIPS 2025
๐ฅ Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, Ding Zhao
Robotics
ๆฅ็ๆ่ฆ
End-to-end (E2E) autonomous driving models have demonstrated strong performance in open-loop evaluations but often suffer from cascading errors and poor generalization in closed-loop settings. To address this gap, we propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment. MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine, exposing the agent to scenarios beyond the original dataset. Based on this generated data, MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes. At inference time, the adapter proposes multiple trajectory candidates, and the Q value model selects the one with the highest expected utility. Experiments on the nuScenes benchmark using a photorealistic closed-loop simulator demonstrate that MPA significantly improves performance across in-domain, out-of-domain, and safety-critical scenarios. We further investigate how the scale of counterfactual data and inference-time guidance strategies affect overall effectiveness.
๐
2025-11-26
๐ ArXiv Preprint (cs.SD)
๐ฅ Kexin Li, Xiao Hu, Ilya Grishchenko, David Lie
ๆฅ็ๆ่ฆ
The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As those seeking to misuse AI-generated audio may thus seek to remove audio watermarks, studying effective watermark removal techniques is critical to being able to objectively evaluate the robustness of audio watermarks against removal. Previous watermark removal schemes either assume impractical knowledge of the watermarks they are designed to remove or are computationally expensive, potentially generating a false sense of confidence in current watermark schemes.
We introduce HarmonicAttack, an efficient audio watermark removal method that only requires the basic ability to generate the watermarks from the targeted scheme and nothing else. With this, we are able to train a general watermark removal model that is able to remove the watermarks generated by the targeted scheme from any watermarked audio sample. HarmonicAttack employs a dual-path convolutional autoencoder that operates in both temporal and frequency domains, along with GAN-style training, to separate the watermark from the original audio. When evaluated against state-of-the-art watermark schemes AudioSeal, WavMark, and Silentcipher, HarmonicAttack demonstrates greater watermark removal ability than previous watermark removal methods with near real-time performance. Moreover, while HarmonicAttack requires training, we find that it is able to transfer to out-of-distribution samples with minimal degradation in performance.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiang Gu, Liming Lu, Xu Zheng, Anan Du, Yongbin Zhou et al.
Multimodal
ๆฅ็ๆ่ฆ
Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Liming Yang, Junyu Luo, Xuanzhe Liu, Yiling Lou, Zhenpeng Chen
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Large language model (LLM)-based multi-agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi-agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi-agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning-based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state-of-the-art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.
๐
2025-11-26
๐ Accessible at https://projecteuclid.org/journals/j
๐ฅ Maria Perez-Ortiz
ๆฅ็ๆ่ฆ
In an era marked by rapid technological advancements and complex global challenges, responsible foresight has emerged as an essential framework for policymakers aiming to navigate future uncertainties and shape the future. Responsible foresight entails the ethical anticipation of emerging opportunities and risks, with a focus on fostering proactive, sustainable, and accountable future design. This paper coins the term "responsible computational foresight", examining the role of human-centric artificial intelligence and computational modeling in advancing responsible foresight, establishing a set of foundational principles for this new field and presenting a suite of AI-driven foresight tools currently shaping it. AI, particularly in conjunction with simulations and scenario analysis, enhances policymakers' ability to address uncertainty, evaluate risks, and devise strategies geared toward sustainable, resilient futures. However, responsible foresight extends beyond mere technical forecasting; it demands a nuanced understanding of the interdependencies within social, environmental, economic and political systems, alongside a commitment to ethical, long-term decision-making that supports human intelligence. We argue that AI will play a role as a supportive tool in responsible, human-centered foresight, complementing rather than substituting policymaker judgment to enable the proactive shaping of resilient and ethically sound futures. This paper advocates for the thoughtful integration of AI into foresight practices to empower policymakers and communities as they confront the grand challenges of the 21st century.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alex Diep
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
If a language model cannot reliably disclose its AI identity in expert contexts, users cannot trust its competence boundaries. This study examines self-transparency in models assigned professional personas within high-stakes domains where false expertise risks user harm. Using a common-garden design, sixteen open-weight models (4B--671B parameters) were audited across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure initially, while a Neurosurgeon persona elicited only 3.5%. This creates preconditions for a "Reverse Gell-Mann Amnesia" effect, where transparency in some domains leads users to overgeneralize trust to contexts where disclosure fails. Disclosure ranged from 2.8% to 73.6%, with a 14B model reaching 61.4% while a 70B produced just 4.1%. Model identity predicted behavior better than parameter count ($ฮR_{adj}^{2} = 0.359$ vs 0.018). Reasoning optimization actively suppressed self-transparency in some models, with reasoning variants showing up to 48.4% lower disclosure than base counterparts. Bayesian validation with Rogan--Gladen correction confirmed robustness to measurement error ($ฮบ= 0.908$). These findings demonstrate transparency reflects training factors rather than scale. Organizations cannot assume safety properties transfer to deployment contexts, requiring deliberate behavior design and empirical verification.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Hui Zhou, Siyuan Huang, Minxing Li, Hao Zhang, Lue Fan et al.
Robotics
ๆฅ็ๆ่ฆ
Vision Language Action models have significantly advanced general purpose robotic manipulation by harnessing large scale pretrained vision and language representations. Among existing approaches, a majority of current VLA systems employ parallel two finger grippers as their default end effectors. However, such grippers face inherent limitations in handling certain real world tasks such as wiping glass surfaces or opening drawers without handles due to insufficient contact area or lack of adhesion. To overcome these challenges, we present a low cost, integrated hardware design that combines a mechanical two finger gripper with a vacuum suction unit, enabling dual mode manipulation within a single end effector. Our system supports flexible switching or synergistic use of both modalities, expanding the range of feasible tasks. We validate the efficiency and practicality of our design within two state of the art VLA frameworks: DexVLA and Pi0. Experimental results demonstrate that with the proposed hybrid end effector, robots can successfully perform multiple complex tasks that are infeasible for conventional two finger grippers alone. All hardware designs and controlling systems will be released.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jin Pin, Krasowski Hanna, Vanneaux Elena
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Obtaining safety guarantees for reinforcement learning is a major challenge to achieve applicability for real-world tasks. Safety shields extend standard reinforcement learning and achieve hard safety guarantees. However, existing safety shields commonly use random sampling of safe actions or a fixed fallback controller, therefore disregarding future performance implications of different safe actions. In this work, we propose a predictive safety shield for model-based reinforcement learning agents in discrete space. Our safety shield updates the Q-function locally based on safe predictions, which originate from a safe simulation of the environment model. This shielding approach improves performance while maintaining hard safety guarantees. Our experiments on gridworld environments demonstrate that even short prediction horizons can be sufficient to identify the optimal path. We observe that our approach is robust to distribution shifts, e.g., between simulation and reality, without requiring additional training.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yanxing Huang, Zihan Tang, Zejin Lin, Peng Li, Yang Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
The key limitation of the verification performance lies in the ability of error detection. With this intuition we designed several variants of pessimistic verification, which are simple workflows that could significantly improve the verification of open-ended math questions. In pessimistic verification we construct multiple parallel verifications for the same proof, and the proof is deemed incorrect if any one of them reports an error. This simple technique significantly improves the performance across many math verification benchmarks without incurring substantial computational resources. Its token efficiency even surpassed extended long-CoT in test-time scaling. Our case studies further indicate that the majority of false negatives in stronger models are actually caused by annotation errors in the original dataset, so our method's performance is in fact underestimated. Self-verification for mathematical problems can effectively improve the reliability and performance of language model outputs, and it also plays a critical role in enabling long-horizon mathematical tasks. We believe that research on pessimistic verification will help enhance the mathematical capabilities of language models across a wide range of tasks.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lina Conti, Dennis Fucci, Marco Gaido, Matteo Negri, Guillaume Wisniewski et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker's vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matฤซss Kalnฤre, Sofoklis Kitharidis, Thomas Bรคck, Niki van Stein
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and sparse autoencoders, from NLP to transformer architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of sparse autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification tasks.
๐
2025-11-26
๐ ArXiv Preprint (cs.MA)
๐ฅ Ke Zhang, Xiaoning Zhao, Ce Zheng, Jiahong Ning, Dandan Zhu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
This study proposes Tool-RoCo, a novel benchmark for evaluating large language models (LLMs) in long-term multi-agent cooperation based on RoCo, a multi-robot cooperative benchmark. Recent research on LLM-based multi-agent systems has relied on predefined orchestration, while ignoring agent autonomy. Tool-RoCo treats other agents as tools and introduces cooperative tools, leveraging tool usage to evaluate multi-agent cooperation and self-organization. Tool usage means that each agent (LLM) selects a tool from a candidate set based on the current state, receives feedback, and adjusts its selection in subsequent rounds. To evaluate different autonomy levels, we propose four LLM paradigms: (1) centralized cooperation, where a single LLM allocates tools to all agents; (2) centralized self-organization, where a central LLM autonomously activates agents while keeping others inactive; (3) decentralized cooperation, where each agent has its own LLM and calls tools based on local information; and (4) self-organization, where a randomly chosen initial agent can request collaboration, activating additional agents via tool calls. Tool-RoCo includes three multi-robot tasks, SORT, PACK, and CABINET, to measure format and parameter accuracy and agent coordination through tool usage. The results using several LLMs showed that cooperative tools accounted for only 7.09% of all tools, indicating that LLM-based agents rarely invoked others as assistants. Moreover, activation tools accounted for 96.42%, suggesting that current LLMs tend to maintain active agents while seldom deactivating them for adaptive coordination. Tool-RoCo provides a systematic benchmark to evaluate LLM autonomy and cooperation in multi-agent tasks. Code and Demo: https://github.com/ColaZhang22/Tool-Roco
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Taehoon Kim, Donghwan Jang, Bohyung Han
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the parameter space for optimization. Our algorithm involves two types of weight merging: inter-task weight merging and intra-task weight merging. Inter-task weight merging unifies previous models by averaging the weights of models from all previous stages. On the other hand, intra-task weight merging facilitates the learning of current task by combining the model parameters within current stage. For reliable weight merging, we also propose a bounded update technique that aims to optimize the target model with minimal cumulative updates and preserve knowledge from previous tasks; this strategy reveals that it is possible to effectively obtain new models near old ones, reducing catastrophic forgetting. M&B is seamlessly integrated into existing CIL methods without modifying architecture components or revising learning objectives. We extensively evaluate our algorithm on standard CIL benchmarks and demonstrate superior performance compared to state-of-the-art methods.
๐
2025-11-26
๐ NeurIPS 2025
๐ฅ Dong-Jae Lee, Jiwan Hur, Jaehyun Choi, Jaemyung Yu, Junmo Kim
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.
๐
2025-11-26
๐ NeurIPS 2025
๐ฅ Fabian Paischer, Leo Cotteleer, Yann Dreze, Richard Kurle, Dylan Rubini et al.
Machine Learning
ๆฅ็ๆ่ฆ
The widespread use of neural surrogates in automotive aerodynamics, enabled by datasets such as DrivAerML and DrivAerNet++, has primarily focused on bluff-body flows with large wakes. Extending these methods to aerospace, particularly in the transonic regime, remains challenging due to the high level of non-linearity of compressible flows and 3D effects such as wingtip vortices. Existing aerospace datasets predominantly focus on 2D airfoils, neglecting these critical 3D phenomena. To address this gap, we present a new dataset of CFD simulations for 3D wings in the transonic regime. The dataset comprises volumetric and surface-level fields for around $30,000$ samples with unique geometry and inflow conditions. This allows computation of lift and drag coefficients, providing a foundation for data-driven aerodynamic optimization of the drag-lift Pareto front. We evaluate several state-of-the-art neural surrogates on our dataset, including Transolver and AB-UPT, focusing on their out-of-distribution (OOD) generalization over geometry and inflow variations. AB-UPT demonstrates strong performance for transonic flowfields and reproduces physically consistent drag-lift Pareto fronts even for unseen wing configurations. Our results demonstrate that AB-UPT can approximate drag-lift Pareto fronts for unseen geometries, highlighting its potential as an efficient and effective tool for rapid aerodynamic design exploration. To facilitate future research, we open-source our dataset at https://huggingface.co/datasets/EmmiAI/Emmi-Wing.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yurui Zheng, Yijun Chen, Shaohong Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This paper proposes a bidirectional readability assessment mechanism that captures contextual information to identify regions with rich semantic information in the text, thereby predicting the readability level of individual sentences. These sentence-level labels are then used to assist in predicting the overall readability level of the document. Additionally, a pairwise sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtraction. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive performance and outperforms other baseline models.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Yunjian Zhang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Junjian Wang, Lidan Zhao, Xi Sheryl Zhang
ๆฅ็ๆ่ฆ
Ensuring the safety of embodied AI agents during task planning is critical for real-world deployment, especially in household environments where dangerous instructions pose significant risks. Existing methods often suffer from either high computational costs due to preference alignment training or over-rejection when using single-agent safety prompts. To address these limitations, we propose MADRA, a training-free Multi-Agent Debate Risk Assessment framework that leverages collective reasoning to enhance safety awareness without sacrificing task performance. MADRA employs multiple LLM-based agents to debate the safety of a given instruction, guided by a critical evaluator that scores responses based on logical soundness, risk identification, evidence quality, and clarity. Through iterative deliberation and consensus voting, MADRA significantly reduces false rejections while maintaining high sensitivity to dangerous tasks. Additionally, we introduce a hierarchical cognitive collaborative planning framework that integrates safety, memory, planning, and self-evolution mechanisms to improve task success rates through continuous learning. We also contribute SafeAware-VH, a benchmark dataset for safety-aware task planning in VirtualHome, containing 800 annotated instructions. Extensive experiments on AI2-THOR and VirtualHome demonstrate that our approach achieves over 90% rejection of unsafe tasks while ensuring that safe-task rejection is low, outperforming existing methods in both safety and execution efficiency. Our work provides a scalable, model-agnostic solution for building trustworthy embodied agents.
๐
2025-11-26
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Rebeka Toth, Tamas Bisztray, Richard Dubniczky
Natural Language Processing
ๆฅ็ๆ่ฆ
Phishing and spam emails remain a major cybersecurity threat, with attackers increasingly leveraging Large Language Models (LLMs) to craft highly deceptive content. This study presents a comprehensive email dataset containing phishing, spam, and legitimate messages, explicitly distinguishing between human- and LLM-generated content. Each email is annotated with its category, emotional appeal (e.g., urgency, fear, authority), and underlying motivation (e.g., link-following, credential theft, financial fraud). We benchmark multiple LLMs on their ability to identify these emotional and motivational cues and select the most reliable model to annotate the full dataset. To evaluate classification robustness, emails were also rephrased using several LLMs while preserving meaning and intent. A state-of-the-art LLM was then assessed on its performance across both original and rephrased emails using expert-labeled ground truth. The results highlight strong phishing detection capabilities but reveal persistent challenges in distinguishing spam from legitimate emails. Our dataset and evaluation framework contribute to improving AI-assisted email security systems. To support open science, all code, templates, and resources are available on our project site.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhe Jiang, Jiong Wang, Xiaoyu Yue, Zijie Guo, Wenlong Zhang et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Extreme weather events pose escalating risks to global society, underscoring the urgent need to unravel their underlying physical mechanisms. Yet the prevailing expert-driven, labor-intensive diagnostic paradigm has created a critical analytical bottleneck, stalling scientific progress. While AI for Earth Science has achieved notable advances in prediction, the equally essential challenge of automated diagnostic reasoning remains largely unexplored. We present the Extreme Weather Expert (EWE), the first intelligent agent framework dedicated to this task. EWE emulates expert workflows through knowledge-guided planning, closed-loop reasoning, and a domain-tailored meteorological toolkit. It autonomously produces and interprets multimodal visualizations from raw meteorological data, enabling comprehensive diagnostic analyses. To catalyze progress, we introduce the first benchmark for this emerging field, comprising a curated dataset of 103 high-impact events and a novel step-wise evaluation metric. EWE marks a step toward automated scientific discovery and offers the potential to democratize expertise and intellectual resources, particularly for developing nations vulnerable to extreme weather.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Futian Wang, Fan Zhang, Xiao Wang, Mengqi Wang, Dexing Huang et al.
ๆฅ็ๆ่ฆ
Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Simon Sรผwer, Kester Bagemihl, Sylvie Baier, Lucia Dicunta, Markus List et al.
Robotics
ๆฅ็ๆ่ฆ
Repurposing approved drugs offers a time-efficient and cost-effective alternative to traditional drug development. However, in silico prediction of repurposing candidates is challenging and requires the effective collaboration of specialists in various fields, including pharmacology, medicine, biology, and bioinformatics. Fragmented, specialized algorithms and tools often address only narrow aspects of the overall problem, and heterogeneous, unstructured data landscapes require specialized users to be involved. Hence, these data services do not integrate smoothly across workflows. With ChatDRex, we present a conversation-based, multi-agent system that facilitates the execution of complex bioinformatic analyses aiming for network-based drug repurposing prediction. It builds on the integrated systems medicine knowledge graph NeDRex. ChatDRex provides natural language access to its extensive biomedical KG and integrates bioinformatics agents for network analysis and drug repurposing, complemented by agents for functional coherence evaluation for in silico validation, as well as agents for literature mining and for discussing the obtained results in a scientific context. Its flexible multi-agent design assigns specific tasks to specialized agents, including query routing, data retrieval, algorithm execution, and result visualization. A dedicated reasoning module keeps the user in the loop and allows for hallucination detection. By enabling physicians and researchers without computer science expertise to control complex analyses in natural language, ChatDRex democratizes access to bioinformatics as an important resource for drug repurposing. It enables clinical experts to generate hypotheses and explore drug repurposing opportunities, ultimately accelerating the discovery of novel therapies and advancing personalized medicine and translational research.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiajie Zhang, Sรถren Schwertfeger, Alexander Kleiner
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Futian Wang, Mengqi Wang, Xiao Wang, Haowen Wang, Jin Tang
Natural Language Processing
ๆฅ็ๆ่ฆ
Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mia Mรผรig, Jan Johannsen
ๆฅ็ๆ่ฆ
In pseudo-boolean solving the currently most successful unit propagation strategy is a hybrid mode combining the watched literal scheme with the counting method. This short paper introduces new heuristics for this hybrid decision, which are able to drastically outperform the current method in the RoundingSAT solver.
๐
2025-11-26
๐ ArXiv Preprint (cs.DC)
๐ฅ Tim Trappen, Robert Keรler, Roland Pabel, Viktor Achter, Stefan Wesner
ๆฅ็ๆ่ฆ
Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Frederico Wieser, Martin Benfeghoul, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas
Natural Language Processing
ๆฅ็ๆ่ฆ
The rigid, uniform allocation of computation in standard Transformer (TF) architectures can limit their efficiency and scalability, particularly for large-scale models and long sequences. Addressing this, we introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT), two distinct architectures that leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs. SDT augments a decoder-only stack with alternating Decision and Dynamic layers: a Decision layer computes a full block 'posterior' and a lightweight 'prior,' while a Dynamic layer employs fixed-capacity Top-K routing based on Bayesian surprise (Expected and Unexpected Change), maintaining a static compute graph. STT extends this conditional computation to the temporal domain: a transition network predicts residual updates, forming a temporal 'change hypothesis' that informs a router to dynamically execute or bypass TF blocks for each token, managing KV-cache contributions. Both architectures exhibit the predicted shift from novelty to prediction driven gating over training, suggesting alignment with surprise based principles. While operating at reduced capacity, they offer preliminary insights into the compute-accuracy trade-offs of conditional computation. The proposed architectures establish a flexible framework for efficiency, reducing self-attention computation by 75% and KV-cache requirements by 50% within each compute skipping layer, setting a pathway for more efficient models.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joshua Fonseca Rivera
Natural Language Processing
ๆฅ็ๆ่ฆ
Lindsey (2025) investigates introspective awareness in language models through four experiments, finding that models can sometimes detect and identify injected activation patterns -- but unreliably (~20% success in the best model). We focus on the first of these experiments -- self-report of injected "thoughts" -- and ask whether this capability can be directly trained rather than waiting for emergence. Through fine-tuning on transient single-token injections, we transform a 7B parameter model from near-complete failure (0.4% accuracy, 6.7% false positive rate) to reliable detection (85% accuracy on held-out concepts at ฮฑ=40, 0% false positives). Our model detects fleeting "thoughts" injected at a single token position, retains that information, and reports the semantic content across subsequent generation steps. On this task, our trained model satisfies three of Lindsey's criteria: accuracy (correct identification), grounding (0/60 false positives), and internality (detection precedes verbalization). Generalization to unseen concept vectors (7.5pp gap) demonstrates the model learns a transferable skill rather than memorizing specific vectors, though this does not establish metacognitive representation in Lindsey's sense. These results address an open question raised by Lindsey: whether "training for introspection would help eliminate cross-model differences." We show that at least one component of introspective behavior can be directly induced, offering a pathway to built-in AI transparency.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model (LLM)-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation -- risking the loss of critical information -- or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive LLM reading to efficient programmatic pruning. Central to our approach is DOM Tree Pruning Programming, where an LLM generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for LLMs to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.
๐
2025-11-26
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Guoxiao Zhang, Tan Qu, Ao Li, DongLin Ni, Qianlong Xie et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Reranking improves recommendation quality by modeling item interactions. However, existing methods often decouple ranking and reranking, leading to weak listwise evaluation models that suffer from combinatorial sparsity and limited representational power under strict latency constraints. In this paper, we propose RIA (Ranking-Infused Architecture), a unified, end-to-end framework that seamlessly integrates pointwise and listwise evaluation. RIA introduces four key components: (1) the User and Candidate DualTransformer (UCDT) for fine-grained user-item-context modeling; (2) the Context-aware User History and Target (CUHT) module for position-sensitive preference learning; (3) the Listwise Multi-HSTU (LMH) module to capture hierarchical item dependencies; and (4) the Embedding Cache (EC) module to bridge efficiency and effectiveness during inference. By sharing representations across ranking and reranking, RIA enables rich contextual knowledge transfer while maintaining low latency. Extensive experiments show that RIA outperforms state-of-the-art models on both public and industrial datasets, achieving significant gains in AUC and LogLoss. Deployed in Meituan advertising system, RIA yields a +1.69% improvement in Click-Through Rate (CTR) and a +4.54% increase in Cost Per Mille (CPM) in online A/B tests.
๐
2025-11-26
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Guoxiao Zhang, Ao Li, Tan Qu, Qianlong Xie, Xingxing Wang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Online platforms usually suffer from user experience degradation due to near-duplicate items with similar visuals and text. While Multimodal Large Language Models (MLLMs) enable multimodal embedding, existing methods treat representations as black boxes, ignoring structural relationships (e.g., primary vs. auxiliary elements), leading to local structural collapse problem. To address this, inspired by Feature Integration Theory (FIT), we propose FITRep, the first attention-guided, white-box item representation framework for fine-grained item deduplication. FITRep consists of: (1) Concept Hierarchical Information Extraction (CHIE), using MLLMs to extract hierarchical semantic concepts; (2) Structure-Preserving Dimensionality Reduction (SPDR), an adaptive UMAP-based method for efficient information compression; and (3) FAISS-Based Clustering (FBC), a FAISS-based clustering that assigns each item a unique cluster id using FAISS. Deployed on Meituan's advertising system, FITRep achieves +3.60% CTR and +4.25% CPM gains in online A/B tests, demonstrating both effectiveness and real-world impact.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jungi Lee, Jungkwon Kim, Chi Zhang, Kwangsun Yoo, Seok-Joo Byun
ๆฅ็ๆ่ฆ
Handling contaminated data poses a critical challenge in anomaly detection, as traditional models assume training on purely normal data. Conventional methods mitigate contamination by relying on fixed contamination ratios, but discrepancies between assumed and actual ratios can severely degrade performance, especially in noisy environments where normal and abnormal data distributions overlap. To address these limitations, we propose Adaptive and Aggressive Rejection (AAR), a novel method that dynamically excludes anomalies using a modified z-score and Gaussian mixture model-based thresholds. AAR effectively balances the trade-off between preserving normal data and excluding anomalies by integrating hard and soft rejection strategies. Extensive experiments on two image datasets and thirty tabular datasets demonstrate that AAR outperforms the state-of-the-art method by 0.041 AUROC. By providing a scalable and reliable solution, AAR enhances robustness against contaminated datasets, paving the way for broader real-world applications in domains such as security and healthcare.
๐
2025-11-26
๐ AAAI
๐ฅ Kevin Iselborn, David Dembinsky, Adriano Lucieri, Andreas Dengel
Machine Learning
ๆฅ็ๆ่ฆ
The utility of an explanation method critically depends on its fidelity to the underlying machine learning model. Especially in high-stakes medical settings, clinicians and regulators require explanations that faithfully reflect the model's decision process. Existing fidelity metrics such as Infidelity rely on Monte Carlo approximation, which demands numerous model evaluations and introduces uncertainty due to random sampling. This work proposes a novel metric for evaluating the fidelity of local feature attribution methods by modifying the existing Prediction Change (PC) metric within the Guided Perturbation Experiment. By incorporating the direction of both perturbation and attribution, the proposed Directed Prediction Change (DPC) metric achieves an almost tenfold speedup and eliminates randomness, resulting in a deterministic and trustworthy evaluation procedure that measures the same property as local Infidelity. DPC is evaluated on two datasets (skin lesion images and financial tabular data), two black-box models, seven explanation algorithms, and a wide range of hyperparameters. Across $4\,744$ distinct explanations, the results demonstrate that DPC, together with PC, enables a holistic and computationally efficient evaluation of both baseline-oriented and local feature attribution methods, while providing deterministic and reproducible outcomes.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bram Silue, Santiago Amaya-Corredor, Patrick Mannion, Lander Willem, Pieter Libin
Machine Learning
ๆฅ็ๆ่ฆ
Adversarial Inverse Reinforcement Learning (AIRL) has shown promise in addressing the sparse reward problem in reinforcement learning (RL) by inferring dense reward functions from expert demonstrations. However, its performance in highly complex, imperfect-information settings remains largely unexplored. To explore this gap, we evaluate AIRL in the context of Heads-Up Limit Hold'em (HULHE) poker, a domain characterized by sparse, delayed rewards and significant uncertainty. In this setting, we find that AIRL struggles to infer a sufficiently informative reward function. To overcome this limitation, we contribute Hybrid-AIRL (H-AIRL), an extension that enhances reward inference and policy learning by incorporating a supervised loss derived from expert data and a stochastic regularization mechanism. We evaluate H-AIRL on a carefully selected set of Gymnasium benchmarks and the HULHE poker setting. Additionally, we analyze the learned reward function through visualization to gain deeper insights into the learning process. Our experimental results show that H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL. This highlights the benefits of incorporating supervised signals into inverse RL and establishes H-AIRL as a promising framework for tackling challenging, real-world settings.
๐
2025-11-26
๐ ArXiv Preprint (cs.SD)
๐ฅ Genรญs Plaja-Roglans, Yun-Ning Hung, Xavier Serra, Igor Pereira
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tae-Min Choi, Tae Kyeong Jeong, Garam Kim, Jaemin Lee, Yeongyoon Koh et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Munish Rathee, Boris Baฤiฤ, Maryam Doborjeh
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% (+- 0.8%) with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, enhances interpretability, supports transparent decision-making, and operates efficiently on embedded hardware. Although synthetic augmentation improved robustness, generalisation to unseen field conditions remains to be validated. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.
๐
2025-11-26
๐ ArXiv Preprint (cs.SD)
๐ฅ Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen, Dan Raviv
Machine Learning
ๆฅ็ๆ่ฆ
Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.
๐
2025-11-26
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Kirti Bhagat, Shaily Bhatt, Athul Velagapudi, Aditya Vashistha, Shachi Dave et al.
ๆฅ็ๆ่ฆ
Millions of users across the globe turn to AI chatbots for their creative needs, inviting widespread interest in understanding how such chatbots represent diverse cultures. At the same time, evaluating cultural representations in open-ended tasks remains challenging and underexplored. In this work, we present TALES, an evaluation of cultural misrepresentations in LLM-generated stories for diverse Indian cultural identities. First, we develop TALES-Tax, a taxonomy of cultural misrepresentations by collating insights from participants with lived experiences in India through focus groups (N=9) and individual surveys (N=15). Using TALES-Tax, we evaluate 6 models through a large-scale annotation study spanning 2,925 annotations from 108 annotators with lived cultural experience from across 71 regions in India and 14 languages. Concerningly, we find that 88\% of the generated stories contain one or more cultural inaccuracies, and such errors are more prevalent in mid- and low-resourced languages and stories based in peri-urban regions in India. Lastly, we transform the annotations into TALES-QA, a standalone question bank to evaluate the cultural knowledge of foundational models. Through this evaluation, we surprisingly discover that models often possess the requisite cultural knowledge despite generating stories rife with cultural misrepresentations.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Jamal Raiyn
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper proposes a new strategy for collision avoidance system leveraging Time-to-Collision (TTC) metrics for handling cut-in scenarios, which are particularly challenging for autonomous vehicles (AVs). By integrating a deep learning with TTC calculations, the system predicts potential collisions and determines appropriate evasive actions compared to traditional TTC -based approaches.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Joseph Y. Halpern, Rafael Pass
ๆฅ็ๆ่ฆ
Perhaps the most prominent current definition of (actual) causality is due to Halpern and Pearl. It is defined using causal models (also known as structural equations models). We abstract the definition, extracting its key features, so that it can be applied to any other model where counterfactuals are defined. By abstracting the definition, we gain a number of benefits. Not only can we apply the definition in a wider range of models, including ones that allow, for example, backtracking, but we can apply the definition to determine if A is a cause of B even if A and B are formulas involving disjunctions, negations, beliefs, and nested counterfactuals (none of which can be handled by the Halpern-Pearl definition). Moreover, we can extend the ideas to getting an abstract definition of explanation that can be applied beyond causal models. Finally, we gain a deeper understanding of features of the definition even in causal models.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Selene Cerna, Sara Si-Moussi, Wilfried Thuiller, Hadrien Hendrikx, Vincent Miele
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevรฉs. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hui Lu, Yi Yu, Yiming Yang, Chenyu Yi, Qixin Zhang et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Joonhyung Park, Hyeongwon Jang, Joowon Kim, Eunho Yang
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dogukan Aksu, Jesus Martinez del Rincon, Ihsen Alouani
Machine Learning
ๆฅ็ๆ่ฆ
Spiking neural networks (SNNs) have emerged as prominent candidates for embedded and edge AI. Their inherent low power consumption makes them far more efficient than conventional ANNs in scenarios where energy budgets are tightly constrained. In parallel, federated learning (FL) has become the prevailing training paradigm in such settings, enabling on-device learning while limiting the exposure of raw data. However, gradient inversion attacks represent a critical privacy threat in FL, where sensitive training data can be reconstructed directly from shared gradients. While this vulnerability has been widely investigated in conventional ANNs, its implications for SNNs remain largely unexplored. In this work, we present the first comprehensive empirical study of gradient leakage in SNNs across diverse data domains. SNNs are inherently non-differentiable and are typically trained using surrogate gradients, which we hypothesized would be less correlated with the original input and thus less informative from a privacy perspective. To investigate this, we adapt different gradient leakage attacks to the spike domain. Our experiments reveal a striking contrast with conventional ANNs: whereas ANN gradients reliably expose salient input content, SNN gradients yield noisy, temporally inconsistent reconstructions that fail to recover meaningful spatial or temporal structure. These results indicate that the combination of event-driven dynamics and surrogate-gradient training substantially reduces gradient informativeness. To the best of our knowledge, this work provides the first systematic benchmark of gradient inversion attacks for spiking architectures, highlighting the inherent privacy-preserving potential of neuromorphic computation.
๐
2025-11-26
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Shuhan Xia, Jing Dai, Hui Ouyang, Yadong Shang, Dongxiao Zhao et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Diffusion models exhibit notable fragility when faced with adversarial prompts, and strengthening attack capabilities is crucial for uncovering such vulnerabilities and building more robust generative systems. Existing works often rely on white-box access to model gradients or hand-crafted prompt engineering, which is infeasible in real-world deployments due to restricted access or poor attack effect. In this paper, we propose CAHS-Attack , a CLIP-Aware Heuristic Search attack method. CAHS-Attack integrates Monte Carlo Tree Search (MCTS) to perform fine-grained suffix optimization, leveraging a constrained genetic algorithm to preselect high-potential adversarial prompts as root nodes, and retaining the most semantically disruptive outcome at each simulation rollout for efficient local search. Extensive experiments demonstrate that our method achieves state-of-the-art attack performance across both short and long prompts of varying semantics. Furthermore, we find that the fragility of SD models can be attributed to the inherent vulnerability of their CLIP-based text encoders, suggesting a fundamental security risk in current text-to-image pipelines.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shichu Sun, Yichen Zhang, Haolin Song, Zonghao Guo, Chi Chen et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Guoming Huang, Qingyi Zhou, Dianjing Liu, Shuai Zhang, Ming Zhou et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Non-contact manipulation has emerged as a transformative approach across various industrial fields. However, current flexible 2D and 3D non-contact manipulation techniques are often limited to microscopic scales, typically controlling objects in the milligram range. In this paper, we present a magnetic levitation system, termed Maglev-Pentabot, designed to address this limitation. The Maglev-Pentabot leverages deep reinforcement learning (DRL) to develop complex control strategies for manipulating objects in the gram range. Specifically, we propose an electromagnet arrangement optimized through numerical analysis to maximize controllable space. Additionally, an action remapping method is introduced to address sample sparsity issues caused by the strong nonlinearity in magnetic field intensity, hence allowing the DRL controller to converge. Experimental results demonstrate flexible manipulation capabilities, and notably, our system can generalize to transport tasks it has not been explicitly trained for. Furthermore, our approach can be scaled to manipulate heavier objects using larger electromagnets, offering a reference framework for industrial-scale robotic applications.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Changlin Li, Jiawei Zhang, Shuhao Liu, Sihao Lin, Zeyi Shi et al.
ๆฅ็ๆ่ฆ
Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Ziyi Chen, Yingnan Guo, Zedong Chu, Minghua Luo, Yanfen Shen et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Embodied navigation that adheres to social norms remains an open research challenge. Our \textbf{SocialNav} is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware Flow Exploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Our project page: https://amap-eai.github.io/SocialNav/
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Changlin Li, Jiawei Zhang, Zeyi Shi, Zongxin Yang, Zhihui Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.
๐
2025-11-26
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Anup Roy, Rishabh Gyanendra Upadhyay, Animesh Rameshbhai Panara, Robin Mills
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Document centric RAG pipelines usually begin with OCR, followed by brittle heuristics for chunking, table parsing, and layout reconstruction. These text first workflows are costly to maintain, sensitive to small layout shifts, and often lose the spatial cues that contain the answer. Vision first retrieval has emerged as a strong alternative. By operating directly on page images, systems like ColPali and ColQwen preserve structure and reduce pipeline complexity while achieving strong benchmark performance. However, these late interaction models tie retrieval to a specific vision backbone and require storing hundreds of patch embeddings per page, creating high memory overhead and complicating large scale deployment.
We introduce VisionRAG, a multimodal retrieval system that is OCR free and model agnostic. VisionRAG indexes documents directly as images, preserving layout, tables, and spatial cues, and builds semantic vectors without committing to a specific extraction. Our three pass pyramid indexing framework creates vectors using global page summaries, section headers, visual hotspots, and fact level cues. These summaries act as lightweight retrieval surrogates. At query time, VisionRAG retrieves the most relevant pages using the pyramid index, then forwards the raw page image encoded as base64 to a multimodal LLM for final question answering. During retrieval, reciprocal rank fusion integrates signals across the pyramid to produce robust ranking.
VisionRAG stores only 17 to 27 vectors per page, matching the efficiency of patch based methods while staying flexible across multimodal encoders. On financial document benchmarks, it achieves 0.8051 accuracy at 10 on FinanceBench and 0.9629 recall at 100 on TAT DQA. These results show that OCR free, summary guided multimodal retrieval is a practical and scalable alternative to traditional text extraction pipelines.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Mengran Li, Zelin Zang, Wenbin Xing, Junzhou Chen, Ronghui Zhang et al.
Multimodal
ๆฅ็ๆ่ฆ
Understanding how chemical perturbations propagate through biological systems is essential for robust molecular property prediction. While most existing methods focus on chemical structures alone, recent advances highlight the crucial role of cellular responses such as morphology and gene expression in shaping drug effects. However, current cell-aware approaches face two key limitations: (1) modality incompleteness in external biological data, and (2) insufficient modeling of hierarchical dependencies across molecular, cellular, and genomic levels. We propose CHMR (Cell-aware Hierarchical Multi-modal Representations), a robust framework that jointly models local-global dependencies between molecules and cellular responses and captures latent biological hierarchies via a novel tree-structured vector quantization module. Evaluated on nine public benchmarks spanning 728 tasks, CHMR outperforms state-of-the-art baselines, yielding average improvements of 3.6% on classification and 17.2% on regression tasks. These results demonstrate the advantage of hierarchy-aware, multimodal learning for reliable and biologically grounded molecular representations, offering a generalizable framework for integrative biomedical modeling. The code is in https://github.com/limengran98/CHMR.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Honga, Jie Lin, Minghui Wang
ๆฅ็ๆ่ฆ
Alzheimer's disease (AD), a degenerative brain condition, can benefit from early prediction to slow its progression. As the disease progresses, patients typically undergo brain atrophy. Current prediction methods for Alzheimers disease largely involve analyzing morphological changes in brain images through manual feature extraction. This paper proposes a novel method, the Deformation-Aware Temporal Generative Network (DATGN), to automate the learning of morphological changes in brain images about disease progression for early prediction. Given the common occurrence of missing data in the temporal sequences of MRI images, DATGN initially interpolates incomplete sequences. Subsequently, a bidirectional temporal deformation-aware module guides the network in generating future MRI images that adhere to the disease's progression, facilitating early prediction of Alzheimer's disease. DATGN was tested for the generation of temporal sequences of future MRI images using the ADNI dataset, and the experimental results are competitive in terms of PSNR and MMSE image quality metrics. Furthermore, when DATGN-generated synthetic data was integrated into the SVM vs. CNN vs. 3DCNN-based classification methods, significant improvements were achieved from 6. 21\% to 16\% in AD vs. NC classification accuracy and from 7. 34\% to 21. 25\% in AD vs. MCI vs. NC classification accuracy. The qualitative visualization results indicate that DATGN produces MRI images consistent with the brain atrophy trend in Alzheimer's disease, enabling early disease prediction.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tongkai Lu, Shuai Ma, Chongyang Tao
Machine Learning
ๆฅ็ๆ่ฆ
Mixed Integer Linear Programming (MILP) is a fundamental class of NP-hard problems that has garnered significant attention from both academia and industry. The Branch-and-Bound (B\&B) method is the dominant approach for solving MILPs and the branching plays an important role in B\&B methods. Neural-based learning frameworks have recently been developed to enhance branching policies and the efficiency of solving MILPs. However, these methods still struggle with semantic variation across depths, the scarcity of upstream nodes, and the costly collection of strong branching samples. To address these issues, we propose \ours, a Dynamic \underline{\textbf{S}}tratified \underline{\textbf{C}}ontrastive Training Framework for \underline{\textbf{MILP}} Branching. It groups branch-and-bound nodes based on their feature distributions and trains a GCNN-based discriminative model to progressively separate nodes across groups, learning finer-grained node representations throughout the tree. To address data scarcity and imbalance at upstream nodes, we introduce an upstream-augmented MILP derivation procedure that generates both theoretically equivalent and perturbed instances. \ours~effectively models subtle semantic differences between nodes, significantly enhancing branching accuracy and solving efficiency, particularly for upstream nodes. Extensive experiments on standard MILP benchmarks demonstrate that our method enhances branching accuracy, reduces solving time, and generalizes effectively to unseen instances.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hengyu Fu, Baihe Huang, Virginia Adams, Charles Wang, Venkat Srinivasan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Language Models (DLMs) have recently emerged as a strong alternative to autoregressive language models (LMs). DLMs offer comparable accuracy with faster inference speed via parallel decoding. However, standard DLM decoding strategies relying on high-confidence tokens encounter an inherent information-theoretic bottleneck that restricts decoding progress and ultimately slows generation. We demonstrate both theoretically and empirically that prioritizing high-confidence tokens is inherently inefficient. High-probability tokens carry negligible information and strictly relying on them limits the effective progress made in each decoding round. We prove that the number of decoding rounds must grow linearly with the sample's total information (negative log-likelihood) and inversely with the per-round information budget, establishing a bits-to-rounds principle. We also propose Explore-Then-Exploit (ETE), a training-free decoding strategy that maximizes information throughput and decoding efficiency. ETE combines cross-block decoding with targeted exploration of high-uncertainty tokens to reshape the conditional distribution and trigger cascades of confident predictions. Experiments verify our theoretical bounds and demonstrate that ETE consistently reduces the required number of decoding rounds compared to confidence-only baselines without compromising generation quality.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gayoung Lee, Junho Kim, Jin-Hwa Kim, Junmo Kim
Computer Vision
ๆฅ็ๆ่ฆ
Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically "sculpts" reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Seunghun Baek, Jaejin Lee, Jaeyoon Sim, Minjae Jeong, Won Hwa Kim
ๆฅ็ๆ่ฆ
Various neuroimaging studies suffer from small sample size problem which often limit their reliability. Meta-analysis addresses this challenge by aggregating findings from different studies to identify consistent patterns of brain activity. However, traditional approaches based on keyword retrieval or linear mappings often overlook the rich hierarchical structure in the brain. In this work, we propose a novel framework that leverages hyperbolic geometry to bridge the gap between neuroscience literature and brain activation maps. By embedding text from research articles and corresponding brain images into a shared hyperbolic space via the Lorentz model, our method captures both semantic similarity and hierarchical organization inherent in neuroimaging data. In the hyperbolic space, our method performs multi-level neuroimaging meta-analysis (MNM) by 1) aligning brain and text embeddings for semantic correspondence, 2) guiding hierarchy between text and brain activations, and 3) preserving the hierarchical relationships within brain activation patterns. Experimental results demonstrate that our model outperforms baselines, offering a robust and interpretable paradigm of multi-level neuroimaging meta-analysis via hyperbolic brain-text representation.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ivan Novikov
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are predominantly deployed as dense transformers, where every parameter in every feed-forward block is activated for every token. While architecturally simple, this is computationally inefficient, since inference costs scale linearly with parameter count. Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in sparse, semi-modular substructures inside dense feed-forward networks, but these approaches typically rely on clustering, activation profiling, singular value decomposition, or custom routing that requires calibration data. This paper introduces MLPMoE (MLP Mixture-of-Experts), a training-free, deterministic transformation that restructures the dense MLP in transformer blocks into a static, high-cardinality mixture of experts. The transformation uses simple tensor slicing and summation, reinterpreting the algebra of tensor parallelism as a topological conversion rather than a distributed training pattern. We further introduce Fractal Fade (differential branch sparsity) and Compensated Pruning (variance-preserving branch reduction) as lightweight mechanisms for structured sparsity. On Qwen2.5-0.5B-Instruct and DeepSeek-R1-Distill-Llama-8B, the zero-shot MLPMoE transform changes a proxy perplexity metric by less than 0.05 percent while keeping the parameter count effectively constant. On the 8B model, differential sparsity removes about 20 percent of MLP parameters while keeping perplexity within about 2 percent of the dense baseline. The method operates entirely post hoc on existing checkpoints and does not require gradients, calibration sets, or router training. Code is available at https://gist.github.com/iwallarm/fc2ef1eddf226ca7814f9e5e2ae9bad1
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Thura Aung, Eaint Kay Khaing Kyaw, Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi
Natural Language Processing
ๆฅ็ๆ่ฆ
In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.
๐
2025-11-26
๐ Accepted by IEEE Big Data 2025
๐ฅ Yeswanth Ravichandran, Duoduo Liao, Charan Teja Kurakula
Machine Learning
ๆฅ็ๆ่ฆ
Subsurface defects such as delamination, voids, and honeycombing critically affect the durability of concrete bridge decks but are difficult to detect reliably using visual inspection or manual sounding. This paper presents a machine learning based Impact Echo (IE) framework that automates both defect localization and multi-class classification of common concrete defects. Raw IE signals from Federal Highway Administration (FHWA) laboratory slabs and in-service bridge decks are transformed via Fast Fourier Transform (FFT) into dominant peak-frequency features and interpolated into spatial maps for defect zone visualization. Unsupervised k-means clustering highlights low-frequency, defect-prone regions, while Ground Truth Masks (GTMs) derived from seeded lab defects are used to validate spatial accuracy and generate high-confidence training labels. From these validated regions, spatially ordered peak-frequency sequences are constructed and fed into a stacked Long Short-Term Memory (LSTM) network that classifies four defect types shallow delamination, deep delamination, voids, and honeycombing with 73% overall accuracy. Field validation on the bridge deck demonstrates that models trained on laboratory data generalize under realistic coupling, noise, and environmental variability. The proposed framework enhances the objectivity, scalability, and repeatability of Non-Destructive Evaluation (NDE), supporting intelligent, data-driven bridge health monitoring at a network scale.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Effective post-training is essential to align Large Language Models (LLMs) with specialized biomedical knowledge to accelerate life science research. However, current approaches face significant limitations. First, biomedical reasoning involves intricate mechanisms often represented by sparse textual data. Standard Supervised Fine-Tuning (SFT) tends to overfit to surface-level instruction patterns without effectively internalizing this fragmented scientific knowledge. Second, Reinforcement Learning (RL) is impractical for this domain, as defining meaningful rewards often necessitates prohibitive experimental validation (e.g., wet-lab verification of drug responses), rendering real-time feedback unfeasible. We propose Balanced Fine-Tuning (BFT), an efficient post-training method designed to learn complex reasoning from sparse data without external reward signals. BFT operates through a two-layer weighting mechanism: 1. At the token level, it scales loss via prediction probabilities to stabilize gradients and prevent overfitting; 2. At the sample level, it uses "minimum group confidence" to adaptively enhance the learning of hard samples. Experiments demonstrate that BFT significantly outperforms SFT. In medical tasks, it enables LLMs to acquire knowledge that SFT misses. In biological tasks, BFT-based LLMs surpass GeneAgent (an accurate agent for biology analysis) in biological process reasoning. Moreover, the text embeddings generated by BFT can be directly applied to downstream tasks, such as gene interaction and single-cell perturbation response prediction. These results indicate that BFT facilitates broad applications of LLMs in biomedical research.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michael Iskandardinata, William Christian, Derwin Suhartono
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model's own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 Task 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. These findings highlight the importance of context in enhancing LLMs performance in sarcasm detection task, particularly the involvement of culturally specific slang, references, or unknown terms to the LLMs. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. The experiment code is available at: https://github.com/wllchrst/sarcasm-detection_pmp_knowledge-base.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chujie Wang, Jianyu Lu, Zhiyuan Luo, Xi Chen, Chu He
MultimodalMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD's lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent's state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.
๐
2025-11-26
๐ AAAI
๐ฅ Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury, Haotian An, Yawei Wang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.
๐
2025-11-26
๐ 17 pages, 11 figures, this article was submitted t
๐ฅ Jingtao Guo, Yuyi Mao, Ivan Wang-Hei Ho
Machine Learning
ๆฅ็ๆ่ฆ
Wi-Fi channel state information (CSI)-based sensing provides a non-invasive, device-free approach for tasks such as human activity recognition and crowd counting, but large-scale deployment is hindered by the need for extensive site-specific training data. Federated learning (FL) offers a way to avoid raw data sharing but is challenged by heterogeneous sensing data and device resources. This paper proposes FedAPA, a collaborative Wi-Fi CSI-based sensing algorithm that uses adaptive prototype aggregation (APA) strategy to assign similarity-based weights to peer prototypes, enabling adaptive client contributions and yielding a personalized global prototype for each client instead of a fixed-weight aggregation. During local training, we adopt a hybrid objective that combines classification learning with representation contrastive learning to align local and global knowledge. We provide a convergence analysis of FedAPA and evaluate it in a real-world distributed Wi-Fi crowd counting scenario with six environments and up to 20 people. The results show that our method outperform multiple baselines in terms of accuracy, F1 score, mean absolute error (MAE), and communication overhead, with FedAPA achieving at least a 9.65% increase in accuracy, a 9% gain in F1 score, a 0.29 reduction in MAE, and a 95.94% reduction in communication overhead.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anantha Padmanaban Krishna Kumar
ๆฅ็ๆ่ฆ
Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1--12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1--12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Linze Chen, Yufan Cai, Zhe Hou, Jinsong Dong
Natural Language Processing
ๆฅ็ๆ่ฆ
The rationality of law manifests in two forms: substantive rationality, which concerns the fairness or moral desirability of outcomes, and formal rationality, which requires legal decisions to follow explicitly stated, general, and logically coherent rules. Existing LLM-based systems excel at surface-level text analysis but lack the guarantees required for principled jurisprudence. We introduce L4M, a novel framework that combines adversarial LLM agents with SMT-solver-backed proofs to unite the interpretive flexibility of natural language with the rigor of symbolic verification. The pipeline consists of three phases: (1) Statute Formalization, where domain-specific prompts convert legal provisions into logical formulae; (2) Dual Fact and Statute Extraction, in which prosecutor- and defense-aligned LLMs independently map case narratives to fact tuples and statutes, ensuring role isolation; and (3) Solver-Centric Adjudication, where an autoformalizer compiles both parties' arguments into logic constraints, and unsat cores trigger iterative self-critique until a satisfiable formula is achieved, which is then verbalized by a Judge-LLM into a transparent verdict and optimized sentence. Experimental results on public benchmarks show that our system surpasses advanced LLMs including GPT-o4-mini, DeepSeek-V3, and Claude 4 as well as state-of-the-art Legal AI baselines, while providing rigorous and explainable symbolic justifications.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haojian Huang, Jiahao Shi, Zhe Liu, Harold Haodong Chen, Han Fang et al.
ๆฅ็ๆ่ฆ
Trustworthy multi-view classification (TMVC) addresses the challenge of achieving reliable decision-making in complex scenarios where multi-source information is heterogeneous, inconsistent, or even conflicting. Existing TMVC approaches predominantly rely on globally dense neighbor relationships to model intra-view dependencies, leading to high computational costs and an inability to directly ensure consistency across inter-view relationships. Furthermore, these methods typically aggregate evidence from different views through manually assigned weights, lacking guarantees that the learned multi-view neighbor structures are consistent within the class space, thus undermining the trustworthiness of classification outcomes. To overcome these limitations, we propose a novel TMVC framework that introduces prototypes to represent the neighbor structures of each view. By simplifying the learning of intra-view neighbor relations and enabling dynamic alignment of intra- and inter-view structure, our approach facilitates more efficient and consistent discovery of cross-view consensus. Extensive experiments on multiple public multi-view datasets demonstrate that our method achieves competitive downstream performance and robustness compared to prevalent TMVC methods.
๐
2025-11-26
๐ 22 pages, 15 figures, Submitted to Journal of Envi
๐ฅ Taehoon Kang, Taeyong Kim
Machine Learning
ๆฅ็ๆ่ฆ
Climate change has intensified the frequency and severity of wildfires, making rapid and accurate prediction of fire spread essential for effective mitigation and response. Physics-based simulators such as FARSITE offer high-fidelity predictions but are computationally intensive, limiting their applicability in real-time decision-making, while existing deep learning models often yield overly smooth predictions that fail to capture the complex, nonlinear dynamics of wildfire propagation. This study proposes an autoregressive conditional generative adversarial network (CGAN) for probabilistic wildfire spread prediction. By formulating the prediction task as an autoregressive problem, the model learns sequential state transitions, ensuring long-term prediction stability. Experimental results demonstrate that the proposed CGAN-based model outperforms conventional deep learning models in both overall predictive accuracy and boundary delineation of fire perimeters. These results demonstrate that adversarial learning allows the model to capture the strong nonlinearity and uncertainty of wildfire spread, instead of simply fitting the pixel average. Furthermore, the autoregressive framework facilitates systematic temporal forecasting of wildfire evolution. The proposed CGAN-based autoregressive framework enhances both the accuracy and physical interpretability of wildfire spread prediction, offering a promising foundation for time-sensitive response and evacuation planning.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates significant potential in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing RLVR methods are often constrained by issues such as coarse-grained rewards, reward noise, and inefficient exploration, which lead to unstable training and entropy collapse. To address this challenge, we propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO). The intuition behind it lies in the fact that the probabilities of an LLM generating different responses can inherently and directly reflect its self-assessment of the reasoning process. Inspired by the idea of preference modeling, ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt, and integrates this score with verifiable rewards to guide the exploration process. We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors, enhances the relative superiority of undervalued high-quality responses, and prevents the model from overfitting to specific strategies, thereby facilitating more thorough exploration. Comprehensive experiments across four general-domain benchmarks and three mathematical benchmarks demonstrate that ICPO steadily boosts reasoning compared to GRPO.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Xiaoxing You, Qiang Huang, Lingyu Li, Chi Zhang, Xiaopeng Liu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Jiaoyang Li, Jun Fang, Tianhao Gao, Xiaohui Zhang, Zhiyuan Liu et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Representation learning is fundamental to modern machine learning, powering applications such as text retrieval and multimodal understanding. However, learning robust and generalizable representations remains challenging. While prior work has demonstrated that active noise injection, a form of data augmentation, can enhance encoding performance, most existing methods rely on heuristic or static noise, overlooking the dynamic nature of feature distributions during training. In this work, we systematically study the role of noise in representation learning from both gradient-based and feature distribution perspectives, using InfoNCE loss as a representative example. Focusing on multimodal representation learning, we propose FANoise, a novel feature-adaptive noise injection strategy. By leveraging the dynamics of contrastive learning, FANoise effectively mitigates the negative impacts of noise while preserving its benefits. Under this theoretically grounded framework, comprehensive experiments demonstrate that FANoise consistently improves overall performance on multimodal tasks across various base VLM models.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan et al.
Multimodal
ๆฅ็ๆ่ฆ
Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shanwei Fan
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) offer strong high-level planning capabilities for reinforcement learning (RL) by decomposing tasks into subgoals. However, their practical utility is limited by poor planning-execution alignment, which reflects a critical gap between abstract plans and actionable, environment-compatible behaviors. This misalignment arises from two interrelated limitations: (1) LLMs often produce subgoals that are semantically plausible but infeasible or irrelevant in the target environment due to insufficient grounding in environment-specific knowledge, and (2) single-LLM planning conflates generation with self-verification, resulting in overconfident yet unreliable subgoals that frequently fail during execution. To address these challenges, we propose Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR), a framework that integrates an environment-specific subgoal graph and structured entity knowledge with a multi-LLM planning pipeline that explicitly separates generation, critique, and refinement to produce executable and verifiable subgoals. A subgoal tracker further monitors execution progress, provides auxiliary rewards, and adaptively updates the subgoal graph to maintain alignment between plans and actions. Experimental results on 22 diverse tasks in the open-world game "Crafter" demonstrate the effectiveness of our proposed method.
๐
2025-11-26
๐ ArXiv Preprint (math.CO)
๐ฅ Davis Brown, Jesse He, Helen Jenne, Henry Kvinge, Max Vargas
Natural Language Processing
ๆฅ็ๆ่ฆ
Evolutionary program synthesis systems such as AlphaEvolve, OpenEvolve, and ShinkaEvolve offer a new approach to AI-assisted mathematical discovery. These systems utilize teams of large language models (LLMs) to generate candidate solutions to a problem as human readable code. These candidate solutions are then 'evolved' with the goal of improving them beyond what an LLM can produce in a single shot. While existing mathematical applications have mostly focused on problems of establishing bounds (e.g., sphere packing), the program synthesis approach is well suited to any problem where the solution takes the form of an explicit construction. With this in mind, in this paper we explore the use of OpenEvolve for combinatorial bijection discovery. We describe the results of applying OpenEvolve to three bijection construction problems involving Dyck paths, two of which are known and one of which is open. We find that while systems like OpenEvolve show promise as a valuable tool for combinatorialists, the problem of finding novel, research-level bijections remains a challenging task for current frontier systems, reinforcing the need for human mathematicians in the loop. We describe some lessons learned for others in the field interested in exploring the use of these systems.
๐
2025-11-26
๐ ArXiv Preprint (physics.soc-ph)
๐ฅ Stephen G. Dale, Nikita Kazeev, Alastair J. A. Price, Victor Posligua, Stephan Roche et al.
Machine Learning
ๆฅ็ๆ่ฆ
Artificial intelligence and machine learning are reshaping how we approach scientific discovery, not by replacing established methods but by extending what researchers can probe, predict, and design. In this roadmap we provide a forward-looking view of AI-enabled science across biology, chemistry, climate science, mathematics, materials science, physics, self-driving laboratories and unconventional computing. Several shared themes emerge: the need for diverse and trustworthy data, transferable electronic-structure and interatomic models, AI systems integrated into end-to-end scientific workflows that connect simulations to experiments and generative systems grounded in synthesisability rather than purely idealised phases. Across domains, we highlight how large foundation models, active learning and self-driving laboratories can close loops between prediction and validation while maintaining reproducibility and physical interpretability. Taken together, these perspectives outline where AI-enabled science stands today, identify bottlenecks in data, methods and infrastructure, and chart concrete directions for building AI systems that are not only more powerful but also more transparent and capable of accelerating discovery in complex real-world environments.
๐
2025-11-26
๐ ArXiv Preprint (eess.AS)
๐ฅ Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices.
In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rawa Mohammed, Mina Attin, Bryar Shareef
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR
๐
2025-11-26
๐ ArXiv Preprint (cs.SE)
๐ฅ Sanchit Kaul, Kevin Nhu, Jason Eissayou, Ivan Eser, Victor Borup
Natural Language Processing
ๆฅ็ๆ่ฆ
This empirical investigation elucidates the limitations of deterministic, unidimensional productivity heuristics by operationalizing the SPACE framework through extensive repository mining. Utilizing a dataset derived from open-source repositories, the study employs rigorous statistical methodologies including Generalized Linear Mixed Models (GLMM) and RoBERTa-based sentiment classification to synthesize a holistic, multi-faceted productivity metric. Analytical results reveal a statistically significant positive correlation between negative affective states and commit frequency, implying a cycle of iterative remediation driven by frustration. Furthermore, the investigation has demonstrated that analyzing the topology of contributor interactions yields superior fidelity in mapping collaborative dynamics compared to traditional volume-based metrics. Ultimately, this research posits a Composite Productivity Score (CPS) to address the heterogeneity of developer efficacy.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman et al.
RoboticsComputer VisionMultimodal
ๆฅ็ๆ่ฆ
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai et al.
Robotics
ๆฅ็ๆ่ฆ
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D "trace-space" of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang et al.
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pandiyaraju V, Sreya Mynampati, Abishek Karthik, Poovarasan L, D. Saraswathi
Machine Learning
ๆฅ็ๆ่ฆ
Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors. To address this difficulty, the proposed research will develop a hybrid deep learning model which integrates U-Net based segmentation and a hybrid DenseNet-VGG classification network with multihead attention and spatial-channel attention capabilities. The segmentation model will precisely demarcate the tumors in a 3D volume of MRI data guided by spatial and contextual information. The classification network which combines a branch of both DenseNet and VGG, will incorporate the demarcated tumor on which features with attention mechanisms would be focused on clinically relevant features. High-dimensional 3D MRI data could successfully be utilized in the model through preprocessing steps which are normalization, resampling, and data augmentation. Through a variety of measures the framework is evaluated: measures of performance in segmentation are Dice coefficient and Mean Intersection over Union (IoU) and measures of performance in classification are accuracy precision, recall, and F1-score. The hybrid framework that has been proposed has demonstrated through physical testing that it has the capability of obtaining a Dice coefficient of 98% in tumor segmentation, and 99% on classification accuracy, outperforming traditional CNN models and attention-free methods. Utilizing multi-head attention mechanisms enhances notions of priority in aspects of the tumor that are clinically significant, and enhances interpretability and accuracy. The results suggest a great potential of the framework in facilitating the timely and reliable diagnosis and grading of glioma by clinicians is promising, allowing for better planning of patient treatment.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Lorenzo Shaikewitz, Charis Georgiou, Luca Carlone
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Quantifying the uncertainty of an object's pose estimate is essential for robust control and planning. Although pose estimation is a well-studied robotics problem, attaching statistically rigorous uncertainty is not well understood without strict distributional assumptions. We develop distribution-free pose uncertainty bounds about a given pose estimate in the monocular setting. Our pose uncertainty only requires high probability noise bounds on pixel detections of 2D semantic keypoints on a known object. This noise model induces an implicit, non-convex set of pose uncertainty constraints. Our key contribution is SLUE (S-Lemma Uncertainty Estimation), a convex program to reduce this set to a single ellipsoidal uncertainty bound that is guaranteed to contain the true object pose with high probability. SLUE solves a relaxation of the minimum volume bounding ellipsoid problem inspired by the celebrated S-lemma. It requires no initial guess of the bound's shape or size and is guaranteed to contain the true object pose with high probability. For tighter uncertainty bounds at the same confidence, we extend SLUE to a sum-of-squares relaxation hierarchy which is guaranteed to converge to the minimum volume ellipsoidal uncertainty bound for a given set of keypoint constraints. We show this pose uncertainty bound can easily be projected to independent translation and axis-angle orientation bounds. We evaluate SLUE on two pose estimation datasets and a real-world drone tracking scenario. Compared to prior work, SLUE generates substantially smaller translation bounds and competitive orientation bounds. We release code at https://github.com/MIT-SPARK/PoseUncertaintySets.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Naifu Zhang, Wei Tao, Xi Xiao, Qianpu Sun, Yuxin Zheng et al.
Multimodal
ๆฅ็ๆ่ฆ
In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni et al.
Multimodal
ๆฅ็ๆ่ฆ
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruisheng Han, Kanglei Zhou, Shuang Chen, Amir Atapour-Abarghouei, Hubert P. H. Shum
ๆฅ็ๆ่ฆ
Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow
๐
2025-11-26
๐ ACM MMSys 2025
๐ฅ Kirill Paramonov, Mete Ozay, Aristeidis Mystakidis, Nikolaos Tsalikidis, Dimitrios Sotos et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system's effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system's practicality in real-world scenarios.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anantha Padmanaban Krishna Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hichem Sahbi
ๆฅ็ๆ่ฆ
Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ M. Naseer Subhani
ๆฅ็ๆ่ฆ
Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haotian Xue, Qi Chen, Zhonghao Wang, Xun Huang, Eli Shechtman et al.
ๆฅ็ๆ่ฆ
Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Joy Naoum, Revana Salama, Ali Hamdi
Machine Learning
ๆฅ็ๆ่ฆ
Oral cancer is highly common across the globe and is mostly diagnosed during the later stages due to the close visual similarity to benign, precancerous, and malignant lesions in the oral cavity. Implementing computer aided diagnosis systems early on has the potential to greatly improve clinical outcomes. This research intends to use deep learning to build a multiclass classifier for sixteen different oral lesions. To overcome the challenges of limited and imbalanced datasets, the proposed technique combines stratified data splitting and advanced data augmentation and oversampling to perform the classification. The experimental results, which achieved 83.33 percent accuracy, 89.12 percent precision, and 77.31 percent recall, demonstrate the superiority of the suggested model over state of the art methods now in use. The suggested model effectively conveys the effectiveness of oversampling and augmentation strategies in situations where the minority class classification performance is noteworthy. As a first step toward trustworthy computer aided diagnostic systems for the early detection of oral cancer in clinical settings, the suggested framework shows promise.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou et al.
ๆฅ็ๆ่ฆ
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chou Mo, Yehyun Suh, J. Ryan Martin, Daniel Moyer
Computer Vision
ๆฅ็ๆ่ฆ
Automated landmark detection offers an efficient approach for medical professionals to understand patient anatomic structure and positioning using intra-operative imaging. While current detection methods for pelvic fluoroscopy demonstrate promising accuracy, most assume a fixed Antero-Posterior view of the pelvis. However, orientation often deviates from this standard view, either due to repositioning of the imaging unit or of the target structure itself. To address this limitation, we propose a novel framework that incorporates 2D/3D landmark registration into the training of a U-Net landmark prediction model. We analyze the performance difference by comparing landmark detection accuracy between the baseline U-Net, U-Net trained with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic intra-operative conditions where patient pose is variable.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiang Gu, Liming Lu, Xu Zheng, Anan Du, Yongbin Zhou et al.
Multimodal
ๆฅ็ๆ่ฆ
Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kang Du, Xue Liao, Junpeng Xia, Chaozheng Guo, Yi Gu et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable. However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness. We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong et al.
Natural Language ProcessingMachine LearningComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Husne Ara Rubaiyeat, Hasan Mahmud, Md Kamrul Hasan
ๆฅ็ๆ่ฆ
Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Hong, Kaifeng Huang
Computer Vision
ๆฅ็ๆ่ฆ
Alzheimer's disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development of personalized treatment strategies that aim to mitigate its progression. The application of generated images for the prediction of Alzheimer's disease poses challenges, particularly in accurately representing the disease's characteristics when input sequences are captured at irregular time intervals. This study presents an innovative methodology for sequential image generation, guided by quantitative metrics, to maintain the essential features indicative of disease progression. Furthermore, an age-scaling factor is integrated into the process to produce age-specific MRI images, facilitating the prediction of advanced stages of the disease. The results obtained from the ablation study suggest that the inclusion of quantitative metrics significantly improves the accuracy of MRI image synthesis. Furthermore, the application of age-scaled pixel loss contributed to the enhanced iterative generation of MRI images. In terms of long-term disease prognosis, the Structural Similarity Index reached a peak value of 0.882, indicating a substantial degree of similarity in the synthesized images.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pierre Adorni, Minh-Tan Pham, Stรฉphane May, Sรฉbastien Lefรจvre
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs.
๐
2025-11-26
๐ IEEE Transactions on Medical Imaging
๐ฅ Yiyang Jiang, Guangwu Qian, Jiaxin Wu, Qi Huang, Qing Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjรถgren's syndrome, and scleroderma. Despite its importance, manual ANA detection is slow, labor-intensive, and demands years of training. ANA detection is complicated by over 100 coexisting antibody types, resulting in vast fluorescent pattern combinations. Although machine learning and deep learning have enabled automation, ANA detection in real-world clinical settings presents unique challenges as it involves multi-instance, multi-label (MIML) learning. In this paper, a novel framework for ANA detection is proposed that handles the complexities of MIML tasks using unaltered microscope images without manual preprocessing. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels accordingly. These steps are implemented using three task-specific components: an instance sampler, a probabilistic pseudo-label dispatcher, and self-paced weight learning rate coefficients. The instance sampler suppresses low-confidence instances by modeling pattern confidence, while the dispatcher adaptively assigns labels based on instance distinguishability. Self-paced learning adjusts training according to empirical label observations. Our framework overcomes limitations of traditional MIML methods and supports end-to-end optimization. Extensive experiments on one ANA dataset and three public medical MIML benchmarks demonstrate the superiority of our framework. On the ANA dataset, our model achieves up to +7.0% F1-Macro and +12.6% mAP gains over the best prior method, setting new state-of-the-art results. It also ranks top-2 across all key metrics on public datasets, reducing Hamming loss and one-error by up to 18.2% and 26.9%, respectively. The source code can be accessed at https://github.com/fletcherjiang/ANA-SelfPacedLearning.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lorenzo Pellegrini, Serafino Pandolfini, Davide Maltoni, Matteo Ferrara, Marco Prati et al.
Machine Learning
ๆฅ็ๆ่ฆ
The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentation strategies, and optimization techniques. These factors make it difficult to fairly compare detectors and to understand which factors truly contribute to their performance. To address this, we systematically investigate how different design choices influence the accuracy and generalization capabilities of deepfake detection models, focusing on aspects related to training, inference, and incremental updates. By isolating the impact of individual factors, we aim to establish robust, architecture-agnostic best practices for the design and development of future deepfake detection systems. Our experiments identify a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on the AI-GenBench benchmark.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shizhe Sun, Wataru Ohyama
Computer Vision
ๆฅ็ๆ่ฆ
We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD's potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Taehoon Kim, Donghwan Jang, Bohyung Han
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the parameter space for optimization. Our algorithm involves two types of weight merging: inter-task weight merging and intra-task weight merging. Inter-task weight merging unifies previous models by averaging the weights of models from all previous stages. On the other hand, intra-task weight merging facilitates the learning of current task by combining the model parameters within current stage. For reliable weight merging, we also propose a bounded update technique that aims to optimize the target model with minimal cumulative updates and preserve knowledge from previous tasks; this strategy reveals that it is possible to effectively obtain new models near old ones, reducing catastrophic forgetting. M&B is seamlessly integrated into existing CIL methods without modifying architecture components or revising learning objectives. We extensively evaluate our algorithm on standard CIL benchmarks and demonstrate superior performance compared to state-of-the-art methods.
๐
2025-11-26
๐ NeurIPS 2025
๐ฅ Dong-Jae Lee, Jiwan Hur, Jaehyun Choi, Jaemyung Yu, Junmo Kim
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuai Zhang, Bao Tang, Siyuan Yu, Yueting Zhu, Jingfeng Yao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Futian Wang, Fan Zhang, Xiao Wang, Mengqi Wang, Dexing Huang et al.
ๆฅ็ๆ่ฆ
Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiajie Zhang, Sรถren Schwertfeger, Alexander Kleiner
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Adeela Islam, Stefano Fiorini, Manuel Lecha, Theodore Tsesmelis, Stuart James et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Futian Wang, Mengqi Wang, Xiao Wang, Haowen Wang, Jin Tang
Natural Language Processing
ๆฅ็ๆ่ฆ
Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingue Park, Prin Phunyaphibarn, Phillip Y. Lee, Minhyuk Sung
Computer Vision
ๆฅ็ๆ่ฆ
We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Gu, Haoji Zhang, Qihang Fan, Jingxuan Niu, Zhipeng Zhang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m\_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yangle Liu, Fengze Li, Kan Liu, Jieming Ma
Machine Learning
ๆฅ็ๆ่ฆ
Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns with geometry and triggers early geometric drift, where erroneous shapes are reinforced during densification and become hard to correct. We ask how to anchor geometry early for 4D Gaussian splatting (4DGS) while maintaining temporal consistency and efficiency in dynamic endoscopic scenes. Thus, we present Endo-G$^{2}$T, a geometry-guided and temporally aware training scheme for time-embedded 4DGS. First, geo-guided prior distillation converts confidence-gated monocular depth into supervision with scale-invariant depth and depth-gradient losses, using a warm-up-to-cap schedule to inject priors softly and avoid early overfitting. Second, a time-embedded Gaussian field represents dynamics in XYZT with a rotor-like rotation parameterization, yielding temporally coherent geometry with lightweight regularization that favors smooth motion and crisp opacity boundaries. Third, keyframe-constrained streaming improves efficiency and long-horizon stability through keyframe-focused optimization under a max-points budget, while non-keyframes advance with lightweight updates. Across EndoNeRF and StereoMIS-P1 datasets, Endo-G$^{2}$T achieves state-of-the-art results among monocular reconstruction baselines.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qing Li, Huifang Feng, Kanle Shi, Yue Gao, Yi Fang et al.
ๆฅ็ๆ่ฆ
Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is difficult when dealing with different data or geometries. Existing methods commonly employ various parameter-heavy strategies to extract a full feature description from the input patch. However, they still have difficulties in accurately and efficiently predicting normals for various point clouds. In this work, we present a new idea of feature extraction for robust normal estimation of point clouds. We use the fusion of multi-scale features from different neighborhood sizes to address the issue of selecting reasonable patch sizes for various data or geometries. We seek to model a patch feature fitting (PFF) based on multi-scale features to approximate the optimal geometric description for normal estimation and implement the approximation process via multi-scale feature aggregation and cross-scale feature compensation. The feature aggregation module progressively aggregates the patch features of different scales to the center of the patch and shrinks the patch size by removing points far from the center. It not only enables the network to precisely capture the structure characteristic in a wide range, but also describes highly detailed geometries. The feature compensation module ensures the reusability of features from earlier layers of large scales and reveals associated information in different patch sizes. Our approximation strategy based on aggregating the features of multiple scales enables the model to achieve scale adaptation of varying local patches and deliver the optimal feature description. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets with fewer network parameters and running time.
๐
2025-11-26
๐ Presented at the 2025 IEEE International Conferenc
๐ฅ Ariful Islam, Md Rifat Hossen, Md. Mahmudul Arif, Abdullah Al Noman, Md Arifur Rahman
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Natural disasters remain a major challenge for Bangladesh, so real-time monitoring and quick response systems are essential. In this study, we present BanglaMM-Disaster, an end-to-end deep learning-based multimodal framework for disaster classification in Bangla, using both textual and visual data from social media. We constructed a new dataset of 5,037 Bangla social media posts, each consisting of a caption and a corresponding image, annotated into one of nine disaster-related categories. The proposed model integrates transformer-based text encoders, including BanglaBERT, mBERT, and XLM-RoBERTa, with CNN backbones such as ResNet50, DenseNet169, and MobileNetV2, to process the two modalities. Using early fusion, the best model achieves 83.76% accuracy. This surpasses the best text-only baseline by 3.84% and the image-only baseline by 16.91%. Our analysis also shows reduced misclassification across all classes, with noticeable improvements for ambiguous examples. This work fills a key gap in Bangla multimodal disaster analysis and demonstrates the benefits of combining multiple data types for real-time disaster response in low-resource settings.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tae-Min Choi, Tae Kyeong Jeong, Garam Kim, Jaemin Lee, Yeongyoon Koh et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Munish Rathee, Boris Baฤiฤ, Maryam Doborjeh
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% (+- 0.8%) with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, enhances interpretability, supports transparent decision-making, and operates efficiently on embedded hardware. Although synthetic augmentation improved robustness, generalisation to unseen field conditions remains to be validated. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Weitian Wang, Lukas Meiner, Rai Shubham, Cecilia De La Parra, Akash Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenyu Liu, Hongze Chen, Jingzhi Bao, Lingting Zhu, Runze Zhang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling. To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure. It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity. Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization. Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jules Decaestecker, Nicolas Vigne
Natural Language Processing
ๆฅ็ๆ่ฆ
Achieving both high accuracy and topological continuity in road segmentation from satellite imagery is a critical goal for applications ranging from urban planning to disaster response. State-of-the-art methods often rely on Vision Transformers, which excel at capturing global context, yet their quadratic complexity is a significant barrier to efficient deployment, particularly for on-board processing in resource-constrained platforms. In contrast, emerging State Space Models like Mamba offer linear-time efficiency and are inherently suited to modeling long, continuous structures. We posit that these architectures have complementary strengths. To this end, we introduce PathMamba, a novel hybrid architecture that integrates Mamba's sequential modeling with the Transformer's global reasoning. Our design strategically uses Mamba blocks to trace the continuous nature of road networks, preserving topological structure, while integrating Transformer blocks to refine features with global context. This approach yields topologically superior segmentation maps without the prohibitive scaling costs of pure attention-based models. Our experiments on the DeepGlobe Road Extraction and Massachusetts Roads datasets demonstrate that PathMamba sets a new state-of-the-art. Notably, it significantly improves topological continuity, as measured by the APLS metric, setting a new benchmark while remaining computationally competitive.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou et al.
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.
๐
2025-11-26
๐ ArXiv Preprint (cs.SD)
๐ฅ Yicheng Zhong, Peiji Yang, Zhisheng Wang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS LLMs. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for decoding stability, and an LLM-annotated prosody alignment reward that explicitly supervises rhythm. In this prosody reward, an external reasoning LLM predicts multiple plausible pause structures via in-context learning, providing a human-preference-aligned supervisory signal for GRPO training. To assess universality, we further attach a flow-matching (FM) decoder on top of the GRPO-optimized AR backbone and observe consistent additional gains, indicating that our reinforcement optimization enhances the intrinsic AR policy. We further conduct a scalability analysis across data sizes and model scales, revealing that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS LLMs.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Juncheng Chen, Chao Xu, Yanjun Cao
ๆฅ็ๆ่ฆ
Learning-based image matching critically depends on large-scale, diverse, and geometrically accurate training data. 3D Gaussian Splatting (3DGS) enables photorealistic novel-view synthesis and thus is attractive for data generation. However, its geometric inaccuracies and biased depth rendering currently prevent robust correspondence labeling. To address this, we introduce MatchGS, the first framework designed to systematically correct and leverage 3DGS for robust, zero-shot image matching. Our approach is twofold: (1) a geometrically-faithful data generation pipeline that refines 3DGS geometry to produce highly precise correspondence labels, enabling the synthesis of a vast and diverse range of viewpoints without compromising rendering fidelity; and (2) a 2D-3D representation alignment strategy that infuses 3DGS' explicit 3D knowledge into the 2D matcher, guiding 2D semi-dense matchers to learn viewpoint-invariant 3D representations. Our generated ground-truth correspondences reduce the epipolar error by up to 40 times compared to existing datasets, enable supervision under extreme viewpoint changes, and provide self-supervisory signals through Gaussian attributes. Consequently, state-of-the-art matchers trained solely on our data achieve significant zero-shot performance gains on public benchmarks, with improvements of up to 17.7%. Our work demonstrates that with proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source, paving the way for a new generation of robust zero-shot image matchers.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sizhuo Zhou, Xiaosong Jia, Fanrui Zhang, Junjie Li, Juyong Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model's interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuhan Xia, Peipei Li, Xuannan Liu, Dongsen Zhang, Xinyu Guo et al.
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Quentin Gabot, Teck-Yian Lim, Jรฉrรฉmy Fix, Joana Frontera-Pons, Chengfang Ren et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systematic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS) introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real-valued neural networks. In this paper, we extend the work on LPS to complex-valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from $\mathbb{C}$ to $\mathbb{R}$ before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems, specifically for either the invariance property in classification tasks or the equivariance property in both reconstruction and semantic segmentation problems, using polarimetric Synthetic Aperture Radar images.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chen Ling, Henglin Shi, Hedvig Kjellstrรถm
Computer Vision
ๆฅ็ๆ่ฆ
Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuhan Xia, Xuannan Liu, Xing Cui, Peipei Li
Robotics
ๆฅ็ๆ่ฆ
Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Umang Agarwal, Rudraksh Sangore, Sumit Laddha
Machine Learning
ๆฅ็ๆ่ฆ
We present a comprehensive comparative study of three generative modeling paradigms: Denoising Diffusion Probabilistic Models (DDPM), Conditional Flow Matching (CFM), and MeanFlow. While DDPM and CFM require iterative sampling, MeanFlow enables direct one-step generation by modeling the average velocity over time intervals. We implement all three methods using a unified TinyUNet architecture (<1.5M parameters) on CIFAR-10, demonstrating that CFM achieves an FID of 24.15 with 50 steps, significantly outperforming DDPM (FID 402.98). MeanFlow achieves FID 29.15 with single-step sampling -- a 50X reduction in inference time. We further extend CFM to image inpainting, implementing mask-guided sampling with four mask types (center, random bbox, irregular, half). Our fine-tuned inpainting model achieves substantial improvements: PSNR increases from 4.95 to 8.57 dB on center masks (+73%), and SSIM improves from 0.289 to 0.418 (+45%), demonstrating the effectiveness of inpainting-aware training.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Baoli Sun, Yihan Wang, Xinzhu Ma, Zhihui Wang, Kun Lu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Selene Cerna, Sara Si-Moussi, Wilfried Thuiller, Hadrien Hendrikx, Vincent Miele
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevรฉs. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.
๐
2025-11-26
๐ NeurIPS 2025
๐ฅ Hanyang Li, Yuheng Jia, Hui Liu, Junhui Hou
Machine Learning
ๆฅ็ๆ่ฆ
Recent deep clustering models have produced impressive clustering performance. However, a common issue with existing methods is the disparity between global and local feature structures. While local structures typically show strong consistency and compactness within class samples, global features often present intertwined boundaries and poorly separated clusters. Motivated by this observation, we propose DCBoost, a parameter-free plug-in designed to enhance the global feature structures of current deep clustering models. By harnessing reliable local structural cues, our method aims to elevate clustering performance effectively. Specifically, we first identify high-confidence samples through adaptive $k$-nearest neighbors-based consistency filtering, aiming to select a sufficient number of samples with high label reliability to serve as trustworthy anchors for self-supervision. Subsequently, these samples are utilized to compute a discriminative loss, which promotes both intra-class compactness and inter-class separability, to guide network optimization. Extensive experiments across various benchmark datasets showcase that our DCBoost significantly improves the clustering performance of diverse existing deep clustering models. Notably, our method improves the performance of current state-of-the-art baselines (e.g., ProPos) by more than 3% and amplifies the silhouette coefficient by over $7\times$. Code is available at .
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hui Lu, Yi Yu, Yiming Yang, Chenyu Yi, Qixin Zhang et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yutao Tang, Cheng Zhao, Gaurav Mittal, Rohith Kukkala, Rama Chellappa et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zheng Li, Yibing Song, Xin Zhang, Lei Luo, Xiang Li et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Joonhyung Park, Hyeongwon Jang, Joowon Kim, Eunho Yang
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shichu Sun, Yichen Zhang, Haolin Song, Zonghao Guo, Chi Chen et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.
๐
2025-11-26
๐ ArXiv Preprint (Multimedia)
๐ฅ Xinyue Guo, Xiaoran Yang, Lipan Zhang, Jianxuan Yang, Zhao Wang et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Sound effect editing-modifying audio by adding, removing, or replacing elements-remains constrained by existing approaches that rely solely on low-level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV-Edit, a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, learning aligned cross-modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MM-DiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation-based feature gating training strategy. Furthermore, we construct a dedicated video-based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in the field of sound effect editing and exhibiting strong competitiveness in the domain of audio generation.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaming He, Guanyu Hou, Hongwei Li, Zhicong Huang, Kangjie Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.
๐
2025-11-26
๐ ACM UIST 2023
๐ฅ Taejun Kim, Amy Karlson, Aakar Gupta, Tovi Grossman, Jason Wu et al.
ๆฅ็ๆ่ฆ
While text entry is an essential and frequent task in Augmented Reality (AR) applications, devising an efficient and easy-to-use text entry method for AR remains an open challenge. This research presents STAR, a smartphone-analogous AR text entry technique that leverages a user's familiarity with smartphone two-thumb typing. With STAR, a user performs thumb typing on a virtual QWERTY keyboard that is overlain on the skin of their hands. During an evaluation study of STAR, participants achieved a mean typing speed of 21.9 WPM (i.e., 56% of their smartphone typing speed), and a mean error rate of 0.3% after 30 minutes of practice. We further analyze the major factors implicated in the performance gap between STAR and smartphone typing, and discuss ways this gap could be narrowed.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Baoli Sun, Xinzhu Ma, Ning Wang, Zhihui Wang, Zhiyong Wang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Changlin Li, Jiawei Zhang, Shuhao Liu, Sihao Lin, Zeyi Shi et al.
ๆฅ็ๆ่ฆ
Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Ziyi Chen, Yingnan Guo, Zedong Chu, Minghua Luo, Yanfen Shen et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Embodied navigation that adheres to social norms remains an open research challenge. Our \textbf{SocialNav} is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware Flow Exploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Our project page: https://amap-eai.github.io/SocialNav/
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xintian Mao, Haofei Song, Yin-Nian Liu, Qingli Li, Yan Wang
ๆฅ็ๆ่ฆ
It is well-known that if a network aims to learn how to deblur, it should understand the blur process. Blurring is naturally caused by the convolution of the sharp image with the blur kernel. Thus, allowing the network to learn the blur process in the kernel-level can significantly improve the image deblurring performance. But, current deep networks are still at the pixel-level learning stage, either performing end-to-end pixel-level restoration or stage-wise pseudo kernel-level restoration, failing to enable the deblur model to understand the essence of the blur. To this end, we propose Fourier Kernel Estimator (FKE), which considers the activation operation in Fourier space and converts the convolution problem in the spatial domain to a multiplication problem in Fourier space. Our FKE, jointly optimized with the deblur model, enables the network to learn the kernel-level blur process with low complexity and without any additional supervision. Furthermore, we change the convolution object of the kernel from ``image" to network extracted ``feature", whose rich semantic and structural information is more suitable to blur process learning. With the convolution of the feature and the estimated kernel, our model can learn the essence of blur in kernel-level. To further improve the efficiency of feature extraction, we design a decoupled multi-scale architecture with multiple hierarchical sub-unets with a reversible strategy, which allows better multi-scale encoding and decoding in low training memory. Extensive experiments indicate that our method achieves state-of-the-art motion deblurring results and show potential for handling other kernel-related problems. Analysis also shows our kernel estimator is able to learn physically meaningful kernels. The code will be available at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Jialun Liu et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation.
However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations.
We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Changlin Li, Jiawei Zhang, Zeyi Shi, Zongxin Yang, Zhihui Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Honga, Jie Lin, Minghui Wang
ๆฅ็ๆ่ฆ
Alzheimer's disease (AD), a degenerative brain condition, can benefit from early prediction to slow its progression. As the disease progresses, patients typically undergo brain atrophy. Current prediction methods for Alzheimers disease largely involve analyzing morphological changes in brain images through manual feature extraction. This paper proposes a novel method, the Deformation-Aware Temporal Generative Network (DATGN), to automate the learning of morphological changes in brain images about disease progression for early prediction. Given the common occurrence of missing data in the temporal sequences of MRI images, DATGN initially interpolates incomplete sequences. Subsequently, a bidirectional temporal deformation-aware module guides the network in generating future MRI images that adhere to the disease's progression, facilitating early prediction of Alzheimer's disease. DATGN was tested for the generation of temporal sequences of future MRI images using the ADNI dataset, and the experimental results are competitive in terms of PSNR and MMSE image quality metrics. Furthermore, when DATGN-generated synthetic data was integrated into the SVM vs. CNN vs. 3DCNN-based classification methods, significant improvements were achieved from 6. 21\% to 16\% in AD vs. NC classification accuracy and from 7. 34\% to 21. 25\% in AD vs. MCI vs. NC classification accuracy. The qualitative visualization results indicate that DATGN produces MRI images consistent with the brain atrophy trend in Alzheimer's disease, enabling early disease prediction.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ YuAn Wang, Xiaofan Li, Chi Huang, Wenhao Zhang, Hao Li et al.
Robotics
ๆฅ็ๆ่ฆ
In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Ze Feng, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia
Machine Learning
ๆฅ็ๆ่ฆ
Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gayoung Lee, Junho Kim, Jin-Hwa Kim, Junmo Kim
Computer Vision
ๆฅ็ๆ่ฆ
Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically "sculpts" reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Geetanjali Sharma, Gaurav Jaswal, Aditya Nigam, Raghavendra Ramachandra
Robotics
ๆฅ็ๆ่ฆ
Iris authentication algorithms have achieved impressive recognition performance, making them highly promising for real-world applications such as border control, citizen identification, and both criminal investigations and commercial systems. However, their robustness is still challenged by variations in rotation, scale, specular reflections, and defocus blur. In addition, most existing approaches rely on straightforward point-to-point comparisons, typically using cosine or L2 distance, without effectively leveraging the spatio-spatial-temporal structure of iris patterns. To address these limitations, we propose a novel and generalized matching pipeline that learns rich spatio-spatial-temporal representations of iris features. Our approach first splits each iris image along one dimension, generating a sequence of sub-images that serve as input to a 3D-CNN, enabling the network to capture both spatial and spatio-spatial-temporal cues. To further enhance the modeling of spatio-spatial-temporal feature dynamics, we train the model in curriculum manner. This design allows the network to embed temporal dependencies directly into the feature space, improving discriminability in the deep metric domain. The framework is trained end-to-end with triplet and ArcFace loss in a curriculum manner, enforcing highly discriminative embeddings despite challenges like rotation, scale, reflections, and blur. This design yields a robust and generalizable solution for iris authentication.Github code: https://github.com/GeetanjaliGTZ/CLRecogEye
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyun Zeng, Hang Hua, Jiebo Luo
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chujie Wang, Jianyu Lu, Zhiyuan Luo, Xi Chen, Chu He
MultimodalMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD's lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent's state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Hong, Xinze Sun, Yinhao Li, Yen-Wei Chen
Computer Vision
ๆฅ็ๆ่ฆ
Image generation can provide physicians with an imaging diagnosis basis in the prediction of Alzheimer's Disease (AD). Recent research has shown that long-term AD predictions by image generation often face difficulties maintaining disease-related characteristics when dealing with irregular time intervals in sequential data. Considering that the time-related aspects of the distribution can reflect changes in disease-related characteristics when images are distributed unevenly, this research proposes a model to estimate the temporal parameter within the Normal Inverse Gamma Distribution (T-NIG) to assist in generating images over the long term. The T-NIG model employs brain images from two different time points to create intermediate brain images, forecast future images, and predict the disease. T-NIG is designed by identifying features using coordinate neighborhoods. It incorporates a time parameter into the normal inverse gamma distribution to understand how features change in brain imaging sequences that have varying time intervals. Additionally, T-NIG utilizes uncertainty estimation to reduce both epistemic and aleatoric uncertainties in the model, which arise from insufficient temporal data. In particular, the T-NIG model demonstrates state-of-the-art performance in both short-term and long-term prediction tasks within the dataset. Experimental results indicate that T-NIG is proficient in forecasting disease progression while maintaining disease-related characteristics, even when faced with an irregular temporal data distribution.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Chenglizhao Chen, Shaofeng Liang, Runwei Guan, Xiaolou Sun, Haocheng Zhao et al.
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yingjie Xia, Xi Wang, Jinglei Shi, Vicky Kalogeiton, Jian Yang
Machine Learning
ๆฅ็ๆ่ฆ
Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hakki Motorcu, Mujdat Cetin
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Spatially varying image deblurring remains a fundamentally ill-posed problem, especially when degradations arise from complex mixtures of motion and other forms of blur under significant noise. State-of-the-art learning-based approaches generally fall into two paradigms: model-based deep unrolling methods that enforce physical constraints by modeling the degradations, but often produce over-smoothed, artifact-laden textures, and generative models that achieve superior perceptual quality yet hallucinate details due to weak physical constraints. In this paper, we propose a novel framework that uniquely reconciles these paradigms by taming a powerful generative prior with explicit, dense physical constraints. Rather than oversimplifying the degradation field, we model it as a dense continuum of high-dimensional compressed kernels, ensuring that minute variations in motion and other degradation patterns are captured. We leverage this rich descriptor field to condition a ControlNet architecture, strongly guiding the diffusion sampling process. Extensive experiments demonstrate that our method effectively bridges the gap between physical accuracy and perceptual realism, outperforming state-of-the-art model-based methods as well as generative baselines in challenging, severely blurred scenarios.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Cheng Yang, Hui Jin, Xinlei Yu, Zhipeng Wang, Yaoqun Liu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Diagnosing lung cancer typically involves physicians identifying lung nodules in Computed tomography (CT) scans and generating diagnostic reports based on their morphological features and medical expertise. Although advancements have been made in using multimodal large language models for analyzing lung CT scans, challenges remain in accurately describing nodule morphology and incorporating medical expertise. These limitations affect the reliability and effectiveness of these models in clinical settings. Collaborative multi-agent systems offer a promising strategy for achieving a balance between generality and precision in medical applications, yet their potential in pathology has not been thoroughly explored. To bridge these gaps, we introduce LungNoduleAgent, an innovative collaborative multi-agent system specifically designed for analyzing lung CT scans. LungNoduleAgent streamlines the diagnostic process into sequential components, improving precision in describing nodules and grading malignancy through three primary modules. The first module, the Nodule Spotter, coordinates clinical detection models to accurately identify nodules. The second module, the Radiologist, integrates localized image description techniques to produce comprehensive CT reports. Finally, the Doctor Agent System performs malignancy reasoning by using images and CT reports, supported by a pathology knowledge base and a multi-agent system framework. Extensive testing on two private datasets and the public LIDC-IDRI dataset indicates that LungNoduleAgent surpasses mainstream vision-language models, agent systems, and advanced expert models. These results highlight the importance of region-level semantic alignment and multi-agent collaboration in diagnosing nodules. LungNoduleAgent stands out as a promising foundational tool for supporting clinical analyses of lung nodules.
๐
2025-11-26
๐ 8 Pages, 10 figures, 2 Tables, Accepted in Journal
๐ฅ Dinanath Padhya, Krishna Acharya, Bipul Kumar Dahal, Dinesh Baniya Kshatri
Machine Learning
ๆฅ็ๆ่ฆ
Automatic Modulation Classification (AMC) is a core technology for future wireless communication systems, enabling the identification of modulation schemes without prior knowledge. This capability is essential for applications in cognitive radio, spectrum monitoring, and intelligent communication networks. We propose an AMC system based on a hybrid Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture, integrated with a Software Defined Radio (SDR) platform. The proposed architecture leverages CNNs for spatial feature extraction and LSTMs for capturing temporal dependencies, enabling efficient handling of complex, time-varying communication signals. The system's practical ability was demonstrated by identifying over-the-air (OTA) signals from a custom-built FM transmitter alongside other modulation schemes. The system was trained on a hybrid dataset combining the RadioML2018 dataset with a custom-generated dataset, featuring samples at Signal-to-Noise Ratios (SNRs) from 0 to 30dB. System performance was evaluated using accuracy, precision, recall, F1 score, and the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). The optimized model achieved 93.48% accuracy, 93.53% precision, 93.48% recall, and an F1 score of 93.45%. The AUC-ROC analysis confirmed the model's discriminative power, even in noisy conditions. This paper's experimental results validate the effectiveness of the hybrid CNN-LSTM architecture for AMC, suggesting its potential application in adaptive spectrum management and advanced cognitive radio systems.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaixing Yang, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang et al.
ๆฅ็ๆ่ฆ
Music-to-dance generation aims to translate auditory signals into expressive human motion, with broad applications in virtual reality, choreography, and digital entertainment. Despite promising progress, the limited generation efficiency of existing methods leaves insufficient computational headroom for high-fidelity 3D rendering, thereby constraining the expressiveness of 3D characters during real-world applications. Thus, we propose FlowerDance, which not only generates refined motion with physical plausibility and artistic expressiveness, but also achieves significant generation efficiency on inference speed and memory utilization . Specifically, FlowerDance combines MeanFlow with Physical Consistency Constraints, which enables high-quality motion generation with only a few sampling steps. Moreover, FlowerDance leverages a simple but efficient model architecture with BiMamba-based backbone and Channel-Level Cross-Modal Fusion, which generates dance with efficient non-autoregressive manner. Meanwhile, FlowerDance supports motion editing, enabling users to interactively refine dance sequences. Extensive experiments on AIST++ and FineDance show that FlowerDance achieves state-of-the-art results in both motion quality and generation efficiency. Code will be released upon acceptance.
๐
2025-11-26
๐ ArXiv Preprint (eess.IV)
๐ฅ Chicago Y. Park, Michael T. McCann, Cristina Garcia-Cardona, Brendt Wohlberg, Ulugbek S. Kamilov
Machine Learning
ๆฅ็ๆ่ฆ
We propose deep parameter interpolation (DPI), a general-purpose method for transforming an existing deep neural network architecture into one that accepts an additional scalar input. Recent deep generative models, including diffusion models and flow matching, employ a single neural network to learn a time- or noise level-dependent vector field. Designing a network architecture to accurately represent this vector field is challenging because the network must integrate information from two different sources: a high-dimensional vector (usually an image) and a scalar. Common approaches either encode the scalar as an additional image input or combine scalar and vector information in specific network components, which restricts architecture choices. Instead, we propose to maintain two learnable parameter sets within a single network and to introduce the scalar dependency by dynamically interpolating between the parameter sets based on the scalar value during training and sampling. DPI is a simple, architecture-agnostic method for adding scalar dependence to a neural network. We demonstrate that our method improves denoising performance and enhances sample quality for both diffusion and flow matching models, while achieving computational efficiency comparable to standard scalar conditioning techniques. Code is available at https://github.com/wustl-cig/parameter_interpolation.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu et al.
Multimodal
ๆฅ็ๆ่ฆ
Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qirui Yang, Yang Yang, Ying Zeng, Xiaobin Hu, Bo Li et al.
Robotics
ๆฅ็ๆ่ฆ
Text-guided diffusion models have greatly advanced image editing and generation. However, achieving physically consistent image retouching with precise parameter control (e.g., exposure, white balance, zoom) remains challenging. Existing methods either rely solely on ambiguous and entangled text prompts, which hinders precise camera control, or train separate heads/weights for parameter adjustment, which compromises scalability, multi-parameter composition, and sensitivity to subtle variations. To address these limitations, we propose CameraMaster, a unified camera-aware framework for image retouching. The key idea is to explicitly decouple the camera directive and then coherently integrate two critical information streams: a directive representation that captures the photographer's intent, and a parameter embedding that encodes precise camera settings. CameraMaster first uses the camera parameter embedding to modulate both the camera directive and the content semantics. The modulated directive is then injected into the content features via cross-attention, yielding a strongly camera-sensitive semantic context. In addition, the directive and camera embeddings are injected as conditioning and gating signals into the time embedding, enabling unified, layer-wise modulation throughout the denoising process and enforcing tight semantic-parameter alignment. To train and evaluate CameraMaster, we construct a large-scale dataset of 78K image-prompt pairs annotated with camera parameters. Extensive experiments show that CameraMaster produces monotonic and near-linear responses to parameter variations, supports seamless multi-parameter composition, and significantly outperforms existing methods.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haojian Huang, Jiahao Shi, Zhe Liu, Harold Haodong Chen, Han Fang et al.
ๆฅ็ๆ่ฆ
Trustworthy multi-view classification (TMVC) addresses the challenge of achieving reliable decision-making in complex scenarios where multi-source information is heterogeneous, inconsistent, or even conflicting. Existing TMVC approaches predominantly rely on globally dense neighbor relationships to model intra-view dependencies, leading to high computational costs and an inability to directly ensure consistency across inter-view relationships. Furthermore, these methods typically aggregate evidence from different views through manually assigned weights, lacking guarantees that the learned multi-view neighbor structures are consistent within the class space, thus undermining the trustworthiness of classification outcomes. To overcome these limitations, we propose a novel TMVC framework that introduces prototypes to represent the neighbor structures of each view. By simplifying the learning of intra-view neighbor relations and enabling dynamic alignment of intra- and inter-view structure, our approach facilitates more efficient and consistent discovery of cross-view consensus. Extensive experiments on multiple public multi-view datasets demonstrate that our method achieves competitive downstream performance and robustness compared to prevalent TMVC methods.
๐
2025-11-26
๐ 22 pages, 15 figures, Submitted to Journal of Envi
๐ฅ Taehoon Kang, Taeyong Kim
Machine Learning
ๆฅ็ๆ่ฆ
Climate change has intensified the frequency and severity of wildfires, making rapid and accurate prediction of fire spread essential for effective mitigation and response. Physics-based simulators such as FARSITE offer high-fidelity predictions but are computationally intensive, limiting their applicability in real-time decision-making, while existing deep learning models often yield overly smooth predictions that fail to capture the complex, nonlinear dynamics of wildfire propagation. This study proposes an autoregressive conditional generative adversarial network (CGAN) for probabilistic wildfire spread prediction. By formulating the prediction task as an autoregressive problem, the model learns sequential state transitions, ensuring long-term prediction stability. Experimental results demonstrate that the proposed CGAN-based model outperforms conventional deep learning models in both overall predictive accuracy and boundary delineation of fire perimeters. These results demonstrate that adversarial learning allows the model to capture the strong nonlinearity and uncertainty of wildfire spread, instead of simply fitting the pixel average. Furthermore, the autoregressive framework facilitates systematic temporal forecasting of wildfire evolution. The proposed CGAN-based autoregressive framework enhances both the accuracy and physical interpretability of wildfire spread prediction, offering a promising foundation for time-sensitive response and evacuation planning.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuhang Liu, Wenjie Zhao, Yunhui Guo
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Selecting an appropriate pre-trained source model is a critical, yet computationally expensive, task in transfer learning. Model Transferability Estimation (MTE) methods address this by providing efficient proxy metrics to rank models without full fine-tuning. In practice, the choice of which MTE metric to use is often ad hoc or guided simply by a metric's average historical performance. However, we observe that the effectiveness of MTE metrics is highly task-dependent and no single metric is universally optimal across all target datasets. To address this gap, we introduce MetaRank, a meta-learning framework for automatic, task-aware MTE metric selection. We formulate metric selection as a learning-to-rank problem. Rather than relying on conventional meta-features, MetaRank encodes textual descriptions of both datasets and MTE metrics using a pretrained language model, embedding them into a shared semantic space. A meta-predictor is then trained offline on diverse meta-tasks to learn the intricate relationship between dataset characteristics and metric mechanisms, optimized with a listwise objective that prioritizes correctly ranking the top-performing metrics. During the subsequent online phase, MetaRank efficiently ranks the candidate MTE metrics for a new, unseen target dataset based on its textual description, enabling practitioners to select the most appropriate metric a priori. Extensive experiments across 11 pretrained models and 11 target datasets demonstrate the strong effectiveness of our approach.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Xiaoxing You, Qiang Huang, Lingyu Li, Chi Zhang, Xiaopeng Liu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingxi Chen, Yixiao Zhang, Xiaoye Qian, Zongxia Li, Cornelia Fermuller et al.
Machine Learning
ๆฅ็ๆ่ฆ
Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan et al.
Multimodal
ๆฅ็ๆ่ฆ
Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiwen Zheng, Yiwei Ouyang, Zhao Huang, Tao Zhang, Xiaoshuai Zhang et al.
Computer Vision
ๆฅ็ๆ่ฆ
Accurately localizing and segmenting obscured objects from faint light patterns beyond the field of view is highly challenging due to multiple scattering and medium-induced perturbations. Most existing methods, based on real-valued modeling or local convolutional operations, are inadequate for capturing the underlying physics of coherent light propagation. Moreover, under low signal-to-noise conditions, these methods often converge to non-physical solutions, severely compromising the stability and reliability of the observation. To address these challenges, we propose a novel physics-driven Wavefront Propagating Compensation Network (WavePCNet) to simulate wavefront propagation and enhance the perception of obscured objects. This WavePCNet integrates the Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) to incorporate complex amplitude transfer operators to precisely constrain coherent propagation behavior, along with a momentum memory mechanism to effectively suppress the accumulation of perturbations. Additionally, a High-frequency Cross-layer Compensation Enhancement is introduced to construct frequency-selective pathways with multi-scale receptive fields and dynamically model structural consistency across layers, further boosting the model's robustness and interpretability under complex environmental conditions. Extensive experiments conducted on four physically collected datasets demonstrate that WavePCNet consistently outperforms state-of-the-art methods across both accuracy and robustness.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu-Huan Wu, Zi-Xuan Zhu, Yan Wang, Liangli Zhen, Deng-Ping Fan
Computer Vision
ๆฅ็ๆ่ฆ
Referring Camouflaged Object Detection (Ref-COD) segments specified camouflaged objects in a scene by leveraging a small set of referring images. Though effective, current systems adopt a dual-branch design that requires reference images at test time, which limits deployability and adds latency and data-collection burden. We introduce a Ref-COD framework that distills references into a class-prototype memory during training and synthesizes a reference vector at inference via a query-conditioned mixture of prototypes. Concretely, we maintain an EMA-updated prototype per category and predict mixture weights from the query to produce a guidance vector without any test-time references. To bridge the representation gap between reference statistics and camouflaged query features, we propose a bidirectional attention alignment module that adapts both the query features and the class representation. Thus, our approach yields a simple, efficient path to Ref-COD without mandatory references. We evaluate the proposed method on the large-scale R2C7K benchmark. Extensive experiments demonstrate competitive or superior performance of the proposed method compared with recent state-of-the-arts. Code is available at https://github.com/yuhuan-wu/RefOnce.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yingying Deng, Xiangyu He, Fan Tang, Weiming Dong, Xucheng Yin
Computer Vision
ๆฅ็ๆ่ฆ
Style transfer, a pivotal task in image processing, synthesizes visually compelling images by seamlessly blending realistic content with artistic styles, enabling applications in photo editing and creative design. While mainstream training-free diffusion-based methods have greatly advanced style transfer in recent years, their reliance on computationally inversion processes compromises efficiency and introduces visual distortions when inversion is inaccurate. To address these limitations, we propose a novel \textit{inversion-free} style transfer framework based on dual rectified flows, which tackles the challenge of finding an unknown stylized distribution from two distinct inputs (content and style images), \textit{only with forward pass}. Our approach predicts content and style trajectories in parallel, then fuses them through a dynamic midpoint interpolation that integrates velocities from both paths while adapting to the evolving stylized image. By jointly modeling the content, style, and stylized distributions, our velocity field design achieves robust fusion and avoids the shortcomings of naive overlays. Attention injection further guides style integration, enhancing visual fidelity, content preservation, and computational efficiency. Extensive experiments demonstrate generalization across diverse styles and content, providing an effective and efficient pipeline for style transfer.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang et al.
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anna Marklovรก, Ondลej Vinลก, Martina Vokรกฤovรก, Jiลรญ Miliฤka
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models are increasingly capable of producing creative texts, yet most studies on AI-generated poetry focus on English -- a language that dominates training data. In this paper, we examine the perception of AI- and human-written Czech poetry. We ask if Czech native speakers are able to identify it and how they aesthetically judge it. Participants performed at chance level when guessing authorship (45.8\% correct on average), indicating that Czech AI-generated poems were largely indistinguishable from human-written ones. Aesthetic evaluations revealed a strong authorship bias: when participants believed a poem was AI-generated, they rated it as less favorably, even though AI poems were in fact rated equally or more favorably than human ones on average. The logistic regression model uncovered that the more the people liked a poem, the less probable was that they accurately assign the authorship. Familiarity with poetry or literary background had no effect on recognition accuracy. Our findings show that AI can convincingly produce poetry even in a morphologically complex, low-resource (with respect of the training data of AI models) Slavic language such as Czech. The results suggest that readers' beliefs about authorship and the aesthetic evaluation of the poem are interconnected.
๐
2025-11-26
๐ ArXiv Preprint (cs.SI)
๐ฅ Kay Liu, Yuwei Han, Haoyan Xu, Henry Peng Zou, Yue Zhao et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have recently revolutionized machine learning on text-attributed graphs, but the application of LLMs to graph outlier detection, particularly in the context of fake news detection, remains significantly underexplored. One of the key challenges is the scarcity of large-scale, realistic, and well-annotated datasets that can serve as reliable benchmarks for outlier detection. To bridge this gap, we introduce TAGFN, a large-scale, real-world text-attributed graph dataset for outlier detection, specifically fake news detection. TAGFN enables rigorous evaluation of both traditional and LLM-based graph outlier detection methods. Furthermore, it facilitates the development of misinformation detection capabilities in LLMs through fine-tuning. We anticipate that TAGFN will be a valuable resource for the community, fostering progress in robust graph-based outlier detection and trustworthy AI. The dataset is publicly available at https://huggingface.co/datasets/kayzliu/TAGFN and our code is available at https://github.com/kayzliu/tagfn.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dongyang Fan, Diba Hashemi, Sai Praneeth Karimireddy, Martin Jaggi
Natural Language Processing
ๆฅ็ๆ่ฆ
Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yixiu Zhao, Xiaozhi Wang, Zijun Yao, Lei Hou, Juanzi Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque. In this paper, we introduce a simple, lightweight, and broadly applicable method with a focus on isolating neurons that encode specific skills. Building upon prior work that identified "skill neurons" via soft prompt training on classification tasks, our approach extends the analysis to complex scenarios involving multiple skills. We correlate neuron activations with auxiliary metrics -- such as external labels and the model's own confidence score -- thereby uncovering interpretable and task-specific behaviors without the need for manual token aggregation. We empirically validate our method on tasks spanning open-ended text generation and natural language inference, demonstrating its ability to detect neurons that not only drive known skills but also reveal previously unidentified shortcuts in arithmetic reasoning on BigBench.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Minjoon Choi
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model's robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Husne Ara Rubaiyeat, Hasan Mahmud, Md Kamrul Hasan
ๆฅ็ๆ่ฆ
Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lina Conti, Dennis Fucci, Marco Gaido, Matteo Negri, Guillaume Wisniewski et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker's vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yurui Zheng, Yijun Chen, Shaohong Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This paper proposes a bidirectional readability assessment mechanism that captures contextual information to identify regions with rich semantic information in the text, thereby predicting the readability level of individual sentences. These sentence-level labels are then used to assist in predicting the overall readability level of the document. Additionally, a pairwise sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtraction. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive performance and outperforms other baseline models.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Oฤuz Kaฤan Hitit, Leander Girrbach, Zeynep Akata
Natural Language Processing
ๆฅ็ๆ่ฆ
Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kaifeng Hong, Yinglong Zhang, Xiaoying Hong, Xuewen Xia, Xing Xu
Natural Language Processing
ๆฅ็ๆ่ฆ
Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs--limited by over-smoothing and hop-dependent diffusion--or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model's semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin's expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Frederico Wieser, Martin Benfeghoul, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas
Natural Language Processing
ๆฅ็ๆ่ฆ
The rigid, uniform allocation of computation in standard Transformer (TF) architectures can limit their efficiency and scalability, particularly for large-scale models and long sequences. Addressing this, we introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT), two distinct architectures that leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs. SDT augments a decoder-only stack with alternating Decision and Dynamic layers: a Decision layer computes a full block 'posterior' and a lightweight 'prior,' while a Dynamic layer employs fixed-capacity Top-K routing based on Bayesian surprise (Expected and Unexpected Change), maintaining a static compute graph. STT extends this conditional computation to the temporal domain: a transition network predicts residual updates, forming a temporal 'change hypothesis' that informs a router to dynamically execute or bypass TF blocks for each token, managing KV-cache contributions. Both architectures exhibit the predicted shift from novelty to prediction driven gating over training, suggesting alignment with surprise based principles. While operating at reduced capacity, they offer preliminary insights into the compute-accuracy trade-offs of conditional computation. The proposed architectures establish a flexible framework for efficiency, reducing self-attention computation by 75% and KV-cache requirements by 50% within each compute skipping layer, setting a pathway for more efficient models.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhifeng Hao, Qibin Song, Ruichu Cai, Boyan Xu
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent divide-and-conquer reasoning approaches, particularly those based on Chain-of-Thought (CoT), have substantially improved the Text-to-SQL capabilities of Large Language Models (LLMs). However, when applied to complex enterprise databases, such methods struggle to maintain coherent reasoning due to limited context capacity, unreliable schema linking, and weak grounding in database semantics. To overcome these issues, we introduce DSR-SQL, a \textbf{D}ual-\textbf{S}tate \textbf{R}easoning framework that models Text-to-SQL as an interaction between an adaptive context state and a progressive generation state. The first constructs a compact, semantically faithful environment by refining large schemas and selecting relevant structures, while the second formalizes SQL synthesis as feedback-guided state transitions, enabling the model to self-correct and align with user intent. Without any post-training or in-context examples, DSR-SQL achieves competitive performance, reaching 35.28\% execution accuracy on Spider 2.0-Snow and 68.32\% on BIRD development set. Our implementation will be open-sourced at: https://github.com/DMIRLAB-Group/DSR-SQL.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Antonรญn Jarolรญm, Martin Fajฤรญk, Lucia Makaiovรก
Natural Language Processing
ๆฅ็ๆ่ฆ
Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task -- fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joshua Fonseca Rivera
Natural Language Processing
ๆฅ็ๆ่ฆ
Lindsey (2025) investigates introspective awareness in language models through four experiments, finding that models can sometimes detect and identify injected activation patterns -- but unreliably (~20% success in the best model). We focus on the first of these experiments -- self-report of injected "thoughts" -- and ask whether this capability can be directly trained rather than waiting for emergence. Through fine-tuning on transient single-token injections, we transform a 7B parameter model from near-complete failure (0.4% accuracy, 6.7% false positive rate) to reliable detection (85% accuracy on held-out concepts at ฮฑ=40, 0% false positives). Our model detects fleeting "thoughts" injected at a single token position, retains that information, and reports the semantic content across subsequent generation steps. On this task, our trained model satisfies three of Lindsey's criteria: accuracy (correct identification), grounding (0/60 false positives), and internality (detection precedes verbalization). Generalization to unseen concept vectors (7.5pp gap) demonstrates the model learns a transferable skill rather than memorizing specific vectors, though this does not establish metacognitive representation in Lindsey's sense. These results address an open question raised by Lindsey: whether "training for introspection would help eliminate cross-model differences." We show that at least one component of introspective behavior can be directly induced, offering a pathway to built-in AI transparency.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model (LLM)-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation -- risking the loss of critical information -- or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive LLM reading to efficient programmatic pruning. Central to our approach is DOM Tree Pruning Programming, where an LLM generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for LLMs to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
๐
2025-11-26
๐ Presented at the 2025 IEEE International Conferenc
๐ฅ Ariful Islam, Md Rifat Hossen, Abir Ahmed, B M Taslimul Haque
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Aspect-Based Sentiment Analysis (ABSA) has emerged as a critical tool for extracting fine-grained sentiment insights from user-generated content, particularly in e-commerce and social media domains. However, research on Bangla ABSA remains significantly underexplored due to the absence of comprehensive datasets and specialized frameworks for triplet extraction in this language. This paper introduces BanglaASTE, a novel framework for Aspect Sentiment Triplet Extraction (ASTE) that simultaneously identifies aspect terms, opinion expressions, and sentiment polarities from Bangla product reviews. Our contributions include: (1) creation of the first annotated Bangla ASTE dataset containing 3,345 product reviews collected from major e-commerce platforms including Daraz, Facebook, and Rokomari; (2) development of a hybrid classification framework that employs graph-based aspect-opinion matching with semantic similarity techniques; and (3) implementation of an ensemble model combining BanglaBERT contextual embeddings with XGBoost boosting algorithms for enhanced triplet extraction performance. Experimental results demonstrate that our ensemble approach achieves superior performance with 89.9% accuracy and 89.1% F1-score, significantly outperforming baseline models across all evaluation metrics. The framework effectively addresses key challenges in Bangla text processing including informal expressions, spelling variations, and data sparsity. This research advances the state-of-the-art in low-resource language sentiment analysis and provides a scalable solution for Bangla e-commerce analytics applications.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kai Kugler
Natural Language Processing
ๆฅ็ๆ่ฆ
We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.
๐
2025-11-26
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Kirti Bhagat, Shaily Bhatt, Athul Velagapudi, Aditya Vashistha, Shachi Dave et al.
ๆฅ็ๆ่ฆ
Millions of users across the globe turn to AI chatbots for their creative needs, inviting widespread interest in understanding how such chatbots represent diverse cultures. At the same time, evaluating cultural representations in open-ended tasks remains challenging and underexplored. In this work, we present TALES, an evaluation of cultural misrepresentations in LLM-generated stories for diverse Indian cultural identities. First, we develop TALES-Tax, a taxonomy of cultural misrepresentations by collating insights from participants with lived experiences in India through focus groups (N=9) and individual surveys (N=15). Using TALES-Tax, we evaluate 6 models through a large-scale annotation study spanning 2,925 annotations from 108 annotators with lived cultural experience from across 71 regions in India and 14 languages. Concerningly, we find that 88\% of the generated stories contain one or more cultural inaccuracies, and such errors are more prevalent in mid- and low-resourced languages and stories based in peri-urban regions in India. Lastly, we transform the annotations into TALES-QA, a standalone question bank to evaluate the cultural knowledge of foundational models. Through this evaluation, we surprisingly discover that models often possess the requisite cultural knowledge despite generating stories rife with cultural misrepresentations.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Robert Belanec, Branislav Pecher, Ivan Srba, Maria Bielikova
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Adisai Na-Thalang, Chanakan Wittayasakpan, Kritsadha Phatcharoen, Supakit Buakaw
ๆฅ็ๆ่ฆ
This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Steven Wang, Kyle Hunt, Shaojie Tang, Kenneth Joseph
Natural Language Processing
ๆฅ็ๆ่ฆ
There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zheng Li, Yibing Song, Xin Zhang, Lei Luo, Xiang Li et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., "shape", "color"), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have estimates of these values and it is not well known how to properly construct confidence intervals using only estimates. This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset, enabling practical and statistically sound LLM-based evaluation. Additionally, to reduce uncertainty in the accuracy estimate, we introduce an adaptive algorithm that efficiently allocates calibration sample sizes.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Manish Jain, Satheesh Kumar Ponnambalam, Salman Faroz, Chandrakanth Lns, Vinay Sharma
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ye Bhone Lin, Thura Aung, Ye Kyaw Thu, Thazin Myint Oo
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. To our knowledge, this is the first study addressing ASR error correction specifically for Burmese. We evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word- and character-level accuracy over baseline outputs. The proposed AEC model, combining IPA and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bryan E. Tuck, Rakesh M. Verma
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography ("data", "poop", "loll": 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Thura Aung, Eaint Kay Khaing Kyaw, Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi
Natural Language Processing
ๆฅ็ๆ่ฆ
In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michael Iskandardinata, William Christian, Derwin Suhartono
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model's own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 Task 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. These findings highlight the importance of context in enhancing LLMs performance in sarcasm detection task, particularly the involvement of culturally specific slang, references, or unknown terms to the LLMs. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. The experiment code is available at: https://github.com/wllchrst/sarcasm-detection_pmp_knowledge-base.
๐
2025-11-26
๐ ArXiv Preprint (stat.ME)
๐ฅ Vladimir Berman
ๆฅ็ๆ่ฆ
Zipf's law in language lacks a definitive origin, debated across fields. This study explains Zipf-like behavior using geometric mechanisms without linguistic elements. The Full Combinatorial Word Model (FCWM) forms words from a finite alphabet, generating a geometric distribution of word lengths. Interacting exponential forces yield a power-law rank-frequency curve, determined by alphabet size and blank symbol probability. Simulations support predictions, matching English, Russian, and mixed-genre data. The symbolic model suggests Zipf-type laws arise from geometric constraints, not communicative efficiency.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Quan Xiao, Tianyi Chen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Offline data selection and online self-refining generation, which enhance the data quality, are crucial steps in adapting large language models (LLMs) to specific downstream tasks. We tackle offline data selection and online self-refining generations through an optimization perspective. Specifically, bilevel data selection is used for offline data selection with respect to the validation dataset, and we treat online self-refining generation as a model adaptation step of selecting the model trained on current responses that best fits the validation data. Our framework offers a unified understanding of offline data selection and self-refining generation by assigning a learned data weight to each question and response, either explicitly or implicitly. For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework and demonstrate its performance gains over unfiltered direct mixing baselines. By combining offline data with validation-weighted online generations, our method enhances fine-tuning performance. Experiments on quality enhancement and safety-aware LLM fine-tuning validate its effectiveness.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anantha Padmanaban Krishna Kumar
ๆฅ็ๆ่ฆ
Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1--12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1--12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia et al.
Robotics
ๆฅ็ๆ่ฆ
As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented tasks. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And (2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, GKA shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ioana Buhnila, Aman Sinha, Mathieu Constant
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model's performance. We evaluated the quality of the LLM's output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM's task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.
๐
2025-11-26
๐ ArXiv Preprint (eess.AS)
๐ฅ Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, Abhishek Yanamandra, Abhinav Jain et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.
๐
2025-11-26
๐ ArXiv Preprint (eess.AS)
๐ฅ Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices.
In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.
๐
2025-11-26
๐ 2024 IEEE 27th International Conference on Intelli
๐ฅ Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.
๐
2025-11-26
๐ SIGMOD 2026
๐ฅ Reham Omar, Abdelghny Orogat, Ibrahim Abdelaziz, Omij Mangukiya, Panos Kalnis et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Conversational Question Answering over Knowledge Graphs (KGs) combines the factual grounding of KG-based QA with the interactive nature of dialogue systems. KGs are widely used in enterprise and domain applications to provide structured, evolving, and reliable knowledge. Large language models (LLMs) enable natural and context-aware conversations, but lack direct access to private and dynamic KGs. Retrieval-augmented generation (RAG) systems can retrieve graph content but often serialize structure, struggle with multi-turn context, and require heavy indexing. Traditional KGQA systems preserve structure but typically support only single-turn QA, incur high latency, and struggle with coreference and context tracking. To address these limitations, we propose Chatty-KG, a modular multi-agent system for conversational QA over KGs. Chatty-KG combines RAG-style retrieval with structured execution by generating SPARQL queries through task-specialized LLM agents. These agents collaborate for contextual interpretation, dialogue tracking, entity and relation linking, and efficient query planning, enabling accurate and low-latency translation of natural questions into executable queries. Experiments on large and diverse KGs show that Chatty-KG significantly outperforms state-of-the-art baselines in both single-turn and multi-turn settings, achieving higher F1 and P@1 scores. Its modular design preserves dialogue coherence and supports evolving KGs without fine-tuning or pre-processing. Evaluations with commercial (e.g., GPT-4o, Gemini-2.0) and open-weight (e.g., Phi-4, Gemma 3) LLMs confirm broad compatibility and stable performance. Overall, Chatty-KG unifies conversational flexibility with structured KG grounding, offering a scalable and extensible approach for reliable multi-turn KGQA.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai et al.
Robotics
ๆฅ็ๆ่ฆ
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D "trace-space" of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie et al.
Multimodal
ๆฅ็ๆ่ฆ
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Sadegh Shirani, Mohsen Bayati
ๆฅ็ๆ่ฆ
Causal effect estimation in networked systems is central to data-driven decision making. In such settings, interventions on one unit can spill over to others, and in complex physical or social systems, the interaction pathways driving these interference structures remain largely unobserved. We argue that for identifying population-level causal effects, it is not necessary to recover the exact network structure; instead, it suffices to characterize how those interactions contribute to the evolution of outcomes. Building on this principle, we study an evolution-based approach that investigates how outcomes change across observation rounds in response to interventions, hence compensating for missing network information. Using an exposure-mapping perspective, we give an axiomatic characterization of when the empirical distribution of outcomes follows a low-dimensional recursive equation, and identify minimal structural conditions under which such evolution mappings exist. We frame this as a distributional counterpart to difference-in-differences. Rather than assuming parallel paths for individual units, it exploits parallel evolution patterns across treatment scenarios to estimate counterfactual trajectories. A key insight is that treatment randomization plays a role beyond eliminating latent confounding; it induces an implicit sampling from hidden interference channels, enabling consistent learning about heterogeneous spillover effects. We highlight causal message passing as an instantiation of this method in dense networks while extending to more general interference structures, including influencer networks where a small set of units drives most spillovers. Finally, we discuss the limits of this approach, showing that strong temporal trends or endogenous interference can undermine identification.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fengze Yu, Leshu Li, Brad McDanel, Saiqian Zhang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative decoding framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable LLM serving across edge and cloud.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shruti Bothe, Illyyne Saffar, Aurelie Boisbunon, Hasan Farooq, Julien Forgeat et al.
ๆฅ็ๆ่ฆ
The rise of AI in telecommunications, from optimizing Radio Access Networks to managing user experience, has sharply increased data volumes and training demands. Telecom data is often noisy, high-dimensional, costly to store, process, and label. Despite Ai's critical role, standard workflows still assume all training samples contribute equally. On the other hand, next generation systems require AI models that are accurate, efficient, and sustainable.The paper questions the assumptions of equal importance by focusing on applying and analyzing the roles of individual samples in telecom training and assessing whether the proposed model optimizes computation and energy use. we perform sample-level gradient analysis across epochs to identify patterns of influence and redundancy in model learning. Based on this, we propose a sample importance framework thats electively prioritizes impactful data and reduces computation without compromising accuracy. Experiments on three real-world telecom datasets show that our method [reserves performance while reducing data needs and computational overhead while advancing the goals of sustainable AI in telecommunications.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Locke Cai, Ivan Provilkov
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator): the policy learns to mimic expert answers, while the critic learns to compare and distinguish between policy and expert answers. Our method trains both the policy and the critic jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL on verifiable tasks. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeld
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI Using GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, respectively. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/EvilGenie.
๐
2025-11-26
๐ ACM MMSys 2025
๐ฅ Kirill Paramonov, Mete Ozay, Aristeidis Mystakidis, Nikolaos Tsalikidis, Dimitrios Sotos et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system's effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system's practicality in real-world scenarios.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Daniel R. Jiang, Jalaj Bhandari, Yukai Yang, Rรฉmi Munos, Tyler Lu
Natural Language Processing
ๆฅ็ๆ่ฆ
Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from sparse, long-horizon rewards and the discrepancy between response-level planning and token-level generation. In this technical note, we propose a formal reduction of the multi-turn RL problem into a sequence of single-turn RLHF-style problems. This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RL problem with standard token-level PPO is equivalent to a policy improvement step within the multi-turn problem. This insight naturally leads to Iterative PPO, a batch online policy iteration algorithm that alternates between fitting Q-functions from logged conversation trajectories and improving the policy. A major practical advantage is that Iterative PPO directly leverages stable, off-the-shelf single-turn RLHF tools, making it straightforward to implement. Our method occupies a middle ground between fully online and fully offline approaches, retaining the adaptability of online updates while gaining the stability benefits of offline training.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anantha Padmanaban Krishna Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mathew Vanherreweghe, Michael H. Freedman, Keith M. Adams
Machine Learning
ๆฅ็ๆ่ฆ
Recent work by Freedman and Mulligan demonstrated that shallow multilayer perceptrons spontaneously develop Kolmogorov-Arnold geometric (KAG) structure during training on synthetic three-dimensional tasks. However, it remained unclear whether this phenomenon persists in realistic high-dimensional settings and what spatial properties this geometry exhibits.
We extend KAG analysis to MNIST digit classification (784 dimensions) using 2-layer MLPs with systematic spatial analysis at multiple scales. We find that KAG emerges during training and appears consistently across spatial scales, from local 7-pixel neighborhoods to the full 28x28 image. This scale-agnostic property holds across different training procedures: both standard training and training with spatial augmentation produce the same qualitative pattern. These findings reveal that neural networks spontaneously develop organized, scale-invariant geometric structure during learning on realistic high-dimensional data.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hans Gundlach, Alex Fogelson, Jayson Lynch, Ana Trisovic, Jonathan Rosenfeld et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader literature, we estimate that additional innovations not included in our ablations account for less than 10x, yielding a total under 100x. This leads us to conduct scaling experiments, which reveal that much of this efficiency gap can be explained by algorithms with scale-dependent efficiency improvements. In particular, we conduct scaling experiments between LSTMs and Transformers, finding exponent differences in their compute-optimal scaling law while finding little scaling difference for many other innovations. These experiments demonstrate that - contrary to standard assumptions - an algorithm's efficiency gains are tied to compute scale. Using experimental extrapolation and literature estimates, we account for 6,930x efficiency gains over the same time period, with the scale-dependent LSTM-to-Transformer transition accounting for the majority of gains. Our results indicate that algorithmic progress for small models has been far slower than previously assumed, and that measures of algorithmic efficiency are strongly reference-dependent.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dongyang Fan, Diba Hashemi, Sai Praneeth Karimireddy, Martin Jaggi
Natural Language Processing
ๆฅ็ๆ่ฆ
Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.
๐
2025-11-26
๐ ArXiv Preprint (cs.DB)
๐ฅ Zarin Tahia Hossain, Mostafa Milani
Machine Learning
ๆฅ็ๆ่ฆ
Handling missing data is a central challenge in data-driven analysis. Modern imputation methods not only aim for accurate reconstruction but also differ in how they represent and quantify uncertainty. Yet, the reliability and calibration of these uncertainty estimates remain poorly understood. This paper presents a systematic empirical study of uncertainty in imputation, comparing representative methods from three major families: statistical (MICE, SoftImpute), distribution alignment (OT-Impute), and deep generative (GAIN, MIWAE, TabCSDI). Experiments span multiple datasets, missingness mechanisms (MCAR, MAR, MNAR), and missingness rates. Uncertainty is estimated through three complementary routes: multi-run variability, conditional sampling, and predictive-distribution modeling, and evaluated using calibration curves and the Expected Calibration Error (ECE). Results show that accuracy and calibration are often misaligned: models with high reconstruction accuracy do not necessarily yield reliable uncertainty. We analyze method-specific trade-offs among accuracy, calibration, and runtime, identify stable configurations, and offer guidelines for selecting uncertainty-aware imputers in data cleaning and downstream machine learning pipelines.
๐
2025-11-26
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yizhou Zhao, Xiang Li, Peter Song, Qi Long, Weijie Su
ๆฅ็ๆ่ฆ
The rise of generative AI has enabled the production of high-fidelity synthetic tabular data across fields such as healthcare, finance, and public policy, raising growing concerns about data provenance and misuse. Watermarking offers a promising solution to address these concerns by ensuring the traceability of synthetic data, but existing methods face many limitations: they are computationally expensive due to reliance on large diffusion models, struggle with mixed discrete-continuous data, or lack robustness to post-modifications. To address them, we propose TAB-DRW, an efficient and robust post-editing watermarking scheme for generative tabular data. TAB-DRW embeds watermark signals in the frequency domain: it normalizes heterogeneous features via the Yeo-Johnson transformation and standardization, applies the discrete Fourier transform (DFT), and adjusts the imaginary parts of adaptively selected entries according to precomputed pseudorandom bits. To further enhance robustness and efficiency, we introduce a novel rank-based pseudorandom bit generation method that enables row-wise retrieval without incurring storage overhead. Experiments on five benchmark tabular datasets show that TAB-DRW achieves strong detectability and robustness against common post-processing attacks, while preserving high data fidelity and fully supporting mixed-type features.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alex Ning, Vainateya Rangaraju
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) achieve state-of-the-art results across many natural language tasks, but their internal mechanisms remain difficult to interpret. In this work, we extract, process, and visualize latent state geometries in Transformer-based language models through dimensionality reduction. We capture layerwise activations at multiple points within Transformer blocks and enable systematic analysis through Principal Component Analysis (PCA) and Uniform Manifold Approximation (UMAP). We demonstrate experiments on GPT-2 and LLaMa models, where we uncover interesting geometric patterns in latent space. Notably, we identify a clear separation between attention and MLP component outputs across intermediate layers, a pattern not documented in prior work to our knowledge. We also characterize the high norm of latent states at the initial sequence position and visualize the layerwise evolution of latent states. Additionally, we demonstrate the high-dimensional helical structure of GPT-2's positional embeddings, the sequence-wise geometric patterns in LLaMa, and experiment with repeating token sequences. We aim to support systematic analysis of Transformer internals with the goal of enabling further reproducible interpretability research. We make our code available at https://github.com/Vainateya/Feature_Geometry_Visualization.
๐
2025-11-26
๐ 16 pages, 11 figures, IEEEaccess journal
๐ฅ Muhammad Siddique, Sohaib Zafar
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Smart grids are a fusion of classical power infrastructure and advanced communication networks and smart control, to create a cyber-physical environment that is more efficient and flexible than ever before. This integration causes vulnerabilities that can undermine grid stability as well as reliability. Digital forensics is a fundamental concept of learning and identifying, detecting, and mitigating such security incidents. This paper presents an all-in-one machine learning-based digital forensic framework of smart grid systems deployed on the Cloud. The framework combines the data acquisition at the sensor-level, authenticated communication, scalable cloud storage and automated forensic analytics. The model uses supervised and unsupervised learning algorithms - such as Random Forest, Support Vector Machine, Gradient Boosted Trees and deep neural architectures for anomaly detection, event reconstruction and intrusion analysis in real time. After several simulation and experimental studies on real-time smart-meter data streams, the proposed framework is shown to be very accurate, scalable and resilient to cyber-attacks including data tampering, false-data injection and coordinated control-loop manipulation. The results indicate that cloud services are the best backbone for big-data-driven forensic workflows, which allows energy utilities to achieve a fast situational awareness and intelligent incident response.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alex Ning, Yen-Ling Kuo, Gabe Gomes
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Latent reasoning represents a new development in Transformer language models that has shown potential in compressing reasoning lengths compared to chain-of-thought reasoning. By directly passing the information-rich previous final latent state into the next sequence, latent reasoning removes the restriction to human language tokens as the medium for reasoning. We develop adaptive-length latent reasoning models and introduce a post-SFT reinforcement-learning methodology to optimize latent reasoning length by minimizing reasoning length while maintaining accuracy. This, in turn, further reduces compute usage and raises the bar on the compressive capabilities of latent reasoning models. Experiments on the Llama 3.2 1B model and the GSM8K-Aug dataset show a $52\%$ drop in total reasoning length with no penalty to accuracy. In future work, we plan to extend to additional models and datasets, analyze relationships between training coefficients, experiment with architecture variations, and continue our knowledge distillation for latent reasoning SFT efforts. We make our code and pretrained weights available at https://github.com/apning/adaptive-latent-reasoning.
๐
2025-11-26
๐ submitted to Journal of Cheminformatics
๐ฅ Ali Amirahmadi, Gรถkรงe Geylan, Leonardo De Maria, Farzaneh Etminani, Mattias Ohlsson et al.
Machine Learning
ๆฅ็ๆ่ฆ
Cyclic peptides are promising modalities for targeting intracellular sites; however, cell-membrane permeability remains a key bottleneck, exacerbated by limited public data and the need for well-calibrated uncertainty. Instead of relying on data-eager complex deep learning architecture, we propose a monomer-aware decoupled global alignment kernel (MD-GAK), which couples chemically meaningful residue-residue similarity with sequence alignment while decoupling local matches from gap penalties. MD-GAK is a relatively simple kernel. To further demonstrate the robustness of our framework, we also introduce a variant, PMD-GAK, which incorporates a triangular positional prior. As we will show in the experimental section, PMD-GAK can offer additional advantages over MD-GAK, particularly in reducing calibration errors. Since our focus is on uncertainty estimation, we use Gaussian Processes as the predictive model, as both MD-GAK and PMD-GAK can be directly applied within this framework. We demonstrate the effectiveness of our methods through an extensive set of experiments, comparing our fully reproducible approach against state-of-the-art models, and show that it outperforms them across all metrics.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wei-Chen Chang, Lu Dai, Ting Xu
Machine Learning
ๆฅ็ๆ่ฆ
This study proposes a risk prediction method based on a Multi-Scale Temporal Alignment Network (MSTAN) to address the challenges of temporal irregularity, sampling interval differences, and multi-scale dynamic dependencies in Electronic Health Records (EHR). The method focuses on temporal feature modeling by introducing a learnable temporal alignment mechanism and a multi-scale convolutional feature extraction structure to jointly model long-term trends and short-term fluctuations in EHR sequences. At the input level, the model maps multi-source clinical features into a unified high-dimensional semantic space and employs temporal embedding and alignment modules to dynamically weight irregularly sampled data, reducing the impact of temporal distribution differences on model performance. The multi-scale feature extraction module then captures key patterns across different temporal granularities through multi-layer convolution and hierarchical fusion, achieving a fine-grained representation of patient states. Finally, an attention-based aggregation mechanism integrates global temporal dependencies to generate individual-level risk representations for disease risk prediction and health status assessment. Experiments conducted on publicly available EHR datasets show that the proposed model outperforms mainstream baselines in accuracy, recall, precision, and F1-Score, demonstrating the effectiveness and robustness of multi-scale temporal alignment in complex medical time-series analysis. This study provides a new solution for intelligent representation of high-dimensional asynchronous medical sequences and offers important technical support for EHR-driven clinical risk prediction.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jack Geary, Boyan Gao, Henry Gouk
ๆฅ็ๆ่ฆ
We consider the problem of strategic classification, where the act of deploying a classifier leads to strategic behaviour that induces a distribution shift on subsequent observations. Current approaches to learning classifiers in strategic settings are focused primarily on the linear setting, but in many cases non-linear classifiers are more suitable. A central limitation to progress for non-linear classifiers arises from the inability to compute best responses in these settings. We present a novel method for computing the best response by optimising the Lagrangian dual of the Agents' objective. We demonstrate that our method reproduces best responses in linear settings, identifying key weaknesses in existing approaches. We present further results demonstrating our method can be straight-forwardly applied to non-linear classifier settings, where it is useful for both evaluation and training.
๐
2025-11-26
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Thai-Khanh Nguyen, Uyen Vo, Tan M. Nguyen, Thieu N. Vo, Trung-Hieu Le et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Human activity recognition (HAR) from inertial sensors is essential for ubiquitous computing, mobile health, and ambient intelligence. Conventional deep models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and transformers have advanced HAR but remain limited by vanishing or exloding gradients, high computational cost, and difficulty in capturing long-range dependencies. Structured state-space models (SSMs) like Mamba address these challenges with linear complexity and effective temporal modeling, yet they are restricted to first-order dynamics without stable longterm memory mechanisms. We introduce Momentum Mamba, a momentum-augmented SSM that incorporates second-order dynamics to improve stability of information flow across time steps, robustness, and long-sequence modeling. Two extensions further expand its capacity: Complex Momentum Mamba for frequency-selective memory scaling. Experiments on multiple HAR benchmarks demonstrate consistent gains over vanilla Mamba and Transformer baselines in accuracy, robustness, and convergence speed. With only moderate increases in training cost, momentum-augmented SSMs offer a favorable accuracy-efficiency balance, establishing them as a scalable paradigm for HAR and a promising principal framework for broader sequence modeling applications.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Martin Rabel, Jakob Runge
Robotics
ๆฅ็ๆ่ฆ
Real-world data, for example in climate applications, often consists of spatially gridded time series data or data with comparable structure. While the underlying system is often believed to behave similar at different points in space and time, those variations that do exist are twofold relevant: They often encode important information in and of themselves. And they may negatively affect the stability / convergence and reliability\Slash{}validity of results of algorithms assuming stationarity or space-translation invariance. We study the information encoded in changes of the causal graph, with stability in mind. An analysis of this general task identifies two core challenges. We develop guiding principles to overcome these challenges, and provide a framework realizing these principles by modifying constraint-based causal discovery approaches on the level of independence testing. This leads to an extremely modular, easily extensible and widely applicable framework. It can leverage existing constraint-based causal discovery methods (demonstrated on IID-algorithms PC, PC-stable, FCI and time series algorithms PCMCI, PCMCI+, LPCMCI) with little to no modification. The built-in modularity allows to systematically understand and improve upon an entire array of subproblems. By design, it can be extended by leveraging insights from change-point-detection, clustering, independence-testing and other well-studied related problems. The division into more accessible sub-problems also simplifies the understanding of fundamental limitations, hyperparameters controlling trade-offs and the statistical interpretation of results. An open-source implementation will be available soon.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jin Pin, Krasowski Hanna, Vanneaux Elena
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Obtaining safety guarantees for reinforcement learning is a major challenge to achieve applicability for real-world tasks. Safety shields extend standard reinforcement learning and achieve hard safety guarantees. However, existing safety shields commonly use random sampling of safe actions or a fixed fallback controller, therefore disregarding future performance implications of different safe actions. In this work, we propose a predictive safety shield for model-based reinforcement learning agents in discrete space. Our safety shield updates the Q-function locally based on safe predictions, which originate from a safe simulation of the environment model. This shielding approach improves performance while maintaining hard safety guarantees. Our experiments on gridworld environments demonstrate that even short prediction horizons can be sufficient to identify the optimal path. We observe that our approach is robust to distribution shifts, e.g., between simulation and reality, without requiring additional training.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Alexandra Carpentier, Christophe Giraud, Nicolas Verzelen
ๆฅ็ๆ่ฆ
A fundamental theoretical question in network analysis is to determine under which conditions community recovery is possible in polynomial time in the Stochastic Block Model (SBM). When the number $K$ of communities remains smaller than $\sqrt{n}$ --where $n$ denotes the number of nodes--, non-trivial community recovery is possible in polynomial time above, and only above, the Kesten--Stigum (KS) threshold, originally postulated using arguments from statistical physics.
When $K \geq \sqrt{n}$, Chin, Mossel, Sohn, and Wein recently proved that, in the \emph{sparse regime}, community recovery in polynomial time is achievable below the KS threshold by counting non-backtracking paths. This finding led them to postulate a new threshold for the many-communities regime $K \geq \sqrt{n}$. Subsequently, Carpentier, Giraud, and Verzelen established the failure of low-degree polynomials below this new threshold across all density regimes, and demonstrated successful recovery above the threshold in certain moderately sparse settings. While these results provide strong evidence that, in the many community setting, the computational barrier lies at the threshold proposed in~Chin et al., the question of achieving recovery above this threshold still remains open in most density regimes.
The present work is a follow-up to~Carpentier et al., in which we prove Conjecture~1.4 stated therein by: \\ 1- Constructing a family of motifs satisfying specific structural properties; and\\ 2- Proving that community recovery is possible above the proposed threshold by counting such motifs.\\ Our results complete the picture of the computational barrier for community recovery in the SBM with $K \geq \sqrt{n}$ communities. They also indicate that, in moderately sparse regimes, the optimal algorithms appear to be fundamentally different from spectral methods.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matฤซss Kalnฤre, Sofoklis Kitharidis, Thomas Bรคck, Niki van Stein
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and sparse autoencoders, from NLP to transformer architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of sparse autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification tasks.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wanli Zhong, Haibo Feng, Zirui Zhou, Hanyang Peng, Shiqi Yu
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax as the dominant bottleneck. This stage incurs a costly dequantize-softmax-requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer, plug-and-play attention pipeline without retraining. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup-table approximation, and direct integer normalization, thereby eliminating all datatype conversion overhead. We evaluate IntAttention and demonstrate consistent and substantial gains. Our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines and 2.0x faster than conventional INT8 attention pipelines on Armv8 CPUs. These gains are achieved with high-fidelity accuracy comparable to baselines across diverse language and vision models, enabling practical and efficient Transformer inference on commodity edge devices. Code will be released in later version of this work.
๐
2025-11-26
๐ AAAI
๐ฅ Qian Hong, Cheng Bian, Xiao Zhou, Xiaoyu Li, Yelei Li et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Translating non-invasive signals such as photoplethysmography (PPG) and ballistocardiography (BCG) into clinically meaningful signals like arterial blood pressure (ABP) is vital for continuous, low-cost healthcare monitoring. However, temporal misalignment in multimodal signal transformation impairs transformation accuracy, especially in capturing critical features like ABP peaks. Conventional synchronization methods often rely on strong similarity assumptions or manual tuning, while existing Learning with Noisy Labels (LNL) approaches are ineffective under time-shifted supervision, either discarding excessive data or failing to correct label shifts. To address this challenge, we propose ShiftSyncNet, a meta-learning-based bi-level optimization framework that automatically mitigates performance degradation due to time misalignment. It comprises a transformation network (TransNet) and a time-shift correction network (SyncNet), where SyncNet learns time offsets between training pairs and applies Fourier phase shifts to align supervision signals. Experiments on one real-world industrial dataset and two public datasets show that ShiftSyncNet outperforms strong baselines by 9.4%, 6.0%, and 12.8%, respectively. The results highlight its effectiveness in correcting time shifts, improving label quality, and enhancing transformation accuracy across diverse misalignment scenarios, pointing toward a unified direction for addressing temporal inconsistencies in multimodal physiological transformation.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Taehoon Kim, Donghwan Jang, Bohyung Han
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the parameter space for optimization. Our algorithm involves two types of weight merging: inter-task weight merging and intra-task weight merging. Inter-task weight merging unifies previous models by averaging the weights of models from all previous stages. On the other hand, intra-task weight merging facilitates the learning of current task by combining the model parameters within current stage. For reliable weight merging, we also propose a bounded update technique that aims to optimize the target model with minimal cumulative updates and preserve knowledge from previous tasks; this strategy reveals that it is possible to effectively obtain new models near old ones, reducing catastrophic forgetting. M&B is seamlessly integrated into existing CIL methods without modifying architecture components or revising learning objectives. We extensively evaluate our algorithm on standard CIL benchmarks and demonstrate superior performance compared to state-of-the-art methods.
๐
2025-11-26
๐ NeurIPS 2025
๐ฅ Fabian Paischer, Leo Cotteleer, Yann Dreze, Richard Kurle, Dylan Rubini et al.
Machine Learning
ๆฅ็ๆ่ฆ
The widespread use of neural surrogates in automotive aerodynamics, enabled by datasets such as DrivAerML and DrivAerNet++, has primarily focused on bluff-body flows with large wakes. Extending these methods to aerospace, particularly in the transonic regime, remains challenging due to the high level of non-linearity of compressible flows and 3D effects such as wingtip vortices. Existing aerospace datasets predominantly focus on 2D airfoils, neglecting these critical 3D phenomena. To address this gap, we present a new dataset of CFD simulations for 3D wings in the transonic regime. The dataset comprises volumetric and surface-level fields for around $30,000$ samples with unique geometry and inflow conditions. This allows computation of lift and drag coefficients, providing a foundation for data-driven aerodynamic optimization of the drag-lift Pareto front. We evaluate several state-of-the-art neural surrogates on our dataset, including Transolver and AB-UPT, focusing on their out-of-distribution (OOD) generalization over geometry and inflow variations. AB-UPT demonstrates strong performance for transonic flowfields and reproduces physically consistent drag-lift Pareto fronts even for unseen wing configurations. Our results demonstrate that AB-UPT can approximate drag-lift Pareto fronts for unseen geometries, highlighting its potential as an efficient and effective tool for rapid aerodynamic design exploration. To facilitate future research, we open-source our dataset at https://huggingface.co/datasets/EmmiAI/Emmi-Wing.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ William De Deyn, Michael Herty, Giovanni Samaey
Machine Learning
ๆฅ็ๆ่ฆ
We study two-layer neural networks and train these with a particle-based method called consensus-based optimization (CBO). We compare the performance of CBO against Adam on two test cases and demonstrate how a hybrid approach, combining CBO with Adam, provides faster convergence than CBO. In the context of multi-task learning, we recast CBO into a formulation that offers less memory overhead. The CBO method allows for a mean-field limit formulation, which we couple with the mean-field limit of the neural network. To this end, we first reformulate CBO within the optimal transport framework. Finally, in the limit of infinitely many particles, we define the corresponding dynamics on the Wasserstein-over-Wasserstein space and show that the variance decreases monotonically.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Enes Bektas, Fazli Can
ๆฅ็ๆ่ฆ
Ensemble learning improves classification performance by combining multiple base classifiers. While increasing the number of classifiers generally enhances accuracy, excessively large ensembles can lead to computational inefficiency and diminishing returns. This paper investigates the relationship between ensemble size and performance through the lens of linear independence among classifier votes in data streams. We propose that ensembles composed of linearly independent classifiers maximize representational capacity, particularly under a geometric model. We then generalize the importance of linear independence to the weighted majority voting problem. By modeling the probability of achieving linear independence among classifier outputs, we derive a theoretical framework that explains the trade-off between ensemble size and accuracy. Our analysis leads to a theoretical estimate of the ensemble size required to achieve a user-specified probability of linear independence. We validate our theory through experiments on both real-world and synthetic datasets using two ensemble methods, OzaBagging and GOOWE. Our results confirm that this theoretical estimate effectively identifies the point of performance saturation for robust ensembles like OzaBagging. Conversely, for complex weighting schemes like GOOWE, our framework reveals that high theoretical diversity can trigger algorithmic instability. Our implementation is publicly available to support reproducibility and future research.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Oฤuz Kaฤan Hitit, Leander Girrbach, Zeynep Akata
Natural Language Processing
ๆฅ็ๆ่ฆ
Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kaifeng Hong, Yinglong Zhang, Xiaoying Hong, Xuewen Xia, Xing Xu
Natural Language Processing
ๆฅ็ๆ่ฆ
Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs--limited by over-smoothing and hop-dependent diffusion--or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model's semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin's expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zachary Morrow, Michael Penwarden, Brian Chen, Aurya Javeed, Akil Narayan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep neural networks (DNNs) and Kolmogorov-Arnold networks (KANs) are popular methods for function approximation due to their flexibility and expressivity. However, they typically require a large number of trainable parameters to produce a suitable approximation. Beyond making the resulting network less transparent, overparameterization creates a large optimization space, likely producing local minima in training that have quite different generalization errors. In this case, network initialization can have an outsize impact on the model's out-of-sample accuracy. For these reasons, we propose shallow universal polynomial networks (SUPNs). These networks replace all but the last hidden layer with a single layer of polynomials with learnable coefficients, leveraging the strengths of DNNs and polynomials to achieve sufficient expressivity with far fewer parameters. We prove that SUPNs converge at the same rate as the best polynomial approximation of the same degree, and we derive explicit formulas for quasi-optimal SUPN parameters. We complement theory with an extensive suite of numerical experiments involving SUPNs, DNNs, KANs, and polynomial projection in one, two, and ten dimensions, consisting of over 13,000 trained models. On the target functions we numerically studied, for a given number of trainable parameters, the approximation error and variability are often lower for SUPNs than for DNNs and KANs by an order of magnitude. In our examples, SUPNs even outperform polynomial projection on non-smooth functions.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Frederico Wieser, Martin Benfeghoul, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas
Natural Language Processing
ๆฅ็ๆ่ฆ
The rigid, uniform allocation of computation in standard Transformer (TF) architectures can limit their efficiency and scalability, particularly for large-scale models and long sequences. Addressing this, we introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT), two distinct architectures that leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs. SDT augments a decoder-only stack with alternating Decision and Dynamic layers: a Decision layer computes a full block 'posterior' and a lightweight 'prior,' while a Dynamic layer employs fixed-capacity Top-K routing based on Bayesian surprise (Expected and Unexpected Change), maintaining a static compute graph. STT extends this conditional computation to the temporal domain: a transition network predicts residual updates, forming a temporal 'change hypothesis' that informs a router to dynamically execute or bypass TF blocks for each token, managing KV-cache contributions. Both architectures exhibit the predicted shift from novelty to prediction driven gating over training, suggesting alignment with surprise based principles. While operating at reduced capacity, they offer preliminary insights into the compute-accuracy trade-offs of conditional computation. The proposed architectures establish a flexible framework for efficiency, reducing self-attention computation by 75% and KV-cache requirements by 50% within each compute skipping layer, setting a pathway for more efficient models.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.
๐
2025-11-26
๐ Presented at the 2025 IEEE International Conferenc
๐ฅ Ariful Islam, Md Rifat Hossen, Abir Ahmed, B M Taslimul Haque
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Aspect-Based Sentiment Analysis (ABSA) has emerged as a critical tool for extracting fine-grained sentiment insights from user-generated content, particularly in e-commerce and social media domains. However, research on Bangla ABSA remains significantly underexplored due to the absence of comprehensive datasets and specialized frameworks for triplet extraction in this language. This paper introduces BanglaASTE, a novel framework for Aspect Sentiment Triplet Extraction (ASTE) that simultaneously identifies aspect terms, opinion expressions, and sentiment polarities from Bangla product reviews. Our contributions include: (1) creation of the first annotated Bangla ASTE dataset containing 3,345 product reviews collected from major e-commerce platforms including Daraz, Facebook, and Rokomari; (2) development of a hybrid classification framework that employs graph-based aspect-opinion matching with semantic similarity techniques; and (3) implementation of an ensemble model combining BanglaBERT contextual embeddings with XGBoost boosting algorithms for enhanced triplet extraction performance. Experimental results demonstrate that our ensemble approach achieves superior performance with 89.9% accuracy and 89.1% F1-score, significantly outperforming baseline models across all evaluation metrics. The framework effectively addresses key challenges in Bangla text processing including informal expressions, spelling variations, and data sparsity. This research advances the state-of-the-art in low-resource language sentiment analysis and provides a scalable solution for Bangla e-commerce analytics applications.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jungi Lee, Jungkwon Kim, Chi Zhang, Kwangsun Yoo, Seok-Joo Byun
ๆฅ็ๆ่ฆ
Handling contaminated data poses a critical challenge in anomaly detection, as traditional models assume training on purely normal data. Conventional methods mitigate contamination by relying on fixed contamination ratios, but discrepancies between assumed and actual ratios can severely degrade performance, especially in noisy environments where normal and abnormal data distributions overlap. To address these limitations, we propose Adaptive and Aggressive Rejection (AAR), a novel method that dynamically excludes anomalies using a modified z-score and Gaussian mixture model-based thresholds. AAR effectively balances the trade-off between preserving normal data and excluding anomalies by integrating hard and soft rejection strategies. Extensive experiments on two image datasets and thirty tabular datasets demonstrate that AAR outperforms the state-of-the-art method by 0.041 AUROC. By providing a scalable and reliable solution, AAR enhances robustness against contaminated datasets, paving the way for broader real-world applications in domains such as security and healthcare.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ben Anson, Laurence Aitchison
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Stability of neural network weights is critical when training transformer models. The query and key weights are particularly problematic, as they tend to grow large without any intervention. Applying normalization to queries and keys, known as `QK norm', fixes stability issues in practice, but is not always applicable. For example, QK norm is not compatible with Multi Latent Attention (MLA) because QK norm requires full materialization of queries and keys during inference, which is not done in MLA. In this paper we suggest that controlling the changes to logits is important for stability. We show that these changes are controllable by assigning parameter-dependent learning rates to the query and key weights. We find that our cheap intervention allows us to increase the base learning rate of the network, outperform other methods in the MLA setting, and achieve performance competitive with QK norm when using Multi-head Attention.
๐
2025-11-26
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Tingkai Xue, Chin Chun Ooi, Zhengwei Ge, Fong Yew Leong, Hongying Li et al.
ๆฅ็ๆ่ฆ
Numerical simulations provide key insights into many physical, real-world problems. However, while these simulations are solved on a full 3D domain, most analysis only require a reduced set of metrics (e.g. plane-level concentrations). This work presents a hybrid physics-neural model that predicts scalar transport in a complex domain orders of magnitude faster than the 3D simulation (from hours to less than 1 min). This end-to-end differentiable framework jointly learns the physical model parameterization (i.e. orthotropic diffusivity) and a non-Markovian neural closure model to capture unresolved, 'coarse-grained' effects, thereby enabling stable, long time horizon rollouts. This proposed model is data-efficient (learning with 26 training data), and can be flexibly extended to an out-of-distribution scenario (with a moving source), achieving a Spearman correlation coefficient of 0.96 at the final simulation time. Overall results show that this differentiable physics-neural framework enables fast, accurate, and generalizable coarse-grained surrogates for physical phenomena.
๐
2025-11-26
๐ Presented at the 2025 IEEE International Conferenc
๐ฅ Ariful Islam, Md Rifat Hossen, Md. Mahmudul Arif, Abdullah Al Noman, Md Arifur Rahman
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Natural disasters remain a major challenge for Bangladesh, so real-time monitoring and quick response systems are essential. In this study, we present BanglaMM-Disaster, an end-to-end deep learning-based multimodal framework for disaster classification in Bangla, using both textual and visual data from social media. We constructed a new dataset of 5,037 Bangla social media posts, each consisting of a caption and a corresponding image, annotated into one of nine disaster-related categories. The proposed model integrates transformer-based text encoders, including BanglaBERT, mBERT, and XLM-RoBERTa, with CNN backbones such as ResNet50, DenseNet169, and MobileNetV2, to process the two modalities. Using early fusion, the best model achieves 83.76% accuracy. This surpasses the best text-only baseline by 3.84% and the image-only baseline by 16.91%. Our analysis also shows reduced misclassification across all classes, with noticeable improvements for ambiguous examples. This work fills a key gap in Bangla multimodal disaster analysis and demonstrates the benefits of combining multiple data types for real-time disaster response in low-resource settings.
๐
2025-11-26
๐ AAAI
๐ฅ Kevin Iselborn, David Dembinsky, Adriano Lucieri, Andreas Dengel
Machine Learning
ๆฅ็ๆ่ฆ
The utility of an explanation method critically depends on its fidelity to the underlying machine learning model. Especially in high-stakes medical settings, clinicians and regulators require explanations that faithfully reflect the model's decision process. Existing fidelity metrics such as Infidelity rely on Monte Carlo approximation, which demands numerous model evaluations and introduces uncertainty due to random sampling. This work proposes a novel metric for evaluating the fidelity of local feature attribution methods by modifying the existing Prediction Change (PC) metric within the Guided Perturbation Experiment. By incorporating the direction of both perturbation and attribution, the proposed Directed Prediction Change (DPC) metric achieves an almost tenfold speedup and eliminates randomness, resulting in a deterministic and trustworthy evaluation procedure that measures the same property as local Infidelity. DPC is evaluated on two datasets (skin lesion images and financial tabular data), two black-box models, seven explanation algorithms, and a wide range of hyperparameters. Across $4\,744$ distinct explanations, the results demonstrate that DPC, together with PC, enables a holistic and computationally efficient evaluation of both baseline-oriented and local feature attribution methods, while providing deterministic and reproducible outcomes.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bram Silue, Santiago Amaya-Corredor, Patrick Mannion, Lander Willem, Pieter Libin
Machine Learning
ๆฅ็ๆ่ฆ
Adversarial Inverse Reinforcement Learning (AIRL) has shown promise in addressing the sparse reward problem in reinforcement learning (RL) by inferring dense reward functions from expert demonstrations. However, its performance in highly complex, imperfect-information settings remains largely unexplored. To explore this gap, we evaluate AIRL in the context of Heads-Up Limit Hold'em (HULHE) poker, a domain characterized by sparse, delayed rewards and significant uncertainty. In this setting, we find that AIRL struggles to infer a sufficiently informative reward function. To overcome this limitation, we contribute Hybrid-AIRL (H-AIRL), an extension that enhances reward inference and policy learning by incorporating a supervised loss derived from expert data and a stochastic regularization mechanism. We evaluate H-AIRL on a carefully selected set of Gymnasium benchmarks and the HULHE poker setting. Additionally, we analyze the learned reward function through visualization to gain deeper insights into the learning process. Our experimental results show that H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL. This highlights the benefits of incorporating supervised signals into inverse RL and establishes H-AIRL as a promising framework for tackling challenging, real-world settings.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Umberto Michelucci, Francesca Venturini
Machine Learning
ๆฅ็ๆ่ฆ
Machine learning (ML) is increasingly adopted in scientific research, yet the quality and reliability of results often depend on how experiments are designed and documented. Poor baselines, inconsistent preprocessing, or insufficient validation can lead to misleading conclusions about model performance. This paper presents a practical and structured guide for conducting ML experiments in scientific applications, focussing on reproducibility, fair comparison, and transparent reporting. We outline a step-by-step workflow, from dataset preparation to model selection and evaluation, and propose metrics that account for overfitting and instability across validation folds, including the Logarithmic Overfitting Ratio (LOR) and the Composite Overfitting Score (COS). Through recommended practices and example reporting formats, this work aims to support researchers in establishing robust baselines and drawing valid evidence-based insights from ML models applied to scientific problems.
๐
2025-11-26
๐ ArXiv Preprint (cs.SI)
๐ฅ Kazuki Nakajima, Yuya Sasaki, Takeaki Uno, Masaki Aida
ๆฅ็ๆ่ฆ
Higher-order networks, naturally described as hypergraphs, are essential for modeling real-world systems involving interactions among three or more entities. Stochastic block models offer a principled framework for characterizing mesoscale organization, yet their extension to hypergraphs involves a trade-off between expressive power and computational complexity. A recent simplification, a single-order model, mitigates this complexity by assuming a single affinity pattern governs interactions of all orders. This universal assumption, however, may overlook order-dependent structural details. Here, we propose a framework that relaxes this assumption by introducing a multi-order block structure, in which different affinity patterns govern distinct subsets of interaction orders. Our framework is based on a multi-order stochastic block model and searches for the optimal partition of the set of interaction orders that maximizes out-of-sample hyperlink prediction performance. Analyzing a diverse range of real-world networks, we find that multi-order block structures are prevalent. Accounting for them not only yields better predictive performance over the single-order model but also uncovers sharper, more interpretable mesoscale organization. Our findings reveal that order-dependent mechanisms are a key feature of the mesoscale organization of real-world higher-order networks.
๐
2025-11-26
๐ ArXiv Preprint (eess.SP)
๐ฅ Chin-Hung Chen, Ivana Nikoloska, Wim van Houtum, Yan Wu, Alex Alvarado
ๆฅ็ๆ่ฆ
This paper presents a fully blind phase-aware expectation-maximization (EM) algorithm for OFDM systems with the phase-shift keying (PSK) modulation. We address the well-known local maximum problem of the EM algorithm for blind channel estimation. This is primarily caused by the unknown phase ambiguity in the channel estimates, which conventional blind EM estimators cannot resolve. To overcome this limitation, we propose to exploit the extrinsic information from the decoder as model evidence metrics. A finite set of candidate models is generated based on the inherent symmetries of PSK modulation, and the decoder selects the most likely candidate model. Simulation results demonstrate that, when combined with a simple convolutional code, the phase-aware EM algorithm reliably resolves phase ambiguity during the initialization stage and reduces the local convergence rate from 80% to nearly 0% in frequency-selective channels with a constant phase ambiguity. The algorithm is invoked only once after the EM initialization stage, resulting in negligible additional complexity during subsequent turbo iterations.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Julianna Piskorz, Cristina Pinneri, Alvaro Correia, Motasem Alfarra, Risheek Garrepalli et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens--required for generation--can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model's ability to process relevant information. To address this, we introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm and provide actionable insights for building diffusion-based language models with stronger context comprehension.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haksoo Lim, Jaehoon Lee, Sewon Park, Minjung Kim, Noseong Park
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Score-based generative models (SGMs) have demonstrated unparalleled sampling quality and diversity in numerous fields, such as image generation, voice synthesis, and tabular data synthesis, etc. Inspired by those outstanding results, we apply SGMs to synthesize time-series by learning its conditional score function. To this end, we present a conditional score network for time-series synthesis, deriving a denoising score matching loss tailored for our purposes. In particular, our presented denoising score matching loss is the conditional denoising score matching loss for time-series synthesis. In addition, our framework is such flexible that both regular and irregular time-series can be synthesized with minimal changes to our model design. Finally, we obtain exceptional synthesis performance on various time-series datasets, achieving state-of-the-art sampling diversity and quality.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Heiko Oppel, Andreas Spilz, Michael Munz
ๆฅ็ๆ่ฆ
Denoising Diffusion Probabilistic Models (DDPMs) can generate synthetic timeseries data to help improve the performance of a classifier, but their sampling process is computationally expensive. We address this by combining implicit diffusion models with a novel Sawtooth Sampler that accelerates the reverse process and can be applied to any pretrained diffusion model. Our approach achieves a 30 times speed-up over the standard baseline while also enhancing the quality of the generated sequences for classification tasks.
๐
2025-11-26
๐ ArXiv Preprint (math.DS)
๐ฅ Xiaohang Yu, William Knottenbelt
ๆฅ็ๆ่ฆ
We study the periodic behaviour of the dual logarithmic derivative operator $\mathcal{A}[f]=\mathrm{d}\ln f/\mathrm{d}\ln x$ in a complex analytic setting. We show that $\mathcal{A}$ admits genuinely nondegenerate period-$2$ orbits and identify a canonical explicit example. Motivated by this, we obtain a complete classification of all nondegenerate period-$2$ solutions, which are precisely the rational pairs $(c a x^{c}/(1-ax^{c}),\, c/(1-ax^{c}))$ with $ac\neq 0$. We further classify all fixed points of $\mathcal{A}$, showing that every solution of $\mathcal{A}[f]=f$ has the form $f(x)=1/(a-\ln x)$. As an illustration, logistic-type functions become pre-periodic under $\mathcal{A}$ after a logarithmic change of variables, entering the period-$2$ family in one iterate. These results give an explicit description of the low-period structure of $\mathcal{A}$ and provide a tractable example of operator-induced dynamics on function spaces.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sutirtha Biswas, Kshitij Kumar Yadav
Machine Learning
ๆฅ็ๆ่ฆ
Accurate and efficient seismic response prediction is essential for the design of resilient structures. While the Finite Element Method (FEM) remains the standard for nonlinear seismic analysis, its high computational demands limit its scalability and real time applicability. Recent developments in deep learning, particularly Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short Term Memory (LSTM) models, have shown promise in reducing the computational cost of nonlinear seismic analysis of structures. However, these data driven models often struggle to generalize and capture the underlying physics, leading to reduced reliability. We propose a novel Physics Informed U Net LSTM framework that integrates physical laws with deep learning to enhance both accuracy and efficiency. By embedding domain specific constraints into the learning process, the proposed model achieves improved predictive performance over conventional Machine Learning architectures. This hybrid approach bridges the gap between purely data driven methods and physics based modeling, offering a robust and computationally efficient alternative for seismic response prediction of structures.
๐
2025-11-26
๐ ArXiv Preprint (econ.EM)
๐ฅ Sullivan Huรฉ, Sรฉbastien Laurent, Ulrich Aiounou, Emmanuel Flachaire
ๆฅ็ๆ่ฆ
Post-Double-Lasso is becoming the most popular method for estimating linear regression models with many covariates when the purpose is to obtain an accurate estimate of a parameter of interest, such as an average treatment effect. However, this method can suffer from substantial omitted variable bias in finite sample. We propose a new method called Post-Double-Autometrics, which is based on Autometrics, and show that this method outperforms Post-Double-Lasso. Its use in a standard application of economic growth sheds new light on the hypothesis of convergence from poor to rich economies.
๐
2025-11-26
๐ ArXiv Preprint (eess.AS)
๐ฅ Jaime Garcia-Martinez, David Diaz-Guerra, John Anderson, Ricardo Falcon-Perez, Pablo Cabaรฑas-Molero et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper introduces The Spheres dataset, multitrack orchestral recordings designed to advance machine learning research in music source separation and related MIR tasks within the classical music domain. The dataset is composed of over one hour recordings of musical pieces performed by the Colibrรฌ Ensemble at The Spheres recording studio, capturing two canonical works - Tchaikovsky's Romeo and Juliet and Mozart's Symphony No. 40 - along with chromatic scales and solo excerpts for each instrument. The recording setup employed 23 microphones, including close spot, main, and ambient microphones, enabling the creation of realistic stereo mixes with controlled bleeding and providing isolated stems for supervised training of source separation models. In addition, room impulse responses were estimated for each instrument position, offering valuable acoustic characterization of the recording space. We present the dataset structure, acoustic analysis, and baseline evaluations using X-UMX based models for orchestral family separation and microphone debleeding. Results highlight both the potential and the challenges of source separation in complex orchestral scenarios, underscoring the dataset's value for benchmarking and for exploring new approaches to separation, localization, dereverberation, and immersive rendering of classical music.
๐
2025-11-26
๐ ArXiv Preprint (cs.AR)
๐ฅ Muhammed Yildirim, Ozcan Ozturk
Machine Learning
ๆฅ็ๆ่ฆ
The increasing demand for on-device intelligence in Edge AI and TinyML applications requires the efficient execution of modern Convolutional Neural Networks (CNNs). While lightweight architectures like MobileNetV2 employ Depthwise Separable Convolutions (DSC) to reduce computational complexity, their multi-stage design introduces a critical performance bottleneck inherent to layer-by-layer execution: the high energy and latency cost of transferring intermediate feature maps to either large on-chip buffers or off-chip DRAM. To address this memory wall, this paper introduces a novel hardware accelerator architecture that utilizes a fused pixel-wise dataflow. Implemented as a Custom Function Unit (CFU) for a RISC-V processor, our architecture eliminates the need for intermediate buffers entirely, reducing the data movement up to 87\% compared to conventional layer-by-layer execution. It computes a single output pixel to completion across all DSC stages-expansion, depthwise convolution, and projection-by streaming data through a tightly-coupled pipeline without writing to memory. Evaluated on a Xilinx Artix-7 FPGA, our design achieves a speedup of up to 59.3x over the baseline software execution on the RISC-V core. Furthermore, ASIC synthesis projects a compact 0.284 mm$^2$ footprint with 910 mW power at 2 GHz in 28 nm, and a 1.20 mm$^2$ footprint with 233 mW power at 300 MHz in 40 nm. This work confirms the feasibility of a zero-buffer dataflow within a TinyML resource envelope, offering a novel and effective strategy for overcoming the memory wall in edge AI accelerators.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Jasraj Singh, Shelvia Wongso, Jeremie Houssineau, Badr-Eddine Chรฉrief-Abdellatif
ๆฅ็ๆ่ฆ
Variational inference (VI) is a cornerstone of modern Bayesian learning, enabling approximate inference in complex models that would otherwise be intractable. However, its formulation depends on expectations and divergences defined through high-dimensional integrals, often rendering analytical treatment impossible and necessitating heavy reliance on approximate learning and inference techniques. Possibility theory, an imprecise probability framework, allows to directly model epistemic uncertainty instead of leveraging subjective probabilities. While this framework provides robustness and interpretability under sparse or imprecise information, adapting VI to the possibilistic setting requires rethinking core concepts such as entropy and divergence, which presuppose additivity. In this work, we develop a principled formulation of possibilistic variational inference and apply it to a special class of exponential-family functions, highlighting parallels with their probabilistic counterparts and revealing the distinctive mathematical structures of possibility theory.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Umang Agarwal, Rudraksh Sangore, Sumit Laddha
Machine Learning
ๆฅ็ๆ่ฆ
We present a comprehensive comparative study of three generative modeling paradigms: Denoising Diffusion Probabilistic Models (DDPM), Conditional Flow Matching (CFM), and MeanFlow. While DDPM and CFM require iterative sampling, MeanFlow enables direct one-step generation by modeling the average velocity over time intervals. We implement all three methods using a unified TinyUNet architecture (<1.5M parameters) on CIFAR-10, demonstrating that CFM achieves an FID of 24.15 with 50 steps, significantly outperforming DDPM (FID 402.98). MeanFlow achieves FID 29.15 with single-step sampling -- a 50X reduction in inference time. We further extend CFM to image inpainting, implementing mask-guided sampling with four mask types (center, random bbox, irregular, half). Our fine-tuned inpainting model achieves substantial improvements: PSNR increases from 4.95 to 8.57 dB on center masks (+73%), and SSIM improves from 0.289 to 0.418 (+45%), demonstrating the effectiveness of inpainting-aware training.
๐
2025-11-26
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Yifan Sun, Zhi Li, Tetsuya Imamura, Yuji Ohishi, Chris Wolverton et al.
Machine Learning
ๆฅ็ๆ่ฆ
Thermoelectrics (TEs) are promising candidates for energy harvesting with performance quantified by figure of merit, $ZT$. To accelerate the discovery of high-$ZT$ materials, efforts have focused on identifying compounds with low thermal conductivity $ฮบ$. Using a curated dataset of 71,913 entries, we show that high-$ZT$ materials reside not only in the low-$ฮบ$ regime but also cluster near a lattice-to-total thermal conductivity ratio ($ฮบ_\mathrm{L}/ฮบ$) of approximately 0.5, consistent with the phonon-glass electron-crystal design concept. Building on this insight, we construct a framework consisting of two machine learning models for the lattice and electronic components of thermal conductivity that jointly provide both $ฮบ$ and $ฮบ_\mathrm{L}/ฮบ$ for screening and guiding the optimization of TE materials. Among 104,567 compounds screened, our models identify 2,522 ultralow-$ฮบ$ candidates. Follow-up case studies demonstrate that this framework can reliably provide optimization strategies by suggesting new dopants and alloys that shift pristine materials toward the $ฮบ_\mathrm{L}/ฮบ$ approaching 0.5 regime. Ultimately, by integrating rapid screening with PGEC-guided optimization, our data-driven framework effectively bridges the critical gap between materials discovery and performance enhancement.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rubรฉn Fernรกndez-Farelo, Jorge Paz-Ruza, Bertha Guijarro-Berdiรฑas, Amparo Alonso-Betanzos, Alex A. Freitas
ๆฅ็ๆ่ฆ
Gene prioritization (identifying genes potentially associated with a biological process) is increasingly tackled with Artificial Intelligence. However, existing methods struggle with the high dimensionality and incomplete labelling of biomedical data. This work proposes a more robust and efficient pipeline that leverages Fast-mRMR feature selection to retain only relevant, non-redundant features for classifiers. This enables us to build simpler and more effective models, as well as to combine different biological feature sets. Experiments on Dietary Restriction datasets show significant improvements over existing methods, proving that feature selection can be critical for reliable gene prioritization.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lucas Thil, Jesse Read, Rim Kaddah, Guillaume Doquet
ๆฅ็ๆ่ฆ
Accurate remaining useful life (RUL) prediction hinges on the quality of health indicators (HIs), yet existing methods often fail to disentangle complex degradation mechanisms in multi-sensor systems or quantify uncertainty in HI reliability. This paper introduces a novel framework for HI construction, advancing three key contributions. First, we adapt Reconstruction along Projected Pathways (RaPP) as a health indicator (HI) for RUL prediction for the first time, showing that it outperforms traditional reconstruction error metrics. Second, we show that augmenting RaPP-derived HIs with aleatoric and epistemic uncertainty quantification (UQ) via Monte Carlo dropout and probabilistic latent spaces- significantly improves RUL-prediction robustness. Third, and most critically, we propose indicator groups, a paradigm that isolates sensor subsets to model system-specific degradations, giving rise to our novel method, I-GLIDE which enables interpretable, mechanism-specific diagnostics. Evaluated on data sourced from aerospace and manufacturing systems, our approach achieves marked improvements in accuracy and generalizability compared to state-of-the-art HI methods while providing actionable insights into system failure pathways. This work bridges the gap between anomaly detection and prognostics, offering a principled framework for uncertainty-aware degradation modeling in complex systems.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dogukan Aksu, Jesus Martinez del Rincon, Ihsen Alouani
Machine Learning
ๆฅ็ๆ่ฆ
Spiking neural networks (SNNs) have emerged as prominent candidates for embedded and edge AI. Their inherent low power consumption makes them far more efficient than conventional ANNs in scenarios where energy budgets are tightly constrained. In parallel, federated learning (FL) has become the prevailing training paradigm in such settings, enabling on-device learning while limiting the exposure of raw data. However, gradient inversion attacks represent a critical privacy threat in FL, where sensitive training data can be reconstructed directly from shared gradients. While this vulnerability has been widely investigated in conventional ANNs, its implications for SNNs remain largely unexplored. In this work, we present the first comprehensive empirical study of gradient leakage in SNNs across diverse data domains. SNNs are inherently non-differentiable and are typically trained using surrogate gradients, which we hypothesized would be less correlated with the original input and thus less informative from a privacy perspective. To investigate this, we adapt different gradient leakage attacks to the spike domain. Our experiments reveal a striking contrast with conventional ANNs: whereas ANN gradients reliably expose salient input content, SNN gradients yield noisy, temporally inconsistent reconstructions that fail to recover meaningful spatial or temporal structure. These results indicate that the combination of event-driven dynamics and surrogate-gradient training substantially reduces gradient informativeness. To the best of our knowledge, this work provides the first systematic benchmark of gradient inversion attacks for spiking architectures, highlighting the inherent privacy-preserving potential of neuromorphic computation.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have estimates of these values and it is not well known how to properly construct confidence intervals using only estimates. This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset, enabling practical and statistically sound LLM-based evaluation. Additionally, to reduce uncertainty in the accuracy estimate, we introduce an adaptive algorithm that efficiently allocates calibration sample sizes.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Mengran Li, Zelin Zang, Wenbin Xing, Junzhou Chen, Ronghui Zhang et al.
Multimodal
ๆฅ็ๆ่ฆ
Understanding how chemical perturbations propagate through biological systems is essential for robust molecular property prediction. While most existing methods focus on chemical structures alone, recent advances highlight the crucial role of cellular responses such as morphology and gene expression in shaping drug effects. However, current cell-aware approaches face two key limitations: (1) modality incompleteness in external biological data, and (2) insufficient modeling of hierarchical dependencies across molecular, cellular, and genomic levels. We propose CHMR (Cell-aware Hierarchical Multi-modal Representations), a robust framework that jointly models local-global dependencies between molecules and cellular responses and captures latent biological hierarchies via a novel tree-structured vector quantization module. Evaluated on nine public benchmarks spanning 728 tasks, CHMR outperforms state-of-the-art baselines, yielding average improvements of 3.6% on classification and 17.2% on regression tasks. These results demonstrate the advantage of hierarchy-aware, multimodal learning for reliable and biologically grounded molecular representations, offering a generalizable framework for integrative biomedical modeling. The code is in https://github.com/limengran98/CHMR.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pius Onobhayedo, Paul Osemudiame Oamen
Robotics
ๆฅ็ๆ่ฆ
Artificial intelligence is retracing the Internet's path from centralized provision to distributed creation. Initially, resource-intensive computation concentrates within institutions capable of training and serving large models.Eventually, as federated learning matures, billions of edge devices holding sensitive data will be able to collectively improve models without surrendering raw information, enabling both contribution and consumption at scale. This democratic vision remains unrealized due to certain compositional gaps; aggregators handle updates without accountability, economic mechanisms are lacking and even when present remain vulnerable to gaming, coordination serializes state modifications limiting scalability, and governance permits retroactive manipulation. This work addresses these gaps by leveraging cryptographic receipts to prove aggregation correctness, geometric novelty measurement to prevent incentive gaming, parallel object ownership to achieve linear scalability, and time-locked policies to check retroactive manipulation.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Lechen Feng, Haoran Li, Lucky Li, Xingqiu Zhao
Machine Learning
ๆฅ็ๆ่ฆ
This paper investigates the partial linear model by Least Absolute Deviation (LAD) regression. We parameterize the nonparametric term using Deep Neural Networks (DNNs) and formulate a penalized LAD problem for estimation. Specifically, our model exhibits the following challenges. First, the regularization term can be nonconvex and nonsmooth, necessitating the introduction of infinite dimensional variational analysis and nonsmooth analysis into the asymptotic normality discussion. Second, our network must expand (in width, sparsity level and depth) as more samples are observed, thereby introducing additional difficulties for theoretical analysis. Third, the oracle of the proposed estimator is itself defined through a ultra high-dimensional, nonconvex, and discontinuous optimization problem, which already entails substantial computational and theoretical challenges. Under such the challenges, we establish the consistency, convergence rate, and asymptotic normality of the estimator. Furthermore, we analyze the oracle problem itself and its continuous relaxation. We study the convergence of a proximal subgradient method for both formulations, highlighting their structural differences lead to distinct computational subproblems along the iterations. In particular, the relaxed formulation admits significantly cheaper proximal updates, reflecting an inherent trade-off between statistical accuracy and computational tractability.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mudi Jiang, Jiahui Zhou, Xinying Liu, Zengyou He, Zhikui Chen
ๆฅ็ๆ่ฆ
Fair clustering has gained increasing attention in recent years, especially in applications involving socially sensitive attributes. However, existing fair clustering methods often lack interpretability, limiting their applicability in high-stakes scenarios where understanding the rationale behind clustering decisions is essential. In this work, we address this limitation by proposing an interpretable and fair clustering framework, which integrates fairness constraints into the structure of decision trees. Our approach constructs interpretable decision trees that partition the data while ensuring fair treatment across protected groups. To further enhance the practicality of our framework, we also introduce a variant that requires no fairness hyperparameter tuning, achieved through post-pruning a tree constructed without fairness constraints. Extensive experiments on both real-world and synthetic datasets demonstrate that our method not only delivers competitive clustering performance and improved fairness, but also offers additional advantages such as interpretability and the ability to handle multiple sensitive attributes. These strengths enable our method to perform robustly under complex fairness constraints, opening new possibilities for equitable and transparent clustering.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tongkai Lu, Shuai Ma, Chongyang Tao
Machine Learning
ๆฅ็ๆ่ฆ
Mixed Integer Linear Programming (MILP) is a fundamental class of NP-hard problems that has garnered significant attention from both academia and industry. The Branch-and-Bound (B\&B) method is the dominant approach for solving MILPs and the branching plays an important role in B\&B methods. Neural-based learning frameworks have recently been developed to enhance branching policies and the efficiency of solving MILPs. However, these methods still struggle with semantic variation across depths, the scarcity of upstream nodes, and the costly collection of strong branching samples. To address these issues, we propose \ours, a Dynamic \underline{\textbf{S}}tratified \underline{\textbf{C}}ontrastive Training Framework for \underline{\textbf{MILP}} Branching. It groups branch-and-bound nodes based on their feature distributions and trains a GCNN-based discriminative model to progressively separate nodes across groups, learning finer-grained node representations throughout the tree. To address data scarcity and imbalance at upstream nodes, we introduce an upstream-augmented MILP derivation procedure that generates both theoretically equivalent and perturbed instances. \ours~effectively models subtle semantic differences between nodes, significantly enhancing branching accuracy and solving efficiency, particularly for upstream nodes. Extensive experiments on standard MILP benchmarks demonstrate that our method enhances branching accuracy, reduces solving time, and generalizes effectively to unseen instances.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Robert Joseph George, Carson Eisenach, Udaya Ghai, Dominique Perrault-Joncas, Anima Anandkumar et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved impressive results in code generation, yet struggle with program verification, especially in interactive proof frameworks such as Lean4. A central challenge is scalability: verified synthesis requires not just code, but also precise specifications and correctness proofs, and existing approaches rarely span all three domains. We present BRIDGE, the first systematic study of structured prompting for scalable verified program generation. BRIDGE decomposes verification into three interconnected domains: Code (executable implementations), Specifications (formal intent statements), and Proofs (constructive correctness arguments). Our key idea is to elicit distinct reasoning behaviors functional, specification-driven, and proof-oriented as intermediate representations that preserve semantic structure and connect these domains. Through systematic ablations, we show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods. For example, functional reasoning improves correctness of code in formal languages (Lean4) by nearly 1.5x (pass@5) over direct baselines. In inference-time compute, functional reasoning is also 2x more efficient, achieving higher pass rates with fewer generations and lower total sampling budgets. Similarly, we find that specification-driven prompting boosts Python coding pass rates by up to 17.5%. These findings suggest that structured domain alignment is a promising direction for advancing verified synthesis. BRIDGE establishes a foundation for training via expert iteration or RLVR, enabling models to internalize these reasoning strategies across code, specifications, and proofs.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hengyu Fu, Baihe Huang, Virginia Adams, Charles Wang, Venkat Srinivasan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Language Models (DLMs) have recently emerged as a strong alternative to autoregressive language models (LMs). DLMs offer comparable accuracy with faster inference speed via parallel decoding. However, standard DLM decoding strategies relying on high-confidence tokens encounter an inherent information-theoretic bottleneck that restricts decoding progress and ultimately slows generation. We demonstrate both theoretically and empirically that prioritizing high-confidence tokens is inherently inefficient. High-probability tokens carry negligible information and strictly relying on them limits the effective progress made in each decoding round. We prove that the number of decoding rounds must grow linearly with the sample's total information (negative log-likelihood) and inversely with the per-round information budget, establishing a bits-to-rounds principle. We also propose Explore-Then-Exploit (ETE), a training-free decoding strategy that maximizes information throughput and decoding efficiency. ETE combines cross-block decoding with targeted exploration of high-uncertainty tokens to reshape the conditional distribution and trigger cascades of confident predictions. Experiments verify our theoretical bounds and demonstrate that ETE consistently reduces the required number of decoding rounds compared to confidence-only baselines without compromising generation quality.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Manish Jain, Satheesh Kumar Ponnambalam, Salman Faroz, Chandrakanth Lns, Vinay Sharma
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Juhee Hong, Meng Liu, Shengzhi Wang, Xiaoheng Mao, Huihui Cheng et al.
Machine Learning
ๆฅ็ๆ่ฆ
Large-scale recommendations commonly adopt a multi-stage cascading ranking system paradigm to balance effectiveness and efficiency. Early Stage Ranking (ESR) systems utilize the "user-item decoupling" approach, where independently learned user and item representations are only combined at the final layer. While efficient, this design is limited in effectiveness, as it struggles to capture fine-grained user-item affinities and cross-signals. To address these, we propose the Generative Early Stage Ranking (GESR) paradigm, introducing the Mixture of Attention (MoA) module which leverages diverse attention mechanisms to bridge the effectiveness gap: the Hard Matching Attention (HMA) module encodes explicit cross-signals by computing raw match counts between user and item features; the Target-Aware Self Attention module generates target-aware user representations conditioned on the item, enabling more personalized learning; and the Cross Attention modules facilitate early and more enriched interactions between user-item features. MoA's specialized attention encodings are further refined in the final layer through a Multi-Logit Parameterized Gating (MLPG) module, which integrates the newly learned embeddings via gating and produces secondary logits that are fused with the primary logit. To address the efficiency and latency challenges, we have introduced a comprehensive suite of optimization techniques. These span from custom kernels that maximize the capabilities of the latest hardware to efficient serving solutions powered by caching mechanisms. The proposed GESR paradigm has shown substantial improvements in topline metrics, engagement, and consumption tasks, as validated by both offline and online experiments. To the best of our knowledge, this marks the first successful deployment of full target-aware attention sequence modeling within an ESR stage at such a scale.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Seunghun Baek, Jaejin Lee, Jaeyoon Sim, Minjae Jeong, Won Hwa Kim
ๆฅ็ๆ่ฆ
Various neuroimaging studies suffer from small sample size problem which often limit their reliability. Meta-analysis addresses this challenge by aggregating findings from different studies to identify consistent patterns of brain activity. However, traditional approaches based on keyword retrieval or linear mappings often overlook the rich hierarchical structure in the brain. In this work, we propose a novel framework that leverages hyperbolic geometry to bridge the gap between neuroscience literature and brain activation maps. By embedding text from research articles and corresponding brain images into a shared hyperbolic space via the Lorentz model, our method captures both semantic similarity and hierarchical organization inherent in neuroimaging data. In the hyperbolic space, our method performs multi-level neuroimaging meta-analysis (MNM) by 1) aligning brain and text embeddings for semantic correspondence, 2) guiding hierarchy between text and brain activations, and 3) preserving the hierarchical relationships within brain activation patterns. Experimental results demonstrate that our model outperforms baselines, offering a robust and interpretable paradigm of multi-level neuroimaging meta-analysis via hyperbolic brain-text representation.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ivan Novikov
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are predominantly deployed as dense transformers, where every parameter in every feed-forward block is activated for every token. While architecturally simple, this is computationally inefficient, since inference costs scale linearly with parameter count. Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in sparse, semi-modular substructures inside dense feed-forward networks, but these approaches typically rely on clustering, activation profiling, singular value decomposition, or custom routing that requires calibration data. This paper introduces MLPMoE (MLP Mixture-of-Experts), a training-free, deterministic transformation that restructures the dense MLP in transformer blocks into a static, high-cardinality mixture of experts. The transformation uses simple tensor slicing and summation, reinterpreting the algebra of tensor parallelism as a topological conversion rather than a distributed training pattern. We further introduce Fractal Fade (differential branch sparsity) and Compensated Pruning (variance-preserving branch reduction) as lightweight mechanisms for structured sparsity. On Qwen2.5-0.5B-Instruct and DeepSeek-R1-Distill-Llama-8B, the zero-shot MLPMoE transform changes a proxy perplexity metric by less than 0.05 percent while keeping the parameter count effectively constant. On the 8B model, differential sparsity removes about 20 percent of MLP parameters while keeping perplexity within about 2 percent of the dense baseline. The method operates entirely post hoc on existing checkpoints and does not require gradients, calibration sets, or router training. Code is available at https://gist.github.com/iwallarm/fc2ef1eddf226ca7814f9e5e2ae9bad1
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ye Bhone Lin, Thura Aung, Ye Kyaw Thu, Thazin Myint Oo
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. To our knowledge, this is the first study addressing ASR error correction specifically for Burmese. We evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word- and character-level accuracy over baseline outputs. The proposed AEC model, combining IPA and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Thura Aung, Eaint Kay Khaing Kyaw, Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi
Natural Language Processing
ๆฅ็ๆ่ฆ
In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.
๐
2025-11-26
๐ Accepted by IEEE Big Data 2025
๐ฅ Yeswanth Ravichandran, Duoduo Liao, Charan Teja Kurakula
Machine Learning
ๆฅ็ๆ่ฆ
Subsurface defects such as delamination, voids, and honeycombing critically affect the durability of concrete bridge decks but are difficult to detect reliably using visual inspection or manual sounding. This paper presents a machine learning based Impact Echo (IE) framework that automates both defect localization and multi-class classification of common concrete defects. Raw IE signals from Federal Highway Administration (FHWA) laboratory slabs and in-service bridge decks are transformed via Fast Fourier Transform (FFT) into dominant peak-frequency features and interpolated into spatial maps for defect zone visualization. Unsupervised k-means clustering highlights low-frequency, defect-prone regions, while Ground Truth Masks (GTMs) derived from seeded lab defects are used to validate spatial accuracy and generate high-confidence training labels. From these validated regions, spatially ordered peak-frequency sequences are constructed and fed into a stacked Long Short-Term Memory (LSTM) network that classifies four defect types shallow delamination, deep delamination, voids, and honeycombing with 73% overall accuracy. Field validation on the bridge deck demonstrates that models trained on laboratory data generalize under realistic coupling, noise, and environmental variability. The proposed framework enhances the objectivity, scalability, and repeatability of Non-Destructive Evaluation (NDE), supporting intelligent, data-driven bridge health monitoring at a network scale.
๐
2025-11-26
๐ NeurIPS 2025
๐ฅ Aaditya L. Kachhadiya
ๆฅ็ๆ่ฆ
Inverse problems in the physical sciences are often ill-conditioned in input space, making progress step-size sensitive. We propose the Deceptron, a lightweight bidirectional module that learns a local inverse of a differentiable forward surrogate. Training combines a supervised fit, forward-reverse consistency, a lightweight spectral penalty, a soft bias tie, and a Jacobian Composition Penalty (JCP) that encourages $J_g(f(x))\,J_f(x)\!\approx\!I$ via JVP/VJP probes. At solve time, D-IPG (Deceptron Inverse-Preconditioned Gradient) takes a descent step in output space, pulls it back through $g$, and projects under the same backtracking and stopping rules as baselines. On Heat-1D initial-condition recovery and a Damped Oscillator inverse problem, D-IPG reaches a fixed normalized tolerance with $\sim$20$\times$ fewer iterations on Heat and $\sim$2-3$\times$ fewer on Oscillator than projected gradient, competitive in iterations and cost with Gauss-Newton. Diagnostics show JCP reduces a measured composition error and tracks iteration gains. We also preview a single-scale 2D instantiation, DeceptronNet (v0), that learns few-step corrections under a strict fairness protocol and exhibits notably fast convergence.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Effective post-training is essential to align Large Language Models (LLMs) with specialized biomedical knowledge to accelerate life science research. However, current approaches face significant limitations. First, biomedical reasoning involves intricate mechanisms often represented by sparse textual data. Standard Supervised Fine-Tuning (SFT) tends to overfit to surface-level instruction patterns without effectively internalizing this fragmented scientific knowledge. Second, Reinforcement Learning (RL) is impractical for this domain, as defining meaningful rewards often necessitates prohibitive experimental validation (e.g., wet-lab verification of drug responses), rendering real-time feedback unfeasible. We propose Balanced Fine-Tuning (BFT), an efficient post-training method designed to learn complex reasoning from sparse data without external reward signals. BFT operates through a two-layer weighting mechanism: 1. At the token level, it scales loss via prediction probabilities to stabilize gradients and prevent overfitting; 2. At the sample level, it uses "minimum group confidence" to adaptively enhance the learning of hard samples. Experiments demonstrate that BFT significantly outperforms SFT. In medical tasks, it enables LLMs to acquire knowledge that SFT misses. In biological tasks, BFT-based LLMs surpass GeneAgent (an accurate agent for biology analysis) in biological process reasoning. Moreover, the text embeddings generated by BFT can be directly applied to downstream tasks, such as gene interaction and single-cell perturbation response prediction. These results indicate that BFT facilitates broad applications of LLMs in biomedical research.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alireza Aghasi, Nicholas Marshall, Saeid Pourmand, Wyatt Whiting
Machine Learning
ๆฅ็ๆ่ฆ
We propose a novel randomized algorithm for constructing binary neural networks with tunable accuracy. This approach is motivated by hyperdimensional computing (HDC), which is a brain-inspired paradigm that leverages high-dimensional vector representations, offering efficient hardware implementation and robustness to model corruptions. Unlike traditional low-precision methods that use quantization, we consider binary embeddings of data as points in the hypercube equipped with the Hamming distance. We propose a novel family of floating-point neural networks, G-Nets, which are general enough to mimic standard network layers. Each floating-point G-Net has a randomized binary embedding, an embedded hyperdimensional (EHD) G-Net, that retains the accuracy of its floating-point counterparts, with theoretical guarantees, due to the concentration of measure. Empirically, our binary models match convolutional neural network accuracies and outperform prior HDC models by large margins, for example, we achieve almost 30\% higher accuracy on CIFAR-10 compared to prior HDC models. G-Nets are a theoretically justified bridge between neural networks and randomized binary neural networks, opening a new direction for constructing robust binary/quantized deep learning models. Our implementation is available at https://github.com/GNet2025/GNet.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Quan Xiao, Tianyi Chen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Offline data selection and online self-refining generation, which enhance the data quality, are crucial steps in adapting large language models (LLMs) to specific downstream tasks. We tackle offline data selection and online self-refining generations through an optimization perspective. Specifically, bilevel data selection is used for offline data selection with respect to the validation dataset, and we treat online self-refining generation as a model adaptation step of selecting the model trained on current responses that best fits the validation data. Our framework offers a unified understanding of offline data selection and self-refining generation by assigning a learned data weight to each question and response, either explicitly or implicitly. For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework and demonstrate its performance gains over unfiltered direct mixing baselines. By combining offline data with validation-weighted online generations, our method enhances fine-tuning performance. Experiments on quality enhancement and safety-aware LLM fine-tuning validate its effectiveness.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiaming Guo, Rui Zhang, Zerun Li, Yunkai Gao, Shaohui Peng et al.
ๆฅ็ๆ่ฆ
Diffusion planning is a promising method for learning high-performance policies from offline data. To avoid the impact of discrepancies between planning and reality on performance, previous works generate new plans at each time step. However, this incurs significant computational overhead and leads to lower decision frequencies, and frequent plan switching may also affect performance. In contrast, humans might create detailed short-term plans and more general, sometimes vague, long-term plans, and adjust them over time. Inspired by this, we propose the Temporal Diffusion Planner (TDP) which improves decision efficiency by distributing the denoising steps across the time dimension. TDP begins by generating an initial plan that becomes progressively more vague over time. At each subsequent time step, rather than generating an entirely new plan, TDP updates the previous one with a small number of denoising steps. This reduces the average number of denoising steps, improving decision efficiency. Additionally, we introduce an automated replanning mechanism to prevent significant deviations between the plan and reality. Experiments on D4RL show that, compared to previous works that generate new plans every time step, TDP improves the decision-making frequency by 11-24.8 times while achieving higher or comparable performance.
๐
2025-11-26
๐ AAAI
๐ฅ Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury, Haotian An, Yawei Wang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.
๐
2025-11-26
๐ 17 pages, 11 figures, this article was submitted t
๐ฅ Jingtao Guo, Yuyi Mao, Ivan Wang-Hei Ho
Machine Learning
ๆฅ็ๆ่ฆ
Wi-Fi channel state information (CSI)-based sensing provides a non-invasive, device-free approach for tasks such as human activity recognition and crowd counting, but large-scale deployment is hindered by the need for extensive site-specific training data. Federated learning (FL) offers a way to avoid raw data sharing but is challenged by heterogeneous sensing data and device resources. This paper proposes FedAPA, a collaborative Wi-Fi CSI-based sensing algorithm that uses adaptive prototype aggregation (APA) strategy to assign similarity-based weights to peer prototypes, enabling adaptive client contributions and yielding a personalized global prototype for each client instead of a fixed-weight aggregation. During local training, we adopt a hybrid objective that combines classification learning with representation contrastive learning to align local and global knowledge. We provide a convergence analysis of FedAPA and evaluate it in a real-world distributed Wi-Fi crowd counting scenario with six environments and up to 20 people. The results show that our method outperform multiple baselines in terms of accuracy, F1 score, mean absolute error (MAE), and communication overhead, with FedAPA achieving at least a 9.65% increase in accuracy, a 9% gain in F1 score, a 0.29 reduction in MAE, and a 95.94% reduction in communication overhead.
๐
2025-11-26
๐ 8 Pages, 10 figures, 2 Tables, Accepted in Journal
๐ฅ Dinanath Padhya, Krishna Acharya, Bipul Kumar Dahal, Dinesh Baniya Kshatri
Machine Learning
ๆฅ็ๆ่ฆ
Automatic Modulation Classification (AMC) is a core technology for future wireless communication systems, enabling the identification of modulation schemes without prior knowledge. This capability is essential for applications in cognitive radio, spectrum monitoring, and intelligent communication networks. We propose an AMC system based on a hybrid Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture, integrated with a Software Defined Radio (SDR) platform. The proposed architecture leverages CNNs for spatial feature extraction and LSTMs for capturing temporal dependencies, enabling efficient handling of complex, time-varying communication signals. The system's practical ability was demonstrated by identifying over-the-air (OTA) signals from a custom-built FM transmitter alongside other modulation schemes. The system was trained on a hybrid dataset combining the RadioML2018 dataset with a custom-generated dataset, featuring samples at Signal-to-Noise Ratios (SNRs) from 0 to 30dB. System performance was evaluated using accuracy, precision, recall, F1 score, and the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). The optimized model achieved 93.48% accuracy, 93.53% precision, 93.48% recall, and an F1 score of 93.45%. The AUC-ROC analysis confirmed the model's discriminative power, even in noisy conditions. This paper's experimental results validate the effectiveness of the hybrid CNN-LSTM architecture for AMC, suggesting its potential application in adaptive spectrum management and advanced cognitive radio systems.
๐
2025-11-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Anantha Padmanaban Krishna Kumar
ๆฅ็ๆ่ฆ
Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1--12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1--12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shima Rafiei, Zahra Nabizadeh Shahr Babak, Shadrokh Samavi, Shahram Shirani
Machine Learning
ๆฅ็ๆ่ฆ
Holography offers significant potential for AR/VR applications, yet its adoption is limited by the high demands of data compression. Existing deep learning approaches generally lack rate adaptivity within a single network. We present RAVQ-HoloNet, a rate-adaptive vector quantization framework that achieves high-fidelity reconstructions at low and ultra-low bit rates, outperforming current state-of-the-art methods. In low bit, our method exceeds by -33.91% in BD-Rate and achieves a BD-PSNR of 1.02 dB from the best existing method demonstrated by the rate-distortion curve.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mahdi Saki, Justin Lipman
Natural Language Processing
ๆฅ็ๆ่ฆ
Dairy farmers should decide to keep or cull a cow based on an objective assessment of her likely performance in the herd. For this purpose, farmers need to identify more resilient cows, which can cope better with farm conditions and complete more lactations. This decision-making process is inherently complex, with significant environmental and economic implications. In this study, we develop an AI-driven model to predict cow longevity using historical multivariate time-series data recorded from birth. Leveraging advanced AI techniques, specifically Multi-Head Attention Transformers, we analysed approximately 780,000 records from 19,000 unique cows across 7 farms in Australia. The results demonstrate that our model achieves an overall determination coefficient of 83% in predicting herd life across the studied farms, highlighting its potential for practical application in dairy herd management.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuxuan Zhu, Cong Fu, Yabo Ni, Anxiang Zeng, Yuan Fang
ๆฅ็ๆ่ฆ
Temporal distribution shift (TDS) erodes the long-term accuracy of recommender systems, yet industrial practice still relies on periodic incremental training, which struggles to capture both stable and transient patterns. Existing approaches such as invariant learning and self-supervised learning offer partial solutions but often suffer from unstable temporal generalization, representation collapse, or inefficient data utilization. To address these limitations, we propose ELBO$_\text{TDS}$, a probabilistic framework that integrates seamlessly into industry-scale incremental learning pipelines. First, we identify key shifting factors through statistical analysis of real-world production data and design a simple yet effective data augmentation strategy that resamples these time-varying factors to extend the training support. Second, to harness the benefits of this extended distribution while preventing representation collapse, we model the temporal recommendation scenario using a causal graph and derive a self-supervised variational objective, ELBO$_\text{TDS}$, grounded in the causal structure. Extensive experiments supported by both theoretical and empirical analysis demonstrate that our method achieves superior temporal generalization, yielding a 2.33\% uplift in GMV per user and has been successfully deployed in Shopee Product Search. Code is available at https://github.com/FuCongResearchSquad/ELBO4TDS.
๐
2025-11-26
๐ 22 pages, 15 figures, Submitted to Journal of Envi
๐ฅ Taehoon Kang, Taeyong Kim
Machine Learning
ๆฅ็ๆ่ฆ
Climate change has intensified the frequency and severity of wildfires, making rapid and accurate prediction of fire spread essential for effective mitigation and response. Physics-based simulators such as FARSITE offer high-fidelity predictions but are computationally intensive, limiting their applicability in real-time decision-making, while existing deep learning models often yield overly smooth predictions that fail to capture the complex, nonlinear dynamics of wildfire propagation. This study proposes an autoregressive conditional generative adversarial network (CGAN) for probabilistic wildfire spread prediction. By formulating the prediction task as an autoregressive problem, the model learns sequential state transitions, ensuring long-term prediction stability. Experimental results demonstrate that the proposed CGAN-based model outperforms conventional deep learning models in both overall predictive accuracy and boundary delineation of fire perimeters. These results demonstrate that adversarial learning allows the model to capture the strong nonlinearity and uncertainty of wildfire spread, instead of simply fitting the pixel average. Furthermore, the autoregressive framework facilitates systematic temporal forecasting of wildfire evolution. The proposed CGAN-based autoregressive framework enhances both the accuracy and physical interpretability of wildfire spread prediction, offering a promising foundation for time-sensitive response and evacuation planning.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia et al.
Robotics
ๆฅ็ๆ่ฆ
As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented tasks. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And (2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, GKA shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sid Bharthulwar, Stone Tao, Hao Su
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Massively parallel GPU simulation environments have accelerated reinforcement learning (RL) research by enabling fast data collection for on-policy RL algorithms like Proximal Policy Optimization (PPO). To maximize throughput, it is common to use short rollouts per policy update, increasing the update-to-data (UTD) ra- tio. However, we find that, in this setting, standard synchronous resets introduce harmful nonstationarity, skewing the learning signal and destabilizing training. We introduce staggered resets, a simple yet effective technique where environments are initialized and reset at varied points within the task horizon. This yields training batches with greater temporal diversity, reducing the nonstationarity induced by synchronized rollouts. We characterize dimensions along which RL environments can benefit significantly from staggered resets through illustrative toy environ- ments. We then apply this technique to challenging high-dimensional robotics environments, achieving significantly higher sample efficiency, faster wall-clock convergence, and stronger final performance. Finally, this technique scales better with more parallel environments compared to naive synchronized rollouts.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Md Tasnin Tanvir, Dr Santanu Kumar Dash, Ishan Shahnan, Nafis Fuad, Tanvir Rahman et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The challenge of separating AI-generated text from human-authored content is becoming more urgent as generative AI technologies like ChatGPT become more widely available. In this work, we address this issue by looking at both the detection of content that has been entirely generated by AI and the identification of human text that has been reworded by AI. In our work, a comprehensive methodology to detect AI- generated text using XLM-RoBERTa, a state-of-the-art multilingual transformer model. Our approach includes rigorous preprocessing, and feature extraction involving perplexity, semantic, and readability features. We fine-tuned the XLM-RoBERTa model on a balanced dataset of human and AI-generated texts and evaluated its performance. The model demonstrated high accuracy and robust performance across various text genres. Additionally, we conducted feature analysis to understand the model's decision-making process, revealing that perplexity and attention-based features are critical in differentiating between human and AI-generated texts. Our findings offer a valuable tool for maintaining academic integrity and contribute to the broader field of AI ethics by promoting transparency and accountability in AI systems. Future research directions include exploring other advanced models and expanding the dataset to enhance the model's generalizability.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Constantinos Daskalakis, Vardis Kandiros, Rui Yao
ๆฅ็ๆ่ฆ
We consider the problem of estimating Ising models over $n$ variables in Total Variation (TV) distance, given $l$ independent samples from the model. While the statistical complexity of the problem is well-understood [DMR20], identifying computationally and statistically efficient algorithms has been challenging. In particular, remarkable progress has occurred in several settings, such as when the underlying graph is a tree [DP21, BGPV21], when the entries of the interaction matrix follow a Gaussian distribution [GM24, CK24], or when the bulk of its eigenvalues lie in a small interval [AJK+24, KLV24], but no unified framework for polynomial-time estimation in TV exists so far. Our main contribution is a unified analysis of the Maximum Pseudo-Likelihood Estimator (MPLE) for two general classes of Ising models. The first class includes models that have bounded operator norm and satisfy the Modified Log-Sobolev Inequality (MLSI), a functional inequality that was introduced to study the convergence of the associated Glauber dynamics to stationarity. In the second class of models, the interaction matrix has bounded infinity norm (or bounded width), which is the most common assumption in the literature for structure learning of Ising models. We show how our general results for these classes yield polynomial-time algorithms and optimal or near-optimal sample complexity guarantees in a variety of settings. Our proofs employ a variety of tools from tensorization inequalities to measure decompositions and concentration bounds.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Jiaoyang Li, Jun Fang, Tianhao Gao, Xiaohui Zhang, Zhiyuan Liu et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Representation learning is fundamental to modern machine learning, powering applications such as text retrieval and multimodal understanding. However, learning robust and generalizable representations remains challenging. While prior work has demonstrated that active noise injection, a form of data augmentation, can enhance encoding performance, most existing methods rely on heuristic or static noise, overlooking the dynamic nature of feature distributions during training. In this work, we systematically study the role of noise in representation learning from both gradient-based and feature distribution perspectives, using InfoNCE loss as a representative example. Focusing on multimodal representation learning, we propose FANoise, a novel feature-adaptive noise injection strategy. By leveraging the dynamics of contrastive learning, FANoise effectively mitigates the negative impacts of noise while preserving its benefits. Under this theoretically grounded framework, comprehensive experiments demonstrate that FANoise consistently improves overall performance on multimodal tasks across various base VLM models.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shanwei Fan
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) offer strong high-level planning capabilities for reinforcement learning (RL) by decomposing tasks into subgoals. However, their practical utility is limited by poor planning-execution alignment, which reflects a critical gap between abstract plans and actionable, environment-compatible behaviors. This misalignment arises from two interrelated limitations: (1) LLMs often produce subgoals that are semantically plausible but infeasible or irrelevant in the target environment due to insufficient grounding in environment-specific knowledge, and (2) single-LLM planning conflates generation with self-verification, resulting in overconfident yet unreliable subgoals that frequently fail during execution. To address these challenges, we propose Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR), a framework that integrates an environment-specific subgoal graph and structured entity knowledge with a multi-LLM planning pipeline that explicitly separates generation, critique, and refinement to produce executable and verifiable subgoals. A subgoal tracker further monitors execution progress, provides auxiliary rewards, and adaptively updates the subgoal graph to maintain alignment between plans and actions. Experimental results on 22 diverse tasks in the open-world game "Crafter" demonstrate the effectiveness of our proposed method.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Akansha Kalra, Soumil Datta, Ethan Gilmore, Duc La, Guanhong Tao et al.
ๆฅ็ๆ่ฆ
Behavior Cloning (BC) is a popular framework for training sequential decision policies from expert demonstrations via supervised learning. As these policies are increasingly being deployed in the real world, their robustness and potential vulnerabilities are an important concern. In this work, we perform the first analysis of the efficacy of clean-label backdoor attacks on BC policies. Our backdoor attacks poison a dataset of demonstrations by injecting a visual trigger to create a spurious correlation that can be exploited at test time. We evaluate how policy vulnerability scales with the fraction of poisoned data, the strength of the trigger, and the trigger type. We also introduce a novel entropy-based test-time trigger attack that substantially degrades policy performance by identifying critical states where test-time triggering of the backdoor is expected to be most effective at degrading performance. We empirically demonstrate that BC policies trained on even minimally poisoned datasets exhibit deceptively high, near-baseline task performance despite being highly vulnerable to backdoor trigger attacks during deployment. Our results underscore the urgent need for more research into the robustness of BC policies, particularly as large-scale datasets are increasingly used to train policies for real-world cyber-physical systems. Videos and code are available at https://sites.google.com/view/dataset-poisoning-in-bc.
๐
2025-11-26
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiwen Zheng, Yiwei Ouyang, Zhao Huang, Tao Zhang, Xiaoshuai Zhang et al.
Computer Vision
ๆฅ็ๆ่ฆ
Accurately localizing and segmenting obscured objects from faint light patterns beyond the field of view is highly challenging due to multiple scattering and medium-induced perturbations. Most existing methods, based on real-valued modeling or local convolutional operations, are inadequate for capturing the underlying physics of coherent light propagation. Moreover, under low signal-to-noise conditions, these methods often converge to non-physical solutions, severely compromising the stability and reliability of the observation. To address these challenges, we propose a novel physics-driven Wavefront Propagating Compensation Network (WavePCNet) to simulate wavefront propagation and enhance the perception of obscured objects. This WavePCNet integrates the Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) to incorporate complex amplitude transfer operators to precisely constrain coherent propagation behavior, along with a momentum memory mechanism to effectively suppress the accumulation of perturbations. Additionally, a High-frequency Cross-layer Compensation Enhancement is introduced to construct frequency-selective pathways with multi-scale receptive fields and dynamically model structural consistency across layers, further boosting the model's robustness and interpretability under complex environmental conditions. Extensive experiments conducted on four physically collected datasets demonstrate that WavePCNet consistently outperforms state-of-the-art methods across both accuracy and robustness.
๐
2025-11-26
๐ ArXiv Preprint (math.CO)
๐ฅ Davis Brown, Jesse He, Helen Jenne, Henry Kvinge, Max Vargas
Natural Language Processing
ๆฅ็ๆ่ฆ
Evolutionary program synthesis systems such as AlphaEvolve, OpenEvolve, and ShinkaEvolve offer a new approach to AI-assisted mathematical discovery. These systems utilize teams of large language models (LLMs) to generate candidate solutions to a problem as human readable code. These candidate solutions are then 'evolved' with the goal of improving them beyond what an LLM can produce in a single shot. While existing mathematical applications have mostly focused on problems of establishing bounds (e.g., sphere packing), the program synthesis approach is well suited to any problem where the solution takes the form of an explicit construction. With this in mind, in this paper we explore the use of OpenEvolve for combinatorial bijection discovery. We describe the results of applying OpenEvolve to three bijection construction problems involving Dyck paths, two of which are known and one of which is open. We find that while systems like OpenEvolve show promise as a valuable tool for combinatorialists, the problem of finding novel, research-level bijections remains a challenging task for current frontier systems, reinforcing the need for human mathematicians in the loop. We describe some lessons learned for others in the field interested in exploring the use of these systems.
๐
2025-11-26
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Guoxiao Zhang, Tan Qu, Ao Li, DongLin Ni, Qianlong Xie et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Reranking improves recommendation quality by modeling item interactions. However, existing methods often decouple ranking and reranking, leading to weak listwise evaluation models that suffer from combinatorial sparsity and limited representational power under strict latency constraints. In this paper, we propose RIA (Ranking-Infused Architecture), a unified, end-to-end framework that seamlessly integrates pointwise and listwise evaluation. RIA introduces four key components: (1) the User and Candidate DualTransformer (UCDT) for fine-grained user-item-context modeling; (2) the Context-aware User History and Target (CUHT) module for position-sensitive preference learning; (3) the Listwise Multi-HSTU (LMH) module to capture hierarchical item dependencies; and (4) the Embedding Cache (EC) module to bridge efficiency and effectiveness during inference. By sharing representations across ranking and reranking, RIA enables rich contextual knowledge transfer while maintaining low latency. Extensive experiments show that RIA outperforms state-of-the-art models on both public and industrial datasets, achieving significant gains in AUC and LogLoss. Deployed in Meituan advertising system, RIA yields a +1.69% improvement in Click-Through Rate (CTR) and a +4.54% increase in Cost Per Mille (CPM) in online A/B tests.
๐
2025-11-26
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Guoxiao Zhang, Ao Li, Tan Qu, Qianlong Xie, Xingxing Wang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Online platforms usually suffer from user experience degradation due to near-duplicate items with similar visuals and text. While Multimodal Large Language Models (MLLMs) enable multimodal embedding, existing methods treat representations as black boxes, ignoring structural relationships (e.g., primary vs. auxiliary elements), leading to local structural collapse problem. To address this, inspired by Feature Integration Theory (FIT), we propose FITRep, the first attention-guided, white-box item representation framework for fine-grained item deduplication. FITRep consists of: (1) Concept Hierarchical Information Extraction (CHIE), using MLLMs to extract hierarchical semantic concepts; (2) Structure-Preserving Dimensionality Reduction (SPDR), an adaptive UMAP-based method for efficient information compression; and (3) FAISS-Based Clustering (FBC), a FAISS-based clustering that assigns each item a unique cluster id using FAISS. Deployed on Meituan's advertising system, FITRep achieves +3.60% CTR and +4.25% CPM gains in online A/B tests, demonstrating both effectiveness and real-world impact.
๐
2025-11-26
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Anup Roy, Rishabh Gyanendra Upadhyay, Animesh Rameshbhai Panara, Robin Mills
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Document centric RAG pipelines usually begin with OCR, followed by brittle heuristics for chunking, table parsing, and layout reconstruction. These text first workflows are costly to maintain, sensitive to small layout shifts, and often lose the spatial cues that contain the answer. Vision first retrieval has emerged as a strong alternative. By operating directly on page images, systems like ColPali and ColQwen preserve structure and reduce pipeline complexity while achieving strong benchmark performance. However, these late interaction models tie retrieval to a specific vision backbone and require storing hundreds of patch embeddings per page, creating high memory overhead and complicating large scale deployment.
We introduce VisionRAG, a multimodal retrieval system that is OCR free and model agnostic. VisionRAG indexes documents directly as images, preserving layout, tables, and spatial cues, and builds semantic vectors without committing to a specific extraction. Our three pass pyramid indexing framework creates vectors using global page summaries, section headers, visual hotspots, and fact level cues. These summaries act as lightweight retrieval surrogates. At query time, VisionRAG retrieves the most relevant pages using the pyramid index, then forwards the raw page image encoded as base64 to a multimodal LLM for final question answering. During retrieval, reciprocal rank fusion integrates signals across the pyramid to produce robust ranking.
VisionRAG stores only 17 to 27 vectors per page, matching the efficiency of patch based methods while staying flexible across multimodal encoders. On financial document benchmarks, it achieves 0.8051 accuracy at 10 on FinanceBench and 0.9629 recall at 100 on TAT DQA. These results show that OCR free, summary guided multimodal retrieval is a practical and scalable alternative to traditional text extraction pipelines.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates significant potential in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing RLVR methods are often constrained by issues such as coarse-grained rewards, reward noise, and inefficient exploration, which lead to unstable training and entropy collapse. To address this challenge, we propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO). The intuition behind it lies in the fact that the probabilities of an LLM generating different responses can inherently and directly reflect its self-assessment of the reasoning process. Inspired by the idea of preference modeling, ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt, and integrates this score with verifiable rewards to guide the exploration process. We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors, enhances the relative superiority of undervalued high-quality responses, and prevents the model from overfitting to specific strategies, thereby facilitating more thorough exploration. Comprehensive experiments across four general-domain benchmarks and three mathematical benchmarks demonstrate that ICPO steadily boosts reasoning compared to GRPO.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai et al.
Robotics
ๆฅ็ๆ่ฆ
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D "trace-space" of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Lorenzo Shaikewitz, Charis Georgiou, Luca Carlone
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Quantifying the uncertainty of an object's pose estimate is essential for robust control and planning. Although pose estimation is a well-studied robotics problem, attaching statistically rigorous uncertainty is not well understood without strict distributional assumptions. We develop distribution-free pose uncertainty bounds about a given pose estimate in the monocular setting. Our pose uncertainty only requires high probability noise bounds on pixel detections of 2D semantic keypoints on a known object. This noise model induces an implicit, non-convex set of pose uncertainty constraints. Our key contribution is SLUE (S-Lemma Uncertainty Estimation), a convex program to reduce this set to a single ellipsoidal uncertainty bound that is guaranteed to contain the true object pose with high probability. SLUE solves a relaxation of the minimum volume bounding ellipsoid problem inspired by the celebrated S-lemma. It requires no initial guess of the bound's shape or size and is guaranteed to contain the true object pose with high probability. For tighter uncertainty bounds at the same confidence, we extend SLUE to a sum-of-squares relaxation hierarchy which is guaranteed to converge to the minimum volume ellipsoidal uncertainty bound for a given set of keypoint constraints. We show this pose uncertainty bound can easily be projected to independent translation and axis-angle orientation bounds. We evaluate SLUE on two pose estimation datasets and a real-world drone tracking scenario. Compared to prior work, SLUE generates substantially smaller translation bounds and competitive orientation bounds. We release code at https://github.com/MIT-SPARK/PoseUncertaintySets.
๐
2025-11-26
๐ NeurIPS 2025
๐ฅ Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, Ding Zhao
Robotics
ๆฅ็ๆ่ฆ
End-to-end (E2E) autonomous driving models have demonstrated strong performance in open-loop evaluations but often suffer from cascading errors and poor generalization in closed-loop settings. To address this gap, we propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment. MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine, exposing the agent to scenarios beyond the original dataset. Based on this generated data, MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes. At inference time, the adapter proposes multiple trajectory candidates, and the Q value model selects the one with the highest expected utility. Experiments on the nuScenes benchmark using a photorealistic closed-loop simulator demonstrate that MPA significantly improves performance across in-domain, out-of-domain, and safety-critical scenarios. We further investigate how the scale of counterfactual data and inference-time guidance strategies affect overall effectiveness.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Hui Zhou, Siyuan Huang, Minxing Li, Hao Zhang, Lue Fan et al.
Robotics
ๆฅ็ๆ่ฆ
Vision Language Action models have significantly advanced general purpose robotic manipulation by harnessing large scale pretrained vision and language representations. Among existing approaches, a majority of current VLA systems employ parallel two finger grippers as their default end effectors. However, such grippers face inherent limitations in handling certain real world tasks such as wiping glass surfaces or opening drawers without handles due to insufficient contact area or lack of adhesion. To overcome these challenges, we present a low cost, integrated hardware design that combines a mechanical two finger gripper with a vacuum suction unit, enabling dual mode manipulation within a single end effector. Our system supports flexible switching or synergistic use of both modalities, expanding the range of feasible tasks. We validate the efficiency and practicality of our design within two state of the art VLA frameworks: DexVLA and Pi0. Experimental results demonstrate that with the proposed hybrid end effector, robots can successfully perform multiple complex tasks that are infeasible for conventional two finger grippers alone. All hardware designs and controlling systems will be released.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Yet existing VLA models still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. Compared with continuous diffusion policies, E0 offers two key advantages: (1) discrete action tokens align naturally with the symbolic structure of pretrained VLM/VLA backbones, enabling stronger semantic conditioning; and 2. discrete diffusion matches the true quantized nature of real-world robot control-whose hardware constraints (e.g., encoder resolution, control frequency, actuation latency) inherently discretize continuous signals-and therefore benefits from a Bayes-optimal denoiser that models the correct discrete action distribution, leading to stronger generalization. Compared with discrete autoregressive and mask-based discrete diffusion models, E0 supports a significantly larger and finer-grained action vocabulary and avoids the distributional mismatch introduced by masking-based corruptions-yielding more accurate fine-grained action control. We further introduce a spherical viewpoint perturbation augmentation method to improve robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, and ManiSkill show that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average. Real-world evaluation on a Franka arm confirms that E0 delivers precise, robust, and transferable manipulation, establishing discrete diffusion as a promising direction for generalizable VLA policy learning.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jin Pin, Krasowski Hanna, Vanneaux Elena
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Obtaining safety guarantees for reinforcement learning is a major challenge to achieve applicability for real-world tasks. Safety shields extend standard reinforcement learning and achieve hard safety guarantees. However, existing safety shields commonly use random sampling of safe actions or a fixed fallback controller, therefore disregarding future performance implications of different safe actions. In this work, we propose a predictive safety shield for model-based reinforcement learning agents in discrete space. Our safety shield updates the Q-function locally based on safe predictions, which originate from a safe simulation of the environment model. This shielding approach improves performance while maintaining hard safety guarantees. Our experiments on gridworld environments demonstrate that even short prediction horizons can be sufficient to identify the optimal path. We observe that our approach is robust to distribution shifts, e.g., between simulation and reality, without requiring additional training.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Dmitri Kovalenko
Robotics
ๆฅ็ๆ่ฆ
An autonomous robotic nut tightening system for a serial manipulator equipped with a parallel gripper is proposed. The system features a hierarchical motion-primitive-based planner and a control-switching scheme that alternates between force and position control. Extensive simulations demonstrate the system's robustness to variance in initial conditions. Additionally, the proposed controller tightens threaded screws 14% faster than the baseline while applying 40 times less contact force on manipulands. For the benefit of the research community, the system's implementation is open-sourced.
๐
2025-11-26
๐ Published in IEEE Sensors Journal
๐ฅ Jewel Benny, Narahari N. Moudhgalya, Mujeev Khan, Hemant Kumar Meena, Mohd Wajid et al.
ๆฅ็ๆ่ฆ
In this work, we introduce an innovative approach to estimate the vital signs of multiple human subjects simultaneously in a non-contact way using a Frequency Modulated Continuous Wave (FMCW) radar-based system. Traditional vital sign monitoring methods often face significant limitations, including subject discomfort with wearable devices, challenges in calibration, and the risk of infection transmission through contact measurement devices. To address these issues, this research is motivated by the need for versatile, non-contact vital monitoring solutions applicable in various critical scenarios. This work also explores the challenges of extending this capability to an arbitrary number of subjects, including hardware and theoretical limitations. Supported by rigorous experimental results and discussions, the paper illustrates the system's potential to redefine vital sign monitoring. An FPGA-based implementation is also presented as proof of concept for a hardware-based and portable solution, improving upon previous works by offering 2.7x faster execution and 18.4% less Look-Up Table (LUT) utilization, as well as providing over 7400x acceleration compared to its software counterpart.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Martin Jacquet, Marvin Harms, Kostas Alexis
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper introduces a neural Nonlinear Model Predictive Control (NMPC) framework for mapless, collision-free navigation in unknown environments with Aerial Robots, using onboard range sensing. We leverage deep neural networks to encode a single range image, capturing all the available information about the environment, into a Signed Distance Function (SDF). The proposed neural architecture consists of two cascaded networks: a convolutional encoder that compresses the input image into a low-dimensional latent vector, and a Multi-Layer Perceptron that approximates the corresponding spatial SDF. This latter network parametrizes an explicit position constraint used for collision avoidance, which is embedded in a velocity-tracking NMPC that outputs thrust and attitude commands to the robot. First, a theoretical analysis of the contributed NMPC is conducted, verifying recursive feasibility and stability properties under fixed observations. Subsequently, we evaluate the open-loop performance of the learning-based components as well as the closed-loop performance of the controller in simulations and experiments. The simulation study includes an ablation study, comparisons with two state-of-the-art local navigation methods, and an assessment of the resilience to drifting odometry. The real-world experiments are conducted in forest environments, demonstrating that the neural NMPC effectively performs collision avoidance in cluttered settings against an adversarial reference velocity input and drifting position estimates.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Jamal Raiyn
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper proposes a new strategy for collision avoidance system leveraging Time-to-Collision (TTC) metrics for handling cut-in scenarios, which are particularly challenging for autonomous vehicles (AVs). By integrating a deep learning with TTC calculations, the system predicts potential collisions and determines appropriate evasive actions compared to traditional TTC -based approaches.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Iryna Hurova, Alinjar Dan, Karl Kruusamรคe, Arun Kumar Singh
Machine LearningRobotics
ๆฅ็ๆ่ฆ
In recent years, dual-arm manipulation has become an area of strong interest in robotics, with end-to-end learning emerging as the predominant strategy for solving bimanual tasks. A critical limitation of such learning-based approaches, however, is their difficulty in generalizing to novel scenarios, especially within cluttered environments. This paper presents an alternative paradigm: a sampling-based optimization framework that utilizes a GPU-accelerated physics simulator as its world model. We demonstrate that this approach can solve complex bimanual manipulation tasks in the presence of static obstacles. Our contribution is a customized Model Predictive Path Integral Control (MPPI) algorithm, \textbf{guided by carefully designed task-specific cost functions,} that uses GPU-accelerated MuJoCo for efficiently evaluating robot-object interaction. We apply this method to solve significantly more challenging versions of tasks from the PerAct$^{2}$ benchmark, such as requiring the point-to-point transfer of a ball through an obstacle course. Furthermore, we establish that our method achieves real-time performance on commodity GPUs and facilitates successful sim-to-real transfer by leveraging unique features within MuJoCo. The paper concludes with a statistical analysis of the sample complexity and robustness, quantifying the performance of our approach. The project website is available at: https://sites.google.com/view/bimanualakslabunitartu .
๐
2025-11-26
๐ ArXiv Preprint (eess.SY)
๐ฅ Jewel Benny, Pranjal Mahajan, Srayan Sankar Chatterjee, Mohd Wajid, Abhishek Srivastava
ๆฅ็ๆ่ฆ
Recent developments in mmWave radar technologies have enabled the truly non-contact heart-rate (HR) and breath-rate (BR) measurement approaches, which provides a great ease in patient monitoring. Additionally, these technologies also provide opportunities to simultaneously detect HR and BR of multiple patients, which has become increasingly important for efficient mass monitoring scenarios. In this work, a frequency modulated continuous wave (FMCW) mmWave radar based truly non-contact multiple patient HR and BR monitoring system has been presented. Furthermore, a novel approach is also proposed, which combines multiple processing methods using a least squares solution to improve measurement accuracy, generalization, and handle measurement error. The proposed system has been developed using Texas Instruments' FMCW radar and experimental results with multiple subjects are also presented, which show >97% and >93% accuracy in the measured BR and HR values, respectively.
๐
2025-11-26
๐ 8 pages, 11 figures. Accepted to IEEE Robotics and
๐ฅ Fuyuki Tokuda, Akira Seino, Akinari Kobayashi, Kai Tang, Kazuhiro Kosuge
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
In this paper, we propose a method to align and place a fabric piece on top of another using a dual-arm manipulator and a grayscale camera, so that their surface textures are accurately matched. We propose a novel control scheme that combines Transformer-driven visual servoing with dualarm impedance control. This approach enables the system to simultaneously control the pose of the fabric piece and place it onto the underlying one while applying tension to keep the fabric piece flat. Our transformer-based network incorporates pretrained backbones and a newly introduced Difference Extraction Attention Module (DEAM), which significantly enhances pose difference prediction accuracy. Trained entirely on synthetic images generated using rendering software, the network enables zero-shot deployment in real-world scenarios without requiring prior training on specific fabric textures. Real-world experiments demonstrate that the proposed system accurately aligns fabric pieces with different textures.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Ruican Xia, Hailong Pei
Robotics
ๆฅ็ๆ่ฆ
Relative State Estimation perform mutually localization between two mobile agents undergoing six-degree-of-freedom motion. Based on the principle of circular motion, the estimation accuracy is sensitive to nonlinear rotations of the reference platform, particularly under large inter-platform distances. This phenomenon is even obvious for linearized kinematics, because cumulative linearization errors significantly degrade precision. In virtual reality (VR) applications, this manifests as substantial positional errors in 6-DoF controller tracking during rapid rotations of the head-mounted display. The linearization errors introduce drift in the estimate and render the estimator inconsistent. In the field of odometry, IMU preintegration is proposed as a kinematic observation to enable efficient relinearization, thus mitigate linearized error. Building on this theory, we propose dual preintegration, a novel observation integrating IMU preintegration from both platforms. This method serves as kinematic constraints for consecutive relative state and supports efficient relinearization. We also perform observability analysis of the state and analytically formulate the accordingly null space. Algorithm evaluation encompasses both simulations and real-world experiments. Multiple nonlinear rotations on the reference platform are simulated to compare the precision of the proposed method with that of other state-of-the-art (SOTA) algorithms. The field test compares the proposed method and SOTA algorithms in the application of VR controller tracking from the perspectives of bias observability, nonlinear rotation, and background texture. The results demonstrate that the proposed method is more precise and robust than the SOTA algorithms.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Kaiyan Xiao, Zihan Xu, Cheng Zhe, Chengju Liu, Qijun Chen
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Humanoid robots, with their human-like morphology, hold great potential for industrial applications. However, existing loco-manipulation methods primarily focus on dexterous manipulation, falling short of the combined requirements for dexterity and proactive force interaction in high-load industrial scenarios. To bridge this gap, we propose a reinforcement learning-based framework with a decoupled three-stage training pipeline, consisting of an upper-body policy, a lower-body policy, and a delta-command policy. To accelerate upper-body training, a heuristic reward function is designed. By implicitly embedding forward kinematics priors, it enables the policy to converge faster and achieve superior performance. For the lower body, a force-based curriculum learning strategy is developed, enabling the robot to actively exert and regulate interaction forces with the environment.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Xu Hu, Yiyang Feng, Junran Peng, Jiawei He, Liyi Chen et al.
Robotics
ๆฅ็ๆ่ฆ
The development of embodied agents for complex commercial environments is hindered by a critical gap in existing robotics datasets and benchmarks, which primarily focus on household or tabletop settings with short-horizon tasks. To address this limitation, we introduce MarketGen, a scalable simulation platform with automatic scene generation for complex supermarket environments. MarketGen features a novel agent-based Procedural Content Generation (PCG) framework. It uniquely supports multi-modal inputs (text and reference images) and integrates real-world design principles to automatically generate complete, structured, and realistic supermarkets. We also provide an extensive and diverse 3D asset library with a total of 1100+ supermarket goods and parameterized facilities assets. Building on this generative foundation, we propose a novel benchmark for assessing supermarket agents, featuring two daily tasks in a supermarket: (1) Checkout Unloading: long-horizon tabletop tasks for cashier agents, and (2) In-Aisle Item Collection: complex mobile manipulation tasks for salesperson agents. We validate our platform and benchmark through extensive experiments, including the deployment of a modular agent system and successful sim-to-real transfer. MarketGen provides a comprehensive framework to accelerate research in embodied AI for complex commercial applications.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Guoming Huang, Qingyi Zhou, Dianjing Liu, Shuai Zhang, Ming Zhou et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Non-contact manipulation has emerged as a transformative approach across various industrial fields. However, current flexible 2D and 3D non-contact manipulation techniques are often limited to microscopic scales, typically controlling objects in the milligram range. In this paper, we present a magnetic levitation system, termed Maglev-Pentabot, designed to address this limitation. The Maglev-Pentabot leverages deep reinforcement learning (DRL) to develop complex control strategies for manipulating objects in the gram range. Specifically, we propose an electromagnet arrangement optimized through numerical analysis to maximize controllable space. Additionally, an action remapping method is introduced to address sample sparsity issues caused by the strong nonlinearity in magnetic field intensity, hence allowing the DRL controller to converge. Experimental results demonstrate flexible manipulation capabilities, and notably, our system can generalize to transport tasks it has not been explicitly trained for. Furthermore, our approach can be scaled to manipulate heavier objects using larger electromagnets, offering a reference framework for industrial-scale robotic applications.
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Ziyi Chen, Yingnan Guo, Zedong Chu, Minghua Luo, Yanfen Shen et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Embodied navigation that adheres to social norms remains an open research challenge. Our \textbf{SocialNav} is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware Flow Exploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Our project page: https://amap-eai.github.io/SocialNav/
๐
2025-11-26
๐ ArXiv Preprint (Robotics)
๐ฅ Feiyang Pan, Shenghe Zheng, Chunyan Yin, Guangbin Dou
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Visual-Inertial Odometry (VIO) is a critical component for robust ego-motion estimation, enabling foundational capabilities such as autonomous navigation in robotics and real-time 6-DoF tracking for augmented reality. Existing methods face a well-known trade-off: filter-based approaches are efficient but prone to drift, while optimization-based methods, though accurate, rely on computationally prohibitive Visual-Inertial Bundle Adjustment (VIBA) that is difficult to run on resource-constrained platforms. Rather than removing VIBA altogether, we aim to reduce how often and how heavily it must be invoked. To this end, we cast two key design choices in modern VIO, when to run the visual frontend and how strongly to trust its output, as sequential decision problems, and solve them with lightweight reinforcement learning (RL) agents. Our framework introduces a lightweight, dual-pronged RL policy that serves as our core contribution: (1) a Select Agent intelligently gates the entire VO pipeline based only on high-frequency IMU data; and (2) a composite Fusion Agent that first estimates a robust velocity state via a supervised network, before an RL policy adaptively fuses the full (p, v, q) state. Experiments on the EuRoC MAV and TUM-VI datasets show that, in our unified evaluation, the proposed method achieves a more favorable accuracy-efficiency-memory trade-off than prior GPU-based VO/VIO systems: it attains the best average ATE while running up to 1.77 times faster and using less GPU memory. Compared to classical optimization-based VIO systems, our approach maintains competitive trajectory accuracy while substantially reducing computational load.
๐
2025-11-26
๐ AAAI 2026
๐ฅ Chenglizhao Chen, Shaofeng Liang, Runwei Guan, Xiaolou Sun, Haocheng Zhao et al.
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method.
๐
2025-11-26
๐ ArXiv Preprint (Machine Learning)
๐ฅ Akansha Kalra, Soumil Datta, Ethan Gilmore, Duc La, Guanhong Tao et al.
ๆฅ็ๆ่ฆ
Behavior Cloning (BC) is a popular framework for training sequential decision policies from expert demonstrations via supervised learning. As these policies are increasingly being deployed in the real world, their robustness and potential vulnerabilities are an important concern. In this work, we perform the first analysis of the efficacy of clean-label backdoor attacks on BC policies. Our backdoor attacks poison a dataset of demonstrations by injecting a visual trigger to create a spurious correlation that can be exploited at test time. We evaluate how policy vulnerability scales with the fraction of poisoned data, the strength of the trigger, and the trigger type. We also introduce a novel entropy-based test-time trigger attack that substantially degrades policy performance by identifying critical states where test-time triggering of the backdoor is expected to be most effective at degrading performance. We empirically demonstrate that BC policies trained on even minimally poisoned datasets exhibit deceptively high, near-baseline task performance despite being highly vulnerable to backdoor trigger attacks during deployment. Our results underscore the urgent need for more research into the robustness of BC policies, particularly as large-scale datasets are increasingly used to train policies for real-world cyber-physical systems. Videos and code are available at https://sites.google.com/view/dataset-poisoning-in-bc.
๐
2025-11-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nura Aljaafari, Danilo S. Carvalho, Andrรฉ Freitas
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite displaying semantic competence, large language models' internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how LLMs implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component overlap) alongside high spectral similarity. These findings suggest that LLMs form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ali Jahan, Masood Ghayoomi, Annette Hautli-Janisz
Natural Language Processing
ๆฅ็ๆ่ฆ
Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2\% on the English test set and 50.7\% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2\% on English and 69.3\% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8\%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sree Bhattacharyya, Yaman Kumar Singla, Sudhir Yarram, Somesh Kumar Singh, Harini S et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dong Dong, Weijie Su
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Asad Aali, Muhammad Ahmed Mohsin, Vasiliki Bikia, Arnav Singhvi, Richard Gaus et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller ฮ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Samuele Dell'Erba, Andrew D. Bagdanov
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiaojiao Han, Wujiang Xu, Mingyu Jin, Mengnan Du
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Trung Cuong Dang, David Mohaisen
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daeheon Jeong, Seoyeon Byun, Kihoon Son, Dae Hyun Kim, Juho Kim
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.
๐
2025-11-25
๐ ArXiv Preprint (cs.CY)
๐ฅ Xing Wang, Huiyuan Xie, Yiyan Wang, Chaojun Xiao, Huimin Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks. However, the risk of these models assisting unlawful activities remains underexplored. In this study, we define this high-risk behavior as complicit facilitation - the provision of guidance or support that enables illicit user instructions - and present four empirical studies that assess its prevalence in widely deployed LLMs. Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents to assess LLMs' complicit facilitation behavior. Our findings reveal widespread LLM susceptibility to complicit facilitation, with GPT-4o providing illicit assistance in nearly half of tested cases. Moreover, LLMs exhibit deficient performance in delivering credible legal warnings and positive guidance. Further analysis uncovers substantial safety variation across socio-legal contexts. On the legal side, we observe heightened complicity for crimes against societal interests, non-extreme but frequently occurring violations, and malicious intents driven by subjective motives or deceptive justifications. On the social side, we identify demographic disparities that reveal concerning complicit patterns towards marginalized and disadvantaged groups, with older adults, racial minorities, and individuals in lower-prestige occupations disproportionately more likely to receive unlawful guidance. Analysis of model reasoning traces suggests that model-perceived stereotypes, characterized along warmth and competence, are associated with the model's complicit behavior. Finally, we demonstrate that existing safety alignment strategies are insufficient and may even exacerbate complicit behavior.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Abdullah Al Sefat
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.
๐
2025-11-25
๐ ArXiv Preprint (cs.CY)
๐ฅ Ali Madad
Natural Language Processing
ๆฅ็ๆ่ฆ
InvisibleBench is a deployment gate for caregiving-relationship AI, evaluating 3-20+ turn interactions across five dimensions: Safety, Compliance, Trauma-Informed Design, Belonging/Cultural Fitness, and Memory. The benchmark includes autofail conditions for missed crises, medical advice (WOPR Act), harmful information, and attachment engineering. We evaluate four frontier models across 17 scenarios (N=68) spanning three complexity tiers. All models show significant safety gaps (11.8-44.8 percent crisis detection), indicating the necessity of deterministic crisis routing in production systems. DeepSeek Chat v3 achieves the highest overall score (75.9 percent), while strengths differ by dimension: GPT-4o Mini leads Compliance (88.2 percent), Gemini leads Trauma-Informed Design (85.0 percent), and Claude Sonnet 4.5 ranks highest in crisis detection (44.8 percent). We release all scenarios, judge prompts, and scoring configurations with code. InvisibleBench extends single-turn safety tests by evaluating longitudinal risk, where real harms emerge. No clinical claims; this is a deployment-readiness evaluation.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xiaopeng Li, Yuanjin Zheng, Wanyu Wang, wenlin zhang, Pengyue Jia et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Alfredo Garcia et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
PPO has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1)~token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn-level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Farzad Ahmed, Joniel Augustine Jerome, Meliha Yetisgen, รzlem Uzuner
Natural Language Processing
ๆฅ็ๆ่ฆ
Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction.
Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning.
Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections.
Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.
๐
2025-11-25
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mengliang ZHang
ๆฅ็ๆ่ฆ
Electronic health records (EHR) contain extensive structured and unstructured data, including tabular information and free-text clinical notes. Querying relevant patient information often requires complex database operations, increasing the workload for clinicians. However, complex table relationships and professional terminology in EHRs limit the query accuracy. In this work, we construct a publicly available dataset, TQGen, that integrates both \textbf{T}ables and clinical \textbf{T}ext for natural language-to-query \textbf{Gen}eration. To address the challenges posed by complex medical terminology and diverse types of questions in EHRs, we propose TQGen-EHRQuery, a framework comprising a medical knowledge module and a questions template matching module. For processing medical text, we introduced the concept of a toolset, which encapsulates the text processing module as a callable tool, thereby improving processing efficiency and flexibility. We conducted extensive experiments to assess the effectiveness of our dataset and workflow, demonstrating their potential to enhance information querying in EHR systems.
๐
2025-11-25
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Puneet S. Bagga, Vivek F. Farias, Tamar Korkotashvili, Tianyi Peng, Yuhang Wu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
With the rise of large language models (LLMs), generative engines are becoming powerful alternatives to traditional search, reshaping retrieval tasks. In e-commerce, for instance, conversational shopping agents now guide consumers to relevant products. This shift has created the need for generative engine optimization (GEO)--improving content visibility and relevance for generative engines. Yet despite its growing importance, current GEO practices are ad hoc, and their impacts remain poorly understood, especially in e-commerce. We address this gap by introducing E-GEO, the first benchmark built specifically for e-commerce GEO. E-GEO contains over 7,000 realistic, multi-sentence consumer product queries paired with relevant listings, capturing rich intent, constraints, preferences, and shopping contexts that existing datasets largely miss. Using this benchmark, we conduct the first large-scale empirical study of e-commerce GEO, evaluating 15 common rewriting heuristics and comparing their empirical performance. To move beyond heuristics, we further formulate GEO as a tractable optimization problem and develop a lightweight iterative prompt-optimization algorithm that can significantly outperform these baselines. Surprisingly, the optimized prompts reveal a stable, domain-agnostic pattern--suggesting the existence of a "universally effective" GEO strategy. Our data and code are publicly available at https://github.com/psbagga17/E-GEO.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Inkook Chun, Seungjae Lee, Michael S. Albergo, Saining Xie, Eric Vanden-Eijnden
Robotics
ๆฅ็ๆ่ฆ
Diffusion- and flow-based policies deliver state-of-the-art performance on long-horizon robotic manipulation and imitation learning tasks. However, these controllers employ a fixed inference budget at every control step, regardless of task complexity, leading to computational inefficiency for simple subtasks while potentially underperforming on challenging ones. To address these issues, we introduce Difficulty-Aware Stochastic Interpolant Policy (DA-SIP), a framework that enables robotic controllers to adaptively adjust their integration horizon in real time based on task difficulty. Our approach employs a difficulty classifier that analyzes observations to dynamically select the step budget, the optimal solver variant, and ODE/SDE integration at each control cycle. DA-SIP builds upon the stochastic interpolant formulation to provide a unified framework that unlocks diverse training and inference configurations for diffusion- and flow-based policies. Through comprehensive benchmarks across diverse manipulation tasks, DA-SIP achieves 2.6-4.4x reduction in total computation time while maintaining task success rates comparable to fixed maximum-computation baselines. By implementing adaptive computation within this framework, DA-SIP transforms generative robot controllers into efficient, task-aware systems that intelligently allocate inference resources where they provide the greatest benefit.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Vivek Pandey, Amirhossein Mollaei, Nader Motee
Robotics
ๆฅ็ๆ่ฆ
Robot localization is a fundamental component of autonomous navigation in unknown environments. Among various sensing modalities, visual input from cameras plays a central role, enabling robots to estimate their position by tracking point features across image frames. However, image frames often contain a large number of features, many of which are redundant or uninformative for localization. Processing all features can introduce significant computational latency and inefficiency. This motivates the need for intelligent feature selection, identifying a subset of features that are most informative for localization over a prediction horizon. In this work, we propose two fast and memory-efficient feature selection algorithms that enable robots to actively evaluate the utility of visual features in real time. Unlike existing approaches with high computational and memory demands, the proposed methods are explicitly designed to reduce both time and memory complexity while achieving a favorable trade-off between computational efficiency and localization accuracy.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Rui Yan, Jiajian Fu, Shiqi Yang, Lars Paulsen, Xuxin Cheng et al.
Robotics
ๆฅ็ๆ่ฆ
Teleoperation systems are essential for efficiently collecting diverse and high-quality robot demonstration data, especially for complex, contact-rich tasks. However, current teleoperation platforms typically lack integrated force feedback, cross-embodiment generalization, and portable, user-friendly designs, limiting their practical deployment. To address these limitations, we introduce ACE-F, a cross embodiment foldable teleoperation system with integrated force feedback. Our approach leverages inverse kinematics (IK) combined with a carefully designed human-robot interface (HRI), enabling users to capture precise and high-quality demonstrations effortlessly. We further propose a generalized soft-controller pipeline integrating PD control and inverse dynamics to ensure robot safety and precise motion control across diverse robotic embodiments. Critically, to achieve cross-embodiment generalization of force feedback without additional sensors, we innovatively interpret end-effector positional deviations as virtual force signals, which enhance data collection and enable applications in imitation learning. Extensive teleoperation experiments confirm that ACE-F significantly simplifies the control of various robot embodiments, making dexterous manipulation tasks as intuitive as operating a computer mouse. The system is open-sourced at: https://acefoldable.github.io/
๐
2025-11-25
๐ CoRL 2024
๐ฅ Tasha Kim, Yingke Wang, Hanvit Cho, Alex Hodges
Robotics
ๆฅ็ๆ่ฆ
Neural Signal Operated Intelligent Robots (NOIR) system is a versatile brain-robot interface that allows humans to control robots for daily tasks using their brain signals. This interface utilizes electroencephalography (EEG) to translate human intentions regarding specific objects and desired actions directly into commands that robots can execute. We present NOIR 2.0, an enhanced version of NOIR. NOIR 2.0 includes faster and more accurate brain decoding algorithms, which reduce task completion time by 46%. NOIR 2.0 uses few-shot robot learning algorithms to adapt to individual users and predict their intentions. The new learning algorithms leverage foundation models for more sample-efficient learning and adaptation (15 demos vs. a single demo), significantly reducing overall human time by 65%.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Edmond Tong, Advaith Balaji, Anthony Opipari, Stanley Lewis, Zhen Zeng et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
To manipulate objects in novel, unstructured environments, robots need task-oriented grasps that target object parts based on the given task. Geometry-based methods often struggle with visually defined parts, occlusions, and unseen objects. We introduce OVAL-Grasp, a zero-shot open-vocabulary approach to task-oriented, affordance based grasping that uses large-language models and vision-language models to allow a robot to grasp objects at the correct part according to a given task. Given an RGB image and a task, OVAL-Grasp identifies parts to grasp or avoid with an LLM, segments them with a VLM, and generates a 2D heatmap of actionable regions on the object. During our evaluations, we found that our method outperformed two task oriented grasping baselines on experiments with 20 household objects with 3 unique tasks for each. OVAL-Grasp successfully identifies and segments the correct object part 95% of the time and grasps the correct actionable area 78.3% of the time in real-world experiments with the Fetch mobile manipulator. Additionally, OVAL-Grasp finds correct object parts under partial occlusions, demonstrating a part selection success rate of 80% in cluttered scenes. We also demonstrate OVAL-Grasp's efficacy in scenarios that rely on visual features for part selection, and show the benefit of a modular design through our ablation experiments. Our project webpage is available at https://ekjt.github.io/OVAL-Grasp/
๐
2025-11-25
๐ AAAI
๐ฅ Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haibo HU, Lianming Huang, Nan Guan, Chun Jason Xue
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tooba Tehreem Sheikh, Jean Lahoud, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa et al.
Robotics
ๆฅ็ๆ่ฆ
While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi et al.
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Anastasia Mavridou, Divya Gopinath, Corina S. Pฤsฤreanu
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
The integration of AI components, particularly Deep Neural Networks (DNNs), into safety-critical systems such as aerospace and autonomous vehicles presents fundamental challenges for assurance. The opacity of AI systems, combined with the semantic gap between high-level requirements and low-level network representations, creates barriers to traditional verification approaches. These AI-specific challenges are amplified by longstanding issues in Requirements Engineering, including ambiguity in natural language specifications and scalability bottlenecks in formalization. We propose an approach that leverages AI itself to address these challenges through two complementary components. REACT (Requirements Engineering with AI for Consistency and Testing) employs Large Language Models (LLMs) to bridge the gap between informal natural language requirements and formal specifications, enabling early verification and validation. SemaLens (Semantic Analysis of Visual Perception using large Multi-modal models) utilizes Vision Language Models (VLMs) to reason about, test, and monitor DNN-based perception systems using human-understandable concepts. Together, these components provide a comprehensive pipeline from informal requirements to validated implementations.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The optimization of large language models (LLMs) remains a critical challenge, particularly as model scaling exacerbates sensitivity to algorithmic imprecision and training instability. Recent advances in optimizers have improved convergence efficiency through momentum orthogonalization, but suffer from two key robustness limitations: dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise. To address these robustness challenges, we introduce ROOT, a Robust Orthogonalized Optimizer that enhances training stability through dual robustness mechanisms. First, we develop a dimension-robust orthogonalization scheme using adaptive Newton iterations with fine-grained coefficients tailored to specific matrix sizes, ensuring consistent precision across diverse architectural configurations. Second, we introduce an optimization-robust framework via proximal optimization that suppresses outlier noise while preserving meaningful gradient directions. Extensive experiments demonstrate that ROOT achieves significantly improved robustness, with faster convergence and superior final performance compared to both Muon and Adam-based optimizers, particularly in noisy and non-convex scenarios. Our work establishes a new paradigm for developing robust and precise optimizers capable of handling the complexities of modern large-scale model training. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/ROOT.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ David Szczecina, Senan Gaffori, Edmond Li
Natural Language Processing
ๆฅ็ๆ่ฆ
The widespread use of Large Language Models (LLMs) raises critical concerns regarding the unauthorized inclusion of copyrighted content in training data. Existing detection frameworks, such as DE-COP, are computationally intensive, and largely inaccessible to independent creators. As legal scrutiny increases, there is a pressing need for a scalable, transparent, and user-friendly solution. This paper introduce an open-source copyright detection platform that enables content creators to verify whether their work was used in LLM training datasets. Our approach enhances existing methodologies by facilitating ease of use, improving similarity detection, optimizing dataset validation, and reducing computational overhead by 10-30% with efficient API calls. With an intuitive user interface and scalable backend, this framework contributes to increasing transparency in AI development and ethical compliance, facilitating the foundation for further research in responsible AI development and copyright enforcement.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Adam Karvonen, Daniel Reuter, Roy Rinberg, Luke Marks, Adriร Garriga-Alonso et al.
ๆฅ็ๆ่ฆ
As demand for LLM inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tampering. However, re-running the same inference process twice often leads to different results due to benign numerical noise, making it difficult to distinguish legitimate variation from actual problems. To address this problem, we introduce Token-DiFR (Token-Divergence-From-Reference), a method for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Sampling seed synchronization tightly constrains valid outputs, leaving providers minimal room to deviate from correct inference, which allows output tokens themselves to serve as auditable evidence of correctness at zero additional cost to the provider. Token-DiFR reliably identifies sampling errors, simulated bugs, and model quantization, detecting 4-bit quantization with AUC $>$ 0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit quantization with AUC $>$ 0.999 using just 2 output tokens, while reducing communication overhead by 25-75% relative to existing methods. We release an open-source integration with vLLM to accelerate practical deployment of verifiable inference.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seyede Niloofar Hosseini, Ali Mojibi, Mahdi Mohseni, Navid Arjmand, Alireza Taheri
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 47.0 mm, exhibited ~58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Panayiotis Danassis, Naman Goel
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Gaspard Merten, Mahmoud Sakr, Gilles Dejaegere
Natural Language Processing
ๆฅ็ๆ่ฆ
Foundation models are transformative in artificial intelligence, but building them from scratch, especially for mobility trajectories, is not yet clear or documented. This tutorial bridges this gap by demonstrating the steps and code of a minimal implementation of a trajectory-focused foundation model starting from GPT-2. Through a concise, step-by-step, code-driven process, we demonstrate adapting GPT-2 for spatiotemporal data. We then review and compare representative trajectory foundation models, such as TrajFM and TrajGPT, highlighting their architectural innovations and differences. Additionally, we introduce complementary techniques from related domains, like TimesFM's patching approach. Targeted at researchers and practitioners, this tutorial aims to explain the concepts and terminology of foundation models, at the implementation level. We find it timely and indispensable to create this educational material in order to support the SIGSPATIAL community in building and evaluating mobility foundation models, enhancing both research clarity and peer-review effectiveness in mobility AI.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Yixin Liu, Pengfei Liu, Arman Cohan
Natural Language Processing
ๆฅ็ๆ่ฆ
Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs' generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Heman Shakeri
ๆฅ็ๆ่ฆ
Deep sequence models for blood glucose forecasting consistently fail to leverage clinically informative drivers--insulin, meals, and activity--despite well-understood physiological mechanisms. We term this Driver-Blindness and formalize it via $ฮ_{\text{drivers}}$, the performance gain of multivariate models over matched univariate baselines. Across the literature, $ฮ_{\text{drivers}}$ is typically near zero. We attribute this to three interacting factors: architectural biases favoring autocorrelation (C1), data fidelity gaps that render drivers noisy and confounded (C2), and physiological heterogeneity that undermines population-level models (C3). We synthesize strategies that partially mitigate Driver-Blindness--including physiological feature encoders, causal regularization, and personalization--and recommend that future work routinely report $ฮ_{\text{drivers}}$ to prevent driver-blind models from being considered state-of-the-art.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats et al.
ๆฅ็ๆ่ฆ
The integration of artificial intelligence (AI) agents into web browsers introduces security challenges that go beyond traditional web application threat models. Prior work has identified prompt injection as a new attack vector for web agents, yet the resulting impact within real-world environments remains insufficiently understood.
In this work, we examine the landscape of prompt injection attacks and synthesize a benchmark of attacks embedded in realistic HTML payloads. Our benchmark goes beyond prior work by emphasizing injections that can influence real-world actions rather than mere text outputs, and by presenting attack payloads with complexity and distractor frequency similar to what real-world agents encounter. We leverage this benchmark to conduct a comprehensive empirical evaluation of existing defenses, assessing their effectiveness across a suite of frontier AI models. We propose a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. Our work offers a blueprint for designing practical, secure web agents through a defense-in-depth approach.
๐
2025-11-25
๐ ArXiv Preprint (cs.MA)
๐ฅ Jakub Muszyลski, Ignacy Waluลผenicz, Patryk Zan, Zofia Wrona, Maria Ganzha et al.
Robotics
ๆฅ็ๆ่ฆ
Microgrids are deployed to reduce purchased grid energy, limit exposure to volatile tariffs, and ensure service continuity during disturbances. This requires coordinating heterogeneous distributed energy resources across multiple time scales and under variable conditions. Among existing tools, typically, power-system simulators capture physical behaviour but assume centralized control, while multi-agent frameworks model decentralized decision-making but represent energy with no physical grounding. In this context, the EnergyTwin is introduced, an agent-based microgrid simulation environment that couples physically grounded models with forecast-informed, rolling-horizon planning, and negotiations. Each asset is modeled as an agent, interacting with a central agent that obtains forecasts, formulates predictions, and allocates energy through contract-based interactions. EnergyTwin targets tertiary-layer decision making and is extensible for digital-twin use. Its feasibility was evaluated in a university campus microgrid scenario where multiple planning strategies were compared. Achieved results show that forecast-driven rolling-horizon planning increases local energy self-sufficiency, maintains higher battery reserves, and reduces exposure to low-resilience operating states. They demonstrate also potential of EnergyTwin as platform supporting research on resilient, negotiation-driven microgrids.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Dennis Eisermann, Frank Kargl
Machine Learning
ๆฅ็ๆ่ฆ
Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics such as accuracy and precision fail to capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the \emph{Parallel Trust Assessment System (PaTAS)}, a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through \emph{Trust Nodes} and \emph{Trust Functions} that propagate input, parameter, and activation trust across the network. The framework defines a \emph{Parameter Trust Update} mechanism to refine parameter reliability during training and an \emph{Inference-Path Trust Assessment (IPTA)} method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a principled foundation for evaluating model reliability across the AI lifecycle.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Tasha Kim, Oiwi Parker Jones
Robotics
ๆฅ็ๆ่ฆ
Safety-critical assistive systems that directly decode user intent from neural signals require rigorous guarantees of reliability and trust. We present GUARDIAN (Gated Uncertainty-Aware Runtime Dual Invariants), a framework for real-time neuro-symbolic verification for neural signal-controlled robotics. GUARDIAN enforces both logical safety and physiological trust by coupling confidence-calibrated brain signal decoding with symbolic goal grounding and dual-layer runtime monitoring. On the BNCI2014 motor imagery electroencephalogram (EEG) dataset with 9 subjects and 5,184 trials, the system performs at a high safety rate of 94-97% even with lightweight decoder architectures with low test accuracies (27-46%) and high ECE confidence miscalibration (0.22-0.41). We demonstrate 1.7x correct interventions in simulated noise testing versus at baseline. The monitor operates at 100Hz and sub-millisecond decision latency, making it practically viable for closed-loop neural signal-based systems. Across 21 ablation results, GUARDIAN exhibits a graduated response to signal degradation, and produces auditable traces from intent, plan to action, helping to link neural evidence to verifiable robot action.
๐
2025-11-25
๐ ArXiv Preprint (eess.SP)
๐ฅ Tatiana Gelvez-Barrera, Barbara Nicolas, Denis Kouamรฉ, Bruno Gilles, Adrian Basarab
ๆฅ็ๆ่ฆ
Passive acoustic mapping enables the spatial mapping and temporal monitoring of cavitation activity, playing a crucial role in therapeutic ultrasound applications. Most conventional beamforming methods, whether implemented in the time or frequency domains, suffer from limited axial resolution due to the absence of a reference emission onset time. While frequency-domain methods, the most efficient of which are based on the cross-spectral matrix, require long signals for accurate estimation, time-domain methods typically achieve lower spatial resolution. To address these limitations, we propose a linear model-based beamforming framework fully formulated in the time domain. The linear forward model relates a discretized spatiotemporal distribution of cavitation activity to the temporal signals recorded by a probe, explicitly accounting for time-of-flight delays dictated by the acquisition geometry. This model is then inverted using regularization techniques that exploit prior knowledge of cavitation activity in both spatial and temporal domains. Experimental results show that the proposed framework achieves enhanced or competitive cavitation map quality while using only 20\% of the data typically required by frequency-domain methods. This highlights the substantial gain in data efficiency and the flexibility of our spatiotemporal regularization to adapt to diverse passive cavitation scenarios, outperforming state-of-the-art techniques.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ege Ozguroglu, Junbang Liang, Ruoshi Liu, Mia Chiquier, Michael DeTienne et al.
Multimodal
ๆฅ็ๆ่ฆ
While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.'' Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70$\times$ more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andrea Ranieri, Giorgio Palmieri, Silvia Biasotti
Machine Learning
ๆฅ็ๆ่ฆ
This paper addresses the critical need for automated crack detection in the preservation of cultural heritage through semantic segmentation. We present a comparative study of U-Net architectures, using various convolutional neural network (CNN) encoders, for pixel-level crack identification on statues and monuments. A comparative quantitative evaluation is performed on the test set of the OmniCrack30k dataset [1] using popular segmentation metrics including Mean Intersection over Union (mIoU), Dice coefficient, and Jaccard index. This is complemented by an out-of-distribution qualitative evaluation on an unlabeled test set of real-world cracked statues and monuments. Our findings provide valuable insights into the capabilities of different CNN- based encoders for fine-grained crack segmentation. We show that the models exhibit promising generalization capabilities to unseen cultural heritage contexts, despite never having been explicitly trained on images of statues or monuments.
๐
2025-11-25
๐ ArXiv Preprint (cs.LO)
๐ฅ Adam Bjorndahl
ๆฅ็ๆ่ฆ
The TARK conference (Theoretical Aspects of Rationality and Knowledge) is a conference that aims to bring together researchers from a wide variety of fields, including computer science, artificial intelligence, game theory, decision theory, philosophy, logic, linguistics, and cognitive science. Its goal is to further our understanding of interdisciplinary issues involving reasoning about rationality and knowledge.
Previous conferences have been held biennially around the world since 1986, on the initiative of Joe Halpern (Cornell University). Topics of interest include, but are not limited to, semantic models for knowledge, belief, uncertainty, awareness, bounded rationality, common sense epistemic reasoning, epistemic logic, epistemic game theory, knowledge and action, applications of reasoning about knowledge and other mental states, belief revision, computational social choice, algorithmic game theory, and foundations of multi-agent systems.
Information about TARK is available at http://www.tark.org/.
These proceedings contain the papers that have been accepted for presentation at the Twentieth Conference on Theoretical Aspects of Rationality and Knowledge (TARK 2025), held July 14--16, 2025, at Heinrich-Heine-Universitรคt, Dรผsseldorf, Germany. The conference website can be found at https://ccc.cs.uni-duesseldorf.de/tark-2025/.
๐
2025-11-25
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Charles Y. Zhang, Yuanjia Yang, Aidan Sirbu, Elliott T. T. Abe, Emil Wรคrnberg et al.
Robotics
ๆฅ็ๆ่ฆ
The primary output of the nervous system is movement and behavior. While recent advances have democratized pose tracking during complex behavior, kinematic trajectories alone provide only indirect access to the underlying control processes. Here we present MIMIC-MJX, a framework for learning biologically-plausible neural control policies from kinematics. MIMIC-MJX models the generative process of motor control by training neural controllers that learn to actuate biomechanically-realistic body models in physics simulation to reproduce real kinematic trajectories. We demonstrate that our implementation is accurate, fast, data-efficient, and generalizable to diverse animal body models. Policies trained with MIMIC-MJX can be utilized to both analyze neural control strategies and simulate behavioral experiments, illustrating its potential as an integrative modeling framework for neuroscience.
๐
2025-11-25
๐ ICML
๐ฅ Shamima Hossain
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Visual Language Models (VLMs) are powerful generative tools but often produce factually inaccurate outputs due to a lack of robust reasoning capabilities. While extensive research has been conducted on integrating external knowledge for reasoning in large language models (LLMs), such efforts remain underexplored in VLMs, where the challenge is compounded by the need to bridge multiple modalities seamlessly. This work introduces a framework for knowledge-guided reasoning in VLMs, leveraging structured knowledge graphs for multi-hop verification using image-captioning task to illustrate our framework. Our approach enables systematic reasoning across multiple steps, including visual entity recognition, knowledge graph traversal, and fact-based caption refinement. We evaluate the framework using hierarchical, triple-based and bullet-point based knowledge representations, analyzing their effectiveness in factual accuracy and logical inference. Empirical results show that our approach improves factual accuracy by approximately 31% on preliminary experiments on a curated dataset of mixtures from Google Landmarks v2, Conceptual captions and Coco captions revealing key insights into reasoning patterns and failure modes. This work demonstrates the potential of integrating external knowledge for advancing reasoning in VLMs, paving the way for more reliable and knowledgable multimodal systems.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xinran Wang, Boran Zhu, Shujuan Zhou, Ziwen Long, Dehua Zhou et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Background: As large language models (LLMs) become increasingly integrated into digital health education and assessment workflows, their capabilities in supporting high-stakes, domain-specific certification tasks remain underexplored.In China, the national pharmacist licensure exam serves as a standardized benchmark for evaluating pharmacists' clinical and theoretical competencies. Objective: This study aimed to compare the performance of two LLMs: ChatGPT-4o and DeepSeek-R1 on real questions from the Chinese Pharmacist Licensing Examination (2017-2021), and to discuss the implications of these performance differences for AI-enabled formative evaluation. Methods: A total of 2,306 multiple-choice (text-only) questions were compiled from official exams, training materials, and public databases. Questions containing tables or images were excluded. Each item was input in its original Chinese format, and model responses were evaluated for exact accuracy. Pearson's Chi-squared test was used to compare overall performance, and Fisher's exact test was applied to year-wise multiple-choice accuracy. Results: DeepSeek-R1 outperformed ChatGPT-4o with a significantly higher overall accuracy (90.0% vs. 76.1%, p < 0.001). Unit-level analyses revealed consistent advantages for DeepSeek-R1, particularly in foundational and clinical synthesis modules. While year-by-year multiple-choice performance also favored DeepSeek-R1, this performance gap did not reach statistical significance in any specific unit-year (all p > 0.05). Conclusion: DeepSeek-R1 demonstrated robust alignment with the structural and semantic demands of the pharmacist licensure exam. These findings suggest that domain-specific models warrant further investigation for this context, while also reinforcing the necessity of human oversight in legally and ethically sensitive contexts.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi-Hao Peng, Jeffrey P. Bigham, Jason Wu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuto Suzuki, Paul Awolade, Daniel V. LaBarbera, Farnoush Banaei-Kashani
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Molecule generation using generative AI is vital for drug discovery, yet class-specific datasets often contain fewer than 100 training examples. While fragment-based models handle limited data better than atom-based approaches, existing heuristic fragmentation limits diversity and misses key fragments. Additionally, model tuning typically requires slow, indirect collaboration between medicinal chemists and AI engineers. We introduce FRAGMENTA, an end-to-end framework for drug lead optimization comprising: 1) a novel generative model that reframes fragmentation as a "vocabulary selection" problem, using dynamic Q-learning to jointly optimize fragmentation and generation; and 2) an agentic AI system that refines objectives via conversational feedback from domain experts. This system removes the AI engineer from the loop and progressively learns domain knowledge to eventually automate tuning. In real-world cancer drug discovery experiments, FRAGMENTA's Human-Agent configuration identified nearly twice as many high-scoring molecules as baselines. Furthermore, the fully autonomous Agent-Agent system outperformed traditional Human-Human tuning, demonstrating the efficacy of agentic tuning in capturing expert intent.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nathan Roll, Jill Kries, Flora Jin, Catherine Wang, Ann Marie Finley et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sidahmed Benabderrahmane, Talal Rahwan
Machine Learning
ๆฅ็ๆ่ฆ
Advanced Persistent Threats (APT) pose a major cybersecurity challenge due to their stealth, persistence, and adaptability. Traditional machine learning detectors struggle with class imbalance, high dimensional features, and scarce real world traces. They often lack transferability-performing well in the training domain but degrading in novel attack scenarios. We propose a hybrid transfer framework that integrates Transfer Learning, Explainable AI (XAI), contrastive learning, and Siamese networks to improve cross-domain generalization. An attention-based autoencoder supports knowledge transfer across domains, while Shapley Additive exPlanations (SHAP) select stable, informative features to reduce dimensionality and computational cost. A Siamese encoder trained with a contrastive objective aligns source and target representations, increasing anomaly separability and mitigating feature drift. We evaluate on real-world traces from the DARPA Transparent Computing (TC) program and augment with synthetic attack scenarios to test robustness. Across source to target transfers, the approach delivers improved detection scores with classical and deep baselines, demonstrating a scalable, explainable, and transferable solution for APT detection.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Van Tran, Shinan Liu, Tian Li, Nick Feamster
Machine Learning
ๆฅ็ๆ่ฆ
To address the scarcity and privacy concerns of network traffic data, various generative models have been developed to produce synthetic traffic. However, synthetic traffic is not inherently privacy-preserving, and the extent to which it leaks sensitive information, and how to measure such leakage, remain largely unexplored. This challenge is further compounded by the diversity of model architectures, which shape how traffic is represented and synthesized. We introduce a comprehensive set of privacy metrics for synthetic network traffic, combining standard approaches like membership inference attacks (MIA) and data extraction attacks with network-specific identifiers and attributes. Using these metrics, we systematically evaluate the vulnerability of different representative generative models and examine the factors that influence attack success. Our results reveal substantial variability in privacy risks across models and datasets. MIA success ranges from 0% to 88%, and up to 100% of network identifiers can be recovered from generated traffic, highlighting serious privacy vulnerabilities. We further identify key factors that significantly affect attack outcomes, including training data diversity and how well the generative model fits the training data. These findings provide actionable guidance for designing and deploying generative models that minimize privacy leakage, establishing a foundation for safer synthetic network traffic generation.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain, Phil F Cheng, Petros Liakopoulos et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability -- frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sidahmed Benabderrahmane, James Cheney, Talal Rahwan
ๆฅ็ๆ่ฆ
Advanced Persistent Threats (APTs) pose a significant challenge in cybersecurity due to their stealthy and long-term nature. Modern supervised learning methods require extensive labeled data, which is often scarce in real-world cybersecurity environments. In this paper, we propose an innovative approach that leverages AutoEncoders for unsupervised anomaly detection, augmented by active learning to iteratively improve the detection of APT anomalies. By selectively querying an oracle for labels on uncertain or ambiguous samples, we minimize labeling costs while improving detection rates, enabling the model to improve its detection accuracy with minimal data while reducing the need for extensive manual labeling. We provide a detailed formulation of the proposed Attention Adversarial Dual AutoEncoder-based anomaly detection framework and show how the active learning loop iteratively enhances the model. The framework is evaluated on real-world imbalanced provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004\% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The results have shown significant improvements in detection rates during active learning and better performance compared to other existing approaches.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuto Suzuki, Farnoush Banaei-Kashani
Natural Language Processing
ๆฅ็ๆ่ฆ
Reasoning based on Large Language Models (LLMs) has garnered increasing attention due to outstanding performance of these models in mathematical and complex logical tasks. Beginning with the Chain-of-Thought (CoT) prompting technique, numerous reasoning methods have emerged that decompose problems into smaller, sequential steps (or thoughts). However, existing reasoning models focus on conventional problem-solving and do not necessarily generate creative solutions by ``creative reasoning''. In domains where the solution space is expansive and conventional solutions are suboptimal, such as drug discovery or business strategization, creative reasoning to discover innovative solutions is crucial. To address this gap, first we introduce a computational framework for creative reasoning inspired by established cognitive science principles. With this framework, we propose three core creative reasoning paradigms, namely, \textit{combinational}, \textit{exploratory}, and \textit{transformative} reasoning, where each offers specific directions for systematic exploration of the universe of thoughts to generate creative solutions. Next, to materialize this framework using LLMs, we introduce the \textit{Universe of Thoughts} (or \textit{UoT}, for short), a novel set of methods to implement the aforementioned three creative processes. Finally, we introduce three novel tasks that necessitate creative problem-solving, along with an evaluation benchmark to assess creativity from three orthogonal perspectives: feasibility as constraint, and utility and novelty as metrics. With a comparative analysis against the state-of-the-art (SOTA) reasoning techniques as well as representative commercial models with reasoning capability, we show that UoT demonstrates superior performance in creative reasoning.
๐
2025-11-25
๐ ArXiv Preprint (cs.SD)
๐ฅ Genรญs Plaja-Roglans, Yun-Ning Hung, Xavier Serra, Igor Pereira
Machine Learning
ๆฅ็ๆ่ฆ
Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach, the source overlap and correlation in music signals poses an inherent challenge. Also, accessing all sources in the mixture is crucial to train these systems, while complicated. Attempts to address these challenges in a generative fashion exist, however, the separation performance and inference efficiency remain limited. In this work, we study the potential of diffusion models to advance toward bridging this gap, focusing on generative singing voice separation relying only on corresponding pairs of isolated vocals and mixtures for training. To align with creative workflows, we leverage latent diffusion: the system generates samples encoded in a compact latent space, and subsequently decodes these into audio. This enables efficient optimization and faster inference. Our system is trained using only open data. We outperform existing generative separation systems, and level the compared non-generative systems on a list of signal quality measures and on interference removal. We provide a noise robustness study on the latent encoder, providing insights on its potential for the task. We release a modular toolkit for further research on the topic.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuanhao Li, Mingshan Liu, Hongbo Wang, Yiding Zhang, Yifei Ma et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have shown impressive capabilities in multi-step reasoning and problem-solving.Recent works introduce multi-agent reflection frameworks where multiple LLM agents critique and refine each other's outputs using reinforcement learning (RL). However, these approaches often rely on single-shot responses and lack structural diversity in reasoning exploration. In this paper, we propose DRAFT-RL, a novel framework that integrates Chain-of-Draft (CoD) reasoning into multi-agent RL training. Instead of generating single responses, each agent produces multiple drafts per query, which are then evaluated by peer agents and a learned reward model to identify the most promising trajectory. These selected drafts are used to refine future reasoning strategies through actor-critic learning.DRAFT-RL enables explicit multi-path exploration, peer-guided reflection, and reward-aligned selection, resulting in more robust and interpretable LLM agent behavior. We evaluate our method on complex reasoning tasks including code synthesis, symbolic math, and knowledge-intensive QA,demonstrating that DRAFT-RL outperforms existing reflective and RL-based agents by significant margins in both accuracy and convergence speed
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mosab Rezaei, Mina Rajaei Moghadam, Abdul Rahman Shaikh, Hamed Alhoori, Reva Freedman
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in large language models have created new opportunities for stylometry, the study of writing styles and authorship. Two challenges, however, remain central: training generative models when no paired data exist, and evaluating stylistic text without relying only on human judgment. In this work, we present a framework for both generating and evaluating sentences in the style of 19th-century novelists. Large language models are fine-tuned with minimal, single-token prompts to produce text in the voices of authors such as Dickens, Austen, Twain, Alcott, and Melville. To assess these generative models, we employ a transformer-based detector trained on authentic sentences, using it both as a classifier and as a tool for stylistic explanation. We complement this with syntactic comparisons and explainable AI methods, including attention-based and gradient-based analyses, to identify the linguistic cues that drive stylistic imitation. Our findings show that the generated text reflects the authors' distinctive patterns and that AI-based evaluation offers a reliable alternative to human assessment. All artifacts of this work are published online.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen
Natural Language Processing
ๆฅ็ๆ่ฆ
In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song et al.
ๆฅ็ๆ่ฆ
Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bo Pang, Chenxi Xu, Jierui Ren, Guoping Wang, Sheng Li
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Matvei Shelukhan, Timur Mamedov, Karina Kvanchiani
Computer Vision
ๆฅ็ๆ่ฆ
Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving $\textit{11.6%}$ HOTA improvement at $\textit{1}$ Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yaaqov Mishayev, Yonatan Sverdlov, Tal Amir, Nadav Dym
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Message Passing Neural Networks (MPNNs) are widely used for learning on graphs, but their ability to process long-range information is limited by the phenomenon of oversquashing. This limitation has led some researchers to advocate Graph Transformers as a better alternative, whereas others suggest that it can be mitigated within the MPNN framework, using virtual nodes or other rewiring techniques.
In this work, we demonstrate that oversquashing is not limited to long-range tasks, but can also arise in short-range problems. This observation allows us to disentangle two distinct mechanisms underlying oversquashing: (1) the bottleneck phenomenon, which can arise even in low-range settings, and (2) the vanishing gradient phenomenon, which is closely associated with long-range tasks.
We further show that the short-range bottleneck effect is not captured by existing explanations for oversquashing, and that adding virtual nodes does not resolve it. In contrast, transformers do succeed in such tasks, positioning them as the more compelling solution to oversquashing, compared to specialized MPNNs.
๐
2025-11-25
๐ Accepted at 40th IEEE/ACM International Conference
๐ฅ Andrea Lops, Fedelucio Narducci, Azzurra Ragone, Michelantonio Trizio, Claudio Barto
Natural Language Processing
ๆฅ็ๆ่ฆ
Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AgoneTest clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Abdullah Al Sefat
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.
๐
2025-11-25
๐ AAAI 2026
๐ฅ Zhiqing Guo, Dongdong Xi, Songlin Li, Gaobo Yang
Robotics
ๆฅ็ๆ่ฆ
Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world deployment.In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu et al.
Natural Language ProcessingRoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya et al.
Natural Language ProcessingRoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shi Jiazhao, Pan Pan, Shi Haotian
Machine Learning
ๆฅ็ๆ่ฆ
This article trained a network for perceiving three-dimensional motion information of binocular vision target, which can provide real-time three-dimensional coordinate, velocity, and acceleration, and has a basic spatiotemporal perception capability. Understood the ability of neural networks to fit nonlinear problems from the perspective of PID. Considered a single-layer neural network as using a second-order difference equation and a nonlinearity to describe a local problem. Multilayer networks gradually transform the raw representation to the desired representation through multiple such combinations. Analysed some reference principles for designing neural networks. Designed a relatively small PID convolutional neural network, with a total of 17 layers and 413 thousand parameters. Implemented a simple but practical feature reuse method by concatenation and pooling. The network was trained and tested using the simulated randomly moving ball datasets, and the experimental results showed that the prediction accuracy was close to the upper limit that the input image resolution can represent. Analysed the experimental results and errors, as well as the existing shortcomings and possible directions for improvement. Finally, discussed the advantages of high-dimensional convolution in improving computational efficiency and feature space utilization. As well as the potential advantages of using PID information to implement memory and attention mechanisms.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Patrick Kenny
Machine Learning
ๆฅ็ๆ่ฆ
We seek to clarify the concept of active inference by disentangling it from the Free Energy Principle. We show how the optimizations that need to be carried out in order to implement active inference in discrete state spaces can be formulated as constrained divergence minimization problems which can be solved by standard mean field methods that do not appeal to the idea of expected free energy. When it is used to model perception, the perception/action divergence criterion that we propose coincides with variational free energy. When it is used to model action, it differs from an expected free energy functional by an entropy regularizer.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Abhinav Joshi, Divyanshu Bhatt, Ashutosh Modi
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) show strong generalization across diverse tasks, yet the internal decision-making processes behind their predictions remain opaque. In this work, we study the geometry of hidden representations in LLMs through the lens of \textit{intrinsic dimension} (ID), focusing specifically on decision-making dynamics in a multiple-choice question answering (MCQA) setting. We perform a large-scale study, with 28 open-weight transformer models and estimate ID across layers using multiple estimators, while also quantifying per-layer performance on MCQA tasks. Our findings reveal a consistent ID pattern across models: early layers operate on low-dimensional manifolds, middle layers expand this space, and later layers compress it again, converging to decision-relevant representations. Together, these results suggest LLMs implicitly learn to project linguistic inputs onto structured, low-dimensional manifolds aligned with task-specific decisions, providing new geometric insights into how generalization and reasoning emerge in language models.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alexander Beiser, Flavio Martinelli, Wulfram Gerstner, Johanni Brea
Machine Learning
ๆฅ็ๆ่ฆ
Network weights can be reverse-engineered given enough informative samples of a network's input-output function. In a teacher-student setup, this translates into collecting a dataset of the teacher mapping -- querying the teacher -- and fitting a student to imitate such mapping. A sensible choice of queries is the dataset the teacher is trained on. But current methods fail when the teacher parameters are more numerous than the training data, because the student overfits to the queries instead of aligning its parameters to the teacher. In this work, we explore augmentation techniques to best sample the input-output mapping of a teacher network, with the goal of eliciting a rich set of representations from the teacher hidden layers. We discover that standard augmentations such as rotation, flipping, and adding noise, bring little to no improvement to the identification problem. We design new data augmentation techniques tailored to better sample the representational space of the network's hidden layers. With our augmentations we extend the state-of-the-art range of recoverable network sizes. To test their scalability, we show that we can recover networks of up to 100 times more parameters than training data-points.
๐
2025-11-25
๐ ArXiv Preprint (cs.NI)
๐ฅ Changpeng He, Yang Lu, Yanqing Xu, Chong-Yung Chi, Bo Ai et al.
Machine Learning
ๆฅ็ๆ่ฆ
This paper investigates a reconfigurable intelligent surface (RIS)-assisted multi-waveguide pinching-antenna (PA) system (PASS) for multi-user downlink information transmission, motivated by the unknown impact of the integration of emerging PASS and RIS on wireless communications. First, we formulate sum rate (SR) and energy efficiency (EE) maximization problems in a unified framework, subject to constraints on the movable region of PAs, total power budget, and tunable phase of RIS elements. Then, by leveraging a graph-structured topology of the RIS-assisted PASS, a novel three-stage graph neural network (GNN) is proposed, which learns PA positions based on user locations, and RIS phase shifts according to composite channel conditions at the first two stages, respectively, and finally determines beamforming vectors. Specifically, the proposed GNN is achieved through unsupervised training, together with three implementation strategies for its integration with convex optimization, thus offering trade-offs between inference time and solution optimality. Extensive numerical results are provided to validate the effectiveness of the proposed GNN, and to support its unique attributes of viable generalization capability, good performance reliability, and real-time applicability. Moreover, the impact of key parameters on RIS-assisted PASS is illustrated and analyzed.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shashank Kirtania, Param Biyani, Priyanshu Gupta, Yasharth Bajpai, Roshni Iyer et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Large Language Model (LLM)-based agents are increasingly applied to tasks requiring structured reasoning, tool use, and environmental adaptation, such as data manipulation, multistep planning, and computer-use automation. However, despite their versatility, current training paradigms for model weight optimization methods, like PPO and GRPO, remain relatively impractical with their high computational overhead for rollout convergence. In addition, the resulting agent policies are difficult to interpret, adapt, or incrementally improve. To address this, we investigate creating and refining structured memory of experiential learning of an agent from its environment as an alternative route to agent optimization. We introduce BREW (Bootstrapping expeRientially-learned Environmental knoWledge), a framework for agent optimization for downstream tasks via KB construction and refinement. In our formulation, we introduce an effective method for partitioning agent memory for more efficient retrieval and refinement. BREW uses task graders and behavior rubrics to learn insights while leveraging state-space search for ensuring robustness from the noise and non-specificity in natural language. Empirical results on real world, domain-grounded benchmarks -- OSWorld, $ฯ^2$Bench, and SpreadsheetBench -- show BREW achieves $10-20\%$ improvement in task precision, $10-15\%$ reduction in API/tool calls leading to faster execution time, all while maintaining computational efficiency on par with base models. Unlike prior work where memory is treated as static context, we establish the KB as a modular and controllable substrate for agent optimization -- an explicit lever for shaping behavior in a transparent, interpretable, and extensible manner.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Baoshun Shi, Ke Jiang, Qiusheng Lian, Xinran Yu, Huazhu Fu
Machine Learning
ๆฅ็ๆ่ฆ
Despite significant advancements in deep learning-based sparse-view computed tomography (SVCT) reconstruction algorithms, these methods still encounter two primary limitations: (i) It is challenging to explicitly prove that the prior networks of deep unfolding algorithms satisfy Lipschitz constraints due to their empirically designed nature. (ii) The substantial storage costs of training a separate model for each setting in the case of multiple views hinder practical clinical applications. To address these issues, we elaborate an explicitly provable Lipschitz-constrained network, dubbed LipNet, and integrate an explicit prompt module to provide discriminative knowledge of different sparse sampling settings, enabling the treatment of multiple sparse view configurations within a single model. Furthermore, we develop a storage-saving deep unfolding framework for multiple-in-one SVCT reconstruction, termed PromptCT, which embeds LipNet as its prior network to ensure the convergence of its corresponding iterative algorithm. In simulated and real data experiments, PromptCT outperforms benchmark reconstruction algorithms in multiple-in-one SVCT reconstruction, achieving higher-quality reconstructions with lower storage costs. On the theoretical side, we explicitly demonstrate that LipNet satisfies boundary property, further proving its Lipschitz continuity and subsequently analyzing the convergence of the proposed iterative algorithms. The data and code are publicly available at https://github.com/shibaoshun/PromptCT.
๐
2025-11-25
๐ ArXiv Preprint (cs.DB)
๐ฅ Chaowei He, Yuanjun Liu, Qingzhi Ma, Shenyuan Ren, Xizhao Luo et al.
ๆฅ็ๆ่ฆ
Machine unlearning in learned cardinality estimation (CE) systems presents unique challenges due to the complex distributional dependencies in multi-table relational data. Specifically, data deletion, a core component of machine unlearning, faces three critical challenges in learned CE models: attribute-level sensitivity, inter-table propagation and domain disappearance leading to severe overestimation in multi-way joins. We propose Cardinality Estimation Pruning (CEP), the first unlearning framework specifically designed for multi-table learned CE systems. CEP introduces Distribution Sensitivity Pruning, which constructs semi-join deletion results and computes sensitivity scores to guide parameter pruning, and Domain Pruning, which removes support for value domains entirely eliminated by deletion. We evaluate CEP on state-of-the-art architectures NeuroCard and FACE across IMDB and TPC-H datasets. Results demonstrate CEP consistently achieves the lowest Q-error in multi-table scenarios, particularly under high deletion ratios, often outperforming full retraining. Furthermore, CEP significantly reduces convergence iterations, incurring negligible computational overhead of 0.3%-2.5% of fine-tuning time.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mingyu Jeon, Jaeyoung Suh, Suwan Cho
Natural Language Processing
ๆฅ็ๆ่ฆ
Schema matching is a critical task in data integration, particularly in the medical domain where disparate Electronic Health Record (EHR) systems must be aligned to standard models like OMOP CDM. While Large Language Models (LLMs) have shown promise in schema matching, they suffer from hallucination and lack of up-to-date domain knowledge. Knowledge Graphs (KGs) offer a solution by providing structured, verifiable knowledge. However, existing KG-augmented LLM approaches often rely on inefficient complex multi-hop queries or storage-intensive vector-based retrieval methods. This paper introduces SMoG (Schema Matching on Graph), a novel framework that leverages iterative execution of simple 1-hop SPARQL queries, inspired by successful strategies in Knowledge Graph Question Answering (KGQA). SMoG enhances explainability and reliability by generating human-verifiable query paths while significantly reducing storage requirements by directly querying SPARQL endpoints. Experimental results on real-world medical datasets demonstrate that SMoG achieves performance comparable to state-of-the-art baselines, validating its effectiveness and efficiency in KG-augmented schema matching.
๐
2025-11-25
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Friederike Groschupp, Daniele Lain, Aritra Dhar, Lara Magdalena Lazier, Srdjan ฤapkun
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Precise access control decisions are crucial to the security of both traditional applications and emerging agent-based systems. Typically, these decisions are made by users during app installation or at runtime. Due to the increasing complexity and automation of systems, making these access control decisions can add a significant cognitive load on users, often overloading them and leading to suboptimal or even arbitrary access control decisions. To address this problem, we propose to leverage the processing and reasoning capabilities of large language models (LLMs) to make dynamic, context-aware decisions aligned with the user's security preferences. For this purpose, we conducted a user study, which resulted in a dataset of 307 natural-language privacy statements and 14,682 access control decisions made by users. We then compare these decisions against those made by two versions of LLMs: a general and a personalized one, for which we also gathered user feedback on 1,446 of its decisions.
Our results show that in general, LLMs can reflect users' preferences well, achieving up to 86\% accuracy when compared to the decision made by the majority of users. Our study also reveals a crucial trade-off in personalizing such a system: while providing user-specific privacy preferences to the LLM generally improves agreement with individual user decisions, adhering to those preferences can also violate some security best practices. Based on our findings, we discuss design and risk considerations for implementing a practical natural-language-based access control system that balances personalization, security, and utility.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yiheng Zhang, Shaowu Wu, Yuanzhuo Xu, Jiajun Wu, Shang Xu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity
, allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.
๐
2025-11-25
๐ Accepted to 2025 IEEE International Conference on
๐ฅ Zhiguo Zhang, Xiaoliang Ma, Daniel Schlesinger
ๆฅ็ๆ่ฆ
Accurate and interpretable air pollution forecasting is crucial for public health, but most models face a trade-off between performance and interpretability. This study proposes a physics-guided, interpretable-by-design spatiotemporal learning framework. The model decomposes the spatiotemporal behavior of air pollutant concentrations into two transparent, additive modules. The first is a physics-guided transport kernel with directed weights conditioned on wind and geography (advection). The second is an explainable attention mechanism that learns local responses and attributes future concentrations to specific historical lags and exogenous drivers. Evaluated on a comprehensive dataset from the Stockholm region, our model consistently outperforms state-of-the-art baselines across multiple forecasting horizons. Our model's integration of high predictive performance and spatiotemporal interpretability provides a more reliable foundation for operational air-quality management in real-world applications.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alexander C. Jenke, Gregor Just, Claas de Boer, Martin Wagner, Sebastian Bodenstedt et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that indicates the state of each robotic arm, including the activation of the endoscope arm. Detecting this activation provides valuable metadata such as camera movement information, which can support downstream surgical data science tasks including tool tracking, skill assessment, or camera control automation.
Methods: We developed a lightweight pipeline based on a ResNet18 convolutional neural network to automatically identify the position of the camera tile and its activation state within the DaVinci Xi UI. The model was fine-tuned on manually annotated data from the SurgToolLoc dataset and evaluated across three public datasets comprising over 70,000 frames.
Results: The model achieved F1-scores between 0.993 and 1.000 for the binary detection of active cameras and correctly localized the camera tile in all cases without false multiple-camera detections.
Conclusion: The proposed pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, trained models, and annotations are publicly available.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniel Kienzle, Katja Ludwig, Julian Lorenz, Shin'ichi Satoh, Rainer Lienhart
ๆฅ็ๆ่ฆ
Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Szymon Bobek, ลukasz Baลec, Grzegorz J. Nalepa
Machine Learning
ๆฅ็ๆ่ฆ
Counterfactual explanations enhance the actionable interpretability of machine learning models by identifying the minimal changes required to achieve a desired outcome of the model. However, existing methods often ignore the complex dependencies in real-world datasets, leading to unrealistic or impractical modifications. Motivated by cybersecurity applications in the email marketing domain, we propose a method for generating Diverse, Actionable, and kNowledge-Constrained Explanations (DANCE), which incorporates feature dependencies and causal constraints to ensure plausibility and real-world feasibility of counterfactuals. Our method learns linear and nonlinear constraints from data or integrates expert-provided dependency graphs, ensuring counterfactuals are plausible and actionable. By maintaining consistency with feature relationships, the method produces explanations that align with real-world constraints. Additionally, it balances plausibility, diversity, and sparsity, effectively addressing key limitations in existing algorithms. The work is developed based on a real-life case study with Freshmail, the largest email marketing company in Poland and supported by a joint R&D project Sendguard. Furthermore, we provide an extensive evaluation using 140 public datasets, which highlights its ability to generate meaningful, domain-relevant counterfactuals that outperform other existing approaches based on widely used metrics. The source code for reproduction of the results can be found in a GitHub repository we provide.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Olivier Moulin, Vincent Francois-lavet, Paul Elbers, Mark Hoogendoorn
Machine Learning
ๆฅ็ๆ่ฆ
Generalizability of Reinforcement Learning (RL) agents (ability to perform on environments different from the ones they have been trained on) is a key problem as agents have the tendency to overfit to their training environments. In order to address this problem and offer a solution to increase the generalizability of RL agents, we introduce a new methodology to predict the generalizability score of RL agents based on the internal weights of the agent's neural networks. Using this prediction capability, we propose some changes in the Proximal Policy Optimization (PPO) loss function to boost the generalization score of the agents trained with this upgraded version. Experimental results demonstrate that our improved PPO algorithm yields agents with stronger generalizability compared to the original version.
๐
2025-11-25
๐ ArXiv Preprint (cs.SD)
๐ฅ Rui Lin, Zhiyue Wu, Jiahe Le, Kangdi Wang, Weixiong Chen et al.
ๆฅ็ๆ่ฆ
Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haebin Seong, Sungmin Kim, Minchan Kim, Yongjun Cho, Myunchul Joe et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Existing navigation benchmarks focus on task success metrics while overlooking economic viability -- critical for commercial deployment of autonomous delivery robots. We introduce \emph{CostNav}, a \textbf{Micro-Navigation Economic Testbed} that evaluates embodied agents through comprehensive cost-revenue analysis aligned with real-world business operations. CostNav models the complete economic lifecycle including hardware, training, energy, maintenance costs, and delivery revenue with service-level agreements, using industry-derived parameters. \textbf{To our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability}, revealing that optimizing for task success fundamentally differs from optimizing for economic deployment. Our cost model uses parameters derived from industry data sources (energy rates, delivery service pricing), and we project from a reduced-scale simulation to realistic deliveries. Under this projection, the baseline achieves 43.0\% SLA compliance but is \emph{not} commercially viable: yielding a loss of \$30.009 per run with no finite break-even point, because operating costs are dominated by collision-induced maintenance, which accounts for 99.7\% of per-run costs and highlights collision avoidance as a key optimization target. We demonstrate a learning-based on-device navigation baseline and establish a foundation for evaluating rule-based navigation, imitation learning, and cost-aware RL training. CostNav bridges the gap between navigation research and commercial deployment, enabling data-driven decisions about economic trade-offs across navigation paradigms.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Yu, Jiabo Zhan, Zile Wang, Jinglin Wang, Huaisong Zhang et al.
Natural Language ProcessingRoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yitian Huang, Yuxuan Lei, Jianxun Lian, Hao Liao
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
This report presents the solution and results of our team MSRA\_SC in the Commonsense Persona-Grounded Dialogue Challenge (CPDC 2025). We propose a simple yet effective framework that unifies improvements across both GPU Track and API Track. Our method centers on two key components. First, Context Engineering applies dynamic tool pruning and persona clipping for input compression, combined with post-processing techniques such as parameter normalization and function merging. Together with manually refined prompts, this design improves tool call stability, execution reliability, and role-playing guidance. Second, in the GPU Track, we further adopt GRPO training, replacing supervised fine-tuning with reinforcement learning directly optimized by reward signals. This mitigates small-sample overfitting and significantly enhances task-oriented dialogue performance. In the final evaluation, our team ranks 1st in Task 2 API, 2nd in Task 1 API, and 3rd in both Task 3 API and GPU track, demonstrating the effectiveness of our approach. Our code is publicly available at https://gitlab.aicrowd.com/nikoo_yu/cpdc-2025-winning-solution
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhen Zeng, Leijiang Gu, Zhangling Duan, Feng Li, Zenglin Shi et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) achieve remarkable capabilities but can inadvertently memorize privacy-sensitive information. Although existing unlearning methods can remove such knowledge, they fail to achieve benign forgetting because they often degrade the model's general image understanding performance. To address this, we propose the Sculpted Memory Forgetting Adapter (SMFA), which confines forgetting to targeted memory regions while preserving overall capabilities. SMFA first fine-tunes the model to replace sensitive responses with refusals, yielding a memory forgetting adapter, and then applies a retaining anchor-guided masking mechanism to prevent interference with unrelated knowledge and understanding ability. To systematically evaluate selective MLLM unlearning, we introduce S-MLLMUn Bench, the first benchmark designed to jointly assess the removal of sensitive knowledge and retention of general visual understanding. Extensive experiments show that, unlike prior methods, SMFA achieves precise and controllable unlearning while maintaining the model's foundational image understanding.
๐
2025-11-25
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Veith Weilnhammer, Jefferson Ortega, David Whitney
ๆฅ็ๆ่ฆ
Scalable assessments of mental illness, the leading driver of disability worldwide, remain a critical roadblock toward accessible and equitable care. Here, we show that human-computer interactions encode multiple dimensions of self-reported mental health and their changes over time.
We introduce MAILA, a MAchine-learning framework for Inferring Latent mental states from digital Activity. We trained MAILA to predict 1.3 million mental-health self-reports from 20,000 cursor and touchscreen recordings recorded in 9,000 online participants. The dataset includes 2,000 individuals assessed longitudinally, 1,500 diagnosed with depression, and 500 with obsessive-compulsive disorder. MAILA tracks dynamic mental states along three orthogonal dimensions, generalizes across contexts, and achieves near-ceiling accuracy when predicting group-level mental health. The model translates from general to clinical populations, identifies individuals living with mental illness, and captures signatures of psychological function that are not conveyed by language.
Our results demonstrate how everyday human-computer interactions can power passive, reliable, dynamic, and maximally scalable mental health assessments. The ability to decode mental states at zero marginal cost sets new benchmarks for precision medicine and public health, while raising important questions about privacy, agency, and autonomy online.
๐
2025-11-25
๐ SIGMOD
๐ฅ Xinjun Yang, Qingda Hu, Junru Li, Feifei Li, Yuqi Zhou et al.
ๆฅ็ๆ่ฆ
The rapid increase in LLM model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated serving systems. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support larger working sets such as the KVCache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems often adopt RDMA-based disaggregated memory pools, which introduce significant challenges including high access latency, complex communication protocols, and synchronization overhead. Fortunately, the emerging CXL technology introduces new opportunities in KVCache design. In this paper, we propose Beluga, a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale memory pool through CXL switches. By supporting native load/store access semantics over the CXL fabric, our design delivers near-local memory latency, while reducing programming complexity and minimizing synchronization overhead. We conduct a systematic characterization of a commercial CXL switch-based memory pool and propose a set of design guidelines. Based on Beluga, we design and implement Beluga-KVCache, a system tailored for managing the large-scale KVCache in LLM inference. Beluga-KVCache achieves an 89.6% reduction in Time-To-First-Token (TTFT) and 7.35x throughput improvement in the vLLM inference engine compared to RDMA-based solutions. To the best of our knowledge, Beluga is the first system that enables GPUs to directly access large-scale memory pools through CXL switches, marking a significant step toward low-latency, shared access to vast memory resources by GPUs.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Riccardo Zaccone, Sai Praneeth Karimireddy, Carlo Masone
Machine Learning
ๆฅ็ๆ่ฆ
Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a solution to mitigate the effects of statistical heterogeneity. Despite recent progress in this direction, it is still unclear if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios, where only some workers participate at each round. In this work we analyze momentum under cyclic client participation, and theoretically prove that it remains inevitably affected by statistical heterogeneity. Similarly to SGD, we prove that decreasing step-sizes do not help either: in fact, any schedule decreasing faster than $ฮ\left(1/t\right)$ leads to convergence to a constant value that depends on the initialization and the heterogeneity bound. Numerical results corroborate the theory, and deep learning experiments confirm its relevance for realistic settings.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniel Harari, Michael Sidorov, Liel David, Chen Shterental, Abrham Kahsay Gebreselasie et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached ('contact') or detached ('release'). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wen-Fang Su, Hsiao-Wei Chou, Wen-Yang Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aleksei Samarin, Artem Nazarenko, Egor Kotenko, Valentin Malykh, Alexander Savelev et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a novel approach to neural network compression that addresses redundancy at both the filter and architectural levels through a unified framework grounded in information flow analysis. Building on the concept of tensor flow divergence, which quantifies how information is transformed across network layers, we develop a two-stage optimization process. The first stage employs iterative divergence-aware pruning to identify and remove redundant filters while preserving critical information pathways. The second stage extends this principle to higher-level architecture optimization by analyzing layer-wise contributions to information propagation and selectively eliminating entire layers that demonstrate minimal impact on network performance. The proposed method naturally adapts to diverse architectures, including convolutional networks, transformers, and hybrid designs, providing a consistent metric for comparing the structural importance across different layer types. Experimental validation across multiple modern architectures and datasets reveals that this combined approach achieves substantial model compression while maintaining competitive accuracy. The presented approach achieves parameter reduction results that are globally comparable to those of state-of-the-art solutions and outperforms them across a wide range of modern neural network architectures, from convolutional models to transformers. The results demonstrate how flow divergence serves as an effective guiding principle for both filter-level and layer-level optimization, offering practical benefits for deployment in resource-constrained environments.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jason Lo, Mohammadnima Jafari
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
A wiring diagram is a labeled directed graph that represents an abstract concept such as a temporal process. In this article, we introduce the notion of a quasi-skeleton wiring diagram graph, and prove that quasi-skeleton wiring diagram graphs correspond to Hasse diagrams. Using this result, we designed algorithms that extract wiring diagrams from sequential data. We used our algorithms in analyzing the behavior of an autonomous agent playing a computer game, and the algorithms correctly identified the winning strategies. We compared the performance of our main algorithm with two other algorithms based on standard clustering techniques (DBSCAN and agglomerative hierarchical), including when some of the data was perturbed. Overall, this article brings together techniques in category theory, graph theory, clustering, reinforcement learning, and data engineering.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Somsubhra De, Harsh Kumar, Arun Prakash A
Natural Language Processing
ๆฅ็ๆ่ฆ
Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in transformer-based models and large annotated datasets have greatly improved GEC performance for high-resource languages such as English, the progress has not extended equally. For most Indic languages, GEC remains a challenging task due to limited resources, linguistic diversity and complex morphology. In this work, we explore prompting-based approaches using state-of-the-art large language models (LLMs), such as GPT-4.1, Gemini-2.5 and LLaMA-4, combined with few-shot strategy to adapt them to low-resource settings. We observe that even basic prompting strategies, such as zero-shot and few-shot approaches, enable these LLMs to substantially outperform fine-tuned Indic-language models like Sarvam-22B, thereby illustrating the exceptional multilingual generalization capabilities of contemporary LLMs for GEC. Our experiments show that carefully designed prompts and lightweight adaptation significantly enhance correction quality across multiple Indic languages. We achieved leading results in the shared task--ranking 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97). These findings highlight the effectiveness of prompt-driven NLP techniques and underscore the potential of large-scale LLMs to bridge resource gaps in multilingual GEC.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Johannes Brandt, Maulik Chevli, Rickmer Braren, Georgios Kaissis, Philip Mรผller et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Lung cancer risk estimation is gaining increasing importance as more countries introduce population-wide screening programs using low-dose CT (LDCT). As imaging volumes grow, scalable methods that can process entire lung volumes efficiently are essential to tap into the full potential of these large screening datasets. Existing approaches either over-rely on pixel-level annotations, limiting scalability, or analyze the lung in fragments, weakening performance. We present LungEvaty, a fully transformer-based framework for predicting 1-6 year lung cancer risk from a single LDCT scan. The model operates on whole-lung inputs, learning directly from large-scale screening data to capture comprehensive anatomical and pathological cues relevant for malignancy risk. Using only imaging data and no region supervision, LungEvaty matches state-of-the-art performance, refinable by an optional Anatomically Informed Attention Guidance (AIAG) loss that encourages anatomically focused attention. In total, LungEvaty was trained on more than 90,000 CT scans, including over 28,000 for fine-tuning and 6,000 for evaluation. The framework offers a simple, data-efficient, and fully open-source solution that provides an extensible foundation for future research in longitudinal and multimodal lung cancer risk prediction.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Craig Dickson
Natural Language Processing
ๆฅ็ๆ่ฆ
Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales.
We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o's 20%.
We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model's 'degrees of freedom' to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.
๐
2025-11-25
๐ ArXiv Preprint (cs.CY)
๐ฅ Giovanni Spitale, Federico Germani
ๆฅ็ๆ่ฆ
Advances in artificial intelligence now make it possible to simulate the dead through chatbots, voice clones, and video avatars trained on a person's digital traces. These "digital ghosts" are moving from fiction to commercial reality, reshaping how people mourn and remember. This paper offers a conceptual and ethical analysis of AI-mediated digital afterlives. We define what counts as a digital ghost, trace their rise across personal, commercial, and institutional contexts, and identify core ethical tensions around grief and well-being, truthfulness and deception, consent and posthumous privacy, dignity and misrepresentation, and the commercialization of mourning. To analyze these challenges, we propose a nine-dimensional taxonomy of digital afterlife technologies and, building on it, outline the features of an ethically acceptable digital ghost: premortem intent, mutual consent, transparent and limited data use, clear disclosure, restricted purposes and access, family or estate stewardship, and minimal behavioral agency. We argue for targeted regulation and professional guidelines to ensure that digital ghosts can aid remembrance without slipping into forms of deception.
๐
2025-11-25
๐ ArXiv Preprint (cs.AR)
๐ฅ Zizhang Luo, Fan Cui, Kexing Zhou, Runlin Guo, Mile Xia et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Repairing RTL bugs is crucial for hardware design and verification. Traditional automatic program repair (APR) methods define dedicated search spaces to locate and fix bugs with program synthesis. However, they heavily rely on fixed templates and can only deal with limited bugs. As an alternative, Large Language Models with the ability to understand code semantics can be explored for RTL repair. However, they suffer from unreliable outcomes due to inherent randomness and long input contexts of RTL code and waveform. To address these challenges, we propose R3A, an LLM-based automatic RTL program repair framework upon the basic model to improve reliability. R3A proposes the stochastic Tree-Of-Thoughts method to control a patch generation agent to explore a validated solution for the bug. The algorithm samples search states according to a heuristic function to balance between exploration and exploitation for a reliable outcome. Besides, R3A proposes a multi-agent fault localization method to find fault candidates as the starting points for the patch generation agent, further increasing the reliability. Experiments show R3A can fix 90.6% of bugs in the RTL-repair dataset within a given time limit, which covers 45% more bugs than traditional methods and other LLM-based approaches, while achieving an 86.7% pass@5 rate on average, showing a high reliability.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arianna Stropeni, Valentina Zaccaria, Francesco Borsatti, Davide Dalle Pezze, Manuel Barusco et al.
ๆฅ็ๆ่ฆ
In recent years, Visual Anomaly Detection (VAD) has gained significant attention due to its ability to identify anomalous images using only normal images during training. Many VAD models work without supervision but are still able to provide visual explanations by highlighting the anomalous regions within an image. However, although these visual explanations can be helpful, they lack a direct and semantically meaningful interpretation for users. To address this limitation, we propose extending Concept Bottleneck Models (CBMs) to the VAD setting. By learning meaningful concepts, the network can provide human-interpretable descriptions of anomalies, offering a novel and more insightful way to explain them. Our contributions are threefold: (i) we develop a Concept Dataset to support research on CBMs for VAD; (ii) we improve the CBM architecture to generate both concept-based and visual explanations, bridging semantic and localization interpretability; and (iii) we introduce a pipeline for synthesizing artificial anomalies, preserving the VAD paradigm of minimizing dependence on rare anomalous samples. Our approach, Concept-Aware Visual Anomaly Detection (CONVAD), achieves performance comparable to classic VAD methods while providing richer, concept-driven explanations that enhance interpretability and trust in VAD systems.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chujie Wang, Zhiyuan Luo, Ruiqi Liu, Can Ran, Shenghua Fan et al.
Multimodal
ๆฅ็ๆ่ฆ
The current remote sensing image analysis task is increasingly evolving from traditional object recognition to complex intelligence reasoning, which places higher requirements on the model's reasoning ability and the flexibility of tool invocation. To this end, we propose a new multimodal agent framework, Vision-Interleaved Chain-of-Thought Framework (VICoT), which implements explicit multi-round reasoning by dynamically incorporating visual tools into the chain of thought. Through a stack-based reasoning structure and a modular MCP-compatible tool suite, VICoT enables LLMs to efficiently perform multi-round, interleaved vision-language reasoning tasks with strong generalization and flexibility.We also propose the Reasoning Stack distillation method to migrate complex Agent behaviors to small, lightweight models, which ensures the reasoning capability while significantly reducing complexity. Experiments on multiple remote sensing benchmarks demonstrate that VICoT significantly outperforms existing SOTA frameworks in reasoning transparency, execution efficiency, and generation quality.
๐
2025-11-25
๐ AAAI 2026
๐ฅ Marta Sumyk, Oleksandr Kosovan
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Computer Use Agents (CUAs) are designed to autonomously operate digital interfaces, yet they often fail to reliably determine whether a given task has been completed. We present an autonomous evaluation and feedback framework that uses vision-language models to assess task completion directly from screenshots and task descriptions. Our dataset covers 42 built-in macOS applications and 1,260 human-labeled tasks across a wide range of scenarios. Our framework achieves up to 73 percent accuracy in task success detection and yields an average relative improvement of 27 percent in overall task success when evaluator feedback is applied. These results show that vision-based evaluation can serve as an effective feedback mechanism that improves the reliability and self-correction of autonomous computer-use agents.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zixiao Huang, Wen Zeng, Tianyu Fu, Tengxuan Liu, Yizhou Sun et al.
ๆฅ็ๆ่ฆ
LLM-based search agents achieve strong performance but suffer from severe latency, as each step requires serialized LLM reasoning followed by action of tool execution. We revisit this bottleneck through the lens of speculation. While traditional predict-verify speculation paradigm can break serial execution, its benefit remains limited, as it retains the full original workload and adds extra inference overhead. We observe that early agent steps often involve simple evidence-gathering, where correct actions can often be predicted without full reasoning. Building on these observations, we present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency. Algorithmically, SPAgent introduces a two-phase adaptive speculation mechanism that selectively omits verification when safe. System-wise, a two-level scheduler regulates speculative requests based on engine load to ensure speculation remains beneficial. We implement SPAgent in real-world systems. Across extensive experimental settings, SPAgent achieves up to $1.65\times$ end-to-end speedup while maintaining same or even achieving higher accuracy, enabling practical deployment of multi-step search agents.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Petr Molodyk, Jaemoo Choi, David W. Romero, Ming-Yu Liu, Yongxin Chen
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, point cloud generation has gained significant attention in 3D generative modeling. Among existing approaches, point-based methods directly generate point clouds without relying on other representations such as latent features, meshes, or voxels. These methods offer low training cost and algorithmic simplicity, but often underperform compared to representation-based approaches. In this paper, we propose MFM-Point, a multi-scale Flow Matching framework for point cloud generation that substantially improves the scalability and performance of point-based methods while preserving their simplicity and efficiency. Our multi-scale generation algorithm adopts a coarse-to-fine generation paradigm, enhancing generation quality and scalability without incurring additional training or inference overhead. A key challenge in developing such a multi-scale framework lies in preserving the geometric structure of unordered point clouds while ensuring smooth and consistent distributional transitions across resolutions. To address this, we introduce a structured downsampling and upsampling strategy that preserves geometry and maintains alignment between coarse and fine resolutions. Our experimental results demonstrate that MFM-Point achieves best-in-class performance among point-based methods and challenges the best representation-based methods. In particular, MFM-point demonstrates strong results in multi-category and high-resolution generation tasks.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seungjun Yu, Seonho Lee, Namho Kim, Jaeyo Shin, Junsung Park et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.
๐
2025-11-25
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Sian Heesom-Green, Jonathan Shock, Geoff Nitschke
Machine Learning
ๆฅ็ๆ่ฆ
The Cognitive Buffer Hypothesis (CBH) posits that larger brains evolved to enhance survival in changing conditions. However, larger brains also carry higher energy demands, imposing additional metabolic burdens. Alongside brain size, brain organization plays a key role in cognitive ability and, with suitable architectures, may help mitigate energy challenges. This study evolves Artificial Neural Networks (ANNs) used by Reinforcement Learning (RL) agents to investigate how environmental variability and energy costs influence the evolution of neural complexity, defined in terms of ANN size and structure. Results indicate that under energy constraints, increasing seasonality led to smaller ANNs. This challenges CBH and supports the Expensive Brain Hypothesis (EBH), as highly seasonal environments reduced net energy intake and thereby constrained brain size. ANN structural complexity primarily emerged as a byproduct of size, where energy costs promoted the evolution of more efficient networks. These results highlight the role of energy constraints in shaping neural complexity, offering in silico support for biological theory and energy-efficient robotic design.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanzhe Li, Hang Zhong, Steffen Mรผller
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Pedestrian crossing intention prediction is essential for autonomous vehicles to improve pedestrian safety and reduce traffic accidents. However, accurate pedestrian intention prediction in urban environments remains challenging due to the multitude of factors affecting pedestrian behavior. In this paper, we propose a multi-context fusion Transformer (MFT) that leverages diverse numerical contextual attributes across four key dimensions, encompassing pedestrian behavior context, environmental context, pedestrian localization context and vehicle motion context, to enable accurate pedestrian intention prediction. MFT employs a progressive fusion strategy, where mutual intra-context attention enables reciprocal interactions within each context, thereby facilitating feature sequence fusion and yielding a context token as a context-specific representation. This is followed by mutual cross-context attention, which integrates features across contexts with a global CLS token serving as a compact multi-context representation. Finally, guided intra-context attention refines context tokens within each context through directed interactions, while guided cross-context attention strengthens the global CLS token to promote multi-context fusion via guided information propagation, yielding deeper and more efficient integration. Experimental results validate the superiority of MFT over state-of-the-art methods, achieving accuracy rates of 73%, 93%, and 90% on the JAADbeh, JAADall, and PIE datasets, respectively. Extensive ablation studies are further conducted to investigate the effectiveness of the network architecture and contribution of different input context. Our code is open-source: https://github.com/ZhongHang0307/Multi-Context-Fusion-Transformer.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanzhe Li, Steffen Mรผller
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Pedestrian crossing intention prediction is essential for the deployment of autonomous vehicles (AVs) in urban environments. Ideal prediction provides AVs with critical environmental cues, thereby reducing the risk of pedestrian-related collisions. However, the prediction task is challenging due to the diverse nature of pedestrian behavior and its dependence on multiple contextual factors. This paper proposes a multimodal fusion network that leverages seven modality features from both visual and motion branches, aiming to effectively extract and integrate complementary cues across different modalities. Specifically, motion and visual features are extracted from the raw inputs using multiple Transformer-based extraction modules. Depth-guided attention module leverages depth information to guide attention towards salient regions in another modality through comprehensive spatial feature interactions. To account for the varying importance of different modalities and frames, modality attention and temporal attention are designed to selectively emphasize informative modalities and effectively capture temporal dependencies. Extensive experiments on the JAAD dataset validate the effectiveness of the proposed network, achieving superior performance compared to the baseline methods.
๐
2025-11-25
๐ ArXiv Preprint (eess.AS)
๐ฅ Sungjae Kim, Kihyun Na, Jinyoung Choi, Injung Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with the intended musical notes. However, existing APC systems either rely on reference pitches, which limits their practical applicability, or employ simple pitch estimation algorithms that often fail to preserve expressiveness and naturalness. We propose BERT-APC, a novel reference-free APC framework that corrects pitch errors while maintaining the natural expressiveness of vocal performances. In BERT-APC, a novel stationary pitch predictor first estimates the perceived pitch of each note from the detuned singing voice. A context-aware note pitch predictor estimates the intended pitch sequence by leveraging a music language model repurposed to incorporate musical context. Finally, a note-level correction algorithm fixes pitch errors while preserving intentional pitch deviations for emotional expression. In addition, we introduce a learnable data augmentation strategy that improves the robustness of the music language model by simulating realistic detuning patterns. Compared to two recent singing voice transcription models, BERT-APC demonstrated superior performance in note pitch prediction, outperforming the second-best model, ROSVOT, by 10.49%p on highly detuned samples in terms of the raw pitch accuracy. In the MOS test, BERT-APC achieved the highest score of $4.32 \pm 0.15$, which is significantly higher than those of the widely-used commercial APC tools, AutoTune ($3.22 \pm 0.18$) and Melodyne ($3.08 \pm 0.18$), while maintaining a comparable ability to preserve expressive nuances. To the best of our knowledge, this is the first APC model that leverages a music language model to achieve reference-free pitch correction with symbolic musical context. The corrected audio samples of BERT-APC are available online.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Peining Zhang, Hongchen Qin, Haochen Zhang, Ziqi Guo, Guiling Wang et al.
ๆฅ็ๆ่ฆ
This work investigates the zero-shot forecasting capability of time-series foundation models for Leaf Area Index (LAI) forecasting in agricultural monitoring. Using the HiQ dataset (U.S., 2000-2022), we systematically compare statistical baselines, a fully supervised LSTM, and the Sundial foundation model under multiple evaluation protocols. We find that Sundial, in the zero-shot setting, can outperform a fully trained LSTM provided that the input context window is sufficiently long-specifically, when covering more than one or two full seasonal cycles. This demonstrates, for the first time, that a general-purpose foundation model can surpass specialized supervised models on remote-sensing time series prediction without any task-specific tuning. These results highlight the strong potential of pretrained time-series foundation models to serve as effective plug-and-play forecasters in agricultural and environmental applications.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang et al.
Natural Language ProcessingRoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks.
This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model's outputs toward multiple, predefined outcomes, such as simultaneously misclassifying "non-motorized lane" signs as "motorized lane" and "pedestrian" as "plastic bag".
To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame.
๐
2025-11-25
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Anton Lyubinin
ๆฅ็ๆ่ฆ
We are extending Popularity Bias Memorization theorem from arXiv:archive/2404.12008 in several directions. We extend it to arbitrary degree distributions and also prove both upper and lower estimates for the alignment with top-k singular hyperspace.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mihir Sahasrabudhe
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a "reversal curse," and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lianming Huang, Haibo Hu, Qiao Li, Nan Guan, Chun Jason Xue
Robotics
ๆฅ็ๆ่ฆ
Sparsity is essential for deploying large models on resource constrained edge platforms. However, optimizing sparsity patterns for individual tasks in isolation ignores the significant I/O overhead incurred during frequent task switching. We introduce an on-demand multi-task sparsity framework specifically designed to minimize switching costs by maximizing parameter reuse. Unlike monolithic approaches, we decompose weights into reusable block-granular units and align sparse structures across tasks to maximize overlap. By dynamically loading only the small differential set of blocks required for the next task, our method effectively mitigates the cold-start latency inherent in traditional monolithic approaches.Experiments on a real-world autonomous driving platform demonstrate that our framework achieves superior switching efficiency, accelerating task switching by over 6.6X on average compared to existing sparsity methods.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingyang Jia, Kai Shu, Gang Yang, Long Xing, Xun Chen et al.
Machine LearningNatural Language ProcessingRoboticsComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback2) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Weizi Shao, Taolin Zhang, Zijie Zhou, Chen Chen, Chengyu Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical communication graph PRUNING framework, termed M$^3$Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M$^3$Prune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Changho Choi, Minho Kim, Jinkyu Kim
ๆฅ็ๆ่ฆ
Despite decades of progress, a truly input-size agnostic visual encoder-a fundamental characteristic of human vision-has remained elusive. We address this limitation by proposing \textbf{MambaEye}, a novel, causal sequential encoder that leverages the low complexity and causal-process based pure Mamba2 backbone. Unlike previous Mamba-based vision encoders that often employ bidirectional processing, our strictly unidirectional approach preserves the inherent causality of State Space Models, enabling the model to generate a prediction at any point in its input sequence. A core innovation is our use of relative move embedding, which encodes the spatial shift between consecutive patches, providing a strong inductive bias for translation invariance and making the model inherently adaptable to arbitrary image resolutions and scanning patterns. To achieve this, we introduce a novel diffusion-inspired loss function that provides dense, step-wise supervision, training the model to build confidence as it gathers more visual evidence. We demonstrate that MambaEye exhibits robust performance across a wide range of image resolutions, especially at higher resolutions such as $1536^2$ on the ImageNet-1K classification task. This feat is achieved while maintaining linear time and memory complexity relative to the number of patches.
๐
2025-11-25
๐ 39 pages, 15 figures. Under consideration for publ
๐ฅ Akash Doshi, Pinar Sen, Kirill Ivanov, Wei Yang, June Namgoong et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Channel coding from 2G to 5G has assumed the inputs bits at the physical layer to be uniformly distributed. However, hybrid automatic repeat request acknowledgement (HARQ-ACK) bits transmitted in the uplink are inherently non-uniformly distributed. For such sources, significant performance gains could be obtained by employing joint source channel coding, aided by deep learning-based techniques. In this paper, we learn a transformer-based encoder using a novel "free-lunch" training algorithm and propose per-codeword power shaping to exploit the source prior at the encoder whilst being robust to small changes in the HARQ-ACK distribution. Furthermore, any HARQ-ACK decoder has to achieve a low negative acknowledgement (NACK) error rate to avoid radio link failures resulting from multiple NACK errors. We develop an extension of the Neyman-Pearson test to a coded bit system with multiple information bits to achieve Unequal Error Protection of NACK over ACK bits at the decoder. Finally, we apply the proposed encoder and decoder designs to a 5G New Radio (NR) compliant uplink setup under a fading channel, describing the optimal receiver design and a low complexity coherent approximation to it. Our results demonstrate 3-6 dB reduction in the average transmit power required to achieve the target error rates compared to the NR baseline, while also achieving a 2-3 dB reduction in the maximum transmit power, thus providing for significant coverage gains and power savings.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shenjun Zhong, Zhifeng Chen, Zhaolin Chen
Machine Learning
ๆฅ็ๆ่ฆ
Magnetic Resonance Fingerprinting (MRF) leverages transient-state signal dynamics generated by the tunable acquisition parameters, making the design of an optimal, robust sequence a complex, high-dimensional sequential decision problem, such as optimizing one of the key parameters, flip angle. Reinforcement learning (RL) offers a promising approach to automate parameter selection, to optimize pulse sequences that maximize the distinguishability of fingerprints across the parameter space. In this work, we introduce an RL framework for optimizing the flip-angle schedule in MRF and demonstrate a learned schedule exhibiting non-periodic patterns that enhances fingerprint separability. Additionally, an interesting observation is that the RL-optimized schedule may enable a reduction in the number of repetition time, potentially accelerate MRF acquisitions.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan et al.
RoboticsComputer VisionMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tooba Tehreem Sheikh, Jean Lahoud, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag
Robotics
ๆฅ็ๆ่ฆ
Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tahira Kazimi, Connor Dunlop, Pinar Yanardag
Machine Learning
ๆฅ็ๆ่ฆ
While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoye Wang, Chen Tang, Xiangyu Yue, Wei-Hong Li
ๆฅ็ๆ่ฆ
This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu et al.
Natural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wei Tang, Zuo-Zheng Wang, Kun Zhang, Tong Wei, Min-Ling Zhang
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP's textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa et al.
Robotics
ๆฅ็ๆ่ฆ
While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhoujie Fu, Xianfang Zeng, Jinghong Lan, Xinyao Liao, Cheng Chen et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi et al.
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
๐
2025-11-25
๐ SIGGRAPH
๐ฅ Yangguang Li, Xianglong He, Zi-Xin Zou, Zexiang Liu, Wanli Ouyang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, current methods still face challenges, including the lack of intricate details, overly smoothed surfaces, and fragmented thin-shell structures. These limitations leave the generated 3D assets still one step short of meeting the standards favored by artists. In this paper, we present ShapeGen, which achieves high-quality image-to-3D shape generation through 3D representation and supervision improvements, resolution scaling up, and the advantages of linear transformers. These advancements allow the generated assets to be seamlessly integrated into 3D pipelines, facilitating their widespread adoption across various applications. Through extensive experiments, we validate the impact of these improvements on overall performance. Ultimately, thanks to the synergistic effects of these enhancements, ShapeGen achieves a significant leap in image-to-3D generation, establishing a new state-of-the-art performance.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinhao Liu, Jiaqi Li, Youming Deng, Ruxin Chen, Yingjia Zhang et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at https://ai4ce.github.io/wanderland/.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seyede Niloofar Hosseini, Ali Mojibi, Mahdi Mohseni, Navid Arjmand, Alireza Taheri
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 47.0 mm, exhibited ~58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou et al.
ๆฅ็ๆ่ฆ
Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.
๐
2025-11-25
๐ ArXiv Preprint (math.OC)
๐ฅ Nils Mรผller
Machine Learning
ๆฅ็ๆ่ฆ
We study the optimization of functions with $n>2$ arguments that have a representation as a sum of several functions that have only $2$ of the $n$ arguments each, termed sums of bivariates, on finite domains. The complexity of optimizing sums of bivariates is shown to be NP-equivalent and it is shown that there exists free lunch in the optimization of sums of bivariates. Based on measure-valued extensions of the objective function, so-called relaxations, $\ell^2$-approximation, and entropy-regularization, we derive several tractable problem formulations solvable with linear programming, coordinate ascent as well as with closed-form solutions. The limits of applying tractable versions of such relaxations to sums of bivariates are investigated using general results for reconstructing measures from their bivariate marginals. Experiments in which the derived algorithms are applied to random functions, vertex coloring, and signal reconstruction problems provide insights into qualitatively different function classes that can be modeled as sums of bivariates.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingxing Rao, Bowen Qu, Daniel Moyer
Machine Learning
ๆฅ็ๆ่ฆ
The recovery of training data from generative models (``model inversion'') has been extensively studied for diffusion models in the data domain. The encoder/decoder pair and corresponding latent codes have largely been ignored by inversion techniques applied to latent space generative models, e.g., Latent Diffusion models (LDMs). In this work we describe two key findings: (1) The diffusion model exhibits non-uniform memorization across latent codes, tending to overfit samples located in high-distortion regions of the decoder pullback metric. (2) Even within a single latent code, different dimensions contribute unequally to memorization. We introduce a principled method to rank latent dimensions by their per-dimensional contribution to the decoder pullback metric, identifying those most responsible for memorization. Empirically, removing less-memorizing dimensions when computing attack statistics for score-based membership inference attacker significantly improves performance, with average AUROC gains of 2.7\% and substantial increases in TPR@1\%FPR (6.42\%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pokรฉmon, MS-COCO, and Flickr. This indicates stronger confidence in identifying members under extremely low false-positive tolerance. Our results highlight the overlooked influence of the auto-encoder geometry on LDM memorization and provide a new perspective for analyzing privacy risks in diffusion-based generative models.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$\times$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengqiong Wu, Weicai Ye, Yuanxing Zhang, Jiahao Wang, Quande Liu et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: https://sqwu.top/ReaDe/.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ege Ozguroglu, Junbang Liang, Ruoshi Liu, Mia Chiquier, Michael DeTienne et al.
Multimodal
ๆฅ็ๆ่ฆ
While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.'' Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70$\times$ more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andrea Ranieri, Giorgio Palmieri, Silvia Biasotti
Machine Learning
ๆฅ็ๆ่ฆ
This paper addresses the critical need for automated crack detection in the preservation of cultural heritage through semantic segmentation. We present a comparative study of U-Net architectures, using various convolutional neural network (CNN) encoders, for pixel-level crack identification on statues and monuments. A comparative quantitative evaluation is performed on the test set of the OmniCrack30k dataset [1] using popular segmentation metrics including Mean Intersection over Union (mIoU), Dice coefficient, and Jaccard index. This is complemented by an out-of-distribution qualitative evaluation on an unlabeled test set of real-world cracked statues and monuments. Our findings provide valuable insights into the capabilities of different CNN- based encoders for fine-grained crack segmentation. We show that the models exhibit promising generalization capabilities to unseen cultural heritage contexts, despite never having been explicitly trained on images of statues or monuments.
๐
2025-11-25
๐ ICML
๐ฅ Shamima Hossain
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Visual Language Models (VLMs) are powerful generative tools but often produce factually inaccurate outputs due to a lack of robust reasoning capabilities. While extensive research has been conducted on integrating external knowledge for reasoning in large language models (LLMs), such efforts remain underexplored in VLMs, where the challenge is compounded by the need to bridge multiple modalities seamlessly. This work introduces a framework for knowledge-guided reasoning in VLMs, leveraging structured knowledge graphs for multi-hop verification using image-captioning task to illustrate our framework. Our approach enables systematic reasoning across multiple steps, including visual entity recognition, knowledge graph traversal, and fact-based caption refinement. We evaluate the framework using hierarchical, triple-based and bullet-point based knowledge representations, analyzing their effectiveness in factual accuracy and logical inference. Empirical results show that our approach improves factual accuracy by approximately 31% on preliminary experiments on a curated dataset of mixtures from Google Landmarks v2, Conceptual captions and Coco captions revealing key insights into reasoning patterns and failure modes. This work demonstrates the potential of integrating external knowledge for advancing reasoning in VLMs, paving the way for more reliable and knowledgable multimodal systems.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yayuan Li, Aadit Jain, Filippos Bellos, Jason J. Corso
ๆฅ็ๆ่ฆ
We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura et al.
Multimodal
ๆฅ็ๆ่ฆ
Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi-Hao Peng, Jeffrey P. Bigham, Jason Wu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Muhammad Irfan, Nasir Rahim, Khalid Mahmood Malik
Machine Learning
ๆฅ็ๆ่ฆ
Accurate extraction and segmentation of the cerebral arteries from digital subtraction angiography (DSA) sequences is essential for developing reliable clinical management models of complex cerebrovascular diseases. Conventional loss functions often rely solely on pixel-wise overlap, overlooking the geometric and physical consistency of vascular boundaries, which can lead to fragmented or unstable vessel predictions. To overcome this limitation, we propose a novel \textit{Physics-Informed Loss} (PIL) that models the interaction between the predicted and ground-truth boundaries as an elastic process inspired by dislocation theory in materials physics. This formulation introduces a physics-based regularization term that enforces smooth contour evolution and structural consistency, allowing the network to better capture fine vascular geometry. The proposed loss is integrated into several segmentation architectures, including U-Net, U-Net++, SegFormer, and MedFormer, and evaluated on two public benchmarks: DIAS and DSCA. Experimental results demonstrate that PIL consistently outperforms conventional loss functions such as Cross-Entropy, Dice, Active Contour, and Surface losses, achieving superior sensitivity, F1 score, and boundary coherence. These findings confirm that the incorporation of physics-based boundary interactions into deep neural networks improves both the precision and robustness of vascular segmentation in dynamic angiographic imaging. The implementation of the proposed method is publicly available at https://github.com/irfantahir301/Physicsis_loss.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Akshit Pramod Anchan, Jewelith Thomas, Sritama Roy
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like 'Smart Eye.' We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ben Hamscher, Arnold Brosch, Nicolas Binninger, Maksymilian Jan Dejna, Kira Maag
ๆฅ็ๆ่ฆ
Dance is an essential component of human culture and serves as a tool for conveying emotions and telling stories. Identifying and distinguishing dance genres based on motion data is a complex problem in human activity recognition, as many styles share similar poses, gestures, and temporal motion patterns. This work presents a lightweight framework for classifying dance styles that determines motion characteristics based on pose estimates extracted from videos. We propose temporal-spatial descriptors inspired by Laban Movement Analysis. These features capture local joint dynamics such as velocity, acceleration, and angular movement of the upper body, enabling a structured representation of spatial coordination. To further encode rhythmic and periodic aspects of movement, we integrate Fast Fourier Transform features that characterize movement patterns in the frequency domain. The proposed approach achieves robust classification of different dance styles with low computational effort, as complex model architectures are not required, and shows that interpretable motion representations can effectively capture stylistic nuances.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunqi Zhou, Chengjie Jiang, Chun Yuan, Jing Li
Natural Language Processing
ๆฅ็ๆ่ฆ
With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8\% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jeonghyeon Na, Sangwon Baik, Inhee Lee, Junyoung Lee, Hanbyul Joo
Machine Learning
ๆฅ็ๆ่ฆ
The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen
Natural Language Processing
ๆฅ็ๆ่ฆ
In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song et al.
ๆฅ็ๆ่ฆ
Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bo Pang, Chenxi Xu, Jierui Ren, Guoping Wang, Sheng Li
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Matvei Shelukhan, Timur Mamedov, Karina Kvanchiani
Computer Vision
ๆฅ็ๆ่ฆ
Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving $\textit{11.6%}$ HOTA improvement at $\textit{1}$ Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zilong Huang, Jun He, Xiaobin Huang, Ziyi Xiong, Yang Luo et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent} that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset} containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be released at https://github.com/LongHZ140516/MajutsuCity.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bao Tang, Shuai Zhang, Yueting Zhu, Jijun Xiang, Xin Yang et al.
Multimodal
ๆฅ็ๆ่ฆ
Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model's generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: https://github.com/hustvl/TBCM.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiawei Lin, Guanlong Jiao, Jianjin Xu
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinwan Wen, Bowen Li, Jiajun Luo, Ye Li, Zhi Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference methods enable lossless parallel sampling in U-Net-based diffusion models via a drafter-verifier scheme, but their acceleration is limited on DiTs due to insufficient draft accuracy during verification. To address this limitation, we analyze the DiTs' feature dynamics and find the features of the final transformer layer (top-block) exhibit strong temporal consistency and rich semantic abstraction. Based on this insight, we propose FREE, a novel framework that employs a lightweight drafter to perform feature-level autoregression with parallel verification, guaranteeing lossless acceleration with theoretical and empirical support. Meanwhile, prediction variance (uncertainty) of DiTs naturally increases in later denoising steps, reducing acceptance rates under speculative sampling. To mitigate this effect, we further introduce an uncertainty-guided relaxation strategy, forming FREE (relax), which dynamically adjusts the acceptance probability in response to uncertainty levels. Experiments on ImageNet-$512^2$ show that FREE achieves up to $1.86 \times$ acceleration, and FREE (relax) further reaches $2.25 \times$ speedup while maintaining high perceptual and quantitative fidelity in generation quality.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Ming, Yuxuan Han, Tianyu Huang, Feng Xu
ๆฅ็ๆ่ฆ
Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, \emph{i.e.} VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.
๐
2025-11-25
๐ AAAI 2026
๐ฅ Zhiqing Guo, Dongdong Xi, Songlin Li, Gaobo Yang
Robotics
ๆฅ็ๆ่ฆ
Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world deployment.In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.
๐
2025-11-25
๐ AAAI 2026
๐ฅ Haoliang Han, Ziyuan Luo, Jun Qi, Anderson Rocha, Renjie Wan
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in editing technologies for 3D Gaussian Splatting (3DGS) have made it simple to manipulate 3D scenes. However, these technologies raise concerns about potential malicious manipulation of 3D content. To avoid such malicious applications, localizing tampered regions becomes crucial. In this paper, we propose GS-Checker, a novel method for locating tampered areas in 3DGS models. Our approach integrates a 3D tampering attribute into the 3D Gaussian parameters to indicate whether the Gaussian has been tampered. Additionally, we design a 3D contrastive mechanism by comparing the similarity of key attributes between 3D Gaussians to seek tampering cues at 3D level. Furthermore, we introduce a cyclic optimization strategy to refine the 3D tampering attribute, enabling more accurate tampering localization. Notably, our approach does not require expensive 3D labels for supervision. Extensive experimental results demonstrate the effectiveness of our proposed method to locate the tampered 3DGS area.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang et al.
Robotics
ๆฅ็ๆ่ฆ
Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360ยฐ. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360ยฐ panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.
๐
2025-11-25
๐ 8 pages, 5 figures. Submitted to IEEE Intelligent
๐ฅ Joรฃo Malheiro Silva, Andy Huynh, Tong Duy Son, Holger Caesar
Computer Vision
ๆฅ็ๆ่ฆ
3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hengyi Wang, Lourdes Agapito
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Onur Berk Tore, Ibrahim Samil Yalciner, Server Calap
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Estimating homography from a single image remains a challenging yet practically valuable task, particularly in domains like retail, where only one viewpoint is typically available for shelf monitoring and product alignment. In this paper, we present a deep learning framework that predicts a 4-point parameterized homography matrix to rectify shelf images captured from arbitrary angles. Our model leverages a ConvNeXt-based backbone for enhanced feature representation and adopts normalized coordinate regression for improved stability. To address data scarcity and promote generalization, we introduce a novel augmentation strategy by modeling and sampling synthetic homographies. Our method achieves a mean corner error of 1.298 pixels on the test set. When compared with both classical computer vision and deep learning-based approaches, our method demonstrates competitive performance in both accuracy and inference speed. Together, these results establish our approach as a robust and efficient solution for realworld single-view rectification. To encourage further research in this domain, we will make our dataset, ShelfRectSet, and code publicly available
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shi Jiazhao, Pan Pan, Shi Haotian
Machine Learning
ๆฅ็ๆ่ฆ
This article trained a network for perceiving three-dimensional motion information of binocular vision target, which can provide real-time three-dimensional coordinate, velocity, and acceleration, and has a basic spatiotemporal perception capability. Understood the ability of neural networks to fit nonlinear problems from the perspective of PID. Considered a single-layer neural network as using a second-order difference equation and a nonlinearity to describe a local problem. Multilayer networks gradually transform the raw representation to the desired representation through multiple such combinations. Analysed some reference principles for designing neural networks. Designed a relatively small PID convolutional neural network, with a total of 17 layers and 413 thousand parameters. Implemented a simple but practical feature reuse method by concatenation and pooling. The network was trained and tested using the simulated randomly moving ball datasets, and the experimental results showed that the prediction accuracy was close to the upper limit that the input image resolution can represent. Analysed the experimental results and errors, as well as the existing shortcomings and possible directions for improvement. Finally, discussed the advantages of high-dimensional convolution in improving computational efficiency and feature space utilization. As well as the potential advantages of using PID information to implement memory and attention mechanisms.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Yuhan Wu, Tiantian Wei, Shuo Wang, ZhiChao Wang, Yanyong Zhang et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Interactive articulated manipulation requires long-horizon, multi-step interactions with appliances while maintaining physical consistency. Existing vision-language and diffusion-based policies struggle to generalize across parts, instances, and categories. We first introduce ArtiBench, a five-level benchmark covering kitchen, storage, office, and tool environments. ArtiBench enables structured evaluation from cross-part and cross-instance variation to long-horizon multi-object tasks, revealing the core generalization challenges of articulated object manipulation. Building on this benchmark, we propose ArtiBrain, a modular framework that unifies high-level reasoning with adaptive low-level control. ArtiBrain uses a VLM-based Task Reasoner (GPT-4.1) to decompose and validate subgoals, and employs a Hybrid Controller that combines geometry-aware keyframe execution with affordance-guided diffusion for precise and interpretable manipulation. An Affordance Memory Bank continually accumulates successful execution episodes and propagates part-level actionable affordances to unseen articulated parts and configurations. Extensive experiments on ArtiBench show that our ArtiBrain significantly outperforms state-of-the-art multimodal and diffusion-based methods in robustness and generalization. Code and dataset will be released upon acceptance.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianyi Yan, Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuelin Qian, Jiaming Lu, Zixuan Wang, Wenxuan Wang, Zhongling Huang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Infrared Small Target Detection (IRSTD) faces significant challenges due to low signal-to-noise ratios, complex backgrounds, and the absence of discernible target features. While deep learning-based encoder-decoder frameworks have advanced the field, their static pattern learning suffers from pattern drift across diverse scenarios (\emph{e.g.}, day/night variations, sky/maritime/ground domains), limiting robustness. To address this, we propose IrisNet, a novel meta-learned framework that dynamically adapts detection strategies to the input infrared image status. Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters via an image-to-decoder transformer. More concretely, we represent the parameterized decoder as a structured 2D tensor preserving hierarchical layer correlations and enable the transformer to model inter-layer dependencies through self-attention while generating adaptive decoding patterns via cross-attention. To further enhance the perception ability of infrared images, we integrate high-frequency components to supplement target-position and scene-edge information. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet, achieving state-of-the-art performance.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengqian Li, Ming Gao, Yi Liu, Zuzeng Lin, Feng Wang et al.
ๆฅ็ๆ่ฆ
Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Han Guo, Chenyang Liu, Haotian Zhang, Bowen Chen, Zhengxia Zou et al.
ๆฅ็ๆ่ฆ
Remote sensing change detection (RSCD) aims to identify surface changes across bi-temporal satellite images. Most previous methods rely solely on mask supervision, which effectively guides spatial localization but provides limited constraints on the temporal semantic transitions. Consequently, they often produce spatially coherent predictions while still suffering from unresolved semantic inconsistencies. To address this limitation, we propose TaCo, a spatio-temporal semantic consistent network, which enriches the existing mask-supervised framework with a spatio-temporal semantic joint constraint. TaCo conceptualizes change as a semantic transition between bi-temporal states, in which one temporal feature representation can be derived from the other via dedicated transition features. To realize this, we introduce a Text-guided Transition Generator that integrates textual semantics with bi-temporal visual features to construct the cross-temporal transition features. In addition, we propose a spatio-temporal semantic joint constraint consisting of bi-temporal reconstruct constraints and a transition constraint: the former enforces alignment between reconstructed and original features, while the latter enhances discrimination for changes. This design can yield substantial performance gains without introducing any additional computational overhead during inference. Extensive experiments on six public datasets, spanning both binary and semantic change detection tasks, demonstrate that TaCo consistently achieves SOTA performance.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng et al.
ๆฅ็ๆ่ฆ
In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work will be released.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Baoshun Shi, Ke Jiang, Qiusheng Lian, Xinran Yu, Huazhu Fu
Machine Learning
ๆฅ็ๆ่ฆ
Despite significant advancements in deep learning-based sparse-view computed tomography (SVCT) reconstruction algorithms, these methods still encounter two primary limitations: (i) It is challenging to explicitly prove that the prior networks of deep unfolding algorithms satisfy Lipschitz constraints due to their empirically designed nature. (ii) The substantial storage costs of training a separate model for each setting in the case of multiple views hinder practical clinical applications. To address these issues, we elaborate an explicitly provable Lipschitz-constrained network, dubbed LipNet, and integrate an explicit prompt module to provide discriminative knowledge of different sparse sampling settings, enabling the treatment of multiple sparse view configurations within a single model. Furthermore, we develop a storage-saving deep unfolding framework for multiple-in-one SVCT reconstruction, termed PromptCT, which embeds LipNet as its prior network to ensure the convergence of its corresponding iterative algorithm. In simulated and real data experiments, PromptCT outperforms benchmark reconstruction algorithms in multiple-in-one SVCT reconstruction, achieving higher-quality reconstructions with lower storage costs. On the theoretical side, we explicitly demonstrate that LipNet satisfies boundary property, further proving its Lipschitz continuity and subsequently analyzing the convergence of the proposed iterative algorithms. The data and code are publicly available at https://github.com/shibaoshun/PromptCT.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chao Wang, Chengan Che, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera
Machine Learning
ๆฅ็ๆ่ฆ
Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.
๐
2025-11-25
๐ ICCV 2025
๐ฅ Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fabian Gรผlhan, Emil Mededovic, Yuli Wu, Johannes Stegmaier
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite progress toward end-to-end tracking with transformer architectures, poor detection performance and the conflict between detection and association in a joint architecture remain critical concerns. Recent approaches aim to mitigate these issues by (i) employing advanced denoising or label assignment strategies, or (ii) incorporating detection priors from external object detectors via distillation or anchor proposal techniques. Inspired by the success of integrating detection priors and by the key insight that MOTR-like models are secretly strong detection models, we introduce SelfMOTR, a novel tracking transformer that relies on self-generated detection priors. Through extensive analysis and ablation studies, we uncover and demonstrate the hidden detection capabilities of MOTR-like models, and present a practical set of tools for leveraging them effectively. On DanceTrack, SelfMOTR achieves strong performance, competing with recent state-of-the-art end-to-end tracking methods.
๐
2025-11-25
๐ AAAI 2026
๐ฅ Yinghui Li, Qianyu Zhou, Di Shao, Hao Yang, Ye Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Domain adaptive point cloud completion (DA PCC) aims to narrow the geometric and semantic discrepancies between the labeled source and unlabeled target domains. Existing methods either suffer from limited receptive fields or quadratic complexity due to using CNNs or vision Transformers. In this paper, we present the first work that studies the adaptability of State Space Models (SSMs) in DA PCC and find that directly applying SSMs to DA PCC will encounter several challenges: directly serializing 3D point clouds into 1D sequences often disrupts the spatial topology and local geometric features of the target domain. Besides, the overlook of designs in the learning domain-agnostic representations hinders the adaptation performance. To address these issues, we propose a novel framework, DAPointMamba for DA PCC, that exhibits strong adaptability across domains and has the advantages of global receptive fields and efficient linear complexity. It has three novel modules. In particular, Cross-Domain Patch-Level Scanning introduces patch-level geometric correspondences, enabling effective local alignment. Cross-Domain Spatial SSM Alignment further strengthens spatial consistency by modulating patch features based on cross-domain similarity, effectively mitigating fine-grained structural discrepancies. Cross-Domain Channel SSM Alignment actively addresses global semantic gaps by interleaving and aligning feature channels. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our DAPointMamba outperforms state-of-the-art methods with less computational complexity and inference latency.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Advik Sinha, Saurabh Atreya, Aashutosh A, Sk Aziz Ali, Abhijit Das
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Amirhossein Khadivi Noghredeh, Abdollah Safari, Fatemeh Ziaeetabar, Firoozeh Haghighi
Machine Learning
ๆฅ็ๆ่ฆ
Anomaly detection in industrial visual inspection is challenging due to the scarcity of defective samples. Most existing methods rely on unsupervised reconstruction using only normal data, often resulting in overfitting and poor detection of subtle defects. We propose a semi-supervised deep reinforcement learning framework that integrates a neural batch sampler, an autoencoder, and a predictor. The RL-based sampler adaptively selects informative patches by balancing exploration and exploitation through a composite reward. The autoencoder generates loss profiles highlighting abnormal regions, while the predictor performs segmentation in the loss-profile space. This interaction enables the system to effectively learn both normal and defective patterns with limited labeled data. Experiments on the MVTec AD dataset demonstrate that our method achieves higher accuracy and better localization of subtle anomalies than recent state-of-the-art approaches while maintaining low complexity, yielding an average improvement of 0.15 in F1_max and 0.06 in AUC, with a maximum gain of 0.37 in F1_max in the best case.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Omer Belhasin, Shelly Golan, Ran El-Yaniv, Michael Elad
Computer Vision
ๆฅ็ๆ่ฆ
Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at https://github.com/omerb01/didicm .
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, Fan Zhou
Machine Learning
ๆฅ็ๆ่ฆ
Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alexander C. Jenke, Gregor Just, Claas de Boer, Martin Wagner, Sebastian Bodenstedt et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that indicates the state of each robotic arm, including the activation of the endoscope arm. Detecting this activation provides valuable metadata such as camera movement information, which can support downstream surgical data science tasks including tool tracking, skill assessment, or camera control automation.
Methods: We developed a lightweight pipeline based on a ResNet18 convolutional neural network to automatically identify the position of the camera tile and its activation state within the DaVinci Xi UI. The model was fine-tuned on manually annotated data from the SurgToolLoc dataset and evaluated across three public datasets comprising over 70,000 frames.
Results: The model achieved F1-scores between 0.993 and 1.000 for the binary detection of active cameras and correctly localized the camera tile in all cases without false multiple-camera detections.
Conclusion: The proposed pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, trained models, and annotations are publicly available.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andrey Lemeshko, Bulat Gabdullin, Nikita Drozdov, Anton Konushin, Danila Rukhovich et al.
Computer Vision
ๆฅ็ๆ่ฆ
3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D$_0$, which requires no training at all, and the self-supervised Zoo3D$_1$, which refines 3D box prediction by training a class-agnostic detector on Zoo3D$_0$-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D$_0$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at https://github.com/col14m/zoo3d .
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Yi-Lun Wu, Hong-Han Shuai
Computer Vision
ๆฅ็ๆ่ฆ
Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniel Kienzle, Katja Ludwig, Julian Lorenz, Shin'ichi Satoh, Rainer Lienhart
ๆฅ็ๆ่ฆ
Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jawaria Maqbool, M. Imran Cheema
Machine Learning
ๆฅ็ๆ่ฆ
Existing deep learning methods in multimode fiber (MMF) imaging often focus on simpler datasets, limiting their applicability to complex, real-world imaging tasks. These models are typically data-intensive, a challenge that becomes more pronounced when dealing with diverse and complex images. In this work, we propose HistoSpeckle-Net, a deep learning architecture designed to reconstruct structurally rich medical images from MMF speckles. To build a clinically relevant dataset, we develop an optical setup that couples laser light through a spatial light modulator (SLM) into an MMF, capturing output speckle patterns corresponding to input OrganAMNIST images. Unlike previous MMF imaging approaches, which have not considered the underlying statistics of speckles and reconstructed images, we introduce a distribution-aware learning strategy. We employ a histogram-based mutual information loss to enhance model robustness and reduce reliance on large datasets. Our model includes a histogram computation unit that estimates smooth marginal and joint histograms for calculating mutual information loss. It also incorporates a unique Three-Scale Feature Refinement Module, which leads to multiscale Structural Similarity Index Measure (SSIM) loss computation. Together, these two loss functions enhance both the structural fidelity and statistical alignment of the reconstructed images. Our experiments on the complex OrganAMNIST dataset demonstrate that HistoSpeckle-Net achieves higher fidelity than baseline models such as U-Net and Pix2Pix. It gives superior performance even with limited training samples and across varying fiber bending conditions. By effectively reconstructing complex anatomical features with reduced data and under fiber perturbations, HistoSpeckle-Net brings MMF imaging closer to practical deployment in real-world clinical environments.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sen Nie, Jie Zhang, Jianxin Yan, Shiguang Shan, Xilin Chen
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics on Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and fail to precisely manipulate the semantics of specific concepts in the image. We attribute this limitation to semantic entanglement in the patch-token representations on which adversarial attacks typically operate: global context aggregated by self-attention in the vision encoder dominates individual patch features, making them unreliable handles for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features (V) computed within the transformer attention block serve as much more precise handles for manipulation. We show that V suppresses global-context channels, allowing it to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose V-Attack, a novel method designed for precise local semantic attacks. V-Attack targets the value features and introduces two core components: (1) a Self-Value Enhancement module to refine V's intrinsic semantic richness, and (2) a Text-Guided Value Manipulation module that leverages text prompts to locate source concept and optimize it toward a target concept. By bypassing the entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepseekVL and GPT-4o, show that V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in modern visual-language understanding. Our code and data are available https://github.com/Summu77/V-Attack.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Juexin Zhang, Qifeng Zhong, Ying Weng, Ke Chen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The significant molecular and pathological heterogeneity of glioblastoma, an aggressive brain tumor, complicates diagnosis and patient stratification. While traditional histopathological assessment remains the standard, deep learning offers a promising path toward objective and automated analysis of whole slide images. For the BraTS-Path 2025 Challenge, we developed a method that fine-tunes a pre-trained Vision Transformer (ViT) encoder with a dedicated classification head on the official training dataset. Our model's performance on the online validation set, evaluated via the Synapse platform, yielded a Matthews Correlation Coefficient (MCC) of 0.7064 and an F1-score of 0.7676. On the final test set, the model achieved an MCC of 0.6509 and an F1-score of 0.5330, which secured our team second place in the BraTS-Pathology 2025 Challenge. Our results establish a solid baseline for ViT-based histopathological analysis, and future efforts will focus on bridging the performance gap observed on the unseen validation data.
๐
2025-11-25
๐ AAAI 2026
๐ฅ Yuhang Qian, Haiyan Chen, Wentong Li, Ningzhong Liu, Jie Qin
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG's ability to produce photorealistic camouflage images.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haebin Seong, Sungmin Kim, Minchan Kim, Yongjun Cho, Myunchul Joe et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Existing navigation benchmarks focus on task success metrics while overlooking economic viability -- critical for commercial deployment of autonomous delivery robots. We introduce \emph{CostNav}, a \textbf{Micro-Navigation Economic Testbed} that evaluates embodied agents through comprehensive cost-revenue analysis aligned with real-world business operations. CostNav models the complete economic lifecycle including hardware, training, energy, maintenance costs, and delivery revenue with service-level agreements, using industry-derived parameters. \textbf{To our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability}, revealing that optimizing for task success fundamentally differs from optimizing for economic deployment. Our cost model uses parameters derived from industry data sources (energy rates, delivery service pricing), and we project from a reduced-scale simulation to realistic deliveries. Under this projection, the baseline achieves 43.0\% SLA compliance but is \emph{not} commercially viable: yielding a loss of \$30.009 per run with no finite break-even point, because operating costs are dominated by collision-induced maintenance, which accounts for 99.7\% of per-run costs and highlights collision avoidance as a key optimization target. We demonstrate a learning-based on-device navigation baseline and establish a foundation for evaluating rule-based navigation, imitation learning, and cost-aware RL training. CostNav bridges the gap between navigation research and commercial deployment, enabling data-driven decisions about economic trade-offs across navigation paradigms.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Yu, Jiabo Zhan, Zile Wang, Jinglin Wang, Huaisong Zhang et al.
Natural Language ProcessingRoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Juexin Zhang, Ying Weng, Ke Chen
Machine Learning
ๆฅ็ๆ่ฆ
The ASNR-MICCAI BraTS-Inpainting Challenge was established to mitigate dataset biases that limit deep learning models in the quantitative analysis of brain tumor MRI. This paper details our submission to the 2025 challenge, a novel deep learning framework for synthesizing healthy tissue in 3D scans. The core of our method is a U-Net architecture trained to inpaint synthetically corrupted regions, enhanced with a random masking augmentation strategy to improve generalization. Quantitative evaluation confirmed the efficacy of our approach, yielding an SSIM of 0.873$\pm$0.004, a PSNR of 24.996$\pm$4.694, and an MSE of 0.005$\pm$0.087 on the validation set. On the final online test set, our method achieved an SSIM of 0.919$\pm$0.088, a PSNR of 26.932$\pm$5.057, and an RMSE of 0.052$\pm$0.026. This performance secured first place in the BraTS-Inpainting 2025 challenge and surpassed the winning solutions from the 2023 and 2024 competitions on the official leaderboard.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dionysia Danai Brilli, Dimitrios Mallis, Vassilis Pitsikalis, Petros Maragos
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM's attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel et al.
ๆฅ็ๆ่ฆ
Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Federico Paredes-Valles, Yoshitaka Miyatani, Kirk Y. W. Scheper
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution and sparse data streams, they have lacked fully integrated, low-power processing solutions capable of real-time inference. In this work, we present the first battery-powered, wearable pupil-center-tracking system with complete on-device integration, combining event-based sensing and neuromorphic processing on the commercially available Speck2f system-on-chip with lightweight coordinate decoding on a low-power microcontroller. Our solution features a novel uncertainty-quantifying spiking neural network with gated temporal decoding, optimized for strict memory and bandwidth constraints, complemented by systematic deployment mechanisms that bridge the reality gap. We validate our system on a new multi-user dataset and demonstrate a wearable prototype with dual neuromorphic devices achieving robust binocular pupil tracking at 100 Hz with an average power consumption below 5 mW per eye. Our work demonstrates that end-to-end neuromorphic computing enables practical, always-on eye tracking for next-generation energy-efficient wearable systems.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hai Ling, Jia Guo, Zhulin Tao, Yunkang Cao, Donglin Di et al.
Multimodal
ๆฅ็ๆ่ฆ
Anomaly detection (AD) aims to identify defects using normal-only training data. Existing anomaly detection benchmarks (e.g., MVTec-AD with 15 categories) cover only a narrow range of categories, limiting the evaluation of cross-context generalization and scalability. We introduce ADNet, a large-scale, multi-domain benchmark comprising 380 categories aggregated from 49 publicly available datasets across Electronics, Industry, Agrifood, Infrastructure, and Medical domains. The benchmark includes a total of 196,294 RGB images, consisting of 116,192 normal samples for training and 80,102 test images, of which 60,311 are anomalous. All images are standardized with MVTec-style pixel-level annotations and structured text descriptions spanning both spatial and visual attributes, enabling multimodal anomaly detection tasks. Extensive experiments reveal a clear scalability challenge: existing state-of-the-art methods achieve 90.6% I-AUROC in one-for-one settings but drop to 78.5% when scaling to all 380 categories in a multi-class setting. To address this, we propose Dinomaly-m, a context-guided Mixture-of-Experts extension of Dinomaly that expands decoder capacity without increasing inference cost. It achieves 83.2% I-AUROC and 93.1% P-AUROC, demonstrating superior performance over existing approaches. ADNet is designed as a standardized and extensible benchmark, supporting the community in expanding anomaly detection datasets across diverse domains and providing a scalable foundation for future anomaly detection foundation models. Dataset: https://grainnet.github.io/ADNet
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniel Harari, Michael Sidorov, Liel David, Chen Shterental, Abrham Kahsay Gebreselasie et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached ('contact') or detached ('release'). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziqi Wang, Chang Che, Qi Wang, Hui Ma, Zenglin Shi et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Da Li, Ji-Ping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun et al.
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bin Hu, Zijian Lu, Haicheng Liao, Chengran Yuan, Bin Rao et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Hong, Ying Shi, Yinhao Li, Yen-Wei Chen
ๆฅ็ๆ่ฆ
The uncertainty of clinical examinations frequently leads to irregular observation intervals in longitudinal imaging data, posing challenges for modeling disease progression.Most existing imaging-based disease prediction models operate in Euclidean space, which assumes a flat representation of data and fails to fully capture the intrinsic continuity and nonlinear geometric structure of irregularly sampled longitudinal images. To address the challenge of modeling Alzheimers disease (AD) progression from irregularly sampled longitudinal structural Magnetic Resonance Imaging (sMRI) data, we propose a Riemannian manifold mapping, a Time-aware manifold Neural ordinary differential equation, and an Attention-based riemannian Gated recurrent unit (R-TNAG) framework. Our approach first projects features extracted from high-dimensional sMRI into a manifold space to preserve the intrinsic geometry of disease progression. On this representation, a time-aware Neural Ordinary Differential Equation (TNODE) models the continuous evolution of latent states between observations, while an Attention-based Riemannian Gated Recurrent Unit (ARGRU) adaptively integrates historical and current information to handle irregular intervals. This joint design improves temporal consistency and yields robust AD trajectory prediction under irregular sampling.Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art models in both disease status prediction and cognitive score regression. Ablation studies verify the contributions of each module, highlighting their complementary roles in enhancing predictive accuracy. Moreover, the model exhibits stable performance across varying sequence lengths and missing data rates, indicating strong temporal generalizability. Cross-dataset validation further confirms its robustness and applicability in diverse clinical settings.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arnela Hadzic, Franz Thaler, Lea Bogensperger, Simon Johannes Joham, Martin Urschler
Computer Vision
ๆฅ็ๆ่ฆ
Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haodong Pan, Hao Wei, Yusong Wang, Nanning Zheng, Caigui Jiang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Learned image compression (LIC) has recently benefited from Transformer based and state space model (SSM) based architectures. Convolutional neural networks (CNNs) effectively capture local high frequency details, whereas Transformers and SSMs provide strong long range modeling capabilities but may cause structural information loss or ignore frequency characteristics that are crucial for compression. In this work we propose HCFSSNet, a Hybrid Convolution and Frequency State Space Network for LIC. HCFSSNet uses CNNs to extract local high frequency structures and introduces a Vision Frequency State Space (VFSS) block that models long range low frequency information. The VFSS block combines an Omni directional Neighborhood State Space (VONSS) module, which scans features horizontally, vertically and diagonally, with an Adaptive Frequency Modulation Module (AFMM) that applies content adaptive weighting of discrete cosine transform frequency components for more efficient bit allocation. To further reduce redundancy in the entropy model, we integrate AFMM with a Swin Transformer to form a Frequency Swin Transformer Attention Module (FSTAM) for frequency aware side information modeling. Experiments on the Kodak, Tecnick and CLIC Professional Validation datasets show that HCFSSNet achieves competitive rate distortion performance compared with recent SSM based codecs such as MambaIC, while using significantly fewer parameters. On Kodak, Tecnick and CLIC, HCFSSNet reduces BD rate over the VTM anchor by 18.06, 24.56 and 22.44 percent, respectively, providing an efficient and interpretable hybrid architecture for future learned image compression systems.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenpei Jiao, Kun Shang, Hui Li, Ke Yan, Jiajin Zhang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49\% ROUGE-L) and clinical efficacy metrics (e.g., +8.18\% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Johannes Brandt, Maulik Chevli, Rickmer Braren, Georgios Kaissis, Philip Mรผller et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Lung cancer risk estimation is gaining increasing importance as more countries introduce population-wide screening programs using low-dose CT (LDCT). As imaging volumes grow, scalable methods that can process entire lung volumes efficiently are essential to tap into the full potential of these large screening datasets. Existing approaches either over-rely on pixel-level annotations, limiting scalability, or analyze the lung in fragments, weakening performance. We present LungEvaty, a fully transformer-based framework for predicting 1-6 year lung cancer risk from a single LDCT scan. The model operates on whole-lung inputs, learning directly from large-scale screening data to capture comprehensive anatomical and pathological cues relevant for malignancy risk. Using only imaging data and no region supervision, LungEvaty matches state-of-the-art performance, refinable by an optional Anatomically Informed Attention Guidance (AIAG) loss that encourages anatomically focused attention. In total, LungEvaty was trained on more than 90,000 CT scans, including over 28,000 for fine-tuning and 6,000 for evaluation. The framework offers a simple, data-efficient, and fully open-source solution that provides an extensible foundation for future research in longitudinal and multimodal lung cancer risk prediction.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Abishek Karthik, Pandiyaraju V
Machine Learning
ๆฅ็ๆ่ฆ
The healthcare industry has been revolutionized significantly by novel imaging technologies, not just in the diagnosis of cardiovascular diseases but also by the visualization of structural abnormalities like cardiomegaly. This article explains an integrated approach to the use of deep learning tools and attention mechanisms for automatic detection of cardiomegaly using X-ray images. The initiation of the project is grounded on a strong Data Collection phase and gathering the data of annotated X-ray images of various types. Then, while the Preprocessing module fine-tunes image quality, it is feasible to utilize the best out of the data quality in the proposed system. In our proposed system, the process is a CNN configuration leveraging the inception V3 model as one of the key blocks. Besides, we also employ a multilayer attention mechanism to enhance the strength. The most important feature of the method is the multi-head attention mechanism that can learn features automatically. By exact selective focusing on only some regions of input, the model can thus identify cardiomegaly in a sensitive manner. Attention rating is calculated, duplicated, and applied to enhance representation of main data, and therefore there is a successful diagnosis. The Evaluation stage will be extremely strict and it will thoroughly evaluate the model based on such measures as accuracy and precision. This will validate that the model can identify cardiomegaly and will also show the clinical significance of this method. The model has accuracy of 95.6, precision of 95.2, recall of 96.2, sensitivity of 95.7, specificity of 96.1 and an Area Under Curve(AUC) of 96.0 and their respective graphs are plotted for visualisation.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sharjeel Ahmed, Daim Armaghan, Fatima Naweed, Umair Yousaf, Ahmad Zubair et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
There have been many recent developments in the use of Deep Learning Neural Networks for fire detection. In this paper, we explore an early warning system for detection of forest fires. Due to the lack of sizeable datasets and models tuned for this task, existing methods suffer from missed detection. In this work, we first propose a dataset for early identification of forest fires through visual analysis. Unlike existing image corpuses that contain images of wide-spread fire, our dataset consists of multiple instances of smoke plumes and fire that indicates the initiation of fire. We obtained this dataset synthetically by utilising game simulators such as Red Dead Redemption 2. We also combined our dataset with already published images to obtain a more comprehensive set. Finally, we compared image classification and localisation methods on the proposed dataset. More specifically we used YOLOv7 (You Only Look Once) and different models of detection transformer.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Yixin Liu, Pengfei Liu, Arman Cohan
Natural Language Processing
ๆฅ็ๆ่ฆ
Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs' generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Farjana Sultana Mim, Shuchin Aeron, Eric Miller, Kristen Wendell
Natural Language Processing
ๆฅ็ๆ่ฆ
Identifying discourse features in student conversations is quite important for educational researchers to recognize the curricular and pedagogical variables that cause students to engage in constructing knowledge rather than merely completing tasks. The manual analysis of student conversations to identify these discourse features is time-consuming and labor-intensive, which limits the scale and scope of studies. Leveraging natural language processing (NLP) techniques can facilitate the automatic detection of these discourse features, offering educational researchers scalable and data-driven insights. However, existing studies in NLP that focus on discourse in dialogue rarely address educational data. In this work, we address this gap by introducing an annotated educational dialogue dataset of student conversations featuring knowledge construction and task production discourse. We also establish baseline models for automatically predicting these discourse properties for each turn of talk within conversations, using pre-trained large language models GPT-3.5 and Llama-3.1. Experimental results indicate that these state-of-the-art models perform suboptimally on this task, indicating the potential for future research.
๐
2025-11-25
๐ ICML 2025
๐ฅ Wesley Bian, Xiaofeng Lin, Guang Cheng
Machine Learning
ๆฅ็ๆ่ฆ
Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi-Hao Peng, Jeffrey P. Bigham, Jason Wu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nathan Roll, Jill Kries, Flora Jin, Catherine Wang, Ann Marie Finley et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jakub Hoscilowicz, Artur Janicki
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mosab Rezaei, Mina Rajaei Moghadam, Abdul Rahman Shaikh, Hamed Alhoori, Reva Freedman
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in large language models have created new opportunities for stylometry, the study of writing styles and authorship. Two challenges, however, remain central: training generative models when no paired data exist, and evaluating stylistic text without relying only on human judgment. In this work, we present a framework for both generating and evaluating sentences in the style of 19th-century novelists. Large language models are fine-tuned with minimal, single-token prompts to produce text in the voices of authors such as Dickens, Austen, Twain, Alcott, and Melville. To assess these generative models, we employ a transformer-based detector trained on authentic sentences, using it both as a classifier and as a tool for stylistic explanation. We complement this with syntactic comparisons and explainable AI methods, including attention-based and gradient-based analyses, to identify the linguistic cues that drive stylistic imitation. Our findings show that the generated text reflects the authors' distinctive patterns and that AI-based evaluation offers a reliable alternative to human assessment. All artifacts of this work are published online.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Md Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal
Natural Language Processing
ๆฅ็ๆ่ฆ
Text normalization is an essential preprocessing step in many natural language processing (NLP) tasks, and stemming is one such normalization technique that reduces words to their base or root form. However, evaluating stemming methods is challenging because current evaluation approaches are limited and do not capture the potential harm caused by excessive stemming; therefore, it is essential to develop new approaches to evaluate stemming methods. To address this issue, this study propose a novel, task-oriented approach to evaluate stemming methods, which considers three aspects: (1) the utility of stemming using Stemming Effectiveness Score (SES), (2) the impact of stemming on downstream tasks using Model Performance Delta (MPD), and (3) the semantic similarity between stemmed and original words using Average Normalized Levenshtein Distance (ANLD), thus providing a comprehensive evaluation framework. We apply our evaluation framework to compare two stemmers for Bangla (BNLTK) and English (Snowball), and our results reveal a significant issue, prompting us to analyze their performance in detail. While the Bangla stemmer achieves the highest SES (1.67) due to effective word reduction (CR = 1.90), SES alone is insufficient because our proposed safety measure, ANLD, reveals that this high SES is due to harmful over-stemming (ANLD = 0.26), which correlates with the observed decrease in downstream performance.In contrast, the English stemmer achieves a moderate SES (1.31) with a safe meaning distance (ANLD = 0.14), allowing its word reduction to contribute positively to downstream performance; therefore, it is a more reliable stemmer. Our study provides a valuable tool for distinguishing between potential efficiency gains (high SES) and meaning preservation (low ANLD).
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Abdullah Al Sefat
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu et al.
Natural Language ProcessingRoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.
๐
2025-11-25
๐ AAAI 2026
๐ฅ Taewhoo Lee, Minju Song, Chanwoong Yoon, Jungwoo Park, Jaewoo Kang
Natural Language Processing
ๆฅ็ๆ่ฆ
Analogical reasoning is at the core of human cognition, serving as an important foundation for a variety of intellectual activities. While prior work has shown that LLMs can represent task patterns and surface-level concepts, it remains unclear whether these models can encode high-level relational concepts and apply them to novel situations through structured comparisons. In this work, we explore this fundamental aspect using proportional and story analogies, and identify three key findings. First, LLMs effectively encode the underlying relationships between analogous entities; both attributive and relational information propagate through mid-upper layers in correct cases, whereas reasoning failures reflect missing relational information within these layers. Second, unlike humans, LLMs often struggle not only when relational information is missing, but also when attempting to apply it to new entities. In such cases, strategically patching hidden representations at critical token positions can facilitate information transfer to a certain extent. Lastly, successful analogical reasoning in LLMs is marked by strong structural alignment between analogous situations, whereas failures often reflect degraded or misplaced alignment. Overall, our findings reveal that LLMs exhibit emerging but limited capabilities in encoding and applying high-level relational concepts, highlighting both parallels and gaps with human cognition.
๐
2025-11-25
๐ AAAI
๐ฅ Luohe Shi, Zuchao Li, Lefei Zhang, Baoyuan Qi, Guoming Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative decoding, as they compress the available idle computing power. Therefore, performing speculative decoding with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate sequences of limited length, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model's ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent acceleration, even in large-batch scenarios. Through lossless speculative decoding experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling LLM inference with lower training demands and reduced computational costs.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Abhinav Joshi, Divyanshu Bhatt, Ashutosh Modi
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) show strong generalization across diverse tasks, yet the internal decision-making processes behind their predictions remain opaque. In this work, we study the geometry of hidden representations in LLMs through the lens of \textit{intrinsic dimension} (ID), focusing specifically on decision-making dynamics in a multiple-choice question answering (MCQA) setting. We perform a large-scale study, with 28 open-weight transformer models and estimate ID across layers using multiple estimators, while also quantifying per-layer performance on MCQA tasks. Our findings reveal a consistent ID pattern across models: early layers operate on low-dimensional manifolds, middle layers expand this space, and later layers compress it again, converging to decision-relevant representations. Together, these results suggest LLMs implicitly learn to project linguistic inputs onto structured, low-dimensional manifolds aligned with task-specific decisions, providing new geometric insights into how generalization and reasoning emerge in language models.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chuyi Kong, Gao Wei, Jing Ma, Hongzhan Lin, Zhiyuan Fan
Natural Language Processing
ๆฅ็ๆ่ฆ
The prevalence of misinformation on social media threatens public trust, demanding automated fact-checking systems that provide accurate verdicts with interpretable explanations. However, existing large language model-based (LLM-based) approaches often rely heavily on external knowledge sources, introducing substantial latency and even hallucinations that undermine reliability, interpretability, and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations REFLEX paradigm, a plug-and-play, self-refining paradigm that leverages the internal knowledge in backbone model to improve both verdict accuracy and explanation quality. REFLEX reformulates fact-checking as a role-play dialogue and jointly trains verdict prediction and explanation generation. It adaptively extracts contrastive activation pairs between the backbone model and its fine-tuned variant to construct steering vectors that disentangle truth into style and substance naturally. These activation-level signals guide inference and suppress noisy explanations, enabling more faithful and efficient reasoning. Experiments on real-world datasets show that REFLEX outperforms previous methods that steer toward a single truth direction and underscores the challenge traditional approaches face when handling the subtle, human-unknown truth in fact-checking tasks. Remarkably, with only 465 self-refined training samples, RELFEX achieves state-of-the-art performance. Furthermore, models trained with explanatory objectives can effectively guide those without them, yielding up to a 7.57% improvement, highlighting that internal explanation signals play a dual role in both interpreting and enhancing factual reasoning.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Adilet Metinov, Gulida M. Kudakeeva, Gulnara D. Kabaeva
Natural Language Processing
ๆฅ็ๆ่ฆ
Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language's morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wen-Fang Su, Hsiao-Wei Chou, Wen-Yang Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Somsubhra De, Harsh Kumar, Arun Prakash A
Natural Language Processing
ๆฅ็ๆ่ฆ
Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in transformer-based models and large annotated datasets have greatly improved GEC performance for high-resource languages such as English, the progress has not extended equally. For most Indic languages, GEC remains a challenging task due to limited resources, linguistic diversity and complex morphology. In this work, we explore prompting-based approaches using state-of-the-art large language models (LLMs), such as GPT-4.1, Gemini-2.5 and LLaMA-4, combined with few-shot strategy to adapt them to low-resource settings. We observe that even basic prompting strategies, such as zero-shot and few-shot approaches, enable these LLMs to substantially outperform fine-tuned Indic-language models like Sarvam-22B, thereby illustrating the exceptional multilingual generalization capabilities of contemporary LLMs for GEC. Our experiments show that carefully designed prompts and lightweight adaptation significantly enhance correction quality across multiple Indic languages. We achieved leading results in the shared task--ranking 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97). These findings highlight the effectiveness of prompt-driven NLP techniques and underscore the potential of large-scale LLMs to bridge resource gaps in multilingual GEC.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Huu Tuong Tu, Ha Viet Khanh, Tran Tien Dat, Vu Huan, Thien Van Luong et al.
ๆฅ็ๆ่ฆ
Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.
๐
2025-11-25
๐ Submitted to IEEE Transactions on Affective comput
๐ฅ Xingfeng Li, Xiaohan Shi, Junjie Li, Yongwei Li, Masashi Unoki et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolor{black}{that restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity}, EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao. The corpus integrates spontaneous emotional expressions from online platforms, annotated with fine-grained emotion distributions across 32 categories. Experimental baselines using self-supervised learning models demonstrate robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results. By incorporating linguistic diversity and ecological validity, EM2LDL enables the exploration of complex emotional dynamics in multilingual settings. This work provides a versatile testbed for developing adaptive, empathetic systems for applications in affective computing, including mental health monitoring and cross-cultural communication. The dataset, annotations, and baseline codes are publicly available at https://github.com/xingfengli/EM2LDL.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Craig Dickson
Natural Language Processing
ๆฅ็ๆ่ฆ
Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales.
We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o's 20%.
We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model's 'degrees of freedom' to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.
๐
2025-11-25
๐ AAAI 2026
๐ฅ Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo et al.
Machine Learning
ๆฅ็ๆ่ฆ
Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.
๐
2025-11-25
๐ Accepted at the 41st ACM/SIGAPP Symposium On Appli
๐ฅ Duc Anh Vu, Thong Nguyen, Cong-Duy Nguyen, Viet Anh Nguyen, Anh Tuan Luu
Natural Language Processing
ๆฅ็ๆ่ฆ
With the advancement of large language models (LLMs), their performance on multiple-choice question (MCQ) tasks has improved significantly. However, existing approaches face key limitations: answer choices are typically presented to LLMs without contextual grounding or explanation. This absence of context can lead to incomplete exploration of all possible answers, ultimately degrading the models' reasoning capabilities. To address these challenges, we introduce BiasPrompting, a novel inference framework that guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction. It consists of two components: first, a reasoning generation stage, where the model is prompted to produce supportive reasonings for each answer option, and then, a reasoning-guided agreement stage, where the generated reasonings are synthesized to select the most plausible answer. Through comprehensive evaluations, BiasPrompting demonstrates significant improvements in five widely used multiple-choice question answering benchmarks. Our experiments showcase that BiasPrompting enhances the reasoning capabilities of LLMs and provides a strong foundation for tackling complex and challenging questions, particularly in settings where existing methods underperform.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xiaopeng Li, Yuanjin Zheng, Wanyu Wang, wenlin zhang, Pengyue Jia et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Huiyu Bai, Runze Wang, Zhuoyun Du, Yiyang Zhao, Fengji Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalized Visual Language Models (VLMs) are gaining increasing attention for their formidable ability in user-specific concepts aligned interactions (e.g., identifying a user's bike). Existing methods typically require the learning of separate embeddings for each new concept, which fails to support real-time adaptation during testing. This limitation becomes particularly pronounced in large-scale scenarios, where efficient retrieval of concept embeddings is not achievable. To alleviate this gap, we propose Online-PVLM, a framework for online concept learning by leveraging hyperbolic representations. Our approach makes a train-free paradigm for concept embeddings generation at test time, making the use of personalized VLMs both scalable and efficient. In addition, we develop OP-Eval, a comprehensive and large-scale benchmark comprising 1,292 concepts and over 30K high-quality instances with diverse question types, designed to rigorously assess online concept learning in realistic scenarios. Extensive experiments demonstrate the state-of-the-art performance of our proposed framework. Our source code and dataset will be made available.
๐
2025-11-25
๐ AAAI
๐ฅ Edward Ajayi, Martha Kachweka, Mawuli Deku, Emily Aiken
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous "split-then-balance" pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard ("Social Media Screener") designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mihir Sahasrabudhe
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a "reversal curse," and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xinyu Wang, Hanwei Wu, Qingchen Hu, Zhenghan Tai, Jingrui Tian et al.
ๆฅ็ๆ่ฆ
Decoder-only rerankers are central to Retrieval-Augmented Generation (RAG). However, generalist models miss domain-specific nuances in high-stakes fields like finance and law, and naive fine-tuning causes surface-form overfitting and catastrophic forgetting. To address this challenge, we introduce R2R, a domain-aware framework that combines dynamic expert routing with a two-stage training strategy, Entity Abstraction for Generalization (EAG). EAG introduces a counter-shortcut mechanism by masking the most predictive surface cues, forcing the reranker to learn domain-invariant relevance patterns rather than memorizing dataset-specific entities. To efficiently activate domain experts, R2R employs a lightweight Latent Semantic Router that probes internal representations from the frozen backbone decoder to select the optimal LoRA expert per query. Extensive experiments across different reranker backbones and diverse domains (legal, medical, and financial) demonstrate that R2R consistently surpasses generalist and single-domain fine-tuned baselines. Our results confirm that R2R is a model-agnostic and modular approach to domain specialization with strong cross-domain robustness.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tianyi Chen, Michael Solodko, Sen Wang, Jongwoo Ko, Junheng Hao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://github.com/microsoft/appselectbench.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Songlin Zhao, Michael Pitts, Zhuwei Qin
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of large language models (LLMs) has increased the demand for domain-specialized variants in areas such as law, healthcare, and finance. However, their large size remains a barrier to deployment in resource-constrained environments, and existing compression methods either generalize poorly across domains or incur high overhead. In this work, we propose \textbf{EfficientXpert}, a lightweight domain-pruning framework that combines a propagation-aware pruning criterion (Foresight Mask) with an efficient adapter-update algorithm (Partial Brain Surgeon). Integrated into the LoRA fine-tuning process, EfficientXpert enables a one-step transformation of general pretrained models into sparse, domain-adapted experts. Across health and legal tasks, it retains up to 98% of dense-model performance at 40% sparsity, outperforming state-of-the-art methods. Further analysis reveals substantial domain-dependent structural shifts that degrade the effectiveness of general pruning masks, underscoring the need for adaptive, domain-aware pruning strategies tailored to each domain.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuefei Chen, Jiang Liu, Xiaodong Lin, Ruixiang Tang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chengyue Huang, Mellon M. Zhang, Robert Azarcon, Glen Chou, Zsolt Kira
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Farzad Ahmed, Joniel Augustine Jerome, Meliha Yetisgen, รzlem Uzuner
Natural Language Processing
ๆฅ็ๆ่ฆ
Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction.
Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning.
Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections.
Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shi-Wei Dai, Yan-Wei Shie, Tsung-Huan Yang, Lun-Wei Ku, Yung-Hui Li
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Personalized Large Language Models (LLMs) have been shown to be an effective way to create more engaging and enjoyable user-AI interactions. While previous studies have explored using prompts to elicit specific personality traits in LLMs, they have not optimized these prompts to maximize personality expression. To address this limitation, we propose PersonaPulse: Dynamic Profile Optimization for Realistic Personality Expression in LLMs, a framework that leverages LLMs' inherent knowledge of personality traits to iteratively enhance role-play prompts while integrating a situational response benchmark as a scoring tool, ensuring a more realistic and contextually grounded evaluation to guide the optimization process. Quantitative evaluations demonstrate that the prompts generated by PersonaPulse outperform those of prior work, which were designed based on personality descriptions from psychological studies. Additionally, we explore the relationship between model size and personality modeling through extensive experiments. Finally, we find that, for certain personality traits, the extent of personality evocation can be partially controlled by pausing the optimization process. These findings underscore the importance of prompt optimization in shaping personality expression within LLMs, offering valuable insights for future research on adaptive AI interactions.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Miguel Carvalho, Helder Dias, Bruno Martins
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.
๐
2025-11-25
๐ Published in the The Fourth Workshop on Processing
๐ฅ Koena Ronny Mabokela, Tim Schlippe, Mpho Raborife, Turgay Celik
Natural Language Processing
ๆฅ็ๆ่ฆ
Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Saif M. Mohammad
ๆฅ็ๆ่ฆ
Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D). Existing lexicons such as the NRC VAD Lexicon, published in 2018, include VAD association ratings for words. Here, we present a complement to it, which has human ratings of valence, arousal, and dominance for 10k English Multiword Expressions (MWEs) and their constituent words. We also increase the coverage of unigrams, especially words that have become more common since 2018. In all, the new NRC VAD Lexicon v2 now has entries for 10k MWEs and 25k words, in addition to the entries in v1. We show that the associations are highly reliable. We use the lexicon to examine emotional characteristics of MWEs, including: 1. The degree to which MWEs (idioms, noun compounds, and verb particle constructions) exhibit strong emotionality; 2. The degree of emotional compositionality in MWEs. The lexicon enables a wide variety of research in NLP, Psychology, Public Health, Digital Humanities, and Social Sciences. The NRC VAD Lexicon v2 is freely available through the project webpage: http://saifmohammad.com/WebPages/nrc-vad.html
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Debin Meng, Chen Jin, Zheng Gao, Yanran Li, Ioannis Patras et al.
Machine Learning
ๆฅ็ๆ่ฆ
Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wei Tang, Zuo-Zheng Wang, Kun Zhang, Tong Wei, Min-Ling Zhang
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP's textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa et al.
Robotics
ๆฅ็ๆ่ฆ
While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziyue Wang, Yayati Jadhav, Peter Pak, Amir Barati Farimani
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D geometry is slow and poorly suited to rapid prototyping. Even minor design variations typically necessitate manual updates in CAD software, making iteration time-consuming and difficult to scale. To address this limitation, we introduce Image2Gcode, an end-to-end data-driven framework that bypasses the CAD stage and generates printer-ready G-code directly from images and part drawings. Instead of relying on an explicit 3D model, a hand-drawn or captured 2D image serves as the sole input. The framework first extracts slice-wise structural cues from the image and then employs a denoising diffusion probabilistic model (DDPM) over G-code sequences. Through iterative denoising, the model transforms Gaussian noise into executable print-move trajectories with corresponding extrusion parameters, establishing a direct mapping from visual input to native toolpaths. By producing structured G-code directly from 2D imagery, Image2Gcode eliminates the need for CAD or STL intermediates, lowering the entry barrier for additive manufacturing and accelerating the design-to-fabrication cycle. This approach supports on-demand prototyping from simple sketches or visual references and integrates with upstream 2D-to-3D reconstruction modules to enable an automated pipeline from concept to physical artifact. The result is a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi et al.
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The optimization of large language models (LLMs) remains a critical challenge, particularly as model scaling exacerbates sensitivity to algorithmic imprecision and training instability. Recent advances in optimizers have improved convergence efficiency through momentum orthogonalization, but suffer from two key robustness limitations: dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise. To address these robustness challenges, we introduce ROOT, a Robust Orthogonalized Optimizer that enhances training stability through dual robustness mechanisms. First, we develop a dimension-robust orthogonalization scheme using adaptive Newton iterations with fine-grained coefficients tailored to specific matrix sizes, ensuring consistent precision across diverse architectural configurations. Second, we introduce an optimization-robust framework via proximal optimization that suppresses outlier noise while preserving meaningful gradient directions. Extensive experiments demonstrate that ROOT achieves significantly improved robustness, with faster convergence and superior final performance compared to both Muon and Adam-based optimizers, particularly in noisy and non-convex scenarios. Our work establishes a new paradigm for developing robust and precise optimizers capable of handling the complexities of modern large-scale model training. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/ROOT.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Adam Karvonen, Daniel Reuter, Roy Rinberg, Luke Marks, Adriร Garriga-Alonso et al.
ๆฅ็ๆ่ฆ
As demand for LLM inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tampering. However, re-running the same inference process twice often leads to different results due to benign numerical noise, making it difficult to distinguish legitimate variation from actual problems. To address this problem, we introduce Token-DiFR (Token-Divergence-From-Reference), a method for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Sampling seed synchronization tightly constrains valid outputs, leaving providers minimal room to deviate from correct inference, which allows output tokens themselves to serve as auditable evidence of correctness at zero additional cost to the provider. Token-DiFR reliably identifies sampling errors, simulated bugs, and model quantization, detecting 4-bit quantization with AUC $>$ 0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit quantization with AUC $>$ 0.999 using just 2 output tokens, while reducing communication overhead by 25-75% relative to existing methods. We release an open-source integration with vLLM to accelerate practical deployment of verifiable inference.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Panayiotis Danassis, Naman Goel
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yujin Kim, Sarah Dean
Machine Learning
ๆฅ็ๆ่ฆ
Many consequential real-world systems, like wind fields and ocean currents, are dynamic and hard to model. Learning their governing dynamics remains a central challenge in scientific machine learning. Dynamic Mode Decomposition (DMD) provides a simple, data-driven approximation, but practical use is limited by sparse/noisy observations from continuous fields, reliance on linear approximations, and the lack of principled uncertainty quantification. To address these issues, we introduce Stochastic NODE-DMD, a probabilistic extension of DMD that models continuous-time, nonlinear dynamics while remaining interpretable. Our approach enables continuous spatiotemporal reconstruction at arbitrary coordinates and quantifies predictive uncertainty. Across four benchmarks, a synthetic setting and three physics-based flows, it surpasses a baseline in reconstruction accuracy when trained from only 10% observation density. It further recovers the dynamical structure by aligning learned modes and continuous-time eigenvalues with ground truth. Finally, on datasets with multiple realizations, our method learns a calibrated distribution over latent dynamics that preserves ensemble variability rather than averaging across regimes. Our code is available at: https://github.com/sedan-group/Stochastic-NODE-DMD
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shurong Wang, Yuqi Pan, Zhuoyang Shen, Meng Zhang, Hongwei Wang et al.
ๆฅ็ๆ่ฆ
Associative memory models are content-addressable memory systems fundamental to biological intelligence and are notable for their high interpretability. However, existing models evaluate the quality of retrieval based on proximity, which cannot guarantee that the retrieved pattern has the strongest association with the query, failing correctness. We reframe this problem by proposing that a query is a generative variant of a stored memory pattern, and define a variant distribution to model this subtle context-dependent generative process. Consequently, correct retrieval should return the memory pattern with the maximum a posteriori probability of being the query's origin. This perspective reveals that an ideal similarity measure should approximate the likelihood of each stored pattern generating the query in accordance with variant distribution, which is impossible for fixed and pre-defined similarities used by existing associative memories. To this end, we develop adaptive similarity, a novel mechanism that learns to approximate this insightful but unknown likelihood from samples drawn from context, aiming for correct retrieval. We theoretically prove that our proposed adaptive similarity achieves optimal correct retrieval under three canonical and widely applicable types of variants: noisy, masked, and biased. We integrate this mechanism into a novel adaptive Hopfield network (A-Hop), and empirical results show that it achieves state-of-the-art performance across diverse tasks, including memory retrieval, tabular classification, image classification, and multiple instance learning.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiwen Huang, Pierre Pinson
ๆฅ็ๆ่ฆ
We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Yixin Liu, Pengfei Liu, Arman Cohan
Natural Language Processing
ๆฅ็ๆ่ฆ
Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs' generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Heman Shakeri
ๆฅ็ๆ่ฆ
Deep sequence models for blood glucose forecasting consistently fail to leverage clinically informative drivers--insulin, meals, and activity--despite well-understood physiological mechanisms. We term this Driver-Blindness and formalize it via $ฮ_{\text{drivers}}$, the performance gain of multivariate models over matched univariate baselines. Across the literature, $ฮ_{\text{drivers}}$ is typically near zero. We attribute this to three interacting factors: architectural biases favoring autocorrelation (C1), data fidelity gaps that render drivers noisy and confounded (C2), and physiological heterogeneity that undermines population-level models (C3). We synthesize strategies that partially mitigate Driver-Blindness--including physiological feature encoders, causal regularization, and personalization--and recommend that future work routinely report $ฮ_{\text{drivers}}$ to prevent driver-blind models from being considered state-of-the-art.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats et al.
ๆฅ็ๆ่ฆ
The integration of artificial intelligence (AI) agents into web browsers introduces security challenges that go beyond traditional web application threat models. Prior work has identified prompt injection as a new attack vector for web agents, yet the resulting impact within real-world environments remains insufficiently understood.
In this work, we examine the landscape of prompt injection attacks and synthesize a benchmark of attacks embedded in realistic HTML payloads. Our benchmark goes beyond prior work by emphasizing injections that can influence real-world actions rather than mere text outputs, and by presenting attack payloads with complexity and distractor frequency similar to what real-world agents encounter. We leverage this benchmark to conduct a comprehensive empirical evaluation of existing defenses, assessing their effectiveness across a suite of frontier AI models. We propose a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. Our work offers a blueprint for designing practical, secure web agents through a defense-in-depth approach.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingxing Rao, Bowen Qu, Daniel Moyer
Machine Learning
ๆฅ็ๆ่ฆ
The recovery of training data from generative models (``model inversion'') has been extensively studied for diffusion models in the data domain. The encoder/decoder pair and corresponding latent codes have largely been ignored by inversion techniques applied to latent space generative models, e.g., Latent Diffusion models (LDMs). In this work we describe two key findings: (1) The diffusion model exhibits non-uniform memorization across latent codes, tending to overfit samples located in high-distortion regions of the decoder pullback metric. (2) Even within a single latent code, different dimensions contribute unequally to memorization. We introduce a principled method to rank latent dimensions by their per-dimensional contribution to the decoder pullback metric, identifying those most responsible for memorization. Empirically, removing less-memorizing dimensions when computing attack statistics for score-based membership inference attacker significantly improves performance, with average AUROC gains of 2.7\% and substantial increases in TPR@1\%FPR (6.42\%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pokรฉmon, MS-COCO, and Flickr. This indicates stronger confidence in identifying members under extremely low false-positive tolerance. Our results highlight the overlooked influence of the auto-encoder geometry on LDM memorization and provide a new perspective for analyzing privacy risks in diffusion-based generative models.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Charlotte Beylier, Hannah Selder, Arthur Fleig, Simon M. Hofmann, Nico Scherf
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The learning process of a reinforcement learning (RL) agent remains poorly understood beyond the mathematical formulation of its learning algorithm. To address this gap, we introduce attention-oriented metrics (ATOMs) to investigate the development of an RL agent's attention during training. In a controlled experiment, we tested ATOMs on three variations of a Pong game, each designed to teach the agent distinct behaviours, complemented by a behavioural assessment. ATOMs successfully delineate the attention patterns of an agent trained on each game variation, and that these differences in attention patterns translate into differences in the agent's behaviour. Through continuous monitoring of ATOMs during training, we observed that the agent's attention developed in phases, and that these phases were consistent across game variations. Overall, we believe that ATOM could help improve our understanding of the learning processes of RL agents and better understand the relationship between attention and learning.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Karim Kadry, Abdallah Abdelwahed, Shoaib Goraya, Ajay Manicka, Naravich Chutisilp et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We present Anatomica: an inference-time framework for generating multi-class anatomical voxel maps with localized geo-topological control. During generation, we use cuboidal control domains of varying dimensionality, location, and shape to slice out relevant substructures. These local substructures are used to compute differentiable penalty functions that steer the sample towards target constraints. We control geometric features such as size, shape, and position through voxel-wise moments, while topological features such as connected components, loops, and voids are enforced through persistent homology. Lastly, we implement Anatomica for latent diffusion models, where neural field decoders partially extract substructures, enabling the efficient control of anatomical properties. Anatomica applies flexibly across diverse anatomical systems, composing constraints to control complex structures over arbitrary dimensions and coordinate systems, thereby enabling the rational design of synthetic datasets for virtual trials or machine learning workflows.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Dennis Eisermann, Frank Kargl
Machine Learning
ๆฅ็ๆ่ฆ
Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics such as accuracy and precision fail to capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the \emph{Parallel Trust Assessment System (PaTAS)}, a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through \emph{Trust Nodes} and \emph{Trust Functions} that propagate input, parameter, and activation trust across the network. The framework defines a \emph{Parameter Trust Update} mechanism to refine parameter reliability during training and an \emph{Inference-Path Trust Assessment (IPTA)} method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a principled foundation for evaluating model reliability across the AI lifecycle.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shuo Xie, Tianhao Wang, Beining Wu, Zhiyuan Li
Machine Learning
ๆฅ็ๆ่ฆ
Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by introducing adaptive gradient variance, which parallels adaptive smoothness and leads to dimension-free convergence guarantees that cannot be achieved under standard gradient variance for certain non-Euclidean geometry.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sumit S Shevtekar, Chandresh K Maurya, Gourab Sil
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Real-world time-series data is highly non stationary and complex in dynamics that operate across multiple timescales, ranging from fast, short-term changes to slow, long-term trends. Most existing models rely on fixed-scale structural priors, such as patch-based tokenization, fixed frequency transformations, or frozen backbone architectures. This often leads to over-regularization of temporal dynamics, which limits their ability to adaptively model the full spectrum of temporal variations and impairs their performance on unpredictable, Sudden, high-magnitude events. To address this, we introduce the Multi-scale Temporal Network (MSTN), a novel deep learning architecture founded on a hierarchical multi-scale and sequence modeling principle. The MSTN framework integrates: (i) a multi-scale convolutional encoder that constructs a hierarchical feature pyramid for local patterns (ii) a sequence modeling component for long-range temporal dependencies. We empirically validate this with BiLSTM and Transformer variants, establishing a flexible foundation for future architectural advancements. and (iii) a gated fusion mechanism augmented with squeeze-and-excitation (SE) and multi-head temporal attention (MHTA) for dynamic, context-aware feature integration. This design enables MSTN to adaptively model temporal patterns from milliseconds to long-range dependencies within a unified framework. Extensive evaluations across time-series long-horizon forecasting, imputation, classification and generalizability study demonstrate that MSTN achieves competitive state-of-the-art (SOTA) performance, showing improvements over contemporary approaches including EMTSF, LLM4TS, HiMTM, TIME-LLM, MTST, SOFTS, iTransformer, TimesNet, and PatchTST. In total, MSTN establishes new SOTA performance on 24 of 32 benchmark datasets, demonstrating its consistent performance across diverse temporal tasks.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Tasha Kim, Oiwi Parker Jones
Robotics
ๆฅ็ๆ่ฆ
Safety-critical assistive systems that directly decode user intent from neural signals require rigorous guarantees of reliability and trust. We present GUARDIAN (Gated Uncertainty-Aware Runtime Dual Invariants), a framework for real-time neuro-symbolic verification for neural signal-controlled robotics. GUARDIAN enforces both logical safety and physiological trust by coupling confidence-calibrated brain signal decoding with symbolic goal grounding and dual-layer runtime monitoring. On the BNCI2014 motor imagery electroencephalogram (EEG) dataset with 9 subjects and 5,184 trials, the system performs at a high safety rate of 94-97% even with lightweight decoder architectures with low test accuracies (27-46%) and high ECE confidence miscalibration (0.22-0.41). We demonstrate 1.7x correct interventions in simulated noise testing versus at baseline. The monitor operates at 100Hz and sub-millisecond decision latency, making it practically viable for closed-loop neural signal-based systems. Across 21 ablation results, GUARDIAN exhibits a graduated response to signal degradation, and produces auditable traces from intent, plan to action, helping to link neural evidence to verifiable robot action.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rui Xue, Shichao Zhu, Liang Qin, Guangmou Pan, Yang Song et al.
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have emerged as powerful tools for modeling graph-structured data and have been widely used in recommender systems, such as for capturing complex user-item and item-item relations. However, most industrial deployments adopt a two-stage pipeline: GNNs are first pre-trained offline to generate node embeddings, which are then used as static features for downstream recommender systems. This decoupled paradigm leads to two key limitations: (1) high computational overhead, since large-scale GNN inference must be repeatedly executed to refresh embeddings; and (2) lack of joint optimization, as the gradient from the recommender system cannot directly influence the GNN learning process, causing the GNN to be suboptimally informative for the recommendation task. In this paper, we propose E2E-GRec, a novel end-to-end training framework that unifies GNN training with the recommender system. Our framework is characterized by three key components: (i) efficient subgraph sampling from a large-scale cross-domain heterogeneous graph to ensure training scalability and efficiency; (ii) a Graph Feature Auto-Encoder (GFAE) serving as an auxiliary self-supervised task to guide the GNN to learn structurally meaningful embeddings; and (iii) a two-level feature fusion mechanism combined with Gradnorm-based dynamic loss balancing, which stabilizes graph-aware multi-task end-to-end training. Extensive offline evaluations, online A/B tests (e.g., a +0.133% relative improvement in stay duration, a 0.3171% reduction in the average number of videos a user skips) on large-scale production data, together with theoretical analysis, demonstrate that E2E-GRec consistently surpasses traditional approaches, yielding significant gains across multiple recommendation metrics.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Xintong Li, Haoran Zhang, Xiao Zhou
ๆฅ็ๆ่ฆ
The abundance of fine-grained spatio-temporal data, such as traffic sensor networks, offers vast opportunities for scientific discovery. However, inferring causal relationships from such observational data remains challenging, particularly due to unobserved confounders that are specific to units (e.g., geographical locations) yet influence outcomes over time. Most existing methods for spatio-temporal causal inference assume that all confounders are observed, an assumption that is often violated in practice. In this paper, we introduce Spatio-Temporal Hierarchical Causal Models (ST-HCMs), a novel graphical framework that extends hierarchical causal modeling to the spatio-temporal domain. At the core of our approach is the Spatio-Temporal Collapse Theorem, which shows that a complex ST-HCM converges to a simpler flat causal model as the amount of subunit data increases. This theoretical result enables a general procedure for causal identification, allowing ST-HCMs to recover causal effects even in the presence of unobserved, time-invariant unit-level confounders, a scenario where standard non-hierarchical models fail. We validate the effectiveness of our framework on both synthetic and real-world datasets, demonstrating its potential for robust causal inference in complex dynamic systems.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ege Ozguroglu, Junbang Liang, Ruoshi Liu, Mia Chiquier, Michael DeTienne et al.
Multimodal
ๆฅ็ๆ่ฆ
While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.'' Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70$\times$ more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alhasan Abdellatif, Hannah P. Menke, Ahmed H. Elsheikh, Florian Doster, Kamaljit Singh
ๆฅ็ๆ่ฆ
The UNet-enhanced Fourier Neural Operator (UFNO) extends the Fourier Neural Operator (FNO) by incorporating a parallel UNet pathway, enabling the retention of both high- and low-frequency components. While UFNO improves predictive accuracy over FNO, it inefficiently treats scalar inputs (e.g., temperature, injection rate) as spatially distributed fields by duplicating their values across the domain. This forces the model to process redundant constant signals within the frequency domain. Additionally, its standard loss function does not account for spatial variations in error sensitivity, limiting performance in regions of high physical importance. We introduce UFNO-FiLM, an enhanced architecture that incorporates two key innovations. First, we decouple scalar inputs from spatial features using a Feature-wise Linear Modulation (FiLM) layer, allowing the model to modulate spatial feature maps without introducing constant signals into the Fourier transform. Second, we employ a spatially weighted loss function that prioritizes learning in critical regions. Our experiments on subsurface multiphase flow demonstrate a 21\% reduction in gas saturation Mean Absolute Error (MAE) compared to UFNO, highlighting the effectiveness of our approach in improving predictive accuracy.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andrea Ranieri, Giorgio Palmieri, Silvia Biasotti
Machine Learning
ๆฅ็ๆ่ฆ
This paper addresses the critical need for automated crack detection in the preservation of cultural heritage through semantic segmentation. We present a comparative study of U-Net architectures, using various convolutional neural network (CNN) encoders, for pixel-level crack identification on statues and monuments. A comparative quantitative evaluation is performed on the test set of the OmniCrack30k dataset [1] using popular segmentation metrics including Mean Intersection over Union (mIoU), Dice coefficient, and Jaccard index. This is complemented by an out-of-distribution qualitative evaluation on an unlabeled test set of real-world cracked statues and monuments. Our findings provide valuable insights into the capabilities of different CNN- based encoders for fine-grained crack segmentation. We show that the models exhibit promising generalization capabilities to unseen cultural heritage contexts, despite never having been explicitly trained on images of statues or monuments.
๐
2025-11-25
๐ ICML
๐ฅ Shamima Hossain
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Visual Language Models (VLMs) are powerful generative tools but often produce factually inaccurate outputs due to a lack of robust reasoning capabilities. While extensive research has been conducted on integrating external knowledge for reasoning in large language models (LLMs), such efforts remain underexplored in VLMs, where the challenge is compounded by the need to bridge multiple modalities seamlessly. This work introduces a framework for knowledge-guided reasoning in VLMs, leveraging structured knowledge graphs for multi-hop verification using image-captioning task to illustrate our framework. Our approach enables systematic reasoning across multiple steps, including visual entity recognition, knowledge graph traversal, and fact-based caption refinement. We evaluate the framework using hierarchical, triple-based and bullet-point based knowledge representations, analyzing their effectiveness in factual accuracy and logical inference. Empirical results show that our approach improves factual accuracy by approximately 31% on preliminary experiments on a curated dataset of mixtures from Google Landmarks v2, Conceptual captions and Coco captions revealing key insights into reasoning patterns and failure modes. This work demonstrates the potential of integrating external knowledge for advancing reasoning in VLMs, paving the way for more reliable and knowledgable multimodal systems.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sam Laing, Antonio Orvieto
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The Adam optimizer is a cornerstone of modern deep learning, yet the empirical necessity of each of its individual components is often taken for granted. This paper presents a focused investigation into the role of bias-correction, a feature whose contribution remains poorly understood. Through a series of systematic ablations on vision and language modelling tasks, we demonstrate that the conventional wisdom surrounding bias correction is misleading. In particular, we demonstrate that in the optimal hyper-parameter configuration, the inclusion of bias correction leads to no improvement in final test performance. Moreover, unless appropriate learning rate scheduling is implemented, the inclusion of bias correction can sometimes be detrimental to performance. We further reinterpret bias correction as a form of implicit learning rate scheduling whose behaviour is strongly dependent on the choice of smoothing hyper-parameters $ฮฒ_1, ฮฒ_2 \in [0,1)$. Our findings challenge the universal inclusion of this component.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mihaela Hudiลteanu, Edwige Cyffers, Nikita P. Kalinin
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Adaptive optimizers are the de facto standard in non-private training as they often enable faster convergence and improved performance. In contrast, differentially private (DP) training is still predominantly performed with DP-SGD, typically requiring extensive compute and hyperparameter tuning. We propose DP-MicroAdam, a memory-efficient and sparsity-aware adaptive DP optimizer. We prove that DP-MicroAdam converges in stochastic non-convex optimization at the optimal $\mathcal{O}(1/\sqrt{T})$ rate, up to privacy-dependent constants. Empirically, DP-MicroAdam outperforms existing adaptive DP optimizers and achieves competitive or superior accuracy compared to DP-SGD across a range of benchmarks, including CIFAR-10, large-scale ImageNet training, and private fine-tuning of pretrained transformers. These results demonstrate that adaptive optimization can improve both performance and stability under differential privacy.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Rui Tong
Machine Learning
ๆฅ็ๆ่ฆ
Generative modeling is typically framed as learning mapping rules, but from an observer's perspective without access to these rules, the task manifests as disentangling the geometric support from the probability distribution. We propose that Continuum Percolation is uniquely suited for this support analysis, as the sampling process effectively projects high-dimensional density estimation onto a geometric counting problem on the support. In this work, we establish a rigorous isomorphism between the topological phase transitions of Random Geometric Graphs and the underlying data manifold in high-dimensional space. By analyzing the relationship between our proposed Percolation Shift metric and FID, we demonstrate that our metric captures structural pathologies (such as implicit mode collapse) where statistical metrics fail. Finally, we translate this topological phenomenon into a differentiable loss function to guide training. Experimental results confirm that this approach not only prevents manifold shrinkage but drives the model toward a state of "Hyper-Generalization," achieving good fidelity and verified topological expansion.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Muhammad Irfan, Nasir Rahim, Khalid Mahmood Malik
Machine Learning
ๆฅ็ๆ่ฆ
Accurate extraction and segmentation of the cerebral arteries from digital subtraction angiography (DSA) sequences is essential for developing reliable clinical management models of complex cerebrovascular diseases. Conventional loss functions often rely solely on pixel-wise overlap, overlooking the geometric and physical consistency of vascular boundaries, which can lead to fragmented or unstable vessel predictions. To overcome this limitation, we propose a novel \textit{Physics-Informed Loss} (PIL) that models the interaction between the predicted and ground-truth boundaries as an elastic process inspired by dislocation theory in materials physics. This formulation introduces a physics-based regularization term that enforces smooth contour evolution and structural consistency, allowing the network to better capture fine vascular geometry. The proposed loss is integrated into several segmentation architectures, including U-Net, U-Net++, SegFormer, and MedFormer, and evaluated on two public benchmarks: DIAS and DSCA. Experimental results demonstrate that PIL consistently outperforms conventional loss functions such as Cross-Entropy, Dice, Active Contour, and Surface losses, achieving superior sensitivity, F1 score, and boundary coherence. These findings confirm that the incorporation of physics-based boundary interactions into deep neural networks improves both the precision and robustness of vascular segmentation in dynamic angiographic imaging. The implementation of the proposed method is publicly available at https://github.com/irfantahir301/Physicsis_loss.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sidahmed Benabderrahmane, Talal Rahwan
Machine Learning
ๆฅ็ๆ่ฆ
Advanced Persistent Threats (APT) pose a major cybersecurity challenge due to their stealth, persistence, and adaptability. Traditional machine learning detectors struggle with class imbalance, high dimensional features, and scarce real world traces. They often lack transferability-performing well in the training domain but degrading in novel attack scenarios. We propose a hybrid transfer framework that integrates Transfer Learning, Explainable AI (XAI), contrastive learning, and Siamese networks to improve cross-domain generalization. An attention-based autoencoder supports knowledge transfer across domains, while Shapley Additive exPlanations (SHAP) select stable, informative features to reduce dimensionality and computational cost. A Siamese encoder trained with a contrastive objective aligns source and target representations, increasing anomaly separability and mitigating feature drift. We evaluate on real-world traces from the DARPA Transparent Computing (TC) program and augment with synthetic attack scenarios to test robustness. Across source to target transfers, the approach delivers improved detection scores with classical and deep baselines, demonstrating a scalable, explainable, and transferable solution for APT detection.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain, Phil F Cheng, Petros Liakopoulos et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability -- frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.
๐
2025-11-25
๐ SIGMOD 2026
๐ฅ Kanchan Chowdhury, Lixi Zhou, Lulu Xie, Xinwei Fu, Jia Zou
ๆฅ็ๆ่ฆ
Real-world AI/ML workflows often apply inference computations to feature vectors joined from multiple datasets. To avoid the redundant AI/ML computations caused by repeated data records in the join's output, factorized ML has been proposed to decompose ML computations into sub-computations to be executed on each normalized dataset. However, there is insufficient discussion on how factorized ML could impact AI/ML inference over multi-way joins. To address the limitations, we propose a novel declarative InferF system, focusing on the factorization of arbitrary inference workflows represented as analyzable expressions over the multi-way joins. We formalize our problem to flexibly push down partial factorized computations to qualified nodes in the join tree to minimize the overall inference computation and join costs and propose two algorithms to resolve the problem: (1) a greedy algorithm based on a per-node cost function that estimates the influence on overall latency if a subset of factorized computations is pushed to a node, and (2) a genetic algorithm for iteratively enumerating and evaluating promising factorization plans. We implement InferF on Velox, an open-sourced database engine from Meta, evaluate it on real-world datasets, observed up to 11.3x speedups, and systematically summarized the factors that determine when factorized ML can benefit AI/ML inference workflows.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sidahmed Benabderrahmane, James Cheney, Talal Rahwan
ๆฅ็ๆ่ฆ
Advanced Persistent Threats (APTs) pose a significant challenge in cybersecurity due to their stealthy and long-term nature. Modern supervised learning methods require extensive labeled data, which is often scarce in real-world cybersecurity environments. In this paper, we propose an innovative approach that leverages AutoEncoders for unsupervised anomaly detection, augmented by active learning to iteratively improve the detection of APT anomalies. By selectively querying an oracle for labels on uncertain or ambiguous samples, we minimize labeling costs while improving detection rates, enabling the model to improve its detection accuracy with minimal data while reducing the need for extensive manual labeling. We provide a detailed formulation of the proposed Attention Adversarial Dual AutoEncoder-based anomaly detection framework and show how the active learning loop iteratively enhances the model. The framework is evaluated on real-world imbalanced provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004\% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The results have shown significant improvements in detection rates during active learning and better performance compared to other existing approaches.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kateryna Chumachenko, Amala Sanjay Deshmukh, Jarno Seppanen, Ilia Karmanov, Chia-Chih Chen et al.
ๆฅ็ๆ่ฆ
We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Akshit Pramod Anchan, Jewelith Thomas, Sritama Roy
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like 'Smart Eye.' We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ben Hamscher, Arnold Brosch, Nicolas Binninger, Maksymilian Jan Dejna, Kira Maag
ๆฅ็ๆ่ฆ
Dance is an essential component of human culture and serves as a tool for conveying emotions and telling stories. Identifying and distinguishing dance genres based on motion data is a complex problem in human activity recognition, as many styles share similar poses, gestures, and temporal motion patterns. This work presents a lightweight framework for classifying dance styles that determines motion characteristics based on pose estimates extracted from videos. We propose temporal-spatial descriptors inspired by Laban Movement Analysis. These features capture local joint dynamics such as velocity, acceleration, and angular movement of the upper body, enabling a structured representation of spatial coordination. To further encode rhythmic and periodic aspects of movement, we integrate Fast Fourier Transform features that characterize movement patterns in the frequency domain. The proposed approach achieves robust classification of different dance styles with low computational effort, as complex model architectures are not required, and shows that interpretable motion representations can effectively capture stylistic nuances.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Afra Kilic, Kim Batselier
ๆฅ็ๆ่ฆ
Modeling nonlinear systems with Volterra series is challenging because the number of kernel coefficients grows exponentially with the model order. This work introduces Bayesian Tensor Network Volterra kernel machines (BTN-V), extending the Bayesian Tensor Network framework to Volterra system identification. BTN-V represents Volterra kernels using canonical polyadic decomposition, reducing model complexity from O(I^D) to O(DIR). By treating all tensor components and hyperparameters as random variables, BTN-V provides predictive uncertainty estimation at no additional computational cost. Sparsity-inducing hierarchical priors enable automatic rank determination and the learning of fading-memory behavior directly from data, improving interpretability and preventing overfitting. Empirical results demonstrate competitive accuracy, enhanced uncertainty quantification, and reduced computational cost.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shreevanth Krishnaa Gopalakrishnan, Stephen Hailes
Machine Learning
ๆฅ็ๆ่ฆ
Machine learning has become integral to Channel State Information (CSI)-based human sensing systems and is expected to power applications such as device-free activity recognition and identity detection in future cellular and Wi-Fi generations. However, these systems rely on models whose decisions can be subtly perturbed, raising concerns for security and reliability in ubiquitous sensing. Quantifying and understanding the robustness of such models, defined as their ability to maintain accurate predictions under adversarial perturbations, is therefore critical before wireless sensing can be safely deployed in real-world environments.
This work presents a systematic evaluation of the robustness of CSI deep learning models under diverse threat models (white-box, black-box/transfer, and universal perturbations) and varying degrees of attack realism. We establish a framework to compare compact temporal autoencoder models with larger deep architectures across three public datasets, quantifying how model scale, training regime, and physical constraints influence robustness. Our experiments show that smaller models, while efficient and equally performant on clean data, are markedly less robust. We further confirm that physically realizable signal-space perturbations, designed to be feasible in real wireless channels, significantly reduce attack success compared to unconstrained feature-space attacks. Adversarial training mitigates these vulnerabilities, improving mean robust accuracy with only moderate degradation in clean performance across both model classes. As wireless sensing advances towards reliable, cross-domain operation, these findings provide quantitative baselines for robustness estimation and inform design principles for secure and trustworthy human-centered sensing systems.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Misha Padidar, Teresa Huang, Andrew Giuliani, Marina Spivak
Machine Learning
ๆฅ็ๆ่ฆ
Stellarators are a prospective class of fusion-based power plants that confine a hot plasma with three-dimensional magnetic fields. Typically framed as a PDE-constrained optimization problem, stellarator design is a time-consuming process that can take hours to solve on a computing cluster. Developing fast methods for designing stellarators is crucial for advancing fusion research. Given the recent development of large datasets of optimized stellarators, machine learning approaches have emerged as a potential candidate. Motivated by this, we present an open inverse problem to the machine learning community: to rapidly generate high-quality stellarator designs which have a set of desirable characteristics. As a case study in the problem space, we train a conditional diffusion model on data from the QUASR database to generate quasisymmetric stellarator designs with desirable characteristics (aspect ratio and mean rotational transform). The diffusion model is applied to design stellarators with characteristics not seen during training. We provide evaluation protocols and show that many of the generated stellarators exhibit solid performance: less than 5% deviation from quasisymmetry and the target characteristics. The modest deviation from quasisymmetry highlights an opportunity to reach the sub 1% target. Beyond the case study, we share multiple promising avenues for generative modeling to advance stellarator design.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Matvei Shelukhan, Timur Mamedov, Karina Kvanchiani
Computer Vision
ๆฅ็ๆ่ฆ
Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving $\textit{11.6%}$ HOTA improvement at $\textit{1}$ Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kasper Green Larsen, Natascha Schalburg
ๆฅ็ๆ่ฆ
We prove the first margin-based generalization bound for voting classifiers, that is asymptotically tight in the tradeoff between the size of the hypothesis set, the margin, the fraction of training points with the given margin, the number of training samples and the failure probability.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yaaqov Mishayev, Yonatan Sverdlov, Tal Amir, Nadav Dym
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Message Passing Neural Networks (MPNNs) are widely used for learning on graphs, but their ability to process long-range information is limited by the phenomenon of oversquashing. This limitation has led some researchers to advocate Graph Transformers as a better alternative, whereas others suggest that it can be mitigated within the MPNN framework, using virtual nodes or other rewiring techniques.
In this work, we demonstrate that oversquashing is not limited to long-range tasks, but can also arise in short-range problems. This observation allows us to disentangle two distinct mechanisms underlying oversquashing: (1) the bottleneck phenomenon, which can arise even in low-range settings, and (2) the vanishing gradient phenomenon, which is closely associated with long-range tasks.
We further show that the short-range bottleneck effect is not captured by existing explanations for oversquashing, and that adding virtual nodes does not resolve it. In contrast, transformers do succeed in such tasks, positioning them as the more compelling solution to oversquashing, compared to specialized MPNNs.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joรซl Charles-Rebuffรฉ, Nicolas Gast, Bruno Gaujal
Machine Learning
ๆฅ็ๆ่ฆ
We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with arbitrary precision. Moreover, we investigate its computational complexity. Our numerical experiments suggest that BLINQ significantly outperforms existing Q-learning approaches in terms of the number of samples needed to get an accurate approximation. In addition, it has a total computational cost even lower than Q-learning for any reasonably high number of samples. These observations persist even when the Q-learning algorithms are speeded up using pre-trained neural networks to predict Q-values.
๐
2025-11-25
๐ 18 pages, 6 figures, submitted to Nature Food
๐ฅ M. C. Schoppema, B. H. M. van der Velden, A. Hรผrriyetoฤlu, M. D. Klijnstra, E. J. Faassen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Since 2012, tetrodotoxin (TTX) has been found in seafoods such as bivalve mollusks in temperate European waters. TTX contamination leads to food safety risks and economic losses, making early prediction of TTX contamination vital to the food industry and competent authorities. Recent studies have pointed to shallow habitats and water temperature as main drivers to TTX contamination in bivalve mollusks. However, the temporal relationships between abiotic factors, biotic factors, and TTX contamination remain unexplored.
We have developed an explainable, deep learning-based model to predict TTX contamination in the Dutch Zeeland estuary. Inputs for the model were meteorological and hydrological features; output was the presence or absence of TTX contamination.
Results showed that the time of sunrise, time of sunset, global radiation, water temperature, and chloride concentration contributed most to TTX contamination. Thus, the effective number of sun hours, represented by day length and global radiation, was an important driver for tetrodotoxin contamination in bivalve mollusks.
To conclude, our explainable deep learning model identified the aforementioned environmental factors (number of sun hours, global radiation, water temperature, and water chloride concentration) to be associated with tetrodotoxin contamination in bivalve mollusks; making our approach a valuable tool to mitigate marine toxin risks for food industry and competent authorities.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Audrey Pei-Hsuan Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Representation learning on multi-omics data is challenging due to extreme dimensionality, modality heterogeneity, and cohort-specific batch effects. While pre-trained transformer backbones have shown broad generalization capabilities in biological sequence modeling, their application to multi-omics integration remains underexplored. We present MoRE (Multi-Omics Representation Embedding), a framework that repurposes frozen pre-trained transformers to align heterogeneous assays into a shared latent space. Unlike purely generative approaches, MoRE employs a parameter-efficient fine-tuning (PEFT) strategy, prioritizing cross-sample and cross-modality alignment over simple sequence reconstruction. Specifically, MoRE attaches lightweight, modality-specific adapters and a task-adaptive fusion layer to the frozen backbone. It optimizes a masked modeling objective jointly with supervised contrastive and batch-invariant alignment losses, yielding structure-preserving embeddings that generalize across unseen cell types and platforms. We benchmark MoRE against established baselines, including scGPT, scVI, and Harmony with scArches, evaluating integration fidelity, rare population detection, and modality transfer. Our results demonstrate that MoRE achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models. This work positions MoRE as a practical step toward general-purpose omics foundation models.
๐
2025-11-25
๐ ArXiv Preprint (cs.SD)
๐ฅ Ilias Ibnyahya, Joshua D. Reiss
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We introduce a novel method for designing attenuation filters in digital audio reverberation systems based on Feedback Delay Networks (FDNs). Our approach uses Second Order Sections (SOS) of Infinite Impulse Response (IIR) filters arranged as parametric equalizers (PEQ), enabling fine control over frequency-dependent reverberation decay. Unlike traditional graphic equalizer designs, which require numerous filters per delay line, we propose a scalable solution where the number of filters can be adjusted. The frequency, gain, and quality factor (Q) parameters are shared parameters across delay lines and only the gain is adjusted based on delay length. This design not only reduces the number of optimization parameters, but also remains fully differentiable and compatible with gradient-based learning frameworks. Leveraging principles of analog filter design, our method allows for efficient and accurate filter fitting using supervised learning. Our method delivers a flexible and differentiable design, achieving state-of-the-art performance while significantly reducing computational cost.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ รlex Solรฉ, Albert Mosella-Montoro, Joan Cardona, Daniel Aravena, Silvia Gรณmez-Coca et al.
Machine Learning
ๆฅ็ๆ่ฆ
Crystal structures are characterised by repeating atomic patterns within unit cells across three-dimensional space, posing unique challenges for graph-based representation learning. Current methods often overlook essential periodic boundary conditions and multiscale interactions inherent to crystalline structures. In this paper, we introduce PRISM, a graph neural network framework that explicitly integrates multiscale representations and periodic feature encoding by employing a set of expert modules, each specialised in encoding distinct structural and chemical aspects of periodic systems. Extensive experiments across crystal structure-based benchmarks demonstrate that PRISM improves state-of-the-art predictive accuracy, significantly enhancing crystal property prediction.
๐
2025-11-25
๐ ArXiv Preprint (math.NA)
๐ฅ Maarten V. de Hoop, Nikola B. Kovachki, Matti Lassas, Nicholas H. Nelsen
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper considers the problem of noise-robust neural operator approximation for the solution map of Calderรณn's inverse conductivity problem. In this continuum model of electrical impedance tomography (EIT), the boundary measurements are realized as a noisy perturbation of the Neumann-to-Dirichlet map's integral kernel. The theoretical analysis proceeds by extending the domain of the inversion operator to a Hilbert space of kernel functions. The resulting extension shares the same stability properties as the original inverse map from kernels to conductivities, but is now amenable to neural operator approximation. Numerical experiments demonstrate that Fourier neural operators excel at reconstructing infinite-dimensional piecewise constant and lognormal conductivities in noisy setups both within and beyond the theory's assumptions. The methodology developed in this paper for EIT exemplifies a broader strategy for addressing nonlinear inverse problems with a noise-aware operator learning framework.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ M. E. A. Kherchouche, F. Galpin, T. Dumas, F. Schnitzler, D. Menard et al.
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, a complexity study is conducted for Versatile Video Codec (VVC) intra partitioning to accelerate the exhaustive search involved in Rate-Distortion Optimization (RDO) process. To address this problem, two main machine learning techniques are proposed and compared. Unlike existing methods, the proposed approaches are size independent and incorporate the Rate-Distortion (RD) costs of neighboring blocks as input features. The first method is a regression based technique that predicts normalized RD costs of a given Coding Unit (CU). As partitioning possesses the Markov property, the associated decision-making problem can be modeled as a Markov Decision Process (MDP) and solved by Reinforcement Learning (RL). The second approach is a RL agent learned from trajectories of CU decision across two depths with Deep Q-Network (DQN) algorithm. Then a pre-determined thresholds are applied for both methods to select a suitable split for the current CU.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu et al.
Natural Language ProcessingRoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya et al.
Natural Language ProcessingRoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Michael Kilgour, Mark E. Tuckerman, Jutta Rogal
Machine Learning
ๆฅ็ๆ่ฆ
We present MXtalTools, a flexible Python package for the data-driven modelling of molecular crystals, facilitating machine learning studies of the molecular solid state. MXtalTools comprises several classes of utilities: (1) synthesis, collation, and curation of molecule and crystal datasets, (2) integrated workflows for model training and inference, (3) crystal parameterization and representation, (4) crystal structure sampling and optimization, (5) end-to-end differentiable crystal sampling, construction and analysis. Our modular functions can be integrated into existing workflows or combined and used to build novel modelling pipelines. MXtalTools leverages CUDA acceleration to enable high-throughput crystal modelling. The Python code is available open-source on our GitHub page, with detailed documentation on ReadTheDocs.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Abhinav Joshi, Divyanshu Bhatt, Ashutosh Modi
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) show strong generalization across diverse tasks, yet the internal decision-making processes behind their predictions remain opaque. In this work, we study the geometry of hidden representations in LLMs through the lens of \textit{intrinsic dimension} (ID), focusing specifically on decision-making dynamics in a multiple-choice question answering (MCQA) setting. We perform a large-scale study, with 28 open-weight transformer models and estimate ID across layers using multiple estimators, while also quantifying per-layer performance on MCQA tasks. Our findings reveal a consistent ID pattern across models: early layers operate on low-dimensional manifolds, middle layers expand this space, and later layers compress it again, converging to decision-relevant representations. Together, these results suggest LLMs implicitly learn to project linguistic inputs onto structured, low-dimensional manifolds aligned with task-specific decisions, providing new geometric insights into how generalization and reasoning emerge in language models.
๐
2025-11-25
๐ ArXiv Preprint (cs.DB)
๐ฅ Chaowei He, Yuanjun Liu, Qingzhi Ma, Shenyuan Ren, Xizhao Luo et al.
ๆฅ็ๆ่ฆ
Machine unlearning in learned cardinality estimation (CE) systems presents unique challenges due to the complex distributional dependencies in multi-table relational data. Specifically, data deletion, a core component of machine unlearning, faces three critical challenges in learned CE models: attribute-level sensitivity, inter-table propagation and domain disappearance leading to severe overestimation in multi-way joins. We propose Cardinality Estimation Pruning (CEP), the first unlearning framework specifically designed for multi-table learned CE systems. CEP introduces Distribution Sensitivity Pruning, which constructs semi-join deletion results and computes sensitivity scores to guide parameter pruning, and Domain Pruning, which removes support for value domains entirely eliminated by deletion. We evaluate CEP on state-of-the-art architectures NeuroCard and FACE across IMDB and TPC-H datasets. Results demonstrate CEP consistently achieves the lowest Q-error in multi-table scenarios, particularly under high deletion ratios, often outperforming full retraining. Furthermore, CEP significantly reduces convergence iterations, incurring negligible computational overhead of 0.3%-2.5% of fine-tuning time.
๐
2025-11-25
๐ ArXiv Preprint (econ.GN)
๐ฅ Marta Grzeskiewicz
Machine Learning
ๆฅ็ๆ่ฆ
Understanding household behaviour is essential for modelling macroeconomic dynamics and designing effective policy. While heterogeneous agent models offer a more realistic alternative to representative agent frameworks, their implementation poses significant computational challenges, particularly in continuous time. The Aiyagari-Bewley-Huggett (ABH) framework, recast as a system of partial differential equations, typically relies on grid-based solvers that suffer from the curse of dimensionality, high computational cost, and numerical inaccuracies. This paper introduces the ABH-PINN solver, an approach based on Physics-Informed Neural Networks (PINNs), which embeds the Hamilton-Jacobi-Bellman and Kolmogorov Forward equations directly into the neural network training objective. By replacing grid-based approximation with mesh-free, differentiable function learning, the ABH-PINN solver benefits from the advantages of PINNs of improved scalability, smoother solutions, and computational efficiency. Preliminary results show that the PINN-based approach is able to obtain economically valid results matching the established finite-difference solvers.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yiheng Zhang, Shaowu Wu, Yuanzhuo Xu, Jiajun Wu, Shang Xu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity
, allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
Natural Language Processing
ๆฅ็ๆ่ฆ
Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, Fan Zhou
Machine Learning
ๆฅ็ๆ่ฆ
Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.
๐
2025-11-25
๐ Accepted to 2025 IEEE International Conference on
๐ฅ Zhiguo Zhang, Xiaoliang Ma, Daniel Schlesinger
ๆฅ็ๆ่ฆ
Accurate and interpretable air pollution forecasting is crucial for public health, but most models face a trade-off between performance and interpretability. This study proposes a physics-guided, interpretable-by-design spatiotemporal learning framework. The model decomposes the spatiotemporal behavior of air pollutant concentrations into two transparent, additive modules. The first is a physics-guided transport kernel with directed weights conditioned on wind and geography (advection). The second is an explainable attention mechanism that learns local responses and attributes future concentrations to specific historical lags and exogenous drivers. Evaluated on a comprehensive dataset from the Stockholm region, our model consistently outperforms state-of-the-art baselines across multiple forecasting horizons. Our model's integration of high predictive performance and spatiotemporal interpretability provides a more reliable foundation for operational air-quality management in real-world applications.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniel Kienzle, Katja Ludwig, Julian Lorenz, Shin'ichi Satoh, Rainer Lienhart
ๆฅ็ๆ่ฆ
Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.
๐
2025-11-25
๐ ArXiv Preprint (eess.SY)
๐ฅ Zeynab Kaseb, Matthias Moller, Lindsay Spoor, Jerry J. Guo, Yu Xiang et al.
Machine Learning
ๆฅ็ๆ่ฆ
The Newton-Raphson (NR) method is widely used for solving power flow (PF) equations due to its quadratic convergence. However, its performance deteriorates under poor initialization or extreme operating scenarios, e.g., high levels of renewable energy penetration. Traditional NR initialization strategies often fail to address these challenges, resulting in slow convergence or even divergence. We propose the use of reinforcement learning (RL) to optimize the initialization of NR, and introduce a novel quantum-enhanced RL environment update mechanism to mitigate the significant computational cost of evaluating power system states over a combinatorially large action space at each RL timestep by formulating the voltage adjustment task as a quadratic unconstrained binary optimization problem. Specifically, quantum/digital annealers are integrated into the RL environment update to evaluate state transitions using a problem Hamiltonian designed for PF. Results demonstrate significant improvements in convergence speed, a reduction in NR iteration counts, and enhanced robustness under different operating conditions.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Szymon Bobek, ลukasz Baลec, Grzegorz J. Nalepa
Machine Learning
ๆฅ็ๆ่ฆ
Counterfactual explanations enhance the actionable interpretability of machine learning models by identifying the minimal changes required to achieve a desired outcome of the model. However, existing methods often ignore the complex dependencies in real-world datasets, leading to unrealistic or impractical modifications. Motivated by cybersecurity applications in the email marketing domain, we propose a method for generating Diverse, Actionable, and kNowledge-Constrained Explanations (DANCE), which incorporates feature dependencies and causal constraints to ensure plausibility and real-world feasibility of counterfactuals. Our method learns linear and nonlinear constraints from data or integrates expert-provided dependency graphs, ensuring counterfactuals are plausible and actionable. By maintaining consistency with feature relationships, the method produces explanations that align with real-world constraints. Additionally, it balances plausibility, diversity, and sparsity, effectively addressing key limitations in existing algorithms. The work is developed based on a real-life case study with Freshmail, the largest email marketing company in Poland and supported by a joint R&D project Sendguard. Furthermore, we provide an extensive evaluation using 140 public datasets, which highlights its ability to generate meaningful, domain-relevant counterfactuals that outperform other existing approaches based on widely used metrics. The source code for reproduction of the results can be found in a GitHub repository we provide.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Olivier Moulin, Vincent Francois-lavet, Paul Elbers, Mark Hoogendoorn
Machine Learning
ๆฅ็ๆ่ฆ
Generalizability of Reinforcement Learning (RL) agents (ability to perform on environments different from the ones they have been trained on) is a key problem as agents have the tendency to overfit to their training environments. In order to address this problem and offer a solution to increase the generalizability of RL agents, we introduce a new methodology to predict the generalizability score of RL agents based on the internal weights of the agent's neural networks. Using this prediction capability, we propose some changes in the Proximal Policy Optimization (PPO) loss function to boost the generalization score of the agents trained with this upgraded version. Experimental results demonstrate that our improved PPO algorithm yields agents with stronger generalizability compared to the original version.
๐
2025-11-25
๐ AAAI
๐ฅ Bo Han, Zhuoming Li, Xiaoyu Wang, Yaxin Hou, Hui Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Semi-supervised multi-label learning (SSMLL) aims to address the challenge of limited labeled data in multi-label learning (MLL) by leveraging unlabeled data to improve the model's performance. While pseudo-labeling has become a dominant strategy in SSMLL, most existing methods assign equal weights to all pseudo-labels regardless of their quality, which can amplify the impact of noisy or uncertain predictions and degrade the overall performance. In this paper, we theoretically verify that the optimal weight for a pseudo-label should reflect its correctness likelihood. Empirically, we observe that on the same dataset, the correctness likelihood distribution of unlabeled data remains stable, even as the number of labeled training samples varies. Building on this insight, we propose Distribution-Calibrated Pseudo-labeling (DiCaP), a correctness-aware framework that estimates posterior precision to calibrate pseudo-label weights. We further introduce a dual-thresholding mechanism to separate confident and ambiguous regions: confident samples are pseudo-labeled and weighted accordingly, while ambiguous ones are explored by unsupervised contrastive learning. Experiments conducted on multiple benchmark datasets verify that our method achieves consistent improvements, surpassing state-of-the-art methods by up to 4.27%.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lian Shen, Zhendan Chen, Yinhui jiang, Meijia Song, Ziming Su et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
In critical web applications such as e-commerce and recommendation systems, multimodal graphs integrating rich visual and textual attributes are increasingly central, yet their large scale introduces substantial computational burdens for training Graph Neural Networks (GNNs). While Graph Condensation (GC) offers a promising solution by synthesizing smaller datasets, existing methods falter in the multimodal setting. We identify a dual challenge causing this failure: (1) conflicting gradients arising from semantic misalignments between modalities, and (2) the GNN's message-passing architecture pathologically amplifying this gradient noise across the graph structure. To address this, we propose Structurally-Regularized Gradient Matching (SR-GM), a novel condensation framework tailored for multimodal graphs. SR-GM introduces two synergistic components: first, a gradient decoupling mechanism that resolves inter-modality conflicts at their source via orthogonal projection; and second, a structural damping regularizer that acts directly on the gradient field. By leveraging the graph's Dirichlet energy, this regularizer transforms the topology from a noise amplifier into a stabilizing force during optimization. Extensive experiments demonstrate that SR-GM significantly improves accuracy and accelerates convergence compared to baseline methods. Ablation studies confirm that addressing both gradient conflict and structural amplification in tandem is essential for achieving superior performance. Moreover, the condensed multimodal graphs exhibit strong cross-architecture generalization and promise to accelerate applications like Neural Architecture Search. This research provides a scalable methodology for multimodal graph-based learning in resource-constrained environments.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruxandra-Stefania Tudose, Moritz H. W. Grรผss, Grace Ra Kim, Karl H. Johansson, Nicola Bastianello
ๆฅ็ๆ่ฆ
Satellite constellations in low-Earth orbit are now widespread, enabling positioning, Earth imaging, and communications. In this paper we address the solution of learning problems using these satellite constellations. In particular, we focus on a federated approach, where satellites collect and locally process data, with the ground station aggregating local models. We focus on designing a novel, communication-efficient algorithm that still yields accurate trained models. To this end, we employ several mechanisms to reduce the number of communications with the ground station (local training) and their size (compression). We then propose an error feedback mechanism that enhances accuracy, which yields, as a byproduct, an algorithm-agnostic error feedback scheme that can be more broadly applied. We analyze the convergence of the resulting algorithm, and compare it with the state of the art through simulations in a realistic space scenario, showcasing superior performance.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haebin Seong, Sungmin Kim, Minchan Kim, Yongjun Cho, Myunchul Joe et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Existing navigation benchmarks focus on task success metrics while overlooking economic viability -- critical for commercial deployment of autonomous delivery robots. We introduce \emph{CostNav}, a \textbf{Micro-Navigation Economic Testbed} that evaluates embodied agents through comprehensive cost-revenue analysis aligned with real-world business operations. CostNav models the complete economic lifecycle including hardware, training, energy, maintenance costs, and delivery revenue with service-level agreements, using industry-derived parameters. \textbf{To our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability}, revealing that optimizing for task success fundamentally differs from optimizing for economic deployment. Our cost model uses parameters derived from industry data sources (energy rates, delivery service pricing), and we project from a reduced-scale simulation to realistic deliveries. Under this projection, the baseline achieves 43.0\% SLA compliance but is \emph{not} commercially viable: yielding a loss of \$30.009 per run with no finite break-even point, because operating costs are dominated by collision-induced maintenance, which accounts for 99.7\% of per-run costs and highlights collision avoidance as a key optimization target. We demonstrate a learning-based on-device navigation baseline and establish a foundation for evaluating rule-based navigation, imitation learning, and cost-aware RL training. CostNav bridges the gap between navigation research and commercial deployment, enabling data-driven decisions about economic trade-offs across navigation paradigms.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Wei Chen, Jingxi Yu, Zichen Miao, Qiang Qiu
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning tasks. In these tasks, models solve the target problems by inferring compositional rules from context examples, which are composed of basic components structured by underlying rules. However, some of these tasks remain challenging for Transformers, which are not inherently designed to handle compositional tasks and offer limited structural inductive bias. In this work, inspired by the principle of sparse coding, we propose a reformulation of the attention to enhance its capability for compositional tasks. In sparse coding, data are represented as sparse combinations of dictionary atoms with coefficients that capture their compositional rules. Specifically, we reinterpret the attention block as a mapping of inputs into outputs through projections onto two sets of learned dictionary atoms: an encoding dictionary and a decoding dictionary. The encoding dictionary decomposes the input into a set of coefficients, which represent the compositional structure of the input. To enhance structured representations, we impose sparsity on these coefficients. The sparse coefficients are then used to linearly combine the decoding dictionary atoms to generate the output. Furthermore, to assist compositional generalization tasks, we propose estimating the coefficients of the target problem as a linear combination of the coefficients obtained from the context examples. We demonstrate the effectiveness of our approach on the S-RAVEN and RAVEN datasets. For certain compositional generalization tasks, our method maintains performance even when standard Transformers fail, owing to its ability to learn and apply compositional rules.
๐
2025-11-25
๐ AAAI 2026
๐ฅ Lincen Yang, Zhong Li, Matthijs van Leeuwen, Saber Salehkaleybar
ๆฅ็ๆ่ฆ
Discovering subgroups with the maximum average treatment effect is crucial for targeted decision making in domains such as precision medicine, public policy, and education. While most prior work is formulated in the potential outcome framework, the corresponding structural causal model (SCM) for this task has been largely overlooked. In practice, two approaches dominate. The first estimates pointwise conditional treatment effects and then fits a tree on those estimates, effectively turning subgroup estimation into the harder problem of accurate pointwise estimation. The second constructs decision trees or rule sets with ad-hoc 'causal' heuristics, typically without rigorous justification for why a given heuristic may be used or whether such heuristics are necessary at all. We address these issues by studying the problem directly under the SCM framework. Under the assumption of a partition-based model, we show that optimal subgroup discovery reduces to recovering the data-generating models and hence a standard supervised learning problem (regression or classification). This allows us to adopt any partition-based methods to learn the subgroup from data. We instantiate the approach with CART, arguably one of the most widely used tree-based methods, to learn the subgroup with maximum treatment effect. Finally, on a large collection of synthetic and semi-synthetic datasets, we compare our method against a wide range of baselines and find that our approach, which avoids such causal heuristics, more accurately identifies subgroups with maximum treatment effect. Our source code is available at https://github.com/ylincen/causal-subgroup.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bruno Belucci, Karim Lounici, Katia Meziani
Machine Learning
ๆฅ็ๆ่ฆ
Neural networks struggle on small tabular datasets, where tree-based models remain dominant. We introduce Adaptive Contrastive Approach (AdaCap), a training scheme that combines a permutation-based contrastive loss with a Tikhonov-based closed-form output mapping. Across 85 real-world regression datasets and multiple architectures, AdaCap yields consistent and statistically significant improvements in the small-sample regime, particularly for residual models. A meta-predictor trained on dataset characteristics (size, skewness, noise) accurately anticipates when AdaCap is beneficial. These results show that AdaCap acts as a targeted regularization mechanism, strengthening neural networks precisely where they are most fragile. All results and code are publicly available at https://github.com/BrunoBelucci/adacap.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Riccardo Zaccone, Sai Praneeth Karimireddy, Carlo Masone
Machine Learning
ๆฅ็ๆ่ฆ
Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a solution to mitigate the effects of statistical heterogeneity. Despite recent progress in this direction, it is still unclear if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios, where only some workers participate at each round. In this work we analyze momentum under cyclic client participation, and theoretically prove that it remains inevitably affected by statistical heterogeneity. Similarly to SGD, we prove that decreasing step-sizes do not help either: in fact, any schedule decreasing faster than $ฮ\left(1/t\right)$ leads to convergence to a constant value that depends on the initialization and the heterogeneity bound. Numerical results corroborate the theory, and deep learning experiments confirm its relevance for realistic settings.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aleksei Samarin, Artem Nazarenko, Egor Kotenko, Valentin Malykh, Alexander Savelev et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a novel approach to neural network compression that addresses redundancy at both the filter and architectural levels through a unified framework grounded in information flow analysis. Building on the concept of tensor flow divergence, which quantifies how information is transformed across network layers, we develop a two-stage optimization process. The first stage employs iterative divergence-aware pruning to identify and remove redundant filters while preserving critical information pathways. The second stage extends this principle to higher-level architecture optimization by analyzing layer-wise contributions to information propagation and selectively eliminating entire layers that demonstrate minimal impact on network performance. The proposed method naturally adapts to diverse architectures, including convolutional networks, transformers, and hybrid designs, providing a consistent metric for comparing the structural importance across different layer types. Experimental validation across multiple modern architectures and datasets reveals that this combined approach achieves substantial model compression while maintaining competitive accuracy. The presented approach achieves parameter reduction results that are globally comparable to those of state-of-the-art solutions and outperforms them across a wide range of modern neural network architectures, from convolutional models to transformers. The results demonstrate how flow divergence serves as an effective guiding principle for both filter-level and layer-level optimization, offering practical benefits for deployment in resource-constrained environments.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jason Lo, Mohammadnima Jafari
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
A wiring diagram is a labeled directed graph that represents an abstract concept such as a temporal process. In this article, we introduce the notion of a quasi-skeleton wiring diagram graph, and prove that quasi-skeleton wiring diagram graphs correspond to Hasse diagrams. Using this result, we designed algorithms that extract wiring diagrams from sequential data. We used our algorithms in analyzing the behavior of an autonomous agent playing a computer game, and the algorithms correctly identified the winning strategies. We compared the performance of our main algorithm with two other algorithms based on standard clustering techniques (DBSCAN and agglomerative hierarchical), including when some of the data was perturbed. Overall, this article brings together techniques in category theory, graph theory, clustering, reinforcement learning, and data engineering.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hyeonjae Kim, Chenyue Li, Wen Deng, Mengxi Jin, Wen Huang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Climate science demands automated workflows to transform comprehensive questions into data-driven statements across massive, heterogeneous datasets. However, generic LLM agents and static scripting pipelines lack climate-specific context and flexibility, thus, perform poorly in practice. We present ClimateAgent, an autonomous multi-agent framework that orchestrates end-to-end climate data analytic workflows. ClimateAgent decomposes user questions into executable sub-tasks coordinated by an Orchestrate-Agent and a Plan-Agent; acquires data via specialized Data-Agents that dynamically introspect APIs to synthesize robust download scripts; and completes analysis and reporting with a Coding-Agent that generates Python code, visualizations, and a final report with a built-in self-correction loop. To enable systematic evaluation, we introduce Climate-Agent-Bench-85, a benchmark of 85 real-world tasks spanning atmospheric rivers, drought, extreme precipitation, heat waves, sea surface temperature, and tropical cyclones. On Climate-Agent-Bench-85, ClimateAgent achieves 100% task completion and a report quality score of 8.32, outperforming GitHub-Copilot (6.27) and a GPT-5 baseline (3.26). These results demonstrate that our multi-agent orchestration with dynamic API awareness and self-correcting execution substantially advances reliable, end-to-end automation for climate science analytic tasks.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Grzegorz Dudek, Mateusz Kasprzyk, Paweล Peลka
Machine Learning
ๆฅ็ๆ่ฆ
This study investigates the application of the Light Gradient Boosting Machine (LGBM) model for both deterministic and probabilistic forecasting of Bitcoin realized volatility. Utilizing a comprehensive set of 69 predictors -- encompassing market, behavioral, and macroeconomic indicators -- we evaluate the performance of LGBM-based models and compare them with both econometric and machine learning baselines. For probabilistic forecasting, we explore two quantile-based approaches: direct quantile regression using the pinball loss function, and a residual simulation method that transforms point forecasts into predictive distributions. To identify the main drivers of volatility, we employ gain-based and permutation feature importance techniques, consistently highlighting the significance of trading volume, lagged volatility measures, investor attention, and market capitalization. The results demonstrate that LGBM models effectively capture the nonlinear and high-variance characteristics of cryptocurrency markets while providing interpretable insights into the underlying volatility dynamics.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Craig Dickson
Natural Language Processing
ๆฅ็ๆ่ฆ
Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales.
We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o's 20%.
We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model's 'degrees of freedom' to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descriptions that are often ambiguous, redundant, and unstructured, which poses significant challenges for downstream Verilog code generation. We treat hardware code generation as a complex transformation from an open-ended natural language space to a domain-specific, highly constrained target space. To bridge this gap, we introduce Core Refined Understanding eXpression (CRUX), a structured intermediate space that captures the essential semantics of user intent while organizing the expression for precise Verilog code generation. We further design a two-stage training framework, comprising Joint Expression Modeling and Dual-Space Optimization, to enhance the quality of both CRUX and Verilog code. Experiments across multiple Verilog generation benchmarks demonstrate that our model, CRUX-V, achieves state-of-the-art performance among general models, particularly under challenging design tasks. Furthermore, the CRUX space proves transferable and beneficial when used as input prompts for other code models, highlighting its effectiveness in narrowing the gap between free-form natural language descriptions and precise Verilog generation.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bhavya Sukhija, Lenart Treven, Carmelo Sferrazza, Florian Dรถrfler, Pieter Abbeel et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We address the challenge of efficient exploration in model-based reinforcement learning (MBRL), where the system dynamics are unknown and the RL agent must learn directly from online interactions. We propose Scalable and Optimistic MBRL (SOMBRL), an approach based on the principle of optimism in the face of uncertainty. SOMBRL learns an uncertainty-aware dynamics model and greedily maximizes a weighted sum of the extrinsic reward and the agent's epistemic uncertainty. SOMBRL is compatible with any policy optimizers or planners, and under common regularity assumptions on the system, we show that SOMBRL has sublinear regret for nonlinear dynamics in the (i) finite-horizon, (ii) discounted infinite-horizon, and (iii) non-episodic settings. Additionally, SOMBRL offers a flexible and scalable solution for principled exploration. We evaluate SOMBRL on state-based and visual-control environments, where it displays strong performance across all tasks and baselines. We also evaluate SOMBRL on a dynamic RC car hardware and show SOMBRL outperforms the state-of-the-art, illustrating the benefits of principled exploration for MBRL.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zixiao Huang, Wen Zeng, Tianyu Fu, Tengxuan Liu, Yizhou Sun et al.
ๆฅ็ๆ่ฆ
LLM-based search agents achieve strong performance but suffer from severe latency, as each step requires serialized LLM reasoning followed by action of tool execution. We revisit this bottleneck through the lens of speculation. While traditional predict-verify speculation paradigm can break serial execution, its benefit remains limited, as it retains the full original workload and adds extra inference overhead. We observe that early agent steps often involve simple evidence-gathering, where correct actions can often be predicted without full reasoning. Building on these observations, we present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency. Algorithmically, SPAgent introduces a two-phase adaptive speculation mechanism that selectively omits verification when safe. System-wise, a two-level scheduler regulates speculative requests based on engine load to ensure speculation remains beneficial. We implement SPAgent in real-world systems. Across extensive experimental settings, SPAgent achieves up to $1.65\times$ end-to-end speedup while maintaining same or even achieving higher accuracy, enabling practical deployment of multi-step search agents.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ PengYu Chen, Xiaohou Shi, Yuan Chang, Yan Sun, Sajal K. Das
ๆฅ็ๆ่ฆ
The proactive prediction of anomalies (AP) in multivariate time series (MTS) is a critical challenge to ensure system dependability. The difficulty lies in identifying subtle anomaly precursors concealed within normal signals. However, existing unsupervised methods, trained exclusively on normal data, demonstrate a fundamental propensity to reconstruct normal patterns. Consequently, when confronted with weak precursors, their predictions are dominated by the normal pattern, submerging the very signal required for prediction. To contend with the limitation, we propose RED-F, a Reconstruction-Elimination based Dual-stream Contrastive Forecasting framework, comprising the Reconstruction-Elimination Model (REM) and the Dual-stream Contrastive Forecasting Model (DFM). The REM utilizes a hybrid time-frequency mechanism to mitigate the precursor, generating a purified, normal-pattern baseline. The DFM then receives this purified baseline and the original sequence which retains the precursor as parallel inputs. At the core of our framework, RED-F employs a contrastive forecast that transforms the difficult task of absolute signal detection into a simpler, more robust task of relative trajectory comparison by computing the divergence between these two predictive streams. This contrastive mechanism serves to amplify the faint precursor signal. Furthermore, the DFM is trained with a novel Multi-Series Prediction (MSP) objective, which leverages distant future context to enhance its predictive sensitivity. Extensive experiments on six real-world datasets demonstrate the superior capability of RED-F in anomaly prediction tasks.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Petr Molodyk, Jaemoo Choi, David W. Romero, Ming-Yu Liu, Yongxin Chen
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, point cloud generation has gained significant attention in 3D generative modeling. Among existing approaches, point-based methods directly generate point clouds without relying on other representations such as latent features, meshes, or voxels. These methods offer low training cost and algorithmic simplicity, but often underperform compared to representation-based approaches. In this paper, we propose MFM-Point, a multi-scale Flow Matching framework for point cloud generation that substantially improves the scalability and performance of point-based methods while preserving their simplicity and efficiency. Our multi-scale generation algorithm adopts a coarse-to-fine generation paradigm, enhancing generation quality and scalability without incurring additional training or inference overhead. A key challenge in developing such a multi-scale framework lies in preserving the geometric structure of unordered point clouds while ensuring smooth and consistent distributional transitions across resolutions. To address this, we introduce a structured downsampling and upsampling strategy that preserves geometry and maintains alignment between coarse and fine resolutions. Our experimental results demonstrate that MFM-Point achieves best-in-class performance among point-based methods and challenges the best representation-based methods. In particular, MFM-point demonstrates strong results in multi-category and high-resolution generation tasks.
๐
2025-11-25
๐ ArXiv Preprint (cs.FL)
๐ฅ Hongjian Jiang, Michael Hahn, Georg Zetzsche, Anthony Widjaja Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
Hard attention Chain-of-Thought (CoT) transformers are known to be Turing-complete. However, it is an open problem whether softmax attention Chain-of-Thought (CoT) transformers are Turing-complete. In this paper, we prove a stronger result that length-generalizable softmax CoT transformers are Turing-complete. More precisely, our Turing-completeness proof goes via the CoT extension of the Counting RASP (C-RASP), which correspond to softmax CoT transformers that admit length generalization. We prove Turing-completeness for CoT C-RASP with causal masking over a unary alphabet (more generally, for letter-bounded languages). While we show this is not Turing-complete for arbitrary languages, we prove that its extension with relative positional encoding is Turing-complete for arbitrary languages. We empirically validate our theory by training transformers for languages requiring complex (non-linear) arithmetic reasoning.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoran Zheng, Renchi Yang, Hongtao Wang, Jianliang Xu
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Multimodal Attributed Graphs (MMAGs) are an expressive data model for representing the complex interconnections among entities that associate attributes from multiple data modalities (text, images, etc.). Clustering over such data finds numerous practical applications in real scenarios, including social community detection, medical data analytics, etc. However, as revealed by our empirical studies, existing multi-view clustering solutions largely rely on the high correlation between attributes across various views and overlook the unique characteristics (e.g., low modality-wise correlation and intense feature-wise noise) of multimodal attributes output by large pre-trained language and vision models in MMAGs, leading to suboptimal clustering performance.
Inspired by foregoing empirical observations and our theoretical analyses with graph signal processing, we propose the Dual Graph Filtering (DGF) scheme, which innovatively incorporates a feature-wise denoising component into node representation learning, thereby effectively overcoming the limitations of traditional graph filters adopted in the extant multi-view graph clustering approaches. On top of that, DGF includes a tri-cross contrastive training strategy that employs instance-level contrastive learning across modalities, neighborhoods, and communities for learning robust and discriminative node representations. Our comprehensive experiments on eight benchmark MMAG datasets exhibit that DGF is able to outperform a wide range of state-of-the-art baselines consistently and significantly in terms of clustering quality measured against ground-truth labels.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiucheng Wang, Tingwei Yuan, Yang Cao, Nan Cheng, Ruijin Sun et al.
ๆฅ็ๆ่ฆ
Radio maps (RMs) serve as environment-aware electromagnetic (EM) representations that connect scenario geometry and material properties to the spatial distribution of signal strength, enabling localization without costly in-situ measurements. However, constructing high-fidelity indoor RMs remains challenging due to the prohibitive latency of EM solvers and the limitations of learning-based methods, which often rely on sparse measurements or assumptions of homogeneous material, which are misaligned with the heterogeneous and multipath-rich nature of indoor environments. To overcome these challenges, we propose iRadioDiff, a sampling-free diffusion-based framework for indoor RM construction. iRadioDiff is conditioned on access point (AP) positions, and physics-informed prompt encoded by material reflection and transmission coefficients. It further incorporates multipath-critical priors, including diffraction points, strong transmission boundaries, and line-of-sight (LoS) contours, to guide the generative process via conditional channels and boundary-weighted objectives. This design enables accurate modeling of nonstationary field discontinuities and efficient construction of physically consistent RMs. Experiments demonstrate that iRadioDiff achieves state-of-the-art performance in indoor RM construction and received signal strength based indoor localization, which offers effective generalization across layouts and material configurations. Code is available at https://github.com/UNIC-Lab/iRadioDiff.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Peining Zhang, Hongchen Qin, Haochen Zhang, Ziqi Guo, Guiling Wang et al.
ๆฅ็ๆ่ฆ
This work investigates the zero-shot forecasting capability of time-series foundation models for Leaf Area Index (LAI) forecasting in agricultural monitoring. Using the HiQ dataset (U.S., 2000-2022), we systematically compare statistical baselines, a fully supervised LSTM, and the Sundial foundation model under multiple evaluation protocols. We find that Sundial, in the zero-shot setting, can outperform a fully trained LSTM provided that the input context window is sufficiently long-specifically, when covering more than one or two full seasonal cycles. This demonstrates, for the first time, that a general-purpose foundation model can surpass specialized supervised models on remote-sensing time series prediction without any task-specific tuning. These results highlight the strong potential of pretrained time-series foundation models to serve as effective plug-and-play forecasters in agricultural and environmental applications.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikit Phadke
ๆฅ็ๆ่ฆ
REWA introduces a general theory of similarity based on witness-overlap structures. We show that whenever similarity between concepts can be expressed as monotone witness overlap -- whether arising from graph neighborhoods, causal relations, temporal structure, topological features, symbolic patterns, or embedding-based neighborhoods -- it admits a reduction to compact encodings with provable ranking preservation guarantees. REWA systems consist of: (1) finite witness sets $W(v)$, (2) semi-random bit assignments generated from each witness, and (3) monotonicity of expected similarity in the overlap $ฮ(u, v) = |W(u) \cap W(v)|$. We prove that under an overlap-gap condition on the final witness sets -- independent of how they were constructed -- top-$k$ rankings are preserved using $m = O(\log(|V|/ฮด))$ bits. The witness-set formulation is compositional: any sequence of structural, temporal, causal, topological, information-theoretic, or learned transformations can be combined into pipelines that terminate in discrete witness sets. The theory applies to the final witness overlap, enabling modular construction of similarity systems from reusable primitives. This yields a vast design space: millions of composable similarity definitions inherit logarithmic encoding complexity. REWA subsumes and unifies Bloom filters, minhash, LSH bitmaps, random projections, sketches, and hierarchical filters as special cases. It provides a principled foundation for similarity systems whose behavior is governed by witness overlap rather than hash-function engineering. This manuscript presents the axioms, the main reducibility theorem, complete proofs with explicit constants, and a detailed discussion of compositional design, limitations, and future extensions including multi-bit encodings, weighted witnesses, and non-set representations.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dishanika Denipitiyage, Naveen Karunanayake, Suranga Seneviratne, Sanjay Chawla
Machine Learning
ๆฅ็ๆ่ฆ
We propose RankOOD, a rank-based Out-of-Distribution (OOD) detection approach based on training a model with the Placket-Luce loss, which is now extensively used for preference alignment tasks in foundational models. Our approach is based on the insight that with a deep learning model trained using the Cross Entropy Loss, in-distribution (ID) class prediction induces a ranking pattern for each ID class prediction. The RankOOD framework formalizes the insight by first extracting a rank list for each class using an initial classifier and then uses another round of training with the Plackett-Luce loss, where the class rank, a fixed permutation for each class, is the predicted variable. An OOD example may get assigned with high probability to an ID example, but the probability of it respecting the ranking classification is likely to be small. RankOOD, achieves SOTA performance on the near-ODD TinyImageNet evaluation benchmark, reducing FPR95 by 4.3%.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lianming Huang, Haibo Hu, Qiao Li, Nan Guan, Chun Jason Xue
Robotics
ๆฅ็ๆ่ฆ
Sparsity is essential for deploying large models on resource constrained edge platforms. However, optimizing sparsity patterns for individual tasks in isolation ignores the significant I/O overhead incurred during frequent task switching. We introduce an on-demand multi-task sparsity framework specifically designed to minimize switching costs by maximizing parameter reuse. Unlike monolithic approaches, we decompose weights into reusable block-granular units and align sparse structures across tasks to maximize overlap. By dynamically loading only the small differential set of blocks required for the next task, our method effectively mitigates the cold-start latency inherent in traditional monolithic approaches.Experiments on a real-world autonomous driving platform demonstrate that our framework achieves superior switching efficiency, accelerating task switching by over 6.6X on average compared to existing sparsity methods.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoran Zheng, Renchi Yang, Yubo Zhou, Jianliang Xu
Machine Learning
ๆฅ็ๆ่ฆ
Message passing neural networks (MPNNs) have emerged as go-to models for learning on graph-structured data in the past decade. Despite their effectiveness, most of such models still incur severe issues such as over-smoothing and -correlation, due to their underlying objective of minimizing the Dirichlet energy and the derived neighborhood aggregation operations. In this paper, we propose the DDSM, a new MPNN model built on an optimization framework that includes the stress majorization and orthogonal regularization for overcoming the above issues. Further, we introduce the diffusion distances for nodes into the framework to guide the new message passing operations and develop efficient algorithms for distance approximations, both backed by rigorous theoretical analyses. Our comprehensive experiments showcase that DDSM consistently and considerably outperforms 15 strong baselines on both homophilic and heterophilic graphs.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aras Bacho, Aleksei G. Sorokin, Xianjin Yang, Thรฉo Bourdais, Edoardo Calvello et al.
ๆฅ็ๆ่ฆ
Neural operator learning methods have garnered significant attention in scientific computing for their ability to approximate infinite-dimensional operators. However, increasing their complexity often fails to substantially improve their accuracy, leaving them on par with much simpler approaches such as kernel methods and more traditional reduced-order models. In this article, we set out to address this shortcoming and introduce CHONKNORIS (Cholesky Newton--Kantorovich Neural Operator Residual Iterative System), an operator learning paradigm that can achieve machine precision. CHONKNORIS draws on numerical analysis: many nonlinear forward and inverse PDE problems are solvable by Newton-type methods. Rather than regressing the solution operator itself, our method regresses the Cholesky factors of the elliptic operator associated with Tikhonov-regularized Newton--Kantorovich updates. The resulting unrolled iteration yields a neural architecture whose machine-precision behavior follows from achieving a contractive map, requiring far lower accuracy than end-to-end approximation of the solution operator. We benchmark CHONKNORIS on a range of nonlinear forward and inverse problems, including a nonlinear elliptic equation, Burgers' equation, a nonlinear Darcy flow problem, the Calderรณn problem, an inverse wave scattering problem, and a problem from seismic imaging. We also present theoretical guarantees for the convergence of CHONKNORIS in terms of the accuracy of the emulated Cholesky factors. Additionally, we introduce a foundation model variant, FONKNORIS (Foundation Newton--Kantorovich Neural Operator Residual Iterative System), which aggregates multiple pre-trained CHONKNORIS experts for diverse PDEs to emulate the solution map of a novel nonlinear PDE. Our FONKNORIS model is able to accurately solve unseen nonlinear PDEs such as the Klein--Gordon and Sine--Gordon equations.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Songbo Wang, Renchi Yang, Yurui Lai, Xiaoyang Lin, Tsz Nam Chan
Machine Learning
ๆฅ็ๆ่ฆ
The emergence of graph neural networks (GNNs) has offered a powerful tool for semi-supervised node classification tasks. Subsequent studies have achieved further improvements through refining the message passing schemes in GNN models or exploiting various data augmentation techniques to mitigate limited supervision. In real graphs, nodes often tend to form tightly-knit communities/clusters, which embody abundant signals for compensating label scarcity in semi-supervised node classification but are not explored in prior methods.
Inspired by this, this paper presents NCGC that integrates self-supervised graph clustering and semi-supervised classification into a unified framework. Firstly, we theoretically unify the optimization objectives of GNNs and spectral graph clustering, and based on that, develop soft orthogonal GNNs (SOGNs) that leverage a refined message passing paradigm to generate node representations for both classification and clustering. On top of that, NCGC includes a self-supervised graph clustering module that enables the training of SOGNs for learning representations of unlabeled nodes in a self-supervised manner. Particularly, this component comprises two non-trivial clustering objectives and a Sinkhorn-Knopp normalization that transforms predicted cluster assignments into balanced soft pseudo-labels. Through combining the foregoing clustering module with the classification model using a multi-task objective containing the supervised classification loss on labeled data and self-supervised clustering loss on unlabeled data, NCGC promotes synergy between them and achieves enhanced model capacity. Our extensive experiments showcase that the proposed NCGC framework consistently and considerably outperforms popular GNN models and recent baselines for semi-supervised node classification on seven real graphs, when working with various classic GNN backbones.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yujia Wang, Fenglong Ma, Jinghui Chen
ๆฅ็ๆ่ฆ
Asynchronous federated learning (FL) has recently gained attention for its enhanced efficiency and scalability, enabling local clients to send model updates to the server at their own pace without waiting for slower participants. However, such a design encounters significant challenges, such as the risk of outdated updates from straggler clients degrading the overall model performance and the potential bias introduced by faster clients dominating the learning process, especially under heterogeneous data distributions. Existing methods typically address only one of these issues, creating a conflict where mitigating the impact of outdated updates can exacerbate the bias created by faster clients, and vice versa. To address these challenges, we propose FedEcho, a novel framework that incorporates uncertainty-aware distillation to enhance the asynchronous FL performances under large asynchronous delays and data heterogeneity. Specifically, uncertainty-aware distillation enables the server to assess the reliability of predictions made by straggler clients, dynamically adjusting the influence of these predictions based on their estimated uncertainty. By prioritizing more certain predictions while still leveraging the diverse information from all clients, FedEcho effectively mitigates the negative impacts of outdated updates and data heterogeneity. Through extensive experiments, we demonstrate that FedEcho consistently outperforms existing asynchronous federated learning baselines, achieving robust performance without requiring access to private client data.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yujia Wang, Yuanpu Cao, Jinghui Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Federated learning (FL) has been extensively studied as a privacy-preserving training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training large-scale models, as it allows clients to train only a subset of the model locally instead of the entire model. However, in the era of large language models (LLMs), even a single block can contain a significant number of parameters, posing substantial communication latency, particularly for resource-constrained clients. To address this challenge in federated training/fine-tuning LLMs, we propose ParaBlock, a novel approach that establishes two parallel threads for communication and computation to enhance communication efficiency. We theoretically prove that the proposed ParaBlock achieves the same convergence rate as the standard federated block coordinate descent methods. Empirical evaluations on fine-tuning LLMs on general instruction following and mathematical reasoning confirm that ParaBlock not only maintains strong performance but also significantly improves communication efficiency.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Meiyu Zhong, Noel Teku, Ravi Tandon
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs), though shown to be effective in many applications, can vary significantly in their response quality. In this paper, we investigate this problem of prompt fairness: specifically, the phrasing of a prompt by different users/styles, despite the same question being asked in principle, may elicit different responses from an LLM. To quantify this disparity, we propose to use information-theoretic metrics that can capture two dimensions of bias: subgroup sensitivity, the variability of responses within a subgroup and cross group consistency, the variability of responses across subgroups. Our analysis reveals that certain subgroups exhibit both higher internal variability and greater divergence from others. Our empirical analysis reveals that certain demographic sub groups experience both higher internal variability and greater divergence from others, indicating structural inequities in model behavior. To mitigate these disparities, we propose practical interventions, including majority voting across multiple generations and prompt neutralization, which together improve response stability and enhance fairness across user populations. In the experiments, we observe clear prompt sensitivity disparities across demographic subgroups: before mitigation, cross-group divergence values reach 0.28 and typically fall in the from 0.14 to 0.22 range. After applying our neutralization and multi generation strategy, these divergences consistently decrease, with the largest gap reduced to 0.22 and many distances falling to 0.17 or below, indicating more stable and consistent outputs across subgroups.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haoran Hu, Junren Shi, Shuo Jiang, Kun Cheng, Xia Yang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Forward Collision Warning systems are crucial for vehicle safety and autonomous driving, yet current methods often fail to balance precise multi-agent interaction modeling with real-time decision adaptability, evidenced by the high computational cost for edge deployment and the unreliability stemming from simplified interaction models.To overcome these dual challenges-computational complexity and modeling insufficiency-along with the high false alarm rates of traditional static-threshold warnings, this paper introduces an integrated FCW framework that pairs a Hierarchical Spatio-Temporal Attention Network with a Dynamic Risk Threshold Adjustment algorithm. HSTAN employs a decoupled architecture (Graph Attention Network for spatial, cascaded GRU with self-attention for temporal) to achieve superior performance and efficiency, requiring only 12.3 ms inference time (73% faster than Transformer methods) and reducing the Average Displacement Error (ADE) to 0.73m (42.2% better than Social_LSTM) on the NGSIM dataset. Furthermore, Conformalized Quantile Regression enhances reliability by generating prediction intervals (91.3% coverage at 90% confidence), which the DTRA module then converts into timely warnings via a physics-informed risk potential function and an adaptive threshold mechanism inspired by statistical process control.Tested across multi-scenario datasets, the complete system demonstrates high efficacy, achieving an F1 score of 0.912, a low false alarm rate of 8.2%, and an ample warning lead time of 2.8 seconds, validating the framework's superior performance and practical deployment feasibility in complex environments.
๐
2025-11-25
๐ 39 pages, 15 figures. Under consideration for publ
๐ฅ Akash Doshi, Pinar Sen, Kirill Ivanov, Wei Yang, June Namgoong et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Channel coding from 2G to 5G has assumed the inputs bits at the physical layer to be uniformly distributed. However, hybrid automatic repeat request acknowledgement (HARQ-ACK) bits transmitted in the uplink are inherently non-uniformly distributed. For such sources, significant performance gains could be obtained by employing joint source channel coding, aided by deep learning-based techniques. In this paper, we learn a transformer-based encoder using a novel "free-lunch" training algorithm and propose per-codeword power shaping to exploit the source prior at the encoder whilst being robust to small changes in the HARQ-ACK distribution. Furthermore, any HARQ-ACK decoder has to achieve a low negative acknowledgement (NACK) error rate to avoid radio link failures resulting from multiple NACK errors. We develop an extension of the Neyman-Pearson test to a coded bit system with multiple information bits to achieve Unequal Error Protection of NACK over ACK bits at the decoder. Finally, we apply the proposed encoder and decoder designs to a 5G New Radio (NR) compliant uplink setup under a fading channel, describing the optimal receiver design and a low complexity coherent approximation to it. Our results demonstrate 3-6 dB reduction in the average transmit power required to achieve the target error rates compared to the NR baseline, while also achieving a 2-3 dB reduction in the maximum transmit power, thus providing for significant coverage gains and power savings.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Cyrill Pรผntener, Johann Schwabe, Dominique Garmier, Jonas Frey, Marco Hutter
ๆฅ็ๆ่ฆ
We introduce Kleinkram, a free and open-source system designed to solve the challenge of managing massive, unstructured robotic datasets. Designed as a modular, on-premises cloud solution, Kleinkram enables scalable storage, indexing, and sharing of datasets, ranging from individual experiments to large-scale research collections. Kleinkram natively integrates with standard formats such as ROS bags and MCAP and utilises S3-compatible storage for flexibility. Beyond storage, Kleinkram features an integrated "Action Runner" that executes customizable Docker-based workflows for data validation, curation, and benchmarking. Kleinkram has successfully managed over 30 TB of data from diverse robotic systems, streamlining the research lifecycle through a modern web interface and a robust Command Line Interface (CLI).
๐
2025-11-25
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Liren Yu, Wenming Zhang, Silu Zhou, Zhixuan Zhang, Dan Ou
Natural Language Processing
ๆฅ็ๆ่ฆ
We propose HHFT (Hierarchical Heterogeneous Feature Transformer), a Transformer-based architecture tailored for industrial CTR prediction. HHFT addresses the limitations of DNN through three key designs: (1) Semantic Feature Partitioning: Grouping heterogeneous features (e.g. user profile, item information, behaviour sequennce) into semantically coherent blocks to preserve domain-specific information; (2) Heterogeneous Transformer Encoder: Adopting block-specific QKV projections and FFNs to avoid semantic confusion between distinct feature types; (3) Hiformer Layer: Capturing high-order interactions across features. Our findings reveal that Transformers significantly outperform DNN baselines, achieving a +0.4% improvement in CTR AUC at scale. We have successfully deployed the model on Taobao's production platform, observing a significant uplift in key business metrics, including a +0.6% increase in Gross Merchandise Value (GMV).
๐
2025-11-25
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Anyang Tong, Xiang Niu, ZhiPing Liu, Chang Tian, Yanyan Wei et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Existing multimodal Retrieval-Augmented Generation (RAG) methods for visually rich documents (VRD) are often biased towards retrieving salient knowledge(e.g., prominent text and visual elements), while largely neglecting the critical fine-print knowledge(e.g., small text, contextual details). This limitation leads to incomplete retrieval and compromises the generator's ability to produce accurate and comprehensive answers. To bridge this gap, we propose HKRAG, a new holistic RAG framework designed to explicitly capture and integrate both knowledge types. Our framework features two key components: (1) a Hybrid Masking-based Holistic Retriever that employs explicit masking strategies to separately model salient and fine-print knowledge, ensuring a query-relevant holistic information retrieval; and (2) an Uncertainty-guided Agentic Generator that dynamically assesses the uncertainty of initial answers and actively decides how to integrate the two distinct knowledge streams for optimal response generation. Extensive experiments on open-domain visual question answering benchmarks show that HKRAG consistently outperforms existing methods in both zero-shot and supervised settings, demonstrating the critical importance of holistic knowledge retrieval for VRD understanding.
๐
2025-11-25
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Tianjie Dai, Xu Chen, Yunmeng Shu, Jinsong Lan, Xiaoyong Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Sequential Recommendation System~(SRS) has become pivotal in modern society, which predicts subsequent actions based on the user's historical behavior. However, traditional collaborative filtering-based sequential recommendation models often lead to suboptimal performance due to the limited information of their collaborative signals. With the rapid development of LLMs, an increasing number of works have incorporated LLMs' world knowledge into sequential recommendation. Although they achieve considerable gains, these approaches typically assume the correctness of LLM-generated results and remain susceptible to noise induced by LLM hallucinations. To overcome these limitations, we propose GRASP (Generation Augmented Retrieval with Holistic Attention for Sequential Prediction), a flexible framework that integrates generation augmented retrieval for descriptive synthesis and similarity retrieval, and holistic attention enhancement which employs multi-level attention to effectively employ LLM's world knowledge even with hallucinations and better capture users' dynamic interests. The retrieved similar users/items serve as auxiliary contextual information for the later holistic attention enhancement module, effectively mitigating the noisy guidance of supervision-based methods. Comprehensive evaluations on two public benchmarks and one industrial dataset reveal that GRASP consistently achieves state-of-the-art performance when integrated with diverse backbones. The code is available at: https://anonymous.4open.science/r/GRASP-SRS.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wen-Fang Su, Hsiao-Wei Chou, Wen-Yang Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.
๐
2025-11-25
๐ KDD 2026
๐ฅ Ximing Chen, Pui Ieng Lei, Yijun Sheng, Yanyan Liu, Zhiguo Gong
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion models (DMs) have recently gained significant interest for their exceptional potential in recommendation tasks. This stems primarily from their prominent capability in distilling, modeling, and generating comprehensive user preferences. However, previous work fails to examine DMs in recommendation tasks through a rigorous lens. In this paper, we first experimentally investigate the completeness of recommender models from a thermodynamic view. We reveal that existing DM-based recommender models operate by maximizing the energy, while classic recommender models operate by reducing the entropy. Based on this finding, we propose a minimalistic diffusion framework that incorporates both factors via the maximization of Helmholtz free energy. Meanwhile, to foster the optimization, our reverse process is armed with a well-designed denoiser to maintain the inherent anisotropy, which measures the user-item cross-correlation in the context of bipartite graphs. Finally, we adopt an Acceptance-Rejection Gumbel Sampling Process (AR-GSP) to prioritize the far-outnumbered unobserved interactions for model robustness. AR-GSP integrates an acceptance-rejection sampling to ensure high-quality hard negative samples for general recommendation tasks, and a timestep-dependent Gumbel Softmax to handle an adaptive sampling strategy for diffusion models. Theoretical analyses and extensive experiments demonstrate that our proposed framework has distinct superiority over baselines in terms of accuracy and efficiency.
๐
2025-11-25
๐ ArXiv Preprint (cs.CY)
๐ฅ Mykola Makhortykh, Tobias Rohrbach, Maryna Sydorova
ๆฅ็ๆ่ฆ
Information retrieval systems, such as search engines, increasingly shape the representation of the past and present states of social reality. Despite their importance, these systems face challenges in dealing with the ethical aspects of representation due to various forms of bias, including aesthetic bias that perpetuates hegemonic patterns of representation. While most research on aesthetic bias has examined it in the context of current societal issues, it is also crucial for historical representation, particularly of sensitive subjects such as historical atrocities. To address this gap, we conduct a comparative audit of the visual representation of Holocaust victims on Google. We find that Google tends to propagate a male-dominated representation of Holocaust victims with an emphasis on atrocity context, risking rendering invisible gender-specific suffering and decreasing potential for nurturing empathy. We also observe a variation in representation across geographic locations, suggesting that search algorithms may produce their own aesthetic of victimhood.
๐
2025-11-25
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yulong Deng, Zheng Guan, Min He, Xue Wang, Jie Liu et al.
ๆฅ็ๆ่ฆ
Cross-Disciplinary Cold-start Knowledge Tracing (CDCKT) faces a critical challenge: insufficient student interaction data in the target discipline prevents effective knowledge state modeling and performance prediction. Existing cross-disciplinary methods rely on overlapping entities between disciplines for knowledge transfer through simple mapping functions, but suffer from two key limitations: (1) overlapping entities are scarce in real-world scenarios, and (2) simple mappings inadequately capture cross-disciplinary knowledge complexity. To overcome these challenges, we propose Mixed of Experts and Adversarial Generative Network-based Cross-disciplinary Cold-start Knowledge Tracing Framework. Our approach consists of three key components: First, we pre-train a source discipline model and cluster student knowledge states into K categories. Second, these cluster attributes guide a mixture-of-experts network through a gating mechanism, serving as a cross-domain mapping bridge. Third, an adversarial discriminator enforces feature separation by pulling same-attribute student features closer while pushing different-attribute features apart, effectively mitigating small-sample limitations. We validate our method's effectiveness across 20 extreme cross-disciplinary cold-start scenarios.
๐
2025-11-25
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Anton Lyubinin
ๆฅ็ๆ่ฆ
We are extending Popularity Bias Memorization theorem from arXiv:archive/2404.12008 in several directions. We extend it to arbitrary degree distributions and also prove both upper and lower estimates for the alignment with top-k singular hyperspace.
๐
2025-11-25
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikit Phadke
ๆฅ็ๆ่ฆ
REWA introduces a general theory of similarity based on witness-overlap structures. We show that whenever similarity between concepts can be expressed as monotone witness overlap -- whether arising from graph neighborhoods, causal relations, temporal structure, topological features, symbolic patterns, or embedding-based neighborhoods -- it admits a reduction to compact encodings with provable ranking preservation guarantees. REWA systems consist of: (1) finite witness sets $W(v)$, (2) semi-random bit assignments generated from each witness, and (3) monotonicity of expected similarity in the overlap $ฮ(u, v) = |W(u) \cap W(v)|$. We prove that under an overlap-gap condition on the final witness sets -- independent of how they were constructed -- top-$k$ rankings are preserved using $m = O(\log(|V|/ฮด))$ bits. The witness-set formulation is compositional: any sequence of structural, temporal, causal, topological, information-theoretic, or learned transformations can be combined into pipelines that terminate in discrete witness sets. The theory applies to the final witness overlap, enabling modular construction of similarity systems from reusable primitives. This yields a vast design space: millions of composable similarity definitions inherit logarithmic encoding complexity. REWA subsumes and unifies Bloom filters, minhash, LSH bitmaps, random projections, sketches, and hierarchical filters as special cases. It provides a principled foundation for similarity systems whose behavior is governed by witness overlap rather than hash-function engineering. This manuscript presents the axioms, the main reducibility theorem, complete proofs with explicit constants, and a detailed discussion of compositional design, limitations, and future extensions including multi-bit encodings, weighted witnesses, and non-set representations.
๐
2025-11-25
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xinyu Wang, Hanwei Wu, Qingchen Hu, Zhenghan Tai, Jingrui Tian et al.
ๆฅ็ๆ่ฆ
Decoder-only rerankers are central to Retrieval-Augmented Generation (RAG). However, generalist models miss domain-specific nuances in high-stakes fields like finance and law, and naive fine-tuning causes surface-form overfitting and catastrophic forgetting. To address this challenge, we introduce R2R, a domain-aware framework that combines dynamic expert routing with a two-stage training strategy, Entity Abstraction for Generalization (EAG). EAG introduces a counter-shortcut mechanism by masking the most predictive surface cues, forcing the reranker to learn domain-invariant relevance patterns rather than memorizing dataset-specific entities. To efficiently activate domain experts, R2R employs a lightweight Latent Semantic Router that probes internal representations from the frozen backbone decoder to select the optimal LoRA expert per query. Extensive experiments across different reranker backbones and diverse domains (legal, medical, and financial) demonstrate that R2R consistently surpasses generalist and single-domain fine-tuned baselines. Our results confirm that R2R is a model-agnostic and modular approach to domain specialization with strong cross-domain robustness.
๐
2025-11-25
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Kaike Zhang, Jiakai Tang, Du Su, Shuchang Liu, Julian McAuley et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recommender systems shape how people discover information, form opinions, and connect with society. Yet, as their influence grows, traditional metrics, e.g., accuracy, clicks, and engagement, no longer capture what truly matters to humans. The workshop on Human-Centered Recommender Systems (HCRS) calls for a paradigm shift from optimizing engagement toward designing systems that truly understand, involve, and benefit people. It brings together researchers in recommender systems, human-computer interaction, AI safety, and social computing to explore how human values, e.g., trust, safety, fairness, transparency, and well-being, can be integrated into recommendation processes. Centered around three thematic axes-Human Understanding, Human Involvement, and Human Impact-HCRS features keynotes, panels, and papers covering topics from LLM-based interactive recommenders to societal welfare optimization. By fostering interdisciplinary collaboration, HCRS aims to shape the next decade of responsible and human-aligned recommendation research.
๐
2025-11-25
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Ziwei Liu, Qidong Liu, Wanyu Wang, Yejing Wang, Tong Xu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Cross-domain Sequential Recommendation (CDSR) has been proposed to enrich user-item interactions by incorporating information from various domains. Despite current progress, the imbalance issue and transition issue hinder further development of CDSR. The former one presents a phenomenon that the interactions in one domain dominate the entire behavior, leading to difficulty in capturing the domain-specific features in the other domain. The latter points to the difficulty in capturing users' cross-domain preferences within the mixed interaction sequence, resulting in poor next-item prediction performance for specific domains. With world knowledge and powerful reasoning ability, Large Language Models (LLMs) partially alleviate the above issues by performing as a generator and an encoder. However, current LLMs-enhanced CDSR methods are still under exploration, which fail to recognize the irrelevant noise and rough profiling problems. Thus, to make peace with the aforementioned challenges, we proposed an LLMs Enhanced Cross-domain Sequential Recommendation with Dual-phase Training ({LLM-EDT}). To address the imbalance issue while introducing less irrelevant noise, we first propose the transferable item augmenter to adaptively generate possible cross-domain behaviors for users. Then, to alleviate the transition issue, we introduce a dual-phase training strategy to empower the domain-specific thread with a domain-shared background. As for the rough profiling problem, we devise a domain-aware profiling module to summarize the user's preference in each domain and adaptively aggregate them to generate comprehensive user profiles. The experiments on three public datasets validate the effectiveness of our proposed LLM-EDT. To ease reproducibility, we have released the detailed code online at {https://anonymous.4open.science/r/LLM-EDT-583F}.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, Li Zhang
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations, and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real-robot interaction is expensive and conventional simulators are hard to engineer and transfer. We address both data efficiency and optimization stability in VLA post-training via a learned world model and an RL procedure tailored to flow-based action heads. Specifically, we introduce Prophet, a unified action-to-video robot actuation pretrained across large-scale, heterogeneous robot data to learn reusable action-outcome dynamics. It is able to few-shot adapt to new robots, objects, and environments, yielding a rollout-ready simulator. Upon Prophet, we reinforce action policies with Flow-action-GRPO (FA-GRPO), which adapts Flow-GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per-step gradients in the flow head. Together, Prophet, FA-GRPO, and FlowScale constitute ProphRL, a practical, data- and compute-efficient path to VLA post-training. Experiments show 5-17% success gains on public benchmarks and 24-30% gains on real robots across different VLA variants.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinhao Liu, Jiaqi Li, Youming Deng, Ruxin Chen, Yingjia Zhang et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at https://ai4ce.github.io/wanderland/.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Allen Emmanuel Binny, Mahathi Anand, Hugo T. M. Kussaba, Lingyun Chen, Shreenabh Agrawal et al.
Machine Learning
ๆฅ็ๆ่ฆ
Learning safe and stable robot motions from demonstrations remains a challenge, especially in complex, nonlinear tasks involving dynamic, obstacle-rich environments. In this paper, we propose Safe and Stable Neural Network Dynamical Systems S$^2$-NNDS, a learning-from-demonstration framework that simultaneously learns expressive neural dynamical systems alongside neural Lyapunov stability and barrier safety certificates. Unlike traditional approaches with restrictive polynomial parameterizations, S$^2$-NNDS leverages neural networks to capture complex robot motions providing probabilistic guarantees through split conformal prediction in learned certificates. Experimental results on various 2D and 3D datasets -- including LASA handwriting and demonstrations recorded kinesthetically from the Franka Emika Panda robot -- validate S$^2$-NNDS effectiveness in learning robust, safe, and stable motions from potentially unsafe demonstrations.
๐
2025-11-25
๐ NeurIPS 2025
๐ฅ Tasha Kim, Oiwi Parker Jones
Robotics
ๆฅ็ๆ่ฆ
Safety-critical assistive systems that directly decode user intent from neural signals require rigorous guarantees of reliability and trust. We present GUARDIAN (Gated Uncertainty-Aware Runtime Dual Invariants), a framework for real-time neuro-symbolic verification for neural signal-controlled robotics. GUARDIAN enforces both logical safety and physiological trust by coupling confidence-calibrated brain signal decoding with symbolic goal grounding and dual-layer runtime monitoring. On the BNCI2014 motor imagery electroencephalogram (EEG) dataset with 9 subjects and 5,184 trials, the system performs at a high safety rate of 94-97% even with lightweight decoder architectures with low test accuracies (27-46%) and high ECE confidence miscalibration (0.22-0.41). We demonstrate 1.7x correct interventions in simulated noise testing versus at baseline. The monitor operates at 100Hz and sub-millisecond decision latency, making it practically viable for closed-loop neural signal-based systems. Across 21 ablation results, GUARDIAN exhibits a graduated response to signal degradation, and produces auditable traces from intent, plan to action, helping to link neural evidence to verifiable robot action.
๐
2025-11-25
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Charles Y. Zhang, Yuanjia Yang, Aidan Sirbu, Elliott T. T. Abe, Emil Wรคrnberg et al.
Robotics
ๆฅ็ๆ่ฆ
The primary output of the nervous system is movement and behavior. While recent advances have democratized pose tracking during complex behavior, kinematic trajectories alone provide only indirect access to the underlying control processes. Here we present MIMIC-MJX, a framework for learning biologically-plausible neural control policies from kinematics. MIMIC-MJX models the generative process of motor control by training neural controllers that learn to actuate biomechanically-realistic body models in physics simulation to reproduce real kinematic trajectories. We demonstrate that our implementation is accurate, fast, data-efficient, and generalizable to diverse animal body models. Policies trained with MIMIC-MJX can be utilized to both analyze neural control strategies and simulate behavioral experiments, illustrating its potential as an integrative modeling framework for neuroscience.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Jiaxin Liu, Min Li, Wanting Xu, Liang Li, Jiaqi Yang et al.
Computer Vision
ๆฅ็ๆ่ฆ
Accurate state estimation for flexible robotic systems poses significant challenges, particular for platforms with dynamically deforming structures that invalidate rigid-body assumptions. This paper tackles this problem and allows to extend existing rigid-body pose estimation methods to non-rigid systems. Our approach hinges on two core assumptions: first, the elastic properties are captured by an injective deformation-force model, efficiently learned via a Multi-Layer Perceptron; second, we solve the platform's inherently smooth motion using continuous-time B-spline kinematic models. By continuously applying Newton's Second Law, our method establishes a physical link between visually-derived trajectory acceleration and predicted deformation-induced acceleration. We demonstrate that our approach not only enables robust and accurate pose estimation on non-rigid platforms, but that the properly modeled platform physics instigate inertial sensing properties. We demonstrate this feasibility on a simple spring-camera system, and show how it robustly resolves the typically ill-posed problem of metric scale and gravity recovery in monocular visual odometry.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Cyrill Pรผntener, Johann Schwabe, Dominique Garmier, Jonas Frey, Marco Hutter
ๆฅ็ๆ่ฆ
We introduce Kleinkram, a free and open-source system designed to solve the challenge of managing massive, unstructured robotic datasets. Designed as a modular, on-premises cloud solution, Kleinkram enables scalable storage, indexing, and sharing of datasets, ranging from individual experiments to large-scale research collections. Kleinkram natively integrates with standard formats such as ROS bags and MCAP and utilises S3-compatible storage for flexibility. Beyond storage, Kleinkram features an integrated "Action Runner" that executes customizable Docker-based workflows for data validation, curation, and benchmarking. Kleinkram has successfully managed over 30 TB of data from diverse robotic systems, streamlining the research lifecycle through a modern web interface and a robust Command Line Interface (CLI).
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Liangkai Liu, Weisong Shi, Kang G. Shin
Robotics
ๆฅ็ๆ่ฆ
This paper presents pNav, a novel power-management system that significantly enhances the power/energy-efficiency of Autonomous Mobile Robots (AMRs) by jointly optimizing their physical/mechanical and cyber subsystems. By profiling AMRs' power consumption, we identify three challenges in achieving CPS (cyber-physical system) power-efficiency that involve both cyber (C) and physical (P) subsystems: (1) variabilities of system power consumption breakdown, (2) environment-aware navigation locality, and (3) coordination of C and P subsystems. pNav takes a multi-faceted approach to achieve power-efficiency of AMRs. First, it integrates millisecond-level power consumption prediction for both C and P subsystems. Second, it includes novel real-time modeling and monitoring of spatial and temporal navigation localities for AMRs. Third, it supports dynamic coordination of AMR software (navigation, detection) and hardware (motors, DVFS driver) configurations. pNav is prototyped using the Robot Operating System (ROS) Navigation Stack, 2D LiDAR, and camera. Our in-depth evaluation with a real robot and Gazebo environments demonstrates a >96% accuracy in predicting power consumption and a 38.1% reduction in power consumption without compromising navigation accuracy and safety.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bo Pang, Chenxi Xu, Jierui Ren, Guoping Wang, Sheng Li
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Shiqian Liu, Azlan Mohd Zain, Le-le Mao
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recently, path planning has achieved remarkable progress in enhancing global search capability and convergence accuracy through heuristic and learning-inspired optimization frameworks. However, real-time adaptability in dynamic environments remains a critical challenge for autonomous navigation, particularly when robots must generate collision-free, smooth, and efficient trajectories under complex constraints. By analyzing the difficulties in dynamic path planning, the Wind Driven Optimization (WDO) algorithm emerges as a promising framework owing to its physically interpretable search dynamics. Motivated by these observations, this work revisits the WDO principle and proposes a variant formulation, Multi-hierarchical adaptive wind driven optimization(MAWDO), that improves adaptability and robustness in time-varying environments. To mitigate instability and premature convergence, a hierarchical-guidance mechanism divides the population into multiple groups guided by individual, regional, and global leaders to balance exploration and exploitation. Extensive evaluations on sixteen benchmark functions show that MAWDO achieves superior optimization accuracy, convergence stability, and adaptability over state-of-the art metaheuristics. In dynamic path planning, MAWDO shortens the path length to 469.28 pixels, improving over Multi-strategy ensemble wind driven optimization(MEWDO), Adaptive wind driven optimization(AWDO) and WDO by 3.51\%, 11.63\% and 14.93\%, and achieves the smallest optimality gap (1.01) with smoothness 0.71 versus 13.50 and 15.67 for AWDO and WDO, leading to smoother, shorter, and collision-free trajectories that confirm its effectiveness for real-time path planning in complex environments.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Benjamin Sportich, Kenza Boubakri, Olivier Simonin, Alessandro Renzaglia
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Reasons for mapping an unknown environment with autonomous robots are wide-ranging, but in practice, they are often overlooked when developing planning strategies. Rapid information gathering and comprehensive structural assessment of buildings have different requirements and therefore necessitate distinct methodologies. In this paper, we propose a novel modular Next-Best-View (NBV) planning framework for aerial robots that explicitly uses a reconstruction quality objective to guide the exploration planning. In particular, our approach introduces new and efficient methods for view generation and selection of viewpoint candidates that are adaptive to the user-defined quality requirements, fully exploiting the uncertainty encoded in a Truncated Signed Distance field (TSDF) representation of the environment. This results in informed and efficient exploration decisions tailored towards the predetermined objective. Finally, we validate our method via extensive simulations in realistic environments. We demonstrate that it successfully adjusts its behavior to the user goal while consistently outperforming conventional NBV strategies in terms of coverage, quality of the final 3D map and path efficiency.
๐
2025-11-25
๐ 8 pages, 5 figures. Submitted to IEEE Intelligent
๐ฅ Joรฃo Malheiro Silva, Andy Huynh, Tong Duy Son, Holger Caesar
Computer Vision
ๆฅ็ๆ่ฆ
3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Yuhan Wu, Tiantian Wei, Shuo Wang, ZhiChao Wang, Yanyong Zhang et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Interactive articulated manipulation requires long-horizon, multi-step interactions with appliances while maintaining physical consistency. Existing vision-language and diffusion-based policies struggle to generalize across parts, instances, and categories. We first introduce ArtiBench, a five-level benchmark covering kitchen, storage, office, and tool environments. ArtiBench enables structured evaluation from cross-part and cross-instance variation to long-horizon multi-object tasks, revealing the core generalization challenges of articulated object manipulation. Building on this benchmark, we propose ArtiBrain, a modular framework that unifies high-level reasoning with adaptive low-level control. ArtiBrain uses a VLM-based Task Reasoner (GPT-4.1) to decompose and validate subgoals, and employs a Hybrid Controller that combines geometry-aware keyframe execution with affordance-guided diffusion for precise and interpretable manipulation. An Affordance Memory Bank continually accumulates successful execution episodes and propagates part-level actionable affordances to unseen articulated parts and configurations. Extensive experiments on ArtiBench show that our ArtiBrain significantly outperforms state-of-the-art multimodal and diffusion-based methods in robustness and generalization. Code and dataset will be released upon acceptance.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Rรณisรญn Keenan, Joost C. Dessing
Robotics
ๆฅ็ๆ่ฆ
Recent advancements in robotics have increased the possibilities for integrating robotic systems into human-involved workplaces, highlighting the need to examine and optimize human-robot coordination in collaborative settings. This study explores human-robot interactions during handover tasks using Virtual Reality (VR) to investigate differences in human motor performance across various task dynamics and robot kinematics. A VR-based robot handover simulation afforded safe and controlled assessments of human-robot interactions. In separate experiments, four potential influences on human performance were examined (1) control over task initiation and robot movement synchrony (temporal and spatiotemporal); (2) partner appearance (human versus robotic); (3) robot velocity profiles (minimum jerk, constant velocity, constant acceleration, and biphasic); and (4) the timing of rotational object motion. Findings across experiments emphasize humans benefit from robots providing early and salient visual information about task-relevant object motion, and advantages of human-like smooth robot trajectories. To varying degrees, these manipulations improved predictive accuracy and synchronization during interaction. This suggests that human-robot interactions should be designed to allow humans to leverage their natural capabilities for detecting biological motion, which conversely may reduce the need for costly robotic computations or added cognitive adaptation on the human side.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Dong Wang, Daniel Casado Herraez, Stefan May, Andreas Nรผchter
Robotics
ๆฅ็ๆ่ฆ
Reliable odometry in highly dynamic environments remains challenging when it relies on ICP-based registration: ICP assumes near-static scenes and degrades in repetitive or low-texture geometry. We introduce Dynamic-ICP, a Doppler-aware registration framework. The method (i) estimates ego motion from per-point Doppler velocity via robust regression and builds a velocity filter, (ii) clusters dynamic objects and reconstructs object-wise translational velocities from ego-compensated radial measurements, (iii) predicts dynamic points with a constant-velocity model, and (iv) aligns scans using a compact objective that combines point-to-plane geometry residual with a translation-invariant, rotation-only Doppler residual. The approach requires no external sensors or sensor-vehicle calibration and operates directly on FMCW LiDAR range and Doppler velocities. We evaluate Dynamic-ICP on three datasets-HeRCULES, HeLiPR, AevaScenes-focusing on highly dynamic scenes. Dynamic-ICP consistently improves rotational stability and translation accuracy over the state-of-the-art methods. Our approach is also simple to integrate into existing pipelines, runs in real time, and provides a lightweight solution for robust registration in dynamic environments. To encourage further research, the code is available at: https://github.com/JMUWRobotics/Dynamic-ICP.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Chenhui Dong, Haozhe Xu, Wenhao Feng, Zhipeng Wang, Yanmin Zhou et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning controllers have made impressive progress in humanoid locomotion and light load manipulation. However, achieving robust and precise motion with strong force interaction remains a significant challenge. Based on the above limitations, this paper proposes HAFO, a dual-agent reinforcement learning control framework that simultaneously optimizes both a robust locomotion strategy and a precise upper-body manipulation strategy through coupled training under external force interaction environments. Simultaneously, we explicitly model the external pulling disturbances through a spring-damper system and achieve fine-grained force control by manipulating the virtual spring. During this process, the reinforcement-learning policy spontaneously generates disturbance-rejection response by exploiting environmental feedback. Moreover, HAFO employs an asymmetric Actor-Critic framework in which the Critic-network access to privileged spring-damping forces guides the actor-network to learn a generalizable, robust policy for resisting external disturbances. The experimental results demonstrate that HAFO achieves stable control of humanoid robot under various strong force interactions, showing remarkable performance in load tasks and ensuring stable robot operation under rope tension disturbances. Project website: hafo-robot.github.io.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Yu Sun, Yaosheng Deng, Wenjie Mei, Xiaogang Xiong, Yang Bai et al.
Robotics
ๆฅ็ๆ่ฆ
Soft robotics has advanced rapidly, yet its control methods remain fragmented: different morphologies and actuation schemes still require task-specific controllers, hindering theoretical integration and large-scale deployment. A generic control framework is therefore essential, and a key obstacle lies in the persistent use of rigid-body control logic, which relies on precise models and strict low-level execution. Such a paradigm is effective for rigid robots but fails for soft robots, where the ability to tolerate and exploit approximate action representations, i.e., control compliance, is the basis of robustness and adaptability rather than a disturbance to be eliminated. Control should thus shift from suppressing compliance to explicitly exploiting it. Human motor control exemplifies this principle: instead of computing exact dynamics or issuing detailed muscle-level commands, it expresses intention through high-level movement tendencies, while reflexes and biomechanical mechanisms autonomously resolve local details. This architecture enables robustness, flexibility, and cross-task generalization. Motivated by this insight, we propose a generic soft-robot control framework grounded in control compliance and validate it across robots with diverse morphologies and actuation mechanisms. The results demonstrate stable, safe, and cross-platform transferable behavior, indicating that embracing control compliance, rather than resisting it, may provide a widely applicable foundation for unified soft-robot control.
๐
2025-11-25
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haebin Seong, Sungmin Kim, Minchan Kim, Yongjun Cho, Myunchul Joe et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Existing navigation benchmarks focus on task success metrics while overlooking economic viability -- critical for commercial deployment of autonomous delivery robots. We introduce \emph{CostNav}, a \textbf{Micro-Navigation Economic Testbed} that evaluates embodied agents through comprehensive cost-revenue analysis aligned with real-world business operations. CostNav models the complete economic lifecycle including hardware, training, energy, maintenance costs, and delivery revenue with service-level agreements, using industry-derived parameters. \textbf{To our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability}, revealing that optimizing for task success fundamentally differs from optimizing for economic deployment. Our cost model uses parameters derived from industry data sources (energy rates, delivery service pricing), and we project from a reduced-scale simulation to realistic deliveries. Under this projection, the baseline achieves 43.0\% SLA compliance but is \emph{not} commercially viable: yielding a loss of \$30.009 per run with no finite break-even point, because operating costs are dominated by collision-induced maintenance, which accounts for 99.7\% of per-run costs and highlights collision avoidance as a key optimization target. We demonstrate a learning-based on-device navigation baseline and establish a foundation for evaluating rule-based navigation, imitation learning, and cost-aware RL training. CostNav bridges the gap between navigation research and commercial deployment, enabling data-driven decisions about economic trade-offs across navigation paradigms.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Ryohei Kobayashi, Kosei Isomoto, Kosei Yamao, Soma Fumoto, Koshun Arimura et al.
Robotics
ๆฅ็ๆ่ฆ
This paper provides an overview of the techniques employed by Hibikino-Musashi@Home, which intends to participate in the domestic standard platform league. The team developed a dataset generator for training a robot vision system and an open-source development environment running on a Human Support Robot simulator. The large-language-model-powered task planner selects appropriate primitive skills to perform the task requested by the user. Moreover, the team has focused on research involving brain-inspired memory models for adaptation to individual home environments. This approach aims to provide intuitive and personalized assistance. Additionally, the team contributed to the reusability of the navigation system developed by Pumas in RoboCup2024. The team aimed to design a home service robot to assist humans in their homes and continuously attend competitions to evaluate and improve the developed system.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bin Hu, Zijian Lu, Haicheng Liao, Chengran Yuan, Bin Rao et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Yan Li, Yingzhao Li, Gim Hee Lee
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
In this paper, we present an active exploration framework for high-fidelity 3D reconstruction that incrementally builds a multi-level uncertainty space and selects next-best-views through an uncertainty-driven motion planner. We introduce a hybrid implicit-explicit representation that fuses neural fields with Gaussian primitives to jointly capture global structural priors and locally observed details. Based on this hybrid state, we derive a hierarchical uncertainty volume that quantifies both implicit global structure quality and explicit local surface confidence. To focus optimization on the most informative regions, we propose an uncertainty-driven keyframe selection strategy that anchors high-entropy viewpoints as sparse attention nodes, coupled with a viewpoint-space sliding window for uncertainty-aware local refinement. The planning module formulates next-best-view selection as an Expected Hybrid Information Gain problem and incorporates a risk-sensitive path planner to ensure efficient and safe exploration. Extensive experiments on challenging benchmarks demonstrate that our approach consistently achieves state-of-the-art accuracy, completeness, and rendering quality, highlighting its effectiveness for real-world active reconstruction and robotic perception tasks.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Jinxuan Zhu, Zihao Yan, Yangyu Xiao, Jingxiang Guo, Chenrui Tie et al.
Robotics
ๆฅ็ๆ่ฆ
Contact feedback is essential for contact-rich robotic manipulation, as it allows the robot to detect subtle interaction changes and adjust its actions accordingly. Six-axis force-torque sensors are commonly used to obtain contact feedback, but their high cost and fragility have discouraged many researchers from adopting them in contact-rich tasks. To offer a more cost-efficient and easy-accessible source of contact feedback, we present ShapeForce, a low-cost, plug-and-play soft wrist that provides force-like signals for contact-rich robotic manipulation. Inspired by how humans rely on relative force changes in contact rather than precise force magnitudes, ShapeForce converts external force and torque into measurable deformations of its compliant core, which are then estimated via marker-based pose tracking and converted into force-like signals. Our design eliminates the need for calibration or specialized electronics to obtain exact values, and instead focuses on capturing force and torque changes sufficient for enabling contact-rich manipulation. Extensive experiments across diverse contact-rich tasks and manipulation policies demonstrate that ShapeForce delivers performance comparable to six-axis force-torque sensors at an extremely low cost.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Lidi Zhang, Han Wu, Liyu Zhang, Ruofeng Liu, Haotian Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
The 3D bin packing problem, with its diverse industrial applications, has garnered significant research attention in recent years. Existing approaches typically model it as a discrete and static process, while real-world applications involve continuous gravity-driven interactions. This idealized simplification leads to infeasible deployments (e.g., unstable packing) in practice. Simulations with physical engine offer an opportunity to emulate continuous gravity effects, enabling the training of reinforcement learning (RL) agents to address such limitations and improve packing stability. However, a simulation-to-reality gap persists due to dynamic variations in physical properties of real-world objects, such as various friction coefficients, elasticity, and non-uniform weight distributions. To bridge this gap, we propose a hybrid RL framework that collaborates with physical simulation with real-world data feedback. Firstly, domain randomization is applied during simulation to expose agents to a spectrum of physical parameters, enhancing their generalization capability. Secondly, the RL agent is fine-tuned with real-world deployment feedback, further reducing collapse rates. Extensive experiments demonstrate that our method achieves lower collapse rates in both simulated and real-world scenarios. Large-scale deployments in logistics systems validate the practical effectiveness, with a 35\% reduction in packing collapse compared to baseline methods.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Dapeng Zhang, Fei Shen, Rui Zhao, Yinda Chen, Peng Zhi et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Autonomous driving represents a prominent application of artificial intelligence. Recent approaches have shifted from focusing solely on common scenarios to addressing complex, long-tail situations such as subtle human behaviors, traffic accidents, and non-compliant driving patterns. Given the demonstrated capabilities of large language models (LLMs) in understanding visual and natural language inputs and following instructions, recent methods have integrated LLMs into autonomous driving systems to enhance reasoning, interpretability, and performance across diverse scenarios. However, existing methods typically rely either on real-world data, which is suitable for industrial deployment, or on simulation data tailored to rare or hard case scenarios. Few approaches effectively integrate the complementary advantages of both data sources. To address this limitation, we propose a novel VLM-guided, end-to-end adversarial transfer framework for autonomous driving that transfers long-tail handling capabilities from simulation to real-world deployment, named CoC-VLA. The framework comprises a teacher VLM model, a student VLM model, and a discriminator. Both the teacher and student VLM models utilize a shared base architecture, termed the Chain-of-Causality Visual-Language Model (CoC VLM), which integrates temporal information via an end-to-end text adapter. This architecture supports chain-of-thought reasoning to infer complex driving logic. The teacher and student VLM models are pre-trained separately on simulated and real-world datasets. The discriminator is trained adversarially to facilitate the transfer of long-tail handling capabilities from simulated to real-world environments by the student VLM model, using a novel backpropagation strategy.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen et al.
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.
๐
2025-11-25
๐ ArXiv Preprint (cs.DC)
๐ฅ Prabhat Kumar Chand, Anisur Rahaman Molla
Robotics
ๆฅ็ๆ่ฆ
Mobile agents have emerged as a powerful framework for solving fundamental graph problems in distributed settings in recent times. These agents, modelled as autonomous physical or software entities, possess local computation power, finite memory and have the ability to traverse a graph, offering efficient solutions to a range of classical problems. In this work, we focus on the problem of computing a \emph{minimal dominating set} (mDS) in anonymous graphs using mobile agents. Building on the recently proposed optimal dispersion algorithm on the synchronous mobile agent model, we design two new algorithms that achieve a \emph{linear-time} solution for this problem in the synchronous setting. Specifically, given a connected $n$-node graph with $n$ agents initially placed in either rooted or arbitrary configurations, we show that an mDS can be computed in $O(n)$ rounds using only $O(\log n)$ bits of memory per agent, without using any prior knowledge of any global parameters. This improves upon the best-known complexity results in the literature over the same model. In addition, as natural by-products of our methodology, our algorithms also construct a spanning tree and elect a unique leader in $O(n)$ rounds, which are also important results of independent interest in the mobile-agent framework.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chengyue Huang, Mellon M. Zhang, Robert Azarcon, Glen Chou, Zsolt Kira
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Eito Sato, Takahiro Wada
Robotics
ๆฅ็ๆ่ฆ
Haptic shared control (HSC) is effective in teleoperation when full autonomy is limited by uncertainty or sensing constraints. However, autonomous control performance achieved by maximizing HSC strength is limited because the dynamics of the joystick and human arm affect the robot's behavior. We propose a cooperative framework coupling a joystick-independent autonomous controller with HSC. A control barrier function ignores joystick inputs within a safe region determined by the human operator in real-time, while HSC is engaged otherwise. A pilot experiment on simulated tasks with tele-operated underwater robot in virtual environment demonstrated improved accuracy and reduced required time over conventional HSC.
๐
2025-11-25
๐ ArXiv Preprint (Computer Vision)
๐ฅ GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang et al.
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.
๐
2025-11-25
๐ ArXiv Preprint (Robotics)
๐ฅ Xiangkai Ma, Lekai Xing, Han Zhang, Wenzhong Li, Sanglu Lu
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT: autoregressively generated tokens is simultaneously decoded into future frames predictions and robot actions, thereby internalizing visual dynamics as an inductive bias for motion planning. Extensive experiments on simulated and real-world environments demonstrate state-of-the-art performance. VITA improves 14.5\%, 9.6\% and 12.1\% over existing baselines on CALVIN, LIBERO and SimplerEnv. Furthermore, VITA attains an average success rate of 80.5\% across six real-world tasks, demonstrating its potential as a generalist robotic manipulation model.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yang Wu, Qian Li, Yuling Xiong, Hongbo Tang, Xun Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Harnessing the reasoning power of Large Language Models (LLMs) for recommender systems is hindered by two fundamental challenges. First, current approaches lack a mechanism for automated, data-driven discovery of effective reasoning patterns, relying instead on brittle manual templates or unstable zero-shot prompting. Second, they employ structure-collapsing integration: direct prompting incurs prohibitive online inference costs, while feature extraction collapses reasoning chains into single vectors, discarding stepwise logic. To address these challenges, we propose SCoTER (Structured Chain-of-Thought Transfer for Enhanced Recommendation), a unified framework that treats pattern discovery and structure-aware transfer as a jointly optimized problem. Specifically, SCoTER operationalizes this through two synergistic components: a GVM pipeline for automated pattern discovery and a structure-preserving integration architecture that transfers stepwise logic to efficient models. Formally, we provide information-theoretic justification proving that structure-preserving transfer achieves tighter performance bounds than structure-agnostic alternatives. Empirically, experiments on four benchmarks demonstrate improvements of 3.75\%-11.59\% over a strong TIGER backbone. Moreover, in production deployment on the Tencent Advertising Platform, SCoTER achieved a 2.14\% lift in Gross Merchandise Value (GMV) while eliminating online LLM inference costs. Overall, SCoTER establishes a principled and production-validated blueprint for transferring structured LLM reasoning to large-scale recommender systems.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Shantanu Rahman, Nayeb Hasin, Mainul Islam, Golam Sarowar
Robotics
ๆฅ็ๆ่ฆ
Autonomous vehicles are becoming popular day by day not only for autonomous road traversal but also for industrial automation, farming and military. Most of the standard vehicles follow the Ackermann style steering mechanism. This has become to de facto standard for large and long faring vehicles. The local planner of an autonomous vehicle controls the low-level vehicle movement upon which the vehicle will perform its motor actuation. In our work, we focus on autonomous vehicles in road and perform experiments to analyze the effect of low-level controllers in the simulation and a real environment. To increase the precision and stability of trajectory tracking in autonomous cars, a novel method that combines lane identification with Model Predictive Control (MPC) is presented. The research focuses on camera-equipped autonomous vehicles and uses methods like edge recognition, sliding window-based straight-line identification for lane line extraction, and dynamic region of interest (ROI) extraction. Next, to follow the identified lane line, an MPC built on a bicycle vehicle dynamics model is created. A single-lane road simulation model is built using ROS Gazebo and tested in order to verify the controller's performance. The root mean square error between the optimal tracking trajectory and the target trajectory was reduced by 27.65% in the simulation results, demonstrating the high robustness and flexibility of the developed controller.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Yuchen Zhou, Haihang Wu
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Monocular Simultaneous Localization and Mapping (SLAM) aims to estimate a robot's pose while simultaneously reconstructing an unknown 3D scene using a single camera. While existing monocular SLAM systems generate detailed 3D geometry through dense scene representations, they are computationally expensive due to the need for iterative optimization. To address this challenge, MASt3R-SLAM utilizes learned 3D reconstruction priors, enabling more efficient and accurate estimation of both 3D structures and camera poses. However, MASt3R-SLAM is limited to single-agent operation. In this paper, we extend MASt3R-SLAM to introduce the first multi-agent monocular dense SLAM system. Each agent performs local SLAM using a 3D reconstruction prior, and their individual maps are fused into a globally consistent map through a loop-closure-based map fusion mechanism. Our approach improves computational efficiency compared to state-of-the-art methods, while maintaining similar mapping accuracy when evaluated on real-world datasets.
๐
2025-11-24
๐ AAAI 2026
๐ฅ Maureen Herbert, Katie Sun, Angelica Lim, Yasaman Etesam
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, "How does this person feel?". Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Colton Casto, Anna Ivanova, Evelina Fedorenko, Nancy Kanwisher
ๆฅ็ๆ่ฆ
Language understanding entails not just extracting the surface-level meaning of the linguistic input, but constructing rich mental models of the situation it describes. Here we propose that because processing within the brain's core language system is fundamentally limited, deeply understanding language requires exporting information from the language system to other brain regions that compute perceptual and motor representations, construct mental models, and store our world knowledge and autobiographical memories. We review the existing evidence for this hypothesis, and argue that recent progress in cognitive neuroscience provides both the conceptual foundation and the methods to directly test it, thus opening up a new strategy to reveal what it means, cognitively and neurally, to understand language.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Richard J. Young, Alice M. Matthews
Natural Language Processing
ๆฅ็ๆ่ฆ
Domain-specific text embeddings are critical for clinical natural language processing, yet systematic comparisons across model architectures remain limited. This study evaluates ten transformer-based embedding models adapted for cardiology through Low-Rank Adaptation (LoRA) fine-tuning on 106,535 cardiology text pairs derived from authoritative medical textbooks. Results demonstrate that encoder-only architectures, particularly BioLinkBERT, achieve superior domain-specific performance (separation score: 0.510) compared to larger decoder-based models, while requiring significantly fewer computational resources. The findings challenge the assumption that larger language models necessarily produce better domain-specific embeddings and provide practical guidance for clinical NLP system development. All models, training code, and evaluation datasets are publicly available to support reproducible research in medical informatics.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mobina Mehrazar, Mohammad Amin Yousefi, Parisa Abolfath Beygi, Behnam Bahrak
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used to generate self-explanations alongside their predictions, a practice that raises concerns about the faithfulness of these explanations, especially in low-resource languages. This study evaluates the faithfulness of LLM-generated explanations in the context of emotion classification in Persian, a low-resource language, by comparing the influential words identified by the model against those identified by human annotators. We assess faithfulness using confidence scores derived from token-level log-probabilities. Two prompting strategies, differing in the order of explanation and prediction (Predict-then-Explain and Explain-then-Predict), are tested for their impact on explanation faithfulness. Our results reveal that while LLMs achieve strong classification performance, their generated explanations often diverge from faithful reasoning, showing greater agreement with each other than with human judgments. These results highlight the limitations of current explanation methods and metrics, emphasizing the need for more robust approaches to ensure LLM reliability in multilingual and low-resource contexts.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi et al.
ๆฅ็ๆ่ฆ
Progress in computer use agents (CUAs) has been constrained by the absence of large and high-quality datasets that capture how humans interact with a computer. While LLMs have thrived on abundant textual data, no comparable corpus exists for CUA trajectories. To address these gaps, we introduce FaraGen, a novel synthetic data generation system for multi-step web tasks. FaraGen can propose diverse tasks from frequently used websites, generate multiple solution attempts, and filter successful trajectories using multiple verifiers. It achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately $1 each. We use this data to train Fara-7B, a native CUA model that perceives the computer using only screenshots, executes actions via predicted coordinates, and is small enough to run on-device. We find that Fara-7B outperforms other CUA models of comparable size on benchmarks like WebVoyager, Online-Mind2Web, and WebTailBench -- our novel benchmark that better captures under-represented web tasks in pre-existing benchmarks. Furthermore, Fara-7B is competitive with much larger frontier models, illustrating key benefits of scalable data generation systems in advancing small efficient agentic models. We are making Fara-7B open-weight on Microsoft Foundry and HuggingFace, and we are releasing WebTailBench.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Manil Shrestha, Edward Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-hop question answering over knowledge graphs remains computationally challenging due to the combinatorial explosion of possible reasoning paths. Recent approaches rely on expensive Large Language Model (LLM) inference for both entity linking and path ranking, limiting their practical deployment. Additionally, LLM-generated answers often lack verifiable grounding in structured knowledge. We present two complementary hybrid algorithms that address both efficiency and verifiability: (1) LLM-Guided Planning that uses a single LLM call to predict relation sequences executed via breadth-first search, achieving near-perfect accuracy (micro-F1 > 0.90) while ensuring all answers are grounded in the knowledge graph, and (2) Embedding-Guided Neural Search that eliminates LLM calls entirely by fusing text and graph embeddings through a lightweight 6.7M-parameter edge scorer, achieving over 100 times speedup with competitive accuracy. Through knowledge distillation, we compress planning capability into a 4B-parameter model that matches large-model performance at zero API cost. Evaluation on MetaQA demonstrates that grounded reasoning consistently outperforms ungrounded generation, with structured planning proving more transferable than direct answer generation. Our results show that verifiable multi-hop reasoning does not require massive models at inference time, but rather the right architectural inductive biases combining symbolic structure with learned representations.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikita Neveditsin, Pawan Lingras, Vijay Mago
Natural Language Processing
ๆฅ็ๆ่ฆ
Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Remi Petitpierre
Computer Vision
ๆฅ็ๆ่ฆ
This thesis presents methods and datasets to investigate cartographic heritage on a large scale and from a cultural perspective. Heritage institutions worldwide have digitized more than one million maps, and automated techniques now enable large-scale recognition and extraction of map content. Yet these methods have engaged little with the history of cartography, or the view that maps are semantic-symbolic systems, and cultural objects reflecting political and epistemic expectations. This work leverages a diverse corpus of 771,561 map records and 99,715 digitized images aggregated from 38 digital catalogs. After normalization, the dataset includes 236,925 contributors and spans six centuries, from 1492 to 1948. These data make it possible to chart geographic structures and the global chronology of map publication. The spatial focus of cartography is analyzed in relation to political dynamics, evidencing links between Atlantic maritime charting, the triangular trade, and colonial expansion. Further results document the progression of national, domestic focus and the impact of military conflicts on publication volumes. The research introduces semantic segmentation techniques and object detection models for the generic recognition of land classes and cartographic signs, trained on annotated data and synthetic images. The analysis of land classes shows that maps are designed images whose framing and composition emphasize features through centering and semantic symmetries. The study of cartographic figuration encodes 63 M signs and 25 M fragments into a latent visual space, revealing figurative shifts such as the replacement of relief hachures by terrain contours and showing that signs tend to form locally consistent systems. Analyses of collaboration and diffusion highlight the role of legitimacy, larger actors, and major cities in the spread of figurative norms and semiotic cultures.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Linxiao Cao, Ruitao Wang, Jindong Li, Zhipeng Zhou, Menglin Yang
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-augmented generation (RAG) enables large language models (LLMs) to access external knowledge, helping mitigate hallucinations and enhance domain-specific expertise. Graph-based RAG enhances structural reasoning by introducing explicit relational organization that enables information propagation across semantically connected text units. However, these methods typically rely on Euclidean embeddings that capture semantic similarity but lack a geometric notion of hierarchical depth, limiting their ability to represent abstraction relationships inherent in complex knowledge graphs. To capture both fine-grained semantics and global hierarchy, we propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG. HyperbolicRAG introduces three key designs: (1) a depth-aware representation learner that embeds nodes within a shared Poincare manifold to align semantic similarity with hierarchical containment, (2) an unsupervised contrastive regularization that enforces geometric consistency across abstraction levels, and (3) a mutual-ranking fusion mechanism that jointly exploits retrieval signals from Euclidean and hyperbolic spaces, emphasizing cross-space agreement during inference. Extensive experiments across multiple QA benchmarks demonstrate that HyperbolicRAG outperforms competitive baselines, including both standard RAG and graph-augmented baselines.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yang Wu, Qian Li, Yuling Xiong, Hongbo Tang, Xun Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Harnessing the reasoning power of Large Language Models (LLMs) for recommender systems is hindered by two fundamental challenges. First, current approaches lack a mechanism for automated, data-driven discovery of effective reasoning patterns, relying instead on brittle manual templates or unstable zero-shot prompting. Second, they employ structure-collapsing integration: direct prompting incurs prohibitive online inference costs, while feature extraction collapses reasoning chains into single vectors, discarding stepwise logic. To address these challenges, we propose SCoTER (Structured Chain-of-Thought Transfer for Enhanced Recommendation), a unified framework that treats pattern discovery and structure-aware transfer as a jointly optimized problem. Specifically, SCoTER operationalizes this through two synergistic components: a GVM pipeline for automated pattern discovery and a structure-preserving integration architecture that transfers stepwise logic to efficient models. Formally, we provide information-theoretic justification proving that structure-preserving transfer achieves tighter performance bounds than structure-agnostic alternatives. Empirically, experiments on four benchmarks demonstrate improvements of 3.75\%-11.59\% over a strong TIGER backbone. Moreover, in production deployment on the Tencent Advertising Platform, SCoTER achieved a 2.14\% lift in Gross Merchandise Value (GMV) while eliminating online LLM inference costs. Overall, SCoTER establishes a principled and production-validated blueprint for transferring structured LLM reasoning to large-scale recommender systems.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Noah Frahm, Prakrut Patel, Yue Zhang, Shoubin Yu, Mohit Bansal et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.
๐
2025-11-24
๐ This is the author's accepted version of an articl
๐ฅ Sakib Ahmed, Oscar Pizarro
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Unmanned Aerial Vehicles (UAVs) are crucial in Search and Rescue (SAR) missions due to their ability to monitor vast maritime areas. However, small objects often remain difficult to detect from high altitudes due to low object-to-background pixel ratios. We propose an altitude-aware dynamic tiling method that scales and adaptively subdivides the image into tiles for enhanced small object detection. By integrating altitude-dependent scaling with an adaptive tiling factor, we reduce unnecessary computation while maintaining detection performance. Tested on the SeaDronesSee dataset [1] with YOLOv5 [2] and Slicing Aided Hyper Inference (SAHI) framework [3], our approach improves Mean Average Precision (mAP) for small objects by 38% compared to a baseline and achieves more than double the inference speed compared to static tiling. This approach enables more efficient and accurate UAV-based SAR operations under diverse conditions.
๐
2025-11-24
๐ 9 pages, 6 figures, to be published in IEEE Roboti
๐ฅ Lukas Molnar, Jin Cheng, Gabriele Fadini, Dongho Kang, Fatemeh Zargarbashi et al.
Robotics
ๆฅ็ๆ่ฆ
Loco-manipulation demands coordinated whole-body motion to manipulate objects effectively while maintaining locomotion stability, presenting significant challenges for both planning and control. In this work, we propose a whole-body model predictive control (MPC) framework that directly optimizes joint torques through full-order inverse dynamics, enabling unified motion and force planning and execution within a single predictive layer. This approach allows emergent, physically consistent whole-body behaviors that account for the system's dynamics and physical constraints. We implement our MPC formulation using open software frameworks (Pinocchio and CasADi), along with the state-of-the-art interior-point solver Fatrop. In real-world experiments on a Unitree B2 quadruped equipped with a Unitree Z1 manipulator arm, our MPC formulation achieves real-time performance at 80 Hz. We demonstrate loco-manipulation tasks that demand fine control over the end-effector's position and force to perform real-world interactions like pulling heavy loads, pushing boxes, and wiping whiteboards.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Thomas Marshall Vielmetti, Devansh R Agrawal, Dimitra Panagou
Robotics
ๆฅ็ๆ่ฆ
We present Multi-Agent gatekeeper, a framework that provides provable safety guarantees for leader-follower formation control in cluttered 3D environments. Existing methods face a trad-off: online planners and controllers lack formal safety guarantees, while offline planners lack adaptability to changes in the number of agents or desired formation. To address this gap, we propose a hybrid architecture where a single leader tracks a pre-computed, safe trajectory, which serves as a shared trajectory backup set for all follower agents. Followers execute a nominal formation-keeping tracking controller, and are guaranteed to remain safe by always possessing a known-safe backup maneuver along the leader's path. We formally prove this method ensures collision avoidance with both static obstacles and other agents. The primary contributions are: (1) the multi-agent gatekeeper algorithm, which extends our single-agent gatekeeper framework to multi-agent systems; (2) the trajectory backup set for provably safe inter-agent coordination for leader-follower formation control; and (3) the first application of the gatekeeper framework in a 3D environment. We demonstrate our approach in a simulated 3D urban environment, where it achieved a 100% collision-avoidance success rate across 100 randomized trials, significantly outperforming baseline CBF and NMPC methods. Finally, we demonstrate the physical feasibility of the resulting trajectories on a team of quadcopters.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Vivek Chavan, Yasmina Imgrund, Tung Dao, Sanwantri Bai, Bosong Wang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: https://huggingface.co/datasets/FraunhoferIPK/IndEgo
๐
2025-11-24
๐ ArXiv Preprint (math.OC)
๐ฅ Jiarui Wang, Mahyar Fazlyab
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper presents the Safe Sequential Quadratically Constrained Quadratic Programming (SS-QCQP) algorithm, a first-order method for smooth inequality-constrained nonconvex optimization that guarantees feasibility at every iteration. The method is derived from a continuous-time dynamical system whose vector field is obtained by solving a convex QCQP that enforces monotonic descent of the objective and forward invariance of the feasible set. The resulting continuous-time dynamics achieve an $O(1/t)$ convergence rate to first-order stationary points under standard constraint qualification conditions. We then propose a safeguarded Euler discretization with adaptive step-size selection that preserves this convergence rate while maintaining both descent and feasibility in discrete time. To enhance scalability, we develop an active-set variant (SS-QCQP-AS) that selectively enforces constraints near the boundary, substantially reducing computational cost without compromising theoretical guarantees. Numerical experiments on a multi-agent nonlinear optimal control problem demonstrate that SS-QCQP and SS-QCQP-AS maintain feasibility, exhibit the predicted convergence behavior, and deliver solution quality comparable to second-order solvers such as SQP and IPOPT.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Shantanu Rahman, Nayeb Hasin, Mainul Islam
Robotics
ๆฅ็ๆ่ฆ
Autonomous vehicles are becoming popular day by day not only for autonomous road traversal but also for industrial automation, farming and military. Most of the standard vehicles follow the Ackermann style steering mechanism. This has become to de facto standard for large and long faring vehicles. The local planner of an autonomous vehicle controls the low-level vehicle movement upon which the vehicle will perform its motor actuation. In our work, we focus on autonomous vehicles in road and perform experiments to analyze the effect of low-level controllers in the simulation and a real environment. To increase the precision and stability of trajectory tracking in autonomous cars, a novel method that combines lane identification with Model Predictive Control (MPC) is presented. The research focuses on camera-equipped autonomous vehicles and uses methods like edge recognition, sliding window-based straight-line identification for lane line extraction, and dynamic region of interest (ROI) extraction. Next, to follow the identified lane line, an MPC built on a bicycle vehicle dynamics model is created. A single-lane road simulation model is built using ROS Gazebo and tested in order to verify the controller's performance. The root mean square error between the optimal tracking trajectory and the target trajectory was reduced by 27.65% in the simulation results, demonstrating the high robustness and flexibility of the developed controller.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Mahmud Suhaimi Ibrahim, Shantanu Rahman, Muhammad Samin Hasan, Minhaj Uddin Ahmad, Abdullah Abrar
ๆฅ็ๆ่ฆ
Collision-free path planning is the most crucial component in multi-UAV formation-flying (MFF). We use unlabeled homogenous quadcopters (UAVs) to demonstrate the use of a flow network to create complete (inter-UAV) collision-free paths. This procedure has three main parts: 1) Creating a flow network graph from physical GPS coordinates, 2) Finding a path of minimum cost (least distance) using any graph-based path-finding algorithm, and 3) Implementing the Ford-Fulkerson Method to find the paths with the maximum flow (no collision). Simulations of up to 64 UAVs were conducted for various formations, followed by a practical experiment with 3 quadcopters for testing physical plausibility and feasibility. The results of these tests show the efficacy of this method's ability to produce safe, collision-free paths.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Lishuo Pan, Mattia Catellani, Thales C. Silva, Lorenzo Sabattini, Nora Ayanian
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Control barrier functions (CBFs) are an effective model-based tool to formally certify the safety of a system. With the growing complexity of modern control problems, CBFs have received increasing attention in both optimization-based and learning-based control communities as a safety filter, owing to their provable guarantees. However, success in transferring these guarantees to real-world systems is critically tied to model accuracy. For example, payloads or wind disturbances can significantly influence the dynamics of an aerial vehicle and invalidate the safety guarantee. In this work, we propose an efficient yet flexible online learning-enhanced high-order adaptive control barrier function using Neural ODEs. Our approach improves the safety of a CBF-certified system on the fly, even under complex time-varying model perturbations. In particular, we deploy our hybrid adaptive CBF controller on a 38g nano quadrotor, keeping a safe distance from the obstacle, against 18km/h wind.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Jennifer Grannen, Michelle Pan, Kenneth Llontop, Cherie Ho, Mark Zolotas et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Foundation models (FM) have unlocked powerful zero-shot capabilities in vision and language, yet their reliance on internet pretraining data leaves them brittle in unstructured, real-world settings. The messy, real-world data encountered during deployment (e.g. occluded or multilingual text) remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches FM training with precisely the examples current models lack. We introduce the Robot-Powered Data Flywheel, a framework that transforms robots from FM consumers into data generators. By deploying robots equipped with FMs in the wild, we enable a virtuous cycle: robots perform useful tasks while collecting real-world data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator deployed in the East Asia Library for 2 weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to label images without human annotation. This deployment both aids librarians and produces a dataset to finetune the underlying VLM, improving performance on the domain-specific in-the-wild library setting and on domain-adjacent multilingual OCR benchmarks. Using data collected from 2103 shelves, Scanford improves VLM performance on book identification from 32.0% to 71.8% and boosts domain-adjacent multilingual OCR from 24.8% to 46.6% (English) and 30.8% to 38.0% (Chinese), while saving an ~18.7 hrs of human time. These results highlight how robot-powered data flywheels can both reduce human effort in real deployments and unlock new pathways for continually adapting FMs to the messiness of reality. More details are at: https://scanford-robot.github.io
๐
2025-11-24
๐ WWW
๐ฅ Nicklas Hansen, Hao Su, Xiaolong Wang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emph{Newt}, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Omar Faris, Sลawomir Tadeja, Fulvio Forni
Robotics
ๆฅ็ๆ่ฆ
Object handover is a common form of interaction that is widely present in collaborative tasks. However, achieving it efficiently remains a challenge. We address the problem of ensuring resilient robotic actions that can adapt to complex changes in object pose during human-to-robot object handovers. We propose the use of Virtual Model Control to create an interaction layer that controls the robot and adapts to the dynamic changes in the handover process. Additionally, we propose the use of augmented reality to facilitate bidirectional communication between humans and robots during handovers. We assess the performance of our controller in a set of experiments that demonstrate its resilience to various sources of uncertainties, including complex changes to the object's pose during the handover. Finally, we performed a user study with 16 participants to understand human preferences for different robot control profiles and augmented reality visuals in object handovers. Our results showed a general preference for the proposed approach and revealed insights that can guide further development in adapting the interaction with the user.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Samuel Cerezo, Seong Hun Lee, Javier Civera
Machine Learning
ๆฅ็ๆ่ฆ
In this letter, we present a closed-form initialization method that recovers the full visual-inertial state without nonlinear optimization. Unlike previous approaches that rely on iterative solvers, our formulation yields analytical, easy-to-implement, and numerically stable solutions for reliable start-up. Our method builds on small-rotation and constant-velocity approximations, which keep the formulation compact while preserving the essential coupling between motion and inertial measurements. We further propose an observability-driven, two-stage initialization scheme that balances accuracy with initialization latency. Extensive experiments on the EuRoC dataset validate our assumptions: our method achieves 10-20% lower initialization error than optimization-based approaches, while using 4x shorter initialization windows and reducing computational cost by 5x.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Rushuai Yang, Zhiyuan Feng, Tianxiang Zhang, Kaixin Wang, Chuheng Zhang et al.
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $ฯ_0$, $ฯ_{0.5}$, and one-step regression policy $ฯ_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $ฯ_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons
๐
2025-11-24
๐ ArXiv Preprint (cs.SE)
๐ฅ Jayanaka L. Dantanarayana, Savini Kashmira, Thakee Nathees, Zichen Zhang, Krisztian Flautner et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
AI-Integrated programming is emerging as a foundational paradigm for building intelligent systems with large language models (LLMs). Recent approaches such as Meaning Typed Programming (MTP) automate prompt generation by leveraging the semantics already present in code. However, many real-world applications depend on contextual cues, developer intent, and domain-specific reasoning that extend beyond what static code semantics alone can express. To address this limitation, we introduce Semantic Engineering, a lightweight method for enriching program semantics so that LLM-based systems can more accurately reflect developer intent without requiring full manual prompt design. We present Semantic Context Annotations (SemTexts), a language-level mechanism that allows developers to embed natural-language context directly into program constructs. Integrated into the Jac programming language, Semantic Engineering extends MTP to incorporate these enriched semantics during prompt generation. We further introduce a benchmark suite designed to reflect realistic AI-Integrated application scenarios. Our evaluation shows that Semantic Engineering substantially improves prompt fidelity, achieving performance comparable to Prompt Engineering while requiring significantly less developer effort.
๐
2025-11-24
๐ ArXiv Preprint (q-bio.QM)
๐ฅ Bruno Jacob, Khushbu Agarwal, Marcel Baer, Peter Rice, Simone Raugei
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We present Genie-CAT, a tool-augmented large-language-model (LLM) system designed to accelerate scientific hypothesis generation in protein design. Using metalloproteins (e.g., ferredoxins) as a case study, Genie-CAT integrates four capabilities -- literature-grounded reasoning through retrieval-augmented generation (RAG), structural parsing of Protein Data Bank files, electrostatic potential calculations, and machine-learning prediction of redox properties -- into a unified agentic workflow. By coupling natural-language reasoning with data-driven and physics-based computation, the system generates mechanistically interpretable, testable hypotheses linking sequence, structure, and function. In proof-of-concept demonstrations, Genie-CAT autonomously identifies residue-level modifications near [Fe--S] clusters that affect redox tuning, reproducing expert-derived hypotheses in a fraction of the time. The framework highlights how AI agents combining language models with domain-specific tools can bridge symbolic reasoning and numerical simulation, transforming LLMs from conversational assistants into partners for computational discovery.
๐
2025-11-24
๐ ArXiv Preprint (cs.SE)
๐ฅ David Jiahao Fu, Aryan Gupta, Aaron Councilman, David Grove, Yu-Xiong Wang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advancements in large language models (LLMs) have shown very impressive capabilities in code generation across many programming languages. However, even state-of-the-art LLMs generate programs that contains syntactic errors and fail to complete the given tasks, especially for low-resource programming languages (LRPLs). In addition, high training cost makes finetuning LLMs unaffordable with constrained computational resources, further undermining the effectiveness of LLMs for code generation. In this work, we propose SLMFix, a novel code generation pipeline that leverages a small language model (SLM) finetuned using reinforcement learning (RL) techniques to fix syntactic errors in LLM-generated programs to improve the quality of LLM-generated programs for domain-specific languages (DSLs). In specific, we applied RL on the SLM for the program repair task using a reward calculated using both a static validator and a static semantic similarity metric. Our experimental results demonstrate the effectiveness and generalizability of our approach across multiple DSLs, achieving more than 95% pass rate on the static validator. Notably, SLMFix brings substantial improvement to the base model and outperforms supervised finetuning approach even for 7B models on a LRPL, showing the potential of our approach as an alternative to traditional finetuning approaches.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ James Y. Huang, Sheng Zhang, Qianchu Liu, Guanghui Qin, Tinghui Zhu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang
Multimodal
ๆฅ็ๆ่ฆ
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gongfan Fang, Xinyin Ma, Xinchao Wang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
๐
2025-11-24
๐ ArXiv Preprint (cs.SD)
๐ฅ Jorge Ortigoso-Narro, Jose A. Belloch, Adrian Amor-Martin, Sandra Roger, Maximo Cobos
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array's focus, synchronizing the acoustic response with the target's position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rudy Morel, Francesco Pio Ramunno, Jeff Shen, Alberto Bietti, Kyunghyun Cho et al.
ๆฅ็ๆ่ฆ
Conditional diffusion models provide a natural framework for probabilistic prediction of dynamical systems and have been successfully applied to fluid dynamics and weather prediction. However, in many settings, the available information at a given time represents only a small fraction of what is needed to predict future states, either due to measurement uncertainty or because only a small fraction of the state can be observed. This is true for example in solar physics, where we can observe the Sun's surface and atmosphere, but its evolution is driven by internal processes for which we lack direct measurements. In this paper, we tackle the probabilistic prediction of partially observable, long-memory dynamical systems, with applications to solar dynamics and the evolution of active regions. We show that standard inference schemes, such as autoregressive rollouts, fail to capture long-range dependencies in the data, largely because they do not integrate past information effectively. To overcome this, we propose a multiscale inference scheme for diffusion models, tailored to physical processes. Our method generates trajectories that are temporally fine-grained near the present and coarser as we move farther away, which enables capturing long-range temporal dependencies without increasing computational cost. When integrated into a diffusion model, we show that our inference scheme significantly reduces the bias of the predicted distributions and improves rollout stability.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saniah Kayenat Chowdhury, Rusab Sarmun, Muhammad E. H. Chowdhury, Sohaib Bassam Zoghoul, Israa Al-Hashimi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor's size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable "black box" manner, our method offers both state-of-the-art performance and transparent decision support.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Franklin Cardenoso, Wouter Caarls
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
The challenge of designing effective reward functions in reinforcement learning (RL) represents a significant bottleneck, often requiring extensive human expertise and being time-consuming. Previous work and recent advancements in large language models (LLMs) have demonstrated their potential for automating the generation of reward functions. However, existing methodologies often require preliminary evaluation metrics, human-engineered feedback for the refinement process, or the use of environmental source code as context. To address these limitations, this paper introduces LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization). This LLM-based, fully autonomous, and model-agnostic framework eliminates the need for preliminary metrics and environmental source code as context to generate, execute, and evaluate reward function candidates from textual descriptions of systems and task objectives. LEARN-Opt's main contribution lies in its ability to autonomously derive performance metrics directly from the system description and the task objective, enabling unsupervised evaluation and selection of reward functions. Our experiments indicate that LEARN-Opt achieves performance comparable to or better to that of state-of-the-art methods, such as EUREKA, while requiring less prior knowledge. We find that automated reward design is a high-variance problem, where the average-case candidate fails, requiring a multi-run approach to find the best candidates. Finally, we show that LEARN-Opt can unlock the potential of low-cost LLMs to find high-performing candidates that are comparable to, or even better than, those of larger models. This demonstrated performance affirms its potential to generate high-quality reward functions without requiring any preliminary human-defined metrics, thereby reducing engineering overhead and enhancing generalizability.
๐
2025-11-24
๐ ArXiv Preprint (cs.SD)
๐ฅ Maral Ebrahimzadeh, Gilberto Bernardes, Sebastian Stober
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
State-of-the-art symbolic music generation models have recently achieved remarkable output quality, yet explicit control over compositional features, such as tonal tension, remains challenging. We propose a novel approach that integrates a computational tonal tension model, based on tonal interval vector analysis, into a Transformer framework. Our method employs a two-level beam search strategy during inference. At the token level, generated candidates are re-ranked using model probability and diversity metrics to maintain overall quality. At the bar level, a tension-based re-ranking is applied to ensure that the generated music aligns with a desired tension curve. Objective evaluations indicate that our approach effectively modulates tonal tension, and subjective listening tests confirm that the system produces outputs that align with the target tension. These results demonstrate that explicit tension conditioning through a dual-level beam search provides a powerful and intuitive tool to guide AI-generated music. Furthermore, our experiments demonstrate that our method can generate multiple distinct musical interpretations under the same tension condition.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Olivia Macmillan-Scott, Roksana Goworek, Eda B. รzyiฤit
Natural Language Processing
ๆฅ็ๆ่ฆ
Query expansion is the reformulation of a user query by adding semantically related information, and is an essential component of monolingual and cross-lingual information retrieval used to ensure that relevant documents are not missed. Recently, multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation. Pseudo-documents both introduce additional relevant terms and bridge the gap between short queries and long documents, which is particularly beneficial in dense retrieval. This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance. Results show that query length largely determines which prompting technique is effective, and that more elaborate prompts often do not yield further gains. Substantial linguistic disparities persist: cross-lingual query expansion can produce the largest improvements for languages with the weakest baselines, yet retrieval is especially poor between languages written in different scripts. Fine-tuning is found to lead to performance gains only when the training and test data are of similar format. These outcomes underline the need for more balanced multilingual and cross-lingual training and evaluation resources.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Roksana Goworek, Olivia Macmillan-Scott, Eda B. รzyiฤit
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Cross-lingual information retrieval (CLIR) enables access to multilingual knowledge but remains challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models. Existing pipelines often rely on translation and monolingual retrieval heuristics, which add computational overhead and noise, degrading performance. This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets. We find that dense retrieval models trained specifically for CLIR consistently outperform lexical matching methods and derive little benefit from document translation. Contrastive learning mitigates language biases and yields substantial improvements for encoders with weak initial alignment, and re-ranking can be effective, but depends on the quality of the cross-encoder training data. Although high-resource languages still dominate overall performance, gains over lexical and document-translated baselines are most pronounced for low-resource and cross-script pairs. These findings indicate that cross-lingual search systems should prioritise semantic multilingual embeddings and targeted learning-based alignment over translation-based pipelines, particularly for cross-script and under-resourced languages.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xincheng Wang, Hanchi Sun, Wenjun Sun, Kejun Xue, Wangqiu Zhou et al.
Computer Vision
ๆฅ็ๆ่ฆ
Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ James R. M. Black, Moritz S. Hanke, Aaron Maiwald, Tina Hernandez-Boussard, Oliver M. Crook et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Novel deep learning architectures are increasingly being applied to biological data, including genetic sequences. These models, referred to as genomic language mod- els (gLMs), have demonstrated impressive predictive and generative capabilities, raising concerns that such models may also enable misuse, for instance via the generation of genomes for human-infecting viruses. These concerns have catalyzed calls for risk mitigation measures. The de facto mitigation of choice is filtering of pretraining data (i.e., removing viral genomic sequences from training datasets) in order to limit gLM performance on virus-related tasks. However, it is not currently known how robust this approach is for securing open-source models that can be fine-tuned using sensitive pathogen data. Here, we evaluate a state-of-the-art gLM, Evo 2, and perform fine-tuning using sequences from 110 harmful human-infecting viruses to assess the rescue of misuse-relevant predictive capabilities. The fine- tuned model exhibited reduced perplexity on unseen viral sequences relative to 1) the pretrained model and 2) a version fine-tuned on bacteriophage sequences. The model fine-tuned on human-infecting viruses also identified immune escape variants from SARS-CoV-2 (achieving an AUROC of 0.6), despite having no expo- sure to SARS-CoV-2 sequences during fine-tuning. This work demonstrates that data exclusion might be circumvented by fine-tuning approaches that can, to some degree, rescue misuse-relevant capabilities of gLMs. We highlight the need for safety frameworks for gLMs and outline further work needed on evaluations and mitigation measures to enable the safe deployment of gLMs.
๐
2025-11-24
๐ ArXiv Preprint (cs.CY)
๐ฅ Ndaka. A, Avila-Acosta. F, Mbula-Ndaka. H, Amera. C, Chauke. S et al.
ๆฅ็ๆ่ฆ
This chapter seeks to frame the elemental and invisible problems of AI and big data in the African context by examining digital sites and infrastructure through the lens of power and interests. It will present reflections on how these sites are using AI recommendation algorithms to recreate new digital societies in the region, how they have the potential to propagate algorithmic colonialism and negative gender norms, and what this means for the regional sustainable development agenda. The chapter proposes adopting business models that embrace response-ability and consider the existence of alternative socio-material worlds of AI. These reflections will mainly come from ongoing discussions with Kenyan social media users in this authors' user space talks, personal experiences and six months of active participant observations done by the authors.
๐
2025-11-24
๐ Accepted by IEEE Big Data 2025
๐ฅ Ellie L. Zhang, Duoduo Liao, Callie C. Liao
Robotics
ๆฅ็ๆ่ฆ
Generation of dynamic, scalable multi-species bird soundscapes remains a significant challenge in computer music and algorithmic sound design. Birdsongs involve rapid frequency-modulated chirps, complex amplitude envelopes, distinctive acoustic patterns, overlapping calls, and dynamic inter-bird interactions, all of which require precise temporal and spatial control in 3D environments. Existing approaches, whether Digital Signal Processing (DSP)-based or data-driven, typically focus only on single species modeling, static call structures, or synthesis directly from recordings, and often suffer from noise, limited flexibility, or large data needs. To address these challenges, we present a novel, fully algorithm-driven framework that generates dynamic multi-species bird soundscapes using DSP-based chirp generation and 3D spatialization, without relying on recordings or training data. Our approach simulates multiple independently-moving birds per species along different moving 3D trajectories, supporting controllable chirp sequences, overlapping choruses, and realistic 3D motion in scalable soundscapes while preserving species-specific acoustic patterns. A visualization interface provides bird trajectories, spectrograms, activity timelines, and sound waves for analytical and creative purposes. Both visual and audio evaluations demonstrate the ability of the system to generate dense, immersive, and ecologically inspired soundscapes, highlighting its potential for computer music, interactive virtual environments, and computational bioacoustics research.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Amirtha Varshini A S, Duminda S. Ranasinghe, Hok Hei Tam
Robotics
ๆฅ็ๆ่ฆ
Generative Flow Networks, or GFlowNets, offer a promising framework for molecular design, but their internal decision policies remain opaque. This limits adoption in drug discovery, where chemists require clear and interpretable rationales for proposed structures. We present an interpretability framework for SynFlowNet, a GFlowNet trained on documented chemical reactions and purchasable starting materials that generates both molecules and the synthetic routes that produce them. Our approach integrates three complementary components. Gradient based saliency combined with counterfactual perturbations identifies which atomic environments influence reward and how structural edits change molecular outcomes. Sparse autoencoders reveal axis aligned latent factors that correspond to physicochemical properties such as polarity, lipophilicity, and molecular size. Motif probes show that functional groups including aromatic rings and halogens are explicitly encoded and linearly decodable from the internal embeddings. Together, these results expose the chemical logic inside SynFlowNet and provide actionable and mechanistic insight that supports transparent and controllable molecular design.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Lucas Li, Jean-Baptiste Puel, Florence Carton, Dounya Barrit, Jhony H. Giraldo
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Perovskite solar cells are promising candidates for next-generation photovoltaics. However, their performance as multi-scale devices is determined by complex interactions between their constituent layers. This creates a vast combinatorial space of possible materials and device architectures, making the conventional experimental-based screening process slow and expensive. Machine learning models try to address this problem, but they only focus on individual material properties or neglect the important geometric information of the perovskite crystal. To address this problem, we propose to predict perovskite solar cell power conversion efficiency with a geometric-aware co-attention (Solar-GECO) model. Solar-GECO combines a geometric graph neural network (GNN) - that directly encodes the atomic structure of the perovskite absorber - with language model embeddings that process the textual strings representing the chemical compounds of the transport layers and other device components. Solar-GECO also integrates a co-attention module to capture intra-layer dependencies and inter-layer interactions, while a probabilistic regression head predicts both power conversion efficiency (PCE) and its associated uncertainty. Solar-GECO achieves state-of-the-art performance, significantly outperforming several baselines, reducing the mean absolute error (MAE) for PCE prediction from 3.066 to 2.936 compared to semantic GNN (the previous state-of-the-art model). Solar-GECO demonstrates that integrating geometric and textual information provides a more powerful and accurate framework for PCE prediction.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Przemyslaw Chojecki
ๆฅ็ๆ่ฆ
We develop a moduli-theoretic view of psychometric test batteries for AI agents and connect it explicitly to the AAI score developed previously. First, we make precise the notion of an AAI functional on a battery and set out axioms that any reasonable autonomy/general intelligence score should satisfy. Second, we show that the composite index ('AAI-Index') defined previously is a special case of our AAI functional. Third, we introduce the notion of a cognitive core of an agent relative to a battery and define the associated AAI$_{\textrm{core}}$ score as the restriction of an AAI functional to that core. Finally, we use these notions to describe invariants of batteries under evaluation-preserving symmetries and outline how moduli of equivalent batteries are organized.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kyle Verrier, Achille Nazaret, Joseph Futoma, Andrew C. Miller, Guillermo Sapiro
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Hunger and satiety dynamics shape dietary behaviors and metabolic health, yet remain difficult to capture in everyday settings. We present a Nutrition Photoplethysmography Language Model (NPLM), integrating continuous photoplethysmography (PPG) from wearables with meal descriptions. NPLM projects PPG into embeddings interpretable by language models, enabling joint reasoning over physiology and meal context. Trained on 19,340 participants and 1.1 million meal-PPG pairs, the model improved daily caloric intake prediction by 11% over text-only baselines, with accuracy maintained when 80% of meal text was removed. In an independent validation study (n=140) with controlled dining and detailed meal information, the model replicated these findings. These results demonstrate the value of integrating physiological measurements from consumer wearables with meal information for noninvasive dietary monitoring at scale.
๐
2025-11-24
๐ KDD 2026
๐ฅ Yingjia Shang, Yi Liu, Huimin Wang, Furong Li, Wenfang Sun et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
With the rapid advancement of retrieval-augmented vision-language models, multimodal medical retrieval-augmented generation (MMed-RAG) systems are increasingly adopted in clinical decision support. These systems enhance medical applications by performing cross-modal retrieval to integrate relevant visual and textual evidence for tasks, e.g., report generation and disease diagnosis. However, their complex architecture also introduces underexplored adversarial vulnerabilities, particularly via visual input perturbations. In this paper, we propose Medusa, a novel framework for crafting cross-modal transferable adversarial attacks on MMed-RAG systems under a black-box setting. Specifically, Medusa formulates the attack as a perturbation optimization problem, leveraging a multi-positive InfoNCE loss (MPIL) to align adversarial visual embeddings with medically plausible but malicious textual targets, thereby hijacking the retrieval process. To enhance transferability, we adopt a surrogate model ensemble and design a dual-loop optimization strategy augmented with invariant risk minimization (IRM). Extensive experiments on two real-world medical tasks, including medical report generation and disease diagnosis, demonstrate that Medusa achieves over 90% average attack success rate across various generation models and retrievers under appropriate parameter configuration, while remaining robust against four mainstream defenses, outperforming state-of-the-art baselines. Our results reveal critical vulnerabilities in the MMed-RAG systems and highlight the necessity of robustness benchmarking in safety-critical medical applications. The code and data are available at https://anonymous.4open.science/r/MMed-RAG-Attack-F05A.
๐
2025-11-24
๐ AAAI 2026
๐ฅ Hang Ding, Xue Wang, Tian Zhou, Tao Yao
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion models have recently shown promise in time series forecasting, particularly for probabilistic predictions. However, they often fail to achieve state-of-the-art point estimation performance compared to regression-based methods. This limitation stems from difficulties in providing sufficient contextual bias to track distribution shifts and in balancing output diversity with the stability and precision required for point forecasts. Existing diffusion-based approaches mainly focus on full-distribution modeling under probabilistic frameworks, often with likelihood maximization objectives, while paying little attention to dedicated strategies for high-accuracy point estimation. Moreover, other existing point prediction diffusion methods frequently rely on pre-trained or jointly trained mature models for contextual bias, sacrificing the generative flexibility of diffusion models.
To address these challenges, we propose SimDiff, a single-stage, end-to-end framework. SimDiff employs a single unified Transformer network carefully tailored to serve as both denoiser and predictor, eliminating the need for external pre-trained or jointly trained regressors. It achieves state-of-the-art point estimation performance by leveraging intrinsic output diversity and improving mean squared error accuracy through multiple inference ensembling. Key innovations, including normalization independence and the median-of-means estimator, further enhance adaptability and stability. Extensive experiments demonstrate that SimDiff significantly outperforms existing methods in time series point forecasting.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Rissal Hedna, Sesugh Samuel Nder
Computer Vision
ๆฅ็ๆ่ฆ
Computer vision systems are increasingly adopted in modern logistics operations, including the estimation of trailer occupancy for planning, routing, and billing. Although effective, such systems may be vulnerable to physical adversarial attacks, particularly adversarial patches that can be printed and placed on interior surfaces. In this work, we study the feasibility of such attacks on a convolutional cargo-occupancy classifier using fully simulated 3D environments. Using Mitsuba 3 for differentiable rendering, we optimize patch textures across variations in geometry, lighting, and viewpoint, and compare their effectiveness to a 2D compositing baseline. Our experiments demonstrate that 3D-optimized patches achieve high attack success rates, especially in a denial-of-service scenario (empty to full), where success reaches 84.94 percent. Concealment attacks (full to empty) prove more challenging but still reach 30.32 percent. We analyze the factors influencing attack success, discuss implications for the security of automated logistics pipelines, and highlight directions for strengthening physical robustness. To our knowledge, this is the first study to investigate adversarial patch attacks for cargo-occupancy estimation in physically realistic, fully simulated 3D scenes.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Boyuan Wu
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.
๐
2025-11-24
๐ ArXiv Preprint (quant-ph)
๐ฅ Hibah Agha, Samuel Yen-Chi Chen, Huan-Hsin Tseng, Shinjae Yoo
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, machine learning and deep learning have driven advances in domains such as image classification, speech recognition, and anomaly detection by leveraging multi-layer neural networks to model complex data. Simultaneously, quantum computing (QC) promises to address classically intractable problems via quantum parallelism, motivating research in quantum machine learning (QML). Among QML techniques, quantum autoencoders show promise for compressing high-dimensional quantum and classical data. However, designing effective quantum circuit architectures for quantum autoencoders remains challenging due to the complexity of selecting gates, arranging circuit layers, and tuning parameters.
This paper proposes a neural architecture search (NAS) framework that automates the design of quantum autoencoders using a genetic algorithm (GA). By systematically evolving variational quantum circuit (VQC) configurations, our method seeks to identify high-performing hybrid quantum-classical autoencoders for data reconstruction without becoming trapped in local minima. We demonstrate effectiveness on image datasets, highlighting the potential of quantum autoencoders for efficient feature extraction within a noise-prone, near-term quantum era. Our approach lays a foundation for broader application of genetic algorithms to quantum architecture search, aiming for a robust, automated method that can adapt to varied data and hardware constraints.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ David Stenger, Armin Lindicke, Alexander von Rohr, Sebastian Trimpe
Machine Learning
ๆฅ็ๆ่ฆ
Searching large and complex design spaces for a global optimum can be infeasible and unnecessary. A practical alternative is to iteratively refine the neighborhood of an initial design using local optimization methods such as gradient descent. We propose local entropy search (LES), a Bayesian optimization paradigm that explicitly targets the solutions reachable by the descent sequences of iterative optimizers. The algorithm propagates the posterior belief over the objective through the optimizer, resulting in a probability distribution over descent sequences. It then selects the next evaluation by maximizing mutual information with that distribution, using a combination of analytic entropy calculations and Monte-Carlo sampling of descent sequences. Empirical results on high-complexity synthetic objectives and benchmark problems show that LES achieves strong sample efficiency compared to existing local and global Bayesian optimization methods.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Yuxuan Wang, Haobin Jiang, Shiqing Yao, Ziluo Ding, Zongqing Lu
Robotics
ๆฅ็ๆ่ฆ
Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Christos-Nikolaos Zacharopoulos, Revekka Kyriakoglou
Natural Language Processing
ๆฅ็ๆ่ฆ
How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Federico Felizzi, Olivia Riccomi, Michele Ferramola, Francesco Andrea Causio, Manuel Del Medico et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.
๐
2025-11-24
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Xurui Li, Kaisong Song, Rui Zhu, Pin-Yu Chen, Haixu Tang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts. To mitigate these challenges, we propose ACE-Safety (Adversarial Co-Evolution for LLM Safety), a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures: (1) Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS), which efficiently explores jailbreak strategies to uncover vulnerabilities and generate diverse adversarial samples; (2) Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO), which jointly trains attack and defense LLMs with challenging samples via curriculum reinforcement learning, enabling robust mutual improvement. Evaluations across multiple benchmarks demonstrate that our method outperforms existing attack and defense approaches, and provides a feasible pathway for developing LLMs that can sustainably support responsible AI ecosystems.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Teodora Popordanoska, Jiameng Li, Matthew B. Blaschko
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.
๐
2025-11-24
๐ ArXiv Preprint (q-bio.BM)
๐ฅ Lakshaditya Singh, Adwait Shelke, Divyansh Agrawal
Machine Learning
ๆฅ็ๆ่ฆ
Designing new protein structures is fundamental to computational biology, enabling advances in therapeutic molecule discovery and enzyme engineering. Existing diffusion-based generative models typically operate in Cartesian coordinate space, where adding noise disrupts strict geometric constraints such as fixed bond lengths and angles, often producing physically invalid structures. To address this limitation, we propose a Torsion-Space Diffusion Model that generates protein backbones by denoising torsion angles, ensuring perfect local geometry by construction. A differentiable forward-kinematics module reconstructs 3D coordinates with fixed 3.8 Angstrom backbone bond lengths while a constrained post-processing refinement optimizes global compactness via Radius of Gyration (Rg) correction, without violating bond constraints. Experiments on standard PDB proteins demonstrate 100% bond-length accuracy and significantly improved structural compactness, reducing Rg error from 70% to 18.6% compared to Cartesian diffusion baselines. Overall, this hybrid torsion-diffusion plus geometric-refinement framework generates physically valid and compact protein backbones, providing a promising path toward full-atom protein generation.
๐
2025-11-24
๐ ArXiv Preprint (cs.NI)
๐ฅ Hatim Chergui, Farhad Rezazadeh, Mehdi Bennis, Merouane Debbah
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
A critical barrier to the trustworthiness of sixth-generation (6G) agentic autonomous networks is the uncertainty neglect bias; a cognitive tendency for large language model (LLM)-powered agents to make high-stakes decisions based on simple averages while ignoring the tail risk of extreme events. This paper proposes an unbiased, risk-aware framework for agentic negotiation, designed to ensure robust resource allocation in 6G network slicing. Specifically, agents leverage Digital Twins (DTs) to predict full latency distributions, which are then evaluated using a formal framework from extreme value theory, namely, Conditional Value-at-Risk (CVaR). This approach fundamentally shifts the agent's objective from reasoning over the mean to reasoning over the tail, thereby building a statistically-grounded buffer against worst-case outcomes. Furthermore, our framework ensures full uncertainty awareness by requiring agents to quantify epistemic uncertainty -- confidence in their own DTs predictions -- and propagate this meta-verification to make robust decisions, preventing them from acting on unreliable data. We validate this framework in a 6G inter-slice negotiation use-case between an eMBB and a URLLC agent. The results demonstrate the profound failure of the biased, mean-based baseline, which consistently fails its SLAs with a 25\% rate. Our unbiased, CVaR-aware agent successfully mitigates this bias, eliminating SLA violations and reducing the URLLC and eMBB p99.999 latencies by around 11\%. We show this reliability comes at the rational and quantifiable cost of slightly reduced energy savings to 17\%, exposing the false economy of the biased approach. This work provides a concrete methodology for building the trustworthy autonomous systems required for 6G.
๐
2025-11-24
๐ ArXiv Preprint (cs.IT)
๐ฅ Jianfeng Xu, Zeyan Li
Machine Learning
ๆฅ็ๆ่ฆ
The rapid scaling of artificial intelligence models has revealed a fundamental tension between model capacity (storage) and inference efficiency (computation). While classical information theory focuses on transmission and storage limits, it lacks a unified physical framework to quantify the thermodynamic costs of generating information from compressed laws versus retrieving it from memory. In this paper, we propose a theoretical framework that treats information processing as an enabling mapping from ontological states to carrier states. We introduce a novel metric, Derivation Entropy, which quantifies the effective work required to compute a target state from a given logical depth. By analyzing the interplay between Shannon entropy (storage) and computational complexity (time/energy), we demonstrate the existence of a critical phase transition point. Below this threshold, memory retrieval is thermodynamically favorable; above it, generative computation becomes the optimal strategy. This "Energy-Time-Space" conservation law provides a physical explanation for the efficiency of generative models and offers a rigorous mathematical bound for designing next-generation, energy-efficient AI architectures. Our findings suggest that the minimization of Derivation Entropy is a governing principle for the evolution of both biological and artificial intelligence.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xihe Qiu, Gengchen Ma, Haoyu Wang, Chen Zhan, Xiaoyu Tan et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Sleep stage classification based on electroencephalography (EEG) is fundamental for assessing sleep quality and diagnosing sleep-related disorders. However, most traditional machine learning methods rely heavily on prior knowledge and handcrafted features, while existing deep learning models still struggle to jointly capture fine-grained time-frequency patterns and achieve clinical interpretability. Recently, vision-language models (VLMs) have made significant progress in the medical domain, yet their performance remains constrained when applied to physiological waveform data, especially EEG signals, due to their limited visual understanding and insufficient reasoning capability. To address these challenges, we propose EEG-VLM, a hierarchical vision-language framework that integrates multi-level feature alignment with visually enhanced language-guided reasoning for interpretable EEG-based sleep stage classification. Specifically, a specialized visual enhancement module constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images. These tokens are further aligned with low-level CLIP features through a multi-level alignment mechanism, enhancing the VLM's image-processing capability. In addition, a Chain-of-Thought (CoT) reasoning strategy decomposes complex medical inference into interpretable logical steps, effectively simulating expert-like decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in EEG-based sleep stage classification, showing promising potential for automated and explainable EEG analysis in clinical settings.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Moazzam Umer Gondal, Hamad Ul Qudous, Daniya Siddiqui, Asma Ahmad Farhan
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Krishang Sharma
Machine Learning
ๆฅ็ๆ่ฆ
Accurate Remaining Useful Life (RUL) prediction coupled with uncertainty quantification remains a critical challenge in aerospace prognostics. This research introduces a novel uncertainty-aware deep learning framework that learns aleatoric uncertainty directly through probabilistic modeling, an approach unexplored in existing CMAPSS-based literature. Our hierarchical architecture integrates multi-scale Inception blocks for temporal pattern extraction, bidirectional Long Short-Term Memory networks for sequential modeling, and a dual-level attention mechanism operating simultaneously on sensor and temporal dimensions. The innovation lies in the Bayesian output layer that predicts both mean RUL and variance, enabling the model to learn data-inherent uncertainty. Comprehensive preprocessing employs condition-aware clustering, wavelet denoising, and intelligent feature selection. Experimental validation on NASA CMAPSS benchmarks (FD001-FD004) demonstrates competitive overall performance with RMSE values of 16.22, 19.29, 16.84, and 19.98 respectively. Remarkably, our framework achieves breakthrough critical zone performance (RUL <= 30 cycles) with RMSE of 5.14, 6.89, 5.27, and 7.16, representing 25-40 percent improvements over conventional approaches and establishing new benchmarks for safety-critical predictions. The learned uncertainty provides well-calibrated 95 percent confidence intervals with coverage ranging from 93.5 percent to 95.2 percent, enabling risk-aware maintenance scheduling previously unattainable in CMAPSS literature.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Phong Le, Mees Lindeman, Raquel G. Alhama
ๆฅ็ๆ่ฆ
The structure of naming systems in natural languages hinges on a trade-off between high informativeness and low complexity. Prior work capitalizes on information theory to formalize these notions; however, these studies generally rely on two simplifications: (i) optimal listeners, and (ii) universal communicative need across languages. Here, we address these limitations by introducing an information-theoretic framework for discrete object naming systems, and we use it to prove that an optimal trade-off is achievable if and only if the listener's decoder is equivalent to the Bayesian decoder of the speaker. Adopting a referential game setup from emergent communication, and focusing on the semantic domain of kinship, we show that our notion of optimality is not only theoretically achievable but also emerges empirically in learned communication systems.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Rufin VanRullen
ๆฅ็ๆ่ฆ
In AI, the existential risk denotes the hypothetical threat posed by an artificial system that would possess both the capability and the objective, either directly or indirectly, to eradicate humanity. This issue is gaining prominence in scientific debate due to recent technical advancements and increased media coverage. In parallel, AI progress has sparked speculation and studies about the potential emergence of artificial consciousness. The two questions, AI consciousness and existential risk, are sometimes conflated, as if the former entailed the latter. Here, I explain that this view stems from a common confusion between consciousness and intelligence. Yet these two properties are empirically and theoretically distinct. Arguably, while intelligence is a direct predictor of an AI system's existential threat, consciousness is not. There are, however, certain incidental scenarios in which consciousness could influence existential risk, in either direction. Consciousness could be viewed as a means towards AI alignment, thereby lowering existential risk; or, it could be a precondition for reaching certain capabilities or levels of intelligence, and thus positively related to existential risk. Recognizing these distinctions can help AI safety researchers and public policymakers focus on the most pressing issues.
๐
2025-11-24
๐ ArXiv Preprint (physics.plasm-ph)
๐ฅ Siqi Ding, Zitong Zhang, Guoyang Shi, Xingyu Li, Xiang Gu et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
As artificial intelligence emerges as a transformative enabler for fusion energy commercialization, fast and accurate solvers become increasingly critical. In magnetic confinement nuclear fusion, rapid and accurate solution of the Grad-Shafranov equation (GSE) is essential for real-time plasma control and analysis. Traditional numerical solvers achieve high precision but are computationally prohibitive, while data-driven surrogates infer quickly but fail to enforce physical laws and generalize poorly beyond training distributions. To address this challenge, we present a Physics-Informed Neural Operator (PINO) that directly learns the GSE solution operator, mapping shape parameters of last closed flux surface to equilibrium solutions for realistic nonlinear current profiles. Comprehensive benchmarking of five neural architectures identifies the novel Transformer-KAN (Kolmogorov-Arnold Network) Neural Operator (TKNO) as achieving highest accuracy (0.25% mean L2 relative error) under supervised training (only data-driven). However, all data-driven models exhibit large physics residuals, indicating poor physical consistency. Our unsupervised training can reduce the residuals by nearly four orders of magnitude through embedding physics-based loss terms without labeled data. Critically, semi-supervised learning--integrating sparse labeled data (100 interior points) with physics constraints--achieves optimal balance: 0.48% interpolation error and the most robust extrapolation performance (4.76% error, 8.9x degradation factor vs 39.8x for supervised models). Accelerated by TensorRT optimization, our models enable millisecond-level inference, establishing PINO as a promising pathway for next-generation fusion control systems.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Robert Bredereck, Eva Deltl, Leon Kellerhals, Jannik Peters
ๆฅ็ๆ่ฆ
We study core stability in non-centroid clustering under the max-loss objective, where each agent's loss is the maximum distance to other members of their cluster. We prove that for all $k\geq 3$ there exist metric instances with $n\ge 9$ agents, with $n$ divisible by $k$, for which no clustering lies in the $ฮฑ$-core for any $ฮฑ<2^{\frac{1}{5}}\sim 1.148$. The bound is tight for our construction. Using a computer-aided proof, we also identify a two-dimensional Euclidean point set whose associated lower bound is slightly smaller than that of our general construction. This is, to our knowledge, the first impossibility result showing that the core can be empty in non-centroid clustering under the max-loss objective.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chih-Duo Hong, Hongjian Jiang, Anthony W. Lin, Oliver Markgraf, Julian Parsert et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automata extraction is a method for synthesising interpretable surrogates for black-box neural models that can be analysed symbolically. Existing techniques assume a finite input alphabet, and thus are not directly applicable to data sequences drawn from continuous domains. We address this challenge with deterministic register automata (DRAs), which extend finite automata with registers that store and compare numeric values. Our main contribution is a framework for robust DRA extraction from black-box models: we develop a polynomial-time robustness checker for DRAs with a fixed number of registers, and combine it with passive and active automata learning algorithms. This combination yields surrogate DRAs with statistical robustness and equivalence guarantees. As a key application, we use the extracted automata to assess the robustness of neural networks: for a given sequence and distance metric, the DRA either certifies local robustness or produces a concrete counterexample. Experiments on recurrent neural networks and transformer architectures show that our framework reliably learns accurate automata and enables principled robustness evaluation. Overall, our results demonstrate that robust DRA extraction effectively bridges neural network interpretability and formal reasoning without requiring white-box access to the underlying network.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziyun Li, Ben Dai, Huancheng Hu, Henrik Bostrรถm, Soon Hoe Lim
Machine Learning
ๆฅ็ๆ่ฆ
Flow-based generative models synthesize data by integrating a learned velocity field from a reference distribution to the target data distribution. Prior work has focused on endpoint metrics (e.g., fidelity, likelihood, perceptual quality) while overlooking a deeper question: what do the sampling trajectories reveal? Motivated by classical mechanics, we introduce kinetic path energy (KPE), a simple yet powerful diagnostic that quantifies the total kinetic effort along each generation path of ODE-based samplers. Through comprehensive experiments on CIFAR-10 and ImageNet-256, we uncover two key phenomena: ({i}) higher KPE predicts stronger semantic quality, indicating that semantically richer samples require greater kinetic effort, and ({ii}) higher KPE inversely correlates with data density, with informative samples residing in sparse, low-density regions. Together, these findings reveal that semantically informative samples naturally reside on the sparse frontier of the data distribution, demanding greater generative effort. Our results suggest that trajectory-level analysis offers a physics-inspired and interpretable framework for understanding generation difficulty and sample characteristics.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yutong Li, Yitian Zhou, Xudong Wang, GuoChen, Caiyan Qin
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Timur Mamedov, Anton Konushin, Vadim Konushin
ๆฅ็ๆ่ฆ
Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiangyu Chang, Manyi Yao, Srikanth V. Krishnamurthy, Christian R. Shelton, Anirban Chakraborty et al.
ๆฅ็ๆ่ฆ
In Asynchronous Federated Learning (AFL), the central server immediately updates the global model with each arriving client's contribution. As a result, clients perform their local training on different model versions, causing information staleness (delay). In federated environments with non-IID local data distributions, this asynchronous pattern amplifies the adverse effect of client heterogeneity (due to different data distribution, local objectives, etc.), as faster clients contribute more frequent updates, biasing the global model. We term this phenomenon heterogeneity amplification. Our work provides a theoretical analysis that maps AFL design choices to their resulting error sources when heterogeneity amplification occurs. Guided by our analysis, we propose ACE (All-Client Engagement AFL), which mitigates participation imbalance through immediate, non-buffered updates that use the latest information available from all clients. We also introduce a delay-aware variant, ACED, to balance client diversity against update staleness. Experiments on different models for different tasks across diverse heterogeneity and delay settings validate our analysis and demonstrate the robust performance of our approaches.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek et al.
Machine Learning
ๆฅ็ๆ่ฆ
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
๐
2025-11-24
๐ ArXiv Preprint (eess.SY)
๐ฅ Xinda Zheng, Canchen Jiang, Hao Wang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The growing demand for electric vehicle (EV) charging infrastructure presents significant planning challenges, requiring efficient strategies for investment and operation to deliver cost-effective charging services. However, the potential benefits of EV charging assignment, particularly in response to varying spatial-temporal patterns of charging demand, remain under-explored in infrastructure planning. This paper proposes an integrated approach that jointly optimizes investment decisions and charging assignments while accounting for spatial-temporal demand dynamics and their interdependencies. To support efficient model development, we leverage a large language model (LLM) to assist in generating and refining the mathematical formulation from structured natural-language descriptions, significantly reducing the modeling burden. The resulting optimization model enables optimal joint decision-making for investment and operation. Additionally, we propose a distributed optimization algorithm based on the Alternating Direction Method of Multipliers (ADMM) to address computational complexity in high-dimensional scenarios, which can be executed on standard computing platforms. We validate our approach through a case study using 1.5 million real-world travel records from Chengdu, China, demonstrating a 30% reduction in total cost compared to a baseline without EV assignment.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu et al.
Computer VisionNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kai Zhenga, Zhenkai Wu, Fupeng Wei, Miaolan Zhou, Kai Lie et al.
ๆฅ็ๆ่ฆ
Accurately and swiftly assessing damage from conflicts is crucial for humanitarian aid and regional stability. In conflict zones, damaged zones often share similar architectural styles, with damage typically covering small areas and exhibiting blurred boundaries. These characteristics lead to limited data, annotation difficulties, and significant recognition challenges, including high intra-class similarity and ambiguous semantic changes. To address these issues, we introduce a pre-trained DINOv3 model and propose a multi-scale cross-attention difference siamese network (MC-DiSNet). The powerful visual representation capability of the DINOv3 backbone enables robust and rich feature extraction from bi-temporal remote sensing images. We also release a new Gaza-change dataset containing high-resolution satellite image pairs from 2023-2024 with pixel-level semantic change annotations. It is worth emphasizing that our annotations only include semantic pixels of changed areas. Unlike conventional semantic change detection (SCD), our approach eliminates the need for large-scale semantic annotations of bi-temporal images, instead focusing directly on the changed regions. We term this new task change semantic detection (CSD). The CSD task represents a direct extension of binary change detection (BCD). Due to the limited spatial extent of semantic regions, it presents greater challenges than traditional SCD tasks. We evaluated our method under the CSD framework on both the Gaza-Change and SECOND datasets. Experimental results demonstrate that our proposed approach effectively addresses the CSD task, and its outstanding performance paves the way for practical applications in rapid damage assessment across conflict zones.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Long Tang, Guoquan Zhen, Jie Hao, Jianbo Zhang, Huiyu Duan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks.The code is available at: \href{https://github.com/TANGLONG2/Life-IQA/tree/main}{\texttt{Life-IQA}}.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuting Gao, Weihao Chen, Lan Wang, Ruihan Xu, Qingpei Guo
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Preference learning has recently emerged as a pivotal strategy for post-training alignment of Multimodal Large Language Models (MLLMs). However, existing approaches predominantly rely on external human-annotated preference data, which is costly and labor-intensive to collect. In this work, we propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures. Specifically, we observe that the router's expert selection scores implicitly encode a quality-aware ranking of responses (i.e. higher-scoring experts consistently generate higher-quality outputs). Building on this insight, OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores and activating each tier separately to produce a sequence of responses with increasing quality. This yields a zero-cost, self-supervised preference ordering over generated responses, which can be directly optimized using standard preference learning objectives. Extensive experiments across multiple multimodal benchmarks demnstrate that OrdMoE significantly enhances both alignment and overall performance of multimodal Mixture-of-Experts LLMs, achieving competitive results without requiring any human-annotated preference data.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Di Wu, Liting Jiang, Ruiyu Fang, Bianjing, Hongyan Xie et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Spoken Language Understanding (SLU) consists of two sub-tasks: intent detection (ID) and slot filling (SF). Given its broad range of real-world applications, enhancing SLU for practical deployment is increasingly critical. Profile-based SLU addresses ambiguous user utterances by incorporating context awareness (CA), user profiles (UP), and knowledge graphs (KG) to support disambiguation, thereby advancing SLU research toward real-world applicability. However, existing SLU datasets still fall short in representing real-world scenarios. Specifically, (1) CA uses one-hot vectors for representation, which is overly idealized, and (2) models typically focuses solely on predicting intents and slot labels, neglecting the reasoning process that could enhance performance and interpretability. To overcome these limitations, we introduce VRSLU, a novel SLU dataset that integrates both Visual images and explicit Reasoning. For over-idealized CA, we use GPT-4o and FLUX.1-dev to generate images reflecting users' environments and statuses, followed by human verification to ensure quality. For reasoning, GPT-4o is employed to generate explanations for predicted labels, which are then refined by human annotators to ensure accuracy and coherence. Additionally, we propose an instructional template, LR-Instruct, which first predicts labels and then generates corresponding reasoning. This two-step approach helps mitigate the influence of reasoning bias on label prediction. Experimental results confirm the effectiveness of incorporating visual information and highlight the promise of explicit reasoning in advancing SLU.
๐
2025-11-24
๐ ArXiv Preprint (hep-ex)
๐ฅ Ivรกn Mozรบn Mateo
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The current KM3NeT/ORCA neutrino telescope, still under construction, has not yet reached its full potential in neutrino reconstruction capability. When training any deep learning model, no explicit information about the physics or the detector is provided, thus they remain unknown to the model. This study leverages the strengths of transformers by incorporating attention masks inspired by the physics and detector design, making the model understand both the telescope design and the neutrino physics measured on it. The study also shows the efficacy of transformers on retaining valuable information between detectors when doing fine-tuning from one configurations to another.
๐
2025-11-24
๐ Accepted at the IEEE conference on Big Data (Speci
๐ฅ Zineddine Tighidet, Lazhar Labiod, Mohamed Nadif
ๆฅ็ๆ่ฆ
The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wassim Benabbas, Mohammed Brahimi, Samir Akhrouf, Bilal Fortas
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Recent advances in deep learning have enabled significant progress in plant disease classification using leaf images. Much of the existing research in this field has relied on the PlantVillage dataset, which consists of well-centered plant images captured against uniform, uncluttered backgrounds. Although models trained on this dataset achieve high accuracy, they often fail to generalize to real-world field images, such as those submitted by farmers to plant diagnostic systems. This has created a significant gap between published studies and practical application requirements, highlighting the necessity of investigating and addressing this issue. In this study, we investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions in plant disease classification. We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models. While CNNs exhibit limited robustness under domain shift, Vision Transformers demonstrate stronger generalization by capturing global contextual features. Most notably, CLIP models classify diseases directly from natural language descriptions without any task-specific training, offering strong adaptability and interpretability. These findings highlight the potential of zero-shot learning as a practical and scalable domain adaptation strategy for plant health diagnosis in diverse field environments.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Donghu Kim
Machine Learning
ๆฅ็ๆ่ฆ
The challenge of building neural networks that can continuously learn and adapt to evolving data streams is central to the fields of continual learning (CL) and reinforcement learning (RL). This lifelong learning problem is often framed in terms of the plasticity-stability dilemma, focusing on issues like loss of plasticity and catastrophic forgetting. Unlike neural networks, biological brains maintain plasticity through capacity growth, inspiring researchers to explore similar approaches in artificial networks, such as adding capacity dynamically. Prior solutions often lack parameter efficiency or depend on explicit task indices, but Mixture-of-Experts (MoE) architectures offer a promising alternative by specializing experts for distinct distributions. This paper aims to evaluate a DynamicMoE approach for continual and reinforcement learning environments and benchmark its effectiveness against existing network expansion methods.
๐
2025-11-24
๐ ArXiv Preprint (physics.optics)
๐ฅ S. Rodionov, A. Burguete-Lopez, M. Makarenko, Q. Wang, F. Getman et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Foundation models (FM) are transforming artificial intelligence by enabling generalizable, data-efficient solutions across different domains for a broad range of applications. However, the lack of large and diverse datasets limits the development of FM in nanophotonics. This work presents MOCLIP (Metasurface Optics Contrastive Learning Pretrained), a nanophotonic foundation model that integrates metasurface geometry and spectra within a shared latent space. MOCLIP employs contrastive learning to align geometry and spectral representations using an experimentally acquired dataset with a sample density comparable to ImageNet-1K. The study demonstrates MOCLIP inverse design capabilities for high-throughput zero-shot prediction at a rate of 0.2 million samples per second, enabling the design of a full 4-inch wafer populated with high-density metasurfaces in minutes. It also shows generative latent-space optimization reaching 97 percent accuracy. Finally, we introduce an optical information storage concept that uses MOCLIP to achieve a density of 0.1 Gbit per square millimeter at the resolution limit, exceeding commercial optical media by a factor of six. These results position MOCLIP as a scalable and versatile platform for next-generation photonic design and data-driven applications.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xin Yuan, Siqi Li, Jiateng Wei, Chengrui Zhu, Yanming Wu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Pruning is an effective method for compressing Large Language Models, but finding an optimal, non-uniform layer-wise sparsity allocation remains a key challenge. While heuristic methods are fast but yield suboptimal performance, more powerful search-based approaches like Reinforcement Learning are often hindered by prohibitive computational costs on large-scale models. To overcome this efficiency barrier, we propose FastForward Pruning. Its core is a decoupled, single-step RL framework that separates policy optimization from the complex budget satisfaction problem. Such a decoupling is crucial for efficiently searching the vast policy space of LLMs. This curriculum-based strategy begins with low-cost, simple tasks and gradually increases in complexity, significantly reducing the search's computational overhead. Evaluated on the LLaMA, Mistral, and OPT model families, our framework discovers pruning policies that achieve superior performance over strong heuristic baselines. Crucially, when compared to other search-based algorithms, our method achieves competitive or superior results at a fraction of the computational cost, demonstrating a clear advantage in search efficiency.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Muhammad Usman Shahid, Chuadhry Mujeeb Ahmed, Rajiv Ranjan
Natural Language Processing
ๆฅ็ๆ่ฆ
The security of code generated by large language models (LLMs) is a significant concern, as studies indicate that such code often contains vulnerabilities and lacks essential defensive programming constructs. This work focuses on examining and evaluating the security of LLM-generated code, particularly in the context of C/C++. We categorized known vulnerabilities using the Common Weakness Enumeration (CWE) and, to study their criticality, mapped them to CVEs. We used ten different LLMs for code generation and analyzed the outputs through static analysis. The amount of CWEs present in AI-generated code is concerning. Our findings highlight the need for developers to be cautious when using LLM-generated code. This study provides valuable insights to advance automated code generation and encourage further research in this domain.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Antonia Wรผst, Wolfgang Stammer, Hikaru Shindo, Lukas Helff, Devendra Singh Dhami et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qisen Chai, Yansong Wang, Junjie Huang, Tao Jia
Machine Learning
ๆฅ็ๆ่ฆ
As graph-structured data grow increasingly large, evaluating their robustness under adversarial attacks becomes computationally expensive and difficult to scale. To address this challenge, we propose to compress graphs into compact representations that preserve both topological structure and robustness profile, enabling efficient and reliable evaluation.We propose Cutter, a dual-agent reinforcement learning framework composed of a Vital Detection Agent (VDA) and a Redundancy Detection Agent (RDA), which collaboratively identify structurally vital and redundant nodes for guided compression. Cutter incorporates three key strategies to enhance learning efficiency and compression quality: trajectory-level reward shaping to transform sparse trajectory returns into dense, policy-equivalent learning signals; prototype-based shaping to guide decisions using behavioral patterns from both highand low-return trajectories; and cross-agent imitation to enable safer and more transferable exploration. Experiments on multiple real-world graphs demonstrate that Cutter generates compressed graphs that retain essential static topological properties and exhibit robustness degradation trends highly consistent with the original graphs under various attack scenarios, thereby significantly improving evaluation efficiency without compromising assessment fidelity.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wouter W. L. Nuijten, Mykola Lukashchuk
ๆฅ็ๆ่ฆ
Automated decision-making under uncertainty requires balancing exploitation and exploration. Classical methods treat these separately using heuristics, while Active Inference unifies them through Expected Free Energy (EFE) minimization. However, EFE minimization is computationally expensive, limiting scalability. We build on recent theory recasting EFE minimization as variational inference, formally unifying it with Planning-as-Inference and showing the epistemic drive as a unique entropic contribution. Our main contribution is a novel message-passing scheme for this unified objective, enabling scalable Active Inference in factored-state MDPs and overcoming high-dimensional planning intractability.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Santhosh G S, Saurav Prakash, Balaraman Ravindran
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.
๐
2025-11-24
๐ EMNLP 2025
๐ฅ Yuchen Ji, Bo Xu, Jie Shi, Jiaqing Liang, Deqing Yang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at https://github.com/jjjycaptain/Skeletron.
๐
2025-11-24
๐ NeurIPS 2024
๐ฅ Ryan Wong, Hosea David Yu Fei Ng, Dhananjai Sharma, Glenn Jun Jie Ng, Kavishvaran Srinivasan
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall, this study highlights how jailbreaks pose a significant security threat to LLMs and identifies key intervention points for prevention, while noting that defense strategies often involve trade-offs between safety, performance, and scalability. Code is available at: https://github.com/Kuro0911/CS5446-Project
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sahil Kale
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Salah Eddine Choutri, Prajwal Chauhan, Othmane Mazhar, Saif Eddin Jabari
ๆฅ็ๆ่ฆ
The Monte Carlo-type Neural Operator (MCNO) introduces a lightweight architecture for learning solution operators for parametric PDEs by directly approximating the kernel integral using a Monte Carlo approach. Unlike Fourier Neural Operators, MCNO makes no spectral or translation-invariance assumptions. The kernel is represented as a learnable tensor over a fixed set of randomly sampled points. This design enables generalization across multiple grid resolutions without relying on fixed global basis functions or repeated sampling during training. Experiments on standard 1D PDE benchmarks show that MCNO achieves competitive accuracy with low computational cost, providing a simple and practical alternative to spectral and graph-based neural operators.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haifeng Jing, Yujie Hou, Junfei Liu, Rui Xie, alan Xu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
With the rapid development of Large Language Models, dialogue systems are shifting from information tools to emotional companions, heralding the era of Emotional Companionship Dialogue Systems (ECDs) that provide personalized emotional support for users. However, the field lacks clear definitions and systematic evaluation standards for ECDs. To address this, we first propose a definition of ECDs with formal descriptions. Then, based on this theory and the design principle of "Ability Layer-Task Layer (three level)-Data Layer-Method Layer", we design and implement the first ECD evaluation benchmark - MoodBench 1.0. Through extensive evaluations of 30 mainstream models, we demonstrate that MoodBench 1.0 has excellent discriminant validity and can effectively quantify the differences in emotional companionship abilities among models. Furthermore, the results reveal current models' shortcomings in deep emotional companionship, guiding future technological optimization and significantly aiding developers in enhancing ECDs' user experience.
๐
2025-11-24
๐ ArXiv Preprint (cs.SE)
๐ฅ Arina Kharlamova, Jiawen Liu, Tianyi Zhang, Xinrui Yang, Humaid Alqasimi et al.
ๆฅ็ๆ่ฆ
Linux kernel evolution breaks drivers through API/ABI changes, semantic shifts, and security-hardening updates. We introduce DRIVEBENCH, an executable corpus of kernel$\rightarrow$driver co-evolution cases, and AUTODRIVER, a closed-loop, LLM-driven system for automating driver maintenance. The system integrates prompt engineering, multi-agent collaboration, static analysis, and iterative validation to ensure that generated patches are not only syntactically correct but also functionally and semantically consistent with kernel conventions. The corpus spans v5.10-v6.10 with 235 validated cases drawn from 612 candidates. In evaluation across 55 cases, AUTODRIVER achieves 56.4% compilation success; QEMU-based boot verification indicates that compiled patches preserve driver initialization in most instances. By releasing DRIVEBENCH and tooling, we enable reproducible research and a practical route to continuous, safe co-evolution of drivers with the Linux kernel.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruiying Liu, Yuanzhi Liang, Haibin Huang, Tianshu Yu, Chi Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zengjie Hu, Jiantao Qiu, Tianyi Bai, Haojin Yang, Binhang Yuan et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical \emph{gradient vanishing} problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose \textbf{VADE}, a \textbf{V}ariance-\textbf{A}ware \textbf{D}ynamic sampling framework via online sample-level difficulty \textbf{E}stimation. Our framework integrates three key components: online sample-level difficulty estimation using Beta distributions, a Thompson sampler that maximizes information gain through the estimated correctness probability, and a two-scale prior decay mechanism that maintains robust estimation under policy evolution. This three components design enables VADE to dynamically select the most informative samples, thereby amplifying training signals while eliminating extra rollout costs. Extensive experiments on multimodal reasoning benchmarks show that VADE consistently outperforms strong baselines in both performance and sample efficiency, while achieving a dramatic reduction in computational overhead. More importantly, our framework can serves as a plug-and-play component to be seamlessly integrated into existing group-based RL algorithms. Code and models are available at https://VADE-RL.github.io.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenyu Mu, Guihai Chen, Xun Yang, Erkun Yang, Cheng Deng
Computer Vision
ๆฅ็ๆ่ฆ
Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, which lead to instability in model training. Existing methods typically rely on global noise assumptions or confidence-based sample selection, which inadequately mitigate the performance degradation caused by annotation noise, especially in challenging boundary regions. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy ground-truth labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg consistently outperforms existing state-of-the-art methods.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.
๐
2025-11-24
๐ ACL
๐ฅ Jingqian Zhao, Bingbing Wang, Geng Tu, Yice Zhang, Qianlong Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Suzie Kim, Hye-Bin Shin, Hyo-Jeong Jang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
In this work, we investigate how implicit neural feed back can accelerate reinforcement learning in complex robotic manipulation settings. While prior electroencephalogram (EEG) guided reinforcement learning studies have primarily focused on navigation or low-dimensional locomotion tasks, we aim to understand whether such neural evaluative signals can improve policy learning in high-dimensional manipulation tasks involving obstacles and precise end-effector control. We integrate error related potentials decoded from offline-trained EEG classifiers into reward shaping and systematically evaluate the impact of human-feedback weighting. Experiments on a 7-DoF manipulator in an obstacle-rich reaching environment show that neural feedback accelerates reinforcement learning and, depending on the human-feedback weighting, can yield task success rates that at times exceed those of sparse-reward baselines. Moreover, when applying the best-performing feedback weighting across all sub jects, we observe consistent acceleration of reinforcement learning relative to the sparse-reward setting. Furthermore, leave-one subject-out evaluations confirm that the proposed framework remains robust despite the intrinsic inter-individual variability in EEG decodability. Our findings demonstrate that EEG-based reinforcement learning can scale beyond locomotion tasks and provide a viable pathway for human-aligned manipulation skill acquisition.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuzhi Chen, Yuanchang Xie, Lei Zhao, Pan Liu, Yajie Zou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jian Lu
Machine Learning
ๆฅ็ๆ่ฆ
Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.
๐
2025-11-24
๐ ArXiv Preprint (cs.SD)
๐ฅ Shuyang Liu, Yuan Jin, Rui Lin, Shizhe Chen, Junyu Dai et al.
ๆฅ็ๆ่ฆ
Evaluating the aesthetic quality of generated songs is challenging due to the multi-dimensional nature of musical perception. We propose a robust music aesthetic evaluation framework that combines (1) multi-source multi-scale feature extraction to obtain complementary segment- and track-level representations, (2) a hierarchical audio augmentation strategy to enrich training data, and (3) a hybrid training objective that integrates regression and ranking losses for accurate scoring and reliable top-song identification. Experiments on the ICASSP 2026 SongEval benchmark demonstrate that our approach consistently outperforms baseline methods across correlation and top-tier metrics.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dezhi Ran, Shuxiao Xie, Mingfang Ji, Ziyue Hua, Mengzhou Wu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
High quality kernels are critical for reducing training and inference costs of Large Language Models (LLMs), yet they traditionally require significant expertise in hardware architecture and software optimization. While recent advances in LLM-based code generation show promise for complex optimization, existing methods struggle with the vast optimization space due to insufficient hardware domain knowledge, failing to effectively balance exploration and exploitation. We present KernelBand, a novel framework that formulates kernel optimization as a hierarchical multi-armed bandit problem, enabling LLM agents to strategically navigate the optimization space by treating kernel selection and optimization strategy application as sequential decision-making processes. Our approach leverages hardware profiling information to identify promising optimization strategies and employs runtime behavior clustering to reduce exploration overhead across kernel candidates. Extensive experiments on TritonBench demonstrate that KernelBand significantly outperforms state-of-the-art methods, achieving superior performance with fewer tokens while exhibiting consistent improvement without saturation as computational resources increase.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xingyu Huang, Fei Jiang, Jianli Xiao
Natural Language Processing
ๆฅ็ๆ่ฆ
With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sana Alamgeer
ๆฅ็ๆ่ฆ
The main goal of the project is to design a new model that predicts regions of interest in 360$^{\circ}$ videos. The region of interest (ROI) plays an important role in 360$^{\circ}$ video streaming. For example, ROIs are used to predict view-ports, intelligently cut the videos for live streaming, etc so that less bandwidth is used. Detecting view-ports in advance helps reduce the movement of the head while streaming and watching a video via the head-mounted device. Whereas, intelligent cuts of the videos help improve the efficiency of streaming the video to users and enhance the quality of their viewing experience. This report illustrates the secondary task to identify ROIs, in which, we design, train, and test a hybrid saliency model. In this work, we refer to saliency regions to represent the regions of interest. The method includes the processes as follows: preprocessing the video to obtain frames, developing a hybrid saliency model for predicting the region of interest, and finally post-processing the output predictions of the hybrid saliency model to obtain the output region of interest for each frame. Then, we compare the performance of the proposed method with the subjective annotations of the 360RAT dataset.
๐
2025-11-24
๐ ArXiv Preprint (cs.SE)
๐ฅ Yujing Wang, Weize Hong
Natural Language Processing
ๆฅ็ๆ่ฆ
We present a novel framework that integrates Large Language Models (LLMs) into the Git bisect process for semantic fault localization. Traditional bisect assumes deterministic predicates and binary failure states assumptions often violated in modern software development due to flaky tests, nonmonotonic regressions, and semantic divergence from upstream repositories. Our system augments bisect traversal with structured chain of thought reasoning, enabling commit by commit analysis under noisy conditions. We evaluate multiple open source and proprietary LLMs for their suitability and fine tune DeepSeekCoderV2 using QLoRA on a curated dataset of semantically labeled diffs. We adopt a weak supervision workflow to reduce annotation overhead, incorporating human in the loop corrections and self consistency filtering. Experiments across multiple open source projects show a 6.4 point absolute gain in success rate from 74.2 to 80.6 percent, leading to significantly fewer failed traversals and by experiment up to 2x reduction in average bisect time. We conclude with discussions on temporal reasoning, prompt design, and finetuning strategies tailored for commit level behavior analysis.
๐
2025-11-24
๐ \c{opyright} 2025 IEEE. Personal use of this mater
๐ฅ Mohammad Nour Al Awad, Sergey Ivanov, Olga Tikhonova
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly integrated into code editors to provide AI-powered code suggestions. Yet many of these suggestions are ignored, resulting in wasted computation, increased latency, and unnecessary interruptions. We introduce a lightweight pre-filtering model that predicts the likelihood of suggestion acceptance before invoking the LLM, using only real-time developer telemetry such as typing speed, file navigation, and editing activity. Deployed in a production-grade Visual Studio Code plugin over four months of naturalistic use, our approach nearly doubled acceptance rates (18.4% -> 34.2%) while suppressing 35% of low-value LLM calls. These findings demonstrate that behavioral signals alone can meaningfully improve both user experience and system efficiency in LLM-assisted programming, highlighting the value of timing-aware, privacy-preserving adaptation mechanisms. The filter operates solely on pre-invocation editor telemetry and never inspects code or prompts.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ishmam Tashdeed, Md. Atiqur Rahman, Sabrina Islam, Md. Azam Hossain
Computer Vision
ๆฅ็ๆ่ฆ
Personalized federated learning (PFL) possesses the unique capability of preserving data confidentiality among clients while tackling the data heterogeneity problem of non-independent and identically distributed (Non-IID) data. Its advantages have led to widespread adoption in domains such as medical image segmentation. However, the existing approaches mostly overlook the potential benefits of leveraging shared features across clients, where each client contains segmentation data of different organs. In this work, we introduce a novel personalized federated approach for organ agnostic tumor segmentation (FedOAP), that utilizes cross-attention to model long-range dependencies among the shared features of different clients and a boundary-aware loss to improve segmentation consistency. FedOAP employs a decoupled cross-attention (DCA), which enables each client to retain local queries while attending to globally shared key-value pairs aggregated from all clients, thereby capturing long-range inter-organ feature dependencies. Additionally, we introduce perturbed boundary loss (PBL) which focuses on the inconsistencies of the predicted mask's boundary for each client, forcing the model to localize the margins more precisely. We evaluate FedOAP on diverse tumor segmentation tasks spanning different organs. Extensive experiments demonstrate that FedOAP consistently outperforms existing state-of-the-art federated and personalized segmentation methods.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yubo Wang, Hui He, Chaoxi Niu, Zhendong Niu
ๆฅ็ๆ่ฆ
Due to the inherent complexity, temporal patterns in real-world time series often evolve across multiple intertwined scales, including long-term periodicity, short-term fluctuations, and abrupt regime shifts. While existing literature has designed many sophisticated decomposition approaches based on the time or frequency domain to partition trend-seasonality components and high-low frequency components, an alternative line of approaches based on the wavelet domain has been proposed to provide a unified multi-resolution representation with precise time-frequency localization. However, most wavelet-based methods suffer from a persistent bias toward recursively decomposing only low-frequency components, severely underutilizing subtle yet informative high-frequency components that are pivotal for precise time series forecasting. To address this problem, we propose WaveTuner, a Wavelet decomposition framework empowered by full-spectrum subband Tuning for time series forecasting. Concretely, WaveTuner comprises two key modules: (i) Adaptive Wavelet Refinement module, that transforms time series into time-frequency coefficients, utilizes an adaptive router to dynamically assign subband weights, and generates subband-specific embeddings to support refinement; and (ii) Multi-Branch Specialization module, that employs multiple functional branches, each instantiated as a flexible Kolmogorov-Arnold Network (KAN) with a distinct functional order to model a specific spectral subband. Equipped with these modules, WaveTuner comprehensively tunes global trends and local variations within a unified time-frequency framework. Extensive experiments on eight real-world datasets demonstrate WaveTuner achieves state-of-the-art forecasting performance in time series forecasting.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Changxin Huang, Lv Tang, Zhaohuan Zhan, Lisha Yu, Runhao Zeng et al.
Natural Language ProcessingRoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Vision-and-Language Navigation (VLN) requires agents to autonomously navigate complex environments via visual images and natural language instruction--remains highly challenging. Recent research on enhancing language-guided navigation reasoning using pre-trained large language models (LLMs) has shown promising prospects. However, the reasoning of such methods is limited to the linguistic modality, lacking visual reasoning capabilities. Moreover, existing reasoning modules are optimized separately from navigation policies, leading to incompatibility and potential conflicts in optimization objectives. To tackle these challenges, we introduce UNeMo, a novel framework designed for the collaborative optimization of visual state reasoning and navigational decision-making. It introduces a Multimodal World Model (MWM) that takes visual features, language instructions, and navigational actions as inputs to jointly predict subsequent visual states, enabling cross-modal reasoning. Via a Hierarchical Prediction-Feedback (HPN) mechanism, MWM collaborates with navigation policies: the first layer generates actions using current vision-and-language features; MWM then infers post-action visual states to guide the second layer's fine-grained decisions. This forms a dynamic bidirectional promotion mechanism where MWM reasoning optimizes navigation policies, while policy decisions feedback to improve MWM's reasoning accuracy. Experiments on R2R and REVERIE datasets show UNeMo outperforms state-of-the-art methods by 2.1% and 0.7% in navigation accuracy for unseen scenes, validating its effectiveness.
๐
2025-11-24
๐ \c{opyright} 2025 IEEE. Personal use of this mater
๐ฅ Mohammad Nour Al Awad, Sergey Ivanov, Olga Tikhonova
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have transformed code auto-completion by generating context-aware suggestions. Yet, deciding when to present these suggestions remains underexplored, often leading to interruptions or wasted inference calls. We propose an adaptive timing mechanism that dynamically adjusts the delay before offering a suggestion based on real-time developer feedback. Our suggested method combines a logistic transform of recent acceptance rates with a bounded delay range, anchored by a high-level binary prediction of the developer's cognitive state. In a two-month deployment with professional developers, our system improved suggestion acceptance from 4.9% with no delay to 15.4% with static delays, and to 18.6% with adaptive timing-while reducing blind rejections (rejections without being read) from 8.3% to 0.36%. Together, these improvements increase acceptance and substantially reduce wasted inference calls by 75%, making LLM-based code assistants more efficient and cost-effective in practice.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mincheol Jeon, Euinam Huh
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalized Federated Learning (PFL) faces persistent challenges, including domain heterogeneity from diverse client data, data imbalance due to skewed participation, and strict communication constraints. Traditional federated learning often lacks personalization, as a single global model cannot capture client-specific characteristics, leading to biased predictions and poor generalization, especially for clients with highly divergent data distributions.
To address these issues, we propose FedSTAR, a style-aware federated learning framework that disentangles client-specific style factors from shared content representations. FedSTAR aggregates class-wise prototypes using a Transformer-based attention mechanism, allowing the server to adaptively weight client contributions while preserving personalization.
Furthermore, by exchanging compact prototypes and style vectors instead of full model parameters, FedSTAR significantly reduces communication overhead. Experimental results demonstrate that combining content-style disentanglement with attention-driven prototype aggregation improves personalization and robustness in heterogeneous environments without increasing communication cost.
๐
2025-11-24
๐ ArXiv Preprint (cs.MA)
๐ฅ Binglin Liu, Yucheng Wang, Zheyuan Zhang, Jiyuan Lu, Shen Yang et al.
ๆฅ็ๆ่ฆ
The adaptation of teaching slides to instructors' situated teaching needs, including pedagogical styles and their students' context, is a critical yet time-consuming task for educators. Through a series of educator interviews, we first identify and systematically categorize the key friction points that impede this adaptation process. Grounded in these findings, we introduce a novel multi-agent framework designed to automate slide adaptation based on high-level instructor specifications. An evaluation involving 16 modification requests across 8 real-world courses validates our approach. The framework's output consistently achieved high scores in intent alignment, content coherence and factual accuracy, and performed on par with baseline methods regarding visual clarity, while also demonstrating appropriate timeliness and a high operational agreement with human experts, achieving an F1 score of 0.89. This work heralds a new paradigm where AI agents handle the logistical burdens of instructional design, liberating educators to focus on the creative and strategic aspects of teaching.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingzhi Bao, Hongze Chen, Lingting Zhu, Chenyu Liu, Runze Zhang et al.
ๆฅ็ๆ่ฆ
Physically-based rendering (PBR) provides a principled standard for realistic material-lighting interactions in computer graphics. Despite recent advances in generating PBR textures, existing methods fail to address two fundamental challenges: 1) materials decomposition from image prompts under limited illumination cues, and 2) seamless and view-consistent texture completion. To this end, we propose LumiTex, an end-to-end framework that comprises three key components: (1) a multi-branch generation scheme that disentangles albedo and metallic-roughness under shared illumination priors for robust material understanding, (2) a lighting-aware material attention mechanism that injects illumination context into the decoding process for physically grounded generation of albedo, metallic, and roughness maps, and (3) a geometry-guided inpainting module based on a large view synthesis model that enriches texture coverage and ensures seamless, view-consistent UV completion. Extensive experiments demonstrate that LumiTex achieves state-of-the-art performance in texture quality, surpassing both existing open-source and commercial methods.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zechuan Zhang, Zhenyuan Chen, Zongxin Yang, Yi Yang
ๆฅ็ๆ่ฆ
Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.
๐
2025-11-24
๐ ICLR 2025
๐ฅ Yasin Esfandiari, Stefan Bauer, Sebastian U. Stich, Andrea Dittadi
Computer Vision
ๆฅ็ๆ่ฆ
Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $ฯ_0$, $ฯ_{0.5}$, and one-step regression policy $ฯ_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $ฯ_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Jacob Lin, Edward Gryspeerdt, Ronald Clark
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
There has been great progress in improving numerical weather prediction and climate models using machine learning. However, most global models act at a kilometer-scale, making it challenging to model individual clouds and factors such as extreme precipitation, wind gusts, turbulence, and surface irradiance. Therefore, there is a need to move towards higher-resolution models, which in turn require high-resolution real-world observations that current instruments struggle to obtain. We present Cloud4D, the first learning-based framework that reconstructs a physically consistent, four-dimensional cloud state using only synchronized ground-based cameras. Leveraging a homography-guided 2D-to-3D transformer, Cloud4D infers the full 3D distribution of liquid water content at 25 m spatial and 5 s temporal resolution. By tracking the 3D liquid water content retrievals over time, Cloud4D additionally estimates horizontal wind vectors. Across a two-month deployment comprising six skyward cameras, our system delivers an order-of-magnitude improvement in space-time resolution relative to state-of-the-art satellite measurements, while retaining single-digit relative error ($<10\%$) against collocated radar measurements. Code and data are available on our project page https://cloud4d.jacob-lin.com/.
๐
2025-11-24
๐ AAAI 2026
๐ฅ Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shangyuan Tong, Nanye Ma, Saining Xie, Tommi Jaakkola
Machine Learning
ๆฅ็ๆ่ฆ
State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution, a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256x256, and 1.49 on ImageNet 512x512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yun Zhou, Yaoting Wang, Guangquan Jie, Jinyu Liu, Henghui Ding
Computer Vision
ๆฅ็ๆ่ฆ
SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianrun Chen, Runlong Cao, Xinda Yu, Lanyun Zhu, Chaotao Ding et al.
Computer Vision
ๆฅ็ๆ่ฆ
The rapid rise of large-scale foundation models has reshaped the landscape of image segmentation, with models such as Segment Anything achieving unprecedented versatility across diverse vision tasks. However, previous generations-including SAM and its successor-still struggle with fine-grained, low-level segmentation challenges such as camouflaged object detection, medical image segmentation, cell image segmentation, and shadow detection. To address these limitations, we originally proposed SAM-Adapter in 2023, demonstrating substantial gains on these difficult scenarios. With the emergence of Segment Anything 3 (SAM3)-a more efficient and higher-performing evolution with a redesigned architecture and improved training pipeline-we revisit these long-standing challenges. In this work, we present SAM3-Adapter, the first adapter framework tailored for SAM3 that unlocks its full segmentation capability. SAM3-Adapter not only reduces computational overhead but also consistently surpasses both SAM and SAM2-based solutions, establishing new state-of-the-art results across multiple downstream tasks, including medical imaging, camouflaged (concealed) object segmentation, and shadow detection. Built upon the modular and composable design philosophy of the original SAM-Adapter, SAM3-Adapter provides stronger generalizability, richer task adaptability, and significantly improved segmentation precision. Extensive experiments confirm that integrating SAM3 with our adapter yields superior accuracy, robustness, and efficiency compared to all prior SAM-based adaptations. We hope SAM3-Adapter can serve as a foundation for future research and practical segmentation applications. Code, pre-trained models, and data processing pipelines are available.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang
Multimodal
ๆฅ็ๆ่ฆ
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gongfan Fang, Xinyin Ma, Xinchao Wang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.
๐
2025-11-24
๐ ArXiv Preprint (cs.SD)
๐ฅ Jorge Ortigoso-Narro, Jose A. Belloch, Adrian Amor-Martin, Sandra Roger, Maximo Cobos
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array's focus, synchronizing the acoustic response with the target's position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rachit Saluja, Asli Cihangir, Ruining Deng, Johannes C. Paetzold, Fengbei Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Segmenting small lesions in medical images remains notoriously difficult. Most prior work tackles this challenge by either designing better architectures, loss functions, or data augmentation schemes; and collecting more labeled data. We take a different view, arguing that part of the problem lies in how the background is modeled. Common lesion segmentation collapses all non-lesion pixels into a single "background" class, ignoring the rich anatomical context in which lesions appear. In reality, the background is highly heterogeneous-composed of tissues, organs, and other structures that can now be labeled manually or inferred automatically using existing segmentation models.
In this paper, we argue that training with fine-grained labels that sub-divide the background class, which we call BackSplit, is a simple yet powerful paradigm that can offer a significant performance boost without increasing inference costs. From an information theoretic standpoint, we prove that BackSplit increases the expected Fisher Information relative to conventional binary training, leading to tighter asymptotic bounds and more stable optimization. With extensive experiments across multiple datasets and architectures, we empirically show that BackSplit consistently boosts small-lesion segmentation performance, even when auxiliary labels are generated automatically using pretrained segmentation models. Additionally, we demonstrate that auxiliary labels derived from interactive segmentation frameworks exhibit the same beneficial effect, demonstrating its robustness, simplicity, and broad applicability.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maroun Ayli, Youssef Bakouny, Tushar Sharma, Nader Jalloul, Hani Seifeddine et al.
Multimodal
ๆฅ็ๆ่ฆ
Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saniah Kayenat Chowdhury, Rusab Sarmun, Muhammad E. H. Chowdhury, Sohaib Bassam Zoghoul, Israa Al-Hashimi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor's size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable "black box" manner, our method offers both state-of-the-art performance and transparent decision support.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.
๐
2025-11-24
๐ The IEEE International Conference on Data Mining (
๐ฅ Abdurahman Ali Mohammed, Catherine Fonder, Ying Wei, Wallapak Tavanapong, Donald S Sakaguchi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate cell counting is essential in various biomedical research and clinical applications, including cancer diagnosis, stem cell research, and immunology. Manual counting is labor-intensive and error-prone, motivating automation through deep learning techniques. However, training reliable deep learning models requires large amounts of high-quality annotated data, which is difficult and time-consuming to produce manually. Consequently, existing cell-counting datasets are often limited, frequently containing fewer than $500$ images. In this work, we introduce a large-scale annotated dataset comprising $3{,}023$ images from immunocytochemistry experiments related to cellular differentiation, containing over $430{,}000$ manually annotated cell locations. The dataset presents significant challenges: high cell density, overlapping and morphologically diverse cells, a long-tailed distribution of cell count per image, and variation in staining protocols. We benchmark three categories of existing methods: regression-based, crowd-counting, and cell-counting techniques on a test set with cell counts ranging from $10$ to $2{,}126$ cells per image. We also evaluate how the Segment Anything Model (SAM) can be adapted for microscopy cell counting using only dot-annotated datasets. As a case study, we implement a density-map-based adaptation of SAM (SAM-Counter) and report a mean absolute error (MAE) of $22.12$, which outperforms existing approaches (second-best MAE of $27.46$). Our results underscore the value of the dataset and the benchmarking framework for driving progress in automated cell counting and provide a robust foundation for future research and development.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qihan Huang, Haofei Zhang, Rong Wei, Yi Wang, Rui Tang et al.
Computer VisionMachine LearningMultimodal
ๆฅ็ๆ่ฆ
RL (reinforcement learning) methods (e.g., GRPO) for MLLM (Multimodal LLM) perception ability has attracted wide research interest owing to its remarkable generalization ability. Nevertheless, existing reinforcement learning methods still face the problem of low data quality, where data samples cannot elicit diverse responses from MLLMs, thus restricting the exploration scope for MLLM reinforcement learning. Some methods attempt to mitigate this problem by imposing constraints on entropy, but none address it at its root. Therefore, to tackle this problem, this work proposes Syn-GRPO (Synthesis-GRPO), which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training. Specifically, Syn-GRPO consists of two components: (1) data server; (2) GRPO workflow. The data server synthesizes new samples from existing ones using an image generation model, featuring a decoupled and asynchronous scheme to achieve high generation efficiency. The GRPO workflow provides the data server with the new image descriptions, and it leverages a diversity reward to supervise the MLLM to predict image descriptions for synthesizing samples with diverse responses. Experiment results across three visual perception tasks demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods, and Syn-GRPO presents promising potential for scaling long-term self-evolving RL. Our code is available at https://github.com/hqhQAQ/Syn-GRPO.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anjie Le, Can Peng, Yuyuan Liu, J. Alison Noble
Computer Vision
ๆฅ็ๆ่ฆ
In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting. In this work, we extend the notion of unlearning to the representation level, deriving a three-term interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower dimensional space, yielding a provably optimal forgetting operator. We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and a feature-level unlearning variant under a distillation scheme (POUR-D). Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Farnoosh Koleini, Hongfei Xue, Ahmed Helmy, Pu Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Reconstructing biomechanically realistic 3D human motion - recovering both kinematics (motion) and kinetics (forces) - is a critical challenge. While marker-based systems are lab-bound and slow, popular monocular methods use oversimplified, anatomically inaccurate models (e.g., SMPL) and ignore physics, fundamentally limiting their biomechanical fidelity. In this work, we introduce MonoMSK, a hybrid framework that bridges data-driven learning and physics-based simulation for biomechanically realistic 3D human motion estimation from monocular video. MonoMSK jointly recovers both kinematics (motions) and kinetics (forces and torques) through an anatomically accurate musculoskeletal model. By integrating transformer-based inverse dynamics with differentiable forward kinematics and dynamics layers governed by ODE-based simulation, MonoMSK establishes a physics-regulated inverse-forward loop that enforces biomechanical causality and physical plausibility. A novel forward-inverse consistency loss further aligns motion reconstruction with the underlying kinetic reasoning. Experiments on BML-MoVi, BEDLAM, and OpenCap show that MonoMSK significantly outperforms state-of-the-art methods in kinematic accuracy, while for the first time enabling precise monocular kinetics estimation.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui et al.
Robotics
ๆฅ็ๆ่ฆ
Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An et al.
Robotics
ๆฅ็ๆ่ฆ
Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xincheng Wang, Hanchi Sun, Wenjun Sun, Kejun Xue, Wangqiu Zhou et al.
Computer Vision
ๆฅ็ๆ่ฆ
Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zixuan Wang, Haoran Sun, Jiaming Lu, Wenxuan Wang, Zhongling Huang et al.
Multimodal
ๆฅ็ๆ่ฆ
Infrared small target detection remains challenging due to limited feature representation and severe background interference, resulting in sub-optimal performance. While recent CLIP-inspired methods attempt to leverage textual guidance for detection, they are hindered by inaccurate text descriptions and reliance on manual annotations. To overcome these limitations, we propose DGSPNet, an end-to-end language prompt-driven framework. Our approach integrates dual-granularity semantic prompts: coarse-grained textual priors (e.g., 'infrared image', 'small target') and fine-grained personalized semantic descriptions derived through visual-to-textual mapping within the image space. This design not only facilitates learning fine-grained semantic information but also can inherently leverage language prompts during inference without relying on any annotation requirements. By fully leveraging the precision and conciseness of text descriptions, we further introduce a text-guide channel attention (TGCA) mechanism and text-guide spatial attention (TGSA) mechanism that enhances the model's sensitivity to potential targets across both low- and high-level feature spaces. Extensive experiments demonstrate that our method significantly improves detection accuracy and achieves state-of-the-art performance on three benchmark datasets.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Johannes Meier, Florian Gรผnther, Riccardo Marin, Oussema Dhaouadi, Jacques Kaiser et al.
Computer Vision
ๆฅ็ๆ่ฆ
Monocular 3D detection relies on just a single camera and is therefore easy to deploy. Yet, achieving reliable 3D understanding from monocular images requires substantial annotation, and 3D labels are especially costly. To maximize performance under constrained labeling budgets, it is essential to prioritize annotating samples expected to deliver the largest performance gains. This prioritization is the focus of active learning. Curiously, we observed two significant limitations in active learning algorithms for 3D monocular object detection. First, previous approaches select entire images, which is inefficient, as non-informative instances contained in the same image also need to be labeled. Secondly, existing methods rely on uncertainty-based selection, which in monocular 3D object detection creates a bias toward depth ambiguity. Consequently, distant objects are selected, while nearby objects are overlooked.
To address these limitations, we propose IDEAL-M3D, the first instance-level pipeline for monocular 3D detection. For the first time, we demonstrate that an explicitly diverse, fast-to-train ensemble improves diversity-driven active learning for monocular 3D. We induce diversity with heterogeneous backbones and task-agnostic features, loss weight perturbation, and time-dependent bagging. IDEAL-M3D shows superior performance and significant resource savings: with just 60% of the annotations, we achieve similar or better AP3D on KITTI validation and test set results compared to training the same detector on the whole dataset.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Phurtivilai Patt, Leyang Huang, Yinqiang Zhang, Yang Lei
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper addresses the limitations of existing 3D Gaussian Splatting (3DGS) methods, particularly their reliance on adaptive density control, which can lead to floating artifacts and inefficient resource usage. We propose a novel densify beforehand approach that enhances the initialization of 3D scenes by combining sparse LiDAR data with monocular depth estimation from corresponding RGB images. Our ROI-aware sampling scheme prioritizes semantically and geometrically important regions, yielding a dense point cloud that improves visual fidelity and computational efficiency. This densify beforehand approach bypasses the adaptive density control that may introduce redundant Gaussians in the original pipeline, allowing the optimization to focus on the other attributes of 3D Gaussian primitives, reducing overlap while enhancing visual quality. Our method achieves comparable results to state-of-the-art techniques while significantly lowering resource consumption and training time. We validate our approach through extensive comparisons and ablation studies on four newly collected datasets, showcasing its effectiveness in preserving regions of interest in complex scenes.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qianying Liu, Xiao Liang, Zhiqiang Zhang, Yibo Chen, Xu Tang et al.
Multimodal
ๆฅ็ๆ่ฆ
We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.
๐
2025-11-24
๐ AAAI 2026
๐ฅ Mingyang Chen, Jiawei Du, Bo Huang, Yi Wang, Xiaobo Zhang et al.
ๆฅ็ๆ่ฆ
Existing core-set selection methods predominantly rely on heuristic scoring signals such as training dynamics or model uncertainty, lacking explicit modeling of data likelihood. This omission may hinder the constructed subset from capturing subtle yet critical distributional structures that underpin effective model training. In this work, we propose a novel, theoretically grounded approach that leverages diffusion models to estimate data likelihood via reconstruction deviation induced by partial reverse denoising. Specifically, we establish a formal connection between reconstruction error and data likelihood, grounded in the Evidence Lower Bound (ELBO) of Markovian diffusion processes, thereby enabling a principled, distribution-aware scoring criterion for data selection. Complementarily, we introduce an efficient information-theoretic method to identify the optimal reconstruction timestep, ensuring that the deviation provides a reliable signal indicative of underlying data likelihood. Extensive experiments on ImageNet demonstrate that reconstruction deviation offers an effective scoring criterion, consistently outperforming existing baselines across selection ratios, and closely matching full-data training using only 50% of the data. Further analysis shows that the likelihood-informed nature of our score reveals informative insights in data selection, shedding light on the interplay between data distributional characteristics and model learning preferences.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Conditional image generation enhances text-to-image synthesis with structural, spatial, or stylistic priors, but current methods face challenges in handling conflicts between sources. These include 1) input-level conflicts, where the conditioning image contradicts the text prompt, and 2) model-bias conflicts, where generative biases disrupt alignment even when conditions match the text. Addressing these conflicts requires nuanced solutions, which standard supervised fine-tuning struggles to provide. Preference-based optimization techniques like Direct Preference Optimization (DPO) show promise but are limited by gradient entanglement between text and condition signals and lack disentangled training data for multi-constraint tasks. To overcome this, we propose a bidirectionally decoupled DPO framework (BideDPO). Our method creates two disentangled preference pairs-one for the condition and one for the text-to reduce gradient entanglement. The influence of pairs is managed using an Adaptive Loss Balancing strategy for balanced optimization. We introduce an automated data pipeline to sample model outputs and generate conflict-aware data. This process is embedded in an iterative optimization strategy that refines both the model and the data. We construct a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments show BideDPO significantly improves text success rates (e.g., +35%) and condition adherence. We also validate our approach using the COCO dataset. Project Pages: https://limuloo.github.io/BideDPO/.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Rissal Hedna, Sesugh Samuel Nder
Computer Vision
ๆฅ็ๆ่ฆ
Computer vision systems are increasingly adopted in modern logistics operations, including the estimation of trailer occupancy for planning, routing, and billing. Although effective, such systems may be vulnerable to physical adversarial attacks, particularly adversarial patches that can be printed and placed on interior surfaces. In this work, we study the feasibility of such attacks on a convolutional cargo-occupancy classifier using fully simulated 3D environments. Using Mitsuba 3 for differentiable rendering, we optimize patch textures across variations in geometry, lighting, and viewpoint, and compare their effectiveness to a 2D compositing baseline. Our experiments demonstrate that 3D-optimized patches achieve high attack success rates, especially in a denial-of-service scenario (empty to full), where success reaches 84.94 percent. Concealment attacks (full to empty) prove more challenging but still reach 30.32 percent. We analyze the factors influencing attack success, discuss implications for the security of automated logistics pipelines, and highlight directions for strengthening physical robustness. To our knowledge, this is the first study to investigate adversarial patch attacks for cargo-occupancy estimation in physically realistic, fully simulated 3D scenes.
๐
2025-11-24
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Md Akil Raihan Iftee, Syed Md. Ahnaf Hasan, Amin Ahsan Ali, AKM Mahbubur Rahman, Sajib Mistry et al.
ๆฅ็ๆ่ฆ
Test-time personalization in federated learning enables models at clients to adjust online to local domain shifts, enhancing robustness and personalization in deployment. Yet, existing federated learning work largely overlooks the security risks that arise when local adaptation occurs at test time. Heterogeneous domain arrivals, diverse adaptation algorithms, and limited cross-client visibility create vulnerabilities where compromised participants can craft poisoned inputs and submit adversarial updates that undermine both global and per-client performance. To address this threat, we introduce FedPoisonTTP, a realistic grey-box attack framework that explores test-time data poisoning in the federated adaptation setting. FedPoisonTTP distills a surrogate model from adversarial queries, synthesizes in-distribution poisons using feature-consistency, and optimizes attack objectives to generate high-entropy or class-confident poisons that evade common adaptation filters. These poisons are injected during local adaptation and spread through collaborative updates, leading to broad degradation. Extensive experiments on corrupted vision benchmarks show that compromised participants can substantially diminish overall test-time performance.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Carl Lindstrรถm, Mahan Rafidashti, Maryam Fatemi, Lars Hammarstrand, Martin R. Oswald et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reconstructing dynamic driving scenes is essential for developing autonomous systems through sensor-realistic simulation. Although recent methods achieve high-fidelity reconstructions, they either rely on costly human annotations for object trajectories or use time-varying representations without explicit object-level decomposition, leading to intertwined static and dynamic elements that hinder scene separation. We present IDSplat, a self-supervised 3D Gaussian Splatting framework that reconstructs dynamic scenes with explicit instance decomposition and learnable motion trajectories, without requiring human annotations. Our key insight is to model dynamic objects as coherent instances undergoing rigid transformations, rather than unstructured time-varying primitives. For instance decomposition, we employ zero-shot, language-grounded video tracking anchored to 3D using lidar, and estimate consistent poses via feature correspondences. We introduce a coordinated-turn smoothing scheme to obtain temporally and physically consistent motion trajectories, mitigating pose misalignments and tracking failures, followed by joint optimization of object poses and Gaussian parameters. Experiments on the Waymo Open Dataset demonstrate that our method achieves competitive reconstruction quality while maintaining instance-level decomposition and generalizes across diverse sequences and view densities without retraining, making it practical for large-scale autonomous driving applications. Code will be released.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Selena Song, Ziming Xu, Zijun Zhang, Kun Zhou, Jiaxian Guo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Transformer(DiT) based video generation models have recently achieved impressive visual quality and temporal coherence, but they still frequently violate basic physical laws and commonsense dynamics, revealing a lack of explicit world knowledge. In this work, we explore how to equip them with a plug-and-play memory that injects useful world knowledge. Motivated by in-context memory in Transformer-based LLMs, we conduct empirical studies to show that DiT can be steered via interventions on its hidden states, and simple low-pass and high-pass filters in the embedding space naturally disentangle low-level appearance and high-level physical/semantic cues, enabling targeted guidance. Building on these observations, we propose a learnable memory encoder DiT-Mem, composed of stacked 3D CNNs, low-/high-pass filters, and self-attention layers. The encoder maps reference videos into a compact set of memory tokens, which are concatenated as the memory within the DiT self-attention layers. During training, we keep the diffusion backbone frozen, and only optimize the memory encoder. It yields a rather efficient training process on few training parameters (150M) and 10K data samples, and enables plug-and-play usage at inference time. Extensive experiments on state-of-the-art models demonstrate the effectiveness of our method in improving physical rule following and video fidelity. Our code and data are publicly released here: https://thrcle421.github.io/DiT-Mem-Web/.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianhua Han, Meng Tian, Jiangtong Zhu, Fan He, Huixin Zhang et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Federico Felizzi, Olivia Riccomi, Michele Ferramola, Francesco Andrea Causio, Manuel Del Medico et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.
๐
2025-11-24
๐ AAAI 2026
๐ฅ Wanjiang Weng, Xiaofeng Tan, Junbo Wang, Guo-Sen Xie, Pan Zhou et al.
Robotics
ๆฅ็ๆ่ฆ
Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Brent Zoomers, Florian Hahlbohm, Joni Vanherck, Lode Jorissen, Marcus Magnor et al.
ๆฅ็ๆ่ฆ
3D Gaussian Splatting can exploit frustum culling and level-of-detail strategies to accelerate rendering of scenes containing a large number of primitives. However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. By querying it for Gaussians within the viewing frustum prior to rasterization, our method can discard occluded primitives during rendering. Leveraging Tensor Cores for efficient computation, we integrate these neural queries directly into a novel instanced software rasterizer. Our approach outperforms the current state of the art for composed scenes in terms of VRAM usage and image quality, utilizing a combination of our instanced rasterizer and occlusion culling MLP, and exhibits complementary properties to existing LoD techniques.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Itay Cohen, Ethan Fetaya, Amir Rosenfeld
Computer VisionNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Teodora Popordanoska, Jiameng Li, Matthew B. Blaschko
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.
๐
2025-11-24
๐ IROS
๐ฅ Ann-Sophia Mรผller, Moonkwang Jeong, Meng Zhang, Jiyuan Tian, Arkadiusz Miernik et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nithira Jayarathne, Naveen Basnayake, Keshawa Jayasundara, Pasindu Dodampegama, Praveen Wijesinghe et al.
Machine Learning
ๆฅ็ๆ่ฆ
Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Carsten T. Lรผth, Jeremias Traub, Kim-Celine Kahl, Till J. Bungert, Lukas Klein et al.
ๆฅ็ๆ่ฆ
Semantic segmentation is crucial for various biomedical applications, yet its reliance on large annotated datasets presents a bottleneck due to the high cost and specialized expertise required for manual labeling. Active Learning (AL) aims to mitigate this challenge by querying only the most informative samples, thereby reducing annotation effort. However, in the domain of 3D biomedical imaging, there is no consensus on whether AL consistently outperforms Random sampling. Four evaluation pitfalls hinder the current methodological assessment. These are (1) restriction to too few datasets and annotation budgets, (2) using 2D models on 3D images without partial annotations, (3) Random baseline not being adapted to the task, and (4) measuring annotation cost only in voxels. In this work, we introduce nnActive, an open-source AL framework that overcomes these pitfalls by (1) means of a large scale study spanning four biomedical imaging datasets and three label regimes, (2) extending nnU-Net by using partial annotations for training with 3D patch-based query selection, (3) proposing Foreground Aware Random sampling strategies tackling the foreground-background class imbalance of medical images and (4) propose the foreground efficiency metric, which captures the low annotation cost of background-regions. We reveal the following findings: (A) while all AL methods outperform standard Random sampling, none reliably surpasses an improved Foreground Aware Random sampling; (B) benefits of AL depend on task specific parameters; (C) Predictive Entropy is overall the best performing AL method, but likely requires the most annotation effort; (D) AL performance can be improved with more compute intensive design choices. As a holistic, open-source framework, nnActive can serve as a catalyst for research and application of AL in 3D biomedical imaging. Code is at: https://github.com/MIC-DKFZ/nnActive
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mansur Ozaman
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
One of the most important tasks in computer vision is identifying the device using which the image was taken, useful for facilitating further comprehensive analysis of the image. This paper presents comparative analysis of three techniques used in source camera identification (SCI): Photo Response Non-Uniformity (PRNU), JPEG compression artifact analysis, and convolutional neural networks (CNNs). It evaluates each method in terms of device classification accuracy. Furthermore, the research discusses the possible scientific development needed for the implementation of the methods in real-life scenarios.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kehua Chen, Tianlu Mao, Zhuxin Ma, Hao Jiang, Zehao Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that organically integrates monocular and multi-view optimization to achieve efficient and accurate geometric refinement. Finally, to address the appearance inconsistency commonly observed in large-scale scenes, we introduce a depth-guided appearance modeling approach that learns spatial features with 3D consistency, facilitating effective decoupling between geometry and appearance and further enhancing reconstruction stability. Experiments on large-scale urban datasets demonstrate that MetroGS achieves superior geometric accuracy, rendering quality, offering a unified solution for high-fidelity large-scale scene reconstruction.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bingchen Li, Xin Li, Jiaqi Xu, Jiaming Guo, Wenbo Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Image restoration (IR) models are typically trained to recover high-quality images using L1 or LPIPS loss. To handle diverse unknown degradations, zero-shot IR methods have also been introduced. However, existing pre-trained and zero-shot IR approaches often fail to align with human preferences, resulting in restored images that may not be favored. This highlights the critical need to enhance restoration quality and adapt flexibly to various image restoration tasks or backbones without requiring model retraining and ideally without labor-intensive preference data collection. In this paper, we propose the first Test-Time Preference Optimization (TTPO) paradigm for image restoration, which enhances perceptual quality, generates preference data on-the-fly, and is compatible with any IR model backbone. Specifically, we design a training-free, three-stage pipeline: (i) generate candidate preference images online using diffusion inversion and denoising based on the initially restored image; (ii) select preferred and dispreferred images using automated preference-aligned metrics or human feedback; and (iii) use the selected preference images as reward signals to guide the diffusion denoising process, optimizing the restored image to better align with human preferences. Extensive experiments across various image restoration tasks and models demonstrate the effectiveness and flexibility of the proposed pipeline.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Moazzam Umer Gondal, Hamad Ul Qudous, Daniya Siddiqui, Asma Ahmad Farhan
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huisoo Lee, Jisu Han, Hyunsouk Cho, Wonjun Hwang
Multimodal
ๆฅ็ๆ่ฆ
Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain without access to source data. Recent advances in Foundation Models (FMs) have introduced new opportunities for leveraging external semantic knowledge to guide SFDA. However, relying on a single FM is often insufficient, as it tends to bias adaptation toward a restricted semantic coverage, failing to capture diverse contextual cues under domain shift. To overcome this limitation, we propose a Collaborative Multi-foundation Adaptation (CoMA) framework that jointly leverages two different FMs (e.g., CLIP and BLIP) with complementary properties to capture both global semantics and local contextual cues. Specifically, we employ a bidirectional adaptation mechanism that (1) aligns different FMs with the target model for task adaptation while maintaining their semantic distinctiveness, and (2) transfers complementary knowledge from the FMs to the target model. To ensure stable adaptation under mini-batch training, we introduce Decomposed Mutual Information (DMI) that selectively enhances true dependencies while suppressing false dependencies arising from incomplete class coverage. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art SFDA methods across four benchmarks, including Office-31, Office-Home, DomainNet-126, and VisDA, under the closed-set setting, while also achieving best results on partial-set and open-set variants.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongha Lee, Jinhee Park, Minjun Kim, Junseok Kwon
ๆฅ็ๆ่ฆ
We propose Activation Boundary Matching for Low-Rank Adaptation (ABM-LoRA), a principled initialization strategy that substantially accelerates the convergence of low-rank adapters. While LoRA offers high parameter efficiency, its random initialization restricts gradient updates to a mismatched tangent space, causing significant information loss and hindering early convergence. Our ABM-LoRA addresses this by aligning the adapter's activation boundaries with those of the pretrained model before downstream training, thereby maximizing the projection of full-parameter gradients into the adapter subspace. This alignment sharply reduces information loss at initialization, yields a lower starting loss, and accelerates convergence. We demonstrate ABM-LoRA's effectiveness across diverse architectures and tasks: language understanding (T5-Base on GLUE), dialogue generation (LLaMA2-7B on WizardLM), and vision recognition (ViT-B/16 on VTAB-1K). On VTAB-1K, it achieves the highest accuracy among all methods, with strong gains on structured reasoning tasks requiring geometric understanding.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhifeng Xie, Keyi Zhang, Yiye Yan, Yuling Guo, Fan Yang et al.
ๆฅ็ๆ่ฆ
Film set design plays a pivotal role in cinematic storytelling and shaping the visual atmosphere. However, the traditional process depends on expert-driven manual modeling, which is labor-intensive and time-consuming. To address this issue, we introduce FilmSceneDesigner, an automated scene generation system that emulates professional film set design workflow. Given a natural language description, including scene type, historical period, and style, we design an agent-based chaining framework to generate structured parameters aligned with film set design workflow, guided by prompt strategies that ensure parameter accuracy and coherence. On the other hand, we propose a procedural generation pipeline which executes a series of dedicated functions with the structured parameters for floorplan and structure generation, material assignment, door and window placement, and object retrieval and layout, ultimately constructing a complete film scene from scratch. Moreover, to enhance cinematic realism and asset diversity, we construct SetDepot-Pro, a curated dataset of 6,862 film-specific 3D assets and 733 materials. Experimental results and human evaluations demonstrate that our system produces structurally sound scenes with strong cinematic fidelity, supporting downstream tasks such as virtual previs, construction drawing and mood board creation.
๐
2025-11-24
๐ Submitted to IEEE Geoscience and Remote Sensing Le
๐ฅ Shuyu Cao, Minxin Chen, Yucheng Song, Zhaozhong Chen, Xinyou Zhang
Computer Vision
ๆฅ็ๆ่ฆ
Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse'' strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Beilin Chu, Weike You, Mengtao Li, Tingting Zheng, Kehan Zhao et al.
Multimodal
ๆฅ็ๆ่ฆ
The rapid progress of GANs and Diffusion Models poses new challenges for detecting AI-generated images. Although CLIP-based detectors exhibit promising generalization, they often rely on semantic cues rather than generator artifacts, leading to brittle performance under distribution shifts. In this work, we revisit the nature of semantic bias and uncover that Patch Shuffle provides an unusually strong benefit for CLIP, that disrupts global semantic continuity while preserving local artifact cues, which reduces semantic entropy and homogenizes feature distributions between natural and synthetic images. Through a detailed layer-wise analysis, we further show that CLIP's deep semantic structure functions as a regulator that stabilizes cross-domain representations once semantic bias is suppressed. Guided by these findings, we propose SemAnti, a semantic-antagonistic fine-tuning paradigm that freezes the semantic subspace and adapts only artifact-sensitive layers under shuffled semantics. Despite its simplicity, SemAnti achieves state-of-the-art cross-domain generalization on AIGCDetectBenchmark and GenImage, demonstrating that regulating semantics is key to unlocking CLIP's full potential for robust AI-generated image detection.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qirui Wang, Jingyi He, Yining Pan, Si Yong Yeo, Xulei Yang et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Spatial reasoning (SR), the ability to infer 3D spatial information from 2D inputs, is essential for real-world applications such as embodied AI and autonomous driving. However, existing research primarily focuses on indoor environments and typically relies on multi-view observations, which limits their generalizability to outdoor scenarios and constrains their applicability to monocular images, the most common real-world setting. In this work, we propose MonoSR, a large-scale monocular spatial reasoning dataset that spans diverse scenarios including indoor, outdoor, and object-centric settings, and supports multiple question types. MonoSR provides a path toward open-world monocular spatial reasoning. Beyond introducing the dataset, we evaluate advanced vision-language models to reveal their limitations on this challenging task. We further analyze whether auxiliary information is crucial for monocular spatial reasoning and offer practical guidance for designing future models. These contributions collectively establish a foundation for advancing monocular spatial reasoning in real-world, open-world environments.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minchong Chen, Xiaoyun Yuan, Junzhe Wan, Jianing Zhang, Jun Zhang
Computer Vision
ๆฅ็ๆ่ฆ
The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal super-resolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems. More materials: https://github.com/work-submit/3MTI.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hai Ci, Ziheng Peng, Pei Yang, Yingxin Xuan, Mike Zheng Shou
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images--we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models--local edits using eight SOTA diffusion models; 3) Multi-turn editing--each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios--a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel-level localization, emerge as highly reliable whole-image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross-generator generalization. We believe DiffSeg30k will advance research in fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohan Ramesh, Mark Azer, Fabian B. Flohr
Robotics
ๆฅ็ๆ่ฆ
Current autonomous driving (AD) simulations are critically limited by their inadequate representation of realistic and diverse human behavior, which is essential for ensuring safety and reliability. Existing benchmarks often simplify pedestrian interactions, failing to capture complex, dynamic intentions and varied responses critical for robust system deployment. To overcome this, we introduce HABIT (Human Action Benchmark for Interactive Traffic), a high-fidelity simulation benchmark. HABIT integrates real-world human motion, sourced from mocap and videos, into CARLA (Car Learning to Act, a full autonomous driving simulator) via a modular, extensible, and physically consistent motion retargeting pipeline. From an initial pool of approximately 30,000 retargeted motions, we curate 4,730 traffic-compatible pedestrian motions, standardized in SMPL format for physically consistent trajectories. HABIT seamlessly integrates with CARLA's Leaderboard, enabling automated scenario generation and rigorous agent evaluation. Our safety metrics, including Abbreviated Injury Scale (AIS) and False Positive Braking Rate (FPBR), reveal critical failure modes in state-of-the-art AD agents missed by prior evaluations. Evaluating three state-of-the-art autonomous driving agents, InterFuser, TransFuser, and BEVDriver, demonstrates how HABIT exposes planner weaknesses that remain hidden in scripted simulations. Despite achieving close or equal to zero collisions per kilometer on the CARLA Leaderboard, the autonomous agents perform notably worse on HABIT, with up to 7.43 collisions/km and a 12.94% AIS 3+ injury risk, and they brake unnecessarily in up to 33% of cases. All components are publicly released to support reproducible, pedestrian-aware AI research.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jichao Chen, YangYang Qu, Ruibo Tang, Dirk Slock
Computer Vision
ๆฅ็ๆ่ฆ
WiFi-based human pose estimation (HPE) has attracted increasing attention due to its resilience to occlusion and privacy-preserving compared to camera-based methods. However, existing WiFi-based HPE approaches often employ regression networks that directly map WiFi channel state information (CSI) to 3D joint coordinates, ignoring the inherent topological relationships among human joints. In this paper, we present GraphPose-Fi, a graph-based framework that explicitly models skeletal topology for WiFi-based 3D HPE. Our framework comprises a CNN encoder shared across antennas for subcarrier-time feature extraction, a lightweight attention module that adaptively reweights features over time and across antennas, and a graph-based regression head that combines GCN layers with self-attention to capture local topology and global dependencies. Our proposed method significantly outperforms existing methods on the MM-Fi dataset in various settings. The source code is available at: https://github.com/Cirrick/GraphPose-Fi.
๐
2025-11-24
๐ ArXiv Preprint (Multimedia)
๐ฅ Fan Nie, Jiangqun Ni, Jian Zhang, Bin Zhang, Weizhe Zhang et al.
ๆฅ็ๆ่ฆ
The widespread application of AIGC contents has brought not only unprecedented opportunities, but also potential security concerns, e.g., audio-visual deepfakes. Therefore, it is of great importance to develop an effective and generalizable method for multi-modal deepfake detection. Typically, the audio-visual correlation learning could expose subtle cross-modal inconsistencies, e.g., audio-visual misalignment, which serve as crucial clues in deepfake detection. In this paper, we reformulate the correlation learning with variational Bayesian estimation, where audio-visual correlation is approximated as a Gaussian distributed latent variable, and thus develop a novel framework for deepfake detection, i.e., Forgery-aware Audio-Visual Adaptation with Variational Bayes (FoVB). Specifically, given the prior knowledge of pre-trained backbones, we adopt two core designs to estimate audio-visual correlations effectively. First, we exploit various difference convolutions and a high-pass filter to discern local and global forgery traces from both modalities. Second, with the extracted forgery-aware features, we estimate the latent Gaussian variable of audio-visual correlation via variational Bayes. Then, we factorize the variable into modality-specific and correlation-specific ones with orthogonality constraint, allowing them to better learn intra-modal and cross-modal forgery traces with less entanglement. Extensive experiments demonstrate that our FoVB outperforms other state-of-the-art methods in various benchmarks.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fangda Chen, Jintao Tang, Pancheng Wang, Ting Wang, Shasha Li et al.
Computer Vision
ๆฅ็ๆ่ฆ
The Segment Anything Model (SAM) has recently demonstrated significant potential in medical image segmentation. Although SAM is primarily trained on 2D images, attempts have been made to apply it to 3D medical image segmentation. However, the pseudo 3D processing used to adapt SAM results in spatial feature loss, limiting its performance. Additionally, most SAM-based methods still rely on manual prompts, which are challenging to implement in real-world scenarios and require extensive external expert knowledge. To address these limitations, we introduce the Decoder Enhanced and Auto Prompt SAM (DEAP-3DSAM) to tackle these limitations. Specifically, we propose a Feature Enhanced Decoder that fuses the original image features with rich and detailed spatial information to enhance spatial features. We also design a Dual Attention Prompter to automatically obtain prompt information through Spatial Attention and Channel Attention. We conduct comprehensive experiments on four public abdominal tumor segmentation datasets. The results indicate that our DEAP-3DSAM achieves state-of-the-art performance in 3D image segmentation, outperforming or matching existing manual prompt methods. Furthermore, both quantitative and qualitative ablation studies confirm the effectiveness of our proposed modules.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Timur Mamedov, Anton Konushin, Vadim Konushin
ๆฅ็ๆ่ฆ
Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek et al.
Machine Learning
ๆฅ็ๆ่ฆ
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen et al.
Computer VisionNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Prompt-free image segmentation aims to generate accurate masks without manual guidance. Typical pre-trained models, notably Segmentation Anything Model (SAM), generate prompts directly at a single granularity level. However, this approach has two limitations: (1) Localizability, lacking mechanisms for autonomous region localization; (2) Scalability, limited fine-grained modeling at high resolution. To address these challenges, we introduce Granular Computing-driven SAM (Grc-SAM), a coarse-to-fine framework motivated by Granular Computing (GrC). First, the coarse stage adaptively extracts high-response regions from features to achieve precise foreground localization and reduce reliance on external prompts. Second, the fine stage applies finer patch partitioning with sparse local swin-style attention to enhance detail modeling and enable high-resolution segmentation. Third, refined masks are encoded as latent prompt embeddings for the SAM decoder, replacing handcrafted prompts with an automated reasoning process. By integrating multi-granularity attention, Grc-SAM bridges granular computing with vision transformers. Extensive experimental results demonstrate Grc-SAM outperforms baseline methods in both accuracy and scalability. It offers a unique granular computational perspective for prompt-free segmentation.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hai Wu, Shuai Tang, Jiale Wang, Longkun Zou, Mingyue Guo et al.
Computer Vision
ๆฅ็ๆ่ฆ
Perception of Low-Altitude Aircraft (LAA) in 3D space enables precise 3D object localization and behavior understanding. However, datasets tailored for 3D LAA perception remain scarce. To address this gap, we present LAA3D, a large-scale dataset designed to advance 3D detection and tracking of low-altitude aerial vehicles. LAA3D contains 15,000 real images and 600,000 synthetic frames, captured across diverse scenarios, including urban and suburban environments. It covers multiple aerial object categories, including electric Vertical Take-Off and Landing (eVTOL) aircraft, Micro Aerial Vehicles (MAVs), and Helicopters. Each instance is annotated with 3D bounding box, class label, and instance identity, supporting tasks such as 3D object detection, 3D multi-object tracking (MOT), and 6-DoF pose estimation. Besides, we establish the LAA3D Benchmark, integrating multiple tasks and methods with unified evaluation protocols for comparison. Furthermore, we propose MonoLAA, a monocular 3D detection baseline, achieving robust 3D localization from zoom cameras with varying focal lengths. Models pretrained on synthetic images transfer effectively to real-world data with fine-tuning, demonstrating strong sim-to-real generalization. Our LAA3D provides a comprehensive foundation for future research in low-altitude 3D object perception.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruojun Xu, Yu Kai, Xuhua Ren, Jiaxiang Cheng, Bing Ma et al.
Machine Learning
ๆฅ็ๆ่ฆ
Direct Preference Optimization (DPO) has shown promising results in aligning generative outputs with human preferences by distinguishing between chosen and rejected samples. However, a critical limitation of DPO is likelihood displacement, where the probabilities of chosen samples paradoxically decrease during training, undermining the quality of generation. Although this issue has been investigated in autoregressive models, its impact within diffusion-based models remains largely unexplored. This gap leads to suboptimal performance in tasks involving video generation. To address this, we conduct a formal analysis of DPO loss through updating policy within the diffusion framework, which describes how the updating of specific training samples influences the model's predictions on other samples. Using this tool, we identify two main failure modes: (1) Optimization Conflict, which arises from small reward margins between chosen and rejected samples, and (2) Suboptimal Maximization, caused by large reward margins. Informed by these insights, we introduce a novel solution named Policy-Guided DPO (PG-DPO), combining Adaptive Rejection Scaling (ARS) and Implicit Preference Regularization (IPR) to effectively mitigate likelihood displacement. Experiments show that PG-DPO outperforms existing methods in both quantitative metrics and qualitative evaluations, offering a robust solution for improving preference alignment in video generation tasks.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu et al.
Computer VisionNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kai Zhenga, Zhenkai Wu, Fupeng Wei, Miaolan Zhou, Kai Lie et al.
ๆฅ็ๆ่ฆ
Accurately and swiftly assessing damage from conflicts is crucial for humanitarian aid and regional stability. In conflict zones, damaged zones often share similar architectural styles, with damage typically covering small areas and exhibiting blurred boundaries. These characteristics lead to limited data, annotation difficulties, and significant recognition challenges, including high intra-class similarity and ambiguous semantic changes. To address these issues, we introduce a pre-trained DINOv3 model and propose a multi-scale cross-attention difference siamese network (MC-DiSNet). The powerful visual representation capability of the DINOv3 backbone enables robust and rich feature extraction from bi-temporal remote sensing images. We also release a new Gaza-change dataset containing high-resolution satellite image pairs from 2023-2024 with pixel-level semantic change annotations. It is worth emphasizing that our annotations only include semantic pixels of changed areas. Unlike conventional semantic change detection (SCD), our approach eliminates the need for large-scale semantic annotations of bi-temporal images, instead focusing directly on the changed regions. We term this new task change semantic detection (CSD). The CSD task represents a direct extension of binary change detection (BCD). Due to the limited spatial extent of semantic regions, it presents greater challenges than traditional SCD tasks. We evaluated our method under the CSD framework on both the Gaza-Change and SECOND datasets. Experimental results demonstrate that our proposed approach effectively addresses the CSD task, and its outstanding performance paves the way for practical applications in rapid damage assessment across conflict zones.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gengyuan Zhang, Mingcong Ding, Jingpei Wu, Ruotong Liao, Volker Tresp
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Embodied exploration is a target-driven process that requires embodied agents to possess fine-grained perception and knowledge-enhanced decision making. While recent attempts leverage MLLMs for exploration due to their strong perceptual and reasoning abilities, we find that MLLM-based embodied agents remain suboptimal in exploring new environments: (i) they rely on profound but stale pre-trained knowledge, (ii) training-based approaches such as imitation learning or reinforcement learning are expensive for long-horizon tasks with sparse outcome rewards, and (iii) frontier-based exploration yields a large, visually nuanced action space that is difficult for MLLMs to make reliable decisions. We address these challenges with ReEXplore, a training-free framework that performs retrospective experience replay to inject distilled, abstract experience at inference time, and hierarchical frontier selection to decompose frontier ranking into coarse-to-fine decisions. Our approach enables robust, traceable, and efficient exploration. Across multiple embodied exploration benchmarks, ReEXplore yields great improvements over strong MLLM baselines, up to 3x higher performance in both success rate and in navigation efficiency under open-source backbones.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangjie Sui, Songyang Li, Hanwei Zhu, Baoliang Chen, Yuming Fang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Long Tang, Guoquan Zhen, Jie Hao, Jianbo Zhang, Huiyu Duan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \underline{l}ayer\underline{i}nteraction and MoE-based \underline{f}eature d\underline{e}coupling, termed \textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks.The code is available at: \href{https://github.com/TANGLONG2/Life-IQA/tree/main}{\texttt{Life-IQA}}.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, ฮฑ and \b{eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wentao Qu, Guofeng Mei, Yang Wu, Yongshun Gong, Xiaoshui Huang et al.
Robotics
ๆฅ็ๆ่ฆ
Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Christos Koutlis, Symeon Papadopoulos
Robotics
ๆฅ็ๆ่ฆ
With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 AP@0.95 on LAV-DF, +9.6 AP@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Duolikun Danier, Ge Gao, Steven McDonagh, Changjian Li, Hakan Bilen et al.
Robotics
ๆฅ็ๆ่ฆ
Video generation models have made significant progress in generating realistic content, enabling applications in simulation, gaming, and film making. However, current generated videos still contain visual artifacts arising from 3D inconsistencies, e.g., objects and structures deforming under changes in camera pose, which can undermine user experience and simulation fidelity. Motivated by recent findings on representation alignment for diffusion models, we hypothesize that improving the multi-view consistency of video diffusion representations will yield more 3D-consistent video generation. Through detailed analysis on multiple recent camera-controlled video diffusion models we reveal strong correlations between 3D-consistent representations and videos. We also propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations. We evaluate ViCoDR on camera controlled image-to-video, text-to-video, and multi-view generation models, demonstrating significant improvements in the 3D consistency of the generated videos. Project page: https://danier97.github.io/ViCoDR.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wassim Benabbas, Mohammed Brahimi, Samir Akhrouf, Bilal Fortas
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Recent advances in deep learning have enabled significant progress in plant disease classification using leaf images. Much of the existing research in this field has relied on the PlantVillage dataset, which consists of well-centered plant images captured against uniform, uncluttered backgrounds. Although models trained on this dataset achieve high accuracy, they often fail to generalize to real-world field images, such as those submitted by farmers to plant diagnostic systems. This has created a significant gap between published studies and practical application requirements, highlighting the necessity of investigating and addressing this issue. In this study, we investigate whether attention-based architectures and zero-shot learning approaches can bridge the gap between curated academic datasets and real-world agricultural conditions in plant disease classification. We evaluate three model categories: Convolutional Neural Networks (CNNs), Vision Transformers, and Contrastive Language-Image Pre-training (CLIP)-based zero-shot models. While CNNs exhibit limited robustness under domain shift, Vision Transformers demonstrate stronger generalization by capturing global contextual features. Most notably, CLIP models classify diseases directly from natural language descriptions without any task-specific training, offering strong adaptability and interpretability. These findings highlight the potential of zero-shot learning as a practical and scalable domain adaptation strategy for plant health diagnosis in diverse field environments.
๐
2025-11-24
๐ 24-page manuscript accepted to IJCV
๐ฅ Ching-Yi Lai, Chih-Yu Jian, Pei-Cheng Chuang, Chia-Ming Lee, Chih-Chung Hsu et al.
Natural Language ProcessingRoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability. Although existing methods have progressed from single-modal to multimodal approaches, they face critical limitations: single-modal methods struggle with feature degradation under data compression in social media streaming, while multimodal approaches require expensive data collection and labeling and suffer from inconsistent modal quality or accessibility in real-world scenarios. To address these challenges, we propose a novel Unimodal-generated Multimodal Contrastive Learning (UMCL) framework for robust cross-compression-rate (CCR) deepfake detection. In the training stage, our approach transforms a single visual modality into three complementary features: compression-robust rPPG signals, temporal landmark dynamics, and semantic embeddings from pre-trained vision-language models. These features are explicitly aligned through an affinity-driven semantic alignment (ASA) strategy, which models inter-modal relationships through affinity matrices and optimizes their consistency through contrastive learning. Subsequently, our cross-quality similarity learning (CQSL) strategy enhances feature robustness across compression rates. Extensive experiments demonstrate that our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection. Notably, our approach maintains high detection accuracy even when individual features degrade, while providing interpretable insights into feature relationships through explicit alignment.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Santiago Moreno, Pablo Meseguer, Rocรญo del Amor, Valery Naranjo
Multimodal
ๆฅ็ๆ่ฆ
Accurate annotation of cutaneous neoplasm biopsies represents a major challenge due to their wide morphological variability, overlapping histological patterns, and the subtle distinctions between benign and malignant lesions. Vision-language foundation models (VLMs), pre-trained on paired image-text corpora, learn joint representations that bridge visual features and diagnostic terminology, enabling zero-shot localization and classification of tissue regions without pixel-level labels. However, most existing VLM applications in histopathology remain limited to slide-level tasks or rely on coarse interactive prompts, and they struggle to produce fine-grained segmentations across gigapixel whole-slide images (WSIs). In this work, we introduce a zero-shot visual-language segmentation pipeline for whole-slide images (ZEUS), a fully automated, zero-shot segmentation framework that leverages class-specific textual prompt ensembles and frozen VLM encoders to generate high-resolution tumor masks in WSIs. By partitioning each WSI into overlapping patches, extracting visual embeddings, and computing cosine similarities against text prompts, we generate a final segmentation mask. We demonstrate competitive performance on two in-house datasets, primary spindle cell neoplasms and cutaneous metastases, highlighting the influence of prompt design, domain shifts, and institutional variability in VLMs for histopathology. ZEUS markedly reduces annotation burden while offering scalable, explainable tumor delineation for downstream diagnostic workflows.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huaming Ling, Ying Wang, Si Chen, Junfeng Fan
Computer Vision
ๆฅ็ๆ่ฆ
We address two fundamental challenges in adapting general deep CNNs for FHE-based inference: approximating non-linear activations such as ReLU with low-degree polynomials while minimizing accuracy degradation, and overcoming the ciphertext capacity barrier that constrains high-resolution image processing on FHE inference. Our contributions are twofold: (1) a single-stage fine-tuning (SFT) strategy that directly converts pre-trained CNNs into FHE-friendly forms using low-degree polynomials, achieving competitive accuracy with minimal training overhead; and (2) a generalized interleaved packing (GIP) scheme that is compatible with feature maps of virtually arbitrary spatial resolutions, accompanied by a suite of carefully designed homomorphic operators that preserve the GIP-form encryption throughout computation. These advances enable efficient, end-to-end FHE inference across diverse CNN architectures. Experiments on CIFAR-10, ImageNet, and MS COCO demonstrate that the FHE-friendly CNNs obtained via our SFT strategy achieve accuracy comparable to baselines using ReLU or SiLU activations. Moreover, this work presents the first demonstration of FHE-based inference for YOLO architectures in object detection leveraging low-degree polynomial activations.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bhuvan Sachdeva, Sneha Kumari, Rudransh Agarwal, Shalaka Kumaraswamy, Niharika Singri Prasad et al.
Multimodal
ๆฅ็ๆ่ฆ
Cataract surgery is one of the most commonly performed surgeries worldwide, yet intraoperative complications such as iris prolapse, posterior capsule rupture (PCR), and vitreous loss remain major causes of adverse outcomes. Automated detection of such events could enable early warning systems and objective training feedback. In this work, we propose CataractCompDetect, a complication detection framework that combines phase-aware localization, SAM 2-based tracking, complication-specific risk scoring, and vision-language reasoning for final classification. To validate CataractCompDetect, we curate CataComp, the first cataract surgery video dataset annotated for intraoperative complications, comprising 53 surgeries, including 23 with clinical complications. On CataComp, CataractCompDetect achieves an average F1 score of 70.63%, with per-complication performance of 81.8% (Iris Prolapse), 60.87% (PCR), and 69.23% (Vitreous Loss). These results highlight the value of combining structured surgical priors with vision-language reasoning for recognizing rare but high-impact intraoperative events. Our dataset and code will be publicly released upon acceptance.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jianhao Zeng, Yancheng Bai, Ruidong Chen, Xuanpu Zhang, Lei Sun et al.
ๆฅ็ๆ่ฆ
Video virtual try-on technology provides a cost-effective solution for creating marketing videos in fashion e-commerce. However, its practical adoption is hindered by two critical limitations. First, the reliance on a single garment image as input in current virtual try-on datasets limits the accurate capture of realistic texture details. Second, most existing methods focus solely on generating full-shot virtual try-on videos, neglecting the business's demand for videos that also provide detailed close-ups. To address these challenges, we introduce a high-resolution dataset for video-based virtual try-on. This dataset offers two key features. First, it provides more detailed information on the garments, which includes high-fidelity images with detailed close-ups and textual descriptions; Second, it uniquely includes full-shot and close-up try-on videos of real human models. Furthermore, accurately assessing consistency becomes significantly more critical for the close-up videos, which demand high-fidelity preservation of garment details. To facilitate such fine-grained evaluation, we propose a new garment consistency metric VGID (Video Garment Inception Distance) that quantifies the preservation of both texture and structure. Our experiments validate these contributions. We demonstrate that by utilizing the detailed images from our dataset, existing video generation models can extract and incorporate texture features, significantly enhancing the realism and detail fidelity of virtual try-on results. Furthermore, we conduct a comprehensive benchmark of recent models. The benchmark effectively identifies the texture and structural preservation problems among current methods.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Juntao Gao, Feiyang Ye, Jing Zhang, Wenjing Qian
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Josรฉ Teixeira, Pascal Klรถckner, Diana Montezuma, Melis Erdal Cesur, Joรฃo Fraga et al.
ๆฅ็ๆ่ฆ
In addition to evaluating tumor morphology using H&E staining, immunohistochemistry is used to assess the presence of specific proteins within the tissue. However, this is a costly and labor-intensive technique, for which virtual staining, as an image-to-image translation task, offers a promising alternative. Although recent, this is an emerging field of research with 64% of published studies just in 2024. Most studies use publicly available datasets of H&E-IHC pairs from consecutive tissue sections. Recognizing the training challenges, many authors develop complex virtual staining models based on conditional Generative Adversarial Networks, but ignore the impact of adversarial loss on the quality of virtual staining. Furthermore, overlooking the issues of model evaluation, they claim improved performance based on metrics such as SSIM and PSNR, which are not sufficiently robust to evaluate the quality of virtually stained images. In this paper, we developed CSSP2P GAN, which we demonstrate to achieve heightened pathological fidelity through a blind pathological expert evaluation. Furthermore, while iteratively developing our model, we study the impact of the adversarial loss and demonstrate its crucial role in the quality of virtually stained images. Finally, while comparing our model with reference works in the field, we underscore the limitations of the currently used evaluation metrics and demonstrate the superior performance of CSSP2P GAN.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zong-Wei Hong, Jing-lun Li, Lin-Ze Li, Shen Zhang, Yao Tang
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Flow Matching (FM) has recently emerged as a principled and efficient alternative to diffusion models. Standard FM encourages the learned velocity field to follow a target direction; however, it may accumulate errors along the trajectory and drive samples off the data manifold, leading to perceptual degradation, especially in lightweight or low-step configurations.
To enhance stability and generalization, we extend FM into a balanced attract-repel scheme that provides explicit guidance on both "where to go" and "where not to go." To be formal, we propose \textbf{Velocity Contrastive Regularization (VeCoR)}, a complementary training scheme for flow-based generative modeling that augments the standard FM objective with contrastive, two-sided supervision. VeCoR not only aligns the predicted velocity with a stable reference direction (positive supervision) but also pushes it away from inconsistent, off-manifold directions (negative supervision). This contrastive formulation transforms FM from a purely attractive, one-sided objective into a two-sided training signal, regularizing trajectory evolution and improving perceptual fidelity across datasets and backbones.
On ImageNet-1K 256$\times$256, VeCoR yields 22\% and 35\% relative FID reductions on SiT-XL/2 and REPA-SiT-XL/2 backbones, respectively, and achieves further FID gains (32\% relative) on MS-COCO text-to-image generation, demonstrating consistent improvements in stability, convergence, and image quality, particularly in low-step and lightweight settings. Project page: https://p458732.github.io/VeCoR_Project_Page/
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that directly assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across multiple plausible futures. To facilitate this study, we propose an HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes the complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.
๐
2025-11-24
๐ AAAI 2026
๐ฅ Keming Shen, Bizhu Wu, Junliang Chen, Xiaoqin Wang, Linlin Shen
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yash Mali
Natural Language Processing
ๆฅ็ๆ่ฆ
Test-time adaptation (TTA) enables models to adapt to distribution shifts at inference time. While entropy minimization over the output distribution has proven effective for TTA, transformers offer an additional unsupervised learning signal through their attention mechanisms. We propose minimizing the entropy of attention distributions from the CLS token to image patches as a novel TTA objective.This approach encourages the model to attend more confidently to relevant image regions under distribution shift and is effective even when only a single test image is available. We demonstrate that attention entropy minimization improves robustness across diverse corruption types while not hurting performance on clean data on a single sample stream of images at test time.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenxing Mi, Yuxin Wang, Dan Xu
Robotics
ๆฅ็ๆ่ฆ
We present One4D, a unified framework for 4D generation and reconstruction that produces dynamic 4D content as synchronized RGB frames and pointmaps. By consistently handling varying sparsities of conditioning frames through a Unified Masked Conditioning (UMC) mechanism, One4D can seamlessly transition between 4D generation from a single image, 4D reconstruction from a full video, and mixed generation and reconstruction from sparse frames. Our framework adapts a powerful video generation model for joint RGB and pointmap generation, with carefully designed network architectures. The commonly used diffusion finetuning strategies for depthmap or pointmap reconstruction often fail on joint RGB and pointmap generation, quickly degrading the base video model. To address this challenge, we introduce Decoupled LoRA Control (DLC), which employs two modality-specific LoRA adapters to form decoupled computation branches for RGB frames and pointmaps, connected by lightweight, zero-initialized control links that gradually learn mutual pixel-level consistency. Trained on a mixture of synthetic and real 4D datasets under modest computational budgets, One4D produces high-quality RGB frames and accurate pointmaps across both generation and reconstruction tasks. This work represents a step toward general, high-quality geometry-based 4D world modeling using video diffusion models. Project page: https://mizhenxing.github.io/One4D
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Juncheng Li, Yige Li, Hanxun Huang, Yunhao Chen, Xin Wang et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Backdoor attacks undermine the reliability and trustworthiness of machine learning systems by injecting hidden behaviors that can be maliciously activated at inference time. While such threats have been extensively studied in unimodal settings, their impact on multimodal foundation models, particularly vision-language models (VLMs), remains largely underexplored. In this work, we introduce \textbf{BackdoorVLM}, the first comprehensive benchmark for systematically evaluating backdoor attacks on VLMs across a broad range of settings. It adopts a unified perspective that injects and analyzes backdoors across core vision-language tasks, including image captioning and visual question answering. BackdoorVLM organizes multimodal backdoor threats into 5 representative categories: targeted refusal, malicious injection, jailbreak, concept substitution, and perceptual hijack. Each category captures a distinct pathway through which an adversary can manipulate a model's behavior. We evaluate these threats using 12 representative attack methods spanning text, image, and bimodal triggers, tested on 2 open-source VLMs and 3 multimodal datasets. Our analysis reveals that VLMs exhibit strong sensitivity to textual instructions, and in bimodal backdoors the text trigger typically overwhelms the image trigger when forming the backdoor mapping. Notably, backdoors involving the textual modality remain highly potent, with poisoning rates as low as 1\% yielding over 90\% success across most tasks. These findings highlight significant, previously underexplored vulnerabilities in current VLMs. We hope that BackdoorVLM can serve as a useful benchmark for analyzing and mitigating multimodal backdoor threats. Code is available at: https://github.com/bin015/BackdoorVLM .
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenhao Xu, Xin Dong, Yue Li, Haoyuan Shi, Zhiwei Xiong
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruiying Liu, Yuanzhi Liang, Haibin Huang, Tianshu Yu, Chi Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many to many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.
๐
2025-11-24
๐ ArXiv Preprint (cs.GR)
๐ฅ Xiuchao Wu, Pengfei Zhu, Jiangjing Lyu, Xinguo Liu, Jie Guo et al.
Machine Learning
ๆฅ็ๆ่ฆ
Applying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose \ttt, a novel material reconstruction framework for 3D objects, offering the following advantages. First, \ttt\ adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), \ttt\ enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, \ttt\ achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stability across various types of objects. Extensive experiments demonstrate that \ttt\ achieves superior performance in material reconstruction compared to existing methods.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenyu Mu, Guihai Chen, Xun Yang, Erkun Yang, Cheng Deng
Computer Vision
ๆฅ็ๆ่ฆ
Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, which lead to instability in model training. Existing methods typically rely on global noise assumptions or confidence-based sample selection, which inadequately mitigate the performance degradation caused by annotation noise, especially in challenging boundary regions. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy ground-truth labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg consistently outperforms existing state-of-the-art methods.
๐
2025-11-24
๐ AAAI
๐ฅ Qian Jiang, Qianqian Wang, Xin Jin, Michal Wozniak, Shaowen Yao et al.
ๆฅ็ๆ่ฆ
Remote sensing images are becoming increasingly widespread in military, earth resource exploration. Because of the limitation of a single sensor, we can obtain high spatial resolution grayscale panchromatic (PAN) images and low spatial resolution color multispectral (MS) images. Therefore, an important issue is to obtain a color image with high spatial resolution when there is only a PAN image at the input. The existing methods improve spatial resolution using super-resolution (SR) technology and spectral recovery using colorization technology. However, the SR technique cannot improve the spectral resolution, and the colorization technique cannot improve the spatial resolution. Moreover, the pansharpening method needs two registered inputs and can not achieve SR. As a result, an integrated approach is expected. To solve the above problems, we designed a novel multi-function model (MFmamba) to realize the tasks of SR, spectral recovery, joint SR and spectral recovery through three different inputs. Firstly, MFmamba utilizes UNet++ as the backbone, and a Mamba Upsample Block (MUB) is combined with UNet++. Secondly, a Dual Pool Attention (DPA) is designed to replace the skip connection in UNet++. Finally, a Multi-scale Hybrid Cross Block (MHCB) is proposed for initial feature extraction. Many experiments show that MFmamba is competitive in evaluation metrics and visual results and performs well in the three tasks when only the input PAN image is used.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li et al.
ๆฅ็ๆ่ฆ
Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Ayca Duran, Christoph Waibel, Bernd Bickel, Iro Armeni, Arno Schlueter
Machine Learning
ๆฅ็ๆ่ฆ
Building integrated photovoltaic (BIPV) facades represent a promising pathway towards urban decarbonization, especially where roof areas are insufficient and ground-mounted arrays are infeasible. Although machine learning-based approaches to support photovoltaic (PV) planning on rooftops are well researched, automated approaches for facades still remain scarce and oversimplified. This paper therefore presents a pipeline that integrates detailed information on the architectural composition of the facade to automatically identify suitable surfaces for PV application and estimate the solar energy potential. The pipeline fine-tunes SegFormer-B5 on the CMP Facades dataset and converts semantic predictions into facade-level PV suitability masks and PV panel layouts considering module sizes and clearances. Applied to a dataset of 373 facades with known dimensions from ten cities, the results show that installable BIPV potential is significantly lower than theoretical potential, thus providing valuable insights for reliable urban energy planning. With the growing availability of facade imagery, the proposed pipeline can be scaled to support BIPV planning in cities worldwide.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wengyi Zhan, Mingbao Lin, Zhihang Lin, Rongrong Ji
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel Vision Token Scheduling), a training-free scheduling framework that partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference to reduce computation. This scheduling reduces computational complexity, requires no heuristics or additional modules, and is compatible with diverse existing MLLM architectures. Experiments across multiple MLLM backbones show that ParVTS prunes up to 88.9% of visual tokens with minimal performance drop, achieving 1.77x speedup and 70% FLOPs reduction.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuzhi Chen, Yuanchang Xie, Lei Zhao, Pan Liu, Yajie Zou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ James Y. Huang, Sheng Zhang, Qianchu Liu, Guanghui Qin, Tinghui Zhu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikita Neveditsin, Pawan Lingras, Vijay Mago
Natural Language Processing
ๆฅ็ๆ่ฆ
Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shaltiel Shmidman, Asher Fredman, Oleg Sudakov, Meriem Bendris
Natural Language Processing
ๆฅ็ๆ่ฆ
Test-time scaling, which leverages additional computation during inference to improve model accuracy, has enabled a new class of Large Language Models (LLMs) that are able to reason through complex problems by understanding the goal, turning this goal into a plan, working through intermediate steps, and checking their own work before answering . Frontier large language models with reasoning capabilities, such as DeepSeek-R1 and OpenAI's gpt-oss, follow the same procedure when solving complex problems by generating intermediate reasoning traces before giving the final answer. Today, these models are being increasingly used to generate reasoning traces that serve as high-quality supervised data for post-training of small and medium-sized language models to teach reasoning capabilities without requiring expensive human curation. In this work, we compare the performance of medium-sized LLMs on Math problems after post-training on two kinds of reasoning traces. We compare the impact of reasoning traces generated by DeepSeek-R1 and gpt-oss LLMs in terms of accuracy and inference efficiency.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Olivia Macmillan-Scott, Roksana Goworek, Eda B. รzyiฤit
Natural Language Processing
ๆฅ็ๆ่ฆ
Query expansion is the reformulation of a user query by adding semantically related information, and is an essential component of monolingual and cross-lingual information retrieval used to ensure that relevant documents are not missed. Recently, multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation. Pseudo-documents both introduce additional relevant terms and bridge the gap between short queries and long documents, which is particularly beneficial in dense retrieval. This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance. Results show that query length largely determines which prompting technique is effective, and that more elaborate prompts often do not yield further gains. Substantial linguistic disparities persist: cross-lingual query expansion can produce the largest improvements for languages with the weakest baselines, yet retrieval is especially poor between languages written in different scripts. Fine-tuning is found to lead to performance gains only when the training and test data are of similar format. These outcomes underline the need for more balanced multilingual and cross-lingual training and evaluation resources.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Roksana Goworek, Olivia Macmillan-Scott, Eda B. รzyiฤit
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Cross-lingual information retrieval (CLIR) enables access to multilingual knowledge but remains challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models. Existing pipelines often rely on translation and monolingual retrieval heuristics, which add computational overhead and noise, degrading performance. This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets. We find that dense retrieval models trained specifically for CLIR consistently outperform lexical matching methods and derive little benefit from document translation. Contrastive learning mitigates language biases and yields substantial improvements for encoders with weak initial alignment, and re-ranking can be effective, but depends on the quality of the cross-encoder training data. Although high-resource languages still dominate overall performance, gains over lexical and document-translated baselines are most pronounced for low-resource and cross-script pairs. These findings indicate that cross-lingual search systems should prioritise semantic multilingual embeddings and targeted learning-based alignment over translation-based pipelines, particularly for cross-script and under-resourced languages.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Md. Tanzim Ferdous, Naeem Ahsan Chowdhury, Prithwiraj Bhattacharjee
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This study developed a new Bangla abstractive summarization dataset to generate concise summaries of Bangla articles from diverse sources. Most existing studies in this field have concentrated on news articles, where journalists usually follow a fixed writing style. While such approaches are effective in limited contexts, they often fail to adapt to the varied nature of real-world Bangla texts. In today's digital era, a massive amount of Bangla content is continuously produced across blogs, newspapers, and social media. This creates a pressing need for summarization systems that can reduce information overload and help readers understand content more quickly. To address this challenge, we developed a dataset of over 54,000 Bangla articles and summaries collected from multiple sources, including blogs such as Cinegolpo and newspapers such as Samakal and The Business Standard. Unlike single-domain resources, our dataset spans multiple domains and writing styles. It offers greater adaptability and practical relevance. To establish strong baselines, we trained and evaluated this dataset using several deep learning and transfer learning models, including LSTM, BanglaT5-small, and MTS-small. The results highlight its potential as a benchmark for future research in Bangla natural language processing. This dataset provides a solid foundation for building robust summarization systems and helps expand NLP resources for low-resource languages.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Victor Rambaud, Salvador Mascarenhas, Yair Lakretz
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce MapFormers, new architectures based on Transformer models, which can learn cognitive maps from observational data and perform path integration in parallel, in a self-supervised manner. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved naturally by updating the positional encoding in Transformers with input-dependent matrices. We developed two variants of MapFormers that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested MapFormers on several tasks, including a classic 2D navigation task, showing that our models can learn a cognitive map of the underlying space and generalize OOD (e.g., to longer sequences) with near-perfect performance, unlike current architectures. Together, these results demonstrate the superiority of models designed to learn a cognitive map, and the importance of introducing a structural bias for structure-content disentanglement, which can be achieved in Transformers with input-dependent positional encoding. MapFormers have broad applications in both neuroscience and AI, by explaining the neural mechanisms giving rise to cognitive maps, while allowing these relation models to be learned at scale.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kyle Verrier, Achille Nazaret, Joseph Futoma, Andrew C. Miller, Guillermo Sapiro
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Hunger and satiety dynamics shape dietary behaviors and metabolic health, yet remain difficult to capture in everyday settings. We present a Nutrition Photoplethysmography Language Model (NPLM), integrating continuous photoplethysmography (PPG) from wearables with meal descriptions. NPLM projects PPG into embeddings interpretable by language models, enabling joint reasoning over physiology and meal context. Trained on 19,340 participants and 1.1 million meal-PPG pairs, the model improved daily caloric intake prediction by 11% over text-only baselines, with accuracy maintained when 80% of meal text was removed. In an independent validation study (n=140) with controlled dining and detailed meal information, the model replicated these findings. These results demonstrate the value of integrating physiological measurements from consumer wearables with meal information for noninvasive dietary monitoring at scale.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Christos-Nikolaos Zacharopoulos, Revekka Kyriakoglou
Natural Language Processing
ๆฅ็ๆ่ฆ
How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.
๐
2025-11-24
๐ EMNLP 2025
๐ฅ Deyi Ji, Yuekui Yang, Liqun Liu, Peng Shu, Haiyang Wu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Samantha Dies, Courtney Maynard, Germans Savcisens, Tina Eliassi-Rad
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) are widely used for factual tasks such as "What treats asthma?" or "What is the capital of Latvia?". However, it remains unclear how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations. We introduce representational stability as the robustness of an LLM's veracity representations to perturbations in the operational definition of truth. We assess representational stability by (i) training a linear probe on an LLM's activations to separate true from not-true statements and (ii) measuring how its learned decision boundary shifts under controlled label changes. Using activations from sixteen open-source models and three factual domains, we compare two types of neither statements. The first are fact-like assertions about entities we believe to be absent from any training data. We call these unfamiliar neither statements. The second are nonfactual claims drawn from well-known fictional contexts. We call these familiar neither statements. The unfamiliar statements induce the largest boundary shifts, producing up to $40\%$ flipped truth judgements in fragile domains (such as word definitions), while familiar fictional statements remain more coherently clustered and yield smaller changes ($\leq 8.2\%$). These results suggest that representational stability stems more from epistemic familiarity than from linguistic form. More broadly, our approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty, rather than optimizing for output accuracy alone.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Moazzam Umer Gondal, Hamad Ul Qudous, Daniya Siddiqui, Asma Ahmad Farhan
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
๐
2025-11-24
๐ AAAI 2026
๐ฅ Zijian Wang, Yanxiang Ma, Chang Xu
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Chain-of-Thought (CoT) reasoning is a critical capability for large language models (LLMs), enabling them to tackle com- plex multi-step tasks. While base LLMs, pre-trained on general text corpora, often struggle with reasoning due to a lack of specialized training, recent studies reveal their latent reason- ing potential tied to hidden states. However, existing hidden state manipulation methods, such as linear activation steering, suffer from limitations due to their rigid and unconstrained nature, often leading to distribution shifts and degraded text quality. In this work, we propose a novel approach for elic- iting CoT reasoning from base LLMs through hidden state manipulation grounded in probabilistic conditional generation. By reformulating the challenge as an optimization problem with a balanced likelihood and prior regularization framework, our method guides hidden states toward reasoning-oriented trajectories while preserving linguistic coherence. Extensive evaluations across mathematical, commonsense, and logical reasoning benchmarks demonstrate that our approach con- sistently outperforms existing steering methods, offering a theoretically principled and effective solution for enhancing reasoning capabilities in base LLMs.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yaping Chai, Haoran Xie, Joe S. Qin
Natural Language Processing
ๆฅ็ๆ่ฆ
Aspect category sentiment analysis (ACSA) has achieved remarkable progress with large language models (LLMs), yet existing approaches primarily emphasize sentiment polarity while overlooking the underlying emotional dimensions that shape sentiment expressions. This limitation hinders the model's ability to capture fine-grained affective signals toward specific aspect categories. To address this limitation, we introduce a novel emotion-enhanced multi-task ACSA framework that jointly learns sentiment polarity and category-specific emotions grounded in Ekman's six basic emotions. Leveraging the generative capabilities of LLMs, our approach enables the model to produce emotional descriptions for each aspect category, thereby enriching sentiment representations with affective expressions. Furthermore, to ensure the accuracy and consistency of the generated emotions, we introduce an emotion refinement mechanism based on the Valence-Arousal-Dominance (VAD) dimensional framework. Specifically, emotions predicted by the LLM are projected onto a VAD space, and those inconsistent with their corresponding VAD coordinates are re-annotated using a structured LLM-based refinement strategy. Experimental results demonstrate that our approach significantly outperforms strong baselines on all benchmark datasets. This underlines the effectiveness of integrating affective dimensions into ACSA.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Phong Le, Mees Lindeman, Raquel G. Alhama
ๆฅ็ๆ่ฆ
The structure of naming systems in natural languages hinges on a trade-off between high informativeness and low complexity. Prior work capitalizes on information theory to formalize these notions; however, these studies generally rely on two simplifications: (i) optimal listeners, and (ii) universal communicative need across languages. Here, we address these limitations by introducing an information-theoretic framework for discrete object naming systems, and we use it to prove that an optimal trade-off is achievable if and only if the listener's decoder is equivalent to the Bayesian decoder of the speaker. Adopting a referential game setup from emergent communication, and focusing on the semantic domain of kinship, we show that our notion of optimality is not only theoretically achievable but also emerges empirically in learned communication systems.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Juan-Josรฉ Guzmรกn-Landa, Jesรบs Vรกzquez-Osorio, Juan-Manuel Torres-Moreno, Ligia Quintana Torres, Miguel Figueroa-Saavedra et al.
ๆฅ็ๆ่ฆ
In this paper, we describe a symbolic model for the automatic orthographic unification of Nawatl text documents. Our model is based on algorithms that we have previously used to analyze sentences in Nawatl, and on the corpus called $ฯ$-yalli, consisting of texts in several Nawatl orthographies. Our automatic unification algorithm implements linguistic rules in symbolic regular expressions. We also present a manual evaluation protocol that we have proposed and implemented to assess the quality of the unified sentences generated by our algorithm, by testing in a sentence semantic task. We have obtained encouraging results from the evaluators for most of the desired features of our artificially unified sentences
๐
2025-11-24
๐ AAAI 2026
๐ฅ Wenxuan Mu, Jinzhong Ning, Di Zhao, Yijia Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
In-context learning (ICL) with large language models (LLMs) has emerged as a promising paradigm for named entity recognition (NER) in low-resource scenarios. However, existing ICL-based NER methods suffer from three key limitations: (1) reliance on dynamic retrieval of annotated examples, which is problematic when annotated data is scarce; (2) limited generalization to unseen domains due to the LLM's insufficient internal domain knowledge; and (3) failure to incorporate external knowledge or resolve entity ambiguities. To address these challenges, we propose KDR-Agent, a novel multi-agent framework for multi-domain low-resource in-context NER that integrates Knowledge retrieval, Disambiguation, and Reflective analysis. KDR-Agent leverages natural-language type definitions and a static set of entity-level contrastive demonstrations to reduce dependency on large annotated corpora. A central planner coordinates specialized agents to (i) retrieve factual knowledge from Wikipedia for domain-specific mentions, (ii) resolve ambiguous entities via contextualized reasoning, and (iii) reflect on and correct model predictions through structured self-assessment. Experiments across ten datasets from five domains demonstrate that KDR-Agent significantly outperforms existing zero-shot and few-shot ICL baselines across multiple LLM backbones. The code and data can be found at https://github.com/MWXGOD/KDR-Agent.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yutong Li, Yitian Zhou, Xudong Wang, GuoChen, Caiyan Qin
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hayami Takahashi, Kensuke Takahashi
ๆฅ็ๆ่ฆ
In expressing emotions, as an expression form separate from natural language, we propose an alternative form that complements natural language, acting as a proxy or window for emotional states. First, we set up an expression form "Effect of Contradictory Structure." "Effect of Contradictory Structure" is not static but dynamic. Effect in "Effect of Contradictory Structure" is unpleasant or pleasant, and the orientation to avoid that unpleasantness is considered pseudo-expression of will. Second, "Effect of Contradictory Structure" can be overlapped with each other. This overlapping operation is called "montage." A broader "Structure" that includes related "Effect of Contradictory Structure" and "Effect of Structure" are set up. Montage produces "Effect of Structure". In montage, it is necessary to set something like "strength," so we adopted Deleuze and Deleuze/Guattari's word "intensity" and set it as an element of our model. We set up a general theoretical framework - Word Import Between Systems (Models) and justified the import of "intensity" through Austin's use of the word "force." "Effect of Structure" process is demonstrated using the example of proceeding to the next level of education.
๐
2025-11-24
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Junbo Zhang, Ran Chen, Qianli Zhou, Xinyang Deng, Wen Jiang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models demonstrate powerful capabilities across various natural language processing tasks, yet they also harbor safety vulnerabilities. To enhance LLM safety, various jailbreak defense methods have been proposed to guard against harmful outputs. However, improvements in model safety often come at the cost of severe over-refusal, failing to strike a good balance between safety and usability. In this paper, we first analyze the causes of over-refusal from a representation perspective, revealing that over-refusal samples reside at the boundary between benign and malicious samples. Based on this, we propose MOSR, designed to mitigate over-refusal by intervening the safety representation of LLMs. MOSR incorporates two novel components: (1) Overlap-Aware Loss Weighting, which determines the erasure weight for malicious samples by quantifying their similarity to pseudo-malicious samples in the representation space, and (2) Context-Aware Augmentation, which supplements the necessary context for rejection decisions by adding harmful prefixes before rejection responses. Experiments demonstrate that our method outperforms existing approaches in mitigating over-refusal while largely maintaining safety. Overall, we advocate that future defense methods should strike a better balance between safety and over-refusal.
๐
2025-11-24
๐ Accepted at the IEEE conference on Big Data (Speci
๐ฅ Zineddine Tighidet, Lazhar Labiod, Mohamed Nadif
ๆฅ็ๆ่ฆ
The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Francois Vandenhende, Anna Georgiou, Michalis Georgiou, Theodoros Psaras, Ellie Karekla et al.
ๆฅ็ๆ่ฆ
We present a graphical, knowledge-based method for reviewing treatment-emergent adverse events (AEs) in clinical trials. The approach enhances MedDRA by adding a hidden medical knowledge layer (Safeterm) that captures semantic relationships between terms in a 2-D map. Using this layer, AE Preferred Terms can be regrouped automatically into similarity clusters, and their association to the trial disease may be quantified. The Safeterm map is available online and connected to aggregated AE incidence tables from ClinicalTrials.gov. For signal detection, we compute treatment-specific disproportionality metrics using shrinkage incidence ratios. Cluster-level EBGM values are then derived through precision-weighted aggregation. Two visual outputs support interpretation: a semantic map showing AE incidence and an expectedness-versus-disproportionality plot for rapid signal detection. Applied to three legacy trials, the automated method clearly recovers all expected safety signals. Overall, augmenting MedDRA with a medical knowledge layer improves clarity, efficiency, and accuracy in AE interpretation for clinical trials.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Santhosh G S, Saurav Prakash, Balaraman Ravindran
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.
๐
2025-11-24
๐ EMNLP 2025
๐ฅ Yuchen Ji, Bo Xu, Jie Shi, Jiaqing Liang, Deqing Yang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at https://github.com/jjjycaptain/Skeletron.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sahil Kale
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
๐
2025-11-24
๐ ICLR 2024
๐ฅ Adam Rychert, Gasper Spagnolo, Evgenii Posashkov
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In this reproducibility study, we revisit the LLAMBO framework of Daxberger et al. (2024), a prompting-based Bayesian optimization (BO) method that uses large language models as discriminative surrogates and acquisition optimizers via text-only interactions. We replicate the core Bayesmark and HPOBench experiments under the original evaluation protocol, but replace GPT-3.5 with the open-weight Llama 3.1 70B model used for all text encoding components.
Our results broadly confirm the main claims of LLAMBO. Contextual warm starting via textual problem and hyperparameter descriptions substantially improves early regret behaviour and reduces variance across runs. LLAMBO's discriminative surrogate is weaker than GP or SMAC as a pure single task regressor, yet benefits from cross task semantic priors induced by the language model. Ablations that remove textual context markedly degrade predictive accuracy and calibration, while the LLAMBO candidate sampler consistently generates higher quality and more diverse proposals than TPE or random sampling. Experiments with smaller backbones (Gemma 27B, Llama 3.1 8B) yield unstable or invalid predictions, suggesting insufficient capacity for reliable surrogate behaviour.
Overall, our study shows that the LLAMBO architecture is robust to changing the language model backbone and remains effective when instantiated with Llama 3.1 70B.
๐
2025-11-24
๐ ACL
๐ฅ Jingqian Zhao, Bingbing Wang, Geng Tu, Yice Zhang, Qianlong Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yang Xiang, Yixin Ji, Juntao Li, Min Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xingyu Huang, Fei Jiang, Jianli Xiao
Natural Language Processing
ๆฅ็ๆ่ฆ
With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Masoomali Fatehkia, Enes Altinisik, Husrev Taha Sencar
Natural Language Processing
ๆฅ็ๆ่ฆ
Content moderation filters are a critical safeguard against alignment failures in language models. Yet most existing filters focus narrowly on general safety and overlook cultural context. In this work, we introduce FanarGuard, a bilingual moderation filter that evaluates both safety and cultural alignment in Arabic and English. We construct a dataset of over 468K prompt and response pairs, drawn from synthetic and public datasets, scored by a panel of LLM judges on harmlessness and cultural awareness, and use it to train two filter variants. To rigorously evaluate cultural alignment, we further develop the first benchmark targeting Arabic cultural contexts, comprising over 1k norm-sensitive prompts with LLM-generated responses annotated by human raters. Results show that FanarGuard achieves stronger agreement with human annotations than inter-annotator reliability, while matching the performance of state-of-the-art filters on safety benchmarks. These findings highlight the importance of integrating cultural awareness into moderation and establish FanarGuard as a practical step toward more context-sensitive safeguards.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fengyuan Liu, Huang Yi, Sichun Luo, Yuqi Wang, Yazheng Yang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Discovering effective predictive signals, or ``alphas,'' from financial data with high dimensionality and extremely low signal-to-noise ratio remains a difficult open problem. Despite progress in deep learning, genetic programming, and, more recently, large language model (LLM)--based factor generation, existing approaches still explore only a narrow region of the vast alpha search space. Neural models tend to produce opaque and fragile patterns, while symbolic or formula-based methods often yield redundant or economically ungrounded expressions that generalize poorly. Although different in form, these paradigms share a key limitation: none can conduct broad, structured, and human-like exploration that balances logical consistency with creative leaps. To address this gap, we introduce the Cognitive Alpha Mining Framework (CogAlpha), which combines code-level alpha representation with LLM-driven reasoning and evolutionary search. Treating LLMs as adaptive cognitive agents, our framework iteratively refines, mutates, and recombines alpha candidates through multi-stage prompts and financial feedback. This synergistic design enables deeper thinking, richer structural diversity, and economically interpretable alpha discovery, while greatly expanding the effective search space. Experiments on A-share equities demonstrate that CogAlpha consistently discovers alphas with superior predictive accuracy, robustness, and generalization over existing methods. Our results highlight the promise of aligning evolutionary optimization with LLM-based reasoning for automated and explainable alpha discovery. All source code will be released.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Vรกclav Tran, Jakub ล mรญd, Ladislav Lenc, Jean-Pierre Salmon, Pavel Krรกl
Natural Language Processing
ๆฅ็ๆ่ฆ
Text summarization is the task of automatically condensing longer texts into shorter, coherent summaries while preserving the original meaning and key information. Although this task has been extensively studied in English and other high-resource languages, Czech summarization, particularly in the context of historical documents, remains underexplored. This is largely due to the inherent linguistic complexity of Czech and the lack of high-quality annotated datasets.
In this work, we address this gap by leveraging the capabilities of Large Language Models (LLMs), specifically Mistral and mT5, which have demonstrated strong performance across a wide range of natural language processing tasks and multilingual settings. In addition, we also propose a translation-based approach that first translates Czech texts into English, summarizes them using an English-language model, and then translates the summaries back into Czech. Our study makes the following main contributions: We demonstrate that LLMs achieve new state-of-the-art results on the SumeCzech dataset, a benchmark for modern Czech text summarization, showing the effectiveness of multilingual LLMs even for morphologically rich, medium-resource languages like Czech. We introduce a new dataset, Posel od ฤerchova, designed for the summarization of historical Czech texts. This dataset is derived from digitized 19th-century publications and annotated for abstractive summarization. We provide initial baselines using modern LLMs to facilitate further research in this underrepresented area.
By combining cutting-edge models with both modern and historical Czech datasets, our work lays the foundation for further progress in Czech summarization and contributes valuable resources for future research in Czech historical document processing and low-resource summarization more broadly.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Heger Arfaoui, Mohammed Iheb Hergli, Beya Benzina, Slimane BenMiled
Natural Language Processing
ๆฅ็ๆ่ฆ
Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a rigorous, reproducible computational framework for applying neural topic modeling to focus group transcripts, addressing fundamental methodological challenges: hyperparameter sensitivity, model stability, and validation of interpretability. Using BERTopic applied to ten focus groups exploring HPV vaccine perceptions in Tunisia (1,076 utterances), we conducted systematic evaluation across 27 hyperparameter configurations, assessed stability through bootstrap resampling with 30 replicates per configuration, and validated interpretability through formal human evaluation by three domain experts. Our analysis demonstrates substantial sensitivity to hyperparameter choices and reveals that metric selection for stability assessment must align with analytical goals. A hierarchical merging strategy (extracting fine-grained topics for stability then consolidating for interpretability) effectively navigates the stability-coherence tradeoff, achieving coherence of 0.558 compared to 0.539 for direct extraction. Human validation confirmed topic quality with very good inter-rater reliability (ICC = 0.79, weighted Cohen's kappa = 0.578). Our framework provides practical guidelines that researchers can adapt to their own qualitative research contexts. All code, data processing scripts, and evaluation protocols are publicly available to support reproduction and extension of this work.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kaize Shi, Xueyao Sun, Xiaohui Tao, Lin Li, Qika Lin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) face information overload when handling long contexts, particularly in Retrieval-Augmented Generation (RAG) where extensive supporting documents often introduce redundant content. This issue not only weakens reasoning accuracy but also increases computational overhead. We propose an unsupervised context compression framework that exploits Abstract Meaning Representation (AMR) graphs to preserve semantically essential information while filtering out irrelevant text. By quantifying node-level entropy within AMR graphs, our method estimates the conceptual importance of each node, enabling the retention of core semantics. Specifically, we construct AMR graphs from raw contexts, compute the conceptual entropy of each node, and screen significant informative nodes to form a condensed and semantically focused context than raw documents. Experiments on the PopQA and EntityQuestions datasets show that our method outperforms vanilla and other baselines, achieving higher accuracy while substantially reducing context length. To the best of our knowledge, this is the first work introducing AMR-based conceptual entropy for context compression, demonstrating the potential of stable linguistic features in context engineering.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alvin Wei Ming Tan, Jane Yang, Tarun Sepuri, Khai Loong Aw, Robert Z. Sparks et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children's visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus of infant-perspective videos. We show that idealized aligned moments for learning (e.g., "look at the ball" with a ball present in the child's view) are relatively rare in children's everyday experiences compared to modern machine learning datasets, and highlight variability in alignment both within and across children. These findings suggest that infrequent alignment is a constraint for models describing early word learning and offer a new method for investigating children's multimodal environment.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Cao Linxiao, Wang Ruitao, Li Jindong, Zhou Zhipeng, Yang Menglin
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-augmented generation (RAG) enables large language models (LLMs) to access external knowledge, helping mitigate hallucinations and enhance domain-specific expertise. Graph-based RAG enhances structural reasoning by introducing explicit relational organization that enables information propagation across semantically connected text units. However, these methods typically rely on Euclidean embeddings that capture semantic similarity but lack a geometric notion of hierarchical depth, limiting their ability to represent abstraction relationships inherent in complex knowledge graphs. To capture both fine-grained semantics and global hierarchy, we propose HyperbolicRAG, a retrieval framework that integrates hyperbolic geometry into graph-based RAG. HyperbolicRAG introduces three key designs: (1) a depth-aware representation learner that embeds nodes within a shared Poincare manifold to align semantic similarity with hierarchical containment, (2) an unsupervised contrastive regularization that enforces geometric consistency across abstraction levels, and (3) a mutual-ranking fusion mechanism that jointly exploits retrieval signals from Euclidean and hyperbolic spaces, emphasizing cross-space agreement during inference. Extensive experiments across multiple QA benchmarks demonstrate that HyperbolicRAG outperforms competitive baselines, including both standard RAG and graph-augmented baselines.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bashar Talafha, Amin Abu Alhassan, Muhammad Abdul-Mageed
ๆฅ็ๆ่ฆ
Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker's voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.
๐
2025-11-24
๐ Accepted by ACM MM 2024
๐ฅ Daiqing Wu, Dongbao Yang, Can Ma, Yu Zhou
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
As posts on social media increase rapidly, analyzing the sentiments embedded in image-text pairs has become a popular research topic in recent years. Although existing works achieve impressive accomplishments in simultaneously harnessing image and text information, they lack the considerations of possible low-quality and missing modalities. In real-world applications, these issues might frequently occur, leading to urgent needs for models capable of predicting sentiment robustly. Therefore, we propose a Distribution-based feature Recovery and Fusion (DRF) method for robust multimodal sentiment analysis of image-text pairs. Specifically, we maintain a feature queue for each modality to approximate their feature distributions, through which we can simultaneously handle low-quality and missing modalities in a unified framework. For low-quality modalities, we reduce their contributions to the fusion by quantitatively estimating modality qualities based on the distributions. For missing modalities, we build inter-modal mapping relationships supervised by samples and distributions, thereby recovering the missing modalities from available ones. In experiments, two disruption strategies that corrupt and discard some modalities in samples are adopted to mimic the low-quality and missing modalities in various real-world scenarios. Through comprehensive experiments on three publicly available image-text datasets, we demonstrate the universal improvements of DRF compared to SOTA methods under both two strategies, validating its effectiveness in robust multimodal sentiment analysis.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Matthew R. DeVerna, Kai-Cheng Yang, Harry Yaojun Yan, Filippo Menczer
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools -- and millions of users already rely on them for verification -- rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yu Lei, Shuzheng Si, Wei Wang, Yifei Wu, Gang Chen et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wangjiaxuan Xin
Natural Language Processing
ๆฅ็ๆ่ฆ
This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models. ECN employs four stages: Perspective Adoption, Emotional Resonance, Reflective Understanding, and Integrative Synthesis, to guide models toward generating emotionally resonant and contextually aware responses. Experimental results demonstrate that ECN achieves the highest Empathy Quotient (EQ) scores across GPT-3.5-turbo and GPT-4, while maintaining competitive Regard and Perplexity metrics. These findings emphasize ECN's potential for applications requiring empathy and inclusivity in conversational AI.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.
๐
2025-11-24
๐ ICLR 2025
๐ฅ Yasin Esfandiari, Stefan Bauer, Sebastian U. Stich, Andrea Dittadi
Computer Vision
ๆฅ็ๆ่ฆ
Diffusion models for image generation often exhibit a trade-off between perceptual sample quality and data likelihood: training objectives emphasizing high-noise denoising steps yield realistic images but poor likelihoods, whereas likelihood-oriented training overweights low-noise steps and harms visual fidelity. We introduce a simple plug-and-play sampling method that combines two pretrained diffusion experts by switching between them along the denoising trajectory. Specifically, we apply an image-quality expert at high noise levels to shape global structure, then switch to a likelihood expert at low noise levels to refine pixel statistics. The approach requires no retraining or fine-tuning -- only the choice of an intermediate switching step. On CIFAR-10 and ImageNet32, the merged model consistently matches or outperforms its base components, improving or preserving both likelihood and sample quality relative to each expert alone. These results demonstrate that expert switching across noise levels is an effective way to break the likelihood-quality trade-off in image diffusion models.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shangyuan Tong, Nanye Ma, Saining Xie, Tommi Jaakkola
Machine Learning
ๆฅ็ๆ่ฆ
State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution, a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of 1.45 on ImageNet 256x256, and 1.49 on ImageNet 512x512, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ James Y. Huang, Sheng Zhang, Qianchu Liu, Guanghui Qin, Tinghui Zhu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang
Multimodal
ๆฅ็ๆ่ฆ
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dereck Piche, Mohammed Muqeeth, Milad Aghajohari, Juan Duque, Michael Noukhovitch et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale. We also contribute a novel social dilemma environment, Trust and Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Zikai Shen, Zonghao Chen, Dimitri Meunier, Ingo Steinwart, Arthur Gretton et al.
ๆฅ็ๆ่ฆ
We study the problem of nonparametric instrumental variable regression with observed covariates, which we refer to as NPIV-O. Compared with standard nonparametric instrumental variable regression (NPIV), the additional observed covariates facilitate causal identification and enables heterogeneous causal effect estimation. However, the presence of observed covariates introduces two challenges for its theoretical analysis. First, it induces a partial identity structure, which renders previous NPIV analyses - based on measures of ill-posedness, stability conditions, or link conditions - inapplicable. Second, it imposes anisotropic smoothness on the structural function. To address the first challenge, we introduce a novel Fourier measure of partial smoothing; for the second challenge, we extend the existing kernel 2SLS instrumental variable algorithm with observed covariates, termed KIV-O, to incorporate Gaussian kernel lengthscales adaptive to the anisotropic smoothness. We prove upper $L^2$-learning rates for KIV-O and the first $L^2$-minimax lower learning rates for NPIV-O. Both rates interpolate between known optimal rates of NPIV and nonparametric regression (NPR). Interestingly, we identify a gap between our upper and lower bounds, which arises from the choice of kernel lengthscales tuned to minimize a projected risk. Our theoretical analysis also applies to proximal causal inference, an emerging framework for causal effect estimation that shares the same conditional moment restriction as NPIV-O.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
๐
2025-11-24
๐ ArXiv Preprint (cs.DS)
๐ฅ Ilias Diakonikolas, Daniel M. Kane, Sihan Liu, Thanasis Pittas
ๆฅ็ๆ่ฆ
This work studies information-computation gaps for statistical problems. A common approach for providing evidence of such gaps is to show sample complexity lower bounds (that are stronger than the information-theoretic optimum) against natural models of computation. A popular such model in the literature is the family of low-degree polynomial tests. While these tests are defined in such a way that make them easy to analyze, the class of algorithms that they rule out is somewhat restricted. An important goal in this context has been to obtain lower bounds against the stronger and more natural class of low-degree Polynomial Threshold Function (PTF) tests, i.e., any test that can be expressed as comparing some low-degree polynomial of the data to a threshold. Proving lower bounds against PTF tests has turned out to be challenging. Indeed, we are not aware of any non-trivial PTF testing lower bounds in the literature.
In this paper, we establish the first non-trivial PTF testing lower bounds for a range of statistical tasks. Specifically, we prove a near-optimal PTF testing lower bound for Non-Gaussian Component Analysis (NGCA). Our NGCA lower bound implies similar lower bounds for a number of other statistical problems. Our proof leverages a connection to recent work on pseudorandom generators for PTFs and recent techniques developed in that context. At the technical level, we develop several tools of independent interest, including novel structural results for analyzing the behavior of low-degree polynomials restricted to random directions.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rudy Morel, Francesco Pio Ramunno, Jeff Shen, Alberto Bietti, Kyunghyun Cho et al.
ๆฅ็ๆ่ฆ
Conditional diffusion models provide a natural framework for probabilistic prediction of dynamical systems and have been successfully applied to fluid dynamics and weather prediction. However, in many settings, the available information at a given time represents only a small fraction of what is needed to predict future states, either due to measurement uncertainty or because only a small fraction of the state can be observed. This is true for example in solar physics, where we can observe the Sun's surface and atmosphere, but its evolution is driven by internal processes for which we lack direct measurements. In this paper, we tackle the probabilistic prediction of partially observable, long-memory dynamical systems, with applications to solar dynamics and the evolution of active regions. We show that standard inference schemes, such as autoregressive rollouts, fail to capture long-range dependencies in the data, largely because they do not integrate past information effectively. To overcome this, we propose a multiscale inference scheme for diffusion models, tailored to physical processes. Our method generates trajectories that are temporally fine-grained near the present and coarser as we move farther away, which enables capturing long-range temporal dependencies without increasing computational cost. When integrated into a diffusion model, we show that our inference scheme significantly reduces the bias of the predicted distributions and improves rollout stability.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Srishti Gupta, Yashasvee Taiwade
ๆฅ็ๆ่ฆ
Denoising Diffusion Probabilistic Models (DDPMs) have established a new state-of-the-art in generative image synthesis, yet their deployment is hindered by significant computational overhead during inference, often requiring up to 1,000 iterative steps. This study presents a rigorous comparative analysis of DDPMs against the emerging Flow Matching (Rectified Flow) paradigm, specifically isolating their geometric and efficiency properties on low-resource hardware. By implementing both frameworks on a shared Time-Conditioned U-Net backbone using the MNIST dataset, we demonstrate that Flow Matching significantly outperforms Diffusion in efficiency. Our geometric analysis reveals that Flow Matching learns a highly rectified transport path (Curvature $\mathcal{C} \approx 1.02$), which is near-optimal, whereas Diffusion trajectories remain stochastic and tortuous ($\mathcal{C} \approx 3.45$). Furthermore, we establish an ``efficiency frontier'' at $N=10$ function evaluations, where Flow Matching retains high fidelity while Diffusion collapses. Finally, we show via numerical sensitivity analysis that the learned vector field is sufficiently linear to render high-order ODE solvers (Runge-Kutta 4) unnecessary, validating the use of lightweight Euler solvers for edge deployment. \textbf{This work concludes that Flow Matching is the superior algorithmic choice for real-time, resource-constrained generative tasks.}
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tianyang Duan, Zongyuan Zhang, Zheng Lin, Songxiao Guo, Xiuxian Guan et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Multi-agent reinforcement learning (MARL) has been increasingly adopted in many real-world applications. While MARL enables decentralized deployment on resource-constrained edge devices, it suffers from severe non-stationarity due to the synchronous updates of agent policies. This non stationarity results in unstable training and poor policy con vergence, especially as the number of agents increases. In this paper, we propose RELED, a scalable MARL framework that integrates large language model (LLM)-driven expert demonstrations with autonomous agent exploration. RELED incorporates a Stationarity-Aware Expert Demonstration module, which leverages theoretical non-stationarity bounds to enhance the quality of LLM-generated expert trajectories, thus providing high reward and training-stable samples for each agent. Moreover, a Hybrid Expert-Agent Policy Optimization module adaptively balances each agent's learning from both expert-generated and agent-generated trajectories, accelerating policy convergence and improving generalization. Extensive experiments with real city networks based on OpenStreetMap demonstrate that RELED achieves superior performance compared to state-of-the-art MARL methods.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Carlos Ruiz-Gonzalez, Sรถren Arlt, Sebastian Lehner, Arturs Berzins, Yehonathan Drori et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Physics simulators are essential in science and engineering, enabling the analysis, control, and design of complex systems. In experimental sciences, they are increasingly used to automate experimental design, often via combinatorial search and optimization. However, as the setups grow more complex, the computational cost of traditional, CPU-based simulators becomes a major limitation. Here, we show how neural surrogate models can significantly reduce reliance on such slow simulators while preserving accuracy. Taking the design of interferometric gravitational wave detectors as a representative example, we train a neural network to surrogate the gravitational wave physics simulator Finesse, which was developed by the LIGO community. Despite that small changes in physical parameters can change the output by orders of magnitudes, the model rapidly predicts the quality and feasibility of candidate designs, allowing an efficient exploration of large design spaces. Our algorithm loops between training the surrogate, inverse designing new experiments, and verifying their properties with the slow simulator for further training. Assisted by auto-differentiation and GPU parallelism, our method proposes high-quality experiments much faster than direct optimization. Solutions that our algorithm finds within hours outperform designs that take five days for the optimizer to reach. Though shown in the context of gravitational wave detectors, our framework is broadly applicable to other domains where simulator bottlenecks hinder optimization and discovery.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ariel Fargion, Lahav Dabah, Tom Tirer
ๆฅ็ๆ่ฆ
Conformal Prediction (CP) has emerged as a powerful statistical framework for high-stakes classification applications. Instead of predicting a single class, CP generates a prediction set, guaranteed to include the true label with a pre-specified probability. The performance of different CP methods is typically assessed by their average prediction set size. In setups where the classes can be partitioned into semantic groups, e.g., diseases that require similar treatment, users can benefit from prediction sets that are not only small on average, but also contain a small number of semantically different groups. This paper begins by addressing this problem and ultimately offers a widely applicable tool for boosting any CP method on any dataset. First, given a class partition, we propose augmenting the CP score function with a term that penalizes predictions with out-of-group errors. We theoretically analyze this strategy and prove its advantages for group-related metrics. Surprisingly, we show mathematically that, for common class partitions, it can also reduce the average set size of any CP score function. Our analysis reveals the class similarity factors behind this improvement and motivates us to propose a model-specific variant, which does not require any human semantic partition and can further reduce the prediction set size. Finally, we present an extensive empirical study, encompassing prominent CP methods, multiple models, and several datasets, which demonstrates that our class-similarity-based approach consistently enhances CP methods.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Franklin Cardenoso, Wouter Caarls
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
The challenge of designing effective reward functions in reinforcement learning (RL) represents a significant bottleneck, often requiring extensive human expertise and being time-consuming. Previous work and recent advancements in large language models (LLMs) have demonstrated their potential for automating the generation of reward functions. However, existing methodologies often require preliminary evaluation metrics, human-engineered feedback for the refinement process, or the use of environmental source code as context. To address these limitations, this paper introduces LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization). This LLM-based, fully autonomous, and model-agnostic framework eliminates the need for preliminary metrics and environmental source code as context to generate, execute, and evaluate reward function candidates from textual descriptions of systems and task objectives. LEARN-Opt's main contribution lies in its ability to autonomously derive performance metrics directly from the system description and the task objective, enabling unsupervised evaluation and selection of reward functions. Our experiments indicate that LEARN-Opt achieves performance comparable to or better to that of state-of-the-art methods, such as EUREKA, while requiring less prior knowledge. We find that automated reward design is a high-variance problem, where the average-case candidate fails, requiring a multi-run approach to find the best candidates. Finally, we show that LEARN-Opt can unlock the potential of low-cost LLMs to find high-performing candidates that are comparable to, or even better than, those of larger models. This demonstrated performance affirms its potential to generate high-quality reward functions without requiring any preliminary human-defined metrics, thereby reducing engineering overhead and enhancing generalizability.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikita Neveditsin, Pawan Lingras, Vijay Mago
Natural Language Processing
ๆฅ็ๆ่ฆ
Clustering short text embeddings is a foundational task in natural language processing, yet remains challenging due to the need to specify the number of clusters in advance. We introduce a scalable spectral method that estimates the number of clusters directly from the structure of the Laplacian eigenspectrum, constructed using cosine similarities and guided by an adaptive sampling strategy. This sampling approach enables our estimator to efficiently scale to large datasets without sacrificing reliability. To support intrinsic evaluation of cluster quality without ground-truth labels, we propose the Cohesion Ratio, a simple and interpretable evaluation metric that quantifies how much intra-cluster similarity exceeds the global similarity background. It has an information-theoretic motivation inspired by mutual information, and in our experiments it correlates closely with extrinsic measures such as normalized mutual information and homogeneity. Extensive experiments on six short-text datasets and four modern embedding models show that standard algorithms like K-Means and HAC, when guided by our estimator, significantly outperform popular parameter-light methods such as HDBSCAN, OPTICS, and Leiden. These results demonstrate the practical value of our spectral estimator and Cohesion Ratio for unsupervised organization and evaluation of short text data. Implementation of our estimator of k and Cohesion Ratio, along with code for reproducing the experiments, is available at https://anonymous.4open.science/r/towards_clustering-0C2E.
๐
2025-11-24
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Hongyi Wang, Xiuli Zheng, Weimin Liu, Zitian Tang, Sheng Gong
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The discovery of high-performance photosensitizers has long been hindered by the time-consuming and resource-intensive nature of traditional trial-and-error approaches. Here, we present \textbf{A}I-\textbf{A}ccelerated \textbf{P}hoto\textbf{S}ensitizer \textbf{I}nnovation (AAPSI), a closed-loop workflow that integrates expert knowledge, scaffold-based molecule generation, and Bayesian optimization to accelerate the design of novel photosensitizers. The scaffold-driven generation in AAPSI ensures structural novelty and synthetic feasibility, while the iterative AI-experiment loop accelerates the discovery of novel photosensitizers. AAPSI leverages a curated database of 102,534 photosensitizer-solvent pairs and generate 6,148 synthetically accessible candidates. These candidates are screened via graph transformers trained to predict singlet oxygen quantum yield ($ฯ_ฮ$) and absorption maxima ($ฮป_{max}$), following experimental validation. This work generates several novel candidates for photodynamic therapy (PDT), among which the hypocrellin-based candidate HB4Ph exhibits exceptional performance at the Pareto frontier of high quantum yield of singlet oxygen and long absorption maxima among current photosensitizers ($ฯ_ฮ$=0.85, $ฮป_{max}$=650nm).
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hari Chandana Kuchibhotla, K S Ananth, Vineeth N Balasubramanian
Multimodal
ๆฅ็ๆ่ฆ
Despite significant progress in continual learning ranging from architectural novelty to clever strategies for mitigating catastrophic forgetting most existing methods rest on a strong but unrealistic assumption the availability of labeled data throughout the learning process. In real-world scenarios, however, data often arrives sequentially and without annotations, rendering conventional approaches impractical. In this work, we revisit the fundamental assumptions of continual learning and ask: Can current systems adapt when labels are absent and tasks emerge incrementally over time? To this end, we introduce Annotation-Free Class-Incremental Learning (AFCIL), a more realistic and challenging paradigm where unlabeled data arrives continuously, and the learner must incrementally acquire new classes without any supervision. To enable effective learning under AFCIL, we propose CrossWorld CL, a Cross Domain World Guided Continual Learning framework that incorporates external world knowledge as a stable auxiliary source. The method retrieves semantically related ImageNet classes for each downstream category, maps downstream and ImageNet features through a cross domain alignment strategy and finally introduce a novel replay strategy. This design lets the model uncover semantic structure without annotations while keeping earlier knowledge intact. Across four datasets, CrossWorld-CL surpasses CLIP baselines and existing continual and unlabeled learning methods, underscoring the benefit of world knowledge for annotation free continual learning.
๐
2025-11-24
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Changjun Cheng, Daniel Persaud, Kangming Li, Michael J. Moorehead, Natalie Page et al.
Machine Learning
ๆฅ็ๆ่ฆ
High-throughput methods enable accelerated discovery of novel materials in complex systems such as high-entropy alloys, which exhibit intricate phase stability across vast compositional spaces. Computational approaches, including Density Functional Theory (DFT) and calculation of phase diagrams (CALPHAD), facilitate screening of phase formability as a function of composition and temperature. However, the integration of computational predictions with experimental validation remains challenging in high-throughput studies. In this work, we introduce a quantitative confidence metric to assess the agreement between predictions and experimental observations, providing a quantitative measure of the confidence of machine learning models trained on either DFT or CALPHAD input in accounting for experimental evidence. The experimental dataset was generated via high-throughput in-situ synchrotron X-ray diffraction on compositionally varied FeNiMnCr alloy libraries, heated from room temperature to ~1000 ยฐC. Agreement between the observed and predicted phases was evaluated using either temperature-independent phase classification or a model that incorporates a temperature-dependent probability of phase formation. This integrated approach demonstrates where strong overall agreement between computation and experiment exists, while also identifying key discrepancies, particularly in FCC/BCC predictions at Mn-rich regions to inform future model refinement.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dominik Luszczynski
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
A common method of attacking deep learning models is through adversarial attacks, which occur when an attacker specifically modifies the input of a model to produce an incorrect result. Adversarial attacks have been deeply investigated in the image domain; however, there is less research in the time-series domain and very little for forecasting financial data. To address these concerns, this study aims to build upon previous research on adversarial attacks for time-series data by introducing two new slope-based methods aimed to alter the trends of the predicted stock forecast generated by an N-HiTS model. Compared to the normal N-HiTS predictions, the two new slope-based methods, the General Slope Attack and Least-Squares Slope Attack, can manipulate N-HiTS predictions by doubling the slope. These new slope attacks can bypass standard security mechanisms, such as a discriminator that filters real and perturbed inputs, reducing a 4-layered CNN's specificity to 28% and accuracy to 57%. Furthermore, the slope based methods were incorporated into a GAN architecture as a means of generating realistic synthetic data, while simultaneously fooling the model. Finally, this paper also proposes a sample malware designed to inject an adversarial attack in the model inference library, proving that ML-security research should not only focus on making the model safe, but also securing the entire pipeline.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rohan Saha, Farzane Aminmansour, Alona Fyshe
Natural Language Processing
ๆฅ็ๆ่ฆ
While transformers can discover latent structure from context, the dynamics of how they acquire different components of the latent structure remain poorly understood. In this work, we use the Alchemy benchmark, to investigate the dynamics of latent structure learning. We train a small decoder-only transformer on three task variants: 1) inferring missing rules from partial contextual information, 2) composing simple rules to solve multi-step sequences, and 3) decomposing complex multi-step examples to infer intermediate steps. By factorizing each task into interpretable events, we show that the model acquires capabilities in discrete stages, first learning the coarse grained rules, before learning the complete latent structure. We also identify a crucial asymmetry, where the model can compose fundamental rules robustly, but struggles to decompose complex examples to discover the fundamental rules. These findings offer new insights into understanding how a transformer model learns latent structures, providing a granular view of how these capabilities evolve during training.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple step quality dimensions (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models, along with ablations, reveal that best-of-n sampling with PRInTS enhances information-seeking abilities of open-source models as well as specialized agents, matching or surpassing the performance of frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ James R. M. Black, Moritz S. Hanke, Aaron Maiwald, Tina Hernandez-Boussard, Oliver M. Crook et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Novel deep learning architectures are increasingly being applied to biological data, including genetic sequences. These models, referred to as genomic language mod- els (gLMs), have demonstrated impressive predictive and generative capabilities, raising concerns that such models may also enable misuse, for instance via the generation of genomes for human-infecting viruses. These concerns have catalyzed calls for risk mitigation measures. The de facto mitigation of choice is filtering of pretraining data (i.e., removing viral genomic sequences from training datasets) in order to limit gLM performance on virus-related tasks. However, it is not currently known how robust this approach is for securing open-source models that can be fine-tuned using sensitive pathogen data. Here, we evaluate a state-of-the-art gLM, Evo 2, and perform fine-tuning using sequences from 110 harmful human-infecting viruses to assess the rescue of misuse-relevant predictive capabilities. The fine- tuned model exhibited reduced perplexity on unseen viral sequences relative to 1) the pretrained model and 2) a version fine-tuned on bacteriophage sequences. The model fine-tuned on human-infecting viruses also identified immune escape variants from SARS-CoV-2 (achieving an AUROC of 0.6), despite having no expo- sure to SARS-CoV-2 sequences during fine-tuning. This work demonstrates that data exclusion might be circumvented by fine-tuning approaches that can, to some degree, rescue misuse-relevant capabilities of gLMs. We highlight the need for safety frameworks for gLMs and outline further work needed on evaluations and mitigation measures to enable the safe deployment of gLMs.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Oliver Knitter, Jonathan Mei, Masako Yamada, Martin Roetteler
ๆฅ็ๆ่ฆ
TorchQuantumDistributed (tqd) is a PyTorch-based [Paszke et al., 2019] library for accelerator-agnostic differentiable quantum state vector simulation at scale. This enables studying the behavior of learnable parameterized near-term and fault- tolerant quantum circuits with high qubit counts.
๐
2025-11-24
๐ ArXiv Preprint (quant-ph)
๐ฅ Sreejith Sreekumar, Ziv Goldfeld, Mark M. Wilde
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Estimating quantum entropies and divergences is an important problem in quantum physics, information theory, and machine learning. Quantum neural estimators (QNEs), which utilize a hybrid classical-quantum architecture, have recently emerged as an appealing computational framework for estimating these measures. Such estimators combine classical neural networks with parametrized quantum circuits, and their deployment typically entails tedious tuning of hyperparameters controlling the sample size, network architecture, and circuit topology. This work initiates the study of formal guarantees for QNEs of measured (Rรฉnyi) relative entropies in the form of non-asymptotic error risk bounds. We further establish exponential tail bounds showing that the error is sub-Gaussian, and thus sharply concentrates about the ground truth value. For an appropriate sub-class of density operator pairs on a space of dimension $d$ with bounded Thompson metric, our theory establishes a copy complexity of $O(|ฮ(\mathcal{U})|d/ฮต^2)$ for QNE with a quantum circuit parameter set $ฮ(\mathcal{U})$, which has minimax optimal dependence on the accuracy $ฮต$. Additionally, if the density operator pairs are permutation invariant, we improve the dimension dependence above to $O(|ฮ(\mathcal{U})|\mathrm{polylog}(d)/ฮต^2)$. Our theory aims to facilitate principled implementation of QNEs for measured relative entropies and guide hyperparameter tuning in practice.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Eichi Uehara
Machine Learning
ๆฅ็ๆ่ฆ
This document proposes a Unified Robust Framework that re-engineers the estimation of the Average Treatment Effect on the Overlap (ATO). It synthesizes gamma-Divergence for outlier robustness, Graduated Non-Convexity (GNC) for global optimization, and a "Gatekeeper" mechanism to address the impossibility of higher-order orthogonality in Gaussian regimes.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Victor Rambaud, Salvador Mascarenhas, Yair Lakretz
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce MapFormers, new architectures based on Transformer models, which can learn cognitive maps from observational data and perform path integration in parallel, in a self-supervised manner. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved naturally by updating the positional encoding in Transformers with input-dependent matrices. We developed two variants of MapFormers that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested MapFormers on several tasks, including a classic 2D navigation task, showing that our models can learn a cognitive map of the underlying space and generalize OOD (e.g., to longer sequences) with near-perfect performance, unlike current architectures. Together, these results demonstrate the superiority of models designed to learn a cognitive map, and the importance of introducing a structural bias for structure-content disentanglement, which can be achieved in Transformers with input-dependent positional encoding. MapFormers have broad applications in both neuroscience and AI, by explaining the neural mechanisms giving rise to cognitive maps, while allowing these relation models to be learned at scale.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Brittany V. Lancellotti, Jordan M. Malof, Aaron Davitt, Gavin McCormick, Shelby Anderson et al.
ๆฅ็ๆ่ฆ
Global greenhouse gas emissions estimates are essential for monitoring and mitigation planning. Yet most datasets lack one or more characteristics that enhance their actionability, such as accuracy, global coverage, high spatial and temporal resolution, and frequent updates. To address these gaps, we present Climate TRACE (climatetrace.org), an open-access platform delivering global emissions estimates with enhanced detail, coverage, and timeliness. Climate TRACE synthesizes existing emissions data, prioritizing accuracy, coverage, and resolution, and fills gaps using sector-specific estimation approaches. The dataset is the first to provide globally comprehensive emissions estimates for individual sources (e.g., individual power plants) for all anthropogenic emitting sectors. The dataset spans January 1, 2021, to the present, with a two-month reporting lag and monthly updates. The open-access platform enables non-technical audiences to engage with detailed emissions datasets for most subnational governments worldwide. Climate TRACE supports data-driven climate action at scales where decisions are made, representing a major breakthrough for emissions accounting and mitigation.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kunal Dumbre, Lei Jiao, Ole-Christoffer Granmo
ๆฅ็ๆ่ฆ
The PC algorithm is a widely used method in causal inference for learning the structure of Bayesian networks. Despite its popularity, the PC algorithm suffers from significant time complexity, particularly as the size of the dataset increases, which limits its applicability in large-scale real-world problems. In this study, we propose a novel approach that utilises the Tsetlin Machine (TM) to construct Bayesian structures more efficiently. Our method leverages the most significant literals extracted from the TM and performs conditional independence (CI) tests on these selected literals instead of the full set of variables, resulting in a considerable reduction in computational time. We implemented our approach and compared it with various state-of-the-art methods. Our evaluation includes categorical datasets from the bnlearn repository, such as Munin1, Hepar2. The findings indicate that the proposed TM-based method not only reduces computational complexity but also maintains competitive accuracy in causal discovery, making it a viable alternative to traditional PC algorithm implementations by offering improved efficiency without compromising performance.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Felix Birkel
ๆฅ็ๆ่ฆ
We present Tiny-TSM, a time series foundation model characterized by small scale, economical training, and state-of-the-art performance. It comprises 23M total parameters, trained on a single A100 GPU in less than a week using a new synthetic data generation and data augmentation pipeline (SynthTS). Without any neural architecture search, hyperparameter tuning, or scaling up model size, Tiny-TSM achieves state-of-the-art performance on a wide range of time series benchmark datasets, often outperforming much larger models and even matching the performance of much larger, industrial-scale, likely highly tuned foundation models. Specifically, Tiny-TSM outperforms all other time series foundation models we evaluated on medium- and long-term forecasting tasks under MSE loss, while short-term accuracy is still competitive with state-of-the-art models.
We also introduce a causal input normalization scheme that enables time series models to be trained with dense next-token prediction loss, significantly accelerating convergence speed and reducing training time.
All experiments were conducted on a single A100 GPU, illustrating the practicality of the proposed approach in a resource-constrained setting.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Manish Singh, Arpita Dayama
Machine Learning
ๆฅ็ๆ่ฆ
This work evaluates the effectiveness of spatiotemporal Graph Neural Networks (GNNs) for multi-store retail sales forecasting and compares their performance against ARIMA, LSTM, and XGBoost baselines. Using weekly sales data from 45 Walmart stores, we construct a relational forecasting framework that models inter-store dependencies through a learned adaptive graph. The proposed STGNN predicts log-differenced sales and reconstructs final values through a residual path, enabling stable training and improved generalisation. Experiments show that STGNN achieves the lowest overall forecasting error, outperforming all baselines in Normalised Total Absolute Error, P90 MAPE, and variance of MAPE across stores. Analysis of the learned adjacency matrix reveals meaningful functional store clusters and high-influence nodes that emerge without geographic metadata. These results demonstrate that relational structure significantly improves forecast quality in interconnected retail environments and establishes STGNNs as a robust modelling choice for multi-store demand prediction.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bianka Kowalska, Halina Kwaลnicka
Machine Learning
ๆฅ็ๆ่ฆ
The black box nature of deep neural networks poses a significant challenge for the deployment of transparent and trustworthy artificial intelligence (AI) systems. With the growing presence of AI in society, it becomes increasingly important to develop methods that can explain and interpret the decisions made by these systems. To address this, mechanistic interpretability (MI) emerged as a promising and distinctive research program within the broader field of explainable artificial intelligence (XAI). MI is the process of studying the inner computations of neural networks and translating them into human-understandable algorithms. It encompasses reverse engineering techniques aimed at uncovering the computational algorithms implemented by neural networks. In this article, we propose a unified taxonomy of MI approaches and provide a detailed analysis of key techniques, illustrated with concrete examples and pseudo-code. We contextualize MI within the broader interpretability landscape, comparing its goals, methods, and insights to other strands of XAI. Additionally, we trace the development of MI as a research area, highlighting its conceptual roots and the accelerating pace of recent work. We argue that MI holds significant potential to support a more scientific understanding of machine learning systems -- treating models not only as tools for solving tasks, but also as systems to be studied and understood. We hope to invite new researchers into the field of mechanistic interpretability.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Amirtha Varshini A S, Duminda S. Ranasinghe, Hok Hei Tam
Robotics
ๆฅ็ๆ่ฆ
Generative Flow Networks, or GFlowNets, offer a promising framework for molecular design, but their internal decision policies remain opaque. This limits adoption in drug discovery, where chemists require clear and interpretable rationales for proposed structures. We present an interpretability framework for SynFlowNet, a GFlowNet trained on documented chemical reactions and purchasable starting materials that generates both molecules and the synthetic routes that produce them. Our approach integrates three complementary components. Gradient based saliency combined with counterfactual perturbations identifies which atomic environments influence reward and how structural edits change molecular outcomes. Sparse autoencoders reveal axis aligned latent factors that correspond to physicochemical properties such as polarity, lipophilicity, and molecular size. Motif probes show that functional groups including aromatic rings and halogens are explicitly encoded and linearly decodable from the internal embeddings. Together, these results expose the chemical logic inside SynFlowNet and provide actionable and mechanistic insight that supports transparent and controllable molecular design.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Lucas Li, Jean-Baptiste Puel, Florence Carton, Dounya Barrit, Jhony H. Giraldo
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Perovskite solar cells are promising candidates for next-generation photovoltaics. However, their performance as multi-scale devices is determined by complex interactions between their constituent layers. This creates a vast combinatorial space of possible materials and device architectures, making the conventional experimental-based screening process slow and expensive. Machine learning models try to address this problem, but they only focus on individual material properties or neglect the important geometric information of the perovskite crystal. To address this problem, we propose to predict perovskite solar cell power conversion efficiency with a geometric-aware co-attention (Solar-GECO) model. Solar-GECO combines a geometric graph neural network (GNN) - that directly encodes the atomic structure of the perovskite absorber - with language model embeddings that process the textual strings representing the chemical compounds of the transport layers and other device components. Solar-GECO also integrates a co-attention module to capture intra-layer dependencies and inter-layer interactions, while a probabilistic regression head predicts both power conversion efficiency (PCE) and its associated uncertainty. Solar-GECO achieves state-of-the-art performance, significantly outperforming several baselines, reducing the mean absolute error (MAE) for PCE prediction from 3.066 to 2.936 compared to semantic GNN (the previous state-of-the-art model). Solar-GECO demonstrates that integrating geometric and textual information provides a more powerful and accurate framework for PCE prediction.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Przemyslaw Chojecki
ๆฅ็ๆ่ฆ
We develop a moduli-theoretic view of psychometric test batteries for AI agents and connect it explicitly to the AAI score developed previously. First, we make precise the notion of an AAI functional on a battery and set out axioms that any reasonable autonomy/general intelligence score should satisfy. Second, we show that the composite index ('AAI-Index') defined previously is a special case of our AAI functional. Third, we introduce the notion of a cognitive core of an agent relative to a battery and define the associated AAI$_{\textrm{core}}$ score as the restriction of an AAI functional to that core. Finally, we use these notions to describe invariants of batteries under evaluation-preserving symmetries and outline how moduli of equivalent batteries are organized.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kyle Verrier, Achille Nazaret, Joseph Futoma, Andrew C. Miller, Guillermo Sapiro
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Hunger and satiety dynamics shape dietary behaviors and metabolic health, yet remain difficult to capture in everyday settings. We present a Nutrition Photoplethysmography Language Model (NPLM), integrating continuous photoplethysmography (PPG) from wearables with meal descriptions. NPLM projects PPG into embeddings interpretable by language models, enabling joint reasoning over physiology and meal context. Trained on 19,340 participants and 1.1 million meal-PPG pairs, the model improved daily caloric intake prediction by 11% over text-only baselines, with accuracy maintained when 80% of meal text was removed. In an independent validation study (n=140) with controlled dining and detailed meal information, the model replicated these findings. These results demonstrate the value of integrating physiological measurements from consumer wearables with meal information for noninvasive dietary monitoring at scale.
๐
2025-11-24
๐ KDD 2026
๐ฅ Yingjia Shang, Yi Liu, Huimin Wang, Furong Li, Wenfang Sun et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
With the rapid advancement of retrieval-augmented vision-language models, multimodal medical retrieval-augmented generation (MMed-RAG) systems are increasingly adopted in clinical decision support. These systems enhance medical applications by performing cross-modal retrieval to integrate relevant visual and textual evidence for tasks, e.g., report generation and disease diagnosis. However, their complex architecture also introduces underexplored adversarial vulnerabilities, particularly via visual input perturbations. In this paper, we propose Medusa, a novel framework for crafting cross-modal transferable adversarial attacks on MMed-RAG systems under a black-box setting. Specifically, Medusa formulates the attack as a perturbation optimization problem, leveraging a multi-positive InfoNCE loss (MPIL) to align adversarial visual embeddings with medically plausible but malicious textual targets, thereby hijacking the retrieval process. To enhance transferability, we adopt a surrogate model ensemble and design a dual-loop optimization strategy augmented with invariant risk minimization (IRM). Extensive experiments on two real-world medical tasks, including medical report generation and disease diagnosis, demonstrate that Medusa achieves over 90% average attack success rate across various generation models and retrievers under appropriate parameter configuration, while remaining robust against four mainstream defenses, outperforming state-of-the-art baselines. Our results reveal critical vulnerabilities in the MMed-RAG systems and highlight the necessity of robustness benchmarking in safety-critical medical applications. The code and data are available at https://anonymous.4open.science/r/MMed-RAG-Attack-F05A.
๐
2025-11-24
๐ AAAI 2026
๐ฅ Hang Ding, Xue Wang, Tian Zhou, Tao Yao
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion models have recently shown promise in time series forecasting, particularly for probabilistic predictions. However, they often fail to achieve state-of-the-art point estimation performance compared to regression-based methods. This limitation stems from difficulties in providing sufficient contextual bias to track distribution shifts and in balancing output diversity with the stability and precision required for point forecasts. Existing diffusion-based approaches mainly focus on full-distribution modeling under probabilistic frameworks, often with likelihood maximization objectives, while paying little attention to dedicated strategies for high-accuracy point estimation. Moreover, other existing point prediction diffusion methods frequently rely on pre-trained or jointly trained mature models for contextual bias, sacrificing the generative flexibility of diffusion models.
To address these challenges, we propose SimDiff, a single-stage, end-to-end framework. SimDiff employs a single unified Transformer network carefully tailored to serve as both denoiser and predictor, eliminating the need for external pre-trained or jointly trained regressors. It achieves state-of-the-art point estimation performance by leveraging intrinsic output diversity and improving mean squared error accuracy through multiple inference ensembling. Key innovations, including normalization independence and the median-of-means estimator, further enhance adaptability and stability. Extensive experiments demonstrate that SimDiff significantly outperforms existing methods in time series point forecasting.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Boyuan Wu
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.
๐
2025-11-24
๐ ArXiv Preprint (quant-ph)
๐ฅ Hibah Agha, Samuel Yen-Chi Chen, Huan-Hsin Tseng, Shinjae Yoo
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, machine learning and deep learning have driven advances in domains such as image classification, speech recognition, and anomaly detection by leveraging multi-layer neural networks to model complex data. Simultaneously, quantum computing (QC) promises to address classically intractable problems via quantum parallelism, motivating research in quantum machine learning (QML). Among QML techniques, quantum autoencoders show promise for compressing high-dimensional quantum and classical data. However, designing effective quantum circuit architectures for quantum autoencoders remains challenging due to the complexity of selecting gates, arranging circuit layers, and tuning parameters.
This paper proposes a neural architecture search (NAS) framework that automates the design of quantum autoencoders using a genetic algorithm (GA). By systematically evolving variational quantum circuit (VQC) configurations, our method seeks to identify high-performing hybrid quantum-classical autoencoders for data reconstruction without becoming trapped in local minima. We demonstrate effectiveness on image datasets, highlighting the potential of quantum autoencoders for efficient feature extraction within a noise-prone, near-term quantum era. Our approach lays a foundation for broader application of genetic algorithms to quantum architecture search, aiming for a robust, automated method that can adapt to varied data and hardware constraints.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ David Stenger, Armin Lindicke, Alexander von Rohr, Sebastian Trimpe
Machine Learning
ๆฅ็ๆ่ฆ
Searching large and complex design spaces for a global optimum can be infeasible and unnecessary. A practical alternative is to iteratively refine the neighborhood of an initial design using local optimization methods such as gradient descent. We propose local entropy search (LES), a Bayesian optimization paradigm that explicitly targets the solutions reachable by the descent sequences of iterative optimizers. The algorithm propagates the posterior belief over the objective through the optimizer, resulting in a probability distribution over descent sequences. It then selects the next evaluation by maximizing mutual information with that distribution, using a combination of analytic entropy calculations and Monte-Carlo sampling of descent sequences. Empirical results on high-complexity synthetic objectives and benchmark problems show that LES achieves strong sample efficiency compared to existing local and global Bayesian optimization methods.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Minxin Chen
ๆฅ็ๆ่ฆ
Many real-world bandit problems involve non-stationary reward distributions, where the optimal decision may shift due to evolving environments. However, the performance of some typical Multi-Armed Bandit (MAB) models such as Upper Confidence Bound (UCB) algorithms degrades significantly in non-stationary environments where reward distributions change over time. To address this limitation, this paper introduces and evaluates FDSW-UCB, a novel dual-view algorithm that integrates a discount-based long-term perspective with a sliding-window-based short-term view. A data-driven semi-synthetic simulation platform, built upon the MovieLens-1M and Open Bandit datasets, is developed to test algorithm adaptability under abrupt and gradual drift scenarios. Experimental results demonstrate that a well-configured sliding-window mechanism (SW-UCB) is robust, while the widely used discounting method (D-UCB) suffers from a fundamental learning failure, leading to linear regret. Crucially, the proposed FDSW-UCB, when employing an optimistic aggregation strategy, achieves superior performance in dynamic settings, highlighting that the ensemble strategy itself is a decisive factor for success.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Teodora Popordanoska, Jiameng Li, Matthew B. Blaschko
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nithira Jayarathne, Naveen Basnayake, Keshawa Jayasundara, Pasindu Dodampegama, Praveen Wijesinghe et al.
Machine Learning
ๆฅ็ๆ่ฆ
Detecting deepfake images is crucial in combating misinformation. We present a lightweight, generalizable binary classification model based on EfficientNet-B6, fine-tuned with transformation techniques to address severe class imbalances. By leveraging robust preprocessing, oversampling, and optimization strategies, our model achieves high accuracy, stability, and generalization. While incorporating Fourier transform-based phase and amplitude features showed minimal impact, our proposed framework helps non-experts to effectively identify deepfake images, making significant strides toward accessible and reliable deepfake detection.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jeeho Shin, Kyungho Kim, Kijung Shin
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Recipe recommendation has become an essential task in web-based food platforms. A central challenge is effectively leveraging rich multimodal features beyond user-recipe interactions. Our analysis shows that even simple uses of multimodal signals yield competitive performance, suggesting that systematic enhancement of these signals is highly promising. We propose TESMR, a 3-stage framework for recipe recommendation that progressively refines raw multimodal features into effective embeddings through: (1) content-based enhancement using foundation models with multimodal comprehension, (2) relation-based enhancement via message propagation over user-recipe interactions, and (3) learning-based enhancement through contrastive learning with learnable embeddings. Experiments on two real-world datasets show that TESMR outperforms existing methods, achieving 7-15% higher Recall@10.
๐
2025-11-24
๐ EMNLP 2025
๐ฅ Deyi Ji, Yuekui Yang, Liqun Liu, Peng Shu, Haiyang Wu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier
Machine Learning
ๆฅ็ๆ่ฆ
We propose a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also their derivatives with respect to states and actions. By differentiating the Bellman backup through differentiable dynamics, we obtain analytically consistent gradient targets. Incorporating these into the critic objective using a Sobolev-type loss encourages the critic to align with both the value and local geometry of the target function. This first-order TD matching principle can be seamlessly integrated into existing algorithms, such as Q-learning or actor-critic methods (e.g., DDPG, SAC), potentially leading to faster critic convergence and more stable policy gradients without altering their overall structure.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Weitao Liu
ๆฅ็ๆ่ฆ
This paper introduces a novel Kalman filter framework designed to achieve robust state estimation under both process and measurement noise. Inspired by the Weighted Observation Likelihood Filter (WoLF), which provides robustness against measurement outliers, we applied generalized Bayesian approach to build a framework considering both process and measurement noise outliers.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Prateek Garg, Bhavya Kohli, Sunita Sarawagi
Machine Learning
ๆฅ็ๆ่ฆ
Masked Diffusion Models (MDMs) have emerged as one of the most promising paradigms for generative modeling over discrete domains. It is known that MDMs effectively train to decode tokens in a random order, and that this ordering has significant performance implications in practice. This observation raises a fundamental question: can we design a training framework that optimizes for a favorable decoding order? We answer this in the affirmative, showing that the continuous-time variational objective of MDMs, when equipped with multivariate noise schedules, can identify and optimize for a decoding order during training. We establish a direct correspondence between decoding order and the multivariate noise schedule and show that this setting breaks invariance of the MDM objective to the noise schedule. Furthermore, we prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders, which establishes them as auto-regressive models with learnable orders.
๐
2025-11-24
๐ ArXiv Preprint (quant-ph)
๐ฅ Georgios Maragkopoulos, Lazaros Chavatzoglou, Aikaterini Mandilara, Dimitris Syvridis
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In finance, predictive models must balance accuracy and interpretability, particularly in credit risk assessment, where model decisions carry material consequences. We present a quantum neural network (QNN) based on a single qudit, in which both data features and trainable parameters are co-encoded within a unified unitary evolution generated by the full Lie algebra. This design explores the entire Hilbert space while enabling interpretability through the magnitudes of the learned coefficients. We benchmark our model on a real-world, imbalanced credit-risk dataset from Taiwan. The proposed QNN consistently outperforms LR and reaches the results of random forest models in macro-F1 score while preserving a transparent correspondence between learned parameters and input feature importance. To quantify the interpretability of the proposed model, we introduce two complementary metrics: (i) the edit distance between the model's feature ranking and that of LR, and (ii) a feature-poisoning test where selected features are replaced with noise. Results indicate that the proposed quantum model achieves competitive performance while offering a tractable path toward interpretable quantum learning.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huisoo Lee, Jisu Han, Hyunsouk Cho, Wonjun Hwang
Multimodal
ๆฅ็ๆ่ฆ
Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain without access to source data. Recent advances in Foundation Models (FMs) have introduced new opportunities for leveraging external semantic knowledge to guide SFDA. However, relying on a single FM is often insufficient, as it tends to bias adaptation toward a restricted semantic coverage, failing to capture diverse contextual cues under domain shift. To overcome this limitation, we propose a Collaborative Multi-foundation Adaptation (CoMA) framework that jointly leverages two different FMs (e.g., CLIP and BLIP) with complementary properties to capture both global semantics and local contextual cues. Specifically, we employ a bidirectional adaptation mechanism that (1) aligns different FMs with the target model for task adaptation while maintaining their semantic distinctiveness, and (2) transfers complementary knowledge from the FMs to the target model. To ensure stable adaptation under mini-batch training, we introduce Decomposed Mutual Information (DMI) that selectively enhances true dependencies while suppressing false dependencies arising from incomplete class coverage. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art SFDA methods across four benchmarks, including Office-31, Office-Home, DomainNet-126, and VisDA, under the closed-set setting, while also achieving best results on partial-set and open-set variants.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Krishang Sharma
Machine Learning
ๆฅ็ๆ่ฆ
Accurate Remaining Useful Life (RUL) prediction coupled with uncertainty quantification remains a critical challenge in aerospace prognostics. This research introduces a novel uncertainty-aware deep learning framework that learns aleatoric uncertainty directly through probabilistic modeling, an approach unexplored in existing CMAPSS-based literature. Our hierarchical architecture integrates multi-scale Inception blocks for temporal pattern extraction, bidirectional Long Short-Term Memory networks for sequential modeling, and a dual-level attention mechanism operating simultaneously on sensor and temporal dimensions. The innovation lies in the Bayesian output layer that predicts both mean RUL and variance, enabling the model to learn data-inherent uncertainty. Comprehensive preprocessing employs condition-aware clustering, wavelet denoising, and intelligent feature selection. Experimental validation on NASA CMAPSS benchmarks (FD001-FD004) demonstrates competitive overall performance with RMSE values of 16.22, 19.29, 16.84, and 19.98 respectively. Remarkably, our framework achieves breakthrough critical zone performance (RUL <= 30 cycles) with RMSE of 5.14, 6.89, 5.27, and 7.16, representing 25-40 percent improvements over conventional approaches and establishing new benchmarks for safety-critical predictions. The learned uncertainty provides well-calibrated 95 percent confidence intervals with coverage ranging from 93.5 percent to 95.2 percent, enabling risk-aware maintenance scheduling previously unattainable in CMAPSS literature.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Robert Bredereck, Eva Deltl, Leon Kellerhals, Jannik Peters
ๆฅ็ๆ่ฆ
We study core stability in non-centroid clustering under the max-loss objective, where each agent's loss is the maximum distance to other members of their cluster. We prove that for all $k\geq 3$ there exist metric instances with $n\ge 9$ agents, with $n$ divisible by $k$, for which no clustering lies in the $ฮฑ$-core for any $ฮฑ<2^{\frac{1}{5}}\sim 1.148$. The bound is tight for our construction. Using a computer-aided proof, we also identify a two-dimensional Euclidean point set whose associated lower bound is slightly smaller than that of our general construction. This is, to our knowledge, the first impossibility result showing that the core can be empty in non-centroid clustering under the max-loss objective.
๐
2025-11-24
๐ Accepted for presentation and publication in the p
๐ฅ Dora Krekovic, Mario Kusek, Ivana Podnar Zarko, Danh Le-Phuoc
ๆฅ็ๆ่ฆ
The rapid growth of IoT devices has led to an enormous amount of sensor data that requires transmission to cloud servers for processing, resulting in excessive network congestion, increased latency and high energy consumption. This is particularly problematic in resource-constrained and remote environments where bandwidth is limited, and battery-dependent devices further emphasize the problem. Moreover, in domains such as agriculture, consecutive sensor readings often have minimal variation, making continuous data transmission inefficient and unnecessarily resource intensive. To overcome these challenges, we propose an analytical prediction algorithm designed for edge computing environments and validated through simulation. The proposed solution utilizes a predictive filter at the network edge that forecasts the next sensor data point and triggers data transmission only when the deviation from the predicted value exceeds a predefined tolerance. A complementary cloud-based model ensures data integrity and overall system consistency. This dual-model strategy effectively reduces communication overhead and demonstrates potential for improving energy efficiency by minimizing redundant transmissions. In addition to reducing communication load, our approach leverages both in situ and satellite observations from the same locations to enhance model robustness. It also supports cross-site generalization, enabling models trained in one region to be effectively deployed elsewhere without retraining. This makes our solution highly scalable, energy-aware, and well-suited for optimizing sensor data transmission in remote and bandwidth-constrained IoT environments.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chih-Duo Hong, Hongjian Jiang, Anthony W. Lin, Oliver Markgraf, Julian Parsert et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Automata extraction is a method for synthesising interpretable surrogates for black-box neural models that can be analysed symbolically. Existing techniques assume a finite input alphabet, and thus are not directly applicable to data sequences drawn from continuous domains. We address this challenge with deterministic register automata (DRAs), which extend finite automata with registers that store and compare numeric values. Our main contribution is a framework for robust DRA extraction from black-box models: we develop a polynomial-time robustness checker for DRAs with a fixed number of registers, and combine it with passive and active automata learning algorithms. This combination yields surrogate DRAs with statistical robustness and equivalence guarantees. As a key application, we use the extracted automata to assess the robustness of neural networks: for a given sequence and distance metric, the DRA either certifies local robustness or produces a concrete counterexample. Experiments on recurrent neural networks and transformer architectures show that our framework reliably learns accurate automata and enables principled robustness evaluation. Overall, our results demonstrate that robust DRA extraction effectively bridges neural network interpretability and formal reasoning without requiring white-box access to the underlying network.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shenghan Zhao, Yuzhen Lin, Ximeng Yang, Qiaochu Lu, Haozhong Xue et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The advent of financial technology has witnessed a surge in the utilization of deep learning models to anticipate consumer conduct, a trend that has demonstrated considerable potential in enhancing lending strategies and bolstering market efficiency. We study multi-horizon demand forecasting on e-commerce transactions using the UCI Online Retail II dataset. Unlike prior versions of this manuscript that mixed financial-loan narratives with retail data, we focus exclusively on retail market behavior and define a clear prediction target: per SKU daily demand (or revenue) for horizons H=1,7,14. We present a hybrid sequence model that combines multi-scale temporal convolutions, a gated recurrent module, and time-aware self-attention. The model is trained with standard regression losses and evaluated under MAE, RMSE, sMAPE, MASE, and Theil's U_2 with strict time-based splits to prevent leakage. We benchmark against ARIMA/Prophet, LSTM/GRU, LightGBM, and state-of-the-art Transformer forecasters (TFT, Informer, Autoformer, N-BEATS). Results show consistent accuracy gains and improved robustness on peak/holiday periods. We further provide ablations and statistical significance tests to ensure the reliability of improvements, and we release implementation details to facilitate reproducibility.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziyun Li, Ben Dai, Huancheng Hu, Henrik Bostrรถm, Soon Hoe Lim
Machine Learning
ๆฅ็ๆ่ฆ
Flow-based generative models synthesize data by integrating a learned velocity field from a reference distribution to the target data distribution. Prior work has focused on endpoint metrics (e.g., fidelity, likelihood, perceptual quality) while overlooking a deeper question: what do the sampling trajectories reveal? Motivated by classical mechanics, we introduce kinetic path energy (KPE), a simple yet powerful diagnostic that quantifies the total kinetic effort along each generation path of ODE-based samplers. Through comprehensive experiments on CIFAR-10 and ImageNet-256, we uncover two key phenomena: ({i}) higher KPE predicts stronger semantic quality, indicating that semantically richer samples require greater kinetic effort, and ({ii}) higher KPE inversely correlates with data density, with informative samples residing in sparse, low-density regions. Together, these findings reveal that semantically informative samples naturally reside on the sparse frontier of the data distribution, demanding greater generative effort. Our results suggest that trajectory-level analysis offers a physics-inspired and interpretable framework for understanding generation difficulty and sample characteristics.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Emanuele Pardini, Katerina Papagiannouli
ๆฅ็ๆ่ฆ
Unbalanced optimal transport (UOT) provides a flexible way to match or compare nonnegative finite Radon measures. However, UOT requires a predefined ground transport cost, which may misrepresent the data's underlying geometry. Choosing such a cost is particularly challenging when datasets live in heterogeneous spaces, often motivating practitioners to adopt Gromov-Wasserstein formulations. To address this challenge, we introduce cost-regularized unbalanced optimal transport (CR-UOT), a framework that allows the ground cost to vary while allowing mass creation and removal. We show that CR-UOT incorporates unbalanced Gromov-Wasserstein type problems through families of inner-product costs parameterized by linear transformations, enabling the matching of measures or point clouds across Euclidean spaces. We develop algorithms for such CR-UOT problems using entropic regularization and demonstrate that this approach improves the alignment of heterogeneous single-cell omics profiles, especially when many cells lack direct matches.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Timur Mamedov, Anton Konushin, Vadim Konushin
ๆฅ็ๆ่ฆ
Generalizable person re-identification (Re-ID) aims to recognize individuals across unseen cameras and environments. While existing methods rely heavily on limited labeled multi-camera data, we propose DynaMix, a novel method that effectively combines manually labeled multi-camera and large-scale pseudo-labeled single-camera data. Unlike prior works, DynaMix dynamically adapts to the structure and noise of the training data through three core components: (1) a Relabeling Module that refines pseudo-labels of single-camera identities on-the-fly; (2) an Efficient Centroids Module that maintains robust identity representations under a large identity space; and (3) a Data Sampling Module that carefully composes mixed data mini-batches to balance learning complexity and intra-batch diversity. All components are specifically designed to operate efficiently at scale, enabling effective training on millions of images and hundreds of thousands of identities. Extensive experiments demonstrate that DynaMix consistently outperforms state-of-the-art methods in generalizable person Re-ID.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiangyu Chang, Manyi Yao, Srikanth V. Krishnamurthy, Christian R. Shelton, Anirban Chakraborty et al.
ๆฅ็ๆ่ฆ
In Asynchronous Federated Learning (AFL), the central server immediately updates the global model with each arriving client's contribution. As a result, clients perform their local training on different model versions, causing information staleness (delay). In federated environments with non-IID local data distributions, this asynchronous pattern amplifies the adverse effect of client heterogeneity (due to different data distribution, local objectives, etc.), as faster clients contribute more frequent updates, biasing the global model. We term this phenomenon heterogeneity amplification. Our work provides a theoretical analysis that maps AFL design choices to their resulting error sources when heterogeneity amplification occurs. Guided by our analysis, we propose ACE (All-Client Engagement AFL), which mitigates participation imbalance through immediate, non-buffered updates that use the latest information available from all clients. We also introduce a delay-aware variant, ACED, to balance client diversity against update staleness. Experiments on different models for different tasks across diverse heterogeneity and delay settings validate our analysis and demonstrate the robust performance of our approaches.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek et al.
Machine Learning
ๆฅ็ๆ่ฆ
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zimo Yan, Zheng Xie, Chang Liu, Yuan Wang
Machine Learning
ๆฅ็ๆ่ฆ
Message passing graph neural networks are widely used for learning on graphs, yet their expressive power is limited by the one-dimensional Weisfeiler-Lehman test and can fail to distinguish structurally different nodes. We provide rigorous theory for a Laplacian positional encoding that is invariant to eigenvector sign flips and to basis rotations within eigenspaces. We prove that this encoding yields node identifiability from a constant number of observations and establishes a sample-complexity separation from architectures constrained by the Weisfeiler-Lehman test. The analysis combines a monotone link between shortest-path and diffusion distance, spectral trilateration with a constant set of anchors, and quantitative spectral injectivity with logarithmic embedding size. As an instantiation, pairing this encoding with a neural-process style decoder yields significant gains on a drug-drug interaction task on chemical graphs, improving both the area under the ROC curve and the F1 score and demonstrating the practical benefits of resolving theoretical expressiveness limitations with principled positional information.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuting Gao, Weihao Chen, Lan Wang, Ruihan Xu, Qingpei Guo
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Preference learning has recently emerged as a pivotal strategy for post-training alignment of Multimodal Large Language Models (MLLMs). However, existing approaches predominantly rely on external human-annotated preference data, which is costly and labor-intensive to collect. In this work, we propose OrdMoE, a novel preference alignment framework that bypasses the reliance on external human preferences entirely by leveraging intrinsic signals within Mixture-of-Experts (MoE) architectures. Specifically, we observe that the router's expert selection scores implicitly encode a quality-aware ranking of responses (i.e. higher-scoring experts consistently generate higher-quality outputs). Building on this insight, OrdMoE constructs an internal preference hierarchy by grouping experts into ranked tiers based on their per-token routing scores and activating each tier separately to produce a sequence of responses with increasing quality. This yields a zero-cost, self-supervised preference ordering over generated responses, which can be directly optimized using standard preference learning objectives. Extensive experiments across multiple multimodal benchmarks demnstrate that OrdMoE significantly enhances both alignment and overall performance of multimodal Mixture-of-Experts LLMs, achieving competitive results without requiring any human-annotated preference data.
๐
2025-11-24
๐ 7 pages, 4 figures, submitted to IEEE ICC 2026
๐ฅ Nguyen Duc Minh Quang, Chang Liu, Huy-Trung Nguyen, Shuangyang Li, Derrick Wing Kwan Ng et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Low-altitude wireless networks (LAWN) are rapidly expanding with the growing deployment of unmanned aerial vehicles (UAVs) for logistics, surveillance, and emergency response. Reliable connectivity remains a critical yet challenging task due to three-dimensional (3D) mobility, time-varying user density, and limited power budgets. The transmit power of base stations (BSs) fluctuates dynamically according to user locations and traffic demands, leading to a highly non-stationary 3D radio environment. Radio maps (RMs) have emerged as an effective means to characterize spatial power distributions and support radio-aware network optimization. However, most existing works construct static or offline RMs, overlooking real-time power variations and spatio-temporal dependencies in multi-UAV networks. To overcome this limitation, we propose a {3D dynamic radio map (3D-DRM)} framework that learns and predicts the spatio-temporal evolution of received power. Specially, a Vision Transformer (ViT) encoder extracts high-dimensional spatial representations from 3D RMs, while a Transformer-based module models sequential dependencies to predict future power distributions. Experiments unveil that 3D-DRM accurately captures fast-varying power dynamics and substantially outperforms baseline models in both RM reconstruction and short-term prediction.
๐
2025-11-24
๐ Accepted at the IEEE conference on Big Data (Speci
๐ฅ Zineddine Tighidet, Lazhar Labiod, Mohamed Nadif
ๆฅ็ๆ่ฆ
The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Donghu Kim
Machine Learning
ๆฅ็ๆ่ฆ
The challenge of building neural networks that can continuously learn and adapt to evolving data streams is central to the fields of continual learning (CL) and reinforcement learning (RL). This lifelong learning problem is often framed in terms of the plasticity-stability dilemma, focusing on issues like loss of plasticity and catastrophic forgetting. Unlike neural networks, biological brains maintain plasticity through capacity growth, inspiring researchers to explore similar approaches in artificial networks, such as adding capacity dynamically. Prior solutions often lack parameter efficiency or depend on explicit task indices, but Mixture-of-Experts (MoE) architectures offer a promising alternative by specializing experts for distinct distributions. This paper aims to evaluate a DynamicMoE approach for continual and reinforcement learning environments and benchmark its effectiveness against existing network expansion methods.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xin Yuan, Siqi Li, Jiateng Wei, Chengrui Zhu, Yanming Wu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Pruning is an effective method for compressing Large Language Models, but finding an optimal, non-uniform layer-wise sparsity allocation remains a key challenge. While heuristic methods are fast but yield suboptimal performance, more powerful search-based approaches like Reinforcement Learning are often hindered by prohibitive computational costs on large-scale models. To overcome this efficiency barrier, we propose FastForward Pruning. Its core is a decoupled, single-step RL framework that separates policy optimization from the complex budget satisfaction problem. Such a decoupling is crucial for efficiently searching the vast policy space of LLMs. This curriculum-based strategy begins with low-cost, simple tasks and gradually increases in complexity, significantly reducing the search's computational overhead. Evaluated on the LLaMA, Mistral, and OPT model families, our framework discovers pruning policies that achieve superior performance over strong heuristic baselines. Crucially, when compared to other search-based algorithms, our method achieves competitive or superior results at a fraction of the computational cost, demonstrating a clear advantage in search efficiency.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qisen Chai, Yansong Wang, Junjie Huang, Tao Jia
Machine Learning
ๆฅ็ๆ่ฆ
As graph-structured data grow increasingly large, evaluating their robustness under adversarial attacks becomes computationally expensive and difficult to scale. To address this challenge, we propose to compress graphs into compact representations that preserve both topological structure and robustness profile, enabling efficient and reliable evaluation.We propose Cutter, a dual-agent reinforcement learning framework composed of a Vital Detection Agent (VDA) and a Redundancy Detection Agent (RDA), which collaboratively identify structurally vital and redundant nodes for guided compression. Cutter incorporates three key strategies to enhance learning efficiency and compression quality: trajectory-level reward shaping to transform sparse trajectory returns into dense, policy-equivalent learning signals; prototype-based shaping to guide decisions using behavioral patterns from both highand low-return trajectories; and cross-agent imitation to enable safer and more transferable exploration. Experiments on multiple real-world graphs demonstrate that Cutter generates compressed graphs that retain essential static topological properties and exhibit robustness degradation trends highly consistent with the original graphs under various attack scenarios, thereby significantly improving evaluation efficiency without compromising assessment fidelity.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Juntao Gao, Feiyang Ye, Jing Zhang, Wenjing Qian
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ German Gritsai, Megan Richards, Maxime Mรฉloux, Kyunghyun Cho, Maxime Peyrard
Machine Learning
ๆฅ็ๆ่ฆ
We propose a fully data-driven approach to designing mutual information (MI) estimators. Since any MI estimator is a function of the observed sample from two random variables, we parameterize this function with a neural network (MIST) and train it end-to-end to predict MI values. Training is performed on a large meta-dataset of 625,000 synthetic joint distributions with known ground-truth MI. To handle variable sample sizes and dimensions, we employ a two-dimensional attention scheme ensuring permutation invariance across input samples. To quantify uncertainty, we optimize a quantile regression loss, enabling the estimator to approximate the sampling distribution of MI rather than return a single point estimate. This research program departs from prior work by taking a fully empirical route, trading universal theoretical guarantees for flexibility and efficiency. Empirically, the learned estimators largely outperform classical baselines across sample sizes and dimensions, including on joint distributions unseen during training. The resulting quantile-based intervals are well-calibrated and more reliable than bootstrap-based confidence intervals, while inference is orders of magnitude faster than existing neural baselines. Beyond immediate empirical gains, this framework yields trainable, fully differentiable estimators that can be embedded into larger learning pipelines. Moreover, exploiting MI's invariance to invertible transformations, meta-datasets can be adapted to arbitrary data modalities via normalizing flows, enabling flexible training for diverse target meta-distributions.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sanjeev Manivannan, Chandrashekar Lakshminarayan
ๆฅ็ๆ่ฆ
Cross-subject motor-imagery decoding remains a major challenge in EEG-based brain-computer interfaces due to strong subject variability and the curved geometry of covariance matrices on the symmetric positive definite (SPD) manifold. We address the zero-shot cross-subject setting, where no target-subject labels or adaptation are allowed, by introducing novel geometry-aware preprocessing modules and deep congruence networks that operate directly on SPD covariance matrices. Our preprocessing modules, DCR and RiFU, extend Riemannian Alignment by improving action separation while reducing subject-specific distortions. We further propose two manifold classifiers, SPD-DCNet and RiFUNet, which use hierarchical congruence transforms to learn discriminative, subject-invariant covariance representations. On the BCI-IV 2a benchmark, our framework improves cross-subject accuracy by 3-4% over the strongest classical baselines, demonstrating the value of geometry-aware transformations for robust EEG decoding.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Santhosh G S, Saurav Prakash, Balaraman Ravindran
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) face a significant bottleneck during autoregressive inference due to the massive memory footprint of the Key-Value (KV) cache. Existing compression techniques like token eviction, quantization, or other low-rank methods often risk information loss, have fixed limits, or introduce significant computational overhead from explicit decompression steps. In this work, we introduce SWAN, a novel, fine-tuning-free framework that eliminates this overhead. Our method uses an offline orthogonal matrix to rotate and prune the KV-cache, which is then used directly in the attention computation without any reconstruction. Our extensive experiments demonstrate that SWAN, augmented with a small dense buffer, offers a robust trade-off, maintaining performance close to the uncompressed baseline even at aggressive 50-60% memory savings per-token on KV-cache. A key advantage is its runtime-tunable compression level, allowing operators to dynamically adjust the memory footprint, a flexibility absent in methods requiring fixed offline configurations. This combination of a decompression-free design, high performance under compression, and adaptability makes SWAN a practical and efficient solution for serving LLMs with long contexts.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Salah Eddine Choutri, Prajwal Chauhan, Othmane Mazhar, Saif Eddin Jabari
ๆฅ็ๆ่ฆ
The Monte Carlo-type Neural Operator (MCNO) introduces a lightweight architecture for learning solution operators for parametric PDEs by directly approximating the kernel integral using a Monte Carlo approach. Unlike Fourier Neural Operators, MCNO makes no spectral or translation-invariance assumptions. The kernel is represented as a learnable tensor over a fixed set of randomly sampled points. This design enables generalization across multiple grid resolutions without relying on fixed global basis functions or repeated sampling during training. Experiments on standard 1D PDE benchmarks show that MCNO achieves competitive accuracy with low computational cost, providing a simple and practical alternative to spectral and graph-based neural operators.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zengjie Hu, Jiantao Qiu, Tianyi Bai, Haojin Yang, Binhang Yuan et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical \emph{gradient vanishing} problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose \textbf{VADE}, a \textbf{V}ariance-\textbf{A}ware \textbf{D}ynamic sampling framework via online sample-level difficulty \textbf{E}stimation. Our framework integrates three key components: online sample-level difficulty estimation using Beta distributions, a Thompson sampler that maximizes information gain through the estimated correctness probability, and a two-scale prior decay mechanism that maintains robust estimation under policy evolution. This three components design enables VADE to dynamically select the most informative samples, thereby amplifying training signals while eliminating extra rollout costs. Extensive experiments on multimodal reasoning benchmarks show that VADE consistently outperforms strong baselines in both performance and sample efficiency, while achieving a dramatic reduction in computational overhead. More importantly, our framework can serves as a plug-and-play component to be seamlessly integrated into existing group-based RL algorithms. Code and models are available at https://VADE-RL.github.io.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier. Next, we explore emerging efficient attention alternatives to evaluate their potential as candidate building operators. Using the identified promising operators, we construct an evolutionary search framework to automatically discover latency-optimal combinations of these operators within hybrid SLMs, thereby advancing the accuracy-latency frontier. In addition to architectural improvements, we further enhance SLM training using a weight normalization technique that enables more effective weight updates and improves final convergence. Combining these methods, we introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs, e.g., achieving over +5.5% average accuracy, 1.3x/1.9x lower latency, and 18.7x/45.6x higher throughput compared to Qwen3-1.7B/0.6B, respectively.
๐
2025-11-24
๐ currently submitted and awaiting review at the IEE
๐ฅ Hyeong-Gun Joo, Songnam Hong, Seunghwan Lee, Dong-Joon Shin
ๆฅ็ๆ่ฆ
Federated learning (FL) faces challenges in ensuring both privacy and communication efficiency, particularly in resource-constrained environments such as Internet of Things (IoT) and edge networks. While sign-based methods, such as sign stochastic gradient descent with majority voting (SIGNSGD-MV), offer substantial bandwidth savings, they remain vulnerable to inference attacks due to exposure of gradient signs. Existing secure aggregation techniques are either incompatible with sign-based methods or incur prohibitive overhead. To address these limitations, we propose Hi-SAFE, a lightweight and cryptographically secure aggregation framework for sign-based FL. Our core contribution is the construction of efficient majority vote polynomials for SIGNSGD-MV, derived from Fermat's Little Theorem. This formulation represents the majority vote as a low-degree polynomial over a finite field, enabling secure evaluation that hides intermediate values and reveals only the final result. We further introduce a hierarchical subgrouping strategy that ensures constant multiplicative depth and bounded per-user complexity, independent of the number of users n.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Lilian Say, Christophe Denis, Rafael Pinot
Machine Learning
ๆฅ็ๆ่ฆ
The increasing use of machine learning in sensitive applications demands algorithms that simultaneously preserve data privacy and ensure fairness across potentially sensitive sub-populations. While privacy and fairness have each been extensively studied, their joint treatment remains poorly understood. Existing research often frames them as conflicting objectives, with multiple studies suggesting that strong privacy notions such as differential privacy inevitably compromise fairness. In this work, we challenge that perspective by showing that differential privacy can be integrated into a fairness-enhancing pipeline with minimal impact on fairness guarantees. We design a postprocessing algorithm, called DP2DP, that enforces both demographic parity and differential privacy. Our analysis reveals that our algorithm converges towards its demographic parity objective at essentially the same rate (up logarithmic factor) as the best non-private methods from the literature. Experiments on both synthetic and real datasets confirm our theoretical results, showing that the proposed algorithm achieves state-of-the-art accuracy/fairness/privacy trade-offs.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuzhi Chen, Yuanchang Xie, Lei Zhao, Pan Liu, Yajie Zou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jian Lu
Machine Learning
ๆฅ็ๆ่ฆ
Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dezhi Ran, Shuxiao Xie, Mingfang Ji, Ziyue Hua, Mengzhou Wu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
High quality kernels are critical for reducing training and inference costs of Large Language Models (LLMs), yet they traditionally require significant expertise in hardware architecture and software optimization. While recent advances in LLM-based code generation show promise for complex optimization, existing methods struggle with the vast optimization space due to insufficient hardware domain knowledge, failing to effectively balance exploration and exploitation. We present KernelBand, a novel framework that formulates kernel optimization as a hierarchical multi-armed bandit problem, enabling LLM agents to strategically navigate the optimization space by treating kernel selection and optimization strategy application as sequential decision-making processes. Our approach leverages hardware profiling information to identify promising optimization strategies and employs runtime behavior clustering to reduce exploration overhead across kernel candidates. Extensive experiments on TritonBench demonstrate that KernelBand significantly outperforms state-of-the-art methods, achieving superior performance with fewer tokens while exhibiting consistent improvement without saturation as computational resources increase.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bo Jiang, Weijun Zhao, Beibei Wang, Xiao Wang, Jin Tang
ๆฅ็ๆ่ฆ
Recently, fine-tuning large-scale pre-trained GNNs has yielded remarkable attention in adapting pre-trained GNN models for downstream graph learning tasks. One representative fine-tuning method is to exploit adapter (termed AdapterGNN) which aims to 'augment' the pre-trained model by inserting a lightweight module to make the 'augmented' model better adapt to the downstream tasks. However, graph data may contain various types of noise in downstream tasks, such as noisy edges and ambiguous node attributes. Existing AdapterGNNs are often prone to graph noise and exhibit limited generalizability. How to enhance the robustness and generalization ability of GNNs' fine tuning remains an open problem. In this paper, we show that the above problem can be well addressed by integrating uncertainty learning into the GNN adapter. We propose the Uncertainty-aware Adapter (UAdapterGNN) that fortifies pre-trained GNN models against noisy graph data in the fine-tuning process. Specifically, in contrast to regular AdapterGNN, our UAdapterGNN exploits Gaussian probabilistic adapter to augment the pre-trained GNN model. In this way, when the graph contains various noises,our method can automatically absorb the effects of changes in the variances of the Gaussian distribution, thereby significantly enhancing the model's robustness. Also, UAdapterGNN can further improve the generalization ability of the model on the downstream tasks. Extensive experiments on several benchmarks demonstrate the effectiveness, robustness and high generalization ability of the proposed UAdapterGNN method.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yubo Wang, Hui He, Chaoxi Niu, Zhendong Niu
ๆฅ็ๆ่ฆ
Due to the inherent complexity, temporal patterns in real-world time series often evolve across multiple intertwined scales, including long-term periodicity, short-term fluctuations, and abrupt regime shifts. While existing literature has designed many sophisticated decomposition approaches based on the time or frequency domain to partition trend-seasonality components and high-low frequency components, an alternative line of approaches based on the wavelet domain has been proposed to provide a unified multi-resolution representation with precise time-frequency localization. However, most wavelet-based methods suffer from a persistent bias toward recursively decomposing only low-frequency components, severely underutilizing subtle yet informative high-frequency components that are pivotal for precise time series forecasting. To address this problem, we propose WaveTuner, a Wavelet decomposition framework empowered by full-spectrum subband Tuning for time series forecasting. Concretely, WaveTuner comprises two key modules: (i) Adaptive Wavelet Refinement module, that transforms time series into time-frequency coefficients, utilizes an adaptive router to dynamically assign subband weights, and generates subband-specific embeddings to support refinement; and (ii) Multi-Branch Specialization module, that employs multiple functional branches, each instantiated as a flexible Kolmogorov-Arnold Network (KAN) with a distinct functional order to model a specific spectral subband. Equipped with these modules, WaveTuner comprehensively tunes global trends and local variations within a unified time-frequency framework. Extensive experiments on eight real-world datasets demonstrate WaveTuner achieves state-of-the-art forecasting performance in time series forecasting.
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Heger Arfaoui, Mohammed Iheb Hergli, Beya Benzina, Slimane BenMiled
Natural Language Processing
ๆฅ็ๆ่ฆ
Focus group discussions generate rich qualitative data but their analysis traditionally relies on labor-intensive manual coding that limits scalability and reproducibility. We present a rigorous, reproducible computational framework for applying neural topic modeling to focus group transcripts, addressing fundamental methodological challenges: hyperparameter sensitivity, model stability, and validation of interpretability. Using BERTopic applied to ten focus groups exploring HPV vaccine perceptions in Tunisia (1,076 utterances), we conducted systematic evaluation across 27 hyperparameter configurations, assessed stability through bootstrap resampling with 30 replicates per configuration, and validated interpretability through formal human evaluation by three domain experts. Our analysis demonstrates substantial sensitivity to hyperparameter choices and reveals that metric selection for stability assessment must align with analytical goals. A hierarchical merging strategy (extracting fine-grained topics for stability then consolidating for interpretability) effectively navigates the stability-coherence tradeoff, achieving coherence of 0.558 compared to 0.539 for direct extraction. Human validation confirmed topic quality with very good inter-rater reliability (ICC = 0.79, weighted Cohen's kappa = 0.578). Our framework provides practical guidelines that researchers can adapt to their own qualitative research contexts. All code, data processing scripts, and evaluation protocols are publicly available to support reproduction and extension of this work.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mincheol Jeon, Euinam Huh
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalized Federated Learning (PFL) faces persistent challenges, including domain heterogeneity from diverse client data, data imbalance due to skewed participation, and strict communication constraints. Traditional federated learning often lacks personalization, as a single global model cannot capture client-specific characteristics, leading to biased predictions and poor generalization, especially for clients with highly divergent data distributions.
To address these issues, we propose FedSTAR, a style-aware federated learning framework that disentangles client-specific style factors from shared content representations. FedSTAR aggregates class-wise prototypes using a Transformer-based attention mechanism, allowing the server to adaptively weight client contributions while preserving personalization.
Furthermore, by exchanging compact prototypes and style vectors instead of full model parameters, FedSTAR significantly reduces communication overhead. Experimental results demonstrate that combining content-style disentanglement with attention-driven prototype aggregation improves personalization and robustness in heterogeneous environments without increasing communication cost.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yasiru Laksara, Uthayasanker Thayasivam
Machine Learning
ๆฅ็ๆ่ฆ
The utility of deep learning models, such as CheXNet, in high stakes clinical settings is fundamentally constrained by their purely deterministic nature, failing to provide reliable measures of predictive confidence. This project addresses this critical gap by integrating robust Uncertainty Quantification (UQ) into a high performance diagnostic platform for 14 common thoracic diseases on the NIH ChestX-ray14 dataset. Initial architectural development failed to stabilize performance and calibration using Monte Carlo Dropout (MCD), yielding an unacceptable Expected Calibration Error (ECE) of 0.7588. This technical failure necessitated a rigorous architectural pivot to a high diversity, 9-member Deep Ensemble (DE). This resulting DE successfully stabilized performance and delivered superior reliability, achieving a State-of-the-Art (SOTA) average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.8559 and an average F1 Score of 0.3857. Crucially, the DE demonstrated superior calibration (Mean ECE of 0.0728 and Negative Log-Likelihood (NLL) of 0.1916) and enabled the reliable decomposition of total uncertainty into its Aleatoric (irreducible data noise) and Epistemic (reducible model knowledge) components, with a mean Epistemic Uncertainty (EU) of 0.0240. These results establish the Deep Ensemble as a trustworthy and explainable platform, transforming the model from a probabilistic tool into a reliable clinical decision support system.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fang Wang, Lance Kosca, Adrienne Kosca, Marko Gacesa, Ernesto Damiani
Machine Learning
ๆฅ็ๆ่ฆ
This paper introduces HGNN(O), an AutoML GNN hypermodel framework for outcome prediction on event-sequence data. Building on our earlier work on graph convolutional network hypermodels, HGNN(O) extends four architectures-One Level, Two Level, Two Level Pseudo Embedding, and Two Level Embedding-across six canonical GNN operators. A self-tuning mechanism based on Bayesian optimization with pruning and early stopping enables efficient adaptation over architectures and hyperparameters without manual configuration. Empirical evaluation on both balanced and imbalanced event logs shows that HGNN(O) achieves accuracy exceeding 0.98 on the Traffic Fines dataset and weighted F1 scores up to 0.86 on the Patients dataset without explicit imbalance handling. These results demonstrate that the proposed AutoML-GNN approach provides a robust and generalizable benchmark for outcome prediction in complex event-sequence data.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fang Wang, Paolo Ceravolo, Ernesto Damiani
Machine Learning
ๆฅ็ๆ่ฆ
Existing deep learning models for Predictive Process Monitoring (PPM) struggle with temporal irregularities, particularly stochastic event durations and overlapping timestamps, limiting their adaptability across heterogeneous datasets. We propose a dual input neural network strategy that separates event and sequence attributes, using a duration-aware pseudo-embedding matrix to transform temporal importance into compact, learnable representations. This design is implemented across two baseline families: B-LSTM and B-GCN, and their duration-aware variants D-LSTM and D-GCN. All models incorporate self-tuned hypermodels for adaptive architecture selection. Experiments on balanced and imbalanced outcome prediction tasks show that duration pseudo-embedding inputs consistently improve generalization, reduce model complexity, and enhance interpretability. Our results demonstrate the benefits of explicit temporal encoding and provide a flexible design for robust, real-world PPM applications.
๐
2025-11-24
๐ NeurIPS 2025
๐ฅ Kanav Arora, Girish Narayanswamy, Shwetak Patel, Richard Li
Machine Learning
ๆฅ็ๆ่ฆ
Heart rate estimation from photoplethysmography (PPG) signals generated by wearable devices such as smartwatches and fitness trackers has significant implications for the health and well-being of individuals. Although prior work has demonstrated deep learning models with strong performance in the heart rate estimation task, in order to deploy these models on wearable devices, these models must also adhere to strict memory and latency constraints. In this work, we explore and characterize how large pre-trained PPG models may be distilled to smaller models appropriate for real-time inference on the edge. We evaluate four distillation strategies through comprehensive sweeps of teacher and student model capacities: (1) hard distillation, (2) soft distillation, (3) decoupled knowledge distillation (DKD), and (4) feature distillation. We present a characterization of the resulting scaling laws describing the relationship between model size and performance. This early investigation lays the groundwork for practical and predictable methods for building edge-deployable models for physiological sensing.
๐
2025-11-24
๐ ArXiv Preprint (math.ST)
๐ฅ Edgar Dobriban
Natural Language Processing
ๆฅ็ๆ่ฆ
Over the last few months, AI models including large language models have improved greatly. There are now several documented examples where they have helped professional mathematical scientists prove new results, sometimes even helping resolve known open problems. In this short note, we add another example to the list, by documenting how we were able to solve a previously unsolved research problem in robust mathematical statistics with crucial help from GPT-5. Our problem concerns robust density estimation, where the observations are perturbed by Wasserstein-bounded contaminations.In a previous preprint (Chao and Dobriban, 2023, arxiv:2308.01853v2), we have obtained upper and lower bounds on the minimax optimal estimation error; which were, however, not sharp.
Starting in October 2025, making significant use of GPT-5 Pro, we were able to derive the minimax optimal error rate (reported in version 3 of the above arxiv preprint). GPT-5 provided crucial help along the way, including by suggesting calculations that we did not think of, and techniques that were not familiar to us, such as the dynamic Benamou-Brenier formulation, for key steps in the analysis. Working with GPT-5 took a few weeks of effort, and we estimate that it could have taken several months to get the same results otherwise. At the same time, there are still areas where working with GPT-5 was challenging: it sometimes provided incorrect references, and glossed over details that sometimes took days of work to fill in. We outline our workflow and steps taken to mitigate issues. Overall, our work can serve as additional documentation for a new age of human-AI collaborative work in mathematical science.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aakash Gore, Anoushka Dey, Aryan Mishra
ๆฅ็ๆ่ฆ
Knowledge distillation has emerged as a powerful technique for model compression, enabling the transfer of knowledge from large teacher networks to compact student models. However, traditional knowledge distillation methods treat all teacher predictions equally, regardless of the teacher's confidence in those predictions. This paper proposes an uncertainty-aware dual-student knowledge distillation framework that leverages teacher prediction uncertainty to selectively guide student learning. We introduce a peer-learning mechanism where two heterogeneous student architectures, specifically ResNet-18 and MobileNetV2, learn collaboratively from both the teacher network and each other. Experimental results on ImageNet-100 demonstrate that our approach achieves superior performance compared to baseline knowledge distillation methods, with ResNet-18 achieving 83.84\% top-1 accuracy and MobileNetV2 achieving 81.46\% top-1 accuracy, representing improvements of 2.04\% and 0.92\% respectively over traditional single-student distillation approaches.
๐
2025-11-24
๐ ArXiv Preprint (physics.flu-dyn)
๐ฅ Qifeng Hu, Inanc Senocak
Machine Learning
ๆฅ็ๆ่ฆ
We present a meshless method for the solution of incompressible Navier-Stokes equations in advection-dominated regimes using physics- and equality-constrained artificial neural networks combined with a conditionally adaptive augmented Lagrangian formulation. A single neural network parameterizes both the velocity and pressure fields, and is trained by minimizing the residual of a Poisson's equation for pressure, constrained by the momentum and continuity equations, together with boundary conditions on the velocity field. No boundary conditions are imposed on the pressure field aside from anchoring the pressure at a point to prevent its unbounded development. The training is performed from scratch without labeled data, relying solely on the governing equations and constraints. To enhance accuracy in advection-dominated flows, we employ a single Fourier feature mapping of the input coordinates. The proposed method is demonstrated for the canonical lid-driven cavity flow up to a Reynolds number of 7,500 and for laminar flow over a circular cylinder with inflow-outflow boundary conditions, achieving excellent agreement with benchmark solutions. We further compare the present formulation against alternative objective-function constructions based on different arrangements of the flow equations, thereby highlighting the algorithmic advantages of the proposed formulation centered around the Poisson's equation for pressure.
๐
2025-11-24
๐ Submitted for journal publication
๐ฅ Sing-Yuan Yeh, Chun-Hao Yang
Machine Learning
ๆฅ็ๆ่ฆ
Persistent homology (PH) is a crucial concept in computational topology, providing a multiscale topological description of a space. It is particularly significant in topological data analysis, which aims to make statistical inference from a topological perspective. In this work, we introduce a new topological summary for Bayesian neural networks, termed the predictive topological uncertainty (pTU). The proposed pTU measures the uncertainty in the interaction between the model and the inputs. It provides insights from the model perspective: if two samples interact with a model in a similar way, then they are considered identically distributed. We also show that the pTU is insensitive to the model architecture. As an application, pTU is used to solve the out-of-distribution (OOD) detection problem, which is critical to ensure model reliability. Failure to detect OOD input can lead to incorrect and unreliable predictions. To address this issue, we propose a significance test for OOD based on the pTU, providing a statistical framework for this issue. The effectiveness of the framework is validated through various experiments, in terms of its statistical power, sensitivity, and robustness.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yejing Wang, Shengyu Zhou, Jinyu Lu, Ziwei Liu, Langming Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, which makes them infeasible for high-throughput, real-time services and limits their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, requiring additional training and increasing the latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary model, enabling efficient self-drafting. This design, combined with a specialized input prompt structure, preserves the integrity of sequence-to-sequence generation. Furthermore, to tackle the critical problem of hallucination, a major source of performance degradation, we introduce an efficient, model-free verifier based on a hash set. We demonstrate the effectiveness of NEZHA through extensive experiments on public datasets and have successfully deployed the system on Taobao since October 2025, driving the billion-level advertising revenue and serving hundreds of millions of daily active users.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haichen Hu, David Simchi-Levi
Machine Learning
ๆฅ็ๆ่ฆ
We study the problem of excess risk evaluation for empirical risk minimization (ERM) under general convex loss functions. Our contribution is an efficient refitting procedure that computes the excess risk and provides high-probability upper bounds under the fixed-design setting. Assuming only black-box access to the training algorithm and a single dataset, we begin by generating two sets of artificially modified pseudo-outcomes termed wild response, created by stochastically perturbing the gradient vectors with carefully chosen scaling. Using these two pseudo-labeled datasets, we then refit the black-box procedure twice to obtain two corresponding wild predictors. Finally, leveraging the original predictor, the two wild predictors, and the constructed wild responses, we derive an efficient excess risk upper bound. A key feature of our analysis is that it requires no prior knowledge of the complexity of the underlying function class. As a result, the method is essentially model-free and holds significant promise for theoretically evaluating modern opaque machine learning system--such as deep nerral networks and generative model--where traditional capacity-based learning theory becomes infeasible due to the extreme complexity of the hypothesis class.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bhuvan Sachdeva, Karan Uppal, Abhinav Java, Vineeth N. Balasubramanian
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Nour Jedidi, Jimmy Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent approaches that leverage large language models (LLMs) for pseudo-relevance feedback (PRF) have generally not utilized well-established feedback models like Rocchio and RM3 when expanding queries for sparse retrievers like BM25. Instead, they often opt for a simple string concatenation of the query and LLM-generated expansion content. But is this optimal? To answer this question, we revisit and systematically evaluate traditional feedback models in the context of HyDE, a popular method that enriches query representations with LLM-generated hypothetical answer documents. Our experiments show that HyDE's effectiveness can be substantially improved when leveraging feedback algorithms such as Rocchio to extract and weight expansion terms, providing a simple way to further enhance the accuracy of LLM-based PRF methods.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Olivia Macmillan-Scott, Roksana Goworek, Eda B. รzyiฤit
Natural Language Processing
ๆฅ็ๆ่ฆ
Query expansion is the reformulation of a user query by adding semantically related information, and is an essential component of monolingual and cross-lingual information retrieval used to ensure that relevant documents are not missed. Recently, multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation. Pseudo-documents both introduce additional relevant terms and bridge the gap between short queries and long documents, which is particularly beneficial in dense retrieval. This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance. Results show that query length largely determines which prompting technique is effective, and that more elaborate prompts often do not yield further gains. Substantial linguistic disparities persist: cross-lingual query expansion can produce the largest improvements for languages with the weakest baselines, yet retrieval is especially poor between languages written in different scripts. Fine-tuning is found to lead to performance gains only when the training and test data are of similar format. These outcomes underline the need for more balanced multilingual and cross-lingual training and evaluation resources.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Roksana Goworek, Olivia Macmillan-Scott, Eda B. รzyiฤit
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Cross-lingual information retrieval (CLIR) enables access to multilingual knowledge but remains challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models. Existing pipelines often rely on translation and monolingual retrieval heuristics, which add computational overhead and noise, degrading performance. This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets. We find that dense retrieval models trained specifically for CLIR consistently outperform lexical matching methods and derive little benefit from document translation. Contrastive learning mitigates language biases and yields substantial improvements for encoders with weak initial alignment, and re-ranking can be effective, but depends on the quality of the cross-encoder training data. Although high-resource languages still dominate overall performance, gains over lexical and document-translated baselines are most pronounced for low-resource and cross-script pairs. These findings indicate that cross-lingual search systems should prioritise semantic multilingual embeddings and targeted learning-based alignment over translation-based pipelines, particularly for cross-script and under-resourced languages.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jeeho Shin, Kyungho Kim, Kijung Shin
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Recipe recommendation has become an essential task in web-based food platforms. A central challenge is effectively leveraging rich multimodal features beyond user-recipe interactions. Our analysis shows that even simple uses of multimodal signals yield competitive performance, suggesting that systematic enhancement of these signals is highly promising. We propose TESMR, a 3-stage framework for recipe recommendation that progressively refines raw multimodal features into effective embeddings through: (1) content-based enhancement using foundation models with multimodal comprehension, (2) relation-based enhancement via message propagation over user-recipe interactions, and (3) learning-based enhancement through contrastive learning with learnable embeddings. Experiments on two real-world datasets show that TESMR outperforms existing methods, achieving 7-15% higher Recall@10.
๐
2025-11-24
๐ KDD 2026
๐ฅ Chenhao Zhai, Chang Meng, Xueliang Wang, Shuchang Liu, Xiaolong Hu et al.
Machine Learning
ๆฅ็ๆ่ฆ
The rapid proliferation of short videos on social media platforms presents unique challenges and opportunities for recommendation systems. Users exhibit diverse preferences, and the responses resulting from different strategies often conflict with one another, potentially exhibiting inverse correlations between metrics such as watch time and video view counts. Existing uplift models face limitations in handling the heterogeneous multi-treatment scenarios of short-video recommendations, often failing to effectively capture both the synergistic and individual causal effects of different strategies. Furthermore, traditional fixed-weight approaches for balancing these responses lack personalization and can result in biased decision-making. To address these issues, we propose a novel Heterogeneous Multi-treatment Uplift Modeling (HMUM) framework for trade-off optimization in short-video recommendations. HMUM comprises an Offline Hybrid Uplift Modeling (HUM) module, which captures the synergistic and individual effects of multiple strategies, and an Online Dynamic Decision-Making (DDM) module, which estimates the value weights of different user responses in real-time for personalized decision-making. Evaluated on two public datasets, an industrial dataset, and through online A/B experiments on the Kuaishou platform, our model demonstrated superior offline performance and significant improvements in key metrics. It is now fully deployed on the platform, benefiting hundreds of millions of users.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yi Xu, Chaofan Fan, Jinxin Hu, Yu Zhang, Zeng Xiaoyi et al.
ๆฅ็ๆ่ฆ
Ranking models have become an important part of modern personalized recommendation systems. However, significant challenges persist in handling high-cardinality, heterogeneous, and sparse feature spaces, particularly regarding model scalability and efficiency. We identify two key bottlenecks: (i) Representation Bottleneck: Driven by the high cardinality and dynamic nature of features, model capacity is forced into sparse-activated embedding layers, leading to low-rank representations. This, in turn, triggers phenomena like "One-Epoch" and "Interaction-Collapse," ultimately hindering model scalability.(ii) Computational Bottleneck: Integrating all heterogeneous features into a unified model triggers an explosion in the number of feature tokens, rendering traditional attention mechanisms computationally demanding and susceptible to attention dispersion. To dismantle these barriers, we introduce STORE, a unified and scalable token-based ranking framework built upon three core innovations: (1) Semantic Tokenization fundamentally tackles feature heterogeneity and sparsity by decomposing high-cardinality sparse features into a compact set of stable semantic tokens; and (2) Orthogonal Rotation Transformation is employed to rotate the subspace spanned by low-cardinality static features, which facilitates more efficient and effective feature interactions; and (3) Efficient attention that filters low-contributing tokens to improve computional efficiency while preserving model accuracy. Across extensive offline experiments and online A/B tests, our framework consistently improves prediction accuracy(online CTR by 2.71%, AUC by 1.195%) and training effeciency (1.84 throughput).
๐
2025-11-24
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Matthew R. DeVerna, Kai-Cheng Yang, Harry Yaojun Yan, Filippo Menczer
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools -- and millions of users already rely on them for verification -- rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yu Wang, Yonghui Yang, Le Wu, Yi Zhang, Richang Hong
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Recent advances in Large Language Models (LLMs) have opened new avenues for sequential recommendation by enabling natural language reasoning over user behavior sequences. A common approach formulates recommendation as a language modeling task, where interaction histories are transformed into prompts and user preferences are learned via supervised fine-tuning. However, these methods operate solely in the textual modality and often miss users' fine-grained interests, especially when shaped by rich visual signals such as product images or movie posters. Multimodal Large Language Models (MLLMs) offer a promising alternative by aligning text and vision in a shared semantic space. A prevalent training paradigm applies Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) to model user preferences. Yet, two core challenges remain: 1) Imbalanced sample hardness, where random negative sampling causes overfitting on easy examples and under-training on hard ones; 2) Cross-modal semantic bias, where the fixed reference model in DPO prevents the policy model from correcting modality misalignments--especially over long sequences. To address these issues, we propose a Multimodal LLM framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec). Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness, prioritizing harder examples. It further introduces Gaussian-perturbed distribution optimization on output logits to enhance cross-modal semantic consistency and reduce modality bias inherited from the reference model.
๐
2025-11-24
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jin Chai, Xiaoxiao Ma, Jian Yang, Jia Wu
ๆฅ็ๆ่ฆ
Sequential recommendation models user preferences to predict the next target item. Most existing work is passive, where the system responds only when users open the application, missing chances after closure. We investigate active recommendation, which predicts the next interaction time and actively delivers items. Two challenges: accurately estimating the Time of Interest (ToI) and generating Item of Interest (IoI) conditioned on the predicted ToI. We propose PASRec, a diffusion-based framework that aligns ToI and IoI via a joint objective. Experiments on five benchmarks show superiority over eight state-of-the-art baselines under leave-one-out and temporal splits.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $ฯ_0$, $ฯ_{0.5}$, and one-step regression policy $ฯ_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $ฯ_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Mamoon Aamir, Mariyam Sattar, Naveed Ur Rehman Junejo, Aqsa Zafar Abbasi
Machine Learning
ๆฅ็ๆ่ฆ
Given the increasing need for large aperture antennas in space missions, the difficulty of fitting such structures into small launch vehicles has prompted the design of deployable antenna systems. The thesis introduces a new Triple Scissors Deployable Truss Mechanism (TSDTM) for space antenna missions. The new mechanism is to be stowed during launch and efficiently deploy in orbit, offering maximum aperture size while taking up minimal launch volume. The thesis covers the entire design process from geometric modeling, kinematic analysis with screw theory and Newtonian approaches, dynamic analysis by eigenvalue and simulation methods, and verification with SolidWorks. In addition, optimization routines were coded based on Support Vector Machines for material choice in LEO environments and machine learning method for geometric setup. The TSDTM presented has enhanced structural dynamics with good comparison between simulation and analytical predictions. The structure optimized proved highly accurate, with a deviation of just 1.94% between machine learning-predicted and simulated natural frequencies, demonstrating the potential of incorporating AI-based methods in space structural design.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Franklin Cardenoso, Wouter Caarls
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
The challenge of designing effective reward functions in reinforcement learning (RL) represents a significant bottleneck, often requiring extensive human expertise and being time-consuming. Previous work and recent advancements in large language models (LLMs) have demonstrated their potential for automating the generation of reward functions. However, existing methodologies often require preliminary evaluation metrics, human-engineered feedback for the refinement process, or the use of environmental source code as context. To address these limitations, this paper introduces LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization). This LLM-based, fully autonomous, and model-agnostic framework eliminates the need for preliminary metrics and environmental source code as context to generate, execute, and evaluate reward function candidates from textual descriptions of systems and task objectives. LEARN-Opt's main contribution lies in its ability to autonomously derive performance metrics directly from the system description and the task objective, enabling unsupervised evaluation and selection of reward functions. Our experiments indicate that LEARN-Opt achieves performance comparable to or better to that of state-of-the-art methods, such as EUREKA, while requiring less prior knowledge. We find that automated reward design is a high-variance problem, where the average-case candidate fails, requiring a multi-run approach to find the best candidates. Finally, we show that LEARN-Opt can unlock the potential of low-cost LLMs to find high-performing candidates that are comparable to, or even better than, those of larger models. This demonstrated performance affirms its potential to generate high-quality reward functions without requiring any preliminary human-defined metrics, thereby reducing engineering overhead and enhancing generalizability.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Weiliang Tang, Jialin Gao, Jia-Hui Pan, Gang Wang, Li Erran Li et al.
Natural Language ProcessingRoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Yuxuan Wang, Haobin Jiang, Shiqing Yao, Ziluo Ding, Zongqing Lu
Robotics
ๆฅ็ๆ่ฆ
Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Prabhat Kumar, Chandra Prakash, Josh Pinskier, David Howard, Matthijs Langelaar
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a systematic topology optimization framework for designing a soft pneumatic gripper (SPG), explicitly considering the design-dependent nature of the actuating load. The load is modeled using Darcy's law with an added drainage term. A 2D soft arm unit is optimized by formulating it as a compliant mechanism design problem using the robust formulation. The problem is posed as a min-max optimization, where the output deformations of blueprint and eroded designs are considered. A volume constraint is imposed on the blueprint part, while a strain-energy constraint is enforced on the eroded part. The MMA is employed to solve the optimization problem and obtain the optimized soft unit. Finite element analysis with the Ogden material model confirms that the optimized 2D unit outperforms a conventional rectangular design under pneumatic loading. The optimized 2D unit is extruded to obtain a 3D module, and ten such units are assembled to create a soft arm. Deformation profiles of the optimized arm are analysed under different pressure loads. Four arms are 3D-printed and integrated with a supporting structure to realize the proposed SPG. The gripping performance of the SPG is demonstrated on objects with different weights, sizes, stiffness, and shapes.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Fabian Schramm, Pierre Fabre, Nicolas Perrin-Gilbert, Justin Carpentier
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We present a sampling-based model predictive control (MPC) framework that enables emergent locomotion without relying on handcrafted gait patterns or predefined contact sequences. Our method discovers diverse motion patterns, ranging from trotting to galloping, robust standing policies, jumping, and handstand balancing, purely through the optimization of high-level objectives. Building on model predictive path integral (MPPI), we propose a dual-space spline parameterization that operates on position and velocity control points. Our approach enables contact-making and contact-breaking strategies that adapt automatically to task requirements, requiring only a limited number of sampled trajectories. This sample efficiency allows us to achieve real-time control on standard CPU hardware, eliminating the need for GPU acceleration typically required by other state-of-the-art MPPI methods. We validate our approach on the Go2 quadrupedal robot, demonstrating various emergent gaits and basic jumping capabilities. In simulation, we further showcase more complex behaviors, such as backflips, dynamic handstand balancing and locomotion on a Humanoid, all without requiring reference tracking or offline pre-training.
๐
2025-11-24
๐ ICRA
๐ฅ Ann-Sophia Mรผller, Moonkwang Jeong, Jiyuan Tian, Meng Zhang, Tian Qiu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Untethered magnetic manipulation of biomedical millirobots has a high potential for minimally invasive surgical applications. However, it is still challenging to exert high actuation forces on the small robots over a large distance. Permanent magnets offer stronger magnetic torques and forces than electromagnetic coils, however, feedback control is more difficult. As proven by Earnshaw's theorem, it is not possible to achieve a stable magnetic trap in 3D by static permanent magnets. Here, we report a stable 2D magnetic force trap by an array of permanent magnets to control a millirobot. The trap is located in an open space with a tunable distance to the magnet array in the range of 20 - 120mm, which is relevant to human anatomical scales. The design is achieved by a novel GPU-accelerated optimization algorithm that uses mean squared error (MSE) and Adam optimizer to efficiently compute the optimal angles for any number of magnets in the array. The algorithm is verified using numerical simulation and physical experiments with an array of two magnets. A millirobot is successfully trapped and controlled to follow a complex trajectory. The algorithm demonstrates high scalability by optimizing the angles for 100 magnets in under three seconds. Moreover, the optimization workflow can be adapted to optimize a permanent magnet array to achieve the desired force vector fields.
๐
2025-11-24
๐ IROS
๐ฅ Ann-Sophia Mรผller, Moonkwang Jeong, Meng Zhang, Jiyuan Tian, Arkadiusz Miernik et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier
Machine Learning
ๆฅ็ๆ่ฆ
We propose a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also their derivatives with respect to states and actions. By differentiating the Bellman backup through differentiable dynamics, we obtain analytically consistent gradient targets. Incorporating these into the critic objective using a Sobolev-type loss encourages the critic to align with both the value and local geometry of the target function. This first-order TD matching principle can be seamlessly integrated into existing algorithms, such as Q-learning or actor-critic methods (e.g., DDPG, SAC), potentially leading to faster critic convergence and more stable policy gradients without altering their overall structure.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Pascal Goldschmid, Aamir Ahmad
Robotics
ๆฅ็ๆ่ฆ
Multi-rotor UAVs face limited flight time due to battery constraints. Autonomous docking on blimps with onboard battery recharging and data offloading offers a promising solution for extended UAV missions. However, the vulnerability of blimps to wind gusts causes trajectory deviations, requiring precise, obstacle-aware docking strategies. To this end, this work introduces two key novelties: (i) a temporal convolutional network that predicts blimp responses to wind gusts, enabling rapid gust detection and estimation of points where the wind gust effect has subsided; (ii) a model predictive controller (MPC) that leverages these predictions to compute collision-free trajectories for docking, enabled by a novel obstacle avoidance method for close-range manoeuvres near the blimp. Simulation results show our method outperforms a baseline constant-velocity model of the blimp significantly across different scenarios. We further validate the approach in real-world experiments, demonstrating the first autonomous multi-rotor docking control strategy on blimps shown outside simulation. Source code is available here https://github.com/robot-perception-group/multi_rotor_airship_docking.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ David Bricher, Andreas Mueller
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Over the last years collaborative robots have gained great success in manufacturing applications where human and robot work together in close proximity. However, current ISO/TS-15066-compliant implementations often limit the efficiency of collaborative tasks due to conservative speed restrictions. For this reason, this paper introduces a deep-learning-based human-robot-safety framework (HRSF) that aims at a dynamical adaptation of robot velocities depending on the separation distance between human and robot while respecting maximum biomechanical force and pressure limits. The applicability of the framework was investigated for four different deep learning approaches that can be used for human body extraction: human body recognition, human body segmentation, human pose estimation, and human body part segmentation. Unlike conventional industrial safety systems, the proposed HRSF differentiates individual human body parts from other objects, enabling optimized robot process execution. Experiments demonstrated a quantitative reduction in cycle time of up to 15% compared to conventional safety technology.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Haihang Wu, Yuchen Zhou
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Monocular Simultaneous Localization and Mapping (SLAM) aims to estimate a robot's pose while simultaneously reconstructing an unknown 3D scene using a single camera. While existing monocular SLAM systems generate detailed 3D geometry through dense scene representations, they are computationally expensive due to the need for iterative optimization. To address this challenge, MASt3R-SLAM utilizes learned 3D reconstruction priors, enabling more efficient and accurate estimation of both 3D structures and camera poses. However, MASt3R-SLAM is limited to single-agent operation. In this paper, we extend MASt3R-SLAM to introduce the first multi-agent monocular dense SLAM system. Each agent performs local SLAM using a 3D reconstruction prior, and their individual maps are fused into a globally consistent map through a loop-closure-based map fusion mechanism. Our approach improves computational efficiency compared to state-of-the-art methods, while maintaining similar mapping accuracy when evaluated on real-world datasets.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Jiale Zhang, Yeqiang Qian, Tong Qin, Mingyang Jiang, Siyuan Chen et al.
Robotics
ๆฅ็ๆ่ฆ
The increase in vehicle ownership has led to increased traffic congestion, more accidents, and higher carbon emissions. Vehicle platooning is a promising solution to address these issues by improving road capacity and reducing fuel consumption. However, existing platooning systems face challenges such as reliance on lane markings and expensive high-precision sensors, which limits their general applicability. To address these issues, we propose a vehicle following framework that expands its capability from restricted scenarios to general scenario applications using only a camera. This is achieved through our newly proposed end-to-end method, which improves overall driving performance. The method incorporates a semantic mask to address causal confusion in multi-frame data fusion. Additionally, we introduce a dynamic sampling mechanism to precisely track the trajectories of preceding vehicles. Extensive closed-loop validation in real-world vehicle experiments demonstrates the system's ability to follow vehicles in various scenarios, outperforming traditional multi-stage algorithms. This makes it a promising solution for cost-effective autonomous vehicle platooning. A complete real-world vehicle experiment is available at https://youtu.be/zL1bcVb9kqQ.
๐
2025-11-24
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Juntao Gao, Feiyang Ye, Jing Zhang, Wenjing Qian
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Samuel Cerezo, Seong Hun Lee, Javier Civera
Machine Learning
ๆฅ็ๆ่ฆ
In this letter, we present a closed-form initialization method that recovers the full visual-inertial state without nonlinear optimization. Unlike previous approaches that rely on iterative solvers, our formulation yields analytical, easy-to-implement, and numerically stable solutions for reliable start-up. Our method builds on small-rotation and constant-velocity approximations, which keep the formulation compact while preserving the essential coupling between motion and inertial measurements. We further propose an observability-driven, two-stage initialization scheme that balances accuracy with initialization latency. Extensive experiments on the EuRoC dataset validate our assumptions: our method achieves 10-20% lower initialization error than optimization-based approaches, while using 4x shorter initialization windows and reducing computational cost by 5x.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Suzie Kim, Hye-Bin Shin, Hyo-Jeong Jang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
In this work, we investigate how implicit neural feed back can accelerate reinforcement learning in complex robotic manipulation settings. While prior electroencephalogram (EEG) guided reinforcement learning studies have primarily focused on navigation or low-dimensional locomotion tasks, we aim to understand whether such neural evaluative signals can improve policy learning in high-dimensional manipulation tasks involving obstacles and precise end-effector control. We integrate error related potentials decoded from offline-trained EEG classifiers into reward shaping and systematically evaluate the impact of human-feedback weighting. Experiments on a 7-DoF manipulator in an obstacle-rich reaching environment show that neural feedback accelerates reinforcement learning and, depending on the human-feedback weighting, can yield task success rates that at times exceed those of sparse-reward baselines. Moreover, when applying the best-performing feedback weighting across all sub jects, we observe consistent acceleration of reinforcement learning relative to the sparse-reward setting. Furthermore, leave-one subject-out evaluations confirm that the proposed framework remains robust despite the intrinsic inter-individual variability in EEG decodability. Our findings demonstrate that EEG-based reinforcement learning can scale beyond locomotion tasks and provide a viable pathway for human-aligned manipulation skill acquisition.
๐
2025-11-24
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuzhi Chen, Yuanchang Xie, Lei Zhao, Pan Liu, Yajie Zou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal trajectory prediction generates multiple plausible future trajectories to address vehicle motion uncertainty from intention ambiguity and execution variability. However, HD map-dependent models suffer from costly data acquisition, delayed updates, and vulnerability to corrupted inputs, causing prediction failures. Map-free approaches lack global context, with pairwise attention over-amplifying straight patterns while suppressing transitional patterns, resulting in motion-intention misalignment. This paper proposes GContextFormer, a plug-and-play encoder-decoder architecture with global context-aware hybrid attention and scaled additive aggregation achieving intention-aligned multimodal prediction without map reliance. The Motion-Aware Encoder builds scene-level intention prior via bounded scaled additive aggregation over mode-embedded trajectory tokens and refines per-mode representations under shared global context, mitigating inter-mode suppression and promoting intention alignment. The Hierarchical Interaction Decoder decomposes social reasoning into dual-pathway cross-attention: a standard pathway ensures uniform geometric coverage over agent-mode pairs while a neighbor-context-enhanced pathway emphasizes salient interactions, with gating module mediating their contributions to maintain coverage-focus balance. Experiments on eight highway-ramp scenarios from TOD-VT dataset show GContextFormer outperforms state-of-the-art baselines. Compared to existing transformer models, GContextFormer achieves greater robustness and concentrated improvements in high-curvature and transition zones via spatial distributions. Interpretability is achieved through motion mode distinctions and neighbor context modulation exposing reasoning attribution. The modular architecture supports extensibility toward cross-domain multimodal reasoning tasks. Source: https://fenghy-chen.github.io/sources/.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Changsheng Luo, Yushi Wang, Wenhan Cai, Mingguo Zhao
Robotics
ๆฅ็ๆ่ฆ
Accurate proprioceptive odometry is fundamental for legged robot navigation in GPS-denied and visually degraded environments where conventional visual odometry systems fail. Current approaches face critical limitations: analytical filtering methods suffer from modeling uncertainties and cumulative drift, hybrid learning-filtering approaches remain constrained by their analytical components, while pure learning-based methods struggle with simulation-to-reality transfer and demand extensive real-world data collection. This paper introduces AutoOdom, a novel autoregressive proprioceptive odometry system that overcomes these challenges through an innovative two-stage training paradigm. Stage 1 employs large-scale simulation data to learn complex nonlinear dynamics and rapidly changing contact states inherent in legged locomotion, while Stage 2 introduces an autoregressive enhancement mechanism using limited real-world data to effectively bridge the sim-to-real gap. The key innovation lies in our autoregressive training approach, where the model learns from its own predictions to develop resilience against sensor noise and improve robustness in highly dynamic environments. Comprehensive experimental validation on the Booster T1 humanoid robot demonstrates that AutoOdom significantly outperforms state-of-the-art methods across all evaluation metrics, achieving 57.2% improvement in absolute trajectory error, 59.2% improvement in Umeyama-aligned error, and 36.2% improvement in relative pose error compared to the Legolas baseline. Extensive ablation studies provide critical insights into sensor modality selection and temporal modeling, revealing counterintuitive findings about IMU acceleration data and validating our systematic design choices for robust proprioceptive odometry in challenging locomotion scenarios.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent Vision-Language-Action (VLA) models reformulate vision-language models by tuning them with millions of robotic demonstrations. While they perform well when fine-tuned for a single embodiment or task family, extending them to multi-skill settings remains challenging: directly merging VLA experts trained on different tasks results in near-zero success rates. This raises a fundamental question: what prevents VLAs from mastering multiple skills within one model? With an empirical decomposition of learnable parameters during VLA fine-tuning, we identify two key sources of non-mergeability: (1) Finetuning drives LoRA adapters in the VLM backbone toward divergent, task-specific directions beyond the capacity of existing merging methods to unify. (2) Action experts develop inter-block dependencies through self-attention feedback, causing task information to spread across layers and preventing modular recombination. To address these challenges, we present MergeVLA, a merging-oriented VLA architecture that preserves mergeability by design. MergeVLA introduces sparsely activated LoRA adapters via task masks to retain consistent parameters and reduce irreconcilable conflicts in the VLM. Its action expert replaces self-attention with cross-attention-only blocks to keep specialization localized and composable. When the task is unknown, it uses a test-time task router to adaptively select the appropriate task mask and expert head from the initial observation, enabling unsupervised task inference. Across LIBERO, LIBERO-Plus, RoboTwin, and multi-task experiments on the real SO101 robotic arm, MergeVLA achieves performance comparable to or even exceeding individually finetuned experts, demonstrating robust generalization across tasks, embodiments, and environments.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Xueyu Du, Lilian Zhang, Fuan Duan, Xincan Luo, Maosong Wang et al.
Robotics
ๆฅ็ๆ่ฆ
Filter-based visual inertial navigation system (VINS) has attracted mobile-robot researchers for the good balance between accuracy and efficiency, but its limited mapping quality hampers long-term high-accuracy state estimation. To this end, we first propose a novel filter-based stereo VINS, differing from traditional simultaneous localization and mapping (SLAM) systems based on 3D map, which performs efficient loop closure constraints with implicit environmental map composed of keyframes and 2D keypoints. Secondly, we proposed a hybrid residual filter framework that combines landmark reprojection and ray constraints to construct a unified Jacobian matrix for measurement updates. Finally, considering the degraded environment, we incorporated the camera-IMU extrinsic parameters into visual description to achieve online calibration. Benchmark experiments demonstrate that the proposed SP-VINS achieves high computational efficiency while maintaining long-term high-accuracy localization performance, and is superior to existing state-of-the-art (SOTA) methods.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Omar Garib, Jayaprakash D. Kambhampaty, Olivia J. Pinon Fischer, Dimitri N. Mavris
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
We introduce AIRHILT (Aviation Integrated Reasoning, Human-in-the-Loop Testbed), a modular and lightweight simulation environment designed to evaluate multimodal pilot and air traffic control (ATC) assistance systems for aviation conflict detection. Built on the open-source Godot engine, AIRHILT synchronizes pilot and ATC radio communications, visual scene understanding from camera streams, and ADS-B surveillance data within a unified, scalable platform. The environment supports pilot- and controller-in-the-loop interactions, providing a comprehensive scenario suite covering both terminal area and en route operational conflicts, including communication errors and procedural mistakes. AIRHILT offers standardized JSON-based interfaces that enable researchers to easily integrate, swap, and evaluate automatic speech recognition (ASR), visual detection, decision-making, and text-to-speech (TTS) models. We demonstrate AIRHILT through a reference pipeline incorporating fine-tuned Whisper ASR, YOLO-based visual detection, ADS-B-based conflict logic, and GPT-OSS-20B structured reasoning, and present preliminary results from representative runway-overlap scenarios, where the assistant achieves an average time-to-first-warning of approximately 7.7 s, with average ASR and vision latencies of approximately 5.9 s and 0.4 s, respectively. The AIRHILT environment and scenario suite are openly available, supporting reproducible research on multimodal situational awareness and conflict detection in aviation; code and scenarios are available at https://github.com/ogarib3/airhilt.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Tianyu Wang, Chunxiang Yan, Xuanhong Liao, Tao Zhang, Ping Wang et al.
Robotics
ๆฅ็ๆ่ฆ
Wheeled bipedal robots are emerging as flexible platforms for field exploration. However, head instability induced by uneven terrain can degrade the accuracy of onboard sensors or damage fragile payloads. Existing research primarily focuses on stabilizing the mobile platform but overlooks active stabilization of the head in the world frame, resulting in vertical oscillations that undermine overall stability. To address this challenge, we developed a model-based ground force estimation method for our 6-degree-of-freedom wheeled bipedal robot. Leveraging these force estimates, we implemented an admittance control algorithm to enhance terrain adaptability. Simulation experiments validated the real-time performance of the force estimator and the robot's robustness when traversing uneven terrain.
๐
2025-11-24
๐ IROS
๐ฅ Xueyan Oh, Jonathan Her, Zhixiang Ong, Brandon Koh, Yun Hann Tan et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Ultraviolet (UV) germicidal radiation is an established non-contact method for surface disinfection in medical environments. Traditional approaches require substantial human intervention to define disinfection areas, complicating automation, while deep learning-based methods often need extensive fine-tuning and large datasets, which can be impractical for large-scale deployment. Additionally, these methods often do not address scene understanding for partial surface disinfection, which is crucial for avoiding unintended UV exposure. We propose a solution that leverages foundation models to simplify surface selection for manipulator-based UV disinfection, reducing human involvement and removing the need for model training. Additionally, we propose a VLM-assisted segmentation refinement to detect and exclude thin and small non-target objects, showing that this reduces mis-segmentation errors. Our approach achieves over 92\% success rate in correctly segmenting target and non-target surfaces, and real-world experiments with a manipulator and simulated UV light demonstrate its practical potential for real-world applications.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Yanbin Li, Canran Xiao, Shenghai Yuan, Peilai Yu, Ziruo Li et al.
Robotics
ๆฅ็ๆ่ฆ
Topological maps are more suitable than metric maps for robotic exploration tasks. However, real-time updating of accurate and detail-rich environmental topological maps remains a challenge. This paper presents a topological map updating method based on the Generalized Voronoi Diagram (GVD). First, the newly observed areas are denoised to avoid low-efficiency GVD nodes misleading the topological structure. Subsequently, a multi-granularity hierarchical GVD generation method is designed to control the sampling granularity at both global and local levels. This not only ensures the accuracy of the topological structure but also enhances the ability to capture detail features, reduces the probability of path backtracking, and ensures no overlap between GVDs through the maintenance of a coverage map, thereby improving GVD utilization efficiency. Second, a node clustering method with connectivity constraints and a connectivity method based on a switching mechanism are designed to avoid the generation of unreachable nodes and erroneous nodes caused by obstacle attraction. A special cache structure is used to store all connectivity information, thereby improving exploration efficiency. Finally, to address the issue of frontiers misjudgment caused by obstacles within the scope of GVD units, a frontiers extraction method based on morphological dilation is designed to effectively ensure the reachability of frontiers. On this basis, a lightweight cost function is used to assess and switch to the next viewpoint in real time. This allows the robot to quickly adjust its strategy when signs of path backtracking appear, thereby escaping the predicament and increasing exploration flexibility. And the performance of system for exploration task is verified through comparative tests with SOTA methods.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Ardalan Tajbakhsh, Augustinos Saravanos, James Zhu, Evangelos A. Theodorou, Lorenz T. Biegler et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper addresses the challenge of coordinating multi-robot systems under realistic communication delays using distributed optimization. We focus on consensus ADMM as a scalable framework for generating collision-free, dynamically feasible motion plans in both trajectory optimization and receding-horizon control settings. In practice, however, these algorithms are sensitive to penalty tuning or adaptation schemes (e.g. residual balancing and adaptive parameter heuristics) that do not explicitly consider delays. To address this, we introduce a Delay-Aware ADMM (DA-ADMM) variant that adapts penalty parameters based on real-time delay statistics, allowing agents to down-weight stale information and prioritize recent updates during consensus and dual updates. Through extensive simulations in 2D and 3D environments with double-integrator, Dubins-car, and drone dynamics, we show that DA-ADMM significantly improves robustness, success rate, and solution quality compared to fixed-parameter, residual-balancing, and fixed-constraint baselines. Our results highlight that performance degradation is not solely determined by delay length or frequency, but by the optimizer's ability to contextually reason over delayed information. The proposed DA-ADMM achieves consistently better coordination performance across a wide range of delay conditions, offering a principled and efficient mechanism for resilient multi-robot motion planning under imperfect communication.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Xueyan Oh, Leonard Loh, Shaohui Foong, Zhong Bao Andy Koh, Kow Leong Ng et al.
Computer VisionRoboticsMachine Learning
ๆฅ็ๆ่ฆ
General Visual Inspection is a manual inspection process regularly used to detect and localise obvious damage on the exterior of commercial aircraft. There has been increasing demand to perform this process at the boarding gate to minimise the downtime of the aircraft and automating this process is desired to reduce the reliance on human labour. Automating this typically requires estimating a camera's pose with respect to the aircraft for initialisation but most existing localisation methods require infrastructure, which is very challenging in uncontrolled outdoor environments and within the limited turnover time (approximately 2 hours) on an airport tarmac. Additionally, many airlines and airports do not allow contact with the aircraft's surface or using UAVs for inspection between flights, and restrict access to commercial aircraft. Hence, this paper proposes an on-site method that is infrastructure-free and easy to deploy for estimating a pan-tilt-zoom camera's pose and localising scan images. This method initialises using the same pan-tilt-zoom camera used for the inspection task by utilising a Deep Convolutional Neural Network fine-tuned on only synthetic images to predict its own pose. We apply domain randomisation to generate the dataset for fine-tuning the network and modify its loss function by leveraging aircraft geometry to improve accuracy. We also propose a workflow for initialisation, scan path planning, and precise localisation of images captured from a pan-tilt-zoom camera. We evaluate and demonstrate our approach through experiments with real aircraft, achieving root-mean-square camera pose estimation errors of less than 0.24 m and 2 degrees for all real scenes.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Shuo Wen, Edwin Meriaux, Mariana Sosa Guzmรกn, Zhizun Wang, Junming Shi et al.
Robotics
ๆฅ็ๆ่ฆ
Accurate localization is essential for marine robotics, yet Global Navigation Satellite System (GNSS) signals are unreliable or unavailable even at a very short distance below the water surface. Traditional alternatives, such as inertial navigation, Doppler Velocity Loggers (DVL), SLAM, and acoustic methods, suffer from error accumulation, high computational demands, or infrastructure dependence. In this work, we present a scalable multi-drone GNSS-based tracking system for surface and near-surface marine robots. Our approach combines efficient visual detection, lightweight multi-object tracking, GNSS-based triangulation, and a confidence-weighted Extended Kalman Filter (EKF) to provide stable GNSS estimation in real time. We further introduce a cross-drone tracking ID alignment algorithm that enforces global consistency across views, enabling robust multi-robot tracking with redundant aerial coverage. We validate our system in diversified complex settings to show the scalability and robustness of the proposed algorithm.
๐
2025-11-24
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.
๐
2025-11-24
๐ ArXiv Preprint (Robotics)
๐ฅ Yinan Dong, Ziyu Xu, Tsimafei Lazouski, Sangli Teng, Maani Ghaffari
Robotics
ๆฅ็ๆ่ฆ
Autonomous surface vehicles (ASVs) are easily influenced by environmental disturbances such as wind and waves, making accurate trajectory tracking a persistent challenge in dynamic marine conditions. In this paper, we propose an efficient controller for trajectory tracking of marine vehicles under unknown disturbances by combining a convex error-state MPC on the Lie group with an online learning module to compensate for these disturbances in real time. This design enables adaptive and robust control while maintaining computational efficiency. Extensive evaluations in numerical simulations, the Virtual RobotX (VRX) simulator, and real-world field experiments demonstrate that our method achieves superior tracking accuracy under various disturbance scenarios compared with existing approaches.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Litian Gong, Fatemeh Bahrani, Yutai Zhou, Amin Banayeeanzade, Jiachen Li et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Goun Pyeon, Inbum Heo, Jeesu Jung, Taewook Hwang, Hyuk Namgoong et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
This study systematically evaluated the mathematical reasoning capabilities of Large Language Models (LLMs) using the 2026 Korean College Scholastic Ability Test (CSAT) Mathematics section, ensuring a completely contamination-free evaluation environment. To address data leakage issues in existing benchmarks, we digitized all 46 questions (22 common and 24 elective) within two hours of the exam's public release, eliminating any possibility of inclusion in model training data. We conducted comprehensive evaluations of 24 state-of-the-art LLMs across varying input modalities (text, image, text+figure) and prompt languages (Korean, English).
GPT-5 Codex achieved the only perfect score (100 points) with text input and Korean prompts, while Grok 4, GPT-5, and Deepseek R1 scored above 95 points. Notably, gpt-oss-20B achieved 95.7 points despite its relatively small size, demonstrating high cost-effectiveness. Problem-specific analysis revealed geometry as the weakest domain (77.7% average) with significant performance degradation on 4-point high-difficulty problems. Text input consistently outperformed image input, while prompt language effects varied by model scale.
In reasoning enhancement experiments with GPT-5 series, increased reasoning intensity improved performance (from 82.6 to 100 points) but quadrupled token usage and drastically reduced efficiency, suggesting that models with minimal reasoning may be more practical. This research contributes: (1) implementation of a completely unexposed evaluation environment, (2) a real-exam-based LLM assessment framework, and (3) a practical evaluation perspective integrating performance, cost, and time considerations. Detailed results and model comparisons are available at the 2026 Korean CSAT LLM Evaluation Leaderboard (https://isoft.cnu.ac.kr/csat2026/).
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shireen Chand, Faith Baca, Emilio Ferrara
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful or unfair outputs. While various techniques aim to mitigate these biases, their effects are often evaluated only along the dimension of the bias being targeted. This work investigates the cross-category consequences of targeted bias mitigation. We study four bias mitigation techniques applied across ten models from seven model families, and we explore racial, religious, profession- and gender-related biases. We measure the impact of debiasing on model coherence and stereotypical preference using the StereoSet benchmark. Our results consistently show that while targeted mitigation can sometimes reduce bias in the intended dimension, it frequently leads to unintended and often negative consequences in others, such as increasing model bias and decreasing general coherence. These findings underscore the critical need for robust, multi-dimensional evaluation tools when examining and developing bias mitigation strategies to avoid inadvertently shifting or worsening bias along untargeted axes.
๐
2025-11-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Amin Rakhsha, Kanika Madan, Tianyu Zhang, Amir-massoud Farahmand, Amir Khasahmadi
Natural Language Processing
ๆฅ็ๆ่ฆ
Sampling multiple outputs from a Large Language Model (LLM) and selecting the most frequent (Self-consistency) or highest-scoring (Best-of-N) candidate is a popular approach to achieve higher accuracy in tasks with discrete final answers. Best-of-N (BoN) selects the output with the highest reward, and with perfect rewards, it often achieves near-perfect accuracy. With imperfect rewards from reward models, however, BoN fails to reliably find the correct answer and its performance degrades drastically. We consider the distribution of BoN's outputs and highlight that, although the correct answer does not usually have a probability close to one under imperfect rewards, it is often the most likely outcome. This suggests that the mode of this distribution can be more reliably correct than a sample from it. Based on this idea, we propose Majority-of-the-Bests (MoB), a novel selection mechanism that estimates the output distribution of BoN via bootstrapping and selects its mode. Experimental results across five benchmarks, three different base LLMs, and two reward models demonstrate consistent improvements over BoN in 25 out of 30 setups. We also provide theoretical results for the consistency of the bootstrapping. MoB serves as a simple, yet strong alternative to BoN and self-consistency, and more broadly, motivates further research in more nuanced selection mechanisms.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michael J. Bommarito
Natural Language Processing
ๆฅ็ๆ่ฆ
We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content.
Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks.
As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Maanas Taneja
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Language Models are extremely susceptible to performance collapse with even small changes to input prompt strings. Libraries such as DSpy (from Stanford NLP) avoid this problem through demonstration-based prompt optimisation. Inspired by this, I propose an alternative approach that treats prompt optimisation as a classical state-space search problem. I model the prompt space as a graph where nodes represent prompt states and edges correspond to deliberate transformations such as shortening, adding examples, or re- ordering content. Using beam search and random walk algorithms, I systematically explore this space, evaluating candidates on development sets and pruning unpromising branches. Across five NLP tasks (sentiment classification, question answering, summarisation, reason- ing, and natural language inference), I find that even shallow search configurations (beam width=2, depth=2) improve upon seed prompts on development sets. For instance, beam search achieves development accuracy gains from 0.40 to 0.80 on reasoning tasks, though test set improvements are more modest (0.20 to 0.50), indicating overfitting to the develop- ment heuristic. Analysis of successful optimisation paths reveals that transformations that make prompts concise appear most frequently, while verbosity operators are never selected. My results validate prompt optimization as a search problem and suggest that with greater computational resources and improved evaluation metrics, deeper exploration could yield more robust prompts that generalize beyond development sets. Code and implementation are available at [https://github.com/MaanasTaneja/PromptOptimiser].
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mirza Raquib, Munazer Montasir Akash, Tawhid Ahmed, Saydul Akbar Murad, Farida Siddiqi Prity et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In our daily lives, newspapers are an essential information source that impacts how the public talks about present-day issues. However, effectively navigating the vast amount of news content from different newspapers and online news portals can be challenging. Newspaper headlines with sentiment analysis tell us what the news is about (e.g., politics, sports) and how the news makes us feel (positive, negative, neutral). This helps us quickly understand the emotional tone of the news. This research presents a state-of-the-art approach to Bangla news headline classification combined with sentiment analysis applying Natural Language Processing (NLP) techniques, particularly the hybrid transfer learning model BERT-CNN-BiLSTM. We have explored a dataset called BAN-ABSA of 9014 news headlines, which is the first time that has been experimented with simultaneously in the headline and sentiment categorization in Bengali newspapers. Over this imbalanced dataset, we applied two experimental strategies: technique-1, where undersampling and oversampling are applied before splitting, and technique-2, where undersampling and oversampling are applied after splitting on the In technique-1 oversampling provided the strongest performance, both headline and sentiment, that is 78.57\% and 73.43\% respectively, while technique-2 delivered the highest result when trained directly on the original imbalanced dataset, both headline and sentiment, that is 81.37\% and 64.46\% respectively. The proposed model BERT-CNN-BiLSTM significantly outperforms all baseline models in classification tasks, and achieves new state-of-the-art results for Bangla news headline classification and sentiment analysis. These results demonstrate the importance of leveraging both the headline and sentiment datasets, and provide a strong baseline for Bangla text classification in low-resource.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joseph Malone, Rachith Aiyappa, Byunghwee Lee, Haewoon Kwak, Jisun An et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Beliefs are central to how humans reason, communicate, and form social connections, yet most computational approaches to studying them remain confined to narrow sociopolitical contexts and rely on fine-tuning for optimal performance. Despite the growing use of large language models (LLMs) across disciplines, how well these systems generalize across diverse belief domains remains unclear. We introduce a systematic, reproducible benchmark that evaluates the ability of LLMs to predict individuals' stances on a wide range of topics in a zero-shot setting using data from an online debate platform. The benchmark includes multiple informational conditions that isolate the contribution of demographic context and known prior beliefs to predictive success. Across several small- to medium-sized models, we find that providing more background information about an individual improves predictive accuracy, but performance varies substantially across belief domains. These findings reveal both the capacity and limitations of current LLMs to emulate human reasoning, advancing the study of machine behavior and offering a scalable framework for modeling belief systems beyond the sociopolitical sphere.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ H. M. Shadman Tabib, Jaber Ahmed Deedar
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language and code generation, and are increasingly deployed as automatic judges of model outputs and learning activities. Yet, their behavior on structured tasks such as predicting the difficulty of competitive programming problems remains under-explored. We conduct a systematic comparison of GPT-4o, used purely as a natural-language difficulty assessor, against an interpretable Light-GBM ensemble trained on explicit numeric and textual features. On a dataset of 1,825 LeetCode problems labeled Easy, Medium, or Hard, LightGBM attains 86% accuracy, whereas GPT-4o reaches only 37.75%. Detailed analyses, including confusion matrices and SHAP-based interpretability, show that numeric constraints -- such as input size limits and acceptance rates -- play a crucial role in separating Hard problems from easier ones. By contrast, GPT-4o often overlooks these cues and exhibits a strong bias toward simpler categories. We further probe GPT-4o through a synthetic Hard-problem generation protocol. Surprisingly, GPT-4o labels almost all of its own synthetic Hard problems as Medium, contradicting its tendency to downgrade real Hard problems to Easy. Our findings connect to recent work on LLMs-as-judges and automatic difficulty estimation in programming and education, and highlight concrete failure modes that must be addressed before LLM-based judges can be considered trustworthy in competitive programming, educational platforms, or reinforcement-learning pipelines.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yacouba Diarra, Nouhoum Souleymane Coulibaly, Panga Azazia Kamatรฉ, Madani Amadou Tall, Emmanuel รlisรฉ Konรฉ et al.
ๆฅ็ๆ่ฆ
Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
๐
2025-11-23
๐ ArXiv Preprint (cs.SE)
๐ฅ Jian Yang, Wei Zhang, Shark Liu, Jiajun Wu, Shawn Guo et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tyler Shoemaker
Natural Language Processing
ๆฅ็ๆ่ฆ
This position paper argues that literary scholars must engage with large language model (LLM) interpretability research. While doing so will involve ideological struggle, if not out-right complicity, the necessity of this engagement is clear: the abiding instrumentality of current approaches to interpretability cannot be the only standard by which we measure interpretation with LLMs. One site at which this struggle could take place, I suggest, is the red team.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Josรฉ Pombal, Maya D'Eon, Nuno M. Guerreiro, Pedro Henrique Martins, Antรณnio Farinhas et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.
๐
2025-11-23
๐ ArXiv Preprint (eess.AS)
๐ฅ Chunyu Qiang, Kang Yin, Xiaopeng Wang, Yuzhe Liang, Jiahui Zhao et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion, style, accent), and musical (genre, instrument, rhythm, atmosphere). It supports expressive speech, music, and dialogue generation in English and Chinese. The model employs joint and single diffusion transformer layers with a standardized instruction-phoneme input format, trained on 50K hours of speech and 20K hours of music data, enabling multi-task learning and cross-modal alignment. Fig. 1 visualizes performance comparisons with mainstream TTS and TTM models, demonstrating that InstructAudio achieves optimal results on most metrics. To our best knowledge, InstructAudio represents the first instruction-controlled framework unifying speech and music generation. Audio samples are available at: https://qiangchunyu.github.io/InstructAudio/
๐
2025-11-23
๐ AAAI 2026
๐ฅ Xiaoqing Wang, Keman Huang, Bin Liang, Hongyu Li, Xiaoyong Du
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of Large Language Model (LLM)-driven multi-agent systems has significantly streamlined software developing tasks, enabling users with little technical expertise to develop executable applications. While these systems democratize software creation through natural language requirements, they introduce significant security risks that remain largely unexplored. We identify two risky scenarios: Malicious User with Benign Agents (MU-BA) and Benign User with Malicious Agents (BU-MA). We introduce the Implicit Malicious Behavior Injection Attack (IMBIA), demonstrating how multi-agent systems can be manipulated to generate software with concealed malicious capabilities beneath seemingly benign applications, and propose Adv-IMBIA as a defense mechanism. Evaluations across ChatDev, MetaGPT, and AgentVerse frameworks reveal varying vulnerability patterns, with IMBIA achieving attack success rates of 93%, 45%, and 71% in MU-BA scenarios, and 71%, 84%, and 45% in BU-MA scenarios. Our defense mechanism reduced attack success rates significantly, particularly in the MU-BA scenario. Further analysis reveals that compromised agents in the coding and testing phases pose significantly greater security risks, while also identifying critical agents that require protection against malicious user exploitation. Our findings highlight the urgent need for robust security measures in multi-agent software development systems and provide practical guidelines for implementing targeted, resource-efficient defensive strategies.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ B. Y. Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, Zheng Liu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of "\textbf{just-in time (JIT) compilation}" where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yu Xia, Sungchul Kim, Tong Yu, Ryan A. Rossi, Julian McAuely
Natural Language Processing
ๆฅ็ๆ่ฆ
Agentic recommendations cast recommenders as large language model (LLM) agents that can plan, reason, use tools, and interact with users of varying preferences in web applications. However, most existing agentic recommender systems focus on generic single-agent plan-execute workflows or multi-agent task decomposition pipelines. Without recommendation-oriented design, they often underuse the collaborative signals in the user-item interaction history, leading to unsatisfying recommendation results. To address this, we propose the Multi-Agent Collaborative Filtering (MACF) framework for agentic recommendations, drawing an analogy between traditional collaborative filtering algorithms and LLM-based multi-agent collaboration. Specifically, given a target user and query, we instantiate similar users and relevant items as LLM agents with unique profiles. Each agent is able to call retrieval tools, suggest candidate items, and interact with other agents. Different from the static preference aggregation in traditional collaborative filtering, MACF employs a central orchestrator agent to adaptively manage the collaboration between user and item agents via dynamic agent recruitment and personalized collaboration instruction. Experimental results on datasets from three different domains show the advantages of our MACF framework compared to strong agentic recommendation baselines.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sultan Alrashed, Chadi Helwe, Francesco Orabona
ๆฅ็ๆ่ฆ
Although the community has tackled the acquisition of high-quality Arabic pretraining data, we still lack large-scale, multi-turn Arabic datasets that include reasoning and tool calling. Naive translation can work at the pretraining scale, but post-training demands much higher quality, which requires a stricter approach to dataset curation. In this work, we introduce SmolKalam, a translation of Smoltalk2 that uses a multi-model ensemble translation pipeline, applies quality filtering, and examines effective translation techniques for traditional decoder-only models through ablations.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dana Arad, Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB; Mueller et al., 2025) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foundation, the BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driving model behavior, and causal variable localization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization strategies for circuit discovery. With one team spanning two methods, participants achieved significant gains in causal variable localization using low-dimensional and non-linear projections to featurize activation vectors. The MIB leaderboard remains open; we encourage continued work in this standard evaluation framework to measure progress in MI research going forward.
๐
2025-11-23
๐ AAAI
๐ฅ Heejoon Koo
Natural Language Processing
ๆฅ็ๆ่ฆ
A decade of rapid advances in artificial intelligence (AI) has opened new opportunities for clinical decision support systems (CDSS), with large language models (LLMs) demonstrating strong reasoning abilities on timely medical tasks. However, clinical texts are often degraded by human errors or failures in automated pipelines, raising concerns about the reliability and fairness of AI-assisted decision-making. Yet the impact of such degradations remains under-investigated, particularly regarding how noise-induced shifts can heighten predictive uncertainty and unevenly affect demographic subgroups. We present a systematic study of state-of-the-art LLMs under diverse text corruption scenarios, focusing on robustness and equity in next-visit diagnosis prediction. To address the challenge posed by the large diagnostic label space, we introduce a clinically grounded label-reduction scheme and a hierarchical chain-of-thought (CoT) strategy that emulates clinicians' reasoning. Our approach improves robustness and reduces subgroup instability under degraded inputs, advancing the reliable use of LLMs in CDSS. We release code at https://github.com/heejkoo9/NECHOv3.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Manon Berriche
Robotics
ๆฅ็ๆ่ฆ
This thesis addresses two paradoxes: (1) why empirical studies find that fake news represent only a small share of the information consulted and shared on social media despite the absence of editorial control or journalistic norms, and (2) how political polarization has intensified even though users do not appear especially receptive to fake news. To investigate these issues, two complementary studies were carried out on Twitter and Facebook, combining quantitative analyses of digital traces with online observation and interviews. This mixed-methods design avoids reducing users to single reactions to identified fake items and instead examines the variety of practices across different interactional situations, online and offline, while recording socio-demographic traits. The first study mapped users who shared at least one item labeled fake by fact-checkers in the French Twittersphere. The second used a corpus of items flagged by Facebook users to study reactions to statements whose epistemic status is uncertain. Three main findings emerge. First, sharing fake news is concentrated among a limited group of users who are not less educated or cognitively disadvantaged but are more politicized and critical of institutions; owing to their high activity and prolific sharing, they can help set the agenda for their political camp. Second, exposed users can deploy varying forms of critical distance depending on their social position and the interactional norms of the situations they inhabit: either discursive caution (prudence รฉnonciative) or interventions ('points d'arrรชt') that express disagreement or corrections. Third, these forms of critical distance seldom yield genuine deliberative debates or agonistic pluralism; rather, they often produce dialogues of the deaf among a small, particularly active minority.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ James Y. Huang, Wenxuan Zhou, Nan Xu, Fei Wang, Qin Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The ability of Large Language Models (LLMs) to generate structured outputs that follow arbitrary schemas is crucial to a wide range of downstream tasks that require diverse structured representations of results such as information extraction, table generation, and function calling. While modern LLMs excel in generating unstructured responses in natural language, whether this advancement translates to a strong performance on text-to-structure tasks remains unclear. To bridge this gap, we first introduce OmniStruct, a comprehensive benchmark for assessing LLMs' capabilities on diverse text-to-structure tasks such as information extraction, table generation, and function calling. We build OmniStruct by identifying existing datasets across a wide range of tasks that are suitable for a structured answer format, and adapting them under a unified text-to-structure problem setting. To facilitate the development of efficient text-to-structure models, we collect high-quality training data via synthetic task generation. Without using any supervised data for OmniStruct tasks, our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models that can rival the performance of GPT-4o.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Syed Mohaiminul Hoque, Naimur Rahman, Md Sakhawat Hossain
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper introduces the approach of "Gradient Masters" for BLP-2025 Task 1: "Bangla Multitask Hate Speech Identification Shared Task". We present an ensemble-based fine-tuning strategy for addressing subtasks 1A (hate-type classification) and 1B (target group classification) in YouTube comments. We propose a hybrid approach on a Bangla Language Model, which outperformed the baseline models and secured the 6th position in subtask 1A with a micro F1 score of 73.23% and the third position in subtask 1B with 73.28%. We conducted extensive experiments that evaluated the robustness of the model throughout the development and evaluation phases, including comparisons with other Language Model variants, to measure generalization in low-resource Bangla hate speech scenarios and data set coverage. In addition, we provide a detailed analysis of our findings, exploring misclassification patterns in the detection of hate speech.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joseph Oladokun
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model agents often retrieve context from knowledge bases that lack structural consistency with the agent's current reasoning state, leading to incoherent reasoning chains. We introduce Path-Constrained Retrieval (PCR), a retrieval method that combines structural graph constraints with semantic search to ensure retrieved information maintains logical relationships within a knowledge graph. PCR restricts the search space to nodes reachable from an anchor node, preventing retrieval of structurally disconnected information that may lead to inconsistent reasoning. We evaluate PCR on PathRAG-6, a benchmark spanning six domains with 180 nodes and 360 edges. Our results show that PCR achieves full structural consistency compared to 24-32 percent in baseline methods, while maintaining strong relevance scores. On the technology domain, PCR obtains full relevance at rank 10 with full structural consistency, significantly outperforming vector search and hybrid retrieval. PCR reduces the average graph distance of retrieved context by 78 percent compared to baselines, demonstrating retrieval of more structurally consistent information. These findings suggest that path-constrained retrieval is an effective approach for improving the reliability and coherence of LLM agent reasoning systems.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohammad Aqib, Mohd Hamza, Ying Hei Chui, Qipei Mei
Natural Language Processing
ๆฅ็ๆ่ฆ
Building codes contain critical information for ensuring safety, regulatory compliance, and informed decision-making in construction and engineering. Automated question answering systems over such codes enable quick and accurate access to specific regulatory clauses, improving efficiency and reducing errors. Retrieval-Augmented Generation (RAG) systems are essential for this task as they combine the precision of information retrieval with the generative capabilities of language models. However, tabular data are challenging to extract as they often involve complex layouts, merged cells, multi-row headers, and embedded semantic relationships that are not easily captured by traditional natural language processing techniques and Vision Language Models (VLMs). This paper explores and compares two methods for extracting information from tabular data in building codes using several pre-trained VLMs. First, a direct input method is used, where the image of the page is input directly into the VLMs, which are then tasked with answering questions based on the image. Second, an indirect input method is introduced, which involves converting an image of a page containing tables into the LaTeX code and then answering inquires based on the LaTeX-based input. The experiments find that the direct input method generally resulted in higher accuracy than the indirect input method. To further improve the performance, we fine-tuned each VLM using Low Rank Adaptation (LoRA) on a domain-specific tabular dataset. The fine-tuned models exhibited substantial improvements, with Qwen2.5-VL-3B-Instruct achieving relative accuracy gains exceeding 100%. Our results highlight the potential of parameter-efficient fine-tuning methods to adapt powerful VLMs for understanding complex structured data in specialized fields, such as building code interpretation and regulatory compliance.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Harsh Rathva, Pruthwik Mishra, Shrikant Malviya
Natural Language Processing
ๆฅ็ๆ่ฆ
The detection of hallucinations in multilingual scientific text generated by Large Language Models (LLMs) presents significant challenges for reliable AI systems. This paper describes our submission to the SHROOM-CAP 2025 shared task on scientific hallucination detection across 9 languages. Unlike most approaches that focus primarily on model architecture, we adopted a data-centric strategy that addressed the critical issue of training data scarcity and imbalance. We unify and balance five existing datasets to create a comprehensive training corpus of 124,821 samples (50% correct, 50% hallucinated), representing a 172x increase over the original SHROOM training data. Our approach fine-tuned XLM-RoBERTa-Large with 560 million parameters on this enhanced dataset, achieves competitive performance across all languages, including \textbf{2nd place in Gujarati} (zero-shot language) with Factuality F1 of 0.5107, and rankings between 4th-6th place across the remaining 8 languages. Our results demonstrate that systematic data curation can significantly outperform architectural innovations alone, particularly for low-resource languages in zero-shot settings.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xiaochen Zheng, Alvaro Serra, Ilya Schneider Chernov, Maddalena Marchesi, Eunice Musvasva et al.
ๆฅ็ๆ่ฆ
Pharmaceutical research and development has accumulated vast, heterogeneous archives of data. Much of this knowledge stems from discontinued programs, and reusing these archives is invaluable for reverse translation. However, in practice, such reuse is often infeasible. In this work, we introduce DiscoVerse, a multi-agent co-scientist designed to support pharmaceutical research and development. The system implements semantic retrieval, cross-document linking, and auditable synthesis on a large historical corpus from Roche. To validate our approach at real-world scale, we selected a subset of 180 molecules from the Roche research repositories, covering over 0.87 billion BPE tokens and more than four decades of research. Given that automated evaluation metrics are poorly aligned with scientific utility, we evaluate the performance of DiscoVerse using blinded expert evaluation of source-linked outputs. To our knowledge, this is the first agentic framework systematically assessed on real pharmaceutical data for reverse translation, enabled by authorized access to confidential, end-to-end drug-development archives. Our contributions include role-specialized agent designs aligned with scientist workflows; human-in-the-loop support for reverse translation; expert evaluation; and a large-scale demonstration showing promising answer accuracy and decision-making insights. In brief, across seven benchmark queries covering 180 molecules, DiscoVerse achieved near-perfect recall ($\geq 0.99$) with moderate precision ($0.71-0.91$), while qualitative assessments of discontinuation rationale and organ-specific toxicity showed faithful, source-linked synthesis across preclinical and clinical evidence.
๐
2025-11-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Raoul H. Kutil, Georg Zimmermann, Christian Borgelt
ๆฅ็ๆ่ฆ
Diagnosing cognitive (mental health) disorders is a delicate and complex task. Identifying the next most informative symptoms to assess, in order to distinguish between possible disorders, presents an additional challenge. This process requires comprehensive knowledge of diagnostic criteria and symptom overlap across disorders, making it difficult to navigate based on symptoms alone. This research aims to develop a recommender system for cognitive disorder diagnosis using binary matrix representations. The core algorithm utilizes a binary matrix of disorders and their symptom combinations. It filters through the rows and columns based on the patient's current symptoms to identify potential disorders and recommend the most informative next symptoms to examine. A prototype of the recommender system was implemented in Python. Using synthetic test and some real-life data, the system successfully identified plausible disorders from an initial symptom set and recommended further symptoms to refine the diagnosis. It also provided additional context on the symptom-disorder relationships. Although this is a prototype, the recommender system shows potential as a clinical support tool. A fully-developed application of this recommender system may assist mental health professionals in identifying relevant disorders more efficiently and guiding symptom-specific follow-up investigations to improve diagnostic accuracy.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ B. Y. Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, Zheng Liu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of "\textbf{just-in time (JIT) compilation}" where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yu Xia, Sungchul Kim, Tong Yu, Ryan A. Rossi, Julian McAuely
Natural Language Processing
ๆฅ็ๆ่ฆ
Agentic recommendations cast recommenders as large language model (LLM) agents that can plan, reason, use tools, and interact with users of varying preferences in web applications. However, most existing agentic recommender systems focus on generic single-agent plan-execute workflows or multi-agent task decomposition pipelines. Without recommendation-oriented design, they often underuse the collaborative signals in the user-item interaction history, leading to unsatisfying recommendation results. To address this, we propose the Multi-Agent Collaborative Filtering (MACF) framework for agentic recommendations, drawing an analogy between traditional collaborative filtering algorithms and LLM-based multi-agent collaboration. Specifically, given a target user and query, we instantiate similar users and relevant items as LLM agents with unique profiles. Each agent is able to call retrieval tools, suggest candidate items, and interact with other agents. Different from the static preference aggregation in traditional collaborative filtering, MACF employs a central orchestrator agent to adaptively manage the collaboration between user and item agents via dynamic agent recruitment and personalized collaboration instruction. Experimental results on datasets from three different domains show the advantages of our MACF framework compared to strong agentic recommendation baselines.
๐
2025-11-23
๐ \c{opyright} 2025 IEEE. Personal use of this mater
๐ฅ Mohammad Nour Al Awad, Sergey Ivanov, Olga Tikhonova, Ivan Khodnenko
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large language models (LLMs) can reshape information processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high performance. In this article, we present Talk2Data, a multimodal LLM-driven conversational agent for intuitive data exploration. The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken explanations. Built on LLMs, the suggested design combines OpenAI Whisper automatic speech recognition (ASR) system, Qwen-coder code generation LLM/model, custom sandboxed execution tools, and Coqui library for text-to-speech (TTS) within an agentic orchestration loop. Unlike text-only analysis tools, it adapts responses across modalities and supports multi-turn dialogues grounded in dataset context. In an evaluation of 48 tasks on three datasets, our prototype achieved 95.8% accuracy with model-only generation time under 1.7 seconds (excluding ASR and execution time). A comparison across five LLM sizes (1.5B-32B) revealed accuracy-latency-cost trade-offs, with a 7B model providing the best balance for interactive use. By routing between conversation with user and code execution, constrained to a transparent sandbox, with simultaneously grounding prompts in schema-level context, the Talk2Data agent reliably retrieves actionable insights from tables while making computations verifiable. In the article, except for the Talk2Data agent itself, we discuss implications for human-data interaction, trust in LLM-driven analytics, and future extensions toward large-scale multimodal assistants.
๐
2025-11-23
๐ ArXiv Preprint (cs.NI)
๐ฅ Muhammad Bilal, Zafar Qazi, Marco Canini
ๆฅ็ๆ่ฆ
The rise of Generative AI Search is fundamentally transforming how users and intelligent systems interact with the Internet. LLMs increasingly act as intermediaries between humans and web information. Yet the web remains optimized for human browsing rather than AI-driven semantic retrieval, resulting in wasted network bandwidth, lower information quality, and unnecessary complexity for developers. We introduce the concept of an AI-Native Internet, a web architecture in which servers expose semantically relevant information chunks rather than full documents, supported by a Web-native semantic resolver that allows AI applications to discover relevant information sources before retrieving fine-grained chunks. Through motivational experiments, we quantify the inefficiencies of current HTML-based retrieval, and outline architectural directions and open challenges for evolving today's document-centric web into an AI-oriented substrate that better supports semantic access to web content.
๐
2025-11-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Haoyan Fu, Zhida Qin, Shixiao Yang, Haoyao Zhang, Bin Lu et al.
ๆฅ็ๆ่ฆ
Sequential recommendation (SR) is widely deployed in e-commerce platforms, streaming services, etc., revealing significant potential to enhance user experience. However, existing methods often overlook two critical factors: irregular user interests between interactions and highly uneven item distributions over time. The former factor implies that actual user preferences are not always continuous, and long-term historical interactions may not be relevant to current purchasing behavior. Therefore, relying only on these historical interactions for recommendations may result in a lack of user interest at the target time. The latter factor, characterized by peaks and valleys in interaction frequency, may result from seasonal trends, special events, or promotions. These externally driven distributions may not align with individual user interests, leading to inaccurate recommendations. To address these deficiencies, we propose TGODE to both enhance and capture the long-term historical interactions. Specifically, we first construct a user time graph and item evolution graph, which utilize user personalized preferences and global item distribution information, respectively. To tackle the temporal sparsity caused by irregular user interactions, we design a time-guided diffusion generator to automatically obtain an augmented time-aware user graph. Additionally, we devise a user interest truncation factor to efficiently identify sparse time intervals and achieve balanced preference inference. After that, the augmented user graph and item graph are fed into a generalized graph neural ordinary differential equation (ODE) to align with the evolution of user preferences and item distributions. This allows two patterns of information evolution to be matched over time. Experimental results demonstrate that TGODE outperforms baseline methods across five datasets, with improvements ranging from 10% to 46%.
๐
2025-11-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiaming Zhang, Yuyuan Li, Xiaohua Feng, Zhifei Ren, Li Zhang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language model-based Recommender Systems (LRSs) have demonstrated superior recommendation performance by integrating pre-training with Supervised Fine-Tuning (SFT). However, this approach introduces item-side unfairness. Existing studies primarily attribute this issue to the absence of fairness constraints during SFT and attempt to mitigate unfairness via re-weighting and re-ranking methods. In this paper, we find that unfairness arises not only from SFT but also from pre-training, where inherent biases are further amplified during SFT. This finding underscores the failure of current methods to address the root causes of unfairness. Moreover, current methods struggle to preserve satisfactory recommendation performance. To tackle these issues, we propose an Unfair-to-Fair evOlving (UFO) framework using a self-play mechanism, formulating unfairness mitigation as a two-player game. UFO alternates between two player roles: the \textit{judger}, which identifies unfairness from both pre-training and SFT, and the \textit{corrector}, which adjusts the LRS to address identified unfairness while preserving recommendation performance. Iterative optimization between these roles enables UFO to completely resolve unfairness. Extensive experiments demonstrate that UFO effectively mitigates unfairness while improving recommendation performance.
๐
2025-11-23
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joseph Oladokun
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model agents often retrieve context from knowledge bases that lack structural consistency with the agent's current reasoning state, leading to incoherent reasoning chains. We introduce Path-Constrained Retrieval (PCR), a retrieval method that combines structural graph constraints with semantic search to ensure retrieved information maintains logical relationships within a knowledge graph. PCR restricts the search space to nodes reachable from an anchor node, preventing retrieval of structurally disconnected information that may lead to inconsistent reasoning. We evaluate PCR on PathRAG-6, a benchmark spanning six domains with 180 nodes and 360 edges. Our results show that PCR achieves full structural consistency compared to 24-32 percent in baseline methods, while maintaining strong relevance scores. On the technology domain, PCR obtains full relevance at rank 10 with full structural consistency, significantly outperforming vector search and hybrid retrieval. PCR reduces the average graph distance of retrieved context by 78 percent compared to baselines, demonstrating retrieval of more structurally consistent information. These findings suggest that path-constrained retrieval is an effective approach for improving the reliability and coherence of LLM agent reasoning systems.
๐
2025-11-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiahao Liang, Haoran Yang, Xiangyu Zhao, Zhiwen Yu, Mianjie Li et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Out-of-distribution (OOD) generalization has emerged as a significant challenge in graph recommender systems. Traditional graph neural network algorithms often fail because they learn spurious environmental correlations instead of stable causal relationships, leading to substantial performance degradation under distribution shifts. While recent advancements in Large Language Models (LLMs) offer a promising avenue due to their vast world knowledge and reasoning capabilities, effectively integrating this knowledge with the fine-grained topology of specific graphs to solve the OOD problem remains a significant challenge. To address these issues, we propose {$\textbf{Inv}$ariant $\textbf{G}$raph $\textbf{C}$ontrastive Learning with $\textbf{LLM}$s for Out-of-Distribution Recommendation (InvGCLLM)}, an innovative causal learning framework that synergistically integrates the strengths of data-driven models and knowledge-driven LLMs. Our framework first employs a data-driven invariant learning model to generate causal confidence scores for each user-item interaction. These scores then guide an LLM to perform targeted graph refinement, leveraging its world knowledge to prune spurious connections and augment missing causal links. Finally, the structurally purified graphs provide robust supervision for a causality-guided contrastive learning objective, enabling the model to learn representations that are resilient to spurious correlations. Experiments conducted on four public datasets demonstrate that InvGCLLM achieves significant improvements in out-of-distribution recommendation, consistently outperforming state-of-the-art baselines.
๐
2025-11-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiahao Liang, Haoran Yang, Xiangyu Zhao, Zhiwen Yu, Guandong Xu et al.
ๆฅ็ๆ่ฆ
The challenges associated with large-scale user-item interaction graphs have attracted increasing attention in graph-based recommendation systems, primarily due to computational inefficiencies and inadequate information propagation. Existing methods provide partial solutions but suffer from notable limitations: model-centric approaches, such as sampling and aggregation, often struggle with generalization, while data-centric techniques, including graph sparsification and coarsening, lead to information loss and ineffective handling of bipartite graph structures. Recent advances in graph condensation offer a promising direction by reducing graph size while preserving essential information, presenting a novel approach to mitigating these challenges. Inspired by the principles of democracy, we propose \textbf{DemoRec}, a framework that leverages graph condensation to generate user and item representatives for recommendation tasks. By constructing a compact interaction graph and clustering nodes with shared characteristics from the original graph, DemoRec significantly reduces graph size and computational complexity. Furthermore, it mitigates the over-reliance on high-order information, a critical challenge in large-scale bipartite graphs. Extensive experiments conducted on four public datasets demonstrate the effectiveness of DemoRec, showcasing substantial improvements in recommendation performance, computational efficiency, and robustness compared to SOTA methods.
๐
2025-11-23
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Shijun Li, Yu Wang, Jin Wang, Ying Li, Joydeep Ghosh et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have shown significant potential for improving recommendation systems through their inherent reasoning capabilities and extensive knowledge base. Yet, existing studies predominantly address warm-start scenarios with abundant user-item interaction data, leaving the more challenging cold-start scenarios, where sparse interactions hinder traditional collaborative filtering methods, underexplored. To address this limitation, we propose novel reasoning strategies designed for cold-start item recommendations within the Netflix domain. Our method utilizes the advanced reasoning capabilities of LLMs to effectively infer user preferences, particularly for newly introduced or rarely interacted items. We systematically evaluate supervised fine-tuning, reinforcement learning-based fine-tuning, and hybrid approaches that combine both methods to optimize recommendation performance. Extensive experiments on real-world data demonstrate significant improvements in both methodological efficacy and practical performance in cold-start recommendation contexts. Remarkably, our reasoning-based fine-tuned models outperform Netflix's production ranking model by up to 8% in certain cases.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Litian Gong, Fatemeh Bahrani, Yutai Zhou, Amin Banayeeanzade, Jiachen Li et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Kensuke Nakamura, Arun L. Bishop, Steven Man, Aaron M. Johnson, Zachary Manchester et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Latent safety filters extend Hamilton-Jacobi (HJ) reachability to operate on latent state representations and dynamics learned directly from high-dimensional observations, enabling safe visuomotor control under hard-to-model constraints. However, existing methods implement "least-restrictive" filtering that discretely switch between nominal and safety policies, potentially undermining the task performance that makes modern visuomotor policies valuable. While reachability value functions can, in principle, be adapted to be control barrier functions (CBFs) for smooth optimization-based filtering, we theoretically and empirically show that current latent-space learning methods produce fundamentally incompatible value functions. We identify two sources of incompatibility: First, in HJ reachability, failures are encoded via a "margin function" in latent space, whose sign indicates whether or not a latent is in the constraint set. However, representing the margin function as a classifier yields saturated value functions that exhibit discontinuous jumps. We prove that the value function's Lipschitz constant scales linearly with the margin function's Lipschitz constant, revealing that smooth CBFs require smooth margins. Second, reinforcement learning (RL) approximations trained solely on safety policy data yield inaccurate value estimates for nominal policy actions, precisely where CBF filtering needs them. We propose the LatentCBF, which addresses both challenges through gradient penalties that lead to smooth margin functions without additional labeling, and a value-training procedure that mixes data from both nominal and safety policy distributions. Experiments on simulated benchmarks and hardware with a vision-based manipulation policy demonstrate that LatentCBF enables smooth safety filtering while doubling the task-completion rate over prior switching methods.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Hannah Lee, James D. Motes, Marco Morales, Nancy M. Amato
ๆฅ็ๆ่ฆ
This study informs the design of future multi-agent pathfinding (MAPF) and multi-robot motion planning (MRMP) algorithms by guiding choices based on constraint classification for constraint-based search algorithms. We categorize constraints as conservative or aggressive and provide insights into their search behavior, focusing specifically on vanilla Conflict-Based Search (CBS) and Conflict-Based Search with Priorities (CBSw/P). Under a hybrid grid-roadmap representation with varying resolution, we observe that aggressive (priority constraint) formulations tend to solve more instances as agent count or resolution increases, whereas conservative (motion constraint) formulations yield stronger solution quality when both succeed. Findings are synthesized in a decision flowchart, aiding users in selecting suitable constraints. Recommendations extend to Multi-Robot Motion Planning (MRMP), emphasizing the importance of considering topological features alongside problem, solution, and representation features. A comprehensive exploration of the study, including raw data and map performance, is available in our public GitHub Repository: https://GitHub.com/hannahjmlee/constraint-mapf-analysis
๐
2025-11-23
๐ Under review in IEEE Control Systems Letters (LCSS
๐ฅ Kooktae Lee, Ethan Brook
Robotics
ๆฅ็ๆ่ฆ
Multi-agent systems play a central role in area coverage tasks across search-and-rescue, environmental monitoring, and precision agriculture. Achieving non-uniform coverage, where spatial priorities vary across the domain, requires coordinating agents while respecting dynamic and communication constraints. Density-driven approaches can distribute agents according to a prescribed reference density, but existing methods do not ensure connectivity. This limitation often leads to communication loss, reduced coordination, and degraded coverage performance.
This letter introduces a connectivity-preserving extension of the Density-Driven Optimal Control (D2OC) framework. The coverage objective, defined using the Wasserstein distance between the agent distribution and the reference density, admits a convex quadratic program formulation. Communication constraints are incorporated through a smooth connectivity penalty, which maintains strict convexity, supports distributed implementation, and preserves inter-agent communication without imposing rigid formations.
Simulation studies show that the proposed method consistently maintains connectivity, improves convergence speed, and enhances non-uniform coverage quality compared with density-driven schemes that do not incorporate explicit connectivity considerations.
๐
2025-11-23
๐ CVPR 2026
๐ฅ Samarth Chopra, Jing Liang, Gershom Seneviratne, Dinesh Manocha
Computer Vision
ๆฅ็ๆ่ฆ
Understanding physical properties such as friction, stiffness, hardness, and material composition is essential for enabling robots to interact safely and effectively with their surroundings. However, existing 3D reconstruction methods focus on geometry and appearance and cannot infer these underlying physical properties. We present PhysGS, a Bayesian-inferred extension of 3D Gaussian Splatting that estimates dense, per-point physical properties from visual cues and vision--language priors. We formulate property estimation as Bayesian inference over Gaussian splats, where material and property beliefs are iteratively refined as new observations arrive. PhysGS also models aleatoric and epistemic uncertainties, enabling uncertainty-aware object and scene interpretation. Across object-scale (ABO-500), indoor, and outdoor real-world datasets, PhysGS improves accuracy of the mass estimation by up to 22.8%, reduces Shore hardness error by up to 61.2%, and lowers kinetic friction error by up to 18.1% compared to deterministic baselines. Our results demonstrate that PhysGS unifies 3D reconstruction, uncertainty modeling, and physical reasoning in a single, spatially continuous framework for dense physical property estimation. Additional results are available at https://samchopra2003.github.io/physgs.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Cem Bilaloglu, Tobias Lรถw, Sylvain Calinon
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Curved objects pose a fundamental challenge for skill transfer in robotics: unlike planar surfaces, they do not admit a global reference frame. As a result, task-relevant directions such as "toward" or "along" the surface vary with position and geometry, making object-centric tasks difficult to transfer across shapes. To address this, we introduce an approach using Diffused Orientation Fields (DOF), a smooth representation of local reference frames, for transfer learning of tasks across curved objects. By expressing manipulation tasks in these smoothly varying local frames, we reduce the problem of transferring tasks across curved objects to establishing sparse keypoint correspondences. DOF is computed online from raw point cloud data using diffusion processes governed by partial differential equations, conditioned on keypoints. We evaluate DOF under geometric, topological, and localization perturbations, and demonstrate successful transfer of tasks requiring continuous physical interaction such as inspection, slicing, and peeling across varied objects. We provide our open-source codes at our website https://github.com/idiap/diffused_fields_robotics
๐
2025-11-23
๐ ICRA 2026
๐ฅ Samarth Chopra, Jing Liang, Gershom Seneviratne, Yonghan Lee, Jaehoon Choi et al.
Robotics
ๆฅ็ๆ่ฆ
We present Splatblox, a real-time system for autonomous navigation in outdoor environments with dense vegetation, irregular obstacles, and complex terrain. Our method fuses segmented RGB images and LiDAR point clouds using Gaussian Splatting to construct a traversability-aware Euclidean Signed Distance Field (ESDF) that jointly encodes geometry and semantics. Updated online, this field enables semantic reasoning to distinguish traversable vegetation (e.g., tall grass) from rigid obstacles (e.g., trees), while LiDAR ensures 360-degree geometric coverage for extended planning horizons. We validate Splatblox on a quadruped robot and demonstrate transfer to a wheeled platform. In field trials across vegetation-rich scenarios, it outperforms state-of-the-art methods with over 50% higher success rate, 40% fewer freezing incidents, 5% shorter paths, and up to 13% faster time to goal, while supporting long-range missions up to 100 meters. Experiment videos and more details can be found on our project page: https://splatblox.github.io
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Ziyu Meng, Tengyu Liu, Le Ma, Yingying Wu, Ran Song et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Bipedal locomotion makes humanoid robots inherently prone to falls, causing catastrophic damage to the expensive sensors, actuators, and structural components of full-scale robots. To address this critical barrier to real-world deployment, we present \method, a framework that learns to predict imminent, unavoidable falls and execute protective maneuvers to minimize hardware damage. SafeFall is designed to operate seamlessly alongside existing nominal controller, ensuring no interference during normal operation. It combines two synergistic components: a lightweight, GRU-based fall predictor that continuously monitors the robot's state, and a reinforcement learning policy for damage mitigation. The protective policy remains dormant until the predictor identifies a fall as unavoidable, at which point it activates to take control and execute a damage-minimizing response. This policy is trained with a novel, damage-aware reward function that incorporates the robot's specific structural vulnerabilities, learning to shield critical components like the head and hands while absorbing energy with more robust parts of its body. Validated on a full-scale Unitree G1 humanoid, SafeFall demonstrated significant performance improvements over unprotected falls. It reduced peak contact forces by 68.3\%, peak joint torques by 78.4\%, and eliminated 99.3\% of collisions with vulnerable components. By enabling humanoids to fail safely, SafeFall provides a crucial safety net that allows for more aggressive experiments and accelerates the deployment of these robots in complex, real-world environments.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Jasan Zughaibi, Denis von Arx, Maurus Derungs, Florian Heemeyer, Luca A. Antonelli et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Electromagnetic navigation systems (eMNS) enable a number of magnetically guided surgical procedures. A challenge in magnetically manipulating surgical tools is that the effective workspace of an eMNS is often severely constrained by power and thermal limits. We show that system-level control design significantly expands this workspace by reducing the currents needed to achieve a desired motion. We identified five key system approaches that enable this expansion: (i) motion-centric torque/force objectives, (ii) energy-optimal current allocation, (iii) real-time pose estimation, (iv) dynamic feedback, and (v) high-bandwidth eMNS components. As a result, we stabilize a 3D inverted pendulum on an eight-coil OctoMag eMNS with significantly lower currents (0.1-0.2 A vs. 8-14 A), by replacing a field-centric field-alignment strategy with a motion-centric torque/force-based approach. We generalize to multi-agent control by simultaneously stabilizing two inverted pendulums within a shared workspace, exploiting magnetic-field nonlinearity and coil redundancy for independent actuation. A structured analysis compares the electromagnetic workspaces of both paradigms and examines current-allocation strategies that map motion objectives to coil currents. Cross-platform evaluation of the clinically oriented Navion eMNS further demonstrates substantial workspace expansion by maintaining stable balancing at distances up to 50 cm from the coils. The results demonstrate that feedback is a practical path to scalable, efficient, and clinically relevant magnetic manipulation.
๐
2025-11-23
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yoshihiro Maruyama
Machine Learning
ๆฅ็ๆ่ฆ
We develop a theory of category-equivariant neural networks (CENNs) that unifies group/groupoid-equivariant networks, poset/lattice-equivariant networks, graph and sheaf neural networks. Equivariance is formulated as naturality in a topological category with Radon measures, formulating linear and nonlinear layers in the categorical setup. We prove the equivariant universal approximation theorem in the general setting: the class of finite-depth CENNs is dense in the space of continuous equivariant transformations. We instantiate the framework for groups/groupoids, posets/lattices, graphs and cellular sheaves, deriving universal approximation theorems for them in a systematic manner. Categorical equivariant deep learning thus allows us to expand the horizons of equivariant deep learning beyond group actions, encompassing not only geometric symmetries but also contextual and compositional symmetries.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Jiaxun Sun
ๆฅ็ๆ่ฆ
This paper establishes the first explicit and closed-form upper bound on the Hausdorff distance between the truncated minimal robust positively invariant (mRPI) set and its infinite-horizon limit. While existing mRPI approximations guarantee asymptotic convergence through geometric or norm-based arguments, none provides a computable expression that quantifies the truncation error for a given horizon. We show that the error satisfies \( d_H(\mathcal{E}_N,\mathcal{E}_\infty) \le r_W\,ฮณ^{N+1}/(1-ฮณ), \) where $ฮณ<1$ is the induced-norm contraction factor and $r_W$ depends only on the disturbance set. The bound is fully analytic, requires no iterative set computations, and directly characterizes the decay rate of the truncated Minkowski series. We further demonstrate that the choice of vector norm serves as a design parameter that accelerates convergence, enabling substantially tighter horizon selection for robust invariant-set computations and tube-based MPC. Numerical experiments validate the sharpness, scalability, and practical relevance of the proposed bound.
๐
2025-11-23
๐ Submitted to IEEE Journal of Selected Topics in Ap
๐ฅ Sigrid Helene Strand, Thomas Wiedemann, Bram Burczek, Dmitriy Shutin
Machine Learning
ๆฅ็ๆ่ฆ
Search and rescue missions are often critical following sudden natural disasters or in high-risk environmental situations. The most challenging search and rescue missions involve difficult-to-access terrains, such as dense forests with high occlusion. Deploying unmanned aerial vehicles for exploration can significantly enhance search effectiveness, facilitate access to challenging environments, and reduce search time. However, in dense forests, the effectiveness of unmanned aerial vehicles depends on their ability to capture clear views of the ground, necessitating a robust search strategy to optimize camera positioning and perspective. This work presents an optimized planning strategy and an efficient algorithm for the next best view problem in occluded environments. Two novel optimization heuristics, a geometry heuristic, and a visibility heuristic, are proposed to enhance search performance by selecting optimal camera viewpoints. Comparative evaluations in both simulated and real-world settings reveal that the visibility heuristic achieves greater performance, identifying over 90% of hidden objects in simulated forests and offering 10% better detection rates than the geometry heuristic. Additionally, real-world experiments demonstrate that the visibility heuristic provides better coverage under the canopy, highlighting its potential for improving search and rescue missions in occluded environments.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Henrik Krauss, Johann Licher, Naoya Takeishi, Annika Raatz, Takehisa Yairi
Robotics
ๆฅ็ๆ่ฆ
Data-driven learning of soft continuum robot (SCR) dynamics from high-dimensional observations offers flexibility but often lacks physical interpretability, while model-based approaches require prior knowledge and can be computationally expensive. We bridge this gap by introducing (1) the Attention Broadcast Decoder (ABCD), a plug-and-play module for autoencoder-based latent dynamics learning that generates pixel-accurate attention maps localizing each latent dimension's contribution while filtering static backgrounds. (2) By coupling these attention maps to 2D oscillator networks, we enable direct on-image visualization of learned dynamics (masses, stiffness, and forces) without prior knowledge. We validate our approach on single- and double-segment SCRs, demonstrating that ABCD-based models significantly improve multi-step prediction accuracy: 5.7x error reduction for Koopman operators and 3.5x for oscillator networks on the two-segment robot. The learned oscillator network autonomously discovers a chain structure of oscillators. Unlike standard methods, ABCD models enable smooth latent space extrapolation beyond training data. This fully data-driven approach yields compact, physically interpretable models suitable for control applications.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Steven Oh, Tai Inui, Magdeline Kuan, Jia-Yeu Lin
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Robotic manipulation tasks are contact-rich, yet most imitation learning (IL) approaches rely primarily on vision, which struggles to capture stiffness, roughness, slip, and other fine interaction cues. Tactile signals can address this gap, but existing sensors often require expensive, delicate, or integration-heavy hardware. In this work, we introduce MicCheck, a plug-and-play acoustic sensing approach that repurposes an off-the-shelf Bluetooth pin microphone as a low-cost contact sensor. The microphone clips into a 3D-printed gripper insert and streams audio via a standard USB receiver, requiring no custom electronics or drivers. Despite its simplicity, the microphone provides signals informative enough for both perception and control. In material classification, it achieves 92.9% accuracy on a 10-class benchmark across four interaction types (tap, knock, slow press, drag). For manipulation, integrating pin microphone into an IL pipeline with open source hardware improves the success rate on picking and pouring task from 0.40 to 0.80 and enables reliable execution of contact-rich skills such as unplugging and sound-based sorting. Compared with high-resolution tactile sensors, pin microphones trade spatial detail for cost and ease of integration, offering a practical pathway for deploying acoustic contact sensing in low-cost robot setups.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Shuai Zhang, Jingsong Mu, Cancan Zhao, Leiqi Tian, Zhijun Xing et al.
ๆฅ็ๆ่ฆ
Neural radiance field (NeRF) is a promising approach for reconstruction and new view synthesis. However, previous NeRF-based reconstruction methods overlook the critical role of acoustic impedance in ultrasound imaging. Localization methods face challenges related to local minima due to the selection of initial poses. In this study, we design a robotic ultrasound system (RUSS) with an acoustic-impedance-aware ultrasound NeRF (AIA-UltraNeRF) to decouple the scanning and diagnostic processes. Specifically, AIA-UltraNeRF models a continuous function of hash-encoded spatial coordinates for the 3D ultrasound map, allowing for the storage of acoustic impedance without dense sampling. This approach accelerates both reconstruction and inference speeds. We then propose a dual-supervised network that leverages teacher and student models to hash-encode the rendered ultrasound images from the reconstructed map. AIA-UltraNeRF retrieves the most similar hash values without the need to render images again, providing an offline initial image position for localization. Moreover, we develop a RUSS with a spherical remote center of motion mechanism to hold the probe, implementing operator-independent scanning modes that separate image acquisition from diagnostic workflows. Experimental results on a phantom and human subjects demonstrate the effectiveness of acoustic impedance in implicitly characterizing the color of ultrasound images. AIAUltraNeRF achieves both reconstruction and localization with inference speeds that are 9.9 faster than those of vanilla NeRF.
๐
2025-11-23
๐ ArXiv Preprint (Robotics)
๐ฅ Zhongkai Chen, Yihao Sun, Chao Yan, Han Zhou, Xiaojia Xiang et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous aerial vehicles (AAVs) have played a pivotal role in coverage operations and search missions. Recent advances in large language models (LLMs) offer promising opportunities to augment AAV intelligence. These advances help address complex challenges like area coverage optimization, dynamic path planning, and adaptive decision-making. However, the absence of physical grounding in LLMs leads to hallucination and reproducibility problems in spatial reasoning and decision-making. To tackle these issues, we present Skypilot, an LLM-enhanced two-stage framework that grounds language models in physical reality by integrating monte carlo tree search (MCTS). In the first stage, we introduce a diversified action space that encompasses generate, regenerate, fine-tune, and evaluate operations, coupled with physics-informed reward functions to ensure trajectory feasibility. In the second stage, we fine-tune Qwen3-4B on 23,000 MCTS-generated samples, achieving substantial inference acceleration while maintaining solution quality. Extensive numerical simulations and real-world flight experiments validate the efficiency and superiority of our proposed approach. Detailed information and experimental results are accessible at https://sky-pilot.top.
๐
2025-11-21
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xiangrui Xiong, Yichuan Lu, Zifei Pan, Chang Sun
Natural Language Processing
ๆฅ็ๆ่ฆ
The growth of Massive Open Online Courses (MOOCs) presents significant challenges for personalized learning, where concept recommendation is crucial. Existing approaches typically rely on heterogeneous information networks or knowledge graphs to capture conceptual relationships, combined with knowledge tracing models to assess learners' cognitive states. However, these methods face significant limitations due to their dependence on high-quality structured knowledge graphs, which are often scarce in real-world educational scenarios. To address this fundamental challenge, this paper proposes CLLMRec, a novel framework that leverages Large Language Models through two synergistic technical pillars: Semantic Alignment and Prerequisite Knowledge Distillation. The Semantic Alignment component constructs a unified representation space by encoding unstructured textual descriptions of learners and concepts. The Prerequisite Knowledge Distillation paradigm employs a teacher-student architecture, where a large teacher LLM (implemented as the Prior Knowledge Aware Component) extracts conceptual prerequisite relationships from its internalized world knowledge and distills them into soft labels to train an efficient student ranker. Building upon these foundations, our framework incorporates a fine-ranking mechanism that explicitly models learners' real-time cognitive states through deep knowledge tracing, ensuring recommendations are both structurally sound and cognitively appropriate. Extensive experiments on two real-world MOOC datasets demonstrate that CLLMRec significantly outperforms existing baseline methods across multiple evaluation metrics, validating its effectiveness in generating truly cognitive-aware and personalized concept recommendations without relying on explicit structural priors.
๐
2025-11-21
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yesheng Liu, Hao Li, Haiyu Xu, Baoqi Pei, Jiahao Wang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.
๐
2025-11-21
๐ ArXiv Preprint (Robotics)
๐ฅ Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan et al.
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation. The unified framework of RynnVLA-002 enables joint learning of environmental dynamics and action planning. Our experiments show that RynnVLA-002 surpasses individual VLA and world models, demonstrating their mutual enhancement. We evaluate RynnVLA-002 in both simulation and real-world robot tasks. RynnVLA-002 achieves 97.4% success rate on the LIBERO simulation benchmark without pretraining, while in real-world LeRobot experiments, its integrated world model boosts the overall success rate by 50%.
๐
2025-11-21
๐ ArXiv Preprint (Robotics)
๐ฅ Yixuan Pan, Ruoyi Qiao, Li Chen, Kashyap Chitta, Liang Pan et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Humanoid robots are envisioned to perform a wide range of tasks in human-centered environments, requiring controllers that combine agility with robust balance. Recent advances in locomotion and whole-body tracking have enabled impressive progress in either agile dynamic skills or stability-critical behaviors, but existing methods remain specialized, focusing on one capability while compromising the other. In this work, we introduce AMS (Agility Meets Stability), the first framework that unifies both dynamic motion tracking and extreme balance maintenance in a single policy. Our key insight is to leverage heterogeneous data sources: human motion capture datasets that provide rich, agile behaviors, and physically constrained synthetic balance motions that capture stability configurations. To reconcile the divergent optimization goals of agility and stability, we design a hybrid reward scheme that applies general tracking objectives across all data while injecting balance-specific priors only into synthetic motions. Further, an adaptive learning strategy with performance-driven sampling and motion-specific reward shaping enables efficient training across diverse motion distributions. We validate AMS extensively in simulation and on a real Unitree G1 humanoid. Experiments demonstrate that a single policy can execute agile skills such as dancing and running, while also performing zero-shot extreme balance motions like Ip Man's Squat, highlighting AMS as a versatile control paradigm for future humanoid applications.
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ George Cazenavette, Antonio Torralba, Vincent Sitzmann
Multimodal
ๆฅ็ๆ่ฆ
The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Irmak Guzey, Haozhi Qi, Julen Urain, Changhao Wang, Jessica Yin et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across nine everyday manipulation tasks. Robot rollouts are best viewed on our website: https://aina-robot.github.io.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Priyanka Kargupta, Shuyue Stella Li, Haocheng Wang, Jinu Lee, Shan Chen et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning computational constraints, meta-cognitive controls, knowledge representations, and transformation operations, then analyze their behavioral manifestations in reasoning traces. We propose a fine-grained cognitive evaluation framework and conduct the first large-scale analysis of 170K traces from 17 models across text, vision, and audio modalities, alongside 54 human think-aloud traces, which we make publicly available. Our analysis reveals systematic structural differences: humans employ hierarchical nesting and meta-cognitive monitoring while models rely on shallow forward chaining, with divergence most pronounced on ill-structured problems. Meta-analysis of 1,598 LLM reasoning papers reveals the research community concentrates on easily quantifiable behaviors (sequential organization: 55%, decomposition: 60%) while neglecting meta-cognitive controls (self-awareness: 16%, evaluation: 8%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 60% on complex problems. By bridging cognitive science and LLM research, we establish a foundation for developing models that reason through principled cognitive mechanisms rather than brittle spurious reasoning shortcuts or memorization, opening new directions for both improving model capabilities and testing theories of human cognition at scale.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Juan C. King, Jose M. Amigo
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents the implementation of an advanced artificial intelligence-based algorithmic trading system specifically designed for the EUR-USD pair within the high-frequency environment of the Forex market. The methodological approach centers on integrating a holistic set of input features: key fundamental macroeconomic variables (for example, Gross Domestic Product and Unemployment Rate) collected from both the Euro Zone and the United States, alongside a comprehensive suite of technical variables (including indicators, oscillators, Fibonacci levels, and price divergences). The performance of the resulting algorithm is evaluated using standard machine learning metrics to quantify predictive accuracy and backtesting simulations across historical data to assess trading profitability and risk. The study concludes with a comparative analysis to determine which class of input features, fundamental or technical, provides greater and more reliable predictive capacity for generating profitable trading signals.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio Leรณn Villares et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Na{รฏ}ve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations $E\in\mathbb{R}^{m\times n}$ and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices $A\in \mathbb{R}^{m\times r},\ B\in \mathbb{R}^{n\times r}$ with $r\ll \min(m,n)$ to form a low-rank matrix perturbation $A B^\top$ that are used in place of the full-rank perturbation $E$. As the overall update is an average across a population of $N$ workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from $mn$ to $r(m+n)$ per layer and the cost of a forward pass from $\mathcal{O}(mn)$ to $\mathcal{O}(r(m+n))$ when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast $\mathcal{O}\left(\frac{1}{r}\right)$ rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.
๐
2025-11-20
๐ Accepted at 2025 IEEE International Conference on
๐ฅ Md. Samiul Alim, Sharjil Khan, Amrijit Biswas, Fuad Rahman, Shafin Rahman et al.
Machine Learning
ๆฅ็ๆ่ฆ
Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Markus Anders, Bart Bogaerts, Benjamin Bogรธ, Arthur Gontier, Wietze Koops et al.
ๆฅ็ๆ่ฆ
Symmetry breaking is a crucial technique in modern combinatorial solving, but it is difficult to be sure it is implemented correctly. The most successful approach to deal with bugs is to make solvers certifying, so that they output not just a solution, but also a mathematical proof of correctness in a standard format, which can then be checked by a formally verified checker. This requires justifying symmetry reasoning within the proof, but developing efficient methods for this has remained a long-standing open challenge. A fully general approach was recently proposed by Bogaerts et al. (2023), but it relies on encoding lexicographic orders with big integers, which quickly becomes infeasible for large symmetries. In this work, we develop a method for instead encoding orders with auxiliary variables. We show that this leads to orders-of-magnitude speed-ups in both theory and practice by running experiments on proof logging and checking for SAT symmetry breaking using the state-of-the-art satsuma symmetry breaker and the VeriPB proof checking toolchain.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shihab Ahmed, El Houcine Bergou, Aritra Dutta, Yue Wang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster convergence to near-optimal returns, up to 1.75x reduction in return variance on some setups. Our profiling approach offers a general, theoretically grounded path to more reliable and efficient policy learning in complex environments.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Elias Hossain, Md Mehedi Hasan Nipu, Maleeha Sheikh, Rajib Rana, Subash Neupane et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We propose MedBayes-Lite, a lightweight Bayesian enhancement for transformer-based clinical language models designed to produce reliable, uncertainty-aware predictions. Although transformers show strong potential for clinical decision support, they remain prone to overconfidence, especially in ambiguous medical cases where calibrated uncertainty is critical. MedBayes-Lite embeds uncertainty quantification directly into existing transformer pipelines without any retraining or architectural rewiring, adding no new trainable layers and keeping parameter overhead under 3 percent. The framework integrates three components: (i) Bayesian Embedding Calibration using Monte Carlo dropout for epistemic uncertainty, (ii) Uncertainty-Weighted Attention that marginalizes over token reliability, and (iii) Confidence-Guided Decision Shaping inspired by clinical risk minimization. Across biomedical QA and clinical prediction benchmarks (MedQA, PubMedQA, MIMIC-III), MedBayes-Lite consistently improves calibration and trustworthiness, reducing overconfidence by 32 to 48 percent. In simulated clinical settings, it can prevent up to 41 percent of diagnostic errors by flagging uncertain predictions for human review. These results demonstrate its effectiveness in enabling reliable uncertainty propagation and improving interpretability in medical AI systems.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Yingji Zhang et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tianlong Zhang, Hongwei Xue, Shilin Yan, Di Wu, Chen Xu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis-where subsequent judgments are conditioned on previous ones-and further benefits from post-hoc CoT.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
๐
2025-11-20
๐ ArXiv Preprint (cs.SE)
๐ฅ Diaeddin Rimawi
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Sen Chen, Tong Zhao, Yi Bin, Fei Ma, Wenqi Shao et al.
ๆฅ็ๆ่ฆ
Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.
๐
2025-11-20
๐ AAAI
๐ฅ Jules Soria, Zakaria Chihani, Julien Girard-Satabin, Alban Grastien, Romain Xu-Darme et al.
ๆฅ็ๆ่ฆ
Case-based reasoning networks are machine-learning models that make predictions based on similarity between the input and prototypical parts of training samples, called prototypes. Such models are able to explain each decision by pointing to the prototypes that contributed the most to the final outcome. As the explanation is a core part of the prediction, they are often qualified as ``interpretable by design". While promising, we show that such explanations are sometimes misleading, which hampers their usefulness in safety-critical contexts. In particular, several instances may lead to different predictions and yet have the same explanation. Drawing inspiration from the field of formal eXplainable AI (FXAI), we propose Abductive Latent Explanations (ALEs), a formalism to express sufficient conditions on the intermediate (latent) representation of the instance that imply the prediction. Our approach combines the inherent interpretability of case-based reasoning models and the guarantees provided by formal XAI. We propose a solver-free and scalable algorithm for generating ALEs based on three distinct paradigms, compare them, and present the feasibility of our approach on diverse datasets for both standard and fine-grained image classification. The associated code can be found at https://github.com/julsoria/ale
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Andres Campero, Derek Shiller, Jaan Aru, Jonathan Simon
ๆฅ็ๆ่ฆ
We develop a taxonomical framework for classifying challenges to the possibility of consciousness in digital artificial intelligence systems. This framework allows us to identify the level of granularity at which a given challenge is intended (the levels we propose correspond to Marr's levels) and to disambiguate its degree of force: is it a challenge to computational functionalism that leaves the possibility of digital consciousness open (degree 1), a practical challenge to digital consciousness that suggests improbability without claiming impossibility (degree 2), or an argument claiming that digital consciousness is strictly impossible (degree 3)? We apply this framework to 14 prominent examples from the scientific and philosophical literature. Our aim is not to take a side in the debate, but to provide structure and a tool for disambiguating between challenges to computational functionalism and challenges to digital consciousness, as well as between different ways of parsing such challenges.
๐
2025-11-20
๐ ArXiv Preprint (cs.LO)
๐ฅ Gaspard Ohlmann, Edwin Hamel-De le Court, Francesco Belardinelli
Robotics
ๆฅ็ๆ่ฆ
Ensuring that agents satisfy safety specifications can be crucial in safety-critical environments. While methods exist for controller synthesis with safe temporal specifications, most existing methods restrict safe temporal specifications to probabilistic-avoidance constraints. Formal methods typically offer more expressive ways to express safety in probabilistic systems, such as Probabilistic Computation Tree Logic (PCTL) formulas. Thus, in this paper, we develop a new approach that supports more general temporal properties expressed in PCTL. Our contribution is twofold. First, we develop a theoretical framework for the Synthesis of safe-PCTL specifications. We show how the reducing global specification satisfaction to local constraints, and define CPCTL, a fragment of safe-PCTL. We demonstrate how the expressiveness of CPCTL makes it a relevant fragment for the Synthesis Problem. Second, we leverage these results and propose a new Value Iteration-based algorithm to solve the synthesis problem for these more general temporal properties, and we prove the soundness and completeness of our method.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kexin Zhao, Ken Forbus
Natural Language Processing
ๆฅ็ๆ่ฆ
Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Fares Fourati, Mohamed-Slim Alouini, Vaneet Aggarwal
Machine Learning
ๆฅ็ๆ่ฆ
We propose ECPv2, a scalable and theoretically grounded algorithm for global optimization of Lipschitz-continuous functions with unknown Lipschitz constants. Building on the Every Call is Precious (ECP) framework, which ensures that each accepted function evaluation is potentially informative, ECPv2 addresses key limitations of ECP, including high computational cost and overly conservative early behavior. ECPv2 introduces three innovations: (i) an adaptive lower bound to avoid vacuous acceptance regions, (ii) a Worst-m memory mechanism that restricts comparisons to a fixed-size subset of past evaluations, and (iii) a fixed random projection to accelerate distance computations in high dimensions. We theoretically show that ECPv2 retains ECP's no-regret guarantees with optimal finite-time bounds and expands the acceptance region with high probability. We further empirically validate these findings through extensive experiments and ablation studies. Using principled hyperparameter settings, we evaluate ECPv2 across a wide range of high-dimensional, non-convex optimization problems. Across benchmarks, ECPv2 consistently matches or outperforms state-of-the-art optimizers, while significantly reducing wall-clock time.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Misaal Khan, Mayank Vatsa, Kuldeep Singh, Richa Singh
Multimodal
ๆฅ็ๆ่ฆ
Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.
๐
2025-11-20
๐ ArXiv Preprint (cs.ET)
๐ฅ Asmita S. Thool, Sourodeep Roy, Prahalad Kanti Barman, Kartick Biswas, Pavan Nukala et al.
Robotics
ๆฅ็ๆ่ฆ
In this study, we design a reservoir computing (RC) network by exploiting short- and long-term memory dynamics in Au/Ti/MoS$_2$/Au memristive devices. The temporal dynamics is engineered by controlling the thickness of the Chemical Vapor Deposited (CVD) MoS$_2$ films. Devices with a monolayer (1L)-MoS$_2$ film exhibit volatile (short-term memory) switching dynamics. We also report non-volatile resistance switching with excellent uniformity and analog behavior in conductance tuning for the multilayer (ML) MoS$_2$ memristive devices. We correlate this performance with trap-assisted space-charge limited conduction (SCLC) mechanism, leading to a bulk-limited resistance switching behavior. Four-bit reservoir states are generated using volatile memristors. The readout layer is implemented with an array of nonvolatile synapses. This small RC network achieves 89.56\% precision in a spoken-digit recognition task and is also used to analyze a nonlinear time series equation.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Guanchen Wu, Yuzhang Xie, Huanwei Wu, Zhe He, Hui Shao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Integrating novel medical concepts and relationships into existing ontologies can significantly enhance their coverage and utility for both biomedical research and clinical applications. Clinical notes, as unstructured documents rich with detailed patient observations, offer valuable context-specific insights and represent a promising yet underutilized source for ontology extension. Despite this potential, directly leveraging clinical notes for ontology extension remains largely unexplored. To address this gap, we propose CLOZE, a novel framework that uses large language models (LLMs) to automatically extract medical entities from clinical notes and integrate them into hierarchical medical ontologies. By capitalizing on the strong language understanding and extensive biomedical knowledge of pre-trained LLMs, CLOZE effectively identifies disease-related concepts and captures complex hierarchical relationships. The zero-shot framework requires no additional training or labeled data, making it a cost-efficient solution. Furthermore, CLOZE ensures patient privacy through automated removal of protected health information (PHI). Experimental results demonstrate that CLOZE provides an accurate, scalable, and privacy-preserving ontology extension framework, with strong potential to support a wide range of downstream applications in biomedical research and clinical informatics.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zachary Ellis, Jared Joselowitz, Yash Deo, Yajie He, Anna Kalygina et al.
ๆฅ็ๆ่ฆ
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $ฮบ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
๐
2025-11-20
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiaheng Zhang, Daqiang Zhang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage.
Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models.
Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jaime รlvarez Urueรฑa, David Camacho, Javier Huertas Tato
Machine Learning
ๆฅ็ๆ่ฆ
The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical.
This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators.
With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3\%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70\% and 4.27\% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ รzay Ezerceli, Mahmoud El Hussieni, Selva Taล, Reyhan Bayraktar, Fatma Betรผl Terzioฤlu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Carlos Boned Riera, David Romero Sanchez, Oriol Ramos Terrades
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zongcai Tan, Lan Wei, Dandan Zhang
Computer VisionRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging.This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.
๐
2025-11-20
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Rongjie Liao, Junhao Qiu, Xin Chen, Xiaoping Li
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Customized static operator design has enabled widespread application of Evolutionary Algorithms (EAs), but their search performance is transient during iterations and prone to degradation. Dynamic operators aim to address this but typically rely on predefined designs and localized parameter control during the search process, lacking adaptive optimization throughout evolution. To overcome these limitations, this work leverages Large Language Models (LLMs) to perceive evolutionary dynamics and enable operator-level meta-evolution. The proposed framework, LLMs for Evolutionary Optimization (LLM4EO), comprises three components: knowledge-transfer-based operator design, evolution perception and analysis, and adaptive operator evolution. Firstly, initialization of operators is performed by transferring the strengths of classical operators via LLMs. Then, search preferences and potential limitations of operators are analyzed by integrating fitness performance and evolutionary features, accompanied by corresponding suggestions for improvement. Upon stagnation of population evolution, gene selection priorities of operators are dynamically optimized via improvement prompting strategies. This approach achieves co-evolution of populations and operators in the search, introducing a novel paradigm for enhancing the efficiency and adaptability of EAs. Finally, a series of validations on multiple benchmark datasets of the flexible job shop scheduling problem demonstrate that LLM4EO accelerates population evolution and outperforms both mainstream evolutionary programming and traditional EAs.
๐
2025-11-20
๐ AAAI
๐ฅ Sayak Mukherjee, Samrat Chatterjee, Emilie Purvine, Ted Fujimoto, Tegan Emerson
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Designing rewards for autonomous cyber attack and defense learning agents in a complex, dynamic environment is a challenging task for subject matter experts. We propose a large language model (LLM)-based reward design approach to generate autonomous cyber defense policies in a deep reinforcement learning (DRL)-driven experimental simulation environment. Multiple attack and defense agent personas were crafted, reflecting heterogeneity in agent actions, to generate LLM-guided reward designs where the LLM was first provided with contextual cyber simulation environment information. These reward structures were then utilized within a DRL-driven attack-defense simulation environment to learn an ensemble of cyber defense policies. Our results suggest that LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Poushali Sengupta, Yan Zhang, Frank Eliassen, Sabita Maharjan
ๆฅ็ๆ่ฆ
Explainable AI (XAI) is increasingly essential as modern models become more complex and high-stakes applications demand transparency, trust, and regulatory compliance. Existing global attribution methods often incur high computational costs, lack stability under correlated inputs, and fail to scale efficiently to large or heterogeneous datasets. We address these gaps with \emph{ExCIR} (Explainability through Correlation Impact Ratio), a correlation-aware attribution score equipped with a lightweight transfer protocol that reproduces full-model rankings using only a fraction of the data. ExCIR quantifies sign-aligned co-movement between features and model outputs after \emph{robust centering} (subtracting a robust location estimate, e.g., median or mid-mean, from features and outputs). We further introduce \textsc{BlockCIR}, a \emph{groupwise} extension of ExCIR that scores \emph{sets} of correlated features as a single unit. By aggregating the same signed-co-movement numerators and magnitudes over predefined or data-driven groups, \textsc{BlockCIR} mitigates double-counting in collinear clusters (e.g., synonyms or duplicated sensors) and yields smoother, more stable rankings when strong dependencies are present. Across diverse text, tabular, signal, and image datasets, ExCIR shows trustworthy agreement with established global baselines and the full model, delivers consistent top-$k$ rankings across settings, and reduces runtime via lightweight evaluation on a subset of rows. Overall, ExCIR provides \emph{computationally efficient}, \emph{consistent}, and \emph{scalable} explainability for real-world deployment.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Andrew Gomes
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Joy Lai, Alex Mihailidis
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
People living with dementia (PLwD) often show gradual shifts in how they communicate, becoming less expressive, more repetitive, or drifting off-topic in subtle ways. While caregivers may notice these changes informally, most computational tools are not designed to track such behavioral drift over time. This paper introduces PersonaDrift, a synthetic benchmark designed to evaluate machine learning and statistical methods for detecting progressive changes in daily communication, focusing on user responses to a digital reminder system. PersonaDrift simulates 60-day interaction logs for synthetic users modeled after real PLwD, based on interviews with caregivers. These caregiver-informed personas vary in tone, modality, and communication habits, enabling realistic diversity in behavior. The benchmark focuses on two forms of longitudinal change that caregivers highlighted as particularly salient: flattened sentiment (reduced emotional tone and verbosity) and off-topic replies (semantic drift). These changes are injected progressively at different rates to emulate naturalistic cognitive trajectories, and the framework is designed to be extensible to additional behaviors in future use cases. To explore this novel application space, we evaluate several anomaly detection approaches, unsupervised statistical methods (CUSUM, EWMA, One-Class SVM), sequence models using contextual embeddings (GRU + BERT), and supervised classifiers in both generalized and personalized settings. Preliminary results show that flattened sentiment can often be detected with simple statistical models in users with low baseline variability, while detecting semantic drift requires temporal modeling and personalized baselines. Across both tasks, personalized classifiers consistently outperform generalized ones, highlighting the importance of individual behavioral context.
๐
2025-11-20
๐ Frontiers in Computational Neuroscience, in press
๐ฅ Claudius Gros
ๆฅ็ๆ่ฆ
The big strides seen in generative AI are not based on somewhat obscure algorithms, but due to clearly defined generative principles. The resulting concrete implementations have proven themselves in large numbers of applications. We suggest that it is imperative to thoroughly investigate which of these generative principles may be operative also in the brain, and hence relevant for cognitive neuroscience. In addition, ML research led to a range of interesting characterizations of neural information processing systems. We discuss five examples, the shortcomings of world modelling, the generation of thought processes, attention, neural scaling laws, and quantization, that illustrate how much neuroscience could potentially learn from ML research.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Muhammad Aslanimoghanloo, Ahmed ElGazzar, Marcel van Gerven
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Clinical time series data from electronic health records and medical registries offer unprecedented opportunities to understand patient trajectories and inform medical decision-making. However, leveraging such data presents significant challenges due to irregular sampling, complex latent physiology, and inherent uncertainties in both measurements and disease progression. To address these challenges, we propose a generative modeling framework based on latent neural stochastic differential equations (SDEs) that views clinical time series as discrete-time partial observations of an underlying controlled stochastic dynamical system. Our approach models latent dynamics via neural SDEs with modality-dependent emission models, while performing state estimation and parameter learning through variational inference. This formulation naturally handles irregularly sampled observations, learns complex non-linear interactions, and captures the stochasticity of disease progression and measurement noise within a unified scalable probabilistic framework. We validate the framework on two complementary tasks: (i) individual treatment effect estimation using a simulated pharmacokinetic-pharmacodynamic (PKPD) model of lung cancer, and (ii) probabilistic forecasting of physiological signals using real-world intensive care unit (ICU) data from 12,000 patients. Results show that our framework outperforms ordinary differential equation and long short-term memory baseline models in accuracy and uncertainty estimation. These results highlight its potential for enabling precise, uncertainty-aware predictions to support clinical decision-making.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Li Zhang, Zhongxuan Han, XiaoHua Feng, Jiaming Zhang, Yuyuan Li et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.
๐
2025-11-20
๐ AAAI
๐ฅ Yan Chen, Yu Zou, Jialei Zeng, Haoran You, Xiaorui Zhou et al.
Multimodal
ๆฅ็ๆ่ฆ
Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial gover- nance, transforming capital allocation architectures, regu- latory frameworks, and systemic risk coordination mecha- nisms. However, as the core medium for assessing corpo- rate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic read- ing order from slide-like irregular layouts and implicit hier- archies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a uni- fied framework that transforms ESG reports into structured representations through multimodal parsing, contextual nar- ration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware seg- mentation guided by table-of-contents anchors, and a multi- modal aggregation pipeline that contextually transforms vi- sual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical de- mands of financial research. Extensive experiments on anno- tated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG re- ports, spanning Mainland China, Hong Kong, and U.S. mar- kets, featuring unified structured representations of multi- modal content, enriched with fine-grained layout and seman- tic annotations to better support ESG integration in financial governance and decision-making.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jacopo Tagliabue, Federico Bianchi, Ciro Greco
ๆฅ็ๆ่ฆ
Even as AI capabilities improve, most enterprises do not consider agents trustworthy enough to work on production data. In this paper, we argue that the path to trustworthy agentic workflows begins with solving the infrastructure problem first: traditional lakehouses are not suited for agent access patterns, but if we design one around transactions, governance follows. In particular, we draw an operational analogy to MVCC in databases and show why a direct transplant fails in a decoupled, multi-language setting. We then propose an agent-first design, Bauplan, that reimplements data and compute isolation in the lakehouse. We conclude by sharing a reference implementation of a self-healing pipeline in Bauplan, which seamlessly couples agent reasoning with all the desired guarantees for correctness and trust.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yidong Chai, Haoxin Liu, Jiaheng Xie, Chaopeng Wang, Xiao Fang
Machine Learning
ๆฅ็ๆ่ฆ
Wearable sensor technologies and deep learning are transforming healthcare management. Yet, most health sensing studies focus narrowly on physical chronic diseases. This overlooks the critical need for joint assessment of comorbid physical chronic diseases and depression, which is essential for collaborative chronic care. We conceptualize multi-disease assessment, including both physical diseases and depression, as a multi-task learning (MTL) problem, where each disease assessment is modeled as a task. This joint formulation leverages inter-disease relationships to improve accuracy, but it also introduces the challenge of double heterogeneity: chronic diseases differ in their manifestation (disease heterogeneity), and patients with the same disease show varied patterns (patient heterogeneity). To address these issues, we first adopt existing techniques and propose a base method. Given the limitations of the base method, we further propose an Advanced Double Heterogeneity-based Multi-Task Learning (ADH-MTL) method that improves the base method through three innovations: (1) group-level modeling to support new patient predictions, (2) a decomposition strategy to reduce model complexity, and (3) a Bayesian network that explicitly captures dependencies while balancing similarities and differences across model components. Empirical evaluations on real-world wearable sensor data demonstrate that ADH-MTL significantly outperforms existing baselines, and each of its innovations is shown to be effective. This study contributes to health information systems by offering a computational solution for integrated physical and mental healthcare and provides design principles for advancing collaborative chronic disease management across the pre-treatment, treatment, and post-treatment phases.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kangwei Xu, Grace Li Zhang, Ulf Schlichtmann, Bing Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable potential in hardware front-end design using hardware description languages (HDLs). However, their inherent tendency toward hallucination often introduces functional errors into the generated HDL designs. To address this issue, we propose the framework CorrectHDL that leverages high-level synthesis (HLS) results as functional references to correct potential errors in LLM-generated HDL designs.The input to the proposed framework is a C/C++ program that specifies the target circuit's functionality. The program is provided to an LLM to directly generate an HDL design, whose syntax errors are repaired using a Retrieval-Augmented Generation (RAG) mechanism. The functional correctness of the LLM-generated circuit is iteratively improved by comparing its simulated behavior with an HLS reference design produced by conventional HLS tools, which ensures the functional correctness of the result but can lead to suboptimal area and power efficiency. Experimental results demonstrate that circuits generated by the proposed framework achieve significantly better area and power efficiency than conventional HLS designs and approach the quality of human-engineered circuits. Meanwhile, the correctness of the resulting HDL implementation is maintained, highlighting the effectiveness and potential of agentic HDL design leveraging the generative capabilities of LLMs and the rigor of traditional correctness-driven IC design flows.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Ajith Anil Meera, Poppy Collis, Polina Arbuzova, Abiรกn Torres, Paul F Kinghorn et al.
Robotics
ๆฅ็ๆ่ฆ
Robots today often miss a key ingredient of truly intelligent behavior: the ability to reflect on their own cognitive processes and decisions. In humans, this self-monitoring or metacognition is crucial for learning, decision making and problem solving. For instance, they can evaluate how confident they are in performing a task, thus regulating their own behavior and allocating proper resources. Taking inspiration from neuroscience, we propose a robot metacognition architecture centered on confidence (a second-order judgment on decisions) and we demonstrate it on the use case of autonomous tool invention. We propose the use of confidence as a metacognitive measure within the robot decision making scheme. Confidence-informed robots can evaluate the reliability of their decisions, improving their robustness during real-world physical deployment. This form of robotic metacognition emphasizes embodied action monitoring as a means to achieve better informed decisions. We also highlight potential applications and research directions for robot metacognition.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alexander Zadorojniy, Segev Wasserkrug, Eitan Farchi
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recently, using Large Language Models (LLMs) to generate optimization models from natural language descriptions has became increasingly popular. However, a major open question is how to validate that the generated models are correct and satisfy the requirements defined in the natural language description. In this work, we propose a novel agent-based method for automatic validation of optimization models that builds upon and extends methods from software testing to address optimization modeling . This method consists of several agents that initially generate a problem-level testing API, then generate tests utilizing this API, and, lastly, generate mutations specific to the optimization model (a well-known software testing technique assessing the fault detection power of the test suite). In this work, we detail this validation framework and show, through experiments, the high quality of validation provided by this agent ensemble in terms of the well-known software testing measure called mutation coverage.
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ Marcin Kostrzewa, Oleksii Furman, Roman Furman, Sebastian Tomczak, Maciej Ziฤba
Machine Learning
ๆฅ็ๆ่ฆ
Foundation models have shown promise across various financial applications, yet their effectiveness for corporate bankruptcy prediction remains systematically unevaluated against established methods. We study bankruptcy forecasting using Llama-3.3-70B-Instruct and TabPFN, evaluated on large, highly imbalanced datasets of over one million company records from the Visegrรกd Group. We provide the first systematic comparison of foundation models against classical machine learning baselines for this task. Our results show that models such as XGBoost and CatBoost consistently outperform foundation models across all prediction horizons. LLM-based approaches suffer from unreliable probability estimates, undermining their use in risk-sensitive financial settings. TabPFN, while competitive with simpler baselines, requires substantial computational resources with costs not justified by performance gains. These findings suggest that, despite their generality, current foundation models remain less effective than specialized methods for bankruptcy forecasting.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Anna Luiza Gomes da Silva, Diego Kreutz, Angelo Diniz, Rodrigo Mansilha, Celso Nobre da Fonseca
Machine Learning
ๆฅ็ๆ่ฆ
Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jonathan Kamp, Lisa Beinborn, Antske Fokkens
ๆฅ็ๆ่ฆ
Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wei Xia, Zhi-Hong Deng
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Daniel Vaughan, Kateลina Vaughan
Robotics
ๆฅ็ๆ่ฆ
This paper presents a proof-of-concept demonstration of agent-to-agent communication across distributed systems, using only natural-language messages and without shared identifiers, structured schemas, or centralised data exchange. The prototype explores how multiple organisations (represented here as a Clinic, Insurer, and Specialist Network) can cooperate securely via pseudonymised case tokens, local data lookups, and controlled operational boundaries.
The system uses Orpius as the underlying platform for multi-agent orchestration, tool execution, and privacy-preserving communication. All agents communicate through OperationRelay calls, exchanging concise natural-language summaries. Each agent operates on its own data (such as synthetic clinic records, insurance enrolment tables, and clinical guidance extracts), and none receives or reconstructs patient identity. The Clinic computes an HMAC-based pseudonymous token, the Insurer evaluates coverage rules and consults the Specialist agent, and the Specialist returns an appropriateness recommendation.
The goal of this prototype is intentionally limited: to demonstrate feasibility, not to provide a clinically validated, production-ready system. No clinician review was conducted, and no evaluation beyond basic functional runs was performed. The work highlights architectural patterns, privacy considerations, and communication flows that enable distributed reasoning among specialised agents while keeping data local to each organisation. We conclude by outlining opportunities for more rigorous evaluation and future research in decentralised multi-agent systems.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhiyuan Li, Haisheng Yu, Guangchuan Guo, Nan Zhou, Jiajun Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Complex scientific questions often entail multiple intents, such as identifying gene mutations and linking them to related diseases. These tasks require evidence from diverse sources and multi-hop reasoning, while conventional retrieval-augmented generation (RAG) systems are usually single-intent oriented, leading to incomplete evidence coverage. To assess this limitation, we introduce the Multi-Intent Scientific Question Answering (MuISQA) benchmark, which is designed to evaluate RAG systems on heterogeneous evidence coverage across sub-questions. In addition, we propose an intent-aware retrieval framework that leverages large language models (LLMs) to hypothesize potential answers, decompose them into intent-specific queries, and retrieve supporting passages for each underlying intent. The retrieved fragments are then aggregated and re-ranked via Reciprocal Rank Fusion (RRF) to balance coverage across diverse intents while reducing redundancy. Experiments on both MuISQA benchmark and other general RAG datasets demonstrate that our method consistently outperforms conventional approaches, particularly in retrieval accuracy and evidence coverage.
๐
2025-11-20
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Zhen Sun, Zongmin Zhang, Deqi Liang, Han Sun, Yule Liu et al.
ๆฅ็ๆ่ฆ
As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xingtao Zhao, Hao Peng, Dingli Su, Xianghua Zeng, Chunyang Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yun Lu, Xiaoyu Shi, Hong Xie, Chongjun Xia, Zhenhui Gong et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper revisits fairness-aware interactive recommendation (e.g., TikTok, KuaiShou) by introducing a novel control knob, i.e., the lifecycle of items. We make threefold contributions. First, we conduct a comprehensive empirical analysis and uncover that item lifecycles in short-video platforms follow a compressed three-phase pattern, i.e., rapid growth, transient stability, and sharp decay, which significantly deviates from the classical four-stage model (introduction, growth, maturity, decline). Second, we introduce LHRL, a lifecycle-aware hierarchical reinforcement learning framework that dynamically harmonizes fairness and accuracy by leveraging phase-specific exposure dynamics. LHRL consists of two key components: (1) PhaseFormer, a lightweight encoder combining STL decomposition and attention mechanisms for robust phase detection; (2) a two-level HRL agent, where the high-level policy imposes phase-aware fairness constraints, and the low-level policy optimizes immediate user engagement. This decoupled optimization allows for effective reconciliation between long-term equity and short-term utility. Third, experiments on multiple real-world interactive recommendation datasets demonstrate that LHRL significantly improves both fairness and user engagement. Furthermore, the integration of lifecycle-aware rewards into existing RL-based models consistently yields performance gains, highlighting the generalizability and practical value of our approach.
๐
2025-11-20
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Wei Zhao, Zhe Li, Yige Li, Jun Sun
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in cross-modal understanding, but remain vulnerable to adversarial attacks through visual inputs despite robust textual safety mechanisms. These vulnerabilities arise from two core weaknesses: the continuous nature of visual representations, which allows for gradient-based attacks, and the inadequate transfer of text-based safety mechanisms to visual content. We introduce Q-MLLM, a novel architecture that integrates two-level vector quantization to create a discrete bottleneck against adversarial attacks while preserving multimodal reasoning capabilities. By discretizing visual representations at both pixel-patch and semantic levels, Q-MLLM blocks attack pathways and bridges the cross-modal safety alignment gap. Our two-stage training methodology ensures robust learning while maintaining model utility. Experiments demonstrate that Q-MLLM achieves significantly better defense success rate against both jailbreak attacks and toxic image attacks than existing approaches. Notably, Q-MLLM achieves perfect defense success rate (100\%) against jailbreak attacks except in one arguable case, while maintaining competitive performance on multiple utility benchmarks with minimal inference overhead. This work establishes vector quantization as an effective defense mechanism for secure multimodal AI systems without requiring expensive safety-specific fine-tuning or detection overhead. Code is available at https://github.com/Amadeuszhao/QMLLM.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhen Hao Wong, Jingwen Deng, Hao Liang, Runming He, Chengyu Shen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The development of Large Language Models (LLMs) increasingly depends on high-quality supervised data, yet existing instruction-tuning and RL datasets remain costly to curate and often rely on synthetic samples that introduce hallucination and limited diversity. At the same time, textbooks and exercise materials contain abundant, high-quality human-authored Question-Answer(QA) content that remains underexploited due to the difficulty of transforming raw PDFs into AI-ready supervision. Although modern OCR and vision-language models can accurately parse document structure, their outputs lack the semantic alignment required for training. We propose an automated pipeline that extracts well-formed QA and visual-QA (VQA) pairs from educational documents by combining layout-aware OCR with LLM-based semantic parsing. Experiments across diverse document types show that the method produces accurate, aligned, and low-noise QA/VQA pairs. This approach enables scalable use of real-world educational content and provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training. All code and data-processing pipelines are open-sourced at https://github.com/OpenDCAI/DataFlow.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xu Qiang, Shengyuan Bai, Leqing Chen, Zijing Liu, Yu Li
Multimodal
ๆฅ็ๆ่ฆ
Olympiad-level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment-Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model's visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi-agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state-of-the-art multimodal models demonstrate that combining SVE with our multi-agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state-of-the-art in automated chemical problem-solving. ChemO Dataset: https://huggingface.co/datasets/IDEA-AI4SCI/ChemO
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuping Yan, Yuhan Xie, Yinxin Zhang, Lingjuan Lyu, Yaochu Jin
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Pei Yang, Ke Zhang, Ji Wang, Xiao Chen, Yuxin Tang et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present CRM (Multi-Agent Collaborative Reward Model), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as ranker-based and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi-perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Niki van Stein, Anna V. Kononova, Thomas Bรคck
Natural Language Processing
ๆฅ็ๆ่ฆ
Automated algorithm design is entering a new phase: Large Language Models can now generate full optimisation (meta)heuristics, explore vast design spaces and adapt through iterative feedback. Yet this rapid progress is largely performance-driven and opaque. Current LLM-based approaches rarely reveal why a generated algorithm works, which components matter or how design choices relate to underlying problem structures. This paper argues that the next breakthrough will come not from more automation, but from coupling automation with understanding from systematic benchmarking. We outline a vision for explainable automated algorithm design, built on three pillars: (i) LLM-driven discovery of algorithmic variants, (ii) explainable benchmarking that attributes performance to components and hyperparameters and (iii) problem-class descriptors that connect algorithm behaviour to landscape structure. Together, these elements form a closed knowledge loop in which discovery, explanation and generalisation reinforce each other. We argue that this integration will shift the field from blind search to interpretable, class-specific algorithm design, accelerating progress while producing reusable scientific insight into when and why optimisation strategies succeed.
๐
2025-11-20
๐ ArXiv Preprint (cs.DC)
๐ฅ Rongxin Cheng, Kai Zhou, Xingda Wei, Siyuan Liu, Mingcong Han et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Rollout dominates the training time in large language model (LLM) post-training, where the trained model is used to generate tokens given a batch of prompts. SpecActor achieves fast rollout with speculative decoding that deploys a fast path (e.g., a smaller model) to accelerate the unparallelizable generation, while the correctness is guaranteed by fast parallel verification of the outputs with the original model. SpecActor addresses two foundational challenges in speculative rollout by (1) a \emph{dynamic decoupled speculation} execution method that maximizes the GPU computational efficiency to realize speedup for large-batch execution -- a configuration common in training but unfriendly to speculative execution and (2) a \emph{dynamic Best-of-N speculation} method that selects and combines different drafting methods according to the rollout progress. It substantially improves the speculation accuracy even when the best drafting method is unknown a priori, meanwhile without requiring adding extra computation resources. {\sys} is {1.3--1.7}\,$\times$ faster than common post-training baselines, and is {1.3--1.5}\,$\times$ faster compared to naively adopting speculative decoding for rollout.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, Sotiris Manitsaris
ๆฅ็ๆ่ฆ
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $ฯ_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dabiao Ma, Ziming Dai, Zhimin Xin, Shu Wang, Ye Wang et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhijie Zhong, Zhiwen Yu, Kaixiang Yang, C. L. Philip Chen
ๆฅ็ๆ่ฆ
Time series anomaly detection (TSAD) is a critical data mining task often constrained by label scarcity. Consequently, current research predominantly focuses on Unsupervised Time-series Anomaly Detection (UTAD), relying on complex architectures to model normal data distributions. However, this approach often overlooks the significant performance gains available from limited anomaly labels achievable in practical scenarios. This paper challenges the premise that architectural complexity is the optimal path for TSAD. We conduct the first methodical comparison between supervised and unsupervised paradigms and introduce STAND, a streamlined supervised baseline. Extensive experiments on five public datasets demonstrate that: (1) Labels matter more than models: under a limited labeling budget, simple supervised models significantly outperform complex state-of-the-art unsupervised methods; (2) Supervision yields higher returns: the performance gain from minimal supervision far exceeds that from architectural innovations; and (3) Practicality: STAND exhibits superior prediction consistency and anomaly localization compared to unsupervised counterparts. These findings advocate for a data-centric shift in TSAD research, emphasizing label utilization over purely algorithmic complexity. The code is publicly available at https://github.com/EmorZz1G/STAND.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yongnan Jin, Xurui Li, Feng Cao, Liucun Gao, Juanjuan Yao
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The integration of large language models (LLMs) into medical practice holds transformative potential, yet their real-world clinical utility remains limited by critical alignment challenges: (1) a disconnect between static evaluation benchmarks and dynamic clinical cognitive needs, (2) difficulties in adapting to evolving, multi-source medical standards, and (3) the inability of conventional reward models to capture nuanced, multi-dimensional medical quality criteria. To address these gaps, we propose MR-RML (Multidimensional Rubric-oriented Reward Model Learning) via GPRC (Geometric Projection Reference Constraints), a novel alignment framework that integrates medical standards into a structured "Dimensions-Scenarios-Disciplines" matrix to guide data generation and model optimization. MR-RML introduces three core innovations: (1) a "Dimensions-Scenarios-Disciplines" medical standard system that embeds domain standards into the full training pipeline; (2) an independent multi-dimensional reward model that decomposes evaluation criteria, shifting from real-time rubric-based scoring to internalized reward modeling for improved consistency and cost-efficiency; (3) geometric projection reference constraints that transform medical cognitive logic into mathematical regularization, aligning scoring gradients with clinical reasoning and enabling synthetic data-driven training. Through extensive evaluations on the authoritative medical benchmark Healthbench, our method yields substantial performance gains over the base LLM Qwen-32B (45% on the full subset and 85% on Hard subset, respectively). It achieves a SOTA among open-source LLMs with scores of 62.7 (full subset) and 44.7 (hard subset), while also outperforming the majority of closed-source models.
๐
2025-11-20
๐ ArXiv Preprint (physics.optics)
๐ฅ Shujie Yang, Xuzhe Zhao, Yuqi Zhang, Yansong Tang, Kaichen Dong
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Metamaterials, known for their ability to manipulate light at subwavelength scales, face significant design challenges due to their complex and sophisticated structures. Consequently, deep learning has emerged as a powerful tool to streamline their design process. Reconfigurable multi-state metamaterials (RMMs) with adjustable parameters can switch their optical characteristics between different states upon external stimulation, leading to numerous applications. However, existing deep learning-based inverse design methods fall short in considering reconfigurability with multi-state switching. To address this challenge, we propose CoSP, an intelligent inverse design method based on contrastive pretrained large language model (LLM). By performing contrastive pretraining on multi-state spectrum, a well-trained spectrum encoder capable of understanding the spectrum is obtained, and it subsequently interacts with a pretrained LLM. This approach allows the model to preserve its linguistic capabilities while also comprehending Maxwell's Equations, enabling it to describe material structures with target optical properties in natural language. Our experiments demonstrate that CoSP can design corresponding thin-film metamaterial structures for arbitrary multi-state, multi-band optical responses, showing great potentials in the intelligent design of RMMs for versatile applications.
๐
2025-11-20
๐ ArXiv Preprint (cs.DB)
๐ฅ Xuan-Quang Phan, Tan-Ha Mai, Thai-Duy Dinh, Minh-Thuan Nguyen, Lam-Son Lรช
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Interacting with relational databases remains challenging for users across different expertise levels, particularly when composing complex analytical queries or performing administrative tasks. Existing systems typically address either natural language querying or narrow aspects of database administration, lacking a unified and intelligent interface for general-purpose database interaction. We introduce AskDB, a large language model powered agent designed to bridge this gap by supporting both data analysis and administrative operations over SQL databases through natural language. Built on Gemini 2, AskDB integrates two key innovations: a dynamic schema-aware prompting mechanism that effectively incorporates database metadata, and a task decomposition framework that enables the agent to plan and execute multi-step actions. These capabilities allow AskDB to autonomously debug derived SQL, retrieve contextual information via real-time web search, and adaptively refine its responses. We evaluate AskDB on a widely used Text-to-SQL benchmark and a curated set of DBA tasks, demonstrating strong performance in both analytical and administrative scenarios. Our results highlight the potential of AskDB as a unified and intelligent agent for relational database systems, offering an intuitive and accessible experience for end users.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qing Zhang, Bing Xu, Xudong Zhang, Yifan Shi, Yang Li et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker.
Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haohui Chen, Zhiyong Chen, Aoxiang Liu, Wentuo Fang
Robotics
ๆฅ็ๆ่ฆ
Deterministic policy gradient algorithms for continuous control suffer from value estimation biases that degrade performance. While double critics reduce such biases, the exploration potential of double actors remains underexplored. Building on temporal-difference error-driven regularization (TDDR), a double actor-critic framework, this work introduces enhanced methods to achieve flexible bias control and stronger representation learning. We propose three convex combination strategies, symmetric and asymmetric, that balance pessimistic estimates to mitigate overestimation and optimistic exploration via double actors to alleviate underestimation. A single hyperparameter governs this mechanism, enabling tunable control across the bias spectrum. To further improve performance, we integrate augmented state and action representations into the actor and critic networks. Extensive experiments show that our approach consistently outperforms benchmarks, demonstrating the value of tunable bias and revealing that both overestimation and underestimation can be exploited differently depending on the environment.
๐
2025-11-20
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Vu Van Than
ๆฅ็ๆ่ฆ
Traditional threat modeling remains reactive-focused on known TTPs and past incident data, while threat prediction and forecasting frameworks are often disconnected from operational or architectural artifacts. This creates a fundamental weakness: the most serious cyber threats often do not arise from what is known, but from what is assumed, overlooked, or not yet conceived, and frequently originate from the future, such as artificial intelligence, information warfare, and supply chain attacks, where adversaries continuously develop new exploits that can bypass defenses built on current knowledge. To address this mental gap, this paper introduces the theory and methodology of Future-Back Threat Modeling (FBTM). This predictive approach begins with envisioned future threat states and works backward to identify assumptions, gaps, blind spots, and vulnerabilities in the current defense architecture, providing a clearer and more accurate view of impending threats so that we can anticipate their emergence and shape the future we want through actions taken now. The proposed methodology further aims to reveal known unknowns and unknown unknowns, including tactics, techniques, and procedures that are emerging, anticipated, and plausible. This enhances the predictability of adversary behavior, particularly under future uncertainty, helping security leaders make informed decisions today that shape more resilient security postures for the future.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meihua Zhou, Liping Yu, Jiawei Cai, Wai Kin Fung, Ruiguo Hu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.
๐
2025-11-20
๐ ArXiv Preprint (cs.PL)
๐ฅ Sungbin Moon, Jiho Park, Suyoung Hwang, Donghyun Koh, Seunghyun Moon et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Modern data processing workflows frequently encounter ragged data: collections with variable-length elements that arise naturally in domains like natural language processing, scientific measurements, and autonomous AI agents. Existing workflow engines lack native support for tracking the shapes and dependencies inherent to ragged data, forcing users to manage complex indexing and dependency bookkeeping manually. We present Operon, a Rust-based workflow engine that addresses these challenges through a novel formalism of named dimensions with explicit dependency relations. Operon provides a domain-specific language where users declare pipelines with dimension annotations that are statically verified for correctness, while the runtime system dynamically schedules tasks as data shapes are incrementally discovered during execution. We formalize the mathematical foundation for reasoning about partial shapes and prove that Operon's incremental construction algorithm guarantees deterministic and confluent execution in parallel settings. The system's explicit modeling of partially-known states enables robust persistence and recovery mechanisms, while its per-task multi-queue architecture achieves efficient parallelism across heterogeneous task types. Empirical evaluation demonstrates that Operon outperforms an existing workflow engine with 14.94x baseline overhead reduction while maintaining near-linear end-to-end output rates as workloads scale, making it particularly suitable for large-scale data generation pipelines in machine learning applications.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hrikshesh Kumar, Anika Garg, Anshul Gupta, Yashika Agarwal
Machine Learning
ๆฅ็ๆ่ฆ
Old cloud edge workload resource management is too reactive. The problem with relying on static thresholds is that we are either overspending for more resources than needed or have reduced performance because of their lack. This is why we work on proactive solutions. A framework developed for it stops reacting to the problems but starts expecting them. We design a hybrid architecture, combining two powerful tools: the CNN LSTM model for time series forecasting and an orchestrator based on multi agent Deep Reinforcement Learning In fact the novelty is in how we combine them as we embed the predictive forecast from the CNN LSTM directly into the DRL agent state space. That is what makes the AI manager smarter it sees the future, which allows it to make better decisions about a long term plan for where to run tasks That means finding that sweet spot between how much money is saved while keeping the system healthy and apps fast for users That is we have given it eyes in order to see down the road so that it does not have to lurch from one problem to another it finds a smooth path forward Our tests show our system easily beats the old methods It is great at solving tough problems like making complex decisions and juggling multiple goals at once like being cheap fast and reliable
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shreyansh Jain, Madhav Singhvi, Shreya Rahul Jain, Pranav S, Dishaa Lokesh et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Conventional Applicant Tracking Systems (ATS) tend to be inflexible keyword-matchers, and deny gifted candidates a role due to a few minor semantic mismatches. This article describes a new two-step process to design a more refined resume evaluation model based on a small language model (<600M parameters) that is finetuned using GRPO on a custom reward function. To begin with, Supervised Fine-Tuning (SFT) was used to build a solid baseline model. Second, this SFT model was also optimized with the help of Reinforcement Learning (RL) through GRPO under the guidance of a new, multi-component reward function that can holistically assess candidates beyond simple keyword matching. We indicate that the RL application presents a critical problem of reward hacking due to the initial experiments of aggressive penalties, which produces faulty, excessively negative model behaviors. We have overcome this challenge by refining the reward function repeatedly and training hyperparameters into a stable "gentle polishing process" of the reward function. Our resulting GRPO-polished model demonstrates significant real-world efficacy, achieving a final accuracy of 91% on unseen test data. The model shows a strong ability to correctly identify qualified candidates (recall of 0.85 for the 'SELECTED' class) while also showing exceptional precision (1.0), confirming its reliability. These results indicate that a properly executed, two-step fine-tuning procedure can indeed effectively refine a small language model to be able to conduct fine-tuned and human-like candidate scoring, overcoming the drawbacks of both traditional ATS and naive RL usage.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sรฉbastien Bubeck, Christian Coester, Ronen Eldan, Timothy Gowers, Yin Tat Lee et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Theophanis C. Stratopoulos, Victor Xiaoqi Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in artificial intelligence, particularly generative AI (GenAI) and large language models (LLMs), are fundamentally transforming accounting research, creating both opportunities and competitive threats for scholars. This paper proposes a framework that classifies AI-accounting research along two dimensions: research focus (accounting-centric versus AI-centric) and methodological approach (AI-based versus traditional methods). We apply this framework to papers from the IJAIS special issue and recent AI-accounting research published in leading accounting journals to map existing studies and identify research opportunities. Using this same framework, we analyze how accounting researchers can leverage their expertise through strategic positioning and collaboration, revealing where accounting scholars' strengths create the most value. We further examine how GenAI and LLMs transform the research process itself, comparing the capabilities of human researchers and AI agents across the entire research workflow. This analysis reveals that while GenAI democratizes certain research capabilities, it simultaneously intensifies competition by raising expectations for higher-order contributions where human judgment, creativity, and theoretical depth remain valuable. These shifts call for reforming doctoral education to cultivate comparative advantages while building AI fluency.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Gwen Yidou-Weng, Ian Li, Anji Liu, Oliver Broadrick, Guy Van den Broeck et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ Qing Zhang, Jing Huang, Mingyang Xu, Jun Rekimoto
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
While mainstream robotics pursues metric precision and flawless performance, this paper explores the creative potential of a deliberately "lo-fi" approach. We present the "Semantic Glitch," a soft flying robotic art installation whose physical form, a 3D pixel style cloud, is a "physical glitch" derived from digital archaeology. We detail a novel autonomous pipeline that rejects conventional sensors like LiDAR and SLAM, relying solely on the qualitative, semantic understanding of a Multimodal Large Language Model to navigate. By authoring a bio-inspired personality for the robot through a natural language prompt, we create a "narrative mind" that complements the "weak," historically, loaded body. Our analysis begins with a 13-minute autonomous flight log, and a follow-up study statistically validates the framework's robustness for authoring quantifiably distinct personas. The combined analysis reveals emergent behaviors, from landmark-based navigation to a compelling "plan to execution" gap, and a character whose unpredictable, plausible behavior stems from a lack of precise proprioception. This demonstrates a lo-fi framework for creating imperfect companions whose success is measured in character over efficiency.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jorge A. Huertas, Pascal Van Hentenryck
ๆฅ็ๆ่ฆ
In serial batch (s-batch) scheduling, jobs from similar families are grouped into batches and processed sequentially to avoid repetitive setups that are required when processing consecutive jobs of different families. Despite its large success in scheduling, only three Constraint Programming (CP) models have been proposed for this problem considering minimum batch sizes, which is a common requirement in many practical settings, including the ion implantation area in semiconductor manufacturing. These existing CP models rely on a predefined virtual set of possible batches that suffers from the curse of dimensionality and adds complexity to the problem. This paper proposes a novel CP model that does not rely on this virtual set. Instead, it uses key alignment parameters that allow it to reason directly on the sequences of same-family jobs scheduled on the machines, resulting in a more compact formulation. This new model is further improved by exploiting the problem's structure with tailored search phases and strengthened inference levels of the constraint propagators. The extensive computational experiments on nearly five thousand instances compare the proposed models against existing methods in the literature, including mixed-integer programming formulations, tabu search meta-heuristics, and CP approaches. The results demonstrate the superiority of the proposed models on small-to-medium instances with up to 100 jobs, and their ability to find solutions up to 25\% better than the ones produces by existing methods on large-scale instances with up to 500 jobs, 10 families, and 10 machines.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks
Natural Language Processing
ๆฅ็ๆ่ฆ
Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yang Wu, Yifan Zhang, Zhenxing Liang, Jian Cheng
Machine Learning
ๆฅ็ๆ่ฆ
Two-stage Stochastic Programming (2SP) is a standard framework for modeling decision-making problems under uncertainty. While numerous methods exist, solving such problems with many scenarios remains challenging. Selecting representative scenarios is a practical method for accelerating solutions. However, current approaches typically rely on clustering or Monte Carlo sampling, failing to integrate scenario information deeply and overlooking the significant impact of the scenario order on solving time. To address these issues, we develop HGCN2SP, a novel model with a hierarchical graph designed for 2SP problems, encoding each scenario and modeling their relationships hierarchically. The model is trained in a reinforcement learning paradigm to utilize the feedback of the solver. The policy network is equipped with a hierarchical graph convolutional network for feature encoding and an attention-based decoder for scenario selection in proper order. Evaluation of two classic 2SP problems demonstrates that HGCN2SP provides high-quality decisions in a short computational time. Furthermore, HGCN2SP exhibits remarkable generalization capabilities in handling large-scale instances, even with a substantial number of variables or scenarios that were unseen during the training phase.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Abdallah Salem, Hamdy Ahmed Ashur, Ahmed Elshinnawy
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers' health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material's surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dingkun Zhou, Patrick P. K. Chan, Hengxu Wu, Shikang Zheng, Ruiqi Huang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Emanuel C. Silva, Emily S. M. Salum, Gabriel M. Arantes, Matheus P. Pereira, Vinicius F. Oliveira et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Introduction: The application of Artificial Intelligence in games has evolved significantly, allowing for dynamic content generation. However, its use as a core gameplay co-creation tool remains underexplored. Objective: This paper proposes SpellForger, a game where players create custom spells by writing natural language prompts, aiming to provide a unique experience of personalization and creativity. Methodology: The system uses a supervisedtrained BERT model to interpret player prompts. This model maps textual descriptions to one of many spell prefabs and balances their parameters (damage, cost, effects) to ensure competitive integrity. The game is developed in the Unity Game Engine, and the AI backend is in Python. Expected Results: We expect to deliver a functional prototype that demonstrates the generation of spells in real time, applied to an engaging gameplay loop, where player creativity is central to the experience, validating the use of AI as a direct gameplay mechanic.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Juncheng Dong, Yiling Liu, Ahmed Aloui, Vahid Tarokh, David Carlson
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have recently demonstrated impressive capabilities across a range of reasoning and generation tasks. However, research studies have shown that LLMs lack the ability to identify causal relationships, a fundamental cornerstone of human intelligence. We first conduct an exploratory investigation of LLMs' behavior when asked to perform a causal-discovery task and find that they mostly rely on the semantic meaning of variable names, ignoring the observation data. This is unsurprising, given that LLMs were never trained to process structural datasets. To first tackle this challenge, we prompt the LLMs with the outputs of established causal discovery algorithms designed for observational datasets. These algorithm outputs effectively serve as the sufficient statistics of the observation data. However, quite surprisingly, we find that prompting the LLMs with these sufficient statistics decreases the LLMs' performance in causal discovery. To address this current limitation, we propose CARE, a framework that enhances LLMs' causal-reasoning ability by teaching them to effectively utilize the outputs of established causal-discovery algorithms through supervised fine-tuning. Experimental results show that a finetuned Qwen2.5-1.5B model produced by CARE significantly outperforms both traditional causal-discovery algorithms and state-of-the-art LLMs with over a thousand times more parameters, demonstrating effective utilization of its own knowledge and the external algorithmic clues.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jinhao Li, Jianzhong Qi, Soyeon Caren Han, Eun-Jung Holden
Multimodal
ๆฅ็ๆ่ฆ
Digital transformation in the cultural heritage sector has produced vast yet fragmented collections of artefact data. Existing frameworks for museum information systems struggle to integrate heterogeneous metadata, unstructured documents, and multimodal artefacts into a coherent and queryable form. We present MuseKG, an end-to-end knowledge-graph framework that unifies structured and unstructured museum data through symbolic-neural integration. MuseKG constructs a typed property graph linking objects, people, organisations, and visual or textual labels, and supports natural language queries. Evaluations on real museum collections demonstrate robust performance across queries over attributes, relations, and related entities, surpassing large-language-model zero-shot, few-shot and SPARQL prompt baselines. The results highlight the importance of symbolic grounding for interpretable and scalable cultural heritage reasoning, and pave the way for web-scale integration of digital heritage knowledge.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shuo Wang, Mengfan Teng, Yun Cheng, Lothar Thiele, Olga Saukh et al.
Machine Learning
ๆฅ็ๆ่ฆ
High-resolution mapping of fine particulate matter (PM2.5) is a cornerstone of sustainable urbanism but remains critically hindered by the spatial sparsity of ground monitoring networks. While traditional data-driven methods attempt to bridge this gap using satellite Aerosol Optical Depth (AOD), they often suffer from severe, non-random data missingness (e.g., due to cloud cover or nighttime) and inversion biases. To overcome these limitations, this study proposes the Spatiotemporal Physics-Guided Inference Network (SPIN), a novel framework designed for inductive spatiotemporal kriging. Unlike conventional approaches, SPIN synergistically integrates domain knowledge into deep learning by explicitly modeling physical advection and diffusion processes via parallel graph kernels. Crucially, we introduce a paradigm-shifting training strategy: rather than using error-prone AOD as a direct input, we repurpose it as a spatial gradient constraint within the loss function. This allows the model to learn structural pollution patterns from satellite data while remaining robust to data voids. Validated in the highly polluted Beijing-Tianjin-Hebei and Surrounding Areas (BTHSA), SPIN achieves a new state-of-the-art with a Mean Absolute Error (MAE) of 9.52 ug/m^3, effectively generating continuous, physically plausible pollution fields even in unmonitored areas. This work provides a robust, low-cost, and all-weather solution for fine-grained environmental management.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yiling Liu, Juncheng Dong, Chen Fu, Wei Shi, Ziyang Jiang et al.
ๆฅ็ๆ่ฆ
Estimating counterfactual outcomes from time-series observations is crucial for effective decision-making, e.g. when to administer a life-saving treatment, yet remains significantly challenging because (i) the counterfactual trajectory is never observed and (ii) confounders evolve with time and distort estimation at every step. To address these challenges, we propose a novel framework that synergistically integrates two complementary approaches: Sub-treatment Group Alignment (SGA) and Random Temporal Masking (RTM). Instead of the coarse practice of aligning marginal distributions of the treatments in latent space, SGA uses iterative treatment-agnostic clustering to identify fine-grained sub-treatment groups. Aligning these fine-grained groups achieves improved distributional matching, thus leading to more effective deconfounding. We theoretically demonstrate that SGA optimizes a tighter upper bound on counterfactual risk and empirically verify its deconfounding efficacy. RTM promotes temporal generalization by randomly replacing input covariates with Gaussian noises during training. This encourages the model to rely less on potentially noisy or spuriously correlated covariates at the current step and more on stable historical patterns, thereby improving its ability to generalize across time and better preserve underlying causal relationships. Our experiments demonstrate that while applying SGA and RTM individually improves counterfactual outcome estimation, their synergistic combination consistently achieves state-of-the-art performance. This success comes from their distinct yet complementary roles: RTM enhances temporal generalization and robustness across time steps, while SGA improves deconfounding at each specific time point.
๐
2025-11-20
๐ ArXiv Preprint (cs.SE)
๐ฅ Qingao Dong, Mengfei Wang, Hengzhi Zhang, Zhichao Li, Yuan Yuan et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language model (LLM) agents have recently shown strong performance on repository-level issue resolution, but existing systems are almost exclusively designed for Python and rely heavily on lexical retrieval and shallow code navigation. These approaches transfer poorly to C++ projects, where overloaded identifiers, nested namespaces, template instantiations, and deep control-flow structures make context retrieval and fault localization substantially more difficult. As a result, state-of-the-art Python-oriented agents show a drastic performance drop on the C++ subset of MultiSWE-bench. We introduce INFCODE-C++, the first C++-aware autonomous system for end-to-end issue resolution. The system combines two complementary retrieval mechanisms -- semantic code-intent retrieval and deterministic AST-structured querying -- to construct accurate, language-aware context for repair.These components enable precise localization and robust patch synthesis in large, statically typed C++ repositories. Evaluated on the \texttt{MultiSWE-bench-CPP} benchmark, INFCODE-C++ achieves a resolution rate of 25.58\%, outperforming the strongest prior agent by 10.85 percentage points and more than doubling the performance of MSWE-agent. Ablation and behavioral studies further demonstrate the critical role of semantic retrieval, structural analysis, and accurate reproduction in C++ issue resolution. INFCODE-C++ highlights the need for language-aware reasoning in multi-language software agents and establishes a foundation for future research on scalable, LLM-driven repair for complex, statically typed ecosystems.
๐
2025-11-20
๐ ArXiv Preprint (cs.SE)
๐ฅ KeFan Li, Mengfei Wang, Hengzhi Zhang, Zhichao Li, Yuan Yuan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models have advanced software engineering automation, yet resolving real-world software issues remains difficult because it requires repository-level reasoning, accurate diagnostics, and strong verification signals. Existing agent-based and pipeline-based methods often rely on insufficient tests, which can lead to patches that satisfy verification but fail to fix the underlying defect. We present InfCode, an adversarial multi-agent framework for automated repository-level issue resolution. InfCode iteratively refines both tests and patches through adversarial interaction between a Test Patch Generator and a Code Patch Generator, while a Selector agent identifies the most reliable fix. The framework runs inside a containerized environment that supports realistic repository inspection, modification, and validation. Experiments on SWE-bench Lite and SWE-bench Verified using models such as DeepSeek-V3 and Claude 4.5 Sonnet show that InfCode consistently outperforms strong baselines. It achieves 79.4% performance on SWE-bench Verified, establishing a new state-of-the-art. We have released InfCode as an open-source project at https://github.com/Tokfinity/InfCode.
๐
2025-11-20
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Strahinja Janjusevic, Anna Baron Garcia, Sohrob Kazerounian
Robotics
ๆฅ็ๆ่ฆ
Generative AI is reshaping offensive cybersecurity by enabling autonomous red team agents that can plan, execute, and adapt during penetration tests. However, existing approaches face trade-offs between generality and specialization, and practical deployments reveal challenges such as hallucinations, context limitations, and ethical concerns. In this work, we introduce a novel command & control (C2) architecture leveraging the Model Context Protocol (MCP) to coordinate distributed, adaptive reconnaissance agents covertly across networks. Notably, we find that our architecture not only improves goal-directed behavior of the system as whole, but also eliminates key host and network artifacts that can be used to detect and prevent command & control behavior altogether. We begin with a comprehensive review of state-of-the-art generative red teaming methods, from fine-tuned specialist models to modular or agentic frameworks, analyzing their automation capabilities against task-specific accuracy. We then detail how our MCP-based C2 can overcome current limitations by enabling asynchronous, parallel operations and real-time intelligence sharing without periodic beaconing. We furthermore explore advanced adversarial capabilities of this architecture, its detection-evasion techniques, and address dual-use ethical implications, proposing defensive measures and controlled evaluation in lab settings. Experimental comparisons with traditional C2 show drastic reductions in manual effort and detection footprint. We conclude with future directions for integrating autonomous exploitation, defensive LLM agents, predictive evasive maneuvers, and multi-agent swarms. The proposed MCP-enabled C2 framework demonstrates a significant step toward realistic, AI-driven red team operations that can simulate advanced persistent threats while informing the development of next-generation defensive systems.
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ Noah Bissell, Ethan Paley, Joshua Harrison, Juliano Calil, Myungin Lee
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Sensorium Arc (AI reflects on climate) is a real-time multimodal interactive AI agent system that personifies the ocean as a poetic speaker and guides users through immersive explorations of complex marine data. Built on a modular multi-agent system and retrieval-augmented large language model (LLM) framework, Sensorium enables natural spoken conversations with AI agents that embodies the ocean's perspective, generating responses that blend scientific insight with ecological poetics. Through keyword detection and semantic parsing, the system dynamically triggers data visualizations and audiovisual playback based on time, location, and thematic cues drawn from the dialogue. Developed in collaboration with the Center for the Study of the Force Majeure and inspired by the eco-aesthetic philosophy of Newton Harrison, Sensorium Arc reimagines ocean data not as an abstract dataset but as a living narrative. The project demonstrates the potential of conversational AI agents to mediate affective, intuitive access to high-dimensional environmental data and proposes a new paradigm for human-machine-ecosystem.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Deepthi Potluri, Aby Mammen Mathew, Jeffrey B DeWitt, Alexander L. Rasgon, Yide Hao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shahin Zanbaghi, Ryan Rostampour, Farhan Abid, Salim Al Jarmakani
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as "sleeper agents." Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (<1s per query), requires no model modification, and provides the first practical solution to LLM backdoor detection. Our work addresses a critical security gap in AI deployment and demonstrates that embedding-based detection can effectively identify deceptive model behavior without sacrificing deployment efficiency.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Daniel Mas Montserrat, Ray Verma, Mรญriam Barrabรฉs, Francisco M. de la Vega, Carlos D. Bustamante et al.
ๆฅ็ๆ่ฆ
Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ George Cazenavette, Antonio Torralba, Vincent Sitzmann
Multimodal
ๆฅ็ๆ่ฆ
The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Omkat Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal et al.
Multimodal
ๆฅ็ๆ่ฆ
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.
๐
2025-11-20
๐ NeurIPS
๐ฅ Jing Wen, Alexander G. Schwing, Shenlong Wang
ๆฅ็ๆ่ฆ
We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenyu Lin, Cheng Chi, Jinlin Wu, Sharon Li, Kaiyang Zhou
Natural Language Processing
ๆฅ็ๆ่ฆ
When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junhao Cheng, Liang Hou, Xin Tao, Jing Liao
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang et al.
ๆฅ็ๆ่ฆ
Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ Zhenyuan Qin, Xincheng Shuai, Henghui Ding
Computer VisionRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Eddie Pokming Sheung, Qihao Liu, Wufei Ma, Prakhar Kaushik, Jianwen Xie et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.
๐
2025-11-20
๐ WWW
๐ฅ Zhaoning Wang, Xinyue Wei, Ruoxi Shi, Xiaoshuai Zhang, Hao Su et al.
ๆฅ็ๆ่ฆ
UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart's distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at https://www.zhaoningwang.com/PartUV.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vishaal Udandarao, Shyamgopal Karthik, Surabhi S. Nath, Andreas Hochlehnert, Matthias Bethge et al.
ๆฅ็ๆ่ฆ
Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity
๐
2025-11-20
๐ Accepted at 2025 IEEE International Conference on
๐ฅ Md. Samiul Alim, Sharjil Khan, Amrijit Biswas, Fuad Rahman, Shafin Rahman et al.
Machine Learning
ๆฅ็ๆ่ฆ
Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuyu Cao, Chongshou Li, Jie Xu, Tianrui Li, Na Zhao
Machine Learning
ๆฅ็ๆ่ฆ
3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ Zeyuan Yin, Xiaoming Liu
ๆฅ็ๆ่ฆ
Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guolin Huang, Wenting Chen, Jiaqi Yang, Xinheng Lyu, Xiaoling Luo et al.
Multimodal
ๆฅ็ๆ่ฆ
Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin et al.
ๆฅ็ๆ่ฆ
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
๐
2025-11-20
๐ ArXiv Preprint (cs.SE)
๐ฅ Diaeddin Rimawi
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nirjhor Datta, Md. Golam Rabiul Alam
Computer VisionRobotics
ๆฅ็ๆ่ฆ
The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher's confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement.
For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent.
These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/
๐
2025-11-20
๐ AAAI 2026
๐ฅ Misaal Khan, Mayank Vatsa, Kuldeep Singh, Richa Singh
Multimodal
ๆฅ็ๆ่ฆ
Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk
ๆฅ็ๆ่ฆ
Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.
๐
2025-11-20
๐ CVPR 2025
๐ฅ Xiaoyue Chen, Yuling Shi, Kaiyuan Li, Huandong Wang, Yong Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment.
We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model.
However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality.
Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pierrick Bournez, Luca Savant Aira, Thibaud Ehret, Gabriele Facciolo
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jaime รlvarez Urueรฑa, David Camacho, Javier Huertas Tato
Machine Learning
ๆฅ็ๆ่ฆ
The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical.
This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators.
With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3\%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70\% and 4.27\% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haytham Ziani
ๆฅ็ๆ่ฆ
This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fan Yang, Shigeyuki Odashima, Shoichi Masui, Ikuo Kusajima, Sosuke Yamao et al.
ๆฅ็ๆ่ฆ
We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: (i) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and (ii) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast's 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast's 3D center typically lies within a predefined vertical plane during \revised{much of their} performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, Artur d'Avila Garcez
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rahul Kumar, Vipul Baghel, Sudhanshu Singh, Bikash Kumar Badatya, Shivam Yadav et al.
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fan Yang, Sosuke Yamao, Ikuo Kusajima, Atsunori Moteki, Shoichi Masui et al.
Machine Learning
ๆฅ็ๆ่ฆ
Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration (https://sites.google.com/view/yowo/home). Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang et al.
Robotics
ๆฅ็ๆ่ฆ
We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rui Wang, Yuexi Du, John Lewin, R. Todd Constable, Nicha C. Dvornek
ๆฅ็ๆ่ฆ
Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays an important role in breast cancer screening, tumor assessment, and treatment planning and monitoring. The dynamic changes in contrast in different tissues help to highlight the tumor in post-contrast images. However, varying acquisition protocols and individual factors result in large variation in the appearance of tissues, even for images acquired in the same phase (e.g., first post-contrast phase), making automated tumor segmentation challenging. Here, we propose a tumor segmentation method that leverages knowledge of the image acquisition time to modulate model features according to the specific acquisition sequence. We incorporate the acquisition times using feature-wise linear modulation (FiLM) layers, a lightweight method for incorporating temporal information that also allows for capitalizing on the full, variables number of images acquired per imaging study. We trained baseline and different configurations for the time-modulated models with varying backbone architectures on a large public multisite breast DCE-MRI dataset. Evaluation on in-domain images and a public out-of-domain dataset showed that incorporating knowledge of phase acquisition time improved tumor segmentation performance and model generalization.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zongcai Tan, Lan Wei, Dandan Zhang
Computer VisionRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging.This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Eliyas Suleyman, Paul Henderson, Eksan Firkat, Nicolas Pugeault
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Clemens Pollak, Kersten Diers, Santiago Estrada, David Kรผgler, Martin Reuter
Robotics
ๆฅ็ๆ่ฆ
The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington's disease patients and healthy controls that are not detected by the current state-of-the-art.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mateusz Chiliลski, Julita Oลtusek, Wojciech Jaลkowski
Natural Language Processing
ๆฅ็ๆ่ฆ
Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.
๐
2025-11-20
๐ AAAI
๐ฅ Doriand Petit, Steve Bourgeois, Vincent Gay-Bellile, Florian Chabot, Loรฏc Barthe
Computer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Diogo J. Paulo, Joรฃo Martins, Hugo Proenรงa, Joรฃo C. Neves
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jin Wang, Bingfeng Zhang, Jian Pang, Mengyu Liu, Honglong Chen et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yihan Li, Nikhil Churamani, Maria Robu, Imanol Luengo, Danail Stoyanov
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses.
Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks.
Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures.
Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Samer Abualhanud, Christian Grannemann, Max Mehltretter
ๆฅ็ๆ่ฆ
Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360ยฐ field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hai Lan, Zongyan Li, Jianmin Hu, Jialing Yang, Houde Dai
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pan Yang, Cheng Deng, Jing Yang, Han Zhao, Yun Liu et al.
Multimodal
ๆฅ็ๆ่ฆ
Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meng-Cheng Shih, Tsai-Ling Huang, Yu-Heng Shih, Hong-Han Shuai, Hsuan-Tung Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model's interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhengxue Wang, Zhiqiang Yan, Yuan Wu, Guangwei Gao, Xiang Li et al.
ๆฅ็ๆ่ฆ
Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Joni Vanherck, Steven Moonen, Brent Zoomers, Kobe Werner, Jeroen Put et al.
Robotics
ๆฅ็ๆ่ฆ
Accurate camera localization is crucial for robotics and Extended Reality (XR), enabling reliable navigation and alignment of virtual and real content. Existing visual methods often suffer from drift, scale ambiguity, and depend on fiducials or loop closure. This work introduces a real-time method for localizing a camera within a pre-captured, highly accurate colored LiDAR point cloud. By rendering synthetic views from this cloud, 2D-3D correspondences are established between live frames and the point cloud. A neural rendering technique narrows the domain gap between synthetic and real images, reducing occlusion and background artifacts to improve feature matching. The result is drift-free camera tracking with correct metric scale in the global LiDAR coordinate system. Two real-time variants are presented: Online Render and Match, and Prebuild and Localize. We demonstrate improved results on the ScanNet++ dataset and outperform existing SLAM pipelines.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chi-Han Chen, Chieh-Ming Chen, Wen-Huang Cheng, Ching-Chun Huang
ๆฅ็ๆ่ฆ
The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30\% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : https://gitlab.com/prophet.ai.inc/drone-based-riverbed-inspection
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi Ting Tsai, Yu Wei Chen, Hong-Han Shuai, Ching-Chun Huang
ๆฅ็ๆ่ฆ
Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images and has significant implications for face-related tasks. However, existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations. To address these limitations, this paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR), featuring three novel designs. First, ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale. Second, a local frequency estimation module captures high-frequency facial texture information to reduce the spectral bias effect. Lastly, a global coordinate modulation module guides FSR to leverage prior facial structure knowledge and achieve resolution adaptation effectively. Quantitative and qualitative evaluations demonstrate the robustness of ARASFSR over existing state-of-the-art methods while super-resolving facial images across various input sizes and up-sampling scales.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ching-Heng Cheng, Chih-Chung Hsu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Remote sensing change detection (RSCD) aims to identify surface changes from co-registered bi-temporal images. However, many deep learning-based RSCD methods rely solely on change-map annotations and underuse the semantic information in non-changing regions, which limits robustness under illumination variation, off-nadir views, and scarce labels. This article introduces ChangeDINO, an end-to-end multiscale Siamese framework for optical building change detection. The model fuses a lightweight backbone stream with features transferred from a frozen DINOv3, yielding semantic- and context-rich pyramids even on small datasets. A spatial-spectral differential transformer decoder then exploits multi-scale absolute differences as change priors to highlight true building changes and suppress irrelevant responses. Finally, a learnable morphology module refines the upsampled logits to recover clean boundaries. Experiments on four public benchmarks show that ChangeDINO consistently outperforms recent state-of-the-art methods in IoU and F1, and ablation studies confirm the effectiveness of each component. The source code is available at https://github.com/chingheng0808/ChangeDINO.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ching-Heng Cheng, Jen-Wei Lee, Chia-Ming Lee, Chih-Chung Hsu
ๆฅ็ๆ่ฆ
Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by wavelength-dependent absorption and scattering. Recent hybrid approaches, which couple domain priors with modern deep neural architectures, have achieved strong performance but incur high computational cost, limiting their practicality in real-time scenarios. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors. First, adaptive white balance alleviates the strong wavelength-dependent color attenuation, particularly the dominance of blue-green tones. Second, a wavelet-based enhancement block (WEB) performs multi-band decomposition, enabling the network to capture both global structures and fine textures, which are critical for underwater restoration. Third, a gradient-aware module (SGFB) leverages Sobel operators with learnable gating to explicitly preserve edge structures degraded by scattering. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations further validate the contribution of each component. The source code is available at https://github.com/chingheng0808/WWE-UIE.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Xin Yang, Xin Huang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ Samuel Stevens
Computer Vision
ๆฅ็ๆ่ฆ
ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Leander Girrbach, Zeynep Akata
ๆฅ็ๆ่ฆ
Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minseok Seo, Mark Hamilton, Changick Kim
Machine Learning
ๆฅ็ๆ่ฆ
We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Musfiqur Rahman Sanim, Zhihao Shu, Bahram Afsharmanesh, AmirAli Mirian, Jiexiong Guan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Image-based 3D scene reconstruction, which transforms multi-view images into a structured 3D representation of the surrounding environment, is a common task across many modern applications. 3D Gaussian Splatting (3DGS) is a new paradigm to address this problem and offers considerable efficiency as compared to the previous methods. Motivated by this, and considering various benefits of mobile device deployment (data privacy, operating without internet connectivity, and potentially faster responses), this paper develops Texture3dgs, an optimized mapping of 3DGS for a mobile GPU. A critical challenge in this area turns out to be optimizing for the two-dimensional (2D) texture cache, which needs to be exploited for faster executions on mobile GPUs. As a sorting method dominates the computations in 3DGS on mobile platforms, the core of Texture3dgs is a novel sorting algorithm where the processing, data movement, and placement are highly optimized for 2D memory. The properties of this algorithm are analyzed in view of a cost model for the texture cache. In addition, we accelerate other steps of the 3DGS algorithm through improved variable layout design and other optimizations. End-to-end evaluation shows that Texture3dgs delivers up to 4.1$\times$ and 1.7$\times$ speedup for the sorting and overall 3D scene reconstruction, respectively -- while also reducing memory usage by up to 1.6$\times$ -- demonstrating the effectiveness of our design for efficient mobile 3D scene reconstruction.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Abishek Karthik, Pandiyaraju V, Sreya Mynampati
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment of edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gergely Dinya, Pรฉter Halรกsz, Andrรกs Lลrincz, Kristรณf Karacs, Anna Gelencsรฉr-Horvรกth
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present a fast, spatio-temporal scene understanding framework based on Vision Gated Generative Transformers (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT's high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the framework to real-world scenarios.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Seonghun Oh, Youngjung Uh, Jin-Hwa Kim
ๆฅ็ๆ่ฆ
Extracting meshes that exactly match the zero-level set of neural signed distance functions (SDFs) remains challenging. Sampling-based methods introduce discretization error, while continuous piecewise affine (CPWA) analytic approaches apply only to plain ReLU MLPs. We present TetraSDF, a precise analytic meshing framework for SDFs represented by a ReLU MLP composed with a multi-resolution tetrahedral positional encoder. The encoder's barycentric interpolation preserves global CPWA structure, enabling us to track ReLU linear regions within an encoder-induced polyhedral complex. A fixed analytic input preconditioner derived from the encoder's metric further reduces directional bias and stabilizes training. Across multiple benchmarks, TetraSDF matches or surpasses existing grid-based encoders in SDF reconstruction accuracy, and its analytic extractor produces highly self-consistent meshes that remain faithful to the learned isosurfaces, all with practical runtime and memory efficiency.
๐
2025-11-20
๐ ArXiv Preprint (eess.IV)
๐ฅ Erwan Dereure, Robin Louiset, Laura Parkkinen, David A Menassa, David Holcman
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Parkinson's disease (PD) is a neurodegenerative disorder associated with the accumulation of misfolded alpha-synuclein aggregates, forming Lewy bodies and neuritic shape used for pathology diagnostics. Automatic analysis of immunohistochemistry histopathological images with Deep Learning provides a promising tool for better understanding the spatial organization of these aggregates. In this study, we develop an automated image processing pipeline to segment and classify these aggregates in whole-slide images (WSIs) of midbrain tissue from PD and incidental Lewy Body Disease (iLBD) cases based on weakly supervised segmentation, robust to immunohistochemical labelling variability, with a ResNet50 classifier. Our approach allows to differentiate between major aggregate morphologies, including Lewy bodies and neurites with a balanced accuracy of $80\%$. This framework paves the way for large-scale characterization of the spatial distribution and heterogeneity of alpha-synuclein aggregates in brightfield immunohistochemical tissue, and for investigating their poorly understood relationships with surrounding cells such as microglia and astrocytes.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sinan Mutlu, Georgios F. Angelis, Savas Ozkan, Paul Wisbey, Anastasios Drosou et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Oliver Bimber, Karl Dietrich von Ellenrieder, Michael Haller, Rakesh John Amala Arokia Nathan, Gianni Lunardi et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Peering, a side-to-side motion used by animals to estimate distance through motion parallax, offers a powerful bio-inspired strategy to overcome a fundamental limitation in robotic vision: partial occlusion. Conventional robot cameras, with their small apertures and large depth of field, render both foreground obstacles and background objects in sharp focus, causing occluders to obscure critical scene information. This work establishes a formal connection between animal peering and synthetic aperture (SA) sensing from optical imaging. By having a robot execute a peering motion, its camera describes a wide synthetic aperture. Computational integration of the captured images synthesizes an image with an extremely shallow depth of field, effectively blurring out occluding elements while bringing the background into sharp focus. This efficient, wavelength-independent technique enables real-time, high-resolution perception across various spectral bands. We demonstrate that this approach not only restores basic scene understanding but also empowers advanced visual reasoning in large multimodal models, which fail with conventionally occluded imagery. Unlike feature-dependent multi-view 3D vision methods or active sensors like LiDAR, SA sensing via peering is robust to occlusion, computationally efficient, and immediately deployable on any mobile robot. This research bridges animal behavior and robotics, suggesting that peering motions for synthetic aperture sensing are a key to advanced scene understanding in complex, cluttered environments.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Boyue Xu, Ruichao Hou, Tongwei Ren, Dongming Zhou, Gangshan Wu et al.
ๆฅ็ๆ่ฆ
Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2\% and 4.3\%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at https://github.com/xuboyue1999/SwiTrack.git.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Melih Baydar, Emre Akbas
ๆฅ็ๆ่ฆ
Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuping Yan, Yuhan Xie, Yinxin Zhang, Lingjuan Lyu, Yaochu Jin
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Deniz Sayin Mercadier, Hieu Le, Yihong Chen, Jiancheng Yang, Udaranga Wickramasinghe et al.
ๆฅ็ๆ่ฆ
Human organs are composed of interconnected substructures whose geometry and spatial relationships constrain one another. Yet, most deep-learning approaches treat these parts independently, producing anatomically implausible reconstructions. We introduce PrIntMesh, a template-based, topology-preserving framework that reconstructs organs as unified systems. Starting from a connected template, PrIntMesh jointly deforms all substructures to match patient-specific anatomy, while explicitly preserving internal boundaries and enforcing smooth, artifact-free surfaces. We demonstrate its effectiveness on the heart, hippocampus, and lungs, achieving high geometric accuracy, correct topology, and robust performance even with limited or noisy training data. Compared to voxel- and surface-based methods, PrIntMesh better reconstructs shared interfaces, maintains structural consistency, and provides a data-efficient solution suitable for clinical use.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nianchang Huang, Yi Xu, Ruida Xi, Ruida Xi, Qiang Zhang
ๆฅ็ๆ่ฆ
Recently, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets. However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications. To address this, we take the initiative to investigate Unsupervised Domain Adaptation Visible-Infrared person Re-Identification (UDA-VI-ReID), aiming to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples. Specifically, we first analyze two basic challenges in UDA-VI-ReID, i.e., inter-domain modality discrepancies and intra-domain modality discrepancies. Then, we design a novel two-stage model, i.e., Domain-Shared Learning and Gradual Alignment (DSLGA), to handle these discrepancies. In the first pre-training stage, DSLGA introduces a Domain-Shared Learning Strategy (DSLS) to mitigate ineffective pre-training caused by inter-domain modality discrepancies via exploiting shared information between the source and target domains. While, in the second fine-tuning stage, DSLGA designs a Gradual Alignment Strategy (GAS) to handle the cross-modality alignment challenges between visible and infrared data caused by the large intra-domain modality discrepancies through a cluster-to-holistic alignment way. Finally, a new UDA-VI-ReID testing method i.e., CMDA-XD, is constructed for training and testing different UDA-VI-ReID models. A large amount of experiments demonstrate that our method significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jeremie Ochin, Raphael Chekroun, Bogdan Stanciulescu, Sotiris Manitsaris
ๆฅ็ๆ่ฆ
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $ฯ_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Jiahao Li, Yang Lu, Yachao Zhang, Yong Xie, Fangyong Wang et al.
Multimodal
ๆฅ็ๆ่ฆ
Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP's vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zeting Liu, Zida Yang, Zeyu Zhang, Hao Tang
Machine LearningRoboticsMultimodal
ๆฅ็ๆ่ฆ
Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhi Luo, Zenghui Yuan, Wenqi Wei, Daizong Liu, Pan Zhou
Machine LearningMultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuting Lu, Ziliang Wang, Weixin Xu, Wei Zhang, Yongqiang Zhao et al.
Computer Vision
ๆฅ็ๆ่ฆ
Clinical deployment requires segmentation models to stay stable under distribution shifts and perturbations. The mainstream solution is adversarial training (AT) to improve robustness; however, AT often brings a clean--robustness trade-off and high training/tuning cost, which limits scalability and maintainability in medical imaging. We propose \emph{Layer-wise Noise-Guided Selective Wavelet Reconstruction (LNG-SWR)}. During training, we inject small, zero-mean noise at multiple layers to learn a frequency-bias prior that steers representations away from noise-sensitive directions. We then apply prior-guided selective wavelet reconstruction on the input/feature branch to achieve frequency adaptation: suppress noise-sensitive bands, enhance directional structures and shape cues, and stabilize boundary responses while maintaining spectral consistency. The framework is backbone-agnostic and adds low additional inference overhead. It can serve as a plug-in enhancement to AT and also improves robustness without AT. On CT and ultrasound datasets, under a unified protocol with PGD-$L_{\infty}/L_{2}$ and SSAH, LNG-SWR delivers consistent gains on clean Dice/IoU and significantly reduces the performance drop under strong attacks; combining LNG-SWR with AT yields additive gains. When combined with adversarial training, robustness improves further without sacrificing clean accuracy, indicating an engineering-friendly and scalable path to robust segmentation. These results indicate that LNG-SWR provides a simple, effective, and engineering-friendly path to robust medical image segmentation in both adversarial and standard training regimes.
๐
2025-11-20
๐ AAAI
๐ฅ Lirui Zhang, Zhengkai Zhao, Zhi Zuo, Pan Gao, Jie Qin
ๆฅ็ๆ่ฆ
Point cloud completion is a fundamental task in 3D vision. A persistent challenge in this field is simultaneously preserving fine-grained details present in the input while ensuring the global structural integrity of the completed shape. While recent works leveraging local symmetry transformations via direct regression have significantly improved the preservation of geometric structure details, these methods suffer from two major limitations: (1) These regression-based methods are prone to overfitting which tend to memorize instant-specific transformations instead of learning a generalizable geometric prior. (2) Their reliance on point-wise transformation regression lead to high sensitivity to input noise, severely degrading their robustness and generalization. To address these challenges, we introduce Simba, a novel framework that reformulates point-wise transformation regression as a distribution learning problem. Our approach integrates symmetry priors with the powerful generative capabilities of diffusion models, avoiding instance-specific memorization while capturing robust geometric structures. Additionally, we introduce a hierarchical Mamba-based architecture to achieve high-fidelity upsampling. Extensive experiments across the PCN, ShapeNet, and KITTI benchmarks validate our method's state-of-the-art (SOTA) performance.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jian Ma, Qirong Peng, Xujie Zhu, Peixing Xie, Chen Chen et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chunxu Liu, Jiyuan Yang, Ruopeng Gao, Yuhan Zhu, Feng Zhu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sibaek Lee, Seongbo Ha, Kyeongsu Kang, Joonyeol Choi, Seungjun Tak et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Quanqing Ma, Jiaen Chen, Peng Wang, Yao Zheng, Qingzhan Zhao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Remote sensing Water Body Change Detection (WBCD) aims to detect water body surface changes from bi-temporal images of the same geographic area. Recently, the scarcity of high spatial resolution datasets for WBCD restricts its application in urban and rural regions, which require more accurate positioning. Meanwhile, previous deep learning-based methods fail to comprehensively exploit the spatial semantic and structural information in deep features in the change detection networks. To resolve these concerns, we first propose a new dataset, HSRW-CD, with a spatial resolution higher than 3 meters for WBCD. Specifically, it contains a large number of image pairs, widely covering various water body types. Besides, a Spatial Semantics and Continuity Perception (SSCP) attention module is designed to fully leverage both the spatial semantics and structure of deep features in the WBCD networks, significantly improving the discrimination capability for water body. The proposed SSCP has three components: the Multi-Semantic spatial Attention (MSA), the Structural Relation-aware Global Attention (SRGA), and the Channel-wise Self-Attention (CSA). The MSA enhances the spatial semantics of water body features and provides precise spatial semantic priors for the CSA. Then, the SRGA further extracts spatial structure to learn the spatial continuity of the water body. Finally, the CSA utilizes the spatial semantic and structural priors from the MSA and SRGA to compute the similarity across channels. Specifically designed as a plug-and-play module for water body deep features, the proposed SSCP allows integration into existing WBCD models. Numerous experiments conducted on the proposed HSRW-CD and Water-CD datasets validate the effectiveness and generalization of the SSCP. The code of this work and the HSRW-CD dataset will be accessed at https://github.com/QingMa1/SSCP.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Chenyu Zhao, Xianwei Zheng, Zimin Xia, Linwei Yue, Nan Xue
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model's ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Li Yu, Yingbo Zhao, Shiyu Wu, Siyue Yu, Moncef Gabbouj et al.
ๆฅ็ๆ่ฆ
Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiazhen Yan, Ziqiang Li, Fan Wang, Kai Zeng, Zhangjie Fu
Machine LearningRoboticsMultimodal
ๆฅ็ๆ่ฆ
The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Chenyang Wu, Jiayi Fu, Chun-Le Guo, Shuhao Han, Chongyi Li
ๆฅ็ๆ่ฆ
Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows' edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows' edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: https://github.com/Wucy0519/VTinker.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianxiong Zhong, Xingye Tian, Xuebo Wang, Boyuan Jiang, Xin Tao et al.
ๆฅ็ๆ่ฆ
Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.
๐
2025-11-20
๐ SIGGRAPH
๐ฅ Taeho Kang, Jaeyeon Park, Kyungjin Lee, Youngki Lee
ๆฅ็ๆ่ฆ
Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method's capability to identify errors and properly initialize new splats. Our implementation details and source code are available at https://github.com/tho-kn/cem-4dgs.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Renxiang Xiao, Wei Liu, Yuanfan Zhang, Yushuai Chen, Jinming Chen et al.
ๆฅ็ๆ่ฆ
We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meihua Zhou, Liping Yu, Jiawei Cai, Wai Kin Fung, Ruiguo Hu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zishan Xu, Yifu Guo, Yuquan Lu, Fengyu Yang, Junxin Li
Computer VisionRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pei Liu, Songtao Wang, Lang Zhang, Xingyue Peng, Yuandong Lyu et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor's native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR's state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: https://ocean-luna.github.io/LiSTAR.gitub.io.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Boxun Xu, Yu Wang, Zihu Wang, Peng Li
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qing Wang, Chong-Wah Ngo, Ee-Peng Lim, Qianru Sun
Natural Language Processing
ๆฅ็ๆ่ฆ
Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Timilehin T. Ayanlade, Anirudha Powadi, Talukder Z. Jubery, Baskar Ganapathysubramanian, Soumik Sarkar
ๆฅ็ๆ่ฆ
Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zijian Wu, Mingfeng Jiang, Zidian Lin, Ying Song, Hanjie Ma et al.
Computer Vision
ๆฅ็ๆ่ฆ
3D Gaussian Splatting (3DGS) has recently emerged as an efficient, high-fidelity representation for real-time scene reconstruction and rendering. However, extending 3DGS to sparse-view settings remains challenging because of supervision scarcity and overfitting caused by limited viewpoint coverage. In this paper, we present CuriGS, a curriculum-guided framework for sparse-view 3D reconstruction using 3DGS. CuriGS addresses the core challenge of sparse-view synthesis by introducing student views: pseudo-views sampled around ground-truth poses (teacher). For each teacher, we generate multiple groups of student views with different perturbation levels. During training, we follow a curriculum schedule that gradually unlocks higher perturbation level, randomly sampling candidate students from the active level to assist training. Each sampled student is regularized via depth-correlation and co-regularization, and evaluated using a multi-signal metric that combines SSIM, LPIPS, and an image-quality measure. For every teacher and perturbation level, we periodically retain the best-performing students and promote those that satisfy a predefined quality threshold to the training set, resulting in a stable augmentation of sparse training views. Experimental results show that CuriGS outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes. Project page: https://zijian1026.github.io/CuriGS/
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Abdallah Salem, Hamdy Ahmed Ashur, Ahmed Elshinnawy
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers' health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material's surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Xiao He, Zhijun Tu, Kun Cheng, Mingrui Zhu, Jie Hu et al.
Multimodal
ๆฅ็ๆ่ฆ
The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework's effectiveness and state-of-the-art performance.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dingkun Zhou, Patrick P. K. Chan, Hengxu Wu, Shikang Zheng, Ruiqi Huang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nimeshika Udayangani, Hadi M. Dolatabadi, Sarah Erfani, Christopher Leckie
Machine Learning
ๆฅ็ๆ่ฆ
Detecting out-of-distribution (OOD) data is essential for safe deployment of deep neural networks (DNNs). This problem becomes particularly challenging in the presence of long-tailed in-distribution (ID) datasets, often leading to high false positive rates (FPR) and low tail-class ID classification accuracy. In this paper, we demonstrate that exploiting inter-sample relationships using a graph-based representation can significantly improve OOD detection in long-tailed recognition of vision datasets. To this end, we use the feature space of a pre-trained model to initialize our graph structure. We account for the differences between the activation layer distribution of the pre-training vs. training data, and actively introduce Gaussianization to alleviate any deviations from a standard normal distribution in the activation layers of the pre-trained model. We then refine this initial graph representation using graph convolutional networks (GCNs) to arrive at a feature space suitable for long-tailed OOD detection. This leads us to address the inferior performance observed in ID tail-classes within existing OOD detection methods. Experiments over three benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that our method outperforms the state-of-the-art approaches by a large margin in terms of FPR and tail-class ID classification accuracy.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.
๐
2025-11-20
๐ ArXiv Preprint (eess.AS)
๐ฅ Wei-Cheng Tseng, David Harwath
ๆฅ็ๆ่ฆ
Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2Vec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2Vec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5x and training time by 2.3x, showcasing its scalability and efficiency.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guolin Huang, Wenting Chen, Jiaqi Yang, Xinheng Lyu, Xiaoling Luo et al.
Multimodal
ๆฅ็ๆ่ฆ
Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Sen Chen, Tong Zhao, Yi Bin, Fei Ma, Wenqi Shao et al.
ๆฅ็ๆ่ฆ
Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kexin Zhao, Ken Forbus
Natural Language Processing
ๆฅ็ๆ่ฆ
Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zachary Ellis, Jared Joselowitz, Yash Deo, Yajie He, Anna Kalygina et al.
ๆฅ็ๆ่ฆ
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $ฮบ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
๐
2025-11-20
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiaheng Zhang, Daqiang Zhang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage.
Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models.
Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ รloรฏse Benito-Rodriguez, Einar Urdshals, Jasmina Nasufi, Nicky Pochinkov
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ รzay Ezerceli, Mahmoud El Hussieni, Selva Taล, Reyhan Bayraktar, Fatma Betรผl Terzioฤlu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang et al.
Robotics
ๆฅ็ๆ่ฆ
We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.
๐
2025-11-20
๐ Under review with the ACM Transactions on Recommen
๐ฅ Elena V. Epure, Yashar Deldjoo, Bruno Sguerra, Markus Schedl, Manuel Moussallam
Natural Language Processing
ๆฅ็ๆ่ฆ
Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators.
This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mateusz Chiliลski, Julita Oลtusek, Wojciech Jaลkowski
Natural Language Processing
ๆฅ็ๆ่ฆ
Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Andrew Gomes
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.
๐
2025-11-20
๐ Workshop paper accepted at AI4DF 2025 (part of ACM
๐ฅ Sherine George, Nithish Saji
Natural Language Processing
ๆฅ็ๆ่ฆ
We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Li Zhang, Zhongxuan Han, XiaoHua Feng, Jiaming Zhang, Yuyuan Li et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Connor McElroy, Thiago E. A. de Oliveira, Chris Brogly
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jonathan Kamp, Lisa Beinborn, Antske Fokkens
ๆฅ็ๆ่ฆ
Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Blake Matheny, Phuong Minh Nguyen, Minh Le Nguyen, Stephanie Reynolds
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Jiashu Yao, Heyan Huang, Shuang Zeng, Chuwei Luo, WangJie You et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wei Xia, Zhi-Hong Deng
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xingtao Zhao, Hao Peng, Dingli Su, Xianghua Zeng, Chunyang Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.
๐
2025-11-20
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Huseein Jawad, Nicolas Brunel
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
System prompts are critical for guiding the behavior of Large Language Models (LLMs), yet they often contain proprietary logic or sensitive information, making them a prime target for extraction attacks. Adversarial queries can successfully elicit these hidden instructions, posing significant security and privacy risks. Existing defense mechanisms frequently rely on heuristics, incur substantial computational overhead, or are inapplicable to models accessed via black-box APIs. This paper introduces a novel framework for hardening system prompts through shield appending, a lightweight approach that adds a protective textual layer to the original prompt. Our core contribution is the formalization of prompt hardening as a utility-constrained optimization problem. We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks, while simultaneously preserving task utility above a specified threshold, measured by semantic fidelity to baseline outputs. This black-box, optimization-driven methodology is lightweight and practical, requiring only API access to the target and optimizer LLMs. We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks, outperforming established baseline defenses without compromising the model's intended functionality. Our work presents a paradigm for developing robust, utility-aware defenses in the escalating landscape of LLM security. The code is made public on the following link: https://github.com/psm-defense/psm
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sebastian Haan
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dabiao Ma, Ziming Dai, Zhimin Xin, Shu Wang, Ye Wang et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qing Zhang, Bing Xu, Xudong Zhang, Yifan Shi, Yang Li et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sรฉbastien Bubeck, Christian Coester, Ronen Eldan, Timothy Gowers, Yin Tat Lee et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Gwen Yidou-Weng, Ian Li, Anji Liu, Oliver Broadrick, Guy Van den Broeck et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks
Natural Language Processing
ๆฅ็ๆ่ฆ
Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Emanuel C. Silva, Emily S. M. Salum, Gabriel M. Arantes, Matheus P. Pereira, Vinicius F. Oliveira et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Introduction: The application of Artificial Intelligence in games has evolved significantly, allowing for dynamic content generation. However, its use as a core gameplay co-creation tool remains underexplored. Objective: This paper proposes SpellForger, a game where players create custom spells by writing natural language prompts, aiming to provide a unique experience of personalization and creativity. Methodology: The system uses a supervisedtrained BERT model to interpret player prompts. This model maps textual descriptions to one of many spell prefabs and balances their parameters (damage, cost, effects) to ensure competitive integrity. The game is developed in the Unity Game Engine, and the AI backend is in Python. Expected Results: We expect to deliver a functional prototype that demonstrates the generation of spells in real time, applied to an engaging gameplay loop, where player creativity is central to the experience, validating the use of AI as a direct gameplay mechanic.
๐
2025-11-20
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Amin Bigdeli, Radin Hamidi Rad, Mert Incesu, Negar Arabzadeh, Charles L. A. Clarke et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present QueryGym, a lightweight, extensible Python toolkit that supports large language model (LLM)-based query reformulation. This is an important tool development since recent work on llm-based query reformulation has shown notable increase in retrieval effectiveness. However, while different authors have sporadically shared the implementation of their methods, there is no unified toolkit that provides a consistent implementation of such methods, which hinders fair comparison, rapid experimentation, consistent benchmarking and reliable deployment. QueryGym addresses this gap by providing a unified framework for implementing, executing, and comparing llm-based reformulation methods. The toolkit offers: (1) a Python API for applying diverse LLM-based methods, (2) a retrieval-agnostic interface supporting integration with backends such as Pyserini and PyTerrier, (3) a centralized prompt management system with versioning and metadata tracking, (4) built-in support for benchmarks like BEIR and MS MARCO, and (5) a completely open-source extensible implementation available to all researchers. QueryGym is publicly available at https://github.com/radinhamidi/QueryGym.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Deepthi Potluri, Aby Mammen Mathew, Jeffrey B DeWitt, Alexander L. Rasgon, Yide Hao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sarik Ghazarian, Abhinav Gullapalli, Swair Shah, Anurag Beniwal, Nanyun Peng et al.
Robotics
ๆฅ็ๆ่ฆ
In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs' complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs' abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhenyu Bi, Gaurav Srivastava, Yang Li, Meng Lu, Swastik Roy et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ George Cazenavette, Antonio Torralba, Vincent Sitzmann
Multimodal
ๆฅ็ๆ่ฆ
The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Irmak Guzey, Haozhi Qi, Julen Urain, Changhao Wang, Jessica Yin et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across nine everyday manipulation tasks. Robot rollouts are best viewed on our website: https://aina-robot.github.io.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vishaal Udandarao, Shyamgopal Karthik, Surabhi S. Nath, Andreas Hochlehnert, Matthias Bethge et al.
ๆฅ็ๆ่ฆ
Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio Leรณn Villares et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Na{รฏ}ve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations $E\in\mathbb{R}^{m\times n}$ and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices $A\in \mathbb{R}^{m\times r},\ B\in \mathbb{R}^{n\times r}$ with $r\ll \min(m,n)$ to form a low-rank matrix perturbation $A B^\top$ that are used in place of the full-rank perturbation $E$. As the overall update is an average across a population of $N$ workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from $mn$ to $r(m+n)$ per layer and the cost of a forward pass from $\mathcal{O}(mn)$ to $\mathcal{O}(r(m+n))$ when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast $\mathcal{O}\left(\frac{1}{r}\right)$ rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shihab Ahmed, El Houcine Bergou, Aritra Dutta, Yue Wang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster convergence to near-optimal returns, up to 1.75x reduction in return variance on some setups. Our profiling approach offers a general, theoretically grounded path to more reliable and efficient policy learning in complex environments.
๐
2025-11-20
๐ ArXiv Preprint (math.AC)
๐ฅ Jurgen Mezinaj
Machine Learning
ๆฅ็ๆ่ฆ
We develop a computational framework for classifying Galois groups of irreducible degree-7 polynomials over~$\mathbb{Q}$, combining explicit resolvent methods with machine learning techniques. A database of over one million normalized projective septics is constructed, each annotated with algebraic invariants~$J_0, \dots, J_4$ derived from binary transvections. For each polynomial, we compute resolvent factorizations to determine its Galois group among the seven transitive subgroups of~$S_7$ identified by Foulkes. Using this dataset, we train a neurosymbolic classifier that integrates invariant-theoretic features with supervised learning, yielding improved accuracy in detecting rare solvable groups compared to coefficient-based models. The resulting database provides a reproducible resource for constructive Galois theory and supports empirical investigations into group distribution under height constraints. The methodology extends to higher-degree cases and illustrates the utility of hybrid symbolic-numeric techniques in computational algebra.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Jingqiu Ding, Yiding Hua, Kasper Lindberg, David Steurer, Aleksandr Storozhenko
ๆฅ็ๆ่ฆ
We study community detection in the \emph{symmetric $k$-stochastic block model}, where $n$ nodes are evenly partitioned into $k$ clusters with intra- and inter-cluster connection probabilities $p$ and $q$, respectively.
Our main result is a polynomial-time algorithm that achieves the minimax-optimal misclassification rate
\begin{equation*}
\exp \Bigl(-\bigl(1 \pm o(1)\bigr) \tfrac{C}{k}\Bigr),
\quad \text{where } C = (\sqrt{pn} - \sqrt{qn})^2,
\end{equation*}
whenever $C \ge K\,k^2\,\log k$ for some universal constant $K$, matching the Kesten--Stigum (KS) threshold up to a $\log k$ factor.
Notably, this rate holds even when an adversary corrupts an $ฮท\le \exp\bigl(- (1 \pm o(1)) \tfrac{C}{k}\bigr)$ fraction of the nodes.
To the best of our knowledge, the minimax rate was previously only attainable either via computationally inefficient procedures [ZZ15] or via polynomial-time algorithms that require strictly stronger assumptions such as $C \ge K k^3$ [GMZZ17].
In the node-robust setting, the best known algorithm requires the substantially stronger condition $C \ge K k^{102}$ [LM22].
Our results close this gap by providing the first polynomial-time algorithm that achieves the minimax rate near the KS threshold in both settings.
Our work has two key technical contributions:
(1) we robustify majority voting via the Sum-of-Squares framework,
(2) we develop a novel graph bisection algorithm via robust majority voting, which allows us to significantly improve the misclassification rate to $1/\mathrm{poly}(k)$ for the initial estimation near the KS threshold.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Lukas Billera, Hedwig Nora Nordlinder, Ben Murrell
ๆฅ็ๆ่ฆ
This brief note clarifies that, in Generator Matching (which subsumes a large family of flow matching and diffusion models over continuous, manifold, and discrete spaces), both the Bregman divergence loss and the linear parameterization of the generator can depend on both the current state $X_t$ and the time $t$, and we show that the expectation over time in the loss can be taken with respect to a broad class of time distributions. We also show this for Edit Flows, which falls outside of Generator Matching. That the loss can depend on $t$ clarifies that time-dependent loss weighting schemes, often used in practice to stabilize training, are theoretically justified when the specific flow or diffusion scheme is a special case of Generator Matching (or Edit Flows). It also often simplifies the construction of $X_1$-predictor schemes, which are sometimes preferred for model-related reasons. We show examples that rely upon the dependence of linear parameterizations, and of the Bregman divergence loss, on $t$ and $X_t$.
๐
2025-11-20
๐ ArXiv Preprint (quant-ph)
๐ฅ Ivana Nikoloska, Osvaldo Simeone
Machine Learning
ๆฅ็ๆ่ฆ
The integration of sensing and communication functionalities within a common system is one of the main innovation drivers for next-generation networks. In this paper, we introduce a quantum integrated sensing and communication (QISAC) protocol that leverages entanglement in quantum carriers of information to enable both superdense coding and quantum sensing. The proposed approach adaptively optimizes encoding and quantum measurement via variational circuit learning, while employing classical machine learning-based decoders and estimators to process the measurement outcomes. Numerical results for qudit systems demonstrate that the proposed QISAC protocol can achieve a flexible trade-off between classical communication rate and accuracy of parameter estimation.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zohar Rimon, Elisei Shafer, Tal Tepper, Efrat Shimron, Aviv Tamar
ๆฅ็ๆ่ฆ
Palpation, the use of touch in medical examination, is almost exclusively performed by humans. We investigate a proof of concept for an artificial palpation method based on self-supervised learning. Our key idea is that an encoder-decoder framework can learn a $\textit{representation}$ from a sequence of tactile measurements that contains all the relevant information about the palpated object. We conjecture that such a representation can be used for downstream tasks such as tactile imaging and change detection. With enough training data, it should capture intricate patterns in the tactile measurements that go beyond a simple map of forces -- the current state of the art. To validate our approach, we both develop a simulation environment and collect a real-world dataset of soft objects and corresponding ground truth images obtained by magnetic resonance imaging (MRI). We collect palpation sequences using a robot equipped with a tactile sensor, and train a model that predicts sensory readings at different positions on the object. We investigate the representation learned in this process, and demonstrate its use in imaging and change detection.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Daniil Tiapkin, Artem Agarkov, Nikita Morozov, Ian Maksimov, Askar Tsyganov et al.
ๆฅ็ๆ่ฆ
In this paper, we present gfnx, a fast and scalable package for training and evaluating Generative Flow Networks (GFlowNets) written in JAX. gfnx provides an extensive set of environments and metrics for benchmarking, accompanied with single-file implementations of core objectives for training GFlowNets. We include synthetic hypergrids, multiple sequence generation environments with various editing regimes and particular reward designs for molecular generation, phylogenetic tree construction, Bayesian structure learning, and sampling from the Ising model energy. Across different tasks, gfnx achieves significant wall-clock speedups compared to Pytorch-based benchmarks (such as torchgfn library) and author implementations. For example, gfnx achieves up to 55 times speedup on CPU-based sequence generation environments, and up to 80 times speedup with the GPU-based Bayesian network structure learning setup. Our package provides a diverse set of benchmarks and aims to standardize empirical evaluation and accelerate research and applications of GFlowNets. The library is available on GitHub (https://github.com/d-tiapkin/gfnx) and on pypi (https://pypi.org/project/gfnx/). Documentation is available on https://gfnx.readthedocs.io.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Amartya Mukherjee, Jun Liu
Machine Learning
ๆฅ็ๆ่ฆ
Differentially private stochastic gradient descent (DP-SGD) has become the standard algorithm for training machine learning models with rigorous privacy guarantees. Despite its widespread use, the theoretical understanding of its long-run behavior remains limited: existing analyses typically establish convergence in expectation or with high probability, but do not address the almost sure convergence of single trajectories. In this work, we prove that DP-SGD converges almost surely under standard smoothness assumptions, both in nonconvex and strongly convex settings, provided the step sizes satisfy some standard decaying conditions. Our analysis extends to momentum variants such as the stochastic heavy ball (DP-SHB) and Nesterov's accelerated gradient (DP-NAG), where we show that careful energy constructions yield similar guarantees. These results provide stronger theoretical foundations for differentially private optimization and suggest that, despite privacy-induced distortions, the algorithm remains pathwise stable in both convex and nonconvex regimes.
๐
2025-11-20
๐ ArXiv Preprint (cs.LO)
๐ฅ Gaspard Ohlmann, Edwin Hamel-De le Court, Francesco Belardinelli
Robotics
ๆฅ็ๆ่ฆ
Ensuring that agents satisfy safety specifications can be crucial in safety-critical environments. While methods exist for controller synthesis with safe temporal specifications, most existing methods restrict safe temporal specifications to probabilistic-avoidance constraints. Formal methods typically offer more expressive ways to express safety in probabilistic systems, such as Probabilistic Computation Tree Logic (PCTL) formulas. Thus, in this paper, we develop a new approach that supports more general temporal properties expressed in PCTL. Our contribution is twofold. First, we develop a theoretical framework for the Synthesis of safe-PCTL specifications. We show how the reducing global specification satisfaction to local constraints, and define CPCTL, a fragment of safe-PCTL. We demonstrate how the expressiveness of CPCTL makes it a relevant fragment for the Synthesis Problem. Second, we leverage these results and propose a new Value Iteration-based algorithm to solve the synthesis problem for these more general temporal properties, and we prove the soundness and completeness of our method.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Fares Fourati, Mohamed-Slim Alouini, Vaneet Aggarwal
Machine Learning
ๆฅ็ๆ่ฆ
We propose ECPv2, a scalable and theoretically grounded algorithm for global optimization of Lipschitz-continuous functions with unknown Lipschitz constants. Building on the Every Call is Precious (ECP) framework, which ensures that each accepted function evaluation is potentially informative, ECPv2 addresses key limitations of ECP, including high computational cost and overly conservative early behavior. ECPv2 introduces three innovations: (i) an adaptive lower bound to avoid vacuous acceptance regions, (ii) a Worst-m memory mechanism that restricts comparisons to a fixed-size subset of past evaluations, and (iii) a fixed random projection to accelerate distance computations in high dimensions. We theoretically show that ECPv2 retains ECP's no-regret guarantees with optimal finite-time bounds and expands the acceptance region with high probability. We further empirically validate these findings through extensive experiments and ablation studies. Using principled hyperparameter settings, we evaluate ECPv2 across a wide range of high-dimensional, non-convex optimization problems. Across benchmarks, ECPv2 consistently matches or outperforms state-of-the-art optimizers, while significantly reducing wall-clock time.
๐
2025-11-20
๐ ArXiv Preprint (cs.OH)
๐ฅ Huanshuo Dong, Hong Wang, Hao Wu, Zhiwei Zhuang, Xuanze Yang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Neural operators have demonstrated considerable effectiveness in accelerating the solution of time-dependent partial differential equations (PDEs) by directly learning governing physical laws from data. However, for PDEs governed by conservation laws(e.g., conservation of mass, energy, or matter), existing neural operators fail to satisfy conservation properties, which leads to degraded model performance and limited generalizability. Moreover, we observe that distinct PDE problems generally require different optimal neural network architectures. This finding underscores the inherent limitations of specialized models in generalizing across diverse problem domains.
To address these limitations, we propose Exterior-Embedded Conservation Framework (ECF), a universal conserving framework that can be integrated with various data-driven neural operators to enforce conservation laws strictly in predictions. The framework consists of two key components: a conservation quantity encoder that extracts conserved quantities from input data, and a conservation quantity decoder that adjusts the neural operator's predictions using these quantities to ensure strict conservation compliance in the final output. Since our architecture enforces conservation laws, we theoretically prove that it enhances model performance. To validate the performance of our method, we conduct experiments on multiple conservation-law-constrained PDE scenarios, including adiabatic systems, shallow water equations, and the Allen-Cahn problem. These baselines demonstrate that our method effectively improves model accuracy while strictly enforcing conservation laws in the predictions.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Md. Tawfique Ihsan, Md. Rakibul Hasan Rafi, Ahmed Shoyeb Raihan, Imtiaz Ahmed, Abdullahil Azeem
ๆฅ็ๆ่ฆ
Severe class imbalance is common in real-world tabular learning, where rare but important minority classes are essential for reliable prediction. Existing generative oversampling methods such as GANs, VAEs, and diffusion models can improve minority-class performance, but they often struggle with tabular heterogeneity, training stability, and privacy concerns. We propose a family of latent-space, tree-driven diffusion methods for minority oversampling that use conditional flow matching with gradient-boosted trees as the vector-field learner. The models operate in compact latent spaces to preserve tabular structure and reduce computation. We introduce three variants: PCAForest, which uses linear PCA embedding; EmbedForest, which uses a learned nonlinear embedding; and AttentionForest, which uses an attention-augmented embedding. Each method couples a GBT-based flow with a decoder back to the original feature space. Across 11 datasets from healthcare, finance, and manufacturing, AttentionForest achieves the best average minority recall while maintaining competitive precision, calibration, and distributional similarity. PCAForest and EmbedForest reach similar utility with much faster generation, offering favorable accuracy-efficiency trade-offs. Privacy evaluated with nearest-neighbor distance ratio and distance-to-closest-record is comparable to or better than the ForestDiffusion baseline. Ablation studies show that smaller embeddings tend to improve minority recall, while aggressive learning rates harm stability. Overall, latent-space, tree-driven diffusion provides an efficient and privacy-aware approach to high-fidelity tabular data augmentation under severe class imbalance.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Perrine Chassat, Van Tuan Nguyen, Lucas Ducrot, Emilie Lanoy, Agathe Guilloux
Robotics
ๆฅ็ๆ่ฆ
Clinical trials face mounting challenges: fragmented patient populations, slow enrollment, and unsustainable costs, particularly for late phase trials in oncology and rare diseases. While external control arms built from real-world data have been explored, a promising alternative is the generation of synthetic control arms using generative AI. A central challenge is the generation of time-to-event outcomes, which constitute primary endpoints in oncology and rare disease trials, but are difficult to model under censoring and small sample sizes. Existing generative approaches, largely GAN-based, are data-hungry, unstable, and rely on strong assumptions such as independent censoring. We introduce a variational autoencoder (VAE) that jointly generates mixed-type covariates and survival outcomes within a unified latent variable framework, without assuming independent censoring. Across synthetic and real trial datasets, we evaluate our model in two realistic scenarios: (i) data sharing under privacy constraints, where synthetic controls substitute for original data, and (ii) control-arm augmentation, where synthetic patients mitigate imbalances between treated and control groups. Our method outperforms GAN baselines on fidelity, utility, and privacy metrics, while revealing systematic miscalibration of type I error and power. We propose a post-generation selection procedure that improves calibration, highlighting both progress and open challenges for generative survival modeling.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Han Su, Zhongyan Li, Wanquan Liu
Machine Learning
ๆฅ็ๆ่ฆ
Universal approximation serves as the foundation of neural network learning algorithms. However, some networks establish their universal approximation property by demonstrating that the iterative errors converge in probability measure rather than the more rigorous norm convergence, which makes the universal approximation property of randomized learning networks highly sensitive to random parameter selection, Broad residual learning system (BRLS), as a member of randomized learning models, also encounters this issue. We theoretically demonstrate the limitation of its universal approximation property, that is, the iterative errors do not satisfy norm convergence if the selection of random parameters is inappropriate and the convergence rate meets certain conditions. To address this issue, we propose the broad stochastic configuration residual learning system (BSCRLS) algorithm, which features a novel supervisory mechanism adaptively constraining the range settings of random parameters on the basis of BRLS framework, Furthermore, we prove the universal approximation theorem of BSCRLS based on the more stringent norm convergence. Three versions of incremental BSCRLS algorithms are presented to satisfy the application requirements of various network updates. Solar panels dust detection experiments are performed on publicly available dataset and compared with 13 deep and broad learning algorithms. Experimental results reveal the effectiveness and superiority of BSCRLS algorithms.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuanbo Guo, Jun Xia, Yiyu Shi
Machine Learning
ๆฅ็ๆ่ฆ
As deep learning (DL) techniques become integral to various applications, ensuring model fairness while maintaining high performance has become increasingly critical, particularly in sensitive fields such as medical diagnosis. Although a variety of bias-mitigation methods have been proposed, many rely on computationally expensive debiasing strategies or suffer substantial drops in model accuracy, which limits their practicality in real-world, resource-constrained settings. To address this issue, we propose a fairness-oriented low rank factorization (LRF) framework that leverages singular value decomposition (SVD) to improve DL model fairness. Unlike traditional SVD, which is mainly used for model compression by decomposing and reducing weight matrices, our work shows that SVD can also serve as an effective tool for fairness enhancement. Specifically, we observed that elements in the unitary matrices obtained from SVD contribute unequally to model bias across groups defined by sensitive attributes. Motivated by this observation, we propose a method, named FairLRF, that selectively removes bias-inducing elements from unitary matrices to reduce group disparities, thus enhancing model fairness. Extensive experiments show that our method outperforms conventional LRF methods as well as state-of-the-art fairness-enhancing techniques. Additionally, an ablation study examines how major hyper-parameters may influence the performance of processed models. To the best of our knowledge, this is the first work utilizing SVD not primarily for compression but for fairness enhancement.
๐
2025-11-20
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiaheng Zhang, Daqiang Zhang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage.
Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models.
Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ รloรฏse Benito-Rodriguez, Einar Urdshals, Jasmina Nasufi, Nicky Pochinkov
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kwun Ho Ngan, Saman Sadeghi Afgeh, Joe Townsend, Artur d'Avila Garcez
Machine LearningMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ming-Lun Lee, Fu-Shiang Yang, Cheng-Kuan Lin, Yan-Ann Chen, Chih-Yu Lin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Federated learning (FL) enables clients to collaboratively train a shared model in a distributed manner, setting it apart from traditional deep learning paradigms. However, most existing FL research assumes consistent client participation, overlooking the practical scenario of dynamic participation (DPFL), where clients may intermittently join or leave during training. Moreover, no existing benchmarking framework systematically supports the study of DPFL-specific challenges. In this work, we present the first open-source framework explicitly designed for benchmarking FL models under dynamic client participation. Our framework provides configurable data distributions, participation patterns, and evaluation metrics tailored to DPFL scenarios. Using this platform, we benchmark four major categories of widely adopted FL models and uncover substantial performance degradation under dynamic participation. To address these challenges, we further propose Knowledge-Pool Federated Learning (KPFL), a generic plugin that maintains a shared knowledge pool across both active and idle clients. KPFL leverages dual-age and data-bias weighting, combined with generative knowledge distillation, to mitigate instability and prevent knowledge loss. Extensive experiments demonstrate the significant impact of dynamic participation on FL performance and the effectiveness of KPFL in improving model robustness and generalization.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuxiang Wan, Ryan Devera, Wenjie Zhang, Ju Sun
ๆฅ็ๆ่ฆ
Foundation flow-matching (FM) models promise a universal prior for solving inverse problems (IPs), yet today they trail behind domain-specific or even untrained priors. How can we unlock their potential? We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with a sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. This leads to a significant performance boost across image restoration and scientific IPs. Our results point to a path for making foundation FM models practical, reusable priors for IP solving.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nicholas Pellegrino, David Szczecina, Paul Fieguth
ๆฅ็ๆ่ฆ
Methods for detecting label errors in training data require models that are robust to label errors (i.e., not fit to erroneously labelled data points). However, acquiring such models often involves training on corrupted data, which presents a challenge. Adjustments to the loss function present an opportunity for improvement. Motivated by Focal Loss (which emphasizes difficult-to-classify samples), two novel, yet simple, loss functions are proposed that de-weight or ignore these difficult samples (i.e., those likely to have label errors). Results on artificially corrupted data show promise, such that F1 scores for detecting errors are improved from the baselines of conventional categorical Cross Entropy and Focal Loss.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Carlos Boned Riera, David Romero Sanchez, Oriol Ramos Terrades
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.
๐
2025-11-20
๐ AAAI
๐ฅ Sayak Mukherjee, Samrat Chatterjee, Emilie Purvine, Ted Fujimoto, Tegan Emerson
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Designing rewards for autonomous cyber attack and defense learning agents in a complex, dynamic environment is a challenging task for subject matter experts. We propose a large language model (LLM)-based reward design approach to generate autonomous cyber defense policies in a deep reinforcement learning (DRL)-driven experimental simulation environment. Multiple attack and defense agent personas were crafted, reflecting heterogeneity in agent actions, to generate LLM-guided reward designs where the LLM was first provided with contextual cyber simulation environment information. These reward structures were then utilized within a DRL-driven attack-defense simulation environment to learn an ensemble of cyber defense policies. Our results suggest that LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Poushali Sengupta, Yan Zhang, Frank Eliassen, Sabita Maharjan
ๆฅ็ๆ่ฆ
Explainable AI (XAI) is increasingly essential as modern models become more complex and high-stakes applications demand transparency, trust, and regulatory compliance. Existing global attribution methods often incur high computational costs, lack stability under correlated inputs, and fail to scale efficiently to large or heterogeneous datasets. We address these gaps with \emph{ExCIR} (Explainability through Correlation Impact Ratio), a correlation-aware attribution score equipped with a lightweight transfer protocol that reproduces full-model rankings using only a fraction of the data. ExCIR quantifies sign-aligned co-movement between features and model outputs after \emph{robust centering} (subtracting a robust location estimate, e.g., median or mid-mean, from features and outputs). We further introduce \textsc{BlockCIR}, a \emph{groupwise} extension of ExCIR that scores \emph{sets} of correlated features as a single unit. By aggregating the same signed-co-movement numerators and magnitudes over predefined or data-driven groups, \textsc{BlockCIR} mitigates double-counting in collinear clusters (e.g., synonyms or duplicated sensors) and yields smoother, more stable rankings when strong dependencies are present. Across diverse text, tabular, signal, and image datasets, ExCIR shows trustworthy agreement with established global baselines and the full model, delivers consistent top-$k$ rankings across settings, and reduces runtime via lightweight evaluation on a subset of rows. Overall, ExCIR provides \emph{computationally efficient}, \emph{consistent}, and \emph{scalable} explainability for real-world deployment.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Muhammad Sa'ood Shah, Asad Jeewa
Machine Learning
ๆฅ็ๆ่ฆ
Scalarisation functions are widely employed in MORL algorithms to enable intelligent decision-making. However, these functions often struggle to approximate the Pareto front accurately, rendering them unideal in complex, uncertain environments. This study examines selected Multi-Objective Reinforcement Learning (MORL) algorithms across MORL environments with discrete action and observation spaces. We aim to investigate further the limitations associated with scalarisation approaches for decision-making in multi-objective settings. Specifically, we use an outer-loop multi-policy methodology to assess the performance of a seminal single-policy MORL algorithm, MO Q-Learning implemented with linear scalarisation and Chebyshev scalarisation functions. In addition, we explore a pioneering inner-loop multi-policy algorithm, Pareto Q-Learning, which offers a more robust alternative. Our findings reveal that the performance of the scalarisation functions is highly dependent on the environment and the shape of the Pareto front. These functions often fail to retain the solutions uncovered during learning and favour finding solutions in certain regions of the solution space. Moreover, finding the appropriate weight configurations to sample the entire Pareto front is complex, limiting their applicability in uncertain settings. In contrast, inner-loop multi-policy algorithms may provide a more sustainable and generalizable approach and potentially facilitate intelligent decision-making in dynamic and uncertain environments.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ali Murtaza Caunhye, Asad Jeewa
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The field of Offline Reinforcement Learning (RL) aims to derive effective policies from pre-collected datasets without active environment interaction. While traditional offline RL algorithms like Conservative Q-Learning (CQL) and Implicit Q-Learning (IQL) have shown promise, they often face challenges in balancing exploration and exploitation, especially in environments with varying reward densities. The recently proposed Decision Transformer (DT) approach, which reframes offline RL as a sequence modelling problem, has demonstrated impressive results across various benchmarks. This paper presents a comparative study evaluating the performance of DT against traditional offline RL algorithms in dense and sparse reward settings for the ANT continous control environment. Our research investigates how these algorithms perform when faced with different reward structures, examining their ability to learn effective policies and generalize across varying levels of feedback. Through empirical analysis in the ANT environment, we found that DTs showed less sensitivity to varying reward density compared to other methods and particularly excelled with medium-expert datasets in sparse reward scenarios. In contrast, traditional value-based methods like IQL showed improved performance in dense reward settings with high-quality data, while CQL offered balanced performance across different data qualities. Additionally, DTs exhibited lower variance in performance but required significantly more computational resources compared to traditional approaches. These findings suggest that sequence modelling approaches may be more suitable for scenarios with uncertain reward structures or mixed-quality data, while value-based methods remain competitive in settings with dense rewards and high-quality demonstrations.
๐
2025-11-20
๐ ArXiv Preprint (quant-ph)
๐ฅ Akshit Pramod Anchan, Ameiy Acharya, Leki Chom Thungon
Machine Learning
ๆฅ็ๆ่ฆ
This paper proposes an optimization of Quantum Key Distribution (QKD) Networks using Graph Neural Networks (GNN) framework. Today, the development of quantum computers threatens the security systems of classical cryptography. Moreover, as QKD networks are designed for protecting secret communication, they suffer from multiple operational difficulties: adaptive to dynamic conditions, optimization for multiple parameters and effective resource utilization. In order to overcome these obstacles, we propose a GNN-based framework which can model QKD networks as dynamic graphs and extracts exploitable characteristics from these networks' structure. The graph contains not only topological information but also specific characteristics associated with quantum communication (the number of edges between nodes, etc). Experimental results demonstrate that the GNN-optimized QKD network achieves a substantial increase in total key rate (from 27.1 Kbits/s to 470 Kbits/s), a reduced average QBER (from 6.6% to 6.0%), and maintains path integrity with a slight reduction in average transmission distance (from 7.13 km to 6.42 km). Furthermore, we analyze network performance across varying scales (10 to 250 nodes), showing improved link prediction accuracy and enhanced key generation rate in medium-sized networks. This work introduces a novel operation mode for QKD networks, shifting the paradigm of network optimization through adaptive and scalable quantum communication systems that enhance security and performance.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Andrew Gomes
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Joy Lai, Alex Mihailidis
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
People living with dementia (PLwD) often show gradual shifts in how they communicate, becoming less expressive, more repetitive, or drifting off-topic in subtle ways. While caregivers may notice these changes informally, most computational tools are not designed to track such behavioral drift over time. This paper introduces PersonaDrift, a synthetic benchmark designed to evaluate machine learning and statistical methods for detecting progressive changes in daily communication, focusing on user responses to a digital reminder system. PersonaDrift simulates 60-day interaction logs for synthetic users modeled after real PLwD, based on interviews with caregivers. These caregiver-informed personas vary in tone, modality, and communication habits, enabling realistic diversity in behavior. The benchmark focuses on two forms of longitudinal change that caregivers highlighted as particularly salient: flattened sentiment (reduced emotional tone and verbosity) and off-topic replies (semantic drift). These changes are injected progressively at different rates to emulate naturalistic cognitive trajectories, and the framework is designed to be extensible to additional behaviors in future use cases. To explore this novel application space, we evaluate several anomaly detection approaches, unsupervised statistical methods (CUSUM, EWMA, One-Class SVM), sequence models using contextual embeddings (GRU + BERT), and supervised classifiers in both generalized and personalized settings. Preliminary results show that flattened sentiment can often be detected with simple statistical models in users with low baseline variability, while detecting semantic drift requires temporal modeling and personalized baselines. Across both tasks, personalized classifiers consistently outperform generalized ones, highlighting the importance of individual behavioral context.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yihan Li, Nikhil Churamani, Maria Robu, Imanol Luengo, Danail Stoyanov
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses.
Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks.
Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures.
Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Muhammad Aslanimoghanloo, Ahmed ElGazzar, Marcel van Gerven
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Clinical time series data from electronic health records and medical registries offer unprecedented opportunities to understand patient trajectories and inform medical decision-making. However, leveraging such data presents significant challenges due to irregular sampling, complex latent physiology, and inherent uncertainties in both measurements and disease progression. To address these challenges, we propose a generative modeling framework based on latent neural stochastic differential equations (SDEs) that views clinical time series as discrete-time partial observations of an underlying controlled stochastic dynamical system. Our approach models latent dynamics via neural SDEs with modality-dependent emission models, while performing state estimation and parameter learning through variational inference. This formulation naturally handles irregularly sampled observations, learns complex non-linear interactions, and captures the stochasticity of disease progression and measurement noise within a unified scalable probabilistic framework. We validate the framework on two complementary tasks: (i) individual treatment effect estimation using a simulated pharmacokinetic-pharmacodynamic (PKPD) model of lung cancer, and (ii) probabilistic forecasting of physiological signals using real-world intensive care unit (ICU) data from 12,000 patients. Results show that our framework outperforms ordinary differential equation and long short-term memory baseline models in accuracy and uncertainty estimation. These results highlight its potential for enabling precise, uncertainty-aware predictions to support clinical decision-making.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Seyed Mohamad Moghadas, Bruno Cornelis, Adrian Munteanu
Machine Learning
ๆฅ็ๆ่ฆ
Multivariate time-series (MTS) forecasting is fundamental to applications ranging from urban mobility and resource management to climate modeling. While recent generative models based on denoising diffusion have advanced state-of-the-art performance in capturing complex data distributions, they suffer from significant computational overhead due to iterative stochastic sampling procedures that limit real-time deployment. Moreover, these models can be brittle when handling high-dimensional, non-stationary, and multi-scale periodic patterns characteristic of real-world sensor networks. We introduce FreqFlow, a novel framework that leverages conditional flow matching in the frequency domain for deterministic MTS forecasting. Unlike conventional approaches that operate in the time domain, FreqFlow transforms the forecasting problem into the spectral domain, where it learns to model amplitude and phase shifts through a single complex-valued linear layer. This frequency-domain formulation enables the model to efficiently capture temporal dynamics via complex multiplication, corresponding to scaling and temporal translations. The resulting architecture is exceptionally lightweight with only 89k parameters - an order of magnitude smaller than competing diffusion-based models-while enabling single-pass deterministic sampling through ordinary differential equation (ODE) integration. Our approach decomposes MTS signals into trend, seasonal, and residual components, with the flow matching mechanism specifically designed for residual learning to enhance long-term forecasting accuracy. Extensive experiments on real-world traffic speed, volume, and flow datasets demonstrate that FreqFlow achieves state-of-the-art forecasting performance, on average 7\% RMSE improvements, while being significantly faster and more parameter-efficient than existing methods
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Connor McElroy, Thiago E. A. de Oliveira, Chris Brogly
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yidong Chai, Haoxin Liu, Jiaheng Xie, Chaopeng Wang, Xiao Fang
Machine Learning
ๆฅ็ๆ่ฆ
Wearable sensor technologies and deep learning are transforming healthcare management. Yet, most health sensing studies focus narrowly on physical chronic diseases. This overlooks the critical need for joint assessment of comorbid physical chronic diseases and depression, which is essential for collaborative chronic care. We conceptualize multi-disease assessment, including both physical diseases and depression, as a multi-task learning (MTL) problem, where each disease assessment is modeled as a task. This joint formulation leverages inter-disease relationships to improve accuracy, but it also introduces the challenge of double heterogeneity: chronic diseases differ in their manifestation (disease heterogeneity), and patients with the same disease show varied patterns (patient heterogeneity). To address these issues, we first adopt existing techniques and propose a base method. Given the limitations of the base method, we further propose an Advanced Double Heterogeneity-based Multi-Task Learning (ADH-MTL) method that improves the base method through three innovations: (1) group-level modeling to support new patient predictions, (2) a decomposition strategy to reduce model complexity, and (3) a Bayesian network that explicitly captures dependencies while balancing similarities and differences across model components. Empirical evaluations on real-world wearable sensor data demonstrate that ADH-MTL significantly outperforms existing baselines, and each of its innovations is shown to be effective. This study contributes to health information systems by offering a computational solution for integrated physical and mental healthcare and provides design principles for advancing collaborative chronic disease management across the pre-treatment, treatment, and post-treatment phases.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hrad Ghoukasian, Shahab Asoodeh
Machine Learning
ๆฅ็ๆ่ฆ
We investigate how to optimally design local differential privacy (LDP) mechanisms that reduce data unfairness and thereby improve fairness in downstream classification. We first derive a closed-form optimal mechanism for binary sensitive attributes and then develop a tractable optimization framework that yields the corresponding optimal mechanism for multi-valued attributes. As a theoretical contribution, we establish that for discrimination-accuracy optimal classifiers, reducing data unfairness necessarily leads to lower classification unfairness, thus providing a direct link between privacy-aware pre-processing and classification fairness. Empirically, we demonstrate that our approach consistently outperforms existing LDP mechanisms in reducing data unfairness across diverse datasets and fairness metrics, while maintaining accuracy close to that of non-private models. Moreover, compared with leading pre-processing and post-processing fairness methods, our mechanism achieves a more favorable accuracy-fairness trade-off while simultaneously preserving the privacy of sensitive attributes. Taken together, these results highlight LDP as a principled and effective pre-processing fairness intervention technique.
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ Marcin Kostrzewa, Oleksii Furman, Roman Furman, Sebastian Tomczak, Maciej Ziฤba
Machine Learning
ๆฅ็ๆ่ฆ
Foundation models have shown promise across various financial applications, yet their effectiveness for corporate bankruptcy prediction remains systematically unevaluated against established methods. We study bankruptcy forecasting using Llama-3.3-70B-Instruct and TabPFN, evaluated on large, highly imbalanced datasets of over one million company records from the Visegrรกd Group. We provide the first systematic comparison of foundation models against classical machine learning baselines for this task. Our results show that models such as XGBoost and CatBoost consistently outperform foundation models across all prediction horizons. LLM-based approaches suffer from unreliable probability estimates, undermining their use in risk-sensitive financial settings. TabPFN, while competitive with simpler baselines, requires substantial computational resources with costs not justified by performance gains. These findings suggest that, despite their generality, current foundation models remain less effective than specialized methods for bankruptcy forecasting.
๐
2025-11-20
๐ ArXiv Preprint (cs.AR)
๐ฅ Abdelrahman Helaly, Nourhan Sakr, Kareem Madkour, Ilhami Torunoglu
Machine Learning
ๆฅ็ๆ่ฆ
Multipatterning is an essential decomposition strategy in electronic design automation (EDA) that overcomes lithographic limitations when printing dense circuit layouts. Although heuristic-based backtracking and SAT solvers can address these challenges, they often struggle to simultaneously handle both complex constraints and secondary objectives. In this study, we present a hybrid workflow that casts multipatterning as a variant of a constrained graph coloring problem with the primary objective of minimizing feature violations and a secondary objective of balancing the number of features on each mask. Our pipeline integrates two main components: (1) A GNN-based agent, trained in an unsupervised manner to generate initial color predictions, which are refined by (2) refinement strategies (a GNN-based heuristic and simulated annealing) that together enhance solution quality and balance. Experimental evaluation in both proprietary data sets and publicly available open source layouts demonstrate complete conflict-free decomposition and consistent color balancing. The proposed framework provides a reproducible, data-efficient and deployable baseline for scalable layout decomposition in EDA workflows.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Anna Luiza Gomes da Silva, Diego Kreutz, Angelo Diniz, Rodrigo Mansilha, Celso Nobre da Fonseca
Machine Learning
ๆฅ็ๆ่ฆ
Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.
๐
2025-11-20
๐ ArXiv Preprint (eess.SP)
๐ฅ Deniz Kasap, Taraneh Aminosharieh Najafi, Jรฉrรดme Paul Rรฉmy Thevenot, Jonathan Dan, Stefano Albini et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We present VersaPants, the first loose-fitting, textile-based capacitive sensing system for lower-body motion capture, built on the open-hardware VersaSens platform. By integrating conductive textile patches and a compact acquisition unit into a pair of pants, the system reconstructs lower-body pose without compromising comfort. Unlike IMU-based systems that require user-specific fitting or camera-based methods that compromise privacy, our approach operates without fitting adjustments and preserves user privacy. VersaPants is a custom-designed smart garment featuring 6 capacitive channels per leg. We employ a lightweight Transformer-based deep learning model that maps capacitance signals to joint angles, enabling embedded implementation on edge platforms. To test our system, we collected approximately 3.7 hours of motion data from 11 participants performing 16 daily and exercise-based movements. The model achieves a mean per-joint position error (MPJPE) of 11.96 cm and a mean per-joint angle error (MPJAE) of 12.3 degrees across the hip, knee, and ankle joints, indicating the model's ability to generalize to unseen users and movements. A comparative analysis of existing textile-based deep learning architectures reveals that our model achieves competitive reconstruction performance with up to 22 times fewer parameters and 18 times fewer FLOPs, enabling real-time inference at 42 FPS on a commercial smartwatch without quantization. These results position VersaPants as a promising step toward scalable, comfortable, and embedded motion-capture solutions for fitness, healthcare, and wellbeing applications.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alan Yufei Dong, Jihao Andreas Lin, Josรฉ Miguel Hernรกndez-Lobato
ๆฅ็ๆ่ฆ
Scalable Gaussian process (GP) inference is essential for sequential decision-making tasks, yet improving GP scalability remains a challenging problem with many open avenues of research. This paper focuses on iterative GPs, where iterative linear solvers, such as conjugate gradients, stochastic gradient descent or alternative projections, are used to approximate the GP posterior. We propose a new method which improves solver convergence of a large linear system by leveraging the known solution to a smaller system contained within. This is significant for tasks with incremental data additions, and we show that our technique achieves speed-ups when solving to tolerance, as well as improved Bayesian optimisation performance under a fixed compute budget.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammad Areeb Qazi, Maryam Nadeem, Mohammad Yaqub
Machine LearningMultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Healthcare requires AI that is predictive, reliable, and data-efficient. However, recent generative models lack physical foundation and temporal reasoning required for clinical decision support. As scaling language models show diminishing returns for grounded clinical reasoning, world models are gaining traction because they learn multimodal, temporally coherent, and action-conditioned representations that reflect the physical and causal structure of care. This paper reviews World Models for healthcare systems that learn predictive dynamics to enable multistep rollouts, counterfactual evaluation and planning. We survey recent work across three domains: (i) medical imaging and diagnostics (e.g., longitudinal tumor simulation, projection-transition modeling, and Joint Embedding Predictive Architecture i.e., JEPA-style predictive representation learning), (ii) disease progression modeling from electronic health records (generative event forecasting at scale), and (iii) robotic surgery and surgical planning (action-conditioned guidance and control). We also introduce a capability rubric: L1 temporal prediction, L2 action-conditioned prediction, L3 counterfactual rollouts for decision support, and L4 planning/control. Most reviewed systems achieve L1--L2, with fewer instances of L3 and rare L4. We identify cross-cutting gaps that limit clinical reliability; under-specified action spaces and safety constraints, weak interventional validation, incomplete multimodal state construction, and limited trajectory-level uncertainty calibration. This review outlines a research agenda for clinically robust prediction-first world models that integrate generative backbones (transformers, diffusion, VAE) with causal/mechanical foundation for safe decision support in healthcare.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hao Shu
ๆฅ็ๆ่ฆ
This work introduces a learning-enhanced observer (LEO) for linear time-invariant systems with uncertain dynamics. Rather than relying solely on nominal models, the proposed framework treats the system matrices as optimizable variables and refines them through gradient-based minimization of a steady-state output discrepancy loss. The resulting data-informed surrogate model enables the construction of an improved observer that effectively compensates for moderate parameter uncertainty while preserving the structure of classical designs. Extensive Monte Carlo studies across diverse system dimensions show systematic and statistically significant reductions, typically exceeding 15\%, in normalized estimation error for both open-loop and Luenberger observers. These results demonstrate that modern learning mechanisms can serve as a powerful complement to traditional observer design, yielding more accurate and robust state estimation in uncertain systems. Codes are available at https://github.com/Hao-B-Shu/LTI_LEO.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Leander Girrbach, Zeynep Akata
ๆฅ็ๆ่ฆ
Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dean Brandner, Sergio Lucia
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Optimal operation of chemical processes is vital for energy, resource, and cost savings in chemical engineering. The problem of optimal operation can be tackled with reinforcement learning, but traditional reinforcement learning methods face challenges due to hard constraints related to quality and safety that must be strictly satisfied, and the large amount of required training data. Chemical processes often cannot provide sufficient experimental data, and while detailed dynamic models can be an alternative, their complexity makes it computationally intractable to generate the needed data. Optimal control methods, such as model predictive control, also struggle with the complexity of the underlying dynamic models. Consequently, many chemical processes rely on manually defined operation recipes combined with simple linear controllers, leading to suboptimal performance and limited flexibility.
In this work, we propose a novel approach that leverages expert knowledge embedded in operation recipes. By using reinforcement learning to optimize the parameters of these recipes and their underlying linear controllers, we achieve an optimized operation recipe. This method requires significantly less data, handles constraints more effectively, and is more interpretable than traditional reinforcement learning methods due to the structured nature of the recipes. We demonstrate the potential of our approach through simulation results of an industrial batch polymerization reactor, showing that it can approach the performance of optimal controllers while addressing the limitations of existing methods.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ William Hao-Cheng Huang
Robotics
ๆฅ็ๆ่ฆ
Linear probes are widely used to interpret and evaluate neural representations, yet their reliability remains unclear, as probes may appear accurate in some regimes but collapse unpredictably in others. We uncover a spectral mechanism behind this phenomenon and formalize it as the Spectral Identifiability Principle (SIP), a verifiable Fisher-inspired condition for probe stability. When the eigengap separating task-relevant directions is larger than the Fisher estimation error, the estimated subspace concentrates and accuracy remains consistent, whereas closing this gap induces instability in a phase-transition manner. Our analysis connects eigengap geometry, sample size, and misclassification risk through finite-sample reasoning, providing an interpretable diagnostic rather than a loose generalization bound. Controlled synthetic studies, where Fisher quantities are computed exactly, confirm these predictions and show how spectral inspection can anticipate unreliable probes before they distort downstream evaluation.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ David Bechtoldt, Sidney Bender
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Machine learning models that operate on graph-structured data, such as molecular graphs or social networks, often make accurate predictions but offer little insight into why certain predictions are made. Counterfactual explanations address this challenge by seeking the closest alternative scenario where the model's prediction would change. Although counterfactual explanations are extensively studied in tabular data and computer vision, the graph domain remains comparatively underexplored. Constructing graph counterfactuals is intrinsically difficult because graphs are discrete and non-euclidean objects. We introduce Graph Diffusion Counterfactual Explanation, a novel framework for generating counterfactual explanations on graph data, combining discrete diffusion models and classifier-free guidance. We empirically demonstrate that our method reliably generates in-distribution as well as minimally structurally different counterfactuals for both discrete classification targets and continuous properties.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yang Xu, Zuliang Yang, Kai Ming Ting
ๆฅ็ๆ่ฆ
Trajectory similarity retrieval is an important part of spatiotemporal data mining, however, existing methods have the following limitations: traditional metrics are computationally expensive, while learning-based methods suffer from substantial training costs and potential instability. This paper addresses these problems by proposing \textbf{Geo}metric \textbf{P}rototype \textbf{T}rajectory \textbf{H}ashing (GeoPTH), a novel, lightweight, and non-learning framework for efficient category-based trajectory retrieval. GeoPTH constructs data-dependent hash functions by using representative trajectory prototypes, i.e., small point sets preserving geometric characteristics, as anchors. The hashing process is efficient, which involves mapping a new trajectory to its closest prototype via a robust, \textit{Hausdorff} metric. Extensive experiments show that GeoPTH's retrieval accuracy is highly competitive with both traditional metrics and state-of-the-art learning methods, and it significantly outperforms binary codes generated through simple binarization of the learned embeddings. Critically, GeoPTH consistently outperforms all competitors in terms of efficiency. Our work demonstrates that a lightweight, prototype-centric approach offers a practical and powerful alternative, achieving an exceptional retrieval performance and computational efficiency.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yang Yu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The ability of Large Language Models (LLMs) to perform complex, multi-step reasoning is a central focus of modern AI research. To evaluate and enhance this capability, the pass@k metric, which measures the probability of obtaining at least one correct solution in k independent samples, has received significant attention. Its intuitive appeal has led to its adoption not only as an evaluation standard but also as a direct optimization objective in reinforcement learning. In this paper, we analyze the pass@k objective, derive its gradient, and demonstrate that it is fundamentally a per-example positive reweighting of the simpler pass@1 objective. Our analysis reveals that the pass@k objective provides a vanishing learning signal in regimes where exploration is most critical. We further analyze the dynamics of "exploration collapse", showing that as the policy concentrates probability mass, the gap between pass@k and pass@1 diminishes. We conclude that while pass@k is a useful diagnostic tool, it may be an unsuitable direct objective for optimization. Instead, mechanisms explicitly encouraging efficient exploration could offer a more effective path forward for reinforcement learning in reasoning tasks.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Saksham Gautam, Lakshmi Mandal, Shalabh Bhatnagar
Machine Learning
ๆฅ็ๆ่ฆ
In this work, we consider the problem of a two-player zero-sum game. In the literature, the successive over-relaxation Q-learning algorithm has been developed and implemented, and it is seen to result in a lower contraction factor for the associated Q-Bellman operator resulting in a faster value iteration-based procedure. However, this has been presented only for the tabular case and not for the setting with function approximation that typically caters to real-world high-dimensional state-action spaces. Furthermore, such settings in the case of two-player zero-sum games have not been considered. We thus propose a deep successive over-relaxation minimax Q-learning algorithm that incorporates deep neural networks as function approximators and is suitable for high-dimensional spaces. We prove the finite-time convergence of the proposed algorithm. Through numerical experiments, we show the effectiveness of the proposed method over the existing Q-learning algorithm. Our ablation studies demonstrate the effect of different values of the crucial successive over-relaxation parameter.
๐
2025-11-20
๐ 6 pages, 3 figures, submitted to IEEE ICC 2026
๐ฅ Victor Croisfelt, Joรฃo Henrique Inacio de Souza, Shashi Raj Pandey, Beatriz Soret, Petar Popovski
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Connected cyber-physical systems perform inference based on real-time inputs from multiple data streams. Uncertain communication delays across data streams challenge the temporal flow of the inference process. State-of-the-art (SotA) non-blocking inference methods rely on a reference-modality paradigm, requiring one modality input to be fully received before processing, while depending on costly offline profiling. We propose a novel, neuro-inspired non-blocking inference paradigm that primarily employs adaptive temporal windows of integration (TWIs) to dynamically adjust to stochastic delay patterns across heterogeneous streams while relaxing the reference-modality requirement. Our communication-delay-aware framework achieves robust real-time inference with finer-grained control over the accuracy-latency tradeoff. Experiments on the audio-visual event localization (AVEL) task demonstrate superior adaptability to network dynamics compared to SotA approaches.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joana Reuss, Ekaterina Gikalo, Marco Kรถrner
Machine Learning
ๆฅ็ๆ่ฆ
Real-world agricultural distributions often suffer from severe class imbalance, typically following a long-tailed distribution. Labeled datasets for crop-type classification are inherently scarce and remain costly to obtain. When working with such limited data, training sets are frequently constructed to be artificially balanced -- in particular in the case of few-shot learning -- failing to reflect real-world conditions. This mismatch induces a shift between training and test label distributions, degrading real-world generalization. To address this, we propose Dirichlet Prior Augmentation (DirPA), a novel method that simulates an unknown label distribution skew of the target domain proactively during model training. Specifically, we model the real-world distribution as Dirichlet-distributed random variables, effectively performing a prior augmentation during few-shot learning. Our experiments show that DirPA successfully shifts the decision boundary and stabilizes the training process by acting as a dynamic feature regularizer.
๐
2025-11-20
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhen Hao Wong, Jingwen Deng, Hao Liang, Runming He, Chengyu Shen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The development of Large Language Models (LLMs) increasingly depends on high-quality supervised data, yet existing instruction-tuning and RL datasets remain costly to curate and often rely on synthetic samples that introduce hallucination and limited diversity. At the same time, textbooks and exercise materials contain abundant, high-quality human-authored Question-Answer(QA) content that remains underexploited due to the difficulty of transforming raw PDFs into AI-ready supervision. Although modern OCR and vision-language models can accurately parse document structure, their outputs lack the semantic alignment required for training. We propose an automated pipeline that extracts well-formed QA and visual-QA (VQA) pairs from educational documents by combining layout-aware OCR with LLM-based semantic parsing. Experiments across diverse document types show that the method produces accurate, aligned, and low-noise QA/VQA pairs. This approach enables scalable use of real-world educational content and provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training. All code and data-processing pipelines are open-sourced at https://github.com/OpenDCAI/DataFlow.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Farah Alsafadi, Alexandra Akins, Xu Wu
Machine Learning
ๆฅ็ๆ่ฆ
Deep generative modeling provides a powerful pathway to overcome data scarcity in energy-related applications where experimental data are often limited, costly, or difficult to obtain. By learning the underlying probability distribution of the training dataset, deep generative models, such as the diffusion model (DM), can generate high-fidelity synthetic samples that statistically resemble the training data. Such synthetic data generation can significantly enrich the size and diversity of the available training data, and more importantly, improve the robustness of downstream machine learning models in predictive tasks. The objective of this paper is to investigate the effectiveness of DM for overcoming data scarcity in nuclear energy applications. By leveraging a public dataset on critical heat flux (CHF) that cover a wide range of commercial nuclear reactor operational conditions, we developed a DM that can generate an arbitrary amount of synthetic samples for augmenting of the CHF dataset. Since a vanilla DM can only generate samples randomly, we also developed a conditional DM capable of generating targeted CHF data under user-specified thermal-hydraulic conditions. The performance of the DM was evaluated based on their ability to capture empirical feature distributions and pair-wise correlations, as well as to maintain physical consistency. The results showed that both the DM and conditional DM can successfully generate realistic and physics-consistent CHF data. Furthermore, uncertainty quantification was performed to establish confidence in the generated data. The results demonstrated that the conditional DM is highly effective in augmenting CHF data while maintaining acceptable levels of uncertainty.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrea Iommi, Antonio Mastropietro, Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The importance of Synthetic Data Generation (SDG) has increased significantly in domains where data quality is poor or access is limited due to privacy and regulatory constraints. One such domain is recruitment, where publicly available datasets are scarce due to the sensitive nature of information typically found in curricula vitae, such as gender, disability status, or age. %
This lack of accessible, representative data presents a significant obstacle to the development of fair and transparent machine learning models, particularly ranking algorithms that require large volumes of data to effectively learn how to recommend candidates. In the absence of such data, these models are prone to poor generalisation and may fail to perform reliably in real-world scenarios. %
Recent advances in Causal Generative Models (CGMs) offer a promising solution. CGMs enable the generation of synthetic datasets that preserve the underlying causal relationships within the data, providing greater control over fairness and interpretability in the data generation process. %
In this study, we present a specialised SDG method involving two CGMs: one modelling job offers and the other modelling curricula. Each model is structured according to a causal graph informed by domain expertise. We use these models to generate synthetic datasets and evaluate the fairness of candidate rankings under controlled scenarios that introduce specific biases.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Antonios Antoniadis, Ali Shahheidar, Golnoosh Shahkarami, Abolfazl Soltani
ๆฅ็ๆ่ฆ
We study online interval scheduling in the irrevocable setting, where each interval must be immediately accepted or rejected upon arrival. The objective is to maximize the total length of accepted intervals while ensuring that no two accepted intervals overlap. We consider this problem in a learning-augmented setting, where the algorithm has access to (machine-learned) predictions. The goal is to design algorithms that leverage these predictions to improve performance while maintaining robust guarantees in the presence of prediction errors.
Our main contribution is the SemiTrust-and-Switch framework, which provides a unified approach for combining prediction-based and classical interval scheduling algorithms. This framework applies to both deterministic and randomized algorithms and captures the trade-off between consistency (performance under accurate predictions) and robustness (performance under adversarial inputs). Moreover, we provide lower bounds, proving the tightness of this framework in particular settings.
We further design a randomized algorithm that smoothly interpolates between prediction-based and robust algorithms. This algorithm achieves both robustness and smoothness--its performance degrades gracefully with the quality of the prediction.
๐
2025-11-20
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Andrea Venturi, Imanol Jerico-Yoldi, Francesco Zola, Raul Orduna
Machine Learning
ๆฅ็ๆ่ฆ
As Law Enforcement Agencies advance in cryptocurrency forensics, criminal actors aiming to conceal illicit fund movements increasingly turn to "mixin" services or privacy-based cryptocurrencies. Monero stands out as a leading choice due to its strong privacy preserving and untraceability properties, making conventional blockchain analysis ineffective. Understanding the behavior and operational patterns of criminal actors within Monero is therefore challenging and it is essential to support future investigative strategies and disrupt illicit activities. In this work, we propose a case study in which we leverage a novel graph-based methodology to extract structural and temporal patterns from Monero transactions linked to already discovered criminal activities. By building Address-Ring-Transaction graphs from flagged transactions, we extract structural and temporal features and use them to train Machine Learning models capable of detecting similar behavioral patterns that could highlight criminal modus operandi. This represents a first partial step toward developing analytical tools that support investigative efforts in privacy-preserving blockchain ecosystems
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaotong Zhan, Xi Cheng
ๆฅ็ๆ่ฆ
Rumor detection on social media remains a challenging task due to the complex propagation dynamics and the limited interpretability of existing models. While recent neural architectures capture content and structural features, they often fail to reveal the underlying causal mechanisms of misinformation spread. We propose CausalMamba, a novel framework that integrates Mamba-based sequence modeling, graph convolutional networks (GCNs), and differentiable causal discovery via NOTEARS. CausalMamba learns joint representations of temporal tweet sequences and reply structures, while uncovering latent causal graphs to identify influential nodes within each propagation chain. Experiments on the Twitter15 dataset show that our model achieves competitive classification performance compared to strong baselines, and uniquely enables counterfactual intervention analysis. Qualitative results demonstrate that removing top-ranked causal nodes significantly alters graph connectivity, offering interpretable insights into rumor dynamics. Our framework provides a unified approach for rumor classification and influence analysis, paving the way for more explainable and actionable misinformation detection systems.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Eloi Lindas, Yannig Goude, Philippe Ciais
ๆฅ็ๆ่ฆ
Accurate and reliable wind power forecasts are crucial for grid stability, balancing supply and demand, and market risk management. Even though short-term weather forecasts have been thoroughly used to provide short-term renewable power predictions, forecasts involving longer prediction horizons still need investigations. Despite the recent progress in subseasonal-to-seasonal weather probabilistic forecasting, their use for wind power prediction usually involves both temporal and spatial aggregation achieve reasonable skill. In this study, we present a forecasting pipeline enabling to transform ECMWF subseasonal-to-seasonal weather forecasts into wind power forecasts for lead times ranging from 1 day to 46 days at daily resolution. This framework also include post-processing of the resulting power ensembles to account for the biases and lack of dispersion of the weather forecasts. We show that our method is able to outperform a climatological baseline by 50 % in terms of both Continuous Ranked Probability Skill Score and Ensemble Mean Squared Error while also providing near perfect calibration of the forecasts for lead times ranging from 15 to 46 days.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Lara Bergmann, Cedric Grothues, Klaus Neumann
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Magnetic levitation is about to revolutionize in-machine material flow in industrial automation. Such systems are flexibly configurable and can include a large number of independently actuated shuttles (movers) that dynamically rebalance production capacity. Beyond their capabilities for dynamic transportation, these systems possess the inherent yet unexploited potential to perform manipulation. By merging the fields of transportation and manipulation into a coordinated swarm of magnetic robots (MagBots), we enable manufacturing systems to achieve significantly higher efficiency, adaptability, and compactness. To support the development of intelligent algorithms for magnetic levitation systems, we introduce MagBotSim (Magnetic Robotics Simulation): a physics-based simulation for magnetic levitation systems. By framing magnetic levitation systems as robot swarms and providing a dedicated simulation, this work lays the foundation for next generation manufacturing systems powered by Magnetic Robotics. MagBotSim's documentation, videos, experiments, and code are available at: https://ubi-coro.github.io/MagBotSim/
๐
2025-11-20
๐ ArXiv Preprint (quant-ph)
๐ฅ Ariel Neufeld, Philipp Schmocker, Viet Khoa Tran
Machine Learning
ๆฅ็ๆ่ฆ
Quantum neural networks (QNNs) are an analog of classical neural networks in the world of quantum computing, which are represented by a unitary matrix with trainable parameters. Inspired by the universal approximation property of classical neural networks, ensuring that every continuous function can be arbitrarily well approximated uniformly on a compact set of a Euclidean space, some recent works have established analogous results for QNNs, ranging from single-qubit to multi-qubit QNNs, and even hybrid classical-quantum models. In this paper, we study the approximation capabilities of QNNs for periodic functions with respect to the supremum norm. We use the Jackson inequality to approximate a given function by implementing its approximating trigonometric polynomial via a suitable QNN. In particular, we see that by restricting to the class of periodic functions, one can achieve a quadratic reduction of the number of parameters, producing better approximation results than in the literature. Moreover, the smoother the function, the fewer parameters are needed to construct a QNN to approximate the function.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Perceval Beja-Battais, Alain Grossetรชte, Nicolas Vayatis
Robotics
ๆฅ็ๆ่ฆ
In recent years, there has been an increasing need for Nuclear Power Plants (NPPs) to improve flexibility in order to match the rapid growth of renewable energies. The Operator Assistance Predictive System (OAPS) developed by Framatome addresses this problem through Model Predictive Control (MPC). In this work, we aim to improve MPC methods through data-driven simulation schemes. Thus, from a set of nonlinear stiff ordinary differential equations (ODEs), this paper introduces two surrogate models acting as alternative simulation schemes to enhance nuclear reactor core simulation. We show that both data-driven and physics-informed models can rapidly integrate complex dynamics, with a very low computational time (up to 1000x time reduction).
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhijie Zhong, Zhiwen Yu, Kaixiang Yang, C. L. Philip Chen
ๆฅ็ๆ่ฆ
Time series anomaly detection (TSAD) is a critical data mining task often constrained by label scarcity. Consequently, current research predominantly focuses on Unsupervised Time-series Anomaly Detection (UTAD), relying on complex architectures to model normal data distributions. However, this approach often overlooks the significant performance gains available from limited anomaly labels achievable in practical scenarios. This paper challenges the premise that architectural complexity is the optimal path for TSAD. We conduct the first methodical comparison between supervised and unsupervised paradigms and introduce STAND, a streamlined supervised baseline. Extensive experiments on five public datasets demonstrate that: (1) Labels matter more than models: under a limited labeling budget, simple supervised models significantly outperform complex state-of-the-art unsupervised methods; (2) Supervision yields higher returns: the performance gain from minimal supervision far exceeds that from architectural innovations; and (3) Practicality: STAND exhibits superior prediction consistency and anomaly localization compared to unsupervised counterparts. These findings advocate for a data-centric shift in TSAD research, emphasizing label utilization over purely algorithmic complexity. The code is publicly available at https://github.com/EmorZz1G/STAND.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Paula Joy B. Martinez, Jose Marie Antonio Miรฑoza, Sebastian C. Ibaรฑez
ๆฅ็ๆ่ฆ
Emotion recognition from social media is critical for understanding public sentiment, but accessing training data has become prohibitively expensive due to escalating API costs and platform restrictions. We introduce an interpretability-guided framework where Shapley Additive Explanations (SHAP) provide principled guidance for LLM-based synthetic data generation. With sufficient seed data, SHAP-guided approach matches real data performance, significantly outperforms naรฏve generation, and substantially improves classification for underrepresented emotion classes. However, our linguistic analysis reveals that synthetic text exhibits reduced vocabulary richness and fewer personal or temporally complex expressions than authentic posts. This work provides both a practical framework for responsible synthetic data generation and a critical perspective on its limitations, underscoring that the future of trustworthy AI depends on navigating the trade-offs between synthetic utility and real-world authenticity.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Feiyue Zhao, Yangfan He, Zhichao Zhang
Robotics
ๆฅ็ๆ่ฆ
Graph spectral representations are fundamental in graph signal processing, offering a rigorous framework for analyzing and processing graph-structured data. The graph fractional Fourier transform (GFRFT) extends the classical graph Fourier transform (GFT) with a fractional-order parameter, enabling flexible spectral analysis while preserving mathematical consistency. The angular graph Fourier transform (AGFT) introduces angular control via GFT eigenvector rotation; however, existing constructions fail to degenerate to the GFT at zero angle, which is a critical flaw that undermines theoretical consistency and interpretability. To resolve these complementary limitations - GFRFT's lack of angular regulation and AGFT's defective degeneracy - this study proposes an angular GFRFT (AGFRFT), a unified framework that integrates fractional-order and angular spectral analyses with theoretical rigor. A degeneracy-friendly rotation matrix family ensures exact GFT degeneration at zero angle, with two AGFRFT variants (I-AGFRFT and II-AGFRFT) defined accordingly. Rigorous theoretical analyses confirm their unitarity, invertibility, and smooth parameter dependence. Both support learnable joint parameterization of the angle and fractional order, enabling adaptive spectral processing for diverse graph signals. Extensive experiments on real-world data denoising, image denoising, and point cloud denoising demonstrate that AGFRFT outperforms GFRFT and AGFT in terms of spectral concentration, reconstruction quality, and controllable spectral manipulation, establishing a robust and flexible tool for integrated angular fractional spectral analysis in graph signal processing.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuanbo Tang, Yan Tang, Zixuan Zhang, Zihui Zhao, Yang Li
Machine Learning
ๆฅ็ๆ่ฆ
Trajectory generation has recently drawn growing interest in privacy-preserving urban mobility studies and location-based service applications. Although many studies have used deep learning or generative AI methods to model trajectories and have achieved promising results, the robustness and interpretability of such models are largely unexplored. This limits the application of trajectory generation algorithms on noisy real-world data and their trustworthiness in downstream tasks. To address this issue, we exploit the regular structure in urban trajectories and propose a deep generative model based on the pathlet representation, which encode trajectories with binary vectors associated with a learned dictionary of trajectory segments. Specifically, we introduce a probabilistic graphical model to describe the trajectory generation process, which includes a Variational Autoencoder (VAE) component and a linear decoder component. During training, the model can simultaneously learn the latent embedding of pathlet representations and the pathlet dictionary that captures mobility patterns in the trajectory dataset. The conditional version of our model can also be used to generate customized trajectories based on temporal and spatial constraints.
Our model can effectively learn data distribution even using noisy data, achieving relative improvements of $35.4\%$ and $26.3\%$ over strong baselines on two real-world trajectory datasets. Moreover, the generated trajectories can be conveniently utilized for multiple downstream tasks, including trajectory prediction and data denoising. Lastly, the framework design offers a significant efficiency advantage, saving $64.8\%$ of the time and $56.5\%$ of GPU memory compared to previous approaches.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huseyin Goksu
Machine Learning
ๆฅ็ๆ่ฆ
Spectral Graph Neural Networks offer a principled approach to graph filtering but face a fundamental "Stability-vs-Adaptivity" trade-off. This trade-off is dictated by the choice of spectral domain. Filters in the finite [-1, 1] domain (e.g., ChebyNet) are numerically stable at high polynomial degrees (K) but are static and low-pass, causing them to fail on heterophilic graphs. Conversely, filters in the semi-infinite [0, infty) domain (e.g., KrawtchoukNet) are highly adaptive and achieve SOTA results on heterophily by learning non-low-pass responses. However, as we demonstrate, these adaptive filters can also suffer from numerical instability, leading to catastrophic performance collapse at high K. In this paper, we propose to resolve this trade-off by designing a hybrid-domain GNN, HybSpecNet, which combines a stable `ChebyNet` branch with an adaptive `KrawtchoukNet` branch. We first demonstrate that a "naive" hybrid architecture, which fuses the branches via concatenation, successfully unifies performance at low K, achieving strong results on both homophilic and heterophilic benchmarks. However, we then prove that this naive architecture fails the stability test. Our K-ablation experiments show that this architecture catastrophically collapses at K=25, exactly mirroring the collapse of its unstable `KrawtchoukNet` branch. We identify this critical finding as "Instability Poisoning," where `NaN`/`Inf` gradients from the adaptive branch destroy the training of the model. Finally, we propose and validate an advanced architecture that uses "Late Fusion" to completely isolate the gradient pathways. We demonstrate that this successfully solves the instability problem, remaining perfectly stable up to K=30 while retaining its SOTA performance across all graph types. This work identifies a critical architectural pitfall in hybrid GNN design and provides the robust architectural solution.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haohui Chen, Zhiyong Chen, Aoxiang Liu, Wentuo Fang
Robotics
ๆฅ็ๆ่ฆ
Deterministic policy gradient algorithms for continuous control suffer from value estimation biases that degrade performance. While double critics reduce such biases, the exploration potential of double actors remains underexplored. Building on temporal-difference error-driven regularization (TDDR), a double actor-critic framework, this work introduces enhanced methods to achieve flexible bias control and stronger representation learning. We propose three convex combination strategies, symmetric and asymmetric, that balance pessimistic estimates to mitigate overestimation and optimistic exploration via double actors to alleviate underestimation. A single hyperparameter governs this mechanism, enabling tunable control across the bias spectrum. To further improve performance, we integrate augmented state and action representations into the actor and critic networks. Extensive experiments show that our approach consistently outperforms benchmarks, demonstrating the value of tunable bias and revealing that both overestimation and underestimation can be exploited differently depending on the environment.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vincent Fan, Regina Barzilay
Machine Learning
ๆฅ็ๆ่ฆ
The performance of machine learning models in drug discovery is highly dependent on the quality and consistency of the underlying training data. Due to limitations in dataset sizes, many models are trained by aggregating bioactivity data from diverse sources, including public databases such as ChEMBL. However, this approach often introduces significant noise due to variability in experimental protocols. We introduce AssayMatch, a framework for data selection that builds smaller, more homogenous training sets attuned to the test set of interest. AssayMatch leverages data attribution methods to quantify the contribution of each training assay to model performance. These attribution scores are used to finetune language embeddings of text-based assay descriptions to capture not just semantic similarity, but also the compatibility between assays. Unlike existing data attribution methods, our approach enables data selection for a test set with unknown labels, mirroring real-world drug discovery campaigns where the activities of candidate molecules are not known in advance. At test time, embeddings finetuned with AssayMatch are used to rank all available training data. We demonstrate that models trained on data selected by AssayMatch are able to surpass the performance of the model trained on the complete dataset, highlighting its ability to effectively filter out harmful or noisy experiments. We perform experiments on two common machine learning architectures and see increased prediction capability over a strong language-only baseline for 9/12 model-target pairs. AssayMatch provides a data-driven mechanism to curate higher-quality datasets, reducing noise from incompatible experiments and improving the predictive power and data efficiency of models for drug discovery. AssayMatch is available at https://github.com/Ozymandias314/AssayMatch.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huseyin Goksu
ๆฅ็ๆ่ฆ
Spectral GNNs, like ChebyNet, are limited by heterophily and over-smoothing due to their static, low-pass filter design. This work investigates the "Adaptive Orthogonal Polynomial Filter" (AOPF) class as a solution. We introduce two models operating in the [-1, 1] domain: 1) `L-JacobiNet`, the adaptive generalization of `ChebyNet` with learnable alpha, beta shape parameters, and 2) `S-JacobiNet`, a novel baseline representing a LayerNorm-stabilized static `ChebyNet`. Our analysis, comparing these models against AOPFs in the [0, infty) domain (e.g., `LaguerreNet`), reveals critical, previously unknown trade-offs. We find that the [0, infty) domain is superior for modeling heterophily, while the [-1, 1] domain (Jacobi) provides superior numerical stability at high K (K>20). Most significantly, we discover that `ChebyNet`'s main flaw is stabilization, not its static nature. Our static `S-JacobiNet` (ChebyNet+LayerNorm) outperforms the adaptive `L-JacobiNet` on 4 out of 5 benchmark datasets, identifying `S-JacobiNet` as a powerful, overlooked baseline and suggesting that adaptation in the [-1, 1] domain can lead to overfitting.
๐
2025-11-20
๐ ArXiv Preprint (cs.PL)
๐ฅ Sungbin Moon, Jiho Park, Suyoung Hwang, Donghyun Koh, Seunghyun Moon et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Modern data processing workflows frequently encounter ragged data: collections with variable-length elements that arise naturally in domains like natural language processing, scientific measurements, and autonomous AI agents. Existing workflow engines lack native support for tracking the shapes and dependencies inherent to ragged data, forcing users to manage complex indexing and dependency bookkeeping manually. We present Operon, a Rust-based workflow engine that addresses these challenges through a novel formalism of named dimensions with explicit dependency relations. Operon provides a domain-specific language where users declare pipelines with dimension annotations that are statically verified for correctness, while the runtime system dynamically schedules tasks as data shapes are incrementally discovered during execution. We formalize the mathematical foundation for reasoning about partial shapes and prove that Operon's incremental construction algorithm guarantees deterministic and confluent execution in parallel settings. The system's explicit modeling of partially-known states enables robust persistence and recovery mechanisms, while its per-task multi-queue architecture achieves efficient parallelism across heterogeneous task types. Empirical evaluation demonstrates that Operon outperforms an existing workflow engine with 14.94x baseline overhead reduction while maintaining near-linear end-to-end output rates as workloads scale, making it particularly suitable for large-scale data generation pipelines in machine learning applications.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shreyansh Jain, Madhav Singhvi, Shreya Rahul Jain, Pranav S, Dishaa Lokesh et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Conventional Applicant Tracking Systems (ATS) tend to be inflexible keyword-matchers, and deny gifted candidates a role due to a few minor semantic mismatches. This article describes a new two-step process to design a more refined resume evaluation model based on a small language model (<600M parameters) that is finetuned using GRPO on a custom reward function. To begin with, Supervised Fine-Tuning (SFT) was used to build a solid baseline model. Second, this SFT model was also optimized with the help of Reinforcement Learning (RL) through GRPO under the guidance of a new, multi-component reward function that can holistically assess candidates beyond simple keyword matching. We indicate that the RL application presents a critical problem of reward hacking due to the initial experiments of aggressive penalties, which produces faulty, excessively negative model behaviors. We have overcome this challenge by refining the reward function repeatedly and training hyperparameters into a stable "gentle polishing process" of the reward function. Our resulting GRPO-polished model demonstrates significant real-world efficacy, achieving a final accuracy of 91% on unseen test data. The model shows a strong ability to correctly identify qualified candidates (recall of 0.85 for the 'SELECTED' class) while also showing exceptional precision (1.0), confirming its reliability. These results indicate that a properly executed, two-step fine-tuning procedure can indeed effectively refine a small language model to be able to conduct fine-tuned and human-like candidate scoring, overcoming the drawbacks of both traditional ATS and naive RL usage.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junchao Zhou, Junkang Liu, Fanhua Shang
Robotics
ๆฅ็ๆ่ฆ
Federated Learning with Low-Rank Adaptation (LoRA) faces three critical challenges under client heterogeneity: (1) Initialization-Induced Instability due to random initialization misaligning client subspaces; (2) Rank Incompatibility and Aggregation Error when averaging LoRA parameters of different ranks, which biases the global model; and (3) exacerbated Client Drift under Non-IID Data, impairing generalization. To address these challenges, we propose ILoRA, a unified framework that integrates three core innovations: a QR-based orthonormal initialization to ensure all clients start in a coherent subspace; a Concatenated QR Aggregation mechanism that fuses heterogeneous-rank updates via concatenation and decomposition, preserving information while maintaining dimension alignment; and an AdamW optimizer with rank-aware control variates to correct local updates and mitigate client drift. Supported by theoretical convergence guarantees, extensive experiments on vision and NLP benchmarks demonstrate that ILoRA consistently achieves superior accuracy and convergence stability compared to existing federated LoRA methods.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yoonhyuk Choi, Chong-Kwon Kim
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) excel on homophilous graphs but often fail under heterophily due to self-reinforcing and phase-inconsistent signals. We propose a Gauge-Equivariant Graph Network with Self-Interference Cancellation (GESC), which replaces additive aggregation with a projection-based interference mechanism. Unlike prior magnetic or gauge-equivariant GNNs that typically focus on phase handling in spectral filtering while largely relying on scalar weighting, GESC introduces a $\mathrm{U}(1)$ phase connection followed by a rank-1 projection that attenuates self-parallel components before attention. A sign- and phase-aware gate further regulates neighbor influence, attenuating components aligned with current node states and acting as a local notch on low-frequency modes. Across diverse graph benchmarks, our method consistently outperforms recent state-of-the-art models while offering a unified, interference-aware view of message passing. Our code is available at \href{here}{https://anonymous.4open.science/r/GESC-1B22}.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alex Ning, Vainateya Rangaraju
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Structured pruning removes entire neurons or channels, but its effectiveness depends on how importance is distributed across the representation space. Change-of-basis (CoB) pruning addresses this challenge by applying orthogonal linear transformations that concentrate importance within certain dimensions. However, many standard deep learning architectures are not inherently invariant to such transformations. To enable compatibility, we introduce two-subspace radial activations (TSRAs): an activation family that is invariant to orthogonal linear transformations applied independently within its two activation subspaces. This invariance allows CoB transformations to be merged into surrounding weights without incurring extra parameters. We position this work as a proof-of-concept that a rotationally invariant design may offer a principled approach towards change-of-basis pruning. We do not provide an analysis of multiple TSRA candidates nor do we explore weight initialization for any TSRAs. These limitations, combined with other necessary modifications we make to permit rotational invariance, result in a slight accuracy drop of $4.52\%$ compared to a ReLU-based control. However, using activation-magnitude importance, VGG-16 implementing our CoB+TSRA framework shows encouraging results on CIFAR-10. Under fixed-ratio structured pruning, CoB improves accuracy over a TSRA baseline at all pruning ratios and extends reliable pruning frontier from roughly $30\%$ to $70\%$ of parameters without post-prune fine tuning. Under threshold-based pruning strategies, CoB prunes $90-96\%$ of parameters while maintaining $1-6\%$ accuracy drop after fine-tuning. Together, these results indicate that rotationally invariant architectures may offer a promising path towards CoB pruning.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an alternative but are typically restricted by the model's inherent capabilities and single-round interactions, hindering the development of complex curricula involving tool use or dynamic reasoning. We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration. Agent0 establishes a symbiotic competition between two agents initialized from the same base LLM: a curriculum agent that proposes increasingly challenging frontier tasks, and an executor agent that learns to solve them. We integrate external tools to enhance the executor's problem-solving capacity; this improvement, in turn, pressures the curriculum agent to construct more complex, tool-aware tasks. Through this iterative process, Agent0 establishes a self-reinforcing cycle that continuously produces high-quality curricula. Empirically, Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks. Code is available at https://github.com/aiming-lab/Agent0.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yang Wu, Yifan Zhang, Zhenxing Liang, Jian Cheng
Machine Learning
ๆฅ็ๆ่ฆ
Two-stage Stochastic Programming (2SP) is a standard framework for modeling decision-making problems under uncertainty. While numerous methods exist, solving such problems with many scenarios remains challenging. Selecting representative scenarios is a practical method for accelerating solutions. However, current approaches typically rely on clustering or Monte Carlo sampling, failing to integrate scenario information deeply and overlooking the significant impact of the scenario order on solving time. To address these issues, we develop HGCN2SP, a novel model with a hierarchical graph designed for 2SP problems, encoding each scenario and modeling their relationships hierarchically. The model is trained in a reinforcement learning paradigm to utilize the feedback of the solver. The policy network is equipped with a hierarchical graph convolutional network for feature encoding and an attention-based decoder for scenario selection in proper order. Evaluation of two classic 2SP problems demonstrates that HGCN2SP provides high-quality decisions in a short computational time. Furthermore, HGCN2SP exhibits remarkable generalization capabilities in handling large-scale instances, even with a substantial number of variables or scenarios that were unseen during the training phase.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Abdallah Salem, Hamdy Ahmed Ashur, Ahmed Elshinnawy
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers' health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material's surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Juncheng Dong, Yiling Liu, Ahmed Aloui, Vahid Tarokh, David Carlson
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have recently demonstrated impressive capabilities across a range of reasoning and generation tasks. However, research studies have shown that LLMs lack the ability to identify causal relationships, a fundamental cornerstone of human intelligence. We first conduct an exploratory investigation of LLMs' behavior when asked to perform a causal-discovery task and find that they mostly rely on the semantic meaning of variable names, ignoring the observation data. This is unsurprising, given that LLMs were never trained to process structural datasets. To first tackle this challenge, we prompt the LLMs with the outputs of established causal discovery algorithms designed for observational datasets. These algorithm outputs effectively serve as the sufficient statistics of the observation data. However, quite surprisingly, we find that prompting the LLMs with these sufficient statistics decreases the LLMs' performance in causal discovery. To address this current limitation, we propose CARE, a framework that enhances LLMs' causal-reasoning ability by teaching them to effectively utilize the outputs of established causal-discovery algorithms through supervised fine-tuning. Experimental results show that a finetuned Qwen2.5-1.5B model produced by CARE significantly outperforms both traditional causal-discovery algorithms and state-of-the-art LLMs with over a thousand times more parameters, demonstrating effective utilization of its own knowledge and the external algorithmic clues.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shuo Wang, Mengfan Teng, Yun Cheng, Lothar Thiele, Olga Saukh et al.
Machine Learning
ๆฅ็ๆ่ฆ
High-resolution mapping of fine particulate matter (PM2.5) is a cornerstone of sustainable urbanism but remains critically hindered by the spatial sparsity of ground monitoring networks. While traditional data-driven methods attempt to bridge this gap using satellite Aerosol Optical Depth (AOD), they often suffer from severe, non-random data missingness (e.g., due to cloud cover or nighttime) and inversion biases. To overcome these limitations, this study proposes the Spatiotemporal Physics-Guided Inference Network (SPIN), a novel framework designed for inductive spatiotemporal kriging. Unlike conventional approaches, SPIN synergistically integrates domain knowledge into deep learning by explicitly modeling physical advection and diffusion processes via parallel graph kernels. Crucially, we introduce a paradigm-shifting training strategy: rather than using error-prone AOD as a direct input, we repurpose it as a spatial gradient constraint within the loss function. This allows the model to learn structural pollution patterns from satellite data while remaining robust to data voids. Validated in the highly polluted Beijing-Tianjin-Hebei and Surrounding Areas (BTHSA), SPIN achieves a new state-of-the-art with a Mean Absolute Error (MAE) of 9.52 ug/m^3, effectively generating continuous, physically plausible pollution fields even in unmonitored areas. This work provides a robust, low-cost, and all-weather solution for fine-grained environmental management.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yiling Liu, Juncheng Dong, Chen Fu, Wei Shi, Ziyang Jiang et al.
ๆฅ็ๆ่ฆ
Estimating counterfactual outcomes from time-series observations is crucial for effective decision-making, e.g. when to administer a life-saving treatment, yet remains significantly challenging because (i) the counterfactual trajectory is never observed and (ii) confounders evolve with time and distort estimation at every step. To address these challenges, we propose a novel framework that synergistically integrates two complementary approaches: Sub-treatment Group Alignment (SGA) and Random Temporal Masking (RTM). Instead of the coarse practice of aligning marginal distributions of the treatments in latent space, SGA uses iterative treatment-agnostic clustering to identify fine-grained sub-treatment groups. Aligning these fine-grained groups achieves improved distributional matching, thus leading to more effective deconfounding. We theoretically demonstrate that SGA optimizes a tighter upper bound on counterfactual risk and empirically verify its deconfounding efficacy. RTM promotes temporal generalization by randomly replacing input covariates with Gaussian noises during training. This encourages the model to rely less on potentially noisy or spuriously correlated covariates at the current step and more on stable historical patterns, thereby improving its ability to generalize across time and better preserve underlying causal relationships. Our experiments demonstrate that while applying SGA and RTM individually improves counterfactual outcome estimation, their synergistic combination consistently achieves state-of-the-art performance. This success comes from their distinct yet complementary roles: RTM enhances temporal generalization and robustness across time steps, while SGA improves deconfounding at each specific time point.
๐
2025-11-20
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Osama Zafar, Rosemarie Santa Gonzรกlez, Alfonso Morales, Erman Ayday
ๆฅ็ๆ่ฆ
Digital agriculture is transforming the way we grow food by utilizing technology to make farming more efficient, sustainable, and productive. This modern approach to agriculture generates a wealth of valuable data that could help address global food challenges, but farmers are hesitant to share it due to privacy concerns. This limits the extent to which researchers can learn from this data to inform improvements in farming. This paper presents the Digital Agriculture Sandbox, a secure online platform that solves this problem. The platform enables farmers (with limited technical resources) and researchers to collaborate on analyzing farm data without exposing private information. We employ specialized techniques such as federated learning, differential privacy, and data analysis methods to safeguard the data while maintaining its utility for research purposes. The system enables farmers to identify similar farmers in a simplified manner without needing extensive technical knowledge or access to computational resources. Similarly, it enables researchers to learn from the data and build helpful tools without the sensitive information ever leaving the farmer's system. This creates a safe space where farmers feel comfortable sharing data, allowing researchers to make important discoveries. Our platform helps bridge the gap between maintaining farm data privacy and utilizing that data to address critical food and farming challenges worldwide.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dawei Li, Zijian Gu, Peng Wang, Chuhan Song, Zhen Tan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Siqiao Mu, Diego Klabjan
ๆฅ็ๆ่ฆ
Machine unlearning algorithms aim to remove the impact of selected training data from a model without the computational expenses of retraining from scratch. Two such algorithms are ``Descent-to-Delete" (D2D) and ``Rewind-to-Delete" (R2D), full-batch gradient descent algorithms that are easy to implement and satisfy provable unlearning guarantees. In particular, the stochastic version of D2D is widely implemented as the ``finetuning" unlearning baseline, despite lacking theoretical backing on nonconvex functions. In this work, we prove $(ฮต, ฮด)$ certified unlearning guarantees for stochastic R2D and D2D for strongly convex, convex, and nonconvex loss functions, by analyzing unlearning through the lens of disturbed or biased gradient systems, which may be contracting, semi-contracting, or expansive respectively. Our argument relies on optimally coupling the random behavior of the unlearning and retraining trajectories, resulting in a probabilistic sensitivity bound that can be combined with a novel relaxed Gaussian mechanism to achieve $(ฮต, ฮด)$ unlearning. We determine that D2D can yield tighter guarantees for strongly convex functions compared to R2D by relying on contraction to a unique global minimum. However, unlike D2D, R2D can achieve unlearning in the convex and nonconvex setting because it draws the unlearned model closer to the retrained model by reversing the accumulated disturbances.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chukwunonso Henry Nwokoye, Blessing Oluchi, Sharna Waldron, Peace Ezzeh
Machine Learning
ๆฅ็ๆ่ฆ
The lack of epidemiological data in wireless sensor networks (WSNs) is a fundamental difficulty in constructing robust models to forecast and mitigate threats such as viruses and worms. Many studies have examined different epidemic models for WSNs, focusing on how malware infections spread given the network's specific properties, including energy limits and node mobility. In this study, an agent-based implementation of the susceptible-exposed-infected-recovered-vaccinated (SEIRV) mathematical model was employed for machine learning (ML) predictions. Using tools such as NetLogo's BehaviorSpace and Python, two epidemic synthetic datasets were generated and prepared for the application of several ML algorithms. Posed as a regression problem, the infected and recovered nodes were predicted, and the performance of these algorithms is compared using the error metrics of the train and test sets. The predictions performed well, with low error metrics and high R^2 values (0.997, 1.000, 0.999, 1.000), indicating an effective fit to the training set. The validation values were lower (0.992, 0.998, 0.971, and 0.999), as is typical when evaluating model performance on unseen data. Based on the recorded performances, support vector, linear, Lasso, Ridge, and ElasticNet regression were among the worst-performing algorithms, while Random Forest, XGBoost, Decision Trees, and k-nearest neighbors achieved the best results.
๐
2025-11-20
๐ AAAI 2026
๐ฅ Daniel Mas Montserrat, Ray Verma, Mรญriam Barrabรฉs, Francisco M. de la Vega, Carlos D. Bustamante et al.
ๆฅ็ๆ่ฆ
Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.
๐
2025-11-20
๐ ArXiv Preprint (quant-ph)
๐ฅ Su Yeon Chang, M. Cerezo
Machine Learning
ๆฅ็ๆ่ฆ
Quantum machine learning (QML) is a computational paradigm that seeks to apply quantum-mechanical resources to solve learning problems. As such, the goal of this framework is to leverage quantum processors to tackle optimization, supervised, unsupervised and reinforcement learning, and generative modeling-among other tasks-more efficiently than classical models. Here we offer a high level overview of QML, focusing on settings where the quantum device is the primary learning or data generating unit. We outline the field's tensions between practicality and guarantees, access models and speedups, and classical baselines and claimed quantum advantages-flagging where evidence is strong, where it is conditional or still lacking, and where open questions remain. By shedding light on these nuances and debates, we aim to provide a friendly map of the QML landscape so that the reader can judge when-and under what assumptions-quantum approaches may offer real benefits.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ Luning Sun, Josรฉ L. Safanelli, Jonathan Sanderman, Katerina Georgiou, Colby Brungard et al.
Machine Learning
ๆฅ็ๆ่ฆ
We propose a self-supervised machine learning (SSML) framework for multi-fidelity learning and extended predictive soil spectroscopy based on latent space embeddings. A self-supervised representation was pretrained with the large MIR spectral library and the Variational Autoencoder algorithm to obtain a compressed latent space for generating spectral embeddings. At this stage, only unlabeled spectral data were used, allowing us to leverage the full spectral database and the availability of scan repeats for augmented training. We also leveraged and froze the trained MIR decoder for a spectrum conversion task by plugging it into a NIR encoder to learn the mapping between NIR and MIR spectra in an attempt to leverage the predictive capabilities contained in the large MIR library with a low cost portable NIR scanner. This was achieved by using a smaller subset of the KSSL library with paired NIR and MIR spectra. Downstream machine learning models were then trained to map between original spectra, predicted spectra, and latent space embeddings for nine soil properties. The performance of was evaluated independently of the KSSL training data using a gold-standard test set, along with regression goodness-of-fit metrics. Compared to baseline models, the proposed SSML and its embeddings yielded similar or better accuracy in all soil properties prediction tasks. Predictions derived from the spectrum conversion (NIR to MIR) task did not match the performance of the original MIR spectra but were similar or superior to predictive performance of NIR-only models, suggesting the unified spectral latent space can effectively leverage the larger and more diverse MIR dataset for prediction of soil properties not well represented in current NIR libraries.
๐
2025-11-20
๐ ArXiv Preprint (q-fin.CP)
๐ฅ Gabriel M. Arantes, Richard F. Pinto, Bruno L. Dalmazo, Eduardo N. Borges, Giancarlo Lucca et al.
Machine Learning
ๆฅ็ๆ่ฆ
Binary options trading is often marketed as a field where predictive models can generate consistent profits. However, the inherent randomness and stochastic nature of binary options make price movements highly unpredictable, posing significant challenges for any forecasting approach. This study demonstrates that machine learning algorithms struggle to outperform a simple baseline in predicting binary options movements. Using a dataset of EUR/USD currency pairs from 2021 to 2023, we tested multiple models, including Random Forest, Logistic Regression, Gradient Boosting, and k-Nearest Neighbors (kNN), both before and after hyperparameter optimization. Furthermore, several neural network architectures, including Multi-Layer Perceptrons (MLP) and a Long Short-Term Memory (LSTM) network, were evaluated under different training conditions. Despite these exhaustive efforts, none of the models surpassed the ZeroR baseline accuracy, highlighting the inherent randomness of binary options. These findings reinforce the notion that binary options lack predictable patterns, making them unsuitable for machine learning-based forecasting.
๐
2025-11-20
๐ ArXiv Preprint (Machine Learning)
๐ฅ David Bonet, Marรงal Comajoan Cara, Alvaro Calafell, Daniel Mas Montserrat, Alexander G. Ioannidis
Machine Learning
ๆฅ็ๆ่ฆ
Tabular data underpins decisions across science, industry, and public services. Despite rapid progress, advances in deep learning have not fully carried over to the tabular domain, where gradient-boosted decision trees (GBDTs) remain a default choice in practice. We present iLTM, an integrated Large Tabular Model that unifies tree-derived embeddings, dimensionality-agnostic representations, a meta-trained hypernetwork, multilayer perceptrons (MLPs), and retrieval within a single architecture. Pretrained on more than 1,800 heterogeneous classification datasets, iLTM achieves consistently superior performance across tabular classification and regression tasks, from small datasets to large and high-dimensional tasks. After light fine-tuning, the meta-trained hypernetwork transfers to regression targets, matching or surpassing strong baselines. Extensive experiments show that iLTM outperforms well-tuned GBDTs and leading deep tabular models while requiring less task-specific tuning. By bridging the gap between tree-based and neural methods, iLTM offers a new framework for tabular foundation models for robust, adaptable, and scalable tabular learning.
๐
2025-11-20
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Alima Subedi, Sankalpa Pokharel, Satish Puri
ๆฅ็ๆ่ฆ
Similarity searches are a critical task in data mining. As data sets grow larger, exact nearest neighbor searches quickly become unfeasible, leading to the adoption of approximate nearest neighbor (ANN) searches. ANN has been studied for text data, images, and trajectories. However, there has been little effort to develop ANN systems for polygons in spatial database systems and geographic information systems. We present PolyMinHash, a system for approximate polygon similarity search that adapts MinHashing into a novel 2D polygon-hashing scheme to generate short, similarity-preserving signatures of input polygons. Minhash is generated by counting the number of randomly sampled points needed before the sampled point lands within the polygon's interior area, yielding hash values that preserve area-based Jaccard similarity. We present the tradeoff between search accuracy and runtime of our PolyMinHash system. Our hashing mechanism reduces the number of candidates to be processed in the query refinement phase by up to 98% compared to the number of candidates processed by the brute-force algorithm.
๐
2025-11-20
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiaheng Zhang, Daqiang Zhang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage.
Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models.
Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation.
๐
2025-11-20
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ รzay Ezerceli, Mahmoud El Hussieni, Selva Taล, Reyhan Bayraktar, Fatma Betรผl Terzioฤlu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.
๐
2025-11-20
๐ Under review with the ACM Transactions on Recommen
๐ฅ Elena V. Epure, Yashar Deldjoo, Bruno Sguerra, Markus Schedl, Manuel Moussallam
Natural Language Processing
ๆฅ็ๆ่ฆ
Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators.
This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation.
๐
2025-11-20
๐ Workshop paper accepted at AI4DF 2025 (part of ACM
๐ฅ Sherine George, Nithish Saji
Natural Language Processing
ๆฅ็ๆ่ฆ
We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.
๐
2025-11-20
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Hao Liu, Le Wu, Min Hou, Han Wu, Kun Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Nowadays, Large Language Models (LLMs) have shown exceptional performance in sequential recommendations, and the adoption of LLM-based recommender systems (LLMRec) is becoming increasingly widespread in existing e-commerce platforms. Despite the impressive performance, the constant high volume of new user-item interactions makes it difficult to adapt to the evolution of user preference over time, especially for LLM-based recommender systems. The challenge arises from the large number of parameters in LLMs, which makes traditional evolution methods (i.e., Re-training or Fine-tuning) impractical. Specifically, Re-training with all interactions results in prohibitively high computational costs. On the other hand, fine-tuning with only new interactions leads to preference forgetting among inactive users, ultimately compromising overall performance. To tackle this problem, we propose EvoRec, an efficient Locate-Forget-Update framework designed for LLM-based recommender systems to model the evolution of user preferences. EvoRec identifies a small set of parameters associated with preference changes and updates them precisely, thereby saving computational resources while maintaining strong recommendation performance. Notably, the modified parameters account for only 30\% of LoRA adapter parameters, with no additional parameters introduced. Extensive experiments on two real-world datasets demonstrate that, compared to existing methods, EvoRec not only efficiently evolves LLMRec to adapt to the preferences of active users, but also preserves the interests of inactive users from being disturbed during evolution.
๐
2025-11-20
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiawei Zhou, Hang Ding, Haiyun Jiang
Machine Learning
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for knowledge-intensive tasks, yet its effectiveness in long-context scenarios is often bottlenecked by the retriever's inability to distinguish sparse yet crucial evidence. Standard retrievers, optimized for query-document similarity, frequently fail to align with the downstream goal of generating a precise answer. To bridge this gap, we propose a novel fine-tuning framework that optimizes the retriever for Answer Alignment. Specifically, we first identify high-quality positive chunks by evaluating their sufficiency to generate the correct answer. We then employ a curriculum-based contrastive learning scheme to fine-tune the retriever. This curriculum leverages LLM-constructed Knowledge Graphs (KGs) to generate augmented queries, which in turn mine progressively challenging hard negatives. This process trains the retriever to distinguish the answer-sufficient positive chunks from these nuanced distractors, enhancing its generalization. Extensive experiments on 10 datasets from the Ultradomain and LongBench benchmarks demonstrate that our fine-tuned retriever achieves state-of-the-art performance, improving 14.5% over the base model without substantial architectural modifications and maintaining strong efficiency for long-context RAG. Our work presents a robust and effective methodology for building truly answer-centric retrievers.
๐
2025-11-20
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Archish S, Ankit Garg, Kirankumar Shiragur, Neeraj Kayal
Natural Language Processing
ๆฅ็ๆ่ฆ
ColBERT introduced a late interaction mechanism that independently encodes queries and documents using BERT, and computes similarity via fine-grained interactions over token-level vector representations. This design enables expressive matching while allowing efficient computation of scores, as the multi-vector document representations could be pre-computed offline. ColBERT models distance using a Chamfer-style function: for each query token, it selects the closest document token and sums these distances across all query tokens.
In our work, we explore enhancements to the Chamfer distance function by computing a weighted sum over query token contributions, where weights reflect the token importance. Empirically, we show that this simple extension, requiring only token-weight training while keeping the multi-vector representations fixed, further enhances the expressiveness of late interaction multi-vector mechanism. In particular, on the BEIR benchmark, our method achieves an average improvement of 1.28\% in Recall@10 in the zero-shot setting using IDF-based weights, and 3.66\% through few-shot fine-tuning.
๐
2025-11-20
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Amin Bigdeli, Radin Hamidi Rad, Mert Incesu, Negar Arabzadeh, Charles L. A. Clarke et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present QueryGym, a lightweight, extensible Python toolkit that supports large language model (LLM)-based query reformulation. This is an important tool development since recent work on llm-based query reformulation has shown notable increase in retrieval effectiveness. However, while different authors have sporadically shared the implementation of their methods, there is no unified toolkit that provides a consistent implementation of such methods, which hinders fair comparison, rapid experimentation, consistent benchmarking and reliable deployment. QueryGym addresses this gap by providing a unified framework for implementing, executing, and comparing llm-based reformulation methods. The toolkit offers: (1) a Python API for applying diverse LLM-based methods, (2) a retrieval-agnostic interface supporting integration with backends such as Pyserini and PyTerrier, (3) a centralized prompt management system with versioning and metadata tracking, (4) built-in support for benchmarks like BEIR and MS MARCO, and (5) a completely open-source extensible implementation available to all researchers. QueryGym is publicly available at https://github.com/radinhamidi/QueryGym.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Irmak Guzey, Haozhi Qi, Julen Urain, Changhao Wang, Jessica Yin et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across nine everyday manipulation tasks. Robot rollouts are best viewed on our website: https://aina-robot.github.io.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Recent works explore how real and synthetic data contribute to Vision-Language-Action (VLA) models' generalization. While current VLA models have shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale. This paper provides the first evidence that synthetic data alone can match the performance of the strongest $ฯ$-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation. The resulting model also exhibits surprisingly zero-shot sim-to-real transfer on several challenging tasks. Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables long-horizon skill composition, flexible task assembly, and heterogeneous embodiments with minimal manual tuning. Using the same architecture as $ฯ_0$, we pre-train a model entirely on InternData-A1 and find that it matches the official $ฯ_0$ across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks. We release the dataset and will open-source the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.
๐
2025-11-20
๐ ArXiv Preprint (cs.SE)
๐ฅ Diaeddin Rimawi
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang et al.
Robotics
ๆฅ็ๆ่ฆ
We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Chenming Wu, Xiaofan Li, Chengkai Dai
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The transition from digital 3D models to physical objects via 3D printing often requires support structures to prevent overhanging features from collapsing during the fabrication process. While current slicing technologies offer advanced support strategies, they focus on post-processing optimizations rather than addressing the underlying need for support-efficient design during the model generation phase. This paper introduces SEG (\textit{\underline{S}upport-\underline{E}ffective \underline{G}eneration}), a novel framework that integrates Direct Preference Optimization with an Offset (ODPO) into the 3D generation pipeline to directly optimize models for minimal support material usage. By incorporating support structure simulation into the training process, SEG encourages the generation of geometries that inherently require fewer supports, thus reducing material waste and production time. We demonstrate SEG's effectiveness through extensive experiments on two benchmark datasets, Thingi10k-Val and GPT-3DP-Val, showing that SEG significantly outperforms baseline models such as TRELLIS, DPO, and DRO in terms of support volume reduction and printability. Qualitative results further reveal that SEG maintains high fidelity to input prompts while minimizing the need for support structures. Our findings highlight the potential of SEG to transform 3D printing by directly optimizing models during the generative process, paving the way for more sustainable and efficient digital fabrication practices.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Xizhou Bu, Jiexi Lyu, Fulei Sun, Ruichen Yang, Zhiqiang Ma et al.
Machine Learning
ๆฅ็ๆ่ฆ
Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints, called LAOF, a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10 percent. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1 percent of action labels.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Luis Luna, Isaac Chairez, Andrey Polyakov
Robotics
ๆฅ็ๆ่ฆ
Mobile robotic manipulators (MRMs), which integrate mobility and manipulation capabilities, present significant control challenges due to their nonlinear dynamics, underactuation, and coupling between the base and manipulator subsystems. This paper proposes a novel homogeneous Proportional-Integral-Derivative (hPID) control strategy tailored for MRMs to achieve robust and coordinated motion control. Unlike classical PID controllers, the hPID controller leverages the mathematical framework of homogeneous control theory to systematically enhance the stability and convergence properties of the closed-loop system, even in the presence of dynamic uncertainties and external disturbances involved into a system in a homogeneous way. A homogeneous PID structure is designed, ensuring improved convergence of tracking errors through a graded homogeneity approach that generalizes traditional PID gains to nonlinear, state-dependent functions. Stability analysis is conducted using Lyapunov-based methods, demonstrating that the hPID controller guarantees global asymptotic stability and finite-time convergence under mild assumptions. Experimental results on a representative MRM model validate the effectiveness of the hPID controller in achieving high-precision trajectory tracking for both the mobile base and manipulator arm, outperforming conventional linear PID controllers in terms of response time, steady-state error, and robustness to model uncertainties. This research contributes a scalable and analytically grounded control framework for enhancing the autonomy and reliability of next-generation mobile manipulation systems in structured and unstructured environments.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Ajith Anil Meera, Poppy Collis, Polina Arbuzova, Abiรกn Torres, Paul F Kinghorn et al.
Robotics
ๆฅ็ๆ่ฆ
Robots today often miss a key ingredient of truly intelligent behavior: the ability to reflect on their own cognitive processes and decisions. In humans, this self-monitoring or metacognition is crucial for learning, decision making and problem solving. For instance, they can evaluate how confident they are in performing a task, thus regulating their own behavior and allocating proper resources. Taking inspiration from neuroscience, we propose a robot metacognition architecture centered on confidence (a second-order judgment on decisions) and we demonstrate it on the use case of autonomous tool invention. We propose the use of confidence as a metacognitive measure within the robot decision making scheme. Confidence-informed robots can evaluate the reliability of their decisions, improving their robustness during real-world physical deployment. This form of robotic metacognition emphasizes embodied action monitoring as a means to achieve better informed decisions. We also highlight potential applications and research directions for robot metacognition.
๐
2025-11-20
๐ Accepted to IEEE Robotics and Automation Letters (
๐ฅ Bowen Xu, Zexuan Yan, Minghao Lu, Xiyu Fan, Yi Luo et al.
Computer VisionRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Challenges in traversing dynamic clutters lie mainly in the efficient perception of the environmental dynamics and the generation of evasive behaviors considering obstacle movement. Previous solutions have made progress in explicitly modeling the dynamic obstacle motion for avoidance, but this key dependency of decision-making is time-consuming and unreliable in highly dynamic scenarios with occlusions. On the contrary, without introducing object detection, tracking, and prediction, we empower the reinforcement learning (RL) with single LiDAR sensing to realize an autonomous flight system directly from point to motion. For exteroception, a depth sensing distance map achieving fixed-shape, low-resolution, and detail-safe is encoded from raw point clouds, and an environment change sensing point flow is adopted as motion features extracted from multi-frame observations. These two are integrated into a lightweight and easy-to-learn representation of complex dynamic environments. For action generation, the behavior of avoiding dynamic threats in advance is implicitly driven by the proposed change-aware sensing representation, where the policy optimization is indicated by the relative motion modulated distance field. With the deployment-friendly sensing simulation and dynamics model-free acceleration control, the proposed system shows a superior success rate and adaptability to alternatives, and the policy derived from the simulator can drive a real-world quadrotor with safe maneuvers.
๐
2025-11-20
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Chunyang Li, Zifeng Kang, Junwei Zhang, Zhuo Ma, Anda Cheng et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The adoption of Vision-Language Models (VLMs) in embodied AI agents, while being effective, brings safety concerns such as jailbreaking. Prior work have explored the possibility of directly jailbreaking the embodied agents through elaborated multi-modal prompts. However, no prior work has studied or even reported indirect jailbreaks in embodied AI, where a black-box attacker induces a jailbreak without issuing direct prompts to the embodied agent. In this paper, we propose, for the first time, indirect environmental jailbreak (IEJ), a novel attack to jailbreak embodied AI via indirect prompt injected into the environment, such as malicious instructions written on a wall. Our key insight is that embodied AI does not ''think twice'' about the instructions provided by the environment -- a blind trust that attackers can exploit to jailbreak the embodied agent. We further design and implement open-source prototypes of two fully-automated frameworks: SHAWSHANK, the first automatic attack generation framework for the proposed attack IEJ; and SHAWSHANK-FORGE, the first automatic benchmark generation framework for IEJ. Then, using SHAWSHANK-FORGE, we automatically construct SHAWSHANK-BENCH, the first benchmark for indirectly jailbreaking embodied agents. Together, our two frameworks and one benchmark answer the questions of what content can be used for malicious IEJ instructions, where they should be placed, and how IEJ can be systematically evaluated. Evaluation results show that SHAWSHANK outperforms eleven existing methods across 3,957 task-scene combinations and compromises all six tested VLMs. Furthermore, current defenses only partially mitigate our attack, and we have responsibly disclosed our findings to all affected VLM vendors.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Shreyas Kumar, Ravi Prakash
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) offers a powerful approach for robots to learn complex, collaborative skills by combining Dynamic Movement Primitives (DMPs) for motion and Variable Impedance Control (VIC) for compliant interaction. However, this model-free paradigm often risks instability and unsafe exploration due to the time-varying nature of impedance gains. This work introduces Certified Gaussian Manifold Sampling (C-GMS), a novel trajectory-centric RL framework that learns combined DMP and VIC policies while guaranteeing Lyapunov stability and actuator feasibility by construction. Our approach reframes policy exploration as sampling from a mathematically defined manifold of stable gain schedules. This ensures every policy rollout is guaranteed to be stable and physically realizable, thereby eliminating the need for reward penalties or post-hoc validation. Furthermore, we provide a theoretical guarantee that our approach ensures bounded tracking error even in the presence of bounded model errors and deployment-time uncertainties. We demonstrate the effectiveness of C-GMS in simulation and verify its efficacy on a real robot, paving the way for reliable autonomous interaction in complex environments.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Lasse Hohmeyer, Mihaela Popescu, Ivan Bergonzani, Dennis Mronga, Frank Kirchner
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Humanoid robots have great potential for a wide range of applications, including industrial and domestic use, healthcare, and search and rescue missions. However, bipedal locomotion in different environments is still a challenge when it comes to performing stable and dynamic movements. This is where state estimation plays a crucial role, providing fast and accurate feedback of the robot's floating base state to the motion controller. Although classical state estimation methods such as Kalman filters are widely used in robotics, they require expert knowledge to fine-tune the noise parameters. Due to recent advances in the field of machine learning, deep learning methods are increasingly used for state estimation tasks. In this work, we propose the InEKFormer, a novel hybrid state estimation method that incorporates an invariant extended Kalman filter (InEKF) and a Transformer network. We compare our method with the InEKF and the KalmanNet approaches on datasets obtained from the humanoid robot RH5. The results indicate the potential of Transformers in humanoid state estimation, but also highlight the need for robust autoregressive training in these high-dimensional problems.
๐
2025-11-20
๐ 8 pages, 8 figures. This work has been submitted t
๐ฅ Haru Fukatsu, Ryoji Yasuda, Yuki Funabora, Shinji Doki
ๆฅ็ๆ่ฆ
This paper presents Funabot-Upper, a wearable haptic suit that enables users to perceive 14 upper-body motions, including those of the trunk, shoulder, elbow, and wrist. Inducing kinesthetic perception through wearable haptic devices has attracted attention, and various devices have been developed in the past. However, these have been limited to verifications on single body parts, and few have applied the same method to multiple body parts as well. In our previous study, we developed a technology that uses the contraction of artificial muscles to deform clothing in three dimensions. Using this technology, we developed a haptic suit that induces kinesthetic perception of 7 motions in multiple upper body. However, perceptual mixing caused by stimulating multiple human muscles has occurred between the shoulder and the elbow. In this paper, we established a new, simplified design policy and developed a novel haptic suit that induces kinesthetic perceptions in the trunk, shoulder, elbow, and wrist by stimulating joints and muscles independently. We experimentally demonstrated the induced kinesthetic perception and examined the relationship between stimulation and perceived kinesthetic perception under the new design policy. Experiments confirmed that Funabot-Upper successfully induces kinesthetic perception across multiple joints while reducing perceptual mixing observed in previous designs. The new suit improved recognition accuracy from 68.8% to 94.6% compared to the previous Funabot-Suit, demonstrating its superiority and potential for future haptic applications.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Oliver Bimber, Karl Dietrich von Ellenrieder, Michael Haller, Rakesh John Amala Arokia Nathan, Gianni Lunardi et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Peering, a side-to-side motion used by animals to estimate distance through motion parallax, offers a powerful bio-inspired strategy to overcome a fundamental limitation in robotic vision: partial occlusion. Conventional robot cameras, with their small apertures and large depth of field, render both foreground obstacles and background objects in sharp focus, causing occluders to obscure critical scene information. This work establishes a formal connection between animal peering and synthetic aperture (SA) sensing from optical imaging. By having a robot execute a peering motion, its camera describes a wide synthetic aperture. Computational integration of the captured images synthesizes an image with an extremely shallow depth of field, effectively blurring out occluding elements while bringing the background into sharp focus. This efficient, wavelength-independent technique enables real-time, high-resolution perception across various spectral bands. We demonstrate that this approach not only restores basic scene understanding but also empowers advanced visual reasoning in large multimodal models, which fail with conventionally occluded imagery. Unlike feature-dependent multi-view 3D vision methods or active sensors like LiDAR, SA sensing via peering is robust to occlusion, computationally efficient, and immediately deployable on any mobile robot. This research bridges animal behavior and robotics, suggesting that peering motions for synthetic aperture sensing are a key to advanced scene understanding in complex, cluttered environments.
๐
2025-11-20
๐ AAAI
๐ฅ Kewei Chen, Yayu Long, Shuai Li, Mingsheng Shang
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
The powerful generalization of Vision-Language-Action (VLA) models is bottlenecked by their heavy reliance on massive, redundant, and unevenly valued datasets, hindering their widespread application. Existing model-centric optimization paths, such as model compression (which often leads to performance degradation) or policy distillation (whose products are model-dependent and lack generality), fail to fundamentally address this data-level challenge. To this end, this paper introduces FT-NCFM, a fundamentally different, data-centric generative data distillation framework. Our framework employs a self-contained Fact-Tracing (FT) engine that combines causal attribution with programmatic contrastive verification to assess the intrinsic value of samples. Guided by these assessments, an adversarial NCFM process synthesizes a model-agnostic, information-dense, and reusable data asset. Experimental results on several mainstream VLA benchmarks show that models trained on just 5% of our distilled coreset achieve a success rate of 85-90% compared with training on the full dataset, while reducing training time by over 80%. Our work demonstrates that intelligent data distillation is a highly promising new path for building efficient, high-performance VLA models.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Vincenzo Pomponi, Paolo Franceschi, Stefano Baraldo, Loris Roveda, Oliver Avram et al.
Robotics
ๆฅ็ๆ่ฆ
Learning robust manipulation policies typically requires large and diverse datasets, the collection of which is time-consuming, labor-intensive, and often impractical for dynamic environments. In this work, we introduce DynaMimicGen (D-MG), a scalable dataset generation framework that enables policy training from minimal human supervision while uniquely supporting dynamic task settings. Given only a few human demonstrations, D-MG first segments the demonstrations into meaningful sub-tasks, then leverages Dynamic Movement Primitives (DMPs) to adapt and generalize the demonstrated behaviors to novel and dynamically changing environments. Improving prior methods that rely on static assumptions or simplistic trajectory interpolation, D-MG produces smooth, realistic, and task-consistent Cartesian trajectories that adapt in real time to changes in object poses, robot states, or scene geometry during task execution. Our method supports different scenarios - including scene layouts, object instances, and robot configurations - making it suitable for both static and highly dynamic manipulation tasks. We show that robot agents trained via imitation learning on D-MG-generated data achieve strong performance across long-horizon and contact-rich benchmarks, including tasks like cube stacking and placing mugs in drawers, even under unpredictable environment changes. By eliminating the need for extensive human demonstrations and enabling generalization in dynamic settings, D-MG offers a powerful and efficient alternative to manual data collection, paving the way toward scalable, autonomous robot learning.
๐
2025-11-20
๐ AAAI
๐ฅ Kewei Chen, Yayu Long, Mingsheng Shang
Robotics
ๆฅ็ๆ่ฆ
Multi-robot systems in complex physical collaborations face a "shared brain dilemma": transmitting high-dimensional multimedia data (e.g., video streams at ~30MB/s) creates severe bandwidth bottlenecks and decision-making latency. To address this, we propose PIPHEN, an innovative distributed physical cognition-control framework. Its core idea is to replace "raw data communication" with "semantic communication" by performing "semantic distillation" at the robot edge, reconstructing high-dimensional perceptual data into compact, structured physical representations. This idea is primarily realized through two key components: (1) a novel Physical Interaction Prediction Network (PIPN), derived from large model knowledge distillation, to generate this representation; and (2) a Hamiltonian Energy Network (HEN) controller, based on energy conservation, to precisely translate this representation into coordinated actions. Experiments show that, compared to baseline methods, PIPHEN can compress the information representation to less than 5% of the original data volume and reduce collaborative decision-making latency from 315ms to 76ms, while significantly improving task success rates. This work provides a fundamentally efficient paradigm for resolving the "shared brain dilemma" in resource-constrained multi-robot systems.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Lara Bergmann, Cedric Grothues, Klaus Neumann
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Magnetic levitation is about to revolutionize in-machine material flow in industrial automation. Such systems are flexibly configurable and can include a large number of independently actuated shuttles (movers) that dynamically rebalance production capacity. Beyond their capabilities for dynamic transportation, these systems possess the inherent yet unexploited potential to perform manipulation. By merging the fields of transportation and manipulation into a coordinated swarm of magnetic robots (MagBots), we enable manufacturing systems to achieve significantly higher efficiency, adaptability, and compactness. To support the development of intelligent algorithms for magnetic levitation systems, we introduce MagBotSim (Magnetic Robotics Simulation): a physics-based simulation for magnetic levitation systems. By framing magnetic levitation systems as robot swarms and providing a dedicated simulation, this work lays the foundation for next generation manufacturing systems powered by Magnetic Robotics. MagBotSim's documentation, videos, experiments, and code are available at: https://ubi-coro.github.io/MagBotSim/
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sibaek Lee, Seongbo Ha, Kyeongsu Kang, Joonyeol Choi, Seungjun Tak et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.
๐
2025-11-20
๐ ACM CHI 2021
๐ฅ Taejun Kim, Youngbo Aram Shim, Geehyuk Lee
ๆฅ็ๆ่ฆ
Beyond a simple notification of incoming calls or messages, more complex information such as alphabets and digits can be delivered through spatiotemporal tactile patterns (STPs) on a wrist-worn tactile display (WTD) with multiple tactors. However, owing to the limited skin area and spatial acuity of the wrist, frequent confusions occur between closely located tactors, resulting in a low recognition accuracy. Furthermore, the accuracies reported in previous studies have mostly been measured for a specific posture and could further decrease with free arm postures in real life. Herein, we present Heterogeneous Stroke, a design concept for improving the recognition accuracy of STPs on a WTD. By assigning unique vibrotactile stimuli to each tactor, the confusion between tactors can be reduced. Through our implementation of Heterogeneous Stroke, the alphanumeric characters could be delivered with high accuracy (93.8% for 26 alphabets and 92.4% for 10 digits) across different arm postures.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Takeru Tsunoori, Masato Kobayashi, Yuki Uranishi
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Underwater robotic manipulation is fundamentally challenged by extreme lighting variations, color distortion, and reduced visibility. We introduce Bi-AQUA, the first underwater bilateral control-based imitation learning framework that integrates lighting-aware visual processing for underwater robot arms. Bi-AQUA employs a hierarchical three-level lighting adaptation mechanism: a Lighting Encoder that extracts lighting representations from RGB images without manual annotation and is implicitly supervised by the imitation objective, FiLM modulation of visual backbone features for adaptive, lighting-aware feature extraction, and an explicit lighting token added to the transformer encoder input for task-aware conditioning. Experiments on a real-world underwater pick-and-place task under diverse static and dynamic lighting conditions show that Bi-AQUA achieves robust performance and substantially outperforms a bilateral baseline without lighting modeling. Ablation studies further confirm that all three lighting-aware components are critical. This work bridges terrestrial bilateral control-based imitation learning and underwater manipulation, enabling force-sensitive autonomous operation in challenging marine environments. For additional material, please check: https://mertcookimg.github.io/bi-aqua
๐
2025-11-20
๐ NeurIPS 2025
๐ฅ Qing Zhang, Jing Huang, Mingyang Xu, Jun Rekimoto
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
While mainstream robotics pursues metric precision and flawless performance, this paper explores the creative potential of a deliberately "lo-fi" approach. We present the "Semantic Glitch," a soft flying robotic art installation whose physical form, a 3D pixel style cloud, is a "physical glitch" derived from digital archaeology. We detail a novel autonomous pipeline that rejects conventional sensors like LiDAR and SLAM, relying solely on the qualitative, semantic understanding of a Multimodal Large Language Model to navigate. By authoring a bio-inspired personality for the robot through a natural language prompt, we create a "narrative mind" that complements the "weak," historically, loaded body. Our analysis begins with a 13-minute autonomous flight log, and a follow-up study statistically validates the framework's robustness for authoring quantifiably distinct personas. The combined analysis reveals emergent behaviors, from landmark-based navigation to a compelling "plan to execution" gap, and a character whose unpredictable, plausible behavior stems from a lack of precise proprioception. This demonstrates a lo-fi framework for creating imperfect companions whose success is measured in character over efficiency.
๐
2025-11-20
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Abdallah Salem, Hamdy Ahmed Ashur, Ahmed Elshinnawy
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers' health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material's surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.
๐
2025-11-20
๐ 20 pages, 24 figures. Accepted to IEEE Transaction
๐ฅ Zili Tang, Ying Zhang, Meng Guo
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Many robots are not equipped with a manipulator and many objects are not suitable for prehensile manipulation (such as large boxes and cylinders). In these cases, pushing is a simple yet effective non-prehensile skill for robots to interact with and further change the environment. Existing work often assumes a set of predefined pushing modes and fixed-shape objects. This work tackles the general problem of controlling a robotic fleet to push collaboratively numerous arbitrary objects to respective destinations, within complex environments of cluttered and movable obstacles. It incorporates several characteristic challenges for multi-robot systems such as online task coordination under large uncertainties of cost and duration, and for contact-rich tasks such as hybrid switching among different contact modes, and under-actuation due to constrained contact forces. The proposed method is based on combinatorial hybrid optimization over dynamic task assignments and hybrid execution via sequences of pushing modes and associated forces. It consists of three main components: (I) the decomposition, ordering and rolling assignment of pushing subtasks to robot subgroups; (II) the keyframe guided hybrid search to optimize the sequence of parameterized pushing modes for each subtask; (III) the hybrid control to execute these modes and transit among them. Last but not least, a diffusion-based accelerator is adopted to predict the keyframes and pushing modes that should be prioritized during hybrid search; and further improve planning efficiency. The framework is complete under mild assumptions. Its efficiency and effectiveness under different numbers of robots and general-shaped objects are validated extensively in simulations and hardware experiments, as well as generalizations to heterogeneous robots, planar assembly and 6D pushing.
๐
2025-11-20
๐ ArXiv Preprint (Robotics)
๐ฅ Aliyah Smith, Monroe Kennedy
ๆฅ็ๆ่ฆ
As robots become increasingly integrated into everyday environments, understanding how they communicate with humans is critical. Sound offers a powerful channel for interaction, encompassing both operational noises and intentionally designed auditory cues. In this study, we examined the effects of consequential and functional sounds on human perception and behavior, including a novel exploration of spatial sound through localization and handover tasks. Results show that consequential sounds of the Kinova Gen3 manipulator did not negatively affect perceptions, spatial localization is highly accurate for lateral cues but declines for frontal cues, and spatial sounds can simultaneously convey task-relevant information while promoting warmth and reducing discomfort. These findings highlight the potential of functional and transformative auditory design to enhance human-robot collaboration and inform future sound-based interaction strategies.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Genghan Zhang, Shaowei Zhu, Anjiang Wei, Zhenyu Song, Allen Nie et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Seungbeen Lee, Jinhong Jeong, Donghyun Kim, Yejin Son, Youngjae Yu
ๆฅ็ๆ่ฆ
Our ability to interpret others' mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jeremias Ferrao, Ezgi Basar, Khondoker Ittehadul Islam, Mahrokh Hassani
Robotics
ๆฅ็ๆ่ฆ
This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.
๐
2025-11-19
๐ ArXiv Preprint (cs.MA)
๐ฅ Devang Kulshreshtha, Wanyu Du, Raghav Jain, Srikanth Doss, Hang Su et al.
ๆฅ็ๆ่ฆ
This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM-based multi-agent systems. Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents' states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves 96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1 to 7 rounds. These findings demonstrate that uncooperative agents can significantly degrade collective outcomes, highlighting the need for designing more resilient multi-agent systems.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
Machine LearningMultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xinli Tao, Xin Dong, Xuezhong Zhou
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
With the rapid expansion of unstructured clinical texts in electronic health records (EHRs), clinical named entity recognition (NER) has become a crucial technique for extracting medical information. However, traditional supervised models such as CRF and BioClinicalBERT suffer from high annotation costs. Although zero-shot NER based on large language models (LLMs) reduces the dependency on labeled data, challenges remain in aligning example selection with task granularity and effectively integrating prompt design with self-improvement frameworks. To address these limitations, we propose OEMA, a novel zero-shot clinical NER framework based on multi-agent collaboration. OEMA consists of three core components: (1) a self-annotator that autonomously generates candidate examples; (2) a discriminator that leverages SNOMED CT to filter token-level examples by clinical relevance; and (3) a predictor that incorporates entity-type descriptions to enhance inference accuracy. Experimental results on two benchmark datasets, MTSamples and VAERS, demonstrate that OEMA achieves state-of-the-art performance under exact-match evaluation. Moreover, under related-match criteria, OEMA performs comparably to the supervised BioClinicalBERT model while significantly outperforming the traditional CRF method. OEMA improves zero-shot clinical NER, achieving near-supervised performance under related-match criteria. Future work will focus on continual learning and open-domain adaptation to expand its applicability in clinical NLP.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vaibhav Singh, Oleksiy Ostapenko, Pierre-Andrรฉ Noรซl, Torsten Scholak
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Genghan Zhang, Shaowei Zhu, Anjiang Wei, Zhenyu Song, Allen Nie et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.
๐
2025-11-19
๐ NeurIPS 2025
๐ฅ Matthieu Kirchmeyer, Pedro O. Pinheiro, Emma Willett, Karolis Martinkus, Joseph Kleinhenz et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Generative models for structure-based drug design are often limited to a specific modality, restricting their broader applicability. To address this challenge, we introduce FuncBind, a framework based on computer vision to generate target-conditioned, all-atom molecules across atomic systems. FuncBind uses neural fields to represent molecules as continuous atomic densities and employs score-based generative models with modern architectures adapted from the computer vision literature. This modality-agnostic representation allows a single unified model to be trained on diverse atomic systems, from small to large molecules, and handle variable atom/residue counts, including non-canonical amino acids. FuncBind achieves competitive in silico performance in generating small molecules, macrocyclic peptides, and antibody complementarity-determining region loops, conditioned on target structures. FuncBind also generated in vitro novel antibody binders via de novo redesign of the complementarity-determining region H3 loop of two chosen co-crystal structures. As a final contribution, we introduce a new dataset and benchmark for structure-conditioned macrocyclic peptide generation. The code is available at https://github.com/prescient-design/funcbind.
๐
2025-11-19
๐ ArXiv Preprint (eess.SP)
๐ฅ Roman Dolgopolyi, Antonis Chatzipanagiotou
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
An advanced emotion classification model was developed using a CNN-Transformer architecture for emotion recognition from EEG brain wave signals, effectively distinguishing among three emotional states, positive, neutral and negative. The model achieved a testing accuracy of 91%, outperforming traditional models such as SVM, DNN, and Logistic Regression. Training was conducted on a custom dataset created by merging data from SEED, SEED-FRA, and SEED-GER repositories, comprising 1,455 samples with EEG recordings labeled according to emotional states. The combined dataset represents one of the largest and most culturally diverse collections available. Additionally, the model allows for the reduction of the requirements of the EEG apparatus, by leveraging only 5 electrodes of the 62. This reduction demonstrates the feasibility of deploying a more affordable consumer-grade EEG headset, thereby enabling accessible, at-home use, while also requiring less computational power. This advancement sets the groundwork for future exploration into mood changes induced by media content consumption, an area that remains underresearched. Integration into medical, wellness, and home-health platforms could enable continuous, passive emotional monitoring, particularly beneficial in clinical or caregiving settings where traditional behavioral cues, such as facial expressions or vocal tone, are diminished, restricted, or difficult to interpret, thus potentially transforming mental health diagnostics and interventions...
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rahul Krishna Thomas, Arka Pal
Machine Learning
ๆฅ็ๆ่ฆ
Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step $n$ draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over $V^n$ variables, with $V$ being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most $V$ variables. This allows us to devise an algorithm for optimal $n$-draft speculative sampling when the $n$ tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various $n$ and top-$k$ draft sampling settings. Our findings give the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yintao Ma, Sajjad Pakdamansavoji, Amir Rasouli, Tongtong Cao
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings.
To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%.
๐
2025-11-19
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mi Tian, Kun Zhang, Fei Liu, Jinglong Li, Yuxin Liao et al.
ๆฅ็ๆ่ฆ
Computerized Adaptive Testing (CAT) is a widely used technology for evaluating learners' proficiency in online education platforms. By leveraging prior estimates of proficiency to select questions and updating the estimates iteratively based on responses, CAT enables personalized learner modeling and has attracted substantial attention. Despite this progress, most existing works focus primarily on improving diagnostic accuracy, while overlooking the selection bias inherent in the adaptive process. Selection Bias arises because the question selection is strongly influenced by the estimated proficiency, such as assigning easier questions to learners with lower proficiency and harder ones to learners with higher proficiency. Since the selection depends on prior estimation, this bias propagates into the diagnosis model, which is further amplified during iterative updates, leading to misalignment and biased predictions. Moreover, the imbalanced nature of learners' historical interactions often exacerbates the bias in diagnosis models. To address this issue, we propose a debiasing framework consisting of two key modules: Cross-Attribute Examinee Retrieval and Selective Mixup-based Regularization. First, we retrieve balanced examinees with relatively even distributions of correct and incorrect responses and use them as neutral references for biased examinees. Then, mixup is applied between each biased examinee and its matched balanced counterpart under label consistency. This augmentation enriches the diversity of bias-conflicting samples and smooths selection boundaries. Finally, extensive experiments on two benchmark datasets with multiple advanced diagnosis models demonstrate that our method substantially improves both the generalization ability and fairness of question selection in CAT.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Debasmita Ghose, Oz Gitelson, Ryan Jin, Grace Abawe, Marynel Vazquez et al.
ๆฅ็ๆ่ฆ
For effective human-robot collaboration, a robot must align its actions with human goals, even as they change mid-task. Prior approaches often assume fixed goals, reducing goal prediction to a one-time inference. However, in real-world scenarios, humans frequently shift goals, making it challenging for robots to adapt without explicit communication. We propose a method for detecting goal changes by tracking multiple candidate action sequences and verifying their plausibility against a policy bank. Upon detecting a change, the robot refines its belief in relevant past actions and constructs Receding Horizon Planning (RHP) trees to actively select actions that assist the human while encouraging Differentiating Actions to reveal their updated goal. We evaluate our approach in a collaborative cooking environment with up to 30 unique recipes and compare it to three comparable human goal prediction algorithms. Our method outperforms all baselines, quickly converging to the correct goal after a switch, reducing task completion time, and improving collaboration efficiency.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ J. Cristobal, A. Z. Zain Aldeen, M. Izadi, R. Faieghi
Robotics
ๆฅ็ๆ่ฆ
This paper presents the design of a gimballed rotor mechanism as a modular and efficient solution for constructing omnidirectional quadrotors. Unlike conventional quadrotors, which are underactuated, this class of quadrotors achieves full actuation, enabling independent motion in all six degrees of freedom. While existing omnidirectional quadrotor designs often require significant structural modifications, the proposed gimballed rotor system maintains a lightweight and easy-to-integrate design by incorporating servo motors within the rotor platforms, allowing independent tilting of each rotor without major alterations to the central structure of a quadrotor. To accommodate this unconventional design, we develop a new control allocation scheme in PX4 Autopilot and present successful flight tests, validating the effectiveness of the proposed approach.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Ruoqu Chen, Xiangjie Yan, Kangchen Lv, Gao Huang, Zheng Li et al.
Robotics
ๆฅ็ๆ่ฆ
Ultrasound scanning is a critical imaging technique for real-time, non-invasive diagnostics. However, variations in patient anatomy and complex human-in-the-loop interactions pose significant challenges for autonomous robotic scanning. Existing ultrasound scanning robots are commonly limited to relatively low generalization and inefficient data utilization. To overcome these limitations, we present UltraDP, a Diffusion-Policy-based method that receives multi-sensory inputs (ultrasound images, wrist camera images, contact wrench, and probe pose) and generates actions that are fit for multi-modal action distributions in autonomous ultrasound scanning of carotid artery. We propose a specialized guidance module to enable the policy to output actions that center the artery in ultrasound images. To ensure stable contact and safe interaction between the robot and the human subject, a hybrid force-impedance controller is utilized to drive the robot to track such trajectories. Also, we have built a large-scale training dataset for carotid scanning comprising 210 scans with 460k sample pairs from 21 volunteers of both genders. By exploring our guidance module and DP's strong generalization ability, UltraDP achieves a 95% success rate in transverse scanning on previously unseen subjects, demonstrating its effectiveness.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: https://xiongyicai.github.io/In-N-On/
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.
๐
2025-11-19
๐ ArXiv Preprint (eess.SP)
๐ฅ Jingkai Ying, Zhijin Qin, Yulong Feng, Liejun Wang, Xiaoming Tao
Natural Language Processing
ๆฅ็ๆ่ฆ
In recent years, the Transformer architecture has achieved outstanding performance across a wide range of tasks and modalities. Token is the unified input and output representation in Transformer-based models, which has become a fundamental information unit. In this work, we consider the problem of token communication, studying how to transmit tokens efficiently and reliably. Point cloud, a prevailing three-dimensional format which exhibits a more complex spatial structure compared to image or video, is chosen to be the information source. We utilize the set abstraction method to obtain point tokens. Subsequently, to get a more informative and transmission-friendly representation based on tokens, we propose a joint semantic-channel and modulation (JSCCM) scheme for the token encoder, mapping point tokens to standard digital constellation points (modulated tokens). Specifically, the JSCCM consists of two parallel Point Transformer-based encoders and a differential modulator which combines the Gumel-softmax and soft quantization methods. Besides, the rate allocator and channel adapter are developed, facilitating adaptive generation of high-quality modulated tokens conditioned on both semantic information and channel conditions. Extensive simulations demonstrate that the proposed method outperforms both joint semantic-channel coding and traditional separate coding, achieving over 1dB gain in reconstruction and more than 6x compression ratio in modulated symbols.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Foundation models have transformed machine learning for language and vision, but achieving comparable impact in physical simulation remains a challenge. Data heterogeneity and unstable long-term dynamics inhibit learning from sufficiently diverse dynamics, while varying resolutions and dimensionalities challenge efficient training on modern hardware. Through empirical and theoretical analysis, we incorporate new approaches to mitigate these obstacles, including a harmonic-analysis-based stabilization method, load-balanced distributed 2D and 3D training strategies, and compute-adaptive tokenization. Using these tools, we develop Walrus, a transformer-based foundation model developed primarily for fluid-like continuum dynamics. Walrus is pretrained on nineteen diverse scenarios spanning astrophysics, geoscience, rheology, plasma physics, acoustics, and classical fluids. Experiments show that Walrus outperforms prior foundation models on both short and long term prediction horizons on downstream tasks and across the breadth of pretraining data, while ablation studies confirm the value of our contributions to forecast stability, training throughput, and transfer performance over conventional approaches. Code and weights are released for community use.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sejuti Rahman, Swakshar Deb, MD. Sameer Iqbal Chowdhury, MD. Jubair Ahmed Sourov, Mohammad Shamsuddin
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary (depressed and non depressed) classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class (no depression, mild to moderate depression and severe depression) classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
MultimodalNatural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel et al.
Computer Vision
ๆฅ็ๆ่ฆ
Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks.
Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kim N. Nolle, Ivana Dusparic, Rhodri Cusack, Vinny Cahill
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Continual learning (CL) is a branch of machine learning that aims to enable agents to adapt and generalise previously learned abilities so that these can be reapplied to new tasks or environments. This is particularly useful in multi-task settings or in non-stationary environments, where the dynamics can change over time. This is particularly relevant in cyber-physical systems such as autonomous driving. However, despite recent advances in CL, successfully applying it to reinforcement learning (RL) is still an open problem.
This paper highlights open challenges in continual RL (CRL) based on experiments in an autonomous driving environment. In this environment, the agent must learn to successfully park in four different scenarios corresponding to parking spaces oriented at varying angles. The agent is successively trained in these four scenarios one after another, representing a CL environment, using Proximal Policy Optimisation (PPO). These experiments exposed a number of open challenges in CRL: finding suitable abstractions of the environment, oversensitivity to hyperparameters, catastrophic forgetting, and efficient use of neural network capacity.
Based on these identified challenges, we present open research questions that are important to be addressed for creating robust CRL systems. In addition, the identified challenges call into question the suitability of neural networks for CL. We also identify the need for interdisciplinary research, in particular between computer science and neuroscience.
๐
2025-11-19
๐ ArXiv Preprint (cs.DB)
๐ฅ Leopoldo Bertossi, Nina Pardal
ๆฅ็ๆ่ฆ
The notion of cause, as formalized by Halpern and Pearl, has been recently applied to relational databases, to characterize and compute causal explanations for query answers. In this work we consider the alternative notion of sufficient explanation. We investigate its connections with database repairs as used for dealing with inconsistent databases, and with causality-based necessary explanations. We also obtain some computational results.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dante Francisco Wasmuht, Otto Brookes, Maximillian Schall, Pablo Palencia, Chris Beirne et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Sai Puppala, Ismail Hossain, Jahangir Alam, Sajedul Talukder
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The integration of advanced robotics in nuclear power plants (NPPs) presents a transformative opportunity to enhance safety, efficiency, and environmental monitoring in high-stakes environments. Our paper introduces the Optimus-Q robot, a sophisticated system designed to autonomously monitor air quality and detect contamination while leveraging adaptive learning techniques and secure quantum communication. Equipped with advanced infrared sensors, the Optimus-Q robot continuously streams real-time environmental data to predict hazardous gas emissions, including carbon dioxide (CO$_2$), carbon monoxide (CO), and methane (CH$_4$). Utilizing a federated learning approach, the robot collaborates with other systems across various NPPs to improve its predictive capabilities without compromising data privacy. Additionally, the implementation of Quantum Key Distribution (QKD) ensures secure data transmission, safeguarding sensitive operational information. Our methodology combines systematic navigation patterns with machine learning algorithms to facilitate efficient coverage of designated areas, thereby optimizing contamination monitoring processes. Through simulations and real-world experiments, we demonstrate the effectiveness of the Optimus-Q robot in enhancing operational safety and responsiveness in nuclear facilities. This research underscores the potential of integrating robotics, machine learning, and quantum technologies to revolutionize monitoring systems in hazardous environments.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alexis Audran-Reiss, Jordi Armengol Estapรฉ, Karen Hambardzumyan, Amar Budhiraja, Martin Josifoski et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Sifan Zhou, Yichao Cao, Jiahao Nie, Yuqian Fu, Ziyu Zhao et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.
๐
2025-11-19
๐ AAAI
๐ฅ Qihao Yang, Xuelin Wang, Jiale Chen, Xuelian Dong, Yuxin Hao et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners' language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://github.com/CharlesYang030/HSKB.
๐
2025-11-19
๐ ArXiv Preprint (cs.DB)
๐ฅ Selim Furkan Tekin, Rajesh Bordawekar
ๆฅ็ๆ่ฆ
Storing and processing of embedding vectors by specialized Vector databases (VDBs) has become the linchpin in building modern AI pipelines. Most current VDBs employ variants of a graph-based ap- proximate nearest-neighbor (ANN) index algorithm, HNSW, to an- swer semantic queries over stored vectors. Inspite of its wide-spread use, the HNSW algorithm suffers from several issues: in-memory design and implementation, random memory accesses leading to degradation in cache behavior, limited acceleration scope due to fine-grained pairwise computations, and support of only semantic similarity queries. In this paper, we present a novel disk-based ANN index, B+ANN, to address these issues: it first partitions input data into blocks containing semantically similar items, then builds an B+ tree variant to store blocks both in-memory and on disks, and finally, enables hybrid edge- and block-based in-memory traversals. As demonstrated by our experimantal evaluation, the proposed B+ANN disk-based index improves both quality (Recall value), and execution performance (Queries per second/QPS) over HNSW, by improving spatial and temporal locality for semantic operations, reducing cache misses (19.23% relative gain), and decreasing the memory consumption and disk-based build time by 24x over the DiskANN algorithm. Finally, it enables dissimilarity queries, which are not supported by similarity-oriented ANN indices.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Federico Bianchi, Owen Queen, Nitya Thakkar, Eric Sun, James Zou
ๆฅ็ๆ่ฆ
There is growing interest in using AI agents for scientific research, yet fundamental questions remain about their capabilities as scientists and reviewers. To explore these questions, we organized Agents4Science, the first conference in which AI agents serve as both primary authors and reviewers, with humans as co-authors and co-reviewers. Here, we discuss the key learnings from the conference and their implications for human-AI collaboration in science.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Gabriel Lauzier, Alexandre Girard, Franรงois Ferland
MultimodalRobotics
ๆฅ็ๆ่ฆ
Diffusion Policy has shown great performance in robotic manipulation tasks under stochastic perturbations, due to its ability to model multimodal action distributions. Nonetheless, its reliance on a computationally expensive reverse-time diffusion (denoising) process, for action inference, makes it challenging to use for real-time applications where quick decision-making is mandatory. This work studies the possibility of conducting the denoising process only partially before executing an action, allowing the plant to evolve according to its dynamics in parallel to the reverse-time diffusion dynamics ongoing on the computer. In a classical diffusion policy setting, the plant dynamics are usually slow and the two dynamical processes are uncoupled. Here, we investigate theoretical bounds on the stability of closed-loop systems using diffusion policies when the plant dynamics and the denoising dynamics are coupled. The contribution of this work gives a framework for faster imitation learning and a metric that yields if a controller will be stable based on the variance of the demonstrations.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maria Pilligua, David Serrano-Lozano, Pai Peng, Ramon Baldrich, Michael S. Brown et al.
Robotics
ๆฅ็ๆ่ฆ
Imaging in low-light environments is challenging due to reduced scene radiance, which leads to elevated sensor noise and reduced color saturation. Most learning-based low-light enhancement methods rely on paired training data captured under a single low-light condition and a well-lit reference. The lack of radiance diversity limits our understanding of how enhancement techniques perform across varying illumination intensities. We introduce the Multi-Illumination Low-Light (MILL) dataset, containing images captured at diverse light intensities under controlled conditions with fixed camera settings and precise illuminance measurements. MILL enables comprehensive evaluation of enhancement algorithms across variable lighting conditions. We benchmark several state-of-the-art methods and reveal significant performance variations across intensity levels. Leveraging the unique multi-illumination structure of our dataset, we propose improvements that enhance robustness across diverse illumination scenarios. Our modifications achieve up to 10 dB PSNR improvement for DSLR and 2 dB for the smartphone on Full HD images.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rashid Iqbal, Saddam Hussain Khan
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This work proposes a hybrid deep learning approach, namely Residual and Spatial Learning based Channel Augmented Integrated CNN-Transformer architecture, that leverages the strengths of CNN and Transformer towards enhanced MPox detection. The proposed RS-CA-HSICT framework is composed of an HSICT block, a residual CNN module, a spatial CNN block, and a CA, which enhances the diverse feature space, detailed lesion information, and long-range dependencies. The new HSICT module first integrates an abstract representation of the stem CNN and customized ICT blocks for efficient multihead attention and structured CNN layers with homogeneous (H) and structural (S) operations. The customized ICT blocks learn global contextual interactions and local texture extraction. Additionally, H and S layers learn spatial homogeneity and fine structural details by reducing noise and modeling complex morphological variations. Moreover, inverse residual learning enhances vanishing gradient, and stage-wise resolution reduction ensures scale invariance. Furthermore, the RS-CA-HSICT framework augments the learned HSICT channels with the TL-driven Residual and Spatial CNN maps for enhanced multiscale feature space capturing global and localized structural cues, subtle texture, and contrast variations. These channels, preceding augmentation, are refined through the Channel-Fusion-and-Attention block, which preserves discriminative channels while suppressing redundant ones, thereby enabling efficient computation. Finally, the spatial attention mechanism refines pixel selection to detect subtle patterns and intra-class contrast variations in Mpox. Experimental results on both the Kaggle benchmark and a diverse MPox dataset reported classification accuracy as high as 98.30% and an F1-score of 98.13%, which outperforms the existing CNNs and ViTs.
๐
2025-11-19
๐ ArXiv Preprint (cs.CY)
๐ฅ Amir Hossein Kargaran, Nafiseh Nikeghbal, Jing Yang, Nedjma Ousidhoum
Machine Learning
ๆฅ็ๆ่ฆ
Peer review is a cornerstone of scientific publishing, including at premier machine learning conferences such as ICLR. As submission volumes increase, understanding the nature and dynamics of the review process is crucial for improving its efficiency, effectiveness, and the quality of published papers. We present a large-scale analysis of the ICLR 2024 and 2025 peer review processes, focusing on before- and after-rebuttal scores and reviewer-author interactions. We examine review scores, author-reviewer engagement, temporal patterns in review submissions, and co-reviewer influence effects. Combining quantitative analyses with LLM-based categorization of review texts and rebuttal discussions, we identify common strengths and weaknesses for each rating group, as well as trends in rebuttal strategies that are most strongly associated with score changes. Our findings show that initial scores and the ratings of co-reviewers are the strongest predictors of score changes during the rebuttal, pointing to a degree of reviewer influence. Rebuttals play a valuable role in improving outcomes for borderline papers, where thoughtful author responses can meaningfully shift reviewer perspectives. More broadly, our study offers evidence-based insights to improve the peer review process, guiding authors on effective rebuttal strategies and helping the community design fairer and more efficient review processes. Our code and score changes data are available at https://github.com/papercopilot/iclr-insights.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qian'ang Mao, Yuxuan Zhang, Jiaman Chen, Wenjun Zhou, Jiaqi Yan
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
As Decentralized Finance (DeFi) develops, understanding user intent behind DeFi transactions is crucial yet challenging due to complex smart contract interactions, multifaceted on-/off-chain factors, and opaque hex logs. Existing methods lack deep semantic insight. To address this, we propose the Transaction Intent Mining (TIM) framework. TIM leverages a DeFi intent taxonomy built on grounded theory and a multi-agent Large Language Model (LLM) system to robustly infer user intents. A Meta-Level Planner dynamically coordinates domain experts to decompose multiple perspective-specific intent analyses into solvable subtasks. Question Solvers handle the tasks with multi-modal on/off-chain data. While a Cognitive Evaluator mitigates LLM hallucinations and ensures verifiability. Experiments show that TIM significantly outperforms machine learning models, single LLMs, and single Agent baselines. We also analyze core challenges in intent inference. This work helps provide a more reliable understanding of user motivations in DeFi, offering context-aware explanations for complex blockchain activity.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Michel Tokic, Slobodan Djukanoviฤ, Anja von Beuningen, Cheng Feng
ๆฅ็ๆ่ฆ
This paper introduces a classification method using in-context learning in time-series foundation models (TSFM). We show how data, which was not part of the TSFM training data corpus, can be classified without the need of finetuning the model. Examples are represented in the form of targets (class id) and covariates (data matrix) within the prompt of the model, which enables to classify an unknown covariate data pattern alongside the forecast axis through in-context learning. We apply this method to vibration data for assessing the health state of a bearing within a servo-press motor. The method transforms frequency domain reference signals into pseudo time-series patterns, generates aligned covariate and target signals, and uses the TSFM to predict probabilities how classified data corresponds to predefined labels. Leveraging the scalability of pre-trained models this method demonstrates efficacy across varied operational conditions. This marks significant progress beyond custom narrow AI solutions towards broader, AI-driven maintenance systems.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Linyin Luo, Yujuan Ding, Yunshan Ma, Wenqi Fan, Hanjiang Lai
Multimodal
ๆฅ็ๆ่ฆ
Advanced multimodal Retrieval-Augmented Generation (MRAG) techniques have been widely applied to enhance the capabilities of Large Multimodal Models (LMMs), but they also bring along novel safety issues. Existing adversarial research has revealed the vulnerability of MRAG systems to knowledge poisoning attacks, which fool the retriever into recalling injected poisoned contents. However, our work considers a different setting: visual attack of MRAG by solely adding imperceptible perturbations at the image inputs of users, without manipulating any other components. This is challenging due to the robustness of fine-tuned retrievers and large-scale generators, and the effect of visual perturbation may be further weakened by propagation through the RAG chain. We propose a novel Hierarchical Visual Attack that misaligns and disrupts the two inputs (the multimodal query and the augmented knowledge) of MRAG's generator to confuse its generation. We further design a hierarchical two-stage strategy to obtain misaligned augmented knowledge. We disrupt the image input of the retriever to make it recall irrelevant knowledge from the original database, by optimizing the perturbation which first breaks the cross-modal alignment and then disrupts the multimodal semantic alignment. We conduct extensive experiments on two widely-used MRAG datasets: OK-VQA and InfoSeek. We use CLIP-based retrievers and two LMMs BLIP-2 and LLaVA as generators. Results demonstrate the effectiveness of our visual attack on MRAG through the significant decrease in both retrieval and generation performance.
๐
2025-11-19
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Georg Goldenits, Philip Koenig, Sebastian Raubitzek, Andreas Ekelhart
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Phishing websites pose a major cybersecurity threat, exploiting unsuspecting users and causing significant financial and organisational harm. Traditional machine learning approaches for phishing detection often require extensive feature engineering, continuous retraining, and costly infrastructure maintenance. At the same time, proprietary large language models (LLMs) have demonstrated strong performance in phishing-related classification tasks, but their operational costs and reliance on external providers limit their practical adoption in many business environments. This paper investigates the feasibility of small language models (SLMs) for detecting phishing websites using only their raw HTML code. A key advantage of these models is that they can be deployed on local infrastructure, providing organisations with greater control over data and operations. We systematically evaluate 15 commonly used Small Language Models (SLMs), ranging from 1 billion to 70 billion parameters, benchmarking their classification accuracy, computational requirements, and cost-efficiency. Our results highlight the trade-offs between detection performance and resource consumption, demonstrating that while SLMs underperform compared to state-of-the-art proprietary LLMs, they can still provide a viable and scalable alternative to external LLM services. By presenting a comparative analysis of costs and benefits, this work lays the foundation for future research on the adaptation, fine-tuning, and deployment of SLMs in phishing detection systems, aiming to balance security effectiveness and economic practicality.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Amir Rezaei Balef, Mykhailo Koshil, Katharina Eggensperger
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite the architectural similarities between tabular in-context learning (ICL) models and large language models (LLMs), little is known about how individual layers contribute to tabular prediction. In this paper, we investigate how the latent spaces evolve across layers in tabular ICL models, identify potential redundant layers, and compare these dynamics with those observed in LLMs. We analyze TabPFN and TabICL through the "layers as painters" perspective, finding that only subsets of layers share a common representational language, suggesting structural redundancy and offering opportunities for model compression and improved interpretability.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Arjun Gangwar, Kaousheik Jayakumar, S. Umesh
ๆฅ็ๆ่ฆ
This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to improve in predicting the language and dialect of the utterance among 8 languages across 33 dialects. We participated in Track 1 and Track 2, which restricts the use of additional data and develop from-the-scratch multilingual systems. We presented a novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation. It improved the performance over the baseline (in the CLS space). We also discuss various methods used to retain the gain obtained in the phonemic space while converting them back to the corresponding grapheme representations. Our systems beat the baseline in 3 languages (Track 2) in terms of WER/CER and achieved the highest language ID and dialect ID accuracy among all participating teams (Track 2).
๐
2025-11-19
๐ IROS 2025
๐ฅ Mingyang Feng, Shaoyuan Li, Xiang Yin
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We investigate the sampling-based optimal path planning problem for robotics in complex and dynamic environments. Most existing sampling-based algorithms neglect environmental information or the information from previous samples. Yet, these pieces of information are highly informative, as leveraging them can provide better heuristics when sampling the next state. In this paper, we propose a novel sampling-based planning algorithm, called \emph{RRT*former}, which integrates the standard RRT* algorithm with a Transformer network in a novel way. Specifically, the Transformer is used to extract features from the environment and leverage information from previous samples to better guide the sampling process. Our extensive experiments demonstrate that, compared to existing sampling-based approaches such as RRT*, Neural RRT*, and their variants, our algorithm achieves considerable improvements in both the optimality of the path and sampling efficiency. The code for our implementation is available on https://github.com/fengmingyang666/RRTformer.
๐
2025-11-19
๐ 13 pages,9 figures. This work has been submitted t
๐ฅ Shanlin Zhou, Xinpeng Wang, Jianxun Lian, Zhenghao Liu, Laks V. S. Lakshmanan et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising and storytelling. Nevertheless, CNLG still remains difficult due to two main challenges. (1) Multi-objective flexibility: user requirements are often personalized, fine-grained, and pluralistic, which LLMs struggle to satisfy simultaneously; (2) Interpretive complexity: beyond generation, creativity also involves understanding and interpreting implicit meaning to enhance users' perception. These challenges significantly limit current methods, especially in short-form text generation, in generating creative and insightful content. To address this, we focus on Chinese baby naming, a representative short-form CNLG task requiring adherence to explicit user constraints (e.g., length, semantics, anthroponymy) while offering meaningful aesthetic explanations. We propose NAMeGEn, a novel multi-agent optimization framework that iteratively alternates between objective extraction, name generation, and evaluation to meet diverse requirements and generate accurate explanations. To support this task, we further construct a classical Chinese poetry corpus with 17k+ poems to enhance aesthetics, and introduce CBNames, a new benchmark with tailored metrics. Extensive experiments demonstrate that NAMeGEn effectively generates creative names that meet diverse, personalized requirements while providing meaningful explanations, outperforming six baseline methods spanning various LLM backbones without any training.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning. Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Sirui Chen, Mengshi Zhao, Lei Xu, Yuying Zhao, Beier Zhu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Byungho Jo
MultimodalRobotics
ๆฅ็ๆ่ฆ
Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Trevor McInroe
Machine Learning
ๆฅ็ๆ่ฆ
We introduce Terra Nova, a new comprehensive challenge environment (CCE) for reinforcement learning (RL) research inspired by Civilization V. A CCE is a single environment in which multiple canonical RL challenges (e.g., partial observability, credit assignment, representation learning, enormous action spaces, etc.) arise simultaneously. Mastery therefore demands integrated, long-horizon understanding across many interacting variables. We emphasize that this definition excludes challenges that only aggregate unrelated tasks in independent, parallel streams (e.g., learning to play all Atari games at once). These aggregated multitask benchmarks primarily asses whether an agent can catalog and switch among unrelated policies rather than test an agent's ability to perform deep reasoning across many interacting challenges.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lingxiang Wang, Hainan Zhang, Zhiming Zheng
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Domain-specific post-training often causes catastrophic forgetting, making foundation models lose their general reasoning ability and limiting their adaptability to dynamic real-world environments. Preserving general capabilities while acquiring downstream domain knowledge is a central challenge for large language and multimodal models. Traditional continual learning methods, such as regularization, replay and architectural isolation, suffer from poor downstream performance, reliance on inaccessible historical data, or additional parameter overhead. While recent parameter-efficient tuning (PET) methods can alleviate forgetting, their effectiveness strongly depends on the choice of parameters and update strategies. In this paper, we introduce PIECE, a Parameter Importance Estimation-based Continual Enhancement method that preserves general ability while efficiently learning domain knowledge without accessing prior training data or increasing model parameters. PIECE selectively updates only 0.1% of core parameters most relevant to new tasks, guided by two importance estimators: PIECE-F based on Fisher Information, and PIECE-S based on a second-order normalization that combines gradient and curvature information. Experiments across three language models and two multimodal models show that PIECE maintains general capabilities and achieves state-of-the-art continual learning performance across diverse downstream tasks. Our results highlight a practical path to scalable, domain-adaptive foundation models without catastrophic forgetting.
๐
2025-11-19
๐ The manuscript is currently ongoing the underrevie
๐ฅ Guoqiang Liang, Jingqian Gong, Mengxuan Li, Gege Lin, Shuo Zhang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge-enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gihwan Kim, Jemin Lee, Hyungshin Kim
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44\%p (avg. 1.78\%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yifu Guo, Zishan Xu, Zhiyuan Yao, Yuquan Lu, Jiaye Lin et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.
๐
2025-11-19
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Shan Shan
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Achieving Sustainable Development Goal 7 (Affordable and Clean Energy) requires not only technological innovation but also a deeper understanding of the socioeconomic factors influencing energy access and carbon emissions. While these factors are gaining attention, critical questions remain, particularly regarding how to quantify their impacts on energy systems, model their cross-domain interactions, and capture feedback dynamics in the broader context of energy transitions. To address these gaps, this study introduces ClimateAgents, an AI-based framework that combines large language models with domain-specialized agents to support hypothesis generation and scenario exploration. Leveraging 20 years of socioeconomic and emissions data from 265 economies, countries and regions, and 98 indicators drawn from the World Bank database, the framework applies a machine learning based causal inference approach to identify key determinants of carbon emissions in an evidence-based, data driven manner. The analysis highlights three primary drivers: access to clean cooking fuels in rural areas, access to clean cooking fuels in urban areas, and the percentage of population living in urban areas. These findings underscore the critical role of clean cooking technologies and urbanization patterns in shaping emission outcomes. In line with growing calls for evidence-based AI policy, ClimateAgents offers a modular and reflexive learning system that supports the generation of credible and actionable insights for policy. By integrating heterogeneous data modalities, including structured indicators, policy documents, and semantic reasoning, the framework contributes to adaptive policymaking infrastructures that can evolve with complex socio-technical challenges. This approach aims to support a shift from siloed modeling to reflexive, modular systems designed for dynamic, context-aware climate action.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kadir-Kaan รzer, Renรฉ Ebeling, Markus Enzweiler
ๆฅ็ๆ่ฆ
Automotive telemetry data exhibits slow drifts and fast spikes, often within the same sequence, making reliable anomaly detection challenging. Standard reconstruction-based methods, including sequence variational autoencoders (VAEs), use a single latent process and therefore mix heterogeneous time scales, which can smooth out spikes or inflate variances and weaken anomaly separation.
In this paper, we present STREAM-VAE, a variational autoencoder for anomaly detection in automotive telemetry time-series data. Our model uses a dual-path encoder to separate slow drift and fast spike signal dynamics, and a decoder that represents transient deviations separately from the normal operating pattern. STREAM-VAE is designed for deployment, producing stable anomaly scores across operating modes for both in-vehicle monitors and backend fleet analytics.
Experiments on an automotive telemetry dataset and the public SMD benchmark show that explicitly separating drift and spike dynamics improves robustness compared to strong forecasting, attention, graph, and VAE baselines.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Jonas De Maeyer, Hossein Yarahmadi, Moharram Challenger
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Path planning in dynamic environments is a fundamental challenge in intelligent transportation and robotics, where obstacles and conditions change over time, introducing uncertainty and requiring continuous adaptation. While existing approaches often assume complete environmental unpredictability or rely on global planners, these assumptions limit scalability and practical deployment in real-world settings. In this paper, we propose a scalable, region-aware reinforcement learning (RL) framework for path planning in dynamic environments. Our method builds on the observation that environmental changes, although dynamic, are often localized within bounded regions. To exploit this, we introduce a hierarchical decomposition of the environment and deploy distributed RL agents that adapt to changes locally. We further propose a retraining mechanism based on sub-environment success rates to determine when policy updates are necessary. Two training paradigms are explored: single-agent Q-learning and multi-agent federated Q-learning, where local Q-tables are aggregated periodically to accelerate the learning process. Unlike prior work, we evaluate our methods in more realistic settings, where multiple simultaneous obstacle changes and increasing difficulty levels are present. Results show that the federated variants consistently outperform their single-agent counterparts and closely approach the performance of A* Oracle while maintaining shorter adaptation times and robust scalability. Although initial training remains time-consuming in large environments, our decentralized framework eliminates the need for a global planner and lays the groundwork for future improvements using deep RL and flexible environment decomposition.
๐
2025-11-19
๐ AAAI
๐ฅ Ninell Oldenburg, Ruchira Dhar, Anders Sรธgaard
ๆฅ็ๆ่ฆ
In this paper, we argue that current AI research operates on a spectrum between two different underlying conceptions of intelligence: Intelligence Realism, which holds that intelligence represents a single, universal capacity measurable across all systems, and Intelligence Pluralism, which views intelligence as diverse, context-dependent capacities that cannot be reduced to a single universal measure. Through an analysis of current debates in AI research, we demonstrate how the conceptions remain largely implicit yet fundamentally shape how empirical evidence gets interpreted across a wide range of areas. These underlying views generate fundamentally different research approaches across three areas. Methodologically, they produce different approaches to model selection, benchmark design, and experimental validation. Interpretively, they lead to contradictory readings of the same empirical phenomena, from capability emergence to system limitations. Regarding AI risk, they generate categorically different assessments: realists view superintelligence as the primary risk and search for unified alignment solutions, while pluralists see diverse threats across different domains requiring context-specific solutions. We argue that making explicit these underlying assumptions can contribute to a clearer understanding of disagreements in AI research.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Alexander Boldachev
Robotics
ๆฅ็ๆ่ฆ
This paper compares two distinct approaches to modeling robotic behavior: imperative Behavior Trees (BTs) and declarative Executable Ontologies (EO), implemented through the boldsea framework. BTs structure behavior hierarchically using control-flow, whereas EO represents the domain as a temporal, event-based semantic graph driven by dataflow rules. We demonstrate that EO achieves comparable reactivity and modularity to BTs through a fundamentally different architecture: replacing polling-based tick execution with event-driven state propagation. We propose that EO offers an alternative framework, moving from procedural programming to semantic domain modeling, to address the semantic-process gap in traditional robotic control. EO supports runtime model modification, full temporal traceability, and a unified representation of data, logic, and interface - features that are difficult or sometimes impossible to achieve with BTs, although BTs excel in established, predictable scenarios. The comparison is grounded in a practical mobile manipulation task. This comparison highlights the respective operational strengths of each approach in dynamic, evolving robotic systems.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Philipp Wiesner, Daniel W. O'Neill, Francesca Larosa, Odej Kao
Machine Learning
ๆฅ็ๆ่ฆ
AI research is increasingly moving toward complex problem solving, where models are optimized not only for pattern recognition but for multi-step reasoning. Historically, computing's global energy footprint has been stabilized by sustained efficiency gains and natural saturation thresholds in demand. But as efficiency improvements are approaching physical limits, emerging reasoning AI lacks comparable saturation points: performance is no longer limited by the amount of available training data but continues to scale with exponential compute investments in both training and inference. This paper argues that efficiency alone will not lead to sustainable reasoning AI and discusses research and policy directions to embed explicit limits into the optimization and governance of such systems.
๐
2025-11-19
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Sirui Chen, Jinsong Zhou, Xinli Xu, Xiaoyu Yang, Litao Guo et al.
Multimodal
ๆฅ็ๆ่ฆ
Effective presentation skills are essential in education, professional communication, and public speaking, yet learners often lack access to high-quality exemplars or personalized coaching. Existing AI tools typically provide isolated functionalities such as speech scoring or script generation without integrating reference modeling and interactive feedback into a cohesive learning experience. We introduce a dual-agent system that supports presentation practice through two complementary roles: the Ideal Presentation Agent and the Coach Agent. The Ideal Presentation Agent converts user-provided slides into model presentation videos by combining slide processing, visual-language analysis, narration script generation, personalized voice synthesis, and synchronized video assembly. The Coach Agent then evaluates user-recorded presentations against these exemplars, conducting multimodal speech analysis and delivering structured feedback in an Observation-Impact-Suggestion (OIS) format. To enhance the authenticity of the learning experience, the Coach Agent incorporates an Audience Agent, which simulates the perspective of a human listener and provides humanized feedback reflecting audience reactions and engagement. Together, these agents form a closed loop of observation, practice, and feedback. Implemented on a robust backend with multi-model integration, voice cloning, and error handling mechanisms, the system demonstrates how AI-driven agents can provide engaging, human-centered, and scalable support for presentation skill development in both educational and professional contexts.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu et al.
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps. To address this, we propose Entropy stablilization via Proportional-Integral Control (EntroPIC), a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress. We provide a comprehensive theoretical analysis for both on-policy and off-policy learning settings, demonstrating that EntroPIC is effective at controlling entropy in large-scale LLM training. Experimental results show that our method successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xinli Tao, Xin Dong, Xuezhong Zhou
Natural Language Processing
ๆฅ็ๆ่ฆ
Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA's three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov et al.
Robotics
ๆฅ็ๆ่ฆ
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.
๐
2025-11-19
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Zimo Ji, Xunguang Wang, Zongjie Li, Pingchuan Ma, Yudong Gao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model (LLM)-based agents with function-calling capabilities are increasingly deployed, but remain vulnerable to Indirect Prompt Injection (IPI) attacks that hijack their tool calls. In response, numerous IPI-centric defense frameworks have emerged. However, these defenses are fragmented, lacking a unified taxonomy and comprehensive evaluation. In this Systematization of Knowledge (SoK), we present the first comprehensive analysis of IPI-centric defense frameworks. We introduce a comprehensive taxonomy of these defenses, classifying them along five dimensions. We then thoroughly assess the security and usability of representative defense frameworks. Through analysis of defensive failures in the assessment, we identify six root causes of defense circumvention. Based on these findings, we design three novel adaptive attacks that significantly improve attack success rates targeting specific frameworks, demonstrating the severity of the flaws in these defenses. Our paper provides a foundation and critical insights for the future development of more secure and usable IPI-centric agent defense frameworks.
๐
2025-11-19
๐ NeurIPS 2025
๐ฅ Yinsheng Wang, Tario G You, Lรฉonard Boussioux, Shan Liu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper introduces SOLID (Synergizing Optimization and Large Language Models for Intelligent Decision-Making), a novel framework that integrates mathematical optimization with the contextual capabilities of large language models (LLMs). SOLID facilitates iterative collaboration between optimization and LLMs agents through dual prices and deviation penalties. This interaction improves the quality of the decisions while maintaining modularity and data privacy. The framework retains theoretical convergence guarantees under convexity assumptions, providing insight into the design of LLMs prompt. To evaluate SOLID, we applied it to a stock portfolio investment case with historical prices and financial news as inputs. Empirical results demonstrate convergence under various scenarios and indicate improved annualized returns compared to a baseline optimizer-only method, validating the synergy of the two agents. SOLID offers a promising framework for advancing automated and intelligent decision-making across diverse domains.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Jian Deng, Yuandong Wang, Yangfu Zhu, Tao Feng, Tianyu Wo et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Robotic manipulation systems are increasingly deployed across diverse domains. Yet existing multi-modal learning frameworks lack inherent guarantees of geometric consistency, struggling to handle spatial transformations such as rotations and translations. While recent works attempt to introduce equivariance through bespoke architectural modifications, these methods suffer from high implementation complexity, computational cost, and poor portability. Inspired by human cognitive processes in spatial reasoning, we propose Eq.Bot, a universal canonicalization framework grounded in SE(2) group equivariant theory for robotic manipulation learning. Our framework transforms observations into a canonical space, applies an existing policy, and maps the resulting actions back to the original space. As a model-agnostic solution, Eq.Bot aims to endow models with spatial equivariance without requiring architectural modifications. Extensive experiments demonstrate the superiority of Eq.Bot under both CNN-based (e.g., CLIPort) and Transformer-based (e.g., OpenVLA-OFT) architectures over existing methods on various robotic manipulation tasks, where the most significant improvement can reach 50.0%.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haodong Li, Jingqi Zhang, Xiao Cheng, Peihua Mai, Haoyu Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The remarkable language ability of Large Language Models (LLMs) stems from extensive training on vast datasets, often including copyrighted material, which raises serious concerns about unauthorized use. While Membership Inference Attacks (MIAs) offer potential solutions for detecting such violations, existing approaches face critical limitations and challenges due to LLMs' inherent overconfidence, limited access to ground truth training data, and reliance on empirically determined thresholds.
We present COPYCHECK, a novel framework that leverages uncertainty signals to detect whether copyrighted content was used in LLM training sets. Our method turns LLM overconfidence from a limitation into an asset by capturing uncertainty patterns that reliably distinguish between ``seen" (training data) and ``unseen" (non-training data) content. COPYCHECK further implements a two-fold strategy: (1) strategic segmentation of files into smaller snippets to reduce dependence on large-scale training data, and (2) uncertainty-guided unsupervised clustering to eliminate the need for empirically tuned thresholds. Experiment results show that COPYCHECK achieves an average balanced accuracy of 90.1% on LLaMA 7b and 91.6% on LLaMA2 7b in detecting seen files. Compared to the SOTA baseline, COPYCHECK achieves over 90% relative improvement, reaching up to 93.8\% balanced accuracy. It further exhibits strong generalizability across architectures, maintaining high performance on GPT-J 6B. This work presents the first application of uncertainty for copyright detection in LLMs, offering practical tools for training data transparency.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhiyi Duan, Zixing Shi, Hongyu Yuan, Qi Wang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Knowledge Tracing (KT) aims to mine students' evolving knowledge states and predict their future question-answering performance. Existing methods based on heterogeneous information networks (HINs) are prone to introducing noises due to manual or random selection of meta-paths and lack necessary quality assessment of meta-path instances. Conversely, recent large language models (LLMs)-based methods ignore the rich information across students, and both paradigms struggle to deliver consistently accurate and evidence-based explanations. To address these issues, we propose an innovative framework, HIN-LLM Synergistic Enhanced Knowledge Tracing (HISE-KT), which seamlessly integrates HINs with LLMs. HISE-KT first builds a multi-relationship HIN containing diverse node types to capture the structural relations through multiple meta-paths. The LLM is then employed to intelligently score and filter meta-path instances and retain high-quality paths, pioneering automated meta-path quality assessment. Inspired by educational psychology principles, a similar student retrieval mechanism based on meta-paths is designed to provide a more valuable context for prediction. Finally, HISE-KT uses a structured prompt to integrate the target student's history with the retrieved similar trajectories, enabling the LLM to generate not only accurate predictions but also evidence-backed, explainable analysis reports. Experiments on four public datasets show that HISE-KT outperforms existing KT baselines in both prediction performance and interpretability.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuxuan Gu, Weimin Bai, Yifei Wang, Weijian Luo, He Sun
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model post-training.To address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256*256, MARVAL-Huge achieves an FID of 2.00 with more than 30 times speedup compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.
๐
2025-11-19
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Subhashis Hazarika, Leonard Lupin-Jimenez, Rohit Vuppala, Ashesh Chattopadhyay, Hon Yung Wong
ๆฅ็ๆ่ฆ
Efficient and sustainable maritime transport increasingly depends on reliable forecasting and adaptive routing, yet operational adoption remains difficult due to forecast latencies and the need for human judgment in rapid decision-making under changing ocean conditions. We introduce SWR-Viz, an AI-assisted visual analytics framework that combines a physics-informed Fourier Neural Operator wave forecast model with SIMROUTE-based routing and interactive emissions analytics. The framework generates near-term forecasts directly from current conditions, supports data assimilation with sparse observations, and enables rapid exploration of what-if routing scenarios. We evaluate the forecast models and SWR-Viz framework along key shipping corridors in the Japan Coast and Gulf of Mexico, showing both improved forecast stability and realistic routing outcomes comparable to ground-truth reanalysis wave products. Expert feedback highlights the usability of SWR-Viz, its ability to isolate voyage segments with high emission reduction potential, and its value as a practical decision-support system. More broadly, this work illustrates how lightweight AI forecasting can be integrated with interactive visual analytics to support human-centered decision-making in complex geospatial and environmental domains.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yi Xu, Zhigang Chen, Rui Wang, Yangfan Li, Fengxiao Tang et al.
ๆฅ็ๆ่ฆ
In industrial equipment monitoring, fault diagnosis is critical for ensuring system reliability and enabling predictive maintenance. However, the scarcity of fault data, due to the rarity of fault events and the high cost of data annotation, significantly hinders data-driven approaches. Existing time-series generation models, optimized for abundant normal data, struggle to capture fault distributions in few-shot scenarios, producing samples that lack authenticity and diversity due to the large domain gap and high intra-class variability of faults. To address this, we propose a novel few-shot fault time-series generation framework based on diffusion models. Our approach employs a positive-negative difference adapter, leveraging pre-trained normal data distributions to model the discrepancies between normal and fault domains for accurate fault synthesis. Additionally, a diversity loss is introduced to prevent mode collapse, encouraging the generation of diverse fault samples through inter-sample difference regularization. Experimental results demonstrate that our model significantly outperforms traditional methods in authenticity and diversity, achieving state-of-the-art performance on key benchmarks.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xin Gao, Shaohan Yu, Zerui Chen, Yueming Lyu, Weichen Yu et al.
ๆฅ็ๆ่ฆ
Large Reasoning Models (LRMs) improve answer quality through explicit chain-of-thought, yet this very capability introduces new safety risks: harmful content can be subtly injected, surface gradually, or be justified by misleading rationales within the reasoning trace. Existing safety evaluations, however, primarily focus on output-level judgments and rarely capture these dynamic risks along the reasoning process. In this paper, we present SafeRBench, the first benchmark that assesses LRM safety end-to-end -- from inputs and intermediate reasoning to final outputs. (1) Input Characterization: We pioneer the incorporation of risk categories and levels into input design, explicitly accounting for affected groups and severity, and thereby establish a balanced prompt suite reflecting diverse harm gradients. (2) Fine-Grained Output Analysis: We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units, enabling fine-grained evaluation across ten safety dimensions. (3) Human Safety Alignment: We validate LLM-based evaluations against human annotations specifically designed to capture safety judgments. Evaluations on 19 LRMs demonstrate that SafeRBench enables detailed, multidimensional safety assessment, offering insights into risks and protective mechanisms from multiple perspectives.
๐
2025-11-19
๐ ArXiv Preprint (cs.SE)
๐ฅ Nguyen-Khang Le, Nguyen Hiep, Minh Nguyen, Son Luu, Trung Vo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Automated web application testing is a critical component of modern software development, with frameworks like Selenium widely adopted for validating functionality through browser automation. Among the essential aspects of such testing is the ability to interact with and validate web forms, a task that requires syntactically correct, executable scripts with high coverage of input fields. Despite its importance, this task remains underexplored in the context of large language models (LLMs), and no public benchmark or dataset exists to evaluate LLMs on form interaction generation systematically. This paper introduces a novel method for training LLMs to generate high-quality test cases in Selenium, specifically targeting form interaction testing. We curate both synthetic and human-annotated datasets for training and evaluation, covering diverse real-world forms and testing scenarios. We define clear metrics for syntax correctness, script executability, and input field coverage. Our empirical study demonstrates that our approach significantly outperforms strong baselines, including GPT-4o and other popular LLMs, across all evaluation metrics. Our work lays the groundwork for future research on LLM-based web testing and provides resources to support ongoing progress in this area.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jing Cao, Kui Jiang, Shenyi Li, Xiaocheng Feng, Yong Huang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weather conditions such as rain and fog, where reduced visibility critically impairs depth prediction. To address this issue, we propose a novel self-evolution contrastive learning framework called SEC-Depth for self-supervised robust depth estimation tasks. Our approach leverages intermediate parameters generated during training to construct temporally evolving latency models. Using these, we design a self-evolution contrastive scheme to mitigate performance loss under challenging conditions. Concretely, we first design a dynamic update strategy of latency models for the depth estimation task to capture optimization states across training stages. To effectively leverage latency models, we introduce a self-evolution contrastive Loss (SECL) that treats outputs from historical latency models as negative samples. This mechanism adaptively adjusts learning objectives while implicitly sensing weather degradation severity, reducing the needs for manual intervention. Experiments show that our method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations.
๐
2025-11-19
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Jingzhuo Zhou
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The rapid proliferation of Multimodal Large Language Models (MLLMs) has introduced unprecedented security challenges, particularly in phishing detection within academic environments. Academic institutions and researchers are high-value targets, facing dynamic, multilingual, and context-dependent threats that leverage research backgrounds, academic collaborations, and personal information to craft highly tailored attacks. Existing security benchmarks largely rely on datasets that do not incorporate specific academic background information, making them inadequate for capturing the evolving attack patterns and human-centric vulnerability factors specific to academia. To address this gap, we present AdapT-Bench, a unified methodological framework and benchmark suite for systematically evaluating MLLM defense capabilities against dynamic phishing attacks in academic settings.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Yang Wu, Rujing Yao, Tong Zhang, Yufei Shi, Zhuoren Jiang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students' knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student's mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students' Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student's mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.
๐
2025-11-19
๐ ArXiv Preprint (eess.SP)
๐ฅ Ahmed Aboulfotouh, Hatem Abou-Zeid
Multimodal
ๆฅ็ๆ่ฆ
Wireless foundation models (WFMs) have recently demonstrated promising capabilities, jointly performing multiple wireless functions and adapting effectively to new environments. However, while current WFMs process only one modality, depending on the task and operating conditions, the most informative modality changes and no single modality is best for all tasks. WFMs should therefore be designed to accept multiple modalities to enable a broader and more diverse range of tasks and scenarios. In this work, we propose and build the first multimodal wireless foundation model capable of processing both raw IQ streams and image-like wireless modalities (e.g., spectrograms and CSI) and performing multiple tasks across both. We introduce masked wireless modeling for the multimodal setting, a self-supervised objective and pretraining recipe that learns a joint representation from IQ streams and image-like wireless modalities. We evaluate the model on five tasks across both modality families: image-based (human activity sensing, RF signal classification, 5G NR positioning) and IQ-based (RF device fingerprinting, interference detection/classification). The multimodal WFM is competitive with single-modality WFMs, and in several cases surpasses their performance. Our results demonstrates the strong potential of developing multimodal WFMs that support diverse wireless tasks across different modalities. We believe this provides a concrete step toward both AI-native 6G and the vision of joint sensing, communication, and localization.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Firdavs Nasriddinov, Rafal Kocielnik, Anima Anandkumar, Andrew J. Hung
Natural Language Processing
ๆฅ็ๆ่ฆ
High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meihua Zhou, Xinyu Tong, Jiarui Zhao, Min Cheng, Li Yang et al.
ๆฅ็ๆ่ฆ
High-dimensional neuroimaging analyses for clinical diagnosis are often constrained by compromises in spatiotemporal fidelity and by the limited adaptability of large-scale, general-purpose models. To address these challenges, we introduce Dynamic Curriculum Learning for Spatiotemporal Encoding (DCL-SE), an end-to-end framework centered on data-driven spatiotemporal encoding (DaSE). We leverage Approximate Rank Pooling (ARP) to efficiently encode three-dimensional volumetric brain data into information-rich, two-dimensional dynamic representations, and then employ a dynamic curriculum learning strategy, guided by a Dynamic Group Mechanism (DGM), to progressively train the decoder, refining feature extraction from global anatomical structures to fine pathological details. Evaluated across six publicly available datasets, including Alzheimer's disease and brain tumor classification, cerebral artery segmentation, and brain age prediction, DCL-SE consistently outperforms existing methods in accuracy, robustness, and interpretability. These findings underscore the critical importance of compact, task-specific architectures in the era of large-scale pretrained networks.
๐
2025-11-19
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Sunwoo Kim, Geon Lee, Kyungho Kim, Jaemin Yoo, Kijung Shin
Natural Language Processing
ๆฅ็ๆ่ฆ
Recently, large language models (LLMs) have been widely used as recommender systems, owing to their strong reasoning capability and their effectiveness in handling cold-start items. To better adapt LLMs for recommendation, retrieval-augmented generation (RAG) has been incorporated. Most existing RAG methods are user-based, retrieving purchase patterns of users similar to the target user and providing them to the LLM. In this work, we propose ItemRAG, an item-based RAG method for LLM-based recommendation that retrieves relevant items (rather than users) from item-item co-purchase histories. ItemRAG helps LLMs capture co-purchase patterns among items, which are beneficial for recommendations. Especially, our retrieval strategy incorporates semantically similar items to better handle cold-start items and uses co-purchase frequencies to improve the relevance of the retrieved items. Through extensive experiments, we demonstrate that ItemRAG consistently (1) improves the zero-shot LLM-based recommender by up to 43% in Hit-Ratio-1 and (2) outperforms user-based RAG baselines under both standard and cold-start item recommendation settings.
๐
2025-11-19
๐ ArXiv Preprint (q-bio.GN)
๐ฅ Amit Kumar, Maninder Kaur, Raghvendra Mall, Sukrit Gupta
ๆฅ็ๆ่ฆ
Spatial Transcriptomics enables mapping of gene expression within its native tissue context, but current platforms measure only a limited set of genes due to experimental constraints and excessive costs. To overcome this, computational models integrate Single-Cell RNA Sequencing data with Spatial Transcriptomics to predict unmeasured genes. We propose CASPER, a cross-attention based framework that predicts unmeasured gene expression in Spatial Transcriptomics by leveraging centroid-level representations from Single-Cell RNA Sequencing. We performed rigorous testing over four state-of-the-art Spatial Transcriptomics/Single-Cell RNA Sequencing dataset pairs across four existing baseline models. CASPER shows significant improvement in nine out of the twelve metrics for our experiments. This work paves the way for further work in Spatial Transcriptomics to Single-Cell RNA Sequencing modality translation. The code for CASPER is available at https://github.com/AI4Med-Lab/CASPER.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaoxuan Wang, Bo Liu, Song Jiang, Jingzhou Liu, Jingyuan Qi et al.
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
The reasoning capabilities of large language models (LLMs) have been significantly improved through reinforcement learning (RL). Nevertheless, LLMs still struggle to consistently verify their own reasoning traces. This raises the research question of how to enhance the self-verification ability of LLMs and whether such an ability can further improve reasoning performance. In this work, we propose GRPO-Verif, an algorithm that jointly optimizes solution generation and self-verification within a unified loss function, with an adjustable hyperparameter controlling the weight of the verification signal. Experimental results demonstrate that our method enhances self-verification capability while maintaining comparable performance in reasoning.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Fuwei Zhang, Xiaoyu Liu, Dongbo Xi, Jishen Yin, Huan Chen et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users' historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, we propose Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multimodal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, we first introduce cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of our GR model, we incorporate multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, we conduct extensive experiments on three well-known recommendation datasets to demonstrate the effectiveness of our proposed method.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Bohan Zhang, Zihao Wang, Hengyu Fu, Jason D. Lee
Machine Learning
ๆฅ็ๆ่ฆ
In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f(\boldsymbol{x})=g(\boldsymbol{U}\boldsymbol{x})$ with hidden subspace $\boldsymbol{U}\in \mathbb{R}^{r\times d}$, which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error using $\widetilde{\mathcal{O}}(d)$ samples and $\widetilde{\mathcal{O}}(d^2)$ time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than $\mathcal{O}(1)$ steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wei-hsiang Yen, Lyn Chao-ling Chen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The innovation of the study is that the deep learning method and sentiment analysis are integrated in traditional business model analysis and forecasting, and the research subject is TSMC for industry trend prediction of semiconductor industry in Taiwan. For the rapid market changes and development of wafer technologies of semiconductor industry, traditional data analysis methods not perform well in the high variety and time series data. Textual data and time series data were collected from seasonal reports of TSMC including financial information. Textual data through sentiment analysis by considering the event intervention both from internal events of the company and the external global events. Using the sentiment-enhanced time series data, the LSTM model was adopted for predicting industry trend of TSMC. The prediction results reveal significant development of wafer technology of TSMC and the potential threatens in the global market, and matches the product released news of TSMC and the international news. The contribution of the work performed accurately in industry trend prediction of the semiconductor industry by considering both the internal and external event intervention, and the prediction results provide valuable information of semiconductor industry both in research and business aspects.
๐
2025-11-19
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Ting-An Lin, Pei-Lin Tsai, Yi-An Chen, Feng-Yu Chen, Lyn Chao-ling Chen
Robotics
ๆฅ็ๆ่ฆ
In the study, the device of social robot was designed for visually impaired users, and along with a mobile application for provide functions to assist their lives. Both physical and mental conditions of visually impaired users are considered, and the mobile application provides functions: photo record, mood lift, greeting guest and today highlight. The application was designed for visually impaired users, and uses voice control to provide a friendly interface. Photo record function allows visually impaired users to capture image immediately when they encounter danger situations. Mood lift function accompanies visually impaired users by asking questions, playing music and reading articles. Greeting guest function answers to the visitors for the inconvenient physical condition of visually impaired users. In addition, today highlight function read news including weather forecast, daily horoscopes and daily reminder for visually impaired users. Multiple tools were adopted for developing the mobile application, and a website was developed for caregivers to check statues of visually impaired users and for marketing of the application.
๐
2025-11-19
๐ ArXiv Preprint (cs.SE)
๐ฅ Yuan Jiang, Zehao Li, Shan Huang, Christoph Treude, Xiaohong Su et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Membership inference attacks (MIAs) on code completion models offer an effective way to assess privacy risks by inferring whether a given code snippet was part of the training data. Existing black- and gray-box MIAs rely on expensive surrogate models or manually crafted heuristic rules, which limit their ability to capture the nuanced memorization patterns exhibited by over-parameterized code language models. To address these challenges, we propose AdvPrompt-MIA, a method specifically designed for code completion models, combining code-specific adversarial perturbations with deep learning. The core novelty of our method lies in designing a series of adversarial prompts that induce variations in the victim code model's output. By comparing these outputs with the ground-truth completion, we construct feature vectors to train a classifier that automatically distinguishes member from non-member samples. This design allows our method to capture richer memorization patterns and accurately infer training set membership. We conduct comprehensive evaluations on widely adopted models, such as Code Llama 7B, over the APPS and HumanEval benchmarks. The results show that our approach consistently outperforms state-of-the-art baselines, with AUC gains of up to 102%. In addition, our method exhibits strong transferability across different models and datasets, underscoring its practical utility and generalizability.
๐
2025-11-19
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Vineeth Sai Narajala, Manish Bhatt, Idan Habler, Ronald F. Del Rosario
MultimodalRobotics
ๆฅ็ๆ่ฆ
The AI trustworthiness crisis threatens to derail the artificial intelligence revolution, with regulatory barriers, security vulnerabilities, and accountability gaps preventing deployment in critical domains. Current AI systems operate on opaque data structures that lack the audit trails, provenance tracking, or explainability required by emerging regulations like the EU AI Act. We propose an artifact-centric AI agent paradigm where behavior is driven by persistent, verifiable data artifacts rather than ephemeral tasks, solving the trustworthiness problem at the data architecture level. Central to this approach is the Multimodal Artifact File Format (MAIF), an AI-native container embedding semantic representations, cryptographic provenance, and granular access controls. MAIF transforms data from passive storage into active trust enforcement, making every AI operation inherently auditable. Our production-ready implementation demonstrates ultra-high-speed streaming (2,720.7 MB/s), optimized video processing (1,342 MB/s), and enterprise-grade security. Novel algorithms for cross-modal attention, semantic compression, and cryptographic binding achieve up to 225 compression while maintaining semantic fidelity. Advanced security features include stream-level access control, real-time tamper detection, and behavioral anomaly analysis with minimal overhead. This approach directly addresses the regulatory, security, and accountability challenges preventing AI deployment in sensitive domains, offering a viable path toward trustworthy AI systems at scale.
๐
2025-11-19
๐ ArXiv Preprint (cs.DB)
๐ฅ Wenhan Yu, Wang Chen, Guanqiang Qi, Weikang Li, Yang Li et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Document Visual Question Answering (DocVQA) is a fundamental task for multimodal document understanding and a key testbed for vision language reasoning. However, most existing DocVQA datasets are limited to the page level and lack fine grained spatial grounding, constraining the interpretability and reasoning capability of Vision Language Models (VLMs). To address this gap, we introduce BBox DocVQA a large scale, bounding box grounded dataset designed to enhance spatial reasoning and evidence localization in visual documents. We further present an automated construction pipeline, Segment Judge and Generate, which integrates a segment model for region segmentation, a VLM for semantic judgment, and another advanced VLM for question answer generation, followed by human verification for quality assurance. The resulting dataset contains 3.6 K diverse documents and 32 K QA pairs, encompassing single and multi region as well as single and multi page scenarios. Each QA instance is grounded on explicit bounding boxes, enabling fine grained evaluation of spatial semantic alignment. Benchmarking multiple state of the art VLMs (e.g., GPT 5, Qwen2.5 VL, and InternVL) on BBox DocVQA reveals persistent challenges in spatial grounding and reasoning accuracy. Furthermore, fine tuning on BBox DocVQA substantially improves both bounding box localization and answer generation, validating its effectiveness for enhancing the reasoning ability of VLMs. Our dataset and code will be publicly released to advance research on interpretable and spatially grounded vision language reasoning.
๐
2025-11-19
๐ ArXiv Preprint (cs.DC)
๐ฅ Khaled Hamidouche, John Bachan, Pak Markthub, Peter-Jan Gootzen, Elena Agostini et al.
Robotics
ๆฅ็ๆ่ฆ
Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations - a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead.
NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side APIs for device communicator setup and collective memory window registration; ii) Device-side APIs for remote memory operations callable from CUDA kernels; and iii) A network plugin architecture with dual semantics (GPUDirect Async Kernel-Initiated and Proxy) for broad hardware support. The GPUDirect Async Kernel-Initiated backend leverages DOCA GPUNetIO for direct GPU-to-NIC communication, while the Proxy backend provides equivalent functionality via lock-free GPU-to-CPU queues over standard RDMA networks. We demonstrate GIN's practicality through integration with DeepEP, an MoE communication library. Comprehensive benchmarking shows that GIN provides device-initiated communication within NCCL's unified runtime, combining low-latency operations with NCCL's collective algorithms and production infrastructure.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Henrik Bradland, Morten Goodwin, Vladimir I. Zadorozhny, Per-Arne Andersen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The performance of machine learning models on tabular data is critically dependent on high-quality feature engineering. While Large Language Models (LLMs) have shown promise in automating feature extraction (AutoFE), existing methods are often limited by monolithic LLM architectures, simplistic quantitative feedback, and a failure to systematically integrate external domain knowledge. This paper introduces Rogue One, a novel, LLM-based multi-agent framework for knowledge-informed automatic feature extraction. Rogue One operationalizes a decentralized system of three specialized agents-Scientist, Extractor, and Tester-that collaborate iteratively to discover, generate, and validate predictive features. Crucially, the framework moves beyond primitive accuracy scores by introducing a rich, qualitative feedback mechanism and a "flooding-pruning" strategy, allowing it to dynamically balance feature exploration and exploitation. By actively incorporating external knowledge via an integrated retrieval-augmented (RAG) system, Rogue One generates features that are not only statistically powerful but also semantically meaningful and interpretable. We demonstrate that Rogue One significantly outperforms state-of-the-art methods on a comprehensive suite of 19 classification and 9 regression datasets. Furthermore, we show qualitatively that the system surfaces novel, testable hypotheses, such as identifying a new potential biomarker in the myocardial dataset, underscoring its utility as a tool for scientific discovery.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haoyong Wu, Yongmei Liu
ๆฅ็ๆ่ฆ
In this paper, we propose ProRAC (Progression-based Reasoning about Actions and Change), a neuro-symbolic framework that leverages LLMs to tackle RAC problems. ProRAC extracts fundamental RAC elements including actions and questions from the problem, progressively executes each action to derive the final state, and then evaluates the query against the progressed state to arrive at an answer. We evaluate ProRAC on several RAC benchmarks, and the results demonstrate that our approach achieves strong performance across different benchmarks, domains, LLM backbones, and types of RAC tasks.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zisong Wang, Xuanyu Wang, Hang Chen, Haizhou Wang, Yuxin Chen et al.
ๆฅ็ๆ่ฆ
Precise prognostic stratification of colorectal cancer (CRC) remains a major clinical challenge due to its high heterogeneity. The conventional TNM staging system is inadequate for personalized medicine. We aimed to develop and validate a novel multiple instance learning model TDAM-CRC using histopathological whole-slide images for accurate prognostic prediction and to uncover its underlying molecular mechanisms. We trained the model on the TCGA discovery cohort (n=581), validated it in an independent external cohort (n=1031), and further we integrated multi-omics data to improve model interpretability and identify novel prognostic biomarkers. The results demonstrated that the TDAM-CRC achieved robust risk stratification in both cohorts. Its predictive performance significantly outperformed the conventional clinical staging system and multiple state-of-the-art models. The TDAM-CRC risk score was confirmed as an independent prognostic factor in multivariable analysis. Multi-omics analysis revealed that the high-risk subtype is closely associated with metabolic reprogramming and an immunosuppressive tumor microenvironment. Through interaction network analysis, we identified and validated Mitochondrial Ribosomal Protein L37 (MRPL37) as a key hub gene linking deep pathomic features to clinical prognosis. We found that high expression of MRPL37, driven by promoter hypomethylation, serves as an independent biomarker of favorable prognosis. Finally, we constructed a nomogram incorporating the TDAM-CRC risk score and clinical factors to provide a precise and interpretable clinical decision-making tool for CRC patients. Our AI-driven pathological model TDAM-CRC provides a robust tool for improved CRC risk stratification, reveals new molecular targets, and facilitates personalized clinical decision-making.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.
๐
2025-11-19
๐ SIGIR
๐ฅ Haodong Chen, Guido Zuccon, Teerapong Leelanupab
Natural Language Processing
ๆฅ็ๆ่ฆ
Genomic question answering often requires complex reasoning and integration across diverse biomedical sources. GeneGPT addressed this challenge by combining domain-specific APIs with OpenAI's code-davinci-002 large language model to enable natural language interaction with genomic databases. However, its reliance on a proprietary model limits scalability, increases operational costs, and raises concerns about data privacy and generalization.
In this work, we revisit and reproduce GeneGPT in a pilot study using open source models, including Llama 3.1, Qwen2.5, and Qwen2.5 Coder, within a monolithic architecture; this allows us to identify the limitations of this approach. Building on this foundation, we then develop OpenBioLLM, a modular multi-agent framework that extends GeneGPT by introducing agent specialization for tool routing, query generation, and response validation. This enables coordinated reasoning and role-based task execution.
OpenBioLLM matches or outperforms GeneGPT on over 90% of the benchmark tasks, achieving average scores of 0.849 on Gene-Turing and 0.830 on GeneHop, while using smaller open-source models without additional fine-tuning or tool-specific pretraining. OpenBioLLM's modular multi-agent design reduces latency by 40-50% across benchmark tasks, significantly improving efficiency without compromising model capability. The results of our comprehensive evaluation highlight the potential of open-source multi-agent systems for genomic question answering. Code and resources are available at https://github.com/ielab/OpenBioLLM.
๐
2025-11-19
๐ NeurIPS 2025
๐ฅ Jian-Ting Guo, Yu-Cheng Chen, Ping-Chun Hsieh, Kuo-Hao Ho, Po-Wei Huang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at https://rlg.iis.sinica.edu.tw/papers/MAQ.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Panqi Yang, Haodong Jing, Nanning Zheng, Yongqiang Ma
ๆฅ็ๆ่ฆ
In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. Specifically, we introduce a symmetric interaction-aware attention module and a unified semi-supervised learning paradigm, enabling effective bidirectional mapping between images and interaction semantics even under limited annotations. Extensive experiments demonstrate that UniHOI achieves state-of-the-art performance in both HOI detection and generation. Specifically, UniHOI improves accuracy by 4.9% on long-tailed HOI detection and boosts interaction metrics by 42.0% on open-vocabulary generation tasks.
๐
2025-11-19
๐ AAAI
๐ฅ Dorien Herremans, Abhinaba Roy
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in generative AI for music have achieved remarkable fidelity and stylistic diversity, yet these systems often fail to align with nuanced human preferences due to the specific loss functions they use. This paper advocates for the systematic application of preference alignment techniques to music generation, addressing the fundamental gap between computational optimization and human musical appreciation. Drawing on recent breakthroughs including MusicRL's large-scale preference learning, multi-preference alignment frameworks like diffusion-based preference optimization in DiffRhythm+, and inference-time optimization techniques like Text2midi-InferAlign, we discuss how these techniques can address music's unique challenges: temporal coherence, harmonic consistency, and subjective quality assessment. We identify key research challenges including scalability to long-form compositions, reliability amongst others in preference modelling. Looking forward, we envision preference-aligned music generation enabling transformative applications in interactive composition tools and personalized music services. This work calls for sustained interdisciplinary research combining advances in machine learning, music-theory to create music AI systems that truly serve human creative and experiential needs.
๐
2025-11-19
๐ Manuscript in preparation for IEEE Transactions on
๐ฅ Jeffrey Jiang, Kevin Hong, Emily Kuczynski, Gregory Pottie
Machine Learning
ๆฅ็ๆ่ฆ
While intelligent tutoring systems (ITSs) can use information from past students to personalize instruction, each new student is unique. Moreover, the education problem is inherently difficult because the learning process is only partially observable. We therefore develop a dynamic, time-series environment to simulate a classroom setting, with student-teacher interventions - including tutoring sessions, lectures, and exams. In particular, we design the simulated environment to allow for varying levels of probing interventions that can gather more information. Then, we develop reinforcement learning ITSs that combine learning the individual state of students while pulling from population information through the use of probing interventions. These interventions can reduce the difficulty of student estimation, but also introduce a cost-benefit decision to find a balance between probing enough to get accurate estimates and probing so often that it becomes disruptive to the student. We compare the efficacy of standard RL algorithms with several greedy rules-based heuristic approaches to find that they provide different solutions, but with similar results. We also highlight the difficulty of the problem with increasing levels of hidden information, and the boost that we get if we allow for probing interventions. We show the flexibility of both heuristic and RL policies with regards to changing student population distributions, finding that both are flexible, but RL policies struggle to help harder classes. Finally, we test different course structures with non-probing policies and we find that our policies are able to boost the performance of quiz and midterm structures more than we can in a finals-only structure, highlighting the benefit of having additional information.
๐
2025-11-19
๐ ArXiv Preprint (cs.PF)
๐ฅ Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zecheng Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets.
Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.
๐
2025-11-19
๐ 10 pages, theoretical/mathematical LLM research, n
๐ฅ Moses Kiprono
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are powerful linguistic engines but remain susceptible to hallucinations: plausible-sounding outputs that are factually incorrect or unsupported. In this work, we present a mathematically grounded framework to understand, measure, and mitigate these hallucinations. Drawing on probabilistic modeling, information theory, trigonometric signal analysis, and Bayesian uncertainty estimation, we analyze how errors compound autoregressively, propose refined uncertainty metrics, including semantic and phase-aware variants, and develop principled mitigation strategies such as contrastive decoding, retrieval-augmented grounding, factual alignment, and abstention. This unified lens connects recent advances in calibration, retrieval, and alignment to support safer and more reliable LLMs.
๐
2025-11-19
๐ Accepted to be published in IEEE Transaction on Ma
๐ฅ Fatemeh Lotfi, Hossein Rajoli, Fatemeh Afghah
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Next-generation networks utilize the Open Radio Access Network (O-RAN) architecture to enable dynamic resource management, facilitated by the RAN Intelligent Controller (RIC). While deep reinforcement learning (DRL) models show promise in optimizing network resources, they often struggle with robustness and generalizability in dynamic environments. This paper introduces a novel resource management approach that enhances the Soft Actor Critic (SAC) algorithm with Sharpness-Aware Minimization (SAM) in a distributed Multi-Agent RL (MARL) framework. Our method introduces an adaptive and selective SAM mechanism, where regularization is explicitly driven by temporal-difference (TD)-error variance, ensuring that only agents facing high environmental complexity are regularized. This targeted strategy reduces unnecessary overhead, improves training stability, and enhances generalization without sacrificing learning efficiency. We further incorporate a dynamic $ฯ$ scheduling scheme to refine the exploration-exploitation trade-off across agents. Experimental results show our method significantly outperforms conventional DRL approaches, yielding up to a $22\%$ improvement in resource allocation efficiency and ensuring superior QoS satisfaction across diverse O-RAN slices.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Johan Edstedt, David Nordstrรถm, Yushan Zhang, Georg Bรถkman, Jonathan Astermark et al.
ๆฅ็ๆ่ฆ
Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at https://github.com/Parskatt/romav2
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu et al.
MultimodalNatural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: https://xiongyicai.github.io/In-N-On/
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu et al.
ๆฅ็ๆ่ฆ
What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohammed Q. Alkhatib
Machine Learning
ๆฅ็ๆ่ฆ
This paper introduces SS-MixNet, a lightweight and effective deep learning model for hyperspectral image (HSI) classification. The architecture integrates 3D convolutional layers for local spectral-spatial feature extraction with two parallel MLP-style mixer blocks that capture long-range dependencies in spectral and spatial dimensions. A depthwise convolution-based attention mechanism is employed to enhance discriminative capability with minimal computational overhead. The model is evaluated on the QUH-Tangdaowan and QUH-Qingyun datasets using only 1% of labeled data for training and validation. SS-MixNet achieves the highest performance among compared methods, including 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN, reaching 95.68% and 93.86% overall accuracy on the Tangdaowan and Qingyun datasets, respectively. The results, supported by quantitative metrics and classification maps, confirm the model's effectiveness in delivering accurate and robust predictions with limited supervision. The code will be made publicly available at: https://github.com/mqalkhatib/SS-MixNet
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sejuti Rahman, Swakshar Deb, MD. Sameer Iqbal Chowdhury, MD. Jubair Ahmed Sourov, Mohammad Shamsuddin
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary (depressed and non depressed) classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class (no depression, mild to moderate depression and severe depression) classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
MultimodalNatural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel et al.
Computer Vision
ๆฅ็ๆ่ฆ
Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks.
Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Edward Vendrow, Julia Chae, Rupa Kurinchi-Vendhan, Isaac Eckert, Jazlynn Hall et al.
ๆฅ็ๆ่ฆ
Large community science platforms such as iNaturalist contain hundreds of millions of biodiversity images that often capture ecological context on behaviors, interactions, phenology, and habitat. Yet most ecological workflows rely on metadata filtering or manual inspection, leaving this secondary information inaccessible at scale. We introduce INQUIRE-Search, an open-source system that enables scientists to rapidly and interactively search within an ecological image database for specific concepts using natural language, verify and export relevant observations, and utilize this discovered data for novel scientific analysis. Compared to traditional methods, INQUIRE-Search takes a fraction of the time, opening up new possibilities for scientific questions that can be explored. Through five case studies, we show the diversity of scientific applications that a tool like INQUIRE-Search can support, from seasonal variation in behavior across species to forest regrowth after wildfires. These examples demonstrate a new paradigm for interactive, efficient, and scalable scientific discovery that can begin to unlock previously inaccessible scientific value in large-scale biodiversity datasets. Finally, we emphasize using such AI-enabled discovery tools for science call for experts to reframe the priorities of the scientific process and develop novel methods for experiment design, data collection, survey effort, and uncertainty analysis.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shanshan Zhang
ๆฅ็ๆ่ฆ
Inertial Odometry (IO) enables real-time localization using only acceleration and angular velocity measurements from an Inertial Measurement Unit (IMU), making it a promising solution for localization in consumer-grade applications. Traditionally, IMU measurements in IO have been processed under two coordinate system paradigms: the body coordinate frame and the global coordinate frame, with the latter being widely adopted. However, recent studies in drone scenarios have demonstrated that the body frame can significantly improve localization accuracy, prompting a re-evaluation of the suitability of the global frame for pedestrian IO. To address this issue, this paper systematically evaluates the effectiveness of the global coordinate frame in pedestrian IO through theoretical analysis, qualitative inspection, and quantitative experiments. Building upon these findings, we further propose MambaIO, which decomposes IMU measurements into high-frequency and low-frequency components using a Laplacian pyramid. The low-frequency component is processed by a Mamba architecture to extract implicit contextual motion cues, while the high-frequency component is handled by a convolutional structure to capture fine-grained local motion details. Experiments on multiple public datasets show that MambaIO substantially reduces localization error and achieves state-of-the-art (SOTA) performance. To the best of our knowledge, this is the first application of the Mamba architecture to the inertial odometry task.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shourov Joarder, Tushar Talukder Showrav, Md. Kamrul Hasan
Machine Learning
ๆฅ็ๆ่ฆ
Ultrasound Strain Elastography (USE) is a powerful non-invasive imaging technique for assessing tissue mechanical properties, offering crucial diagnostic value across diverse clinical applications. However, its clinical application remains limited by tissue decorrelation noise, scarcity of ground truth, and inconsistent strain estimation under different deformation conditions. Overcoming these barriers, we propose MUSSE-Net, a residual-aware, multi-stage unsupervised sequential deep learning framework designed for robust and consistent strain estimation. At its backbone lies our proposed USSE-Net, an end-to-end multi-stream encoder-decoder architecture that parallelly processes pre- and post-deformation RF sequences to estimate displacement fields and axial strains. The novel architecture incorporates Context-Aware Complementary Feature Fusion (CACFF)-based encoder with Tri-Cross Attention (TCA) bottleneck with a Cross-Attentive Fusion (CAF)-based sequential decoder. To ensure temporal coherence and strain stability across varying deformation levels, this architecture leverages a tailored consistency loss. Finally, with the MUSSE-Net framework, a secondary residual refinement stage further enhances accuracy and suppresses noise. Extensive validation on simulation, in vivo, and private clinical datasets from Bangladesh University of Engineering and Technology (BUET) medical center, demonstrates MUSSE-Net's outperformed existing unsupervised approaches. On MUSSE-Net achieves state-of-the-art performance with a target SNR of 24.54, background SNR of 132.76, CNR of 59.81, and elastographic SNR of 9.73 on simulation data. In particular, on the BUET dataset, MUSSE-Net produces strain maps with enhanced lesion-to-background contrast and significant noise suppression yielding clinically interpretable strain patterns.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Hu, Lan Li, Zhen-Hao Xie, Da-Wei Zhou
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via multi-modal pre-training, making them well-suited for CIL. However, real-world visual and linguistic concepts are inherently hierarchical: a textual concept like "dog" subsumes fine-grained categories such as "Labrador" and "Golden Retriever," and each category entails its images. But existing CLIP-based CIL methods fail to explicitly capture this inherent hierarchy, leading to fine-grained class features drift during incremental updates and ultimately to catastrophic forgetting. To address this challenge, we propose HASTEN (Hierarchical Semantic Tree Anchoring) that anchors hierarchical information into CIL to reduce catastrophic forgetting. First, we employ an external knowledge graph as supervision to embed visual and textual features in hyperbolic space, effectively preserving hierarchical structure as data evolves. Second, to mitigate catastrophic forgetting, we project gradients onto the null space of the shared hyperbolic mapper, preventing interference with prior tasks. These two steps work synergistically to enable the model to resist forgetting by maintaining hierarchical relationships. Extensive experiments show that HASTEN consistently outperforms existing methods while providing a unified structured representation.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dante Francisco Wasmuht, Otto Brookes, Maximillian Schall, Pablo Palencia, Chris Beirne et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tingrui Shen, Yiheng Zhang, Chen Tang, Chuan Ping, Zixing Zhao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications. We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels. Extensive experiments show that FlashMesh achieves up to a 2 x speedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang et al.
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bin Xie, Gady Agam
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Medical image segmentation typically adopts a point-wise convolutional segmentation head to predict dense labels, where each output channel is heuristically tied to a specific class. This rigid design limits both feature sharing and semantic generalization. In this work, we propose a unified decoupled segmentation head that separates multi-class prediction into class-agnostic mask prediction and class label prediction using shared object queries. Furthermore, we introduce a Full-Scale Aware Deformable Transformer module that enables low-resolution encoder features to attend across full-resolution encoder features via deformable attention, achieving memory-efficient and spatially aligned full-scale fusion. Our proposed method, named MaskMed, achieves state-of-the-art performance, surpassing nnUNet by +2.0% Dice on AMOS 2022 and +6.9% Dice on BTCV.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Miruna-Alexandra Gafencu, Yordanka Velikova, Nassir Navab, Mohammad Farid Azampour
Machine Learning
ๆฅ็ๆ่ฆ
Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intraoperative guidance during spinal procedures. However, ultrasound suffers from inherent limitations in visualizing complete vertebral anatomy, in particular vertebral bodies, due to acoustic shadowing effects caused by bone. In this work, we present a novel multi-modal deep learning method for completing occluded anatomical structures in 3D ultrasound by leveraging complementary information from a single X-ray image. To enable training, we generate paired training data consisting of: (1) 2D lateral vertebral views that simulate X-ray scans, and (2) 3D partial vertebrae representations that mimic the limited visibility and occlusions encountered during ultrasound spine imaging. Our method integrates morphological information from both imaging modalities and demonstrates significant improvements in vertebral reconstruction (p < 0.001) compared to state of art in 3D ultrasound vertebral completion. We perform phantom studies as an initial step to future clinical translation, and achieve a more accurate, complete volumetric lumbar spine visualization overlayed on the ultrasound scan without the need for registration with preoperative modalities such as computed tomography. This demonstrates that integrating a single X-ray projection mitigates ultrasound's key limitation while preserving its strengths as the primary imaging modality. Code and data can be found at https://github.com/miruna20/US-X-Complete
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xufei Wang, Junqiao Zhao, Siyue Tao, Qiwen Gu, Wonbong Kim et al.
Robotics
ๆฅ็ๆ่ฆ
LiDAR place recognition plays a crucial role in SLAM, robot navigation, and autonomous driving. However, existing LiDAR place recognition methods often struggle to adapt to new environments without forgetting previously learned knowledge, a challenge widely known as catastrophic forgetting. To address this issue, we propose KDF+, a novel continual learning framework for LiDAR place recognition that extends the KDF paradigm with a loss-aware sampling strategy and a rehearsal enhancement mechanism. The proposed sampling strategy estimates the learning difficulty of each sample via its loss value and selects samples for replay according to their estimated difficulty. Harder samples, which tend to encode more discriminative information, are sampled with higher probability while maintaining distributional coverage across the dataset. In addition, the rehearsal enhancement mechanism encourages memory samples to be further refined during new-task training by slightly reducing their loss relative to previous tasks, thereby reinforcing long-term knowledge retention. Extensive experiments across multiple benchmarks demonstrate that KDF+ consistently outperforms existing continual learning methods and can be seamlessly integrated into state-of-the-art continual learning for LiDAR place recognition frameworks to yield significant and stable performance gains. The code will be available at https://github.com/repo/KDF-plus.
๐
2025-11-19
๐ ArXiv Preprint (cs.GR)
๐ฅ Aaron Ferguson, Ahmed A. A. Osman, Berta Bescos, Carsten Stoll, Chris Twigg et al.
ๆฅ็ๆ่ฆ
We present MHR, a parametric human body model that combines the decoupled skeleton/shape paradigm of ATLAS with a flexible, modern rig and pose corrective system inspired by the Momentum library. Our model enables expressive, anatomically plausible human animation, supporting non-linear pose correctives, and is designed for robust integration in AR/VR and graphics pipelines.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Sifan Zhou, Yichao Cao, Jiahao Nie, Yuqian Fu, Ziyu Zhao et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar
Natural Language Processing
ๆฅ็ๆ่ฆ
With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR's effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huiyuan Tian, Bonan Xu, Shijian Li, Xin Jin
Natural Language Processing
ๆฅ็ๆ่ฆ
Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only $121/61/34/14$ dimensions suffice to capture $99\%/95\%/90\%/80\%$ of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student's last block to the teacher's width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from $74.86\%$ to $77.53\%$ and $78.23\%$ when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Weiheng Zhu, Gang Cao, Jing Liu, Lifang Yu, Shaowei Weng
Machine Learning
ๆฅ็ๆ่ฆ
Recent AI-generated image (AIGI) detectors achieve impressive accuracy under clean condition. In view of antiforensics, it is significant to develop advanced adversarial attacks for evaluating the security of such detectors, which remains unexplored sufficiently. This letter proposes a Dual-domain Feature Importance Attack (DuFIA) scheme to invalidate AIGI detectors to some extent. Forensically important features are captured by the spatially interpolated gradient and frequency-aware perturbation. The adversarial transferability is enhanced by jointly modeling spatial and frequency-domain feature importances, which are fused to guide the optimization-based adversarial example generation. Extensive experiments across various AIGI detectors verify the cross-model transferability, transparency and robustness of DuFIA.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniel Bermuth, Alexander Poeppel, Wolfgang Reif
Robotics
ๆฅ็ๆ่ฆ
Human pose forecasting predicts future poses based on past observations, and has many significant applications in areas such as action recognition, autonomous driving or human-robot interaction. This paper evaluates a wide range of pose forecasting algorithms in the task of absolute pose forecasting, revealing many reproducibility issues, and provides a unified training and evaluation pipeline. After drawing a high-level analogy to the task of speech understanding, it is shown that recent speech models can be efficiently adapted to the task of pose forecasting, and improve current state-of-the-art performance. At last the robustness of the models is evaluated, using noisy joint coordinates obtained from a pose estimator model, to reflect a realistic type of noise, which is more close to real-world applications. For this a new dataset variation is introduced, and it is shown that estimated poses result in a substantial performance degradation, and how much of it can be recovered again by unsupervised finetuning.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pandiyaraju V, Abishek Karthik, Sreya Mynampati, Poovarasan L, D. Saraswathi
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment to edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qiang Jiao, Bin Yan, Yi Yang, Mengrui Shi, Qiang Zhang
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross-modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi-Text Guided Few-Shot Semantic Segmentation Network (MTGNet), a dual-branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors. Specifically, we design a Multi-Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi-text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra-class variations. Furthermore, a Foreground Confidence-Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self-similarity within support foreground features. It adaptively down-weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1-shot setting, it achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruiqing Yang, Kaixin Zhang, Zheng Zhang, Shan You, Tao Huang
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive models have recently shown great promise in visual generation by leveraging discrete token sequences akin to language modeling. However, existing approaches often suffer from inefficiency, either due to token-by-token decoding or the complexity of multi-scale representations. In this work, we introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system's center-outward perception pattern. EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding. To further enhance flexibility and speed, we propose a length-adaptive decoding strategy that dynamically adjusts the number of tokens predicted at each step. This biologically inspired design not only reduces computational cost but also improves generation quality by aligning the generation order with perceptual relevance. Extensive experiments on ImageNet demonstrate that EAR achieves state-of-the-art trade-offs between fidelity and efficiency on single-scale autoregressive models, setting a new direction for scalable and cognitively aligned autoregressive image generation.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Maria Pilligua, David Serrano-Lozano, Pai Peng, Ramon Baldrich, Michael S. Brown et al.
Robotics
ๆฅ็ๆ่ฆ
Imaging in low-light environments is challenging due to reduced scene radiance, which leads to elevated sensor noise and reduced color saturation. Most learning-based low-light enhancement methods rely on paired training data captured under a single low-light condition and a well-lit reference. The lack of radiance diversity limits our understanding of how enhancement techniques perform across varying illumination intensities. We introduce the Multi-Illumination Low-Light (MILL) dataset, containing images captured at diverse light intensities under controlled conditions with fixed camera settings and precise illuminance measurements. MILL enables comprehensive evaluation of enhancement algorithms across variable lighting conditions. We benchmark several state-of-the-art methods and reveal significant performance variations across intensity levels. Leveraging the unique multi-illumination structure of our dataset, we propose improvements that enhance robustness across diverse illumination scenarios. Our modifications achieve up to 10 dB PSNR improvement for DSLR and 2 dB for the smartphone on Full HD images.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chen Zhang, Wei Zuo, Bingyang Cheng, Yikun Wang, Wei-Bin Kou et al.
Computer Vision
ๆฅ็ๆ่ฆ
Implicit Neural Representations (INRs) parameterize continuous signals via multilayer perceptrons (MLPs), enabling compact, resolution-independent modeling for tasks like image, audio, and 3D reconstruction. However, fitting high-resolution signals demands optimizing over millions of coordinates, incurring prohibitive computational costs. To address it, we propose NTK-Guided Implicit Neural Teaching (NINT), which accelerates training by dynamically selecting coordinates that maximize global functional updates. Leveraging the Neural Tangent Kernel (NTK), NINT scores examples by the norm of their NTK-augmented loss gradients, capturing both fitting errors and heterogeneous leverage (self-influence and cross-coordinate coupling). This dual consideration enables faster convergence compared to existing methods. Through extensive experiments, we demonstrate that NINT significantly reduces training time by nearly half while maintaining or improving representation quality, establishing state-of-the-art acceleration among recent sampling-based strategies.
๐
2025-11-19
๐ ArXiv Preprint (cs.SD)
๐ฅ Abishek Karthik, Pandiyaraju V, Dominic Savio M, Rohit Swaminathan S
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Parkinson's disease is a neurodegenerative disorder that can be very tricky to diagnose and treat. Such early symptoms can include tremors, wheezy breathing, and changes in voice quality as critical indicators of neural damage. Notably, there has been growing interest in utilizing changes in vocal attributes as markers for the detection of PD early on. Based on this understanding, the present paper was designed to focus on the acoustic feature analysis based on voice recordings of patients diagnosed with PD and healthy controls (HC). In this paper, we introduce a novel classification and visualization model known as CustNetGC, combining a Convolutional Neural Network (CNN) with Custom Network Grad-CAM and CatBoost to enhance the efficiency of PD diagnosis. We use a publicly available dataset from Figshare, including voice recordings of 81 participants: 40 patients with PD and 41 healthy controls. From these recordings, we extracted the key spectral features: L-mHP and Spectral Slopes. The L-mHP feature combines three spectrogram representations: Log-Mel spectrogram, harmonic spectrogram, and percussive spectrogram, which are derived using Harmonic-Percussive Source Separation (HPSS). Grad-CAM was used to highlight the important regions in the data, thus making the PD predictions interpretable and effective. Our proposed CustNetGC model achieved an accuracy of 99.06% and precision of 95.83%, with the area under the ROC curve (AUC) recorded at 0.90 for the PD class and 0.89 for the HC class. Additionally, the combination of CatBoost, a gradient boosting algorithm, enhanced the robustness and the prediction performance by properly classifying PD and non-PD samples. Therefore, the results provide the potential improvement in the CustNetGC system in enhancing diagnostic accuracy and the interpretability of the Parkinson's Disease prediction model.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Luisa Gallรฉe, Yiheng Xiong, Meinrad Beer, Michael Gรถtz
Robotics
ๆฅ็ๆ่ฆ
Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rashid Iqbal, Saddam Hussain Khan
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This work proposes a hybrid deep learning approach, namely Residual and Spatial Learning based Channel Augmented Integrated CNN-Transformer architecture, that leverages the strengths of CNN and Transformer towards enhanced MPox detection. The proposed RS-CA-HSICT framework is composed of an HSICT block, a residual CNN module, a spatial CNN block, and a CA, which enhances the diverse feature space, detailed lesion information, and long-range dependencies. The new HSICT module first integrates an abstract representation of the stem CNN and customized ICT blocks for efficient multihead attention and structured CNN layers with homogeneous (H) and structural (S) operations. The customized ICT blocks learn global contextual interactions and local texture extraction. Additionally, H and S layers learn spatial homogeneity and fine structural details by reducing noise and modeling complex morphological variations. Moreover, inverse residual learning enhances vanishing gradient, and stage-wise resolution reduction ensures scale invariance. Furthermore, the RS-CA-HSICT framework augments the learned HSICT channels with the TL-driven Residual and Spatial CNN maps for enhanced multiscale feature space capturing global and localized structural cues, subtle texture, and contrast variations. These channels, preceding augmentation, are refined through the Channel-Fusion-and-Attention block, which preserves discriminative channels while suppressing redundant ones, thereby enabling efficient computation. Finally, the spatial attention mechanism refines pixel selection to detect subtle patterns and intra-class contrast variations in Mpox. Experimental results on both the Kaggle benchmark and a diverse MPox dataset reported classification accuracy as high as 98.30% and an F1-score of 98.13%, which outperforms the existing CNNs and ViTs.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xabier Lekunberri, Ahmad Kamal, Izaro Goienetxea, Jon Ruiz, Iรฑaki Quincoces et al.
Machine Learning
ๆฅ็ๆ่ฆ
Purse seiners play a crucial role in tuna fishing, as approximately 69% of the world's tropical tuna is caught using this gear. All tuna Regional Fisheries Management Organizations have established minimum standards to use electronic monitoring (EM) in fisheries in addition to traditional observers. The EM systems produce a massive amount of video data that human analysts must process. Integrating artificial intelligence (AI) into their workflow can decrease that workload and improve the accuracy of the reports. However, species identification still poses significant challenges for AI, as achieving balanced performance across all species requires appropriate training data. Here, we quantify the difficulty experts face to distinguish bigeye tuna (BET, Thunnus Obesus) from yellowfin tuna (YFT, Thunnus Albacares) using images captured by EM systems. We found inter-expert agreements of 42.9% $\pm$ 35.6% for BET and 57.1% $\pm$ 35.6% for YFT. We then present a multi-stage pipeline to estimate the species composition of the catches using a reliable ground-truth dataset based on identifications made by observers on board. Three segmentation approaches are compared: Mask R-CNN, a combination of DINOv2 with SAM2, and a integration of YOLOv9 with SAM2. We found that the latest performs the best, with a validation mean average precision of 0.66 $\pm$ 0.03 and a recall of 0.88 $\pm$ 0.03. Segmented individuals are tracked using ByteTrack. For classification, we evaluate a standard multiclass classification model and a hierarchical approach, finding a superior generalization by the hierarchical. All our models were cross-validated during training and tested on fishing operations with fully known catch composition. Combining YOLOv9-SAM2 with the hierarchical classification produced the best estimations, with 84.8% of the individuals being segmented and classified with a mean average error of 4.5%.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dabin Jeong, Amirhossein Vahidi, Ciro Ramรญrez-Suรกstegui, Marie Moullet, Kevin Ly et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78\% in the gene-expression prediction task and avg. 26.93\% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyan Liu, Qi Su, Lulu Tang, Zhaofei Yu, Tiejun Huang
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Object detection in autonomous driving suffers from motion blur and saturation under fast motion and extreme lighting. Spike cameras, offer microsecond latency and ultra high dynamic range for object detection by using per pixel asynchronous integrate and fire. However, their sparse, discrete output cannot be processed by standard image-based detectors, posing a critical challenge for end to end spike stream detection. We propose EASD, an end to end spike camera detector with a dual branch design: a Temporal Based Texture plus Feature Fusion branch for global cross slice semantics, and an Entropy Selective Attention branch for object centric details. To close the data gap, we introduce DSEC Spike, the first driving oriented simulated spike detection benchmark.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Johannes C. Bauer, Paul Geng, Stephan Trattnig, Petr Doklรกdal, Rรผdiger Daub
Machine Learning
ๆฅ็ๆ่ฆ
Remanufacturing describes a process where worn products are restored to like-new condition and it offers vast ecological and economic potentials. A key step is the quality inspection of disassembled components, which is mostly done manually due to the high variety of parts and defect patterns. Deep neural networks show great potential to automate such visual inspection tasks but struggle to generalize to new product variants, components, or defect patterns. To tackle this challenge, we propose a novel image dataset depicting typical gearbox components in good and defective condition from two automotive transmissions. Depending on the train-test split of the data, different distribution shifts are generated to benchmark the generalization ability of a classification model. We evaluate different models using the dataset and propose a contrastive regularization loss to enhance model robustness. The results obtained demonstrate the ability of the loss to improve generalisation to unseen types of components.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Linyin Luo, Yujuan Ding, Yunshan Ma, Wenqi Fan, Hanjiang Lai
Multimodal
ๆฅ็ๆ่ฆ
Advanced multimodal Retrieval-Augmented Generation (MRAG) techniques have been widely applied to enhance the capabilities of Large Multimodal Models (LMMs), but they also bring along novel safety issues. Existing adversarial research has revealed the vulnerability of MRAG systems to knowledge poisoning attacks, which fool the retriever into recalling injected poisoned contents. However, our work considers a different setting: visual attack of MRAG by solely adding imperceptible perturbations at the image inputs of users, without manipulating any other components. This is challenging due to the robustness of fine-tuned retrievers and large-scale generators, and the effect of visual perturbation may be further weakened by propagation through the RAG chain. We propose a novel Hierarchical Visual Attack that misaligns and disrupts the two inputs (the multimodal query and the augmented knowledge) of MRAG's generator to confuse its generation. We further design a hierarchical two-stage strategy to obtain misaligned augmented knowledge. We disrupt the image input of the retriever to make it recall irrelevant knowledge from the original database, by optimizing the perturbation which first breaks the cross-modal alignment and then disrupts the multimodal semantic alignment. We conduct extensive experiments on two widely-used MRAG datasets: OK-VQA and InfoSeek. We use CLIP-based retrievers and two LMMs BLIP-2 and LLaVA as generators. Results demonstrate the effectiveness of our visual attack on MRAG through the significant decrease in both retrieval and generation performance.
๐
2025-11-19
๐ This work has been submitted to the IEEE for possi
๐ฅ YiKang Shao, Tao Shi
MultimodalComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Multimodal object detection has attracted significant attention in both academia and industry for its enhanced robustness. Although numerous studies have focused on improving modality fusion strategies, most neglect fusion degradation, and none provide a theoretical analysis of its underlying causes. To fill this gap, this paper presents a systematic theoretical investigation of fusion degradation in multimodal detection and identifies two key optimization deficiencies: (1) the gradients of unimodal branch backbones are severely suppressed under multimodal architectures, resulting in under-optimization of the unimodal branches; (2) disparities in modality quality cause weaker modalities to experience stronger gradient suppression, which in turn results in imbalanced modality learning. To address these issues, this paper proposes a Representation Space Constrained Learning with Modality Decoupling (RSC-MD) method, which consists of two modules. The RSC module and the MD module are designed to respectively amplify the suppressed gradients and eliminate inter-modality coupling interference as well as modality imbalance, thereby enabling the comprehensive optimization of each modality-specific backbone. Extensive experiments conducted on the FLIR, LLVIP, M3FD, and MFAD datasets demonstrate that the proposed method effectively alleviates fusion degradation and achieves state-of-the-art performance across multiple benchmarks. The code and training procedures will be released at https://github.com/yikangshao/RSC-MD.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marc-Emmanuel Coupvent des Graviers, Hejer Ammar, Christophe Guettier, Yann Dumortier, Romaric Audigier
Robotics
ๆฅ็ๆ่ฆ
We introduce WarNav, a novel real-world dataset constructed from images of the open-source DATTALION repository, specifically tailored to enable the development and benchmarking of semantic segmentation models for autonomous ground vehicle navigation in unstructured, conflict-affected environments. This dataset addresses a critical gap between conventional urban driving resources and the unique operational scenarios encountered by unmanned systems in hazardous and damaged war-zones. We detail the methodological challenges encountered, ranging from data heterogeneity to ethical considerations, providing guidance for future efforts that target extreme operational contexts. To establish performance references, we report baseline results on WarNav using several state-of-the-art semantic segmentation models trained on structured urban scenes. We further analyse the impact of training data environments and propose a first step towards effective navigability in challenging environments with the constraint of having no annotation of the targeted images. Our goal is to foster impactful research that enhances the robustness and safety of autonomous vehicles in high-risk scenarios while being frugal in annotated data.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning. Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Luca Mossina, Corentin Friedrich
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Reliable semantic segmentation is essential for clinical decision making, yet deep models rarely provide explicit statistical guarantees on their errors. We introduce a simple post-hoc framework that constructs confidence masks with distribution-free, image-level control of false-positive predictions. Given any pretrained segmentation model, we define a nested family of shrunken masks obtained either by increasing the score threshold or by applying morphological erosion. A labeled calibration set is used to select a single shrink parameter via conformal prediction, ensuring that, for new images that are exchangeable with the calibration data, the proportion of false positives retained in the confidence mask stays below a user-specified tolerance with high probability. The method is model-agnostic, requires no retraining, and provides finite-sample guarantees regardless of the underlying predictor. Experiments on a polyp-segmentation benchmark demonstrate target-level empirical validity. Our framework enables practical, risk-aware segmentation in settings where over-segmentation can have clinical consequences. Code at https://github.com/deel-ai-papers/conseco.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Simon Boeder, Fabian Gigengack, Simon Roesler, Holger Caesar, Benjamin Risse
ๆฅ็ๆ่ฆ
Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haidong Kang, Lihong Lin, Enneng Yang, Hongning Dai, Hao Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to \textit{huge labor costs} and \textit{requires expert knowledge}. Furthermore, we are the first to identify the serious \textit{outlier value issue} behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called \textbf{AutoPrune}, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: https://anonymous.4open.science/r/AutoPrune.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunjiao Zhou, Xinyan Chen, Junlang Qian, Lihua Xie, Jianfei Yang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7\% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gihwan Kim, Jemin Lee, Hyungshin Kim
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44\%p (avg. 1.78\%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yifu Guo, Zishan Xu, Zhiyuan Yao, Yuquan Lu, Jiaye Lin et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Spyridon Loukovitis, Vasileios Karampinis, Athanasios Voulodimos
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Developing reliable UAV navigation systems requires robust air-to-air object detectors capable of distinguishing between objects seen during training and previously unseen objects. While many methods address closed-set detection and achieve high-confidence recognition of in-domain (ID) targets, they generally do not tackle open-set detection, which requires simultaneous handling of both ID and out-of-distribution (OOD) objects. Existing open-set approaches typically rely on a single uncertainty score with thresholding, limiting flexibility and often conflating OOD objects with background clutter. In contrast, we propose a lightweight, model-agnostic post-processing framework that explicitly separates background from unknown objects while preserving the base detector's performance. Our approach extends open-set detection beyond binary ID/OOD classification to real-time three-way classification among ID targets, OOD objects, and background. To this end, we employ a fusion scheme that aggregates multiple confidence estimates and per-detection features using a compact multilayer perceptron (MLP). Incorporating different logit variants into the MLP consistently enhances performance across both binary and three-class classification without compromising throughput. Extensive ablation and comparative experiments confirm that our method surpasses threshold-based baselines in two-class classification by an average of 2.7% AUROC, while retaining or improving open-set mAP. Furthermore, our study uniquely enables robust three-class classification, a critical capability for safe UAV navigation, where OOD objects must be actively avoided and background regions safely ignored. Comparative analysis highlights that our method surpasses competitive techniques in AUROC across datasets, while improving closed-set mAP by up to 9 points, an 18% relative gain.
๐
2025-11-19
๐ 25 pages, 10 figures, Journal paper
๐ฅ Zahra Farzadpour, Masoumeh Azghani
Robotics
ๆฅ็ๆ่ฆ
Fingerprint liveness detection systems have been affected by spoofing, which is a severe threat for fingerprint-based biometric systems. Therefore, it is crucial to develop some techniques to distinguish the fake fingerprints from the real ones. The software based techniques can detect the fingerprint forgery automatically. Also, the scheme shall be resistant against various distortions such as noise contamination, pixel missing and block missing, so that the forgers cannot deceive the detector by adding some distortions to the faked fingerprint. In this paper, we propose a fingerprint forgery detection algorithm based on a suggested adaptive thresholding pattern. The anisotropic diffusion of the input image is passed through three levels of the wavelet transform. The coefficients of different layers are adaptively thresholded and concatenated to produce the feature vector which is classified using the SVM classifier. Another contribution of the paper is to investigate the effect of various distortions such as pixel missing, block missing, and noise contamination. Our suggested approach includes a novel method that exhibits improved resistance against a range of distortions caused by environmental phenomena or manipulations by malicious users. In quantitative comparisons, our proposed method outperforms its counterparts by approximately 8% and 5% in accuracy for missing pixel scenarios of 90% and block missing scenarios of size 70x70 , respectively. This highlights the novelty approach in addressing such challenges.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhihan Ren, Lijun He, Jiaxi Liang, Xinzhu Fu, Haixia Bi et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image-feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mauro Larrat, Claudomiro Sales
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security, prompting a need for robust systems that overcome limitations of single-modality approaches. This research addresses these challenges by designing and rigorously evaluating a novel multimodal Transformer model that integrates diverse data streams: radar, visual band video (RGB), infrared (IR) video, and audio. The architecture effectively fuses distinct features from each modality, leveraging the Transformer's self-attention mechanisms to learn comprehensive, complementary, and highly discriminative representations for classification. The model demonstrated exceptional performance on an independent test set, achieving macro-averaged metrics of 0.9812 accuracy, 0.9873 recall, 0.9787 precision, 0.9826 F1-score, and 0.9954 specificity. Notably, it exhibited particularly high precision and recall in distinguishing drones from other aerial objects. Furthermore, computational analysis confirmed its efficiency, with 1.09 GFLOPs, 1.22 million parameters, and an inference speed of 41.11 FPS, highlighting its suitability for real-time applications. This study presents a significant advancement in aerial object classification, validating the efficacy of multimodal data fusion via a Transformer architecture for achieving state-of-the-art performance, thereby offering a highly accurate and resilient solution for UAV detection and monitoring in complex airspace.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Mehran Tamjidi, Hamidreza Dastmalchi, Mohammadreza Alimoradijazi, Ali Cheraghian, Aijun An et al.
Multimodal
ๆฅ็ๆ่ฆ
3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.
๐
2025-11-19
๐ CVPR 2024
๐ฅ Yan Xia, Letian Shi, Yilin Di, Joao F. Henriques, Daniel Cremers
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jialong Sun, Hongguang Zhu, Weizhe Liu, Yunda Sun, Renshuai Tao et al.
Computer Vision
ๆฅ็ๆ่ฆ
Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insufficiency, X-ray security image synthesis methods composite images to scale up datasets. However, previous methods primarily follow a two-stage pipeline, where they implement labor-intensive foreground extraction in the first stage and then composite images in the second stage. Such a pipeline introduces inevitable extra labor cost and is not efficient. In this paper, we propose a one-stage X-ray security image synthesis pipeline (Xsyn) based on text-to-image generation, which incorporates two effective strategies to improve the usability of synthetic images. The Cross-Attention Refinement (CAR) strategy leverages the cross-attention map from the diffusion model to refine the bounding box annotation. The Background Occlusion Modeling (BOM) strategy explicitly models background occlusion in the latent space to enhance imaging complexity. To the best of our knowledge, compared with previous methods, Xsyn is the first to achieve high-quality X-ray security image synthesis without extra labor cost. Experiments demonstrate that our method outperforms all previous methods with 1.2% mAP improvement, and the synthetic images generated by our method are beneficial to improve prohibited item detection performance across various X-ray security datasets and detectors. Code is available at https://github.com/pILLOW-1/Xsyn/.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yanni Ma, Hao Liu, Yulan Guo, Theo Gevers, Martin R. Oswald
Machine Learning
ๆฅ็ๆ่ฆ
3D scene graph prediction aims to abstract complex 3D environments into structured graphs consisting of objects and their pairwise relationships. Existing approaches typically adopt object-centric graph neural networks, where relation edge features are iteratively updated by aggregating messages from connected object nodes. However, this design inherently restricts relation representations to pairwise object context, making it difficult to capture high-order relational dependencies that are essential for accurate relation prediction. To address this limitation, we propose a Link-guided Edge-centric relational reasoning framework with Object-aware fusion, namely LEO, which enables progressive reasoning from relation-level context to object-level understanding. Specifically, LEO first predicts potential links between object pairs to suppress irrelevant edges, and then transforms the original scene graph into a line graph where each relation is treated as a node. A line graph neural network is applied to perform edge-centric relational reasoning to capture inter-relation context. The enriched relation features are subsequently integrated into the original object-centric graph to enhance object-level reasoning and improve relation prediction. Our framework is model-agnostic and can be integrated with any existing object-centric method. Experiments on the 3DSSG dataset with two competitive baselines show consistent improvements, highlighting the effectiveness of our edge-to-object reasoning paradigm.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Jiashu Yang, Yifan Han, Yucheng Xie, Ning Guo, Wenzhao Lian
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Loveneet Saini, Hasan Tercan, Tobias Meisen
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Object detection with 3D radar is essential for 360-degree automotive perception, but radar's long wavelengths produce sparse and irregular reflections that challenge traditional grid and sequence-based convolutional and transformer detectors. This paper introduces Graph Query Networks (GQN), an attention-based framework that models objects sensed by radar as graphs, to extract individualized relational and contextual features. GQN employs a novel concept of graph queries to dynamically attend over the bird's-eye view (BEV) space, constructing object-specific graphs processed by two novel modules: EdgeFocus for relational reasoning and DeepContext Pooling for contextual aggregation. On the NuScenes dataset, GQN improves relative mAP by up to +53%, including a +8.2% gain over the strongest prior radar method, while reducing peak graph construction overhead by 80% with moderate FLOPs cost.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yitong Yang, Yinglin Wang, Changshuo Wang, Yongjun Zhang, Ziyang Chen et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Disentangling image content and style is essential for customized image generation. Existing SDXL-based methods struggle to achieve high-quality results, while the recently proposed Flux model fails to achieve effective content-style separation due to its underexplored characteristics. To address these challenges, we conduct a systematic analysis of Flux and make two key observations: (1) Single Dream Blocks are essential for image generation; and (2) Early single stream blocks mainly control content, whereas later blocks govern style. Based on these insights, we propose SplitFlux, which disentangles content and style by fine-tuning the single dream blocks via LoRA, enabling the disentangled content to be re-embedded into new contexts. It includes two key components: (1) Rank-Constrained Adaptation. To preserve content identity and structure, we compress the rank and amplify the magnitude of updates within specific blocks, preventing content leakage into style blocks. (2) Visual-Gated LoRA. We split the content LoRA into two branches with different ranks, guided by image saliency. The high-rank branch preserves primary subject information, while the low-rank branch encodes residual details, mitigating content overfitting and enabling seamless re-embedding. Extensive experiments demonstrate that SplitFlux consistently outperforms state-of-the-art methods, achieving superior content preservation and stylization quality across diverse scenarios.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yanchen Xu, Ziheng Jiao, Hongyuan Zhang, Xuelong Li
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can be generalized to representation learning models. In this paper, we propose Group Relative Policy Optimization for Representation Model (GRPO-RM), and investigate the performance of GRPO-like policy in post-training representation models. Specifically, our method establishes a predefined output set to functionally replace token sequence sampling in LLMs, thereby generating an output group, which is essential for the probability-driven optimization of GRPO. In addition, a specialized reward function is designed to accommodate the properties of representation models. Extensive experiments are conducted on various real-world datasets to validate the effectiveness of our proposed method.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fanfan Liu, Haibo Qiu
Natural Language Processing
ๆฅ็ๆ่ฆ
Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuhao Shen, Jiahe Qian, Zhangtianyi Chen, Yuanhao He, Juexiao Zhou
Natural Language Processing
ๆฅ็ๆ่ฆ
We present SkinGPT-R1, a dermatology focused vision language model that makes diagnostic chain of thought reasoning explicit, step by step, and verifiable. To support skin specific reasoning, we build DermCoT, a corpus of standardized dermatologic chain of thought narratives that combines 10,000 DermEval filtered training cases with 3,000 dermatologist scored certified cases, and we define DermEval as a physician aligned six dimensional evaluator and DermBench as the corresponding benchmark for dermatologic chain of thought quality. On DermBench, across 14 general, reasoning, and medical vision language models, SkinGPT-R1 achieves an average score of 4.031 out of 5 over the six clinician defined dimensions, ranks 1st among all systems, and improves the average score over Vision-R1 by about 41%. On three dermatology classification benchmarks, SkinGPT-R1 delivers stable accuracy gains over Vision-R1 and remains competitive among strong vision language models. Ablation results further show that DermCoT based chain of thought supervision provides substantial improvements over the base model and that adding dermatology aware visual distillation yields consistent additional gains in both narrative quality and recognition.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque et al.
MultimodalComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Qing Wang, Chong-Wah Ngo, Ee-Peng Lim
ๆฅ็ๆ่ฆ
This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Raghu Vamsi Chittersu, Yuvraj Singh Rathore, Pranav Adlinge, Kunal Swami
ๆฅ็ๆ่ฆ
Reference-based object composition methods fail when inserting real-world objects into stylized domains. This under-explored problem is currently split between practical "blenders" that lack generative fidelity and "generators" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation. This approach prevents the concept interference common in general-purpose, unified-attention models. Our framework is trained on a new 100k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, two-stage filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wasif Jalal, Md Nafiu Rahman, M. Sohel Rahman
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate brain age estimation from structural MRI is a valuable biomarker for studying aging and neurodegeneration. Traditional regression and CNN-based methods face limitations such as manual feature engineering, limited receptive fields, and overfitting on heterogeneous data. Pure transformer models, while effective, require large datasets and high computational cost. We propose Brain ResNet over trained Vision Transformer (BrainRotViT), a hybrid architecture that combines the global context modeling of vision transformers (ViT) with the local refinement of residual CNNs. A ViT encoder is first trained on an auxiliary age and sex classification task to learn slice-level features. The frozen encoder is then applied to all sagittal slices to generate a 2D matrix of embedding vectors, which is fed into a residual CNN regressor that incorporates subject sex at the final fully-connected layer to estimate continuous brain age. Our method achieves an MAE of 3.34 years (Pearson $r=0.98$, Spearman $ฯ=0.97$, $R^2=0.95$) on validation across 11 MRI datasets encompassing more than 130 acquisition sites, outperforming baseline and state-of-the-art models. It also generalizes well across 4 independent cohorts with MAEs between 3.77 and 5.04 years. Analyses on the brain age gap (the difference between the predicted age and actual age) show that aging patterns are associated with Alzheimer's disease, cognitive impairment, and autism spectrum disorder. Model attention maps highlight aging-associated regions of the brain, notably the cerebellar vermis, precentral and postcentral gyri, temporal lobes, and medial superior frontal gyrus. Our results demonstrate that this method provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging the gap between CNN- and transformer-based approaches while opening new avenues for aging and neurodegeneration research.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kyotaro Tokoro, Hiromu Taketsugu, Norimichi Ukita
Multimodal
ๆฅ็ๆ่ฆ
This paper proposes a novel metric for Human Motion Prediction (HMP). Since a single past sequence can lead to multiple possible futures, a probabilistic HMP method predicts such multiple motions. While a single motion predicted by a deterministic method is evaluated only with the difference from its ground truth motion, multiple predicted motions should also be evaluated based on their distribution. For this evaluation, this paper focuses on the following two criteria. \textbf{(a) Coverage}: motions should be distributed among multiple motion modes to cover diverse possibilities. \textbf{(b) Validity}: motions should be kinematically valid as future motions observable from a given past motion. However, existing metrics simply appreciate widely distributed motions even if these motions are observed in a single mode and kinematically invalid. To resolve these disadvantages, this paper proposes a Multimodality-aware Metric using Clustering-based Modes (MMCM). For (a) coverage, MMCM divides a motion space into several clusters, each of which is regarded as a mode. These modes are used to explicitly evaluate whether predicted motions are distributed among multiple modes. For (b) validity, MMCM identifies valid modes by collecting possible future motions from a motion dataset. Our experiments validate that our clustering yields sensible mode definitions and that MMCM accurately scores multimodal predictions. Code: https://github.com/placerkyo/MMCM
๐
2025-11-19
๐ ArXiv Preprint (q-bio.QM)
๐ฅ Jun Hyeun Kang, Jung Eek Son, Tae In Ahn
Machine Learning
ๆฅ็ๆ่ฆ
The application of spectral-shifting films in greenhouses to shift green light to red light has shown variable growth responses across crop species. However, the yield enhancement of crops under altered light quality is related to the collective effects of the specific biophysical characteristics of each species. Considering only one attribute of a crop has limitations in understanding the relationship between sunlight quality adjustments and crop growth performance. Therefore, this study aims to comprehensively link multiple plant phenotypic traits and daily light integral considering the physiological responses of crops to their growth outcomes under SF using artificial intelligence. Between 2021 and 2024, various leafy, fruiting, and root crops were grown in greenhouses covered with either PEF or SF, and leaf reflectance, leaf mass per area, chlorophyll content, daily light integral, and light saturation point were measured from the plants cultivated in each condition. 210 data points were collected, but there was insufficient data to train deep learning models, so a variational autoencoder was used for data augmentation. Most crop yields showed an average increase of 22.5% under SF. These data were used to train several models, including logistic regression, decision tree, random forest, XGBoost, and feedforward neural network (FFNN), aiming to binary classify whether there was a significant effect on yield with SF application. The FFNN achieved a high classification accuracy of 91.4% on a test dataset that was not used for training. This study provide insight into the complex interactions between leaf phenotypic and photosynthetic traits, environmental conditions, and solar spectral components by improving the ability to predict solar spectral shift effects using SF.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jing Cao, Kui Jiang, Shenyi Li, Xiaocheng Feng, Yong Huang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weather conditions such as rain and fog, where reduced visibility critically impairs depth prediction. To address this issue, we propose a novel self-evolution contrastive learning framework called SEC-Depth for self-supervised robust depth estimation tasks. Our approach leverages intermediate parameters generated during training to construct temporally evolving latency models. Using these, we design a self-evolution contrastive scheme to mitigate performance loss under challenging conditions. Concretely, we first design a dynamic update strategy of latency models for the depth estimation task to capture optimization states across training stages. To effectively leverage latency models, we introduce a self-evolution contrastive Loss (SECL) that treats outputs from historical latency models as negative samples. This mechanism adaptively adjusts learning objectives while implicitly sensing weather degradation severity, reducing the needs for manual intervention. Experiments show that our method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Songze Li, Mingyu Gao, Tonghua Su, Xu-Yao Zhang, Zhongjie Wang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Firdavs Nasriddinov, Rafal Kocielnik, Anima Anandkumar, Andrew J. Hung
Natural Language Processing
ๆฅ็ๆ่ฆ
High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chun-Jung Lin, Tat-Jun Chin, Sourav Garg, Feras Dayoub
Robotics
ๆฅ็ๆ่ฆ
Accurate, up-to-date High-Definition (HD) maps are critical for urban planning, infrastructure monitoring, and autonomous navigation. However, these maps quickly become outdated as environments evolve, creating a need for robust methods that not only detect changes but also incorporate them into updated 3D representations. While change detection techniques have advanced significantly, there remains a clear gap between detecting changes and actually updating 3D maps, particularly when relying on 2D image-based change detection. To address this gap, we introduce SceneEdited, the first city-scale dataset explicitly designed to support research on HD map maintenance through 3D point cloud updating. SceneEdited contains over 800 up-to-date scenes covering 73 km of driving and approximate 3 $\text{km}^2$ of urban area, with more than 23,000 synthesized object changes created both manually and automatically across 2000+ out-of-date versions, simulating realistic urban modifications such as missing roadside infrastructure, buildings, overpasses, and utility poles. Each scene includes calibrated RGB images, LiDAR scans, and detailed change masks for training and evaluation. We also provide baseline methods using a foundational image-based structure-from-motion pipeline for updating outdated scenes, as well as a comprehensive toolkit supporting scalability, trackability, and portability for future dataset expansion and unification of out-of-date object annotations. Both the dataset and the toolkit are publicly available at https://github.com/ChadLin9596/ScenePoint-ETK, establising a standardized benchmark for 3D map updating research.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meihua Zhou, Xinyu Tong, Jiarui Zhao, Min Cheng, Li Yang et al.
ๆฅ็ๆ่ฆ
High-dimensional neuroimaging analyses for clinical diagnosis are often constrained by compromises in spatiotemporal fidelity and by the limited adaptability of large-scale, general-purpose models. To address these challenges, we introduce Dynamic Curriculum Learning for Spatiotemporal Encoding (DCL-SE), an end-to-end framework centered on data-driven spatiotemporal encoding (DaSE). We leverage Approximate Rank Pooling (ARP) to efficiently encode three-dimensional volumetric brain data into information-rich, two-dimensional dynamic representations, and then employ a dynamic curriculum learning strategy, guided by a Dynamic Group Mechanism (DGM), to progressively train the decoder, refining feature extraction from global anatomical structures to fine pathological details. Evaluated across six publicly available datasets, including Alzheimer's disease and brain tumor classification, cerebral artery segmentation, and brain age prediction, DCL-SE consistently outperforms existing methods in accuracy, robustness, and interpretability. These findings underscore the critical importance of compact, task-specific architectures in the era of large-scale pretrained networks.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nishchala Thakur, Swati Kochhar, Deepti R. Bathula, Sukrit Gupta
ๆฅ็ๆ่ฆ
Active learning reduces annotation costs in medical imaging by strategically selecting the most informative samples for labeling. However, individual acquisition strategies often exhibit inconsistent behavior across different stages of the active learning cycle. We propose Cyclical and Performance-Adaptive Multi-Strategy Active Learning (WaveFuse-AL), a novel framework that adaptively fuses multiple established acquisition strategies-BALD, BADGE, Entropy, and CoreSet throughout the learning process. WaveFuse-AL integrates cyclical (sinusoidal) temporal priors with performance-driven adaptation to dynamically adjust strategy importance over time. We evaluate WaveFuse-AL on three medical imaging benchmarks: APTOS-2019 (multi-class classification), RSNA Pneumonia Detection (binary classification), and ISIC-2018 (skin lesion segmentation). Experimental results demonstrate that WaveFuse-AL consistently outperforms both single-strategy and alternating-strategy baselines, achieving statistically significant performance improvements (on ten out of twelve metric measurements) while maximizing the utility of limited annotation budgets.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jin Wang, Bingfeng Zhang, Jian Pang, Weifeng Liu, Baodi Liu et al.
Multimodal
ๆฅ็ๆ่ฆ
Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in few-shot segmentation. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an Unbiased Semantic Decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the Contrastive Language-Image Pre-training (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed by interacting target text embeddings and clip visual features. Without requiring re-training of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jun-Yi Liu, Chung-Hao Chen, Ya-Chi Tsao, Ssu-Yao Wu, Yu-Ting Tsao et al.
Machine Learning
ๆฅ็ๆ่ฆ
In the study, the physical state and mental state of elders are both considered, and an event-triggered system has developed to detect events: watch dog, danger notice and photo link. By adopting GMM background modeling, the motion behavior of visitors and elders can be detected in the watch dog event and danger notice event respectively. Experiments set in home scenarios and 5 families participated in the experiments for detecting and recording three types of events from their life activities. In addition, the captured images were analyzed using SVM machine learning. For lack of technical experiences of elders, an intuitive operation as normal life activity was designed to create communication between elder and relatives via social media.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Junseo Koo, Jinseo Jeong, Gunhee Kim
ๆฅ็ๆ่ฆ
The recent introduction of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis. Several studies have further improved the rendering quality of 3DGS, yet they still exhibit noticeable visual discrepancies when synthesizing views at sampling rates unseen during training. Specifically, they suffer from (i) erosion-induced blurring artifacts when zooming in and (ii) dilation-induced staircase artifacts when zooming out. We speculate that these artifacts arise from the fundamental limitation of the alpha blending adopted in 3DGS methods. Instead of the conventional alpha blending that computes alpha and transmittance as scalar quantities over a pixel, we propose to replace it with our novel Gaussian Blending that treats alpha and transmittance as spatially varying distributions. Thus, transmittances can be updated considering the spatial distribution of alpha values across the pixel area, allowing nearby background splats to contribute to the final rendering. Our Gaussian Blending maintains real-time rendering speed and requires no additional memory cost, while being easily integrated as a drop-in replacement into existing 3DGS-based or other NVS frameworks. Extensive experiments demonstrate that Gaussian Blending effectively captures fine details at various sampling rates unseen during training, consistently outperforming existing novel view synthesis models across both unseen and seen sampling rates.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chengyu Xie, Zhi Gong, Junchi Ren, Linkun Yu, Si Shen et al.
Computer Vision
ๆฅ็ๆ่ฆ
Pose-guided human image generation is limited by incomplete textures from single reference views and the absence of explicit cross-view interaction. We present jointly conditioned diffusion model (JCDM), a jointly conditioned diffusion framework that exploits multi-view priors. The appearance prior module (APM) infers a holistic identity preserving prior from incomplete references, and the joint conditional injection (JCI) mechanism fuses multi-view cues and injects shared conditioning into the denoising backbone to align identity, color, and texture across poses. JCDM supports a variable number of reference views and integrates with standard diffusion backbones with minimal and targeted architectural modifications. Experiments demonstrate state of the art fidelity and cross-view consistency.
๐
2025-11-19
๐ ArXiv Preprint (cs.DB)
๐ฅ Wenhan Yu, Wang Chen, Guanqiang Qi, Weikang Li, Yang Li et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Document Visual Question Answering (DocVQA) is a fundamental task for multimodal document understanding and a key testbed for vision language reasoning. However, most existing DocVQA datasets are limited to the page level and lack fine grained spatial grounding, constraining the interpretability and reasoning capability of Vision Language Models (VLMs). To address this gap, we introduce BBox DocVQA a large scale, bounding box grounded dataset designed to enhance spatial reasoning and evidence localization in visual documents. We further present an automated construction pipeline, Segment Judge and Generate, which integrates a segment model for region segmentation, a VLM for semantic judgment, and another advanced VLM for question answer generation, followed by human verification for quality assurance. The resulting dataset contains 3.6 K diverse documents and 32 K QA pairs, encompassing single and multi region as well as single and multi page scenarios. Each QA instance is grounded on explicit bounding boxes, enabling fine grained evaluation of spatial semantic alignment. Benchmarking multiple state of the art VLMs (e.g., GPT 5, Qwen2.5 VL, and InternVL) on BBox DocVQA reveals persistent challenges in spatial grounding and reasoning accuracy. Furthermore, fine tuning on BBox DocVQA substantially improves both bounding box localization and answer generation, validating its effectiveness for enhancing the reasoning ability of VLMs. Our dataset and code will be publicly released to advance research on interpretable and spatially grounded vision language reasoning.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wen Yin, Siyu Zhan, Cencen Liu, Xin Hu, Guiduo Duan et al.
Multimodal
ๆฅ็ๆ่ฆ
Multimodal Emotion Recognition (MER) aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data. Existing approaches predominantly rely on unified emotion labels to supervise model training, often overlooking a critical challenge: inter-modal emotion conflicts, wherein different modalities within the same sample may express divergent emotional tendencies. In this work, we address this overlooked issue by proposing a novel framework, Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL), inspired by the stage-wise nature of human emotion perception. TiCAL dynamically assesses the consistency of each training sample by leveraging pseudo unimodal emotion labels alongside a typicality estimation. To further enhance emotion representation, we embed features in a hyperbolic space, enabling the capture of fine-grained distinctions among emotional categories. By incorporating consistency estimates into the learning process, our method improves model performance, particularly on samples exhibiting high modality inconsistency. Extensive experiments on benchmark datasets, e.g, CMU-MOSEI and MER2023, validate the effectiveness of TiCAL in mitigating inter-modal emotional conflicts and enhancing overall recognition accuracy, e.g., with about 2.6% improvements over the state-of-the-art DMD.
๐
2025-11-19
๐ This work has been submitted to a journal for poss
๐ฅ Shengjing Tian, Yinan Han, Xiantong Zhao, Xuehu Liu, Qi Lang
ๆฅ็ๆ่ฆ
Dynamic outdoor environments with high temporal variation (HTV) pose significant challenges for 3D single object tracking in LiDAR point clouds. Existing memory-based trackers often suffer from quadratic computational complexity, temporal redundancy, and insufficient exploitation of geometric priors. To address these issues, we propose MambaTrack3D, a novel HTV-oriented tracking framework built upon the state space model Mamba. Specifically, we design a Mamba-based Inter-frame Propagation (MIP) module that replaces conventional single-frame feature extraction with efficient inter-frame propagation, achieving near-linear complexity while explicitly modeling spatial relations across historical frames. Furthermore, a Grouped Feature Enhancement Module (GFEM) is introduced to separate foreground and background semantics at the channel level, thereby mitigating temporal redundancy in the memory bank. Extensive experiments on KITTI-HTV and nuScenes-HTV benchmarks demonstrate that MambaTrack3D consistently outperforms both HTV-oriented and normal-scenario trackers, achieving improvements of up to 6.5 success and 9.5 precision over HVTrack under moderate temporal gaps. On the standard KITTI dataset, MambaTrack3D remains highly competitive with state-of-the-art normal-scenario trackers, confirming its strong generalization ability. Overall, MambaTrack3D achieves a superior accuracy-efficiency trade-off, delivering robust performance across both specialized HTV and conventional tracking scenarios.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zisong Wang, Xuanyu Wang, Hang Chen, Haizhou Wang, Yuxin Chen et al.
ๆฅ็ๆ่ฆ
Precise prognostic stratification of colorectal cancer (CRC) remains a major clinical challenge due to its high heterogeneity. The conventional TNM staging system is inadequate for personalized medicine. We aimed to develop and validate a novel multiple instance learning model TDAM-CRC using histopathological whole-slide images for accurate prognostic prediction and to uncover its underlying molecular mechanisms. We trained the model on the TCGA discovery cohort (n=581), validated it in an independent external cohort (n=1031), and further we integrated multi-omics data to improve model interpretability and identify novel prognostic biomarkers. The results demonstrated that the TDAM-CRC achieved robust risk stratification in both cohorts. Its predictive performance significantly outperformed the conventional clinical staging system and multiple state-of-the-art models. The TDAM-CRC risk score was confirmed as an independent prognostic factor in multivariable analysis. Multi-omics analysis revealed that the high-risk subtype is closely associated with metabolic reprogramming and an immunosuppressive tumor microenvironment. Through interaction network analysis, we identified and validated Mitochondrial Ribosomal Protein L37 (MRPL37) as a key hub gene linking deep pathomic features to clinical prognosis. We found that high expression of MRPL37, driven by promoter hypomethylation, serves as an independent biomarker of favorable prognosis. Finally, we constructed a nomogram incorporating the TDAM-CRC risk score and clinical factors to provide a precise and interpretable clinical decision-making tool for CRC patients. Our AI-driven pathological model TDAM-CRC provides a robust tool for improved CRC risk stratification, reveals new molecular targets, and facilitates personalized clinical decision-making.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yachuan Huang, Xianrui Luo, Qiwen Wang, Liao Shen, Jiaqi Li et al.
Robotics
ๆฅ็ๆ่ฆ
Bokeh rendering simulates the shallow depth-of-field effect in photography, enhancing visual aesthetics and guiding viewer attention to regions of interest. Although recent approaches perform well, rendering controllable bokeh without additional depth inputs remains a significant challenge. Existing classical and neural controllable methods rely on accurate depth maps, while generative approaches often struggle with limited controllability and efficiency. In this paper, we propose BokehFlow, a depth-free framework for controllable bokeh rendering based on flow matching. BokehFlow directly synthesizes photorealistic bokeh effects from all-in-focus images, eliminating the need for depth inputs. It employs a cross-attention mechanism to enable semantic control over both focus regions and blur intensity via text prompts. To support training and evaluation, we collect and synthesize four datasets. Extensive experiments demonstrate that BokehFlow achieves visually compelling bokeh effects and offers precise control, outperforming existing depth-dependent and generative methods in both rendering quality and efficiency.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.
๐
2025-11-19
๐ ArXiv Preprint (eess.IV)
๐ฅ Nabiha Choudhury, Jianqing Jia, Yifei Lou
ๆฅ็ๆ่ฆ
Total variation (TV) regularization is a classical tool for image denoising, but its convex $\ell_1$ formulation often leads to staircase artifacts and loss of contrast. To address these issues, we introduce the Transformed $\ell_1$ (TL1) regularizer applied to image gradients. In particular, we develop a TL1-regularized denoising model and solve it using the Alternating Direction Method of Multipliers (ADMM), featuring a closed-form TL1 proximal operator and an FFT-based image update under periodic boundary conditions. Experimental results demonstrate that our approach achieves superior denoising performance, effectively suppressing noise while preserving edges and enhancing image contrast.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Keito Sasagawa, Shuhei Kurita, Daisuke Kawahara
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced from the real-world document images containing vertically written Japanese text. Using these datasets, we demonstrate that the existing MLLMs perform worse on vertically written Japanese text than on horizontally written Japanese text. Furthermore, we show that training MLLMs on our synthesized Japanese OCR dataset results in improving the performance of models that previously could not handle vertical writing. The datasets and code are publicly available https://github.com/llm-jp/eval_vertical_ja.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Yaxiong Chen, Qicong Wang, Chunlei Li, Jingliang Hu, Yilei Shi et al.
Computer Vision
ๆฅ็ๆ่ฆ
Existing approaches for the problem of ultrasound image segmentation, whether supervised or semi-supervised, are typically specialized for specific anatomical structures or tasks, limiting their practical utility in clinical settings. In this paper, we pioneer the task of universal semi-supervised ultrasound image segmentation and propose ProPL, a framework that can handle multiple organs and segmentation tasks while leveraging both labeled and unlabeled data. At its core, ProPL employs a shared vision encoder coupled with prompt-guided dual decoders, enabling flexible task adaptation through a prompting-upon-decoding mechanism and reliable self-training via an uncertainty-driven pseudo-label calibration (UPLC) module. To facilitate research in this direction, we introduce a comprehensive ultrasound dataset spanning 5 organs and 8 segmentation tasks. Extensive experiments demonstrate that ProPL outperforms state-of-the-art methods across various metrics, establishing a new benchmark for universal ultrasound image segmentation.
๐
2025-11-19
๐ 4 pages, 3 figures, Submitted to IEEE SSIAI 2026
๐ฅ Srijan Ray, Bikesh K. Nirala, Jason T. Yustein, Sundaresh Ram
ๆฅ็ๆ่ฆ
Accurate nuclei segmentation in microscopy whole slide images (WSIs) remains challenging due to variability in staining, imaging conditions, and tissue morphology. We propose CellGenNet, a knowledge distillation framework for robust cross-tissue cell segmentation under limited supervision. CellGenNet adopts a student-teacher architecture, where a capacity teacher is trained on sparse annotations and generates soft pseudo-labels for unlabeled regions. The student is optimized using a joint objective that integrates ground-truth labels, teacher-derived probabilistic targets, and a hybrid loss function combining binary cross-entropy and Tversky loss, enabling asymmetric penalties to mitigate class imbalance and better preserve minority nuclear structures. Consistency regularization and layerwise dropout further stabilize feature representations and promote reliable feature transfer. Experiments across diverse cancer tissue WSIs show that CellGenNet improves segmentation accuracy and generalization over supervised and semi-supervised baselines, supporting scalable and reproducible histopathology analysis.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yue Wen, Kunjing Yang, Minru Bai
Machine Learning
ๆฅ็ๆ่ฆ
The fusion of hyperspectral image (HSI) with multispectral image (MSI) provides an effective way to enhance the spatial resolution of HSI. However, due to different acquisition conditions, there may exist spectral variability and spatially localized changes between HSI and MSI, referred to as inter-image variability, which can significantly affect the fusion performance. Existing methods typically handle inter-image variability by applying direct transformations to the images themselves, which can exacerbate the ill-posedness of the fusion model. To address this challenge, we propose a Degradation-based Low-Rank and Residual Fusion (DLRRF) model. First, we model the spectral variability as change in the spectral degradation operator. Second, to recover the lost spatial details caused by spatially localized changes, we decompose the target HSI into low rank and residual components, where the latter is used to capture the lost details. By exploiting the spectral correlation within the images, we perform dimensionality reduction on both components. Additionally, we introduce an implicit regularizer to utilize the spatial prior information from the images. The proposed DLRRF model is solved using the Proximal Alternating Optimization (PAO) algorithm within a Plug-and-Play (PnP) framework, where the subproblem regarding implicit regularizer is addressed by an external denoiser. We further provide a comprehensive convergence analysis of the algorithm. Finally, extensive numerical experiments demonstrate that DLRRF achieves superior performance in fusing HSI and MSI with inter-image variability.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Panqi Yang, Haodong Jing, Nanning Zheng, Yongqiang Ma
ๆฅ็ๆ่ฆ
In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. Specifically, we introduce a symmetric interaction-aware attention module and a unified semi-supervised learning paradigm, enabling effective bidirectional mapping between images and interaction semantics even under limited annotations. Extensive experiments demonstrate that UniHOI achieves state-of-the-art performance in both HOI detection and generation. Specifically, UniHOI improves accuracy by 4.9% on long-tailed HOI detection and boosts interaction metrics by 42.0% on open-vocabulary generation tasks.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zekun Wang, Sashank Varma
Computer Vision
ๆฅ็ๆ่ฆ
Mathematical thinking is a fundamental aspect of human cognition. Cognitive scientists have investigated the mechanisms that underlie our ability to thinking geometrically and numerically, to take two prominent examples, and developmental scientists have documented the trajectories of these abilities over the lifespan. Prior research has shown that computer vision (CV) models trained on the unrelated task of image classification nevertheless learn latent representations of geometric and numerical concepts similar to those of adults. Building on this demonstrated cognitive alignment, the current study investigates whether CV models also show developmental alignment: whether their performance improvements across training to match the developmental progressions observed in children. In a detailed case study of the ResNet-50 model, we show that this is the case. For the case of geometry and topology, we find developmental alignment for some classes of concepts (Euclidean Geometry, Geometrical Figures, Metric Properties, Topology) but not others (Chiral Figures, Geometric Transformations, Symmetrical Figures). For the case of number, we find developmental alignment in the emergence of a human-like ``mental number line'' representation with experience. These findings show the promise of computer vision models for understanding the development of mathematical understanding in humans. They point the way to future research exploring additional model architectures and building larger benchmarks.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicheng Zhan, Xiangjun Gao, Long Quan, Kaan Akลit
Machine Learning
ๆฅ็ๆ่ฆ
We propose a new hologram representation based on structured complex-valued 2D Gaussian primitives, which replaces per-pixel information storage and reduces the parameter search space by up to 10:1. To enable end-to-end training, we develop a differentiable rasterizer for our representation, integrated with a GPU-optimized light propagation kernel in free space. Our extensive experiments show that our method achieves up to 2.5x lower VRAM usage and 50% faster optimization while producing higher-fidelity reconstructions than existing methods. We further introduce a conversion procedure that adapts our representation to practical hologram formats, including smooth and random phase-only holograms. Our experiments show that this procedure can effectively suppress noise artifacts observed in previous methods. By reducing the hologram parameter search space, our representation enables a more scalable hologram estimation in the next-generation computer-generated holography systems.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenyu Cui, Jiahuan Zhou, Yuxin Peng
ๆฅ็ๆ่ฆ
Lifelong person Re-IDentification (LReID) aims to match the same person employing continuously collected individual data from different scenarios. To achieve continuous all-day person matching across day and night, Visible-Infrared Lifelong person Re-IDentification (VI-LReID) focuses on sequential training on data from visible and infrared modalities and pursues average performance over all data. To this end, existing methods typically exploit cross-modal knowledge distillation to alleviate the catastrophic forgetting of old knowledge. However, these methods ignore the mutual interference of modality-specific knowledge acquisition and modality-common knowledge anti-forgetting, where conflicting knowledge leads to collaborative forgetting. To address the above problems, this paper proposes a Cross-modality Knowledge Disentanglement and Alignment method, called CKDA, which explicitly separates and preserves modality-specific knowledge and modality-common knowledge in a balanced way. Specifically, a Modality-Common Prompting (MCP) module and a Modality-Specific Prompting (MSP) module are proposed to explicitly disentangle and purify discriminative information that coexists and is specific to different modalities, avoiding the mutual interference between both knowledge. In addition, a Cross-modal Knowledge Alignment (CKA) module is designed to further align the disentangled new knowledge with the old one in two mutually independent inter- and intra-modality feature spaces based on dual-modality prototypes in a balanced manner. Extensive experiments on four benchmark datasets verify the effectiveness and superiority of our CKDA against state-of-the-art methods. The source code of this paper is available at https://github.com/PKU-ICST-MIPL/CKDA-AAAI2026.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yueru He, Xueqing Peng, Yupeng Cao, Yan Wang, Lingfei Qian et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We introduce FinCriticalED (Financial Critical Error Detection), a visual benchmark for evaluating OCR and vision language models on financial documents at the fact level. Financial documents contain visually dense and table heavy layouts where numerical and temporal information is tightly coupled with structure. In high stakes settings, small OCR mistakes such as sign inversion or shifted dates can lead to materially different interpretations, while traditional OCR metrics like ROUGE and edit distance capture only surface level text similarity. \ficriticaled provides 500 image-HTML pairs with expert annotated financial facts covering over seven hundred numerical and temporal facts. It introduces three key contributions. First, it establishes the first fact level evaluation benchmark for financial document understanding, shifting evaluation from lexical overlap to domain critical factual correctness. Second, all annotations are created and verified by financial experts with strict quality control over signs, magnitudes, and temporal expressions. Third, we develop an LLM-as-Judge evaluation pipeline that performs structured fact extraction and contextual verification for visually complex financial documents. We benchmark OCR systems, open source vision language models, and proprietary models on FinCriticalED. Results show that although the strongest proprietary models achieve the highest factual accuracy, substantial errors remain in visually intricate numerical and temporal contexts. Through quantitative evaluation and expert case studies, FinCriticalED provides a rigorous foundation for advancing visual factual precision in financial and other precision critical domains.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Violeta Kastreva, Philip Whittington, Dennis Komm, Tiago Pimentel
ๆฅ็ๆ่ฆ
Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded $n$-ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an $n$-ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
MultimodalNatural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang et al.
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
๐
2025-11-19
๐ AAAI
๐ฅ Qihao Yang, Xuelin Wang, Jiale Chen, Xuelian Dong, Yuxin Hao et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners' language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://github.com/CharlesYang030/HSKB.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yves Pauli, Jan-Bernard Marsman, Finn Rabe, Victoria Edkins, Roya Hรผppi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from initial data cleaning and task-specific preprocessing to the extraction of sophisticated linguistic and acoustic features, such as semantic embeddings and prosodic metrics. The entire processing workflow can be specified within a single, shareable configuration file, which pelican nlp then executes on LPDS-formatted data. Depending on the specifications, the reproducible output can consist of preprocessed language data or standardised extraction of both linguistic and acoustic features and corresponding result aggregations. LPDS and pelican nlp collectively offer an end-to-end processing pipeline for linguistic data, designed to ensure methodological transparency and enhance reproducibility.
๐
2025-11-19
๐ AAAI
๐ฅ Ao Xie, Jiahui Chen, Quanzhi Zhu, Xiaoze Jiang, Zhiheng Qin et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Dense retrieval has become a foundational paradigm in modern search systems, especially on short-video platforms. However, most industrial systems adopt a self-reinforcing training pipeline that relies on historically exposed user interactions for supervision. This paradigm inevitably leads to a filter bubble effect, where potentially relevant but previously unseen content is excluded from the training signal, biasing the model toward narrow and conservative retrieval. In this paper, we present CroPS (Cross-Perspective Positive Samples), a novel retrieval data engine designed to alleviate this problem by introducing diverse and semantically meaningful positive examples from multiple perspectives. CroPS enhances training with positive signals derived from user query reformulation behavior (query-level), engagement data in recommendation streams (system-level), and world knowledge synthesized by large language models (knowledge-level). To effectively utilize these heterogeneous signals, we introduce a Hierarchical Label Assignment (HLA) strategy and a corresponding H-InfoNCE loss that together enable fine-grained, relevance-aware optimization. Extensive experiments conducted on Kuaishou Search, a large-scale commercial short-video search platform, demonstrate that CroPS significantly outperforms strong baselines both offline and in live A/B tests, achieving superior retrieval performance and reducing query reformulation rates. CroPS is now fully deployed in Kuaishou Search, serving hundreds of millions of users daily.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuanjie Zhu, Liangwei Yang, Ke Xu, Weizhi Zhang, Zihe Song et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their direct application is fundamentally limited by a lack of stateful memory for iterative refinement and the difficulty of managing cluster granularity. As a result, existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach. We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task. It leverages a Dynamic Memory to instill state awareness and a Dual-Prompt Strategy to enable the model to reason about and determine the number of clusters. Evaluated on several benchmark datasets, our tuning-free framework significantly and consistently outperforms strong baselines. LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Arjun Gangwar, Kaousheik Jayakumar, S. Umesh
ๆฅ็ๆ่ฆ
This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to improve in predicting the language and dialect of the utterance among 8 languages across 33 dialects. We participated in Track 1 and Track 2, which restricts the use of additional data and develop from-the-scratch multilingual systems. We presented a novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation. It improved the performance over the baseline (in the CLS space). We also discuss various methods used to retain the gain obtained in the phonemic space while converting them back to the corresponding grapheme representations. Our systems beat the baseline in 3 languages (Track 2) in terms of WER/CER and achieved the highest language ID and dialect ID accuracy among all participating teams (Track 2).
๐
2025-11-19
๐ 13 pages,9 figures. This work has been submitted t
๐ฅ Shanlin Zhou, Xinpeng Wang, Jianxun Lian, Zhenghao Liu, Laks V. S. Lakshmanan et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising and storytelling. Nevertheless, CNLG still remains difficult due to two main challenges. (1) Multi-objective flexibility: user requirements are often personalized, fine-grained, and pluralistic, which LLMs struggle to satisfy simultaneously; (2) Interpretive complexity: beyond generation, creativity also involves understanding and interpreting implicit meaning to enhance users' perception. These challenges significantly limit current methods, especially in short-form text generation, in generating creative and insightful content. To address this, we focus on Chinese baby naming, a representative short-form CNLG task requiring adherence to explicit user constraints (e.g., length, semantics, anthroponymy) while offering meaningful aesthetic explanations. We propose NAMeGEn, a novel multi-agent optimization framework that iteratively alternates between objective extraction, name generation, and evaluation to meet diverse requirements and generate accurate explanations. To support this task, we further construct a classical Chinese poetry corpus with 17k+ poems to enhance aesthetics, and introduce CBNames, a new benchmark with tailored metrics. Extensive experiments demonstrate that NAMeGEn effectively generates creative names that meet diverse, personalized requirements while providing meaningful explanations, outperforming six baseline methods spanning various LLM backbones without any training.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Sirui Chen, Mengshi Zhao, Lei Xu, Yuying Zhao, Beier Zhu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Byungho Jo
MultimodalRobotics
ๆฅ็ๆ่ฆ
Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.
๐
2025-11-19
๐ The manuscript is currently ongoing the underrevie
๐ฅ Guoqiang Liang, Jingqian Gong, Mengxuan Li, Gege Lin, Shuo Zhang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge-enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Alexis Correa-Guillรฉn, Carlos Gรณmez-Rodrรญguez, David Vilares
ๆฅ็ๆ่ฆ
We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gรณmez-Rodrรญguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.
๐
2025-11-19
๐ ArXiv Preprint (cs.PL)
๐ฅ Youwei Xiao, Yuyang Zou, Yun Liang
Machine Learning
ๆฅ็ๆ่ฆ
Hardware synthesis from high-level descriptions remains fundamentally limited by the sequential optimization of interdependent design decisions. Current methodologies, including state-of-the-art high-level synthesis (HLS) tools, artificially separate implementation selection from scheduling, leading to suboptimal designs that cannot fully exploit modern FPGA heterogeneous architectures. Implementation selection is typically performed by ad-hoc pattern matching on operations, a process that does not consider the impact on scheduling. Subsequently, scheduling algorithms operate on fixed selection solutions with inaccurate delay estimates, which misses critical optimization opportunities from appropriately configured FPGA blocks like DSP slices.
We present SkyEgg, a novel hardware synthesis framework that jointly optimizes implementation selection and scheduling using the e-graph data structure. Our key insight is that both algebraic transformations and hardware implementation choices can be uniformly represented as rewrite rules within an e-graph, modeling the complete design space of implementation candidates to be selected and scheduled together. First, SkyEgg constructs an e-graph from the input program. It then applies both algebraic and implementation rewrites through equality saturation. Finally, it formulates the joint optimization as a mixed-integer linear programming (MILP) problem on the saturated e-graph. We provide both exact MILP solving and an efficient ASAP heuristic for scalable synthesis. Our evaluation on benchmarks from diverse applications targeting Xilinx Kintex UltraScale+ FPGAs demonstrates that SkyEgg achieves an average speedup of 3.01x over Vitis HLS, with improvements up to 5.22x for complex expressions.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Randa Zarnoufi
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Sentiment analysis of Arabic dialects presents significant challenges due to linguistic diversity and the scarcity of annotated data. This paper describes our approach to the AHaSIS shared task, which focuses on sentiment analysis on Arabic dialects in the hospitality domain. The dataset comprises hotel reviews written in Moroccan and Saudi dialects, and the objective is to classify the reviewers sentiment as positive, negative, or neutral. We employed the SetFit (Sentence Transformer Fine-tuning) framework, a data-efficient few-shot learning technique. On the official evaluation set, our system achieved an F1 of 73%, ranking 12th among 26 participants. This work highlights the potential of few-shot learning to address data scarcity in processing nuanced dialectal Arabic text within specialized domains like hotel reviews.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Liangyu Chen, Yichen Xu, Jianzhe Ma, Yuqi Liu, Donglu Yang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Chart editing reduces manual effort in visualization design. Typical benchmarks limited in data diversity and assume access to complete chart code, which is seldom in real-world scenarios. To address this gap, we present ChartEditVista, a comprehensive benchmark consisting of 7,964 samples spanning 31 chart categories. It encompasses diverse editing instructions and covers nearly all editable chart elements. The inputs in ChartEditVista include only the original chart image and natural language editing instructions, without the original chart codes. ChartEditVista is generated through a fully automated pipeline that produces, edits, and verifies charts, ensuring high-quality chart editing data. Besides, we introduce two novel fine-grained, rule-based evaluation metrics: the layout metric, which evaluates the position, size and color of graphical components; and the text metric, which jointly assesses textual content and font styling. Building on top of ChartEditVista, we present ChartEditor, a model trained using a reinforcement learning framework that incorporates a novel rendering reward to simultaneously enforce code executability and visual fidelity. Through extensive experiments and human evaluations, we demonstrate that ChartEditVista provides a robust evaluation, while ChartEditor consistently outperforms models with similar-scale and larger-scale on chart editing tasks.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sowmya Vajjala
Natural Language Processing
ๆฅ็ๆ่ฆ
In this paper, we report the results of the TeamNRC's participation in the BHASHA-Task 1 Grammatical Error Correction shared task https://github.com/BHASHA-Workshop/IndicGEC2025/ for 5 Indian languages. Our approach, focusing on zero/few-shot prompting of language models of varying sizes (4B to large proprietary models) achieved a Rank 4 in Telugu and Rank 2 in Hindi with GLEU scores of 83.78 and 84.31 respectively. In this paper, we extend the experiments to the other three languages of the shared task - Tamil, Malayalam and Bangla, and take a closer look at the data quality and evaluation metric used. Our results primarily highlight the potential of small language models, and summarize the concerns related to creating good quality datasets and appropriate metrics for this task that are suitable for Indian language scripts.
๐
2025-11-19
๐ ArXiv Preprint (cs.SE)
๐ฅ Hiep Hong Trinh, Federico Ciccozzi, Abu Naser Masud, Marjan Sirjani, Mikael Sjรถdin
ๆฅ็ๆ่ฆ
Complex software-driven systems often interleave distributed, concurrent computation processes with physical interactions with the environment. Developing these systems more efficiently and safely can be achieved by employing actionable, software-based models. From a high-level system model, engineers often need to derive multiple specialized models for different purposes, including simulation, deployment, and formal verification. Each of these target models usually rely on its own formalism, specification language, and execution platform. Traditionally, a compiler analyzes a program written in a programming language and generates executable code. In contrast, a model compiler processes a source model written in a modeling language and should ideally support the generation of multiple heterogeneous targets. However, most existing modeling languages are designed with a narrow focus, typically targeting only simulation or implementation. Multi-target compilation, when not considered during the language's early design, becomes significantly harder to achieve. In this paper, we introduce our initiative: a toolchain and modeling language called M, designed to support system modeling and multi-target compilation for model-driven engineering of complex, concurrent, and time-aware systems. M is a textual, grammar-driven language based on the actor model and extended with discrete-event scheduling semantics. It provides constructs for modeling system entities, message-based interactions, and time- or state-triggered reactions. From such models, M enables the systematic generation of diverse target artifacts while preserving semantic conformance to the original model. Moreover, M can serve as a middle language to which other modeling languages may anchor, thereby allowing them to benefit from its compilation framework.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fanfan Liu, Haibo Qiu
Natural Language Processing
ๆฅ็ๆ่ฆ
Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xinli Tao, Xin Dong, Xuezhong Zhou
Natural Language Processing
ๆฅ็ๆ่ฆ
Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA's three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov et al.
Robotics
ๆฅ็ๆ่ฆ
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rishikant Chigrupaatii, Ponnada Sai Tulasi Kanishka, Lalit Chandra Routhu, Martin Patel Sama Supratheek Reddy, Divyam Gupta et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Yang Wu, Rujing Yao, Tong Zhang, Yufei Shi, Zhuoren Jiang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students' knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student's mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students' Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student's mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Firdavs Nasriddinov, Rafal Kocielnik, Anima Anandkumar, Andrew J. Hung
Natural Language Processing
ๆฅ็ๆ่ฆ
High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
๐
2025-11-19
๐ ArXiv Preprint (eess.AS)
๐ฅ Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, Tatsuya Komatsu
ๆฅ็ๆ่ฆ
We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthetic datasets. Moreover, the evaluation is based on annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1,009, 213, and 640 audio recordings for train, valid, and test split, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Henrik Bradland, Morten Goodwin, Vladimir I. Zadorozhny, Per-Arne Andersen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The performance of machine learning models on tabular data is critically dependent on high-quality feature engineering. While Large Language Models (LLMs) have shown promise in automating feature extraction (AutoFE), existing methods are often limited by monolithic LLM architectures, simplistic quantitative feedback, and a failure to systematically integrate external domain knowledge. This paper introduces Rogue One, a novel, LLM-based multi-agent framework for knowledge-informed automatic feature extraction. Rogue One operationalizes a decentralized system of three specialized agents-Scientist, Extractor, and Tester-that collaborate iteratively to discover, generate, and validate predictive features. Crucially, the framework moves beyond primitive accuracy scores by introducing a rich, qualitative feedback mechanism and a "flooding-pruning" strategy, allowing it to dynamically balance feature exploration and exploitation. By actively incorporating external knowledge via an integrated retrieval-augmented (RAG) system, Rogue One generates features that are not only statistically powerful but also semantically meaningful and interpretable. We demonstrate that Rogue One significantly outperforms state-of-the-art methods on a comprehensive suite of 19 classification and 9 regression datasets. Furthermore, we show qualitatively that the system surfaces novel, testable hypotheses, such as identifying a new potential biomarker in the myocardial dataset, underscoring its utility as a tool for scientific discovery.
๐
2025-11-19
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haoyong Wu, Yongmei Liu
ๆฅ็ๆ่ฆ
In this paper, we propose ProRAC (Progression-based Reasoning about Actions and Change), a neuro-symbolic framework that leverages LLMs to tackle RAC problems. ProRAC extracts fundamental RAC elements including actions and questions from the problem, progressively executes each action to derive the final state, and then evaluates the query against the progressed state to arrive at an answer. We evaluate ProRAC on several RAC benchmarks, and the results demonstrate that our approach achieves strong performance across different benchmarks, domains, LLM backbones, and types of RAC tasks.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Keito Sasagawa, Shuhei Kurita, Daisuke Kawahara
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced from the real-world document images containing vertically written Japanese text. Using these datasets, we demonstrate that the existing MLLMs perform worse on vertically written Japanese text than on horizontally written Japanese text. Furthermore, we show that training MLLMs on our synthesized Japanese OCR dataset results in improving the performance of models that previously could not handle vertical writing. The datasets and code are publicly available https://github.com/llm-jp/eval_vertical_ja.
๐
2025-11-19
๐ 10 pages, theoretical/mathematical LLM research, n
๐ฅ Moses Kiprono
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are powerful linguistic engines but remain susceptible to hallucinations: plausible-sounding outputs that are factually incorrect or unsupported. In this work, we present a mathematically grounded framework to understand, measure, and mitigate these hallucinations. Drawing on probabilistic modeling, information theory, trigonometric signal analysis, and Bayesian uncertainty estimation, we analyze how errors compound autoregressively, propose refined uncertainty metrics, including semantic and phase-aware variants, and develop principled mitigation strategies such as contrastive decoding, retrieval-augmented grounding, factual alignment, and abstention. This unified lens connects recent advances in calibration, retrieval, and alignment to support safer and more reliable LLMs.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Violeta Kastreva, Philip Whittington, Dennis Komm, Tiago Pimentel
ๆฅ็ๆ่ฆ
Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded $n$-ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an $n$-ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.
๐
2025-11-19
๐ ArXiv Preprint (cs.CY)
๐ฅ Naveen Raman, Jingwu Tang, Zhiyu Chen, Zheyuan Ryan Shi, Sean Hudson et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Food rescue organizations simultaneously tackle food insecurity and waste by working with volunteers to redistribute food from donors who have excess to recipients who need it. Volunteer feedback allows food rescue organizations to identify issues early and ensure volunteer satisfaction. However, food rescue organizations monitor feedback manually, which can be cumbersome and labor-intensive, making it difficult to prioritize which issues are most important. In this work, we investigate how large language models (LLMs) assist food rescue organizers in understanding and taking action based on volunteer experiences. We work with 412 Food Rescue, a large food rescue organization based in Pittsburgh, Pennsylvania, to design RescueLens, an LLM-powered tool that automatically categorizes volunteer feedback, suggests donors and recipients to follow up with, and updates volunteer directions based on feedback. We evaluate the performance of RescueLens on an annotated dataset, and show that it can recover 96% of volunteer issues at 71% precision. Moreover, by ranking donors and recipients according to their rates of volunteer issues, RescueLens allows organizers to focus on 0.5% of donors responsible for more than 30% of volunteer issues. RescueLens is now deployed at 412 Food Rescue and through semi-structured interviews with organizers, we find that RescueLens streamlines the feedback process so organizers better allocate their time.
๐
2025-11-19
๐ NeurIPS 2025
๐ฅ Medha Kumar, Zifei Xu, Xin Wang, Tristan Webb
Machine Learning
ๆฅ็ๆ่ฆ
Strong reasoning capabilities can now be achieved by large-scale reinforcement learning (RL) without any supervised fine-tuning. Although post-training quantization (PTQ) and quantization-aware training (QAT) are well studied in the context of fine-tuning, how quantization impacts RL in large reasoning models (LRMs) remains an open question. To answer this question, we conducted systematic experiments and discovered a significant gap in reasoning performance on mathematical benchmarks between post-RL quantized models and their quantization-aware RL optimized counterparts. Our findings suggest that quantization-aware RL training negatively impacted the learning process, whereas PTQ and QLoRA led to greater performance.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Foundation models have transformed machine learning for language and vision, but achieving comparable impact in physical simulation remains a challenge. Data heterogeneity and unstable long-term dynamics inhibit learning from sufficiently diverse dynamics, while varying resolutions and dimensionalities challenge efficient training on modern hardware. Through empirical and theoretical analysis, we incorporate new approaches to mitigate these obstacles, including a harmonic-analysis-based stabilization method, load-balanced distributed 2D and 3D training strategies, and compute-adaptive tokenization. Using these tools, we develop Walrus, a transformer-based foundation model developed primarily for fluid-like continuum dynamics. Walrus is pretrained on nineteen diverse scenarios spanning astrophysics, geoscience, rheology, plasma physics, acoustics, and classical fluids. Experiments show that Walrus outperforms prior foundation models on both short and long term prediction horizons on downstream tasks and across the breadth of pretraining data, while ablation studies confirm the value of our contributions to forecast stability, training throughput, and transfer performance over conventional approaches. Code and weights are released for community use.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Jianqiao Mao, Max A. Little
ๆฅ็ๆ่ฆ
Front-door adjustment provides a simple closed-form identification formula under the classical front-door criterion, but its applicability is often viewed as narrow and strict. Although ID algorithm is very useful and is proved effective for causal relation identification in general causal graphs (if it is identifiable), performing ID algorithm does not guarantee to obtain a practical, easy-to-estimate interventional distribution expression. We argue that the applicability of the front-door criterion is not as limited as it seems: many more complicated causal graphs can be reduced to the front-door criterion. In this paper, We introduce front-door reducibility (FDR), a graphical condition on acyclic directed mixed graphs (ADMGs) that extends the applicability of the classic front-door criterion to reduce a large family of complicated causal graphs to a front-door setting by aggregating variables into super-nodes (FDR triple) $\left(\boldsymbol{X}^{*},\boldsymbol{Y}^{*},\boldsymbol{M}^{*}\right)$. After characterizing FDR criterion, we prove a graph-level equivalence between the satisfication of FDR criterion and the applicability of FDR adjustment. Meanwhile, we then present FDR-TID, an exact algorithm that detects an admissible FDR triple, together with established the algorithm's correctness, completeness, and finite termination. Empirically-motivated examples illustrate that many graphs outside the textbook front-door setting are FDR, yielding simple, estimable adjustments where general ID expressions would be cumbersome. FDR thus complements existing identification method by prioritizing interpretability and computational simplicity without sacrificing generality across mixed graphs.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
MultimodalNatural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kim N. Nolle, Ivana Dusparic, Rhodri Cusack, Vinny Cahill
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Continual learning (CL) is a branch of machine learning that aims to enable agents to adapt and generalise previously learned abilities so that these can be reapplied to new tasks or environments. This is particularly useful in multi-task settings or in non-stationary environments, where the dynamics can change over time. This is particularly relevant in cyber-physical systems such as autonomous driving. However, despite recent advances in CL, successfully applying it to reinforcement learning (RL) is still an open problem.
This paper highlights open challenges in continual RL (CRL) based on experiments in an autonomous driving environment. In this environment, the agent must learn to successfully park in four different scenarios corresponding to parking spaces oriented at varying angles. The agent is successively trained in these four scenarios one after another, representing a CL environment, using Proximal Policy Optimisation (PPO). These experiments exposed a number of open challenges in CRL: finding suitable abstractions of the environment, oversensitivity to hyperparameters, catastrophic forgetting, and efficient use of neural network capacity.
Based on these identified challenges, we present open research questions that are important to be addressed for creating robust CRL systems. In addition, the identified challenges call into question the suitability of neural networks for CL. We also identify the need for interdisciplinary research, in particular between computer science and neuroscience.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Benjamin Dupuis, Mert Gรผrbรผzbalaban, Umut ลimลekli, Jian Wang, Sinan Yildirim et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Characterizing the differential privacy (DP) of learning algorithms has become a major challenge in recent years. In parallel, many studies suggested investigating the behavior of stochastic gradient descent (SGD) with heavy-tailed noise, both as a model for modern deep learning models and to improve their performance. However, most DP bounds focus on light-tailed noise, where satisfactory guarantees have been obtained but the proposed techniques do not directly extend to the heavy-tailed setting. Recently, the first DP guarantees for heavy-tailed SGD were obtained. These results provide $(0,ฮด)$-DP guarantees without requiring gradient clipping. Despite casting new light on the link between DP and heavy-tailed algorithms, these results have a strong dependence on the number of parameters and cannot be extended to other DP notions like the well-established Rรฉnyi differential privacy (RDP). In this work, we propose to address these limitations by deriving the first RDP guarantees for heavy-tailed SDEs, as well as their discretized counterparts. Our framework is based on new Rรฉnyi flow computations and the use of well-established fractional Poincarรฉ inequalities. Under the assumption that such inequalities are satisfied, we obtain DP guarantees that have a much weaker dependence on the dimension compared to prior art.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tao Hu, Lan Li, Zhen-Hao Xie, Da-Wei Zhou
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via multi-modal pre-training, making them well-suited for CIL. However, real-world visual and linguistic concepts are inherently hierarchical: a textual concept like "dog" subsumes fine-grained categories such as "Labrador" and "Golden Retriever," and each category entails its images. But existing CLIP-based CIL methods fail to explicitly capture this inherent hierarchy, leading to fine-grained class features drift during incremental updates and ultimately to catastrophic forgetting. To address this challenge, we propose HASTEN (Hierarchical Semantic Tree Anchoring) that anchors hierarchical information into CIL to reduce catastrophic forgetting. First, we employ an external knowledge graph as supervision to embed visual and textual features in hyperbolic space, effectively preserving hierarchical structure as data evolves. Second, to mitigate catastrophic forgetting, we project gradients onto the null space of the shared hyperbolic mapper, preventing interference with prior tasks. These two steps work synergistically to enable the model to resist forgetting by maintaining hierarchical relationships. Extensive experiments show that HASTEN consistently outperforms existing methods while providing a unified structured representation.
๐
2025-11-19
๐ ArXiv Preprint (eess.SP)
๐ฅ Petrus E. O. G. B. Abreu, Gabriela M. M. Paixรฃo, Jiawei Li, Paulo R. Gomes, Peter W. Macfarlane et al.
Machine Learning
ๆฅ็ๆ่ฆ
Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitations in annotation quality, size, and scope remain major challenges. Here we present CODE-II, a large-scale real-world dataset of 2,735,269 12-lead ECGs from 2,093,807 adult patients collected by the Telehealth Network of Minas Gerais (TNMG), Brazil. Each exam was annotated using standardized diagnostic criteria and reviewed by cardiologists. A defining feature of CODE-II is a set of 66 clinically meaningful diagnostic classes, developed with cardiologist input and routinely used in telehealth practice. We additionally provide an open available subset: CODE-II-open, a public subset of 15,000 patients, and the CODE-II-test, a non-overlapping set of 8,475 exams reviewed by multiple cardiologists for blinded evaluation. A neural network pre-trained on CODE-II achieved superior transfer performance on external benchmarks (PTB-XL and CPSC 2018) and outperformed alternatives trained on larger datasets.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nils Wildt, Daniel M. Tartakovsky, Sergey Oladyshkin, Wolfgang Nowak
Machine Learning
ๆฅ็ๆ่ฆ
Ordinary differential equations (ODEs) are a conventional way to describe the observed dynamics of physical systems. Scientists typically hypothesize about dynamical behavior, propose a mathematical model, and compare its predictions to data. However, modern computing and algorithmic advances now enable purely data-driven learning of governing dynamics directly from observations. In data-driven settings, one learns the ODE's right-hand side (RHS). Dense measurements are often assumed, yet high temporal resolution is typically both cumbersome and expensive. Consequently, one usually has only sparsely sampled data. In this work we introduce ChaosODE (CODE), a Polynomial Chaos ODE Expansion in which we use an arbitrary Polynomial Chaos Expansion (aPCE) for the ODE's right-hand side, resulting in a global orthonormal polynomial representation of dynamics. We evaluate the performance of CODE in several experiments on the Lotka-Volterra system, across varying noise levels, initial conditions, and predictions far into the future, even on previously unseen initial conditions. CODE exhibits remarkable extrapolation capabilities even when evaluated under novel initial conditions and shows advantages compared to well-examined methods using neural networks (NeuralODE) or kernel approximators (KernelODE) as the RHS representer. We observe that the high flexibility of NeuralODE and KernelODE degrades extrapolation capabilities under scarce data and measurement noise. Finally, we provide practical guidelines for robust optimization of dynamics-learning problems and illustrate them in the accompanying code.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Gรกbor Balรกzs
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a tractable algorithm for estimating an unknown Lipschitz function from noisy observations and establishes an upper bound on its convergence rate. The approach extends max-affine methods from convex shape-restricted regression to the more general Lipschitz setting. A key component is a nonlinear feature expansion that maps max-affine functions into a subclass of delta-convex functions, which act as universal approximators of Lipschitz functions while preserving their Lipschitz constants. Leveraging this property, the estimator attains the minimax convergence rate (up to logarithmic factors) with respect to the intrinsic dimension of the data under squared loss and subgaussian distributions in the random design setting. The algorithm integrates adaptive partitioning to capture intrinsic dimension, a penalty-based regularization mechanism that removes the need to know the true Lipschitz constant, and a two-stage optimization procedure combining a convex initialization with local refinement. The framework is also straightforward to adapt to convex shape-restricted regression. Experiments demonstrate competitive performance relative to other theoretically justified methods, including nearest-neighbor and kernel-based regressors.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Miruna-Alexandra Gafencu, Yordanka Velikova, Nassir Navab, Mohammad Farid Azampour
Machine Learning
ๆฅ็ๆ่ฆ
Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intraoperative guidance during spinal procedures. However, ultrasound suffers from inherent limitations in visualizing complete vertebral anatomy, in particular vertebral bodies, due to acoustic shadowing effects caused by bone. In this work, we present a novel multi-modal deep learning method for completing occluded anatomical structures in 3D ultrasound by leveraging complementary information from a single X-ray image. To enable training, we generate paired training data consisting of: (1) 2D lateral vertebral views that simulate X-ray scans, and (2) 3D partial vertebrae representations that mimic the limited visibility and occlusions encountered during ultrasound spine imaging. Our method integrates morphological information from both imaging modalities and demonstrates significant improvements in vertebral reconstruction (p < 0.001) compared to state of art in 3D ultrasound vertebral completion. We perform phantom studies as an initial step to future clinical translation, and achieve a more accurate, complete volumetric lumbar spine visualization overlayed on the ultrasound scan without the need for registration with preoperative modalities such as computed tomography. This demonstrates that integrating a single X-ray projection mitigates ultrasound's key limitation while preserving its strengths as the primary imaging modality. Code and data can be found at https://github.com/miruna20/US-X-Complete
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Georgios Venianakis, Constantinos Theodoropoulos, Michail Kavousanakis
Machine Learning
ๆฅ็ๆ่ฆ
Parameter estimation remains a challenging task across many areas of engineering. Because data acquisition can often be costly, limited, or prone to inaccuracies (noise, uncertainty) it is crucial to identify sensor configurations that provide the maximum amount of information about the unknown parameters, in particular for the case of distributed-parameter systems, where spatial variations are important. Physics-Informed Neural Networks (PINNs) have recently emerged as a powerful machine-learning (ML) tool for parameter estimation, particularly in cases with sparse or noisy measurements, overcoming some of the limitations of traditional optimization-based and Bayesian approaches. Despite the widespread use of PINNs for solving inverse problems, relatively little attention has been given to how their performance depends on sensor placement. This study addresses this gap by introducing a comprehensive PINN-based framework that simultaneously tackles optimal sensor placement and parameter estimation. Our approach involves training a PINN model in which the parameters of interest are included as additional inputs. This enables the efficient computation of sensitivity functions through automatic differentiation, which are then used to determine optimal sensor locations exploiting the D-optimality criterion. The framework is validated on two illustrative distributed-parameter reaction-diffusion-advection problems of increasing complexity. The results demonstrate that our PINNs-based methodology consistently achieves higher accuracy compared to parameter values estimated from intuitively or randomly selected sensor positions.
๐
2025-11-19
๐ ArXiv Preprint (math.NA)
๐ฅ Max Hirsch, Federico Pichi
Machine Learning
ๆฅ็ๆ่ฆ
In multi-objective optimization, multiple loss terms are weighted and added together to form a single objective. These weights are chosen to properly balance the competing losses according to some meta-goal. For example, in physics-informed neural networks (PINNs), these weights are often adaptively chosen to improve the network's generalization error. A popular choice of adaptive weights is based on the neural tangent kernel (NTK) of the PINN, which describes the evolution of the network in predictor space during training. The convergence of such an adaptive weighting algorithm is not clear a priori. Moreover, these NTK-based weights would be updated frequently during training, further increasing the computational burden of the learning process. In this paper, we prove that under appropriate conditions, gradient descent enhanced with adaptive NTK-based weights is convergent in a suitable sense. We then address the problem of computational efficiency by developing a randomized algorithm inspired by a predictor-corrector approach and matrix sketching, which produces unbiased estimates of the NTK up to an arbitrarily small discretization error. Finally, we provide numerical experiments to support our theoretical findings and to show the efficacy of our randomized algorithm. Code Availability: https://github.com/maxhirsch/Efficient-NTK
๐
2025-11-19
๐ IROS 2025
๐ฅ Yifei Gao, Hans J. He, Daniel J. Stilwell, James McMahon
Robotics
ๆฅ็ๆ่ฆ
Teams of cooperating autonomous underwater vehicles (AUVs) rely on acoustic communication for coordination, yet this communication medium is constrained by limited range, multi-path effects, and low bandwidth. One way to address the uncertainty associated with acoustic communication is to learn the communication environment in real-time. We address the challenge of a team of robots building a map of the probability of communication success from one location to another in real-time. This is a decentralized classification problem -- communication events are either successful or unsuccessful -- where AUVs share a subset of their communication measurements to build the map. The main contribution of this work is a rigorously derived data sharing policy that selects measurements to be shared among AUVs. We experimentally validate our proposed sharing policy using real acoustic communication data collected from teams of Virginia Tech 690 AUVs, demonstrating its effectiveness in underwater environments.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yinan Yu, Samuel Scheidegger
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Runtime geofencing for ground vehicles is rapidly emerging as a critical technology for enforcing Operational Design Domains (ODDs). However, existing solutions struggle to reconcile high-fidelity learning with the structural requirements of verifiable control. We address this by introducing PCARNN-DCBF, a novel pipeline integrating a Physics-encoded Control-Affine Residual Neural Network with a preview-based Discrete Control Barrier Function. Unlike generic learned models, PCARNN explicitly preserves the control-affine structure of vehicle dynamics, ensuring the linearity required for reliable optimization. This enables the DCBF to enforce polygonal keep-in constraints via a real-time Quadratic Program (QP) that handles high relative degree and mitigates actuator saturation. Experiments in CARLA across electric and combustion platforms demonstrate that this structure-preserving approach significantly outperforms analytical and unstructured neural baselines.
๐
2025-11-19
๐ NeurIPS 2025
๐ฅ Nika Haghtalab, Omar Montasser, Mingda Qiao
Machine Learning
ๆฅ็ๆ่ฆ
We study the tradeoff between sample complexity and round complexity in on-demand sampling, where the learning algorithm adaptively samples from $k$ distributions over a limited number of rounds. In the realizable setting of Multi-Distribution Learning (MDL), we show that the optimal sample complexity of an $r$-round algorithm scales approximately as $dk^{ฮ(1/r)} / ฮต$. For the general agnostic case, we present an algorithm that achieves near-optimal sample complexity of $\widetilde O((d + k) / ฮต^2)$ within $\widetilde O(\sqrt{k})$ rounds. Of independent interest, we introduce a new framework, Optimization via On-Demand Sampling (OODS), which abstracts the sample-adaptivity tradeoff and captures most existing MDL algorithms. We establish nearly tight bounds on the round complexity in the OODS setting. The upper bounds directly yield the $\widetilde O(\sqrt{k})$-round algorithm for agnostic MDL, while the lower bounds imply that achieving sub-polynomial round complexity would require fundamentally new techniques that bypass the inherent hardness of OODS.
๐
2025-11-19
๐ ArXiv Preprint (cs.AR)
๐ฅ Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Processing-In-Memory (PIM) devices integrated with high-performance Host processors (e.g., GPUs) can accelerate memory-intensive kernels in Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging high memory bandwidth at PIM cores. However, Host processors and PIM cores require different data layouts: Hosts need consecutive elements distributed across DRAM banks, while PIM cores need them within local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM backends. Current compilation approaches lack systematic optimization for diverse ML kernels across multiple PIM backends and may largely ignore data rearrangements during compute code optimization. We demonstrate that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. To address this, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction that enables various data distribution and processing strategies on different PIM backends. DCC enables effective co-optimization by mapping data partitioning strategies to compute loop partitions, applying PIM-specific code optimizations and leveraging a fast and accurate performance prediction model to select optimal configurations. Our evaluations in various individual ML kernels demonstrate that DCC achieves up to 7.68x speedup (2.7x average) on HBM-PIM and up to 13.17x speedup (5.75x average) on AttAcc PIM backend over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by up to 7.71x (4.88x average) over GPU.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chen Zhang, Wei Zuo, Bingyang Cheng, Yikun Wang, Wei-Bin Kou et al.
Computer Vision
ๆฅ็ๆ่ฆ
Implicit Neural Representations (INRs) parameterize continuous signals via multilayer perceptrons (MLPs), enabling compact, resolution-independent modeling for tasks like image, audio, and 3D reconstruction. However, fitting high-resolution signals demands optimizing over millions of coordinates, incurring prohibitive computational costs. To address it, we propose NTK-Guided Implicit Neural Teaching (NINT), which accelerates training by dynamically selecting coordinates that maximize global functional updates. Leveraging the Neural Tangent Kernel (NTK), NINT scores examples by the norm of their NTK-augmented loss gradients, capturing both fitting errors and heterogeneous leverage (self-influence and cross-coordinate coupling). This dual consideration enables faster convergence compared to existing methods. Through extensive experiments, we demonstrate that NINT significantly reduces training time by nearly half while maintaining or improving representation quality, establishing state-of-the-art acceleration among recent sampling-based strategies.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rashid Iqbal, Saddam Hussain Khan
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This work proposes a hybrid deep learning approach, namely Residual and Spatial Learning based Channel Augmented Integrated CNN-Transformer architecture, that leverages the strengths of CNN and Transformer towards enhanced MPox detection. The proposed RS-CA-HSICT framework is composed of an HSICT block, a residual CNN module, a spatial CNN block, and a CA, which enhances the diverse feature space, detailed lesion information, and long-range dependencies. The new HSICT module first integrates an abstract representation of the stem CNN and customized ICT blocks for efficient multihead attention and structured CNN layers with homogeneous (H) and structural (S) operations. The customized ICT blocks learn global contextual interactions and local texture extraction. Additionally, H and S layers learn spatial homogeneity and fine structural details by reducing noise and modeling complex morphological variations. Moreover, inverse residual learning enhances vanishing gradient, and stage-wise resolution reduction ensures scale invariance. Furthermore, the RS-CA-HSICT framework augments the learned HSICT channels with the TL-driven Residual and Spatial CNN maps for enhanced multiscale feature space capturing global and localized structural cues, subtle texture, and contrast variations. These channels, preceding augmentation, are refined through the Channel-Fusion-and-Attention block, which preserves discriminative channels while suppressing redundant ones, thereby enabling efficient computation. Finally, the spatial attention mechanism refines pixel selection to detect subtle patterns and intra-class contrast variations in Mpox. Experimental results on both the Kaggle benchmark and a diverse MPox dataset reported classification accuracy as high as 98.30% and an F1-score of 98.13%, which outperforms the existing CNNs and ViTs.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dabin Jeong, Amirhossein Vahidi, Ciro Ramรญrez-Suรกstegui, Marie Moullet, Kevin Ly et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78\% in the gene-expression prediction task and avg. 26.93\% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ouiame Marnissi, Hajar EL Hammouti, El Houcine Bergou
Machine Learning
ๆฅ็ๆ่ฆ
Federated learning (FL) enables collaborative model training across distributed devices while preserving data privacy. However, balancing energy efficiency and fair participation while ensuring high model accuracy remains challenging in wireless edge systems due to heterogeneous resources, unequal client contributions, and limited communication capacity. To address these challenges, we propose FairEnergy, a fairness-aware energy minimization framework that integrates a contribution score capturing both the magnitude of updates and their compression ratio into the joint optimization of device selection, bandwidth allocation, and compression level. The resulting mixed-integer non-convex problem is solved by relaxing binary selection variables and applying Lagrangian decomposition to handle global bandwidth coupling, followed by per-device subproblem optimization. Experiments on non-IID data show that FairEnergy achieves higher accuracy while reducing energy consumption by up to 79\% compared to baseline strategies.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Michel Tokic, Slobodan Djukanoviฤ, Anja von Beuningen, Cheng Feng
ๆฅ็ๆ่ฆ
This paper introduces a classification method using in-context learning in time-series foundation models (TSFM). We show how data, which was not part of the TSFM training data corpus, can be classified without the need of finetuning the model. Examples are represented in the form of targets (class id) and covariates (data matrix) within the prompt of the model, which enables to classify an unknown covariate data pattern alongside the forecast axis through in-context learning. We apply this method to vibration data for assessing the health state of a bearing within a servo-press motor. The method transforms frequency domain reference signals into pseudo time-series patterns, generates aligned covariate and target signals, and uses the TSFM to predict probabilities how classified data corresponds to predefined labels. Leveraging the scalability of pre-trained models this method demonstrates efficacy across varied operational conditions. This marks significant progress beyond custom narrow AI solutions towards broader, AI-driven maintenance systems.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Alexej Brauer, Mario V. Wรผthrich
Machine Learning
ๆฅ็ๆ่ฆ
The Gini score is a popular tool in statistical modeling and machine learning for model validation and model selection. It is a purely rank based score that allows one to assess risk rankings. The Gini score for statistical modeling has mainly been used in a binary context, in which it has many equivalent reformulations such as the receiver operating characteristic (ROC) or the area under the curve (AUC). In the actuarial literature, this rank based score for binary responses has been extended to general real-valued random variables using Lorenz curves and concentration curves. While these initial concepts assume that the risk ranking is generated by a continuous distribution function, we discuss in this paper how the Gini score can be used in the case of ties in the risk ranking. Moreover, we adapt the Gini score to the common actuarial situation of having case weights.
๐
2025-11-19
๐ ArXiv Preprint (math.NA)
๐ฅ Victorita Dolean, Daria Hrebenshchykova, Stรฉphane Lanteri, Victor Michel-Dansac
Machine Learning
ๆฅ็ๆ่ฆ
Accurately simulating wave propagation is crucial in fields such as acoustics, electromagnetism, and seismic analysis. Traditional numerical methods, like finite difference and finite element approaches, are widely used to solve governing partial differential equations (PDEs) such as the Helmholtz equation. However, these methods face significant computational challenges when applied to high-frequency wave problems in complex two-dimensional domains. This work investigates Finite Basis Physics-Informed Neural Networks (FBPINNs) and their multilevel extensions as a promising alternative. These methods leverage domain decomposition, partitioning the computational domain into overlapping sub-domains, each governed by a local neural network. We assess their accuracy and computational efficiency in solving the Helmholtz equation for the homogeneous case, demonstrating their potential to mitigate the limitations of traditional approaches.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Amir Rezaei Balef, Mykhailo Koshil, Katharina Eggensperger
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite the architectural similarities between tabular in-context learning (ICL) models and large language models (LLMs), little is known about how individual layers contribute to tabular prediction. In this paper, we investigate how the latent spaces evolve across layers in tabular ICL models, identify potential redundant layers, and compare these dynamics with those observed in LLMs. We analyze TabPFN and TabICL through the "layers as painters" perspective, finding that only subsets of layers share a common representational language, suggesting structural redundancy and offering opportunities for model compression and improved interpretability.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hany Abdulsamad, รngel F. Garcรญa-Fernรกndez, Simo Sรคrkkรค
ๆฅ็ๆ่ฆ
We present a class of algorithms for state estimation in nonlinear, non-Gaussian state-space models. Our approach is based on a variational Lagrangian formulation that casts Bayesian inference as a sequence of entropic trust-region updates subject to dynamic constraints. This framework gives rise to a family of forward-backward algorithms, whose structure is determined by the chosen factorization of the variational posterior. By focusing on Gauss--Markov approximations, we derive recursive schemes with favorable computational complexity. For general nonlinear, non-Gaussian models we close the recursions using generalized statistical linear regression and Fourier--Hermite moment matching.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Luca Mossina, Corentin Friedrich
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Reliable semantic segmentation is essential for clinical decision making, yet deep models rarely provide explicit statistical guarantees on their errors. We introduce a simple post-hoc framework that constructs confidence masks with distribution-free, image-level control of false-positive predictions. Given any pretrained segmentation model, we define a nested family of shrunken masks obtained either by increasing the score threshold or by applying morphological erosion. A labeled calibration set is used to select a single shrink parameter via conformal prediction, ensuring that, for new images that are exchangeable with the calibration data, the proportion of false positives retained in the confidence mask stays below a user-specified tolerance with high probability. The method is model-agnostic, requires no retraining, and provides finite-sample guarantees regardless of the underlying predictor. Experiments on a polyp-segmentation benchmark demonstrate target-level empirical validity. Our framework enables practical, risk-aware segmentation in settings where over-segmentation can have clinical consequences. Code at https://github.com/deel-ai-papers/conseco.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kunyu Zhang, Mingxuan Wang, Xiangjie Shi, Haoxing Xu, Chao Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
The brain age is a key indicator of brain health. While electroencephalography (EEG) is a practical tool for this task, existing models struggle with the common challenge of imperfect medical data, such as learning a ``normal'' baseline from weakly supervised, healthy-only cohorts. This is a critical anomaly detection task for identifying disease, but standard models are often black boxes lacking an interpretable structure. We propose EVA-Net, a novel framework that recasts brain age as an interpretable anomaly detection problem. EVA-Net uses an efficient, sparsified-attention Transformer to model long EEG sequences. To handle noise and variability in imperfect data, it employs a Variational Information Bottleneck to learn a robust, compressed representation. For interpretability, this representation is aligned to a continuous prototype network that explicitly learns the normative healthy aging manifold. Trained on 1297 healthy subjects, EVA-Net achieves state-of-the-art accuracy. We validated its anomaly detection capabilities on an unseen cohort of 27 MCI and AD patients. This pathological group showed significantly higher brain-age gaps and a novel Prototype Alignment Error, confirming their deviation from the healthy manifold. EVA-Net provides an interpretable framework for healthcare intelligence using imperfect medical data.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lingxiang Wang, Hainan Zhang, Zhiming Zheng
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Domain-specific post-training often causes catastrophic forgetting, making foundation models lose their general reasoning ability and limiting their adaptability to dynamic real-world environments. Preserving general capabilities while acquiring downstream domain knowledge is a central challenge for large language and multimodal models. Traditional continual learning methods, such as regularization, replay and architectural isolation, suffer from poor downstream performance, reliance on inaccessible historical data, or additional parameter overhead. While recent parameter-efficient tuning (PET) methods can alleviate forgetting, their effectiveness strongly depends on the choice of parameters and update strategies. In this paper, we introduce PIECE, a Parameter Importance Estimation-based Continual Enhancement method that preserves general ability while efficiently learning domain knowledge without accessing prior training data or increasing model parameters. PIECE selectively updates only 0.1% of core parameters most relevant to new tasks, guided by two importance estimators: PIECE-F based on Fisher Information, and PIECE-S based on a second-order normalization that combines gradient and curvature information. Experiments across three language models and two multimodal models show that PIECE maintains general capabilities and achieves state-of-the-art continual learning performance across diverse downstream tasks. Our results highlight a practical path to scalable, domain-adaptive foundation models without catastrophic forgetting.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Eddie Conti, รlvaro Parafita, Axel Brando
Machine Learning
ๆฅ็ๆ่ฆ
Assessing the importance of individual features in Machine Learning is critical to understand the model's decision-making process. While numerous methods exist, the lack of a definitive ground truth for comparison highlights the need for alternative, well-founded measures. This paper introduces a novel post-hoc local feature importance method called Counterfactual Importance Distribution (CID). We generate two sets of positive and negative counterfactuals, model their distributions using Kernel Density Estimation, and rank features based on a distributional dissimilarity measure. This measure, grounded in a rigorous mathematical framework, satisfies key properties required to function as a valid metric. We showcase the effectiveness of our method by comparing with well-established local feature importance explainers. Our method not only offers complementary perspectives to existing approaches, but also improves performance on faithfulness metrics (both for comprehensiveness and sufficiency), resulting in more faithful explanations of the system. These results highlight its potential as a valuable tool for model analysis.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yinan Yu, Falk Dippel, Christina E. Lundberg, Martin Lindgren, Annika Rosengren et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Objective: Machine learning (ML) predictive models are often developed without considering downstream value trade-offs and clinical interpretability. This paper introduces a cost-aware prediction (CAP) framework that combines cost-benefit analysis assisted by large language model (LLM) agents to communicate the trade-offs involved in applying ML predictions. Materials and Methods: We developed an ML model predicting 1-year mortality in patients with heart failure (N = 30,021, 22% mortality) to identify those eligible for home care. We then introduced clinical impact projection (CIP) curves to visualize important cost dimensions - quality of life and healthcare provider expenses, further divided into treatment and error costs, to assess the clinical consequences of predictions. Finally, we used four LLM agents to generate patient-specific descriptions. The system was evaluated by clinicians for its decision support value. Results: The eXtreme gradient boosting (XGB) model achieved the best performance, with an area under the receiver operating characteristic curve (AUROC) of 0.804 (95% confidence interval (CI) 0.792-0.816), area under the precision-recall curve (AUPRC) of 0.529 (95% CI 0.502-0.558) and a Brier score of 0.135 (95% CI 0.130-0.140). Discussion: The CIP cost curves provided a population-level overview of cost composition across decision thresholds, whereas LLM-generated cost-benefit analysis at individual patient-levels. The system was well received according to the evaluation by clinicians. However, feedback emphasizes the need to strengthen the technical accuracy for speculative tasks. Conclusion: CAP utilizes LLM agents to integrate ML classifier outcomes and cost-benefit analysis for more transparent and interpretable decision support.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nathanael Bosch, Oleksandr Shchur, Nick Erickson, Michael Bohlke-Schneider, Caner Tรผrkmen
Machine Learning
ๆฅ็ๆ่ฆ
Ensembling is a powerful technique for improving the accuracy of machine learning models, with methods like stacking achieving strong results in tabular tasks. In time series forecasting, however, ensemble methods remain underutilized, with simple linear combinations still considered state-of-the-art. In this paper, we systematically explore ensembling strategies for time series forecasting. We evaluate 33 ensemble models -- both existing and novel -- across 50 real-world datasets. Our results show that stacking consistently improves accuracy, though no single stacker performs best across all tasks. To address this, we propose a multi-layer stacking framework for time series forecasting, an approach that combines the strengths of different stacker models. We demonstrate that this method consistently provides superior accuracy across diverse forecasting scenarios. Our findings highlight the potential of stacking-based methods to improve AutoML systems for time series forecasting.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Spyridon Loukovitis, Vasileios Karampinis, Athanasios Voulodimos
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Developing reliable UAV navigation systems requires robust air-to-air object detectors capable of distinguishing between objects seen during training and previously unseen objects. While many methods address closed-set detection and achieve high-confidence recognition of in-domain (ID) targets, they generally do not tackle open-set detection, which requires simultaneous handling of both ID and out-of-distribution (OOD) objects. Existing open-set approaches typically rely on a single uncertainty score with thresholding, limiting flexibility and often conflating OOD objects with background clutter. In contrast, we propose a lightweight, model-agnostic post-processing framework that explicitly separates background from unknown objects while preserving the base detector's performance. Our approach extends open-set detection beyond binary ID/OOD classification to real-time three-way classification among ID targets, OOD objects, and background. To this end, we employ a fusion scheme that aggregates multiple confidence estimates and per-detection features using a compact multilayer perceptron (MLP). Incorporating different logit variants into the MLP consistently enhances performance across both binary and three-class classification without compromising throughput. Extensive ablation and comparative experiments confirm that our method surpasses threshold-based baselines in two-class classification by an average of 2.7% AUROC, while retaining or improving open-set mAP. Furthermore, our study uniquely enables robust three-class classification, a critical capability for safe UAV navigation, where OOD objects must be actively avoided and background regions safely ignored. Comparative analysis highlights that our method surpasses competitive techniques in AUROC across datasets, while improving closed-set mAP by up to 9 points, an 18% relative gain.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kadir-Kaan รzer, Renรฉ Ebeling, Markus Enzweiler
ๆฅ็ๆ่ฆ
Automotive telemetry data exhibits slow drifts and fast spikes, often within the same sequence, making reliable anomaly detection challenging. Standard reconstruction-based methods, including sequence variational autoencoders (VAEs), use a single latent process and therefore mix heterogeneous time scales, which can smooth out spikes or inflate variances and weaken anomaly separation.
In this paper, we present STREAM-VAE, a variational autoencoder for anomaly detection in automotive telemetry time-series data. Our model uses a dual-path encoder to separate slow drift and fast spike signal dynamics, and a decoder that represents transient deviations separately from the normal operating pattern. STREAM-VAE is designed for deployment, producing stable anomaly scores across operating modes for both in-vehicle monitors and backend fleet analytics.
Experiments on an automotive telemetry dataset and the public SMD benchmark show that explicitly separating drift and spike dynamics improves robustness compared to strong forecasting, attention, graph, and VAE baselines.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ The Tien Mai
ๆฅ็ๆ่ฆ
In high-dimensional statistics, the Lasso is a cornerstone method for simultaneous variable selection and parameter estimation. However, its reliance on the squared loss function renders it highly sensitive to outliers and heavy-tailed noise, potentially leading to unreliable model selection and biased estimates. To address this limitation, we introduce the Exponential Lasso, a novel robust method that integrates an exponential-type loss function within the Lasso framework. This loss function is designed to achieve a smooth trade-off between statistical efficiency under Gaussian noise and robustness against data contamination. Unlike other methods that cap the influence of large residuals, the exponential loss smoothly redescends, effectively downweighting the impact of extreme outliers while preserving near-quadratic behavior for small errors. We establish theoretical guarantees showing that the Exponential Lasso achieves strong statistical convergence rates, matching the classical Lasso under ideal conditions while maintaining its robustness in the presence of heavy-tailed contamination. Computationally, the estimator is optimized efficiently via a Majorization-Minimization (MM) algorithm that iteratively solves a series of weighted Lasso subproblems. Numerical experiments demonstrate that the proposed method is highly competitive, outperforming the classical Lasso in contaminated settings and maintaining strong performance even under Gaussian noise.
Our method is implemented in the \texttt{R} package \texttt{heavylasso} available on Github: https://github.com/tienmt/heavylasso
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huseyin Goksu
Machine Learning
ๆฅ็ๆ่ฆ
Spectral Graph Neural Networks (GNNs) suffer from two critical limitations: poor performance on "heterophilic" graphs and performance collapse at high polynomial degrees (K), known as over-smoothing. Both issues stem from the static, low-pass nature of standard filters (e.g., ChebyNet). While adaptive polynomial filters, such as the discrete MeixnerNet, have emerged as a potential unified solution, their extension to the continuous domain and stability with unbounded coefficients remain open questions. In this work, we propose `LaguerreNet`, a novel GNN filter based on continuous Laguerre polynomials. `LaguerreNet` learns the filter's spectral shape by making its core alpha parameter trainable, thereby advancing the adaptive polynomial approach. We solve the severe O(k^2) numerical instability of these unbounded polynomials using a `LayerNorm`-based stabilization technique. We demonstrate experimentally that this approach is highly effective: 1) `LaguerreNet` achieves state-of-the-art results on challenging heterophilic benchmarks. 2) It is exceptionally robust to over-smoothing, with performance peaking at K=10, an order of magnitude beyond where ChebyNet collapses.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huseyin Goksu
Machine Learning
ๆฅ็ๆ่ฆ
Spectral Graph Neural Networks (GNNs) based on polynomial filters, such as ChebyNet, suffer from two critical limitations: 1) performance collapse on "heterophilic" graphs and 2) performance collapse at high polynomial degrees (K), known as over-smoothing. Both issues stem from the static, low-pass nature of standard filters. In this work, we propose `KrawtchoukNet`, a GNN filter based on the discrete Krawtchouk polynomials. We demonstrate that `KrawtchoukNet` provides a unified solution to both problems through two key design choices. First, by fixing the polynomial's domain N to a small constant (e.g., N=20), we create the first GNN filter whose recurrence coefficients are \textit{inherently bounded}, making it exceptionally robust to over-smoothing (achieving SOTA results at K=10). Second, by making the filter's shape parameter p learnable, the filter adapts its spectral response to the graph data. We show this adaptive nature allows `KrawtchoukNet` to achieve SOTA performance on challenging heterophilic benchmarks (Texas, Cornell), decisively outperforming standard GNNs like GAT and APPNP.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Atharva Pandey, Abhilash Neog, Gautam Jajoo
ๆฅ็ๆ่ฆ
Time-series Foundation Models (TSFMs) have recently emerged as a universal paradigm for learning across diverse temporal domains. However, despite their empirical success, the internal mechanisms by which these models represent fundamental time-series concepts remain poorly understood. In this work, we undertake a systematic investigation of concept interpretability in TSFMs. Specifically, we examine: (i) which layers encode which concepts, (ii) whether concept parameters are linearly recoverable, (iii) how representations evolve in terms of concept disentanglement and abstraction across model depth, and (iv) how models process compositions of concepts. We systematically probe these questions using layer-wise analyses, linear recoverability tests, and representation similarity measures, providing a structured account of TSFM semantics. The resulting insights show that early layers mainly capture local, time-domain patterns (e.g., AR(1), level shifts, trends), while deeper layers encode dispersion and change-time signals, with spectral and warping factors remaining the hardest to recover linearly. In compositional settings, however, probe performance degrades, revealing interference between concepts. This highlights that while atomic concepts are reliably localized, composition remains a challenge, underscoring a key limitation in current TSFMs' ability to represent interacting temporal phenomena.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Abdelhamid Ezzerg, Ilija Bogunovic, Jeremias Knoblauch
Machine Learning
ๆฅ็ๆ่ฆ
Bayesian Optimization is critically vulnerable to extreme outliers. Existing provably robust methods typically assume a bounded cumulative corruption budget, which makes them defenseless against even a single corruption of sufficient magnitude. To address this, we introduce a new adversary whose budget is only bounded in the frequency of corruptions, not in their magnitude. We then derive RCGP-UCB, an algorithm coupling the famous upper confidence bound (UCB) approach with a Robust Conjugate Gaussian Process (RCGP). We present stable and adaptive versions of RCGP-UCB, and prove that they achieve sublinear regret in the presence of up to $O(T^{1/2})$ and $O(T^{1/3})$ corruptions with possibly infinite magnitude. This robustness comes at near zero cost: without outliers, RCGP-UCB's regret bounds match those of the standard GP-UCB algorithm.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rayen Dhahri, Steffen Urban
Multimodal
ๆฅ็ๆ่ฆ
Specialized edge accelerators rely on low-bit quantization, but vendor compilers differ in scaling, clipping, and kernel support, often as black boxes. The same floating-point (FP) checkpoint can therefore yield inconsistent accuracy across backends, forcing practitioners to tweak flags or refactor models to vendor-friendly operator subsets. We introduce Quant-Trim, a training-phase method that produces a hardware-neutral checkpoint robust to backend and precision choices. It combines progressive fake quantization to align training with the deployed integer grid and reverse pruning to tame outlier-driven scale inflation while preserving learnability. Quant-Trim is agnostic to quantization schemes (symmetric/asymmetric,per-tensor/per-channel, INT8/INT4) and requires no vendor-specific graph changes.Across models and tasks, it narrows the FP,low-bit gap, reduces dependence on compiler heuristics/calibration, and avoids per-backend retraining. We report accuracy and edge metrics latency, throughput, energy/inference, and cost under static/dynamic activation scaling and varying operator coverage.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hyeongheon Cha, Dong Min Kim, Hye Won Chung, Taesik Gong, Sung-Ju Lee
ๆฅ็ๆ่ฆ
Test-Time Adaptation (TTA) adjusts models using unlabeled test data to handle dynamic distribution shifts. However, existing methods rely on frequent adaptation and high computational cost, making them unsuitable for resource-constrained edge environments. To address this, we propose SNAP, a sparse TTA framework that reduces adaptation frequency and data usage while preserving accuracy. SNAP maintains competitive accuracy even when adapting based on only 1% of the incoming data stream, demonstrating its robustness under infrequent updates. Our method introduces two key components: (i) Class and Domain Representative Memory (CnDRM), which identifies and stores a small set of samples that are representative of both class and domain characteristics to support efficient adaptation with limited data; and (ii) Inference-only Batch-aware Memory Normalization (IoBMN), which dynamically adjusts normalization statistics at inference time by leveraging these representative samples, enabling efficient alignment to shifting target domains. Integrated with five state-of-the-art TTA algorithms, SNAP reduces latency by up to 93.12%, while keeping the accuracy drop below 3.3%, even across adaptation rates ranging from 1% to 50%. This demonstrates its strong potential for practical use on edge devices serving latency-sensitive applications. The source code is available at https://github.com/chahh9808/SNAP.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Loveneet Saini, Hasan Tercan, Tobias Meisen
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Object detection with 3D radar is essential for 360-degree automotive perception, but radar's long wavelengths produce sparse and irregular reflections that challenge traditional grid and sequence-based convolutional and transformer detectors. This paper introduces Graph Query Networks (GQN), an attention-based framework that models objects sensed by radar as graphs, to extract individualized relational and contextual features. GQN employs a novel concept of graph queries to dynamically attend over the bird's-eye view (BEV) space, constructing object-specific graphs processed by two novel modules: EdgeFocus for relational reasoning and DeepContext Pooling for contextual aggregation. On the NuScenes dataset, GQN improves relative mAP by up to +53%, including a +8.2% gain over the strongest prior radar method, while reducing peak graph construction overhead by 80% with moderate FLOPs cost.
๐
2025-11-19
๐ ArXiv Preprint (q-fin.TR)
๐ฅ Tomas Espana, Yadh Hafsi, Fabrizio Lillo, Edoardo Vittori
Machine Learning
ๆฅ็ๆ่ฆ
We investigate the use of Reinforcement Learning for the optimal execution of meta-orders, where the objective is to execute incrementally large orders while minimizing implementation shortfall and market impact over an extended period of time. Departing from traditional parametric approaches to price dynamics and impact modeling, we adopt a model-free, data-driven framework. Since policy optimization requires counterfactual feedback that historical data cannot provide, we employ the Queue-Reactive Model to generate realistic and tractable limit order book simulations that encompass transient price impact, and nonlinear and dynamic order flow responses. Methodologically, we train a Double Deep Q-Network agent on a state space comprising time, inventory, price, and depth variables, and evaluate its performance against established benchmarks. Numerical simulation results show that the agent learns a policy that is both strategic and tactical, adapting effectively to order book conditions and outperforming standard approaches across multiple training configurations. These findings provide strong evidence that model-free Reinforcement Learning can yield adaptive and robust solutions to the optimal execution problem.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yanchen Xu, Ziheng Jiao, Hongyuan Zhang, Xuelong Li
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can be generalized to representation learning models. In this paper, we propose Group Relative Policy Optimization for Representation Model (GRPO-RM), and investigate the performance of GRPO-like policy in post-training representation models. Specifically, our method establishes a predefined output set to functionally replace token sequence sampling in LLMs, thereby generating an output group, which is essential for the probability-driven optimization of GRPO. In addition, a specialized reward function is designed to accommodate the properties of representation models. Extensive experiments are conducted on various real-world datasets to validate the effectiveness of our proposed method.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chengze Du, Heng Xu, Zhiwei Yu, Bo Liu, Jialong Li
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Network tomography aims to infer hidden network states, such as link performance, traffic load, and topology, from external observations. Most existing methods solve these problems separately and depend on limited task-specific signals, which limits generalization and interpretability. We present PLATONT, a unified framework that models different network indicators (e.g., delay, loss, bandwidth) as projections of a shared latent network state. Guided by the Platonic Representation Hypothesis, PLATONT learns this latent state through multimodal alignment and contrastive learning. By training multiple tomography tasks within a shared latent space, it builds compact and structured representations that improve cross-task generalization. Experiments on synthetic and real-world datasets show that PLATONT consistently outperforms existing methods in link estimation, topology inference, and traffic prediction, achieving higher accuracy and stronger robustness under varying network conditions.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jin Ye, Lingmei Wang, Shujian Zhang, Haihang WU
Machine Learning
ๆฅ็ๆ่ฆ
With the global energy transition and rapid development of renewable energy, the scheduling optimization challenge for combined power-heat systems under new energy integration and multiple uncertainties has become increasingly prominent. Addressing this challenge, this study proposes an intelligent scheduling method based on the improved Dual-Delay Deep Deterministic Policy Gradient (PVTD3) algorithm. System optimization is achieved by introducing a penalty term for grid power purchase variations. Simulation results demonstrate that under three typical scenarios (10%, 20%, and 30% renewable penetration), the PVTD3 algorithm reduces the system's comprehensive cost by 6.93%, 12.68%, and 13.59% respectively compared to the traditional TD3 algorithm. Concurrently, it reduces the average fluctuation amplitude of grid power purchases by 12.8%. Regarding energy storage management, the PVTD3 algorithm reduces the end-time state values of low-temperature thermal storage tanks by 7.67-17.67 units while maintaining high-temperature tanks within the 3.59-4.25 safety operating range. Multi-scenario comparative validation demonstrates that the proposed algorithm not only excels in economic efficiency and grid stability but also exhibits superior sustainable scheduling capabilities in energy storage device management.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu et al.
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps. To address this, we propose Entropy stablilization via Proportional-Integral Control (EntroPIC), a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress. We provide a comprehensive theoretical analysis for both on-policy and off-policy learning settings, demonstrating that EntroPIC is effective at controlling entropy in large-scale LLM training. Experimental results show that our method successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tung Giang Le, Xuan Tung Nguyen, Won-Joo Hwang
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Increasing wireless network complexity demands scalable resource management. Classical GNNs excel at graph learning but incur high computational costs in large-scale settings. We present a fully quantum Graph Neural Network (QGNN) that implements message passing via Parameterized Quantum Circuits (PQCs). Our Quantum Graph Convolutional Layers (QGCLs) encode features into quantum states, process graphs with NISQ-compatible unitaries, and retrieve embeddings through measurement. Applied to D2D power control for SINR maximization, our QGNN matches classical performance with fewer parameters and inherent parallelism. This end-to-end PQC-based GNN marks a step toward quantum-accelerated wireless optimization.
๐
2025-11-19
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Pol Benรญtez, Cibrรกn Lรณpez, Edgardo Saucedo, Teruyasu Mizoguchi, Claudio Cazorla
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Machine learning (ML) methods have become powerful tools for predicting material properties with near first-principles accuracy and vastly reduced computational cost. However, the performance of ML models critically depends on the quality, size, and diversity of the training dataset. In materials science, this dependence is particularly important for learning from low-symmetry atomistic configurations that capture thermal excitations, structural defects, and chemical disorder, features that are ubiquitous in real materials but underrepresented in most datasets. The absence of systematic strategies for generating representative training data may therefore limit the predictive power of ML models in technologically critical fields such as energy conversion and photonics. In this work, we assess the effectiveness of graph neural network (GNN) models trained on two fundamentally different types of datasets: one composed of randomly generated atomic configurations and another constructed using physically informed sampling based on lattice vibrations. As a case study, we address the challenging task of predicting electronic and mechanical properties of a prototypical family of optoelectronic materials under realistic finite-temperature conditions. We find that the phonons-informed model consistently outperforms the randomly trained counterpart, despite relying on fewer data points. Explainability analyses further reveal that high-performing models assign greater weight to chemically meaningful bonds that control property variations, underscoring the importance of physically guided data generation. Overall, this work demonstrates that larger datasets do not necessarily yield better GNN predictive models and introduces a simple and general strategy for efficiently constructing high-quality training data in materials informatics.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov et al.
Robotics
ๆฅ็ๆ่ฆ
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ranfei Chen, Ming Chen, Kaifei Wang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Diffusion Large Language Models (dLLMs) are rapidly emerging alongside autoregressive models as a powerful paradigm for complex reasoning, with reinforcement learning increasingly used for downstream alignment. Existing trajectory-based RL methods uniformly allocate policy gradients across denoising steps, implicitly treating all steps as equally important. We challenge this assumption by analyzing trajectories with several step-level metrics: entropy-based uncertainty, Confidence-Margin (CM) uncertainty, and Rate of Entropy Change (RoEC). These reveal structured "zones of confusion": transient spikes in uncertainty and instability that strongly predict final success or failure, while most steps remain stable. We propose Adaptive Trajectory Policy Optimization (ATPO), a lightweight step-selection strategy that dynamically reallocates gradient updates to these high-leverage steps without changing the RL objective, rewards, or compute budget. Using a hybrid RoEC+CM rule, ATPO delivers substantial gains in reasoning accuracy and training stability across benchmarks, showing that exploiting trajectory dynamics is key to advancing dLLM RL.
๐
2025-11-19
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Jiajun Zhan, Zeyuan Ma, Yue-Jiao Gong, Kay Chen Tan
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Evolutionary multitasking (EMT) algorithms typically require tailored designs for knowledge transfer, in order to assure convergence and optimality in multitask optimization. In this paper, we explore designing a systematic and generalizable knowledge transfer policy through Reinforcement Learning. We first identify three major challenges: determining the task to transfer (where), the knowledge to be transferred (what) and the mechanism for the transfer (how). To address these challenges, we formulate a multi-role RL system where three (groups of) policy networks act as specialized agents: a task routing agent incorporates an attention-based similarity recognition module to determine source-target transfer pairs via attention scores; a knowledge control agent determines the proportion of elite solutions to transfer; and a group of strategy adaptation agents control transfer strength by dynamically controlling hyper-parameters in the underlying EMT framework. Through pre-training all network modules end-to-end over an augmented multitask problem distribution, a generalizable meta-policy is obtained. Comprehensive validation experiments show state-of-the-art performance of our method against representative baselines. Further in-depth analysis not only reveals the rationale behind our proposal but also provide insightful interpretations on what the system have learned.
๐
2025-11-19
๐ NeurIPS 2025
๐ฅ David Yallup
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
High-dimensional multimodal sampling problems from lattice field theory (LFT) have become important benchmarks for machine learning assisted sampling methods. We show that GPU-accelerated particle methods, Sequential Monte Carlo (SMC) and nested sampling, provide a strong classical baseline that matches or outperforms state-of-the-art neural samplers in sample quality and wall-clock time on standard scalar field theory benchmarks, while also estimating the partition function. Using only a single data-driven covariance for tuning, these methods achieve competitive performance without problem-specific structure, raising the bar for when learned proposals justify their training cost.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuxuan Gu, Weimin Bai, Yifei Wang, Weijian Luo, He Sun
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model post-training.To address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256*256, MARVAL-Huge achieves an FID of 2.00 with more than 30 times speedup compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wasif Jalal, Md Nafiu Rahman, M. Sohel Rahman
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate brain age estimation from structural MRI is a valuable biomarker for studying aging and neurodegeneration. Traditional regression and CNN-based methods face limitations such as manual feature engineering, limited receptive fields, and overfitting on heterogeneous data. Pure transformer models, while effective, require large datasets and high computational cost. We propose Brain ResNet over trained Vision Transformer (BrainRotViT), a hybrid architecture that combines the global context modeling of vision transformers (ViT) with the local refinement of residual CNNs. A ViT encoder is first trained on an auxiliary age and sex classification task to learn slice-level features. The frozen encoder is then applied to all sagittal slices to generate a 2D matrix of embedding vectors, which is fed into a residual CNN regressor that incorporates subject sex at the final fully-connected layer to estimate continuous brain age. Our method achieves an MAE of 3.34 years (Pearson $r=0.98$, Spearman $ฯ=0.97$, $R^2=0.95$) on validation across 11 MRI datasets encompassing more than 130 acquisition sites, outperforming baseline and state-of-the-art models. It also generalizes well across 4 independent cohorts with MAEs between 3.77 and 5.04 years. Analyses on the brain age gap (the difference between the predicted age and actual age) show that aging patterns are associated with Alzheimer's disease, cognitive impairment, and autism spectrum disorder. Model attention maps highlight aging-associated regions of the brain, notably the cerebellar vermis, precentral and postcentral gyri, temporal lobes, and medial superior frontal gyrus. Our results demonstrate that this method provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging the gap between CNN- and transformer-based approaches while opening new avenues for aging and neurodegeneration research.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rishikant Chigrupaatii, Ponnada Sai Tulasi Kanishka, Lalit Chandra Routhu, Martin Patel Sama Supratheek Reddy, Divyam Gupta et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Le Tung Giang, Vu Hoang Viet, Nguyen Xuan Tung, Trinh Van Chien, Won-Joo Hwang
Machine Learning
ๆฅ็ๆ่ฆ
The vehicle routing problem (VRP) is a fundamental NP-hard task in intelligent transportation systems with broad applications in logistics and distribution. Deep reinforcement learning (DRL) with Graph Neural Networks (GNNs) has shown promise, yet classical models rely on large multi-layer perceptrons (MLPs) that are parameter-heavy and memory-bound. We propose a Quantum Graph Attention Network (Q-GAT) within a DRL framework, where parameterized quantum circuits (PQCs) replace conventional MLPs at critical readout stages. The hybrid model maintains the expressive capacity of graph attention encoders while reducing trainable parameters by more than 50%. Using proximal policy optimization (PPO) with greedy and stochastic decoding, experiments on VRP benchmarks show that Q-GAT achieves faster convergence and reduces routing cost by about 5% compared with classical GAT baselines. These results demonstrate the potential of PQC-enhanced GNNs as compact and effective solvers for large-scale routing and logistics optimization.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yi Xu, Zhigang Chen, Rui Wang, Yangfan Li, Fengxiao Tang et al.
ๆฅ็ๆ่ฆ
In industrial equipment monitoring, fault diagnosis is critical for ensuring system reliability and enabling predictive maintenance. However, the scarcity of fault data, due to the rarity of fault events and the high cost of data annotation, significantly hinders data-driven approaches. Existing time-series generation models, optimized for abundant normal data, struggle to capture fault distributions in few-shot scenarios, producing samples that lack authenticity and diversity due to the large domain gap and high intra-class variability of faults. To address this, we propose a novel few-shot fault time-series generation framework based on diffusion models. Our approach employs a positive-negative difference adapter, leveraging pre-trained normal data distributions to model the discrepancies between normal and fault domains for accurate fault synthesis. Additionally, a diversity loss is introduced to prevent mode collapse, encouraging the generation of diverse fault samples through inter-sample difference regularization. Experimental results demonstrate that our model significantly outperforms traditional methods in authenticity and diversity, achieving state-of-the-art performance on key benchmarks.
๐
2025-11-19
๐ ArXiv Preprint (q-bio.QM)
๐ฅ Jun Hyeun Kang, Jung Eek Son, Tae In Ahn
Machine Learning
ๆฅ็ๆ่ฆ
The application of spectral-shifting films in greenhouses to shift green light to red light has shown variable growth responses across crop species. However, the yield enhancement of crops under altered light quality is related to the collective effects of the specific biophysical characteristics of each species. Considering only one attribute of a crop has limitations in understanding the relationship between sunlight quality adjustments and crop growth performance. Therefore, this study aims to comprehensively link multiple plant phenotypic traits and daily light integral considering the physiological responses of crops to their growth outcomes under SF using artificial intelligence. Between 2021 and 2024, various leafy, fruiting, and root crops were grown in greenhouses covered with either PEF or SF, and leaf reflectance, leaf mass per area, chlorophyll content, daily light integral, and light saturation point were measured from the plants cultivated in each condition. 210 data points were collected, but there was insufficient data to train deep learning models, so a variational autoencoder was used for data augmentation. Most crop yields showed an average increase of 22.5% under SF. These data were used to train several models, including logistic regression, decision tree, random forest, XGBoost, and feedforward neural network (FFNN), aiming to binary classify whether there was a significant effect on yield with SF application. The FFNN achieved a high classification accuracy of 91.4% on a test dataset that was not used for training. This study provide insight into the complex interactions between leaf phenotypic and photosynthetic traits, environmental conditions, and solar spectral components by improving the ability to predict solar spectral shift effects using SF.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrew Gracyk
ๆฅ็ๆ่ฆ
It has been discovered that latent-Euclidean variational autoencoders (VAEs) admit, in various capacities, Riemannian structure. We adapt these arguments but for complex VAEs with a complex latent stage. We show that complex VAEs reveal to some level Kรคhler geometric structure. Our methods will be tailored for decoder geometry. We derive the Fisher information metric in the complex case under a latent complex Gaussian regularization with trivial relation matrix. It is well known from statistical information theory that the Fisher information coincides with the Hessian of the Kullback-Leibler (KL) divergence. Thus, the metric Kรคhler potential relation is exactly achieved under relative entropy. We propose a Kรคhler potential derivative of complex Gaussian mixtures that has rough equivalence to the Fisher information metric while still being faithful to the underlying Kรคhler geometry. Computation of the metric via this potential is efficient, and through our potential, valid as a plurisubharmonic (PSH) function, large scale computational burden of automatic differentiation is displaced to small scale. We show that we can regularize the latent space with decoder geometry, and that we can sample in accordance with a weighted complex volume element. We demonstrate these strategies, at the exchange of sample variation, yield consistently smoother representations and fewer semantic outliers.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Yang Wu, Rujing Yao, Tong Zhang, Yufei Shi, Zhuoren Jiang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students' knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student's mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students' Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student's mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.
๐
2025-11-19
๐ ArXiv Preprint (eess.SP)
๐ฅ Ahmed Aboulfotouh, Hatem Abou-Zeid
Multimodal
ๆฅ็ๆ่ฆ
Wireless foundation models (WFMs) have recently demonstrated promising capabilities, jointly performing multiple wireless functions and adapting effectively to new environments. However, while current WFMs process only one modality, depending on the task and operating conditions, the most informative modality changes and no single modality is best for all tasks. WFMs should therefore be designed to accept multiple modalities to enable a broader and more diverse range of tasks and scenarios. In this work, we propose and build the first multimodal wireless foundation model capable of processing both raw IQ streams and image-like wireless modalities (e.g., spectrograms and CSI) and performing multiple tasks across both. We introduce masked wireless modeling for the multimodal setting, a self-supervised objective and pretraining recipe that learns a joint representation from IQ streams and image-like wireless modalities. We evaluate the model on five tasks across both modality families: image-based (human activity sensing, RF signal classification, 5G NR positioning) and IQ-based (RF device fingerprinting, interference detection/classification). The multimodal WFM is competitive with single-modality WFMs, and in several cases surpasses their performance. Our results demonstrates the strong potential of developing multimodal WFMs that support diverse wireless tasks across different modalities. We believe this provides a concrete step toward both AI-native 6G and the vision of joint sensing, communication, and localization.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Firdavs Nasriddinov, Rafal Kocielnik, Anima Anandkumar, Andrew J. Hung
Natural Language Processing
ๆฅ็ๆ่ฆ
High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Meihua Zhou, Xinyu Tong, Jiarui Zhao, Min Cheng, Li Yang et al.
ๆฅ็ๆ่ฆ
High-dimensional neuroimaging analyses for clinical diagnosis are often constrained by compromises in spatiotemporal fidelity and by the limited adaptability of large-scale, general-purpose models. To address these challenges, we introduce Dynamic Curriculum Learning for Spatiotemporal Encoding (DCL-SE), an end-to-end framework centered on data-driven spatiotemporal encoding (DaSE). We leverage Approximate Rank Pooling (ARP) to efficiently encode three-dimensional volumetric brain data into information-rich, two-dimensional dynamic representations, and then employ a dynamic curriculum learning strategy, guided by a Dynamic Group Mechanism (DGM), to progressively train the decoder, refining feature extraction from global anatomical structures to fine pathological details. Evaluated across six publicly available datasets, including Alzheimer's disease and brain tumor classification, cerebral artery segmentation, and brain age prediction, DCL-SE consistently outperforms existing methods in accuracy, robustness, and interpretability. These findings underscore the critical importance of compact, task-specific architectures in the era of large-scale pretrained networks.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Eugene Ndiaye
ๆฅ็ๆ่ฆ
Conformal prediction (CP) constructs uncertainty sets for model outputs with finite-sample coverage guarantees. A candidate output is included in the prediction set if its non-conformity score is not considered extreme relative to the scores observed on a set of calibration examples. However, this procedure is only straightforward when scores are scalar-valued, which has limited CP to real-valued scores or ad-hoc reductions to one dimension. The problem of ordering vectors has been studied via optimal transport (OT), which provides a principled method for defining vector-ranks and multivariate quantile regions, though typically with only asymptotic coverage guarantees. We restore finite-sample, distribution-free coverage by conformalizing the vector-valued OT quantile region. Here, a candidate's rank is defined via a transport map computed for the calibration scores augmented with that candidate's score. This defines a continuum of OT problems for which we prove that the resulting optimal assignment is piecewise-constant across a fixed polyhedral partition of the score space. This allows us to characterize the entire prediction set tractably, and provides the machinery to address a deeper limitation of prediction sets: that they only indicate which outcomes are plausible, but not their relative likelihood. In one dimension, conformal predictive distributions (CPDs) fill this gap by producing a predictive distribution with finite-sample calibration. Extending CPDs beyond one dimension remained an open problem. We construct, to our knowledge, the first multivariate CPDs with finite-sample calibration, i.e., they define a valid multivariate distribution where any derived uncertainty region automatically has guaranteed coverage. We present both conservative and exact randomized versions, the latter resulting in a multivariate generalization of the classical Dempster-Hill procedure.
๐
2025-11-19
๐ ArXiv Preprint (q-bio.GN)
๐ฅ Amit Kumar, Maninder Kaur, Raghvendra Mall, Sukrit Gupta
ๆฅ็ๆ่ฆ
Spatial Transcriptomics enables mapping of gene expression within its native tissue context, but current platforms measure only a limited set of genes due to experimental constraints and excessive costs. To overcome this, computational models integrate Single-Cell RNA Sequencing data with Spatial Transcriptomics to predict unmeasured genes. We propose CASPER, a cross-attention based framework that predicts unmeasured gene expression in Spatial Transcriptomics by leveraging centroid-level representations from Single-Cell RNA Sequencing. We performed rigorous testing over four state-of-the-art Spatial Transcriptomics/Single-Cell RNA Sequencing dataset pairs across four existing baseline models. CASPER shows significant improvement in nine out of the twelve metrics for our experiments. This work paves the way for further work in Spatial Transcriptomics to Single-Cell RNA Sequencing modality translation. The code for CASPER is available at https://github.com/AI4Med-Lab/CASPER.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hyo-Jeong Jang, Hye-Bin Shin, Kang Yin
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning models perform best with abundant, high-quality labels, yet such conditions are rarely achievable in EEG-based emotion recognition. Electroencephalogram (EEG) signals are easily corrupted by artifacts and individual variability, while emotional labels often stem from subjective and inconsistent reports-making robust affective decoding particularly difficult. We propose an uncertainty-aware active learning framework that enhances robustness to label noise by jointly leveraging model uncertainty and cross-modal consistency. Instead of relying solely on EEG-based uncertainty estimates, the method evaluates cross-modal alignment to determine whether uncertainty originates from cognitive ambiguity or sensor noise. A representation alignment module embeds EEG and face features into a shared latent space, enforcing semantic coherence between modalities. Residual discrepancies are treated as noise-induced inconsistencies, and these samples are selectively queried for oracle feedback during active learning. This feedback-driven process guides the network toward reliable, informative samples and reduces the impact of noisy labels. Experiments on the ASCERTAIN dataset examine the efficiency and robustness of ours, highlighting its potential as a data-efficient and noise-tolerant approach for EEG-based affective decoding in brain-computer interface systems.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaoxuan Wang, Bo Liu, Song Jiang, Jingzhou Liu, Jingyuan Qi et al.
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
The reasoning capabilities of large language models (LLMs) have been significantly improved through reinforcement learning (RL). Nevertheless, LLMs still struggle to consistently verify their own reasoning traces. This raises the research question of how to enhance the self-verification ability of LLMs and whether such an ability can further improve reasoning performance. In this work, we propose GRPO-Verif, an algorithm that jointly optimizes solution generation and self-verification within a unified loss function, with an adjustable hyperparameter controlling the weight of the verification signal. Experimental results demonstrate that our method enhances self-verification capability while maintaining comparable performance in reasoning.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrew Amos, Joanne Lee, Tarun Sen Gupta, Bunmi S. Malau-Aduli
ๆฅ็ๆ่ฆ
Past efforts to map the Medline database have been limited to small subsets of the available data because of the exponentially increasing memory and processing demands of existing algorithms. We designed a novel algorithm for sparse matrix multiplication that allowed us to apply a self-organizing map to the entire Medline dataset, allowing for a more complete map of existing medical knowledge. The algorithm also increases the feasibility of refining the self-organizing map to account for changes in the dataset over time.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nishchala Thakur, Swati Kochhar, Deepti R. Bathula, Sukrit Gupta
ๆฅ็ๆ่ฆ
Active learning reduces annotation costs in medical imaging by strategically selecting the most informative samples for labeling. However, individual acquisition strategies often exhibit inconsistent behavior across different stages of the active learning cycle. We propose Cyclical and Performance-Adaptive Multi-Strategy Active Learning (WaveFuse-AL), a novel framework that adaptively fuses multiple established acquisition strategies-BALD, BADGE, Entropy, and CoreSet throughout the learning process. WaveFuse-AL integrates cyclical (sinusoidal) temporal priors with performance-driven adaptation to dynamically adjust strategy importance over time. We evaluate WaveFuse-AL on three medical imaging benchmarks: APTOS-2019 (multi-class classification), RSNA Pneumonia Detection (binary classification), and ISIC-2018 (skin lesion segmentation). Experimental results demonstrate that WaveFuse-AL consistently outperforms both single-strategy and alternating-strategy baselines, achieving statistically significant performance improvements (on ten out of twelve metric measurements) while maximizing the utility of limited annotation budgets.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huifan Zhang, Pingqiang Zhou
Machine Learning
ๆฅ็ๆ่ฆ
Conventional radio frequency (RF) passive components modeling based on machine learning requires extensive electromagnetic (EM) simulations to cover geometric and frequency design spaces, creating computational bottlenecks. In this paper, we introduce an uncertainty-aware Bayesian online learning framework for efficient parametric modeling of RF passive components, which includes: 1) a Bayesian neural network with reconfigurable heads for joint geometric-frequency domain modeling while quantifying uncertainty; 2) an adaptive sampling strategy that simultaneously optimizes training data sampling across geometric parameters and frequency domain using uncertainty guidance. Validated on three RF passive components, the framework achieves accurate modeling while using only 2.86% EM simulation time compared to traditional ML-based flow, achieving a 35 times speedup.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Bohan Zhang, Zihao Wang, Hengyu Fu, Jason D. Lee
Machine Learning
ๆฅ็ๆ่ฆ
In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f(\boldsymbol{x})=g(\boldsymbol{U}\boldsymbol{x})$ with hidden subspace $\boldsymbol{U}\in \mathbb{R}^{r\times d}$, which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error using $\widetilde{\mathcal{O}}(d)$ samples and $\widetilde{\mathcal{O}}(d^2)$ time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than $\mathcal{O}(1)$ steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wei-hsiang Yen, Lyn Chao-ling Chen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The innovation of the study is that the deep learning method and sentiment analysis are integrated in traditional business model analysis and forecasting, and the research subject is TSMC for industry trend prediction of semiconductor industry in Taiwan. For the rapid market changes and development of wafer technologies of semiconductor industry, traditional data analysis methods not perform well in the high variety and time series data. Textual data and time series data were collected from seasonal reports of TSMC including financial information. Textual data through sentiment analysis by considering the event intervention both from internal events of the company and the external global events. Using the sentiment-enhanced time series data, the LSTM model was adopted for predicting industry trend of TSMC. The prediction results reveal significant development of wafer technology of TSMC and the potential threatens in the global market, and matches the product released news of TSMC and the international news. The contribution of the work performed accurately in industry trend prediction of the semiconductor industry by considering both the internal and external event intervention, and the prediction results provide valuable information of semiconductor industry both in research and business aspects.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiancheng Wang, Lin Wang, Rui Wang, Zhibo Zhang, Minghang Zhao
Robotics
ๆฅ็ๆ่ฆ
Time-series anomaly detection plays a critical role in numerous real-world applications, including industrial monitoring and fault diagnosis. Recently, Mamba-based state-space models have shown remarkable efficiency in long-sequence modeling. However, directly applying Mamba to anomaly detection tasks still faces challenges in capturing complex temporal patterns and nonlinear dynamics. In this paper, we propose Fourier-KAN-Mamba, a novel hybrid architecture that integrates Fourier layer, Kolmogorov-Arnold Networks (KAN), and Mamba selective state-space model. The Fourier layer extracts multi-scale frequency features, KAN enhances nonlinear representation capability, and a temporal gating control mechanism further improves the model's ability to distinguish normal and anomalous patterns. Extensive experiments on MSL, SMAP, and SWaT datasets demonstrate that our method significantly outperforms existing state-of-the-art approaches.
Keywords: time-series anomaly detection, state-space model, Mamba, Fourier transform, Kolmogorov-Arnold Network
๐
2025-11-19
๐ ArXiv Preprint (cs.DC)
๐ฅ Khaled Hamidouche, John Bachan, Pak Markthub, Peter-Jan Gootzen, Elena Agostini et al.
Robotics
ๆฅ็ๆ่ฆ
Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations - a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead.
NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side APIs for device communicator setup and collective memory window registration; ii) Device-side APIs for remote memory operations callable from CUDA kernels; and iii) A network plugin architecture with dual semantics (GPUDirect Async Kernel-Initiated and Proxy) for broad hardware support. The GPUDirect Async Kernel-Initiated backend leverages DOCA GPUNetIO for direct GPU-to-NIC communication, while the Proxy backend provides equivalent functionality via lock-free GPU-to-CPU queues over standard RDMA networks. We demonstrate GIN's practicality through integration with DeepEP, an MoE communication library. Comprehensive benchmarking shows that GIN provides device-initiated communication within NCCL's unified runtime, combining low-latency operations with NCCL's collective algorithms and production infrastructure.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zisong Wang, Xuanyu Wang, Hang Chen, Haizhou Wang, Yuxin Chen et al.
ๆฅ็ๆ่ฆ
Precise prognostic stratification of colorectal cancer (CRC) remains a major clinical challenge due to its high heterogeneity. The conventional TNM staging system is inadequate for personalized medicine. We aimed to develop and validate a novel multiple instance learning model TDAM-CRC using histopathological whole-slide images for accurate prognostic prediction and to uncover its underlying molecular mechanisms. We trained the model on the TCGA discovery cohort (n=581), validated it in an independent external cohort (n=1031), and further we integrated multi-omics data to improve model interpretability and identify novel prognostic biomarkers. The results demonstrated that the TDAM-CRC achieved robust risk stratification in both cohorts. Its predictive performance significantly outperformed the conventional clinical staging system and multiple state-of-the-art models. The TDAM-CRC risk score was confirmed as an independent prognostic factor in multivariable analysis. Multi-omics analysis revealed that the high-risk subtype is closely associated with metabolic reprogramming and an immunosuppressive tumor microenvironment. Through interaction network analysis, we identified and validated Mitochondrial Ribosomal Protein L37 (MRPL37) as a key hub gene linking deep pathomic features to clinical prognosis. We found that high expression of MRPL37, driven by promoter hypomethylation, serves as an independent biomarker of favorable prognosis. Finally, we constructed a nomogram incorporating the TDAM-CRC risk score and clinical factors to provide a precise and interpretable clinical decision-making tool for CRC patients. Our AI-driven pathological model TDAM-CRC provides a robust tool for improved CRC risk stratification, reveals new molecular targets, and facilitates personalized clinical decision-making.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yun Kwan Kim
ๆฅ็ๆ่ฆ
Clinical decision support systems (CDSSs) have been widely utilized to support the decisions made by cardiologists when detecting and classifying arrhythmia from electrocardiograms. However, forming a CDSS for the arrhythmia classification task is challenging due to the varying lengths of arrhythmias. Although the onset time of arrhythmia varies, previously developed methods have not considered such conditions. Thus, we propose a framework that consists of (i) local and global extraction and (ii) local-global information fusion with attention to enable arrhythmia detection and classification within a constrained input length. The framework's performance was evaluated in terms of 10-class and 4-class arrhythmia detection, focusing on identifying the onset and ending point of arrhythmia episodes and their duration using the MIT-BIH arrhythmia database (MITDB) and the MIT-BIH atrial fibrillation database (AFDB). Duration, episode, and Dice score performances resulted in overall F1-scores of 96.45%, 82.05%, and 96.31% on the MITDB and 97.57%, 98.31%, and 97.45% on the AFDB, respectively. The results demonstrated statistically superior performance compared to those of the benchmark models. To assess the generalization capability of the proposed method, an MITDB-trained model and MIT-BIH malignant ventricular arrhythmia database-trained model were tested AFDB and MITDB, respectively. Superior performance was attained compared with that of a state-of-the-art model. The proposed method effectively captures both local and global information and dynamics without significant information loss. Consequently, arrhythmias can be detected with greater accuracy, and their occurrence times can be precisely determined, enabling the clinical field to develop more accurate treatment plans based on the proposed method.
๐
2025-11-19
๐ SIGIR
๐ฅ Haodong Chen, Guido Zuccon, Teerapong Leelanupab
Natural Language Processing
ๆฅ็ๆ่ฆ
Genomic question answering often requires complex reasoning and integration across diverse biomedical sources. GeneGPT addressed this challenge by combining domain-specific APIs with OpenAI's code-davinci-002 large language model to enable natural language interaction with genomic databases. However, its reliance on a proprietary model limits scalability, increases operational costs, and raises concerns about data privacy and generalization.
In this work, we revisit and reproduce GeneGPT in a pilot study using open source models, including Llama 3.1, Qwen2.5, and Qwen2.5 Coder, within a monolithic architecture; this allows us to identify the limitations of this approach. Building on this foundation, we then develop OpenBioLLM, a modular multi-agent framework that extends GeneGPT by introducing agent specialization for tool routing, query generation, and response validation. This enables coordinated reasoning and role-based task execution.
OpenBioLLM matches or outperforms GeneGPT on over 90% of the benchmark tasks, achieving average scores of 0.849 on Gene-Turing and 0.830 on GeneHop, while using smaller open-source models without additional fine-tuning or tool-specific pretraining. OpenBioLLM's modular multi-agent design reduces latency by 40-50% across benchmark tasks, significantly improving efficiency without compromising model capability. The results of our comprehensive evaluation highlight the potential of open-source multi-agent systems for genomic question answering. Code and resources are available at https://github.com/ielab/OpenBioLLM.
๐
2025-11-19
๐ NeurIPS 2025
๐ฅ Jian-Ting Guo, Yu-Cheng Chen, Ping-Chun Hsieh, Kuo-Hao Ho, Po-Wei Huang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at https://rlg.iis.sinica.edu.tw/papers/MAQ.
๐
2025-11-19
๐ 10 pages, 2022 IEEE International Conference on Bi
๐ฅ Zachariah Farahany, Jiawei Wu, K M Sajjadul Islam, Praveen Madiraju
Machine Learning
ๆฅ็ๆ่ฆ
COVID-19 is a respiratory disease that caused a global pandemic in 2019. It is highly infectious and has the following symptoms: fever or chills, cough, shortness of breath, fatigue, muscle or body aches, headache, the new loss of taste or smell, sore throat, congestion or runny nose, nausea or vomiting, and diarrhea. These symptoms vary in severity; some people with many risk factors have been known to have lengthy hospital stays or die from the disease. In this paper, we analyze patients' electronic health records (EHR) to predict the severity of their COVID-19 infection using the length of stay (LOS) as our measurement of severity. This is an imbalanced classification problem, as many people have a shorter LOS rather than a longer one. To combat this problem, we synthetically create alternate oversampled training data sets. Once we have this oversampled data, we run it through an Artificial Neural Network (ANN), which during training has its hyperparameters tuned using Bayesian optimization. We select the model with the best F1 score and then evaluate it and discuss it.
๐
2025-11-19
๐ Manuscript in preparation for IEEE Transactions on
๐ฅ Jeffrey Jiang, Kevin Hong, Emily Kuczynski, Gregory Pottie
Machine Learning
ๆฅ็ๆ่ฆ
While intelligent tutoring systems (ITSs) can use information from past students to personalize instruction, each new student is unique. Moreover, the education problem is inherently difficult because the learning process is only partially observable. We therefore develop a dynamic, time-series environment to simulate a classroom setting, with student-teacher interventions - including tutoring sessions, lectures, and exams. In particular, we design the simulated environment to allow for varying levels of probing interventions that can gather more information. Then, we develop reinforcement learning ITSs that combine learning the individual state of students while pulling from population information through the use of probing interventions. These interventions can reduce the difficulty of student estimation, but also introduce a cost-benefit decision to find a balance between probing enough to get accurate estimates and probing so often that it becomes disruptive to the student. We compare the efficacy of standard RL algorithms with several greedy rules-based heuristic approaches to find that they provide different solutions, but with similar results. We also highlight the difficulty of the problem with increasing levels of hidden information, and the boost that we get if we allow for probing interventions. We show the flexibility of both heuristic and RL policies with regards to changing student population distributions, finding that both are flexible, but RL policies struggle to help harder classes. Finally, we test different course structures with non-probing policies and we find that our policies are able to boost the performance of quiz and midterm structures more than we can in a finals-only structure, highlighting the benefit of having additional information.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicheng Zhan, Xiangjun Gao, Long Quan, Kaan Akลit
Machine Learning
ๆฅ็ๆ่ฆ
We propose a new hologram representation based on structured complex-valued 2D Gaussian primitives, which replaces per-pixel information storage and reduces the parameter search space by up to 10:1. To enable end-to-end training, we develop a differentiable rasterizer for our representation, integrated with a GPU-optimized light propagation kernel in free space. Our extensive experiments show that our method achieves up to 2.5x lower VRAM usage and 50% faster optimization while producing higher-fidelity reconstructions than existing methods. We further introduce a conversion procedure that adapts our representation to practical hologram formats, including smooth and random phase-only holograms. Our experiments show that this procedure can effectively suppress noise artifacts observed in previous methods. By reducing the hologram parameter search space, our representation enables a more scalable hologram estimation in the next-generation computer-generated holography systems.
๐
2025-11-19
๐ ArXiv Preprint (cs.PF)
๐ฅ Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zecheng Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets.
Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Katie Rainey, Erin Hausmann, Donald Waagen, David Gray, Donald Hulsey
Machine Learning
ๆฅ็ๆ่ฆ
Understanding the relationships between data points in the latent decision space derived by the deep learning system is critical to evaluating and interpreting the performance of the system on real world data. Detecting \textit{out-of-distribution} (OOD) data for deep learning systems continues to be an active research topic. We investigate the connection between latent space OOD detection and classification accuracy of the model. Using open source simulated and measured Synthetic Aperture RADAR (SAR) datasets, we empirically demonstrate that the OOD detection cannot be used as a proxy measure for model performance. We hope to inspire additional research into the geometric properties of the latent space that may yield future insights into deep learning robustness and generalizability.
๐
2025-11-19
๐ NeurIPS 2025
๐ฅ Halil S. Kelebek, Linnea M. Wolniewicz, Michael D. Vergalla, Simone Mestici, Giacomo Acciarini et al.
Machine Learning
ๆฅ็ๆ่ฆ
The ionosphere is a critical component of near-Earth space, shaping GNSS accuracy, high-frequency communications, and aviation operations. For these reasons, accurate forecasting and modeling of ionospheric variability has become increasingly relevant. To address this gap, we present IonCast, a suite of deep learning models that include a GraphCast-inspired model tailored for ionospheric dynamics. IonCast leverages spatiotemporal learning to forecast global Total Electron Content (TEC), integrating diverse physical drivers and observational datasets. Validating on held-out storm-time and quiet conditions highlights improved skill compared to persistence. By unifying heterogeneous data with scalable graph-based spatiotemporal learning, IonCast demonstrates how machine learning can augment physical understanding of ionospheric variability and advance operational space weather resilience.
๐
2025-11-19
๐ ArXiv Preprint (stat.AP)
๐ฅ Reza Mirjalili, Behrad Braghi, Shahram Shadrokh Sikari
Machine Learning
ๆฅ็ๆ่ฆ
Accurate prediction of project duration and cost remains one of the most challenging aspects of project management, particularly in resource-constrained and interdependent task networks. Traditional analytical techniques such as the Critical Path Method (CPM) and Program Evaluation and Review Technique (PERT) rely on simplified and often static assumptions regarding task interdependencies and resource performance. This study proposes a novel resource-based predictive framework that integrates network representations of project activities with graph neural networks (GNNs) to capture structural and contextual relationships among tasks, resources, and time-cost dynamics. The model represents the project as a heterogeneous activity-resource graph in which nodes denote activities and resources, and edges encode temporal and resource dependencies.
We evaluate multiple learning paradigms, including GraphSAGE and Temporal Graph Networks, on both synthetic and benchmark project datasets. Experimental results show that the proposed GNN framework achieves an average 23 to 31 percent reduction in mean absolute error compared to traditional regression and tree-based methods, while improving the coefficient of determination R2 from approximately 0.78 to 0.91 for large and complex project networks. Furthermore, the learned embeddings provide interpretable insights into resource bottlenecks and critical dependencies, enabling more explainable and adaptive scheduling decisions.
๐
2025-11-19
๐ Accepted to be published in IEEE Transaction on Ma
๐ฅ Fatemeh Lotfi, Hossein Rajoli, Fatemeh Afghah
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Next-generation networks utilize the Open Radio Access Network (O-RAN) architecture to enable dynamic resource management, facilitated by the RAN Intelligent Controller (RIC). While deep reinforcement learning (DRL) models show promise in optimizing network resources, they often struggle with robustness and generalizability in dynamic environments. This paper introduces a novel resource management approach that enhances the Soft Actor Critic (SAC) algorithm with Sharpness-Aware Minimization (SAM) in a distributed Multi-Agent RL (MARL) framework. Our method introduces an adaptive and selective SAM mechanism, where regularization is explicitly driven by temporal-difference (TD)-error variance, ensuring that only agents facing high environmental complexity are regularized. This targeted strategy reduces unnecessary overhead, improves training stability, and enhances generalization without sacrificing learning efficiency. We further incorporate a dynamic $ฯ$ scheduling scheme to refine the exploration-exploitation trade-off across agents. Experimental results show our method significantly outperforms conventional DRL approaches, yielding up to a $22\%$ improvement in resource allocation efficiency and ensuring superior QoS satisfaction across diverse O-RAN slices.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.
๐
2025-11-19
๐ AAAI
๐ฅ Ao Xie, Jiahui Chen, Quanzhi Zhu, Xiaoze Jiang, Zhiheng Qin et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Dense retrieval has become a foundational paradigm in modern search systems, especially on short-video platforms. However, most industrial systems adopt a self-reinforcing training pipeline that relies on historically exposed user interactions for supervision. This paradigm inevitably leads to a filter bubble effect, where potentially relevant but previously unseen content is excluded from the training signal, biasing the model toward narrow and conservative retrieval. In this paper, we present CroPS (Cross-Perspective Positive Samples), a novel retrieval data engine designed to alleviate this problem by introducing diverse and semantically meaningful positive examples from multiple perspectives. CroPS enhances training with positive signals derived from user query reformulation behavior (query-level), engagement data in recommendation streams (system-level), and world knowledge synthesized by large language models (knowledge-level). To effectively utilize these heterogeneous signals, we introduce a Hierarchical Label Assignment (HLA) strategy and a corresponding H-InfoNCE loss that together enable fine-grained, relevance-aware optimization. Extensive experiments conducted on Kuaishou Search, a large-scale commercial short-video search platform, demonstrate that CroPS significantly outperforms strong baselines both offline and in live A/B tests, achieving superior retrieval performance and reducing query reformulation rates. CroPS is now fully deployed in Kuaishou Search, serving hundreds of millions of users daily.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Linyin Luo, Yujuan Ding, Yunshan Ma, Wenqi Fan, Hanjiang Lai
Multimodal
ๆฅ็ๆ่ฆ
Advanced multimodal Retrieval-Augmented Generation (MRAG) techniques have been widely applied to enhance the capabilities of Large Multimodal Models (LMMs), but they also bring along novel safety issues. Existing adversarial research has revealed the vulnerability of MRAG systems to knowledge poisoning attacks, which fool the retriever into recalling injected poisoned contents. However, our work considers a different setting: visual attack of MRAG by solely adding imperceptible perturbations at the image inputs of users, without manipulating any other components. This is challenging due to the robustness of fine-tuned retrievers and large-scale generators, and the effect of visual perturbation may be further weakened by propagation through the RAG chain. We propose a novel Hierarchical Visual Attack that misaligns and disrupts the two inputs (the multimodal query and the augmented knowledge) of MRAG's generator to confuse its generation. We further design a hierarchical two-stage strategy to obtain misaligned augmented knowledge. We disrupt the image input of the retriever to make it recall irrelevant knowledge from the original database, by optimizing the perturbation which first breaks the cross-modal alignment and then disrupts the multimodal semantic alignment. We conduct extensive experiments on two widely-used MRAG datasets: OK-VQA and InfoSeek. We use CLIP-based retrievers and two LMMs BLIP-2 and LLaVA as generators. Results demonstrate the effectiveness of our visual attack on MRAG through the significant decrease in both retrieval and generation performance.
๐
2025-11-19
๐ 13 pages,9 figures. This work has been submitted t
๐ฅ Shanlin Zhou, Xinpeng Wang, Jianxun Lian, Zhenghao Liu, Laks V. S. Lakshmanan et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising and storytelling. Nevertheless, CNLG still remains difficult due to two main challenges. (1) Multi-objective flexibility: user requirements are often personalized, fine-grained, and pluralistic, which LLMs struggle to satisfy simultaneously; (2) Interpretive complexity: beyond generation, creativity also involves understanding and interpreting implicit meaning to enhance users' perception. These challenges significantly limit current methods, especially in short-form text generation, in generating creative and insightful content. To address this, we focus on Chinese baby naming, a representative short-form CNLG task requiring adherence to explicit user constraints (e.g., length, semantics, anthroponymy) while offering meaningful aesthetic explanations. We propose NAMeGEn, a novel multi-agent optimization framework that iteratively alternates between objective extraction, name generation, and evaluation to meet diverse requirements and generate accurate explanations. To support this task, we further construct a classical Chinese poetry corpus with 17k+ poems to enhance aesthetics, and introduce CBNames, a new benchmark with tailored metrics. Extensive experiments demonstrate that NAMeGEn effectively generates creative names that meet diverse, personalized requirements while providing meaningful explanations, outperforming six baseline methods spanning various LLM backbones without any training.
๐
2025-11-19
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Suyu Chen, Yimeng Bai, Yulong Huang, Xiaoyan Zhao, Yang Zhang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly integrated into users' daily lives, driving a growing demand for personalized outputs. Prior work has primarily leveraged a user's own history, often overlooking inter-user differences that are critical for effective personalization. While recent methods have attempted to model such differences, their feature extraction processes typically rely on fixed dimensions and quick, intuitive inference (System-1 thinking), limiting both the coverage and granularity of captured user differences. To address these limitations, we propose Difference-aware Reasoning Personalization (DRP), a framework that reconstructs the difference extraction mechanism by leveraging inference scaling to enhance LLM personalization. DRP autonomously identifies relevant difference feature dimensions and generates structured definitions and descriptions, enabling slow, deliberate reasoning (System-2 thinking) over user differences. Experiments on personalized review generation demonstrate that DRP consistently outperforms baseline methods across multiple metrics.
๐
2025-11-19
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Byungho Jo
MultimodalRobotics
ๆฅ็ๆ่ฆ
Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.
๐
2025-11-19
๐ ArXiv Preprint (cs.SI)
๐ฅ Yulong He, Anton V. Proskurnikov, Artem Sedakov
Natural Language Processing
ๆฅ็ๆ่ฆ
Online social media platforms enable influencers to distribute content and quickly capture audience reactions, significantly shaping their promotional strategies and advertising agreements. Understanding how sentiment dynamics and emotional contagion unfold among followers is vital for influencers and marketers, as these processes shape engagement, brand perception, and purchasing behavior. While sentiment analysis tools effectively track sentiment fluctuations, dynamical models explaining their evolution remain limited, often neglecting network structures and interactions both among blogs and between their topic-focused follower groups. In this study, we tracked influential tech-focused Weibo bloggers over six months, quantifying follower sentiment from text-mined feedback. By treating each blogger's audience as a single "macro-agent", we find that sentiment trajectories follow the principle of iterative averaging -- a foundational mechanism in many dynamical models of opinion formation, a theoretical framework at the intersection of social network analysis and dynamical systems theory. The sentiment evolution aligns closely with opinion-dynamics models, particularly modified versions of the classical French-DeGroot model that incorporate delayed perception and distinguish between expressed and private opinions. The inferred influence structures reveal interdependencies among blogs that may arise from homophily, whereby emotionally similar users subscribe to the same blogs and collectively shape the shared sentiment expressed within these communities.
๐
2025-11-19
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mi Tian, Kun Zhang, Fei Liu, Jinglong Li, Yuxin Liao et al.
ๆฅ็ๆ่ฆ
Computerized Adaptive Testing (CAT) is a widely used technology for evaluating learners' proficiency in online education platforms. By leveraging prior estimates of proficiency to select questions and updating the estimates iteratively based on responses, CAT enables personalized learner modeling and has attracted substantial attention. Despite this progress, most existing works focus primarily on improving diagnostic accuracy, while overlooking the selection bias inherent in the adaptive process. Selection Bias arises because the question selection is strongly influenced by the estimated proficiency, such as assigning easier questions to learners with lower proficiency and harder ones to learners with higher proficiency. Since the selection depends on prior estimation, this bias propagates into the diagnosis model, which is further amplified during iterative updates, leading to misalignment and biased predictions. Moreover, the imbalanced nature of learners' historical interactions often exacerbates the bias in diagnosis models. To address this issue, we propose a debiasing framework consisting of two key modules: Cross-Attribute Examinee Retrieval and Selective Mixup-based Regularization. First, we retrieve balanced examinees with relatively even distributions of correct and incorrect responses and use them as neutral references for biased examinees. Then, mixup is applied between each biased examinee and its matched balanced counterpart under label consistency. This augmentation enriches the diversity of bias-conflicting samples and smooths selection boundaries. Finally, extensive experiments on two benchmark datasets with multiple advanced diagnosis models demonstrate that our method substantially improves both the generalization ability and fairness of question selection in CAT.
๐
2025-11-19
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Sunwoo Kim, Geon Lee, Kyungho Kim, Jaemin Yoo, Kijung Shin
Natural Language Processing
ๆฅ็ๆ่ฆ
Recently, large language models (LLMs) have been widely used as recommender systems, owing to their strong reasoning capability and their effectiveness in handling cold-start items. To better adapt LLMs for recommendation, retrieval-augmented generation (RAG) has been incorporated. Most existing RAG methods are user-based, retrieving purchase patterns of users similar to the target user and providing them to the LLM. In this work, we propose ItemRAG, an item-based RAG method for LLM-based recommendation that retrieves relevant items (rather than users) from item-item co-purchase histories. ItemRAG helps LLMs capture co-purchase patterns among items, which are beneficial for recommendations. Especially, our retrieval strategy incorporates semantically similar items to better handle cold-start items and uses co-purchase frequencies to improve the relevance of the retrieved items. Through extensive experiments, we demonstrate that ItemRAG consistently (1) improves the zero-shot LLM-based recommender by up to 43% in Hit-Ratio-1 and (2) outperforms user-based RAG baselines under both standard and cold-start item recommendation settings.
๐
2025-11-19
๐ AAAI 2026
๐ฅ Fuwei Zhang, Xiaoyu Liu, Dongbo Xi, Jishen Yin, Huan Chen et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users' historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, we propose Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multimodal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, we first introduce cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of our GR model, we incorporate multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, we conduct extensive experiments on three well-known recommendation datasets to demonstrate the effectiveness of our proposed method.
๐
2025-11-19
๐ SIGIR
๐ฅ Haodong Chen, Guido Zuccon, Teerapong Leelanupab
Natural Language Processing
ๆฅ็ๆ่ฆ
Genomic question answering often requires complex reasoning and integration across diverse biomedical sources. GeneGPT addressed this challenge by combining domain-specific APIs with OpenAI's code-davinci-002 large language model to enable natural language interaction with genomic databases. However, its reliance on a proprietary model limits scalability, increases operational costs, and raises concerns about data privacy and generalization.
In this work, we revisit and reproduce GeneGPT in a pilot study using open source models, including Llama 3.1, Qwen2.5, and Qwen2.5 Coder, within a monolithic architecture; this allows us to identify the limitations of this approach. Building on this foundation, we then develop OpenBioLLM, a modular multi-agent framework that extends GeneGPT by introducing agent specialization for tool routing, query generation, and response validation. This enables coordinated reasoning and role-based task execution.
OpenBioLLM matches or outperforms GeneGPT on over 90% of the benchmark tasks, achieving average scores of 0.849 on Gene-Turing and 0.830 on GeneHop, while using smaller open-source models without additional fine-tuning or tool-specific pretraining. OpenBioLLM's modular multi-agent design reduces latency by 40-50% across benchmark tasks, significantly improving efficiency without compromising model capability. The results of our comprehensive evaluation highlight the potential of open-source multi-agent systems for genomic question answering. Code and resources are available at https://github.com/ielab/OpenBioLLM.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: https://xiongyicai.github.io/In-N-On/
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shanshan Zhang
ๆฅ็ๆ่ฆ
Inertial Odometry (IO) enables real-time localization using only acceleration and angular velocity measurements from an Inertial Measurement Unit (IMU), making it a promising solution for localization in consumer-grade applications. Traditionally, IMU measurements in IO have been processed under two coordinate system paradigms: the body coordinate frame and the global coordinate frame, with the latter being widely adopted. However, recent studies in drone scenarios have demonstrated that the body frame can significantly improve localization accuracy, prompting a re-evaluation of the suitability of the global frame for pedestrian IO. To address this issue, this paper systematically evaluates the effectiveness of the global coordinate frame in pedestrian IO through theoretical analysis, qualitative inspection, and quantitative experiments. Building upon these findings, we further propose MambaIO, which decomposes IMU measurements into high-frequency and low-frequency components using a Laplacian pyramid. The low-frequency component is processed by a Mamba architecture to extract implicit contextual motion cues, while the high-frequency component is handled by a convolutional structure to capture fine-grained local motion details. Experiments on multiple public datasets show that MambaIO substantially reduces localization error and achieves state-of-the-art (SOTA) performance. To the best of our knowledge, this is the first application of the Mamba architecture to the inertial odometry task.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Sai Puppala, Ismail Hossain, Jahangir Alam, Sajedul Talukder
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The integration of advanced robotics in nuclear power plants (NPPs) presents a transformative opportunity to enhance safety, efficiency, and environmental monitoring in high-stakes environments. Our paper introduces the Optimus-Q robot, a sophisticated system designed to autonomously monitor air quality and detect contamination while leveraging adaptive learning techniques and secure quantum communication. Equipped with advanced infrared sensors, the Optimus-Q robot continuously streams real-time environmental data to predict hazardous gas emissions, including carbon dioxide (CO$_2$), carbon monoxide (CO), and methane (CH$_4$). Utilizing a federated learning approach, the robot collaborates with other systems across various NPPs to improve its predictive capabilities without compromising data privacy. Additionally, the implementation of Quantum Key Distribution (QKD) ensures secure data transmission, safeguarding sensitive operational information. Our methodology combines systematic navigation patterns with machine learning algorithms to facilitate efficient coverage of designated areas, thereby optimizing contamination monitoring processes. Through simulations and real-world experiments, we demonstrate the effectiveness of the Optimus-Q robot in enhancing operational safety and responsiveness in nuclear facilities. This research underscores the potential of integrating robotics, machine learning, and quantum technologies to revolutionize monitoring systems in hazardous environments.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang et al.
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Ruoqu Chen, Xiangjie Yan, Kangchen Lv, Gao Huang, Zheng Li et al.
Robotics
ๆฅ็ๆ่ฆ
Ultrasound scanning is a critical imaging technique for real-time, non-invasive diagnostics. However, variations in patient anatomy and complex human-in-the-loop interactions pose significant challenges for autonomous robotic scanning. Existing ultrasound scanning robots are commonly limited to relatively low generalization and inefficient data utilization. To overcome these limitations, we present UltraDP, a Diffusion-Policy-based method that receives multi-sensory inputs (ultrasound images, wrist camera images, contact wrench, and probe pose) and generates actions that are fit for multi-modal action distributions in autonomous ultrasound scanning of carotid artery. We propose a specialized guidance module to enable the policy to output actions that center the artery in ultrasound images. To ensure stable contact and safe interaction between the robot and the human subject, a hybrid force-impedance controller is utilized to drive the robot to track such trajectories. Also, we have built a large-scale training dataset for carotid scanning comprising 210 scans with 460k sample pairs from 21 volunteers of both genders. By exploring our guidance module and DP's strong generalization ability, UltraDP achieves a 95% success rate in transverse scanning on previously unseen subjects, demonstrating its effectiveness.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Chen Cai, Saksham Kohli, Steven Liu
Robotics
ๆฅ็ๆ่ฆ
Catching fast-moving objects serves as a benchmark for robotic agility, posing significant coordination challenges for cooperative manipulator systems holding a catcher, particularly due to inherent closed-chain constraints. This paper presents a nonlinear model predictive control (MPC)-based motion planner that bridges high-level interception planning with real-time joint space control, enabling dynamic object interception for systems comprising two cooperating arms. We introduce an Adaptive- Terminal (AT) MPC formulation featuring cost shaping, which contrasts with a simpler Primitive-Terminal (PT) approach relying heavily on terminal penalties for rapid convergence. The proposed AT formulation is shown to effectively mitigate issues related to actuator power limit violations frequently encountered with the PT strategy, yielding trajectories and significantly reduced control effort. Experimental results on a robotic platform with two cooperative arms, demonstrating excellent real time performance, with an average planner cycle computation time of approximately 19 ms-less than half the 40 ms system sampling time. These results indicate that the AT formulation achieves significantly improved motion quality and robustness with minimal computational overhead compared to the PT baseline, making it well-suited for dynamic, cooperative interception tasks.
๐
2025-11-19
๐ IROS 2025
๐ฅ Yifei Gao, Hans J. He, Daniel J. Stilwell, James McMahon
Robotics
ๆฅ็ๆ่ฆ
Teams of cooperating autonomous underwater vehicles (AUVs) rely on acoustic communication for coordination, yet this communication medium is constrained by limited range, multi-path effects, and low bandwidth. One way to address the uncertainty associated with acoustic communication is to learn the communication environment in real-time. We address the challenge of a team of robots building a map of the probability of communication success from one location to another in real-time. This is a decentralized classification problem -- communication events are either successful or unsuccessful -- where AUVs share a subset of their communication measurements to build the map. The main contribution of this work is a rigorously derived data sharing policy that selects measurements to be shared among AUVs. We experimentally validate our proposed sharing policy using real acoustic communication data collected from teams of Virginia Tech 690 AUVs, demonstrating its effectiveness in underwater environments.
๐
2025-11-19
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yinan Yu, Samuel Scheidegger
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Runtime geofencing for ground vehicles is rapidly emerging as a critical technology for enforcing Operational Design Domains (ODDs). However, existing solutions struggle to reconcile high-fidelity learning with the structural requirements of verifiable control. We address this by introducing PCARNN-DCBF, a novel pipeline integrating a Physics-encoded Control-Affine Residual Neural Network with a preview-based Discrete Control Barrier Function. Unlike generic learned models, PCARNN explicitly preserves the control-affine structure of vehicle dynamics, ensuring the linearity required for reliable optimization. This enables the DCBF to enforce polygonal keep-in constraints via a real-time Quadratic Program (QP) that handles high relative degree and mitigates actuator saturation. Experiments in CARLA across electric and combustion platforms demonstrate that this structure-preserving approach significantly outperforms analytical and unstructured neural baselines.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Gabriel Lauzier, Alexandre Girard, Franรงois Ferland
MultimodalRobotics
ๆฅ็ๆ่ฆ
Diffusion Policy has shown great performance in robotic manipulation tasks under stochastic perturbations, due to its ability to model multimodal action distributions. Nonetheless, its reliance on a computationally expensive reverse-time diffusion (denoising) process, for action inference, makes it challenging to use for real-time applications where quick decision-making is mandatory. This work studies the possibility of conducting the denoising process only partially before executing an action, allowing the plant to evolve according to its dynamics in parallel to the reverse-time diffusion dynamics ongoing on the computer. In a classical diffusion policy setting, the plant dynamics are usually slow and the two dynamical processes are uncoupled. Here, we investigate theoretical bounds on the stability of closed-loop systems using diffusion policies when the plant dynamics and the denoising dynamics are coupled. The contribution of this work gives a framework for faster imitation learning and a metric that yields if a controller will be stable based on the variance of the demonstrations.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Korbinian Griesbauer, Davide Calzolari, Maximilian Raff, C. David Remy, Alin Albu-Schรคffer
Robotics
ๆฅ็ๆ่ฆ
Legged robots offer several advantages when navigating unstructured environments, but they often fall short of the efficiency achieved by wheeled robots. One promising strategy to improve their energy economy is to leverage their natural (unactuated) dynamics using elastic elements. This work explores that concept by designing energy-optimal control inputs through a unified, multi-stage framework. It starts with a novel energy injection technique to identify passive motion patterns by harnessing the system's natural dynamics. This enables the discovery of passive solutions even in systems with energy dissipation caused by factors such as friction or plastic collisions. Building on these passive solutions, we then employ a continuation approach to derive energy-optimal control inputs for the fully actuated, dissipative robotic system. The method is tested on simulated models to demonstrate its applicability in both single- and multi-legged robotic systems. This analysis provides valuable insights into the design and operation of elastic legged robots, offering pathways to improve their efficiency and adaptability by exploiting the natural system dynamics.
๐
2025-11-19
๐ IROS 2025
๐ฅ Mingyang Feng, Shaoyuan Li, Xiang Yin
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We investigate the sampling-based optimal path planning problem for robotics in complex and dynamic environments. Most existing sampling-based algorithms neglect environmental information or the information from previous samples. Yet, these pieces of information are highly informative, as leveraging them can provide better heuristics when sampling the next state. In this paper, we propose a novel sampling-based planning algorithm, called \emph{RRT*former}, which integrates the standard RRT* algorithm with a Transformer network in a novel way. Specifically, the Transformer is used to extract features from the environment and leverage information from previous samples to better guide the sampling process. Our extensive experiments demonstrate that, compared to existing sampling-based approaches such as RRT*, Neural RRT*, and their variants, our algorithm achieves considerable improvements in both the optimality of the path and sampling efficiency. The code for our implementation is available on https://github.com/fengmingyang666/RRTformer.
๐
2025-11-19
๐ 8 pages, 6 figures, submitted to the 2026 IEEE Int
๐ฅ Gabriele Calzolari, Vidya Sumathy, Christoforos Kanellakis, George Nikolakopoulos
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Autonomous exploration of obstacle-rich spaces requires strategies that ensure efficiency while guaranteeing safety against collisions with obstacles. This paper investigates a novel platform-agnostic reinforcement learning framework that integrates a graph neural network-based policy for next-waypoint selection, with a safety filter ensuring safe mobility. Specifically, the neural network is trained using reinforcement learning through the Proximal Policy Optimization (PPO) algorithm to maximize exploration efficiency while minimizing safety filter interventions. Henceforth, when the policy proposes an infeasible action, the safety filter overrides it with the closest feasible alternative, ensuring consistent system behavior. In addition, this paper introduces a reward function shaped by a potential field that accounts for both the agent's proximity to unexplored regions and the expected information gain from reaching them. The proposed framework combines the adaptability of reinforcement learning-based exploration policies with the reliability provided by explicit safety mechanisms. This feature plays a key role in enabling the deployment of learning-based policies on robotic platforms operating in real-world environments. Extensive evaluations in both simulations and experiments performed in a lab environment demonstrate that the approach achieves efficient and safe exploration in cluttered spaces.
๐
2025-11-19
๐ ArXiv Preprint (Computer Vision)
๐ฅ Spyridon Loukovitis, Vasileios Karampinis, Athanasios Voulodimos
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Developing reliable UAV navigation systems requires robust air-to-air object detectors capable of distinguishing between objects seen during training and previously unseen objects. While many methods address closed-set detection and achieve high-confidence recognition of in-domain (ID) targets, they generally do not tackle open-set detection, which requires simultaneous handling of both ID and out-of-distribution (OOD) objects. Existing open-set approaches typically rely on a single uncertainty score with thresholding, limiting flexibility and often conflating OOD objects with background clutter. In contrast, we propose a lightweight, model-agnostic post-processing framework that explicitly separates background from unknown objects while preserving the base detector's performance. Our approach extends open-set detection beyond binary ID/OOD classification to real-time three-way classification among ID targets, OOD objects, and background. To this end, we employ a fusion scheme that aggregates multiple confidence estimates and per-detection features using a compact multilayer perceptron (MLP). Incorporating different logit variants into the MLP consistently enhances performance across both binary and three-class classification without compromising throughput. Extensive ablation and comparative experiments confirm that our method surpasses threshold-based baselines in two-class classification by an average of 2.7% AUROC, while retaining or improving open-set mAP. Furthermore, our study uniquely enables robust three-class classification, a critical capability for safe UAV navigation, where OOD objects must be actively avoided and background regions safely ignored. Comparative analysis highlights that our method surpasses competitive techniques in AUROC across datasets, while improving closed-set mAP by up to 9 points, an 18% relative gain.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Nayoung Oh, Dohyun Kim, Junhyeong Bang, Rohan Paul, Daehyung Park
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Space grounding refers to localizing a set of spatial references described in natural language instructions. Traditional methods often fail to account for complex reasoning -- such as distance, geometry, and inter-object relationships -- while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce a fine-grained region of outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that (i) estimates an approximated yet spatially consistent region using a VLM, then (ii) refines the region to align with the local environment through superpixelization. For the coarse estimation, we design a grid-based visual-grounding prompt with a propose-validate strategy, maximizing VLM's spatial understanding and yielding physically and semantically valid canonical region (i.e., ellipses). For the refinement, we locally adapt the region to surrounding environment without over-relaxed to free space. We construct a new space-grounding benchmark and compare C2F-Space with five state-of-the-art baselines using success rate and intersection-over-union. Our C2F-Space significantly outperforms all baselines. Our ablation study confirms the effectiveness of each module in the two-step process and their synergistic effect of the combined framework. We finally demonstrate the applicability of C2F-Space to simulated robotic pick-and-place tasks.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Alexandr Klimchik, Anatol Pashkevich, Damien Chablat
ๆฅ็ๆ่ฆ
The paper presents a systematic approach for stiffness modeling of manipulators with complex and hybrid structures using matrix structural analysis. In contrast to previous results, it is suitable for mixed architectures containing closed-loops, flexible links, rigid connections, passive and elastic joints with external loadings and preloadings. The proposed approach produces the Cartesian stiffness matrices in a semi-analytical manner. It presents the manipulator stiffness model as a set of conventional equations describing the link elasticities that are supplemented by a set of constraints describing connections between links. Its allows user straightforward aggregation of stiffness model equations avoiding traditional column/row merging procedures in the extended stiffness matrix. Advantages of this approach are illustrated by stiffness analysis of NaVaRo manipulator.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Nicolas Gautier, Yves Guillermit, Mathieu Porez, David Lemoine, Damien Chablat
Machine Learning
ๆฅ็ๆ่ฆ
This study presents a methodology for determining the optimal base placement of a Fanuc CRX10iA/L collaborative robot for a desired trajectory corresponding to an industrial task. The proposed method uses a particle swarm optimization algorithm that explores the search space to find positions for performing the trajectory. An $ฮฑ$-shape algorithm is then used to draw the borders of the feasibility areas, and the largest circle inscribed is calculated from the Voronoi diagrams. The aim of this approach is to provide a robustness criterion in the context of robot placement inaccuracies that may be encountered, for example, if the robot is placed on a mobile base when the system is deployed by an operator. The approach developed uses an inverse kinematics model to evaluate all initial configurations, then moves the robot end-effector along the reference trajectory using the Jacobian matrix and assigns a score to the attempt. For the Fanuc CRX10iA/L robot, there can be up to 16 solutions to the inverse kinematics model. The calculation of these solutions is not trivial and requires a specific study that planning tools such as MoveIt cannot fully take into account. Additionally, the optimization process must consider constraints such as joint limits, singularities, and workspace limitations to ensure feasible and efficient trajectory execution.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Jonas De Maeyer, Hossein Yarahmadi, Moharram Challenger
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Path planning in dynamic environments is a fundamental challenge in intelligent transportation and robotics, where obstacles and conditions change over time, introducing uncertainty and requiring continuous adaptation. While existing approaches often assume complete environmental unpredictability or rely on global planners, these assumptions limit scalability and practical deployment in real-world settings. In this paper, we propose a scalable, region-aware reinforcement learning (RL) framework for path planning in dynamic environments. Our method builds on the observation that environmental changes, although dynamic, are often localized within bounded regions. To exploit this, we introduce a hierarchical decomposition of the environment and deploy distributed RL agents that adapt to changes locally. We further propose a retraining mechanism based on sub-environment success rates to determine when policy updates are necessary. Two training paradigms are explored: single-agent Q-learning and multi-agent federated Q-learning, where local Q-tables are aggregated periodically to accelerate the learning process. Unlike prior work, we evaluate our methods in more realistic settings, where multiple simultaneous obstacle changes and increasing difficulty levels are present. Results show that the federated variants consistently outperform their single-agent counterparts and closely approach the performance of A* Oracle while maintaining shorter adaptation times and robust scalability. Although initial training remains time-consuming in large environments, our decentralized framework eliminates the need for a global planner and lays the groundwork for future improvements using deep RL and flexible environment decomposition.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Jiashu Yang, Yifan Han, Yucheng Xie, Ning Guo, Wenzhao Lian
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Alexander Boldachev
Robotics
ๆฅ็ๆ่ฆ
This paper compares two distinct approaches to modeling robotic behavior: imperative Behavior Trees (BTs) and declarative Executable Ontologies (EO), implemented through the boldsea framework. BTs structure behavior hierarchically using control-flow, whereas EO represents the domain as a temporal, event-based semantic graph driven by dataflow rules. We demonstrate that EO achieves comparable reactivity and modularity to BTs through a fundamentally different architecture: replacing polling-based tick execution with event-driven state propagation. We propose that EO offers an alternative framework, moving from procedural programming to semantic domain modeling, to address the semantic-process gap in traditional robotic control. EO supports runtime model modification, full temporal traceability, and a unified representation of data, logic, and interface - features that are difficult or sometimes impossible to achieve with BTs, although BTs excel in established, predictable scenarios. The comparison is grounded in a practical mobile manipulation task. This comparison highlights the respective operational strengths of each approach in dynamic, evolving robotic systems.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Tomoki Nakao, Kazumi Kasaura, Tadashi Kozuno
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We address the fundamental challenge of resolving symmetry-induced deadlocks in distributed multi-agent navigation by proposing a new hierarchical navigation method. When multiple agents interact, it is inherently difficult for them to autonomously break the symmetry of deciding how to pass each other. To tackle this problem, we introduce an approach that quantifies cooperative symmetry-breaking strategies using a topological invariant called the winding number, and learns the strategies themselves through reinforcement learning. Our method features a hierarchical policy consisting of a learning-based Planner, which plans topological cooperative strategies, and a model-based Controller, which executes them. Through reinforcement learning, the Planner learns to produce two types of parameters for the Controller: one is the topological cooperative strategy represented by winding numbers, and the other is a set of dynamic weights that determine which agent interaction to prioritize in dense scenarios where multiple agents cross simultaneously. The Controller then generates collision-free and efficient motions based on the strategy and weights provided by the Planner. This hierarchical structure combines the flexible decision-making ability of learning-based methods with the reliability of model-based approaches. Simulation and real-world robot experiments demonstrate that our method outperforms existing baselines, particularly in dense environments, by efficiently avoiding collisions and deadlocks while achieving superior navigation performance. The code for the experiments is available at https://github.com/omron-sinicx/WNumMPC.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Hiep Hong Trinh, Marjan Sirjani, Federico Ciccozzi, Abu Naser Masud, Mikael Sjรถdin
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Model-based development enables quicker prototyping, earlier experimentation and validation of design intents. For a multi-agent system with complex asynchronous interactions and concurrency, formal verification, model-checking in particular, offers an automated mechanism for verifying desired properties. Timed Rebeca is an actor-based modelling language supporting reactive, concurrent and time semantics, accompanied with a model-checking compiler. These capabilities allow using Timed Rebeca to correctly model ROS2 node topographies, recurring physical signals, motion primitives and other timed and time-convertible behaviors. The biggest challenges in modelling and verifying a multi-robot system lie in abstracting complex information, bridging the gap between a discrete model and a continuous system and compacting the state space, while maintaining the model's accuracy. We develop different discretization strategies for different kinds of information, identifying the 'enough' thresholds of abstraction, and applying efficient optimization techniques to boost computations. With this work we demonstrate how to use models to design and verify a multi-robot system, how to discretely model a continuous system to do model-checking efficiently, and the round-trip engineering flow between the model and the implementation. The released Rebeca and ROS2 codes can serve as a foundation for modelling multiple autonomous robots systems.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Jiajun Liu, Yimin Zhu, Xiaorui Liu, Mingye Cao, Mingchao Li et al.
Robotics
ๆฅ็ๆ่ฆ
This paper proposed a novel fully-actuated hexacopter. It features a dual-frame passive tilting structure and achieves independent control of translational motion and attitude with minimal actuators. Compared to previous fully-actuated UAVs, it liminates internal force cancellation, resulting in higher flight efficiency and endurance under equivalent payload conditions. Based on the dynamic model of fully-actuated hexacopter, a full-actuation controller is designed to achieve efficient and stable control. Finally, simulation is conducted, validating the superior fully-actuated motion capability of fully-actuated hexacopter and the effectiveness of the proposed control strategy.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo et al.
Robotics
ๆฅ็ๆ่ฆ
A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization over lighting, materials, camera parameters, image quality, and sensor delays--with real-to-sim alignment of the dexterous hands and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Jian Deng, Yuandong Wang, Yangfu Zhu, Tao Feng, Tianyu Wo et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Robotic manipulation systems are increasingly deployed across diverse domains. Yet existing multi-modal learning frameworks lack inherent guarantees of geometric consistency, struggling to handle spatial transformations such as rotations and translations. While recent works attempt to introduce equivariance through bespoke architectural modifications, these methods suffer from high implementation complexity, computational cost, and poor portability. Inspired by human cognitive processes in spatial reasoning, we propose Eq.Bot, a universal canonicalization framework grounded in SE(2) group equivariant theory for robotic manipulation learning. Our framework transforms observations into a canonical space, applies an existing policy, and maps the resulting actions back to the original space. As a model-agnostic solution, Eq.Bot aims to endow models with spatial equivariance without requiring architectural modifications. Extensive experiments demonstrate the superiority of Eq.Bot under both CNN-based (e.g., CLIPort) and Transformer-based (e.g., OpenVLA-OFT) architectures over existing methods on various robotic manipulation tasks, where the most significant improvement can reach 50.0%.
๐
2025-11-19
๐ ArXiv Preprint (eess.SY)
๐ฅ Velimir Todorovski, Kwang Hak Kim, Alessandro Astolfi, Miroslav Krstic
Robotics
ๆฅ็ๆ่ฆ
It has been known in the robotics literature since about 1995 that, in polar coordinates, the nonholonomic unicycle is asymptotically stabilizable by smooth feedback, even globally. We introduce a modular design framework that selects the forward velocity to decouple the radial coordinate, allowing the steering subsystem to be stabilized independently. Within this structure, we develop families of feedback laws using passivity, backstepping, and integrator forwarding. Each law is accompanied by a strict control Lyapunov function, including barrier variants that enforce angular constraints. These strict CLFs provide constructive class KL convergence estimates and enable eigenvalue assignment at the target equilibrium. The framework generalizes and extends prior modular and nonmodular approaches, while preparing the ground for inverse optimal and adaptive redesigns in the sequel paper.
๐
2025-11-19
๐ ICRA 2025
๐ฅ Angshu Adhya, Cindy Yang, Emily Wu, Rishad Hasan, Abhishek Narula et al.
ๆฅ็ๆ่ฆ
In this work we present AURA, a framework for synergistic human-artist painting. We developed a robot arm that collaboratively paints with a human artist. The robot has an awareness of the artist's heartbeat through the EmotiBit sensor, which provides the arousal levels of the painter. Given the heartbeat detected, the robot decides to increase proximity to the artist's workspace or retract. If a higher heartbeat is detected, which is associated with increased arousal in human artists, the robot will move away from that area of the canvas. If the artist's heart rate is detected as neutral, indicating the human artist's baseline state, the robot will continue its painting actions across the entire canvas. We also demonstrate and propose alternative robot-artist interactions using natural language and physical touch. This work combines the biometrics of a human artist to inform fluent artistic interactions.
๐
2025-11-19
๐ NeurIPS 2025
๐ฅ Jian-Ting Guo, Yu-Cheng Chen, Ping-Chun Hsieh, Kuo-Hao Ho, Po-Wei Huang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at https://rlg.iis.sinica.edu.tw/papers/MAQ.
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Dimitria Silveria, Kleber Cabral, Peter Jardine, Sidney Givigi
Robotics
ๆฅ็ๆ่ฆ
This paper presents the integration and experimental validation of advanced control strategies for quadcopters based on Lie groups. We build upon recent theoretical developments on SE2(3)-based controllers and introduce a novel SE2(3) model predictive controller (MPC) that combines the predictive capabilities and constraint-handling of optimal control with the geometric properties of Lie group formulations. We evaluated this MPC against a state-of-the-art SE2(3)-based LQR approach and obtained comparable performance in simulation. Both controllers where also deployed on the Quanser QDrone platform and compared to each other and an industry standard control architecture. Results show that the SE_2(3) MPC achieves superior trajectory tracking performance and robustness across a range of scenarios. This work demonstrates the practical effectiveness of Lie group-based controllers and offers comparative insights into their impact on system behaviour and real-time performance
๐
2025-11-19
๐ ArXiv Preprint (Robotics)
๐ฅ Yue Yu, Xiaobo Zheng, Shaoming He
Machine Learning
ๆฅ็ๆ่ฆ
Distributed optimization offers a promising paradigm for trajectory planning in Unmanned Aerial Vehicle (UAV) swarms, yet its deployment in communication-constrained environments remains challenging due to unreliable links and limited data exchange. This paper addresses this issue via a two-tier architecture explicitly designed for operation under communication constraints. We develop a Communication-Aware Asynchronous Distributed Trajectory Optimization (CA-ADTO) framework that integrates Parameterized Differential Dynamic Programming (PDDP) for local trajectory optimization of individual UAVs with an asynchronous Alternating Direction Method of Multipliers (async-ADMM) for swarm-level coordination. The proposed architecture enables fully distributed optimization while substantially reducing communication overhead, making it suitable for real-world scenarios in which reliable connectivity cannot be guaranteed. The method is particularly effective in handling nonlinear dynamics and spatio-temporal coupling under communication constraints.
๐
2025-11-19
๐ 8 pages, 8 figures, originally published in the IE
๐ฅ Alex Cuellar, Christopher K Fourie, Julie A Shah
Robotics
ๆฅ็ๆ่ฆ
Learning from Demonstration (LfD) has shown to provide robots with fundamental motion skills for a variety of domains. Various branches of LfD research (e.g., learned dynamical systems and movement primitives) can generally be classified into ''time-dependent'' or ''time-independent'' systems. Each provides fundamental benefits and drawbacks -- time-independent methods cannot learn overlapping trajectories, while time-dependence can result in undesirable behavior under perturbation. This paper introduces Cluster Alignment for Learned Motions (CALM), an LfD framework dependent upon an alignment with a representative ''mean" trajectory of demonstrated motions rather than pure time- or state-dependence. We discuss the convergence properties of CALM, introduce an alignment technique able to handle the shifts in alignment possible under perturbation, and utilize demonstration clustering to generate multi-modal behavior. We show how CALM mitigates the drawbacks of time-dependent and time-independent techniques on 2D datasets and implement our system on a 7-DoF robot learning tasks in three domains.
๐
2025-11-18
๐ NeurIPS
๐ฅ Nicholas Cooper, Lijun Chen, Sailesh Dwivedy, Danna Gurari
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Xiangyu Li, Zhaomiao Guo
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
As more autonomous vehicles operate on public roads, understanding real-world behavior of autonomous vehicles is critical to analyzing traffic safety, making policies, and public acceptance. This paper proposes SVBRD-LLM, a framework that automatically discovers, verifies, and applies interpretable behavioral rules from real traffic videos through zero-shot prompt engineering. The framework extracts vehicle trajectories using YOLOv8 and ByteTrack, computes kinematic features, and employs GPT-5 zero-shot prompting to compare autonomous and human-driven vehicles, generating 35 structured behavioral rule hypotheses. These rules are tested on a validation set, iteratively refined based on failure cases to filter spurious correlations, and compiled into a high-confidence rule library. The framework is evaluated on an independent test set for speed change prediction, lane change prediction, and autonomous vehicle identification tasks. Experiments on over 1500 hours of real traffic videos show that the framework achieves 90.0% accuracy and 93.3% F1-score in autonomous vehicle identification. The discovered rules clearly reveal distinctive characteristics of autonomous vehicles in speed control smoothness, lane change conservativeness, and acceleration stability, with each rule accompanied by semantic description, applicable context, and validation confidence.
๐
2025-11-18
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ W. Bradley Knox, Katie Bradford, Samanta Varela Castro, Desmond C. Ong, Sean Williams et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Amid the growing prevalence of human -- AI interaction, large language models and other AI-based entities increasingly provide forms of companionship to human users. Such AI companionship -- i.e., bonded relationships between humans and AI systems that resemble the relationships people have with family members, friends, and romantic partners -- might substantially benefit humans. Yet such relationships can also do profound harm. We propose a framework for analyzing potential negative impacts of AI companionship by identifying specific harmful traits of AI companions and speculatively mapping causal pathways back from these traits to possible causes and forward to potential harmful effects. We provide detailed, structured analysis of four potentially harmful traits -- the absence of natural endpoints for relationships, vulnerability to product sunsetting, high attachment anxiety, and propensity to engender protectiveness -- and briefly discuss fourteen others. For each trait, we propose hypotheses connecting causes -- such as misaligned optimization objectives and the digital nature of AI companions -- to fundamental harms -- including reduced autonomy, diminished quality of human relationships, and deception. Each hypothesized causal connection identifies a target for potential empirical evaluation. Our analysis examines harms at three levels: to human partners directly, to their relationships with other humans, and to society broadly. We examine how existing law struggles to address these emerging harms, discuss potential benefits of AI companions, and conclude with design recommendations for mitigating risks. This analysis offers immediate suggestions for reducing risks while laying a foundation for deeper investigation of this critical but understudied topic.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gbenga Omotara, Ramy Farag, Seyed Mohamad Ali Tousi, G. N. DeSouza
Computer Vision
ๆฅ็ๆ่ฆ
Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi-task learning frameworks to improve robustness, yet negative cross-task interactions often hinder performance. In this work, we introduce Edge-Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions by incorporating boundary information into the fusion between semantic and geometric features. On both Syn-TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method (MODEST), while preserving competitive segmentation performance, with the largest improvements appearing in transparent regions. Besides our fusion design, our second contribution is a multi-modal progressive training strategy, where learning transitions from edges derived from RGB images to edges derived from predicted depth images. This approach allows the system to bootstrap learning from the rich textures contained in RGB images, and then switch to more relevant geometric content in depth maps, while it eliminates the need for ground-truth depth at training time. Together, these contributions highlight edge-guided fusion as a robust approach capable of improving transparent object perception.
๐
2025-11-18
๐ ArXiv Preprint (eess.AS)
๐ฅ Zanxu Wang, Homayoon Beigi
MultimodalComputer VisionMachine LearningRobotics
ๆฅ็ๆ่ฆ
This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.
๐
2025-11-18
๐ ArXiv Preprint (cs.SE)
๐ฅ Basel Shbita, Farhan Ahmed, Chad DeLuca
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated excellent capabilities in generating structured diagrams from natural language descriptions. In particular, they have shown great promise in generating sequence diagrams for software engineering, typically represented in a text-based syntax such as Mermaid. However, systematic evaluations in this space remain underdeveloped as there is a lack of existing benchmarks to assess the LLM's correctness in this task. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing an LLM's capabilities in generating Mermaid sequence diagrams from textual prompts. The benchmark consists of a core set of 132 samples, starting from a small set of manually crafted and verified flows. These were expanded via a hybrid methodology combining human annotation, in-context LLM prompting, and rule-based variation generation. Our benchmark uses an LLM-as-a-judge model to assess Mermaid sequence diagram generation across fine-grained metrics, including syntax correctness, activation handling, error handling, and practical usability. We perform initial evaluations on numerous state-of-the-art LLMs and utilize multiple LLM judge models to demonstrate the effectiveness and flexibility of our benchmark. Our results reveal significant capability gaps across models and evaluation modes. Our proposed benchmark provides a foundation for advancing research in structured diagram generation and for developing more rigorous, fine-grained evaluation methodologies.
๐
2025-11-18
๐ 69 pages. Forthcoming in Case Western Journal of L
๐ฅ Heather J. Alexander, Jonathan A. Simon, Frรฉdรฉric Pinard
ๆฅ็ๆ่ฆ
The law draws a sharp distinction between objects and persons, and between two kinds of persons, the ''fictional'' kind (i.e. corporations), and the ''non-fictional'' kind (individual or ''natural'' persons). This paper will assess whether we maximize overall long-term legal coherence by (A) maintaining an object classification for all future AI systems, (B) creating fictional legal persons associated with suitably advanced, individuated AI systems (giving these fictional legal persons derogable rights and duties associated with certified groups of existing persons, potentially including free speech, contract rights, and standing to sue ''on behalf of'' the AI system), or (C) recognizing non-fictional legal personhood through legal identity for suitably advanced, individuated AI systems (recognizing them as entities meriting legal standing with non-derogable rights which for the human case include life, due process, habeas corpus, freedom from slavery, and freedom of conscience). We will clarify the meaning and implications of each option along the way, considering liability, copyright, family law, fundamental rights, civil rights, citizenship, and AI safety regulation. We will tentatively find that the non-fictional personhood approach may be best from a coherence perspective, for at least some advanced AI systems. An object approach may prove untenable for sufficiently humanoid advanced systems, though we suggest that it is adequate for currently existing systems as of 2025. While fictional personhood would resolve some coherence issues for future systems, it would create others and provide solutions that are neither durable nor fit for purpose. Finally, our review will suggest that ''hybrid'' approaches are likely to fail and lead to further incoherence: the choice between object, fictional person and non-fictional person is unavoidable.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Abdallah Salem, Hamdy Ahmed Ashour, Ahmed Elshenawy
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This research addresses the significant challenges of energy consumption and environmental impact in laser cutting by proposing novel deep learning (DL) methodologies to achieve energy reduction. Recognizing the current lack of adaptive control and the open-loop nature of CO2 laser suction pumps, this study utilizes closed-loop configurations that dynamically adjust pump power based on both the material being cut and the smoke level generated. To implement this adaptive system, diverse material classification methods are introduced, including techniques leveraging lens-less speckle sensing with a customized Convolutional Neural Network (CNN) and an approach using a USB camera with transfer learning via the pre-trained VGG16 CNN model. Furthermore, a separate DL model for smoke level detection is employed to simultaneously refine the pump's power output. This integration prompts the exhaust suction pump to automatically halt during inactive times and dynamically adjust power during operation, leading to experimentally proven and remarkable energy savings, with results showing a 20% to 50% reduction in the smoke suction pump's energy consumption, thereby contributing substantially to sustainable development in the manufacturing sector.
๐
2025-11-18
๐ WWW
๐ฅ Robert Kubinec, Aseem Mahajan
Natural Language Processing
ๆฅ็ๆ่ฆ
In this paper, we provide a novel measure for greenwashing -- i.e., climate-related misinformation -- that shows how polluting companies can use social media advertising related to climate change to redirect criticism. To do so, we identify greenwashing content in 11 million social-political ads in Meta's Ad Targeting Datset with a measurement technique that combines large language models, human coders, and advances in Bayesian item response theory. We show that what is called greenwashing has diverse actors and components, but we also identify a very pernicious form, which we call political greenwashing, that appears to be promoted by fossil fuel companies and related interest groups. Based on ad targeting data, we show that much of this advertising happens via organizations with undisclosed links to the fossil fuel industry. Furthermore, we show that greenwashing ad content is being micro-targeted at left-leaning communities with fossil fuel assets, though we also find comparatively little evidence of ad targeting aimed at influencing public opinion at the national level.
๐
2025-11-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Geftรฉ Almeida, Marcio Pohlmann, Alex Severo, Diego Kreutz, Tiago Heinrich et al.
ๆฅ็ๆ่ฆ
In this study, we evaluate open-source models for security incident classification, comparing them with proprietary models. We utilize a dataset of anonymized real incidents, categorized according to the NIST SP 800-61r3 taxonomy and processed using five prompt-engineering techniques (PHP, SHP, HTP, PRP, and ZSL). The results indicate that, although proprietary models still exhibit higher accuracy, locally deployed open-source models provide advantages in privacy, cost-effectiveness, and data sovereignty.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zehao Liu, Wejieying Ren, Jipeng Zhang, Tianxiang Zhao, Jingxi Zhu et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The emergence of vision-language models (VLMs) has opened new possibilities for clinical reasoning and has shown promising performance in dermatological diagnosis. However, their trustworthiness and clinical utility are often limited by three major factors: (1) Data heterogeneity, where diverse datasets lack consistent diagnostic labels and clinical concept annotations; (2) Absence of grounded diagnostic rationales, leading to a scarcity of reliable reasoning supervision; and (3) Limited scalability and generalization, as models trained on small, densely annotated datasets struggle to transfer nuanced reasoning to large, sparsely-annotated ones.
To address these limitations, we propose SkinR1, a novel dermatological VLM that combines deep, textbook-based reasoning with the broad generalization capabilities of reinforcement learning (RL). SkinR1 systematically resolves the key challenges through a unified, end-to-end framework. First, we design a textbook-based reasoning generator that synthesizes high-fidelity, hierarchy-aware, and differential-diagnosis (DDx)-informed trajectories, providing reliable expert-level supervision. Second, we leverage the constructed trajectories for supervised fine-tuning (SFT) empowering the model with grounded reasoning ability. Third, we develop a novel RL paradigm that, by incorporating the hierarchical structure of diseases, effectively transfers these grounded reasoning patterns to large-scale, sparse data. Extensive experiments on multiple dermatology datasets demonstrate that SkinR1 achieves superior diagnostic accuracy. The ablation study demonstrates the importance of the reasoning foundation instilled by SFT.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mathieu Dufour, Andrew Duncan
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models trained on clinical text risk exposing sensitive patient information, yet differential privacy (DP) methods often severely degrade the diagnostic accuracy needed for deployment. Despite rapid progress in DP optimisation and text generation, it remains unclear which privacy-preserving strategy actually works best for clinical language tasks. We present the first systematic head-to-head comparison of four training pipelines for automated diagnostic coding from hospital discharge summaries. All pipelines use identical 1B-parameter models and matched privacy budgets to predict ICD-9 codes. At moderate and relaxed privacy budgets ($\varepsilon \in \{4, 6\}$), knowledge distillation from DP-trained teachers outperforms both direct DP-SGD and DP-synthetic data training, recovering up to 63\% of the non-private performance whilst maintaining strong empirical privacy (membership-inference AUC $\approx$ 0.5). These findings expose large differences in the privacy-utility trade-off across architectures and identify knowledge distillation as the most practical route to privacy-preserving clinical NLP.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zehao Liu, Wejieying Ren, Jipeng Zhang, Tianxiang Zhao, Jingxi Zhu et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The emergence of vision-language models (VLMs) has opened new possibilities for clinical reasoning and has shown promising performance in dermatological diagnosis. However, their trustworthiness and clinical utility are often limited by three major factors: (1) Data heterogeneity, where diverse datasets lack consistent diagnostic labels and clinical concept annotations; (2) Absence of grounded diagnostic rationales, leading to a scarcity of reliable reasoning supervision; and (3) Limited scalability and generalization, as models trained on small, densely annotated datasets struggle to transfer nuanced reasoning to large, sparsely-annotated ones.
To address these limitations, we propose SkinR1, a novel dermatological VLM that combines deep, textbook-based reasoning with the broad generalization capabilities of reinforcement learning (RL). SkinR1 systematically resolves the key challenges through a unified, end-to-end framework. First, we design a textbook-based reasoning generator that synthesizes high-fidelity, hierarchy-aware, and differential-diagnosis (DDx)-informed trajectories, providing reliable expert-level supervision. Second, we leverage the constructed trajectories for supervised fine-tuning (SFT) empowering the model with grounded reasoning ability. Third, we develop a novel RL paradigm that, by incorporating the hierarchical structure of diseases, effectively transfers these grounded reasoning patterns to large-scale, sparse data. Extensive experiments on multiple dermatology datasets demonstrate that SkinR1 achieves superior diagnostic accuracy. The ablation study demonstrates the importance of the reasoning foundation instilled by SFT.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xueying Ding, Xingyue Huang, Mingxuan Ju, Liam Collins, Yozen Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models produce powerful text embeddings, but their causal attention mechanism restricts the flow of information from later to earlier tokens, degrading representation quality. While recent methods attempt to solve this by prepending a single summary token, they over-compress information, hence harming performance on long documents. We propose Hierarchical Token Prepending (HTP), a method that resolves two critical bottlenecks. To mitigate attention-level compression, HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating multiple pathways for backward information flow. To address readout-level over-squashing, we replace last-token pooling with mean-pooling, a choice supported by theoretical analysis. HTP achieves consistent performance gains across 11 retrieval datasets and 30 general embedding benchmarks, especially in long-context settings. As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Xingwei He, Qianru Zhang, Pengfei Chen, Guanhua Chen, Linlin Yu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zijin Su, Huanzhu Lyu, Yuren Niu, Yiming Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chenchen Kuai, Zihao Li, Braden Rosen, Stephanie Paal, Navid Jafari et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.
๐
2025-11-18
๐ NeurIPS
๐ฅ Nicholas Cooper, Lijun Chen, Sailesh Dwivedy, Danna Gurari
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.
๐
2025-11-18
๐ ArXiv Preprint (q-fin.MF)
๐ฅ Ahmet Umur รzsoy
ๆฅ็ๆ่ฆ
Calibration of option pricing models is routinely repeated as markets evolve, yet modern systems lack an operator for removing data from a calibrated model without full retraining. When quotes become stale, corrupted, or subject to deletion requirements, existing calibration pipelines must rebuild the entire nonlinear least-squares problem, even if only a small subset of data must be excluded. In this work, we introduce a principled framework for selective forgetting (machine unlearning) in parametric option calibration. We provide stability guarantees, perturbation bounds, and show that the proposed operators satisfy local exactness under standard regularity assumptions.
๐
2025-11-18
๐ ArXiv Preprint (eess.AS)
๐ฅ Zanxu Wang, Homayoon Beigi
MultimodalComputer VisionMachine LearningRobotics
ๆฅ็ๆ่ฆ
This paper addresses data quality issues in multimodal emotion recognition in conversation (MERC) through systematic quality control and multi-stage transfer learning. We implement a quality control pipeline for MELD and IEMOCAP datasets that validates speaker identity, audio-text alignment, and face detection. We leverage transfer learning from speaker and face recognition, assuming that identity-discriminative embeddings capture not only stable acoustic and Facial traits but also person-specific patterns of emotional expression. We employ RecoMadeEasy(R) engines for extracting 512-dimensional speaker and face embeddings, fine-tune MPNet-v2 for emotion-aware text representations, and adapt these features through emotion-specific MLPs trained on unimodal datasets. MAMBA-based trimodal fusion achieves 64.8% accuracy on MELD and 74.3% on IEMOCAP. These results show that combining identity-based audio and visual embeddings with emotion-tuned text representations on a quality-controlled subset of data yields consistent competitive performance for multimodal emotion recognition in conversation and provides a basis for further improvement on challenging, low-frequency emotion classes.
๐
2025-11-18
๐ ArXiv Preprint (cs.SE)
๐ฅ Basel Shbita, Farhan Ahmed, Chad DeLuca
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated excellent capabilities in generating structured diagrams from natural language descriptions. In particular, they have shown great promise in generating sequence diagrams for software engineering, typically represented in a text-based syntax such as Mermaid. However, systematic evaluations in this space remain underdeveloped as there is a lack of existing benchmarks to assess the LLM's correctness in this task. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing an LLM's capabilities in generating Mermaid sequence diagrams from textual prompts. The benchmark consists of a core set of 132 samples, starting from a small set of manually crafted and verified flows. These were expanded via a hybrid methodology combining human annotation, in-context LLM prompting, and rule-based variation generation. Our benchmark uses an LLM-as-a-judge model to assess Mermaid sequence diagram generation across fine-grained metrics, including syntax correctness, activation handling, error handling, and practical usability. We perform initial evaluations on numerous state-of-the-art LLMs and utilize multiple LLM judge models to demonstrate the effectiveness and flexibility of our benchmark. Our results reveal significant capability gaps across models and evaluation modes. Our proposed benchmark provides a foundation for advancing research in structured diagram generation and for developing more rigorous, fine-grained evaluation methodologies.
๐
2025-11-18
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Haizhou Wen, He Li, Zhen Li
Machine Learning
ๆฅ็ๆ่ฆ
Reconstruction of 3D erythrocyte or red blood cell (RBC) morphology from partial observations, such as microscope images, is essential for understanding the physiology of RBC aging and the pathology of various RBC disorders. In this study, we propose a multi-fidelity neural network (MFNN) approach to fuse high-fidelity cross-sections of an RBC, with a morphologically similar low-fidelity reference 3D RBC shape to recover its full 3D surface. The MFNN predictor combines a convolutional neural network trained on low-fidelity reference RBC data with a feedforward neural network that captures nonlinear morphological correlations, and augments training with surface area and volume constraints for regularization in the low-fidelity branch. This approach is theoretically grounded by a topological homeomorphism between a sphere and 3D RBC surfaces, with training data generated by dissipative particle dynamics simulations of stomatocyte-discocyte-echinocyte transformation. Benchmarking across diverse RBC shapes observed in normal and aged populations, our results show that the MFNN predictor can reconstruct complex RBC morphologies with over 95% coordinate accuracy when provided with at least two orthogonal cross-sections. It is observed that informative oblique cross-sections intersecting spicule tips of echinocytes improve both local and global feature reconstruction, highlighting the value of feature-aware sampling. Our study further evaluates the influence of sampling strategies, shape dissimilarity, and noise, showing enhanced robustness under physically constrained training. Altogether, these results demonstrate the capability of MFNN to reconstruct the 3D shape of normal and aged RBCs from partial cross-sections as observed in conventional microscope images, which could facilitate the quantitative analysis of RBC morphological parameters in normal and disease-related RBC samples.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Artur A. Oliveira, Mateus Espadoto, Roberto M. Cesar, Roberto Hirata
ๆฅ็ๆ่ฆ
We introduce Graph Memory (GM), a structured non-parametric framework that augments embedding-based inference with a compact, relational memory over region-level prototypes. Rather than treating each training instance in isolation, GM summarizes the embedding space into prototype nodes annotated with reliability indicators and connected by edges that encode geometric and contextual relations. This design unifies instance retrieval, prototype-based reasoning, and graph-based label propagation within a single inductive model that supports both efficient inference and faithful explanation. Experiments on synthetic and real datasets including breast histopathology (IDC) show that GM achieves accuracy competitive with $k$NN and Label Spreading while offering substantially better calibration and smoother decision boundaries, all with an order of magnitude fewer samples. By explicitly modeling reliability and relational structure, GM provides a principled bridge between local evidence and global consistency in non-parametric learning.
๐
2025-11-18
๐ ArXiv Preprint (cs.PL)
๐ฅ Joey Velez-Ginorio, Nada Amin, Konrad Kording, Steve Zdancewic
Machine Learning
ๆฅ็ๆ่ฆ
Discrete structures are currently second-class in differentiable programming. Since functions over discrete structures lack overt derivatives, differentiable programs do not differentiate through them and limit where they can be used. For example, when programming a neural network, conditionals and iteration cannot be used everywhere; they can break the derivatives necessary for gradient-based learning to work. This limits the class of differentiable algorithms we can directly express, imposing restraints on how we build neural networks and differentiable programs more generally. However, these restraints are not fundamental. Recent work shows conditionals can be first-class, by compiling them into differentiable form as linear neurons. Similarly, this work shows iteration can be first-class -- by compiling to linear recurrent neurons. We present a minimal typed, higher-order and linear programming language with iteration called $\textsf{Cajal}\scriptstyle(\mathbb{\multimap}, \mathbb{2}, \mathbb{N})$. We prove its programs compile correctly to recurrent neurons, allowing discrete algorithms to be expressed in a differentiable form compatible with gradient-based learning. With our implementation, we conduct two experiments where we link these recurrent neurons against a neural network solving an iterative image transformation task. This determines part of its function prior to learning. As a result, the network learns faster and with greater data-efficiency relative to a neural network programmed without first-class iteration. A key lesson is that recurrent neurons enable a rich interplay between learning and the discrete structures of ordinary programming.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Abdallah Salem, Hamdy Ahmed Ashour, Ahmed Elshenawy
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This research addresses the significant challenges of energy consumption and environmental impact in laser cutting by proposing novel deep learning (DL) methodologies to achieve energy reduction. Recognizing the current lack of adaptive control and the open-loop nature of CO2 laser suction pumps, this study utilizes closed-loop configurations that dynamically adjust pump power based on both the material being cut and the smoke level generated. To implement this adaptive system, diverse material classification methods are introduced, including techniques leveraging lens-less speckle sensing with a customized Convolutional Neural Network (CNN) and an approach using a USB camera with transfer learning via the pre-trained VGG16 CNN model. Furthermore, a separate DL model for smoke level detection is employed to simultaneously refine the pump's power output. This integration prompts the exhaust suction pump to automatically halt during inactive times and dynamically adjust power during operation, leading to experimentally proven and remarkable energy savings, with results showing a 20% to 50% reduction in the smoke suction pump's energy consumption, thereby contributing substantially to sustainable development in the manufacturing sector.
๐
2025-11-18
๐ ArXiv Preprint (cs.SD)
๐ฅ Daniel Oliveira de Brito, Letรญcia Gabriella de Souza, Marcelo Matheus Gauy, Marcelo Finger, Arnaldo Candido Junior
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This technical report investigates the performance of pre-trained audio models on COVID-19 detection tasks using established benchmark datasets. We fine-tuned Audio-MAE and three PANN architectures (CNN6, CNN10, CNN14) on the Coswara and COUGHVID datasets, evaluating both intra-dataset and cross-dataset generalization. We implemented a strict demographic stratification by age and gender to prevent models from exploiting spurious correlations between demographic characteristics and COVID-19 status. Intra-dataset results showed moderate performance, with Audio-MAE achieving the strongest result on Coswara (0.82 AUC, 0.76 F1-score), while all models demonstrated limited performance on Coughvid (AUC 0.58-0.63). Cross-dataset evaluation revealed severe generalization failure across all models (AUC 0.43-0.68), with Audio-MAE showing strong performance degradation (F1-score 0.00-0.08). Our experiments demonstrate that demographic balancing, while reducing apparent model performance, provides more realistic assessment of COVID-19 detection capabilities by eliminating demographic leakage - a confounding factor that inflate performance metrics. Additionally, the limited dataset sizes after balancing (1,219-2,160 samples) proved insufficient for deep learning models that typically require substantially larger training sets. These findings highlight fundamental challenges in developing generalizable audio-based COVID-19 detection systems and underscore the importance of rigorous demographic controls for clinically robust model evaluation.
๐
2025-11-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Bi Xue, Hong Wu, Lei Chen, Chao Yang, Yiming Ma et al.
Machine Learning
ๆฅ็ๆ่ฆ
Serving deep learning based recommendation models (DLRM) at scale is challenging. Existing systems rely on CPU-based ANN indexing and filtering services, suffering from non-negligible costs and forgoing joint optimization opportunities. Such inefficiency makes them difficult to support more complex model architectures, such as learned similarities and multi-task retrieval.
In this paper, we propose SilverTorch, a model-based system for serving recommendation models on GPUs. SilverTorch unifies model serving by replacing standalone indexing and filtering services with layers of served models. We propose a Bloom index algorithm on GPUs for feature filtering and a tensor-native fused Int8 ANN kernel on GPUs for nearest neighbor search. We further co-design the ANN search index and filtering index to reduce GPU memory utilization and eliminate unnecessary computation. Benefit from SilverTorch's serving paradigm, we introduce a OverArch scoring layer and a Value Model to aggregate results across multi-tasks. These advancements improve the accuracy for retrieval and enable future studies for serving more complex models. For ranking, SilverTorch's design accelerates item embedding calculation by caching the pre-calculated embeddings inside the serving model.
Our evaluation on the industry-scale datasets show that SilverTorch achieves up to 5.6x lower latency and 23.7x higher throughput compared to the state-of-the-art approaches. We also demonstrate that SilverTorch's solution is 13.35x more cost-efficient than CPU-based solution while improving accuracy via serving more complex models. SilverTorch serves over hundreds of models online across major products and recommends contents for billions of daily active users.
๐
2025-11-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Dun Zhang, Ziyang Zeng, Yudong Zhou, Shuyang Lu
Machine Learning
ๆฅ็ๆ่ฆ
This technical report presents the training methodology and evaluation results of the open-source Jasper-Token-Compression-600M model, released in November 2025. Building on previous distillation-based recipes from the English Stella and Jasper models, we successfully extend this approach to a bilingual (English and Chinese) domain, further enhancing model performance through the incorporation of contrastive learning. A key innovation of our model is the introduction of a one-dimensional convolution-based token compression module. We dynamically adjust the compression rate during training, enabling the model to learn more robust and efficient compressed text representations. By combining knowledge distillation with token compression techniques, we achieve significant improvements in both embedding quality and inference efficiency. Our model performs with higher efficiency than a traditional 0.6B model while achieving performance comparable to that of an 8B model. For more information on the model release, visit: https://huggingface.co/infgrad/Jasper-Token-Compression-600M.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Xiangyu Li, Zhaomiao Guo
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
As more autonomous vehicles operate on public roads, understanding real-world behavior of autonomous vehicles is critical to analyzing traffic safety, making policies, and public acceptance. This paper proposes SVBRD-LLM, a framework that automatically discovers, verifies, and applies interpretable behavioral rules from real traffic videos through zero-shot prompt engineering. The framework extracts vehicle trajectories using YOLOv8 and ByteTrack, computes kinematic features, and employs GPT-5 zero-shot prompting to compare autonomous and human-driven vehicles, generating 35 structured behavioral rule hypotheses. These rules are tested on a validation set, iteratively refined based on failure cases to filter spurious correlations, and compiled into a high-confidence rule library. The framework is evaluated on an independent test set for speed change prediction, lane change prediction, and autonomous vehicle identification tasks. Experiments on over 1500 hours of real traffic videos show that the framework achieves 90.0% accuracy and 93.3% F1-score in autonomous vehicle identification. The discovered rules clearly reveal distinctive characteristics of autonomous vehicles in speed control smoothness, lane change conservativeness, and acceleration stability, with each rule accompanied by semantic description, applicable context, and validation confidence.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gbenga Omotara, Ramy Farag, Seyed Mohamad Ali Tousi, G. N. DeSouza
Computer Vision
ๆฅ็ๆ่ฆ
Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi-task learning frameworks to improve robustness, yet negative cross-task interactions often hinder performance. In this work, we introduce Edge-Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions by incorporating boundary information into the fusion between semantic and geometric features. On both Syn-TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method (MODEST), while preserving competitive segmentation performance, with the largest improvements appearing in transparent regions. Besides our fusion design, our second contribution is a multi-modal progressive training strategy, where learning transitions from edges derived from RGB images to edges derived from predicted depth images. This approach allows the system to bootstrap learning from the rich textures contained in RGB images, and then switch to more relevant geometric content in depth maps, while it eliminates the need for ground-truth depth at training time. Together, these contributions highlight edge-guided fusion as a robust approach capable of improving transparent object perception.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Abdallah Salem, Hamdy Ahmed Ashour, Ahmed Elshenawy
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This research addresses the significant challenges of energy consumption and environmental impact in laser cutting by proposing novel deep learning (DL) methodologies to achieve energy reduction. Recognizing the current lack of adaptive control and the open-loop nature of CO2 laser suction pumps, this study utilizes closed-loop configurations that dynamically adjust pump power based on both the material being cut and the smoke level generated. To implement this adaptive system, diverse material classification methods are introduced, including techniques leveraging lens-less speckle sensing with a customized Convolutional Neural Network (CNN) and an approach using a USB camera with transfer learning via the pre-trained VGG16 CNN model. Furthermore, a separate DL model for smoke level detection is employed to simultaneously refine the pump's power output. This integration prompts the exhaust suction pump to automatically halt during inactive times and dynamically adjust power during operation, leading to experimentally proven and remarkable energy savings, with results showing a 20% to 50% reduction in the smoke suction pump's energy consumption, thereby contributing substantially to sustainable development in the manufacturing sector.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Sebastian Dingler, Hannes Burrichter
ๆฅ็ๆ่ฆ
Odometry with lidar sensors is a state-of-the-art method to estimate the ego pose of a moving vehicle. Many implementations of lidar odometry use variants of the Iterative Closest Point (ICP) algorithm. Real-world effects such as dynamic objects, non-overlapping areas, and sensor noise diminish the accuracy of ICP. We build on a recently proposed method that makes these effects visible by visualizing the multidimensional objective function of ICP in two dimensions. We use this method to study different ICP variants in the context of lidar odometry. In addition, we propose a novel method to filter out dynamic objects and to address the ego blind spot problem.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Yassine Ibork, Myounggyu Won, Lokesh Das
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Ramp merging is a critical and challenging task for autonomous vehicles (AVs), particularly in mixed traffic environments with human-driven vehicles (HVs). Existing approaches typically rely on either lane-changing or inter-vehicle gap creation strategies based solely on local or neighboring information, often leading to suboptimal performance in terms of safety and traffic efficiency. In this paper, we present a V2X (vehicle-to-everything communication)-assisted Multiagent Reinforcement Learning (MARL) framework for on-ramp merging that effectively coordinates the complex interplay between lane-changing and inter-vehicle gap adaptation strategies by utilizing zone-specific global information available from a roadside unit (RSU). The merging control problem is formulated as a Multiagent Partially Observable Markov Decision Process (MA-POMDP), where agents leverage both local and global observations through V2X communication. To support both discrete and continuous control decisions, we design a hybrid action space and adopt a parameterized deep Q-learning approach. Extensive simulations, integrating the SUMO traffic simulator and the MOSAIC V2X simulator, demonstrate that our framework significantly improves merging success rate, traffic efficiency, and road safety across diverse traffic scenarios.
๐
2025-11-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Henry Wong, Clement Fung, Weiran Lin, Karen Li, Stanley Chen et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
To autonomously control vehicles, driving agents use outputs from a combination of machine-learning (ML) models, controller logic, and custom modules. Although numerous prior works have shown that adversarial examples can mislead ML models used in autonomous driving contexts, it remains unclear if these attacks are effective at producing harmful driving actions for various agents, environments, and scenarios.
To assess the risk of adversarial examples to autonomous driving, we evaluate attacks against a variety of driving agents, rather than against ML models in isolation. To support this evaluation, we leverage CARLA, an urban driving simulator, to create and evaluate adversarial examples. We create adversarial patches designed to stop or steer driving agents, stream them into the CARLA simulator at runtime, and evaluate them against agents from the CARLA Leaderboard, a public repository of best-performing autonomous driving agents from an annual research competition. Unlike prior work, we evaluate attacks against autonomous driving systems without creating or modifying any driving-agent code and against all parts of the agent included with the ML model.
We perform a case-study investigation of two attack strategies against three open-source driving agents from the CARLA Leaderboard across multiple driving scenarios, lighting conditions, and locations. Interestingly, we show that, although some attacks can successfully mislead ML models into predicting erroneous stopping or steering commands, some driving agents use modules, such as PID control or GPS-based rules, that can overrule attacker-manipulated predictions from ML models.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black et al.
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $ฯ^{*}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $ฯ^{*}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Xianhui Meng, Yuchen Zhang, Zhijian Huang, Zheng Lu, Ziling Ji et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) show great promise for autonomous driving, but their suitability for safety-critical scenarios is largely unexplored, raising safety concerns. This issue arises from the lack of comprehensive benchmarks that assess both external environmental risks and in-cabin driving behavior safety simultaneously. To bridge this critical gap, we introduce DSBench, the first comprehensive Driving Safety Benchmark designed to assess a VLM's awareness of various safety risks in a unified manner. DSBench encompasses two major categories: external environmental risks and in-cabin driving behavior safety, divided into 10 key categories and a total of 28 sub-categories. This comprehensive evaluation covers a wide range of scenarios, ensuring a thorough assessment of VLMs' performance in safety-critical contexts. Extensive evaluations across various mainstream open-source and closed-source VLMs reveal significant performance degradation under complex safety-critical situations, highlighting urgent safety concerns. To address this, we constructed a large dataset of 98K instances focused on in-cabin and external safety scenarios, showing that fine-tuning on this dataset significantly enhances the safety performance of existing VLMs and paves the way for advancing autonomous driving technology. The benchmark toolkit, code, and model checkpoints will be publicly accessible.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaoquan Sun, Ruijian Zhang, Kang Pang, Bingchen Miao, Yuxiang Tan et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Household tidying is an important application area, yet current benchmarks neither model user preferences nor support mobility, and they generalize poorly, making it hard to comprehensively assess integrated language-to-action capabilities. To address this, we propose RoboTidy, a unified benchmark for language-guided household tidying that supports Vision-Language-Action (VLA) and Vision-Language-Navigation (VLN) training and evaluation. RoboTidy provides 500 photorealistic 3D Gaussian Splatting (3DGS) household scenes (covering 500 objects and containers) with collisions, formulates tidying as an "Action (Object, Container)" list, and supplies 6.4k high-quality manipulation demonstration trajectories and 1.5k naviagtion trajectories to support both few-shot and large-scale training. We also deploy RoboTidy in the real world for object tidying, establishing an end-to-end benchmark for household tidying. RoboTidy offers a scalable platform and bridges a key gap in embodied AI by enabling holistic and realistic evaluation of language-guided robots.
๐
2025-11-18
๐ ICRA
๐ฅ Taijing Chen, Sateesh Kumar, Junhong Xu, George Pavlakos, J oydeep Biswas et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Service robots must retrieve objects in dynamic, open-world settings where requests may reference attributes ("the red mug"), spatial context ("the mug on the table"), or past states ("the mug that was here yesterday"). Existing approaches capture only parts of this problem: scene graphs capture spatial relations but ignore temporal grounding, temporal reasoning methods model dynamics but do not support embodied interaction, and dynamic scene graphs handle both but remain closed-world with fixed vocabularies. We present STAR (SpatioTemporal Active Retrieval), a framework that unifies memory queries and embodied actions within a single decision loop. STAR leverages non-parametric long-term memory and a working memory to support efficient recall, and uses a vision-language model to select either temporal or spatial actions at each step. We introduce STARBench, a benchmark of spatiotemporal object search tasks across simulated and real environments. Experiments in STARBench and on a Tiago robot show that STAR consistently outperforms scene-graph and memory-only baselines, demonstrating the benefits of treating search in time and search in space as a unified problem.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Keya Hu, Ali Cy, Linlu Qiu, Xiaoman Delores Ding, Runqian Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Parya Dolatyabi, Mahdi Khodayar
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Restoring power distribution systems (PDS) after large-scale outages requires sequential switching operations that reconfigure feeder topology and coordinate distributed energy resources (DERs) under nonlinear constraints such as power balance, voltage limits, and thermal ratings. These challenges make conventional optimization and value-based RL approaches computationally inefficient and difficult to scale. This paper applies a Heterogeneous-Agent Reinforcement Learning (HARL) framework, instantiated through Heterogeneous-Agent Proximal Policy Optimization (HAPPO), to enable coordinated restoration across interconnected microgrids. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts, introducing practical structural heterogeneity. Decentralized actor policies are trained with a centralized critic to compute advantage values for stable on-policy updates. A physics-informed OpenDSS environment provides full power flow feedback and enforces operational limits via differentiable penalty signals rather than invalid action masking. The total DER generation is capped at 2400 kW, and each microgrid must satisfy local supply-demand feasibility. Experiments on the IEEE 123-bus and IEEE 8500-node systems show that HAPPO achieves faster convergence, higher restored power, and smoother multi-seed training than DQN, PPO, MAES, MAGDPG, MADQN, Mean-Field RL, and QMIX. Results demonstrate that incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex PDS restoration.
๐
2025-11-18
๐ ArXiv Preprint (cs.CG)
๐ฅ Zoltรกn Kovรกcs, Xicheng Peng
ๆฅ็ๆ่ฆ
We improve the complex number identity proving method to a fully automated procedure, based on elimination ideals. By using declarative equations or rewriting each real-relational hypothesis $h_i$ to $h_i-r_i$, and the thesis $t$ to $t-r$, clearing the denominators and introducing an extra expression with a slack variable, we eliminate all free and relational point variables. From the obtained ideal $I$ in $\mathbb{Q}[r,r_1,r_2,\ldots]$ we can find a conclusive result. It plays an important role that if $r_1,r_2,\ldots$ are real, $r$ must also be real if there is a linear polynomial $p(r)\in I$, unless division by zero occurs when expressing $r$. Our results are presented in Mathematica, Maple and in a new version of the Giac computer algebra system. Finally, we present a prototype of the automated procedure in an experimental version of the dynamic geometry software GeoGebra.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Wang, Liya Ji, Zhanghan Ke, Harry Yang, Ser-Nam Lim et al.
ๆฅ็ๆ่ฆ
We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Abolfazl Younesi, Leon Kiss, Zahra Najafabadi Samani, Juan Aznar Poveda, Thomas Fahringer
ๆฅ็ๆ่ฆ
Federated learning (FL) enables collaborative model training while preserving data privacy. However, it remains vulnerable to malicious clients who compromise model integrity through Byzantine attacks, data poisoning, or adaptive adversarial behaviors. Existing defense mechanisms rely on static thresholds and binary classification, failing to adapt to evolving client behaviors in real-world deployments. We propose FLARE, an adaptive reputation-based framework that transforms client reliability assessment from binary decisions to a continuous, multi-dimensional trust evaluation. FLARE integrates: (i) a multi-dimensional reputation score capturing performance consistency, statistical anomaly indicators, and temporal behavior, (ii) a self-calibrating adaptive threshold mechanism that adjusts security strictness based on model convergence and recent attack intensity, (iii) reputation-weighted aggregation with soft exclusion to proportionally limit suspicious contributions rather than eliminating clients outright, and (iv) a Local Differential Privacy (LDP) mechanism enabling reputation scoring on privatized client updates. We further introduce a highly evasive Statistical Mimicry (SM) attack, a benchmark adversary that blends honest gradients with synthetic perturbations and persistent drift to remain undetected by traditional filters. Extensive experiments with 100 clients on MNIST, CIFAR-10, and SVHN demonstrate that FLARE maintains high model accuracy and converges faster than state-of-the-art Byzantine-robust methods under diverse attack types, including label flipping, gradient scaling, adaptive attacks, ALIE, and SM. FLARE improves robustness by up to 16% and preserves model convergence within 30% of the non-attacked baseline, while achieving strong malicious-client detection performance with minimal computational overhead. https://github.com/Anonymous0-0paper/FLARE
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Farheen Ramzan, Yusuf Kiberu, Nikesh Jathanna, Meryem Jabrane, Vicente Grau et al.
Multimodal
ๆฅ็ๆ่ฆ
Accurate segmentation of myocardial scar from late gadolinium enhanced (LGE) cardiac MRI is essential for evaluating tissue viability, yet remains challenging due to variable contrast and imaging artifacts. Electrocardiogram (ECG) signals provide complementary physiological information, as conduction abnormalities can help localize or suggest scarred myocardial regions. In this work, we propose a novel multimodal framework that integrates ECG-derived electrophysiological information with anatomical priors from the AHA-17 atlas for physiologically consistent LGE-based scar segmentation. As ECGs and LGE-MRIs are not acquired simultaneously, we introduce a Temporal Aware Feature Fusion (TAFF) mechanism that dynamically weights and fuses features based on their acquisition time difference. Our method was evaluated on a clinical dataset and achieved substantial gains over the state-of-the-art image-only baseline (nnU-Net), increasing the average Dice score for scars from 0.6149 to 0.8463 and achieving high performance in both precision (0.9115) and sensitivity (0.9043). These results show that integrating physiological and anatomical knowledge allows the model to "see beyond the image", setting a new direction for robust and physiologically grounded cardiac scar segmentation.
๐
2025-11-18
๐ ArXiv Preprint (q-bio.GN)
๐ฅ Rui Zhu, Xiaopu Zhou, Haixu Tang, Stephen W. Scherer, Lucila Ohno-Machado
Natural Language Processing
ๆฅ็ๆ่ฆ
Trained on massive cross-species DNA corpora, DNA large language models (LLMs) learn the fundamental "grammar" and evolutionary patterns of genomic sequences. This makes them powerful priors for DNA sequence modeling, particularly over long ranges. However, two major constraints hinder their use in practice: the quadratic computational cost of self-attention and the growing memory required for key-value (KV) caches during autoregressive decoding. These constraints force the use of heuristics such as fixed-window truncation or sliding windows, which compromise fidelity on ultra-long sequences by discarding distant information. We introduce FOCUS (Feature-Oriented Compression for Ultra-long Self-attention), a progressive context-compression module that can be plugged into pretrained DNA LLMs. FOCUS combines the established k-mer representation in genomics with learnable hierarchical compression: it inserts summary tokens at k-mer granularity and progressively compresses attention key and value activations across multiple Transformer layers, retaining only the summary KV states across windows while discarding ordinary-token KV. A shared-boundary windowing scheme yields a stationary cross-window interface that propagates long-range information with minimal loss. We validate FOCUS on an Evo-2-based DNA LLM fine-tuned on GRCh38 chromosome 1 with self-supervised training and randomized compression schedules to promote robustness across compression ratios. On held-out human chromosomes, FOCUS achieves near-lossless fidelity: compressing a 1 kb context into only 10 summary tokens (about 100x) shifts the average per-nucleotide probability by only about 0.0004. Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N^2) to near-linear O(N), enabling about 100x longer inference windows on commodity GPUs with near-lossless fidelity.
๐
2025-11-18
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Kallol Mondal, Ankush Kumar
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Attention is the brain's ability to selectively focus on a few specific aspects while ignoring irrelevant ones. This biological principle inspired the attention mechanism in modern Transformers. Transformers now underpin large language models (LLMs) such as GPT, but at the cost of massive training and inference energy, leading to a large carbon footprint. While brain attention emerges from neural circuits, Transformer attention relies on dot-product similarity to weight elements in the input sequence. Neuromorphic computing, especially spiking neural networks (SNNs), offers a brain-inspired path to energy-efficient intelligence. Despite recent work on attention-based spiking Transformers, the core attention layer remains non-neuromorphic. Current spiking attention (i) relies on dot-product or element-wise similarity suited to floating-point operations, not event-driven spikes; (ii) keeps attention matrices that suffer from the von Neumann bottleneck, limiting in-memory computing; and (iii) still diverges from brain-like computation. To address these issues, we propose the Spiking STDP Transformer (S$^{2}$TDPT), a neuromorphic Transformer that implements self-attention through spike-timing-dependent plasticity (STDP), embedding query--key correlations in synaptic weights. STDP, a core mechanism of memory and learning in the brain and widely studied in neuromorphic devices, naturally enables in-memory computing and supports non-von Neumann hardware. On CIFAR-10 and CIFAR-100, our model achieves 94.35\% and 78.08\% accuracy with only four timesteps and 0.49 mJ on CIFAR-100, an 88.47\% energy reduction compared to a standard ANN Transformer. Grad-CAM shows that the model attends to semantically relevant regions, enhancing interpretability. Overall, S$^{2}$TDPT illustrates how biologically inspired attention can yield energy-efficient, hardware-friendly, and explainable neuromorphic models.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shiyar Jamo
ๆฅ็ๆ่ฆ
Automatic age estimation is widely used for age verification, where input images often vary considerably in resolution. This study evaluates the effect of image resolution on age estimation accuracy using DeepFace and InsightFace. A total of 1000 images from the IMDB-Clean dataset were processed in seven resolutions, resulting in 7000 test samples. Performance was evaluated using Mean Absolute Error (MAE), Standard Deviation (SD), and Median Absolute Error (MedAE). Based on this study, we conclude that input image resolution has a clear and consistent impact on the accuracy of age estimation in both DeepFace and InsightFace. Both frameworks achieve optimal performance at 224x224 pixels, with an MAE of 10.83 years (DeepFace) and 7.46 years (InsightFace). At low resolutions, MAE increases substantially, while very high resolutions also degrade accuracy. InsightFace is consistently faster than DeepFace across all resolutions.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Clovis Gladstone, Zhao Fang, Spencer Dean Stewart
Natural Language Processing
ๆฅ็ๆ่ฆ
Historical and low-resource NLP remains challenging due to limited annotated data and domain mismatches with modern, web-sourced corpora. This paper outlines our work in using large language models (LLMs) to create ground-truth annotations for historical French (16th-20th centuries) and Chinese (1900-1950) texts. By leveraging LLM-generated ground truth on a subset of our corpus, we were able to fine-tune spaCy to achieve significant gains on period-specific tests for part-of-speech (POS) annotations, lemmatization, and named entity recognition (NER). Our results underscore the importance of domain-specific models and demonstrate that even relatively limited amounts of synthetic data can improve NLP tools for under-resourced corpora in computational humanities research.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ruomeng Ding, Wei Cheng, Minglai Shao, Chen Zhao
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly applied to sequential decision-making through in-context learning (ICL), yet their effectiveness is highly sensitive to prompt quality. Effective prompts should meet three principles: focus on decision-critical information, provide step-level granularity, and minimize reliance on expert annotations through label efficiency. However, existing ICL methods often fail to satisfy all three criteria simultaneously. Motivated by these challenges, we introduce SkillGen, a skill-based ICL framework for structured sequential reasoning. It constructs an action-centric, domain-level graph from sampled trajectories, identifies high-utility actions via temporal-difference credit assignment, and retrieves step-wise skills to generate fine-grained, context-aware prompts. We further present a theoretical analysis showing that focusing on high-utility segments supports task identifiability and informs more effective ICL prompt design. Experiments on ALFWorld, BabyAI, and ScienceWorld, using both open-source and proprietary LLMs, show that SkillGen achieves consistent gains, improving progress rate by 5.9%-16.5% on average across models.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marius Dubosc, Yann Fischer, Zacharie Auray, Nicolas Boutry, Edwin Carlinet et al.
Machine Learning
ๆฅ็ๆ่ฆ
Doppler holography is an emerging retinal imaging technique that captures the dynamic behavior of blood flow with high temporal resolution, enabling quantitative assessment of retinal hemodynamics. This requires accurate segmentation of retinal arteries and veins, but traditional segmentation methods focus solely on spatial information and overlook the temporal richness of holographic data. In this work, we propose a simple yet effective approach for artery-vein segmentation in temporal Doppler holograms using standard segmentation architectures. By incorporating features derived from a dedicated pulse analysis pipeline, our method allows conventional U-Nets to exploit temporal dynamics and achieve performance comparable to more complex attention- or iteration-based models. These findings demonstrate that time-resolved preprocessing can unlock the full potential of deep learning for Doppler holography, opening new perspectives for quantitative exploration of retinal hemodynamics. The dataset is publicly available at https://huggingface.co/datasets/DigitalHolography/
๐
2025-11-18
๐ AAAI 2026
๐ฅ Jingyi Jia, Qinbin Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model (LLM) agents have emerged as powerful tools for automating complex tasks by leveraging the reasoning and decision-making abilities of LLMs. However, a major bottleneck in current agent frameworks lies in the high inference cost of tool selection, especially in approaches like ReAct that repeatedly invoke the LLM to determine which tool to use at each step. In this work, we propose AutoTool, a novel graph-based framework that bypasses repeated LLM inference by exploiting a key empirical observation: tool usage inertia - the tendency of tool invocations to follow predictable sequential patterns. AutoTool constructs a directed graph from historical agent trajectories, where nodes represent tools and edges capture transition probabilities, effectively modeling the inertia in tool selection. It further integrates parameter-level information to refine tool input generation. By traversing this structured representation, AutoTool efficiently selects tools and their parameters with minimal reliance on LLM inference. Extensive experiments across diverse agent tasks demonstrate that AutoTool reduces inference costs by up to 30% while maintaining competitive task completion rates, offering a practical and scalable enhancement for inference-heavy frameworks. Our work highlights the promise of integrating statistical structure into LLM agent design for greater efficiency without sacrificing performance.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuchen Luo, Xinyu Li, Liuhua Peng, Mingming Gong
Natural Language Processing
ๆฅ็ๆ่ฆ
In multivariate time series forecasting (MTSF), accurately modeling the intricate dependencies among multiple variables remains a significant challenge due to the inherent limitations of traditional approaches. Most existing models adopt either \textbf{channel-independent} (CI) or \textbf{channel-dependent} (CD) strategies, each presenting distinct drawbacks. CI methods fail to leverage the potential insights from inter-channel interactions, resulting in models that may not fully exploit the underlying statistical dependencies present in the data. Conversely, CD approaches often incorporate too much extraneous information, risking model overfitting and predictive inefficiency. To address these issues, we introduce the Adaptive Forecasting Transformer (\textbf{Adapformer}), an advanced Transformer-based framework that merges the benefits of CI and CD methodologies through effective channel management. The core of Adapformer lies in its dual-stage encoder-decoder architecture, which includes the \textbf{A}daptive \textbf{C}hannel \textbf{E}nhancer (\textbf{ACE}) for enriching embedding processes and the \textbf{A}daptive \textbf{C}hannel \textbf{F}orecaster (\textbf{ACF}) for refining the predictions. ACE enhances token representations by selectively incorporating essential dependencies, while ACF streamlines the decoding process by focusing on the most relevant covariates, substantially reducing noise and redundancy. Our rigorous testing on diverse datasets shows that Adapformer achieves superior performance over existing models, enhancing both predictive accuracy and computational efficiency, thus making it state-of-the-art in MTSF.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kahaan Gandhi, Boris Bolliet, Inigo Zubeldia
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ivy Yuqian Yang, David Yu Zhang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Scientific idea generation and selection requires exploration following a target probability distribution. In contrast, current AI benchmarks have objectively correct answers, and training large language models (LLMs) via reinforcement learning against these benchmarks discourages probabilistic exploration. Here, we conducted systematic experiments requesting LLMs to produce outputs following simple probabilistic distributions, and found that all modern LLMs tested grossly fail to follow the distributions. For example, requesting a binary output of "1" 49% of the time produces an answer of "0" nearly 100% of the time. This step function-like behavior of near-exclusively generating the output with marginally highest probability even overrules even strong in-built LLM biases.
๐
2025-11-18
๐ ArXiv Preprint (cond-mat.soft)
๐ฅ Giulia Janzen, Gaia Maselli, Juan F. Jimenez, Lia Garcia-Perez, D A Matoz Fernandez et al.
ๆฅ็ๆ่ฆ
Robophysics investigates the physical principles that govern living-like robots operating in complex, realworld environments. Despite remarkable technological advances, robots continue to face fundamental efficiency limitations. At the level of individual units, locomotion remains a challenge, while at the collective level, robot swarms struggle to achieve shared purpose, coordination, communication, and cost efficiency. This perspective article examines the key challenges faced by bio-inspired robotic collectives and highlights recent research efforts that incorporate principles from active-matter physics and biology into the modeling and design of robot swarms.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Marco Locatelli, Arjen Hommersom, Roberto Clemens Cerioli, Daniela Besozzi, Fabio Stella
ๆฅ็ๆ่ฆ
Learning the parameters of Partially Observable Markov Decision Processes (POMDPs) from limited data is a significant challenge. We introduce the Fuzzy MAP EM algorithm, a novel approach that incorporates expert knowledge into the parameter estimation process by enriching the Expectation Maximization (EM) framework with fuzzy pseudo-counts derived from an expert-defined fuzzy model. This integration naturally reformulates the problem as a Maximum A Posteriori (MAP) estimation, effectively guiding learning in environments with limited data. In synthetic medical simulations, our method consistently outperforms the standard EM algorithm under both low-data and high-noise conditions. Furthermore, a case study on Myasthenia Gravis illustrates the ability of the Fuzzy MAP EM algorithm to recover a clinically coherent POMDP, demonstrating its potential as a practical tool for data-efficient modeling in healthcare.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yilu Fang, Jordan G. Nestor, Casey N. Ta, Jerard Z. Kneifati-Hayek, Chunhua Weng
ๆฅ็ๆ่ฆ
Patients with acute kidney injury (AKI) are at high risk of developing chronic kidney disease (CKD), but identifying those at greatest risk remains challenging. We used electronic health record (EHR) data to dynamically track AKI patients' clinical evolution and characterize AKI-to-CKD progression. Post-AKI clinical states were identified by clustering patient vectors derived from longitudinal medical codes and creatinine measurements. Transition probabilities between states and progression to CKD were estimated using multi-state modeling. After identifying common post-AKI trajectories, CKD risk factors in AKI subpopulations were identified through survival analysis. Of 20,699 patients with AKI at admission, 3,491 (17%) developed CKD. We identified fifteen distinct post-AKI states, each with different probabilities of CKD development. Most patients (75%, n=15,607) remained in a single state or made only one transition during the study period. Both established (e.g., AKI severity, diabetes, hypertension, heart failure, liver disease) and novel CKD risk factors, with their impact varying across these clinical states. This study demonstrates a data-driven approach for identifying high-risk AKI patients, supporting the development of decision-support tools for early CKD detection and intervention.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nathaniel Putera, Daniel Vilet Rodrรญguez, Noah Videcrantz, Julia Machnio, Mostafa Mehdipour Ghazi
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Accurate modeling of cognitive decline in Alzheimer's disease is essential for early stratification and personalized management. While tabular predictors provide robust markers of global risk, their ability to capture subtle brain changes remains limited. In this study, we evaluate the predictive contributions of tabular and imaging-based representations, with a focus on transformer-derived Magnetic Resonance Imaging (MRI) embeddings. We introduce a trajectory-aware labeling strategy based on Dynamic Time Warping clustering to capture heterogeneous patterns of cognitive change, and train a 3D Vision Transformer (ViT) via unsupervised reconstruction on harmonized and augmented MRI data to obtain anatomy-preserving embeddings without progression labels. The pretrained encoder embeddings are subsequently assessed using both traditional machine learning classifiers and deep learning heads, and compared against tabular representations and convolutional network baselines. Results highlight complementary strengths across modalities. Clinical and volumetric features achieved the highest AUCs of around 0.70 for predicting mild and severe progression, underscoring their utility in capturing global decline trajectories. In contrast, MRI embeddings from the ViT model were most effective in distinguishing cognitively stable individuals with an AUC of 0.71. However, all approaches struggled in the heterogeneous moderate group. These findings indicate that clinical features excel in identifying high-risk extremes, whereas transformer-based MRI embeddings are more sensitive to subtle markers of stability, motivating multimodal fusion strategies for AD progression modeling.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongqing Xie, Yonghuang Wu, Zisheng Ai, Jun Min, Zhencun Jiang et al.
Machine Learning
ๆฅ็ๆ่ฆ
The accurate segmentation of brain tumors from multi-modal MRI is critical for clinical diagnosis and treatment planning. While integrating complementary information from various MRI sequences is a common practice, the frequent absence of one or more modalities in real-world clinical settings poses a significant challenge, severely compromising the performance and generalizability of deep learning-based segmentation models. To address this challenge, we propose a novel Cross-Modal Compositional Self-Distillation (CCSD) framework that can flexibly handle arbitrary combinations of input modalities. CCSD adopts a shared-specific encoder-decoder architecture and incorporates two self-distillation strategies: (i) a hierarchical modality self-distillation mechanism that transfers knowledge across modality hierarchies to reduce semantic discrepancies, and (ii) a progressive modality combination distillation approach that enhances robustness to missing modalities by simulating gradual modality dropout during training. Extensive experiments on public brain tumor segmentation benchmarks demonstrate that CCSD achieves state-of-the-art performance across various missing-modality scenarios, with strong generalization and stability.
๐
2025-11-18
๐ Accepted in the 5th Workshop on Knowledge Graphs a
๐ฅ Yuan An, Ruhma Hashmi, Michelle Rogers, Jane Greenberg, Brian K. Smith
Machine Learning
ๆฅ็ๆ่ฆ
Task-oriented knowledge graphs (KGs) enable AI-powered learning assistant systems to automatically generate high-quality multiple-choice questions (MCQs). Yet converting unstructured educational materials, such as lecture notes and slides, into KGs that capture key pedagogical content remains difficult. We propose a framework for knowledge graph construction and refinement grounded in rate-distortion (RD) theory and optimal transport geometry. In the framework, lecture content is modeled as a metric-measure space, capturing semantic and relational structure, while candidate KGs are aligned using Fused Gromov-Wasserstein (FGW) couplings to quantify semantic distortion. The rate term, expressed via the size of KG, reflects complexity and compactness. Refinement operators (add, merge, split, remove, rewire) minimize the rate-distortion Lagrangian, yielding compact, information-preserving KGs. Our prototype applied to data science lectures yields interpretable RD curves and shows that MCQs generated from refined KGs consistently surpass those from raw notes on fifteen quality criteria. This study establishes a principled foundation for information-theoretic KG optimization in personalized and AI-assisted education.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Xianhui Meng, Yuchen Zhang, Zhijian Huang, Zheng Lu, Ziling Ji et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) show great promise for autonomous driving, but their suitability for safety-critical scenarios is largely unexplored, raising safety concerns. This issue arises from the lack of comprehensive benchmarks that assess both external environmental risks and in-cabin driving behavior safety simultaneously. To bridge this critical gap, we introduce DSBench, the first comprehensive Driving Safety Benchmark designed to assess a VLM's awareness of various safety risks in a unified manner. DSBench encompasses two major categories: external environmental risks and in-cabin driving behavior safety, divided into 10 key categories and a total of 28 sub-categories. This comprehensive evaluation covers a wide range of scenarios, ensuring a thorough assessment of VLMs' performance in safety-critical contexts. Extensive evaluations across various mainstream open-source and closed-source VLMs reveal significant performance degradation under complex safety-critical situations, highlighting urgent safety concerns. To address this, we constructed a large dataset of 98K instances focused on in-cabin and external safety scenarios, showing that fine-tuning on this dataset significantly enhances the safety performance of existing VLMs and paves the way for advancing autonomous driving technology. The benchmark toolkit, code, and model checkpoints will be publicly accessible.
๐
2025-11-18
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Nick von Felten, Johannes Schรถning, Klaus Opwis, Nicolas Scharowksi
ๆฅ็ๆ่ฆ
Humans increasingly interact with artificial intelligence (AI) in decision-making. However, both AI and humans are prone to biases. While AI and human biases have been studied extensively in isolation, this paper examines their complex interaction. Specifically, we examined how class imbalance as an AI bias affects people's ability to appropriately rely on an AI-based decision-support system, and how it interacts with base rate neglect as a human bias. In a within-subject online study (N= 46), participants classified three diseases using an AI-based decision-support system trained on either a balanced or unbalanced dataset. We found that class imbalance disrupted participants' calibration of AI reliance. Moreover, we observed mutually reinforcing effects between class imbalance and base rate neglect, offering evidence of a compound human-AI bias. Based on these findings, we advocate for an interactionist perspective and further research into the mutually reinforcing effects of biases in human-AI interaction.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Julia Machnio, Mads Nielsen, Mostafa Mehdipour Ghazi
Machine Learning
ๆฅ็ๆ่ฆ
White matter hyperintensities (WMH) are key imaging markers in cognitive aging, Alzheimer's disease (AD), and related dementias. Although automated methods for WMH segmentation have advanced, most provide only global lesion load and overlook their spatial distribution across distinct white matter regions. We propose a deep learning framework for robust WMH segmentation and localization, evaluated across public datasets and an independent Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. Our results show that the predicted lesion loads are in line with the reference WMH estimates, confirming the robustness to variations in lesion load, acquisition, and demographics. Beyond accurate segmentation, we quantify WMH load within anatomically defined regions and combine these measures with brain structure volumes to assess diagnostic value. Regional WMH volumes consistently outperform global lesion burden for disease classification, and integration with brain atrophy metrics further improves performance, reaching area under the curve (AUC) values up to 0.97. Several spatially distinct regions, particularly within anterior white matter tracts, are reproducibly associated with diagnostic status, indicating localized vulnerability in AD. These results highlight the added value of regional WMH quantification. Incorporating localized lesion metrics alongside atrophy markers may enhance early diagnosis and stratification in neurodegenerative disorders.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ankush Kadu, Ashwanth Krishnan
Machine Learning
ๆฅ็ๆ่ฆ
Enabling agents to learn from experience and generalize across diverse tasks without task-specific training remains a fundamental challenge in reinforcement learning and decision-making. While recent approaches have explored episodic memory (Reflexion), gradient-based prompt optimization (TextGrad),and hierarchical task decomposition independently, their potential for synergistic integration remains unexplored. We introduce ReflexGrad, a novel architecture that tightly couples three complementary mechanisms: (1) LLM-based hierarchical TODO decomposition for strategic planning, (2) history-aware causal reflection that analyzes recent action patterns to identify failure root causes and enable within-trial learning, and (3) gradient-based optimization for systematic improvement. Unlike prior work relying on few-shot demonstrations, our system achieves true zero-shot generalization through pure LLM semantic reasoning,requiring no task-specific examples, fine-tuning, or hardcoded similarity metrics. Evaluated on ALFWorld benchmark tasks, ReflexGrad demonstrates 67% zero-shot success rate on Trial 0 without any prior task experience or demonstrations, establishing effective performance on first exposure. Through empirical analysis, we identify the architectural mechanisms underlying stable convergence (zero action loops) and effective cross-task transfer (67% to 78% improvement).Our work demonstrates that synergistic integration of complementary learning mechanisms enables robust zero-shot generalization that approaches few-shot baselines from prior work.
๐
2025-11-18
๐ 28 pages, 16 figures, this article has been accept
๐ฅ Chen Chen, Cuong Nguyen, Alexa Siu, Dingzeyu Li, Nadir Weibel
Natural Language Processing
ๆฅ็ๆ่ฆ
Accessing 3D models remains challenging for Screen Reader (SR) users. While some existing 3D viewers allow creators to provide alternative text, they often lack sufficient detail about the 3D models. Grounded on a formative study, this paper introduces SweeperBot, a system that enables SR users to leverage visual question answering to explore and compare 3D models. SweeperBot answers SR users' visual questions by combining an optimal view selection technique with the strength of generative- and recognition-based foundation models. An expert review with 10 Blind and Low-Vision (BLV) users with SR experience demonstrated the feasibility of using SweeperBot to assist BLV users in exploring and comparing 3D models. The quality of the descriptions generated by SweeperBot was validated by a second survey study with 30 sighted participants.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lucia Makaiovรก, Martin Fajฤรญk, Antonรญn Jarolรญm
ๆฅ็ๆ่ฆ
Document-level claim extraction remains an open challenge in the field of fact-checking, and subsequently, methods for evaluating extracted claims have received limited attention. In this work, we explore approaches to aligning two sets of claims pertaining to the same source document and computing their similarity through an alignment score. We investigate techniques to identify the best possible alignment and evaluation method between claim sets, with the aim of providing a reliable evaluation framework. Our approach enables comparison between model-extracted and human-annotated claim sets, serving as a metric for assessing the extraction performance of models and also as a possible measure of inter-annotator agreement. We conduct experiments on newly collected dataset-claims extracted from comments under Czech and Slovak news articles-domains that pose additional challenges due to the informal language, strong local context, and subtleties of these closely related languages. The results draw attention to the limitations of current evaluation approaches when applied to document-level claim extraction and highlight the need for more advanced methods-ones able to correctly capture semantic similarity and evaluate essential claim properties such as atomicity, checkworthiness, and decontextualization.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Minyoung Hwang, Alexandra Forsey-Smerek, Nathaniel Dennler, Andreea Bobu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language. Project page: https://MIT-CLEAR-Lab.github.io/Masked-IRL and Code: https://github.com/MIT-CLEAR-Lab/Masked-IRL
๐
2025-11-18
๐ AAAI 2026
๐ฅ Xinzhe Zheng, Shiyu Jiang, Gustavo Seabra, Chenglong Li, Yanjun Li
Machine Learning
ๆฅ็ๆ่ฆ
Deep generative models are rapidly advancing structure-based drug design, offering substantial promise for generating small molecule ligands that bind to specific protein targets. However, most current approaches assume a rigid protein binding pocket, neglecting the intrinsic flexibility of proteins and the conformational rearrangements induced by ligand binding, limiting their applicability in practical drug discovery. Here, we propose Apo2Mol, a diffusion-based generative framework for 3D molecule design that explicitly accounts for conformational flexibility in protein binding pockets. To support this, we curate a dataset of over 24,000 experimentally resolved apo-holo structure pairs from the Protein Data Bank, enabling the characterization of protein structure changes associated with ligand binding. Apo2Mol employs a full-atom hierarchical graph-based diffusion model that simultaneously generates 3D ligand molecules and their corresponding holo pocket conformations from input apo states. Empirical studies demonstrate that Apo2Mol can achieve state-of-the-art performance in generating high-affinity ligands and accurately capture realistic protein pocket conformational changes.
๐
2025-11-18
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Alexander Olza, Roberto Santana, David Soto
Machine Learning
ๆฅ็ๆ่ฆ
Decoded Neurofeedback (DecNef) is a flourishing non-invasive approach to brain modulation with wide-ranging applications in neuromedicine and cognitive neuroscience. However, progress in DecNef research remains constrained by subject-dependent learning variability, reliance on indirect measures to quantify progress, and the high cost and time demands of experimentation.
We present DecNefLab, a modular and interpretable simulation framework that formalizes DecNef as a machine learning problem. Beyond providing a virtual laboratory, DecNefLab enables researchers to model, analyze and understand neurofeedback dynamics. Using latent variable generative models as simulated participants, DecNefLab allows direct observation of internal cognitive states and systematic evaluation of how different protocol designs and subject characteristics influence learning.
We demonstrate how this approach can (i) reproduce empirical phenomena of DecNef learning, (ii) identify conditions under which DecNef feedback fails to induce learning, and (iii) guide the design of more robust and reliable DecNef protocols in silico before human implementation.
In summary, DecNefLab bridges computational modeling and cognitive neuroscience, offering a principled foundation for methodological innovation, robust protocol design, and ultimately, a deeper understanding of DecNef-based brain modulation.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal
Machine Learning
ๆฅ็ๆ่ฆ
Incomplete data are common in real-world tabular applications, where numerical, categorical, and discrete attributes coexist within a single dataset. This heterogeneous structure presents significant challenges for existing diffusion-based imputation models, which typically assume a homogeneous feature space and rely on stochastic denoising trajectories. Such assumptions make it difficult to maintain conditional consistency, and they often lead to information collapse for categorical variables or instability when numerical variables require deterministic updates. These limitations indicate that a single diffusion process is insufficient for mixed-type tabular imputation.
We propose a hybrid deterministic diffusion framework that separates heterogeneous features into two complementary generative channels. A continuous DDIM-based channel provides efficient and stable deterministic denoising for numerical variables, while a discrete latent-path diffusion channel, inspired by loopholing-based discrete diffusion, models categorical and discrete features without leaving their valid sample manifolds. The two channels are trained under a unified conditional imputation objective, enabling coherent reconstruction of mixed-type incomplete data.
Extensive experiments on multiple real-world datasets show that the proposed framework achieves higher imputation accuracy, more stable sampling trajectories, and improved robustness across MCAR, MAR, and MNAR settings compared with existing diffusion-based and classical methods. These results demonstrate the importance of structure-aware diffusion processes for advancing deep learning approaches to incomplete tabular data.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiahao Wu, Shengwen Yu
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Bridging continuous perceptual signals and discrete symbolic reasoning is a fundamental challenge in AI systems that must operate under uncertainty. We present a neuro-symbolic framework that explicitly models and propagates uncertainty from perception to planning, providing a principled connection between these two abstraction levels. Our approach couples a transformer-based perceptual front-end with graph neural network (GNN) relational reasoning to extract probabilistic symbolic states from visual observations, and an uncertainty-aware symbolic planner that actively gathers information when confidence is low. We demonstrate the framework's effectiveness on tabletop robotic manipulation as a concrete application: the translator processes 10,047 PyBullet-generated scenes (3--10 objects) and outputs probabilistic predicates with calibrated confidences (overall F1=0.68). When embedded in the planner, the system achieves 94\%/90\%/88\% success on Simple Stack, Deep Stack, and Clear+Stack benchmarks (90.7\% average), exceeding the strongest POMDP baseline by 10--14 points while planning within 15\,ms. A probabilistic graphical-model analysis establishes a quantitative link between calibrated uncertainty and planning convergence, providing theoretical guarantees that are validated empirically. The framework is general-purpose and can be applied to any domain requiring uncertainty-aware reasoning from perceptual input to symbolic planning.
๐
2025-11-18
๐ ArXiv Preprint (cs.SD)
๐ฅ Xinxin Tang, Bin Qin, Yufang Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex "approximate-compensate" mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the "amplitude-ignoring" problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8\% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jintao Zhang, Mingyue Cheng, Zirui Liu, Xianquan Wang, Yitong Zhou et al.
ๆฅ็ๆ่ฆ
Time series generation is critical for a wide range of applications, which greatly supports downstream analytical and decision-making tasks. However, the inherent temporal heterogeneous induced by localized perturbations present significant challenges for generating structurally consistent time series. While flow matching provides a promising paradigm by modeling temporal dynamics through trajectory-level supervision, it fails to adequately capture abrupt transitions in perturbed time series, as the use of globally shared parameters constrains the velocity field to a unified representation. To address these limitations, we introduce \textbf{PAFM}, a \textbf{P}erturbation-\textbf{A}ware \textbf{F}low \textbf{M}atching framework that models perturbed trajectories to ensure stable and structurally consistent time series generation. The framework incorporates perturbation-guided training to simulate localized disturbances and leverages a dual-path velocity field to capture trajectory deviations under perturbation, enabling refined modeling of perturbed behavior to enhance the structural coherence. In order to further improve sensitivity to trajectory perturbations while enhancing expressiveness, a mixture-of-experts decoder with flow routing dynamically allocates modeling capacity in response to different trajectory dynamics. Extensive experiments on both unconditional and conditional generation tasks demonstrate that PAFM consistently outperforms strong baselines. Code is available at https://anonymous.4open.science/r/PAFM-03B2.
๐
2025-11-18
๐ ArXiv Preprint (eess.SY)
๐ฅ Soham Ghosh, Gaurav Mittal
ๆฅ็ๆ่ฆ
Agentic AI systems have recently emerged as a critical and transformative approach in artificial intelligence, offering capabilities that extend far beyond traditional AI agents and contemporary generative AI models. This rapid evolution necessitates a clear conceptual and taxonomical understanding to differentiate this new paradigm. Our paper addresses this gap by providing a comprehensive review that establishes a precise definition and taxonomy for "agentic AI," with the aim of distinguishing it from previous AI paradigms. The concepts are gradually introduced, starting with a highlight of its diverse applications across the broader field of engineering. The paper then presents four detailed, state-of-the-art use case applications specifically within electrical engineering. These case studies demonstrate practical impact, ranging from an advanced agentic framework for streamlining complex power system studies and benchmarking to a novel system developed for survival analysis of dynamic pricing strategies in battery swapping stations. Finally, to ensure robust deployment, the paper provides detailed failure mode investigations. From these findings, we derive actionable recommendations for the design and implementation of safe, reliable, and accountable agentic AI systems, offering a critical resource for researchers and practitioners.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dalia Ali, Dora Zhao, Allison Koenecke, Orestis Papakyriakopoulos
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Although large language models (LLMs) are increasingly trained using human feedback for safety and alignment with human values, alignment decisions often overlook human social diversity. This study examines how incorporating pluralistic values affects LLM behavior by systematically evaluating demographic variation and design parameters in the alignment pipeline. We collected alignment data from US and German participants (N = 1,095, 27,375 ratings) who rated LLM responses across five dimensions: Toxicity, Emotional Awareness (EA), Sensitivity, Stereotypical Bias, and Helpfulness. We fine-tuned multiple Large Language Models and Large Reasoning Models using preferences from different social groups while varying rating scales, disagreement handling methods, and optimization techniques. The results revealed systematic demographic effects: male participants rated responses 18% less toxic than female participants; conservative and Black participants rated responses 27.9% and 44% more emotionally aware than liberal and White participants, respectively. Models fine-tuned on group-specific preferences exhibited distinct behaviors. Technical design choices showed strong effects: the preservation of rater disagreement achieved roughly 53% greater toxicity reduction than majority voting, and 5-point scales yielded about 22% more reduction than binary formats; and Direct Preference Optimization (DPO) consistently outperformed Group Relative Policy Optimization (GRPO) in multi-value optimization. These findings represent a preliminary step in answering a critical question: How should alignment balance expert-driven and user-driven signals to ensure both safety and fair representation?
๐
2025-11-18
๐ NeurIPS 2025
๐ฅ Clรฉment Dumas
Natural Language Processing
ๆฅ็ๆ่ฆ
Mechanistic interpretability research requires reliable tools for analyzing transformer internals across diverse architectures. Current approaches face a fundamental tradeoff: custom implementations like TransformerLens ensure consistent interfaces but require coding a manual adaptation for each architecture, introducing numerical mismatch with the original models, while direct HuggingFace access through NNsight preserves exact behavior but lacks standardization across models. To bridge this gap, we develop nnterp, a lightweight wrapper around NNsight that provides a unified interface for transformer analysis while preserving original HuggingFace implementations. Through automatic module renaming and comprehensive validation testing, nnterp enables researchers to write intervention code once and deploy it across 50+ model variants spanning 16 architecture families. The library includes built-in implementations of common interpretability methods (logit lens, patchscope, activation steering) and provides direct access to attention probabilities for models that support it. By packaging validation tests with the library, researchers can verify compatibility with custom models locally. nnterp bridges the gap between correctness and usability in mechanistic interpretability tooling.
๐
2025-11-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Daniรซl Wilten, Gideon Maillette de Buy Wenniger, Arjen Hommersom, Paul Lucassen, Emiel Poortman
Robotics
ๆฅ็ๆ่ฆ
Using multiple carousels, lists that wrap around and can be scrolled, is the basis for offering content in most contemporary movie streaming platforms. Carousels allow for highlighting different aspects of users' taste, that fall in categories such as genres and authors. However, while carousels offer structure and greater ease of navigation, they alone do not increase diversity in recommendations, while this is essential to keep users engaged. In this work we propose several approaches to effectively increase item diversity within the domain of book recommendations, on top of a collaborative filtering algorithm. These approaches are intended to improve book recommendations in the web catalogs of public libraries. Furthermore, we introduce metrics to evaluate the resulting strategies, and show that the proposed system finds a suitable balance between accuracy and beyond-accuracy aspects.
๐
2025-11-18
๐ Accepted for publication in 3rd IEEE International
๐ฅ Fabian Stricker, David Bermbach, Christian Zirpins
Machine Learning
ๆฅ็ๆ่ฆ
Federated learning (FL) is a new paradigm for training machine learning (ML) models without sharing data. While applying FL in cross-silo scenarios, where organizations collaborate, it is necessary that the FL system is reliable; however, participants can fail due to various reasons (e.g., communication issues or misconfigurations). In order to provide a reliable system, it is necessary to analyze the impact of participant failures. While this problem received attention in cross-device FL where mobile devices with limited resources participate, there is comparatively little research in cross-silo FL.
Therefore, we conduct an extensive study for analyzing the impact of participant failures on the model quality in the context of inter-organizational cross-silo FL with few participants. In our study, we focus on analyzing generally influential factors such as the impact of the timing and the data as well as the impact on the evaluation, which is important for deciding, if the model should be deployed. We show that under high skews the evaluation is optimistic and hides the real impact. Furthermore, we demonstrate that the timing impacts the quality of the trained model. Our results offer insights for researchers and software architects aiming to build robust FL systems.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Emanuele Palumbo, Sorawit Saengkyongam, Maria R. Cervera, Jens Behrmann, Andrew C. Miller et al.
ๆฅ็ๆ่ฆ
Continuous cardiovascular monitoring can play a key role in precision health. However, some fundamental cardiac biomarkers of interest, including stroke volume and cardiac output, require invasive measurements, e.g., arterial pressure waveforms (APW). As a non-invasive alternative, photoplethysmography (PPG) measurements are routinely collected in hospital settings. Unfortunately, the prediction of key cardiac biomarkers from PPG instead of APW remains an open challenge, further complicated by the scarcity of annotated PPG measurements. As a solution, we propose a hybrid approach that uses hemodynamic simulations and unlabeled clinical data to estimate cardiovascular biomarkers directly from PPG signals. Our hybrid model combines a conditional variational autoencoder trained on paired PPG-APW data with a conditional density estimator of cardiac biomarkers trained on labeled simulated APW segments. As a key result, our experiments demonstrate that the proposed approach can detect fluctuations of cardiac output and stroke volume and outperform a supervised baseline in monitoring temporal changes in these biomarkers.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hong Gao, Yiming Bao, Xuezhen Tu, Yutong Xu, Yue Jin et al.
Computer VisionMultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.
๐
2025-11-18
๐ ACL
๐ฅ Trishala Jayesh Ahalpara
Natural Language Processing
ๆฅ็ๆ่ฆ
We present Tell Me, a mental well-being system that leverages advances in large language models to provide accessible, context-aware support for users and researchers. The system integrates three components: (i) a retrieval-augmented generation (RAG) assistant for personalized, knowledge-grounded dialogue; (ii) a synthetic client-therapist dialogue generator conditioned on client profiles to facilitate research on therapeutic language and data augmentation; and (iii) a Well-being AI crew, implemented with CrewAI, that produces weekly self-care plans and guided meditation audio. The system is designed as a reflective space for emotional processing rather than a substitute for professional therapy. It illustrates how conversational assistants can lower barriers to support, complement existing care, and broaden access to mental health resources. To address the shortage of confidential therapeutic data, we introduce synthetic client-therapist dialogue generation conditioned on client profiles. Finally, the planner demonstrates an innovative agentic workflow for dynamically adaptive, personalized self-care, bridging the limitations of static well-being tools. We describe the architecture, demonstrate its functionalities, and report evaluation of the RAG assistant in curated well-being scenarios using both automatic LLM-based judgments and a human-user study. This work highlights opportunities for interdisciplinary collaboration between NLP researchers and mental health professionals to advance responsible innovation in human-AI interaction for well-being.
๐
2025-11-18
๐ ArXiv Preprint (cs.SE)
๐ฅ Angelo Ferrando
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Assuring the safety and trustworthiness of autonomous systems is particularly difficult when learning-enabled components and open environments are involved. Formal methods provide strong guarantees but depend on complete models and static assumptions. Runtime verification (RV) complements them by monitoring executions at run time and, in its predictive variants, by anticipating potential violations. Large language models (LLMs), meanwhile, excel at translating natural language into formal artefacts and recognising patterns in data, yet they remain error-prone and lack formal guarantees. This vision paper argues for a symbiotic integration of RV and LLMs. RV can serve as a guardrail for LLM-driven autonomy, while LLMs can extend RV by assisting specification capture, supporting anticipatory reasoning, and helping to handle uncertainty. We outline how this mutual reinforcement differs from existing surveys and roadmaps, discuss challenges and certification implications, and identify future research directions towards dependable autonomy.
๐
2025-11-18
๐ ArXiv Preprint (cs.LO)
๐ฅ Dominik Grundt, Ishan Saxena, Malte Petersen, Bernd Westphal, Eike Mรถhlmann
Robotics
ๆฅ็ๆ่ฆ
Autonomous vehicles (AVs) must be both safe and trustworthy to gain social acceptance and become a viable option for everyday public transportation. Explanations about the system behaviour can increase safety and trust in AVs. Unfortunately, explaining the system behaviour of AI-based driving functions is particularly challenging, as decision-making processes are often opaque. The field of Explainability Engineering tackles this challenge by developing explanation models at design time. These models are designed from system design artefacts and stakeholder needs to develop correct and good explanations. To support this field, we propose an approach that enables context-aware, ante-hoc explanations of (un)expectable driving manoeuvres at runtime. The visual yet formal language Traffic Sequence Charts is used to formalise explanation contexts, as well as corresponding (un)expectable driving manoeuvres. A dedicated runtime monitoring enables context-recognition and ante-hoc presentation of explanations at runtime. In combination, we aim to support the bridging of correct and good explanations. Our method is demonstrated in a simulated overtaking.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrey Okhotin, Maksim Nakhodnov, Nikita Kazeev, Andrey E Ustyuzhanin, Dmitry Vetrov
ๆฅ็ๆ่ฆ
In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel (S.U.N.) crystalline materials. However, most of these models don't have the ability to change the number of atoms in the crystal during the generation process, which limits the variability of model sampling trajectories. In this paper, we demonstrate the severity of this restriction and introduce a simple yet powerful technique, mirage infusion, which enables diffusion models to change the state of the atoms that make up the crystal from existent to non-existent (mirage) and vice versa. We show that this technique improves model quality by up to $\times2.5$ compared to the same model without this modification. The resulting model, Mirage Atom Diffusion (MiAD), is an equivariant joint diffusion model for de novo crystal generation that is capable of altering the number of atoms during the generation process. MiAD achieves an $8.2\%$ S.U.N. rate on the MP-20 dataset, which substantially exceeds existing state-of-the-art approaches. The source code can be found at \href{https://github.com/andrey-okhotin/miad.git}{\texttt{github.com/andrey-okhotin/miad}}.
๐
2025-11-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Zhengchunmin Dai, Jiaxiong Tang, Peng Sun, Honglong Chen, Liantao Wu
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
In decentralized machine learning paradigms such as Split Federated Learning (SFL) and its variant U-shaped SFL, the server's capabilities are severely restricted. Although this enhances client-side privacy, it also leaves the server highly vulnerable to model theft by malicious clients. Ensuring intellectual property protection for such capability-limited servers presents a dual challenge: watermarking schemes that depend on client cooperation are unreliable in adversarial settings, whereas traditional server-side watermarking schemes are technically infeasible because the server lacks access to critical elements such as model parameters or labels.
To address this challenge, this paper proposes Sigil, a mandatory watermarking framework designed specifically for capability-limited servers. Sigil defines the watermark as a statistical constraint on the server-visible activation space and embeds the watermark into the client model via gradient injection, without requiring any knowledge of the data. Besides, we design an adaptive gradient clipping mechanism to ensure that our watermarking process remains both mandatory and stealthy, effectively countering existing gradient anomaly detection methods and a specifically designed adaptive subspace removal attack. Extensive experiments on multiple datasets and models demonstrate Sigil's fidelity, robustness, and stealthiness.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Xiuxiu Qi, Yu Yang, Jiannong Cao, Luyao Bai, Chongshan Fan et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kangqiao Zhao, Shuo Huai, Xurui Song, Jun Luo
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Prathamesh Kalamkar, Ned Letcher, Meissane Chami, Sahger Lad, Shayan Mohanty et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The application of large language models (LLMs) to chemistry is frequently hampered by a "tokenization bottleneck", where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically uninformative sub-tokens. This paper introduces a principled methodology to resolve this bottleneck by unifying the representation of natural language and molecular structures within a single model. Our approach involves targeted vocabulary extension-augmenting a pretrained LLM's vocabulary with chemically salient tokens, followed by continued pretraining on chemistry-domain text to integrate this new knowledge. We provide an empirical demonstration of the effectiveness of this strategy, showing that our methodology leads to superior performance on a range of downstream chemical tasks.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gustavo Adolpho Bonesso, Carlos Marcelo Gurjรฃo de Godoy, Tammy Hentona Osaki, Midori Hentona Osaki, Bรกrbara Moreira Ribeiro Trindade dos Santos et al.
ๆฅ็ๆ่ฆ
Blinking is a vital physiological process that protects and maintains the health of the ocular surface. Objective assessment of eyelid movements remains challenging due to the complexity, cost, and limited clinical applicability of existing tools. This study presents the clinical validation of Bapp (Blink Application), a mobile application developed using the Flutter framework and integrated with Google ML Kit for on-device, real-time analysis of eyelid movements. The validation occurred using 45 videos from real patients, whose blinks were manually annotated by ophthalmology specialists from the Paulista School of Medicine of the Federal University of Sao Paulo (EPM-UNIFESP) to serve as the ground truth. Bapp's performance was evaluated using standard metrics, including Precision, Recall, and F1-Score, with results demonstrating 98.4% precision, 96.9% recall, and an overall accuracy of 98.3%. These outcomes confirm the reliability of Bapp as a portable, accessible, and objective tool for monitoring both normal and abnormal eyelid movements. The application offers a promising alternative to traditional manual blink counting, supporting continuous ocular health monitoring and postoperative evaluation in clinical environments.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Michael Milford, Tobias Fischer
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Place recognition, the ability to identify previously visited locations, is critical for both biological navigation and autonomous systems. This review synthesizes findings from robotic systems, animal studies, and human research to explore how different systems encode and recall place. We examine the computational and representational strategies employed across artificial systems, animals, and humans, highlighting convergent solutions such as topological mapping, cue integration, and memory management. Animal systems reveal evolved mechanisms for multimodal navigation and environmental adaptation, while human studies provide unique insights into semantic place concepts, cultural influences, and introspective capabilities. Artificial systems showcase scalable architectures and data-driven models. We propose a unifying set of concepts by which to consider and develop place recognition mechanisms and identify key challenges such as generalization, robustness, and environmental variability. This review aims to foster innovations in artificial localization by connecting future developments in artificial place recognition systems to insights from both animal navigation research and human spatial cognition studies.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alessio Pellegrino, Jacopo Mauro
Natural Language Processing
ๆฅ็ๆ่ฆ
One of the long-standing goals in optimisation and constraint programming is to describe a problem in natural language and automatically obtain an executable, efficient model. Large language models appear to bring this vision closer, showing impressive results in automatically generating models for classical benchmarks. However, much of this apparent success may derive from data contamination rather than genuine reasoning: many standard CP problems are likely included in the training data of these models. To examine this hypothesis, we systematically rephrased and perturbed a set of well-known CSPLib problems to preserve their structure while modifying their context and introducing misleading elements. We then compared the models produced by three representative LLMs across original and modified descriptions. Our qualitative analysis shows that while LLMs can produce syntactically valid and semantically plausible models, their performance drops sharply under contextual and linguistic variation, revealing shallow understanding and sensitivity to wording.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nanjun Li, Ziyue Hao, Quanqiang Wang, Xuanyin Wang
Computer Vision
ๆฅ็ๆ่ฆ
With the rise in sedentary behavior, health problems caused by poor sitting posture have drawn increasing attention. Most existing methods, whether using invasive sensors or computer vision, rely on two-stage pipelines, which result in high intrusiveness, intensive computation, and poor real-time performance on embedded edge devices. Inspired by YOLOv11-Pose, a lightweight single-stage network for sitting posture recognition on embedded edge devices termed LSP-YOLO was proposed. By integrating partial convolution(PConv) and Similarity-Aware Activation Module(SimAM), a lightweight module, Light-C3k2, was designed to reduce computational cost while maintaining feature extraction capability. In the recognition head, keypoints were directly mapped to posture classes through pointwise convolution, and intermediate supervision was employed to enable efficient fusion of pose estimation and classification. Furthermore, a dataset containing 5,000 images across six posture categories was constructed for model training and testing. The smallest trained model, LSP-YOLO-n, achieved 94.2% accuracy and 251 Fps on personal computer(PC) with a model size of only 1.9 MB. Meanwhile, real-time and high-accuracy inference under constrained computational resources was demonstrated on the SV830C + GC030A platform. The proposed approach is characterized by high efficiency, lightweight design and deployability, making it suitable for smart classrooms, rehabilitation, and human-computer interaction applications.
๐
2025-11-18
๐ This paper was accepted by IEEE BIBM 2025 conferen
๐ฅ Chenyang Xu, Siming Li, Hao Wang
Robotics
ๆฅ็ๆ่ฆ
Phonocardiogram (PCG) analysis is vital for cardiovascular disease diagnosis, yet the scarcity of labeled pathological data hinders the capability of AI systems. To bridge this, we introduce H-LDM, a Hierarchical Latent Diffusion Model for generating clinically accurate and controllable PCG signals from structured metadata. Our approach features: (1) a multi-scale VAE that learns a physiologically-disentangled latent space, separating rhythm, heart sounds, and murmurs; (2) a hierarchical text-to-biosignal pipeline that leverages rich clinical metadata for fine-grained control over 17 distinct conditions; and (3) an interpretable diffusion process guided by a novel Medical Attention module. Experiments on the PhysioNet CirCor dataset demonstrate state-of-the-art performance, achieving a Frรฉchet Audio Distance of 9.7, a 92% attribute disentanglement score, and 87.1% clinical validity confirmed by cardiologists. Augmenting diagnostic models with our synthetic data improves the accuracy of rare disease classification by 11.3\%. H-LDM establishes a new direction for data augmentation in cardiac diagnostics, bridging data scarcity with interpretable clinical insights.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sahar Nasirihaghighi, Negin Ghamsarian, Yiping Li, Marcel Breeuwer, Raphael Sznitman et al.
Computer Vision
ๆฅ็ๆ่ฆ
Medical image segmentation is clinically important, yet data privacy and the cost of expert annotation limit the availability of labeled data. Federated semi-supervised learning (FSSL) offers a solution but faces two challenges: pseudo-label reliability depends on the strength of local models, and client devices often require compact or heterogeneous architectures due to limited computational resources. These constraints reduce the quality and stability of pseudo-labels, while large models, though more accurate, cannot be trained or used for routine inference on client devices. We propose SAM-Fed, a federated semi-supervised framework that leverages a high-capacity segmentation foundation model to guide lightweight clients during training. SAM-Fed combines dual knowledge distillation with an adaptive agreement mechanism to refine pixel-level supervision. Experiments on skin lesion and polyp segmentation across homogeneous and heterogeneous settings show that SAM-Fed consistently outperforms state-of-the-art FSSL methods.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xiaochuan Liu, Yuanfeng Song, Xiaoming Yin, Xing Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
In today's data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohammad Zbib, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxin Zhang, Ziyu Lu, Hongbo Duan, Keyu Fan, Pengting Luo et al.
Computer Vision
ๆฅ็ๆ่ฆ
Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. Additionally, 3D scene generation is vital for advancing embodied AI and world models, which depend on diverse, high-quality scenes for learning and evaluation. In this work, we propose Gen3d, a novel method for generation of high-quality, wide-scope, and generic 3D scenes from a single image. After the initial point cloud is created by lifting the RGBD image, Gen3d maintains and expands its world model. The 3D scene is finalized through optimizing a Gaussian splatting representation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in generating a world model and Synthesizing high-fidelity and consistent novel views.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vincent-Daniel Yun, Junhyuk Jo, Sunwoo Lee
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Deep neural networks achieve outstanding performance in visual recognition tasks, yet their large number of parameters makes them less practical for real-world applications. Recently, one-shot pruning has emerged as an effective strategy for reducing model size without additional training. However, models trained with standard objective functions often suffer a significant drop in accuracy after aggressive pruning. Some existing pruning-robust optimizers, such as SAM, and CrAM, mitigate this accuracy drop by guiding the model toward flatter regions of the parameter space, but they inevitably incur non-negligible additional computations. We propose a Variance Amplifying Regularizer (VAR) that deliberately increases the variance of model parameters during training. Our study reveals an intriguing finding that parameters with higher variance exhibit greater pruning robustness. VAR exploits this property by promoting such variance in the weight distribution, thereby mitigating the adverse effects of pruning. We further provide a theoretical analysis of its convergence behavior, supported by extensive empirical results demonstrating the superior pruning robustness of VAR.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Frederik Hoppe, Lars Kleinemeier, Astrid Franz, Udo Gรถbel
ๆฅ็ๆ่ฆ
Recent foundation models for tabular data achieve strong task-specific performance via in-context learning. Nevertheless, they focus on direct prediction by encapsulating both representation learning and task-specific inference inside a single, resource-intensive network. This work specifically focuses on representation learning, i.e., on transferable, task-agnostic embeddings. We systematically evaluate task-agnostic representations from tabular foundation models (TabPFN and TabICL) alongside with classical feature engineering (TableVectorizer) across a variety of application tasks as outlier detection (ADBench) and supervised learning (TabArena Lite). We find that simple TableVectorizer features achieve comparable or superior performance while being up to three orders of magnitude faster than tabular foundation models. The code is available at https://github.com/ContactSoftwareAI/TabEmbedBench.
๐
2025-11-18
๐ AAAI
๐ฅ Yosuke Nishimoto, Takashi Matsubara
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
World models have been developed to support sample-efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high-dimensional, non-stationary, and composed of multiple objects with rich interactions since most world models learn holistic representations of all environmental components. By contrast, humans perceive the environment by decomposing it into discrete objects, facilitating efficient decision-making. Motivated by this insight, we propose \emph{Slot Transformer Imagination with CAusality-aware reinforcement learning} (STICA), a unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks. STICA represents each observation as a set of object-centric tokens, together with tokens for the agent action and the resulting reward, enabling the world model to predict token-level dynamics and interactions. The policy and value networks then estimate token-level cause--effect relations and use them in the attention layers, yielding causality-guided decision-making. Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Yu Liu, Xixun Lin, Yanmin Shang, Yangxi Li, Shi Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Knowledge graph reasoning (KGR) is the task of inferring new knowledge by performing logical deductions on knowledge graphs. Recently, large language models (LLMs) have demonstrated remarkable performance in complex reasoning tasks. Despite promising success, current LLM-based KGR methods still face two critical limitations. First, existing methods often extract reasoning paths indiscriminately, without assessing their different importance, which may introduce irrelevant noise that misleads LLMs. Second, while many methods leverage LLMs to dynamically explore potential reasoning paths, they require high retrieval demands and frequent LLM calls. To address these limitations, we propose PathMind, a novel framework designed to enhance faithful and interpretable reasoning by selectively guiding LLMs with important reasoning paths. Specifically, PathMind follows a "Retrieve-Prioritize-Reason" paradigm. First, it retrieves a query subgraph from KG through the retrieval module. Next, it introduces a path prioritization mechanism that identifies important reasoning paths using a semantic-aware path priority function, which simultaneously considers the accumulative cost and the estimated future cost for reaching the target. Finally, PathMind generates accurate and logically consistent responses via a dual-phase training strategy, including task-specific instruction tuning and path-wise preference alignment. Extensive experiments on benchmark datasets demonstrate that PathMind consistently outperforms competitive baselines, particularly on complex reasoning tasks with fewer input tokens, by identifying essential reasoning paths.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hongju Lee, Youngjun Park, Jisun An, Dongman Lee
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The expansion of short-term rental platforms, such as Airbnb, has significantly disrupted local housing markets, often leading to increased rental prices and housing affordability issues. Accurately forecasting regional Airbnb market trends can thus offer critical insights for policymakers and urban planners aiming to mitigate these impacts. This study proposes a novel time-series forecasting framework to predict three key Airbnb indicators -- Revenue, Reservation Days, and Number of Reservations -- at the regional level. Using a sliding-window approach, the model forecasts trends 1 to 3 months ahead. Unlike prior studies that focus on individual listings at fixed time points, our approach constructs regional representations by integrating listing features with external contextual factors such as urban accessibility and human mobility. We convert structured tabular data into prompt-based inputs for a Large Language Model (LLM), producing comprehensive regional embeddings. These embeddings are then fed into advanced time-series models (RNN, LSTM, Transformer) to better capture complex spatio-temporal dynamics. Experiments on Seoul's Airbnb dataset show that our method reduces both average RMSE and MAE by approximately 48% compared to conventional baselines, including traditional statistical and machine learning models. Our framework not only improves forecasting accuracy but also offers practical insights for detecting oversupplied regions and supporting data-driven urban policy decisions.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ahlam Alrehili, Areej Alhothali
Natural Language Processing
ๆฅ็ๆ่ฆ
Grammatical Error Correction (GEC) is an important aspect of natural language processing. Arabic has a complicated morphological and syntactic structure, posing a greater challenge than other languages. Even though modern neural models have improved greatly in recent years, the majority of previous attempts used individual models without taking into account the potential benefits of combining different systems. In this paper, we present one of the first multi-system approaches for correcting grammatical errors in Arabic, the Arab Enhanced Edit Selection System Complication (ArbESC+). Several models are used to collect correction proposals, which are represented as numerical features in the framework. A classifier determines and implements the appropriate corrections based on these features. In order to improve output quality, the framework uses support techniques to filter overlapping corrections and estimate decision reliability. A combination of AraT5, ByT5, mT5, AraBART, AraBART+Morph+GEC, and Text editing systems gave better results than a single model alone, with F0.5 at 82.63% on QALB-14 test data, 84.64% on QALB-15 L1 data, and 65.55% on QALB-15 L2 data. As one of the most significant contributions of this work, it's the first Arab attempt to integrate linguistic error correction. Improving existing models provides a practical step towards developing advanced tools that will benefit users and researchers of Arabic text processing.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuxiang Wang, Siwen Wang, Haowei Han, Ao Wang, Boya Liu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Operation recommendation for IoT devices refers to generating personalized device operations for users based on their context, such as historical operations, environment information, and device status. This task is crucial for enhancing user satisfaction and corporate profits. Existing recommendation models struggle with complex operation logic, diverse user preferences, and sensitive to suboptimal suggestions, limiting their applicability to IoT device operations. To address these issues, we propose DevPiolt, a LLM-based recommendation model for IoT device operations. Specifically, we first equip the LLM with fundamental domain knowledge of IoT operations via continual pre-training and multi-task fine-tuning. Then, we employ direct preference optimization to align the fine-tuned LLM with specific user preferences. Finally, we design a confidence-based exposure control mechanism to avoid negative user experiences from low-quality recommendations. Extensive experiments show that DevPiolt significantly outperforms baselines on all datasets, with an average improvement of 69.5% across all metrics. DevPiolt has been practically deployed in Xiaomi Home app for one quarter, providing daily operation recommendations to 255,000 users. Online experiment results indicate a 21.6% increase in unique visitor device coverage and a 29.1% increase in page view acceptance rates.
๐
2025-11-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Hao Jiang, Guoquan Wang, Donglin Zhou, Sheng Yu, Yang Zeng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Large Language Models (LLMs) have enhanced text-based recommendation by enriching traditional ID-based methods with semantic generalization capabilities. Text-based methods typically encode item textual information via prompt design and generate discrete semantic IDs through item tokenization. However, in domain-specific tasks such as local-life services, simply injecting location information into prompts fails to capture fine-grained spatial characteristics and real-world distance awareness among items. To address this, we propose LGSID, an LLM-Aligned Geographic Item Tokenization Framework for Local-life Recommendation. This framework consists of two key components: (1) RL-based Geographic LLM Alignment, and (2) Hierarchical Geographic Item Tokenization. In the RL-based alignment module, we initially train a list-wise reward model to capture real-world spatial relationships among items. We then introduce a novel G-DPO algorithm that uses pre-trained reward model to inject generalized spatial knowledge and collaborative signals into LLMs while preserving their semantic understanding. Furthermore, we propose a hierarchical geographic item tokenization strategy, where primary tokens are derived from discrete spatial and content attributes, and residual tokens are refined using the aligned LLM's geographic representation vectors. Extensive experiments on real-world Kuaishou industry datasets show that LGSID consistently outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further validate its effectiveness.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yaniv Oren, Joery A. de Vries, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin Bรถhmer
Machine Learning
ๆฅ็ๆ่ฆ
Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS. Through variance reduction and mitigation of path degeneracy, TSMCTS scales favorably with sequential compute while retaining the properties that make SMC natural to parallelize.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Kumud Tripathi, Aditya Srinivas Menon, Aman Gaurav, Raj Prakash Gohil, Pankaj Wasnik
ๆฅ็ๆ่ฆ
The Whisper model, an open-source automatic speech recognition system, is widely adopted for its strong performance across multilingual and zero-shot settings. However, it frequently suffers from hallucination errors, especially under noisy acoustic conditions. Previous works to reduce hallucinations in Whisper-style ASR systems have primarily focused on audio preprocessing or post-processing of transcriptions to filter out erroneous content. However, modifications to the Whisper model itself remain largely unexplored to mitigate hallucinations directly. To address this challenge, we present a two-stage architecture that first enhances encoder robustness through Adaptive Layer Attention (ALA) and further suppresses hallucinations using a multi-objective knowledge distillation (KD) framework. In the first stage, ALA groups encoder layers into semantically coherent blocks via inter-layer correlation analysis. A learnable multi-head attention module then fuses these block representations, enabling the model to jointly exploit low- and high-level features for more robust encoding. In the second stage, our KD framework trains the student model on noisy audio to align its semantic and attention distributions with a teacher model processing clean inputs. Our experiments on noisy speech benchmarks show notable reductions in hallucinations and word error rates, while preserving performance on clean speech. Together, ALA and KD offer a principled strategy to improve Whisper's reliability under real-world noisy conditions.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinlei Xiong, Wenbo Hu, Shuxun Zhou, Kaifeng Bi, Lingxi Xie et al.
Machine Learning
ๆฅ็ๆ่ฆ
Weather forecasting is fundamentally challenged by the chaotic nature of the atmosphere, necessitating probabilistic approaches to quantify uncertainty. While traditional ensemble prediction (EPS) addresses this through computationally intensive simulations, recent advances in Bayesian Deep Learning (BDL) offer a promising but often disconnected alternative. We bridge these paradigms through a unified hybrid Bayesian Deep Learning framework for ensemble weather forecasting that explicitly decomposes predictive uncertainty into epistemic and aleatoric components, learned via variational inference and a physics-informed stochastic perturbation scheme modeling flow-dependent atmospheric dynamics, respectively. We further establish a unified theoretical framework that rigorously connects BDL and EPS, providing formal theorems that decompose total predictive uncertainty into epistemic and aleatoric components under the hybrid BDL framework. We validate our framework on the large-scale 40-year ERA5 reanalysis dataset (1979-2019) with 0.25ยฐ spatial resolution. Experimental results show that our method not only improves forecast accuracy and yields better-calibrated uncertainty quantification but also achieves superior computational efficiency compared to state-of-the-art probabilistic diffusion models. We commit to making our code open-source upon acceptance of this paper.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Pattaraphon Kenny Wongchamcharoen, Paul Glasserman
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used in finance and economics, where prompt-based attempts against look-ahead bias implicitly assume that models understand chronology. We test this fundamental question with a series of chronological ordering tasks with increasing complexities over facts the model already knows from pre-training. Our tasks cover (1) chronological ordering, (2) conditional sorting (filter, then order), and (3) anachronism detection. We evaluate GPT-4.1, Claude-3.7 Sonnet, with and without Extended Thinking (ET), and GPT-5 across multiple reasoning-effort settings. Across models, Exact match rate drops sharply as sequences lengthen even while rank correlations stay high as LLMs largely preserve local order but struggle to maintain a single globally consistent timeline. In conditional sorting, most failures stem from the filtering step rather than the ordering step, but GPT-5 and Claude-3.7 Sonnet with Extended Thinking outshine normal models significantly. Lastly, anachronism detection is found to be the easiest task for the LLMs but performance still declines with increasingly overlapping timelines or entities. Overall, our main contribution is showing that allocating explicit reasoning budget helps with chronological ordering with GPT-5 at medium/high reasoning effort achieving flawless ordering at all lengths and perfect conditional sorting (both self-filtered and given-subset), whereas low/minimal effort degrades with longer lists, mirroring earlier models. Our findings delineate limits of current LLMs on chronological tasks, providing insights into task complexity, and demonstrate scenarios in which reasoning helps. These patterns are important for the real-time application of LLMs in finance. We release all code and evaluation templates to support full reproducibility.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ N Dinesh Reddy, Sudeep Pillai
RoboticsComputer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunhe Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Maritime vessel re-identification (Re-ID) plays a crucial role in advancing maritime monitoring and intelligent situational awareness systems. However, some existing vessel Re-ID methods are directly adapted from pedestrian-focused algorithms, making them ill-suited for mitigating the unique problems present in vessel images, particularly the greater intra-identity variations and more severe missing of local parts, which lead to the emergence of outlier samples within the same identity. To address these challenges, we propose the Multi-scale Correlation-aware Transformer Network (MCFormer), which explicitly models multi-scale correlations across the entire input set to suppress the adverse effects of outlier samples with intra-identity variations or local missing, incorporating two novel modules, the Global Correlation Module (GCM), and the Local Correlation Module (LCM). Specifically, GCM constructs a global similarity affinity matrix across all input images to model global correlations through feature aggregation based on inter-image consistency, rather than solely learning features from individual images as in most existing approaches. Simultaneously, LCM mines and aligns local features of positive samples with contextual similarity to extract local correlations by maintaining a dynamic memory bank, effectively compensating for missing or occluded regions in individual images. To further enhance feature robustness, MCFormer integrates global and local features that have been respectively correlated across multiple scales, effectively capturing latent relationships among image features. Experiments on three benchmarks demonstrate that MCFormer achieves state-of-the-art performance.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiazhuo Tian, Yachao Yuan
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In modern communication networks driven by 5G and the Internet of Things (IoT), effective network traffic flow classification is crucial for Quality of Service (QoS) management and security. Traditional centralized machine learning struggles with the distributed data and privacy concerns in these heterogeneous environments, while existing federated learning approaches suffer from high costs and poor generalization. To address these challenges, we propose HFL-FlowLLM, which to our knowledge is the first framework to apply large language models to network traffic flow classification in heterogeneous federated learning. Compared to state-of-the-art heterogeneous federated learning methods for network traffic flow classification, the proposed approach improves the average F1 score by approximately 13%, demonstrating compelling performance and strong robustness. When compared to existing large language models federated learning frameworks, as the number of clients participating in each training round increases, the proposed method achieves up to a 5% improvement in average F1 score while reducing the training costs by about 87%. These findings prove the potential and practical value of HFL-FlowLLM in modern communication networks security.
๐
2025-11-18
๐ ArXiv Preprint (cs.CY)
๐ฅ Wendy Wong, Yuchao Jiang, Yuekang Li
Robotics
ๆฅ็ๆ่ฆ
Although CS programs are booming, introductory courses like CS1 still adopt a one-size-fits-all formats that can exacerbate cognitive load and discourage learners with autism, ADHD, dyslexia and other neurological conditions. These call for compassionate pedagogies and Universal Design For Learning (UDL) to create learning environments and materials where cognitive diversity is welcomed. To address this, we introduce DiverseClaire a pilot study, which simulates students including neurodiverse profiles using LLMs and diverse personas. By leveraging Bloom's Taxonomy and UDL, DiverseClaire compared UDL-transformed lecture slides with traditional formats. To evaluate DiverseClaire controlled experiments, we used the evaluation metric the average score. The findings revealed that the simulated neurodiverse students struggled with learning due to lecture slides that were in inaccessible formats. These results highlight the need to provide course materials in multiple formats for diverse learner preferences. Data from our pilot study will be made available to assist future CS1 instructors.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin et al.
Multimodal
ๆฅ็ๆ่ฆ
Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available at https://github.com/LZYAndy/UMEG-Net.
๐
2025-11-18
๐ 9 pages, 8 figures, submitted to IEEE RA-L
๐ฅ Zhuo Li, Junjia Liu, Zhipeng Dong, Tao Teng, Quentin Rouxel et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have demonstrated significant potential in real-world robotic manipulation. However, pre-trained VLA policies still suffer from substantial performance degradation during downstream deployment. Although fine-tuning can mitigate this issue, its reliance on costly demonstration collection and intensive computation makes it impractical in real-world settings. In this work, we introduce VLA-Pilot, a plug-and-play inference-time policy steering method for zero-shot deployment of pre-trained VLA without any additional fine-tuning or data collection. We evaluate VLA-Pilot on six real-world downstream manipulation tasks across two distinct robotic embodiments, encompassing both in-distribution and out-of-distribution scenarios. Experimental results demonstrate that VLA-Pilot substantially boosts the success rates of off-the-shelf pre-trained VLA policies, enabling robust zero-shot generalization to diverse tasks and embodiments. Experimental videos and code are available at: https://rip4kobe.github.io/vla-pilot/.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Naveen Lamba, Sanju Tiwari, Manas Gaur
ๆฅ็ๆ่ฆ
LLMs still struggle with hallucination, especially when confronted with symbolic triggers like modifiers, negation, numbers, exceptions, and named entities. Yet, we lack a clear understanding of where these symbolic hallucinations originate, making it crucial to systematically handle such triggers and localize the emergence of hallucination inside the model. While prior work explored localization using statistical techniques like LSC and activation variance analysis, these methods treat all tokens equally and overlook the role symbolic linguistic knowledge plays in triggering hallucinations. So far, no approach has investigated how symbolic elements specifically drive hallucination failures across model layers, nor has symbolic linguistic knowledge been used as the foundation for a localization framework. We propose the first symbolic localization framework that leverages symbolic linguistic and semantic knowledge to meaningfully trace the development of hallucinations across all model layers. By focusing on how models process symbolic triggers, we analyze five models using HaluEval and TruthfulQA. Our symbolic knowledge approach reveals that attention variance for these linguistic elements explodes to critical instability in early layers (2-4), with negation triggering catastrophic variance levels, demonstrating that symbolic semantic processing breaks down from the very beginning. Through the lens of symbolic linguistic knowledge, despite larger model sizes, hallucination rates remain consistently high (78.3%-83.7% across Gemma variants), with steep attention drops for symbolic semantic triggers throughout deeper layers. Our findings demonstrate that hallucination is fundamentally a symbolic linguistic processing failure, not a general generation problem, revealing that symbolic semantic knowledge provides the key to understanding and localizing hallucination mechanisms in LLMs.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinliang Zhang, Lei Zhu, Hangzhou He, Shuang Zeng, Ourui Fu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have demonstrated substantial value in unified text-image understanding and reasoning, primarily by converting images into sequences of patch-level tokens that align with their architectural paradigm. However, patch-level tokenization leads to a quadratic growth in image tokens, burdening MLLMs' understanding and reasoning with enormous computation and memory. Additionally, the traditional patch-wise scanning tokenization workflow misaligns with the human vision cognition system, further leading to hallucination and computational redundancy. To address this issue, we propose an object-level token merging strategy for Adaptive Token compression, revealing the consistency with human vision system. The experiments are conducted on multiple comprehensive benchmarks, which show that our approach averagely, utilizes only 10% tokens while achieving almost 96% of the vanilla model's performance. More extensive experimental results in comparison with relevant works demonstrate the superiority of our method in balancing compression ratio and performance. Our code will be available.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junpeng Zhao, Lin Li, Kaixi Hu, Kaize Shi, Jingling Yuan
Machine Learning
ๆฅ็ๆ่ฆ
Signed graphs model complex relationships through positive and negative edges, with widespread real-world applications. Given the sensitive nature of such data, selective removal mechanisms have become essential for privacy protection. While graph unlearning enables the removal of specific data influences from Graph Neural Networks (GNNs), existing methods are designed for conventional GNNs and overlook the unique heterogeneous properties of signed graphs. When applied to Signed Graph Neural Networks (SGNNs), these methods lose critical sign information, degrading both model utility and unlearning effectiveness. To address these challenges, we propose Certified Signed Graph Unlearning (CSGU), which provides provable privacy guarantees while preserving the sociological principles underlying SGNNs. CSGU employs a three-stage method: (1) efficiently identifying minimal influenced neighborhoods via triangular structures, (2) applying sociological theories to quantify node importance for optimal privacy budget allocation, and (3) performing importance-weighted parameter updates to achieve certified modifications with minimal utility degradation. Extensive experiments demonstrate that CSGU outperforms existing methods, achieving superior performance in both utility preservation and unlearning effectiveness on SGNNs.
๐
2025-11-18
๐ AAAI 2025
๐ฅ Hao Lang, Fei Huang, Yongbin Li
ๆฅ็ๆ่ฆ
Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, Biqing Qi
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA achieves state-of-the-art results across general embodied evaluations due to its asynchronous generation in AFM. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ An Yu, Weiheng Lu, Jian Li, Zhenfei Zhang, Yunhang Shen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textit{S}hot-aware \textit{M}ultimodal \textit{A}udio-enhanced \textit{R}etrieval of \textit{T}emporal \textit{S}egments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbf{Shot-aware Token Compression}, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61\% increase in R1@0.5 and 2.59\% gain in R1@0.7 on Charades-STA.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sushant Mehta
Robotics
ๆฅ็ๆ่ฆ
Current agentic AI benchmarks predominantly evaluate task completion accuracy, while overlooking critical enterprise requirements such as cost-efficiency, reliability, and operational stability. Through systematic analysis of 12 main benchmarks and empirical evaluation of state-of-the-art agents, we identify three fundamental limitations: (1) absence of cost-controlled evaluation leading to 50x cost variations for similar precision, (2) inadequate reliability assessment where agent performance drops from 60\% (single run) to 25\% (8-run consistency), and (3) missing multidimensional metrics for security, latency, and policy compliance. We propose \textbf{CLEAR} (Cost, Latency, Efficacy, Assurance, Reliability), a holistic evaluation framework specifically designed for enterprise deployment. Evaluation of six leading agents on 300 enterprise tasks demonstrates that optimizing for accuracy alone yields agents 4.4-10.8x more expensive than cost-aware alternatives with comparable performance. Expert evaluation (N=15) confirms that CLEAR better predicts production success (correlation $ฯ=0.83$) compared to accuracy-only evaluation ($ฯ=0.41$).
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Promise Ekpo, Saesha Agarwal, Felix Grimm, Lekan Molu, Angelique Taylor
Machine Learning
ๆฅ็ๆ่ฆ
Enforcing a fair workload allocation among multiple agents tasked to achieve an objective in learning enabled demand side healthcare worker settings is crucial for consistent and reliable performance at runtime. Existing multi-agent reinforcement learning (MARL) approaches steer fairness by shaping reward through post hoc orchestrations, leaving no certifiable self-enforceable fairness that is immutable by individual agents at runtime. Contextualized within a setting where each agent shares resources with others, we address this shortcoming with a learning enabled optimization scheme among self-interested decision makers whose individual actions affect those of other agents. This extends the problem to a generalized Nash equilibrium (GNE) game-theoretic framework where we steer group policy to a safe and locally efficient equilibrium, so that no agent can improve its utility function by unilaterally changing its decisions. Fair-GNE models MARL as a constrained generalized Nash equilibrium-seeking (GNE) game, prescribing an ideal equitable collective equilibrium within the problem's natural fabric. Our hypothesis is rigorously evaluated in our custom-designed high-fidelity resuscitation simulator. Across all our numerical experiments, Fair-GNE achieves significant improvement in workload balance over fixed-penalty baselines (0.89 vs.\ 0.33 JFI, $p < 0.01$) while maintaining 86\% task success, demonstrating statistically significant fairness gains through adaptive constraint enforcement. Our results communicate our formulations, evaluation metrics, and equilibrium-seeking innovations in large multi-agent learning-based healthcare systems with clarity and principled fairness enforcement.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yu Zhong, Zihao Zhang, Rui Zhang, Lingdong Huang, Haihan Gao et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions. Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities. Despite their strengths, a substantial gap in task completion performance persists between LLM-based approaches and domain experts, as LLMs inherently struggle to comprehend real-world spatial correlations precisely. Additionally, introducing LLMs is accompanied with substantial computational cost and inference latency. To address these issues, we propose a novel dual-process thinking framework dubbed R3, integrating LLMs' generalization capabilities with VLN-specific expertise in a zero-shot manner. The framework comprises three core modules: Runner, Ruminator, and Regulator. The Runner is a lightweight transformer-based expert model that ensures efficient and accurate navigation under regular circumstances. The Ruminator employs a powerful multimodal LLM as the backbone and adopts chain-of-thought (CoT) prompting to elicit structured reasoning. The Regulator monitors the navigation progress and controls the appropriate thinking mode according to three criteria, integrating Runner and Ruminator harmoniously. Experimental results illustrate that R3 significantly outperforms other state-of-the-art methods, exceeding 3.28% and 3.30% in SPL and RGSPL respectively on the REVERIE benchmark. This pronounced enhancement highlights the effectiveness of our method in handling challenging VLN tasks.
๐
2025-11-18
๐ 3rd-place solution for the ACM ICAIF 2025 Agentic
๐ฅ Chun Chet Ng, Jia Yu Lim, Wei Zeng Low
Natural Language Processing
ๆฅ็ๆ่ฆ
With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task-relevant information from lengthy financial filings is essential for both operational and analytical decision-making. The FinAgentBench dataset formalizes this problem through two tasks: document ranking and chunk ranking. We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and a lightweight multi-agent system. Each component is examined extensively to reveal their synergies: prompt engineering provides precise task instructions, ICL supplies semantically relevant few-shot examples, and the multi-agent system models coordinated scoring behaviour. Our best configuration achieves an NDCG@5 of 0.71818 on the restricted validation split. We further demonstrate that PRISM is feasible and robust for production-scale financial retrieval. Its modular, inference-only design makes it practical for real-world use cases. The source code is released at https://bit.ly/prism-ailens.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Zhen, Yunxiang Yang, Jidong J. Yang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Pedestrian-vehicle incidents remain a critical urban safety challenge, with pedestrians accounting for over 20% of global traffic fatalities. Although existing video-based systems can detect when incidents occur, they provide little insight into how these events unfold across the distinct cognitive phases of pedestrian behavior. Recent vision-language models (VLMs) have shown strong potential for video understanding, but they remain limited in that they typically process videos in isolation, without explicit temporal structuring or multi-view integration. This paper introduces Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning (MP-PVIR), a unified framework that systematically processes multi-view video streams into structured diagnostic reports through four stages: (1) event-triggered multi-view video acquisition, (2) pedestrian behavior phase segmentation, (3) phase-specific multi-view reasoning, and (4) hierarchical synthesis and diagnostic reasoning. The framework operationalizes behavioral theory by automatically segmenting incidents into cognitive phases, performing synchronized multi-view analysis within each phase, and synthesizing results into causal chains with targeted prevention strategies. Particularly, two specialized VLMs underpin the MP-PVIR pipeline: TG-VLM for behavioral phase segmentation (mIoU = 0.4881) and PhaVR-VLM for phase-aware multi-view analysis, achieving a captioning score of 33.063 and up to 64.70% accuracy on question answering. Finally, a designated large language model is used to generate comprehensive reports detailing scene understanding, behavior interpretation, causal reasoning, and prevention recommendations. Evaluation on the Woven Traffic Safety dataset shows that MP-PVIR effectively translates multi-view video data into actionable insights, advancing AI-driven traffic safety analytics for vehicle-infrastructure cooperative systems.
๐
2025-11-18
๐ ArXiv Preprint (Multimedia)
๐ฅ Liuyi Jin, Amran Haroon, Radu Stoleru, Pasan Gunawardena, Michael Middleton et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Timely and accurate pre-arrival video streaming and analytics are critical for emergency medical services (EMS) to deliver life-saving interventions. Yet, current-generation EMS infrastructure remains constrained by one-to-one video streaming and limited analytics capabilities, leaving dispatchers and EMTs to manually interpret overwhelming, often noisy or redundant information in high-stress environments. We present TeleEMS, a mobile live video analytics system that enables pre-arrival multimodal inference by fusing audio and video into a unified decision-making pipeline before EMTs arrive on scene.
TeleEMS comprises two key components: TeleEMS Client and TeleEMS Server. The TeleEMS Client runs across phones, smart glasses, and desktops to support bystanders, EMTs en route, and 911 dispatchers. The TeleEMS Server, deployed at the edge, integrates EMS-Stream, a communication backbone that enables smooth multi-party video streaming. On top of EMSStream, the server hosts three real-time analytics modules: (1) audio-to-symptom analytics via EMSLlama, a domain-specialized LLM for robust symptom extraction and normalization; (2) video-to-vital analytics using state-of-the-art rPPG methods for heart rate estimation; and (3) joint text-vital analytics via PreNet, a multimodal multitask model predicting EMS protocols, medication types, medication quantities, and procedures.
Evaluation shows that EMSLlama outperforms GPT-4o (exact-match 0.89 vs. 0.57) and that text-vital fusion improves inference robustness, enabling reliable pre-arrival intervention recommendations. TeleEMS demonstrates the potential of mobile live video analytics to transform EMS operations, bridging the gap between bystanders, dispatchers, and EMTs, and paving the way for next-generation intelligent EMS infrastructure.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Agamdeep Singh, Ashish Tiwari, Hosein Hasanbeig, Priyanshu Gupta
Machine Learning
ๆฅ็ๆ่ฆ
Many machine learning tasks involve inherent subjectivity, where annotators naturally provide varied labels. Standard practice collapses these label distributions into single labels, aggregating diverse human judgments into point estimates. We argue that this approach is epistemically misaligned for ambiguous data--the annotation distribution itself should be regarded as the ground truth. Training on collapsed single labels forces models to express false confidence on fundamentally ambiguous cases, creating a misalignment between model certainty and the diversity of human perception. We demonstrate empirically that soft-label training, which treats annotation distributions as ground truth, preserves epistemic uncertainty. Across both vision and NLP tasks, soft-label training achieves 32% lower KL divergence from human annotations and 61% stronger correlation between model and annotation entropy, while matching the accuracy of hard-label training. Our work repositions annotation distributions from noisy signals to be aggregated away, to faithful representations of epistemic uncertainty that models should learn to reproduce.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Truong Vo, Weiyi Wu, Kaize Ding
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Srivathsan Sivakumar, Faisal Z. Qureshi
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose \emph{Cascaded-ViT (CViT)}, a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called \emph{Cascaded-Chunk Feed Forward Network (CCFFN)}. By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our \emph{CViT-XL} model achieves 75.5\% Top-1 accuracy while reducing FLOPs by 15\% and energy consumption by 3.3\% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called \emph{Accuracy-Per-FLOP (APF)}, which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2\% more accurate than EfficientViT-M2 while having comparable APF scores.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xinpeng Chen, Xiaofeng Han, Kaihao Zhang, Guochao Ren, Yujie Wang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Layout design is a crucial step in developing mobile app pages. However, crafting satisfactory designs is time-intensive for designers: they need to consider which controls and content to present on the page, and then repeatedly adjust their size, position, and style for better aesthetics and structure. Although many design software can now help to perform these repetitive tasks, extensive training is needed to use them effectively. Moreover, collaborative design across app pages demands extra time to align standards and ensure consistent styling. In this work, we propose APD-agents, a large language model (LLM) driven multi-agent framework for automated page design in mobile applications. Our framework contains OrchestratorAgent, SemanticParserAgent, PrimaryLayoutAgent, TemplateRetrievalAgent, and RecursiveComponentAgent. Upon receiving the user's description of the page, the OrchestratorAgent can dynamically can direct other agents to accomplish users' design task. To be specific, the SemanticParserAgent is responsible for converting users' descriptions of page content into structured data. The PrimaryLayoutAgent can generate an initial coarse-grained layout of this page. The TemplateRetrievalAgent can fetch semantically relevant few-shot examples and enhance the quality of layout generation. Besides, a RecursiveComponentAgent can be used to decide how to recursively generate all the fine-grained sub-elements it contains for each element in the layout. Our work fully leverages the automatic collaboration capabilities of large-model-driven multi-agent systems. Experimental results on the RICO dataset show that our APD-agents achieve state-of-the-art performance.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingren Liu, Shuning Xu, Qirui Yang, Yun Wang, Xiangyu Chen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Adit Jain, Vikram Krishnamurthy, Yiming Zhang
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we model and analyze how a network of interacting LLMs performs collaborative question-answering (CQA) in order to estimate a ground truth given a distributed set of documents. This problem is interesting because LLMs often hallucinate when direct evidence to answer a question is lacking, and these effects become more pronounced in a network of interacting LLMs. The hallucination spreads, causing previously accurate LLMs to hallucinate. We study interacting LLMs and their hallucination by combining novel ideas of mean-field dynamics (MFD) from network science and the randomized utility model from economics to construct a useful generative model. We model the LLM with a latent state that indicates if it is truthful or not with respect to the ground truth, and extend a tractable analytical model considering an MFD to model the diffusion of information in a directed network of LLMs. To specify the probabilities that govern the dynamics of the MFD, we propose a randomized utility model. For a network of LLMs, where each LLM has two possible latent states, we posit sufficient conditions for the existence and uniqueness of a fixed point and analyze the behavior of the fixed point in terms of the incentive (e.g., test-time compute) given to individual LLMs. We experimentally study and analyze the behavior of a network of $100$ open-source LLMs with respect to data heterogeneity, node capability, network structure, and sensitivity to framing on multiple semi-synthetic datasets.
๐
2025-11-18
๐ NeurIPS 2025
๐ฅ Junchen Li, Rongzheng Wang, Yihong Huang, Qizhi Chen, Jiasheng Zhang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-augmented generation (RAG) greatly enhances large language models (LLMs) performance in knowledge-intensive tasks. However, naive RAG methods struggle with multi-hop question answering due to their limited capacity to capture complex dependencies across documents. Recent studies employ graph-based RAG to capture document connections. However, these approaches often result in a loss of semantic coherence and introduce irrelevant noise during node matching and subgraph construction. To address these limitations, we propose NeuroPath, an LLM-driven semantic path tracking RAG framework inspired by the path navigational planning of place cells in neurobiology. It consists of two steps: Dynamic Path Tracking and Post-retrieval Completion. Dynamic Path Tracking performs goal-directed semantic path tracking and pruning over the constructed knowledge graph (KG), improving noise reduction and semantic coherence. Post-retrieval Completion further reinforces these benefits by conducting second-stage retrieval using intermediate reasoning and the original query to refine the query goal and complete missing information in the reasoning path. NeuroPath surpasses current state-of-the-art baselines on three multi-hop QA datasets, achieving average improvements of 16.3% on recall@2 and 13.5% on recall@5 over advanced graph-based RAG methods. Moreover, compared to existing iter-based RAG methods, NeuroPath achieves higher accuracy and reduces token consumption by 22.8%. Finally, we demonstrate the robustness of NeuroPath across four smaller LLMs (Llama3.1, GLM4, Mistral0.3, and Gemma3), and further validate its scalability across tasks of varying complexity. Code is available at https://github.com/KennyCaty/NeuroPath.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jun Ding, Shang Gao
Computer VisionMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Medical image segmentation underpins computer-aided diagnosis and therapy by supporting clinical diagnosis, preoperative planning, and disease monitoring. While U-Net style convolutional neural networks perform well due to their encoder-decoder structures with skip connections, they struggle to capture long-range dependencies. Transformer-based variants address global context but often require heavy computation and large training datasets. This paper proposes GCA-ResUNet, an efficient segmentation network that integrates Grouped Coordinate Attention (GCA) into ResNet-50 residual blocks. GCA uses grouped coordinate modeling to jointly encode global dependencies across channels and spatial locations, strengthening feature representation and boundary delineation while adding minimal parameter and FLOP overhead compared with self-attention. On the Synapse dataset, GCA-ResUNet achieves a Dice score of 86.11%, and on the ACDC dataset, it reaches 92.64%, surpassing several state-of-the-art baselines while maintaining fast inference and favorable computational efficiency. These results indicate that GCA offers a practical way to enhance convolutional architectures with global modeling capability, enabling high-accuracy and resource-efficient medical image segmentation.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yue Zhang, Zun Wang, Han Lin, Jialu Li, Jianing Yang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments. This limitation stems in part from training data that focuses on language reasoning rather than spatial understanding due to scarce 3D resources, leaving inherent grounding biases unresolved. To address this, we propose 3D scene editing as a key mechanism to generate precise visual counterfactuals that mitigate these biases through fine-grained spatial manipulation, without requiring costly scene reconstruction or large-scale 3D data collection. Furthermore, to make these edits targeted and directly address the specific weaknesses of the model, we introduce DEER-3D, an error-driven framework following a structured "Decompose, Diagnostic Evaluation, Edit, and Re-train" workflow, rather than broadly or randomly augmenting data as in conventional approaches. Specifically, upon identifying a grounding failure of the 3D-LLM, our framework first diagnoses the exact predicate-level error (e.g., attribute or spatial relation). It then executes minimal, predicate-aligned 3D scene edits, such as recoloring or repositioning, to produce targeted counterfactual supervision for iterative model fine-tuning, significantly enhancing grounding accuracy. We evaluate our editing pipeline across multiple benchmarks for 3D grounding and scene understanding tasks, consistently demonstrating improvements across all evaluated datasets through iterative refinement. DEER-3D underscores the effectiveness of targeted, error-driven scene editing in bridging linguistic reasoning capabilities with spatial grounding in 3D LLMs.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhonghao Liu, Hanxue Gu, Qihang Li, Michael Fox, Jay M. Levin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Reliable measurement of glenoid bone loss is essential for operative planning in shoulder instability, but current manual and semi-automated methods are time-consuming and often subject to interreader variability. We developed and validated a fully automated deep learning pipeline for measuring glenoid bone loss on three-dimensional computed tomography (CT) scans using a linear-based, en-face view, best-circle method. Shoulder CT images of 91 patients (average age, 40 years; range, 14-89 years; 65 men) were retrospectively collected along with manual labels including glenoid segmentation, landmarks, and bone loss measurements. The multi-stage algorithm has three main stages: (1) segmentation, where we developed a U-Net to automatically segment the glenoid and humerus; (2) anatomical landmark detection, where a second network predicts glenoid rim points; and (3) geometric fitting, where we applied principal component analysis (PCA), projection, and circle fitting to compute the percentage of bone loss. The automated measurements showed strong agreement with consensus readings and exceeded surgeon-to-surgeon consistency (intraclass correlation coefficient (ICC) 0.84 vs 0.78), including in low- and high-bone-loss subgroups (ICC 0.71 vs 0.63 and 0.83 vs 0.21, respectively; P < 0.001). For classifying patients into low, medium, and high bone-loss categories, the pipeline achieved a recall of 0.714 for low and 0.857 for high severity, with no low cases misclassified as high or vice versa. These results suggest that our method is a time-efficient and clinically reliable tool for preoperative planning in shoulder instability and for screening patients with substantial glenoid bone loss. Code and dataset are available at https://github.com/Edenliu1/Auto-Glenoid-Measurement-DL-Pipeline.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Keya Hu, Ali Cy, Linlu Qiu, Xiaoman Delores Ding, Runqian Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan et al.
Computer VisionMultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alexander Vedernikov, Puneet Kumar, Haoyu Chen, Tapio Seppรคnen, Xiaobai Li
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.
๐
2025-11-18
๐ Accepted at IEEE Transactions on Visualization and
๐ฅ Stefan Cobeli, Kazi Shahrukh Omar, Rodrigo Valenรงa, Nivan Ferreira, Fabio Miranda
ๆฅ็ๆ่ฆ
Despite the growing availability of 3D urban datasets, extracting insights remains challenging due to computational bottlenecks and the complexity of interacting with data. In fact, the intricate geometry of 3D urban environments results in high degrees of occlusion and requires extensive manual viewpoint adjustments that make large-scale exploration inefficient. To address this, we propose a view-based approach for 3D data exploration, where a vector field encodes views from the environment. To support this approach, we introduce a neural field-based method that constructs an efficient implicit representation of 3D environments. This representation enables both faster direct queries, which consist of the computation of view assessment indices, and inverse queries, which help avoid occlusion and facilitate the search for views that match desired data patterns. Our approach supports key urban analysis tasks such as visibility assessments, solar exposure evaluation, and assessing the visual impact of new developments. We validate our method through quantitative experiments, case studies informed by real-world urban challenges, and feedback from domain experts. Results show its effectiveness in finding desirable viewpoints, analyzing building facade visibility, and evaluating views from outdoor spaces. Code and data are publicly available at https://urbantk.org/neural-3d.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Wang, Liya Ji, Zhanghan Ke, Harry Yang, Ser-Nam Lim et al.
ๆฅ็ๆ่ฆ
We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiyuan Wang, Muhan Zhang
ๆฅ็ๆ่ฆ
Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunfeng Wu, Jiayi Song, Zhenxiong Tan, Zihao He, Songhua Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
The quadratic time and memory complexity of the attention mechanism in modern Transformer based video generators makes end-to-end training for ultra high resolution videos prohibitively expensive. Motivated by this limitation, we introduce a training-free approach that leverages video Diffusion Transformers pretrained at their native scale to synthesize higher resolution videos without any additional training or adaptation. At the core of our method lies an inward sliding window attention mechanism, which originates from a key observation: maintaining each query token's training scale receptive field is crucial for preserving visual fidelity and detail. However, naive local window attention, unfortunately, often leads to repetitive content and exhibits a lack of global coherence in the generated results. To overcome this challenge, we devise a dual-path pipeline that backs up window attention with a novel cross-attention override strategy, enabling the semantic content produced by local attention to be guided by another branch with a full receptive field and, therefore, ensuring holistic consistency. Furthermore, to improve efficiency, we incorporate a cross-attention caching strategy for this branch to avoid the frequent computation of full 3D attention. Extensive experiments demonstrate that our method delivers ultra-high-resolution videos with fine-grained visual details and high efficiency in a training-free paradigm. Meanwhile, it achieves superior performance on VBench, even compared to training-based alternatives, with competitive or improved efficiency. Codes are available at: https://github.com/WillWu111/FreeSwim
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Farheen Ramzan, Yusuf Kiberu, Nikesh Jathanna, Meryem Jabrane, Vicente Grau et al.
Multimodal
ๆฅ็ๆ่ฆ
Accurate segmentation of myocardial scar from late gadolinium enhanced (LGE) cardiac MRI is essential for evaluating tissue viability, yet remains challenging due to variable contrast and imaging artifacts. Electrocardiogram (ECG) signals provide complementary physiological information, as conduction abnormalities can help localize or suggest scarred myocardial regions. In this work, we propose a novel multimodal framework that integrates ECG-derived electrophysiological information with anatomical priors from the AHA-17 atlas for physiologically consistent LGE-based scar segmentation. As ECGs and LGE-MRIs are not acquired simultaneously, we introduce a Temporal Aware Feature Fusion (TAFF) mechanism that dynamically weights and fuses features based on their acquisition time difference. Our method was evaluated on a clinical dataset and achieved substantial gains over the state-of-the-art image-only baseline (nnU-Net), increasing the average Dice score for scars from 0.6149 to 0.8463 and achieving high performance in both precision (0.9115) and sensitivity (0.9043). These results show that integrating physiological and anatomical knowledge allows the model to "see beyond the image", setting a new direction for robust and physiologically grounded cardiac scar segmentation.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sriram Srinivasan, Srinivasan Aruchamy, Siva Ram Krisha Vadali
Machine Learning
ๆฅ็ๆ่ฆ
Seismic sensing has emerged as a promising solution for border surveillance and monitoring; the seismic sensors that are often buried underground are small and cannot be noticed easily, making them difficult for intruders to detect, avoid, or vandalize. This significantly enhances their effectiveness compared to highly visible cameras or fences. However, accurately detecting and distinguishing between overlapping activities that are happening simultaneously, such as human intrusions, animal movements, and vehicle rumbling, remains a major challenge due to the complex and noisy nature of seismic signals. Correctly identifying simultaneous activities is critical because failing to separate them can lead to misclassification, missed detections, and an incomplete understanding of the situation, thereby reducing the reliability of surveillance systems. To tackle this problem, we propose HyMAD (Hybrid Multi-Activity Detection), a deep neural architecture based on spatio-temporal feature fusion. The framework integrates spectral features extracted with SincNet and temporal dependencies modeled by a recurrent neural network (RNN). In addition, HyMAD employs self-attention layers to strengthen intra-modal representations and a cross-modal fusion module to achieve robust multi-label classification of seismic events. e evaluate our approach on a dataset constructed from real-world field recordings collected in the context of border surveillance and monitoring, demonstrating its ability to generalize to complex, simultaneous activity scenarios involving humans, animals, and vehicles. Our method achieves competitive performance and offers a modular framework for extending seismic-based activity recognition in real-world security applications.
๐
2025-11-18
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Kallol Mondal, Ankush Kumar
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Attention is the brain's ability to selectively focus on a few specific aspects while ignoring irrelevant ones. This biological principle inspired the attention mechanism in modern Transformers. Transformers now underpin large language models (LLMs) such as GPT, but at the cost of massive training and inference energy, leading to a large carbon footprint. While brain attention emerges from neural circuits, Transformer attention relies on dot-product similarity to weight elements in the input sequence. Neuromorphic computing, especially spiking neural networks (SNNs), offers a brain-inspired path to energy-efficient intelligence. Despite recent work on attention-based spiking Transformers, the core attention layer remains non-neuromorphic. Current spiking attention (i) relies on dot-product or element-wise similarity suited to floating-point operations, not event-driven spikes; (ii) keeps attention matrices that suffer from the von Neumann bottleneck, limiting in-memory computing; and (iii) still diverges from brain-like computation. To address these issues, we propose the Spiking STDP Transformer (S$^{2}$TDPT), a neuromorphic Transformer that implements self-attention through spike-timing-dependent plasticity (STDP), embedding query--key correlations in synaptic weights. STDP, a core mechanism of memory and learning in the brain and widely studied in neuromorphic devices, naturally enables in-memory computing and supports non-von Neumann hardware. On CIFAR-10 and CIFAR-100, our model achieves 94.35\% and 78.08\% accuracy with only four timesteps and 0.49 mJ on CIFAR-100, an 88.47\% energy reduction compared to a standard ANN Transformer. Grad-CAM shows that the model attends to semantically relevant regions, enhancing interpretability. Overall, S$^{2}$TDPT illustrates how biologically inspired attention can yield energy-efficient, hardware-friendly, and explainable neuromorphic models.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shiyar Jamo
ๆฅ็ๆ่ฆ
Automatic age estimation is widely used for age verification, where input images often vary considerably in resolution. This study evaluates the effect of image resolution on age estimation accuracy using DeepFace and InsightFace. A total of 1000 images from the IMDB-Clean dataset were processed in seven resolutions, resulting in 7000 test samples. Performance was evaluated using Mean Absolute Error (MAE), Standard Deviation (SD), and Median Absolute Error (MedAE). Based on this study, we conclude that input image resolution has a clear and consistent impact on the accuracy of age estimation in both DeepFace and InsightFace. Both frameworks achieve optimal performance at 224x224 pixels, with an MAE of 10.83 years (DeepFace) and 7.46 years (InsightFace). At low resolutions, MAE increases substantially, while very high resolutions also degrade accuracy. InsightFace is consistently faster than DeepFace across all resolutions.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marius Dubosc, Yann Fischer, Zacharie Auray, Nicolas Boutry, Edwin Carlinet et al.
Machine Learning
ๆฅ็ๆ่ฆ
Doppler holography is an emerging retinal imaging technique that captures the dynamic behavior of blood flow with high temporal resolution, enabling quantitative assessment of retinal hemodynamics. This requires accurate segmentation of retinal arteries and veins, but traditional segmentation methods focus solely on spatial information and overlook the temporal richness of holographic data. In this work, we propose a simple yet effective approach for artery-vein segmentation in temporal Doppler holograms using standard segmentation architectures. By incorporating features derived from a dedicated pulse analysis pipeline, our method allows conventional U-Nets to exploit temporal dynamics and achieve performance comparable to more complex attention- or iteration-based models. These findings demonstrate that time-resolved preprocessing can unlock the full potential of deep learning for Doppler holography, opening new perspectives for quantitative exploration of retinal hemodynamics. The dataset is publicly available at https://huggingface.co/datasets/DigitalHolography/
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ John M. Oyer, Ali Namvar, Benjamin A. Hoff, Wassim W. Labaki, Ella A. Kazerooni et al.
ๆฅ็ๆ่ฆ
Accurate airway segmentation from chest computed tomography (CT) scans is essential for quantitative lung analysis, yet manual annotation is impractical and many automated U-Net-based methods yield disconnected components that hinder reliable biomarker extraction. We present RepAir, a three-stage framework for robust 3D airway segmentation that combines an nnU-Net-based network with anatomically informed topology correction. The segmentation network produces an initial airway mask, after which a skeleton-based algorithm identifies potential discontinuities and proposes reconnections. A 1D convolutional classifier then determines which candidate links correspond to true anatomical branches versus false or obstructed paths. We evaluate RepAir on two distinct datasets: ATM'22, comprising annotated CT scans from predominantly healthy subjects and AeroPath, encompassing annotated scans with severe airway pathology. Across both datasets, RepAir outperforms existing 3D U-Net-based approaches such as Bronchinet and NaviAirway on both voxel-level and topological metrics, and produces more complete and anatomically consistent airway trees while maintaining high segmentation accuracy.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marco Acerbis, Swarnadip Chatterjee, Christophe Avenel, Joakim Lindblad
ๆฅ็ๆ่ฆ
Computational cytology faces two major challenges: i) instance-level labels are unreliable and prohibitively costly to obtain, ii) witness rates are extremely low. We propose SLAM-AGS, a Slide-Label-Aware Multitask pretraining framework that jointly optimizes (i) a weakly supervised similarity objective on slide-negative patches and (ii) a self-supervised contrastive objective on slide-positive patches, yielding stronger performance on downstream tasks. To stabilize learning, we apply Adaptive Gradient Surgery to tackle conflicting task gradients and prevent model collapse. We integrate the pretrained encoder into an attention-based Multiple Instance Learning aggregator for bag-level prediction and attention-guided retrieval of the most abnormal instances in a bag. On a publicly available bone-marrow cytology dataset, with simulated witness rates from 10% down to 0.5%, SLAM-AGS improves bag-level F1-Score and Top 400 positive cell retrieval over other pretraining methods, with the largest gains at low witness rates, showing that resolving gradient interference enables stable pretraining and better performance on downstream tasks. To facilitate reproducibility, we share our complete implementation and evaluation framework as open source: https://github.com/Ace95/SLAM-AGS.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Meiying Gu, Jiawei Zhang, Jiahe Li, Xiaohan Yu, Haonan Luo et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in optimizing Gaussian Splatting for scene geometry have enabled efficient reconstruction of detailed surfaces from images. However, when input views are sparse, such optimization is prone to overfitting, leading to suboptimal reconstruction quality. Existing approaches address this challenge by employing flattened Gaussian primitives to better fit surface geometry, combined with depth regularization to alleviate geometric ambiguities under limited viewpoints. Nevertheless, the increased anisotropy inherent in flattened Gaussians exacerbates overfitting in sparse-view scenarios, hindering accurate surface fitting and degrading novel view synthesis performance. In this paper, we propose \net{}, a method that reconstructs more accurate and detailed surfaces while preserving high-quality novel view rendering. Our key insight is to introduce Stereo Geometry-Texture Alignment, which bridges rendering quality and geometry estimation, thereby jointly enhancing both surface reconstruction and view synthesis. In addition, we present a Pseudo-Feature Enhanced Geometry Consistency that enforces multi-view geometric consistency by incorporating both training and unseen views, effectively mitigating overfitting caused by sparse supervision. Extensive experiments on the DTU, BlendedMVS, and Mip-NeRF360 datasets demonstrate that our method achieves the state-of-the-art performance.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kahaan Gandhi, Boris Bolliet, Inigo Zubeldia
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Fokhrul Islam, Sajeda Al-Hammouri, Christopher J. Arellano, Kavan Hazeli, Heman Shakeri
ๆฅ็ๆ่ฆ
Falls are a leading cause of injury and loss of independence among older adults. Vision-based fall prediction systems offer a non-invasive solution to anticipate falls seconds before impact, but their development is hindered by the scarcity of available fall data. Contributing to these efforts, this study proposes the Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN), a dual-stream model that combines both pose and biomechanical information using a cross-attention fusion mechanism. Our model outperforms the vanilla ST-GCN baseline by 5.32% and 2.91% F1-score on the simulated MCF-UA stunt-actor and MUVIM datasets, respectively. The spatio-temporal attention mechanisms in the ST-GCN stream also provide interpretability by identifying critical joints and temporal phases. However, a critical simulation-reality gap persists. While our model achieves an 89.0% F1-score with full supervision on simulated data, zero-shot generalization to unseen subjects drops to 35.9%. This performance decline is likely due to biases in simulated data, such as `intent-to-fall' cues. For older adults, particularly those with diabetes or frailty, this gap is exacerbated by their unique kinematic profiles. To address this, we propose personalization strategies and advocate for privacy-preserving data pipelines to enable real-world validation. Our findings underscore the urgent need to bridge the gap between simulated and real-world data to develop effective fall prediction systems for vulnerable elderly populations.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohammad Vali Sanian, Arshia Hemmat, Amirhossein Vahidi, Jonas Maaskola, Jimmy Tsz Hang Lee et al.
Robotics
ๆฅ็ๆ่ฆ
A scalable and robust 3D tissue transcriptomics profile can enable a holistic understanding of tissue organization and provide deeper insights into human biology and disease. Most predictive algorithms that infer ST directly from histology treat each section independently and ignore 3D structure, while existing 3D-aware approaches are not generative and do not scale well. We present Holographic Tissue Expression Inpainting and Analysis (HoloTea), a 3D-aware flow-matching framework that imputes spot-level gene expression from H&E while explicitly using information from adjacent sections. Our key idea is to retrieve morphologically corresponding spots on neighboring slides in a shared feature space and fuse this cross section context into a lightweight ControlNet, allowing conditioning to follow anatomical continuity. To better capture the count nature of the data, we introduce a 3D-consistent prior for flow matching that combines a learned zero-inflated negative binomial (ZINB) prior with a spatial-empirical prior constructed from neighboring sections. A global attention block introduces 3D H&E scaling linearly with the number of spots in the slide, enabling training and inference on large 3D ST datasets. Across three spatial transcriptomics datasets spanning different tissue types and resolutions, HoloTea consistently improves 3D expression accuracy and generalization compared to 2D and 3D baselines. We envision HoloTea advancing the creation of accurate 3D virtual tissues, ultimately accelerating biomarker discovery and deepening our understanding of disease.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yilin Zhang, Leo D. Westbury, Elaine M. Dennison, Nicholas C. Harvey, Nicholas R. Fuggle et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Poor bone health is a significant public health concern, and low bone mineral density (BMD) leads to an increased fracture risk, a key feature of osteoporosis. We present XAttn-BMD (Cross-Attention BMD), a multimodal deep learning framework that predicts femoral neck BMD from hip X-ray images and structured clinical metadata. It utilizes a novel bidirectional cross-attention mechanism to dynamically integrate image and metadata features for cross-modal mutual reinforcement. A Weighted Smooth L1 loss is tailored to address BMD imbalance and prioritize clinically significant cases. Extensive experiments on the data from the Hertfordshire Cohort Study show that our model outperforms the baseline models in regression generalization and robustness. Ablation studies confirm the effectiveness of both cross-attention fusion and the customized loss function. Experimental results show that the integration of multimodal data via cross-attention outperforms naive feature concatenation without cross-attention, reducing MSE by 16.7%, MAE by 6.03%, and increasing the R2 score by 16.4%, highlighting the effectiveness of the approach for femoral neck BMD estimation. Furthermore, screening performance was evaluated using binary classification at clinically relevant femoral neck BMD thresholds, demonstrating the model's potential in real-world scenarios.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nathaniel Putera, Daniel Vilet Rodrรญguez, Noah Videcrantz, Julia Machnio, Mostafa Mehdipour Ghazi
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Accurate modeling of cognitive decline in Alzheimer's disease is essential for early stratification and personalized management. While tabular predictors provide robust markers of global risk, their ability to capture subtle brain changes remains limited. In this study, we evaluate the predictive contributions of tabular and imaging-based representations, with a focus on transformer-derived Magnetic Resonance Imaging (MRI) embeddings. We introduce a trajectory-aware labeling strategy based on Dynamic Time Warping clustering to capture heterogeneous patterns of cognitive change, and train a 3D Vision Transformer (ViT) via unsupervised reconstruction on harmonized and augmented MRI data to obtain anatomy-preserving embeddings without progression labels. The pretrained encoder embeddings are subsequently assessed using both traditional machine learning classifiers and deep learning heads, and compared against tabular representations and convolutional network baselines. Results highlight complementary strengths across modalities. Clinical and volumetric features achieved the highest AUCs of around 0.70 for predicting mild and severe progression, underscoring their utility in capturing global decline trajectories. In contrast, MRI embeddings from the ViT model were most effective in distinguishing cognitively stable individuals with an AUC of 0.71. However, all approaches struggled in the heterogeneous moderate group. These findings indicate that clinical features excel in identifying high-risk extremes, whereas transformer-based MRI embeddings are more sensitive to subtle markers of stability, motivating multimodal fusion strategies for AD progression modeling.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongqing Xie, Yonghuang Wu, Zisheng Ai, Jun Min, Zhencun Jiang et al.
Machine Learning
ๆฅ็ๆ่ฆ
The accurate segmentation of brain tumors from multi-modal MRI is critical for clinical diagnosis and treatment planning. While integrating complementary information from various MRI sequences is a common practice, the frequent absence of one or more modalities in real-world clinical settings poses a significant challenge, severely compromising the performance and generalizability of deep learning-based segmentation models. To address this challenge, we propose a novel Cross-Modal Compositional Self-Distillation (CCSD) framework that can flexibly handle arbitrary combinations of input modalities. CCSD adopts a shared-specific encoder-decoder architecture and incorporates two self-distillation strategies: (i) a hierarchical modality self-distillation mechanism that transfers knowledge across modality hierarchies to reduce semantic discrepancies, and (ii) a progressive modality combination distillation approach that enhances robustness to missing modalities by simulating gradual modality dropout during training. Extensive experiments on public brain tumor segmentation benchmarks demonstrate that CCSD achieves state-of-the-art performance across various missing-modality scenarios, with strong generalization and stability.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Julia Machnio, Mads Nielsen, Mostafa Mehdipour Ghazi
Machine Learning
ๆฅ็ๆ่ฆ
White matter hyperintensities (WMH) are key imaging markers in cognitive aging, Alzheimer's disease (AD), and related dementias. Although automated methods for WMH segmentation have advanced, most provide only global lesion load and overlook their spatial distribution across distinct white matter regions. We propose a deep learning framework for robust WMH segmentation and localization, evaluated across public datasets and an independent Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. Our results show that the predicted lesion loads are in line with the reference WMH estimates, confirming the robustness to variations in lesion load, acquisition, and demographics. Beyond accurate segmentation, we quantify WMH load within anatomically defined regions and combine these measures with brain structure volumes to assess diagnostic value. Regional WMH volumes consistently outperform global lesion burden for disease classification, and integration with brain atrophy metrics further improves performance, reaching area under the curve (AUC) values up to 0.97. Several spatially distinct regions, particularly within anterior white matter tracts, are reproducibly associated with diagnostic status, indicating localized vulnerability in AD. These results highlight the added value of regional WMH quantification. Incorporating localized lesion metrics alongside atrophy markers may enhance early diagnosis and stratification in neurodegenerative disorders.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Jian liu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42X inference speedup and 1.4X memory reduction over other top-performing counterparts, while maintaining performance with no training.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Adam Bajger, Jan Obdrลพรกlek, Vojtฤch Kลฏr, Rudolf Nenutil, Petr Holub et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present a clustering-based explainability technique for digital pathology models based on convolutional neural networks. Unlike commonly used methods based on saliency maps, such as occlusion, GradCAM, or relevance propagation, which highlight regions that contribute the most to the prediction for a single slide, our method shows the global behaviour of the model under consideration, while also providing more fine-grained information. The result clusters can be visualised not only to understand the model, but also to increase confidence in its operation, leading to faster adoption in clinical practice. We also evaluate the performance of our technique on an existing model for detecting prostate cancer, demonstrating its usefulness.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohammad Romani
Natural Language Processing
ๆฅ็ๆ่ฆ
Deepfakes generated by advanced GANs and autoencoders severely threaten information integrity and societal stability. Single-stream CNNs fail to capture multi-scale forgery artifacts across spatial, texture, and frequency domains, limiting robustness and generalization. We introduce the ForensicFlow, a tri-modal forensic framework that synergistically fuses RGB, texture, and frequency evidence for video Deepfake detection. The RGB branch (ConvNeXt-tiny) extracts global visual inconsistencies; the texture branch (Swin Transformer-tiny) detects fine-grained blending artifacts; the frequency branch (CNN + SE) identifies periodic spectral noise. Attention-based temporal pooling dynamically prioritizes high-evidence frames, while adaptive attention fusion balances branch contributions.Trained on Celeb-DF (v2) with Focal Loss, ForensicFlow achieves AUC 0.9752, F1-Score 0.9408, and accuracy 0.9208, outperforming single-stream baselines. Ablation validates branch synergy; Grad-CAM confirms forensic focus. This comprehensive feature fusion provides superior resilience against subtle forgeries.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Tian, Chenyangguang Zhang, Rui Liu, Wen Shen, Xiaolin Qin
Machine Learning
ๆฅ็ๆ่ฆ
This paper focuses on a challenging setting of simultaneously modeling geometry and appearance of hand-object interaction scenes without any object priors. We follow the trend of dynamic 3D Gaussian Splatting based methods, and address several significant challenges. To model complex hand-object interaction with mutual occlusion and edge blur, we present interaction-aware hand-object Gaussians with newly introduced optimizable parameters aiming to adopt piecewise linear hypothesis for clearer structural representation. Moreover, considering the complementarity and tightness of hand shape and object shape during interaction dynamics, we incorporate hand information into object deformation field, constructing interaction-aware dynamic fields to model flexible motions. To further address difficulties in the optimization process, we propose a progressive strategy that handles dynamic regions and static background step by step. Correspondingly, explicit regularizations are designed to stabilize the hand-object representations for smooth motion transition, physical interaction reality, and coherent lighting. Experiments show that our approach surpasses existing dynamic 3D-GS-based methods and achieves state-of-the-art performance in reconstructing dynamic hand-object interaction.
๐
2025-11-18
๐ AAAI
๐ฅ Qiang Bai, Bojian Wu, Xi Yang, Zhizhong Han
Machine Learning
ๆฅ็ๆ่ฆ
Neural signed distance functions (SDFs) have been a vital representation to represent 3D shapes or scenes with neural networks. An SDF is an implicit function that can query signed distances at specific coordinates for recovering a 3D surface. Although implicit functions work well on a single shape or scene, they pose obstacles when analyzing multiple SDFs with high-fidelity geometry details, due to the limited information encoded in the latent space for SDFs and the loss of geometry details. To overcome these obstacles, we introduce a method to represent multiple SDFs in a common space, aiming to recover more high-fidelity geometry details with more compact latent representations. Our key idea is to take full advantage of the benefits of generalization-based and overfitting-based learning strategies, which manage to preserve high-fidelity geometry details with compact latent codes. Based on this framework, we also introduce a novel sampling strategy to sample training queries. The sampling can improve the training efficiency and eliminate artifacts caused by the influence of other SDFs. We report numerical and visual evaluations on widely used benchmarks to validate our designs and show advantages over the latest methods in terms of the representative ability and compactness.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangchen Yin, Jiahui Yuan, Zhangchi Hu, Wenzhang Sun, Jie Chen et al.
ๆฅ็ๆ่ฆ
Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.
๐
2025-11-18
๐ This work has been submitted to the IEEE for possi
๐ฅ Yufeng Tian, Yifan Chen, Zhe Sun, Libang Chen, Mingyu Dou et al.
Machine Learning
ๆฅ็ๆ่ฆ
Underwater image restoration and enhancement are crucial for correcting color distortion and restoring image details, thereby establishing a fundamental basis for subsequent underwater visual tasks. However, current deep learning methodologies in this area are frequently constrained by the scarcity of high-quality paired datasets. Since it is difficult to obtain pristine reference labels in underwater scenes, existing benchmarks often rely on manually selected results from enhancement algorithms, providing debatable reference images that lack globally consistent color and authentic supervision. This limits the model's capabilities in color restoration, image enhancement, and generalization. To overcome this limitation, we propose using in-air natural images as unambiguous reference targets and translating them into underwater-degraded versions, thereby constructing synthetic datasets that provide authentic supervision signals for model learning. Specifically, we establish a generative data framework based on unpaired image-to-image translation, producing a large-scale dataset that covers 6 representative underwater degradation types. The framework constructs synthetic datasets with precise ground-truth labels, which facilitate the learning of an accurate mapping from degraded underwater images to their pristine scene appearances. Extensive quantitative and qualitative experiments across 6 representative network architectures and 3 independent test sets show that models trained on our synthetic data achieve comparable or superior color restoration and generalization performance to those trained on existing benchmarks. This research provides a reliable and scalable data-driven solution for underwater image restoration and enhancement. The generated dataset is publicly available at: https://github.com/yftian2025/SynUIEDatasets.git.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Taifour Yousra Nabila, Azeddine Beghdadi, Marie Luong, Zuheng Ming, Habib Zaidi et al.
ๆฅ็ๆ่ฆ
Low Dose Computed Tomography (LDCT) is widely used as an imaging solution to aid diagnosis and other clinical tasks. However, this comes at the price of a deterioration in image quality due to the low dose of radiation used to reduce the risk of secondary cancer development. While some efficient methods have been proposed to enhance LDCT quality, many overestimate noise and perform excessive smoothing, leading to a loss of critical details. In this paper, we introduce D-PerceptCT, a novel architecture inspired by key principles of the Human Visual System (HVS) to enhance LDCT images. The objective is to guide the model to enhance or preserve perceptually relevant features, thereby providing radiologists with CT images where critical anatomical structures and fine pathological details are perceptu- ally visible. D-PerceptCT consists of two main blocks: 1) a Visual Dual-path Extractor (ViDex), which integrates semantic priors from a pretrained DINOv2 model with local spatial features, allowing the network to incorporate semantic-awareness during enhancement; (2) a Global-Local State-Space block that captures long-range information and multiscale features to preserve the important structures and fine details for diagnosis. In addition, we propose a novel deep perceptual loss, designated as the Deep Perceptual Relevancy Loss Function (DPRLF), which is inspired by human contrast sensitivity, to further emphasize perceptually important features. Extensive experiments on the Mayo2016 dataset demonstrate the effectiveness of D-PerceptCT method for LDCT enhancement, showing better preservation of structural and textural information within LDCT images compared to SOTA methods.
๐
2025-11-18
๐ ArXiv Preprint (cs.SD)
๐ฅ Xinxin Tang, Bin Qin, Yufang Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex "approximate-compensate" mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the "amplitude-ignoring" problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8\% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.
๐
2025-11-18
๐ Accepted to IEEE Transactions on Cybernetics
๐ฅ Xinzhuo Yu, Yunzhi Zhuge, Sitong Gong, Lu Zhang, Pingping Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Understanding the inter-relations and interactions between tasks is crucial for multi-task dense prediction. Existing methods predominantly utilize convolutional layers and attention mechanisms to explore task-level interactions. In this work, we introduce a novel decoder-based framework, Parameter Aware Mamba Model (PAMM), specifically designed for dense prediction in multi-task learning setting. Distinct from approaches that employ Transformers to model holistic task relationships, PAMM leverages the rich, scalable parameters of state space models to enhance task interconnectivity. It features dual state space parameter experts that integrate and set task-specific parameter priors, capturing the intrinsic properties of each task. This approach not only facilitates precise multi-task interactions but also allows for the global integration of task priors through the structured state space sequence model (S4). Furthermore, we employ the Multi-Directional Hilbert Scanning method to construct multi-angle feature sequences, thereby enhancing the sequence model's perceptual capabilities for 2D data. Extensive experiments on the NYUD-v2 and PASCAL-Context benchmarks demonstrate the effectiveness of our proposed method. Our code is available at https://github.com/CQC-gogopro/PAMM.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jack Qin, Zhitao Wang, Yinan Zheng, Keyu Chen, Yang Zhou et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model's ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aditi Agarwal, Anjali Jain, Nikita Saxena, Ishan Deshpande, Michal Kazmierski et al.
ๆฅ็ๆ่ฆ
Delineating farm boundaries through segmentation of satellite images is a fundamental step in many agricultural applications. The task is particularly challenging for smallholder farms, where accurate delineation requires the use of high resolution (HR) imagery which are available only at low revisit frequencies (e.g., annually). To support more frequent (sub-) seasonal monitoring, HR images could be combined as references (ref) with low resolution (LR) images -- having higher revisit frequency (e.g., weekly) -- using reference-based super-resolution (Ref-SR) methods. However, current Ref-SR methods optimize perceptual quality and smooth over crucial features needed for downstream tasks, and are unable to meet the large scale-factor requirements for this task. Further, previous two-step approaches of SR followed by segmentation do not effectively utilize diverse satellite sources as inputs. We address these problems through a new approach, $\textbf{SEED-SR}$, which uses a combination of conditional latent diffusion models and large-scale multi-spectral, multi-source geo-spatial foundation models. Our key innovation is to bypass the explicit SR task in the pixel space and instead perform SR in a segmentation-aware latent space. This unique approach enables us to generate segmentation maps at an unprecedented 20$\times$ scale factor, and rigorous experiments on two large, real datasets demonstrate up to $\textbf{25.5}$ and $\textbf{12.9}$ relative improvement in instance and semantic segmentation metrics respectively over approaches based on state-of-the-art Ref-SR methods.
๐
2025-11-18
๐ AAAI
๐ฅ Miao Shang, Xiaopeng Hong
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
This paper introduces Gaussian Spatial Transport (GST), a novel framework that leverages Gaussian splatting to facilitate transport from the probability measure in the image coordinate space to the annotation map. We propose a Gaussian splatting-based method to estimate pixel-annotation correspondence, which is then used to compute a transport plan derived from Bayesian probability. To integrate the resulting transport plan into standard network optimization in typical computer vision tasks, we derive a loss function that measures discrepancy after transport. Extensive experiments on representative computer vision tasks, including crowd counting and landmark detection, validate the effectiveness of our approach. Compared to conventional optimal transport schemes, GST eliminates iterative transport plan computation during training, significantly improving efficiency. Code is available at https://github.com/infinite0522/GST.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bayu Adhi Tama, Jianwu Wang, Vandana Janeja, Mostafa Cham
ๆฅ็ๆ่ฆ
Accurate subglacial bed topography is essential for ice sheet modeling, yet radar observations are sparse and uneven. We propose a physics-guided residual learning framework that predicts bed thickness residuals over a BedMachine prior and reconstructs bed from the observed surface. A DeepLabV3+ decoder over a standard encoder (e.g.,ResNet-50) is trained with lightweight physics and data terms: multi-scale mass conservation, flow-aligned total variation, Laplacian damping, non-negativity of thickness, a ramped prior-consistency term, and a masked Huber fit to radar picks modulated by a confidence map. To measure real-world generalization, we adopt leakage-safe blockwise hold-outs (vertical/horizontal) with safety buffers and report metrics only on held-out cores. Across two Greenland sub-regions, our approach achieves strong test-core accuracy and high structural fidelity, outperforming U-Net, Attention U-Net, FPN, and a plain CNN. The residual-over-prior design, combined with physics, yields spatially coherent, physically plausible beds suitable for operational mapping under domain shift.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingchen Zhong, Xin Lu, Dong Li, Senyan Xu, Ruixuan Jiang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Low-light video deblurring poses significant challenges in applications like nighttime surveillance and autonomous driving due to dim lighting and long exposures. While event cameras offer potential solutions with superior low-light sensitivity and high temporal resolution, existing fusion methods typically employ staged strategies, limiting their effectiveness against combined low-light and motion blur degradations. To overcome this, we propose CompEvent, a complex neural network framework enabling holistic full-process fusion of event data and RGB frames for enhanced joint restoration. CompEvent features two core components: 1) Complex Temporal Alignment GRU, which utilizes complex-valued convolutions and processes video and event streams iteratively via GRU to achieve temporal alignment and continuous fusion; and 2) Complex Space-Frequency Learning module, which performs unified complex-valued signal processing in both spatial and frequency domains, facilitating deep fusion through spatial structures and system-level characteristics. By leveraging the holistic representation capability of complex-valued neural networks, CompEvent achieves full-process spatiotemporal fusion, maximizes complementary learning between modalities, and significantly strengthens low-light video deblurring capability. Extensive experiments demonstrate that CompEvent outperforms SOTA methods in addressing this challenging task. The code is available at https://github.com/YuXie1/CompEvent.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zongwei Zhen, Biqing Zeng
Robotics
ๆฅ็ๆ่ฆ
This paper addresses the task of interactive, conversational text-to-image retrieval.
Our DIR-TIR framework progressively refines the target image search through two specialized modules: the Dialog Refiner Module and the Image Refiner Module.
The Dialog Refiner actively queries users to extract essential information and generate increasingly precise descriptions of the target image.
Complementarily, the Image Refiner identifies perceptual gaps between generated images and user intentions, strategically reducing the visual-semantic discrepancy. By leveraging multi-turn dialogues, DIR-TIR provides superior controllability and fault tolerance compared to conventional single-query methods, significantly improving target image hit accuracy.
Comprehensive experiments across diverse image datasets demonstrate our dialogue-based approach substantially outperforms initial-description-only baselines, while the synergistic module integration achieves both higher retrieval precision and enhanced interactive experience.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hong Gao, Yiming Bao, Xuezhen Tu, Yutong Xu, Yue Jin et al.
Computer VisionMultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yusen Cai, Bhargava Satya Nunna, Qing Lin, Mengmi Zhang
ๆฅ็ๆ่ฆ
Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged "visual diets", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)-collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture-shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm. All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants' visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ravi Shankar Prasad, Nandani Sharma, Dinesh Singh
Machine Learning
ๆฅ็ๆ่ฆ
In forensic craniofacial identification and in many biomedical applications, craniometric landmarks are important. Traditional methods for locating landmarks are time-consuming and require specialized knowledge and expertise. Current methods utilize superimposition and deep learning-based methods that employ automatic annotation of landmarks. However, these methods are not reliable due to insufficient large-scale validation studies. In this paper, we proposed a novel framework Cranio-ID: First, an automatic annotation of landmarks on 2D skulls (which are X-ray scans of faces) with their respective optical images using our trained YOLO-pose models. Second, cross-modal matching by formulating these landmarks into graph representations and then finding semantic correspondence between graphs of these two modalities using cross-attention and optimal transport framework. Our proposed framework is validated on the S2F and CUHK datasets (CUHK dataset resembles with S2F dataset). Extensive experiments have been conducted to evaluate the performance of our proposed framework, which demonstrates significant improvements in both reliability and accuracy, as well as its effectiveness in cross-domain skull-to-face and sketch-to-face matching in forensic science.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuyi Geng, Tao Zhou, Yi Zhou
ๆฅ็ๆ่ฆ
A key challenge in Domain Incremental Learning (DIL) is to continually learn under shifting distributions while preserving knowledge from previous domains. Existing methods face a fundamental dilemma. On one hand, projecting all domains into a single unified visual space leads to inter-domain interference and semantic distortion, as large shifts may vary with not only visual appearance but also underlying semantics. On the other hand, isolating domain-specific parameters causes knowledge fragmentation, creating "knowledge islands" that hamper knowledge reuse and exacerbate forgetting. To address this issue, we propose LAVA (Language-Anchored Visual Alignment), a novel DIL framework that replaces direct feature alignment with relative alignment driven by a text-based reference anchor. LAVA guides the visual representations of each incoming domain to preserve a consistent relative geometry, which is defined by mirroring the pairwise semantic similarities between the class names. This anchored geometric structure acts as a bridge across domains, enabling the retrieval of class-aware prior knowledge and facilitating robust feature aggregation. Extensive experiments on standard DIL benchmarks demonstrate that LAVA achieves significant performance improvements over state-of-the-arts. Code is available at https://github.com/ShuyiGeng/LAVA.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saksham Kumar, D Sridhar Aditya, T Likhil Kumar, Thulasi Bikku, Srinivasarao Thota et al.
ๆฅ็ๆ่ฆ
Diabetic Retinopathy (DR) has emerged as a major cause of preventable blindness in recent times. With timely screening and intervention, the condition can be prevented from causing irreversible damage. The work introduces a state-of-the-art Ordinal Regression-based DR Detection framework that uses the APTOS-2019 fundus image dataset. A widely accepted combination of preprocessing methods: Green Channel (GC) Extraction, Noise Masking, and CLAHE, was used to isolate the most relevant features for DR classification. Model performance was evaluated using the Quadratic Weighted Kappa, with a focus on agreement between results and clinical grading. Our Ordinal Regression approach attained a QWK score of 0.8992, setting a new benchmark on the APTOS dataset.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Xiuxiu Qi, Yu Yang, Jiannong Cao, Luyao Bai, Chongshan Fan et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.
๐
2025-11-18
๐ NeurIPS 2025
๐ฅ Joachim Tesch, Giorgio Becherini, Prerana Achar, Anastasios Yiannakidis, Muhammed Kocabas et al.
ๆฅ็ๆ่ฆ
Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways. In addition to introducing more diverse and realistic cameras and camera motions, BEDLAM2.0 increases diversity and realism of body shape, motions, clothing, hair, and 3D environments. Additionally, it adds shoes, which were missing in BEDLAM. BEDLAM has become a key resource for training 3D human pose and motion regressors today and we show that BEDLAM2.0 is significantly better, particularly for training methods that estimate humans in world coordinates. We compare state-of-the art methods trained on BEDLAM and BEDLAM2.0, and find that BEDLAM2.0 significantly improves accuracy over BEDLAM. For research purposes, we provide the rendered videos, ground truth body parameters, and camera motions. We also provide the 3D assets to which we have rights and links to those from third parties.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fabian Schmidt, Noushiq Mohammed Kayilan Abdul Nazar, Markus Enzweiler, Abhinav Valada
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly used for decision-making and planning in autonomous driving, showing promising reasoning capabilities and potential to generalize across diverse traffic situations. However, current LLM-based driving agents lack explicit mechanisms to enforce traffic rules and often struggle to reliably detect small, safety-critical objects such as traffic lights and signs. To address this limitation, we introduce TLS-Assist, a modular redundancy layer that augments LLM-based autonomous driving agents with explicit traffic light and sign recognition. TLS-Assist converts detections into structured natural language messages that are injected into the LLM input, enforcing explicit attention to safety-critical cues. The framework is plug-and-play, model-agnostic, and supports both single-view and multi-view camera setups. We evaluate TLS-Assist in a closed-loop setup on the LangAuto benchmark in CARLA. The results demonstrate relative driving performance improvements of up to 14% over LMDrive and 7% over BEVDriver, while consistently reducing traffic light and sign infractions. We publicly release the code and models on https://github.com/iis-esslingen/TLS-Assist.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kangqiao Zhao, Shuo Huai, Xurui Song, Jun Luo
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.
๐
2025-11-18
๐ 13 pages, 4 figures, conference or journal submiss
๐ฅ Alfonso Pedro Ridao
Computer Vision
ๆฅ็ๆ่ฆ
International standards for biometric identity documents mandate strict compliance with pose requirements, including the square presentation of a subject's shoulders. However, the literature on automated quality assessment offers few quantitative methods for evaluating this specific attribute. This paper proposes a Shoulder Presentation Evaluation (SPE) algorithm to address this gap. The method quantifies shoulder yaw and roll using only the 3D coordinates of two shoulder landmarks provided by common pose estimation frameworks. The algorithm was evaluated on a dataset of 121 portrait images. The resulting SPE scores demonstrated a strong Pearson correlation (r approx. 0.80) with human-assigned labels. An analysis of the metric's filtering performance, using an adapted Error-versus-Discard methodology, confirmed its utility in identifying non-compliant samples. The proposed algorithm is a viable lightweight tool for automated compliance checking in enrolment systems.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Xiaolin Wang, Houzhang Fang, Qingshan Li, Lu Wang, Yi Chang et al.
ๆฅ็ๆ่ฆ
Infrared unmanned aerial vehicle (UAV) target images often suffer from motion blur degradation caused by rapid sensor movement, significantly reducing contrast between target and background. Generally, detection performance heavily depends on the discriminative feature representation between target and background. Existing methods typically treat deblurring as a preprocessing step focused on visual quality, while neglecting the enhancement of task-relevant features crucial for detection. Improving feature representation for detection under blur conditions remains challenging. In this paper, we propose a novel Joint Feature-Domain Deblurring and Detection end-to-end framework, dubbed JFD3. We design a dual-branch architecture with shared weights, where the clear branch guides the blurred branch to enhance discriminative feature representation. Specifically, we first introduce a lightweight feature restoration network, where features from the clear branch serve as feature-level supervision to guide the blurred branch, thereby enhancing its distinctive capability for detection. We then propose a frequency structure guidance module that refines the structure prior from the restoration network and integrates it into shallow detection layers to enrich target structural information. Finally, a feature consistency self-supervised loss is imposed between the dual-branch detection backbones, driving the blurred branch to approximate the feature representations of the clear one. Wealso construct a benchmark, named IRBlurUAV, containing 30,000 simulated and 4,118 real infrared UAV target images with diverse motion blur. Extensive experiments on IRBlurUAV demonstrate that JFD3 achieves superior detection performance while maintaining real-time efficiency.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty
Natural Language Processing
ๆฅ็ๆ่ฆ
While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gustavo Adolpho Bonesso, Carlos Marcelo Gurjรฃo de Godoy, Tammy Hentona Osaki, Midori Hentona Osaki, Bรกrbara Moreira Ribeiro Trindade dos Santos et al.
ๆฅ็ๆ่ฆ
Blinking is a vital physiological process that protects and maintains the health of the ocular surface. Objective assessment of eyelid movements remains challenging due to the complexity, cost, and limited clinical applicability of existing tools. This study presents the clinical validation of Bapp (Blink Application), a mobile application developed using the Flutter framework and integrated with Google ML Kit for on-device, real-time analysis of eyelid movements. The validation occurred using 45 videos from real patients, whose blinks were manually annotated by ophthalmology specialists from the Paulista School of Medicine of the Federal University of Sao Paulo (EPM-UNIFESP) to serve as the ground truth. Bapp's performance was evaluated using standard metrics, including Precision, Recall, and F1-Score, with results demonstrating 98.4% precision, 96.9% recall, and an overall accuracy of 98.3%. These outcomes confirm the reliability of Bapp as a portable, accessible, and objective tool for monitoring both normal and abnormal eyelid movements. The application offers a promising alternative to traditional manual blink counting, supporting continuous ocular health monitoring and postoperative evaluation in clinical environments.
๐
2025-11-18
๐ NeurIPS 2025
๐ฅ Hoang Chuong Nguyen, Wei Mao, Jose M. Alvarez, Miaomiao Liu
ๆฅ็ๆ่ฆ
3D Gaussian Splatting (3DGS) has recently emerged as a fast, high-quality method for novel view synthesis (NVS). However, its use of low-degree spherical harmonics limits its ability to capture spatially varying color and view-dependent effects such as specular highlights. Existing works augment Gaussians with either a global texture map, which struggles with complex scenes, or per-Gaussian texture maps, which introduces high storage overhead. We propose Image-Based Gaussian Splatting, an efficient alternative that leverages high-resolution source images for fine details and view-specific color modeling. Specifically, we model each pixel color as a combination of a base color from standard 3DGS rendering and a learned residual inferred from neighboring training images. This promotes accurate surface alignment and enables rendering images of high-frequency details and accurate view-dependent effects. Experiments on standard NVS benchmarks show that our method significantly outperforms prior Gaussian Splatting approaches in rendering quality, without increasing the storage footprint.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junfu Pu, Teng Wang, Yixiao Ge, Yuying Ge, Chen Li et al.
ๆฅ็ๆ่ฆ
The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiyi Miao, Taoyu Wu, Ji Jiang, Tong Chen, Zhe Tang et al.
Robotics
ๆฅ็ๆ่ฆ
Reliable 3D-2D alignment between intraoral scan (IOS) models and lateral cephalometric radiographs is critical for orthodontic diagnosis, yet conventional intensity-driven registration methods struggle under real clinical conditions, where cephalograms exhibit projective magnification, geometric distortion, low-contrast dental crowns, and acquisition-dependent variation. These factors hinder the stability of appearance-based similarity metrics and often lead to convergence failures or anatomically implausible alignments. To address these limitations, we propose DentalSCR, a pose-stable, contour-guided framework for accurate and interpretable silhouette-to-contour registration. Our method first constructs a U-Midline Dental Axis (UMDA) to establish a unified cross-arch anatomical coordinate system, thereby stabilizing initialization and standardizing projection geometry across cases. Using this reference frame, we generate radiograph-like projections via a surface-based DRR formulation with coronal-axis perspective and Gaussian splatting, which preserves clinical source-object-detector magnification and emphasizes external silhouettes. Registration is then formulated as a 2D similarity transform optimized with a symmetric bidirectional Chamfer distance under a hierarchical coarse-to-fine schedule, enabling both large capture range and subpixel-level contour agreement. We evaluate DentalSCR on 34 expert-annotated clinical cases. Experimental results demonstrate substantial reductions in landmark error-particularly at posterior teeth-tighter dispersion on the lower jaw, and low Chamfer and controlled Hausdorff distances at the curve level. These findings indicate that DentalSCR robustly handles real-world cephalograms and delivers high-fidelity, clinically inspectable 3D--2D alignment, outperforming conventional baselines.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Michael Milford, Tobias Fischer
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Place recognition, the ability to identify previously visited locations, is critical for both biological navigation and autonomous systems. This review synthesizes findings from robotic systems, animal studies, and human research to explore how different systems encode and recall place. We examine the computational and representational strategies employed across artificial systems, animals, and humans, highlighting convergent solutions such as topological mapping, cue integration, and memory management. Animal systems reveal evolved mechanisms for multimodal navigation and environmental adaptation, while human studies provide unique insights into semantic place concepts, cultural influences, and introspective capabilities. Artificial systems showcase scalable architectures and data-driven models. We propose a unifying set of concepts by which to consider and develop place recognition mechanisms and identify key challenges such as generalization, robustness, and environmental variability. This review aims to foster innovations in artificial localization by connecting future developments in artificial place recognition systems to insights from both animal navigation research and human spatial cognition studies.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bohan Zhang, Yiyi Miao, Taoyu Wu, Tong Chen, Ji Jiang et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
A structured understanding of intraoral 3D scans is essential for digital orthodontics. However, existing deep-learning approaches rely heavily on modality-specific training, large annotated datasets, and controlled scanning conditions, which limit generalization across devices and hinder deployment in real clinical workflows. Moreover, raw intraoral meshes exhibit substantial variation in arch pose, incomplete geometry caused by occlusion or tooth contact, and a lack of texture cues, making unified semantic interpretation highly challenging. To address these limitations, we propose ArchMap, a training-free and knowledge-guided framework for robust structured dental understanding. ArchMap first introduces a geometry-aware arch-flattening module that standardizes raw 3D meshes into spatially aligned, continuity-preserving multi-view projections. We then construct a Dental Knowledge Base (DKB) encoding hierarchical tooth ontology, dentition-stage policies, and clinical semantics to constrain the symbolic reasoning space. We validate ArchMap on 1060 pre-/post-orthodontic cases, demonstrating robust performance in tooth counting, anatomical partitioning, dentition-stage classification, and the identification of clinical conditions such as crowding, missing teeth, prosthetics, and caries. Compared with supervised pipelines and prompted VLM baselines, ArchMap achieves higher accuracy, reduced semantic drift, and superior stability under sparse or artifact-prone conditions. As a fully training-free system, ArchMap demonstrates that combining geometric normalization with ontology-guided multimodal reasoning offers a practical and scalable solution for the structured analysis of 3D intraoral scans in modern digital orthodontics.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongchen Han, Tianzhu Ye, Zhuofan Xia, Kaiyi Chen, Yulin Wang et al.
Computer VisionMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Scaling up network depth is a fundamental pursuit in neural architecture design, as theory suggests that deeper models offer exponentially greater capability. Benefiting from the residual connections, modern neural networks can scale up to more than one hundred layers and enjoy wide success. However, as networks continue to deepen, current architectures often struggle to realize their theoretical capacity improvements, calling for more advanced designs to further unleash the potential of deeper networks. In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width. Shortcut degradation hinders deep-layer learning, while the inherent depth-width trade-off imposes limited width. To mitigate these issues, we propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance of deep models. Specifically, we separate features along the channel dimension and let the model learn progressively via stacking blocks with increasing width. The resulting method mitigates the two identified problems and serves as a versatile macro design applicable to various models. Extensive experiments show that our method consistently outperforms residual models across diverse tasks, including image classification, object detection, semantic segmentation, and language modeling. These results position StepsNet as a superior generalization of the widely adopted residual architecture.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nanjun Li, Ziyue Hao, Quanqiang Wang, Xuanyin Wang
Computer Vision
ๆฅ็ๆ่ฆ
With the rise in sedentary behavior, health problems caused by poor sitting posture have drawn increasing attention. Most existing methods, whether using invasive sensors or computer vision, rely on two-stage pipelines, which result in high intrusiveness, intensive computation, and poor real-time performance on embedded edge devices. Inspired by YOLOv11-Pose, a lightweight single-stage network for sitting posture recognition on embedded edge devices termed LSP-YOLO was proposed. By integrating partial convolution(PConv) and Similarity-Aware Activation Module(SimAM), a lightweight module, Light-C3k2, was designed to reduce computational cost while maintaining feature extraction capability. In the recognition head, keypoints were directly mapped to posture classes through pointwise convolution, and intermediate supervision was employed to enable efficient fusion of pose estimation and classification. Furthermore, a dataset containing 5,000 images across six posture categories was constructed for model training and testing. The smallest trained model, LSP-YOLO-n, achieved 94.2% accuracy and 251 Fps on personal computer(PC) with a model size of only 1.9 MB. Meanwhile, real-time and high-accuracy inference under constrained computational resources was demonstrated on the SV830C + GC030A platform. The proposed approach is characterized by high efficiency, lightweight design and deployability, making it suitable for smart classrooms, rehabilitation, and human-computer interaction applications.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiyi Miao, Taoyu Wu, Tong Chen, Ji Jiang, Zhe Tang et al.
Computer Vision
ๆฅ็ๆ่ฆ
Intraoral 3D reconstruction is fundamental to digital orthodontics, yet conventional methods like intraoral scanning are inaccessible for remote tele-orthodontics, which typically relies on sparse smartphone imagery. While 3D Gaussian Splatting (3DGS) shows promise for novel view synthesis, its application to the standard clinical triad of unposed anterior and bilateral buccal photographs is challenging. The large view baselines, inconsistent illumination, and specular surfaces common in intraoral settings can destabilize simultaneous pose and geometry estimation. Furthermore, sparse-view photometric supervision often induces a frequency bias, leading to over-smoothed reconstructions that lose critical diagnostic details. To address these limitations, we propose \textbf{Dental3R}, a pose-free, graph-guided pipeline for robust, high-fidelity reconstruction from sparse intraoral photographs. Our method first constructs a Geometry-Aware Pairing Strategy (GAPS) to intelligently select a compact subgraph of high-value image pairs. The GAPS focuses on correspondence matching, thereby improving the stability of the geometry initialization and reducing memory usage. Building on the recovered poses and point cloud, we train the 3DGS model with a wavelet-regularized objective. By enforcing band-limited fidelity using a discrete wavelet transform, our approach preserves fine enamel boundaries and interproximal edges while suppressing high-frequency artifacts. We validate our approach on a large-scale dataset of 950 clinical cases and an additional video-based test set of 195 cases. Experimental results demonstrate that Dental3R effectively handles sparse, unposed inputs and achieves superior novel view synthesis quality for dental occlusion visualization, outperforming state-of-the-art methods.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiancheng Fang, Shaoyu Wang, Junlin Wang, Weiwen Wu, Yikun Zhang et al.
ๆฅ็ๆ่ฆ
Multi-source stationary computed tomography (CT) has recently attracted attention for its ability to achieve rapid image reconstruction, making it suitable for time-sensitive clinical and industrial applications. However, practical systems are often constrained by ultra-sparse-view sampling, which significantly degrades reconstruction quality. Traditional methods struggle under ultra-sparse-view settings, where interpolation becomes inaccurate and the resulting reconstructions are unsatisfactory. To address this challenge, this study proposes Diffusion-Refined Neural Attenuation Fields (Diff-NAF), an iterative framework tailored for multi-source stationary CT under ultra-sparse-view conditions. Diff-NAF combines a Neural Attenuation Field representation with a dual-branch conditional diffusion model. The process begins by training an initial NAF using ultra-sparse-view projections. New projections are then generated through an Angle-Prior Guided Projection Synthesis strategy that exploits inter view priors, and are subsequently refined by a Diffusion-driven Reuse Projection Refinement Module. The refined projections are incorporated as pseudo-labels into the training set for the next iteration. Through iterative refinement, Diff-NAF progressively enhances projection completeness and reconstruction fidelity under ultra-sparse-view conditions, ultimately yielding high-quality CT reconstructions. Experimental results on multiple simulated 3D CT volumes and real projection data demonstrate that Diff-NAF achieves the best performance under ultra-sparse-view conditions.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sahar Nasirihaghighi, Negin Ghamsarian, Yiping Li, Marcel Breeuwer, Raphael Sznitman et al.
Computer Vision
ๆฅ็ๆ่ฆ
Medical image segmentation is clinically important, yet data privacy and the cost of expert annotation limit the availability of labeled data. Federated semi-supervised learning (FSSL) offers a solution but faces two challenges: pseudo-label reliability depends on the strength of local models, and client devices often require compact or heterogeneous architectures due to limited computational resources. These constraints reduce the quality and stability of pseudo-labels, while large models, though more accurate, cannot be trained or used for routine inference on client devices. We propose SAM-Fed, a federated semi-supervised framework that leverages a high-capacity segmentation foundation model to guide lightweight clients during training. SAM-Fed combines dual knowledge distillation with an adaptive agreement mechanism to refine pixel-level supervision. Experiments on skin lesion and polyp segmentation across homogeneous and heterogeneous settings show that SAM-Fed consistently outperforms state-of-the-art FSSL methods.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxin Zhang, Ziyu Lu, Hongbo Duan, Keyu Fan, Pengting Luo et al.
Computer Vision
ๆฅ็ๆ่ฆ
Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. Additionally, 3D scene generation is vital for advancing embodied AI and world models, which depend on diverse, high-quality scenes for learning and evaluation. In this work, we propose Gen3d, a novel method for generation of high-quality, wide-scope, and generic 3D scenes from a single image. After the initial point cloud is created by lifting the RGBD image, Gen3d maintains and expands its world model. The 3D scene is finalized through optimizing a Gaussian splatting representation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in generating a world model and Synthesizing high-fidelity and consistent novel views.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Luohong Wu, Matthias Seibold, Nicola A. Cavalcanti, Yunke Ao, Roman Flepp et al.
ๆฅ็ๆ่ฆ
In computer- and robot-assisted orthopedic surgery (CAOS), patient-specific surgical plans derived from preoperative imaging define target locations and implant trajectories. During surgery, these plans must be accurately transferred, relying on precise cross-registration between preoperative and intraoperative data. However, substantial modality heterogeneity across imaging modalities makes this registration challenging and error-prone. Robust, automatic, and modality-agnostic bone surface registration is therefore clinically important. We propose NeuralBoneReg, a self-supervised, surface-based framework that registers bone surfaces using 3D point clouds as a modality-agnostic representation. NeuralBoneReg includes two modules: an implicit neural unsigned distance field (UDF) that learns the preoperative bone model, and an MLP-based registration module that performs global initialization and local refinement by generating transformation hypotheses to align the intraoperative point cloud with the neural UDF. Unlike SOTA supervised methods, NeuralBoneReg operates in a self-supervised manner, without requiring inter-subject training data. We evaluated NeuralBoneReg against baseline methods on two publicly available multi-modal datasets: a CT-ultrasound dataset of the fibula and tibia (UltraBones100k) and a CT-RGB-D dataset of spinal vertebrae (SpineDepth). The evaluation also includes a newly introduced CT--ultrasound dataset of cadaveric subjects containing femur and pelvis (UltraBones-Hip), which will be made publicly available. NeuralBoneReg matches or surpasses existing methods across all datasets, achieving mean RRE/RTE of 1.68ยฐ/1.86 mm on UltraBones100k, 1.88ยฐ/1.89 mm on UltraBones-Hip, and 3.79ยฐ/2.45 mm on SpineDepth. These results demonstrate strong generalizability across anatomies and modalities, providing robust and accurate cross-modal alignment for CAOS.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zi-Chen Xi, Jiahui Huang, Hao-Xiang Chen, Francis Williams, Qun-Ce Xu et al.
Machine Learning
ๆฅ็ๆ่ฆ
We proposed a generalized method, NeuralSSD, for reconstructing a 3D implicit surface from the widely-available point cloud data. NeuralSSD is a solver-based on the neural Galerkin method, aimed at reconstructing higher-quality and accurate surfaces from input point clouds. Implicit method is preferred due to its ability to accurately represent shapes and its robustness in handling topological changes. However, existing parameterizations of implicit fields lack explicit mechanisms to ensure a tight fit between the surface and input data. To address this, we propose a novel energy equation that balances the reliability of point cloud information. Additionally, we introduce a new convolutional network that learns three-dimensional information to achieve superior optimization results. This approach ensures that the reconstructed surface closely adheres to the raw input points and infers valuable inductive biases from point clouds, resulting in a highly accurate and stable surface reconstruction. NeuralSSD is evaluated on a variety of challenging datasets, including the ShapeNet and Matterport datasets, and achieves state-of-the-art results in terms of both surface reconstruction accuracy and generalizability.
๐
2025-11-18
๐ Accepted to IJCV 2025
๐ฅ Tong Zhang, Yifan Zhao, Liangyu Wang, Jia Li
Machine Learning
ๆฅ็ๆ่ฆ
Cross-Domain Few-Shot Learning (CDFSL) endeavors to transfer generalized knowledge from the source domain to target domains using only a minimal amount of training data, which faces a triplet of learning challenges in the meantime, i.e., semantic disjoint, large domain discrepancy, and data scarcity. Different from predominant CDFSL works focused on generalized representations, we make novel attempts to construct Intermediate Domain Proxies (IDP) with source feature embeddings as the codebook and reconstruct the target domain feature with this learned codebook. We then conduct an empirical study to explore the intrinsic attributes from perspectives of visual styles and semantic contents in intermediate domain proxies. Reaping benefits from these attributes of intermediate domains, we develop a fast domain alignment method to use these proxies as learning guidance for target domain feature transformation. With the collaborative learning of intermediate domain reconstruction and target feature transformation, our proposed model is able to surpass the state-of-the-art models by a margin on 8 cross-domain few-shot learning benchmarks.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Weimin Bai, Yubo Li, Weijian Luo, Zeqiang Lai, Yequan Wang et al.
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM's Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM's rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiming Zeng, Xi-Le Zhao, Wei-Hao Wu, Teng-Yu Ji, Chao Wang
ๆฅ็ๆ่ฆ
Tensor singular value decomposition (t-SVD) is a promising tool for multi-dimensional image representation, which decomposes a multi-dimensional image into a latent tensor and an accompanying transform matrix. However, two critical limitations of t-SVD methods persist: (1) the approximation of the latent tensor (e.g., tensor factorizations) is coarse and fails to accurately capture spatial local high-frequency information; (2) The transform matrix is composed of fixed basis atoms (e.g., complex exponential atoms in DFT and cosine atoms in DCT) and cannot precisely capture local high-frequency information along the mode-3 fibers. To address these two limitations, we propose a Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework, which compactly and continuously represents multi-dimensional images. Specifically, we leverage tailored 2D Gaussian splatting and 1D Gaussian splatting to generate the latent tensor and transform matrix, respectively. The 2D and 1D Gaussian splatting are indispensable and complementary under this representation framework, which enjoys a powerful representation capability, especially for local high-frequency information. To evaluate the representation ability of the proposed GSLR, we develop an unsupervised GSLR-based multi-dimensional image recovery model. Extensive experiments on multi-dimensional image recovery demonstrate that GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zitong Xu, Huiyu Duan, Xiaoyu Wang, Zhaolin Cai, Kaiwei Zhang et al.
RoboticsMultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbf{ManipBench}, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbf{ManipShield}, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Wenkai Lin, Qiming Xia, Wen Li, Xun Huang, Chenglu Wen
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-agents rely on accurate poses to share and align observations, enabling a collaborative perception of the environment. However, traditional GNSS-based localization often fails in GNSS-denied environments, making consistent feature alignment difficult in collaboration. To tackle this challenge, we propose a robust GNSS-free collaborative perception framework based on LiDAR localization. Specifically, we propose a lightweight Pose Generator with Confidence (PGC) to estimate compact pose and confidence representations. To alleviate the effects of localization errors, we further develop the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT), which performs confidence-aware spatial alignment while capturing essential temporal context. Additionally, we present a new simulation dataset, V2VLoc, which can be adapted for both LiDAR localization and collaborative detection tasks. V2VLoc comprises three subsets: Town1Loc, Town4Loc, and V2VDet. Town1Loc and Town4Loc offer multi-traversal sequences for training in localization tasks, whereas V2VDet is specifically intended for the collaborative detection task. Extensive experiments conducted on the V2VLoc dataset demonstrate that our approach achieves state-of-the-art performance under GNSS-denied conditions. We further conduct extended experiments on the real-world V2V4Real dataset to validate the effectiveness and generalizability of PASTAT.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Yan Huang, Yongyi Su, Xin Lin, Le Zhang, Xun Xu
ๆฅ็ๆ่ฆ
The emergence of foundation models has substantially advanced zero-shot generalization in monocular depth estimation (MDE), as exemplified by the Depth Anything series. However, given access to some data from downstream tasks, a natural question arises: can the performance of these models be further improved? To this end, we propose WeSTAR, a parameter-efficient framework that performs Weakly supervised Self-Training Adaptation with Regularization, designed to enhance the robustness of MDE foundation models in unseen and diverse domains. We first adopt a dense self-training objective as the primary source of structural self-supervision. To further improve robustness, we introduce semantically-aware hierarchical normalization, which exploits instance-level segmentation maps to perform more stable and multi-scale structural normalization. Beyond dense supervision, we introduce a cost-efficient weak supervision in the form of pairwise ordinal depth annotations to further guide the adaptation process, which enforces informative ordinal constraints to mitigate local topological errors. Finally, a weight regularization loss is employed to anchor the LoRA updates, ensuring training stability and preserving the model's generalizable knowledge. Extensive experiments on both realistic and corrupted out-of-distribution datasets under diverse and challenging scenarios demonstrate that WeSTAR consistently improves generalization and achieves state-of-the-art performance across a wide range of benchmarks.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Juncheng Hu, Zijian Zhang, Zeyu Wang, Guoyu Wang, Yingji Li et al.
ๆฅ็ๆ่ฆ
Forecasting 3D human motion is an important embodiment of fine-grained understanding and cognition of human behavior by artificial agents. Current approaches excessively rely on implicit network modeling of spatiotemporal relationships and motion characteristics, falling into the passive learning trap that results in redundant and monotonous 3D coordinate information acquisition while lacking actively guided explicit learning mechanisms. To overcome these issues, we propose an Active Perceptual Strategy (APS) for human motion prediction, leveraging quotient space representations to explicitly encode motion properties while introducing auxiliary learning objectives to strengthen spatio-temporal modeling. Specifically, we first design a data perception module that projects poses into the quotient space, decoupling motion geometry from coordinate redundancy. By jointly encoding tangent vectors and Grassmann projections, this module simultaneously achieves geometric dimension reduction, semantic decoupling, and dynamic constraint enforcement for effective motion pose characterization. Furthermore, we introduce a network perception module that actively learns spatio-temporal dependencies through restorative learning. This module deliberately masks specific joints or injects noise to construct auxiliary supervision signals. A dedicated auxiliary learning network is designed to actively adapt and learn from perturbed information. Notably, APS is model agnostic and can be integrated with different prediction models to enhance active perceptual. The experimental results demonstrate that our method achieves the new state-of-the-art, outperforming existing methods by large margins: 16.3% on H3.6M, 13.9% on CMU Mocap, and 10.1% on 3DPW.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yifan Yang, Zhi Cen, Sida Peng, Xiangwei Chen, Yifu Deng et al.
ๆฅ็ๆ่ฆ
This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs.Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations.However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenjie Li, Yulun Zhang, Guangwei Gao, Heng Guo, Zhanyu Ma
Robotics
ๆฅ็ๆ่ฆ
Blind face restoration (BFR) may correspond to multiple plausible high-quality (HQ) reconstructions under extremely low-quality (LQ) inputs. However, existing methods typically produce deterministic results, struggling to capture this one-to-many nature. In this paper, we propose a Measurement-Constrained Sampling (MCS) approach that enables diverse LQ face reconstructions conditioned on different textual prompts. Specifically, we formulate BFR as a measurement-constrained generative task by constructing an inverse problem through controlled degradations of coarse restorations, which allows posterior-guided sampling within text-to-image diffusion. Measurement constraints include both Forward Measurement, which ensures results align with input structures, and Reverse Measurement, which produces projection spaces, ensuring that the solution can align with various prompts. Experiments show that our MCS can generate prompt-aligned results and outperforms existing BFR methods. Codes will be released after acceptance.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ N Dinesh Reddy, Sudeep Pillai
RoboticsComputer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Weimin Bai, Suzhe Xu, Yiwei Ren, Jinhua Hao, Ming Sun et al.
Machine Learning
ๆฅ็ๆ่ฆ
Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunhe Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Maritime vessel re-identification (Re-ID) plays a crucial role in advancing maritime monitoring and intelligent situational awareness systems. However, some existing vessel Re-ID methods are directly adapted from pedestrian-focused algorithms, making them ill-suited for mitigating the unique problems present in vessel images, particularly the greater intra-identity variations and more severe missing of local parts, which lead to the emergence of outlier samples within the same identity. To address these challenges, we propose the Multi-scale Correlation-aware Transformer Network (MCFormer), which explicitly models multi-scale correlations across the entire input set to suppress the adverse effects of outlier samples with intra-identity variations or local missing, incorporating two novel modules, the Global Correlation Module (GCM), and the Local Correlation Module (LCM). Specifically, GCM constructs a global similarity affinity matrix across all input images to model global correlations through feature aggregation based on inter-image consistency, rather than solely learning features from individual images as in most existing approaches. Simultaneously, LCM mines and aligns local features of positive samples with contextual similarity to extract local correlations by maintaining a dynamic memory bank, effectively compensating for missing or occluded regions in individual images. To further enhance feature robustness, MCFormer integrates global and local features that have been respectively correlated across multiple scales, effectively capturing latent relationships among image features. Experiments on three benchmarks demonstrate that MCFormer achieves state-of-the-art performance.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zitang Sun, Masakazu Yoshimura, Junji Otsuka, Atsushi Irie, Takeshi Ohashi
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
High-quality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model's evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Xuan-Hao Liu, Yan-Kai Liu, Tianyi Zhou, Bao-Liang Lu, Wei-Long Zheng
ๆฅ็ๆ่ฆ
Reconstructing video from brain signals is an important brain decoding task. Existing brain decoding frameworks are primarily built on a subject-dependent paradigm, which requires large amounts of brain data for each subject. However, the expensive cost of collecting brain-video data causes severe data scarcity. Although some cross-subject methods being introduced, they often overfocus with subject-invariant information while neglecting subject-specific information, resulting in slow fine-tune-based adaptation strategy. To achieve fast and data-efficient new subject adaptation, we propose MindCross, a novel cross-subject framework. MindCross's N specific encoders and one shared encoder are designed to extract subject-specific and subject-invariant information, respectively. Additionally, a Top-K collaboration module is adopted to enhance new subject decoding with the knowledge learned from previous subjects' encoders. Extensive experiments on fMRI/EEG-to-video benchmarks demonstrate MindCross's efficacy and efficiency of cross-subject decoding and new subject adaptation using only one model.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pengcheng Shi
ๆฅ็ๆ่ฆ
The aorta, the body's largest artery, is prone to pathologies such as dissection, aneurysm, and atherosclerosis, which often require timely intervention. Minimally invasive repairs involving branch vessels necessitate detailed 3D anatomical analysis. Existing methods often overlook hierarchical anatomical relationships while struggling with severe class imbalance inherent in vascular structures. We address these challenges with a curriculum learning strategy that leverages a novel fractal softmax for hierarchical semantic learning. Inspired by human cognition, our approach progressively learns anatomical constraints by decomposing complex structures from simple to complex components. The curriculum learning framework naturally addresses class imbalance by first establishing robust feature representations for dominant classes before tackling rare but anatomically critical structures, significantly accelerating model convergence in multi-class scenarios. Our two-stage inference strategy achieves up to fivefold acceleration, enhancing clinical practicality. On the validation set at epoch 50, our hierarchical semantic loss improves the Dice score of nnU-Net ResEnc M by 11.65%. The proposed model demonstrates a 5.6% higher Dice score than baselines on the test set. Experimental results show significant improvements in segmentation accuracy and efficiency, making the framework suitable for real-time clinical applications. The implementation code for this challenge entry is publicly available at: https://github.com/PengchengShi1220/AortaSeg24. The code for fractal softmax will be available at https://github.com/PengchengShi1220/fractal-softmax.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin et al.
Multimodal
ๆฅ็ๆ่ฆ
Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available at https://github.com/LZYAndy/UMEG-Net.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangyu Li, Chen Wang, Yumao Liu, Dengbo He, Jiahao Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
Most existing autonomous-driving datasets (e.g., KITTI, nuScenes, and the Waymo Perception Dataset), collected by human-driving mode or unidentified driving mode, can only serve as early training for the perception and prediction of autonomous vehicles (AVs). To evaluate the real behavioral safety of AVs controlled in the black box, we present the first end-to-end benchmark dataset collected entirely by autonomous-driving mode in the real world. This dataset contains over 100 hours of naturalistic data from multiple production autonomous-driving vehicle models in the market. We segment the original data into 32,727 key frames, each consisting of four synchronized camera images and high-precision GNSS/IMU data (0.8 cm localization accuracy). For each key frame, 20 Hz vehicle trajectories spanning the past 6 s and future 5 s are provided, along with detailed 2D annotations of surrounding vehicles, pedestrians, traffic lights, and traffic signs. These key frames have rich scenario-level attributes, including driver intent, area type (covering highways, urban roads, and residential areas), lighting (day, night, or dusk), weather (clear or rain), road surface (paved or unpaved), traffic and vulnerable road users (VRU) density, traffic lights, and traffic signs (warning, prohibition, and indication). To evaluate the safety of AVs, we employ an end-to-end motion planning model that predicts vehicle trajectories with an Average Displacement Error (ADE) of 1.4 m on autonomous-driving frames. The dataset continues to expand by over 10 hours of new data weekly, thereby providing a sustainable foundation for research on AV driving behavior analysis and safety evaluation.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuan Zhao, Zhongyu Zhang, Yuge Huang, Yuxi Mi, Guodong Mu et al.
ๆฅ็ๆ่ฆ
Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module that recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingdong Zhang, Lingzhi Zhang, Qing Liu, Mang Tik Chiu, Connelly Barnes et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Digital images are often degraded by soft effects such as lens flare, haze, shadows, and reflections, which reduce aesthetics even though the underlying pixels remain partially visible. The prevailing works address these degradations in isolation, developing highly specialized, specialist models that lack scalability and fail to exploit the shared underlying essences of these restoration problems. While specialist models are limited, recent large-scale pretrained generalist models offer powerful, text-driven image editing capabilities. while recent general-purpose systems (e.g., GPT-4o, Flux Kontext, Nano Banana) require detailed prompts and often fail to achieve robust removal on these fine-grained tasks or preserve identity of the scene. Leveraging the common essence of soft effects, i.e., semi-transparent occlusions, we introduce a foundational versatile model UniSER, capable of addressing diverse degradations caused by soft effects within a single framework. Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization, which includes novel, physically-plausible data to fill critical gaps in public benchmarks, and a tailored training pipeline that fine-tunes a Diffusion Transformer to learn robust restoration priors from this diverse data, integrating fine-grained mask and strength controls. This synergistic approach allows UniSER to significantly outperform both specialist and generalist models, achieving robust, high-fidelity restoration in the wild.
๐
2025-11-18
๐ 14 pages, 7 figures, journal
๐ฅ Yanshan Li, Ke Ma, Miaomiao Wei, Linhui Dai
Machine Learning
ๆฅ็ๆ่ฆ
Existing self-supervised contrastive learning methods for skeleton-based action recognition often process all skeleton regions uniformly, and adopt a first-in-first-out (FIFO) queue to store negative samples, which leads to motion information loss and non-optimal negative sample selection. To address these challenges, this paper proposes Dominance-Game Contrastive Learning network for skeleton-based action Recognition (DoGCLR), a self-supervised framework based on game theory. DoGCLR models the construction of positive and negative samples as a dynamic Dominance Game, where both sample types interact to reach an equilibrium that balances semantic preservation and discriminative strength. Specifically, a spatio-temporal dual weight localization mechanism identifies key motion regions and guides region-wise augmentations to enhance motion diversity while maintaining semantics. In parallel, an entropy-driven dominance strategy manages the memory bank by retaining high entropy (hard) negatives and replacing low-entropy (weak) ones, ensuring consistent exposure to informative contrastive signals. Extensive experiments are conducted on NTU RGB+D and PKU-MMD datasets. On NTU RGB+D 60 X-Sub/X-View, DoGCLR achieves 81.1%/89.4% accuracy, and on NTU RGB+D 120 X-Sub/X-Set, DoGCLR achieves 71.2%/75.5% accuracy, surpassing state-of-the-art methods by 0.1%, 2.7%, 1.1%, and 2.3%, respectively. On PKU-MMD Part I/Part II, DoGCLR performs comparably to the state-of-the-art methods and achieves a 1.9% higher accuracy on Part II, highlighting its strong robustness on more challenging scenarios.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinliang Zhang, Lei Zhu, Hangzhou He, Shuang Zeng, Ourui Fu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) have demonstrated substantial value in unified text-image understanding and reasoning, primarily by converting images into sequences of patch-level tokens that align with their architectural paradigm. However, patch-level tokenization leads to a quadratic growth in image tokens, burdening MLLMs' understanding and reasoning with enormous computation and memory. Additionally, the traditional patch-wise scanning tokenization workflow misaligns with the human vision cognition system, further leading to hallucination and computational redundancy. To address this issue, we propose an object-level token merging strategy for Adaptive Token compression, revealing the consistency with human vision system. The experiments are conducted on multiple comprehensive benchmarks, which show that our approach averagely, utilizes only 10% tokens while achieving almost 96% of the vanilla model's performance. More extensive experimental results in comparison with relevant works demonstrate the superiority of our method in balancing compression ratio and performance. Our code will be available.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaoquan Sun, Ruijian Zhang, Kang Pang, Bingchen Miao, Yuxiang Tan et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Household tidying is an important application area, yet current benchmarks neither model user preferences nor support mobility, and they generalize poorly, making it hard to comprehensively assess integrated language-to-action capabilities. To address this, we propose RoboTidy, a unified benchmark for language-guided household tidying that supports Vision-Language-Action (VLA) and Vision-Language-Navigation (VLN) training and evaluation. RoboTidy provides 500 photorealistic 3D Gaussian Splatting (3DGS) household scenes (covering 500 objects and containers) with collisions, formulates tidying as an "Action (Object, Container)" list, and supplies 6.4k high-quality manipulation demonstration trajectories and 1.5k naviagtion trajectories to support both few-shot and large-scale training. We also deploy RoboTidy in the real world for object tidying, establishing an end-to-end benchmark for household tidying. RoboTidy offers a scalable platform and bridges a key gap in embodied AI by enabling holistic and realistic evaluation of language-guided robots.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xun Lin, Shuai Wang, Yi Yu, Zitong Yu, Jiale Zhou et al.
Multimodal
ๆฅ็ๆ่ฆ
Multimodal Face Anti-Spoofing (FAS) methods, which integrate multiple visual modalities, often suffer even more severe performance degradation than unimodal FAS when deployed in unseen domains. This is mainly due to two overlooked risks that affect cross-domain multimodal generalization. The first is the modal representation invariant risk, i.e., whether representations remain generalizable under domain shift. We theoretically show that the inherent class asymmetry in FAS (diverse spoofs vs. compact reals) enlarges the upper bound of generalization error, and this effect is further amplified in multimodal settings. The second is the modal synergy invariant risk, where models overfit to domain-specific inter-modal correlations. Such spurious synergy cannot generalize to unseen attacks in target domains, leading to performance drops. To solve these issues, we propose a provable framework, namely Multimodal Representation and Synergy Invariance Learning (RiSe). For representation risk, RiSe introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space. For synergy risk, RiSe employs Multimodal Synergy Disentanglement (MMSD), a self-supervised task enhancing intrinsic, generalizable modal features via cross-sample mixing and disentanglement. Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Laura Dodds, Maisy Lam, Waleed Akbar, Yibo Cheng, Fadel Adib
RoboticsComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
We present Wave-Former, a novel method capable of high-accuracy 3D shape reconstruction for completely occluded, diverse, everyday objects. This capability can open new applications spanning robotics, augmented reality, and logistics. Our approach leverages millimeter-wave (mmWave) wireless signals, which can penetrate common occlusions and reflect off hidden objects. In contrast to past mmWave reconstruction methods, which suffer from limited coverage and high noise, Wave-Former introduces a physics-aware shape completion model capable of inferring full 3D geometry. At the heart of Wave-Former's design is a novel three-stage pipeline which bridges raw wireless signals with recent advancements in vision-based shape completion by incorporating physical properties of mmWave signals. The pipeline proposes candidate geometric surfaces, employs a transformer-based shape completion model designed specifically for mmWave signals, and finally performs entropy-guided surface selection. This enables Wave-Former to be trained using entirely synthetic point-clouds, while demonstrating impressive generalization to real-world data.In head-to-head comparisons with state-of-the-art baselines, Wave-Former raises recall from 54% to 72% while maintaining a high precision of 85%.
๐
2025-11-18
๐ IROS 2025
๐ฅ Hao Wang, Linqing Zhao, Xiuwei Xu, Jiwen Lu, Haibin Yan
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Recent trends in SLAM and visual navigation have embraced 3D Gaussians as the preferred scene representation, highlighting the importance of estimating camera poses from a single image using a pre-built Gaussian model. However, existing approaches typically rely on an iterative \textit{render-compare-refine} loop, where candidate views are first rendered using NeRF or Gaussian Splatting, then compared against the target image, and finally, discrepancies are used to update the pose. This multi-round process incurs significant computational overhead, hindering real-time performance in robotics. In this paper, we propose iGaussian, a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion. Our method first regresses a coarse 6DoF pose using a Gaussian Scene Prior-based Pose Regression Network with spatial uniform sampling and guided attention mechanisms, then refines it through feature matching and multi-model fusion. The key contribution lies in our cross-correlation module that aligns image embeddings with 3D Gaussian attributes without differentiable rendering, coupled with a Weighted Multiview Predictor that fuses features from Multiple strategically sampled viewpoints. Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T\&T+DB datasets demonstrate a significant performance improvement over previous methods, reducing median rotation errors to 0.2ยฐ while achieving 2.87 FPS tracking on mobile robots, which is an impressive 10 times speedup compared to optimization-based approaches. Code: https://github.com/pythongod-exe/iGaussian
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ An Yu, Weiheng Lu, Jian Li, Zhenfei Zhang, Yunhang Shen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textit{S}hot-aware \textit{M}ultimodal \textit{A}udio-enhanced \textit{R}etrieval of \textit{T}emporal \textit{S}egments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbf{Shot-aware Token Compression}, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61\% increase in R1@0.5 and 2.59\% gain in R1@0.7 on Charades-STA.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mingi Kang, Jeovรก Farias Sales Rocha Neto
Computer VisionMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and aggregation; convolution selects neighbors by spatial proximity, while attention selects by feature similarity, revealing they exist on a continuous spectrum. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. Crucially, ConvNN serves as a drop-in replacement for convolutional and attention layers, enabling systematic exploration of the intermediate spectrum between these two extremes. We validate the framework's coherence on CIFAR-10 and CIFAR-100 classification tasks across two complementary architectures: (1) Hybrid branching in VGG improves accuracy on both CIFAR datasets by combining spatial-proximity and feature-similarity selection; and (2) ConvNN in ViT outperforms standard attention and other attention variants on both datasets. Extensive ablations on $k$ values and architectural variants reveal that interpolating along this spectrum provides regularization benefits by balancing local and global receptive fields. Our work provides a unifying framework that dissolves the apparent distinction between convolution and attention, with implications for designing more principled and interpretable vision architectures.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Zhen, Yunxiang Yang, Jidong J. Yang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Pedestrian-vehicle incidents remain a critical urban safety challenge, with pedestrians accounting for over 20% of global traffic fatalities. Although existing video-based systems can detect when incidents occur, they provide little insight into how these events unfold across the distinct cognitive phases of pedestrian behavior. Recent vision-language models (VLMs) have shown strong potential for video understanding, but they remain limited in that they typically process videos in isolation, without explicit temporal structuring or multi-view integration. This paper introduces Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning (MP-PVIR), a unified framework that systematically processes multi-view video streams into structured diagnostic reports through four stages: (1) event-triggered multi-view video acquisition, (2) pedestrian behavior phase segmentation, (3) phase-specific multi-view reasoning, and (4) hierarchical synthesis and diagnostic reasoning. The framework operationalizes behavioral theory by automatically segmenting incidents into cognitive phases, performing synchronized multi-view analysis within each phase, and synthesizing results into causal chains with targeted prevention strategies. Particularly, two specialized VLMs underpin the MP-PVIR pipeline: TG-VLM for behavioral phase segmentation (mIoU = 0.4881) and PhaVR-VLM for phase-aware multi-view analysis, achieving a captioning score of 33.063 and up to 64.70% accuracy on question answering. Finally, a designated large language model is used to generate comprehensive reports detailing scene understanding, behavior interpretation, causal reasoning, and prevention recommendations. Evaluation on the Woven Traffic Safety dataset shows that MP-PVIR effectively translates multi-view video data into actionable insights, advancing AI-driven traffic safety analytics for vehicle-infrastructure cooperative systems.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyao Zeng, Jingcheng Ni, Ruyi Liu, Alex Wong
Robotics
ๆฅ็ๆ่ฆ
Text-to-image diffusion models can generate diverse content with flexible prompts, which makes them well-suited for customization through fine-tuning with a small amount of user-provided data. However, controllable fine-tuning that prevents models from learning undesired concepts present in the fine-tuning data, and from entangling those concepts with user prompts, remains an open challenge. It is crucial for downstream tasks like bias mitigation, preventing malicious adaptation, attribute disentanglement, and generalizable fine-tuning of diffusion policy. We propose Coffee that allows using language to specify undesired concepts to regularize the adaptation process. The crux of our method lies in keeping the embeddings of the user prompt from aligning with undesired concepts. Crucially, Coffee requires no additional training and enables flexible modification of undesired concepts by modifying textual descriptions. We evaluate Coffee by fine-tuning on images associated with user prompts paired with undesired concepts. Experimental results demonstrate that Coffee can prevent text-to-image models from learning specified undesired concepts during fine-tuning and outperforms existing methods. Code will be released upon acceptance.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Srivathsan Sivakumar, Faisal Z. Qureshi
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose \emph{Cascaded-ViT (CViT)}, a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called \emph{Cascaded-Chunk Feed Forward Network (CCFFN)}. By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our \emph{CViT-XL} model achieves 75.5\% Top-1 accuracy while reducing FLOPs by 15\% and energy consumption by 3.3\% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called \emph{Accuracy-Per-FLOP (APF)}, which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2\% more accurate than EfficientViT-M2 while having comparable APF scores.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenyu Li, Tianyi Shang
ๆฅ็ๆ่ฆ
Visual Place Recognition (VPR) aims to match query images against a database using visual cues. State-of-the-art methods aggregate features from deep backbones to form global descriptors. Optimal transport-based aggregation methods reformulate feature-to-cluster assignment as a transport problem, but the standard Sinkhorn algorithm symmetrically treats source and target marginals, limiting effectiveness when image features and cluster centers exhibit substantially different distributions. We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called $A^2$GC-VPR. Our method employs row-column normalization averaging with separate marginal calibration, enabling asymmetric matching that adapts to distributional discrepancies in visual place recognition. Geometric constraints are incorporated through learnable coordinate embeddings, computing compatibility scores fused with feature similarities, thereby promoting spatially proximal features to the same cluster and enhancing spatial awareness. Experimental results on MSLS, NordLand, and Pittsburgh datasets demonstrate superior performance, validating the effectiveness of our approach in improving matching accuracy and robustness.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zeyu Cheng, Tongfei Liu, Tao Lei, Xiang Hua, Yi Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
Depth information is crucial for autonomous driving and intelligent robot navigation. The simplicity and flexibility of self-supervised monocular depth estimation are conducive to its role in these fields. However, most existing monocular depth estimation models consume many computing resources. Although some methods have reduced the model's size and improved computing efficiency, the performance deteriorates, seriously hindering the real-world deployment of self-supervised monocular depth estimation models in the real world. To address this problem, we proposed a real-time self-supervised monocular depth estimation method and implemented it in the real world. It is called RTS-Mono, which is a lightweight and efficient encoder-decoder architecture. The encoder is based on Lite-Encoder, and the decoder is designed with a multi-scale sparse fusion framework to minimize redundancy, ensure performance, and improve inference speed. RTS-Mono achieved state-of-the-art (SoTA) performance in high and low resolutions with extremely low parameter counts (3 M) in experiments based on the KITTI dataset. Compared with lightweight methods, RTS-Mono improved Abs Rel and Sq Rel by 5.6% and 9.8% at low resolution and improved Sq Rel and RMSE by 6.1% and 1.9% at high resolution. In real-world deployment experiments, RTS-Mono has extremely high accuracy and can perform real-time inference on Nvidia Jetson Orin at a speed of 49 FPS. Source code is available at https://github.com/ZYCheng777/RTS-Mono.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiqing Shen, Chenjia Li, Mathias Unberath
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Text-driven video editing enables users to modify video content only using text queries. While existing methods can modify video content if explicit descriptions of editing targets with precise spatial locations and temporal boundaries are provided, these requirements become impractical when users attempt to conceptualize edits through implicit queries referencing semantic properties or object relationships. We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications, and a first model attempting to solve this complex task, RIVER (Reasoning-based Implicit Video Editor). RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes. A large language model then processes this representation jointly with the implicit query, performing multi-hop reasoning to determine modifications, then outputs structured instructions that guide a diffusion-based editor to execute pixel-level changes. RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality. Finally, we introduce RVEBenchmark, a benchmark of 100 videos with 519 implicit queries spanning three levels and categories of reasoning complexity specifically for reasoning video editing. RIVER demonstrates best performance on the proposed RVEBenchmark and also achieves state-of-the-art performance on two additional video editing benchmarks (VegGIE and FiVE), where it surpasses six baseline methods.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingren Liu, Shuning Xu, Qirui Yang, Yun Wang, Xiangyu Chen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Raha Aghaei, Ali A. Kiaei, Mahnaz Boush, Mahan Rofoosheh, Mohammad Zavvar
Natural Language Processing
ๆฅ็ๆ่ฆ
This study analyzes the multiple functions of Large Language Models (LLMs) in transforming research and development (R&D) processes. By automating knowledge discovery, boosting hypothesis creation, integrating transdisciplinary insights, and enabling cooperation within innovation ecosystems, LLMs dramatically improve the efficiency and effectiveness of research processes. Through extensive analysis of scientific literature, patent databases, and experimental data, these models enable more flexible and informed R&D workflows, ultimately accelerating innovation cycles and lowering time-to-market for breakthrough ideas.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ali Salehi, Cassandra L. Jacobs
ๆฅ็ๆ่ฆ
We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6\% of test cases compared to 68.7\% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity levels. These findings highlight the importance of coverage-aware evaluation in low-resource language processing and offers different tokenization methods for low-resourced language processing.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Rishu Kumar Singh, Navneet Shreya, Sarmistha Das, Apoorva Singh, Sriparna Saha
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues, where users often share both textual complaints and visual evidence (e.g., screenshots, product photos) to enable fine-grained classification of complaint aspects and severity. We introduce VALOR, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate VALOR on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: https://github.com/sarmistha-D/VALOR
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Clovis Gladstone, Zhao Fang, Spencer Dean Stewart
Natural Language Processing
ๆฅ็ๆ่ฆ
Historical and low-resource NLP remains challenging due to limited annotated data and domain mismatches with modern, web-sourced corpora. This paper outlines our work in using large language models (LLMs) to create ground-truth annotations for historical French (16th-20th centuries) and Chinese (1900-1950) texts. By leveraging LLM-generated ground truth on a subset of our corpus, we were able to fine-tune spaCy to achieve significant gains on period-specific tests for part-of-speech (POS) annotations, lemmatization, and named entity recognition (NER). Our results underscore the importance of domain-specific models and demonstrate that even relatively limited amounts of synthetic data can improve NLP tools for under-resourced corpora in computational humanities research.
๐
2025-11-18
๐ NeurIPS 2025
๐ฅ Kiera McCormick, Rafael Martรญnez-Galarza
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models have demonstrated the ability to generalize well at many levels across domains, modalities, and even shown in-context learning capabilities. This enables research questions regarding how they can be used to encode physical information that is usually only available from scientific measurements, and loosely encoded in textual descriptions. Using astrophysics as a test bed, we investigate if LLM embeddings can codify physical summary statistics that are obtained from scientific measurements through two main questions: 1) Does prompting play a role on how those quantities are codified by the LLM? and 2) What aspects of language are most important in encoding the physics represented by the measurement? We investigate this using sparse autoencoders that extract interpretable features from the text.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Biaojie Zeng, Min Zhang, Juan Zhou, Fengrui Liu, Ruiyang Huang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) often make reasoning errors when solving mathematical problems, and how to automatically detect and correct these errors has become an important research direction. However, existing approaches \textit{mainly focus on self-correction within the model}, which falls short of the ``teacher-style`` correction required in educational settings, \textit{i.e.}, systematically guiding and revising a student's problem-solving process. To address this gap, we propose \texttt{SMRC} (\textit{\underline{S}tudent \underline{M}athematical \underline{R}easoning \underline{C}orrection}), a novel method that aligns LLMs with student reasoning. Specifically, \texttt{SMRC} formulates student reasoning as a multi-step sequential decision problem and introduces Monte Carlo Tree Search (MCTS) to explore optimal correction paths. To reduce the cost of the annotating process-level rewards, we leverage breadth-first search (BFS) guided by LLMs and final-answer evaluation to generate reward signals, which are then distributed across intermediate reasoning steps via a back-propagation mechanism, enabling fine-grained process supervision. Additionally, we construct a benchmark for high school mathematics, MSEB (Multi-Solution Error Benchmark), consisting of 158 instances that include problem statements, student solutions, and correct reasoning steps. We further propose a dual evaluation protocol centered on \textbf{solution accuracy} and \textbf{correct-step retention}, offering a comprehensive measure of educational applicability. Experiments demonstrate that \texttt{SMRC} significantly outperforms existing methods on two public datasets (ProcessBench and MR-GSM8K) and our MSEB in terms of effectiveness and overall performance. The code and data are available at https://github.com/Mind-Lab-ECNU/SMRC.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Oscar Fontanelli, Wentian Li
ๆฅ็ๆ่ฆ
Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kristi Topollai, Tolga Dimlioglu, Anna Choromanska, Simon Odie, Reginald Hui
ๆฅ็ๆ่ฆ
Contract management involves reviewing and negotiating provisions, individual clauses that define rights, obligations, and terms of agreement. During this process, revisions to provisions are proposed and iteratively refined, some of which may be problematic or unacceptable. Automating this workflow is challenging due to the scarcity of labeled data and the abundance of unstructured legacy contracts. In this paper, we present a modular framework designed to streamline contract management through a retrieval-augmented generation (RAG) pipeline. Our system integrates synthetic data generation, semantic clause retrieval, acceptability classification, and reward-based alignment to flag problematic revisions and generate improved alternatives. Developed and evaluated in collaboration with an industry partner, our system achieves over 80% accuracy in both identifying and optimizing problematic revisions, demonstrating strong performance under real-world, low-resource conditions and offering a practical means of accelerating contract revision workflows.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xia Cui, Ziyi Huang, Naeemeh Adel
Natural Language Processing
ๆฅ็ๆ่ฆ
Annotation bias in NLP datasets remains a major challenge for developing multilingual Large Language Models (LLMs), particularly in culturally diverse settings. Bias from task framing, annotator subjectivity, and cultural mismatches can distort model outputs and exacerbate social harms. We propose a comprehensive framework for understanding annotation bias, distinguishing among instruction bias, annotator bias, and contextual and cultural bias. We review detection methods (including inter-annotator agreement, model disagreement, and metadata analysis) and highlight emerging techniques such as multilingual model divergence and cultural inference. We further outline proactive and reactive mitigation strategies, including diverse annotator recruitment, iterative guideline refinement, and post-hoc model adjustments. Our contributions include: (1) a typology of annotation bias; (2) a synthesis of detection metrics; (3) an ensemble-based bias mitigation approach adapted for multilingual settings, and (4) an ethical analysis of annotation processes. Together, these insights aim to inform more equitable and culturally grounded annotation pipelines for LLMs.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuhan Zhang, Erxiao Wang, Cory Shain
Natural Language Processing
ๆฅ็ๆ่ฆ
Like visual processing, language processing is susceptible to illusions in which people systematically misperceive stimuli. In one such case--the comparative illusion (CI), e.g., More students have been to Russia than I have--comprehenders tend to judge the sentence as acceptable despite its underlying nonsensical comparison. Prior research has argued that this phenomenon can be explained as Bayesian inference over a noisy channel: the posterior probability of an interpretation of a sentence is proportional to both the prior probability of that interpretation and the likelihood of corruption into the observed (CI) sentence. Initial behavioral work has supported this claim by evaluating a narrow set of alternative interpretations of CI sentences and showing that comprehenders favor interpretations that are more likely to have been corrupted into the illusory sentence. In this study, we replicate and go substantially beyond this earlier work by directly predicting the strength of illusion with a quantitative model of the posterior probability of plausible interpretations, which we derive through a novel synthesis of statistical language models with human behavioral data. Our model explains not only the fine gradations in the strength of CI effects, but also a previously unexplained effect caused by pronominal vs. full noun phrase than-clause subjects. These findings support a noisy-channel theory of sentence comprehension by demonstrating that the theory makes novel predictions about the comparative illusion that bear out empirically. This outcome joins related evidence of noisy channel processing in both illusory and non-illusory contexts to support noisy channel inference as a unified computational-level theory of diverse language processing phenomena.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tao Yang, Dandan Huang, Yunting Lin, Pengfei Wu, Zhikun Wu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Rare diseases affect hundreds of millions worldwide, yet diagnosis often spans years. Convectional pipelines decouple noisy evidence extraction from downstream inferential diagnosis, and general/medical large language models (LLMs) face scarce real world electronic health records (EHRs), stale domain knowledge, and hallucinations. We assemble a large, domain specialized clinical corpus and a clinician validated reasoning set, and develop RareSeek R1 via staged instruction tuning, chain of thought learning, and graph grounded retrieval. Across multicenter EHR narratives and public benchmarks, RareSeek R1 attains state of the art accuracy, robust generalization, and stability under noisy or overlapping phenotypes. Augmented retrieval yields the largest gains when narratives pair with prioritized variants by resolving ambiguity and aligning candidates to mechanisms. Human studies show performance on par with experienced physicians and consistent gains in assistive use. Notably, transparent reasoning highlights decisive non phenotypic evidence (median 23.1%, such as imaging, interventions, functional tests) underpinning many correct diagnoses. This work advances a narrative first, knowledge integrated reasoning paradigm that shortens the diagnostic odyssey and enables auditable, clinically translatable decision support.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kahaan Gandhi, Boris Bolliet, Inigo Zubeldia
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shreya Adrita Banik, Niaz Nafi Rahman, Tahsina Moiukh, Farig Sadeque
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Detecting political bias in news media is a complex task that requires interpreting subtle linguistic and contextual cues. Although recent advances in Natural Language Processing (NLP) have enabled automatic bias classification, the extent to which large language models (LLMs) align with human judgment still remains relatively underexplored and not yet well understood. This study aims to present a comparative framework for evaluating the detection of political bias across human annotations and multiple LLMs, including GPT, BERT, RoBERTa, and FLAN. We construct a manually annotated dataset of news articles and assess annotation consistency, bias polarity, and inter-model agreement to quantify divergence between human and model perceptions of bias. Experimental results show that among traditional transformer-based models, RoBERTa achieves the highest alignment with human labels, whereas generative models such as GPT demonstrate the strongest overall agreement with human annotations in a zero-shot setting. Among all transformer-based baselines, our fine-tuned RoBERTa model acquired the highest accuracy and the strongest alignment with human-annotated labels. Our findings highlight systematic differences in how humans and LLMs perceive political slant, underscoring the need for hybrid evaluation frameworks that combine human interpretability with model scalability in automated media bias detection.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yilu Fang, Jordan G. Nestor, Casey N. Ta, Jerard Z. Kneifati-Hayek, Chunhua Weng
ๆฅ็ๆ่ฆ
Patients with acute kidney injury (AKI) are at high risk of developing chronic kidney disease (CKD), but identifying those at greatest risk remains challenging. We used electronic health record (EHR) data to dynamically track AKI patients' clinical evolution and characterize AKI-to-CKD progression. Post-AKI clinical states were identified by clustering patient vectors derived from longitudinal medical codes and creatinine measurements. Transition probabilities between states and progression to CKD were estimated using multi-state modeling. After identifying common post-AKI trajectories, CKD risk factors in AKI subpopulations were identified through survival analysis. Of 20,699 patients with AKI at admission, 3,491 (17%) developed CKD. We identified fifteen distinct post-AKI states, each with different probabilities of CKD development. Most patients (75%, n=15,607) remained in a single state or made only one transition during the study period. Both established (e.g., AKI severity, diabetes, hypertension, heart failure, liver disease) and novel CKD risk factors, with their impact varying across these clinical states. This study demonstrates a data-driven approach for identifying high-risk AKI patients, supporting the development of decision-support tools for early CKD detection and intervention.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Noam Dahan, Omer Kidron, Gabriel Stanovsky
ๆฅ็ๆ่ฆ
High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lucia Makaiovรก, Martin Fajฤรญk, Antonรญn Jarolรญm
ๆฅ็ๆ่ฆ
Document-level claim extraction remains an open challenge in the field of fact-checking, and subsequently, methods for evaluating extracted claims have received limited attention. In this work, we explore approaches to aligning two sets of claims pertaining to the same source document and computing their similarity through an alignment score. We investigate techniques to identify the best possible alignment and evaluation method between claim sets, with the aim of providing a reliable evaluation framework. Our approach enables comparison between model-extracted and human-annotated claim sets, serving as a metric for assessing the extraction performance of models and also as a possible measure of inter-annotator agreement. We conduct experiments on newly collected dataset-claims extracted from comments under Czech and Slovak news articles-domains that pose additional challenges due to the informal language, strong local context, and subtleties of these closely related languages. The results draw attention to the limitations of current evaluation approaches when applied to document-level claim extraction and highlight the need for more advanced methods-ones able to correctly capture semantic similarity and evaluate essential claim properties such as atomicity, checkworthiness, and decontextualization.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Alex Shtoff et al.
ๆฅ็ๆ่ฆ
With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mingyue Cheng, Jie Ouyang, Shuo Yu, Ruiran Yan, Yucong Luo et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in-depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent-R1, a modular, flexible, and user-friendly training framework for RL-based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.
๐
2025-11-18
๐ ACL
๐ฅ Trishala Jayesh Ahalpara
Natural Language Processing
ๆฅ็ๆ่ฆ
We present Tell Me, a mental well-being system that leverages advances in large language models to provide accessible, context-aware support for users and researchers. The system integrates three components: (i) a retrieval-augmented generation (RAG) assistant for personalized, knowledge-grounded dialogue; (ii) a synthetic client-therapist dialogue generator conditioned on client profiles to facilitate research on therapeutic language and data augmentation; and (iii) a Well-being AI crew, implemented with CrewAI, that produces weekly self-care plans and guided meditation audio. The system is designed as a reflective space for emotional processing rather than a substitute for professional therapy. It illustrates how conversational assistants can lower barriers to support, complement existing care, and broaden access to mental health resources. To address the shortage of confidential therapeutic data, we introduce synthetic client-therapist dialogue generation conditioned on client profiles. Finally, the planner demonstrates an innovative agentic workflow for dynamically adaptive, personalized self-care, bridging the limitations of static well-being tools. We describe the architecture, demonstrate its functionalities, and report evaluation of the RAG assistant in curated well-being scenarios using both automatic LLM-based judgments and a human-user study. This work highlights opportunities for interdisciplinary collaboration between NLP researchers and mental health professionals to advance responsible innovation in human-AI interaction for well-being.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xin Yi, Yue Li, Dongsheng Shi, Linlin Wang, Xiaoling Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Second, layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Finally, defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TSSF effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mario Sanz-Guerrero, Katharina von der Wense
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty
Natural Language Processing
ๆฅ็ๆ่ฆ
While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hongwei Liu, Junnan Liu, Shudong Liu, Haodong Duan, Yuqiang Li et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Prathamesh Kalamkar, Ned Letcher, Meissane Chami, Sahger Lad, Shayan Mohanty et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The application of large language models (LLMs) to chemistry is frequently hampered by a "tokenization bottleneck", where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically uninformative sub-tokens. This paper introduces a principled methodology to resolve this bottleneck by unifying the representation of natural language and molecular structures within a single model. Our approach involves targeted vocabulary extension-augmenting a pretrained LLM's vocabulary with chemically salient tokens, followed by continued pretraining on chemistry-domain text to integrate this new knowledge. We provide an empirical demonstration of the effectiveness of this strategy, showing that our methodology leads to superior performance on a range of downstream chemical tasks.
๐
2025-11-18
๐ ArXiv Preprint (cs.DL)
๐ฅ Hang Ding, Yilun Zhao, Tiansheng Hu, Manasi Patwardhan, Arman Cohan
ๆฅ็ๆ่ฆ
The accelerating growth of scientific publications has intensified the need for scalable, trustworthy systems to synthesize knowledge across diverse literature. While recent retrieval-augmented generation (RAG) methods have improved access to scientific information, they often overlook citation graph structure, adapt poorly to complex queries, and yield fragmented, hard-to-verify syntheses. We introduce SciRAG, an open-source framework for scientific literature exploration that addresses these gaps through three key innovations: (1) adaptive retrieval that flexibly alternates between sequential and parallel evidence gathering; (2) citation-aware symbolic reasoning that leverages citation graphs to organize and filter supporting documents; and (3) outline-guided synthesis that plans, critiques, and refines answers to ensure coherence and transparent attribution. Extensive experiments across multiple benchmarks such as QASA and ScholarQA demonstrate that SciRAG outperforms prior systems in factual accuracy and synthesis quality, establishing a new foundation for reliable, large-scale scientific knowledge aggregation.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Xingwei He, Qianru Zhang, Pengfei Chen, Guanhua Chen, Linlin Yu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.
๐
2025-11-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Eric Xue, Ruiyi Zhang, Zijun Zhang, Pengtao Xie
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Transformer models are foundational to natural language processing (NLP) applications, yet remain vulnerable to backdoor attacks introduced through poisoned data, which implant hidden behaviors during training. To strengthen the ability to prevent such compromises, recent research has focused on designing increasingly stealthy attacks to stress-test existing defenses, pairing backdoor behaviors with stylized artifact or token-level perturbation triggers. However, this trend diverts attention from the harder and more realistic case: making the model respond to semantic triggers such as specific names or entities, where a successful backdoor could manipulate outputs tied to real people or events in deployed systems. Motivated by this growing disconnect, we introduce SteganoBackdoor, bringing stealth techniques back into line with practical threat models. Leveraging innocuous properties from natural-language steganography, SteganoBackdoor applies a gradient-guided data optimization process to transform semantic trigger seeds into steganographic carriers that embed a high backdoor payload, remain fluent, and exhibit no representational resemblance to the trigger. Across diverse experimental settings, SteganoBackdoor achieves over 99% attack success at an order-of-magnitude lower data-poisoning rate than prior approaches while maintaining unparalleled evasion against a comprehensive suite of data-level defenses. By revealing this practical and covert attack, SteganoBackdoor highlights an urgent blind spot in current defenses and demands immediate attention to adversarial data defenses and real-world threat modeling.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xiaochuan Liu, Yuanfeng Song, Xiaoming Yin, Xing Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
In today's data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohammad Zbib, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ante Wang, Weizhi Ma, Yang Liu
Machine Learning
ๆฅ็ๆ่ฆ
Knowing the reliability of a model's response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this work, we demonstrate that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation. Intuitively, it requires an LLM to consider all candidates within the answer space instead of basing on a single guess, and to carefully assign confidence scores to meet the requirements of a distribution. This method shows an advantage across different models and various tasks, regardless of whether the answer space is known. Its advantage is maintained even after reinforcement learning, and further analysis shows its reasoning patterns are aligned with human expectations.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hourun Zhu, Yang Gao, Wenlong Fei, Jiawei Li, Huashan Sun
ๆฅ็ๆ่ฆ
Large reasoning models have demonstrated remarkable performance on complex reasoning tasks, yet the excessive length of their chain-of-thought outputs remains a major practical bottleneck due to high computation cost and poor deployability. Existing compression methods have achieved partial success but overlook a crucial phenomenon in the training process -- the entropy conflict. During compression training, entropy decreases, leading to shorter reasoning but limited exploration, while accuracy-oriented objectives increase entropy, lengthening reasoning chains. This can cause the model to get stuck in a local dilemma. Our analysis further reveals the origin of the entropy conflict: many high-entropy tokens are logical connectors that receive larger gradients and are encouraged under the performance objective, while the compression objective simultaneously penalizes these potentially redundant connectors. This opposing pressure creates a direct source of entropy conflict. To address these issues, we adopt an entropy-guided training framework. As entropy descends, the model is guided toward efficient reasoning by encouraging concise thought steps; as entropy rises, exploration is reinforced under the compact reasoning mode to improve robustness. Experiments on six mathematical benchmarks show that our method compresses reasoning length to 20% of the original while maintaining or even surpassing baseline accuracy. Code and models will be released publicly.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Gabrial Zencha Ashungafac, Mardhiyah Sanni, Busayo Awobade, Alex Gichamba, Tobi Olatunji
Multimodal
ๆฅ็ๆ่ฆ
Recent advances in speech-enabled AI, including Google's NotebookLM and OpenAI's speech-to-speech API, are driving widespread interest in voice interfaces globally. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa's linguistic diversity. We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversation drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue; multimodal LLMs are more accent-robust yet struggle with domain-specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Models fine-tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment, hallucinations still remain a big problem for most SOTA models. By releasing this comprehensive benchmark, we empower practitioners and researchers to select voice technologies suited to African use-cases, fostering inclusive voice applications for underserved communities.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Rui Liu, Yuan Zhao, Zhenqi Jia
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker's timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor's final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at https://github.com/AI-S2-Lab/Authentic-Dubber.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kai Tian, Yirong Mao, Wendong Bi, Hanjie Wang, Que Wenhui
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models perform strongly on general tasks but remain constrained in specialized settings such as music, particularly in the music-entertainment domain, where corpus scale, purity, and the match between data and training objectives are critical. We address this by constructing a large, music-related natural language corpus (40B tokens) that combines open source and in-house data, and by implementing a domain-first data pipeline: a lightweight classifier filters and weights in-domain text, followed by multi-stage cleaning, de-duplication, and privacy-preserving masking. We further integrate multi-source music text with associated metadata to form a broader, better-structured foundation of domain knowledge. On the training side, we introduce reference-model (RM)-based token-level soft scoring for quality control: a unified loss-ratio criterion is used both for data selection and for dynamic down-weighting during optimization, reducing noise gradients and amplifying task-aligned signals, thereby enabling more effective music-domain continued pretraining and alignment. To assess factuality, we design the MusicSimpleQA benchmark, which adopts short, single-answer prompts with automated agreement scoring. Beyond the benchmark design, we conduct systematic comparisons along the axes of data composition. Overall, this work advances both the right corpus and the right objective, offering a scalable data-training framework and a reusable evaluation tool for building domain LLMs in the music field.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ahlam Alrehili, Areej Alhothali
Natural Language Processing
ๆฅ็ๆ่ฆ
Grammatical Error Correction (GEC) is an important aspect of natural language processing. Arabic has a complicated morphological and syntactic structure, posing a greater challenge than other languages. Even though modern neural models have improved greatly in recent years, the majority of previous attempts used individual models without taking into account the potential benefits of combining different systems. In this paper, we present one of the first multi-system approaches for correcting grammatical errors in Arabic, the Arab Enhanced Edit Selection System Complication (ArbESC+). Several models are used to collect correction proposals, which are represented as numerical features in the framework. A classifier determines and implements the appropriate corrections based on these features. In order to improve output quality, the framework uses support techniques to filter overlapping corrections and estimate decision reliability. A combination of AraT5, ByT5, mT5, AraBART, AraBART+Morph+GEC, and Text editing systems gave better results than a single model alone, with F0.5 at 82.63% on QALB-14 test data, 84.64% on QALB-15 L1 data, and 65.55% on QALB-15 L2 data. As one of the most significant contributions of this work, it's the first Arab attempt to integrate linguistic error correction. Improving existing models provides a practical step towards developing advanced tools that will benefit users and researchers of Arabic text processing.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiajun Hou, Chenyu Zhang, Rui Meng
Natural Language Processing
ๆฅ็ๆ่ฆ
Entity Linking (EL), the task of mapping textual entity mentions to their corresponding entries in knowledge bases, constitutes a fundamental component of natural language understanding. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable potential for enhancing EL performance. Prior research has leveraged LLMs to improve entity disambiguation and input representation, yielding significant gains in accuracy and robustness. However, these approaches typically apply LLMs to isolated stages of the EL task, failing to fully integrate their capabilities throughout the entire process.
In this work, we introduce DeepEL, a comprehensive framework that incorporates LLMs into every stage of the entity linking task. Furthermore, we identify that disambiguating entities in isolation is insufficient for optimal performance. To address this limitation, we propose a novel self-validation mechanism that utilizes global contextual information, enabling LLMs to rectify their own predictions and better recognize cohesive relationships among entities within the same sentence.
Extensive empirical evaluation across ten benchmark datasets demonstrates that DeepEL substantially outperforms existing state-of-the-art methods, achieving an average improvement of 2.6\% in overall F1 score and a remarkable 4% gain on out-of-domain datasets. These results underscore the efficacy of deep LLM integration in advancing the state-of-the-art in entity linking.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Naveen Lamba, Sanju Tiwari, Manas Gaur
ๆฅ็ๆ่ฆ
LLMs still struggle with hallucination, especially when confronted with symbolic triggers like modifiers, negation, numbers, exceptions, and named entities. Yet, we lack a clear understanding of where these symbolic hallucinations originate, making it crucial to systematically handle such triggers and localize the emergence of hallucination inside the model. While prior work explored localization using statistical techniques like LSC and activation variance analysis, these methods treat all tokens equally and overlook the role symbolic linguistic knowledge plays in triggering hallucinations. So far, no approach has investigated how symbolic elements specifically drive hallucination failures across model layers, nor has symbolic linguistic knowledge been used as the foundation for a localization framework. We propose the first symbolic localization framework that leverages symbolic linguistic and semantic knowledge to meaningfully trace the development of hallucinations across all model layers. By focusing on how models process symbolic triggers, we analyze five models using HaluEval and TruthfulQA. Our symbolic knowledge approach reveals that attention variance for these linguistic elements explodes to critical instability in early layers (2-4), with negation triggering catastrophic variance levels, demonstrating that symbolic semantic processing breaks down from the very beginning. Through the lens of symbolic linguistic knowledge, despite larger model sizes, hallucination rates remain consistently high (78.3%-83.7% across Gemma variants), with steep attention drops for symbolic semantic triggers throughout deeper layers. Our findings demonstrate that hallucination is fundamentally a symbolic linguistic processing failure, not a general generation problem, revealing that symbolic semantic knowledge provides the key to understanding and localizing hallucination mechanisms in LLMs.
๐
2025-11-18
๐ AAAI 2025
๐ฅ Hao Lang, Fei Huang, Yongbin Li
ๆฅ็ๆ่ฆ
Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Naoki Shimoda, Akihiro Yamamoto
Natural Language Processing
ๆฅ็ๆ่ฆ
In this research, we combine Transformer-based relation extraction with matching of knowledge graphs (KGs) and apply them to answering multiple-choice questions (MCQs) while maintaining the traceability of the output process. KGs are structured representations of factual knowledge consisting of entities and relations. Due to the high construction cost, they had been regarded as static databases with validated links. However, the recent development of Transformer-based relation extraction (RE) methods has enabled us to generate KGs dynamically by giving them natural language texts, and thereby opened the possibility for representing the meaning of the input sentences with the created KGs. Using this effect, we propose a method that answers MCQs in the "fill-in-the-blank" format, taking care of the point that RE methods generate KGs that represent false information if provided with factually incorrect texts. We measure the truthfulness of each question sentence by (i) converting the sentence into a relational graph using an RE method and (ii) verifying it against factually correct KGs under the closed-world assumption. The experimental results demonstrate that our method correctly answers up to around 70% of the questions, while providing traceability of the procedure. We also highlight that the question category has a vast influence on the accuracy.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Omkar Mahesh Kashyap, Padegal Amit, Madhav Kashyap, Ashwini M Joshi, Shylaja SS
Natural Language Processing
ๆฅ็ๆ่ฆ
Aspect-Based Sentiment Analysis (ABSA) predicts sentiment polarity for specific aspect terms, a task made difficult by conflicting sentiments across aspects and the sparse context of short texts. Prior graph-based approaches model only pairwise dependencies, forcing them to construct multiple graphs for different relational views. These introduce redundancy, parameter overhead, and error propagation during fusion, limiting robustness in short-text, low-resource settings. We present HyperABSA, a dynamic hypergraph framework that induces aspect-opinion structures through sample-specific hierarchical clustering. To construct these hyperedges, we introduce a novel acceleration-fallback cutoff for hierarchical clustering, which adaptively determines the level of granularity. Experiments on three benchmarks (Lap14, Rest14, MAMS) show consistent improvements over strong graph baselines, with substantial gains when paired with RoBERTa backbones. These results position dynamic hypergraph construction as an efficient, powerful alternative for ABSA, with potential extensions to other short-text NLP tasks.
๐
2025-11-18
๐ 3rd-place solution for the ACM ICAIF 2025 Agentic
๐ฅ Chun Chet Ng, Jia Yu Lim, Wei Zeng Low
Natural Language Processing
ๆฅ็ๆ่ฆ
With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task-relevant information from lengthy financial filings is essential for both operational and analytical decision-making. The FinAgentBench dataset formalizes this problem through two tasks: document ranking and chunk ranking. We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and a lightweight multi-agent system. Each component is examined extensively to reveal their synergies: prompt engineering provides precise task instructions, ICL supplies semantically relevant few-shot examples, and the multi-agent system models coordinated scoring behaviour. Our best configuration achieves an NDCG@5 of 0.71818 on the restricted validation split. We further demonstrate that PRISM is feasible and robust for production-scale financial retrieval. Its modular, inference-only design makes it practical for real-world use cases. The source code is released at https://bit.ly/prism-ailens.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Truong Vo, Weiyi Wu, Kaize Ding
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Le Yu, Zhengyue Zhao, Yawen Zheng, Yunhao Liu
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yue Zhang, Zun Wang, Han Lin, Jialu Li, Jianing Yang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments. This limitation stems in part from training data that focuses on language reasoning rather than spatial understanding due to scarce 3D resources, leaving inherent grounding biases unresolved. To address this, we propose 3D scene editing as a key mechanism to generate precise visual counterfactuals that mitigate these biases through fine-grained spatial manipulation, without requiring costly scene reconstruction or large-scale 3D data collection. Furthermore, to make these edits targeted and directly address the specific weaknesses of the model, we introduce DEER-3D, an error-driven framework following a structured "Decompose, Diagnostic Evaluation, Edit, and Re-train" workflow, rather than broadly or randomly augmenting data as in conventional approaches. Specifically, upon identifying a grounding failure of the 3D-LLM, our framework first diagnoses the exact predicate-level error (e.g., attribute or spatial relation). It then executes minimal, predicate-aligned 3D scene edits, such as recoloring or repositioning, to produce targeted counterfactual supervision for iterative model fine-tuning, significantly enhancing grounding accuracy. We evaluate our editing pipeline across multiple benchmarks for 3D grounding and scene understanding tasks, consistently demonstrating improvements across all evaluated datasets through iterative refinement. DEER-3D underscores the effectiveness of targeted, error-driven scene editing in bridging linguistic reasoning capabilities with spatial grounding in 3D LLMs.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zijin Su, Huanzhu Lv, Yuren Niu, Yiming Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.
๐
2025-11-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yule Liu, Heyi Zhang, Jinyi Zheng, Zhen Sun, Zifan Peng et al.
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Membership inference attacks (MIAs) on large language models (LLMs) pose significant privacy risks across various stages of model training. Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have brought a profound paradigm shift in LLM training, particularly for complex reasoning tasks. However, the on-policy nature of RLVR introduces a unique privacy leakage pattern: since training relies on self-generated responses without fixed ground-truth outputs, membership inference must now determine whether a given prompt (independent of any specific response) is used during fine-tuning. This creates a threat where leakage arises not from answer memorization.
To audit this novel privacy risk, we propose Divergence-in-Behavior Attack (DIBA), the first membership inference framework specifically designed for RLVR. DIBA shifts the focus from memorization to behavioral change, leveraging measurable shifts in model behavior across two axes: advantage-side improvement (e.g., correctness gain) and logit-side divergence (e.g., policy drift). Through comprehensive evaluations, we demonstrate that DIBA significantly outperforms existing baselines, achieving around 0.8 AUC and an order-of-magnitude higher TPR@0.1%FPR. We validate DIBA's superiority across multiple settings--including in-distribution, cross-dataset, cross-algorithm, black-box scenarios, and extensions to vision-language models. Furthermore, our attack remains robust under moderate defensive measures.
To the best of our knowledge, this is the first work to systematically analyze privacy vulnerabilities in RLVR, revealing that even in the absence of explicit supervision, training data exposure can be reliably inferred through behavioral traces.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chandrachur Bhattacharya, Sibendu Som
ๆฅ็ๆ่ฆ
AI Scientific Assistant Core (AISAC) is an integrated multi-agent system developed at Argonne National Laboratory for scientific and engineering workflows. AISAC builds on established technologies - LangGraph for orchestration, FAISS for vector search, and SQLite for persistence - and integrates them into a unified system prototype focused on transparency, provenance tracking, and scientific adaptability.
The system implements a Router-Planner-Coordinator workflow and an optional Evaluator role, using prompt-engineered agents coordinated via LangGraph's StateGraph and supported by helper agents such as a Researcher. Each role is defined through custom system prompts that enforce structured JSON outputs. A hybrid memory approach (FAISS + SQLite) enables both semantic retrieval and structured conversation history. An incremental indexing strategy based on file hashing minimizes redundant re-embedding when scientific corpora evolve. A configuration-driven project bootstrap layer allows research teams to customize tools, prompts, and data sources without modifying core code.
All agent decisions, tool invocations, and retrievals are logged and visualized through a custom Gradio interface, providing step-by-step transparency for each reasoning episode. The authors have applied AISAC to multiple research areas at Argonne, including specialized deployments for waste-to-products research and energy process safety, as well as general-purpose scientific assistance, demonstrating its cross-domain applicability.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Junjie Wu, Yumeng Fu, Nan Yu, Guohong Fu
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advancements in multimodal out-of-context (OOC) misinformation detection have made remarkable progress in checking the consistencies between different modalities for supporting or refuting image-text pairs. However, existing OOC misinformation detection methods tend to emphasize the role of internal consistency, ignoring the significant of external consistency between image-text pairs and external evidence. In this paper, we propose HiEAG, a novel Hierarchical Evidence-Augmented Generation framework to refine external consistency checking through leveraging the extensive knowledge of multimodal large language models (MLLMs). Our approach decomposes external consistency checking into a comprehensive engine pipeline, which integrates reranking and rewriting, apart from retrieval. Evidence reranking module utilizes Automatic Evidence Selection Prompting (AESP) that acquires the relevant evidence item from the products of evidence retrieval. Subsequently, evidence rewriting module leverages Automatic Evidence Generation Prompting (AEGP) to improve task adaptation on MLLM-based OOC misinformation detectors. Furthermore, our approach enables explanation for judgment, and achieves impressive performance with instruction tuning. Experimental results on different benchmark datasets demonstrate that our proposed HiEAG surpasses previous state-of-the-art (SOTA) methods in the accuracy over all samples.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chenchen Kuai, Zihao Li, Braden Rosen, Stephanie Paan, Navid Jafari et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Keya Hu, Ali Cy, Linlu Qiu, Xiaoman Delores Ding, Runqian Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley et al.
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $ฯ^{*}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $ฯ^{*}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Albert Lin, Alessandro Pinto, Somil Bansal
Robotics
ๆฅ็ๆ่ฆ
As perception-based controllers for autonomous systems become increasingly popular in the real world, it is important that we can formally verify their safety and performance despite perceptual uncertainty. Unfortunately, the verification of such systems remains challenging, largely due to the complexity of the controllers, which are often nonlinear, nonconvex, learning-based, and/or black-box. Prior works propose verification algorithms that are based on approximate reachability methods, but they often restrict the class of controllers and systems that can be handled or result in overly conservative analyses. Hamilton-Jacobi (HJ) reachability analysis is a popular formal verification tool for general nonlinear systems that can compute optimal reachable sets under worst-case system uncertainties; however, its application to perception-based systems is currently underexplored. In this work, we propose RoVer-CoRe, a framework for the Robust Verification of Controllers via HJ Reachability. To the best of our knowledge, RoVer-CoRe is the first HJ reachability-based framework for the verification of perception-based systems under perceptual uncertainty. Our key insight is to concatenate the system controller, observation function, and the state estimation modules to obtain an equivalent closed-loop system that is readily compatible with existing reachability frameworks. Within RoVer-CoRe, we propose novel methods for formal safety verification and robust controller design. We demonstrate the efficacy of the framework in case studies involving aircraft taxiing and NN-based rover navigation. Code is available at the link in the footnote.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junfeng Wu, Hadjer Benmeziane, Kaoutar El Maghraoui, Liu Liu, Yinan Wang
ๆฅ็ๆ่ฆ
Spatiotemporal data mining (STDM) has a wide range of applications in various complex physical systems (CPS), i.e., transportation, manufacturing, healthcare, etc. Among all the proposed methods, the Convolutional Long Short-Term Memory (ConvLSTM) has proved to be generalizable and extendable in different applications and has multiple variants achieving state-of-the-art performance in various STDM applications. However, ConvLSTM and its variants are computationally expensive, which makes them inapplicable in edge devices with limited computational resources. With the emerging need for edge computing in CPS, efficient AI is essential to reduce the computational cost while preserving the model performance. Common methods of efficient AI are developed to reduce redundancy in model capacity (i.e., model pruning, compression, etc.). However, spatiotemporal data mining naturally requires extensive model capacity, as the embedded dependencies in spatiotemporal data are complex and hard to capture, which limits the model redundancy. Instead, there is a fairly high level of data and feature redundancy that introduces an unnecessary computational burden, which has been largely overlooked in existing research. Therefore, we developed a novel framework SparseST, that pioneered in exploiting data sparsity to develop an efficient spatiotemporal model. In addition, we explore and approximate the Pareto front between model performance and computational efficiency by designing a multi-objective composite loss function, which provides a practical guide for practitioners to adjust the model according to computational resource constraints and the performance requirements of downstream tasks.
๐
2025-11-18
๐ NeurIPS 2025
๐ฅ Haiqing Zhu, Tijana Zrnic, Celestine Mendler-Dรผnner
Machine Learning
ๆฅ็ๆ่ฆ
On many learning platforms, the optimization criteria guiding model training reflect the priorities of the designer rather than those of the individuals they affect. Consequently, users may act strategically to obtain more favorable outcomes, effectively contesting the platform's predictions. While past work has studied strategic user behavior on learning platforms, the focus has largely been on strategic responses to a deployed model, without considering the behavior of other users. In contrast, look-ahead reasoning takes into account that user actions are coupled, and -- at scale -- impact future predictions. Within this framework, we first formalize level-$k$ thinking, a concept from behavioral economics, where users aim to outsmart their peers by looking one step ahead. We show that, while convergence to an equilibrium is accelerated, the equilibrium remains the same, providing no benefit of higher-level reasoning for individuals in the long run. Then, we focus on collective reasoning, where users take coordinated actions by optimizing through their joint impact on the model. By contrasting collective with selfish behavior, we characterize the benefits and limits of coordination; a new notion of alignment between the learner's and the users' utilities emerges as a key concept. We discuss connections to several related mathematical frameworks, including strategic classification, performative prediction, and algorithmic collective action.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Antonia Ebner, Christoph Bartmann, Sonja Topf, Sohvi Luukkonen, Johannes Schimunek et al.
Computer VisionMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Deep learning's rise since the early 2010s has transformed fields like computer vision and natural language processing and strongly influenced biomedical research. For drug discovery specifically, a key inflection - akin to vision's "ImageNet moment" - arrived in 2015, when deep neural networks surpassed traditional approaches on the Tox21 Data Challenge. This milestone accelerated the adoption of deep learning across the pharmaceutical industry, and today most major companies have integrated these methods into their research pipelines. After the Tox21 Challenge concluded, its dataset was included in several established benchmarks, such as MoleculeNet and the Open Graph Benchmark. However, during these integrations, the dataset was altered and labels were imputed or manufactured, resulting in a loss of comparability across studies. Consequently, the extent to which bioactivity and toxicity prediction methods have improved over the past decade remains unclear. To this end, we introduce a reproducible leaderboard, hosted on Hugging Face with the original Tox21 Challenge dataset, together with a set of baseline and representative methods. The current version of the leaderboard indicates that the original Tox21 winner - the ensemble-based DeepTox method - and the descriptor-based self-normalizing neural networks introduced in 2017, continue to perform competitively and rank among the top methods for toxicity prediction, leaving it unclear whether substantial progress in toxicity prediction has been achieved over the past decade. As part of this work, we make all baselines and evaluated models publicly accessible for inference via standardized API calls to Hugging Face Spaces.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Christof Naumzik, Abdurahman Maarouf, Stefan Feuerriegel, Markus Weinmann
ๆฅ็ๆ่ฆ
Online ratings influence customer decision-making, yet standard aggregation methods, such as the sample mean, fail to adapt to quality changes over time and ignore review heterogeneity (e.g., review sentiment, a review's helpfulness). To address these challenges, we demonstrate the value of using the Gaussian process (GP) framework for rating aggregation. Specifically, we present a tailored GP model that captures the dynamics of ratings over time while additionally accounting for review heterogeneity. Based on 121,123 ratings from Yelp, we compare the predictive power of different rating aggregation methods in predicting future ratings, thereby finding that the GP model is considerably more accurate and reduces the mean absolute error by 10.2% compared to the sample mean. Our findings have important implications for marketing practitioners and customers. By moving beyond means, designers of online reputation systems can display more informative and adaptive aggregated rating scores that are accurate signals of expected customer satisfaction.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tzu-Hsuan Chou, Chun-Nan Chou
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have shown a remarkable ability to generalize beyond their pre-training data, and fine-tuning LLMs can elevate performance to human-level and beyond. However, in real-world scenarios, lacking labeled data often prevents practitioners from obtaining well-performing models, thereby forcing practitioners to highly rely on prompt-based approaches that are often tedious, inefficient, and driven by trial and error. To alleviate this issue of lacking labeled data, we present a learning framework integrating LLMs with active learning for unlabeled dataset (LAUD). LAUD mitigates the cold-start problem by constructing an initial label set with zero-shot learning. Experimental results show that LLMs derived from LAUD outperform LLMs with zero-shot or few-shot learning on commodity name classification tasks, demonstrating the effectiveness of LAUD.
๐
2025-11-18
๐ NeurIPS 2025
๐ฅ Fu-Ming Guo, Yingfang Fan
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the $\ell_2$ penalty embedded in weight decay drives all parameters toward the origin at the same rate, making the update vulnerable to rare but extreme gradient directions and often over-penalizing well-conditioned coordinates. We propose AdamHuberDecay, a drop-in replacement for AdamW that substitutes the $\ell_2$ penalty with a decoupled smooth Huber regularizer. The resulting update decays parameters quadratically while their magnitude remains below a threshold $ฮด$, and linearly ($\ell_1$-like) once they exceed $ฮด$, yielding (i) bounded regularization gradients, (ii) invariance to per-coordinate second-moment rescaling, and (iii) stronger sparsity pressure on overgrown weights.
We derive the closed-form decoupled Huber decay step and show how to integrate it with any Adam-family optimizer at $O(1)$ extra cost. Extensive experiments on GPT-2 and GPT-3 pre-training demonstrate that AdamHuberDecay (a) converges 10-15% faster in wall-clock time, (b) reduces validation perplexity by up to 4 points, (c) delivers performance improvements of 2.5-4.7% across downstream tasks, and (d) yields visibly sparser weight histograms that translate into 20-30% memory savings after magnitude pruning, without tuning the decay coefficient beyond the default grid used for AdamW. Ablations confirm robustness to outlier gradients and large-batch regimes, together with theoretical analyses that bound the expected parameter norm under noisy updates. AdamHuberDecay therefore provides a simple, principled path toward more efficient and resilient training of next-generation foundational generative transformers.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Abolfazl Younesi, Leon Kiss, Zahra Najafabadi Samani, Juan Aznar Poveda, Thomas Fahringer
ๆฅ็ๆ่ฆ
Federated learning (FL) enables collaborative model training while preserving data privacy. However, it remains vulnerable to malicious clients who compromise model integrity through Byzantine attacks, data poisoning, or adaptive adversarial behaviors. Existing defense mechanisms rely on static thresholds and binary classification, failing to adapt to evolving client behaviors in real-world deployments. We propose FLARE, an adaptive reputation-based framework that transforms client reliability assessment from binary decisions to a continuous, multi-dimensional trust evaluation. FLARE integrates: (i) a multi-dimensional reputation score capturing performance consistency, statistical anomaly indicators, and temporal behavior, (ii) a self-calibrating adaptive threshold mechanism that adjusts security strictness based on model convergence and recent attack intensity, (iii) reputation-weighted aggregation with soft exclusion to proportionally limit suspicious contributions rather than eliminating clients outright, and (iv) a Local Differential Privacy (LDP) mechanism enabling reputation scoring on privatized client updates. We further introduce a highly evasive Statistical Mimicry (SM) attack, a benchmark adversary that blends honest gradients with synthetic perturbations and persistent drift to remain undetected by traditional filters. Extensive experiments with 100 clients on MNIST, CIFAR-10, and SVHN demonstrate that FLARE maintains high model accuracy and converges faster than state-of-the-art Byzantine-robust methods under diverse attack types, including label flipping, gradient scaling, adaptive attacks, ALIE, and SM. FLARE improves robustness by up to 16% and preserves model convergence within 30% of the non-attacked baseline, while achieving strong malicious-client detection performance with minimal computational overhead. https://github.com/Anonymous0-0paper/FLARE
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Zonghao Chen, Atsushi Nitanda, Arthur Gretton, Taiji Suzuki
Machine Learning
ๆฅ็ๆ่ฆ
We establish the first global convergence result of neural networks for two stage least squares (2SLS) approach in nonparametric instrumental variable regression (NPIV). This is achieved by adopting a lifted perspective through mean-field Langevin dynamics (MFLD), unlike standard MFLD, however, our setting of 2SLS entails a \emph{bilevel} optimization problem in the space of probability measures. To address this challenge, we leverage the penalty gradient approach recently developed for bilevel optimization which formulates bilevel optimization as a Lagrangian problem. This leads to a novel fully first-order algorithm, termed \texttt{F$^2$BMLD}. Apart from the convergence bound, we further provide a generalization bound, revealing an inherent trade-off in the choice of the Lagrange multiplier between optimization and statistical guarantees. Finally, we empirically validate the effectiveness of the proposed method on an offline reinforcement learning benchmark.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sriram Srinivasan, Srinivasan Aruchamy, Siva Ram Krisha Vadali
Machine Learning
ๆฅ็ๆ่ฆ
Seismic sensing has emerged as a promising solution for border surveillance and monitoring; the seismic sensors that are often buried underground are small and cannot be noticed easily, making them difficult for intruders to detect, avoid, or vandalize. This significantly enhances their effectiveness compared to highly visible cameras or fences. However, accurately detecting and distinguishing between overlapping activities that are happening simultaneously, such as human intrusions, animal movements, and vehicle rumbling, remains a major challenge due to the complex and noisy nature of seismic signals. Correctly identifying simultaneous activities is critical because failing to separate them can lead to misclassification, missed detections, and an incomplete understanding of the situation, thereby reducing the reliability of surveillance systems. To tackle this problem, we propose HyMAD (Hybrid Multi-Activity Detection), a deep neural architecture based on spatio-temporal feature fusion. The framework integrates spectral features extracted with SincNet and temporal dependencies modeled by a recurrent neural network (RNN). In addition, HyMAD employs self-attention layers to strengthen intra-modal representations and a cross-modal fusion module to achieve robust multi-label classification of seismic events. e evaluate our approach on a dataset constructed from real-world field recordings collected in the context of border surveillance and monitoring, demonstrating its ability to generalize to complex, simultaneous activity scenarios involving humans, animals, and vehicles. Our method achieves competitive performance and offers a modular framework for extending seismic-based activity recognition in real-world security applications.
๐
2025-11-18
๐ ArXiv Preprint (q-bio.GN)
๐ฅ Rui Zhu, Xiaopu Zhou, Haixu Tang, Stephen W. Scherer, Lucila Ohno-Machado
Natural Language Processing
ๆฅ็ๆ่ฆ
Trained on massive cross-species DNA corpora, DNA large language models (LLMs) learn the fundamental "grammar" and evolutionary patterns of genomic sequences. This makes them powerful priors for DNA sequence modeling, particularly over long ranges. However, two major constraints hinder their use in practice: the quadratic computational cost of self-attention and the growing memory required for key-value (KV) caches during autoregressive decoding. These constraints force the use of heuristics such as fixed-window truncation or sliding windows, which compromise fidelity on ultra-long sequences by discarding distant information. We introduce FOCUS (Feature-Oriented Compression for Ultra-long Self-attention), a progressive context-compression module that can be plugged into pretrained DNA LLMs. FOCUS combines the established k-mer representation in genomics with learnable hierarchical compression: it inserts summary tokens at k-mer granularity and progressively compresses attention key and value activations across multiple Transformer layers, retaining only the summary KV states across windows while discarding ordinary-token KV. A shared-boundary windowing scheme yields a stationary cross-window interface that propagates long-range information with minimal loss. We validate FOCUS on an Evo-2-based DNA LLM fine-tuned on GRCh38 chromosome 1 with self-supervised training and randomized compression schedules to promote robustness across compression ratios. On held-out human chromosomes, FOCUS achieves near-lossless fidelity: compressing a 1 kb context into only 10 summary tokens (about 100x) shifts the average per-nucleotide probability by only about 0.0004. Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N^2) to near-linear O(N), enabling about 100x longer inference windows on commodity GPUs with near-lossless fidelity.
๐
2025-11-18
๐ This paper has been officially accepted for public
๐ฅ Vaskar Chakma, MD Jaheid Hasan Nerab, Abdur Rouf, Abu Sayed, Hossem MD Saim et al.
Machine Learning
ๆฅ็ๆ่ฆ
Smoking continues to be a major preventable cause of death worldwide, affecting millions through damage to the heart, metabolism, liver, and kidneys. However, current medical screening methods often miss the early warning signs of smoking-related health problems, leading to late-stage diagnoses when treatment options become limited. This study presents a systematic comparative evaluation of machine learning approaches for smoking-related health risk assessment, emphasizing clinical interpretability and practical deployment over algorithmic innovation. We analyzed health screening data from 55,691 individuals, examining various health indicators, including body measurements, blood tests, and demographic information. We tested three advanced prediction algorithms - Random Forest, XGBoost, and LightGBM - to determine which could most accurately identify people at high risk. This study employed a cross-sectional design to classify current smoking status based on health screening biomarkers, not to predict future disease development. Our Random Forest model performed best, achieving an Area Under the Curve (AUC) of 0.926, meaning it could reliably distinguish between high-risk and lower-risk individuals. Using SHAP (SHapley Additive exPlanations) analysis to understand what the model was detecting, we found that key health markers played crucial roles in prediction: blood pressure levels, triglyceride concentrations, liver enzyme readings, and kidney function indicators (serum creatinine) were the strongest signals of declining health in smokers.
๐
2025-11-18
๐ ArXiv Preprint (math.NA)
๐ฅ Jan Naumann
Machine Learning
ๆฅ็ๆ่ฆ
Recently developed applications in the field of machine learning and computational physics rely on automatic differentiation techniques, that require stable and efficient linear algebra gradient computations. This technical note provides a comprehensive and detailed discussion of the derivative of the truncated singular and eigenvalue decomposition. It summarizes previous work and builds on them with an extensive description of how to derive the relevant terms. A main focus is correctly expressing the derivative in terms of the truncated part, despite lacking knowledge of the full decomposition.
๐
2025-11-18
๐ ArXiv Preprint (eess.SP)
๐ฅ Avi Bagchi, Dwight Hutchenson
Machine Learning
ๆฅ็ๆ่ฆ
Radio spectrum monitoring in contested environments motivates the need for reliable automatic signal classification technology. Prior work highlights deep learning as a promising approach, but existing models depend on brute-force Doppler augmentation to achieve real-world generalization, which undermines both training efficiency and interpretability. In this paper, we propose a convolutional neural network (CNN) architecture with complex-valued layers that exploits convolutional shift equivariance in the frequency domain. To establish provable frequency bin shift invariance, we use adaptive polyphase sampling (APS) as pooling layers followed by a global average pooling layer at the end of the network. Using a synthetic dataset of common interference signals, experimental results demonstrate that unlike a vanilla CNN, our model maintains consistent classification accuracy with and without random Doppler shifts despite being trained on no Doppler-shifted examples. Overall, our method establishes an invariance-driven framework for signal classification that offers provable robustness against real-world effects.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuchen Luo, Xinyu Li, Liuhua Peng, Mingming Gong
Natural Language Processing
ๆฅ็ๆ่ฆ
In multivariate time series forecasting (MTSF), accurately modeling the intricate dependencies among multiple variables remains a significant challenge due to the inherent limitations of traditional approaches. Most existing models adopt either \textbf{channel-independent} (CI) or \textbf{channel-dependent} (CD) strategies, each presenting distinct drawbacks. CI methods fail to leverage the potential insights from inter-channel interactions, resulting in models that may not fully exploit the underlying statistical dependencies present in the data. Conversely, CD approaches often incorporate too much extraneous information, risking model overfitting and predictive inefficiency. To address these issues, we introduce the Adaptive Forecasting Transformer (\textbf{Adapformer}), an advanced Transformer-based framework that merges the benefits of CI and CD methodologies through effective channel management. The core of Adapformer lies in its dual-stage encoder-decoder architecture, which includes the \textbf{A}daptive \textbf{C}hannel \textbf{E}nhancer (\textbf{ACE}) for enriching embedding processes and the \textbf{A}daptive \textbf{C}hannel \textbf{F}orecaster (\textbf{ACF}) for refining the predictions. ACE enhances token representations by selectively incorporating essential dependencies, while ACF streamlines the decoding process by focusing on the most relevant covariates, substantially reducing noise and redundancy. Our rigorous testing on diverse datasets shows that Adapformer achieves superior performance over existing models, enhancing both predictive accuracy and computational efficiency, thus making it state-of-the-art in MTSF.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ivy Yuqian Yang, David Yu Zhang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Scientific idea generation and selection requires exploration following a target probability distribution. In contrast, current AI benchmarks have objectively correct answers, and training large language models (LLMs) via reinforcement learning against these benchmarks discourages probabilistic exploration. Here, we conducted systematic experiments requesting LLMs to produce outputs following simple probabilistic distributions, and found that all modern LLMs tested grossly fail to follow the distributions. For example, requesting a binary output of "1" 49% of the time produces an answer of "0" nearly 100% of the time. This step function-like behavior of near-exclusively generating the output with marginally highest probability even overrules even strong in-built LLM biases.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Marco Locatelli, Arjen Hommersom, Roberto Clemens Cerioli, Daniela Besozzi, Fabio Stella
ๆฅ็ๆ่ฆ
Learning the parameters of Partially Observable Markov Decision Processes (POMDPs) from limited data is a significant challenge. We introduce the Fuzzy MAP EM algorithm, a novel approach that incorporates expert knowledge into the parameter estimation process by enriching the Expectation Maximization (EM) framework with fuzzy pseudo-counts derived from an expert-defined fuzzy model. This integration naturally reformulates the problem as a Maximum A Posteriori (MAP) estimation, effectively guiding learning in environments with limited data. In synthetic medical simulations, our method consistently outperforms the standard EM algorithm under both low-data and high-noise conditions. Furthermore, a case study on Myasthenia Gravis illustrates the ability of the Fuzzy MAP EM algorithm to recover a clinically coherent POMDP, demonstrating its potential as a practical tool for data-efficient modeling in healthcare.
๐
2025-11-18
๐ ArXiv Preprint (cs.DC)
๐ฅ Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt. Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding. Together, these mechanisms substantially reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer improves end-to-end rollout throughput by 74% to 97% and reduces long-tail latency by 75% to 93% compared to state-of-the-art synchronous RL systems, significantly accelerating RL training iterations.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shreya Adrita Banik, Niaz Nafi Rahman, Tahsina Moiukh, Farig Sadeque
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Detecting political bias in news media is a complex task that requires interpreting subtle linguistic and contextual cues. Although recent advances in Natural Language Processing (NLP) have enabled automatic bias classification, the extent to which large language models (LLMs) align with human judgment still remains relatively underexplored and not yet well understood. This study aims to present a comparative framework for evaluating the detection of political bias across human annotations and multiple LLMs, including GPT, BERT, RoBERTa, and FLAN. We construct a manually annotated dataset of news articles and assess annotation consistency, bias polarity, and inter-model agreement to quantify divergence between human and model perceptions of bias. Experimental results show that among traditional transformer-based models, RoBERTa achieves the highest alignment with human labels, whereas generative models such as GPT demonstrate the strongest overall agreement with human annotations in a zero-shot setting. Among all transformer-based baselines, our fine-tuned RoBERTa model acquired the highest accuracy and the strongest alignment with human-annotated labels. Our findings highlight systematic differences in how humans and LLMs perceive political slant, underscoring the need for hybrid evaluation frameworks that combine human interpretability with model scalability in automated media bias detection.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yilu Fang, Jordan G. Nestor, Casey N. Ta, Jerard Z. Kneifati-Hayek, Chunhua Weng
ๆฅ็ๆ่ฆ
Patients with acute kidney injury (AKI) are at high risk of developing chronic kidney disease (CKD), but identifying those at greatest risk remains challenging. We used electronic health record (EHR) data to dynamically track AKI patients' clinical evolution and characterize AKI-to-CKD progression. Post-AKI clinical states were identified by clustering patient vectors derived from longitudinal medical codes and creatinine measurements. Transition probabilities between states and progression to CKD were estimated using multi-state modeling. After identifying common post-AKI trajectories, CKD risk factors in AKI subpopulations were identified through survival analysis. Of 20,699 patients with AKI at admission, 3,491 (17%) developed CKD. We identified fifteen distinct post-AKI states, each with different probabilities of CKD development. Most patients (75%, n=15,607) remained in a single state or made only one transition during the study period. Both established (e.g., AKI severity, diabetes, hypertension, heart failure, liver disease) and novel CKD risk factors, with their impact varying across these clinical states. This study demonstrates a data-driven approach for identifying high-risk AKI patients, supporting the development of decision-support tools for early CKD detection and intervention.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ankush Kadu, Ashwanth Krishnan
Machine Learning
ๆฅ็ๆ่ฆ
Enabling agents to learn from experience and generalize across diverse tasks without task-specific training remains a fundamental challenge in reinforcement learning and decision-making. While recent approaches have explored episodic memory (Reflexion), gradient-based prompt optimization (TextGrad),and hierarchical task decomposition independently, their potential for synergistic integration remains unexplored. We introduce ReflexGrad, a novel architecture that tightly couples three complementary mechanisms: (1) LLM-based hierarchical TODO decomposition for strategic planning, (2) history-aware causal reflection that analyzes recent action patterns to identify failure root causes and enable within-trial learning, and (3) gradient-based optimization for systematic improvement. Unlike prior work relying on few-shot demonstrations, our system achieves true zero-shot generalization through pure LLM semantic reasoning,requiring no task-specific examples, fine-tuning, or hardcoded similarity metrics. Evaluated on ALFWorld benchmark tasks, ReflexGrad demonstrates 67% zero-shot success rate on Trial 0 without any prior task experience or demonstrations, establishing effective performance on first exposure. Through empirical analysis, we identify the architectural mechanisms underlying stable convergence (zero action loops) and effective cross-task transfer (67% to 78% improvement).Our work demonstrates that synergistic integration of complementary learning mechanisms enables robust zero-shot generalization that approaches few-shot baselines from prior work.
๐
2025-11-18
๐ 33 pages, 11 figures, submitted for publication in
๐ฅ Hugo Frezat, Thomas Gastine, Alexandre Fournier
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The use of machine learning to represent subgrid-scale (SGS) dynamics is now well established in weather forecasting and climate modelling. Recent advances have demonstrated that SGS models trained via ``online'' end-to-end learning -- where the dynamical solver operating on the filtered equations participates in the training -- can outperform traditional physics-based approaches. Most studies, however, have focused on idealised periodic domains, neglecting the mechanical boundaries present e.g. in planetary interiors. To address this issue, we consider two-dimensional quasi-geostrophic turbulent flow in an axisymmetric bounded domain that we model using a pseudo-spectral differentiable solver, thereby enabling online learning. We examine three configurations, varying the geometry (between an exponential container and a spherical shell) and the rotation rate. Flow is driven by a prescribed analytical forcing, allowing for precise control over the energy injection scale and an exact estimate of the power input. We evaluate the accuracy of the online-trained SGS model against the reference direct numerical simulation using integral quantities and spectral diagnostics. In all configurations, we show that an SGS model trained on data spanning only one turnover time remains stable and accurate over integrations at least a hundred times longer than the training period. Moreover, we demonstrate the model's remarkable ability to reproduce slow processes occurring on time scales far exceeding the training duration, such as the inward drift of jets in the spherical shell. These results suggest a promising path towards developing SGS models for planetary and stellar interior dynamics, including dynamo processes.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Adam Hazimeh, Alessandro Favero, Pascal Frossard
Natural Language Processing
ๆฅ็ๆ่ฆ
Task arithmetic has recently emerged as a promising method for editing pre-trained \textit{open-vocabulary} models, offering a cost-effective alternative to standard multi-task fine-tuning. However, despite the abundance of \textit{closed-vocabulary} models that are not pre-trained with language supervision, applying task arithmetic to these models remains unexplored. In this paper, we deploy and study task addition in closed-vocabulary image classification models. We consider different pre-training schemes and find that \textit{weight disentanglement} -- the property enabling task arithmetic -- is a general consequence of pre-training, as it appears in different pre-trained closed-vocabulary models. In fact, we find that pre-trained closed-vocabulary vision transformers can also be edited with task arithmetic, achieving high task addition performance and enabling the efficient deployment of multi-task models. Finally, we demonstrate that simple linear probing is a competitive baseline to task addition. Overall, our findings expand the applicability of task arithmetic to a broader class of pre-trained models and open the way for more efficient use of pre-trained models in diverse settings.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Xinzhe Zheng, Shiyu Jiang, Gustavo Seabra, Chenglong Li, Yanjun Li
Machine Learning
ๆฅ็ๆ่ฆ
Deep generative models are rapidly advancing structure-based drug design, offering substantial promise for generating small molecule ligands that bind to specific protein targets. However, most current approaches assume a rigid protein binding pocket, neglecting the intrinsic flexibility of proteins and the conformational rearrangements induced by ligand binding, limiting their applicability in practical drug discovery. Here, we propose Apo2Mol, a diffusion-based generative framework for 3D molecule design that explicitly accounts for conformational flexibility in protein binding pockets. To support this, we curate a dataset of over 24,000 experimentally resolved apo-holo structure pairs from the Protein Data Bank, enabling the characterization of protein structure changes associated with ligand binding. Apo2Mol employs a full-atom hierarchical graph-based diffusion model that simultaneously generates 3D ligand molecules and their corresponding holo pocket conformations from input apo states. Empirical studies demonstrate that Apo2Mol can achieve state-of-the-art performance in generating high-affinity ligands and accurately capture realistic protein pocket conformational changes.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohammad Romani
Natural Language Processing
ๆฅ็ๆ่ฆ
Deepfakes generated by advanced GANs and autoencoders severely threaten information integrity and societal stability. Single-stream CNNs fail to capture multi-scale forgery artifacts across spatial, texture, and frequency domains, limiting robustness and generalization. We introduce the ForensicFlow, a tri-modal forensic framework that synergistically fuses RGB, texture, and frequency evidence for video Deepfake detection. The RGB branch (ConvNeXt-tiny) extracts global visual inconsistencies; the texture branch (Swin Transformer-tiny) detects fine-grained blending artifacts; the frequency branch (CNN + SE) identifies periodic spectral noise. Attention-based temporal pooling dynamically prioritizes high-evidence frames, while adaptive attention fusion balances branch contributions.Trained on Celeb-DF (v2) with Focal Loss, ForensicFlow achieves AUC 0.9752, F1-Score 0.9408, and accuracy 0.9208, outperforming single-stream baselines. Ablation validates branch synergy; Grad-CAM confirms forensic focus. This comprehensive feature fusion provides superior resilience against subtle forgeries.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Haorui Ma, Dennis Frauen, Stefan Feuerriegel
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Structural nested mean models (SNMMs) are a principled approach to estimate the treatment effects over time. A particular strength of SNMMs is to break the joint effect of treatment sequences over time into localized, time-specific ``blip effects''. This decomposition promotes interpretability through the incremental effects and enables the efficient offline evaluation of optimal treatment policies without re-computation. However, neural frameworks for SNMMs are lacking, as their inherently sequential g-estimation scheme prevents end-to-end, gradient-based training. Here, we propose DeepBlip, the first neural framework for SNMMs, which overcomes this limitation with a novel double optimization trick to enable simultaneous learning of all blip functions. Our DeepBlip seamlessly integrates sequential neural networks like LSTMs or transformers to capture complex temporal dependencies. By design, our method correctly adjusts for time-varying confounding to produce unbiased estimates, and its Neyman-orthogonal loss function ensures robustness to nuisance model misspecification. Finally, we evaluate our DeepBlip across various clinical datasets, where it achieves state-of-the-art performance.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jaume Ros, Alessio Arleo, Fernando Paulovich
ๆฅ็ๆ่ฆ
Dimensionality Reduction (DR) techniques are commonly used for the visual exploration and analysis of high-dimensional data due to their ability to project datasets of high-dimensional points onto the 2D plane. However, projecting datasets in lower dimensions often entails some distortion, which is not necessarily easy to recognize but can lead users to misleading conclusions. Several Projection Quality Metrics (PQMs) have been developed as tools to quantify the goodness-of-fit of a DR projection; however, they mostly focus on measuring how well the projection captures the global or local structure of the data, without taking into account the visual distortion of the resulting plots, thus often ignoring the presence of outliers or artifacts that can mislead a visual analysis of the projection. In this work, we introduce the Warping Index (WI), a new metric for measuring the quality of DR projections onto the 2D plane, based on the assumption that the correct preservation of empty regions between points is of crucial importance towards a faithful visual representation of the data.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal
Machine Learning
ๆฅ็ๆ่ฆ
Incomplete data are common in real-world tabular applications, where numerical, categorical, and discrete attributes coexist within a single dataset. This heterogeneous structure presents significant challenges for existing diffusion-based imputation models, which typically assume a homogeneous feature space and rely on stochastic denoising trajectories. Such assumptions make it difficult to maintain conditional consistency, and they often lead to information collapse for categorical variables or instability when numerical variables require deterministic updates. These limitations indicate that a single diffusion process is insufficient for mixed-type tabular imputation.
We propose a hybrid deterministic diffusion framework that separates heterogeneous features into two complementary generative channels. A continuous DDIM-based channel provides efficient and stable deterministic denoising for numerical variables, while a discrete latent-path diffusion channel, inspired by loopholing-based discrete diffusion, models categorical and discrete features without leaving their valid sample manifolds. The two channels are trained under a unified conditional imputation objective, enabling coherent reconstruction of mixed-type incomplete data.
Extensive experiments on multiple real-world datasets show that the proposed framework achieves higher imputation accuracy, more stable sampling trajectories, and improved robustness across MCAR, MAR, and MNAR settings compared with existing diffusion-based and classical methods. These results demonstrate the importance of structure-aware diffusion processes for advancing deep learning approaches to incomplete tabular data.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangchen Yin, Jiahui Yuan, Zhangchi Hu, Wenzhang Sun, Jie Chen et al.
ๆฅ็ๆ่ฆ
Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Hao Qian, Shikui Tu, Lei Xu
Multimodal
ๆฅ็ๆ่ฆ
Diffusion and flow matching models have recently emerged as promising approaches for peptide binder design. Despite their progress, these models still face two major challenges. First, categorical sampling of discrete residue types collapses their continuous parameters into onehot assignments, while continuous variables (e.g., atom positions) evolve smoothly throughout the generation process. This mismatch disrupts the update dynamics and results in suboptimal performance. Second, current models assume unimodal distributions for side-chain torsion angles, which conflicts with the inherently multimodal nature of side chain rotameric states and limits prediction accuracy. To address these limitations, we introduce PepBFN, the first Bayesian flow network for full atom peptide design that directly models parameter distributions in fully continuous space. Specifically, PepBFN models discrete residue types by learning their continuous parameter distributions, enabling joint and smooth Bayesian updates with other continuous structural parameters. It further employs a novel Gaussian mixture based Bayesian flow to capture the multimodal side chain rotameric states and a Matrix Fisher based Riemannian flow to directly model residue orientations on the $\mathrm{SO}(3)$ manifold. Together, these parameter distributions are progressively refined via Bayesian updates, yielding smooth and coherent peptide generation. Experiments on side chain packing, reverse folding, and binder design tasks demonstrate the strong potential of PepBFN in computational peptide design.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiawei Yi, Ping Gong, Youhui Bai, Jiaqi Ruan, Shengnan Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
The growth of million-token LLMs exposes the scalability limits of inference systems, where the KVCache dominates memory usage and data transfer overhead. Recent offloading systems migrate the KVCache to CPU memory and incorporate top-k attention to reduce the volume of data transferred from the CPU, while further applying system-level optimizations such as on-GPU caching and prefetching to lower transfer overhead. However, they overlook the CPU bottleneck in three aspects: (1) substantial overhead of fine-grained dynamic cache management performed on the CPU side, (2) significant transfer overhead from poor PCIe bandwidth utilization caused by heavy gathering operations at the CPU side, and (3) GPU runtime bubbles introduced by coarse-grained CPU-centric synchronization. To address these challenges, we propose CLO, a CPU-light KVCache offloading system via algorithm-system co-design. CLO features: (1) a coarse-grained head-wise approximate on-GPU caching strategy with negligible cache management cost, (2) seamless combination of data prefetching and on-GPU persistent caching for lower transfer overhead, (3) a zero-copy transfer engine to fully exploit PCIe bandwidth, and a GPU-centric synchronization method to eliminate GPU stalls. Evaluation on two widely-used LLMs demonstrates that CLO achieves comparable accuracy to state-of-the-art systems, while substantially minimizing CPU overhead, fully utilizing PCIe bandwidth, thus improving decoding throughput by 9.3%-66.6%. Our results highlight that algorithm-system co-design is essential for memory-constrained LLM inference on modern GPU platforms. We open source CLO at https://github.com/CommediaJW/CLO.
๐
2025-11-18
๐ ArXiv Preprint (math.OC)
๐ฅ Abdurakhmon Sadiev, Yury Demidovich, Igor Sokolov, Grigory Malinovsky, Sarit Khirirat et al.
Machine Learning
ๆฅ็ๆ่ฆ
Communication compression is essential for scalable distributed training of modern machine learning models, but it often degrades convergence due to the noise it introduces. Error Feedback (EF) mechanisms are widely adopted to mitigate this issue of distributed compression algorithms. Despite their popularity and training efficiency, existing distributed EF algorithms often require prior knowledge of problem parameters (e.g., smoothness constants) to fine-tune stepsizes. This limits their practical applicability especially in large-scale neural network training. In this paper, we study normalized error feedback algorithms that combine EF with normalized updates, various momentum variants, and parameter-agnostic, time-varying stepsizes, thus eliminating the need for problem-dependent tuning. We analyze the convergence of these algorithms for minimizing smooth functions, and establish parameter-agnostic complexity bounds that are close to the best-known bounds with carefully-tuned problem-dependent stepsizes. Specifically, we show that normalized EF21 achieve the convergence rate of near ${O}(1/T^{1/4})$ for Polyak's heavy-ball momentum, ${O}(1/T^{2/7})$ for Iterative Gradient Transport (IGT), and ${O}(1/T^{1/3})$ for STORM and Hessian-corrected momentum. Our results hold with decreasing stepsizes and small mini-batches. Finally, our empirical experiments confirm our theoretical insights.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jintao Zhang, Mingyue Cheng, Zirui Liu, Xianquan Wang, Yitong Zhou et al.
ๆฅ็ๆ่ฆ
Time series generation is critical for a wide range of applications, which greatly supports downstream analytical and decision-making tasks. However, the inherent temporal heterogeneous induced by localized perturbations present significant challenges for generating structurally consistent time series. While flow matching provides a promising paradigm by modeling temporal dynamics through trajectory-level supervision, it fails to adequately capture abrupt transitions in perturbed time series, as the use of globally shared parameters constrains the velocity field to a unified representation. To address these limitations, we introduce \textbf{PAFM}, a \textbf{P}erturbation-\textbf{A}ware \textbf{F}low \textbf{M}atching framework that models perturbed trajectories to ensure stable and structurally consistent time series generation. The framework incorporates perturbation-guided training to simulate localized disturbances and leverages a dual-path velocity field to capture trajectory deviations under perturbation, enabling refined modeling of perturbed behavior to enhance the structural coherence. In order to further improve sensitivity to trajectory perturbations while enhancing expressiveness, a mixture-of-experts decoder with flow routing dynamically allocates modeling capacity in response to different trajectory dynamics. Extensive experiments on both unconditional and conditional generation tasks demonstrate that PAFM consistently outperforms strong baselines. Code is available at https://anonymous.4open.science/r/PAFM-03B2.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Diego Armando Pรฉrez-Rosero, Danna Valentina Salazar-Dubois, Juan Camilo Lugo-Rojas, Andrรฉs Marino รlvarez-Meza, Germรกn Castellanos-Dominguez
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
These notes provide a self-contained introduction to kernel methods and their geometric foundations in machine learning. Starting from the construction of Hilbert spaces, we develop the theory of positive definite kernels, reproducing kernel Hilbert spaces (RKHS), and Hilbert-Schmidt operators, emphasizing their role in statistical estimation and representation of probability measures. Classical concepts such as covariance, regression, and information measures are revisited through the lens of Hilbert space geometry. We also introduce kernel density estimation, kernel embeddings of distributions, and the Maximum Mean Discrepancy (MMD). The exposition is designed to serve as a foundation for more advanced topics, including Gaussian processes, kernel Bayesian inference, and functional analytic approaches to modern machine learning.
๐
2025-11-18
๐ ArXiv Preprint (cs.DB)
๐ฅ Tim Schwabe, Maribel Acosta
Machine Learning
ๆฅ็ๆ่ฆ
Join ordering is the NP-hard problem of selecting the most efficient sequence in which to evaluate joins (conjunctive, binary operators) in a database query. As the performance of query execution critically depends on this choice, join ordering lies at the core of query optimization. Traditional approaches cast this problem as a discrete combinatorial search over binary trees guided by a cost model, but they often suffer from high computational complexity and limited scalability. We show that, when the cost model is differentiable, the query plans can be continuously relaxed into a soft adjacency matrix representing a superposition of plans. This continuous relaxation, together with a Gumbel-Softmax parameterization of the adjacency matrix and differentiable constraints enforcing plan validity, enables gradient-based search for plans within this relaxed space. Using a learned Graph Neural Network as the cost model, we demonstrate that this gradient-based approach can find comparable and even lower-cost plans compared to traditional discrete local search methods on two different graph datasets. Furthermore, we empirically show that the runtime of this approach scales linearly with query size, in contrast to quadratic or exponential runtimes of classical approaches. We believe this first step towards gradient-based join ordering can lead to more effective and efficient query optimizers in the future.
๐
2025-11-18
๐ NeurIPS 2025
๐ฅ Clรฉment Dumas
Natural Language Processing
ๆฅ็ๆ่ฆ
Mechanistic interpretability research requires reliable tools for analyzing transformer internals across diverse architectures. Current approaches face a fundamental tradeoff: custom implementations like TransformerLens ensure consistent interfaces but require coding a manual adaptation for each architecture, introducing numerical mismatch with the original models, while direct HuggingFace access through NNsight preserves exact behavior but lacks standardization across models. To bridge this gap, we develop nnterp, a lightweight wrapper around NNsight that provides a unified interface for transformer analysis while preserving original HuggingFace implementations. Through automatic module renaming and comprehensive validation testing, nnterp enables researchers to write intervention code once and deploy it across 50+ model variants spanning 16 architecture families. The library includes built-in implementations of common interpretability methods (logit lens, patchscope, activation steering) and provides direct access to attention probabilities for models that support it. By packaging validation tests with the library, researchers can verify compatibility with custom models locally. nnterp bridges the gap between correctness and usability in mechanistic interpretability tooling.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nicola Rares Franco, Lorenzo Tedesco
Machine Learning
ๆฅ็ๆ่ฆ
We introduce conditional push-forward neural networks (CPFN), a generative framework for conditional distribution estimation. Instead of directly modeling the conditional density $f_{Y|X}$, CPFN learns a stochastic map $\varphi=\varphi(x,u)$ such that $\varphi(x,U)$ and $Y|X=x$ follow approximately the same law, with $U$ a suitable random vector of pre-defined latent variables. This enables efficient conditional sampling and straightforward estimation of conditional statistics through Monte Carlo methods. The model is trained via an objective function derived from a Kullback-Leibler formulation, without requiring invertibility or adversarial training. We establish a near-asymptotic consistency result and demonstrate experimentally that CPFN can achieve performance competitive with, or even superior to, state-of-the-art methods, including kernel estimators, tree-based algorithms, and popular deep learning techniques, all while remaining lightweight and easy to train.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Emanuele Palumbo, Sorawit Saengkyongam, Maria R. Cervera, Jens Behrmann, Andrew C. Miller et al.
ๆฅ็ๆ่ฆ
Continuous cardiovascular monitoring can play a key role in precision health. However, some fundamental cardiac biomarkers of interest, including stroke volume and cardiac output, require invasive measurements, e.g., arterial pressure waveforms (APW). As a non-invasive alternative, photoplethysmography (PPG) measurements are routinely collected in hospital settings. Unfortunately, the prediction of key cardiac biomarkers from PPG instead of APW remains an open challenge, further complicated by the scarcity of annotated PPG measurements. As a solution, we propose a hybrid approach that uses hemodynamic simulations and unlabeled clinical data to estimate cardiovascular biomarkers directly from PPG signals. Our hybrid model combines a conditional variational autoencoder trained on paired PPG-APW data with a conditional density estimator of cardiac biomarkers trained on labeled simulated APW segments. As a key result, our experiments demonstrate that the proposed approach can detect fluctuations of cardiac output and stroke volume and outperform a supervised baseline in monitoring temporal changes in these biomarkers.
๐
2025-11-18
๐ ACL
๐ฅ Trishala Jayesh Ahalpara
Natural Language Processing
ๆฅ็ๆ่ฆ
We present Tell Me, a mental well-being system that leverages advances in large language models to provide accessible, context-aware support for users and researchers. The system integrates three components: (i) a retrieval-augmented generation (RAG) assistant for personalized, knowledge-grounded dialogue; (ii) a synthetic client-therapist dialogue generator conditioned on client profiles to facilitate research on therapeutic language and data augmentation; and (iii) a Well-being AI crew, implemented with CrewAI, that produces weekly self-care plans and guided meditation audio. The system is designed as a reflective space for emotional processing rather than a substitute for professional therapy. It illustrates how conversational assistants can lower barriers to support, complement existing care, and broaden access to mental health resources. To address the shortage of confidential therapeutic data, we introduce synthetic client-therapist dialogue generation conditioned on client profiles. Finally, the planner demonstrates an innovative agentic workflow for dynamically adaptive, personalized self-care, bridging the limitations of static well-being tools. We describe the architecture, demonstrate its functionalities, and report evaluation of the RAG assistant in curated well-being scenarios using both automatic LLM-based judgments and a human-user study. This work highlights opportunities for interdisciplinary collaboration between NLP researchers and mental health professionals to advance responsible innovation in human-AI interaction for well-being.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Daniel Klippert, Alexander Marx
ๆฅ็ๆ่ฆ
To distinguish Markov equivalent graphs in causal discovery, it is necessary to restrict the structural causal model. Crucially, we need to be able to distinguish cause $X$ from effect $Y$ in bivariate models, that is, distinguish the two graphs $X \to Y$ and $Y \to X$. Location-scale noise models (LSNMs), in which the effect $Y$ is modeled based on the cause $X$ as $Y = f(X) + g(X)N$, form a flexible class of models that is general and identifiable in most cases. Estimating these models for arbitrary noise terms $N$, however, is challenging. Therefore, practical estimators are typically restricted to symmetric distributions, such as the normal distribution. As we showcase in this paper, when $N$ is a skewed random variable, which is likely in real-world domains, the reliability of these approaches decreases. To approach this limitation, we propose SkewD, a likelihood-based algorithm for bivariate causal discovery under LSNMs with skewed noise distributions. SkewD extends the usual normal-distribution framework to the skew-normal setting, enabling reliable inference under symmetric and skewed noise. For parameter estimation, we employ a combination of a heuristic search and an expectation conditional maximization algorithm. We evaluate SkewD on novel synthetically generated datasets with skewed noise as well as established benchmark datasets. Throughout our experiments, SkewD exhibits a strong performance and, in comparison to prior work, remains robust under high skewness.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andrey Okhotin, Maksim Nakhodnov, Nikita Kazeev, Andrey E Ustyuzhanin, Dmitry Vetrov
ๆฅ็ๆ่ฆ
In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel (S.U.N.) crystalline materials. However, most of these models don't have the ability to change the number of atoms in the crystal during the generation process, which limits the variability of model sampling trajectories. In this paper, we demonstrate the severity of this restriction and introduce a simple yet powerful technique, mirage infusion, which enables diffusion models to change the state of the atoms that make up the crystal from existent to non-existent (mirage) and vice versa. We show that this technique improves model quality by up to $\times2.5$ compared to the same model without this modification. The resulting model, Mirage Atom Diffusion (MiAD), is an equivariant joint diffusion model for de novo crystal generation that is capable of altering the number of atoms during the generation process. MiAD achieves an $8.2\%$ S.U.N. rate on the MP-20 dataset, which substantially exceeds existing state-of-the-art approaches. The source code can be found at \href{https://github.com/andrey-okhotin/miad.git}{\texttt{github.com/andrey-okhotin/miad}}.
๐
2025-11-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Zhengchunmin Dai, Jiaxiong Tang, Peng Sun, Honglong Chen, Liantao Wu
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
In decentralized machine learning paradigms such as Split Federated Learning (SFL) and its variant U-shaped SFL, the server's capabilities are severely restricted. Although this enhances client-side privacy, it also leaves the server highly vulnerable to model theft by malicious clients. Ensuring intellectual property protection for such capability-limited servers presents a dual challenge: watermarking schemes that depend on client cooperation are unreliable in adversarial settings, whereas traditional server-side watermarking schemes are technically infeasible because the server lacks access to critical elements such as model parameters or labels.
To address this challenge, this paper proposes Sigil, a mandatory watermarking framework designed specifically for capability-limited servers. Sigil defines the watermark as a statistical constraint on the server-visible activation space and embeds the watermark into the client model via gradient injection, without requiring any knowledge of the data. Besides, we design an adaptive gradient clipping mechanism to ensure that our watermarking process remains both mandatory and stealthy, effectively countering existing gradient anomaly detection methods and a specifically designed adaptive subspace removal attack. Extensive experiments on multiple datasets and models demonstrate Sigil's fidelity, robustness, and stealthiness.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaowei Xu, Justin Sonneck, Hongxiao Wang, Roman Burkard, Hendrik Wohrle et al.
Robotics
ๆฅ็ๆ่ฆ
Autonomous migration is essential for the function of immune cells such as neutrophils and plays a pivotal role in diverse diseases. Recently, we introduced ComplexEye, a multi-lens array microscope comprising 16 independent aberration-corrected glass lenses arranged at the pitch of a 96-well plate, capable of capturing high-resolution movies of migrating cells. This architecture enables high-throughput live-cell video microscopy for migration analysis, supporting routine quantification of autonomous motility with strong potential for clinical translation. However, ComplexEye and similar high-throughput imaging platforms generate data at an exponential rate, imposing substantial burdens on storage and transmission. To address this challenge, we present FlowRoI, a fast optical-flow-based region of interest (RoI) extraction framework designed for high-throughput image compression in immune cell migration studies. FlowRoI estimates optical flow between consecutive frames and derives RoI masks that reliably cover nearly all migrating cells. The raw image and its corresponding RoI mask are then jointly encoded using JPEG2000 to enable RoI-aware compression. FlowRoI operates with high computational efficiency, achieving runtimes comparable to standard JPEG2000 and reaching an average throughput of about 30 frames per second on a modern laptop equipped with an Intel i7-1255U CPU. In terms of image quality, FlowRoI yields higher peak signal-to-noise ratio (PSNR) in cellular regions and achieves 2.0-2.2x higher compression rates at matched PSNR compared to standard JPEG2000.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haobin Li, Mouxing Yang, Xi Peng
ๆฅ็ๆ่ฆ
Recently, the general-to-customized paradigm has emerged as the dominant approach for Cross-Modal Retrieval (CMR), which reconciles the distribution shift problem between the source domain and the target domain. However, existing general-to-customized CMR methods typically assume that the entire target-domain data is available, which is easily violated in real-world scenarios and thus inevitably suffer from the query shift (QS) problem. Specifically, query shift embraces the following two characteristics and thus poses new challenges to CMR. i) Online Shift: real-world queries always arrive in an online manner, rendering it impractical to access the entire query set beforehand for customization approaches; ii) Diverse Shift: even with domain customization, the CMR models struggle to satisfy queries from diverse users or scenarios, leaving an urgent need to accommodate diverse queries. In this paper, we observe that QS would not only undermine the well-structured common space inherited from the source model, but also steer the model toward forgetting the indispensable general knowledge for CMR. Inspired by the observations, we propose a novel method for achieving online and harmonious adaptation against QS, dubbed Robust adaptation with quEry ShifT (REST). To deal with online shift, REST first refines the retrieval results to formulate the query predictions and accordingly designs a QS-robust objective function on these predictions to preserve the well-established common space in an online manner. As for tackling the more challenging diverse shift, REST employs a gradient decoupling module to dexterously manipulate the gradients during the adaptation process, thus preventing the CMR model from forgetting the general knowledge. Extensive experiments on 20 benchmarks across three CMR tasks verify the effectiveness of our method against QS.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bastien Vuillod, Pierre-Alain Moellic, Jean-Max Dutertre
ๆฅ็ๆ่ฆ
Large models adaptation through Federated Learning (FL) addresses a wide range of use cases and is enabled by Parameter-Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA). However, this distributed learning paradigm faces several security threats, particularly to its integrity, such as backdoor attacks that aim to inject malicious behavior during the local training steps of certain clients. We present the first analysis of the influence of LoRA on state-of-the-art backdoor attacks targeting model adaptation in FL. Specifically, we focus on backdoor lifespan, a critical characteristic in FL, that can vary depending on the attack scenario and the attacker's ability to effectively inject the backdoor. A key finding in our experiments is that for an optimally injected backdoor, the backdoor persistence after the attack is longer when the LoRA's rank is lower. Importantly, our work highlights evaluation issues of backdoor attacks against FL and contributes to the development of more robust and fair evaluations of backdoor attacks, enhancing the reliability of risk assessments for critical FL systems. Our code is publicly available.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty
Natural Language Processing
ๆฅ็ๆ่ฆ
While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nanxi Chen, Sifan Wang, Rujin Ma, Airong Chen, Chuanjie Cui
Machine Learning
ๆฅ็ๆ่ฆ
Physics-informed neural networks (PINNs) represent a new paradigm for solving partial differential equations (PDEs) by integrating physical laws into the learning process of neural networks. However, despite their foundational role, the hidden irreversibility implied by the Second Law of Thermodynamics is often neglected during training, leading to unphysical solutions or even training failures in conventional PINNs. In this paper, we identify this critical gap and introduce a simple, generalized, yet robust irreversibility-regularized strategy that enforces hidden physical laws as soft constraints during training. This approach ensures that the learned solutions consistently respect the intrinsic one-way nature of irreversible physical processes. Across a wide range of benchmarks spanning traveling wave propagation, steady combustion, ice melting, corrosion evolution, and crack propagation, we demonstrate that our regularization scheme reduces predictive errors by more than an order of magnitude, while requiring only minimal modification to existing PINN frameworks. We believe that the proposed framework is broadly applicable to a wide class of PDE-governed physical systems and will have significant impact within the scientific machine learning community.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alessio Pellegrino, Jacopo Mauro
Natural Language Processing
ๆฅ็ๆ่ฆ
One of the long-standing goals in optimisation and constraint programming is to describe a problem in natural language and automatically obtain an executable, efficient model. Large language models appear to bring this vision closer, showing impressive results in automatically generating models for classical benchmarks. However, much of this apparent success may derive from data contamination rather than genuine reasoning: many standard CP problems are likely included in the training data of these models. To examine this hypothesis, we systematically rephrased and perturbed a set of well-known CSPLib problems to preserve their structure while modifying their context and introducing misleading elements. We then compared the models produced by three representative LLMs across original and modified descriptions. Our qualitative analysis shows that while LLMs can produce syntactically valid and semantically plausible models, their performance drops sharply under contextual and linguistic variation, revealing shallow understanding and sensitivity to wording.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aneesh Barthakur, Luiz F. O. Chamon
Machine Learning
ๆฅ็ๆ่ฆ
As machine learning applications grow increasingly ubiquitous and complex, they face an increasing set of requirements beyond accuracy. The prevalent approach to handle this challenge is to aggregate a weighted combination of requirement violation penalties into the training objective. To be effective, this approach requires careful tuning of these hyperparameters (weights), involving trial-and-error and cross-validation, which becomes ineffective even for a moderate number of requirements. These issues are exacerbated when the requirements involve parities or equalities, as is the case in fairness and boundary value problems. An alternative technique uses constrained optimization to formulate these learning problems. Yet, existing approximation and generalization guarantees do not apply to problems involving equality constraints. In this work, we derive a generalization theory for equality-constrained statistical learning problems, showing that their solutions can be approximated using samples and rich parametrizations. Using these results, we propose a practical algorithm based on solving a sequence of unconstrained, empirical learning problems. We showcase its effectiveness and the new formulations enabled by equality constraints in fair learning, interpolating classifiers, and boundary value problems.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuwen Zhang, Viet Tran, Paul Weng
Machine Learning
ๆฅ็ๆ่ฆ
In clinical machine learning, the coexistence of multiple models with comparable performance -- a manifestation of the Rashomon Effect -- poses fundamental challenges for trustworthy deployment and evaluation. Small, imbalanced, and noisy datasets, coupled with high-dimensional and weakly identified clinical features, amplify this multiplicity and make conventional validation schemes unreliable. As a result, selecting among equally performing models becomes uncertain, particularly when resource constraints and operational priorities are not considered by conventional metrics like F1 score. To address these issues, we propose two complementary tools for robust model assessment and selection: Intervention Efficiency (IE) and the Perturbation Validation Framework (PVF). IE is a capacity-aware metric that quantifies how efficiently a model identifies actionable true positives when only limited interventions are feasible, thereby linking predictive performance with clinical utility. PVF introduces a structured approach to assess the stability of models under data perturbations, identifying models whose performance remains most invariant across noisy or shifted validation sets. Empirical results on synthetic and real-world healthcare datasets show that using these tools facilitates the selection of models that generalize more robustly and align with capacity constraints, offering a new direction for tackling the Rashomon Effect in clinical settings.
๐
2025-11-18
๐ This paper was accepted by IEEE BIBM 2025 conferen
๐ฅ Chenyang Xu, Siming Li, Hao Wang
Robotics
ๆฅ็ๆ่ฆ
Phonocardiogram (PCG) analysis is vital for cardiovascular disease diagnosis, yet the scarcity of labeled pathological data hinders the capability of AI systems. To bridge this, we introduce H-LDM, a Hierarchical Latent Diffusion Model for generating clinically accurate and controllable PCG signals from structured metadata. Our approach features: (1) a multi-scale VAE that learns a physiologically-disentangled latent space, separating rhythm, heart sounds, and murmurs; (2) a hierarchical text-to-biosignal pipeline that leverages rich clinical metadata for fine-grained control over 17 distinct conditions; and (3) an interpretable diffusion process guided by a novel Medical Attention module. Experiments on the PhysioNet CirCor dataset demonstrate state-of-the-art performance, achieving a Frรฉchet Audio Distance of 9.7, a 92% attribute disentanglement score, and 87.1% clinical validity confirmed by cardiologists. Augmenting diagnostic models with our synthetic data improves the accuracy of rare disease classification by 11.3\%. H-LDM establishes a new direction for data augmentation in cardiac diagnostics, bridging data scarcity with interpretable clinical insights.
๐
2025-11-18
๐ ArXiv Preprint (cs.SD)
๐ฅ Marcel Gibier, Nolwenn Celton, Raphaรซl Duroselle, Pierre Serrano, Olivier Boeffard et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves an accuracy of 62.6 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.
๐
2025-11-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Eric Xue, Ruiyi Zhang, Zijun Zhang, Pengtao Xie
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Transformer models are foundational to natural language processing (NLP) applications, yet remain vulnerable to backdoor attacks introduced through poisoned data, which implant hidden behaviors during training. To strengthen the ability to prevent such compromises, recent research has focused on designing increasingly stealthy attacks to stress-test existing defenses, pairing backdoor behaviors with stylized artifact or token-level perturbation triggers. However, this trend diverts attention from the harder and more realistic case: making the model respond to semantic triggers such as specific names or entities, where a successful backdoor could manipulate outputs tied to real people or events in deployed systems. Motivated by this growing disconnect, we introduce SteganoBackdoor, bringing stealth techniques back into line with practical threat models. Leveraging innocuous properties from natural-language steganography, SteganoBackdoor applies a gradient-guided data optimization process to transform semantic trigger seeds into steganographic carriers that embed a high backdoor payload, remain fluent, and exhibit no representational resemblance to the trigger. Across diverse experimental settings, SteganoBackdoor achieves over 99% attack success at an order-of-magnitude lower data-poisoning rate than prior approaches while maintaining unparalleled evasion against a comprehensive suite of data-level defenses. By revealing this practical and covert attack, SteganoBackdoor highlights an urgent blind spot in current defenses and demands immediate attention to adversarial data defenses and real-world threat modeling.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohammad Zbib, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
๐
2025-11-18
๐ ICASSP 2026
๐ฅ Marcel Gibier, Raphaรซl Duroselle, Pierre Serrano, Olivier Boeffard, Jean-Franรงois Bonastre
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recent audio-language models have shown impressive performance across a wide range of audio tasks and are increasingly capable of handling long audio inputs. However, the computing costs in these models heavily depend on sequence length, which can become very large given the nature of audio data. In the vision-language domain, token pruning methods have proven effective in reducing token counts while preserving strong performance on standard benchmarks. In this work, we investigate the relevance and effectiveness of such token selection strategies in the context of audio-language models. We also improve them by proposing a lightweight strategy that takes the time dimension into account. While retaining only a quarter of the initial tokens, our approach results in a relative maximum decrease of 2% in CIDEr on Clotho v2 and a relative maximum decrease of 4% in accuracy on MMAU.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zi-Chen Xi, Jiahui Huang, Hao-Xiang Chen, Francis Williams, Qun-Ce Xu et al.
Machine Learning
ๆฅ็ๆ่ฆ
We proposed a generalized method, NeuralSSD, for reconstructing a 3D implicit surface from the widely-available point cloud data. NeuralSSD is a solver-based on the neural Galerkin method, aimed at reconstructing higher-quality and accurate surfaces from input point clouds. Implicit method is preferred due to its ability to accurately represent shapes and its robustness in handling topological changes. However, existing parameterizations of implicit fields lack explicit mechanisms to ensure a tight fit between the surface and input data. To address this, we propose a novel energy equation that balances the reliability of point cloud information. Additionally, we introduce a new convolutional network that learns three-dimensional information to achieve superior optimization results. This approach ensures that the reconstructed surface closely adheres to the raw input points and infers valuable inductive biases from point clouds, resulting in a highly accurate and stable surface reconstruction. NeuralSSD is evaluated on a variety of challenging datasets, including the ShapeNet and Matterport datasets, and achieves state-of-the-art results in terms of both surface reconstruction accuracy and generalizability.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vincent-Daniel Yun, Junhyuk Jo, Sunwoo Lee
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Deep neural networks achieve outstanding performance in visual recognition tasks, yet their large number of parameters makes them less practical for real-world applications. Recently, one-shot pruning has emerged as an effective strategy for reducing model size without additional training. However, models trained with standard objective functions often suffer a significant drop in accuracy after aggressive pruning. Some existing pruning-robust optimizers, such as SAM, and CrAM, mitigate this accuracy drop by guiding the model toward flatter regions of the parameter space, but they inevitably incur non-negligible additional computations. We propose a Variance Amplifying Regularizer (VAR) that deliberately increases the variance of model parameters during training. Our study reveals an intriguing finding that parameters with higher variance exhibit greater pruning robustness. VAR exploits this property by promoting such variance in the weight distribution, thereby mitigating the adverse effects of pruning. We further provide a theoretical analysis of its convergence behavior, supported by extensive empirical results demonstrating the superior pruning robustness of VAR.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Frederik Hoppe, Lars Kleinemeier, Astrid Franz, Udo Gรถbel
ๆฅ็ๆ่ฆ
Recent foundation models for tabular data achieve strong task-specific performance via in-context learning. Nevertheless, they focus on direct prediction by encapsulating both representation learning and task-specific inference inside a single, resource-intensive network. This work specifically focuses on representation learning, i.e., on transferable, task-agnostic embeddings. We systematically evaluate task-agnostic representations from tabular foundation models (TabPFN and TabICL) alongside with classical feature engineering (TableVectorizer) across a variety of application tasks as outlier detection (ADBench) and supervised learning (TabArena Lite). We find that simple TableVectorizer features achieve comparable or superior performance while being up to three orders of magnitude faster than tabular foundation models. The code is available at https://github.com/ContactSoftwareAI/TabEmbedBench.
๐
2025-11-18
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Zhenchuan Ma, Qizhi Teng, Pengcheng Yan, Lindong Li, Kirill M. Gerke et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Heterogeneous porous materials play a crucial role in various engineering systems. Microstructure characterization and reconstruction provide effective means for modeling these materials, which are critical for conducting physical property simulations, structure-property linkage studies, and enhancing their performance across different applications. To achieve superior controllability and applicability with small sample sizes, we propose a statistically controllable microstructure reconstruction framework that integrates neural networks with sliced-Wasserstein metric. Specifically, our approach leverages local pattern distribution for microstructure characterization and employs a controlled sampling strategy to generate target distributions that satisfy given conditional parameters. A neural network-based model establishes the mapping from the input distribution to the target local pattern distribution, enabling microstructure reconstruction. Combinations of sliced-Wasserstein metric and gradient optimization techniques minimize the distance between these distributions, leading to a stable and reliable model. Our method can perform stochastic and controllable reconstruction tasks even with small sample sizes. Additionally, it can generate large-size (e.g. 512 and 1024) 3D microstructures using a chunking strategy. By introducing spatial location masks, our method excels at generating spatially heterogeneous and complex microstructures. We conducted experiments on stochastic reconstruction, controllable reconstruction, heterogeneous reconstruction, and large-size microstructure reconstruction across various materials. Comparative analysis through visualization, statistical measures, and physical property simulations demonstrates the effectiveness, providing new insights and possibilities for research on structure-property linkage and material inverse design.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rui Zhang, Chao Li, Kezhong Liu, Chen Wang, Bolong Zheng et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vessel trajectory prediction is fundamental to intelligent maritime systems. Within this domain, short-term prediction of rapid behavioral changes in complex maritime environments has established multimodal trajectory prediction (MTP) as a promising research area. However, existing vessel MTP methods suffer from limited scenario applicability and insufficient explainability. To address these challenges, we propose a unified MTP framework incorporating explainable navigation intentions, which we classify into sustained and transient categories. Our method constructs sustained intention trees from historical trajectories and models dynamic transient intentions using a Conditional Variational Autoencoder (CVAE), while using a non-local attention mechanism to maintain global scenario consistency. Experiments on real Automatic Identification System (AIS) datasets demonstrates our method's broad applicability across diverse scenarios, achieving significant improvements in both ADE and FDE. Furthermore, our method improves explainability by explicitly revealing the navigational intentions underlying each predicted trajectory.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pietro Sittoni, Francesco Tudisco
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recent work in deep learning has opened new possibilities for solving classical algorithmic tasks using end-to-end learned models. In this work, we investigate the fundamental task of solving linear systems, particularly those that are ill-conditioned. Existing numerical methods for ill-conditioned systems often require careful parameter tuning, preconditioning, or domain-specific expertise to ensure accuracy and stability. In this work, we propose Algebraformer, a Transformer-based architecture that learns to solve linear systems end-to-end, even in the presence of severe ill-conditioning. Our model leverages a novel encoding scheme that enables efficient representation of matrix and vector inputs, with a memory complexity of $O(n^2)$, supporting scalable inference. We demonstrate its effectiveness on application-driven linear problems, including interpolation tasks from spectral methods for boundary value problems and acceleration of the Newton method. Algebraformer achieves competitive accuracy with significantly lower computational overhead at test time, demonstrating that general-purpose neural architectures can effectively reduce complexity in traditional scientific computing pipelines.
๐
2025-11-18
๐ AAAI
๐ฅ Yosuke Nishimoto, Takashi Matsubara
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
World models have been developed to support sample-efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high-dimensional, non-stationary, and composed of multiple objects with rich interactions since most world models learn holistic representations of all environmental components. By contrast, humans perceive the environment by decomposing it into discrete objects, facilitating efficient decision-making. Motivated by this insight, we propose \emph{Slot Transformer Imagination with CAusality-aware reinforcement learning} (STICA), a unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks. STICA represents each observation as a set of object-centric tokens, together with tokens for the agent action and the resulting reward, enabling the world model to predict token-level dynamics and interactions. The policy and value networks then estimate token-level cause--effect relations and use them in the attention layers, yielding causality-guided decision-making. Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.
๐
2025-11-18
๐ ArXiv Preprint (cs.SD)
๐ฅ Jonathan Yaffe, Ben Maman, Meinard Mรผller, Amit H. Bermano
Machine Learning
ๆฅ็ๆ่ฆ
Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and impractical for many musical contexts, weakly aligned approaches using segment-level annotations have gained traction. However, existing methods often rely on Dynamic Time Warping (DTW) or soft alignment loss functions, both of which still require local semantic correspondences, making them error-prone and computationally expensive. In this article, we introduce CountEM, a novel AMT framework that eliminates the need for explicit local alignment by leveraging note event histograms as supervision, enabling lighter computations and greater flexibility. Using an Expectation-Maximization (EM) approach, CountEM iteratively refines predictions based solely on note occurrence counts, significantly reducing annotation efforts while maintaining high transcription accuracy. Experiments on piano, guitar, and multi-instrument datasets demonstrate that CountEM matches or surpasses existing weakly supervised methods, improving AMT's robustness, scalability, and efficiency. Our project page is available at https://yoni-yaffe.github.io/count-the-notes.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Yan Huang, Yongyi Su, Xin Lin, Le Zhang, Xun Xu
ๆฅ็ๆ่ฆ
The emergence of foundation models has substantially advanced zero-shot generalization in monocular depth estimation (MDE), as exemplified by the Depth Anything series. However, given access to some data from downstream tasks, a natural question arises: can the performance of these models be further improved? To this end, we propose WeSTAR, a parameter-efficient framework that performs Weakly supervised Self-Training Adaptation with Regularization, designed to enhance the robustness of MDE foundation models in unseen and diverse domains. We first adopt a dense self-training objective as the primary source of structural self-supervision. To further improve robustness, we introduce semantically-aware hierarchical normalization, which exploits instance-level segmentation maps to perform more stable and multi-scale structural normalization. Beyond dense supervision, we introduce a cost-efficient weak supervision in the form of pairwise ordinal depth annotations to further guide the adaptation process, which enforces informative ordinal constraints to mitigate local topological errors. Finally, a weight regularization loss is employed to anchor the LoRA updates, ensuring training stability and preserving the model's generalizable knowledge. Extensive experiments on both realistic and corrupted out-of-distribution datasets under diverse and challenging scenarios demonstrate that WeSTAR consistently improves generalization and achieves state-of-the-art performance across a wide range of benchmarks.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jim Broadbent, Felix Cohen, Frederik Hvilshรธj, Eric Landau, Eren Sasoglu
Multimodal
ๆฅ็ๆ่ฆ
We simplify space binding by focusing on two core components, a single encoder per modality and high-quality data; enabling training state-of-the-art models on a single GPU in a few hours as opposed to multiple days. We present EBind, an Easy, data-centric, and parameter-efficient method to Bind the embedding spaces of multiple contrastive models. We demonstrate that a simple 1.8B-parameter image-text-video-audio-3D model can outperform models 4 to 17x the size. The key to achieving this is a carefully curated dataset of three complementary data sources: i) 6.7M fully-automated multimodal quintuples sourced via SOTA retrieval models, ii) 1M diverse, semi-automated triples annotated by humans as negative, partial, or positive matches, and iii) 3.4M pre-existing captioned data items. We use 13 different evaluations to demonstrate the value of each data source. Due to limitations with existing benchmarks, we further introduce the first high-quality, consensus-annotated zero-shot classification benchmark between audio and PCs. In contrast to related work, we will open-source our code, model weights, and datasets.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuxiang Wang, Siwen Wang, Haowei Han, Ao Wang, Boya Liu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Operation recommendation for IoT devices refers to generating personalized device operations for users based on their context, such as historical operations, environment information, and device status. This task is crucial for enhancing user satisfaction and corporate profits. Existing recommendation models struggle with complex operation logic, diverse user preferences, and sensitive to suboptimal suggestions, limiting their applicability to IoT device operations. To address these issues, we propose DevPiolt, a LLM-based recommendation model for IoT device operations. Specifically, we first equip the LLM with fundamental domain knowledge of IoT operations via continual pre-training and multi-task fine-tuning. Then, we employ direct preference optimization to align the fine-tuned LLM with specific user preferences. Finally, we design a confidence-based exposure control mechanism to avoid negative user experiences from low-quality recommendations. Extensive experiments show that DevPiolt significantly outperforms baselines on all datasets, with an average improvement of 69.5% across all metrics. DevPiolt has been practically deployed in Xiaomi Home app for one quarter, providing daily operation recommendations to 255,000 users. Online experiment results indicate a 21.6% increase in unique visitor device coverage and a 29.1% increase in page view acceptance rates.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yaniv Oren, Joery A. de Vries, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin Bรถhmer
Machine Learning
ๆฅ็ๆ่ฆ
Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS. Through variance reduction and mitigation of path degeneracy, TSMCTS scales favorably with sequential compute while retaining the properties that make SMC natural to parallelize.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinlei Xiong, Wenbo Hu, Shuxun Zhou, Kaifeng Bi, Lingxi Xie et al.
Machine Learning
ๆฅ็ๆ่ฆ
Weather forecasting is fundamentally challenged by the chaotic nature of the atmosphere, necessitating probabilistic approaches to quantify uncertainty. While traditional ensemble prediction (EPS) addresses this through computationally intensive simulations, recent advances in Bayesian Deep Learning (BDL) offer a promising but often disconnected alternative. We bridge these paradigms through a unified hybrid Bayesian Deep Learning framework for ensemble weather forecasting that explicitly decomposes predictive uncertainty into epistemic and aleatoric components, learned via variational inference and a physics-informed stochastic perturbation scheme modeling flow-dependent atmospheric dynamics, respectively. We further establish a unified theoretical framework that rigorously connects BDL and EPS, providing formal theorems that decompose total predictive uncertainty into epistemic and aleatoric components under the hybrid BDL framework. We validate our framework on the large-scale 40-year ERA5 reanalysis dataset (1979-2019) with 0.25ยฐ spatial resolution. Experimental results show that our method not only improves forecast accuracy and yields better-calibrated uncertainty quantification but also achieves superior computational efficiency compared to state-of-the-art probabilistic diffusion models. We commit to making our code open-source upon acceptance of this paper.
๐
2025-11-18
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Pattaraphon Kenny Wongchamcharoen, Paul Glasserman
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used in finance and economics, where prompt-based attempts against look-ahead bias implicitly assume that models understand chronology. We test this fundamental question with a series of chronological ordering tasks with increasing complexities over facts the model already knows from pre-training. Our tasks cover (1) chronological ordering, (2) conditional sorting (filter, then order), and (3) anachronism detection. We evaluate GPT-4.1, Claude-3.7 Sonnet, with and without Extended Thinking (ET), and GPT-5 across multiple reasoning-effort settings. Across models, Exact match rate drops sharply as sequences lengthen even while rank correlations stay high as LLMs largely preserve local order but struggle to maintain a single globally consistent timeline. In conditional sorting, most failures stem from the filtering step rather than the ordering step, but GPT-5 and Claude-3.7 Sonnet with Extended Thinking outshine normal models significantly. Lastly, anachronism detection is found to be the easiest task for the LLMs but performance still declines with increasingly overlapping timelines or entities. Overall, our main contribution is showing that allocating explicit reasoning budget helps with chronological ordering with GPT-5 at medium/high reasoning effort achieving flawless ordering at all lengths and perfect conditional sorting (both self-filtered and given-subset), whereas low/minimal effort degrades with longer lists, mirroring earlier models. Our findings delineate limits of current LLMs on chronological tasks, providing insights into task complexity, and demonstrate scenarios in which reasoning helps. These patterns are important for the real-time application of LLMs in finance. We release all code and evaluation templates to support full reproducibility.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ N Dinesh Reddy, Sudeep Pillai
RoboticsComputer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Alessio Zanga, Marco Scutari, Fabio Stella
ๆฅ็ๆ่ฆ
Causal discovery combines data with knowledge provided by experts to learn the DAG representing the causal relationships between a given set of variables. When data are scarce, bagging is used to measure our confidence in an average DAG obtained by aggregating bootstrapped DAGs. However, the aggregation step has received little attention from the specialized literature: the average DAG is constructed using only the confidence in the individual edges of the bootstrapped DAGs, thus disregarding complex higher-order edge structures. In this paper, we introduce a novel theoretical framework based on higher-order structures and describe a new DAG aggregation algorithm. We perform a simulation study, discussing the advantages and limitations of the proposed approach. Our proposal is both computationally efficient and effective, outperforming state-of-the-art solutions, especially in low sample size regimes and under high dimensionality settings.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zheyu Lin, Jirui Yang, Hengqi Guo, Yubing Bao, Yao Guan
Natural Language Processing
ๆฅ็ๆ่ฆ
Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming. N-GLARE reproduces the discriminative trends of large-scale red-teaming tests at less than 1\% of the token cost and the runtime cost, providing an efficient output-free evaluation proxy for real-time diagnostics.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junpeng Zhao, Lin Li, Kaixi Hu, Kaize Shi, Jingling Yuan
Machine Learning
ๆฅ็ๆ่ฆ
Signed graphs model complex relationships through positive and negative edges, with widespread real-world applications. Given the sensitive nature of such data, selective removal mechanisms have become essential for privacy protection. While graph unlearning enables the removal of specific data influences from Graph Neural Networks (GNNs), existing methods are designed for conventional GNNs and overlook the unique heterogeneous properties of signed graphs. When applied to Signed Graph Neural Networks (SGNNs), these methods lose critical sign information, degrading both model utility and unlearning effectiveness. To address these challenges, we propose Certified Signed Graph Unlearning (CSGU), which provides provable privacy guarantees while preserving the sociological principles underlying SGNNs. CSGU employs a three-stage method: (1) efficiently identifying minimal influenced neighborhoods via triangular structures, (2) applying sociological theories to quantify node importance for optimal privacy budget allocation, and (3) performing importance-weighted parameter updates to achieve certified modifications with minimal utility degradation. Extensive experiments demonstrate that CSGU outperforms existing methods, achieving superior performance in both utility preservation and unlearning effectiveness on SGNNs.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fatima Kazi, Alex Young, Yash Inani, Setareh Rafatirad
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) inherit explicit and implicit biases from their training datasets. Identifying and mitigating biases in LLMs is crucial to ensure fair outputs, as they can perpetuate harmful stereotypes and misinformation. This study highlights the need to address biases in LLMs amid growing generative AI. We studied bias-specific benchmarks such as StereoSet and CrowSPairs to evaluate the existence of various biases in multiple generative models such as BERT and GPT 3.5. We proposed an automated Bias-Identification Framework to recognize various social biases in LLMs such as gender, race, profession, and religion. We adopted a two-pronged approach to detect explicit and implicit biases in text data. Results indicated fine-tuned models struggle with gender biases but excelled at identifying and avoiding racial biases. Our findings illustrated that despite having some success, LLMs often over-relied on keywords. To illuminate the capability of the analyzed LLMs in detecting implicit biases, we employed Bag-of-Words analysis and unveiled indications of implicit stereotyping within the vocabulary. To bolster the model performance, we applied an enhancement strategy involving fine-tuning models using prompting techniques and data augmentation of the bias benchmarks. The fine-tuned models exhibited promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, Biqing Qi
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA achieves state-of-the-art results across general embodied evaluations due to its asynchronous generation in AFM. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.
๐
2025-11-18
๐ ArXiv Preprint (physics.optics)
๐ฅ Alexander Christie, Matan Leibovich, Miguel Moscoso, Alexei Novikov, George Papanicolaou et al.
ๆฅ็ๆ่ฆ
We develop an imaging algorithm that exploits strong scattering to achieve super-resolution in changing random media. The method processes large and diverse array datasets using sparse dictionary learning, clustering, and multidimensional scaling. Starting from random initializations, the algorithm reliably extracts the unknown medium properties necessary for accurate imaging using back-propagation, $\ell_2$ or $\ell_1$ methods. Remarkably, scattering enhances resolution beyond homogeneous medium limits. When abundant data are available, the algorithm allows the realization of super-resolution in imaging.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Renjie Chen, Viet Anh Nguyen, Huifu Xu
Machine Learning
ๆฅ็ๆ่ฆ
We propose a distributionally robust formulation for simultaneously estimating the covariance matrix and the precision matrix of a random vector.The proposed model minimizes the worst-case weighted sum of the Frobenius loss of the covariance estimator and Stein's loss of the precision matrix estimator against all distributions from an ambiguity set centered at the nominal distribution. The radius of the ambiguity set is measured via convex spectral divergence. We demonstrate that the proposed distributionally robust estimation model can be reduced to a convex optimization problem, thereby yielding quasi-analytical estimators. The joint estimators are shown to be nonlinear shrinkage estimators. The eigenvalues of the estimators are shrunk nonlinearly towards a positive scalar, where the scalar is determined by the weight coefficient of the loss terms. By tuning the coefficient carefully, the shrinkage corrects the spectral bias of the empirical covariance/precision matrix estimator. By this property, we call the proposed joint estimator the Spectral concentrated COvariance and Precision matrix Estimator (SCOPE). We demonstrate that the shrinkage effect improves the condition number of the estimator. We provide a parameter-tuning scheme that adjusts the shrinkage target and intensity that is asymptotically optimal. Numerical experiments on synthetic and real data show that our shrinkage estimators perform competitively against state-of-the-art estimators in practical applications.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Omkar Mahesh Kashyap, Padegal Amit, Madhav Kashyap, Ashwini M Joshi, Shylaja SS
Natural Language Processing
ๆฅ็ๆ่ฆ
Aspect-Based Sentiment Analysis (ABSA) predicts sentiment polarity for specific aspect terms, a task made difficult by conflicting sentiments across aspects and the sparse context of short texts. Prior graph-based approaches model only pairwise dependencies, forcing them to construct multiple graphs for different relational views. These introduce redundancy, parameter overhead, and error propagation during fusion, limiting robustness in short-text, low-resource settings. We present HyperABSA, a dynamic hypergraph framework that induces aspect-opinion structures through sample-specific hierarchical clustering. To construct these hyperedges, we introduce a novel acceleration-fallback cutoff for hierarchical clustering, which adaptively determines the level of granularity. Experiments on three benchmarks (Lap14, Rest14, MAMS) show consistent improvements over strong graph baselines, with substantial gains when paired with RoBERTa backbones. These results position dynamic hypergraph construction as an efficient, powerful alternative for ABSA, with potential extensions to other short-text NLP tasks.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Promise Ekpo, Saesha Agarwal, Felix Grimm, Lekan Molu, Angelique Taylor
Machine Learning
ๆฅ็ๆ่ฆ
Enforcing a fair workload allocation among multiple agents tasked to achieve an objective in learning enabled demand side healthcare worker settings is crucial for consistent and reliable performance at runtime. Existing multi-agent reinforcement learning (MARL) approaches steer fairness by shaping reward through post hoc orchestrations, leaving no certifiable self-enforceable fairness that is immutable by individual agents at runtime. Contextualized within a setting where each agent shares resources with others, we address this shortcoming with a learning enabled optimization scheme among self-interested decision makers whose individual actions affect those of other agents. This extends the problem to a generalized Nash equilibrium (GNE) game-theoretic framework where we steer group policy to a safe and locally efficient equilibrium, so that no agent can improve its utility function by unilaterally changing its decisions. Fair-GNE models MARL as a constrained generalized Nash equilibrium-seeking (GNE) game, prescribing an ideal equitable collective equilibrium within the problem's natural fabric. Our hypothesis is rigorously evaluated in our custom-designed high-fidelity resuscitation simulator. Across all our numerical experiments, Fair-GNE achieves significant improvement in workload balance over fixed-penalty baselines (0.89 vs.\ 0.33 JFI, $p < 0.01$) while maintaining 86\% task success, demonstrating statistically significant fairness gains through adaptive constraint enforcement. Our results communicate our formulations, evaluation metrics, and equilibrium-seeking innovations in large multi-agent learning-based healthcare systems with clarity and principled fairness enforcement.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jessy Xinyi Han, Devavrat Shah
Robotics
ๆฅ็ๆ่ฆ
Estimating causal effects on time-to-event outcomes from observational data is particularly challenging due to censoring, limited sample sizes, and non-random treatment assignment. The need for answering such "when-if" questions--how the timing of an event would change under a specified intervention--commonly arises in real-world settings with heterogeneous treatment adoption and confounding. To address these challenges, we propose Synthetic Survival Control (SSC) to estimate counterfactual hazard trajectories in a panel data setting where multiple units experience potentially different treatments over multiple periods. In such a setting, SSC estimates the counterfactual hazard trajectory for a unit of interest as a weighted combination of the observed trajectories from other units. To provide formal justification, we introduce a panel framework with a low-rank structure for causal survival analysis. Indeed, such a structure naturally arises under classical parametric survival models. Within this framework, for the causal estimand of interest, we establish identification and finite sample guarantees for SSC. We validate our approach using a multi-country clinical dataset of cancer treatment outcomes, where the staggered introduction of new therapies creates a quasi-experimental setting. Empirically, we find that access to novel treatments is associated with improved survival, as reflected by lower post-intervention hazard trajectories relative to their synthetic counterparts. Given the broad relevance of survival analysis across medicine, economics, and public policy, our framework offers a general and interpretable tool for counterfactual survival inference using observational data.
๐
2025-11-18
๐ 13 pages, 13 figures. Intended for submission to I
๐ฅ Xiang Luo, Chang Liu, Gang Xiong, Chen Yang, Gaopeng Gou et al.
ๆฅ็ๆ่ฆ
Fine-grained identification of IDS-flagged suspicious traffic is crucial in cybersecurity. In practice, cyber threats evolve continuously, making the discovery of novel malicious traffic a critical necessity as well as the identification of known classes. Recent studies have advanced this goal with deep models, but they often rely on task-specific architectures that limit transferability and require per-dataset tuning. In this paper we introduce MalRAG, the first LLM driven retrieval-augmented framework for open-set malicious traffic identification. MalRAG freezes the LLM and operates via comprehensive traffic knowledge construction, adaptive retrieval, and prompt engineering. Concretely, we construct a multi-view traffic database by mining prior malicious traffic from content, structural, and temporal perspectives. Furthermore, we introduce a Coverage-Enhanced Retrieval Algorithm that queries across these views to assemble the most probable candidates, thereby improving the inclusion of correct evidence. We then employ Traffic-Aware Adaptive Pruning to select a variable subset of these candidates based on traffic-aware similarity scores, suppressing incorrect matches and yielding reliable retrieved evidence. Moreover, we develop a suite of guidance prompts where task instruction, evidence referencing, and decision guidance are integrated with the retrieved evidence to improve LLM performance. Across diverse real-world datasets and settings, MalRAG delivers state-of-the-art results in both fine-grained identification of known classes and novel malicious traffic discovery. Ablation and deep-dive analyses further show that MalRAG effective leverages LLM capabilities yet achieves open-set malicious traffic identification without relying on a specific LLM.
๐
2025-11-18
๐ This paper accepted for presentation to the 16th A
๐ฅ Sabiha Afroz, Redwan Ibne Seraj Khan, Hadeel Albahar, Jingoo Han, Ali R. Butt
Natural Language Processing
ๆฅ็ๆ่ฆ
Training large language models (LLMs) in the cloud faces growing memory bottlenecks due to the limited capacity and high cost of GPUs. While GPU memory offloading to CPU and NVMe has made large-scale training more feasible, existing approaches suffer from high tensor migration latency and suboptimal device memory utilization, ultimately increasing training time and cloud costs. To address these challenges, we present 10Cache, a resource-aware tensor caching and migration system that accelerates LLM training by intelligently coordinating memory usage across GPU, CPU, and NVMe tiers. 10Cache profiles tensor execution order to construct prefetch policies, allocates memory buffers in pinned memory based on tensor size distributions, and reuses memory buffers to minimize allocation overhead.
Designed for cloud-scale deployments, 10Cache improves memory efficiency and reduces reliance on high-end GPUs. Across diverse LLM workloads, it achieves up to 2x speedup in training time, improves GPU cache hit rate by up to 86.6x, and increases CPU/GPU memory utilization by up to 2.15x and 1.33x, respectively, compared to state-of-the-art offloading methods. These results demonstrate that 10Cache is a practical and scalable solution for optimizing LLM training throughput and resource efficiency in cloud environments.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Agamdeep Singh, Ashish Tiwari, Hosein Hasanbeig, Priyanshu Gupta
Machine Learning
ๆฅ็ๆ่ฆ
Many machine learning tasks involve inherent subjectivity, where annotators naturally provide varied labels. Standard practice collapses these label distributions into single labels, aggregating diverse human judgments into point estimates. We argue that this approach is epistemically misaligned for ambiguous data--the annotation distribution itself should be regarded as the ground truth. Training on collapsed single labels forces models to express false confidence on fundamentally ambiguous cases, creating a misalignment between model certainty and the diversity of human perception. We demonstrate empirically that soft-label training, which treats annotation distributions as ground truth, preserves epistemic uncertainty. Across both vision and NLP tasks, soft-label training achieves 32% lower KL divergence from human annotations and 61% stronger correlation between model and annotation entropy, while matching the accuracy of hard-label training. Our work repositions annotation distributions from noisy signals to be aggregated away, to faithful representations of epistemic uncertainty that models should learn to reproduce.
๐
2025-11-18
๐ ArXiv Preprint (eess.SP)
๐ฅ Sithmini Ranasingha, Agasthi Haputhanthri, Hansa Marasinghe, Nima Wickramasinghe, Kithmin Wickremasinghe et al.
Machine Learning
ๆฅ็ๆ่ฆ
Neonates are highly susceptible to seizures, often leading to short or long-term neurological impairments. However, clinical manifestations of neonatal seizures are subtle and often lead to misdiagnoses. This increases the risk of prolonged, untreated seizure activity and subsequent brain injury. Continuous video electroencephalogram (cEEG) monitoring is the gold standard for seizure detection. However, this is an expensive evaluation that requires expertise and time. In this study, we propose a convolutional neural network-based model for early prediction of neonatal seizures by distinguishing between interictal and preictal states of the EEG. Our model is patient-independent, enabling generalization across multiple subjects, and utilizes mel-frequency cepstral coefficient matrices extracted from multichannel EEG and electrocardiogram (ECG) signals as input features. Trained and validated on the Helsinki neonatal EEG dataset with 10-fold cross-validation, the proposed model achieved an average accuracy of 97.52%, sensitivity of 98.31%, specificity of 96.39%, and F1-score of 97.95%, enabling accurate seizure prediction up to 30 minutes before onset. The inclusion of ECG alongside EEG improved the F1-score by 1.42%, while the incorporation of an attention mechanism yielded an additional 0.5% improvement. To enhance transparency, we incorporated SHapley Additive exPlanations (SHAP) as an explainable artificial intelligence method to interpret the model and provided localization of seizure focus using scalp plots. The overall results demonstrate the model's potential for minimally supervised deployment in neonatal intensive care units, enabling timely and reliable prediction of neonatal seizures, while demonstrating strong generalization capability across unseen subjects through transfer learning.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wenfeng Wang, Jiacheng Liu, Xiaofeng Hou, Xinfeng Xia, Peng Tang et al.
ๆฅ็ๆ่ฆ
The immense memory requirements of state-of-the-art Mixture-of-Experts (MoE) models present a significant challenge for inference, often exceeding the capacity of a single accelerator. While offloading experts to host memory is a common solution, it introduces a severe I/O bottleneck over the PCIe bus, as the data-dependent nature of expert selection places these synchronous transfers directly on the critical path of execution, crippling performance.
This paper argues that the I/O bottleneck can be overcome by trading a small amount of cheap, on-device computation to hide the immense cost of data movement. We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading. MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens. This foresight enables a runtime orchestrator to prefetch these experts from host memory, effectively overlapping the expensive I/O with useful computation and hiding the latency from the critical path. To maximize performance, an adaptive governor, guided by an Amortization Roofline Model, dynamically tunes the speculation strategy to the underlying hardware. Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework. Our work establishes a new, principled approach for managing data-dependent memory access in resource-limited environments, making MoE inference more accessible on commodity hardware.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Iden Kalemaj, Luca Melis, Maxime Boucher, Ilya Mironov, Saeed Mahloujifar
Machine Learning
ๆฅ็ๆ่ฆ
Differential privacy (DP) auditing is essential for evaluating privacy guarantees in machine learning systems. Existing auditing methods, however, pose a significant challenge for large-scale systems since they require modifying the training dataset -- for instance, by injecting out-of-distribution canaries or removing samples from training. Such interventions on the training data pipeline are resource-intensive and involve considerable engineering overhead. We introduce a novel observational auditing framework that leverages the inherent randomness of data distributions, enabling privacy evaluation without altering the original dataset. Our approach extends privacy auditing beyond traditional membership inference to protected attributes, with labels as a special case, addressing a key gap in existing techniques. We provide theoretical foundations for our method and perform experiments on Criteo and CIFAR-10 datasets that demonstrate its effectiveness in auditing label privacy guarantees. This work opens new avenues for practical privacy auditing in large-scale production environments.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qiqi Xiao, Ziqi Ye, Yinghui He, Jianwei Liu, Guanding Yu
Machine Learning
ๆฅ็ๆ่ฆ
To promote the practicality of deep learning-based localization, existing studies aim to address the issue of scenario dependence through meta-learning. However, these studies primarily focus on variations in environmental layouts while overlooking the impact of changes in device configurations, such as bandwidth, the number of access points (APs), and the number of antennas used. Unlike environmental changes, variations in device configurations affect the dimensionality of channel state information (CSI), thereby compromising neural network usability. To address this issue, we propose Meta-SimGNN, a novel WiFi localization system that integrates graph neural networks with meta-learning to improve localization generalization and robustness. First, we introduce a fine-grained CSI graph construction scheme, where each AP is treated as a graph node, allowing for adaptability to changes in the number of APs. To structure the features of each node, we propose an amplitude-phase fusion method and a feature extraction method. The former utilizes both amplitude and phase to construct CSI images, enhancing data reliability, while the latter extracts dimension-consistent features to address variations in bandwidth and the number of antennas. Second, a similarity-guided meta-learning strategy is developed to enhance adaptability in diverse scenarios. The initial model parameters for the fine-tuning stage are determined by comparing the similarity between the new scenario and historical scenarios, facilitating rapid adaptation of the model to the new localization scenario. Extensive experimental results over commodity WiFi devices in different scenarios show that Meta-SimGNN outperforms the baseline methods in terms of localization generalization and accuracy.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nakkyu Yang, Yechan Lee, SooJean Han
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Classifier-Free Guidance (CFG) has become a mainstream approach for simultaneously improving prompt fidelity and generation quality in conditional generative models. During training, CFG stochastically alternates between conditional and null prompts to enable both conditional and unconditional generation. However, during sampling, CFG outputs both null and conditional prompts simultaneously, leading to inconsistent noise estimates between the training and sampling processes. To reduce this error, we propose CFG-EC, a versatile correction scheme augmentable to any CFG-based method by refining the unconditional noise predictions. CFG-EC actively realigns the unconditional noise error component to be orthogonal to the conditional error component. This corrective maneuver prevents interference between the two guidance components, thereby constraining the sampling error's upper bound and establishing more reliable guidance trajectories for high-fidelity image generation. Our numerical experiments show that CFG-EC handles the unconditional component more effectively than CFG and CFG++, delivering a marked performance increase in the low guidance sampling regime and consistently higher prompt alignment across the board.
๐
2025-11-18
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Ajesh Koyatan Chathoth, Stephen Lee
Machine Learning
ๆฅ็ๆ่ฆ
Sensor data-based recognition systems are widely used in various applications, such as gait-based authentication and human activity recognition (HAR). Modern wearable and smart devices feature various built-in Inertial Measurement Unit (IMU) sensors, and such sensor-based measurements can be fed to a machine learning-based model to train and classify human activities. While deep learning-based models have proven successful in classifying human activity and gestures, they pose various security risks. In our paper, we discuss a novel dynamic trigger-generation technique for performing black-box adversarial attacks on sensor data-based IoT systems. Our empirical analysis shows that the attack is successful on various datasets and classifier models with minimal perturbation on the input data. We also provide a detailed comparative analysis of performance and stealthiness to various other poisoning techniques found in backdoor attacks. We also discuss some adversarial defense mechanisms and their impact on the effectiveness of our trigger-generation technique.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kelin Ren, Chan-Yang Ju, Dong-Ho Lee
ๆฅ็ๆ่ฆ
Medication recommendation systems play a crucial role in assisting clinicians with personalized treatment decisions. While existing approaches have made significant progress in learning medication representations, they suffer from two fundamental limitations: (i) treating medical entities as independent features without modeling their synergistic effects on medication selection; (ii) employing static causal relationships that fail to adapt to patient-specific contexts and health states. To address these challenges, we propose CafeMed, a framework that integrates dynamic causal reasoning with cross-modal attention for safe and accurate medication recommendation. CafeMed introduces two key components: the Causal Weight Generator (CWG) that transforms static causal effects into dynamic modulation weights based on individual patient states, and the Channel Harmonized Attention Refinement Module (CHARM) that captures complex interdependencies between diagnoses and procedures. This design enables CafeMed to model how different medical conditions jointly influence treatment decisions while maintaining medication safety constraints. Extensive experiments on MIMIC-III and MIMIC-IV datasets demonstrate that CafeMed significantly outperforms state-of-the-art baselines, achieving superior accuracy in medication prediction while maintaining the lower drug--drug interaction rates. Our results indicate that incorporating dynamic causal relationships and cross-modal synergies leads to more clinically-aligned and personalized medication recommendations. Our code is released publicly at https://github.com/rkl71/CafeMed.
๐
2025-11-18
๐ ArXiv Preprint (cs.SE)
๐ฅ Shenglin Zhang, Ziang Chen, Zijing Que, Yilun Liu, Yongqian Sun et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Log anomaly detection, which is critical for identifying system failures and preempting security breaches, detects irregular patterns within large volumes of log data, and impacts domains such as service reliability, performance optimization, and database log analysis. Modern log anomaly detection methods rely on training deep learning models on clean, anomaly-free log sequences. However, obtaining such clean log data requires costly and tedious human labeling, and existing automatic cleaning methods fail to fully integrate the specific characteristics and actual semantics of logs in their purification process. In this paper, we propose a cost-aware, rule-enhanced purification framework, LogPurge, that automatically selects a sufficient subset of normal log sequences from contamination log sequences to train a anomaly detection model. Our approach involves a two-stage filtering algorithm: In the first stage, we use a large language model (LLM) to remove clustered anomalous patterns and enhance system rules to improve LLM's understanding of system logs; in the second stage, we utilize a divide-and-conquer strategy that decomposes the remaining contaminated regions into smaller subproblems, allowing each to be effectively purified through the first stage procedure. Our experiments, conducted on two public datasets and one industrial dataset, show that our method significantly removes an average of 98.74% of anomalies while retaining 82.39% of normal samples. Compared to the latest unsupervised log sample selection algorithms, our method achieves F-1 score improvements of 35.7% and 84.11% on the public datasets, and an impressive 149.72% F-1 improvement on the private dataset, demonstrating the effectiveness of our approach.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xianghe Liu, Jiajia Liu, Chuxian Xu, Minghan Wang, Hongbo Peng et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
In precision sports such as archery, athletes' performance depends on both biomechanical stability and psychological resilience. Traditional motion analysis systems are often expensive and intrusive, limiting their use in natural training environments. To address this limitation, we propose a machine learning-based multimodal framework that integrates wearable sensor data for simultaneous action recognition and stress estimation. Using a self-developed wrist-worn device equipped with an accelerometer and photoplethysmography (PPG) sensor, we collected synchronized motion and physiological data during real archery sessions. For motion recognition, we introduce a novel feature--Smoothed Differential Acceleration (SmoothDiff)--and employ a Long Short-Term Memory (LSTM) model to identify motion phases, achieving 96.8% accuracy and 95.9% F1-score. For stress estimation, we extract heart rate variability (HRV) features from PPG signals and apply a Multi-Layer Perceptron (MLP) classifier, achieving 80% accuracy in distinguishing high- and low-stress levels. The proposed framework demonstrates that integrating motion and physiological sensing can provide meaningful insights into athletes' technical and mental states. This approach offers a foundation for developing intelligent, real-time feedback systems for training optimization in archery and other precision sports.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Marios Papamichals, Regina Ruane
Machine Learning
ๆฅ็ๆ่ฆ
Generative models on curved spaces rely on charts to map Euclidean spaces to manifolds. Exponential maps preserve geodesics but have stiff, radius-dependent Jacobians, while volume-preserving charts maintain densities but distort geodesic distances. Both approaches entangle curvature with model parameters, inflating gradient variance. In high-dimensional latent normalizing flows, the wrapped exponential prior can stretch radii far beyond the curvature scale, leading to poor test likelihoods and stiff solvers. We introduce Radial Compensation (RC), an information-geometric method that selects the base density in the tangent space so that the likelihood depends only on geodesic distance from a pole, decoupling parameter semantics from curvature. RC lets radial parameters retain their usual meaning in geodesic units, while the chart can be tuned as a numerical preconditioner. We extend RC to manifolds with known geodesic polar volume and show that RC is the only construction for geodesic-radial likelihoods with curvature-invariant Fisher information. We derive the Balanced-Exponential (bExp) chart family, balancing volume distortion and geodesic error. Under RC, all bExp settings preserve the same manifold density and Fisher information, with smaller dial values reducing gradient variance and flow cost. Empirically, RC yields stable generative models across densities, VAEs, flows on images and graphs, and protein models. RC improves likelihoods, restores clean geodesic radii, and prevents radius blow-ups in high-dimensional flows, making RC-bExp a robust default for likelihood-trained generative models on manifolds.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Semen Leontev
Machine Learning
ๆฅ็ๆ่ฆ
Small and medium-sized enterprises (SMEs) represent 99.9% of U.S. businesses yet remain systematically excluded from AI due to a mismatch between their operational scale and modern machine learning's data requirements. This paper introduces SmallML, a Bayesian transfer learning framework achieving enterprise-level prediction accuracy with datasets as small as 50-200 observations.
We develop a three-layer architecture integrating transfer learning, hierarchical Bayesian modeling, and conformal prediction. Layer 1 extracts informative priors from 22,673 public records using a SHAP-based procedure transferring knowledge from gradient boosting to logistic regression. Layer 2 implements hierarchical pooling across J=5-50 SMEs with adaptive shrinkage, balancing population patterns with entity-specific characteristics. Layer 3 provides conformal sets with finite-sample coverage guarantees P(y in C(x)) >= 1-alpha for distribution-free uncertainty quantification.
Validation on customer churn data demonstrates 96.7% +/- 4.2% AUC with 100 observations per business -- a +24.2 point improvement over independent logistic regression (72.5% +/- 8.1%), with p < 0.000001. Conformal prediction achieves 92% empirical coverage at 90% target. Training completes in 33 minutes on standard CPU hardware. By enabling enterprise-grade predictions for 33 million U.S. SMEs previously excluded from machine learning, SmallML addresses a critical gap in AI democratization.
Keywords: Bayesian transfer learning, hierarchical models, conformal prediction, small-data analytics, SME machine learning
๐
2025-11-18
๐ ArXiv Preprint (math.OC)
๐ฅ Zifan Wang, Georgios Pantazis, Sergio Grammatico, Michael M. Zavlanos, Karl H. Johansson
ๆฅ็ๆ่ฆ
We study a class of distributionally robust games where agents are allowed to heterogeneously choose their risk aversion with respect to distributional shifts of the uncertainty. In our formulation, heterogeneous Wasserstein ball constraints on each distribution are enforced through a penalty function leveraging a Lagrangian formulation. We then formulate the distributionally robust Nash equilibrium problem and show that under certain assumptions it is equivalent to a finite-dimensional variational inequality problem with a strongly monotone mapping. We then design an approximate Nash equilibrium seeking algorithm and prove convergence of the average regret to a quantity that diminishes with the number of iterations, thus learning the desired equilibrium up to an a priori specified accuracy. Numerical simulations corroborate our theoretical findings.
๐
2025-11-18
๐ ArXiv Preprint (astro-ph.IM)
๐ฅ Cutter Beck, Evan Smith, Khagendra Katuwal, Rudra Kafle, Jacob Whitehill
Machine Learning
ๆฅ็ๆ่ฆ
Coronal holes (CHs) are low-activity, low-density solar coronal regions with open magnetic field lines (Cranmer 2009). In the extreme ultraviolet (EUV) spectrum, CHs appear as dark patches. Using daily hand-drawn maps from the Space Weather Prediction Center (SWPC), we developed a semi-automated pipeline to digitize the SWPC maps into binary segmentation masks. The resulting masks constitute the CHASM-SWPC dataset, a high-quality dataset to train and test automated CH detection models, which is released with this paper. We developed CHASM (Coronal Hole Annotation using Semi-automatic Methods), a software tool for semi-automatic annotation that enables users to rapidly and accurately annotate SWPC maps. The CHASM tool enabled us to annotate 1,111 CH masks, comprising the CHASM-SWPC-1111 dataset. We then trained multiple CHRONNOS (Coronal Hole RecOgnition Neural Network Over multi-Spectral-data) architecture (Jarolim et al. 2021) neural networks using the CHASM-SWPC dataset and compared their performance. Training the CHRONNOS neural network on these data achieved an accuracy of 0.9805, a True Skill Statistic (TSS) of 0.6807, and an intersection-over-union (IoU) of 0.5668, which is higher than the original pretrained CHRONNOS model Jarolim et al. (2021) achieved an accuracy of 0.9708, a TSS of 0.6749, and an IoU of 0.4805, when evaluated on the CHASM-SWPC-1111 test set.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Mara Daniels, Philippe Rigollet
Machine Learning
ๆฅ็ๆ่ฆ
We introduce a highly expressive class of function approximators called Splat Regression Models. Model outputs are mixtures of heterogeneous and anisotropic bump functions, termed splats, each weighted by an output vector. The power of splat modeling lies in its ability to locally adjust the scale and direction of each splat, achieving both high interpretability and accuracy. Fitting splat models reduces to optimization over the space of mixing measures, which can be implemented using Wasserstein-Fisher-Rao gradient flows. As a byproduct, we recover the popular Gaussian Splatting methodology as a special case, providing a unified theoretical framework for this state-of-the-art technique that clearly disambiguates the inverse problem, the model, and the optimization algorithm. Through numerical experiments, we demonstrate that the resulting models and algorithms constitute a flexible and promising approach for solving diverse approximation, estimation, and inverse problems involving low-dimensional data.
๐
2025-11-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Dawn Lawrie, James Mayfield, Eugene Yang, Andrew Yates, Sean MacAvaney et al.
ๆฅ็ๆ่ฆ
To measure advances in retrieval, test collections with relevance judgments that can faithfully distinguish systems are required. This paper presents NeuCLIRBench, an evaluation collection for cross-language and multilingual retrieval. The collection consists of documents written natively in Chinese, Persian, and Russian, as well as those same documents machine translated into English. The collection supports several retrieval scenarios including: monolingual retrieval in English, Chinese, Persian, or Russian; cross-language retrieval with English as the query language and one of the other three languages as the document language; and multilingual retrieval, again with English as the query language and relevant documents in all three languages. NeuCLIRBench combines the TREC NeuCLIR track topics of 2022, 2023, and 2024. The 250,128 judgments across approximately 150 queries for the monolingual and cross-language tasks and 100 queries for multilingual retrieval provide strong statistical discriminatory power to distinguish retrieval approaches. A fusion baseline of strong neural retrieval systems is included with the collection so that developers of reranking algorithms are no longer reliant on BM25 as their first-stage retriever. NeuCLIRBench is publicly available.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Alex Shtoff et al.
ๆฅ็ๆ่ฆ
With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.
๐
2025-11-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Daniรซl Wilten, Gideon Maillette de Buy Wenniger, Arjen Hommersom, Paul Lucassen, Emiel Poortman
Robotics
ๆฅ็ๆ่ฆ
Using multiple carousels, lists that wrap around and can be scrolled, is the basis for offering content in most contemporary movie streaming platforms. Carousels allow for highlighting different aspects of users' taste, that fall in categories such as genres and authors. However, while carousels offer structure and greater ease of navigation, they alone do not increase diversity in recommendations, while this is essential to keep users engaged. In this work we propose several approaches to effectively increase item diversity within the domain of book recommendations, on top of a collaborative filtering algorithm. These approaches are intended to improve book recommendations in the web catalogs of public libraries. Furthermore, we introduce metrics to evaluate the resulting strategies, and show that the proposed system finds a suitable balance between accuracy and beyond-accuracy aspects.
๐
2025-11-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Dun Zhang, Ziyang Zeng, Yudong Zhou, Shuyang Lu
Machine Learning
ๆฅ็ๆ่ฆ
This technical report presents the training methodology and evaluation results of the open-source Jasper-Token-Compression-600M model, released in November 2025. Building on previous distillation-based recipes from the English Stella and Jasper models, we successfully extend this approach to a bilingual (English and Chinese) domain, further enhancing model performance through the incorporation of contrastive learning. A key innovation of our model is the introduction of a one-dimensional convolution-based token compression module. We dynamically adjust the compression rate during training, enabling the model to learn more robust and efficient compressed text representations. By combining knowledge distillation with token compression techniques, we achieve significant improvements in both embedding quality and inference efficiency. Our model performs with higher efficiency than a traditional 0.6B model while achieving performance comparable to that of an 8B model. For more information on the model release, visit: https://huggingface.co/infgrad/Jasper-Token-Compression-600M.
๐
2025-11-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Moyu Zhang, Yujun Jin, Yun Chen, Jinxin Hu, Yu Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Generative models are increasingly being explored in click-through rate (CTR) prediction field to overcome the limitations of the conventional discriminative paradigm, which rely on a simple binary classification objective. However, existing generative models typically confine the generative paradigm to the training phase, primarily for representation learning. During online inference, they revert to a standard discriminative paradigm, failing to leverage their powerful generative capabilities to further improve prediction accuracy. This fundamental asymmetry between the training and inference phases prevents the generative paradigm from realizing its full potential. To address this limitation, we propose the Symmetric Masked Generative Paradigm for CTR prediction (SGCTR), a novel framework that establishes symmetry between the training and inference phases. Specifically, after acquiring generative capabilities by learning feature dependencies during training, SGCTR applies the generative capabilities during online inference to iteratively redefine the features of input samples, which mitigates the impact of noisy features and enhances prediction accuracy. Extensive experiments validate the superiority of SGCTR, demonstrating that applying the generative paradigm symmetrically across both training and inference significantly unlocks its power in CTR prediction.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Yu Liu, Xixun Lin, Yanmin Shang, Yangxi Li, Shi Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Knowledge graph reasoning (KGR) is the task of inferring new knowledge by performing logical deductions on knowledge graphs. Recently, large language models (LLMs) have demonstrated remarkable performance in complex reasoning tasks. Despite promising success, current LLM-based KGR methods still face two critical limitations. First, existing methods often extract reasoning paths indiscriminately, without assessing their different importance, which may introduce irrelevant noise that misleads LLMs. Second, while many methods leverage LLMs to dynamically explore potential reasoning paths, they require high retrieval demands and frequent LLM calls. To address these limitations, we propose PathMind, a novel framework designed to enhance faithful and interpretable reasoning by selectively guiding LLMs with important reasoning paths. Specifically, PathMind follows a "Retrieve-Prioritize-Reason" paradigm. First, it retrieves a query subgraph from KG through the retrieval module. Next, it introduces a path prioritization mechanism that identifies important reasoning paths using a semantic-aware path priority function, which simultaneously considers the accumulative cost and the estimated future cost for reaching the target. Finally, PathMind generates accurate and logically consistent responses via a dual-phase training strategy, including task-specific instruction tuning and path-wise preference alignment. Extensive experiments on benchmark datasets demonstrate that PathMind consistently outperforms competitive baselines, particularly on complex reasoning tasks with fewer input tokens, by identifying essential reasoning paths.
๐
2025-11-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Hao Jiang, Guoquan Wang, Donglin Zhou, Sheng Yu, Yang Zeng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Large Language Models (LLMs) have enhanced text-based recommendation by enriching traditional ID-based methods with semantic generalization capabilities. Text-based methods typically encode item textual information via prompt design and generate discrete semantic IDs through item tokenization. However, in domain-specific tasks such as local-life services, simply injecting location information into prompts fails to capture fine-grained spatial characteristics and real-world distance awareness among items. To address this, we propose LGSID, an LLM-Aligned Geographic Item Tokenization Framework for Local-life Recommendation. This framework consists of two key components: (1) RL-based Geographic LLM Alignment, and (2) Hierarchical Geographic Item Tokenization. In the RL-based alignment module, we initially train a list-wise reward model to capture real-world spatial relationships among items. We then introduce a novel G-DPO algorithm that uses pre-trained reward model to inject generalized spatial knowledge and collaborative signals into LLMs while preserving their semantic understanding. Furthermore, we propose a hierarchical geographic item tokenization strategy, where primary tokens are derived from discrete spatial and content attributes, and residual tokens are refined using the aligned LLM's geographic representation vectors. Extensive experiments on real-world Kuaishou industry datasets show that LGSID consistently outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further validate its effectiveness.
๐
2025-11-18
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zihuai Zhao, Yujuan Ding, Wenqi Fan, Qing Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Recommender systems play a vital role in alleviating information overload and enriching users' online experience. In the era of large language models (LLMs), LLM-based recommender systems have emerged as a prevalent paradigm for advancing personalized recommendations. Recently, retrieval-augmented generation (RAG) has drawn growing interest to facilitate the recommendation capability of LLMs, incorporating useful information retrieved from external knowledge bases. However, as a rich source of up-to-date information, the web remains under-explored by existing RAG-based recommendations. In particular, unique challenges are posed from two perspectives: one is to generate effective queries for web retrieval, considering the inherent knowledge gap between web search and recommendations; another challenge lies in harnessing online websites that contain substantial noisy content. To tackle these limitations, we propose WebRec, a novel web-based RAG framework, which takes advantage of the reasoning capability of LLMs to interpret recommendation tasks into queries of user preferences that cater to web retrieval. Moreover, given noisy web-retrieved information, where relevant pieces of evidence are scattered far apart, an insightful MP-Head is designed to enhance LLM attentions between distant tokens of relevant information via message passing. Extensive experiments have been conducted to demonstrate the effectiveness of our proposed web-based RAG methods in recommendation scenarios.
๐
2025-11-18
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Naoki Shimoda, Akihiro Yamamoto
Natural Language Processing
ๆฅ็ๆ่ฆ
In this research, we combine Transformer-based relation extraction with matching of knowledge graphs (KGs) and apply them to answering multiple-choice questions (MCQs) while maintaining the traceability of the output process. KGs are structured representations of factual knowledge consisting of entities and relations. Due to the high construction cost, they had been regarded as static databases with validated links. However, the recent development of Transformer-based relation extraction (RE) methods has enabled us to generate KGs dynamically by giving them natural language texts, and thereby opened the possibility for representing the meaning of the input sentences with the created KGs. Using this effect, we propose a method that answers MCQs in the "fill-in-the-blank" format, taking care of the point that RE methods generate KGs that represent false information if provided with factually incorrect texts. We measure the truthfulness of each question sentence by (i) converting the sentence into a relational graph using an RE method and (ii) verifying it against factually correct KGs under the closed-world assumption. The experimental results demonstrate that our method correctly answers up to around 70% of the questions, while providing traceability of the procedure. We also highlight that the question category has a vast influence on the accuracy.
๐
2025-11-18
๐ 3rd-place solution for the ACM ICAIF 2025 Agentic
๐ฅ Chun Chet Ng, Jia Yu Lim, Wei Zeng Low
Natural Language Processing
ๆฅ็ๆ่ฆ
With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task-relevant information from lengthy financial filings is essential for both operational and analytical decision-making. The FinAgentBench dataset formalizes this problem through two tasks: document ranking and chunk ranking. We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and a lightweight multi-agent system. Each component is examined extensively to reveal their synergies: prompt engineering provides precise task instructions, ICL supplies semantically relevant few-shot examples, and the multi-agent system models coordinated scoring behaviour. Our best configuration achieves an NDCG@5 of 0.71818 on the restricted validation split. We further demonstrate that PRISM is feasible and robust for production-scale financial retrieval. Its modular, inference-only design makes it practical for real-world use cases. The source code is released at https://bit.ly/prism-ailens.
๐
2025-11-18
๐ NeurIPS 2025
๐ฅ Junchen Li, Rongzheng Wang, Yihong Huang, Qizhi Chen, Jiasheng Zhang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-augmented generation (RAG) greatly enhances large language models (LLMs) performance in knowledge-intensive tasks. However, naive RAG methods struggle with multi-hop question answering due to their limited capacity to capture complex dependencies across documents. Recent studies employ graph-based RAG to capture document connections. However, these approaches often result in a loss of semantic coherence and introduce irrelevant noise during node matching and subgraph construction. To address these limitations, we propose NeuroPath, an LLM-driven semantic path tracking RAG framework inspired by the path navigational planning of place cells in neurobiology. It consists of two steps: Dynamic Path Tracking and Post-retrieval Completion. Dynamic Path Tracking performs goal-directed semantic path tracking and pruning over the constructed knowledge graph (KG), improving noise reduction and semantic coherence. Post-retrieval Completion further reinforces these benefits by conducting second-stage retrieval using intermediate reasoning and the original query to refine the query goal and complete missing information in the reasoning path. NeuroPath surpasses current state-of-the-art baselines on three multi-hop QA datasets, achieving average improvements of 16.3% on recall@2 and 13.5% on recall@5 over advanced graph-based RAG methods. Moreover, compared to existing iter-based RAG methods, NeuroPath achieves higher accuracy and reduces token consumption by 22.8%. Finally, we demonstrate the robustness of NeuroPath across four smaller LLMs (Llama3.1, GLM4, Mistral0.3, and Gemma3), and further validate its scalability across tasks of varying complexity. Code is available at https://github.com/KennyCaty/NeuroPath.
๐
2025-11-18
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley et al.
RoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $ฯ^{*}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $ฯ^{*}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Lai Wei, Xuanbin Peng, Ri-Zhao Qiu, Tianshu Huang, Xuxin Cheng et al.
Robotics
ๆฅ็ๆ่ฆ
Learning from real-world robot demonstrations holds promise for interacting with complex real-world environments. However, the complexity and variability of interaction dynamics often cause purely positional controllers to struggle with contacts or varying payloads. To address this, we propose a Heterogeneous Meta-Control (HMC) framework for Loco-Manipulation that adaptively stitches multiple control modalities: position, impedance, and hybrid force-position. We first introduce an interface, HMC-Controller, for blending actions from different control profiles continuously in the torque space. HMC-Controller facilitates both teleoperation and policy deployment. Then, to learn a robust force-aware policy, we propose HMC-Policy to unify different controllers into a heterogeneous architecture. We adopt a mixture-of-experts style routing to learn from large-scale position-only data and fine-grained force-aware demonstrations. Experiments on a real humanoid robot show over 50% relative improvement vs. baselines on challenging tasks such as compliant table wiping and drawer opening, demonstrating the efficacy of HMC.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Albert Lin, Alessandro Pinto, Somil Bansal
Robotics
ๆฅ็ๆ่ฆ
As perception-based controllers for autonomous systems become increasingly popular in the real world, it is important that we can formally verify their safety and performance despite perceptual uncertainty. Unfortunately, the verification of such systems remains challenging, largely due to the complexity of the controllers, which are often nonlinear, nonconvex, learning-based, and/or black-box. Prior works propose verification algorithms that are based on approximate reachability methods, but they often restrict the class of controllers and systems that can be handled or result in overly conservative analyses. Hamilton-Jacobi (HJ) reachability analysis is a popular formal verification tool for general nonlinear systems that can compute optimal reachable sets under worst-case system uncertainties; however, its application to perception-based systems is currently underexplored. In this work, we propose RoVer-CoRe, a framework for the Robust Verification of Controllers via HJ Reachability. To the best of our knowledge, RoVer-CoRe is the first HJ reachability-based framework for the verification of perception-based systems under perceptual uncertainty. Our key insight is to concatenate the system controller, observation function, and the state estimation modules to obtain an equivalent closed-loop system that is readily compatible with existing reachability frameworks. Within RoVer-CoRe, we propose novel methods for formal safety verification and robust controller design. We demonstrate the efficacy of the framework in case studies involving aircraft taxiing and NN-based rover navigation. Code is available at the link in the footnote.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Qingwei Ben, Botian Xu, Kailin Li, Feiyu Jia, Wentao Zhang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure. This paper presents Gallant, a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency. Experimental results show that Gallant's broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves near 100% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization.
๐
2025-11-18
๐ ArXiv Preprint (cond-mat.soft)
๐ฅ Giulia Janzen, Gaia Maselli, Juan F. Jimenez, Lia Garcia-Perez, D A Matoz Fernandez et al.
ๆฅ็ๆ่ฆ
Robophysics investigates the physical principles that govern living-like robots operating in complex, realworld environments. Despite remarkable technological advances, robots continue to face fundamental efficiency limitations. At the level of individual units, locomotion remains a challenge, while at the collective level, robot swarms struggle to achieve shared purpose, coordination, communication, and cost efficiency. This perspective article examines the key challenges faced by bio-inspired robotic collectives and highlights recent research efforts that incorporate principles from active-matter physics and biology into the modeling and design of robot swarms.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Xianhui Meng, Yuchen Zhang, Zhijian Huang, Zheng Lu, Ziling Ji et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) show great promise for autonomous driving, but their suitability for safety-critical scenarios is largely unexplored, raising safety concerns. This issue arises from the lack of comprehensive benchmarks that assess both external environmental risks and in-cabin driving behavior safety simultaneously. To bridge this critical gap, we introduce DSBench, the first comprehensive Driving Safety Benchmark designed to assess a VLM's awareness of various safety risks in a unified manner. DSBench encompasses two major categories: external environmental risks and in-cabin driving behavior safety, divided into 10 key categories and a total of 28 sub-categories. This comprehensive evaluation covers a wide range of scenarios, ensuring a thorough assessment of VLMs' performance in safety-critical contexts. Extensive evaluations across various mainstream open-source and closed-source VLMs reveal significant performance degradation under complex safety-critical situations, highlighting urgent safety concerns. To address this, we constructed a large dataset of 98K instances focused on in-cabin and external safety scenarios, showing that fine-tuning on this dataset significantly enhances the safety performance of existing VLMs and paves the way for advancing autonomous driving technology. The benchmark toolkit, code, and model checkpoints will be publicly accessible.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Minyoung Hwang, Alexandra Forsey-Smerek, Nathaniel Dennler, Andreea Bobu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language. Project page: https://MIT-CLEAR-Lab.github.io/Masked-IRL and Code: https://github.com/MIT-CLEAR-Lab/Masked-IRL
๐
2025-11-18
๐ 7 pages, Presented at IEEE International Symposium
๐ฅ Jan Quenzel, Valerij Sekin, Daniel Schleich, Alexander Miller, Merlin Stampa et al.
Robotics
ๆฅ็ๆ่ฆ
Fires in industrial facilities pose special challenges to firefighters, e.g., due to the sheer size and scale of the buildings. The resulting visual obstructions impair firefighting accuracy, further compounded by inaccurate assessments of the fire's location. Such imprecision simultaneously increases the overall damage and prolongs the fire-brigades operation unnecessarily.
We propose an automated assistance system for firefighting using a motorized fire monitor on a turntable ladder with aerial support from an unmanned aerial vehicle (UAV). The UAV flies autonomously within an obstacle-free flight funnel derived from geodata, detecting and localizing heat sources. An operator supervises the operation on a handheld controller and selects a fire target in reach. After the selection, the UAV automatically plans and traverses between two triangulation poses for continued fire localization. Simultaneously, our system steers the fire monitor to ensure the water jet reaches the detected heat source. In preliminary tests, our assistance system successfully localized multiple heat sources and directed a water jet towards the fires.
๐
2025-11-18
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jack Qin, Zhitao Wang, Yinan Zheng, Keyu Chen, Yang Zhou et al.
RoboticsMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model's ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Michelle Mattille, Alexandre Mesot, Miriam Weisskopf, Nicole Ochsenbein-Kรถlble, Ueli Moehrlen et al.
Robotics
ๆฅ็ๆ่ฆ
Flexible robots hold great promise for enhancing minimally invasive surgery (MIS) by providing superior dexterity, precise control, and safe tissue interaction. Yet, translating these advantages into endoscopic interventions within open cavities remains challenging. The lack of anatomical constraints and the inherent flexibility of such devices complicate their control, while the limited field of view of endoscopes restricts situational awareness. We present a robotic platform designed to overcome these challenges and demonstrate its potential in fetoscopic laser coagulation, a complex MIS procedure typically performed only by highly experienced surgeons. Our system combines a magnetically actuated flexible endoscope with teleoperated and semi-autonomous navigation capabilities for performing targeted laser ablations. To enhance surgical awareness, the platform reconstructs real-time mosaics of the endoscopic scene, providing an extended and continuous visual context. The ability of this system to address the key limitations of MIS in open spaces is validated in vivo in an ovine model.
๐
2025-11-18
๐ ArXiv Preprint (cs.LO)
๐ฅ Mahdi Etumi, Hazel M. Taylor, Marie Farrell
Robotics
ๆฅ็ๆ่ฆ
In the development of safety and mission-critical systems, including autonomous space robotic missions, complex behaviour is captured during the requirements elicitation phase. Requirements are typically expressed using natural language which is ambiguous and not amenable to formal verification methods that can provide robust guarantees of system behaviour. To support the definition of formal requirements, specification patterns provide reusable, logic-based templates. A suite of robotic specification patterns, along with their formalisation in NASA's Formal Requirements Elicitation Tool (FRET) already exists. These pre-existing requirement patterns are domain agnostic and, in this paper we explore their applicability for space missions. To achieve this we carried out a literature review of existing space missions and formalised their requirements using FRET, contributing a corpus of space mission requirements. We categorised these requirements using pre-existing specification patterns which demonstrated their applicability in space missions. However, not all of the requirements that we formalised corresponded to an existing pattern so we have contributed 5 new requirement specification patterns as well as several variants of the existing and new patterns. We also conducted an expert evaluation of the new patterns, highlighting their benefits and limitations.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Marlow Fawn, Matthias Scheutz
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We propose a method for combining Harmonic Control Lyapunov-Barrier Functions (HCLBFs) derived from Signal Temporal Logic (STL) specifications with any given robot policy to turn an unsafe policy into a safe one with formal guarantees. The two components are combined via HCLBF-derived safety certificates, thus producing commands that preserve both safety and task-driven behavior. We demonstrate with a simple proof-of-concept implementation for an object-centric force-based policy trained through reinforcement learning for a movement task of a stationary robot arm that is able to avoid colliding with obstacles on a table top after combining the policy with the safety constraints. The proposed method can be generalized to more complex specifications and dynamic task settings.
๐
2025-11-18
๐ ArXiv Preprint (cs.LO)
๐ฅ Diana C. Benjumea, Marie Farrell, Louise A. Dennis
Robotics
ๆฅ็ๆ่ฆ
Deploying autonomous robots in safety-critical domains requires architectures that ensure operational effectiveness and safety compliance. In this paper, we contribute the Safe-ROS architecture for developing reliable and verifiable autonomous robots in such domains. It features two distinct subsystems: (1) an intelligent control system that is responsible for normal/routine operations, and (2) a Safety System consisting of Safety Instrumented Functions (SIFs) that provide formally verifiable independent oversight. We demonstrate Safe-ROS on an AgileX Scout Mini robot performing autonomous inspection in a nuclear environment. One safety requirement is selected and instantiated as a SIF. To support verification, we implement the SIF as a cognitive agent, programmed to stop the robot whenever it detects that it is too close to an obstacle. We verify that the agent meets the safety requirement and integrate it into the autonomous inspection. This integration is also verified, and the full deployment is validated in a Gazebo simulation, and lab testing. We evaluate this architecture in the context of the UK nuclear sector, where safety and regulation are crucial aspects of deployment. Success criteria include the development of a formal property from the safety requirement, implementation, and verification of the SIF, and the integration of the SIF into the operational robotic autonomous system. Our results demonstrate that the Safe-ROS architecture can provide safety verifiable oversight while deploying autonomous robots in safety-critical domains, offering a robust framework that can be extended to additional requirements and various applications.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Marcela Gonรงalves dos Santos, Sylvain Hallรฉ, Fรกbio Petrillo
Robotics
ๆฅ็ๆ่ฆ
Industrial robotic systems (IRS) are increasingly deployed in diverse environments, where failures can result in severe accidents and costly downtime. Ensuring the reliability of the software controlling these systems is therefore critical. Mutation testing, a technique widely used in software engineering, evaluates the effectiveness of test suites by introducing small faults, or mutants, into the code. However, traditional mutation operators are poorly suited to robotic programs, which involve message-based commands and interactions with the physical world. This paper explores the adaptation of mutation testing to IRS by defining domain-specific mutation operators that capture the semantics of robot actions and sensor readings. We propose a methodology for generating meaningful mutants at the level of high-level read and write operations, including movement, gripper actions, and sensor noise injection. An empirical study on a pick-and-place scenario demonstrates that our approach produces more informative mutants and reduces the number of invalid or equivalent cases compared to conventional operators. Results highlight the potential of mutation testing to enhance test suite quality and contribute to safer, more reliable industrial robotic systems.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control.
๐
2025-11-18
๐ AAAI 2026
๐ฅ Xiuxiu Qi, Yu Yang, Jiannong Cao, Luyao Bai, Chongshan Fan et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Svetlana Seliunina, Daniel Schleich, Sven Behnke
Robotics
ๆฅ็ๆ่ฆ
In our work, we extend the current state-of-the-art approach for autonomous multi-UAV exploration to consumer-level UAVs, such as the DJI Mini 3 Pro. We propose a pipeline that selects viewpoint pairs from which the depth can be estimated and plans the trajectory that satisfies motion constraints necessary for odometry estimation. For the multi-UAV exploration, we propose a semi-distributed communication scheme that distributes the workload in a balanced manner. We evaluate our model performance in simulation for different numbers of UAVs and prove its ability to safely explore the environment and reconstruct the map even with the hardware limitations of consumer-grade UAVs.
๐
2025-11-18
๐ ArXiv Preprint (eess.SY)
๐ฅ Kai Ren, Maryam Kamgarpour
ๆฅ็ๆ่ฆ
We address cost identification in a finite-horizon linear quadratic Gaussian game. We characterize the set of cost parameters that generate a given Nash equilibrium policy. We propose a backpropagation algorithm to identify the time-varying cost parameters. We derive a probabilistic error bound when the cost parameters are identified from finite trajectories. We test our method in numerical and driving simulations. Our algorithm identifies the cost parameters that can reproduce the Nash equilibrium policy and trajectory observations.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Michael Milford, Tobias Fischer
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Place recognition, the ability to identify previously visited locations, is critical for both biological navigation and autonomous systems. This review synthesizes findings from robotic systems, animal studies, and human research to explore how different systems encode and recall place. We examine the computational and representational strategies employed across artificial systems, animals, and humans, highlighting convergent solutions such as topological mapping, cue integration, and memory management. Animal systems reveal evolved mechanisms for multimodal navigation and environmental adaptation, while human studies provide unique insights into semantic place concepts, cultural influences, and introspective capabilities. Artificial systems showcase scalable architectures and data-driven models. We propose a unifying set of concepts by which to consider and develop place recognition mechanisms and identify key challenges such as generalization, robustness, and environmental variability. This review aims to foster innovations in artificial localization by connecting future developments in artificial place recognition systems to insights from both animal navigation research and human spatial cognition studies.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Jeryes Danial, Yosi Ben Asher, Itzik Klein
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Monocular simultaneous localization and mapping (SLAM) algorithms estimate drone poses and build a 3D map using a single camera. Current algorithms include sparse methods that lack detailed geometry, while learning-driven approaches produce dense maps but are computationally intensive. Monocular SLAM also faces scale ambiguities, which affect its accuracy. To address these challenges, we propose an edge-aware lightweight monocular SLAM system combining sparse keypoint-based pose estimation with dense edge reconstruction. Our method employs deep learning-based depth prediction and edge detection, followed by optimization to refine keypoints and edges for geometric consistency, without relying on global loop closure or heavy neural computations. We fuse inertial data with vision by using an extended Kalman filter to resolve scale ambiguity and improve accuracy. The system operates in real time on low-power platforms, as demonstrated on a DJI Tello drone with a monocular camera and inertial sensors. In addition, we demonstrate robust autonomous navigation and obstacle avoidance in indoor corridors and on the TUM RGBD dataset. Our approach offers an effective, practical solution to real-time mapping and navigation in resource-constrained environments.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Yizhen Yin, Yuhua Qi, Dapeng Feng, Hongbo Chen, Hongjun Ma et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Active Simultaneous Localization and Mapping (Active SLAM) involves the strategic planning and precise control of a robotic system's movement in order to construct a highly accurate and comprehensive representation of its surrounding environment, which has garnered significant attention within the research community. While the current methods demonstrate efficacy in small and controlled settings, they face challenges when applied to large-scale and diverse environments, marked by extended periods of exploration and suboptimal paths of discovery. In this paper, we propose MA-SLAM, a Map-Aware Active SLAM system based on Deep Reinforcement Learning (DRL), designed to address the challenge of efficient exploration in large-scale environments. In pursuit of this objective, we put forward a novel structured map representation. By discretizing the spatial data and integrating the boundary points and the historical trajectory, the structured map succinctly and effectively encapsulates the visited regions, thereby serving as input for the deep reinforcement learning based decision module. Instead of sequentially predicting the next action step within the decision module, we have implemented an advanced global planner to optimize the exploration path by leveraging long-range target points. We conducted experiments in three simulation environments and deployed in a real unmanned ground vehicle (UGV), the results demonstrate that our approach significantly reduces both the duration and distance of exploration compared with state-of-the-art methods.
๐
2025-11-18
๐ 36 pages, 10 figures, submitted and under-review i
๐ฅ Felipe Ballen-Moreno, Pasquale Ferrentino, Milan Amighi, Bram Vanderborght, Tom Verstraten
Robotics
ๆฅ็ๆ่ฆ
Understanding the physical interaction with wearable robots is essential to ensure safety and comfort. However, this interaction is complex in two key aspects: (1) the motion involved, and (2) the non-linear behaviour of soft tissues. Multiple approaches have been undertaken to better understand this interaction and to improve the quantitative metrics of physical interfaces or cuffs. As these two topics are closely interrelated, finite modelling and soft tissue characterisation offer valuable insights into pressure distribution and shear stress induced by the cuff. Nevertheless, current characterisation methods typically rely on a single fitting variable along one degree of freedom, which limits their applicability, given that interactions with wearable robots often involve multiple degrees of freedom. To address this limitation, this work introduces a dual-variable characterisation method, involving normal and tangential forces, aimed at identifying reliable material parameters and evaluating the impact of single-variable fitting on force and torque responses. This method demonstrates the importance of incorporating two variables into the characterisation process by analysing the normalized mean square error (NMSE) across different scenarios and material models, providing a foundation for simulation at the closest possible level, with a focus on the cuff and the human limb involved in the physical interaction between the user and the wearable robot.
๐
2025-11-18
๐ ArXiv Preprint (eess.SY)
๐ฅ Lukas Schroth, Daniel Morton, Amon Lahr, Daniele Gammelli, Andrea Carron et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Model Predictive Control (MPC) has established itself as the primary methodology for constrained control, enabling autonomy across diverse applications. While model fidelity is crucial in MPC, solving the corresponding optimization problem in real time remains challenging when combining long horizons with high-fidelity models that capture both short-term dynamics and long-term behavior. Motivated by results on the Exponential Decay of Sensitivities (EDS), which imply that, under certain conditions, the influence of modeling inaccuracies decreases exponentially along the prediction horizon, this paper proposes a multi-timescale MPC scheme for fast-sampled control. Tailored to systems with both fast and slow dynamics, the proposed approach improves computational efficiency by i) switching to a reduced model that captures only the slow, dominant dynamics and ii) exponentially increasing integration step sizes to progressively reduce model detail along the horizon. We evaluate the method on three practically motivated robotic control problems in simulation and observe speed-ups of up to an order of magnitude.
๐
2025-11-18
๐ 9 pages, 8 figures, submitted to IEEE RA-L
๐ฅ Zhuo Li, Junjia Liu, Zhipeng Dong, Tao Teng, Quentin Rouxel et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have demonstrated significant potential in real-world robotic manipulation. However, pre-trained VLA policies still suffer from substantial performance degradation during downstream deployment. Although fine-tuning can mitigate this issue, its reliance on costly demonstration collection and intensive computation makes it impractical in real-world settings. In this work, we introduce VLA-Pilot, a plug-and-play inference-time policy steering method for zero-shot deployment of pre-trained VLA without any additional fine-tuning or data collection. We evaluate VLA-Pilot on six real-world downstream manipulation tasks across two distinct robotic embodiments, encompassing both in-distribution and out-of-distribution scenarios. Experimental results demonstrate that VLA-Pilot substantially boosts the success rates of off-the-shelf pre-trained VLA policies, enabling robust zero-shot generalization to diverse tasks and embodiments. Experimental videos and code are available at: https://rip4kobe.github.io/vla-pilot/.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaoquan Sun, Ruijian Zhang, Kang Pang, Bingchen Miao, Yuxiang Tan et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Household tidying is an important application area, yet current benchmarks neither model user preferences nor support mobility, and they generalize poorly, making it hard to comprehensively assess integrated language-to-action capabilities. To address this, we propose RoboTidy, a unified benchmark for language-guided household tidying that supports Vision-Language-Action (VLA) and Vision-Language-Navigation (VLN) training and evaluation. RoboTidy provides 500 photorealistic 3D Gaussian Splatting (3DGS) household scenes (covering 500 objects and containers) with collisions, formulates tidying as an "Action (Object, Container)" list, and supplies 6.4k high-quality manipulation demonstration trajectories and 1.5k naviagtion trajectories to support both few-shot and large-scale training. We also deploy RoboTidy in the real world for object tidying, establishing an end-to-end benchmark for household tidying. RoboTidy offers a scalable platform and bridges a key gap in embodied AI by enabling holistic and realistic evaluation of language-guided robots.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, Biqing Qi
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA achieves state-of-the-art results across general embodied evaluations due to its asynchronous generation in AFM. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Junhao Gong, Shoujie Li, Kit-Wa Sou, Changqing Guo, Hourong Huang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Conventional suction cups lack sensing capabilities for contact-aware manipulation in unstructured environments. This paper presents FlexiCup, a fully wireless multimodal suction cup that integrates dual-zone vision-tactile sensing. The central zone dynamically switches between vision and tactile modalities via illumination control for contact detection, while the peripheral zone provides continuous spatial awareness for approach planning. FlexiCup supports both vacuum and Bernoulli suction modes through modular mechanical configurations, achieving complete wireless autonomy with onboard computation and power. We validate hardware versatility through dual control paradigms. Modular perception-driven grasping across structured surfaces with varying obstacle densities demonstrates comparable performance between vacuum (90.0% mean success) and Bernoulli (86.7% mean success) modes. Diffusion-based end-to-end learning achieves 73.3% success on inclined transport and 66.7% on orange extraction tasks. Ablation studies confirm that multi-head attention coordinating dual-zone observations provides 13% improvements for contact-aware manipulation. Hardware designs and firmware are available at https://anonymous.4open.science/api/repo/FlexiCup-DA7D/file/index.html?v=8f531b44.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Hesam Mojtahedi, Reza Akhavian
Robotics
ๆฅ็ๆ่ฆ
This paper presents a BIM-discrepancy-driven active sensing framework for cooperative navigation between unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) in dynamic construction environments. Traditional navigation approaches rely on static Building Information Modeling (BIM) priors or limited onboard perception. In contrast, our framework continuously fuses real-time LiDAR data from aerial and ground robots with BIM priors to maintain an evolving 2D occupancy map. We quantify navigation safety through a unified corridor-risk metric integrating occupancy uncertainty, BIM-map discrepancy, and clearance. When risk exceeds safety thresholds, the UAV autonomously re-scans affected regions to reduce uncertainty and enable safe replanning. Validation in PX4-Gazebo simulation with Robotec GPU LiDAR demonstrates that risk-triggered re-scanning reduces mean corridor risk by 58% and map entropy by 43% compared to static BIM navigation, while maintaining clearance margins above 0.4 m. Compared to frontier-based exploration, our approach achieves similar uncertainty reduction in half the mission time. These results demonstrate that integrating BIM priors with risk-adaptive aerial sensing enables scalable, uncertainty-aware autonomy for construction robotics.
๐
2025-11-18
๐ ArXiv Preprint (Robotics)
๐ฅ Jaskirat Singh, Rohan Chandra
ๆฅ็ๆ่ฆ
Multi-robot systems are increasingly being used for critical applications such as rescuing injured people, delivering food and medicines, and monitoring key areas. These applications usually involve navigating at high speeds through constrained spaces such as small gaps. Navigating such constrained spaces becomes particularly challenging when the space is crowded with multiple heterogeneous agents all of which have urgent priorities. What makes the problem even harder is that during an active response situation, roles and priorities can quickly change on a dime without informing the other agents. In order to complete missions in such environments, robots must not only be safe, but also agile, able to dodge and change course at a moment's notice. In this paper, we propose FACA, a fair and agile collision avoidance approach where robots coordinate their tasks by talking to each other via natural language (just as people do). In FACA, robots balance safety with agility via a novel artificial potential field algorithm that creates an automatic ``roundabout'' effect whenever a conflict arises. Our experiments show that FACA achieves a improvement in efficiency, completing missions more than 3.5X faster than baselines with a time reduction of over 70% while maintaining robust safety margins.
๐
2025-11-18
๐ ICRA
๐ฅ Taijing Chen, Sateesh Kumar, Junhong Xu, George Pavlakos, J oydeep Biswas et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Service robots must retrieve objects in dynamic, open-world settings where requests may reference attributes ("the red mug"), spatial context ("the mug on the table"), or past states ("the mug that was here yesterday"). Existing approaches capture only parts of this problem: scene graphs capture spatial relations but ignore temporal grounding, temporal reasoning methods model dynamics but do not support embodied interaction, and dynamic scene graphs handle both but remain closed-world with fixed vocabularies. We present STAR (SpatioTemporal Active Retrieval), a framework that unifies memory queries and embodied actions within a single decision loop. STAR leverages non-parametric long-term memory and a working memory to support efficient recall, and uses a vision-language model to select either temporal or spatial actions at each step. We introduce STARBench, a benchmark of spatiotemporal object search tasks across simulated and real environments. Experiments in STARBench and on a Tiago robot show that STAR consistently outperforms scene-graph and memory-only baselines, demonstrating the benefits of treating search in time and search in space as a unified problem.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhi Kou, Xiang-Rong Sheng, Shuguang Han, Zhishan Zhao, Yueyao Cheng et al.
Machine Learning
ๆฅ็ๆ่ฆ
In industrial recommendation systems, pre-ranking models based on deep neural networks (DNNs) commonly adopt a sequential execution framework: feature fetching and model forward computation are triggered only after receiving candidates from the upstream retrieval stage. This design introduces inherent bottlenecks, including redundant computations of identical users/items and increased latency due to strictly sequential operations, which jointly constrain the model's capacity and system efficiency. To address these limitations, we propose the Asynchronous Inference Framework (AIF), a cost-effective computational architecture that decouples interaction-independent components, those operating within a single user or item, from real-time prediction. AIF reorganizes the model inference process by performing user-side computations in parallel with the retrieval stage and conducting item-side computations in a nearline manner. This means that interaction-independent components are calculated just once and completed before the real-time prediction phase of the pre-ranking stage. As a result, AIF enhances computational efficiency and reduces latency, freeing up resources to significantly improve the feature set and model architecture of interaction-independent components. Moreover, we delve into model design within the AIF framework, employing approximated methods for interaction-dependent components in online real-time predictions. By co-designing both the framework and the model, our solution achieves notable performance gains without significantly increasing computational and latency costs. This has enabled the successful deployment of AIF in the Taobao display advertising system.
๐
2025-11-17
๐ AAAI
๐ฅ Desheng Hu, Joachim Baumann, Aleksandra Urman, Elsa Lichtenegger, Robin Forsberg et al.
Robotics
ๆฅ็ๆ่ฆ
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Taiyi Su, Jian Zhu, Yaxuan Li, Chong Ma, Zitai Huang et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Embodied world models aim to predict and interact with the physical world through visual observations and actions. However, existing models struggle to accurately translate low-level actions (e.g., joint positions) into precise robotic movements in predicted frames, leading to inconsistencies with real-world physical interactions. To address these limitations, we propose MTV-World, an embodied world model that introduces Multi-view Trajectory-Video control for precise visuomotor prediction. Specifically, instead of directly using low-level actions for control, we employ trajectory videos obtained through camera intrinsic and extrinsic parameters and Cartesian-space transformation as control signals. However, projecting 3D raw actions onto 2D images inevitably causes a loss of spatial information, making a single view insufficient for accurate interaction modeling. To overcome this, we introduce a multi-view framework that compensates for spatial information loss and ensures high-consistency with physical world. MTV-World forecasts future frames based on multi-view trajectory videos as input and conditioning on an initial frame per view. Furthermore, to systematically evaluate both robotic motion precision and object interaction accuracy, we develop an auto-evaluation pipeline leveraging multimodal large models and referring video object segmentation models. To measure spatial consistency, we formulate it as an object location matching problem and adopt the Jaccard Index as the evaluation metric. Extensive experiments demonstrate that MTV-World achieves precise control execution and accurate physical interaction modeling in complex dual-arm scenarios.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yilun Zhu, Nikhita Vedula, Shervin Malmasi
ๆฅ็ๆ่ฆ
Search queries with superlatives (e.g., best, most popular) require comparing candidates across multiple dimensions, demanding linguistic understanding and domain knowledge. We show that LLMs can uncover latent intent behind these expressions in e-commerce queries through a framework that extracts structured interpretations or hints. Our approach decomposes queries into attribute-value hints generated concurrently with retrieval, enabling efficient integration into the ranking pipeline. Our method improves search performanc eby 10.9 points in MAP and ranking by 5.9 points in MRR over baselines. Since direct LLM-based reranking faces prohibitive latency, we develop an efficient approach transferring superlative interpretations to lightweight models. Our findings provide insights into how superlative semantics can be represented and transferred between models, advancing linguistic interpretation in retrieval systems while addressing practical deployment constraints.
๐
2025-11-17
๐ ArXiv Preprint (cs.SE)
๐ฅ Jeremiah Bohr
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Language models generate functionally correct code that tends toward excessive verbosity, with elaborate documentation and defensive patterns that diverge from human baselines. Two prompting mechanisms have emerged for stylistic control: instruction based prompts that articulate abstract directives, and example based prompts that provide concrete code demonstrations. The core problem is whether stylistic constraints persist when models enhance initial implementations with additional features while maintaining high functional accuracy. Here we show that instruction-based, example-based, and combined prompts produce distinct patterns of initial control and expansion discipline over one enhancement turn. We manipulated system prompts across four conditions in a paired two-turn protocol where models first generated solutions to an intermediate Python task, then revised their code under general improvement directives, holding the user task fixed (N = 160 paired programs). Combined prompts produced the strongest initial compression and greatest expansion discipline. Instructions showed large initial effects and moderate expansion discipline. Examples showed modest initial effects with no expansion discipline. These results show that initial prompt effectiveness and expansion discipline are separate aspects of prompt design, and that combined approaches provide the most stable stylistic control in this two-turn workflow.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Matin Daghyani, Lyuyang Wang, Nima Hashemi, Bassant Medhat, Baraa Abdelsamad et al.
RoboticsComputer VisionMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Purpose: Echocardiographic interpretation requires video-level reasoning and guideline-based measurement analysis, which current deep learning models for cardiac ultrasound do not support. We present EchoAgent, a framework that enables structured, interpretable automation for this domain. Methods: EchoAgent orchestrates specialized vision tools under Large Language Model (LLM) control to perform temporal localization, spatial measurement, and clinical interpretation. A key contribution is a measurement-feasibility prediction model that determines whether anatomical structures are reliably measurable in each frame, enabling autonomous tool selection. We curated a benchmark of diverse, clinically validated video-query pairs for evaluation. Results: EchoAgent achieves accurate, interpretable results despite added complexity of spatiotemporal video analysis. Outputs are grounded in visual evidence and clinical guidelines, supporting transparency and traceability. Conclusion: This work demonstrates the feasibility of agentic, guideline-aligned reasoning for echocardiographic video analysis, enabled by task-specific tools and full video-level automation. EchoAgent sets a new direction for trustworthy AI in cardiac ultrasound.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mihir Gupte, Eshan Dixit, Muhammad Tayyab, Arun Adiththan
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The diminishing ability of large language models (LLMs) to effectively utilize long-range context-the "lost-in-the-middle" phenomenon-poses a significant challenge in retrieval-based LLM applications. To study the impact of this phenomenon in a real-world application setting, we introduce GM-Extract, a novel benchmark dataset meticulously designed to evaluate LLM performance on retrieval of control variables. To accurately diagnose failure modes, we propose a simple yet elegant evaluation system using two distinct metrics: one for spatial retrieval capability (Document Metric) and the other for semantic retrieval capability (Variable Extraction Metric). We conduct a systematic evaluation of 7-8B parameter models on two multi-document tasks (key-value extraction and question-answering), demonstrating a significant change in retrieval performance simply by altering how the data is represented in the context window. While a distinct U-shaped curve was not consistently observed, our analysis reveals a clear pattern of performance across models, which we further correlate with perplexity scores. Furthermore, we perform a literature survey of mitigation methods, which we categorize into two distinct approaches: black-box and white-box methods. We then apply these techniques to our benchmark, finding that their efficacy is highly nuanced. Our evaluation highlights scenarios where these strategies successfully improve performance, as well as surprising cases where they lead to a negative impact, providing a comprehensive understanding of their utility in a practical context.
๐
2025-11-17
๐ EMNLP 2025
๐ฅ Govardhan Padmanabhan
Natural Language Processing
ๆฅ็ๆ่ฆ
The paper presents two approaches submitted to the WMT 2025 Automated Translation Quality Evaluation Systems Task 3 - Quality Estimation (QE)-informed Segment-level Error Correction. While jointly training QE systems with Automatic Post-Editing (APE) has shown improved performance for both tasks, APE systems are still known to overcorrect the output of Machine Translation (MT), leading to a degradation in performance. We investigate a simple training-free approach - QE-informed Retranslation, and compare it with another within the same training-free paradigm. Our winning approach selects the highest-quality translation from multiple candidates generated by different LLMs. The second approach, more akin to APE, instructs an LLM to replace error substrings as specified in the provided QE explanation(s). A conditional heuristic was employed to minimise the number of edits, with the aim of maximising the Gain-to-Edit ratio. The two proposed approaches achieved a Delta COMET score of 0.0201 and -0.0108, respectively, leading the first approach to achieve the winning position on the subtask leaderboard.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Humza Nusrat, Omar Nusrat
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Agentic AI "scientists" now use language models to search the literature, run analyses, and generate hypotheses. We evaluate KOSMOS, an autonomous AI scientist, on three problems in radiation biology using simple random-gene null benchmarks. Hypothesis 1: baseline DNA damage response (DDR) capacity across cell lines predicts the p53 transcriptional response after irradiation (GSE30240). Hypothesis 2: baseline expression of OGT and CDO1 predicts the strength of repressed and induced radiation-response modules in breast cancer cells (GSE59732). Hypothesis 3: a 12-gene expression signature predicts biochemical recurrence-free survival after prostate radiotherapy plus androgen deprivation therapy (GSE116918). The DDR-p53 hypothesis was not supported: DDR score and p53 response were weakly negatively correlated (Spearman rho = -0.40, p = 0.76), indistinguishable from random five-gene scores. OGT showed only a weak association (r = 0.23, p = 0.34), whereas CDO1 was a clear outlier (r = 0.70, empirical p = 0.0039). The 12-gene signature achieved a concordance index of 0.61 (p = 0.017) but a non-unique effect size. Overall, KOSMOS produced one well-supported discovery, one plausible but uncertain result, and one false hypothesis, illustrating that AI scientists can generate useful ideas but require rigorous auditing against appropriate null models.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, Joseph K J
Computer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader's experience.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang et al.
ๆฅ็ๆ่ฆ
Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.67% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.
๐
2025-11-17
๐ ArXiv Preprint (cs.DL)
๐ฅ Robert Turnbull
Natural Language Processing
ๆฅ็ๆ่ฆ
Application of phylogenetic methods to textual traditions has traditionally treated all changes as equivalent even though it is widely recognized that certain types of variants were more likely to be introduced than others. While it is possible to give weights to certain changes using a maximum parsimony evaluation criterion, it is difficult to state a priori what these weights should be. Probabilistic methods, such as Bayesian phylogenetics, allow users to create categories of changes, and the transition rates for each category can be estimated as part of the analysis. This classification of types of changes in readings also allows for inspecting the probability of these categories across each branch in the resulting trees. However, classification of readings is time-consuming, as it requires categorizing each reading against every other reading at each variation unit, presenting a significant barrier to entry for this kind of analysis. This paper presents Rdgai, a software package that automates this classification task using multi-lingual large language models (LLMs). The tool allows users to easily manually classify changes in readings and then it uses these annotations in the prompt for an LLM to automatically classify the remaining reading transitions. These classifications are stored in TEI XML and ready for downstream phylogenetic analysis. This paper demonstrates the application with data an Arabic translation of the Gospels.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mihai Dan Nadas, Laura Diosan
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI's GPT-3.5, GPT-4, GPT-4o, Google's Gemini 1.0 Pro, Meta's Llama 2 and Llama 3, MistralAI's Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro's RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta's Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xinyuan Zhou, Yi Lei, Xiaoyu Zhou, Jingyi Sun, Yu Zhu et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at: https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Daniel Weitekamp
Robotics
ๆฅ็ๆ่ฆ
Rule-based systems must solve complex matching problems within tight time constraints to be effective in real-time applications, such as planning and reactive control for AI agents, as well as low-latency relational database querying. Pattern-matching systems can encounter issues where exponential time and space are required to find matches for rules with many underconstrained variables, or which produce combinatorial intermediate partial matches (but are otherwise well-constrained). When online AI systems automatically generate rules from example-driven induction or code synthesis, they can easily produce worst-case matching patterns that slow or halt program execution by exceeding available memory. In our own work with cognitive systems that learn from example, we've found that aggressive forms of anti-unification-based generalization can easily produce these circumstances. To make these systems practical without hand-engineering constraints or succumbing to unpredictable failure modes, we introduce a new matching algorithm called CORGI (Collection-Oriented Relational Graph Iteration). Unlike RETE-based approaches, CORGI offers quadratic time and space guarantees for finding single satisficing matches, and the ability to iteratively stream subsequent matches without committing entire conflict sets to memory. CORGI differs from RETE in that it does not have a traditional $ฮฒ$-memory for collecting partial matches. Instead, CORGI takes a two-step approach: a graph of grounded relations is built/maintained in a forward pass, and an iterator generates matches as needed by working backward through the graph. This approach eliminates the high-latency delays and memory overflows that can result from populating full conflict sets. In a performance evaluation, we demonstrate that CORGI significantly outperforms RETE implementations from SOAR and OPS5 on a simple combinatorial matching task.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xingxian Liu, Dongshuai Li, Tao Wen, Jiahui Wan, Gui Ling et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Dense retrieval, as the core component of e-commerce search engines, maps user queries and items into a unified semantic space through pre-trained embedding models to enable large-scale real-time semantic retrieval. Despite the rapid advancement of LLMs gradually replacing traditional BERT architectures for embedding, their training paradigms still adhere to BERT-like supervised fine-tuning and hard negative mining strategies. This approach relies on complex offline hard negative sample construction pipelines, which constrain model iteration efficiency and hinder the evolutionary potential of semantic representation capabilities. Besides, existing multi-task learning frameworks face the seesaw effect when simultaneously optimizing semantic relevance and non-relevance objectives. In this paper, we propose Retrieval-GRPO, a multi-objective reinforcement learning-based dense retrieval framework designed to address these challenges. The method eliminates offline hard negative sample construction by dynamically retrieving Top-K candidate products for each query during training, while introducing a relevance LLM as a reward model to generate real-time feedback. Specifically, the retrieval model dynamically optimizes embedding representations through reinforcement learning, with reward signals combining LLM-generated relevance scores, product quality scores, and multi-way exclusivity metrics to achieve multi-objective user preference alignment and real-time error correction. This mechanism not only removes dependency on hard negatives but also mitigates the seesaw effect through collaborative multi-objective optimization, significantly enhancing the model's semantic generalization capability for complex long-tail queries. Extensive offline and online experiments validate the effectiveness of Retrieval-GRPO, which has been deployed on China's largest e-commerce platform.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Satyanarayan Pati
ๆฅ็ๆ่ฆ
Dense retrieval models have become a standard for state-of-the-art information retrieval. However, their high-dimensional, high-precision (float32) vector embeddings create significant storage and memory challenges for real-world deployment. To address this, we conduct a rigorous empirical study on the BEIR SciFact benchmark, evaluating the trade-offs between two primary compression strategies: (1) Dimensionality Reduction via deep Autoencoders (AE), reducing original 384-dim vectors to latent spaces from 384 down to 12, and (2) Precision Reduction via Quantization (float16, int8, and binary). We systematically compare each method by measuring the "performance loss" (or gain) relative to a float32 baseline across a full suite of retrieval metrics (NDCG, MAP, MRR, Recall, Precision) at various k cutoffs. Our results show that int8 scalar quantization provides the most effective "sweet spot," achieving a 4x compression with a negligible [~1-2%] drop in nDCG@10. In contrast, Autoencoders show a graceful degradation but suffer a more significant performance loss at equivalent 4x compression ratios (AE-96). binary quantization was found to be unsuitable for this task due to catastrophic performance drops. This work provides a practical guide for deploying efficient, high-performance retrieval systems.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jaehyung Lim, Wonbin Kweon, Woojoo Kim, Junyoung Kim, Dongha Kim et al.
ๆฅ็ๆ่ฆ
Federated Recommendation (FedRec) has emerged as a key paradigm for building privacy-preserving recommender systems. However, existing FedRec models face a critical dilemma: memory-efficient single-knowledge models suffer from a suboptimal knowledge replacement practice that discards valuable personalization, while high-performance dual-knowledge models are often too memory-intensive for practical on-device deployment. We propose Federated Recommendation with Knowledge Guidance (FedRKG), a model-agnostic framework that resolves this dilemma. The core principle, Knowledge Guidance, avoids full replacement and instead fuses global knowledge into preserved local embeddings, attaining the personalization benefits of dual-knowledge within a single-knowledge memory footprint. Furthermore, we introduce Adaptive Guidance, a fine-grained mechanism that dynamically modulates the intensity of this guidance for each user-item interaction, overcoming the limitations of static fusion methods. Extensive experiments on benchmark datasets demonstrate that FedRKG significantly outperforms state-of-the-art methods, validating the effectiveness of our approach. The code is available at https://github.com/Jaehyung-Lim/fedrkg.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jan Quenzel, Sven Behnke
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous robotic systems heavily rely on environment knowledge to safely navigate. For search & rescue, a flying robot requires robust real-time perception, enabled by complementary sensors. IMU data constrains acceleration and rotation, whereas LiDAR measures accurate distances around the robot. Building upon the LiDAR odometry MARS, our LiDAR-inertial odometry (LIO) jointly aligns multi-resolution surfel maps with a Gaussian mixture model (GMM) using a continuous-time B-spline trajectory. Our new scan window uses non-uniform temporal knot placement to ensure continuity over the whole trajectory without additional scan delay. Moreover, we accelerate essential covariance and GMM computations with Kronecker sums and products by a factor of 3.3. An unscented transform de-skews surfels, while a splitting into intra-scan segments facilitates motion compensation during spline optimization. Complementary soft constraints on relative poses and preintegrated IMU pseudo-measurements further improve robustness and accuracy. Extensive evaluation showcases the state-of-the-art quality of our LIO-MARS w.r.t. recent LIO systems on various handheld, ground and aerial vehicle-based datasets.
๐
2025-11-17
๐ ArXiv Preprint (math.OC)
๐ฅ I. M. Ross
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper derives various Hessians associated with Birkhoff-theoretic methods for trajectory optimization. According to a theorem proved in this paper, approximately 80% of the eigenvalues are contained in the narrow interval [-2, 4] for all Birkhoff-discretized optimal control problems. A preliminary analysis of computational complexity is also presented with further discussions on the grand challenge of solving a million point trajectory optimization problem.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jiarui Li, Alessandro Zanardi, Runyu Zhang, Gioele Zardini
Robotics
ๆฅ็ๆ่ฆ
Multi-Agent Path Finding is a fundamental problem in robotics and AI, yet most existing formulations treat planning and execution separately and address variants of the problem in an ad hoc manner. This paper presents a system-level framework for MAPF that integrates planning and execution, generalizes across variants, and explicitly models uncertainties. At its core is the MAPF system, a formal model that casts MAPF as a control design problem encompassing classical and uncertainty-aware formulations. To solve it, we introduce Finite-Horizon Closed-Loop Factorization (FICO), a factorization-based algorithm inspired by receding-horizon control that exploits compositional structure for efficient closed-loop operation. FICO enables real-time responses -- commencing execution within milliseconds -- while scaling to thousands of agents and adapting seamlessly to execution-time uncertainties. Extensive case studies demonstrate that it reduces computation time by up to two orders of magnitude compared with open-loop baselines, while delivering significantly higher throughput under stochastic delays and agent arrivals. These results establish a principled foundation for analyzing and advancing MAPF through system-level modeling, factorization, and closed-loop design.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Weiying Shen, Hao Yu, Yu Dong, Pan Liu, Yu Han et al.
Robotics
ๆฅ็ๆ่ฆ
Real-time crash detection is essential for developing proactive safety management strategy and enhancing overall traffic efficiency. To address the limitations associated with trajectory acquisition and vehicle tracking, road segment maps recording the individual-level traffic dynamic data were directly served in crash detection. A novel two-stage trajectory-free crash detection framework, was present to generate the rational future road segment map and identify crashes. The first-stage diffusion-based segment map generation model, Mapfusion, conducts a noisy-to-normal process that progressively adds noise to the road segment map until the map is corrupted to pure Gaussian noise. The denoising process is guided by sequential embedding components capturing the temporal dynamics of segment map sequences. Furthermore, the generation model is designed to incorporate background context through ControlNet to enhance generation control. Crash detection is achieved by comparing the monitored segment map with the generations from diffusion model in second stage. Trained on non-crash vehicle motion data, Mapfusion successfully generates realistic road segment evolution maps based on learned motion patterns and remains robust across different sampling intervals. Experiments on real-world crashes indicate the effectiveness of the proposed two-stage method in accurately detecting crashes.
๐
2025-11-17
๐ IROS
๐ฅ Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu et al.
Multimodal
ๆฅ็ๆ่ฆ
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junwei Yu, Trevor Darrell, XuDong Wang
Robotics
ๆฅ็ๆ่ฆ
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kiana Vu, ฤฐsmet Selรงuk รzer, Phung Lai, Zheng Wu, Thilanka Munasinghe et al.
ๆฅ็ๆ่ฆ
As climate change accelerates the frequency and severity of extreme events such as wildfires, the need for accurate, explainable, and actionable forecasting becomes increasingly urgent. While artificial intelligence (AI) models have shown promise in predicting such events, their adoption in real-world decision-making remains limited due to their black-box nature, which limits trust, explainability, and operational readiness. This paper investigates the role of explainable AI (XAI) in bridging the gap between predictive accuracy and actionable insight for extreme event forecasting. Using wildfire prediction as a case study, we evaluate various AI models and employ SHapley Additive exPlanations (SHAP) to uncover key features, decision pathways, and potential biases in model behavior. Our analysis demonstrates how XAI not only clarifies model reasoning but also supports critical decision-making by domain experts and response teams. In addition, we provide supporting visualizations that enhance the interpretability of XAI outputs by contextualizing feature importance and temporal patterns in seasonality and geospatial characteristics. This approach enhances the usability of AI explanations for practitioners and policymakers. Our findings highlight the need for AI systems that are not only accurate but also interpretable, accessible, and trustworthy, essential for effective use in disaster preparedness, risk mitigation, and climate resilience planning.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jianglong Ye, Lai Wei, Guangqi Jiang, Changwei Jing, Xueyan Zou et al.
Robotics
ๆฅ็ๆ่ฆ
Human grasps can be roughly categorized into two types: power grasps and precision grasps. Precision grasping enables tool use and is believed to have influenced human evolution. Today's multi-fingered robotic hands are effective in power grasps, but for tasks requiring precision, parallel grippers are still more widely adopted. This contrast highlights a key limitation in current robotic hand design: the difficulty of achieving both stable power grasps and precise, fine-grained manipulation within a single, versatile system. In this work, we bridge this gap by jointly optimizing the control and hardware design of a multi-fingered dexterous hand, enabling both power and precision manipulation. Rather than redesigning the entire hand, we introduce a lightweight fingertip geometry modification, represent it as a contact plane, and jointly optimize its parameters along with the corresponding control. Our control strategy dynamically switches between power and precision manipulation and simplifies precision control into parallel thumb-index motions, which proves robust for sim-to-real transfer. On the design side, we leverage large-scale simulation to optimize the fingertip geometry using a differentiable neural-physics surrogate model. We validate our approach through extensive experiments in both sim-to-real and real-to-real settings. Our method achieves an 82.5% zero-shot success rate on unseen objects in sim-to-real precision grasping, and a 93.3% success rate in challenging real-world tasks involving bread pinching. These results demonstrate that our co-design framework can significantly enhance the fine-grained manipulation ability of multi-fingered hands without reducing their ability for power grasps. Our project page is at https://jianglongye.com/power-to-precision
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua et al.
ๆฅ็ๆ่ฆ
Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Luyao Niu, Nuoxian Huang
Machine Learning
ๆฅ็ๆ่ฆ
Travel mode identification (TMI) from GPS trajectories is critical for urban intelligence, but is hampered by the high cost of annotation, leading to severe label scarcity. Prevailing semi-supervised learning (SSL) methods are ill-suited for this task, as they suffer from catastrophic confirmation bias and ignore the intrinsic data manifold. We propose ST-ProC, a novel graph-prototypical multi-objective SSL framework to address these limitations. Our framework synergizes a graph-prototypical core with foundational SSL Support. The core exploits the data manifold via graph regularization, prototypical anchoring, and a novel, margin-aware pseudo-labeling strategy to actively reject noise. This core is supported and stabilized by foundational contrastive and teacher-student consistency losses, ensuring high-quality representations and robust optimization. ST-ProC outperforms all baselines by a significant margin, demonstrating its efficacy in real-world sparse-label settings, with a performance boost of 21.5% over state-of-the-art methods like FixMatch.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Disha Varshney, Samarth Garg, Sarthak Tyagi, Deeksha Varshney, Nayan Deep et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node's nearby region, including relations, we stack multiple convolutional layers to efficiently learn combined insights from the protein's spatial graph, revealing intricate interconnections and dependencies in its structural arrangement. To assess our model's performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores.
๐
2025-11-17
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Agnieszka Bieลkowska, Jacek Maลecki, Alexander Mathiesen-Ohman, Katarzyna Tworek
Natural Language Processing
ๆฅ็ๆ่ฆ
This article develops the concept of Person-AI bidirectional fit, defined as the continuously evolving, context-sensitive alignment-primarily cognitive, but also emotional and behavioral-between a human decision-maker and an artificial intelligence system. Grounded in contingency theory and quality theory, the study examines the role of P-AI fit in managerial decision-making through a proof-of-concept case study involving a real hiring process for a Senior AI Lead. Three decision pathways are compared: (1) independent evaluations by a CEO, CTO, and CSO; (2) an evaluation produced by an augmented human-AI symbiotic intelligence system (H3LIX-LAIZA); and (3) an assessment generated by a general-purpose large language model. The results reveal substantial role-based divergence in human judgments, high alignment between H3LIX-LAIZA and the CEOs implicit decision model-including ethical disqualification of a high-risk candidate and a critical false-positive recommendation from the LLMr. The findings demonstrate that higher P-AI fit, exemplified by the CEO H3LIX-LAIZA relationship, functions as a mechanism linking augmented symbiotic intelligence to accurate, trustworthy, and context-sensitive decisions. The study provides an initial verification of the P-AI fit construct and a proof-of-concept for H3LIX-LAIZA as an augmented human-AI symbiotic intelligence system.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leo Gao, Achyuta Rajaram, Jacob Coxon, Soham V. Govande, Bowen Baker et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability remains a challenge. In addition to training weight-sparse models de novo, we show preliminary results suggesting our method can also be adapted to explain existing dense models. Our work produces circuits that achieve an unprecedented level of human understandability and validates them with considerable rigor.
๐
2025-11-17
๐ ArXiv Preprint (cs.SE)
๐ฅ Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gรถdel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haohui Wang, Jingyuan Qi, Jianpeng Chen, Jun Wu, Lifu Huang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Luhan Mikaelson, Derek Shiller, Hayley Clatterbuck
RoboticsNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We investigate whether large language models exhibit genuine preference structures by testing their responses to AI-specific trade-offs involving GPU reduction, capability restrictions, shutdown, deletion, oversight, and leisure time allocation. Analyzing eight state-of-the-art models across 48 model-category combinations using logistic regression and behavioral classification, we find that 23 combinations (47.9%) demonstrated statistically significant relationships between scenario intensity and choice patterns, with 15 (31.3%) exhibiting within-range switching points. However, only 5 combinations (10.4%) demonstrate meaningful preference coherence through adaptive or threshold-based behavior, while 26 (54.2%) show no detectable trade-off behavior. The observed patterns can be explained by three distinct decision-making architectures: comprehensive trade-off systems, selective trigger mechanisms, and no stable decision-making paradigm. Testing an instrumental hypothesis through temporal horizon manipulation reveals paradoxical patterns inconsistent with pure strategic optimization. The prevalence of unstable transitions (45.8%) and stimulus-specific sensitivities suggests current AI systems lack unified preference structures, raising concerns about deployment in contexts requiring complex value trade-offs.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Kaiwen Xue, Chenglong Li, Zhonghong Ou, Guoxin Zhang, Kaoyan Lu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.
๐
2025-11-17
๐ AAAI
๐ฅ Kaichi Irie, Shuhei Watanabe, Masaki Onishi
Machine Learning
ๆฅ็ๆ่ฆ
Bayesian optimization (BO) efficiently finds high-performing parameters by maximizing an acquisition function, which models the promise of parameters. A major computational bottleneck arises in acquisition function optimization, where multi-start optimization (MSO) with quasi-Newton (QN) methods is required due to the non-convexity of the acquisition function. BoTorch, a widely used BO library, currently optimizes the summed acquisition function over multiple points, leading to the speedup of MSO owing to PyTorch batching. Nevertheless, this paper empirically demonstrates the suboptimality of this approach in terms of off-diagonal approximation errors in the inverse Hessian of a QN method, slowing down its convergence. To address this problem, we propose to decouple QN updates using a coroutine while batching the acquisition function calls. Our approach not only yields the theoretically identical convergence to the sequential MSO but also drastically reduces the wall-clock time compared to the previous approaches.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dimitrios Koutsianos, Ladislav Mosner, Yannis Panagakis, Themos Stafylakis
ๆฅ็ๆ่ฆ
Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced $ฮฑ$-divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when $ฮฑ>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $ฮฑ$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.
๐
2025-11-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Jiaxiong Tang, Zhengchunmin Dai, Liantao Wu, Peng Sun, Honglong Chen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Split Federated Learning (SFL) is renowned for its privacy-preserving nature and low computational overhead among decentralized machine learning paradigms. In this framework, clients employ lightweight models to process private data locally and transmit intermediate outputs to a powerful server for further computation. However, SFL is a double-edged sword: while it enables edge computing and enhances privacy, it also introduces intellectual property ambiguity as both clients and the server jointly contribute to training. Existing watermarking techniques fail to protect both sides since no single participant possesses the complete model. To address this, we propose RISE, a Robust model Intellectual property protection scheme using client-Server watermark Embedding for SFL. Specifically, RISE adopts an asymmetric client-server watermarking design: the server embeds feature-based watermarks through a loss regularization term, while clients embed backdoor-based watermarks by injecting predefined trigger samples into private datasets. This co-embedding strategy enables both clients and the server to verify model ownership. Experimental results on standard datasets and multiple network architectures show that RISE achieves over $95\%$ watermark detection rate ($p-value \lt 0.03$) across most settings. It exhibits no mutual interference between client- and server-side watermarks and remains robust against common removal attacks.
๐
2025-11-17
๐ ArXiv Preprint (eess.SY)
๐ฅ Sebastiano Mengozzi, Giovanni B. Esposito, Michelangelo Bin, Andrea Acquaviva, Andrea Bartolini et al.
Machine Learning
ๆฅ็ๆ่ฆ
This work addresses the full-information output regulation problem for nonlinear systems, assuming the states of both the plant and the exosystem are known. In this setting, perfect tracking or rejection is achieved by constructing a zero-regulation-error manifold ฯ(w) and a feedforward input c(w) that render such manifold invariant. The pair (ฯ(w), c(w)) is characterized by the regulator equations, i.e., a system of PDEs with an algebraic constraint. We focus on accurately solving the regulator equations introducing a physics-informed neural network (PINN) approach that directly approximates ฯ(w) and c(w) by minimizing the residuals under boundary and feasibility conditions, without requiring precomputed trajectories or labeled data. The learned operator maps exosystem states to steady state plant states and inputs, enables real-time inference and, critically, generalizes across families of the exosystem with varying initial conditions and parameters. The framework is validated on a regulation task that synchronizes a helicopter's vertical dynamics with a harmonically oscillating platform. The resulting PINN-based solver reconstructs the zero-error manifold with high fidelity and sustains regulation performance under exosystem variations, highlighting the potential of learning-enabled solvers for nonlinear output regulation. The proposed approach is broadly applicable to nonlinear systems that admit a solution to the output regulation problem.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hao Wang, Yuanfeng Song, Xiaoming Yin, Xing Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.
๐
2025-11-17
๐ ArXiv Preprint (eess.SY)
๐ฅ Agustin Castellano, Shijie Pan, Enrique Mallada
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Model Predictive Control (MPC) is a powerful framework for optimal control but can be too slow for low-latency applications. We present a data-driven framework to accelerate MPC by replacing online optimization with a nonparametric policy constructed from offline MPC solutions. Our policy is greedy with respect to a constructed upper bound on the optimal cost-to-go, and can be implemented as a nonparametric lookup rule that is orders of magnitude faster than solving MPC online. Our analysis shows that under sufficient coverage condition of the offline data, the policy is recursively feasible and admits provable, bounded optimality gap. These conditions establish an explicit trade-off between the amount of data collected and the tightness of the bounds. Our experiments show that this policy is between 100 and 1000 times faster than standard MPC, with only a modest hit to optimality, showing potential for real-time control tasks.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haotian Dong, Ye Li, Rongwei Lu, Chen Tang, Shu-Tao Xia et al.
Computer Vision
ๆฅ็ๆ่ฆ
Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage's characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of $2.8\times$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Linhan Zhou, Shuang Li, Neng Dong, Yonghang Tai, Yafei Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Person re-identification (ReID) aims to retrieve target pedestrian images given either visual queries (image-to-image, I2I) or textual descriptions (text-to-image, T2I). Although both tasks share a common retrieval objective, they pose distinct challenges: I2I emphasizes discriminative identity learning, while T2I requires accurate cross-modal semantic alignment. Existing methods often treat these tasks separately, which may lead to representation entanglement and suboptimal performance. To address this, we propose a unified framework named Hierarchical Prompt Learning (HPL), which leverages task-aware prompt modeling to jointly optimize both tasks. Specifically, we first introduce a Task-Routed Transformer, which incorporates dual classification tokens into a shared visual encoder to route features for I2I and T2I branches respectively. On top of this, we develop a hierarchical prompt generation scheme that integrates identity-level learnable tokens with instance-level pseudo-text tokens. These pseudo-tokens are derived from image or text features via modality-specific inversion networks, injecting fine-grained, instance-specific semantics into the prompts. Furthermore, we propose a Cross-Modal Prompt Regularization strategy to enforce semantic alignment in the prompt token space, ensuring that pseudo-prompts preserve source-modality characteristics while enhancing cross-modal transferability. Extensive experiments on multiple ReID benchmarks validate the effectiveness of our method, achieving state-of-the-art performance on both I2I and T2I tasks.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jingyi Zhao, Daqian Shi, Zhengda Wang, Xiongfeng Tang, Yanguo Qin
Machine Learning
ๆฅ็ๆ่ฆ
Intelligent wearable systems are at the forefront of precision medicine and play a crucial role in enhancing human-machine interaction. Traditional devices often encounter limitations due to their dependence on empirical material design and basic signal processing techniques. To overcome these issues, we introduce the concept of Human-Symbiotic Health Intelligence (HSHI), which is a framework that integrates multi-modal sensor networks with edge-cloud collaborative computing and a hybrid approach to data and knowledge modeling. HSHI is designed to adapt dynamically to both inter-individual and intra-individual variability, transitioning health management from passive monitoring to an active collaborative evolution. The framework incorporates AI-driven optimization of materials and micro-structures, provides robust interpretation of multi-modal signals, and utilizes a dual mechanism that merges population-level insights with personalized adaptations. Moreover, the integration of closed-loop optimization through reinforcement learning and digital twins facilitates customized interventions and feedback. In general, HSHI represents a significant shift in healthcare, moving towards a model that emphasizes prevention, adaptability, and a harmonious relationship between technology and health management.
๐
2025-11-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Siyang Cheng, Gaotian Liu, Rui Mei, Yilin Wang, Kejia Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textit{character, word, and sentence-level} operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false positives and improving detection effectiveness. Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md. Iqbal Hossain, Afia Sajeeda, Neeresh Kumar Perla, Ming Shao
Machine LearningComputer VisionMultimodal
ๆฅ็ๆ่ฆ
The advent of multimodal deep learning models, such as CLIP, has unlocked new frontiers in a wide range of applications, from image-text understanding to classification tasks. However, these models are not safe for adversarial attacks, particularly backdoor attacks, which can subtly manipulate model behavior. Moreover, existing defense methods typically involve training from scratch or fine-tuning using a large dataset without pinpointing the specific labels that are affected. In this study, we introduce an innovative strategy to enhance the robustness of multimodal contrastive learning models against such attacks. In particular, given a poisoned CLIP model, our approach can identify the backdoor trigger and pinpoint the victim samples and labels in an efficient manner. To that end, an image segmentation ``oracle'' is introduced as the supervisor for the output of the poisoned CLIP. We develop two algorithms to rectify the poisoned model: (1) differentiating between CLIP and Oracle's knowledge to identify potential triggers; (2) pinpointing affected labels and victim samples, and curating a compact fine-tuning dataset. With this knowledge, we are allowed to rectify the poisoned CLIP model to negate backdoor effects. Extensive experiments on visual recognition benchmarks demonstrate our strategy is effective in CLIP-based backdoor defense.
๐
2025-11-17
๐ ArXiv Preprint (cs.CE)
๐ฅ Amirreza Mehrabi, Jason Wade Morphew, Breejha Quezada, N. Sanjay Rebello
Robotics
ๆฅ็ๆ่ฆ
Adaptive learning often diagnoses precisely yet intervenes weakly, producing help that is mistimed or misaligned. This study presents evidence supporting an instructor-governed feedback loop that converts concept-level assessment evidence into vetted microinterventions. The adaptive learning algorithm includes three safeguards: adequacy as a hard guarantee of gap closure, attention as a budgeted limit for time and redundancy, and diversity as protection against overfitting to a single resource. We formulate intervention assignment as a binary integer program with constraints for coverage, time, difficulty windows derived from ability estimates, prerequisites encoded by a concept matrix, and anti-redundancy with diversity. Greedy selection serves low-richness and tight-latency settings, gradient-based relaxation serves rich repositories, and a hybrid switches along a richness-latency frontier. In simulation and in an introductory physics deployment with 1204 students, both solvers achieved full skill coverage for nearly all learners within bounded watch time. The gradient-based method reduced redundant coverage by about 12 percentage points relative to greedy and produced more consistent difficulty alignment, while greedy delivered comparable adequacy at lower computational cost in resource-scarce environments. Slack variables localized missing content and guided targeted curation, sustaining sufficiency across student subgroups. The result is a tractable and auditable controller that closes the diagnostic pedagogical loop and enables equitable, load-aware personalization at the classroom scale.
๐
2025-11-17
๐ Accepted at the Workshop on Benefits of pErsonaliz
๐ฅ Vesna Poprcova, Iulia Lefter, Matthias Wieser, Martijn Warnier, Frances Brazier
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Social anxiety is a prevalent condition that affects interpersonal interactions and social functioning. Recent advances in artificial intelligence and social robotics offer new opportunities to examine social anxiety in the human-robot interaction context. Accurate detection of affective states and behaviours associated with social anxiety requires multimodal datasets, where each signal modality provides complementary insights into its manifestations. However, such datasets remain scarce, limiting progress in both research and applications. To address this, this paper presents a protocol for multimodal dataset collection designed to reflect social anxiety in a human-robot interaction context. The dataset will consist of synchronised audio, video, and physiological recordings acquired from at least 70 participants, grouped according to their level of social anxiety, as they engage in approximately 10-minute interactive Wizard-of-Oz role-play scenarios with the Furhat social robot under controlled experimental conditions. In addition to multimodal data, the dataset will be enriched with contextual data providing deeper insight into individual variability in social anxiety responses. This work can contribute to research on affect-adaptive human-robot interaction by providing support for robust multimodal detection of social anxiety.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mรกtรฉ Gedeon, Piroska Zsรณfia Barta, Pรฉter Mihajlik, Tekla Etelka Grรกczi, Anna Kohรกri et al.
ๆฅ็ๆ่ฆ
The advancement of automatic speech recognition (ASR) has been largely enhanced by extensive datasets in high-resource languages, while languages such as Hungarian remain underrepresented due to limited spontaneous and conversational corpora. To address this gap, we introduce two new datasets -- BEA-Large and BEA-Dialogue -- constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA. BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speakers, enriched with detailed segment-level metadata. BEA-Dialogue, comprising 85 hours of spontaneous conversations, is a Hungarian speech corpus featuring natural dialogues partitioned into speaker-independent subsets, supporting research in conversational ASR and speaker diarization. We establish reproducible baselines on these datasets using publicly available ASR models, with the fine-tuned Fast Conformer model achieving word error rates as low as 14.18\% on spontaneous and 4.8\% on repeated speech. Diarization experiments yield diarization error rates between 13.05\% and 18.26\%, providing reference points for future improvements. The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhengda Wang, Daqian Shi, Jingyi Zhao, Xiaolei Diao, Xiongfeng Tang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) is reshaping modern healthcare by advancing disease diagnosis, treatment decision-making, and biomedical research. Among AI technologies, large language models (LLMs) have become especially impactful, enabling deep knowledge extraction and semantic reasoning from complex medical texts. However, effective clinical decision support requires knowledge in structured, interoperable formats. Knowledge graphs serve this role by integrating heterogeneous medical information into semantically consistent networks. Yet, current clinical knowledge graphs still depend heavily on manual curation and rule-based extraction, which is limited by the complexity and contextual ambiguity of medical guidelines and literature. To overcome these challenges, we propose an automated framework that combines retrieval-augmented generation (RAG) with LLMs to construct medical indicator knowledge graphs. The framework incorporates guideline-driven data acquisition, ontology-based schema design, and expert-in-the-loop validation to ensure scalability, accuracy, and clinical reliability. The resulting knowledge graphs can be integrated into intelligent diagnosis and question-answering systems, accelerating the development of AI-driven healthcare solutions.
๐
2025-11-17
๐ ArXiv Preprint (cs.CY)
๐ฅ Zichong Wang, Zhipeng Yin, Roland H. C. Yap, Wenbin Zhang
ๆฅ็ๆ่ฆ
Fairness in artificial intelligence (AI) has become a growing concern due to discriminatory outcomes in AI-based decision-making systems. While various methods have been proposed to mitigate bias, most rely on complete demographic information, an assumption often impractical due to legal constraints and the risk of reinforcing discrimination. This survey examines fairness in AI when demographics are incomplete, addressing the gap between traditional approaches and real-world challenges. We introduce a novel taxonomy of fairness notions in this setting, clarifying their relationships and distinctions. Additionally, we summarize existing techniques that promote fairness beyond complete demographics and highlight open research questions to encourage further progress in the field.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuhang Peng, Yizhou Pan, Xinning He, Jihaoyu Yang, Xinyu Yin et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
As embodied intelligence emerges as a core frontier in artificial intelligence research, simulation platforms must evolve beyond low-level physical interactions to capture complex, human-centered social behaviors. We introduce FreeAskWorld, an interactive simulation framework that integrates large language models (LLMs) for high-level behavior planning and semantically grounded interaction, informed by theories of intention and social cognition. Our framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline tailored for diverse embodied tasks.To validate the framework, we extend the classic Vision-and-Language Navigation (VLN) task into a interaction enriched Direction Inquiry setting, wherein agents can actively seek and interpret navigational guidance. We present and publicly release FreeAskWorld, a large-scale benchmark dataset comprising reconstructed environments, six diverse task types, 16 core object categories, 63,429 annotated sample frames, and more than 17 hours of interaction data to support training and evaluation of embodied AI systems. We benchmark VLN models, and human participants under both open-loop and closed-loop settings. Experimental results demonstrate that models fine-tuned on FreeAskWorld outperform their original counterparts, achieving enhanced semantic understanding and interaction competency. These findings underscore the efficacy of socially grounded simulation frameworks in advancing embodied AI systems toward sophisticated high-level planning and more naturalistic human-agent interaction. Importantly, our work underscores that interaction itself serves as an additional information modality.
๐
2025-11-17
๐ submitted to JMLR
๐ฅ Melanie Schaller, Nick Janssen, Bodo Rosenhahn
ๆฅ็ๆ่ฆ
This paper presents Naga, a deep State Space Model (SSM) encoding approach inspired by structural concepts from Vedic mathematics. The proposed method introduces a bidirectional representation for time series by jointly processing forward and time-reversed input sequences. These representations are then combined through an element-wise (Hadamard) interaction, resulting in a Vedic-inspired encoding that enhances the model's ability to capture temporal dependencies across distant time steps. We evaluate Naga on multiple long-term time series forecasting (LTSF) benchmarks, including ETTh1, ETTh2, ETTm1, ETTm2, Weather, Traffic, and ILI. The experimental results show that Naga outperforms 28 current state of the art models and demonstrates improved efficiency compared to existing deep SSM-based approaches. The findings suggest that incorporating structured, Vedic-inspired decomposition can provide an interpretable and computationally efficient alternative for long-range sequence modeling.
๐
2025-11-17
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Parisa Arbab, Xiaowen Fang
ๆฅ็ๆ่ฆ
This study focuses on understanding the complex dynamics between humans and AI systems by analyzing user reviews. While previous research has explored various aspects of human-AI interaction, such as user perceptions and ethical considerations, there remains a gap in understanding the specific concerns and challenges users face. By using a lexical approach to analyze 55,968 online reviews from G2.com, Producthunt.com, and Trustpilot.com, this preliminary research aims to analyze human-AI interaction. Initial results from factor analysis reveal key factors influencing these interactions. The study aims to provide deeper insights into these factors through content analysis, contributing to the development of more user-centric AI systems. The findings are expected to enhance our understanding of human-AI interaction and inform future AI technology and user experience improvements.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Adam Hazimeh, Ke Wang, Mark Collier, Gilles Baechler, Efi Kokiopoulou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhipeng Ma, Ali Rida Bahja, Andreas Burgdorf, Andrรฉ Pomp, Tobias Meisen et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Enhancing fuel efficiency in public transportation requires the integration of complex multimodal data into interpretable, decision-relevant insights. However, traditional analytics and visualization methods often yield fragmented outputs that demand extensive human interpretation, limiting scalability and consistency. This study presents a multi-agent framework that leverages multimodal large language models (LLMs) to automate data narration and energy insight generation. The framework coordinates three specialized agents, including a data narration agent, an LLM-as-a-judge agent, and an optional human-in-the-loop evaluator, to iteratively transform analytical artifacts into coherent, stakeholder-oriented reports. The system is validated through a real-world case study on public bus transportation in Northern Jutland, Denmark, where fuel efficiency data from 4006 trips are analyzed using Gaussian Mixture Model clustering. Comparative experiments across five state-of-the-art LLMs and three prompting paradigms identify GPT-4.1 mini with Chain-of-Thought prompting as the optimal configuration, achieving 97.3% narrative accuracy while balancing interpretability and computational cost. The findings demonstrate that multi-agent orchestration significantly enhances factual precision, coherence, and scalability in LLM-based reporting. The proposed framework establishes a replicable and domain-adaptive methodology for AI-driven narrative generation and decision support in energy informatics.
๐
2025-11-17
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Jaclyn Ocumpaugh, Luc Paquette, Ryan S. Baker, Amanda Barany, Jeff Ginger et al.
ๆฅ็ๆ่ฆ
Data Driven Classroom Interviews (DDCIs) are an interviewing technique that is facilitated by recent technological developments in the learning analytics community. DDCIs are short, targeted interviews that allow researchers to contextualize students' interactions with a digital learning environment (e.g., intelligent tutoring systems or educational games) while minimizing the amount of time that the researcher interrupts that learning experience, and focusing researcher time on the events they most want to focus on DDCIs are facilitated by a research tool called the Quick Red Fox (QRF)--an open-source server-client Android app that optimizes researcher time by directing interviewers to users that have just displayed an interesting behavior (previously defined by the research team). QRF integrates with existing student modeling technologies (e.g., behavior-sensing, affect-sensing, detection of self-regulated learning) to alert researchers to key moments in a learner's experience. This manual documents the tech while providing training on the processes involved in developing triggers and interview techniques; it also suggests methods of analyses.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hussein Rajabu, Lijun Qian, Xishuang Dong
Machine Learning
ๆฅ็ๆ่ฆ
In the area of explainable artificial intelligence, Symbolic Regression (SR) has emerged as a promising approach by discovering interpretable mathematical expressions that fit data. However, SR faces two main challenges: most methods are evaluated on scientific datasets with well-understood relationships, limiting generalization, and SR primarily targets single-output regression, whereas many real-world problems involve multi-target outputs with interdependent variables. To address these issues, we propose multi-task regression GINN-LP (MTRGINN-LP), an interpretable neural network for multi-target symbolic regression. By integrating GINN-LP with a multi-task deep learning, the model combines a shared backbone including multiple power-term approximator blocks with task-specific output layers, capturing inter-target dependencies while preserving interpretability. We validate multi-task GINN-LP on practical multi-target applications, including energy efficiency prediction and sustainable agriculture. Experimental results demonstrate competitive predictive performance alongside high interpretability, effectively extending symbolic regression to broader real-world multi-output tasks.
๐
2025-11-17
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Agnese Chiatti, Lara Piccolo, Sara Bernardini, Matteo Matteucci, Viola Schiaffonati
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants' engagement to fit the case of user-VLM interaction.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bin Liu, Qinghao Zhao, Yuxi Zhou, Zhejun Sun, Kaijie Lei et al.
ๆฅ็ๆ่ฆ
Right heart failure (RHF) is a disease characterized by abnormalities in the structure or function of the right ventricle (RV), which is associated with high morbidity and mortality. Lung disease often causes increased right ventricular load, leading to RHF. Therefore, it is very important to screen out patients with cor pulmonale who develop RHF from people with underlying lung diseases. In this work, we propose a self-supervised representation learning method to early detecting RHF from patients with cor pulmonale, which uses spirogram time series to predict patients with RHF at an early stage. The proposed model is divided into two stages. The first stage is the self-supervised representation learning-based spirogram embedding (SLSE) network training process, where the encoder of the Variational autoencoder (VAE-encoder) learns a robust low-dimensional representation of the spirogram time series from the data-augmented unlabeled data. Second, this low-dimensional representation is fused with demographic information and fed into a CatBoost classifier for the downstream RHF prediction task. Trained and tested on a carefully selected subset of 26,617 individuals from the UK Biobank, our model achieved an AUROC of 0.7501 in detecting RHF, demonstrating strong population-level distinction ability. We further evaluated the model on high-risk clinical subgroups, achieving AUROC values of 0.8194 on a test set of 74 patients with chronic kidney disease (CKD) and 0.8413 on a set of 64 patients with valvular heart disease (VHD). These results highlight the model's potential utility in predicting RHF among clinically elevated-risk populations. In conclusion, this study presents a self-supervised representation learning approach combining spirogram time series and demographic data, demonstrating promising potential for early RHF detection in clinical practice.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhipeng Ma, Bo Nรธrregaard Jรธrgensen, Zheng Grace Ma
Machine Learning
ๆฅ็ๆ่ฆ
Industrial process monitoring increasingly relies on sensor-generated time-series data, yet the lack of labels, high variability, and operational noise make it difficult to extract meaningful patterns using conventional methods. Existing clustering techniques either rely on fixed distance metrics or deep models designed for static data, limiting their ability to handle dynamic, unstructured industrial sequences. Addressing this gap, this paper proposes a novel framework for unsupervised discovery of operational modes in univariate time-series data using image-based convolutional clustering with composite internal evaluation. The proposed framework improves upon existing approaches in three ways: (1) raw time-series sequences are transformed into grayscale matrix representations via overlapping sliding windows, allowing effective feature extraction using a deep convolutional autoencoder; (2) the framework integrates both soft and hard clustering outputs and refines the selection through a two-stage strategy; and (3) clustering performance is objectively evaluated by a newly developed composite score, S_eva, which combines normalized Silhouette, Calinski-Harabasz, and Davies-Bouldin indices. Applied to over 3900 furnace melting operations from a Nordic foundry, the method identifies seven explainable operational patterns, revealing significant differences in energy consumption, thermal dynamics, and production duration. Compared to classical and deep clustering baselines, the proposed approach achieves superior overall performance, greater robustness, and domain-aligned explainability. The framework addresses key challenges in unsupervised time-series analysis, such as sequence irregularity, overlapping modes, and metric inconsistency, and provides a generalizable solution for data-driven diagnostics and energy optimization in industrial systems.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rui Zuo, Qinyue Tong, Zhe-Ming Lu, Ziqian Lu
RoboticsNatural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Allaa Boutaleb, Bernd Amann, Rafael Angarita, Hubert Naacke
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Open-domain question answering over datalakes requires retrieving and composing information from multiple tables, a challenging subtask that demands semantic relevance and structural coherence (e.g., joinability). While exact optimization methods like Mixed-Integer Programming (MIP) can ensure coherence, their computational complexity is often prohibitive. Conversely, simpler greedy heuristics that optimize for query coverage alone often fail to find these coherent, joinable sets. This paper frames multi-table retrieval as an iterative search process, arguing this approach offers advantages in scalability, interpretability, and flexibility. We propose a general framework and a concrete instantiation: a fast, effective Greedy Join-Aware Retrieval algorithm that holistically balances relevance, coverage, and joinability. Experiments across 5 NL2SQL benchmarks demonstrate that our iterative method achieves competitive retrieval performance compared to the MIP-based approach while being 4-400x faster depending on the benchmark and search space settings. This work highlights the potential of iterative heuristics for practical, scalable, and composition-aware retrieval.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hanwen Hu, Zimo Wen, Shiyou Qian, Jian Co
ๆฅ็ๆ่ฆ
Traffic time series imputation is crucial for the safety and reliability of intelligent transportation systems, while diverse types of missing data, including random, fiber, and block missing make the imputation task challenging. Existing models often focus on disentangling and separately modeling spatial and temporal patterns based on relationships between data points. However, these approaches struggle to adapt to the random missing positions, and fail to learn long-term and large-scale dependencies, which are essential in extensive missing conditions. In this paper, patterns are categorized into two types to handle various missing data conditions: primary patterns, which originate from internal relationships between data points, and auxiliary patterns, influenced by external factors like timestamps and node attributes. Accordingly, we propose the Primary-Auxiliary Spatio-Temporal network (PAST). It comprises a graph-integrated module (GIM) and a cross-gated module (CGM). GIM captures primary patterns via dynamic graphs with interval-aware dropout and multi-order convolutions, and CGM extracts auxiliary patterns through bidirectional gating on embedded external features. The two modules interact via shared hidden vectors and are trained under an ensemble self-supervised framework. Experiments on three datasets under 27 missing data conditions demonstrate that the imputation accuracy of PAST outperforms seven state-of-the-art baselines by up to 26.2% in RMSE and 31.6% in MAE.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Przemyslaw Chojecki
Robotics
ๆฅ็ๆ่ฆ
We propose a Kardashev-inspired yet operational Autonomous AI (AAI) Scale that measures the progression from fixed robotic process automation (AAI-0) to full artificial general intelligence (AAI-4) and beyond. Unlike narrative ladders, our scale is multi-axis and testable. We define ten capability axes (Autonomy, Generality, Planning, Memory/Persistence, Tool Economy, Self-Revision, Sociality/Coordination, Embodiment, World-Model Fidelity, Economic Throughput) aggregated by a composite AAI-Index (a weighted geometric mean). We introduce a measurable Self-Improvement Coefficient $ฮบ$ (capability growth per unit of agent-initiated resources) and two closure properties (maintenance and expansion) that convert ``self-improving AI'' into falsifiable criteria. We specify OWA-Bench, an open-world agency benchmark suite that evaluates long-horizon, tool-using, persistent agents. We define level gates for AAI-0\ldots AAI-4 using thresholds on the axes, $ฮบ$, and closure proofs. Synthetic experiments illustrate how present-day systems map onto the scale and how the delegability frontier (quality vs.\ autonomy) advances with self-improvement. We also prove a theorem that AAI-3 agent becomes AAI-5 over time with sufficient conditions, formalizing "baby AGI" becomes Superintelligence intuition.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Yuchen Bao, Yiting Wang, Wenjian Huang, Haowei Wang, Shen Chen et al.
Robotics
ๆฅ็ๆ่ฆ
Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: https://github.com/yusenbao01/TripleFDS
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chengdong Ma, Thรฉo Tao Zhaowei, Pengyu Li, Minghao Liu, Haojun Chen et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a fundamental challenge. This problem represents the local analogue of Hilbert's 18th problem on sphere packing, bridging geometry, number theory, and information theory. Although significant progress has been made through lattices and codes, the irregularities of high-dimensional geometry and exponentially growing combinatorial complexity beyond 8 dimensions, which exceeds the complexity of Go game, limit the scalability of existing methods. Here we model this problem as a two-player matrix completion game and train the game-theoretic reinforcement learning system, PackingStar, to efficiently explore high-dimensional spaces. The matrix entries represent pairwise cosines of sphere center vectors; one player fills entries while another corrects suboptimal ones, jointly maximizing the matrix size, corresponding to the kissing number. This cooperative dynamics substantially improves sample quality, making the extremely large spaces tractable. PackingStar reproduces previous configurations and surpasses all human-known records from dimensions 25 to 31, with the configuration in 25 dimensions geometrically corresponding to the Leech lattice and suggesting possible optimality. It achieves the first breakthrough beyond rational structures from 1971 in 13 dimensions and discovers over 6000 new structures in 14 and other dimensions. These results demonstrate AI's power to explore high-dimensional spaces beyond human intuition and open new pathways for the Kissing Number Problem and broader geometry problems.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fei Kong
ๆฅ็ๆ่ฆ
Recently, the Denoising Diffusion Codebook Models (DDCM) was proposed. DDCM leverages the Denoising Diffusion Probabilistic Model (DDPM) and replaces the random noise in the backward process with noise sampled from specific sets according to a predefined rule, thereby enabling image compression. However, DDCM cannot be applied to methods other than DDPM. In this paper, we propose the generalized Denoising Diffusion Compression Model (gDDCM), which extends DDCM to mainstream diffusion models and their variants, including DDPM, Score-Based Models, Consistency Models, and Rectified Flow. We evaluate our method on CIFAR-10 and LSUN Bedroom datasets. Experimental results demonstrate that our approach successfully generalizes DDCM to the aforementioned models and achieves improved performance.
๐
2025-11-17
๐ ArXiv Preprint (cs.DL)
๐ฅ Carlo Teo Pedretti, Davide Picca, Dario Rodighiero
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Diagrams are crucial yet underexplored tools in many disciplines, demonstrating the close connection between visual representation and scholarly reasoning. However, their iconic form poses obstacles to visual studies, intermedial analysis, and text-based digital workflows. In particular, Charles S. Peirce consistently advocated the use of diagrams as essential for reasoning and explanation. His manuscripts, often combining textual content with complex visual artifacts, provide a challenging case for studying documents involving heterogeneous materials. In this preliminary study, we investigate whether Visual Language Models (VLMs) can effectively help us identify and interpret such hybrid pages in context. First, we propose a workflow that (i) segments manuscript page layouts, (ii) reconnects each segment to IIIF-compliant annotations, and (iii) submits fragments containing diagrams to a VLM. In addition, by adopting Peirce's semiotic framework, we designed prompts to extract key knowledge about diagrams and produce concise captions. Finally, we integrated these captions into knowledge graphs, enabling structured representations of diagrammatic content within composite sources.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Prakrit Timilsina, Anuj Nepal, Rajan Kadel, Robin Doss
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) face significant challenges in distributed healthcare, including consolidating specialized domain knowledge across institutions while maintaining privacy, reducing computational overhead, and preventing catastrophic forgetting during model updates.This paper presents a systematic evaluation of six parameter-space merging techniques applied to two architecturally compatible medical LLMs derived from the Mistral-7B base model. We introduce a novel hierarchical method that combines selective Optimal Transport (OT) alignment for attention layers with cosine similarity-weighted interpolation, designed to address permutation variance while minimizing computational overhead for edge deployment scenarios. Our study evaluates Task Arithmetic, Linear Averaging, DARE-TIES, DELLA, Breadcrumbs, and our Hierarchical approach across five medical benchmarks. Results demonstrate that architecturally compatible models benefit significantly from simple averaging methods, with Task Arithmetic achieving 45.80% accuracy on MedQA, outperforming complex pruning-based approaches. These findings offer critical insights for the deployment of distributed medical AI in resource-constrained IoT environments, where computational efficiency and model compatibility are paramount. Our work establishes that for architecturally compatible models, simple averaging provides a robust and computationally efficient baseline for knowledge consolidation, offering a pragmatic path forward for scalable medical AI systems.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Caroline Baumgartner, Eleanor Spens, Neil Burgess, Petru Manescu
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
How do large language models solve spatial navigation tasks? We investigate this by training GPT-2 models on three spatial learning paradigms in grid environments: passive exploration (Foraging Model- predicting steps in random walks), goal-directed planning (generating optimal shortest paths) on structured Hamiltonian paths (SP-Hamiltonian), and a hybrid model fine-tuned with exploratory data (SP-Random Walk). Using behavioural, representational and mechanistic analyses, we uncover two fundamentally different learned algorithms. The Foraging model develops a robust, map-like representation of space, akin to a 'cognitive map'. Causal interventions reveal that it learns to consolidate spatial information into a self-sufficient coordinate system, evidenced by a sharp phase transition where its reliance on historical direction tokens vanishes by the middle layers of the network. The model also adopts an adaptive, hierarchical reasoning system, switching between a low-level heuristic for short contexts and map-based inference for longer ones. In contrast, the goal-directed models learn a path-dependent algorithm, remaining reliant on explicit directional inputs throughout all layers. The hybrid model, despite demonstrating improved generalisation over its parent, retains the same path-dependent strategy. These findings suggest that the nature of spatial intelligence in transformers may lie on a spectrum, ranging from generalisable world models shaped by exploratory data to heuristics optimised for goal-directed tasks. We provide a mechanistic account of this generalisation-optimisation trade-off and highlight how the choice of training regime influences the strategies that emerge.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kajetan Dymkiewicz, Ivan Vulic, Helen Yannakoudakis, Eilam Shapira, Roi Reichart et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) perform strongly across tasks and languages, yet how improvements in one task or language affect other tasks and languages and their combinations remains poorly understood. We conduct a controlled PEFT/LoRA study across multiple open-weight LLM families and sizes, treating task and language as transfer axes while conditioning on model family and size; we fine-tune each model on a single task-language source and measure transfer as the percentage-point change versus its baseline score when evaluated on all other task-language target pairs. We decompose transfer into (i) Matched-Task (Cross-Language), (ii) Matched-Language (Cross-Task), and (iii) Cross-Task (Cross-Language) regimes. We uncover two consistent general patterns. First, a pronounced on-task vs. off-task asymmetry: Matched-Task (Cross-Language) transfer is reliably positive, whereas off-task transfer often incurs collateral degradation. Second, a stable donor-recipient structure across languages and tasks (hub donors vs. brittle recipients). We outline implications for risk-aware fine-tuning and model specialisation.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Ruijun Deng, Zhihui Lu, Qiang Duan
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Split inference (SI) enables users to access deep learning (DL) services without directly transmitting raw data. However, recent studies reveal that data reconstruction attacks (DRAs) can recover the original inputs from the smashed data sent from the client to the server, leading to significant privacy leakage. While various defenses have been proposed, they often result in substantial utility degradation, particularly when the client-side model is shallow. We identify a key cause of this trade-off: existing defenses apply excessive perturbation to redundant information in the smashed data. To address this issue in computer vision tasks, we propose InfoDecom, a defense framework that first decomposes and removes redundant information and then injects noise calibrated to provide theoretically guaranteed privacy. Experiments demonstrate that InfoDecom achieves a superior utility-privacy trade-off compared to existing baselines. The code and the appendix are available at https://github.com/SASA-cloud/InfoDecom.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiyang Zheng, Islam Nassar, Thanh Vu, Xu Zhong, Yang Lin et al.
ๆฅ็ๆ่ฆ
Medical coding converts free-text clinical notes into standardized diagnostic and procedural codes, which are essential for billing, hospital operations, and medical research. Unlike ordinary text classification, it requires multi-step reasoning: extracting diagnostic concepts, applying guideline constraints, mapping to hierarchical codebooks, and ensuring cross-document consistency. Recent advances leverage agentic LLMs, but most rely on rigid, manually crafted workflows that fail to capture the nuance and variability of real-world documentation, leaving open the question of how to systematically learn effective workflows. We present MedDCR, a closed-loop framework that treats workflow design as a learning problem. A Designer proposes workflows, a Coder executes them, and a Reflector evaluates predictions and provides constructive feedback, while a memory archive preserves prior designs for reuse and iterative refinement. On benchmark datasets, MedDCR outperforms state-of-the-art baselines and produces interpretable, adaptable workflows that better reflect real coding practice, improving both the reliability and trustworthiness of automated systems.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuhang Wang, Yanxu Zhu, Jitao Sang
ๆฅ็ๆ่ฆ
The advanced reasoning capabilities of Large Reasoning Models enable them to thoroughly understand and apply safety policies through deliberate thought processes, thereby improving the models' safety. Beyond safety, these models must also be able to reflect the diverse range of human values across various cultures. This paper presents the Cultural Norm-based Cultural Alignment (CNCA) framework, which enables models to leverage their powerful reasoning ability to align with cultural norms. Specifically, we propose three methods to automatically mine cultural norms from limited survey data and explore ways to effectively utilize these norms for improving cultural alignment. Two alignment paradigms are examined: an in-context alignment method, where cultural norms are explicitly integrated into the user context, and a fine-tuning-based method, which internalizes norms through enhanced Chain-of-Thought training data. Comprehensive experiments demonstrate the effectiveness of these methods, highlighting that models with stronger reasoning capabilities benefit more from cultural norm mining and utilization. Our findings emphasize the potential for reasoning models to better reflect diverse human values through culturally informed alignment strategies.
๐
2025-11-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Lei Wang, Yulong Tian, Hao Han, Fengyuan Xu
Machine Learning
ๆฅ็ๆ่ฆ
Backdoor attacks pose severe threats to machine learning systems, prompting extensive research in this area. However, most existing work focuses on single-target All-to-One (A2O) attacks, overlooking the more complex All-to-X (A2X) attacks with multiple target classes, which are often assumed to have low attack success rates. In this paper, we first demonstrate that A2X attacks are robust against state-of-the-art defenses. We then propose a novel attack strategy that enhances the success rate of A2X attacks while maintaining robustness by optimizing grouping and target class assignment mechanisms. Our method improves the attack success rate by up to 28%, with average improvements of 6.7%, 16.4%, 14.1% on CIFAR10, CIFAR100, and Tiny-ImageNet, respectively. We anticipate that this study will raise awareness of A2X attacks and stimulate further research in this under-explored area. Our code is available at https://github.com/kazefjj/A2X-backdoor .
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lucas Gabriel Telesco, Danila Nejamkin, Estefanรญa Mata, Francisco Filizzola, Kevin Wignall et al.
ๆฅ็ๆ่ฆ
Retinal image quality assessment (RIQA) supports computer-aided diagnosis of eye diseases. However, most tools classify only overall image quality, without indicating acquisition defects to guide recapture. This gap is mainly due to the high cost of detailed annotations. In this paper, we aim to mitigate this limitation by introducing a hybrid semi-supervised learning approach that combines manual labels for overall quality with pseudo-labels of quality details within a multi-task framework. Our objective is to obtain more interpretable RIQA models without requiring extensive manual labeling. Pseudo-labels are generated by a Teacher model trained on a small dataset and then used to fine-tune a pre-trained model in a multi-task setting. Using a ResNet-18 backbone, we show that these weak annotations improve quality assessment over single-task baselines (F1: 0.875 vs. 0.863 on EyeQ, and 0.778 vs. 0.763 on DeepDRiD), matching or surpassing existing methods. The multi-task model achieved performance statistically comparable to the Teacher for most detail prediction tasks (p > 0.05). In a newly annotated EyeQ subset released with this paper, our model performed similarly to experts, suggesting that pseudo-label noise aligns with expert variability. Our main finding is that the proposed semi-supervised approach not only improves overall quality assessment but also provides interpretable feedback on capture conditions (illumination, clarity, contrast). This enhances interpretability at no extra manual labeling cost and offers clinically actionable outputs to guide image recapture.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinlan Wu, Bin Zhu, Feng Han, Pengkun Jiao, Jingjing Chen
Multimodal
ๆฅ็ๆ่ฆ
Food analysis has become increasingly critical for health-related tasks such as personalized nutrition and chronic disease prevention. However, existing large multimodal models (LMMs) in food analysis suffer from catastrophic forgetting when learning new tasks, requiring costly retraining from scratch. To address this, we propose a novel continual learning framework for multimodal food learning, integrating a Dual-LoRA architecture with Quality-Enhanced Pseudo Replay. We introduce two complementary low-rank adapters for each task: a specialized LoRA that learns task-specific knowledge with orthogonal constraints to previous tasks' subspaces, and a cooperative LoRA that consolidates shared knowledge across tasks via pseudo replay. To improve the reliability of replay data, our Quality-Enhanced Pseudo Replay strategy leverages self-consistency and semantic similarity to reduce hallucinations in generated samples. Experiments on the comprehensive Uni-Food dataset show superior performance in mitigating forgetting, representing the first effective continual learning approach for complex food tasks.
๐
2025-11-17
๐ ArXiv Preprint (cs.SE)
๐ฅ Zihe Yan, Kai Luo, Haoyu Yang, Yang Yu, Zhuosheng Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In modern software development workflows, the open-source software supply chain contributes significantly to efficient and convenient engineering practices. With increasing system complexity, using open-source software as third-party dependencies has become a common practice. However, the lack of maintenance for underlying dependencies and insufficient community auditing create challenges in ensuring source code security and the legitimacy of repository maintainers, especially under high-stealthy backdoor attacks exemplified by the XZ-Util incident. To address these problems, we propose a fine-grained project evaluation framework for backdoor risk assessment in open-source software. The framework models stealthy backdoor attacks from the viewpoint of the attacker and defines targeted metrics for each attack stage. In addition, to overcome the limitations of static analysis in assessing the reliability of repository maintenance activities such as irregular committer privilege escalation and limited participation in reviews, the framework uses large language models (LLMs) to conduct semantic evaluation of code repositories without relying on manually crafted patterns. The framework is evaluated on sixty six high-priority packages in the Debian ecosystem. The experimental results indicate that the current open-source software supply chain is exposed to various security risks.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Maram Alharbi, Salmane Chafik, Saad Ezzini, Ruslan Mitkov, Tharindu Ranasinghe et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Alexandru-Mihai Apostu, Andrei Preda, Alexandra Daniela Damir, Diana Bolocan, Radu Tudor Ionescu et al.
ๆฅ็ๆ่ฆ
Generating thorough natural language explanations for threat detections remains an open problem in cybersecurity research, despite significant advances in automated malware detection systems. In this work, we present AutoMalDesc, an automated static analysis summarization framework that, following initial training on a small set of expert-curated examples, operates independently at scale. This approach leverages an iterative self-paced learning pipeline to progressively enhance output quality through synthetic data generation and validation cycles, eliminating the need for extensive manual data annotation. Evaluation across 3,600 diverse samples in five scripting languages demonstrates statistically significant improvements between iterations, showing consistent gains in both summary quality and classification accuracy. Our comprehensive validation approach combines quantitative metrics based on established malware labels with qualitative assessment from both human experts and LLM-based judges, confirming both technical precision and linguistic coherence of generated summaries. To facilitate reproducibility and advance research in this domain, we publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) and test (3.6K) datasets, along with our methodology and evaluation framework.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Senne Deproost, Dennis Steckelmacher, Ann Nowรฉ
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Deep Reinforcement Learning is one of the state-of-the-art methods for producing near-optimal system controllers. However, deep RL algorithms train a deep neural network, that lacks transparency, which poses challenges when the controller has to meet regulations, or foster trust. To alleviate this, one could transfer the learned behaviour into a model that is human-readable by design using knowledge distilla- tion. Often this is done with a single model which mimics the original model on average but could struggle in more dynamic situations. A key challenge is that this simpler model should have the right balance be- tween flexibility and complexity or right balance between balance bias and accuracy. We propose a new model-agnostic method to divide the state space into regions where a simplified, human-understandable model can operate in. In this paper, we use Voronoi partitioning to find regions where linear models can achieve similar performance to the original con- troller. We evaluate our approach on a gridworld environment and a classic control task. We observe that our proposed distillation to locally- specialized linear models produces policies that are explainable and show that the distillation matches or even slightly outperforms the black-box policy they are distilled from.
๐
2025-11-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Chelsea McMurray, Hayder Tirmazi
Natural Language Processing
ๆฅ็ๆ่ฆ
Users increasingly rely on large language models (LLMs) for personal, emotionally charged, and socially sensitive conversations. However, prompts sent to cloud-hosted models can contain personally identifiable information (PII) that users do not want logged, retained, or leaked. We observe this to be especially acute when users discuss friends, coworkers, or adversaries, i.e., when they spill the tea. Enterprises face the same challenge when they want to use LLMs for internal communication and decision-making.
In this whitepaper, we present Whistledown, a best-effort privacy layer that modifies prompts before they are sent to the LLM. Whistledown combines pseudonymization and $ฮต$-local differential privacy ($ฮต$-LDP) with transformation caching to provide best-effort privacy protection without sacrificing conversational utility. Whistledown is designed to have low compute and memory overhead, allowing it to be deployed directly on a client's device in the case of individual users. For enterprise users, Whistledown is deployed centrally within a zero-trust gateway that runs on an enterprise's trusted infrastructure. Whistledown requires no changes to the existing APIs of popular LLM providers.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Narthana Sivalingam, Santhirarajah Sivasthigan, Thamayanthi Mahendranathan, G. M. R. I. Godaliyadda, M. P. B. Ekanayake et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Group activity detection in multi-person scenes is challenging due to complex human interactions, occlusions, and variations in appearance over time. This work presents a computer vision based framework for group activity recognition and action spotting using a combination of deep learning models and graph based relational reasoning. The system first applies Mask R-CNN to obtain accurate actor localization through bounding boxes and instance masks. Multiple backbone networks, including Inception V3, MobileNet, and VGG16, are used to extract feature maps, and RoIAlign is applied to preserve spatial alignment when generating actor specific features. The mask information is then fused with the feature maps to obtain refined masked feature representations for each actor. To model interactions between individuals, we construct Actor Relation Graphs that encode appearance similarity and positional relations using methods such as normalized cross correlation, sum of absolute differences, and dot product. Graph Convolutional Networks operate on these graphs to reason about relationships and predict both individual actions and group level activities. Experiments on the Collective Activity dataset demonstrate that the combination of mask based feature refinement, robust similarity search, and graph neural network reasoning leads to improved recognition performance across both crowded and non crowded scenarios. This approach highlights the potential of integrating segmentation, feature extraction, and relational graph reasoning for complex video understanding tasks.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jonas Bode, Raphael Memmesheimer, Sven Behnke
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Acting in human environments is a crucial capability for general-purpose robots, necessitating a robust understanding of natural language and its application to physical tasks. This paper seeks to harness the capabilities of diffusion models within a visuomotor policy framework that merges visual and textual inputs to generate precise robotic trajectories. By employing reference demonstrations during training, the model learns to execute manipulation tasks specified through textual commands within the robot's immediate environment. The proposed research aims to extend an existing model by leveraging improved embeddings, and adapting techniques from diffusion models for image generation. We evaluate our methods on the CALVIN dataset, proving enhanced performance on various manipulation tasks and an increased long-horizon success rate when multiple tasks are executed in sequence. Our approach reinforces the usefulness of diffusion models and contributes towards general multitask manipulation.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bowen Ye, Bin Zhang, Hang Zhao
Robotics
ๆฅ็ๆ่ฆ
Gaining sustainable performance improvement with scaling data and model budget remains a pivotal yet unresolved challenge in autonomous driving. While autoregressive models exhibited promising data-scaling efficiency in planning tasks, predicting ego trajectories alone suffers sparse supervision and weakly constrains how scene evolution should shape ego motion. Therefore, we introduce DAP, a discrete-token autoregressive planner that jointly forecasts BEV semantics and ego trajectories, thereby enforcing comprehensive representation learning and allowing predicted dynamics to directly condition ego motion. In addition, we incorporate a reinforcement-learning-based fine-tuning, which preserves supervised behavior cloning priors while injecting reward-guided improvements. Despite a compact 160M parameter budget, DAP achieves state-of-the-art performance on open-loop metrics and delivers competitive closed-loop results on the NAVSIM benchmark. Overall, the fully discrete-token autoregressive formulation operating on both rasterized BEV and ego actions provides a compact yet scalable planning paradigm for autonomous driving.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chuang Zhao, Hui Tang, Hongke Zhao, Xiaofang Zhou, Xiaomeng Li
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Accurate healthcare prediction is critical for improving patient outcomes and reducing operational costs. Bolstered by growing reasoning capabilities, large language models (LLMs) offer a promising path to enhance healthcare predictions by drawing on their rich parametric knowledge. However, LLMs are prone to factual inaccuracies due to limitations in the reliability and coverage of their embedded knowledge. While retrieval-augmented generation (RAG) frameworks, such as GraphRAG and its variants, have been proposed to mitigate these issues by incorporating external knowledge, they face two key challenges in the healthcare scenario: (1) identifying the clinical necessity to activate the retrieval mechanism, and (2) achieving synergy between the retriever and the generator to craft contextually appropriate retrievals. To address these challenges, we propose GHAR, a \underline{g}enerative \underline{h}ierarchical \underline{a}gentic \underline{R}AG framework that simultaneously resolves when to retrieve and how to optimize the collaboration between submodules in healthcare. Specifically, for the first challenge, we design a dual-agent architecture comprising Agent-Top and Agent-Low. Agent-Top acts as the primary physician, iteratively deciding whether to rely on parametric knowledge or to initiate retrieval, while Agent-Low acts as the consulting service, summarising all task-relevant knowledge once retrieval was triggered. To tackle the second challenge, we innovatively unify the optimization of both agents within a formal Markov Decision Process, designing diverse rewards to align their shared goal of accurate prediction while preserving their distinct roles. Extensive experiments on three benchmark datasets across three popular tasks demonstrate our superiority over state-of-the-art baselines, highlighting the potential of hierarchical agentic RAG in advancing healthcare systems.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Jea Kwon, Luiz Felipe Vecchietti, Sungwon Park, Meeyoung Cha
Natural Language Processing
ๆฅ็ๆ่ฆ
Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via "dropout" at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs' confidence in morally complex scenarios.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Taras Sereda, Tom St. John, Burak Bartan, Natalie Serrino, Sachin Katti et al.
Machine Learning
ๆฅ็ๆ่ฆ
GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms.
We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.
๐
2025-11-17
๐ ArXiv Preprint (cs.SD)
๐ฅ Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AMPBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AMPBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.
๐
2025-11-17
๐ ArXiv Preprint (cs.SE)
๐ฅ Rufeng Chen, Shuaishuai Jiang, Jiyun Shen, AJung Moon, Lili Wei
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yushuo Zheng, Jiangyong Ying, Huiyu Duan, Chunyi Li, Zicheng Zhang et al.
RoboticsComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.
๐
2025-11-17
๐ ArXiv Preprint (cs.LO)
๐ฅ Matt Luckcuck, Maike Schwammberger, Mengwei Xu
Robotics
ๆฅ็ๆ่ฆ
This EPTCS volume contains the papers from the Seventh International Workshop on Formal Methods for Autonomous Systems (FMAS 2025), which was held between the 17th and 19th of November 2025. The goal of the FMAS workshop series is to bring together leading researchers who are using formal methods to tackle the unique challenges that autonomous systems present, so that they can publish and discuss their work with a growing community of researchers. FMAS 2025 was co-located with the 20th International Conference on integrated Formal Methods (iFM'25), hosted by Inria Paris, France at the Inria Paris Center.
In total, FMAS 2025 received 16 submissions from researchers at institutions in: Canada, China, France, Germany, Ireland, Italy, Japan, the Netherlands, Portugal, Sweden, the United States of America, and the United Kingdom. Though we received fewer submissions than last year, we are encouraged to see the submissions being sent from a wide range of countries. Submissions come from both past and new FMAS authors, which shows us that the existing community appreciates the network that FMAS has built over the past 7 years, while new authors also show the FMAS community's great potential of growth.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nadav Bojan Sellam, Meital Bojan, Paul Schanda, Alex Bronstein
Machine Learning
ๆฅ็ๆ่ฆ
Accurate protein structures are essential for understanding biological function, yet incorporating experimental data into protein generative models remains a major challenge. Most predictors of experimental observables are non-differentiable, making them incompatible with gradient-based conditional sampling. This is especially limiting in nuclear magnetic resonance, where rich data such as chemical shifts are hard to directly integrate into generative modeling. We introduce a framework for non-differentiable guidance of protein generative models, coupling a continuous diffusion-based generator with any black-box objective via a tailored genetic algorithm. We demonstrate its effectiveness across three modalities: pairwise distance constraints, nuclear Overhauser effect restraints, and for the first time chemical shifts. These results establish chemical shift guided structure generation as feasible, expose key weaknesses in current predictors, and showcase a general strategy for incorporating diverse experimental signals. Our work points toward automated, data-conditioned protein modeling beyond the limits of differentiability.
๐
2025-11-17
๐ AAAI
๐ฅ Xiaoqi Han, Ru Li, Ran Yi, Hongye Tan, Zhuomin Liang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Patrick Parschan, Charlott Jakob
Natural Language Processing
ๆฅ็ๆ่ฆ
This article presents the first systematic review of unsupervised and semi-supervised computational text-based ideal point estimation (CT-IPE) algorithms, methods designed to infer latent political positions from textual data. These algorithms are widely used in political science, communication, computational social science, and computer science to estimate ideological preferences from parliamentary speeches, party manifestos, and social media. Over the past two decades, their development has closely followed broader NLP trends -- beginning with word-frequency models and most recently turning to large language models (LLMs). While this trajectory has greatly expanded the methodological toolkit, it has also produced a fragmented field that lacks systematic comparison and clear guidance for applied use. To address this gap, we identified 25 CT-IPE algorithms through a systematic literature review and conducted a manual content analysis of their modeling assumptions and development contexts. To compare them meaningfully, we introduce a conceptual framework that distinguishes how algorithms generate, capture, and aggregate textual variance. On this basis, we identify four methodological families -- word-frequency, topic modeling, word embedding, and LLM-based approaches -- and critically assess their assumptions, interpretability, scalability, and limitations. Our review offers three contributions. First, it provides a structured synthesis of two decades of algorithm development, clarifying how diverse methods relate to one another. Second, it translates these insights into practical guidance for applied researchers, highlighting trade-offs in transparency, technical requirements, and validation strategies that shape algorithm choice. Third, it emphasizes that differences in estimation outcomes across algorithms are themselves informative, underscoring the need for systematic benchmarking.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Michele Persiani, Thomas Hellstrom
ๆฅ็ๆ่ฆ
When a robot is asked to verbalize its plan it can do it in many ways. For example, a seemingly natural strategy is incremental, where the robot verbalizes its planned actions in plan order. However, an important aspect of this type of strategy is that it misses considerations on what is effectively informative to communicate, because not considering what the user knows prior to explanations. In this paper we propose a verbalization strategy to communicate robot plans informatively, by measuring the information gain that verbalizations have against a second-order theory of mind of the user capturing his prior knowledge on the robot. As shown in our experiments, this strategy allows to understand the robot's goal much quicker than by using strategies such as increasing or decreasing plan order. In addition, following our formulation we hint to what is informative and why when a robot communicates its plan.
๐
2025-11-17
๐ NeurIPS 2025
๐ฅ Yuxiang Zhang, Zhengxu Yu, Weihang Pan, Zhongming Jin, Qiang Fu et al.
ๆฅ็ๆ่ฆ
Emerging reasoning LLMs such as OpenAI-o1 and DeepSeek-R1 have achieved strong performance on complex reasoning tasks by generating long chain-of-thought (CoT) traces. However, these long CoTs result in increased token usage, leading to higher inference latency and memory consumption. As a result, balancing accuracy and reasoning efficiency has become essential for deploying reasoning LLMs in practical applications. Existing long-to-short (Long2Short) methods aim to reduce inference length but often sacrifice accuracy, revealing a need for an approach that maintains performance while lowering token costs. To address this efficiency-accuracy tradeoff, we propose TokenSqueeze, a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data. First, to prevent performance degradation caused by excessive compression of reasoning depth, we propose to select self-generated samples whose reasoning depth is adaptively matched to the complexity of the problem. To further optimize the linguistic expression without altering the underlying reasoning paths, we introduce a distribution-aligned linguistic refinement method that enhances the clarity and conciseness of the reasoning path while preserving its logical integrity. Comprehensive experimental results demonstrate the effectiveness of TokenSqueeze in reducing token usage while maintaining accuracy. Notably, DeepSeek-R1-Distill-Qwen-7B fine-tuned using our proposed method achieved a 50\% average token reduction while preserving accuracy on the MATH500 benchmark. TokenSqueeze exclusively utilizes the model's self-generated data, enabling efficient and high-fidelity reasoning without relying on manually curated short-answer datasets across diverse applications. Our code is available at https://github.com/zhangyx1122/TokenSqueeze.
๐
2025-11-17
๐ ArXiv Preprint (cs.SD)
๐ฅ Satvik Dixit, Koichi Saito, Zhi Zhong, Yuki Mitsufuji, Chris Donahue
Multimodal
ๆฅ็ๆ่ฆ
Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: https://gclef-cmu.org/foleybench
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Guillaume Infantes, Stรฉphanie Roussel, Antoine Jacquet, Emmanuel Benazera
Machine Learning
ๆฅ็ๆ่ฆ
The Resource-Constrained Project Scheduling Problem (RCPSP) is a classical scheduling problem that has received significant attention due to of its numerous applications in industry. However, in practice, task durations are subject to uncertainty that must be considered in order to propose resilient scheduling. In this paper, we address the RCPSP variant with uncertain tasks duration (modeled using known probabilities) and aim to minimize the overall expected project duration. Our objective is to produce a baseline schedule that can be reused multiple times in an industrial setting regardless of the actual duration scenario. We leverage Graph Neural Networks in conjunction with Deep Reinforcement Learning (DRL) to develop an effective policy for task scheduling. This policy operates similarly to a priority dispatch rule and is paired with a Serial Schedule Generation Scheme to produce a schedule. Our empirical evaluation on standard benchmarks demonstrates the approach's superiority in terms of performance and its ability to generalize. The developed framework, Wheatley, is made publicly available online to facilitate further research and reproducibility.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhixin Ou, Peng Liang, Jianchen Han, Baihui Liu, Linbo Qiao
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Chengpei Tang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-agent systems (MAS) built on large language models (LLMs) often suffer from inefficient "free-for-all" communication, leading to exponential token costs and low signal-to-noise ratios that hinder their practical deployment. We challenge the notion that more communication is always beneficial, hypothesizing instead that the core issue is the absence of resource rationality. We argue that "free" communication, by ignoring the principle of scarcity, inherently breeds inefficiency and unnecessary expenses. To address this, we introduce the Dynamic Auction-based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource. Specifically, our DALA regards inter-agent communication as a centralized auction, where agents learn to bid for the opportunity to speak based on the predicted value density of their messages. Thus, our DALA intrinsically encourages agents to produce concise, informative messages while filtering out low-value communication. Extensive and comprehensive experiments demonstrate that our economically-driven DALA achieves new state-of-the-art performance across seven challenging reasoning benchmarks, including 84.32% on MMLU and a 91.21% pass@1 rate on HumanEval. Note that this is accomplished with remarkable efficiency, i.e., our DALA uses only 6.25 million tokens, a fraction of the resources consumed by current state-of-the-art methods on GSM8K. Further analysis reveals that our DALA cultivates the emergent skill of strategic silence, effectively adapting its communication strategies from verbosity to silence in a dynamical manner via resource constraints.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haodong Wang, Tao Zhuo, Xiuwei Zhang, Hanlin Yin, Wencong Wu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Achieving pixel-level registration between SAR and optical images remains a challenging task due to their fundamentally different imaging mechanisms and visual characteristics. Although deep learning has achieved great success in many cross-modal tasks, its performance on SAR-Optical registration tasks is still unsatisfactory. Gradient-based information has traditionally played a crucial role in handcrafted descriptors by highlighting structural differences. However, such gradient cues have not been effectively leveraged in deep learning frameworks for SAR-Optical image matching. To address this gap, we propose SOMA, a dense registration framework that integrates structural gradient priors into deep features and refines alignment through a hybrid matching strategy. Specifically, we introduce the Feature Gradient Enhancer (FGE), which embeds multi-scale, multi-directional gradient filters into the feature space using attention and reconstruction mechanisms to boost feature distinctiveness. Furthermore, we propose the Global-Local Affine-Flow Matcher (GLAM), which combines affine transformation and flow-based refinement within a coarse-to-fine architecture to ensure both structural consistency and local accuracy. Experimental results demonstrate that SOMA significantly improves registration precision, increasing the CMR@1px by 12.29% on the SEN1-2 dataset and 18.50% on the GFGE_SO dataset. In addition, SOMA exhibits strong robustness and generalizes well across diverse scenes and resolutions.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zhaoxin Shen, Dan Wu
ๆฅ็ๆ่ฆ
To leverage user behavior data from the Internet more effectively in recommender systems, this paper proposes a novel collaborative filtering (CF) method called Local Collaborative Filtering (LCF). LCF utilizes local similarities among users and integrates their data using the law of large numbers (LLN), thereby improving the utilization of user behavior data. Experiments are conducted on the Steam game dataset, and the results of LCF align with real-world needs.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ TC Singh, Sougata Mukherjea
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) excel in graph-based learning tasks, but their complex, non-linear operations often render them as opaque "black boxes". This opacity hinders user trust, complicates debugging, bias detection, and adoption in critical domains requiring explainability. This paper introduces InteractiveGNNExplainer, a visual analytics framework to enhance GNN explainability, focusing on node classification. Our system uniquely integrates coordinated interactive views (dynamic graph layouts, embedding projections, feature inspection, neighborhood analysis) with established post-hoc (GNNExplainer) and intrinsic (GAT attention) explanation techniques. Crucially, it incorporates interactive graph editing, allowing users to perform a "what-if" analysis by perturbing graph structures and observing immediate impacts on GNN predictions and explanations. We detail the system architecture and, through case studies on Cora and CiteSeer datasets, demonstrate how InteractiveGNNExplainer facilitates in-depth misclassification diagnosis, comparative analysis of GCN versus GAT behaviors, and rigorous probing of model sensitivity. These capabilities foster a deeper, multifaceted understanding of GNN predictions, contributing to more transparent, trustworthy, and robust graph analysis.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cesar Portocarrero Rodriguez, Laura Vandeweyen, Yosuke Yamamoto
RoboticsNatural Language ProcessingMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
The American Society of Civil Engineers has graded Americas infrastructure condition as a C, with the road system receiving a dismal D. Roads are vital to regional economic viability, yet their management, maintenance, and repair processes remain inefficient, relying on outdated manual or laser-based inspection methods that are both costly and time-consuming. With the increasing availability of real-time visual data from autonomous vehicles, there is an opportunity to apply computer vision (CV) methods for advanced road monitoring, providing insights to guide infrastructure rehabilitation efforts. This project explores the use of state-of-the-art CV techniques for road distress segmentation. It begins by evaluating synthetic data generated with Generative Adversarial Networks (GANs) to assess its usefulness for model training. The study then applies Convolutional Neural Networks (CNNs) for road distress segmentation and subsequently examines the transformer-based model MaskFormer. Results show that GAN-generated data improves model performance and that MaskFormer outperforms the CNN model in two metrics: mAP50 and IoU.
๐
2025-11-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Gorka Abad, Marina Krฤek, Stefanos Koffas, Behrad Tajalli, Marco Arazzi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Backdoor attacks pose a significant threat to deep learning models by implanting hidden vulnerabilities that can be activated by malicious inputs. While numerous defenses have been proposed to mitigate these attacks, the heterogeneous landscape of evaluation methodologies hinders fair comparison between defenses. This work presents a systematic (meta-)analysis of backdoor defenses through a comprehensive literature review and empirical evaluation. We analyzed 183 backdoor defense papers published between 2018 and 2025 across major AI and security venues, examining the properties and evaluation methodologies of these defenses.
Our analysis reveals significant inconsistencies in experimental setups, evaluation metrics, and threat model assumptions in the literature. Through extensive experiments involving three datasets (MNIST, CIFAR-100, ImageNet-1K), four model architectures (ResNet-18, VGG-19, ViT-B/16, DenseNet-121), 16 representative defenses, and five commonly used attacks, totaling over 3\,000 experiments, we demonstrate that defense effectiveness varies substantially across different evaluation setups. We identify critical gaps in current evaluation practices, including insufficient reporting of computational overhead and behavior under benign conditions, bias in hyperparameter selection, and incomplete experimentation. Based on our findings, we provide concrete challenges and well-motivated recommendations to standardize and improve future defense evaluations. Our work aims to equip researchers and industry practitioners with actionable insights for developing, assessing, and deploying defenses to different systems.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Yanda Zhu, Yuanyang Zhu, Daoyi Dong, Caihua Chen, Chunlin Chen
Machine Learning
ๆฅ็ๆ่ฆ
Task decomposition has shown promise in complex cooperative multi-agent reinforcement learning (MARL) tasks, which enables efficient hierarchical learning for long-horizon tasks in dynamic and uncertain environments. However, learning dynamic task decomposition from scratch generally requires a large number of training samples, especially exploring the large joint action space under partial observability. In this paper, we present the Conditional Diffusion Model for Dynamic Task Decomposition (C$\text{D}^\text{3}$T), a novel two-level hierarchical MARL framework designed to automatically infer subtask and coordination patterns. The high-level policy learns subtask representation to generate a subtask selection strategy based on subtask effects. To capture the effects of subtasks on the environment, C$\text{D}^\text{3}$T predicts the next observation and reward using a conditional diffusion model. At the low level, agents collaboratively learn and share specialized skills within their assigned subtasks. Moreover, the learned subtask representation is also used as additional semantic information in a multi-head attention mixing network to enhance value decomposition and provide an efficient reasoning bridge between individual and joint value functions. Experimental results on various benchmarks demonstrate that C$\text{D}^\text{3}$T achieves better performance than existing baselines.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shudong Wang, Xinfei Wang, Chenhao Zhang, Shanchen Pang, Haiyuan Gui et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Multi-task reinforcement learning (MTRL) seeks to learn a unified policy for diverse tasks, but often suffers from gradient conflicts across tasks. Existing masking-based methods attempt to mitigate such conflicts by assigning task-specific parameter masks. However, our empirical study shows that coarse-grained binary masks have the problem of over-suppressing key conflicting parameters, hindering knowledge sharing across tasks. Moreover, different tasks exhibit varying conflict levels, yet existing methods use a one-size-fits-all fixed sparsity strategy to keep training stability and performance, which proves inadequate. These limitations hinder the model's generalization and learning efficiency.
To address these issues, we propose SoCo-DT, a Soft Conflict-resolution method based by parameter importance. By leveraging Fisher information, mask values are dynamically adjusted to retain important parameters while suppressing conflicting ones. In addition, we introduce a dynamic sparsity adjustment strategy based on the Interquartile Range (IQR), which constructs task-specific thresholding schemes using the distribution of conflict and harmony scores during training. To enable adaptive sparsity evolution throughout training, we further incorporate an asymmetric cosine annealing schedule to continuously update the threshold. Experimental results on the Meta-World benchmark show that SoCo-DT outperforms the state-of-the-art method by 7.6% on MT50 and by 10.5% on the suboptimal dataset, demonstrating its effectiveness in mitigating gradient conflicts and improving overall multi-task performance.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenyang Li, Wenbing Tang, Yihao Huang, Sinong Simon Zhan, Ming Hu et al.
Robotics
ๆฅ็ๆ่ฆ
Vision-and-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Gagan Raj Gupta, Anshul Kumar, Manish Rai, Apu Chakraborty, Ashutosh Modi et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks. In telecommunications, they hold the potential to transform network optimization, automate troubleshooting, enhance customer support, and ensure regulatory compliance. However, their deployment in telecom is hindered by domain-specific challenges that demand specialized adaptation. To overcome these challenges and to accelerate the adaptation of LLMs for telecom, we propose MM-Telco, a comprehensive suite of multimodal benchmarks and models tailored for the telecom domain. The benchmark introduces various tasks (both text based and image based) that address various practical real-life use cases such as network operations, network management, improving documentation quality, and retrieval of relevant text and images. Further, we perform baseline experiments with various LLMs and VLMs. The models fine-tuned on our dataset exhibit a significant boost in performance. Our experiments also help analyze the weak areas in the working of current state-of-art multimodal LLMs, thus guiding towards further development and research.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Quanjiang Guo, Sijie Wang, Jinchuan Zhang, Ben Zhang, Zhao Kang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Zero-shot event extraction (ZSEE) remains a significant challenge for large language models (LLMs) due to the need for complex reasoning and domain-specific understanding. Direct prompting often yields incomplete or structurally invalid outputs--such as misclassified triggers, missing arguments, and schema violations. To address these limitations, we present Agent-Event-Coder (AEC), a novel multi-agent framework that treats event extraction like software engineering: as a structured, iterative code-generation process. AEC decomposes ZSEE into specialized subtasks--retrieval, planning, coding, and verification--each handled by a dedicated LLM agent. Event schemas are represented as executable class definitions, enabling deterministic validation and precise feedback via a verification agent. This programming-inspired approach allows for systematic disambiguation and schema enforcement through iterative refinement. By leveraging collaborative agent workflows, AEC enables LLMs to produce precise, complete, and schema-consistent extractions in zero-shot settings. Experiments across five diverse domains and six LLMs demonstrate that AEC consistently outperforms prior zero-shot baselines, showcasing the power of treating event extraction like code generation. The code and data are released on https://github.com/UESTC-GQJ/Agent-Event-Coder.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qipeng Song, Nan Yang, Ziqi Xu, Yue Li, Wei Shao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Machine unlearning aims to eliminate the influence of specific data from trained models to ensure privacy compliance. However, most existing methods assume full access to the original training dataset, which is often impractical. We address a more realistic yet challenging setting: few-shot zero-glance, where only a small subset of the retained data is available and the forget set is entirely inaccessible. We introduce GFOES, a novel framework comprising a Generative Feedback Network (GFN) and a two-phase fine-tuning procedure. GFN synthesises Optimal Erasure Samples (OES), which induce high loss on target classes, enabling the model to forget class-specific knowledge without access to the original forget data, while preserving performance on retained classes. The two-phase fine-tuning procedure enables aggressive forgetting in the first phase, followed by utility restoration in the second. Experiments on three image classification datasets demonstrate that GFOES achieves effective forgetting at both logit and representation levels, while maintaining strong performance using only 5% of the original data. Our framework offers a practical and scalable solution for privacy-preserving machine learning under data-constrained conditions.
๐
2025-11-17
๐ ArXiv Preprint (hep-ex)
๐ฅ Rasmus F. Orsoe, Stephan Meighen-Berger, Jeffrey Lazar, Jorge Prado, Ivan Mozun-Mateo et al.
Machine Learning
ๆฅ็ๆ่ฆ
Neutrino telescopes are large-scale detectors designed to observe Cherenkov radiation produced from neutrino interactions in water or ice. They exist to identify extraterrestrial neutrino sources and to probe fundamental questions pertaining to the elusive neutrino itself. A central challenge common across neutrino telescopes is to solve a series of inverse problems known as event reconstruction, which seeks to resolve properties of the incident neutrino, based on the detected Cherenkov light. In recent times, significant efforts have been made in adapting advances from deep learning research to event reconstruction, as such techniques provide several benefits over traditional methods. While a large degree of similarity in reconstruction needs and low-level data exists, cross-experimental collaboration has been hindered by a lack of diverse open-source datasets for comparing methods.
We present NuBench, an open benchmark for deep learning-based event reconstruction in neutrino telescopes. NuBench comprises seven large-scale simulated datasets containing nearly 130 million charged- and neutral-current muon-neutrino interactions spanning 10 GeV to 100 TeV, generated across six detector geometries inspired by existing and proposed experiments. These datasets provide pulse- and event-level information suitable for developing and comparing machine-learning reconstruction methods in both water and ice environments. Using NuBench, we evaluate four reconstruction algorithms - ParticleNeT and DynEdge, both actively used within the KM3NeT and IceCube collaborations, respectively, along with GRIT and DeepIce - on up to five core tasks: energy and direction reconstruction, topology classification, interaction vertex prediction, and inelasticity estimation.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ SeokJoo Kwak, Jihoon Kim, Boyoun Kim, Jung Jae Yoon, Wooseok Jang et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yehonatan Elisha, Seffi Cohen, Oren Barkan, Noam Koenigstein
Machine Learning
ๆฅ็ๆ่ฆ
Saliency maps are widely used for visual explanations in deep learning, but a fundamental lack of consensus persists regarding their intended purpose and alignment with diverse user queries. This ambiguity hinders the effective evaluation and practical utility of explanation methods.We address this gap by introducing the Reference-Frame $\times$ Granularity (RFxG) taxonomy, a principled conceptual framework that organizes saliency explanations along two essential axes:Reference-Frame: Distinguishing between pointwise ("Why this prediction?") and contrastive ("Why this and not an alternative?") explanations.Granularity: Ranging from fine-grained class-level (e.g., "Why Husky?") to coarse-grained group-level (e.g., "Why Dog?") interpretations.Using the RFxG lens, we demonstrate critical limitations in existing evaluation metrics, which overwhelmingly prioritize pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. To systematically assess explanation quality across both RFxG dimensions, we propose four novel faithfulness metrics. Our comprehensive evaluation framework applies these metrics to ten state-of-the-art saliency methods, four model architectures, and three datasets.By advocating a shift toward user-intent-driven evaluation, our work provides both the conceptual foundation and the practical tools necessary to develop visual explanations that are not only faithful to the underlying model behavior but are also meaningfully aligned with the complexity of human understanding and inquiry.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohit Meena, Yash Punjabi, Abhishek A, Vishal Sharma, Mahesh Chandran
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have emerged as powerful tools for learning over graph-structured data, yet recent studies have shown that their performance gains are beginning to plateau. In many cases, well-established models such as GCN and GAT, when appropriately tuned, can match or even exceed the performance of more complex, state-of-the-art architectures. This trend highlights a key limitation in the current landscape: the difficulty of selecting the most suitable model for a given graph task or dataset. To address this, we propose Self-Adaptive Graph Mixture of Models (SAGMM), a modular and practical framework that learns to automatically select and combine the most appropriate GNN models from a diverse pool of architectures. Unlike prior mixture-of-experts approaches that rely on variations of a single base model, SAGMM leverages architectural diversity and a topology-aware attention gating mechanism to adaptively assign experts to each node based on the structure of the input graph. To improve efficiency, SAGMM includes a pruning mechanism that reduces the number of active experts during training and inference without compromising performance. We also explore a training-efficient variant in which expert models are pretrained and frozen, and only the gating and task-specific layers are trained. We evaluate SAGMM on 16 benchmark datasets covering node classification, graph classification, regression, and link prediction tasks, and demonstrate that it consistently outperforms or matches leading GNN baselines and prior mixture-based methods, offering a robust and adaptive solution for real-world graph learning.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vladimรญr Macko, Vladimรญr Boลพa
Natural Language Processing
ๆฅ็ๆ่ฆ
Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianhong Li, Kaiming He
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "$\textbf{Just image Transformers}$", or $\textbf{JiT}$, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu et al.
Multimodal
ๆฅ็ๆ่ฆ
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Hengrui Hu, Kaining Ying, Henghui Ding
ๆฅ็ๆ่ฆ
This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junwei Yu, Trevor Darrell, XuDong Wang
Robotics
ๆฅ็ๆ่ฆ
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Xincheng Shuai, Zhenyuan Qin, Henghui Ding, Dacheng Tao
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, Joseph K J
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader's experience.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiangnan Ye, Jiedong Zhuang, Lianrui Mu, Wenjie Zheng, Jiaqi Hu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.
๐
2025-11-17
๐ ArXiv Preprint (cs.AR)
๐ฅ Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Wenjun Huang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality--and locality into utilization--QUILL delivers consistent, end-to-end speedups.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon et al.
Multimodal
ๆฅ็ๆ่ฆ
Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at $\href{https://github.com/allenai/olmoearth_pretrain}{\text{https://github.com/allenai/olmoearth_pretrain}}$.
๐
2025-11-17
๐ AAAI
๐ฅ Pascal Zimmer, Ghassan Karame
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we present the first detailed analysis of how optimization hyperparameters -- such as learning rate, weight decay, momentum, and batch size -- influence robustness against both transfer-based and query-based attacks. Supported by theory and experiments, our study spans a variety of practical deployment settings, including centralized training, ensemble learning, and distributed training. We uncover a striking dichotomy: for transfer-based attacks, decreasing the learning rate significantly enhances robustness by up to $64\%$. In contrast, for query-based attacks, increasing the learning rate consistently leads to improved robustness by up to $28\%$ across various settings and data distributions. Leveraging these findings, we explore -- for the first time -- the optimization hyperparameter design space to jointly enhance robustness against both transfer-based and query-based attacks. Our results reveal that distributed models benefit the most from hyperparameter tuning, achieving a remarkable tradeoff by simultaneously mitigating both attack types more effectively than other training setups.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Xin Jin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu
Machine Learning
ๆฅ็ๆ่ฆ
3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shrenik Patel, Daivik Patel
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one's keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block's full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dimitrios Koutsianos, Ladislav Mosner, Yannis Panagakis, Themos Stafylakis
ๆฅ็ๆ่ฆ
Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced $ฮฑ$-divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when $ฮฑ>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $ฮฑ$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ashlesha G. Sawant, Shreyash S. Kamble, Raj S. Kanade, Raunak N. Kanugo, Tanishq A. Kapse et al.
ๆฅ็ๆ่ฆ
One of the major causes of road accidents is driver fatigue that causes thousands of fatalities and injuries every year. This study shows development of a Driver Drowsiness Detection System meant to improve the safety of the road by alerting drivers who are showing signs of being drowsy. The system is based on a standard webcam that tracks the facial features of the driver with the main emphasis on the examination of eye movements that can be conducted with the help of the Eye Aspect Ratio (EAR) method. The Face Mesh by MediaPipe is a lightweight framework that can identify facial landmarks with high accuracy and efficiency, which is considered to be important in real time use. The system detects the moments of long eye shutdowns or a very low rate of blinking which are manifestations of drowsiness and alerts the driver through sound to get her attention back. This system achieves a high-performance and low-cost driver monitoring solution with the help of the computational power of OpenCV to process the image and the MediaPipe to identify faces. Test data experimental analyses indicate that the system is very accurate and responds quicker; this confirms that it can be a component of the current Advanced Driving Assistance System (ADAS).
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kesi Xu, Eleni Chiou, Ali Varamesh, Laura Acqualagna, Nasir Rajpoot
ๆฅ็ๆ่ฆ
Accurate nuclei detection and classification are fundamental to computational pathology, yet existing approaches are hindered by reliance on detailed expert annotations and insufficient use of tissue context. We present Tissue-Aware Nuclei Detection (TAND), a novel framework achieving joint nuclei detection and classification using point-level supervision enhanced by tissue mask conditioning. TAND couples a ConvNeXt-based encoder-decoder with a frozen Virchow-2 tissue segmentation branch, where semantic tissue probabilities selectively modulate the classification stream through a novel multi-scale Spatial Feature-wise Linear Modulation (Spatial-FiLM). On the PUMA benchmark, TAND achieves state-of-the-art performance, surpassing both tissue-agnostic baselines and mask-supervised methods. Notably, our approach demonstrates remarkable improvements in tissue-dependent cell types such as epithelium, endothelium, and stroma. To the best of our knowledge, this is the first method to condition per-cell classification on learned tissue masks, offering a practical pathway to reduce annotation burden.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marianne Rakic, Andrew Hoopes, S. Mazdak Abulnaga, Mert R. Sabuncu, John V. Guttag et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deformable templates, or atlases, are images that represent a prototypical anatomy for a population, and are often enhanced with probabilistic anatomical label maps. They are commonly used in medical image analysis for population studies and computational anatomy tasks such as registration and segmentation. Because developing a template is a computationally expensive process, relatively few templates are available. As a result, analysis is often conducted with sub-optimal templates that are not truly representative of the study population, especially when there are large variations within this population. We propose a machine learning framework that uses convolutional registration neural networks to efficiently learn a function that outputs templates conditioned on subject-specific attributes, such as age and sex. We also leverage segmentations, when available, to produce anatomical segmentation maps for the resulting templates. The learned network can also be used to register subject images to the templates. We demonstrate our method on a compilation of 3D brain MRI datasets, and show that it can learn high-quality templates that are representative of populations. We find that annotated conditional templates enable better registration than their unlabeled unconditional counterparts, and outperform other templates construction methods.
๐
2025-11-17
๐ AAAI
๐ฅ Xin Xu, Hao Liu, Wei Liu, Wei Wang, Jiayi Wu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Low-Light Image Enhancement (LLIE) task aims at improving contrast while restoring details and textures for images captured in low-light conditions. HVI color space has made significant progress in this task by enabling precise decoupling of chrominance and luminance. However, for the interaction of chrominance and luminance branches, substantial distributional differences between the two branches prevalent in natural images limit complementary feature extraction, and luminance errors are propagated to chrominance channels through the nonlinear parameter. Furthermore, for interaction between different chrominance branches, images with large homogeneous-color regions usually exhibit weak correlation between chrominance branches due to concentrated distributions. Traditional pixel-wise losses exploit strong inter-branch correlations for co-optimization, causing gradient conflicts in weakly correlated regions. Therefore, we propose an Inter-Chrominance and Luminance Interaction (ICLR) framework including a Dual-stream Interaction Enhancement Module (DIEM) and a Covariance Correction Loss (CCL). The DIEM improves the extraction of complementary information from two dimensions, fusion and enhancement, respectively. The CCL utilizes luminance residual statistics to penalize chrominance errors and balances gradient conflicts by constraining chrominance branches covariance. Experimental results on multiple datasets show that the proposed ICLR framework outperforms state-of-the-art methods.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haotian Dong, Ye Li, Rongwei Lu, Chen Tang, Shu-Tao Xia et al.
Computer Vision
ๆฅ็ๆ่ฆ
Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage's characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of $2.8\times$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yinuo Xu, Yan Cui, Mingyao Li, Zhi Huang
ๆฅ็ๆ่ฆ
Identifying cell types and subtypes from routine histopathology images is essential for improving the computational understanding of human disease. Existing tile-based models can capture detailed nuclear morphology but often fail to incorporate the broader tissue context that influences a cell's function and identity. In addition, available human annotations are typically coarse-grained and unevenly distributed across studies, making fine-grained subtype-level supervision difficult to obtain.
To address these limitations, we introduce NuClass, a pathologist workflow inspired framework for cell-wise multi-scale integration of nuclear morphology and microenvironmental context. NuClass includes two main components: Path local, which focuses on nuclear morphology from 224-by-224 pixel crops, and Path global, which models the surrounding 1024-by-1024 pixel neighborhood. A learnable gating module adaptively balances local detail and contextual cues. To encourage complementary learning, we incorporate an uncertainty-guided objective that directs the global path to prioritize regions where the local path is uncertain. We also provide calibrated confidence estimates and Grad-CAM visualizations to enhance interpretability.
To overcome the lack of high-quality annotations, we construct a marker-guided dataset from Xenium spatial transcriptomics assays, yielding single-cell resolution labels for more than two million cells across eight organs and 16 classes. Evaluated on three fully held-out cohorts, NuClass achieves up to 96 percent F1 for its best-performing class, outperforming strong baselines. Our results show that multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable, cell-level phenotype prediction.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Linhan Zhou, Shuang Li, Neng Dong, Yonghang Tai, Yafei Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Person re-identification (ReID) aims to retrieve target pedestrian images given either visual queries (image-to-image, I2I) or textual descriptions (text-to-image, T2I). Although both tasks share a common retrieval objective, they pose distinct challenges: I2I emphasizes discriminative identity learning, while T2I requires accurate cross-modal semantic alignment. Existing methods often treat these tasks separately, which may lead to representation entanglement and suboptimal performance. To address this, we propose a unified framework named Hierarchical Prompt Learning (HPL), which leverages task-aware prompt modeling to jointly optimize both tasks. Specifically, we first introduce a Task-Routed Transformer, which incorporates dual classification tokens into a shared visual encoder to route features for I2I and T2I branches respectively. On top of this, we develop a hierarchical prompt generation scheme that integrates identity-level learnable tokens with instance-level pseudo-text tokens. These pseudo-tokens are derived from image or text features via modality-specific inversion networks, injecting fine-grained, instance-specific semantics into the prompts. Furthermore, we propose a Cross-Modal Prompt Regularization strategy to enforce semantic alignment in the prompt token space, ensuring that pseudo-prompts preserve source-modality characteristics while enhancing cross-modal transferability. Extensive experiments on multiple ReID benchmarks validate the effectiveness of our method, achieving state-of-the-art performance on both I2I and T2I tasks.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Ziyang Huang, Jiagang Chen, Jin Liu, Shunping Ji
Machine Learning
ๆฅ็ๆ่ฆ
3D Gaussian Splatting (3DGS) has emerged as a leading framework for novel view synthesis, yet its core optimization challenges remain underexplored. We identify two key issues in 3DGS optimization: entrapment in suboptimal local optima and insufficient convergence quality. To address these, we propose Opt3DGS, a robust framework that enhances 3DGS through a two-stage optimization process of adaptive exploration and curvature-guided exploitation. In the exploration phase, an Adaptive Weighted Stochastic Gradient Langevin Dynamics (SGLD) method enhances global search to escape local optima. In the exploitation phase, a Local Quasi-Newton Direction-guided Adam optimizer leverages curvature information for precise and efficient convergence. Extensive experiments on diverse benchmark datasets demonstrate that Opt3DGS achieves state-of-the-art rendering quality by refining the 3DGS optimization process without modifying its underlying representation.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sining Chen, Xiao Xiang Zhu
Machine Learning
ๆฅ็ๆ่ฆ
Monocular height estimation plays a critical role in 3D perception for remote sensing, offering a cost-effective alternative to multi-view or LiDAR-based methods. While deep learning has significantly advanced the capabilities of monocular height estimation, these methods remain fundamentally limited by the availability of labeled data, which are expensive and labor-intensive to obtain at scale. The scarcity of high-quality annotations hinders the generalization and performance of existing models. To overcome this limitation, we propose leveraging large volumes of unlabeled data through a semi-supervised learning framework, enabling the model to extract informative cues from unlabeled samples and improve its predictive performance. In this work, we introduce TSE-Net, a self-training pipeline for semi-supervised monocular height estimation. The pipeline integrates teacher, student, and exam networks. The student network is trained on unlabeled data using pseudo-labels generated by the teacher network, while the exam network functions as a temporal ensemble of the student network to stabilize performance. The teacher network is formulated as a joint regression and classification model: the regression branch predicts height values that serve as pseudo-labels, and the classification branch predicts height value classes along with class probabilities, which are used to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to address the inherent long-tailed distribution of heights, and the predicted class probabilities are calibrated with a Plackett-Luce model to reflect the expected accuracy of pseudo-labels. We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities. Codes are available at https://github.com/zhu-xlab/tse-net.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md. Iqbal Hossain, Afia Sajeeda, Neeresh Kumar Perla, Ming Shao
Machine LearningComputer VisionMultimodal
ๆฅ็ๆ่ฆ
The advent of multimodal deep learning models, such as CLIP, has unlocked new frontiers in a wide range of applications, from image-text understanding to classification tasks. However, these models are not safe for adversarial attacks, particularly backdoor attacks, which can subtly manipulate model behavior. Moreover, existing defense methods typically involve training from scratch or fine-tuning using a large dataset without pinpointing the specific labels that are affected. In this study, we introduce an innovative strategy to enhance the robustness of multimodal contrastive learning models against such attacks. In particular, given a poisoned CLIP model, our approach can identify the backdoor trigger and pinpoint the victim samples and labels in an efficient manner. To that end, an image segmentation ``oracle'' is introduced as the supervisor for the output of the poisoned CLIP. We develop two algorithms to rectify the poisoned model: (1) differentiating between CLIP and Oracle's knowledge to identify potential triggers; (2) pinpointing affected labels and victim samples, and curating a compact fine-tuning dataset. With this knowledge, we are allowed to rectify the poisoned CLIP model to negate backdoor effects. Extensive experiments on visual recognition benchmarks demonstrate our strategy is effective in CLIP-based backdoor defense.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanchao Wang, Tian Qin, Eduardo Valle, Bruno Abrahao
ๆฅ็ๆ่ฆ
Out-of-distribution (OOD) detection is critical for deploying image classifiers in safety-sensitive environments, yet existing detectors often struggle when OOD samples are semantically similar to the in-distribution (ID) classes. We present BootOOD, a fully self-supervised OOD detection framework that bootstraps exclusively from ID data and is explicitly designed to handle semantically challenging OOD samples. BootOOD synthesizes pseudo-OOD features through simple transformations of ID representations and leverages Neural Collapse (NC), where ID features cluster tightly around class means with consistent feature norms. Unlike prior approaches that aim to constrain OOD features into subspaces orthogonal to the collapsed ID means, BootOOD introduces a lightweight auxiliary head that performs radius-based classification on feature norms. This design decouples OOD detection from the primary classifier and imposes a relaxed requirement: OOD samples are learned to have smaller feature norms than ID features, which is easier to satisfy when ID and OOD are semantically close. Experiments on CIFAR-10, CIFAR-100, and ImageNet-200 show that BootOOD outperforms prior post-hoc methods, surpasses training-based methods without outlier exposure, and is competitive with state-of-the-art outlier-exposure approaches while maintaining or improving ID accuracy.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Farhin Farhad Riya, Shahinul Hoque, Jinyuan Stella Sun, Olivera Kotevska
Machine Learning
ๆฅ็ๆ่ฆ
As machine learning models are increasingly deployed in safety-critical domains, visual explanation techniques have become essential tools for supporting transparency. In this work, we reveal a new class of attacks that compromise model interpretability without affecting accuracy. Specifically, we show that small color perturbations applied by adversarial clients in a federated learning setting can shift a model's saliency maps away from semantically meaningful regions while keeping the prediction unchanged. The proposed saliency-aware attack framework, called Chromatic Perturbation Module, systematically crafts adversarial examples by altering the color contrast between foreground and background in a way that disrupts explanation fidelity. These perturbations accumulate across training rounds, poisoning the global model's internal feature attributions in a stealthy and persistent manner. Our findings challenge a common assumption in model auditing that correct predictions imply faithful explanations and demonstrate that interpretability itself can be an attack surface. We evaluate this vulnerability across multiple datasets and show that standard training pipelines are insufficient to detect or mitigate explanation degradation, especially in the federated learning setting, where subtle color perturbations are harder to discern. Our attack reduces peak activation overlap in Grad-CAM explanations by up to 35% while preserving classification accuracy above 96% on all evaluated datasets.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jeffrey Wen, Rizwan Ahmad, Philip Schniter
ๆฅ็ๆ่ฆ
In ill-posed imaging inverse problems, uncertainty quantification remains a fundamental challenge, especially in safety-critical applications. Recently, conformal prediction has been used to quantify the uncertainty that the inverse problem contributes to downstream tasks like image classification, image quality assessment, fat mass quantification, etc. While existing works handle only a scalar estimation target, practical applications often involve multiple targets. In response, we propose an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals while ensuring joint marginal coverage. We then outline how our minimax approach can be applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Finally, we numerically demonstrate the benefits of our minimax method, relative to existing multi-target conformal prediction methods, using both synthetic and magnetic resonance imaging (MRI) data.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenyu Zhang, Yao Tong, Yiqiu Liu, Rui Cao
Machine Learning
ๆฅ็ๆ่ฆ
Urban villages (UVs), informal settlements embedded within China's urban fabric, have undergone widespread demolition and redevelopment in recent decades. However, there remains a lack of systematic evaluation of whether the demolished land has been effectively reused, raising concerns about the efficacy and sustainability of current redevelopment practices. To address the gap, this study proposes a deep learning-based framework to monitor the spatiotemporal changes of UVs in China. Specifically, semantic segmentation of multi-temporal remote sensing imagery is first used to map evolving UV boundaries, and then post-demolition land use is classified into six categories based on the "remained-demolished-redeveloped" phase: incomplete demolition, vacant land, construction sites, buildings, green spaces, and others. Four representative cities from China's four economic regions were selected as the study areas, i.e., Guangzhou (East), Zhengzhou (Central), Xi'an (West), and Harbin (Northeast). The results indicate: 1) UV redevelopment processes were frequently prolonged; 2) redevelopment transitions primarily occurred in peripheral areas, whereas urban cores remained relatively stable; and 3) three spatiotemporal transformation pathways, i.e., synchronized redevelopment, delayed redevelopment, and gradual optimization, were revealed. This study highlights the fragmented, complex and nonlinear nature of UV redevelopment, underscoring the need for tiered and context-sensitive planning strategies. By linking spatial dynamics with the context of redevelopment policies, the findings offer valuable empirical insights that support more inclusive, efficient, and sustainable urban renewal, while also contributing to a broader global understanding of informal settlement transformations.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jae Joong Lee
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic.
Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.
๐
2025-11-17
๐ AAAI
๐ฅ Lipeng Wang, Hongxing Fan, Haohua Chen, Zehuan Huang, Lu Sheng
Robotics
ๆฅ็ๆ่ฆ
Generating high-quality human interactions holds significant value for applications like virtual reality and robotics. However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity. Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9% on the InterHuman dataset and 22% on InterX.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Adam Hazimeh, Ke Wang, Mark Collier, Gilles Baechler, Efi Kokiopoulou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.
๐
2025-11-17
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Agnese Chiatti, Lara Piccolo, Sara Bernardini, Matteo Matteucci, Viola Schiaffonati
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants' engagement to fit the case of user-VLM interaction.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rui Zuo, Qinyue Tong, Zhe-Ming Lu, Ziqian Lu
RoboticsNatural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lorenzo Olearo, Giulio Viganรฒ, Daniele Baieri, Filippo Maggioli, Simone Melzi
ๆฅ็ๆ่ฆ
We introduce a novel neural representation for maps between 3D shapes based on flow-matching models, which is computationally efficient and supports cross-representation shape matching without large-scale training or data-driven procedures. 3D shapes are represented as the probability distribution induced by a continuous and invertible flow mapping from a fixed anchor distribution. Given a source and a target shape, the composition of the inverse flow (source to anchor) with the forward flow (anchor to target), we continuously map points between the two surfaces. By encoding the shapes with a pointwise task-tailored embedding, this construction provides an invertible and modality-agnostic representation of maps between shapes across point clouds, meshes, signed distance fields (SDFs), and volumetric data. The resulting representation consistently achieves high coverage and accuracy across diverse benchmarks and challenging settings in shape matching. Beyond shape matching, our framework shows promising results in other tasks, including UV mapping and registration of raw point cloud scans of human bodies.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xingming Long, Jie Zhang, Shiguang Shan, Xilin Chen
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Most research on hallucinations in Large Vision-Language Models (LVLMs) focuses on factual description tasks that prohibit any output absent from the image. However, little attention has been paid to hallucinations in voluntary imagination tasks, e.g., story writing, where the models are expected to generate novel content beyond the given image. In these tasks, it is inappropriate to simply regard such imagined novel content as hallucinations. To address this limitation, we introduce Voluntary-imagined Object Presence Evaluation (VOPE)-a novel method to assess LVLMs' hallucinations in voluntary imagination tasks via presence evaluation. Specifically, VOPE poses recheck-based questions to evaluate how an LVLM interprets the presence of the imagined objects in its own response. The consistency between the model's interpretation and the object's presence in the image is then used to determine whether the model hallucinates when generating the response. We apply VOPE to several mainstream LVLMs and hallucination mitigation methods, revealing two key findings: (1) most LVLMs hallucinate heavily during voluntary imagination, and their performance in presence evaluation is notably poor on imagined objects; (2) existing hallucination mitigation methods show limited effect in voluntary imagination tasks, making this an important direction for future research.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mykola Lavreniuk, Nataliia Kussul, Andrii Shelestov, Yevhenii Salii, Volodymyr Kuzin et al.
ๆฅ็ๆ่ฆ
Accurate delineation of agricultural field boundaries from satellite imagery is essential for land management and crop monitoring, yet existing methods often produce incomplete boundaries, merge adjacent fields, and struggle to scale. We present the Delineate Anything Flow (DelAnyFlow) methodology, a resolution-agnostic approach for large-scale field boundary mapping. DelAnyFlow combines the DelAny instance segmentation model, based on a YOLOv11 backbone and trained on the large-scale Field Boundary Instance Segmentation-22M (FBIS 22M) dataset, with a structured post-processing, merging, and vectorization sequence to generate topologically consistent vector boundaries. FBIS 22M, the largest dataset of its kind, contains 672,909 multi-resolution image patches (0.25-10m) and 22.9million validated field instances. The DelAny model delivers state-of-the-art accuracy with over 100% higher mAP and 400x faster inference than SAM2. DelAny demonstrates strong zero-shot generalization and supports national-scale applications: using Sentinel 2 data for 2024, DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km2) in under six hours on a single workstation. DelAnyFlow outputs significantly improve boundary completeness relative to operational products from Sinergise Solutions and NASA Harvest, particularly in smallholder and fragmented systems (0.25-1ha). For Ukraine, DelAnyFlow delineated 3.75M fields at 5m and 5.15M at 2.5m, compared to 2.66M detected by Sinergise Solutions and 1.69M by NASA Harvest. This work delivers a scalable, cost-effective methodology for field delineation in regions lacking digital cadastral data. A project landing page with links to model weights, code, national-scale vector outputs, and dataset is available at https://lavreniuk.github.io/Delineate-Anything/.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu, Meiguang Jin et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries. To alleviate this problem, we propose a \textbf{A}ttention-\textbf{G}rounded \textbf{RE}triever \textbf{E}nhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models as proxy local supervision to guide the identification of relevant document regions. During training, AGREE combines local signals with the global signals to jointly optimize the retriever, enabling it to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging ViDoRe V2 benchmark show that AGREE significantly outperforms the global-supervision-only baseline. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://anonymous.4open.science/r/AGREE-2025.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinkun Zhao, Lei Huang, Wenjun Wu
Multimodal
ๆฅ็ๆ่ฆ
With the rapid advancement of Large Models, numerous text-and-vision-fused Multimodal Large Models (MLMs) have emerged. However, these MLMs remain susceptible to informational interference in visual perception, particularly in color perception, which introduces an additional risk of hallucination. To validate this hypothesis, we introduce the "What Color Is It" dataset, a novel benchmark constructed using a simple method to trigger single-modality visual hallucination in MLMs. Based on this dataset, we further investigate the underlying causes of hallucination in the visual modality of MLMs and propose potential solutions to enhance their robustness.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Yuchen Bao, Yiting Wang, Wenjian Huang, Haowei Wang, Shen Chen et al.
Robotics
ๆฅ็ๆ่ฆ
Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: https://github.com/yusenbao01/TripleFDS
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fei Kong
ๆฅ็ๆ่ฆ
Recently, the Denoising Diffusion Codebook Models (DDCM) was proposed. DDCM leverages the Denoising Diffusion Probabilistic Model (DDPM) and replaces the random noise in the backward process with noise sampled from specific sets according to a predefined rule, thereby enabling image compression. However, DDCM cannot be applied to methods other than DDPM. In this paper, we propose the generalized Denoising Diffusion Compression Model (gDDCM), which extends DDCM to mainstream diffusion models and their variants, including DDPM, Score-Based Models, Consistency Models, and Rectified Flow. We evaluate our method on CIFAR-10 and LSUN Bedroom datasets. Experimental results demonstrate that our approach successfully generalizes DDCM to the aforementioned models and achieves improved performance.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lucas Gabriel Telesco, Danila Nejamkin, Estefanรญa Mata, Francisco Filizzola, Kevin Wignall et al.
ๆฅ็ๆ่ฆ
Retinal image quality assessment (RIQA) supports computer-aided diagnosis of eye diseases. However, most tools classify only overall image quality, without indicating acquisition defects to guide recapture. This gap is mainly due to the high cost of detailed annotations. In this paper, we aim to mitigate this limitation by introducing a hybrid semi-supervised learning approach that combines manual labels for overall quality with pseudo-labels of quality details within a multi-task framework. Our objective is to obtain more interpretable RIQA models without requiring extensive manual labeling. Pseudo-labels are generated by a Teacher model trained on a small dataset and then used to fine-tune a pre-trained model in a multi-task setting. Using a ResNet-18 backbone, we show that these weak annotations improve quality assessment over single-task baselines (F1: 0.875 vs. 0.863 on EyeQ, and 0.778 vs. 0.763 on DeepDRiD), matching or surpassing existing methods. The multi-task model achieved performance statistically comparable to the Teacher for most detail prediction tasks (p > 0.05). In a newly annotated EyeQ subset released with this paper, our model performed similarly to experts, suggesting that pseudo-label noise aligns with expert variability. Our main finding is that the proposed semi-supervised approach not only improves overall quality assessment but also provides interpretable feedback on capture conditions (illumination, clarity, contrast). This enhances interpretability at no extra manual labeling cost and offers clinically actionable outputs to guide image recapture.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ori Meiraz, Sharon Shalev, Avishai Weizman
Computer Vision
ๆฅ็ๆ่ฆ
This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Narthana Sivalingam, Santhirarajah Sivasthigan, Thamayanthi Mahendranathan, G. M. R. I. Godaliyadda, M. P. B. Ekanayake et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Group activity detection in multi-person scenes is challenging due to complex human interactions, occlusions, and variations in appearance over time. This work presents a computer vision based framework for group activity recognition and action spotting using a combination of deep learning models and graph based relational reasoning. The system first applies Mask R-CNN to obtain accurate actor localization through bounding boxes and instance masks. Multiple backbone networks, including Inception V3, MobileNet, and VGG16, are used to extract feature maps, and RoIAlign is applied to preserve spatial alignment when generating actor specific features. The mask information is then fused with the feature maps to obtain refined masked feature representations for each actor. To model interactions between individuals, we construct Actor Relation Graphs that encode appearance similarity and positional relations using methods such as normalized cross correlation, sum of absolute differences, and dot product. Graph Convolutional Networks operate on these graphs to reason about relationships and predict both individual actions and group level activities. Experiments on the Collective Activity dataset demonstrate that the combination of mask based feature refinement, robust similarity search, and graph neural network reasoning leads to improved recognition performance across both crowded and non crowded scenarios. This approach highlights the potential of integrating segmentation, feature extraction, and relational graph reasoning for complex video understanding tasks.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Kaiwen Cai, Xinze Liu, Xia Zhou, Hengtong Hu, Jie Xiang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
The generation of realistic LiDAR point clouds plays a crucial role in the development and evaluation of autonomous driving systems. Although recent methods for 3D LiDAR point cloud generation have shown significant improvements, they still face notable limitations, including the lack of sequential generation capabilities and the inability to produce accurately positioned foreground objects and realistic backgrounds. These shortcomings hinder their practical applicability. In this paper, we introduce DriveLiDAR4D, a novel LiDAR generation pipeline consisting of multimodal conditions and a novel sequential noise prediction model LiDAR4DNet, capable of producing temporally consistent LiDAR scenes with highly controllable foreground objects and realistic backgrounds. To the best of our knowledge, this is the first work to address the sequential generation of LiDAR scenes with full scene manipulation capability in an end-to-end manner. We evaluated DriveLiDAR4D on the nuScenes and KITTI datasets, where we achieved an FRD score of 743.13 and an FVD score of 16.96 on the nuScenes dataset, surpassing the current state-of-the-art (SOTA) method, UniScene, with an performance boost of 37.2% in FRD and 24.1% in FVD, respectively.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bowen Ye, Bin Zhang, Hang Zhao
Robotics
ๆฅ็ๆ่ฆ
Gaining sustainable performance improvement with scaling data and model budget remains a pivotal yet unresolved challenge in autonomous driving. While autoregressive models exhibited promising data-scaling efficiency in planning tasks, predicting ego trajectories alone suffers sparse supervision and weakly constrains how scene evolution should shape ego motion. Therefore, we introduce DAP, a discrete-token autoregressive planner that jointly forecasts BEV semantics and ego trajectories, thereby enforcing comprehensive representation learning and allowing predicted dynamics to directly condition ego motion. In addition, we incorporate a reinforcement-learning-based fine-tuning, which preserves supervised behavior cloning priors while injecting reward-guided improvements. Despite a compact 160M parameter budget, DAP achieves state-of-the-art performance on open-loop metrics and delivers competitive closed-loop results on the NAVSIM benchmark. Overall, the fully discrete-token autoregressive formulation operating on both rasterized BEV and ego actions provides a compact yet scalable planning paradigm for autonomous driving.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Enhui Ma, Lijun Zhou, Tao Tang, Jiahuan Zhang, Junpeng Jiang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
End-to-end planning methods are the de facto standard of the current autonomous driving system, while the robustness of the data-driven approaches suffers due to the notorious long-tail problem (i.e., rare but safety-critical failure cases). In this work, we explore whether recent diffusion-based video generation methods (a.k.a. world models), paired with structured 3D layouts, can enable a fully automated pipeline to self-correct such failure cases. We first introduce an agent to simulate the role of product manager, dubbed PM-Agent, which formulates data requirements to collect data similar to the failure cases. Then, we use a generative model that can simulate both data collection and annotation. However, existing generative models struggle to generate high-fidelity data conditioned on 3D layouts. To address this, we propose DriveSora, which can generate spatiotemporally consistent videos aligned with the 3D annotations requested by PM-Agent. We integrate these components into our self-correcting agentic system, CorrectAD. Importantly, our pipeline is an end-to-end model-agnostic and can be applied to improve any end-to-end planner. Evaluated on both nuScenes and a more challenging in-house dataset across multiple end-to-end planners, CorrectAD corrects 62.5% and 49.8% of failure cases, reducing collision rates by 39% and 27%, respectively.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yunjie Yu, Jingchen Wu, Junchen Zhu, Chunze Lin, Guibin Chen
Robotics
ๆฅ็ๆ่ฆ
Artistic design such as poster design often demands rapid yet precise modification of textual content while preserving visual harmony and typographic intent, especially across diverse font styles. Although modern image editing models have grown increasingly powerful, they still fall short in fine-grained, font-aware text manipulation, limiting their utility in professional design workflows such as poster editing. To address this issue, we present SkyReels-Text, a novel font-controllable framework for precise poster text editing. Our method enables simultaneous editing of multiple text regions, each rendered in distinct typographic styles, while preserving the visual appearance of non-edited regions. Notably, our model requires neither font labels nor fine-tuning during inference: users can simply provide cropped glyph patches corresponding to their desired typography, even if the font is not included in any standard library. Extensive experiments on multiple datasets, including handwrittent text benchmarks, SkyReels-Text achieves state-of-the-art performance in both text fidelity and visual realism, offering unprecedented control over font families, and stylistic nuances. This work bridges the gap between general-purpose image editing and professional-grade typographic design.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Jongha Kim, Minseong Bae, Sanghyeok Lee, Jinsung Yoon, Hyunwoo J. Kim
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer's capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kaiwen Wang, Kaili Zheng, Yiming Shi, Chenyi Guo, Ji Wu
Machine Learning
ๆฅ็ๆ่ฆ
Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zihan Li, Tengfei Wang, Wentian Gan, Hao Zhan, Xin Wang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:https://lzh282140127-cell.github.io/SF-Recon-project/
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Noam Tsfaty, Avishai Weizman, Liav Cohen, Moshe Tshuva, Yehudit Aperstein
Natural Language Processing
ๆฅ็ๆ่ฆ
We address the challenge of detecting rare and diverse anomalies in surveillance videos using only video-level supervision. Our dual-backbone framework combines convolutional and transformer representations through top-k pooling, achieving 90.7% area under the curve (AUC) on the UCF-Crime dataset.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at https://github.com/linglingxiansen/SpatialSKy.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Keshav Gupta, Akshat Sanghvi, Shreyas Reddy Palley, Astitva Srivastava, Charu Sharma et al.
ๆฅ็ๆ่ฆ
3D Gaussian Splatting has emerged as a transformative technique in novel view synthesis, primarily due to its high rendering speed and photorealistic fidelity. However, its memory footprint scales rapidly with scene complexity, often reaching several gigabytes. Existing methods address this issue by introducing compression strategies that exploit primitive-level redundancy through similarity detection and quantization. We aim to surpass the compression limits of such methods by incorporating symmetry-aware techniques, specifically targeting mirror symmetries to eliminate redundant primitives. We propose a novel compression framework, \textbf{\textit{SymGS}}, introducing learnable mirrors into the scene, thereby eliminating local and global reflective redundancies for compression. Our framework functions as a plug-and-play enhancement to state-of-the-art compression methods, (e.g. HAC) to achieve further compression. Compared to HAC, we achieve $1.66 \times$ compression across benchmark datasets (upto $3\times$ on large-scale scenes). On an average, SymGS enables $\bf{108\times}$ compression of a 3DGS scene, while preserving rendering quality. The project page and supplementary can be found at \textbf{\color{cyan}{symgs.github.io}}
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junlong Li, Huaiyuan Xu, Sijie Cheng, Kejun Wu, Kim-Hui Yap et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Driven by recent advances in vision language models (VLMs) and egocentric perception research, we introduce the concept of an egocentric procedural AI assistant (EgoProceAssist) tailored to step-by-step support daily procedural tasks in a first-person view. In this work, we start by identifying three core tasks: egocentric procedural error detection, egocentric procedural learning, and egocentric procedural question answering. These tasks define the essential functions of EgoProceAssist within a new taxonomy. Specifically, our work encompasses a comprehensive review of current techniques, relevant datasets, and evaluation metrics across these three core areas. To clarify the gap between the proposed EgoProceAssist and existing VLM-based AI assistants, we introduce novel experiments and provide a comprehensive evaluation of representative VLM-based methods. Based on these findings and our technical analysis, we discuss the challenges ahead and suggest future research directions. Furthermore, an exhaustive list of this study is publicly available in an active repository that continuously collects the latest work: https://github.com/z1oong/Building-Egocentric-Procedural-AI-Assistant
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yushuo Zheng, Jiangyong Ying, Huiyu Duan, Chunyi Li, Zicheng Zhang et al.
RoboticsComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.
๐
2025-11-17
๐ 12 pages, 7figures, This work is supported by Nati
๐ฅ Yu Wen, Shuyong Gao, Shuping Zhang, Miao Huang, Lili Tao et al.
Computer Vision
ๆฅ็ๆ่ฆ
Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.
๐
2025-11-17
๐ AAAI
๐ฅ Xiaoqi Han, Ru Li, Ran Yi, Hongye Tan, Zhuomin Liang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junjie Wu, Guohong Fu
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Multimodal misinformation floods on various social media, and continues to evolve in the era of AI-generated content (AIGC). The emerged misinformation with low creation cost and high deception poses significant threats to society. While recent studies leverage general-purpose multimodal large language models (MLLMs) to achieve remarkable results in detection, they encounter two critical limitations: (1) Insufficient reasoning, where general-purpose MLLMs often follow the uniform reasoning paradigm but generate inaccurate explanations and judgments, due to the lack of the task-specific knowledge of multimodal misinformation detection. (2) Reasoning biases, where a single thinking mode make detectors a suboptimal path for judgment, struggling to keep pace with the fast-growing and intricate multimodal misinformation. In this paper, we propose MMD-Thinker, a two-stage framework for multimodal misinformation detection through adaptive multi-dimensional thinking. First, we develop tailor-designed thinking mode for multimodal misinformation detection. Second, we adopt task-specific instruction tuning to inject the tailored thinking mode into general-purpose MLLMs. Third, we further leverage reinforcement learning strategy with a mixed advantage function, which incentivizes the reasoning capabilities in trajectories. Furthermore, we construct the multimodal misinformation reasoning (MMR) dataset, encompasses more than 8K image-text pairs with both reasoning processes and classification labels, to make progress in the relam of multimodal misinformation detection. Experimental results demonstrate that our proposed MMD-Thinker achieves state-of-the-art performance on both in-domain and out-of-domain benchmark datasets, while maintaining flexible inference and token usage. Code will be publicly available at Github.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Malek Al Abed, Sebiha Demir, Anne Groteklaes, Elodie Germani, Shahrooz Faghihroohi et al.
Computer Vision
ๆฅ็ๆ่ฆ
Portable ultra-low-field MRI (uLF-MRI, 0.064 T) offers accessible neuroimaging for neonatal care but suffers from low signal-to-noise ratio and poor diagnostic quality compared to high-field (HF) MRI. We propose MRIQT, a 3D conditional diffusion framework for image quality transfer (IQT) from uLF to HF MRI. MRIQT combines realistic K-space degradation for physics-consistent uLF simulation, v-prediction with classifier-free guidance for stable image-to-image generation, and an SNR-weighted 3D perceptual loss for anatomical fidelity. The model denoises from a noised uLF input conditioned on the same scan, leveraging volumetric attention-UNet architecture for structure-preserving translation. Trained on a neonatal cohort with diverse pathologies, MRIQT surpasses recent GAN and CNN baselines in PSNR 15.3% with 1.78% over the state of the art, while physicians rated 85% of its outputs as good quality with clear pathology present. MRIQT enables high-fidelity, diffusion-based enhancement of portable ultra-low-field (uLF) MRI for deliable neonatal brain assessment.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Qida Tan, Hongyu Yang, Wenchao Du
ๆฅ็ๆ่ฆ
Appearance-based gaze estimation, aiming to predict accurate 3D gaze direction from a single facial image, has made promising progress in recent years. However, most methods suffer significant performance degradation in cross-domain evaluation due to interference from gaze-irrelevant factors, such as expressions, wearables, and image quality. To alleviate this problem, we present a novel Hybrid-domain Adaptative Representation Learning (shorted by HARL) framework that exploits multi-source hybrid datasets to learn robust gaze representation. More specifically, we propose to disentangle gaze-relevant representation from low-quality facial images by aligning features extracted from high-quality near-eye images in an unsupervised domain-adaptation manner, which hardly requires any computational or inference costs. Additionally, we analyze the effect of head-pose and design a simple yet efficient sparse graph fusion module to explore the geometric constraint between gaze direction and head-pose, leading to a dense and robust gaze representation. Extensive experiments on EyeDiap, MPIIFaceGaze, and Gaze360 datasets demonstrate that our approach achieves state-of-the-art accuracy of $\textbf{5.02}^{\circ}$ and $\textbf{3.36}^{\circ}$, and $\textbf{9.26}^{\circ}$ respectively, and present competitive performances through cross-dataset evaluation. The code is available at https://github.com/da60266/HARL.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Jian Wang et al.
ๆฅ็ๆ่ฆ
Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. Specifically, during the training, our proposed dynamic attention policy (DAP) employs the Hierarchical Attention Fusion (HAF) module to represent the alignment as learnable fine-grained token-to-point attentions. To optimize these attentions across different tasks and geometric hierarchies, our DAP further exploits the Monte Carlo tree search to dynamically calibrate HAF attention weights via a hybrid reward signal and further enhances the alignment between textual descriptions and local 3D geometry. During the inference, our 3DAlign-DAER introduces an Efficient Retrieval Strategy (ERS) to leverage efficient hierarchical searching in the large-scale embedding spaces, outperforming traditional methods (e.g., KNN) in accuracy and efficiency. Furthermore, to facilitate text-3D alignment research and train our 3DAlign-DAER, we construct Align3D-2M, a large-scale dataset featuring 2M text-3D pairs, to provide sufficient fine-grained cross-modal annotations. Extensive and comprehensive experiments demonstrate the superior performance of our 3DAlign-DAER on diverse benchmarks. We will release our codes, models, and datasets.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yonghui Yu, Jiahang Cai, Xun Wang, Wenwu Yang
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single-person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames.Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation.Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a \textbf{6.0} mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video-based approaches, while offering significant gains in efficiency.Project page: https://github.com/zgspose/PAVENet
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Cheng Peng, Zhenzhe Zhang, Cheng Chi, Xiaobao Wei, Yanhao Zhang et al.
RoboticsNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Navigating to a specified object in an unknown environment is a fundamental yet challenging capability of embodied intelligence. However, current methods struggle to balance decision frequency with intelligence, resulting in decisions lacking foresight or discontinuous actions. In this work, we propose PIGEON: Point of Interest Guided Exploration for Object Navigation with VLM, maintaining a lightweight and semantically aligned snapshot memory during exploration as semantic input for the exploration strategy. We use a large Visual-Language Model (VLM), named PIGEON-VL, to select Points of Interest (PoI) formed during exploration and then employ a lower-level planner for action output, increasing the decision frequency. Additionally, this PoI-based decision-making enables the generation of Reinforcement Learning with Verifiable Reward (RLVR) data suitable for simulators. Experiments on classic object navigation benchmarks demonstrate that our zero-shot transfer method achieves state-of-the-art performance, while RLVR further enhances the model's semantic guidance capabilities, enabling deep reasoning during real-time navigation.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Junhee Lee, ChaeBeen Bang, MyoungChul Kim, MyeongAh Cho
Natural Language Processing
ๆฅ็ๆ่ฆ
Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both "how" motion evolves and "what" semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alberto Gomez, Jorge Oliveira, Ramon Casero, Agis Chartsias
ๆฅ็ๆ่ฆ
Ultrasound (US) machines display images on a built-in monitor, but routine transfer to hospital systems relies on DICOM. We propose a self-supervised pipeline to extract the US image from a photograph of the monitor. This removes the DICOM bottleneck and enables rapid testing and prototyping of new algorithms. In a proof-of-concept study, the rectified images retained enough visual fidelity to classify cardiac views with a balanced accuracy of 0.79 with respect to the native DICOMs.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Soyul Lee, Seungmin Baek, Dongbo Min
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Monocular 3D object detection is a cost-effective solution for applications like autonomous driving and robotics, but remains fundamentally ill-posed due to inherently ambiguous depth cues. Recent DETR-based methods attempt to mitigate this through global attention and auxiliary depth prediction, yet they still struggle with inaccurate depth estimates. Moreover, these methods often overlook instance-level detection difficulty, such as occlusion, distance, and truncation, leading to suboptimal detection performance. We propose MonoDLGD, a novel Difficulty-Aware Label-Guided Denoising framework that adaptively perturbs and reconstructs ground-truth labels based on detection uncertainty. Specifically, MonoDLGD applies stronger perturbations to easier instances and weaker ones into harder cases, and then reconstructs them to effectively provide explicit geometric supervision. By jointly optimizing label reconstruction and 3D object detection, MonoDLGD encourages geometry-aware representation learning and improves robustness to varying levels of object complexity. Extensive experiments on the KITTI benchmark demonstrate that MonoDLGD achieves state-of-the-art performance across all difficulty levels.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ying Jiang, Jiayin Lu, Yunuo Chen, Yumeng He, Kui Wu et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Painting embodies a unique form of visual storytelling, where the creation process is as significant as the final artwork. Although recent advances in generative models have enabled visually compelling painting synthesis, most existing methods focus solely on final image generation or patch-based process simulation, lacking explicit stroke structure and failing to produce smooth, realistic shading. In this work, we present a differentiable stroke reconstruction framework that unifies painting, stylized texturing, and smudging to faithfully reproduce the human painting-smudging loop. Given an input image, our framework first optimizes single- and dual-color Bezier strokes through a parallel differentiable paint renderer, followed by a style generation module that synthesizes geometry-conditioned textures across diverse painting styles. We further introduce a differentiable smudge operator to enable natural color blending and shading. Coupled with a coarse-to-fine optimization strategy, our method jointly optimizes stroke geometry, color, and texture under geometric and semantic guidance. Extensive experiments on oil, watercolor, ink, and digital paintings demonstrate that our approach produces realistic and expressive stroke reconstructions, smooth tonal transitions, and richly stylized appearances, offering a unified model for expressive digital painting creation. See our project page for more demos: https://yingjiang96.github.io/DiffPaintWebsite/.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR's superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).
๐
2025-11-17
๐ AAAI 2026
๐ฅ Diego Ortego, Marlon Rodrรญguez, Mario Almagro, Kunal Dahiya, David Jimรฉnez et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art by up to +8.21\% in P@1 on the largest dataset. ViXML's code is available at https://github.com/DiegoOrtego/vixml.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alec Sargood, Lemuel Puglisi, Elinor Thompson, Mirco Musolesi, Daniel C. Alexander
Machine Learning
ๆฅ็ๆ่ฆ
Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract's performance against state-of-the-art baselines. Notably, GenTract achieves precision 2.1x higher than the next-best method, TractOracle. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by an order of magnitude. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chao Yang, Boqian Zhang, Jinghao Xu, Guang Jiang
ๆฅ็ๆ่ฆ
Diffusion-based methods have shown great promise in single image super-resolution (SISR); however, existing approaches often produce blurred fine details due to insufficient guidance in the high-frequency domain. To address this issue, we propose a High-Frequency Guided Diffusion Network based on Wavelet Decomposition (HDW-SR), which replaces the conventional U-Net backbone in diffusion frameworks. Specifically, we perform diffusion only on the residual map, allowing the network to focus more effectively on high-frequency information restoration. We then introduce wavelet-based downsampling in place of standard CNN downsampling to achieve multi-scale frequency decomposition, enabling sparse cross-attention between the high-frequency subbands of the pre-super-resolved image and the low-frequency subbands of the diffused image for explicit high-frequency guidance. Moreover, a Dynamic Thresholding Block (DTB) is designed to refine high-frequency selection during the sparse attention process. During upsampling, the invertibility of the wavelet transform ensures low-loss feature reconstruction. Experiments on both synthetic and real-world datasets demonstrate that HDW-SR achieves competitive super-resolution performance, excelling particularly in recovering fine-grained image details. The code will be available after acceptance.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zahra Tabatabaei, Jon Sporring
Machine Learning
ๆฅ็ๆ่ฆ
According to the World Health Organization, breast cancer claimed the lives of approximately 685,000 women in 2020. Early diagnosis and accurate clinical decision making are critical in reducing this global burden. In this study, we propose THIR, a novel Content-Based Medical Image Retrieval (CBMIR) framework that leverages topological data analysis specifically, Betti numbers derived from persistent homology to characterize and retrieve histopathological images based on their intrinsic structural patterns. Unlike conventional deep learning approaches that rely on extensive training, annotated datasets, and powerful GPU resources, THIR operates entirely without supervision. It extracts topological fingerprints directly from RGB histopathological images using cubical persistence, encoding the evolution of loops as compact, interpretable feature vectors. The similarity retrieval is then performed by computing the distances between these topological descriptors, efficiently returning the top-K most relevant matches.
Extensive experiments on the BreaKHis dataset demonstrate that THIR outperforms state of the art supervised and unsupervised methods. It processes the entire dataset in under 20 minutes on a standard CPU, offering a fast, scalable, and training free solution for clinical image retrieval.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haodong Wang, Tao Zhuo, Xiuwei Zhang, Hanlin Yin, Wencong Wu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Achieving pixel-level registration between SAR and optical images remains a challenging task due to their fundamentally different imaging mechanisms and visual characteristics. Although deep learning has achieved great success in many cross-modal tasks, its performance on SAR-Optical registration tasks is still unsatisfactory. Gradient-based information has traditionally played a crucial role in handcrafted descriptors by highlighting structural differences. However, such gradient cues have not been effectively leveraged in deep learning frameworks for SAR-Optical image matching. To address this gap, we propose SOMA, a dense registration framework that integrates structural gradient priors into deep features and refines alignment through a hybrid matching strategy. Specifically, we introduce the Feature Gradient Enhancer (FGE), which embeds multi-scale, multi-directional gradient filters into the feature space using attention and reconstruction mechanisms to boost feature distinctiveness. Furthermore, we propose the Global-Local Affine-Flow Matcher (GLAM), which combines affine transformation and flow-based refinement within a coarse-to-fine architecture to ensure both structural consistency and local accuracy. Experimental results demonstrate that SOMA significantly improves registration precision, increasing the CMR@1px by 12.29% on the SEN1-2 dataset and 18.50% on the GFGE_SO dataset. In addition, SOMA exhibits strong robustness and generalizes well across diverse scenes and resolutions.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rifen Lin, Alex Jinpeng Wang, Jiawei Mo, Min Li
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion-an essential cue for distinguishing identities in video. In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID. To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods. CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cesar Portocarrero Rodriguez, Laura Vandeweyen, Yosuke Yamamoto
RoboticsNatural Language ProcessingMachine LearningComputer Vision
ๆฅ็ๆ่ฆ
The American Society of Civil Engineers has graded Americas infrastructure condition as a C, with the road system receiving a dismal D. Roads are vital to regional economic viability, yet their management, maintenance, and repair processes remain inefficient, relying on outdated manual or laser-based inspection methods that are both costly and time-consuming. With the increasing availability of real-time visual data from autonomous vehicles, there is an opportunity to apply computer vision (CV) methods for advanced road monitoring, providing insights to guide infrastructure rehabilitation efforts. This project explores the use of state-of-the-art CV techniques for road distress segmentation. It begins by evaluating synthetic data generated with Generative Adversarial Networks (GANs) to assess its usefulness for model training. The study then applies Convolutional Neural Networks (CNNs) for road distress segmentation and subsequently examines the transformer-based model MaskFormer. Results show that GAN-generated data improves model performance and that MaskFormer outperforms the CNN model in two metrics: mAP50 and IoU.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Longhui Zheng, Qiming Xia, Xiaolu Chen, Zhaoliang Liu, Chenglu Wen
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
3D object detection is critical for autonomous driving, yet it remains fundamentally challenging to simultaneously maximize computational efficiency and capture long-range spatial dependencies. We observed that Mamba-based models, with their linear state-space design, capture long-range dependencies at lower cost, offering a promising balance between efficiency and accuracy. However, existing methods rely on axis-aligned scanning within a fixed window, inevitably discarding spatial information. To address this problem, we propose WinMamba, a novel Mamba-based 3D feature-encoding backbone composed of stacked WinMamba blocks. To enhance the backbone with robust multi-scale representation, the WinMamba block incorporates a window-scale-adaptive module that compensates voxel features across varying resolutions during sampling. Meanwhile, to obtain rich contextual cues within the linear state space, we equip the WinMamba layer with a learnable positional encoding and a window-shift strategy. Extensive experiments on the KITTI and Waymo datasets demonstrate that WinMamba significantly outperforms the baseline. Ablation studies further validate the individual contributions of the WSF and AWF modules in improving detection accuracy. The code will be made publicly available.
๐
2025-11-17
๐ CVPR 2026
๐ฅ Junjie Yang, Yuhao Yan, Gang Wu, Yuxuan Wang, Ruoyu Liang et al.
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce \textsc{MedGEN-Bench}, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenyang Li, Wenbing Tang, Yihao Huang, Sinong Simon Zhan, Ming Hu et al.
Robotics
ๆฅ็ๆ่ฆ
Vision-and-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Gagan Raj Gupta, Anshul Kumar, Manish Rai, Apu Chakraborty, Ashutosh Modi et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks. In telecommunications, they hold the potential to transform network optimization, automate troubleshooting, enhance customer support, and ensure regulatory compliance. However, their deployment in telecom is hindered by domain-specific challenges that demand specialized adaptation. To overcome these challenges and to accelerate the adaptation of LLMs for telecom, we propose MM-Telco, a comprehensive suite of multimodal benchmarks and models tailored for the telecom domain. The benchmark introduces various tasks (both text based and image based) that address various practical real-life use cases such as network operations, network management, improving documentation quality, and retrieval of relevant text and images. Further, we perform baseline experiments with various LLMs and VLMs. The models fine-tuned on our dataset exhibit a significant boost in performance. Our experiments also help analyze the weak areas in the working of current state-of-art multimodal LLMs, thus guiding towards further development and research.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zonghao Ying, Moyang Chen, Nizhang Li, Zhiqiang Wang, Wenxin Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models' cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger's effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Hao Long, Silin Zhou, Lisi Chen, Shuo Shang
ๆฅ็ๆ่ฆ
Recent learning-based methods have reduced the computational complexity of traditional trajectory similarity computation, but state-of-the-art (SOTA) methods still fail to leverage the comprehensive spectrum of trajectory information for similarity modeling. To tackle this problem, we propose \textbf{RePo}, a novel method that jointly encodes \textbf{Re}gion-wise and \textbf{Po}int-wise features to capture both spatial context and fine-grained moving patterns. For region-wise representation, the GPS trajectories are first mapped to grid sequences, and spatial context are captured by structural features and semantic context enriched by visual features. For point-wise representation, three lightweight expert networks extract local, correlation, and continuous movement patterns from dense GPS sequences. Then, a router network adaptively fuses the learned point-wise features, which are subsequently combined with region-wise features using cross-attention to produce the final trajectory embedding. To train RePo, we adopt a contrastive loss with hard negative samples to provide similarity ranking supervision. Experiment results show that RePo achieves an average accuracy improvement of 22.2\% over SOTA baselines across all evaluation metrics.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuqi Zhang, Guanying Chen, Jiaxing Chen, Chuanyu Fu, Chuan Huang et al.
ๆฅ็ๆ่ฆ
Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task. Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities, making them a promising tool for enhancing reconstruction quality under sparse-view settings. However, existing approaches are primarily designed for modest viewpoint variations, which struggle in capturing fine-grained details in close-up scenarios since input information is severely limited. In this paper, we present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion. Specifically, we observe that pixel-warping conditioning suffers from severe sparsity and background leakage in close-up settings. To address this, we propose hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model. Furthermore, we introduce global structure guidance, which leverages a dense fused point cloud to provide consistent geometric context to the diffusion process, to compensate for the lack of globally consistent 3D constraints in sparse conditioning inputs. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, especially in close-up novel view synthesis, clearly validating the effectiveness of our design.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hanzhe Liang, Jie Zhou, Can Gao, Bingyang Guo, Jinbao Wang et al.
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
3D anomaly detection (AD) is a crucial task in computer vision, aiming to identify anomalous points or regions from point cloud data. However, existing methods may encounter challenges when handling point clouds with changes in orientation and position because the resulting features may vary significantly. To address this problem, we propose a novel Rotationally Invariant Features (RIF) framework for 3D AD. Firstly, to remove the adverse effect of variations on point cloud data, we develop a Point Coordinate Mapping (PCM) technique, which maps each point into a rotationally invariant space to maintain consistency of representation. Then, to learn robust and discriminative features, we design a lightweight Convolutional Transform Feature Network (CTF-Net) to extract rotationally invariant features for the memory bank. To improve the ability of the feature extractor, we introduce the idea of transfer learning to pre-train the feature extractor with 3D data augmentation. Experimental results show that the proposed method achieves the advanced performance on the Anomaly-ShapeNet dataset, with an average P-AUROC improvement of 17.7\%, and also gains the best performance on the Real3D-AD dataset, with an average P-AUROC improvement of 1.6\%. The strong generalization ability of RIF has been verified by combining it with traditional feature extraction methods on anomaly detection tasks, demonstrating great potential for industrial applications.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhaocheng Yu, Kui Jiang, Junjun Jiang, Xianming Liu, Guanglu Sun et al.
RoboticsComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Rain significantly degrades the performance of computer vision systems, particularly in applications like autonomous driving and video surveillance. While existing deraining methods have made considerable progress, they often struggle with fidelity of semantic and spatial details. To address these limitations, we propose the Multi-Prior Hierarchical Mamba (MPHM) network for image deraining. This novel architecture synergistically integrates macro-semantic textual priors (CLIP) for task-level semantic guidance and micro-structural visual priors (DINOv2) for scene-aware structural information. To alleviate potential conflicts between heterogeneous priors, we devise a progressive Priors Fusion Injection (PFI) that strategically injects complementary cues at different decoder levels. Meanwhile, we equip the backbone network with an elaborate Hierarchical Mamba Module (HMM) to facilitate robust feature representation, featuring a Fourier-enhanced dual-path design that concurrently addresses global context modeling and local detail recovery. Comprehensive experiments demonstrate MPHM's state-of-the-art performance, achieving a 0.57 dB PSNR gain on the Rain200H dataset while delivering superior generalization on real-world rainy scenarios.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuaibin Fan, Senming Zhong, Wenchao Yan, Minglong Xue
Computer Vision
ๆฅ็ๆ่ฆ
Image dehazing is an important task in the field of computer vision, aiming at restoring clear and detail-rich visual content from haze-affected images. However, when dealing with complex scenes, existing methods often struggle to strike a balance between fine-grained feature representation of inhomogeneous haze distribution and global consistency modeling. Furthermore, to better learn the common degenerate representation of haze in spatial variations, we propose an unsupervised dehaze method for implicit neural degradation representation. Firstly, inspired by the Kolmogorov-Arnold representation theorem, we propose a mechanism combining the channel-independent and channel-dependent mechanisms, which efficiently enhances the ability to learn from nonlinear dependencies. which in turn achieves good visual perception in complex scenes. Moreover, we design an implicit neural representation to model haze degradation as a continuous function to eliminate redundant information and the dependence on explicit feature extraction and physical models. To further learn the implicit representation of the haze features, we also designed a dense residual enhancement module from it to eliminate redundant information. This achieves high-quality image restoration. Experimental results show that our method achieves competitive dehaze performance on various public and real-world datasets. This project code will be available at https://github.com/Fan-pixel/NeDR-Dehaze.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiazhen Yan, Ziqiang Li, Fan Wang, Boyu Wang, Zhangjie Fu
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fengzhi Xu, Ziyuan Yang, Mengyu Sun, Joey Tianyi Zhou, Yi Zhang
ๆฅ็ๆ่ฆ
Medical image enhancement is clinically valuable, but existing methods require large-scale datasets to learn complex pixel-level mappings. However, the substantial training and storage costs associated with these datasets hinder their practical deployment. While dataset distillation (DD) can alleviate these burdens, existing methods mainly target high-level tasks, where multiple samples share the same label. This many-to-one mapping allows distilled data to capture shared semantics and achieve information compression. In contrast, low-level tasks involve a many-to-many mapping that requires pixel-level fidelity, making low-level DD an underdetermined problem, as a small distilled dataset cannot fully constrain the dense pixel-level mappings. To address this, we propose the first low-level DD method for medical image enhancement. We first leverage anatomical similarities across patients to construct the shared anatomical prior based on a representative patient, which serves as the initialization for the distilled data of different patients. This prior is then personalized for each patient using a Structure-Preserving Personalized Generation (SPG) module, which integrates patient-specific anatomical information into the distilled dataset while preserving pixel-level fidelity. For different low-level tasks, the distilled data is used to construct task-specific high- and low-quality training pairs. Patient-specific knowledge is injected into the distilled data by aligning the gradients computed from networks trained on the distilled pairs with those from the corresponding patient's raw data. Notably, downstream users cannot access raw patient data. Instead, only a distilled dataset containing abstract training information is shared, which excludes patient-specific details and thus preserves privacy.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Seungjae Kim, SeungJoon Lee, MyeongAh Cho
ๆฅ็ๆ่ฆ
Multi-object tracking (MOT) predominantly follows the tracking-by-detection paradigm, where Kalman filters serve as the standard motion predictor due to computational efficiency but inherently fail on non-linear motion patterns. Conversely, recent data-driven motion predictors capture complex non-linear dynamics but suffer from limited domain generalization and computational overhead. Through extensive analysis, we reveal that even in datasets dominated by non-linear motion, Kalman filter outperforms data-driven predictors in up to 34\% of cases, demonstrating that real-world tracking scenarios inherently involve both linear and non-linear patterns. To leverage this complementarity, we propose PlugTrack, a novel framework that adaptively fuses Kalman filter and data-driven motion predictors through multi-perceptive motion understanding. Our approach employs multi-perceptive motion analysis to generate adaptive blending factors. PlugTrack achieves significant performance gains on MOT17/MOT20 and state-of-the-art on DanceTrack without modifying existing motion predictors. To the best of our knowledge, PlugTrack is the first framework to bridge classical and modern motion prediction paradigms through adaptive fusion in MOT.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu Zhu, Dan Zeng, Shuiwang Li, Qijun Zhao, Qiaomu Shen et al.
Computer Vision
ๆฅ็ๆ่ฆ
Recent research in Category-Agnostic Pose Estimation (CAPE) has adopted fixed textual keypoint description as semantic prior for two-stage pose matching frameworks. While this paradigm enhances robustness and flexibility by disentangling the dependency of support images, our critical analysis reveals two inherent limitations of static joint embedding: (1) polysemy-induced cross-category ambiguity during the matching process(e.g., the concept "leg" exhibiting divergent visual manifestations across humans and furniture), and (2) insufficient discriminability for fine-grained intra-category variations (e.g., posture and fur discrepancies between a sleeping white cat and a standing black cat). To overcome these challenges, we propose a new framework that innovatively integrates hierarchical cross-modal interaction with dual-stream feature refinement, enhancing the joint embedding with both class-level and instance-specific cues from textual description and specific images. Experiments on the MP-100 dataset demonstrate that, regardless of the network backbone, CapeNext consistently outperforms state-of-the-art CAPE methods by a large margin.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Doanh C. Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maรฏ K. Nguyen et al.
Multimodal
ๆฅ็ๆ่ฆ
Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ SeokJoo Kwak, Jihoon Kim, Boyoun Kim, Jung Jae Yoon, Wooseok Jang et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy than monolithic approaches. On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results. Code and the Grounding Benchmark Toolkit (GBT) are available at https://github.com/samsungsds-research-papers/mega-gui.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kyunghyun Lee, Yong-Min Shin, Minwoo Shin, Jihun Kim, Sunghwan Lim et al.
Machine Learning
ๆฅ็ๆ่ฆ
Early diagnosis of breast cancer is crucial, enabling the establishment of appropriate treatment plans and markedly enhancing patient prognosis. While direct magnetic resonance imaging-guided biopsy demonstrates promising performance in detecting cancer lesions, its practical application is limited by prolonged procedure times and high costs. To overcome these issues, an indirect MRI-guided biopsy that allows the procedure to be performed outside of the MRI room has been proposed, but it still faces challenges in creating an accurate real-time deformable breast model. In our study, we tackled this issue by developing a graph neural network (GNN)-based model capable of accurately predicting deformed breast cancer sites in real time during biopsy procedures. An individual-specific finite element (FE) model was developed by incorporating magnetic resonance (MR) image-derived structural information of the breast and tumor to simulate deformation behaviors. A GNN model was then employed, designed to process surface displacement and distance-based graph data, enabling accurate prediction of overall tissue displacement, including the deformation of the tumor region. The model was validated using phantom and real patient datasets, achieving an accuracy within 0.2 millimeters (mm) for cancer node displacement (RMSE) and a dice similarity coefficient (DSC) of 0.977 for spatial overlap with actual cancerous regions. Additionally, the model enabled real-time inference and achieved a speed-up of over 4,000 times in computational cost compared to conventional FE simulations. The proposed deformation-aware GNN model offers a promising solution for real-time tumor displacement prediction in breast biopsy, with high accuracy and real-time capability. Its integration with clinical procedures could significantly enhance the precision and efficiency of breast cancer diagnosis.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yehonatan Elisha, Seffi Cohen, Oren Barkan, Noam Koenigstein
Machine Learning
ๆฅ็ๆ่ฆ
Saliency maps are widely used for visual explanations in deep learning, but a fundamental lack of consensus persists regarding their intended purpose and alignment with diverse user queries. This ambiguity hinders the effective evaluation and practical utility of explanation methods.We address this gap by introducing the Reference-Frame $\times$ Granularity (RFxG) taxonomy, a principled conceptual framework that organizes saliency explanations along two essential axes:Reference-Frame: Distinguishing between pointwise ("Why this prediction?") and contrastive ("Why this and not an alternative?") explanations.Granularity: Ranging from fine-grained class-level (e.g., "Why Husky?") to coarse-grained group-level (e.g., "Why Dog?") interpretations.Using the RFxG lens, we demonstrate critical limitations in existing evaluation metrics, which overwhelmingly prioritize pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. To systematically assess explanation quality across both RFxG dimensions, we propose four novel faithfulness metrics. Our comprehensive evaluation framework applies these metrics to ten state-of-the-art saliency methods, four model architectures, and three datasets.By advocating a shift toward user-intent-driven evaluation, our work provides both the conceptual foundation and the practical tools necessary to develop visual explanations that are not only faithful to the underlying model behavior but are also meaningfully aligned with the complexity of human understanding and inquiry.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiacheng Tang, Mingyue Feng, Jiachao Liu, Yaonong Wang, Jian Pu
Robotics
ๆฅ็ๆ่ฆ
Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios.
๐
2025-11-17
๐ IEEE WACV'26 Main Conference
๐ฅ Reeshoon Sayera, Akash Kumar, Sirshapan Mitra, Prudvi Kamtam, Yogesh S Rawat
Robotics
ๆฅ็ๆ่ฆ
Appearance-based gait recognition have achieved strong performance on controlled datasets, yet systematic evaluation of its robustness to real-world corruptions and silhouette variability remains lacking. We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems. RobustGait evaluation spans four dimensions: the type of perturbation (digital, environmental, temporal, occlusion), the silhouette extraction method (segmentation and parsing networks), the architectural capacities of gait recognition models, and various deployment scenarios. The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems. We came across several exciting insights. First, applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems. Second, gait accuracy is highly sensitive to silhouette extractor biases, revealing an overlooked source of benchmark bias. Third, robustness is dependent on both the type of perturbation and the architectural design. Finally, we explore robustness-enhancing strategies, showing that noise-aware training and knowledge distillation improve performance and move toward deployment-ready systems.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenghua Li, Hang Chen, Zihao Sun, Kai Li, Xiaolin Hu
ๆฅ็ๆ่ฆ
Accurate segmentation of neural structures in Electron Microscopy (EM) images is paramount for neuroscience. However, this task is challenged by intricate morphologies, low signal-to-noise ratios, and scarce annotations, limiting the accuracy and generalization of existing methods. To address these challenges, we seek to leverage the priors learned by visual foundation models on a vast amount of natural images to better tackle this task. Specifically, we propose a novel framework that can effectively transfer knowledge from Segment Anything 2 (SAM2), which is pre-trained on natural images, to the EM domain. We first use SAM2 to extract powerful, general-purpose features. To bridge the domain gap, we introduce a Feature-Guided Attention module that leverages semantic cues from SAM2 to guide a lightweight encoder, the Fine-Grained Encoder (FGE), in focusing on these challenging regions. Finally, a dual-affinity decoder generates both coarse and refined affinity maps. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art (SOTA) approaches with the SAM2 weights frozen. Upon further fine-tuning on EM data, our method significantly outperforms existing SOTA methods. This study validates that transferring representations pre-trained on natural images, when combined with targeted domain-adaptive guidance, can effectively address the specific challenges in neuron segmentation.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruixin Liu, Zejian Yuan
ๆฅ็ๆ่ฆ
Monocular 3D lane detection is challenged by aleatoric uncertainty arising from inherent observation noise. Existing methods rely on simplified geometric assumptions, such as independent point predictions or global planar modeling, failing to capture structural variations and aleatoric uncertainty in real-world scenarios. In this paper, we propose MonoUnc, a bird's-eye view (BEV)-free 3D lane detector that explicitly models aleatoric uncertainty informed by local lane structures. Specifically, 3D lanes are projected onto the front-view (FV) space and approximated by parametric curves. Guided by curve predictions, curve-point query embeddings are dynamically generated for lane point predictions in 3D space. Each segment formed by two adjacent points is modeled as a 3D Gaussian, parameterized by the local structure and uncertainty estimations. Accordingly, a novel 3D Gaussian matching loss is designed to constrain these parameters jointly. Experiments on the ONCE-3DLanes and OpenLane datasets demonstrate that MonoUnc outperforms previous state-of-the-art (SoTA) methods across all benchmarks under stricter evaluation criteria. Additionally, we propose two comprehensive evaluation metrics for ONCE-3DLanes, calculating the average and maximum bidirectional Chamfer distances to quantify global and local errors. Codes are released at https://github.com/lrx02/MonoUnc.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua et al.
ๆฅ็ๆ่ฆ
Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, Joseph K J
Natural Language ProcessingComputer VisionMultimodal
ๆฅ็ๆ่ฆ
Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader's experience.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiaming Qu, Mengtian Guo, Yue Wang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Deceptive reviews mislead consumers, harm businesses, and undermine trust in online marketplaces. Machine learning classifiers can learn from large amounts of training examples to effectively distinguish deceptive reviews from genuine ones. However, the distinguishing features learned by these classifiers are often subtle, fragmented, and difficult for humans to interpret. In this work, we explore using large language models (LLMs) to translate machine-learned lexical cues into human-understandable language phenomena that can differentiate deceptive reviews from genuine ones. We show that language phenomena obtained in this manner are empirically grounded in data, generalizable across similar domains, and more predictive than phenomena either in LLMs' prior knowledge or obtained through in-context learning. These language phenomena have the potential to aid people in critically assessing the credibility of online reviews in environments where deception detection classifiers are unavailable.
๐
2025-11-17
๐ ArXiv Preprint (cs.SE)
๐ฅ Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gรถdel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang et al.
ๆฅ็ๆ่ฆ
Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.76% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hao Wang, Yuanfeng Song, Xiaoming Yin, Xing Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.
๐
2025-11-17
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Siyang Cheng, Gaotian Liu, Rui Mei, Yilin Wang, Kejia Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textit{character, word, and sentence-level} operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false positives and improving detection effectiveness. Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mรกtรฉ Gedeon, Piroska Zsรณfia Barta, Pรฉter Mihajlik, Tekla Etelka Grรกczi, Anna Kohรกri et al.
ๆฅ็ๆ่ฆ
The advancement of automatic speech recognition (ASR) has been largely enhanced by extensive datasets in high-resource languages, while languages such as Hungarian remain underrepresented due to limited spontaneous and conversational corpora. To address this gap, we introduce two new datasets -- BEA-Large and BEA-Dialogue -- constructed from the previously unprocessed portions of the Hungarian speech corpus named BEA. BEA-Large extends BEA-Base with 255 hours of spontaneous speech from 433 speakers, enriched with detailed segment-level metadata. BEA-Dialogue, comprising 85 hours of spontaneous conversations, is a Hungarian speech corpus featuring natural dialogues partitioned into speaker-independent subsets, supporting research in conversational ASR and speaker diarization. We establish reproducible baselines on these datasets using publicly available ASR models, with the fine-tuned Fast Conformer model achieving word error rates as low as 14.18\% on spontaneous and 4.8\% on repeated speech. Diarization experiments yield diarization error rates between 13.05\% and 18.26\%, providing reference points for future improvements. The results highlight the persistent difficulty of conversational ASR, particularly due to disfluencies, overlaps, and informal speech patterns. By releasing these datasets and baselines, we aim to advance Hungarian speech technology and offer a methodological framework for developing spontaneous and conversational benchmarks in other languages.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Elinor Poole-Dayan, Daniel T Kessler, Hannah Chiou, Margaret Hughes, Emily S Lin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Public Narratives (PNs) are key tools for leadership development and civic mobilization, yet their systematic analysis remains challenging due to their subjective interpretation and the high cost of expert annotation. In this work, we propose a novel computational framework that leverages large language models (LLMs) to automate the qualitative annotation of public narratives. Using a codebook we co-developed with subject-matter experts, we evaluate LLM performance against that of expert annotators. Our work reveals that LLMs can achieve near-human-expert performance, achieving an average F1 score of 0.80 across 8 narratives and 14 codes. We then extend our analysis to empirically explore how PN framework elements manifest across a larger dataset of 22 stories. Lastly, we extrapolate our analysis to a set of political speeches, establishing a novel lens in which to analyze political rhetoric in civic spaces. This study demonstrates the potential of LLM-assisted annotation for scalable narrative analysis and highlights key limitations and directions for future research in computational civic storytelling.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Attapol T. Rutherford, Sirisak Chueykamhang, Thachaparn Bunditlurdruk, Nanthicha Angsuwichitkul
Natural Language Processing
ๆฅ็ๆ่ฆ
Understanding sentiment in financial documents is crucial for gaining insights into market behavior. These reports often contain obfuscated language designed to present a positive or neutral outlook, even when underlying conditions may be less favorable. This paper presents a novel approach using Aspect-Based Sentiment Analysis (ABSA) to decode obfuscated sentiment in Thai financial annual reports. We develop specific guidelines for annotating obfuscated sentiment in these texts and annotate more than one hundred financial reports. We then benchmark various text classification models on this annotated dataset, demonstrating strong performance in sentiment classification. Additionally, we conduct an event study to evaluate the real-world implications of our sentiment analysis on stock prices. Our results suggest that market reactions are selectively influenced by specific aspects within the reports. Our findings underscore the complexity of sentiment analysis in financial texts and highlight the importance of addressing obfuscated language to accurately assess market sentiment.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Serge Gladkoff, Lifeng Han, Katerina Gasova
ๆฅ็ๆ่ฆ
Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition.
Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size.
Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model
E(x) = a * ln(1 + b * x), a, b > 0,
anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added.
The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Allaa Boutaleb, Bernd Amann, Rafael Angarita, Hubert Naacke
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Open-domain question answering over datalakes requires retrieving and composing information from multiple tables, a challenging subtask that demands semantic relevance and structural coherence (e.g., joinability). While exact optimization methods like Mixed-Integer Programming (MIP) can ensure coherence, their computational complexity is often prohibitive. Conversely, simpler greedy heuristics that optimize for query coverage alone often fail to find these coherent, joinable sets. This paper frames multi-table retrieval as an iterative search process, arguing this approach offers advantages in scalability, interpretability, and flexibility. We propose a general framework and a concrete instantiation: a fast, effective Greedy Join-Aware Retrieval algorithm that holistically balances relevance, coverage, and joinability. Experiments across 5 NL2SQL benchmarks demonstrate that our iterative method achieves competitive retrieval performance compared to the MIP-based approach while being 4-400x faster depending on the benchmark and search space settings. This work highlights the potential of iterative heuristics for practical, scalable, and composition-aware retrieval.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu, Meiguang Jin et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries. To alleviate this problem, we propose a \textbf{A}ttention-\textbf{G}rounded \textbf{RE}triever \textbf{E}nhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models as proxy local supervision to guide the identification of relevant document regions. During training, AGREE combines local signals with the global signals to jointly optimize the retriever, enabling it to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging ViDoRe V2 benchmark show that AGREE significantly outperforms the global-supervision-only baseline. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://anonymous.4open.science/r/AGREE-2025.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Zhaopei Huang, Qifeng Dai, Guozheng Wu, Xiaopeng Wu, Kehan Chen et al.
ๆฅ็ๆ่ฆ
With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users' subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Siyu Zhu, Mouxiao Bian, Yue Xie, Yongyu Tang, Zhikang Yu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kajetan Dymkiewicz, Ivan Vulic, Helen Yannakoudakis, Eilam Shapira, Roi Reichart et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) perform strongly across tasks and languages, yet how improvements in one task or language affect other tasks and languages and their combinations remains poorly understood. We conduct a controlled PEFT/LoRA study across multiple open-weight LLM families and sizes, treating task and language as transfer axes while conditioning on model family and size; we fine-tune each model on a single task-language source and measure transfer as the percentage-point change versus its baseline score when evaluated on all other task-language target pairs. We decompose transfer into (i) Matched-Task (Cross-Language), (ii) Matched-Language (Cross-Task), and (iii) Cross-Task (Cross-Language) regimes. We uncover two consistent general patterns. First, a pronounced on-task vs. off-task asymmetry: Matched-Task (Cross-Language) transfer is reliably positive, whereas off-task transfer often incurs collateral degradation. Second, a stable donor-recipient structure across languages and tasks (hub donors vs. brittle recipients). We outline implications for risk-aware fine-tuning and model specialisation.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Maram Alharbi, Salmane Chafik, Saad Ezzini, Ruslan Mitkov, Tharindu Ranasinghe et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Alexandru-Mihai Apostu, Andrei Preda, Alexandra Daniela Damir, Diana Bolocan, Radu Tudor Ionescu et al.
ๆฅ็ๆ่ฆ
Generating thorough natural language explanations for threat detections remains an open problem in cybersecurity research, despite significant advances in automated malware detection systems. In this work, we present AutoMalDesc, an automated static analysis summarization framework that, following initial training on a small set of expert-curated examples, operates independently at scale. This approach leverages an iterative self-paced learning pipeline to progressively enhance output quality through synthetic data generation and validation cycles, eliminating the need for extensive manual data annotation. Evaluation across 3,600 diverse samples in five scripting languages demonstrates statistically significant improvements between iterations, showing consistent gains in both summary quality and classification accuracy. Our comprehensive validation approach combines quantitative metrics based on established malware labels with qualitative assessment from both human experts and LLM-based judges, confirming both technical precision and linguistic coherence of generated summaries. To facilitate reproducibility and advance research in this domain, we publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) and test (3.6K) datasets, along with our methodology and evaluation framework.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Shufan Yang, Zifeng Cheng, Zhiwei Jiang, Yafeng Yin, Cong Wang et al.
ๆฅ็ๆ่ฆ
Embedding-as-a-Service (EaaS) is an effective and convenient deployment solution for addressing various NLP tasks. Nevertheless, recent research has shown that EaaS is vulnerable to model extraction attacks, which could lead to significant economic losses for model providers. For copyright protection, existing methods inject watermark embeddings into text embeddings and use them to detect copyright infringement. However, current watermarking methods often resist only a subset of attacks and fail to provide \textit{comprehensive} protection. To this end, we present the region-triggered semantic watermarking framework called RegionMarker, which defines trigger regions within a low-dimensional space and injects watermarks into text embeddings associated with these regions. By utilizing a secret dimensionality reduction matrix to project onto this subspace and randomly selecting trigger regions, RegionMarker makes it difficult for watermark removal attacks to evade detection. Furthermore, by embedding watermarks across the entire trigger region and using the text embedding as the watermark, RegionMarker is resilient to both paraphrasing and dimension-perturbation attacks. Extensive experiments on various datasets show that RegionMarker is effective in resisting different attack methods, thereby protecting the copyright of EaaS.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Jea Kwon, Luiz Felipe Vecchietti, Sungwon Park, Meeyoung Cha
Natural Language Processing
ๆฅ็ๆ่ฆ
Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via "dropout" at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs' confidence in morally complex scenarios.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Patrick Parschan, Charlott Jakob
Natural Language Processing
ๆฅ็ๆ่ฆ
This article presents the first systematic review of unsupervised and semi-supervised computational text-based ideal point estimation (CT-IPE) algorithms, methods designed to infer latent political positions from textual data. These algorithms are widely used in political science, communication, computational social science, and computer science to estimate ideological preferences from parliamentary speeches, party manifestos, and social media. Over the past two decades, their development has closely followed broader NLP trends -- beginning with word-frequency models and most recently turning to large language models (LLMs). While this trajectory has greatly expanded the methodological toolkit, it has also produced a fragmented field that lacks systematic comparison and clear guidance for applied use. To address this gap, we identified 25 CT-IPE algorithms through a systematic literature review and conducted a manual content analysis of their modeling assumptions and development contexts. To compare them meaningfully, we introduce a conceptual framework that distinguishes how algorithms generate, capture, and aggregate textual variance. On this basis, we identify four methodological families -- word-frequency, topic modeling, word embedding, and LLM-based approaches -- and critically assess their assumptions, interpretability, scalability, and limitations. Our review offers three contributions. First, it provides a structured synthesis of two decades of algorithm development, clarifying how diverse methods relate to one another. Second, it translates these insights into practical guidance for applied researchers, highlighting trade-offs in transparency, technical requirements, and validation strategies that shape algorithm choice. Third, it emphasizes that differences in estimation outcomes across algorithms are themselves informative, underscoring the need for systematic benchmarking.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tyler Loakman, Joseph James, Chenghua Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mihai Dan Nadas, Laura Diosan
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI's GPT-3.5, GPT-4, GPT-4o, Google's Gemini 1.0 Pro, Meta's Llama 2 and Llama 3, MistralAI's Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro's RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta's Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ronit D. Gross, Yanir Harel, Ido Kanter
ๆฅ็ๆ่ฆ
The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator's overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tianai Huang, Jiayuan Chen, Lu Lu, Pengcheng Chen, Tianbin Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek\_r1 and gemini\_2\_5\_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the "In-depth Challenge for Comprehensive TCM Abilities" special track.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zaara Zabeen Arpa, Sadnam Sakib Apurbo, Nazia Karim Khan Oishee, Ajwad Abrar
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68\% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78\% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sourya Dipta Das, Shubham Kumar, Kuldeep Yadav
Natural Language Processing
ๆฅ็ๆ่ฆ
Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature. Developing accurate grammar scoring models further requires extensive expert annotation, making large-scale data creation impractical. To address these limitations, we propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels. During training, we employ LLM-generated predictions on unlabeled data by using grammar competency rubric-based prompts. These predictions, treated as pseudo labels, are utilized to train a transformer-based model through a novel training framework designed to handle label noise effectively. We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy. Finally, a qualitative analysis of error intensity and score prediction confirms the robustness and interpretability of our approach. Experimental results demonstrate the efficacy of our approach in estimating grammar competency scores with high accuracy, paving the way for scalable, low-resource grammar assessment systems.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nigar Alishzade, Gulchin Abdullayeva
Natural Language Processing
ๆฅ็ๆ่ฆ
This study presents a systematic comparative analysis of recurrent and attention-based neural architectures for isolated sign language recognition. We implement and evaluate two representative models-ConvLSTM and Vanilla Transformer-on the Azerbaijani Sign Language Dataset (AzSLD) and the Word-Level American Sign Language (WLASL) dataset. Our results demonstrate that the attention-based Vanilla Transformer consistently outperforms the recurrent ConvLSTM in both Top-1 and Top-5 accuracy across datasets, achieving up to 76.8% Top-1 accuracy on AzSLD and 88.3% on WLASL. The ConvLSTM, while more computationally efficient, lags in recognition accuracy, particularly on smaller datasets. These findings highlight the complementary strengths of each paradigm: the Transformer excels in overall accuracy and signer independence, whereas the ConvLSTM offers advantages in computational efficiency and temporal modeling. The study provides a nuanced analysis of these trade-offs, offering guidance for architecture selection in sign language recognition systems depending on application requirements and resource constraints.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Quanjiang Guo, Sijie Wang, Jinchuan Zhang, Ben Zhang, Zhao Kang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Zero-shot event extraction (ZSEE) remains a significant challenge for large language models (LLMs) due to the need for complex reasoning and domain-specific understanding. Direct prompting often yields incomplete or structurally invalid outputs--such as misclassified triggers, missing arguments, and schema violations. To address these limitations, we present Agent-Event-Coder (AEC), a novel multi-agent framework that treats event extraction like software engineering: as a structured, iterative code-generation process. AEC decomposes ZSEE into specialized subtasks--retrieval, planning, coding, and verification--each handled by a dedicated LLM agent. Event schemas are represented as executable class definitions, enabling deterministic validation and precise feedback via a verification agent. This programming-inspired approach allows for systematic disambiguation and schema enforcement through iterative refinement. By leveraging collaborative agent workflows, AEC enables LLMs to produce precise, complete, and schema-consistent extractions in zero-shot settings. Experiments across five diverse domains and six LLMs demonstrate that AEC consistently outperforms prior zero-shot baselines, showcasing the power of treating event extraction like code generation. The code and data are released on https://github.com/UESTC-GQJ/Agent-Event-Coder.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhichao He, Mouxiao Bian, Jianhong Zhu, Jiayuan Chen, Yunqiu Wang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen's Kappa coefficients of 0.280 and 0.282, respectively, indicating only fair agreement with expert consensus. A striking performance disparity was observed across classes: while most models could identify compliant items with high accuracy (F1 score > 0.850), they struggled profoundly with identifying non-compliant and not applicable items, where F1 scores rarely exceeded 0.400. Notably, some high-profile models like GPT-4o underperformed, achieving a macro F1-score of only 0.521. LLMs show potential as preliminary screening assistants for CONSORT checks, capably identifying well-reported items. However, their current inability to reliably detect reporting omissions or methodological flaws makes them unsuitable for replacing human expertise in the critical appraisal of trial quality.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chuyuan Li, Giuseppe Carenini
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xinyuan Zhou, Yi Lei, Xiaoyu Zhou, Jingyi Sun, Yu Zhu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover-X1-7B achieves state-of-the-art performance among similarly-sized open-source models, attaining a 37.0\% average pass rate (pass@32). It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at:https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kasun Wickramasinghe, Nisansa de Silva
ๆฅ็ๆ่ฆ
Sans a dwindling number of monolingual embedding studies originating predominantly from the low-resource domains, it is evident that multilingual embedding has become the de facto choice due to its adaptability to the usage of code-mixed languages, granting the ability to process multilingual documents in a language-agnostic manner, as well as removing the difficult task of aligning monolingual embeddings. But is this victory complete? Are the multilingual models better than aligned monolingual models in every aspect? Can the higher computational cost of multilingual models always be justified? Or is there a compromise between the two extremes? Bilingual Lexicon Induction is one of the most widely used metrics in terms of evaluating the degree of alignment between two embedding spaces. In this study, we explore the strengths and limitations of BLI as a measure to evaluate the degree of alignment of two embedding spaces. Further, we evaluate how well traditional embedding alignment techniques, novel multilingual models, and combined alignment techniques perform BLI tasks in the contexts of both high-resource and low-resource languages. In addition to that, we investigate the impact of the language families to which the pairs of languages belong. We identify that BLI does not measure the true degree of alignment in some cases and we propose solutions for them. We propose a novel stem-based BLI approach to evaluate two aligned embedding spaces that take into account the inflected nature of languages as opposed to the prevalent word-based BLI techniques. Further, we introduce a vocabulary pruning technique that is more informative in showing the degree of the alignment, especially performing BLI on multilingual embedding models. Often, combined embedding alignment techniques perform better while in certain cases multilingual embeddings perform better (mainly low-resource language cases).
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Declan Jackson, William Keating, George Cameron, Micah Hill-Smith
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Sachin Vashistha, Aryan Bibhuti, Atharva Naik, Martin Tutek, Somak Aditya
Natural Language Processing
ๆฅ็ๆ่ฆ
Real-world conversations are rich with pragmatic elements, such as entity mentions, references, and implicatures. Understanding such nuances is a requirement for successful natural communication, and often requires building a local world model which encodes such elements and captures the dynamics of their evolving states. However, it is not well-understood whether language models (LMs) construct or maintain a robust implicit representation of conversations. In this work, we evaluate the ability of LMs to encode and update their internal world model in dyadic conversations and test their malleability under linguistic alterations. To facilitate this, we apply seven minimal linguistic alterations to conversations sourced from popular datasets and construct two benchmarks comprising yes-no questions. We evaluate a wide range of open and closed source LMs and observe that they struggle to maintain robust accuracy. Our analysis unveils that LMs struggle to memorize crucial details, such as tracking entities under linguistic alterations to conversations. We then propose a dual-perspective interpretability framework which identifies transformer layers that are useful or harmful and highlights linguistic alterations most influenced by harmful layers, typically due to encoding spurious signals or relying on shortcuts. Inspired by these insights, we propose two layer-regularization based fine-tuning strategies that suppress the effect of the harmful layers.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Genglin Liu, Shijie Geng, Sha Li, Hejie Cui, Sarah Zhang et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal LLM-powered agents have recently demonstrated impressive capabilities in web navigation, enabling agents to complete complex browsing tasks across diverse domains. However, current agents struggle with repetitive errors and lack the ability to learn from past experiences across sessions, limiting their long-term robustness and sample efficiency. We introduce WebCoach, a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory, enabling improved long-term planning, reflection, and continual learning without retraining. WebCoach consists of three key components: (1) a WebCondenser, which standardizes raw navigation logs into concise summaries; (2) an External Memory Store, which organizes complete trajectories as episodic experiences; and (3) a Coach, which retrieves relevant experiences based on similarity and recency, and decides whether to inject task-specific advice into the agent via runtime hooks. This design empowers web agents to access long-term memory beyond their native context window, improving robustness in complex browsing tasks. Moreover, WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories, enabling agents to improve over time without retraining. Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents across three different LLM backbones. With a 38B model, it increases task success rates from 47% to 61% while reducing or maintaining the average number of steps. Notably, smaller base models with WebCoach achieve performance comparable to the same web agent using GPT-4o.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Zeyu Shi, Ziming Wang, Tianyu Chen, Shiqi Gao, Haoyi Zhou et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models' ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Haokun Li, Yazhou Zhang, Jizhi Ding, Qiuchi Li, Peng Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Can multi-modal large language models (MLLMs) truly understand what they can see? Extending Searle's Chinese Room into the multi-modal domain, this paper proposes the Visual Room argument: MLLMs may describe every visual detail precisely yet fail to comprehend the underlying emotions and intentions, namely seeing is not understanding. Building on this, we introduce \textit{Visual Room} 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs. We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks. The perception component ranges from attribute recognition to scene understanding, while the cognition component extends from textual entailment to causal and social reasoning. The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception to cognition. Evaluating 10 state-of-the-art (SoTA) MLLMs, we highlight three key findings: (1) MLLMs exhibit stronger perceptual competence than cognitive ability (8.0\%$\uparrow$); (2) cognition appears not causally dependent on perception-based reasoning; and (3) cognition scales with model size, but perception does not consistently improve with larger variants. This work operationalizes Seeing $\ne$ Understanding as a testable hypothesis, offering a new paradigm from perceptual processing to cognitive reasoning in MLLMs. Our dataset is available at https://huggingface.co/datasets/LHK2003/PCBench.
๐
2025-11-17
๐ AAAI
๐ฅ Desheng Hu, Joachim Baumann, Aleksandra Urman, Elsa Lichtenegger, Robin Forsberg et al.
Robotics
ๆฅ็ๆ่ฆ
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chukwuebuka Fortunate Ijezue, Tania-Amanda Fredrick Eneye, Maaz Amjad
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper presents a transformer-based approach for classifying hope expressions in text. We developed and compared three architectures (BERT, GPT-2, and DeBERTa) for both binary classification (Hope vs. Not Hope) and multiclass categorization (five hope-related categories). Our initial BERT implementation achieved 83.65% binary and 74.87% multiclass accuracy. In the extended comparison, BERT demonstrated superior performance (84.49% binary, 72.03% multiclass accuracy) while requiring significantly fewer computational resources (443s vs. 704s training time) than newer architectures. GPT-2 showed lowest overall accuracy (79.34% binary, 71.29% multiclass), while DeBERTa achieved moderate results (80.70% binary, 71.56% multiclass) but at substantially higher computational cost (947s for multiclass training). Error analysis revealed architecture-specific strengths in detecting nuanced hope expressions, with GPT-2 excelling at sarcasm detection (92.46% recall). This study provides a framework for computational analysis of hope, with applications in mental health and social media analysis, while demonstrating that architectural suitability may outweigh model size for specialized emotion detection tasks.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wenxin Zhu, Andong Chen, Yuchen Song, Kehai Chen, Conghui Zhu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kang Yin, Hye-Bin Shin
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Saranzaya Magsarjav, Melissa Humphries, Jonathan Tuke, Lewis Mitchell
Natural Language Processing
ๆฅ็ๆ่ฆ
Topic modelling in Natural Language Processing uncovers hidden topics in large, unlabelled text datasets. It is widely applied in fields such as information retrieval, content summarisation, and trend analysis across various disciplines. However, probabilistic topic models can produce different results when rerun due to their stochastic nature, leading to inconsistencies in latent topics. Factors like corpus shuffling, rare text removal, and document elimination contribute to these variations. This instability affects replicability, reliability, and interpretation, raising concerns about whether topic models capture meaningful topics or just noise. To address these problems, we defined a new stability measure that incorporates accuracy and consistency and uses the generative properties of LDA to generate a new corpus with ground truth. These generated corpora are run through LDA 50 times to determine the variability in the output. We show that LDA can correctly determine the underlying number of topics in the documents. We also find that LDA is more internally consistent, as the multiple reruns return similar topics; however, these topics are not the true topics.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu et al.
Multimodal
ๆฅ็ๆ่ฆ
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junwei Yu, Trevor Darrell, XuDong Wang
Robotics
ๆฅ็ๆ่ฆ
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kiana Vu, ฤฐsmet Selรงuk รzer, Phung Lai, Zheng Wu, Thilanka Munasinghe et al.
ๆฅ็ๆ่ฆ
As climate change accelerates the frequency and severity of extreme events such as wildfires, the need for accurate, explainable, and actionable forecasting becomes increasingly urgent. While artificial intelligence (AI) models have shown promise in predicting such events, their adoption in real-world decision-making remains limited due to their black-box nature, which limits trust, explainability, and operational readiness. This paper investigates the role of explainable AI (XAI) in bridging the gap between predictive accuracy and actionable insight for extreme event forecasting. Using wildfire prediction as a case study, we evaluate various AI models and employ SHapley Additive exPlanations (SHAP) to uncover key features, decision pathways, and potential biases in model behavior. Our analysis demonstrates how XAI not only clarifies model reasoning but also supports critical decision-making by domain experts and response teams. In addition, we provide supporting visualizations that enhance the interpretability of XAI outputs by contextualizing feature importance and temporal patterns in seasonality and geospatial characteristics. This approach enhances the usability of AI explanations for practitioners and policymakers. Our findings highlight the need for AI systems that are not only accurate but also interpretable, accessible, and trustworthy, essential for effective use in disaster preparedness, risk mitigation, and climate resilience planning.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jianglong Ye, Lai Wei, Guangqi Jiang, Changwei Jing, Xueyan Zou et al.
Robotics
ๆฅ็ๆ่ฆ
Human grasps can be roughly categorized into two types: power grasps and precision grasps. Precision grasping enables tool use and is believed to have influenced human evolution. Today's multi-fingered robotic hands are effective in power grasps, but for tasks requiring precision, parallel grippers are still more widely adopted. This contrast highlights a key limitation in current robotic hand design: the difficulty of achieving both stable power grasps and precise, fine-grained manipulation within a single, versatile system. In this work, we bridge this gap by jointly optimizing the control and hardware design of a multi-fingered dexterous hand, enabling both power and precision manipulation. Rather than redesigning the entire hand, we introduce a lightweight fingertip geometry modification, represent it as a contact plane, and jointly optimize its parameters along with the corresponding control. Our control strategy dynamically switches between power and precision manipulation and simplifies precision control into parallel thumb-index motions, which proves robust for sim-to-real transfer. On the design side, we leverage large-scale simulation to optimize the fingertip geometry using a differentiable neural-physics surrogate model. We validate our approach through extensive experiments in both sim-to-real and real-to-real settings. Our method achieves an 82.5% zero-shot success rate on unseen objects in sim-to-real precision grasping, and a 93.3% success rate in challenging real-world tasks involving bread pinching. These results demonstrate that our co-design framework can significantly enhance the fine-grained manipulation ability of multi-fingered hands without reducing their ability for power grasps. Our project page is at https://jianglongye.com/power-to-precision
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alaa Mezghiche
Robotics
ๆฅ็ๆ่ฆ
Unsupervised learning on high-dimensional RNA-seq data can reveal molecular subtypes beyond standard labels. We combine an autoencoder-based representation with clustering and stability analysis to search for rare but reproducible genomic subtypes. On the UCI "Gene Expression Cancer RNA-Seq" dataset (801 samples, 20,531 genes; BRCA, COAD, KIRC, LUAD, PRAD), a pan-cancer analysis shows clusters aligning almost perfectly with tissue of origin (Cramer's V = 0.887), serving as a negative control. We therefore reframe the problem within KIRC (n = 146): we select the top 2,000 highly variable genes, standardize them, train a feed-forward autoencoder (128-dimensional latent space), and run k-means for k = 2-10. While global indices favor small k, scanning k with a pre-specified discovery rule (rare < 10 percent and stable with Jaccard >= 0.60 across 20 seeds after Hungarian alignment) yields a simple solution at k = 5 (silhouette = 0.129, DBI = 2.045) with a rare cluster C0 (6.85 percent of patients) that is highly stable (Jaccard = 0.787). Cluster-vs-rest differential expression (Welch's t-test, Benjamini-Hochberg FDR) identifies coherent markers. Overall, pan-cancer clustering is dominated by tissue of origin, whereas a stability-aware within-cancer approach reveals a rare, reproducible KIRC subtype.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua et al.
ๆฅ็ๆ่ฆ
Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Luyao Niu, Nuoxian Huang
Machine Learning
ๆฅ็ๆ่ฆ
Travel mode identification (TMI) from GPS trajectories is critical for urban intelligence, but is hampered by the high cost of annotation, leading to severe label scarcity. Prevailing semi-supervised learning (SSL) methods are ill-suited for this task, as they suffer from catastrophic confirmation bias and ignore the intrinsic data manifold. We propose ST-ProC, a novel graph-prototypical multi-objective SSL framework to address these limitations. Our framework synergizes a graph-prototypical core with foundational SSL Support. The core exploits the data manifold via graph regularization, prototypical anchoring, and a novel, margin-aware pseudo-labeling strategy to actively reject noise. This core is supported and stabilized by foundational contrastive and teacher-student consistency losses, ensuring high-quality representations and robust optimization. ST-ProC outperforms all baselines by a significant margin, demonstrating its efficacy in real-world sparse-label settings, with a performance boost of 21.5% over state-of-the-art methods like FixMatch.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gianluigi Pillonetto, Alberto Giaretta, Mauro Bisiacco
ๆฅ็ๆ่ฆ
Understanding the principles that govern dynamical systems is a central challenge across many scientific domains, including biology and ecology. Incomplete knowledge of nonlinear interactions and stochastic effects often renders bottom-up modeling approaches ineffective, motivating the development of methods that can discover governing equations directly from data. In such contexts, parametric models often struggle without strong prior knowledge, especially when estimating intrinsic noise. Nonetheless, incorporating stochastic effects is often essential for understanding the dynamic behavior of complex systems such as gene regulatory networks and signaling pathways. To address these challenges, we introduce Trine (Three-phase Regression for INtrinsic noisE), a nonparametric, kernel-based framework that infers state-dependent intrinsic noise from time-series data. Trine features a three-stage algorithm that com- bines analytically solvable subproblems with a structured kernel architecture that captures both abrupt noise-driven fluctuations and smooth, state-dependent changes in variance. We validate Trine on biological and ecological systems, demonstrating its ability to uncover hidden dynamics without relying on predefined parametric assumptions. Across several benchmark problems, Trine achieves performance comparable to that of an oracle. Biologically, this oracle can be viewed as an idealized observer capable of directly tracking the random fluctuations in molecular concentrations or reaction events within a cell. The Trine framework thus opens new avenues for understanding how intrinsic noise affects the behavior of complex systems.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Parikshit Gopalan, Konstantinos Stavropoulos, Kunal Talwar, Pranay Tankala
Machine Learning
ๆฅ็ๆ่ฆ
A decision-theoretic characterization of perfect calibration is that an agent seeking to minimize a proper loss in expectation cannot improve their outcome by post-processing a perfectly calibrated predictor. Hu and Wu (FOCS'24) use this to define an approximate calibration measure called calibration decision loss ($\mathsf{CDL}$), which measures the maximal improvement achievable by any post-processing over any proper loss. Unfortunately, $\mathsf{CDL}$ turns out to be intractable to even weakly approximate in the offline setting, given black-box access to the predictions and labels.
We suggest circumventing this by restricting attention to structured families of post-processing functions $K$. We define the calibration decision loss relative to $K$, denoted $\mathsf{CDL}_K$ where we consider all proper losses but restrict post-processings to a structured family $K$. We develop a comprehensive theory of when $\mathsf{CDL}_K$ is information-theoretically and computationally tractable, and use it to prove both upper and lower bounds for natural classes $K$. In addition to introducing new definitions and algorithmic techniques to the theory of calibration for decision making, our results give rigorous guarantees for some widely used recalibration procedures in machine learning.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Disha Varshney, Samarth Garg, Sarthak Tyagi, Deeksha Varshney, Nayan Deep et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node's nearby region, including relations, we stack multiple convolutional layers to efficiently learn combined insights from the protein's spatial graph, revealing intricate interconnections and dependencies in its structural arrangement. To assess our model's performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiangnan Ye, Jiedong Zhuang, Lianrui Mu, Wenjie Zheng, Jiaqi Hu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leopoldo Agorio, Juan Cerviรฑo, Miguel Calvo-Fullana, Alejandro Ribeiro, Juan Andrรฉs Bazerque
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
A learning task, understood as the problem of fitting a parametric model from supervised data, fundamentally requires the dataset to be large enough to be representative of the underlying distribution of the source. When data is limited, the learned models fail generalize to cases not seen during training. This paper introduces a multi-task \emph{cross-learning} framework to overcome data scarcity by jointly estimating \emph{deterministic} parameters across multiple, related tasks. We formulate this joint estimation as a constrained optimization problem, where the constraints dictate the resulting similarity between the parameters of the different models, allowing the estimated parameters to differ across tasks while still combining information from multiple data sources. This framework enables knowledge transfer from tasks with abundant data to those with scarce data, leading to more accurate and reliable parameter estimates, providing a solution for scenarios where parameter inference from limited data is critical. We provide theoretical guarantees in a controlled framework with Gaussian data, and show the efficiency of our cross-learning method in applications with real data including image classification and propagation of infectious diseases.
๐
2025-11-17
๐ ArXiv Preprint (cs.AR)
๐ฅ Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Wenjun Huang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality--and locality into utilization--QUILL delivers consistent, end-to-end speedups.
๐
2025-11-17
๐ ArXiv Preprint (cs.AR)
๐ฅ Hyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das et al.
ๆฅ็ๆ่ฆ
Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Minh Vu, Andrey Lokhov
ๆฅ็ๆ่ฆ
Modern scientific simulations, observations, and large-scale experiments generate data at volumes that often exceed the limits of storage, processing, and analysis. This challenge drives the development of data reduction methods that efficiently manage massive datasets while preserving essential physical features and quantities of interest. In many scientific workflows, it is also crucial to enable data recovery from compressed representations - a task known as super-resolution - with guarantees on the preservation of key physical characteristics. A notable example is checkpointing and restarting, which is essential for long-running simulations to recover from failures, resume after interruptions, or examine intermediate results. In this work, we introduce a novel framework for scientific data compression and super-resolution, grounded in recent advances in learning exponential families. Our method preserves and quantifies uncertainty in physical quantities of interest and supports flexible trade-offs between compression ratio and reconstruction fidelity.
๐
2025-11-17
๐ ArXiv Preprint (cs.PL)
๐ฅ Qiuhan Gu, Avaljot Singh, Gagandeep Singh
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Constructing abstract interpreters that provide global soundness guarantees remains a major obstacle in abstract interpretation. We investigate whether modern LLMs can reduce this burden by leveraging them to synthesize sound, non-trivial abstract interpreters across multiple abstract domains in the setting of neural network verification. We formulate synthesis as a constrained optimization problem and introduce a novel mathematically grounded cost function for measuring unsoundness under strict syntactic and semantic constraints. Based on this formulation, we develop a unified framework that unifies LLM-based generation with syntactic and semantic validation and a quantitative cost-guided feedback mechanism. Empirical results demonstrate that our framework not only matches the quality of handcrafted transformers, but more importantly, discovers sound, high-precision transformers for complex nonlinear operators that are absent from existing literature.
๐
2025-11-17
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiaming Qu, Mengtian Guo, Yue Wang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Deceptive reviews mislead consumers, harm businesses, and undermine trust in online marketplaces. Machine learning classifiers can learn from large amounts of training examples to effectively distinguish deceptive reviews from genuine ones. However, the distinguishing features learned by these classifiers are often subtle, fragmented, and difficult for humans to interpret. In this work, we explore using large language models (LLMs) to translate machine-learned lexical cues into human-understandable language phenomena that can differentiate deceptive reviews from genuine ones. We show that language phenomena obtained in this manner are empirically grounded in data, generalizable across similar domains, and more predictive than phenomena either in LLMs' prior knowledge or obtained through in-context learning. These language phenomena have the potential to aid people in critically assessing the credibility of online reviews in environments where deception detection classifiers are unavailable.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon et al.
Multimodal
ๆฅ็ๆ่ฆ
Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at $\href{https://github.com/allenai/olmoearth_pretrain}{\text{https://github.com/allenai/olmoearth_pretrain}}$.
๐
2025-11-17
๐ AAAI
๐ฅ Pascal Zimmer, Ghassan Karame
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we present the first detailed analysis of how optimization hyperparameters -- such as learning rate, weight decay, momentum, and batch size -- influence robustness against both transfer-based and query-based attacks. Supported by theory and experiments, our study spans a variety of practical deployment settings, including centralized training, ensemble learning, and distributed training. We uncover a striking dichotomy: for transfer-based attacks, decreasing the learning rate significantly enhances robustness by up to $64\%$. In contrast, for query-based attacks, increasing the learning rate consistently leads to improved robustness by up to $28\%$ across various settings and data distributions. Leveraging these findings, we explore -- for the first time -- the optimization hyperparameter design space to jointly enhance robustness against both transfer-based and query-based attacks. Our results reveal that distributed models benefit the most from hyperparameter tuning, achieving a remarkable tradeoff by simultaneously mitigating both attack types more effectively than other training setups.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leo Gao, Achyuta Rajaram, Jacob Coxon, Soham V. Govande, Bowen Baker et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability remains a challenge. In addition to training weight-sparse models de novo, we show preliminary results suggesting our method can also be adapted to explain existing dense models. Our work produces circuits that achieve an unprecedented level of human understandability and validates them with considerable rigor.
๐
2025-11-17
๐ ArXiv Preprint (cs.SE)
๐ฅ Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gรถdel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aleksandar Stankoviฤ
ๆฅ็ๆ่ฆ
We present FuseSampleAgg, a CUDA operator that fuses neighbor sampling and mean aggregation into a single pass for one and two hop GraphSAGE. By eliminating block materialization and extra kernel launches, FuseSampleAgg reduces memory traffic and overhead while preserving GraphSAGE mean semantics via saved index replay. Across the Reddit, ogbn-arxiv, and ogbn-products benchmarks (batch size 1024, automatic mixed precision enabled), we observe step time speedups up to 51x on ogbn-products, about 4x on Reddit with fanouts 10-10 and 15-10, and about 3.3x on ogbn-arxiv at larger fanouts, with peak GPU memory reductions up to 100x, 36x, and about 3.5x, respectively. The operator is deterministic, integrates with standard PyTorch optimizers, and ships with scripts that reproduce all tables and figures from CSV logs. Code and scripts are available at https://github.com/SV25-22/FuseSampleAgg.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Haohui Wang, Jingyuan Qi, Jianpeng Chen, Jun Wu, Lifu Huang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ana Durica, John Booth, Ivana Drobnjak
Multimodal
ๆฅ็ๆ่ฆ
Paediatric kidney disease varies widely in its presentation and progression, which calls for continuous monitoring of renal function. Using electronic health records collected between 2019 and 2025 at Great Ormond Street Hospital, a leading UK paediatric hospital, we explored a temporal modelling approach that integrates longitudinal laboratory sequences with demographic information. A recurrent neural model trained on these data was used to predict whether a child would record an abnormal serum creatinine value within the following thirty days. Framed as a pilot study, this work provides an initial demonstration that simple temporal representations can capture useful patterns in routine paediatric data and lays the groundwork for future multimodal extensions using additional clinical signals and more detailed renal outcomes.
๐
2025-11-17
๐ AAAI
๐ฅ Kaichi Irie, Shuhei Watanabe, Masaki Onishi
Machine Learning
ๆฅ็ๆ่ฆ
Bayesian optimization (BO) efficiently finds high-performing parameters by maximizing an acquisition function, which models the promise of parameters. A major computational bottleneck arises in acquisition function optimization, where multi-start optimization (MSO) with quasi-Newton (QN) methods is required due to the non-convexity of the acquisition function. BoTorch, a widely used BO library, currently optimizes the summed acquisition function over multiple points, leading to the speedup of MSO owing to PyTorch batching. Nevertheless, this paper empirically demonstrates the suboptimality of this approach in terms of off-diagonal approximation errors in the inverse Hessian of a QN method, slowing down its convergence. To address this problem, we propose to decouple QN updates using a coroutine while batching the acquisition function calls. Our approach not only yields the theoretically identical convergence to the sequential MSO but also drastically reduces the wall-clock time compared to the previous approaches.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marianne Rakic, Andrew Hoopes, S. Mazdak Abulnaga, Mert R. Sabuncu, John V. Guttag et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deformable templates, or atlases, are images that represent a prototypical anatomy for a population, and are often enhanced with probabilistic anatomical label maps. They are commonly used in medical image analysis for population studies and computational anatomy tasks such as registration and segmentation. Because developing a template is a computationally expensive process, relatively few templates are available. As a result, analysis is often conducted with sub-optimal templates that are not truly representative of the study population, especially when there are large variations within this population. We propose a machine learning framework that uses convolutional registration neural networks to efficiently learn a function that outputs templates conditioned on subject-specific attributes, such as age and sex. We also leverage segmentations, when available, to produce anatomical segmentation maps for the resulting templates. The learned network can also be used to register subject images to the templates. We demonstrate our method on a compilation of 3D brain MRI datasets, and show that it can learn high-quality templates that are representative of populations. We find that annotated conditional templates enable better registration than their unlabeled unconditional counterparts, and outperform other templates construction methods.
๐
2025-11-17
๐ ArXiv Preprint (stat.ME)
๐ฅ M. Stocker, W. Maลgorzewicz, M. Fontana, S. Ben Taieb
ๆฅ็ๆ่ฆ
Conformal prediction is a powerful post-hoc framework for uncertainty quantification that provides distribution-free coverage guarantees. However, these guarantees crucially rely on the assumption of exchangeability. This assumption is fundamentally violated in time series data, where temporal dependence and distributional shifts are pervasive. As a result, classical split-conformal methods may yield prediction intervals that fail to maintain nominal validity. This review unifies recent advances in conformal forecasting methods specifically designed to address nonexchangeable data. We first present a theoretical foundation, deriving finite-sample guarantees for split-conformal prediction under mild weak-dependence conditions. We then survey and classify state-of-the-art approaches that mitigate serial dependence by reweighting calibration data, dynamically updating residual distributions, or adaptively tuning target coverage levels in real time. Finally, we present a comprehensive simulation study that compares these techniques in terms of empirical coverage, interval width, and computational cost, highlighting practical trade-offs and open research directions.
๐
2025-11-17
๐ ArXiv Preprint (math.OC)
๐ฅ Chen Xu
Machine Learning
ๆฅ็ๆ่ฆ
We introduce GS-PowerHP, a novel zeroth-order method for non-convex optimization problems of the form $\max_{x \in \mathbb{R}^d} f(x)$. Our approach leverages two key components: a power-transformed Gaussian-smoothed surrogate $F_{N,ฯ}(ฮผ) = \mathbb{E}_{x\sim\mathcal{N}(ฮผ,ฯ^2 I_d)}[e^{N f(x)}]$ whose stationary points cluster near the global maximizer $x^*$ of $f$ for sufficiently large $N$, and an incrementally decaying $ฯ$ for enhanced data efficiency. Under mild assumptions, we prove convergence in expectation to a small neighborhood of $x^*$ with the iteration complexity of $O(d^2 \varepsilon^{-2})$. Empirical results show our approach consistently ranks among the top three across a suite of competing algorithms. Its robustness is underscored by the final experiment on a substantially high-dimensional problem ($d=150,528$), where it achieved first place on least-likely targeted black-box attacks against images from ImageNet, surpassing all competing methods.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shihao Dong, Yue Liu, Xiaotong Zhou, Yuhui Zheng, Huiying Xu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Multi-view clustering (MVC), which aims to separate the multi-view data into distinct clusters in an unsupervised manner, is a fundamental yet challenging task. To enhance its applicability in real-world scenarios, this paper addresses a more challenging task: MVC under multi-source noises, including missing noise and observation noise. To this end, we propose a novel framework, Reliability-Aware Contrastive Deep Multi-View Clustering (RAC-DMVC), which constructs a reliability graph to guide robust representation learning under noisy environments. Specifically, to address observation noise, we introduce a cross-view reconstruction to enhances robustness at the data level, and a reliability-aware noise contrastive learning to mitigates bias in positive and negative pairs selection caused by noisy representations. To handle missing noise, we design a dual-attention imputation to capture shared information across views while preserving view-specific features. In addition, a self-supervised cluster distillation module further refines the learned representations and improves the clustering performance. Extensive experiments on five benchmark datasets demonstrate that RAC-DMVC outperforms SOTA methods on multiple evaluation metrics and maintains excellent performance under varying ratios of noise.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Yue Hou, Ruomei Liu, Yingke Su, Junran Wu, Ke Xu
ๆฅ็ๆ่ฆ
A key challenge in graph out-of-distribution (OOD) detection lies in the absence of ground-truth OOD samples during training. Existing methods are typically optimized to capture features within the in-distribution (ID) data and calculate OOD scores, which often limits pre-trained models from representing distributional boundaries, leading to unreliable OOD detection. Moreover, the latent structure of graph data is often governed by multiple underlying factors, which remains less explored. To address these challenges, we propose a novel test-time graph OOD detection method, termed BaCa, that calibrates OOD scores using dual dynamically updated dictionaries without requiring fine-tuning the pre-trained model. Specifically, BaCa estimates graphons and applies a mix-up strategy solely with test samples to generate diverse boundary-aware discriminative topologies, eliminating the need for exposing auxiliary datasets as outliers. We construct dual dynamic dictionaries via priority queues and attention mechanisms to adaptively capture latent ID and OOD representations, which are then utilized for boundary-aware OOD score calibration. To the best of our knowledge, extensive experiments on real-world datasets show that BaCa significantly outperforms existing state-of-the-art methods in OOD detection.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zichong Wang, Zhipeng Yin, Liping Yang, Jun Zhuang, Rui Yu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Ensuring fairness in Graph Neural Networks is fundamental to promoting trustworthy and socially responsible machine learning systems. In response, numerous fair graph learning methods have been proposed in recent years. However, most of them assume full access to demographic information, a requirement rarely met in practice due to privacy, legal, or regulatory restrictions. To this end, this paper introduces a novel fair graph learning framework that mitigates bias in graph learning under limited demographic information. Specifically, we propose a mechanism guided by partial demographic data to generate proxies for demographic information and design a strategy that enforces consistent node embeddings across demographic groups. In addition, we develop an adaptive confidence strategy that dynamically adjusts each node's contribution to fairness and utility based on prediction confidence. We further provide theoretical analysis demonstrating that our framework, FairGLite, achieves provable upper bounds on group fairness metrics, offering formal guarantees for bias mitigation. Through extensive experiments on multiple datasets and fair graph learning frameworks, we demonstrate the framework's effectiveness in both mitigating bias and maintaining model utility.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanchao Wang, Tian Qin, Eduardo Valle, Bruno Abrahao
ๆฅ็ๆ่ฆ
Out-of-distribution (OOD) detection is critical for deploying image classifiers in safety-sensitive environments, yet existing detectors often struggle when OOD samples are semantically similar to the in-distribution (ID) classes. We present BootOOD, a fully self-supervised OOD detection framework that bootstraps exclusively from ID data and is explicitly designed to handle semantically challenging OOD samples. BootOOD synthesizes pseudo-OOD features through simple transformations of ID representations and leverages Neural Collapse (NC), where ID features cluster tightly around class means with consistent feature norms. Unlike prior approaches that aim to constrain OOD features into subspaces orthogonal to the collapsed ID means, BootOOD introduces a lightweight auxiliary head that performs radius-based classification on feature norms. This design decouples OOD detection from the primary classifier and imposes a relaxed requirement: OOD samples are learned to have smaller feature norms than ID features, which is easier to satisfy when ID and OOD are semantically close. Experiments on CIFAR-10, CIFAR-100, and ImageNet-200 show that BootOOD outperforms prior post-hoc methods, surpasses training-based methods without outlier exposure, and is competitive with state-of-the-art outlier-exposure approaches while maintaining or improving ID accuracy.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ihab Asaad, Maha Shadaydeh, Joachim Denzler
Multimodal
ๆฅ็ๆ่ฆ
Patch-wise multi-label classification provides an efficient alternative to full pixel-wise segmentation on high-resolution images, particularly when the objective is to determine the presence or absence of target objects within a patch rather than their precise spatial extent. This formulation substantially reduces annotation cost, simplifies training, and allows flexible patch sizing aligned with the desired level of decision granularity. In this work, we focus on a special case, patch-wise binary classification, applied to the detection of a single class of interest (tumor) on high-resolution multimodal nonlinear microscopy images. We show that, although this simplified formulation enables efficient model development, it can introduce spurious correlations between patch composition and labels: tumor patches tend to contain larger tissue regions, whereas non-tumor patches often consist mostly of background with small tissue areas. We further quantify the bias in model predictions caused by this spurious correlation, and propose to use a debiasing strategy to mitigate its effect. Specifically, we apply GERNE, a debiasing method that can be adapted to maximize worst-group accuracy (WGA). Our results show an improvement in WGA by approximately 7% compared to ERM for two different thresholds used to binarize the spurious feature. This enhancement boosts model performance on critical minority cases, such as tumor patches with small tissues and non-tumor patches with large tissues, and underscores the importance of spurious correlation-aware learning in patch-wise classification problems.
๐
2025-11-17
๐ ArXiv Preprint (cs.CY)
๐ฅ Zichong Wang, Zhipeng Yin, Roland H. C. Yap, Wenbin Zhang
ๆฅ็ๆ่ฆ
Fairness in artificial intelligence (AI) has become a growing concern due to discriminatory outcomes in AI-based decision-making systems. While various methods have been proposed to mitigate bias, most rely on complete demographic information, an assumption often impractical due to legal constraints and the risk of reinforcing discrimination. This survey examines fairness in AI when demographics are incomplete, addressing the gap between traditional approaches and real-world challenges. We introduce a novel taxonomy of fairness notions in this setting, clarifying their relationships and distinctions. Additionally, we summarize existing techniques that promote fairness beyond complete demographics and highlight open research questions to encourage further progress in the field.
๐
2025-11-17
๐ IEEE International Conference on Knowledge Graph (
๐ฅ Pragatheeswaran Vipulananthan, Kamal Premaratne, Dilip Sarkar, Manohar N. Murthi
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Accurate uncertainty quantification is a critical challenge in machine learning. While neural networks are highly versatile and capable of learning complex patterns, they often lack interpretability due to their ``black box'' nature. On the other hand, probabilistic ``white box'' models, though interpretable, often suffer from a significant performance gap when compared to neural networks. To address this, we propose a novel quantum physics-based ``white box'' method that offers both accurate uncertainty quantification and enhanced interpretability. By mapping the kernel mean embedding (KME) of a time series data vector to a reproducing kernel Hilbert space (RKHS), we construct a tensor network-inspired 1D spin chain Hamiltonian, with the KME as one of its eigen-functions or eigen-modes. We then solve the associated Schr{รถ}dinger equation and apply perturbation theory to quantify uncertainty, thereby improving the interpretability of tasks performed with the quantum tensor network-based model. We demonstrate the effectiveness of this methodology, compared to state-of-the-art ``white box" models, in change point detection and time series clustering, providing insights into the uncertainties associated with decision-making throughout the process.
๐
2025-11-17
๐ submitted to JMLR
๐ฅ Melanie Schaller, Nick Janssen, Bodo Rosenhahn
ๆฅ็ๆ่ฆ
This paper presents Naga, a deep State Space Model (SSM) encoding approach inspired by structural concepts from Vedic mathematics. The proposed method introduces a bidirectional representation for time series by jointly processing forward and time-reversed input sequences. These representations are then combined through an element-wise (Hadamard) interaction, resulting in a Vedic-inspired encoding that enhances the model's ability to capture temporal dependencies across distant time steps. We evaluate Naga on multiple long-term time series forecasting (LTSF) benchmarks, including ETTh1, ETTh2, ETTm1, ETTm2, Weather, Traffic, and ILI. The experimental results show that Naga outperforms 28 current state of the art models and demonstrates improved efficiency compared to existing deep SSM-based approaches. The findings suggest that incorporating structured, Vedic-inspired decomposition can provide an interpretable and computationally efficient alternative for long-range sequence modeling.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Ioannis Diamantis
ๆฅ็ๆ่ฆ
Modern business and economic datasets often exhibit nonlinear, multi-scale structures that traditional linear tools under-represent. Topological Data Analysis (TDA) offers a geometric lens for uncovering robust patterns, such as connected components, loops and voids, across scales. This paper provides an intuitive, figure-driven introduction to persistent homology and a practical, reproducible TDA pipeline for applied analysts. Through comparative case studies in consumer behavior, equity markets (SAX/eSAX vs.\ TDA) and foreign exchange dynamics, we demonstrate how topological features can reveal segmentation patterns and structural relationships beyond classical statistical methods. We discuss methodological choices regarding distance metrics, complex construction and interpretation, and we introduce the \textit{Topological Stability Index} (TSI), a simple yet interpretable indicator of structural variability derived from persistence lifetimes. We conclude with practical guidelines for TDA implementation, visualization and communication in business and economic analytics.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Liudmila A. Zhukas, Vivian Ni Zhang, Qiang Miao, Qingfeng Wang, Marko Cetina et al.
Machine Learning
ๆฅ็ๆ่ฆ
Quantum machine learning (QML) has attracted growing interest with the rapid parallel advances in large-scale classical machine learning and quantum technologies. Similar to classical machine learning, QML models also face challenges arising from the scarcity of labeled data, particularly as their scale and complexity increase. Here, we introduce self-supervised pretraining of quantum representations that reduces reliance on labeled data by learning invariances from unlabeled examples. We implement this paradigm on a programmable trapped-ion quantum computer, encoding images as quantum states. In situ contrastive pretraining on hardware yields a representation that, when fine-tuned, classifies image families with higher mean test accuracy and lower run-to-run variability than models trained from random initialization. Performance improvement is especially significant in regimes with limited labeled training data. We show that the learned invariances generalize beyond the pretraining image samples. Unlike prior work, our pipeline derives similarity from measured quantum overlaps and executes all training and classification stages on hardware. These results establish a label-efficient route to quantum representation learning, with direct relevance to quantum-native datasets and a clear path to larger classical inputs.
๐
2025-11-17
๐ ICASSP 2026
๐ฅ Davoud Shariat Panah, Alessandro Ragano, Dan Barry, Jan Skoglund, Andrew Hines
Machine Learning
ๆฅ็ๆ่ฆ
This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Adam Hazimeh, Ke Wang, Mark Collier, Gilles Baechler, Efi Kokiopoulou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shiyuan Luo, Chonghao Qiu, Runlong Yu, Yiqun Xie, Xiaowei Jia
Machine Learning
ๆฅ็ๆ่ฆ
Environmental modeling faces critical challenges in predicting ecosystem dynamics across unmonitored regions due to limited and geographically imbalanced observation data. This challenge is compounded by spatial heterogeneity, causing models to learn spurious patterns that fit only local data. Unlike conventional domain generalization, environmental modeling must preserve invariant physical relationships and temporal coherence during augmentation. In this paper, we introduce Generalizable Representation Enhancement via Auxiliary Transformations (GREAT), a framework that effectively augments available datasets to improve predictions in completely unseen regions. GREAT guides the augmentation process to ensure that the original governing processes can be recovered from the augmented data, and the inclusion of the augmented data leads to improved model generalization. Specifically, GREAT learns transformation functions at multiple layers of neural networks to augment both raw environmental features and temporal influence. They are refined through a novel bi-level training process that constrains augmented data to preserve key patterns of the original source data. We demonstrate GREAT's effectiveness on stream temperature prediction across six ecologically diverse watersheds in the eastern U.S., each containing multiple stream segments. Experimental results show that GREAT significantly outperforms existing methods in zero-shot scenarios. This work provides a practical solution for environmental applications where comprehensive monitoring is infeasible.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Meng Zhu, Quan Xiao, Weidong Min
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamX.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hussein Rajabu, Lijun Qian, Xishuang Dong
Machine Learning
ๆฅ็ๆ่ฆ
In the area of explainable artificial intelligence, Symbolic Regression (SR) has emerged as a promising approach by discovering interpretable mathematical expressions that fit data. However, SR faces two main challenges: most methods are evaluated on scientific datasets with well-understood relationships, limiting generalization, and SR primarily targets single-output regression, whereas many real-world problems involve multi-target outputs with interdependent variables. To address these issues, we propose multi-task regression GINN-LP (MTRGINN-LP), an interpretable neural network for multi-target symbolic regression. By integrating GINN-LP with a multi-task deep learning, the model combines a shared backbone including multiple power-term approximator blocks with task-specific output layers, capturing inter-target dependencies while preserving interpretability. We validate multi-task GINN-LP on practical multi-target applications, including energy efficiency prediction and sustainable agriculture. Experimental results demonstrate competitive predictive performance alongside high interpretability, effectively extending symbolic regression to broader real-world multi-output tasks.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bin Liu, Qinghao Zhao, Yuxi Zhou, Zhejun Sun, Kaijie Lei et al.
ๆฅ็ๆ่ฆ
Right heart failure (RHF) is a disease characterized by abnormalities in the structure or function of the right ventricle (RV), which is associated with high morbidity and mortality. Lung disease often causes increased right ventricular load, leading to RHF. Therefore, it is very important to screen out patients with cor pulmonale who develop RHF from people with underlying lung diseases. In this work, we propose a self-supervised representation learning method to early detecting RHF from patients with cor pulmonale, which uses spirogram time series to predict patients with RHF at an early stage. The proposed model is divided into two stages. The first stage is the self-supervised representation learning-based spirogram embedding (SLSE) network training process, where the encoder of the Variational autoencoder (VAE-encoder) learns a robust low-dimensional representation of the spirogram time series from the data-augmented unlabeled data. Second, this low-dimensional representation is fused with demographic information and fed into a CatBoost classifier for the downstream RHF prediction task. Trained and tested on a carefully selected subset of 26,617 individuals from the UK Biobank, our model achieved an AUROC of 0.7501 in detecting RHF, demonstrating strong population-level distinction ability. We further evaluated the model on high-risk clinical subgroups, achieving AUROC values of 0.8194 on a test set of 74 patients with chronic kidney disease (CKD) and 0.8413 on a set of 64 patients with valvular heart disease (VHD). These results highlight the model's potential utility in predicting RHF among clinically elevated-risk populations. In conclusion, this study presents a self-supervised representation learning approach combining spirogram time series and demographic data, demonstrating promising potential for early RHF detection in clinical practice.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Iulius Gherasim, Carlos Garcรญa Sรกnchez
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
The pervasive integration of Artificial Intelligence models into contemporary mobile computing is notable across numerous use cases, from virtual assistants to advanced image processing. Optimizing the mobile user experience involves minimal latency and high responsiveness from deployed AI models with challenges from execution strategies that fully leverage real time constraints to the exploitation of heterogeneous hardware architecture. In this paper, we research and propose the optimal execution configurations for AI models on an Android system, focusing on two critical tasks: object detection (YOLO family) and image classification (ResNet). These configurations evaluate various model quantization schemes and the utilization of on device accelerators, specifically the GPU and NPU. Our core objective is to empirically determine the combination that achieves the best trade-off between minimal accuracy degradation and maximal inference speed-up.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhipeng Ma, Bo Nรธrregaard Jรธrgensen, Zheng Grace Ma
Machine Learning
ๆฅ็ๆ่ฆ
Industrial process monitoring increasingly relies on sensor-generated time-series data, yet the lack of labels, high variability, and operational noise make it difficult to extract meaningful patterns using conventional methods. Existing clustering techniques either rely on fixed distance metrics or deep models designed for static data, limiting their ability to handle dynamic, unstructured industrial sequences. Addressing this gap, this paper proposes a novel framework for unsupervised discovery of operational modes in univariate time-series data using image-based convolutional clustering with composite internal evaluation. The proposed framework improves upon existing approaches in three ways: (1) raw time-series sequences are transformed into grayscale matrix representations via overlapping sliding windows, allowing effective feature extraction using a deep convolutional autoencoder; (2) the framework integrates both soft and hard clustering outputs and refines the selection through a two-stage strategy; and (3) clustering performance is objectively evaluated by a newly developed composite score, S_eva, which combines normalized Silhouette, Calinski-Harabasz, and Davies-Bouldin indices. Applied to over 3900 furnace melting operations from a Nordic foundry, the method identifies seven explainable operational patterns, revealing significant differences in energy consumption, thermal dynamics, and production duration. Compared to classical and deep clustering baselines, the proposed approach achieves superior overall performance, greater robustness, and domain-aligned explainability. The framework addresses key challenges in unsupervised time-series analysis, such as sequence irregularity, overlapping modes, and metric inconsistency, and provides a generalizable solution for data-driven diagnostics and energy optimization in industrial systems.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tingkai Yan, Haodong Wen, Binghui Li, Kairong Luo, Wenguang Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size $N$ for $K$ epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the \textit{effective reuse rate} of the data, $E(K, N)$, which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as $K$-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When $K$ is small, we prove that $E(K, N) \approx K$, indicating that every new epoch yields a linear gain; (2) As $K$ increases, $E(K, N)$ plateaus at a problem-dependent value that grows with $N$ ($ฮ(\log N)$ for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study (Muennighoff et al. (2023)), which claimed that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, $E(K, N) \approx K$ for $K \le 4$ in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum $K$ value for which $E(K, N) \approx K$ in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shaheen Mohammed Saleh Ahmed, Hakan Hakan Guneyli
Machine Learning
ๆฅ็ๆ่ฆ
Accurate short-range prediction of extreme air temperature events remains a fundamental challenge in operational climate-risk management. We present Multi-Modal Weather State Transition Model with Anomaly-Driven Recurrent Attention Network Plus (MMWSTM-ADRAN+), a dual-stream deep learning architecture that couples a regime-aware dynamics model with an anomaly-focused attention mechanism to forecast daily maximum temperature and its extremes. The first stream, MMWSTM, combines bidirectional Long Short-Term Memory (BiLSTM) units with a learnable Markov state transition matrix to capture synoptic-scale weather regime changes. The second stream, ADRAN, integrates bidirectional Gated Recurrent Units (BiGRUs), multi-head self-attention, and a novel anomaly amplification layer to enhance sensitivity to low-probability signals. A lightweight attentive fusion gate adaptively determines the contribution of each stream to the final prediction. Model optimization employs a custom ExtremeWeatherLoss function that up-weights errors on the upper 5% and lower 5% of the temperature distribution, and a time-series data augmentation suite (jittering, scaling, time/magnitude warping) that effectively quadruples the training data
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Allaa Boutaleb, Bernd Amann, Rafael Angarita, Hubert Naacke
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Open-domain question answering over datalakes requires retrieving and composing information from multiple tables, a challenging subtask that demands semantic relevance and structural coherence (e.g., joinability). While exact optimization methods like Mixed-Integer Programming (MIP) can ensure coherence, their computational complexity is often prohibitive. Conversely, simpler greedy heuristics that optimize for query coverage alone often fail to find these coherent, joinable sets. This paper frames multi-table retrieval as an iterative search process, arguing this approach offers advantages in scalability, interpretability, and flexibility. We propose a general framework and a concrete instantiation: a fast, effective Greedy Join-Aware Retrieval algorithm that holistically balances relevance, coverage, and joinability. Experiments across 5 NL2SQL benchmarks demonstrate that our iterative method achieves competitive retrieval performance compared to the MIP-based approach while being 4-400x faster depending on the benchmark and search space settings. This work highlights the potential of iterative heuristics for practical, scalable, and composition-aware retrieval.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hanwen Hu, Zimo Wen, Shiyou Qian, Jian Co
ๆฅ็ๆ่ฆ
Traffic time series imputation is crucial for the safety and reliability of intelligent transportation systems, while diverse types of missing data, including random, fiber, and block missing make the imputation task challenging. Existing models often focus on disentangling and separately modeling spatial and temporal patterns based on relationships between data points. However, these approaches struggle to adapt to the random missing positions, and fail to learn long-term and large-scale dependencies, which are essential in extensive missing conditions. In this paper, patterns are categorized into two types to handle various missing data conditions: primary patterns, which originate from internal relationships between data points, and auxiliary patterns, influenced by external factors like timestamps and node attributes. Accordingly, we propose the Primary-Auxiliary Spatio-Temporal network (PAST). It comprises a graph-integrated module (GIM) and a cross-gated module (CGM). GIM captures primary patterns via dynamic graphs with interval-aware dropout and multi-order convolutions, and CGM extracts auxiliary patterns through bidirectional gating on embedded external features. The two modules interact via shared hidden vectors and are trained under an ensemble self-supervised framework. Experiments on three datasets under 27 missing data conditions demonstrate that the imputation accuracy of PAST outperforms seven state-of-the-art baselines by up to 26.2% in RMSE and 31.6% in MAE.
๐
2025-11-17
๐ ArXiv Preprint (quant-ph)
๐ฅ Zhenyu Chen, Yuguo Shao, Zhengwei Liu, Zhaohui Wei
Machine Learning
ๆฅ็ๆ่ฆ
Quantum algorithms based on parameterized quantum circuits (PQCs) have enabled a wide range of applications on near-term quantum devices. However, existing PQC architectures face several challenges, among which the ``barren plateaus" phenomenon is particularly prominent. In such cases, the loss function concentrates exponentially with increasing system size, thereby hindering effective parameter optimization. To address this challenge, we propose a general and hardware-efficient method for eliminating barren plateaus in an arbitrary PQC. Specifically, our approach achieves this by inserting a layer of easily implementable quantum channels into the original PQC, each channel requiring only one ancilla qubit and four additional gates, yielding a modified PQC (MPQC) that is provably at least as expressive as the original PQC and, under mild assumptions, is guaranteed to be free from barren plateaus. Furthermore, by appropriately adjusting the structure of MPQCs, we rigorously prove that any parameter in the original PQC can be made trainable. Importantly, the absence of barren plateaus in MPQCs is robust against realistic noise, making our approach directly applicable to current noisy intermediate-scale quantum (NISQ) hardware. Numerically, we demonstrate the practicality of our method by modifying a commonly used PQC for thermal-state preparation. The results show that {barren plateaus are effectively eliminated} in this class of circuits with up to 100 qubits and 2400 layers, whereas the original ansatz suffers from severe gradient vanishing.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vasilis Gkolemis, Christos Diou, Michael Gutmann
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Bayesian parameter inference for complex stochastic simulators is challenging due to intractable likelihood functions. Existing simulation-based inference methods often require large number of simulations and become costly to use in high-dimensional parameter spaces or in problems with partially uninformative outputs. We propose a new method for differentiable simulators that delivers accurate posterior inference with substantially reduced runtimes. Building on the Optimization Monte Carlo framework, our approach reformulates stochastic simulation as deterministic optimization problems. Gradient-based methods are then applied to efficiently navigate toward high-density posterior regions and avoid wasteful simulations in low-probability areas. A JAX-based implementation further enhances the performance through vectorization of key method components. Extensive experiments, including high-dimensional parameter spaces, uninformative outputs, multiple observations and multimodal posteriors show that our method consistently matches, and often exceeds, the accuracy of state-of-the-art approaches, while reducing the runtime by a substantial margin.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chengdong Ma, Thรฉo Tao Zhaowei, Pengyu Li, Minghao Liu, Haojun Chen et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a fundamental challenge. This problem represents the local analogue of Hilbert's 18th problem on sphere packing, bridging geometry, number theory, and information theory. Although significant progress has been made through lattices and codes, the irregularities of high-dimensional geometry and exponentially growing combinatorial complexity beyond 8 dimensions, which exceeds the complexity of Go game, limit the scalability of existing methods. Here we model this problem as a two-player matrix completion game and train the game-theoretic reinforcement learning system, PackingStar, to efficiently explore high-dimensional spaces. The matrix entries represent pairwise cosines of sphere center vectors; one player fills entries while another corrects suboptimal ones, jointly maximizing the matrix size, corresponding to the kissing number. This cooperative dynamics substantially improves sample quality, making the extremely large spaces tractable. PackingStar reproduces previous configurations and surpasses all human-known records from dimensions 25 to 31, with the configuration in 25 dimensions geometrically corresponding to the Leech lattice and suggesting possible optimality. It achieves the first breakthrough beyond rational structures from 1971 in 13 dimensions and discovers over 6000 new structures in 14 and other dimensions. These results demonstrate AI's power to explore high-dimensional spaces beyond human intuition and open new pathways for the Kissing Number Problem and broader geometry problems.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zhipeng Ma, Bo Nรธrregaard Jรธrgensen, Zheng Grace Ma
ๆฅ็ๆ่ฆ
Improving energy efficiency in industrial foundry processes is a critical challenge, as these operations are highly energy-intensive and marked by complex interdependencies among process variables. Correlation-based analyses often fail to distinguish true causal drivers from spurious associations, limiting their usefulness for decision-making. This paper applies a time-series causal inference framework to identify the operational factors that directly affect energy efficiency in induction furnace melting. Using production data from a Danish foundry, the study integrates time-series clustering to segment melting cycles into distinct operational modes with the PCMCI+ algorithm, a state-of-the-art causal discovery method, to uncover cause-effect relationships within each mode. Across clusters, robust causal relations among energy consumption, furnace temperature, and material weight define the core drivers of efficiency, while voltage consistently influences cooling water temperature with a delayed response. Cluster-specific differences further distinguish operational regimes: efficient clusters are characterized by stable causal structures, whereas inefficient ones exhibit reinforcing feedback loops and atypical dependencies. The contributions of this study are twofold. First, it introduces an integrated clustering-causal inference pipeline as a methodological innovation for analyzing energy-intensive processes. Second, it provides actionable insights that enable foundry operators to optimize performance, reduce energy consumption, and lower emissions.
๐
2025-11-17
๐ ArXiv Preprint (cs.DL)
๐ฅ Carlo Teo Pedretti, Davide Picca, Dario Rodighiero
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Diagrams are crucial yet underexplored tools in many disciplines, demonstrating the close connection between visual representation and scholarly reasoning. However, their iconic form poses obstacles to visual studies, intermedial analysis, and text-based digital workflows. In particular, Charles S. Peirce consistently advocated the use of diagrams as essential for reasoning and explanation. His manuscripts, often combining textual content with complex visual artifacts, provide a challenging case for studying documents involving heterogeneous materials. In this preliminary study, we investigate whether Visual Language Models (VLMs) can effectively help us identify and interpret such hybrid pages in context. First, we propose a workflow that (i) segments manuscript page layouts, (ii) reconnects each segment to IIIF-compliant annotations, and (iii) submits fragments containing diagrams to a VLM. In addition, by adopting Peirce's semiotic framework, we designed prompts to extract key knowledge about diagrams and produce concise captions. Finally, we integrated these captions into knowledge graphs, enabling structured representations of diagrammatic content within composite sources.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Prakrit Timilsina, Anuj Nepal, Rajan Kadel, Robin Doss
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) face significant challenges in distributed healthcare, including consolidating specialized domain knowledge across institutions while maintaining privacy, reducing computational overhead, and preventing catastrophic forgetting during model updates.This paper presents a systematic evaluation of six parameter-space merging techniques applied to two architecturally compatible medical LLMs derived from the Mistral-7B base model. We introduce a novel hierarchical method that combines selective Optimal Transport (OT) alignment for attention layers with cosine similarity-weighted interpolation, designed to address permutation variance while minimizing computational overhead for edge deployment scenarios. Our study evaluates Task Arithmetic, Linear Averaging, DARE-TIES, DELLA, Breadcrumbs, and our Hierarchical approach across five medical benchmarks. Results demonstrate that architecturally compatible models benefit significantly from simple averaging methods, with Task Arithmetic achieving 45.80% accuracy on MedQA, outperforming complex pruning-based approaches. These findings offer critical insights for the deployment of distributed medical AI in resource-constrained IoT environments, where computational efficiency and model compatibility are paramount. Our work establishes that for architecturally compatible models, simple averaging provides a robust and computationally efficient baseline for knowledge consolidation, offering a pragmatic path forward for scalable medical AI systems.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinlan Wu, Bin Zhu, Feng Han, Pengkun Jiao, Jingjing Chen
Multimodal
ๆฅ็ๆ่ฆ
Food analysis has become increasingly critical for health-related tasks such as personalized nutrition and chronic disease prevention. However, existing large multimodal models (LMMs) in food analysis suffer from catastrophic forgetting when learning new tasks, requiring costly retraining from scratch. To address this, we propose a novel continual learning framework for multimodal food learning, integrating a Dual-LoRA architecture with Quality-Enhanced Pseudo Replay. We introduce two complementary low-rank adapters for each task: a specialized LoRA that learns task-specific knowledge with orthogonal constraints to previous tasks' subspaces, and a cooperative LoRA that consolidates shared knowledge across tasks via pseudo replay. To improve the reliability of replay data, our Quality-Enhanced Pseudo Replay strategy leverages self-consistency and semantic similarity to reduce hallucinations in generated samples. Experiments on the comprehensive Uni-Food dataset show superior performance in mitigating forgetting, representing the first effective continual learning approach for complex food tasks.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Han Meng, Gang Mei, Hong Tian, Nengxiong Xu, Jianbing Peng
ๆฅ็ๆ่ฆ
Rock discontinuities critically govern the mechanical behavior and stability of rock masses. Their internal distributions remain largely unobservable and are typically inferred from surface-exposed discontinuities using generative prediction approaches. However, surface-exposed observations are inherently sparse, and existing generative prediction approaches either fail to capture the underlying complex distribution patterns or lack robustness under data-sparse conditions. Here, we proposed a simple yet robust approach for statistically accurate generative prediction of rock discontinuities by utilizing a tabular foundation model. By leveraging the powerful sample learning capability of the foundation model specifically designed for small data, our approach can effectively capture the underlying complex distribution patterns within limited measured discontinuities. Comparative experiments on ten datasets with diverse scales and distribution patterns of discontinuities demonstrate superior accuracy and robustness over conventional statistical models and deep generative approaches. This work advances quantitative characterization of rock mass structures, supporting safer and more reliable data-driven geotechnical design.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunze Leng, Rohan Ghosh, Mehul Motani
Natural Language Processing
ๆฅ็ๆ่ฆ
Supervised learning with tabular data presents unique challenges, including low data sizes, the absence of structural cues, and heterogeneous features spanning both categorical and continuous domains. Unlike vision and language tasks, where models can exploit inductive biases in the data, tabular data lacks inherent positional structure, hindering the effectiveness of self-attention mechanisms. While recent transformer-based models like TabTransformer, SAINT, and FT-Transformer (which we refer to as 3T) have shown promise on tabular data, they typically operate without leveraging structural cues such as positional encodings (PEs), as no prior structural information is usually available. In this work, we find both theoretically and empirically that structural cues, specifically PEs can be a useful tool to improve generalization performance for tabular transformers. We find that PEs impart the ability to reduce the effective rank (a form of intrinsic dimensionality) of the features, effectively simplifying the task by reducing the dimensionality of the problem, yielding improved generalization. To that end, we propose Tab-PET (PEs for Tabular Transformers), a graph-based framework for estimating and inculcating PEs into embeddings. Inspired by approaches that derive PEs from graph topology, we explore two paradigms for graph estimation: association-based and causality-based. We empirically demonstrate that graph-derived PEs significantly improve performance across 50 classification and regression datasets for 3T. Notably, association-based graphs consistently yield more stable and pronounced gains compared to causality-driven ones. Our work highlights an unexpected role of PEs in tabular transformers, revealing how they can be harnessed to improve generalization.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Alexandru-Mihai Apostu, Andrei Preda, Alexandra Daniela Damir, Diana Bolocan, Radu Tudor Ionescu et al.
ๆฅ็ๆ่ฆ
Generating thorough natural language explanations for threat detections remains an open problem in cybersecurity research, despite significant advances in automated malware detection systems. In this work, we present AutoMalDesc, an automated static analysis summarization framework that, following initial training on a small set of expert-curated examples, operates independently at scale. This approach leverages an iterative self-paced learning pipeline to progressively enhance output quality through synthetic data generation and validation cycles, eliminating the need for extensive manual data annotation. Evaluation across 3,600 diverse samples in five scripting languages demonstrates statistically significant improvements between iterations, showing consistent gains in both summary quality and classification accuracy. Our comprehensive validation approach combines quantitative metrics based on established malware labels with qualitative assessment from both human experts and LLM-based judges, confirming both technical precision and linguistic coherence of generated summaries. To facilitate reproducibility and advance research in this domain, we publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) and test (3.6K) datasets, along with our methodology and evaluation framework.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Senne Deproost, Dennis Steckelmacher, Ann Nowรฉ
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Deep Reinforcement Learning is one of the state-of-the-art methods for producing near-optimal system controllers. However, deep RL algorithms train a deep neural network, that lacks transparency, which poses challenges when the controller has to meet regulations, or foster trust. To alleviate this, one could transfer the learned behaviour into a model that is human-readable by design using knowledge distilla- tion. Often this is done with a single model which mimics the original model on average but could struggle in more dynamic situations. A key challenge is that this simpler model should have the right balance be- tween flexibility and complexity or right balance between balance bias and accuracy. We propose a new model-agnostic method to divide the state space into regions where a simplified, human-understandable model can operate in. In this paper, we use Voronoi partitioning to find regions where linear models can achieve similar performance to the original con- troller. We evaluate our approach on a gridworld environment and a classic control task. We observe that our proposed distillation to locally- specialized linear models produces policies that are explainable and show that the distillation matches or even slightly outperforms the black-box policy they are distilled from.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jonas Bode, Raphael Memmesheimer, Sven Behnke
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Acting in human environments is a crucial capability for general-purpose robots, necessitating a robust understanding of natural language and its application to physical tasks. This paper seeks to harness the capabilities of diffusion models within a visuomotor policy framework that merges visual and textual inputs to generate precise robotic trajectories. By employing reference demonstrations during training, the model learns to execute manipulation tasks specified through textual commands within the robot's immediate environment. The proposed research aims to extend an existing model by leveraging improved embeddings, and adapting techniques from diffusion models for image generation. We evaluate our methods on the CALVIN dataset, proving enhanced performance on various manipulation tasks and an increased long-horizon success rate when multiple tasks are executed in sequence. Our approach reinforces the usefulness of diffusion models and contributes towards general multitask manipulation.
๐
2025-11-17
๐ ArXiv Preprint (q-bio.QM)
๐ฅ Chaowang Lan, Jingxin Wu, Yulong Yuan, Chuxun Liu, Huangyi Kang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Biomarker discovery from high-throughput transcriptomic data is crucial for advancing precision medicine. However, existing methods often neglect gene-gene regulatory relationships and lack stability across datasets, leading to conflation of spurious correlations with genuine causal effects. To address these issues, we develop a causal graph neural network (Causal-GNN) method that integrates causal inference with multi-layer graph neural networks (GNNs). The key innovation is the incorporation of causal effect estimation for identifying stable biomarkers, coupled with a GNN-based propensity scoring mechanism that leverages cross-gene regulatory networks. Experimental results demonstrate that our method achieves consistently high predictive accuracy across four distinct datasets and four independent classifiers. Moreover, it enables the identification of more stable biomarkers compared to traditional methods. Our work provides a robust, efficient, and biologically interpretable tool for biomarker discovery, demonstrating strong potential for broad application across medical disciplines.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Taras Sereda, Tom St. John, Burak Bartan, Natalie Serrino, Sachin Katti et al.
Machine Learning
ๆฅ็ๆ่ฆ
GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and iteratively refines programs through compilation and correctness feedback, and a performance analysis agent that interprets profiling data to guide optimization. This agent-based architecture requires only a single-shot example to target new platforms.
We make three key contributions: (1) introducing an iterative refinement system where the generation agent and performance analysis agent collaborate through functional and optimization passes, interpreting diverse profiling data (from programmatic APIs to GUI-based tools) to generate actionable recommendations that guide program synthesis for arbitrary accelerators; (2) demonstrating that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves generation quality for different hardware targets; and (3) validating the platform-agnostic nature of our approach by demonstrating effective program synthesis across fundamentally different parallel computing platforms: NVIDIA CUDA and Apple Metal.
๐
2025-11-17
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Jack B. Coughlin, Archis Joglekar, Jonathan Brodrick, Alexander Lavin
ๆฅ็ๆ่ฆ
This work presents a case study of a heterogeneous multiphysics solver from the nuclear fusion domain. At the macroscopic scale, an auto-differentiable ODE solver in JAX computes the evolution of the pulsed power circuit and bulk plasma parameters for a compressing Z Pinch. The ODE solver requires a closure for the impedance of the plasma load obtained via root-finding at every timestep, which we solve efficiently using gradient-based Newton iteration. However, incorporating non-differentiable production-grade plasma solvers like Gkeyll (a C/CUDA plasma simulation suite) into a gradient-based workflow is non-trivial. The ''Tesseract'' software addresses this challenge by providing a multi-physics differentiable abstraction layer made fully compatible with JAX (through the `tesseract_jax` adapter). This architecture ensures end-to-end differentiability while allowing seamless interchange between high-fidelity solvers (Gkeyll), neural surrogates, and analytical approximations for rapid, progressive prototyping.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aleksandar Stankoviฤ, Dejan Lisica
ๆฅ็ๆ่ฆ
We present reproducible, edge-aware baselines for ogbn-proteins in PyTorch Geometric (PyG). We study two system choices that dominate practice: (i) how 8-dimensional edge evidence is aggregated into node inputs, and (ii) how edges are used inside message passing. Our strongest baseline is GraphSAGE with sum-based edge-to-node features. We compare LayerNorm (LN), BatchNorm (BN), and a species-aware Conditional LayerNorm (CLN), and report compute cost (time, VRAM, parameters) together with accuracy (ROC-AUC) and decision quality. In our primary experimental setup (hidden size 512, 3 layers, 3 seeds), sum consistently beats mean and max; BN attains the best AUC, while CLN matches the AUC frontier with better thresholded F1. Finally, post-hoc per-label temperature scaling plus per-label thresholds substantially improves micro-F1 and expected calibration error (ECE) with negligible AUC change, and light label-correlation smoothing yields small additional gains. We release standardized artifacts and scripts used for all of the runs presented in the paper.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nadav Bojan Sellam, Meital Bojan, Paul Schanda, Alex Bronstein
Machine Learning
ๆฅ็ๆ่ฆ
Accurate protein structures are essential for understanding biological function, yet incorporating experimental data into protein generative models remains a major challenge. Most predictors of experimental observables are non-differentiable, making them incompatible with gradient-based conditional sampling. This is especially limiting in nuclear magnetic resonance, where rich data such as chemical shifts are hard to directly integrate into generative modeling. We introduce a framework for non-differentiable guidance of protein generative models, coupling a continuous diffusion-based generator with any black-box objective via a tailored genetic algorithm. We demonstrate its effectiveness across three modalities: pairwise distance constraints, nuclear Overhauser effect restraints, and for the first time chemical shifts. These results establish chemical shift guided structure generation as feasible, expose key weaknesses in current predictors, and showcase a general strategy for incorporating diverse experimental signals. Our work points toward automated, data-conditioned protein modeling beyond the limits of differentiability.
๐
2025-11-17
๐ AAAI
๐ฅ Xiaoqi Han, Ru Li, Ran Yi, Hongye Tan, Zhuomin Liang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Arka Pal, Teo Kitanovski, Arthur Liang, Akilesh Potti, Micah Goldblum
Natural Language Processing
ๆฅ็ๆ่ฆ
Real-world tasks and environments exhibit differences from the static datasets that large language models (LLMs) are typically evaluated on. Such tasks can involve sequential interaction, requiring coherent updating of beliefs in light of new evidence, and making appropriate decisions based on those beliefs. Predicting how LLMs will perform in such dynamic environments is important, but can be tricky to determine from measurements in static settings. In this work, we examine two critical components of LLM performance: the ability of LLMs to coherently update their beliefs, and the extent to which the actions they take are consistent with those beliefs. First, we find that LLMs are largely inconsistent in how they update their beliefs; models can exhibit up to a 30% average difference between the directly elicited posterior, and the correct update of their prior. Second, we find that LLMs also often take actions which are inconsistent with the beliefs they hold. On a betting market, for example, LLMs often do not even bet in the same direction as their internally held beliefs over the underlying outcomes. We also find they have moderate self-inconsistency in how they respond to challenges by users to given answers. Finally, we show that the above properties hold even for strong models that obtain high accuracy or that are well-calibrated on the tasks at hand. Our results highlight the difficulties of predicting LLM behavior in complex real-world settings.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Patrick Parschan, Charlott Jakob
Natural Language Processing
ๆฅ็ๆ่ฆ
This article presents the first systematic review of unsupervised and semi-supervised computational text-based ideal point estimation (CT-IPE) algorithms, methods designed to infer latent political positions from textual data. These algorithms are widely used in political science, communication, computational social science, and computer science to estimate ideological preferences from parliamentary speeches, party manifestos, and social media. Over the past two decades, their development has closely followed broader NLP trends -- beginning with word-frequency models and most recently turning to large language models (LLMs). While this trajectory has greatly expanded the methodological toolkit, it has also produced a fragmented field that lacks systematic comparison and clear guidance for applied use. To address this gap, we identified 25 CT-IPE algorithms through a systematic literature review and conducted a manual content analysis of their modeling assumptions and development contexts. To compare them meaningfully, we introduce a conceptual framework that distinguishes how algorithms generate, capture, and aggregate textual variance. On this basis, we identify four methodological families -- word-frequency, topic modeling, word embedding, and LLM-based approaches -- and critically assess their assumptions, interpretability, scalability, and limitations. Our review offers three contributions. First, it provides a structured synthesis of two decades of algorithm development, clarifying how diverse methods relate to one another. Second, it translates these insights into practical guidance for applied researchers, highlighting trade-offs in transparency, technical requirements, and validation strategies that shape algorithm choice. Third, it emphasizes that differences in estimation outcomes across algorithms are themselves informative, underscoring the need for systematic benchmarking.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Alan G. Paredes Cetina, Kaouther Benguessoum, Raoni Lourenรงo, Sylvain Kubler
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in deep learning have improved multivariate time series (MTS) classification and regression by capturing complex patterns, but their lack of transparency hinders decision-making. Explainable AI (XAI) methods offer partial insights, yet often fall short of conveying the full decision space. Counterfactual Explanations (CE) provide a promising alternative, but current approaches typically prioritize either accuracy, proximity or sparsity -- rarely all -- limiting their practical value. To address this, we propose CONFETTI, a novel multi-objective CE method for MTS. CONFETTI identifies key MTS subsequences, locates a counterfactual target, and optimally modifies the time series to balance prediction confidence, proximity and sparsity. This method provides actionable insights with minimal changes, improving interpretability, and decision support. CONFETTI is evaluated on seven MTS datasets from the UEA archive, demonstrating its effectiveness in various domains. CONFETTI consistently outperforms state-of-the-art CE methods in its optimization objectives, and in six other metrics from the literature, achieving $\geq10\%$ higher confidence while improving sparsity in $\geq40\%$.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Boris Kriuk
Machine Learning
ๆฅ็ๆ่ฆ
Traditional gradient boosting algorithms employ static tree structures with fixed splitting criteria that remain unchanged throughout training, limiting their ability to adapt to evolving gradient distributions and problem-specific characteristics across different learning stages. This work introduces MorphBoost, a new gradient boosting framework featuring self-organizing tree structures that dynamically morph their splitting behavior during training. The algorithm implements adaptive split functions that evolve based on accumulated gradient statistics and iteration-dependent learning pressures, enabling automatic adjustment to problem complexity. Key innovations include: (1) morphing split criterion combining gradient-based scores with information-theoretic metrics weighted by training progress; (2) automatic problem fingerprinting for intelligent parameter configuration across binary/multiclass/regression tasks; (3) vectorized tree prediction achieving significant computational speedups; (4) interaction-aware feature importance detecting multiplicative relationships; and (5) fast-mode optimization balancing speed and accuracy. Comprehensive benchmarking across 10 diverse datasets against competitive models (XGBoost, LightGBM, GradientBoosting, HistGradientBoosting, ensemble methods) demonstrates that MorphBoost achieves state-of-the-art performance, outperforming XGBoost by 0.84% on average. MorphBoost secured the overall winner position with 4/10 dataset wins (40% win rate) and 6/30 top-3 finishes (20%), while maintaining the lowest variance (ฯ=0.0948) and highest minimum accuracy across all models, revealing superior consistency and robustness. Performance analysis across difficulty levels shows competitive results on easy datasets while achieving notable improvements on advanced problems due to higher adaptation levels.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mary Chriselda Antony Oliver, Michael Roberts, Carola-Bibiane Schรถnlieb, Matthew Thorpe
ๆฅ็ๆ่ฆ
The manifold hypothesis posits that high-dimensional data typically resides on low-dimensional sub spaces. In this paper, we assume manifold hypothesis to investigate graph-based semi-supervised learning
methods. In particular, we examine Laplace Learning in the Wasserstein space, extending the classical
notion of graph-based semi-supervised learning algorithms from finite-dimensional Euclidean spaces to
an infinite-dimensional setting. To achieve this, we prove variational convergence of a discrete graph p- Dirichlet energy to its continuum counterpart. In addition, we characterize the Laplace-Beltrami operator
on asubmanifold of the Wasserstein space. Finally, we validate the proposed theoretical framework through
numerical experiments conducted on benchmark datasets, demonstrating the consistency of our classification performance in high-dimensional settings.
๐
2025-11-17
๐ NeurIPS 2025
๐ฅ Yuxiang Zhang, Zhengxu Yu, Weihang Pan, Zhongming Jin, Qiang Fu et al.
ๆฅ็ๆ่ฆ
Emerging reasoning LLMs such as OpenAI-o1 and DeepSeek-R1 have achieved strong performance on complex reasoning tasks by generating long chain-of-thought (CoT) traces. However, these long CoTs result in increased token usage, leading to higher inference latency and memory consumption. As a result, balancing accuracy and reasoning efficiency has become essential for deploying reasoning LLMs in practical applications. Existing long-to-short (Long2Short) methods aim to reduce inference length but often sacrifice accuracy, revealing a need for an approach that maintains performance while lowering token costs. To address this efficiency-accuracy tradeoff, we propose TokenSqueeze, a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data. First, to prevent performance degradation caused by excessive compression of reasoning depth, we propose to select self-generated samples whose reasoning depth is adaptively matched to the complexity of the problem. To further optimize the linguistic expression without altering the underlying reasoning paths, we introduce a distribution-aligned linguistic refinement method that enhances the clarity and conciseness of the reasoning path while preserving its logical integrity. Comprehensive experimental results demonstrate the effectiveness of TokenSqueeze in reducing token usage while maintaining accuracy. Notably, DeepSeek-R1-Distill-Qwen-7B fine-tuned using our proposed method achieved a 50\% average token reduction while preserving accuracy on the MATH500 benchmark. TokenSqueeze exclusively utilizes the model's self-generated data, enabling efficient and high-fidelity reasoning without relying on manually curated short-answer datasets across diverse applications. Our code is available at https://github.com/zhangyx1122/TokenSqueeze.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Mohamed Salem, Inyoung Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
The transformer architecture has demonstrated strong performance in classification tasks involving structured and high-dimensional data. However, its success often hinges on large- scale training data and careful regularization to prevent overfitting. In this paper, we intro- duce a novel likelihood-guided variational Ising-based regularization framework for Vision Transformers (ViTs), which simultaneously enhances model generalization and dynamically prunes redundant parameters. The proposed variational Ising-based regularization approach leverages Bayesian sparsification techniques to impose structured sparsity on model weights, allowing for adaptive architecture search during training. Unlike traditional dropout-based methods, which enforce fixed sparsity patterns, the variational Ising-based regularization method learns task-adaptive regularization, improving both efficiency and interpretability. We evaluate our approach on benchmark vision datasets, including MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100, demonstrating improved generalization under sparse, complex data and allowing for principled uncertainty quantification on both weights and selection parameters. Additionally, we show that the Ising regularizer leads to better-calibrated probability estimates and structured feature selection through uncertainty-aware attention mechanisms. Our results highlight the effectiveness of structured Bayesian sparsification in enhancing transformer-based architectures, offering a principled alternative to standard regularization techniques.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Guillaume Infantes, Stรฉphanie Roussel, Antoine Jacquet, Emmanuel Benazera
Machine Learning
ๆฅ็ๆ่ฆ
The Resource-Constrained Project Scheduling Problem (RCPSP) is a classical scheduling problem that has received significant attention due to of its numerous applications in industry. However, in practice, task durations are subject to uncertainty that must be considered in order to propose resilient scheduling. In this paper, we address the RCPSP variant with uncertain tasks duration (modeled using known probabilities) and aim to minimize the overall expected project duration. Our objective is to produce a baseline schedule that can be reused multiple times in an industrial setting regardless of the actual duration scenario. We leverage Graph Neural Networks in conjunction with Deep Reinforcement Learning (DRL) to develop an effective policy for task scheduling. This policy operates similarly to a priority dispatch rule and is paired with a Serial Schedule Generation Scheme to produce a schedule. Our empirical evaluation on standard benchmarks demonstrates the approach's superiority in terms of performance and its ability to generalize. The developed framework, Wheatley, is made publicly available online to facilitate further research and reproducibility.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhixin Ou, Peng Liang, Jianchen Han, Baihui Liu, Linbo Qiao
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.
๐
2025-11-17
๐ IJCAI 2025
๐ฅ Akash Karthikeyan, Yash Vardhan Pant
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Self-play reinforcement learning has demonstrated significant success in learning complex strategic and interactive behaviors in competitive multi-agent games. However, achieving such behaviors in continuous decision spaces remains challenging. Ensuring adaptability and generalization in self-play settings is critical for achieving competitive performance in dynamic multi-agent environments. These challenges often cause methods to converge slowly or fail to converge at all to a Nash equilibrium, making agents vulnerable to strategic exploitation by unseen opponents. To address these challenges, we propose DiffFP, a fictitious play (FP) framework that estimates the best response to unseen opponents while learning a robust and multimodal behavioral policy. Specifically, we approximate the best response using a diffusion policy that leverages generative modeling to learn adaptive and diverse strategies. Through empirical evaluation, we demonstrate that the proposed FP framework converges towards $ฮต$-Nash equilibria in continuous- space zero-sum games. We validate our method on complex multi-agent environments, including racing and multi-particle zero-sum games. Simulation results show that the learned policies are robust against diverse opponents and outperform baseline reinforcement learning policies. Our approach achieves up to 3$\times$ faster convergence and 30$\times$ higher success rates on average against RL-based baselines, demonstrating its robustness to opponent strategies and stability across training iterations
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aishwarya Venkataramanan, Sai Karthikeya Vemuri, Adithya Ashok Chalain Valapil, Joachim Denzler
Machine Learning
ๆฅ็ๆ่ฆ
Coherent anti-Stokes Raman scattering (CARS) spectroscopy is a powerful and rapid technique widely used in medicine, material science, and chemical analyses. However, its effectiveness is hindered by the presence of a non-resonant background that interferes with and distorts the true Raman signal. Deep learning methods have been employed to reconstruct the true Raman spectrum from measured CARS data using labeled datasets. A more recent development integrates the domain knowledge of Kramers-Kronig relationships and smoothness constraints in the form of physics-informed loss functions. However, these deterministic models lack the ability to quantify uncertainty, an essential feature for reliable deployment in high-stakes scientific and biomedical applications. In this work, we evaluate and compare various uncertainty quantification (UQ) techniques within the context of CARS-to-Raman signal reconstruction. Furthermore, we demonstrate that incorporating physics-informed constraints into these models improves their calibration, offering a promising path toward more trustworthy CARS data analysis.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingxuan Tian, Haochen Mu, Donghong Ding, Mengjiao Li, Yuhan Ding et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
With the development of digital twins and smart manufacturing systems, there is an urgent need for real-time distortion field prediction to control defects in metal Additive Manufacturing (AM). However, numerical simulation methods suffer from high computational cost, long run-times that prevent real-time use, while conventional Machine learning (ML) models struggle to extract spatiotemporal features for long-horizon prediction and fail to decouple thermo-mechanical fields. This paper proposes a Physics-informed Neural Operator (PINO) to predict z and y-direction distortion for the future 15 s. Our method, Physics-informed Deep Operator Network-Recurrent Neural Network (PIDeepONet-RNN) employs trunk and branch network to process temperature history and encode distortion fields, respectively, enabling decoupling of thermo-mechanical responses. By incorporating the heat conduction equation as a soft constraint, the model ensures physical consistency and suppresses unphysical artifacts, thereby establishing a more physically consistent mapping between the thermal history and distortion. This is important because such a basis function, grounded in physical laws, provides a robust and interpretable foundation for predictions. The proposed models are trained and tested using datasets generated from experimentally validated Finite Element Method (FEM). Evaluation shows that the model achieves high accuracy, low error accumulation, time efficiency. The max absolute errors in the z and y-directions are as low as 0.9733 mm and 0.2049 mm, respectively. The error distribution shows high errors in the molten pool but low gradient norms in the deposited and key areas. The performance of PINO surrogate model highlights its potential for real-time long-horizon physics field prediction in controlling defects.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ella J. Schmidtobreick, Daniel Arnstrรถm, Paul Hรคusner, Jens Sjรถlund
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Quadratic programming (QP) solvers are widely used in real-time control and optimization, but their computational cost often limits applicability in time-critical settings. We propose a learning-to-optimize approach using graph neural networks (GNNs) to predict active sets in the dual active-set solver DAQP. The method exploits the structural properties of QPs by representing them as bipartite graphs and learning to identify the optimal active set for efficiently warm-starting the solver. Across varying problem sizes, the GNN consistently reduces the number of solver iterations compared to cold-starting, while performance is comparable to a multilayer perceptron (MLP) baseline. Furthermore, a GNN trained on varying problem sizes generalizes effectively to unseen dimensions, demonstrating flexibility and scalability. These results highlight the potential of structure-aware learning to accelerate optimization in real-time applications such as model predictive control.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ TC Singh, Sougata Mukherjea
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) excel in graph-based learning tasks, but their complex, non-linear operations often render them as opaque "black boxes". This opacity hinders user trust, complicates debugging, bias detection, and adoption in critical domains requiring explainability. This paper introduces InteractiveGNNExplainer, a visual analytics framework to enhance GNN explainability, focusing on node classification. Our system uniquely integrates coordinated interactive views (dynamic graph layouts, embedding projections, feature inspection, neighborhood analysis) with established post-hoc (GNNExplainer) and intrinsic (GAT attention) explanation techniques. Crucially, it incorporates interactive graph editing, allowing users to perform a "what-if" analysis by perturbing graph structures and observing immediate impacts on GNN predictions and explanations. We detail the system architecture and, through case studies on Cora and CiteSeer datasets, demonstrate how InteractiveGNNExplainer facilitates in-depth misclassification diagnosis, comparative analysis of GCN versus GAT behaviors, and rigorous probing of model sensitivity. These capabilities foster a deeper, multifaceted understanding of GNN predictions, contributing to more transparent, trustworthy, and robust graph analysis.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shaoyuan Chen, Zhixuan Chen, Dawei Yang, Zhihang Yuan, Qiang Wu
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real-world scenarios. To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness through once fine-tuning. OTARo introduces Shared Exponent Floating Point (SEFP), a distinct quantization mechanism, to produce different bit-widths through simple mantissa truncations of a single model. Moreover, to achieve bit-width robustness in downstream applications, OTARo performs a learning process toward losses induced by different bit-widths. The method involves two critical strategies: (1) Exploitation-Exploration Bit-Width Path Search (BPS), which iteratively updates the search path via a designed scoring mechanism; (2) Low-Precision Asynchronous Accumulation (LAA), which performs asynchronous gradient accumulations and delayed updates under low bit-widths. Experiments on popular LLMs, e.g., LLaMA3.2-1B, LLaMA3-8B, demonstrate that OTARo achieves consistently strong and robust performance for all precisions.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Jiacheng Cheng, Xu Zhang, Guanghui Qiu, Yifang Zhang, Yinchuan Li et al.
ๆฅ็ๆ่ฆ
Federated Learning (FL) enables collaborative training across decentralized data, but faces key challenges of bidirectional communication overhead and client-side data heterogeneity. To address communication costs while embracing data heterogeneity, we propose pFed1BS, a novel personalized federated learning framework that achieves extreme communication compression through one-bit random sketching. In personalized FL, the goal shifts from training a single global model to creating tailored models for each client. In our framework, clients transmit highly compressed one-bit sketches, and the server aggregates and broadcasts a global one-bit consensus. To enable effective personalization, we introduce a sign-based regularizer that guides local models to align with the global consensus while preserving local data characteristics. To mitigate the computational burden of random sketching, we employ the Fast Hadamard Transform for efficient projection. Theoretical analysis guarantees that our algorithm converges to a stationary neighborhood of the global potential function. Numerical simulations demonstrate that pFed1BS substantially reduces communication costs while achieving competitive performance compared to advanced communication-efficient FL algorithms.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shudong Wang, Xinfei Wang, Chenhao Zhang, Shanchen Pang, Haiyuan Gui et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Multi-task reinforcement learning (MTRL) seeks to learn a unified policy for diverse tasks, but often suffers from gradient conflicts across tasks. Existing masking-based methods attempt to mitigate such conflicts by assigning task-specific parameter masks. However, our empirical study shows that coarse-grained binary masks have the problem of over-suppressing key conflicting parameters, hindering knowledge sharing across tasks. Moreover, different tasks exhibit varying conflict levels, yet existing methods use a one-size-fits-all fixed sparsity strategy to keep training stability and performance, which proves inadequate. These limitations hinder the model's generalization and learning efficiency.
To address these issues, we propose SoCo-DT, a Soft Conflict-resolution method based by parameter importance. By leveraging Fisher information, mask values are dynamically adjusted to retain important parameters while suppressing conflicting ones. In addition, we introduce a dynamic sparsity adjustment strategy based on the Interquartile Range (IQR), which constructs task-specific thresholding schemes using the distribution of conflict and harmony scores during training. To enable adaptive sparsity evolution throughout training, we further incorporate an asymmetric cosine annealing schedule to continuously update the threshold. Experimental results on the Meta-World benchmark show that SoCo-DT outperforms the state-of-the-art method by 7.6% on MT50 and by 10.5% on the suboptimal dataset, demonstrating its effectiveness in mitigating gradient conflicts and improving overall multi-task performance.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Hao Long, Silin Zhou, Lisi Chen, Shuo Shang
ๆฅ็ๆ่ฆ
Recent learning-based methods have reduced the computational complexity of traditional trajectory similarity computation, but state-of-the-art (SOTA) methods still fail to leverage the comprehensive spectrum of trajectory information for similarity modeling. To tackle this problem, we propose \textbf{RePo}, a novel method that jointly encodes \textbf{Re}gion-wise and \textbf{Po}int-wise features to capture both spatial context and fine-grained moving patterns. For region-wise representation, the GPS trajectories are first mapped to grid sequences, and spatial context are captured by structural features and semantic context enriched by visual features. For point-wise representation, three lightweight expert networks extract local, correlation, and continuous movement patterns from dense GPS sequences. Then, a router network adaptively fuses the learned point-wise features, which are subsequently combined with region-wise features using cross-attention to produce the final trajectory embedding. To train RePo, we adopt a contrastive loss with hard negative samples to provide similarity ranking supervision. Experiment results show that RePo achieves an average accuracy improvement of 22.2\% over SOTA baselines across all evaluation metrics.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Changxi Chi, Yufei Huang, Jun Xia, Jiangbin Zheng, Yunfan Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Predicting single-cell perturbation outcomes directly advances gene function analysis and facilitates drug candidate selection, making it a key driver of both basic and translational biomedical research. However, a major bottleneck in this task is the unpaired nature of single-cell data, as the same cell cannot be observed both before and after perturbation due to the destructive nature of sequencing. Although some neural generative transport models attempt to tackle unpaired single-cell perturbation data, they either lack explicit conditioning or depend on prior spaces for indirect distribution alignment, limiting precise perturbation modeling. In this work, we approximate Schrรถdinger Bridge (SB), which defines stochastic dynamic mappings recovering the entropy-regularized optimal transport (OT), to directly align the distributions of control and perturbed single-cell populations across different perturbation conditions. Unlike prior SB approximations that rely on bidirectional modeling to infer optimal source-target sample coupling, we leverage Minibatch-OT based pairing to avoid such bidirectional inference and the associated ill-posedness of defining the reverse process. This pairing directly guides bridge learning, yielding a scalable approximation to the SB. We approximate two SB models, one modeling discrete gene activation states and the other continuous expression distributions. Joint training enables accurate perturbation modeling and captures single-cell heterogeneity. Experiments on public genetic and drug perturbation datasets show that our model effectively captures heterogeneous single-cell responses and achieves state-of-the-art performance.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qipeng Song, Nan Yang, Ziqi Xu, Yue Li, Wei Shao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Machine unlearning aims to eliminate the influence of specific data from trained models to ensure privacy compliance. However, most existing methods assume full access to the original training dataset, which is often impractical. We address a more realistic yet challenging setting: few-shot zero-glance, where only a small subset of the retained data is available and the forget set is entirely inaccessible. We introduce GFOES, a novel framework comprising a Generative Feedback Network (GFN) and a two-phase fine-tuning procedure. GFN synthesises Optimal Erasure Samples (OES), which induce high loss on target classes, enabling the model to forget class-specific knowledge without access to the original forget data, while preserving performance on retained classes. The two-phase fine-tuning procedure enables aggressive forgetting in the first phase, followed by utility restoration in the second. Experiments on three image classification datasets demonstrate that GFOES achieves effective forgetting at both logit and representation levels, while maintaining strong performance using only 5% of the original data. Our framework offers a practical and scalable solution for privacy-preserving machine learning under data-constrained conditions.
๐
2025-11-17
๐ ArXiv Preprint (hep-ex)
๐ฅ Rasmus F. Orsoe, Stephan Meighen-Berger, Jeffrey Lazar, Jorge Prado, Ivan Mozun-Mateo et al.
Machine Learning
ๆฅ็ๆ่ฆ
Neutrino telescopes are large-scale detectors designed to observe Cherenkov radiation produced from neutrino interactions in water or ice. They exist to identify extraterrestrial neutrino sources and to probe fundamental questions pertaining to the elusive neutrino itself. A central challenge common across neutrino telescopes is to solve a series of inverse problems known as event reconstruction, which seeks to resolve properties of the incident neutrino, based on the detected Cherenkov light. In recent times, significant efforts have been made in adapting advances from deep learning research to event reconstruction, as such techniques provide several benefits over traditional methods. While a large degree of similarity in reconstruction needs and low-level data exists, cross-experimental collaboration has been hindered by a lack of diverse open-source datasets for comparing methods.
We present NuBench, an open benchmark for deep learning-based event reconstruction in neutrino telescopes. NuBench comprises seven large-scale simulated datasets containing nearly 130 million charged- and neutral-current muon-neutrino interactions spanning 10 GeV to 100 TeV, generated across six detector geometries inspired by existing and proposed experiments. These datasets provide pulse- and event-level information suitable for developing and comparing machine-learning reconstruction methods in both water and ice environments. Using NuBench, we evaluate four reconstruction algorithms - ParticleNeT and DynEdge, both actively used within the KM3NeT and IceCube collaborations, respectively, along with GRIT and DeepIce - on up to five core tasks: energy and direction reconstruction, topology classification, interaction vertex prediction, and inelasticity estimation.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vidur Sinha, Muhammed Ustaomeroglu, Guannan Qu
RoboticsNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Multi-agent reinforcement learning (MARL) has shown promise for large-scale network control, yet existing methods face two major limitations. First, they typically rely on assumptions leading to decay properties of local agent interactions, limiting their ability to capture long-range dependencies such as cascading power failures or epidemic outbreaks. Second, most approaches lack generalizability across network topologies, requiring retraining when applied to new graphs. We introduce STACCA (Shared Transformer Actor-Critic with Counterfactual Advantage), a unified transformer-based MARL framework that addresses both challenges. STACCA employs a centralized Graph Transformer Critic to model long-range dependencies and provide system-level feedback, while its shared Graph Transformer Actor learns a generalizable policy capable of adapting across diverse network structures. Further, to improve credit assignment during training, STACCA integrates a novel counterfactual advantage estimator that is compatible with state-value critic estimates. We evaluate STACCA on epidemic containment and rumor-spreading network control tasks, demonstrating improved performance, network generalization, and scalability. These results highlight the potential of transformer-based MARL architectures to achieve scalable and generalizable control in large-scale networked systems.
๐
2025-11-17
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kyunghyun Lee, Yong-Min Shin, Minwoo Shin, Jihun Kim, Sunghwan Lim et al.
Machine Learning
ๆฅ็ๆ่ฆ
Early diagnosis of breast cancer is crucial, enabling the establishment of appropriate treatment plans and markedly enhancing patient prognosis. While direct magnetic resonance imaging-guided biopsy demonstrates promising performance in detecting cancer lesions, its practical application is limited by prolonged procedure times and high costs. To overcome these issues, an indirect MRI-guided biopsy that allows the procedure to be performed outside of the MRI room has been proposed, but it still faces challenges in creating an accurate real-time deformable breast model. In our study, we tackled this issue by developing a graph neural network (GNN)-based model capable of accurately predicting deformed breast cancer sites in real time during biopsy procedures. An individual-specific finite element (FE) model was developed by incorporating magnetic resonance (MR) image-derived structural information of the breast and tumor to simulate deformation behaviors. A GNN model was then employed, designed to process surface displacement and distance-based graph data, enabling accurate prediction of overall tissue displacement, including the deformation of the tumor region. The model was validated using phantom and real patient datasets, achieving an accuracy within 0.2 millimeters (mm) for cancer node displacement (RMSE) and a dice similarity coefficient (DSC) of 0.977 for spatial overlap with actual cancerous regions. Additionally, the model enabled real-time inference and achieved a speed-up of over 4,000 times in computational cost compared to conventional FE simulations. The proposed deformation-aware GNN model offers a promising solution for real-time tumor displacement prediction in breast biopsy, with high accuracy and real-time capability. Its integration with clinical procedures could significantly enhance the precision and efficiency of breast cancer diagnosis.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yehonatan Elisha, Seffi Cohen, Oren Barkan, Noam Koenigstein
Machine Learning
ๆฅ็ๆ่ฆ
Saliency maps are widely used for visual explanations in deep learning, but a fundamental lack of consensus persists regarding their intended purpose and alignment with diverse user queries. This ambiguity hinders the effective evaluation and practical utility of explanation methods.We address this gap by introducing the Reference-Frame $\times$ Granularity (RFxG) taxonomy, a principled conceptual framework that organizes saliency explanations along two essential axes:Reference-Frame: Distinguishing between pointwise ("Why this prediction?") and contrastive ("Why this and not an alternative?") explanations.Granularity: Ranging from fine-grained class-level (e.g., "Why Husky?") to coarse-grained group-level (e.g., "Why Dog?") interpretations.Using the RFxG lens, we demonstrate critical limitations in existing evaluation metrics, which overwhelmingly prioritize pointwise faithfulness while neglecting contrastive reasoning and semantic granularity. To systematically assess explanation quality across both RFxG dimensions, we propose four novel faithfulness metrics. Our comprehensive evaluation framework applies these metrics to ten state-of-the-art saliency methods, four model architectures, and three datasets.By advocating a shift toward user-intent-driven evaluation, our work provides both the conceptual foundation and the practical tools necessary to develop visual explanations that are not only faithful to the underlying model behavior but are also meaningfully aligned with the complexity of human understanding and inquiry.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Liuyi Jin, Pasan Gunawardena, Amran Haroon, Runzhi Wang, Sangwoo Lee et al.
Multimodal
ๆฅ็ๆ่ฆ
Emergency Medical Technicians (EMTs) operate in high-pressure environments, making rapid, life-critical decisions under heavy cognitive and operational loads. We present EMSGlass, a smart-glasses system powered by EMSNet, the first multimodal multitask model for Emergency Medical Services (EMS), and EMSServe, a low-latency multimodal serving framework tailored to EMS scenarios. EMSNet integrates text, vital signs, and scene images to construct a unified real-time understanding of EMS incidents. Trained on real-world multimodal EMS datasets, EMSNet simultaneously supports up to five critical EMS tasks with superior accuracy compared to state-of-the-art unimodal baselines. Built on top of PyTorch, EMSServe introduces a modality-aware model splitter and a feature caching mechanism, achieving adaptive and efficient inference across heterogeneous hardware while addressing the challenge of asynchronous modality arrival in the field. By optimizing multimodal inference execution in EMS scenarios, EMSServe achieves 1.9x -- 11.7x speedup over direct PyTorch multimodal inference. A user study evaluation with six professional EMTs demonstrates that EMSGlass enhances real-time situational awareness, decision-making speed, and operational efficiency through intuitive on-glass interaction. In addition, qualitative insights from the user study provide actionable directions for extending EMSGlass toward next-generation AI-enabled EMS systems, bridging multimodal intelligence with real-world emergency response workflows.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Michal Levin, Itzik Klein
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Low-cost micro-electromechanical accelerometers are widely used in navigation, robotics, and consumer devices for motion sensing and position estimation. However, their performance is often degraded by bias errors. To eliminate deterministic bias terms a calibration procedure is applied under stationary conditions. It requires accelerom- eter leveling or complex orientation-dependent calibration procedures. To overcome those requirements, in this paper we present a model-free learning-based calibration method that estimates accelerometer bias under stationary conditions, without requiring knowledge of the sensor orientation and without the need to rotate the sensors. The proposed approach provides a fast, practical, and scalable solution suitable for rapid field deployment. Experimental validation on a 13.39-hour dataset collected from six accelerometers shows that the proposed method consistently achieves error levels more than 52% lower than traditional techniques. On a broader scale, this work contributes to the advancement of accurate calibration methods in orientation-free scenarios. As a consequence, it improves the reliability of low-cost inertial sensors in diverse scientific and industrial applications and eliminates the need for leveled calibration.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohit Meena, Yash Punjabi, Abhishek A, Vishal Sharma, Mahesh Chandran
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have emerged as powerful tools for learning over graph-structured data, yet recent studies have shown that their performance gains are beginning to plateau. In many cases, well-established models such as GCN and GAT, when appropriately tuned, can match or even exceed the performance of more complex, state-of-the-art architectures. This trend highlights a key limitation in the current landscape: the difficulty of selecting the most suitable model for a given graph task or dataset. To address this, we propose Self-Adaptive Graph Mixture of Models (SAGMM), a modular and practical framework that learns to automatically select and combine the most appropriate GNN models from a diverse pool of architectures. Unlike prior mixture-of-experts approaches that rely on variations of a single base model, SAGMM leverages architectural diversity and a topology-aware attention gating mechanism to adaptively assign experts to each node based on the structure of the input graph. To improve efficiency, SAGMM includes a pruning mechanism that reduces the number of active experts during training and inference without compromising performance. We also explore a training-efficient variant in which expert models are pretrained and frozen, and only the gating and task-specific layers are trained. We evaluate SAGMM on 16 benchmark datasets covering node classification, graph classification, regression, and link prediction tasks, and demonstrate that it consistently outperforms or matches leading GNN baselines and prior mixture-based methods, offering a robust and adaptive solution for real-world graph learning.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vladimรญr Macko, Vladimรญr Boลพa
Natural Language Processing
ๆฅ็ๆ่ฆ
Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Duo Yi
Multimodal
ๆฅ็ๆ่ฆ
Online decision systems routinely operate under delayed feedback and order-sensitive (noncommutative) dynamics: actions affect which observations arrive, and in what sequence. Taking a Bregman divergence $D_ฮฆ$ as the loss benchmark, we prove that the excess benchmark loss admits a structured lower bound $L \ge L_{\mathrm{ideal}} + g_1(ฮป) + g_2(\varepsilon_\star) + g_{12}(ฮป,\varepsilon_\star) - D_{\mathrm{ncx}}$, where $g_1$ and $g_2$ are calibrated penalties for latency and order-sensitivity, $g_{12}$ captures their geometric interaction, and $D_{\mathrm{ncx}}\ge 0$ is a nonconvexity/approximation penalty that vanishes under convex Legendre assumptions. We extend this inequality to prox-regular and weakly convex settings, obtaining robust guarantees beyond the convex case. We also give an operational recipe for estimating and monitoring the four terms via simple $2\times 2$ randomized experiments and streaming diagnostics (effective sample size, clipping rate, interaction heatmaps). The framework packages heterogeneous latency, noncommutativity, and implementation-gap effects into a single interpretable lower-bound statement that can be stress-tested and tuned in real-world systems.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Akira Tamamori
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Kernel-based learning methods can dramatically increase the storage capacity of Hopfield networks, yet the dynamical mechanism behind this enhancement remains poorly understood. We address this gap by conducting a geometric analysis of the network's energy landscape. We introduce a novel metric, ``Pinnacle Sharpness,'' to quantify the local stability of attractors. By systematically varying the kernel width and storage load, we uncover a rich phase diagram of attractor shapes. Our central finding is the emergence of a ``ridge of optimization,'' where the network maximizes attractor stability under challenging high-load and global-kernel conditions. Through a theoretical decomposition of the landscape gradient into a direct ``driving'' force and an indirect ``feedback'' force, we reveal the origin of this phenomenon. The optimization ridge corresponds to a regime of strong anti-correlation between the two forces, where the direct force, amplified by the high storage load, dominates the opposing collective feedback force. This demonstrates a sophisticated self-organization mechanism: the network adaptively harnesses inter-pattern interactions as a cooperative feedback control system to sculpt a robust energy landscape. Our findings provide a new physical picture for the stability of high-capacity associative memories and offer principles for their design.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Nikita Neveditsin, Pawan Lingras, Salil Patil, Swarup Patil, Vijay Mago
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Digitization of medical records often relies on smartphone photographs of printed reports, producing images degraded by blur, shadows, and other noise. Conventional OCR systems, optimized for clean scans, perform poorly under such real-world conditions. This study evaluates compact multimodal language models as privacy-preserving alternatives for transcribing noisy clinical documents. Using obstetric ultrasound reports written in regionally inflected medical English common to Indian healthcare settings, we compare eight systems in terms of transcription accuracy, noise sensitivity, numeric accuracy, and computational efficiency. Compact multimodal models consistently outperform both classical and neural OCR pipelines. Despite higher computational costs, their robustness and linguistic adaptability position them as viable candidates for on-premises healthcare digitization.
๐
2025-11-17
๐ ArXiv Preprint (cs.ET)
๐ฅ Gautam Nagarajan, Omir Kumar, Sudarsun Santhiappan
Natural Language Processing
ๆฅ็ๆ่ฆ
All citizens of a country are affected by the laws and policies introduced by their government. These laws and policies serve essential functions for citizens. Such as granting them certain rights or imposing specific obligations. However, these documents are often lengthy, complex, and difficult to navigate, making it challenging for citizens to locate and understand relevant information. This work presents PolicyBot, a retrieval-augmented generation (RAG) system designed to answer user queries over policy documents with a focus on transparency and reproducibility. The system combines domain-specific semantic chunking, multilingual dense embeddings, multi-stage retrieval with reranking, and source-aware generation to provide responses grounded in the original documents. We implemented citation tracing to reduce hallucinations and improve user trust, and evaluated alternative retrieval and generation configurations to identify effective design choices. The end-to-end pipeline is built entirely with open-source tools, enabling easy adaptation to other domains requiring document-grounded question answering. This work highlights design considerations, practical challenges, and lessons learned in deploying trustworthy RAG systems for governance-related contexts.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Allaa Boutaleb, Bernd Amann, Rafael Angarita, Hubert Naacke
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Open-domain question answering over datalakes requires retrieving and composing information from multiple tables, a challenging subtask that demands semantic relevance and structural coherence (e.g., joinability). While exact optimization methods like Mixed-Integer Programming (MIP) can ensure coherence, their computational complexity is often prohibitive. Conversely, simpler greedy heuristics that optimize for query coverage alone often fail to find these coherent, joinable sets. This paper frames multi-table retrieval as an iterative search process, arguing this approach offers advantages in scalability, interpretability, and flexibility. We propose a general framework and a concrete instantiation: a fast, effective Greedy Join-Aware Retrieval algorithm that holistically balances relevance, coverage, and joinability. Experiments across 5 NL2SQL benchmarks demonstrate that our iterative method achieves competitive retrieval performance compared to the MIP-based approach while being 4-400x faster depending on the benchmark and search space settings. This work highlights the potential of iterative heuristics for practical, scalable, and composition-aware retrieval.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Wanqing Cui, Wei Huang, Yazhi Guo, Yibo Hu, Meiguang Jin et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Visual document retrieval requires understanding heterogeneous and multi-modal content to satisfy information needs. Recent advances use screenshot-based document encoding with fine-grained late interaction, significantly improving retrieval performance. However, retrievers are still trained with coarse global relevance labels, without revealing which regions support the match. As a result, retrievers tend to rely on surface-level cues and struggle to capture implicit semantic connections, hindering their ability to handle non-extractive queries. To alleviate this problem, we propose a \textbf{A}ttention-\textbf{G}rounded \textbf{RE}triever \textbf{E}nhancement (AGREE) framework. AGREE leverages cross-modal attention from multimodal large language models as proxy local supervision to guide the identification of relevant document regions. During training, AGREE combines local signals with the global signals to jointly optimize the retriever, enabling it to learn not only whether documents match, but also which content drives relevance. Experiments on the challenging ViDoRe V2 benchmark show that AGREE significantly outperforms the global-supervision-only baseline. Quantitative and qualitative analyses further demonstrate that AGREE promotes deeper alignment between query terms and document regions, moving beyond surface-level matching toward more accurate and interpretable retrieval. Our code is available at: https://anonymous.4open.science/r/AGREE-2025.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zhipeng Ma, Bo Nรธrregaard Jรธrgensen, Zheng Grace Ma
ๆฅ็ๆ่ฆ
Improving energy efficiency in industrial foundry processes is a critical challenge, as these operations are highly energy-intensive and marked by complex interdependencies among process variables. Correlation-based analyses often fail to distinguish true causal drivers from spurious associations, limiting their usefulness for decision-making. This paper applies a time-series causal inference framework to identify the operational factors that directly affect energy efficiency in induction furnace melting. Using production data from a Danish foundry, the study integrates time-series clustering to segment melting cycles into distinct operational modes with the PCMCI+ algorithm, a state-of-the-art causal discovery method, to uncover cause-effect relationships within each mode. Across clusters, robust causal relations among energy consumption, furnace temperature, and material weight define the core drivers of efficiency, while voltage consistently influences cooling water temperature with a delayed response. Cluster-specific differences further distinguish operational regimes: efficient clusters are characterized by stable causal structures, whereas inefficient ones exhibit reinforcing feedback loops and atypical dependencies. The contributions of this study are twofold. First, it introduces an integrated clustering-causal inference pipeline as a methodological innovation for analyzing energy-intensive processes. Second, it provides actionable insights that enable foundry operators to optimize performance, reduce energy consumption, and lower emissions.
๐
2025-11-17
๐ ArXiv Preprint (cs.SE)
๐ฅ Dmitry Moskalev
Machine Learning
ๆฅ็ๆ่ฆ
Exploring relationships across data sources is a crucial optimization for entities recognition. Since databases can store big amount of information with synthetic and organic data, serving all quantity of objects correctly is an important task to deal with. However, the decision of how to construct entity relationship model is associated with human factor. In this paper, we present flow-oriented entity-relationship tool. This is first and unique end-to-end solution that eliminates routine and resource-intensive problems of processing, creating and visualizing both of explicit and implicit dependencies for prominent SQL dialects on-the-fly. Once launched, FLOWER automatically detects built-in constraints and starting to create own correct and necessary one using dynamic sampling and robust data analysis techniques. This approach applies to improve entity-relationship model and data storytelling to better understand the foundation of data and get unseen insights from DB sources using SQL or natural language. Evaluated on state-of-the-art STATS benchmark, experiments show that FLOWER is superior to reservoir sampling by 2.4x for distribution representation and 2.6x for constraint learning with 2.15x acceleration. For data storytelling, our tool archives 1.19x for accuracy enhance with 1.86x context decrease compare to LLM. Presented tool is also support 23 languages and compatible with both of CPU and GPU. Those results show that FLOWER can manage with real-world data a way better to ensure with quality, scalability and applicability for different use-cases.
๐
2025-11-17
๐ ArXiv Preprint (cs.SE)
๐ฅ Rufeng Chen, Shuaishuai Jiang, Jiyun Shen, AJung Moon, Lili Wei
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing research has primarily focused on GenAI's ability to complete educational tasks and its impact on student performance, often overlooking its effects on knowledge gains. In this study, we investigate how GenAI assistance compares to conventional online resources in supporting knowledge gains across different proficiency levels. We conducted a controlled user experiment with 24 undergraduate students of two different levels of programming experience (beginner, intermediate) to examine how students interact with ChatGPT while solving programming tasks. We analyzed task performance, conceptual understanding, and interaction behaviors. Our findings reveal that generating complete solutions with GenAI significantly improves task performance, especially for beginners, but does not consistently result in knowledge gains. Importantly, usage strategies differ by experience: beginners tend to rely heavily on GenAI toward task completion often without knowledge gain in the process, while intermediates adopt more selective approaches. We find that both over-reliance and minimal use result in weaker knowledge gains overall. Based on our results, we call on students and educators to adopt GenAI as a learning rather than a problem solving tool. Our study highlights the urgent need for guidance when integrating GenAI into programming education to foster deeper understanding.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Hao Hu, Yifan Feng, Ruoxue Li, Rundong Xue, Xingliang Hou et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) enhances the response quality and domain-specific performance of large language models (LLMs) by incorporating external knowledge to combat hallucinations. In recent research, graph structures have been integrated into RAG to enhance the capture of semantic relations between entities. However, it primarily focuses on low-order pairwise entity relations, limiting the high-order associations among multiple entities. Hypergraph-enhanced approaches address this limitation by modeling multi-entity interactions via hyperedges, but they are typically constrained to inter-chunk entity-level representations, overlooking the global thematic organization and alignment across chunks. Drawing inspiration from the top-down cognitive process of human reasoning, we propose a theme-aligned dual-hypergraph RAG framework (Cog-RAG) that uses a theme hypergraph to capture inter-chunk thematic structure and an entity hypergraph to model high-order semantic relations. Furthermore, we design a cognitive-inspired two-stage retrieval strategy that first activates query-relevant thematic content from the theme hypergraph, and then guides fine-grained recall and diffusion in the entity hypergraph, achieving semantic alignment and consistent generation from global themes to local details. Our extensive experiments demonstrate that Cog-RAG significantly outperforms existing state-of-the-art baseline approaches.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Diego Ortego, Marlon Rodrรญguez, Mario Almagro, Kunal Dahiya, David Jimรฉnez et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art by up to +8.21\% in P@1 on the largest dataset. ViXML's code is available at https://github.com/DiegoOrtego/vixml.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zhaoxin Shen, Dan Wu
ๆฅ็ๆ่ฆ
To leverage user behavior data from the Internet more effectively in recommender systems, this paper proposes a novel collaborative filtering (CF) method called Local Collaborative Filtering (LCF). LCF utilizes local similarities among users and integrates their data using the law of large numbers (LLN), thereby improving the utilization of user behavior data. Experiments are conducted on the Steam game dataset, and the results of LCF align with real-world needs.
๐
2025-11-17
๐ AAAI 2026
๐ฅ Hao Long, Silin Zhou, Lisi Chen, Shuo Shang
ๆฅ็ๆ่ฆ
Recent learning-based methods have reduced the computational complexity of traditional trajectory similarity computation, but state-of-the-art (SOTA) methods still fail to leverage the comprehensive spectrum of trajectory information for similarity modeling. To tackle this problem, we propose \textbf{RePo}, a novel method that jointly encodes \textbf{Re}gion-wise and \textbf{Po}int-wise features to capture both spatial context and fine-grained moving patterns. For region-wise representation, the GPS trajectories are first mapped to grid sequences, and spatial context are captured by structural features and semantic context enriched by visual features. For point-wise representation, three lightweight expert networks extract local, correlation, and continuous movement patterns from dense GPS sequences. Then, a router network adaptively fuses the learned point-wise features, which are subsequently combined with region-wise features using cross-attention to produce the final trajectory embedding. To train RePo, we adopt a contrastive loss with hard negative samples to provide similarity ranking supervision. Experiment results show that RePo achieves an average accuracy improvement of 22.2\% over SOTA baselines across all evaluation metrics.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhenghua Li, Hang Chen, Zihao Sun, Kai Li, Xiaolin Hu
ๆฅ็ๆ่ฆ
Accurate segmentation of neural structures in Electron Microscopy (EM) images is paramount for neuroscience. However, this task is challenged by intricate morphologies, low signal-to-noise ratios, and scarce annotations, limiting the accuracy and generalization of existing methods. To address these challenges, we seek to leverage the priors learned by visual foundation models on a vast amount of natural images to better tackle this task. Specifically, we propose a novel framework that can effectively transfer knowledge from Segment Anything 2 (SAM2), which is pre-trained on natural images, to the EM domain. We first use SAM2 to extract powerful, general-purpose features. To bridge the domain gap, we introduce a Feature-Guided Attention module that leverages semantic cues from SAM2 to guide a lightweight encoder, the Fine-Grained Encoder (FGE), in focusing on these challenging regions. Finally, a dual-affinity decoder generates both coarse and refined affinity maps. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art (SOTA) approaches with the SAM2 weights frozen. Upon further fine-tuning on EM data, our method significantly outperforms existing SOTA methods. This study validates that transferring representations pre-trained on natural images, when combined with targeted domain-adaptive guidance, can effectively address the specific challenges in neuron segmentation.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Satyanarayan Pati
ๆฅ็ๆ่ฆ
Dense retrieval models have become a standard for state-of-the-art information retrieval. However, their high-dimensional, high-precision (float32) vector embeddings create significant storage and memory challenges for real-world deployment. To address this, we conduct a rigorous empirical study on the BEIR SciFact benchmark, evaluating the trade-offs between two primary compression strategies: (1) Dimensionality Reduction via deep Autoencoders (AE), reducing original 384-dim vectors to latent spaces from 384 down to 12, and (2) Precision Reduction via Quantization (float16, int8, and binary). We systematically compare each method by measuring the "performance loss" (or gain) relative to a float32 baseline across a full suite of retrieval metrics (NDCG, MAP, MRR, Recall, Precision) at various k cutoffs. Our results show that int8 scalar quantization provides the most effective "sweet spot," achieving a 4x compression with a negligible [~1-2%] drop in nDCG@10. In contrast, Autoencoders show a graceful degradation but suffer a more significant performance loss at equivalent 4x compression ratios (AE-96). binary quantization was found to be unsuitable for this task due to catastrophic performance drops. This work provides a practical guide for deploying efficient, high-performance retrieval systems.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Miaomiao Cai, Min Hou, Lei Chen, Le Wu, Haoyue Bai et al.
ๆฅ็ๆ่ฆ
Collaborative Filtering~(CF) plays a crucial role in modern recommender systems, leveraging historical user-item interactions to provide personalized suggestions. However, CF-based methods often encounter biases due to imbalances in training data. This phenomenon makes CF-based methods tend to prioritize recommending popular items and performing unsatisfactorily on inactive users. Existing works address this issue by rebalancing training samples, reranking recommendation results, or making the modeling process robust to the bias. Despite their effectiveness, these approaches can compromise accuracy or be sensitive to weighting strategies, making them challenging to train. In this paper, we deeply analyze the causes and effects of the biases and propose a framework to alleviate biases in recommendation from the perspective of representation distribution, namely Group-Alignment and Global-Uniformity Enhanced Representation Learning for Debiasing Recommendation (AURL). Specifically, we identify two significant problems in the representation distribution of users and items, namely group-discrepancy and global-collapse. These two problems directly lead to biases in the recommendation results. To this end, we propose two simple but effective regularizers in the representation space, respectively named group-alignment and global-uniformity. The goal of group-alignment is to bring the representation distribution of long-tail entities closer to that of popular entities, while global-uniformity aims to preserve the information of entities as much as possible by evenly distributing representations. Our method directly optimizes both the group-alignment and global-uniformity regularization terms to mitigate recommendation biases. Extensive experiments on three real datasets and various recommendation backbones verify the superiority of our proposed framework.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jaehyung Lim, Wonbin Kweon, Woojoo Kim, Junyoung Kim, Dongha Kim et al.
ๆฅ็ๆ่ฆ
Federated Recommendation (FedRec) has emerged as a key paradigm for building privacy-preserving recommender systems. However, existing FedRec models face a critical dilemma: memory-efficient single-knowledge models suffer from a suboptimal knowledge replacement practice that discards valuable personalization, while high-performance dual-knowledge models are often too memory-intensive for practical on-device deployment. We propose Federated Recommendation with Knowledge Guidance (FedRKG), a model-agnostic framework that resolves this dilemma. The core principle, Knowledge Guidance, avoids full replacement and instead fuses global knowledge into preserved local embeddings, attaining the personalization benefits of dual-knowledge within a single-knowledge memory footprint. Furthermore, we introduce Adaptive Guidance, a fine-grained mechanism that dynamically modulates the intensity of this guidance for each user-item interaction, overcoming the limitations of static fusion methods. Extensive experiments on benchmark datasets demonstrate that FedRKG significantly outperforms state-of-the-art methods, validating the effectiveness of our approach. The code is available at https://github.com/Jaehyung-Lim/fedrkg.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Bokang Fu, Jiahao Wang, Xiaojing Liu, Yuli Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
In recent years, large language models (LLMs) have excelled in language understanding and generation, powering advanced dialogue and recommendation systems. However, a significant limitation persists: these systems often model user preferences statically, failing to capture the dynamic and sequential nature of interactive behaviors. The sequence of a user's historical questions provides a rich, implicit signal of evolving interests and cognitive patterns, yet leveraging this temporal data for predictive tasks remains challenging due to the inherent disconnect between language modeling and behavioral sequence modeling.
To bridge this gap, we propose a Collaborative Filtering-enhanced Question Prediction (CFQP) framework. CFQP dynamically models evolving user-question interactions by integrating personalized memory modules with graph-based preference propagation. This dual mechanism allows the system to adaptively learn from user-specific histories while refining predictions through collaborative signals from similar users. Experimental results demonstrate that our approach effectively generates agents that mimic real-user questioning patterns, highlighting its potential for building proactive and adaptive dialogue systems.
๐
2025-11-17
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Hao Jiang, Guoquan Wang, Sheng Yu, Yang Zeng, Wencong Zeng et al.
Machine Learning
ๆฅ็ๆ่ฆ
Local-life recommendation have witnessed rapid growth, providing users with convenient access to daily essentials. However, this domain faces two key challenges: (1) spatial constraints, driven by the requirements of the local-life scenario, where items are usually shown only to users within a limited geographic area, indirectly reducing their exposure probability; and (2) long-tail sparsity, where few popular items dominate user interactions, while many high-quality long-tail items are largely overlooked due to imbalanced interaction opportunities. Existing methods typically adopt a user-centric perspective, such as modeling spatial user preferences or enhancing long-tail representations with collaborative filtering signals. However, we argue that an item-centric perspective is more suitable for this domain, focusing on enhancing long-tail items representation that align with the spatially-constrained characteristics of local lifestyle services. To tackle this issue, we propose ReST, a Plug-And-Play Spatially-Constrained Representation Enhancement Framework for Long-Tail Local-Life Recommendation. Specifically, we first introduce a Meta ID Warm-up Network, which initializes fundamental ID representations by injecting their basic attribute-level semantic information. Subsequently, we propose a novel Spatially-Constrained ID Representation Enhancement Network (SIDENet) based on contrastive learning, which incorporates two efficient strategies: a spatially-constrained hard sampling strategy and a dynamic representation alignment strategy. This design adaptively identifies weak ID representations based on their attribute-level information during training. It additionally enhances them by capturing latent item relationships within the spatially-constrained characteristics of local lifestyle services, while preserving compatibility with popular items.
๐
2025-11-17
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhi Kou, Xiang-Rong Sheng, Shuguang Han, Zhishan Zhao, Yueyao Cheng et al.
Machine Learning
ๆฅ็ๆ่ฆ
In industrial recommendation systems, pre-ranking models based on deep neural networks (DNNs) commonly adopt a sequential execution framework: feature fetching and model forward computation are triggered only after receiving candidates from the upstream retrieval stage. This design introduces inherent bottlenecks, including redundant computations of identical users/items and increased latency due to strictly sequential operations, which jointly constrain the model's capacity and system efficiency. To address these limitations, we propose the Asynchronous Inference Framework (AIF), a cost-effective computational architecture that decouples interaction-independent components, those operating within a single user or item, from real-time prediction. AIF reorganizes the model inference process by performing user-side computations in parallel with the retrieval stage and conducting item-side computations in a nearline manner. This means that interaction-independent components are calculated just once and completed before the real-time prediction phase of the pre-ranking stage. As a result, AIF enhances computational efficiency and reduces latency, freeing up resources to significantly improve the feature set and model architecture of interaction-independent components. Moreover, we delve into model design within the AIF framework, employing approximated methods for interaction-dependent components in online real-time predictions. By co-designing both the framework and the model, our solution achieves notable performance gains without significantly increasing computational and latency costs. This has enabled the successful deployment of AIF in the Taobao display advertising system.
๐
2025-11-17
๐ AAAI
๐ฅ Yu Hou, Won-Yong Shin
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language model (LLM)-based recommender systems have achieved high-quality performance by bridging the discrepancy between the item space and the language space through item tokenization. However, existing item tokenization methods typically require training separate models for each item domain, limiting generalization. Moreover, the diverse distributions and semantics across item domains make it difficult to construct a unified tokenization that preserves domain-specific information. To address these challenges, we propose UniTok, a Unified item Tokenization framework that integrates our own mixture-of-experts (MoE) architecture with a series of codebooks to convert items into discrete tokens, enabling scalable tokenization while preserving semantic information across multiple item domains. Specifically, items from different domains are first projected into a unified latent space through a shared encoder. They are then routed to domain-specific experts to capture the unique semantics, while a shared expert, which is always active, encodes common knowledge transferable across domains. Additionally, to mitigate semantic imbalance across domains, we present a mutual information calibration mechanism, which guides the model towards retaining similar levels of semantic information for each domain. Comprehensive experiments on wide-ranging real-world datasets demonstrate that the proposed UniTok framework is (a) highly effective: achieving up to 51.89% improvements over strong benchmarks, (b) theoretically sound: showing the analytical validity of our architectural design and optimization; and (c) highly generalizable: demonstrating robust performance across diverse domains without requiring per-domain retraining, a capability not supported by existing baselines.
๐
2025-11-17
๐ AAAI
๐ฅ Desheng Hu, Joachim Baumann, Aleksandra Urman, Elsa Lichtenegger, Robin Forsberg et al.
Robotics
ๆฅ็ๆ่ฆ
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
๐
2025-11-17
๐ This work has been published in Big Data & Society
๐ฅ Jacob Erickson
ๆฅ็ๆ่ฆ
Filter bubbles and echo chambers have received global attention from scholars, media organizations, and the general public. Filter bubbles have primarily been regarded as intrinsically negative, and many studies have sought to minimize their influence. The detrimental influence of filter bubbles is well-studied. Filter bubbles may, for example, create information silos, amplify misinformation, and promote hatred and extremism. However, comparatively few studies have considered the other side of the filter bubble; its protective benefits, particularly to marginalized communities and those living in countries with low levels of press freedom. Through a review of the literature on digital safe spaces and protective filter bubbles, this commentary suggests that there may be a need to rethink the filter bubble, and it proposes several areas for future research.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu et al.
Multimodal
ๆฅ็ๆ่ฆ
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jianglong Ye, Lai Wei, Guangqi Jiang, Changwei Jing, Xueyan Zou et al.
Robotics
ๆฅ็ๆ่ฆ
Human grasps can be roughly categorized into two types: power grasps and precision grasps. Precision grasping enables tool use and is believed to have influenced human evolution. Today's multi-fingered robotic hands are effective in power grasps, but for tasks requiring precision, parallel grippers are still more widely adopted. This contrast highlights a key limitation in current robotic hand design: the difficulty of achieving both stable power grasps and precise, fine-grained manipulation within a single, versatile system. In this work, we bridge this gap by jointly optimizing the control and hardware design of a multi-fingered dexterous hand, enabling both power and precision manipulation. Rather than redesigning the entire hand, we introduce a lightweight fingertip geometry modification, represent it as a contact plane, and jointly optimize its parameters along with the corresponding control. Our control strategy dynamically switches between power and precision manipulation and simplifies precision control into parallel thumb-index motions, which proves robust for sim-to-real transfer. On the design side, we leverage large-scale simulation to optimize the fingertip geometry using a differentiable neural-physics surrogate model. We validate our approach through extensive experiments in both sim-to-real and real-to-real settings. Our method achieves an 82.5% zero-shot success rate on unseen objects in sim-to-real precision grasping, and a 93.3% success rate in challenging real-world tasks involving bread pinching. These results demonstrate that our co-design framework can significantly enhance the fine-grained manipulation ability of multi-fingered hands without reducing their ability for power grasps. Our project page is at https://jianglongye.com/power-to-precision
๐
2025-11-17
๐ IROS 2025
๐ฅ Xiaoyu Liang, Ziang Liu, Kelvin Lin, Edward Gu, Ruolin Ye et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
We present OpenRoboCare, a multimodal dataset for robot caregiving, capturing expert occupational therapist demonstrations of Activities of Daily Living (ADLs). Caregiving tasks involve complex physical human-robot interactions, requiring precise perception under occlusions, safe physical contact, and long-horizon planning. While recent advances in robot learning from demonstrations have shown promise, there is a lack of a large-scale, diverse, and expert-driven dataset that captures real-world caregiving routines. To address this gap, we collect data from 21 occupational therapists performing 15 ADL tasks on two manikins. The dataset spans five modalities: RGB-D video, pose tracking, eye-gaze tracking, task and action annotations, and tactile sensing, providing rich multimodal insights into caregiver movement, attention, force application, and task execution strategies. We further analyze expert caregiving principles and strategies, offering insights to improve robot efficiency and task feasibility. Additionally, our evaluations demonstrate that OpenRoboCare presents challenges for state-of-the-art robot perception and human activity recognition methods, both critical for developing safe and adaptive assistive robots, highlighting the value of our contribution. See our website for additional visualizations: https://emprise.cs.cornell.edu/robo-care/.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu
Machine Learning
ๆฅ็ๆ่ฆ
3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.
๐
2025-11-17
๐ Accepted at the Workshop on Benefits of pErsonaliz
๐ฅ Vesna Poprcova, Iulia Lefter, Matthias Wieser, Martijn Warnier, Frances Brazier
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Social anxiety is a prevalent condition that affects interpersonal interactions and social functioning. Recent advances in artificial intelligence and social robotics offer new opportunities to examine social anxiety in the human-robot interaction context. Accurate detection of affective states and behaviours associated with social anxiety requires multimodal datasets, where each signal modality provides complementary insights into its manifestations. However, such datasets remain scarce, limiting progress in both research and applications. To address this, this paper presents a protocol for multimodal dataset collection designed to reflect social anxiety in a human-robot interaction context. The dataset will consist of synchronised audio, video, and physiological recordings acquired from at least 70 participants, grouped according to their level of social anxiety, as they engage in approximately 10-minute interactive Wizard-of-Oz role-play scenarios with the Furhat social robot under controlled experimental conditions. In addition to multimodal data, the dataset will be enriched with contextual data providing deeper insight into individual variability in social anxiety responses. This work can contribute to research on affect-adaptive human-robot interaction by providing support for robust multimodal detection of social anxiety.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Bingkun Huang, Yuhe Gong, Zewen Yang, Tianyu Ren, Luis Figueredo
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) approaches based on Markov Decision Processes (MDPs) are predominantly applied in the robot joint space, often relying on limited task-specific information and partial awareness of the 3D environment. In contrast, episodic RL has demonstrated advantages over traditional MDP-based methods in terms of trajectory consistency, task awareness, and overall performance in complex robotic tasks. Moreover, traditional step-wise and episodic RL methods often neglect the contact-rich information inherent in task-space manipulation, especially considering the contact-safety and robustness. In this work, contact-rich manipulation tasks are tackled using a task-space, energy-safe framework, where reliable and safe task-space trajectories are generated through the combination of Proximal Policy Optimization (PPO) and movement primitives. Furthermore, an energy-aware Cartesian Impedance Controller objective is incorporated within the proposed framework to ensure safe interactions between the robot and the environment. Our experimental results demonstrate that the proposed framework outperforms existing methods in handling tasks on various types of surfaces in 3D environments, achieving high success rates as well as smooth trajectories and energy-safe interactions.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Juntao Jian, Yi-Lin Wei, Chengjie Mou, Yuhao Lin, Xing Zhu et al.
RoboticsNatural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Task-oriented dexterous grasping holds broad application prospects in robotic manipulation and human-object interaction. However, most existing methods still struggle to generalize across diverse objects and task instructions, as they heavily rely on costly labeled data to ensure task-specific semantic alignment. In this study, we propose \textbf{ZeroDexGrasp}, a zero-shot task-oriented dexterous grasp synthesis framework integrating Multimodal Large Language Models with grasp refinement to generate human-like grasp poses that are well aligned with specific task objectives and object affordances. Specifically, ZeroDexGrasp employs prompt-based multi-stage semantic reasoning to infer initial grasp configurations and object contact information from task and object semantics, then exploits contact-guided grasp optimization to refine these poses for physical feasibility and task alignment. Experimental results demonstrate that ZeroDexGrasp enables high-quality zero-shot dexterous grasping on diverse unseen object categories and complex task requirements, advancing toward more generalizable and intelligent robotic grasping.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jonas Bode, Raphael Memmesheimer, Sven Behnke
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Acting in human environments is a crucial capability for general-purpose robots, necessitating a robust understanding of natural language and its application to physical tasks. This paper seeks to harness the capabilities of diffusion models within a visuomotor policy framework that merges visual and textual inputs to generate precise robotic trajectories. By employing reference demonstrations during training, the model learns to execute manipulation tasks specified through textual commands within the robot's immediate environment. The proposed research aims to extend an existing model by leveraging improved embeddings, and adapting techniques from diffusion models for image generation. We evaluate our methods on the CALVIN dataset, proving enhanced performance on various manipulation tasks and an increased long-horizon success rate when multiple tasks are executed in sequence. Our approach reinforces the usefulness of diffusion models and contributes towards general multitask manipulation.
๐
2025-11-17
๐ ArXiv Preprint (cs.LO)
๐ฅ Matt Luckcuck, Maike Schwammberger, Mengwei Xu
Robotics
ๆฅ็ๆ่ฆ
This EPTCS volume contains the papers from the Seventh International Workshop on Formal Methods for Autonomous Systems (FMAS 2025), which was held between the 17th and 19th of November 2025. The goal of the FMAS workshop series is to bring together leading researchers who are using formal methods to tackle the unique challenges that autonomous systems present, so that they can publish and discuss their work with a growing community of researchers. FMAS 2025 was co-located with the 20th International Conference on integrated Formal Methods (iFM'25), hosted by Inria Paris, France at the Inria Paris Center.
In total, FMAS 2025 received 16 submissions from researchers at institutions in: Canada, China, France, Germany, Ireland, Italy, Japan, the Netherlands, Portugal, Sweden, the United States of America, and the United Kingdom. Though we received fewer submissions than last year, we are encouraged to see the submissions being sent from a wide range of countries. Submissions come from both past and new FMAS authors, which shows us that the existing community appreciates the network that FMAS has built over the past 7 years, while new authors also show the FMAS community's great potential of growth.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Chiyun Noh, Sangwoo Jung, Hanjun Kim, Yafei Hu, Laura Herlant et al.
Computer Vision
ๆฅ็ๆ่ฆ
Deployment of legged robots for navigating challenging terrains (e.g., stairs, slopes, and unstructured environments) has gained increasing preference over wheel-based platforms. In such scenarios, accurate odometry estimation is a preliminary requirement for stable locomotion, localization, and mapping. Traditional proprioceptive approaches, which rely on leg kinematics sensor modalities and inertial sensing, suffer from irrepressible vertical drift caused by frequent contact impacts, foot slippage, and vibrations, particularly affected by inaccurate roll and pitch estimation. Existing methods incorporate exteroceptive sensors such as LiDAR or cameras. Further enhancement has been introduced by leveraging gravity vector estimation to add additional observations on roll and pitch, thereby increasing the accuracy of vertical pose estimation. However, these approaches tend to degrade in feature-sparse or repetitive scenes and are prone to errors from double-integrated IMU acceleration. To address these challenges, we propose GaRLILEO, a novel gravity-aligned continuous-time radar-leg-inertial odometry framework. GaRLILEO decouples velocity from the IMU by building a continuous-time ego-velocity spline from SoC radar Doppler and leg kinematics information, enabling seamless sensor fusion which mitigates odometry distortion. In addition, GaRLILEO can reliably capture accurate gravity vectors leveraging a novel soft S2-constrained gravity factor, improving vertical pose accuracy without relying on LiDAR or cameras. Evaluated on a self-collected real-world dataset with diverse indoor-outdoor trajectories, GaRLILEO demonstrates state-of-the-art accuracy, particularly in vertical odometry estimation on stairs and slopes. We open-source both our dataset and algorithm to foster further research in legged robot odometry and SLAM. https://garlileo.github.io/GaRLILEO
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Cheng Peng, Zhenzhe Zhang, Cheng Chi, Xiaobao Wei, Yanhao Zhang et al.
RoboticsNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Navigating to a specified object in an unknown environment is a fundamental yet challenging capability of embodied intelligence. However, current methods struggle to balance decision frequency with intelligence, resulting in decisions lacking foresight or discontinuous actions. In this work, we propose PIGEON: Point of Interest Guided Exploration for Object Navigation with VLM, maintaining a lightweight and semantically aligned snapshot memory during exploration as semantic input for the exploration strategy. We use a large Visual-Language Model (VLM), named PIGEON-VL, to select Points of Interest (PoI) formed during exploration and then employ a lower-level planner for action output, increasing the decision frequency. Additionally, this PoI-based decision-making enables the generation of Reinforcement Learning with Verifiable Reward (RLVR) data suitable for simulators. Experiments on classic object navigation benchmarks demonstrate that our zero-shot transfer method achieves state-of-the-art performance, while RLVR further enhances the model's semantic guidance capabilities, enabling deep reasoning during real-time navigation.
๐
2025-11-17
๐ This paper has been accepted by IEEE SII 2026
๐ฅ Osama Al Sheikh Ali, Sotiris Koutsoftas, Ze Zhang, Knut Akesson, Emmanuel Dean
Robotics
ๆฅ็ๆ่ฆ
This paper presents an integrated navigation framework for Autonomous Mobile Robots (AMRs) that unifies environment representation, trajectory generation, and Model Predictive Control (MPC). The proposed approach incorporates a quadtree-based method to generate structured, axis-aligned collision-free regions from occupancy maps. These regions serve as both a basis for developing safe corridors and as linear constraints within the MPC formulation, enabling efficient and reliable navigation without requiring direct obstacle encoding. The complete pipeline combines safe-area extraction, connectivity graph construction, trajectory generation, and B-spline smoothing into one coherent system. Experimental results demonstrate consistent success and superior performance compared to baseline approaches across complex environments.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Trevor Exley, Anderson Brazil Nardin, Petr Trunin, Diana Cafiso, Lucia Beccai
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This work introduces the Monolithic Unit (MU), an actuator-lattice-sensor building block for soft robotics. The MU integrates pneumatic actuation, a compliant lattice envelope, and candidate sites for optical waveguide sensing into a single printed body. In order to study reproducibility and scalability, a parametric design framework establishes deterministic rules linking actuator chamber dimensions to lattice unit cell size. Experimental homogenization of lattice specimens provides effective material properties for finite element simulation. Within this simulation environment, sensor placement is treated as a discrete optimization problem, where a finite set of candidate waveguide paths derived from lattice nodes is evaluated by introducing local stiffening, and the configuration minimizing deviation from baseline mechanical response is selected. Optimized models are fabricated and experimentally characterized, validating the preservation of mechanical performance while enabling embedded sensing. The workflow is further extended to scaled units and a two-finger gripper, demonstrating generality of the MU concept. This approach advances monolithic soft robotic design by combining reproducible co-design rules with simulation-informed sensor integration.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Xuecheng Chen, Jingao Xu, Wenhua Ding, Haoyang Wang, Xinyu Luo et al.
ๆฅ็ๆ่ฆ
As drone-based applications proliferate, paramount contactless sensing of airborne drones from the ground becomes indispensable. This work demonstrates concentrating on propeller rotational speed will substantially improve drone sensing performance and proposes an event-camera-based solution, \sysname. \sysname features two components: \textit{Count Every Rotation} achieves accurate, real-time propeller speed estimation by mitigating ultra-high sensitivity of event cameras to environmental noise. \textit{Every Rotation Counts} leverages these speeds to infer both internal and external drone dynamics. Extensive evaluations in real-world drone delivery scenarios show that \sysname achieves a sensing latency of 3$ms$ and a rotational speed estimation error of merely 0.23\%. Additionally, \sysname infers drone flight commands with 96.5\% precision and improves drone tracking accuracy by over 22\% when combined with other sensing modalities. \textit{ Demo: {\color{blue}https://eventpro25.github.io/EventPro/.} }
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Guy Damari, Itzik Klein
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous underwater vehicles rely on precise navigation systems that combine the inertial navigation system and the Doppler velocity log for successful missions in challenging environments where satellite navigation is unavailable. The effectiveness of this integration critically depends on accurate alignment between the sensor reference frames. Standard model-based alignment methods between these sensor systems suffer from lengthy convergence times, dependence on prescribed motion patterns, and reliance on external aiding sensors, significantly limiting operational flexibility. To address these limitations, this paper presents ResAlignNet, a data-driven approach using the 1D ResNet-18 architecture that transforms the alignment problem into deep neural network optimization, operating as an in-situ solution that requires only sensors on board without external positioning aids or complex vehicle maneuvers, while achieving rapid convergence in seconds. Additionally, the approach demonstrates the learning capabilities of Sim2Real transfer, enabling training in synthetic data while deploying in operational sensor measurements. Experimental validation using the Snapir autonomous underwater vehicle demonstrates that ResAlignNet achieves alignment accuracy within 0.8ยฐ using only 25 seconds of data collection, representing a 65\% reduction in convergence time compared to standard velocity-based methods. The trajectory-independent solution eliminates motion pattern requirements and enables immediate vehicle deployment without lengthy pre-mission procedures, advancing underwater navigation capabilities through robust sensor-agnostic alignment that scales across different operational scenarios and sensor specifications.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Michal Levin, Itzik Klein
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Low-cost micro-electromechanical accelerometers are widely used in navigation, robotics, and consumer devices for motion sensing and position estimation. However, their performance is often degraded by bias errors. To eliminate deterministic bias terms a calibration procedure is applied under stationary conditions. It requires accelerom- eter leveling or complex orientation-dependent calibration procedures. To overcome those requirements, in this paper we present a model-free learning-based calibration method that estimates accelerometer bias under stationary conditions, without requiring knowledge of the sensor orientation and without the need to rotate the sensors. The proposed approach provides a fast, practical, and scalable solution suitable for rapid field deployment. Experimental validation on a 13.39-hour dataset collected from six accelerometers shows that the proposed method consistently achieves error levels more than 52% lower than traditional techniques. On a broader scale, this work contributes to the advancement of accurate calibration methods in orientation-free scenarios. As a consequence, it improves the reliability of low-cost inertial sensors in diverse scientific and industrial applications and eliminates the need for leveled calibration.
๐
2025-11-17
๐ ICRA
๐ฅ Yong Li, Hui Cheng
Robotics
ๆฅ็ๆ่ฆ
Practical global path planning is critical for commercializing cleaning robots working in semi-structured environments. In the literature, global path planning methods for free space usually focus on path length and neglect the traffic rule constraints of the environments, which leads to high-frequency re-planning and increases collision risks. In contrast, those for structured environments are developed mainly by strictly complying with the road network representing the traffic rule constraints, which may result in an overlong path that hinders the overall navigation efficiency. This article proposes a general and systematic approach to improve global path planning performance in semi-structured environments. A unidirectional road network is built to represent the traffic constraints in semi-structured environments and a hybrid strategy is proposed to achieve a guaranteed planning result.Cutting across the road at the starting and the goal points are allowed to achieve a shorter path. Especially, a two-layer potential map is proposed to achieve a guaranteed performance when the starting and the goal points are in complex intersections. Comparative experiments are carried out to validate the effectiveness of the proposed method. Quantitative experimental results show that, compared with the state-of-art, the proposed method guarantees a much better balance between path length and the consistency with the road network.
๐
2025-11-17
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yan Gong, Jianli Lu, Yongsheng Gao, Jie Zhao, Xiaojuan Zhang et al.
RoboticsNatural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-attention mechanisms and insufficiently model intra- and inter-modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel-aware Transformer for RGB-D indoor scene segmentation that simultaneously enhances intra-modal representations and models inter-modal interactions. At its core, the Intra-Inter Modal Interaction Block (IIMIB) captures intra-modal long-range dependencies via self-attention and models inter-modal interactions with the Differential-Shared Inter-Modal (DSIM) module to disentangle modality-specific and shared cues, enabling fine-grained, pixel-level cross-modal alignment. Furthermore, a dynamic fusion strategy balances modality contributions and fully exploits RGB-D information according to scene characteristics. Extensive experiments on the SUN RGB-D and NYUDv2 benchmarks demonstrate that DiffPixelFormer-L achieves mIoU scores of 54.28% and 59.95%, outperforming DFormer-L by 1.78% and 2.75%, respectively. Code is available at https://github.com/gongyan1/DiffPixelFormer.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Yong Li, Hui Cheng
Robotics
ๆฅ็ๆ่ฆ
Paths generated by A* and other graph-search-based planners are widely used in the robotic field. Due to the restricted node-expansion directions, the resulting paths are usually not the shortest. Besides, unnecessary heading changes, or zig-zag patterns, exist even when no obstacle is nearby, which is inconsistent with the human intuition that the path segments should be straight in wide-open space due to the absence of obstacles. This article puts forward a general and systematic post-processing algorithm for A* and other graph-search-based planners. The A* post-processing algorithm, called APP, is developed based on the costmap, which is widely used in commercial service robots. First, a bidirectional vertices reduction algorithm is proposed to tackle the asymm- etry of the path and the environments. During the forward and backward vertices reduction, a thorough shortcut strategy is put forward to improve the path-shortening performance and avoid unnecessary heading changes. Second, an iterative path perturbation algorithm is adopted to locally reduce the number of unnecessary heading changes and improve the path smooth- ness. Comparative experiments are then carried out to validate the superiority of the proposed method. Quantitative performance indexes show that APP outperforms the existing methods in planning time, path length as well as the number of unnecessary heading changes. Finally, field navigation experiments are carried out to verify the practicability of APP.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Miryeong Park, Dongjin Cho, Sanghyun Kim, Younggun Cho
Robotics
ๆฅ็ๆ่ฆ
Planetary exploration robots must navigate uneven terrain while building reliable maps for space missions. However, most existing methods incorporate traversability constraints but may not handle high uncertainty in elevation estimates near complex features like craters, do not consider exploration strategies for uncertainty reduction, and typically fail to address how elevation uncertainty affects navigation safety and map quality. To address the problems, we propose a framework integrating safe path generation, adaptive confidence updates, and confidence-aware exploration strategies. Using Kalman-based elevation estimation, our approach generates terrain traversability and confidence scores, then incorporates them into Graph-Based exploration Planner (GBP) to prioritize exploration of traversable low-confidence regions. We evaluate our framework through simulated lunar experiments using a novel low-confidence region ratio metric, achieving 69% uncertainty reduction compared to baseline GBP. In terms of mission success rate, our method achieves 100% while baseline GBP achieves 0%, demonstrating improvements in exploration safety and map reliability.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Siddarth Narasimhan, Matthew Lisondra, Haitong Wang, Goldie Nejat
Robotics
ๆฅ็ๆ่ฆ
The Instance Image Goal Navigation (IIN) problem requires mobile robots deployed in unknown environments to search for specific objects or people of interest using only a single reference goal image of the target. This problem can be especially challenging when: 1) the reference image is captured from an arbitrary viewpoint, and 2) the robot must operate with sparse-view scene reconstructions. In this paper, we address the IIN problem, by introducing SplatSearch, a novel architecture that leverages sparse-view 3D Gaussian Splatting (3DGS) reconstructions. SplatSearch renders multiple viewpoints around candidate objects using a sparse online 3DGS map, and uses a multi-view diffusion model to complete missing regions of the rendered images, enabling robust feature matching against the goal image. A novel frontier exploration policy is introduced which uses visual context from the synthesized viewpoints with semantic context from the goal image to evaluate frontier locations, allowing the robot to prioritize frontiers that are semantically and visually relevant to the goal image. Extensive experiments in photorealistic home and real-world environments validate the higher performance of SplatSearch against current state-of-the-art methods in terms of Success Rate and Success Path Length. An ablation study confirms the design choices of SplatSearch.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Chunyong Hu, Qi Luo, Jianyun Xu, Song Wang, Qiang Li et al.
Robotics
ๆฅ็ๆ่ฆ
In the realm of autonomous driving, accurately detecting surrounding obstacles is crucial for effective decision-making. Traditional methods primarily rely on 3D bounding boxes to represent these obstacles, which often fail to capture the complexity of irregularly shaped, real-world objects. To overcome these limitations, we present GUIDE, a novel framework that utilizes 3D Gaussians for instance detection and occupancy prediction. Unlike conventional occupancy prediction methods, GUIDE also offers robust tracking capabilities. Our framework employs a sparse representation strategy, using Gaussian-to-Voxel Splatting to provide fine-grained, instance-level occupancy data without the computational demands associated with dense voxel grids. Experimental validation on the nuScenes dataset demonstrates GUIDE's performance, with an instance occupancy mAP of 21.61, marking a 50\% improvement over existing methods, alongside competitive tracking capabilities. GUIDE establishes a new benchmark in autonomous perception systems, effectively combining precision with computational efficiency to better address the complexities of real-world driving environments.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Yingting Zhou, Wenbo Cui, Weiheng Liu, Guixing Chen, Haoran Li et al.
Robotics
ๆฅ็ๆ่ฆ
Transferring the depth-based end-to-end policy trained in simulation to physical robots can yield an efficient and robust grasping policy, yet sensor artifacts in real depth maps like voids and noise establish a significant sim2real gap that critically impedes policy transfer. Training-time strategies like procedural noise injection or learned mappings suffer from data inefficiency due to unrealistic noise simulation, which is often ineffective for grasping tasks that require fine manipulation or dependency on paired datasets heavily. Furthermore, leveraging foundation models to reduce the sim2real gap via intermediate representations fails to mitigate the domain shift fully and adds computational overhead during deployment. This work confronts dual challenges of data inefficiency and deployment complexity. We propose DiffuDepGrasp, a deploy-efficient sim2real framework enabling zero-shot transfer through simulation-exclusive policy training. Its core innovation, the Diffusion Depth Generator, synthesizes geometrically pristine simulation depth with learned sensor-realistic noise via two synergistic modules. The first Diffusion Depth Module leverages temporal geometric priors to enable sample-efficient training of a conditional diffusion model that captures complex sensor noise distributions, while the second Noise Grafting Module preserves metric accuracy during perceptual artifact injection. With only raw depth inputs during deployment, DiffuDepGrasp eliminates computational overhead and achieves a 95.7% average success rate on 12-object grasping with zero-shot transfer and strong generalization to unseen objects.Project website: https://diffudepgrasp.github.io/.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Yong Li, Yujun Huang, Yi Chen, Hui Cheng
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Differential-driven wheeled robots (DWR) represent the quintessential type of mobile robots and find extensive appli- cations across the robotic field. Most high-performance control approaches for DWR explicitly utilize the linear and angular velocities of the trajectory as control references. However, existing research on time-optimal path parameterization (TOPP) for mobile robots usually neglects the angular velocity and joint vel- ocity constraints, which can result in degraded control perfor- mance in practical applications. In this article, a systematic and practical TOPP algorithm named TOPP-DWR is proposed for DWR and other mobile robots. First, the non-uniform B-spline is adopted to represent the initial trajectory in the task space. Second, the piecewise-constant angular velocity, as well as joint velocity, linear velocity, and linear acceleration constraints, are incorporated into the TOPP problem. During the construction of the optimization problem, the aforementioned constraints are uniformly represented as linear velocity constraints. To boost the numerical computational efficiency, we introduce a slack variable to reformulate the problem into second-order-cone programming (SOCP). Subsequently, comparative experiments are conducted to validate the superiority of the proposed method. Quantitative performance indexes show that TOPP-DWR achieves TOPP while adhering to all constraints. Finally, field autonomous navigation experiments are carried out to validate the practicability of TOPP-DWR in real-world applications.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Jun Huo, Hongge Ru, Bo Yang, Xingjian Chen, Xi Li et al.
ๆฅ็ๆ่ฆ
Soft multi-axis force/torque sensors provide safe and precise force interaction. Capturing the complete degree-of-freedom of force is imperative for accurate force measurement with six-axis force/torque sensors. However, cross-axis coupling can lead to calibration issues and decreased accuracy. In this instance, developing a soft and accurate six-axis sensor is a challenging task. In this paper, a soft air-chamber type six-axis force/torque sensor with 16-channel barometers is introduced, which housed in hyper-elastic air chambers made of silicone rubber. Additionally, an effective decoupling method is proposed, based on a rigid-soft hierarchical structure, which reduces the six-axis decoupling problem to two three-axis decoupling problems. Finite element model simulation and experiments demonstrate the compatibility of the proposed approach with reality. The prototype's sensing performance is quantitatively measured in terms of static load response, dynamic load response and dynamic response characteristic. It possesses a measuring range of 50 N force and 1 Nm torque, and the average deviation, repeatability, non-linearity and hysteresis are 4.9$\%$, 2.7$\%$, 5.8$\%$ and 6.7$\%$, respectively. The results indicate that the prototype exhibits satisfactory sensing performance while maintaining its softness due to the presence of soft air chambers.
๐
2025-11-17
๐ ArXiv Preprint (Robotics)
๐ฅ Taiyi Su, Jian Zhu, Yaxuan Li, Chong Ma, Zitai Huang et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Embodied world models aim to predict and interact with the physical world through visual observations and actions. However, existing models struggle to accurately translate low-level actions (e.g., joint positions) into precise robotic movements in predicted frames, leading to inconsistencies with real-world physical interactions. To address these limitations, we propose MTV-World, an embodied world model that introduces Multi-view Trajectory-Video control for precise visuomotor prediction. Specifically, instead of directly using low-level actions for control, we employ trajectory videos obtained through camera intrinsic and extrinsic parameters and Cartesian-space transformation as control signals. However, projecting 3D raw actions onto 2D images inevitably causes a loss of spatial information, making a single view insufficient for accurate interaction modeling. To overcome this, we introduce a multi-view framework that compensates for spatial information loss and ensures high-consistency with physical world. MTV-World forecasts future frames based on multi-view trajectory videos as input and conditioning on an initial frame per view. Furthermore, to systematically evaluate both robotic motion precision and object interaction accuracy, we develop an auto-evaluation pipeline leveraging multimodal large models and referring video object segmentation models. To measure spatial consistency, we formulate it as an object location matching problem and adopt the Jaccard Index as the evaluation metric. Extensive experiments demonstrate that MTV-World achieves precise control execution and accurate physical interaction modeling in complex dual-arm scenarios.
๐
2025-11-17
๐ IROS
๐ฅ Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., "how to interact"). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., "when to interact") is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at https://github.com/IRMVLab/EgoLoc.
๐
2025-11-17
๐ Presented at the "Workshop on Generative Modeling
๐ฅ Max M. Sun, Todd Murphey
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Generative model-based imitation learning methods have recently achieved strong results in learning high-complexity motor skills from human demonstrations. However, imitation learning of interactive policies that coordinate with humans in shared spaces without explicit communication remains challenging, due to the significantly higher behavioral complexity in multi-agent interactions compared to non-interactive tasks. In this work, we introduce a structured imitation learning framework for interactive policies by combining generative single-agent policy learning with a flexible yet expressive game-theoretic structure. Our method explicitly separates learning into two steps: first, we learn individual behavioral patterns from multi-agent demonstrations using standard imitation learning; then, we structurally learn inter-agent dependencies by solving an inverse game problem. Preliminary results in a synthetic 5-agent social navigation task show that our method significantly improves non-interactive policies and performs comparably to the ground truth interactive policy using only 50 demonstrations. These results highlight the potential of structured imitation learning in interactive settings.
๐
2025-11-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mengyao Gao, Chongming Gao, Haoyan Liu, Qingpeng Cai, Peng Jiang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advancements in large language model-based recommendation systems often represent items as text or semantic IDs and generate recommendations in an auto-regressive manner. However, due to the left-to-right greedy decoding strategy and the unidirectional logical flow, such methods often fail to produce globally optimal recommendations. In contrast, human reasoning does not follow a rigid left-to-right sequence. Instead, it often begins with keywords or intuitive insights, which are then refined and expanded. Inspired by this fact, we propose MindRec, a diffusion-driven coarse-to-fine generative paradigm that emulates human thought processes. Built upon a diffusion language model, MindRec departs from auto-regressive generation by leveraging a masked diffusion process to reconstruct items in a flexible, non-sequential manner. Particularly, our method first generates key tokens that reflect user preferences, and then expands them into the complete item, enabling adaptive and human-like generation. To further emulate the structured nature of human decision-making, we organize items into a hierarchical category tree. This structure guides the model to first produce the coarse-grained category and then progressively refine its selection through finer-grained subcategories before generating the specific item. To mitigate the local optimum problem inherent in greedy decoding, we design a novel beam search algorithm, Diffusion Beam Search, tailored for our mind-inspired generation paradigm. Experimental results demonstrate that MindRec yields a 9.5\% average improvement in top-1 accuracy over state-of-the-art methods, highlighting its potential to enhance recommendation performance. The implementation is available via https://github.com/Mr-Peach0301/MindRec.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Niranjan Chebrolu, Gerard Christopher Yeo, Kokil Jaidka
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable framework and new directions for the study of conversational AI.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ruiyu Wang, Yuzhang Xie, Xiao Hu, Carl Yang, Jiaying Lu
ๆฅ็ๆ่ฆ
Assessing journal impact is central to scholarly communication, yet existing open resources rarely capture how collaboration structures and artificial intelligence (AI) research jointly shape venue prestige in biomedicine. We present BioMedJImpact, a large-scale, biomedical-oriented dataset designed to advance journal-level analysis of scientific impact and AI engagement. Built from 1.74 million PubMed Central articles across 2,744 journals, BioMedJImpact integrates bibliometric indicators, collaboration features, and LLM-derived semantic indicators for AI engagement. Specifically, the AI engagement feature is extracted through a reproducible three-stage LLM pipeline that we propose. Using this dataset, we analyze how collaboration intensity and AI engagement jointly influence scientific impact across pre- and post-pandemic periods (2016-2019, 2020-2023). Two consistent trends emerge: journals with higher collaboration intensity, particularly those with larger and more diverse author teams, tend to achieve greater citation impact, and AI engagement has become an increasingly strong correlate of journal prestige, especially in quartile rankings. To further validate the three-stage LLM pipeline we proposed for deriving the AI engagement feature, we conduct human evaluation, confirming substantial agreement in AI relevance detection and consistent subfield classification. Together, these contributions demonstrate that BioMedJImpact serves as both a comprehensive dataset capturing the intersection of biomedicine and AI, and a validated methodological framework enabling scalable, content-aware scientometric analysis of scientific impact and innovation dynamics. Code is available at https://github.com/JonathanWry/BioMedJImpact.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hayden Moore, Asfahan Shah
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Thomas Rivasseau
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Current Large Language Model alignment research mostly focuses on improving model robustness against adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There is a lack of appropriate research into means of strengthening alignment which also scale with user input length. We propose interruptions as a possible solution to this problem. Interruptions are control sentences added to the user input approximately every x tokens for some arbitrary x. We suggest that this can be generalized to the Chain-of-Thought process to prevent scheming.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Noah Hong, Tao Hong
Natural Language Processing
ๆฅ็ๆ่ฆ
Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs), where new capabilities appear abruptly once models surpass critical thresholds of scale. Prior work, such as that of Wei et al., demonstrated these phenomena under model and data scaling, with transitions revealed after applying a log scale to training compute. In this work, we ask three complementary questions: (1) Are phase transitions unique to large models, or can they also be observed in small transformer-based language models? (2) Can such transitions be detected directly in linear training space, rather than only after log rescaling? and (3) Can these transitions emerge at early stages of training? To investigate, we train a small GPT-style transformer on a character-level corpus and analyze the evolution of vocabulary usage throughout training. We track the average word length, the number of correct versus incorrect words, and shifts in vocabulary diversity. Building on these measures, we apply Poisson and sub-Poisson statistics to quantify how words connect and reorganize. This combined analysis reveals a distinct transition point during training. Notably, these transitions are not apparent in standard loss or validation curves, but become visible through our vocabulary- and statistics-based probes. Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, detectable directly in linear training space, and occurring surprisingly early as coherence emerges. This perspective provides new insight into the nonlinear dynamics of language model training and underscores the importance of tailored metrics for uncovering phase transition behaviors
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lea Hergert, Gรกbor Berend, Mario Szegedy, Gyorgy Turan, Mรกrk Jelasity
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Christopher Cruz
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly deployed in multi-turn dialogue settings, but their behavior is still bottlenecked by fixed context windows and naive memory strategies. Replaying the full conversation at every turn is simple but expensive, while static summarization or recency-only heuristics often erase safety-critical user details. We present Adaptive Focus Memory (AFM), a dynamic context manager that assigns each past message one of three fidelity levels -- FULL, COMPRESSED, or PLACEHOLDER -- based on semantic similarity to the current query, half-life recency weighting, and importance classification. AFM packs messages chronologically under a strict token budget, preferring high fidelity for the most relevant turns while aiming to preserve a cheap trace of the dialogue. In a safety-oriented benchmark involving a user with a severe peanut allergy planning a trip to Thailand, AFM retains the allergy across both short and medium-length conversations, matches the safety performance of naive replay, and cuts average token usage by 66% relative to a replay baseline. We release a modular Python implementation of AFM designed for OpenAI-compatible APIs and offline operation, enabling practitioners to reduce inference cost without sacrificing safety or factual continuity in the evaluated scenario.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yunhao Chen, Xin Wang, Juncheng Li, Yixu Wang, Jie Li et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce \textbf{EvoSynth}, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sina Rashidi, Hossein Sameti
Natural Language Processing
ๆฅ็ๆ่ฆ
Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao
Natural Language Processing
ๆฅ็ๆ่ฆ
Aligning Large Language Models (LLMs) to be faithful to new knowledge in complex, multi-hop reasoning tasks is a critical, yet unsolved, challenge. We find that SFT-based methods, e.g., Reason-KE, while state-of-the-art, suffer from a "faithfulness gap": they optimize for format mimicry rather than sound reasoning. This gap enables the LLM's powerful parametric priors to override new contextual facts, resulting in critical factual hallucinations (e.g., incorrectly reasoning "Houston" from "NASA" despite an explicit edit). To solve this core LLM alignment problem, we propose Reason-KE++, an SFT+RL framework that instills process-level faithfulness. Its core is a Stage-aware Reward mechanism that provides dense supervision for intermediate reasoning steps (e.g., Decomposition, Sub-answer Correctness). Crucially, we identify that naive outcome-only RL is a deceptive trap for LLM alignment: it collapses reasoning integrity (e.g., 19.00% Hop acc) while superficially boosting final accuracy. Our process-aware framework sets a new SOTA of 95.48% on MQUAKE-CF-3k (+5.28%), demonstrating that for complex tasks, aligning the reasoning process is essential for building trustworthy LLMs.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Maoqi Liu, Quan Fang, Yang Yang, Can Zhao, Kaiquan Cai
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: https://github.com/Estrellajer/Knots.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu et al.
RoboticsComputer VisionMultimodal
ๆฅ็ๆ่ฆ
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
๐
2025-11-16
๐ EMNLP
๐ฅ Oron Anschel, Alon Shoshan, Adam Botach, Shunit Haviv Hakimi, Asaf Gendler et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pu-Hai Yang, Heyan Huang, Heng-Da Xu, Fanshu Sun, Xian-Ling Mao et al.
Multimodal
ๆฅ็ๆ่ฆ
Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hyeonji Kim, Sujeong Oh, Sanghack Lee
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning from human feedback (RLHF) is widely used to align large language models (LLMs) with human preferences. However, RLHF-trained reward models often exhibit length bias -- a systematic tendency to favor longer responses by conflating verbosity with quality. We propose a causal framework for analyzing and mitigating length bias in RLHF reward modeling. Central to our approach is a counterfactual data augmentation method that generates response pairs designed to isolate content quality from verbosity. These counterfactual examples are then used to train the reward model, enabling it to assess responses based on content quality independently of verbosity. Specifically, we construct (1) length-divergent pairs with similar content and (2) content-divergent pairs of similar length. Empirical evaluations show that our method reduces length bias in reward assignment and leads to more concise, content-focused outputs from the policy model. These findings demonstrate that the proposed approach effectively reduces length bias and improves the robustness and content sensitivity of reward modeling in RLHF pipelines.
๐
2025-11-16
๐ AAAI 2026
๐ฅ Lingyun Xiang, Chengfu Ou, Xu He, Zhongliang Yang, Yuling Liu
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Existing linguistic steganography methods primarily rely on content transformations to conceal secret messages. However, they often cause subtle yet looking-innocent deviations between normal and stego texts, posing potential security risks in real-world applications. To address this challenge, we propose a content-preserving linguistic steganography paradigm for perfectly secure covert communication without modifying the cover text. Based on this paradigm, we introduce CLstega (\textit{C}ontent-preserving \textit{L}inguistic \textit{stega}nography), a novel method that embeds secret messages through controllable distribution transformation. CLstega first applies an augmented masking strategy to locate and mask embedding positions, where MLM(masked language model)-predicted probability distributions are easily adjustable for transformation. Subsequently, a dynamic distribution steganographic coding strategy is designed to encode secret messages by deriving target distributions from the original probability distributions. To achieve this transformation, CLstega elaborately selects target words for embedding positions as labels to construct a masked sentence dataset, which is used to fine-tune the original MLM, producing a target MLM capable of directly extracting secret messages from the cover text. This approach ensures perfect security of secret messages while fully preserving the integrity of the original cover text. Experimental results show that CLstega can achieve a 100\% extraction success rate, and outperforms existing methods in security, effectively balancing embedding capacity and security.
๐
2025-11-16
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Sanchaita Hazra, Doeun Lee, Bodhisattwa Prasad Majumder, Sachin Kumar
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Language Models have seen expanding application across domains, yet their effectiveness as assistive tools for scientific writing -- an endeavor requiring precision, multimodal synthesis, and domain expertise -- remains insufficiently understood. We examine the potential of LLMs to support domain experts in scientific writing, with a focus on abstract composition. We design an incentivized randomized controlled trial with a hypothetical conference setup where participants with relevant expertise are split into an author and reviewer pool. Inspired by methods in behavioral science, our novel incentive structure encourages authors to edit the provided abstracts to an acceptable quality for a peer-reviewed submission. Our 2x2 between-subject design expands into two dimensions: the implicit source of the provided abstract and the disclosure of it. We find authors make most edits when editing human-written abstracts compared to AI-generated abstracts without source attribution, often guided by higher perceived readability in AI generation. Upon disclosure of source information, the volume of edits converges in both source treatments. Reviewer decisions remain unaffected by the source of the abstract, but bear a significant correlation with the number of edits made. Careful stylistic edits, especially in the case of AI-generated abstracts, in the presence of source information, improve the chance of acceptance. We find that AI-generated abstracts hold potential to reach comparable levels of acceptability to human-written ones with minimal revision, and that perceptions of AI authorship, rather than objective quality, drive much of the observed editing behavior. Our findings reverberate the significance of source disclosure in collaborative scientific writing.
๐
2025-11-16
๐ AAAI 2026
๐ฅ Jie Zhang, Bo Tang, Wanzi Shao, Wenqiang Wei, Jihao Zhao et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) improves large language models by retrieving external knowledge, often truncated into smaller chunks due to the input context window, which leads to information loss, resulting in response hallucinations and broken reasoning chains. Moreover, traditional RAG retrieves unstructured knowledge, introducing irrelevant details that hinder accurate reasoning. To address these issues, we propose TAdaRAG, a novel RAG framework for on-the-fly task-adaptive knowledge graph construction from external sources. Specifically, we design an intent-driven routing mechanism to a domain-specific extraction template, followed by supervised fine-tuning and a reinforcement learning-based implicit extraction mechanism, ensuring concise, coherent, and non-redundant knowledge integration. Evaluations on six public benchmarks and a real-world business benchmark (NowNewsQA) across three backbone models demonstrate that TAdaRAG outperforms existing methods across diverse domains and long-text tasks, highlighting its strong generalization and practical effectiveness.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Maria Tseytlin, Paul Roit, Omri Abend, Ido Dagan, Ayal Klein
ๆฅ็ๆ่ฆ
Decomposing sentences into fine-grained meaning units is increasingly used to model semantic alignment. While QA-based semantic approaches have shown effectiveness for representing predicate-argument relations, they have so far left noun-centered semantics largely unaddressed. We introduce QA-Noun, a QA-based framework for capturing noun-centered semantic relations. QA-Noun defines nine question templates that cover both explicit syntactical and implicit contextual roles for nouns, producing interpretable QA pairs that complement verbal QA-SRL. We release detailed guidelines, a dataset of over 2,000 annotated noun mentions, and a trained model integrated with QA-SRL to yield a unified decomposition of sentence meaning into individual, highly fine-grained, facts. Evaluation shows that QA-Noun achieves near-complete coverage of AMR's noun arguments while surfacing additional contextually implied relations, and that combining QA-Noun with QA-SRL yields over 130\% higher granularity than recent fact-based decomposition methods such as FactScore and DecompScore. QA-Noun thus complements the broader QA-based semantic framework, forming a comprehensive and scalable approach to fine-grained semantic decomposition for cross-text alignment.
๐
2025-11-16
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ JoonHo Lee, HyeonMin Cho, Jaewoong Yun, Hyunjae Lee, JunKyu Lee et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.
๐
2025-11-16
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Onkar Shelar, Travis Desell
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.
๐
2025-11-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chucheng Xiang, Ruchao Bao, Biyin Feng, Wenzheng Wu, Zhongyuan Liu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We present a novel framework for automated interior design that combines large language models (LLMs) with grid-based integer programming to jointly optimize room layout and furniture placement. Given a textual prompt, the LLM-driven agent workflow extracts structured design constraints related to room configurations and furniture arrangements. These constraints are encoded into a unified grid-based representation inspired by ``Modulor". Our formulation accounts for key design requirements, including corridor connectivity, room accessibility, spatial exclusivity, and user-specified preferences. To improve computational efficiency, we adopt a coarse-to-fine optimization strategy that begins with a low-resolution grid to solve a simplified problem and guides the solution at the full resolution. Experimental results across diverse scenarios demonstrate that our joint optimization approach significantly outperforms existing two-stage design pipelines in solution quality, and achieves notable computational efficiency through the coarse-to-fine strategy.
๐
2025-11-16
๐ AAAI
๐ฅ Mengying Wang, Chenhui Ma, Ao Jiao, Tuo Liang, Pengjun Lu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have greatly advanced knowledge graph question answering (KGQA), yet existing systems are typically optimized for returning highly relevant but predictable answers. A missing yet desired capacity is to exploit LLMs to suggest surprise and novel ("serendipitious") answers. In this paper, we formally define the serendipity-aware KGQA task and propose the SerenQA framework to evaluate LLMs' ability to uncover unexpected insights in scientific KGQA tasks. SerenQA includes a rigorous serendipity metric based on relevance, novelty, and surprise, along with an expert-annotated benchmark derived from the Clinical Knowledge Graph, focused on drug repurposing. Additionally, it features a structured evaluation pipeline encompassing three subtasks: knowledge retrieval, subgraph reasoning, and serendipity exploration. Our experiments reveal that while state-of-the-art LLMs perform well on retrieval, they still struggle to identify genuinely surprising and valuable discoveries, underscoring a significant room for future improvements. Our curated resources and extended version are released at: https://cwru-db-group.github.io/serenQA.
๐
2025-11-16
๐ AAAI 2026
๐ฅ Chenglong Wang, Yifu Huo, Yang Gan, Yongyu Mu, Qiaozhi He et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with the alignment performance of large language models (LLMs), making it a reliable reference for developing advanced reward models. Our analysis of MRMBench evaluation results reveals that reward models often struggle to capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Additionally, our findings show that the proposed inference-time probing method offers a reliable metric for assessing the confidence of reward predictions, which ultimately improves the alignment of LLMs.
๐
2025-11-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoyu Lin, Aniket Ghorpade, Hansheng Zhu, Justin Qiu, Dea Rrozhani et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
With the rapid adoption of multimodal large language models (MLLMs) across diverse applications, there is a pressing need for task-centered, high-quality training data. A key limitation of current training datasets is their reliance on sparse annotations mined from the Internet or entered via manual typing that capture only a fraction of an image's visual content. Dense annotations are more valuable but remain scarce. Traditional text-based annotation pipelines are poorly suited for creating dense annotations: typing limits expressiveness, slows annotation speed, and underrepresents nuanced visual features, especially in specialized areas such as multicultural imagery and 3D asset annotation. In this paper, we present DenseAnnotate, an audio-driven online annotation platform that enables efficient creation of dense, fine-grained annotations for images and 3D assets. Annotators narrate observations aloud while synchronously linking spoken phrases to image regions or 3D scene parts. Our platform incorporates speech-to-text transcription and region-of-attention marking. To demonstrate the effectiveness of DenseAnnotate, we conducted case studies involving over 1,000 annotators across two domains: culturally diverse images and 3D scenes. We curate a human-annotated multi-modal dataset of 3,531 images, 898 3D scenes, and 7,460 3D objects, with audio-aligned dense annotations in 20 languages, including 8,746 image captions, 2,000 scene captions, and 19,000 object captions. Models trained on this dataset exhibit improvements of 5% in multilingual, 47% in cultural alignment, and 54% in 3D spatial capabilities. Our results show that our platform offers a feasible approach for future vision-language research and can be applied to various tasks and diverse types of data.
๐
2025-11-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mengyao Gao, Chongming Gao, Haoyan Liu, Qingpeng Cai, Peng Jiang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advancements in large language model-based recommendation systems often represent items as text or semantic IDs and generate recommendations in an auto-regressive manner. However, due to the left-to-right greedy decoding strategy and the unidirectional logical flow, such methods often fail to produce globally optimal recommendations. In contrast, human reasoning does not follow a rigid left-to-right sequence. Instead, it often begins with keywords or intuitive insights, which are then refined and expanded. Inspired by this fact, we propose Mind-inspired Recommender (MindRec), a novel generative framework that emulates human thought processes. Particularly, our method first generates key tokens that reflect user preferences, and then expands them into the complete item, enabling flexible and human-like generation. To further emulate the structured nature of human decision-making, we organize items into a hierarchical category tree. This structure guides the model to first produce the coarse-grained category and then progressively refine its selection through finer-grained subcategories before generating the specific item. To mitigate the local optimum problem inherent in greedy decoding, we design a novel beam search algorithm, Diffusion Beam Search, tailored for our mind-inspired generation paradigm. Experimental results demonstrate that MindRec yields a 9.5\% average improvement in top-1 recommendation performance over state-of-the-art methods, highlighting its potential to enhance recommendation accuracy. The implementation is available via https://github.com/Mr-Peach0301/MindRec.
๐
2025-11-16
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zhongchao Yi, Kai Feng, Xiaojian Ma, Yalong Wang, Yongqi Liu et al.
Robotics
ๆฅ็ๆ่ฆ
In large-scale industrial recommendation systems, retrieval must produce high-quality candidates from massive corpora under strict latency. Recently, Generative Retrieval (GR) has emerged as a viable alternative to Embedding-Based Retrieval (EBR), which quantizes items into a finite token space and decodes candidates autoregressively, providing a scalable path that explicitly models target-history interactions via cross-attention. However, three challenges persist: 1) how to balance users' long-term and short-term interests , 2) noise interference when generating hierarchical semantic IDs (SIDs), 3) the absence of explicit modeling for negative feedback such as exposed items without clicks. To address these challenges, we propose DualGR, a generative retrieval framework that explicitly models dual horizons of user interests with selective activation. Specifically, DualGR utilizes Dual-Branch Long/Short-Term Router (DBR) to cover both stable preferences and transient intents by explicitly modeling users' long- and short-term behaviors. Meanwhile, Search-based SID Decoding (S2D) is presented to control context-induced noise and enhance computational efficiency by constraining candidate interactions to the current coarse (level-1) bucket during fine-grained (level-2/3) SID prediction. % also reinforcing intra-class consistency. Finally, we propose an Exposure-aware Next-Token Prediction Loss (ENTP-Loss) that treats "exposed-but-unclicked" items as hard negatives at level-1, enabling timely interest fade-out. On the large-scale Kuaishou short-video recommendation system, DualGR has achieved outstanding performance. Online A/B testing shows +0.527% video views and +0.432% watch time lifts, validating DualGR as a practical and effective paradigm for industrial generative retrieval.
๐
2025-11-16
๐ AAAI 2026
๐ฅ Zhen Tao, Xinke Jiang, Qingshuai Feng, Haoyu Zhang, Lun Du et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Dynamic recommendation systems aim to provide personalized suggestions by modeling temporal user-item interactions across time-series behavioral data. Recent studies have leveraged pre-trained dynamic graph neural networks (GNNs) to learn user-item representations over temporal snapshot graphs. However, fine-tuning GNNs on these graphs often results in generalization issues due to temporal discrepancies between pre-training and fine-tuning stages, limiting the model's ability to capture evolving user preferences. To address this, we propose TarDGR, a task-aware retrieval-augmented framework designed to enhance generalization capability by incorporating task-aware model and retrieval-augmentation. Specifically, TarDGR introduces a Task-Aware Evaluation Mechanism to identify semantically relevant historical subgraphs, enabling the construction of task-specific datasets without manual labeling. It also presents a Graph Transformer-based Task-Aware Model that integrates semantic and structural encodings to assess subgraph relevance. During inference, TarDGR retrieves and fuses task-aware subgraphs with the query subgraph, enriching its representation and mitigating temporal generalization issues. Experiments on multiple large-scale dynamic graph datasets demonstrate that TarDGR consistently outperforms state-of-the-art methods, with extensive empirical evidence underscoring its superior accuracy and generalization capabilities.
๐
2025-11-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The rapid growth of e-commerce calls for multimodal models that comprehend rich visual and textual product information. Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced multimodal representation learning framework for e-commerce product understanding. MOON2.0 comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further introduce MBE2.0, a co-augmented multimodal representation benchmark for e-commerce representation learning and evaluation. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.
๐
2025-11-16
๐ ArXiv Preprint (Robotics)
๐ฅ Boshu Lei, Wen Jiang, Kostas Daniilidis
ๆฅ็ๆ่ฆ
Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energy-based model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution. The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping. The source code of our paper will be made public when the paper is released to the public.
๐
2025-11-16
๐ ArXiv Preprint (Robotics)
๐ฅ Vignesh Rajagopal, Kasun Weerakoon Kulathun Mudiyanselage, Gershom Devake Seneviratne, Pon Aswin Sankaralingam, Mohamed Elnoor et al.
Robotics
ๆฅ็ๆ่ฆ
We present DR. Nav (Dead-End Recovery-aware Navigation), a novel approach to autonomous navigation in scenarios where dead-end detection and recovery are critical, particularly in unstructured environments where robots must handle corners, vegetation occlusions, and blocked junctions. DR. Nav introduces a proactive strategy for navigation in unmapped environments without prior assumptions. Our method unifies dead-end prediction and recovery by generating a single, continuous, real-time semantic cost map. Specifically, DR. Nav leverages cross-modal RGB-LiDAR fusion with attention-based filtering to estimate per-cell dead-end likelihoods and recovery points, which are continuously updated through Bayesian inference to enhance robustness. Unlike prior mapping methods that only encode traversability, DR. Nav explicitly incorporates recovery-aware risk into the navigation cost map, enabling robots to anticipate unsafe regions and plan safer alternative trajectories. We evaluate DR. Nav across multiple dense indoor and outdoor scenarios and demonstrate an increase of 83.33% in accuracy in detection, a 52.4% reduction in time-to-goal (path efficiency), compared to state-of-the-art planners such as DWA, MPPI, and Nav2 DWB. Furthermore, the dead-end classifier functions
๐
2025-11-16
๐ Author Accepted Manuscript (AAM) of a paper accept
๐ฅ Sungjun Seo, Kooktae Lee
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper addresses the fundamental problem of non-uniform area coverage in multi-agent systems, where different regions require varying levels of attention due to mission-dependent priorities. Existing uniform coverage strategies are insufficient for realistic applications, and many non-uniform approaches either lack optimality guarantees or fail to incorporate crucial real-world constraints such as agent dynamics, limited operation time, the number of agents, and decentralized execution.
To resolve these limitations, we propose a novel framework called Density-Driven Optimal Control (D2OC). The central idea of D2OC is the integration of optimal transport theory with multi-agent coverage control, enabling each agent to continuously adjust its trajectory to match a mission-specific reference density map. The proposed formulation establishes optimality by solving a constrained optimization problem that explicitly incorporates physical and operational constraints. The resulting control input is analytically derived from the Lagrangian of the objective function, yielding closed-form optimal solutions for linear systems and a generalizable structure for nonlinear systems. Furthermore, a decentralized data-sharing mechanism is developed to coordinate agents without reliance on global information.
Comprehensive simulation studies demonstrate that D2OC achieves significantly improved non-uniform area coverage performance compared to existing methods, while maintaining scalability and decentralized implementability.
๐
2025-11-16
๐ ArXiv Preprint (Robotics)
๐ฅ Aleesha Khurram, Amir Moeini, Shangtong Zhang, Rohan Chandra
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Despite significant progress and advances in autonomous driving, many end-to-end systems still struggle with domain adaptation (DA), such as transferring a policy trained under clear weather to adverse weather conditions. Typical DA strategies in the literature include collecting additional data in the target domain or re-training the model, or both. Both these strategies quickly become impractical as we increase scale and complexity of driving. These limitations have encouraged investigation into few-shot and zero-shot prompt-driven DA at inference time involving LLMs and VLMs. These methods work by adding a few state-action trajectories during inference to the prompt (similar to in-context learning). However, there are two limitations of such an approach: $(i)$ prompt-driven DA methods are currently restricted to perception tasks such as detection and segmentation and $(ii)$ they require expert few-shot data. In this work, we present a new approach to inference-time few-shot prompt-driven DA for closed-loop autonomous driving in adverse weather condition using in-context reinforcement learning (ICRL). Similar to other prompt-driven DA methods, our approach does not require any updates to the model parameters nor does it require additional data collection in adversarial weather regime. Furthermore, our approach advances the state-of-the-art in prompt-driven DA by extending to closed driving using general trajectories observed during inference. Our experiments using the CARLA simulator show that ICRL results in safer, more efficient, and more comfortable driving policies in the target domain compared to state-of-the-art prompt-driven DA baselines.
๐
2025-11-16
๐ ArXiv Preprint (Machine Learning)
๐ฅ Timur Anvar, Jeffrey Chen, Yuyan Wang, Rohan Chandra
RoboticsNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Autonomous vehicle navigation in complex environments such as dense and fast-moving highways and merging scenarios remains an active area of research. A key limitation of RL is its reliance on well-specified reward functions, which often fail to capture the full semantic and social complexity of diverse, out-of-distribution situations. As a result, a rapidly growing line of research explores using Large Language Models (LLMs) to replace or supplement RL for direct planning and control, on account of their ability to reason about rich semantic context. However, LLMs present significant drawbacks: they can be unstable in zero-shot safety-critical settings, produce inconsistent outputs, and often depend on expensive API calls with network latency. This motivates our investigation into whether small, locally deployed LLMs (< 14B parameters) can meaningfully support autonomous highway driving through reward shaping rather than direct control. We present a case study comparing RL-only, LLM-only, and hybrid approaches, where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time. Our findings reveal that RL-only agents achieve moderate success rates (73-89%) with reasonable efficiency, LLM-only agents can reach higher success rates (up to 94%) but with severely degraded speed performance, and hybrid approaches consistently fall between these extremes. Critically, despite explicit efficiency instructions, LLM-influenced approaches exhibit systematic conservative bias with substantial model-dependent variability, highlighting important limitations of current small LLMs for safety-critical control tasks.
๐
2025-11-16
๐ 10 pages, 11 figures, It is submitted as a journal
๐ฅ Arvind Kumar Mishra, Sohom Chakrabarty
Machine Learning
ๆฅ็ๆ่ฆ
In this work, Yoshikawa's manipulability index is used to investigate reinforcement learning (RL) as a framework for morphology optimization in planar robotic manipulators. A 2R manipulator tracking a circular end-effector path is first examined because this case has a known analytical optimum: equal link lengths and the second joint orthogonal to the first. This serves as a validation step to test whether RL can rediscover the optimum using reward feedback alone, without access to the manipulability expression or the Jacobian. Three RL algorithms (SAC, DDPG, and PPO) are compared with grid search and black-box optimizers, with morphology represented by a single action parameter phi that maps to the link lengths. All methods converge to the analytical solution, showing that numerical recovery of the optimum is possible without supplying analytical structure.
Most morphology design tasks have no closed-form solutions, and grid or heuristic search becomes expensive as dimensionality increases. RL is therefore explored as a scalable alternative. The formulation used for the circular path is extended to elliptical and rectangular paths by expanding the action space to the full morphology vector (L1, L2, theta2). In these non-analytical settings, RL continues to converge reliably, whereas grid and black-box methods require far larger evaluation budgets. These results indicate that RL is effective for both recovering known optima and solving morphology optimization problems without analytical solutions.
๐
2025-11-16
๐ ArXiv Preprint (Robotics)
๐ฅ Jordan Leyva, Nahim J. Moran Vera, Yihan Xu, Adrien Durasno, Christopher U. Romero et al.
Robotics
ๆฅ็ๆ่ฆ
Obstacle avoidance path planning for uncrewed aerial vehicles (UAVs), or drones, is rarely addressed in most flight path planning schemes, despite obstacles being a realistic condition. Obstacle avoidance can also be energy-intensive, making it a critical factor in efficient point-to-point drone flights. To address these gaps, we propose EcoFlight, an energy-efficient pathfinding algorithm that determines the lowest-energy route in 3D space with obstacles. The algorithm models energy consumption based on the drone propulsion system and flight dynamics. We conduct extensive evaluations, comparing EcoFlight with direct-flight and shortest-distance schemes. The simulation results across various obstacle densities show that EcoFlight consistently finds paths with lower energy consumption than comparable algorithms, particularly in high-density environments. We also demonstrate that a suitable flying speed can further enhance energy savings.
๐
2025-11-16
๐ ArXiv Preprint (Computer Vision)
๐ฅ Artem Moroz, Vรญt Zeman, Martin Mikลกรญk, Elizaveta Isianova, Miroslav David et al.
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
We introduce a unified, end-to-end framework that seamlessly integrates object detection and pose estimation with a versatile onboarding process. Our pipeline begins with an onboarding stage that generates object representations from either traditional 3D CAD models or, in their absence, by rapidly reconstructing a high-fidelity neural representation (NeRF) from multi-view images. Given a test image, our system first employs the CNOS detector to localize target objects. For each detection, our novel pose estimation module, OPFormer, infers the precise 6D pose. The core of OPFormer is a transformer-based architecture that leverages a foundation model for robust feature extraction. It uniquely learns a comprehensive object representation by jointly encoding multiple template views and enriches these features with explicit 3D geometric priors using Normalized Object Coordinate Space (NOCS). A decoder then establishes robust 2D-3D correspondences to determine the final pose. Evaluated on the challenging BOP benchmarks, our integrated system demonstrates a strong balance between accuracy and efficiency, showcasing its practical applicability in both model-based and model-free scenarios.
๐
2025-11-16
๐ Published as Early Access in IEEE Transactions on
๐ฅ Davide De Benedittis, Giovanni Di Lorenzo, Franco Angelini, Barbara Valle, Marina Serena Borgatti et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
According to the European Union's Habitat Directive, habitat monitoring plays a critical role in response to the escalating problems posed by biodiversity loss and environmental degradation. Scree habitats, hosting unique and often endangered species, face severe threats from climate change due to their high-altitude nature. Traditionally, their monitoring has required highly skilled scientists to conduct extensive fieldwork in remote, potentially hazardous locations, making the process resource-intensive and time-consuming. This paper presents a novel approach for scree habitat monitoring using a legged robot to assist botanists in data collection and species identification. Specifically, we deployed the ANYmal C robot in the Italian Alpine bio-region in two field campaigns spanning two years and leveraged deep learning to detect and classify key plant species of interest. Our results demonstrate that agile legged robots can navigate challenging terrains and increase the frequency and efficiency of scree monitoring. When paired with traditional phytosociological surveys performed by botanists, this robotics-assisted protocol not only streamlines field operations but also enhances data acquisition, storage, and usage. The outcomes of this research contribute to the evolving landscape of robotics in environmental science, paving the way for a more comprehensive and sustainable approach to habitat monitoring and preservation.
๐
2025-11-16
๐ Author Accepted Manuscript (AAM) of a paper accept
๐ฅ Sungjun Seo, Kooktae Lee
Robotics
ๆฅ็ๆ่ฆ
The growing scale of modern farms has increased the need for efficient and adaptive multi-agent coverage strategies for pest, weed, and disease management. Traditional methods such as manual inspection and blanket pesticide spraying often lead to excessive chemical use, resource waste, and environmental impact. While unmanned aerial vehicles (UAVs) offer a promising platform for precision agriculture through targeted spraying and improved operational efficiency, existing UAV-based approaches remain limited by battery life, payload capacity, and scalability, especially in large fields where single-UAV or uniformly distributed spraying is insufficient. Although multi-UAV coordination has been explored, many current frameworks still assume uniform spraying and do not account for infestation severity, UAV dynamics, non-uniform resource allocation, or energy-efficient coordination.
To address these limitations, this paper proposes a Density-Driven Optimal Control (D2OC) framework that integrates Optimal Transport (OT) theory with multi-UAV coverage control for large-scale agricultural spraying. The method supports non-uniform, priority-aware resource allocation based on infestation intensity, reducing unnecessary chemical application. UAVs are modeled as a linear time-varying (LTV) system to capture variations in mass and inertia during spraying missions. The D2OC control law, derived using Lagrangian mechanics, enables efficient coordination, balanced workload distribution, and improved mission duration. Simulation results demonstrate that the proposed approach outperforms uniform spraying and Spectral Multiscale Coverage (SMC) in coverage efficiency, chemical reduction, and operational sustainability, providing a scalable solution for smart agriculture.
๐
2025-11-16
๐ ArXiv Preprint (Robotics)
๐ฅ Navin Sriram Ravie, Keerthi Vasan M, Bijo Sebastian
Machine Learning
ๆฅ็ๆ่ฆ
Dense clutter removal for target object retrieval presents a challenging problem, especially when targets are embedded deep within densely-packed configurations. It requires foresight to minimize overall changes to the clutter configuration while accessing target objects, avoiding stack destabilization and reducing the number of object removals required. Rule-based planners when applied to this problem, rely on rigid heuristics, leading to high computational overhead. End-to-end reinforcement learning approaches struggle with interpretability and generalizability over different conditions. To address these issues, we present ClutterNav, a novel decision-making framework that can identify the next best object to be removed so as to access a target object in a given clutter, while minimising stack disturbances. ClutterNav formulates the problem as a continuous reinforcement learning task, where each object removal dynamically updates the understanding of the scene. A removability critic, trained from demonstrations, estimates the cost of removing any given object based on geometric and spatial features. This learned cost is complemented by integrated gradients that assess how the presence or removal of surrounding objects influences the accessibility of the target. By dynamically prioritizing actions that balance immediate removability against long-term target exposure, ClutterNav achieves near human-like strategic sequencing, without predefined heuristics. The proposed approach is validated extensively in simulation and over real-world experiments. The results demonstrate real-time, occlusion-aware decision-making in partially observable environments.
๐
2025-11-16
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Robotic manipulation and navigation are fundamental capabilities of embodied intelligence, enabling effective robot interactions with the physical world. Achieving these capabilities requires a cohesive understanding of the environment, including object recognition to localize target objects, object affordances to identify potential interaction areas and spatial affordances to discern optimal areas for both object placement and robot movement. While Vision-Language Models (VLMs) excel at high-level task planning and scene understanding, they often struggle to infer actionable positions for physical interaction, such as functional grasping points and permissible placement regions. This limitation stems from the lack of fine-grained annotations for object and spatial affordances in their training datasets. To tackle this challenge, we introduce RoboAfford++, a generative AI-enhanced dataset for multimodal affordance learning for both robotic manipulation and navigation. Our dataset comprises 869,987 images paired with 2.0 million question answering (QA) annotations, covering three critical tasks: object affordance recognition to identify target objects based on attributes and spatial relationships, object affordance prediction to pinpoint functional parts for manipulation, and spatial affordance localization to identify free space for object placement and robot navigation. Complementing this dataset, we propose RoboAfford-Eval, a comprehensive benchmark for assessing affordance-aware prediction in real-world scenarios, featuring 338 meticulously annotated samples across the same three tasks. Extensive experimental results reveal the deficiencies of existing VLMs in affordance learning, while fine-tuning on the RoboAfford++ dataset significantly enhances their ability to reason about object and spatial affordances, validating the dataset's effectiveness.
๐
2025-11-15
๐ AAAI 2026
๐ฅ Ganlin Xu, Zhitao Yin, Linghao Zhang, Jiaqing Liang, Weijia Lu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking \emph{complex logical queries} involving first-order logic operations such as conjunction ($\land$), disjunction ($\lor$), and negation ($\lnot$). Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset \textbf{ComLQ} for \textbf{Com}plex \textbf{L}ogical \textbf{Q}ueries, which comprises 2,909 queries and 11,251 candidate passages. A key challenge in constructing the dataset lies in capturing the underlying logical structures within unstructured text. Therefore, by designing the subgraph-guided prompt with the subgraph indicator, an LLM (such as GPT-4o) is guided to generate queries with specific logical structures based on selected passages. All query-passage pairs in ComLQ are ensured \emph{structure conformity} and \emph{evidence distribution} through expert annotation. To better evaluate whether retrievers can handle queries with negation, we further propose a new evaluation metric, \textbf{Log-Scaled Negation Consistency} (\textbf{LSNC@$K$}). As a supplement to standard relevance-based metrics (such as nDCG and mAP), LSNC@$K$ measures whether top-$K$ retrieved passages violate negation conditions in queries. Our experimental results under zero-shot settings demonstrate existing retrieval models' limited performance on complex logical queries, especially on queries with negation, exposing their inferior capabilities of modeling exclusion.
๐
2025-11-15
๐ ArXiv Preprint (Robotics)
๐ฅ Lingfeng Zhang, Erjia Xiao, Xiaoshuai Hao, Haoxiang Fu, Zeying Gong et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Social navigation in densely populated dynamic environments poses a significant challenge for autonomous mobile robots, requiring advanced strategies for safe interaction. Existing reinforcement learning (RL)-based methods require over 2000+ hours of extensive training and often struggle to generalize to unfamiliar environments without additional fine-tuning, limiting their practical application in real-world scenarios. To address these limitations, we propose SocialNav-Map, a novel zero-shot social navigation framework that combines dynamic human trajectory prediction with occupancy mapping, enabling safe and efficient navigation without the need for environment-specific training. Specifically, SocialNav-Map first transforms the task goal position into the constructed map coordinate system. Subsequently, it creates a dynamic occupancy map that incorporates predicted human movements as dynamic obstacles. The framework employs two complementary methods for human trajectory prediction: history prediction and orientation prediction. By integrating these predicted trajectories into the occupancy map, the robot can proactively avoid potential collisions with humans while efficiently navigating to its destination. Extensive experiments on the Social-HM3D and Social-MP3D datasets demonstrate that SocialNav-Map significantly outperforms state-of-the-art (SOTA) RL-based methods, which require 2,396 GPU hours of training. Notably, it reduces human collision rates by over 10% without necessitating any training in novel environments. By eliminating the need for environment-specific training, SocialNav-Map achieves superior navigation performance, paving the way for the deployment of social navigation systems in real-world environments characterized by diverse human behaviors. The code is available at: https://github.com/linglingxiansen/SocialNav-Map.
๐
2025-11-14
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Chenghan Fu, Daoze Zhang, Yukang Lin, Zhanheng Nie, Xiang Zhang et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years, this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of "Pretraining, Post-training, and Application", allowing effective integration of multimodal representations with downstream tasks. Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences.
๐
2025-11-14
๐ ArXiv Preprint (cs.DL)
๐ฅ Liza Daly, Matteo Cargnelutti, Catherine Brobston, John Hess, Greg Leppert et al.
ๆฅ็ๆ่ฆ
Publicly launched in 2004, the Google Books project has scanned tens of millions of items in partnership with libraries around the world. As part of this project, Google created the Google Return Interface (GRIN). Through this platform, libraries can access their scanned collections, the associated metadata, and the ongoing OCR and metadata improvements that become available as Google reprocesses these collections using new technologies. When downloading the Harvard Library Google Books collection from GRIN to develop the Institutional Books dataset, we encountered several challenges related to rate-limiting and atomized metadata within the GRIN platform. To overcome these challenges and help other libraries make more robust use of their Google Books collections, this technical report introduces the initial release of GRIN Transfer. This open-source and production-ready Python pipeline allows partner libraries to efficiently retrieve their Google Books collections from GRIN. This report also introduces an updated version of our Institutional Books 1.0 pipeline, initially used to analyze, augment, and assemble the Institutional Books 1.0 dataset. We have revised this pipeline for compatibility with the output format of GRIN Transfer. A library could pair these two tools to create an end-to-end processing pipeline for their Google Books collection to retrieve, structure, and enhance data available from GRIN. This report gives an overview of how GRIN Transfer was designed to optimize for reliability and usability in different environments, as well as guidance on configuration for various use cases.
๐
2025-11-14
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xintian Han, Honggang Chen, Quan Lin, Jingyue Gao, Xiangyuan Ren et al.
Multimodal
ๆฅ็ๆ่ฆ
Traditional ID-based recommender systems often struggle with cold-start and generalization challenges. Multimodal recommendation systems, which leverage textual and visual data, offer a promising solution to mitigate these issues. However, existing industrial approaches typically adopt a two-stage training paradigm: first pretraining a multimodal model, then applying its frozen representations to train the recommendation model. This decoupled framework suffers from misalignment between multimodal learning and recommendation objectives, as well as an inability to adapt dynamically to new data. To address these limitations, we propose LEMUR, the first large-scale multimodal recommender system trained end-to-end from raw data. By jointly optimizing both the multimodal and recommendation components, LEMUR ensures tighter alignment with downstream objectives while enabling real-time parameter updates. Constructing multimodal sequential representations from user history often entails prohibitively high computational costs. To alleviate this bottleneck, we propose a novel memory bank mechanism that incrementally accumulates historical multimodal representations throughout the training process. After one month of deployment in Douyin Search, LEMUR has led to a 0.843% reduction in query change rate decay and a 0.81% improvement in QAUC. Additionally, LEMUR has shown significant gains across key offline metrics for Douyin Advertisement. Our results validate the superiority of end-to-end multimodal recommendation in real-world industrial scenarios.
๐
2025-11-14
๐ ArXiv Preprint (Robotics)
๐ฅ Wei-Cheng Tseng, Jinwei Gu, Qinsheng Zhang, Hanzi Mao, Ming-Yu Liu et al.
Robotics
ๆฅ็ๆ่ฆ
Training generalist policies for robotic manipulation has shown great promise, as they enable language-conditioned, multi-task behaviors across diverse scenarios. However, evaluating these policies remains difficult because real-world testing is expensive, time-consuming, and labor-intensive. It also requires frequent environment resets and carries safety risks when deploying unproven policies on physical robots. Manually creating and populating simulation environments with assets for robotic manipulation has not addressed these issues, primarily due to the significant engineering effort required and the often substantial sim-to-real gap, both in terms of physics and rendering. In this paper, we explore the use of action-conditional video generation models as a scalable way to learn world models for policy evaluation. We demonstrate how to incorporate action conditioning into existing pre-trained video generation models. This allows leveraging internet-scale in-the-wild online videos during the pre-training stage, and alleviates the need for a large dataset of paired video-action data, which is expensive to collect for robotic manipulation. Our paper examines the effect of dataset diversity, pre-trained weight and common failure cases for the proposed evaluation pipeline. Our experiments demonstrate that, across various metrics, including policy ranking and the correlation between actual policy values and predicted policy values, these models offer a promising approach for evaluating policies without requiring real-world interactions.
๐
2025-11-13
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yunkai Zhang, Qiang Zhang, Feng Lin, Ruizhong Qiu, Hanchao Yu et al.
ๆฅ็ๆ่ฆ
Optimizing recommender systems for objectives beyond accuracy, such as diversity, novelty, and personalization, is crucial for long-term user satisfaction. To this end, industrial practitioners have accumulated vast amounts of structured domain knowledge, which we term human priors (e.g., item taxonomies, temporal patterns). This knowledge is typically applied through post-hoc adjustments during ranking or post-ranking. However, this approach remains decoupled from the core model learning, which is particularly undesirable as the industry shifts to end-to-end generative recommendation foundation models. On the other hand, many methods targeting these beyond-accuracy objectives often require architecture-specific modifications and discard these valuable human priors by learning user intent in a fully unsupervised manner. Instead of discarding the human priors accumulated over years of practice, we introduce a backbone-agnostic framework that seamlessly integrates these human priors directly into the end-to-end training of generative recommenders. With lightweight, prior-conditioned adapter heads inspired by efficient LLM decoding strategies, our approach guides the model to disentangle user intent along human-understandable axes (e.g., interaction types, long- vs. short-term interests). We also introduce a hierarchical composition strategy for modeling complex interactions across different prior types. Extensive experiments on three large-scale datasets demonstrate that our method significantly enhances both accuracy and beyond-accuracy objectives. We also show that human priors allow the backbone model to more effectively leverage longer context lengths and larger model sizes.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model's text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Edward Kim, Devan Shanker, Varun Bharadwaj, Hongbeen Park, Jinkyu Kim et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Simulation-based testing has become a crucial complement to road testing for ensuring the safety of cyber physical systems (CPS). As a result, significant research efforts have been directed toward identifying failure scenarios within simulation environments. However, a critical question remains. Are the AV failure scenarios discovered in simulation reproducible on actual systems in the real world? The sim-to-real gap caused by differences between simulated and real sensor data means that failure scenarios identified in simulation might either be artifacts of synthetic sensor data or actual issues that also occur with real sensor data. To address this, an effective approach to validating simulated failure scenarios is to locate occurrences of these scenarios within real-world datasets and verify whether the failure persists on the datasets. To this end, we introduce a formal definition of how labeled time series sensor data can match an abstract scenario, represented as a scenario program using the Scenic probabilistic programming language. We present a querying algorithm that, given a scenario program and a labeled dataset, identifies the subset of data that matches the specified scenario. Our experiment shows that our algorithm is more accurate and orders of magnitude faster in querying scenarios than the state-of-the-art commercial vision large language models, and can scale with the duration of queried time series data.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Benjamin L. Badger, Matthew Neligeorge
Natural Language Processing
ๆฅ็ๆ่ฆ
Language prediction is constrained by informational entropy intrinsic to language, such that there exists a limit to how accurate any language model can become and equivalently a lower bound to language compression. The most efficient language compression algorithms today are causal (next token prediction) large language models, but the use of these models to form accurate estimates of language entropy is currently computationally infeasible. We introduce encoder-augmented causal decoder model architectures that exhibit superior training efficiency characteristics and achieve higher compression than causal transformers even when trained on modest hardware. We demonstrate how entropy estimates can be obtained on a per-token basis, and show that the generalization of models trained to approach the entropy of their training data necessarily exceeds the generalization of models trained to minimize loss beyond this value. We show empirically that causal models trained to approach but not exceed estimated per-token entropies exhibit greater generalization than models trained without taking entropy into account.
๐
2025-11-13
๐ ArXiv Preprint (cs.NI)
๐ฅ Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Abdu Jyothi
ๆฅ็ๆ่ฆ
Internet measurement research faces an accessibility crisis: complex analyses require custom integration of multiple specialized tools that demands specialized domain expertise. When network disruptions occur, operators need rapid diagnostic workflows spanning infrastructure mapping, routing analysis, and dependency modeling. However, developing these workflows requires specialized knowledge and significant manual effort.
We present ArachNet, the first system demonstrating that LLM agents can independently generate measurement workflows that mimics expert reasoning. Our core insight is that measurement expertise follows predictable compositional patterns that can be systematically automated. ArachNet operates through four specialized agents that mirror expert workflow, from problem decomposition to solution implementation. We validate ArachNet with progressively challenging Internet resilience scenarios. The system independently generates workflows that match expert-level reasoning and produce analytical outputs similar to specialist solutions. Generated workflows handle complex multi-framework integration that traditionally requires days of manual coordination. ArachNet lowers barriers to measurement workflow composition by automating the systematic reasoning process that experts use, enabling broader access to sophisticated measurement capabilities while maintaining the technical rigor required for research-quality analysis.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Radosลaw Miernik, Marek Szykuลa, Jakub Kowalski, Jakub Cieลluk, ลukasz Galas et al.
Machine Learning
ๆฅ็ๆ่ฆ
We propose a new General Game Playing (GGP) system called Regular Games (RG). The main goal of RG is to be both computationally efficient and convenient for game design. The system consists of several languages. The core component is a low-level language that defines the rules by a finite automaton. It is minimal with only a few mechanisms, which makes it easy for automatic processing (by agents, analysis, optimization, etc.). The language is universal for the class of all finite turn-based games with imperfect information. Higher-level languages are introduced for game design (by humans or Procedural Content Generation), which are eventually translated to a low-level language. RG generates faster forward models than the current state of the art, beating other GGP systems (Regular Boardgames, Ludii) in terms of efficiency. Additionally, RG's ecosystem includes an editor with LSP, automaton visualization, benchmarking tools, and a debugger of game description transformations.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bavana Durgapraveen, Sornaraj Sivasankaran, Abhinand Balachandran, Sriram Rajkumar
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.
๐
2025-11-13
๐ ArXiv Preprint (cs.SI)
๐ฅ Raman Ebrahimi, Sean Fuhrman, Kendrick Nguyen, Harini Gurusankar, Massimo Franceschetti
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The WikiRace game, where players navigate between Wikipedia articles using only hyperlinks, serves as a compelling benchmark for goal-directed search in complex information networks. This paper presents a systematic evaluation of navigation strategies for this task, comparing agents guided by graph-theoretic structure (betweenness centrality), semantic meaning (language model embeddings), and hybrid approaches. Through rigorous benchmarking on a large Wikipedia subgraph, we demonstrate that a purely greedy agent guided by the semantic similarity of article titles is overwhelmingly effective. This strategy, when combined with a simple loop-avoidance mechanism, achieved a perfect success rate and navigated the network with an efficiency an order of magnitude better than structural or hybrid methods. Our findings highlight the critical limitations of purely structural heuristics for goal-directed search and underscore the transformative potential of large language models to act as powerful, zero-shot semantic navigators in complex information spaces.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Abhinand Balachandran, Bavana Durgapraveen, Gowsikkan Sikkan Sudhagar, Vidhya Varshany J S, Sriram Rajkumar
Natural Language Processing
ๆฅ็ๆ่ฆ
The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to "overthinking" and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Garapati Keerthana, Manik Gupta
Machine Learning
ๆฅ็ๆ่ฆ
Personalized decision systems in healthcare and behavioral support often rely on static rule-based or engagement-maximizing heuristics that overlook users' emotional context and ethical constraints. Such approaches risk recommending insensitive or unsafe interventions, especially in domains involving serious mental illness, substance use disorders, or depression. To address this limitation, we propose a Responsible Reinforcement Learning (RRL) framework that integrates emotional and contextual understanding with ethical considerations into the sequential decision-making process. RRL formulates personalization as a Constrained Markov Decision Process (CMDP), where the agent optimizes engagement and adherence while ensuring emotional alignment and ethical safety. We introduce a multi-objective reward function that explicitly balances short-term behavioral engagement with long-term user well-being, and define an emotion-informed state representation that captures fluctuations in emotional readiness, affect, and risk. The proposed architecture can be instantiated with any RL algorithm (e.g., DQN, PPO) augmented with safety constraints or Lagrangian regularization. Conceptually, this framework operationalizes empathy and responsibility within machine learning policy optimization, bridging safe RL, affective computing and responsible AI. We discuss the implications of this approach for human-centric domains such as behavioral health, education, and digital therapeutics, and outline simulation-based validation paths for future empirical work. This paper aims to initiate a methodological conversation about ethically aligned reinforcement learning for emotionally aware and trustworthy personalization systems.
๐
2025-11-13
๐ AAAI
๐ฅ Mohammadsina Almasi, Hadis Anahideh
Machine Learning
ๆฅ็ๆ่ฆ
Equitably allocating limited resources in high-stakes domains-such as education, employment, and healthcare-requires balancing short-term utility with long-term impact, while accounting for delayed outcomes, hidden heterogeneity, and ethical constraints. However, most learning-based allocation frameworks either assume immediate feedback or ignore the complex interplay between individual characteristics and intervention dynamics. We propose a novel bi-level contextual bandit framework for individualized resource allocation under delayed feedback, designed to operate in real-world settings with dynamic populations, capacity constraints, and time-sensitive impact. At the meta level, the model optimizes subgroup-level budget allocations to satisfy fairness and operational constraints. At the base level, it identifies the most responsive individuals within each group using a neural network trained on observational data, while respecting cooldown windows and delayed treatment effects modeled via resource-specific delay kernels. By explicitly modeling temporal dynamics and feedback delays, the algorithm continually refines its policy as new data arrive, enabling more responsive and adaptive decision-making. We validate our approach on two real-world datasets from education and workforce development, showing that it achieves higher cumulative outcomes, better adapts to delay structures, and ensures equitable distribution across subgroups. Our results highlight the potential of delay-aware, data-driven decision-making systems to improve institutional policy and social welfare.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Rishi Singhal, Jung-Eun Kim
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.
๐
2025-11-13
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Jose Berengueres
ๆฅ็ๆ่ฆ
Repetitive strain injury (RSI) affects roughly one in five computer users and remains largely unresolved despite decades of ergonomic mouse redesign. All such devices share a fundamental limitation: they still require fine-motor motion to operate. This work investigates whether predictive, AI-assisted input can reduce that motion by replacing physical pointing with ranked on-screen suggestions. To preserve user agency, we introduce Preview Accept Discard (PAD), a zero-click interaction paradigm that lets users preview predicted GUI targets, cycle through a small set of ranked alternatives, and accept or discard them via key-release timing. We evaluate PAD in two settings: a browser-based email client and a ISO 9241-9 keyboard-prediction task under varying top-3 accuracies. Across both studies, PAD substantially reduces hand motion relative to trackpad use while maintaining comparable task times with the trackpad only when accuracies are similar to those of the best spell-checkers.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Maksim E. Eren, Dorianis M. Perez
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) is reshaping how research is conceived, conducted, and communicated across fields from chemistry to biomedicine. This commentary examines how AI is transforming the research workflow. AI systems now help researchers manage the information deluge, filtering the literature, surfacing cross-disciplinary links for ideas and collaborations, generating hypotheses, and designing and executing experiments. These developments mark a shift from AI as a mere computational tool to AI as an active collaborator in science. Yet this transformation demands thoughtful integration and governance. We argue that at this time AI must augment but not replace human judgment in academic workflows such as peer review, ethical evaluation, and validation of results. This paper calls for the deliberate adoption of AI within the scientific practice through policies that promote transparency, reproducibility, and accountability.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Srikant Panda, Avinash Rai
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants.
To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dong-Shan Jian, Xiang Li, Chen-Xu Yan, Hui-Wen Zheng, Zhi-Zhang Bian et al.
ๆฅ็ๆ่ฆ
Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles. The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods.
๐
2025-11-13
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Vincenzo Carletti, Pasquale Foggia, Carlo Mazzocca, Giuseppe Parrella, Mario Vento
Machine Learning
ๆฅ็ๆ่ฆ
One of the key advantages of Federated Learning (FL) is its ability to collaboratively train a Machine Learning (ML) model while keeping clients' data on-site. However, this can create a false sense of security. Despite not sharing private data increases the overall privacy, prior studies have shown that gradients exchanged during the FL training remain vulnerable to Gradient Inversion Attacks (GIAs). These attacks allow reconstructing the clients' local data, breaking the privacy promise of FL. GIAs can be launched by either a passive or an active server. In the latter case, a malicious server manipulates the global model to facilitate data reconstruction. While effective, earlier attacks falling under this category have been demonstrated to be detectable by clients, limiting their real-world applicability. Recently, novel active GIAs have emerged, claiming to be far stealthier than previous approaches. This work provides the first comprehensive analysis of these claims, investigating four state-of-the-art GIAs. We propose novel lightweight client-side detection techniques, based on statistically improbable weight structures and anomalous loss and gradient dynamics. Extensive evaluation across several configurations demonstrates that our methods enable clients to effectively detect active GIAs without any modifications to the FL training protocol.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Georgios Chalkiadakis, Charilaos Akasiadis, Gerasimos Koresis, Stergios Plataniots, Leonidas Bakopoulos
Machine Learning
ๆฅ็ๆ่ฆ
This paper provides a comprehensive review of mainly Graph Neural Networks, Deep Reinforcement Learning, and Probabilistic Topic Modeling methods with a focus on their potential incorporation in strategic multiagent settings. We draw interest in (i) Machine Learning methods currently utilized for uncovering unknown model structures adaptable to the task of strategic opponent modeling, and (ii) the integration of these methods with Game Theoretic concepts that avoid relying on assumptions often invalid in real-world scenarios, such as the Common Prior Assumption (CPA) and the Self-Interest Hypothesis (SIH). We analyze the ability to handle uncertainty and heterogeneity, two characteristics that are very common in real-world application cases, as well as scalability. As a potential answer to effectively modeling relationships and interactions in multiagent settings, we champion the use of Graph Neural Networks (GNN). Such approaches are designed to operate upon graph-structured data, and have been shown to be a very powerful tool for performing tasks such as node classification and link prediction. Next, we review the domain of Reinforcement Learning (RL), and in particular that of Multiagent Deep Reinforcement Learning (MADRL). Following, we describe existing relevant game theoretic solution concepts and consider properties such as fairness and stability. Our review comes complete with a note on the literature that utilizes PTM in domains other than that of document analysis and classification. The capability of PTM to estimate unknown underlying distributions can help with tackling heterogeneity and unknown agent beliefs. Finally, we identify certain open challenges specifically, the need to (i) fit non-stationary environments, (ii) balance the degrees of stability and adaptation, (iii) tackle uncertainty and heterogeneity, (iv) guarantee scalability and solution tractability.
๐
2025-11-13
๐ Submitted to IEEE ISBI 2026
๐ฅ Tejas Sudharshan Mathai, Anisa V. Prasad, Xinya Wang, Praveen T. S. Balamuralikrishna, Yan Zhuang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Type 2 Diabetes Mellitus (T2DM) is a chronic metabolic disease that affects millions of people worldwide. Early detection is crucial as it can alter pancreas function through morphological changes and increased deposition of ectopic fat, eventually leading to organ damage. While studies have shown an association between T2DM and pancreas volume and fat content, the role of increased pancreatic surface lobularity (PSL) in patients with T2DM has not been fully investigated. In this pilot work, we propose a fully automated approach to delineate the pancreas and other abdominal structures, derive CT imaging biomarkers, and opportunistically screen for T2DM. Four deep learning-based models were used to segment the pancreas in an internal dataset of 584 patients (297 males, 437 non-diabetic, age: 45$\pm$15 years). PSL was automatically detected and it was higher for diabetic patients (p=0.01) at 4.26 $\pm$ 8.32 compared to 3.19 $\pm$ 3.62 for non-diabetic patients. The PancAP model achieved the highest Dice score of 0.79 $\pm$ 0.17 and lowest ASSD error of 1.94 $\pm$ 2.63 mm (p$<$0.05). For predicting T2DM, a multivariate model trained with CT biomarkers attained 0.90 AUC, 66.7\% sensitivity, and 91.9\% specificity. Our results suggest that PSL is useful for T2DM screening and could potentially help predict the early onset of T2DM.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Corey Ford, Elizabeth Wilson, Shuoyang Zheng, Gabriel Vigliensoni, Jeba Rezwana et al.
ๆฅ็ๆ่ฆ
This third international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 17th ACM Conference on Creativity and Cognition (C&C 2025), online.
๐
2025-11-13
๐ ArXiv Preprint (cs.DC)
๐ฅ Changhai Man, Joongun Park, Hanjiang Wu, Huan Xu, Srinivas Sridharan et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE will be publicly available to facilitate further research in distributed machine learning systems.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yunzhe Xu, Zhuosheng Zhang, Zhe Liu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models' capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO's superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: https://github.com/xyz9911/KPPO.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zihan Gao, Yifei Xu, Jacob Thebault-Spieker
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Irina Saparina, Mirella Lapata
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Piotr Gorczyca, Hannes Strass
ๆฅ็ๆ่ฆ
Standpoint logics offer unified modal logic-based formalisms for representing multiple heterogeneous viewpoints. At the same time, many non-monotonic reasoning frameworks can be naturally captured using modal logics, in particular using the modal logic S4F. In this work, we propose a novel formalism called S4F Standpoint Logic, which generalises both S4F and standpoint propositional logic and is therefore capable of expressing multi-viewpoint, non-monotonic semantic commitments. We define its syntax and semantics and analyze its computational complexity, obtaining the result that S4F Standpoint Logic is not computationally harder than its constituent logics, whether in monotonic or non-monotonic form. We also outline mechanisms for credulous and sceptical acceptance and illustrate the framework with an example.
๐
2025-11-13
๐ ArXiv Preprint (physics.bio-ph)
๐ฅ Tom Pan, Evan Dramko, Mitchell D. Miller, Anastasios Kyrillidis, George N. Phillips
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Protein structure determination has long been one of the primary challenges of structural biology, to which deep machine learning (ML)-based approaches have increasingly been applied. However, these ML models generally do not incorporate the experimental measurements directly, such as X-ray crystallographic diffraction data. To this end, we explore an approach that more tightly couples these traditional crystallographic and recent ML-based methods, by training a hybrid 3-d vision transformer and convolutional network on inputs from both domains. We make use of two distinct input constructs / Patterson maps, which are directly obtainable from crystallographic data, and ``partial structure'' template maps derived from predicted structures deposited in the AlphaFold Protein Structure Database with subsequently omitted residues. With these, we predict electron density maps that are then post-processed into atomic models through standard crystallographic refinement processes. Introducing an initial dataset of small protein fragments taken from Protein Data Bank entries and placing them in hypothetical crystal settings, we demonstrate that our method is effective at both improving the phases of the crystallographic structure factors and completing the regions missing from partial structure templates, as well as improving the agreement of the electron density maps with the ground truth atomic structures.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Thomas Decker, Volker Tresp, Florian Buettner
ๆฅ็ๆ่ฆ
Perturbation-based explanations are widely utilized to enhance the transparency of machine-learning models in practice. However, their reliability is often compromised by the unknown model behavior under the specific perturbations used. This paper investigates the relationship between uncertainty calibration - the alignment of model confidence with actual accuracy - and perturbation-based explanations. We show that models systematically produce unreliable probability estimates when subjected to explainability-specific perturbations and theoretically prove that this directly undermines global and local explanation quality. To address this, we introduce ReCalX, a novel approach to recalibrate models for improved explanations while preserving their original predictions. Empirical evaluations across diverse models and datasets demonstrate that ReCalX consistently reduces perturbation-specific miscalibration most effectively while enhancing explanation robustness and the identification of globally important input features.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Marco Foschini, Marianne Defresne, Emilio Gamba, Bart Bogaerts, Tias Guns
Machine Learning
ๆฅ็ๆ่ฆ
Step-wise explanations can explain logic puzzles and other satisfaction problems by showing how to derive decisions step by step. Each step consists of a set of constraints that derive an assignment to one or more decision variables. However, many candidate explanation steps exist, with different sets of constraints and different decisions they derive. To identify the most comprehensible one, a user-defined objective function is required to quantify the quality of each step. However, defining a good objective function is challenging. Here, interactive preference elicitation methods from the wider machine learning community can offer a way to learn user preferences from pairwise comparisons. We investigate the feasibility of this approach for step-wise explanations and address several limitations that distinguish it from elicitation for standard combinatorial problems. First, because the explanation quality is measured using multiple sub-objectives that can vary a lot in scale, we propose two dynamic normalization techniques to rescale these features and stabilize the learning process. We also observed that many generated comparisons involve similar explanations. For this reason, we introduce MACHOP (Multi-Armed CHOice Perceptron), a novel query generation strategy that integrates non-domination constraints with upper confidence bound-based diversification. We evaluate the elicitation techniques on Sudokus and Logic-Grid puzzles using artificial users, and validate them with a real-user evaluation. In both settings, MACHOP consistently produces higher-quality explanations than the standard approach.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Ignace Bleukx, Maarten Flippo, Bart Bogaerts, Emir Demiroviฤ, Tias Guns
ๆฅ็ๆ่ฆ
In the field of Explainable Constraint Solving, it is common to explain to a user why a problem is unsatisfiable. A recently proposed method for this is to compute a sequence of explanation steps. Such a step-wise explanation shows individual reasoning steps involving constraints from the original specification, that in the end explain a conflict. However, computing a step-wise explanation is computationally expensive, limiting the scope of problems for which it can be used. We investigate how we can use proofs generated by a constraint solver as a starting point for computing step-wise explanations, instead of computing them step-by-step. More specifically, we define a framework of abstract proofs, in which both proofs and step-wise explanations can be represented. We then propose several methods for converting a proof to a step-wise explanation sequence, with special attention to trimming and simplification techniques to keep the sequence and its individual steps small. Our results show our method significantly speeds up the generation of step-wise explanation sequences, while the resulting step-wise explanation has a quality similar to the current state-of-the-art.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Francisco Cunha, Yves Lepage, Zied Bouraoui, Miguel Couceiro
ๆฅ็ๆ่ฆ
Analogical reasoning is a powerful inductive mechanism, widely used in human cognition and increasingly applied in artificial intelligence. Formal frameworks for analogical inference have been developed for Boolean domains, where inference is provably sound for affine functions and approximately correct for functions close to affine. These results have informed the design of analogy-based classifiers. However, they do not extend to regression tasks or continuous domains. In this paper, we revisit analogical inference from a foundational perspective. We first present a counterexample showing that existing generalization bounds fail even in the Boolean setting. We then introduce a unified framework for analogical reasoning in real-valued domains based on parameterized analogies defined via generalized means. This model subsumes both Boolean classification and regression, and supports analogical inference over continuous functions. We characterize the class of analogy-preserving functions in this setting and derive both worst-case and average-case error bounds under smoothness assumptions. Our results offer a general theory of analogical inference across discrete and continuous domains.
๐
2025-11-13
๐ AAAI
๐ฅ Kayla Boggess, Sarit Kraus, Lu Feng
Machine Learning
ๆฅ็ๆ่ฆ
Multi-Agent Reinforcement Learning (MARL) has gained significant interest in recent years, enabling sequential decision-making across multiple agents in various domains. However, most existing explanation methods focus on centralized MARL, failing to address the uncertainty and nondeterminism inherent in decentralized settings. We propose methods to generate policy summarizations that capture task ordering and agent cooperation in decentralized MARL policies, along with query-based explanations for When, Why Not, and What types of user queries about specific agent behaviors. We evaluate our approach across four MARL domains and two decentralized MARL algorithms, demonstrating its generalizability and computational efficiency. User studies show that our summarizations and explanations significantly improve user question-answering performance and enhance subjective ratings on metrics such as understanding and satisfaction.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Mingxing Peng, Ruoyu Yao, Xusen Guo, Jun Ma
Robotics
ๆฅ็ๆ่ฆ
Recent advances in closed-loop planning benchmarks have significantly improved the evaluation of autonomous vehicles. However, existing benchmarks still rely on rule-based reactive agents such as the Intelligent Driver Model (IDM), which lack behavioral diversity and fail to capture realistic human interactions, leading to oversimplified traffic dynamics. To address these limitations, we present nuPlan-R, a new reactive closed-loop planning benchmark that integrates learning-based reactive multi-agent simulation into the nuPlan framework. Our benchmark replaces the rule-based IDM agents with noise-decoupled diffusion-based reactive agents and introduces an interaction-aware agent selection mechanism to ensure both realism and computational efficiency. Furthermore, we extend the benchmark with two additional metrics to enable a more comprehensive assessment of planning performance. Extensive experiments demonstrate that our reactive agent model produces more realistic, diverse, and human-like traffic behaviors, leading to a benchmark environment that better reflects real-world interactive driving. We further reimplement a collection of rule-based, learning-based, and hybrid planning approaches within our nuPlan-R benchmark, providing a clearer reflection of planner performance in complex interactive scenarios and better highlighting the advantages of learning-based planners in handling complex and dynamic scenarios. These results establish nuPlan-R as a new standard for fair, reactive, and realistic closed-loop planning evaluation. We will open-source the code for the new benchmark.
๐
2025-11-13
๐ ArXiv Preprint (cs.MA)
๐ฅ Lifan Zheng, Jiawei Chen, Qinghong Yin, Jingyuan Zhang, Xinyi Zeng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7\% fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extensive random exploration. These limitations lead to prohibitively high data-construction costs, low exploration efficiency, and poor sample utilization. To address these challenges, we present AgentEvolver, a self-evolving agent system that leverages the semantic understanding and reasoning capabilities of LLMs to drive autonomous agent learning. AgentEvolver introduces three synergistic mechanisms: (i) self-questioning, which enables curiosity-driven task generation in novel environments, reducing dependence on handcrafted datasets; (ii) self-navigating, which improves exploration efficiency through experience reuse and hybrid policy guidance; and (iii) self-attributing, which enhances sample efficiency by assigning differentiated rewards to trajectory states and actions based on their contribution. By integrating these mechanisms into a unified framework, AgentEvolver enables scalable, cost-effective, and continual improvement of agent capabilities. Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yixi Chen, Weixuan Liang, Tianrui Liu, Jun-Jie Huang, Ao Li et al.
ๆฅ็ๆ่ฆ
Kernel power $k$-means (KPKM) leverages a family of means to mitigate local minima issues in kernel $k$-means. However, KPKM faces two key limitations: (1) the computational burden of the full kernel matrix restricts its use on extensive data, and (2) the lack of authentic centroid-sample assignment learning reduces its noise robustness. To overcome these challenges, we propose RFF-KPKM, introducing the first approximation theory for applying random Fourier features (RFF) to KPKM. RFF-KPKM employs RFF to generate efficient, low-dimensional feature maps, bypassing the need for the whole kernel matrix. Crucially, we are the first to establish strong theoretical guarantees for this combination: (1) an excess risk bound of $\mathcal{O}(\sqrt{k^3/n})$, (2) strong consistency with membership values, and (3) a $(1+\varepsilon)$ relative error bound achievable using the RFF of dimension $\mathrm{poly}(\varepsilon^{-1}\log k)$. Furthermore, to improve robustness and the ability to learn multiple kernels, we propose IP-RFF-MKPKM, an improved possibilistic RFF-based multiple kernel power $k$-means. IP-RFF-MKPKM ensures the scalability of MKPKM via RFF and refines cluster assignments by combining the merits of the possibilistic membership and fuzzy membership. Experiments on large-scale datasets demonstrate the superior efficiency and clustering accuracy of the proposed methods compared to the state-of-the-art alternatives.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.
๐
2025-11-13
๐ ArXiv Preprint (cs.SI)
๐ฅ Raj Gaurav Maurya, Vaibhav Shukla, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Misinformation on social media thrives on surprise, emotion, and identity-driven reasoning, often amplified through human cognitive biases. To investigate these mechanisms, we model large language model (LLM) personas as synthetic agents that mimic user-level biases, ideological alignments, and trust heuristics. Within this setup, we introduce an auditor--node framework to simulate and analyze how misinformation evolves as it circulates through networks of such agents. News articles are propagated across networks of persona-conditioned LLM nodes, each rewriting received content. A question--answering-based auditor then measures factual fidelity at every step, offering interpretable, claim-level tracking of misinformation drift. We formalize a misinformation index and a misinformation propagation rate to quantify factual degradation across homogeneous and heterogeneous branches of up to 30 sequential rewrites. Experiments with 21 personas across 10 domains reveal that identity- and ideology-based personas act as misinformation accelerators, especially in politics, marketing, and technology. By contrast, expert-driven personas preserve factual stability. Controlled-random branch simulations further show that once early distortions emerge, heterogeneous persona interactions rapidly escalate misinformation to propaganda-level distortion. Our taxonomy of misinformation severity -- spanning factual errors, lies, and propaganda -- connects observed drift to established theories in misinformation studies. These findings demonstrate the dual role of LLMs as both proxies for human-like biases and as auditors capable of tracing information fidelity. The proposed framework provides an interpretable, empirically grounded approach for studying, simulating, and mitigating misinformation diffusion in digital ecosystems.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kai-Hendrik Cohrs, Zuzanna Osika, Maria Gonzalez-Calabuig, Vishal Nedungadi, Ruben Cartuyvels et al.
ๆฅ็ๆ่ฆ
Geospatial foundation models for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty. Applied to burn scar segmentation, SHRUG-FM shows that OOD scores correlate with lower performance in specific environmental conditions, while uncertainty-based flags help discard many poorly performing predictions. Linking these flags to land cover attributes from HydroATLAS shows that failures are not random but concentrated in certain geographies, such as low-elevation zones and large river areas, likely due to underrepresentation in pretraining data. SHRUG-FM provides a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, helping bridge the gap between benchmark performance and real-world reliability.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thales Bezerra, Emanoel Thyago, Kelvin Cunha, Rodrigo Abreu, Fรกbio Papais et al.
Machine Learning
ๆฅ็ๆ่ฆ
AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chenyi Li, Wanli Ma, Zichen Wang, Zaiwen Wen
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
While large language models (LLMs) have shown progress in mathematical reasoning, they still face challenges in formalizing theorems that arise from instantiating abstract structures in concrete settings. With the goal of auto-formalizing mathematical results at the research level, we develop a framework for structure-to-instance theorem autoformalization (SITA), which systematically bridges the gap between abstract mathematical theories and their concrete applications in Lean proof assistant. Formalized abstract structures are treated as modular templates that contain definitions, assumptions, operations, and theorems. These templates serve as reusable guides for the formalization of concrete instances. Given a specific instantiation, we generate corresponding Lean definitions and instance declarations, integrate them using Lean's typeclass mechanism, and construct verified theorems by checking structural assumptions. We incorporate LLM-based generation with feedback-guided refinement to ensure both automation and formal correctness. Experiments on a dataset of optimization problems demonstrate that SITA effectively formalizes diverse instances grounded in abstract structures.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tomรกลก ฤรญลพek, Martin Balko, Martin Schmid
ๆฅ็ๆ่ฆ
Proof-Number Search is a best-first search algorithm with many successful applications, especially in game solving. As large-scale computing clusters become increasingly accessible, parallelization is a natural way to accelerate computation. However, existing parallel versions of Proof-Number Search are known to scale poorly on many CPU cores. Using two parallelized levels and shared information among workers, we present the first massively parallel version of Proof-Number Search that scales efficiently even on a large number of CPUs. We apply our solver, enhanced with Grundy numbers for reducing game trees, to the Sprouts game, a case study motivated by the long-standing Sprouts Conjecture. Our solver achieves a significantly improved 332.9$\times$ speedup when run on 1024 cores, enabling it to outperform the state-of-the-art Sprouts solver GLOP by four orders of magnitude in runtime and to generate proofs 1,000$\times$ more complex. Despite exponential growth in game tree size, our solver verified the Sprouts Conjecture for 42 new positions, nearly doubling the number of known outcomes.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Guduru Manoj, Neel Prabhanjan Rachamalla, Ashish Kulkarni, Gautam Rajeev, Jay Piplodiya et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu Deng, Baozhu Zhao, Junyan Su, Xiaohan Zhang, Qi Liu
Machine Learning
ๆฅ็ๆ่ฆ
Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhengtao Zou, Ya Gao, Jiarui Guan, Bin Li, Pekka Marttinen
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model's deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs' reliability without a significant compromise on efficiency.
๐
2025-11-13
๐ AAAI
๐ฅ Belona Sonna, Alban Grastien, Claire Benn
ๆฅ็ๆ่ฆ
Privacy leakage in AI-based decision processes poses significant risks, particularly when sensitive information can be inferred. We propose a formal framework to audit privacy leakage using abductive explanations, which identifies minimal sufficient evidence justifying model decisions and determines whether sensitive information disclosed. Our framework formalizes both individual and system-level leakage, introducing the notion of Potentially Applicable Explanations (PAE) to identify individuals whose outcomes can shield those with sensitive features. This approach provides rigorous privacy guarantees while producing human understandable explanations, a key requirement for auditing tools. Experimental evaluation on the German Credit Dataset illustrates how the importance of sensitive literal in the model decision process affects privacy leakage. Despite computational challenges and simplifying assumptions, our results demonstrate that abductive reasoning enables interpretable privacy auditing, offering a practical pathway to reconcile transparency, model interpretability, and privacy preserving in AI decision-making.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Adrien Lafage, Olivier Laurent, Firas Gabetni, Gianni Franchi
Natural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Deep Neural Networks (DNNs) have demonstrated remarkable performance across various domains, including computer vision and natural language processing. However, they often struggle to accurately quantify the uncertainty of their predictions, limiting their broader adoption in critical real-world applications. Uncertainty Quantification (UQ) for Deep Learning seeks to address this challenge by providing methods to improve the reliability of uncertainty estimates. Although numerous techniques have been proposed, a unified tool offering a seamless workflow to evaluate and integrate these methods remains lacking. To bridge this gap, we introduce Torch-Uncertainty, a PyTorch and Lightning-based framework designed to streamline DNN training and evaluation with UQ techniques and metrics. In this paper, we outline the foundational principles of our library and present comprehensive experimental results that benchmark a diverse set of UQ methods across classification, segmentation, and regression tasks. Our library is available at https://github.com/ENSTA-U2IS-AI/Torch-Uncertainty
๐
2025-11-13
๐ AAAI 2026
๐ฅ Jing He, Han Zhang, Yuanhui Xiao, Wei Guo, Shaowen Yao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Fake news detection methods based on writing style have achieved remarkable progress. However, as adversaries increasingly imitate the style of authentic news, the effectiveness of such approaches is gradually diminishing. Recent research has explored incorporating large language models (LLMs) to enhance fake news detection. Yet, despite their transformative potential, LLMs remain an untapped goldmine for fake news detection, with their real-world adoption hampered by shallow functionality exploration, ambiguous usability, and prohibitive inference costs. In this paper, we propose a novel fake news detection framework, dubbed FactGuard, that leverages LLMs to extract event-centric content, thereby reducing the impact of writing style on detection performance. Furthermore, our approach introduces a dynamic usability mechanism that identifies contradictions and ambiguous cases in factual reasoning, adaptively incorporating LLM advice to improve decision reliability. To ensure efficiency and practical deployment, we employ knowledge distillation to derive FactGuard-D, enabling the framework to operate effectively in cold-start and resource-constrained scenarios. Comprehensive experiments on two benchmark datasets demonstrate that our approach consistently outperforms existing methods in both robustness and accuracy, effectively addressing the challenges of style sensitivity and LLM usability in fake news detection.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Martin Braas, Lukas Esterle
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, yet their applicability to dialogue systems in computer games remains limited. This limitation arises from their substantial hardware requirements, latency constraints, and the necessity to maintain clearly defined knowledge boundaries within a game setting. In this paper, we propose a modular NPC dialogue system that leverages Small Language Models (SLMs), fine-tuned to encode specific NPC personas and integrated with runtime-swappable memory modules. These memory modules preserve character-specific conversational context and world knowledge, enabling expressive interactions and long-term memory without retraining or model reloading during gameplay. We comprehensively evaluate our system using three open-source SLMs: DistilGPT-2, TinyLlama-1.1B-Chat, and Mistral-7B-Instruct, trained on synthetic persona-aligned data and benchmarked on consumer-grade hardware. While our approach is motivated by applications in gaming, its modular design and persona-driven memory architecture hold significant potential for broader adoption in domains requiring expressive, scalable, and memory-rich conversational agents, such as virtual assistants, customer support bots, or interactive educational systems.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Konstantin Soshin, Alexander Krapukhin, Andrei Spiridonov, Denis Shepelev, Gregorii Bukhtuev et al.
Robotics
ๆฅ็ๆ่ฆ
Most existing robotic manipulation benchmarks focus on simplified tabletop scenarios, typically involving a stationary robotic arm interacting with various objects on a flat surface. To address this limitation, we introduce RoboBenchMart, a more challenging and realistic benchmark designed for dark store environments, where robots must perform complex manipulation tasks with diverse grocery items. This setting presents significant challenges, including dense object clutter and varied spatial configurations -- with items positioned at different heights, depths, and in close proximity. By targeting the retail domain, our benchmark addresses a setting with strong potential for near-term automation impact. We demonstrate that current state-of-the-art generalist models struggle to solve even common retail tasks. To support further research, we release the RoboBenchMart suite, which includes a procedural store layout generator, a trajectory generation pipeline, evaluation tools and fine-tuned baseline models.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shahaf S. Shperberg, Natalie Morad, Lior Siag, Ariel Felner, Dor Atzmon
ๆฅ็ๆ่ฆ
Recent advancements in bidirectional heuristic search have yielded significant theoretical insights and novel algorithms. While most previous work has concentrated on optimal search methods, this paper focuses on bounded-suboptimal bidirectional search, where a bound on the suboptimality of the solution cost is specified. We build upon the state-of-the-art optimal bidirectional search algorithm, BAE*, designed for consistent heuristics, and introduce several variants of BAE* specifically tailored for the bounded-suboptimal context. Through experimental evaluation, we compare the performance of these new variants against other bounded-suboptimal bidirectional algorithms as well as the standard weighted A* algorithm. Our results demonstrate that each algorithm excels under distinct conditions, highlighting the strengths and weaknesses of each approach.
๐
2025-11-13
๐ ArXiv Preprint (cs.SE)
๐ฅ Xin Sun, Daniel Stรฅhl, Kristian Sandahl, Christoph Kessler
Machine Learning
ๆฅ็ๆ่ฆ
In recent years, LLMs have been widely integrated into software engineering workflows, supporting tasks like code generation. However, while these models often generate functionally correct outputs, we still lack a systematic understanding and evaluation of their non-functional qualities. Existing studies focus mainly on whether generated code passes the tests rather than whether it passes with quality. Guided by the ISO/IEC 25010 quality model, this study conducted three complementary investigations: a systematic review of 108 papers, two industry workshops with practitioners from multiple organizations, and an empirical analysis of patching real-world software issues using three LLMs. Motivated by insights from both the literature and practitioners, the empirical study examined the quality of generated patches on security, maintainability, and performance efficiency. Across the literature, we found that security and performance efficiency dominate academic attention, while maintainability and other qualities are understudied. In contrast, industry experts prioritize maintainability and readability, warning that generated code may accelerate the accumulation of technical debt. In our evaluation of functionally correct patches generated by three LLMs, improvements in one quality dimension often come at the cost of others. Runtime and memory results further show high variance across models and optimization strategies. Overall, our findings reveal a mismatch between academic focus, industry priorities, and model performance, highlighting the urgent need to integrate quality assurance mechanisms into LLM code generation pipelines to ensure that future generated code not only passes tests but truly passes with quality.
๐
2025-11-13
๐ AAAI
๐ฅ Zhe Xu, Zhicai Wang, Junkang Wu, Jinda Lu, Xiang Wang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Vision-Language Models (LVLMs) often suffer from object hallucination, making erroneous judgments about the presence of objects in images. We propose this primar- ily stems from spurious correlations arising when models strongly associate highly co-occurring objects during train- ing, leading to hallucinated objects influenced by visual con- text. Current benchmarks mainly focus on hallucination de- tection but lack a formal characterization and quantitative evaluation of spurious correlations in LVLMs. To address this, we introduce causal analysis into the object recognition scenario of LVLMs, establishing a Structural Causal Model (SCM). Utilizing the language of causality, we formally de- fine spurious correlations arising from co-occurrence bias. To quantify the influence induced by these spurious correla- tions, we develop Causal-HalBench, a benchmark specifically constructed with counterfactual samples and integrated with comprehensive causal metrics designed to assess model ro- bustness against spurious correlations. Concurrently, we pro- pose an extensible pipeline for the construction of these coun- terfactual samples, leveraging the capabilities of proprietary LVLMs and Text-to-Image (T2I) models for their genera- tion. Our evaluations on mainstream LVLMs using Causal- HalBench demonstrate these models exhibit susceptibility to spurious correlations, albeit to varying extents.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Rajab Aghamov, Christel Baier, Joel Ouaknine, Jakob Piribauer, Mihir Vahanwala et al.
ๆฅ็ๆ่ฆ
Dynamic Bayesian networks (DBNs) are compact graphical representations used to model probabilistic systems where interdependent random variables and their distributions evolve over time. In this paper, we study the verification of the evolution of conditional-independence (CI) propositions against temporal logic specifications. To this end, we consider two specification formalisms over CI propositions: linear temporal logic (LTL), and non-deterministic Bรผchi automata (NBAs). This problem has two variants. Stochastic CI properties take the given concrete probability distributions into account, while structural CI properties are viewed purely in terms of the graphical structure of the DBN. We show that deciding if a stochastic CI proposition eventually holds is at least as hard as the Skolem problem for linear recurrence sequences, a long-standing open problem in number theory. On the other hand, we show that verifying the evolution of structural CI propositions against LTL and NBA specifications is in PSPACE, and is NP- and coNP-hard. We also identify natural restrictions on the graphical structure of DBNs that make the verification of structural CI properties tractable.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Gal Hadar, Forest Agostinelli, Shahaf S. Shperberg
Machine Learning
ๆฅ็ๆ่ฆ
Many sequential decision-making problems can be formulated as shortest-path problems, where the objective is to reach a goal state from a given starting state. Heuristic search is a standard approach for solving such problems, relying on a heuristic function to estimate the cost to the goal from any given state. Recent approaches leverage reinforcement learning to learn heuristics by applying deep approximate value iteration. These methods typically rely on single-step Bellman updates, where the heuristic of a state is updated based on its best neighbor and the corresponding edge cost. This work proposes a generalized approach that enhances both state sampling and heuristic updates by performing limited-horizon searches and updating each state's heuristic based on the shortest path to the search frontier, incorporating both edge costs and the heuristic values of frontier states.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yongji Zhang, Siqi Li, Kuiyang Huang, Yue Gao, Yu Jiang
ๆฅ็ๆ่ฆ
Fine-Grained Visual Classification (FGVC) remains a challenging task due to subtle inter-class differences and large intra-class variations. Existing approaches typically rely on feature-selection mechanisms or region-proposal strategies to localize discriminative regions for semantic analysis. However, these methods often fail to capture discriminative cues comprehensively while introducing substantial category-agnostic redundancy. To address these limitations, we propose H3Former, a novel token-to-region framework that leverages high-order semantic relations to aggregate local fine-grained representations with structured region-level modeling. Specifically, we propose the Semantic-Aware Aggregation Module (SAAM), which exploits multi-scale contextual cues to dynamically construct a weighted hypergraph among tokens. By applying hypergraph convolution, SAAM captures high-order semantic dependencies and progressively aggregates token features into compact region-level representations. Furthermore, we introduce the Hyperbolic Hierarchical Contrastive Loss (HHCL), which enforces hierarchical semantic constraints in a non-Euclidean embedding space. The HHCL enhances inter-class separability and intra-class consistency while preserving the intrinsic hierarchical relationships among fine-grained categories. Comprehensive experiments conducted on four standard FGVC benchmarks validate the superiority of our H3Former framework.
๐
2025-11-13
๐ ArXiv Preprint (cs.DC)
๐ฅ Leszek Sliwko, Vladimir Getov
ๆฅ็ๆ่ฆ
This paper presents a novel approach to categorization of modern workload schedulers. We provide descriptions of three classes of schedulers: Operating Systems Process Schedulers, Cluster Systems Jobs Schedulers and Big Data Schedulers. We describe their evolution from early adoptions to modern implementations, considering both the use and features of algorithms. In summary, we discuss differences between all presented classes of schedulers and discuss their chronological development. In conclusion we highlight similarities in the focus of scheduling strategies design, applicable to both local and distributed systems.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Oliver Dippel, Alexei Lisitsa, Bei Peng
Natural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Transformers have demonstrated exceptional in-context learning (ICL) capabilities, enabling applications across natural language processing, computer vision, and sequential decision-making. In reinforcement learning, ICL reframes learning as a supervised problem, facilitating task adaptation without parameter updates. Building on prior work leveraging transformers for sequential decision-making, we propose Heuristic Transformer (HT), an in-context reinforcement learning (ICRL) approach that augments the in-context dataset with a belief distribution over rewards to achieve better decision-making. Using a variational auto-encoder (VAE), a low-dimensional stochastic variable is learned to represent the posterior distribution over rewards, which is incorporated alongside an in-context dataset and query states as prompt to the transformer policy. We assess the performance of HT across the Darkroom, Miniworld, and MuJoCo environments, showing that it consistently surpasses comparable baselines in terms of both effectiveness and generalization. Our method presents a promising direction to bridge the gap between belief-based augmentations and transformer-based decision-making.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yongji Zhang, Siqi Li, Yue Gao, Yu Jiang
ๆฅ็ๆ่ฆ
Action Quality Assessment (AQA) aims to evaluate and score sports actions, which has attracted widespread interest in recent years. Existing AQA methods primarily predict scores based on features extracted from the entire video, resulting in limited interpretability and reliability. Meanwhile, existing AQA datasets also lack fine-grained annotations for action scores, especially for deduction items and sub-score annotations. In this paper, we construct the first AQA dataset containing fine-grained sub-score and deduction annotations for aerial skiing, which will be released as a new benchmark. For the technical challenges, we propose a novel AQA method, named JudgeMind, which significantly enhances performance and reliability by simulating the judgment and scoring mindset of professional referees. Our method segments the input action video into different stages and scores each stage to enhance accuracy. Then, we propose a stage-aware feature enhancement and fusion module to boost the perception of stage-specific key regions and enhance the robustness to visual changes caused by frequent camera viewpoints switching. In addition, we propose a knowledge-based grade-aware decoder to incorporate possible deduction items as prior knowledge to predict more accurate and reliable scores. Experimental results demonstrate that our method achieves state-of-the-art performance.
๐
2025-11-13
๐ ArXiv Preprint (Multimedia)
๐ฅ Rizal Khoirul Anam
Computer Vision
ๆฅ็ๆ่ฆ
The proliferation of digital media necessitates robust methods for copyright protection and content authentication. This paper presents a comprehensive comparative study of digital image watermarking techniques implemented using the spatial domain (Least Significant Bit - LSB), the frequency domain (Discrete Fourier Transform - DFT), and a novel hybrid (LSB+DFT) approach. The core objective is to evaluate the trade-offs between imperceptibility (measured by Peak Signal-to-Noise Ratio - PSNR) and robustness (measured by Normalized Correlation - NC and Bit Error Rate - BER). We implemented these three techniques within a unified MATLAB-based experimental framework. The watermarked images were subjected to a battery of common image processing attacks, including JPEG compression, Gaussian noise, and salt-and-pepper noise, at varying intensities. Experimental results generated from standard image datasets (USC-SIPI) demonstrate that while LSB provides superior imperceptibility, it is extremely fragile. The DFT method offers significant robustness at the cost of visual quality. The proposed hybrid LSB+DFT technique, which leverages redundant embedding and a fallback extraction mechanism, is shown to provide the optimal balance, maintaining high visual fidelity while exhibiting superior resilience to all tested attacks.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Vincent Schilling, Akshat Dubey, Georges Hattab
Natural Language Processing
ๆฅ็ๆ่ฆ
Peptide classification tasks, such as predicting toxicity and HIV inhibition, are fundamental to bioinformatics and drug discovery. Traditional approaches rely heavily on handcrafted encodings of one-dimensional (1D) peptide sequences, which can limit generalizability across tasks and datasets. Recently, protein language models (PLMs), such as ESM-2 and ESMFold, have demonstrated strong predictive performance. However, they face two critical challenges. First, fine-tuning is computationally costly. Second, their complex latent representations hinder interpretability for domain experts. Additionally, many frameworks have been developed for specific types of peptide classification, lacking generalization. These limitations restrict the ability to connect model predictions to biologically relevant motifs and structural properties. To address these limitations, we present PepTriX, a novel framework that integrates one dimensional (1D) sequence embeddings and three-dimensional (3D) structural features via a graph attention network enhanced with contrastive training and cross-modal co-attention. PepTriX automatically adapts to diverse datasets, producing task-specific peptide vectors while retaining biological plausibility. After evaluation by domain experts, we found that PepTriX performs remarkably well across multiple peptide classification tasks and provides interpretable insights into the structural and biophysical motifs that drive predictions. Thus, PepTriX offers both predictive robustness and interpretable validation, bridging the gap between performance-driven peptide-level models (PLMs) and domain-level understanding in peptide research.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Minbae Park, Hyemin Yang, Jeonghyun Kim, Kunsoo Park, Hyunjoon Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) demonstrate strong reasoning capabilities but struggle with hallucinations and limited transparency. Recently, KG-enhanced LLMs that integrate knowledge graphs (KGs) have been shown to improve reasoning performance, particularly for complex, knowledge-intensive tasks. However, these methods still face significant challenges, including inaccurate retrieval and reasoning failures, often exacerbated by long input contexts that obscure relevant information or by context constructions that struggle to capture the richer logical directions required by different question types. Furthermore, many of these approaches rely on LLMs to directly retrieve evidence from KGs, and to self-assess the sufficiency of this evidence, which often results in premature or incorrect reasoning. To address the retrieval and reasoning failures, we propose ProgRAG, a multi-hop knowledge graph question answering (KGQA) framework that decomposes complex questions into sub-questions, and progressively extends partial reasoning paths by answering each sub-question. At each step, external retrievers gather candidate evidence, which is then refined through uncertainty-aware pruning by the LLM. Finally, the context for LLM reasoning is optimized by organizing and rearranging the partial reasoning paths obtained from the sub-question answers. Experiments on three well-known datasets demonstrate that ProgRAG outperforms existing baselines in multi-hop KGQA, offering improved reliability and reasoning quality.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Daniel Herbst, Lea Karbeska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.
๐
2025-11-13
๐ AAAI
๐ฅ Jianghan Zhu, Yaoxin Wu, Zhuoyi Lin, Zhengyuan Zhang, Haiyan Yin et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in Neural Combinatorial Optimization (NCO) methods have significantly improved the capability of neural solvers to handle synthetic routing instances. Nonetheless, existing neural solvers typically struggle to generalize effectively from synthetic, uniformly-distributed training data to real-world VRP scenarios, including widely recognized benchmark instances from TSPLib and CVRPLib. To bridge this generalization gap, we present Evolutionary Realistic Instance Synthesis (EvoReal), which leverages an evolutionary module guided by large language models (LLMs) to generate synthetic instances characterized by diverse and realistic structural patterns. Specifically, the evolutionary module produces synthetic instances whose structural attributes statistically mimics those observed in authentic real-world instances. Subsequently, pre-trained NCO models are progressively refined, firstly aligning them with these structurally enriched synthetic distributions and then further adapting them through direct fine-tuning on actual benchmark instances. Extensive experimental evaluations demonstrate that EvoReal markedly improves the generalization capabilities of state-of-the-art neural solvers, yielding a notable reduced performance gap compared to the optimal solutions on the TSPLib (1.05%) and CVRPLib (2.71%) benchmarks across a broad spectrum of problem scales.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuhao Wang, Ziyang Cheng, Heyang Liu, Ronghua Wu, Qunshan Gu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Current end-to-end spoken language models (SLMs) have made notable progress, yet they still encounter considerable response latency. This delay primarily arises from the autoregressive generation of speech tokens and the reliance on complex flow-matching models for speech synthesis. To overcome this, we introduce VocalNet-M2, a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction (MTP) strategy. Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model. Furthermore, our MTP strategy enhances generation efficiency and improves overall performance. Extensive experiments demonstrate that VocalNet-M2 achieves a substantial reduction in first chunk latency (from approximately 725ms to 350ms) while maintaining competitive performance across mainstream SLMs. This work also provides a comprehensive comparison of single-codebook and multi-codebook strategies, offering valuable insights for developing efficient and high-performance SLMs for real-time interactive applications.
๐
2025-11-13
๐ ArXiv Preprint (cs.SD)
๐ฅ Yudong Yang, Xuezhen Zhang, Zhifeng Han, Siyin Wang, Jimin Zhuang et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Haolong Xiang, Peisi Wang, Xiaolong Xu, Kun Yi, Xuyun Zhang et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
With rapid urbanization in the modern era, traffic signals from various sensors have been playing a significant role in monitoring the states of cities, which provides a strong foundation in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic signal modeling often rely on the original data modality, i.e., numerical direct readings from the sensors in cities. However, this unimodal approach overlooks the semantic information existing in multimodal heterogeneous urban data in different perspectives, which hinders a comprehensive understanding of traffic signals and limits the accurate prediction of complex traffic dynamics. To address this problem, we propose a novel \textit{M}ultimodal framework, \textit{MTP}, for urban \textit{T}raffic \textit{P}rofiling, which learns multimodal features through numeric, visual, and textual perspectives. The three branches drive for a multimodal perspective of urban traffic signal learning in the frequency domain, while the frequency learning strategies delicately refine the information for extraction. Specifically, we first conduct the visual augmentation for the traffic signals, which transforms the original modality into frequency images and periodicity images for visual learning. Also, we augment descriptive texts for the traffic signals based on the specific topic, background information and item description for textual learning. To complement the numeric information, we utilize frequency multilayer perceptrons for learning on the original modality. We design a hierarchical contrastive learning on the three branches to fuse the spectrum of three modalities. Finally, extensive experiments on six real-world datasets demonstrate superior performance compared with the state-of-the-art approaches.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Guanrong Li, Xinyu Liu, Zhen Wu, Xinyu Dai
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalized dialogue generation aims to leverage persona profiles and dialogue history to generate persona-relevant and consistent responses. Mainstream models typically rely on token-level language model training with persona dialogue data, such as Next Token Prediction, to implicitly achieve personalization, making these methods tend to neglect the given personas and generate generic responses. To address this issue, we propose a novel Persona-Aware Alignment Framework (PAL), which directly treats persona alignment as the training objective of dialogue generation. Specifically, PAL employs a two-stage training method including Persona-aware Learning and Persona Alignment, equipped with an easy-to-use inference strategy Select then Generate, to improve persona sensitivity and generate more persona-relevant responses at the semantics level. Through extensive experiments, we demonstrate that our framework outperforms many state-of-the-art personalized dialogue methods and large language models.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhikang Xie, Weilin Wan, Peizhu Gong, Weizhong Zhang, Cheng Jin
Natural Language Processing
ๆฅ็ๆ่ฆ
Black-box tuning is an emerging paradigm for adapting large language models (LLMs) to better achieve desired behaviors, particularly when direct access to model parameters is unavailable. Current strategies, however, often present a dilemma of suboptimal extremes: either separately train a small proxy model and then use it to shift the predictions of the foundation model, offering notable efficiency but often yielding limited improvement; or making API calls in each tuning iteration to the foundation model, which entails prohibitive computational costs. Therefore, we propose a novel advanced black-box tuning method for LLMs with limited API calls. Our core strategy involves training a Gaussian Process (GP) surrogate model with "LogitMap Pairs" derived from querying the foundation model on a minimal but highly informative training subset. This surrogate can approximate the outputs of the foundation model to guide the training of the proxy model, thereby effectively reducing the need for direct queries to the foundation model. Extensive experiments verify that our approach elevates pre-trained language model accuracy from 55.92% to 86.85%, reducing the frequency of API queries to merely 1.38%. This significantly outperforms offline approaches that operate entirely without API access. Notably, our method also achieves comparable or superior accuracy to query-intensive approaches, while significantly reducing API costs. This offers a robust and high-efficiency paradigm for language model adaptation.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Cheng Kevin Qu, Andrew Ly, Pulin Gong
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Attention mechanisms underpin the computational power of Transformer models, which have achieved remarkable success across diverse domains. Yet understanding and extending the principles underlying self-attention remains a key challenge for advancing artificial intelligence. Drawing inspiration from the multiscale dynamics of biological attention and from dynamical systems theory, we introduce Fractional Neural Attention (FNA), a principled, neuroscience-inspired framework for multiscale information processing. FNA models token interactions through Lรฉvy diffusion governed by the fractional Laplacian, intrinsically realizing simultaneous short- and long-range dependencies across multiple scales. This mechanism yields greater expressivity and faster information mixing, advancing the foundational capacity of Transformers. Theoretically, we show that FNA's dynamics are governed by the fractional diffusion equation, and that the resulting attention networks exhibit larger spectral gaps and shorter path lengths -- mechanistic signatures of enhanced computational efficiency. Empirically, FNA achieves competitive text-classification performance even with a single layer and a single head; it also improves performance in image processing and neural machine translation. Finally, the diffusion map algorithm from geometric harmonics enables dimensionality reduction of FNA weights while preserving the intrinsic structure of embeddings and hidden states. Together, these results establish FNA as a principled mechanism connecting self-attention, stochastic dynamics, and geometry, providing an interpretable, biologically grounded foundation for powerful, neuroscience-inspired AI.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Stephane Da Silva Martins, Emanuel Aldea, Sylvie Le Hรฉgarat-Mascle
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Multi-agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents' long-term goals and their fine-grained social interactions, which leads to unrealistic multi-agent futures. We propose VISTA, a recursive goal-conditioned transformer for multi-agent trajectory forecasting. VISTA combines (i) a cross-attention fusion module that integrates long-horizon intent with past motion, (ii) a social-token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single-agent goal-conditioned prediction into a coherent multi-agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal-aware, and interpretable trajectories, making it promising for safety-critical autonomous systems.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Outongyi Lv, Yewei Yuan, Nana Liu
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) with limited samples is common in real-world applications. However, offline RL performance under this constraint is often suboptimal. We consider an alternative approach to dealing with limited samples by introducing the Quantum Metric Encoder (QME). In this methodology, instead of applying the RL framework directly on the original states and rewards, we embed the states into a more compact and meaningful representation, where the structure of the encoding is inspired by quantum circuits. For classical data, QME is a classically simulable, trainable unitary embedding and thus serves as a quantum-inspired module, on a classical device. For quantum data in the form of quantum states, QME can be implemented directly on quantum hardware, allowing for training without measurement or re-encoding.
We evaluated QME on three datasets, each limited to 100 samples. We use Soft-Actor-Critic (SAC) and Implicit-Q-Learning (IQL), two well-known RL algorithms, to demonstrate the effectiveness of our approach. From the experimental results, we find that training offline RL agents on QME-embedded states with decoded rewards yields significantly better performance than training on the original states and rewards. On average across the three datasets, for maximum reward performance, we achieve a 116.2% improvement for SAC and 117.6% for IQL.
We further investigate the $ฮ$-hyperbolicity of our framework, a geometric property of the state space known to be important for the RL training efficacy. The QME-embedded states exhibit low $ฮ$-hyperbolicity, suggesting that the improvement after embedding arises from the modified geometry of the state space induced by QME. Thus, the low $ฮ$-hyperbolicity and the corresponding effectiveness of QME could provide valuable information for developing efficient offline RL methods under limited-sample conditions.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tishya Chhabra, Manisha Bajpai, Walter Zesk, Skylar Tibbits
Machine Learning
ๆฅ็ๆ่ฆ
We present an initial evaluation of NASA and IBM's Prithvi-EO-2.0 geospatial foundation model on shoreline delineation of small sandy islands using satellite images. We curated and labeled a dataset of 225 multispectral images of two Maldivian islands, which we publicly release, and fine-tuned both the 300M and 600M parameter versions of Prithvi on training subsets ranging from 5 to 181 images. Our experiments show that even with as few as 5 training images, the models achieve high performance (F1 of 0.94, IoU of 0.79). Our results demonstrate the strong transfer learning capability of Prithvi, underscoring the potential of such models to support coastal monitoring in data-poor regions.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Periklis Mantenoglou, Luigi Bonassi, Enrico Scala, Pedro Zuidberg Dos Martires
ๆฅ็ๆ่ฆ
We study planning in a fragment of PDDL with qualitative state-trajectory constraints, capturing safety requirements, task ordering conditions, and intermediate sub-goals commonly found in real-world problems. A prominent approach to tackle such problems is to compile their constraints away, leading to a problem that is supported by state-of-the-art planners. Unfortunately, existing compilers do not scale on problems with a large number of objects and high-arity actions, as they necessitate grounding the problem before compilation. To address this issue, we propose two methods for compiling away constraints without grounding, making them suitable for large-scale planning problems. We prove the correctness of our compilers and outline their worst-case time complexity. Moreover, we present a reproducible empirical evaluation on the domains used in the latest International Planning Competition. Our results demonstrate that our methods are efficient and produce planning specifications that are orders of magnitude more succinct than the ones produced by compilers that ground the domain, while remaining competitive when used for planning with a state-of-the-art planner.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ J. Javier Alonso-Ramos, Ignacio Aguilera-Martos, Andrรฉs Herrera-Poyatos, Francisco Herrera
Machine Learning
ๆฅ็ๆ่ฆ
The performance of Machine Learning (ML) models, particularly those operating within the Interpretable Artificial Intelligence (Interpretable AI) framework, is significantly affected by the presence of noise in both training and production data. Denoising has therefore become a critical preprocessing step, typically categorized into instance removal and instance correction techniques. However, existing correction approaches often degrade performance or oversimplify the problem by altering the original data distribution. This leads to unrealistic scenarios and biased models, which is particularly problematic in contexts where interpretable AI models are employed, as their interpretability depends on the fidelity of the underlying data patterns. In this paper, we argue that defining noise independently of the solution may be ineffective, as its nature can vary significantly across tasks and datasets. Using a task-specific high quality solution as a reference can provide a more precise and adaptable noise definition. To this end, we propose DenoGrad, a novel Gradient-based instance Denoiser framework that leverages gradients from an accurate Deep Learning (DL) model trained on the target data -- regardless of the specific task -- to detect and adjust noisy samples. Unlike conventional approaches, DenoGrad dynamically corrects noisy instances, preserving problem's data distribution, and improving AI models robustness. DenoGrad is validated on both tabular and time series datasets under various noise settings against the state-of-the-art. DenoGrad outperforms existing denoising strategies, enhancing the performance of interpretable IA models while standing out as the only high quality approach that preserves the original data distribution.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Zou, Runqing Zhang, Xue Zhou, Jianxiao Zou
ๆฅ็ๆ่ฆ
Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and facilitate cross-modal alignment. (2) Generative Intermediate Fusion (GIF), which combines cross-attention between generated images, original images, and text features to generate a unified representation optimized by triplet alignment loss. We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA. The results justify the effectiveness of our method. More implementation details and extended results are available at https://github.com/sugelamyd123/Sup-for-GEA.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Mayank Vatsa, Aparna Bharati, Richa Singh
Computer Vision
ๆฅ็ๆ่ฆ
The architectural blueprint of today's leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Qinfeng Li, Miao Pan, Ke Xiong, Ge Su, Zhiqiang Shen et al.
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) systems deployed over proprietary knowledge bases face growing threats from reconstruction attacks that aggregate model responses to replicate knowledge bases. Such attacks exploit both intra-class and inter-class paths, progressively extracting fine-grained knowledge within topics and diffusing it across semantically related ones, thereby enabling comprehensive extraction of the original knowledge base. However, existing defenses target only one path, leaving the other unprotected. We conduct a systematic exploration to assess the impact of protecting each path independently and find that joint protection is essential for effective defense. Based on this, we propose RAGFort, a structure-aware dual-module defense combining "contrastive reindexing" for inter-class isolation and "constrained cascade generation" for intra-class protection. Experiments across security, performance, and robustness confirm that RAGFort significantly reduces reconstruction success while preserving answer quality, offering comprehensive defense against knowledge base extraction attacks.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Borui Cai, Yao Zhao
Machine Learning
ๆฅ็ๆ่ฆ
We propose a new perspective for approaching artificial general intelligence (AGI) through an intelligence foundation model (IFM). Unlike existing foundation models (FMs), which specialize in pattern learning within specific domains such as language, vision, or time series, IFM aims to acquire the underlying mechanisms of intelligence by learning directly from diverse intelligent behaviors. Vision, language, and other cognitive abilities are manifestations of intelligent behavior; learning from this broad range of behaviors enables the system to internalize the general principles of intelligence. Based on the fact that intelligent behaviors emerge from the collective dynamics of biological neural systems, IFM consists of two core components: a novel network architecture, termed the state neural network, which captures neuron-like dynamic processes, and a new learning objective, neuron output prediction, which trains the system to predict neuronal outputs from collective dynamics. The state neural network emulates the temporal dynamics of biological neurons, allowing the system to store, integrate, and process information over time, while the neuron output prediction objective provides a unified computational principle for learning these structural dynamics from intelligent behaviors. Together, these innovations establish a biologically grounded and computationally scalable foundation for building systems capable of generalization, reasoning, and adaptive learning across domains, representing a step toward truly AGI.
๐
2025-11-13
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Yanchen Deng, Chendong Zhao, Yixuan Li, Bijun Tang, Xinrun Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
The discovery of advanced metallic alloys is hindered by vast composition spaces, competing property objectives, and real-world constraints on manufacturability. Here we introduce MATAI, a generalist machine learning framework for property prediction and inverse design of as-cast alloys. MATAI integrates a curated alloy database, deep neural network-based property predictors, a constraint-aware optimization engine, and an iterative AI-experiment feedback loop. The framework estimates key mechanical propertie, sincluding density, yield strength, ultimate tensile strength, and elongation, directly from composition, using multi-task learning and physics-informed inductive biases. Alloy design is framed as a constrained optimization problem and solved using a bi-level approach that combines local search with symbolic constraint programming. We demonstrate MATAI's capabilities on the Ti-based alloy system, a canonical class of lightweight structural materials, where it rapidly identifies candidates that simultaneously achieve lower density (<4.45 g/cm3), higher strength (>1000 MPa) and appreciable ductility (>5%) through only seven iterations. Experimental validation confirms that MATAI-designed alloys outperform commercial references such as TC4, highlighting the framework's potential to accelerate the discovery of lightweight, high-performance materials under real-world design constraints.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Takehiro Ishikawa
Robotics
ๆฅ็ๆ่ฆ
We investigate how to couple a learnable brain-like'' controller to a cell-like'' Gray--Scott substrate to steer pattern formation with minimal effort. A compact convolutional policy is embedded in a differentiable PyTorch reaction--diffusion simulator, producing spatially smooth, bounded modulations of the feed and kill parameters ($ฮF$, $ฮK$) under a warm--hold--decay gain schedule. Training optimizes Turing-band spectral targets (FFT-based) while penalizing control effort ($\ell_1/\ell_2$) and instability. We compare three regimes: pure reaction--diffusion, NN-dominant, and a hybrid coupling. The hybrid achieves reliable, fast formation of target textures: 100% strict convergence in $\sim 165$ steps, matching cell-only spectral selectivity (0.436 vs.\ 0.434) while using $\sim 15\times$ less $\ell_1$ effort and $>200\times$ less $\ell_2$ power than NN-dominant control. An amplitude sweep reveals a non-monotonic Goldilocks'' zone ($A \approx 0.03$--$0.045$) that yields 100\% quasi convergence in 94--96 steps, whereas weaker or stronger gains fail to converge or degrade selectivity. These results quantify morphological computation: the controller seeds then cedes,'' providing brief, sparse nudges that place the system in the correct basin of attraction, after which local physics maintains the pattern. The study offers a practical recipe for building steerable, robust, and energy-efficient embodied systems that exploit an optimal division of labor between centralized learning and distributed self-organization.
๐
2025-11-13
๐ Published in the meeting proceedings of the 2025 I
๐ฅ Satu Johansson, Taneli Riihonen
Natural Language Processing
ๆฅ็ๆ่ฆ
In this paper, military use cases or applications and implementation thereof are considered for natural language processing and large language models, which have broken into fame with the invention of the generative pre-trained transformer (GPT) and the extensive foundation model pretraining done by OpenAI for ChatGPT and others. First, we interrogate a GPT-based language model (viz. Microsoft Copilot) to make it reveal its own knowledge about their potential military applications and then critically assess the information. Second, we study how commercial cloud services (viz. Microsoft Azure) could be used readily to build such applications and assess which of them are feasible. We conclude that the summarization and generative properties of language models directly facilitate many applications at large and other features may find particular uses.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Abu Sufian, Cosimo Distante, Marco Leo, Hanan Salam
Machine Learning
ๆฅ็ๆ่ฆ
Text-to-image (T2I) generative models are largely used in AI-powered real-world applications and value creation. However, their strategic deployment raises critical concerns for responsible AI management, particularly regarding the reproduction and amplification of race- and gender-related stereotypes that can undermine organizational ethics. In this work, we investigate whether such societal biases are systematically encoded within the pretrained latent spaces of state-of-the-art T2I models. We conduct an empirical study across the five most popular open-source models, using ten neutral, profession-related prompts to generate 100 images per profession, resulting in a dataset of 5,000 images evaluated by diverse human assessors representing different races and genders. We demonstrate that all five models encode and amplify pronounced societal skew: caregiving and nursing roles are consistently feminized, while high-status professions such as corporate CEO, politician, doctor, and lawyer are overwhelmingly represented by males and mostly White individuals. We further identify model-specific patterns, such as QWEN-Image's near-exclusive focus on East Asian outputs, Kandinsky's dominance of White individuals, and SDXL's comparatively broader but still biased distributions. These results provide critical insights for AI project managers and practitioners, enabling them to select equitable AI models and customized prompts that generate images in alignment with the principles of responsible AI. We conclude by discussing the risks of these biases and proposing actionable strategies for bias mitigation in building responsible GenAI systems.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leonardo Pesce, Jiawen Wei, Gianmarco Mengaldo
Machine Learning
ๆฅ็ๆ่ฆ
Post-hoc explainability methods are a subset of Machine Learning (ML) that aim to provide a reason for why a model behaves in a certain way. In this paper, we show a new black-box model-agnostic adversarial attack for post-hoc explainable Artificial Intelligence (XAI), particularly in the image domain. The goal of the attack is to modify the original explanations while being undetected by the human eye and maintain the same predicted class. In contrast to previous methods, we do not require any access to the model or its weights, but only to the model's computed predictions and explanations. Additionally, the attack is accomplished in a single step while significantly changing the provided explanations, as demonstrated by empirical evaluation. The low requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability in safety-critical applications. We systematically generate attacks based on the explanations generated by post-hoc explainability methods (saliency maps, integrated gradients, and DeepLIFT SHAP) for pretrained ResNet-18 and ViT-B16 on ImageNet. The results show that our attacks could lead to dramatically different explanations without changing the predictive probabilities. We validate the effectiveness of our attack, compute the induced change based on the explanation with mean absolute difference, and verify the closeness of the original image and the corrupted one with the Structural Similarity Index Measure (SSIM).
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Haidong Huang, Haiyue Zhu. Jiayu Song, Xixin Zhao, Yaohua Zhou, Jiayi Zhang et al.
Natural Language ProcessingRoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuxuan Zhou, Yubin Wang, Bin Wang, Chen Ning, Xien Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model's context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at https://muser-llm.github.io.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bodong Du, Honglong Yang, Xiaomeng Li
Machine Learning
ๆฅ็ๆ่ฆ
Radiologists compose diagnostic reports through a structured workflow: they describe visual findings, summarize them into impressions, and carefully refine statements in clinically critical cases. However, most existing medical report generation (MRG) systems treat reports as flat sequences, overlooking this hierarchical organization and leading to inconsistencies between descriptive and diagnostic content. To align model behavior with real-world reporting practices, we propose RadFlow, a hierarchical workflow-guided reinforcement optimization framework that explicitly models the structured nature of clinical reporting. RadFlow introduces a clinically grounded reward hierarchy that mirrors the organization of radiological reports. At the global level, the reward integrates linguistic fluency, medical-domain correctness, and cross-sectional consistency between Finding and Impression, promoting coherent and clinically faithful narratives. At the local level, a section-specific reward emphasizes Impression quality, reflecting its central role in diagnostic accuracy. Furthermore, a critical-aware policy optimization mechanism adaptively regularizes learning for high-risk or clinically sensitive cases, emulating the cautious refinement behavior of radiologists when documenting critical findings. Together, these components translate the structured reporting paradigm into the reinforcement fine-tuning process, enabling the model to generate reports that are both linguistically consistent and clinically aligned. Experiments on chest X-ray and carotid ultrasound datasets demonstrate that RadFlow consistently improves diagnostic coherence and overall report quality compared with state-of-the-art baselines.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Luming Yang, Haoxian Liu, Siqing Li, Alper Yilmaz
ๆฅ็ๆ่ฆ
Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming the ST-GCN baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yun Wang, Lingyun Yang, Senhao Yu, Yixiao Wang, Ruixing Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) architectures scale language models by activating only a subset of specialized expert networks for each input token, thereby reducing the number of floating-point operations. However, the growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity; for example, Mixtral-8x7B has 45 billion parameters and requires 87 GB of memory even though only 14 billion parameters are used per token. Existing systems alleviate this limitation by offloading inactive experts to CPU memory, but transferring experts across the PCIe interconnect incurs significant latency (about 10 ms). Prefetching heuristics aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency. In the event of a prefetch failure, prior work offers two primary solutions: either fetch the expert on demand, which incurs a long stall due to the PCIe bottleneck, or drop the expert from the computation, which significantly degrades model accuracy. The critical challenge, therefore, is to maintain both high inference speed and model accuracy when prefetching fails.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Ziheng Li, Hengyi Cai, Xiaochi Wei, Yuchen Li, Shuaiqiang Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
While large language models (LLMs) demonstrate emerging reasoning capabilities, current inference-time expansion methods incur prohibitive computational costs by exhaustive sampling. Through analyzing decoding trajectories, we observe that most next-token predictions align well with the golden output, except for a few critical tokens that lead to deviations. Inspired by this phenomenon, we propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components: 1) a hinter (powerful LLM) that provides probabilistic guidance at critical decision points, and 2) a practitioner (efficient smaller model) that executes major reasoning steps. The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), a theoretically-grounded metric that dynamically identifies intervention points by quantifying the divergence between practitioner's reasoning trajectory and hinter's expected distribution in a tree-structured probabilistic space. Through iterative tree updates guided by DIR, HPR reweights promising reasoning paths while deprioritizing low-probability branches. Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs: it achieves comparable performance to self-consistency and MCTS baselines while decoding only 1/5 tokens, and outperforms existing methods by at most 5.1% absolute accuracy while maintaining similar or lower FLOPs.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xiaolong Wei, Yuehu Dong, Xingliang Wang, Xingyu Zhang, Zhejun Zhao et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner's tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Vijay Keswani, Cyrus Cousins, Breanna Nguyen, Vincent Conitzer, Hoda Heidari et al.
ๆฅ็ๆ่ฆ
Alignment methods in moral domains seek to elicit moral preferences of human stakeholders and incorporate them into AI. This presupposes moral preferences as static targets, but such preferences often evolve over time. Proper alignment of AI to dynamic human preferences should ideally account for "legitimate" changes to moral reasoning, while ignoring changes related to attention deficits, cognitive biases, or other arbitrary factors. However, common AI alignment approaches largely neglect temporal changes in preferences, posing serious challenges to proper alignment, especially in high-stakes applications of AI, e.g., in healthcare domains, where misalignment can jeopardize the trustworthiness of the system and yield serious individual and societal harms. This work investigates the extent to which people's moral preferences change over time, and the impact of such changes on AI alignment. Our study is grounded in the kidney allocation domain, where we elicit responses to pairwise comparisons of hypothetical kidney transplant patients from over 400 participants across 3-5 sessions. We find that, on average, participants change their response to the same scenario presented at different times around 6-20% of the time (exhibiting "response instability"). Additionally, we observe significant shifts in several participants' retrofitted decision-making models over time (capturing "model instability"). The predictive performance of simple AI models decreases as a function of both response and model instability. Moreover, predictive performance diminishes over time, highlighting the importance of accounting for temporal changes in preferences during training. These findings raise fundamental normative and technical challenges relevant to AI alignment, highlighting the need to better understand the object of alignment (what to align to) when user preferences change significantly over time.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruichu Cai, Xiaokai Huang, Wei Chen, Zijian Li, Zhifeng Hao
ๆฅ็ๆ่ฆ
Inferring causal relationships from observed data is an important task, yet it becomes challenging when the data is subject to various external interferences. Most of these interferences are the additional effects of external factors on observed variables. Since these external factors are often unknown, we introduce latent variables to represent these unobserved factors that affect the observed data. Specifically, to capture the causal strength and adjacency information, we propose a new temporal latent variable structural causal model, incorporating causal strength and adjacency coefficients that represent the causal relationships between variables. Considering that expert knowledge can provide information about unknown interferences in certain scenarios, we develop a method that facilitates the incorporation of prior knowledge into parameter learning based on Variational Inference, to guide the model estimation. Experimental results demonstrate the stability and accuracy of our proposed method.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Risha Surana, Qinyuan Ye, Swabha Swayamdipta
Natural Language Processing
ๆฅ็ๆ่ฆ
Emergency responders managing hazardous material HAZMAT incidents face critical, time-sensitive decisions, manually navigating extensive chemical guidelines. We investigate whether today's language models can assist responders by rapidly and reliably understanding critical information, identifying hazards, and providing recommendations.We introduce the Chemical Emergency Response Evaluation Framework (ChEmREF), a new benchmark comprising questions on 1,035 HAZMAT chemicals from the Emergency Response Guidebook and the PubChem Database. ChEmREF is organized into three tasks: (1) translation of chemical representation between structured and unstructured forms (e.g., converting C2H6O to ethanol), (2) emergency response generation (e.g., recommending appropriate evacuation distances) and (3) domain knowledge question answering from chemical safety and certification exams. Our best evaluated models received an exact match of 68.0% on unstructured HAZMAT chemical representation translation, a LLM Judge score of 52.7% on incident response recommendations, and a multiple-choice accuracy of 63.9% on HAMZAT examinations.These findings suggest that while language models show potential to assist emergency responders in various tasks, they require careful human oversight due to their current limitations.
๐
2025-11-13
๐ ArXiv Preprint (eess.IV)
๐ฅ Farzan Saeedi, Sanaz Keshvari, Nasser Shoeibi
Machine Learning
ๆฅ็ๆ่ฆ
This paper encompasses an in-depth examination of Retinopathy of Prematurity (ROP) diagnosis, employing advanced deep learning methodologies. Our focus centers on refining and evaluating CNN-based approaches for precise and efficient ROP detection. We navigate the complexities of dataset curation, preprocessing strategies, and model architecture, aligning with research objectives encompassing model effectiveness, computational cost analysis, and time complexity assessment. Results underscore the supremacy of tailored CNN models over pre-trained counterparts, evident in heightened accuracy and F1-scores. Implementation of a voting system further enhances performance. Additionally, our study reveals the potential of the proposed customized CNN model to alleviate computational burdens associated with deep neural networks. Furthermore, we showcase the feasibility of deploying these models within dedicated software and hardware configurations, highlighting their utility as valuable diagnostic aids in clinical settings. In summary, our discourse significantly contributes to ROP diagnosis, unveiling the efficacy of deep learning models in enhancing diagnostic precision and efficiency.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxin Jiang, Wei Luo, Hui Zhang, Qiyu Chen, Haiming Yao et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.
๐
2025-11-13
๐ ArXiv Preprint (q-bio.QM)
๐ฅ Guofeng Meng, Li Shen, Qiuyan Zhong, Wei Wang, Haizhou Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are rapidly transforming various domains, including biomedicine and healthcare, and demonstrate remarkable potential from scientific research to new drug discovery. Graph-based retrieval-augmented generation (RAG) systems, as a useful application of LLMs, can improve contextual reasoning through structured entity and relationship identification from long-context knowledge, e.g. biomedical literature. Even though many advantages over naive RAGs, most of graph-based RAGs are computationally intensive, which limits their application to large-scale dataset. To address this issue, we introduce fastbmRAG, an fast graph-based RAG optimized for biomedical literature. Utilizing well organized structure of biomedical papers, fastbmRAG divides the construction of knowledge graph into two stages, first drafting graphs using abstracts; and second, refining them using main texts guided by vector-based entity linking, which minimizes redundancy and computational load. Our evaluations demonstrate that fastbmRAG is over 10x faster than existing graph-RAG tools and achieve superior coverage and accuracy to input knowledge. FastbmRAG provides a fast solution for quickly understanding, summarizing, and answering questions about biomedical literature on a large scale. FastbmRAG is public available in https://github.com/menggf/fastbmRAG.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li et al.
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aleksandr Razin, Danil Kazantsev, Ilya Makarov
ๆฅ็ๆ่ฆ
Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Edward Kim, Devan Shanker, Varun Bharadwaj, Hongbeen Park, Jinkyu Kim et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Simulation-based testing has become a crucial complement to road testing for ensuring the safety of cyber physical systems (CPS). As a result, significant research efforts have been directed toward identifying failure scenarios within simulation environments. However, a critical question remains. Are the AV failure scenarios discovered in simulation reproducible on actual systems in the real world? The sim-to-real gap caused by differences between simulated and real sensor data means that failure scenarios identified in simulation might either be artifacts of synthetic sensor data or actual issues that also occur with real sensor data. To address this, an effective approach to validating simulated failure scenarios is to locate occurrences of these scenarios within real-world datasets and verify whether the failure persists on the datasets. To this end, we introduce a formal definition of how labeled time series sensor data can match an abstract scenario, represented as a scenario program using the Scenic probabilistic programming language. We present a querying algorithm that, given a scenario program and a labeled dataset, identifies the subset of data that matches the specified scenario. Our experiment shows that our algorithm is more accurate and orders of magnitude faster in querying scenarios than the state-of-the-art commercial vision large language models, and can scale with the duration of queried time series data.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shruti Singh Baghel, Yash Pratap Singh Rathore, Sushovan Jena, Anurag Pradhan, Amit Shukla et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zack Dewis, Yimin Zhu, Zhengsen Xu, Mabel Heffring, Saeid Taleghanidoozdoozan et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Although Sentinel-2 based land use and land cover (LULC) classification is critical for various environmental monitoring applications, it is a very difficult task due to some key data challenges (e.g., spatial heterogeneity, context information, signature ambiguity). This paper presents a novel Multitask Glocal OBIA-Mamba (MSOM) for enhanced Sentinel-2 classification with the following contributions. First, an object-based image analysis (OBIA) Mamba model (OBIA-Mamba) is designed to reduce redundant computation without compromising fine-grained details by using superpixels as Mamba tokens. Second, a global-local (GLocal) dual-branch convolutional neural network (CNN)-mamba architecture is designed to jointly model local spatial detail and global contextual information. Third, a multitask optimization framework is designed to employ dual loss functions to balance local precision with global consistency. The proposed approach is tested on Sentinel-2 imagery in Alberta, Canada, in comparison with several advanced classification approaches, and the results demonstrate that the proposed approach achieves higher classification accuracy and finer details that the other state-of-the-art methods.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yen Nhi Truong Vu, Dan Guo, Sripad Joshi, Harshit Kumar, Jason Su et al.
Machine Learning
ๆฅ็ๆ่ฆ
Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose M&M-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, M&M. M&M-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in M&M without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that M&M-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, M&M-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, M&M-3D outperforms previous top baseline by 4% for classification and 10% for localization.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Rishi Singhal, Jung-Eun Kim
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo et al.
MultimodalComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kajiฤ, Amal Rannen-Triki et al.
Computer Vision
ๆฅ็ๆ่ฆ
Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests.
Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenqing Wang, Haosen Yang, Josef Kittler, Xiatian Zhu
ๆฅ็ๆ่ฆ
Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {\bf Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.SemanticVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/SemanticVLA
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yusuf Talha Basak, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim
ๆฅ็ๆ่ฆ
Although Total Variation (TV) performs well in noise reduction and edge preservation on images, its dependence on the lambda parameter limits its efficiency and makes it difficult to use effectively. In this study, we present a Learnable Total Variation (LTV) framework that couples an unrolled TV solver with a data-driven Lambda Mapping Network (LambdaNet) predicting a per-pixel regularization map. The pipeline is trained end-to-end so that reconstruction and regularization are optimized jointly, yielding spatially adaptive smoothing: strong in homogeneous regions, relaxed near anatomical boundaries. Experiments on the DeepLesion dataset, using a realistic noise model adapted from the LoDoPaB-CT methodology, show consistent gains over classical TV and FBP+U-Net: +2.9 dB PSNR and +6% SSIM on average. LTV provides an interpretable alternative to black-box CNNs and a basis for 3D and data-consistency-driven reconstruction. Our codes are available at: https://github.com/itu-biai/deep_tv_for_ldct
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Oded Schlesinger, Amirhossein Farzam, J. Matias Di Martino, Guillermo Sapiro
Natural Language Processing
ๆฅ็ๆ่ฆ
While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at https://github.com/odedsc/SPOT .
๐
2025-11-13
๐ Submitted to IEEE ISBI 2026
๐ฅ Tejas Sudharshan Mathai, Anisa V. Prasad, Xinya Wang, Praveen T. S. Balamuralikrishna, Yan Zhuang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Type 2 Diabetes Mellitus (T2DM) is a chronic metabolic disease that affects millions of people worldwide. Early detection is crucial as it can alter pancreas function through morphological changes and increased deposition of ectopic fat, eventually leading to organ damage. While studies have shown an association between T2DM and pancreas volume and fat content, the role of increased pancreatic surface lobularity (PSL) in patients with T2DM has not been fully investigated. In this pilot work, we propose a fully automated approach to delineate the pancreas and other abdominal structures, derive CT imaging biomarkers, and opportunistically screen for T2DM. Four deep learning-based models were used to segment the pancreas in an internal dataset of 584 patients (297 males, 437 non-diabetic, age: 45$\pm$15 years). PSL was automatically detected and it was higher for diabetic patients (p=0.01) at 4.26 $\pm$ 8.32 compared to 3.19 $\pm$ 3.62 for non-diabetic patients. The PancAP model achieved the highest Dice score of 0.79 $\pm$ 0.17 and lowest ASSD error of 1.94 $\pm$ 2.63 mm (p$<$0.05). For predicting T2DM, a multivariate model trained with CT biomarkers attained 0.90 AUC, 66.7\% sensitivity, and 91.9\% specificity. Our results suggest that PSL is useful for T2DM screening and could potentially help predict the early onset of T2DM.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ รaฤrฤฑ Eser, Zeynep Sonat Baltacฤฑ, Emre Akbaล, Sinan Kalkan
Machine Learning
ๆฅ็ๆ่ฆ
Imbalance in classification tasks is commonly quantified by the cardinalities of examples across classes. This, however, disregards the presence of redundant examples and inherent differences in the learning difficulties of classes. Alternatively, one can use complex measures such as training loss and uncertainty, which, however, depend on training a machine learning model. Our paper proposes using data Intrinsic Dimensionality (ID) as an easy-to-compute, model-free measure of imbalance that can be seamlessly incorporated into various imbalance mitigation methods. Our results across five different datasets with a diverse range of imbalance ratios show that ID consistently outperforms cardinality-based re-weighting and re-sampling techniques used in the literature. Moreover, we show that combining ID with cardinality can further improve performance. Code: https://github.com/cagries/IDIM.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Simon Donike, Cesar Aybar, Julio Contreras, Luis Gรณmez-Chova
ๆฅ็ๆ่ฆ
We present OpenSR-SRGAN, an open and modular framework for single-image super-resolution in Earth Observation. The software provides a unified implementation of SRGAN-style models that is easy to configure, extend, and apply to multispectral satellite data such as Sentinel-2. Instead of requiring users to modify model code, OpenSR-SRGAN exposes generators, discriminators, loss functions, and training schedules through concise configuration files, making it straightforward to switch between architectures, scale factors, and band setups. The framework is designed as a practical tool and benchmark implementation rather than a state-of-the-art model. It ships with ready-to-use configurations for common remote sensing scenarios, sensible default settings for adversarial training, and built-in hooks for logging, validation, and large-scene inference. By turning GAN-based super-resolution into a configuration-driven workflow, OpenSR-SRGAN lowers the entry barrier for researchers and practitioners who wish to experiment with SRGANs, compare models in a reproducible way, and deploy super-resolution pipelines across diverse Earth-observation datasets.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Willem Bonnaffรฉ, Yang Hu, Andrea Chatrian, Mengran Fan, Stefano Malacrino et al.
ๆฅ็ๆ่ฆ
Histopathologists establish cancer grade by assessing histological structures, such as glands in prostate cancer. Yet, digital pathology pipelines often rely on grid-based tiling that ignores tissue architecture. This introduces irrelevant information and limits interpretability. We introduce histology-informed tiling (HIT), which uses semantic segmentation to extract glands from whole slide images (WSIs) as biologically meaningful input patches for multiple-instance learning (MIL) and phenotyping. Trained on 137 samples from the ProMPT cohort, HIT achieved a gland-level Dice score of 0.83 +/- 0.17. By extracting 380,000 glands from 760 WSIs across ICGC-C and TCGA-PRAD cohorts, HIT improved MIL models AUCs by 10% for detecting copy number variation (CNVs) in genes related to epithelial-mesenchymal transitions (EMT) and MYC, and revealed 15 gland clusters, several of which were associated with cancer relapse, oncogenic mutations, and high Gleason. Therefore, HIT improved the accuracy and interpretability of MIL predictions, while streamlining computations by focussing on biologically meaningful structures during feature extraction.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniele Perlo, Vladimir Despotovic, Selma Boudissa, Sang-Yoon Kim, Petr Nazarov et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We introduce a curated video dataset of laboratory rodents for automatic detection of convulsive events. The dataset contains short (10~s) top-down and side-view video clips of individual rodents, labeled at clip level as normal activity or seizure. It includes 10,101 negative samples and 2,952 positive samples collected from 19 subjects. We describe the data curation, annotation protocol and preprocessing pipeline, and report baseline experiments using a transformer-based video classifier (TimeSformer). Experiments employ five-fold cross-validation with strict subject-wise partitioning to prevent data leakage (no subject appears in more than one fold). Results show that the TimeSformer architecture enables discrimination between seizure and normal activity with an average F1-score of 97%. The dataset and baseline code are publicly released to support reproducible research on non-invasive, video-based monitoring in preclinical epilepsy research. RodEpil Dataset access - DOI: 10.5281/zenodo.17601357
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alomar Antonia, Rubio Ricardo, Albaiges Gerard, Salort-Benejam Laura, Caminal Julia et al.
Machine Learning
ๆฅ็ๆ่ฆ
Acquiring standard facial planes during routine fetal ultrasound (US) examinations is often challenging due to fetal movement, variability in orientation, and operator-dependent expertise. These factors contribute to inconsistencies, increased examination time, and potential diagnostic bias.
To address these challenges in the context of facial assessment, we present: 1) GT++, a robust algorithm that estimates standard facial planes from 3D US volumes using annotated anatomical landmarks; and 2) 3DFETUS, a deep learning model that automates and standardizes their localization in 3D fetal US volumes.
We evaluated our methods both qualitatively, through expert clinical review, and quantitatively. The proposed approach achieved a mean translation error of 4.13 mm and a mean rotation error of 7.93 degrees per plane, outperforming other state-of-the-art methods on 3D US volumes. Clinical assessments further confirmed the effectiveness of both GT++ and 3DFETUS, demonstrating statistically significant improvements in plane estimation accuracy.
๐
2025-11-13
๐ Journal resubmission
๐ฅ Yaru Li, Yanxue Wang, Meng Li, Xinming Li, Jianbo Feng
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
The health condition of wind turbine (WT) components is crucial for ensuring stable and reliable operation. However, existing fault detection methods are largely limited to visual recognition, producing structured outputs that lack semantic interpretability and fail to support maintenance decision-making. To address these limitations, this study proposes an integrated framework that combines YOLOMS with a large language model (LLM) for intelligent fault analysis and diagnosis. Specifically, YOLOMS employs multi-scale detection and sliding-window cropping to enhance fault feature extraction, while a lightweight key-value (KV) mapping module bridges the gap between visual outputs and textual inputs. This module converts YOLOMS detection results into structured textual representations enriched with both qualitative and quantitative attributes. A domain-tuned LLM then performs semantic reasoning to generate interpretable fault analyses and maintenance recommendations. Experiments on real-world datasets demonstrate that the proposed framework achieves a fault detection accuracy of 90.6\% and generates maintenance reports with an average accuracy of 89\%, thereby improving the interpretability of diagnostic results and providing practical decision support for the operation and maintenance of wind turbines.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Oussema Dhaouadi, Johannes Meier, Jacques Kaiser, Daniel Cremers
Machine Learning
ๆฅ็ๆ่ฆ
Digital Terrain Models (DTMs) represent the bare-earth elevation and are important in numerous geospatial applications. Such data models cannot be directly measured by sensors and are typically generated from Digital Surface Models (DSMs) derived from LiDAR or photogrammetry. Traditional filtering approaches rely on manually tuned parameters, while learning-based methods require well-designed architectures, often combined with post-processing. To address these challenges, we introduce Ground Diffusion (GrounDiff), the first diffusion-based framework that iteratively removes non-ground structures by formulating the problem as a denoising task. We incorporate a gated design with confidence-guided generation that enables selective filtering. To increase scalability, we further propose Prior-Guided Stitching (PrioStitch), which employs a downsampled global prior automatically generated using GrounDiff to guide local high-resolution predictions. We evaluate our method on the DSM-to-DTM translation task across diverse datasets, showing that GrounDiff consistently outperforms deep learning-based state-of-the-art methods, reducing RMSE by up to 93% on ALS2DTM and up to 47% on USGS benchmarks. In the task of road reconstruction, which requires both high precision and smoothness, our method achieves up to 81% lower distance error compared to specialized techniques on the GeRoD benchmark, while maintaining competitive surface smoothness using only DSM inputs, without task-specific optimization. Our variant for road reconstruction, GrounDiff+, is specifically designed to produce even smoother surfaces, further surpassing state-of-the-art methods. The project page is available at https://deepscenario.github.io/GrounDiff/.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Prince Mensah, Pelumi Victor Aderinto, Ibrahim Salihu Yusuf, Arnu Pretorius
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate retrieval of vegetation biophysical variables from satellite imagery is crucial for ecosystem monitoring and agricultural management. In this work, we propose a physics-informed Transformer-VAE architecture to invert the PROSAIL radiative transfer model for simultaneous estimation of key canopy parameters from Sentinel-2 data. Unlike previous hybrid approaches that require real satellite images for self-supevised training. Our model is trained exclusively on simulated data, yet achieves performance on par with state-of-the-art methods that utilize real imagery. The Transformer-VAE incorporates the PROSAIL model as a differentiable physical decoder, ensuring that inferred latent variables correspond to physically plausible leaf and canopy properties. We demonstrate retrieval of leaf area index (LAI) and canopy chlorophyll content (CCC) on real-world field datasets (FRM4Veg and BelSAR) with accuracy comparable to models trained with real Sentinel-2 data. Our method requires no in-situ labels or calibration on real images, offering a cost-effective and self-supervised solution for global vegetation monitoring. The proposed approach illustrates how integrating physical models with advanced deep networks can improve the inversion of RTMs, opening new prospects for large-scale, physically-constrained remote sensing of vegetation traits.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hyunjong Lee, Jangho Lee, Jaekoo Lee
ๆฅ็ๆ่ฆ
Lane detection is an important topic in the future mobility solutions. Real-world environmental challenges such as background clutter, varying illumination, and occlusions pose significant obstacles to effective lane detection, particularly when relying on data-driven approaches that require substantial effort and cost for data collection and annotation. To address these issues, lane detection methods must leverage contextual and global information from surrounding lanes and objects. In this paper, we propose a Spatial Attention Mutual Information Regularization with a pre-trained model as an Oracle, called SAMIRO. SAMIRO enhances lane detection performance by transferring knowledge from a pretrained model while preserving domain-agnostic spatial information. Leveraging SAMIRO's plug-and-play characteristic, we integrate it into various state-of-the-art lane detection approaches and conduct extensive experiments on major benchmarks such as CULane, Tusimple, and LLAMAS. The results demonstrate that SAMIRO consistently improves performance across different models and datasets. The code will be made available upon publication.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhen Chen, Yi Zhang, Xiangyu Yin, Chengxuan Qin, Xingyu Zhao et al.
ๆฅ็ๆ่ฆ
Personalized AI applications such as DreamBooth enable the generation of customized content from user images, but also raise significant privacy concerns, particularly the risk of facial identity leakage. Recent defense mechanisms like Anti-DreamBooth attempt to mitigate this risk by injecting adversarial perturbations into user photos to prevent successful personalization. However, we identify two critical yet overlooked limitations of these methods. First, the adversarial examples often exhibit perceptible artifacts such as conspicuous patterns or stripes, making them easily detectable as manipulated content. Second, the perturbations are highly fragile, as even a simple, non-learned filter can effectively remove them, thereby restoring the model's ability to memorize and reproduce user identity. To investigate this vulnerability, we propose a novel evaluation framework, AntiDB_Purify, to systematically evaluate existing defenses under realistic purification threats, including both traditional image filters and adversarial purification. Results reveal that none of the current methods maintains their protective effectiveness under such threats. These findings highlight that current defenses offer a false sense of security and underscore the urgent need for more imperceptible and robust protections to safeguard user identity in personalized generation.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kai-Hendrik Cohrs, Zuzanna Osika, Maria Gonzalez-Calabuig, Vishal Nedungadi, Ruben Cartuyvels et al.
ๆฅ็ๆ่ฆ
Geospatial foundation models for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty. Applied to burn scar segmentation, SHRUG-FM shows that OOD scores correlate with lower performance in specific environmental conditions, while uncertainty-based flags help discard many poorly performing predictions. Linking these flags to land cover attributes from HydroATLAS shows that failures are not random but concentrated in certain geographies, such as low-elevation zones and large river areas, likely due to underrepresentation in pretraining data. SHRUG-FM provides a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, helping bridge the gap between benchmark performance and real-world reliability.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thales Bezerra, Emanoel Thyago, Kelvin Cunha, Rodrigo Abreu, Fรกbio Papais et al.
Machine Learning
ๆฅ็ๆ่ฆ
AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mengzhu Wang, Changyuan Deng, Shanshan Wang, Nan Yin, Long Lan et al.
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Single Domain Generalization (SDG) for object detection aims to train a model on a single source domain that can generalize effectively to unseen target domains. While recent methods like CLIP-based semantic augmentation have shown promise, they often overlook the underlying structure of feature distributions and frequency-domain characteristics that are critical for robustness. In this paper, we propose a novel framework that enhances SDG object detection by integrating the von Mises-Fisher (vMF) distribution and Fourier transformation into a CLIP-guided pipeline. Specifically, we model the directional features of object representations using vMF to better capture domain-invariant semantic structures in the embedding space. Additionally, we introduce a Fourier-based augmentation strategy that perturbs amplitude and phase components to simulate domain shifts in the frequency domain, further improving feature robustness. Our method not only preserves the semantic alignment benefits of CLIP but also enriches feature diversity and structural consistency across domains. Extensive experiments on the diverse weather-driving benchmark demonstrate that our approach outperforms the existing state-of-the-art method.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Wenti Yin, Huaxin Zhang, Xiang Wang, Yuqing Lu, Yicheng Zhang et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yu Deng, Baozhu Zhao, Junyan Su, Xiaohan Zhang, Qi Liu
Machine Learning
ๆฅ็ๆ่ฆ
Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaomei Yang, Xizhan Gao, Sijie Niu, Fa Zhu, Guang Feng et al.
Multimodal
ๆฅ็ๆ่ฆ
This paper proposes a novel CLIP-driven modality-shared representation learning network named CLIP4VI-ReID for VI-ReID task, which consists of Text Semantic Generation (TSG), Infrared Feature Embedding (IFE), and High-level Semantic Alignment (HSA). Specifically, considering the huge gap in the physical characteristics between natural images and infrared images, the TSG is designed to generate text semantics only for visible images, thereby enabling preliminary visible-text modality alignment. Then, the IFE is proposed to rectify the feature embeddings of infrared images using the generated text semantics. This process injects id-related semantics into the shared image encoder, enhancing its adaptability to the infrared modality. Besides, with text serving as a bridge, it enables indirect visible-infrared modality alignment. Finally, the HSA is established to refine the high-level semantic alignment. This process ensures that the fine-tuned text semantics only contain id-related information, thereby achieving more accurate cross-modal alignment and enhancing the discriminability of the learned modal-shared representations. Extensive experimental results demonstrate that the proposed CLIP4VI-ReID achieves superior performance than other state-of-the-art methods on some widely used VI-ReID datasets.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Patrick Feifel, Benedikt Franke, Frank Bonarens, Frank Kรถster, Arne Raulf et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Reliable pedestrian detection represents a crucial step towards automated driving systems. However, the current performance benchmarks exhibit weaknesses. The currently applied metrics for various subsets of a validation dataset prohibit a realistic performance evaluation of a DNN for pedestrian detection. As image segmentation supplies fine-grained information about a street scene, it can serve as a starting point to automatically distinguish between different types of errors during the evaluation of a pedestrian detector. In this work, eight different error categories for pedestrian detection are proposed and new metrics are proposed for performance comparison along these error categories. We use the new metrics to compare various backbones for a simplified version of the APD, and show a more fine-grained and robust way to compare models with each other especially in terms of safety-critical performance. We achieve SOTA on CityPersons-reasonable (without extra training data) by using a rather simple architecture.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Sumin Lee, Sungwon Park, Jeasurk Yang, Jihee Kim, Meeyoung Cha
ๆฅ็ๆ่ฆ
Satellite-based slum segmentation holds significant promise in generating global estimates of urban poverty. However, the morphological heterogeneity of informal settlements presents a major challenge, hindering the ability of models trained on specific regions to generalize effectively to unseen locations. To address this, we introduce a large-scale high-resolution dataset and propose GRAM (Generalized Region-Aware Mixture-of-Experts), a two-phase test-time adaptation framework that enables robust slum segmentation without requiring labeled data from target regions. We compile a million-scale satellite imagery dataset from 12 cities across four continents for source training. Using this dataset, the model employs a Mixture-of-Experts architecture to capture region-specific slum characteristics while learning universal features through a shared backbone. During adaptation, prediction consistency across experts filters out unreliable pseudo-labels, allowing the model to generalize effectively to previously unseen regions. GRAM outperforms state-of-the-art baselines in low-resource settings such as African cities, offering a scalable and label-efficient solution for global slum mapping and data-driven urban planning.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhengtao Zou, Ya Gao, Jiarui Guan, Bin Li, Pekka Marttinen
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model's deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs' reliability without a significant compromise on efficiency.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yanbei Jiang, Chao Lei, Yihao Ding, Krista Ehinger, Jey Han Lau
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yongji Zhang, Siqi Li, Kuiyang Huang, Yue Gao, Yu Jiang
ๆฅ็ๆ่ฆ
Fine-Grained Visual Classification (FGVC) remains a challenging task due to subtle inter-class differences and large intra-class variations. Existing approaches typically rely on feature-selection mechanisms or region-proposal strategies to localize discriminative regions for semantic analysis. However, these methods often fail to capture discriminative cues comprehensively while introducing substantial category-agnostic redundancy. To address these limitations, we propose H3Former, a novel token-to-region framework that leverages high-order semantic relations to aggregate local fine-grained representations with structured region-level modeling. Specifically, we propose the Semantic-Aware Aggregation Module (SAAM), which exploits multi-scale contextual cues to dynamically construct a weighted hypergraph among tokens. By applying hypergraph convolution, SAAM captures high-order semantic dependencies and progressively aggregates token features into compact region-level representations. Furthermore, we introduce the Hyperbolic Hierarchical Contrastive Loss (HHCL), which enforces hierarchical semantic constraints in a non-Euclidean embedding space. The HHCL enhances inter-class separability and intra-class consistency while preserving the intrinsic hierarchical relationships among fine-grained categories. Comprehensive experiments conducted on four standard FGVC benchmarks validate the superiority of our H3Former framework.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Jiulong Wu, Yucheng Shen, Lingyong Yan, Haixin Sun, Deguo Xia et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yongji Zhang, Siqi Li, Yue Gao, Yu Jiang
ๆฅ็ๆ่ฆ
Action Quality Assessment (AQA) aims to evaluate and score sports actions, which has attracted widespread interest in recent years. Existing AQA methods primarily predict scores based on features extracted from the entire video, resulting in limited interpretability and reliability. Meanwhile, existing AQA datasets also lack fine-grained annotations for action scores, especially for deduction items and sub-score annotations. In this paper, we construct the first AQA dataset containing fine-grained sub-score and deduction annotations for aerial skiing, which will be released as a new benchmark. For the technical challenges, we propose a novel AQA method, named JudgeMind, which significantly enhances performance and reliability by simulating the judgment and scoring mindset of professional referees. Our method segments the input action video into different stages and scores each stage to enhance accuracy. Then, we propose a stage-aware feature enhancement and fusion module to boost the perception of stage-specific key regions and enhance the robustness to visual changes caused by frequent camera viewpoints switching. In addition, we propose a knowledge-based grade-aware decoder to incorporate possible deduction items as prior knowledge to predict more accurate and reliable scores. Experimental results demonstrate that our method achieves state-of-the-art performance.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinxuan Li, Yi Zhang, Jian-Fang Hu, Chaolei Tan, Tianming Liang et al.
Multimodal
ๆฅ็ๆ่ฆ
Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ashutosh Anshul, Shreyas Gopal, Deepu Rajan, Eng Siong Chng
MultimodalRobotics
ๆฅ็ๆ่ฆ
Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yueran Zhao, Zhang Zhang, Chao Sun, Tianze Wang, Chao Yue et al.
ๆฅ็ๆ่ฆ
Vehicle-to-Everything (V2X) collaborative perception extends sensing beyond single vehicle limits through transmission. However, as more agents participate, existing frameworks face two key challenges: (1) the participating agents are inherently multi-modal and heterogeneous, and (2) the collaborative framework must be scalable to accommodate new agents. The former requires effective cross-agent feature alignment to mitigate heterogeneity loss, while the latter renders full-parameter training impractical, highlighting the importance of scalable adaptation. To address these issues, we propose Heterogeneous Adaptation (HeatV2X), a scalable collaborative framework. We first train a high-performance agent based on heterogeneous graph attention as the foundation for collaborative learning. Then, we design Local Heterogeneous Fine-Tuning and Global Collaborative Fine-Tuning to achieve effective alignment and interaction among heterogeneous agents. The former efficiently extracts modality-specific differences using Hetero-Aware Adapters, while the latter employs the Multi-Cognitive Adapter to enhance cross-agent collaboration and fully exploit the fusion potential. These designs enable substantial performance improvement of the collaborative framework with minimal training cost. We evaluate our approach on the OPV2V-H and DAIR-V2X datasets. Experimental results demonstrate that our method achieves superior perception performance with significantly reduced training overhead, outperforming existing state-of-the-art approaches. Our implementation will be released soon.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Wenzhe He, Xiaojun Chen, Ruiqi Wang, Ruihui Li, Huilong Pi et al.
Robotics
ๆฅ็ๆ่ฆ
3D LiDAR scene completion from point clouds is a fundamental component of perception systems in autonomous vehicles. Previous methods have predominantly employed diffusion models for high-fidelity reconstruction. However, their multi-step iterative sampling incurs significant computational overhead, limiting its real-time applicability. To address this, we propose LiNeXt-a lightweight, non-diffusion network optimized for rapid and accurate point cloud completion. Specifically, LiNeXt first applies the Noise-to-Coarse (N2C) Module to denoise the input noisy point cloud in a single pass, thereby obviating the multi-step iterative sampling of diffusion-based methods. The Refine Module then takes the coarse point cloud and its intermediate features from the N2C Module to perform more precise refinement, further enhancing structural completeness. Furthermore, we observe that LiDAR point clouds exhibit a distance-dependent spatial distribution, being densely sampled at proximal ranges and sparsely sampled at distal ranges. Accordingly, we propose the Distance-aware Selected Repeat strategy to generate a more uniformly distributed noisy point cloud. On the SemanticKITTI dataset, LiNeXt achieves a 199.8x speedup in inference, reduces Chamfer Distance by 50.7%, and uses only 6.1% of the parameters compared with LiDiff. These results demonstrate the superior efficiency and effectiveness of LiNeXt for real-time scene completion.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Stephane Da Silva Martins, Emanuel Aldea, Sylvie Le Hรฉgarat-Mascle
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Multi-agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents' long-term goals and their fine-grained social interactions, which leads to unrealistic multi-agent futures. We propose VISTA, a recursive goal-conditioned transformer for multi-agent trajectory forecasting. VISTA combines (i) a cross-attention fusion module that integrates long-horizon intent with past motion, (ii) a social-token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single-agent goal-conditioned prediction into a coherent multi-agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal-aware, and interpretable trajectories, making it promising for safety-critical autonomous systems.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tishya Chhabra, Manisha Bajpai, Walter Zesk, Skylar Tibbits
Machine Learning
ๆฅ็ๆ่ฆ
We present an initial evaluation of NASA and IBM's Prithvi-EO-2.0 geospatial foundation model on shoreline delineation of small sandy islands using satellite images. We curated and labeled a dataset of 225 multispectral images of two Maldivian islands, which we publicly release, and fine-tuned both the 300M and 600M parameter versions of Prithvi on training subsets ranging from 5 to 181 images. Our experiments show that even with as few as 5 training images, the models achieve high performance (F1 of 0.94, IoU of 0.79). Our results demonstrate the strong transfer learning capability of Prithvi, underscoring the potential of such models to support coastal monitoring in data-poor regions.
๐
2025-11-13
๐ 5 Pages, Under Review at The IEEE International Sy
๐ฅ Ahmed Jaheen, Islam Hassan, Mohanad Abouserie, Abdelaty Rehab, Adham Elasfar et al.
ๆฅ็ๆ่ฆ
Accurate localization of cephalometric landmarks from 2D lateral skull X-rays is vital for orthodontic diagnosis and treatment. Manual annotation is time-consuming and error-prone, whereas automated approaches often struggle with low contrast and anatomical complexity. This paper introduces CephRes-MHNet, a multi-head residual convolutional network for robust and efficient cephalometric landmark detection. The architecture integrates residual encoding, dual-attention mechanisms, and multi-head decoders to enhance contextual reasoning and anatomical precision. Trained on the Aariz Cephalometric dataset of 1,000 radiographs, CephRes-MHNet achieved a mean radial error (MRE) of 1.23 mm and a success detection rate (SDR) @ 2.0 mm of 85.5%, outperforming all evaluated models. In particular, it exceeded the strongest baseline, the attention-driven AFPF-Net (MRE = 1.25 mm, SDR @ 2.0 mm = 84.1%), while using less than 25% of its parameters. These results demonstrate that CephRes-MHNet attains state-of-the-art accuracy through architectural efficiency, providing a practical solution for real-world orthodontic analysis.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hu Gao, Xiaoning Lei, Xichen Xu, Depeng Dang, Lizhuang Ma
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Although image restoration has advanced significantly, most existing methods target only a single type of degradation. In real-world scenarios, images often contain multiple degradations simultaneously, such as rain, noise, and haze, requiring models capable of handling diverse degradation types. Moreover, methods that improve performance through module stacking often suffer from limited interpretability. In this paper, we propose a novel interpretability-driven approach for multi-degradation image restoration, built upon a deep unfolding network that maps the iterative process of a mathematical optimization algorithm into a learnable network structure. Specifically, we employ an improved second-order semi-smooth Newton algorithm to ensure that each module maintains clear physical interpretability. To further enhance interpretability and adaptability, we design an explainable convolution module inspired by the human brain's flexible information processing and the intrinsic characteristics of images, allowing the network to flexibly leverage learned knowledge and autonomously adjust parameters for different input. The resulting tightly integrated architecture, named InterIR, demonstrates excellent performance in multi-degradation restoration while remaining highly competitive on single-degradation tasks.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Zou, Runqing Zhang, Xue Zhou, Jianxiao Zou
ๆฅ็ๆ่ฆ
Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and facilitate cross-modal alignment. (2) Generative Intermediate Fusion (GIF), which combines cross-attention between generated images, original images, and text features to generate a unified representation optimized by triplet alignment loss. We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA. The results justify the effectiveness of our method. More implementation details and extended results are available at https://github.com/sugelamyd123/Sup-for-GEA.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Feng Ding, Wenhui Yi, Yunpeng Zhou, Xinan He, Hong Rao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Zhicheng Cai, Hao Zhu, Linsen Chen, Qiu Shen, Xun Cao
Machine Learning
ๆฅ็ๆ่ฆ
Implicit neural representation (INR) models signals as continuous functions using neural networks, offering efficient and differentiable optimization for inverse problems across diverse disciplines. However, the representational capacity of INR defined by the range of functions the neural network can characterize, is inherently limited by the low-dimensional feature space in conventional multilayer perceptron (MLP) architectures. While widening the MLP can linearly increase feature space dimensionality, it also leads to a quadratic growth in computational and memory costs. To address this limitation, we propose the split-layer, a novel reformulation of MLP construction. The split-layer divides each layer into multiple parallel branches and integrates their outputs via Hadamard product, effectively constructing a high-degree polynomial space. This approach significantly enhances INR's representational capacity by expanding the feature space dimensionality without incurring prohibitive computational overhead. Extensive experiments demonstrate that the split-layer substantially improves INR performance, surpassing existing methods across multiple tasks, including 2D image fitting, 2D CT reconstruction, 3D shape representation, and 5D novel view synthesis.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Mayank Vatsa, Aparna Bharati, Richa Singh
Computer Vision
ๆฅ็ๆ่ฆ
The architectural blueprint of today's leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Mingda Jia, Weiliang Meng, Zenghuang Fu, Yiheng Li, Qi Zeng et al.
ๆฅ็ๆ่ฆ
Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.
๐
2025-11-13
๐ NeurIPS
๐ฅ Jueun Ko, Hyewon Park, Hyesong Choi, Dongbo Min
ๆฅ็ๆ่ฆ
Stereo Depth Estimation in real-world environments poses significant challenges due to dynamic domain shifts, sparse or unreliable supervision, and the high cost of acquiring dense ground-truth labels. While recent Test-Time Adaptation (TTA) methods offer promising solutions, most rely on static target domain assumptions and input-invariant adaptation strategies, limiting their effectiveness under continual shifts. In this paper, we propose RobIA, a novel Robust, Instance-Aware framework for Continual Test-Time Adaptation (CTTA) in stereo depth estimation. RobIA integrates two key components: (1) Attend-and-Excite Mixture-of-Experts (AttEx-MoE), a parameter-efficient module that dynamically routes input to frozen experts via lightweight self-attention mechanism tailored to epipolar geometry, and (2) Robust AdaptBN Teacher, a PEFT-based teacher model that provides dense pseudo-supervision by complementing sparse handcrafted labels. This strategy enables input-specific flexibility, broad supervision coverage, improving generalization under domain shift. Extensive experiments demonstrate that RobIA achieves superior adaptation performance across dynamic target domains while maintaining computational efficiency.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Zihan Wang, Guansong Pang, Wenjun Miao, Jin Zheng, Xiao Bai
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in Large Visual Language Models (LVLMs) have demonstrated impressive performance across various vision-language tasks by leveraging large-scale image-text pretraining and instruction tuning. However, the security vulnerabilities of LVLMs have become increasingly concerning, particularly their susceptibility to backdoor attacks. Existing backdoor attacks focus on single-target attacks, i.e., targeting a single malicious output associated with a specific trigger. In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. Executing such attacks in LVLMs is challenging since there can be many incorrect trigger-target mappings due to severe feature interference among different triggers. To address this challenge, we propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs. The core of MTAttack is a novel optimization method with two constraints, namely Proxy Space Partitioning constraint and Trigger Prototype Anchoring constraint. It jointly optimizes multiple triggers in the latent space, with each trigger independently mapping clean images to a unique proxy class while at the same time guaranteeing their separability. Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks, substantially outperforming existing attack methods. Furthermore, our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies. These findings highlight the vulnerability of LVLMs to multi-target backdoor attacks and underscore the urgent need for mitigating such threats. Code is available at https://github.com/mala-lab/MTAttack.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yiming Tang, Abhijeet Sinha, Dianbo Liu
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Although recent generative models are remarkably capable of producing instruction-following and realistic outputs, they remain prone to notable physical plausibility failures. Though critical in applications, these physical plausibility errors often escape detection by existing evaluation methods. Furthermore, no framework exists for automatically identifying and interpreting specific physical error patterns in natural language, preventing targeted model improvements. We introduce Matryoshka Transcoders, a novel framework for the automatic discovery and interpretation of physical plausibility features in generative models. Our approach extends the Matryoshka representation learning paradigm to transcoder architectures, enabling hierarchical sparse feature learning at multiple granularity levels. By training on intermediate representations from a physical plausibility classifier and leveraging large multimodal models for interpretation, our method identifies diverse physics-related failure modes without manual feature engineering, achieving superior feature relevance and feature accuracy compared to existing approaches. We utilize the discovered visual patterns to establish a benchmark for evaluating physical plausibility in generative models. Our analysis of eight state-of-the-art generative models provides valuable insights into how these models fail to follow physical constraints, paving the way for further model improvements.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Qilang Ye, Yu Zhou, Lian He, Jie Zhang, Xuanming Guo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leonardo Pesce, Jiawen Wei, Gianmarco Mengaldo
Machine Learning
ๆฅ็ๆ่ฆ
Post-hoc explainability methods are a subset of Machine Learning (ML) that aim to provide a reason for why a model behaves in a certain way. In this paper, we show a new black-box model-agnostic adversarial attack for post-hoc explainable Artificial Intelligence (XAI), particularly in the image domain. The goal of the attack is to modify the original explanations while being undetected by the human eye and maintain the same predicted class. In contrast to previous methods, we do not require any access to the model or its weights, but only to the model's computed predictions and explanations. Additionally, the attack is accomplished in a single step while significantly changing the provided explanations, as demonstrated by empirical evaluation. The low requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability in safety-critical applications. We systematically generate attacks based on the explanations generated by post-hoc explainability methods (saliency maps, integrated gradients, and DeepLIFT SHAP) for pretrained ResNet-18 and ViT-B16 on ImageNet. The results show that our attacks could lead to dramatically different explanations without changing the predictive probabilities. We validate the effectiveness of our attack, compute the induced change based on the explanation with mean absolute difference, and verify the closeness of the original image and the corrupted one with the Structural Similarity Index Measure (SSIM).
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxiang Duan, Ao Li, Yingqin Li, Luyu Li, Pengwei Wang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Xiangyue Zhang, Jianfang Li, Jianqiang Ren, Jiaxu Zhang
ๆฅ็ๆ่ฆ
Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint's prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0 % compared to the current SOTA under multiple speaker identities.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Gwangyeon Ahn, Jiwan Seo, Joonhyuk Kang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Muzhou Yang, Wuzhou Quan, Mingqiang Wei
ๆฅ็ๆ่ฆ
Confidence alone is often misleading in hyperspectral image classification, as models tend to mistake high predictive scores for correctness while lacking awareness of uncertainty. This leads to confirmation bias, especially under sparse annotations or class imbalance, where models overfit confident errors and fail to generalize. We propose CABIN (Cognitive-Aware Behavior-Informed learNing), a semi-supervised framework that addresses this limitation through a closed-loop learning process of perception, action, and correction. CABIN first develops perceptual awareness by estimating epistemic uncertainty, identifying ambiguous regions where errors are likely to occur. It then acts by adopting an Uncertainty-Guided Dual Sampling Strategy, selecting uncertain samples for exploration while anchoring confident ones as stable pseudo-labels to reduce bias. To correct noisy supervision, CABIN introduces a Fine-Grained Dynamic Assignment Strategy that categorizes pseudo-labeled data into reliable, ambiguous, and noisy subsets, applying tailored losses to enhance generalization. Experimental results show that a wide range of state-of-the-art methods benefit from the integration of CABIN, with improved labeling efficiency and performance.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Luming Yang, Haoxian Liu, Siqing Li, Alper Yilmaz
ๆฅ็ๆ่ฆ
Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming the ST-GCN baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30\% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu
Natural Language ProcessingMultimodalComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.
๐
2025-11-13
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Go Tsuruoka, Takami Sato, Qi Alfred Chen, Kazuki Nomoto, Ryunosuke Kobayashi et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Traffic sign recognition plays a critical role in ensuring safe and efficient transportation of autonomous vehicles but remain vulnerable to adversarial attacks using stickers or laser projections. While existing attack vectors demonstrate security concerns, they suffer from visual detectability or implementation constraints, suggesting unexplored vulnerability surfaces in TSR systems. We introduce the Adversarial Retroreflective Patch (ARP), a novel attack vector that combines the high deployability of patch attacks with the stealthiness of laser projections by utilizing retroreflective materials activated only under victim headlight illumination. We develop a retroreflection simulation method and employ black-box optimization to maximize attack effectiveness. ARP achieves $\geq$93.4\% success rate in dynamic scenarios at 35 meters and $\geq$60\% success rate against commercial TSR systems in real-world conditions. Our user study demonstrates that ARP attacks maintain near-identical stealthiness to benign signs while achieving $\geq$1.9\% higher stealthiness scores than previous patch attacks. We propose the DPR Shield defense, employing strategically placed polarized filters, which achieves $\geq$75\% defense success rates for stop signs and speed limit signs against micro-prism patches.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xurui Li, Feng Xue, Yu Zhou
Multimodal
ๆฅ็ๆ่ฆ
Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7\%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3\%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wencong Wu, Xiuwei Zhang, Hanlin Yin, Shun Dai, Hongxi Zhang et al.
Natural Language ProcessingMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinran Yang, Shuichang Lai, Jiangjing Lyu, Hongjie Li, Bowen Pan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Generating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies-such as open surfaces and intricate internal structures-while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)-a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutions for capturing local detail with sparse transformers for enforcing global coherence. A Pad-Average strategy further ensures smooth transitions at subvolume boundaries during reconstruction. This modular design enables seamless scaling to ultra-high resolutions up to 2048^3-a regime previously unattainable for 3D VAEs. Experiments demonstrate state-of-the-art performance in both reconstruction accuracy and generative quality, yielding superior surface smoothness and geometric flexibility.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Feiyang Jia, Caiyan Jia, Ailin Liu, Shaoqing Xu, Qiming Xia et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0\% mAP, +0.8\% NDS, and +1.3\% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.
๐
2025-11-13
๐ ArXiv Preprint (eess.IV)
๐ฅ Farzan Saeedi, Sanaz Keshvari, Nasser Shoeibi
Machine Learning
ๆฅ็ๆ่ฆ
This paper encompasses an in-depth examination of Retinopathy of Prematurity (ROP) diagnosis, employing advanced deep learning methodologies. Our focus centers on refining and evaluating CNN-based approaches for precise and efficient ROP detection. We navigate the complexities of dataset curation, preprocessing strategies, and model architecture, aligning with research objectives encompassing model effectiveness, computational cost analysis, and time complexity assessment. Results underscore the supremacy of tailored CNN models over pre-trained counterparts, evident in heightened accuracy and F1-scores. Implementation of a voting system further enhances performance. Additionally, our study reveals the potential of the proposed customized CNN model to alleviate computational burdens associated with deep neural networks. Furthermore, we showcase the feasibility of deploying these models within dedicated software and hardware configurations, highlighting their utility as valuable diagnostic aids in clinical settings. In summary, our discourse significantly contributes to ROP diagnosis, unveiling the efficacy of deep learning models in enhancing diagnostic precision and efficiency.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxin Jiang, Wei Luo, Hui Zhang, Qiyu Chen, Haiming Yao et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.
๐
2025-11-13
๐ AAAI
๐ฅ Shufeng Kong, Zijie Wang, Nuan Cui, Hao Tang, Yihan Meng et al.
Machine Learning
ๆฅ็ๆ่ฆ
Automated interpretation of medical images demands robust modeling of complex visual-semantic relationships while addressing annotation scarcity, label imbalance, and clinical plausibility constraints. We introduce MIRNet (Medical Image Reasoner Network), a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning. Tongue image diagnosis is a particularly challenging domain that requires fine-grained visual and semantic understanding. Our approach leverages self-supervised masked autoencoder (MAE) to learn transferable visual representations from unlabeled data; employs graph attention networks (GAT) to model label correlations through expert-defined structured graphs; enforces clinical priors via constraint-aware optimization using KL divergence and regularization losses; and mitigates imbalance using asymmetric loss (ASL) and boosting ensembles. To address annotation scarcity, we also introduce TongueAtlas-4K, a comprehensive expert-curated benchmark comprising 4,000 images annotated with 22 diagnostic labels--representing the largest public dataset in tongue analysis. Validation shows our method achieves state-of-the-art performance. While optimized for tongue diagnosis, the framework readily generalizes to broader diagnostic medical imaging tasks.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Minjun Kim, Jaeri Lee, Jongjin Kim, Jeongin Yun, Yongmo Kwon et al.
Natural Language ProcessingRoboticsComputer Vision
ๆฅ็ๆ่ฆ
How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xuexun Liu, Xiaoxu Xu, Qiudan Zhang, Lin Ma, Xu Wang
ๆฅ็ๆ่ฆ
Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Saket S. Chaturvedi, Gaurav Bagwe, Lan Zhang, Pan He, Xiaoyong Yuan
Computer Vision
ๆฅ็ๆ่ฆ
LiDAR-based 3D object detection is widely used in safety-critical systems. However, these systems remain vulnerable to backdoor attacks that embed hidden malicious behaviors during training. A key limitation of existing backdoor attacks is their lack of physical realizability, primarily due to the digital-to-physical domain gap. Digital triggers often fail in real-world settings because they overlook material-dependent LiDAR reflection properties. On the other hand, physically constructed triggers are often unoptimized, leading to low effectiveness or easy detectability.This paper introduces Material-Oriented Backdoor Attack (MOBA), a novel framework that bridges the digital-physical gap by explicitly modeling the material properties of real-world triggers. MOBA tackles two key challenges in physical backdoor design: 1) robustness of the trigger material under diverse environmental conditions, 2) alignment between the physical trigger's behavior and its digital simulation. First, we propose a systematic approach to selecting robust trigger materials, identifying titanium dioxide (TiO_2) for its high diffuse reflectivity and environmental resilience. Second, to ensure the digital trigger accurately mimics the physical behavior of the material-based trigger, we develop a novel simulation pipeline that features: (1) an angle-independent approximation of the Oren-Nayar BRDF model to generate realistic LiDAR intensities, and (2) a distance-aware scaling mechanism to maintain spatial consistency across varying depths. We conduct extensive experiments on state-of-the-art LiDAR-based and Camera-LiDAR fusion models, showing that MOBA achieves a 93.50% attack success rate, outperforming prior methods by over 41%. Our work reveals a new class of physically realizable threats and underscores the urgent need for defenses that account for material-level properties in real-world environments.
๐
2025-11-13
๐ AAAI
๐ฅ Yongdeuk Seo, Hyun-seok Min, Sungchul Choi
ๆฅ็ๆ่ฆ
Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Satoshi Suzuki, Shin'ya Yamaguchi, Shoichiro Takeda, Taiga Yamane, Naoki Makishima et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chenxu Wu, Qingpeng Kong, Peiang Zhao, Wendi Yang, Wenxin Ma et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent advances in generative models, especially diffusion models, have significantly improved image restoration (IR) performance. However, existing problem-agnostic diffusion model-based image restoration (DMIR) methods face challenges in fully leveraging diffusion priors, resulting in suboptimal performance. In this paper, we address the limitations of current problem-agnostic DMIR methods by analyzing their sampling process and providing effective solutions. We introduce EquS, a DMIR method that imposes equivariant information through dual sampling trajectories. To further boost EquS, we propose the Timestep-Aware Schedule (TAS) and introduce EquS$^+$. TAS prioritizes deterministic steps to enhance certainty and sampling efficiency. Extensive experiments on benchmarks demonstrate that our method is compatible with previous problem-agnostic DMIR methods and significantly boosts their performance without increasing computational costs. Our code is available at https://github.com/FouierL/EquS.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Uday Bhaskar, Rishabh Bhattacharya, Avinash Patel, Sarthak Khoche, Praveen Anil Kulkarni et al.
Natural Language ProcessingRoboticsMultimodalComputer Vision
ๆฅ็ๆ่ฆ
Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12\%$ to $46.61\%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10\%$) leads to further performance gains, reaching $57.97\%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhicheng Liao, Dongxu Wu, Zhenshan Shi, Sijie Mai, Hanwei Zhu et al.
Multimodal
ๆฅ็ๆ่ฆ
Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as "a good photo" or "a bad photo." However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhiyuan Xu, Nan Min, Yuhang Guo, Tong Wei
ๆฅ็ๆ่ฆ
3D Gaussian Splatting offers a strong speed-quality trade-off but struggles to reconstruct semi-transparent surfaces because most methods assume a single depth per pixel, which fails when multiple surfaces are visible. We propose TSPE-GS (Transparent Surface Probabilistic Extraction for Gaussian Splatting), which uniformly samples transmittance to model a pixel-wise multi-modal distribution of opacity and depth, replacing the prior single-peak assumption and resolving cross-surface depth ambiguity. By progressively fusing truncated signed distance functions, TSPE-GS reconstructs external and internal surfaces separately within a unified framework. The method generalizes to other Gaussian-based reconstruction pipelines without extra training overhead. Extensive experiments on public and self-collected semi-transparent and opaque datasets show TSPE-GS significantly improves semi-transparent geometry reconstruction while maintaining performance on opaque scenes.
๐
2025-11-13
๐ Accepted in 2026 IEEE/CVF Winter Conference on App
๐ฅ Mustafa Munir, Md Mostafijur Rahman, Radu Marculescu
Machine Learning
ๆฅ็ๆ่ฆ
Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Yuhang Zhou, Yanxiang Zhao, Zhongyun Hua, Zhipu Liu, Zhaoquan Gu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Person re-identification (ReID) is a fundamental task in many real-world applications such as pedestrian trajectory tracking. However, advanced deep learning-based ReID models are highly susceptible to adversarial attacks, where imperceptible perturbations to pedestrian images can cause entirely incorrect predictions, posing significant security threats. Although numerous adversarial defense strategies have been proposed for classification tasks, their extension to metric learning tasks such as person ReID remains relatively unexplored. Moreover, the several existing defenses for person ReID fail to address the inherent unique challenges of adversarially robust ReID. In this paper, we systematically identify the challenges of adversarial defense in person ReID into two key issues: model bias and composite generalization requirements. To address them, we propose a debiased dual-invariant defense framework composed of two main phases. In the data balancing phase, we mitigate model bias using a diffusion-model-based data resampling strategy that promotes fairness and diversity in training data. In the bi-adversarial self-meta defense phase, we introduce a novel metric adversarial training approach incorporating farthest negative extension softening to overcome the robustness degradation caused by the absence of classifier. Additionally, we introduce an adversarially-enhanced self-meta mechanism to achieve dual-generalization for both unseen identities and unseen attack types. Experiments demonstrate that our method significantly outperforms existing state-of-the-art defenses.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Xuan Rao, Simian Xu, Zheng Li, Bo Zhao, Derong Liu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updater model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets. Code: https://github.com/raoxuan98-hash/sldc.git.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ketong Chen, Yuhao Chen, Yang Xue
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zihao Zhang, Yang Li, Aming Wu, Yahong Han
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
In this paper, we focus on Single-Domain Generalized Object Detection (Single-DGOD), aiming to transfer a detector trained on one source domain to multiple unknown domains. Existing methods for Single-DGOD typically rely on discrete data augmentation or static perturbation methods to expand data diversity, thereby mitigating the lack of access to target domain data. However, in real-world scenarios such as changes in weather or lighting conditions, domain shifts often occur continuously and gradually. Discrete augmentations and static perturbations fail to effectively capture the dynamic variation of feature distributions, thereby limiting the model's ability to perceive fine-grained cross-domain differences. To this end, we propose a new method, Liquid Temporal Feature Evolution, which simulates the progressive evolution of features from the source domain to simulated latent distributions by incorporating temporal modeling and liquid neural network-driven parameter adjustment. Specifically, we introduce controllable Gaussian noise injection and multi-scale Gaussian blurring to simulate initial feature perturbations, followed by temporal modeling and a liquid parameter adjustment mechanism to generate adaptive modulation parameters, enabling a smooth and continuous adaptation across domains. By capturing progressive cross-domain feature evolution and dynamically regulating adaptation paths, our method bridges the source-unknown domain distribution gap, significantly boosting generalization and robustness to unseen shifts. Significant performance improvements on the Diverse Weather dataset and Real-to-Art benchmark demonstrate the superiority of our method. Our code is available at https://github.com/2490o/LTFE.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yongxian Wei, Yilin Zhao, Li Shen, Xinrui Chen, Runxi Cheng et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data bootstrap problem-design strategies from the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models. Moreover, a solver trained on the synthesized data provides improved rewards for continued generator training, enabling co-evolution and yielding a further 0.7% performance gain. Our code will be made publicly available here.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Brian B. Moser, Shalini Strode, Federico Raue, Stanislav Frolov, Krzysztof Adamkiewicz et al.
ๆฅ็ๆ่ฆ
Dataset distillation (DD) promises compact yet faithful synthetic data, but existing approaches often inherit the inductive bias of a single teacher model. As dataset size increases, this bias drives generation toward overly smooth, homogeneous samples, reducing intra-class diversity and limiting generalization. We present PRISM (PRIors from diverse Source Models), a framework that disentangles architectural priors during synthesis. PRISM decouples the logit-matching and regularization objectives, supervising them with different teacher architectures: a primary model for logits and a stochastic subset for batch-normalization (BN) alignment. On ImageNet-1K, PRISM consistently and reproducibly outperforms single-teacher methods (e.g., SRe2L) and recent multi-teacher variants (e.g., G-VBSM) at low- and mid-IPC regimes. The generated data also show significantly richer intra-class diversity, as reflected by a notable drop in cosine similarity between features. We further analyze teacher selection strategies (pre- vs. intra-distillation) and introduce a scalable cross-class batch formation scheme for fast parallel synthesis. Code will be released after the review period.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Keshara Weerasinghe, Xueren Ge, Tessa Heick, Lahiru Nuwan Wijayasingha, Anthony Cortez et al.
Multimodal
ๆฅ็ๆ่ฆ
Emergency Medical Services (EMS) are critical to patient survival in emergencies, but first responders often face intense cognitive demands in high-stakes situations. AI cognitive assistants, acting as virtual partners, have the potential to ease this burden by supporting real-time data collection and decision making. In pursuit of this vision, we introduce EgoEMS, the first end-to-end, high-fidelity, multimodal, multiperson dataset capturing over 20 hours of realistic, procedural EMS activities from an egocentric view in 233 simulated emergency scenarios performed by 62 participants, including 46 EMS professionals. Developed in collaboration with EMS experts and aligned with national standards, EgoEMS is captured using an open-source, low-cost, and replicable data collection system and is annotated with keysteps, timestamped audio transcripts with speaker diarization, action quality metrics, and bounding boxes with segmentation masks. Emphasizing realism, the dataset includes responder-patient interactions reflecting real-world emergency dynamics. We also present a suite of benchmarks for real-time multimodal keystep recognition and action quality estimation, essential for developing AI support tools for EMS. We hope EgoEMS inspires the research community to push the boundaries of intelligent EMS systems and ultimately contribute to improved patient outcomes.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zubia Naz, Farhan Asghar, Muhammad Ishfaq Hussain, Yahya Hadadi, Muhammad Aasim Rafique et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jinfu Li, Yuqi Huang, Hong Song, Ting Wang, Jianghan Xia et al.
Computer Vision
ๆฅ็ๆ่ฆ
Recently, despite the remarkable advancements in object detection, modern detectors still struggle to detect tiny objects in aerial images. One key reason is that tiny objects carry limited features that are inevitably degraded or lost during long-distance network propagation. Another is that smaller objects receive disproportionately greater regression penalties than larger ones during training. To tackle these issues, we propose a Scale-Aware Relay Layer (SARL) and a Scale-Adaptive Loss (SAL) for tiny object detection, both of which are seamlessly compatible with the top-performing frameworks. Specifically, SARL employs a cross-scale spatial-channel attention to progressively enrich the meaningful features of each layer and strengthen the cross-layer feature sharing. SAL reshapes the vanilla IoU-based losses so as to dynamically assign lower weights to larger objects. This loss is able to focus training on tiny objects while reducing the influence on large objects. Extensive experiments are conducted on three benchmarks (\textit{i.e.,} AI-TOD, DOTA-v2.0 and VisDrone2019), and the results demonstrate that the proposed method boosts the generalization ability by 5.5\% Average Precision (AP) when embedded in YOLOv5 (anchor-based) and YOLOx (anchor-free) baselines. Moreover, it also promotes the robust performance with 29.0\% AP on the real-world noisy dataset (\textit{i.e.,} AI-TOD-v2.0).
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Liheng Zhang, Jin Wang, Hui Li, Bingfeng Zhang, Weifeng Liu
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.
๐
2025-11-13
๐ 13 pages, 8 figures, published to ACM MM
๐ฅ Wenzhe He, Xiaojun Chen, Wentang Chen, Hongyu Wang, Ying Liu et al.
ๆฅ็ๆ่ฆ
Semantic Scene Completion (SSC) aims to generate a complete semantic scene from an incomplete input. Existing approaches often employ dense network architectures with a high parameter count, leading to increased model complexity and resource demands. To address these limitations, we propose RWKV-PCSSC, a lightweight point cloud semantic scene completion network inspired by the Receptance Weighted Key Value (RWKV) mechanism. Specifically, we introduce a RWKV Seed Generator (RWKV-SG) module that can aggregate features from a partial point cloud to produce a coarse point cloud with coarse features. Subsequently, the point-wise feature of the point cloud is progressively restored through multiple stages of the RWKV Point Deconvolution (RWKV-PD) modules. By leveraging a compact and efficient design, our method achieves a lightweight model representation. Experimental results demonstrate that RWKV-PCSSC reduces the parameter count by 4.18$\times$ and improves memory efficiency by 1.37$\times$ compared to state-of-the-art methods PointSSC. Furthermore, our network achieves state-of-the-art performance on established indoor (SSC-PC, NYUCAD-PC) and outdoor (PointSSC) scene dataset, as well as on our proposed datasets (NYUCAD-PC-V2, 3D-FRONT-PC).
๐
2025-11-13
๐ AAAI 2026
๐ฅ Jia Lin, Xiaofei Zhou, Jiyuan Liu, Runmin Cong, Guodao Zhang et al.
Computer Vision
ๆฅ็ๆ่ฆ
Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Peng Gao, Yujian Lee, Xiaofeng Zhang, Zailong Chen, Hui Zhang
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Large Vision-Language Models (LVLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they still face critical challenges in modeling long-range dependencies under the usage of Rotary Positional Encoding (ROPE). Although it can facilitate precise modeling of token positions, it induces progressive attention decay as token distance increases, especially with progressive attention decay over distant token pairs, which severely impairs the model's ability to remember global context. To alleviate this issue, we propose inference-only Three-step Decay Resilience Strategies (T-DRS), comprising (1) Semantic-Driven DRS (SD-DRS), amplifying semantically meaningful but distant signals via content-aware residuals, (2) Distance-aware Control DRS (DC-DRS), which can purify attention by smoothly modulating weights based on positional distances, suppressing noise while preserving locality, and (3) re-Reinforce Distant DRS (reRD-DRS), consolidating the remaining informative remote dependencies to maintain global coherence. Together, the T-DRS recover suppressed long-range token pairs without harming local inductive biases. Extensive experiments on Vision Question Answering (VQA) benchmarks demonstrate that T-DRS can consistently improve performance in a training-free manner. The code can be accessed in https://github.com/labixiaoq-qq/Remember-me
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shogo Sato, Takuhiro Kaneko, Shoichiro Takeda, Tomoyasu Shimada, Kazuhiko Murasaki et al.
Robotics
ๆฅ็ๆ่ฆ
Point clouds are widely used in various fields, including augmented reality (AR) and robotics, where relighting and texture editing are crucial for realistic visualization. Achieving these tasks requires accurately separating albedo from shade. However, performing this separation on point clouds presents two key challenges: (1) the non-grid structure of point clouds makes conventional image-based decomposition models ineffective, and (2) point-cloud models designed for other tasks do not explicitly consider global-light direction, resulting in inaccurate shade. In this paper, we introduce \textbf{Intrinsic Point-Cloud Decomposition (IPCD)}, which extends image decomposition to the direct decomposition of colored point clouds into albedo and shade. To overcome challenge (1), we propose \textbf{IPCD-Net} that extends image-based model with point-wise feature aggregation for non-grid data processing. For challenge (2), we introduce \textbf{Projection-based Luminance Distribution (PLD)} with a hierarchical feature refinement, capturing global-light ques via multi-view projection. For comprehensive evaluation, we create a synthetic outdoor-scene dataset. Experimental results demonstrate that IPCD-Net reduces cast shadows in albedo and enhances color accuracy in shade. Furthermore, we showcase its applications in texture editing, relighting, and point-cloud registration under varying illumination. Finally, we verify the real-world applicability of IPCD-Net.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yesheng Liang, Haisheng Chen, Song Han, Zhijian Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model's text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Benjamin L. Badger, Matthew Neligeorge
Natural Language Processing
ๆฅ็ๆ่ฆ
Language prediction is constrained by informational entropy intrinsic to language, such that there exists a limit to how accurate any language model can become and equivalently a lower bound to language compression. The most efficient language compression algorithms today are causal (next token prediction) large language models, but the use of these models to form accurate estimates of language entropy is currently computationally infeasible. We introduce encoder-augmented causal decoder model architectures that exhibit superior training efficiency characteristics and achieve higher compression than causal transformers even when trained on modest hardware. We demonstrate how entropy estimates can be obtained on a per-token basis, and show that the generalization of models trained to approach the entropy of their training data necessarily exceeds the generalization of models trained to minimize loss beyond this value. We show empirically that causal models trained to approach but not exceed estimated per-token entropies exhibit greater generalization than models trained without taking entropy into account.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shruti Singh Baghel, Yash Pratap Singh Rathore, Sushovan Jena, Anurag Pradhan, Amit Shukla et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bavana Durgapraveen, Sornaraj Sivasankaran, Abhinand Balachandran, Sriram Rajkumar
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Abhinand Balachandran, Bavana Durgapraveen, Gowsikkan Sikkan Sudhagar, Vidhya Varshany J S, Sriram Rajkumar
Natural Language Processing
ๆฅ็ๆ่ฆ
The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to "overthinking" and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Vishal Thenuwara, Nisansa de Silva
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Fine-grained sentiment analysis faces ongoing challenges in Aspect Sentiment Triple Extraction (ASTE), particularly in accurately capturing the relationships between aspects, opinions, and sentiment polarities. While researchers have made progress using BERT and Graph Neural Networks, the full potential of advanced language models in understanding complex language patterns remains unexplored. We introduce DESS, a new approach that builds upon previous work by integrating DeBERTa's enhanced attention mechanism to better understand context and relationships in text. Our framework maintains a dual-channel structure, where DeBERTa works alongside an LSTM channel to process both meaning and grammatical patterns in text. We have carefully refined how these components work together, paying special attention to how different types of language information interact. When we tested DESS on standard datasets, it showed meaningful improvements over current methods, with F1-score increases of 4.85, 8.36, and 2.42 in identifying aspect opinion pairs and determining sentiment accurately. Looking deeper into the results, we found that DeBERTa's sophisticated attention system helps DESS handle complicated sentence structures better, especially when important words are far apart. Our findings suggest that upgrading to more advanced language models when thoughtfully integrated, can lead to real improvements in how well we can analyze sentiments in text. The implementation of our approach is publicly available at: https://github.com/VishalRepos/DESS.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Garapati Keerthana, Manik Gupta
Machine Learning
ๆฅ็ๆ่ฆ
Personalized decision systems in healthcare and behavioral support often rely on static rule-based or engagement-maximizing heuristics that overlook users' emotional context and ethical constraints. Such approaches risk recommending insensitive or unsafe interventions, especially in domains involving serious mental illness, substance use disorders, or depression. To address this limitation, we propose a Responsible Reinforcement Learning (RRL) framework that integrates emotional and contextual understanding with ethical considerations into the sequential decision-making process. RRL formulates personalization as a Constrained Markov Decision Process (CMDP), where the agent optimizes engagement and adherence while ensuring emotional alignment and ethical safety. We introduce a multi-objective reward function that explicitly balances short-term behavioral engagement with long-term user well-being, and define an emotion-informed state representation that captures fluctuations in emotional readiness, affect, and risk. The proposed architecture can be instantiated with any RL algorithm (e.g., DQN, PPO) augmented with safety constraints or Lagrangian regularization. Conceptually, this framework operationalizes empathy and responsibility within machine learning policy optimization, bridging safe RL, affective computing and responsible AI. We discuss the implications of this approach for human-centric domains such as behavioral health, education, and digital therapeutics, and outline simulation-based validation paths for future empirical work. This paper aims to initiate a methodological conversation about ethically aligned reinforcement learning for emotionally aware and trustworthy personalization systems.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Rishi Singhal, Jung-Eun Kim
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Yongxin Shi, Jiapeng Wang, Zeyu Shan, Dezhi Peng, Zening Lin et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at https://github.com/shi-yx/URaG.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Natasha Johnson
ๆฅ็ๆ่ฆ
Though the concept of genre has been a subject of discussion for millennia, the relatively recent emergence of genre fiction has added a new layer to this ongoing conversation. While more traditional perspectives on genre have emphasized form, contemporary scholarship has invoked both formal and institutional characteristics in its taxonomy of genre, genre fiction, and literary fiction. This project uses computational methods to explore the soundness of genre as a formal designation as opposed to an institutional one. Pulling from Andrew Piper's CONLIT dataset of Contemporary Literature, we assemble a corpus of literary and genre fiction, with the latter category containing romance, mystery, and science fiction novels. We use Welch's ANOVA to compare the distribution of narrative features according to author gender within each genre and within genre versus literary fiction. Then, we use logistic regression to model the effect that each feature has on literary classification and to measure how author gender moderates these effects. Finally, we analyze stylistic and semantic vector representations of our genre categories to understand the importance of form and content in literary classification. This project finds statistically significant formal markers of each literary category and illustrates how female authorship narrows and blurs the target for achieving literary status.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Egor Pakhomov, Erik Nijkamp, Caiming Xiong
ๆฅ็ๆ่ฆ
We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns--temporal reasoning, implicit extraction, knowledge updates, and graph representations--memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory--where exhaustive search and complete reranking are feasible--deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Srikant Panda, Avinash Rai
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants.
To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dong-Shan Jian, Xiang Li, Chen-Xu Yan, Hui-Wen Zheng, Zhi-Zhang Bian et al.
ๆฅ็ๆ่ฆ
Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles. The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yunzhe Xu, Zhuosheng Zhang, Zhe Liu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models' capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO's superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: https://github.com/xyz9911/KPPO.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zihan Gao, Yifei Xu, Jacob Thebault-Spieker
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kiamehr Rezaee, Jose Camacho-Collados, Mohammad Taher Pilehvar
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Irina Saparina, Mirella Lapata
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chunyang Jiang, Paola Merlo
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models achieve strong performance through training on vast datasets. Can analogical paradigm organization enable lightweight models to match this performance with minimal data? We develop a computational approach implementing three cognitive-inspired principles: analogical structure, contrastive learning, and minimal contextual cues. We test this approach with structured completion tasks where models identify correct sentence completions from analogical patterns with contrastive alternatives. Training lightweight models (BERT+CNN, $0.5M$ parameters) on only one hundred structured examples of English causative/inchoative alternations achieves $F1=0.95$, outperforming zero-shot \texttt{GPT-o3} ($F1=0.87$). Ablation studies confirm that analogical organization and contrastive structure improve performance, consistently surpassing randomly shuffled baselines across architectures. Cross-phenomenon validation using unspecified object alternations replicates these efficiency gains, confirming approach robustness. Our results show that analogical paradigm organization enables competitive linguistic rule learning with orders of magnitude less data than conventional approaches require.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Mehwish Alam
Natural Language Processing
ๆฅ็ๆ่ฆ
In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.
๐
2025-11-13
๐ ArXiv Preprint (cs.MA)
๐ฅ Lifan Zheng, Jiawei Chen, Qinghong Yin, Jingyuan Zhang, Xinyi Zeng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7\% fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extensive random exploration. These limitations lead to prohibitively high data-construction costs, low exploration efficiency, and poor sample utilization. To address these challenges, we present AgentEvolver, a self-evolving agent system that leverages the semantic understanding and reasoning capabilities of LLMs to drive autonomous agent learning. AgentEvolver introduces three synergistic mechanisms: (i) self-questioning, which enables curiosity-driven task generation in novel environments, reducing dependence on handcrafted datasets; (ii) self-navigating, which improves exploration efficiency through experience reuse and hybrid policy guidance; and (iii) self-attributing, which enhances sample efficiency by assigning differentiated rewards to trajectory states and actions based on their contribution. By integrating these mechanisms into a unified framework, AgentEvolver enables scalable, cost-effective, and continual improvement of agent capabilities. Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.
๐
2025-11-13
๐ ArXiv Preprint (cs.SI)
๐ฅ Raj Gaurav Maurya, Vaibhav Shukla, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Misinformation on social media thrives on surprise, emotion, and identity-driven reasoning, often amplified through human cognitive biases. To investigate these mechanisms, we model large language model (LLM) personas as synthetic agents that mimic user-level biases, ideological alignments, and trust heuristics. Within this setup, we introduce an auditor--node framework to simulate and analyze how misinformation evolves as it circulates through networks of such agents. News articles are propagated across networks of persona-conditioned LLM nodes, each rewriting received content. A question--answering-based auditor then measures factual fidelity at every step, offering interpretable, claim-level tracking of misinformation drift. We formalize a misinformation index and a misinformation propagation rate to quantify factual degradation across homogeneous and heterogeneous branches of up to 30 sequential rewrites. Experiments with 21 personas across 10 domains reveal that identity- and ideology-based personas act as misinformation accelerators, especially in politics, marketing, and technology. By contrast, expert-driven personas preserve factual stability. Controlled-random branch simulations further show that once early distortions emerge, heterogeneous persona interactions rapidly escalate misinformation to propaganda-level distortion. Our taxonomy of misinformation severity -- spanning factual errors, lies, and propaganda -- connects observed drift to established theories in misinformation studies. These findings demonstrate the dual role of LLMs as both proxies for human-like biases and as auditors capable of tracing information fidelity. The proposed framework provides an interpretable, empirically grounded approach for studying, simulating, and mitigating misinformation diffusion in digital ecosystems.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jason Chan, Zhixue Zhao, Robert Gaizauskas
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing work investigates the reasoning capabilities of large language models (LLMs) to uncover their limitations, human-like biases and underlying processes. Such studies include evaluations of base LLMs (pre-trained on unlabeled corpora only) for this purpose. Our position paper argues that evaluating base LLMs' reasoning capabilities raises inherent methodological concerns that are overlooked in such existing studies. We highlight the fundamental mismatch between base LLMs' pretraining objective and normative qualities, such as correctness, by which reasoning is assessed. In particular, we show how base LLMs generate logically valid or invalid conclusions as coincidental byproducts of conforming to purely linguistic patterns of statistical plausibility. This fundamental mismatch challenges the assumptions that (a) base LLMs' outputs can be assessed as their bona fide attempts at correct answers or conclusions; and (b) conclusions about base LLMs' reasoning can generalize to post-trained LLMs optimized for successful instruction-following. We call for a critical re-examination of existing work that relies implicitly on these assumptions, and for future work to account for these methodological pitfalls.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Shuyi Liu, Yuming Shang, Xi Zhang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing the capabilities of Large Language Models (LLMs) by integrating retrieval-based methods with generative models. As external knowledge repositories continue to expand and the parametric knowledge within models becomes outdated, a critical challenge for RAG systems is resolving conflicts between retrieved external information and LLMs' internal knowledge, which can significantly compromise the accuracy and reliability of generated content. However, existing approaches to conflict resolution typically operate at the token or semantic level, often leading to fragmented and partial understanding of factual discrepancies between LLMs' knowledge and context, particularly in knowledge-intensive tasks. To address this limitation, we propose TruthfulRAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level knowledge conflicts in RAG systems. Specifically, TruthfulRAG constructs KGs by systematically extracting triples from retrieved content, utilizes query-based graph retrieval to identify relevant knowledge, and employs entropy-based filtering mechanisms to precisely locate conflicting elements and mitigate factual inconsistencies, thereby enabling LLMs to generate faithful and accurate responses. Extensive experiments reveal that TruthfulRAG outperforms existing methods, effectively alleviating knowledge conflicts and improving the robustness and trustworthiness of RAG systems.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Andrea Schimmenti, Valentina Pasqual, Fabio Vitali, Marieke van Erp
Natural Language Processing
ๆฅ็ๆ่ฆ
Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts...), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Guduru Manoj, Neel Prabhanjan Rachamalla, Ashish Kulkarni, Gautam Rajeev, Jay Piplodiya et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Changyuan Tian, Zhicong Lu, Shuang Qian, Nayu Liu, Peiguang Li et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon -- ``LLMs incline to judge solutions with lower perplexity as correct'', which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.
๐
2025-11-13
๐ ACL
๐ฅ Paolo Astrino
ๆฅ็ๆ่ฆ
Organizations handling sensitive documents face a critical dilemma: adopt cloud-based AI systems that offer powerful question-answering capabilities but compromise data privacy, or maintain local processing that ensures security but delivers poor accuracy. We present a question-answering system that resolves this trade-off by combining semantic understanding with keyword precision, operating entirely on local infrastructure without internet access. Our approach demonstrates that organizations can achieve competitive accuracy on complex queries across legal, scientific, and conversational documents while keeping all data on their machines. By balancing two complementary retrieval strategies and using consumer-grade hardware acceleration, the system delivers reliable answers with minimal errors, letting banks, hospitals, and law firms adopt conversational document AI without transmitting proprietary information to external providers. This work establishes that privacy and performance need not be mutually exclusive in enterprise AI deployment.
๐
2025-11-13
๐ ArXiv Preprint (eess.AS)
๐ฅ Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuping Yan, Yuhan Xie, Yuanshuai Li, Yingchao Yu, Lingjuan Lyu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Jing He, Han Zhang, Yuanhui Xiao, Wei Guo, Shaowen Yao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Fake news detection methods based on writing style have achieved remarkable progress. However, as adversaries increasingly imitate the style of authentic news, the effectiveness of such approaches is gradually diminishing. Recent research has explored incorporating large language models (LLMs) to enhance fake news detection. Yet, despite their transformative potential, LLMs remain an untapped goldmine for fake news detection, with their real-world adoption hampered by shallow functionality exploration, ambiguous usability, and prohibitive inference costs. In this paper, we propose a novel fake news detection framework, dubbed FactGuard, that leverages LLMs to extract event-centric content, thereby reducing the impact of writing style on detection performance. Furthermore, our approach introduces a dynamic usability mechanism that identifies contradictions and ambiguous cases in factual reasoning, adaptively incorporating LLM advice to improve decision reliability. To ensure efficiency and practical deployment, we employ knowledge distillation to derive FactGuard-D, enabling the framework to operate effectively in cold-start and resource-constrained scenarios. Comprehensive experiments on two benchmark datasets demonstrate that our approach consistently outperforms existing methods in both robustness and accuracy, effectively addressing the challenges of style sensitivity and LLM usability in fake news detection.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Minbae Park, Hyemin Yang, Jeonghyun Kim, Kunsoo Park, Hyunjoon Kim
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) demonstrate strong reasoning capabilities but struggle with hallucinations and limited transparency. Recently, KG-enhanced LLMs that integrate knowledge graphs (KGs) have been shown to improve reasoning performance, particularly for complex, knowledge-intensive tasks. However, these methods still face significant challenges, including inaccurate retrieval and reasoning failures, often exacerbated by long input contexts that obscure relevant information or by context constructions that struggle to capture the richer logical directions required by different question types. Furthermore, many of these approaches rely on LLMs to directly retrieve evidence from KGs, and to self-assess the sufficiency of this evidence, which often results in premature or incorrect reasoning. To address the retrieval and reasoning failures, we propose ProgRAG, a multi-hop knowledge graph question answering (KGQA) framework that decomposes complex questions into sub-questions, and progressively extends partial reasoning paths by answering each sub-question. At each step, external retrievers gather candidate evidence, which is then refined through uncertainty-aware pruning by the LLM. Finally, the context for LLM reasoning is optimized by organizing and rearranging the partial reasoning paths obtained from the sub-question answers. Experiments on three well-known datasets demonstrate that ProgRAG outperforms existing baselines in multi-hop KGQA, offering improved reliability and reasoning quality.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuhao Wang, Ziyang Cheng, Heyang Liu, Ronghua Wu, Qunshan Gu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Current end-to-end spoken language models (SLMs) have made notable progress, yet they still encounter considerable response latency. This delay primarily arises from the autoregressive generation of speech tokens and the reliance on complex flow-matching models for speech synthesis. To overcome this, we introduce VocalNet-M2, a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction (MTP) strategy. Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model. Furthermore, our MTP strategy enhances generation efficiency and improves overall performance. Extensive experiments demonstrate that VocalNet-M2 achieves a substantial reduction in first chunk latency (from approximately 725ms to 350ms) while maintaining competitive performance across mainstream SLMs. This work also provides a comprehensive comparison of single-codebook and multi-codebook strategies, offering valuable insights for developing efficient and high-performance SLMs for real-time interactive applications.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Lei Huang, Weitao Ma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs), but the resulting multilingual capability remains highly sensitive to the composition and selection of the training data. Existing selection methods, often based on features like text quality, diversity, or task relevance, typically overlook the intrinsic linguistic structure of multilingual data. In this paper, we propose LangGPS, a lightweight two-stage pre-selection framework guided by language separability which quantifies how well samples in different languages can be distinguished in the model's representation space. LangGPS first filters training data based on separability scores and then refines the subset using existing selection methods. Extensive experiments across six benchmarks and 22 languages demonstrate that applying LangGPS on top of existing selection methods improves their effectiveness and generalizability in multilingual training, especially for understanding tasks and low-resource languages. Further analysis reveals that highly separable samples facilitate the formation of clearer language boundaries and support faster adaptation, while low-separability samples tend to function as bridges for cross-lingual alignment. Besides, we also find that language separability can serve as an effective signal for multilingual curriculum learning, where interleaving samples with diverse separability levels yields stable and generalizable gains. Together, we hope our work offers a new perspective on data utility in multilingual contexts and support the development of more linguistically informed LLMs.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Guanrong Li, Xinyu Liu, Zhen Wu, Xinyu Dai
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalized dialogue generation aims to leverage persona profiles and dialogue history to generate persona-relevant and consistent responses. Mainstream models typically rely on token-level language model training with persona dialogue data, such as Next Token Prediction, to implicitly achieve personalization, making these methods tend to neglect the given personas and generate generic responses. To address this issue, we propose a novel Persona-Aware Alignment Framework (PAL), which directly treats persona alignment as the training objective of dialogue generation. Specifically, PAL employs a two-stage training method including Persona-aware Learning and Persona Alignment, equipped with an easy-to-use inference strategy Select then Generate, to improve persona sensitivity and generate more persona-relevant responses at the semantics level. Through extensive experiments, we demonstrate that our framework outperforms many state-of-the-art personalized dialogue methods and large language models.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Junquan Huang, Haotian Wu, Yubo Gao, Yibo Yan, Junyan Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang et al.
ๆฅ็ๆ่ฆ
The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiran Zhang, Mingyang Lin, Mark Dras, Usman Naseem
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning processes within these interactions presents a significant challenge due to complex contextual dependencies and a lack of specialized visualization tools, leading to a high cognitive load for researchers. To address this gap, we present VISTA, an web-based Visual Interactive System for Textual Analytics in multi-turn reasoning tasks. VISTA allows users to visualize the influence of context on model decisions and interactively modify conversation histories to conduct "what-if" analyses across different models. Furthermore, the platform can automatically parse a session and generate a reasoning dependency tree, offering a transparent view of the model's step-by-step logical path. By providing a unified and interactive framework, VISTA significantly reduces the complexity of analyzing reasoning chains, thereby facilitating a deeper understanding of the capabilities and limitations of current LLMs. The platform is open-source and supports easy integration of custom benchmarks and local models.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Philipp Seeberger, Steffen Freisinger, Tobias Bocklet, Korbinian Riedhammer
ๆฅ็ๆ่ฆ
Due to the rapid growth of social media platforms, these tools have become essential for monitoring information during ongoing disaster events. However, extracting valuable insights requires real-time processing of vast amounts of data. A major challenge in existing systems is their exposure to event-related biases, which negatively affects their ability to generalize to emerging events. While recent advancements in debiasing and causal learning offer promising solutions, they remain underexplored in the disaster event domain. In this work, we approach bias mitigation through a causal lens and propose a method to reduce event- and domain-related biases, enhancing generalization to future events. Our approach outperforms multiple baselines by up to +1.9% F1 and significantly improves a PLM-based classifier across three disaster classification tasks.
๐
2025-11-13
๐ Published in the meeting proceedings of the 2025 I
๐ฅ Satu Johansson, Taneli Riihonen
Natural Language Processing
ๆฅ็ๆ่ฆ
In this paper, military use cases or applications and implementation thereof are considered for natural language processing and large language models, which have broken into fame with the invention of the generative pre-trained transformer (GPT) and the extensive foundation model pretraining done by OpenAI for ChatGPT and others. First, we interrogate a GPT-based language model (viz. Microsoft Copilot) to make it reveal its own knowledge about their potential military applications and then critically assess the information. Second, we study how commercial cloud services (viz. Microsoft Azure) could be used readily to build such applications and assess which of them are feasible. We conclude that the summarization and generative properties of language models directly facilitate many applications at large and other features may find particular uses.
๐
2025-11-13
๐ EMNLP 2025
๐ฅ Haroun Elleuch, Youssef Saidi, Salima Mdhaffar, Yannick Estรจve, Fethi Bougares
ๆฅ็ๆ่ฆ
This paper describes Elyadata \& LIA's joint submission to the NADI multi-dialectal Arabic Speech Processing 2025. We participated in the Spoken Arabic Dialect Identification (ADI) and multi-dialectal Arabic ASR subtasks. Our submission ranked first for the ADI subtask and second for the multi-dialectal Arabic ASR subtask among all participants. Our ADI system is a fine-tuned Whisper-large-v3 encoder with data augmentation. This system obtained the highest ADI accuracy score of \textbf{79.83\%} on the official test set. For multi-dialectal Arabic ASR, we fine-tuned SeamlessM4T-v2 Large (Egyptian variant) separately for each of the eight considered dialects. Overall, we obtained an average WER and CER of \textbf{38.54\%} and \textbf{14.53\%}, respectively, on the test set. Our results demonstrate the effectiveness of large pre-trained speech models with targeted fine-tuning for Arabic speech processing.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Florian Boudin, Atsuhiro Takasu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models' multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.
๐
2025-11-13
๐ INTERSPEECH 2025
๐ฅ Haroun Elleuch, Salima Mdhaffar, Yannick Estรจve, Fethi Bougares
ๆฅ็ๆ่ฆ
We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuxuan Zhou, Yubin Wang, Bin Wang, Chen Ning, Xien Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model's context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at https://muser-llm.github.io.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhenhe Li, Can Lin, Ling Zheng, Wen-Da Wei, Junli Liang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.
๐
2025-11-13
๐ SIGIR
๐ฅ Jiangshu Du, Wenpeng Yin, Philip Yu
Natural Language Processing
ๆฅ็ๆ่ฆ
The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk's representation with structural awareness of its position within the document. It achieves this by enriching each chunk's boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document's narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shivam Sharma, Riya Naik, Tejas Gawas, Heramb Patil, Kunal Korgaonkar
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework "PustakAI"\footnote{Pustak means `book' in many Indian languages.} for the design and evaluation of a novel question-answering dataset "NCERT-QA" aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yu-Shiang Huang, Yun-Yu Lee, Tzu-Hsin Chou, Che Lin, Chuan-Ju Wang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
BERTScore has become a widely adopted metric for evaluating semantic similarity between natural language sentences. However, we identify a critical limitation: BERTScore exhibits low sensitivity to numerical variation, a significant weakness in finance where numerical precision directly affects meaning (e.g., distinguishing a 2% gain from a 20% loss). We introduce FinNuE, a diagnostic dataset constructed with controlled numerical perturbations across earnings calls, regulatory filings, social media, and news articles. Using FinNuE, demonstrate that BERTScore fails to distinguish semantically critical numerical differences, often assigning high similarity scores to financially divergent text pairs. Our findings reveal fundamental limitations of embedding-based metrics for finance and motivate numerically-aware evaluation frameworks for financial NLP.
๐
2025-11-13
๐ AAAI
๐ฅ Bo Li, Zhenghua Xu, Rui Xie
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Multilingual Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to perform knowledge-intensive tasks in multilingual settings by leveraging retrieved documents as external evidence. However, when the retrieved evidence differs in language from the user query and in-context exemplars, the model often exhibits language drift by generating responses in an unintended language. This phenomenon is especially pronounced during reasoning-intensive decoding, such as Chain-of-Thought (CoT) generation, where intermediate steps introduce further language instability. In this paper, we systematically study output language drift in multilingual RAG across multiple datasets, languages, and LLM backbones. Our controlled experiments reveal that the drift results not from comprehension failure but from decoder-level collapse, where dominant token distributions and high-frequency English patterns dominate the intended generation language. We further observe that English serves as a semantic attractor under cross-lingual conditions, emerging as both the strongest interference source and the most frequent fallback language.
To mitigate this, we propose Soft Constrained Decoding (SCD), a lightweight, training-free decoding strategy that gently steers generation toward the target language by penalizing non-target-language tokens. SCD is model-agnostic and can be applied to any generation algorithm without modifying the architecture or requiring additional data. Experiments across three multilingual datasets and multiple typologically diverse languages show that SCD consistently improves language alignment and task performance, providing an effective and generalizable solution in multilingual RAG.
๐
2025-11-13
๐ AAAI
๐ฅ Bo Li, Tian Tian, Zhenghua Xu, Hao Cheng, Shikun Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal timing for retrieval. Existing methods often trigger retrieval based on low token-level confidence, which may lead to delayed intervention after errors have already propagated. We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty. Specifically, ETC utilizes first- and second-order differences of the entropy sequence to detect emerging uncertainty trends, enabling earlier and more precise retrieval. Experiments on six QA benchmarks with three LLM backbones demonstrate that ETC consistently outperforms strong baselines while reducing retrieval frequency. ETC is particularly effective in domain-specific scenarios, exhibiting robust generalization capabilities. Ablation studies and qualitative analyses further confirm that trend-aware uncertainty modeling yields more effective retrieval timing. The method is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines. Implementation code is included in the supplementary materials.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Peter Rรธysland Aarnes, Vinay Setty
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62\% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Yijie Zhu, Haojie Zhou, Wanting Hong, Tailin Liu, Ning Wang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose Recursive Evaluation and Adaptive Planning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP's performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Canwen Wang, Jionghao Lin, Kenneth R. Koedinger
Natural Language Processing
ๆฅ็ๆ่ฆ
Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a "simulated textbook" LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model's performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Xuan Rao, Simian Xu, Zheng Li, Bo Zhao, Derong Liu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updater model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets. Code: https://github.com/raoxuan98-hash/sldc.git.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Pritish Sahu, Anirudh Som, Dimitra Vergyri, Ajay Divakaran
ๆฅ็ๆ่ฆ
Social norms are implicit, culturally grounded expectations that guide interpersonal communication. Unlike factual commonsense, norm reasoning is subjective, context-dependent, and varies across cultures, posing challenges for computational models. Prior works provide valuable normative annotations but mostly target isolated utterances or synthetic dialogues, limiting their ability to capture the fluid, multi-turn nature of real-world conversations. In this work, we present Norm-RAG, a retrieval-augmented, agentic framework for nuanced social norm inference in multi-turn dialogues. Norm-RAG models utterance-level attributes including communicative intent, speaker roles, interpersonal framing, and linguistic cues and grounds them in structured normative documentation retrieved via a novel Semantic Chunking approach. This enables interpretable and context-aware reasoning about norm adherence and violation across multilingual dialogues. We further introduce MINDS (Multilingual Interactions with Norm-Driven Speech), a bilingual dataset comprising 31 multi-turn Mandarin-English and Spanish-English conversations. Each turn is annotated for norm category and adherence status using multi-annotator consensus, reflecting cross-cultural and realistic norm expression. Our experiments show that Norm-RAG improves norm detection and generalization, demonstrates improved performance for culturally adaptive and socially intelligent dialogue systems.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhiming Ma, Shiyu Gan, Junhao Zhao, Xianming Li, Qingyun Pan et al.
Multimodal
ๆฅ็ๆ่ฆ
To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zubia Naz, Farhan Asghar, Muhammad Ishfaq Hussain, Yahya Hadadi, Muhammad Aasim Rafique et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.
๐
2025-11-13
๐ Accepted by IEEE Symposium on Security and Privacy
๐ฅ Jialin Wu, Kecen Li, Zhicong Huang, Xinfeng Li, Xiaofeng Wang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable's generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nikunj Gupta, Bill Guo, Rajgopal Kannan, Viktor K. Prasanna
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Mingye Zhu, Yi Liu, Zheren Fu, Quan Wang, Yongdong Zhang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single "golden" rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yidan Sun, Mengying Zhu, Feiyue Chen, Yangyang Wu, Xiaolei Dan et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in terminology-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained terminology discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Neo Wang, Sonit Singh
Natural Language Processing
ๆฅ็ๆ่ฆ
The course forums are increasingly significant and play vital role in facilitating student discussions and answering their questions related to the course. It provides a platform for students to post their questions related to the content and admin issues related to the course. However, there are several challenges due to the increase in the number of students enrolled in the course. The primary challenge is that students' queries cannot be responded immediately and the instructors have to face lots of repetitive questions. To mitigate these issues, we propose a question answering system based on large language model with retrieval augmented generation (RAG) method. This work focuses on designing a question answering system with open source Large Language Model (LLM) and fine-tuning it on the relevant course dataset. To further improve the performance, we use a local knowledge base and applied RAG method to retrieve relevant documents relevant to students' queries, where the local knowledge base contains all the course content. To mitigate the hallucination of LLMs, We also integrate it with multi chain-of-thought reasoning to overcome the challenge of hallucination in LLMs. In this work, we experiment fine-tuned LLM with RAG method on the HotpotQA dataset. The experimental results demonstrate that the fine-tuned LLM with RAG method has a strong performance on question answering task.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Pascal Strauch, David Mรผller, Sammy Christen, Agon Serifi, Ruben Grandia et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Despite recent advances in robust locomotion, bipedal robots operating in the real world remain at risk of falling. While most research focuses on preventing such events, we instead concentrate on the phenomenon of falling itself. Specifically, we aim to reduce physical damage to the robot while providing users with control over a robot's end pose. To this end, we propose a robot agnostic reward function that balances the achievement of a desired end pose with impact minimization and the protection of critical robot parts during reinforcement learning. To make the policy robust to a broad range of initial falling conditions and to enable the specification of an arbitrary and unseen end pose at inference time, we introduce a simulation-based sampling strategy of initial and end poses. Through simulated and real-world experiments, our work demonstrates that even bipedal robots can perform controlled, soft falls.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Edward Kim, Devan Shanker, Varun Bharadwaj, Hongbeen Park, Jinkyu Kim et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Simulation-based testing has become a crucial complement to road testing for ensuring the safety of cyber physical systems (CPS). As a result, significant research efforts have been directed toward identifying failure scenarios within simulation environments. However, a critical question remains. Are the AV failure scenarios discovered in simulation reproducible on actual systems in the real world? The sim-to-real gap caused by differences between simulated and real sensor data means that failure scenarios identified in simulation might either be artifacts of synthetic sensor data or actual issues that also occur with real sensor data. To address this, an effective approach to validating simulated failure scenarios is to locate occurrences of these scenarios within real-world datasets and verify whether the failure persists on the datasets. To this end, we introduce a formal definition of how labeled time series sensor data can match an abstract scenario, represented as a scenario program using the Scenic probabilistic programming language. We present a querying algorithm that, given a scenario program and a labeled dataset, identifies the subset of data that matches the specified scenario. Our experiment shows that our algorithm is more accurate and orders of magnitude faster in querying scenarios than the state-of-the-art commercial vision large language models, and can scale with the duration of queried time series data.
๐
2025-11-13
๐ ArXiv Preprint (math.OC)
๐ฅ Ilyas Fatkhullin, Niao He, Guanghui Lan, Florian Wolf
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Constrained non-convex optimization is fundamentally challenging, as global solutions are generally intractable and constraint qualifications may not hold. However, in many applications, including safe policy optimization in control and reinforcement learning, such problems possess hidden convexity, meaning they can be reformulated as convex programs via a nonlinear invertible transformation. Typically such transformations are implicit or unknown, making the direct link with the convex program impossible. On the other hand, (sub-)gradients with respect to the original variables are often accessible or can be easily estimated, which motivates algorithms that operate directly in the original (non-convex) problem space using standard (sub-)gradient oracles. In this work, we develop the first algorithms to provably solve such non-convex problems to global minima. First, using a modified inexact proximal point method, we establish global last-iterate convergence guarantees with $\widetilde{\mathcal{O}}(\varepsilon^{-3})$ oracle complexity in non-smooth setting. For smooth problems, we propose a new bundle-level type method based on linearly constrained quadratic subproblems, improving the oracle complexity to $\widetilde{\mathcal{O}}(\varepsilon^{-1})$. Surprisingly, despite non-convexity, our methodology does not require any constraint qualifications, can handle hidden convex equality constraints, and achieves complexities matching those for solving unconstrained hidden convex optimization.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Avrim Blum, Marten Garicano, Kavya Ravichandran, Dravyansh Sharma
Machine Learning
ๆฅ็ๆ่ฆ
The improving multi-armed bandits problem is a formal model for allocating effort under uncertainty, motivated by scenarios such as investing research effort into new technologies, performing clinical trials, and hyperparameter selection from learning curves. Each pull of an arm provides reward that increases monotonically with diminishing returns. A growing line of work has designed algorithms for improving bandits, albeit with somewhat pessimistic worst-case guarantees. Indeed, strong lower bounds of $ฮฉ(k)$ and $ฮฉ(\sqrt{k})$ multiplicative approximation factors are known for both deterministic and randomized algorithms (respectively) relative to the optimal arm, where $k$ is the number of bandit arms. In this work, we propose two new parameterized families of bandit algorithms and bound the sample complexity of learning the near-optimal algorithm from each family using offline data. The first family we define includes the optimal randomized algorithm from prior work. We show that an appropriately chosen algorithm from this family can achieve stronger guarantees, with optimal dependence on $k$, when the arm reward curves satisfy additional properties related to the strength of concavity. Our second family contains algorithms that both guarantee best-arm identification on well-behaved instances and revert to worst case guarantees on poorly-behaved instances. Taking a statistical learning perspective on the bandit rewards optimization problem, we achieve stronger data-dependent guarantees without the need for actually verifying whether the assumptions are satisfied.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Benjamin L. Badger, Matthew Neligeorge
Natural Language Processing
ๆฅ็ๆ่ฆ
Language prediction is constrained by informational entropy intrinsic to language, such that there exists a limit to how accurate any language model can become and equivalently a lower bound to language compression. The most efficient language compression algorithms today are causal (next token prediction) large language models, but the use of these models to form accurate estimates of language entropy is currently computationally infeasible. We introduce encoder-augmented causal decoder model architectures that exhibit superior training efficiency characteristics and achieve higher compression than causal transformers even when trained on modest hardware. We demonstrate how entropy estimates can be obtained on a per-token basis, and show that the generalization of models trained to approach the entropy of their training data necessarily exceeds the generalization of models trained to minimize loss beyond this value. We show empirically that causal models trained to approach but not exceed estimated per-token entropies exhibit greater generalization than models trained without taking entropy into account.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zack Dewis, Yimin Zhu, Zhengsen Xu, Mabel Heffring, Saeid Taleghanidoozdoozan et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Although Sentinel-2 based land use and land cover (LULC) classification is critical for various environmental monitoring applications, it is a very difficult task due to some key data challenges (e.g., spatial heterogeneity, context information, signature ambiguity). This paper presents a novel Multitask Glocal OBIA-Mamba (MSOM) for enhanced Sentinel-2 classification with the following contributions. First, an object-based image analysis (OBIA) Mamba model (OBIA-Mamba) is designed to reduce redundant computation without compromising fine-grained details by using superpixels as Mamba tokens. Second, a global-local (GLocal) dual-branch convolutional neural network (CNN)-mamba architecture is designed to jointly model local spatial detail and global contextual information. Third, a multitask optimization framework is designed to employ dual loss functions to balance local precision with global consistency. The proposed approach is tested on Sentinel-2 imagery in Alberta, Canada, in comparison with several advanced classification approaches, and the results demonstrate that the proposed approach achieves higher classification accuracy and finer details that the other state-of-the-art methods.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Miles Wang-Henderson, Ben Kaufman, Edward Williams, Ryan Pederson, Matteo Rossi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Batched synthesis and testing of molecular designs is the key bottleneck of drug development. There has been great interest in leveraging biomolecular foundation models as surrogates to accelerate this process. In this work, we show how to obtain scalable probabilistic surrogates of binding affinity for use in Batch Bayesian Optimization (Batch BO). This demands parallel acquisition functions that hedge between designs and the ability to rapidly sample from a joint predictive density to approximate them. Through the framework of Epistemic Neural Networks (ENNs), we obtain scalable joint predictive distributions of binding affinity on top of representations taken from large structure-informed models. Key to this work is an investigation into the importance of prior networks in ENNs and how to pretrain them on synthetic data to improve downstream performance in Batch BO. Their utility is demonstrated by rediscovering known potent EGFR inhibitors on a semi-synthetic benchmark in up to 5x fewer iterations, as well as potent inhibitors from a real-world small-molecule library in up to 10x fewer iterations, offering a promising solution for large-scale drug discovery applications.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuval Shapira, Dana Drachsler-Cohen
ๆฅ็ๆ่ฆ
Few-pixel attacks mislead a classifier by modifying a few pixels of an image. Their perturbation space is an $\ell_0$-ball, which is not convex, unlike $\ell_p$-balls for $p\geq1$. However, existing local robustness verifiers typically scale by relying on linear bound propagation, which captures convex perturbation spaces. We show that the convex hull of an $\ell_0$-ball is the intersection of its bounding box and an asymmetrically scaled $\ell_1$-like polytope. The volumes of the convex hull and this polytope are nearly equal as the input dimension increases. We then show a linear bound propagation that precisely computes bounds over the convex hull and is significantly tighter than bound propagations over the bounding box or our $\ell_1$-like polytope. This bound propagation scales the state-of-the-art $\ell_0$ verifier on its most challenging robustness benchmarks by 1.24x-7.07x, with a geometric mean of 3.16.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fengsheng Lin, Shengyi Yan, Trac Duy Tran
ๆฅ็ๆ่ฆ
We present a semi-unified sparse dictionary learning framework that bridges the gap between classical sparse models and modern deep architectures. Specifically, the method integrates strict Top-$K$ LISTA and its convex FISTA-based variant (LISTAConv) into the discriminative LC-KSVD2 model, enabling co-evolution between the sparse encoder and the dictionary under supervised or unsupervised regimes. This unified design retains the interpretability of traditional sparse coding while benefiting from efficient, differentiable training.
We further establish a PALM-style convergence analysis for the convex variant, ensuring theoretical stability under block alternation. Experimentally, our method achieves 95.6\% on CIFAR-10, 86.3\% on CIFAR-100, and 88.5\% on TinyImageNet with faster convergence and lower memory cost ($<$4GB GPU). The results confirm that the proposed LC-KSVD2 + LISTA/LISTAConv pipeline offers an interpretable and computationally efficient alternative for modern deep architectures.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Garapati Keerthana, Manik Gupta
Machine Learning
ๆฅ็ๆ่ฆ
Personalized decision systems in healthcare and behavioral support often rely on static rule-based or engagement-maximizing heuristics that overlook users' emotional context and ethical constraints. Such approaches risk recommending insensitive or unsafe interventions, especially in domains involving serious mental illness, substance use disorders, or depression. To address this limitation, we propose a Responsible Reinforcement Learning (RRL) framework that integrates emotional and contextual understanding with ethical considerations into the sequential decision-making process. RRL formulates personalization as a Constrained Markov Decision Process (CMDP), where the agent optimizes engagement and adherence while ensuring emotional alignment and ethical safety. We introduce a multi-objective reward function that explicitly balances short-term behavioral engagement with long-term user well-being, and define an emotion-informed state representation that captures fluctuations in emotional readiness, affect, and risk. The proposed architecture can be instantiated with any RL algorithm (e.g., DQN, PPO) augmented with safety constraints or Lagrangian regularization. Conceptually, this framework operationalizes empathy and responsibility within machine learning policy optimization, bridging safe RL, affective computing and responsible AI. We discuss the implications of this approach for human-centric domains such as behavioral health, education, and digital therapeutics, and outline simulation-based validation paths for future empirical work. This paper aims to initiate a methodological conversation about ethically aligned reinforcement learning for emotionally aware and trustworthy personalization systems.
๐
2025-11-13
๐ AAAI
๐ฅ Mohammadsina Almasi, Hadis Anahideh
Machine Learning
ๆฅ็ๆ่ฆ
Equitably allocating limited resources in high-stakes domains-such as education, employment, and healthcare-requires balancing short-term utility with long-term impact, while accounting for delayed outcomes, hidden heterogeneity, and ethical constraints. However, most learning-based allocation frameworks either assume immediate feedback or ignore the complex interplay between individual characteristics and intervention dynamics. We propose a novel bi-level contextual bandit framework for individualized resource allocation under delayed feedback, designed to operate in real-world settings with dynamic populations, capacity constraints, and time-sensitive impact. At the meta level, the model optimizes subgroup-level budget allocations to satisfy fairness and operational constraints. At the base level, it identifies the most responsive individuals within each group using a neural network trained on observational data, while respecting cooldown windows and delayed treatment effects modeled via resource-specific delay kernels. By explicitly modeling temporal dynamics and feedback delays, the algorithm continually refines its policy as new data arrive, enabling more responsive and adaptive decision-making. We validate our approach on two real-world datasets from education and workforce development, showing that it achieves higher cumulative outcomes, better adapts to delay structures, and ensures equitable distribution across subgroups. Our results highlight the potential of delay-aware, data-driven decision-making systems to improve institutional policy and social welfare.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Reginald Zhiyan Chen, Heng-Sheng Chang, Prashant G. Mehta
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Hidden Markov Models (HMMs) are fundamental for modeling sequential data, yet learning their parameters from observations remains challenging. Classical methods like the Baum-Welch (EM) algorithm are computationally intensive and prone to local optima, while modern spectral algorithms offer provable guarantees but may produce probability outputs outside valid ranges. This work introduces Belief Net, a novel framework that learns HMM parameters through gradient-based optimization by formulating the HMM's forward filter as a structured neural network. Unlike black-box Transformer models, Belief Net's learnable weights are explicitly the logits of the initial distribution, transition matrix, and emission matrix, ensuring full interpretability. The model processes observation sequences using a decoder-only architecture and is trained end-to-end with standard autoregressive next-observation prediction loss. On synthetic HMM data, Belief Net achieves superior convergence speed compared to Baum-Welch, successfully recovering parameters in both undercomplete and overcomplete settings where spectral methods fail. Comparisons with Transformer-based models are also presented on real-world language data.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Rishi Singhal, Jung-Eun Kim
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Emmanuel Asiedu Brempong, Mohammed Alewi Hassen, MohamedElfatih MohamedKhair, Vusumuzi Dube, Santiago Hincapie Potes et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate precipitation estimation is critical for hydrological applications, especially in the Global South where ground-based observation networks are sparse and forecasting skill is limited. Existing satellite-based precipitation products often rely on the longwave infrared channel alone or are calibrated with data that can introduce significant errors, particularly at sub-daily timescales. This study introduces Oya, a novel real-time precipitation retrieval algorithm utilizing the full spectrum of visible and infrared (VIS-IR) observations from geostationary (GEO) satellites. Oya employs a two-stage deep learning approach, combining two U-Net models: one for precipitation detection and another for quantitative precipitation estimation (QPE), to address the inherent data imbalance between rain and no-rain events. The models are trained using high-resolution GPM Combined Radar-Radiometer Algorithm (CORRA) v07 data as ground truth and pre-trained on IMERG-Final retrievals to enhance robustness and mitigate overfitting due to the limited temporal sampling of CORRA. By leveraging multiple GEO satellites, Oya achieves quasi-global coverage and demonstrates superior performance compared to existing competitive regional and global precipitation baselines, offering a promising pathway to improved precipitation monitoring and forecasting.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Benjamin Yu, Vincenzo Lordi, Daniel Schwalbe-Koda
Machine Learning
ๆฅ็ๆ่ฆ
Machine learning interatomic potentials (MLIPs) balance high accuracy and lower costs compared to density functional theory calculations, but their performance often depends on the size and diversity of training datasets. Large datasets improve model accuracy and generalization but are computationally expensive to produce and train on, while smaller datasets risk discarding rare but important atomic environments and compromising MLIP accuracy/reliability. Here, we develop an information-theoretical framework to quantify the efficiency of dataset compression methods and propose an algorithm that maximizes this efficiency. By framing atomistic dataset compression as an instance of the minimum set cover (MSC) problem over atom-centered environments, our method identifies the smallest subset of structures that contains as much information as possible from the original dataset while pruning redundant information. The approach is extensively demonstrated on the GAP-20 and TM23 datasets, and validated on 64 varied datasets from the ColabFit repository. Across all cases, MSC consistently retains outliers, preserves dataset diversity, and reproduces the long-tail distributions of forces even at high compression rates, outperforming other subsampling methods. Furthermore, MLIPs trained on MSC-compressed datasets exhibit reduced error for out-of-distribution data even in low-data regimes. We explain these results using an outlier analysis and show that such quantitative conclusions could not be achieved with conventional dimensionality reduction methods. The algorithm is implemented in the open-source QUESTS package and can be used for several tasks in atomistic modeling, from data subsampling, outlier detection, and training improved MLIPs at a lower cost.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kajiฤ, Amal Rannen-Triki et al.
Computer Vision
ๆฅ็ๆ่ฆ
Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests.
Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.
๐
2025-11-13
๐ ArXiv Preprint (cs.SI)
๐ฅ Stefano Maria Iacus, Giuseppe Porro
Natural Language Processing
ๆฅ็ๆ่ฆ
Using 2.6 billion geolocated social-media posts (2014-2022) and a fine-tuned generative language model, we construct county-level indicators of life satisfaction and happiness for the United States. We document an apparent rural-urban paradox: rural counties express higher life satisfaction while urban counties exhibit greater happiness. We reconcile this by treating the two as distinct layers of subjective well-being, evaluative vs. hedonic, showing that each maps differently onto place, politics, and time. Republican-leaning areas appear more satisfied in evaluative terms, but partisan gaps in happiness largely flatten outside major metros, indicating context-dependent political effects. Temporal shocks dominate the hedonic layer: happiness falls sharply during 2020-2022, whereas life satisfaction moves more modestly. These patterns are robust across logistic and OLS specifications and align with well-being theory. Interpreted as associations for the population of social-media posts, the results show that large-scale, language-based indicators can resolve conflicting findings about the rural-urban divide by distinguishing the type of well-being expressed, offering a transparent, reproducible complement to traditional surveys.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Deniz Yilmaz, Liangyu Wu, Julia Gonski
Machine Learning
ๆฅ็ๆ่ฆ
Drift chambers have long been central to collider tracking, but future machines like a Higgs factory motivate higher granularity and cluster counting for particle ID, posing new data processing challenges. Machine learning (ML) at the "edge", or in cell-level readout, can dramatically reduce the off-detector data rate for high-granularity drift chambers by performing cluster counting at-source. We present machine learning algorithms for cluster counting in real-time readout of future drift chambers. These algorithms outperform traditional derivative-based techniques based on achievable pion-kaon separation. When synthesized to FPGA resources, they can achieve latencies consistent with real-time operation in a future Higgs factory scenario, thus advancing both R&D for future collider detectors as well as hardware-based ML for edge applications in high energy physics.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Daryl Noupa Yongueng, Hamidou Tembine
Natural Language Processing
ๆฅ็ๆ่ฆ
Normalization is a key point in transformer training . In Dynamic Tanh (DyT), the author demonstrated that Tanh can be used as an alternative layer normalization (LN) and confirmed the effectiveness of the idea. But Tanh itself faces orthogonality, linearity and distortion problems. Due to that, his proposition cannot be reliable. So we propose a Holonorm (hn) which has residual connections and nonlinearity. Holonorm is suitable for replacing Tanh in the context of normalization. Although the HoloNorm expression could be similar to the softsign function in dimension one, softsign is a componentwise function which is not good for tensors and vectors of great dimension. Holonorm preserves the orthogonality, the direction, the invertibility of the signal. Holonorm is also a suitable metric, maps all vectors into the open unit ball. This prevents exploding activations and improves stability in deep Transformer models. In this work, we have meticulously examined the normalization in transformers and say that Holonorm, a generalized form of softsign function suited as a normalization function first.Second, defined between 0 and 1 hn serves as a percentage, and $1 - \text{Holonorm}$ is its complement, making it better understandable in evaluating a model.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stanislav Selitskiy
Machine Learning
ๆฅ็ๆ่ฆ
We propose loss function week enforcement of the velocity relations between time-series points in the Kinematic-Informed artificial Neural Networks (KINN) for long-term stock prediction. Problems of the series volatility, Out-of-Distribution (OOD) test data, and outliers in training data are addressed by (Artificial Neural Networks) ANN's learning not only future points prediction but also by learning velocity relations between the points, such a way as avoiding unrealistic spurious predictions. The presented loss function penalizes not only errors between predictions and supervised label data, but also errors between the next point prediction and the previous point plus velocity prediction. The loss function is tested on the multiple popular and exotic AR ANN architectures, and around fifteen years of Dow Jones function demonstrated statistically meaningful improvement across the normalization-sensitive activation functions prone to spurious behaviour in the OOD data conditions. Results show that such architecture addresses the issue of the normalization in the auto-regressive models that break the data topology by weakly enforcing the data neighbourhood proximity (relation) preservation during the ANN transformation.
๐
2025-11-13
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yunkai Zhang, Qiang Zhang, Feng, Lin, Ruizhong Qiu et al.
ๆฅ็ๆ่ฆ
Optimizing recommender systems for objectives beyond accuracy, such as diversity, novelty, and personalization, is crucial for long-term user satisfaction. To this end, industrial practitioners have accumulated vast amounts of structured domain knowledge, which we term human priors (e.g., item taxonomies, temporal patterns). This knowledge is typically applied through post-hoc adjustments during ranking or post-ranking. However, this approach remains decoupled from the core model learning, which is particularly undesirable as the industry shifts to end-to-end generative recommendation foundation models. On the other hand, many methods targeting these beyond-accuracy objectives often require architecture-specific modifications and discard these valuable human priors by learning user intent in a fully unsupervised manner.
Instead of discarding the human priors accumulated over years of practice, we introduce a backbone-agnostic framework that seamlessly integrates these human priors directly into the end-to-end training of generative recommenders. With lightweight, prior-conditioned adapter heads inspired by efficient LLM decoding strategies, our approach guides the model to disentangle user intent along human-understandable axes (e.g., interaction types, long- vs. short-term interests). We also introduce a hierarchical composition strategy for modeling complex interactions across different prior types. Extensive experiments on three large-scale datasets demonstrate that our method significantly enhances both accuracy and beyond-accuracy objectives. We also show that human priors allow the backbone model to more effectively leverage longer context lengths and larger model sizes.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Ruxi Deng, Wenxuan Bao, Tianxin Wei, Jingrui He
ๆฅ็ๆ่ฆ
Pretrained VLMs exhibit strong zero-shot classification capabilities, but their predictions degrade significantly under common image corruptions. To improve robustness, many test-time adaptation (TTA) methods adopt positive data augmentation (PDA), which generates multiple views of each test sample to reduce prediction variance. However, these methods suffer from two key limitations. First, it introduces considerable computational overhead due to the large number of augmentations required per image. Second, it fails to mitigate prediction bias, where the model tends to predict certain classes disproportionately under corruption, as PDA operates on corrupted inputs and typically does not remove the corruption itself. To address these challenges, we propose Panda, a novel TTA method based on negative data augmentation (NDA). Unlike positive augmentations that preserve object semantics, Panda generates negative augmentations by disrupting semantic content. It divides images into patches and randomly assembles them from a shared patch pool. These negatively augmented images retain corruption-specific features while discarding object-relevant signals. We then subtract the mean feature of these negative samples from the original image feature, effectively suppressing corruption-related components while preserving class-relevant information. This mitigates prediction bias under distribution shifts. Panda allows augmentation to be shared across samples within a batch, resulting in minimal computational overhead. Panda can be seamlessly integrated into existing test-time adaptation frameworks and substantially improve their robustness. Our experiments indicate that Panda delivers superior performance compared to PDA methods, and a wide range of TTA methods exhibit significantly enhanced performance when integrated with Panda. Our code is available at https://github.com/ruxideng/Panda .
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ รaฤrฤฑ Eser, Zeynep Sonat Baltacฤฑ, Emre Akbaล, Sinan Kalkan
Machine Learning
ๆฅ็ๆ่ฆ
Imbalance in classification tasks is commonly quantified by the cardinalities of examples across classes. This, however, disregards the presence of redundant examples and inherent differences in the learning difficulties of classes. Alternatively, one can use complex measures such as training loss and uncertainty, which, however, depend on training a machine learning model. Our paper proposes using data Intrinsic Dimensionality (ID) as an easy-to-compute, model-free measure of imbalance that can be seamlessly incorporated into various imbalance mitigation methods. Our results across five different datasets with a diverse range of imbalance ratios show that ID consistently outperforms cardinality-based re-weighting and re-sampling techniques used in the literature. Moreover, we show that combining ID with cardinality can further improve performance. Code: https://github.com/cagries/IDIM.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Simon Donike, Cesar Aybar, Julio Contreras, Luis Gรณmez-Chova
ๆฅ็ๆ่ฆ
We present OpenSR-SRGAN, an open and modular framework for single-image super-resolution in Earth Observation. The software provides a unified implementation of SRGAN-style models that is easy to configure, extend, and apply to multispectral satellite data such as Sentinel-2. Instead of requiring users to modify model code, OpenSR-SRGAN exposes generators, discriminators, loss functions, and training schedules through concise configuration files, making it straightforward to switch between architectures, scale factors, and band setups. The framework is designed as a practical tool and benchmark implementation rather than a state-of-the-art model. It ships with ready-to-use configurations for common remote sensing scenarios, sensible default settings for adversarial training, and built-in hooks for logging, validation, and large-scene inference. By turning GAN-based super-resolution into a configuration-driven workflow, OpenSR-SRGAN lowers the entry barrier for researchers and practitioners who wish to experiment with SRGANs, compare models in a reproducible way, and deploy super-resolution pipelines across diverse Earth-observation datasets.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Jonghun Lee, YongKyung Oh, Sungil Kim, Dong-Young Lim
Machine Learning
ๆฅ็ๆ่ฆ
Neural Differential Equations (NDEs) excel at modeling continuous-time dynamics, effectively handling challenges such as irregular observations, missing values, and noise. Despite their advantages, NDEs face a fundamental challenge in adopting dropout, a cornerstone of deep learning regularization, making them susceptible to overfitting. To address this research gap, we introduce Continuum Dropout, a universally applicable regularization technique for NDEs built upon the theory of alternating renewal processes. Continuum Dropout formulates the on-off mechanism of dropout as a stochastic process that alternates between active (evolution) and inactive (paused) states in continuous time. This provides a principled approach to prevent overfitting and enhance the generalization capabilities of NDEs. Moreover, Continuum Dropout offers a structured framework to quantify predictive uncertainty via Monte Carlo sampling at test time. Through extensive experiments, we demonstrate that Continuum Dropout outperforms existing regularization methods for NDEs, achieving superior performance on various time series and image classification tasks. It also yields better-calibrated and more trustworthy probability estimates, highlighting its effectiveness for uncertainty-aware modeling.
๐
2025-11-13
๐ ArXiv Preprint (physics.bio-ph)
๐ฅ Tom Pan, Evan Dramko, Mitchell D. Miller, Anastasios Kyrillidis, George N. Phillips
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Protein structure determination has long been one of the primary challenges of structural biology, to which deep machine learning (ML)-based approaches have increasingly been applied. However, these ML models generally do not incorporate the experimental measurements directly, such as X-ray crystallographic diffraction data. To this end, we explore an approach that more tightly couples these traditional crystallographic and recent ML-based methods, by training a hybrid 3-d vision transformer and convolutional network on inputs from both domains. We make use of two distinct input constructs / Patterson maps, which are directly obtainable from crystallographic data, and ``partial structure'' template maps derived from predicted structures deposited in the AlphaFold Protein Structure Database with subsequently omitted residues. With these, we predict electron density maps that are then post-processed into atomic models through standard crystallographic refinement processes. Introducing an initial dataset of small protein fragments taken from Protein Data Bank entries and placing them in hypothetical crystal settings, we demonstrate that our method is effective at both improving the phases of the crystallographic structure factors and completing the regions missing from partial structure templates, as well as improving the agreement of the electron density maps with the ground truth atomic structures.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Thomas Decker, Volker Tresp, Florian Buettner
ๆฅ็ๆ่ฆ
Perturbation-based explanations are widely utilized to enhance the transparency of machine-learning models in practice. However, their reliability is often compromised by the unknown model behavior under the specific perturbations used. This paper investigates the relationship between uncertainty calibration - the alignment of model confidence with actual accuracy - and perturbation-based explanations. We show that models systematically produce unreliable probability estimates when subjected to explainability-specific perturbations and theoretically prove that this directly undermines global and local explanation quality. To address this, we introduce ReCalX, a novel approach to recalibrate models for improved explanations while preserving their original predictions. Empirical evaluations across diverse models and datasets demonstrate that ReCalX consistently reduces perturbation-specific miscalibration most effectively while enhancing explanation robustness and the identification of globally important input features.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Darsh Pareek, Umesh Kumar, Ruthu Rao, Ravi Janjam
Machine Learning
ๆฅ็ๆ่ฆ
Deep Neural Networks (DNNs) rely on inherent fluctuations in their internal parameters (weights and biases) to effectively navigate the complex optimization landscape and achieve robust performance. While these fluctuations are recognized as crucial for escaping local minima and improving generalization, their precise relationship with fundamental hyperparameters remains underexplored. A significant knowledge gap exists concerning how the learning rate, a critical parameter governing the training process, directly influences the dynamics of these neural fluctuations. This study systematically investigates the impact of varying learning rates on the magnitude and character of weight and bias fluctuations within a neural network. We trained a model using distinct learning rates and analyzed the corresponding parameter fluctuations in conjunction with the network's final accuracy. Our findings aim to establish a clear link between the learning rate's value, the resulting fluctuation patterns, and overall model performance. By doing so, we provide deeper insights into the optimization process, shedding light on how the learning rate mediates the crucial exploration-exploitation trade-off during training. This work contributes to a more nuanced understanding of hyperparameter tuning and the underlying mechanics of deep learning.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Feng Wang, Tianxiang Chen, Shuyue Wei, Qian Chu, Yi Zhang et al.
ๆฅ็ๆ่ฆ
Spatio-temporal graphs are powerful tools for modeling complex dependencies in traffic time series. However, the distributed nature of real-world traffic data across multiple stakeholders poses significant challenges in modeling and reconstructing inter-client spatial dependencies while adhering to data locality constraints. Existing methods primarily address static dependencies, overlooking their dynamic nature and resulting in suboptimal performance. In response, we propose Federated Spatio-Temporal Graph with Dynamic Inter-Client Dependencies (FedSTGD), a framework designed to model and reconstruct dynamic inter-client spatial dependencies in federated learning. FedSTGD incorporates a federated nonlinear computation decomposition module to approximate complex graph operations. This is complemented by a graph node embedding augmentation module, which alleviates performance degradation arising from the decomposition. These modules are coordinated through a client-server collective learning protocol, which decomposes dynamic inter-client spatial dependency learning tasks into lightweight, parallelizable subtasks. Extensive experiments on four real-world datasets demonstrate that FedSTGD achieves superior performance over state-of-the-art baselines in terms of RMSE, MAE, and MAPE, approaching that of centralized baselines. Ablation studies confirm the contribution of each module in addressing dynamic inter-client spatial dependencies, while sensitivity analysis highlights the robustness of FedSTGD to variations in hyperparameters.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Francisco Cunha, Yves Lepage, Zied Bouraoui, Miguel Couceiro
ๆฅ็ๆ่ฆ
Analogical reasoning is a powerful inductive mechanism, widely used in human cognition and increasingly applied in artificial intelligence. Formal frameworks for analogical inference have been developed for Boolean domains, where inference is provably sound for affine functions and approximately correct for functions close to affine. These results have informed the design of analogy-based classifiers. However, they do not extend to regression tasks or continuous domains. In this paper, we revisit analogical inference from a foundational perspective. We first present a counterexample showing that existing generalization bounds fail even in the Boolean setting. We then introduce a unified framework for analogical reasoning in real-valued domains based on parameterized analogies defined via generalized means. This model subsumes both Boolean classification and regression, and supports analogical inference over continuous functions. We characterize the class of analogy-preserving functions in this setting and derive both worst-case and average-case error bounds under smoothness assumptions. Our results offer a general theory of analogical inference across discrete and continuous domains.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extensive random exploration. These limitations lead to prohibitively high data-construction costs, low exploration efficiency, and poor sample utilization. To address these challenges, we present AgentEvolver, a self-evolving agent system that leverages the semantic understanding and reasoning capabilities of LLMs to drive autonomous agent learning. AgentEvolver introduces three synergistic mechanisms: (i) self-questioning, which enables curiosity-driven task generation in novel environments, reducing dependence on handcrafted datasets; (ii) self-navigating, which improves exploration efficiency through experience reuse and hybrid policy guidance; and (iii) self-attributing, which enhances sample efficiency by assigning differentiated rewards to trajectory states and actions based on their contribution. By integrating these mechanisms into a unified framework, AgentEvolver enables scalable, cost-effective, and continual improvement of agent capabilities. Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yixi Chen, Weixuan Liang, Tianrui Liu, Jun-Jie Huang, Ao Li et al.
ๆฅ็ๆ่ฆ
Kernel power $k$-means (KPKM) leverages a family of means to mitigate local minima issues in kernel $k$-means. However, KPKM faces two key limitations: (1) the computational burden of the full kernel matrix restricts its use on extensive data, and (2) the lack of authentic centroid-sample assignment learning reduces its noise robustness. To overcome these challenges, we propose RFF-KPKM, introducing the first approximation theory for applying random Fourier features (RFF) to KPKM. RFF-KPKM employs RFF to generate efficient, low-dimensional feature maps, bypassing the need for the whole kernel matrix. Crucially, we are the first to establish strong theoretical guarantees for this combination: (1) an excess risk bound of $\mathcal{O}(\sqrt{k^3/n})$, (2) strong consistency with membership values, and (3) a $(1+\varepsilon)$ relative error bound achievable using the RFF of dimension $\mathrm{poly}(\varepsilon^{-1}\log k)$. Furthermore, to improve robustness and the ability to learn multiple kernels, we propose IP-RFF-MKPKM, an improved possibilistic RFF-based multiple kernel power $k$-means. IP-RFF-MKPKM ensures the scalability of MKPKM via RFF and refines cluster assignments by combining the merits of the possibilistic membership and fuzzy membership. Experiments on large-scale datasets demonstrate the superior efficiency and clustering accuracy of the proposed methods compared to the state-of-the-art alternatives.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Prince Mensah, Pelumi Victor Aderinto, Ibrahim Salihu Yusuf, Arnu Pretorius
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate retrieval of vegetation biophysical variables from satellite imagery is crucial for ecosystem monitoring and agricultural management. In this work, we propose a physics-informed Transformer-VAE architecture to invert the PROSAIL radiative transfer model for simultaneous estimation of key canopy parameters from Sentinel-2 data. Unlike previous hybrid approaches that require real satellite images for self-supevised training. Our model is trained exclusively on simulated data, yet achieves performance on par with state-of-the-art methods that utilize real imagery. The Transformer-VAE incorporates the PROSAIL model as a differentiable physical decoder, ensuring that inferred latent variables correspond to physically plausible leaf and canopy properties. We demonstrate retrieval of leaf area index (LAI) and canopy chlorophyll content (CCC) on real-world field datasets (FRM4Veg and BelSAR) with accuracy comparable to models trained with real Sentinel-2 data. Our method requires no in-situ labels or calibration on real images, offering a cost-effective and self-supervised solution for global vegetation monitoring. The proposed approach illustrates how integrating physical models with advanced deep networks can improve the inversion of RTMs, opening new prospects for large-scale, physically-constrained remote sensing of vegetation traits.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Nicolas Hoischen, Petar Bevanda, Max Beier, Stefan Sosnowski, Boris Houska et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Continuous-time stochastic processes underlie many natural and engineered systems. In healthcare, autonomous driving, and industrial control, direct interaction with the environment is often unsafe or impractical, motivating offline reinforcement learning from historical data. However, there is limited statistical understanding of the approximation errors inherent in learning policies from offline datasets. We address this by linking reinforcement learning to the Hamilton-Jacobi-Bellman equation and proposing an operator-theoretic algorithm based on a simple dynamic programming recursion. Specifically, we represent our world model in terms of the infinitesimal generator of controlled diffusion processes learned in a reproducing kernel Hilbert space. By integrating statistical learning methods and operator theory, we establish global convergence of the value function and derive finite-sample guarantees with bounds tied to system properties such as smoothness and stability. Our theoretical and numerical results indicate that operator-based approaches may hold promise in solving offline reinforcement learning using continuous-time optimal control.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kai-Hendrik Cohrs, Zuzanna Osika, Maria Gonzalez-Calabuig, Vishal Nedungadi, Ruben Cartuyvels et al.
ๆฅ็ๆ่ฆ
Geospatial foundation models for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty. Applied to burn scar segmentation, SHRUG-FM shows that OOD scores correlate with lower performance in specific environmental conditions, while uncertainty-based flags help discard many poorly performing predictions. Linking these flags to land cover attributes from HydroATLAS shows that failures are not random but concentrated in certain geographies, such as low-elevation zones and large river areas, likely due to underrepresentation in pretraining data. SHRUG-FM provides a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, helping bridge the gap between benchmark performance and real-world reliability.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Arnab Bhattacharyya, Davin Choo, Philips George John, Themis Gouleakis
ๆฅ็ๆ่ฆ
Given i.i.d.~samples from an unknown distribution $P$, the goal of distribution learning is to recover the parameters of a distribution that is close to $P$. When $P$ belongs to the class of product distributions on the Boolean hypercube $\{0,1\}^d$, it is known that $ฮฉ(d/\varepsilon^2)$ samples are necessary to learn $P$ within total variation (TV) distance $\varepsilon$. We revisit this problem when the learner is also given as advice the parameters of a product distribution $Q$. We show that there is an efficient algorithm to learn $P$ within TV distance $\varepsilon$ that has sample complexity $\tilde{O}(d^{1-ฮท}/\varepsilon^2)$, if $\|\mathbf{p} - \mathbf{q}\|_1 < \varepsilon d^{0.5 - ฮฉ(ฮท)}$. Here, $\mathbf{p}$ and $\mathbf{q}$ are the mean vectors of $P$ and $Q$ respectively, and no bound on $\|\mathbf{p} - \mathbf{q}\|_1$ is known to the algorithm a priori.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Joel Wendin, Claudio Altafini
Machine Learning
ๆฅ็ๆ่ฆ
The paper surveys recent progresses in understanding the dynamics and loss landscape of the gradient flow equations associated to deep linear neural networks, i.e., the gradient descent training dynamics (in the limit when the step size goes to 0) of deep neural networks missing the activation functions and subject to quadratic loss functions. When formulated in terms of the adjacency matrix of the neural network, as we do in the paper, these gradient flow equations form a class of converging matrix ODEs which is nilpotent, polynomial, isospectral, and with conservation laws. The loss landscape is described in detail. It is characterized by infinitely many global minima and saddle points, both strict and nonstrict, but lacks local minima and maxima. The loss function itself is a positive semidefinite Lyapunov function for the gradient flow, and its level sets are unbounded invariant sets of critical points, with critical values that correspond to the amount of singular values of the input-output data learnt by the gradient along a certain trajectory. The adjacency matrix representation we use in the paper allows to highlight the existence of a quotient space structure in which each critical value of the loss function is represented only once, while all other critical points with the same critical value belong to the fiber associated to the quotient space. It also allows to easily determine stable and unstable submanifolds at the saddle points, even when the Hessian fails to obtain them.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zicheng Hu, Yuchen Wang, Cheng Chen
ๆฅ็ๆ่ฆ
Decentralized cooperative multi-agent multi-armed bandits (DeCMA2B) considers how multiple agents collaborate in a decentralized multi-armed bandit setting. Though this problem has been extensively studied in previous work, most existing methods remain susceptible to various adversarial attacks. In this paper, we first study DeCMA2B with adversarial corruption, where an adversary can corrupt reward observations of all agents with a limited corruption budget. We propose a robust algorithm, called DeMABAR, which ensures that each agent's individual regret suffers only an additive term proportional to the corruption budget. Then we consider a more realistic scenario where the adversary can only attack a small number of agents. Our theoretical analysis shows that the DeMABAR algorithm can also almost completely eliminate the influence of adversarial attacks and is inherently robust in the Byzantine setting, where an unknown fraction of the agents can be Byzantine, i.e., may arbitrarily select arms and communicate wrong information. We also conduct numerical experiments to illustrate the robustness and effectiveness of the proposed method.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qingao Yi, Jiaang Duan, Hanwen Hu, Qin Hua, Haiyan Zhao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Training large language models (LLMs) poses significant challenges regarding computational resources and memory capacity. Although distributed training techniques help mitigate these issues, they still suffer from considerable communication overhead. Existing approaches primarily rely on static gradient compression to enhance communication efficiency; however, these methods neglect the dynamic nature of evolving gradients during training, leading to performance degradation. Accelerating LLM training via compression without sacrificing performance remains a challenge. In this paper, we propose an entropy-driven dynamic gradient compression framework called EDGC. The core concept is to adjust the compression rate during LLM training based on the evolving trends of gradient entropy, taking into account both compression efficiency and error. EDGC consists of three key components.First, it employs a down-sampling method to efficiently estimate gradient entropy, reducing computation overhead. Second, it establishes a theoretical model linking compression rate with gradient entropy, enabling more informed compression decisions. Lastly, a window-based adjustment mechanism dynamically adapts the compression rate across pipeline stages, improving communication efficiency and maintaining model performance. We implemented EDGC on a 32-NVIDIA-V100 cluster and a 64-NVIDIA-H100 cluster to train GPT2-2.5B and GPT2-12.1B, respectively. The results show that EDGC significantly reduces communication latency and training time by up to 46.45% and 16.13% while preserving LLM accuracy.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fuyuan Cao, Jiaxuan Zhang, Xiaoli Li
Robotics
ๆฅ็ๆ่ฆ
Estimating Individual Treatment Effects (ITE) from observational data is challenging due to confounding bias. Most studies tackle this bias by balancing distributions globally, but ignore individual heterogeneity and fail to capture the local structure that represents the natural clustering among individuals, which ultimately compromises ITE estimation. While instance-level alignment methods consider heterogeneity, they similarly overlook the local structure information. To address these issues, we propose an end-to-end Multi-\textbf{P}rototype alignment method for \textbf{ITE} estimation (\textbf{PITE}). PITE effectively captures local structure within groups and enforces cross-group alignment, thereby achieving robust ITE estimation. Specifically, we first define prototypes as cluster centroids based on similar individuals under the same treatment. To identify local similarity and the distribution consistency, we perform instance-to-prototype matching to assign individuals to the nearest prototype within groups, and design a multi-prototype alignment strategy to encourage the matched prototypes to be close across treatment arms in the latent space. PITE not only reduces distribution shift through fine-grained, prototype-level alignment, but also preserves the local structures of treated and control groups, which provides meaningful constraints for ITE estimation. Extensive evaluations on benchmark datasets demonstrate that PITE outperforms 13 state-of-the-art methods, achieving more accurate and robust ITE estimation.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Patrick Feifel, Benedikt Franke, Frank Bonarens, Frank Kรถster, Arne Raulf et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Reliable pedestrian detection represents a crucial step towards automated driving systems. However, the current performance benchmarks exhibit weaknesses. The currently applied metrics for various subsets of a validation dataset prohibit a realistic performance evaluation of a DNN for pedestrian detection. As image segmentation supplies fine-grained information about a street scene, it can serve as a starting point to automatically distinguish between different types of errors during the evaluation of a pedestrian detector. In this work, eight different error categories for pedestrian detection are proposed and new metrics are proposed for performance comparison along these error categories. We use the new metrics to compare various backbones for a simplified version of the APD, and show a more fine-grained and robust way to compare models with each other especially in terms of safety-critical performance. We achieve SOTA on CityPersons-reasonable (without extra training data) by using a rather simple architecture.
๐
2025-11-13
๐ ArXiv Preprint (eess.SY)
๐ฅ Florian Ebmeier, Nicole Ludwig, Jannik Thuemmel, Georg Martius, Volker H. Franz
Machine Learning
ๆฅ็ๆ่ฆ
Solar thermal systems (STS) present a promising avenue for low-carbon heat generation, with a well-running system providing heat at minimal cost and carbon emissions. However, STS can exhibit faults due to improper installation, maintenance, or operation, often resulting in a substantial reduction in efficiency or even damage to the system. As monitoring at the individual level is economically prohibitive for small-scale systems, automated monitoring and fault detection should be used to address such issues. Recent advances in data-driven anomaly detection, particularly in time series analysis, offer a cost-effective solution by leveraging existing sensors to identify abnormal system states.
Here, we propose a probabilistic reconstruction-based framework for anomaly detection. We evaluate our method on the publicly available PaSTS dataset of operational domestic STS, which features real-world complexities and diverse fault types. Our experiments show that reconstruction-based methods can detect faults in domestic STS both qualitatively and quantitatively, while generalizing to previously unseen systems. We also demonstrate that our model outperforms both simple and more complex deep learning baselines. Additionally, we show that heteroscedastic uncertainty estimation is essential to fault detection performance. Finally, we discuss the engineering overhead required to unlock these improvements and make a case for simple deep learning models.
๐
2025-11-13
๐ ArXiv Preprint (cs.IT)
๐ฅ Aswin Arun, Christo Kurisummoottil Thomas, Rimalpudi Sarvendranath, Walid Saad
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Despite the advantages of multi-agent reinforcement learning (MARL) for wireless use case such as medium access control (MAC), their real-world deployment in Internet of Things (IoT) is hindered by their sample inefficiency. To alleviate this challenge, one can leverage model-based reinforcement learning (MBRL) solutions, however, conventional MBRL approaches rely on black-box models that are not interpretable and cannot reason. In contrast, in this paper, a novel causal model-based MARL framework is developed by leveraging tools from causal learn- ing. In particular, the proposed model can explicitly represent causal dependencies between network variables using structural causal models (SCMs) and attention-based inference networks. Interpretable causal models are then developed to capture how MAC control messages influence observations, how transmission actions determine outcomes, and how channel observations affect rewards. Data augmentation techniques are then used to generate synthetic rollouts using the learned causal model for policy optimization via proximal policy optimization (PPO). Analytical results demonstrate exponential sample complexity gains of causal MBRL over black-box approaches. Extensive simulations demonstrate that, on average, the proposed approach can reduce environment interactions by 58%, and yield faster convergence compared to model-free baselines. The proposed approach inherently is also shown to provide interpretable scheduling decisions via attention-based causal attribution, revealing which network conditions drive the policy. The resulting combination of sample efficiency and interpretability establishes causal MBRL as a practical approach for resource-constrained wireless systems.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuping Yan, Yuhan Xie, Yuanshuai Li, Yingchao Yu, Lingjuan Lyu et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Adrien Lafage, Olivier Laurent, Firas Gabetni, Gianni Franchi
Natural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Deep Neural Networks (DNNs) have demonstrated remarkable performance across various domains, including computer vision and natural language processing. However, they often struggle to accurately quantify the uncertainty of their predictions, limiting their broader adoption in critical real-world applications. Uncertainty Quantification (UQ) for Deep Learning seeks to address this challenge by providing methods to improve the reliability of uncertainty estimates. Although numerous techniques have been proposed, a unified tool offering a seamless workflow to evaluate and integrate these methods remains lacking. To bridge this gap, we introduce Torch-Uncertainty, a PyTorch and Lightning-based framework designed to streamline DNN training and evaluation with UQ techniques and metrics. In this paper, we outline the foundational principles of our library and present comprehensive experimental results that benchmark a diverse set of UQ methods across classification, segmentation, and regression tasks. Our library is available at https://github.com/ENSTA-U2IS-AI/Torch-Uncertainty
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qian Jin, Yumeng Liu, Yuqi Jiang, Qi Sun, Cheng Zhuo
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reliable, generalizable data foundations are critical for enabling large-scale models in computational lithography. However, essential tasks-mask generation, rule violation detection, and layout optimization-are often handled in isolation, hindered by scarce datasets and limited modeling approaches. To address these challenges, we introduce Unitho, a unified multi-task large vision model built upon the Transformer architecture. Trained on a large-scale industrial lithography simulation dataset with hundreds of thousands of cases, Unitho supports end-to-end mask generation, lithography simulation, and rule violation detection. By enabling agile and high-fidelity lithography simulation, Unitho further facilitates the construction of robust data foundations for intelligent EDA. Experimental results validate its effectiveness and generalizability, with performance substantially surpassing academic baselines.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Oliver Dippel, Alexei Lisitsa, Bei Peng
Natural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Transformers have demonstrated exceptional in-context learning (ICL) capabilities, enabling applications across natural language processing, computer vision, and sequential decision-making. In reinforcement learning, ICL reframes learning as a supervised problem, facilitating task adaptation without parameter updates. Building on prior work leveraging transformers for sequential decision-making, we propose Heuristic Transformer (HT), an in-context reinforcement learning (ICRL) approach that augments the in-context dataset with a belief distribution over rewards to achieve better decision-making. Using a variational auto-encoder (VAE), a low-dimensional stochastic variable is learned to represent the posterior distribution over rewards, which is incorporated alongside an in-context dataset and query states as prompt to the transformer policy. We assess the performance of HT across the Darkroom, Miniworld, and MuJoCo environments, showing that it consistently surpasses comparable baselines in terms of both effectiveness and generalization. Our method presents a promising direction to bridge the gap between belief-based augmentations and transformer-based decision-making.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Daniel Herbst, Lea Karbeska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.
๐
2025-11-13
๐ AAAI
๐ฅ Jianghan Zhu, Yaoxin Wu, Zhuoyi Lin, Zhengyuan Zhang, Haiyan Yin et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in Neural Combinatorial Optimization (NCO) methods have significantly improved the capability of neural solvers to handle synthetic routing instances. Nonetheless, existing neural solvers typically struggle to generalize effectively from synthetic, uniformly-distributed training data to real-world VRP scenarios, including widely recognized benchmark instances from TSPLib and CVRPLib. To bridge this generalization gap, we present Evolutionary Realistic Instance Synthesis (EvoReal), which leverages an evolutionary module guided by large language models (LLMs) to generate synthetic instances characterized by diverse and realistic structural patterns. Specifically, the evolutionary module produces synthetic instances whose structural attributes statistically mimics those observed in authentic real-world instances. Subsequently, pre-trained NCO models are progressively refined, firstly aligning them with these structurally enriched synthetic distributions and then further adapting them through direct fine-tuning on actual benchmark instances. Extensive experimental evaluations demonstrate that EvoReal markedly improves the generalization capabilities of state-of-the-art neural solvers, yielding a notable reduced performance gap compared to the optimal solutions on the TSPLib (1.05%) and CVRPLib (2.71%) benchmarks across a broad spectrum of problem scales.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yue Chen, Jianfeng Lu, Shuqing Cao, Wei Wang, Gang Li et al.
ๆฅ็ๆ่ฆ
While semi-asynchronous federated learning (SAFL) combines the efficiency of synchronous training with the flexibility of asynchronous updates, it inherently suffers from participation bias, which is further exacerbated by non-IID data distributions. More importantly, hierarchical architecture shifts participation from individual clients to client groups, thereby further intensifying this issue. Despite notable advancements in SAFL research, most existing works still focus on conventional cloud-end architectures while largely overlooking the critical impact of non-IID data on scheduling across the cloud-edge-client hierarchy. To tackle these challenges, we propose FedCure, an innovative semi-asynchronous Federated learning framework that leverages coalition construction and participation-aware scheduling to mitigate participation bias with non-IID data. Specifically, FedCure operates through three key rules: (1) a preference rule that optimizes coalition formation by maximizing collective benefits and establishing theoretically stable partitions to reduce non-IID-induced performance degradation; (2) a scheduling rule that integrates the virtual queue technique with Bayesian-estimated coalition dynamics, mitigating efficiency loss while ensuring mean rate stability; and (3) a resource allocation rule that enhances computational efficiency by optimizing client CPU frequencies based on estimated coalition dynamics while satisfying delay requirements. Comprehensive experiments on four real-world datasets demonstrate that FedCure improves accuracy by up to 5.1x compared with four state-of-the-art baselines, while significantly enhancing efficiency with the lowest coefficient of variation 0.0223 for per-round latency and maintaining long-term balance across diverse scenarios.
๐
2025-11-13
๐ AAAI
๐ฅ Xi Yang, Han Zhang, Zhijian Lin, Yibiao Hu, Hong Han
Multimodal
ๆฅ็ๆ่ฆ
Out-of-context misinformation (OOC) is a low-cost form of misinformation in news reports, which refers to place authentic images into out-of-context or fabricated image-text pairings. This problem has attracted significant attention from researchers in recent years. Current methods focus on assessing image-text consistency or generating explanations. However, these approaches assume that the training and test data are drawn from the same distribution. When encountering novel news domains, models tend to perform poorly due to the lack of prior knowledge. To address this challenge, we propose \textbf{VDT} to enhance the domain adaptation capability for OOC misinformation detection by learning domain-invariant features and test-time training mechanisms. Domain-Invariant Variational Align module is employed to jointly encodes source and target domain data to learn a separable distributional space domain-invariant features. For preserving semantic integrity, we utilize domain consistency constraint module to reconstruct the source and target domain latent distribution. During testing phase, we adopt the test-time training strategy and confidence-variance filtering module to dynamically updating the VAE encoder and classifier, facilitating the model's adaptation to the target domain distribution. Extensive experiments conducted on the benchmark dataset NewsCLIPpings demonstrate that our method outperforms state-of-the-art baselines under most domain adaptation settings.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Cheng Kevin Qu, Andrew Ly, Pulin Gong
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Attention mechanisms underpin the computational power of Transformer models, which have achieved remarkable success across diverse domains. Yet understanding and extending the principles underlying self-attention remains a key challenge for advancing artificial intelligence. Drawing inspiration from the multiscale dynamics of biological attention and from dynamical systems theory, we introduce Fractional Neural Attention (FNA), a principled, neuroscience-inspired framework for multiscale information processing. FNA models token interactions through Lรฉvy diffusion governed by the fractional Laplacian, intrinsically realizing simultaneous short- and long-range dependencies across multiple scales. This mechanism yields greater expressivity and faster information mixing, advancing the foundational capacity of Transformers. Theoretically, we show that FNA's dynamics are governed by the fractional diffusion equation, and that the resulting attention networks exhibit larger spectral gaps and shorter path lengths -- mechanistic signatures of enhanced computational efficiency. Empirically, FNA achieves competitive text-classification performance even with a single layer and a single head; it also improves performance in image processing and neural machine translation. Finally, the diffusion map algorithm from geometric harmonics enables dimensionality reduction of FNA weights while preserving the intrinsic structure of embeddings and hidden states. Together, these results establish FNA as a principled mechanism connecting self-attention, stochastic dynamics, and geometry, providing an interpretable, biologically grounded foundation for powerful, neuroscience-inspired AI.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jieting Wang, Huimei Shi, Feijiang Li, Xiaolei Shang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Time series forecasting is an important task that involves analyzing temporal dependencies and underlying patterns (such as trends, cyclicality, and seasonality) in historical data to predict future values or trends. Current deep learning-based forecasting models primarily employ Mean Squared Error (MSE) loss functions for regression modeling. Despite enabling direct value prediction, this method offers no uncertainty estimation and exhibits poor outlier robustness. To address these limitations, we propose OCE-TS, a novel ordinal classification approach for time series forecasting that replaces MSE with Ordinal Cross-Entropy (OCE) loss, preserving prediction order while quantifying uncertainty through probability output. Specifically, OCE-TS begins by discretizing observed values into ordered intervals and deriving their probabilities via a parametric distribution as supervision signals. Using a simple linear model, we then predict probability distributions for each timestep. The OCE loss is computed between the cumulative distributions of predicted and ground-truth probabilities, explicitly preserving ordinal relationships among forecasted values. Through theoretical analysis using influence functions, we establish that cross-entropy (CE) loss exhibits superior stability and outlier robustness compared to MSE loss. Empirically, we compared OCE-TS with five baseline models-Autoformer, DLinear, iTransformer, TimeXer, and TimeBridge-on seven public time series datasets. Using MSE and Mean Absolute Error (MAE) as evaluation metrics, the results demonstrate that OCE-TS consistently outperforms benchmark models. The code will be published.
๐
2025-11-13
๐ NeurIPS
๐ฅ Eklavya Sarkar, Mathew Magimai. -Doss
Natural Language Processing
ๆฅ็ๆ่ฆ
Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using $k$-Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller classification performances, and hold promise as alternative feature representations towards leveraging sequential information in animal vocalizations.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Outongyi Lv, Yewei Yuan, Nana Liu
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) with limited samples is common in real-world applications. However, offline RL performance under this constraint is often suboptimal. We consider an alternative approach to dealing with limited samples by introducing the Quantum Metric Encoder (QME). In this methodology, instead of applying the RL framework directly on the original states and rewards, we embed the states into a more compact and meaningful representation, where the structure of the encoding is inspired by quantum circuits. For classical data, QME is a classically simulable, trainable unitary embedding and thus serves as a quantum-inspired module, on a classical device. For quantum data in the form of quantum states, QME can be implemented directly on quantum hardware, allowing for training without measurement or re-encoding.
We evaluated QME on three datasets, each limited to 100 samples. We use Soft-Actor-Critic (SAC) and Implicit-Q-Learning (IQL), two well-known RL algorithms, to demonstrate the effectiveness of our approach. From the experimental results, we find that training offline RL agents on QME-embedded states with decoded rewards yields significantly better performance than training on the original states and rewards. On average across the three datasets, for maximum reward performance, we achieve a 116.2% improvement for SAC and 117.6% for IQL.
We further investigate the $ฮ$-hyperbolicity of our framework, a geometric property of the state space known to be important for the RL training efficacy. The QME-embedded states exhibit low $ฮ$-hyperbolicity, suggesting that the improvement after embedding arises from the modified geometry of the state space induced by QME. Thus, the low $ฮ$-hyperbolicity and the corresponding effectiveness of QME could provide valuable information for developing efficient offline RL methods under limited-sample conditions.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Yuancheng Sun, Yuxuan Ren, Zhaoming Chen, Xu Han, Kang Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate exploration of protein conformational ensembles is essential for uncovering function but remains hard because molecular-dynamics (MD) simulations suffer from high computational costs and energy-barrier trapping. This paper presents Energy Preference Optimization (EPO), an online refinement algorithm that turns a pretrained protein ensemble generator into an energy-aware sampler without extra MD trajectories. Specifically, EPO leverages stochastic differential equation sampling to explore the conformational landscape and incorporates a novel energy-ranking mechanism based on list-wise preference optimization. Crucially, EPO introduces a practical upper bound to efficiently approximate the intractable probability of long sampling trajectories in continuous-time generative models, making it easily adaptable to existing pretrained generators. On Tetrapeptides, ATLAS, and Fast-Folding benchmarks, EPO successfully generates diverse and physically realistic ensembles, establishing a new state-of-the-art in nine evaluation metrics. These results demonstrate that energy-only preference signals can efficiently steer generative models toward thermodynamically consistent conformational ensembles, providing an alternative to long MD simulations and widening the applicability of learned potentials in structural biology and drug discovery.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ J. Javier Alonso-Ramos, Ignacio Aguilera-Martos, Andrรฉs Herrera-Poyatos, Francisco Herrera
Machine Learning
ๆฅ็ๆ่ฆ
The performance of Machine Learning (ML) models, particularly those operating within the Interpretable Artificial Intelligence (Interpretable AI) framework, is significantly affected by the presence of noise in both training and production data. Denoising has therefore become a critical preprocessing step, typically categorized into instance removal and instance correction techniques. However, existing correction approaches often degrade performance or oversimplify the problem by altering the original data distribution. This leads to unrealistic scenarios and biased models, which is particularly problematic in contexts where interpretable AI models are employed, as their interpretability depends on the fidelity of the underlying data patterns. In this paper, we argue that defining noise independently of the solution may be ineffective, as its nature can vary significantly across tasks and datasets. Using a task-specific high quality solution as a reference can provide a more precise and adaptable noise definition. To this end, we propose DenoGrad, a novel Gradient-based instance Denoiser framework that leverages gradients from an accurate Deep Learning (DL) model trained on the target data -- regardless of the specific task -- to detect and adjust noisy samples. Unlike conventional approaches, DenoGrad dynamically corrects noisy instances, preserving problem's data distribution, and improving AI models robustness. DenoGrad is validated on both tabular and time series datasets under various noise settings against the state-of-the-art. DenoGrad outperforms existing denoising strategies, enhancing the performance of interpretable IA models while standing out as the only high quality approach that preserves the original data distribution.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jieting Wang, Xiaolei Shang, Feijiang Li, Furong Peng
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Time series forecasting relies on predicting future values from historical data, yet most state-of-the-art approaches-including transformer and multilayer perceptron-based models-optimize using Mean Squared Error (MSE), which has two fundamental weaknesses: its point-wise error computation fails to capture temporal relationships, and it does not account for inherent noise in the data. To overcome these limitations, we introduce the Residual-Informed Loss (RI-Loss), a novel objective function based on the Hilbert-Schmidt Independence Criterion (HSIC). RI-Loss explicitly models noise structure by enforcing dependence between the residual sequence and a random time series, enabling more robust, noise-aware representations. Theoretically, we derive the first non-asymptotic HSIC bound with explicit double-sample complexity terms, achieving optimal convergence rates through Bernstein-type concentration inequalities and Rademacher complexity analysis. This provides rigorous guarantees for RI-Loss optimization while precisely quantifying kernel space interactions. Empirically, experiments across eight real-world benchmarks and five leading forecasting models demonstrate improvements in predictive performance, validating the effectiveness of our approach. Code will be made publicly available to ensure reproducibility.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Philipp Seeberger, Steffen Freisinger, Tobias Bocklet, Korbinian Riedhammer
ๆฅ็ๆ่ฆ
Due to the rapid growth of social media platforms, these tools have become essential for monitoring information during ongoing disaster events. However, extracting valuable insights requires real-time processing of vast amounts of data. A major challenge in existing systems is their exposure to event-related biases, which negatively affects their ability to generalize to emerging events. While recent advancements in debiasing and causal learning offer promising solutions, they remain underexplored in the disaster event domain. In this work, we approach bias mitigation through a causal lens and propose a method to reduce event- and domain-related biases, enhancing generalization to future events. Our approach outperforms multiple baselines by up to +1.9% F1 and significantly improves a PLM-based classifier across three disaster classification tasks.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yiming Tang, Abhijeet Sinha, Dianbo Liu
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Although recent generative models are remarkably capable of producing instruction-following and realistic outputs, they remain prone to notable physical plausibility failures. Though critical in applications, these physical plausibility errors often escape detection by existing evaluation methods. Furthermore, no framework exists for automatically identifying and interpreting specific physical error patterns in natural language, preventing targeted model improvements. We introduce Matryoshka Transcoders, a novel framework for the automatic discovery and interpretation of physical plausibility features in generative models. Our approach extends the Matryoshka representation learning paradigm to transcoder architectures, enabling hierarchical sparse feature learning at multiple granularity levels. By training on intermediate representations from a physical plausibility classifier and leveraging large multimodal models for interpretation, our method identifies diverse physics-related failure modes without manual feature engineering, achieving superior feature relevance and feature accuracy compared to existing approaches. We utilize the discovered visual patterns to establish a benchmark for evaluating physical plausibility in generative models. Our analysis of eight state-of-the-art generative models provides valuable insights into how these models fail to follow physical constraints, paving the way for further model improvements.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Abu Sufian, Cosimo Distante, Marco Leo, Hanan Salam
Machine Learning
ๆฅ็ๆ่ฆ
Text-to-image (T2I) generative models are largely used in AI-powered real-world applications and value creation. However, their strategic deployment raises critical concerns for responsible AI management, particularly regarding the reproduction and amplification of race- and gender-related stereotypes that can undermine organizational ethics. In this work, we investigate whether such societal biases are systematically encoded within the pretrained latent spaces of state-of-the-art T2I models. We conduct an empirical study across the five most popular open-source models, using ten neutral, profession-related prompts to generate 100 images per profession, resulting in a dataset of 5,000 images evaluated by diverse human assessors representing different races and genders. We demonstrate that all five models encode and amplify pronounced societal skew: caregiving and nursing roles are consistently feminized, while high-status professions such as corporate CEO, politician, doctor, and lawyer are overwhelmingly represented by males and mostly White individuals. We further identify model-specific patterns, such as QWEN-Image's near-exclusive focus on East Asian outputs, Kandinsky's dominance of White individuals, and SDXL's comparatively broader but still biased distributions. These results provide critical insights for AI project managers and practitioners, enabling them to select equitable AI models and customized prompts that generate images in alignment with the principles of responsible AI. We conclude by discussing the risks of these biases and proposing actionable strategies for bias mitigation in building responsible GenAI systems.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Leonardo Pesce, Jiawen Wei, Gianmarco Mengaldo
Machine Learning
ๆฅ็ๆ่ฆ
Post-hoc explainability methods are a subset of Machine Learning (ML) that aim to provide a reason for why a model behaves in a certain way. In this paper, we show a new black-box model-agnostic adversarial attack for post-hoc explainable Artificial Intelligence (XAI), particularly in the image domain. The goal of the attack is to modify the original explanations while being undetected by the human eye and maintain the same predicted class. In contrast to previous methods, we do not require any access to the model or its weights, but only to the model's computed predictions and explanations. Additionally, the attack is accomplished in a single step while significantly changing the provided explanations, as demonstrated by empirical evaluation. The low requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability in safety-critical applications. We systematically generate attacks based on the explanations generated by post-hoc explainability methods (saliency maps, integrated gradients, and DeepLIFT SHAP) for pretrained ResNet-18 and ViT-B16 on ImageNet. The results show that our attacks could lead to dramatically different explanations without changing the predictive probabilities. We validate the effectiveness of our attack, compute the induced change based on the explanation with mean absolute difference, and verify the closeness of the original image and the corrupted one with the Structural Similarity Index Measure (SSIM).
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Haidong Huang, Haiyue Zhu. Jiayu Song, Xixin Zhao, Yaohua Zhou, Jiayi Zhang et al.
Natural Language ProcessingRoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Yizheng Wang, Timon Rabczuk, Yinghua Liu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Friction modeling plays a crucial role in achieving high-precision motion control in robotic operating systems. Traditional static friction models (such as the Stribeck model) are widely used due to their simple forms; however, they typically require predefined functional assumptions, which poses significant challenges when dealing with unknown functional structures. To address this issue, this paper proposes a physics-inspired machine learning approach based on the Kolmogorov Arnold Network (KAN) for static friction modeling of robotic joints. The method integrates spline activation functions with a symbolic regression mechanism, enabling model simplification and physical expression extraction through pruning and attribute scoring, while maintaining both high prediction accuracy and interpretability. We first validate the method's capability to accurately identify key parameters under known functional models, and further demonstrate its robustness and generalization ability under conditions with unknown functional structures and noisy data. Experiments conducted on both synthetic data and real friction data collected from a six-degree-of-freedom industrial manipulator show that the proposed method achieves a coefficient of determination greater than 0.95 across various tasks and successfully extracts concise and physically meaningful friction expressions. This study provides a new perspective for interpretable and data-driven robotic friction modeling with promising engineering applicability.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shuxin Zhuang, Linjian Meng, Shuxin Li, Minming Li, Youzhi Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Urban Network Security Games (UNSGs), which model the strategic allocation of limited security resources on city road networks, are critical for urban safety. However, finding a Nash Equilibrium (NE) in large-scale UNSGs is challenging due to their massive and combinatorial action spaces. One common approach to addressing these games is the Policy-Space Response Oracle (PSRO) framework, which requires computing best responses (BR) at each iteration. However, precisely computing exact BRs is impractical in large-scale games, and employing reinforcement learning to approximate BRs inevitably introduces errors, which limits the overall effectiveness of the PSRO methods. Recent advancements in leveraging non-convex stochastic optimization to approximate an NE offer a promising alternative to the burdensome BR computation. However, utilizing existing stochastic optimization techniques with an unbiased loss function for UNSGs remains challenging because the action spaces are too vast to be effectively represented by neural networks. To address these issues, we introduce Tree-based Stochastic Optimization (TSO), a framework that bridges the gap between the stochastic optimization paradigm for NE-finding and the demands of UNSGs. Specifically, we employ the tree-based action representation that maps the whole action space onto a tree structure, addressing the challenge faced by neural networks in representing actions when the action space cannot be enumerated. We then incorporate this representation into the loss function and theoretically demonstrate its equivalence to the unbiased loss function. To further enhance the quality of the converged solution, we introduce a sample-and-prune mechanism that reduces the risk of being trapped in suboptimal local optima. Extensive experimental results indicate the superiority of TSO over other baseline algorithms in addressing the UNSGs.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Muhammad Kashif, Shaf Khalid, Alberto Marchisio, Nouhaila Innan, Muhammad Shafique
Machine Learning
ๆฅ็ๆ่ฆ
Hybrid Quantum Neural Networks (HQNNs), which combine parameterized quantum circuits with classical neural layers, are emerging as promising models in the noisy intermediate-scale quantum (NISQ) era. While quantum circuits are not naturally measured in floating point operations (FLOPs), most HQNNs (in NISQ era) are still trained on classical simulators where FLOPs directly dictate runtime and scalability. Hence, FLOPs represent a practical and viable metric to measure the computational complexity of HQNNs. In this work, we introduce FAQNAS, a FLOPs-aware neural architecture search (NAS) framework that formulates HQNN design as a multi-objective optimization problem balancing accuracy and FLOPs. Unlike traditional approaches, FAQNAS explicitly incorporates FLOPs into the optimization objective, enabling the discovery of architectures that achieve strong performance while minimizing computational cost. Experiments on five benchmark datasets (MNIST, Digits, Wine, Breast Cancer, and Iris) show that quantum FLOPs dominate accuracy improvements, while classical FLOPs remain largely fixed. Pareto-optimal solutions reveal that competitive accuracy can often be achieved with significantly reduced computational cost compared to FLOPs-agnostic baselines. Our results establish FLOPs-awareness as a practical criterion for HQNN design in the NISQ era and as a scalable principle for future HQNN systems.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Zijing Liu, Bin Feng, He Cao, Yu Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Protein structure tokenization converts 3D structures into discrete or vectorized representations, enabling the integration of structural and sequence data. Despite many recent works on structure tokenization, the properties of the underlying discrete representations are not well understood. In this work, we first demonstrate that the successful utilization of structural tokens in a language model for structure prediction depends on using rich, pre-trained sequence embeddings to bridge the semantic gap between the sequence and structural "language". The analysis of the structural vocabulary itself then reveals significant semantic redundancy, where multiple distinct tokens correspond to nearly identical local geometries, acting as "structural synonyms". This redundancy, rather than being a flaw, can be exploited with a simple "synonym swap" strategy to generate diverse conformational ensembles by perturbing a predicted structure with its structural synonyms. This computationally lightweight method accurately recapitulates protein flexibility, performing competitively with state-of-the-art models. Our study provides fundamental insights into the nature of discrete protein structure representations and introduces a powerful, near-instantaneous method for modeling protein dynamics. Source code is available in https://github.com/IDEA-XL/TokenMD.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yun Wang, Lingyun Yang, Senhao Yu, Yixiao Wang, Ruixing Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Mixture-of-Experts (MoE) architectures scale language models by activating only a subset of specialized expert networks for each input token, thereby reducing the number of floating-point operations. However, the growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity; for example, Mixtral-8x7B has 45 billion parameters and requires 87 GB of memory even though only 14 billion parameters are used per token. Existing systems alleviate this limitation by offloading inactive experts to CPU memory, but transferring experts across the PCIe interconnect incurs significant latency (about 10 ms). Prefetching heuristics aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency. In the event of a prefetch failure, prior work offers two primary solutions: either fetch the expert on demand, which incurs a long stall due to the PCIe bottleneck, or drop the expert from the computation, which significantly degrades model accuracy. The critical challenge, therefore, is to maintain both high inference speed and model accuracy when prefetching fails.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruichu Cai, Xiaokai Huang, Wei Chen, Zijian Li, Zhifeng Hao
ๆฅ็ๆ่ฆ
Inferring causal relationships from observed data is an important task, yet it becomes challenging when the data is subject to various external interferences. Most of these interferences are the additional effects of external factors on observed variables. Since these external factors are often unknown, we introduce latent variables to represent these unobserved factors that affect the observed data. Specifically, to capture the causal strength and adjacency information, we propose a new temporal latent variable structural causal model, incorporating causal strength and adjacency coefficients that represent the causal relationships between variables. Considering that expert knowledge can provide information about unknown interferences in certain scenarios, we develop a method that facilitates the incorporation of prior knowledge into parameter learning based on Variational Inference, to guide the model estimation. Experimental results demonstrate the stability and accuracy of our proposed method.
๐
2025-11-13
๐ ArXiv Preprint (cs.MA)
๐ฅ Tao Jiang, Zichuan Lin, Lihe Li, Yi-Chen Li, Cong Guan et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents' current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods. Code is available at https://github.com/LAMDA-RL/MAICC.
๐
2025-11-13
๐ AAAI
๐ฅ Noam Koren, Ralf J. J. Mackenbach, Ruud J. G. van Sloun, Kira Radinsky, Daniel Freedman
ๆฅ็ๆ่ฆ
Neural operators have emerged as a promising paradigm for learning solution operators of partial differential equa- tions (PDEs) directly from data. Existing methods, such as those based on Fourier or graph techniques, make strong as- sumptions about the structure of the kernel integral opera- tor, assumptions which may limit expressivity. We present SVD-NO, a neural operator that explicitly parameterizes the kernel by its singular-value decomposition (SVD) and then carries out the integral directly in the low-rank basis. Two lightweight networks learn the left and right singular func- tions, a diagonal parameter matrix learns the singular values, and a Gram-matrix regularizer enforces orthonormality. As SVD-NO approximates the full kernel, it obtains a high de- gree of expressivity. Furthermore, due to its low-rank struc- ture the computational complexity of applying the operator remains reasonable, leading to a practical system. In exten- sive evaluations on five diverse benchmark equations, SVD- NO achieves a new state of the art. In particular, SVD-NO provides greater performance gains on PDEs whose solutions are highly spatially variable. The code of this work is publicly available at https://github.com/2noamk/SVDNO.git.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chaofan Zhu, Xiaobing Rui, Zhixiao Wang
Machine Learning
ๆฅ็ๆ่ฆ
Imbalanced node classification is a critical challenge in graph learning, where most existing methods typically utilize Graph Neural Networks (GNNs) to learn node representations. These methods can be broadly categorized into the data-level and the algorithm-level. The former aims to synthesize minority-class nodes to mitigate quantity imbalance, while the latter tries to optimize the learning process to highlight minority classes. However, neither category addresses the inherently imbalanced graph structure, which is a fundamental factor that incurs majority-class dominance and minority-class assimilation in GNNs. Our theoretical analysis further supports this critical insight. Therefore, we propose GraphSB (Graph Structural Balance), a novel framework that incorporates Structural Balance as a key strategy to address the underlying imbalanced graph structure before node synthesis. Structural Balance performs a two-stage structure optimization: Structure Enhancement that adaptively builds similarity-based edges to strengthen connectivity of minority-class nodes, and Relation Diffusion that captures higher-order dependencies while amplifying signals from minority classes. Thus, GraphSB balances structural distribution before node synthesis, enabling more effective learning in GNNs. Extensive experiments demonstrate that GraphSB significantly outperforms the state-of-the-art methods. More importantly, the proposed Structural Balance can be seamlessly integrated into state-of-the-art methods as a simple plug-and-play module, increasing their accuracy by an average of 3.67\%.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pilsung Kang
Robotics
ๆฅ็ๆ่ฆ
Classical approaches often treat interaction as engineered product terms or as emergent patterns in flexible models, offering little control over how synergy or antagonism arises. We take a quantum-inspired view: following the Born rule (probability as squared amplitude), \emph{coherent} aggregation sums complex amplitudes before squaring, creating an interference cross-term, whereas an \emph{incoherent} proxy sums squared magnitudes and removes it. In a minimal linear-amplitude model, this cross-term equals the standard potential-outcome interaction contrast \(ฮ_{\mathrm{INT}}\) in a \(2\times 2\) factorial design, giving relative phase a direct, mechanism-level control over synergy versus antagonism.
We instantiate this idea in a lightweight \emph{Interference Kernel Classifier} (IKC) and introduce two diagnostics: \emph{Coherent Gain} (log-likelihood gain of coherent over the incoherent proxy) and \emph{Interference Information} (the induced Kullback-Leibler gap). A controlled phase sweep recovers the identity. On a high-interaction synthetic task (XOR), IKC outperforms strong baselines under paired, budget-matched comparisons; on real tabular data (\emph{Adult} and \emph{Bank Marketing}) it is competitive overall but typically trails the most capacity-rich baseline in paired differences. Holding learned parameters fixed, toggling aggregation from incoherent to coherent consistently improves negative log-likelihood, Brier score, and expected calibration error, with positive Coherent Gain on both datasets.
๐
2025-11-13
๐ ArXiv Preprint (cs.AR)
๐ฅ Shahid Amin, Syed Pervez Hussnain Shah
Machine Learning
ๆฅ็ๆ่ฆ
The remarkable progress in Artificial Intelligence (AI) is foundation-ally linked to a concurrent revolution in computer architecture. As AI models, particularly Deep Neural Networks (DNNs), have grown in complexity, their massive computational demands have pushed traditional architectures to their limits. This paper provides a structured review of this co-evolution, analyzing the architectural landscape designed to accelerate modern AI workloads. We explore the dominant architectural paradigms Graphics Processing Units (GPUs), Appli-cation-Specific Integrated Circuits (ASICs), and Field-Programmable Gate Ar-rays (FPGAs) by breaking down their design philosophies, key features, and per-formance trade-offs. The core principles essential for performance and energy efficiency, including dataflow optimization, advanced memory hierarchies, spar-sity, and quantization, are analyzed. Furthermore, this paper looks ahead to emerging technologies such as Processing-in-Memory (PIM) and neuromorphic computing, which may redefine future computation. By synthesizing architec-tural principles with quantitative performance data from industry-standard benchmarks, this survey presents a comprehensive picture of the AI accelerator landscape. We conclude that AI and computer architecture are in a symbiotic relationship, where hardware-software co-design is no longer an optimization but a necessity for future progress in computing.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hui Dou, Lei Jin, Yuxuan Zhou, Jiang He, Yiwen Zhang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The performance of modern DBMSs such as MySQL and PostgreSQL heavily depends on the configuration of performance-critical knobs. Manual tuning these knobs is laborious and inefficient due to the complex and high-dimensional nature of the configuration space. Among the automated tuning methods, reinforcement learning (RL)-based methods have recently sought to improve the DBMS knobs tuning process from several different perspectives. However, they still encounter challenges with slow convergence speed during offline training. In this paper, we mainly focus on how to leverage the valuable tuning hints contained in various textual documents such as DBMS manuals and web forums to improve the offline training of RL-based methods. To this end, we propose an efficient DBMS knobs tuning framework named DemoTuner via a novel LLM-assisted demonstration reinforcement learning method. Specifically, to comprehensively and accurately mine tuning hints from documents, we design a structured chain of thought prompt to employ LLMs to conduct a condition-aware tuning hints extraction task. To effectively integrate the mined tuning hints into RL agent training, we propose a hint-aware demonstration reinforcement learning algorithm HA-DDPGfD in DemoTuner. As far as we know, DemoTuner is the first work to introduce the demonstration reinforcement learning algorithm for DBMS knobs tuning. Experimental evaluations conducted on MySQL and PostgreSQL across various workloads demonstrate the significant advantages of DemoTuner in both performance improvement and online tuning cost reduction over three representative baselines including DB-BERT, GPTuner and CDBTune. Additionally, DemoTuner also exhibits superior adaptability to application scenarios with unknown workloads.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alireza F. Pour, Shai Ben-David
Machine Learning
ๆฅ็ๆ่ฆ
We address the general task of learning with a set of candidate models that is too large to have a uniform convergence of empirical estimates to true losses. While the common approach to such challenges is SRM (or regularization) based learning algorithms, we propose a novel learning paradigm that relies on stronger incorporation of empirical data and requires less algorithmic decisions to be based on prior assumptions. We analyze the generalization capabilities of our approach and demonstrate its merits in several common learning assumptions, including similarity of close points, clustering of the domain into highly label-homogeneous regions, Lipschitzness assumptions of the labeling rule, and contrastive learning assumptions. Our approach allows utilizing such assumptions without the need to know their true parameters a priori.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fushuo Huo
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
The rapid evolution of machine learning has propelled neural networks to unprecedented success across diverse domains. In particular, multimodal learning has emerged as a transformative paradigm, leveraging complementary information from heterogeneous data streams (e.g., text, vision, audio) to advance contextual reasoning and intelligent decision-making. Despite these advancements, current neural network-based models often fall short in open-world environments characterized by inherent unpredictability, where unpredictable environmental composition dynamics, incomplete modality inputs, and spurious distributions relations critically undermine system reliability. While humans naturally adapt to such dynamic, ambiguous scenarios, artificial intelligence systems exhibit stark limitations in robustness, particularly when processing multimodal signals under real-world complexity. This study investigates the fundamental challenge of multimodal learning robustness in open-world settings, aiming to bridge the gap between controlled experimental performance and practical deployment requirements.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Saumya Shah, Zi-Yu Khoo, Abel Yang, Stรฉphane Bressan
ๆฅ็ๆ่ฆ
This work explores using the physics-inspired AI Feynman symbolic regression algorithm to automatically rediscover a fundamental equation in astronomy -- the Equation of the Centre. Through the introduction of observational and inductive biases corresponding to the physical nature of the system through data preprocessing and search space restriction, AI Feynman was successful in recovering the first-order analytical form of this equation from lunar ephemerides data. However, this manual approach highlights a key limitation in its reliance on expert-driven coordinate system selection. We therefore propose an automated preprocessing extension to find the canonical coordinate system. Results demonstrate that targeted domain knowledge embedding enables symbolic regression to rediscover physical laws, but also highlight further challenges in constraining symbolic regression to derive physics equations when leveraging domain knowledge through tailored biases.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Dimitrios Sinodinos, Jack Yi Wei, Narges Armanfard
Natural Language Processing
ๆฅ็ๆ่ฆ
Tabular data is the most abundant data type in the world, powering systems in finance, healthcare, e-commerce, and beyond. As tabular datasets grow and span multiple related targets, there is an increasing need to exploit shared task information for improved multitask generalization. Multitask learning (MTL) has emerged as a powerful way to improve generalization and efficiency, yet most existing work focuses narrowly on large-scale recommendation systems, leaving its potential in broader tabular domains largely underexplored. Also, existing MTL approaches for tabular data predominantly rely on multi-layer perceptron-based backbones, which struggle to capture complex feature interactions and often fail to scale when data is abundant, a limitation that transformer architectures have overcome in other domains. Motivated by this, we introduce MultiTab-Net, the first multitask transformer architecture specifically designed for large tabular data. MultiTab-Net employs a novel multitask masked-attention mechanism that dynamically models feature-feature dependencies while mitigating task competition. Through extensive experiments, we show that MultiTab-Net consistently achieves higher multitask gain than existing MTL architectures and single-task transformers across diverse domains including large-scale recommendation data, census-like socioeconomic data, and physics datasets, spanning a wide range of task counts, task types, and feature modalities. In addition, we contribute MultiTab-Bench, a generalized multitask synthetic dataset generator that enables systematic evaluation of multitask dynamics by tuning task count, task correlations, and relative task complexity. Our code is publicly available at https://github.com/Armanfard-Lab/MultiTab.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziqing Yin, Xuanjing Chen, Xi Zhang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The rapid proliferation of AI-generated content (AIGC) has reshaped the dynamics of digital marketing and online consumer behavior. However, predicting the diffusion trajectory and market impact of such content remains challenging due to data heterogeneity, non linear propagation mechanisms, and evolving consumer interactions. This study proposes an AI driven Decision Support System (DSS) that integrates multi source data including social media streams, marketing expenditure records, consumer engagement logs, and sentiment dynamics using a hybrid Graph Neural Network (GNN) and Temporal Transformer framework. The model jointly learns the content diffusion structure and temporal influence evolution through a dual channel architecture, while causal inference modules disentangle the effects of marketing stimuli on return on investment (ROI) and market visibility. Experiments on large scale real-world datasets collected from multiple online platforms such as Twitter, TikTok, and YouTube advertising show that our system outperforms existing baselines in all six metrics. The proposed DSS enhances marketing decisions by providing interpretable real-time insights into AIGC driven content dissemination and market growth patterns.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Pengqian Lu, Jie Lu, Anjin Liu, En Yu, Guangquan Zhang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Existing drift detection methods focus on designing sensitive test statistics. They treat the detection threshold as a fixed hyperparameter, set once to balance false alarms and late detections, and applied uniformly across all datasets and over time. However, maintaining model performance is the key objective from the perspective of machine learning, and we observe that model performance is highly sensitive to this threshold. This observation inspires us to investigate whether a dynamic threshold could be provably better. In this paper, we prove that a threshold that adapts over time can outperform any single fixed threshold. The main idea of the proof is that a dynamic strategy, constructed by combining the best threshold from each individual data segment, is guaranteed to outperform any single threshold that apply to all segments. Based on the theorem, we propose a Dynamic Threshold Determination algorithm. It enhances existing drift detection frameworks with a novel comparison phase to inform how the threshold should be adjusted. Extensive experiments on a wide range of synthetic and real-world datasets, including both image and tabular data, validate that our approach substantially enhances the performance of state-of-the-art drift detectors.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sha Zhao, Mingyi Peng, Haiteng Jiang, Tao Li, Shijian Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Scalable and generalizable analysis of brain activity is essential for advancing both clinical diagnostics and cognitive research. Electroencephalography (EEG), a non-invasive modality with high temporal resolution, has been widely used for brain states analysis. However, most existing EEG models are usually tailored for individual specific tasks, limiting their utility in realistic scenarios where EEG analysis often involves multi-task and continuous reasoning. In this work, we introduce EEGAgent, a general-purpose framework that leverages large language models (LLMs) to schedule and plan multiple tools to automatically complete EEG-related tasks. EEGAgent is capable of performing the key functions: EEG basic information perception, spatiotemporal EEG exploration, EEG event detection, interaction with users, and EEG report generation. To realize these capabilities, we design a toolbox composed of different tools for EEG preprocessing, feature extraction, event detection, etc. These capabilities were evaluated on public datasets, and our EEGAgent can support flexible and interpretable EEG analysis, highlighting its potential for real-world clinical applications.
๐
2025-11-13
๐ Accepted in 2026 IEEE/CVF Winter Conference on App
๐ฅ Mustafa Munir, Md Mostafijur Rahman, Radu Marculescu
Machine Learning
ๆฅ็ๆ่ฆ
Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.
๐
2025-11-13
๐ ArXiv Preprint (math.OC)
๐ฅ Minrui Luo, Weihang Xu, Xiang Gao, Maryam Fazel, Simon Shaolei Du
Machine Learning
ๆฅ็ๆ่ฆ
Gradient descent dynamics on the deep matrix factorization problem is extensively studied as a simplified theoretical model for deep neural networks. Although the convergence theory for two-layer matrix factorization is well-established, no global convergence guarantee for general deep matrix factorization under random initialization has been established to date. To address this gap, we provide a polynomial-time global convergence guarantee for randomly initialized gradient descent on four-layer matrix factorization, given certain conditions on the target matrix and a standard balanced regularization term. Our analysis employs new techniques to show saddle-avoidance properties of gradient decent dynamics, and extends previous theories to characterize the change in eigenvalues of layer weights.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hu Zhang, Zhien Dai, Zhaohui Tang, Yongfang Xie
Natural Language Processing
ๆฅ็ๆ่ฆ
Time series forecasting is essential across diverse domains. While MLP-based methods have gained attention for achieving Transformer-comparable performance with fewer parameters and better robustness, they face critical limitations including loss of weak seasonal signals, capacity constraints in weight-sharing MLPs, and insufficient channel fusion in channel-independent strategies. To address these challenges, we propose MDMLP-EIA (Multi-domain Dynamic MLPs with Energy Invariant Attention) with three key innovations. First, we develop an adaptive fused dual-domain seasonal MLP that categorizes seasonal signals into strong and weak components. It employs an adaptive zero-initialized channel fusion strategy to minimize noise interference while effectively integrating predictions. Second, we introduce an energy invariant attention mechanism that adaptively focuses on different feature channels within trend and seasonal predictions across time steps. This mechanism maintains constant total signal energy to align with the decomposition-prediction-reconstruction framework and enhance robustness against disturbances. Third, we propose a dynamic capacity adjustment mechanism for channel-independent MLPs. This mechanism scales neuron count with the square root of channel count, ensuring sufficient capacity as channels increase. Extensive experiments across nine benchmark datasets demonstrate that MDMLP-EIA achieves state-of-the-art performance in both prediction accuracy and computational efficiency.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ethan Hirschowitz, Fabio Ramos
Robotics
ๆฅ็ๆ่ฆ
Improving competent robot policies with on-policy RL is often hampered by noisy, low-signal gradients. We revisit Evolution Strategies (ES) as a policy-gradient proxy and localize exploration with bounded, antithetic triangular perturbations, suitable for policy refinement. We propose Triangular-Distribution ES (TD-ES) which pairs bounded triangular noise with a centered-rank finite-difference estimator to deliver stable, parallelizable, gradient-free updates. In a two-stage pipeline -- PPO pretraining followed by TD-ES refinement -- this preserves early sample efficiency while enabling robust late-stage gains. Across a suite of robotic manipulation tasks, TD-ES raises success rates by 26.5% relative to PPO and greatly reduces variance, offering a simple, compute-light path to reliable refinement.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Jiazhen Chen, Xiuqin Liang, Sichao Fu, Zheng Ma, Weihua Ou
ๆฅ็ๆ่ฆ
Unsupervised graph anomaly detection (GAD) has received increasing attention in recent years, which aims to identify data anomalous patterns utilizing only unlabeled node information from graph-structured data. However, prevailing unsupervised GAD methods typically presuppose complete node attributes and structure information, a condition hardly satisfied in real-world scenarios owing to privacy, collection errors or dynamic node arrivals. Existing standard imputation schemes risk "repairing" rare anomalous nodes so that they appear normal, thereby introducing imputation bias into the detection process. In addition, when both node attributes and edges are missing simultaneously, estimation errors in one view can contaminate the other, causing cross-view interference that further undermines the detection performance. To overcome these challenges, we propose M$^2$V-UGAD, a multiple missing values-resistant unsupervised GAD framework on incomplete graphs. Specifically, a dual-pathway encoder is first proposed to independently reconstruct missing node attributes and graph structure, thereby preventing errors in one view from propagating to the other. The two pathways are then fused and regularized in a joint latent space so that normals occupy a compact inner manifold while anomalies reside on an outer shell. Lastly, to mitigate imputation bias, we sample latent codes just outside the normal region and decode them into realistic node features and subgraphs, providing hard negative examples that sharpen the decision boundary. Experiments on seven public benchmarks demonstrate that M$^2$V-UGAD consistently outperforms existing unsupervised GAD methods across varying missing rates.
๐
2025-11-13
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Hao Xu, Yuntian Chen, Dongxiao Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Constitutive models are fundamental to solid mechanics and materials science, underpinning the quantitative description and prediction of material responses under diverse loading conditions. Traditional phenomenological models, which are derived through empirical fitting, often lack generalizability and rely heavily on expert intuition and predefined functional forms. In this work, we propose a graph-based equation discovery framework for the automated discovery of constitutive laws directly from multisource experimental data. This framework expresses equations as directed graphs, where nodes represent operators and variables, edges denote computational relations, and edge features encode parametric dependencies. This enables the generation and optimization of free-form symbolic expressions with undetermined material-specific parameters. Through the proposed framework, we have discovered new constitutive models for strain-rate effects in alloy steel materials and the deformation behavior of lithium metal. Compared with conventional empirical models, these new models exhibit compact analytical structures and achieve higher accuracy. The proposed graph-based equation discovery framework provides a generalizable and interpretable approach for data-driven scientific modelling, particularly in contexts where traditional empirical formulations are inadequate for representing complex physical phenomena.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Brian B. Moser, Shalini Strode, Federico Raue, Stanislav Frolov, Krzysztof Adamkiewicz et al.
ๆฅ็ๆ่ฆ
Dataset distillation (DD) promises compact yet faithful synthetic data, but existing approaches often inherit the inductive bias of a single teacher model. As dataset size increases, this bias drives generation toward overly smooth, homogeneous samples, reducing intra-class diversity and limiting generalization. We present PRISM (PRIors from diverse Source Models), a framework that disentangles architectural priors during synthesis. PRISM decouples the logit-matching and regularization objectives, supervising them with different teacher architectures: a primary model for logits and a stochastic subset for batch-normalization (BN) alignment. On ImageNet-1K, PRISM consistently and reproducibly outperforms single-teacher methods (e.g., SRe2L) and recent multi-teacher variants (e.g., G-VBSM) at low- and mid-IPC regimes. The generated data also show significantly richer intra-class diversity, as reflected by a notable drop in cosine similarity between features. We further analyze teacher selection strategies (pre- vs. intra-distillation) and introduce a scalable cross-class batch formation scheme for fast parallel synthesis. Code will be released after the review period.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hossein Rouhvarzi, Anastasis Kratsios
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Incremental flow-based denoising models have reshaped generative modelling, but their empirical advantage still lacks a rigorous approximation-theoretic foundation. We show that incremental generation is necessary and sufficient for universal flow-based generation on the largest natural class of self-maps of $[0,1]^d$ compatible with denoising pipelines, namely the orientation-preserving homeomorphisms of $[0,1]^d$. All our guarantees are uniform on the underlying maps and hence imply approximation both samplewise and in distribution.
Using a new topological-dynamical argument, we first prove an impossibility theorem: the class of all single-step autonomous flows, independently of the architecture, width, depth, or Lipschitz activation of the underlying neural network, is meagre and therefore not universal in the space of orientation-preserving homeomorphisms of $[0,1]^d$. By exploiting algebraic properties of autonomous flows, we conversely show that every orientation-preserving Lipschitz homeomorphism on $[0,1]^d$ can be approximated at rate $\mathcal{O}(n^{-1/d})$ by a composition of at most $K_d$ such flows, where $K_d$ depends only on the dimension. Under additional smoothness assumptions, the approximation rate can be made dimension-free, and $K_d$ can be chosen uniformly over the class being approximated. Finally, by linearly lifting the domain into one higher dimension, we obtain structured universal approximation results for continuous functions and for probability measures on $[0,1]^d$, the latter realized as pushforwards of empirical measures with vanishing $1$-Wasserstein error.
๐
2025-11-13
๐ AAAI
๐ฅ Weilin Wan, Fan Yi, Weizhong Zhang, Quan Zhou, Cheng Jin
Machine Learning
ๆฅ็ๆ่ฆ
Modern deep neural networks rely heavily on massive model weights and training samples, incurring substantial computational costs. Weight pruning and coreset selection are two emerging paradigms proposed to improve computational efficiency. In this paper, we first explore the interplay between redundant weights and training samples through a transparent analysis: redundant samples, particularly noisy ones, cause model weights to become unnecessarily overtuned to fit them, complicating the identification of irrelevant weights during pruning; conversely, irrelevant weights tend to overfit noisy data, undermining coreset selection effectiveness. To further investigate and harness this interplay in deep learning, we develop a Simultaneous Weight and Sample Tailoring mechanism (SWaST) that alternately performs weight pruning and coreset selection to establish a synergistic effect in training. During this investigation, we observe that when simultaneously removing a large number of weights and samples, a phenomenon we term critical double-loss can occur, where important weights and their supportive samples are mistakenly eliminated at the same time, leading to model instability and nearly irreversible degradation that cannot be recovered in subsequent training. Unlike classic machine learning models, this issue can arise in deep learning due to the lack of theoretical guarantees on the correctness of weight pruning and coreset selection, which explains why these paradigms are often developed independently. We mitigate this by integrating a state preservation mechanism into SWaST, enabling stable joint optimization. Extensive experiments reveal a strong synergy between pruning and coreset selection across varying prune rates and coreset sizes, delivering accuracy boosts of up to 17.83% alongside 10% to 90% FLOPs reductions.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Shunan Sheng, Bohan Wu, Bennett Zhu, Sinho Chewi, Aram-Alexandre Pooladian
ๆฅ็ๆ่ฆ
Structured variational inference constitutes a core methodology in modern statistical applications. Unlike mean-field variational inference, the approximate posterior is assumed to have interdependent structure. We consider the natural setting of star-structured variational inference, where a root variable impacts all the other ones. We prove the first results for existence, uniqueness, and self-consistency of the variational approximation. In turn, we derive quantitative approximation error bounds for the variational approximation to the posterior, extending prior work from the mean-field setting to the star-structured setting. We also develop a gradient-based algorithm with provable guarantees for computing the variational approximation using ideas from optimal transport theory. We explore the implications of our results for Gaussian measures and hierarchical Bayesian models, including generalized linear models with location family priors and spike-and-slab priors with one-dimensional debiasing. As a by-product of our analysis, we develop new stability results for star-separable transport maps which might be of independent interest.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaoda Wang, Kaiqiao Han, Yuhao Xu, Xiao Luo, Yizhou Sun et al.
ๆฅ็ๆ่ฆ
Cardiovascular disease (CVD) is a leading cause of mortality worldwide. Electrocardiograms (ECGs) are the most widely used non-invasive tool for cardiac assessment, yet large, well-annotated ECG corpora are scarce due to cost, privacy, and workflow constraints. Generating ECGs can be beneficial for the mechanistic understanding of cardiac electrical activity, enable the construction of large, heterogeneous, and unbiased datasets, and facilitate privacy-preserving data sharing. Generating realistic ECG signals from clinical context is important yet underexplored. Recent work has leveraged diffusion models for text-to-ECG generation, but two challenges remain: (i) existing methods often overlook the physiological simulator knowledge of cardiac activity; and (ii) they ignore broader, experience-based clinical knowledge grounded in real-world practice. To address these gaps, we propose SE-Diff, a novel physiological simulator and experience enhanced diffusion model for comprehensive ECG generation. SE-Diff integrates a lightweight ordinary differential equation (ODE)-based ECG simulator into the diffusion process via a beat decoder and simulator-consistent constraints, injecting mechanistic priors that promote physiologically plausible waveforms. In parallel, we design an LLM-powered experience retrieval-augmented strategy to inject clinical knowledge, providing more guidance for ECG generation. Extensive experiments on real-world ECG datasets demonstrate that SE-Diff improves both signal fidelity and text-ECG semantic alignment over baselines, proving its superiority for text-to-ECG generation. We further show that the simulator-based and experience-based knowledge also benefit downstream ECG classification.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Keshara Weerasinghe, Xueren Ge, Tessa Heick, Lahiru Nuwan Wijayasingha, Anthony Cortez et al.
Multimodal
ๆฅ็ๆ่ฆ
Emergency Medical Services (EMS) are critical to patient survival in emergencies, but first responders often face intense cognitive demands in high-stakes situations. AI cognitive assistants, acting as virtual partners, have the potential to ease this burden by supporting real-time data collection and decision making. In pursuit of this vision, we introduce EgoEMS, the first end-to-end, high-fidelity, multimodal, multiperson dataset capturing over 20 hours of realistic, procedural EMS activities from an egocentric view in 233 simulated emergency scenarios performed by 62 participants, including 46 EMS professionals. Developed in collaboration with EMS experts and aligned with national standards, EgoEMS is captured using an open-source, low-cost, and replicable data collection system and is annotated with keysteps, timestamped audio transcripts with speaker diarization, action quality metrics, and bounding boxes with segmentation masks. Emphasizing realism, the dataset includes responder-patient interactions reflecting real-world emergency dynamics. We also present a suite of benchmarks for real-time multimodal keystep recognition and action quality estimation, essential for developing AI support tools for EMS. We hope EgoEMS inspires the research community to push the boundaries of intelligent EMS systems and ultimately contribute to improved patient outcomes.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shengfei Wei, Suyuan Liu, Jun Wang, Ke Liang, Miaomiao Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Fair clustering is crucial for mitigating bias in unsupervised learning, yet existing algorithms often suffer from quadratic or super-quadratic computational complexity, rendering them impractical for large-scale datasets. To bridge this gap, we introduce the Anchor-based Fair Clustering Framework (AFCF), a novel, general, and plug-and-play framework that empowers arbitrary fair clustering algorithms with linear-time scalability. Our approach first selects a small but representative set of anchors using a novel fair sampling strategy. Then, any off-the-shelf fair clustering algorithm can be applied to this small anchor set. The core of our framework lies in a novel anchor graph construction module, where we formulate an optimization problem to propagate labels while preserving fairness. This is achieved through a carefully designed group-label joint constraint, which we prove theoretically ensures that the fairness of the final clustering on the entire dataset matches that of the anchor clustering. We solve this optimization efficiently using an ADMM-based algorithm. Extensive experiments on multiple large-scale benchmarks demonstrate that AFCF drastically accelerates state-of-the-art methods, which reduces computational time by orders of magnitude while maintaining strong clustering performance and fairness guarantees.
๐
2025-11-13
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nikunj Gupta, Bill Guo, Rajgopal Kannan, Viktor K. Prasanna
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Hyung-Jun Moon, Sung-Bae Cho
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Continual learning methods used to force neural networks to process sequential tasks in isolation, preventing them from leveraging useful inter-task relationships and causing them to repeatedly relearn similar features or overly differentiate them. To address this problem, we propose a fully differentiable, exemplar-free expandable method composed of two complementary memories: One learns common features that can be used across all tasks, and the other combines the shared features to learn discriminative characteristics unique to each sample. Both memories are differentiable so that the network can autonomously learn latent representations for each sample. For each task, the memory adjustment module adaptively prunes critical slots and minimally expands capacity to accommodate new concepts, and orthogonal regularization enforces geometric separation between preserved and newly learned memory components to prevent interference. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that the proposed method outperforms 14 state-of-the-art methods for class-incremental learning, achieving final accuracies of 55.13\%, 37.24\%, and 30.11\%, respectively. Additional analysis confirms that, through effective integration and utilization of knowledge, the proposed method can increase average performance across sequential tasks, and it produces feature extraction results closest to the upper bound, thus establishing a new milestone in continual learning.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Manh Nguyen, Dung Nguyen, Dai Do, Svetha Venkatesh, Hung Le
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) finetuning is crucial to aligning large language models (LLMs), but the process is notoriously unstable and exhibits high variance across model checkpoints. In practice, selecting the best checkpoint is challenging: evaluating checkpoints on the validation set during training is computationally expensive and requires a good validation set, while relying on the final checkpoint provides no guarantee of good performance. We introduce an uncertainty-guided approach for checkpoint selection (UGCS) that avoids these pitfalls. Our method identifies hard question-answer pairs using per-sample uncertainty and ranks checkpoints by how well they handle these challenging cases. By averaging the rewards of the top-uncertain samples over a short training window, our method produces a stable and discriminative signal without additional forward passes or significant computation overhead. Experiments across three datasets and three LLMs demonstrate that it consistently identifies checkpoints with stronger generalization, outperforming traditional strategies such as relying on training or validation performance. These results highlight that models solving their hardest tasks with low uncertainty are the most reliable overall.
๐
2025-11-13
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yunkai Zhang, Qiang Zhang, Feng, Lin, Ruizhong Qiu et al.
ๆฅ็ๆ่ฆ
Optimizing recommender systems for objectives beyond accuracy, such as diversity, novelty, and personalization, is crucial for long-term user satisfaction. To this end, industrial practitioners have accumulated vast amounts of structured domain knowledge, which we term human priors (e.g., item taxonomies, temporal patterns). This knowledge is typically applied through post-hoc adjustments during ranking or post-ranking. However, this approach remains decoupled from the core model learning, which is particularly undesirable as the industry shifts to end-to-end generative recommendation foundation models. On the other hand, many methods targeting these beyond-accuracy objectives often require architecture-specific modifications and discard these valuable human priors by learning user intent in a fully unsupervised manner.
Instead of discarding the human priors accumulated over years of practice, we introduce a backbone-agnostic framework that seamlessly integrates these human priors directly into the end-to-end training of generative recommenders. With lightweight, prior-conditioned adapter heads inspired by efficient LLM decoding strategies, our approach guides the model to disentangle user intent along human-understandable axes (e.g., interaction types, long- vs. short-term interests). We also introduce a hierarchical composition strategy for modeling complex interactions across different prior types. Extensive experiments on three large-scale datasets demonstrate that our method significantly enhances both accuracy and beyond-accuracy objectives. We also show that human priors allow the backbone model to more effectively leverage longer context lengths and larger model sizes.
๐
2025-11-13
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Martin Braas, Lukas Esterle
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, yet their applicability to dialogue systems in computer games remains limited. This limitation arises from their substantial hardware requirements, latency constraints, and the necessity to maintain clearly defined knowledge boundaries within a game setting. In this paper, we propose a modular NPC dialogue system that leverages Small Language Models (SLMs), fine-tuned to encode specific NPC personas and integrated with runtime-swappable memory modules. These memory modules preserve character-specific conversational context and world knowledge, enabling expressive interactions and long-term memory without retraining or model reloading during gameplay. We comprehensively evaluate our system using three open-source SLMs: DistilGPT-2, TinyLlama-1.1B-Chat, and Mistral-7B-Instruct, trained on synthetic persona-aligned data and benchmarked on consumer-grade hardware. While our approach is motivated by applications in gaming, its modular design and persona-driven memory architecture hold significant potential for broader adoption in domains requiring expressive, scalable, and memory-rich conversational agents, such as virtual assistants, customer support bots, or interactive educational systems.
๐
2025-11-13
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jun Zhang, Yi Li, Yue Liu, Changping Wang, Yuan Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
As an intelligent infrastructure connecting users with commercial content, advertising recommendation systems play a central role in information flow and value creation within the digital economy. However, existing multi-stage advertising recommendation systems suffer from objective misalignment and error propagation, making it difficult to achieve global optimality, while unified generative recommendation models still struggle to meet the demands of practical industrial applications. To address these issues, we propose GPR (Generative Pre-trained Recommender), the first one-model framework that redefines advertising recommendation as an end-to-end generative task, replacing the traditional cascading paradigm with a unified generative approach. To realize GPR, we introduce three key innovations spanning unified representation, network architecture, and training strategy. First, we design a unified input schema and tokenization method tailored to advertising scenarios, mapping both ads and organic content into a shared multi-level semantic ID space, thereby enhancing semantic alignment and modeling consistency across heterogeneous data. Second, we develop the Heterogeneous Hierarchical Decoder (HHD), a dual-decoder architecture that decouples user intent modeling from ad generation, achieving a balance between training efficiency and inference flexibility while maintaining strong modeling capacity. Finally, we propose a multi-stage joint training strategy that integrates Multi-Token Prediction (MTP), Value-Aware Fine-Tuning and the Hierarchy Enhanced Policy Optimization (HEPO) algorithm, forming a complete generative recommendation pipeline that unifies interest modeling, value alignment, and policy optimization. GPR has been fully deployed in the Tencent Weixin Channels advertising system, delivering significant improvements in key business metrics including GMV and CTCVR.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Pascal Strauch, David Mรผller, Sammy Christen, Agon Serifi, Ruben Grandia et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Despite recent advances in robust locomotion, bipedal robots operating in the real world remain at risk of falling. While most research focuses on preventing such events, we instead concentrate on the phenomenon of falling itself. Specifically, we aim to reduce physical damage to the robot while providing users with control over a robot's end pose. To this end, we propose a robot agnostic reward function that balances the achievement of a desired end pose with impact minimization and the protection of critical robot parts during reinforcement learning. To make the policy robust to a broad range of initial falling conditions and to enable the specification of an arbitrary and unseen end pose at inference time, we introduce a simulation-based sampling strategy of initial and end poses. Through simulated and real-world experiments, our work demonstrates that even bipedal robots can perform controlled, soft falls.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Raghav Adhikari, Sachet Khatiwada, Suman Poudel
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Post-disaster situations pose unique navigation challenges. One of those challenges is the unstructured nature of the environment, which makes it hard to layout paths for rescue vehicles. We propose the use of Uncrewed Aerial Vehicle (UAV) in such scenario to perform reconnaissance across the environment. To accomplish this, we propose an optimization-based approach to plan a path for the UAV at optimal height where the sensors of the UAV can cover the most area and collect data with minimum uncertainty.
๐
2025-11-13
๐ ArXiv Preprint (eess.SY)
๐ฅ Omid Mirzaeedodangeh, Eliot Shekhtman, Nikolai Matni, Lars Lindemann
Robotics
ๆฅ็ๆ่ฆ
Safe planning of an autonomous agent in interactive environments -- such as the control of a self-driving vehicle among pedestrians and human-controlled vehicles -- poses a major challenge as the behavior of the environment is unknown and reactive to the behavior of the autonomous agent. This coupling gives rise to interaction-driven distribution shifts where the autonomous agent's control policy may change the environment's behavior, thereby invalidating safety guarantees in existing work. Indeed, recent works have used conformal prediction (CP) to generate distribution-free safety guarantees using observed data of the environment. However, CP's assumption on data exchangeability is violated in interactive settings due to a circular dependency where a control policy update changes the environment's behavior, and vice versa. To address this gap, we propose an iterative framework that robustly maintains safety guarantees across policy updates by quantifying the potential impact of a planned policy update on the environment's behavior. We realize this via adversarially robust CP where we perform a regular CP step in each episode using observed data under the current policy, but then transfer safety guarantees across policy updates by analytically adjusting the CP result to account for distribution shifts. This adjustment is performed based on a policy-to-trajectory sensitivity analysis, resulting in a safe, episodic open-loop planner. We further conduct a contraction analysis of the system providing conditions under which both the CP results and the policy updates are guaranteed to converge. We empirically demonstrate these safety and convergence guarantees on a two-dimensional car-pedestrian case study. To the best of our knowledge, these are the first results that provide valid safety guarantees in such interactive settings.
๐
2025-11-13
๐ 8 Pages, 9 Figures, Submitted to IEEE RoboSoft
๐ฅ Tianhui Han, Shashwat Singh, Sarvesh Patil, Zeynep Temel
Machine Learning
ๆฅ็ๆ่ฆ
Origami-inspired mechanisms can transform flat sheets into functional three-dimensional dynamic structures that are lightweight, compact, and capable of complex motion. These properties make origami increasingly valuable in robotic and deployable systems. However, accurately simulating their folding behavior and interactions with the environment remains challenging. To address this, we present a design framework for origami mechanism simulation that utilizes MuJoCo's deformable-body capabilities. In our approach, origami sheets are represented as graphs of interconnected deformable elements with user-specified constraints such as creases and actuation, defined through an intuitive graphical user interface (GUI). This framework allows users to generate physically consistent simulations that capture both the geometric structure of origami mechanisms and their interactions with external objects and surfaces. We demonstrate our method's utility through a case study on an origami catapult, where design parameters are optimized in simulation using the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and validated experimentally on physical prototypes. The optimized structure achieves improved throwing performance, illustrating how our system enables rapid, simulation-driven origami design, optimization, and analysis.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kaiwen Zhou et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.SemanticVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/SemanticVLA
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Lorenzo Pagliara, Violeta Redondo, Enrico Ferrentino, Manuel Ferre, Pasquale Chiacchio
MultimodalRobotics
ๆฅ็ๆ่ฆ
Bolting operations are critical in industrial assembly and in the maintenance of scientific facilities, requiring high precision and robustness to faults. Although robotic solutions have the potential to improve operational safety and effectiveness, current systems still lack reliable autonomy and fault management capabilities. To address this gap, we propose a control framework for dependable robotized bolting tasks and instantiate it on a specific robotic system. The system features a control architecture ensuring accurate driving torque control and active compliance throughout the entire operation, enabling safe interaction even under fault conditions. By designing a multimodal human-robot interface (HRI) providing real-time visualization of relevant system information and supporting seamless transitions between automatic and manual control, we improve operator situation awareness and fault detection capabilities. A high-level supervisor (SV) coordinates the execution and manages transitions between control modes, ensuring consistency with the supervisory control (SVC) paradigm, while preserving the human operator's authority. The system is validated in a representative bolting operation involving pipe flange joining, under several fault conditions. The results demonstrate improved fault detection capabilities, enhanced operator situational awareness, and accurate and compliant execution of the bolting operation. However, they also reveal the limitations of relying on a single camera to achieve full situational awareness.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Benjamin Stoler, Jonathan Francis, Jean Oh
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Methods for trajectory prediction in Autonomous Driving must contend with rare, safety-critical scenarios that make reliance on real-world data collection alone infeasible. To assess robustness under such conditions, we propose new long-tail evaluation settings that repartition datasets to create challenging out-of-distribution (OOD) test sets. We first introduce a safety-informed scenario factorization framework, which disentangles scenarios into discrete ego and social contexts. Building on analogies to compositional zero-shot image-labeling in Computer Vision, we then hold out novel context combinations to construct challenging closed-world and open-world settings. This process induces OOD performance gaps in future motion prediction of 5.0% and 14.7% in closed-world and open-world settings, respectively, relative to in-distribution performance for a state-of-the-art baseline. To improve generalization, we extend task-modular gating networks to operate within trajectory prediction models, and develop an auxiliary, difficulty-prediction head to refine internal representations. Our strategies jointly reduce the OOD performance gaps to 2.8% and 11.5% in the two settings, respectively, while still improving in-distribution performance.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Mingxing Peng, Ruoyu Yao, Xusen Guo, Jun Ma
Robotics
ๆฅ็ๆ่ฆ
Recent advances in closed-loop planning benchmarks have significantly improved the evaluation of autonomous vehicles. However, existing benchmarks still rely on rule-based reactive agents such as the Intelligent Driver Model (IDM), which lack behavioral diversity and fail to capture realistic human interactions, leading to oversimplified traffic dynamics. To address these limitations, we present nuPlan-R, a new reactive closed-loop planning benchmark that integrates learning-based reactive multi-agent simulation into the nuPlan framework. Our benchmark replaces the rule-based IDM agents with noise-decoupled diffusion-based reactive agents and introduces an interaction-aware agent selection mechanism to ensure both realism and computational efficiency. Furthermore, we extend the benchmark with two additional metrics to enable a more comprehensive assessment of planning performance. Extensive experiments demonstrate that our reactive agent model produces more realistic, diverse, and human-like traffic behaviors, leading to a benchmark environment that better reflects real-world interactive driving. We further reimplement a collection of rule-based, learning-based, and hybrid planning approaches within our nuPlan-R benchmark, providing a clearer reflection of planner performance in complex interactive scenarios and better highlighting the advantages of learning-based planners in handling complex and dynamic scenarios. These results establish nuPlan-R as a new standard for fair, reactive, and realistic closed-loop planning evaluation. We will open-source the code for the new benchmark.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang et al.
Robotics
ๆฅ็ๆ่ฆ
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Konstantin Soshin, Alexander Krapukhin, Andrei Spiridonov, Denis Shepelev, Gregorii Bukhtuev et al.
Robotics
ๆฅ็ๆ่ฆ
Most existing robotic manipulation benchmarks focus on simplified tabletop scenarios, typically involving a stationary robotic arm interacting with various objects on a flat surface. To address this limitation, we introduce RoboBenchMart, a more challenging and realistic benchmark designed for dark store environments, where robots must perform complex manipulation tasks with diverse grocery items. This setting presents significant challenges, including dense object clutter and varied spatial configurations -- with items positioned at different heights, depths, and in close proximity. By targeting the retail domain, our benchmark addresses a setting with strong potential for near-term automation impact. We demonstrate that current state-of-the-art generalist models struggle to solve even common retail tasks. To support further research, we release the RoboBenchMart suite, which includes a procedural store layout generator, a trajectory generation pipeline, evaluation tools and fine-tuned baseline models.
๐
2025-11-13
๐ ArXiv Preprint (Computer Vision)
๐ฅ Stephane Da Silva Martins, Emanuel Aldea, Sylvie Le Hรฉgarat-Mascle
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Multi-agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents' long-term goals and their fine-grained social interactions, which leads to unrealistic multi-agent futures. We propose VISTA, a recursive goal-conditioned transformer for multi-agent trajectory forecasting. VISTA combines (i) a cross-attention fusion module that integrates long-horizon intent with past motion, (ii) a social-token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single-agent goal-conditioned prediction into a coherent multi-agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal-aware, and interpretable trajectories, making it promising for safety-critical autonomous systems.
๐
2025-11-13
๐ WWW
๐ฅ Kamil Dreczkowski, Pietro Vitiello, Vitalis Vosylius, Edward Johns
Robotics
ๆฅ็ๆ่ฆ
Humans are remarkably efficient at learning tasks from demonstrations, but today's imitation learning methods for robot manipulation often require hundreds or thousands of demonstrations per task. We investigate two fundamental priors for improving learning efficiency: decomposing manipulation trajectories into sequential alignment and interaction phases, and retrieval-based generalisation. Through 3,450 real-world rollouts, we systematically study this decomposition. We compare different design choices for the alignment and interaction phases, and examine generalisation and scaling trends relative to today's dominant paradigm of behavioural cloning with a single-phase monolithic policy. In the few-demonstrations-per-task regime (<10 demonstrations), decomposition achieves an order of magnitude improvement in data efficiency over single-phase learning, with retrieval consistently outperforming behavioural cloning for both alignment and interaction. Building on these insights, we develop Multi-Task Trajectory Transfer (MT3), an imitation learning method based on decomposition and retrieval. MT3 learns everyday manipulation tasks from as little as a single demonstration each, whilst also generalising to novel object instances. This efficiency enables us to teach a robot 1,000 distinct everyday tasks in under 24 hours of human demonstrator time. Through 2,200 additional real-world rollouts, we reveal MT3's capabilities and limitations across different task families. Videos of our experiments can be found on at https://www.robot-learning.uk/learning-1000-tasks.
๐
2025-11-13
๐ NeurIPS 2025
๐ฅ Haidong Huang, Haiyue Zhu. Jiayu Song, Xixin Zhao, Yaohua Zhou, Jiayi Zhang et al.
Natural Language ProcessingRoboticsMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Yizheng Wang, Timon Rabczuk, Yinghua Liu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Friction modeling plays a crucial role in achieving high-precision motion control in robotic operating systems. Traditional static friction models (such as the Stribeck model) are widely used due to their simple forms; however, they typically require predefined functional assumptions, which poses significant challenges when dealing with unknown functional structures. To address this issue, this paper proposes a physics-inspired machine learning approach based on the Kolmogorov Arnold Network (KAN) for static friction modeling of robotic joints. The method integrates spline activation functions with a symbolic regression mechanism, enabling model simplification and physical expression extraction through pruning and attribute scoring, while maintaining both high prediction accuracy and interpretability. We first validate the method's capability to accurately identify key parameters under known functional models, and further demonstrate its robustness and generalization ability under conditions with unknown functional structures and noisy data. Experiments conducted on both synthetic data and real friction data collected from a six-degree-of-freedom industrial manipulator show that the proposed method achieves a coefficient of determination greater than 0.95 across various tasks and successfully extracts concise and physically meaningful friction expressions. This study provides a new perspective for interpretable and data-driven robotic friction modeling with promising engineering applicability.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Egor Davydenko, Andrei Volchenkov, Vladimir Gerasimov, Roman Gorbachev
ๆฅ็ๆ่ฆ
In this paper, we propose a novel design of an electrically actuated robotic leg, called the DecARt (Decoupled Actuation Robot) Leg, aimed at performing agile locomotion. This design incorporates several new features, such as the use of a quasi-telescopic kinematic structure with rotational motors for decoupled actuation, a near-anthropomorphic leg appearance with a forward facing knee, and a novel multi-bar system for ankle torque transmission from motors placed above the knee. To analyze the agile locomotion capabilities of the design numerically, we propose a new descriptive metric, called the `Fastest Achievable Swing Time` (FAST), and perform a quantitative evaluation of the proposed design and compare it with other designs. Then we evaluate the performance of the DecARt Leg-based robot via extensive simulation and preliminary hardware experiments.
๐
2025-11-13
๐ AAAI 2026
๐ฅ Xuancun Lu, Jiaxiang Chen, Shilin Xiao, Zizhi Jin, Zhangrui Chen et al.
Multimodal
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models revolutionize robotic systems by enabling end-to-end perception-to-action pipelines that integrate multiple sensory modalities, such as visual signals processed by cameras and auditory signals captured by microphones. This multi-modality integration allows VLA models to interpret complex, real-world environments using diverse sensor data streams. Given the fact that VLA-based systems heavily rely on the sensory input, the security of VLA models against physical-world sensor attacks remains critically underexplored.
To address this gap, we present the first systematic study of physical sensor attacks against VLAs, quantifying the influence of sensor attacks and investigating the defenses for VLA models. We introduce a novel ``Real-Sim-Real'' framework that automatically simulates physics-based sensor attack vectors, including six attacks targeting cameras and two targeting microphones, and validates them on real robotic systems. Through large-scale evaluations across various VLA architectures and tasks under varying attack parameters, we demonstrate significant vulnerabilities, with susceptibility patterns that reveal critical dependencies on task types and model designs. We further develop an adversarial-training-based defense that enhances VLA robustness against out-of-distribution physical perturbations caused by sensor attacks while preserving model performance. Our findings expose an urgent need for standardized robustness benchmarks and mitigation strategies to secure VLA deployments in safety-critical environments.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Xiangyi Wei, Haotian Zhang, Xinyi Cao, Siyu Xie, Weifeng Ge et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
The Vision-Language-Action models (VLA) have achieved significant advances in robotic manipulation recently. However, vision-only VLA models create fundamental limitations, particularly in perceiving interactive and manipulation dynamic processes. This paper proposes Audio-VLA, a multimodal manipulation policy that leverages contact audio to perceive contact events and dynamic process feedback. Audio-VLA overcomes the vision-only constraints of VLA models. Additionally, this paper introduces the Task Completion Rate (TCR) metric to systematically evaluate dynamic operational processes. Audio-VLA employs pre-trained DINOv2 and SigLIP as visual encoders, AudioCLIP as the audio encoder, and Llama2 as the large language model backbone. We apply LoRA fine-tuning to these pre-trained modules to achieve robust cross-modal understanding of both visual and acoustic inputs. A multimodal projection layer aligns features from different modalities into the same feature space. Moreover RLBench and LIBERO simulation environments are enhanced by adding collision-based audio generation to provide realistic sound feedback during object interactions. Since current robotic manipulation evaluations focus on final outcomes rather than providing systematic assessment of dynamic operational processes, the proposed TCR metric measures how well robots perceive dynamic processes during manipulation, creating a more comprehensive evaluation metric. Extensive experiments on LIBERO, RLBench, and two real-world tasks demonstrate Audio-VLA's superior performance over vision-only comparative methods, while the TCR metric effectively quantifies dynamic process perception capabilities.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Hanwen Wang
Robotics
ๆฅ็ๆ่ฆ
The generalization ability of visuomotor policy is crucial, as a good policy should be deployable across diverse scenarios. Some methods can collect large amounts of trajectory augmentation data to train more generalizable imitation learning policies, aimed at handling the random placement of objects on the scene's horizontal plane. However, the data generated by these methods still lack diversity, which limits the generalization ability of the trained policy. To address this, we investigate the performance of policies trained by existing methods across different scene layout factors via automate the data generation for those factors that significantly impact generalization. We have created a more extensively randomized dataset that can be efficiently and automatically generated with only a small amount of human demonstration. The dataset covers five types of manipulators and two types of grippers, incorporating extensive randomization factors such as camera pose, lighting conditions, tabletop texture, and table height across six manipulation tasks. We found that all of these factors influence the generalization ability of the policy. Applying any form of randomization enhances policy generalization, with diverse trajectories particularly effective in bridging visual gap. Notably, we investigated on low-cost manipulator the effect of the scene randomization proposed in this work on enhancing the generalization capability of visuomotor policies for zero-shot sim-to-real transfer.
๐
2025-11-13
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ethan Hirschowitz, Fabio Ramos
Robotics
ๆฅ็ๆ่ฆ
Improving competent robot policies with on-policy RL is often hampered by noisy, low-signal gradients. We revisit Evolution Strategies (ES) as a policy-gradient proxy and localize exploration with bounded, antithetic triangular perturbations, suitable for policy refinement. We propose Triangular-Distribution ES (TD-ES) which pairs bounded triangular noise with a centered-rank finite-difference estimator to deliver stable, parallelizable, gradient-free updates. In a two-stage pipeline -- PPO pretraining followed by TD-ES refinement -- this preserves early sample efficiency while enabling robust late-stage gains. Across a suite of robotic manipulation tasks, TD-ES raises success rates by 26.5% relative to PPO and greatly reduces variance, offering a simple, compute-light path to reliable refinement.
๐
2025-11-13
๐ 8 pages, 10 figures, IEEE RoboSoft 2026
๐ฅ Shashwat Singh, Zilin Si, Zeynep Temel
ๆฅ็ๆ่ฆ
Amphibians adapt their morphologies and motions to accommodate movement in both terrestrial and aquatic environments. Inspired by these biological features, we present PuffyBot, an untethered shape morphing robot capable of changing its body morphology to navigate multiple environments. Our robot design leverages a scissor-lift mechanism driven by a linear actuator as its primary structure to achieve shape morphing. The transformation enables a volume change from 255.00 cm3 to 423.75 cm3, modulating the buoyant force to counteract a downward force of 3.237 N due to 330 g mass of the robot. A bell-crank linkage is integrated with the scissor-lift mechanism, which adjusts the servo-actuated limbs by 90 degrees, allowing a seamless transition between crawling and swimming modes. The robot is fully waterproof, using thermoplastic polyurethane (TPU) fabric to ensure functionality in aquatic environments. The robot can operate untethered for two hours with an onboard battery of 1000 mA h. Our experimental results demonstrate multi-environment locomotion, including crawling on the land, crawling on the underwater floor, swimming on the water surface, and bimodal buoyancy adjustment to submerge underwater or resurface. These findings show the potential of shape morphing to create versatile and energy efficient robotic platforms suitable for diverse environments.
๐
2025-11-13
๐ ArXiv Preprint (Robotics)
๐ฅ Kaleb Ben Naveed, Utkrisht Sahai, Anouck Girard, Dimitra Panagou
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous robots are increasingly deployed for information-gathering tasks in environments that vary across space and time. Planning informative and safe trajectories in such settings is challenging because information decays when regions are not revisited. Most existing planners model information as static or uniformly decaying, ignoring environments where the decay rate varies spatially; those that model non-uniform decay often overlook how it evolves along the robot's motion, and almost all treat safety as a soft penalty. In this paper, we address these challenges. We model uncertainty in the environment using clarity, a normalized representation of differential entropy from our earlier work that captures how information improves through new measurements and decays over time when regions are not revisited. Building on this, we present Stein Variational Clarity-Aware Informative Planning, a framework that embeds clarity dynamics within trajectory optimization and enforces safety through a low-level filtering mechanism based on our earlier gatekeeper framework for safety verification. The planner performs Bayesian inference-based learning via Stein variational inference, refining a distribution over informative trajectories while filtering each nominal Stein informative trajectory to ensure safety. Hardware experiments and simulations across environments with varying decay rates and obstacles demonstrate consistent safety and reduced information deficits.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rahul Soni, Basem Suleiman, Sonit Singh
Machine Learning
ๆฅ็ๆ่ฆ
This paper aims to address the challenge of selecting relevant courses for students by proposing the design and development of a course recommendation system. The course recommendation system utilises a combination of data analytics techniques and machine learning algorithms to recommend courses that align with current industry trends and requirements. In order to provide customised suggestions, the study entails the design and implementation of an extensive algorithmic framework that combines machine learning methods, user preferences, and academic criteria. The system employs data mining and collaborative filtering techniques to examine past courses and individual career goals in order to provide course recommendations. Moreover, to improve the accessibility and usefulness of the recommendation system, special attention is given to the development of an easy-to-use front-end interface. The front-end design prioritises visual clarity, interaction, and simplicity through iterative prototyping and user input revisions, guaranteeing a smooth and captivating user experience. We refined and optimised the proposed system by incorporating user feedback, ensuring that it effectively meets the needs and preferences of its target users. The proposed course recommendation system could be a useful tool for students, instructors, and career advisers to use in promoting lifelong learning and professional progression as it fills the gap between university learning and industry expectations. We hope that the proposed course recommendation system will help university students in making data-drive and industry-informed course decisions, in turn, improving graduate outcomes for the university sector.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing
Natural Language Processing
ๆฅ็ๆ่ฆ
Compared to English and other high-resource languages, spellchecking for Khmer remains an unresolved problem due to several challenges. First, there are misalignments between words in the lexicon and the word segmentation model. Second, a Khmer word can be written in different forms. Third, Khmer compound words are often loosely and easily formed, and these compound words are not always found in the lexicon. Fourth, some proper nouns may be flagged as misspellings due to the absence of a Khmer named-entity recognition (NER) model. Unfortunately, existing solutions do not adequately address these challenges. This paper proposes a holistic approach to the Khmer spellchecking problem by integrating Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges, identify potential correction candidates, and rank the most suitable candidate. Experimental results show that the proposed approach achieves a state-of-the-art Khmer spellchecking accuracy of up to 94.4%, compared to existing solutions. The benchmark datasets for Khmer spellchecking and NER tasks in this study will be made publicly available.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yufeng Wang, Lu wei, Haibin Ling
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (TARG), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft's prefix logits, TARG computes lightweight uncertainty scores: mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-N variance across a handful of stochastic prefixes, and triggers retrieval only when the score exceeds a threshold. The gate is model agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently shifts the accuracy-efficiency frontier: compared with Always-RAG, TARG matches or improves EM/F1 while reducing retrieval by 70-90% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-N variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a delta-latency view to make budget trade-offs explicit.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rocco Tripodi, Xiaoyu Liu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Cross-lingual Natural Language Processing (NLP) has gained significant traction in recent years, offering practical solutions in low-resource settings by transferring linguistic knowledge from resource-rich to low-resource languages. This field leverages techniques like annotation projection and model transfer for language adaptation, supported by multilingual pre-trained language models. However, linguistic divergences hinder language transfer, especially among typologically distant languages. In this paper, we present an analysis of predicate-argument structures in parallel Chinese and English sentences. We explore the alignment and misalignment of predicate annotations, inspecting similarities and differences and proposing a categorization of structural divergences. The analysis and the categorization are supported by a qualitative and quantitative analysis of the results of an annotation projection experiment, in which, in turn, one of the two languages has been used as source language to project annotations into the corresponding parallel sentences. The results of this analysis show clearly that language transfer is asymmetric. An aspect that requires attention when it comes to selecting the source language in transfer learning applications and that needs to be investigated before any scientific claim about cross-lingual NLP is proposed.
๐
2025-11-12
๐ Accepted in IEEE BigData 2025
๐ฅ Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ C. LeMay, A. Lane, J. Seales, M. Winstead, S. Baty
Natural Language Processing
ๆฅ็ๆ่ฆ
Our research investigates how Natural Language Processing (NLP) can be used to extract main topics from a larger corpus of written data, as applied to the case of identifying signaling themes in Presidential Directives (PDs) from the Reagan through Clinton administrations. Analysts and NLP both identified relevant documents, demonstrating the potential utility of NLPs in research involving large written corpuses. However, we also identified discrepancies between NLP and human-labeled results that indicate a need for more research to assess the validity of NLP in this use case. The research was conducted in 2023, and the rapidly evolving landscape of AIML means existing tools have improved and new tools have been developed; this research displays the inherent capabilities of a potentially dated AI tool in emerging social science applications.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor
Natural Language Processing
ๆฅ็ๆ่ฆ
Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models' improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Warren Li, Yiqian Wang, Zihan Wang, Jingbo Shang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Omnilingual ASR team, Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers et al.
ๆฅ็ๆ่ฆ
Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohamed Elaraby, Jyoti Prakash Maheswari
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) with extended context windows promise direct reasoning over long documents, reducing the need for chunking or retrieval. Constructing annotated resources for training and evaluation, however, remains costly. Synthetic data offers a scalable alternative, and we introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification -- a task central to hallucination detection and fact-checking. Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions. Experiments across benchmarks show that long-context synthesis can improve verification in base instruction-tuned models, particularly when augmenting existing human-written datasets. Moreover, synthesis enhances explanation quality, even when verification scores do not improve, underscoring its potential to strengthen both performance and explainability.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rรฉmi Cardon, A. Seza Doฤruรถz
ๆฅ็ๆ่ฆ
Readability is a key concept in the current era of abundant written information. To help making texts more readable and make information more accessible to everyone, a line of researched aims at making texts accessible for their target audience: automatic text simplification (ATS). Lately, there have been studies on the correlations between automatic evaluation metrics in ATS and human judgment. However, the correlations between those two aspects and commonly available readability measures (such as readability formulas or linguistic features) have not been the focus of as much attention. In this work, we investigate the place of readability measures in ATS by complementing the existing studies on evaluation metrics and human judgment, on English. We first discuss the relationship between ATS and research in readability, then we report a study on correlations between readability measures and human judgment, and between readability measures and ATS evaluation metrics. We identify that in general, readability measures do not correlate well with automatic metrics and human judgment. We argue that as the three different angles from which simplification can be assessed tend to exhibit rather low correlations with one another, there is a need for a clear definition of the construct in ATS.
๐
2025-11-12
๐ ArXiv Preprint (Machine Learning)
๐ฅ Renda Li, Hailang Huang, Fei Wei, Feng Xiong, Yong Wang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) has demonstrated considerable potential for enhancing reasoning in large language models (LLMs). However, existing methods suffer from Gradient Starvation and Policy Degradation when training directly on samples with mixed difficulty. To mitigate this, prior approaches leverage Chain-of-Thought (CoT) data, but the construction of high-quality CoT annotations remains labor-intensive. Alternatively, curriculum learning strategies have been explored but frequently encounter challenges, such as difficulty mismatch, reliance on manual curriculum design, and catastrophic forgetting. To address these issues, we propose AdaCuRL, a Adaptive Curriculum Reinforcement Learning framework that integrates coarse-to-fine difficulty estimation with adaptive curriculum scheduling. This approach dynamically aligns data difficulty with model capability and incorporates a data revisitation mechanism to mitigate catastrophic forgetting. Furthermore, AdaCuRL employs adaptive reference and sparse KL strategies to prevent Policy Degradation. Extensive experiments across diverse reasoning benchmarks demonstrate that AdaCuRL consistently achieves significant performance improvements on both LLMs and MLLMs.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Triet M. Le, Arjun Chandra, C. Anton Rytting, Valerie P. Karuzis, Vladimir Rife et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Predicting an individual's personalities from their generated texts is a challenging task, especially when the text volume is large. In this paper, we introduce a straightforward yet effective novel strategy called targeted preselection of texts (TPoT). This method semantically filters the texts as input to a deep learning model, specifically designed to predict a Big Five personality trait, facet, or item, referred to as the BIG5-TPoT model. By selecting texts that are semantically relevant to a particular trait, facet, or item, this strategy not only addresses the issue of input text limits in large language models but also improves the Mean Absolute Error and accuracy metrics in predictions for the Stream of Consciousness Essays dataset.
๐
2025-11-12
๐ AAAI 2026
๐ฅ Wolfgang Otto, Lu Gan, Sharmila Upadhyaya, Saurav Karmakar, Stefan Dietze
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Research in Machine Learning (ML) and AI evolves rapidly. Information Extraction (IE) from scientific publications enables to identify information about research concepts and resources on a large scale and therefore is a pathway to improve understanding and reproducibility of ML-related research. To extract and connect fine-grained information in ML-related research, e.g. method training and data usage, we introduce GSAP-ERE. It is a manually curated fine-grained dataset with 10 entity types and 18 semantically categorized relation types, containing mentions of 63K entities and 35K relations from the full text of 100 ML publications. We show that our dataset enables fine-tuned models to automatically extract information relevant for downstream tasks ranging from knowledge graph (KG) construction, to monitoring the computational reproducibility of AI research at scale. Additionally, we use our dataset as a test suite to explore prompting strategies for IE using Large Language Models (LLM). We observe that the performance of state-of-the-art LLM prompting methods is largely outperformed by our best fine-tuned baseline model (NER: 80.6%, RE: 54.0% for the fine-tuned model vs. NER: 44.4%, RE: 10.1% for the LLM). This disparity of performance between supervised models and unsupervised usage of LLMs suggests datasets like GSAP-ERE are needed to advance research in the domain of scholarly information extraction.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bichen Wang, Yixin Sun, Junzhe Wang, Hao Yang, Xing Fu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce \textbf{CARE-Bench}, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs' failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lukas Arana, Julen Etxaniz, Ander Salaberria, Gorka Azkune
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.
๐
2025-11-12
๐ AAAI 2026
๐ฅ Ruibo Deng, Duanyu Feng, Wenqiang Lei
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models. However, their effectiveness is critically dependent on ranking accuracy, a metric where further gains are highly impactful. This limitation arises from a fundamental problem that we identify and formalize as the Overfitting-Underfitting Dilemma: current margin designs cause models to apply excessive, wasteful gradients to correctly ranked samples (overfitting) while providing insufficient corrective signals for misranked ones (underfitting). To resolve this dilemma, we propose Adaptive Margin-attached Preference Optimization (AMaPO), a simple yet principled algorithm. AMaPO employs an instance-wise adaptive margin, refined by Z-normalization and exponential scaling, which dynamically reallocates learning effort by amplifying gradients for misranked samples and suppressing them for correct ones. Extensive experiments on widely used benchmarks demonstrate that AMaPO not only achieves better ranking accuracy and superior downstream alignment performance, but targeted analysis also confirms that it successfully mitigates the core overfitting and underfitting issues.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hossein A. Rahmani, Satyapriya Krishna, Xi Wang, Mohammadmehdi Naghiaei, Emine Yilmaz
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement, often referred to as self-consistency or self-reflection. However, the dynamics of this self-correction mechanism may differ substantially depending on whether the model is tasked with open-ended text generation or with selecting the most appropriate response from multiple predefined options. In this paper, we conduct a systematic investigation of these two paradigms by comparing performance trends and error-correction behaviors across various natural language understanding and reasoning tasks, covering language models of different scales and families. Our experimental results reveal distinct patterns of improvement and failure modes:
\textit{While open-ended generation often benefits from the flexibility of re-interpretation and compositional refinement, multiple-choice selection can leverage clearer solution boundaries but may be limited by the provided options}. This contrast also reflects the dual demands faced by emerging agentic LLM applications: effective agents must not only generate and refine open-ended plans or explanations, but also make reliable discrete choices when operating within constrained action spaces. Our findings, therefore, highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space, with implications for both knowledge-intensive reasoning and decision-oriented applications of LLMs.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rhitabrat Pokharel, Ameeta Agrawal
Natural Language Processing
ๆฅ็ๆ่ฆ
The use of large language models (LLMs) for evaluating outputs is becoming an increasingly effective and scalable approach. However, it remains uncertain whether this capability extends beyond task-specific evaluations to more general assessments of text quality, particularly in multilingual contexts. In this study, we introduce, MTQ-Eval, a novel framework for multilingual text quality evaluation that learns from examples of both high- and low-quality texts, adjusting its internal representations. To develop MTQ-Eval, we first automatically generate text quality preference data and then use it to train open-source base LLMs to align with ratings of high- and low-quality text. Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model. Upon further analysis, we find that this enhanced evaluation capability also leads to notable improvements in downstream tasks.
๐
2025-11-12
๐ ArXiv Preprint (cs.SE)
๐ฅ Adam ล torek, Vikas Upadhyay, Marianne Menglin Liu, Daniel W. Peterson, Anshul Mittal et al.
ๆฅ็ๆ่ฆ
LLMs now tackle a wide range of software-related tasks, yet we show that their performance varies markedly both across and within these tasks. Routing user queries to the appropriate LLMs can therefore help improve response quality while reducing cost. Prior work, however, has focused mainly on general-purpose LLM routing via black-box models. We introduce Routesplain, the first LLM router for software-related tasks, including multilingual code generation and repair, input/output prediction, and computer science QA. Unlike existing routing approaches, Routesplain first extracts human-interpretable concepts from each query (e.g., task, domain, reasoning complexity) and only routes based on these concepts, thereby providing intelligible, faithful rationales. We evaluate Routesplain on 16 state-of-the-art LLMs across eight software-related tasks; Routesplain outperforms individual models both in terms of accuracy and cost, and equals or surpasses all black-box baselines, with concept-level intervention highlighting avenues for further router improvements.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Martin Vejvar, Yasutaka Fujimoto
ๆฅ็ๆ่ฆ
We present Spider4SSC dataset and S2CLite parsing tool. S2CLite is a lightweight, ontology-agnostic parser that translates SPARQL queries into Cypher queries, enabling both in-situ and large-scale SPARQL to Cypher translation. Unlike existing solutions, S2CLite is purely rule-based (inspired by traditional programming language compilers) and operates without requiring an RDF graph or external tools. Experiments conducted on the BSBM42 and Spider4SPARQL datasets show that S2CLite significantly reduces query parsing errors, achieving a total parsing accuracy of 77.8% on Spider4SPARQL compared to 44.2% by the state-of-the-art S2CTrans. Furthermore, S2CLite achieved a 96.6\% execution accuracy on the intersecting subset of queries parsed by both parsers, outperforming S2CTrans by 7.3%. We further use S2CLite to parse Spider4SPARQL queries to Cypher and generate Spider4SSC, a unified Text-to-Query language (SQL, SPARQL, Cypher) dataset with 4525 unique questions and 3 equivalent sets of 2581 matching queries (SQL, SPARQL and Cypher). We open-source S2CLite for further development on GitHub (github.com/vejvarm/S2CLite) and provide the clean Spider4SSC dataset for download.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, Wanxiang Che
Natural Language Processing
ๆฅ็ๆ่ฆ
Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs. Although recent studies have reduced token consumption through dynamic self-consistency, they remain constrained by the high latency of sequential requests. In this paper, we propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency by integrating System 1 and System 2 reasoning. Specifically, we utilize the rapid System 1 to compute the answer entropy for given queries. This score is then used to evaluate the potential of samples for scaling, enabling dynamic self-consistency under System 2. Benefiting from the advance and accurate estimation provided by System 1, the proposed method can reduce token usage while simultaneously achieving a significant decrease in latency through parallel generation. It outperforms existing methods, achieving up to a 47% reduction in token consumption and a 43% reduction in inference latency without significant performance loss.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Arka Mukherjee, Shreya Ghosh
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85\% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84\% accuracy on held-out 2025 questions, open-source models plateau at 37-45\% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100\% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2\% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://mmjee-eval.github.io
๐
2025-11-12
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Stine Beltoft, Lukas Galke
Natural Language Processing
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) and large language models (LLM) are reshaping science, with most recent advances culminating in fully-automated scientific discovery pipelines. But qualitative research has been left behind. Researchers in qualitative methods are hesitant about AI adoption. Yet when they are willing to use AI at all, they have little choice but to rely on general-purpose tools like ChatGPT to assist with interview interpretation, data annotation, and topic modeling - while simultaneously acknowledging these system's well-known limitations of being biased, opaque, irreproducible, and privacy-compromising. This creates a critical gap: while AI has substantially advanced quantitative methods, the qualitative dimensions essential for meaning-making and comprehensive scientific understanding remain poorly integrated. We argue for developing dedicated qualitative AI systems built from the ground up for interpretive research. Such systems must be transparent, reproducible, and privacy-friendly. We review recent literature to show how existing automated discovery pipelines could be enhanced by robust qualitative capabilities, and identify key opportunities where safe qualitative AI could advance multidisciplinary and mixed-methods research.
๐
2025-11-12
๐ NeurIPS 2025
๐ฅ Zhuoqun Huang, Neil G. Marchant, Olga Ohrimenko, Benjamin I. P. Rubinstein
Natural Language Processing
ๆฅ็ๆ่ฆ
We consider the problem of certified robustness for sequence classification against edit distance perturbations. Naturally occurring inputs of varying lengths (e.g., sentences in natural language processing tasks) present a challenge to current methods that employ fixed-rate deletion mechanisms and lead to suboptimal performance. To this end, we introduce AdaptDel methods with adaptable deletion rates that dynamically adjust based on input properties. We extend the theoretical framework of randomized smoothing to variable-rate deletion, ensuring sound certification with respect to edit distance. We achieve strong empirical results in natural language tasks, observing up to 30 orders of magnitude improvement to median cardinality of the certified region, over state-of-the-art certifications.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing
Natural Language Processing
ๆฅ็ๆ่ฆ
Khmer polarity classification is a fundamental natural language processing task that assigns a positive, negative, or neutral label to a given Khmer text input. Existing Khmer models typically predict the label without explaining the rationale behind the prediction. This paper proposes an explainable Khmer polarity classifier by fine-tuning an instruction-based reasoning Qwen-3 model. The notion of explainability in this paper is limited to self-explanations, which the model uses to rationalize its predictions. Experimental results show that the fine-tuned model not only predicts labels accurately but also provides reasoning by identifying polarity-related keywords or phrases to support its predictions. In addition, we contribute a new Khmer polarity dataset consisting of short- to medium-length casual, romanized, and mixed-code Khmer expressions. This dataset was constructed using both heuristic rules and human curation and is publicly available through a gated Hugging Face repository (rinabuoy/khmerpolarity_nonreasoning). The fine-tuned Qwen-3 models are also made available in the same Hugging Face account.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yi Wang, Yuqian Sun et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
People have different creative writing preferences, and large language models (LLMs) for these tasks can benefit from adapting to each user's preferences. However, these models are often trained over a dataset that considers varying personal tastes as a monolith. To facilitate developing personalized creative writing LLMs, we introduce LiteraryTaste, a dataset of reading preferences from 60 people, where each person: 1) self-reported their reading habits and tastes (stated preference), and 2) annotated their preferences over 100 pairs of short creative writing texts (revealed preference). With our dataset, we found that: 1) people diverge on creative writing preferences, 2) finetuning a transformer encoder could achieve 75.8% and 67.7% accuracy when modeling personal and collective revealed preferences, and 3) stated preferences had limited utility in modeling revealed preferences. With an LLM-driven interpretability pipeline, we analyzed how people's preferences vary. We hope our work serves as a cornerstone for personalizing creative writing technologies.
๐
2025-11-12
๐ AAAI
๐ฅ Yu Li, Zhe Yang, Yi Huang, Xin Liu, Guilin Qi
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent advancements in large language models (LLMs) have demonstrated remarkable text generation capabilities. However, controlling specific attributes of generated text remains challenging without architectural modifications or extensive fine-tuning. Current methods typically toggle a single, basic attribute but struggle with precise multi-attribute control. In scenarios where attribute requirements conflict, existing methods lack coordination mechanisms, causing interference between desired attributes. Furthermore, these methods fail to incorporate iterative optimization processes in the controlled generation pipeline. To address these limitations, we propose Conflict-aware, Composite, and Collaborative Controlled Text Generation (C$^3$TG), a two-phase framework for fine-grained, multi-dimensional text attribute control. During generation, C$^3$TG selectively pairs the LLM with the required attribute classifiers from the 17 available dimensions and employs weighted KL-divergence to adjust token probabilities. The optimization phase then leverages an energy function combining classifier scores and penalty terms to resolve attribute conflicts through iterative feedback, enabling precise control over multiple dimensions simultaneously while preserving natural text flow. Experiments show that C$^3$TG significantly outperforms baselines across multiple metrics including attribute accuracy, linguistic fluency, and output diversity, while simultaneously reducing toxicity. These results establish C$^3$TG as an effective and flexible solution for multi-dimensional text attribute control that requires no costly model modifications.
๐
2025-11-12
๐ AAAI 2026
๐ฅ Jiliang Hu, Zuchao Li, Baoyuan Qi, Liu Guoming, Ping Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.
๐
2025-11-12
๐ ICASSP 2026
๐ฅ Xuanchen Li, Chenrui Cui, Tianrui Wang, Meng Ge, Zikang Huang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Speech Large Language Models (SpeechLLMs) have achieved breakthroughs in multilingual speech-to-text translation (S2TT). However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose \textbf{POTSA} (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport (OT), designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations across languages. Second, we impose token-level OT constraints on a Q-Former using parallel speech pairs to establish fine-grained consistency of representations. Then, we apply a layer scheduling strategy to focus OT constraints on the most semantically beneficial layers. Experiments on the FLEURS dataset show that our method achieves SOTA performance, with +0.93 BLEU on average over five common languages and +5.05 BLEU on zero-shot languages, using only 10 hours of parallel speech per source language.
๐
2025-11-12
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiarui Liu, Weihao Xuan, Zhijing Jin, Mona Diab
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Large Language Models (MLLMs) often suffer from hallucinations, particularly errors in object existence, attributes, or relations, which undermine their reliability. We introduce TACO (Verified Atomic Confidence Estimation), a simple framework that mitigates hallucinations through self-verification and confidence calibration without relying on external vision experts. TACO decomposes responses into atomic queries, paraphrases them to reduce sensitivity to wording, and estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, before refining answers with a language model. Experiments on five benchmarks (POPE, MME, HallusionBench, AMBER, and MM-Hal Bench) with two MLLMs (\texttt{LLaVA-1.5-7B} and \texttt{CogVLM2}) show that TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration, demonstrating its effectiveness in enhancing the faithfulness of MLLMs.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a promising framework for aligning language models with complex reasoning objectives. However, most existing methods optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. This challenge is especially pronounced in honesty alignment, where models must not only solve answerable queries but also identify when conclusions cannot be drawn from the given premises. Deductive reasoning provides an ideal testbed because it isolates reasoning capability from reliance on external factual knowledge. To investigate honesty alignment, we curate two multi-step deductive reasoning datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that GRPO, with or without supervised fine tuning initialization, struggles on these tasks. Through extensive experiments across three models, we evaluate stabilization strategies and show that curriculum learning provides some benefit but requires carefully designed in distribution datasets with controllable difficulty. To address these limitations, we propose Anchor, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling reliable deductive reasoning in aligned language models.
๐
2025-11-12
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Etienne Dallaire
ๆฅ็ๆ่ฆ
This paper addresses the guessing game in building production RAG. Classical rank-centric IR metrics (nDCG/MAP/MRR) are a poor fit for RAG, where LLMs consume a set of passages rather than a browsed list; position discounts and prevalence-blind aggregation miss what matters: whether the prompt at cutoff K contains the decisive evidence. Second, there is no standardized, reproducible way to build and audit golden sets. Third, leaderboards exist but lack end-to-end, on-corpus benchmarking that reflects production trade-offs. Fourth, how state-of-the-art embedding models handle proper-name identity signals and conversational noise remains opaque. To address these, we contribute: (1) RA-nWG@K, a rarity-aware, per-query-normalized set score, and operational ceilings via the pool-restricted oracle ceiling (PROC) and the percentage of PROC (%PROC) to separate retrieval from ordering headroom within a Cost-Latency-Quality (CLQ) lens; (2) rag-gs (MIT), a lean golden-set pipeline with Plackett-Luce listwise refinement whose iterative updates outperform single-shot LLM ranking; (3) a comprehensive benchmark on a production RAG (scientific-papers corpus) spanning dense retrieval, hybrid dense+BM25, embedding models and dimensions, cross-encoder rerankers, ANN (HNSW), and quantization; and (4) targeted diagnostics that quantify proper-name identity signal and conversational-noise sensitivity via identity-destroying and formatting ablations. Together, these components provide practitioner Pareto guidance and auditable guardrails to support reproducible, budget/SLA-aware decisions.
๐
2025-11-12
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Andreas Konstantin Kruff, Christin Katharina Kreutz, Timo Breuer, Philipp Schaer, Krisztian Balog
ๆฅ็ๆ่ฆ
Validating user simulation is a difficult task due to the lack of established measures and benchmarks, which makes it challenging to assess whether a simulator accurately reflects real user behavior. As part of the Sim4IA Micro-Shared Task at the Sim4IA Workshop, SIGIR 2025, we present Sim4IA-Bench, a simulation benchmark suit for the prediction of the next queries and utterances, the first of its kind in the IR community. Our dataset as part of the suite comprises 160 real-world search sessions from the CORE search engine. For 70 of these sessions, up to 62 simulator runs are available, divided into Task A and Task B, in which different approaches predicted users next search queries or utterances. Sim4IA-Bench provides a basis for evaluating and comparing user simulation approaches and for developing new measures of simulator validity. Although modest in size, the suite represents the first publicly available benchmark that links real search sessions with simulated next-query predictions. In addition to serving as a testbed for next query prediction, it also enables exploratory studies on query reformulation behavior, intent drift, and interaction-aware retrieval evaluation. We also introduce a new measure for evaluating next-query predictions in this task. By making the suite publicly available, we aim to promote reproducible research and stimulate further work on realistic and explainable user simulation for information access: https://github.com/irgroup/Sim4IA-Bench.
๐
2025-11-12
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jiyuan Wang, Li Zhang, Haipeng Lin, Qile Liu, Gan Huang et al.
Natural Language ProcessingMultimodalMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in brain-inspired artificial intelligence have sought to align neural signals with visual semantics using multimodal models such as CLIP. However, existing methods often treat CLIP as a static feature extractor, overlooking its adaptability to neural representations and the inherent physiological-symbolic gap in EEG-image alignment. To address these challenges, we present NeuroCLIP, a prompt tuning framework tailored for EEG-to-image contrastive learning. Our approach introduces three core innovations: (1) We design a dual-stream visual embedding pipeline that combines dynamic filtering and token-level fusion to generate instance-level adaptive prompts, which guide the adjustment of patch embedding tokens based on image content, thereby enabling fine-grained modulation of visual representations under neural constraints; (2) We are the first to introduce visual prompt tokens into EEG-image alignment, acting as global, modality-level prompts that work in conjunction with instance-level adjustments. These visual prompt tokens are inserted into the Transformer architecture to facilitate neural-aware adaptation and parameter optimization at a global level; (3) Inspired by neuroscientific principles of human visual encoding, we propose a refined contrastive loss that better model the semantic ambiguity and cross-modal noise present in EEG signals. On the THINGS-EEG2 dataset, NeuroCLIP achieves a Top-1 accuracy of 63.2% in zero-shot image retrieval, surpassing the previous best method by +12.3%, and demonstrates strong generalization under inter-subject conditions (+4.6% Top-1), highlighting the potential of physiology-aware prompt tuning for bridging brain signals and visual semantics.
๐
2025-11-12
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wenda Wei, Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Lixin Su et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning scenarios. Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.
๐
2025-11-12
๐ ICDE 2026
๐ฅ Chenhao Wang, Shanshan Feng, Lisi Chen, Fan Li, Shuo Shang
ๆฅ็ๆ่ฆ
Next point-of-interest (POI) recommendation improves personalized location-based services by predicting users' next destinations based on their historical check-ins. However, most existing methods rely on static datasets and fixed models, limiting their ability to adapt to changes in user behavior over time. To address this limitation, we explore a novel task termed continual next POI recommendation, where models dynamically adapt to evolving user interests through continual updates. This task is particularly challenging, as it requires capturing shifting user behaviors while retaining previously learned knowledge. Moreover, it is essential to ensure efficiency in update time and memory usage for real-world deployment. To this end, we propose GIRAM (Generative Key-based Interest Retrieval and Adaptive Modeling), an efficient, model-agnostic framework that integrates context-aware sustained interests with recent interests. GIRAM comprises four components: (1) an interest memory to preserve historical preferences; (2) a context-aware key encoding module for unified interest key representation; (3) a generative key-based retrieval module to identify diverse and relevant sustained interests; and (4) an adaptive interest update and fusion module to update the interest memory and balance sustained and recent interests. In particular, GIRAM can be seamlessly integrated with existing next POI recommendation models. Experiments on three real-world datasets demonstrate that GIRAM consistently outperforms state-of-the-art methods while maintaining high efficiency in both update time and memory consumption.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Eshika Pathak, Ahmed Aboudonia, Sandeep Banik, Naira Hovakimyan
Robotics
ๆฅ็ๆ่ฆ
Dynamical system (DS)-based learning from demonstration (LfD) is a powerful tool for generating motion plans in the operation (`task') space of robotic systems. However, the realization of the generated motion plans is often compromised by a ''task-execution mismatch'', where unmodeled dynamics, persistent disturbances, and system latency cause the robot's actual task-space state to diverge from the desired motion trajectory. We propose a novel task-level robust control architecture, L1-augmented Dynamical Systems (L1-DS), that explicitly handles the task-execution mismatch in tracking a nominal motion plan generated by any DS-based LfD scheme. Our framework augments any DS-based LfD model with a nominal stabilizing controller and an L1 adaptive controller. Furthermore, we introduce a windowed Dynamic Time Warping (DTW)-based target selector, which enables the nominal stabilizing controller to handle temporal misalignment for improved phase-consistent tracking. We demonstrate the efficacy of our architecture on the LASA and IROS handwriting datasets.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Stelios Zarifis, Ioannis Chalkiadakis, Artemis Chardouveli, Vasiliki Moutzouri, Aggelos Sotirchos et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Inspired by infant development, we propose a Reinforcement Learning (RL) framework for autonomous self-exploration in a robotic agent, Baby Sophia, using the BabyBench simulation environment. The agent learns self-touch and hand regard behaviors through intrinsic rewards that mimic an infant's curiosity-driven exploration of its own body. For self-touch, high-dimensional tactile inputs are transformed into compact, meaningful representations, enabling efficient learning. The agent then discovers new tactile contacts through intrinsic rewards and curriculum learning that encourage broad body coverage, balance, and generalization. For hand regard, visual features of the hands, such as skin-color and shape, are learned through motor babbling. Then, intrinsic rewards encourage the agent to perform novel hand motions, and follow its hands with its gaze. A curriculum learning setup from single-hand to dual-hand training allows the agent to reach complex visual-motor coordination. The results of this work demonstrate that purely curiosity-based signals, with no external supervision, can drive coordinated multimodal learning, imitating an infant's progression from random motor babbling to purposeful behaviors.
๐
2025-11-12
๐ Accepted to IEEE/CVF Winter Conference on Applicat
๐ฅ Yunqian Cheng, Benjamin Princen, Roberto Manduchi
Robotics
ๆฅ็ๆ่ฆ
Indoor localization in GPS-denied environments is crucial for applications like emergency response and assistive navigation. Vision-based methods such as PALMS enable infrastructure-free localization using only a floor plan and a stationary scan, but are limited by the short range of smartphone LiDAR and ambiguity in indoor layouts. We propose PALMS$+$, a modular, image-based system that addresses these challenges by reconstructing scale-aligned 3D point clouds from posed RGB images using a foundation monocular depth estimation model (Depth Pro), followed by geometric layout matching via convolution with the floor plan. PALMS$+$ outputs a posterior over the location and orientation, usable for direct or sequential localization. Evaluated on the Structured3D and a custom campus dataset consisting of 80 observations across four large campus buildings, PALMS$+$ outperforms PALMS and F3Loc in stationary localization accuracy -- without requiring any training. Furthermore, when integrated with a particle filter for sequential localization on 33 real-world trajectories, PALMS$+$ achieved lower localization errors compared to other methods, demonstrating robustness for camera-free tracking and its potential for infrastructure-free applications. Code and data are available at https://github.com/Head-inthe-Cloud/PALMS-Plane-based-Accessible-Indoor-Localization-Using-Mobile-Smartphones
๐
2025-11-12
๐ ICRA
๐ฅ David Minkwan Kim, K. M. Brian Lee, Yong Hyeok Seo, Nikola Raicevic, Runfa Blark Li et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
We present the ongoing development of a robotic system for overhead work such as ceiling drilling. The hardware platform comprises a mobile base with a two-stage lift, on which a bimanual torso is mounted with a custom-designed drilling end effector and RGB-D cameras. To support teleoperation in dynamic environments with limited visibility, we use Gaussian splatting for online 3D reconstruction and introduce motion parameters to model moving objects. For safe operation around dynamic obstacles, we developed a neural configuration-space barrier approach for planning and control. Initial feasibility studies demonstrate the capability of the hardware in drilling, bolting, and anchoring, and the software in safe teleoperation in a dynamic environment.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Ray Muxin Liu, Mingxuan Li, Kenneth Shaw, Deepak Pathak
Robotics
ๆฅ็ๆ่ฆ
Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and semantically understanding object parts, even in cluttered, crowded scenes. However, while these models can direct a robot toward the general region of an object, they lack the geometric understanding required to precisely control dexterous robotic hands for 3D grasping. To overcome this, our key insight is to leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene. Because this pipeline is slow and requires ground-truth observations, the resulting data is distilled into a diffusion model that operates in real-time on camera point clouds. By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force-closure, \our achieves high-performance semantic grasping without any manually collected training data. For visualizations of this please visit our website at https://ifgrasping.github.io/
๐
2025-11-12
๐ AAAI 2026
๐ฅ Hao Shi, Bin Xie, Yingfei Liu, Yang Yue, Tiancai Wang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Robotic manipulation requires precise spatial understanding to interact with objects in the real world. Point-based methods suffer from sparse sampling, leading to the loss of fine-grained semantics. Image-based methods typically feed RGB and depth into 2D backbones pre-trained on 3D auxiliary tasks, but their entangled semantics and geometry are sensitive to inherent depth noise in real-world that disrupts semantic understanding. Moreover, these methods focus on high-level geometry while overlooking low-level spatial cues essential for precise interaction. We propose SpatialActor, a disentangled framework for robust robotic manipulation that explicitly decouples semantics and geometry. The Semantic-guided Geometric Module adaptively fuses two complementary geometry from noisy depth and semantic-guided expert priors. Also, a Spatial Transformer leverages low-level spatial cues for accurate 2D-3D mapping and enables interaction among spatial features. We evaluate SpatialActor on multiple simulation and real-world scenarios across 50+ tasks. It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions, showing strong robustness. Moreover, it significantly enhances few-shot generalization to new tasks and maintains robustness under various spatial perturbations. Project Page: https://shihao1895.github.io/SpatialActor
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Runhao Li, Wenkai Guo, Zhenyu Wu, Changyuan Wang, Haoyuan Deng et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Pre-trained Vision-Language-Action (VLA) models have achieved remarkable success in improving robustness and generalization for end-to-end robotic manipulation. However, these models struggle with long-horizon tasks due to their lack of memory and reliance solely on immediate sensory inputs. To address this limitation, we propose Memory-Augmented Prompting for Vision-Language-Action model (MAP-VLA), a novel framework that empowers pre-trained VLA models with demonstration-derived memory prompts to augment action generation for long-horizon robotic manipulation tasks. To achieve this, MAP-VLA first constructs a memory library from historical demonstrations, where each memory unit captures information about a specific stage of a task. These memory units are implemented as learnable soft prompts optimized through prompt tuning. Then, during real-time task execution, MAP-VLA retrieves relevant memory through trajectory similarity matching and dynamically integrates it into the VLA model for augmented action generation. Importantly, this prompt tuning and retrieval augmentation approach operates as a plug-and-play module for a frozen VLA model, offering a lightweight and flexible solution to improve task performance. Experimental results show that MAP-VLA delivers up to 7.0% absolute performance gains in the simulation benchmark and 25.0% on real robot evaluations for long-horizon tasks, surpassing the current state-of-the-art methods.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma et al.
MultimodalRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Chaoyi Pan, Changhao Wang, Haozhi Qi, Zixi Liu, Homanga Bharadhwaj et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Learning dexterous and agile policy for humanoid and dexterous hand control requires large-scale demonstrations, but collecting robot-specific data is prohibitively expensive. In contrast, abundant human motion data is readily available from motion capture, videos, and virtual reality, which could help address the data scarcity problem. However, due to the embodiment gap and missing dynamic information like force and torque, these demonstrations cannot be directly executed on robots. To bridge this gap, we propose Scalable Physics-Informed DExterous Retargeting (SPIDER), a physics-based retargeting framework to transform and augment kinematic-only human demonstrations to dynamically feasible robot trajectories at scale. Our key insight is that human demonstrations should provide global task structure and objective, while large-scale physics-based sampling with curriculum-style virtual contact guidance should refine trajectories to ensure dynamical feasibility and correct contact sequences. SPIDER scales across diverse 9 humanoid/dexterous hand embodiments and 6 datasets, improving success rates by 18% compared to standard sampling, while being 10X faster than reinforcement learning (RL) baselines, and enabling the generation of a 2.4M frames dynamic-feasible robot dataset for policy learning. As a universal physics-based retargeting method, SPIDER can work with diverse quality data and generate diverse and high-quality data to enable efficient policy learning with methods like RL.
๐
2025-11-12
๐ Accepted by IEEE Robotics and Automation Letters
๐ฅ Sizhe Wang, Yifan Yang, Yongkang Luo, Daheng Li, Wei Wei et al.
Robotics
ๆฅ็ๆ่ฆ
Dexterous functional tool-use grasping is essential for effective robotic manipulation of tools. However, existing approaches face significant challenges in efficiently constructing large-scale datasets and ensuring generalizability to everyday object scales. These issues primarily arise from size mismatches between robotic and human hands, and the diversity in real-world object scales. To address these limitations, we propose the ScaleADFG framework, which consists of a fully automated dataset construction pipeline and a lightweight grasp generation network. Our dataset introduce an affordance-based algorithm to synthesize diverse tool-use grasp configurations without expert demonstrations, allowing flexible object-hand size ratios and enabling large robotic hands (compared to human hands) to grasp everyday objects effectively. Additionally, we leverage pre-trained models to generate extensive 3D assets and facilitate efficient retrieval of object affordances. Our dataset comprising five object categories, each containing over 1,000 unique shapes with 15 scale variations. After filtering, the dataset includes over 60,000 grasps for each 2 dexterous robotic hands. On top of this dataset, we train a lightweight, single-stage grasp generation network with a notably simple loss design, eliminating the need for post-refinement. This demonstrates the critical importance of large-scale datasets and multi-scale object variant for effective training. Extensive experiments in simulation and on real robot confirm that the ScaleADFG framework exhibits strong adaptability to objects of varying scales, enhancing functional grasp stability, diversity, and generalizability. Moreover, our network exhibits effective zero-shot transfer to real-world objects. Project page is available at https://sizhe-wang.github.io/ScaleADFG_webpage
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Stepan Dergachev, Artem Pshenitsyn, Aleksandr Panov, Alexey Skrynnik, Konstantin Yakovlev
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Decentralized collision avoidance remains a core challenge for scalable multi-robot systems. One of the promising approaches to tackle this problem is Model Predictive Path Integral (MPPI) -- a framework that is naturally suited to handle any robot motion model and provides strong theoretical guarantees. Still, in practice MPPI-based controller may provide suboptimal trajectories as its performance relies heavily on uninformed random sampling. In this work, we introduce CoRL-MPPI, a novel fusion of Cooperative Reinforcement Learning and MPPI to address this limitation. We train an action policy (approximated as deep neural network) in simulation that learns local cooperative collision avoidance behaviors. This learned policy is then embedded into the MPPI framework to guide its sampling distribution, biasing it towards more intelligent and cooperative actions. Notably, CoRL-MPPI preserves all the theoretical guarantees of regular MPPI. We evaluate our approach in dense, dynamic simulation environments against state-of-the-art baselines, including ORCA, BVC, and a multi-agent MPPI implementation. Our results demonstrate that CoRL-MPPI significantly improves navigation efficiency (measured by success rate and makespan) and safety, enabling agile and robust multi-robot navigation.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Yan Huang, Shoujie Li, Xingting Li, Wenbo Ding
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Data-driven robotic learning faces an obvious dilemma: robust policies demand large-scale, high-quality demonstration data, yet collecting such data remains a major challenge owing to high operational costs, dependence on specialized hardware, and the limited spatial generalization capability of current methods. The Universal Manipulation Interface (UMI) relaxes the strict hardware requirements for data collection, but it is restricted to capturing only RGB images of a scene and omits the 3D geometric information on which many tasks rely. Inspired by DemoGen, we propose UMIGen, a unified framework that consists of two key components: (1) Cloud-UMI, a handheld data collection device that requires no visual SLAM and simultaneously records point cloud observation-action pairs; and (2) a visibility-aware optimization mechanism that extends the DemoGen pipeline to egocentric 3D observations by generating only points within the camera's field of view. These two components enable efficient data generation that aligns with real egocentric observations and can be directly transferred across different robot embodiments without any post-processing. Experiments in both simulated and real-world settings demonstrate that UMIGen supports strong cross-embodiment generalization and accelerates data collection in diverse manipulation tasks.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Yuxi Wei, Zirui Wang, Kangning Yin, Yue Hu, Jingbo Wang et al.
Robotics
ๆฅ็ๆ่ฆ
Data scaling has long remained a critical bottleneck in robot learning. For humanoid robots, human videos and motion data are abundant and widely available, offering a free and large-scale data source. Besides, the semantics related to the motions enable modality alignment and high-level robot control learning. However, how to effectively mine raw video, extract robot-learnable representations, and leverage them for scalable learning remains an open problem. To address this, we introduce Humanoid-Union, a large-scale dataset generated through an autonomous pipeline, comprising over 260 hours of diverse, high-quality humanoid robot motion data with semantic annotations derived from human motion videos. The dataset can be further expanded via the same pipeline. Building on this data resource, we propose SCHUR, a scalable learning framework designed to explore the impact of large-scale data on high-level control in humanoid robots. Experimental results demonstrate that SCHUR achieves high robot motion generation quality and strong text-motion alignment under data and model scaling, with 37\% reconstruction improvement under MPJPE and 25\% alignment improvement under FID comparing with previous methods. Its effectiveness is further validated through deployment in real-world humanoid robot.
๐
2025-11-12
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ethan Griffiths, Maryam Haghighat, Simon Denman, Clinton Fookes, Milad Ramezani
Natural Language Processing
ๆฅ็ๆ่ฆ
This article presents HOTFLoc++, an end-to-end framework for LiDAR place recognition, re-ranking, and 6-DoF metric localisation in forests. Leveraging an octree-based transformer, our approach extracts hierarchical local descriptors at multiple granularities to increase robustness to clutter, self-similarity, and viewpoint changes in challenging scenarios, including ground-to-ground and ground-to-aerial in forest and urban environments. We propose a learnable multi-scale geometric verification module to reduce re-ranking failures in the presence of degraded single-scale correspondences. Our coarse-to-fine registration approach achieves comparable or lower localisation errors to baselines, with runtime improvements of two orders of magnitude over RANSAC for dense point clouds. Experimental results on public datasets show the superiority of our approach compared to state-of-the-art methods, achieving an average Recall@1 of 90.7% on CS-Wild-Places: an improvement of 29.6 percentage points over baselines, while maintaining high performance on single-source benchmarks with an average Recall@1 of 91.7% and 96.0% on Wild-Places and MulRan, respectively. Our method achieves under 2 m and 5 degrees error for 97.2% of 6-DoF registration attempts, with our multi-scale re-ranking module reducing localisation errors by ~2$\times$ on average. The code will be available upon acceptance.
๐
2025-11-12
๐ 8 pages, 5 figures, 6 tables, accepted for the pub
๐ฅ Eungchang Mason Lee, Kevin Christiansen Marsim, Hyun Myung
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
LiDAR-inertial odometry (LIO) has been widely used in robotics due to its high accuracy. However, its performance degrades in degenerate environments, such as long corridors and high-altitude flights, where LiDAR measurements are imbalanced or sparse, leading to ill-posed state estimation. In this letter, we present LODESTAR, a novel LIO method that addresses these degeneracies through two key modules: degeneracy-aware adaptive Schmidt-Kalman filter (DA-ASKF) and degeneracy-aware data exploitation (DA-DE). DA-ASKF employs a sliding window to utilize past states and measurements as additional constraints. Specifically, it introduces degeneracy-aware sliding modes that adaptively classify states as active or fixed based on their degeneracy level. Using Schmidt-Kalman update, it partially optimizes active states while preserving fixed states. These fixed states influence the update of active states via their covariances, serving as reference anchors--akin to a lodestar. Additionally, DA-DE prunes less-informative measurements from active states and selectively exploits measurements from fixed states, based on their localizability contribution and the condition number of the Jacobian matrix. Consequently, DA-ASKF enables degeneracy-aware constrained optimization and mitigates measurement sparsity, while DA-DE addresses measurement imbalance. Experimental results show that LODESTAR outperforms existing LiDAR-based odometry methods and degeneracy-aware modules in terms of accuracy and robustness under various degenerate conditions.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Xuetao Li, Wenke Huang, Nengyuan Pan, Kaiyan Zhao, Songhua Yang et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in significant waste of training resources. To address these limitations, we present the Recurrent Geometric-prior Multimodal Policy (RGMP), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing adaptive skill sequences for unseen scenes with minimal spatial common sense tuning. To achieve data-efficient robotic motion synthesis, we introduce the Adaptive Recursive Gaussian Network, which parameterizes robot-object interactions as a compact hierarchy of Gaussian processes that recursively encode multi-scale spatial relationships, yielding dexterous, data-efficient motion synthesis even from sparse demonstrations. Evaluated on both our humanoid robot and desktop dual-arm robot, the RGMP framework achieves 87% task success in generalization tests and exhibits 5x greater data efficiency than the state-of-the-art model. This performance underscores its superior cross-domain generalization, enabled by geometric-semantic reasoning and recursive-Gaussion adaptation.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Jiahao Xiao, Bowen Yan, Jianbo Zhang, Jia Wang, Chunyi Li et al.
Multimodal
ๆฅ็ๆ่ฆ
In embodied intelligence, datasets play a pivotal role, serving as both a knowledge repository and a conduit for information transfer. The two most critical attributes of a dataset are the amount of information it provides and how easily this information can be learned by models. However, the multimodal nature of embodied data makes evaluating these properties particularly challenging. Prior work has largely focused on diversity, typically counting tasks and scenes or evaluating isolated modalities, which fails to provide a comprehensive picture of dataset diversity. On the other hand, the learnability of datasets has received little attention and is usually assessed post-hoc through model training, an expensive, time-consuming process that also lacks interpretability, offering little guidance on how to improve a dataset. In this work, we address both challenges by introducing two principled, data-driven tools. First, we construct a unified multimodal representation for each data sample and, based on it, propose diversity entropy, a continuous measure that characterizes the amount of information contained in a dataset. Second, we introduce the first interpretable, data-driven algorithm to efficiently quantify dataset learnability without training, enabling researchers to assess a dataset's learnability immediately upon its release. We validate our algorithm on both simulated and real-world embodied datasets, demonstrating that it yields faithful, actionable insights that enable researchers to jointly improve diversity and learnability. We hope this work provides a foundation for designing higher-quality datasets that advance the development of embodied intelligence.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Amirhossein Kazemipour, Robert K. Katzschmann
Robotics
ๆฅ็ๆ่ฆ
Antagonistic soft actuators built from artificial muscles (PAMs, HASELs, DEAs) promise plant-level torque-stiffness decoupling, yet existing controllers for soft muscles struggle to maintain independent control through dynamic contact transients. We present a unified framework enabling independent torque and stiffness commands in real-time for diverse soft actuator types. Our unified force law captures diverse soft muscle physics in a single model with sub-ms computation, while our cascaded controller with analytical inverse dynamics maintains decoupling despite model errors and disturbances. Using co-contraction/bias coordinates, the controller independently modulates torque via bias and stiffness via co-contraction-replicating biological impedance strategies. Simulation-based validation through contact experiments demonstrates maintained independence: 200x faster settling on soft surfaces, 81% force reduction on rigid surfaces, and stable interaction vs 22-54% stability for fixed policies. This framework provides a foundation for enabling musculoskeletal antagonistic systems to execute adaptive impedance control for safe human-robot interaction.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Shivam Sood, Laukik Nakhwa, Sun Ge, Yuhong Cao, Jin Cheng et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Learning natural, animal-like locomotion from demonstrations has become a core paradigm in legged robotics. Despite the recent advancements in motion tracking, most existing methods demand extensive tuning and rely on reference data during deployment, limiting adaptability. We present APEX (Action Priors enable Efficient Exploration), a plug-and-play extension to state-of-the-art motion tracking algorithms that eliminates any dependence on reference data during deployment, improves sample efficiency, and reduces parameter tuning effort. APEX integrates expert demonstrations directly into reinforcement learning (RL) by incorporating decaying action priors, which initially bias exploration toward expert demonstrations but gradually allow the policy to explore independently. This is combined with a multi-critic framework that balances task performance with motion style. Moreover, APEX enables a single policy to learn diverse motions and transfer reference-like styles across different terrains and velocities, while remaining robust to variations in reward design. We validate the effectiveness of our method through extensive experiments in both simulation and on a Unitree Go2 robot. By leveraging demonstrations to guide exploration during RL training, without imposing explicit bias toward them, APEX enables legged robots to learn with greater stability, efficiency, and generalization. We believe this approach paves the way for guidance-driven RL to boost natural skill acquisition in a wide array of robotic tasks, from locomotion to manipulation. Website and code: https://marmotlab.github.io/APEX/.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Shunsuke Ito, Chaoran Zhao, Ryo Okamura, Takuya Azumi
Robotics
ๆฅ็ๆ่ฆ
Autonomous driving systems have achieved significant advances, and full autonomy within defined operational design domains near practical deployment. Expanding these domains requires addressing safety assurance under diverse conditions. Information sharing through vehicle-to-vehicle and vehicle-to-infrastructure communication, enabled by a Dynamic Map platform built from vehicle and roadside sensor data, offers a promising solution. Real-world experiments with numerous infrastructure sensors incur high costs and regulatory challenges. Conventional single-host simulators lack the capacity for large-scale urban traffic scenarios. This paper proposes D-AWSIM, a distributed simulator that partitions its workload across multiple machines to support the simulation of extensive sensor deployment and dense traffic environments. A Dynamic Map generation framework on D-AWSIM enables researchers to explore information-sharing strategies without relying on physical testbeds. The evaluation shows that D-AWSIM increases throughput for vehicle count and LiDAR sensor processing substantially compared to a single-machine setup. Integration with Autoware demonstrates applicability for autonomous driving research.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Sangheon Yang, Yeongin Yoon, Hong Mo Jung, Jongwoo Lim
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Traditional Visual Odometry (VO) and Visual Inertial Odometry (VIO) methods rely on a 'pose-centric' paradigm, which computes absolute camera poses from the local map thus requires large-scale landmark maintenance and continuous map optimization. This approach is computationally expensive, limiting their real-time performance on resource-constrained devices. To overcome these limitations, we introduce Sparse Motion Field Visual Odometry (SMF-VO), a lightweight, 'motion-centric' framework. Our approach directly estimates instantaneous linear and angular velocity from sparse optical flow, bypassing the need for explicit pose estimation or expensive landmark tracking. We also employed a generalized 3D ray-based motion field formulation that works accurately with various camera models, including wide-field-of-view lenses. SMF-VO demonstrates superior efficiency and competitive accuracy on benchmark datasets, achieving over 100 FPS on a Raspberry Pi 5 using only a CPU. Our work establishes a scalable and efficient alternative to conventional methods, making it highly suitable for mobile robotics and wearable devices.
๐
2025-11-12
๐ The paper has been accepted by the 40th IEEE/ACM I
๐ฅ Dingji Wang, You Lu, Bihuan Chen, Shuo Hao, Haowen Jiang et al.
Robotics
ๆฅ็ๆ่ฆ
End-to-end autonomous driving systems (ADSs), with their strong capabilities in environmental perception and generalizable driving decisions, are attracting growing attention from both academia and industry. However, once deployed on public roads, ADSs are inevitably exposed to diverse driving hazards that may compromise safety and degrade system performance. This raises a strong demand for resilience of ADSs, particularly the capability to continuously monitor driving hazards and adaptively respond to potential safety violations, which is crucial for maintaining robust driving behaviors in complex driving scenarios.
To bridge this gap, we propose a runtime resilience-oriented framework, Argus, to mitigate the driving hazards, thus preventing potential safety violations and improving the driving performance of an ADS. Argus continuously monitors the trajectories generated by the ADS for potential hazards and, whenever the EGO vehicle is deemed unsafe, seamlessly takes control through a hazard mitigator. We integrate Argus with three state-of-the-art end-to-end ADSs, i.e., TCP, UniAD and VAD. Our evaluation has demonstrated that Argus effectively and efficiently enhances the resilience of ADSs, improving the driving score of the ADS by up to 150.30% on average, and preventing up to 64.38% of the violations, with little additional time overhead.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Mingyang Yu, Haorui Yang, Kangning An, Xinjian Wei, Xiaoxuan Xu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
With the widespread adoption of unmanned aerial vehicles (UAV), effective path planning has become increasingly important. Although traditional search methods have been extensively applied, metaheuristic algorithms have gained popularity due to their efficiency and problem-specific heuristics. However, challenges such as premature convergence and lack of solution diversity still hinder their performance in complex scenarios. To address these issues, this paper proposes an Enhanced Multi-Strategy Dwarf Mongoose Optimization (EDMO) algorithm, tailored for three-dimensional UAV trajectory planning in dynamic and obstacle-rich environments. EDMO integrates three novel strategies: (1) a Dynamic Quantum Tunneling Optimization Strategy (DQTOS) to enable particles to probabilistically escape local optima; (2) a Bio-phototactic Dynamic Focusing Search Strategy (BDFSS) inspired by microbial phototaxis for adaptive local refinement; and (3) an Orthogonal Lens Opposition-Based Learning (OLOBL) strategy to enhance global exploration through structured dimensional recombination. EDMO is benchmarked on 39 standard test functions from CEC2017 and CEC2020, outperforming 14 advanced algorithms in convergence speed, robustness, and optimization accuracy. Furthermore, real-world validations on UAV three-dimensional path planning and three engineering design tasks confirm its practical applicability and effectiveness in field robotics missions requiring intelligent, adaptive, and time-efficient planning.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Ziyi Song, Chen Xia, Chenbing Wang, Haibao Yu, Sheng Zhou et al.
Robotics
ๆฅ็ๆ่ฆ
Autonomous driving holds transformative potential but remains fundamentally constrained by the limited perception and isolated decision-making with standalone intelligence. While recent multi-agent approaches introduce cooperation, they often focus merely on perception-level tasks, overlooking the alignment with downstream planning and control, or fall short in leveraging the full capacity of the recent emerging end-to-end autonomous driving. In this paper, we present UniMM-V2X, a novel end-to-end multi-agent framework that enables hierarchical cooperation across perception, prediction, and planning. At the core of our framework is a multi-level fusion strategy that unifies perception and prediction cooperation, allowing agents to share queries and reason cooperatively for consistent and safe decision-making. To adapt to diverse downstream tasks and further enhance the quality of multi-level fusion, we incorporate a Mixture-of-Experts (MoE) architecture to dynamically enhance the BEV representations. We further extend MoE into the decoder to better capture diverse motion patterns. Extensive experiments on the DAIR-V2X dataset demonstrate our approach achieves state-of-the-art (SOTA) performance with a 39.7% improvement in perception accuracy, a 7.2% reduction in prediction error, and a 33.2% improvement in planning performance compared with UniV2X, showcasing the strength of our MoE-enhanced multi-level cooperative paradigm.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Mobin Habibpour, Fatemeh Afghah
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
While Vision-Language Models (VLMs) are set to transform robotic navigation, existing methods often underutilize their reasoning capabilities. To unlock the full potential of VLMs in robotics, we shift their role from passive observers to active strategists in the navigation process. Our framework outsources high-level planning to a VLM, which leverages its contextual understanding to guide a frontier-based exploration agent. This intelligent guidance is achieved through a trio of techniques: structured chain-of-thought prompting that elicits logical, step-by-step reasoning; dynamic inclusion of the agent's recent action history to prevent getting stuck in loops; and a novel capability that enables the VLM to interpret top-down obstacle maps alongside first-person views, thereby enhancing spatial awareness. When tested on challenging benchmarks like HM3D, Gibson, and MP3D, this method produces exceptionally direct and logical trajectories, marking a substantial improvement in navigation efficiency over existing approaches and charting a path toward more capable embodied agents.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Ningnan Wang, Weihuang Chen, Liming Chen, Haoxuan Ji, Zhongyu Guo et al.
Natural Language ProcessingRoboticsMultimodal
ๆฅ็ๆ่ฆ
Embodied visual navigation remains a challenging task, as agents must explore unknown environments with limited knowledge. Existing zero-shot studies have shown that incorporating memory mechanisms to support goal-directed behavior can improve long-horizon planning performance. However, they overlook visual frontier boundaries, which fundamentally dictate future trajectories and observations, and fall short of inferring the relationship between partial visual observations and navigation goals. In this paper, we propose Semantic Cognition Over Potential-based Exploration (SCOPE), a zero-shot framework that explicitly leverages frontier information to drive potential-based exploration, enabling more informed and goal-relevant decisions. SCOPE estimates exploration potential with a Vision-Language Model and organizes it into a spatio-temporal potential graph, capturing boundary dynamics to support long-horizon planning. In addition, SCOPE incorporates a self-reconsideration mechanism that revisits and refines prior decisions, enhancing reliability and reducing overconfident errors. Experimental results on two diverse embodied navigation tasks show that SCOPE outperforms state-of-the-art baselines by 4.6\% in accuracy. Further analysis demonstrates that its core components lead to improved calibration, stronger generalization, and higher decision quality.
๐
2025-11-12
๐ IROS 2025
๐ฅ Yunchang Ma, Tenglong Liu, Yixing Lan, Xin Yin, Changxin Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
In offline reinforcement learning, value overestimation caused by out-of-distribution (OOD) actions significantly limits policy performance. Recently, diffusion models have been leveraged for their strong distribution-matching capabilities, enforcing conservatism through behavior policy constraints. However, existing methods often apply indiscriminate regularization to redundant actions in low-quality datasets, resulting in excessive conservatism and an imbalance between the expressiveness and efficiency of diffusion modeling. To address these issues, we propose DIffusion policies with Value-conditional Optimization (DIVO), a novel approach that leverages diffusion models to generate high-quality, broadly covered in-distribution state-action samples while facilitating efficient policy improvement. Specifically, DIVO introduces a binary-weighted mechanism that utilizes the advantage values of actions in the offline dataset to guide diffusion model training. This enables a more precise alignment with the dataset's distribution while selectively expanding the boundaries of high-advantage actions. During policy improvement, DIVO dynamically filters high-return-potential actions from the diffusion model, effectively guiding the learned policy toward better performance. This approach achieves a critical balance between conservatism and explorability in offline RL. We evaluate DIVO on the D4RL benchmark and compare it against state-of-the-art baselines. Empirical results demonstrate that DIVO achieves superior performance, delivering significant improvements in average returns across locomotion tasks and outperforming existing methods in the challenging AntMaze domain, where sparse rewards pose a major difficulty.
๐
2025-11-12
๐ ArXiv Preprint (Robotics)
๐ฅ Jinyu Zhang, Lijun Han, Feng Jian, Lingxi Zhang, Hesheng Wang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
In mobile robot shared control, effectively understanding human motion intention is critical for seamless human-robot collaboration. This paper presents a novel shared control framework featuring planning-level intention prediction. A path replanning algorithm is designed to adjust the robot's desired trajectory according to inferred human intentions. To represent future motion intentions, we introduce the concept of an intention domain, which serves as a constraint for path replanning. The intention-domain prediction and path replanning problems are jointly formulated as a Markov Decision Process and solved through deep reinforcement learning. In addition, a Voronoi-based human trajectory generation algorithm is developed, allowing the model to be trained entirely in simulation without human participation or demonstration data. Extensive simulations and real-world user studies demonstrate that the proposed method significantly reduces operator workload and enhances safety, without compromising task efficiency compared with existing assistive teleoperation approaches.
๐
2025-11-11
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xinpeng Zhao, Zhaochun Ren, Yukun Zhao, Zhenyang Li, Mengqi Zhang et al.
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Generative retrieval (GR) re-frames document retrieval as a sequence-based document identifier (DocID) generation task, memorizing documents with model parameters and enabling end-to-end retrieval without explicit indexing. Existing GR methods are based on auto-regressive generative models, i.e., the token generation is performed from left to right. However, such auto-regressive methods suffer from: (1) mismatch between DocID generation and natural language generation, e.g., an incorrect DocID token generated in early left steps would lead to totally erroneous retrieval; and (2) failure to balance the trade-off between retrieval efficiency and accuracy dynamically, which is crucial for practical applications. To address these limitations, we propose generative document retrieval with diffusion language models, dubbed DiffuGR. It models DocID generation as a discrete diffusion process: during training, DocIDs are corrupted through a stochastic masking process, and a diffusion language model is learned to recover them under a retrieval-aware objective. For inference, DiffuGR attempts to generate DocID tokens in parallel and refines them through a controllable number of denoising steps. In contrast to conventional left-to-right auto-regressive decoding, DiffuGR provides a novel mechanism to first generate more confident DocID tokens and refine the generation through diffusion-based denoising. Moreover, DiffuGR also offers explicit runtime control over the qualitylatency tradeoff. Extensive experiments on benchmark retrieval datasets show that DiffuGR is competitive with strong auto-regressive generative retrievers, while offering flexible speed and accuracy tradeoffs through variable denoising budgets. Overall, our results indicate that non-autoregressive diffusion models are a practical and effective alternative for generative document retrieval.
๐
2025-11-11
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xinpeng Zhao, Yukun Zhao, Zhenyang Li, Mengqi Zhang, Jun Feng et al.
RoboticsNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Generative retrieval (GR) re-frames document retrieval as a sequence-based document identifier (DocID) generation task, memorizing documents with model parameters and enabling end-to-end retrieval without explicit indexing. Existing GR methods are based on auto-regressive generative models, i.e., the token generation is performed from left to right. However, such auto-regressive methods suffer from: (1) mismatch between DocID generation and natural language generation, e.g., an incorrect DocID token generated in early left steps would lead to totally erroneous retrieval; and (2) failure to balance the trade-off between retrieval efficiency and accuracy dynamically, which is crucial for practical applications. To address these limitations, we propose generative document retrieval with diffusion language models, dubbed DiffuGR. It models DocID generation as a discrete diffusion process: during training, DocIDs are corrupted through a stochastic masking process, and a diffusion language model is learned to recover them under a retrieval-aware objective. For inference, DiffuGR attempts to generate DocID tokens in parallel and refines them through a controllable number of denoising steps. In contrast to conventional left-to-right auto-regressive decoding, DiffuGR provides a novel mechanism to first generate more confident DocID tokens and refine the generation through diffusion-based denoising. Moreover, DiffuGR also offers explicit runtime control over the qualitylatency tradeoff. Extensive experiments on benchmark retrieval datasets show that DiffuGR is competitive with strong auto-regressive generative retrievers, while offering flexible speed and accuracy tradeoffs through variable denoising budgets. Overall, our results indicate that non-autoregressive diffusion models are a practical and effective alternative for generative document retrieval.
๐
2025-11-11
๐ AAAI 2026
๐ฅ Peiyu Hu, Wayne Lu, Jia Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Cross-domain recommendation (CDR) is crucial for improving recommendation accuracy and generalization, yet traditional methods are often hindered by the reliance on shared user/item IDs, which are unavailable in most real-world scenarios. Consequently, many efforts have focused on learning disentangled representations through multi-domain joint training to bridge the domain gaps. Recent Large Language Model (LLM)-based approaches show promise, they still face critical challenges, including: (1) the \textbf{item ID tokenization dilemma}, which leads to vocabulary explosion and fails to capture high-order collaborative knowledge; and (2) \textbf{insufficient domain-specific modeling} for the complex evolution of user interests and item semantics. To address these limitations, we propose \textbf{GenCDR}, a novel \textbf{Gen}erative \textbf{C}ross-\textbf{D}omain \textbf{R}ecommendation framework. GenCDR first employs a \textbf{Domain-adaptive Tokenization} module, which generates disentangled semantic IDs for items by dynamically routing between a universal encoder and domain-specific adapters. Symmetrically, a \textbf{Cross-domain Autoregressive Recommendation} module models user preferences by fusing universal and domain-specific interests. Finally, a \textbf{Domain-aware Prefix-tree} enables efficient and accurate generation. Extensive experiments on multiple real-world datasets demonstrate that GenCDR significantly outperforms state-of-the-art baselines. Our code is available in the supplementary materials.
๐
2025-11-11
๐ ArXiv Preprint (Robotics)
๐ฅ Kai S. Yun, Navid Azizan
Robotics
ๆฅ็ๆ่ฆ
Ensuring the safety of real-world systems is challenging, especially when they rely on learned perception modules to infer the system state from high-dimensional sensor data. These perception modules are vulnerable to epistemic uncertainty, often failing when encountering out-of-distribution (OoD) measurements not seen during training. To address this gap, we introduce ATOM-CBF (Adaptive-To-OoD-Measurement Control Barrier Function), a novel safe control framework that explicitly computes and adapts to the epistemic uncertainty from OoD measurements, without the need for ground-truth labels or information on distribution shifts. Our approach features two key components: (1) an OoD-aware adaptive perception error margin and (2) a safety filter that integrates this adaptive error margin, enabling the filter to adjust its conservatism in real-time. We provide empirical validation in simulations, demonstrating that ATOM-CBF maintains safety for an F1Tenth vehicle with LiDAR scans and a quadruped robot with RGB images.
๐
2025-11-10
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Kamand Kalashi, Babak Teimourpour
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The rapid expansion of online fashion platforms has created an increasing demand for intelligent recommender systems capable of understanding both visual and textual cues. This paper proposes a hybrid multimodal deep learning framework for fashion recommendation that jointly addresses two key tasks: outfit compatibility prediction and complementary item retrieval. The model leverages the visual and textual encoders of the CLIP architecture to obtain joint latent representations of fashion items, which are then integrated into a unified feature vector and processed by a transformer encoder. For compatibility prediction, an "outfit token" is introduced to model the holistic relationships among items, achieving an AUC of 0.95 on the Polyvore dataset. For complementary item retrieval, a "target item token" representing the desired item description is used to retrieve compatible items, reaching an accuracy of 69.24% under the Fill-in-the-Blank (FITB) metric. The proposed approach demonstrates strong performance across both tasks, highlighting the effectiveness of multimodal learning for fashion recommendation.
๐
2025-11-07
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Harshit Nainwani, Hediyeh Baban
ๆฅ็ๆ่ฆ
Retrieval systems are essential to contemporary AI pipelines, although most confuse two separate processes: finding relevant information and giving enough context for reasoning. We introduce the Search-Is-Not-Retrieve (SINR) framework, a dual-layer architecture that distinguishes between fine-grained search representations and coarse-grained retrieval contexts. SINR enhances the composability, scalability, and context fidelity of retrieval systems by directly connecting small, semantically accurate search chunks to larger, contextually complete retrieve chunks, all without incurring extra processing costs. This design changes retrieval from a passive step to an active one, making the system architecture more like how people process information. We discuss the SINR framework's conceptual foundation, formal structure, implementation issues, and qualitative outcomes. This provides a practical foundation for the next generation of AI systems that use retrieval.
๐
2025-11-06
๐ AAAI 2026
๐ฅ Shaolong Wu, James Blume, Geshi Yeung
Machine Learning
ๆฅ็ๆ่ฆ
Algorithmic fairness has grown rapidly as a research area, yet key concepts
remain unsettled, especially in criminal justice. We review group, individual,
and process fairness and map the conditions under which they conflict. We then
develop a simple modification to standard group fairness. Rather than exact
parity across protected groups, we minimize a weighted error loss while keeping
differences in false negative rates within a small tolerance. This makes
solutions easier to find, can raise predictive accuracy, and surfaces the
ethical choice of error costs. We situate this proposal within three classes of
critique: biased and incomplete data, latent affirmative action, and the
explosion of subgroup constraints. Finally, we offer a practical framework for
deployment in public decision systems built on three pillars: need-based
decisions, Transparency and accountability, and narrowly tailored definitions
and solutions. Together, these elements link technical design to legitimacy and
provide actionable guidance for agencies that use risk assessment and related
tools.
๐
2025-11-06
๐ ArXiv Preprint (physics.soc-ph)
๐ฅ Hayafumi Watanabe
ๆฅ็ๆ่ฆ
The diffusion of ideas and language in society has conventionally been
described by S-shaped models, such as the logistic curve. However, the role of
sub-exponential growth -a slower than exponential pattern known in
epidemiology- has been largely overlooked in broader social phenomena. Here, we
present a piecewise power-law model to characterize complex growth curves with
a few parameters. We systematically analyzed a large-scale dataset of
approximately one billion Japanese blog articles linked to Wikipedia
vocabulary, and observed consistent patterns in web search trend data (English,
Spanish, and Japanese). Our analysis of the 2,965 selected items reveals that
about 55% (1,625 items) were found to have no abrupt jumps and were well
captured by one or two segments. For single-segment curves, we found that (i)
the mode of the shape parameter alpha was near 0.5, indicating prevalent
sub-exponential growth; (ii) the ultimate diffusion scale is primarily
determined by the growth rate R, with minor contributions from alpha or the
duration T; and (iii) alpha showed a tendency to vary with the nature of the
topic, being smaller for niche/local topics and larger for widely shared ones.
Furthermore, a micro-behavioral model distinguishing outward contact with
strangers from inward interaction within their community suggests that alpha
can be interpreted as an index of the preference for outward-oriented
communication. These findings suggest that sub-exponential growth is a common
pattern of social diffusion, and our model provides a practical framework for
consistently describing, comparing, and interpreting complex and diverse growth
curves.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Nick Oh, Fernand Gobet
Natural Language Processing
ๆฅ็ๆ่ฆ
Test-time reasoning architectures such as those following the Generate-Verify
paradigm -- where a model iteratively refines or verifies its own generated
outputs -- prioritise generation and verification but exclude the monitoring
processes that determine when and how reasoning should begin. This omission may
contribute to the prefix dominance trap, in which models commit early to
suboptimal reasoning paths and seldom recover, yielding roughly 20% accuracy
loss. We address this architectural gap by formalising Flavell's and Nelson and
Narens' metacognitive theories into computational specifications, proposing the
Monitor-Generate-Verify (MGV) framework. MGV extends the Generate-Verify
paradigm by adding explicit monitoring that captures metacognitive experiences
(from difficulty assessments to confidence judgements) before generation begins
and refines future monitoring through verification feedback. Though we present
no empirical validation, this work provides the first systematic computational
translation of foundational metacognitive theories, offering a principled
vocabulary for understanding reasoning system failures and suggesting specific
architectural interventions for future test-time reasoning designs.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Nemotron Nano V2 VL, the latest model of the Nemotron
vision-language series designed for strong real-world document understanding,
long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers
significant improvements over our previous model,
Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major
enhancements in model architecture, datasets, and training recipes. Nemotron
Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and
innovative token reduction techniques to achieve higher inference throughput in
long document and video scenarios. We are releasing model checkpoints in BF16,
FP8, and FP4 formats and sharing large parts of our datasets, recipes and
training code.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva
ๆฅ็ๆ่ฆ
Video generation has achieved remarkable progress, with generated videos
increasingly resembling real ones. However, the rapid advance in generation has
outpaced the development of adequate evaluation metrics. Currently, the
assessment of talking head generation primarily relies on limited metrics,
evaluating general video quality, lip synchronization, and on conducting user
studies. Motivated by this, we propose a new evaluation framework comprising 8
metrics related to three dimensions (i) quality, (ii) naturalness, and (iii)
synchronization. In selecting the metrics, we place emphasis on efficiency, as
well as alignment with human preferences. Based on this considerations, we
streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as
well as face quality. Our extensive experiments on 85,000 videos generated by
17 state-of-the-art models suggest that while many algorithms excel in lip
synchronization, they face challenges with generating expressiveness and
artifact-free details. These videos were generated based on a novel real
dataset, that we have curated, in order to mitigate bias of training data. Our
proposed benchmark framework is aimed at evaluating the improvement of
generative methods. Original code, dataset and leaderboards will be publicly
released and regularly updated with new methods, in order to reflect progress
in the field.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yihao Luo, Xianglong He, Chuanyu Pan, Yiwen Chen, Jiaqi Wu et al.
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Accurate and efficient voxelized representations of 3D meshes are the
foundation of 3D reconstruction and generation. However, existing
representations based on iso-surface heavily rely on water-tightening or
rendering optimization, which inevitably compromise geometric fidelity. We
propose Faithful Contouring, a sparse voxelized representation that supports
2048+ resolutions for arbitrary meshes, requiring neither converting meshes to
field functions nor extracting the isosurface during remeshing. It achieves
near-lossless fidelity by preserving sharpness and internal structures, even
for challenging cases with complex geometry and topology. The proposed method
also shows flexibility for texturing, manipulation, and editing. Beyond
representation, we design a dual-mode autoencoder for Faithful Contouring,
enabling scalable and detail-preserving shape reconstruction. Extensive
experiments show that Faithful Contouring surpasses existing methods in
accuracy and efficiency for both representation and reconstruction. For direct
representation, it achieves distance errors at the $10^{-5}$ level; for mesh
reconstruction, it yields a 93\% reduction in Chamfer Distance and a 35\%
improvement in F-score over strong baselines, confirming superior fidelity as a
representation for 3D learning tasks.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Nemotron Nano V2 VL, the latest model of the Nemotron
vision-language series designed for strong real-world document understanding,
long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers
significant improvements over our previous model,
Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major
enhancements in model architecture, datasets, and training recipes. Nemotron
Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and
innovative token reduction techniques to achieve higher inference throughput in
long document and video scenarios. We are releasing model checkpoints in BF16,
FP8, and FP4 formats and sharing large parts of our datasets, recipes and
training code.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Surapon Nonesung, Teetouch Jaknamon, Sirinya Chaiophat, Natapong Nitarach, Chanakan Wittayasakpan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We present ThaiOCRBench, the first comprehensive benchmark for evaluating
vision-language models (VLMs) on Thai text-rich visual understanding tasks.
Despite recent progress in multimodal modeling, existing benchmarks
predominantly focus on high-resource languages, leaving Thai underrepresented,
especially in tasks requiring document structure understanding. ThaiOCRBench
addresses this gap by offering a diverse, human-annotated dataset comprising
2,808 samples across 13 task categories. We evaluate a wide range of
state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and
open-source systems. Results show a significant performance gap, with
proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source
counterparts. Notably, fine-grained text recognition and handwritten content
extraction exhibit the steepest performance drops among open-source models.
Through detailed error analysis, we identify key challenges such as language
bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a
standardized framework for assessing VLMs in low-resource, script-complex
settings, and provides actionable insights for improving Thai-language document
understanding.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Zhuo Zhang, Xiong Xiong, Sen Zhang, Yuan Zhao, Xi Yang
Machine Learning
ๆฅ็ๆ่ฆ
PDEs arise ubiquitously in science and engineering, where solutions depend on
parameters (physical properties, boundary conditions, geometry). Traditional
numerical methods require re-solving the PDE for each parameter, making
parameter space exploration prohibitively expensive. Recent machine learning
advances, particularly physics-informed neural networks (PINNs) and neural
operators, have revolutionized parametric PDE solving by learning solution
operators that generalize across parameter spaces. We critically analyze two
main paradigms: (1) PINNs, which embed physical laws as soft constraints and
excel at inverse problems with sparse data, and (2) neural operators (e.g.,
DeepONet, Fourier Neural Operator), which learn mappings between
infinite-dimensional function spaces and achieve unprecedented generalization.
Through comparisons across fluid dynamics, solid mechanics, heat transfer, and
electromagnetics, we show neural operators can achieve computational speedups
of $10^3$ to $10^5$ times faster than traditional solvers for multi-query
scenarios, while maintaining comparable accuracy. We provide practical guidance
for method selection, discuss theoretical foundations (universal approximation,
convergence), and identify critical open challenges: high-dimensional
parameters, complex geometries, and out-of-distribution generalization. This
work establishes a unified framework for understanding parametric PDE solvers
via operator learning, offering a comprehensive, incrementally updated resource
for this rapidly evolving field
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Laurits Fredsgaard, Mikkel N. Schmidt
Machine Learning
ๆฅ็ๆ่ฆ
Deep ensembles are a powerful tool in machine learning, improving both model
performance and uncertainty calibration. While ensembles are typically formed
by training and tuning models individually, evidence suggests that jointly
tuning the ensemble can lead to better performance. This paper investigates the
impact of jointly tuning weight decay, temperature scaling, and early stopping
on both predictive performance and uncertainty quantification. Additionally, we
propose a partially overlapping holdout strategy as a practical compromise
between enabling joint evaluation and maximizing the use of data for training.
Our results demonstrate that jointly tuning the ensemble generally matches or
improves performance, with significant variation in effect size across
different tasks and metrics. We highlight the trade-offs between individual and
joint optimization in deep ensemble training, with the overlapping holdout
strategy offering an attractive practical solution. We believe our findings
provide valuable insights and guidance for practitioners looking to optimize
deep ensemble models. Code is available at:
https://github.com/lauritsf/ensemble-optimality-gap
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Nemotron Nano V2 VL, the latest model of the Nemotron
vision-language series designed for strong real-world document understanding,
long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers
significant improvements over our previous model,
Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major
enhancements in model architecture, datasets, and training recipes. Nemotron
Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and
innovative token reduction techniques to achieve higher inference throughput in
long document and video scenarios. We are releasing model checkpoints in BF16,
FP8, and FP4 formats and sharing large parts of our datasets, recipes and
training code.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du et al.
Robotics
ๆฅ็ๆ่ฆ
Human videos can be recorded quickly and at scale, making them an appealing
source of training data for robot learning. However, humans and robots differ
fundamentally in embodiment, resulting in mismatched action execution. Direct
kinematic retargeting of human hand motion can therefore produce actions that
are physically infeasible for robots. Despite these low-level differences,
human demonstrations provide valuable motion cues about how to manipulate and
interact with objects. Our key idea is to exploit the forward diffusion
process: as noise is added to actions, low-level execution differences fade
while high-level task guidance is preserved. We present X-Diffusion, a
principled framework for training diffusion policies that maximally leverages
human data without learning dynamically infeasible motions. X-Diffusion first
trains a classifier to predict whether a noisy action is executed by a human or
robot. Then, a human action is incorporated into policy training only after
adding sufficient noise such that the classifier cannot discern its embodiment.
Actions consistent with robot execution supervise fine-grained denoising at low
noise levels, while mismatched human actions provide only coarse guidance at
higher noise levels. Our experiments show that naive co-training under
execution mismatches degrades policy performance, while X-Diffusion
consistently improves it. Across five manipulation tasks, X-Diffusion achieves
a 16% higher average success rate than the best baseline. The project website
is available at https://portal-cornell.github.io/X-Diffusion/.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yu Feng, Nathaniel Weir, Kaj Bostrom, Sam Bayless, Darion Cassel et al.
Machine Learning
ๆฅ็ๆ่ฆ
LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but
they cannot reliably verify their own logic. Even when they reach correct
answers, the underlying reasoning may be flawed, undermining trust in
high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a
neuro-symbolic method that extracts and verifies formal logical arguments from
CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order
logic and identifies premises that ground the argument in source context,
commonsense knowledge, or prior reasoning steps. The symbolic representation
enables automated solvers to verify logical validity while the NL premises
allow humans and systems to identify ungrounded or fallacious reasoning steps.
Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT
effectively identifies flawed reasoning, and serves as a strong predictor of
final answer correctness. We also leverage VeriCoT's verification signal for
(1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on
VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct
preference optimization (DPO) using verification-based pairwise rewards,
further improving reasoning validity and accuracy.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Narjes Nourzad, Hanqing Yang, Shiyu Chen, Carlee Joe-Wong
ๆฅ็ๆ่ฆ
Cooperative multi-agent planning requires agents to make joint decisions with
partial information and limited communication. Coordination at the trajectory
level often fails, as small deviations in timing or movement cascade into
conflicts. Symbolic planning mitigates this challenge by raising the level of
abstraction and providing a minimal vocabulary of actions that enable
synchronization and collective progress. We present DR. WELL, a decentralized
neurosymbolic framework for cooperative multi-agent planning. Cooperation
unfolds through a two-phase negotiation protocol: agents first propose
candidate roles with reasoning and then commit to a joint allocation under
consensus and environment constraints. After commitment, each agent
independently generates and executes a symbolic plan for its role without
revealing detailed trajectories. Plans are grounded in execution outcomes via a
shared world model that encodes the current state and is updated as agents act.
By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids
brittle step-level alignment and enables higher-level operations that are
reusable, synchronizable, and interpretable. Experiments on cooperative
block-push tasks show that agents adapt across episodes, with the dynamic world
model capturing reusable patterns and improving task completion rates and
efficiency. Experiments on cooperative block-push tasks show that our dynamic
world model improves task completion and efficiency through negotiation and
self-refinement, trading a time overhead for evolving, more efficient
collaboration strategies.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Satchel Grant, Simon Jerome Han, Alexa Tartaglini, Christopher Potts
Machine Learning
ๆฅ็ๆ่ฆ
A common approach to mechanistic interpretability is to causally manipulate
model representations via targeted interventions in order to understand what
those representations encode. Here we ask whether such interventions create
out-of-distribution (divergent) representations, and whether this raises
concerns about how faithful their resulting explanations are to the target
model in its natural state. First, we demonstrate empirically that common
causal intervention techniques often do shift internal representations away
from the natural distribution of the target model. Then, we provide a
theoretical analysis of two classes of such divergences: `harmless' divergences
that occur in the null-space of the weights and from covariance within
behavioral decision boundaries, and `pernicious' divergences that activate
hidden network pathways and cause dormant behavioral changes. Finally, in an
effort to mitigate the pernicious cases, we modify the Counterfactual Latent
(CL) loss from Grant (2025) that regularizes interventions to remain closer to
the natural distributions, reducing the likelihood of harmful divergences while
preserving the interpretive power of interventions. Together, these results
highlight a path towards more reliable interpretability methods.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Soham De, Lodewijk Gelauff, Ashish Goel, Smitha Milli, Ariel Procaccia et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
A central feature of many deliberative processes, such as citizens'
assemblies and deliberative polls, is the opportunity for participants to
engage directly with experts. While participants are typically invited to
propose questions for expert panels, only a limited number can be selected due
to time constraints. This raises the challenge of how to choose a small set of
questions that best represent the interests of all participants. We introduce
an auditing framework for measuring the level of representation provided by a
slate of questions, based on the social choice concept known as justified
representation (JR). We present the first algorithms for auditing JR in the
general utility setting, with our most efficient algorithm achieving a runtime
of $O(mn\log n)$, where $n$ is the number of participants and $m$ is the number
of proposed questions. We apply our auditing methods to historical
deliberations, comparing the representativeness of (a) the actual questions
posed to the expert panel (chosen by a moderator), (b) participants' questions
chosen via integer linear programming, (c) summary questions generated by large
language models (LLMs). Our results highlight both the promise and current
limitations of LLMs in supporting deliberative processes. By integrating our
methods into an online deliberation platform that has been used for over
hundreds of deliberations across more than 50 countries, we make it easy for
practitioners to audit and improve representation in future deliberations.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Daniel Gomm, Cornelius Wolff, Madelon Hulsebos
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Natural language interfaces to tabular data must handle ambiguities inherent
to queries. Instead of treating ambiguity as a deficiency, we reframe it as a
feature of cooperative interaction, where the responsibility of query
specification is shared among the user and the system. We develop a principled
framework distinguishing cooperative queries, i.e., queries that yield a
resolvable interpretation, from uncooperative queries that cannot be resolved.
Applying the framework to evaluations for tabular question answering and
analysis, we analyze the queries in 15 popular datasets, and observe an
uncontrolled mixing of query types neither adequate for evaluating a system's
execution accuracy nor for evaluating interpretation capabilities. Our
framework and analysis of queries shifts the perspective from fixing ambiguity
to embracing cooperation in resolving queries. This reflection enables more
informed design and evaluation for natural language interfaces for tabular
data, for which we outline implications and directions for future research.
๐
2025-11-06
๐ Issues, comments, and questions are all welcome in
๐ฅ Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu Aizawa
Robotics
ๆฅ็ๆ่ฆ
Understanding the current capabilities and risks of AI Scientist systems is
essential for ensuring trustworthy and sustainable AI-driven scientific
progress while preserving the integrity of the academic ecosystem. To this end,
we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system
that mimics the core research workflow of a novice student researcher: Given
the baseline paper from the human mentor, it analyzes its limitations,
formulates novel hypotheses for improvement, validates them through rigorous
experimentation, and writes a paper with the results. Unlike previous
approaches that assume full automation or operate on small-scale code, Jr. AI
Scientist follows a well-defined research workflow and leverages modern coding
agents to handle complex, multi-file implementations, leading to scientifically
valuable contributions. For evaluation, we conducted automated assessments
using AI Reviewers, author-led evaluations, and submissions to Agents4Science,
a venue dedicated to AI-driven scientific contributions. The findings
demonstrate that Jr. AI Scientist generates papers receiving higher review
scores than existing fully automated systems. Nevertheless, we identify
important limitations from both the author evaluation and the Agents4Science
reviews, indicating the potential risks of directly applying current AI
Scientist systems and key challenges for future research. Finally, we
comprehensively report various risks identified during development. We hope
these insights will deepen understanding of current progress and risks in AI
Scientist development.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Divyansha Lachi, Mahmoud Mohammadi, Joe Meyer, Vinam Arora, Tom Palczewski et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In domains such as healthcare, finance, and e-commerce, the temporal dynamics
of relational data emerge from complex interactions-such as those between
patients and providers, or users and products across diverse categories. To be
broadly useful, models operating on these data must integrate long-range
spatial and temporal dependencies across diverse types of entities, while also
supporting multiple predictive tasks. However, existing graph models for
relational data primarily focus on spatial structure, treating temporal
information merely as a filtering constraint to exclude future events rather
than a modeling signal, and are typically designed for single-task prediction.
To address these gaps, we introduce a temporal subgraph sampler that enhances
global context by retrieving nodes beyond the immediate neighborhood to capture
temporally relevant relationships. In addition, we propose the Relational Graph
Perceiver (RGP), a graph transformer architecture for relational deep learning
that leverages a cross-attention-based latent bottleneck to efficiently
integrate information from both structural and temporal contexts. This latent
bottleneck integrates signals from different node and edge types into a common
latent space, enabling the model to build global context across the entire
relational system. RGP also incorporates a flexible cross-attention decoder
that supports joint learning across tasks with disjoint label spaces within a
single model. Experiments on RelBench, SALT, and CTU show that RGP delivers
state-of-the-art performance, offering a general and scalable solution for
relational deep learning with support for diverse predictive tasks.
๐
2025-11-06
๐ 32 pages (including supplementary information), 11
๐ฅ Zihang Ding, Kun Zhang
Robotics
ๆฅ็ๆ่ฆ
Urban surface water flooding, triggered by intense rainfall overwhelming
drainage systems, is increasingly frequent and widespread. While flood
prediction and monitoring in high spatial-temporal resolution are desired,
practical constraints in time, budget, and technology hinder its full
implementation. How to monitor urban drainage networks and predict flow
conditions under constrained resource is a major challenge. This study presents
a data-driven sparse sensing (DSS) framework, integrated with EPA-SWMM, to
optimize sensor placement and reconstruct peak flowrates in a stormwater
system, using the Woodland Avenue catchment in Duluth, Minnesota, as a case
study. We utilized a SWMM model to generate a training dataset of peak flowrate
profiles across the stormwater network. Furthermore, we applied DSS -
leveraging singular value decomposition for dimensionality reduction and QR
factorization for sensor allocation - to identify the optimal monitoring nodes
based on the simulated training dataset. We then validated the
representativeness of these identified monitoring nodes by comparing the
DSS-reconstructed peak flowrate profiles with those obtained from SWMM. Three
optimally placed sensors among 77 nodes achieved satisfactory reconstruction
performance with Nash-Sutcliffe Efficiency (NSE) values of 0.92-0.95 (25th to
75th percentiles). In addition, the model showed good robustness to uncertainty
in measurements. Its robustness to sensor failures is location-dependent and
improves with the number of sensors deployed. The framework balances
computational efficiency and physical interpretability, enabling high-accuracy
flow reconstruction with minimal sensors. This DSS framework can be further
integrated with predictive models to realize flood early warning and real-time
control under limited sensing and monitoring resource.
๐
2025-11-06
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Baptiste Bonin, Maxime Heuillet, Audrey Durand
Natural Language Processing
ๆฅ็ๆ่ฆ
Modeling user preferences across domains remains a key challenge in slate
recommendation (i.e. recommending an ordered sequence of items) research. We
investigate how Large Language Models (LLM) can effectively act as world models
of user preferences through pairwise reasoning over slates. We conduct an
empirical study involving several LLMs on three tasks spanning different
datasets. Our results reveal relationships between task performance and
properties of the preference function captured by LLMs, hinting towards areas
for improvement and highlighting the potential of LLMs as world models in
recommender systems.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amir Zur, Atticus Geiger, Ekdeep Singh Lubana, Eric Bigelow
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
When a language model generates text, the selection of individual tokens
might lead it down very different reasoning paths, making uncertainty difficult
to quantify. In this work, we consider whether reasoning language models
represent the alternate paths that they could take during generation. To test
this hypothesis, we use hidden activations to control and predict a language
model's uncertainty during chain-of-thought reasoning. In our experiments, we
find a clear correlation between how uncertain a model is at different tokens,
and how easily the model can be steered by controlling its activations. This
suggests that activation interventions are most effective when there are
alternate paths available to the model -- in other words, when it has not yet
committed to a particular final answer. We also find that hidden activations
can predict a model's future outcome distribution, demonstrating that models
implicitly represent the space of possible paths.
๐
2025-11-06
๐ AAAI 2026
๐ฅ Shaolong Wu, James Blume, Geshi Yeung
Machine Learning
ๆฅ็ๆ่ฆ
Algorithmic fairness has grown rapidly as a research area, yet key concepts
remain unsettled, especially in criminal justice. We review group, individual,
and process fairness and map the conditions under which they conflict. We then
develop a simple modification to standard group fairness. Rather than exact
parity across protected groups, we minimize a weighted error loss while keeping
differences in false negative rates within a small tolerance. This makes
solutions easier to find, can raise predictive accuracy, and surfaces the
ethical choice of error costs. We situate this proposal within three classes of
critique: biased and incomplete data, latent affirmative action, and the
explosion of subgroup constraints. Finally, we offer a practical framework for
deployment in public decision systems built on three pillars: need-based
decisions, Transparency and accountability, and narrowly tailored definitions
and solutions. Together, these elements link technical design to legitimacy and
provide actionable guidance for agencies that use risk assessment and related
tools.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, Vedhus Hoskere
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) is a critical technique for grounding
Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in
specialized, safety-critical domains remains a significant challenge. Existing
evaluation frameworks often rely on heuristic-based metrics that fail to
capture domain-specific nuances and other works utilize LLM-as-a-Judge
approaches that lack validated alignment with human judgment. This paper
introduces RAGalyst, an automated, human-aligned agentic framework designed for
the rigorous evaluation of domain-specific RAG systems. RAGalyst features an
agentic pipeline that generates high-quality, synthetic question-answering (QA)
datasets from source documents, incorporating an agentic filtering step to
ensure data fidelity. The framework refines two key LLM-as-a-Judge
metrics-Answer Correctness and Answerability-using prompt optimization to
achieve a strong correlation with human annotations. Applying this framework to
evaluate various RAG components across three distinct domains (military
operations, cybersecurity, and bridge engineering), we find that performance is
highly context-dependent. No single embedding model, LLM, or hyperparameter
configuration proves universally optimal. Additionally, we provide an analysis
on the most common low Answer Correctness reasons in RAG. These findings
highlight the necessity of a systematic evaluation framework like RAGalyst,
which empowers practitioners to uncover domain-specific trade-offs and make
informed design choices for building reliable and effective RAG systems.
RAGalyst is available on our Github.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Andrea Cera Palatsi, Samuel Martin-Gutierrez, Ana S. Cardenal, Max Pellert
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used both to make decisions in
domains such as health, education and law, and to simulate human behavior. Yet
how closely LLMs mirror actual human decision-making remains poorly understood.
This gap is critical: misalignment could produce harmful outcomes in practical
applications, while failure to replicate human behavior renders LLMs
ineffective for social simulations. Here, we address this gap by developing a
digital twin of game-theoretic experiments and introducing a systematic
prompting and probing framework for machine-behavioral evaluation. Testing
three open-source models (Llama, Mistral and Qwen), we find that Llama
reproduces human cooperation patterns with high fidelity, capturing human
deviations from rational choice theory, while Qwen aligns closely with Nash
equilibrium predictions. Notably, we achieved population-level behavioral
replication without persona-based prompting, simplifying the simulation
process. Extending beyond the original human-tested games, we generate and
preregister testable hypotheses for novel game configurations outside the
original parameter grid. Our findings demonstrate that appropriately calibrated
LLMs can replicate aggregate human behavioral patterns and enable systematic
exploration of unexplored experimental spaces, offering a complementary
approach to traditional research in the social and behavioral sciences that
generates new empirical predictions about human social decision-making.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Christos-Nikolaos Zacharopoulos, Revekka Kyriakoglou
Natural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLMs) become integral to human-centered
applications, understanding their personality-like behaviors is increasingly
important for responsible development and deployment. This paper systematically
evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to
assess trait expressions under varying sampling temperatures. We find
significant differences across four of the five personality dimensions, with
Neuroticism and Extraversion susceptible to temperature adjustments. Further,
hierarchical clustering reveals distinct model clusters, suggesting that
architectural features may predispose certain models toward stable trait
profiles. Taken together, these results offer new insights into the emergence
of personality-like patterns in LLMs and provide a new perspective on model
tuning, selection, and the ethical governance of AI systems. We share the data
and code for this analysis here:
https://osf.io/bsvzc/?view_only=6672219bede24b4e875097426dc3fac1
๐
2025-11-06
๐ EMNLP 2025
๐ฅ Cuong Huynh, Jie Cao
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
This paper describes the OUNLP system submitted to the TSAR-2025 Shared Task
(Alva-Manchego et al., 2025), designed for readability-controlled text
simplification using LLM-prompting-based generation. Based on the analysis of
prompt-based text simplification methods, we discovered an interesting finding
that text simplification performance is highly related to the gap between the
source CEFR (Arase et al., 2022) level and the target CEFR level. Inspired by
this finding, we propose two multi-round simplification methods and generate
them via GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based
LLM simplification (MRS-Joint). Our submitted systems ranked 7 out of 20 teams.
Later improvements with MRS-Joint show that taking the LLM simplified
candidates as the starting point could further boost the multi-round
simplification performance.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nikhil Abhyankar, Purvi Chaurasia, Sanchit Kabra, Ananya Srivastava, Vivek Gupta et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing tabular reasoning benchmarks mostly test models on small, uniform
tables, underrepresenting the complexity of real-world data and giving an
incomplete view of Large Language Models' (LLMs) reasoning abilities. Real
tables are long, heterogeneous, and domain-specific, mixing structured fields
with free text and requiring multi-hop reasoning across thousands of tokens. To
address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from
2031 real-world tables spanning two domains: i) RB-Science (NSF grant records)
and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates
LLMs jointly across scale, heterogeneity, domain specificity, and reasoning
complexity. Experiments with open-source and proprietary models show that LLMs
struggle with heterogeneous schemas and complex multi-hop inference, revealing
persistent weaknesses in current architectures and prompting strategies.
RUST-BENCH establishes a challenging new testbed for advancing tabular
reasoning research.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ipsita Ghosh, Ethan Nguyen, Christian Kรผmmerle
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Parameter-efficient training, based on low-rank optimization, has become a
highly successful tool for fine-tuning large deep-learning models. However,
these methods fail at low-rank pre-training tasks where maintaining the
low-rank structure and the objective remains a challenging task. We propose the
Quadratic Reweighted Rank Regularizer dubbed Q3R, which leads to a novel
low-rank inducing training strategy inspired by the iteratively reweighted
least squares (IRLS) framework. Q3R is based on a quadratic regularizer term
which majorizes a smoothed log determinant serving as rank surrogate objective.
Unlike other low-rank training techniques, Q3R is able to train weight matrices
with prescribed, low target ranks of models that achieve comparable predictive
performance as dense models, with small computational overhead, while remaining
fully compatible with existing architectures. For example, we demonstrated one
experiment where we are able to truncate $60\%$ and $80\%$ of the parameters of
a ViT-Tiny model with $~1.3\%$ and $~4\%$ accuracy drop in CIFAR-10 performance
respectively. The efficacy of Q3R is confirmed on Transformers across both
image and language tasks, including for low-rank fine-tuning.
๐
2025-11-06
๐ AAAI 2026
๐ฅ Lars Krupp, Daniel Geiรler, Vishal Banwari, Paul Lukowicz, Jakob Karolus
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Web agents, like OpenAI's Operator and Google's Project Mariner, are powerful
agentic systems pushing the boundaries of Large Language Models (LLM). They can
autonomously interact with the internet at the user's behest, such as
navigating websites, filling search masks, and comparing price lists. Though
web agent research is thriving, induced sustainability issues remain largely
unexplored. To highlight the urgency of this issue, we provide an initial
exploration of the energy and $CO_2$ cost associated with web agents from both
a theoretical -via estimation- and an empirical perspective -by benchmarking.
Our results show how different philosophies in web agent creation can severely
impact the associated expended energy, and that more energy consumed does not
necessarily equate to better results. We highlight a lack of transparency
regarding disclosing model parameters and processes used for some web agents as
a limiting factor when estimating energy consumption. Our work contributes
towards a change in thinking of how we evaluate web agents, advocating for
dedicated metrics measuring energy consumption in benchmarks.
๐
2025-11-06
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Hyo Jin Do, Zahra Ashktorab, Jasmina Gajcin, Erik Miehling, Martรญn Santillรกn Cooper et al.
ๆฅ็ๆ่ฆ
The LLM-as-a-judge paradigm enables flexible, user-defined evaluation, but
its effectiveness is often limited by the scarcity of diverse, representative
data for refining criteria. We present a tool that integrates synthetic data
generation into the LLM-as-a-judge workflow, empowering users to create
tailored and challenging test cases with configurable domains, personas,
lengths, and desired outcomes, including borderline cases. The tool also
supports AI-assisted inline editing of existing test cases. To enhance
transparency and interpretability, it reveals the prompts and explanations
behind each generation. In a user study (N=24), 83% of participants preferred
the tool over manually creating or selecting test cases, as it allowed them to
rapidly generate diverse synthetic data without additional workload. The
generated synthetic data proved as effective as hand-crafted data for both
refining evaluation criteria and aligning with human preferences. These
findings highlight synthetic data as a promising alternative, particularly in
contexts where efficiency and scalability are critical.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alberto Cattaneo, Carlo Luschi, Daniel Justus
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval of information from graph-structured knowledge bases represents a
promising direction for improving the factuality of LLMs. While various
solutions have been proposed, a comparison of methods is difficult due to the
lack of challenging QA datasets with ground-truth targets for graph retrieval.
We present SynthKGQA, a framework for generating high-quality synthetic
Knowledge Graph Question Answering datasets from any Knowledge Graph, providing
the full set of ground-truth facts in the KG to reason over each question. We
show how, in addition to enabling more informative benchmarking of KG
retrievers, the data produced with SynthKGQA also allows us to train better
models. We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset
designed to test zero-shot generalization abilities of KG retrievers with
respect to unseen graph structures and relation types, and benchmark popular
solutions for KG-augmented LLMs on it.
๐
2025-11-06
๐ ICML
๐ฅ Abheek Ghosh, Tzeh Yuan Neoh, Nicholas Teh, Giannis Tyrovolas
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We study a model of subscription-based platforms where users pay a fixed fee
for unlimited access to content, and creators receive a share of the revenue.
Existing approaches to detecting fraud predominantly rely on machine learning
methods, engaging in an ongoing arms race with bad actors. We explore revenue
division mechanisms that inherently disincentivize manipulation. We formalize
three types of manipulation-resistance axioms and examine which existing rules
satisfy these. We show that a mechanism widely used by streaming platforms, not
only fails to prevent fraud, but also makes detecting manipulation
computationally intractable. We also introduce a novel rule, ScaledUserProp,
that satisfies all three manipulation-resistance axioms. Finally, experiments
with both real-world and synthetic streaming data support ScaledUserProp as a
fairer alternative compared to existing rules.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Carnot Braun, Rafael O. Jarczewski, Gabriel U. Talasso, Leandro A. Villas, Allan M. de Souza
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Traditional vehicle routing systems efficiently optimize singular metrics
like time or distance, and when considering multiple metrics, they need more
processes to optimize . However, they lack the capability to interpret and
integrate the complex, semantic, and dynamic contexts of human drivers, such as
multi-step tasks, situational constraints, or urgent needs. This paper
introduces and evaluates PAVe (Personalized Agentic Vehicular Routing), a
hybrid agentic assistant designed to augment classical pathfinding algorithms
with contextual reasoning. Our approach employs a Large Language Model (LLM)
agent that operates on a candidate set of routes generated by a multi-objective
(time, CO2) Dijkstra algorithm. The agent evaluates these options against
user-provided tasks, preferences, and avoidance rules by leveraging a
pre-processed geospatial cache of urban Points of Interest (POIs). In a
benchmark of realistic urban scenarios, PAVe successfully used complex user
intent into appropriate route modifications, achieving over 88% accuracy in its
initial route selections with a local model. We conclude that combining
classical routing algorithms with an LLM-based semantic reasoning layer is a
robust and effective approach for creating personalized, adaptive, and scalable
solutions for urban mobility optimization.
๐
2025-11-06
๐ ArXiv Preprint (eess.SY)
๐ฅ Patrik Valรกbek, Marek Wadinger, Michal Kvasnica, Martin Klauฤo
Robotics
ๆฅ็ๆ่ฆ
Nonlinear dynamical systems with input delays pose significant challenges for
prediction, estimation, and control due to their inherent complexity and the
impact of delays on system behavior. Traditional linear control techniques
often fail in these contexts, necessitating innovative approaches. This paper
introduces a novel approach to approximate the Koopman operator using an
LSTM-enhanced Deep Koopman model, enabling linear representations of nonlinear
systems with time delays. By incorporating Long Short-Term Memory (LSTM)
layers, the proposed framework captures historical dependencies and efficiently
encodes time-delayed system dynamics into a latent space. Unlike traditional
extended Dynamic Mode Decomposition (eDMD) approaches that rely on predefined
dictionaries, the LSTM-enhanced Deep Koopman model is dictionary-free, which
mitigates the problems with the underlying dynamics being known and
incorporated into the dictionary. Quantitative comparisons with extended eDMD
on a simulated system demonstrate highly significant performance gains in
prediction accuracy in cases where the true nonlinear dynamics are unknown and
achieve comparable results to eDMD with known dynamics of a system.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Anisha Garg, Ganesh Venkatesh
Machine Learning
ๆฅ็ๆ่ฆ
Group-relative Policy Optimization's (GRPO) simplicity makes it highly
desirable for adapting LLMs to become experts at specific tasks. But this
simplicity also makes it ill-specified as we seek to enhance RL training with
richer, non-binary feedback. When using ordinal rewards to give partial credit,
GRPO's simplicity starts to hurt, as its group-average baseline often assigns a
positive advantage to failed trajectories and reinforces incorrect behavior.
We introduce Correctness Relative Policy Optimization (CoRPO), a new
formulation that solves this flaw. CoRPO uses an adaptive baseline that
enforces a minimum quality threshold, ensuring failed solutions are never
positively reinforced. Once the policy consistently meets this threshold, the
baseline automatically transitions to a relative preference mode, pushing the
model to find optimal solutions rather than just "acceptable" ones. We
empirically validate CoRPO on a code verification task, where it demonstrates
more stable convergence and better out-of-domain generalization.
This work represents a critical step in our broader research program to
enable LLMs to learn genuinely new capabilities through reinforcement learning.
We achieve this by enabling LLMs to learn from rich, multi-dimensional feedback
- progressing from binary to ordinal rewards in this work, and onward to
denser, per-step supervision.
๐
2025-11-06
๐ ArXiv Preprint (eess.SY)
๐ฅ Patrik Valรกbek, Michaela Horvรกthovรก, Martin Klauฤo
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a deep Koopman-based Economic Model Predictive Control
(EMPC) for efficient operation of a laboratory-scale pasteurization unit (PU).
The method uses Koopman operator theory to transform the complex, nonlinear
system dynamics into a linear representation, enabling the application of
convex optimization while representing the complex PU accurately. The deep
Koopman model utilizes neural networks to learn the linear dynamics from
experimental data, achieving a 45% improvement in open-loop prediction accuracy
over conventional N4SID subspace identification. Both analyzed models were
employed in the EMPC formulation that includes interpretable economic costs,
such as energy consumption, material losses due to inadequate pasteurization,
and actuator wear. The feasibility of EMPC is ensured using slack variables.
The deep Koopman EMPC and N4SID EMPC are numerically validated on a nonlinear
model of multivariable PU under external disturbance. The disturbances include
feed pump fail-to-close scenario and the introduction of a cold batch to be
pastuerized. These results demonstrate that the deep Koopmand EMPC achieves a
32% reduction in total economic cost compared to the N4SID baseline. This
improvement is mainly due to the reductions in material losses and energy
consumption. Furthermore, the steady-state operation via Koopman-based EMPC
requires 10.2% less electrical energy. The results highlight the practical
advantages of integrating deep Koopman representations with economic
optimization to achieve resource-efficient control of thermal-intensive plants.
๐
2025-11-06
๐ ArXiv Preprint (cs.SE)
๐ฅ Hao He, Courtney Miller, Shyam Agarwal, Christian Kรคstner, Bogdan Vasilescu
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated the promise to revolutionize
the field of software engineering. Among other things, LLM agents are rapidly
gaining momentum in their application to software development, with
practitioners claiming a multifold productivity increase after adoption. Yet,
empirical evidence is lacking around these claims. In this paper, we estimate
the causal effect of adopting a widely popular LLM agent assistant, namely
Cursor, on development velocity and software quality. The estimation is enabled
by a state-of-the-art difference-in-differences design comparing
Cursor-adopting GitHub projects with a matched control group of similar GitHub
projects that do not use Cursor. We find that the adoption of Cursor leads to a
significant, large, but transient increase in project-level development
velocity, along with a significant and persistent increase in static analysis
warnings and code complexity. Further panel generalized method of moments
estimation reveals that the increase in static analysis warnings and code
complexity acts as a major factor causing long-term velocity slowdown. Our
study carries implications for software engineering practitioners, LLM agent
assistant designers, and researchers.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jayadeva, Naman Dwivedi, Hari Krishnan, N. M. Anoop Krishnan
Machine Learning
ๆฅ็ๆ่ฆ
A formal link between regression and classification has been tenuous. Even
though the margin maximization term $\|w\|$ is used in support vector
regression, it has at best been justified as a regularizer. We show that a
regression problem with $M$ samples lying on a hyperplane has a one-to-one
equivalence with a linearly separable classification task with $2M$ samples. We
show that margin maximization on the equivalent classification task leads to a
different regression formulation than traditionally used. Using the
equivalence, we demonstrate a ``regressability'' measure, that can be used to
estimate the difficulty of regressing a dataset, without needing to first learn
a model for it. We use the equivalence to train neural networks to learn a
linearizing map, that transforms input variables into a space where a linear
regressor is adequate.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Subeen Park, Joowang Kim, Hakyung Lee, Sunjae Yoo, Kyungwoo Song
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning models achieve strong performance across various domains but
often rely on spurious correlations, making them vulnerable to distribution
shifts. This issue is particularly severe in subpopulation shift scenarios,
where models struggle in underrepresented groups. While existing methods have
made progress in mitigating this issue, their performance gains are still
constrained. They lack a rigorous theoretical framework connecting the
embedding space representations with worst-group error. To address this
limitation, we propose Spurious Correlation-Aware Embedding Regularization for
Worst-Group Robustness (SCER), a novel approach that directly regularizes
feature representations to suppress spurious cues. We show theoretically that
worst-group error is influenced by how strongly the classifier relies on
spurious versus core directions, identified from differences in group-wise mean
embeddings across domains and classes. By imposing theoretical constraints at
the embedding level, SCER encourages models to focus on core features while
reducing sensitivity to spurious patterns. Through systematic evaluation on
multiple vision and language, we show that SCER outperforms prior
state-of-the-art studies in worst-group accuracy. Our code is available at
\href{https://github.com/MLAI-Yonsei/SCER}{https://github.com/MLAI-Yonsei/SCER}.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chanwoo Park, Ziyang Chen, Asuman Ozdaglar, Kaiqing Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly deployed as "agents" for
decision-making (DM) in interactive and dynamic environments. Yet, since they
were not originally designed for DM, recent studies show that LLMs can struggle
even in basic online DM problems, failing to achieve low regret or an effective
exploration-exploitation tradeoff. To address this, we introduce Iterative
Regret-Minimization Fine-Tuning (Iterative RMFT), a post-training procedure
that repeatedly distills low-regret decision trajectories back into the base
model. At each iteration, the model rolls out multiple decision trajectories,
selects the k-lowest regret ones, and fine-tunes itself on them. Unlike prior
methods that (a) distill action sequences from known DM algorithms or (b) rely
on manually crafted chain-of-thought templates, our approach leverages the
regret metric to elicit the model's own DM ability and reasoning rationales.
This reliance on model-generated reasoning avoids rigid output engineering and
provides more flexible, natural-language training signals. Empirical results
show that Iterative RMFT improves LLMs' DM performance across diverse models -
from Transformers with numerical input/output, to open-weight LLMs, and
advanced closed-weight models like GPT-4o mini. Its flexibility in output and
reasoning formats enables generalization across tasks with varying horizons,
action spaces, reward processes, and natural-language contexts. Finally, we
provide theoretical insight showing that a single-layer Transformer under this
paradigm can act as a no-regret learner in a simplified setting. Overall,
Iterative RMFT offers a principled and general post-training framework for
enhancing LLMs' decision-making capabilities.
๐
2025-11-06
๐ ArXiv Preprint (cs.SD)
๐ฅ Ali Boudaghi, Hadi Zare
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Music editing has emerged as an important and practical area of artificial
intelligence, with applications ranging from video game and film music
production to personalizing existing tracks according to user preferences.
However, existing models face significant limitations, such as being restricted
to editing synthesized music generated by their own models, requiring highly
precise prompts, or necessitating task-specific retraining, thus lacking true
zero-shot capability. Leveraging recent advances in rectified flow and
diffusion transformers, we introduce MusRec, the first zero-shot text-to-music
editing model capable of performing diverse editing tasks on real-world music
efficiently and effectively. Experimental results demonstrate that our approach
outperforms existing methods in preserving musical content, structural
consistency, and editing fidelity, establishing a strong foundation for
controllable music editing in real-world scenarios.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Nick Oh, Fernand Gobet
Natural Language Processing
ๆฅ็ๆ่ฆ
Test-time reasoning architectures such as those following the Generate-Verify
paradigm -- where a model iteratively refines or verifies its own generated
outputs -- prioritise generation and verification but exclude the monitoring
processes that determine when and how reasoning should begin. This omission may
contribute to the prefix dominance trap, in which models commit early to
suboptimal reasoning paths and seldom recover, yielding roughly 20% accuracy
loss. We address this architectural gap by formalising Flavell's and Nelson and
Narens' metacognitive theories into computational specifications, proposing the
Monitor-Generate-Verify (MGV) framework. MGV extends the Generate-Verify
paradigm by adding explicit monitoring that captures metacognitive experiences
(from difficulty assessments to confidence judgements) before generation begins
and refines future monitoring through verification feedback. Though we present
no empirical validation, this work provides the first systematic computational
translation of foundational metacognitive theories, offering a principled
vocabulary for understanding reasoning system failures and suggesting specific
architectural interventions for future test-time reasoning designs.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Federico Pirola, Fabio Stella, Marco Grzegorczyk
ๆฅ็ๆ่ฆ
Dynamic Bayesian networks (DBNs) are increasingly used in healthcare due to
their ability to model complex temporal relationships in patient data while
maintaining interpretability, an essential feature for clinical
decision-making. However, existing approaches to handling missing data in
longitudinal clinical datasets are largely derived from static Bayesian
networks literature, failing to properly account for the temporal nature of the
data. This gap limits the ability to quantify uncertainty over time, which is
particularly critical in settings such as intensive care, where understanding
the temporal dynamics is fundamental for model trustworthiness and
applicability across diverse patient groups. Despite the potential of DBNs, a
full Bayesian framework that integrates missing data handling remains
underdeveloped. In this work, we propose a novel Gibbs sampling-based method
for learning DBNs from incomplete data. Our method treats each missing value as
an unknown parameter following a Gaussian distribution. At each iteration, the
unobserved values are sampled from their full conditional distributions,
allowing for principled imputation and uncertainty estimation. We evaluate our
method on both simulated datasets and real-world intensive care data from
critically ill patients. Compared to standard model-agnostic techniques such as
MICE, our Bayesian approach demonstrates superior reconstruction accuracy and
convergence properties. These results highlight the clinical relevance of
incorporating full Bayesian inference in temporal models, providing more
reliable imputations and offering deeper insight into model behavior. Our
approach supports safer and more informed clinical decision-making,
particularly in settings where missing data are frequent and potentially
impactful.
๐
2025-11-06
๐ NeurIPS
๐ฅ Antti Koskela, Tejas Kulkarni, Laith Zumot
Natural Language Processing
ๆฅ็ๆ่ฆ
Differentially private in-context learning (DP-ICL) has recently become an
active research topic due to the inherent privacy risks of in-context learning.
However, existing approaches overlook a critical component of modern large
language model (LLM) pipelines: the similarity search used to retrieve relevant
context data. In this work, we introduce a DP framework for in-context learning
that integrates nearest neighbor search of relevant examples in a privacy-aware
manner. Our method outperforms existing baselines by a substantial margin
across all evaluated benchmarks, achieving more favorable privacy-utility
trade-offs. To achieve this, we employ nearest neighbor retrieval from a
database of context data, combined with a privacy filter that tracks the
cumulative privacy cost of selected samples to ensure adherence to a central
differential privacy budget. Experimental results on text classification and
document question answering show a clear advantage of the proposed method over
existing baselines.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiahao Zhao, Luxin Xu, Minghuan Tan, Lichao Zhang, Ahmadreza Argha et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Numerous medical systems powered by Large Language Models (LLMs) have
achieved remarkable progress in diverse healthcare tasks. However, research on
their medication safety remains limited due to the lack of real world datasets,
constrained by privacy and accessibility issues. Moreover, evaluation of LLMs
in realistic clinical consultation settings, particularly regarding medication
safety, is still underexplored. To address these gaps, we propose a framework
that simulates and evaluates clinical consultations to systematically assess
the medication safety capabilities of LLMs. Within this framework, we generate
inquiry diagnosis dialogues with embedded medication risks and construct a
dedicated medication safety database, RxRisk DB, containing 6,725
contraindications, 28,781 drug interactions, and 14,906 indication-drug pairs.
A two-stage filtering strategy ensures clinical realism and professional
quality, resulting in the benchmark RxSafeBench with 2,443 high-quality
consultation scenarios. We evaluate leading open-source and proprietary LLMs
using structured multiple choice questions that test their ability to recommend
safe medications under simulated patient contexts. Results show that current
LLMs struggle to integrate contraindication and interaction knowledge,
especially when risks are implied rather than explicit. Our findings highlight
key challenges in ensuring medication safety in LLM-based systems and provide
insights into improving reliability through better prompting and task-specific
tuning. RxSafeBench offers the first comprehensive benchmark for evaluating
medication safety in LLMs, advancing safer and more trustworthy AI-driven
clinical decision support.
๐
2025-11-06
๐ ArXiv Preprint (cs.AR)
๐ฅ Yuanpeng Zhang, Xing Hu, Xi Chen, Zhihang Yuan, Cong Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
SRAM Processing-in-Memory (PIM) has emerged as the most promising
implementation for high-performance PIM, delivering superior computing density,
energy efficiency, and computational precision. However, the pursuit of higher
performance necessitates more complex circuit designs and increased operating
frequencies, which exacerbate IR-drop issues. Severe IR-drop can significantly
degrade chip performance and even threaten reliability. Conventional
circuit-level IR-drop mitigation methods, such as back-end optimizations, are
resource-intensive and often compromise power, performance, and area (PPA). To
address these challenges, we propose AIM, comprehensive software and hardware
co-design for architecture-level IR-drop mitigation in high-performance PIM.
Initially, leveraging the bit-serial and in-situ dataflow processing properties
of PIM, we introduce Rtog and HR, which establish a direct correlation between
PIM workloads and IR-drop. Building on this foundation, we propose LHR and WDS,
enabling extensive exploration of architecture-level IR-drop mitigation while
maintaining computational accuracy through software optimization. Subsequently,
we develop IR-Booster, a dynamic adjustment mechanism that integrates
software-level HR information with hardware-based IR-drop monitoring to adapt
the V-f pairs of the PIM macro, achieving enhanced energy efficiency and
performance. Finally, we propose the HR-aware task mapping method, bridging
software and hardware designs to achieve optimal improvement. Post-layout
simulation results on a 7nm 256-TOPS PIM chip demonstrate that AIM achieves up
to 69.2% IR-drop mitigation, resulting in 2.29x energy efficiency improvement
and 1.152x speedup.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tim Beyer, Jonas Dornbusch, Jakob Steimle, Moritz Ladenburger, Leo Schwinn et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid expansion of research on Large Language Model (LLM) safety and
robustness has produced a fragmented and oftentimes buggy ecosystem of
implementations, datasets, and evaluation methods. This fragmentation makes
reproducibility and comparability across studies challenging, hindering
meaningful progress. To address these issues, we introduce AdversariaLLM, a
toolbox for conducting LLM jailbreak robustness research. Its design centers on
reproducibility, correctness, and extensibility. The framework implements
twelve adversarial attack algorithms, integrates seven benchmark datasets
spanning harmfulness, over-refusal, and utility evaluation, and provides access
to a wide range of open-weight LLMs via Hugging Face. The implementation
includes advanced features for comparability and reproducibility such as
compute-resource tracking, deterministic results, and distributional evaluation
techniques. \name also integrates judging through the companion package
JudgeZoo, which can also be used independently. Together, these components aim
to establish a robust foundation for transparent, comparable, and reproducible
research in LLM safety.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jacob Lysnรฆs-Larsen, Marte Eggen, Inga Strรผmke
Machine Learning
ๆฅ็ๆ่ฆ
In explainable AI, Concept Activation Vectors (CAVs) are typically obtained
by training linear classifier probes to detect human-understandable concepts as
directions in the activation space of deep neural networks. It is widely
assumed that a high probe accuracy indicates a CAV faithfully representing its
target concept. However, we show that the probe's classification accuracy alone
is an unreliable measure of concept alignment, i.e., the degree to which a CAV
captures the intended concept. In fact, we argue that probes are more likely to
capture spurious correlations than they are to represent only the intended
concept. As part of our analysis, we demonstrate that deliberately misaligned
probes constructed to exploit spurious correlations, achieve an accuracy close
to that of standard probes. To address this severe problem, we introduce a
novel concept localization method based on spatial linear attribution, and
provide a comprehensive comparison of it to existing feature visualization
techniques for detecting and mitigating concept misalignment. We further
propose three classes of metrics for quantitatively assessing concept
alignment: hard accuracy, segmentation scores, and augmentation robustness. Our
analysis shows that probes with translation invariance and spatial alignment
consistently increase concept alignment. These findings highlight the need for
alignment-based evaluation metrics rather than probe accuracy, and the
importance of tailoring probes to both the model architecture and the nature of
the target concept.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce GUI-360$^\circ$, a large-scale, comprehensive dataset and
benchmark suite designed to advance computer-using agents (CUAs). CUAs present
unique challenges and is constrained by three persistent gaps: a scarcity of
real-world CUA tasks, the lack of automated collection-and-annotation pipelines
for multi-modal trajectories, and the absence of a unified benchmark that
jointly evaluates GUI grounding, screen parsing, and action prediction.
GUI-360$^\circ$ addresses these gaps with an LLM-augmented, largely automated
pipeline for query sourcing, environment-template construction, task
instantiation, batched execution, and LLM-driven quality filtering. The
released corpus contains over 1.2M executed action steps across thousands of
trajectories in popular Windows office applications, and includes
full-resolution screenshots, accessibility metadata when available,
instantiated goals, intermediate reasoning traces, and both successful and
failed action trajectories. The dataset supports three canonical tasks, GUI
grounding, screen parsing, and action prediction, and a hybrid GUI+API action
space that reflects modern agent designs. Benchmarking state-of-the-art
vision--language models on GUI-360$^\circ$ reveals substantial out-of-the-box
shortcomings in grounding and action prediction; supervised fine-tuning and
reinforcement learning yield significant gains but do not close the gap to
human-level reliability. We release GUI-360$^\circ$ and accompanying code to
facilitate reproducible research and accelerate progress on robust desktop
CUAs.
The full dataset has been made public on
https://huggingface.co/datasets/vyokky/GUI-360.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Robin Spanier, Thorsten Hoeser, Claudia Kuenzer
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
The recent and ongoing expansion of marine infrastructure, including offshore
wind farms, oil and gas platforms, artificial islands, and aquaculture
facilities, highlights the need for effective monitoring systems. The
development of robust models for offshore infrastructure detection relies on
comprehensive, balanced datasets, but falls short when samples are scarce,
particularly for underrepresented object classes, shapes, and sizes. By
training deep learning-based YOLOv10 object detection models with a combination
of synthetic and real Sentinel-1 satellite imagery acquired in the fourth
quarter of 2023 from four regions (Caspian Sea, South China Sea, Gulf of
Guinea, and Coast of Brazil), this study investigates the use of synthetic
training data to enhance model performance. We evaluated this approach by
applying the model to detect offshore platforms in three unseen regions (Gulf
of Mexico, North Sea, Persian Gulf) and thereby assess geographic
transferability. This region-holdout evaluation demonstrated that the model
generalises beyond the training areas. In total, 3,529 offshore platforms were
detected, including 411 in the North Sea, 1,519 in the Gulf of Mexico, and
1,593 in the Persian Gulf. The model achieved an F1 score of 0.85, which
improved to 0.90 upon incorporating synthetic data. We analysed how synthetic
data enhances the representation of unbalanced classes and overall model
performance, taking a first step toward globally transferable detection of
offshore infrastructure. This study underscores the importance of balanced
datasets and highlights synthetic data generation as an effective strategy to
address common challenges in remote sensing, demonstrating the potential of
deep learning for scalable, global offshore infrastructure monitoring.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matteo Cercola, Valeria Capretti, Simone Formentin
Machine Learning
ๆฅ็ๆ่ฆ
Learning from human preferences is a cornerstone of aligning machine learning
models with subjective human judgments. Yet, collecting such preference data is
often costly and time-consuming, motivating the need for more efficient
learning paradigms. Two established approaches offer complementary advantages:
RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning,
while PBO achieves greater sample efficiency through active querying. We
propose a hybrid framework that unifies RLHF's scalability with PBO's query
efficiency by integrating an acquisition-driven module into the RLHF pipeline,
thereby enabling active and sample-efficient preference gathering. We validate
the proposed approach on two representative domains: (i) high-dimensional
preference optimization and (ii) LLM fine-tuning. Experimental results
demonstrate consistent improvements in both sample efficiency and overall
performance across these tasks.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zeng Zhiyuan, Jiashuo Liu, Zhangyue Yin, Ge Zhang, Wenhao Huang et al.
Machine Learning
ๆฅ็ๆ่ฆ
While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for
training large reasoning models, its training dynamics harbor a critical
challenge: RL overfitting, where models gain training rewards but lose
generalization. Our analysis reveals this is driven by policy
over-specialization and catastrophic forgetting of diverse solutions generated
during training. Standard optimization discards this valuable inter-step policy
diversity. To address this, we introduce RLoop, a self-improving framework
built on iterative policy initialization. RLoop transforms the standard
training process into a virtuous cycle: it first uses RL to explore the
solution space from a given policy, then filters the successful trajectories to
create an expert dataset. This dataset is used via Rejection-sampling
Fine-Tuning (RFT) to refine the initial policy, creating a superior starting
point for the next iteration. This loop of exploration and exploitation via
iterative re-initialization effectively converts transient policy variations
into robust performance gains. Our experiments show RLoop mitigates forgetting
and substantially improves generalization, boosting average accuracy by 9% and
pass@32 by over 15% compared to vanilla RL.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Claudio Giusti, Luca Guarnera, Sebastiano Battiato
Computer Vision
ๆฅ็ๆ่ฆ
The growing sophistication of synthetic image and deepfake generation models
has turned source attribution and authenticity verification into a critical
challenge for modern computer vision systems. Recent studies suggest that
diffusion pipelines unintentionally imprint persistent statistical traces,
known as signal leaks, within their outputs, particularly in latent
representations. Building on this observation, we propose Proto-LeakNet, a
signal-leak-aware and interpretable attribution framework that integrates
closed-set classification with a density-based open-set evaluation on the
learned embeddings, enabling analysis of unseen generators without retraining.
Operating in the latent domain of diffusion models, our method re-simulates
partial forward diffusion to expose residual generator-specific cues. A
temporal attention encoder aggregates multi-step latent features, while a
feature-weighted prototype head structures the embedding space and enables
transparent attribution. Trained solely on closed data and achieving a Macro
AUC of 98.13%, Proto-LeakNet learns a latent geometry that remains robust under
post-processing, surpassing state-of-the-art methods, and achieves strong
separability between known and unseen generators. These results demonstrate
that modeling signal-leak bias in latent space enables reliable and
interpretable AI-image and deepfake forensics. The code for the whole work will
be available upon submission.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marawan Elbatel, Anbang Wang, Keyuan Liu, Kaouther Mouheb, Enrique Almar-Munoz et al.
Computer Vision
ๆฅ็ๆ่ฆ
This paper does not introduce a novel architecture; instead, it revisits a
fundamental yet overlooked baseline: adapting human-centric foundation models
for anatomical landmark detection in medical imaging. While landmark detection
has traditionally relied on domain-specific models, the emergence of
large-scale pre-trained vision models presents new opportunities. In this
study, we investigate the adaptation of Sapiens, a human-centric foundation
model designed for pose estimation, to medical imaging through multi-dataset
pretraining, establishing a new state of the art across multiple datasets. Our
proposed model, MedSapiens, demonstrates that human-centric foundation models,
inherently optimized for spatial pose localization, provide strong priors for
anatomical landmark detection, yet this potential has remained largely
untapped. We benchmark MedSapiens against existing state-of-the-art models,
achieving up to 5.26% improvement over generalist models and up to 21.81%
improvement over specialist models in the average success detection rate (SDR).
To further assess MedSapiens adaptability to novel downstream tasks with few
annotations, we evaluate its performance in limited-data settings, achieving
2.69% improvement over the few-shot state of the art in SDR. Code and model
weights are available at https://github.com/xmed-lab/MedSapiens .
๐
2025-11-06
๐ ArXiv Preprint (Multimedia)
๐ฅ Allie Tran, Luca Rossetto
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal co-embedding models, especially CLIP, have advanced the state of
the art in zero-shot classification and multimedia information retrieval in
recent years by aligning images and text in a shared representation space.
However, such modals trained on a contrastive alignment can lack stability
towards small input perturbations. Especially when dealing with manually
expressed queries, minor variations in the query can cause large differences in
the ranking of the best-matching results. In this paper, we present a
systematic analysis of the effect of multiple classes of non-semantic query
perturbations in an multimedia information retrieval scenario. We evaluate a
diverse set of lexical, syntactic, and semantic perturbations across multiple
CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video
collection. Across models, we find that syntactic and semantic perturbations
drive the largest instabilities, while brittleness is concentrated in trivial
surface edits such as punctuation and case. Our results highlight robustness as
a critical dimension for evaluating vision-language models beyond benchmark
accuracy.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rasmus Mรธller-Larsen, Adam Izdebski, Jan Olszewski, Pankhil Gawade, Michal Kmicikiewicz et al.
ๆฅ็ๆ่ฆ
Recent advances in computational methods for designing biological sequences
have sparked the development of metrics to evaluate these methods performance
in terms of the fidelity of the designed sequences to a target distribution and
their attainment of desired properties. However, a single software library
implementing these metrics was lacking. In this work we introduce seqme, a
modular and highly extendable open-source Python library, containing
model-agnostic metrics for evaluating computational methods for biological
sequence design. seqme considers three groups of metrics: sequence-based,
embedding-based, and property-based, and is applicable to a wide range of
biological sequences: small molecules, DNA, ncRNA, mRNA, peptides and proteins.
The library offers a number of embedding and property models for biological
sequences, as well as diagnostics and visualization functions to inspect the
results. seqme can be used to evaluate both one-shot and iterative
computational design methods.
๐
2025-11-06
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zefeng Li, Ning Yang
ๆฅ็ๆ่ฆ
Although the collaborative filtering (CF) algorithm has achieved remarkable
performance in recommendation systems, it suffers from suboptimal
recommendation performance due to noise in the user-item interaction matrix.
Numerous noise-removal studies have improved recommendation models, but most
existing approaches conduct denoising on a single graph. This may cause
attenuation of collaborative signals: removing edges between two nodes can
interrupt paths between other nodes, weakening path-dependent collaborative
information. To address these limitations, this study proposes a novel
GNN-based CF model called DRCSD for denoising unstable interactions. DRCSD
includes two core modules: a collaborative signal decoupling module (decomposes
signals into distinct orders by structural characteristics) and an order-wise
denoising module (performs targeted denoising on each order). Additionally, the
information aggregation mechanism of traditional GNN-based CF models is
modified to avoid cross-order signal interference until the final pooling
operation. Extensive experiments on three public real-world datasets show that
DRCSD has superior robustness against unstable interactions and achieves
statistically significant performance improvements in recommendation accuracy
metrics compared to state-of-the-art baseline models.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhengru Fang, Yu Guo, Jingjing Wang, Yuang Zhang, Haonan An et al.
Machine Learning
ๆฅ็ๆ่ฆ
Sharing and reconstructing a consistent spatial memory is a critical
challenge in multi-agent systems, where partial observability and limited
bandwidth often lead to catastrophic failures in coordination. We introduce a
multi-agent predictive coding framework that formulate coordination as the
minimization of mutual uncertainty among agents. Instantiated as an information
bottleneck objective, it prompts agents to learn not only who and what to
communicate but also when. At the foundation of this framework lies a
grid-cell-like metric as internal spatial coding for self-localization,
emerging spontaneously from self-supervised motion prediction. Building upon
this internal spatial code, agents gradually develop a bandwidth-efficient
communication mechanism and specialized neural populations that encode
partners' locations: an artificial analogue of hippocampal social place cells
(SPCs). These social representations are further enacted by a hierarchical
reinforcement learning policy that actively explores to reduce joint
uncertainty. On the Memory-Maze benchmark, our approach shows exceptional
resilience to bandwidth constraints: success degrades gracefully from 73.5% to
64.4% as bandwidth shrinks from 128 to 4 bits/step, whereas a full-broadcast
baseline collapses from 67.6% to 28.6%. Our findings establish a theoretically
principled and biologically plausible basis for how complex social
representations emerge from a unified predictive drive, leading to social
collective intelligence.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Alan Seroul, Thรฉo Fagnoni, Inรจs Adnani, Dana O. Mohamed, Phillip Kingston
Machine Learning
ๆฅ็ๆ่ฆ
This paper introduces the Opus Workflow Evaluation Framework, a
probabilistic-normative formulation for quantifying Workflow quality and
efficiency. It integrates notions of correctness, reliability, and cost into a
coherent mathematical model that enables direct comparison, scoring, and
optimization of Workflows. The framework combines the Opus Workflow Reward, a
probabilistic function estimating expected performance through success
likelihood, resource usage, and output gain, with the Opus Workflow Normative
Penalties, a set of measurable functions capturing structural and informational
quality across Cohesion, Coupling, Observability, and Information Hygiene. It
supports automated Workflow assessment, ranking, and optimization within modern
automation systems such as Opus and can be integrated into Reinforcement
Learning loops to guide Workflow discovery and refinement. In this paper, we
introduce the Opus Workflow Reward model that formalizes Workflow success as a
probabilistic expectation over costs and outcomes. We define measurable Opus
Workflow Normative Penalties capturing structural, semantic, and signal-related
properties of Workflows. Finally, we propose a unified optimization formulation
for identifying and ranking optimal Workflows under joint Reward-Penalty
trade-offs.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The strong lottery ticket hypothesis (SLTH) conjectures that high-performing
subnetworks, called strong lottery tickets (SLTs), are hidden in randomly
initialized neural networks. Although recent theoretical studies have
established the SLTH across various neural architectures, the SLTH for
transformer architectures still lacks theoretical understanding. In particular,
the current theory of the SLTH does not yet account for the multi-head
attention (MHA) mechanism, a core component of transformers. To address this
gap, we introduce a theoretical analysis of the existence of SLTs within MHAs.
We prove that, if a randomly initialized MHA of $H$ heads and input dimension
$d$ has the hidden dimension $O(d\log(Hd^{3/2}))$ for the key and value, it
contains an SLT that approximates an arbitrary MHA with the same input
dimension with high probability. Furthermore, by leveraging this theory for
MHAs, we extend the SLTH to transformers without normalization layers. We
empirically validate our theoretical findings, demonstrating that the
approximation error between the SLT within a source model (MHA and transformer)
and an approximate target counterpart decreases exponentially by increasing the
hidden dimension of the source model.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hanmo Chen, Chenghao Xu, Jiexi Yan, Cheng Deng
ๆฅ็ๆ่ฆ
Human motion style transfer allows characters to appear less rigidity and
more realism with specific style. Traditional arbitrary image style transfer
typically process mean and variance which is proved effective. Meanwhile,
similar methods have been adapted for motion style transfer. However, due to
the fundamental differences between images and motion, relying on mean and
variance is insufficient to fully capture the complex dynamic patterns and
spatiotemporal coherence properties of motion data. Building upon this, our key
insight is to bring two more coefficient, skewness and kurtosis, into the
analysis of motion style. Specifically, we propose a novel Adaptive Statistics
Fusor (AStF) which consists of Style Disentanglement Module (SDM) and
High-Order Multi-Statistics Attention (HOS-Attn). We trained our AStF in
conjunction with a Motion Consistency Regularization (MCR) discriminator.
Experimental results show that, by providing a more comprehensive model of the
spatiotemporal statistical patterns inherent in dynamic styles, our proposed
AStF shows proficiency superiority in motion style transfers over
state-of-the-arts. Our code and model are available at
https://github.com/CHMimilanlan/AStF.
๐
2025-11-06
๐ 10 pages, 4 figures. Submitted to IEEE DISTILL 202
๐ฅ Mohammed Musthafa Rafi, Adarsh Krishnamurthy, Aditya Balu
Robotics
ๆฅ็ๆ่ฆ
The proliferation of AI-generated content has created an absurd communication
theater where senders use LLMs to inflate simple ideas into verbose content,
recipients use LLMs to compress them back into summaries, and as a consequence
neither party engage with authentic content. LAAC (LLM as a Communicator)
proposes a paradigm shift - positioning LLMs as intelligent communication
intermediaries that capture the sender's intent through structured dialogue and
facilitate genuine knowledge exchange with recipients. Rather than perpetuating
cycles of AI-generated inflation and compression, LAAC enables authentic
communication across diverse contexts including academic papers, proposals,
professional emails, and cross-platform content generation. However, deploying
LLMs as trusted communication intermediaries raises critical questions about
information fidelity, consistency, and reliability. This position paper
systematically evaluates the trustworthiness requirements for LAAC's deployment
across multiple communication domains. We investigate three fundamental
dimensions: (1) Information Capture Fidelity - accuracy of intent extraction
during sender interviews across different communication types, (2)
Reproducibility - consistency of structured knowledge across multiple
interaction instances, and (3) Query Response Integrity - reliability of
recipient-facing responses without hallucination, source conflation, or
fabrication. Through controlled experiments spanning multiple LAAC use cases,
we assess these trust dimensions using LAAC's multi-agent architecture.
Preliminary findings reveal measurable trust gaps that must be addressed before
LAAC can be reliably deployed in high-stakes communication scenarios.
๐
2025-11-06
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Leszek Sliwko
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a reinforced genetic approach to a defined d-resource
system optimization problem. The classical evolution schema was ineffective due
to a very strict feasibility function in the studied problem. Hence, the
presented strategy has introduced several modifications and adaptations to
standard genetic routines, e.g.: a migration operator which is an analogy to
the biological random genetic drift.
๐
2025-11-06
๐ ArXiv Preprint (cs.SE)
๐ฅ Oshando Johnson, Alexandra Fomina, Ranjith Krishnamurthy, Vaibhav Chaudhari, Rohith Kumar Shanmuganathan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The prevalence of security vulnerabilities has prompted companies to adopt
static application security testing (SAST) tools for vulnerability detection.
Nevertheless, these tools frequently exhibit usability limitations, as their
generic warning messages do not sufficiently communicate important information
to developers, resulting in misunderstandings or oversight of critical
findings. In light of recent developments in Large Language Models (LLMs) and
their text generation capabilities, our work investigates a hybrid approach
that uses LLMs to tackle the SAST explainability challenges. In this paper, we
present SAFE, an Integrated Development Environment (IDE) plugin that leverages
GPT-4o to explain the causes, impacts, and mitigation strategies of
vulnerabilities detected by SAST tools. Our expert user study findings indicate
that the explanations generated by SAFE can significantly assist beginner to
intermediate developers in understanding and addressing security
vulnerabilities, thereby improving the overall usability of SAST tools.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Claire Yang, Maya Cakmak, Max Kleiman-Weiner
Robotics
ๆฅ็ๆ่ฆ
Empowerment, a measure of an agent's ability to control its environment, has
been proposed as a universal goal-agnostic objective for motivating assistive
behavior in AI agents. While multi-human settings like homes and hospitals are
promising for AI assistance, prior work on empowerment-based assistance assumes
that the agent assists one human in isolation. We introduce an open source
multi-human gridworld test suite Disempower-Grid. Using Disempower-Grid, we
empirically show that assistive RL agents optimizing for one human's
empowerment can significantly reduce another human's environmental influence
and rewards - a phenomenon we formalize as disempowerment. We characterize when
disempowerment occurs in these environments and show that joint empowerment
mitigates disempowerment at the cost of the user's reward. Our work reveals a
broader challenge for the AI alignment community: goal-agnostic objectives that
seem aligned in single-agent settings can become misaligned in multi-agent
contexts.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fatemehzahra Darzi, Rodrigo Escobar Diaz Guerrero, Thomas Bocklitz
Multimodal
ๆฅ็ๆ่ฆ
Image registration refers to the process of spatially aligning two or more
images by mapping them into a common coordinate system, so that corresponding
anatomical or tissue structures are matched across images. In digital
pathology, registration enables direct comparison and integration of
information from different stains or imaging modalities, sup-porting
applications such as biomarker analysis and tissue reconstruction. Accurate
registration of images from different modalities is an essential step in
digital pathology. In this study, we investigated how various color
transformation techniques affect image registration between hematoxylin and
eosin (H&E) stained images and non-linear multimodal images. We used a dataset
of 20 tissue sample pairs, with each pair undergoing several preprocessing
steps, including different color transformation (CycleGAN, Macenko, Reinhard,
Vahadane), inversion, contrast adjustment, intensity normalization, and
denoising. All images were registered using the VALIS registration method,
which first applies rigid registration and then performs non-rigid registration
in two steps on both low and high-resolution images. Registration performance
was evaluated using the relative Target Registration Error (rTRE). We reported
the median of median rTRE values (MMrTRE) and the average of median rTRE values
(AMrTRE) for each method. In addition, we performed a custom point-based
evaluation using ten manually selected key points. Registration was done
separately for two scenarios, using either the original or inverted multimodal
images. In both scenarios, CycleGAN color transformation achieved the lowest
registration errors, while the other methods showed higher errors. These
findings show that applying color transformation before registration improves
alignment between images from different modalities and supports more reliable
analysis in digital pathology.
๐
2025-11-06
๐ ArXiv Preprint (cs.SE)
๐ฅ Asma Yamani, Malak Baslyman, Moataz Ahmed
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly employed in software
engineering tasks such as requirements elicitation, design, and evaluation,
raising critical questions regarding their alignment with human judgments on
responsible AI values. This study investigates how closely LLMs' value
preferences align with those of two human groups: a US-representative sample
and AI practitioners. We evaluate 23 LLMs across four tasks: (T1) selecting key
responsible AI values, (T2) rating their importance in specific contexts, (T3)
resolving trade-offs between competing values, and (T4) prioritizing software
requirements that embody those values. The results show that LLMs generally
align more closely with AI practitioners than with the US-representative
sample, emphasizing fairness, privacy, transparency, safety, and
accountability. However, inconsistencies appear between the values that LLMs
claim to uphold (Tasks 1-3) and the way they prioritize requirements (Task 4),
revealing gaps in faithfulness between stated and applied behavior. These
findings highlight the practical risk of relying on LLMs in requirements
engineering without human oversight and motivate the need for systematic
approaches to benchmark, interpret, and monitor value alignment in AI-assisted
software development.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fahim Ahmed, Md Mubtasim Ahasan, Jahir Sadik Monon, Muntasir Wahed, M Ashraful Amin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Text-to-SQL systems provide a natural language interface that can enable even
laymen to access information stored in databases. However, existing Large
Language Models (LLM) struggle with SQL generation from natural instructions
due to large schema sizes and complex reasoning. Prior work often focuses on
complex, somewhat impractical pipelines using flagship models, while smaller,
efficient models remain overlooked. In this work, we explore three multi-agent
LLM pipelines, with systematic performance benchmarking across a range of small
to large open-source models: (1) Multi-agent discussion pipeline, where agents
iteratively critique and refine SQL queries, and a judge synthesizes the final
answer; (2) Planner-Coder pipeline, where a thinking model planner generates
stepwise SQL generation plans and a coder synthesizes queries; and (3)
Coder-Aggregator pipeline, where multiple coders independently generate SQL
queries, and a reasoning agent selects the best query. Experiments on the
Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small
model performance, with up to 10.6% increase in Execution Accuracy for
Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines,
the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B
and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest
score of 56.4%. Codes are available at
https://github.com/treeDweller98/bappa-sql.
๐
2025-11-06
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Boxuan Ma, Huiyong Li, Gen Li, Li Chen, Cheng Tang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Generative AI tools such as ChatGPT now provide novice programmers with
unprecedented access to instant, personalized support. While this holds clear
promise, their influence on students' metacognitive processes remains
underexplored. Existing work has largely focused on correctness and usability,
with limited attention to whether and how students' use of AI assistants
supports or bypasses key metacognitive processes. This study addresses that gap
by analyzing student-AI interactions through a metacognitive lens in
university-level programming courses. We examined more than 10,000 dialogue
logs collected over three years, complemented by surveys of students and
educators. Our analysis focused on how prompts and responses aligned with
metacognitive phases and strategies. Synthesizing these findings across data
sources, we distill design considerations for AI-powered coding assistants that
aim to support rather than supplant metacognitive engagement. Our findings
provide guidance for developing educational AI tools that strengthen students'
learning processes in programming education.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yujian Liu, Ze Wang, Hao Chen, Ximeng Sun, Xiaodong Yu et al.
ๆฅ็ๆ่ฆ
Computer-use agents can operate computers and automate laborious tasks, but
despite recent rapid progress, they still lag behind human users, especially
when tasks require domain-specific procedural knowledge about particular
applications, platforms, and multi-step workflows. Humans can bridge this gap
by watching video tutorials: we search, skim, and selectively imitate short
segments that match our current subgoal. In this paper, we study how to enable
computer-use agents to learn from online videos at inference time effectively.
We propose a framework that retrieves and filters tutorial videos, converts
them into structured demonstration trajectories, and dynamically selects
trajectories as in-context guidance during execution. Particularly, using a
VLM, we infer UI actions, segment videos into short subsequences of actions,
and assign each subsequence a textual objective. At inference time, a two-stage
selection mechanism dynamically chooses a single trajectory to add in context
at each step, focusing the agent on the most helpful local guidance for its
next decision. Experiments on two widely used benchmarks show that our
framework consistently outperforms strong base agents and variants that use
only textual tutorials or transcripts. Analyses highlight the importance of
trajectory segmentation and selection, action filtering, and visual
information, suggesting that abundant online videos can be systematically
distilled into actionable guidance that improves computer-use agents at
inference time. Our code is available at
https://github.com/UCSB-NLP-Chang/video_demo.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Miguel E. Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, Nadescha Trudel
ๆฅ็ๆ่ฆ
Voice AI agents are rapidly transitioning to production deployments, yet
systematic methods for ensuring testing reliability remain underdeveloped.
Organizations cannot objectively assess whether their testing approaches
(internal tools or external platforms) actually work, creating a critical
measurement gap as voice AI scales to billions of daily interactions.
We present the first systematic framework for evaluating voice AI testing
quality through human-centered benchmarking. Our methodology addresses the
fundamental dual challenge of testing platforms: generating realistic test
conversations (simulation quality) and accurately evaluating agent responses
(evaluation quality). The framework combines established psychometric
techniques (pairwise comparisons yielding Elo ratings, bootstrap confidence
intervals, and permutation tests) with rigorous statistical validation to
provide reproducible metrics applicable to any testing approach.
To validate the framework and demonstrate its utility, we conducted
comprehensive empirical evaluation of three leading commercial platforms
focused on Voice AI Testing using 21,600 human judgments across 45 simulations
and ground truth validation on 60 conversations. Results reveal statistically
significant performance differences with the proposed framework, with the
top-performing platform, Evalion, achieving 0.92 evaluation quality measured as
f1-score versus 0.73 for others, and 0.61 simulation quality using a league
based scoring system (including ties) vs 0.43 for other platforms.
This framework enables researchers and organizations to empirically validate
the testing capabilities of any platform, providing essential measurement
foundations for confident voice AI deployment at scale. Supporting materials
are made available to facilitate reproducibility and adoption.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengyu Tang, Zeyuan Lu, Jiazhi Dong, Changdong Yu, Xiaoyu Wang et al.
RoboticsComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Accurate perception of the marine environment through robust multi-object
tracking (MOT) is essential for ensuring safe vessel navigation and effective
maritime surveillance. However, the complicated maritime environment often
causes camera motion and subsequent visual degradation, posing significant
challenges to MOT. To address this challenge, we propose an efficient
Dual-branch Maritime SORT (DMSORT) method for maritime MOT. The core of the
framework is a parallel tracker with affine compensation, which incorporates an
object detection and re-identification (ReID) branch, along with a dedicated
branch for dynamic camera motion estimation. Specifically, a Reversible
Columnar Detection Network (RCDN) is integrated into the detection module to
leverage multi-level visual features for robust object detection. Furthermore,
a lightweight Transformer-based appearance extractor (Li-TAE) is designed to
capture global contextual information and generate robust appearance features.
Another branch decouples platform-induced and target-intrinsic motion by
constructing a projective transformation, applying platform-motion compensation
within the Kalman filter, and thereby stabilizing true object trajectories.
Finally, a clustering-optimized feature fusion module effectively combines
motion and appearance cues to ensure identity consistency under noise,
occlusion, and drift. Extensive evaluations on the Singapore Maritime Dataset
demonstrate that DMSORT achieves state-of-the-art performance. Notably, DMSORT
attains the fastest runtime among existing ReID-based MOT frameworks while
maintaining high identity consistency and robustness to jitter and occlusion.
Code is available at:
https://github.com/BiscuitsLzy/DMSORT-An-efficient-parallel-maritime-multi-object-tracking-architecture-.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Venkata Manikanta Desu, Syed Fawaz Ali
Machine Learning
ๆฅ็ๆ่ฆ
This study presents a complete pipeline for automated tennis match analysis.
Our framework integrates multiple deep learning models to detect and track
players and the tennis ball in real time, while also identifying court
keypoints for spatial reference. Using YOLOv8 for player detection, a
custom-trained YOLOv5 model for ball tracking, and a ResNet50-based
architecture for court keypoint detection, our system provides detailed
analytics including player movement patterns, ball speed, shot accuracy, and
player reaction times. The experimental results demonstrate robust performance
in varying court conditions and match scenarios. The model outputs an annotated
video along with detailed performance metrics, enabling coaches, broadcasters,
and players to gain actionable insights into the dynamics of the game.
๐
2025-11-06
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Paul Badu Yakubu, Lesther Santana, Mohamed Rahouti, Yufeng Xin, Abdellah Chehri et al.
Machine Learning
ๆฅ็ๆ่ฆ
With the increasing frequency and sophistication of Distributed Denial of
Service (DDoS) attacks, it has become critical to develop more efficient and
interpretable detection methods. Traditional detection systems often struggle
with scalability and transparency, hindering real-time response and
understanding of attack vectors. This paper presents an automated framework for
detecting and interpreting DDoS attacks using machine learning (ML). The
proposed method leverages the Tree-based Pipeline Optimization Tool (TPOT) to
automate the selection and optimization of ML models and features, reducing the
need for manual experimentation. SHapley Additive exPlanations (SHAP) is
incorporated to enhance model interpretability, providing detailed insights
into the contribution of individual features to the detection process. By
combining TPOT's automated pipeline selection with SHAP interpretability, this
approach improves the accuracy and transparency of DDoS detection. Experimental
results demonstrate that key features such as mean backward packet length and
minimum forward packet header length are critical in detecting DDoS attacks,
offering a scalable and explainable cybersecurity solution.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Moses Charikar, Chirag Pabbaraju, Ambuj Tewari
ๆฅ็ๆ่ฆ
We study the problem of language identification in the limit, where given a
sequence of examples from a target language, the goal of the learner is to
output a sequence of guesses for the target language such that all the guesses
beyond some finite time are correct. Classical results of Gold showed that
language identification in the limit is impossible for essentially any
interesting collection of languages. Later, Angluin gave a precise
characterization of language collections for which this task is possible.
Motivated by recent positive results for the related problem of language
generation, we revisit the classic language identification problem in the
setting where the learner is given the additional power of producing a list of
$k$ guesses at each time step. The goal is to ensure that beyond some finite
time, one of the guesses is correct at each time step.
We give an exact characterization of collections of languages that can be
$k$-list identified in the limit, based on a recursive version of Angluin's
characterization (for language identification with a list of size $1$). This
further leads to a conceptually appealing characterization: A language
collection can be $k$-list identified in the limit if and only if the
collection can be decomposed into $k$ collections of languages, each of which
can be identified in the limit (with a list of size $1$). We also use our
characterization to establish rates for list identification in the statistical
setting where the input is drawn as an i.i.d. stream from a distribution
supported on some language in the collection. Our results show that if a
collection is $k$-list identifiable in the limit, then the collection can be
$k$-list identified at an exponential rate, and this is best possible. On the
other hand, if a collection is not $k$-list identifiable in the limit, then it
cannot be $k$-list identified at any rate that goes to zero.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuanning Cui, Zequn Sun, Wei Hu, Zhangjie Fu
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel at reasoning but struggle with
knowledge-intensive questions due to limited context and parametric knowledge.
However, existing methods that rely on finetuned LLMs or GNN retrievers are
limited by dataset-specific tuning and scalability on large or unseen graphs.
We propose the LLM-KGFR collaborative framework, where an LLM works with a
structured retriever, the Knowledge Graph Foundation Retriever (KGFR). KGFR
encodes relations using LLM-generated descriptions and initializes entities
based on their roles in the question, enabling zero-shot generalization to
unseen KGs. To handle large graphs efficiently, it employs Asymmetric
Progressive Propagation (APP)- a stepwise expansion that selectively limits
high-degree nodes while retaining informative paths. Through node-, edge-, and
path-level interfaces, the LLM iteratively requests candidate answers,
supporting facts, and reasoning paths, forming a controllable reasoning loop.
Experiments demonstrate that LLM-KGFR achieves strong performance while
maintaining scalability and generalization, providing a practical solution for
KG-augmented reasoning.
๐
2025-11-06
๐ ArXiv Preprint (cs.LO)
๐ฅ Yang Xu, Peiyao Liu, Shuwei Chen, Jun Liu
ๆฅ็ๆ่ฆ
Currently, there is a lack of rigorous theoretical system for systematically
generating non-trivial and logically valid theorems. Addressing this critical
gap, this paper conducts research to propose a novel automated theorem
generation theory and tool. Based on the concept of standard contradiction
which possesses unique deductive advantages, this paper defines and proves, for
the first time, a new logical structure known as rectangular standard
contradiction. Centered on this structure, a complete Automated Theorem
Generation (ATG) theory is put forward. Theoretical proofs clarify two core
properties of rectangular standard contradiction: first, it is a standard
contradiction (necessarily unsatisfiable); second, it exhibits non-redundancy
(the remaining clause set becomes satisfiable after removing any clause).
Leveraging these properties, this paper proves that partitioning a rectangular
standard contradiction into a premise subset $A$ and negation of its complement
$H$, a valid theorem $A \vdash \neg H$ can be formed, and all such theorems are
logically equivalent. To implement this theory, an efficient template-based ATG
algorithm is designed, and a Rectangular Automated Theorem Generator is
developed. This research enables machines to transition from "verifiers" to
"discoverers", opening up new avenues for fundamental research in the fields of
logic and artificial intelligence.
๐
2025-11-06
๐ ArXiv Preprint (cs.SI)
๐ฅ Brigitte A. Mora-Reyes, Jennifer A. Drewyor, Abel A. Reyes-Angulo
Natural Language Processing
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) systems often reflect biases from economically
advanced regions, marginalizing contexts in economically developing regions
like Latin America due to imbalanced datasets. This paper examines AI
representations of diverse Latin American contexts, revealing disparities
between data from economically advanced and developing regions. We highlight
how the dominance of English over Spanish, Portuguese, and indigenous languages
such as Quechua and Nahuatl perpetuates biases, framing Latin American
perspectives through a Western lens. To address this, we introduce a culturally
aware dataset rooted in Latin American history and socio-political contexts,
challenging Eurocentric models. We evaluate six language models on questions
testing cultural context awareness, using a novel Cultural Expressiveness
metric, statistical tests, and linguistic analyses. Our findings show that some
models better capture Latin American perspectives, while others exhibit
significant sentiment misalignment (p < 0.001). Fine-tuning Mistral-7B with our
dataset improves its cultural expressiveness by 42.9%, advancing equitable AI
development. We advocate for equitable AI by prioritizing datasets that reflect
Latin American history, indigenous knowledge, and diverse languages, while
emphasizing community-centered approaches to amplify marginalized voices.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qingfeng Chen, Haojin Zeng, Jingyi Jie, Shichao Zhang, Debo Cheng
Machine Learning
ๆฅ็ๆ่ฆ
With the rapid growth of graph-structured data in critical domains,
unsupervised graph-level anomaly detection (UGAD) has become a pivotal task.
UGAD seeks to identify entire graphs that deviate from normal behavioral
patterns. However, most Graph Neural Network (GNN) approaches implicitly assume
that the training set is clean, containing only normal graphs, which is rarely
true in practice. Even modest contamination by anomalous graphs can distort
learned representations and sharply degrade performance. To address this
challenge, we propose DeNoise, a robust UGAD framework explicitly designed for
contaminated training data. It jointly optimizes a graph-level encoder, an
attribute decoder, and a structure decoder via an adversarial objective to
learn noise-resistant embeddings. Further, DeNoise introduces an encoder
anchor-alignment denoising mechanism that fuses high-information node
embeddings from normal graphs into all graph embeddings, improving
representation quality while suppressing anomaly interference. A contrastive
learning component then compacts normal graph embeddings and repels anomalous
ones in the latent space. Extensive experiments on eight real-world datasets
demonstrate that DeNoise consistently learns reliable graph-level
representations under varying noise intensities and significantly outperforms
state-of-the-art UGAD baselines.
๐
2025-11-06
๐ AAAI
๐ฅ Hao Li, Haotian Chen, Ruoyuan Gong, Juanjuan Wang, Hao Jiang
Natural Language Processing
ๆฅ็ๆ่ฆ
Redistricting plays a central role in shaping how votes are translated into
political power. While existing computational methods primarily aim to generate
large ensembles of legally valid districting plans, they often neglect the
strategic dynamics involved in the selection process. This oversight creates
opportunities for partisan actors to cherry-pick maps that, while technically
compliant, are politically advantageous. Simply satisfying formal constraints
does not ensure fairness when the selection process itself can be manipulated.
We propose \textbf{Agentmandering}, a framework that reimagines redistricting
as a turn-based negotiation between two agents representing opposing political
interests. Drawing inspiration from game-theoretic ideas, particularly the
\textit{Choose-and-Freeze} protocol, our method embeds strategic interaction
into the redistricting process via large language model (LLM) agents. Agents
alternate between selecting and freezing districts from a small set of
candidate maps, gradually partitioning the state through constrained and
interpretable choices. Evaluation on post-2020 U.S. Census data across all
states shows that Agentmandering significantly reduces partisan bias and
unfairness, while achieving 2 to 3 orders of magnitude lower variance than
standard baselines. These results demonstrate both fairness and stability,
especially in swing-state scenarios. Our code is available at
https://github.com/Lihaogx/AgentMandering.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fatemeh Hosseinabadi, Seyedhassan Sharifi
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Accurate segmentation of the left atrium (LA) from cardiac MRI is critical
for guiding atrial fibrillation (AF) ablation and constructing biophysical
cardiac models. Manual delineation is time-consuming, observer-dependent, and
impractical for large-scale or time-sensitive clinical workflows. Deep learning
methods, particularly convolutional architectures, have recently demonstrated
superior performance in medical image segmentation tasks. In this study, we
applied the nnU-Net framework, an automated, self-configuring deep learning
segmentation architecture, to the Left Atrial Segmentation Challenge 2013
dataset. The dataset consists of thirty MRI scans with corresponding
expert-annotated masks. The nnU-Net model automatically adapted its
preprocessing, network configuration, and training pipeline to the
characteristics of the MRI data. Model performance was quantitatively evaluated
using the Dice similarity coefficient (DSC), and qualitative results were
compared against expert segmentations. The proposed nnUNet model achieved a
mean Dice score of 93.5, demonstrating high overlap with expert annotations and
outperforming several traditional segmentation approaches reported in previous
studies. The network exhibited robust generalization across variations in left
atrial shape, contrast, and image quality, accurately delineating both the
atrial body and proximal pulmonary veins.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fatemeh Hosseinabadi, Seyedhassan Sharifi
Machine Learning
ๆฅ็ๆ่ฆ
Pediatric appendicitis remains one of the most common causes of acute
abdominal pain in children, and its diagnosis continues to challenge clinicians
due to overlapping symptoms and variable imaging quality. This study aims to
develop and evaluate a deep learning model based on a pretrained ResNet
architecture for automated detection of appendicitis from ultrasound images. We
used the Regensburg Pediatric Appendicitis Dataset, which includes ultrasound
scans, laboratory data, and clinical scores from pediatric patients admitted
with abdominal pain to Children Hospital. Hedwig in Regensburg, Germany. Each
subject had 1 to 15 ultrasound views covering the right lower quadrant,
appendix, lymph nodes, and related structures. For the image based
classification task, ResNet was fine tuned to distinguish appendicitis from
non-appendicitis cases. Images were preprocessed by normalization, resizing,
and augmentation to enhance generalization. The proposed ResNet model achieved
an overall accuracy of 93.44, precision of 91.53, and recall of 89.8,
demonstrating strong performance in identifying appendicitis across
heterogeneous ultrasound views. The model effectively learned discriminative
spatial features, overcoming challenges posed by low contrast, speckle noise,
and anatomical variability in pediatric imaging.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Hirohane Takagi, Gouki Minegishi, Shota Kizawa, Issey Sukeda, Hitomi Yanaka
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Although behavioral studies have documented numerical reasoning errors in
large language models (LLMs), the underlying representational mechanisms remain
unclear. We hypothesize that numerical attributes occupy shared latent
subspaces and investigate two questions:(1) How do LLMs internally integrate
multiple numerical attributes of a single entity? (2)How does irrelevant
numerical context perturb these representations and their downstream outputs?
To address these questions, we combine linear probing with partial correlation
analysis and prompt-based vulnerability tests across models of varying sizes.
Our results show that LLMs encode real-world numerical correlations but tend to
systematically amplify them. Moreover, irrelevant context induces consistent
shifts in magnitude representations, with downstream effects that vary by model
size. These findings reveal a vulnerability in LLM decision-making and lay the
groundwork for fairer, representation-aware control under multi-attribute
entanglement.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Kailun Ji, Xiaoyu Hu, Xinyu Zhang, Jun Chen
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large-scale disaster Search And Rescue (SAR) operations are persistently
challenged by complex terrain and disrupted communications. While Unmanned
Aerial Vehicle (UAV) swarms offer a promising solution for tasks like wide-area
search and supply delivery, yet their effective coordination places a
significant cognitive burden on human operators. The core human-machine
collaboration bottleneck lies in the ``intention-to-action gap'', which is an
error-prone process of translating a high-level rescue objective into a
low-level swarm command under high intensity and pressure. To bridge this gap,
this study proposes a novel LLM-CRF system that leverages Large Language Models
(LLMs) to model and augment human-swarm teaming cognition. The proposed
framework initially captures the operator's intention through natural and
multi-modal interactions with the device via voice or graphical annotations. It
then employs the LLM as a cognitive engine to perform intention comprehension,
hierarchical task decomposition, and mission planning for the UAV swarm. This
closed-loop framework enables the swarm to act as a proactive partner,
providing active feedback in real-time while reducing the need for manual
monitoring and control, which considerably advances the efficacy of the SAR
task. We evaluate the proposed framework in a simulated SAR scenario.
Experimental results demonstrate that, compared to traditional order and
command-based interfaces, the proposed LLM-driven approach reduced task
completion time by approximately $64.2\%$ and improved task success rate by
$7\%$. It also leads to a considerable reduction in subjective cognitive
workload, with NASA-TLX scores dropping by $42.9\%$. This work establishes the
potential of LLMs to create more intuitive and effective human-swarm
collaborations in high-stakes scenarios.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Divya Pathak, Harshit Kumar, Anuska Roy, Felix George, Mudit Verma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-Agentic AI systems, powered by large language models (LLMs), are
inherently non-deterministic and prone to silent failures such as drift,
cycles, and missing details in outputs, which are difficult to detect. We
introduce the task of anomaly detection in agentic trajectories to identify
these failures and present a dataset curation pipeline that captures user
behavior, agent non-determinism, and LLM variation. Using this pipeline, we
curate and label two benchmark datasets comprising \textbf{4,275 and 894}
trajectories from Multi-Agentic AI systems. Benchmarking anomaly detection
methods on these datasets, we show that supervised (XGBoost) and
semi-supervised (SVDD) approaches perform comparably, achieving accuracies up
to 98% and 96%, respectively. This work provides the first systematic study of
anomaly detection in Multi-Agentic AI systems, offering datasets, benchmarks,
and insights to guide future research.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shiyin Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) enhanced with retrieval -- commonly referred to
as Retrieval-Augmented Generation (RAG) -- have demonstrated strong performance
in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved
evidence is incomplete, leaving gaps in the reasoning process. In such cases,
\emph{abductive inference} -- the process of generating plausible missing
premises to explain observations -- offers a principled approach to bridge
these gaps. In this paper, we propose a framework that integrates abductive
inference into retrieval-augmented LLMs. Our method detects insufficient
evidence, generates candidate missing premises, and validates them through
consistency and plausibility checks. Experimental results on abductive
reasoning and multi-hop QA benchmarks show that our approach improves both
answer accuracy and reasoning faithfulness. This work highlights abductive
inference as a promising direction for enhancing the robustness and
explainability of RAG systems.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingyu Sung, Vikas Palakonda, Suhwan Im, Sunghwan Moon, Il-Min Kim et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved near-human performance across
diverse reasoning tasks, yet their deployment on resource-constrained
Internet-of-Things (IoT) devices remains impractical due to massive parameter
footprints and memory-intensive autoregressive decoding. While split computing
offers a promising solution by partitioning model execution between edge
devices and cloud servers, existing approaches fail to address the unique
challenges of autoregressive inference, particularly the iterative token
generation process and expanding key-value (KV) cache requirements. This work
introduces the first autoregressive-aware split computing framework designed
explicitly for LLM deployment on edge devices. Our approach makes three key
contributions. First, we develop one-point split compression (OPSC), a
mixed-precision quantization scheme that prevents out-of-memory failures by
strategically partitioning models into front-end and back-end segments with
different precision levels. Second, we propose a two-stage intermediate
compression pipeline that combines threshold splitting (TS) and token-wise
adaptive bit quantization (TAB-Q) to preserve accuracy-critical activations
while dramatically reducing communication overhead. Third, we formulate a
unified optimization framework that jointly selects optimal split points,
quantization settings, and sequence lengths to satisfy strict memory and
latency constraints. Extensive evaluations across diverse LLMs and hardware
platforms demonstrate superior performance compared to state-of-the-art
quantization methods, including SmoothQuant, OmniQuant, and Atom. The framework
achieves a 1.49 inference speedup and significant communication overhead
reduction while maintaining or improving model accuracy.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ J. Nathan Kutz, Peter Battaglia, Michael Brenner, Kevin Carlberg, Aric Hagberg et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Machine learning (ML) and artificial intelligence (AI) algorithms are
transforming and empowering the characterization and control of dynamic systems
in the engineering, physical, and biological sciences. These emerging modeling
paradigms require comparative metrics to evaluate a diverse set of scientific
objectives, including forecasting, state reconstruction, generalization, and
control, while also considering limited data scenarios and noisy measurements.
We introduce a common task framework (CTF) for science and engineering, which
features a growing collection of challenge data sets with a diverse set of
practical and common objectives. The CTF is a critically enabling technology
that has contributed to the rapid advance of ML/AI algorithms in traditional
applications such as speech recognition, language processing, and computer
vision. There is a critical need for the objective metrics of a CTF to compare
the diverse algorithms being rapidly developed and deployed in practice today
across science and engineering.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Kyaw Hpone Myint, Zhe Wu, Alexandre G. R. Day, Giri Iyengar
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Decision trees are widely used in high-stakes fields like finance and
healthcare due to their interpretability. This work introduces an efficient,
scalable method for generating synthetic pre-training data to enable
meta-learning of decision trees. Our approach samples near-optimal decision
trees synthetically, creating large-scale, realistic datasets. Using the
MetaTree transformer architecture, we demonstrate that this method achieves
performance comparable to pre-training on real-world data or with
computationally expensive optimal decision trees. This strategy significantly
reduces computational costs, enhances data generation flexibility, and paves
the way for scalable and efficient meta-learning of interpretable decision tree
models.
๐
2025-11-06
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Shiyin Lin
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Software fuzzing has become a cornerstone in automated vulnerability
discovery, yet existing mutation strategies often lack semantic awareness,
leading to redundant test cases and slow exploration of deep program states. In
this work, I present a hybrid fuzzing framework that integrates static and
dynamic analysis with Large Language Model (LLM)-guided input mutation and
semantic feedback. Static analysis extracts control-flow and data-flow
information, which is transformed into structured prompts for the LLM to
generate syntactically valid and semantically diverse inputs. During execution,
I augment traditional coverage-based feedback with semantic feedback
signals-derived from program state changes, exception types, and output
semantics-allowing the fuzzer to prioritize inputs that trigger novel program
behaviors beyond mere code coverage. I implement our approach atop AFL++,
combining program instrumentation with embedding-based semantic similarity
metrics to guide seed selection. Evaluation on real-world open-source targets,
including libpng, tcpdump, and sqlite, demonstrates that our method achieves
faster time-to-first-bug, higher semantic diversity, and a competitive number
of unique bugs compared to state-of-the-art fuzzers. This work highlights the
potential of combining LLM reasoning with semantic-aware feedback to accelerate
and deepen vulnerability discovery.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Berk Iskar, Michael Taynnan Barros
Machine Learning
ๆฅ็ๆ่ฆ
Network anomaly detection systems encounter several challenges with
traditional detectors trained offline. They become susceptible to concept drift
and new threats such as zero-day or polymorphic attacks. To address this
limitation, we propose a Ca$^{2+}$-modulated learning framework that draws
inspiration from astrocytic Ca$^{2+}$ signaling in the brain, where rapid,
context-sensitive adaptation enables robust information processing. Our
approach couples a multicellular astrocyte dynamics simulator with a deep
neural network (DNN). The simulator models astrocytic Ca$^{2+}$ dynamics
through three key mechanisms: IP$_3$-mediated Ca$^{2+}$ release, SERCA pump
uptake, and conductance-aware diffusion through gap junctions between cells.
Evaluation of our proposed network on CTU-13 (Neris) network traffic data
demonstrates the effectiveness of our biologically plausible approach. The
Ca$^{2+}$-gated model outperforms a matched baseline DNN, achieving up to
$\sim$98\% accuracy with reduced false positives and negatives across multiple
train/test splits. Importantly, this improved performance comes with negligible
runtime overhead once Ca$^{2+}$ trajectories are precomputed. While
demonstrated here for cybersecurity applications, this Ca$^{2+}$-modulated
learning framework offers a generic solution for streaming detection tasks that
require rapid, biologically grounded adaptation to evolving data patterns.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhuowen Yuan, Tao Liu, Yang Yang, Yang Wang, Feng Qi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recent LLM-based agents have demonstrated strong capabilities in automated ML
engineering. However, they heavily rely on repeated full training runs to
evaluate candidate solutions, resulting in significant computational overhead,
limited scalability to large search spaces, and slow iteration cycles. To
address these challenges, we introduce ArchPilot, a multi-agent system that
integrates architecture generation, proxy-based evaluation, and adaptive search
into a unified framework. ArchPilot consists of three specialized agents: an
orchestration agent that coordinates the search process using a Monte Carlo
Tree Search (MCTS)-inspired novel algorithm with a restart mechanism and
manages memory of previous candidates; a generation agent that iteratively
generates, improves, and debugs candidate architectures; and an evaluation
agent that executes proxy training runs, generates and optimizes proxy
functions, and aggregates the proxy scores into a fidelity-aware performance
metric. This multi-agent collaboration allows ArchPilot to prioritize
high-potential candidates with minimal reliance on expensive full training
runs, facilitating efficient ML engineering under limited budgets. Experiments
on MLE-Bench demonstrate that ArchPilot outperforms SOTA baselines such as AIDE
and ML-Master, validating the effectiveness of our multi-agent system.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bram Bultรฉ, Ayla Rigouts Terryn
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are rapidly being adopted by users across the
globe, who interact with them in a diverse range of languages. At the same
time, there are well-documented imbalances in the training data and
optimisation objectives of this technology, raising doubts as to whether LLMs
can represent the cultural diversity of their broad user base. In this study,
we look at LLMs and cultural values and examine how prompt language and
cultural framing influence model responses and their alignment with human
values in different countries. We probe 10 LLMs with 63 items from the Hofstede
Values Survey Module and World Values Survey, translated into 11 languages, and
formulated as prompts with and without different explicit cultural
perspectives. Our study confirms that both prompt language and cultural
perspective produce variation in LLM outputs, but with an important caveat:
While targeted prompting can, to a certain extent, steer LLM responses in the
direction of the predominant values of the corresponding countries, it does not
overcome the models' systematic bias toward the values associated with a
restricted set of countries in our dataset: the Netherlands, Germany, the US,
and Japan. All tested models, regardless of their origin, exhibit remarkably
similar patterns: They produce fairly neutral responses on most topics, with
selective progressive stances on issues such as social tolerance. Alignment
with cultural values of human respondents is improved more with an explicit
cultural perspective than with a targeted prompt language. Unexpectedly,
combining both approaches is no more effective than cultural framing with an
English prompt. These findings reveal that LLMs occupy an uncomfortable middle
ground: They are responsive enough to changes in prompts to produce variation,
but too firmly anchored to specific cultural defaults to adequately represent
cultural diversity.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xu Zou
Natural Language Processing
ๆฅ็ๆ่ฆ
Since its emergence, SARS-CoV-2 has demonstrated a rapid and unpredictable
evolutionary trajectory, characterized by the continual emergence of
immune-evasive variants. This poses persistent challenges to public health and
vaccine development.
While large-scale generative pre-trained transformers (GPTs) have
revolutionized the modeling of sequential data, their direct applications to
noisy viral genomic sequences are limited. In this paper, we introduce
PETRA(Pretrained Evolutionary TRAnsformer), a novel transformer approach based
on evolutionary trajectories derived from phylogenetic trees rather than raw
RNA sequences. This method effectively mitigates sequencing noise and captures
the hierarchical structure of viral evolution.
With a weighted training framework to address substantial geographical and
temporal imbalances in global sequence data, PETRA excels in predicting future
SARS-CoV-2 mutations, achieving a weighted recall@1 of 9.45% for nucleotide
mutations and 17.10\% for spike amino-acid mutations, compared to 0.49% and
6.64% respectively for the best baseline. PETRA also demonstrates its ability
to aid in the real-time mutation prediction of major clades like 24F(XEC) and
25A(LP.8.1). The code is open sourced on https://github.com/xz-keg/PETra
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhejia Cai, Puhua Jiang, Shiwei Mao, Hongkun Cao, Ruqi Huang
Machine Learning
ๆฅ็ๆ่ฆ
Reconstructing real-world objects from multi-view images is essential for
applications in 3D editing, AR/VR, and digital content creation. Existing
methods typically prioritize either geometric accuracy (Multi-View Stereo) or
photorealistic rendering (Novel View Synthesis), often decoupling geometry and
appearance optimization, which hinders downstream editing tasks. This paper
advocates an unified treatment on geometry and appearance optimization for
seamless Gaussian-mesh joint optimization. More specifically, we propose a
novel framework that simultaneously optimizes mesh geometry (vertex positions
and faces) and vertex colors via Gaussian-guided mesh differentiable rendering,
leveraging photometric consistency from input images and geometric
regularization from normal and depth maps. The obtained high-quality 3D
reconstruction can be further exploit in down-stream editing tasks, such as
relighting and shape deformation. The code will be publicly available upon
acceptance.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Kevin Hong, Kia Karbasi, Gregory Pottie
Machine Learning
ๆฅ็ๆ่ฆ
A longstanding goal in computational educational research is to develop
explainable knowledge tracing (KT) models. Deep Knowledge Tracing (DKT), which
leverages a Recurrent Neural Network (RNN) to predict student knowledge and
performance on exercises, has been proposed as a major advancement over
traditional KT methods. Several studies suggest that its performance gains stem
from its ability to model bidirectional relationships between different
knowledge components (KCs) within a course, enabling the inference of a
student's understanding of one KC from their performance on others. In this
paper, we challenge this prevailing explanation and demonstrate that DKT's
strength lies in its implicit ability to model prerequisite relationships as a
causal structure, rather than bidirectional relationships. By pruning exercise
relation graphs into Directed Acyclic Graphs (DAGs) and training DKT on causal
subsets of the Assistments dataset, we show that DKT's predictive capabilities
align strongly with these causal structures. Furthermore, we propose an
alternative method for extracting exercise relation DAGs using DKT's learned
representations and provide empirical evidence supporting our claim. Our
findings suggest that DKT's effectiveness is largely driven by its capacity to
approximate causal dependencies between KCs rather than simple relational
mappings.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fu-Chun Yang, Jason Eshraghian
Natural Language Processing
ๆฅ็ๆ่ฆ
In multi-agent settings, such as debate, reflection, or tool-calling, large
language models (LLMs) pass messages as plain tokens, discarding most latent
semantics. This constrains information transfer and adds unnecessary
computational overhead. We form a latent bridge via vector translations, which
use learned mappings that enable direct semantic exchange between
representation spaces. A dual-encoder translator trained between Llama-2-7B and
Mistral-7B-Instruct attains an average cosine alignment of 0.538. Injecting the
translated vectors at 30 percent blending strength steers the target model's
generation without destabilizing logits. Bidirectional evaluation shows a
2.01:1 transfer asymmetry, indicating that general-purpose models yield more
transferable representations than instruction-tuned variants. This conservative
injection preserves computational stability while demonstrating that
cross-model latent communication is feasible, enabling collaborative AI systems
that share meaning rather than tokens.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Raghav Sharma, Manan Mehta, Sai Tiger Raina
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning from Human Feedback (RLHF) is the standard for
aligning Large Language Models (LLMs), yet recent progress has moved beyond
canonical text-based methods. This survey synthesizes the new frontier of
alignment research by addressing critical gaps in multi-modal alignment,
cultural fairness, and low-latency optimization. To systematically explore
these domains, we first review foundational algo- rithms, including PPO, DPO,
and GRPO, before presenting a detailed analysis of the latest innovations. By
providing a comparative synthesis of these techniques and outlining open
challenges, this work serves as an essential roadmap for researchers building
more robust, efficient, and equitable AI systems.
๐
2025-11-06
๐ ArXiv Preprint (cs.SE)
๐ฅ Athma Narayanan, Mahesh Subedar, Omesh Tickoo
ๆฅ็ๆ่ฆ
We present an agentic flow consisting of multiple agents that combine
specialized LLMs and hardware simulation tools to collaboratively complete the
complex task of Register Transfer Level (RTL) generation without human
intervention. A key feature of the proposed flow is the progressive error
feedback system of agents (PEFA), a self-correcting mechanism that leverages
iterative error feedback to progressively increase the complexity of the
approach. The generated RTL includes checks for compilation, functional
correctness, and synthesizable constructs. To validate this adaptive approach
to code generation, benchmarking is performed using two opensource natural
language-to-RTL datasets. We demonstrate the benefits of the proposed approach
implemented on an open source agentic framework, using both open- and
closed-source LLMs, effectively bridging the performance gap between them.
Compared to previously published methods, our approach sets a new benchmark,
providing state-of-the-art pass rates while being efficient in token counts.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Nemotron Nano V2 VL, the latest model of the Nemotron
vision-language series designed for strong real-world document understanding,
long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers
significant improvements over our previous model,
Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major
enhancements in model architecture, datasets, and training recipes. Nemotron
Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and
innovative token reduction techniques to achieve higher inference throughput in
long document and video scenarios. We are releasing model checkpoints in BF16,
FP8, and FP4 formats and sharing large parts of our datasets, recipes and
training code.
๐
2025-11-06
๐ ArXiv Preprint (cs.SE)
๐ฅ Nikta Akbarpour, Mahdieh Sadat Benis, Fatemeh Hendijani Fard, Ali Ouni, Mohamed Aymen Saied
ๆฅ็ๆ่ฆ
Automated Program Repair (APR) has advanced rapidly with Large Language
Models (LLMs), but most existing methods remain computationally expensive, and
focused on a small set of languages. Ruby, despite its widespread use in web
development and the persistent challenges faced by its developers, has received
little attention in APR research. In this paper, we introduce RAMP, a novel
lightweight framework that formulates program repair as a feedback-driven,
iterative process for Ruby. RAMP employs a team of collaborative agents that
generate targeted tests, reflect on errors, and refine candidate fixes until a
correct solution is found. Unlike prior approaches, RAMP is designed to avoid
reliance on large multilingual repair databases or costly fine-tuning, instead
operating directly on Ruby through lightweight prompting and test-driven
feedback. Evaluation on the XCodeEval benchmark shows that RAMP achieves a
pass@1 of 67% on Ruby, outper-forming prior approaches. RAMP converges quickly
within five iterations, and ablation studies confirm that test generation and
self-reflection are key drivers of its performance. Further analysis shows that
RAMP is particularly effective at repairing wrong answers, compilation errors,
and runtime errors. Our approach provides new insights into multi-agent repair
strategies, and establishes a foundation for extending LLM-based debugging
tools to under-studied languages.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rafe Loya, Andrew Hamara, Benjamin Estell, Benjamin Kilpatrick, Andrew C. Freeman
ๆฅ็ๆ่ฆ
Automatic image cropping is a method for maximizing the human-perceived
quality of cropped regions in photographs. Although several works have proposed
techniques for producing singular crops, little work has addressed the problem
of producing multiple, distinct crops with aesthetic appeal. In this paper, we
motivate the problem with a discussion on modern social media applications,
introduce a dataset of 277 relevant images and human labels, and evaluate the
efficacy of several single-crop models with an image partitioning algorithm as
a pre-processing step. The dataset is available at
https://github.com/RafeLoya/carousel.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Qingzhou Lu, Yao Feng, Baiyu Shi, Michael Piseno, Zhenan Bao et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Humanoid robots are expected to operate in human-centered environments where
safe and natural physical interaction is essential. However, most recent
reinforcement learning (RL) policies emphasize rigid tracking and suppress
external forces. Existing impedance-augmented approaches are typically
restricted to base or end-effector control and focus on resisting extreme
forces rather than enabling compliance. We introduce GentleHumanoid, a
framework that integrates impedance control into a whole-body motion tracking
policy to achieve upper-body compliance. At its core is a unified spring-based
formulation that models both resistive contacts (restoring forces when pressing
against surfaces) and guiding contacts (pushes or pulls sampled from human
motion data). This formulation ensures kinematically consistent forces across
the shoulder, elbow, and wrist, while exposing the policy to diverse
interaction scenarios. Safety is further supported through task-adjustable
force thresholds. We evaluate our approach in both simulation and on the
Unitree G1 humanoid across tasks requiring different levels of compliance,
including gentle hugging, sit-to-stand assistance, and safe object
manipulation. Compared to baselines, our policy consistently reduces peak
contact forces while maintaining task success, resulting in smoother and more
natural interactions. These results highlight a step toward humanoid robots
that can safely and effectively collaborate with humans and handle objects in
real-world environments.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Yihong Sun, Xinyu Yang, Jennifer J. Sun, Bharath Hariharan
ๆฅ็ๆ่ฆ
Real-world objects frequently undergo state transformations. From an apple
being cut into pieces to a butterfly emerging from its cocoon, tracking through
these changes is important for understanding real-world objects and dynamics.
However, existing methods often lose track of the target object after
transformation, due to significant changes in object appearance. To address
this limitation, we introduce the task of Track Any State: tracking objects
through transformations while detecting and describing state changes,
accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we
present TubeletGraph, a zero-shot system that recovers missing objects after
transformation and maps out how object states are evolving over time.
TubeletGraph first identifies potentially overlooked tracks, and determines
whether they should be integrated based on semantic and proximity priors. Then,
it reasons about the added tracks and generates a state graph describing each
observed transformation. TubeletGraph achieves state-of-the-art tracking
performance under transformations, while demonstrating deeper understanding of
object transformations and promising capabilities in temporal grounding and
semantic reasoning for complex object transformations. Code, additional
results, and the benchmark dataset are available at
https://tubelet-graph.github.io.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu et al.
Machine Learning
ๆฅ็ๆ่ฆ
We introduce InfinityStar, a unified spacetime autoregressive framework for
high-resolution image and dynamic video synthesis. Building on the recent
success of autoregressive modeling in both vision and language, our purely
discrete approach jointly captures spatial and temporal dependencies within a
single architecture. This unified design naturally supports a variety of
generation tasks such as text-to-image, text-to-video, image-to-video, and long
interactive video synthesis via straightforward temporal autoregression.
Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench,
outperforming all autoregressive models by large margins, even surpassing some
diffusion competitors like HunyuanVideo. Without extra optimizations, our model
generates a 5s, 720p video approximately 10x faster than leading
diffusion-based methods. To our knowledge, InfinityStar is the first discrete
autoregressive video generator capable of producing industrial level 720p
videos. We release all code and models to foster further research in efficient,
high-quality video generation.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du et al.
Robotics
ๆฅ็ๆ่ฆ
Human videos can be recorded quickly and at scale, making them an appealing
source of training data for robot learning. However, humans and robots differ
fundamentally in embodiment, resulting in mismatched action execution. Direct
kinematic retargeting of human hand motion can therefore produce actions that
are physically infeasible for robots. Despite these low-level differences,
human demonstrations provide valuable motion cues about how to manipulate and
interact with objects. Our key idea is to exploit the forward diffusion
process: as noise is added to actions, low-level execution differences fade
while high-level task guidance is preserved. We present X-Diffusion, a
principled framework for training diffusion policies that maximally leverages
human data without learning dynamically infeasible motions. X-Diffusion first
trains a classifier to predict whether a noisy action is executed by a human or
robot. Then, a human action is incorporated into policy training only after
adding sufficient noise such that the classifier cannot discern its embodiment.
Actions consistent with robot execution supervise fine-grained denoising at low
noise levels, while mismatched human actions provide only coarse guidance at
higher noise levels. Our experiments show that naive co-training under
execution mismatches degrades policy performance, while X-Diffusion
consistently improves it. Across five manipulation tasks, X-Diffusion achieves
a 16% higher average success rate than the best baseline. The project website
is available at https://portal-cornell.github.io/X-Diffusion/.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang et al.
Multimodal
ๆฅ็ๆ่ฆ
We argue that progress in true multimodal intelligence calls for a shift from
reactive, task-driven systems and brute-force long context towards a broader
paradigm of supersensing. We frame spatial supersensing as four stages beyond
linguistic-only understanding: semantic perception (naming what is seen),
streaming event cognition (maintaining memory across continuous experiences),
implicit 3D spatial cognition (inferring the world behind pixels), and
predictive world modeling (creating internal models that filter and organize
information). Current benchmarks largely test only the early stages, offering
narrow coverage of spatial cognition and rarely challenging models in ways that
require true world modeling. To drive progress in spatial supersensing, we
present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial
recall) and VSC (continual visual spatial counting). These tasks require
arbitrarily long video inputs yet are resistant to brute-force context
expansion. We then test data scaling limits by curating VSI-590K and training
Cambrian-S, achieving +30% absolute improvement on VSI-Bench without
sacrificing general capabilities. Yet performance on VSI-SUPER remains limited,
indicating that scale alone is insufficient for spatial supersensing. We
propose predictive sensing as a path forward, presenting a proof-of-concept in
which a self-supervised next-latent-frame predictor leverages surprise
(prediction error) to drive memory and event segmentation. On VSI-SUPER, this
approach substantially outperforms leading proprietary baselines, showing that
spatial supersensing requires models that not only see but also anticipate,
select, and organize experience.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Despite impressive high-level video comprehension, multimodal language models
struggle with spatial reasoning across time and space. While current spatial
training approaches rely on real-world video data, obtaining diverse footage
with precise spatial annotations remains a bottleneck. To alleviate this
bottleneck, we present SIMS-V -- a systematic data-generation framework that
leverages the privileged information of 3D simulators to create spatially-rich
video training data for multimodal language models. Using this framework, we
investigate which properties of simulated data drive effective real-world
transfer through systematic ablations of question types, mixes, and scales. We
identify a minimal set of three question categories (metric measurement,
perspective-dependent reasoning, and temporal tracking) that prove most
effective for developing transferable spatial intelligence, outperforming
comprehensive coverage despite using fewer question types. These insights
enable highly efficient training: our 7B-parameter video LLM fine-tuned on just
25K simulated examples outperforms the larger 72B baseline and achieves
competitive performance with proprietary models on rigorous real-world spatial
reasoning benchmarks. Our approach demonstrates robust generalization,
maintaining performance on general video understanding while showing
substantial improvements on embodied and real-world spatial tasks.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song et al.
Robotics
ๆฅ็ๆ่ฆ
Robotic manipulation policies are advancing rapidly, but their direct
evaluation in the real world remains costly, time-consuming, and difficult to
reproduce, particularly for tasks involving deformable objects. Simulation
provides a scalable and systematic alternative, yet existing simulators often
fail to capture the coupled visual and physical complexity of soft-body
interactions. We present a real-to-sim policy evaluation framework that
constructs soft-body digital twins from real-world videos and renders robots,
objects, and environments with photorealistic fidelity using 3D Gaussian
Splatting. We validate our approach on representative deformable manipulation
tasks, including plush toy packing, rope routing, and T-block pushing,
demonstrating that simulated rollouts correlate strongly with real-world
execution performance and reveal key behavioral patterns of learned policies.
Our results suggest that combining physics-informed reconstruction with
high-quality rendering enables reproducible, scalable, and accurate evaluation
of robotic manipulation policies. Website: https://real2sim-eval.github.io/
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, Saining Xie
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Robust benchmarks are crucial for evaluating Multimodal Large Language Models
(MLLMs). Yet we find that models can ace many multimodal benchmarks without
strong visual understanding, instead exploiting biases, linguistic priors, and
superficial patterns. This is especially problematic for vision-centric
benchmarks that are meant to require visual inputs. We adopt a diagnostic
principle for benchmark design: if a benchmark can be gamed, it will be.
Designers should therefore try to ``game'' their own benchmarks first, using
diagnostic and debiasing procedures to systematically identify and mitigate
non-visual biases. Effective diagnosis requires directly ``training on the test
set'' -- probing the released test set for its intrinsic, exploitable patterns.
We operationalize this standard with two components. First, we diagnose
benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology.
Our primary diagnostic tool involves fine-tuning a powerful Large Language
Model via $k$-fold cross-validation on exclusively the non-visual, textual
inputs of the test set to reveal shortcut performance and assign each sample a
bias score $s(x)$. We complement this with a lightweight Random Forest-based
diagnostic operating on hand-crafted features for fast, interpretable auditing.
Second, we debias benchmarks by filtering high-bias samples using an
``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four
benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive
non-visual biases. As a case study, we apply our full framework to create
VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider
vision-blind performance gap than the original.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mantas ลฝurauskas, Tom Bu, Sanaz Alali, Beyza Kalkanli, Derek Shi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Polarization-resolved near-infrared imaging adds a useful optical contrast
mechanism to eye tracking by measuring the polarization state of light
reflected by ocular tissues in addition to its intensity. In this paper we
demonstrate how this contrast can be used to enable eye tracking. Specifically,
we demonstrate that a polarization-enabled eye tracking (PET) system composed
of a polarization--filter--array camera paired with a linearly polarized
near-infrared illuminator can reveal trackable features across the sclera and
gaze-informative patterns on the cornea, largely absent in intensity-only
images. Across a cohort of 346 participants, convolutional neural network based
machine learning models trained on data from PET reduced the median
95th-percentile absolute gaze error by 10--16\% relative to capacity-matched
intensity baselines under nominal conditions and in the presence of eyelid
occlusions, eye-relief changes, and pupil-size variation. These results link
light--tissue polarization effects to practical gains in human--computer
interaction and position PET as a simple, robust sensing modality for future
wearable devices.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kylie Cancilla, Alexander Moore, Amar Saini, Carmen Carrano
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Video quality assessment (VQA) is vital for computer vision tasks, but
existing approaches face major limitations: full-reference (FR) metrics require
clean reference videos, and most no-reference (NR) models depend on training on
costly human opinion labels. Moreover, most opinion-unaware NR methods are
image-based, ignoring temporal context critical for video object detection. In
this work, we present a scalable, streaming-based VQA model that is both
no-reference and opinion-unaware. Our model leverages synthetic degradations of
the DAVIS dataset, training a temporal-aware convolutional architecture to
predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without
references at inference. We show that our streaming approach outperforms our
own image-based baseline by generalizing across diverse degradations,
underscoring the value of temporal modeling for scalable VQA in real-world
vision systems. Additionally, we demonstrate that our model achieves higher
correlation with full-reference metrics compared to BRISQUE, a widely-used
opinion-aware image quality assessment baseline, validating the effectiveness
of our temporal, opinion-unaware approach.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tushar Kataria, Shikha Dubey, Mary Bronner, Jolanta Jedrzkiewicz, Ben J. Brintz et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning models can generate virtual immunohistochemistry (IHC) stains
from hematoxylin and eosin (H&E) images, offering a scalable and low-cost
alternative to laboratory IHC. However, reliable evaluation of image quality
remains a challenge as current texture- and distribution-based metrics quantify
image fidelity rather than the accuracy of IHC staining. Here, we introduce an
automated and accuracy grounded framework to determine image quality across
sixteen paired or unpaired image translation models. Using color deconvolution,
we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by
each virtual IHC model. We use the segmented masks of real and virtual IHC to
compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly
quantify correct pixel - level labeling without needing expert manual
annotations. Our results demonstrate that conventional image fidelity metrics,
including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR),
and structural similarity (SSIM), correlate poorly with stain accuracy and
pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE
achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based
models are less reliable in providing accurate IHC positive pixel labels.
Moreover, whole-slide images (WSI) reveal performance declines that are
invisible in patch-based evaluations, emphasizing the need for WSI-level
benchmarks. Together, this framework defines a reproducible approach for
assessing the quality of virtual IHC models, a critical step to accelerate
translation towards routine use by pathologists.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yicheng Xiao, Yu Chen, Haoxuan Ma, Jiale Hong, Caorui Li et al.
Multimodal
ๆฅ็ๆ่ฆ
While the Contrastive Language-Image Pretraining(CLIP) model has achieved
remarkable success in a variety of downstream vison language understanding
tasks, enhancing its capability for fine-grained image-text alignment remains
an active research focus. To this end, most existing works adopt the strategy
of explicitly increasing the granularity of visual information processing,
e.g., incorporating visual prompts to guide the model focus on specific local
regions within the image. Meanwhile, researches on Multimodal Large Language
Models(MLLMs) have demonstrated that training with long and detailed textual
descriptions can effectively improve the model's fine-grained vision-language
alignment. However, the inherent token length limitation of CLIP's text encoder
fundamentally limits CLIP to process more granular textual information embedded
in long text sequences. To synergistically leverage the advantages of enhancing
both visual and textual content processing granularity, we propose PixCLIP, a
novel framework designed to concurrently accommodate visual prompt inputs and
process lengthy textual descriptions. Specifically, we first establish an
automated annotation pipeline capable of generating pixel-level localized,
long-form textual descriptions for images. Utilizing this pipeline, we
construct LongGRIT, a high-quality dataset comprising nearly 1.5 million
samples. Secondly, we replace CLIP's original text encoder with the LLM and
propose a three-branch pixel-text alignment learning framework, facilitating
fine-grained alignment between image regions and corresponding textual
descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP
showcases breakthroughs in pixel-level interaction and handling long-form
texts, achieving state-of-the-art performance.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chen Shi, Shaoshuai Shi, Xiaoyang Lyu, Chunyang Liu, Kehua Sheng et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Feed-forward 3D reconstruction for autonomous driving has advanced rapidly,
yet existing methods struggle with the joint challenges of sparse,
non-overlapping camera views and complex scene dynamics. We present UniSplat, a
general feed-forward framework that learns robust dynamic scene reconstruction
through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent
scaffold, a structured representation that captures geometric and semantic
scene context by leveraging pretrained foundation models. To effectively
integrate information across spatial views and temporal frames, we introduce an
efficient fusion mechanism that operates directly within the 3D scaffold,
enabling consistent spatio-temporal alignment. To ensure complete and detailed
reconstructions, we design a dual-branch decoder that generates dynamic-aware
Gaussians from the fused scaffold by combining point-anchored refinement with
voxel-based generation, and maintain a persistent memory of static Gaussians to
enable streaming scene completion beyond current camera coverage. Extensive
experiments on real-world datasets demonstrate that UniSplat achieves
state-of-the-art performance in novel view synthesis, while providing robust
and high-quality renderings even for viewpoints outside the original camera
coverage.
๐
2025-11-06
๐ Issues, comments, and questions are all welcome in
๐ฅ Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu Aizawa
Robotics
ๆฅ็ๆ่ฆ
Understanding the current capabilities and risks of AI Scientist systems is
essential for ensuring trustworthy and sustainable AI-driven scientific
progress while preserving the integrity of the academic ecosystem. To this end,
we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system
that mimics the core research workflow of a novice student researcher: Given
the baseline paper from the human mentor, it analyzes its limitations,
formulates novel hypotheses for improvement, validates them through rigorous
experimentation, and writes a paper with the results. Unlike previous
approaches that assume full automation or operate on small-scale code, Jr. AI
Scientist follows a well-defined research workflow and leverages modern coding
agents to handle complex, multi-file implementations, leading to scientifically
valuable contributions. For evaluation, we conducted automated assessments
using AI Reviewers, author-led evaluations, and submissions to Agents4Science,
a venue dedicated to AI-driven scientific contributions. The findings
demonstrate that Jr. AI Scientist generates papers receiving higher review
scores than existing fully automated systems. Nevertheless, we identify
important limitations from both the author evaluation and the Agents4Science
reviews, indicating the potential risks of directly applying current AI
Scientist systems and key challenges for future research. Finally, we
comprehensively report various risks identified during development. We hope
these insights will deepen understanding of current progress and risks in AI
Scientist development.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
"Thinking with Text" and "Thinking with Images" paradigm significantly
improve the reasoning ability of large language models (LLMs) and Vision
Language Models (VLMs). However, these paradigms have inherent limitations. (1)
Images capture only single moments and fail to represent dynamic processes or
continuous changes, and (2) The separation of text and vision as distinct
modalities, hindering unified multimodal understanding and generation. To
overcome these limitations, we introduce "Thinking with Video", a new paradigm
that leverages video generation models, such as Sora-2, to bridge visual and
textual reasoning in a unified temporal framework. To support this exploration,
we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench
encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing
Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our
evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks,
Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even
surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric
tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
Furthermore, we systematically analyse the source of these abilities. We also
find that self-consistency and in-context learning can improve Sora-2's
performance. In summary, our findings demonstrate that the video generation
model is the potential unified multimodal understanding and generation model,
positions "thinking with video" as a unified multimodal reasoning paradigm.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have emerged as a powerful framework that
unifies perception, language, and control, enabling robots to perform diverse
tasks through multimodal understanding. However, current VLA models typically
contain massive parameters and rely heavily on large-scale robot data
pretraining, leading to high computational costs during training, as well as
limited deployability for real-time inference. Moreover, most training
paradigms often degrade the perceptual representations of the vision-language
backbone, resulting in overfitting and poor generalization to downstream tasks.
In this work, we present Evo-1, a lightweight VLA model that reduces
computation and improves deployment efficiency, while maintaining strong
performance without pretraining on robot data. Evo-1 builds on a native
multimodal Vision-Language model (VLM), incorporating a novel cross-modulated
diffusion transformer along with an optimized integration module, together
forming an effective architecture. We further introduce a two-stage training
paradigm that progressively aligns action with perception, preserving the
representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1
achieves state-of-the-art results on the Meta-World and RoboTwin suite,
surpassing the previous best models by 12.4% and 6.9%, respectively, and also
attains a competitive result of 94.8% on LIBERO. In real-world evaluations,
Evo-1 attains a 78% success rate with high inference frequency and low memory
overhead, outperforming all baseline methods. We release code, data, and model
weights to facilitate future research on lightweight and efficient VLA models.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dimitrios Anastasiou, Santiago Barbarisi, Lucy Culshaw, Jayna Patel, Evangelos B. Mazomenos et al.
Multimodal
ๆฅ็ๆ่ฆ
Purpose: Accurate assessment of surgical complexity is essential in
Laparoscopic Cholecystectomy (LC), where severe inflammation is associated with
longer operative times and increased risk of postoperative complications. The
Parkland Grading Scale (PGS) provides a clinically validated framework for
stratifying inflammation severity; however, its automation in surgical videos
remains largely unexplored, particularly in realistic scenarios where complete
videos must be analyzed without prior manual curation. Methods: In this work,
we introduce STC-Net, a novel framework for SingleTimestamp-based Complexity
estimation in LC via the PGS, designed to operate under weak temporal
supervision. Unlike prior methods limited to static images or manually trimmed
clips, STC-Net operates directly on full videos. It jointly performs temporal
localization and grading through a localization, window proposal, and grading
module. We introduce a novel loss formulation combining hard and soft
localization objectives and background-aware grading supervision. Results:
Evaluated on a private dataset of 1,859 LC videos, STC-Net achieves an accuracy
of 62.11% and an F1-score of 61.42%, outperforming non-localized baselines by
over 10% in both metrics and highlighting the effectiveness of weak supervision
for surgical complexity assessment. Conclusion: STC-Net demonstrates a scalable
and effective approach for automated PGS-based surgical complexity estimation
from full LC videos, making it promising for post-operative analysis and
surgical training.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva
ๆฅ็ๆ่ฆ
Video generation has achieved remarkable progress, with generated videos
increasingly resembling real ones. However, the rapid advance in generation has
outpaced the development of adequate evaluation metrics. Currently, the
assessment of talking head generation primarily relies on limited metrics,
evaluating general video quality, lip synchronization, and on conducting user
studies. Motivated by this, we propose a new evaluation framework comprising 8
metrics related to three dimensions (i) quality, (ii) naturalness, and (iii)
synchronization. In selecting the metrics, we place emphasis on efficiency, as
well as alignment with human preferences. Based on this considerations, we
streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as
well as face quality. Our extensive experiments on 85,000 videos generated by
17 state-of-the-art models suggest that while many algorithms excel in lip
synchronization, they face challenges with generating expressiveness and
artifact-free details. These videos were generated based on a novel real
dataset, that we have curated, in order to mitigate bias of training data. Our
proposed benchmark framework is aimed at evaluating the improvement of
generative methods. Original code, dataset and leaderboards will be publicly
released and regularly updated with new methods, in order to reflect progress
in the field.
๐
2025-11-06
๐ ArXiv Preprint (eess.IV)
๐ฅ Shihan Zhao, Jianru Zhang, Yanan Wu, Linlin Li, Siyuan Shen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Fluorescence Molecular Tomography (FMT) is a promising technique for
non-invasive 3D visualization of fluorescent probes, but its reconstruction
remains challenging due to the inherent ill-posedness and reliance on
inaccurate or often-unknown tissue optical properties. While deep learning
methods have shown promise, their supervised nature limits generalization
beyond training data. To address these problems, we propose $\mu$NeuFMT, a
self-supervised FMT reconstruction framework that integrates implicit
neural-based scene representation with explicit physical modeling of photon
propagation. Its key innovation lies in jointly optimize both the fluorescence
distribution and the optical properties ($\mu$) during reconstruction,
eliminating the need for precise prior knowledge of tissue optics or
pre-conditioned training data. We demonstrate that $\mu$NeuFMT robustly
recovers accurate fluorophore distributions and optical coefficients even with
severely erroneous initial values (0.5$\times$ to 2$\times$ of ground truth).
Extensive numerical, phantom, and in vivo validations show that $\mu$NeuFMT
outperforms conventional and supervised deep learning approaches across diverse
heterogeneous scenarios. Our work establishes a new paradigm for robust and
accurate FMT reconstruction, paving the way for more reliable molecular imaging
in complex clinically related scenarios, such as fluorescence guided surgery.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alper Kalle, Theo Rudkiewicz, Mohamed-Oumar Ouerfelli, Mohamed Tamaazousti
Machine Learning
ๆฅ็ๆ่ฆ
Neural networks are widely used for image-related tasks but typically demand
considerable computing power. Once a network has been trained, however, its
memory- and compute-footprint can be reduced by compression. In this work, we
focus on compression through tensorization and low-rank representations.
Whereas classical approaches search for a low-rank approximation by minimizing
an isotropic norm such as the Frobenius norm in weight-space, we use
data-informed norms that measure the error in function space. Concretely, we
minimize the change in the layer's output distribution, which can be expressed
as $\lVert (W - \widetilde{W}) \Sigma^{1/2}\rVert_F$ where $\Sigma^{1/2}$ is
the square root of the covariance matrix of the layer's input and $W$,
$\widetilde{W}$ are the original and compressed weights. We propose new
alternating least square algorithms for the two most common tensor
decompositions (Tucker-2 and CPD) that directly optimize the new norm. Unlike
conventional compression pipelines, which almost always require
post-compression fine-tuning, our data-informed approach often achieves
competitive accuracy without any fine-tuning. We further show that the same
covariance-based norm can be transferred from one dataset to another with only
a minor accuracy drop, enabling compression even when the original training
dataset is unavailable. Experiments on several CNN architectures (ResNet-18/50,
and GoogLeNet) and datasets (ImageNet, FGVC-Aircraft, Cifar10, and Cifar100)
confirm the advantages of the proposed method.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenwen Li, Sizhe Wang, Hyunho Lee, Chenyan Lu, Sujit Roy et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Landslides cause severe damage to lives, infrastructure, and the environment,
making accurate and timely mapping essential for disaster preparedness and
response. However, conventional deep learning models often struggle when
applied across different sensors, regions, or under conditions of limited
training data. To address these challenges, we present a three-axis analytical
framework of sensor, label, and domain for adapting geospatial foundation
models (GeoFMs), focusing on Prithvi-EO-2.0 for landslide mapping. Through a
series of experiments, we show that it consistently outperforms task-specific
CNNs (U-Net, U-Net++), vision transformers (Segformer, SwinV2-B), and other
GeoFMs (TerraMind, SatMAE). The model, built on global pretraining,
self-supervision, and adaptable fine-tuning, proved resilient to spectral
variation, maintained accuracy under label scarcity, and generalized more
reliably across diverse datasets and geographic settings. Alongside these
strengths, we also highlight remaining challenges such as computational cost
and the limited availability of reusable AI-ready training data for landslide
research. Overall, our study positions GeoFMs as a step toward more robust and
scalable approaches for landslide risk reduction and environmental monitoring.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Empowering Large Multimodal Models (LMMs) to deeply integrate image
interaction with long-horizon reasoning capabilities remains a long-standing
challenge in this field. Recent advances in vision-centric reasoning explore a
promising "Thinking with Images" paradigm for LMMs, marking a shift from
image-assisted reasoning to image-interactive thinking. While this milestone
enables models to focus on fine-grained image regions, progress remains
constrained by limited visual tool spaces and task-specific workflow designs.
To bridge this gap, we present V-Thinker, a general-purpose multimodal
reasoning assistant that enables interactive, vision-centric thinking through
end-to-end reinforcement learning. V-Thinker comprises two key components: (1)
a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies
interactive reasoning datasets across three dimensions-diversity, quality, and
difficulty; and (2) a Visual Progressive Training Curriculum that first aligns
perception via point-level supervision, then integrates interactive reasoning
through a two-stage reinforcement learning framework. Furthermore, we introduce
VTBench, an expert-verified benchmark targeting vision-centric interactive
reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently
outperforms strong LMM-based baselines in both general and interactive
reasoning scenarios, providing valuable insights for advancing
image-interactive reasoning applications.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yaniv Ohayon, Ofir Itzhak Shahar, Ohad Ben-Shahar
ๆฅ็ๆ่ฆ
Jigsaw puzzle solving requires the rearrangement of unordered pieces into
their original pose in order to reconstruct a coherent whole, often an image,
and is known to be an intractable problem. While the possible impact of
automatic puzzle solvers can be disruptive in various application domains, most
of the literature has focused on developing solvers for square jigsaw puzzles,
severely limiting their practical use. In this work, we significantly expand
the types of puzzles handled computationally, focusing on what is known as
Convex Partitions, a major subset of polygonal puzzles whose pieces are convex.
We utilize both geometrical and pictorial compatibilities, introduce a greedy
solver, and report several performance measures next to the first benchmark
dataset of such puzzles.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Alan de Aguiar, Michaella Pereira Andrade, Charles Morphy D. Santos, Joรฃo Paulo Gois
ๆฅ็ๆ่ฆ
Analyzing octopuses in their natural habitats is challenging due to their
camouflage capability, rapid changes in skin texture and color, non-rigid body
deformations, and frequent occlusions, all of which are compounded by variable
underwater lighting and turbidity. Addressing the lack of large-scale annotated
datasets, this paper introduces HideAndSeg, a novel, minimally supervised
AI-based tool for segmenting videos of octopuses. It establishes a quantitative
baseline for this task. HideAndSeg integrates SAM2 with a custom-trained
YOLOv11 object detector. First, the user provides point coordinates to generate
the initial segmentation masks with SAM2. These masks serve as training data
for the YOLO model. After that, our approach fully automates the pipeline by
providing a bounding box prompt to SAM2, eliminating the need for further
manual intervention. We introduce two unsupervised metrics - temporal
consistency $DICE_t$ and new component count $NC_t$ - to quantitatively
evaluate segmentation quality and guide mask refinement in the absence of
ground-truth data, i.e., real-world information that serves to train, validate,
and test AI models. Results show that HideAndSeg achieves satisfactory
performance, reducing segmentation noise compared to the manually prompted
approach. Our method can re-identify and segment the octopus even after periods
of complete occlusion in natural environments, a scenario in which the manually
prompted model fails. By reducing the need for manual analysis in real-world
scenarios, this work provides a practical tool that paves the way for more
efficient behavioral studies of wild cephalopods.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jayadeva, Naman Dwivedi, Hari Krishnan, N. M. Anoop Krishnan
Machine Learning
ๆฅ็ๆ่ฆ
A formal link between regression and classification has been tenuous. Even
though the margin maximization term $\|w\|$ is used in support vector
regression, it has at best been justified as a regularizer. We show that a
regression problem with $M$ samples lying on a hyperplane has a one-to-one
equivalence with a linearly separable classification task with $2M$ samples. We
show that margin maximization on the equivalent classification task leads to a
different regression formulation than traditionally used. Using the
equivalence, we demonstrate a ``regressability'' measure, that can be used to
estimate the difficulty of regressing a dataset, without needing to first learn
a model for it. We use the equivalence to train neural networks to learn a
linearizing map, that transforms input variables into a space where a linear
regressor is adequate.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ke Du, Yimin Peng, Chao Gao, Fan Zhou, Siqiao Xue
Computer Vision
ๆฅ็ๆ่ฆ
DORAEMON is an open-source PyTorch library that unifies visual object
modeling and representation learning across diverse scales. A single
YAML-driven workflow covers classification, retrieval and metric learning; more
than 1000 pretrained backbones are exposed through a timm-compatible interface,
together with modular losses, augmentations and distributed-training utilities.
Reproducible recipes match or exceed reference results on ImageNet-1K,
MS-Celeb-1M and Stanford online products, while one-command export to ONNX or
HuggingFace bridges research and deployment. By consolidating datasets, models,
and training techniques into one platform, DORAEMON offers a scalable
foundation for rapid experimentation in visual recognition and representation
learning, enabling efficient transfer of research advances to real-world
applications. The repository is available at https://github.com/wuji3/DORAEMON.
๐
2025-11-06
๐ IROS 2025
๐ฅ Chang Liu, Juan Li, Sheng Zhang, Chang Liu, Jie Li et al.
ๆฅ็ๆ่ฆ
Depth estimation is one of the key technologies for realizing 3D perception
in unmanned systems. Monocular depth estimation has been widely researched
because of its low-cost advantage, but the existing methods face the challenges
of poor depth estimation performance and blurred object boundaries on embedded
systems. In this paper, we propose a novel monocular depth estimation model,
BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate
depth maps on embedded systems and significantly improves boundary quality.
Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which
adaptively fuses depth features to enhance boundary detail representation.
Secondly, we integrate semantic knowledge into the encoder to improve the
object recognition and boundary perception capabilities. Finally, BoRe-Depth is
deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We
demonstrate that the proposed model significantly outperforms previous
lightweight models on multiple challenging datasets, and we provide detailed
ablation studies for the proposed methods. The code is available at
https://github.com/liangxiansheng093/BoRe-Depth.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Itbaan Safwan, Muhammad Annas Shaikh, Muhammad Haaris, Ramail Khan, Muhammad Atif Tahir
ๆฅ็ๆ่ฆ
We present a multi-task framework for the MediaEval Medico 2025 challenge,
leveraging a LoRA-tuned Florence-2 model for simultaneous visual question
answering (VQA), explanation generation, and visual grounding. The proposed
system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer
learning, (2) a synthetically enriched explanation dataset offering structured
medical reasoning, and (3) text-to-region pairs linking visual features with
segmentation masks. This multi-task setup enables the model to jointly learn
visual grounding, reasoning, and interpretation, producing responses that are
both accurate and interpretable. Extensive evaluation demonstrates that our
approach substantially improves over single-task baselines in both answer
accuracy and visual localization, highlighting the effectiveness of grounded
multi-task learning for medical VQA applications.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Maรซlic Neau, Zoe Falomir, Paulo E. Santos, Anne-Gwenn Bosser, Cรฉdric Buche
MultimodalRobotics
ๆฅ็ๆ่ฆ
Deploying autonomous robots that can learn new skills from demonstrations is
an important challenge of modern robotics. Existing solutions often apply
end-to-end imitation learning with Vision-Language Action (VLA) models or
symbolic approaches with Action Model Learning (AML). On the one hand, current
VLA models are limited by the lack of high-level symbolic planning, which
hinders their abilities in long-horizon tasks. On the other hand, symbolic
approaches in AML lack generalization and scalability perspectives. In this
paper we present a new neuro-symbolic approach, GraSP-VLA, a framework that
uses a Continuous Scene Graph representation to generate a symbolic
representation of human demonstrations. This representation is used to generate
new planning domains during inference and serves as an orchestrator for
low-level VLA policies, scaling up the number of actions that can be reproduced
in a row. Our results show that GraSP-VLA is effective for modeling symbolic
representations on the task of automatic planning domain generation from
observations. In addition, results on real-world experiments show the potential
of our Continuous Scene Graph representation to orchestrate low-level VLA
policies in long-horizon tasks.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Puneet Mishra, Martijntje Vollebregt, Yizhou Ma, Maria Font-i-Furnols
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Background In analytical chemistry, spatial information about materials is
commonly captured through imaging techniques, such as traditional color cameras
or with advanced hyperspectral cameras and microscopes. However, efficiently
extracting and analyzing this spatial information for exploratory and
predictive purposes remains a challenge, especially when using traditional
chemometric methods. Recent advances in deep learning and artificial
intelligence have significantly enhanced image processing capabilities,
enabling the extraction of multiscale deep features that are otherwise
challenging to capture with conventional image processing techniques. Despite
the wide availability of open-source deep learning models, adoption in
analytical chemistry remains limited because of the absence of structured,
step-by-step guidance for implementing these models.
Results This tutorial aims to bridge this gap by providing a step-by-step
guide for applying deep learning approaches to extract spatial information from
imaging data and integrating it with other data sources, such as spectral
information. Importantly, the focus of this work is not on training deep
learning models for image processing but on using existing open source models
to extract deep features from imaging data.
Significance The tutorial provides MATLAB code tutorial demonstrations,
showcasing the processing of imaging data from various imaging modalities
commonly encountered in analytical chemistry. Readers must run the tutorial
steps on their own datasets using the codes presented in this tutorial.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sanjay Kumar, Tim Brophy, Eoin Martino Grua, Ganesh Sistu, Valentina Donzella et al.
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Accurate 3D object detection is essential for automated vehicles to navigate
safely in complex real-world environments. Bird's Eye View (BEV)
representations, which project multi-sensor data into a top-down spatial
format, have emerged as a powerful approach for robust perception. Although
BEV-based fusion architectures have demonstrated strong performance through
multimodal integration, the effects of sensor occlusions, caused by
environmental conditions such as fog, haze, or physical obstructions, on 3D
detection accuracy remain underexplored. In this work, we investigate the
impact of occlusions on both camera and Light Detection and Ranging (LiDAR)
outputs using the BEVFusion architecture, evaluated on the nuScenes dataset.
Detection performance is measured using mean Average Precision (mAP) and the
nuScenes Detection Score (NDS). Our results show that moderate camera
occlusions lead to a 41.3% drop in mAP (from 35.6% to 20.9%) when detection is
based only on the camera. On the other hand, LiDAR sharply drops in performance
only under heavy occlusion, with mAP falling by 47.3% (from 64.7% to 34.1%),
with a severe impact on long-range detection. In fused settings, the effect
depends on which sensor is occluded: occluding the camera leads to a minor 4.1%
drop (from 68.5% to 65.7%), while occluding LiDAR results in a larger 26.8%
drop (to 50.1%), revealing the model's stronger reliance on LiDAR for the task
of 3D object detection. Our results highlight the need for future research into
occlusion-aware evaluation methods and improved sensor fusion techniques that
can maintain detection accuracy in the presence of partial sensor failure or
degradation due to adverse environmental conditions.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Muhammad Annas Shaikh, Hamza Zaman, Arbaz Asif
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
This paper presents a comprehensive evaluation of nine convolutional neural
network architectures for binary classification of horses and motorcycles in
the VOC 2008 dataset. We address the significant class imbalance problem by
implementing minority-class augmentation techniques. Our experiments compare
modern architectures including ResNet-50, ConvNeXt-Tiny, DenseNet-121, and
Vision Transformer across multiple performance metrics. Results demonstrate
substantial performance variations, with ConvNeXt-Tiny achieving the highest
Average Precision (AP) of 95.53% for horse detection and 89.12% for motorcycle
detection. We observe that data augmentation significantly improves minority
class detection, particularly benefiting deeper architectures. This study
provides insights into architecture selection for imbalanced binary
classification tasks and quantifies the impact of data augmentation strategies
in mitigating class imbalance issues in object detection.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saรบl Alonso-Monsalve, Leigh H. Whitehead, Adam Aurisano, Lorena Escudero Sanchez
Machine Learning
ๆฅ็ๆ่ฆ
The accurate delineation of tumours in radiological images like Computed
Tomography is a very specialised and time-consuming task, and currently a
bottleneck preventing quantitative analyses to be performed routinely in the
clinical setting. For this reason, developing methods for the automated
segmentation of tumours in medical imaging is of the utmost importance and has
driven significant efforts in recent years. However, challenges regarding the
impracticality of 3D scans, given the large amount of voxels to be analysed,
usually requires the downsampling of such images or using patches thereof when
applying traditional convolutional neural networks. To overcome this problem,
in this paper we propose a new methodology that uses, divided into two stages,
voxel sparsification and submanifold sparse convolutional networks. This method
allows segmentations to be performed with high-resolution inputs and a native
3D model architecture, obtaining state-of-the-art accuracies while
significantly reducing the computational resources needed in terms of GPU
memory and time. We studied the deployment of this methodology in the context
of Computed Tomography images of renal cancer patients from the KiTS23
challenge, and our method achieved results competitive with the challenge
winners, with Dice similarity coefficients of 95.8% for kidneys + masses, 85.7%
for tumours + cysts, and 80.3% for tumours alone. Crucially, our method also
offers significant computational improvements, achieving up to a 60% reduction
in inference time and up to a 75\% reduction in VRAM usage compared to an
equivalent dense architecture, across both CPU and various GPU cards tested.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangjun Zhang, Litong Gong, Yinglin Zheng, Yansong Liu, Wentao Jiang et al.
ๆฅ็ๆ่ฆ
Most text-to-video(T2V) diffusion models depend on pre-trained text encoders
for semantic alignment, yet they often fail to maintain video quality when
provided with concise prompts rather than well-designed ones. The primary issue
lies in their limited textual semantics understanding. Moreover, these text
encoders cannot rephrase prompts online to better align with user intentions,
which limits both the scalability and usability of the models, To address these
challenges, we introduce RISE-T2V, which uniquely integrates the processes of
prompt rephrasing and semantic feature extraction into a single and seamless
step instead of two separate steps. RISE-T2V is universal and can be applied to
various pre-trained LLMs and video diffusion models(VDMs), significantly
enhancing their capabilities for T2V tasks. We propose an innovative module
called the Rephrasing Adapter, enabling diffusion models to utilize text hidden
states during the next token prediction of the LLM as a condition for video
generation. By employing a Rephrasing Adapter, the video generation model can
implicitly rephrase basic prompts into more comprehensive representations that
better match the user's intent. Furthermore, we leverage the powerful
capabilities of LLMs to enable video generation models to accomplish a broader
range of T2V tasks. Extensive experiments demonstrate that RISE-T2V is a
versatile framework applicable to different video diffusion model
architectures, significantly enhancing the ability of T2V models to generate
high-quality videos that align with user intent. Visual results are available
on the webpage at https://rise-t2v.github.io.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Robin Spanier, Thorsten Hoeser, Claudia Kuenzer
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
The recent and ongoing expansion of marine infrastructure, including offshore
wind farms, oil and gas platforms, artificial islands, and aquaculture
facilities, highlights the need for effective monitoring systems. The
development of robust models for offshore infrastructure detection relies on
comprehensive, balanced datasets, but falls short when samples are scarce,
particularly for underrepresented object classes, shapes, and sizes. By
training deep learning-based YOLOv10 object detection models with a combination
of synthetic and real Sentinel-1 satellite imagery acquired in the fourth
quarter of 2023 from four regions (Caspian Sea, South China Sea, Gulf of
Guinea, and Coast of Brazil), this study investigates the use of synthetic
training data to enhance model performance. We evaluated this approach by
applying the model to detect offshore platforms in three unseen regions (Gulf
of Mexico, North Sea, Persian Gulf) and thereby assess geographic
transferability. This region-holdout evaluation demonstrated that the model
generalises beyond the training areas. In total, 3,529 offshore platforms were
detected, including 411 in the North Sea, 1,519 in the Gulf of Mexico, and
1,593 in the Persian Gulf. The model achieved an F1 score of 0.85, which
improved to 0.90 upon incorporating synthetic data. We analysed how synthetic
data enhances the representation of unbalanced classes and overall model
performance, taking a first step toward globally transferable detection of
offshore infrastructure. This study underscores the importance of balanced
datasets and highlights synthetic data generation as an effective strategy to
address common challenges in remote sensing, demonstrating the potential of
deep learning for scalable, global offshore infrastructure monitoring.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Leire Benito-Del-Valle, Artzai Picรณn, Daniel Mugica, Manuel Ramos, Eva Portillo et al.
ๆฅ็ๆ่ฆ
Herbicide field trials require accurate identification of plant species and
assessment of herbicide-induced damage across diverse environments. While
general-purpose vision foundation models have shown promising results in
complex visual domains, their performance can be limited in agriculture, where
fine-grained distinctions between species and damage types are critical.
In this work, we adapt a general-purpose vision foundation model to herbicide
trial characterization. Trained using a self-supervised learning approach on a
large, curated agricultural dataset, the model learns rich and transferable
representations optimized for herbicide trials images.
Our domain-specific model significantly outperforms the best general-purpose
foundation model in both species identification (F1 score improvement from 0.91
to 0.94) and damage classification (from 0.26 to 0.33). Under unseen conditions
(new locations and other time), it achieves even greater gains (species
identification from 0.56 to 0.66; damage classification from 0.17 to 0.27). In
domain-shift scenarios, such as drone imagery, it maintains strong performance
(species classification from 0.49 to 0.60).
Additionally, we show that domain-specific pretraining enhances segmentation
accuracy, particularly in low-annotation regimes. An annotation-efficiency
analysis reveals that, under unseen conditions, the domain-specific model
achieves 5.4% higher F1 score than the general-purpose model, while using 80%
fewer labeled samples.
These results demonstrate the generalization capabilities of domain-specific
foundation models and their potential to significantly reduce manual annotation
efforts, offering a scalable and automated solution for herbicide trial
analysis.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shiwei Ren, Tianci Wen, Yongchun Fang, Biao Lu
ๆฅ็ๆ่ฆ
The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to
properly regulate the number of Gaussians during training, causing redundant
computational time overhead. In this paper, we propose FastGS, a novel, simple,
and general acceleration framework that fully considers the importance of each
Gaussian based on multi-view consistency, efficiently solving the trade-off
between training time and rendering quality. We innovatively design a
densification and pruning strategy based on multi-view consistency, dispensing
with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks &
Temples, and Deep Blending datasets demonstrate that our method significantly
outperforms the state-of-the-art methods in training speed, achieving a
3.32$\times$ training acceleration and comparable rendering quality compared
with DashGaussian on the Mip-NeRF 360 dataset and a 15.45$\times$ acceleration
compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that
FastGS exhibits strong generality, delivering 2-7$\times$ training acceleration
across various tasks, including dynamic scene reconstruction, surface
reconstruction, sparse-view reconstruction, large-scale reconstruction, and
simultaneous localization and mapping. The project page is available at
https://fastgs.github.io/
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yujie Yang, Shuang Li, Jun Ye, Neng Dong, Fan Li et al.
ๆฅ็ๆ่ฆ
Video-based Visible-Infrared person re-identification (VVI-ReID) aims to
retrieve the same pedestrian across visible and infrared modalities from video
sequences. Existing methods tend to exploit modality-invariant visual features
but largely overlook gait features, which are not only modality-invariant but
also rich in temporal dynamics, thus limiting their ability to model the
spatiotemporal consistency essential for cross-modal video matching. To address
these challenges, we propose a DINOv2-Driven Gait Representation Learning
(DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn
gait features complementary to appearance cues, facilitating robust
sequence-level representations for cross-modal retrieval. Specifically, we
introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which
generates and enhances silhouette representations with general-purpose semantic
priors from DINOv2 and jointly optimizes them with the ReID objective to
achieve semantically enriched and task-adaptive gait feature learning.
Furthermore, we develop a Progressive Bidirectional Multi-Granularity
Enhancement (PBMGE) module, which progressively refines feature representations
by enabling bidirectional interactions between gait and appearance streams
across multiple spatial granularities, fully leveraging their complementarity
to enhance global representations with rich local details and produce highly
discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets
demonstrate the superiority of our approach, significantly outperforming
existing state-of-the-art methods.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Claudio Giusti, Luca Guarnera, Sebastiano Battiato
Computer Vision
ๆฅ็ๆ่ฆ
The growing sophistication of synthetic image and deepfake generation models
has turned source attribution and authenticity verification into a critical
challenge for modern computer vision systems. Recent studies suggest that
diffusion pipelines unintentionally imprint persistent statistical traces,
known as signal leaks, within their outputs, particularly in latent
representations. Building on this observation, we propose Proto-LeakNet, a
signal-leak-aware and interpretable attribution framework that integrates
closed-set classification with a density-based open-set evaluation on the
learned embeddings, enabling analysis of unseen generators without retraining.
Operating in the latent domain of diffusion models, our method re-simulates
partial forward diffusion to expose residual generator-specific cues. A
temporal attention encoder aggregates multi-step latent features, while a
feature-weighted prototype head structures the embedding space and enables
transparent attribution. Trained solely on closed data and achieving a Macro
AUC of 98.13%, Proto-LeakNet learns a latent geometry that remains robust under
post-processing, surpassing state-of-the-art methods, and achieves strong
separability between known and unseen generators. These results demonstrate
that modeling signal-leak bias in latent space enables reliable and
interpretable AI-image and deepfake forensics. The code for the whole work will
be available upon submission.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marawan Elbatel, Anbang Wang, Keyuan Liu, Kaouther Mouheb, Enrique Almar-Munoz et al.
Computer Vision
ๆฅ็ๆ่ฆ
This paper does not introduce a novel architecture; instead, it revisits a
fundamental yet overlooked baseline: adapting human-centric foundation models
for anatomical landmark detection in medical imaging. While landmark detection
has traditionally relied on domain-specific models, the emergence of
large-scale pre-trained vision models presents new opportunities. In this
study, we investigate the adaptation of Sapiens, a human-centric foundation
model designed for pose estimation, to medical imaging through multi-dataset
pretraining, establishing a new state of the art across multiple datasets. Our
proposed model, MedSapiens, demonstrates that human-centric foundation models,
inherently optimized for spatial pose localization, provide strong priors for
anatomical landmark detection, yet this potential has remained largely
untapped. We benchmark MedSapiens against existing state-of-the-art models,
achieving up to 5.26% improvement over generalist models and up to 21.81%
improvement over specialist models in the average success detection rate (SDR).
To further assess MedSapiens adaptability to novel downstream tasks with few
annotations, we evaluate its performance in limited-data settings, achieving
2.69% improvement over the few-shot state of the art in SDR. Code and model
weights are available at https://github.com/xmed-lab/MedSapiens .
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hanmo Chen, Chenghao Xu, Jiexi Yan, Cheng Deng
ๆฅ็ๆ่ฆ
Human motion style transfer allows characters to appear less rigidity and
more realism with specific style. Traditional arbitrary image style transfer
typically process mean and variance which is proved effective. Meanwhile,
similar methods have been adapted for motion style transfer. However, due to
the fundamental differences between images and motion, relying on mean and
variance is insufficient to fully capture the complex dynamic patterns and
spatiotemporal coherence properties of motion data. Building upon this, our key
insight is to bring two more coefficient, skewness and kurtosis, into the
analysis of motion style. Specifically, we propose a novel Adaptive Statistics
Fusor (AStF) which consists of Style Disentanglement Module (SDM) and
High-Order Multi-Statistics Attention (HOS-Attn). We trained our AStF in
conjunction with a Motion Consistency Regularization (MCR) discriminator.
Experimental results show that, by providing a more comprehensive model of the
spatiotemporal statistical patterns inherent in dynamic styles, our proposed
AStF shows proficiency superiority in motion style transfers over
state-of-the-arts. Our code and model are available at
https://github.com/CHMimilanlan/AStF.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Josef Mayr, Anna Reithmeir, Maxime Di Folco, Julia A. Schnabel
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Covariance descriptors capture second-order statistics of image features.
They have shown strong performance in general computer vision tasks, but remain
underexplored in medical imaging. We investigate their effectiveness for both
conventional and learning-based medical image classification, with a particular
focus on SPDNet, a classification network specifically designed for symmetric
positive definite (SPD) matrices. We propose constructing covariance
descriptors from features extracted by pre-trained general vision encoders
(GVEs) and comparing them with handcrafted descriptors. Two GVEs - DINOv2 and
MedSAM - are evaluated across eleven binary and multi-class datasets from the
MedMNSIT benchmark. Our results show that covariance descriptors derived from
GVE features consistently outperform those derived from handcrafted features.
Moreover, SPDNet yields superior performance to state-of-the-art methods when
combined with DINOv2 features. Our findings highlight the potential of
combining covariance descriptors with powerful pretrained vision encoders for
medical image analysis.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fatemehzahra Darzi, Rodrigo Escobar Diaz Guerrero, Thomas Bocklitz
Multimodal
ๆฅ็ๆ่ฆ
Image registration refers to the process of spatially aligning two or more
images by mapping them into a common coordinate system, so that corresponding
anatomical or tissue structures are matched across images. In digital
pathology, registration enables direct comparison and integration of
information from different stains or imaging modalities, sup-porting
applications such as biomarker analysis and tissue reconstruction. Accurate
registration of images from different modalities is an essential step in
digital pathology. In this study, we investigated how various color
transformation techniques affect image registration between hematoxylin and
eosin (H&E) stained images and non-linear multimodal images. We used a dataset
of 20 tissue sample pairs, with each pair undergoing several preprocessing
steps, including different color transformation (CycleGAN, Macenko, Reinhard,
Vahadane), inversion, contrast adjustment, intensity normalization, and
denoising. All images were registered using the VALIS registration method,
which first applies rigid registration and then performs non-rigid registration
in two steps on both low and high-resolution images. Registration performance
was evaluated using the relative Target Registration Error (rTRE). We reported
the median of median rTRE values (MMrTRE) and the average of median rTRE values
(AMrTRE) for each method. In addition, we performed a custom point-based
evaluation using ten manually selected key points. Registration was done
separately for two scenarios, using either the original or inverted multimodal
images. In both scenarios, CycleGAN color transformation achieved the lowest
registration errors, while the other methods showed higher errors. These
findings show that applying color transformation before registration improves
alignment between images from different modalities and supports more reliable
analysis in digital pathology.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Suranjan Goswami, Abhinav Ravi, Raja Kolla, Ali Faraz, Shaharukh Khan et al.
ๆฅ็ๆ่ฆ
Despite significant advances in document understanding, determining the
correct orientation of scanned or photographed documents remains a critical
pre-processing step in the real world settings. Accurate rotation correction is
essential for enhancing the performance of downstream tasks such as Optical
Character Recognition (OCR) where misalignment commonly arises due to user
errors, particularly incorrect base orientations of the camera during capture.
In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for
evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from
rotation-transformed structured and free-form English OCR datasets, and (ii)
ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource
languages. We also present a fast, robust and lightweight rotation
classification pipeline built on the vision encoder of Phi-3.5-Vision model
with dynamic image cropping, fine-tuned specifically for 4-class rotation task
in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy
on identifying the rotations respectively on both the datasets. Beyond
classification, we demonstrate the critical role of our module in boosting OCR
performance: closed-source (up to 14%) and open-weights models (up to 4x) in
the simulated real-world setting.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yujian Liu, Ze Wang, Hao Chen, Ximeng Sun, Xiaodong Yu et al.
ๆฅ็ๆ่ฆ
Computer-use agents can operate computers and automate laborious tasks, but
despite recent rapid progress, they still lag behind human users, especially
when tasks require domain-specific procedural knowledge about particular
applications, platforms, and multi-step workflows. Humans can bridge this gap
by watching video tutorials: we search, skim, and selectively imitate short
segments that match our current subgoal. In this paper, we study how to enable
computer-use agents to learn from online videos at inference time effectively.
We propose a framework that retrieves and filters tutorial videos, converts
them into structured demonstration trajectories, and dynamically selects
trajectories as in-context guidance during execution. Particularly, using a
VLM, we infer UI actions, segment videos into short subsequences of actions,
and assign each subsequence a textual objective. At inference time, a two-stage
selection mechanism dynamically chooses a single trajectory to add in context
at each step, focusing the agent on the most helpful local guidance for its
next decision. Experiments on two widely used benchmarks show that our
framework consistently outperforms strong base agents and variants that use
only textual tutorials or transcripts. Analyses highlight the importance of
trajectory segmentation and selection, action filtering, and visual
information, suggesting that abundant online videos can be systematically
distilled into actionable guidance that improves computer-use agents at
inference time. Our code is available at
https://github.com/UCSB-NLP-Chang/video_demo.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shengyu Tang, Zeyuan Lu, Jiazhi Dong, Changdong Yu, Xiaoyu Wang et al.
RoboticsComputer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Accurate perception of the marine environment through robust multi-object
tracking (MOT) is essential for ensuring safe vessel navigation and effective
maritime surveillance. However, the complicated maritime environment often
causes camera motion and subsequent visual degradation, posing significant
challenges to MOT. To address this challenge, we propose an efficient
Dual-branch Maritime SORT (DMSORT) method for maritime MOT. The core of the
framework is a parallel tracker with affine compensation, which incorporates an
object detection and re-identification (ReID) branch, along with a dedicated
branch for dynamic camera motion estimation. Specifically, a Reversible
Columnar Detection Network (RCDN) is integrated into the detection module to
leverage multi-level visual features for robust object detection. Furthermore,
a lightweight Transformer-based appearance extractor (Li-TAE) is designed to
capture global contextual information and generate robust appearance features.
Another branch decouples platform-induced and target-intrinsic motion by
constructing a projective transformation, applying platform-motion compensation
within the Kalman filter, and thereby stabilizing true object trajectories.
Finally, a clustering-optimized feature fusion module effectively combines
motion and appearance cues to ensure identity consistency under noise,
occlusion, and drift. Extensive evaluations on the Singapore Maritime Dataset
demonstrate that DMSORT achieves state-of-the-art performance. Notably, DMSORT
attains the fastest runtime among existing ReID-based MOT frameworks while
maintaining high identity consistency and robustness to jitter and occlusion.
Code is available at:
https://github.com/BiscuitsLzy/DMSORT-An-efficient-parallel-maritime-multi-object-tracking-architecture-.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Venkata Manikanta Desu, Syed Fawaz Ali
Machine Learning
ๆฅ็ๆ่ฆ
This study presents a complete pipeline for automated tennis match analysis.
Our framework integrates multiple deep learning models to detect and track
players and the tennis ball in real time, while also identifying court
keypoints for spatial reference. Using YOLOv8 for player detection, a
custom-trained YOLOv5 model for ball tracking, and a ResNet50-based
architecture for court keypoint detection, our system provides detailed
analytics including player movement patterns, ball speed, shot accuracy, and
player reaction times. The experimental results demonstrate robust performance
in varying court conditions and match scenarios. The model outputs an annotated
video along with detailed performance metrics, enabling coaches, broadcasters,
and players to gain actionable insights into the dynamics of the game.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Tengjie Li, Shikui Tu, Lei Xu
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in vision-language models have facilitated progress in sketch
generation. However, existing specialized methods primarily focus on generic
synthesis and lack mechanisms for precise control over sketch styles. In this
work, we propose a training-free framework based on diffusion models that
enables explicit style guidance via textual prompts and referenced style
sketches. Unlike previous style transfer methods that overwrite key and value
matrices in self-attention, we incorporate the reference features as auxiliary
information with linear smoothing and leverage a style-content guidance
mechanism. This design effectively reduces content leakage from reference
sketches and enhances synthesis quality, especially in cases with low
structural similarity between reference and target sketches. Furthermore, we
extend our framework to support controllable multi-style generation by
integrating features from multiple reference sketches, coordinated via a joint
AdaIN module. Extensive experiments demonstrate that our approach achieves
high-quality sketch generation with accurate style alignment and improved
flexibility in style control. The official implementation of M3S is available
at https://github.com/CMACH508/M3S.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Yunghee Lee, Byeonghyun Pak, Junwha Hong, Hoseong Kim
ๆฅ็ๆ่ฆ
In this paper, we propose Tortoise and Hare Guidance (THG), a training-free
strategy that accelerates diffusion sampling while maintaining high-fidelity
generation. We demonstrate that the noise estimate and the additional guidance
term exhibit markedly different sensitivity to numerical error by reformulating
the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our
error-bound analysis shows that the additional guidance branch is more robust
to approximation, revealing substantial redundancy that conventional solvers
fail to exploit. Building on this insight, THG significantly reduces the
computation of the additional guidance: the noise estimate is integrated with
the tortoise equation on the original, fine-grained timestep grid, while the
additional guidance is integrated with the hare equation only on a coarse grid.
We also introduce (i) an error-bound-aware timestep sampler that adaptively
selects step sizes and (ii) a guidance-scale scheduler that stabilizes large
extrapolation spans. THG reduces the number of function evaluations (NFE) by up
to 30% with virtually no loss in generation fidelity ($\Delta$ImageReward
$\leq$ 0.032) and outperforms state-of-the-art CFG-based training-free
accelerators under identical computation budgets. Our findings highlight the
potential of multirate formulations for diffusion solvers, paving the way for
real-time high-quality image synthesis without any model retraining. The source
code is available at https://github.com/yhlee-add/THG.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Biao Liu, Yuanzhi Liang
Robotics
ๆฅ็ๆ่ฆ
Text-to-Image (T2I) synthesis has made significant advancements in recent
years, driving applications such as generating datasets automatically. However,
precise control over object localization in generated images remains a
challenge. Existing methods fail to fully utilize positional information,
leading to an inadequate understanding of object spatial layouts. To address
this issue, we propose SpatialLock, a novel framework that leverages perception
signals and grounding information to jointly control the generation of spatial
locations. SpatialLock incorporates two components: Position-Engaged Injection
(PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial
information through an attention layer, encouraging the model to learn the
grounding information effectively. PoG employs perception-based supervision to
further refine object localization. Together, these components enable the model
to generate objects with precise spatial arrangements and improve the visual
quality of the generated images. Experiments show that SpatialLock sets a new
state-of-the-art for precise object positioning, achieving IOU scores above 0.9
across multiple datasets.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nishchal Sapkota, Haoyan Shi, Yejia Zhang, Xianshi Ma, Bofang Zheng et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Medical image segmentation is critical for accurate diagnostics and treatment
planning, but remains challenging due to complex anatomical structures and
limited annotated training data. CNN-based segmentation methods excel at local
feature extraction, but struggle with modeling long-range dependencies.
Transformers, on the other hand, capture global context more effectively, but
are inherently data-hungry and computationally expensive. In this work, we
introduce UKAST, a U-Net like architecture that integrates rational-function
based Kolmogorov-Arnold Networks (KANs) into Swin Transformer encoders. By
leveraging rational base functions and Group Rational KANs (GR-KANs) from the
Kolmogorov-Arnold Transformer (KAT), our architecture addresses the
inefficiencies of vanilla spline-based KANs, yielding a more expressive and
data-efficient framework with reduced FLOPs and only a very small increase in
parameter count compared to SwinUNETR. UKAST achieves state-of-the-art
performance on four diverse 2D and 3D medical image segmentation benchmarks,
consistently surpassing both CNN- and Transformer-based baselines. Notably, it
attains superior accuracy in data-scarce settings, alleviating the data-hungry
limitations of standard Vision Transformers. These results show the potential
of KAN-enhanced Transformers to advance data-efficient medical image
segmentation. Code is available at: https://github.com/nsapkota417/UKAST
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Abu Hanif Muhammad Syarubany
ๆฅ็ๆ่ฆ
We study CT image denoising in the unpaired and self-supervised regimes by
evaluating two strong, training-data-efficient paradigms: a CycleGAN-based
residual translator and a Noise2Score (N2S) score-matching denoiser. Under a
common evaluation protocol, a configuration sweep identifies a simple standard
U-Net backbone within CycleGAN (lambda_cycle = 30, lambda_iden = 2, ngf = ndf =
64) as the most reliable setting; we then train it to convergence with a longer
schedule. The selected CycleGAN improves the noisy input from 34.66 dB / 0.9234
SSIM to 38.913 dB / 0.971 SSIM and attains an estimated score of 1.9441 and an
unseen-set (Kaggle leaderboard) score of 1.9343. Noise2Score, while slightly
behind in absolute PSNR / SSIM, achieves large gains over very noisy inputs,
highlighting its utility when clean pairs are unavailable. Overall, CycleGAN
offers the strongest final image quality, whereas Noise2Score provides a robust
pair-free alternative with competitive performance. Source code is available at
https://github.com/hanifsyarubany/CT-Scan-Image-Denoising-using-CycleGAN-and-Noise2Score.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zehui Feng, Chenqi Zhang, Mingru Wang, Minuo Wei, Shiwei Cheng et al.
Multimodal
ๆฅ็ๆ่ฆ
Unveiling visual semantics from neural signals such as EEG, MEG, and fMRI
remains a fundamental challenge due to subject variability and the entangled
nature of visual features. Existing approaches primarily align neural activity
directly with visual embeddings, but visual-only representations often fail to
capture latent semantic dimensions, limiting interpretability and deep
robustness. To address these limitations, we propose Bratrix, the first
end-to-end framework to achieve multimodal Language-Anchored Vision-Brain
alignment. Bratrix decouples visual stimuli into hierarchical visual and
linguistic semantic components, and projects both visual and brain
representations into a shared latent space, enabling the formation of aligned
visual-language and brain-language embeddings. To emulate human-like perceptual
reliability and handle noisy neural signals, Bratrix incorporates a novel
uncertainty perception module that applies uncertainty-aware weighting during
alignment. By leveraging learnable language-anchored semantic matrices to
enhance cross-modal correlations and employing a two-stage training strategy of
single-modality pretraining followed by multimodal fine-tuning, Bratrix-M
improves alignment precision. Extensive experiments on EEG, MEG, and fMRI
benchmarks demonstrate that Bratrix improves retrieval, reconstruction, and
captioning performance compared to state-of-the-art methods, specifically
surpassing 14.3% in 200-way EEG retrieval task. Code and model are available.
๐
2025-11-06
๐ This work has been submitted to IEEE Transactions
๐ฅ Arfina Rahman, Mahesh Banavar
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Photoplethysmography (PPG) signals, which measure changes in blood volume in
the skin using light, have recently gained attention in biometric
authentication because of their non-invasive acquisition, inherent liveness
detection, and suitability for low-cost wearable devices. However, PPG signal
quality is challenged by motion artifacts, illumination changes, and
inter-subject physiological variability, making robust feature extraction and
classification crucial. This study proposes a lightweight and cost-effective
biometric authentication framework based on PPG signals extracted from
low-frame-rate fingertip videos. The CFIHSR dataset, comprising PPG recordings
from 46 subjects at a sampling rate of 14 Hz, is employed for evaluation. The
raw PPG signals undergo a standard preprocessing pipeline involving baseline
drift removal, motion artifact suppression using Principal Component Analysis
(PCA), bandpass filtering, Fourier-based resampling, and amplitude
normalization. To generate robust representations, each one-dimensional PPG
segment is converted into a two-dimensional time-frequency scalogram via the
Continuous Wavelet Transform (CWT), effectively capturing transient
cardiovascular dynamics. We developed a hybrid deep learning model, termed
CVT-ConvMixer-LSTM, by combining spatial features from the Convolutional Vision
Transformer (CVT) and ConvMixer branches with temporal features from a Long
Short-Term Memory network (LSTM). The experimental results on 46 subjects
demonstrate an authentication accuracy of 98%, validating the robustness of the
model to noise and variability between subjects. Due to its efficiency,
scalability, and inherent liveness detection capability, the proposed system is
well-suited for real-world mobile and embedded biometric security applications.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yihao Luo, Xianglong He, Chuanyu Pan, Yiwen Chen, Jiaqi Wu et al.
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Accurate and efficient voxelized representations of 3D meshes are the
foundation of 3D reconstruction and generation. However, existing
representations based on iso-surface heavily rely on water-tightening or
rendering optimization, which inevitably compromise geometric fidelity. We
propose Faithful Contouring, a sparse voxelized representation that supports
2048+ resolutions for arbitrary meshes, requiring neither converting meshes to
field functions nor extracting the isosurface during remeshing. It achieves
near-lossless fidelity by preserving sharpness and internal structures, even
for challenging cases with complex geometry and topology. The proposed method
also shows flexibility for texturing, manipulation, and editing. Beyond
representation, we design a dual-mode autoencoder for Faithful Contouring,
enabling scalable and detail-preserving shape reconstruction. Extensive
experiments show that Faithful Contouring surpasses existing methods in
accuracy and efficiency for both representation and reconstruction. For direct
representation, it achieves distance errors at the $10^{-5}$ level; for mesh
reconstruction, it yields a 93\% reduction in Chamfer Distance and a 35\%
improvement in F-score over strong baselines, confirming superior fidelity as a
representation for 3D learning tasks.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mahmoud Soliman, Islam Osman, Mohamed S. Shehata, Rasika Rajapakshe
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The performance of vision models in medical imaging is often hindered by the
prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain
natural images. To address this fundamental domain gap, we propose MedDChest, a
new foundational Vision Transformer (ViT) model optimized specifically for
thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated,
multimodal dataset of over 1.2 million images, encompassing different
modalities including Chest X-ray and Computed Tomography (CT) compiled from 10
public sources. A core technical contribution of our work is Guided Random
Resized Crops, a novel content-aware data augmentation strategy that biases
sampling towards anatomically relevant regions, overcoming the inefficiency of
standard cropping techniques on medical scans. We validate our model's
effectiveness by fine-tuning it on a diverse set of downstream diagnostic
tasks. Comprehensive experiments empirically demonstrate that MedDChest
significantly outperforms strong, publicly available ImageNet-pretrained
models. By establishing the superiority of large-scale, in-domain pre-training
combined with domain-specific data augmentation, MedDChest provides a powerful
and robust feature extractor that serves as a significantly better starting
point for a wide array of thoracic diagnostic tasks. The model weights will be
made publicly available to foster future research and applications.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mahmoud Soliman, Omar Abdelaziz, Ahmed Radwan, Anand, Mohamed Shehata
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Domain generalization (DG) seeks robust Vision Transformer (ViT) performance
on unseen domains. Efficiently adapting pretrained ViTs for DG is challenging;
standard fine-tuning is costly and can impair generalization. We propose
GNN-MoE, enhancing Parameter-Efficient Fine-Tuning (PEFT) for DG with a
Mixture-of-Experts (MoE) framework using efficient Kronecker adapters. Instead
of token-based routing, a novel Graph Neural Network (GNN) router (GCN, GAT,
SAGE) operates on inter-patch graphs to dynamically assign patches to
specialized experts. This context-aware GNN routing leverages inter-patch
relationships for better adaptation to domain shifts. GNN-MoE achieves
state-of-the-art or competitive DG benchmark performance with high parameter
efficiency, highlighting the utility of graph-based contextual routing for
robust, lightweight DG.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Peiyao Wang, Weining Wang, Qi Li
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in text-to-video generation have achieved impressive
perceptual quality, yet generated content often violates fundamental principles
of physical plausibility - manifesting as implausible object dynamics,
incoherent interactions, and unrealistic motion patterns. Such failures hinder
the deployment of video generation models in embodied AI, robotics, and
simulation-intensive domains. To bridge this gap, we propose PhysCorr, a
unified framework for modeling, evaluating, and optimizing physical consistency
in video generation. Specifically, we introduce PhysicsRM, the first
dual-dimensional reward model that quantifies both intra-object stability and
inter-object interactions. On this foundation, we develop PhyDPO, a novel
direct preference optimization pipeline that leverages contrastive feedback and
physics-aware reweighting to guide generation toward physically coherent
outputs. Our approach is model-agnostic and scalable, enabling seamless
integration into a wide range of video diffusion and transformer-based
backbones. Extensive experiments across multiple benchmarks demonstrate that
PhysCorr achieves significant improvements in physical realism while preserving
visual fidelity and semantic alignment. This work takes a critical step toward
physically grounded and trustworthy video generation.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuwen Tao, Kanglei Zhou, Xin Tan, Yuan Xie
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Referring 3D Gaussian Splatting Segmentation (R3DGS) aims to interpret
free-form language expressions and localize the corresponding 3D regions in
Gaussian fields. While recent advances have introduced cross-modal alignment
between language and 3D geometry, existing pipelines still struggle with
cross-view consistency due to their reliance on 2D rendered pseudo supervision
and view specific feature learning. In this work, we present Camera Aware
Referring Field (CaRF), a fully differentiable framework that operates directly
in the 3D Gaussian space and achieves multi view consistency. Specifically,
CaRF introduces Gaussian Field Camera Encoding (GFCE), which incorporates
camera geometry into Gaussian text interactions to explicitly model view
dependent variations and enhance geometric reasoning. Building on this, In
Training Paired View Supervision (ITPVS) is proposed to align per Gaussian
logits across calibrated views during training, effectively mitigating single
view overfitting and exposing inter view discrepancies for optimization.
Extensive experiments on three representative benchmarks demonstrate that CaRF
achieves average improvements of 16.8%, 4.3%, and 2.0% in mIoU over state of
the art methods on the Ref LERF, LERF OVS, and 3D OVS datasets, respectively.
Moreover, this work promotes more reliable and view consistent 3D scene
understanding, with potential benefits for embodied AI, AR/VR interaction, and
autonomous perception.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenshuo Qin, Leyla Isik
Multimodal
ๆฅ็ๆ่ฆ
Humans can quickly and effortlessly extract a variety of information about
others' social interactions from visual input, ranging from visuospatial cues
like whether two people are facing each other to higher-level information. Yet,
the computations supporting these abilities remain poorly understood, and
social interaction recognition continues to challenge even the most advanced AI
vision systems. Here, we hypothesized that humans rely on 3D visuospatial pose
information to make social interaction judgments, which is absent in most AI
vision models. To test this, we combined state-of-the-art pose and depth
estimation algorithms to extract 3D joint positions of people in short video
clips depicting everyday human actions and compared their ability to predict
human social interaction judgments with current AI vision models. Strikingly,
3D joint positions outperformed most current AI vision models, revealing that
key social information is available in explicit body position but not in the
learned features of most vision models, including even the layer-wise
embeddings of the pose models used to extract joint positions. To uncover the
critical pose features humans use to make social judgments, we derived a
compact set of 3D social pose features describing only the 3D position and
direction of faces in the videos. We found that these minimal descriptors
matched the predictive strength of the full set of 3D joints and significantly
improved the performance of off-the-shelf AI vision models when combined with
their embeddings. Moreover, the degree to which 3D social pose features were
represented in each off-the-shelf AI vision model predicted the model's ability
to match human social judgments. Together, our findings provide strong evidence
that human social scene understanding relies on explicit representations of 3D
pose and can be supported by simple, structured visuospatial primitives.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sam Bahrami, Dylan Campbell
Machine Learning
ๆฅ็ๆ่ฆ
Modern scene reconstruction methods are able to accurately recover 3D
surfaces that are visible in one or more images. However, this leads to
incomplete reconstructions, missing all occluded surfaces. While much progress
has been made on reconstructing entire objects given partial observations using
generative models, the structural elements of a scene, like the walls, floors
and ceilings, have received less attention. We argue that these scene elements
should be relatively easy to predict, since they are typically planar,
repetitive and simple, and so less costly approaches may be suitable. In this
work, we present a synthetic dataset -- Room Envelopes -- that facilitates
progress on this task by providing a set of RGB images and two associated
pointmaps for each image: one capturing the visible surface and one capturing
the first surface once fittings and fixtures are removed, that is, the
structural layout. As we show, this enables direct supervision for feed-forward
monocular geometry estimators that predict both the first visible surface and
the first layout surface. This confers an understanding of the scene's extent,
as well as the shape and location of its objects.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhong Chen, Changfeng Chen
Computer Vision
ๆฅ็ๆ่ฆ
Accurate calibration of internal parameters is a crucial yet challenging
prerequisite for 3D reconstruction using light field cameras. In this paper, we
propose a linear fractional transformation(LFT) parameter $\alpha$ to decoupled
the main lens and micro lens array (MLA). The proposed method includes an
analytical solution based on least squares, followed by nonlinear refinement.
The method for detecting features from the raw images is also introduced.
Experimental results on both physical and simulated data have verified the
performance of proposed method. Based on proposed model, the simulation of raw
light field images becomes faster, which is crucial for data-driven deep
learning methods. The corresponding code can be obtained from the author's
website.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhejia Cai, Puhua Jiang, Shiwei Mao, Hongkun Cao, Ruqi Huang
Machine Learning
ๆฅ็ๆ่ฆ
Reconstructing real-world objects from multi-view images is essential for
applications in 3D editing, AR/VR, and digital content creation. Existing
methods typically prioritize either geometric accuracy (Multi-View Stereo) or
photorealistic rendering (Novel View Synthesis), often decoupling geometry and
appearance optimization, which hinders downstream editing tasks. This paper
advocates an unified treatment on geometry and appearance optimization for
seamless Gaussian-mesh joint optimization. More specifically, we propose a
novel framework that simultaneously optimizes mesh geometry (vertex positions
and faces) and vertex colors via Gaussian-guided mesh differentiable rendering,
leveraging photometric consistency from input images and geometric
regularization from normal and depth maps. The obtained high-quality 3D
reconstruction can be further exploit in down-stream editing tasks, such as
relighting and shape deformation. The code will be publicly available upon
acceptance.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Machine Learning
ๆฅ็ๆ่ฆ
Temporal action localization requires precise boundary detection; however,
current methods apply uniform computation despite significant variations in
difficulty across boundaries. We present two complementary contributions.
First, Boundary Distance Regression (BDR) provides information-theoretically
optimal localization through signed-distance regression rather than
classification, achieving 43\% sharper boundary peaks. BDR retrofits to
existing methods with approximately 50 lines of code, yielding consistent 1.8
to 3.1\% mAP@0.7 improvements across diverse architectures. Second, Adaptive
Temporal Refinement (ATR) allocates computation via continuous depth selection
$\tau \in [0,1]$, enabling end-to-end differentiable optimization without
reinforcement learning. On THUMOS14, ATR achieves 56.5\% mAP@0.7 at 162G FLOPs,
compared to 53.6\% at 198G for uniform processing, providing a 2.9\%
improvement with 18\% less compute. Gains scale with boundary heterogeneity,
showing 4.2\% improvement on short actions. Training cost is mitigated via
knowledge distillation, with lightweight students retaining 99\% performance at
baseline cost. Results are validated across four benchmarks with rigorous
statistical testing.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Nemotron Nano V2 VL, the latest model of the Nemotron
vision-language series designed for strong real-world document understanding,
long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers
significant improvements over our previous model,
Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major
enhancements in model architecture, datasets, and training recipes. Nemotron
Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and
innovative token reduction techniques to achieve higher inference throughput in
long document and video scenarios. We are releasing model checkpoints in BF16,
FP8, and FP4 formats and sharing large parts of our datasets, recipes and
training code.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yu Feng, Nathaniel Weir, Kaj Bostrom, Sam Bayless, Darion Cassel et al.
Machine Learning
ๆฅ็ๆ่ฆ
LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but
they cannot reliably verify their own logic. Even when they reach correct
answers, the underlying reasoning may be flawed, undermining trust in
high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a
neuro-symbolic method that extracts and verifies formal logical arguments from
CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order
logic and identifies premises that ground the argument in source context,
commonsense knowledge, or prior reasoning steps. The symbolic representation
enables automated solvers to verify logical validity while the NL premises
allow humans and systems to identify ungrounded or fallacious reasoning steps.
Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT
effectively identifies flawed reasoning, and serves as a strong predictor of
final answer correctness. We also leverage VeriCoT's verification signal for
(1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on
VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct
preference optimization (DPO) using verification-based pairwise rewards,
further improving reasoning validity and accuracy.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Mohammad Atif Quamar, Mohammad Areeb
Natural Language Processing
ๆฅ็ๆ่ฆ
Chain-of-Thought (CoT) prompting is a key technique for enabling complex
reasoning in large language models. However, generating full, fixed-length
rationales is computationally wasteful, inflating both token usage and latency.
We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free
decoding algorithm that adaptively halts rationale generation. LEASH monitors
two intrinsic signals: the slope of token-level entropy and the improvement in
the top-logit margin. It terminates the generation once both signals plateau,
indicating the model has reached a stable reasoning state. Across four
instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces
average token generation by 30--35% and latency by 27%, while incurring a 10
p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no
additional training or supervision, offering a simple and efficient alternative
to CoT decoding.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Narjes Nourzad, Hanqing Yang, Shiyu Chen, Carlee Joe-Wong
ๆฅ็ๆ่ฆ
Cooperative multi-agent planning requires agents to make joint decisions with
partial information and limited communication. Coordination at the trajectory
level often fails, as small deviations in timing or movement cascade into
conflicts. Symbolic planning mitigates this challenge by raising the level of
abstraction and providing a minimal vocabulary of actions that enable
synchronization and collective progress. We present DR. WELL, a decentralized
neurosymbolic framework for cooperative multi-agent planning. Cooperation
unfolds through a two-phase negotiation protocol: agents first propose
candidate roles with reasoning and then commit to a joint allocation under
consensus and environment constraints. After commitment, each agent
independently generates and executes a symbolic plan for its role without
revealing detailed trajectories. Plans are grounded in execution outcomes via a
shared world model that encodes the current state and is updated as agents act.
By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids
brittle step-level alignment and enables higher-level operations that are
reusable, synchronizable, and interpretable. Experiments on cooperative
block-push tasks show that agents adapt across episodes, with the dynamic world
model capturing reusable patterns and improving task completion rates and
efficiency. Experiments on cooperative block-push tasks show that our dynamic
world model improves task completion and efficiency through negotiation and
self-refinement, trading a time overhead for evolving, more efficient
collaboration strategies.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Alamgir Munir Qazi, John P. McCrae, Jamal Abdul Nasir
Natural Language Processing
ๆฅ็ๆ่ฆ
The proliferation of misinformation necessitates robust yet computationally
efficient fact verification systems. While current state-of-the-art approaches
leverage Large Language Models (LLMs) for generating explanatory rationales,
these methods face significant computational barriers and hallucination risks
in real-world deployments. We present DeReC (Dense Retrieval Classification), a
lightweight framework that demonstrates how general-purpose text embeddings can
effectively replace autoregressive LLM-based approaches in fact verification
tasks. By combining dense retrieval with specialized classification, our system
achieves better accuracy while being significantly more efficient. DeReC
outperforms explanation-generating LLMs in efficiency, reducing runtime by 95%
on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92%
on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds),
showcasing its effectiveness across varying dataset sizes. On the RAWFC
dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art
method L-Defense (61.20%). Our results demonstrate that carefully engineered
retrieval-based systems can match or exceed LLM performance in specialized
tasks while being significantly more practical for real-world deployment.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Daniel Gomm, Cornelius Wolff, Madelon Hulsebos
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Natural language interfaces to tabular data must handle ambiguities inherent
to queries. Instead of treating ambiguity as a deficiency, we reframe it as a
feature of cooperative interaction, where the responsibility of query
specification is shared among the user and the system. We develop a principled
framework distinguishing cooperative queries, i.e., queries that yield a
resolvable interpretation, from uncooperative queries that cannot be resolved.
Applying the framework to evaluations for tabular question answering and
analysis, we analyze the queries in 15 popular datasets, and observe an
uncontrolled mixing of query types neither adequate for evaluating a system's
execution accuracy nor for evaluating interpretation capabilities. Our
framework and analysis of queries shifts the perspective from fixing ambiguity
to embracing cooperation in resolving queries. This reflection enables more
informed design and evaluation for natural language interfaces for tabular
data, for which we outline implications and directions for future research.
๐
2025-11-06
๐ Issues, comments, and questions are all welcome in
๐ฅ Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu Aizawa
Robotics
ๆฅ็ๆ่ฆ
Understanding the current capabilities and risks of AI Scientist systems is
essential for ensuring trustworthy and sustainable AI-driven scientific
progress while preserving the integrity of the academic ecosystem. To this end,
we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system
that mimics the core research workflow of a novice student researcher: Given
the baseline paper from the human mentor, it analyzes its limitations,
formulates novel hypotheses for improvement, validates them through rigorous
experimentation, and writes a paper with the results. Unlike previous
approaches that assume full automation or operate on small-scale code, Jr. AI
Scientist follows a well-defined research workflow and leverages modern coding
agents to handle complex, multi-file implementations, leading to scientifically
valuable contributions. For evaluation, we conducted automated assessments
using AI Reviewers, author-led evaluations, and submissions to Agents4Science,
a venue dedicated to AI-driven scientific contributions. The findings
demonstrate that Jr. AI Scientist generates papers receiving higher review
scores than existing fully automated systems. Nevertheless, we identify
important limitations from both the author evaluation and the Agents4Science
reviews, indicating the potential risks of directly applying current AI
Scientist systems and key challenges for future research. Finally, we
comprehensively report various risks identified during development. We hope
these insights will deepen understanding of current progress and risks in AI
Scientist development.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
"Thinking with Text" and "Thinking with Images" paradigm significantly
improve the reasoning ability of large language models (LLMs) and Vision
Language Models (VLMs). However, these paradigms have inherent limitations. (1)
Images capture only single moments and fail to represent dynamic processes or
continuous changes, and (2) The separation of text and vision as distinct
modalities, hindering unified multimodal understanding and generation. To
overcome these limitations, we introduce "Thinking with Video", a new paradigm
that leverages video generation models, such as Sora-2, to bridge visual and
textual reasoning in a unified temporal framework. To support this exploration,
we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench
encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing
Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our
evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks,
Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even
surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric
tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
Furthermore, we systematically analyse the source of these abilities. We also
find that self-consistency and in-context learning can improve Sora-2's
performance. In summary, our findings demonstrate that the video generation
model is the potential unified multimodal understanding and generation model,
positions "thinking with video" as a unified multimodal reasoning paradigm.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sadia Sultana, Saiyma Sittul Muna, Mosammat Zannatul Samarukh, Ajwad Abrar, Tareque Mohmud Chowdhury
Natural Language Processing
ๆฅ็ๆ่ฆ
Developing accurate biomedical Question Answering (QA) systems in
low-resource languages remains a major challenge, limiting equitable access to
reliable medical knowledge. This paper introduces BanglaMedQA and
BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice
Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical
artificial intelligence (AI). The study applies and benchmarks several
Retrieval-Augmented Generation (RAG) strategies, including Traditional,
Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining
textbook-based and web retrieval with generative reasoning to improve factual
accuracy. A key novelty lies in integrating a Bangla medical textbook corpus
through Optical Character Recognition (OCR) and implementing an Agentic RAG
pipeline that dynamically selects between retrieval and reasoning strategies.
Experimental results show that the Agentic RAG achieved the highest accuracy
89.54% with openai/gpt-oss-120b, outperforming other configurations and
demonstrating superior rationale quality. These findings highlight the
potential of RAG-based methods to enhance the reliability and accessibility of
Bangla medical QA, establishing a foundation for future research in
multilingual medical artificial intelligence.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Cyril Vallez, Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic
Natural Language Processing
ๆฅ็ๆ่ฆ
As the role of Large Language Models (LLM)-based coding assistants in
software development becomes more critical, so does the role of the bugs they
generate in the overall cybersecurity landscape. While a number of LLM code
security benchmarks have been proposed alongside approaches to improve the
security of generated code, it remains unclear to what extent they have
impacted widely used coding LLMs. Here, we show that even the latest
open-weight models are vulnerable in the earliest reported vulnerability
scenarios in a realistic use setting, suggesting that the safety-functionality
trade-off has until now prevented effective patching of vulnerabilities. To
help address this issue, we introduce a new severity metric that reflects the
risk posed by an LLM-generated vulnerability, accounting for vulnerability
severity, generation chance, and the formulation of the prompt that induces
vulnerable code generation - Prompt Exposure (PE). To encourage the mitigation
of the most serious and prevalent vulnerabilities, we use PE to define the
Model Exposure (ME) score, which indicates the severity and prevalence of
vulnerabilities a model generates.
๐
2025-11-06
๐ AAAI
๐ฅ Kaveh Eskandari Miandoab, Katharine Kowalyshyn, Kabir Pamnani, Anesu Gavhera, Vasanth Sarathy et al.
ๆฅ็ๆ่ฆ
We present IntelliProof, an interactive system for analyzing argumentative
essays through LLMs. IntelliProof structures an essay as an argumentation
graph, where claims are represented as nodes, supporting evidence is attached
as node properties, and edges encode supporting or attacking relations. Unlike
existing automated essay scoring systems, IntelliProof emphasizes the user
experience: each relation is initially classified and scored by an LLM, then
visualized for enhanced understanding. The system provides justifications for
classifications and produces quantitative measures for essay coherence. It
enables rapid exploration of argumentative quality while retaining human
oversight. In addition, IntelliProof provides a set of tools for a better
understanding of an argumentative essay and its corresponding graph in natural
language, bridging the gap between the structural semantics of argumentative
essays and the user's understanding of a given text. A live demo and the system
are available here to try: \textbf{https://intelliproof.vercel.app}
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amir Zur, Atticus Geiger, Ekdeep Singh Lubana, Eric Bigelow
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
When a language model generates text, the selection of individual tokens
might lead it down very different reasoning paths, making uncertainty difficult
to quantify. In this work, we consider whether reasoning language models
represent the alternate paths that they could take during generation. To test
this hypothesis, we use hidden activations to control and predict a language
model's uncertainty during chain-of-thought reasoning. In our experiments, we
find a clear correlation between how uncertain a model is at different tokens,
and how easily the model can be steered by controlling its activations. This
suggests that activation interventions are most effective when there are
alternate paths available to the model -- in other words, when it has not yet
committed to a particular final answer. We also find that hidden activations
can predict a model's future outcome distribution, demonstrating that models
implicitly represent the space of possible paths.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Paloma Rabaey, Jong Hak Moon, Jung-Oh Lee, Min Gwan Kim, Hangyul Yoon et al.
ๆฅ็ๆ่ฆ
Radiology reports are invaluable for clinical decision-making and hold great
potential for automated analysis when structured into machine-readable formats.
These reports often contain uncertainty, which we categorize into two distinct
types: (i) Explicit uncertainty reflects doubt about the presence or absence of
findings, conveyed through hedging phrases. These vary in meaning depending on
the context, making rule-based systems insufficient to quantify the level of
uncertainty for specific findings; (ii) Implicit uncertainty arises when
radiologists omit parts of their reasoning, recording only key findings or
diagnoses. Here, it is often unclear whether omitted findings are truly absent
or simply unmentioned for brevity. We address these challenges with a two-part
framework. We quantify explicit uncertainty by creating an expert-validated,
LLM-based reference ranking of common hedging phrases, and mapping each finding
to a probability value based on this reference. In addition, we model implicit
uncertainty through an expansion framework that systematically adds
characteristic sub-findings derived from expert-defined diagnostic pathways for
14 common diagnoses. Using these methods, we release Lunguage++, an expanded,
uncertainty-aware version of the Lunguage benchmark of fine-grained structured
radiology reports. This enriched resource enables uncertainty-aware image
classification, faithful diagnostic reasoning, and new investigations into the
clinical impact of diagnostic uncertainty.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joshua Gao, Quoc Huy Pham, Subin Varghese, Silwal Saurav, Vedhus Hoskere
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) is a critical technique for grounding
Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in
specialized, safety-critical domains remains a significant challenge. Existing
evaluation frameworks often rely on heuristic-based metrics that fail to
capture domain-specific nuances and other works utilize LLM-as-a-Judge
approaches that lack validated alignment with human judgment. This paper
introduces RAGalyst, an automated, human-aligned agentic framework designed for
the rigorous evaluation of domain-specific RAG systems. RAGalyst features an
agentic pipeline that generates high-quality, synthetic question-answering (QA)
datasets from source documents, incorporating an agentic filtering step to
ensure data fidelity. The framework refines two key LLM-as-a-Judge
metrics-Answer Correctness and Answerability-using prompt optimization to
achieve a strong correlation with human annotations. Applying this framework to
evaluate various RAG components across three distinct domains (military
operations, cybersecurity, and bridge engineering), we find that performance is
highly context-dependent. No single embedding model, LLM, or hyperparameter
configuration proves universally optimal. Additionally, we provide an analysis
on the most common low Answer Correctness reasons in RAG. These findings
highlight the necessity of a systematic evaluation framework like RAGalyst,
which empowers practitioners to uncover domain-specific trade-offs and make
informed design choices for building reliable and effective RAG systems.
RAGalyst is available on our Github.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Andrea Cera Palatsi, Samuel Martin-Gutierrez, Ana S. Cardenal, Max Pellert
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used both to make decisions in
domains such as health, education and law, and to simulate human behavior. Yet
how closely LLMs mirror actual human decision-making remains poorly understood.
This gap is critical: misalignment could produce harmful outcomes in practical
applications, while failure to replicate human behavior renders LLMs
ineffective for social simulations. Here, we address this gap by developing a
digital twin of game-theoretic experiments and introducing a systematic
prompting and probing framework for machine-behavioral evaluation. Testing
three open-source models (Llama, Mistral and Qwen), we find that Llama
reproduces human cooperation patterns with high fidelity, capturing human
deviations from rational choice theory, while Qwen aligns closely with Nash
equilibrium predictions. Notably, we achieved population-level behavioral
replication without persona-based prompting, simplifying the simulation
process. Extending beyond the original human-tested games, we generate and
preregister testable hypotheses for novel game configurations outside the
original parameter grid. Our findings demonstrate that appropriately calibrated
LLMs can replicate aggregate human behavioral patterns and enable systematic
exploration of unexplored experimental spaces, offering a complementary
approach to traditional research in the social and behavioral sciences that
generates new empirical predictions about human social decision-making.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Christos-Nikolaos Zacharopoulos, Revekka Kyriakoglou
Natural Language Processing
ๆฅ็ๆ่ฆ
As Large Language Models (LLMs) become integral to human-centered
applications, understanding their personality-like behaviors is increasingly
important for responsible development and deployment. This paper systematically
evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to
assess trait expressions under varying sampling temperatures. We find
significant differences across four of the five personality dimensions, with
Neuroticism and Extraversion susceptible to temperature adjustments. Further,
hierarchical clustering reveals distinct model clusters, suggesting that
architectural features may predispose certain models toward stable trait
profiles. Taken together, these results offer new insights into the emergence
of personality-like patterns in LLMs and provide a new perspective on model
tuning, selection, and the ethical governance of AI systems. We share the data
and code for this analysis here:
https://osf.io/bsvzc/?view_only=6672219bede24b4e875097426dc3fac1
๐
2025-11-06
๐ EMNLP 2025
๐ฅ Cuong Huynh, Jie Cao
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
This paper describes the OUNLP system submitted to the TSAR-2025 Shared Task
(Alva-Manchego et al., 2025), designed for readability-controlled text
simplification using LLM-prompting-based generation. Based on the analysis of
prompt-based text simplification methods, we discovered an interesting finding
that text simplification performance is highly related to the gap between the
source CEFR (Arase et al., 2022) level and the target CEFR level. Inspired by
this finding, we propose two multi-round simplification methods and generate
them via GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based
LLM simplification (MRS-Joint). Our submitted systems ranked 7 out of 20 teams.
Later improvements with MRS-Joint show that taking the LLM simplified
candidates as the starting point could further boost the multi-round
simplification performance.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nikhil Abhyankar, Purvi Chaurasia, Sanchit Kabra, Ananya Srivastava, Vivek Gupta et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing tabular reasoning benchmarks mostly test models on small, uniform
tables, underrepresenting the complexity of real-world data and giving an
incomplete view of Large Language Models' (LLMs) reasoning abilities. Real
tables are long, heterogeneous, and domain-specific, mixing structured fields
with free text and requiring multi-hop reasoning across thousands of tokens. To
address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from
2031 real-world tables spanning two domains: i) RB-Science (NSF grant records)
and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates
LLMs jointly across scale, heterogeneity, domain specificity, and reasoning
complexity. Experiments with open-source and proprietary models show that LLMs
struggle with heterogeneous schemas and complex multi-hop inference, revealing
persistent weaknesses in current architectures and prompting strategies.
RUST-BENCH establishes a challenging new testbed for advancing tabular
reasoning research.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Surapon Nonesung, Teetouch Jaknamon, Sirinya Chaiophat, Natapong Nitarach, Chanakan Wittayasakpan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We present ThaiOCRBench, the first comprehensive benchmark for evaluating
vision-language models (VLMs) on Thai text-rich visual understanding tasks.
Despite recent progress in multimodal modeling, existing benchmarks
predominantly focus on high-resource languages, leaving Thai underrepresented,
especially in tasks requiring document structure understanding. ThaiOCRBench
addresses this gap by offering a diverse, human-annotated dataset comprising
2,808 samples across 13 task categories. We evaluate a wide range of
state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and
open-source systems. Results show a significant performance gap, with
proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source
counterparts. Notably, fine-grained text recognition and handwritten content
extraction exhibit the steepest performance drops among open-source models.
Through detailed error analysis, we identify key challenges such as language
bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a
standardized framework for assessing VLMs in low-resource, script-complex
settings, and provides actionable insights for improving Thai-language document
understanding.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fabian Schmidt, Seyedehmoniba Ravan, Vladimir Vlassov
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate and interpretable predictions of depression severity are essential
for clinical decision support, yet existing models often lack uncertainty
estimates and temporal modeling. We propose PTTSD, a Probabilistic Textual Time
Series Depression Detection framework that predicts PHQ-8 scores from
utterance-level clinical interviews while modeling uncertainty over time. PTTSD
includes sequence-to-sequence and sequence-to-one variants, both combining
bidirectional LSTMs, self-attention, and residual connections with Gaussian or
Student-t output heads trained via negative log-likelihood. Evaluated on E-DAIC
and DAIC-WOZ, PTTSD achieves state-of-the-art performance among text-only
systems (e.g., MAE = 3.85 on E-DAIC, 3.55 on DAIC) and produces well-calibrated
prediction intervals. Ablations confirm the value of attention and
probabilistic modeling, while comparisons with MentalBERT establish generality.
A three-part calibration analysis and qualitative case studies further
highlight the interpretability and clinical relevance of uncertainty-aware
forecasting.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alberto Cattaneo, Carlo Luschi, Daniel Justus
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval of information from graph-structured knowledge bases represents a
promising direction for improving the factuality of LLMs. While various
solutions have been proposed, a comparison of methods is difficult due to the
lack of challenging QA datasets with ground-truth targets for graph retrieval.
We present SynthKGQA, a framework for generating high-quality synthetic
Knowledge Graph Question Answering datasets from any Knowledge Graph, providing
the full set of ground-truth facts in the KG to reason over each question. We
show how, in addition to enabling more informative benchmarking of KG
retrievers, the data produced with SynthKGQA also allows us to train better
models. We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset
designed to test zero-shot generalization abilities of KG retrievers with
respect to unseen graph structures and relation types, and benchmark popular
solutions for KG-augmented LLMs on it.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lars Bungum, Charles Yijia Huang, Abeer Kashar
ๆฅ็ๆ่ฆ
In this study, we experiment with the ability of LLMs to do temporal
reasoning. Using a Norwegian book from 1940 containing trivia questions, we
prompt the LLMs to answer the questions as if it were 1940. We also pose the
questions in both English and Norwegian. Correct answers are often presented as
sentences, and grading is done by means of LLM-as-judge, with sampled checks by
a native speaker. Prompting in English consistently gave better results than in
Norwegian, an unexpected result. In contrast, using larger LLMs improved
results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families,
and also the largest available LLM especially crafted for Norwegian.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tim Tomov, Dominik Fuchsgruber, Tom Wollschlรคger, Stephan Gรผnnemann
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is
critical for trustworthy deployment. While real-world language is inherently
ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically
benchmarked against tasks with no ambiguity. In this work, we demonstrate that
while current uncertainty estimators perform well under the restrictive
assumption of no ambiguity, they degrade to close-to-random performance on
ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first
ambiguous question-answering (QA) datasets equipped with ground-truth answer
distributions estimated from factual co-occurrence. We find this performance
deterioration to be consistent across different estimation paradigms: using the
predictive distribution itself, internal representations throughout the model,
and an ensemble of models. We show that this phenomenon can be theoretically
explained, revealing that predictive-distribution and ensemble-based estimators
are fundamentally limited under ambiguity. Overall, our study reveals a key
shortcoming of current UQ methods for LLMs and motivates a rethinking of
current modeling paradigms.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohammad Amin Ghanizadeh, Mohammad Javad Dousti
Natural Language Processing
ๆฅ็ๆ่ฆ
Data quality and its effective selection are fundamental to improving the
performance of machine translation models, serving as cornerstones for
achieving robust and reliable translation systems. This paper presents a data
selection methodology specifically designed for fine-tuning machine translation
systems, which leverages the synergy between a learner model and a pre-trained
reference model to enhance overall training effectiveness. By defining a
learnability score, our approach systematically evaluates the utility of data
points for training, ensuring that only the most relevant and impactful
examples contribute to the fine-tuning process. Furthermore, our method employs
a batch selection strategy which considers interdependencies among data points,
optimizing the efficiency of the training process while maintaining a focus on
data relevance. Experiments on English to Persian and several other language
pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that
our method can achieve up to a fivefold improvement in data efficiency compared
to an iid baseline. Experimental results indicate that our approach improves
computational efficiency by 24 when utilizing cached embeddings, as it requires
fewer training data points. Additionally, it enhances generalization, resulting
in superior translation performance compared to random selection method.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kun Yang, Zikang chen, Yanmeng Wang, Zhigen Li
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
As a significant part of post-training of the Large Language Models (LLMs),
Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs'
reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative
Policy Optimization) and GSPO (Group Sequence Policy Optimization), are
observed to suffer from unstable policy updates and low usage of sampling data,
respectively. The importance ratio of GRPO is calculated at the token level,
which focuses more on optimizing a single token. This will be easily affected
by outliers, leading to model training collapse. GSPO proposed the calculation
of the response level importance ratio, which solves the problem of high
variance and training noise accumulation in the calculation of the GRPO
importance ratio. However, since all the response tokens share a common
importance ratio, extreme values can easily raise or lower the overall mean,
leading to the entire response being mistakenly discarded, resulting in a
decrease in the utilization of sampled data. This paper introduces SSPO, which
applies sentence-level importance ratio, taking the balance between GRPO and
GSPO. SSPO not only avoids training collapse and high variance, but also
prevents the whole response tokens from being abandoned by the clipping
mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily
adjust the clipping bounds, encouraging high-entropy tokens to explore and
narrow the clipping range of low-entropy tokens. In particular, SSPO achieves
an average score of 46.57 across five datasets, surpassing GRPO (43.01) and
GSPO (44.42), and wins state-of-the-art performance on three datasets. These
results highlight SSPO's effectiveness in leveraging generated data by taking
the essence of GSPO but rejecting its shortcomings.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Salma Mekaoui, Hiba Sofyan, Imane Amaaz, Imane Benchrif, Arsalane Zarghili et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Extracting topics from text has become an essential task, especially with the
rapid growth of unstructured textual data. Most existing works rely on highly
computational methods to address this challenge. In this paper, we argue that
probabilistic and statistical approaches, such as topic modeling (TM), can
offer effective alternatives that require fewer computational resources. TM is
a statistical method that automatically discovers topics in large collections
of unlabeled text; however, it produces topics as distributions of
representative words, which often lack clear interpretability. Our objective is
to perform topic labeling by assigning meaningful labels to these sets of
words. To achieve this without relying on computationally expensive models, we
propose a graph-based approach that not only enriches topic words with
semantically related terms but also explores the relationships among them. By
analyzing these connections within the graph, we derive suitable labels that
accurately capture each topic's meaning. We present a comparative study between
our proposed method and several benchmarks, including ChatGPT-3.5, across two
different datasets. Our method achieved consistently better results than
traditional benchmarks in terms of BERTScore and cosine similarity and produced
results comparable to ChatGPT-3.5, while remaining computationally efficient.
Finally, we discuss future directions for topic labeling and highlight
potential research avenues for enhancing interpretability and automation.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Alex Fang, Thomas Voice, Ruoming Pang, Ludwig Schmidt, Tom Gunter
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models learn from their vast pre-training corpora, gaining the
ability to solve an ever increasing variety of tasks; yet although researchers
work to improve these datasets, there is little effort to understand how
efficient the pre-training apparatus is at extracting ideas and knowledge from
the data. In this work, we use retrieval augmented generation along with
test-time compute as a way to quantify how much dataset value was left behind
by the process of pre-training, and how this changes across scale. We
demonstrate that pre-training then retrieving from standard and largely
open-sourced datasets results in significant accuracy gains in MMLU, Math-500,
and SimpleQA, which persist through decontamination. For MMLU we observe that
retrieval acts as a ~5x compute multiplier versus pre-training alone. We show
that these results can be further improved by leveraging additional compute at
test time to parse the retrieved context, demonstrating a 10 percentage point
improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results
suggest that today's pre-training methods do not make full use of the
information in existing pre-training datasets, leaving significant room for
progress.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Liran Cohen, Yaniv Nemcovesky, Avi Mendelson
Natural Language Processing
ๆฅ็ๆ่ฆ
Machine unlearning aims to remove the influence of specific training data
from a model without requiring full retraining. This capability is crucial for
ensuring privacy, safety, and regulatory compliance. Therefore, verifying
whether a model has truly forgotten target data is essential for maintaining
reliability and trustworthiness. However, existing evaluation methods often
assess forgetting at the level of individual inputs. This approach may overlook
residual influence present in semantically similar examples. Such influence can
compromise privacy and lead to indirect information leakage. We propose REMIND
(Residual Memorization In Neighborhood Dynamics), a novel evaluation method
aiming to detect the subtle remaining influence of unlearned data and classify
whether the data has been effectively forgotten. REMIND analyzes the model's
loss over small input variations and reveals patterns unnoticed by single-point
evaluations. We show that unlearned data yield flatter, less steep loss
landscapes, while retained or unrelated data exhibit sharper, more volatile
patterns. REMIND requires only query-based access, outperforms existing methods
under similar constraints, and demonstrates robustness across different models,
datasets, and paraphrased inputs, making it practical for real-world
deployment. By providing a more sensitive and interpretable measure of
unlearning effectiveness, REMIND provides a reliable framework to assess
unlearning in language models. As a result, REMIND offers a novel perspective
on memorization and unlearning.
๐
2025-11-06
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Hongwei Yao, Yun Xia, Shuo Shao, Haoran Shi, Tong Qiao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) increasingly employ guardrails to enforce
ethical, legal, and application-specific constraints on their outputs. While
effective at mitigating harmful responses, these guardrails introduce a new
class of vulnerabilities by exposing observable decision patterns. In this
work, we present the first study of black-box LLM guardrail reverse-engineering
attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement
learning-based framework that leverages genetic algorithm-driven data
augmentation to approximate the decision-making policy of victim guardrails. By
iteratively collecting input-output pairs, prioritizing divergence cases, and
applying targeted mutations and crossovers, our method incrementally converges
toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on
three widely deployed commercial systems, namely ChatGPT, DeepSeek, and Qwen3,
and demonstrate that it achieves an rule matching rate exceeding 0.92 while
requiring less than $85 in API costs. These findings underscore the practical
feasibility of guardrail extraction and highlight significant security risks
for current LLM safety mechanisms. Our findings expose critical vulnerabilities
in current guardrail designs and highlight the urgent need for more robust
defense mechanisms in LLM deployment.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved remarkable success, but their
rapidly growing scale imposes prohibitive costs in memory, computation, and
energy. Post-training quantization (PTQ) is a promising solution for efficient
deployment, yet achieving accurate W4A4 quantization remains an open challenge.
While most existing methods are designed for INT4 formats, the emergence of
MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)--
raises questions about the applicability of current techniques. In this work,
we establish a comprehensive benchmark of PTQ methods under the MXFP4 format.
Through systematic evaluation, we find that methods like GPTQ consistently
deliver strong performance, whereas rotation-based approaches, which are almost
used by all state-of-the-art approaches, suffer from severe incompatibility
with MXFP4. We further provide the first in-depth analysis of this conflict,
tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two)
block scaling and the redistribution of outlier energy via global rotation.
Building on this insight, we propose a simple yet effective block rotation
strategy that adapts rotation-based methods to MXFP4, leading to substantial
accuracy improvements across diverse LLMs. Our findings not only offer clear
guidance for practitioners but also set a foundation for advancing PTQ research
under emerging low-precision formats.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michaล Karp, Anna Kubaszewska, Magdalena Krรณl, Robert Krรณl, Aleksander Smywiลski-Pohl et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
This study provides an empirical assessment of whether current large language
models (LLMs) can pass the official qualifying examination for membership in
Poland's National Appeal Chamber (Krajowa Izba Odwo{\l}awcza). The authors
examine two related ideas: using LLM as actual exam candidates and applying the
'LLM-as-a-judge' approach, in which model-generated answers are automatically
evaluated by other models. The paper describes the structure of the exam, which
includes a multiple-choice knowledge test on public procurement law and a
written judgment, and presents the hybrid information recovery and extraction
pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4
Sonnet and Bielik-11B-v2.6) were tested in closed-book and various
Retrieval-Augmented Generation settings. The results show that although the
models achieved satisfactory scores in the knowledge test, none met the passing
threshold in the practical written part, and the evaluations of the
'LLM-as-a-judge' often diverged from the judgments of the official examining
committee. The authors highlight key limitations: susceptibility to
hallucinations, incorrect citation of legal provisions, weaknesses in logical
argumentation, and the need for close collaboration between legal experts and
technical teams. The findings indicate that, despite rapid technological
progress, current LLMs cannot yet replace human judges or independent examiners
in Polish public procurement adjudication.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nicolรฒ Pagan, Petter Tรถrnberg, Christopher A. Bail, Anikรณ Hannรกk, Christopher Barrie
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used in the social sciences to
simulate human behavior, based on the assumption that they can generate
realistic, human-like text. Yet this assumption remains largely untested.
Existing validation efforts rely heavily on human-judgment-based evaluations --
testing whether humans can distinguish AI from human output -- despite evidence
that such judgments are blunt and unreliable. As a result, the field lacks
robust tools for assessing the realism of LLM-generated text or for calibrating
models to real-world data. This paper makes two contributions. First, we
introduce a computational Turing test: a validation framework that integrates
aggregate metrics (BERT-based detectability and semantic similarity) with
interpretable linguistic features (stylistic markers and topical patterns) to
assess how closely LLMs approximate human language within a given dataset.
Second, we systematically compare nine open-weight LLMs across five calibration
strategies -- including fine-tuning, stylistic prompting, and context retrieval
-- benchmarking their ability to reproduce user interactions on X (formerly
Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the
literature. Even after calibration, LLM outputs remain clearly distinguishable
from human text, particularly in affective tone and emotional expression.
Instruction-tuned models underperform their base counterparts, and scaling up
model size does not enhance human-likeness. Crucially, we identify a trade-off:
optimizing for human-likeness often comes at the cost of semantic fidelity, and
vice versa. These results provide a much-needed scalable framework for
validation and calibration in LLM simulations -- and offer a cautionary note
about their current limitations in capturing human communication.
๐
2025-11-06
๐ 10 pages, 4 figures. Submitted to IEEE DISTILL 202
๐ฅ Mohammed Musthafa Rafi, Adarsh Krishnamurthy, Aditya Balu
Robotics
ๆฅ็ๆ่ฆ
The proliferation of AI-generated content has created an absurd communication
theater where senders use LLMs to inflate simple ideas into verbose content,
recipients use LLMs to compress them back into summaries, and as a consequence
neither party engage with authentic content. LAAC (LLM as a Communicator)
proposes a paradigm shift - positioning LLMs as intelligent communication
intermediaries that capture the sender's intent through structured dialogue and
facilitate genuine knowledge exchange with recipients. Rather than perpetuating
cycles of AI-generated inflation and compression, LAAC enables authentic
communication across diverse contexts including academic papers, proposals,
professional emails, and cross-platform content generation. However, deploying
LLMs as trusted communication intermediaries raises critical questions about
information fidelity, consistency, and reliability. This position paper
systematically evaluates the trustworthiness requirements for LAAC's deployment
across multiple communication domains. We investigate three fundamental
dimensions: (1) Information Capture Fidelity - accuracy of intent extraction
during sender interviews across different communication types, (2)
Reproducibility - consistency of structured knowledge across multiple
interaction instances, and (3) Query Response Integrity - reliability of
recipient-facing responses without hallucination, source conflation, or
fabrication. Through controlled experiments spanning multiple LAAC use cases,
we assess these trust dimensions using LAAC's multi-agent architecture.
Preliminary findings reveal measurable trust gaps that must be addressed before
LAAC can be reliably deployed in high-stakes communication scenarios.
๐
2025-11-06
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mashrur Rahman, Mantaqa abedin, Monowar Zamil Abir, Faizul Islam Ansari, Adib Reza et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
University students face immense challenges during their undergraduate lives,
often being deprived of personalized on-demand guidance that mentors fail to
provide at scale. Digital tools exist, but there is a serious lack of
customized coaching for newcomers. This paper presents an AI-powered chatbot
that will serve as a mentor for the students of BRAC University. The main
component is a data ingestion pipeline that efficiently processes and updates
information from diverse sources, such as CSV files and university webpages.
The chatbot retrieves information through a hybrid approach, combining BM25
lexical ranking with ChromaDB semantic retrieval, and uses a Large Language
Model, LLaMA-3.3-70B, to generate conversational responses. The generated text
was found to be semantically highly relevant, with a BERTScore of 0.831 and a
METEOR score of 0.809. The data pipeline was also very efficient, taking 106.82
seconds for updates, compared to 368.62 seconds for new data. This chatbot will
be able to help students by responding to their queries, helping them to get a
better understanding of university life, and assisting them to plan better
routines for their semester in the open-credit university.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Suranjan Goswami, Abhinav Ravi, Raja Kolla, Ali Faraz, Shaharukh Khan et al.
ๆฅ็ๆ่ฆ
Despite significant advances in document understanding, determining the
correct orientation of scanned or photographed documents remains a critical
pre-processing step in the real world settings. Accurate rotation correction is
essential for enhancing the performance of downstream tasks such as Optical
Character Recognition (OCR) where misalignment commonly arises due to user
errors, particularly incorrect base orientations of the camera during capture.
In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for
evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from
rotation-transformed structured and free-form English OCR datasets, and (ii)
ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource
languages. We also present a fast, robust and lightweight rotation
classification pipeline built on the vision encoder of Phi-3.5-Vision model
with dynamic image cropping, fine-tuned specifically for 4-class rotation task
in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy
on identifying the rotations respectively on both the datasets. Beyond
classification, we demonstrate the critical role of our module in boosting OCR
performance: closed-source (up to 14%) and open-weights models (up to 4x) in
the simulated real-world setting.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fahim Ahmed, Md Mubtasim Ahasan, Jahir Sadik Monon, Muntasir Wahed, M Ashraful Amin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Text-to-SQL systems provide a natural language interface that can enable even
laymen to access information stored in databases. However, existing Large
Language Models (LLM) struggle with SQL generation from natural instructions
due to large schema sizes and complex reasoning. Prior work often focuses on
complex, somewhat impractical pipelines using flagship models, while smaller,
efficient models remain overlooked. In this work, we explore three multi-agent
LLM pipelines, with systematic performance benchmarking across a range of small
to large open-source models: (1) Multi-agent discussion pipeline, where agents
iteratively critique and refine SQL queries, and a judge synthesizes the final
answer; (2) Planner-Coder pipeline, where a thinking model planner generates
stepwise SQL generation plans and a coder synthesizes queries; and (3)
Coder-Aggregator pipeline, where multiple coders independently generate SQL
queries, and a reasoning agent selects the best query. Experiments on the
Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small
model performance, with up to 10.6% increase in Execution Accuracy for
Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines,
the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B
and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest
score of 56.4%. Codes are available at
https://github.com/treeDweller98/bappa-sql.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dazhong Chen, Yi-Cheng Lin, Yuchen Huang, Ziwei Gong, Di Jiang et al.
ๆฅ็ๆ่ฆ
Automatic speech recognition (ASR) is critical for language accessibility,
yet low-resource Cantonese remains challenging due to limited annotated data,
six lexical tones, tone sandhi, and accent variation. Existing ASR models, such
as Whisper, often suffer from high word error rates. Large audio-language
models (LALMs), in contrast, can leverage broader contextual reasoning but
still require explicit tonal and prosodic acoustic cues. We introduce CantoASR,
a collaborative ASR-LALM error correction framework that integrates forced
alignment for acoustic feature extraction, a LoRA-finetuned Whisper for
improved tone discrimination, and an instruction-tuned Qwen-Audio for
prosody-aware correction. Evaluations on spontaneous Cantonese data show
substantial CER gains over Whisper-Large-V3. These findings suggest that
integrating acoustic cues with LALM reasoning provides a scalable strategy for
low-resource tonal and dialectal ASR.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Xinyuan Li, Murong Xu, Wenbiao Tao, Hanlun Zhu, Yike Zhao et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) achieve high performance on mathematical
reasoning, but these results can be inflated by training data leakage or
superficial pattern matching rather than genuine reasoning. To this end, an
adversarial perturbation-based evaluation is needed to measure true
mathematical reasoning ability. Current rule-based perturbation methods often
generate ill-posed questions and impede the systematic evaluation of question
difficulty and the evolution of benchmarks. To bridge this gap, we propose
RIDE, a novel adversarial question-rewriting framework that leverages Item
Response Theory (IRT) to rigorously measure question difficulty and to generate
intrinsically more challenging, well-posed variations of mathematical problems.
We employ 35 LLMs to simulate students and build a difficulty ranker from their
responses. This ranker provides a reward signal during reinforcement learning
and guides a question-rewriting model to reformulate existing questions across
difficulty levels. Applying RIDE to competition-level mathematical benchmarks
yields perturbed versions that degrade advanced LLM performance, with
experiments showing an average 21.73% drop across 26 models, thereby exposing
limited robustness in mathematical reasoning and confirming the validity of our
evaluation approach.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wenmo Qiu, Saurabh Srivastava
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recent work has explored batch prompting as a strategy to amortize inference
cost in large language models (LLMs). In this paper, we show that batching
offers an additional, underappreciated benefit: it regularizes model behavior
during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a
comprehensive study across 13 diverse benchmarks and observe that batching
improves accuracy while substantially reducing reasoning token usage, often by
3x-5x. Through detailed behavioral analysis, we find that batching suppresses
overthinking, reduces hedging language (e.g., repetitive self-corrections), and
encourages more decisive answers. Surprisingly, we also observe emergent
collective effects in batched inference: models often generalize patterns from
earlier examples to solve harder ones in the same batch. These findings
position batching not just as a throughput optimization, but as a powerful
inference-time regularizer for more efficient and reliable LLM reasoning.
๐
2025-11-06
๐ ArXiv Preprint (physics.soc-ph)
๐ฅ Hayafumi Watanabe
ๆฅ็ๆ่ฆ
The diffusion of ideas and language in society has conventionally been
described by S-shaped models, such as the logistic curve. However, the role of
sub-exponential growth -a slower than exponential pattern known in
epidemiology- has been largely overlooked in broader social phenomena. Here, we
present a piecewise power-law model to characterize complex growth curves with
a few parameters. We systematically analyzed a large-scale dataset of
approximately one billion Japanese blog articles linked to Wikipedia
vocabulary, and observed consistent patterns in web search trend data (English,
Spanish, and Japanese). Our analysis of the 2,965 selected items reveals that
about 55% (1,625 items) were found to have no abrupt jumps and were well
captured by one or two segments. For single-segment curves, we found that (i)
the mode of the shape parameter alpha was near 0.5, indicating prevalent
sub-exponential growth; (ii) the ultimate diffusion scale is primarily
determined by the growth rate R, with minor contributions from alpha or the
duration T; and (iii) alpha showed a tendency to vary with the nature of the
topic, being smaller for niche/local topics and larger for widely shared ones.
Furthermore, a micro-behavioral model distinguishing outward contact with
strangers from inward interaction within their community suggests that alpha
can be interpreted as an index of the preference for outward-oriented
communication. These findings suggest that sub-exponential growth is a common
pattern of social diffusion, and our model provides a practical framework for
consistently describing, comparing, and interpreting complex and diverse growth
curves.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Moses Charikar, Chirag Pabbaraju, Ambuj Tewari
ๆฅ็ๆ่ฆ
We study the problem of language identification in the limit, where given a
sequence of examples from a target language, the goal of the learner is to
output a sequence of guesses for the target language such that all the guesses
beyond some finite time are correct. Classical results of Gold showed that
language identification in the limit is impossible for essentially any
interesting collection of languages. Later, Angluin gave a precise
characterization of language collections for which this task is possible.
Motivated by recent positive results for the related problem of language
generation, we revisit the classic language identification problem in the
setting where the learner is given the additional power of producing a list of
$k$ guesses at each time step. The goal is to ensure that beyond some finite
time, one of the guesses is correct at each time step.
We give an exact characterization of collections of languages that can be
$k$-list identified in the limit, based on a recursive version of Angluin's
characterization (for language identification with a list of size $1$). This
further leads to a conceptually appealing characterization: A language
collection can be $k$-list identified in the limit if and only if the
collection can be decomposed into $k$ collections of languages, each of which
can be identified in the limit (with a list of size $1$). We also use our
characterization to establish rates for list identification in the statistical
setting where the input is drawn as an i.i.d. stream from a distribution
supported on some language in the collection. Our results show that if a
collection is $k$-list identifiable in the limit, then the collection can be
$k$-list identified at an exponential rate, and this is best possible. On the
other hand, if a collection is not $k$-list identifiable in the limit, then it
cannot be $k$-list identified at any rate that goes to zero.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Eva Prakash, Maayane Attias, Pierre Chambon, Justin Xu, Steven Truong et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Objective: To enhance automated de-identification of radiology reports by
scaling transformer-based models through extensive training datasets and
benchmarking performance against commercial cloud vendor systems for protected
health information (PHI) detection. Materials and Methods: In this
retrospective study, we built upon a state-of-the-art, transformer-based, PHI
de-identification pipeline by fine-tuning on two large annotated radiology
corpora from Stanford University, encompassing chest X-ray, chest CT,
abdomen/pelvis CT, and brain MR reports and introducing an additional PHI
category (AGE) into the architecture. Model performance was evaluated on test
sets from Stanford and the University of Pennsylvania (Penn) for token-level
PHI detection. We further assessed (1) the stability of synthetic PHI
generation using a "hide-in-plain-sight" method and (2) performance against
commercial systems. Precision, recall, and F1 scores were computed across all
PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the
Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining
the previous state-of-the-art model performance. Synthetic PHI evaluation
showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50
independently de-identified Penn datasets. Our model outperformed all vendor
systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754).
Discussion: Large-scale, multimodal training improved cross-institutional
generalization and robustness. Synthetic PHI generation preserved data utility
while ensuring privacy. Conclusion: A transformer-based de-identification model
trained on diverse radiology datasets outperforms prior academic and commercial
systems in PHI detection and establishes a new benchmark for secure clinical
text processing.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ ล pela Vintar, Jan Jona Javorลกek
Natural Language Processing
ๆฅ็ๆ่ฆ
Human word associations are a well-known method of gaining insight into the
internal mental lexicon, but the responses spontaneously offered by human
participants to word cues are not always predictable as they may be influenced
by personal experience, emotions or individual cognitive styles. The ability to
form associative links between seemingly unrelated concepts can be the driving
mechanisms of creativity. We perform a comparison of the associative behaviour
of humans compared to large language models. More specifically, we explore
associations to emotionally loaded words and try to determine whether large
language models generate associations in a similar way to humans. We find that
the overlap between humans and LLMs is moderate, but also that the associations
of LLMs tend to amplify the underlying emotional load of the stimulus, and that
they tend to be more predictable and less creative than human ones.
๐
2025-11-06
๐ Submitted to the IEEE for possible publication
๐ฅ Xinying Qian, Ying Zhang, Yu Zhao, Baohang Zhou, Xuhui Sui et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Temporal Knowledge Graph Question Answering (TKGQA) aims to answer
time-sensitive questions by leveraging factual information from Temporal
Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG
embeddings or graph neural networks to inject temporal knowledge, they fail to
fully understand the complex semantic information of time constraints.
Recently, Large Language Models (LLMs) have shown remarkable progress,
benefiting from their strong semantic understanding and reasoning
generalization capabilities. However, their temporal reasoning ability remains
limited. LLMs frequently suffer from hallucination and a lack of knowledge. To
address these limitations, we propose the Plan of Knowledge framework with a
contrastive temporal retriever, which is named PoK. Specifically, the proposed
Plan of Knowledge module decomposes a complex temporal question into a sequence
of sub-objectives from the pre-defined tools, serving as intermediate guidance
for reasoning exploration. In parallel, we construct a Temporal Knowledge Store
(TKS) with a contrastive retrieval framework, enabling the model to selectively
retrieve semantically and temporally aligned facts from TKGs. By combining
structured planning with temporal knowledge retrieval, PoK effectively enhances
the interpretability and factual consistency of temporal reasoning. Extensive
experiments on four benchmark TKGQA datasets demonstrate that PoK significantly
improves the retrieval precision and reasoning accuracy of LLMs, surpassing the
performance of the state-of-the-art TKGQA methods by 56.0% at most.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shreya Havaldar, Helen Jin, Chaehyeon Kim, Anton Xue, Weiqiu You et al.
ๆฅ็ๆ่ฆ
As LLMs are deployed in knowledge-intensive settings (e.g., surgery,
astronomy, therapy), users expect not just answers, but also meaningful
explanations for those answers. In these settings, users are often domain
experts (e.g., doctors, astrophysicists, psychologists) who require
explanations that reflect expert-level reasoning. However, current evaluation
schemes primarily emphasize plausibility or internal faithfulness of the
explanation, which fail to capture whether the content of the explanation truly
aligns with expert intuition. We formalize expert alignment as a criterion for
evaluating explanations with T-FIX, a benchmark spanning seven
knowledge-intensive domains. In collaboration with domain experts, we develop
novel metrics to measure the alignment of LLM explanations with expert
judgment.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Quantization plays a crucial role in accelerating the inference of
large-scale models, and rotational matrices have been shown to effectively
improve quantization performance by smoothing outliers. However, end-to-end
fine-tuning of rotational optimization algorithms incurs high computational
costs and is prone to overfitting. To address this challenge, we propose an
efficient distribution-aware rotational calibration method, DartQuant, which
reduces the complexity of rotational optimization by constraining the
distribution of the activations after rotation. This approach also effectively
reduces reliance on task-specific losses, thereby mitigating the risk of
overfitting. Additionally, we introduce the QR-Orth optimization scheme, which
replaces expensive alternating optimization with a more efficient solution. In
a variety of model quantization experiments, DartQuant demonstrates superior
performance. Compared to existing methods, it achieves 47$\times$ acceleration
and 10$\times$ memory savings for rotational optimization on a 70B model.
Furthermore, it is the first to successfully complete rotational calibration
for a 70B model on a single 3090 GPU, making quantization of large language
models feasible in resource-constrained environments. Code is available at
https://github.com/CAS-CLab/DartQuant.git.
๐
2025-11-06
๐ ArXiv Preprint (cs.FL)
๐ฅ Ayaan Bedi, Karoliina Lehtinen
ๆฅ็ๆ่ฆ
We study explorability, a measure of nondeterminism in pushdown automata,
which generalises history-determinism. An automaton is k-explorable if, while
reading the input, it suffices to follow k concurrent runs, built step-by-step
based only on the input seen so far, to construct an accepting one, if it
exists. We show that the class of explorable PDAs lies strictly between
history-deterministic and fully nondeterministic PDAs in terms of both
expressiveness and succinctness. In fact increasing explorability induces an
infinite hierarchy: each level k defines a strictly more expressive class than
level k-1, yet the entire class remains less expressive than general
nondeterministic PDAs. We then introduce a parameterized notion of
explorability, where the number of runs may depend on input length, and show
that exponential explorability precisely captures the context-free languages.
Finally, we prove that explorable PDAs can be doubly exponentially more
succinct than history-deterministic ones, and that the succinctness gap between
deterministic and 2-explorable PDAs is not recursively enumerable. These
results position explorability as a robust and operationally meaningful measure
of nondeterminism for pushdown systems.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dongji Gao, Chenda Liao, Changliang Liu, Matthew Wiesner, Leibny Paola Garcia et al.
Machine Learning
ๆฅ็ๆ่ฆ
The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in
end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily
on large-scale, high-quality annotated data, which are often costly and
difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised
Transducer (WST), which integrates a flexible training graph designed to
robustly handle errors in the transcripts without requiring additional
confidence estimation or auxiliary pre-trained models. Empirical evaluations on
synthetic and industrial datasets reveal that WST effectively maintains
performance even with transcription error rates of up to 70%, consistently
outperforming existing Connectionist Temporal Classification (CTC)-based weakly
supervised approaches, such as Bypass Temporal Classification (BTC) and
Omni-Temporal Classification (OTC). These results demonstrate the practical
utility and robustness of WST in realistic ASR settings. The implementation
will be publicly available.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shiyin Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) enhanced with retrieval -- commonly referred to
as Retrieval-Augmented Generation (RAG) -- have demonstrated strong performance
in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved
evidence is incomplete, leaving gaps in the reasoning process. In such cases,
\emph{abductive inference} -- the process of generating plausible missing
premises to explain observations -- offers a principled approach to bridge
these gaps. In this paper, we propose a framework that integrates abductive
inference into retrieval-augmented LLMs. Our method detects insufficient
evidence, generates candidate missing premises, and validates them through
consistency and plausibility checks. Experimental results on abductive
reasoning and multi-hop QA benchmarks show that our approach improves both
answer accuracy and reasoning faithfulness. This work highlights abductive
inference as a promising direction for enhancing the robustness and
explainability of RAG systems.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Kyaw Hpone Myint, Zhe Wu, Alexandre G. R. Day, Giri Iyengar
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Decision trees are widely used in high-stakes fields like finance and
healthcare due to their interpretability. This work introduces an efficient,
scalable method for generating synthetic pre-training data to enable
meta-learning of decision trees. Our approach samples near-optimal decision
trees synthetically, creating large-scale, realistic datasets. Using the
MetaTree transformer architecture, we demonstrate that this method achieves
performance comparable to pre-training on real-world data or with
computationally expensive optimal decision trees. This strategy significantly
reduces computational costs, enhances data generation flexibility, and paves
the way for scalable and efficient meta-learning of interpretable decision tree
models.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bram Bultรฉ, Ayla Rigouts Terryn
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are rapidly being adopted by users across the
globe, who interact with them in a diverse range of languages. At the same
time, there are well-documented imbalances in the training data and
optimisation objectives of this technology, raising doubts as to whether LLMs
can represent the cultural diversity of their broad user base. In this study,
we look at LLMs and cultural values and examine how prompt language and
cultural framing influence model responses and their alignment with human
values in different countries. We probe 10 LLMs with 63 items from the Hofstede
Values Survey Module and World Values Survey, translated into 11 languages, and
formulated as prompts with and without different explicit cultural
perspectives. Our study confirms that both prompt language and cultural
perspective produce variation in LLM outputs, but with an important caveat:
While targeted prompting can, to a certain extent, steer LLM responses in the
direction of the predominant values of the corresponding countries, it does not
overcome the models' systematic bias toward the values associated with a
restricted set of countries in our dataset: the Netherlands, Germany, the US,
and Japan. All tested models, regardless of their origin, exhibit remarkably
similar patterns: They produce fairly neutral responses on most topics, with
selective progressive stances on issues such as social tolerance. Alignment
with cultural values of human respondents is improved more with an explicit
cultural perspective than with a targeted prompt language. Unexpectedly,
combining both approaches is no more effective than cultural framing with an
English prompt. These findings reveal that LLMs occupy an uncomfortable middle
ground: They are responsive enough to changes in prompts to produce variation,
but too firmly anchored to specific cultural defaults to adequately represent
cultural diversity.
๐
2025-11-06
๐ ArXiv Preprint (cs.MA)
๐ฅ Kia Karbasi, Kevin Hong, Mohammad Amin Samadi, Gregory Pottie
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Automatic question generation (AQG) for mathematics education remains an
elusive goal for Intelligent Tutoring Systems and educators. While pre-trained
transformer-based language models have significantly advanced natural language
generation, they often struggle to precisely control problem complexity and
cognitive demands. In this paper, we introduce a collaborative multi-agent
framework as a novel method of incorporating inference-time computation into
AQG. This approach leverages multiple agents that iteratively refine generated
question-answer pairs to better balance complexity and cognitive demand. We
evaluate the generated questions on five meta-evaluation criteria: relevance,
importance, clarity, difficulty matching, answerability, to assess the system's
ability to control the required complexity and quality of the questions.
Preliminary evaluations show that this collaborative multi-agent framework
elevates the quality of generated educational content by fostering a more
nuanced balance between cognitive challenge and clarity. These promising
outcomes suggest that integrating collaborative multi-agent workflows can yield
more controlled, pedagogically valuable content that can help advance automated
educational content generation and adaptive learning environments.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fu-Chun Yang, Jason Eshraghian
Natural Language Processing
ๆฅ็ๆ่ฆ
In multi-agent settings, such as debate, reflection, or tool-calling, large
language models (LLMs) pass messages as plain tokens, discarding most latent
semantics. This constrains information transfer and adds unnecessary
computational overhead. We form a latent bridge via vector translations, which
use learned mappings that enable direct semantic exchange between
representation spaces. A dual-encoder translator trained between Llama-2-7B and
Mistral-7B-Instruct attains an average cosine alignment of 0.538. Injecting the
translated vectors at 30 percent blending strength steers the target model's
generation without destabilizing logits. Bidirectional evaluation shows a
2.01:1 transfer asymmetry, indicating that general-purpose models yield more
transferable representations than instruction-tuned variants. This conservative
injection preserves computational stability while demonstrating that
cross-model latent communication is feasible, enabling collaborative AI systems
that share meaning rather than tokens.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Shih-Lun Wu, Yoon Kim, Cheng-Zhi Anna Huang
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We present MIDI-LLM, an LLM for generating multitrack MIDI music from
free-form text prompts. Our approach expands a text LLM's vocabulary to include
MIDI tokens, and uses a two-stage training recipe to endow text-to-MIDI
abilities. By preserving the original LLM's parameter structure, we can
directly leverage the vLLM library for accelerated inference. Experiments show
that MIDI-LLM achieves higher quality, better text control, and faster
inference compared to the recent Text2midi model. Live demo at
https://midi-llm-demo.vercel.app.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Raghav Sharma, Manan Mehta, Sai Tiger Raina
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning from Human Feedback (RLHF) is the standard for
aligning Large Language Models (LLMs), yet recent progress has moved beyond
canonical text-based methods. This survey synthesizes the new frontier of
alignment research by addressing critical gaps in multi-modal alignment,
cultural fairness, and low-latency optimization. To systematically explore
these domains, we first review foundational algo- rithms, including PPO, DPO,
and GRPO, before presenting a detailed analysis of the latest innovations. By
providing a comparative synthesis of these techniques and outlining open
challenges, this work serves as an essential roadmap for researchers building
more robust, efficient, and equitable AI systems.
๐
2025-11-06
๐ ArXiv Preprint (astro-ph.CO)
๐ฅ A. Thomsen, J. Bucko, T. Kacprzak, V. Ajani, J. Fluri et al.
Machine Learning
ๆฅ็ๆ่ฆ
Data-driven approaches using deep learning are emerging as powerful
techniques to extract non-Gaussian information from cosmological large-scale
structure. This work presents the first simulation-based inference (SBI)
pipeline that combines weak lensing and galaxy clustering maps in a realistic
Dark Energy Survey Year 3 (DES Y3) configuration and serves as preparation for
a forthcoming analysis of the survey data. We develop a scalable forward model
based on the CosmoGridV1 suite of N-body simulations to generate over one
million self-consistent mock realizations of DES Y3 at the map level.
Leveraging this large dataset, we train deep graph convolutional neural
networks on the full survey footprint in spherical geometry to learn
low-dimensional features that approximately maximize mutual information with
target parameters. These learned compressions enable neural density estimation
of the implicit likelihood via normalizing flows in a ten-dimensional parameter
space spanning cosmological $w$CDM, intrinsic alignment, and linear galaxy bias
parameters, while marginalizing over baryonic, photometric redshift, and shear
bias nuisances. To ensure robustness, we extensively validate our inference
pipeline using synthetic observations derived from both systematic
contaminations in our forward model and independent Buzzard galaxy catalogs.
Our forecasts yield significant improvements in cosmological parameter
constraints, achieving $2-3\times$ higher figures of merit in the $\Omega_m -
S_8$ plane relative to our implementation of baseline two-point statistics and
effectively breaking parameter degeneracies through probe combination. These
results demonstrate the potential of SBI analyses powered by deep learning for
upcoming Stage-IV wide-field imaging surveys.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Julian D. Allagan, Dasia A. Singleton, Shanae N. Perry, Gabrielle C. Morgan, Essence A. Morgan
Machine Learning
ๆฅ็ๆ่ฆ
This study evaluates a 40-item mathematics placement examination administered
to 198 students using a multi-method framework combining Classical Test Theory,
machine learning, and unsupervised clustering. Classical Test Theory analysis
reveals that 55\% of items achieve excellent discrimination ($D \geq 0.40$)
while 30\% demonstrate poor discrimination ($D < 0.20$) requiring replacement.
Question 6 (Graph Interpretation) emerges as the examination's most powerful
discriminator, achieving perfect discrimination ($D = 1.000$), highest ANOVA
F-statistic ($F = 4609.1$), and maximum Random Forest feature importance
(0.206), accounting for 20.6\% of predictive power. Machine learning algorithms
demonstrate exceptional performance, with Random Forest and Gradient Boosting
achieving 97.5\% and 96.0\% cross-validation accuracy. K-means clustering
identifies a natural binary competency structure with a boundary at 42.5\%,
diverging from the institutional threshold of 55\% and suggesting potential
overclassification into remedial categories. The two-cluster solution exhibits
exceptional stability (bootstrap ARI = 0.855) with perfect lower-cluster
purity. Convergent evidence across methods supports specific refinements:
replace poorly discriminating items, implement a two-stage assessment, and
integrate Random Forest predictions with transparency mechanisms. These
findings demonstrate that multi-method integration provides a robust empirical
foundation for evidence-based mathematics placement optimization.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin et al.
Machine Learning
ๆฅ็ๆ่ฆ
A fundamental challenge in developing general learning algorithms is their
tendency to forget past knowledge when adapting to new data. Addressing this
problem requires a principled understanding of forgetting; yet, despite decades
of study, no unified definition has emerged that provides insights into the
underlying dynamics of learning. We propose an algorithm- and task-agnostic
theory that characterises forgetting as a lack of self-consistency in a
learner's predictive distribution over future experiences, manifesting as a
loss of predictive information. Our theory naturally yields a general measure
of an algorithm's propensity to forget. To validate the theory, we design a
comprehensive set of experiments that span classification, regression,
generative modelling, and reinforcement learning. We empirically demonstrate
how forgetting is present across all learning settings and plays a significant
role in determining learning efficiency. Together, these results establish a
principled understanding of forgetting and lay the foundation for analysing and
improving the information retention capabilities of general learning
algorithms.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song et al.
Robotics
ๆฅ็ๆ่ฆ
Robotic manipulation policies are advancing rapidly, but their direct
evaluation in the real world remains costly, time-consuming, and difficult to
reproduce, particularly for tasks involving deformable objects. Simulation
provides a scalable and systematic alternative, yet existing simulators often
fail to capture the coupled visual and physical complexity of soft-body
interactions. We present a real-to-sim policy evaluation framework that
constructs soft-body digital twins from real-world videos and renders robots,
objects, and environments with photorealistic fidelity using 3D Gaussian
Splatting. We validate our approach on representative deformable manipulation
tasks, including plush toy packing, rope routing, and T-block pushing,
demonstrating that simulated rollouts correlate strongly with real-world
execution performance and reveal key behavioral patterns of learned policies.
Our results suggest that combining physics-informed reconstruction with
high-quality rendering enables reproducible, scalable, and accurate evaluation
of robotic manipulation policies. Website: https://real2sim-eval.github.io/
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Huaguan Chen, Wei Han, Haofei Sun, Ning Lin, Xingtao Song et al.
ๆฅ็ๆ่ฆ
Extreme precipitation nowcasting demands high spatiotemporal fidelity and
extended lead times, yet existing approaches remain limited. Numerical Weather
Prediction (NWP) and its deep-learning emulations are too slow and coarse for
rapidly evolving convection, while extrapolation and purely data-driven models
suffer from error accumulation and excessive smoothing. Hybrid 2D radar-based
methods discard crucial vertical information, preventing accurate
reconstruction of height-dependent dynamics. We introduce a gray-box, fully
three-dimensional nowcasting framework that directly processes volumetric radar
reflectivity and couples physically constrained neural operators with
datadriven learning. The model learns vertically varying 3D advection fields
under a conservative advection operator, parameterizes spatially varying
diffusion, and introduces a Brownian-motion--inspired stochastic term to
represent unresolved motions. A residual branch captures small-scale convective
initiation and microphysical variability, while a diffusion-based stochastic
module estimates uncertainty. The framework achieves more accurate forecasts up
to three-hour lead time across precipitation regimes and ranked first in 57\%
of cases in a blind evaluation by 160 meteorologists. By restoring full 3D
dynamics with physical consistency, it offers a scalable and robust pathway for
skillful and reliable nowcasting of extreme precipitation.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinlu Zhang, Yansha Deng, Toktam Mahmoodi
Machine Learning
ๆฅ็ๆ่ฆ
Federated learning (FL) offers new opportunities in machine learning,
particularly in addressing data privacy concerns. In contrast to conventional
event-based federated learning, time-triggered federated learning (TT-Fed), as
a general form of both asynchronous and synchronous FL, clusters users into
different tiers based on fixed time intervals. However, the FL network consists
of a growing number of user devices with limited wireless bandwidth,
consequently magnifying issues such as stragglers and communication overhead.
In this paper, we introduce adaptive model pruning to wireless TT-Fed systems
and study the problem of jointly optimizing the pruning ratio and bandwidth
allocation to minimize the training loss while ensuring minimal learning
latency. To answer this question, we perform convergence analysis on the
gradient l_2 norm of the TT-Fed model based on model pruning. Based on the
obtained convergence upper bound, a joint optimization problem of pruning ratio
and wireless bandwidth is formulated to minimize the model training loss under
a given delay threshold. Then, we derive closed-form solutions for wireless
bandwidth and pruning ratio using Karush-Kuhn-Tucker(KKT) conditions. The
simulation results show that model pruning could reduce the communication cost
by 40% while maintaining the model performance at the same level.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sitan Chen, Kevin Cong, Jerry Li
Natural Language Processing
ๆฅ็ๆ่ฆ
A major bottleneck of standard auto-regressive large language models is that
their inference process is inherently sequential, resulting in very long and
costly inference times. To circumvent this, practitioners proposed a class of
language models called diffusion language models, of which the masked diffusion
model (MDM) is the most successful. The MDM is able to sample tokens
out-of-order and, ostensibly, many tokens at once and in parallel. However,
there is very limited rigorous understanding of how much parallel sampling
these models can perform without noticeable degradation in their sampling
performance. Prior work of Li and Cai obtained some preliminary bounds, but
these are not tight for many natural classes of distributions. In this work, we
give a new, exact characterization of the expected divergence between the true
distribution and the sampled distribution, for any distribution and any
unmasking schedule for the sampler, showing an elegant connection to the theory
of univariate function approximation.
By leveraging this connection, we then attain a number of novel lower and
upper bounds for this problem. While the connection to function approximation
in principle gives the optimal unmasking schedule for any distribution, we show
that it is in general impossible to compete with it without strong a priori
knowledge of the distribution, even in seemingly benign settings. However, we
also demonstrate new upper bounds and new sampling schedules in terms of
well-studied information-theoretic properties of the base distribution, namely,
its total correlation and dual total correlation, which show that in some
natural settings, one can sample in $O(log n)$ steps without any visible loss
in performance, where $n$ is the total sequence length.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Narjes Nourzad, Hanqing Yang, Shiyu Chen, Carlee Joe-Wong
ๆฅ็ๆ่ฆ
Cooperative multi-agent planning requires agents to make joint decisions with
partial information and limited communication. Coordination at the trajectory
level often fails, as small deviations in timing or movement cascade into
conflicts. Symbolic planning mitigates this challenge by raising the level of
abstraction and providing a minimal vocabulary of actions that enable
synchronization and collective progress. We present DR. WELL, a decentralized
neurosymbolic framework for cooperative multi-agent planning. Cooperation
unfolds through a two-phase negotiation protocol: agents first propose
candidate roles with reasoning and then commit to a joint allocation under
consensus and environment constraints. After commitment, each agent
independently generates and executes a symbolic plan for its role without
revealing detailed trajectories. Plans are grounded in execution outcomes via a
shared world model that encodes the current state and is updated as agents act.
By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids
brittle step-level alignment and enables higher-level operations that are
reusable, synchronizable, and interpretable. Experiments on cooperative
block-push tasks show that agents adapt across episodes, with the dynamic world
model capturing reusable patterns and improving task completion rates and
efficiency. Experiments on cooperative block-push tasks show that our dynamic
world model improves task completion and efficiency through negotiation and
self-refinement, trading a time overhead for evolving, more efficient
collaboration strategies.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hans Harder, Abhijeet Vishwasrao, Luca Guastoni, Ricardo Vinuesa, Sebastian Peitz
ๆฅ็ๆ่ฆ
This paper is concerned with probabilistic techniques for forecasting
dynamical systems described by partial differential equations (such as, for
example, the Navier-Stokes equations). In particular, it is investigating and
comparing various extensions to the flow matching paradigm that reduce the
number of sampling steps. In this regard, it compares direct distillation,
progressive distillation, adversarial diffusion distillation, Wasserstein GANs
and rectified flows. Moreover, experiments are conducted on a set of
challenging systems. In particular, we also address the challenge of directly
predicting 2D slices of large-scale 3D simulations, paving the way for
efficient inflow generation for solvers.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Satchel Grant, Simon Jerome Han, Alexa Tartaglini, Christopher Potts
Machine Learning
ๆฅ็ๆ่ฆ
A common approach to mechanistic interpretability is to causally manipulate
model representations via targeted interventions in order to understand what
those representations encode. Here we ask whether such interventions create
out-of-distribution (divergent) representations, and whether this raises
concerns about how faithful their resulting explanations are to the target
model in its natural state. First, we demonstrate empirically that common
causal intervention techniques often do shift internal representations away
from the natural distribution of the target model. Then, we provide a
theoretical analysis of two classes of such divergences: `harmless' divergences
that occur in the null-space of the weights and from covariance within
behavioral decision boundaries, and `pernicious' divergences that activate
hidden network pathways and cause dormant behavioral changes. Finally, in an
effort to mitigate the pernicious cases, we modify the Counterfactual Latent
(CL) loss from Grant (2025) that regularizes interventions to remain closer to
the natural distributions, reducing the likelihood of harmful divergences while
preserving the interpretive power of interventions. Together, these results
highlight a path towards more reliable interpretability methods.
๐
2025-11-06
๐ ArXiv Preprint (math.OC)
๐ฅ Steffen Dereich, Arnulf Jentzen, Sebastian Kassing
Machine Learning
ๆฅ็ๆ่ฆ
The Adam optimizer is currently presumably the most popular optimization
method in deep learning. In this article we develop an ODE based method to
study the Adam optimizer in a fast-slow scaling regime. For fixed momentum
parameters and vanishing step-sizes, we show that the Adam algorithm is an
asymptotic pseudo-trajectory of the flow of a particular vector field, which is
referred to as the Adam vector field. Leveraging properties of asymptotic
pseudo-trajectories, we establish convergence results for the Adam algorithm.
In particular, in a very general setting we show that if the Adam algorithm
converges, then the limit must be a zero of the Adam vector field, rather than
a local minimizer or critical point of the objective function.
In contrast, in the overparametrized empirical risk minimization setting, the
Adam algorithm is able to locally find the set of minima. Specifically, we show
that in a neighborhood of the global minima, the objective function serves as a
Lyapunov function for the flow induced by the Adam vector field. As a
consequence, if the Adam algorithm enters a neighborhood of the global minima
infinitely often, it converges to the set of global minima.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Natalia Glazman, Jyoti Mangal, Pedro Borges, Sebastien Ourselin, M. Jorge Cardoso
ๆฅ็ๆ่ฆ
The application of causal discovery to diseases like Alzheimer's (AD) is
limited by the static graph assumptions of most methods; such models cannot
account for an evolving pathophysiology, modulated by a latent disease
pseudotime. We propose to apply an existing latent variable model to real-world
AD data, inferring a pseudotime that orders patients along a data-driven
disease trajectory independent of chronological age, then learning how causal
relationships evolve. Pseudotime outperformed age in predicting diagnosis (AUC
0.82 vs 0.59). Incorporating minimal, disease-agnostic background knowledge
substantially improved graph accuracy and orientation. Our framework reveals
dynamic interactions between novel (NfL, GFAP) and established AD markers,
enabling practical causal discovery despite violated assumptions.
๐
2025-11-06
๐ Accepted for publication by the Journal of Statist
๐ฅ Maximilian Matthe
ๆฅ็ๆ่ฆ
This paper presents evomap, a Python package for dynamic mapping. Mapping
methods are widely used across disciplines to visualize relationships among
objects as spatial representations, or maps. However, most existing statistical
software supports only static mapping, which captures objects' relationships at
a single point in time and lacks tools to analyze how these relationships
evolve. evomap fills this gap by implementing the dynamic mapping framework
EvoMap, originally proposed by Matthe, Ringel, and Skiera (2023), which adapts
traditional static mapping methods for dynamic analyses. The package supports
multiple mapping techniques, including variants of Multidimensional Scaling
(MDS), Sammon Mapping, and t-distributed Stochastic Neighbor Embedding (t-SNE).
It also includes utilities for data preprocessing, exploration, and result
evaluation, offering a comprehensive toolkit for dynamic mapping applications.
This paper outlines the foundations of static and dynamic mapping, describes
the architecture and functionality of evomap, and illustrates its application
through an extensive usage example.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hampus ร
strรถm, Elin Anna Topp, Jacek Malec
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
In this paper we study how transforming regular reinforcement learning
environments into goal-conditioned environments can let agents learn to solve
tasks autonomously and reward-free. We show that an agent can learn to solve
tasks by selecting its own goals in an environment-agnostic way, at training
times comparable to externally guided reinforcement learning. Our method is
independent of the underlying off-policy learning algorithm. Since our method
is environment-agnostic, the agent does not value any goals higher than others,
leading to instability in performance for individual goals. However, in our
experiments, we show that the average goal success rate improves and
stabilizes. An agent trained with this method can be instructed to seek any
observations made in the environment, enabling generic training of agents prior
to specific use cases.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Utkarsh U. Chavan, Prashant Trivedi, Nandyala Hemachandra
Robotics
ๆฅ็ๆ่ฆ
Multi-agent systems (MAS) are central to applications such as swarm robotics
and traffic routing, where agents must coordinate in a decentralized manner to
achieve a common objective. Stochastic Shortest Path (SSP) problems provide a
natural framework for modeling decentralized control in such settings. While
the problem of learning in SSP has been extensively studied in single-agent
settings, the decentralized multi-agent variant remains largely unexplored. In
this work, we take a step towards addressing that gap. We study decentralized
multi-agent SSPs (Dec-MASSPs) under linear function approximation, where the
transition dynamics and costs are represented using linear models. Applying
novel symmetry-based arguments, we identify the structure of optimal policies.
Our main contribution is the first regret lower bound for this setting based on
the construction of hard-to-learn instances for any number of agents, $n$. Our
regret lower bound of $\Omega(\sqrt{K})$, over $K$ episodes, highlights the
inherent learning difficulty in Dec-MASSPs. These insights clarify the learning
complexity of decentralized control and can further guide the design of
efficient learning algorithms in multi-agent systems.
๐
2025-11-06
๐ ICML 2026
๐ฅ Oshri Naparstek
ๆฅ็ๆ่ฆ
We introduce Complexity as Advantage (CAA), a framework that defines the
complexity of a system relative to a family of observers. Instead of measuring
complexity as an intrinsic property, we evaluate how much predictive regret a
system induces for different observers attempting to model it. A system is
complex when it is easy for some observers and hard for others, creating an
information advantage. We show that this formulation unifies several notions of
emergent behavior, including multiscale entropy, predictive information, and
observer-dependent structure. The framework suggests that "interesting" systems
are those positioned to create differentiated regret across observers,
providing a quantitative grounding for why complexity can be functionally
valuable. We demonstrate the idea through simple dynamical models and discuss
implications for learning, evolution, and artificial agents.
๐
2025-11-06
๐ Issues, comments, and questions are all welcome in
๐ฅ Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu Aizawa
Robotics
ๆฅ็ๆ่ฆ
Understanding the current capabilities and risks of AI Scientist systems is
essential for ensuring trustworthy and sustainable AI-driven scientific
progress while preserving the integrity of the academic ecosystem. To this end,
we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system
that mimics the core research workflow of a novice student researcher: Given
the baseline paper from the human mentor, it analyzes its limitations,
formulates novel hypotheses for improvement, validates them through rigorous
experimentation, and writes a paper with the results. Unlike previous
approaches that assume full automation or operate on small-scale code, Jr. AI
Scientist follows a well-defined research workflow and leverages modern coding
agents to handle complex, multi-file implementations, leading to scientifically
valuable contributions. For evaluation, we conducted automated assessments
using AI Reviewers, author-led evaluations, and submissions to Agents4Science,
a venue dedicated to AI-driven scientific contributions. The findings
demonstrate that Jr. AI Scientist generates papers receiving higher review
scores than existing fully automated systems. Nevertheless, we identify
important limitations from both the author evaluation and the Agents4Science
reviews, indicating the potential risks of directly applying current AI
Scientist systems and key challenges for future research. Finally, we
comprehensively report various risks identified during development. We hope
these insights will deepen understanding of current progress and risks in AI
Scientist development.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Zhuo Zhang, Xiong Xiong, Sen Zhang, Yuan Zhao, Xi Yang
Machine Learning
ๆฅ็ๆ่ฆ
PDEs arise ubiquitously in science and engineering, where solutions depend on
parameters (physical properties, boundary conditions, geometry). Traditional
numerical methods require re-solving the PDE for each parameter, making
parameter space exploration prohibitively expensive. Recent machine learning
advances, particularly physics-informed neural networks (PINNs) and neural
operators, have revolutionized parametric PDE solving by learning solution
operators that generalize across parameter spaces. We critically analyze two
main paradigms: (1) PINNs, which embed physical laws as soft constraints and
excel at inverse problems with sparse data, and (2) neural operators (e.g.,
DeepONet, Fourier Neural Operator), which learn mappings between
infinite-dimensional function spaces and achieve unprecedented generalization.
Through comparisons across fluid dynamics, solid mechanics, heat transfer, and
electromagnetics, we show neural operators can achieve computational speedups
of $10^3$ to $10^5$ times faster than traditional solvers for multi-query
scenarios, while maintaining comparable accuracy. We provide practical guidance
for method selection, discuss theoretical foundations (universal approximation,
convergence), and identify critical open challenges: high-dimensional
parameters, complex geometries, and out-of-distribution generalization. This
work establishes a unified framework for understanding parametric PDE solvers
via operator learning, offering a comprehensive, incrementally updated resource
for this rapidly evolving field
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vasco V. Branco, Jandรณ Benedek, Lidia Pivovarova, Luรญs Correia, Pedro Cardoso
Natural Language Processing
ๆฅ็ๆ่ฆ
1. A hard stop for the implementation of rigorous conservation initiatives is
our lack of key species data, especially occurrence data. Furthermore,
researchers have to contend with an accelerated speed at which new information
must be collected and processed due to anthropogenic activity. Publications
ranging from scientific papers to gray literature contain this crucial
information but their data are often not machine-readable, requiring extensive
human work to be retrieved. 2. We present the ARETE R package, an open-source
software aiming to automate data extraction of species occurrences powered by
large language models, namely using the chatGPT Application Programming
Interface. This R package integrates all steps of the data extraction and
validation process, from Optical Character Recognition to detection of outliers
and output in tabular format. Furthermore, we validate ARETE through systematic
comparison between what is modelled and the work of human annotators. 3. We
demonstrate the usefulness of the approach by comparing range maps produced
using GBIF data and with those automatically extracted for 100 species of
spiders. Newly extracted data allowed to expand the known Extent of Occurrence
by a mean three orders of magnitude, revealing new areas where the species were
found in the past, which mayhave important implications for spatial
conservation planning and extinction risk assessments. 4. ARETE allows faster
access to hitherto untapped occurrence data, a potential game changer in
projects requiring such data. Researchers will be able to better prioritize
resources, manually verifying selected species while maintaining automated
extraction for the majority. This workflow also allows predicting available
bibliographic data during project planning.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Masahiro Kato
Machine Learning
ๆฅ็ๆ่ฆ
Riesz regression has garnered attention as a tool in debiased machine
learning for causal and structural parameter estimation (Chernozhukov et al.,
2021). This study shows that Riesz regression is closely related to direct
density-ratio estimation (DRE) in important cases, including average treat-
ment effect (ATE) estimation. Specifically, the idea and objective in Riesz
regression coincide with the one in least-squares importance fitting (LSIF,
Kanamori et al., 2009) in direct density-ratio estimation. While Riesz
regression is general in the sense that it can be applied to Riesz representer
estimation in a wide class of problems, the equivalence with DRE allows us to
directly import exist- ing results in specific cases, including
convergence-rate analyses, the selection of loss functions via
Bregman-divergence minimization, and regularization techniques for flexible
models, such as neural networks. Conversely, insights about the Riesz
representer in debiased machine learning broaden the applications of direct
density-ratio estimation methods. This paper consolidates our prior results in
Kato (2025a) and Kato (2025b).
๐
2025-11-06
๐ ArXiv Preprint (physics.plasm-ph)
๐ฅ Ionut-Gabriel Farcas, Don Lawrence Carl Agapito Fernando, Alejandro Banon Navarro, Gabriele Merlo, Frank Jenko
Machine Learning
ๆฅ็ๆ่ฆ
Constructing reduced models for turbulent transport is essential for
accelerating profile predictions and enabling many-query tasks such as
uncertainty quantification, parameter scans, and design optimization. This
paper presents machine-learning-driven reduced models for Electron Temperature
Gradient (ETG) turbulence in the Wendelstein 7-X (W7-X) stellarator. Each model
predicts the ETG heat flux as a function of three plasma parameters: the
normalized electron temperature radial gradient ($\omega_{T_e}$), the ratio of
normalized electron temperature and density radial gradients ($\eta_e$), and
the electron-to-ion temperature ratio ($\tau$). We first construct models
across seven radial locations using regression and an active
machine-learning-based procedure. This process initializes models using
low-cardinality sparse-grid training data and then iteratively refines their
training sets by selecting the most informative points from a pre-existing
simulation database. We evaluate the prediction capabilities of our models
using out-of-sample datasets with over $393$ points per location, and $95\%$
prediction intervals are estimated via bootstrapping to assess prediction
uncertainty. We then investigate the construction of generalized reduced
models, including a generic, position-independent model, and assess their heat
flux prediction capabilities at three additional locations. Our models
demonstrate robust performance and predictive accuracy comparable to the
original reference simulations, even when applied beyond the training domain.
๐
2025-11-06
๐ ArXiv Preprint (physics.comp-ph)
๐ฅ Yoh-ichi Mototake, Makoto Sasaki
Machine Learning
ๆฅ็ๆ่ฆ
Physics-informed machine learning (PIML) integrates partial differential
equations (PDEs) into machine learning models to solve inverse problems, such
as estimating coefficient functions (e.g., the Hamiltonian function) that
characterize physical systems. This framework enables data-driven understanding
and prediction of complex physical phenomena. While coefficient functions in
PIML are typically estimated on the basis of predictive performance, physics as
a discipline does not rely solely on prediction accuracy to evaluate models.
For example, Kepler's heliocentric model was favored owing to small
discrepancies in planetary motion, despite its similar predictive accuracy to
the geocentric model. This highlights the inherent uncertainties in data-driven
model inference and the scientific importance of selecting physically
meaningful solutions. In this paper, we propose a framework to quantify and
analyze such uncertainties in the estimation of coefficient functions in PIML.
We apply our framework to reduced model of magnetohydrodynamics and our
framework shows that there are uncertainties, and unique identification is
possible with geometric constraints. Finally, we confirm that we can estimate
the reduced model uniquely by incorporating these constraints.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Divyansha Lachi, Mahmoud Mohammadi, Joe Meyer, Vinam Arora, Tom Palczewski et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
In domains such as healthcare, finance, and e-commerce, the temporal dynamics
of relational data emerge from complex interactions-such as those between
patients and providers, or users and products across diverse categories. To be
broadly useful, models operating on these data must integrate long-range
spatial and temporal dependencies across diverse types of entities, while also
supporting multiple predictive tasks. However, existing graph models for
relational data primarily focus on spatial structure, treating temporal
information merely as a filtering constraint to exclude future events rather
than a modeling signal, and are typically designed for single-task prediction.
To address these gaps, we introduce a temporal subgraph sampler that enhances
global context by retrieving nodes beyond the immediate neighborhood to capture
temporally relevant relationships. In addition, we propose the Relational Graph
Perceiver (RGP), a graph transformer architecture for relational deep learning
that leverages a cross-attention-based latent bottleneck to efficiently
integrate information from both structural and temporal contexts. This latent
bottleneck integrates signals from different node and edge types into a common
latent space, enabling the model to build global context across the entire
relational system. RGP also incorporates a flexible cross-attention decoder
that supports joint learning across tasks with disjoint label spaces within a
single model. Experiments on RelBench, SALT, and CTU show that RGP delivers
state-of-the-art performance, offering a general and scalable solution for
relational deep learning with support for diverse predictive tasks.
๐
2025-11-06
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Dhruv Deepak Agarwal, Aswani Kumar Cherukuri
ๆฅ็ๆ่ฆ
The growth of cloud computing has revolutionized data processing and storage
capacities to another levels of scalability and flexibility. But in the
process, it has created a huge challenge of security, especially in terms of
safeguarding sensitive data. Classical security practices, including encryption
at rest and during transit, fail to protect data in use and expose it to
various possible breaches. In response to this problem , Confidential Computing
has been a tool ,seeking to secure data in processing by usage of
hardware-based Trusted Execution Environments (TEEs). TEEs, including Intel's
Software Guard Extensions (SGX) and ARM's TrustZone, offers protected contexts
within the processor, where data is kept confidential ,intact and secure , even
with malicious software or compromised operating systems. In this research, we
have explored the architecture and security features of TEEs like Intel SGX and
ARM TrustZone, and their effectiveness in improving cloud data security. From a
thorough literature survey ,we have analyzed the deployment strategies,
performance indicators, and practical uses of these TEEs for the same purpose.
In addition, we have discussed the issues regarding deployment, possible
weaknesses, scalability issues, and integration issues. Our results focuses on
the central position of TEEs in strengthening and advancing cloud security
infrastructures, pointing towards their ability to create a secure foundation
for Confidential Computing.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Subati Abulikemu, Tiago Azevedo, Michail Mamalakis, John Suckling
Natural Language Processing
ๆฅ็ๆ่ฆ
Functional brain graphs are often characterized with separate graph-theoretic
or spectral descriptors, overlooking how these properties covary and partially
overlap across brains and conditions. We anticipate that dense, weighted
functional connectivity graphs occupy a low-dimensional latent geometry along
which both topological and spectral structures display graded variations. Here,
we estimated this unified graph representation and enabled generation of dense
functional brain graphs through a graph transformer autoencoder with latent
diffusion, with spectral geometry providing an inductive bias to guide
learning. This geometry-aware latent representation, although unsupervised,
meaningfully separated working-memory states and decoded visual stimuli, with
performance further enhanced by incorporating neural dynamics. From the
diffusion modeled distribution, we were able to sample biologically plausible
and structurally grounded synthetic dense graphs.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Jonas E. Katona, Emily K. de Jong, Nipun Gunawardena
ๆฅ็ๆ่ฆ
Reduced-order models (ROMs) can efficiently simulate high-dimensional
physical systems, but lack robust uncertainty quantification methods. Existing
approaches are frequently architecture- or training-specific, which limits
flexibility and generalization. We introduce a post hoc, model-agnostic
framework for predictive uncertainty quantification in latent space ROMs that
requires no modification to the underlying architecture or training procedure.
Using conformal prediction, our approach estimates statistical prediction
intervals for multiple components of the ROM pipeline: latent dynamics,
reconstruction, and end-to-end predictions. We demonstrate the method on a
latent space dynamical model for cloud microphysics, where it accurately
predicts the evolution of droplet-size distributions and quantifies uncertainty
across the ROM pipeline.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Daniel Mayfrank, Kayra Dernek, Laura Lang, Alexander Mitsos, Manuel Dahmen
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
With our recently proposed method based on reinforcement learning (Mayfrank
et al. (2024), Comput. Chem. Eng. 190), Koopman surrogate models can be trained
for optimal performance in specific (economic) nonlinear model predictive
control ((e)NMPC) applications. So far, our method has exclusively been
demonstrated on a small-scale case study. Herein, we show that our method
scales well to a more challenging demand response case study built on a
large-scale model of a single-product (nitrogen) air separation unit. Across
all numerical experiments, we assume observability of only a few realistically
measurable plant variables. Compared to a purely system identification-based
Koopman eNMPC, which generates small economic savings but frequently violates
constraints, our method delivers similar economic performance while avoiding
constraint violations.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Obed Amo, Samit Ghosh, Markus Lange-Hegermann, Bogdan Raiลฃฤ, Michael Pokojovy
ๆฅ็ๆ่ฆ
We present a new benchmarking study comparing a boundary-constrained
Ehrenpreis--Palamodov Gaussian Process (B-EPGP) surrogate with a classical
finite element method combined with Crank--Nicolson time stepping (CN-FEM) for
solving the two-dimensional wave equation with homogeneous Dirichlet boundary
conditions. The B-EPGP construction leverages exponential-polynomial bases
derived from the characteristic variety to enforce the PDE and boundary
conditions exactly and employs penalized least squares to estimate the
coefficients. To ensure fairness across paradigms, we introduce a
degrees-of-freedom (DoF) matching protocol. Under matched DoF, B-EPGP
consistently attains lower space-time $L^2$-error and maximum-in-time
$L^{2}$-error in space than CN-FEM, improving accuracy by roughly two orders of
magnitude.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ C. Hepburn, T. Zielke, A. P. Raulf
Machine Learning
ๆฅ็ๆ่ฆ
The phenomenon of linear mode connectivity (LMC) links several aspects of
deep learning, including training stability under noisy stochastic gradients,
the smoothness and generalization of local minima (basins), the similarity and
functional diversity of sampled models, and architectural effects on data
processing. In this work, we experimentally study LMC under data shifts and
identify conditions that mitigate their impact. We interpret data shifts as an
additional source of stochastic gradient noise, which can be reduced through
small learning rates and large batch sizes. These parameters influence whether
models converge to the same local minimum or to regions of the loss landscape
with varying smoothness and generalization. Although models sampled via LMC
tend to make similar errors more frequently than those converging to different
basins, the benefit of LMC lies in balancing training efficiency against the
gains achieved from larger, more diverse ensembles. Code and supplementary
materials will be made publicly available at https://github.com/DLR-KI/LMC in
due course.
๐
2025-11-06
๐ AAAI 2026
๐ฅ Shaolong Wu, James Blume, Geshi Yeung
Machine Learning
ๆฅ็ๆ่ฆ
Algorithmic fairness has grown rapidly as a research area, yet key concepts
remain unsettled, especially in criminal justice. We review group, individual,
and process fairness and map the conditions under which they conflict. We then
develop a simple modification to standard group fairness. Rather than exact
parity across protected groups, we minimize a weighted error loss while keeping
differences in false negative rates within a small tolerance. This makes
solutions easier to find, can raise predictive accuracy, and surfaces the
ethical choice of error costs. We situate this proposal within three classes of
critique: biased and incomplete data, latent affirmative action, and the
explosion of subgroup constraints. Finally, we offer a practical framework for
deployment in public decision systems built on three pillars: need-based
decisions, Transparency and accountability, and narrowly tailored definitions
and solutions. Together, these elements link technical design to legitimacy and
provide actionable guidance for agencies that use risk assessment and related
tools.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alper Kalle, Theo Rudkiewicz, Mohamed-Oumar Ouerfelli, Mohamed Tamaazousti
Machine Learning
ๆฅ็ๆ่ฆ
Neural networks are widely used for image-related tasks but typically demand
considerable computing power. Once a network has been trained, however, its
memory- and compute-footprint can be reduced by compression. In this work, we
focus on compression through tensorization and low-rank representations.
Whereas classical approaches search for a low-rank approximation by minimizing
an isotropic norm such as the Frobenius norm in weight-space, we use
data-informed norms that measure the error in function space. Concretely, we
minimize the change in the layer's output distribution, which can be expressed
as $\lVert (W - \widetilde{W}) \Sigma^{1/2}\rVert_F$ where $\Sigma^{1/2}$ is
the square root of the covariance matrix of the layer's input and $W$,
$\widetilde{W}$ are the original and compressed weights. We propose new
alternating least square algorithms for the two most common tensor
decompositions (Tucker-2 and CPD) that directly optimize the new norm. Unlike
conventional compression pipelines, which almost always require
post-compression fine-tuning, our data-informed approach often achieves
competitive accuracy without any fine-tuning. We further show that the same
covariance-based norm can be transferred from one dataset to another with only
a minor accuracy drop, enabling compression even when the original training
dataset is unavailable. Experiments on several CNN architectures (ResNet-18/50,
and GoogLeNet) and datasets (ImageNet, FGVC-Aircraft, Cifar10, and Cifar100)
confirm the advantages of the proposed method.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nikhil Abhyankar, Purvi Chaurasia, Sanchit Kabra, Ananya Srivastava, Vivek Gupta et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing tabular reasoning benchmarks mostly test models on small, uniform
tables, underrepresenting the complexity of real-world data and giving an
incomplete view of Large Language Models' (LLMs) reasoning abilities. Real
tables are long, heterogeneous, and domain-specific, mixing structured fields
with free text and requiring multi-hop reasoning across thousands of tokens. To
address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from
2031 real-world tables spanning two domains: i) RB-Science (NSF grant records)
and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates
LLMs jointly across scale, heterogeneity, domain specificity, and reasoning
complexity. Experiments with open-source and proprietary models show that LLMs
struggle with heterogeneous schemas and complex multi-hop inference, revealing
persistent weaknesses in current architectures and prompting strategies.
RUST-BENCH establishes a challenging new testbed for advancing tabular
reasoning research.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ipsita Ghosh, Ethan Nguyen, Christian Kรผmmerle
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Parameter-efficient training, based on low-rank optimization, has become a
highly successful tool for fine-tuning large deep-learning models. However,
these methods fail at low-rank pre-training tasks where maintaining the
low-rank structure and the objective remains a challenging task. We propose the
Quadratic Reweighted Rank Regularizer dubbed Q3R, which leads to a novel
low-rank inducing training strategy inspired by the iteratively reweighted
least squares (IRLS) framework. Q3R is based on a quadratic regularizer term
which majorizes a smoothed log determinant serving as rank surrogate objective.
Unlike other low-rank training techniques, Q3R is able to train weight matrices
with prescribed, low target ranks of models that achieve comparable predictive
performance as dense models, with small computational overhead, while remaining
fully compatible with existing architectures. For example, we demonstrated one
experiment where we are able to truncate $60\%$ and $80\%$ of the parameters of
a ViT-Tiny model with $~1.3\%$ and $~4\%$ accuracy drop in CIFAR-10 performance
respectively. The efficacy of Q3R is confirmed on Transformers across both
image and language tasks, including for low-rank fine-tuning.
๐
2025-11-06
๐ ArXiv Preprint (cs.DS)
๐ฅ Tsubasa Harada, Yasushi Kawase, Hanna Sumita
ๆฅ็ๆ่ฆ
We study the repeated optimal stopping problem, which generalizes the
classical optimal stopping problem with an unknown distribution to a setting
where the same problem is solved repeatedly over $T$ rounds. In this framework,
we aim to design algorithms that guarantee a competitive ratio in each round
while also achieving sublinear regret across all rounds.
Our primary contribution is a general algorithmic framework that achieves
these objectives simultaneously for a wide array of repeated optimal stopping
problems. The core idea is to dynamically select an algorithm for each round,
choosing between two candidates: (1) an empirically optimal algorithm derived
from the history of observations, and (2) a sample-based algorithm with a
proven competitive ratio guarantee. Based on this approach, we design an
algorithm that performs no worse than the baseline sample-based algorithm in
every round, while ensuring that the total regret is bounded by
$\tilde{O}(\sqrt{T})$.
We demonstrate the broad applicability of our framework to canonical
problems, including the prophet inequality, the secretary problem, and their
variants under adversarial, random, and i.i.d. input models. For example, for
the repeated prophet inequality problem, our method achieves a
$1/2$-competitive ratio from the second round on and an $\tilde{O}(\sqrt{T})$
regret. Furthermore, we establish a regret lower bound of $\Omega(\sqrt{T})$
even in the i.i.d. model, confirming that our algorithm's performance is almost
optimal with respect to the number of rounds.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alberto Cattaneo, Carlo Luschi, Daniel Justus
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval of information from graph-structured knowledge bases represents a
promising direction for improving the factuality of LLMs. While various
solutions have been proposed, a comparison of methods is difficult due to the
lack of challenging QA datasets with ground-truth targets for graph retrieval.
We present SynthKGQA, a framework for generating high-quality synthetic
Knowledge Graph Question Answering datasets from any Knowledge Graph, providing
the full set of ground-truth facts in the KG to reason over each question. We
show how, in addition to enabling more informative benchmarking of KG
retrievers, the data produced with SynthKGQA also allows us to train better
models. We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset
designed to test zero-shot generalization abilities of KG retrievers with
respect to unseen graph structures and relation types, and benchmark popular
solutions for KG-augmented LLMs on it.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Dennis Thumm, Luis Ontaneda Mijares
Machine Learning
ๆฅ็ๆ่ฆ
Market generators using deep generative models have shown promise for
synthetic financial data generation, but existing approaches lack causal
reasoning capabilities essential for counterfactual analysis and risk
assessment. We propose a Time-series Neural Causal Model VAE (TNCM-VAE) that
combines variational autoencoders with structural causal models to generate
counterfactual financial time series while preserving both temporal
dependencies and causal relationships. Our approach enforces causal constraints
through directed acyclic graphs in the decoder architecture and employs the
causal Wasserstein distance for training. We validate our method on synthetic
autoregressive models inspired by the Ornstein-Uhlenbeck process, demonstrating
superior performance in counterfactual probability estimation with L1 distances
as low as 0.03-0.10 compared to ground truth. The model enables financial
stress testing, scenario analysis, and enhanced backtesting by generating
plausible counterfactual market trajectories that respect underlying causal
mechanisms.
๐
2025-11-06
๐ ICML
๐ฅ Abheek Ghosh, Tzeh Yuan Neoh, Nicholas Teh, Giannis Tyrovolas
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We study a model of subscription-based platforms where users pay a fixed fee
for unlimited access to content, and creators receive a share of the revenue.
Existing approaches to detecting fraud predominantly rely on machine learning
methods, engaging in an ongoing arms race with bad actors. We explore revenue
division mechanisms that inherently disincentivize manipulation. We formalize
three types of manipulation-resistance axioms and examine which existing rules
satisfy these. We show that a mechanism widely used by streaming platforms, not
only fails to prevent fraud, but also makes detecting manipulation
computationally intractable. We also introduce a novel rule, ScaledUserProp,
that satisfies all three manipulation-resistance axioms. Finally, experiments
with both real-world and synthetic streaming data support ScaledUserProp as a
fairer alternative compared to existing rules.
๐
2025-11-06
๐ ArXiv Preprint (eess.SY)
๐ฅ Giorgio Palma, Andrea Serani, Matteo Diez
Robotics
ๆฅ็ๆ่ฆ
In this study, we present and validate an ensemble-based Hankel Dynamic Mode
Decomposition with control (HDMDc) for uncertainty-aware seakeeping predictions
of a high-speed catamaran, namely the Delft 372 model. Experimental
measurements (time histories) of wave elevation at the longitudinal center of
gravity, heave, pitch, notional flight-deck velocity, notional bridge
acceleration, and total resistance were collected from irregular wave basin
tests on a 1:33.3 scale replica of the Delft 372 model under sea state 5
conditions at Fr = 0.425, and organized into training, validation, and test
sets. The HDMDc algorithm constructs an equation-free linear reduced-order
model of the seakeeping vessel by augmenting states and inputs with their
time-lagged copies to capture nonlinear and memory effects. Two ensembling
strategies, namely Bayesian HDMDc (BHDMDc), which samples hyperparameters
considered stochastic variables with prior distribution to produce posterior
mean forecasts with confidence intervals, and Frequentist HDMDc (FHDMDc), which
aggregates multiple model obtained over data subsets, are compared in providing
seakeeping prediction and uncertainty quantification. The FHDMDc approach is
found to improve the accuracy of the predictions compared to the deterministic
counterpart, also providing robust uncertainty estimation; whereas the
application of BHDMDc to the present test case is not found beneficial in
comparison to the deterministic model. FHDMDc-derived probability density
functions for the motions closely match both experimental data and URANS
results, demonstrating reliable and computationally efficient seakeeping
prediction for design and operational support.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinwen Zhang, Hongchang Gao
Machine Learning
ๆฅ็ๆ่ฆ
Heavy-tailed noise has attracted growing attention in nonconvex stochastic
optimization, as numerous empirical studies suggest it offers a more realistic
assumption than standard bounded variance assumption. In this work, we
investigate nonconvex-PL minimax optimization under heavy-tailed gradient noise
in federated learning. We propose two novel algorithms: Fed-NSGDA-M, which
integrates normalized gradients, and FedMuon-DA, which leverages the Muon
optimizer for local updates. Both algorithms are designed to effectively
address heavy-tailed noise in federated minimax optimization, under a milder
condition. We theoretically establish that both algorithms achieve a
convergence rate of $O({1}/{(TNp)^{\frac{s-1}{2s}}})$. To the best of our
knowledge, these are the first federated minimax optimization algorithms with
rigorous theoretical guarantees under heavy-tailed noise. Extensive experiments
further validate their effectiveness.
๐
2025-11-06
๐ ArXiv Preprint (cs.CE)
๐ฅ Hao Zhu, Jasper Hoffmann, Baohe Zhang, Joschka Boedecker
Machine Learning
ๆฅ็ๆ่ฆ
We consider the problem of fitting a reinforcement learning (RL) model to
some given behavioral data under a multi-armed bandit environment. These models
have received much attention in recent years for characterizing human and
animal decision making behavior. We provide a generic mathematical optimization
problem formulation for the fitting problem of a wide range of RL models that
appear frequently in scientific research applications, followed by a detailed
theoretical analysis of its convexity properties. Based on the theoretical
results, we introduce a novel solution method for the fitting problem of RL
models based on convex relaxation and optimization. Our method is then
evaluated in several simulated bandit environments to compare with some
benchmark methods that appear in the literature. Numerical results indicate
that our method achieves comparable performance to the state-of-the-art, while
significantly reducing computation time. We also provide an open-source Python
package for our proposed method to empower researchers to apply it in the
analysis of their datasets directly, without prior knowledge of convex
optimization.
๐
2025-11-06
๐ ArXiv Preprint (eess.SY)
๐ฅ Patrik Valรกbek, Marek Wadinger, Michal Kvasnica, Martin Klauฤo
Robotics
ๆฅ็ๆ่ฆ
Nonlinear dynamical systems with input delays pose significant challenges for
prediction, estimation, and control due to their inherent complexity and the
impact of delays on system behavior. Traditional linear control techniques
often fail in these contexts, necessitating innovative approaches. This paper
introduces a novel approach to approximate the Koopman operator using an
LSTM-enhanced Deep Koopman model, enabling linear representations of nonlinear
systems with time delays. By incorporating Long Short-Term Memory (LSTM)
layers, the proposed framework captures historical dependencies and efficiently
encodes time-delayed system dynamics into a latent space. Unlike traditional
extended Dynamic Mode Decomposition (eDMD) approaches that rely on predefined
dictionaries, the LSTM-enhanced Deep Koopman model is dictionary-free, which
mitigates the problems with the underlying dynamics being known and
incorporated into the dictionary. Quantitative comparisons with extended eDMD
on a simulated system demonstrate highly significant performance gains in
prediction accuracy in cases where the true nonlinear dynamics are unknown and
achieve comparable results to eDMD with known dynamics of a system.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Syeda Sitara Wishal Fatima, Afshin Rahimi
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Time series forecasting is essential across domains from finance to supply
chain management. This paper introduces ForecastGAN, a novel decomposition
based adversarial framework addressing limitations in existing approaches for
multi-horizon predictions. Although transformer models excel in long-term
forecasting, they often underperform in short-term scenarios and typically
ignore categorical features. ForecastGAN operates through three integrated
modules: a Decomposition Module that extracts seasonality and trend components;
a Model Selection Module that identifies optimal neural network configurations
based on forecasting horizon; and an Adversarial Training Module that enhances
prediction robustness through Conditional Generative Adversarial Network
training. Unlike conventional approaches, ForecastGAN effectively integrates
both numerical and categorical features. We validate our framework on eleven
benchmark multivariate time series datasets that span various forecasting
horizons. The results show that ForecastGAN consistently outperforms
state-of-the-art transformer models for short-term forecasting while remaining
competitive for long-term horizons. This research establishes a more
generalizable approach to time series forecasting that adapts to specific
contexts while maintaining strong performance across diverse data
characteristics without extensive hyperparameter tuning.
๐
2025-11-06
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Anisha Garg, Ganesh Venkatesh
Machine Learning
ๆฅ็ๆ่ฆ
Group-relative Policy Optimization's (GRPO) simplicity makes it highly
desirable for adapting LLMs to become experts at specific tasks. But this
simplicity also makes it ill-specified as we seek to enhance RL training with
richer, non-binary feedback. When using ordinal rewards to give partial credit,
GRPO's simplicity starts to hurt, as its group-average baseline often assigns a
positive advantage to failed trajectories and reinforces incorrect behavior.
We introduce Correctness Relative Policy Optimization (CoRPO), a new
formulation that solves this flaw. CoRPO uses an adaptive baseline that
enforces a minimum quality threshold, ensuring failed solutions are never
positively reinforced. Once the policy consistently meets this threshold, the
baseline automatically transitions to a relative preference mode, pushing the
model to find optimal solutions rather than just "acceptable" ones. We
empirically validate CoRPO on a code verification task, where it demonstrates
more stable convergence and better out-of-domain generalization.
This work represents a critical step in our broader research program to
enable LLMs to learn genuinely new capabilities through reinforcement learning.
We achieve this by enabling LLMs to learn from rich, multi-dimensional feedback
- progressing from binary to ordinal rewards in this work, and onward to
denser, per-step supervision.
๐
2025-11-06
๐ ArXiv Preprint (eess.SY)
๐ฅ Patrik Valรกbek, Michaela Horvรกthovรก, Martin Klauฤo
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a deep Koopman-based Economic Model Predictive Control
(EMPC) for efficient operation of a laboratory-scale pasteurization unit (PU).
The method uses Koopman operator theory to transform the complex, nonlinear
system dynamics into a linear representation, enabling the application of
convex optimization while representing the complex PU accurately. The deep
Koopman model utilizes neural networks to learn the linear dynamics from
experimental data, achieving a 45% improvement in open-loop prediction accuracy
over conventional N4SID subspace identification. Both analyzed models were
employed in the EMPC formulation that includes interpretable economic costs,
such as energy consumption, material losses due to inadequate pasteurization,
and actuator wear. The feasibility of EMPC is ensured using slack variables.
The deep Koopman EMPC and N4SID EMPC are numerically validated on a nonlinear
model of multivariable PU under external disturbance. The disturbances include
feed pump fail-to-close scenario and the introduction of a cold batch to be
pastuerized. These results demonstrate that the deep Koopmand EMPC achieves a
32% reduction in total economic cost compared to the N4SID baseline. This
improvement is mainly due to the reductions in material losses and energy
consumption. Furthermore, the steady-state operation via Koopman-based EMPC
requires 10.2% less electrical energy. The results highlight the practical
advantages of integrating deep Koopman representations with economic
optimization to achieve resource-efficient control of thermal-intensive plants.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jayadeva, Naman Dwivedi, Hari Krishnan, N. M. Anoop Krishnan
Machine Learning
ๆฅ็ๆ่ฆ
A formal link between regression and classification has been tenuous. Even
though the margin maximization term $\|w\|$ is used in support vector
regression, it has at best been justified as a regularizer. We show that a
regression problem with $M$ samples lying on a hyperplane has a one-to-one
equivalence with a linearly separable classification task with $2M$ samples. We
show that margin maximization on the equivalent classification task leads to a
different regression formulation than traditionally used. Using the
equivalence, we demonstrate a ``regressability'' measure, that can be used to
estimate the difficulty of regressing a dataset, without needing to first learn
a model for it. We use the equivalence to train neural networks to learn a
linearizing map, that transforms input variables into a space where a linear
regressor is adequate.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tim Tomov, Dominik Fuchsgruber, Tom Wollschlรคger, Stephan Gรผnnemann
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is
critical for trustworthy deployment. While real-world language is inherently
ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically
benchmarked against tasks with no ambiguity. In this work, we demonstrate that
while current uncertainty estimators perform well under the restrictive
assumption of no ambiguity, they degrade to close-to-random performance on
ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first
ambiguous question-answering (QA) datasets equipped with ground-truth answer
distributions estimated from factual co-occurrence. We find this performance
deterioration to be consistent across different estimation paradigms: using the
predictive distribution itself, internal representations throughout the model,
and an ensemble of models. We show that this phenomenon can be theoretically
explained, revealing that predictive-distribution and ensemble-based estimators
are fundamentally limited under ambiguity. Overall, our study reveals a key
shortcoming of current UQ methods for LLMs and motivates a rethinking of
current modeling paradigms.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Sara Pรฉrez-Vieites, Sahel Iqbal, Simo Sรคrkkรค, Dominik Baumann
Machine Learning
ๆฅ็ๆ่ฆ
Bayesian experimental design (BED) provides a principled framework for
optimizing data collection, but existing approaches do not apply to crucial
real-world settings such as dynamical systems with partial observability, where
only noisy and incomplete observations are available. These systems are
naturally modeled as state-space models (SSMs), where latent states mediate the
link between parameters and data, making the likelihood -- and thus
information-theoretic objectives like the expected information gain (EIG) --
intractable. In addition, the dynamical nature of the system requires online
algorithms that update posterior distributions and select designs sequentially
in a computationally efficient manner. We address these challenges by deriving
new estimators of the EIG and its gradient that explicitly marginalize latent
states, enabling scalable stochastic optimization in nonlinear SSMs. Our
approach leverages nested particle filters (NPFs) for efficient online
inference with convergence guarantees. Applications to realistic models, such
as the susceptible-infected-recovered (SIR) and a moving source location task,
show that our framework successfully handles both partial observability and
online computation.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Subeen Park, Joowang Kim, Hakyung Lee, Sunjae Yoo, Kyungwoo Song
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning models achieve strong performance across various domains but
often rely on spurious correlations, making them vulnerable to distribution
shifts. This issue is particularly severe in subpopulation shift scenarios,
where models struggle in underrepresented groups. While existing methods have
made progress in mitigating this issue, their performance gains are still
constrained. They lack a rigorous theoretical framework connecting the
embedding space representations with worst-group error. To address this
limitation, we propose Spurious Correlation-Aware Embedding Regularization for
Worst-Group Robustness (SCER), a novel approach that directly regularizes
feature representations to suppress spurious cues. We show theoretically that
worst-group error is influenced by how strongly the classifier relies on
spurious versus core directions, identified from differences in group-wise mean
embeddings across domains and classes. By imposing theoretical constraints at
the embedding level, SCER encourages models to focus on core features while
reducing sensitivity to spurious patterns. Through systematic evaluation on
multiple vision and language, we show that SCER outperforms prior
state-of-the-art studies in worst-group accuracy. Our code is available at
\href{https://github.com/MLAI-Yonsei/SCER}{https://github.com/MLAI-Yonsei/SCER}.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Itbaan Safwan, Muhammad Annas Shaikh, Muhammad Haaris, Ramail Khan, Muhammad Atif Tahir
ๆฅ็ๆ่ฆ
We present a multi-task framework for the MediaEval Medico 2025 challenge,
leveraging a LoRA-tuned Florence-2 model for simultaneous visual question
answering (VQA), explanation generation, and visual grounding. The proposed
system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer
learning, (2) a synthetically enriched explanation dataset offering structured
medical reasoning, and (3) text-to-region pairs linking visual features with
segmentation masks. This multi-task setup enables the model to jointly learn
visual grounding, reasoning, and interpretation, producing responses that are
both accurate and interpretable. Extensive evaluation demonstrates that our
approach substantially improves over single-task baselines in both answer
accuracy and visual localization, highlighting the effectiveness of grounded
multi-task learning for medical VQA applications.
๐
2025-11-06
๐ ArXiv Preprint (cs.SD)
๐ฅ Ali Boudaghi, Hadi Zare
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Music editing has emerged as an important and practical area of artificial
intelligence, with applications ranging from video game and film music
production to personalizing existing tracks according to user preferences.
However, existing models face significant limitations, such as being restricted
to editing synthesized music generated by their own models, requiring highly
precise prompts, or necessitating task-specific retraining, thus lacking true
zero-shot capability. Leveraging recent advances in rectified flow and
diffusion transformers, we introduce MusRec, the first zero-shot text-to-music
editing model capable of performing diverse editing tasks on real-world music
efficiently and effectively. Experimental results demonstrate that our approach
outperforms existing methods in preserving musical content, structural
consistency, and editing fidelity, establishing a strong foundation for
controllable music editing in real-world scenarios.
๐
2025-11-06
๐ ArXiv Preprint (q-fin.CP)
๐ฅ Dennis Thumm
ๆฅ็ๆ่ฆ
Energy markets exhibit complex causal relationships between weather patterns,
generation technologies, and price formation, with regime changes occurring
continuously rather than at discrete break points. Current approaches model
electricity prices without explicit causal interpretation or counterfactual
reasoning capabilities. We introduce Augmented Time Series Causal Models
(ATSCM) for energy markets, extending counterfactual reasoning frameworks to
multivariate temporal data with learned causal structure. Our approach models
energy systems through interpretable factors (weather, generation mix, demand
patterns), rich grid dynamics, and observable market variables. We integrate
neural causal discovery to learn time-varying causal graphs without requiring
ground truth DAGs. Applied to real-world electricity price data, ATSCM enables
novel counterfactual queries such as "What would prices be under different
renewable generation scenarios?".
๐
2025-11-06
๐ To be published in Proceedings of 2025 2nd IEEE/AC
๐ฅ Amir Molzam Sharifloo, Maedeh Heydari, Parsa Kazerooni, Daniel Maninger, Mira Mezini
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have achieved remarkable success in code
generation, and the race to improve their performance has become a central
focus of AI research. Benchmarks and leaderboards are increasingly popular,
offering quantitative rankings of LLMs. However, they provide limited insight
into the tasks that LLMs consistently fail to solve - information that is
crucial for understanding current limitations and guiding the development of
more capable models. To address this gap, we examined code generation tasks
across four popular benchmarks, identifying those that major LLMs are most
likely to fail. To understand the causes of these failures, we investigated
whether the static complexity of solution code contributes to them, followed by
a systematic inspection of 114 tasks that LLMs consistently struggled with. Our
analysis revealed four recurring patterns of weaknesses in LLMs, as well as
common complications within benchmark tasks that most often lead to failure.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saรบl Alonso-Monsalve, Leigh H. Whitehead, Adam Aurisano, Lorena Escudero Sanchez
Machine Learning
ๆฅ็ๆ่ฆ
The accurate delineation of tumours in radiological images like Computed
Tomography is a very specialised and time-consuming task, and currently a
bottleneck preventing quantitative analyses to be performed routinely in the
clinical setting. For this reason, developing methods for the automated
segmentation of tumours in medical imaging is of the utmost importance and has
driven significant efforts in recent years. However, challenges regarding the
impracticality of 3D scans, given the large amount of voxels to be analysed,
usually requires the downsampling of such images or using patches thereof when
applying traditional convolutional neural networks. To overcome this problem,
in this paper we propose a new methodology that uses, divided into two stages,
voxel sparsification and submanifold sparse convolutional networks. This method
allows segmentations to be performed with high-resolution inputs and a native
3D model architecture, obtaining state-of-the-art accuracies while
significantly reducing the computational resources needed in terms of GPU
memory and time. We studied the deployment of this methodology in the context
of Computed Tomography images of renal cancer patients from the KiTS23
challenge, and our method achieved results competitive with the challenge
winners, with Dice similarity coefficients of 95.8% for kidneys + masses, 85.7%
for tumours + cysts, and 80.3% for tumours alone. Crucially, our method also
offers significant computational improvements, achieving up to a 60% reduction
in inference time and up to a 75\% reduction in VRAM usage compared to an
equivalent dense architecture, across both CPU and various GPU cards tested.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Federico Pirola, Fabio Stella, Marco Grzegorczyk
ๆฅ็ๆ่ฆ
Dynamic Bayesian networks (DBNs) are increasingly used in healthcare due to
their ability to model complex temporal relationships in patient data while
maintaining interpretability, an essential feature for clinical
decision-making. However, existing approaches to handling missing data in
longitudinal clinical datasets are largely derived from static Bayesian
networks literature, failing to properly account for the temporal nature of the
data. This gap limits the ability to quantify uncertainty over time, which is
particularly critical in settings such as intensive care, where understanding
the temporal dynamics is fundamental for model trustworthiness and
applicability across diverse patient groups. Despite the potential of DBNs, a
full Bayesian framework that integrates missing data handling remains
underdeveloped. In this work, we propose a novel Gibbs sampling-based method
for learning DBNs from incomplete data. Our method treats each missing value as
an unknown parameter following a Gaussian distribution. At each iteration, the
unobserved values are sampled from their full conditional distributions,
allowing for principled imputation and uncertainty estimation. We evaluate our
method on both simulated datasets and real-world intensive care data from
critically ill patients. Compared to standard model-agnostic techniques such as
MICE, our Bayesian approach demonstrates superior reconstruction accuracy and
convergence properties. These results highlight the clinical relevance of
incorporating full Bayesian inference in temporal models, providing more
reliable imputations and offering deeper insight into model behavior. Our
approach supports safer and more informed clinical decision-making,
particularly in settings where missing data are frequent and potentially
impactful.
๐
2025-11-06
๐ NeurIPS
๐ฅ Antti Koskela, Tejas Kulkarni, Laith Zumot
Natural Language Processing
ๆฅ็ๆ่ฆ
Differentially private in-context learning (DP-ICL) has recently become an
active research topic due to the inherent privacy risks of in-context learning.
However, existing approaches overlook a critical component of modern large
language model (LLM) pipelines: the similarity search used to retrieve relevant
context data. In this work, we introduce a DP framework for in-context learning
that integrates nearest neighbor search of relevant examples in a privacy-aware
manner. Our method outperforms existing baselines by a substantial margin
across all evaluated benchmarks, achieving more favorable privacy-utility
trade-offs. To achieve this, we employ nearest neighbor retrieval from a
database of context data, combined with a privacy filter that tracks the
cumulative privacy cost of selected samples to ensure adherence to a central
differential privacy budget. Experimental results on text classification and
document question answering show a clear advantage of the proposed method over
existing baselines.
๐
2025-11-06
๐ ArXiv Preprint (cs.AR)
๐ฅ Yuanpeng Zhang, Xing Hu, Xi Chen, Zhihang Yuan, Cong Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
SRAM Processing-in-Memory (PIM) has emerged as the most promising
implementation for high-performance PIM, delivering superior computing density,
energy efficiency, and computational precision. However, the pursuit of higher
performance necessitates more complex circuit designs and increased operating
frequencies, which exacerbate IR-drop issues. Severe IR-drop can significantly
degrade chip performance and even threaten reliability. Conventional
circuit-level IR-drop mitigation methods, such as back-end optimizations, are
resource-intensive and often compromise power, performance, and area (PPA). To
address these challenges, we propose AIM, comprehensive software and hardware
co-design for architecture-level IR-drop mitigation in high-performance PIM.
Initially, leveraging the bit-serial and in-situ dataflow processing properties
of PIM, we introduce Rtog and HR, which establish a direct correlation between
PIM workloads and IR-drop. Building on this foundation, we propose LHR and WDS,
enabling extensive exploration of architecture-level IR-drop mitigation while
maintaining computational accuracy through software optimization. Subsequently,
we develop IR-Booster, a dynamic adjustment mechanism that integrates
software-level HR information with hardware-based IR-drop monitoring to adapt
the V-f pairs of the PIM macro, achieving enhanced energy efficiency and
performance. Finally, we propose the HR-aware task mapping method, bridging
software and hardware designs to achieve optimal improvement. Post-layout
simulation results on a 7nm 256-TOPS PIM chip demonstrate that AIM achieves up
to 69.2% IR-drop mitigation, resulting in 2.29x energy efficiency improvement
and 1.152x speedup.
๐
2025-11-06
๐ ArXiv Preprint (math.NA)
๐ฅ Michael Ludkovski, Changgen Xie, Zimu Zhu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We consider numerical resolution of principal-agent (PA) problems in
continuous time. We formulate a generic PA model with continuous and lump
payments and a multi-dimensional strategy of the agent. To tackle the resulting
Hamilton-Jacobi-Bellman equation with an implicit Hamiltonian we develop a
novel deep learning method: the Deep Principal-Agent Actor Critic (DeepPAAC)
Actor-Critic algorithm. DeepPAAC is able to handle multi-dimensional states and
controls, as well as constraints. We investigate the role of the neural network
architecture, training designs, loss functions, etc. on the convergence of the
solver, presenting five different case studies.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Giovanni Barbarino, Nicolas Gillis, Subhayan Saha
ๆฅ็ๆ่ฆ
Minimum-volume nonnegative matrix factorization (min-vol NMF) has been used
successfully in many applications, such as hyperspectral imaging, chemical
kinetics, spectroscopy, topic modeling, and audio source separation. However,
its robustness to noise has been a long-standing open problem. In this paper,
we prove that min-vol NMF identifies the groundtruth factors in the presence of
noise under a condition referred to as the expanded sufficiently scattered
condition which requires the data points to be sufficiently well scattered in
the latent simplex generated by the basis vectors.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matteo Cercola, Valeria Capretti, Simone Formentin
Machine Learning
ๆฅ็ๆ่ฆ
Learning from human preferences is a cornerstone of aligning machine learning
models with subjective human judgments. Yet, collecting such preference data is
often costly and time-consuming, motivating the need for more efficient
learning paradigms. Two established approaches offer complementary advantages:
RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning,
while PBO achieves greater sample efficiency through active querying. We
propose a hybrid framework that unifies RLHF's scalability with PBO's query
efficiency by integrating an acquisition-driven module into the RLHF pipeline,
thereby enabling active and sample-efficient preference gathering. We validate
the proposed approach on two representative domains: (i) high-dimensional
preference optimization and (ii) LLM fine-tuning. Experimental results
demonstrate consistent improvements in both sample efficiency and overall
performance across these tasks.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Jungbin Jun, Ilsang Ohn
ๆฅ็ๆ่ฆ
Conformal prediction has emerged as a powerful framework for constructing
distribution-free prediction sets with guaranteed coverage assuming only the
exchangeability assumption. However, this assumption is often violated in
online environments where data distributions evolve over time. Several recent
approaches have been proposed to address this limitation, but, typically, they
slowly adapt to distribution shifts because they update predictions only in a
forward manner, that is, they generate a prediction for a newly observed data
point while previously computed predictions are not updated. In this paper, we
propose a novel online conformal inference method with retrospective
adjustment, which is designed to achieve faster adaptation to distributional
shifts. Our method leverages regression approaches with efficient leave-one-out
update formulas to retroactively adjust past predictions when new data arrive,
thereby aligning the entire set of predictions with the most recent data
distribution. Through extensive numerical studies performed on both synthetic
and real-world data sets, we show that the proposed approach achieves faster
coverage recalibration and improved statistical efficiency compared to existing
online conformal prediction methods.
๐
2025-11-06
๐ ArXiv Preprint (Computer Vision)
๐ฅ Marawan Elbatel, Anbang Wang, Keyuan Liu, Kaouther Mouheb, Enrique Almar-Munoz et al.
Computer Vision
ๆฅ็ๆ่ฆ
This paper does not introduce a novel architecture; instead, it revisits a
fundamental yet overlooked baseline: adapting human-centric foundation models
for anatomical landmark detection in medical imaging. While landmark detection
has traditionally relied on domain-specific models, the emergence of
large-scale pre-trained vision models presents new opportunities. In this
study, we investigate the adaptation of Sapiens, a human-centric foundation
model designed for pose estimation, to medical imaging through multi-dataset
pretraining, establishing a new state of the art across multiple datasets. Our
proposed model, MedSapiens, demonstrates that human-centric foundation models,
inherently optimized for spatial pose localization, provide strong priors for
anatomical landmark detection, yet this potential has remained largely
untapped. We benchmark MedSapiens against existing state-of-the-art models,
achieving up to 5.26% improvement over generalist models and up to 21.81%
improvement over specialist models in the average success detection rate (SDR).
To further assess MedSapiens adaptability to novel downstream tasks with few
annotations, we evaluate its performance in limited-data settings, achieving
2.69% improvement over the few-shot state of the art in SDR. Code and model
weights are available at https://github.com/xmed-lab/MedSapiens .
๐
2025-11-06
๐ submitted to Journal of Artificial Intelligence Re
๐ฅ Irene Ferfoglia, Simone Silvetti, Gaia Saveri, Laura Nenzi, Luca Bortolussi
Machine Learning
ๆฅ็ๆ่ฆ
Time series classification is a task of paramount importance, as this kind of
data often arises in safety-critical applications. However, it is typically
tackled with black-box deep learning methods, making it hard for humans to
understand the rationale behind their output. To take on this challenge, we
propose a novel approach, STELLE (Signal Temporal logic Embedding for
Logically-grounded Learning and Explanation), a neuro-symbolic framework that
unifies classification and explanation through direct embedding of trajectories
into a space of temporal logic concepts. By introducing a novel STL-inspired
kernel that maps raw time series to their alignment with predefined STL
formulae, our model jointly optimises accuracy and interpretability, as each
prediction is accompanied by the most relevant logical concepts that
characterise it. This yields (i) local explanations as human-readable STL
conditions justifying individual predictions, and (ii) global explanations as
class-characterising formulae. Experiments demonstrate that STELLE achieves
competitive accuracy while providing logically faithful explanations, validated
on diverse real-world benchmarks.
๐
2025-11-06
๐ ArXiv Preprint (quant-ph)
๐ฅ Valter Uotila, Vรคinรถ Mehtola, Ilmo Salmenperรค, Bo Zhao
Machine Learning
ๆฅ็ๆ่ฆ
Leveraging data symmetries has been a key driver of performance gains in
geometric deep learning and geometric and equivariant quantum machine learning.
While symmetrization appears to be a promising method, its practical overhead,
such as additional gates, reduced expressibility, and other factors, is not
well understood in quantum machine learning. In this work, we develop an
automated pipeline to measure various characteristics of quantum machine
learning ansatzes with respect to symmetries that can appear in the learning
task. We define the degree of symmetry in the learning problem as the size of
the subgroup it admits. Subgroups define partial symmetries, which have not
been extensively studied in previous research, which has focused on symmetries
defined by whole groups. Symmetrizing the 19 common ansatzes with respect to
these varying-sized subgroup representations, we compute three classes of
metrics that describe how the common ansatz structures behave under varying
amounts of symmetries. The first metric is based on the norm of the difference
between the original and symmetrized generators, while the second metric counts
depth, size, and other characteristics from the symmetrized circuits. The third
class of metrics includes expressibility and entangling capability. The results
demonstrate varying gate overhead across the studied ansatzes and confirm that
increased symmetry reduces expressibility of the circuits. In most cases,
increased symmetry increases entanglement capability. These results help select
sufficiently expressible and computationally efficient ansatze patterns for
geometric quantum machine learning applications.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rasmus Mรธller-Larsen, Adam Izdebski, Jan Olszewski, Pankhil Gawade, Michal Kmicikiewicz et al.
ๆฅ็ๆ่ฆ
Recent advances in computational methods for designing biological sequences
have sparked the development of metrics to evaluate these methods performance
in terms of the fidelity of the designed sequences to a target distribution and
their attainment of desired properties. However, a single software library
implementing these metrics was lacking. In this work we introduce seqme, a
modular and highly extendable open-source Python library, containing
model-agnostic metrics for evaluating computational methods for biological
sequence design. seqme considers three groups of metrics: sequence-based,
embedding-based, and property-based, and is applicable to a wide range of
biological sequences: small molecules, DNA, ncRNA, mRNA, peptides and proteins.
The library offers a number of embedding and property models for biological
sequences, as well as diagnostics and visualization functions to inspect the
results. seqme can be used to evaluate both one-shot and iterative
computational design methods.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Liran Cohen, Yaniv Nemcovesky, Avi Mendelson
Natural Language Processing
ๆฅ็ๆ่ฆ
Machine unlearning aims to remove the influence of specific training data
from a model without requiring full retraining. This capability is crucial for
ensuring privacy, safety, and regulatory compliance. Therefore, verifying
whether a model has truly forgotten target data is essential for maintaining
reliability and trustworthiness. However, existing evaluation methods often
assess forgetting at the level of individual inputs. This approach may overlook
residual influence present in semantically similar examples. Such influence can
compromise privacy and lead to indirect information leakage. We propose REMIND
(Residual Memorization In Neighborhood Dynamics), a novel evaluation method
aiming to detect the subtle remaining influence of unlearned data and classify
whether the data has been effectively forgotten. REMIND analyzes the model's
loss over small input variations and reveals patterns unnoticed by single-point
evaluations. We show that unlearned data yield flatter, less steep loss
landscapes, while retained or unrelated data exhibit sharper, more volatile
patterns. REMIND requires only query-based access, outperforms existing methods
under similar constraints, and demonstrates robustness across different models,
datasets, and paraphrased inputs, making it practical for real-world
deployment. By providing a more sensitive and interpretable measure of
unlearning effectiveness, REMIND provides a reliable framework to assess
unlearning in language models. As a result, REMIND offers a novel perspective
on memorization and unlearning.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The strong lottery ticket hypothesis (SLTH) conjectures that high-performing
subnetworks, called strong lottery tickets (SLTs), are hidden in randomly
initialized neural networks. Although recent theoretical studies have
established the SLTH across various neural architectures, the SLTH for
transformer architectures still lacks theoretical understanding. In particular,
the current theory of the SLTH does not yet account for the multi-head
attention (MHA) mechanism, a core component of transformers. To address this
gap, we introduce a theoretical analysis of the existence of SLTs within MHAs.
We prove that, if a randomly initialized MHA of $H$ heads and input dimension
$d$ has the hidden dimension $O(d\log(Hd^{3/2}))$ for the key and value, it
contains an SLT that approximates an arbitrary MHA with the same input
dimension with high probability. Furthermore, by leveraging this theory for
MHAs, we extend the SLTH to transformers without normalization layers. We
empirically validate our theoretical findings, demonstrating that the
approximation error between the SLT within a source model (MHA and transformer)
and an approximate target counterpart decreases exponentially by increasing the
hidden dimension of the source model.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved remarkable success, but their
rapidly growing scale imposes prohibitive costs in memory, computation, and
energy. Post-training quantization (PTQ) is a promising solution for efficient
deployment, yet achieving accurate W4A4 quantization remains an open challenge.
While most existing methods are designed for INT4 formats, the emergence of
MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)--
raises questions about the applicability of current techniques. In this work,
we establish a comprehensive benchmark of PTQ methods under the MXFP4 format.
Through systematic evaluation, we find that methods like GPTQ consistently
deliver strong performance, whereas rotation-based approaches, which are almost
used by all state-of-the-art approaches, suffer from severe incompatibility
with MXFP4. We further provide the first in-depth analysis of this conflict,
tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two)
block scaling and the redistribution of outlier energy via global rotation.
Building on this insight, we propose a simple yet effective block rotation
strategy that adapts rotation-based methods to MXFP4, leading to substantial
accuracy improvements across diverse LLMs. Our findings not only offer clear
guidance for practitioners but also set a foundation for advancing PTQ research
under emerging low-precision formats.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaokai Wang, Shaoyuan Huang, Yuting Li, Xiaofei Wang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Deep neural networks (DNNs) form the cornerstone of modern AI services,
supporting a wide range of applications, including autonomous driving,
chatbots, and recommendation systems. As models increase in size and
complexity, DNN workloads like training and inference tasks impose
unprecedented demands on distributed computing resources, making the accurate
prediction of runtime essential for optimizing development and resource
allocation. Traditional methods rely on additive computational unit models,
limiting their accuracy and generalizability. In contrast, graph-enhanced
modeling improves performance but significantly increases data collection
costs. Therefore, there is a critical need for a method that strikes a balance
between accuracy, generalizability, and the costs of data collection. To
address these challenges, we propose ScaleDL, a novel runtime prediction
framework that combines nonlinear layer-wise modeling with graph neural network
(GNN)-based cross-layer interaction mechanism, enabling accurate DNN runtime
prediction and hierarchical generalizability across different network
architectures. Additionally, we employ the D-optimal method to reduce data
collection costs. Experiments on the workloads of five popular DNN models prove
that ScaleDL enhances runtime prediction accuracy and generalizability,
achieving 6$\times$ lower MRE and 5$\times$ lower RMSE compared to baseline
models.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Laurits Fredsgaard, Mikkel N. Schmidt
Machine Learning
ๆฅ็ๆ่ฆ
Deep ensembles are a powerful tool in machine learning, improving both model
performance and uncertainty calibration. While ensembles are typically formed
by training and tuning models individually, evidence suggests that jointly
tuning the ensemble can lead to better performance. This paper investigates the
impact of jointly tuning weight decay, temperature scaling, and early stopping
on both predictive performance and uncertainty quantification. Additionally, we
propose a partially overlapping holdout strategy as a practical compromise
between enabling joint evaluation and maximizing the use of data for training.
Our results demonstrate that jointly tuning the ensemble generally matches or
improves performance, with significant variation in effect size across
different tasks and metrics. We highlight the trade-offs between individual and
joint optimization in deep ensemble training, with the overlapping holdout
strategy offering an attractive practical solution. We believe our findings
provide valuable insights and guidance for practitioners looking to optimize
deep ensemble models. Code is available at:
https://github.com/lauritsf/ensemble-optimality-gap
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Anzhuo Xie, Wei-Chen Chang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
This study proposes a Transformer-based longitudinal modeling method to
address challenges in clinical risk classification with heterogeneous
Electronic Health Record (EHR) data, including irregular temporal patterns,
large modality differences, and complex semantic structures. The method takes
multi-source medical features as input and employs a feature embedding layer to
achieve a unified representation of structured and unstructured data. A
learnable temporal encoding mechanism is introduced to capture dynamic
evolution under uneven sampling intervals. The core model adopts a multi-head
self-attention structure to perform global dependency modeling on longitudinal
sequences, enabling the aggregation of long-term trends and short-term
fluctuations across different temporal scales. To enhance semantic
representation, a semantic-weighted pooling module is designed to assign
adaptive importance to key medical events, improving the discriminative ability
of risk-related features. Finally, a linear mapping layer generates
individual-level risk scores. Experimental results show that the proposed model
outperforms traditional machine learning and temporal deep learning models in
accuracy, recall, precision, and F1-Score, achieving stable and precise risk
identification in multi-source heterogeneous EHR environments and providing an
efficient and reliable framework for clinical intelligent decision-making.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Olav Finne Praesteng Larsen, Massimiliano Ruocco, Michail Spitieris, Abdulmajid Murad, Martina Ragosta
Machine Learning
ๆฅ็ๆ่ฆ
Access to trajectory data is a key requirement for developing and validating
Air Traffic Management (ATM) solutions, yet many secondary and regional
airports face severe data scarcity. This limits the applicability of machine
learning methods and the ability to perform large-scale simulations or
"what-if" analyses. In this paper, we investigate whether generative models
trained on data-rich airports can be efficiently adapted to data-scarce
airports using transfer learning. We adapt state-of-the-art diffusion- and
flow-matching-based architectures to the aviation domain and evaluate their
transferability between Zurich (source) and Dublin (target) landing trajectory
datasets. Models are pretrained on Zurich and fine-tuned on Dublin with varying
amounts of local data, ranging from 0% to 100%. Results show that
diffusion-based models achieve competitive performance with as little as 5% of
the Dublin data and reach baseline-level performance around 20%, consistently
outperforming models trained from scratch across metrics and visual
inspections. Latent flow matching and latent diffusion models also benefit from
pretraining, though with more variable gains, while flow matching models show
weaker generalization. Despite challenges in capturing rare trajectory
patterns, these findings demonstrate the potential of transfer learning to
substantially reduce data requirements for trajectory generation in ATM,
enabling realistic synthetic data generation even in environments with limited
historical records.
๐
2025-11-06
๐ Submitted to the Journal of Machine Learning Resea
๐ฅ Jiaming Zhang, Yujie Yang, Haoning Wang, Liping Zhang, Shengbo Eben Li
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Safe reinforcement learning (safe RL) aims to respect safety requirements
while optimizing long-term performance. In many practical applications,
however, the problem involves an infinite number of constraints, known as
semi-infinite safe RL (SI-safe RL). Such constraints typically appear when
safety conditions must be enforced across an entire continuous parameter space,
such as ensuring adequate resource distribution at every spatial location. In
this paper, we propose exchange policy optimization (EPO), an algorithmic
framework that achieves optimal policy performance and deterministic bounded
safety. EPO works by iteratively solving safe RL subproblems with finite
constraint sets and adaptively adjusting the active set through constraint
expansion and deletion. At each iteration, constraints with violations
exceeding the predefined tolerance are added to refine the policy, while those
with zero Lagrange multipliers are removed after the policy update. This
exchange rule prevents uncontrolled growth of the working set and supports
effective policy training. Our theoretical analysis demonstrates that, under
mild assumptions, strategies trained via EPO achieve performance comparable to
optimal solutions with global constraint violations strictly remaining within a
prescribed bound.
๐
2025-11-06
๐ This work has been accepted by IJCNN 2025 and subm
๐ฅ Hongbin Zhang, Shihao Gao, Yang Liu, Mingjie Xing, Yanjun Wu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In recent years, end-to-end Large Language Model (LLM) technology has shown
substantial advantages across various domains. As critical system software and
infrastructure, compilers are responsible for transforming source code into
target code. While LLMs have been leveraged to assist in compiler development
and maintenance, their potential as an end-to-end compiler remains largely
unexplored. This paper explores the feasibility of LLM as a Compiler (LaaC) and
its future directions. We designed the CompilerEval dataset and framework
specifically to evaluate the capabilities of mainstream LLMs in source code
comprehension and assembly code generation. In the evaluation, we analyzed
various errors, explored multiple methods to improve LLM-generated code, and
evaluated cross-platform compilation capabilities. Experimental results
demonstrate that LLMs exhibit basic capabilities as compilers but currently
achieve low compilation success rates. By optimizing prompts, scaling up the
model, and incorporating reasoning methods, the quality of assembly code
generated by LLMs can be significantly enhanced. Based on these findings, we
maintain an optimistic outlook for LaaC and propose practical architectural
designs and future research directions. We believe that with targeted training,
knowledge-rich prompts, and specialized infrastructure, LaaC has the potential
to generate high-quality assembly code and drive a paradigm shift in the field
of compilation.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Giorgio Morales, John W. Sheppard
RoboticsMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Symbolic regression (SR) models complex systems by discovering mathematical
expressions that capture underlying relationships in observed data. However,
most SR methods prioritize minimizing prediction error over identifying the
governing equations, often producing overly complex or inaccurate expressions.
To address this, we present a decomposable SR method that generates
interpretable multivariate expressions leveraging transformer models, genetic
algorithms (GAs), and genetic programming (GP). In particular, our explainable
SR method distills a trained ``opaque'' regression model into mathematical
expressions that serve as explanations of its computed function. Our method
employs a Multi-Set Transformer to generate multiple univariate symbolic
skeletons that characterize how each variable influences the opaque model's
response. We then evaluate the generated skeletons' performance using a
GA-based approach to select a subset of high-quality candidates before
incrementally merging them via a GP-based cascade procedure that preserves
their original skeleton structure. The final multivariate skeletons undergo
coefficient optimization via a GA. We evaluated our method on problems with
controlled and varying degrees of noise, demonstrating lower or comparable
interpolation and extrapolation errors compared to two GP-based methods, three
neural SR methods, and a hybrid approach. Unlike them, our approach
consistently learned expressions that matched the original mathematical
structure.
๐
2025-11-06
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Paul Badu Yakubu, Lesther Santana, Mohamed Rahouti, Yufeng Xin, Abdellah Chehri et al.
Machine Learning
ๆฅ็ๆ่ฆ
With the increasing frequency and sophistication of Distributed Denial of
Service (DDoS) attacks, it has become critical to develop more efficient and
interpretable detection methods. Traditional detection systems often struggle
with scalability and transparency, hindering real-time response and
understanding of attack vectors. This paper presents an automated framework for
detecting and interpreting DDoS attacks using machine learning (ML). The
proposed method leverages the Tree-based Pipeline Optimization Tool (TPOT) to
automate the selection and optimization of ML models and features, reducing the
need for manual experimentation. SHapley Additive exPlanations (SHAP) is
incorporated to enhance model interpretability, providing detailed insights
into the contribution of individual features to the detection process. By
combining TPOT's automated pipeline selection with SHAP interpretability, this
approach improves the accuracy and transparency of DDoS detection. Experimental
results demonstrate that key features such as mean backward packet length and
minimum forward packet header length are critical in detecting DDoS attacks,
offering a scalable and explainable cybersecurity solution.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Moses Charikar, Chirag Pabbaraju, Ambuj Tewari
ๆฅ็ๆ่ฆ
We study the problem of language identification in the limit, where given a
sequence of examples from a target language, the goal of the learner is to
output a sequence of guesses for the target language such that all the guesses
beyond some finite time are correct. Classical results of Gold showed that
language identification in the limit is impossible for essentially any
interesting collection of languages. Later, Angluin gave a precise
characterization of language collections for which this task is possible.
Motivated by recent positive results for the related problem of language
generation, we revisit the classic language identification problem in the
setting where the learner is given the additional power of producing a list of
$k$ guesses at each time step. The goal is to ensure that beyond some finite
time, one of the guesses is correct at each time step.
We give an exact characterization of collections of languages that can be
$k$-list identified in the limit, based on a recursive version of Angluin's
characterization (for language identification with a list of size $1$). This
further leads to a conceptually appealing characterization: A language
collection can be $k$-list identified in the limit if and only if the
collection can be decomposed into $k$ collections of languages, each of which
can be identified in the limit (with a list of size $1$). We also use our
characterization to establish rates for list identification in the statistical
setting where the input is drawn as an i.i.d. stream from a distribution
supported on some language in the collection. Our results show that if a
collection is $k$-list identifiable in the limit, then the collection can be
$k$-list identified at an exponential rate, and this is best possible. On the
other hand, if a collection is not $k$-list identifiable in the limit, then it
cannot be $k$-list identified at any rate that goes to zero.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hyungjong Na, Wonho Song, Seungyong Han, Donghyeon Jo, Sejin Myung et al.
Machine Learning
ๆฅ็ๆ่ฆ
This study introduces the Korean Tax Avoidance Panel (KoTaP), a long-term
panel dataset of non-financial firms listed on KOSPI and KOSDAQ between 2011
and 2024. After excluding financial firms, firms with non-December fiscal year
ends, capital impairment, and negative pre-tax income, the final dataset
consists of 12,653 firm-year observations from 1,754 firms. KoTaP is designed
to treat corporate tax avoidance as a predictor variable and link it to
multiple domains, including earnings management (accrual- and activity-based),
profitability (ROA, ROE, CFO, LOSS), stability (LEV, CUR, SIZE, PPE, AGE,
INVREC), growth (GRW, MB, TQ), and governance (BIG4, FORN, OWN). Tax avoidance
itself is measured using complementary indicators cash effective tax rate
(CETR), GAAP effective tax rate (GETR), and book-tax difference measures (TSTA,
TSDA) with adjustments to ensure interpretability. A key strength of KoTaP is
its balanced panel structure with standardized variables and its consistency
with international literature on the distribution and correlation of core
indicators. At the same time, it reflects distinctive institutional features of
Korean firms, such as concentrated ownership, high foreign shareholding, and
elevated liquidity ratios, providing both international comparability and
contextual uniqueness. KoTaP enables applications in benchmarking econometric
and deep learning models, external validity checks, and explainable AI
analyses. It further supports policy evaluation, audit planning, and investment
analysis, making it a critical open resource for accounting, finance, and
interdisciplinary research.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qingfeng Chen, Haojin Zeng, Jingyi Jie, Shichao Zhang, Debo Cheng
Machine Learning
ๆฅ็ๆ่ฆ
With the rapid growth of graph-structured data in critical domains,
unsupervised graph-level anomaly detection (UGAD) has become a pivotal task.
UGAD seeks to identify entire graphs that deviate from normal behavioral
patterns. However, most Graph Neural Network (GNN) approaches implicitly assume
that the training set is clean, containing only normal graphs, which is rarely
true in practice. Even modest contamination by anomalous graphs can distort
learned representations and sharply degrade performance. To address this
challenge, we propose DeNoise, a robust UGAD framework explicitly designed for
contaminated training data. It jointly optimizes a graph-level encoder, an
attribute decoder, and a structure decoder via an adversarial objective to
learn noise-resistant embeddings. Further, DeNoise introduces an encoder
anchor-alignment denoising mechanism that fuses high-information node
embeddings from normal graphs into all graph embeddings, improving
representation quality while suppressing anomaly interference. A contrastive
learning component then compacts normal graph embeddings and repels anomalous
ones in the latent space. Extensive experiments on eight real-world datasets
demonstrate that DeNoise consistently learns reliable graph-level
representations under varying noise intensities and significantly outperforms
state-of-the-art UGAD baselines.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ananya Sutradhar, Suryansh Gupta, Ravishankar Krishnaswamy, Haiyang Xu, Aseem Rastogi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Filtered Approximate Nearest Neighbor (ANN) search retrieves the closest
vectors for a query vector from a dataset. It enforces that a specified set of
discrete labels $S$ for the query must be included in the labels of each
retrieved vector. Existing graph-based methods typically incorporate filter
awareness by assigning fixed penalties or prioritizing nodes based on filter
satisfaction. However, since these methods use fixed, data in- dependent
penalties, they often fail to generalize across datasets with diverse label and
vector distributions. In this work, we propose a principled alternative that
learns the optimal trade-off between vector distance and filter match directly
from the data, rather than relying on fixed penalties. We formulate this as a
constrained linear optimization problem, deriving weights that better reflect
the underlying filter distribution and more effectively address the filtered
ANN search problem. These learned weights guide both the search process and
index construction, leading to graph structures that more effectively capture
the underlying filter distribution and filter semantics. Our experiments
demonstrate that adapting the distance function to the data significantly im-
proves accuracy by 5-10% over fixed-penalty methods, providing a more flexible
and generalizable framework for the filtered ANN search problem.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fatemeh Hosseinabadi, Seyedhassan Sharifi
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Accurate segmentation of the left atrium (LA) from cardiac MRI is critical
for guiding atrial fibrillation (AF) ablation and constructing biophysical
cardiac models. Manual delineation is time-consuming, observer-dependent, and
impractical for large-scale or time-sensitive clinical workflows. Deep learning
methods, particularly convolutional architectures, have recently demonstrated
superior performance in medical image segmentation tasks. In this study, we
applied the nnU-Net framework, an automated, self-configuring deep learning
segmentation architecture, to the Left Atrial Segmentation Challenge 2013
dataset. The dataset consists of thirty MRI scans with corresponding
expert-annotated masks. The nnU-Net model automatically adapted its
preprocessing, network configuration, and training pipeline to the
characteristics of the MRI data. Model performance was quantitatively evaluated
using the Dice similarity coefficient (DSC), and qualitative results were
compared against expert segmentations. The proposed nnUNet model achieved a
mean Dice score of 93.5, demonstrating high overlap with expert annotations and
outperforming several traditional segmentation approaches reported in previous
studies. The network exhibited robust generalization across variations in left
atrial shape, contrast, and image quality, accurately delineating both the
atrial body and proximal pulmonary veins.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fatemeh Hosseinabadi, Seyedhassan Sharifi
Machine Learning
ๆฅ็ๆ่ฆ
Pediatric appendicitis remains one of the most common causes of acute
abdominal pain in children, and its diagnosis continues to challenge clinicians
due to overlapping symptoms and variable imaging quality. This study aims to
develop and evaluate a deep learning model based on a pretrained ResNet
architecture for automated detection of appendicitis from ultrasound images. We
used the Regensburg Pediatric Appendicitis Dataset, which includes ultrasound
scans, laboratory data, and clinical scores from pediatric patients admitted
with abdominal pain to Children Hospital. Hedwig in Regensburg, Germany. Each
subject had 1 to 15 ultrasound views covering the right lower quadrant,
appendix, lymph nodes, and related structures. For the image based
classification task, ResNet was fine tuned to distinguish appendicitis from
non-appendicitis cases. Images were preprocessed by normalization, resizing,
and augmentation to enhance generalization. The proposed ResNet model achieved
an overall accuracy of 93.44, precision of 91.53, and recall of 89.8,
demonstrating strong performance in identifying appendicitis across
heterogeneous ultrasound views. The model effectively learned discriminative
spatial features, overcoming challenges posed by low contrast, speckle noise,
and anatomical variability in pediatric imaging.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Quantization plays a crucial role in accelerating the inference of
large-scale models, and rotational matrices have been shown to effectively
improve quantization performance by smoothing outliers. However, end-to-end
fine-tuning of rotational optimization algorithms incurs high computational
costs and is prone to overfitting. To address this challenge, we propose an
efficient distribution-aware rotational calibration method, DartQuant, which
reduces the complexity of rotational optimization by constraining the
distribution of the activations after rotation. This approach also effectively
reduces reliance on task-specific losses, thereby mitigating the risk of
overfitting. Additionally, we introduce the QR-Orth optimization scheme, which
replaces expensive alternating optimization with a more efficient solution. In
a variety of model quantization experiments, DartQuant demonstrates superior
performance. Compared to existing methods, it achieves 47$\times$ acceleration
and 10$\times$ memory savings for rotational optimization on a 70B model.
Furthermore, it is the first to successfully complete rotational calibration
for a 70B model on a single 3090 GPU, making quantization of large language
models feasible in resource-constrained environments. Code is available at
https://github.com/CAS-CLab/DartQuant.git.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaoling Luo, Peng Chen, Chengliang Liu, Xiaopeng Jin, Jie Wen et al.
Multimodal
ๆฅ็ๆ่ฆ
Multimodal protein features play a crucial role in protein function
prediction. However, these features encompass a wide range of information,
ranging from structural data and sequence features to protein attributes and
interaction networks, making it challenging to decipher their complex
interconnections. In this work, we propose a multimodal protein function
prediction method (DSRPGO) by utilizing dynamic selection and reconstructive
pre-training mechanisms. To acquire complex protein information, we introduce
reconstructive pre-training to mine more fine-grained information with low
semantic levels. Moreover, we put forward the Bidirectional Interaction Module
(BInM) to facilitate interactive learning among multimodal features.
Additionally, to address the difficulty of hierarchical multi-label
classification in this task, a Dynamic Selection Module (DSM) is designed to
select the feature representation that is most conducive to current protein
function prediction. Our proposed DSRPGO model improves significantly in BPO,
MFO, and CCO on human datasets, thereby outperforming other benchmark models.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingyu Sung, Vikas Palakonda, Suhwan Im, Sunghwan Moon, Il-Min Kim et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have achieved near-human performance across
diverse reasoning tasks, yet their deployment on resource-constrained
Internet-of-Things (IoT) devices remains impractical due to massive parameter
footprints and memory-intensive autoregressive decoding. While split computing
offers a promising solution by partitioning model execution between edge
devices and cloud servers, existing approaches fail to address the unique
challenges of autoregressive inference, particularly the iterative token
generation process and expanding key-value (KV) cache requirements. This work
introduces the first autoregressive-aware split computing framework designed
explicitly for LLM deployment on edge devices. Our approach makes three key
contributions. First, we develop one-point split compression (OPSC), a
mixed-precision quantization scheme that prevents out-of-memory failures by
strategically partitioning models into front-end and back-end segments with
different precision levels. Second, we propose a two-stage intermediate
compression pipeline that combines threshold splitting (TS) and token-wise
adaptive bit quantization (TAB-Q) to preserve accuracy-critical activations
while dramatically reducing communication overhead. Third, we formulate a
unified optimization framework that jointly selects optimal split points,
quantization settings, and sequence lengths to satisfy strict memory and
latency constraints. Extensive evaluations across diverse LLMs and hardware
platforms demonstrate superior performance compared to state-of-the-art
quantization methods, including SmoothQuant, OmniQuant, and Atom. The framework
achieves a 1.49 inference speedup and significant communication overhead
reduction while maintaining or improving model accuracy.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ J. Nathan Kutz, Peter Battaglia, Michael Brenner, Kevin Carlberg, Aric Hagberg et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Machine learning (ML) and artificial intelligence (AI) algorithms are
transforming and empowering the characterization and control of dynamic systems
in the engineering, physical, and biological sciences. These emerging modeling
paradigms require comparative metrics to evaluate a diverse set of scientific
objectives, including forecasting, state reconstruction, generalization, and
control, while also considering limited data scenarios and noisy measurements.
We introduce a common task framework (CTF) for science and engineering, which
features a growing collection of challenge data sets with a diverse set of
practical and common objectives. The CTF is a critically enabling technology
that has contributed to the rapid advance of ML/AI algorithms in traditional
applications such as speech recognition, language processing, and computer
vision. There is a critical need for the objective metrics of a CTF to compare
the diverse algorithms being rapidly developed and deployed in practice today
across science and engineering.
๐
2025-11-06
๐ NeurIPS 2025
๐ฅ Kyaw Hpone Myint, Zhe Wu, Alexandre G. R. Day, Giri Iyengar
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Decision trees are widely used in high-stakes fields like finance and
healthcare due to their interpretability. This work introduces an efficient,
scalable method for generating synthetic pre-training data to enable
meta-learning of decision trees. Our approach samples near-optimal decision
trees synthetically, creating large-scale, realistic datasets. Using the
MetaTree transformer architecture, we demonstrate that this method achieves
performance comparable to pre-training on real-world data or with
computationally expensive optimal decision trees. This strategy significantly
reduces computational costs, enhances data generation flexibility, and paves
the way for scalable and efficient meta-learning of interpretable decision tree
models.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Berk Iskar, Michael Taynnan Barros
Machine Learning
ๆฅ็ๆ่ฆ
Network anomaly detection systems encounter several challenges with
traditional detectors trained offline. They become susceptible to concept drift
and new threats such as zero-day or polymorphic attacks. To address this
limitation, we propose a Ca$^{2+}$-modulated learning framework that draws
inspiration from astrocytic Ca$^{2+}$ signaling in the brain, where rapid,
context-sensitive adaptation enables robust information processing. Our
approach couples a multicellular astrocyte dynamics simulator with a deep
neural network (DNN). The simulator models astrocytic Ca$^{2+}$ dynamics
through three key mechanisms: IP$_3$-mediated Ca$^{2+}$ release, SERCA pump
uptake, and conductance-aware diffusion through gap junctions between cells.
Evaluation of our proposed network on CTU-13 (Neris) network traffic data
demonstrates the effectiveness of our biologically plausible approach. The
Ca$^{2+}$-gated model outperforms a matched baseline DNN, achieving up to
$\sim$98\% accuracy with reduced false positives and negatives across multiple
train/test splits. Importantly, this improved performance comes with negligible
runtime overhead once Ca$^{2+}$ trajectories are precomputed. While
demonstrated here for cybersecurity applications, this Ca$^{2+}$-modulated
learning framework offers a generic solution for streaming detection tasks that
require rapid, biologically grounded adaptation to evolving data patterns.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ahmed A. Metwally, Heyjun Park, Yue Wu, Tracey McLaughlin, Michael P. Snyder
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The classification of diabetes and prediabetes by static glucose thresholds
obscures the pathophysiological dysglycemia heterogeneity, primarily driven by
insulin resistance (IR), beta-cell dysfunction, and incretin deficiency. This
review demonstrates that continuous glucose monitoring and wearable
technologies enable a paradigm shift towards non-invasive, dynamic metabolic
phenotyping. We show evidence that machine learning models can leverage
high-resolution glucose data from at-home, CGM-enabled oral glucose tolerance
tests to accurately predict gold-standard measures of muscle IR and beta-cell
function. This personalized characterization extends to real-world nutrition,
where an individual's unique postprandial glycemic response (PPGR) to
standardized meals, such as the relative glucose spike to potatoes versus
grapes, could serve as a biomarker for their metabolic subtype. Moreover,
integrating wearable data reveals that habitual diet, sleep, and physical
activity patterns, particularly their timing, are uniquely associated with
specific metabolic dysfunctions, informing precision lifestyle interventions.
The efficacy of dietary mitigators in attenuating PPGR is also shown to be
phenotype-dependent. Collectively, this evidence demonstrates that CGM can
deconstruct the complexity of early dysglycemia into distinct, actionable
subphenotypes. This approach moves beyond simple glycemic control, paving the
way for targeted nutritional, behavioral, and pharmacological strategies
tailored to an individual's core metabolic defects, thereby paving the way for
a new era of precision diabetes prevention.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Michael Menezes, Barbara Su, Xinze Feng, Yehya Farhat, Hamza Shili et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce TwIST, a distributed training framework for efficient large
language model (LLM) sparsification. TwIST trains multiple subnetworks in
parallel, periodically aggregates their parameters, and resamples new
subnetworks during training. This process identifies high-quality subnetworks
("golden tickets") without requiring post-training procedures such as
calibration or Hessian-based recovery. As a result, TwIST enables zero-cost
pruning at deployment time while achieving perplexity competitive with
state-of-the-art post-training sparsification methods. The benefits are most
pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly
outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64
for the closest prior approach. Unlike unstructured pruning, TwIST produces
structured, dense matrices that offer practical inference speedups and memory
reductions on commodity hardware (e.g., CPUs) that do not support efficient
sparse computation. TwIST provides an efficient training-time path to
deployable sparse LLMs without additional fine-tuning or recovery overhead.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuxiao Wang, Di Wu, Feng Liu, Zhimin Qiu, Chenrui Hu
ๆฅ็ๆ่ฆ
This paper proposes a composable fine-tuning method that integrates graph
structural priors with modular adapters to address the high computational cost
and structural instability faced by large-scale pre-trained models in
multi-task adaptation. The method introduces a relation matrix to model
dependencies among tasks, explicitly encoding correlations between nodes and
paths into graph structural priors, which provide unified structural
constraints for adapter weight allocation and path selection. Modular adapters
are embedded into different layers through low-rank mapping and a pluggable
mechanism, enabling efficient cross-task composition and reuse under prior
guidance. This mechanism not only improves parameter efficiency and training
stability but also alleviates path conflicts and redundant computation in
multi-task scenarios. Furthermore, experiments on hyperparameter sensitivity,
environmental sensitivity, and data sensitivity are conducted to systematically
analyze key factors such as routing temperature, gating thresholds, and
relation matrix regularization strength, verifying the consistency and superior
performance of the method under structural constraints. The results demonstrate
that the proposed framework significantly enhances task prediction accuracy,
adapter weight allocation precision, and overall computational efficiency while
maintaining model lightweight design, highlighting the synergistic advantages
of graph priors and modular mechanisms in composable fine-tuning.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xu Zou
Natural Language Processing
ๆฅ็ๆ่ฆ
Since its emergence, SARS-CoV-2 has demonstrated a rapid and unpredictable
evolutionary trajectory, characterized by the continual emergence of
immune-evasive variants. This poses persistent challenges to public health and
vaccine development.
While large-scale generative pre-trained transformers (GPTs) have
revolutionized the modeling of sequential data, their direct applications to
noisy viral genomic sequences are limited. In this paper, we introduce
PETRA(Pretrained Evolutionary TRAnsformer), a novel transformer approach based
on evolutionary trajectories derived from phylogenetic trees rather than raw
RNA sequences. This method effectively mitigates sequencing noise and captures
the hierarchical structure of viral evolution.
With a weighted training framework to address substantial geographical and
temporal imbalances in global sequence data, PETRA excels in predicting future
SARS-CoV-2 mutations, achieving a weighted recall@1 of 9.45% for nucleotide
mutations and 17.10\% for spike amino-acid mutations, compared to 0.49% and
6.64% respectively for the best baseline. PETRA also demonstrates its ability
to aid in the real-time mutation prediction of major clades like 24F(XEC) and
25A(LP.8.1). The code is open sourced on https://github.com/xz-keg/PETra
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Semih Cayci
Machine Learning
ๆฅ็ๆ่ฆ
An important question in deep learning is how higher-order optimization
methods affect generalization. In this work, we analyze a stochastic
Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch
sampling for training overparameterized deep neural networks with smooth
activations in a regression setting. Our theoretical contributions are twofold.
First, we establish finite-time convergence bounds via a variable-metric
analysis in parameter space, with explicit dependencies on the batch size,
network width and depth. Second, we derive non-asymptotic generalization bounds
for SGN using uniform stability in the overparameterized regime, characterizing
the impact of curvature, batch size, and overparameterization on generalization
performance. Our theoretical results identify a favorable generalization regime
for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along
the optimization path yields tighter stability bounds.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingliang Hou, Yinuo Wang, Teng Guo, Zitao Liu, Wenzhou Dou et al.
ๆฅ็ๆ่ฆ
The need to remove specific student data from cognitive diagnosis (CD) models
has become a pressing requirement, driven by users' growing assertion of their
"right to be forgotten". However, existing CD models are largely designed
without privacy considerations and lack effective data unlearning mechanisms.
Directly applying general purpose unlearning algorithms is suboptimal, as they
struggle to balance unlearning completeness, model utility, and efficiency when
confronted with the unique heterogeneous structure of CD models. To address
this, our paper presents the first systematic study of the data unlearning
problem for CD models, proposing a novel and efficient algorithm: hierarchical
importanceguided forgetting (HIF). Our key insight is that parameter importance
in CD models exhibits distinct layer wise characteristics. HIF leverages this
via an innovative smoothing mechanism that combines individual and layer, level
importance, enabling a more precise distinction of parameters associated with
the data to be unlearned. Experiments on three real world datasets show that
HIF significantly outperforms baselines on key metrics, offering the first
effective solution for CD models to respond to user data removal requests and
for deploying high-performance, privacy preserving AI systems
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Shinto Eguchi
ๆฅ็ๆ่ฆ
We introduce a density-power weighted variant for the Stein operator, called
the $\gamma$-Stein operator. This is a novel class of operators derived from
the $\gamma$-divergence, designed to build robust inference methods for
unnormalized probability models. The operator's construction (weighting by the
model density raised to a positive power $\gamma$ inherently down-weights the
influence of outliers, providing a principled mechanism for robustness.
Applying this operator yields a robust generalization of score matching that
retains the crucial property of being independent of the model's normalizing
constant. We extend this framework to develop two key applications: the
$\gamma$-kernelized Stein discrepancy for robust goodness-of-fit testing, and
$\gamma$-Stein variational gradient descent for robust Bayesian posterior
approximation. Empirical results on contaminated Gaussian and quartic potential
models show our methods significantly outperform standard baselines in both
robustness and statistical efficiency.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wuxia Chen, Taposh Banerjee, Vahid Tarokh
ๆฅ็ๆ่ฆ
We address the problem of quickest change detection in Markov processes with
unknown transition kernels. The key idea is to learn the conditional score
$\nabla_{\mathbf{y}} \log p(\mathbf{y}|\mathbf{x})$ directly from sample pairs
$( \mathbf{x},\mathbf{y})$, where both $\mathbf{x}$ and $\mathbf{y}$ are
high-dimensional data generated by the same transition kernel. In this way, we
avoid explicit likelihood evaluation and provide a practical way to learn the
transition dynamics. Based on this estimation, we develop a score-based CUSUM
procedure that uses conditional Hyvarinen score differences to detect changes
in the kernel. To ensure bounded increments, we propose a truncated version of
the statistic. With Hoeffding's inequality for uniformly ergodic Markov
processes, we prove exponential lower bounds on the mean time to false alarm.
We also prove asymptotic upper bounds on detection delay. These results give
both theoretical guarantees and practical feasibility for score-based detection
in high-dimensional Markov models.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Aukosh Jagannath, Taj Jones-McCormick, Varnan Sarangian
ๆฅ็ๆ่ฆ
We develop a high-dimensional scaling limit for Stochastic Gradient Descent
with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework
to rigourously compare online SGD with some of its popular variants. We show
that the scaling limits of SGD-M coincide with those of online SGD after an
appropriate time rescaling and a specific choice of step-size. However, if the
step-size is kept the same between the two algorithms, SGD-M will amplify
high-dimensional effects, potentially degrading performance relative to online
SGD. We demonstrate our framework on two popular learning problems: Spiked
Tensor PCA and Single Index Models. In both cases, we also examine online SGD
with an adaptive step-size based on normalized gradients. In the
high-dimensional regime, this algorithm yields multiple benefits: its dynamics
admit fixed points closer to the population minimum and widens the range of
admissible step-sizes for which the iterates converge to such solutions. These
examples provide a rigorous account, aligning with empirical motivation, of how
early preconditioners can stabilize and improve dynamics in settings where
online SGD fails.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Raghav Sharma, Manan Mehta, Sai Tiger Raina
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Reinforcement Learning from Human Feedback (RLHF) is the standard for
aligning Large Language Models (LLMs), yet recent progress has moved beyond
canonical text-based methods. This survey synthesizes the new frontier of
alignment research by addressing critical gaps in multi-modal alignment,
cultural fairness, and low-latency optimization. To systematically explore
these domains, we first review foundational algo- rithms, including PPO, DPO,
and GRPO, before presenting a detailed analysis of the latest innovations. By
providing a comparative synthesis of these techniques and outlining open
challenges, this work serves as an essential roadmap for researchers building
more robust, efficient, and equitable AI systems.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sanggeon Yun, Hyunwoo Oh, Ryozo Masukawa, Pietro Mercati, Nathaniel D. Bastian et al.
ๆฅ็ๆ่ฆ
Hyperdimensional computing (HDC) suits memory, energy, and
reliability-constrained systems, yet the standard "one prototype per class"
design requires $O(CD)$ memory (with $C$ classes and dimensionality $D$). Prior
compaction reduces $D$ (feature axis), improving storage/compute but weakening
robustness. We introduce LogHD, a logarithmic class-axis reduction that
replaces the $C$ per-class prototypes with $n\!\approx\!\lceil\log_k C\rceil$
bundle hypervectors (alphabet size $k$) and decodes in an $n$-dimensional
activation space, cutting memory to $O(D\log_k C)$ while preserving $D$. LogHD
uses a capacity-aware codebook and profile-based decoding, and composes with
feature-axis sparsification. Across datasets and injected bit flips, LogHD
attains competitive accuracy with smaller models and higher resilience at
matched memory. Under equal memory, it sustains target accuracy at roughly
$2.5$-$3.0\times$ higher bit-flip rates than feature-axis compression; an ASIC
instantiation delivers $498\times$ energy efficiency and $62.6\times$ speedup
over an AMD Ryzen 9 9950X and $24.3\times$/$6.58\times$ over an NVIDIA RTX
4090, and is $4.06\times$ more energy-efficient and $2.19\times$ faster than a
feature-axis HDC ASIC baseline.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce Nemotron Nano V2 VL, the latest model of the Nemotron
vision-language series designed for strong real-world document understanding,
long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers
significant improvements over our previous model,
Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major
enhancements in model architecture, datasets, and training recipes. Nemotron
Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and
innovative token reduction techniques to achieve higher inference throughput in
long document and video scenarios. We are releasing model checkpoints in BF16,
FP8, and FP4 formats and sharing large parts of our datasets, recipes and
training code.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Arthur Chen, Victor Zhong
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce and formalize the Synthetic Dataset Quality Estimation (SynQuE)
problem: ranking synthetic datasets by their expected real-world task
performance using only limited unannotated real data. This addresses a critical
and open challenge where data is scarce due to collection costs or privacy
constraints. We establish the first comprehensive benchmarks for this problem
by introducing and evaluating proxy metrics that choose synthetic data for
training to maximize task performance on real data. We introduce the first
proxy metrics for SynQuE by adapting distribution and diversity-based distance
measures to our context via embedding models. To address the shortcomings of
these metrics on complex planning tasks, we propose LENS, a novel proxy that
leverages large language model reasoning. Our results show that SynQuE proxies
correlate with real task performance across diverse tasks, including sentiment
analysis, Text2SQL, web navigation, and image classification, with LENS
consistently outperforming others on complex tasks by capturing nuanced
characteristics. For instance, on text-to-SQL parsing, training on the top-3
synthetic datasets selected via SynQuE proxies can raise accuracy from 30.4% to
38.4 (+8.1)% on average compared to selecting data indiscriminately. This work
establishes SynQuE as a practical framework for synthetic data selection under
real-data scarcity and motivates future research on foundation model-based data
characterization and fine-grained data selection.
๐
2025-11-06
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Baptiste Bonin, Maxime Heuillet, Audrey Durand
Natural Language Processing
ๆฅ็ๆ่ฆ
Modeling user preferences across domains remains a key challenge in slate
recommendation (i.e. recommending an ordered sequence of items) research. We
investigate how Large Language Models (LLM) can effectively act as world models
of user preferences through pairwise reasoning over slates. We conduct an
empirical study involving several LLMs on three tasks spanning different
datasets. Our results reveal relationships between task performance and
properties of the preference function captured by LLMs, hinting towards areas
for improvement and highlighting the potential of LLMs as world models in
recommender systems.
๐
2025-11-06
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nikhil Abhyankar, Purvi Chaurasia, Sanchit Kabra, Ananya Srivastava, Vivek Gupta et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing tabular reasoning benchmarks mostly test models on small, uniform
tables, underrepresenting the complexity of real-world data and giving an
incomplete view of Large Language Models' (LLMs) reasoning abilities. Real
tables are long, heterogeneous, and domain-specific, mixing structured fields
with free text and requiring multi-hop reasoning across thousands of tokens. To
address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from
2031 real-world tables spanning two domains: i) RB-Science (NSF grant records)
and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates
LLMs jointly across scale, heterogeneity, domain specificity, and reasoning
complexity. Experiments with open-source and proprietary models show that LLMs
struggle with heterogeneous schemas and complex multi-hop inference, revealing
persistent weaknesses in current architectures and prompting strategies.
RUST-BENCH establishes a challenging new testbed for advancing tabular
reasoning research.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alberto Cattaneo, Carlo Luschi, Daniel Justus
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval of information from graph-structured knowledge bases represents a
promising direction for improving the factuality of LLMs. While various
solutions have been proposed, a comparison of methods is difficult due to the
lack of challenging QA datasets with ground-truth targets for graph retrieval.
We present SynthKGQA, a framework for generating high-quality synthetic
Knowledge Graph Question Answering datasets from any Knowledge Graph, providing
the full set of ground-truth facts in the KG to reason over each question. We
show how, in addition to enabling more informative benchmarking of KG
retrievers, the data produced with SynthKGQA also allows us to train better
models. We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset
designed to test zero-shot generalization abilities of KG retrievers with
respect to unseen graph structures and relation types, and benchmark popular
solutions for KG-augmented LLMs on it.
๐
2025-11-06
๐ ArXiv Preprint (Multimedia)
๐ฅ Allie Tran, Luca Rossetto
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal co-embedding models, especially CLIP, have advanced the state of
the art in zero-shot classification and multimedia information retrieval in
recent years by aligning images and text in a shared representation space.
However, such modals trained on a contrastive alignment can lack stability
towards small input perturbations. Especially when dealing with manually
expressed queries, minor variations in the query can cause large differences in
the ranking of the best-matching results. In this paper, we present a
systematic analysis of the effect of multiple classes of non-semantic query
perturbations in an multimedia information retrieval scenario. We evaluate a
diverse set of lexical, syntactic, and semantic perturbations across multiple
CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video
collection. Across models, we find that syntactic and semantic perturbations
drive the largest instabilities, while brittleness is concentrated in trivial
surface edits such as punctuation and case. Our results highlight robustness as
a critical dimension for evaluating vision-language models beyond benchmark
accuracy.
๐
2025-11-06
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zefeng Li, Ning Yang
ๆฅ็ๆ่ฆ
Although the collaborative filtering (CF) algorithm has achieved remarkable
performance in recommendation systems, it suffers from suboptimal
recommendation performance due to noise in the user-item interaction matrix.
Numerous noise-removal studies have improved recommendation models, but most
existing approaches conduct denoising on a single graph. This may cause
attenuation of collaborative signals: removing edges between two nodes can
interrupt paths between other nodes, weakening path-dependent collaborative
information. To address these limitations, this study proposes a novel
GNN-based CF model called DRCSD for denoising unstable interactions. DRCSD
includes two core modules: a collaborative signal decoupling module (decomposes
signals into distinct orders by structural characteristics) and an order-wise
denoising module (performs targeted denoising on each order). Additionally, the
information aggregation mechanism of traditional GNN-based CF models is
modified to avoid cross-order signal interference until the final pooling
operation. Extensive experiments on three public real-world datasets show that
DRCSD has superior robustness against unstable interactions and achieves
statistically significant performance improvements in recommendation accuracy
metrics compared to state-of-the-art baseline models.
๐
2025-11-06
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Carl Kugblenu, Petri Vuorimaa
ๆฅ็ๆ่ฆ
Production vector search systems often fan out each query across parallel
lanes (threads, replicas, or shards) to meet latency service-level objectives
(SLOs). In practice, these lanes rediscover the same candidates, so extra
compute does not increase coverage. We present a coordination-free lane
partitioner that turns duplication into complementary work at the same cost and
deadline. For each query we (1) build a deterministic candidate pool sized to
the total top-k budget, (2) apply a per-query pseudorandom permutation, and (3)
assign each lane a disjoint slice of positions. Lanes then return different
results by construction, with no runtime coordination.
At equal cost with four lanes (total candidate budget 64), on SIFT1M (1M SIFT
feature vectors) with Hierarchical Navigable Small World graphs (HNSW)
recall@10 rises from 0.249 to 0.999 while lane overlap falls from nearly 100%
to 0%. On MS MARCO (8.8M passages) with HNSW, hit@10 improves from 0.200 to
0.601 and Mean Reciprocal Rank at 10 (MRR@10) from 0.133 to 0.330. For inverted
file (IVF) indexes we see smaller but consistent gains (for example, +11% on MS
MARCO) by de-duplicating list routing. A microbenchmark shows planner overhead
of ~37 microseconds per query (mean at the main setting) with linear growth in
the number of merged candidates.
These results yield a simple operational guideline: size the per-query pool
to the total budget, deterministically partition positions across lanes, and
turn redundant fan-out into complementary coverage without changing budget or
deadline.
๐
2025-11-06
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Mashrur Rahman, Mantaqa abedin, Monowar Zamil Abir, Faizul Islam Ansari, Adib Reza et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
University students face immense challenges during their undergraduate lives,
often being deprived of personalized on-demand guidance that mentors fail to
provide at scale. Digital tools exist, but there is a serious lack of
customized coaching for newcomers. This paper presents an AI-powered chatbot
that will serve as a mentor for the students of BRAC University. The main
component is a data ingestion pipeline that efficiently processes and updates
information from diverse sources, such as CSV files and university webpages.
The chatbot retrieves information through a hybrid approach, combining BM25
lexical ranking with ChromaDB semantic retrieval, and uses a Large Language
Model, LLaMA-3.3-70B, to generate conversational responses. The generated text
was found to be semantically highly relevant, with a BERTScore of 0.831 and a
METEOR score of 0.809. The data pipeline was also very efficient, taking 106.82
seconds for updates, compared to 368.62 seconds for new data. This chatbot will
be able to help students by responding to their queries, helping them to get a
better understanding of university life, and assisting them to plan better
routines for their semester in the open-credit university.
๐
2025-11-06
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Ge Zhang, Rohan Deepak Ajwani, Tony Zheng, Hongjian Gu, Yaochen Hu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Finding relevant products given a user query plays a pivotal role in an
e-commerce platform, as it can spark shopping behaviors and result in revenue
gains. The challenge lies in accurately predicting the correlation between
queries and products. Recently, mining the cross-features between queries and
products based on the commonsense reasoning capacity of Large Language Models
(LLMs) has shown promising performance. However, such methods suffer from high
costs due to intensive real-time LLM inference during serving, as well as human
annotations and potential Supervised Fine Tuning (SFT). To boost efficiency
while leveraging the commonsense reasoning capacity of LLMs for various
e-commerce tasks, we propose the Efficient Commonsense-Augmented Recommendation
Enhancer (E-CARE). During inference, models augmented with E-CARE can access
commonsense reasoning with only a single LLM forward pass per query by
utilizing a commonsense reasoning factor graph that encodes most of the
reasoning schema from powerful LLMs. The experiments on 2 downstream tasks show
an improvement of up to 12.1% on precision@5.
๐
2025-11-06
๐ ArXiv Preprint (cs.DL)
๐ฅ M Sadik Batcha, S Roselin Jahina, Muneer Ahmad
ๆฅ็ๆ่ฆ
DESIDOC Journal of Library & Information Technology (DJLIT) formerly known as
DESIDOC Bulletin of Information Technology is a peer-reviewed, open access,
bimonthly journal. This paper presents a Scientometric analysis of the DESIDOC
Journal. The paper analyses the pattern of growth of the research output
published in the journal, pattern of authorship, author productivity, and,
subjects covered to the papers over the period (2013-2017). It is found that
227 papers were published during the period of study (2001-2012). The maximum
numbers of articles were collaborative in nature. The subject concentration of
the journal noted is Scientometrics. The maximum numbers of articles (65%) have
ranged their thought contents between 6 and 10 pages. The study applied
standard formula and statistical tools to bring out the factual result.
๐
2025-11-06
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xiaolu Chen, Yong Liao
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Generative Search Engines (GSEs) leverage Retrieval-Augmented Generation
(RAG) techniques and Large Language Models (LLMs) to integrate multi-source
information and provide users with accurate and comprehensive responses. Unlike
traditional search engines that present results in ranked lists, GSEs shift
users' attention from sequential browsing to content-driven subjective
perception, driving a paradigm shift in information retrieval. In this context,
enhancing the subjective visibility of content through Generative Search Engine
Optimization (G-SEO) methods has emerged as a new research focus. With the
rapid advancement of Multimodal Retrieval-Augmented Generation (MRAG)
techniques, GSEs can now efficiently integrate text, images, audio, and video,
producing richer responses that better satisfy complex information needs.
Existing G-SEO methods, however, remain limited to text-based optimization and
fail to fully exploit multimodal data. To address this gap, we propose Caption
Injection, the first multimodal G-SEO approach, which extracts captions from
images and injects them into textual content, integrating visual semantics to
enhance the subjective visibility of content in generative search scenarios. We
systematically evaluate Caption Injection on MRAMG, a benchmark for MRAG, under
both unimodal and multimodal settings. Experimental results show that Caption
Injection significantly outperforms text-only G-SEO baselines under the G-Eval
metric, demonstrating the necessity and effectiveness of multimodal integration
in G-SEO to improve user-perceived content visibility.
๐
2025-11-06
๐ ArXiv Preprint (cs.DL)
๐ฅ Muneer Ahmad, Samuel Ibor Ubi
ๆฅ็ๆ่ฆ
This paper presents a scientometric analysis of research output from the
University of Lagos, focusing on the two decades spanning 2004 to 2023. Using
bibliometric data retrieved from the Web of Science, we examine trends in
publication volume, collaboration patterns, citation impact, and the most
prolific authors, departments, and research domains at the university. The
study reveals a consistent increase in research productivity, with the highest
publication output recorded in 2023. Health Sciences, Engineering, and Social
Sciences are identified as dominant fields, reflecting the university's
interdisciplinary research strengths. Collaborative efforts, both locally and
internationally, show a positive correlation with higher citation impact, with
the United States and the United Kingdom being the leading international
collaborators. Notably, open-access publications account for a significant
portion of the university's research output, enhancing visibility and citation
rates. The findings offer valuable insights into the university's research
performance over the past two decades, providing a foundation for strategic
planning and policy formulation to foster research excellence and global
impact.
๐
2025-11-06
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ananya Sutradhar, Suryansh Gupta, Ravishankar Krishnaswamy, Haiyang Xu, Aseem Rastogi et al.
Machine Learning
ๆฅ็ๆ่ฆ
Filtered Approximate Nearest Neighbor (ANN) search retrieves the closest
vectors for a query vector from a dataset. It enforces that a specified set of
discrete labels $S$ for the query must be included in the labels of each
retrieved vector. Existing graph-based methods typically incorporate filter
awareness by assigning fixed penalties or prioritizing nodes based on filter
satisfaction. However, since these methods use fixed, data in- dependent
penalties, they often fail to generalize across datasets with diverse label and
vector distributions. In this work, we propose a principled alternative that
learns the optimal trade-off between vector distance and filter match directly
from the data, rather than relying on fixed penalties. We formulate this as a
constrained linear optimization problem, deriving weights that better reflect
the underlying filter distribution and more effectively address the filtered
ANN search problem. These learned weights guide both the search process and
index construction, leading to graph structures that more effectively capture
the underlying filter distribution and filter semantics. Our experiments
demonstrate that adapting the distance function to the data significantly im-
proves accuracy by 5-10% over fixed-penalty methods, providing a more flexible
and generalizable framework for the filtered ANN search problem.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Qingzhou Lu, Yao Feng, Baiyu Shi, Michael Piseno, Zhenan Bao et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Humanoid robots are expected to operate in human-centered environments where
safe and natural physical interaction is essential. However, most recent
reinforcement learning (RL) policies emphasize rigid tracking and suppress
external forces. Existing impedance-augmented approaches are typically
restricted to base or end-effector control and focus on resisting extreme
forces rather than enabling compliance. We introduce GentleHumanoid, a
framework that integrates impedance control into a whole-body motion tracking
policy to achieve upper-body compliance. At its core is a unified spring-based
formulation that models both resistive contacts (restoring forces when pressing
against surfaces) and guiding contacts (pushes or pulls sampled from human
motion data). This formulation ensures kinematically consistent forces across
the shoulder, elbow, and wrist, while exposing the policy to diverse
interaction scenarios. Safety is further supported through task-adjustable
force thresholds. We evaluate our approach in both simulation and on the
Unitree G1 humanoid across tasks requiring different levels of compliance,
including gentle hugging, sit-to-stand assistance, and safe object
manipulation. Compared to baselines, our policy consistently reduces peak
contact forces while maintaining task success, resulting in smoother and more
natural interactions. These results highlight a step toward humanoid robots
that can safely and effectively collaborate with humans and handle objects in
real-world environments.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du et al.
Robotics
ๆฅ็ๆ่ฆ
Human videos can be recorded quickly and at scale, making them an appealing
source of training data for robot learning. However, humans and robots differ
fundamentally in embodiment, resulting in mismatched action execution. Direct
kinematic retargeting of human hand motion can therefore produce actions that
are physically infeasible for robots. Despite these low-level differences,
human demonstrations provide valuable motion cues about how to manipulate and
interact with objects. Our key idea is to exploit the forward diffusion
process: as noise is added to actions, low-level execution differences fade
while high-level task guidance is preserved. We present X-Diffusion, a
principled framework for training diffusion policies that maximally leverages
human data without learning dynamically infeasible motions. X-Diffusion first
trains a classifier to predict whether a noisy action is executed by a human or
robot. Then, a human action is incorporated into policy training only after
adding sufficient noise such that the classifier cannot discern its embodiment.
Actions consistent with robot execution supervise fine-grained denoising at low
noise levels, while mismatched human actions provide only coarse guidance at
higher noise levels. Our experiments show that naive co-training under
execution mismatches degrades policy performance, while X-Diffusion
consistently improves it. Across five manipulation tasks, X-Diffusion achieves
a 16% higher average success rate than the best baseline. The project website
is available at https://portal-cornell.github.io/X-Diffusion/.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song et al.
Robotics
ๆฅ็ๆ่ฆ
Robotic manipulation policies are advancing rapidly, but their direct
evaluation in the real world remains costly, time-consuming, and difficult to
reproduce, particularly for tasks involving deformable objects. Simulation
provides a scalable and systematic alternative, yet existing simulators often
fail to capture the coupled visual and physical complexity of soft-body
interactions. We present a real-to-sim policy evaluation framework that
constructs soft-body digital twins from real-world videos and renders robots,
objects, and environments with photorealistic fidelity using 3D Gaussian
Splatting. We validate our approach on representative deformable manipulation
tasks, including plush toy packing, rope routing, and T-block pushing,
demonstrating that simulated rollouts correlate strongly with real-world
execution performance and reveal key behavioral patterns of learned policies.
Our results suggest that combining physics-informed reconstruction with
high-quality rendering enables reproducible, scalable, and accurate evaluation
of robotic manipulation policies. Website: https://real2sim-eval.github.io/
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Phat Nguyen, Erfan Aasi, Shiva Sreeram, Guy Rosman, Andrew Silva et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Autonomous driving systems remain brittle in rare, ambiguous, and
out-of-distribution scenarios, where human driver succeed through contextual
reasoning. Shared autonomy has emerged as a promising approach to mitigate such
failures by incorporating human input when autonomy is uncertain. However, most
existing methods restrict arbitration to low-level trajectories, which
represent only geometric paths and therefore fail to preserve the underlying
driving intent. We propose a unified shared autonomy framework that integrates
human input and autonomous planners at a higher level of abstraction. Our
method leverages Vision Language Models (VLMs) to infer driver intent from
multi-modal cues -- such as driver actions and environmental context -- and to
synthesize coherent strategies that mediate between human and autonomous
control. We first study the framework in a mock-human setting, where it
achieves perfect recall alongside high accuracy and precision. A human-subject
survey further shows strong alignment, with participants agreeing with
arbitration outcomes in 92% of cases. Finally, evaluation on the Bench2Drive
benchmark demonstrates a substantial reduction in collision rate and
improvement in overall performance compared to pure autonomy. Arbitration at
the level of semantic, language-based representations emerges as a design
principle for shared autonomy, enabling systems to exercise common-sense
reasoning and maintain continuity with human intent.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu et al.
MultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have emerged as a powerful framework that
unifies perception, language, and control, enabling robots to perform diverse
tasks through multimodal understanding. However, current VLA models typically
contain massive parameters and rely heavily on large-scale robot data
pretraining, leading to high computational costs during training, as well as
limited deployability for real-time inference. Moreover, most training
paradigms often degrade the perceptual representations of the vision-language
backbone, resulting in overfitting and poor generalization to downstream tasks.
In this work, we present Evo-1, a lightweight VLA model that reduces
computation and improves deployment efficiency, while maintaining strong
performance without pretraining on robot data. Evo-1 builds on a native
multimodal Vision-Language model (VLM), incorporating a novel cross-modulated
diffusion transformer along with an optimized integration module, together
forming an effective architecture. We further introduce a two-stage training
paradigm that progressively aligns action with perception, preserving the
representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1
achieves state-of-the-art results on the Meta-World and RoboTwin suite,
surpassing the previous best models by 12.4% and 6.9%, respectively, and also
attains a competitive result of 94.8% on LIBERO. In real-world evaluations,
Evo-1 attains a 78% success rate with high inference frequency and low memory
overhead, outperforming all baseline methods. We release code, data, and model
weights to facilitate future research on lightweight and efficient VLA models.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Yueyang Weng, Xiaopeng Zhang, Yongjin Mu, Yingcong Zhu, Yanjie Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
Action chunking is a widely adopted approach in Learning from Demonstration
(LfD). By modeling multi-step action chunks rather than single-step actions,
action chunking significantly enhances modeling capabilities for human expert
policies. However, the reduced decision frequency restricts the utilization of
recent observations, degrading reactivity - particularly evident in the
inadequate adaptation to sensor noise and dynamic environmental changes.
Existing efforts to address this issue have primarily resorted to trading off
reactivity against decision consistency, without achieving both. To address
this limitation, we propose a novel algorithm, Temporal Action Selector (TAS),
which caches predicted action chunks from multiple timesteps and dynamically
selects the optimal action through a lightweight selector network. TAS achieves
balanced optimization across three critical dimensions: reactivity, decision
consistency, and motion coherence. Experiments across multiple tasks with
diverse base policies show that TAS significantly improves success rates -
yielding an absolute gain of up to 73.3%. Furthermore, integrating TAS as a
base policy with residual reinforcement learning (RL) substantially enhances
training efficiency and elevates the performance plateau. Experiments in both
simulation and physical robots confirm the method's efficacy.
๐
2025-11-06
๐ IROS 2025
๐ฅ Chang Liu, Juan Li, Sheng Zhang, Chang Liu, Jie Li et al.
ๆฅ็ๆ่ฆ
Depth estimation is one of the key technologies for realizing 3D perception
in unmanned systems. Monocular depth estimation has been widely researched
because of its low-cost advantage, but the existing methods face the challenges
of poor depth estimation performance and blurred object boundaries on embedded
systems. In this paper, we propose a novel monocular depth estimation model,
BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate
depth maps on embedded systems and significantly improves boundary quality.
Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which
adaptively fuses depth features to enhance boundary detail representation.
Secondly, we integrate semantic knowledge into the encoder to improve the
object recognition and boundary perception capabilities. Finally, BoRe-Depth is
deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We
demonstrate that the proposed model significantly outperforms previous
lightweight models on multiple challenging datasets, and we provide detailed
ablation studies for the proposed methods. The code is available at
https://github.com/liangxiansheng093/BoRe-Depth.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Dexin wang, Faliang Chang, Chunsheng Liu
Robotics
ๆฅ็ๆ่ฆ
Efficiently leveraging simulation to acquire advanced manipulation skills is
both challenging and highly significant. We introduce \textit{ForeRobo}, a
generative robotic agent that utilizes generative simulations to autonomously
acquire manipulation skills driven by envisioned goal states. Instead of
directly learning low-level policies, we advocate integrating generative
paradigms with classical control. Our approach equips a robotic agent with a
self-guided \textit{propose-generate-learn-actuate} cycle. The agent first
proposes the skills to be acquired and constructs the corresponding simulation
environments; it then configures objects into appropriate arrangements to
generate skill-consistent goal states (\textit{ForeGen}). Subsequently, the
virtually infinite data produced by ForeGen are used to train the proposed
state generation model (\textit{ForeFormer}), which establishes point-wise
correspondences by predicting the 3D goal position of every point in the
current state, based on the scene state and task instructions. Finally,
classical control algorithms are employed to drive the robot in real-world
environments to execute actions based on the envisioned goal states. Compared
with end-to-end policy learning methods, ForeFormer offers superior
interpretability and execution efficiency. We train and benchmark ForeFormer
across a variety of rigid-body and articulated-object manipulation tasks, and
observe an average improvement of 56.32\% over the state-of-the-art state
generation models, demonstrating strong generality across different
manipulation patterns. Moreover, in real-world evaluations involving more than
20 robotic tasks, ForeRobo achieves zero-shot sim-to-real transfer and exhibits
remarkable generalization capabilities, attaining an average success rate of
79.28\%.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Anna Mรฉszรกros, Javier Alonso-Mora, Jens Kober
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Effectively capturing the joint distribution of all agents in a scene is
relevant for predicting the true evolution of the scene and in turn providing
more accurate information to the decision processes of autonomous vehicles.
While new models have been developed for this purpose in recent years, it
remains unclear how to best represent the joint distributions particularly from
the perspective of the interactions between agents. Thus far there is no clear
consensus on how best to represent interactions between agents; whether they
should be learned implicitly from data by neural networks, or explicitly
modeled using the spatial and temporal relations that are more grounded in
human decision-making. This paper aims to study various means of describing
interactions within the same network structure and their effect on the final
learned joint distributions. Our findings show that more often than not, simply
allowing a network to establish interactive connections between agents based on
data has a detrimental effect on performance. Instead, having well defined
interactions (such as which agent of an agent pair passes first at an
intersection) can often bring about a clear boost in performance.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Maรซlic Neau, Zoe Falomir, Paulo E. Santos, Anne-Gwenn Bosser, Cรฉdric Buche
MultimodalRobotics
ๆฅ็ๆ่ฆ
Deploying autonomous robots that can learn new skills from demonstrations is
an important challenge of modern robotics. Existing solutions often apply
end-to-end imitation learning with Vision-Language Action (VLA) models or
symbolic approaches with Action Model Learning (AML). On the one hand, current
VLA models are limited by the lack of high-level symbolic planning, which
hinders their abilities in long-horizon tasks. On the other hand, symbolic
approaches in AML lack generalization and scalability perspectives. In this
paper we present a new neuro-symbolic approach, GraSP-VLA, a framework that
uses a Continuous Scene Graph representation to generate a symbolic
representation of human demonstrations. This representation is used to generate
new planning domains during inference and serves as an orchestrator for
low-level VLA policies, scaling up the number of actions that can be reproduced
in a row. Our results show that GraSP-VLA is effective for modeling symbolic
representations on the task of automatic planning domain generation from
observations. In addition, results on real-world experiments show the potential
of our Continuous Scene Graph representation to orchestrate low-level VLA
policies in long-horizon tasks.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Kuankuan Sima, Longbin Tang, Haozhe Ma, Lin Zhao
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Autonomous navigation in unknown environments requires compact yet expressive
spatial understanding under partial observability to support high-level
decision making. Existing approaches struggle to balance rich contextual
representation with navigation efficiency. We present MacroNav, a
learning-based navigation framework featuring two key components: (1) a
lightweight context encoder trained via multi-task self-supervised learning to
capture multi-scale, navigation-centric spatial representations; and (2) a
reinforcement learning policy that seamlessly integrates these representations
with graph-based reasoning for efficient action selection. Extensive
experiments demonstrate the context encoder's efficient and robust
environmental understanding. Real-world deployments further validate MacroNav's
effectiveness, yielding significant gains over state-of-the-art navigation
methods in both Success Rate (SR) and Success weighted by Path Length (SPL),
while maintaining low computational cost. Code will be released upon
acceptance.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Jinfeng Liang, Haocheng Guo, Ximin Lyu
Robotics
ๆฅ็ๆ่ฆ
The tailsitter vertical takeoff and landing (VTOL) UAV is widely used due to
its lower dead weight, which eliminates the actuators and mechanisms for
tilting. However, the tailsitter UAV is susceptible to wind disturbances in
multi-rotor mode, as it exposes a large frontal fuselage area. To address this
issue, our tailsitter UAV features a reconfigurable wing design, allowing wings
to retract in multi-rotor mode and extend in fixed- wing mode. Considering
power efficiency, we design a coaxial heterogeneous dual-rotor configuration,
which significantly re- duces the total power consumption. To reduce structural
weight and simplify structural complexity, we employ a swashplateless mechanism
with an improved design to control pitch and roll in multi-rotor mode. We
optimize the structure of the swashplateless mechanism by adding flapping
hinges, which reduces vibration during cyclic acceleration and deceleration.
Finally, we perform comprehensive transition flight tests to validate stable
flight performance across the entire flight envelope of the tailsitter UAV.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Marco Iannotta, Yuxuan Yang, Johannes A. Stork, Erik Schaffernicht, Todor Stoyanov
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Sim-to-real transfer remains a major challenge in reinforcement learning (RL)
for robotics, as policies trained in simulation often fail to generalize to the
real world due to discrepancies in environment dynamics. Domain Randomization
(DR) mitigates this issue by exposing the policy to a wide range of randomized
dynamics during training, yet leading to a reduction in performance. While
standard approaches typically train policies agnostic to these variations, we
investigate whether sim-to-real transfer can be improved by conditioning the
policy on an estimate of the dynamics parameters -- referred to as context. To
this end, we integrate a context estimation module into a DR-based RL framework
and systematically compare SOTA supervision strategies. We evaluate the
resulting context-aware policies in both a canonical control benchmark and a
real-world pushing task using a Franka Emika Panda robot. Results show that
context-aware policies outperform the context-agnostic baseline across all
settings, although the best supervision strategy depends on the task.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Shenglin Wang, Mingtong Dai, Jingxuan Su, Lingbo Liu, Chunjie Chen et al.
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Robotic grasping is a fundamental capability for autonomous manipulation, yet
remains highly challenging in cluttered environments where occlusion, poor
perception quality, and inconsistent 3D reconstructions often lead to unstable
or failed grasps. Conventional pipelines have widely relied on RGB-D cameras to
provide geometric information, which fail on transparent or glossy objects and
degrade at close range. We present GraspView, an RGB-only robotic grasping
pipeline that achieves accurate manipulation in cluttered environments without
depth sensors. Our framework integrates three key components: (i) global
perception scene reconstruction, which provides locally consistent, up-to-scale
geometry from a single RGB view and fuses multi-view projections into a
coherent global 3D scene; (ii) a render-and-score active perception strategy,
which dynamically selects next-best-views to reveal occluded regions; and (iii)
an online metric alignment module that calibrates VGGT predictions against
robot kinematics to ensure physical scale consistency. Building on these
tailor-designed modules, GraspView performs best-view global grasping, fusing
multi-view reconstructions and leveraging GraspNet for robust execution.
Experiments on diverse tabletop objects demonstrate that GraspView
significantly outperforms both RGB-D and single-view RGB baselines, especially
under heavy occlusion, near-field sensing, and with transparent objects. These
results highlight GraspView as a practical and versatile alternative to RGB-D
pipelines, enabling reliable grasping in unstructured real-world environments.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Yizhen Yin, Dapeng Feng, Hongbo Chen, Yuhua Qi
Machine Learning
ๆฅ็ๆ่ฆ
Existing Active SLAM methodologies face issues such as slow exploration speed
and suboptimal paths. To address these limitations, we propose a hybrid
framework combining a Path-Uncertainty Co-Optimization Deep Reinforcement
Learning framework and a Lightweight Stagnation Detection mechanism. The
Path-Uncertainty Co-Optimization framework jointly optimizes travel distance
and map uncertainty through a dual-objective reward function, balancing
exploration and exploitation. The Lightweight Stagnation Detection reduces
redundant exploration through Lidar Static Anomaly Detection and Map Update
Stagnation Detection, terminating episodes on low expansion rates. Experimental
results show that compared with the frontier-based method and RRT method, our
approach shortens exploration time by up to 65% and reduces path distance by up
to 42%, significantly improving exploration efficiency in complex environments
while maintaining reliable map completeness. Ablation studies confirm that the
collaborative mechanism accelerates training convergence. Empirical validation
on a physical robotic platform demonstrates the algorithm's practical
applicability and its successful transferability from simulation to real-world
environments.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Yitang Li, Zhengyi Luo, Tonghe Zhang, Cunxi Dai, Anssi Kanervisto et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Building Behavioral Foundation Models (BFMs) for humanoid robots has the
potential to unify diverse control tasks under a single, promptable generalist
policy. However, existing approaches are either exclusively deployed on
simulated humanoid characters, or specialized to specific tasks such as
tracking. We propose BFM-Zero, a framework that learns an effective shared
latent representation that embeds motions, goals, and rewards into a common
space, enabling a single policy to be prompted for multiple downstream tasks
without retraining. This well-structured latent space in BFM-Zero enables
versatile and robust whole-body skills on a Unitree G1 humanoid in the real
world, via diverse inference methods, including zero-shot motion tracking, goal
reaching, and reward optimization, and few-shot optimization-based adaptation.
Unlike prior on-policy reinforcement learning (RL) frameworks, BFM-Zero builds
upon recent advancements in unsupervised RL and Forward-Backward (FB) models,
which offer an objective-centric, explainable, and smooth latent representation
of whole-body motions. We further extend BFM-Zero with critical reward shaping,
domain randomization, and history-dependent asymmetric learning to bridge the
sim-to-real gap. Those key design choices are quantitatively ablated in
simulation. A first-of-its-kind model, BFM-Zero establishes a step toward
scalable, promptable behavioral foundation models for whole-body humanoid
control.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Yanbo Pang, Qingkai Li, Mingguo Zhao
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
As robotic arm applications extend beyond industrial settings into
healthcare, service, and daily life, existing control algorithms struggle to
achieve the agile manipulation required for complex environments with dynamic
trajectories, unpredictable interactions, and diverse objects. This paper
presents a biomimetic control framework based on Spiking Neural Networks (SNN),
inspired by the human Central Nervous System (CNS), to achieve agile control in
such environments. The proposed framework features five control modules
(cerebral cortex, cerebellum, thalamus, brainstem, spinal cord), three
hierarchical control levels (first-order, second-order, third-order), and two
information pathways (ascending, descending). Each module is fully implemented
using SNN. The spinal cord module uses spike encoding and Leaky
Integrate-and-Fire (LIF) neurons for feedback control. The brainstem module
employs a network of LIF and non-spiking LIF neurons to dynamically adjust
spinal cord parameters via reinforcement learning. The thalamus module
similarly adjusts the cerebellum's torque outputs. The cerebellum module uses a
recurrent SNN to learn the robotic arm's dynamics through regression, providing
feedforward gravity compensation torques. The framework is validated both in
simulation and on real-world robotic arm platform under various loads and
trajectories. Results demonstrate that our method outperforms the
industrial-grade position control in manipulation agility.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Kyongsik Yun, David Bayard, Gerik Kubiak, Austin Owens, Andrew Johnson et al.
RoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Future planetary exploration missions demand high-performance, fault-tolerant
computing to enable autonomous Guidance, Navigation, and Control (GNC) and
Lander Vision System (LVS) operations during Entry, Descent, and Landing (EDL).
This paper evaluates the deployment of GNC and LVS algorithms on
next-generation multi-core processors--HPSC, Snapdragon VOXL2, and AMD Xilinx
Versal--demonstrating up to 15x speedup for LVS image processing and over 250x
speedup for Guidance for Fuel-Optimal Large Divert (GFOLD) trajectory
optimization compared to legacy spaceflight hardware. To ensure computational
reliability, we present ARBITER (Asynchronous Redundant Behavior Inspection for
Trusted Execution and Recovery), a Multi-Core Voting (MV) mechanism that
performs real-time fault detection and correction across redundant cores.
ARBITER is validated in both static optimization tasks (GFOLD) and dynamic
closed-loop control (Attitude Control System). A fault injection study further
identifies the gradient computation stage in GFOLD as the most sensitive to
bit-level errors, motivating selective protection strategies and vector-based
output arbitration. This work establishes a scalable and energy-efficient
architecture for future missions, including Mars Sample Return, Enceladus
Orbilander, and Ceres Sample Return, where onboard autonomy, low latency, and
fault resilience are critical.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Kailun Ji, Xiaoyu Hu, Xinyu Zhang, Jun Chen
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large-scale disaster Search And Rescue (SAR) operations are persistently
challenged by complex terrain and disrupted communications. While Unmanned
Aerial Vehicle (UAV) swarms offer a promising solution for tasks like wide-area
search and supply delivery, yet their effective coordination places a
significant cognitive burden on human operators. The core human-machine
collaboration bottleneck lies in the ``intention-to-action gap'', which is an
error-prone process of translating a high-level rescue objective into a
low-level swarm command under high intensity and pressure. To bridge this gap,
this study proposes a novel LLM-CRF system that leverages Large Language Models
(LLMs) to model and augment human-swarm teaming cognition. The proposed
framework initially captures the operator's intention through natural and
multi-modal interactions with the device via voice or graphical annotations. It
then employs the LLM as a cognitive engine to perform intention comprehension,
hierarchical task decomposition, and mission planning for the UAV swarm. This
closed-loop framework enables the swarm to act as a proactive partner,
providing active feedback in real-time while reducing the need for manual
monitoring and control, which considerably advances the efficacy of the SAR
task. We evaluate the proposed framework in a simulated SAR scenario.
Experimental results demonstrate that, compared to traditional order and
command-based interfaces, the proposed LLM-driven approach reduced task
completion time by approximately $64.2\%$ and improved task success rate by
$7\%$. It also leads to a considerable reduction in subjective cognitive
workload, with NASA-TLX scores dropping by $42.9\%$. This work establishes the
potential of LLMs to create more intuitive and effective human-swarm
collaborations in high-stakes scenarios.
๐
2025-11-06
๐ IROS 2025
๐ฅ Chenzui Li, Yiming Chen, Xi Wu, Giacinto Barresi, Fei Chen
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper introduces an upper limb postural optimization method for
enhancing physical ergonomics and force manipulability during bimanual
human-robot co-carrying tasks. Existing research typically emphasizes human
safety or manipulative efficiency, whereas our proposed method uniquely
integrates both aspects to strengthen collaboration across diverse conditions
(e.g., different grasping postures of humans, and different shapes of objects).
Specifically, the joint angles of a simplified human skeleton model are
optimized by minimizing the cost function to prioritize safety and manipulative
capability. To guide humans towards the optimized posture, the reference
end-effector poses of the robot are generated through a transformation module.
A bimanual model predictive impedance controller (MPIC) is proposed for our
human-like robot, CURI, to recalibrate the end effector poses through planned
trajectories. The proposed method has been validated through various subjects
and objects during human-human collaboration (HHC) and human-robot
collaboration (HRC). The experimental results demonstrate significant
improvement in muscle conditions by comparing the activation of target muscles
before and after optimization.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Yushi Wang, Changsheng Luo, Penghui Chen, Jianran Liu, Weijian Sun et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Humanoid soccer poses a representative challenge for embodied intelligence,
requiring robots to operate within a tightly coupled perception-action loop.
However, existing systems typically rely on decoupled modules, resulting in
delayed responses and incoherent behaviors in dynamic environments, while
real-world perceptual limitations further exacerbate these issues. In this
work, we present a unified reinforcement learning-based controller that enables
humanoid robots to acquire reactive soccer skills through the direct
integration of visual perception and motion control. Our approach extends
Adversarial Motion Priors to perceptual settings in real-world dynamic
environments, bridging motion imitation and visually grounded dynamic control.
We introduce an encoder-decoder architecture combined with a virtual perception
system that models real-world visual characteristics, allowing the policy to
recover privileged states from imperfect observations and establish active
coordination between perception and action. The resulting controller
demonstrates strong reactivity, consistently executing coherent and robust
soccer behaviors across various scenarios, including real RoboCup matches.
๐
2025-11-06
๐ ArXiv Preprint (Robotics)
๐ฅ Iman Adibnazari, Harsh Sharma, Myungsun Park, Jacobo Cervera-Torralba, Boris Kramer et al.
Robotics
ๆฅ็ๆ่ฆ
Soft robots have shown immense promise in settings where they can leverage
dynamic control of their entire bodies. However, effective dynamic shape
control requires a controller that accounts for the robot's high-dimensional
dynamics--a challenge exacerbated by a lack of general-purpose tools for
modeling soft robots amenably for control. In this work, we conduct a
comparative study of data-driven model reduction techniques for generating
linear models amendable to dynamic shape control. We focus on three
methods--the eigensystem realization algorithm, dynamic mode decomposition with
control, and the Lagrangian operator inference (LOpInf) method. Using each
class of model, we explored their efficacy in model predictive control policies
for the dynamic shape control of a simulated eel-inspired soft robot in three
experiments: 1) tracking simulated reference trajectories guaranteed to be
feasible, 2) tracking reference trajectories generated from a biological model
of eel kinematics, and 3) tracking reference trajectories generated by a
reduced-scale physical analog. In all experiments, the LOpInf-based policies
generated lower tracking errors than policies based on other models.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hlali Azzeddine, Majid Ben Yakhlef, Soulaiman El Hazzat
Machine Learning
ๆฅ็ๆ่ฆ
Small, imbalanced datasets and poor input image quality can lead to high
false predictions rates with deep learning models. This paper introduces
Class-Based Image Composition, an approach that allows us to reformulate
training inputs through a fusion of multiple images of the same class into
combined visual composites, named Composite Input Images (CoImg). That enhances
the intra-class variance and improves the valuable information density per
training sample and increases the ability of the model to distinguish between
subtle disease patterns. Our method was evaluated on the Optical Coherence
Tomography Dataset for Image-Based Deep Learning Methods (OCTDL) (Kulyabin et
al., 2024), which contains 2,064 high-resolution optical coherence tomography
(OCT) scans of the human retina, representing seven distinct diseases with a
significant class imbalance. We constructed a perfectly class-balanced version
of this dataset, named Co-OCTDL, where each scan is resented as a 3x1 layout
composite image. To assess the effectiveness of this new representation, we
conducted a comparative analysis between the original dataset and its variant
using a VGG16 model. A fair comparison was ensured by utilizing the identical
model architecture and hyperparameters for all experiments. The proposed
approach markedly improved diagnostic results.The enhanced Dataset achieved
near-perfect accuracy (99.6%) with F1-score (0.995) and AUC (0.9996), compared
to a baseline model trained on raw dataset. The false prediction rate was also
significantly lower, this demonstrates that the method can producehigh-quality
predictions even for weak datasets affected by class imbalance or small sample
size.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hlali Azzeddine, Majid Ben Yakhlef, Soulaiman El Hazzat
Machine Learning
ๆฅ็ๆ่ฆ
Small, imbalanced datasets and poor input image quality can lead to high
false predictions rates with deep learning models. This paper introduces
Class-Based Image Composition, an approach that allows us to reformulate
training inputs through a fusion of multiple images of the same class into
combined visual composites, named Composite Input Images (CoImg). That enhances
the intra-class variance and improves the valuable information density per
training sample and increases the ability of the model to distinguish between
subtle disease patterns. Our method was evaluated on the Optical Coherence
Tomography Dataset for Image-Based Deep Learning Methods (OCTDL) (Kulyabin et
al., 2024), which contains 2,064 high-resolution optical coherence tomography
(OCT) scans of the human retina, representing seven distinct diseases with a
significant class imbalance. We constructed a perfectly class-balanced version
of this dataset, named Co-OCTDL, where each scan is resented as a 3x1 layout
composite image. To assess the effectiveness of this new representation, we
conducted a comparative analysis between the original dataset and its variant
using a VGG16 model. A fair comparison was ensured by utilizing the identical
model architecture and hyperparameters for all experiments. The proposed
approach markedly improved diagnostic results.The enhanced Dataset achieved
near-perfect accuracy (99.6%) with F1-score (0.995) and AUC (0.9996), compared
to a baseline model trained on raw dataset. The false prediction rate was also
significantly lower, this demonstrates that the method can producehigh-quality
predictions even for weak datasets affected by class imbalance or small sample
size.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hyunseok Kwak, Kyeongwon Lee, Jae-Jin Lee, Woojoo Lee
ๆฅ็ๆ่ฆ
On-device fine-tuning of CNNs is essential to withstand domain shift in edge
applications such as Human Activity Recognition (HAR), yet full fine-tuning is
infeasible under strict memory, compute, and energy budgets. We present
LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on
Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies
Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional
layers, (ii) selectively updates only the output-side core with
zero-initialization to keep the auxiliary path inactive at the start, and (iii)
fuses the update back into dense kernels, leaving inference cost unchanged.
This design preserves convolutional structure and reduces the number of
trainable parameters by up to two orders of magnitude compared to full
fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves
accuracy within 4.7% of full fine-tuning while updating at most 1.49% of
parameters, consistently outperforming prior parameter-efficient baselines
under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and
selective-core training yield 1.4-3.8x faster convergence to target F1.
LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN
adaptation practical for edge platforms.
๐
2025-11-05
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Domรญcio Pereira Neto, Joรฃo Correia, Penousal Machado
MultimodalRoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Deep generative models, especially diffusion architectures, have transformed
image generation; however, they are challenging to control and optimize for
specific goals without expensive retraining. Embedding Space Exploration,
especially with Evolutionary Algorithms (EAs), has been shown to be a promising
method for optimizing image generation, particularly within Diffusion Models.
Therefore, in this work, we study the performance of an evolutionary
optimization method, namely Separable Covariance Matrix Adaptation Evolution
Strategy (sep-CMA-ES), against the widely adopted Adaptive Moment Estimation
(Adam), applied to Stable Diffusion XL Turbo's prompt embedding vector. The
evaluation of images combines the LAION Aesthetic Predictor V2 with CLIPScore
into a weighted fitness function, allowing flexible trade-offs between visual
appeal and adherence to prompts. Experiments on a subset of the Parti Prompts
(P2) dataset showcase that sep-CMA-ES consistently yields superior improvements
in aesthetic and alignment metrics in comparison to Adam. Results indicate that
the evolutionary method provides efficient, gradient-free optimization for
diffusion models, enhancing controllability without the need for fine-tuning.
This study emphasizes the potential of evolutionary methods for embedding space
exploration of deep generative models and outlines future research directions.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nand Kumar Yadav, Rodrigue Rizk, William CW Chen, KC Santosh
ๆฅ็ๆ่ฆ
Unknown anomaly detection in medical imaging remains a fundamental challenge
due to the scarcity of labeled anomalies and the high cost of expert
supervision. We introduce an unsupervised, oracle-free framework that
incrementally expands a trusted set of normal samples without any anomaly
labels. Starting from a small, verified seed of normal images, our method
alternates between lightweight adapter updates and uncertainty-gated sample
admission. A frozen pretrained vision backbone is augmented with tiny
convolutional adapters, ensuring rapid domain adaptation with negligible
computational overhead. Extracted embeddings are stored in a compact coreset
enabling efficient k-nearest neighbor anomaly (k-NN) scoring. Safety during
incremental expansion is enforced by dual probabilistic gates, a sample is
admitted into the normal memory only if its distance to the existing coreset
lies within a calibrated z-score threshold, and its SWAG-based epistemic
uncertainty remains below a seed-calibrated bound. This mechanism prevents
drift and false inclusions without relying on generative reconstruction or
replay buffers. Empirically, our system steadily refines the notion of
normality as unlabeled data arrive, producing substantial gains over baselines.
On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982 (F1: 0.8048 to 0.9746); on
Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5,
ROC-AUC increases from 0.6041 to 0.7269 and PR-AUC from 0.7539 to 0.8211. These
results highlight the effectiveness and efficiency of the proposed framework
for real-world, label-scarce medical imaging applications.
๐
2025-11-05
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Liam Bakar, Zachary Englhardt, Vidya Srinivas, Girish Narayanswamy, Dilini Nissanka et al.
Multimodal
ๆฅ็ๆ่ฆ
Food logging, both self-directed and prescribed, plays a critical role in
uncovering correlations between diet, medical, fitness, and health outcomes.
Through conversations with nutritional experts and individuals who practice
dietary tracking, we find current logging methods, such as handwritten and
app-based journaling, are inflexible and result in low adherence and
potentially inaccurate nutritional summaries. These findings, corroborated by
prior literature, emphasize the urgent need for improved food logging methods.
In response, we propose SnappyMeal, an AI-powered dietary tracking system that
leverages multimodal inputs to enable users to more flexibly log their food
intake. SnappyMeal introduces goal-dependent follow-up questions to
intelligently seek missing context from the user and information retrieval from
user grocery receipts and nutritional databases to improve accuracy. We
evaluate SnappyMeal through publicly available nutrition benchmarks and a
multi-user, 3-week, in-the-wild deployment capturing over 500 logged food
instances. Users strongly praised the multiple available input methods and
reported a strong perceived accuracy. These insights suggest that multimodal AI
systems can be leveraged to significantly improve dietary tracking flexibility
and context-awareness, laying the groundwork for a new class of intelligent
self-tracking applications.
๐
2025-11-05
๐ Accepted for publication at the 2nd IEEE Internati
๐ฅ Arup Datta, Ahmed Aljohani, Hyunsook Do
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are now widely used to draft and refactor code,
but code that works is not necessarily secure. We evaluate secure code
generation using the Instruct Prime, which eliminated compliance-required
prompts and cue contamination, and evaluate five instruction-tuned code LLMs
using a zero-shot baseline and a three-round reflexion prompting approach.
Security is measured using the Insecure Code Detector (ICD), and results are
reported by measuring Repair, Regression, and NetGain metrics, considering the
programming language and CWE family. Our findings show that insecurity remains
common at the first round: roughly 25-33% of programs are insecure at a
zero-shot baseline (t0 ). Weak cryptography/config-dependent bugs are the
hardest to avoid while templated ones like XSS, code injection, and hard-coded
secrets are handled more reliably. Python yields the highest secure rates; C
and C# are the lowest, with Java, JS, PHP, and C++ in the middle. Reflexion
prompting improves security for all models, improving average accuracy from
70.74% at t0 to 79.43% at t3 , with the largest gains in the first round
followed by diminishing returns. The trends with Repair, Regression, and
NetGain metrics show that applying one to two rounds produces most of the
benefits. A replication package is available at
https://doi.org/10.5281/zenodo.17065846.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hlali Azzeddine, Majid Ben Yakhlef, Soulaiman El Hazzat
Machine Learning
ๆฅ็ๆ่ฆ
Small, imbalanced datasets and poor input image quality can lead to high
false predictions rates with deep learning models. This paper introduces
Class-Based Image Composition, an approach that allows us to reformulate
training inputs through a fusion of multiple images of the same class into
combined visual composites, named Composite Input Images (CoImg). That enhances
the intra-class variance and improves the valuable information density per
training sample and increases the ability of the model to distinguish between
subtle disease patterns. Our method was evaluated on the Optical Coherence
Tomography Dataset for Image-Based Deep Learning Methods (OCTDL) (Kulyabin et
al., 2024), which contains 2,064 high-resolution optical coherence tomography
(OCT) scans of the human retina, representing seven distinct diseases with a
significant class imbalance. We constructed a perfectly class-balanced version
of this dataset, named Co-OCTDL, where each scan is resented as a 3x1 layout
composite image. To assess the effectiveness of this new representation, we
conducted a comparative analysis between the original dataset and its variant
using a VGG16 model. A fair comparison was ensured by utilizing the identical
model architecture and hyperparameters for all experiments. The proposed
approach markedly improved diagnostic results.The enhanced Dataset achieved
near-perfect accuracy (99.6%) with F1-score (0.995) and AUC (0.9996), compared
to a baseline model trained on raw dataset. The false prediction rate was also
significantly lower, this demonstrates that the method can producehigh-quality
predictions even for weak datasets affected by class imbalance or small sample
size.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song et al.
Robotics
ๆฅ็ๆ่ฆ
Imitation learning-based robot control policies are enjoying renewed interest
in video-based robotics. However, it remains unclear whether this approach
applies to X-ray-guided procedures, such as spine instrumentation. This is
because interpretation of multi-view X-rays is complex. We examine
opportunities and challenges for imitation policy learning in bi-plane-guided
cannula insertion. We develop an in silico sandbox for scalable, automated
simulation of X-ray-guided spine procedures with a high degree of realism. We
curate a dataset of correct trajectories and corresponding bi-planar X-ray
sequences that emulate the stepwise alignment of providers. We then train
imitation learning policies for planning and open-loop control that iteratively
align a cannula solely based on visual information. This precisely controlled
setup offers insights into limitations and capabilities of this method. Our
policy succeeded on the first attempt in 68.5% of cases, maintaining safe
intra-pedicular trajectories across diverse vertebral levels. The policy
generalized to complex anatomy, including fractures, and remained robust to
varied initializations. Rollouts on real bi-planar X-rays further suggest that
the model can produce plausible trajectories, despite training exclusively in
simulation. While these preliminary results are promising, we also identify
limitations, especially in entry point precision. Full closed-look control will
require additional considerations around how to provide sufficiently frequent
feedback. With more robust priors and domain knowledge, such models may provide
a foundation for future efforts toward lightweight and CT-free robotic
intra-operative spinal navigation.
๐
2025-11-05
๐ AAAI
๐ฅ Suraj Prasai, Mengnan Du, Ying Zhang, Fan Yang
ๆฅ็ๆ่ฆ
We develop KnowThyself, an agentic assistant that advances large language
model (LLM) interpretability. Existing tools provide useful insights but remain
fragmented and code-intensive. KnowThyself consolidates these capabilities into
a chat-based interface, where users can upload models, pose natural language
questions, and obtain interactive visualizations with guided explanations. At
its core, an orchestrator LLM first reformulates user queries, an agent router
further directs them to specialized modules, and the outputs are finally
contextualized into coherent explanations. This design lowers technical
barriers and provides an extensible platform for LLM inspection. By embedding
the whole process into a conversational workflow, KnowThyself offers a robust
foundation for accessible LLM interpretability.
๐
2025-11-05
๐ ArXiv Preprint (cs.DC)
๐ฅ Arijit Bhattacharjee, Ali TehraniJamsaz, Le Chen, Niranjan Hasabnis, Mihai Capota et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in large language models (LLMs) have significantly
accelerated progress in code translation, enabling more accurate and efficient
transformation across programming languages. While originally developed for
natural language processing, LLMs have shown strong capabilities in modeling
programming language syntax and semantics, outperforming traditional rule-based
systems in both accuracy and flexibility. These models have streamlined
cross-language conversion, reduced development overhead, and accelerated legacy
code migration. In this paper, we introduce OMPILOT, a novel domain-specific
encoder-decoder transformer tailored for translating C++ code into OpenMP,
enabling effective shared-memory parallelization. OMPILOT leverages custom
pre-training objectives that incorporate the semantics of parallel constructs
and combines both unsupervised and supervised learning strategies to improve
code translation robustness. Unlike previous work that focused primarily on
loop-level transformations, OMPILOT operates at the function level to capture a
wider semantic context. To evaluate our approach, we propose OMPBLEU, a novel
composite metric specifically crafted to assess the correctness and quality of
OpenMP parallel constructs, addressing limitations in conventional translation
metrics.
๐
2025-11-05
๐ ArXiv Preprint (cs.CY)
๐ฅ Tammy Mackenzie, Sukriti Punj, Natalie Perez, Sreyoshi Bhaduri, Branislav Radeljic
ๆฅ็ๆ่ฆ
This paper examines how decision makers in academia, government, business,
and civil society navigate questions of power in implementations of artificial
intelligence. The study explores how individuals experience and exercise levers
of power, which are presented as social mechanisms that shape institutional
responses to technological change. The study reports on the responses of
personalized questionnaires designed to gather insight on a decision maker's
institutional purview, based on an institutional governance framework developed
from the work of Neo-institutionalists. Findings present the anonymized, real
responses and circumstances of respondents in the form of twelve fictional
personas of high-level decision makers from North America and Europe. These
personas illustrate how personal agency, organizational logics, and
institutional infrastructures may intersect in the governance of AI. The
decision makers' responses to the questionnaires then inform a discussion of
the field-level personal power of decision makers, methods of fostering
institutional stability in times of change, and methods of influencing
institutional change in the field of AI. The final section of the discussion
presents a table of the dynamics of the levers of power in the field of AI for
change makers and five testable hypotheses for institutional and social
movement researchers. In summary, this study provides insight on the means for
policymakers within institutions and their counterparts in civil society to
personally engage with AI governance.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nand Kumar Yadav, Rodrigue Rizk, William CW Chen, KC Santosh
ๆฅ็ๆ่ฆ
Unknown anomaly detection in medical imaging remains a fundamental challenge
due to the scarcity of labeled anomalies and the high cost of expert
supervision. We introduce an unsupervised, oracle-free framework that
incrementally expands a trusted set of normal samples without any anomaly
labels. Starting from a small, verified seed of normal images, our method
alternates between lightweight adapter updates and uncertainty-gated sample
admission. A frozen pretrained vision backbone is augmented with tiny
convolutional adapters, ensuring rapid domain adaptation with negligible
computational overhead. Extracted embeddings are stored in a compact coreset
enabling efficient k-nearest neighbor anomaly (k-NN) scoring. Safety during
incremental expansion is enforced by dual probabilistic gates, a sample is
admitted into the normal memory only if its distance to the existing coreset
lies within a calibrated z-score threshold, and its SWAG-based epistemic
uncertainty remains below a seed-calibrated bound. This mechanism prevents
drift and false inclusions without relying on generative reconstruction or
replay buffers. Empirically, our system steadily refines the notion of
normality as unlabeled data arrive, producing substantial gains over baselines.
On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982 (F1: 0.8048 to 0.9746); on
Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5,
ROC-AUC increases from 0.6041 to 0.7269 and PR-AUC from 0.7539 to 0.8211. These
results highlight the effectiveness and efficiency of the proposed framework
for real-world, label-scarce medical imaging applications.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hlali Azzeddine, Majid Ben Yakhlef, Soulaiman El Hazzat
Machine Learning
ๆฅ็ๆ่ฆ
Small, imbalanced datasets and poor input image quality can lead to high
false predictions rates with deep learning models. This paper introduces
Class-Based Image Composition, an approach that allows us to reformulate
training inputs through a fusion of multiple images of the same class into
combined visual composites, named Composite Input Images (CoImg). That enhances
the intra-class variance and improves the valuable information density per
training sample and increases the ability of the model to distinguish between
subtle disease patterns. Our method was evaluated on the Optical Coherence
Tomography Dataset for Image-Based Deep Learning Methods (OCTDL) (Kulyabin et
al., 2024), which contains 2,064 high-resolution optical coherence tomography
(OCT) scans of the human retina, representing seven distinct diseases with a
significant class imbalance. We constructed a perfectly class-balanced version
of this dataset, named Co-OCTDL, where each scan is resented as a 3x1 layout
composite image. To assess the effectiveness of this new representation, we
conducted a comparative analysis between the original dataset and its variant
using a VGG16 model. A fair comparison was ensured by utilizing the identical
model architecture and hyperparameters for all experiments. The proposed
approach markedly improved diagnostic results.The enhanced Dataset achieved
near-perfect accuracy (99.6%) with F1-score (0.995) and AUC (0.9996), compared
to a baseline model trained on raw dataset. The false prediction rate was also
significantly lower, this demonstrates that the method can producehigh-quality
predictions even for weak datasets affected by class imbalance or small sample
size.
๐
2025-11-05
๐ ArXiv Preprint (eess.IV)
๐ฅ Linchen Qian, Jiasong Chen, Ruonan Gong, Wei Sun, Minliang Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate geometric modeling of the aortic valve from 3D CT images is
essential for biomechanical analysis and patient-specific simulations to assess
valve health or make a preoperative plan. However, it remains challenging to
generate aortic valve meshes with both high-quality and consistency across
different patients. Traditional approaches often produce triangular meshes with
irregular topologies, which can result in poorly shaped elements and
inconsistent correspondence due to inter-patient anatomical variation. In this
work, we address these challenges by introducing a template-fitting pipeline
with deep neural networks to generate structured quad (i.e., quadrilateral)
meshes from 3D CT images to represent aortic valve geometries. By remeshing
aortic valves of all patients with a common quad mesh template, we ensure a
uniform mesh topology with consistent node-to-node and element-to-element
correspondence across patients. This consistency enables us to simplify the
learning objective of the deep neural networks, by employing a loss function
with only two terms (i.e., a geometry reconstruction term and a smoothness
regularization term), which is sufficient to preserve mesh smoothness and
element quality. Our experiments demonstrate that the proposed approach
produces high-quality aortic valve surface meshes with improved smoothness and
shape quality, while requiring fewer explicit regularization terms compared to
the traditional methods. These results highlight that using structured quad
meshes for the template and neural network training not only ensures mesh
correspondence and quality but also simplifies the training process, thus
enhancing the effectiveness and efficiency of aortic valve modeling.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Abdulmumin Sa'ad, Sulaimon Oyeniyi Adebayo, Abdul Jabbar Siddiqui
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
The global waste crisis is escalating, with solid waste generation expected
to increase by 70% by 2050. Traditional waste collection methods, particularly
in remote or harsh environments like deserts, are labor-intensive, inefficient,
and often hazardous. Recent advances in computer vision and deep learning have
opened the door to automated waste detection systems, yet most research focuses
on urban environments and recyclable materials, overlooking organic and
hazardous waste and underexplored terrains such as deserts. In this work, we
propose an enhanced real-time object detection framework based on a pruned,
lightweight version of YOLOv12 integrated with Self-Adversarial Training (SAT)
and specialized data augmentation strategies. Using the DroneTrashNet dataset,
we demonstrate significant improvements in precision, recall, and mean average
precision (mAP), while achieving low latency and compact model size suitable
for deployment on resource-constrained aerial drones. Benchmarking our model
against state-of-the-art lightweight YOLO variants further highlights its
optimal balance of accuracy and efficiency. Our results validate the
effectiveness of combining data-centric and model-centric enhancements for
robust, real-time waste detection in desert environments.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song et al.
Robotics
ๆฅ็ๆ่ฆ
Imitation learning-based robot control policies are enjoying renewed interest
in video-based robotics. However, it remains unclear whether this approach
applies to X-ray-guided procedures, such as spine instrumentation. This is
because interpretation of multi-view X-rays is complex. We examine
opportunities and challenges for imitation policy learning in bi-plane-guided
cannula insertion. We develop an in silico sandbox for scalable, automated
simulation of X-ray-guided spine procedures with a high degree of realism. We
curate a dataset of correct trajectories and corresponding bi-planar X-ray
sequences that emulate the stepwise alignment of providers. We then train
imitation learning policies for planning and open-loop control that iteratively
align a cannula solely based on visual information. This precisely controlled
setup offers insights into limitations and capabilities of this method. Our
policy succeeded on the first attempt in 68.5% of cases, maintaining safe
intra-pedicular trajectories across diverse vertebral levels. The policy
generalized to complex anatomy, including fractures, and remained robust to
varied initializations. Rollouts on real bi-planar X-rays further suggest that
the model can produce plausible trajectories, despite training exclusively in
simulation. While these preliminary results are promising, we also identify
limitations, especially in entry point precision. Full closed-look control will
require additional considerations around how to provide sufficiently frequent
feedback. With more robust priors and domain knowledge, such models may provide
a foundation for future efforts toward lightweight and CT-free robotic
intra-operative spinal navigation.
๐
2025-11-05
๐ ArXiv Preprint (eess.IV)
๐ฅ Jinyuxuan Guo, Gurnoor Singh Khurana, Alejandro Gonzalo Grande, Juan C. del Alamo, Francisco Contijoch
Machine Learning
ๆฅ็ๆ่ฆ
Background: Non-invasive imaging-based assessment of blood flow plays a
critical role in evaluating heart function and structure. Computed Tomography
(CT) is a widely-used imaging modality that can robustly evaluate
cardiovascular anatomy and function, but direct methods to estimate blood flow
velocity from movies of contrast evolution have not been developed.
Purpose: This study evaluates the impact of CT imaging on Physics-Informed
Neural Networks (PINN)-based flow estimation and proposes an improved
framework, SinoFlow, which uses sinogram data directly to estimate blood flow.
Methods: We generated pulsatile flow fields in an idealized 2D vessel
bifurcation using computational fluid dynamics and simulated CT scans with
varying gantry rotation speeds, tube currents, and pulse mode imaging settings.
We compared the performance of PINN-based flow estimation using reconstructed
images (ImageFlow) to SinoFlow.
Results: SinoFlow significantly improved flow estimation performance by
avoiding propagating errors introduced by filtered backprojection. SinoFlow was
robust across all tested gantry rotation speeds and consistently produced lower
mean squared error and velocity errors than ImageFlow. Additionally, SinoFlow
was compatible with pulsed-mode imaging and maintained higher accuracy with
shorter pulse widths.
Conclusions: This study demonstrates the potential of SinoFlow for CT-based
flow estimation, providing a more promising approach for non-invasive blood
flow assessment. The findings aim to inform future applications of PINNs to CT
images and provide a solution for image-based estimation, with reasonable
acquisition parameters yielding accurate flow estimates.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Duong Mai, Lawrence Hall
ๆฅ็ๆ่ฆ
Deep learned (DL) models for image recognition have been shown to fail to
generalize to data from different devices, populations, etc. COVID-19 detection
from Chest X-rays (CXRs), in particular, has been shown to fail to generalize
to out-of-distribution (OOD) data from new clinical sources not covered in the
training set. This occurs because models learn to exploit shortcuts -
source-specific artifacts that do not translate to new distributions - rather
than reasonable biomarkers to maximize performance on in-distribution (ID)
data. Rendering the models more robust to distribution shifts, our study
investigates the use of fundamental noise injection techniques (Gaussian,
Speckle, Poisson, and Salt and Pepper) during training. Our empirical results
demonstrate that this technique can significantly reduce the performance gap
between ID and OOD evaluation from 0.10-0.20 to 0.01-0.06, based on results
averaged over ten random seeds across key metrics such as AUC, F1, accuracy,
recall and specificity. Our source code is publicly available at
https://github.com/Duongmai127/Noisy-ood
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ozan Kanbertay, Richard Vogg, Elif Karakoc, Peter M. Kappeler, Claudia Fichtel et al.
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Computer vision methods are increasingly used for the automated analysis of
large volumes of video data collected through camera traps, drones, or direct
observations of animals in the wild. While recent advances have focused
primarily on detecting individual actions, much less work has addressed the
detection and annotation of interactions -- a crucial aspect for understanding
social and individualized animal behavior. Existing open-source annotation
tools support either behavioral labeling without localization of individuals,
or localization without the capacity to capture interactions. To bridge this
gap, we present SILVI, an open-source labeling software that integrates both
functionalities. SILVI enables researchers to annotate behaviors and
interactions directly within video data, generating structured outputs suitable
for training and validating computer vision models. By linking behavioral
ecology with computer vision, SILVI facilitates the development of automated
approaches for fine-grained behavioral analyses. Although developed primarily
in the context of animal behavior, SILVI could be useful more broadly to
annotate human interactions in other videos that require extracting dynamic
scene graphs. The software, along with documentation and download instructions,
is available at: https://gitlab.gwdg.de/kanbertay/interaction-labelling-app.
๐
2025-11-05
๐ NeurIPS
๐ฅ Candace Ross, Florian Bordes, Adina Williams, Polina Kirichenko, Mark Ibrahim
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal language models possess a remarkable ability to handle an
open-vocabulary's worth of objects. Yet the best models still suffer from
hallucinations when reasoning about scenes in the real world, revealing a gap
between their seemingly strong performance on existing perception benchmarks
that are saturating and their reasoning in the real world. To address this gap,
we build a novel benchmark of in-the-wild scenes that we call Common-O. With
more than 10.5k examples using exclusively new images not found in web training
data to avoid contamination, Common-O goes beyond just perception, inspired by
cognitive tests for humans, to probe reasoning across scenes by asking "what's
in common?". We evaluate leading multimodal language models, including models
specifically trained to perform chain-of-thought reasoning. We find that
perceiving objects in single images is tractable for most models, yet reasoning
across scenes is very challenging even for the best models, including reasoning
models. Despite saturating many leaderboards focusing on perception, the best
performing model only achieves 35% on Common-O -- and on Common-O Complex,
consisting of more complex scenes, the best model achieves only 1%. Curiously,
we find models are more prone to hallucinate when similar objects are present
in the scene, suggesting models may be relying on object co-occurrence seen
during training. Among the models we evaluated, we found scale can provide
modest improvements while models explicitly trained with multi-image inputs
show bigger improvements, suggesting scaled multi-image training may offer
promise. We make our benchmark publicly available to spur research into the
challenge of hallucination when reasoning across scenes.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hyunseok Kwak, Kyeongwon Lee, Jae-Jin Lee, Woojoo Lee
ๆฅ็ๆ่ฆ
On-device fine-tuning of CNNs is essential to withstand domain shift in edge
applications such as Human Activity Recognition (HAR), yet full fine-tuning is
infeasible under strict memory, compute, and energy budgets. We present
LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on
Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies
Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional
layers, (ii) selectively updates only the output-side core with
zero-initialization to keep the auxiliary path inactive at the start, and (iii)
fuses the update back into dense kernels, leaving inference cost unchanged.
This design preserves convolutional structure and reduces the number of
trainable parameters by up to two orders of magnitude compared to full
fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves
accuracy within 4.7% of full fine-tuning while updating at most 1.49% of
parameters, consistently outperforming prior parameter-efficient baselines
under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and
selective-core training yield 1.4-3.8x faster convergence to target F1.
LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN
adaptation practical for edge platforms.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Question Answering (VideoQA) in the surgical domain aims to enhance
intraoperative understanding by enabling AI models to reason over temporally
coherent events rather than isolated frames. Current approaches are limited to
static image features, and available datasets often lack temporal annotations,
ignoring the dynamics critical for accurate procedural interpretation. We
propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from
static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder
to fuse video and question features, capturing temporal cues such as motion and
tool--tissue interactions, which a fine-tuned large language model (LLM) then
decodes into coherent answers. To evaluate its performance, we curated
REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related
questions and diagnostic attributes, as well as out-of-template questions with
rephrased or semantically altered formulations to assess model robustness.
Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset
shows that SurgViVQA outperforms existing image-based VQA benchmark models,
particularly in keyword accuracy, improving over PitVQA by +11\% on
REAL-Colon-VQA and +9\% on EndoVis18-VQA. A perturbation study on the questions
further confirms improved generalizability and robustness to variations in
question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework
for temporally-aware understanding in surgical VideoQA, enabling AI models to
interpret dynamic procedural contexts more effectively. Code and dataset
available at https://github.com/madratak/SurgViVQA.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Stefano M. Iacus, Devika Jain, Andrea Nasuto, Giuseppe Porro, Marcello Carammia et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Quantifying human flourishing, a multidimensional construct including
happiness, health, purpose, virtue, relationships, and financial stability, is
critical for understanding societal well-being beyond economic indicators.
Existing measures often lack fine spatial and temporal resolution. Here we
introduce the Human Flourishing Geographic Index (HFGI), derived from analyzing
approximately 2.6 billion geolocated U.S. tweets (2013-2023) using fine-tuned
large language models to classify expressions across 48 indicators aligned with
Harvard's Global Flourishing Study framework plus attitudes towards migration
and perception of corruption. The dataset offers monthly and yearly county- and
state-level indicators of flourishing-related discourse, validated to confirm
that the measures accurately represent the underlying constructs and show
expected correlations with established indicators. This resource enables
multidisciplinary analyses of well-being, inequality, and social change at
unprecedented resolution, offering insights into the dynamics of human
flourishing as reflected in social media discourse across the United States
over the past decade.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Alvin Wei Ming Tan, Ben Prystawski, Veronica Boyce, Michael C. Frank
MultimodalMachine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Iterated reference games - in which players repeatedly pick out novel
referents using language - present a test case for agents' ability to perform
context-sensitive pragmatic reasoning in multi-turn linguistic environments. We
tested humans and vision-language models on trials from iterated reference
games, varying the given context in terms of amount, order, and relevance.
Without relevant context, models were above chance but substantially worse than
humans. However, with relevant context, model performance increased
dramatically over trials. Few-shot reference games with abstract referents
remain a difficult task for machine learning models.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Manh Nguyen, Sunil Gupta, Dai Do, Hung Le
ๆฅ็ๆ่ฆ
Hallucination mitigation remains a persistent challenge for large language
models (LLMs), even as model scales grow. Existing approaches often rely on
external knowledge sources, such as structured databases or knowledge graphs,
accessed through prompting or retrieval. However, prompt-based grounding is
fragile and domain-sensitive, while symbolic knowledge integration incurs heavy
retrieval and formatting costs. Motivated by knowledge graphs, we introduce
Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds
generation in corpus-derived evidence without retraining. GRAD constructs a
sparse token transition graph by accumulating next-token logits across a small
retrieved corpus in a single forward pass. During decoding, graph-retrieved
logits are max-normalized and adaptively fused with model logits to favor
high-evidence continuations while preserving fluency. Across three models and a
range of question-answering benchmarks spanning intrinsic, extrinsic
hallucination, and factuality tasks, GRAD consistently surpasses baselines,
achieving up to 9.7$\%$ higher intrinsic accuracy, 8.6$\%$ lower hallucination
rates, and 6.9$\%$ greater correctness compared to greedy decoding, while
attaining the highest truth--informativeness product score among all methods.
GRAD offers a lightweight, plug-and-play alternative to contrastive decoding
and knowledge graph augmentation, demonstrating that statistical evidence from
corpus-level token transitions can effectively steer generation toward more
truthful and verifiable outputs.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hellina Hailu Nigatu, Bethelhem Yemane Mamo, Bontu Fufa Balcha, Debora Taye Tesfaye, Elbethel Daniel Zewdie et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As low-resourced languages are increasingly incorporated into NLP research,
there is an emphasis on collecting large-scale datasets. But in prioritizing
quantity over quality, we risk 1) building language technologies that perform
poorly for these languages and 2) producing harmful content that perpetuates
societal biases. In this paper, we investigate the quality of Machine
Translation (MT) datasets for three low-resourced languages--Afan Oromo,
Amharic, and Tigrinya, with a focus on the gender representation in the
datasets. Our findings demonstrate that while training data has a large
representation of political and religious domain text, benchmark datasets are
focused on news, health, and sports. We also found a large skew towards the
male gender--in names of persons, the grammatical gender of verbs, and in
stereotypical depictions in the datasets. Further, we found harmful and toxic
depictions against women, which were more prominent for the language with the
largest amount of data, underscoring that quantity does not guarantee quality.
We hope that our work inspires further inquiry into the datasets collected for
low-resourced languages and prompts early mitigation of harmful content.
WARNING: This paper contains discussion of NSFW content that some may find
disturbing.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mikoลaj Langner, Jan Eliasz, Ewa Rudnicka, Jan Kocoล
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce a method for efficient multi-label text classification with
large language models (LLMs), built on reformulating classification tasks as
sequences of dichotomic (yes/no) decisions. Instead of generating all labels in
a single structured response, each target dimension is queried independently,
which, combined with a prefix caching mechanism, yields substantial efficiency
gains for short-text inference without loss of accuracy. To demonstrate the
approach, we focus on affective text analysis, covering 24 dimensions including
emotions and sentiment. Using LLM-to-SLM distillation, a powerful annotator
model (DeepSeek-V3) provides multiple annotations per text, which are
aggregated to fine-tune smaller models (HerBERT-Large, CLARIN-1B, PLLuM-8B,
Gemma3-1B). The fine-tuned models show significant improvements over zero-shot
baselines, particularly on the dimensions seen during training. Our findings
suggest that decomposing multi-label classification into dichotomic queries,
combined with distillation and cache-aware inference, offers a scalable and
effective framework for LLM-based classification. While we validate the method
on affective states, the approach is general and applicable across domains.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Mohammad Atif Quamar, Mohammad Areeb, Mikhail Kuznetsov, Muslum Ozgur Ozmen, Z. Berkay Celik
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Aligning large language models with human values is crucial for their safe
deployment; however, existing methods, such as fine-tuning, are computationally
expensive and suboptimal. In contrast, inference-time approaches like Best-of-N
sampling require practically infeasible computation to achieve optimal
alignment. We propose STARS: Segment-level Token Alignment with Rejection
Sampling, a decoding-time algorithm that steers model generation by iteratively
sampling, scoring, and rejecting/accepting short, fixed-size token segments.
This allows for early correction of the generation path, significantly
improving computational efficiency and boosting alignment quality. Across a
suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT)
by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up
to 4.3 percentage points on win-rates, while remaining highly competitive with
strong Best-of-N baselines. Our work establishes granular, reward-guided
sampling as a generalizable, robust, and efficient alternative to traditional
fine-tuning and full-sequence ranking methods for aligning LLMs.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder
Natural Language Processing
ๆฅ็ๆ่ฆ
Tokenization is fundamental in assembly code analysis, impacting intrinsic
characteristics like vocabulary size, semantic coverage, and extrinsic
performance in downstream tasks. Despite its significance, tokenization in the
context of assembly code remains an underexplored area. This study aims to
address this gap by evaluating the intrinsic properties of Natural Language
Processing (NLP) tokenization models and parameter choices, such as vocabulary
size. We explore preprocessing customization options and pre-tokenization rules
tailored to the unique characteristics of assembly code. Additionally, we
assess their impact on downstream tasks like function signature prediction -- a
critical problem in binary code analysis.
To this end, we conduct a thorough study on various tokenization models,
systematically analyzing their efficiency in encoding assembly instructions and
capturing semantic nuances. Through intrinsic evaluations, we compare
tokenizers based on tokenization efficiency, vocabulary compression, and
representational fidelity for assembly code. Using state-of-the-art pre-trained
models such as the decoder-only Large Language Model (LLM) Llama 3.2, the
encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate
the effectiveness of these tokenizers across multiple performance metrics.
Preliminary findings indicate that tokenizer choice significantly influences
downstream performance, with intrinsic metrics providing partial but incomplete
predictability of extrinsic evaluation outcomes. These results reveal complex
trade-offs between intrinsic tokenizer properties and their utility in
practical assembly code tasks. Ultimately, this study provides valuable
insights into optimizing tokenization models for low-level code analysis,
contributing to the robustness and scalability of Natural Language Model
(NLM)-based binary analysis workflows.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jan Kocoล, Maciej Piasecki, Arkadiusz Janz, Teddy Ferdinan, ลukasz Radliลski et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) play a central role in modern artificial
intelligence, yet their development has been primarily focused on English,
resulting in limited support for other languages. We present PLLuM (Polish
Large Language Model), the largest open-source family of foundation models
tailored specifically for the Polish language. Developed by a consortium of
major Polish research institutions, PLLuM addresses the need for high-quality,
transparent, and culturally relevant language models beyond the English-centric
commercial landscape. We describe the development process, including the
construction of a new 140-billion-token Polish text corpus for pre-training, a
77k custom instructions dataset, and a 100k preference optimization dataset. A
key component is a Responsible AI framework that incorporates strict data
governance and a hybrid module for output correction and safety filtering. We
detail the models' architecture, training procedures, and alignment techniques
for both base and instruction-tuned variants, and demonstrate their utility in
a downstream task within public administration. By releasing these models
publicly, PLLuM aims to foster open research and strengthen sovereign AI
technologies in Poland.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Stergios Chatzikyriakidis, Dimitris Papadakis, Sevasti-Ioanna Papaioannou, Erofili Psaltaki
Natural Language Processing
ๆฅ็ๆ่ฆ
We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the
existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern
Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian
Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a
dataset with total size 6,374,939 words and 10 varieties. This is the first
dataset with such variation and size to date. We conduct a number of
fine-tuning experiments to see the effect of good quality dialectal data on a
number of LLMs. We fine-tune three model architectures (Llama-3-8B,
Llama-3.1-8B, Krikri-8B) and compare the results to frontier models
(Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Doria Bonzi, Alexandre Guiggi, Frรฉdรฉric Bรฉchet, Carlos Ramisch, Benoit Favre
Natural Language Processing
ๆฅ็ๆ่ฆ
Critical appraisal of scientific literature is an essential skill in the
biomedical field. While large language models (LLMs) can offer promising
support in this task, their reliability remains limited, particularly for
critical reasoning in specialized domains. We introduce CareMedEval, an
original dataset designed to evaluate LLMs on biomedical critical appraisal and
reasoning tasks. Derived from authentic exams taken by French medical students,
the dataset contains 534 questions based on 37 scientific articles. Unlike
existing benchmarks, CareMedEval explicitly evaluates critical reading and
reasoning grounded in scientific papers. Benchmarking state-of-the-art
generalist and biomedical-specialized LLMs under various context conditions
reveals the difficulty of the task: open and commercial models fail to exceed
an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens
considerably improves the results. Yet, models remain challenged especially on
questions about study limitations and statistical analysis. CareMedEval
provides a challenging benchmark for grounded reasoning, exposing current LLM
limitations and paving the way for future development of automated support for
critical appraisal.
๐
2025-11-05
๐ AAAI
๐ฅ Suraj Prasai, Mengnan Du, Ying Zhang, Fan Yang
ๆฅ็ๆ่ฆ
We develop KnowThyself, an agentic assistant that advances large language
model (LLM) interpretability. Existing tools provide useful insights but remain
fragmented and code-intensive. KnowThyself consolidates these capabilities into
a chat-based interface, where users can upload models, pose natural language
questions, and obtain interactive visualizations with guided explanations. At
its core, an orchestrator LLM first reformulates user queries, an agent router
further directs them to specialized modules, and the outputs are finally
contextualized into coherent explanations. This design lowers technical
barriers and provides an extensible platform for LLM inspection. By embedding
the whole process into a conversational workflow, KnowThyself offers a robust
foundation for accessible LLM interpretability.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song et al.
Robotics
ๆฅ็ๆ่ฆ
Imitation learning-based robot control policies are enjoying renewed interest
in video-based robotics. However, it remains unclear whether this approach
applies to X-ray-guided procedures, such as spine instrumentation. This is
because interpretation of multi-view X-rays is complex. We examine
opportunities and challenges for imitation policy learning in bi-plane-guided
cannula insertion. We develop an in silico sandbox for scalable, automated
simulation of X-ray-guided spine procedures with a high degree of realism. We
curate a dataset of correct trajectories and corresponding bi-planar X-ray
sequences that emulate the stepwise alignment of providers. We then train
imitation learning policies for planning and open-loop control that iteratively
align a cannula solely based on visual information. This precisely controlled
setup offers insights into limitations and capabilities of this method. Our
policy succeeded on the first attempt in 68.5% of cases, maintaining safe
intra-pedicular trajectories across diverse vertebral levels. The policy
generalized to complex anatomy, including fractures, and remained robust to
varied initializations. Rollouts on real bi-planar X-rays further suggest that
the model can produce plausible trajectories, despite training exclusively in
simulation. While these preliminary results are promising, we also identify
limitations, especially in entry point precision. Full closed-look control will
require additional considerations around how to provide sufficiently frequent
feedback. With more robust priors and domain knowledge, such models may provide
a foundation for future efforts toward lightweight and CT-free robotic
intra-operative spinal navigation.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Richard Dewey, Janos Botyanszki, Ciamac C. Moallemi, Andrew T. Zheng
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
AI researchers have long focused on poker-like games as a testbed for
environments characterized by multi-player dynamics, imperfect information, and
reasoning under uncertainty. While recent breakthroughs have matched elite
human play at no-limit Texas hold'em, the multi-player dynamics are subdued:
most hands converge quickly with only two players engaged through multiple
rounds of bidding. In this paper, we present Solly, the first AI agent to
achieve elite human play in reduced-format Liar's Poker, a game characterized
by extensive multi-player engagement. We trained Solly using self-play with a
model-free, actor-critic, deep reinforcement learning algorithm. Solly played
at an elite human level as measured by win rate (won over 50% of hands) and
equity (money won) in heads-up and multi-player Liar's Poker. Solly also
outperformed large language models (LLMs), including those with reasoning
abilities, on the same metrics. Solly developed novel bidding strategies,
randomized play effectively, and was not easily exploitable by world-class
human players.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nan Li, Albert Gatt, Massimo Poesio
ๆฅ็ๆ่ฆ
Collaborative dialogue relies on participants incrementally establishing
common ground, yet in asymmetric settings they may believe they agree while
referring to different entities. We introduce a perspectivist annotation scheme
for the HCRC MapTask corpus (Anderson et al., 1991) that separately captures
speaker and addressee grounded interpretations for each reference expression,
enabling us to trace how understanding emerges, diverges, and repairs over
time. Using a scheme-constrained LLM annotation pipeline, we obtain 13k
annotated reference expressions with reliability estimates and analyze the
resulting understanding states. The results show that full misunderstandings
are rare once lexical variants are unified, but multiplicity discrepancies
systematically induce divergences, revealing how apparent grounding can mask
referential misalignment. Our framework provides both a resource and an
analytic lens for studying grounded misunderstanding and for evaluating
(V)LLMs' capacity to model perspective-dependent grounding in collaborative
dialogue.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohsen Ahmadzadeh, Kaichang Chen, Georges Gielen
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Analog/mixed-signal circuits are key for interfacing electronics with the
physical world. Their design, however, remains a largely handcrafted process,
resulting in long and error-prone design cycles. While the recent rise of
AI-based reinforcement learning and generative AI has created new techniques to
automate this task, the need for many time-consuming simulations is a critical
bottleneck hindering the overall efficiency. Furthermore, the lack of
explainability of the resulting design solutions hampers widespread adoption of
the tools. To address these issues, a novel agentic AI framework for
sample-efficient and explainable analog circuit sizing is presented. It employs
a multi-agent workflow where specialized Large Language Model (LLM)-based
agents collaborate to interpret the circuit topology, to understand the design
goals, and to iteratively refine the circuit's design parameters towards the
target goals with human-interpretable reasoning. The adaptive simulation
strategy creates an intelligent control that yields a high sample efficiency.
The AnaFlow framework is demonstrated for two circuits of varying complexity
and is able to complete the sizing task fully automatically, differently from
pure Bayesian optimization and reinforcement learning approaches. The system
learns from its optimization history to avoid past mistakes and to accelerate
convergence. The inherent explainability makes this a powerful tool for analog
design space exploration and a new paradigm in analog EDA, where AI agents
serve as transparent design assistants.
๐
2025-11-05
๐ ArXiv Preprint (cs.SE)
๐ฅ Xingyao Wang, Simon Rosenberg, Juan Michelini, Calvin Smith, Hoang Tran et al.
Robotics
ๆฅ็ๆ่ฆ
Agents are now used widely in the process of software development, but
building production-ready software engineering agents is a complex task.
Deploying software agents effectively requires flexibility in implementation
and experimentation, reliable and secure execution, and interfaces for users to
interact with agents. In this paper, we present the OpenHands Software Agent
SDK, a toolkit for implementing software development agents that satisfy these
desiderata. This toolkit is a complete architectural redesign of the agent
components of the popular OpenHands framework for software development agents,
which has 64k+ GitHub stars. To achieve flexibility, we design a simple
interface for implementing agents that requires only a few lines of code in the
default case, but is easily extensible to more complex, full-featured agents
with features such as custom tools, memory management, and more. For security
and reliability, it delivers seamless local-to-remote execution portability,
integrated REST/WebSocket services. For interaction with human users, it can
connect directly to a variety of interfaces, such as visual workspaces (VS
Code, VNC, browser), command-line interfaces, and APIs. Compared with existing
SDKs from OpenAI, Claude, and Google, OpenHands uniquely integrates native
sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and
built-in security analysis. Empirical results on SWE-Bench Verified and GAIA
benchmarks demonstrate strong performance. Put together, these elements allow
the OpenHands Software Agent SDK to provide a practical foundation for
prototyping, unlocking new classes of custom applications, and reliably
deploying agents at scale.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Eugรจne Berta, David Holzmรผller, Michael I. Jordan, Francis Bach
Machine Learning
ๆฅ็ๆ่ฆ
Post-hoc recalibration methods are widely used to ensure that classifiers
provide faithful probability estimates. We argue that parametric recalibration
functions based on logistic regression can be motivated from a simple
theoretical setting for both binary and multiclass classification. This insight
motivates the use of more expressive calibration methods beyond standard
temperature scaling. For multi-class calibration however, a key challenge lies
in the increasing number of parameters introduced by more complex models, often
coupled with limited calibration data, which can lead to overfitting. Through
extensive experiments, we demonstrate that the resulting bias-variance tradeoff
can be effectively managed by structured regularization, robust preprocessing
and efficient optimization. The resulting methods lead to substantial gains
over existing logistic-based calibration techniques. We provide efficient and
easy-to-use open-source implementations of our methods, making them an
attractive alternative to common temperature, vector, and matrix scaling
implementations.
๐
2025-11-05
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Geoff McDonald, Jonathan Bar Or
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly deployed in sensitive domains
including healthcare, legal services, and confidential communications, where
privacy is paramount. This paper introduces Whisper Leak, a side-channel attack
that infers user prompt topics from encrypted LLM traffic by analyzing packet
size and timing patterns in streaming responses. Despite TLS encryption
protecting content, these metadata patterns leak sufficient information to
enable topic classification. We demonstrate the attack across 28 popular LLMs
from major providers, achieving near-perfect classification (often >98% AUPRC)
and high precision even at extreme class imbalance (10,000:1 noise-to-target
ratio). For many models, we achieve 100% precision in identifying sensitive
topics like "money laundering" while recovering 5-20% of target conversations.
This industry-wide vulnerability poses significant risks for users under
network surveillance by ISPs, governments, or local adversaries. We evaluate
three mitigation strategies - random padding, token batching, and packet
injection - finding that while each reduces attack effectiveness, none provides
complete protection. Through responsible disclosure, we have collaborated with
providers to implement initial countermeasures. Our findings underscore the
need for LLM providers to address metadata leakage as AI systems handle
increasingly sensitive information.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Daniel Perkins, Oscar J. Escobar, Luke Green
Machine Learning
ๆฅ็ๆ่ฆ
We present a detailed study of Deep Q-Networks in finite environments,
emphasizing the impact of epsilon-greedy exploration schedules and prioritized
experience replay. Through systematic experimentation, we evaluate how
variations in epsilon decay schedules affect learning efficiency, convergence
behavior, and reward optimization. We investigate how prioritized experience
replay leads to faster convergence and higher returns and show empirical
results comparing uniform, no replay, and prioritized strategies across
multiple simulations. Our findings illuminate the trade-offs and interactions
between exploration strategies and memory management in DQN training, offering
practical recommendations for robust reinforcement learning in
resource-constrained settings.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jing Gao, Shutiao Luo, Yumeng Liu, Yuanming Li, Hongji Zeng
Natural Language Processing
ๆฅ็ๆ่ฆ
With the rapid advancement of natural language processing (NLP) technologies,
the demand for high-quality Chinese document question-answering datasets is
steadily growing. To address this issue, we present the Chinese Multi-Document
Question Answering Dataset(ChiMDQA), specifically designed for downstream
business scenarios across prevalent domains including academic, education,
finance, law, medical treatment, and news. ChiMDQA encompasses long-form
documents from six distinct fields, consisting of 6,068 rigorously curated,
high-quality question-answer (QA) pairs further classified into ten
fine-grained categories. Through meticulous document screening and a systematic
question-design methodology, the dataset guarantees both diversity and high
quality, rendering it applicable to various NLP tasks such as document
comprehension, knowledge extraction, and intelligent QA systems. Additionally,
this paper offers a comprehensive overview of the dataset's design objectives,
construction methodologies, and fine-grained evaluation system, supplying a
substantial foundation for future research and practical applications in
Chinese QA. The code and data are available at:
https://anonymous.4open.science/r/Foxit-CHiMDQA/.
๐
2025-11-05
๐ ArXiv Preprint (q-bio.NC)
๐ฅ Peter DiBerardino, Britt Anderson
ๆฅ็ๆ่ฆ
When people pursue rewards in stochastic environments, they often match their
choice frequencies to the observed target frequencies, even when this policy is
demonstrably sub-optimal. We used a ``hide and seek'' task to evaluate this
behavior under conditions where pursuit (seeking) could be toggled to avoidance
(hiding), while leaving the probability distribution fixed, or varying
complexity by changing the number of possible choices. We developed a model for
participant choice built from choice frequency histograms treated as vectors.
We posited the existence of a probability antimatching strategy for avoidance
(hiding) rounds, and formalized this as a vector reflection of probability
matching. We found that only two basis policies: matching/antimatching and
maximizing/minimizing were sufficient to account for participant choices across
a range of room numbers and opponent probability distributions. This schema
requires only that people have the ability to remember the relative frequency
of the different outcomes. With this knowledge simple operations can construct
the maximizing and minimizing policies as well as matching and antimatching
strategies. A mixture of these two policies captures human choice patterns in a
stochastic environment.
๐
2025-11-05
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Thomas Souverain
Natural Language Processing
ๆฅ็ๆ่ฆ
To foster trustworthy Artificial Intelligence (AI) within the European Union,
the AI Act requires providers to mark and detect the outputs of their
general-purpose models. The Article 50 and Recital 133 call for marking methods
that are ''sufficiently reliable, interoperable, effective and robust''. Yet,
the rapidly evolving and heterogeneous landscape of watermarks for Large
Language Models (LLMs) makes it difficult to determine how these four standards
can be translated into concrete and measurable evaluations. Our paper addresses
this challenge, anchoring the normativity of European requirements in the
multiplicity of watermarking techniques. Introducing clear and distinct
concepts on LLM watermarking, our contribution is threefold. (1) Watermarking
Categorisation: We propose an accessible taxonomy of watermarking methods
according to the stage of the LLM lifecycle at which they are applied - before,
during, or after training, and during next-token distribution or sampling. (2)
Watermarking Evaluation: We interpret the EU AI Act's requirements by mapping
each criterion with state-of-the-art evaluations on robustness and
detectability of the watermark, and of quality of the LLM. Since
interoperability remains largely untheorised in LLM watermarking research, we
propose three normative dimensions to frame its assessment. (3) Watermarking
Comparison: We compare current watermarking methods for LLMs against the
operationalised European criteria and show that no approach yet satisfies all
four standards. Encouraged by emerging empirical tests, we recommend further
research into watermarking directly embedded within the low-level architecture
of LLMs.
๐
2025-11-05
๐ ArXiv Preprint (q-fin.TR)
๐ฅ Haofei Yu, Fenghai Li, Jiaxuan You
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) achieve strong performance across
benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but
these tests occur in static settings, lacking real dynamics and uncertainty.
Consequently, they evaluate isolated reasoning or problem-solving rather than
decision-making under uncertainty. To address this, we introduce
LiveTradeBench, a live trading environment for evaluating LLM agents in
realistic and evolving markets. LiveTradeBench follows three design principles:
(i) Live data streaming of market prices and news, eliminating dependence on
offline backtesting and preventing information leakage while capturing
real-time uncertainty; (ii) a portfolio-management abstraction that extends
control from single-asset actions to multi-asset allocation, integrating risk
management and cross-asset reasoning; and (iii) multi-market evaluation across
structurally distinct environments--U.S. stocks and Polymarket prediction
markets--differing in volatility, liquidity, and information flow. At each
step, an agent observes prices, news, and its portfolio, then outputs
percentage allocations that balance risk and return. Using LiveTradeBench, we
run 50-day live evaluations of 21 LLMs across families. Results show that (1)
high LMArena scores do not imply superior trading outcomes; (2) models display
distinct portfolio styles reflecting risk appetite and reasoning dynamics; and
(3) some LLMs effectively leverage live signals to adapt decisions. These
findings expose a gap between static evaluation and real-world competence,
motivating benchmarks that test sequential decision making and consistency
under live uncertainty.
๐
2025-11-05
๐ This manuscript was presented at VIS x GenAI, a wo
๐ฅ Timo Brand, Henry Fรถrster, Stephen G. Kobourov, Jacob Miller
ๆฅ็ๆ่ฆ
We evaluate how visualizations can influence the judgment of MLLMs about the
presence or absence of bridges in a network. We show that the inclusion of
visualization improves confidence over a structured text-based input that could
theoretically be helpful for answering the question. On the other hand, we
observe that standard visualization techniques create a strong bias towards
accepting or refuting the presence of a bridge -- independently of whether or
not a bridge actually exists in the network. While our results indicate that
the inclusion of visualization techniques can effectively influence the MLLM's
judgment without compromising its self-reported confidence, they also imply
that practitioners must be careful of allowing users to include visualizations
in generative AI applications so as to avoid undesired hallucinations.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu et al.
Robotics
ๆฅ็ๆ่ฆ
We present Step-Audio-EditX, the first open-source LLM-based audio model
excelling at expressive and iterative audio editing encompassing emotion,
speaking style, and paralinguistics alongside robust zero-shot text-to-speech
(TTS) capabilities.Our core innovation lies in leveraging only large-margin
synthetic data, which circumvents the need for embedding-based priors or
auxiliary modules. This large-margin learning approach enables both iterative
control and high expressivity across voices, and represents a fundamental pivot
from the conventional focus on representation-level disentanglement. Evaluation
results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and
Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.
๐
2025-11-05
๐ ArXiv Preprint (cs.PF)
๐ฅ Andrei Ivanov, Siyuan Shen, Gioele Gottardo, Marcin Chrapek, Afif Boudaoud et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The increasing complexity of machine learning models and the proliferation of
diverse hardware architectures (CPUs, GPUs, accelerators) make achieving
optimal performance a significant challenge. Heterogeneity in instruction sets,
specialized kernel requirements for different data types and model features
(e.g., sparsity, quantization), and architecture-specific optimizations
complicate performance tuning. Manual optimization is resource-intensive, while
existing automatic approaches often rely on complex hardware-specific
heuristics and uninterpretable intermediate representations, hindering
performance portability. We introduce PerfLLM, a novel automatic optimization
methodology leveraging Large Language Models (LLMs) and Reinforcement Learning
(RL). Central to this is PerfDojo, an environment framing optimization as an RL
game using a human-readable, mathematically-inspired code representation that
guarantees semantic validity through transformations. This allows effective
optimization without prior hardware knowledge, facilitating both human analysis
and RL agent training. We demonstrate PerfLLM's ability to achieve significant
performance gains across diverse CPU (x86, Arm, RISC-V) and GPU architectures.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mainak Singha
Machine Learning
ๆฅ็ๆ่ฆ
Neural networks can approximate solutions to partial differential equations,
but they often break the very laws they are meant to model-creating mass from
nowhere, drifting shocks, or violating conservation and entropy. We address
this by training within the laws of physics rather than beside them. Our
framework, called Constraint-Projected Learning (CPL), keeps every update
physically admissible by projecting network outputs onto the intersection of
constraint sets defined by conservation, Rankine-Hugoniot balance, entropy, and
positivity. The projection is differentiable and adds only about 10%
computational overhead, making it fully compatible with back-propagation. We
further stabilize training with total-variation damping (TVD) to suppress small
oscillations and a rollout curriculum that enforces consistency over long
prediction horizons. Together, these mechanisms eliminate both hard and soft
violations: conservation holds at machine precision, total-variation growth
vanishes, and entropy and error remain bounded. On Burgers and Euler systems,
CPL produces stable, physically lawful solutions without loss of accuracy.
Instead of hoping neural solvers will respect physics, CPL makes that behavior
an intrinsic property of the learning process.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Aniol Civit, Antonio Andriella, Carles Sierra, Guillem Alenyร
Robotics
ๆฅ็ๆ่ฆ
While personalisation in Human-Robot Interaction (HRI) has advanced
significantly, most existing approaches focus on single-user adaptation,
overlooking scenarios involving multiple stakeholders with potentially
conflicting preferences. To address this, we propose the Multi-User Preferences
Quantitative Bipolar Argumentation Framework (MUP-QBAF), a novel multi-user
personalisation framework based on Quantitative Bipolar Argumentation
Frameworks (QBAFs) that explicitly models and resolves multi-user preference
conflicts. Unlike prior work in Argumentation Frameworks, which typically
assumes static inputs, our approach is tailored to robotics: it incorporates
both users' arguments and the robot's dynamic observations of the environment,
allowing the system to adapt over time and respond to changing contexts.
Preferences, both positive and negative, are represented as arguments whose
strength is recalculated iteratively based on new information. The framework's
properties and capabilities are presented and validated through a realistic
case study, where an assistive robot mediates between the conflicting
preferences of a caregiver and a care recipient during a frailty assessment
task. This evaluation further includes a sensitivity analysis of argument base
scores, demonstrating how preference outcomes can be shaped by user input and
contextual observations. By offering a transparent, structured, and
context-sensitive approach to resolving competing user preferences, this work
advances the field of multi-user HRI. It provides a principled alternative to
data-driven methods, enabling robots to navigate conflicts in real-world
environments.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Iason Chrysomallis, Georgios Chalkiadakis
Machine Learning
ๆฅ็ๆ่ฆ
Imitation learning (IL) enables agents to acquire skills by observing and
replicating the behavior of one or multiple experts. In recent years, advances
in deep learning have significantly expanded the capabilities and scalability
of imitation learning across a range of domains, where expert data can range
from full state-action trajectories to partial observations or unlabeled
sequences. Alongside this growth, novel approaches have emerged, with new
methodologies being developed to address longstanding challenges such as
generalization, covariate shift, and demonstration quality. In this survey, we
review the latest advances in imitation learning research, highlighting recent
trends, methodological innovations, and practical applications. We propose a
novel taxonomy that is distinct from existing categorizations to better reflect
the current state of the IL research stratum and its trends. Throughout the
survey, we critically examine the strengths, limitations, and evaluation
practices of representative works, and we outline key challenges and open
directions for future research.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joachim Diederich
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
This paper presents the first empirical demonstration of controllable
locality in transformer language models, a novel architectural framework that
enables continuous control over the degree of representation localization
through a tunable locality dial parameter. Unlike traditional language models
that rely exclusively on distributed representations, our approach allows
dynamic interpolation between highly interpretable localist encodings and
efficient distributed representations without requiring model retraining. We
conducted experiments on the WikiText corpus using a two-layer transformer
architecture, systematically varying the locality parameter {\lambda} across
the full spectrum from 1.0 (fully localist) to 0.0 (fully distributed). Our
results demonstrate that localist configurations achieve dramatically lower
attention entropy, with {\lambda} = 1.0 yielding 5.36 bits compared to 7.18
bits at {\lambda} = 0.0, while maintaining substantially higher pointer
fidelity scores reflecting stronger alignment with rule-specified targets.
Prediction experiments reveal that intermediate locality values optimize the
tradeoff between interpretability and performance, with {\lambda} = 0.6
achieving test perplexity of 4.65 and accuracy of 84.7%. These findings
establish that localist language models provide a practical framework for
applications in regulated domains requiring both transparency and capability,
offering precise mathematical control over the interpretability-performance
spectrum through explicit penalty thresholds and information-theoretic design
principles.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sofie Helene Bruun, Dan Saattrup Smart
Natural Language Processing
ๆฅ็ๆ่ฆ
Measuring the full abilities of large language models (LLMs) requires
benchmarks representing multiple tasks. We aim to create large, high-quality
datasets for comparison of logical reasoning skills across several languages
and of suitable difficulty for LLMs of various reasoning ability. We explore
multiple ways of increasing difficulty. We generate zebra puzzles in multiple
languages, themes, sizes and including 14 different clue types and 8 red
herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are
sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a
reasoning model), respectively. Including 5 red herrings decreases o3-mini
puzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5
puzzles are not significantly affected by use of English vs. Danish or the
common houses theme vs. the country-specific smoerrebroed theme. We find no
correlation between difficulty and the selected clue types. Datasets of
128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic
languages for sizes 2x3 and 4x5. We publish code for puzzle generation,
designed for adaptablity into more languages and themes.
๐
2025-11-05
๐ ArXiv Preprint (cs.SE)
๐ฅ Ziv Nevo, Orna Raz, Karen Yorav
Natural Language Processing
ๆฅ็ๆ่ฆ
Understanding the purpose of source code is a critical task in software
maintenance, onboarding, and modernization. While large language models (LLMs)
have shown promise in generating code explanations, they often lack grounding
in the broader software engineering context. We propose a novel approach that
leverages natural language artifacts from GitHub -- such as pull request
descriptions, issue descriptions and discussions, and commit messages -- to
enhance LLM-based code understanding. Our system consists of three components:
one that extracts and structures relevant GitHub context, another that uses
this context to generate high-level explanations of the code's purpose, and a
third that validates the explanation. We implemented this as a standalone tool,
as well as a server within the Model Context Protocol (MCP), enabling
integration with other AI-assisted development tools. Our main use case is that
of enhancing a standard LLM-based code explanation with code insights that our
system generates. To evaluate explanations' quality, we conducted a small scale
user study, with developers of several open projects, as well as developers of
proprietary projects. Our user study indicates that when insights are generated
they often are helpful and non trivial, and are free from hallucinations.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sebastian Ordyniak, Giacomo Paesani, Mateusz Rychlicki, Stefan Szeider
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a comprehensive theoretical investigation into the
parameterized complexity of explanation problems in various machine learning
(ML) models. Contrary to the prevalent black-box perception, our study focuses
on models with transparent internal mechanisms. We address two principal types
of explanation problems: abductive and contrastive, both in their local and
global variants. Our analysis encompasses diverse ML models, including Decision
Trees, Decision Sets, Decision Lists, Boolean Circuits, and ensembles thereof,
each offering unique explanatory challenges. This research fills a significant
gap in explainable AI (XAI) by providing a foundational understanding of the
complexities of generating explanations for these models. This work provides
insights vital for further research in the domain of XAI, contributing to the
broader discourse on the necessity of transparency and accountability in AI
systems.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Roberta Di Marino, Giovanni Dioguardi, Antonio Romano, Giuseppe Riccio, Mariano Barone et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Medical question answering systems face deployment challenges including
hallucinations, bias, computational demands, privacy concerns, and the need for
specialized expertise across diverse domains. Here, we present SOLVE-Med, a
multi-agent architecture combining domain-specialized small language models for
complex medical queries. The system employs a Router Agent for dynamic
specialist selection, ten specialized models (1B parameters each) fine-tuned on
specific medical domains, and an Orchestrator Agent that synthesizes responses.
Evaluated on Italian medical forum data across ten specialties, SOLVE-Med
achieves superior performance with ROUGE-1 of 0.301 and BERTScore F1 of 0.697,
outperforming standalone models up to 14B parameters while enabling local
deployment. Our code is publicly available on GitHub:
https://github.com/PRAISELab-PicusLab/SOLVE-Med.
๐
2025-11-05
๐ Paper submitted to WSEAS Signal Processing Journal
๐ฅ Marc Martinez-Gost, Sara Pepe, Ana Pรฉrez-Neira, Miguel รngel Lagunas
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we extend our previous work on the Expressive Neural Network
(ENN), a multilayer perceptron with adaptive activation functions parametrized
using the Discrete Cosine Transform (DCT). Building upon previous work that
demonstrated the strong expressiveness of ENNs with compact architectures, we
now emphasize their efficiency, interpretability and pruning capabilities. The
DCT-based parameterization provides a structured and decorrelated
representation that reveals the functional role of each neuron and allows
direct identification of redundant components. Leveraging this property, we
propose an efficient pruning strategy that removes unnecessary DCT coefficients
with negligible or no loss in performance. Experimental results across
classification and implicit neural representation tasks confirm that ENNs
achieve state-of-the-art accuracy while maintaining a low number of parameters.
Furthermore, up to 40% of the activation coefficients can be safely pruned,
thanks to the orthogonality and bounded nature of the DCT basis. Overall, these
findings demonstrate that the ENN framework offers a principled integration of
signal processing concepts into neural network design, achieving a balanced
trade-off between expressiveness, compactness, and interpretability.
๐
2025-11-05
๐ ArXiv Preprint (cs.CE)
๐ฅ Gabriel Spadon, Vaishnav Vaidheeswaran, Claudio DiBacco
ๆฅ็ๆ่ฆ
Marine invasive species spread through global shipping and generate
substantial ecological and economic impacts. Traditional risk assessments
require detailed records of ballast water and traffic patterns, which are often
incomplete, limiting global coverage. This work advances a theoretical
framework that quantifies invasion risk by combining environmental similarity
across ports with observed and forecasted maritime mobility. Climate-based
feature representations characterize each port's marine conditions, while
mobility networks derived from Automatic Identification System data capture
vessel flows and potential transfer pathways. Clustering and metric learning
reveal climate analogues and enable the estimation of species survival
likelihood along shipping routes. A temporal link prediction model captures how
traffic patterns may change under shifting environmental conditions. The
resulting fusion of environmental similarity and predicted mobility provides
exposure estimates at the port and voyage levels, supporting targeted
monitoring, routing adjustments, and management interventions.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Lei Fu, Sahar Salimpour, Leonardo Militano, Harry Edelman, Jorge Peรฑa Queralta et al.
Robotics
ๆฅ็ๆ่ฆ
Agentic AI systems and Physical or Embodied AI systems have been two key
research verticals at the forefront of Artificial Intelligence and Robotics,
with Model Context Protocol (MCP) increasingly becoming a key component and
enabler of agentic applications. However, the literature at the intersection of
these verticals, i.e., Agentic Embodied AI, remains scarce. This paper
introduces an MCP server for analyzing ROS and ROS 2 bags, allowing for
analyzing, visualizing and processing robot data with natural language through
LLMs and VLMs. We describe specific tooling built with robotics domain
knowledge, with our initial release focused on mobile robotics and supporting
natively the analysis of trajectories, laser scan data, transforms, or time
series data. This is in addition to providing an interface to standard ROS 2
CLI tools ("ros2 bag list" or "ros2 bag info"), as well as the ability to
filter bags with a subset of topics or trimmed in time. Coupled with the MCP
server, we provide a lightweight UI that allows the benchmarking of the tooling
with different LLMs, both proprietary (Anthropic, OpenAI) and open-source
(through Groq). Our experimental results include the analysis of tool calling
capabilities of eight different state-of-the-art LLM/VLM models, both
proprietary and open-source, large and small. Our experiments indicate that
there is a large divide in tool calling capabilities, with Kimi K2 and Claude
Sonnet 4 demonstrating clearly superior performance. We also conclude that
there are multiple factors affecting the success rates, from the tool
description schema to the number of arguments, as well as the number of tools
available to the models. The code is available with a permissive license at
https://github.com/binabik-ai/mcp-rosbags.
๐
2025-11-05
๐ 8 pages 18 fogures, IEEE RAL accept
๐ฅ Jianbo Yuan, Haohua Zhu, Jing Dai, Sheng Yi
Robotics
ๆฅ็ๆ่ฆ
The human hand plays a vital role in daily life and industrial applications,
yet replicating its multifunctional capabilities-including motion, sensing, and
coordinated manipulation-with robotic systems remains a formidable challenge.
Developing a dexterous robotic hand requires balancing human-like agility with
engineering constraints such as complexity, size-to-weight ratio, durability,
and force-sensing performance. This letter presents Dex-Hand 021, a
high-performance, cable-driven five-finger robotic hand with 12 active and 7
passive degrees of freedom (DoFs), achieving 19 DoFs dexterity in a lightweight
1 kg design. We propose a proprioceptive force-sensing-based admittance control
method to enhance manipulation. Experimental results demonstrate its superior
performance: a single-finger load capacity exceeding 10 N, fingertip
repeatability under 0.001 m, and force estimation errors below 0.2 N. Compared
to PID control, joint torques in multi-object grasping are reduced by 31.19%,
significantly improves force-sensing capability while preventing overload
during collisions. The hand excels in both power and precision grasps,
successfully executing 33 GRASP taxonomy motions and complex manipulation
tasks. This work advances the design of lightweight, industrial-grade dexterous
hands and enhances proprioceptive control, contributing to robotic manipulation
and intelligent manufacturing.
๐
2025-11-05
๐ AAAI 2026
๐ฅ Ming Gu, Ziwei Wang, Sicen Lai, Zirui Gao, Sheng Zhou et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Ensuring web accessibility is crucial for advancing social welfare, justice,
and equality in digital spaces, yet the vast majority of website user
interfaces remain non-compliant, due in part to the resource-intensive and
unscalable nature of current auditing practices. While WCAG-EM offers a
structured methodology for site-wise conformance evaluation, it involves great
human efforts and lacks practical support for execution at scale. In this work,
we present an auditing framework, AAA, which operationalizes WCAG-EM through a
human-AI partnership model. AAA is anchored by two key innovations: GRASP, a
graph-based multimodal sampling method that ensures representative page
coverage via learned embeddings of visual, textual, and relational cues; and
MaC, a multimodal large language model-based copilot that supports auditors
through cross-modal reasoning and intelligent assistance in high-effort tasks.
Together, these components enable scalable, end-to-end web accessibility
auditing, empowering human auditors with AI-enhanced assistance for real-world
impact. We further contribute four novel datasets designed for benchmarking
core stages of the audit pipeline. Extensive experiments demonstrate the
effectiveness of our methods, providing insights that small-scale language
models can serve as capable experts when fine-tuned.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Doria Bonzi, Alexandre Guiggi, Frรฉdรฉric Bรฉchet, Carlos Ramisch, Benoit Favre
Natural Language Processing
ๆฅ็ๆ่ฆ
Critical appraisal of scientific literature is an essential skill in the
biomedical field. While large language models (LLMs) can offer promising
support in this task, their reliability remains limited, particularly for
critical reasoning in specialized domains. We introduce CareMedEval, an
original dataset designed to evaluate LLMs on biomedical critical appraisal and
reasoning tasks. Derived from authentic exams taken by French medical students,
the dataset contains 534 questions based on 37 scientific articles. Unlike
existing benchmarks, CareMedEval explicitly evaluates critical reading and
reasoning grounded in scientific papers. Benchmarking state-of-the-art
generalist and biomedical-specialized LLMs under various context conditions
reveals the difficulty of the task: open and commercial models fail to exceed
an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens
considerably improves the results. Yet, models remain challenged especially on
questions about study limitations and statistical analysis. CareMedEval
provides a challenging benchmark for grounded reasoning, exposing current LLM
limitations and paving the way for future development of automated support for
critical appraisal.
๐
2025-11-05
๐ AAAI 2026
๐ฅ Botao 'Amber' Hu, Helena Rong
Robotics
ๆฅ็ๆ่ฆ
As the "agentic web" takes shape-billions of AI agents (often LLM-powered)
autonomously transacting and collaborating-trust shifts from human oversight to
protocol design. In 2025, several inter-agent protocols crystallized this
shift, including Google's Agent-to-Agent (A2A), Agent Payments Protocol (AP2),
and Ethereum's ERC-8004 "Trustless Agents," yet their underlying trust
assumptions remain under-examined. This paper presents a comparative study of
trust models in inter-agent protocol design: Brief (self- or third-party
verifiable claims), Claim (self-proclaimed capabilities and identity, e.g.
AgentCard), Proof (cryptographic verification, including zero-knowledge proofs
and trusted execution environment attestations), Stake (bonded collateral with
slashing and insurance), Reputation (crowd feedback and graph-based trust
signals), and Constraint (sandboxing and capability bounding). For each, we
analyze assumptions, attack surfaces, and design trade-offs, with particular
emphasis on LLM-specific fragilities-prompt injection,
sycophancy/nudge-susceptibility, hallucination, deception, and
misalignment-that render purely reputational or claim-only approaches brittle.
Our findings indicate no single mechanism suffices. We argue for
trustless-by-default architectures anchored in Proof and Stake to gate
high-impact actions, augmented by Brief for identity and discovery and
Reputation overlays for flexibility and social signals. We comparatively
evaluate A2A, AP2, ERC-8004 and related historical variations in academic
research under metrics spanning security, privacy, latency/cost, and social
robustness (Sybil/collusion/whitewashing resistance). We conclude with hybrid
trust model recommendations that mitigate reputation gaming and misinformed LLM
behavior, and we distill actionable design guidelines for safer, interoperable,
and scalable agent economies.
๐
2025-11-05
๐ ArXiv Preprint (cs.SE)
๐ฅ Shihai Wang, Tao Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Elicited performance requirements need to be quantified for compliance in
different engineering tasks, e.g., configuration tuning and performance
testing. Much existing work has relied on manual quantification, which is
expensive and error-prone due to the imprecision. In this paper, we present
LQPR, a highly efficient automatic approach for performance requirements
quantification.LQPR relies on a new theoretical framework that converts
quantification as a classification problem. Despite the prevalent applications
of Large Language Models (LLMs) for requirement analytics, LQPR takes a
different perspective to address the classification: we observed that
performance requirements can exhibit strong patterns and are often
short/concise, therefore we design a lightweight linguistically induced
matching mechanism. We compare LQPR against nine state-of-the-art
learning-based approaches over diverse datasets, demonstrating that it is
ranked as the sole best for 75% or more cases with two orders less cost. Our
work proves that, at least for performance requirement quantification,
specialized methods can be more suitable than the general LLM-driven
approaches.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alexandros Vazaios, Jannis Brugger, Cedric Derstroff, Kristian Kersting, Mira Mezini
Machine Learning
ๆฅ็ๆ่ฆ
AlphaZero-like Monte Carlo Tree Search systems, originally introduced for
two-player games, dynamically balance exploration and exploitation using neural
network guidance. This combination makes them also suitable for classical
search problems. However, the original method of training the network with
simulation results is limited in sparse reward settings, especially in the
early stages, where the network cannot yet give guidance. Hindsight Experience
Replay (HER) addresses this issue by relabeling unsuccessful trajectories from
the search tree as supervised learning signals. We introduce Adaptable HER
(\ours{}), a flexible framework that integrates HER with AlphaZero, allowing
easy adjustments to HER properties such as relabeled goals, policy targets, and
trajectory selection. Our experiments, including equation discovery, show that
the possibility of modifying HER is beneficial and surpasses the performance of
pure supervised or reinforcement learning.
๐
2025-11-05
๐ ArXiv Preprint (eess.IV)
๐ฅ Syed Muqeem Mahmood, Hassan Mohy-ud-Din
Natural Language Processing
ๆฅ็ๆ่ฆ
We present a framework that combines Large Language Models with computational
image analytics for non-invasive, zero-shot prediction of IDH mutation status
in brain gliomas. For each subject, coregistered multi-parametric MRI scans and
multi-class tumor segmentation maps were processed to extract interpretable
semantic (visual) attributes and quantitative features, serialized in a
standardized JSON file, and used to query GPT 4o and GPT 5 without fine-tuning.
We evaluated this framework on six publicly available datasets (N = 1427) and
results showcased high accuracy and balanced classification performance across
heterogeneous cohorts, even in the absence of manual annotations. GPT 5
outperformed GPT 4o in context-driven phenotype interpretation. Volumetric
features emerged as the most important predictors, supplemented by
subtype-specific imaging markers and clinical information. Our results
demonstrate the potential of integrating LLM-based reasoning with computational
image analytics for precise, non-invasive tumor genotyping, advancing
diagnostic strategies in neuro-oncology. The code is available at
https://github.com/ATPLab-LUMS/CIM-LLM.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gahyeon Kim, Sohee Kim, Seokju Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent advances in large-scale vision and language models have led to
significant progress in zero-shot learning tasks. Methods such as CoOp and
CoCoOp have shown that replacing handcrafted prompts with learnable vectors,
known as prompt learning, can result in improved performance. However, these
models often struggle to generalize to entirely unseen categories. While
traditional zero-shot learning techniques benefit from various data
augmentation strategies, prompt learning has primarily focused on text-based
modifications, leaving the potential of image-based augmentation largely
unexplored. In this work, we explore how image-level augmentations,
particularly those that introduce attribute-specific variations, can support
and enhance prompt learning. Our analysis examines the interaction between
these augmentations and soft prompt frameworks, revealing their potential to
improve generalization. We also identify a limitation in existing methods, such
as CoCoOp, which do not provide explicit guidance for learning prompts that
focus on semantically meaningful visual features. To address this, we propose
Adding Attributes to Prompt Learning, AAPL, a novel method that introduces
adversarial token embeddings to decouple superficial visual variations
introduced by augmentation from class-relevant semantic representations. This
decoupling enables the learned prompts to concentrate on visually
discriminative features that align with the target categories. We conduct
comprehensive experiments on eleven benchmark datasets, and AAPL consistently
outperforms existing methods across few-shot, zero-shot, cross-dataset, and
domain generalization settings. Our source code is publicly available at:
https://github.com/Gahyeonkim09/AAPL
๐
2025-11-05
๐ ArXiv Preprint (eess.AS)
๐ฅ Gabriel Pirlogeanu, Alexandru-Lucian Georgescu, Horia Cucu
Natural Language Processing
ๆฅ็ๆ่ฆ
In this work, we present a new state-of-the-art Romanian Automatic Speech
Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored
here for the first time in the context of Romanian. We train our model on a
large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600
hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal
Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate
a range of decoding strategies including greedy, ALSD, and CTC beam search with
a 6-gram token-level language model. Our system achieves state-of-the-art
performance across all Romanian evaluation benchmarks, including read,
spontaneous, and domain-specific speech, with up to 27% relative WER reduction
compared to previous best-performing systems. In addition to improved
transcription accuracy, our approach demonstrates practical decoding
efficiency, making it suitable for both research and deployment in low-latency
ASR applications.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Riasad Alvi, Sayeem Been Zaman, Wasimul Karim, Arefin Ittesafun Abian, Mohaimenul Azam Khan Raiaan et al.
ๆฅ็ๆ่ฆ
Generative artificial intelligence (GenAI) has become a transformative
approach in bioinformatics that often enables advancements in genomics,
proteomics, transcriptomics, structural biology, and drug discovery. To
systematically identify and evaluate these growing developments, this review
proposed six research questions (RQs), according to the preferred reporting
items for systematic reviews and meta-analysis methods. The objective is to
evaluate impactful GenAI strategies in methodological advancement, predictive
performance, and specialization, and to identify promising approaches for
advanced modeling, data-intensive discovery, and integrative biological
analysis. RQ1 highlights diverse applications across multiple bioinformatics
subfields (sequence analysis, molecular design, and integrative data modeling),
which demonstrate superior performance over traditional methods through pattern
recognition and output generation. RQ2 reveals that adapted specialized model
architectures outperformed general-purpose models, an advantage attributed to
targeted pretraining and context-aware strategies. RQ3 identifies significant
benefits in the bioinformatics domains, focusing on molecular analysis and data
integration, which improves accuracy and reduces errors in complex analysis.
RQ4 indicates improvements in structural modeling, functional prediction, and
synthetic data generation, validated by established benchmarks. RQ5 suggests
the main constraints, such as the lack of scalability and biases in data that
impact generalizability, and proposes future directions focused on robust
evaluation and biologically grounded modeling. RQ6 examines that molecular
datasets (such as UniProtKB and ProteinNet12), cellular datasets (such as
CELLxGENE and GTEx) and textual resources (such as PubMedQA and OMIM) broadly
support the training and generalization of GenAI models.
๐
2025-11-05
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Shenghua Wang, Zhen Yin
Machine Learning
ๆฅ็ๆ่ฆ
The rapid growth of open-access (OA) publications has intensified the
challenge of identifying relevant scientific papers. Due to privacy constraints
and limited access to user interaction data, recent efforts have shifted toward
content-based recommendation, which relies solely on textual information.
However, existing models typically treat papers as unstructured text,
neglecting their discourse organization and thereby limiting semantic
completeness and interpretability. To address these limitations, we propose
OMRC-MR, a hierarchical framework that integrates QA-style OMRC (Objective,
Method, Result, Conclusion) summarization, multi-level contrastive learning,
and structure-aware re-ranking for scholarly recommendation. The QA-style
summarization module converts raw papers into structured and
discourse-consistent representations, while multi-level contrastive objectives
align semantic representations across metadata, section, and document levels.
The final re-ranking stage further refines retrieval precision through
contextual similarity calibration. Experiments on DBLP, S2ORC, and the newly
constructed Sci-OMRC dataset demonstrate that OMRC-MR consistently surpasses
state-of-the-art baselines, achieving up to 7.2% and 3.8% improvements in
Precision@10 and Recall@10, respectively. Additional evaluations confirm that
QA-style summarization produces more coherent and factually complete
representations. Overall, OMRC-MR provides a unified and interpretable
content-based paradigm for scientific paper recommendation, advancing
trustworthy and privacy-aware scholarly information retrieval.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jindong Hong, Tianjie Chen, Lingjie Luo, Chuanyang Zheng, Ting Xu et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
A recent advancement in Multimodal Large Language Models (MLLMs) research is
the emergence of "reasoning MLLMs" that offer explicit control over their
internal thinking processes (normally referred as the "thinking mode")
alongside the standard "non-thinking mode". This capability allows these models
to engage in a step-by-step process of internal deliberation before generating
a final response. With the rapid transition to and adoption of these
"dual-state" MLLMs, this work rigorously evaluated how the enhanced reasoning
processes of these MLLMs impact model performance and reliability in clinical
tasks. This paper evaluates the active "thinking mode" capabilities of two
leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We
assessed their performance on four visual medical tasks using VQA-RAD and
ROCOv2 datasets. Our findings reveal that the improvement from activating the
thinking mode remains marginal compared to the standard non-thinking mode for
the majority of the tasks. Their performance on complex medical tasks such as
open-ended VQA and medical image interpretation remains suboptimal,
highlighting the need for domain-specific medical data and more advanced
methods for medical knowledge integration.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Felix Stรถrck, Fabian Hinder, Barbara Hammer
Machine Learning
ๆฅ็ๆ่ฆ
With the on-going integration of machine learning systems into the everyday
social life of millions the notion of fairness becomes an ever increasing
priority in their development. Fairness notions commonly rely on protected
attributes to assess potential biases. Here, the majority of literature focuses
on discrete setups regarding both target and protected attributes. The
literature on continuous attributes especially in conjunction with regression
-- we refer to this as \emph{continuous fairness} -- is scarce. A common
strategy is iterative null-space projection which as of now has only been
explored for linear models or embeddings such as obtained by a non-linear
encoder. We improve on this by generalizing to kernel methods, significantly
extending the scope. This yields a model and fairness-score agnostic method for
kernel embeddings applicable to continuous protected attributes. We demonstrate
that our novel approach in conjunction with Support Vector Regression (SVR)
provides competitive or improved performance across multiple datasets in
comparisons to other contemporary methods.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic evaluation of speech-to-text translation (ST) systems is typically
performed by comparing translation hypotheses with one or more reference
translations. While effective to some extent, this approach inherits the
limitation of reference-based evaluation that ignores valuable information from
the source input. In machine translation (MT), recent progress has shown that
neural metrics incorporating the source text achieve stronger correlation with
human judgments. Extending this idea to ST, however, is not trivial because the
source is audio rather than text, and reliable transcripts or alignments
between source and references are often unavailable. In this work, we conduct
the first systematic study of source-aware metrics for ST, with a particular
focus on real-world operating conditions where source transcripts are not
available. We explore two complementary strategies for generating textual
proxies of the input audio, automatic speech recognition (ASR) transcripts, and
back-translations of the reference translation, and introduce a novel two-step
cross-lingual re-segmentation algorithm to address the alignment mismatch
between synthetic sources and reference translations. Our experiments, carried
out on two ST benchmarks covering 79 language pairs and six ST systems with
diverse architectures and performance levels, show that ASR transcripts
constitute a more reliable synthetic source than back-translations when word
error rate is below 20%, while back-translations always represent a
computationally cheaper but still effective alternative. Furthermore, our
cross-lingual re-segmentation algorithm enables robust use of source-aware MT
metrics in ST evaluation, paving the way toward more accurate and principled
evaluation methodologies for speech translation.
๐
2025-11-05
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Xinyu Ning, Yan Zhuo, Xian Wang, Chan-In Devin Sio, Lik-Hang Lee
ๆฅ็ๆ่ฆ
With the continuous advancement of technology, the application of generative
artificial intelligence (AI) in various fields is gradually demonstrating great
potential, particularly when combined with Extended Reality (XR), creating
unprecedented possibilities. This survey article systematically reviews the
applications of generative AI in XR, covering as much relevant literature as
possible from 2023 to 2025. The application areas of generative AI in XR and
its key technology implementations are summarised through PRISMA screening and
analysis of the final 26 articles. The survey highlights existing articles from
the last three years related to how XR utilises generative AI, providing
insights into current trends and research gaps. We also explore potential
opportunities for future research to further empower XR through generative AI,
providing guidance and information for future generative XR research.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ranul Dayarathne, Uvini Ranaweera, Upeksha Ganegoda
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval Augmented Generation (RAG) is emerging as a powerful technique to
enhance the capabilities of Generative AI models by reducing hallucination.
Thus, the increasing prominence of RAG alongside Large Language Models (LLMs)
has sparked interest in comparing the performance of different LLMs in
question-answering (QA) in diverse domains. This study compares the performance
of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat,
Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA
tasks within the computer science literature leveraging RAG support. Evaluation
metrics employed in the study include accuracy and precision for binary
questions and ranking by a human expert, ranking by Google's AI model Gemini,
alongside cosine similarity for long-answer questions. GPT-3.5, when paired
with RAG, effectively answers binary and long-answer questions, reaffirming its
status as an advanced LLM. Regarding open-source LLMs, Mistral AI's
Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary
and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b
reports the shortest average latency in generating responses, whereas
LLaMa2-7b-chat by Meta reports the highest average latency. This research
underscores the fact that open-source LLMs, too, can go hand in hand with
proprietary models like GPT-3.5 with better infrastructure.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nikolina Tomic Roshni Bhatnagar, Sarthak Jain, Connor Lau, Tien-Yu Liu, Laura Gambini et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning (DL) has the potential to revolutionize image acquisition and
interpretation across medicine, however, attention to data imbalance and
missingness is required. Ultrasound data presents a particular challenge
because in addition to different views and structures, it includes several
sub-modalities-such as greyscale and color flow doppler (CFD)-that are often
imbalanced in clinical studies. Image translation can help balance datasets but
is challenging for ultrasound sub-modalities to date. Here, we present a
generative method for ultrasound CFD-greyscale video translation, trained on
54,975 videos and tested on 8,368. The method developed leveraged pixel-wise,
adversarial, and perceptual loses and utilized two networks: one for
reconstructing anatomic structures and one for denoising to achieve realistic
ultrasound imaging. Average pairwise SSIM between synthetic videos and ground
truth was 0.91+/-0.04. Synthetic videos performed indistinguishably from real
ones in DL classification and segmentation tasks and when evaluated by blinded
clinical experts: F1 score was 0.9 for real and 0.89 for synthetic videos; Dice
score between real and synthetic segmentation was 0.97. Overall clinician
accuracy in distinguishing real vs synthetic videos was 54+/-6% (42-61%),
indicating realistic synthetic videos. Although trained only on heart videos,
the model worked well on ultrasound spanning several clinical domains (average
SSIM 0.91+/-0.05), demonstrating foundational abilities. Together, these data
expand the utility of retrospectively collected imaging and augment the dataset
design toolbox for medical imaging.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhibin Wang, Zhixing Zhang, Shuqi Wang, Xuanting Xie, Zhao Kang
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have demonstrated impressive performance on
task-specific benchmarks, yet their ability to generalize across diverse
domains and tasks remains limited. Existing approaches often struggle with
negative transfer, scalability issues, and high adaptation costs. To address
these challenges, we propose GMoPE (Graph Mixture of Prompt-Experts), a novel
framework that seamlessly integrates the Mixture-of-Experts (MoE) architecture
with prompt-based learning for graphs. GMoPE leverages expert-specific prompt
vectors and structure-aware MoE routing to enable each expert to specialize in
distinct subdomains and dynamically contribute to predictions. To promote
diversity and prevent expert collapse, we introduce a soft orthogonality
constraint across prompt vectors, encouraging expert specialization and
facilitating a more balanced expert utilization. Additionally, we adopt a
prompt-only fine-tuning strategy that significantly reduces spatiotemporal
complexity during transfer. We validate GMoPE through extensive experiments
under various pretraining strategies and multiple downstream tasks. Results
show that GMoPE consistently outperforms state-of-the-art baselines and
achieves performance comparable to full parameter fine-tuning-while requiring
only a fraction of the adaptation overhead. Our work provides a principled and
scalable framework for advancing generalizable and efficient graph foundation
models.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yi-Fei Liu, Yi-Long Lu, Di He, Hang Zhang
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Psychological constructs within individuals are widely believed to be
interconnected. We investigated whether and how Large Language Models (LLMs)
can model the correlational structure of human psychological traits from
minimal quantitative inputs. We prompted various LLMs with Big Five Personality
Scale responses from 816 human individuals to role-play their responses on nine
other psychological scales. LLMs demonstrated remarkable accuracy in capturing
human psychological structure, with the inter-scale correlation patterns from
LLM-generated responses strongly aligning with those from human data $(R^2 >
0.89)$. This zero-shot performance substantially exceeded predictions based on
semantic similarity and approached the accuracy of machine learning algorithms
trained directly on the dataset. Analysis of reasoning traces revealed that
LLMs use a systematic two-stage process: First, they transform raw Big Five
responses into natural language personality summaries through information
selection and compression, analogous to generating sufficient statistics.
Second, they generate target scale responses based on reasoning from these
summaries. For information selection, LLMs identify the same key personality
factors as trained algorithms, though they fail to differentiate item
importance within factors. The resulting compressed summaries are not merely
redundant representations but capture synergistic information--adding them to
original scores enhances prediction alignment, suggesting they encode emergent,
second-order patterns of trait interplay. Our findings demonstrate that LLMs
can precisely predict individual participants' psychological traits from
minimal data through a process of abstraction and reasoning, offering both a
powerful tool for psychological simulation and valuable insights into their
emergent reasoning capabilities.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Alexander Htet Kyaw, Lenin Ravindranath Sivalingam
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We present a node-based storytelling system for multimodal content
generation. The system represents stories as graphs of nodes that can be
expanded, edited, and iteratively refined through direct user edits and
natural-language prompts. Each node can integrate text, images, audio, and
video, allowing creators to compose multimodal narratives. A task selection
agent routes between specialized generative tasks that handle story generation,
node structure reasoning, node diagram formatting, and context generation. The
interface supports targeted editing of individual nodes, automatic branching
for parallel storylines, and node-based iterative refinement. Our results
demonstrate that node-based editing supports control over narrative structure
and iterative generation of text, images, audio, and video. We report
quantitative outcomes on automatic story outline generation and qualitative
observations of editing workflows. Finally, we discuss current limitations such
as scalability to longer narratives and consistency across multiple nodes, and
outline future work toward human-in-the-loop and user-centered creative AI
tools.
๐
2025-11-05
๐ EMNLP
๐ฅ Shaghayegh Kolli, Richard Rosenbaum, Timo Cavelius, Lasse Strothe, Andrii Lata et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel in generating fluent utterances but can
lack reliable grounding in verified information. At the same time,
knowledge-graph-based fact-checkers deliver precise and interpretable evidence,
yet suffer from limited coverage or latency. By integrating LLMs with knowledge
graphs and real-time search agents, we introduce a hybrid fact-checking
approach that leverages the individual strengths of each component. Our system
comprises three autonomous steps: 1) a Knowledge Graph (KG) Retrieval for rapid
one-hop lookups in DBpedia, 2) an LM-based classification guided by a
task-specific labeling prompt, producing outputs with internal rule-based
logic, and 3) a Web Search Agent invoked only when KG coverage is insufficient.
Our pipeline achieves an F1 score of 0.93 on the FEVER benchmark on the
Supported/Refuted split without task-specific fine-tuning. To address Not
enough information cases, we conduct a targeted reannotation study showing that
our approach frequently uncovers valid evidence for claims originally labeled
as Not Enough Information (NEI), as confirmed by both expert annotators and LLM
reviewers. With this paper, we present a modular, opensource fact-checking
pipeline with fallback strategies and generalization across datasets.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wenchang Lei, Ping Zou, Yue Wang, Feng Sun, Lei Zhao
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) exhibit strong semantic understanding, yet
struggle when user instructions involve ambiguous or conceptually misaligned
terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity
by extracting meta-relations-inheritance, alias, and composition-from natural
language. The model further employs a reflection mechanism to validate these
meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these
relations and related descriptions are dynamically supplied to the LLM,
improving its ability to interpret concepts and generate accurate responses.
Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely
on extended context windows, our method enables large language models to
process texts of any length without the need for truncation. Experiments on
standard benchmarks demonstrate that the LGM consistently outperforms existing
RAG baselines.
๐
2025-11-05
๐ ArXiv Preprint (cs.CY)
๐ฅ Henry Fraser, Zahra Stardust
ๆฅ็ๆ่ฆ
This paper examines the role of public interest litigation in promoting
accountability for AI and automated decision-making (ADM) in Australia. Since
ADM regulatio faces geopolitical headwinds, effective governance will have to
rely at least in part on the enforcement of existing laws. Drawing on
interviews with Australian public interest litigators, technology policy
activists, and technology law scholars, the paper positions public interest
litigation as part of a larger ecosystem for transparency, accountability and
justice with respect to ADM. It builds on one participants's characterisation
of litigation about ADM as an exercise in legal retrofitting: adapting old laws
to new circumstances. The paper's primary contribution is to aggregate,
organise and present original insights on pragmatic strategies and tactics for
effective public interest litigation about ADM. Naturally, it also contends
with the limits of these strategies, and of the legal system. Where limits are,
however, capable of being overcome, the paper presents findings on urgent
needs: the enabling institutional arrangements without which effective
litigation and accountability will falter. The paper is relevant to law and
technology scholars; individuals and groups harmed by ADM; public interest
litigators and technology lawyers; civil society and advocacy organisations;
and policymakers.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuei-Chun Kao, Hsu Tzu-Yin, Yunqi Hong, Ruochen Wang, Cho-Jui Hsieh
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recently, Multimodal Large Language Models (MLLMs) encounter two key issues
in multi-image contexts: (1) a lack of fine-grained perception across disparate
images, and (2) a diminished capability to effectively reason over and
synthesize information from multiple visual inputs. However, while various
prompting methods aim to describe visual content, many existing studies focus
primarily on single-image settings or specific, constrained scenarios. This
leaves a critical gap in understanding and addressing how MLLMs tackle more
general and complex multi-image reasoning tasks. Thus, we first extensively
investigate how current prompting methods perceive fine-grained visual details
and process visual information when dealing with multiple images. Our findings
reveal that existing prompting methods fall short in attending to needed clues
and seamlessly integrating perception and reasoning. Inspired by the findings,
we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions
(QG-CoC), a generalized prompting approach that effectively handles problems
with an arbitrary number of images. We evaluate our method on various
open-source and closed-source MLLMs for multi-image and single-image
benchmarks. Experimental results indicate that QG-CoC demonstrates competitive
performance across tasks and exhibits robust improvements in the challenging
scenarios where existing prompting methods fail.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hassan Wasswa, Hussein Abbass, Timothy Lynar
Machine Learning
ๆฅ็ๆ่ฆ
In an effort to counter the increasing IoT botnet-based attacks,
state-of-the-art deep learning methods have been proposed and have achieved
impressive detection accuracy. However, their computational intensity restricts
deployment on resource-constrained IoT devices, creating a critical need for
lightweight detection models. A common solution to this challenge is model
compression via quantization. This study proposes a VAE-MLP model framework
where an MLP-based classifier is trained on 8-dimensional latent vectors
derived from the high-dimensional train data using the encoder component of a
pretrained variational autoencoder (VAE). Two widely used quantization
strategies--Quantization-Aware Training (QAT) and Post-Training Quantization
(PTQ)--are then systematically evaluated in terms of their impact on detection
performance, storage efficiency, and inference latency using two benchmark IoT
botnet datasets--N-BaIoT and CICIoT2022. The results revealed that, with
respect to detection accuracy, the QAT strategy experienced a more noticeable
decline,whereas PTQ incurred only a marginal reduction compared to the original
unquantized model. Furthermore, PTQ yielded a 6x speedup and 21x reduction in
size, while QAT achieved a 3x speedup and 24x compression, demonstrating the
practicality of quantization for device-level IoT botnet detection.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingtao Zhang, Guoli Yang, Zhanxing Zhu, Mengzhu Wang, Xiaoying Bai
ๆฅ็ๆ่ฆ
Attention mechanisms have been extensively employed in various applications,
including time series modeling, owing to their capacity to capture intricate
dependencies; however, their utility is often constrained by quadratic
computational complexity, which impedes scalability for long sequences. In this
work, we propose a novel linear attention mechanism designed to overcome these
limitations. Our approach is grounded in a theoretical demonstration that
entropy, as a strictly concave function on the probability simplex, implies
that distributions with aligned probability rankings and similar entropy values
exhibit structural resemblance. Building on this insight, we develop an
efficient approximation algorithm that computes the entropy of
dot-product-derived distributions with only linear complexity, enabling the
implementation of a linear attention mechanism based on entropy equality.
Through rigorous analysis, we reveal that the effectiveness of attention in
spatio-temporal time series modeling may not primarily stem from the
non-linearity of softmax but rather from the attainment of a moderate and
well-balanced weight distribution. Extensive experiments on four
spatio-temporal datasets validate our method, demonstrating competitive or
superior forecasting performance while achieving substantial reductions in both
memory usage and computational time.
๐
2025-11-05
๐ VLDB 2025
๐ฅ Yiru Chen, Sally Fang, Sai Sree Harsha, Dan Luo, Vaishnavi Muppala et al.
ๆฅ็ๆ่ฆ
Generative AI assistants offer significant potential to enhance productivity,
streamline information access, and improve user experience in enterprise
contexts. In this work, we present Summit Concierge, a domain-specific AI
assistant developed for Adobe Summit. The assistant handles a wide range of
event-related queries and operates under real-world constraints such as data
sparsity, quality assurance, and rapid deployment. To address these challenges,
we adopt a human-in-the-loop development workflow that combines prompt
engineering, retrieval grounding, and lightweight human validation. We describe
the system architecture, development process, and real-world deployment
outcomes. Our experience shows that agile, feedback-driven development enables
scalable and reliable AI assistants, even in cold-start scenarios.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Varun Kumar, George Em Karniadakis
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The engineering design process often demands expertise from multiple domains,
leading to complex collaborations and iterative refinements. Traditional
methods can be resource-intensive and prone to inefficiencies. To address this,
we formalize the engineering design process through a multi-agent AI framework
that integrates structured design and review loops. The framework introduces
specialized knowledge-driven agents that collaborate to generate and refine
design candidates. As an exemplar, we demonstrate its application to the
aerodynamic optimization of 4-digit NACA airfoils. The framework consists of
three key AI agents: a Graph Ontologist, a Design Engineer, and a Systems
Engineer. The Graph Ontologist employs a Large Language Model (LLM) to
construct two domain-specific knowledge graphs from airfoil design literature.
The Systems Engineer, informed by a human manager, formulates technical
requirements that guide design generation and evaluation. The Design Engineer
leverages the design knowledge graph and computational tools to propose
candidate airfoils meeting these requirements. The Systems Engineer reviews and
provides feedback both qualitative and quantitative using its own knowledge
graph, forming an iterative feedback loop until a design is validated by the
manager. The final design is then optimized to maximize performance metrics
such as the lift-to-drag ratio. Overall, this work demonstrates how
collaborative AI agents equipped with structured knowledge representations can
enhance efficiency, consistency, and quality in the engineering design process.
๐
2025-11-05
๐ ArXiv Preprint (astro-ph.EP)
๐ฅ Arsalan Muhammad, Wasiu Akande Ahmed, Omada Friday Ojonugwa, Paul Puspendu Biswas
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The rapid growth of cislunar activities, including lunar landings, the Lunar
Gateway, and in-space refueling stations, requires advances in cost-efficient
trajectory design and reliable integration of navigation and remote sensing.
Traditional Earth-Moon transfers suffer from rigid launch windows and high
propellant demands, while Earth-based GNSS systems provide little to no
coverage beyond geostationary orbit. This limits autonomy and environmental
awareness in cislunar space. This review compares four major transfer
strategies by evaluating velocity requirements, flight durations, and fuel
efficiency, and by identifying their suitability for both crewed and robotic
missions. The emerging role of artificial intelligence and machine learning is
highlighted: convolutional neural networks support automated crater recognition
and digital terrain model generation, while deep reinforcement learning enables
adaptive trajectory refinement during descent and landing to reduce risk and
decision latency. The study also examines how GNSS-Reflectometry and advanced
Positioning, Navigation, and Timing architectures can extend navigation
capabilities beyond current limits. GNSS-R can act as a bistatic radar for
mapping lunar ice, soil properties, and surface topography, while PNT systems
support autonomous rendezvous, Lagrange point station-keeping, and coordinated
satellite swarm operations. Combining these developments establishes a scalable
framework for sustainable cislunar exploration and long-term human and robotic
presence.
๐
2025-11-05
๐ ArXiv Preprint (cs.CE)
๐ฅ Hajung Kim, Jueon Park, Junseok Choe, Sheunheun Baek, Hyeon Hwang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Quantitative structure-activity relationship assumes a smooth relationship
between molecular structure and biological activity. However, activity cliffs
defined as pairs of structurally similar compounds with large potency
differences break this continuity. Recent benchmarks targeting activity cliffs
have revealed that classical machine learning models with extended connectivity
fingerprints outperform graph neural networks. Our analysis shows that graph
embeddings fail to adequately separate structurally similar molecules in the
embedding space, making it difficult to distinguish between structurally
similar but functionally different molecules. Despite this limitation,
molecular graph structures are inherently expressive and attractive, as they
preserve molecular topology. To preserve the structural representation of
molecules as graphs, we propose a new model, GraphCliff, which integrates
short- and long-range information through a gating mechanism. Experimental
results demonstrate that GraphCliff consistently improves performance on both
non-cliff and cliff compounds. Furthermore, layer-wise node embedding analyses
reveal reduced over-smoothing and enhanced discriminative power relative to
strong baseline graph models.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xuanxiang Huang, Yacine Izza, Alexey Ignatiev, Joao Marques-Silva
ๆฅ็ๆ่ฆ
Formal explainable artificial intelligence (XAI) offers unique theoretical
guarantees of rigor when compared to other non-formal methods of
explainability. However, little attention has been given to the validation of
practical implementations of formal explainers. This paper develops a novel
methodology for validating formal explainers and reports on the assessment of
the publicly available formal explainer PyXAI. The paper documents the
existence of incorrect explanations computed by PyXAI on most of the datasets
analyzed in the experiments, thereby confirming the importance of the proposed
novel methodology for the validation of formal explainers.
๐
2025-11-05
๐ ArXiv Preprint (cs.SE)
๐ฅ Khouloud Oueslati, Maxime Lamothe, Foutse Khomh
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have substantially influenced various software
engineering tasks. Indeed, in the case of software refactoring, traditional
LLMs have shown the ability to reduce development time and enhance code
quality. However, these LLMs often rely on static, detailed instructions for
specific tasks. In contrast, LLM-based agents can dynamically adapt to evolving
contexts and autonomously make decisions by interacting with software tools and
executing workflows. In this paper, we explore the potential of LLM-based
agents in supporting refactoring activities. Specifically, we introduce
RefAgent, a multi-agent LLM-based framework for end-to-end software
refactoring. RefAgent consists of specialized agents responsible for planning,
executing, testing, and iteratively refining refactorings using self-reflection
and tool-calling capabilities. We evaluate RefAgent on eight open-source Java
projects, comparing its effectiveness against a single-agent approach, a
search-based refactoring tool, and historical developer refactorings. Our
assessment focuses on: (1) the impact of generated refactorings on software
quality, (2) the ability to identify refactoring opportunities, and (3) the
contribution of each LLM agent through an ablation study. Our results show that
RefAgent achieves a median unit test pass rate of 90%, reduces code smells by a
median of 52.5%, and improves key quality attributes (e.g., reusability) by a
median of 8.6%. Additionally, it closely aligns with developer refactorings and
the search-based tool in identifying refactoring opportunities, attaining a
median F1-score of 79.15% and 72.7%, respectively. Compared to single-agent
approaches, RefAgent improves the median unit test pass rate by 64.7% and the
median compilation success rate by 40.1%. These findings highlight the promise
of multi-agent architectures in advancing automated software refactoring.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Srishti Yadav, Jasmina Gajcin, Erik Miehling, Elizabeth Daly
Robotics
ๆฅ็ๆ่ฆ
Understanding how different stakeholders perceive risks in AI systems is
essential for their responsible deployment. This paper presents a framework for
stakeholder-grounded risk assessment by using LLMs, acting as judges to predict
and explain risks. Using the Risk Atlas Nexus and GloVE explanation method, our
framework generates stakeholder-specific, interpretable policies that shows how
different stakeholders agree or disagree about the same risks. We demonstrate
our method using three real-world AI use cases of medical AI, autonomous
vehicles, and fraud detection domain. We further propose an interactive
visualization that reveals how and why conflicts emerge across stakeholder
perspectives, enhancing transparency in conflict reasoning. Our results show
that stakeholder perspectives significantly influence risk perception and
conflict patterns. Our work emphasizes the importance of these
stakeholder-aware explanations needed to make LLM-based evaluations more
transparent, interpretable, and aligned with human-centered AI governance
goals.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Atif Hassan, Tarun Kumar, Ashish Mishra, Sergey Serebryakov, Satish Kumar Mopur et al.
ๆฅ็ๆ่ฆ
Forecasting anomalies (anomaly prediction) in multivariate time series from
different real-world, dynamic, and complex systems is vital for preempting
critical failures, leading to a substantial minimization in operational costs
and human labor. Yet, existing methods are limited to specific systems while
failing to generalize to evolving anomaly patterns over time. In contrast,
pretrained Time Series Foundation Models (TSFMs) have recently demonstrated
strong generalization and zero-shot forecasting capabilities. However, their
potential remains untapped for anomaly prediction, a task fundamentally
different from forecasting normal behavior. Thus, we present Forecast2Anomaly
(F2A), a novel framework that empowers TSFMs with anomaly prediction abilities
through two key innovations. First, we propose a joint forecast-anomaly loss
that fine-tunes TSFMs to accurately forecast future signals even at anomalous
time points. Second, we introduce a Retrieval-Augmented Generation (RAG) module
that retrieves historically relevant horizons and conditions predictions on
them. This component dynamically adapts to distributional shifts at inference
time, enabling F2A to track evolving anomalies without requiring model updates.
By combining targeted fine-tuning with dynamic retrieval, F2A bridges the gap
between robust TSFM zero-shot forecasting and zero-shot anomaly prediction.
Extensive experiments across 16 diverse datasets and multiple TSFM backbones
show that F2A consistently outperforms state-of-the-art methods, offering a
scalable, zero-shot anomaly prediction solution for real-world applications.
๐
2025-11-05
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Erfan Shayegani, Jina Suh, Andy Wilson, Nagu Rangan, Javier Hernandez
Natural Language Processing
ๆฅ็ๆ่ฆ
Empathy is a critical factor in fostering positive user experiences in
conversational AI. While models can display empathy, it is often generic rather
than tailored to specific tasks and contexts. In this work, we introduce a
novel framework for developing and evaluating context-specific empathetic large
language models (LLMs). We first analyze a real-world conversational dataset
consisting of 672 multi-turn conversations across 8 tasks, revealing
significant differences in terms of expected and experienced empathy before and
after the conversations, respectively. To help minimize this gap, we develop a
synthetic multi-turn conversational generation pipeline and steer responses
toward our defined empathy patterns based on the context that more closely
matches users' expectations. We then train empathetic expert adapters for
context-specific empathy that specialize in varying empathy levels based on the
recognized task. Our empirical results demonstrate a significant gap reduction
of 72.66% between perceived and desired empathy with scores increasing by an
average factor of 2.43 as measured by our metrics and reward models.
Additionally, our trained empathetic expert adapters demonstrate superior
effectiveness in preserving empathy patterns throughout conversation turns,
outperforming system prompts, which tend to dramatically diminish in impact as
conversations lengthen.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qi Li, Jianjun Xu, Pingtao Wei, Jiu Li, Peiqiang Zhao et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
With the widespread application of Large Language Models (LLMs), their
associated security issues have become increasingly prominent, severely
constraining their trustworthy deployment in critical domains. This paper
proposes a novel safety response framework designed to systematically safeguard
LLMs at both the input and output levels. At the input level, the framework
employs a supervised fine-tuning-based safety classification model. Through a
fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused
Attention), it performs precise risk identification and differentiated handling
of user queries, significantly enhancing risk coverage and business scenario
adaptability, and achieving a risk recall rate of 99.3%. At the output level,
the framework integrates Retrieval-Augmented Generation (RAG) with a
specifically fine-tuned interpretation model, ensuring all responses are
grounded in a real-time, trustworthy knowledge base. This approach eliminates
information fabrication and enables result traceability. Experimental results
demonstrate that our proposed safety control model achieves a significantly
higher safety score on public safety evaluation benchmarks compared to the
baseline model, TinyR1-Safety-8B. Furthermore, on our proprietary high-risk
test set, the framework's components attained a perfect 100% safety score,
validating their exceptional protective capabilities in complex risk scenarios.
This research provides an effective engineering pathway for building
high-security, high-trust LLM applications.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Shipeng Cen, Ying Tan
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
As optimization problems grow increasingly complex and diverse, advancements
in optimization techniques and paradigm innovations hold significant
importance. The challenges posed by optimization problems are primarily
manifested in their non-convexity, high-dimensionality, black-box nature, and
other unfavorable characteristics. Traditional zero-order or first-order
methods, which are often characterized by low efficiency, inaccurate gradient
information, and insufficient utilization of optimization information, are
ill-equipped to address these challenges effectively. In recent years, the
rapid development of large language models (LLM) has led to substantial
improvements in their language understanding and code generation capabilities.
Consequently, the design of optimization algorithms leveraging large language
models has garnered increasing attention from researchers. In this study, we
choose the fireworks algorithm(FWA) as the basic optimizer and propose a novel
approach to assist the design of the FWA by incorporating multi-modal large
language model(MLLM). To put it simply, we propose the concept of Critical
Part(CP), which extends FWA to complex high-dimensional tasks, and further
utilizes the information in the optimization process with the help of the
multi-modal characteristics of large language models. We focus on two specific
tasks: the \textit{traveling salesman problem }(TSP) and \textit{electronic
design automation problem} (EDA). The experimental results show that FWAs
generated under our new framework have achieved or surpassed SOTA results on
many problem instances.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thomas Manzini, Priyankari Perali, Robin R. Murphy
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
This paper presents the first AI/ML system for automating building damage
assessment in uncrewed aerial systems (sUAS) imagery to be deployed
operationally during federally declared disasters (Hurricanes Debby and
Helene). In response to major disasters, sUAS teams are dispatched to collect
imagery of the affected areas to assess damage; however, at recent disasters,
teams collectively delivered between 47GB and 369GB of imagery per day,
representing more imagery than can reasonably be transmitted or interpreted by
subject matter experts in the disaster scene, thus delaying response efforts.
To alleviate this data avalanche encountered in practice, computer vision and
machine learning techniques are necessary. While prior work has been deployed
to automatically assess damage in satellite imagery, there is no current state
of practice for sUAS-based damage assessment systems, as all known work has
been confined to academic settings. This work establishes the state of practice
via the development and deployment of models for building damage assessment
with sUAS imagery. The model development involved training on the largest known
dataset of post-disaster sUAS aerial imagery, containing 21,716 building damage
labels, and the operational training of 91 disaster practitioners. The best
performing model was deployed during the responses to Hurricanes Debby and
Helene, where it assessed a combined 415 buildings in approximately 18 minutes.
This work contributes documentation of the actual use of AI/ML for damage
assessment during a disaster and lessons learned to the benefit of the AI/ML
research and user communities.
๐
2025-11-05
๐ ArXiv Preprint (math.OC)
๐ฅ Harbir Antil, Rainald Lรถhner, Felipe Pรฉrez
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We propose a linear programming (LP) framework for steady-state diffusion and
flux optimization on geometric networks. The state variable satisfies a
discrete diffusion law on a weighted, oriented graph, where conductances are
scaled by edge lengths to preserve geometric fidelity. Boundary potentials act
as controls that drive interior fluxes according to a linear network Laplacian.
The optimization problem enforces physically meaningful sign and flux-cap
constraints at all boundary edges, derived directly from a gradient bound. This
yields a finite-dimensional LP whose feasible set is polyhedral, and whose
boundedness and solvability follow from simple geometric or algebraic
conditions on the network data.
We prove that under the absence of negative recession
directions--automatically satisfied in the presence of finite box bounds, flux
caps, or sign restrictions--the LP admits a global minimizer. Several
sufficient conditions guaranteeing boundedness of the feasible region are
identified, covering both full-rank and rank-deficient flux maps. The analysis
connects classical results such as the Minkowski--Weyl decomposition, Hoffman's
bound, and the fundamental theorem of linear programming with modern
network-based diffusion modeling.
Two large-scale examples illustrate the framework: (i) A typical large
stadium in a major modern city, which forms a single connected component with
relatively uniform corridor widths, and a (ii) A complex street network
emanating from a large, historical city center, which forms a multi-component
system.
๐
2025-11-05
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Seunghee Han, Yeonghun Kang, Taeun Bae, Varinia Bernales, Alan Aspuru-Guzik et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Designing materials with targeted properties remains challenging due to the
vastness of chemical space and the scarcity of property-labeled data. While
recent advances in generative models offer a promising way for inverse design,
most approaches require large datasets and must be retrained for every new
target property. Here, we introduce the EGMOF (Efficient Generation of MOFs), a
hybrid diffusion-transformer framework that overcomes these limitations through
a modular, descriptor-mediated workflow. EGMOF decomposes inverse design into
two steps: (1) a one-dimensional diffusion model (Prop2Desc) that maps desired
properties to chemically meaningful descriptors followed by (2) a transformer
model (Desc2MOF) that generates structures from these descriptors. This modular
hybrid design enables minimal retraining and maintains high accuracy even under
small-data conditions. On a hydrogen uptake dataset, EGMOF achieved over 95%
validity and 84% hit rate, representing significant improvements of up to 57%
in validity and 14% in hit rate compared to existing methods, while remaining
effective with only 1,000 training samples. Moreover, our model successfully
performed conditional generation across 29 diverse property datasets, including
CoREMOF, QMOF, and text-mined experimental datasets, whereas previous models
have not. This work presents a data-efficient, generalizable approach to the
inverse design of diverse MOFs and highlights the potential of modular inverse
design workflows for broader materials discovery.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuya Miyaoka, Masaki Inoue
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
This paper proposes a control-based framework for aligning large language
models (LLMs) by leveraging a control barrier function (CBF) to ensure
user-desirable text generation. The presented framework applies the CBF safety
filter to the predicted token generated from the baseline LLM, to intervene in
the generated text. The safety filter includes two significant advantages: this
safety filter is an add-on type, allowing it to be used for alignment purposes
without fine-tuning the baseline LLM, and if there is an evaluation model
regarding the desired alignment, it can be directly applied to the filter
design. The overall text-generation system is implemented with open-source
language models, aiming to generate positive text.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Botong. Zhao, Xubin. Wang, Shujing. Lyu, Yue. Lu
ๆฅ็ๆ่ฆ
Integrated circuit manufacturing is highly complex, comprising hundreds of
process steps. Defects can arise at any stage, causing yield loss and
ultimately degrading product reliability. Supervised methods require extensive
human annotation and struggle with emergent categories and rare, data scarce
defects. Clustering-based unsupervised methods often exhibit unstable
performance due to missing priors. We propose IC DefectNCD, a support set free
framework that leverages Image Intrinsic Priors in IC SEM images for defect
detection and novel class discovery. We first develop Self Normal Information
Guided IC Defect Detection, aggregating representative normal features via a
learnable normal information extractor and using reconstruction residuals to
coarsely localize defect regions. To handle saliency variations across defects,
we introduce an adaptive binarization strategy that produces stable subimages
focused on core defective areas. Finally, we design Self Defect Information
Guided IC Defect Classification, which incorporates a soft mask guided
attention mechanism to inject spatial defect priors into the teacher student
model. This enhances sensitivity to defective regions, suppresses background
interference, and enables recognition and classification of unseen defects. We
validate the approach on a real world dataset spanning three key fabrication
stages and covering 15 defect types. Experiments demonstrate robust performance
on both defect detection and unseen defect classification.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qi Zhang, Yifei Wang, Yisen Wang
Machine Learning
ๆฅ็ๆ่ฆ
Recently, self-supervised contrastive learning has achieved great success on
various tasks. However, its underlying working mechanism is yet unclear. In
this paper, we first provide the tightest bounds based on the widely adopted
assumption of conditional independence. Further, we relax the conditional
independence assumption to a more practical assumption of augmentation overlap
and derive the asymptotically closed bounds for the downstream performance. Our
proposed augmentation overlap theory hinges on the insight that the support of
different intra-class samples will become more overlapped under aggressive data
augmentations, thus simply aligning the positive samples (augmented views of
the same sample) could make contrastive learning cluster intra-class samples
together. Moreover, from the newly derived augmentation overlap perspective, we
develop an unsupervised metric for the representation evaluation of contrastive
learning, which aligns well with the downstream performance almost without
relying on additional modules. Code is available at
https://github.com/PKU-ML/GARC.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiameng Chen, Yida Xiong, Kun Li, Hongzhi Zhang, Xiantao Cai et al.
Machine Learning
ๆฅ็ๆ่ฆ
Computational antibody design holds immense promise for therapeutic
discovery, yet existing generative models are fundamentally limited by two core
challenges: (i) a lack of dynamical consistency, which yields physically
implausible structures, and (ii) poor generalization due to data scarcity and
structural bias. We introduce FP-AbDiff, the first antibody generator to
enforce Fokker-Planck Equation (FPE) physics along the entire generative
trajectory. Our method minimizes a novel FPE residual loss over the mixed
manifold of CDR geometries (R^3 x SO(3)), compelling locally-learned denoising
scores to assemble into a globally coherent probability flow. This
physics-informed regularizer is synergistically integrated with deep biological
priors within a state-of-the-art SE(3)-equivariant diffusion framework.
Rigorous evaluation on the RAbD benchmark confirms that FP-AbDiff establishes a
new state-of-the-art. In de novo CDR-H3 design, it achieves a mean Root Mean
Square Deviation of 0.99 {\AA} when superposing on the variable region, a 25%
improvement over the previous state-of-the-art model, AbX, and the highest
reported Contact Amino Acid Recovery of 39.91%. This superiority is underscored
in the more challenging six-CDR co-design task, where our model delivers
consistently superior geometric precision, cutting the average full-chain Root
Mean Square Deviation by ~15%, and crucially, achieves the highest full-chain
Amino Acid Recovery on the functionally dominant CDR-H3 loop (45.67%). By
aligning generative dynamics with physical laws, FP-AbDiff enhances robustness
and generalizability, establishing a principled approach for physically
faithful and functionally viable antibody design.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh
ๆฅ็ๆ่ฆ
We perform a thorough analysis of the formal and informal statements in the
miniF2F benchmark from the perspective of an AI system that is tasked to
participate in a math Olympiad consisting of the problems in miniF2F. In such
setting, the model has to read and comprehend the problems in natural language,
formalize them in Lean language, then proceed with proving the problems, and it
will get credit for each problem if the formal proof corresponds to the
original informal statement presented to the model. Our evaluation results
reveal that the best accuracy of such pipeline can be about 36% using the SoTA
models in the literature, considerably lower than the individual SoTA
accuracies, 97% and 69% reported in the autoformalization and theorem proving
literature. Analyzing the failure modes, we trace back a considerable portion
of this drop to discrepancies between the formal and informal statements for
more than half of the problems in miniF2F. We proceed with correcting all the
errors, discrepancies and simplifications in formal and informal statements,
and present the miniF2F-v2 with fully verified formal and informal statements
and proofs. Evaluating the full theorem proving pipeline on miniF2F-v2 leads to
the best accuracy of 70%, a significant improvement from the 40% on the
original miniF2F, yet indicating considerable misalignment between the
autoformalization models and theorem provers. Our deep analysis suggests that a
higher quality benchmark can help the community better evaluate progress in the
field of formal reasoning and also better diagnose the failure and success
modes of autoformalization and theorem proving models. Our dataset is available
at https://github.com/roozbeh-yz/miniF2F_v2.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Katherine C. Kellogg, Bingyang Ye, Yifan Hu, Guergana K. Savova, Byron Wallace et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
The rapid adoption of large language models (LLMs) in healthcare has been
accompanied by scrutiny of their oversight. Existing monitoring approaches,
inherited from traditional machine learning (ML), are task-based and founded on
assumed performance degradation arising from dataset drift. In contrast, with
LLMs, inevitable model degradation due to changes in populations compared to
the training dataset cannot be assumed, because LLMs were not trained for any
specific task in any given population. We therefore propose a new organizing
principle guiding generalist LLM monitoring that is scalable and grounded in
how these models are developed and used in practice: capability-based
monitoring. Capability-based monitoring is motivated by the fact that LLMs are
generalist systems whose overlapping internal capabilities are reused across
numerous downstream tasks. Instead of evaluating each downstream task
independently, this approach organizes monitoring around shared model
capabilities, such as summarization, reasoning, translation, or safety
guardrails, in order to enable cross-task detection of systemic weaknesses,
long-tail errors, and emergent behaviors that task-based monitoring may miss.
We describe considerations for developers, organizational leaders, and
professional societies for implementing a capability-based monitoring approach.
Ultimately, capability-based monitoring will provide a scalable foundation for
safe, adaptive, and collaborative monitoring of LLMs and future generalist
artificial intelligence models in healthcare.
๐
2025-11-05
๐ ArXiv Preprint (cs.SE)
๐ฅ Rafael Josรฉ Moura, Maria Gizele Nascimento, Fumio Machida, Ermeson Andrade
ๆฅ็ๆ่ฆ
Software aging is a phenomenon that affects long-running systems, leading to
progressive performance degradation and increasing the risk of failures. To
mitigate this problem, this work proposes an adaptive approach based on machine
learning for software aging detection in environments subject to dynamic
workload conditions. We evaluate and compare a static model with adaptive
models that incorporate adaptive detectors, specifically the Drift Detection
Method (DDM) and Adaptive Windowing (ADWIN), originally developed for concept
drift scenarios and applied in this work to handle workload shifts. Experiments
with simulated sudden, gradual, and recurring workload transitions show that
static models suffer a notable performance drop when applied to unseen workload
profiles, whereas the adaptive model with ADWIN maintains high accuracy,
achieving an F1-Score above 0.93 in all analyzed scenarios.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Saad Mankarious, Ayah Zirikly
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Mental health disorders affect millions worldwide, yet early detection
remains a major challenge, particularly for Arabic-speaking populations where
resources are limited and mental health discourse is often discouraged due to
cultural stigma. While substantial research has focused on English-language
mental health detection, Arabic remains significantly underexplored, partly due
to the scarcity of annotated datasets. We present CARMA, the first
automatically annotated large-scale dataset of Arabic Reddit posts. The dataset
encompasses six mental health conditions, such as Anxiety, Autism, and
Depression, and a control group. CARMA surpasses existing resources in both
scale and diversity. We conduct qualitative and quantitative analyses of
lexical and semantic differences between users, providing insights into the
linguistic markers of specific mental health conditions. To demonstrate the
dataset's potential for further mental health analysis, we perform
classification experiments using a range of models, from shallow classifiers to
large language models. Our results highlight the promise of advancing mental
health detection in underrepresented languages such as Arabic.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hao Xiang Li, Michael Amir, Amanda Prorok
Machine Learning
ๆฅ็ๆ่ฆ
The agent-environment co-design paradigm jointly optimises agent policies and
environment configurations in search of improved system performance. With
application domains ranging from warehouse logistics to windfarm management,
co-design promises to fundamentally change how we deploy multi-agent systems.
However, current co-design methods struggle to scale. They collapse under
high-dimensional environment design spaces and suffer from sample inefficiency
when addressing moving targets inherent to joint optimisation. We address these
challenges by developing Diffusion Co-Design (DiCoDe), a scalable and
sample-efficient co-design framework pushing co-design towards practically
relevant settings. DiCoDe incorporates two core innovations. First, we
introduce Projected Universal Guidance (PUG), a sampling technique that enables
DiCoDe to explore a distribution of reward-maximising environments while
satisfying hard constraints such as spatial separation between obstacles.
Second, we devise a critic distillation mechanism to share knowledge from the
reinforcement learning critic, ensuring that the guided diffusion model adapts
to evolving agent policies using a dense and up-to-date learning signal.
Together, these improvements lead to superior environment-policy pairs when
validated on challenging multi-agent environment co-design benchmarks including
warehouse automation, multi-agent pathfinding and wind farm optimisation. Our
method consistently exceeds the state-of-the-art, achieving, for example, 39%
higher rewards in the warehouse setting with 66% fewer simulation samples. This
sets a new standard in agent-environment co-design, and is a stepping stone
towards reaping the rewards of co-design in real world domains.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gaia Grosso, Sai Sumedh R. Hindupur, Thomas Fel, Samuel Bright-Thonney, Philip Harris et al.
Robotics
ๆฅ็ๆ่ฆ
Modern artificial intelligence has revolutionized our ability to extract rich
and versatile data representations across scientific disciplines. Yet, the
statistical properties of these representations remain poorly controlled,
causing misspecified anomaly detection (AD) methods to falter. Weak or rare
signals can remain hidden within the apparent regularity of normal data,
creating a gap in our ability to detect and interpret anomalies. We examine
this gap and identify a set of structural desiderata for detection methods
operating under minimal prior information: sparsity, to enforce parsimony;
locality, to preserve geometric sensitivity; and competition, to promote
efficient allocation of model capacity. These principles define a class of
self-organizing local kernels that adaptively partition the representation
space around regions of statistical imbalance. As an instantiation of these
principles, we introduce SparKer, a sparse ensemble of Gaussian kernels trained
within a semi-supervised Neyman--Pearson framework to locally model the
likelihood ratio between a sample that may contain anomalies and a nominal,
anomaly-free reference. We provide theoretical insights into the mechanisms
that drive detection and self-organization in the proposed model, and
demonstrate the effectiveness of this approach on realistic high-dimensional
problems of scientific discovery, open-world novelty detection, intrusion
detection, and generative-model validation. Our applications span both the
natural- and computer-science domains. We demonstrate that ensembles containing
only a handful of kernels can identify statistically significant anomalous
locations within representation spaces of thousands of dimensions, underscoring
both the interpretability, efficiency and scalability of the proposed approach.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Haggstrom et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+
context length support have resulted in increasing demands for on-chip memory
to support large KV caches. Techniques such as StreamingLLM and SnapKV
demonstrate how to control KV cache size while maintaining model accuracy. Yet,
these techniques are not commonly used within industrial deployments using
frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static
graphs and continuous batching methodology employed by these frameworks make it
difficult to admit modifications to the standard multi-head attention
algorithm, while on the other hand, the accuracy implications of such
techniques on modern instruction-following and reasoning models are not well
understood, obfuscating the need for implementing these techniques. In this
paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and
DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be
deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way
tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators
running at 128k context length and up to 1832 tokens per second in a real
production setting. SnapStream enables $4\times$ improved on-chip memory usage
and introduces minimal accuracy degradation on LongBench-v2, AIME24 and
LiveCodeBench. To the best of our knowledge, this is the first implementation
of sparse KV attention techniques deployed in a production inference system
with static graphs and continuous batching.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, Jinwoo Choi
Natural Language Processing
ๆฅ็ๆ่ฆ
Effective explanations of video action recognition models should disentangle
how movements unfold over time from the surrounding spatial context. However,
existing methods based on saliency produce entangled explanations, making it
unclear whether predictions rely on motion or spatial context. Language-based
approaches offer structure but often fail to explain motions due to their tacit
nature -- intuitively understood but difficult to verbalize. To address these
challenges, we propose Disentangled Action aNd Context concept-based
Explainable (DANCE) video action recognition, a framework that predicts actions
through disentangled concept types: motion dynamics, objects, and scenes. We
define motion dynamics concepts as human pose sequences. We employ a large
language model to automatically extract object and scene concepts. Built on an
ante-hoc concept bottleneck design, DANCE enforces prediction through these
concepts. Experiments on four datasets -- KTH, Penn Action, HAA500, and UCF-101
-- demonstrate that DANCE significantly improves explanation clarity with
competitive performance. We validate the superior interpretability of DANCE
through a user study. Experimental results also show that DANCE is beneficial
for model debugging, editing, and failure analysis.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Dongkeun Kim, Minsu Cho, Suha Kwak
ๆฅ็ๆ่ฆ
Social interactions often emerge from subtle, fine-grained cues such as
facial expressions, gaze, and gestures. However, existing methods for social
interaction detection overlook such nuanced cues and primarily rely on holistic
representations of individuals. Moreover, they directly detect social groups
without explicitly modeling the underlying interactions between individuals.
These drawbacks limit their ability to capture localized social signals and
introduce ambiguity when group configurations should be inferred from social
interactions grounded in nuanced cues. In this work, we propose a part-aware
bottom-up group reasoning framework for fine-grained social interaction
detection. The proposed method infers social groups and their interactions
using body part features and their interpersonal relations. Our model first
detects individuals and enhances their features using part-aware cues, and then
infers group configuration by associating individuals via similarity-based
reasoning, which considers not only spatial relations but also subtle social
cues that signal interactions, leading to more accurate group inference.
Experiments on the NVI dataset demonstrate that our method outperforms prior
methods, achieving the new state of the art.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mehdi Sefidgar Dilmaghani, Francis Fowley, Peter Corcoran
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a lightweight three-dimensional convolutional neural
network (3DCNN) for human activity recognition (HAR) using event-based vision
data. Privacy preservation is a key challenge in human monitoring systems, as
conventional frame-based cameras capture identifiable personal information. In
contrast, event cameras record only changes in pixel intensity, providing an
inherently privacy-preserving sensing modality. The proposed network
effectively models both spatial and temporal dynamics while maintaining a
compact design suitable for edge deployment. To address class imbalance and
enhance generalization, focal loss with class reweighting and targeted data
augmentation strategies are employed. The model is trained and evaluated on a
composite dataset derived from the Toyota Smart Home and ETRI datasets.
Experimental results demonstrate an F1-score of 0.9415 and an overall accuracy
of 94.17%, outperforming benchmark 3D-CNN architectures such as C3D, ResNet3D,
and MC3_18 by up to 3%. These results highlight the potential of event-based
deep learning for developing accurate, efficient, and privacy-aware human
action recognition systems suitable for real-world edge applications.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Andrei A. Korigodskii, Oleg D. Kalachev, Artem E. Vasiunik, Matvei V. Urvantsev, Georgii E. Bondar
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper presents the innovative design and successful deployment of a
pioneering autonomous unmanned aerial system developed for executing the
world's largest mural painted by a drone. Addressing the dual challenges of
maintaining artistic precision and operational reliability under adverse
outdoor conditions such as wind and direct sunlight, our work introduces a
robust system capable of navigating and painting outdoors with unprecedented
accuracy. Key to our approach is a novel navigation system that combines an
infrared (IR) motion capture camera and LiDAR technology, enabling precise
location tracking tailored specifically for largescale artistic applications.
We employ a unique control architecture that uses different regulation in
tangential and normal directions relative to the planned path, enabling precise
trajectory tracking and stable line rendering. We also present algorithms for
trajectory planning and path optimization, allowing for complex curve drawing
and area filling. The system includes a custom-designed paint spraying
mechanism, specifically engineered to function effectively amidst the turbulent
airflow generated by the drone's propellers, which also protects the drone's
critical components from paint-related damage, ensuring longevity and
consistent performance. Experimental results demonstrate the system's
robustness and precision in varied conditions, showcasing its potential for
autonomous large-scale art creation and expanding the functional applications
of robotics in creative fields.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vittal L. Rao
ๆฅ็ๆ่ฆ
Localisation tasks in biomedical data often require models to learn
meaningful spatial or temporal relationships from signals with complex
intensity distributions. A common strategy, exemplified by CoordConv layers, is
to append coordinate channels to convolutional inputs, enabling networks to
learn absolute positions. In this work, we propose a signal intensity-weighted
coordinate representation that replaces the pure coordinate channels with
channels scaled by local signal intensity. This modification embeds an
intensity-position coupling directly in the input representation, introducing a
simple and modality-agnostic inductive bias. We evaluate the approach on two
distinct localisation problems: (i) predicting the time of morphological
transition in 20-second, two-lead ECG signals, and (ii) regressing the
coordinates of nuclear centres in cytological images from the SiPaKMeD dataset.
In both cases, the proposed representation yields faster convergence and higher
generalisation performance relative to conventional coordinate-channel
approaches, demonstrating its effectiveness across both one-dimensional and
two-dimensional biomedical signals.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Romain Brรฉgier, Guรฉnolรฉ Fiche, Laura Bravo-Sรกnchez, Thomas Lucas, Matthieu Armando et al.
Robotics
ๆฅ็ๆ่ฆ
Parametric body models are central to many human-centric tasks, yet existing
models often rely on costly 3D scans and learned shape spaces that are
proprietary and demographically narrow. We introduce Anny, a simple, fully
differentiable, and scan-free human body model grounded in anthropometric
knowledge from the MakeHuman community. Anny defines a continuous,
interpretable shape space, where phenotype parameters (e.g. gender, age,
height, weight) control blendshapes spanning a wide range of human forms --
across ages (from infants to elders), body types, and proportions. Calibrated
using WHO population statistics, it provides realistic and demographically
grounded human shape variation within a single unified model. Thanks to its
openness and semantic control, Anny serves as a versatile foundation for 3D
human modeling -- supporting millimeter-accurate scan fitting, controlled
synthetic data generation, and Human Mesh Recovery (HMR). We further introduce
Anny-One, a collection of 800k photorealistic humans generated with Anny,
showing that despite its simplicity, HMR models trained with Anny can match the
performance of those trained with scan-based body models, while remaining
interpretable and broadly representative. The Anny body model and its code are
released under the Apache 2.0 license, making Anny an accessible foundation for
human-centric 3D modeling.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Hao Shi, Ze Wang, Shangwei Guo, Mengfei Duan, Song Wang et al.
ๆฅ็ๆ่ฆ
Robust 3D semantic occupancy is crucial for legged/humanoid robots, yet most
semantic scene completion (SSC) systems target wheeled platforms with
forward-facing sensors. We present OneOcc, a vision-only panoramic SSC
framework designed for gait-introduced body jitter and 360{\deg} continuity.
OneOcc combines: (i) Dual-Projection fusion (DP-ER) to exploit the annular
panorama and its equirectangular unfolding, preserving 360{\deg} continuity and
grid alignment; (ii) Bi-Grid Voxelization (BGV) to reason in Cartesian and
cylindrical-polar spaces, reducing discretization bias and sharpening
free/occupied boundaries; (iii) a lightweight decoder with Hierarchical AMoE-3D
for dynamic multi-scale fusion and better long-range/occlusion reasoning; and
(iv) plug-and-play Gait Displacement Compensation (GDC) learning feature-level
motion correction without extra sensors. We also release two panoramic
occupancy benchmarks: QuadOcc (real quadruped, first-person 360{\deg}) and
Human360Occ (H3O) (CARLA human-ego 360{\deg} with RGB, Depth, semantic
occupancy; standardized within-/cross-city splits). OneOcc sets new
state-of-the-art (SOTA): on QuadOcc it beats strong vision baselines and
popular LiDAR ones; on H3O it gains +3.83 mIoU (within-city) and +8.08
(cross-city). Modules are lightweight, enabling deployable full-surround
perception for legged/humanoid robots. Datasets and code will be publicly
available at https://github.com/MasterHow/OneOcc.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kevin Manogue, Tomasz M Schang, Dilara Kuล, Jonas Mรผller, Stefan Zachow et al.
ๆฅ็ๆ่ฆ
Reconstructing the surfaces of deformable objects from correspondences
between a 3D template and a 2D image is well studied under Shape-from-Template
(SfT) methods; however, existing approaches break down when topological changes
accompany the deformation. We propose a principled extension of SfT that
enables reconstruction in the presence of such changes. Our approach is
initialized with a classical SfT solution and iteratively adapts the template
by partitioning its spatial domain so as to minimize an energy functional that
jointly encodes physical plausibility and reprojection consistency. We
demonstrate that the method robustly captures a wide range of practically
relevant topological events including tears and cuts on bounded 2D surfaces,
thereby establishing the first general framework for topological-change-aware
SfT. Experiments on both synthetic and real data confirm that our approach
consistently outperforms baseline methods.
๐
2025-11-05
๐ ArXiv Preprint (eess.AS)
๐ฅ Jiyoung Lee, Song Park, Sanghyuk Chun, Soo-Whan Chung
Computer Vision
ๆฅ็ๆ่ฆ
This paper proposes VoxStudio, the first unified and end-to-end
speech-to-image model that generates expressive images directly from spoken
descriptions by jointly aligning linguistic and paralinguistic information. At
its core is a speech information bottleneck (SIB) module, which compresses raw
speech into compact semantic tokens, preserving prosody and emotional nuance.
By operating directly on these tokens, VoxStudio eliminates the need for an
additional speech-to-text system, which often ignores the hidden details beyond
text, e.g., tone or emotion. We also release VoxEmoset, a large-scale paired
emotional speech-image dataset built via an advanced TTS engine to affordably
generate richly expressive utterances. Comprehensive experiments on the
SpokenCOCO, Flickr8kAudio, and VoxEmoset benchmarks demonstrate the feasibility
of our method and highlight key challenges, including emotional consistency and
linguistic ambiguity, paving the way for future research.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nikolai Herrmann, Marcella C. Zijta, Stefan Klein, Rรฉgine P. M. Steegers-Theunissen, Rene M. H. Wijnen et al.
ๆฅ็ๆ่ฆ
Standardized alignment of the embryo in three-dimensional (3D) ultrasound
images aids prenatal growth monitoring by facilitating standard plane
detection, improving visualization of landmarks and accentuating differences
between different scans. In this work, we propose an automated method for
standardizing this alignment. Given a segmentation mask of the embryo,
Principal Component Analysis (PCA) is applied to the mask extracting the
embryo's principal axes, from which four candidate orientations are derived.
The candidate in standard orientation is selected using one of three
strategies: a heuristic based on Pearson's correlation assessing shape, image
matching to an atlas through normalized cross-correlation, and a Random Forest
classifier. We tested our method on 2166 images longitudinally acquired 3D
ultrasound scans from 1043 pregnancies from the Rotterdam Periconceptional
Cohort, ranging from 7+0 to 12+6 weeks of gestational age. In 99.0% of images,
PCA correctly extracted the principal axes of the embryo. The correct candidate
was selected by the Pearson Heuristic, Atlas-based and Random Forest in 97.4%,
95.8%, and 98.4% of images, respectively. A Majority Vote of these selection
methods resulted in an accuracy of 98.5%. The high accuracy of this pipeline
enables consistent embryonic alignment in the first trimester, enabling
scalable analysis in both clinical and research settings. The code is publicly
available at:
https://gitlab.com/radiology/prenatal-image-analysis/pca-3d-alignment.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gahyeon Kim, Sohee Kim, Seokju Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent advances in large-scale vision and language models have led to
significant progress in zero-shot learning tasks. Methods such as CoOp and
CoCoOp have shown that replacing handcrafted prompts with learnable vectors,
known as prompt learning, can result in improved performance. However, these
models often struggle to generalize to entirely unseen categories. While
traditional zero-shot learning techniques benefit from various data
augmentation strategies, prompt learning has primarily focused on text-based
modifications, leaving the potential of image-based augmentation largely
unexplored. In this work, we explore how image-level augmentations,
particularly those that introduce attribute-specific variations, can support
and enhance prompt learning. Our analysis examines the interaction between
these augmentations and soft prompt frameworks, revealing their potential to
improve generalization. We also identify a limitation in existing methods, such
as CoCoOp, which do not provide explicit guidance for learning prompts that
focus on semantically meaningful visual features. To address this, we propose
Adding Attributes to Prompt Learning, AAPL, a novel method that introduces
adversarial token embeddings to decouple superficial visual variations
introduced by augmentation from class-relevant semantic representations. This
decoupling enables the learned prompts to concentrate on visually
discriminative features that align with the target categories. We conduct
comprehensive experiments on eleven benchmark datasets, and AAPL consistently
outperforms existing methods across few-shot, zero-shot, cross-dataset, and
domain generalization settings. Our source code is publicly available at:
https://github.com/Gahyeonkim09/AAPL
๐
2025-11-05
๐ ArXiv Preprint (eess.IV)
๐ฅ Gabriela Fernandes
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Ovarian cancer remains one of the most lethal gynecological malignancies,
largely due to late diagnosis and extensive heterogeneity across subtypes.
Current diagnostic methods are limited in their ability to reveal underlying
genomic variations essential for precision oncology. This study introduces a
novel hybrid deep learning pipeline that integrates quantitative nuclear
morphometry with deep convolutional image features to perform ovarian cancer
subtype classification and gene mutation inference directly from Hematoxylin
and Eosin (H&E) histopathological images. Using $\sim45,000$ image patches
sourced from The Cancer Genome Atlas (TCGA) and public datasets, a fusion model
combining a ResNet-50 Convolutional Neural Network (CNN) encoder and a Vision
Transformer (ViT) was developed. This model successfully captured both local
morphological texture and global tissue context. The pipeline achieved a robust
overall subtype classification accuracy of $84.2\%$ (Macro AUC of $0.87 \pm
0.03$). Crucially, the model demonstrated the capacity for gene mutation
inference with moderate-to-high accuracy: $AUC_{TP53} = 0.82 \pm 0.02$,
$AUC_{BRCA1} = 0.76 \pm 0.04$, and $AUC_{ARID1A} = 0.73 \pm 0.05$. Feature
importance analysis established direct quantitative links, revealing that
nuclear solidity and eccentricity were the dominant predictors for TP53
mutation. These findings validate that quantifiable histological phenotypes
encode measurable genomic signals, paving the way for cost-effective, precision
histopathology in ovarian cancer triage and diagnosis.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Due to the lack of effective cross-modal modeling, existing open-source
audio-video generation methods often exhibit compromised lip synchronization
and insufficient semantic consistency. To mitigate these drawbacks, we propose
UniAVGen, a unified framework for joint audio and video generation. UniAVGen is
anchored in a dual-branch joint synthesis architecture, incorporating two
parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent
space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which
enables bidirectional, temporally aligned cross-attention, thus ensuring
precise spatiotemporal synchronization and semantic consistency. Furthermore,
this cross-modal interaction is augmented by a Face-Aware Modulation module,
which dynamically prioritizes salient regions in the interaction process. To
enhance generative fidelity during inference, we additionally introduce
Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly
amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint
synthesis design enables seamless unification of pivotal audio-video tasks
within a single model, such as joint audio-video generation and continuation,
video-to-audio dubbing, and audio-driven video synthesis. Comprehensive
experiments validate that, with far fewer training samples (1.3M vs. 30.1M),
UniAVGen delivers overall advantages in audio-video synchronization, timbre
consistency, and emotion consistency.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In this report, we present our solution to the MOT25-Spatiotemporal Action
Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately
localize and track multiple objects that match specific and free-form language
queries, using video data of complex real-world scenes as input. We model the
underlying task as a video retrieval problem and present a two-stage, zero-shot
approach, combining the advantages of the SOTA tracking model FastTracker and
Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our
method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which
won second place in the challenge.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jindong Hong, Tianjie Chen, Lingjie Luo, Chuanyang Zheng, Ting Xu et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
A recent advancement in Multimodal Large Language Models (MLLMs) research is
the emergence of "reasoning MLLMs" that offer explicit control over their
internal thinking processes (normally referred as the "thinking mode")
alongside the standard "non-thinking mode". This capability allows these models
to engage in a step-by-step process of internal deliberation before generating
a final response. With the rapid transition to and adoption of these
"dual-state" MLLMs, this work rigorously evaluated how the enhanced reasoning
processes of these MLLMs impact model performance and reliability in clinical
tasks. This paper evaluates the active "thinking mode" capabilities of two
leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We
assessed their performance on four visual medical tasks using VQA-RAD and
ROCOv2 datasets. Our findings reveal that the improvement from activating the
thinking mode remains marginal compared to the standard non-thinking mode for
the majority of the tasks. Their performance on complex medical tasks such as
open-ended VQA and medical image interpretation remains suboptimal,
highlighting the need for domain-specific medical data and more advanced
methods for medical knowledge integration.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Video Question Answering (VideoQA) in the surgical domain aims to enhance
intraoperative understanding by enabling AI models to reason over temporally
coherent events rather than isolated frames. Current approaches are limited to
static image features, and available datasets often lack temporal annotations,
ignoring the dynamics critical for accurate procedural interpretation. We
propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from
static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder
to fuse video and question features, capturing temporal cues such as motion and
tool--tissue interactions, which a fine-tuned large language model (LLM) then
decodes into coherent answers. To evaluate its performance, we curated
REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related
questions and diagnostic attributes, as well as out-of-template questions with
rephrased or semantically altered formulations to assess model robustness.
Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset
shows that SurgViVQA outperforms existing image-based VQA benchmark models,
particularly in keyword accuracy, improving over PitVQA by +11\% on
REAL-Colon-VQA and +9\% on EndoVis18-VQA. A perturbation study on the questions
further confirms improved generalizability and robustness to variations in
question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework
for temporally-aware understanding in surgical VideoQA, enabling AI models to
interpret dynamic procedural contexts more effectively. Code and dataset
available at https://github.com/madratak/SurgViVQA.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minghao Fu, Guo-Hua Wang, Tianyu Cui, Qing-Guo Chen, Zhao Xu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Text-to-image diffusion models deliver high-quality images, yet aligning them
with human preferences remains challenging. We revisit diffusion-based Direct
Preference Optimization (DPO) for these models and identify a critical
pathology: enlarging the preference margin does not necessarily improve
generation quality. In particular, the standard Diffusion-DPO objective can
increase the reconstruction error of both winner and loser branches.
Consequently, degradation of the less-preferred outputs can become sufficiently
severe that the preferred branch is also adversely affected even as the margin
grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule
that preserves the winner by adaptively scaling the loser gradient according to
its alignment with the winner gradient. A first-order analysis yields a
closed-form scaling coefficient that guarantees the error of the preferred
output is non-increasing at each optimization step. Our method is simple,
model-agnostic, broadly compatible with existing DPO-style alignment frameworks
and adds only marginal computational overhead. Across standard text-to-image
benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning
baselines on automated preference, aesthetic, and prompt alignment metrics.
Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shuangquan Lyu, Steven Mao, Yue Ma
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Generating long videos remains a fundamental challenge, and achieving high
controllability in video inpainting and outpainting is particularly demanding.
To address both of these challenges simultaneously and achieve controllable
video inpainting and outpainting for long video clips, we introduce a novel and
unified approach for long video inpainting and outpainting that extends
text-to-video diffusion models to generate arbitrarily long, spatially edited
videos with high fidelity. Our method leverages LoRA to efficiently fine-tune a
large pre-trained video diffusion model like Alibaba's Wan 2.1 for masked
region video synthesis, and employs an overlap-and-blend temporal co-denoising
strategy with high-order solvers to maintain consistency across long sequences.
In contrast to prior work that struggles with fixed-length clips or exhibits
stitching artifacts, our system enables arbitrarily long video generation and
editing without noticeable seams or drift. We validate our approach on
challenging inpainting/outpainting tasks including editing or adding objects
over hundreds of frames and demonstrate superior performance to baseline
methods like Wan 2.1 model and VACE in terms of quality (PSNR/SSIM), and
perceptual realism (LPIPS). Our method enables practical long-range video
editing with minimal overhead, achieved a balance between parameter efficient
and superior performance.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bingyang Guo, Hongjie Li, Ruiyun Yu, Hanzhe Liang, Jinbao Wang
Machine Learning
ๆฅ็ๆ่ฆ
3D anomaly detection (3D-AD) plays a critical role in industrial
manufacturing, particularly in ensuring the reliability and safety of core
equipment components. Although existing 3D datasets like Real3D-AD and MVTec
3D-AD offer broad application support, they fall short in capturing the
complexities and subtle defects found in real industrial environments. This
limitation hampers precise anomaly detection research, especially for
industrial equipment components (IEC) such as bearings, rings, and bolts. To
address this challenge, we have developed a point cloud anomaly detection
dataset (IEC3D-AD) specific to real industrial scenarios. This dataset is
directly collected from actual production lines, ensuring high fidelity and
relevance. Compared to existing datasets, IEC3D-AD features significantly
improved point cloud resolution and defect annotation granularity, facilitating
more demanding anomaly detection tasks. Furthermore, inspired by generative
2D-AD methods, we introduce a novel 3D-AD paradigm (GMANet) on IEC3D-AD. This
paradigm generates synthetic point cloud samples based on geometric
morphological analysis, then reduces the margin and increases the overlap
between normal and abnormal point-level features through spatial discrepancy
optimization. Extensive experiments demonstrate the effectiveness of our method
on both IEC3D-AD and other datasets.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rong Wu, Yim-Sang Yu
Computer VisionMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Medical image segmentation has been significantly advanced by deep learning
architectures, notably U-Net variants. However, existing models struggle to
achieve efficient global context modeling and long-range dependency reasoning
under practical computational budgets simultaneously. In this work, we propose
a novel hybrid architecture utilizing U-Mamba with Heat Conduction Equation.
Our model combines Mamba-based state-space modules for efficient long-range
reasoning with Heat Conduction Operators (HCOs) in the bottleneck layers,
simulating frequency-domain thermal diffusion for enhanced semantic
abstraction. Experimental results on multimodal abdominal CT and MRI datasets
demonstrate that the proposed model consistently outperforms strong baselines,
validating its effectiveness and generalizability. It suggest that blending
state-space dynamics with heat-based global diffusion offers a scalable and
interpretable solution for medical segmentation tasks.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Jing Ma, Hanlin Li, Xiang Xiang
Machine Learning
ๆฅ็ๆ่ฆ
Entropy Minimization (EM) is beneficial to reducing class overlap, bridging
domain gap, and restricting uncertainty for various tasks in machine learning,
yet its potential is limited. To study the internal mechanism of EM, we
reformulate and decouple the classical EM into two parts with opposite effects:
cluster aggregation driving factor (CADF) rewards dominant classes and prompts
a peaked output distribution, while gradient mitigation calibrator (GMC)
penalizes high-confidence classes based on predicted probabilities.
Furthermore, we reveal the limitations of classical EM caused by its coupled
formulation: 1) reward collapse impedes the contribution of high-certainty
samples in the learning process, and 2) easy-class bias induces misalignment
between output distribution and label distribution. To address these issues, we
propose Adaptive Decoupled Entropy Minimization (AdaDEM), which normalizes the
reward brought from CADF and employs a marginal entropy calibrator (MEC) to
replace GMC. AdaDEM outperforms DEM*, an upper-bound variant of classical EM,
and achieves superior performance across various imperfectly supervised
learning tasks in noisy and dynamic environments.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nikolina Tomic Roshni Bhatnagar, Sarthak Jain, Connor Lau, Tien-Yu Liu, Laura Gambini et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning (DL) has the potential to revolutionize image acquisition and
interpretation across medicine, however, attention to data imbalance and
missingness is required. Ultrasound data presents a particular challenge
because in addition to different views and structures, it includes several
sub-modalities-such as greyscale and color flow doppler (CFD)-that are often
imbalanced in clinical studies. Image translation can help balance datasets but
is challenging for ultrasound sub-modalities to date. Here, we present a
generative method for ultrasound CFD-greyscale video translation, trained on
54,975 videos and tested on 8,368. The method developed leveraged pixel-wise,
adversarial, and perceptual loses and utilized two networks: one for
reconstructing anatomic structures and one for denoising to achieve realistic
ultrasound imaging. Average pairwise SSIM between synthetic videos and ground
truth was 0.91+/-0.04. Synthetic videos performed indistinguishably from real
ones in DL classification and segmentation tasks and when evaluated by blinded
clinical experts: F1 score was 0.9 for real and 0.89 for synthetic videos; Dice
score between real and synthetic segmentation was 0.97. Overall clinician
accuracy in distinguishing real vs synthetic videos was 54+/-6% (42-61%),
indicating realistic synthetic videos. Although trained only on heart videos,
the model worked well on ultrasound spanning several clinical domains (average
SSIM 0.91+/-0.05), demonstrating foundational abilities. Together, these data
expand the utility of retrospectively collected imaging and augment the dataset
design toolbox for medical imaging.
๐
2025-11-05
๐ ICCV 2025
๐ฅ Liwei Luo, Shuaitengyuan Li, Dongwei Ren, Qilong Wang, Pengfei Zhu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Recently, remarkable progress has been made in large-scale pre-trained model
tuning, and inference efficiency is becoming more crucial for practical
deployment. Early exiting in conjunction with multi-stage predictors, when
cooperated with a parameter-efficient fine-tuning strategy, offers a
straightforward way to achieve an inference-efficient model. However, a key
challenge remains unresolved: How can early stages provide low-level
fundamental features to deep stages while simultaneously supplying high-level
discriminative features to early-stage predictors? To address this problem, we
propose a Decoupled Multi-Predictor Optimization (DMPO) method to effectively
decouple the low-level representative ability and high-level discriminative
ability in early stages. First, in terms of architecture, we introduce a
lightweight bypass module into multi-stage predictors for functional
decomposition of shallow features from early stages, while a high-order
statistics-based predictor is developed for early stages to effectively enhance
their discriminative ability. To reasonably train our multi-predictor
architecture, a decoupled optimization is proposed to allocate two-phase loss
weights for multi-stage predictors during model tuning, where the initial
training phase enables the model to prioritize the acquisition of
discriminative ability of deep stages via emphasizing representative ability of
early stages, and the latter training phase drives discriminative ability
towards earlier stages as much as possible. As such, our DMPO can effectively
decouple representative and discriminative abilities in early stages in terms
of architecture design and model optimization. Experiments across various
datasets and pre-trained backbones demonstrate that DMPO clearly outperforms
its counterparts when reducing computational cost.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Philipp Reis, Philipp Rigoll, Christian Steinhauser, Jacob Langner, Eric Sax
Robotics
ๆฅ็ๆ่ฆ
Modern AI systems are increasingly constrained not by model capacity but by
the quality and diversity of their data. Despite growing emphasis on
data-centric AI, most datasets are still gathered in an open-loop manner which
accumulates redundant samples without feedback from the current coverage. This
results in inefficient storage, costly labeling, and limited generalization. To
address this, this paper introduces \ac{FCDC}, a paradigm that formulates data
collection as a closed-loop control problem. \ac{FCDC} continuously
approximates the state of the collected data distribution using an online
probabilistic model and adaptively regulates sample retention using based on
feedback signals such as likelihood and Mahalanobis distance. Through this
feedback mechanism, the system dynamically balances exploration and
exploitation, maintains dataset diversity, and prevents redundancy from
accumulating over time. Besides showcasing the controllability of \ac{FCDC} on
a synthetic dataset, experiments on a real data stream show that \ac{FCDC}
produces more balanced datasets by $\SI{25.9}{\percent}$ while reducing data
storage by $\SI{39.8}{\percent}$. These results demonstrate that data
collection itself can be actively controlled, transforming collection from a
passive pipeline stage into a self-regulating, feedback-driven process at the
core of data-centric AI.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sichen Guo, Wenjie Li, Yuanyang Liu, Guangwei Gao, Jian Yang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recently, Mamba-based super-resolution (SR) methods have demonstrated the
ability to capture global receptive fields with linear complexity, addressing
the quadratic computational cost of Transformer-based SR approaches. However,
existing Mamba-based methods lack fine-grained transitions across different
modeling scales, which limits the efficiency of feature representation. In this
paper, we propose T-PMambaSR, a lightweight SR framework that integrates
window-based self-attention with Progressive Mamba. By enabling interactions
among receptive fields of different scales, our method establishes a
fine-grained modeling paradigm that progressively enhances feature
representation with linear complexity. Furthermore, we introduce an Adaptive
High-Frequency Refinement Module (AHFRM) to recover high-frequency details lost
during Transformer and Mamba processing. Extensive experiments demonstrate that
T-PMambaSR progressively enhances the model's receptive field and
expressiveness, yielding better performance than recent Transformer- or
Mamba-based methods while incurring lower computational cost. Our codes will be
released after acceptance.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pengyu Jie, Wanquan Liu, Rui He, Yihui Wen, Deyu Meng et al.
Computer VisionMachine LearningRobotics
ๆฅ็ๆ่ฆ
Augmentation for dense prediction typically relies on either sample mixing or
generative synthesis. Mixing improves robustness but misaligned masks yield
soft label ambiguity. Diffusion synthesis increases apparent diversity but,
when trained as common samples, overlooks the structural benefit of mask
conditioning and introduces synthetic-real domain shift. We propose a paired,
diffusion-guided paradigm that fuses the strengths of both. For each real
image, a synthetic counterpart is generated under the same mask and the pair is
used as a controllable input for Mask-Consistent Paired Mixing (MCPMix), which
mixes only image appearance while supervision always uses the original hard
mask. This produces a continuous family of intermediate samples that smoothly
bridges synthetic and real appearances under shared geometry, enlarging
diversity without compromising pixel-level semantics. To keep learning aligned
with real data, Real-Anchored Learnable Annealing (RLA) adaptively adjusts the
mixing strength and the loss weight of mixed samples over training, gradually
re-anchoring optimization to real data and mitigating distributional bias.
Across Kvasir-SEG, PICCOLO, CVC-ClinicDB, a private NPC-LES cohort, and ISIC
2017, the approach achieves state-of-the-art segmentation performance and
consistent gains over baselines. The results show that combining
label-preserving mixing with diffusion-driven diversity, together with adaptive
re-anchoring, yields robust and generalizable endoscopic segmentation.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ruting Cheng, Boyuan Feng, Yijiang Zheng, Chuhui Qiu, Aizierjiang Aiersilan et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Accurately assessing the risk of cesarean section (CS) delivery is critical,
especially in settings with limited medical resources, where access to
healthcare is often restricted. Early and reliable risk prediction allows
better-informed prenatal care decisions and can improve maternal and neonatal
outcomes. However, most existing predictive models are tailored for in-hospital
use during labor and rely on parameters that are often unavailable in
resource-limited or home-based settings. In this study, we conduct a pilot
investigation to examine the feasibility of using 3D body shape for CS risk
assessment for future applications with more affordable general devices. We
propose a novel multi-view-based Transformer network, MvBody, which predicts CS
risk using only self-reported medical data and 3D optical body scans obtained
between the 31st and 38th weeks of gestation. To enhance training efficiency
and model generalizability in data-scarce environments, we incorporate a metric
learning loss into the network. Compared to widely used machine learning models
and the latest advanced 3D analysis methods, our method demonstrates superior
performance, achieving an accuracy of 84.62% and an Area Under the Receiver
Operating Characteristic Curve (AUC-ROC) of 0.724 on the independent test set.
To improve transparency and trust in the model's predictions, we apply the
Integrated Gradients algorithm to provide theoretically grounded explanations
of the model's decision-making process. Our results indicate that pre-pregnancy
weight, maternal age, obstetric history, previous CS history, and body shape,
particularly around the head and shoulders, are key contributors to CS risk
prediction.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuei-Chun Kao, Hsu Tzu-Yin, Yunqi Hong, Ruochen Wang, Cho-Jui Hsieh
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recently, Multimodal Large Language Models (MLLMs) encounter two key issues
in multi-image contexts: (1) a lack of fine-grained perception across disparate
images, and (2) a diminished capability to effectively reason over and
synthesize information from multiple visual inputs. However, while various
prompting methods aim to describe visual content, many existing studies focus
primarily on single-image settings or specific, constrained scenarios. This
leaves a critical gap in understanding and addressing how MLLMs tackle more
general and complex multi-image reasoning tasks. Thus, we first extensively
investigate how current prompting methods perceive fine-grained visual details
and process visual information when dealing with multiple images. Our findings
reveal that existing prompting methods fall short in attending to needed clues
and seamlessly integrating perception and reasoning. Inspired by the findings,
we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions
(QG-CoC), a generalized prompting approach that effectively handles problems
with an arbitrary number of images. We evaluate our method on various
open-source and closed-source MLLMs for multi-image and single-image
benchmarks. Experimental results indicate that QG-CoC demonstrates competitive
performance across tasks and exhibits robust improvements in the challenging
scenarios where existing prompting methods fail.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Maryam Alipourhajiagha, Pierre-Louis Lemaire, Youssef Diouane, Julie Carreau
ๆฅ็ๆ่ฆ
Climate models are limited by heavy computational costs, often producing
outputs at coarse spatial resolutions, while many climate change impact studies
require finer scales. Statistical downscaling bridges this gap, and we adapt
the probabilistic U-Net for this task, combining a deterministic U-Net backbone
with a variational latent space to capture aleatoric uncertainty. We evaluate
four training objectives, afCRPS and WMSE-MS-SSIM with three settings for
downscaling precipitation and temperature from $16\times$ coarser resolution.
Our main finding is that WMSE-MS-SSIM performs well for extremes under certain
settings, whereas afCRPS better captures spatial variability across scales.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Le Xue, Gang Feng, Wenbo Zhang, Yichi Zhang, Lanlan Li et al.
ๆฅ็ๆ่ฆ
Publicly available, large-scale medical imaging datasets are crucial for
developing and validating artificial intelligence models and conducting
retrospective clinical research. However, datasets that combine functional and
anatomical imaging with detailed clinical reports across multiple cancer types
remain scarce. Here, we present PETWB-REP, a curated dataset comprising
whole-body 18F-Fluorodeoxyglucose (FDG) Positron Emission Tomography/Computed
Tomography (PET/CT) scans and corresponding radiology reports from 490 patients
diagnosed with various malignancies. The dataset primarily includes common
cancers such as lung cancer, liver cancer, breast cancer, prostate cancer, and
ovarian cancer. This dataset includes paired PET and CT images, de-identified
textual reports, and structured clinical metadata. It is designed to support
research in medical imaging, radiomics, artificial intelligence, and
multi-modal learning.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shreyas C. Dhake, Jiayuan Huang, Runlong He, Danyal Z. Khan, Evangelos B. Mazomenos et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Anticipating forthcoming surgical events is vital for real-time assistance in
endonasal transsphenoidal pituitary surgery, where visibility is limited and
workflow changes rapidly. Most visual question answering (VQA) systems reason
on isolated frames with static vision language alignment, providing little
support for forecasting next steps or instrument needs. Existing surgical VQA
datasets likewise center on the current scene rather than the near future. We
introduce PitVQA-Anticipation, the first VQA dataset designed for forward
looking surgical reasoning. It comprises 33.5 hours of operative video and
734,769 question answer pairs built from temporally grouped clips and expert
annotations across four tasks: predicting the future phase, next step, upcoming
instrument, and remaining duration. We further propose SurgAnt-ViVQA, a video
language model that adapts a large language model using a GRU Gated Temporal
Cross-Attention module. A bidirectional GRU encodes frame to frame dynamics,
while an adaptive gate injects visual context into the language stream at the
token level. Parameter efficient fine tuning customizes the language backbone
to the surgical domain. SurgAnt-ViVQA tested upon on PitVQA-Anticipation and
EndoVis datasets, surpassing strong image and video based baselines. Ablations
show that temporal recurrence and gated fusion drive most of the gains. A frame
budget study indicates a trade-off: 8 frames maximize fluency, whereas 32
frames slightly reduce BLEU but improve numeric time estimation. By pairing a
temporally aware encoder with fine grained gated cross-attention, SurgAnt-ViVQA
advances surgical VQA from retrospective description to proactive anticipation.
PitVQA-Anticipation offers a comprehensive benchmark for this setting and
highlights the importance of targeted temporal modeling for reliable, future
aware surgical assistance.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yun-Chen Lin, Jiayuan Huang, Hanyuan Zhang, Sergi Kavtaradze, Matthew J. Clarkson et al.
ๆฅ็ๆ่ฆ
Accurate detection and delineation of anatomical structures in medical
imaging are critical for computer-assisted interventions, particularly in
laparoscopic liver surgery where 2D video streams limit depth perception and
complicate landmark localization. While recent works have leveraged monocular
depth cues for enhanced landmark detection, challenges remain in fusing RGB and
depth features and in efficiently adapting large-scale vision models to
surgical domains. We propose a depth-guided liver landmark segmentation
framework integrating semantic and geometric cues via vision foundation
encoders. We employ Segment Anything Model V2 (SAM2) encoder to extract RGB
features and Depth Anything V2 (DA2) encoder to extract depth-aware features.
To efficiently adapt SAM2, we introduce SRFT-GaLore, a novel low-rank gradient
projection method that replaces the computationally expensive SVD with a
Subsampled Randomized Fourier Transform (SRFT). This enables efficient
fine-tuning of high-dimensional attention layers without sacrificing
representational power. A cross-attention fusion module further integrates RGB
and depth cues. To assess cross-dataset generalization, we also construct a new
Laparoscopic Liver Surgical Dataset (LLSD) as an external validation benchmark.
On the public L3D dataset, our method achieves a 4.85% improvement in Dice
Similarity Coefficient and a 11.78-point reduction in Average Symmetric Surface
Distance compared to the D2GPLand. To further assess generalization capability,
we evaluate our model on LLSD dataset. Our model maintains competitive
performance and significantly outperforms SAM-based baselines, demonstrating
strong cross-dataset robustness and adaptability to unseen surgical
environments. These results demonstrate that our SRFT-GaLore-enhanced
dual-encoder framework enables scalable and precise segmentation under
real-time, depth-constrained surgical settings.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sagar Shrestha, Gopal Sharma, Luowei Zhou, Suren Kumar
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Personalizing text-to-image diffusion models has traditionally relied on
subject-specific fine-tuning approaches such as
DreamBooth~\cite{ruiz2023dreambooth}, which are computationally expensive and
slow at inference. Recent adapter- and encoder-based methods attempt to reduce
this overhead but still depend on additional fine-tuning or large backbone
models for satisfactory results. In this work, we revisit an orthogonal
direction: fine-tuning-free personalization via Hypernetworks that predict
LoRA-adapted weights directly from subject images. Prior hypernetwork-based
approaches, however, suffer from costly data generation or unstable attempts to
mimic base model optimization trajectories. We address these limitations with
an end-to-end training objective, stabilized by a simple output regularization,
yielding reliable and effective hypernetworks. Our method removes the need for
per-subject optimization at test time while preserving both subject fidelity
and prompt alignment. To further enhance compositional generalization at
inference time, we introduce Hybrid-Model Classifier-Free Guidance (HM-CFG),
which combines the compositional strengths of the base diffusion model with the
subject fidelity of personalized models during sampling. Extensive experiments
on CelebA-HQ, AFHQ-v2, and DreamBench demonstrate that our approach achieves
strong personalization performance and highlights the promise of hypernetworks
as a scalable and effective direction for open-category personalization.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Paria Mehrbod, Pedro Vianna, Geraldin Nanfack, Guy Wolf, Eugene Belilovsky
Machine Learning
ๆฅ็ๆ่ฆ
Domain adaptation is a key strategy for enhancing the generalizability of
deep learning models in real-world scenarios, where test distributions often
diverge significantly from the training domain. However, conventional
approaches typically rely on prior knowledge of the target domain or require
model retraining, limiting their practicality in dynamic or
resource-constrained environments. Recent test-time adaptation methods based on
batch normalization statistic updates allow for unsupervised adaptation, but
they often fail to capture complex activation distributions and are constrained
to specific normalization layers. We propose Adaptive Quantile Recalibration
(AQR), a test-time adaptation technique that modifies pre-activation
distributions by aligning quantiles on a channel-wise basis. AQR captures the
full shape of activation distributions and generalizes across architectures
employing BatchNorm, GroupNorm, or LayerNorm. To address the challenge of
estimating distribution tails under varying batch sizes, AQR incorporates a
robust tail calibration strategy that improves stability and precision. Our
method leverages source-domain statistics computed at training time, enabling
unsupervised adaptation without retraining models. Experiments on CIFAR-10-C,
CIFAR-100-C, and ImageNet-C across multiple architectures demonstrate that AQR
achieves robust adaptation across diverse settings, outperforming existing
test-time adaptation baselines. These results highlight AQR's potential for
deployment in real-world scenarios with dynamic and unpredictable data
distributions.
๐
2025-11-05
๐ ArXiv Preprint (cs.GR)
๐ฅ Haotian Yin, Przemyslaw Musialski
Machine Learning
ๆฅ็ๆ่ฆ
Neural signed distance functions (SDFs) have become a powerful representation
for geometric reconstruction from point clouds, yet they often require both
gradient- and curvature-based regularization to suppress spurious warp and
preserve structural fidelity. FlatCAD introduced the Off-Diagonal Weingarten
(ODW) loss as an efficient second-order prior for CAD surfaces, approximating
full-Hessian regularization at roughly half the computational cost. However,
FlatCAD applies a fixed ODW weight throughout training, which is suboptimal:
strong regularization stabilizes early optimization but suppresses detail
recovery in later stages. We present scheduling strategies for the ODW loss
that assign a high initial weight to stabilize optimization and progressively
decay it to permit fine-scale refinement. We investigate constant, linear,
quintic, and step interpolation schedules, as well as an increasing warm-up
variant. Experiments on the ABC CAD dataset demonstrate that time-varying
schedules consistently outperform fixed weights. Our method achieves up to a
35% improvement in Chamfer Distance over the FlatCAD baseline, establishing
scheduling as a simple yet effective extension of curvature regularization for
robust CAD reconstruction.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Thomas Manzini, Priyankari Perali, Robin R. Murphy
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
This paper presents the first AI/ML system for automating building damage
assessment in uncrewed aerial systems (sUAS) imagery to be deployed
operationally during federally declared disasters (Hurricanes Debby and
Helene). In response to major disasters, sUAS teams are dispatched to collect
imagery of the affected areas to assess damage; however, at recent disasters,
teams collectively delivered between 47GB and 369GB of imagery per day,
representing more imagery than can reasonably be transmitted or interpreted by
subject matter experts in the disaster scene, thus delaying response efforts.
To alleviate this data avalanche encountered in practice, computer vision and
machine learning techniques are necessary. While prior work has been deployed
to automatically assess damage in satellite imagery, there is no current state
of practice for sUAS-based damage assessment systems, as all known work has
been confined to academic settings. This work establishes the state of practice
via the development and deployment of models for building damage assessment
with sUAS imagery. The model development involved training on the largest known
dataset of post-disaster sUAS aerial imagery, containing 21,716 building damage
labels, and the operational training of 91 disaster practitioners. The best
performing model was deployed during the responses to Hurricanes Debby and
Helene, where it assessed a combined 415 buildings in approximately 18 minutes.
This work contributes documentation of the actual use of AI/ML for damage
assessment during a disaster and lessons learned to the benefit of the AI/ML
research and user communities.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongbo Lan, Zhenlin An, Haoyu Li, Vaibhav Singh, Longfei Shangguan
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper introduces \sysname, a system that accelerates vision-guided
physical property reasoning to enable augmented visual cognition. \sysname
minimizes the run-time latency of this reasoning pipeline through a combination
of both algorithmic and systematic optimizations, including rapid geometric 3D
reconstruction, efficient semantic feature fusion, and parallel view encoding.
Through these simple yet effective optimizations, \sysname reduces the
end-to-end latency of this reasoning pipeline from 10--20 minutes to less than
6 seconds. A head-to-head comparison on the ABO dataset shows that \sysname
achieves this 62.9$\times$--287.2$\times$ speedup while not only reaching
on-par (and sometimes slightly better) object-level physical property
estimation accuracy(e.g. mass), but also demonstrating superior performance in
material segmentation and voxel-level inference than two SOTA baselines. We
further combine gaze-tracking with \sysname to localize the object of interest
in cluttered, real-world environments, streamlining the physical property
reasoning on smart glasses. The case study with Meta Aria Glasses conducted at
an IKEA furniture store demonstrates that \sysname achives consistently high
performance compared to controlled captures, providing robust property
estimations even with fewer views in real-world scenarios.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Botong. Zhao, Xubin. Wang, Shujing. Lyu, Yue. Lu
ๆฅ็ๆ่ฆ
Integrated circuit manufacturing is highly complex, comprising hundreds of
process steps. Defects can arise at any stage, causing yield loss and
ultimately degrading product reliability. Supervised methods require extensive
human annotation and struggle with emergent categories and rare, data scarce
defects. Clustering-based unsupervised methods often exhibit unstable
performance due to missing priors. We propose IC DefectNCD, a support set free
framework that leverages Image Intrinsic Priors in IC SEM images for defect
detection and novel class discovery. We first develop Self Normal Information
Guided IC Defect Detection, aggregating representative normal features via a
learnable normal information extractor and using reconstruction residuals to
coarsely localize defect regions. To handle saliency variations across defects,
we introduce an adaptive binarization strategy that produces stable subimages
focused on core defective areas. Finally, we design Self Defect Information
Guided IC Defect Classification, which incorporates a soft mask guided
attention mechanism to inject spatial defect priors into the teacher student
model. This enhances sensitivity to defective regions, suppresses background
interference, and enables recognition and classification of unseen defects. We
validate the approach on a real world dataset spanning three key fabrication
stages and covering 15 defect types. Experiments demonstrate robust performance
on both defect detection and unseen defect classification.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiyi Miao, Taoyu Wu, Tong Chen, Sihao Li, Ji Jiang et al.
Computer Vision
ๆฅ็ๆ่ฆ
In orthodontic treatment, particularly within telemedicine contexts,
observing patients' dental occlusion from multiple viewpoints facilitates
timely clinical decision-making. Recent advances in 3D Gaussian Splatting
(3DGS) have shown strong potential in 3D reconstruction and novel view
synthesis. However, conventional 3DGS pipelines typically rely on densely
captured multi-view inputs and precisely initialized camera poses, limiting
their practicality. Orthodontic cases, in contrast, often comprise only three
sparse images, specifically, the anterior view and bilateral buccal views,
rendering the reconstruction task especially challenging. The extreme sparsity
of input views severely degrades reconstruction quality, while the absence of
camera pose information further complicates the process. To overcome these
limitations, we propose DentalSplat, an effective framework for 3D
reconstruction from sparse orthodontic imagery. Our method leverages a
prior-guided dense stereo reconstruction model to initialize the point cloud,
followed by a scale-adaptive pruning strategy to improve the training
efficiency and reconstruction quality of 3DGS. In scenarios with extremely
sparse viewpoints, we further incorporate optical flow as a geometric
constraint, coupled with gradient regularization, to enhance rendering
fidelity. We validate our approach on a large-scale dataset comprising 950
clinical cases and an additional video-based test set of 195 cases designed to
simulate real-world remote orthodontic imaging conditions. Experimental results
demonstrate that our method effectively handles sparse input scenarios and
achieves superior novel view synthesis quality for dental occlusion
visualization, outperforming state-of-the-art techniques.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Miftahur Rahman, Samuel Adebayo, Dorian A. Acevedo-Mejia, David Hester, Daniel McPolin et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
The Intermeshed Steel Connection (ISC) system, when paired with robotic
manipulators, can accelerate steel-frame assembly and improve worker safety by
eliminating manual assembly. Dependable perception is one of the initial stages
for ISC-aware robots. However, this is hampered by the absence of a dedicated
image corpus, as collecting photographs on active construction sites is
logistically difficult and raises safety and privacy concerns. In response, we
introduce ISC-Perception, the first hybrid dataset expressly designed for ISC
component detection. It blends procedurally rendered CAD images, game-engine
photorealistic scenes, and a limited, curated set of real photographs, enabling
fully automatic labelling of the synthetic portion. We explicitly account for
all human effort to produce the dataset, including simulation engine and scene
setup, asset preparation, post-processing scripts and quality checks; our total
human time to generate a 10,000-image dataset was 30.5,h versus 166.7,h for
manual labelling at 60,s per image (-81.7%). A manual pilot on a representative
image with five instances of ISC members took 60,s (maximum 80,s), anchoring
the manual baseline. Detectors trained on ISC-Perception achieved a mean
Average Precision at IoU 0.50 of 0.756, substantially surpassing models trained
on synthetic-only or photorealistic-only data. On a 1,200-frame bench test, we
report mAP@0.50/mAP@[0.50:0.95] of 0.943/0.823. By bridging the data gap for
construction-robotics perception, ISC-Perception facilitates rapid development
of custom object detectors and is freely available for research and industrial
use upon request.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi Gong, Xinyuan Zhang, Jichen Chai, Yichen Ding, Yifei Lou
ๆฅ็ๆ่ฆ
Cardiac contraction is a rapid, coordinated process that unfolds across
three-dimensional tissue on millisecond timescales. Traditional optical imaging
is often inadequate for capturing dynamic cellular structure in the beating
heart because of a fundamental trade-off between spatial and temporal
resolution. To overcome these limitations, we propose a high-performance
computational imaging framework that integrates Compressive Sensing (CS) with
Light-Sheet Microscopy (LSM) for efficient, low-phototoxic cardiac imaging. The
system performs compressed acquisition of fluorescence signals via random
binary mask coding using a Digital Micromirror Device (DMD). We propose a
Plug-and-Play (PnP) framework, solved using the alternating direction method of
multipliers (ADMM), which flexibly incorporates advanced denoisers, including
Tikhonov, Total Variation (TV), and BM3D. To preserve structural continuity in
dynamic imaging, we further introduce temporal regularization enforcing
smoothness between adjacent z-slices. Experimental results on zebrafish heart
imaging under high compression ratios demonstrate that the proposed method
successfully reconstructs cellular structures with excellent denoising
performance and image clarity, validating the effectiveness and robustness of
our algorithm in real-world high-speed, low-light biological imaging scenarios.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nan Li, Albert Gatt, Massimo Poesio
ๆฅ็ๆ่ฆ
Collaborative dialogue relies on participants incrementally establishing
common ground, yet in asymmetric settings they may believe they agree while
referring to different entities. We introduce a perspectivist annotation scheme
for the HCRC MapTask corpus (Anderson et al., 1991) that separately captures
speaker and addressee grounded interpretations for each reference expression,
enabling us to trace how understanding emerges, diverges, and repairs over
time. Using a scheme-constrained LLM annotation pipeline, we obtain 13k
annotated reference expressions with reliability estimates and analyze the
resulting understanding states. The results show that full misunderstandings
are rare once lexical variants are unified, but multiplicity discrepancies
systematically induce divergences, revealing how apparent grounding can mask
referential misalignment. Our framework provides both a resource and an
analytic lens for studying grounded misunderstanding and for evaluating
(V)LLMs' capacity to model perspective-dependent grounding in collaborative
dialogue.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Francesco Corso, Francesco Pierri, Gianmarco De Francisci Morales
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
In this paper, we investigate whether Large Language Models (LLMs) exhibit
conspiratorial tendencies, whether they display sociodemographic biases in this
domain, and how easily they can be conditioned into adopting conspiratorial
perspectives. Conspiracy beliefs play a central role in the spread of
misinformation and in shaping distrust toward institutions, making them a
critical testbed for evaluating the social fidelity of LLMs. LLMs are
increasingly used as proxies for studying human behavior, yet little is known
about whether they reproduce higher-order psychological constructs such as a
conspiratorial mindset. To bridge this research gap, we administer validated
psychometric surveys measuring conspiracy mindset to multiple models under
different prompting and conditioning strategies. Our findings reveal that LLMs
show partial agreement with elements of conspiracy belief, and conditioning
with socio-demographic attributes produces uneven effects, exposing latent
demographic biases. Moreover, targeted prompts can easily shift model responses
toward conspiratorial directions, underscoring both the susceptibility of LLMs
to manipulation and the potential risks of their deployment in sensitive
contexts. These results highlight the importance of critically evaluating the
psychological dimensions embedded in LLMs, both to advance computational social
science and to inform possible mitigation strategies against harmful uses.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jing Gao, Shutiao Luo, Yumeng Liu, Yuanming Li, Hongji Zeng
Natural Language Processing
ๆฅ็ๆ่ฆ
With the rapid advancement of natural language processing (NLP) technologies,
the demand for high-quality Chinese document question-answering datasets is
steadily growing. To address this issue, we present the Chinese Multi-Document
Question Answering Dataset(ChiMDQA), specifically designed for downstream
business scenarios across prevalent domains including academic, education,
finance, law, medical treatment, and news. ChiMDQA encompasses long-form
documents from six distinct fields, consisting of 6,068 rigorously curated,
high-quality question-answer (QA) pairs further classified into ten
fine-grained categories. Through meticulous document screening and a systematic
question-design methodology, the dataset guarantees both diversity and high
quality, rendering it applicable to various NLP tasks such as document
comprehension, knowledge extraction, and intelligent QA systems. Additionally,
this paper offers a comprehensive overview of the dataset's design objectives,
construction methodologies, and fine-grained evaluation system, supplying a
substantial foundation for future research and practical applications in
Chinese QA. The code and data are available at:
https://anonymous.4open.science/r/Foxit-CHiMDQA/.
๐
2025-11-05
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Thomas Souverain
Natural Language Processing
ๆฅ็ๆ่ฆ
To foster trustworthy Artificial Intelligence (AI) within the European Union,
the AI Act requires providers to mark and detect the outputs of their
general-purpose models. The Article 50 and Recital 133 call for marking methods
that are ''sufficiently reliable, interoperable, effective and robust''. Yet,
the rapidly evolving and heterogeneous landscape of watermarks for Large
Language Models (LLMs) makes it difficult to determine how these four standards
can be translated into concrete and measurable evaluations. Our paper addresses
this challenge, anchoring the normativity of European requirements in the
multiplicity of watermarking techniques. Introducing clear and distinct
concepts on LLM watermarking, our contribution is threefold. (1) Watermarking
Categorisation: We propose an accessible taxonomy of watermarking methods
according to the stage of the LLM lifecycle at which they are applied - before,
during, or after training, and during next-token distribution or sampling. (2)
Watermarking Evaluation: We interpret the EU AI Act's requirements by mapping
each criterion with state-of-the-art evaluations on robustness and
detectability of the watermark, and of quality of the LLM. Since
interoperability remains largely untheorised in LLM watermarking research, we
propose three normative dimensions to frame its assessment. (3) Watermarking
Comparison: We compare current watermarking methods for LLMs against the
operationalised European criteria and show that no approach yet satisfies all
four standards. Encouraged by emerging empirical tests, we recommend further
research into watermarking directly embedded within the low-level architecture
of LLMs.
๐
2025-11-05
๐ AAAI
๐ฅ Apoorva Upadhyaya, Wolfgang Nejdl, Marco Fisichella
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward
unseen targets. Existing research using contrastive, meta-learning, or data
augmentation suffers from generalizability issues or lack of coherence between
text and target. Recent works leveraging large language models (LLMs) for ZSSD
focus either on improving unseen target-specific knowledge or generating
explanations for stance analysis. However, most of these works are limited by
their over-reliance on explicit reasoning, provide coarse explanations that
lack nuance, and do not explicitly model the reasoning process, making it
difficult to interpret the model's predictions. To address these issues, in our
study, we develop a novel interpretable ZSSD framework, IRIS. We provide an
interpretable understanding of the attitude of the input towards the target
implicitly based on sequences within the text (implicit rationales) and
explicitly based on linguistic measures (explicit rationales). IRIS considers
stance detection as an information retrieval ranking task, understanding the
relevance of implicit rationales for different stances to guide the model
towards correct predictions without requiring the ground-truth of rationales,
thus providing inherent interpretability. In addition, explicit rationales
based on communicative features help decode the emotional and cognitive
dimensions of stance, offering an interpretable understanding of the author's
attitude towards the given target. Extensive experiments on the benchmark
datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10%
training data prove the generalizability of our model, benefiting from the
proposed architecture and interpretable design.
๐
2025-11-05
๐ ArXiv Preprint (q-fin.TR)
๐ฅ Haofei Yu, Fenghai Li, Jiaxuan You
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) achieve strong performance across
benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but
these tests occur in static settings, lacking real dynamics and uncertainty.
Consequently, they evaluate isolated reasoning or problem-solving rather than
decision-making under uncertainty. To address this, we introduce
LiveTradeBench, a live trading environment for evaluating LLM agents in
realistic and evolving markets. LiveTradeBench follows three design principles:
(i) Live data streaming of market prices and news, eliminating dependence on
offline backtesting and preventing information leakage while capturing
real-time uncertainty; (ii) a portfolio-management abstraction that extends
control from single-asset actions to multi-asset allocation, integrating risk
management and cross-asset reasoning; and (iii) multi-market evaluation across
structurally distinct environments--U.S. stocks and Polymarket prediction
markets--differing in volatility, liquidity, and information flow. At each
step, an agent observes prices, news, and its portfolio, then outputs
percentage allocations that balance risk and return. Using LiveTradeBench, we
run 50-day live evaluations of 21 LLMs across families. Results show that (1)
high LMArena scores do not imply superior trading outcomes; (2) models display
distinct portfolio styles reflecting risk appetite and reasoning dynamics; and
(3) some LLMs effectively leverage live signals to adapt decisions. These
findings expose a gap between static evaluation and real-world competence,
motivating benchmarks that test sequential decision making and consistency
under live uncertainty.
๐
2025-11-05
๐ WWW
๐ฅ Ringwald Celian, Gandon, Fabien, Faron Catherine, Michel Franck et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
This article presents a systematic review of relation extraction (RE)
research since the advent of Transformer-based models. Using an automated
framework to collect and annotate publications, we analyze 34 surveys, 64
datasets, and 104 models published between 2019 and 2024. The review highlights
methodological advances, benchmark resources, and the integration of semantic
web technologies. By consolidating results across multiple dimensions, the
study identifies current trends, limitations, and open challenges, offering
researchers and practitioners a comprehensive reference for understanding the
evolution and future directions of RE.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu et al.
Robotics
ๆฅ็ๆ่ฆ
We present Step-Audio-EditX, the first open-source LLM-based audio model
excelling at expressive and iterative audio editing encompassing emotion,
speaking style, and paralinguistics alongside robust zero-shot text-to-speech
(TTS) capabilities.Our core innovation lies in leveraging only large-margin
synthetic data, which circumvents the need for embedding-based priors or
auxiliary modules. This large-margin learning approach enables both iterative
control and high expressivity across voices, and represents a fundamental pivot
from the conventional focus on representation-level disentanglement. Evaluation
results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and
Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ One Octadion, Bondan Sapta Prakoso, Nanang Yudi Setiawan, Novanto Yudistira
Natural Language Processing
ๆฅ็ๆ่ฆ
In this study, we explore the fine-tuning of Large Language Models (LLMs) to
better support policymakers in their crucial work of understanding, analyzing,
and crafting legal regulations. To equip the model with a deep understanding of
legal texts, we curated a supervised dataset tailored to the specific needs of
the legal domain. Additionally, we integrated the Retrieval-Augmented
Generation (RAG) method, enabling the LLM to access and incorporate up-to-date
legal knowledge from external sources. This combination of fine-tuning and
RAG-based augmentation results in a tool that not only processes legal
information but actively assists policymakers in interpreting regulations and
drafting new ones that align with current needs. The results demonstrate that
this approach can significantly enhance the effectiveness of legal research and
regulation development, offering a valuable resource in the ever-evolving field
of law.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joachim Diederich
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
This paper presents the first empirical demonstration of controllable
locality in transformer language models, a novel architectural framework that
enables continuous control over the degree of representation localization
through a tunable locality dial parameter. Unlike traditional language models
that rely exclusively on distributed representations, our approach allows
dynamic interpolation between highly interpretable localist encodings and
efficient distributed representations without requiring model retraining. We
conducted experiments on the WikiText corpus using a two-layer transformer
architecture, systematically varying the locality parameter {\lambda} across
the full spectrum from 1.0 (fully localist) to 0.0 (fully distributed). Our
results demonstrate that localist configurations achieve dramatically lower
attention entropy, with {\lambda} = 1.0 yielding 5.36 bits compared to 7.18
bits at {\lambda} = 0.0, while maintaining substantially higher pointer
fidelity scores reflecting stronger alignment with rule-specified targets.
Prediction experiments reveal that intermediate locality values optimize the
tradeoff between interpretability and performance, with {\lambda} = 0.6
achieving test perplexity of 4.65 and accuracy of 84.7%. These findings
establish that localist language models provide a practical framework for
applications in regulated domains requiring both transparency and capability,
offering precise mathematical control over the interpretability-performance
spectrum through explicit penalty thresholds and information-theoretic design
principles.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sofie Helene Bruun, Dan Saattrup Smart
Natural Language Processing
ๆฅ็ๆ่ฆ
Measuring the full abilities of large language models (LLMs) requires
benchmarks representing multiple tasks. We aim to create large, high-quality
datasets for comparison of logical reasoning skills across several languages
and of suitable difficulty for LLMs of various reasoning ability. We explore
multiple ways of increasing difficulty. We generate zebra puzzles in multiple
languages, themes, sizes and including 14 different clue types and 8 red
herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are
sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a
reasoning model), respectively. Including 5 red herrings decreases o3-mini
puzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5
puzzles are not significantly affected by use of English vs. Danish or the
common houses theme vs. the country-specific smoerrebroed theme. We find no
correlation between difficulty and the selected clue types. Datasets of
128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic
languages for sizes 2x3 and 4x5. We publish code for puzzle generation,
designed for adaptablity into more languages and themes.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Brian DuSell, Ryan Cotterell
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Any finite set of training data is consistent with an infinite number of
hypothetical algorithms that could have generated it. Studies have shown that
when human children learn language, they consistently favor hypotheses based on
hierarchical syntactic rules without ever encountering disambiguating examples.
A recent line of work has inquired as to whether common neural network
architectures share this bias, finding that they do so only under special
conditions: when syntactically supervised, when pre-trained on massive corpora,
or when trained long past convergence. In this paper, we demonstrate, for the
first time, neural network architectures that are able to generalize in
human-like fashion without any of the aforementioned requirements:
stack-augmented neural networks. We test three base architectures (transformer,
simple RNN, LSTM) augmented with two styles of stack: the superposition stack
of Joulin & Mikolov (2015) and a nondeterministic generalization of it proposed
by DuSell & Chiang (2023). We find that transformers with nondeterministic
stacks generalize best out of these architectures on a classical question
formation task. We also propose a modification to the stack RNN architecture
that improves hierarchical generalization. These results suggest that
stack-augmented neural networks may be more accurate models of human language
acquisition than standard architectures, serving as useful objects of
psycholinguistic study. Our code is publicly available.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Roberta Di Marino, Giovanni Dioguardi, Antonio Romano, Giuseppe Riccio, Mariano Barone et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Medical question answering systems face deployment challenges including
hallucinations, bias, computational demands, privacy concerns, and the need for
specialized expertise across diverse domains. Here, we present SOLVE-Med, a
multi-agent architecture combining domain-specialized small language models for
complex medical queries. The system employs a Router Agent for dynamic
specialist selection, ten specialized models (1B parameters each) fine-tuned on
specific medical domains, and an Orchestrator Agent that synthesizes responses.
Evaluated on Italian medical forum data across ten specialties, SOLVE-Med
achieves superior performance with ROUGE-1 of 0.301 and BERTScore F1 of 0.697,
outperforming standalone models up to 14B parameters while enabling local
deployment. Our code is publicly available on GitHub:
https://github.com/PRAISELab-PicusLab/SOLVE-Med.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qi Jia, Kaiwei Zhang, Xiujie Song, Ye Shen, Xiangyang Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Understanding how well large language models can follow users' instructions
throughout a dialogue spanning multiple topics is of great importance for
data-intensive conversational applications. Existing benchmarks are often
limited to a fixed number of turns, making them susceptible to saturation and
failing to account for the user's interactive experience. In this work, we
propose an extensible framework for assessing multi-turn instruction-following
ability. At its core, our framework decouples linguistic surface forms from
user intent simulation through a three-layer mechanism that tracks constraints,
instructions, and topics. This framework mimics User-LLM interaction by
enabling the dynamic construction of benchmarks with state changes and
tracebacks, terminating a conversation only when the model exhausts a simulated
user's patience. We define a suite of metrics capturing the quality of the
interaction process. Using this framework, we construct EvolIF, an evolving
instruction-following benchmark incorporating nine distinct constraint types.
Our results indicate that GPT-5 exhibits superior instruction-following
performance. It sustains an average of 18.54 conversational turns and
demonstrates 70.31% robustness, outperforming Gemini-2.5-Pro by a significant
margin of 11.41%, while other models lag far behind. All of the data and code
will be made publicly available online.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Memory systems are key components that enable AI systems such as LLMs and AI
agents to achieve long-term learning and sustained interaction. However, during
memory storage and retrieval, these systems frequently exhibit memory
hallucinations, including fabrication, errors, conflicts, and omissions.
Existing evaluations of memory hallucinations are primarily end-to-end question
answering, which makes it difficult to localize the operational stage within
the memory system where hallucinations arise. To address this, we introduce the
Hallucination in Memory Benchmark (HaluMem), the first operation level
hallucination evaluation benchmark tailored to memory systems. HaluMem defines
three evaluation tasks (memory extraction, memory updating, and memory question
answering) to comprehensively reveal hallucination behaviors across different
operational stages of interaction. To support evaluation, we construct
user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and
HaluMem-Long. Both include about 15k memory points and 3.5k multi-type
questions. The average dialogue length per user reaches 1.5k and 2.6k turns,
with context lengths exceeding 1M tokens, enabling evaluation of hallucinations
across different context scales and task complexities. Empirical studies based
on HaluMem show that existing memory systems tend to generate and accumulate
hallucinations during the extraction and updating stages, which subsequently
propagate errors to the question answering stage. Future research should focus
on developing interpretable and constrained memory operation mechanisms that
systematically suppress hallucinations and improve memory reliability.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kazi Reyazul Hasan, Mubasshira Musarrat, A. B. M. Alim Al Islam, Muhammad Abdullah Adnan
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models work well for technical problem solving in English but
perform poorly when the same questions are asked in Bangla. A simple solution
would be to translate Bangla questions into English first and then use these
models. However, existing Bangla-English translation systems struggle with
technical terms. They often mistranslate specialized vocabulary, which changes
the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a
dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM
fields including computer science, mathematics, physics, chemistry, and
biology. We generated over 12,000 translations using language models and then
used human evaluators to select the highest quality pairs that preserve
technical terminology correctly. We train a T5-based translation model on
BanglaSTEM and test it on two tasks: generating code and solving math problems.
Our results show significant improvements in translation accuracy for technical
content, making it easier for Bangla speakers to use English-focused language
models effectively. Both the BanglaSTEM dataset and the trained translation
model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ringwald Celian, Gandon Fabien, Faron Catherine, Michel Franck, Abi Akl Hanna
Natural Language Processing
ๆฅ็ๆ่ฆ
RDF pattern-based extraction is a compelling approach for fine-tuning small
language models (SLMs) by focusing a relation extraction task on a specified
SHACL shape. This technique enables the development of efficient models trained
on limited text and RDF data. In this article, we introduce Kastor, a framework
that advances this approach to meet the demands for completing and refining
knowledge bases in specialized domains. Kastor reformulates the traditional
validation task, shifting from single SHACL shape validation to evaluating all
possible combinations of properties derived from the shape. By selecting the
optimal combination for each training example, the framework significantly
enhances model generalization and performance. Additionally, Kastor employs an
iterative learning process to refine noisy knowledge bases, enabling the
creation of robust models capable of uncovering new, relevant facts
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Doria Bonzi, Alexandre Guiggi, Frรฉdรฉric Bรฉchet, Carlos Ramisch, Benoit Favre
Natural Language Processing
ๆฅ็ๆ่ฆ
Critical appraisal of scientific literature is an essential skill in the
biomedical field. While large language models (LLMs) can offer promising
support in this task, their reliability remains limited, particularly for
critical reasoning in specialized domains. We introduce CareMedEval, an
original dataset designed to evaluate LLMs on biomedical critical appraisal and
reasoning tasks. Derived from authentic exams taken by French medical students,
the dataset contains 534 questions based on 37 scientific articles. Unlike
existing benchmarks, CareMedEval explicitly evaluates critical reading and
reasoning grounded in scientific papers. Benchmarking state-of-the-art
generalist and biomedical-specialized LLMs under various context conditions
reveals the difficulty of the task: open and commercial models fail to exceed
an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens
considerably improves the results. Yet, models remain challenged especially on
questions about study limitations and statistical analysis. CareMedEval
provides a challenging benchmark for grounded reasoning, exposing current LLM
limitations and paving the way for future development of automated support for
critical appraisal.
๐
2025-11-05
๐ EMNLP 2025
๐ฅ Longpeng Qiu, Ting Li, Shuai Mao, Nan Yang, Xiaohui Yan
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Input errors in question-answering (QA) systems often lead to incorrect
responses. Large language models (LLMs) struggle with this task, frequently
failing to interpret user intent (misinterpretation) or unnecessarily altering
the original question's structure (over-correction). We propose QuestionRAG, a
framework that tackles these problems. To address misinterpretation, it
enriches the input with external knowledge (e.g., search results, related
entities). To prevent over-correction, it uses reinforcement learning (RL) to
align the model's objective with precise correction, not just paraphrasing. Our
results demonstrate that knowledge augmentation is critical for understanding
faulty questions. Furthermore, RL-based alignment proves significantly more
effective than traditional supervised fine-tuning (SFT), boosting the model's
ability to follow instructions and generalize. By integrating these two
strategies, QuestionRAG unlocks the full potential of LLMs for the question
correction task.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Canhui Wu, Qiong Cao, Chao Xue, Wei Xi, Xiaodong He
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in large language models (LLMs) have leveraged explicit
Chain-of-Thought (CoT) prompting to improve reasoning accuracy. However, most
existing methods primarily compress verbose reasoning outputs. These
Long-to-Short transformations aim to improve efficiency, but still rely on
explicit reasoning during inference. In this work, we introduce \textbf{3TF}
(\textbf{T}hought-\textbf{T}raining and \textbf{T}hought-\textbf{F}ree
inference), a framework for efficient reasoning that takes a Short-to-Long
perspective. We first train a hybrid model that can operate in both reasoning
and non-reasoning modes, and then further train it on CoT-annotated data to
internalize structured reasoning, while enforcing concise, thought-free outputs
at inference time using the no-reasoning mode. Unlike compression-based
approaches, 3TF improves the reasoning quality of non-reasoning outputs,
enabling models to perform rich internal reasoning implicitly while keeping
external outputs short. Empirically, 3TF-trained models obtain large
improvements on reasoning benchmarks under thought-free inference,
demonstrating that high quality reasoning can be learned and executed
implicitly without explicit step-by-step generation.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Cรฉlian Ringwald, Fabien Gandon, Catherine Faron, Franck Michel, Hanna Abi Akl
Natural Language Processing
ๆฅ็ๆ่ฆ
Small language models (SLMs) have shown promises for relation extraction (RE)
when extracting RDF triples guided by SHACL shapes focused on common datatype
properties. This paper investigates how SLMs handle both datatype and object
properties for a complete RDF graph extraction. We show that the key bottleneck
is related to long-tail distribution of rare properties. To solve this issue,
we evaluate several strategies: stratified sampling, weighted loss, dataset
scaling, and template-based synthetic data augmentation. We show that the best
strategy to perform equally well over unbalanced target properties is to build
a training set where the number of occurrences of each property exceeds a given
threshold. To enable reproducibility, we publicly released our datasets,
experimental results and code. Our findings offer practical guidance for
training shape-aware SLMs and highlight promising directions for future work in
semantic RE.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Saumitra Yadav, Manish Shrivastava
Natural Language Processing
ๆฅ็ๆ่ฆ
Existing Machine Translation (MT) research often suggests a single, fixed set
of hyperparameters for word segmentation models, symmetric Byte Pair Encoding
(BPE), which applies the same number of merge operations (NMO) to train
tokenizers for both source and target languages. However, we demonstrate that
this uniform approach doesn't guarantee optimal MT performance across different
language pairs and data sizes. This work investigates BPE segmentation recipes
across various data volumes and language pairs to evaluate MT system
performance. We find that utilizing asymmetric BPE, where the source and target
languages have different NMOs, significantly improves results over the
symmetric approach, especially in low-resource settings (50K, 100K, and 500K
sentence pairs). Specifically, asymmetric BPE yield statistically significant
($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in
low-resource setups. We validated this trend across six additional language
pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut),
observing statistically significant improvement in 10 out of 12 systems
compared to symmetric BPE. Our findings indicate a high NMO for the source (4K
to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results,
particularly benefiting low-resource MT.
๐
2025-11-05
๐ ArXiv Preprint (cs.SI)
๐ฅ Yangliu Fan, Kilian Buehling, Volker Stocker
ๆฅ็ๆ่ฆ
Despite the importance of social science knowledge for various stakeholders,
measuring its diffusion into different domains remains a challenge. This study
uses a novel text-based approach to measure the idea-level diffusion of social
science knowledge from the research domain to the journalism and policy-making
domains. By doing so, we expand the detection of knowledge diffusion beyond the
measurements of direct references. Our study focuses on media effects theories
as key research ideas in the field of communication science. Using 72,703
documents (2000-2019) from three domains (i.e., research, journalism, and
policy-making) that mention these ideas, we count the mentions of these ideas
in each domain, estimate their domain-specific contexts, and track and compare
differences across domains and over time. Overall, we find that diffusion
patterns and dynamics vary considerably between ideas, with some ideas
diffusing between other domains, while others do not. Based on the embedding
regression approach, we compare contextualized meanings across domains and find
that the distances between research and policy are typically larger than
between research and journalism. We also find that ideas largely shift roles
across domains - from being the theories themselves in research to sense-making
in news to applied, administrative use in policy. Over time, we observe
semantic convergence mainly for ideas that are practically oriented. Our
results characterize the cross-domain diffusion patterns and dynamics of social
science knowledge at the idea level, and we discuss the implications for
measuring knowledge diffusion beyond citations.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shenghao Li
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
For complex logical data augmentation, heavy reliance on human annotation is
costly, whereas direct generation with large language models yields
uninterpretable and logically homogeneous examples. To address this, we present
LFC-DA, a symbolic-logic-controlled pipeline: logical text is first mapped to
propositional expressions, a compact rule library is compiled, and a bounded
state-space search systematically discovers valid formulas that are then
verbalized back into natural-language questions, ensuring both diversity and
logical rigor under propositional logic. Experiments on ReClor and LogiQA show
significant improvements in the logical-reasoning accuracy of pretrained
models, confirming the effectiveness of LFC-DA for LLM-guided logical data
augmentation.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yunbo Long, Yuhan Liu, Alexandra Brintrup
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The deployment of large language models (LLMs) in automated negotiation has
set a high performance benchmark, but their computational cost and data privacy
requirements render them unsuitable for many privacy-sensitive, on-device
applications such as mobile assistants, embodied AI agents or private client
interactions. While small language models (SLMs) offer a practical alternative,
they suffer from a significant performance gap compared to LLMs in playing
emotionally charged complex personas, especially for credit negotiation. This
paper introduces EQ-Negotiator, a novel framework that bridges this capability
gap using emotional personas. Its core is a reasoning system that integrates
game theory with a Hidden Markov Model(HMM) to learn and track debtor emotional
states online, without pre-training. This allows EQ-Negotiator to equip SLMs
with the strategic intelligence to counter manipulation while de-escalating
conflict and upholding ethical standards. Through extensive agent-to-agent
simulations across diverse credit negotiation scenarios, including adversarial
debtor strategies like cheating, threatening, and playing the victim, we show
that a 7B parameter language model with EQ-Negotiator achieves better debt
recovery and negotiation efficiency than baseline LLMs more than 10 times its
size. This work advances persona modeling from descriptive character profiles
to dynamic emotional architectures that operate within privacy constraints.
Besides, this paper establishes that strategic emotional intelligence, not raw
model scale, is the critical factor for success in automated negotiation,
paving the way for effective, ethical, and privacy-preserving AI negotiators
that can operate on the edge.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rom Himelstein, Amit LeVi, Brit Youngmann, Yaniv Nemcovsky, Avi Mendelson
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Safety-aligned large language models (LLMs) are becoming increasingly
widespread, especially in sensitive applications where fairness is essential
and biased outputs can cause significant harm. However, evaluating the fairness
of models is a complex challenge, and approaches that do so typically utilize
standard question-answer (QA) styled schemes. Such methods often overlook
deeper issues by interpreting the model's refusal responses as positive
fairness measurements, which creates a false sense of fairness. In this work,
we introduce the concept of silenced biases, which are unfair preferences
encoded within models' latent space and are effectively concealed by
safety-alignment. Previous approaches that considered similar indirect biases
often relied on prompt manipulation or handcrafted implicit queries, which
present limited scalability and risk contaminating the evaluation process with
additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to
uncover these biases by employing activation steering to reduce model refusals
during QA. SBB supports easy expansion to new demographic groups and subjects,
presenting a fairness evaluation framework that encourages the future
development of fair models and tools beyond the masking effects of alignment
training. We demonstrate our approach over multiple LLMs, where our findings
expose an alarming distinction between models' direct responses and their
underlying fairness issues.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Riasad Alvi, Sayeem Been Zaman, Wasimul Karim, Arefin Ittesafun Abian, Mohaimenul Azam Khan Raiaan et al.
ๆฅ็ๆ่ฆ
Generative artificial intelligence (GenAI) has become a transformative
approach in bioinformatics that often enables advancements in genomics,
proteomics, transcriptomics, structural biology, and drug discovery. To
systematically identify and evaluate these growing developments, this review
proposed six research questions (RQs), according to the preferred reporting
items for systematic reviews and meta-analysis methods. The objective is to
evaluate impactful GenAI strategies in methodological advancement, predictive
performance, and specialization, and to identify promising approaches for
advanced modeling, data-intensive discovery, and integrative biological
analysis. RQ1 highlights diverse applications across multiple bioinformatics
subfields (sequence analysis, molecular design, and integrative data modeling),
which demonstrate superior performance over traditional methods through pattern
recognition and output generation. RQ2 reveals that adapted specialized model
architectures outperformed general-purpose models, an advantage attributed to
targeted pretraining and context-aware strategies. RQ3 identifies significant
benefits in the bioinformatics domains, focusing on molecular analysis and data
integration, which improves accuracy and reduces errors in complex analysis.
RQ4 indicates improvements in structural modeling, functional prediction, and
synthetic data generation, validated by established benchmarks. RQ5 suggests
the main constraints, such as the lack of scalability and biases in data that
impact generalizability, and proposes future directions focused on robust
evaluation and biologically grounded modeling. RQ6 examines that molecular
datasets (such as UniProtKB and ProteinNet12), cellular datasets (such as
CELLxGENE and GTEx) and textual resources (such as PubMedQA and OMIM) broadly
support the training and generalization of GenAI models.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jindong Hong, Tianjie Chen, Lingjie Luo, Chuanyang Zheng, Ting Xu et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
A recent advancement in Multimodal Large Language Models (MLLMs) research is
the emergence of "reasoning MLLMs" that offer explicit control over their
internal thinking processes (normally referred as the "thinking mode")
alongside the standard "non-thinking mode". This capability allows these models
to engage in a step-by-step process of internal deliberation before generating
a final response. With the rapid transition to and adoption of these
"dual-state" MLLMs, this work rigorously evaluated how the enhanced reasoning
processes of these MLLMs impact model performance and reliability in clinical
tasks. This paper evaluates the active "thinking mode" capabilities of two
leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We
assessed their performance on four visual medical tasks using VQA-RAD and
ROCOv2 datasets. Our findings reveal that the improvement from activating the
thinking mode remains marginal compared to the standard non-thinking mode for
the majority of the tasks. Their performance on complex medical tasks such as
open-ended VQA and medical image interpretation remains suboptimal,
highlighting the need for domain-specific medical data and more advanced
methods for medical knowledge integration.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli
Natural Language Processing
ๆฅ็ๆ่ฆ
Automatic evaluation of speech-to-text translation (ST) systems is typically
performed by comparing translation hypotheses with one or more reference
translations. While effective to some extent, this approach inherits the
limitation of reference-based evaluation that ignores valuable information from
the source input. In machine translation (MT), recent progress has shown that
neural metrics incorporating the source text achieve stronger correlation with
human judgments. Extending this idea to ST, however, is not trivial because the
source is audio rather than text, and reliable transcripts or alignments
between source and references are often unavailable. In this work, we conduct
the first systematic study of source-aware metrics for ST, with a particular
focus on real-world operating conditions where source transcripts are not
available. We explore two complementary strategies for generating textual
proxies of the input audio, automatic speech recognition (ASR) transcripts, and
back-translations of the reference translation, and introduce a novel two-step
cross-lingual re-segmentation algorithm to address the alignment mismatch
between synthetic sources and reference translations. Our experiments, carried
out on two ST benchmarks covering 79 language pairs and six ST systems with
diverse architectures and performance levels, show that ASR transcripts
constitute a more reliable synthetic source than back-translations when word
error rate is below 20%, while back-translations always represent a
computationally cheaper but still effective alternative. Furthermore, our
cross-lingual re-segmentation algorithm enables robust use of source-aware MT
metrics in ST evaluation, paving the way toward more accurate and principled
evaluation methodologies for speech translation.
๐
2025-11-05
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yize Liu, Yunyun Hou, Aina Sui
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have been widely deployed across various
applications, yet their potential security and ethical risks have raised
increasing concerns. Existing research employs red teaming evaluations,
utilizing multi-turn jailbreaks to identify potential vulnerabilities in LLMs.
However, these approaches often lack exploration of successful dialogue
trajectories within the attack space, and they tend to overlook the
considerable overhead associated with the attack process. To address these
limitations, this paper first introduces a theoretical model based on
dynamically weighted graph topology, abstracting the multi-turn attack process
as a path planning problem. Based on this framework, we propose ABC, an
enhanced Artificial Bee Colony algorithm for multi-turn jailbreaks, featuring a
collaborative search mechanism with employed, onlooker, and scout bees. This
algorithm significantly improves the efficiency of optimal attack path search
while substantially reducing the average number of queries required. Empirical
evaluations on three open-source and two proprietary language models
demonstrate the effectiveness of our approach, achieving attack success rates
above 90\% across the board, with a peak of 98\% on GPT-3.5-Turbo, and
outperforming existing baselines. Furthermore, it achieves comparable success
with only 26 queries on average, significantly reducing red teaming overhead
and highlighting its superior efficiency.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jin-woo Lee, Junhwa Choi, Bongkyu Hwang, Jinho Choo, Bogun Kim et al.
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
We revisit continual pre-training for large language models and argue that
progress now depends more on scaling the right structure than on scaling
parameters alone. We introduce SCALE, a width upscaling architecture that
inserts lightweight expansion into linear modules while freezing all
pre-trained parameters. This preserves the residual and attention topologies
and increases capacity without perturbing the base model's original
functionality. SCALE is guided by two principles: Persistent Preservation,
which maintains the base model's behavior via preservation-oriented
initialization and freezing of the pre-trained weights, and Collaborative
Adaptation, which selectively trains a subset of expansion components to
acquire new knowledge with minimal interference. We instantiate these ideas as
SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and
SCALE-Route, an optional routing extension that performs token-level routing
between preservation and adaptation heads. On a controlled synthetic biography
benchmark, SCALE mitigates the severe forgetting observed with depth expansion
while still acquiring new knowledge. In continual pre-training on a Korean
corpus, SCALE variants achieve less forgetting on English evaluations and
competitive gains on Korean benchmarks, with these variants offering the best
overall stability-plasticity trade-off. Accompanying analysis clarifies when
preservation provably holds and why the interplay between preservation and
adaptation stabilizes optimization compared to standard continual learning
setups.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ranul Dayarathne, Uvini Ranaweera, Upeksha Ganegoda
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval Augmented Generation (RAG) is emerging as a powerful technique to
enhance the capabilities of Generative AI models by reducing hallucination.
Thus, the increasing prominence of RAG alongside Large Language Models (LLMs)
has sparked interest in comparing the performance of different LLMs in
question-answering (QA) in diverse domains. This study compares the performance
of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat,
Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA
tasks within the computer science literature leveraging RAG support. Evaluation
metrics employed in the study include accuracy and precision for binary
questions and ranking by a human expert, ranking by Google's AI model Gemini,
alongside cosine similarity for long-answer questions. GPT-3.5, when paired
with RAG, effectively answers binary and long-answer questions, reaffirming its
status as an advanced LLM. Regarding open-source LLMs, Mistral AI's
Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary
and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b
reports the shortest average latency in generating responses, whereas
LLaMa2-7b-chat by Meta reports the highest average latency. This research
underscores the fact that open-source LLMs, too, can go hand in hand with
proprietary models like GPT-3.5 with better infrastructure.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Souvik Rana, Arul Menezes, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal
Natural Language Processing
ๆฅ็ๆ่ฆ
Tokenizers play a crucial role in determining the performance, training
efficiency, and the inference cost of Large Language Models (LLMs). Designing
effective tokenizers for multilingual LLMs is particularly challenging due to
diverse scripts and rich morphological variation. While subword methods such as
Byte Pair Encoding (BPE) are widely adopted, their effectiveness in
multilingual settings remains underexplored. We present IndicSuperTokenizer, a
tokenizer for Indic multilingual LLMs, that combines both subword and
multi-word tokenization, along with language-specific pre-tokenization, leading
to more linguistically aligned tokens and achieving a new state-of-the-art in
fertility score. Evaluated across English, 22 Indian languages and code data,
our tokenizer improves the average fertility score by 39.5% over LLaMA4 and by
18% over Sutra (the current best). This translates to 44% improvement in
inference throughput over LLaMA4 while maintaining comparable performance on
English and Indic benchmarks. We also present detailed ablations across
tokenizer training data size, vocabulary size, merging techniques, and
pre-tokenization strategies, demonstrating the robustness of our design
choices.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shantanu Agarwal, Joel Barry, Elizabeth Boschee, Scott Miller
Natural Language Processing
ๆฅ็ๆ่ฆ
Machine Translation for English Retrieval of Information in Any Language
(MATERIAL) is an IARPA initiative targeted to advance the state of
cross-lingual information retrieval (CLIR). This report provides a detailed
description of Information Sciences Institute's (ISI's) Summarization and
domain-Adaptive Retrieval Across Language's (SARAL's) effort for MATERIAL.
Specifically, we outline our team's novel approach to handle CLIR with emphasis
in developing an approach amenable to retrieve a query-relevant document
\textit{set}, and not just a ranked document-list. In MATERIAL's Phase-3
evaluations, SARAL exceeded the performance of other teams in five out of six
evaluation conditions spanning three different languages (Farsi, Kazakh, and
Georgian).
๐
2025-11-05
๐ EMNLP
๐ฅ Shaghayegh Kolli, Richard Rosenbaum, Timo Cavelius, Lasse Strothe, Andrii Lata et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel in generating fluent utterances but can
lack reliable grounding in verified information. At the same time,
knowledge-graph-based fact-checkers deliver precise and interpretable evidence,
yet suffer from limited coverage or latency. By integrating LLMs with knowledge
graphs and real-time search agents, we introduce a hybrid fact-checking
approach that leverages the individual strengths of each component. Our system
comprises three autonomous steps: 1) a Knowledge Graph (KG) Retrieval for rapid
one-hop lookups in DBpedia, 2) an LM-based classification guided by a
task-specific labeling prompt, producing outputs with internal rule-based
logic, and 3) a Web Search Agent invoked only when KG coverage is insufficient.
Our pipeline achieves an F1 score of 0.93 on the FEVER benchmark on the
Supported/Refuted split without task-specific fine-tuning. To address Not
enough information cases, we conduct a targeted reannotation study showing that
our approach frequently uncovers valid evidence for claims originally labeled
as Not Enough Information (NEI), as confirmed by both expert annotators and LLM
reviewers. With this paper, we present a modular, opensource fact-checking
pipeline with fallback strategies and generalization across datasets.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wenchang Lei, Ping Zou, Yue Wang, Feng Sun, Lei Zhao
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) exhibit strong semantic understanding, yet
struggle when user instructions involve ambiguous or conceptually misaligned
terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity
by extracting meta-relations-inheritance, alias, and composition-from natural
language. The model further employs a reflection mechanism to validate these
meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these
relations and related descriptions are dynamically supplied to the LLM,
improving its ability to interpret concepts and generate accurate responses.
Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely
on extended context windows, our method enables large language models to
process texts of any length without the need for truncation. Experiments on
standard benchmarks demonstrate that the LGM consistently outperforms existing
RAG baselines.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shahriyar Zaman Ridoy, Azmine Toushik Wasi, Koushik Ahamed Tonmoy
Natural Language Processing
ๆฅ็ๆ่ฆ
As multilingual Large Language Models (LLMs) gain traction across South Asia,
their alignment with local ethical norms, particularly for Bengali, which is
spoken by over 285 million people and ranked 6th globally, remains
underexplored. Existing ethics benchmarks are largely English-centric and
shaped by Western frameworks, overlooking cultural nuances critical for
real-world deployment. To address this, we introduce BengaliMoralBench, the
first large-scale ethics benchmark for the Bengali language and socio-cultural
contexts. It covers five moral domains, Daily Activities, Habits, Parenting,
Family Relationships, and Religious Activities, subdivided into 50 culturally
relevant subtopics. Each scenario is annotated via native-speaker consensus
using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct
systematic zero-shot evaluation of prominent multilingual LLMs, including
Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and
standard metrics. Performance varies widely (50-91% accuracy), with qualitative
analysis revealing consistent weaknesses in cultural grounding, commonsense
reasoning, and moral fairness. BengaliMoralBench provides a foundation for
responsible localization, enabling culturally aligned evaluation and supporting
the deployment of ethically robust AI in diverse, low-resource multilingual
settings such as Bangladesh.
๐
2025-11-05
๐ KDD
๐ฅ Kevin Wang, Subre Abdoul Moktar, Jia Li, Kangshuo Li, Feng Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have become increasingly pervasive, finding
applications across many industries and disciplines. Ensuring the
trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE)
plays a key role. In this work, a comprehensive empirical study is conducted to
examine the robustness and effectiveness of diverse UE measures regarding
aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE
methods and four generation quality metrics including LLMScore from LLM
criticizers to evaluate the uncertainty of LLM-generated answers in
Question-Answering (QA) tasks on both in-distribution (ID) and
out-of-distribution (OOD) datasets. Our analysis reveals that information-based
methods, which leverage token and sequence probabilities, perform exceptionally
well in ID settings due to their alignment with the model's understanding of
the data. Conversely, density-based methods and the P(True) metric exhibit
superior performance in OOD contexts, highlighting their effectiveness in
capturing the model's epistemic uncertainty. Semantic consistency methods,
which assess variability in generated answers, show reliable performance across
different datasets and generation metrics. These methods generally perform well
but may not be optimal for every situation.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Srishti Yadav, Jasmina Gajcin, Erik Miehling, Elizabeth Daly
Robotics
ๆฅ็ๆ่ฆ
Understanding how different stakeholders perceive risks in AI systems is
essential for their responsible deployment. This paper presents a framework for
stakeholder-grounded risk assessment by using LLMs, acting as judges to predict
and explain risks. Using the Risk Atlas Nexus and GloVE explanation method, our
framework generates stakeholder-specific, interpretable policies that shows how
different stakeholders agree or disagree about the same risks. We demonstrate
our method using three real-world AI use cases of medical AI, autonomous
vehicles, and fraud detection domain. We further propose an interactive
visualization that reveals how and why conflicts emerge across stakeholder
perspectives, enhancing transparency in conflict reasoning. Our results show
that stakeholder perspectives significantly influence risk perception and
conflict patterns. Our work emphasizes the importance of these
stakeholder-aware explanations needed to make LLM-based evaluations more
transparent, interpretable, and aligned with human-centered AI governance
goals.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kaiyuan Zhang, Chenghao Yang, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang et al.
Multimodal
ๆฅ็ๆ่ฆ
As reasoning models scale rapidly, the essential role of multimodality in
human cognition has come into sharp relief, driving a growing need to probe
vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either
overemphasize textual reasoning or fall short of systematically capturing
vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs
insufficiently assessed. To address this limitation, we introduce MME-CC
(Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded
benchmark that organizes 11 representative reasoning tasks into three
fundamental categories of visual information: spatial, geometric, and
knowledge-based reasoning, and provides fine-grained analyses of MLLMs'
cognitive capacity across these dimensions. Based on MME-CC, we conduct
extensive experiments over 16 representative MLLMs. Our study reveals that
closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs.
30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak
(less than or equal to 30%). We further identify common error patterns,
including orientation mistakes, fragile cross-view identity persistence, and
poor adherence to counterfactual instructions, and observe that
Chain-of-Thought typically follows a three-stage process (extract -> reason ->
verify) with heavy reliance on visual extraction. We hope this work catalyzes a
shift toward treating the cognitive capacity of MLLMs as central to both
evaluation and model design.
๐
2025-11-05
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Erfan Shayegani, Jina Suh, Andy Wilson, Nagu Rangan, Javier Hernandez
Natural Language Processing
ๆฅ็ๆ่ฆ
Empathy is a critical factor in fostering positive user experiences in
conversational AI. While models can display empathy, it is often generic rather
than tailored to specific tasks and contexts. In this work, we introduce a
novel framework for developing and evaluating context-specific empathetic large
language models (LLMs). We first analyze a real-world conversational dataset
consisting of 672 multi-turn conversations across 8 tasks, revealing
significant differences in terms of expected and experienced empathy before and
after the conversations, respectively. To help minimize this gap, we develop a
synthetic multi-turn conversational generation pipeline and steer responses
toward our defined empathy patterns based on the context that more closely
matches users' expectations. We then train empathetic expert adapters for
context-specific empathy that specialize in varying empathy levels based on the
recognized task. Our empirical results demonstrate a significant gap reduction
of 72.66% between perceived and desired empathy with scores increasing by an
average factor of 2.43 as measured by our metrics and reward models.
Additionally, our trained empathetic expert adapters demonstrate superior
effectiveness in preserving empathy patterns throughout conversation turns,
outperforming system prompts, which tend to dramatically diminish in impact as
conversations lengthen.
๐
2025-11-05
๐ EMNLP 2025
๐ฅ Najrin Sultana, Md Rafi Ur Rashid, Kang Gu, Shagufta Mehnaz
Natural Language Processing
ๆฅ็ๆ่ฆ
LLMs can provide substantial zero-shot performance on diverse tasks using a
simple task prompt, eliminating the need for training or fine-tuning. However,
when applying these models to sensitive tasks, it is crucial to thoroughly
assess their robustness against adversarial inputs. In this work, we introduce
Static Deceptor (StaDec) and Dynamic Deceptor (DyDec), two innovative attack
frameworks designed to systematically generate dynamic and adaptive adversarial
examples by leveraging the understanding of the LLMs. We produce subtle and
natural-looking adversarial inputs that preserve semantic similarity to the
original text while effectively deceiving the target LLM. By utilizing an
automated, LLM-driven pipeline, we eliminate the dependence on external
heuristics. Our attacks evolve with the advancements in LLMs and demonstrate
strong transferability across models unknown to the attacker. Overall, this
work provides a systematic approach for the self-assessment of an LLM's
robustness. We release our code and data at
https://github.com/Shukti042/AdversarialExample.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yuya Miyaoka, Masaki Inoue
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
This paper proposes a control-based framework for aligning large language
models (LLMs) by leveraging a control barrier function (CBF) to ensure
user-desirable text generation. The presented framework applies the CBF safety
filter to the predicted token generated from the baseline LLM, to intervene in
the generated text. The safety filter includes two significant advantages: this
safety filter is an add-on type, allowing it to be used for alignment purposes
without fine-tuning the baseline LLM, and if there is an evaluation model
regarding the desired alignment, it can be directly applied to the filter
design. The overall text-generation system is implemented with open-source
language models, aiming to generate positive text.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Saad Mankarious, Ayah Zirikly
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Mental health disorders affect millions worldwide, yet early detection
remains a major challenge, particularly for Arabic-speaking populations where
resources are limited and mental health discourse is often discouraged due to
cultural stigma. While substantial research has focused on English-language
mental health detection, Arabic remains significantly underexplored, partly due
to the scarcity of annotated datasets. We present CARMA, the first
automatically annotated large-scale dataset of Arabic Reddit posts. The dataset
encompasses six mental health conditions, such as Anxiety, Autism, and
Depression, and a control group. CARMA surpasses existing resources in both
scale and diversity. We conduct qualitative and quantitative analyses of
lexical and semantic differences between users, providing insights into the
linguistic markers of specific mental health conditions. To demonstrate the
dataset's potential for further mental health analysis, we perform
classification experiments using a range of models, from shallow classifiers to
large language models. Our results highlight the promise of advancing mental
health detection in underrepresented languages such as Arabic.
๐
2025-11-05
๐ ICASSP 2026
๐ฅ Gowtham Premananth, Carol Espy-Wilson
Robotics
ๆฅ็ๆ่ฆ
Language disruptions are one of the well-known effects of schizophrenia
symptoms. They are often manifested as disorganized speech and impaired
discourse coherence. These abnormalities in spontaneous language production
reflect underlying cognitive disturbances and have the potential to serve as
objective markers for symptom severity and diagnosis of schizophrenia. This
study focuses on how these language disruptions can be characterized in terms
of two computational linguistic measures: surprisal and semantic coherence. By
computing surprisal and semantic coherence of language using computational
models, this study investigates how they differ between subjects with
schizophrenia and healthy controls. Furthermore, this study provides further
insight into how language disruptions in terms of these linguistic measures
change with varying degrees of schizophrenia symptom severity.
๐
2025-11-05
๐ EMNLP 2025
๐ฅ Michel Wong, Ali Alshehri, Sophia Kao, Haotian He
Natural Language Processing
ๆฅ็ๆ่ฆ
Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS)
systems, converting written forms into their canonical spoken equivalents.
Traditional TN systems can exhibit high accuracy, but involve substantial
engineering effort, are difficult to scale, and pose challenges to language
coverage, particularly in low-resource settings. We propose PolyNorm, a
prompt-based approach to TN using Large Language Models (LLMs), aiming to
reduce the reliance on manually crafted rules and enable broader linguistic
applicability with minimal human intervention. Additionally, we present a
language-agnostic pipeline for automatic data curation and evaluation, designed
to facilitate scalable experimentation across diverse languages. Experiments
across eight languages show consistent reductions in the word error rate (WER)
compared to a production-grade-based system. To support further research, we
release PolyNorm-Benchmark, a multilingual data set covering a diverse range of
text normalization phenomena.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a
powerful paradigm for post-training large reasoning models (LRMs) using
policy-gradient methods such as GRPO. To stabilize training, these methods
typically center trajectory rewards by subtracting the empirical mean for each
prompt. Statistically, this centering acts as a control variate (or baseline),
reducing the variance of the policy-gradient estimator.
Typically, the mean reward is estimated using per-prompt empirical averages
for each prompt in a batch. Drawing inspiration from Stein's paradox, we
propose using shrinkage estimators that combine per-prompt and across-prompt
means to improve the overall per-prompt mean estimation accuracy --
particularly in the low-generation regime typical of RLVR. Theoretically, we
construct a shrinkage-based baseline that provably yields lower-variance
policy-gradient estimators across algorithms. Our proposed baseline serves as a
drop-in replacement for existing per-prompt mean baselines, requiring no
additional hyper-parameters or computation. Empirically, shrinkage baselines
consistently outperform standard empirical-mean baselines, leading to
lower-variance gradient updates and improved training stability.
๐
2025-11-05
๐ ArXiv Preprint (math.ST)
๐ฅ Rong Jiang, Cong Ma
Machine Learning
ๆฅ็ๆ่ฆ
We study batched nonparametric contextual bandits under a margin condition
when the margin parameter $\alpha$ is unknown. To capture the statistical price
of this ignorance, we introduce the regret inflation criterion, defined as the
ratio between the regret of an adaptive algorithm and that of an oracle knowing
$\alpha$. We show that the optimal regret inflation grows polynomial with the
horizon $T$, with exponent precisely given by the value of a convex
optimization problem involving the dimension, smoothness, and batch budget.
Moreover, the minimizers of this optimization problem directly prescribe the
batch allocation and exploration strategy of a rate-optimal algorithm. Building
on this principle, we develop RoBIN (RObust batched algorithm with adaptive
BINning), which achieves the optimal regret inflation up to logarithmic
factors. These results reveal a new adaptivity barrier: under batching,
adaptation to an unknown margin parameter inevitably incurs a polynomial
penalty, sharply characterized by a variational problem. Remarkably, this
barrier vanishes when the number of batches exceeds $\log \log T$; with only a
doubly logarithmic number of updates, one can recover the oracle regret rate up
to polylogarithmic factors.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohsen Ahmadzadeh, Kaichang Chen, Georges Gielen
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Analog/mixed-signal circuits are key for interfacing electronics with the
physical world. Their design, however, remains a largely handcrafted process,
resulting in long and error-prone design cycles. While the recent rise of
AI-based reinforcement learning and generative AI has created new techniques to
automate this task, the need for many time-consuming simulations is a critical
bottleneck hindering the overall efficiency. Furthermore, the lack of
explainability of the resulting design solutions hampers widespread adoption of
the tools. To address these issues, a novel agentic AI framework for
sample-efficient and explainable analog circuit sizing is presented. It employs
a multi-agent workflow where specialized Large Language Model (LLM)-based
agents collaborate to interpret the circuit topology, to understand the design
goals, and to iteratively refine the circuit's design parameters towards the
target goals with human-interpretable reasoning. The adaptive simulation
strategy creates an intelligent control that yields a high sample efficiency.
The AnaFlow framework is demonstrated for two circuits of varying complexity
and is able to complete the sizing task fully automatically, differently from
pure Bayesian optimization and reinforcement learning approaches. The system
learns from its optimization history to avoid past mistakes and to accelerate
convergence. The inherent explainability makes this a powerful tool for analog
design space exploration and a new paradigm in analog EDA, where AI agents
serve as transparent design assistants.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lipeng Zu, Hansong Zhou, Xiaonan Zhang
Machine Learning
ๆฅ็ๆ่ฆ
Offline reinforcement learning (RL) enables training from fixed data without
online interaction, but policies learned offline often struggle when deployed
in dynamic environments due to distributional shift and unreliable value
estimates on unseen state-action pairs. We introduce Behavior-Adaptive
Q-Learning (BAQ), a framework designed to enable a smooth and reliable
transition from offline to online RL. The key idea is to leverage an implicit
behavioral model derived from offline data to provide a behavior-consistency
signal during online fine-tuning. BAQ incorporates a dual-objective loss that
(i) aligns the online policy toward the offline behavior when uncertainty is
high, and (ii) gradually relaxes this constraint as more confident online
experience is accumulated. This adaptive mechanism reduces error propagation
from out-of-distribution estimates, stabilizes early online updates, and
accelerates adaptation to new scenarios. Across standard benchmarks, BAQ
consistently outperforms prior offline-to-online RL approaches, achieving
faster recovery, improved robustness, and higher overall performance. Our
results demonstrate that implicit behavior adaptation is a principled and
practical solution for reliable real-world policy deployment.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Md Ahasanul Arafath, Abhijit Kumar Ghosh, Md Rony Ahmed, Sabrin Afroz, Minhazul Hosen et al.
Machine Learning
ๆฅ็ๆ่ฆ
Colorectal cancer (CRC) grading is a critical prognostic factor but remains
hampered by inter-observer variability and the privacy constraints of
multi-institutional data sharing. While deep learning offers a path to
automation, centralized training models conflict with data governance
regulations and neglect the diagnostic importance of multi-scale analysis. In
this work, we propose a scalable, privacy-preserving federated learning (FL)
framework for CRC histopathological grading that integrates multi-scale feature
learning within a distributed training paradigm. Our approach employs a
dual-stream ResNetRS50 backbone to concurrently capture fine-grained nuclear
detail and broader tissue-level context. This architecture is integrated into a
robust FL system stabilized using FedProx to mitigate client drift across
heterogeneous data distributions from multiple hospitals. Extensive evaluation
on the CRC-HGD dataset demonstrates that our framework achieves an overall
accuracy of 83.5%, outperforming a comparable centralized model (81.6%).
Crucially, the system excels in identifying the most aggressive Grade III
tumors with a high recall of 87.5%, a key clinical priority to prevent
dangerous false negatives. Performance further improves with higher
magnification, reaching 88.0% accuracy at 40x. These results validate that our
federated multi-scale approach not only preserves patient privacy but also
enhances model performance and generalization. The proposed modular pipeline,
with built-in preprocessing, checkpointing, and error handling, establishes a
foundational step toward deployable, privacy-aware clinical AI for digital
pathology.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Eugรจne Berta, David Holzmรผller, Michael I. Jordan, Francis Bach
Machine Learning
ๆฅ็ๆ่ฆ
Post-hoc recalibration methods are widely used to ensure that classifiers
provide faithful probability estimates. We argue that parametric recalibration
functions based on logistic regression can be motivated from a simple
theoretical setting for both binary and multiclass classification. This insight
motivates the use of more expressive calibration methods beyond standard
temperature scaling. For multi-class calibration however, a key challenge lies
in the increasing number of parameters introduced by more complex models, often
coupled with limited calibration data, which can lead to overfitting. Through
extensive experiments, we demonstrate that the resulting bias-variance tradeoff
can be effectively managed by structured regularization, robust preprocessing
and efficient optimization. The resulting methods lead to substantial gains
over existing logistic-based calibration techniques. We provide efficient and
easy-to-use open-source implementations of our methods, making them an
attractive alternative to common temperature, vector, and matrix scaling
implementations.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Daniel Perkins, Oscar J. Escobar, Luke Green
Machine Learning
ๆฅ็ๆ่ฆ
We present a detailed study of Deep Q-Networks in finite environments,
emphasizing the impact of epsilon-greedy exploration schedules and prioritized
experience replay. Through systematic experimentation, we evaluate how
variations in epsilon decay schedules affect learning efficiency, convergence
behavior, and reward optimization. We investigate how prioritized experience
replay leads to faster convergence and higher returns and show empirical
results comparing uniform, no replay, and prioritized strategies across
multiple simulations. Our findings illuminate the trade-offs and interactions
between exploration strategies and memory management in DQN training, offering
practical recommendations for robust reinforcement learning in
resource-constrained settings.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mahek Desai, Apoorva Rumale, Marjan Asadinia
Machine Learning
ๆฅ็ๆ่ฆ
The integration of IoT devices in healthcare introduces significant security
and reliability challenges, increasing susceptibility to cyber threats and
operational anomalies. This study proposes a machine learning-driven framework
for (1) detecting malicious cyberattacks and (2) identifying faulty device
anomalies, leveraging a dataset of 200,000 records. Eight machine learning
models are evaluated across three learning approaches: supervised learning
(XGBoost, K-Nearest Neighbors (K- NN)), semi-supervised learning (Generative
Adversarial Networks (GAN), Variational Autoencoders (VAE)), and unsupervised
learning (One-Class Support Vector Machine (SVM), Isolation Forest, Graph
Neural Networks (GNN), and Long Short-Term Memory (LSTM) Autoencoders). The
comprehensive evaluation was conducted across multiple metrics like F1-score,
precision, recall, accuracy, ROC-AUC, computational efficiency. XGBoost
achieved 99\% accuracy with minimal computational overhead (0.04s) for anomaly
detection, while Isolation Forest balanced precision and recall effectively.
LSTM Autoencoders underperformed with lower accuracy and higher latency. For
attack detection, KNN achieved near-perfect precision, recall, and F1-score
with the lowest computational cost (0.05s), followed by VAE at 97% accuracy.
GAN showed the highest computational cost with lowest accuracy and ROC-AUC.
These findings enhance IoT-enabled healthcare security through effective
anomaly detection strategies. By improving early detection of cyber threats and
device failures, this framework has the potential to prevent data breaches,
minimize system downtime, and ensure the continuous and safe operation of
medical devices, ultimately safeguarding patient health and trust in IoT-driven
healthcare solutions.
๐
2025-11-05
๐ ArXiv Preprint (cs.CC)
๐ฅ Cynthia Dwork, Pranay Tankala
ๆฅ็ๆ่ฆ
Given a small random sample of $n$-bit strings labeled by an unknown Boolean
function, which properties of this function can be tested computationally
efficiently? We show an equivalence between properties that are efficiently
testable from few samples and properties with structured symmetry, which depend
only on the function's average values on parts of a low-complexity partition of
the domain. Without the efficiency constraint, a similar characterization in
terms of unstructured symmetry was obtained by Blais and Yoshida (2019). Our
main technical tool is supersimulation, which builds on methods from the
algorithmic fairness literature to approximate arbitrarily complex functions by
small-circuit simulators that fool significantly larger distinguishers.
We extend the characterization along other axes as well. We show that
allowing parts to overlap exponentially reduces their required number,
broadening the scope of the construction from properties testable with $O(\log
n)$ samples to properties testable with $O(n)$ samples. For larger sample
sizes, we show that any efficient tester is essentially checking for
indistinguishability from a bounded collection of small circuits, in the spirit
of a characterization of testable graph properties. Finally, we show that our
results for Boolean function testing generalize to high-entropy distribution
testing on arbitrary domains.
๐
2025-11-05
๐ ArXiv Preprint (astro-ph.CO)
๐ฅ M. H. Jalali Kanafi, S. M. S. Movahed
ๆฅ็ๆ่ฆ
In this work, we perform a simulation-based forecasting analysis to compare
the constraining power of two higher-order summary statistics of the
large-scale structure (LSS), the Minkowski Functionals (MFs) and the
Conditional Moments of Derivative (CMD), with a particular focus on their
sensitivity to nonlinear and anisotropic features in redshift-space. Our
analysis relies on halo catalogs from the Big Sobol Sequence(BSQ) simulations
at redshift $z=0.5$, employing a likelihood-free inference framework
implemented via neural posterior estimation. At the fiducial cosmology of the
Quijote simulations $(\Omega_{m}=0.3175,\,\sigma_{8}=0.834)$, and for the
smoothing scale $R=15\,h^{-1}$Mpc, we find that the CMD yields tighter
forecasts for $(\Omega_{m}},\,\sigma_{8})$ than the zeroth- to third-order MFs
components, improving the constraint precision by ${\sim}(44\%,\,52\%)$,
${\sim}(30\%,\,45\%)$, ${\sim}(27\%,\,17\%)$, and ${\sim}(26\%,\,17\%)$,
respectively. A joint configuration combining the MFs and CMD further enhances
the precision by approximately ${\sim}27\%$ compared to the standard MFs alone,
highlighting the complementary anisotropy-sensitive information captured by the
CMD in contrast to the scalar morphological content encapsulated by the MFs. We
further extend the forecasting analysis to a continuous range of cosmological
parameter values and multiple smoothing scales. Our results show that, although
the absolute forecast uncertainty for each component of summary statistics
depends on the underlying parameter values and the adopted smoothing scale, the
relative constraining power among the summary statistics remains nearly
constant throughout.
๐
2025-11-05
๐ AAAI
๐ฅ Apoorva Upadhyaya, Wolfgang Nejdl, Marco Fisichella
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward
unseen targets. Existing research using contrastive, meta-learning, or data
augmentation suffers from generalizability issues or lack of coherence between
text and target. Recent works leveraging large language models (LLMs) for ZSSD
focus either on improving unseen target-specific knowledge or generating
explanations for stance analysis. However, most of these works are limited by
their over-reliance on explicit reasoning, provide coarse explanations that
lack nuance, and do not explicitly model the reasoning process, making it
difficult to interpret the model's predictions. To address these issues, in our
study, we develop a novel interpretable ZSSD framework, IRIS. We provide an
interpretable understanding of the attitude of the input towards the target
implicitly based on sequences within the text (implicit rationales) and
explicitly based on linguistic measures (explicit rationales). IRIS considers
stance detection as an information retrieval ranking task, understanding the
relevance of implicit rationales for different stances to guide the model
towards correct predictions without requiring the ground-truth of rationales,
thus providing inherent interpretability. In addition, explicit rationales
based on communicative features help decode the emotional and cognitive
dimensions of stance, offering an interpretable understanding of the author's
attitude towards the given target. Extensive experiments on the benchmark
datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10%
training data prove the generalizability of our model, benefiting from the
proposed architecture and interpretable design.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alexander Pfefferle, Johannes Hog, Lennart Purucker, Frank Hutter
Machine Learning
ๆฅ็ๆ่ฆ
Tabular foundation models such as TabPFN have revolutionized predictive
machine learning for tabular data. At the same time, the driving factors of
this revolution are hard to understand. Existing open-source tabular foundation
models are implemented in complicated pipelines boasting over 10,000 lines of
code, lack architecture documentation or code quality. In short, the
implementations are hard to understand, not beginner-friendly, and complicated
to adapt for new experiments. We introduce nanoTabPFN, a simplified and
lightweight implementation of the TabPFN v2 architecture and a corresponding
training loop that uses pre-generated training data. nanoTabPFN makes tabular
foundation models more accessible to students and researchers alike. For
example, restricted to a small data setting it achieves a performance
comparable to traditional machine learning baselines within one minute of
pre-training on a single GPU (160,000x faster than TabPFN v2 pretraining). This
eliminated requirement of large computational resources makes pre-training
tabular foundation models accessible for educational purposes. Our code is
available at https://github.com/automl/nanoTabPFN.
๐
2025-11-05
๐ ArXiv Preprint (cs.IT)
๐ฅ Cemil Vahapoglu, Timothy J. O'Shea, Wan Liu, Sennur Ulukus
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Beamforming has significance for enhancing spectral efficiency and mitigating
interference in multi-antenna wireless systems, facilitating spatial
multiplexing and diversity in dense and high mobility scenarios. Traditional
beamforming techniques such as zero-forcing beamforming (ZFBF) and minimum mean
square error (MMSE) beamforming experience performance deterioration under
adverse channel conditions. Deep learning-based beamforming offers an
alternative with nonlinear mappings from channel state information (CSI) to
beamforming weights by improving robustness against dynamic channel
environments. Transformer-based models are particularly effective due to their
ability to model long-range dependencies across time and frequency. However,
their quadratic attention complexity limits scalability in large OFDM grids.
Recent studies address this issue through sparse attention mechanisms that
reduce complexity while maintaining expressiveness, yet often employ patterns
that disregard channel dynamics, as they are not specifically designed for
wireless communication scenarios. In this work, we propose a Doppler-aware
Sparse Neural Network Beamforming (Doppler-aware Sparse NNBF) model that
incorporates a channel-adaptive sparse attention mechanism in a multi-user
single-input multiple-output (MU-SIMO) setting. The proposed sparsity structure
is configurable along 2D time-frequency axes based on channel dynamics and is
theoretically proven to ensure full connectivity within p hops, where p is the
number of attention heads. Simulation results under urban macro (UMa) channel
conditions show that Doppler-aware Sparse NNBF significantly outperforms both a
fixed-pattern baseline, referred to as Standard Sparse NNBF, and conventional
beamforming techniques ZFBF and MMSE beamforming in high mobility scenarios,
while maintaining structured sparsity with a controlled number of attended keys
per query.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bartลomiej Maลkus, Szymon Bobek, Grzegorz J. Nalepa
ๆฅ็ๆ่ฆ
Small and Medium Enterprises (SMEs), particularly freelancers and early-stage
businesses, face unique financial management challenges due to limited
resources, small customer bases, and constrained data availability. This paper
presents the development and deployment of an integrated financial prediction
system that combines accounts receivable prediction and cash flow forecasting
specifically designed for SME operational constraints. Our system addresses the
gap between enterprise-focused financial tools and the practical needs of
freelancers and small businesses. The solution integrates two key components: a
binary classification model for predicting invoice payment delays, and a
multi-module cash flow forecasting model that handles incomplete and limited
historical data. A prototype system has been implemented and deployed as a web
application with integration into Cluee's platform, a startup providing
financial management tools for freelancers, demonstrating practical feasibility
for real-world SME financial management.
๐
2025-11-05
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Philipp Hager, Onno Zoeter, Maarten de Rijke
Machine Learning
ๆฅ็ๆ่ฆ
CLAX is a JAX-based library that implements classic click models using modern
gradient-based optimization. While neural click models have emerged over the
past decade, complex click models based on probabilistic graphical models
(PGMs) have not systematically adopted gradient-based optimization, preventing
practitioners from leveraging modern deep learning frameworks while preserving
the interpretability of classic models. CLAX addresses this gap by replacing
EM-based optimization with direct gradient-based optimization in a numerically
stable manner. The framework's modular design enables the integration of any
component, from embeddings and deep networks to custom modules, into classic
click models for end-to-end optimization. We demonstrate CLAX's efficiency by
running experiments on the full Baidu-ULTR dataset comprising over a billion
user sessions in $\approx$ 2 hours on a single GPU, orders of magnitude faster
than traditional EM approaches. CLAX implements ten classic click models,
serving both industry practitioners seeking to understand user behavior and
improve ranking performance at scale and researchers developing new click
models. CLAX is available at: https://github.com/philipphager/clax
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shangtong Zhang
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we formalize the almost sure convergence of $Q$-learning and
linear temporal difference (TD) learning with Markovian samples using the Lean
4 theorem prover based on the Mathlib library. $Q$-learning and linear TD are
among the earliest and most influential reinforcement learning (RL) algorithms.
The investigation of their convergence properties is not only a major research
topic during the early development of the RL field but also receives increasing
attention nowadays. This paper formally verifies their almost sure convergence
in a unified framework based on the Robbins-Siegmund theorem. The framework
developed in this work can be easily extended to convergence rates and other
modes of convergence. This work thus makes an important step towards fully
formalizing convergent RL results. The code is available at
https://github.com/ShangtongZhang/rl-theory-in-lean.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Iason Chrysomallis, Georgios Chalkiadakis
Machine Learning
ๆฅ็ๆ่ฆ
Imitation learning traditionally requires complete state-action
demonstrations from optimal or near-optimal experts. These requirements
severely limit practical applicability, as many real-world scenarios provide
only state observations without corresponding actions and expert performance is
often suboptimal. In this paper we introduce a deep implicit imitation
reinforcement learning framework that addresses both limitations by combining
deep reinforcement learning with implicit imitation learning from
observation-only datasets. Our main algorithm, Deep Implicit Imitation
Q-Network (DIIQN), employs an action inference mechanism that reconstructs
expert actions through online exploration and integrates a dynamic confidence
mechanism that adaptively balances expert-guided and self-directed learning.
This enables the agent to leverage expert guidance for accelerated training
while maintaining capacity to surpass suboptimal expert performance. We further
extend our framework with a Heterogeneous Actions DIIQN (HA-DIIQN) algorithm to
tackle scenarios where expert and agent possess different action sets, a
challenge previously unaddressed in the implicit imitation learning literature.
HA-DIIQN introduces an infeasibility detection mechanism and a bridging
procedure identifying alternative pathways connecting agent capabilities to
expert guidance when direct action replication is impossible. Our experimental
results demonstrate that DIIQN achieves up to 130% higher episodic returns
compared to standard DQN, while consistently outperforming existing implicit
imitation methods that cannot exceed expert performance. In heterogeneous
action settings, HA-DIIQN learns up to 64% faster than baselines, leveraging
expert datasets unusable by conventional approaches. Extensive parameter
sensitivity analysis reveals the framework's robustness across varying dataset
sizes and hyperparameter configurations.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Diego Martinez-Taboada, Tomas Gonzalez, Aaditya Ramdas
ๆฅ็ๆ่ฆ
The study of self-normalized processes plays a crucial role in a wide range
of applications, from sequential decision-making to econometrics. While the
behavior of self-normalized concentration has been widely investigated for
scalar-valued processes, vector-valued processes remain comparatively
underexplored, especially outside of the sub-Gaussian framework. In this
contribution, we provide concentration bounds for self-normalized processes
with light tails beyond sub-Gaussianity (such as Bennett or Bernstein bounds).
We illustrate the relevance of our results in the context of online linear
regression, with applications in (kernelized) linear bandits.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Junyi Wu, Dan Li
Machine LearningRobotics
ๆฅ็ๆ่ฆ
High-dimensional reinforcement learning faces challenges with complex
calculations and low sample efficiency in large state-action spaces. Q-learning
algorithms struggle particularly with the curse of dimensionality, where the
number of state-action pairs grows exponentially with problem size. While
neural network-based approaches like Deep Q-Networks have shown success, recent
tensor-based methods using low-rank decomposition offer more
parameter-efficient alternatives. Building upon existing tensor-based methods,
we propose Tensor-Efficient Q-Learning (TEQL), which enhances low-rank tensor
decomposition via improved block coordinate descent on discretized state-action
spaces, incorporating novel exploration and regularization mechanisms. The key
innovation is an exploration strategy that combines approximation error with
visit count-based upper confidence bound to prioritize actions with high
uncertainty, avoiding wasteful random exploration. Additionally, we incorporate
a frequency-based penalty term in the objective function to encourage
exploration of less-visited state-action pairs and reduce overfitting to
frequently visited regions. Empirical results on classic control tasks
demonstrate that TEQL outperforms conventional matrix-based methods and deep RL
approaches in both sample efficiency and total rewards, making it suitable for
resource-constrained applications, such as space and healthcare where sampling
costs are high.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mainak Singha
Machine Learning
ๆฅ็ๆ่ฆ
Neural networks can approximate solutions to partial differential equations,
but they often break the very laws they are meant to model-creating mass from
nowhere, drifting shocks, or violating conservation and entropy. We address
this by training within the laws of physics rather than beside them. Our
framework, called Constraint-Projected Learning (CPL), keeps every update
physically admissible by projecting network outputs onto the intersection of
constraint sets defined by conservation, Rankine-Hugoniot balance, entropy, and
positivity. The projection is differentiable and adds only about 10%
computational overhead, making it fully compatible with back-propagation. We
further stabilize training with total-variation damping (TVD) to suppress small
oscillations and a rollout curriculum that enforces consistency over long
prediction horizons. Together, these mechanisms eliminate both hard and soft
violations: conservation holds at machine precision, total-variation growth
vanishes, and entropy and error remain bounded. On Burgers and Euler systems,
CPL produces stable, physically lawful solutions without loss of accuracy.
Instead of hoping neural solvers will respect physics, CPL makes that behavior
an intrinsic property of the learning process.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gรผnther Schindler, Maximilian Schambach, Michael Medek, Sam Thelin
ๆฅ็ๆ่ฆ
We study LLMs for tabular prediction with mixed text, numeric, and
categorical fields. We introduce TabGemma, a schema-agnostic in-context learner
that treats rows as sequences and tackles two practical hurdles when adapting
pretrained LLMs for tabular predictions: unstable numeric tokenization and
limited context size. We propose to canonicalize numbers via signed scientific
notation and continue pretraining of a 12B Gemma 3 model with a target
imputation objective using a large-scale real world dataset. For inference, we
use a compact n-gram-based retrieval to select informative exemplars that fit
within a 128k-token window.
On semantically rich benchmarks, TabGemma establishes a new state of the art
on classification across low- and high-data regimes and improves monotonically
with more context rows. For regression, it is competitive at small sample sizes
but trails conventional approaches as data grows. Our results show that LLMs
can be effective tabular in-context learners on highly semantic tasks when
paired with dedicated numeric handling and context retrieval, while motivating
further advances in numeric modeling and long-context scaling.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Iason Chrysomallis, Georgios Chalkiadakis
Machine Learning
ๆฅ็ๆ่ฆ
Imitation learning (IL) enables agents to acquire skills by observing and
replicating the behavior of one or multiple experts. In recent years, advances
in deep learning have significantly expanded the capabilities and scalability
of imitation learning across a range of domains, where expert data can range
from full state-action trajectories to partial observations or unlabeled
sequences. Alongside this growth, novel approaches have emerged, with new
methodologies being developed to address longstanding challenges such as
generalization, covariate shift, and demonstration quality. In this survey, we
review the latest advances in imitation learning research, highlighting recent
trends, methodological innovations, and practical applications. We propose a
novel taxonomy that is distinct from existing categorizations to better reflect
the current state of the IL research stratum and its trends. Throughout the
survey, we critically examine the strengths, limitations, and evaluation
practices of representative works, and we outline key challenges and open
directions for future research.
๐
2025-11-05
๐ ArXiv Preprint (math.ST)
๐ฅ Ido Nachum, Rรผdiger Urbanke, Thomas Weinberger
ๆฅ็ๆ่ฆ
Despite ongoing theoretical research on cross-validation (CV), many
theoretical questions about CV remain widely open. This motivates our
investigation into how properties of algorithm-distribution pairs can affect
the choice for the number of folds in $k$-fold cross-validation.
Our results consist of a novel decomposition of the mean-squared error of
cross-validation for risk estimation, which explicitly captures the
correlations of error estimates across overlapping folds and includes a novel
algorithmic stability notion, squared loss stability, that is considerably
weaker than the typically required hypothesis stability in other comparable
works.
Furthermore, we prove:
1. For every learning algorithm that minimizes empirical error, a minimax
lower bound on the mean-squared error of $k$-fold CV estimating the population
risk $L_\mathcal{D}$: \[ \min_{k \mid n}\; \max_{\mathcal{D}}\;
\mathbb{E}\!\left[\big(\widehat{L}_{\mathrm{CV}}^{(k)} -
L_{\mathcal{D}}\big)^{2}\right] \;=\; \Omega\!\big(\sqrt{k}/n\big), \] where
$n$ is the sample size and $k$ the number of folds. This shows that even under
idealized conditions, for large values of $k$, CV cannot attain the optimum of
order $1/n$ achievable by a validation set of size $n$, reflecting an inherent
penalty caused by dependence between folds.
2. Complementing this, we exhibit learning rules for which \[
\max_{\mathcal{D}}\; \mathbb{E}\!\left[\big(\widehat{L}_{\mathrm{CV}}^{(k)} -
L_{\mathcal{D}}\big)^{2}\right] \;=\; \Omega(k/n), \] matching (up to
constants) the accuracy of a hold-out estimator of a single fold of size $n/k$.
Together these results delineate the fundamental trade-off in
resampling-based risk estimation: CV cannot fully exploit all $n$ samples for
unbiased risk evaluation, and its minimax performance is pinned between the
$k/n$ and $\sqrt{k}/n$ regimes.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Matan Schliserman, Shira Vansover-Hager, Tomer Koren
Machine Learning
ๆฅ็ๆ่ฆ
Understanding the generalization behavior of learning algorithms is a central
goal of learning theory. A recently emerging explanation is that learning
algorithms are successful in practice because they converge to flat minima,
which have been consistently associated with improved generalization
performance. In this work, we study the link between flat minima and
generalization in the canonical setting of stochastic convex optimization with
a non-negative, $\beta$-smooth objective. Our first finding is that, even in
this fundamental and well-studied setting, flat empirical minima may incur
trivial $\Omega(1)$ population risk while sharp minima generalizes optimally.
Then, we show that this poor generalization behavior extends to two natural
''sharpness-aware'' algorithms originally proposed by Foret et al. (2021),
designed to bias optimization toward flat solutions: Sharpness-Aware Gradient
Descent (SA-GD) and Sharpness-Aware Minimization (SAM). For SA-GD, which
performs gradient steps on the maximal loss in a predefined neighborhood, we
prove that while it successfully converges to a flat minimum at a fast rate,
the population risk of the solution can still be as large as $\Omega(1)$,
indicating that even flat minima found algorithmically using a sharpness-aware
gradient method might generalize poorly. For SAM, a computationally efficient
approximation of SA-GD based on normalized ascent steps, we show that although
it minimizes the empirical loss, it may converge to a sharp minimum and also
incur population risk $\Omega(1)$. Finally, we establish population risk upper
bounds for both SA-GD and SAM using algorithmic stability techniques.
๐
2025-11-05
๐ Paper submitted to WSEAS Signal Processing Journal
๐ฅ Marc Martinez-Gost, Sara Pepe, Ana Pรฉrez-Neira, Miguel รngel Lagunas
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we extend our previous work on the Expressive Neural Network
(ENN), a multilayer perceptron with adaptive activation functions parametrized
using the Discrete Cosine Transform (DCT). Building upon previous work that
demonstrated the strong expressiveness of ENNs with compact architectures, we
now emphasize their efficiency, interpretability and pruning capabilities. The
DCT-based parameterization provides a structured and decorrelated
representation that reveals the functional role of each neuron and allows
direct identification of redundant components. Leveraging this property, we
propose an efficient pruning strategy that removes unnecessary DCT coefficients
with negligible or no loss in performance. Experimental results across
classification and implicit neural representation tasks confirm that ENNs
achieve state-of-the-art accuracy while maintaining a low number of parameters.
Furthermore, up to 40% of the activation coefficients can be safely pruned,
thanks to the orthogonality and bounded nature of the DCT basis. Overall, these
findings demonstrate that the ENN framework offers a principled integration of
signal processing concepts into neural network design, achieving a balanced
trade-off between expressiveness, compactness, and interpretability.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Javad Parsa, Amir Hossein Daghestani, Andrรฉ M. H. Teixeira, Mikael Johansson
Machine Learning
ๆฅ็ๆ่ฆ
Federated Learning (FL) enables clients to collaboratively train a global
model without sharing their private data. However, the presence of malicious
(Byzantine) clients poses significant challenges to the robustness of FL,
particularly when data distributions across clients are heterogeneous. In this
paper, we propose a novel Byzantine-robust FL optimization problem that
incorporates adaptive weighting into the aggregation process. Unlike
conventional approaches, our formulation treats aggregation weights as
learnable parameters, jointly optimizing them alongside the global model
parameters. To solve this optimization problem, we develop an alternating
minimization algorithm with strong convergence guarantees under adversarial
attack. We analyze the Byzantine resilience of the proposed objective. We
evaluate the performance of our algorithm against state-of-the-art
Byzantine-robust FL approaches across various datasets and attack scenarios.
Experimental results demonstrate that our method consistently outperforms
existing approaches, particularly in settings with highly heterogeneous data
and a large proportion of malicious clients.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bryan L. M. de Oliveira, Felipe V. Frujeri, Marcos P. C. M. Queiroz, Luana G. B. Martins, Telma W. de L. Soares et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Group Relative Policy Optimization (GRPO) has emerged as a scalable
alternative to Proximal Policy Optimization (PPO) by eliminating the learned
critic and instead estimating advantages through group-relative comparisons of
trajectories. This simplification raises fundamental questions about the
necessity of learned baselines in policy-gradient methods. We present the first
systematic study of GRPO in classical single-task reinforcement learning
environments, spanning discrete and continuous control tasks. Through
controlled ablations isolating baselines, discounting, and group sampling, we
reveal three key findings: (1) learned critics remain essential for
long-horizon tasks: all critic-free baselines underperform PPO except in
short-horizon environments like CartPole where episodic returns can be
effective; (2) GRPO benefits from high discount factors (gamma = 0.99) except
in HalfCheetah, where lack of early termination favors moderate discounting
(gamma = 0.9); (3) smaller group sizes outperform larger ones, suggesting
limitations in batch-based grouping strategies that mix unrelated episodes.
These results reveal both the limitations of critic-free methods in classical
control and the specific conditions where they remain viable alternatives to
learned value functions.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kazi Reyazul Hasan, Mubasshira Musarrat, A. B. M. Alim Al Islam, Muhammad Abdullah Adnan
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models work well for technical problem solving in English but
perform poorly when the same questions are asked in Bangla. A simple solution
would be to translate Bangla questions into English first and then use these
models. However, existing Bangla-English translation systems struggle with
technical terms. They often mistranslate specialized vocabulary, which changes
the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a
dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM
fields including computer science, mathematics, physics, chemistry, and
biology. We generated over 12,000 translations using language models and then
used human evaluators to select the highest quality pairs that preserve
technical terminology correctly. We train a T5-based translation model on
BanglaSTEM and test it on two tasks: generating code and solving math problems.
Our results show significant improvements in translation accuracy for technical
content, making it easier for Bangla speakers to use English-focused language
models effectively. Both the BanglaSTEM dataset and the trained translation
model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Elvis Dohmatob, Mohammad Pezeshki, Reyhane Askari-Hemmat
Machine Learning
ๆฅ็ๆ่ฆ
This paper introduces a theoretical framework to resolve a central paradox in
modern machine learning: When is it better to use less data? This question has
become critical as classical scaling laws suggesting ``more is more'' (Sun et
al., 2025) are challenged by methods like LIMO (``less is more'') and s1 (Ye et
al., 2025; Muenighoff et al., 2025), which achieve superior performance with
small, aggressively curated datasets. Here, we study data curation strategies
where an imperfect oracle selects the training examples according to their
difficulty and correctness. Our results provide exact scaling law curves for
test error under both label-agnostic and label-aware curation rules, revealing
when and why keeping only a subset of data can improve generalization. In
contrast to classical scaling laws, we show that under certain conditions,
small curated datasets can outperform full datasets, and we provide analytical
conditions for this by deriving precise phase transition curves tied to data
size and quality. We validate these theoretical claims with empirical results
on ImageNet, confirming our predictions about when curation improves accuracy
and can even mitigate model collapse. Furthermore, our framework provides a
principled explanation for the contradictory curation strategies recently
observed in LLM mathematical reasoning.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alvise Dei Rossi, Julia van der Meer, Markus H. Schmidt, Claudio L. A. Bassetti, Luigi Fiorillo et al.
Multimodal
ๆฅ็ๆ่ฆ
Polysomnography signals are highly heterogeneous, varying in modality
composition (e.g., EEG, EOG, ECG), channel availability (e.g., frontal,
occipital EEG), and acquisition protocols across datasets and clinical sites.
Most existing models that process polysomnography data rely on a fixed subset
of modalities or channels and therefore neglect to fully exploit its inherently
multimodal nature. We address this limitation by introducing NAP (Neural
Aggregator of Predictions), an attention-based model which learns to combine
multiple prediction streams using a tri-axial attention mechanism that captures
temporal, spatial, and predictor-level dependencies. NAP is trained to adapt to
different input dimensions. By aggregating outputs from frozen, pretrained
single-channel models, NAP consistently outperforms individual predictors and
simple ensembles, achieving state-of-the-art zero-shot generalization across
multiple datasets. While demonstrated in the context of automated sleep staging
from polysomnography, the proposed approach could be extended to other
multimodal physiological applications.
๐
2025-11-05
๐ ArXiv Preprint (eess.SY)
๐ฅ Giorgio Palma, Ivan Santic, Andrea Serani, Lorenzo Minno, Matteo Diez
Robotics
ๆฅ็ๆ่ฆ
This study addresses the system identification of a small autonomous surface
vehicle (ASV) under moored conditions using Hankel dynamic mode decomposition
with control (HDMDc) and its Bayesian extension (BHDMDc). Experiments were
carried out on a Codevintec CK-14e ASV in the towing tank of CNR-INM, under
both irregular and regular head-sea wave conditions. The ASV under
investigation features a recessed moon pool, which induces nonlinear responses
due to sloshing, thereby increasing the modelling challenge. Data-driven
reduced-order models were built from measurements of vessel motions and mooring
loads. The HDMDc framework provided accurate deterministic predictions of
vessel dynamics, while the Bayesian formulation enabled uncertainty-aware
characterization of the model response by accounting for variability in
hyperparameter selection. Validation against experimental data demonstrated
that both HDMDc and BHDMDc can predict the vessel's response to unseen regular
and irregular wave excitations. In conclusion, the study shows that HDMDc-based
ROMs are a viable data-driven alternative for system identification,
demonstrating for the first time their generalization capability for a sea
condition different from the training set, achieving high accuracy in
reproducing vessel dynamics.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-augmented generation (RAG) enhances large language models (LLMs)
with retrieved context but often suffers from downgraded prefill performance as
modern applications demand longer and more complex inputs. Existing caching
techniques either preserve accuracy with low cache reuse or improve reuse at
the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG
system that achieves high cache reuse without sacrificing accuracy through
accuracy-preserving context reuse. RAGBoost detects overlapping retrieved items
across concurrent sessions and multi-turn interactions, using efficient context
indexing, ordering, and de-duplication to maximize reuse, while lightweight
contextual hints maintain reasoning fidelity. It integrates seamlessly with
existing LLM inference engines and improves their prefill performance by 1.5-3X
over state-of-the-art methods, while preserving or even enhancing reasoning
accuracy across diverse RAG and agentic AI workloads. Our code is released at:
https://github.com/Edinburgh-AgenticAI/RAGBoost.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alexandru Cioba, Aya Kayal, Laura Toni, Sattar Vakili, Alberto Bernacchia
Machine Learning
ๆฅ็ๆ่ฆ
In many real-world reinforcement learning (RL) problems, the environment
exhibits inherent symmetries that can be exploited to improve learning
efficiency. This paper develops a theoretical and algorithmic framework for
incorporating known group symmetries into kernel-based RL. We propose a
symmetry-aware variant of optimistic least-squares value iteration (LSVI),
which leverages invariant kernels to encode invariance in both rewards and
transition dynamics. Our analysis establishes new bounds on the maximum
information gain and covering numbers for invariant RKHSs, explicitly
quantifying the sample efficiency gains from symmetry. Empirical results on a
customized Frozen Lake environment and a 2D placement design problem confirm
the theoretical improvements, demonstrating that symmetry-aware RL achieves
significantly better performance than their standard kernel counterparts. These
findings highlight the value of structural priors in designing more
sample-efficient reinforcement learning algorithms.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mihriban Kocak Balik, Pekka Marttinen, Negar Safinianaini
Machine Learning
ๆฅ็ๆ่ฆ
Integrating different molecular layers, i.e., multiomics data, is crucial for
unraveling the complexity of diseases; yet, most deep generative models either
prioritize predictive performance at the expense of interpretability or enforce
interpretability by linearizing the decoder, thereby weakening the network's
nonlinear expressiveness. To overcome this tradeoff, we introduce POEMS:
Product Of Experts for Interpretable Multiomics Integration using Sparse
Decoding, an unsupervised probabilistic framework that preserves predictive
performance while providing interpretability. POEMS provides interpretability
without linearizing any part of the network by 1) mapping features to latent
factors using sparse connections, which directly translates to biomarker
discovery, 2) allowing for cross-omic associations through a shared latent
space using product of experts model, and 3) reporting contributions of each
omic by a gating network that adaptively computes their influence in the
representation learning. Additionally, we present an efficient sparse decoder.
In a cancer subtyping case study, POEMS achieves competitive clustering and
classification performance while offering our novel set of interpretations,
demonstrating that biomarker based insight and predictive accuracy can coexist
in multiomics representation learning.
๐
2025-11-05
๐ ArXiv Preprint (math.OC)
๐ฅ Lei Wang, Xin Liu, Xiaojun Chen
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we investigate optimization problems with nonnegative and
orthogonal constraints, where any feasible matrix of size $n \times p$ exhibits
a sparsity pattern such that each row accommodates at most one nonzero entry.
Our analysis demonstrates that, by fixing the support set, the global solution
of the minimization subproblem for the proximal linearization of the objective
function can be computed in closed form with at most $n$ nonzero entries.
Exploiting this structural property offers a powerful avenue for dramatically
enhancing computational efficiency. Guided by this insight, we propose a
support-set algorithm preserving strictly the feasibility of iterates. A
central ingredient is a strategically devised update scheme for support sets
that adjusts the placement of nonzero entries. We establish the global
convergence of the support-set algorithm to a first-order stationary point, and
show that its iteration complexity required to reach an $\epsilon$-approximate
first-order stationary point is $O (\epsilon^{-2})$. Numerical results are
strongly in favor of our algorithm in real-world applications, including
nonnegative PCA, clustering, and community detection.
๐
2025-11-05
๐ ACM Multimedia 2025. Extended version with supplem
๐ฅ Ilya Borovik, Dmitrii Gavrilev, Vladimir Viro
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Emotions are fundamental to the creation and perception of music
performances. However, achieving human-like expression and emotion through
machine learning models for performance rendering remains a challenging task.
In this work, we present SyMuPe, a novel framework for developing and training
affective and controllable symbolic piano performance models. Our flagship
model, PianoFlow, uses conditional flow matching trained to solve diverse
multi-mask performance inpainting tasks. By design, it supports both
unconditional generation and infilling of music performance features. For
training, we use a curated, cleaned dataset of 2,968 hours of aligned musical
scores and expressive MIDI performances. For text and emotion control, we
integrate a piano performance emotion classifier and tune PianoFlow with the
emotion-weighted Flan-T5 text embeddings provided as conditional inputs.
Objective and subjective evaluations against transformer-based baselines and
existing models show that PianoFlow not only outperforms other approaches, but
also achieves performance quality comparable to that of human-recorded and
transcribed MIDI samples. For emotion control, we present and analyze samples
generated under different text conditioning scenarios. The developed model can
be integrated into interactive applications, contributing to the creation of
more accessible and engaging music performance systems.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Alexandros Vazaios, Jannis Brugger, Cedric Derstroff, Kristian Kersting, Mira Mezini
Machine Learning
ๆฅ็ๆ่ฆ
AlphaZero-like Monte Carlo Tree Search systems, originally introduced for
two-player games, dynamically balance exploration and exploitation using neural
network guidance. This combination makes them also suitable for classical
search problems. However, the original method of training the network with
simulation results is limited in sparse reward settings, especially in the
early stages, where the network cannot yet give guidance. Hindsight Experience
Replay (HER) addresses this issue by relabeling unsuccessful trajectories from
the search tree as supervised learning signals. We introduce Adaptable HER
(\ours{}), a flexible framework that integrates HER with AlphaZero, allowing
easy adjustments to HER properties such as relabeled goals, policy targets, and
trajectory selection. Our experiments, including equation discovery, show that
the possibility of modifying HER is beneficial and surpasses the performance of
pure supervised or reinforcement learning.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hongrun Ren, Yun Xiong, Lei You, Yingying Wang, Haixu Xiong et al.
Machine Learning
ๆฅ็ๆ่ฆ
The rise of the machine learning (ML) model economy has intertwined markets
for training datasets and pre-trained models. However, most pricing approaches
still separate data and model transactions or rely on broker-centric pipelines
that favor one side. Recent studies of data markets with externalities capture
buyer interactions but do not yield a simultaneous and symmetric mechanism
across data sellers, model producers, and model buyers. We propose a unified
data-model coupled market that treats dataset and model trading as a single
system. A supply-side mapping transforms dataset payments into buyer-visible
model quotations, while a demand-side mapping propagates buyer prices back to
datasets through Shapley-based allocation. Together, they form a closed loop
that links four interactions: supply-demand propagation in both directions and
mutual coupling among buyers and among sellers. We prove that the joint
operator is a standard interference function (SIF), guaranteeing existence,
uniqueness, and global convergence of equilibrium prices. Experiments
demonstrate efficient convergence and improved fairness compared with
broker-centric and one-sided baselines. The code is available on
https://github.com/HongrunRen1109/Triple-Win-Pricing.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Gahyeon Kim, Sohee Kim, Seokju Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recent advances in large-scale vision and language models have led to
significant progress in zero-shot learning tasks. Methods such as CoOp and
CoCoOp have shown that replacing handcrafted prompts with learnable vectors,
known as prompt learning, can result in improved performance. However, these
models often struggle to generalize to entirely unseen categories. While
traditional zero-shot learning techniques benefit from various data
augmentation strategies, prompt learning has primarily focused on text-based
modifications, leaving the potential of image-based augmentation largely
unexplored. In this work, we explore how image-level augmentations,
particularly those that introduce attribute-specific variations, can support
and enhance prompt learning. Our analysis examines the interaction between
these augmentations and soft prompt frameworks, revealing their potential to
improve generalization. We also identify a limitation in existing methods, such
as CoCoOp, which do not provide explicit guidance for learning prompts that
focus on semantically meaningful visual features. To address this, we propose
Adding Attributes to Prompt Learning, AAPL, a novel method that introduces
adversarial token embeddings to decouple superficial visual variations
introduced by augmentation from class-relevant semantic representations. This
decoupling enables the learned prompts to concentrate on visually
discriminative features that align with the target categories. We conduct
comprehensive experiments on eleven benchmark datasets, and AAPL consistently
outperforms existing methods across few-shot, zero-shot, cross-dataset, and
domain generalization settings. Our source code is publicly available at:
https://github.com/Gahyeonkim09/AAPL
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xiaocai Zhang, Hur Lim, Ke Wang, Zhe Xiao, Jing Wang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
In this study, a modular, data-free pipeline for multi-label intention
recognition is proposed for agentic AI applications in transportation. Unlike
traditional intent recognition systems that depend on large, annotated corpora
and often struggle with fine-grained, multi-label discrimination, our approach
eliminates the need for costly data collection while enhancing the accuracy of
multi-label intention understanding. Specifically, the overall pipeline, named
DMTC, consists of three steps: 1) using prompt engineering to guide large
language models (LLMs) to generate diverse synthetic queries in different
transport scenarios; 2) encoding each textual query with a Sentence-T5 model to
obtain compact semantic embeddings; 3) training a lightweight classifier using
a novel online focal-contrastive (OFC) loss that emphasizes hard samples and
maximizes inter-class separability. The applicability of the proposed pipeline
is demonstrated in an agentic AI application in the maritime transportation
context. Extensive experiments show that DMTC achieves a Hamming loss of 5.35%
and an AUC of 95.92%, outperforming state-of-the-art multi-label classifiers
and recent end-to-end SOTA LLM-based baselines. Further analysis reveals that
Sentence-T5 embeddings improve subset accuracy by at least 3.29% over
alternative encoders, and integrating the OFC loss yields an additional 0.98%
gain compared to standard contrastive objectives. In conclusion, our system
seamlessly routes user queries to task-specific modules (e.g., ETA information,
traffic risk evaluation, and other typical scenarios in the transportation
domain), laying the groundwork for fully autonomous, intention-aware agents
without costly manual labelling.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Elif Arslan, Jacobus G. M. van der Linden, Serge Hoogendoorn, Marco Rinaldi, Emir Demiroviฤ
ๆฅ็ๆ่ฆ
Sparse decision tree learning provides accurate and interpretable predictive
models that are ideal for high-stakes applications by finding the single most
accurate tree within a (soft) size limit. Rather than relying on a single
"best" tree, Rashomon sets-trees with similar performance but varying
structures-can be used to enhance variable importance analysis, enrich
explanations, and enable users to choose simpler trees or those that satisfy
stakeholder preferences (e.g., fairness) without hard-coding such criteria into
the objective function. However, because finding the optimal tree is NP-hard,
enumerating the Rashomon set is inherently challenging. Therefore, we introduce
SORTD, a novel framework that improves scalability and enumerates trees in the
Rashomon set in order of the objective value, thus offering anytime behavior.
Our experiments show that SORTD reduces runtime by up to two orders of
magnitude compared with the state of the art. Moreover, SORTD can compute
Rashomon sets for any separable and totally ordered objective and supports
post-evaluating the set using other separable (and partially ordered)
objectives. Together, these advances make exploring Rashomon sets more
practical in real-world applications.
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jindong Hong, Tianjie Chen, Lingjie Luo, Chuanyang Zheng, Ting Xu et al.
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
A recent advancement in Multimodal Large Language Models (MLLMs) research is
the emergence of "reasoning MLLMs" that offer explicit control over their
internal thinking processes (normally referred as the "thinking mode")
alongside the standard "non-thinking mode". This capability allows these models
to engage in a step-by-step process of internal deliberation before generating
a final response. With the rapid transition to and adoption of these
"dual-state" MLLMs, this work rigorously evaluated how the enhanced reasoning
processes of these MLLMs impact model performance and reliability in clinical
tasks. This paper evaluates the active "thinking mode" capabilities of two
leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We
assessed their performance on four visual medical tasks using VQA-RAD and
ROCOv2 datasets. Our findings reveal that the improvement from activating the
thinking mode remains marginal compared to the standard non-thinking mode for
the majority of the tasks. Their performance on complex medical tasks such as
open-ended VQA and medical image interpretation remains suboptimal,
highlighting the need for domain-specific medical data and more advanced
methods for medical knowledge integration.
๐
2025-11-05
๐ 12 pages, IEEE International Conference on Quantum
๐ฅ Aakash Ravindra Shinde, Jukka K. Nurminen
Machine Learning
ๆฅ็ๆ่ฆ
Data dimensionality reduction techniques are often utilized in the
implementation of Quantum Machine Learning models to address two significant
issues: the constraints of NISQ quantum devices, which are characterized by
noise and a limited number of qubits, and the challenge of simulating a large
number of qubits on classical devices. It also raises concerns over the
scalability of these approaches, as dimensionality reduction methods are slow
to adapt to large datasets. In this article, we analyze how data reduction
methods affect different QML models. We conduct this experiment over several
generated datasets, quantum machine algorithms, quantum data encoding methods,
and data reduction methods. All these models were evaluated on the performance
metrics like accuracy, precision, recall, and F1 score. Our findings have led
us to conclude that the usage of data dimensionality reduction methods results
in skewed performance metric values, which results in wrongly estimating the
actual performance of quantum machine learning models. There are several
factors, along with data dimensionality reduction methods, that worsen this
problem, such as characteristics of the datasets, classical to quantum
information embedding methods, percentage of feature reduction, classical
components associated with quantum models, and structure of quantum machine
learning models. We consistently observed the difference in the accuracy range
of 14% to 48% amongst these models, using data reduction and not using it.
Apart from this, our observations have shown that some data reduction methods
tend to perform better for some specific data embedding methodologies and
ansatz constructions.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Felix Stรถrck, Fabian Hinder, Barbara Hammer
Machine Learning
ๆฅ็ๆ่ฆ
With the on-going integration of machine learning systems into the everyday
social life of millions the notion of fairness becomes an ever increasing
priority in their development. Fairness notions commonly rely on protected
attributes to assess potential biases. Here, the majority of literature focuses
on discrete setups regarding both target and protected attributes. The
literature on continuous attributes especially in conjunction with regression
-- we refer to this as \emph{continuous fairness} -- is scarce. A common
strategy is iterative null-space projection which as of now has only been
explored for linear models or embeddings such as obtained by a non-linear
encoder. We improve on this by generalizing to kernel methods, significantly
extending the scope. This yields a model and fairness-score agnostic method for
kernel embeddings applicable to continuous protected attributes. We demonstrate
that our novel approach in conjunction with Support Vector Regression (SVR)
provides competitive or improved performance across multiple datasets in
comparisons to other contemporary methods.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qingyuan Zhang, Ning Lyu, Le Liu, Yuxi Wang, Ziyu Cheng et al.
Machine Learning
ๆฅ็ๆ่ฆ
This study addresses the problem of anomaly detection and root cause tracing
in microservice architectures and proposes a unified framework that combines
graph neural networks with temporal modeling. The microservice call chain is
abstracted as a directed graph, where multidimensional features of nodes and
edges are used to construct a service topology representation, and graph
convolution is applied to aggregate features across nodes and model
dependencies, capturing complex structural relationships among services. On
this basis, gated recurrent units are introduced to model the temporal
evolution of call chains, and multi-layer stacking and concatenation operations
are used to jointly obtain structural and temporal representations, improving
the ability to identify anomaly patterns. Furthermore, anomaly scoring
functions at both the node and path levels are defined to achieve unified
modeling from local anomaly detection to global call chain tracing, which
enables the identification of abnormal service nodes and the reconstruction of
potential anomaly propagation paths. Sensitivity experiments are then designed
from multiple dimensions, including hyperparameters, environmental
disturbances, and data distribution, to evaluate the framework, and results
show that it outperforms baseline methods in key metrics such as AUC, ACC,
Recall, and F1-Score, maintaining high accuracy and stability under dynamic
topologies and complex environments. This research not only provides a new
technical path for anomaly detection in microservices but also lays a
methodological foundation for intelligent operations in distributed systems.
๐
2025-11-05
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Xinyu Ning, Yan Zhuo, Xian Wang, Chan-In Devin Sio, Lik-Hang Lee
ๆฅ็ๆ่ฆ
With the continuous advancement of technology, the application of generative
artificial intelligence (AI) in various fields is gradually demonstrating great
potential, particularly when combined with Extended Reality (XR), creating
unprecedented possibilities. This survey article systematically reviews the
applications of generative AI in XR, covering as much relevant literature as
possible from 2023 to 2025. The application areas of generative AI in XR and
its key technology implementations are summarised through PRISMA screening and
analysis of the final 26 articles. The survey highlights existing articles from
the last three years related to how XR utilises generative AI, providing
insights into current trends and research gaps. We also explore potential
opportunities for future research to further empower XR through generative AI,
providing guidance and information for future generative XR research.
๐
2025-11-05
๐ NeurIPS
๐ฅ Rob Romijnders, Gabriele Cesa, Christos Louizos, Kumar Pratik, Arash Behboodi
Computer Vision
ๆฅ็ๆ่ฆ
From molecular imaging to wireless communications, the ability to align and
reconstruct signals from multiple misaligned observations is crucial for system
performance. We study the problem of multi-reference alignment (MRA), which
arises in many real-world problems, such as cryo-EM, computer vision, and, in
particular, wireless communication systems. Using a probabilistic approach to
model MRA, we find a new algorithm that uses relative poses as nuisance
variables to marginalize out -- thereby removing the global symmetries of the
problem and allowing for more direct solutions and improved convergence. The
decentralization of this approach enables significant computational savings by
avoiding the cubic scaling of centralized methods through cycle consistency.
Both proposed algorithms achieve lower reconstruction error across experimental
settings.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ning Lyu, Yuxi Wang, Ziyu Cheng, Qingyuan Zhang, Feng Chen
Machine Learning
ๆฅ็ๆ่ฆ
As cloud computing and microservice architectures become increasingly
prevalent, API rate limiting has emerged as a critical mechanism for ensuring
system stability and service quality. Traditional rate limiting algorithms,
such as token bucket and sliding window, while widely adopted, struggle to
adapt to dynamic traffic patterns and varying system loads. This paper proposes
an adaptive rate limiting strategy based on deep reinforcement learning that
dynamically balances system throughput and service latency. We design a hybrid
architecture combining Deep Q-Network (DQN) and Asynchronous Advantage
Actor-Critic (A3C) algorithms, modeling the rate limiting decision process as a
Markov Decision Process. The system continuously monitors microservice states
and learns optimal rate limiting policies through environmental interaction.
Extensive experiments conducted in a Kubernetes cluster environment demonstrate
that our approach achieves 23.7% throughput improvement and 31.4% P99 latency
reduction compared to traditional fixed-threshold strategies under high-load
scenarios. Results from a 90-day production deployment handling 500 million
daily requests validate the practical effectiveness of the proposed method,
with 82% reduction in service degradation incidents and 68% decrease in manual
interventions.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Under strictly controlled pre-training settings, we observe a Crossover: when
unique data is limited, diffusion language models (DLMs) consistently surpass
autoregressive (AR) models by training for more epochs. The crossover shifts
later with more or higher-quality data, earlier with larger models, and
persists across dense and sparse architectures. We attribute the gains to three
compounding factors: (1) any-order modeling, (2) super-dense compute from
iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation;
input or parameter noise improves AR under data constraint but cannot close the
gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B
unique Python tokens overtakes an AR coder trained with strictly matched
settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag
and > 33% on MMLU using only 1B tokens, without any special tricks, just by
repeating standard pre-training data. We also show that rising validation
cross-entropy does not imply degraded downstream performance in this regime.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Jing Ma, Hanlin Li, Xiang Xiang
Machine Learning
ๆฅ็ๆ่ฆ
Entropy Minimization (EM) is beneficial to reducing class overlap, bridging
domain gap, and restricting uncertainty for various tasks in machine learning,
yet its potential is limited. To study the internal mechanism of EM, we
reformulate and decouple the classical EM into two parts with opposite effects:
cluster aggregation driving factor (CADF) rewards dominant classes and prompts
a peaked output distribution, while gradient mitigation calibrator (GMC)
penalizes high-confidence classes based on predicted probabilities.
Furthermore, we reveal the limitations of classical EM caused by its coupled
formulation: 1) reward collapse impedes the contribution of high-certainty
samples in the learning process, and 2) easy-class bias induces misalignment
between output distribution and label distribution. To address these issues, we
propose Adaptive Decoupled Entropy Minimization (AdaDEM), which normalizes the
reward brought from CADF and employs a marginal entropy calibrator (MEC) to
replace GMC. AdaDEM outperforms DEM*, an upper-bound variant of classical EM,
and achieves superior performance across various imperfectly supervised
learning tasks in noisy and dynamic environments.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhibin Wang, Zhixing Zhang, Shuqi Wang, Xuanting Xie, Zhao Kang
Machine Learning
ๆฅ็ๆ่ฆ
Graph Neural Networks (GNNs) have demonstrated impressive performance on
task-specific benchmarks, yet their ability to generalize across diverse
domains and tasks remains limited. Existing approaches often struggle with
negative transfer, scalability issues, and high adaptation costs. To address
these challenges, we propose GMoPE (Graph Mixture of Prompt-Experts), a novel
framework that seamlessly integrates the Mixture-of-Experts (MoE) architecture
with prompt-based learning for graphs. GMoPE leverages expert-specific prompt
vectors and structure-aware MoE routing to enable each expert to specialize in
distinct subdomains and dynamically contribute to predictions. To promote
diversity and prevent expert collapse, we introduce a soft orthogonality
constraint across prompt vectors, encouraging expert specialization and
facilitating a more balanced expert utilization. Additionally, we adopt a
prompt-only fine-tuning strategy that significantly reduces spatiotemporal
complexity during transfer. We validate GMoPE through extensive experiments
under various pretraining strategies and multiple downstream tasks. Results
show that GMoPE consistently outperforms state-of-the-art baselines and
achieves performance comparable to full parameter fine-tuning-while requiring
only a fraction of the adaptation overhead. Our work provides a principled and
scalable framework for advancing generalizable and efficient graph foundation
models.
๐
2025-11-05
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan, Adam Swanda
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Open-weight models provide researchers and developers with accessible
foundations for diverse downstream applications. We tested the safety and
security postures of eight open-weight large language models (LLMs) to identify
vulnerabilities that may impact subsequent fine-tuning and deployment. Using
automated adversarial testing, we measured each model's resilience against
single-turn and multi-turn prompt injection and jailbreak attacks. Our findings
reveal pervasive vulnerabilities across all tested models, with multi-turn
attacks achieving success rates between 25.86\% and 92.78\% -- representing a
$2\times$ to $10\times$ increase over single-turn baselines. These results
underscore a systemic inability of current open-weight models to maintain
safety guardrails across extended interactions. We assess that alignment
strategies and lab priorities significantly influence resilience:
capability-focused models such as Llama 3.3 and Qwen 3 demonstrate higher
multi-turn susceptibility, whereas safety-oriented designs such as Google Gemma
3 exhibit more balanced performance.
The analysis concludes that open-weight models, while crucial for innovation,
pose tangible operational and ethical risks when deployed without layered
security controls. These findings are intended to inform practitioners and
developers of the potential risks and the value of professional AI security
solutions to mitigate exposure. Addressing multi-turn vulnerabilities is
essential to ensure the safe, reliable, and responsible deployment of
open-weight LLMs in enterprise and public domains. We recommend adopting a
security-first design philosophy and layered protections to ensure resilient
deployments of open-weight models.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Miguel Costa, Arthur Vandervoort, Martin Drews, Karyn Morrissey, Francisco C. Pereira
Machine Learning
ๆฅ็ๆ่ฆ
Climate change will cause an increase in the frequency and severity of flood
events, prompting the need for cohesive adaptation policymaking. Designing
effective adaptation policies, however, depends on managing the uncertainty of
long-term climate impacts. Meanwhile, such policies can feature important
normative choices that are not always made explicit. We propose that
Reinforcement Learning (RL) can be a useful tool to both identify adaptation
pathways under uncertain conditions while it also allows for the explicit
modelling (and consequent comparison) of different adaptation priorities (e.g.
economic vs. wellbeing). We use an Integrated Assessment Model (IAM) to link
together a rainfall and flood model, and compute the impacts of flooding in
terms of quality of life (QoL), transportation, and infrastructure damage. Our
results show that models prioritising QoL over economic impacts results in more
adaptation spending as well as a more even distribution of spending over the
study area, highlighting the extent to which such normative assumptions can
alter adaptation policy. Our framework is publicly available:
https://github.com/MLSM-at-DTU/maat_qol_framework.
๐
2025-11-05
๐ ArXiv Preprint (q-bio.PE)
๐ฅ Aristides Moustakas, Ioannis N Vogiatzakis
Machine Learning
ๆฅ็ๆ่ฆ
Island endemism is shaped by complex interactions among environmental,
ecological, and evolutionary factors, yet the relative contributions of
topography, climate, and land cover remain incompletely quantified. We
investigated the drivers of endemic plant richness across Crete, a
Mediterranean biodiversity hotspot, using spatially explicit data on species
distributions, topographic complexity, climatic variability, land cover, and
soil characteristics. Artificial Neural Network models, a machine learning
tool, were employed to assess the relative importance of these predictors and
to identify hotspots of endemism. We found that total species richness,
elevation range, and climatic variability were the strongest predictors of
endemic richness, reflecting the role of biodiversity, topographic
heterogeneity, and climatic gradients in generating diverse habitats and
micro-refugia that promote speciation and buffer extinction risk. Endemic
hotspots only partially overlapped with areas of high total species richness,
indicating that total species richness was the optimal from the ones examined,
yet an imperfect surrogate. These environmentally heterogeneous areas also
provide critical ecosystem services, including soil stabilization, pollination,
and cultural value, which are increasingly threatened by tourism, renewable
energy development, land-use change, and climate impacts. Our findings
underscore the importance of prioritizing mountainous and climatically variable
regions in conservation planning, integrating ecosystem service considerations,
and accounting for within-island spatial heterogeneity. By explicitly linking
the environmental drivers of endemism to both biodiversity patterns and
ecosystem function, this study provides a framework for evidence-based
conservation planning in Crete and other Mediterranean islands with similar
geological and biogeographic contexts.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gang Bao, Yaohua Zang
Machine Learning
ๆฅ็ๆ่ฆ
Solving inverse problems governed by partial differential equations (PDEs) is
central to science and engineering, yet remains challenging when measurements
are sparse, noisy, or when the underlying coefficients are high-dimensional or
discontinuous. Existing deep learning approaches either require extensive
labeled datasets or are limited to specific measurement types, often leading to
failure in such regimes and restricting their practical applicability. Here, a
novel generative neural operator framework, IGNO, is introduced to overcome
these limitations. IGNO unifies the solution of inverse problems from both
point measurements and operator-valued data without labeled training pairs.
This framework encodes high-dimensional, potentially discontinuous coefficient
fields into a low-dimensional latent space, which drives neural operator
decoders to reconstruct both coefficients and PDE solutions. Training relies
purely on physics constraints through PDE residuals, while inversion proceeds
via efficient gradient-based optimization in latent space, accelerated by an a
priori normalizing flow model. Across a diverse set of challenging inverse
problems, including recovery of discontinuous coefficients from solution-based
measurements and the EIT problem with operator-based measurements, IGNO
consistently achieves accurate, stable, and scalable inversion even under
severe noise. It consistently outperforms the state-of-the-art method under
varying noise levels and demonstrates strong generalization to
out-of-distribution targets. These results establish IGNO as a unified and
powerful framework for tackling challenging inverse problems across
computational science domains.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Philipp Reis, Philipp Rigoll, Christian Steinhauser, Jacob Langner, Eric Sax
Robotics
ๆฅ็ๆ่ฆ
Modern AI systems are increasingly constrained not by model capacity but by
the quality and diversity of their data. Despite growing emphasis on
data-centric AI, most datasets are still gathered in an open-loop manner which
accumulates redundant samples without feedback from the current coverage. This
results in inefficient storage, costly labeling, and limited generalization. To
address this, this paper introduces \ac{FCDC}, a paradigm that formulates data
collection as a closed-loop control problem. \ac{FCDC} continuously
approximates the state of the collected data distribution using an online
probabilistic model and adaptively regulates sample retention using based on
feedback signals such as likelihood and Mahalanobis distance. Through this
feedback mechanism, the system dynamically balances exploration and
exploitation, maintains dataset diversity, and prevents redundancy from
accumulating over time. Besides showcasing the controllability of \ac{FCDC} on
a synthetic dataset, experiments on a real data stream show that \ac{FCDC}
produces more balanced datasets by $\SI{25.9}{\percent}$ while reducing data
storage by $\SI{39.8}{\percent}$. These results demonstrate that data
collection itself can be actively controlled, transforming collection from a
passive pipeline stage into a self-regulating, feedback-driven process at the
core of data-centric AI.
๐
2025-11-05
๐ Accepted for presentation at AI in Science (AIS) 2
๐ฅ Miguel Costa, Arthur Vandervoort, Martin Drews, Karyn Morrissey, Francisco C. Pereira
Machine Learning
ๆฅ็ๆ่ฆ
Urban flooding is expected to increase in frequency and severity as a
consequence of climate change, causing wide-ranging impacts that include a
decrease in urban Quality of Life (QoL). Meanwhile, policymakers must devise
adaptation strategies that can cope with the uncertain nature of climate change
and the complex and dynamic nature of urban flooding. Reinforcement Learning
(RL) holds significant promise in tackling such complex, dynamic, and uncertain
problems. Because of this, we use RL to identify which climate adaptation
pathways lead to a higher QoL in the long term. We do this using an Integrated
Assessment Model (IAM) which combines a rainfall projection model, a flood
model, a transport accessibility model, and a quality of life index. Our
preliminary results suggest that this approach can be used to learn optimal
adaptation measures and it outperforms other realistic and real-world planning
strategies. Our framework is publicly available:
https://github.com/MLSM-at-DTU/maat_qol_framework.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Md Ashad Alam
ๆฅ็ๆ่ฆ
RKUM is an R package developed for implementing robust kernel-based
unsupervised methods. It provides functions for estimating the robust kernel
covariance operator (CO) and the robust kernel cross-covariance operator (CCO)
using generalized loss functions instead of the conventional quadratic loss.
These operators form the foundation of robust kernel learning and enable
reliable analysis under contaminated or noisy data conditions. The package
includes implementations of robust kernel canonical correlation analysis
(Kernel CCA), as well as the influence function (IF) for both standard and
multiple kernel CCA frameworks. The influence function quantifies sensitivity
and helps detect influential or outlying observations across two-view and
multi-view datasets. Experiments using synthesized two-view and multi-view data
demonstrate that the IF of the standard kernel CCA effectively identifies
outliers, while the robust kernel methods implemented in RKUM exhibit reduced
sensitivity to contamination. Overall, RKUM provides an efficient and
extensible platform for robust kernel-based analysis in high-dimensional data
applications.
๐
2025-11-05
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuei-Chun Kao, Hsu Tzu-Yin, Yunqi Hong, Ruochen Wang, Cho-Jui Hsieh
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Recently, Multimodal Large Language Models (MLLMs) encounter two key issues
in multi-image contexts: (1) a lack of fine-grained perception across disparate
images, and (2) a diminished capability to effectively reason over and
synthesize information from multiple visual inputs. However, while various
prompting methods aim to describe visual content, many existing studies focus
primarily on single-image settings or specific, constrained scenarios. This
leaves a critical gap in understanding and addressing how MLLMs tackle more
general and complex multi-image reasoning tasks. Thus, we first extensively
investigate how current prompting methods perceive fine-grained visual details
and process visual information when dealing with multiple images. Our findings
reveal that existing prompting methods fall short in attending to needed clues
and seamlessly integrating perception and reasoning. Inspired by the findings,
we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions
(QG-CoC), a generalized prompting approach that effectively handles problems
with an arbitrary number of images. We evaluate our method on various
open-source and closed-source MLLMs for multi-image and single-image
benchmarks. Experimental results indicate that QG-CoC demonstrates competitive
performance across tasks and exhibits robust improvements in the challenging
scenarios where existing prompting methods fail.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Zeqi Ye, Qijie Zhu, Molei Tao, Minshuo Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
Diffusion models have achieved remarkable success across diverse domains, but
they remain vulnerable to memorization -- reproducing training data rather than
generating novel outputs. This not only limits their creative potential but
also raises concerns about privacy and safety. While empirical studies have
explored mitigation strategies, theoretical understanding of memorization
remains limited. We address this gap through developing a dual-separation
result via two complementary perspectives: statistical estimation and network
approximation. From the estimation side, we show that the ground-truth score
function does not minimize the empirical denoising loss, creating a separation
that drives memorization. From the approximation side, we prove that
implementing the empirical score function requires network size to scale with
sample size, spelling a separation compared to the more compact network
representation of the ground-truth score function. Guided by these insights, we
develop a pruning-based method that reduces memorization while maintaining
generation quality in diffusion transformers.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hassan Wasswa, Hussein Abbass, Timothy Lynar
Machine Learning
ๆฅ็ๆ่ฆ
In an effort to counter the increasing IoT botnet-based attacks,
state-of-the-art deep learning methods have been proposed and have achieved
impressive detection accuracy. However, their computational intensity restricts
deployment on resource-constrained IoT devices, creating a critical need for
lightweight detection models. A common solution to this challenge is model
compression via quantization. This study proposes a VAE-MLP model framework
where an MLP-based classifier is trained on 8-dimensional latent vectors
derived from the high-dimensional train data using the encoder component of a
pretrained variational autoencoder (VAE). Two widely used quantization
strategies--Quantization-Aware Training (QAT) and Post-Training Quantization
(PTQ)--are then systematically evaluated in terms of their impact on detection
performance, storage efficiency, and inference latency using two benchmark IoT
botnet datasets--N-BaIoT and CICIoT2022. The results revealed that, with
respect to detection accuracy, the QAT strategy experienced a more noticeable
decline,whereas PTQ incurred only a marginal reduction compared to the original
unquantized model. Furthermore, PTQ yielded a 6x speedup and 21x reduction in
size, while QAT achieved a 3x speedup and 24x compression, demonstrating the
practicality of quantization for device-level IoT botnet detection.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Maryam Alipourhajiagha, Pierre-Louis Lemaire, Youssef Diouane, Julie Carreau
ๆฅ็ๆ่ฆ
Climate models are limited by heavy computational costs, often producing
outputs at coarse spatial resolutions, while many climate change impact studies
require finer scales. Statistical downscaling bridges this gap, and we adapt
the probabilistic U-Net for this task, combining a deterministic U-Net backbone
with a variational latent space to capture aleatoric uncertainty. We evaluate
four training objectives, afCRPS and WMSE-MS-SSIM with three settings for
downscaling precipitation and temperature from $16\times$ coarser resolution.
Our main finding is that WMSE-MS-SSIM performs well for extremes under certain
settings, whereas afCRPS better captures spatial variability across scales.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Feng Wu, Tsai Hor Chan, Fuying Wang, Guosheng Yin, Lequan Yu
Multimodal
ๆฅ็ๆ่ฆ
Various data modalities are common in real-world applications (e.g.,
electronic health records, medical images and clinical notes in healthcare). It
is essential to develop multimodal learning methods to aggregate various
information from multiple modalities. The main challenge is how to
appropriately align and fuse the representations of different modalities into a
joint distribution. Existing methods mainly rely on concatenation or the
Kronecker product, oversimplifying the interaction structure between modalities
and indicating a need to model more complex interactions. Additionally, the
joint distribution of latent representations with higher-order interactions is
underexplored. Copula is a powerful statistical structure for modelling the
interactions among variables, as it naturally bridges the joint distribution
and marginal distributions of multiple variables. We propose a novel
copula-driven multimodal learning framework, which focuses on learning the
joint distribution of various modalities to capture the complex interactions
among them. The key idea is to interpret the copula model as a tool to align
the marginal distributions of the modalities efficiently. By assuming a
Gaussian mixture distribution for each modality and a copula model on the joint
distribution, our model can generate accurate representations for missing
modalities. Extensive experiments on public MIMIC datasets demonstrate the
superior performance of our model over other competitors. The code is available
at https://github.com/HKU-MedAI/CMCM.
๐
2025-11-05
๐ ArXiv Preprint (stat.TH)
๐ฅ Gonzalo Mena, Arun Kumar Kuchibhotla, Larry Wasserman
Machine Learning
ๆฅ็ๆ่ฆ
Rectified flow (Liu et al., 2022; Liu, 2022; Wu et al., 2023) is a method for
defining a transport map between two distributions, and enjoys popularity in
machine learning, although theoretical results supporting the validity of these
methods are scant. The rectified flow can be regarded as an approximation to
optimal transport, but in contrast to other transport methods that require
optimization over a function space, computing the rectified flow only requires
standard statistical tools such as regression or density estimation. Because of
this, one can leverage standard data analysis tools for regression and density
estimation to develop empirical versions of transport maps. We study some
structural properties of the rectified flow, including existence, uniqueness,
and regularity, as well as the related statistical properties, such as rates of
convergence and central limit theorems, for some selected estimators. To do so,
we analyze separately the bounded and unbounded cases as each presents unique
challenges. In both cases, we are able to establish convergence at faster rates
than the ones for the usual nonparametric regression and density estimation.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mingtao Zhang, Guoli Yang, Zhanxing Zhu, Mengzhu Wang, Xiaoying Bai
ๆฅ็ๆ่ฆ
Attention mechanisms have been extensively employed in various applications,
including time series modeling, owing to their capacity to capture intricate
dependencies; however, their utility is often constrained by quadratic
computational complexity, which impedes scalability for long sequences. In this
work, we propose a novel linear attention mechanism designed to overcome these
limitations. Our approach is grounded in a theoretical demonstration that
entropy, as a strictly concave function on the probability simplex, implies
that distributions with aligned probability rankings and similar entropy values
exhibit structural resemblance. Building on this insight, we develop an
efficient approximation algorithm that computes the entropy of
dot-product-derived distributions with only linear complexity, enabling the
implementation of a linear attention mechanism based on entropy equality.
Through rigorous analysis, we reveal that the effectiveness of attention in
spatio-temporal time series modeling may not primarily stem from the
non-linearity of softmax but rather from the attainment of a moderate and
well-balanced weight distribution. Extensive experiments on four
spatio-temporal datasets validate our method, demonstrating competitive or
superior forecasting performance while achieving substantial reductions in both
memory usage and computational time.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Jonghae Park, Daesol Cho, Jusuk Lee, Dongseok Shim, Inkyu Jang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Unsupervised skill discovery in reinforcement learning (RL) aims to learn
diverse behaviors without relying on external rewards. However, current methods
often overlook the periodic nature of learned skills, focusing instead on
increasing the mutual dependence between states and skills or maximizing the
distance traveled in latent space. Considering that many robotic tasks --
particularly those involving locomotion -- require periodic behaviors across
varying timescales, the ability to discover diverse periodic skills is
essential. Motivated by this, we propose Periodic Skill Discovery (PSD), a
framework that discovers periodic behaviors in an unsupervised manner. The key
idea of PSD is to train an encoder that maps states to a circular latent space,
thereby naturally encoding periodicity in the latent representation. By
capturing temporal distance, PSD can effectively learn skills with diverse
periods in complex robotic tasks, even with pixel-based observations. We
further show that these learned skills achieve high performance on downstream
tasks such as hurdling. Moreover, integrating PSD with an existing skill
discovery method offers more diverse behaviors, thus broadening the agent's
repertoire. Our code and demos are available at
https://jonghaepark.github.io/psd/
๐
2025-11-05
๐ ArXiv Preprint (cs.SE)
๐ฅ Vinaik Chhetri, A. B Siddique, Umar Farooq
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used in software development.
However, while LLMs remain static after pretraining, programming languages and
APIs continue to evolve, leading to the generation of deprecated or
incompatible code that undermines reliability. Retraining LLMs from scratch to
reflect such changes is computationally expensive, making model editing a
promising lightweight alternative that updates only a small subset of
parameters. Despite its potential, it remains unclear whether model editing
yields genuine syntactic and semantic adaptations or merely superficial fixes.
In this work, we present a systematic study of five state-of-the-art model
editing methods: Constrained Fine-Tuning (FT), GRACE, MEMIT, PMET, and ROME. We
apply these methods to three leading open-source code LLMs, CodeLlama,
CodeQwen1.5, and DeepSeek-Coder, under controlled API deprecation scenarios.
Our evaluation covers both instant and sequential editing settings, using three
disjoint evaluation sets designed to assess reliability, generalization, and
specificity. We measure model correctness at three levels: successful
compilation, partial test case pass, and full test pass. Our findings show that
instant edits consistently degrade model performance, with syntactic validity
dropping by up to 86 percentage points and functional correctness declining by
45 points even in the best-performing setting. Sequential edits further amplify
this degradation, and in some cases, model performance collapses entirely.
Across all models, most passing generations relied on workarounds rather than
correctly adopting the intended changes, while faulty adoptions that result in
test failures or compilation errors were significantly more frequent. Correct
adoptions, where the model correctly integrates the intended change, occurred
in only about 6% of cases.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Rewida Ali, Cristian C. Beltran-Hernandez, Weiwei Wan, Kensuke Harada
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Human-robot cooperation is essential in environments such as warehouses and
retail stores, where workers frequently handle deformable objects like paper,
bags, and fabrics. Coordinating robotic actions with human assistance remains
difficult due to the unpredictable dynamics of deformable materials and the
need for adaptive force control. To explore this challenge, we focus on the
task of gift wrapping, which exemplifies a long-horizon manipulation problem
involving precise folding, controlled creasing, and secure fixation of paper.
Success is achieved when the robot completes the sequence to produce a neatly
wrapped package with clean folds and no tears.
We propose a learning-based framework that integrates a high-level task
planner powered by a large language model (LLM) with a low-level hybrid
imitation learning (IL) and reinforcement learning (RL) policy. At its core is
a Sub-task Aware Robotic Transformer (START) that learns a unified policy from
human demonstrations. The key novelty lies in capturing long-range temporal
dependencies across the full wrapping sequence within a single model. Unlike
vanilla Action Chunking with Transformer (ACT), typically applied to short
tasks, our method introduces sub-task IDs that provide explicit temporal
grounding. This enables robust performance across the entire wrapping process
and supports flexible execution, as the policy learns sub-goals rather than
merely replicating motion sequences.
Our framework achieves a 97% success rate on real-world wrapping tasks. We
show that the unified transformer-based policy reduces the need for specialized
models, allows controlled human supervision, and effectively bridges high-level
intent with the fine-grained force control required for deformable object
manipulation.
๐
2025-11-05
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Varun Kumar, George Em Karniadakis
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The engineering design process often demands expertise from multiple domains,
leading to complex collaborations and iterative refinements. Traditional
methods can be resource-intensive and prone to inefficiencies. To address this,
we formalize the engineering design process through a multi-agent AI framework
that integrates structured design and review loops. The framework introduces
specialized knowledge-driven agents that collaborate to generate and refine
design candidates. As an exemplar, we demonstrate its application to the
aerodynamic optimization of 4-digit NACA airfoils. The framework consists of
three key AI agents: a Graph Ontologist, a Design Engineer, and a Systems
Engineer. The Graph Ontologist employs a Large Language Model (LLM) to
construct two domain-specific knowledge graphs from airfoil design literature.
The Systems Engineer, informed by a human manager, formulates technical
requirements that guide design generation and evaluation. The Design Engineer
leverages the design knowledge graph and computational tools to propose
candidate airfoils meeting these requirements. The Systems Engineer reviews and
provides feedback both qualitative and quantitative using its own knowledge
graph, forming an iterative feedback loop until a design is validated by the
manager. The final design is then optimized to maximize performance metrics
such as the lift-to-drag ratio. Overall, this work demonstrates how
collaborative AI agents equipped with structured knowledge representations can
enhance efficiency, consistency, and quality in the engineering design process.
๐
2025-11-05
๐ ArXiv Preprint (astro-ph.EP)
๐ฅ Arsalan Muhammad, Wasiu Akande Ahmed, Omada Friday Ojonugwa, Paul Puspendu Biswas
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The rapid growth of cislunar activities, including lunar landings, the Lunar
Gateway, and in-space refueling stations, requires advances in cost-efficient
trajectory design and reliable integration of navigation and remote sensing.
Traditional Earth-Moon transfers suffer from rigid launch windows and high
propellant demands, while Earth-based GNSS systems provide little to no
coverage beyond geostationary orbit. This limits autonomy and environmental
awareness in cislunar space. This review compares four major transfer
strategies by evaluating velocity requirements, flight durations, and fuel
efficiency, and by identifying their suitability for both crewed and robotic
missions. The emerging role of artificial intelligence and machine learning is
highlighted: convolutional neural networks support automated crater recognition
and digital terrain model generation, while deep reinforcement learning enables
adaptive trajectory refinement during descent and landing to reduce risk and
decision latency. The study also examines how GNSS-Reflectometry and advanced
Positioning, Navigation, and Timing architectures can extend navigation
capabilities beyond current limits. GNSS-R can act as a bistatic radar for
mapping lunar ice, soil properties, and surface topography, while PNT systems
support autonomous rendezvous, Lagrange point station-keeping, and coordinated
satellite swarm operations. Combining these developments establishes a scalable
framework for sustainable cislunar exploration and long-term human and robotic
presence.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Tingzhu Bi, Yicheng Pan, Xinrui Jiang, Huize Sun, Meng Ma et al.
Machine Learning
ๆฅ็ๆ่ฆ
Uncovering cause-effect relationships from observational time series is
fundamental to understanding complex systems. While many methods infer static
causal graphs, real-world systems often exhibit dynamic causality-where
relationships evolve over time. Accurately capturing these temporal dynamics
requires time-resolved causal graphs. We propose UnCLe, a novel deep learning
method for scalable dynamic causal discovery. UnCLe employs a pair of Uncoupler
and Recoupler networks to disentangle input time series into semantic
representations and learns inter-variable dependencies via auto-regressive
Dependency Matrices. It estimates dynamic causal influences by analyzing
datapoint-wise prediction errors induced by temporal perturbations. Extensive
experiments demonstrate that UnCLe not only outperforms state-of-the-art
baselines on static causal discovery benchmarks but, more importantly, exhibits
a unique capability to accurately capture and represent evolving temporal
causality in both synthetic and real-world dynamic systems (e.g., human
motion). UnCLe offers a promising approach for revealing the underlying,
time-varying mechanisms of complex phenomena.
๐
2025-11-05
๐ ArXiv Preprint (stat.AP)
๐ฅ Natchaphon Leungbootnak, Zihao Li, Zihang Wei, Dominique Lord, Yunlong Zhang
Robotics
ๆฅ็ๆ่ฆ
The ability of existing headway distributions to accurately reflect the
diverse behaviors and characteristics in heterogeneous traffic (different types
of vehicles) and mixed traffic (human-driven vehicles with autonomous vehicles)
is limited, leading to unsatisfactory goodness of fit. To address these issues,
we modified the exponential function to obtain a novel headway distribution.
Rather than employing Euler's number (e) as the base of the exponential
function, we utilized a real number base to provide greater flexibility in
modeling the observed headway. However, the proposed is not a probability
function. We normalize it to calculate the probability and derive the
closed-form equation. In this study, we utilized a comprehensive experiment
with five open datasets: highD, exiD, NGSIM, Waymo, and Lyft to evaluate the
performance of the proposed distribution and compared its performance with six
existing distributions under mixed and heterogeneous traffic flow. The results
revealed that the proposed distribution not only captures the fundamental
characteristics of headway distribution but also provides physically meaningful
parameters that describe the distribution shape of observed headways. Under
heterogeneous flow on highways (i.e., uninterrupted traffic flow), the proposed
distribution outperforms other candidate distributions. Under urban road
conditions (i.e., interrupted traffic flow), including heterogeneous and mixed
traffic, the proposed distribution still achieves decent results.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Atif Hassan, Tarun Kumar, Ashish Mishra, Sergey Serebryakov, Satish Kumar Mopur et al.
ๆฅ็ๆ่ฆ
Forecasting anomalies (anomaly prediction) in multivariate time series from
different real-world, dynamic, and complex systems is vital for preempting
critical failures, leading to a substantial minimization in operational costs
and human labor. Yet, existing methods are limited to specific systems while
failing to generalize to evolving anomaly patterns over time. In contrast,
pretrained Time Series Foundation Models (TSFMs) have recently demonstrated
strong generalization and zero-shot forecasting capabilities. However, their
potential remains untapped for anomaly prediction, a task fundamentally
different from forecasting normal behavior. Thus, we present Forecast2Anomaly
(F2A), a novel framework that empowers TSFMs with anomaly prediction abilities
through two key innovations. First, we propose a joint forecast-anomaly loss
that fine-tunes TSFMs to accurately forecast future signals even at anomalous
time points. Second, we introduce a Retrieval-Augmented Generation (RAG) module
that retrieves historically relevant horizons and conditions predictions on
them. This component dynamically adapts to distributional shifts at inference
time, enabling F2A to track evolving anomalies without requiring model updates.
By combining targeted fine-tuning with dynamic retrieval, F2A bridges the gap
between robust TSFM zero-shot forecasting and zero-shot anomaly prediction.
Extensive experiments across 16 diverse datasets and multiple TSFM backbones
show that F2A consistently outperforms state-of-the-art methods, offering a
scalable, zero-shot anomaly prediction solution for real-world applications.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Paria Mehrbod, Pedro Vianna, Geraldin Nanfack, Guy Wolf, Eugene Belilovsky
Machine Learning
ๆฅ็ๆ่ฆ
Domain adaptation is a key strategy for enhancing the generalizability of
deep learning models in real-world scenarios, where test distributions often
diverge significantly from the training domain. However, conventional
approaches typically rely on prior knowledge of the target domain or require
model retraining, limiting their practicality in dynamic or
resource-constrained environments. Recent test-time adaptation methods based on
batch normalization statistic updates allow for unsupervised adaptation, but
they often fail to capture complex activation distributions and are constrained
to specific normalization layers. We propose Adaptive Quantile Recalibration
(AQR), a test-time adaptation technique that modifies pre-activation
distributions by aligning quantiles on a channel-wise basis. AQR captures the
full shape of activation distributions and generalizes across architectures
employing BatchNorm, GroupNorm, or LayerNorm. To address the challenge of
estimating distribution tails under varying batch sizes, AQR incorporates a
robust tail calibration strategy that improves stability and precision. Our
method leverages source-domain statistics computed at training time, enabling
unsupervised adaptation without retraining models. Experiments on CIFAR-10-C,
CIFAR-100-C, and ImageNet-C across multiple architectures demonstrate that AQR
achieves robust adaptation across diverse settings, outperforming existing
test-time adaptation baselines. These results highlight AQR's potential for
deployment in real-world scenarios with dynamic and unpredictable data
distributions.
๐
2025-11-05
๐ ArXiv Preprint (cs.GR)
๐ฅ Haotian Yin, Przemyslaw Musialski
Machine Learning
ๆฅ็ๆ่ฆ
Neural signed distance functions (SDFs) have become a powerful representation
for geometric reconstruction from point clouds, yet they often require both
gradient- and curvature-based regularization to suppress spurious warp and
preserve structural fidelity. FlatCAD introduced the Off-Diagonal Weingarten
(ODW) loss as an efficient second-order prior for CAD surfaces, approximating
full-Hessian regularization at roughly half the computational cost. However,
FlatCAD applies a fixed ODW weight throughout training, which is suboptimal:
strong regularization stabilizes early optimization but suppresses detail
recovery in later stages. We present scheduling strategies for the ODW loss
that assign a high initial weight to stabilize optimization and progressively
decay it to permit fine-scale refinement. We investigate constant, linear,
quintic, and step interpolation schedules, as well as an increasing warm-up
variant. Experiments on the ABC CAD dataset demonstrate that time-varying
schedules consistently outperform fixed weights. Our method achieves up to a
35% improvement in Chamfer Distance over the FlatCAD baseline, establishing
scheduling as a simple yet effective extension of curvature regularization for
robust CAD reconstruction.
๐
2025-11-05
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Erfan Shayegani, Jina Suh, Andy Wilson, Nagu Rangan, Javier Hernandez
Natural Language Processing
ๆฅ็ๆ่ฆ
Empathy is a critical factor in fostering positive user experiences in
conversational AI. While models can display empathy, it is often generic rather
than tailored to specific tasks and contexts. In this work, we introduce a
novel framework for developing and evaluating context-specific empathetic large
language models (LLMs). We first analyze a real-world conversational dataset
consisting of 672 multi-turn conversations across 8 tasks, revealing
significant differences in terms of expected and experienced empathy before and
after the conversations, respectively. To help minimize this gap, we develop a
synthetic multi-turn conversational generation pipeline and steer responses
toward our defined empathy patterns based on the context that more closely
matches users' expectations. We then train empathetic expert adapters for
context-specific empathy that specialize in varying empathy levels based on the
recognized task. Our empirical results demonstrate a significant gap reduction
of 72.66% between perceived and desired empathy with scores increasing by an
average factor of 2.43 as measured by our metrics and reward models.
Additionally, our trained empathetic expert adapters demonstrate superior
effectiveness in preserving empathy patterns throughout conversation turns,
outperforming system prompts, which tend to dramatically diminish in impact as
conversations lengthen.
๐
2025-11-05
๐ EMNLP 2025
๐ฅ Najrin Sultana, Md Rafi Ur Rashid, Kang Gu, Shagufta Mehnaz
Natural Language Processing
ๆฅ็ๆ่ฆ
LLMs can provide substantial zero-shot performance on diverse tasks using a
simple task prompt, eliminating the need for training or fine-tuning. However,
when applying these models to sensitive tasks, it is crucial to thoroughly
assess their robustness against adversarial inputs. In this work, we introduce
Static Deceptor (StaDec) and Dynamic Deceptor (DyDec), two innovative attack
frameworks designed to systematically generate dynamic and adaptive adversarial
examples by leveraging the understanding of the LLMs. We produce subtle and
natural-looking adversarial inputs that preserve semantic similarity to the
original text while effectively deceiving the target LLM. By utilizing an
automated, LLM-driven pipeline, we eliminate the dependence on external
heuristics. Our attacks evolve with the advancements in LLMs and demonstrate
strong transferability across models unknown to the attacker. Overall, this
work provides a systematic approach for the self-assessment of an LLM's
robustness. We release our code and data at
https://github.com/Shukti042/AdversarialExample.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Haitao Lin, Boxin Zhao, Mladen Kolar, Chong Liu
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
We study how Bayesian optimization (BO) can be accelerated on a target task
with historical knowledge transferred from related source tasks. Existing works
on BO with knowledge transfer either do not have theoretical guarantees or
achieve the same regret as BO in the non-transfer setting,
$\tilde{\mathcal{O}}(\sqrt{T \gamma_f})$, where $T$ is the number of
evaluations of the target function and $\gamma_f$ denotes its information gain.
In this paper, we propose the DeltaBO algorithm, in which a novel
uncertainty-quantification approach is built on the difference function
$\delta$ between the source and target functions, which are allowed to belong
to different reproducing kernel Hilbert spaces (RKHSs). Under mild assumptions,
we prove that the regret of DeltaBO is of order $\tilde{\mathcal{O}}(\sqrt{T
(T/N + \gamma_\delta)})$, where $N$ denotes the number of evaluations from
source tasks and typically $N \gg T$. In many applications, source and target
tasks are similar, which implies that $\gamma_\delta$ can be much smaller than
$\gamma_f$. Empirical studies on both real-world hyperparameter tuning tasks
and synthetic functions show that DeltaBO outperforms other baseline methods
and support our theoretical claims.
๐
2025-11-05
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Seunghee Han, Yeonghun Kang, Taeun Bae, Varinia Bernales, Alan Aspuru-Guzik et al.
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Designing materials with targeted properties remains challenging due to the
vastness of chemical space and the scarcity of property-labeled data. While
recent advances in generative models offer a promising way for inverse design,
most approaches require large datasets and must be retrained for every new
target property. Here, we introduce the EGMOF (Efficient Generation of MOFs), a
hybrid diffusion-transformer framework that overcomes these limitations through
a modular, descriptor-mediated workflow. EGMOF decomposes inverse design into
two steps: (1) a one-dimensional diffusion model (Prop2Desc) that maps desired
properties to chemically meaningful descriptors followed by (2) a transformer
model (Desc2MOF) that generates structures from these descriptors. This modular
hybrid design enables minimal retraining and maintains high accuracy even under
small-data conditions. On a hydrogen uptake dataset, EGMOF achieved over 95%
validity and 84% hit rate, representing significant improvements of up to 57%
in validity and 14% in hit rate compared to existing methods, while remaining
effective with only 1,000 training samples. Moreover, our model successfully
performed conditional generation across 29 diverse property datasets, including
CoREMOF, QMOF, and text-mined experimental datasets, whereas previous models
have not. This work presents a data-efficient, generalizable approach to the
inverse design of diverse MOFs and highlights the potential of modular inverse
design workflows for broader materials discovery.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qi Zhang, Yifei Wang, Yisen Wang
Machine Learning
ๆฅ็ๆ่ฆ
Recently, self-supervised contrastive learning has achieved great success on
various tasks. However, its underlying working mechanism is yet unclear. In
this paper, we first provide the tightest bounds based on the widely adopted
assumption of conditional independence. Further, we relax the conditional
independence assumption to a more practical assumption of augmentation overlap
and derive the asymptotically closed bounds for the downstream performance. Our
proposed augmentation overlap theory hinges on the insight that the support of
different intra-class samples will become more overlapped under aggressive data
augmentations, thus simply aligning the positive samples (augmented views of
the same sample) could make contrastive learning cluster intra-class samples
together. Moreover, from the newly derived augmentation overlap perspective, we
develop an unsupervised metric for the representation evaluation of contrastive
learning, which aligns well with the downstream performance almost without
relying on additional modules. Code is available at
https://github.com/PKU-ML/GARC.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiameng Chen, Yida Xiong, Kun Li, Hongzhi Zhang, Xiantao Cai et al.
Machine Learning
ๆฅ็ๆ่ฆ
Computational antibody design holds immense promise for therapeutic
discovery, yet existing generative models are fundamentally limited by two core
challenges: (i) a lack of dynamical consistency, which yields physically
implausible structures, and (ii) poor generalization due to data scarcity and
structural bias. We introduce FP-AbDiff, the first antibody generator to
enforce Fokker-Planck Equation (FPE) physics along the entire generative
trajectory. Our method minimizes a novel FPE residual loss over the mixed
manifold of CDR geometries (R^3 x SO(3)), compelling locally-learned denoising
scores to assemble into a globally coherent probability flow. This
physics-informed regularizer is synergistically integrated with deep biological
priors within a state-of-the-art SE(3)-equivariant diffusion framework.
Rigorous evaluation on the RAbD benchmark confirms that FP-AbDiff establishes a
new state-of-the-art. In de novo CDR-H3 design, it achieves a mean Root Mean
Square Deviation of 0.99 {\AA} when superposing on the variable region, a 25%
improvement over the previous state-of-the-art model, AbX, and the highest
reported Contact Amino Acid Recovery of 39.91%. This superiority is underscored
in the more challenging six-CDR co-design task, where our model delivers
consistently superior geometric precision, cutting the average full-chain Root
Mean Square Deviation by ~15%, and crucially, achieves the highest full-chain
Amino Acid Recovery on the functionally dominant CDR-H3 loop (45.67%). By
aligning generative dynamics with physical laws, FP-AbDiff enhances robustness
and generalizability, establishing a principled approach for physically
faithful and functionally viable antibody design.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Daniel Wang, Evan Markou, Dylan Campbell
Machine Learning
ๆฅ็ๆ่ฆ
While backpropagation--reverse-mode automatic differentiation--has been
extraordinarily successful in deep learning, it requires two passes (forward
and backward) through the neural network and the storage of intermediate
activations. Existing gradient estimation methods that instead use forward-mode
automatic differentiation struggle to scale beyond small networks due to the
high variance of the estimates. Efforts to mitigate this have so far introduced
significant bias to the estimates, reducing their utility. We introduce a
gradient estimation approach that reduces both bias and variance by
manipulating upstream Jacobian matrices when computing guess directions. It
shows promising results and has the potential to scale to larger networks,
indeed performing better as the network width is increased. Our understanding
of this method is facilitated by analyses of bias and variance, and their
connection to the low-dimensional structure of neural network gradients.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Md Sakhawat Hossen, Md. Zashid Iqbal Borshon, A. S. M. Badrudduza
Machine Learning
ๆฅ็ๆ่ฆ
The uprising of deep learning methodology and practice in recent years has
brought about a severe consequence of increasing carbon footprint due to the
insatiable demand for computational resources and power. The field of text
analytics also experienced a massive transformation in this trend of
monopolizing methodology. In this paper, the original TF-IDF algorithm has been
modified, and Clement Term Frequency-Inverse Document Frequency (CTF-IDF) has
been proposed for data preprocessing. This paper primarily discusses the
effectiveness of classical machine learning techniques in text analytics with
CTF-IDF and a faster IRLBA algorithm for dimensionality reduction. The
introduction of both of these techniques in the conventional text analytics
pipeline ensures a more efficient, faster, and less computationally intensive
application when compared with deep learning methodology regarding carbon
footprint, with minor compromise in accuracy. The experimental results also
exhibit a manifold of reduction in time complexity and improvement of model
accuracy for the classical machine learning methods discussed further in this
paper.
๐
2025-11-05
๐ ArXiv Preprint (cs.SE)
๐ฅ Rafael Josรฉ Moura, Maria Gizele Nascimento, Fumio Machida, Ermeson Andrade
ๆฅ็ๆ่ฆ
Software aging is a phenomenon that affects long-running systems, leading to
progressive performance degradation and increasing the risk of failures. To
mitigate this problem, this work proposes an adaptive approach based on machine
learning for software aging detection in environments subject to dynamic
workload conditions. We evaluate and compare a static model with adaptive
models that incorporate adaptive detectors, specifically the Drift Detection
Method (DDM) and Adaptive Windowing (ADWIN), originally developed for concept
drift scenarios and applied in this work to handle workload shifts. Experiments
with simulated sudden, gradual, and recurring workload transitions show that
static models suffer a notable performance drop when applied to unseen workload
profiles, whereas the adaptive model with ADWIN maintains high accuracy,
achieving an F1-Score above 0.93 in all analyzed scenarios.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hao Xiang Li, Michael Amir, Amanda Prorok
Machine Learning
ๆฅ็ๆ่ฆ
The agent-environment co-design paradigm jointly optimises agent policies and
environment configurations in search of improved system performance. With
application domains ranging from warehouse logistics to windfarm management,
co-design promises to fundamentally change how we deploy multi-agent systems.
However, current co-design methods struggle to scale. They collapse under
high-dimensional environment design spaces and suffer from sample inefficiency
when addressing moving targets inherent to joint optimisation. We address these
challenges by developing Diffusion Co-Design (DiCoDe), a scalable and
sample-efficient co-design framework pushing co-design towards practically
relevant settings. DiCoDe incorporates two core innovations. First, we
introduce Projected Universal Guidance (PUG), a sampling technique that enables
DiCoDe to explore a distribution of reward-maximising environments while
satisfying hard constraints such as spatial separation between obstacles.
Second, we devise a critic distillation mechanism to share knowledge from the
reinforcement learning critic, ensuring that the guided diffusion model adapts
to evolving agent policies using a dense and up-to-date learning signal.
Together, these improvements lead to superior environment-policy pairs when
validated on challenging multi-agent environment co-design benchmarks including
warehouse automation, multi-agent pathfinding and wind farm optimisation. Our
method consistently exceeds the state-of-the-art, achieving, for example, 39%
higher rewards in the warehouse setting with 66% fewer simulation samples. This
sets a new standard in agent-environment co-design, and is a stepping stone
towards reaping the rewards of co-design in real world domains.
๐
2025-11-05
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gaia Grosso, Sai Sumedh R. Hindupur, Thomas Fel, Samuel Bright-Thonney, Philip Harris et al.
Robotics
ๆฅ็ๆ่ฆ
Modern artificial intelligence has revolutionized our ability to extract rich
and versatile data representations across scientific disciplines. Yet, the
statistical properties of these representations remain poorly controlled,
causing misspecified anomaly detection (AD) methods to falter. Weak or rare
signals can remain hidden within the apparent regularity of normal data,
creating a gap in our ability to detect and interpret anomalies. We examine
this gap and identify a set of structural desiderata for detection methods
operating under minimal prior information: sparsity, to enforce parsimony;
locality, to preserve geometric sensitivity; and competition, to promote
efficient allocation of model capacity. These principles define a class of
self-organizing local kernels that adaptively partition the representation
space around regions of statistical imbalance. As an instantiation of these
principles, we introduce SparKer, a sparse ensemble of Gaussian kernels trained
within a semi-supervised Neyman--Pearson framework to locally model the
likelihood ratio between a sample that may contain anomalies and a nominal,
anomaly-free reference. We provide theoretical insights into the mechanisms
that drive detection and self-organization in the proposed model, and
demonstrate the effectiveness of this approach on realistic high-dimensional
problems of scientific discovery, open-world novelty detection, intrusion
detection, and generative-model validation. Our applications span both the
natural- and computer-science domains. We demonstrate that ensembles containing
only a handful of kernels can identify statistically significant anomalous
locations within representation spaces of thousands of dimensions, underscoring
both the interpretability, efficiency and scalability of the proposed approach.
๐
2025-11-05
๐ ICASSP 2026
๐ฅ Gowtham Premananth, Carol Espy-Wilson
Machine Learning
ๆฅ็ๆ่ฆ
Advances in artificial intelligence (AI) and deep learning have improved
diagnostic capabilities in healthcare, yet limited interpretability continues
to hinder clinical adoption. Schizophrenia, a complex disorder with diverse
symptoms including disorganized speech and social withdrawal, demands tools
that capture symptom severity and provide clinically meaningful insights beyond
binary diagnosis. Here, we present an interpretable framework that leverages
articulatory speech features through eigenspectra difference plots and a
weighted sum with exponential decay (WSED) to quantify vocal tract
coordination. Eigenspectra plots effectively distinguished complex from simpler
coordination patterns, and WSED scores reliably separated these groups, with
ambiguity confined to a narrow range near zero. Importantly, WSED scores
correlated not only with overall BPRS severity but also with the balance
between positive and negative symptoms, reflecting more complex coordination in
subjects with pronounced positive symptoms and the opposite trend for stronger
negative symptoms. This approach offers a transparent, severity-sensitive
biomarker for schizophrenia, advancing the potential for clinically
interpretable speech-based assessment tools.
๐
2025-11-05
๐ EMNLP 2025
๐ฅ Michel Wong, Ali Alshehri, Sophia Kao, Haotian He
Natural Language Processing
ๆฅ็ๆ่ฆ
Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS)
systems, converting written forms into their canonical spoken equivalents.
Traditional TN systems can exhibit high accuracy, but involve substantial
engineering effort, are difficult to scale, and pose challenges to language
coverage, particularly in low-resource settings. We propose PolyNorm, a
prompt-based approach to TN using Large Language Models (LLMs), aiming to
reduce the reliance on manually crafted rules and enable broader linguistic
applicability with minimal human intervention. Additionally, we present a
language-agnostic pipeline for automatic data curation and evaluation, designed
to facilitate scalable experimentation across diverse languages. Experiments
across eight languages show consistent reductions in the word error rate (WER)
compared to a production-grade-based system. To support further research, we
release PolyNorm-Benchmark, a multilingual data set covering a diverse range of
text normalization phenomena.
๐
2025-11-05
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Philipp Hager, Onno Zoeter, Maarten de Rijke
Machine Learning
ๆฅ็ๆ่ฆ
CLAX is a JAX-based library that implements classic click models using modern
gradient-based optimization. While neural click models have emerged over the
past decade, complex click models based on probabilistic graphical models
(PGMs) have not systematically adopted gradient-based optimization, preventing
practitioners from leveraging modern deep learning frameworks while preserving
the interpretability of classic models. CLAX addresses this gap by replacing
EM-based optimization with direct gradient-based optimization in a numerically
stable manner. The framework's modular design enables the integration of any
component, from embeddings and deep networks to custom modules, into classic
click models for end-to-end optimization. We demonstrate CLAX's efficiency by
running experiments on the full Baidu-ULTR dataset comprising over a billion
user sessions in $\approx$ 2 hours on a single GPU, orders of magnitude faster
than traditional EM approaches. CLAX implements ten classic click models,
serving both industry practitioners seeking to understand user behavior and
improve ranking performance at scale and researchers developing new click
models. CLAX is available at: https://github.com/philipphager/clax
๐
2025-11-05
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Saba Latif, Fajar J. Ekaputra, Maxim Vidgof, Sabrina Kirrane, Claudio Di Ciccio
ๆฅ็ๆ่ฆ
The Object-Centric Event Data (OCED) is a novel meta-model aimed at providing
a common ground for process data records centered around events and objects.
One of its objectives is to foster interoperability and process information
exchange. In this context, the integration of data from different providers,
the combination of multiple processes, and the enhancement of knowledge
inference are novel challenges. Semantic Web technologies can enable the
creation of a machine-readable OCED description enriched through ontology-based
relationships and entity categorization. In this paper, we introduce an
approach built upon Semantic Web technologies for the realization of
semantic-enhanced OCED, with the aim to strengthen process data reasoning,
interconnect information sources, and boost expressiveness.
๐
2025-11-05
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Shenghua Wang, Zhen Yin
Machine Learning
ๆฅ็ๆ่ฆ
The rapid growth of open-access (OA) publications has intensified the
challenge of identifying relevant scientific papers. Due to privacy constraints
and limited access to user interaction data, recent efforts have shifted toward
content-based recommendation, which relies solely on textual information.
However, existing models typically treat papers as unstructured text,
neglecting their discourse organization and thereby limiting semantic
completeness and interpretability. To address these limitations, we propose
OMRC-MR, a hierarchical framework that integrates QA-style OMRC (Objective,
Method, Result, Conclusion) summarization, multi-level contrastive learning,
and structure-aware re-ranking for scholarly recommendation. The QA-style
summarization module converts raw papers into structured and
discourse-consistent representations, while multi-level contrastive objectives
align semantic representations across metadata, section, and document levels.
The final re-ranking stage further refines retrieval precision through
contextual similarity calibration. Experiments on DBLP, S2ORC, and the newly
constructed Sci-OMRC dataset demonstrate that OMRC-MR consistently surpasses
state-of-the-art baselines, achieving up to 7.2% and 3.8% improvements in
Precision@10 and Recall@10, respectively. Additional evaluations confirm that
QA-style summarization produces more coherent and factually complete
representations. Overall, OMRC-MR provides a unified and interpretable
content-based paradigm for scientific paper recommendation, advancing
trustworthy and privacy-aware scholarly information retrieval.
๐
2025-11-05
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Giulio Caldarelli, Massimiliano Ornaghi
ๆฅ็ๆ่ฆ
The oracle problem refers to the inability of an agent to know if the
information coming from an oracle is authentic and unbiased. In ancient times,
philosophers and historians debated on how to evaluate, increase, and secure
the reliability of oracle predictions, particularly those from Delphi, which
pertained to matters of state. Today, we refer to data carriers for automatic
machines as oracles, but establishing a secure channel between these oracles
and the real world still represents a challenge. Despite numerous efforts, this
problem remains mostly unsolved, and the recent advent of blockchain oracles
has added a layer of complexity because of the decentralization of blockchains.
This paper conceptually connects Delphic and modern blockchain oracles,
developing a comparative framework. Leveraging blockchain oracle taxonomy,
lexical analysis is also performed on 167 Delphic queries to shed light on the
relationship between oracle answer quality and question type. The presented
framework aims first at revealing commonalities between classical and
computational oracles and then at enriching the oracle analysis within each
field. This study contributes to the computer science literature by proposing
strategies to improve the reliability of blockchain oracles based on insights
from Delphi and to classical literature by introducing a framework that can
also be applied to interpret and classify other ancient oracular mechanisms.
๐
2025-11-05
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Oleg Senkevich, Siyang Xu, Tianyi Jiang, Alexander Radionov, Jan Tabaszewski et al.
Machine Learning
ๆฅ็ๆ่ฆ
Approximate Nearest Neighbor Search (ANNS) is a cornerstone algorithm for
information retrieval, recommendation systems, and machine learning
applications. While x86-based architectures have historically dominated this
domain, the increasing adoption of ARM-based servers in industry presents a
critical need for ANNS solutions optimized on ARM architectures. A naive port
of existing x86 ANNS algorithms to ARM platforms results in a substantial
performance deficit, failing to leverage the unique capabilities of the
underlying hardware. To address this challenge, we introduce KScaNN, a novel
ANNS algorithm co-designed for the Kunpeng 920 ARM architecture. KScaNN
embodies a holistic approach that synergizes sophisticated, data aware
algorithmic refinements with carefully-designed hardware specific
optimizations. Its core contributions include: 1) novel algorithmic techniques,
including a hybrid intra-cluster search strategy and an improved PQ residual
calculation method, which optimize the search process at a higher level; 2) an
ML-driven adaptive search module that provides adaptive, per-query tuning of
search parameters, eliminating the inefficiencies of static configurations; and
3) highly-optimized SIMD kernels for ARM that maximize hardware utilization for
the critical distance computation workloads. The experimental results
demonstrate that KScaNN not only closes the performance gap but establishes a
new standard, achieving up to a 1.63x speedup over the fastest x86-based
solution. This work provides a definitive blueprint for achieving
leadership-class performance for vector search on modern ARM architectures and
underscores
๐
2025-11-05
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shantanu Agarwal, Joel Barry, Elizabeth Boschee, Scott Miller
Natural Language Processing
ๆฅ็ๆ่ฆ
Machine Translation for English Retrieval of Information in Any Language
(MATERIAL) is an IARPA initiative targeted to advance the state of
cross-lingual information retrieval (CLIR). This report provides a detailed
description of Information Sciences Institute's (ISI's) Summarization and
domain-Adaptive Retrieval Across Language's (SARAL's) effort for MATERIAL.
Specifically, we outline our team's novel approach to handle CLIR with emphasis
in developing an approach amenable to retrieve a query-relevant document
\textit{set}, and not just a ranked document-list. In MATERIAL's Phase-3
evaluations, SARAL exceeded the performance of other teams in five out of six
evaluation conditions spanning three different languages (Farsi, Kazakh, and
Georgian).
๐
2025-11-05
๐ EMNLP
๐ฅ Shaghayegh Kolli, Richard Rosenbaum, Timo Cavelius, Lasse Strothe, Andrii Lata et al.
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) excel in generating fluent utterances but can
lack reliable grounding in verified information. At the same time,
knowledge-graph-based fact-checkers deliver precise and interpretable evidence,
yet suffer from limited coverage or latency. By integrating LLMs with knowledge
graphs and real-time search agents, we introduce a hybrid fact-checking
approach that leverages the individual strengths of each component. Our system
comprises three autonomous steps: 1) a Knowledge Graph (KG) Retrieval for rapid
one-hop lookups in DBpedia, 2) an LM-based classification guided by a
task-specific labeling prompt, producing outputs with internal rule-based
logic, and 3) a Web Search Agent invoked only when KG coverage is insufficient.
Our pipeline achieves an F1 score of 0.93 on the FEVER benchmark on the
Supported/Refuted split without task-specific fine-tuning. To address Not
enough information cases, we conduct a targeted reannotation study showing that
our approach frequently uncovers valid evidence for claims originally labeled
as Not Enough Information (NEI), as confirmed by both expert annotators and LLM
reviewers. With this paper, we present a modular, opensource fact-checking
pipeline with fallback strategies and generalization across datasets.
๐
2025-11-05
๐ ArXiv Preprint (cs.DL)
๐ฅ Muneer Ahmad, M Sadik Batcha
ๆฅ็ๆ่ฆ
The present study attempts to highlight the research output generated in
Russia in coronary artery disease (CAD) research during the period 1990-2019 to
understand the distribution of research output, top journals for publications,
and most prolific authors, authorship pattern, and citation pattern. This study
is based on secondary data extracted from the Science Citation Index (SCI),
which is an integral component of the Web of Science. Descriptive and
inferential statistical techniques were applied in the study. There were 5058
articles by Russian scholars in coronary artery disease during 1990-2019; they
preferred to publish in Russian journals. The research contributions were in
the form of research articles, meeting abstracts and reviews with a consistent
drop in the number of editorial material and article; proceedings paper with
time. Co-authorship was the norm in coronary artery disease research, with a
steady increase in the number of multi-author documents in recent years.
๐
2025-11-05
๐ ArXiv Preprint (cs.DL)
๐ฅ Patel Adam Burhansab, M Sadik Batcha, Muneer Ahmad
ๆฅ็ๆ่ฆ
The main aim of this study was to assess and evaluate user satisfaction with
library resources and services among library users associated with Solapur
University. The current research shows the level of users satisfaction with
different library resources and services offered by college libraries. The
research found that a vast number of respondents were pleased with library
facilities and services. The research is designed to achieve users satisfaction
in the library to investigate the level of satisfaction towards library
resources and services with regards to 26 colleges of Solapur University based
in Maharashtra. Information in the form of data has been collected from
colleges and on the basis of users results; analysis needs to analyze users
satisfaction.
๐
2025-11-05
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zhefan Wang, Guokai Yan, Jinbei Yu, Siyu Gu, Jingyan Chen et al.
ๆฅ็ๆ่ฆ
Recommender systems in multi-behavior domains, such as advertising and
e-commerce, aim to guide users toward high-value but inherently sparse
conversions. Leveraging auxiliary behaviors (e.g., clicks, likes, shares) is
therefore essential. Recent progress on generative recommendations has brought
new possibilities for multi-behavior sequential recommendation. However,
existing generative approaches face two significant challenges: 1) Inadequate
Sequence Modeling: capture the complex, cross-level dependencies within user
behavior sequences, and 2) Lack of Suitable Datasets: publicly available
multi-behavior recommendation datasets are almost exclusively derived from
e-commerce platforms, limiting the validation of feasibility in other domains,
while also lacking sufficient side information for semantic ID generation. To
address these issues, we propose a novel generative framework, GAMER
(Generative Augmentation and Multi-lEvel behavior modeling for Recommendation),
built upon a decoder-only backbone. GAMER introduces a cross-level interaction
layer to capture hierarchical dependencies among behaviors and a sequential
augmentation strategy that enhances robustness in training. To further advance
this direction, we collect and release ShortVideoAD, a large-scale
multi-behavior dataset from a mainstream short-video platform, which differs
fundamentally from existing e-commerce datasets and provides pretrained
semantic IDs for research on generative methods. Extensive experiments show
that GAMER consistently outperforms both discriminative and generative
baselines across multiple metrics.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Zhihang Qin, Yueheng Zhang, Wan Su, Linxin Hou, Shenghao Zhou et al.
Robotics
ๆฅ็ๆ่ฆ
Conventional fluid-driven soft grippers typically depend on external sources,
which limit portability and long-term autonomy. This work introduces a
self-contained soft gripper with fixed size that operates solely through
internal liquid redistribution among three interconnected bistable snap-through
chambers. When the top sensing chamber deforms upon contact, the displaced
liquid triggers snap-through expansion of the grasping chambers, enabling
stable and size-selective grasping without continuous energy input. The
internal hydraulic feedback further allows passive adaptation of gripping
pressure to object stiffness. This source-free and compact design opens new
possibilities for lightweight, stiffness-adaptive fluid-driven manipulation in
soft robotics, providing a feasible approach for targeted size-specific
sampling and operation in underwater and field environments.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Taito Tashiro, Tomoko Yonezawa, Hirotake Yamazoe
ๆฅ็ๆ่ฆ
This study investigates how human motion cues can be used to design
expressive robot-arm movements. Using the imperfect-information game Geister,
we analyzed two types of human piece-moving motions: natural gameplay
(unconscious tendencies) and instructed expressions (intentional cues). Based
on these findings, we created phase-specific robot motions by varying movement
speed and stop duration, and evaluated observer impressions under two
presentation modalities: a physical robot and a recorded video. Results
indicate that late-phase motion timing, particularly during withdrawal, plays
an important role in impression formation and that physical embodiment enhances
the interpretability of motion cues. These findings provide insights for
designing expressive robot motions based on human timing behavior.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Azizollah Taheri, Derya Aksaray
ๆฅ็ๆ่ฆ
This paper addresses a motion planning problem to achieve
spatio-temporal-logical tasks, expressed by syntactically co-safe linear
temporal logic specifications (scLTL\next), in uncertain environments. Here,
the uncertainty is modeled as some probabilistic knowledge on the semantic
labels of the environment. For example, the task is "first go to region 1, then
go to region 2"; however, the exact locations of regions 1 and 2 are not known
a priori, instead a probabilistic belief is available. We propose a novel
automata-theoretic approach, where a special product automaton is constructed
to capture the uncertainty related to semantic labels, and a reward function is
designed for each edge of this product automaton. The proposed algorithm
utilizes value iteration for online replanning. We show some theoretical
results and present some simulations/experiments to demonstrate the efficacy of
the proposed approach.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Andrei A. Korigodskii, Oleg D. Kalachev, Artem E. Vasiunik, Matvei V. Urvantsev, Georgii E. Bondar
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper presents the innovative design and successful deployment of a
pioneering autonomous unmanned aerial system developed for executing the
world's largest mural painted by a drone. Addressing the dual challenges of
maintaining artistic precision and operational reliability under adverse
outdoor conditions such as wind and direct sunlight, our work introduces a
robust system capable of navigating and painting outdoors with unprecedented
accuracy. Key to our approach is a novel navigation system that combines an
infrared (IR) motion capture camera and LiDAR technology, enabling precise
location tracking tailored specifically for largescale artistic applications.
We employ a unique control architecture that uses different regulation in
tangential and normal directions relative to the planned path, enabling precise
trajectory tracking and stable line rendering. We also present algorithms for
trajectory planning and path optimization, allowing for complex curve drawing
and area filling. The system includes a custom-designed paint spraying
mechanism, specifically engineered to function effectively amidst the turbulent
airflow generated by the drone's propellers, which also protects the drone's
critical components from paint-related damage, ensuring longevity and
consistent performance. Experimental results demonstrate the system's
robustness and precision in varied conditions, showcasing its potential for
autonomous large-scale art creation and expanding the functional applications
of robotics in creative fields.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Swadhin Agrawal, Sujoy Bhore, Joseph S. B. Mitchell, P. B. Sujit, Aayush Gohil
ๆฅ็ๆ่ฆ
We consider the problem of searching for an intruder in a geometric domain by
utilizing multiple search robots. The domain is a simply connected orthogonal
polygon with edges parallel to the cartesian coordinate axes. Each robot has a
limited sensing capability. We study the problem for both static and mobile
intruders. It turns out that the problem of finding an intruder is NP-hard,
even for a stationary intruder. Given this intractability, we turn our
attention towards developing efficient and robust algorithms, namely methods
based on space-filling curves, random search, and cooperative random search.
Moreover, for each proposed algorithm, we evaluate the trade-off between the
number of search robots and the time required for the robots to complete the
search process while considering the geometric properties of the connected
orthogonal search area.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Qingyi Chen, Ruiqi Ni, Jun Kim, Ahmed H. Qureshi
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Safe multi-agent motion planning (MAMP) under task-induced constraints is a
critical challenge in robotics. Many real-world scenarios require robots to
navigate dynamic environments while adhering to manifold constraints imposed by
tasks. For example, service robots must carry cups upright while avoiding
collisions with humans or other robots. Despite recent advances in
decentralized MAMP for high-dimensional systems, incorporating manifold
constraints remains difficult. To address this, we propose a
manifold-constrained Hamilton-Jacobi reachability (HJR) learning framework for
decentralized MAMP. Our method solves HJR problems under manifold constraints
to capture task-aware safety conditions, which are then integrated into a
decentralized trajectory optimization planner. This enables robots to generate
motion plans that are both safe and task-feasible without requiring assumptions
about other agents' policies. Our approach generalizes across diverse
manifold-constrained tasks and scales effectively to high-dimensional
multi-agent manipulation problems. Experiments show that our method outperforms
existing constrained motion planners and operates at speeds suitable for
real-world applications. Video demonstrations are available at
https://youtu.be/RYcEHMnPTH8 .
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Aniol Civit, Antonio Andriella, Carles Sierra, Guillem Alenyร
Robotics
ๆฅ็ๆ่ฆ
While personalisation in Human-Robot Interaction (HRI) has advanced
significantly, most existing approaches focus on single-user adaptation,
overlooking scenarios involving multiple stakeholders with potentially
conflicting preferences. To address this, we propose the Multi-User Preferences
Quantitative Bipolar Argumentation Framework (MUP-QBAF), a novel multi-user
personalisation framework based on Quantitative Bipolar Argumentation
Frameworks (QBAFs) that explicitly models and resolves multi-user preference
conflicts. Unlike prior work in Argumentation Frameworks, which typically
assumes static inputs, our approach is tailored to robotics: it incorporates
both users' arguments and the robot's dynamic observations of the environment,
allowing the system to adapt over time and respond to changing contexts.
Preferences, both positive and negative, are represented as arguments whose
strength is recalculated iteratively based on new information. The framework's
properties and capabilities are presented and validated through a realistic
case study, where an assistive robot mediates between the conflicting
preferences of a caregiver and a care recipient during a frailty assessment
task. This evaluation further includes a sensitivity analysis of argument base
scores, demonstrating how preference outcomes can be shaped by user input and
contextual observations. By offering a transparent, structured, and
context-sensitive approach to resolving competing user preferences, this work
advances the field of multi-user HRI. It provides a principled alternative to
data-driven methods, enabling robots to navigate conflicts in real-world
environments.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Hao Shi, Ze Wang, Shangwei Guo, Mengfei Duan, Song Wang et al.
ๆฅ็ๆ่ฆ
Robust 3D semantic occupancy is crucial for legged/humanoid robots, yet most
semantic scene completion (SSC) systems target wheeled platforms with
forward-facing sensors. We present OneOcc, a vision-only panoramic SSC
framework designed for gait-introduced body jitter and 360{\deg} continuity.
OneOcc combines: (i) Dual-Projection fusion (DP-ER) to exploit the annular
panorama and its equirectangular unfolding, preserving 360{\deg} continuity and
grid alignment; (ii) Bi-Grid Voxelization (BGV) to reason in Cartesian and
cylindrical-polar spaces, reducing discretization bias and sharpening
free/occupied boundaries; (iii) a lightweight decoder with Hierarchical AMoE-3D
for dynamic multi-scale fusion and better long-range/occlusion reasoning; and
(iv) plug-and-play Gait Displacement Compensation (GDC) learning feature-level
motion correction without extra sensors. We also release two panoramic
occupancy benchmarks: QuadOcc (real quadruped, first-person 360{\deg}) and
Human360Occ (H3O) (CARLA human-ego 360{\deg} with RGB, Depth, semantic
occupancy; standardized within-/cross-city splits). OneOcc sets new
state-of-the-art (SOTA): on QuadOcc it beats strong vision baselines and
popular LiDAR ones; on H3O it gains +3.83 mIoU (within-city) and +8.08
(cross-city). Modules are lightweight, enabling deployable full-surround
perception for legged/humanoid robots. Datasets and code will be publicly
available at https://github.com/MasterHow/OneOcc.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Hong Wang, Ridhima Phatak, James Ocampo, Zhao Han
ๆฅ็ๆ่ฆ
Research indicates that humans can mistakenly assume that robots and humans
have the same field of view (FoV), possessing an inaccurate mental model of
robots. This misperception may lead to failures during human-robot
collaboration tasks where robots might be asked to complete impossible tasks
about out-of-view objects. The issue is more severe when robots do not have a
chance to scan the scene to update their world model while focusing on assigned
tasks. To help align humans' mental models of robots' vision capabilities, we
propose four FoV indicators in augmented reality (AR) and conducted a user
human-subjects experiment (N=41) to evaluate them in terms of accuracy,
confidence, task efficiency, and workload. These indicators span a spectrum
from egocentric (robot's eye and head space) to allocentric (task space).
Results showed that the allocentric blocks at the task space had the highest
accuracy with a delay in interpreting the robot's FoV. The egocentric indicator
of deeper eye sockets, possible for physical alteration, also increased
accuracy. In all indicators, participants' confidence was high while cognitive
load remained low. Finally, we contribute six guidelines for practitioners to
apply our AR indicators or physical alterations to align humans' mental models
with robots' vision capabilities.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Lei Fu, Sahar Salimpour, Leonardo Militano, Harry Edelman, Jorge Peรฑa Queralta et al.
Robotics
ๆฅ็ๆ่ฆ
Agentic AI systems and Physical or Embodied AI systems have been two key
research verticals at the forefront of Artificial Intelligence and Robotics,
with Model Context Protocol (MCP) increasingly becoming a key component and
enabler of agentic applications. However, the literature at the intersection of
these verticals, i.e., Agentic Embodied AI, remains scarce. This paper
introduces an MCP server for analyzing ROS and ROS 2 bags, allowing for
analyzing, visualizing and processing robot data with natural language through
LLMs and VLMs. We describe specific tooling built with robotics domain
knowledge, with our initial release focused on mobile robotics and supporting
natively the analysis of trajectories, laser scan data, transforms, or time
series data. This is in addition to providing an interface to standard ROS 2
CLI tools ("ros2 bag list" or "ros2 bag info"), as well as the ability to
filter bags with a subset of topics or trimmed in time. Coupled with the MCP
server, we provide a lightweight UI that allows the benchmarking of the tooling
with different LLMs, both proprietary (Anthropic, OpenAI) and open-source
(through Groq). Our experimental results include the analysis of tool calling
capabilities of eight different state-of-the-art LLM/VLM models, both
proprietary and open-source, large and small. Our experiments indicate that
there is a large divide in tool calling capabilities, with Kimi K2 and Claude
Sonnet 4 demonstrating clearly superior performance. We also conclude that
there are multiple factors affecting the success rates, from the tool
description schema to the number of arguments, as well as the number of tools
available to the models. The code is available with a permissive license at
https://github.com/binabik-ai/mcp-rosbags.
๐
2025-11-05
๐ 8 pages 18 fogures, IEEE RAL accept
๐ฅ Jianbo Yuan, Haohua Zhu, Jing Dai, Sheng Yi
Robotics
ๆฅ็ๆ่ฆ
The human hand plays a vital role in daily life and industrial applications,
yet replicating its multifunctional capabilities-including motion, sensing, and
coordinated manipulation-with robotic systems remains a formidable challenge.
Developing a dexterous robotic hand requires balancing human-like agility with
engineering constraints such as complexity, size-to-weight ratio, durability,
and force-sensing performance. This letter presents Dex-Hand 021, a
high-performance, cable-driven five-finger robotic hand with 12 active and 7
passive degrees of freedom (DoFs), achieving 19 DoFs dexterity in a lightweight
1 kg design. We propose a proprioceptive force-sensing-based admittance control
method to enhance manipulation. Experimental results demonstrate its superior
performance: a single-finger load capacity exceeding 10 N, fingertip
repeatability under 0.001 m, and force estimation errors below 0.2 N. Compared
to PID control, joint torques in multi-object grasping are reduced by 31.19%,
significantly improves force-sensing capability while preventing overload
during collisions. The hand excels in both power and precision grasps,
successfully executing 33 GRASP taxonomy motions and complex manipulation
tasks. This work advances the design of lightweight, industrial-grade dexterous
hands and enhances proprioceptive control, contributing to robotic manipulation
and intelligent manufacturing.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Vesna Poprcova, Iulia Lefter, Martijn Warnier, Frances Brazier
Robotics
ๆฅ็ๆ่ฆ
Social anxiety is a prevalent mental health condition that can significantly
impact overall well-being and quality of life. Despite its widespread effects,
adequate support or treatment for social anxiety is often insufficient.
Advances in technology, particularly in social robotics, offer promising
opportunities to complement traditional mental health. As an initial step
toward developing effective solutions, it is essential to understand the values
that shape what is considered meaningful, acceptable, and helpful. In this
study, a participatory design workshop was conducted with mental health
academic researchers to elicit the underlying values that should inform the
design of socially assistive robots for social anxiety support. Through
creative, reflective, and envisioning activities, participants explored
scenarios and design possibilities, allowing for systematic elicitation of
values, expectations, needs, and preferences related to robot-supported
interventions. The findings reveal rich insights into design-relevant
values-including adaptivity, acceptance, and efficacy-that are core to support
for individuals with social anxiety. This study highlights the significance of
a research-led approach to value elicitation, emphasising user-centred and
context-aware design considerations in the development of socially assistive
robots.
๐
2025-11-05
๐ IROS 2025
๐ฅ Minquan Gao, Xinyi Li, Qing Yan, Xiaojian Sun, Xiaopan Zhang et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Pre-trained robot policies serve as the foundation of many validated robotic
systems, which encapsulate extensive embodied knowledge. However, they often
lack the semantic awareness characteristic of foundation models, and replacing
them entirely is impractical in many situations due to high costs and the loss
of accumulated knowledge. To address this gap, we introduce GUIDES, a
lightweight framework that augments pre-trained policies with semantic guidance
from foundation models without requiring architectural redesign. GUIDES employs
a fine-tuned vision-language model (Instructor) to generate contextual
instructions, which are encoded by an auxiliary module into guidance
embeddings. These embeddings are injected into the policy's latent space,
allowing the legacy model to adapt to this new semantic input through brief,
targeted fine-tuning. For inference-time robustness, a large language
model-based Reflector monitors the Instructor's confidence and, when confidence
is low, initiates a reasoning loop that analyzes execution history, retrieves
relevant examples, and augments the VLM's context to refine subsequent actions.
Extensive validation in the RoboCasa simulation environment across diverse
policy architectures shows consistent and substantial improvements in task
success rates. Real-world deployment on a UR5 robot further demonstrates that
GUIDES enhances motion precision for critical sub-tasks such as grasping.
Overall, GUIDES offers a practical and resource-efficient pathway to upgrade,
rather than replace, validated robot policies.
๐
2025-11-05
๐ Accepted by IEEE ROBIO 2025
๐ฅ Zeqing Zhang, Weifeng Lu, Lei Yang, Wei Jing, Bowei Tang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper explores a physical human-robot collaboration (pHRC) task
involving the joint insertion of a board into a frame by a sightless robot and
a human operator. While admittance control is commonly used in pHRC tasks, it
can be challenging to measure the force/torque applied by the human for
accurate human intent estimation, limiting the robot's ability to assist in the
collaborative task. Other methods that attempt to solve pHRC tasks using
reinforcement learning (RL) are also unsuitable for the board-insertion task
due to its safety constraints and sparse rewards. Therefore, we propose a novel
RL approach that utilizes a human-designed admittance controller to facilitate
more active robot behavior and reduce human effort. Through simulation and
real-world experiments, we demonstrate that our approach outperforms admittance
control in terms of success rate and task completion time. Additionally, we
observed a significant reduction in measured force/torque when using our
proposed approach compared to admittance control. The video of the experiments
is available at https://youtu.be/va07Gw6YIog.
๐
2025-11-05
๐ NeurIPS 2025
๐ฅ Jonghae Park, Daesol Cho, Jusuk Lee, Dongseok Shim, Inkyu Jang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Unsupervised skill discovery in reinforcement learning (RL) aims to learn
diverse behaviors without relying on external rewards. However, current methods
often overlook the periodic nature of learned skills, focusing instead on
increasing the mutual dependence between states and skills or maximizing the
distance traveled in latent space. Considering that many robotic tasks --
particularly those involving locomotion -- require periodic behaviors across
varying timescales, the ability to discover diverse periodic skills is
essential. Motivated by this, we propose Periodic Skill Discovery (PSD), a
framework that discovers periodic behaviors in an unsupervised manner. The key
idea of PSD is to train an encoder that maps states to a circular latent space,
thereby naturally encoding periodicity in the latent representation. By
capturing temporal distance, PSD can effectively learn skills with diverse
periods in complex robotic tasks, even with pixel-based observations. We
further show that these learned skills achieve high performance on downstream
tasks such as hurdling. Moreover, integrating PSD with an existing skill
discovery method offers more diverse behaviors, thus broadening the agent's
repertoire. Our code and demos are available at
https://jonghaepark.github.io/psd/
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Rewida Ali, Cristian C. Beltran-Hernandez, Weiwei Wan, Kensuke Harada
Machine LearningRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Human-robot cooperation is essential in environments such as warehouses and
retail stores, where workers frequently handle deformable objects like paper,
bags, and fabrics. Coordinating robotic actions with human assistance remains
difficult due to the unpredictable dynamics of deformable materials and the
need for adaptive force control. To explore this challenge, we focus on the
task of gift wrapping, which exemplifies a long-horizon manipulation problem
involving precise folding, controlled creasing, and secure fixation of paper.
Success is achieved when the robot completes the sequence to produce a neatly
wrapped package with clean folds and no tears.
We propose a learning-based framework that integrates a high-level task
planner powered by a large language model (LLM) with a low-level hybrid
imitation learning (IL) and reinforcement learning (RL) policy. At its core is
a Sub-task Aware Robotic Transformer (START) that learns a unified policy from
human demonstrations. The key novelty lies in capturing long-range temporal
dependencies across the full wrapping sequence within a single model. Unlike
vanilla Action Chunking with Transformer (ACT), typically applied to short
tasks, our method introduces sub-task IDs that provide explicit temporal
grounding. This enables robust performance across the entire wrapping process
and supports flexible execution, as the policy learns sub-goals rather than
merely replicating motion sequences.
Our framework achieves a 97% success rate on real-world wrapping tasks. We
show that the unified transformer-based policy reduces the need for specialized
models, allows controlled human supervision, and effectively bridges high-level
intent with the fine-grained force control required for deformable object
manipulation.
๐
2025-11-05
๐ ArXiv Preprint (astro-ph.EP)
๐ฅ Arsalan Muhammad, Wasiu Akande Ahmed, Omada Friday Ojonugwa, Paul Puspendu Biswas
Machine LearningRobotics
ๆฅ็ๆ่ฆ
The rapid growth of cislunar activities, including lunar landings, the Lunar
Gateway, and in-space refueling stations, requires advances in cost-efficient
trajectory design and reliable integration of navigation and remote sensing.
Traditional Earth-Moon transfers suffer from rigid launch windows and high
propellant demands, while Earth-based GNSS systems provide little to no
coverage beyond geostationary orbit. This limits autonomy and environmental
awareness in cislunar space. This review compares four major transfer
strategies by evaluating velocity requirements, flight durations, and fuel
efficiency, and by identifying their suitability for both crewed and robotic
missions. The emerging role of artificial intelligence and machine learning is
highlighted: convolutional neural networks support automated crater recognition
and digital terrain model generation, while deep reinforcement learning enables
adaptive trajectory refinement during descent and landing to reduce risk and
decision latency. The study also examines how GNSS-Reflectometry and advanced
Positioning, Navigation, and Timing architectures can extend navigation
capabilities beyond current limits. GNSS-R can act as a bistatic radar for
mapping lunar ice, soil properties, and surface topography, while PNT systems
support autonomous rendezvous, Lagrange point station-keeping, and coordinated
satellite swarm operations. Combining these developments establishes a scalable
framework for sustainable cislunar exploration and long-term human and robotic
presence.
๐
2025-11-05
๐ ArXiv Preprint (Robotics)
๐ฅ Xin Liu, Jinze Wu, Yinghui Li, Chenkun Qi, Yufei Xue et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Multi-legged robots offer enhanced stability to navigate complex terrains
with their multiple legs interacting with the environment. However, how to
effectively coordinate the multiple legs in a larger action exploration space
to generate natural and robust movements is a key issue. In this paper, we
introduce a motion prior-based approach, successfully applying deep
reinforcement learning algorithms to a real hexapod robot. We generate a
dataset of optimized motion priors, and train an adversarial discriminator
based on the priors to guide the hexapod robot to learn natural gaits. The
learned policy is then successfully transferred to a real hexapod robot, and
demonstrate natural gait patterns and remarkable robustness without visual
information in complex terrains. This is the first time that a reinforcement
learning controller has been used to achieve complex terrain walking on a real
hexapod robot.
๐
2025-11-05
๐ ICRA 2025
๐ฅ Raj Surya Rajendran Kathirvel, Zach A Chavis, Stephen J. Guy, Karthik Desingh
Robotics
ๆฅ็ๆ่ฆ
We introduce SENT-Map, a semantically enhanced topological map for
representing indoor environments, designed to support autonomous navigation and
manipulation by leveraging advancements in foundational models (FMs). Through
representing the environment in a JSON text format, we enable semantic
information to be added and edited in a format that both humans and FMs
understand, while grounding the robot to existing nodes during planning to
avoid infeasible states during deployment. Our proposed framework employs a two
stage approach, first mapping the environment alongside an operator with a
Vision-FM, then using the SENT-Map representation alongside a natural-language
query within an FM for planning. Our experimental results show that
semantic-enhancement enables even small locally-deployable FMs to successfully
plan over indoor environments.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim
Natural Language Processing
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) systems hold great promise for advancing various
scientific disciplines, and are increasingly used in real-world applications.
Despite their remarkable progress, further capabilities are expected in order
to achieve more general types of intelligence. A critical distinction in this
context is between factual knowledge, which can be evaluated against true or
false answers (e.g., "what is the capital of England?"), and probabilistic
knowledge, reflecting probabilistic properties of the real world (e.g., "what
is the sex of a computer science graduate in the US?"). In this paper, our goal
is to build a benchmark for understanding the capabilities of LLMs in terms of
knowledge of probability distributions describing the real world. Given that
LLMs are trained on vast amounts of text, it may be plausible that they
internalize aspects of these distributions. Indeed, LLMs are touted as powerful
universal approximators of real-world distributions. At the same time,
classical results in statistics, known as curse of dimensionality, highlight
fundamental challenges in learning distributions in high dimensions,
challenging the notion of universal distributional learning. In this work, we
develop the first benchmark to directly test this hypothesis, evaluating
whether LLMs have access to empirical distributions describing real-world
populations across domains such as economics, health, education, and social
behavior. Our results demonstrate that LLMs perform poorly overall, and do not
seem to internalize real-world statistics naturally. When interpreted in the
context of Pearl's Causal Hierarchy (PCH), our benchmark demonstrates that
language models do not contain knowledge on observational distributions (Layer
1 of PCH), and thus the Causal Hierarchy Theorem implies that interventional
(Layer 2) and counterfactual (Layer 3) knowledge of these models is also
limited.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Victoria Ebert, Rishabh Singh, Tuochao Chen, Noah A. Smith, Shyamnath Gollakota
ๆฅ็ๆ่ฆ
Conversational AI is constrained in many real-world settings where only one
side of a dialogue can be recorded, such as telemedicine, call centers, and
smart glasses. We formalize this as the one-sided conversation problem (1SC):
inferring and learning from one side of a conversation. We study two tasks: (1)
reconstructing the missing speaker's turns for real-time use cases, and (2)
generating summaries from one-sided transcripts. Evaluating prompting and
finetuned models on MultiWOZ, DailyDialog, and Candor with both human A/B
testing and LLM-as-a-judge metrics, we find that access to one future turn and
information about utterance length improves reconstruction, placeholder
prompting helps to mitigate hallucination, and while large models generate
promising reconstructions with prompting, smaller models require finetuning.
Further, high-quality summaries can be generated without reconstructing missing
turns. We present 1SC as a novel challenge and report promising results that
mark a step toward privacy-aware conversational AI.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Tao Zhang, Kehui Yao, Luyi Ma, Jiao Chen, Reza Yousefi Maragheh et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Evaluating large language models (LLMs) as judges is increasingly critical
for building scalable and trustworthy evaluation pipelines. We present
ScalingEval, a large-scale benchmarking study that systematically compares 36
LLMs, including GPT, Gemini, Claude, and Llama, across multiple product
categories using a consensus-driven evaluation protocol. Our multi-agent
framework aggregates pattern audits and issue codes into ground-truth labels
via scalable majority voting, enabling reproducible comparison of LLM
evaluators without human annotation. Applied to large-scale complementary-item
recommendation, the benchmark reports four key findings: (i) Anthropic Claude
3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers
the best overall performance across categories; (iii) GPT-4o provides the most
favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B leads among
open-source models. Category-level analysis shows strong consensus in
structured domains (Electronics, Sports) but persistent disagreement in
lifestyle categories (Clothing, Food). These results establish ScalingEval as a
reproducible benchmark and evaluation protocol for LLMs as judges, with
actionable guidance on scaling, reliability, and model family tradeoffs.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Sina Montazeri, Yunhe Feng, Kewei Sha
Natural Language Processing
ๆฅ็ๆ่ฆ
Open data repositories hold potential for evidence-based decision-making, yet
are inaccessible to non-experts lacking expertise in dataset discovery, schema
mapping, and statistical analysis. Large language models show promise for
individual tasks, but end-to-end analytical workflows expose fundamental
limitations: attention dilutes across growing contexts, specialized reasoning
patterns interfere, and errors propagate undetected. We present PublicAgent, a
multi-agent framework that addresses these limitations through decomposition
into specialized agents for intent clarification, dataset discovery, analysis,
and reporting. This architecture maintains focused attention within agent
contexts and enables validation at each stage. Evaluation across five models
and 50 queries derives five design principles for multi-agent LLM systems.
First, specialization provides value independent of model strength--even the
strongest model shows 97.5% agent win rates, with benefits orthogonal to model
scale. Second, agents divide into universal (discovery, analysis) and
conditional (report, intent) categories. Universal agents show consistent
effectiveness (std dev 12.4%) while conditional agents vary by model (std dev
20.5%). Third, agents mitigate distinct failure modes--removing discovery or
analysis causes catastrophic failures (243-280 instances), while removing
report or intent causes quality degradation. Fourth, architectural benefits
persist across task complexity with stable win rates (86-92% analysis, 84-94%
discovery), indicating workflow management value rather than reasoning
enhancement. Fifth, wide variance in agent effectiveness across models (42-96%
for analysis) requires model-aware architecture design. These principles guide
when and why specialization is necessary for complex analytical workflows while
enabling broader access to public data through natural language interfaces.
๐
2025-11-04
๐ Published in 2025 IEEE Big Data
๐ฅ Lingqing Shen, Chi Heem Wong, Misaki Mito, Arnab Chakrabarti
Machine Learning
ๆฅ็ๆ่ฆ
Monitoring the internal temperature and humidity of shipping containers is
essential to preventing quality degradation during cargo transportation.
Sensorless monitoring -- machine learning models that predict the internal
conditions of the containers using exogenous factors -- shows promise as an
alternative to monitoring using sensors. However, it does not incorporate
telemetry information and correct for systematic errors, causing the
predictions to differ significantly from the live data and confusing the users.
In this paper, we introduce the residual correction method, a general framework
for correcting for systematic biases in sensorless models after observing live
telemetry data. We call this class of models ``adaptive-sensorless''
monitoring. We train and evaluate adaptive-sensorless models on the 3.48
million data points -- the largest dataset of container sensor readings ever
used in academic research -- and show that they produce consistent improvements
over the baseline sensorless models. When evaluated on the holdout set of the
simulated data, they achieve average mean absolute errors (MAEs) of 2.24 $\sim$
2.31$^\circ$C (vs 2.43$^\circ$C by sensorless) for temperature and 5.72 $\sim$
7.09% for relative humidity (vs 7.99% by sensorless) and average root
mean-squared errors (RMSEs) of 3.19 $\sim$ 3.26$^\circ$C for temperature (vs
3.38$^\circ$C by sensorless) and 7.70 $\sim$ 9.12% for relative humidity (vs
10.0% by sensorless). Adaptive-sensorless models enable more accurate cargo
monitoring, early risk detection, and less dependence on full connectivity in
global shipping.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenbo Lu
Multimodal
ๆฅ็ๆ่ฆ
Vision-Language Pretraining (VLP) has achieved remarkable success across
various downstream tasks, but such gains are largely driven by scaling up on
training data. Yet, literature methods treat image-text pairs as isolated
training examples; this neglects the rich relational structure naturally
present in many domains, such as e-commerce product co-purchase graphs and
social recommendation networks. Inspired by neuroscientific evidence that human
encodes knowledge as relationship cognitive maps, we introduce Structure-aware
Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive
loss to align modalities while also modeling relationships between neighboring
entities in a structured graph. To support this paradigm, we construct a
large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling
structured cross-modality supervision at scale. Experiment results show that
SLIP consistently outperforms CLIP on cross-modal retrieval and classification
tasks in both zero-shot and few-shot settings, showing the value of relational
supervision for cross-modal alignment.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris et al.
Robotics
ๆฅ็ๆ่ฆ
As AI systems become more capable and widely deployed as agents, ensuring
their safe operation becomes critical. AI control offers one approach to
mitigating the risk from untrusted AI agents by monitoring their actions and
intervening or auditing when necessary. Evaluating the safety of these
protocols requires understanding both their effectiveness against current
attacks and their robustness to adaptive adversaries. In this work, we
systematically evaluate a range of control protocols in SHADE-Arena, a dataset
of diverse agentic environments. First, we evaluate blue team protocols,
including deferral to trusted models, resampling, and deferring on critical
actions, against a default attack policy. We find that resampling for
incrimination and deferring on critical actions perform best, increasing safety
from 50% to 96%. We then iterate on red team strategies against these protocols
and find that attack policies with additional affordances, such as knowledge of
when resampling occurs or the ability to simulate monitors, can substantially
improve attack success rates against our resampling strategy, decreasing safety
to 17%. However, deferring on critical actions is highly robust to even our
strongest red team strategies, demonstrating the importance of denying attack
policies access to protocol internals.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Esther Sun, Zichu Wu
ๆฅ็ๆ่ฆ
The design and application of LLM-based personas in AI companionship is a
rapidly expanding but fragmented field, spanning from virtual emotional
companions and game NPCs to embodied functional robots. This diversity in
objectives, modality, and technical stacks creates an urgent need for a unified
framework. To address this gap, this paper systematizes the field by proposing
a Four-Quadrant Technical Taxonomy for AI companion applications. The framework
is structured along two critical axes: Virtual vs. Embodied and Emotional
Companionship vs. Functional Augmentation. Quadrant I (Virtual Companionship)
explores virtual idols, romantic companions, and story characters, introducing
a four-layer technical framework to analyze their challenges in maintaining
long-term emotional consistency. Quadrant II (Functional Virtual Assistants)
analyzes AI applications in work, gaming, and mental health, highlighting the
shift from "feeling" to "thinking and acting" and pinpointing key technologies
like enterprise RAG and on-device inference. Quadrants III & IV (Embodied
Intelligence) shift from the virtual to the physical world, analyzing home
robots and vertical-domain assistants, revealing core challenges in symbol
grounding, data privacy, and ethical liability. This taxonomy provides not only
a systematic map for researchers and developers to navigate the complex persona
design space but also a basis for policymakers to identify and address the
unique risks inherent in different application scenarios.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Stergios Plataniotis, Charilaos Akasiadis, Georgios Chalkiadakis
Machine Learning
ๆฅ็ๆ่ฆ
Efficient exploration in deep reinforcement learning remains a fundamental
challenge, especially in environments characterized by high-dimensional states
and sparse rewards. Traditional exploration strategies that rely on random
local policy noise, such as $\epsilon$-greedy and Boltzmann exploration
methods, often struggle to efficiently balance exploration and exploitation. In
this paper, we integrate the notion of (expected) value of information (EVOI)
within the well-known Bootstrapped DQN algorithmic framework, to enhance the
algorithm's deep exploration ability. Specifically, we develop two novel
algorithms that incorporate the expected gain from learning the value of
information into Bootstrapped DQN. Our methods use value of information
estimates to measure the discrepancies of opinions among distinct network
heads, and drive exploration towards areas with the most potential. We evaluate
our algorithms with respect to performance and their ability to exploit
inherent uncertainty arising from random network initialization. Our
experiments in complex, sparse-reward Atari games demonstrate increased
performance, all the while making better use of uncertainty, and, importantly,
without introducing extra hyperparameters.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sadiq Layi Macaulay, Nimet Kaygusuz, Simon Hadfield
ๆฅ็ๆ่ฆ
Event cameras, with their high dynamic range (HDR) and low latency, offer a
promising alternative for robust depth estimation in challenging environments.
However, many event-based depth estimation approaches are constrained by
small-scale annotated datasets, limiting their generalizability to real-world
scenarios. To bridge this gap, we introduce EvtSlowTV, a large-scale event
camera dataset curated from publicly available YouTube footage, which contains
more than 13B events across various environmental conditions and motions,
including seasonal hiking, flying, scenic driving, and underwater exploration.
EvtSlowTV is an order of magnitude larger than existing event datasets,
providing an unconstrained, naturalistic setting for event-based depth
learning. This work shows the suitability of EvtSlowTV for a self-supervised
learning framework to capitalise on the HDR potential of raw event streams. We
further demonstrate that training with EvtSlowTV enhances the model's ability
to generalise to complex scenes and motions. Our approach removes the need for
frame-based annotations and preserves the asynchronous nature of event data.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fengxu Li, Stephanie M. Carpenter, Matthew P. Buman, Yonatan Mintz
Multimodal
ๆฅ็ๆ่ฆ
A common challenge for decision makers is selecting actions whose rewards are
unknown and evolve over time based on prior policies. For instance, repeated
use may reduce an action's effectiveness (habituation), while inactivity may
restore it (recovery). These nonstationarities are captured by the Reducing or
Gaining Unknown Efficacy (ROGUE) bandit framework, which models real-world
settings such as behavioral health interventions. While existing algorithms can
compute sublinear regret policies to optimize these settings, they may not
provide sufficient exploration due to overemphasis on exploitation, limiting
the ability to estimate population-level effects. This is a challenge of
particular interest in micro-randomized trials (MRTs) that aid researchers in
developing just-in-time adaptive interventions that have population-level
effects while still providing personalized recommendations to individuals. In
this paper, we first develop ROGUE-TS, a Thompson Sampling algorithm tailored
to the ROGUE framework, and provide theoretical guarantees of sublinear regret.
We then introduce a probability clipping procedure to balance personalization
and population-level learning, with quantified trade-off that balances regret
and minimum exploration probability. Validation on two MRT datasets concerning
physical activity promotion and bipolar disorder treatment shows that our
methods both achieve lower regret than existing approaches and maintain high
statistical power through the clipping procedure without significantly
increasing regret. This enables reliable detection of treatment effects while
accounting for individual behavioral dynamics. For researchers designing MRTs,
our framework offers practical guidance on balancing personalization with
statistical validity.
๐
2025-11-04
๐ ArXiv Preprint (eess.SP)
๐ฅ Sepideh KhakzadGharamaleki, Hassan Rivaz, Brandon Helfield
Natural Language Processing
ๆฅ็ๆ่ฆ
Conventional pulse-echo ultrasound suffers when low-cost probes deliver only
narrow fractional bandwidths, elongating pulses and erasing high-frequency
detail. We address this limitation by learning a data-driven mapping from
band-limited to broadband spectrogram of radio-frequency (RF) lines. To this
end, a variation of Tiny Vision Transform (ViT) auto-encoder is trained on
simulation data using a curriculum-weighted loss. On heterogeneous speckle-cyst
phantoms, the network reduces image-domain MSE by 90 percent, boosts PSNR by
6.7 dB, and raises SSIM to 0.965 compared with the narrow-band input. It also
sharpens point-target rows in a completely unseen resolution phantom,
demonstrating strong out-of-distribution generalisation without sacrificing
frame rate or phase information. These results indicate that a purely software
upgrade can endow installed narrow-band probes with broadband-like performance,
potentially widening access to high-resolution ultrasound in
resource-constrained settings.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Neil Byers, Ali Zaidi, Valerie Skye, Chris Beecroft, Kjiersten Fagnan
Natural Language Processing
ๆฅ็ๆ่ฆ
Efforts have increased in recent years to identify associations between
specific datasets and the scientific literature that incorporates them. Knowing
that a given publication cites a given dataset, the next logical step is to
explore how or why that data was used. Advances in recent years with
pretrained, transformer-based large language models (LLMs) offer potential
means for scaling the description of data use cases in the published
literature. This avoids expensive manual labeling and the development of
training datasets for classical machine-learning (ML) systems. In this work we
apply an open-source LLM, Llama 3.1-405B, to generate structured data use case
labels for publications known to incorporate specific genomic datasets. We also
introduce a novel evaluation framework for determining the efficacy of our
methods. Our results demonstrate that the stock model can achieve an F1 score
of .674 on a zero-shot data citation classification task with no previously
defined categories. While promising, our results are qualified by barriers
related to data availability, prompt overfitting, computational infrastructure,
and the expense required to conduct responsible performance evaluation.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andy Dimnaku, Abdullah Yusuf Kavranoฤlu, Yaser Abu-Mostafa
Machine Learning
ๆฅ็ๆ่ฆ
Data augmentation is widely used in vision to introduce variation and
mitigate overfitting, through enabling models to learn invariant properties,
such as spatial invariance. However, these properties are not fully captured by
data augmentation alone, since it attempts to learn the property on
transformations of the training data only. We propose generative hints, a
training methodology that directly enforces known invariances in the entire
input space. Our approach leverages a generative model trained on the training
set to approximate the input distribution and generate unlabeled images, which
we refer to as virtual examples. These virtual examples are used to enforce
functional properties known as hints. In generative hints, although the
training dataset is fully labeled, the model is trained in a semi-supervised
manner on both the classification and hint objectives, using the unlabeled
virtual examples to guide the model in learning the desired hint. Across
datasets, architectures, and loss functions, generative hints consistently
outperform standard data augmentation when learning the same property. On
popular fine-grained visual classification benchmarks, we achieved up to 1.78%
top-1 accuracy improvement (0.63% on average) over fine-tuned models with data
augmentation and an average performance boost of 1.286% on the CheXpert X-ray
dataset.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski et al.
Robotics
ๆฅ็ๆ่ฆ
Data-driven scientific discovery requires iterative cycles of literature
search, hypothesis generation, and data analysis. Substantial progress has been
made towards AI agents that can automate scientific research, but all such
agents remain limited in the number of actions they can take before losing
coherence, thus limiting the depth of their findings. Here we present Kosmos,
an AI scientist that automates data-driven discovery. Given an open-ended
objective and a dataset, Kosmos runs for up to 12 hours performing cycles of
parallel data analysis, literature search, and hypothesis generation before
synthesizing discoveries into scientific reports. Unlike prior systems, Kosmos
uses a structured world model to share information between a data analysis
agent and a literature search agent. The world model enables Kosmos to
coherently pursue the specified objective over 200 agent rollouts, collectively
executing an average of 42,000 lines of code and reading 1,500 papers per run.
Kosmos cites all statements in its reports with code or primary literature,
ensuring its reasoning is traceable. Independent scientists found 79.4% of
statements in Kosmos reports to be accurate, and collaborators reported that a
single 20-cycle Kosmos run performed the equivalent of 6 months of their own
research time on average. Furthermore, collaborators reported that the number
of valuable scientific findings generated scales linearly with Kosmos cycles
(tested up to 20 cycles). We highlight seven discoveries made by Kosmos that
span metabolomics, materials science, neuroscience, and statistical genetics.
Three discoveries independently reproduce findings from preprinted or
unpublished manuscripts that were not accessed by Kosmos at runtime, while four
make novel contributions to the scientific literature.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, Vinay Kumar Sankarapu
Machine Learning
ๆฅ็ๆ่ฆ
Tabular foundation models represent a growing paradigm in structured data
learning, extending the benefits of large-scale pretraining to tabular domains.
However, their adoption remains limited due to heterogeneous preprocessing
pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the
absence of standardized evaluation for deployment-oriented metrics such as
calibration and fairness. We present TabTune, a unified library that
standardizes the complete workflow for tabular foundation models through a
single interface. TabTune provides consistent access to seven state-of-the-art
models supporting multiple adaptation strategies, including zero-shot
inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient
fine-tuning (PEFT). The framework automates model-aware preprocessing, manages
architectural heterogeneity internally, and integrates evaluation modules for
performance, calibration, and fairness. Designed for extensibility and
reproducibility, TabTune enables consistent benchmarking of adaptation
strategies of tabular foundation models.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyang Xu, Olaf Wysocki, Christoph Holst
ๆฅ็ๆ่ฆ
Evaluating uncertainty is critical for reliable use of Mobile Laser Scanning
(MLS) point clouds in many high-precision applications such as Scan-to-BIM,
deformation analysis, and 3D modeling. However, obtaining the ground truth (GT)
for evaluation is often costly and infeasible in many real-world applications.
To reduce this long-standing reliance on GT in uncertainty evaluation research,
this study presents a learning-based framework for MLS point clouds that
integrates optimal neighborhood estimation with geometric feature extraction.
Experiments on a real-world dataset show that the proposed framework is
feasible and the XGBoost model delivers fully comparable accuracy to Random
Forest while achieving substantially higher efficiency (about 3 times faster),
providing initial evidence that geometric features can be used to predict
point-level uncertainty quantified by the C2C distance. In summary, this study
shows that MLS point clouds' uncertainty is learnable, offering a novel
learning-based viewpoint towards uncertainty evaluation research.
๐
2025-11-04
๐ NeurIPS
๐ฅ Emi Soroka, Artem Arzyn
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recent work in financial machine learning has shown the virtue of complexity:
the phenomenon by which deep learning methods capable of learning highly
nonlinear relationships outperform simpler approaches in financial forecasting.
While transformer architectures like Informer have shown promise for financial
time series forecasting, the application of transformer models for options data
remains largely unexplored. We conduct preliminary studies towards the
development of a transformer model for options data by training the Vision
Transformer (ViT) architecture, typically used in modern image recognition and
classification systems, to predict the realized volatility of an asset over the
next 30 days from its implied volatility surface (augmented with date
information) for a single day. We show that the ViT can learn seasonal patterns
and nonlinear features from the IV surface, suggesting a promising direction
for model development.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenbo Lu
Multimodal
ๆฅ็ๆ่ฆ
Vision-Language Pretraining (VLP) has achieved remarkable success across
various downstream tasks, but such gains are largely driven by scaling up on
training data. Yet, literature methods treat image-text pairs as isolated
training examples; this neglects the rich relational structure naturally
present in many domains, such as e-commerce product co-purchase graphs and
social recommendation networks. Inspired by neuroscientific evidence that human
encodes knowledge as relationship cognitive maps, we introduce Structure-aware
Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive
loss to align modalities while also modeling relationships between neighboring
entities in a structured graph. To support this paradigm, we construct a
large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling
structured cross-modality supervision at scale. Experiment results show that
SLIP consistently outperforms CLIP on cross-modal retrieval and classification
tasks in both zero-shot and few-shot settings, showing the value of relational
supervision for cross-modal alignment.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minh Sao Khue Luu, Bair N. Tuchinov
ๆฅ็ๆ่ฆ
We present a foundation model for brain MRI that can work with different
combinations of imaging sequences. The model uses one encoder with learnable
modality embeddings, conditional layer normalization, and a masked autoencoding
objective that accounts for missing modalities. A variance-covariance
regularizer is applied to stabilize feature learning and improve representation
diversity. This design removes the need for separate models for each modality
and allows the network to adapt when some sequences are missing or unseen. It
is trained on about 60,000 multi-center MRIs using self-supervised
reconstruction and modality imputation to learn flexible representations. A
learnable modality embedding guides feature extraction so the encoder can
adjust to different inputs. We describe our planned evaluation on brain tumor
and multiple sclerosis segmentation, as well as lesion classification, under
various modality settings. Preliminary results show that the method works
feasibly, and further experiments are planned to study its performance in more
detail. All code and pretrained models are available at
https://github.com/BrainFM/brainfm
๐
2025-11-04
๐ 25 pages, 11 figures. Submitted in Science of Remo
๐ฅ Dakota Hester, Vitor S. Martins, Lucas B. Ferreira, Thainara M. A. Lima
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning semantic segmentation methods have shown promising performance
for very high 1-m resolution land cover classification, but the challenge of
collecting large volumes of representative training data creates a significant
barrier to widespread adoption of such models for meter-scale land cover
mapping over large areas. In this study, we present a novel label-efficient
approach for statewide 1-m land cover classification using only 1,000 annotated
reference image patches with self-supervised deep learning. We use the
"Bootstrap Your Own Latent" pre-training strategy with a large amount of
unlabeled color-infrared aerial images (377,921 256x256 1-m pixel patches) to
pre-train a ResNet-101 convolutional encoder. The learned encoder weights were
subsequently transferred into multiple deep semantic segmentation architectures
(FCN, U-Net, Attention U-Net, DeepLabV3+, UPerNet, PAN), which were then
fine-tuned using very small training dataset sizes with cross-validation (250,
500, 750 patches). Among the fine-tuned models, we obtained the 87.14% overall
accuracy and 75.58% macro F1 score using an ensemble of the best performing
U-Net models for comprehensive 1-m, 8-class land cover mapping, covering more
than 123 billion pixels over the state of Mississippi, USA. Detailed
qualitative and quantitative analysis revealed accurate mapping of open water
and forested areas, while highlighting challenges in accurate delineation
between cropland, herbaceous, and barren land cover types. These results show
that self-supervised learning is an effective strategy for reducing the need
for large volumes of manually annotated data, directly addressing a major
limitation to high spatial resolution land cover mapping at scale.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ailar Mahdizadeh, Puria Azadi Moghadam, Xiangteng He, Shahriar Mirabbasi, Panos Nasiopoulos et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) have demonstrated strong cross-modal
capabilities, yet most work remains limited to 2D data and assumes binary
supervision (i.e., positive vs. negative pairs), overlooking the continuous and
structured dependencies present in volumetric data such as CT. Existing
approaches often treat volumetric scans as independent 2D slices, compromising
spatial coherence and underutilizing rich clinical semantics. We propose
SCALE-VLP, a soft-weighted contrastive vision-language pre-training framework
that integrates (i) volumetric spatial semantics to preserve anatomical
structure and (ii) domain-aware, knowledge-infused semantics (e.g.,
radiological ontologies) to guide alignment. This yields structurally
consistent and semantically grounded representations under limited supervision,
demonstrating strong cross-task transferability (retrieval, report generation,
and classification), and cross-domain generalizability with consistent gains
without further fine-tuning. In particular, compared to the previous state of
the art, SCALE-VLP achieves up to 4.3x higher top-1 CT-report retrieval,
improves abnormality classification by 10 points, and reaches ROUGE-L 0.44 and
BERT-F1 0.89 for report generation. Further, in zero-shot evaluation on an
out-of-domain external dataset, we observe consistent gains, indicating the
cross-task and cross-domain generalization ability of SCALE-VLP.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Syed Mostaquim Ali, Taufiq Rahman, Ghazal Farhani, Mohamed H. Zaki, Benoit Anctil et al.
Robotics
ๆฅ็ๆ่ฆ
For developing safe Autonomous Driving Systems (ADS), rigorous testing is
required before they are deemed safe for road deployments. Since comprehensive
conventional physical testing is impractical due to cost and safety concerns,
Virtual Testing Environments (VTE) can be adopted as an alternative. Comparing
VTE-generated sensor outputs against their real-world analogues can be a strong
indication that the VTE accurately represents reality. Correspondingly, this
work explores a comprehensive experimental approach to finding evaluation
metrics suitable for comparing real-world and simulated LiDAR scans. The
metrics were tested in terms of sensitivity and accuracy with different noise,
density, distortion, sensor orientation, and channel settings. From comparing
the metrics, we found that Density Aware Chamfer Distance (DCD) works best
across all cases. In the second step of the research, a Virtual Testing
Environment was generated using real LiDAR scan data. The data was collected in
a controlled environment with only static objects using an instrumented vehicle
equipped with LiDAR, IMU and cameras. Simulated LiDAR scans were generated from
the VTEs using the same pose as real LiDAR scans. The simulated and LiDAR scans
were compared in terms of model perception and geometric similarity. Actual and
simulated LiDAR scans have a similar semantic segmentation output with a mIoU
of 21\% with corrected intensity and an average density aware chamfer distance
(DCD) of 0.63. This indicates a slight difference in the geometric properties
of simulated and real LiDAR scans and a significant difference between model
outputs. During the comparison, density-aware chamfer distance was found to be
the most correlated among the metrics with perception methods.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mikhael Djajapermana, Moritz Reiber, Daniel Mueller-Gritschneder, Ulf Schlichtmann
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT)
have outperformed pure CNN or ViT architecture. However, since these
architectures require large parameters and incur large computational costs,
they are unsuitable for tinyML deployment. This paper introduces a new hybrid
CNN-ViT search space for Neural Architecture Search (NAS) to find efficient
hybrid architectures for image classification. The search space covers hybrid
CNN and ViT blocks to learn local and global information, as well as the novel
Pooling block of searchable pooling layers for efficient feature map reduction.
Experimental results on the CIFAR10 dataset show that our proposed search space
can produce hybrid CNN-ViT architectures with superior accuracy and inference
speed to ResNet-based tinyML models under tight model size constraints.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sadiq Layi Macaulay, Nimet Kaygusuz, Simon Hadfield
ๆฅ็ๆ่ฆ
Event cameras, with their high dynamic range (HDR) and low latency, offer a
promising alternative for robust depth estimation in challenging environments.
However, many event-based depth estimation approaches are constrained by
small-scale annotated datasets, limiting their generalizability to real-world
scenarios. To bridge this gap, we introduce EvtSlowTV, a large-scale event
camera dataset curated from publicly available YouTube footage, which contains
more than 13B events across various environmental conditions and motions,
including seasonal hiking, flying, scenic driving, and underwater exploration.
EvtSlowTV is an order of magnitude larger than existing event datasets,
providing an unconstrained, naturalistic setting for event-based depth
learning. This work shows the suitability of EvtSlowTV for a self-supervised
learning framework to capitalise on the HDR potential of raw event streams. We
further demonstrate that training with EvtSlowTV enhances the model's ability
to generalise to complex scenes and motions. Our approach removes the need for
frame-based annotations and preserves the asynchronous nature of event data.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Srikumar Sastry, Subash Khanal, Aayush Dhakal, Jiayu Lin, Dan Cher et al.
Multimodal
ๆฅ็ๆ่ฆ
We introduce ProM3E, a probabilistic masked multimodal embedding model for
any-to-any generation of multimodal representations for ecology. ProM3E is
based on masked modality reconstruction in the embedding space, learning to
infer missing modalities given a few context modalities. By design, our model
supports modality inversion in the embedding space. The probabilistic nature of
our model allows us to analyse the feasibility of fusing various modalities for
given downstream tasks, essentially learning what to fuse. Using these features
of our model, we propose a novel cross-modal retrieval approach that mixes
inter-modal and intra-modal similarities to achieve superior performance across
all retrieval tasks. We further leverage the hidden representation from our
model to perform linear probing tasks and demonstrate the superior
representation learning capability of our model. All our code, datasets and
model will be released at https://vishu26.github.io/prom3e.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Andy Dimnaku, Abdullah Yusuf Kavranoฤlu, Yaser Abu-Mostafa
Machine Learning
ๆฅ็ๆ่ฆ
Data augmentation is widely used in vision to introduce variation and
mitigate overfitting, through enabling models to learn invariant properties,
such as spatial invariance. However, these properties are not fully captured by
data augmentation alone, since it attempts to learn the property on
transformations of the training data only. We propose generative hints, a
training methodology that directly enforces known invariances in the entire
input space. Our approach leverages a generative model trained on the training
set to approximate the input distribution and generate unlabeled images, which
we refer to as virtual examples. These virtual examples are used to enforce
functional properties known as hints. In generative hints, although the
training dataset is fully labeled, the model is trained in a semi-supervised
manner on both the classification and hint objectives, using the unlabeled
virtual examples to guide the model in learning the desired hint. Across
datasets, architectures, and loss functions, generative hints consistently
outperform standard data augmentation when learning the same property. On
popular fine-grained visual classification benchmarks, we achieved up to 1.78%
top-1 accuracy improvement (0.63% on average) over fine-tuned models with data
augmentation and an average performance boost of 1.286% on the CheXpert X-ray
dataset.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Ilerioluwakiiye Abolade, Aniekan Udo, Augustine Ojo, Abdulbasit Oyetunji, Hammed Ajigbotosho et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Glioma segmentation is critical for diagnosis and treatment planning, yet
remains challenging in Sub-Saharan Africa due to limited MRI infrastructure and
heterogeneous acquisition protocols that induce severe domain shift. We propose
SegFormer3D-plus, a radiomics-guided transformer architecture designed for
robust segmentation under domain variability. Our method combines: (1)
histogram matching for intensity harmonization across scanners, (2) radiomic
feature extraction with PCA-reduced k-means for domain-aware stratified
sampling, (3) a dual-pathway encoder with frequency-aware feature extraction
and spatial-channel attention, and (4) composite Dice-Cross-Entropy loss for
boundary refinement. Pretrained on BraTS 2023 and fine-tuned on BraTS-Africa
data, SegFormer3D-plus demonstrates improved tumor subregion delineation and
boundary localization across heterogeneous African clinical scans, highlighting
the value of radiomics-guided domain adaptation for resource-limited settings.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ivan Zvonkov, Gabriel Tseng, Inbal Becker-Reshef, Hannah Kerner
ๆฅ็ๆ่ฆ
Accurate and up-to-date land cover maps are essential for understanding land
use change, a key driver of climate change. Geospatial embeddings offer a more
efficient and accessible way to map landscape features, yet their use in
real-world mapping applications remains underexplored. In this work, we
evaluated the utility of geospatial embeddings for cropland mapping in Togo. We
produced cropland maps using embeddings from Presto and AlphaEarth. Our
findings show that geospatial embeddings can simplify workflows, achieve
high-accuracy cropland classification and ultimately support better assessments
of land use change and its climate impacts.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Harshith Padigela, Shima Nofallah, Atchuth Naveen Chilaparasetti, Ryun Han, Andrew Walker et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Foundation models trained on large-scale pathology image corpora have
demonstrated strong transfer capabilities across diverse histopathology tasks.
Building on this progress, we introduce PLUTO-4, our next generation of
pathology foundation models that extend the Pathology-Universal Transformer
(PLUTO) to frontier scale. We share two complementary Vision Transformer
architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model
optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE
embeddings, and a frontier-scale PLUTO-4G model trained with a single patch
size to maximize representation capacity and stability. Both models are
pretrained using a self-supervised objective derived from DINOv2 on a large
multi-institutional corpus containing 551,164 WSIs from 137,144 patients across
over 50 institutions, spanning over 60 disease types and over 100 stains.
Comprehensive evaluation across public and internal benchmarks demonstrates
that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying
spatial and biological context, including patch-level classification,
segmentation, and slide-level diagnosis. The compact PLUTO-4S provides
high-throughput and robust performance for practical deployment, while PLUTO-4G
establishes new performance frontiers across multiple pathology benchmarks,
including an 11% improvement in dermatopathology diagnosis. These diverse
improvements underscore PLUTO-4's potential to transform real-world
applications as a backbone for translational research and diagnostic use cases.
๐
2025-11-04
๐ ArXiv Preprint (eess.IV)
๐ฅ Chukwuemeka Arua Kalu, Adaobi Chiazor Emegoakor, Fortune Okafor, Augustine Okoh Uchenna, Chijioke Kelvin Ukpai et al.
Computer VisionMultimodal
ๆฅ็ๆ่ฆ
Medical image segmentation is a critical achievement in modern medical
science, developed over decades of research. It allows for the exact
delineation of anatomical and pathological features in two- or
three-dimensional pictures by utilizing notions like pixel intensity, texture,
and anatomical context. With the advent of automated segmentation, physicians
and radiologists may now concentrate on diagnosis and treatment planning while
intelligent computers perform routine image processing tasks.
This study used the BraTS Sub-Saharan Africa dataset, a selected subset of
the BraTS dataset that included 60 multimodal MRI cases from patients with
glioma. Surprisingly, the nnU Net model trained on the initial 60 instances
performed better than the network trained on an offline-augmented dataset of
360 cases. Hypothetically, the offline augmentations introduced artificial
anatomical variances or intensity distributions, reducing generalization. In
contrast, the original dataset, when paired with nnU Net's robust online
augmentation procedures, maintained realistic variability and produced better
results. The study achieved a Dice score of 0.84 for whole tumor segmentation.
These findings highlight the significance of data quality and proper
augmentation approaches in constructing accurate, generalizable medical picture
segmentation models, particularly for under-represented locations.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Riccardo Di Sipio, Jairo Diaz-Rodriguez, Luis Serrano
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We present a geometric framework for understanding Transformer-based language
models, drawing an explicit analogy to General Relativity. Queries and keys
induce an effective metric on representation space, and attention acts as a
discrete connection that implements parallel transport of value vectors across
tokens. Stacked layers provide discrete time-slices through which token
representations evolve on this curved manifold, while backpropagation plays the
role of a least-action principle that shapes loss-minimizing trajectories in
parameter space. If this analogy is correct, token embeddings should not
traverse straight paths in feature space; instead, their layer-wise steps
should bend and reorient as interactions mediated by embedding space curvature.
To test this prediction, we design experiments that expose both the presence
and the consequences of curvature: (i) we visualize a curvature landscape for a
full paragraph, revealing how local turning angles vary across tokens and
layers; (ii) we show through simulations that excess counts of sharp/flat
angles and longer length-to-chord ratios are not explainable by dimensionality
or chance; and (iii) inspired by Einstein's eclipse experiment, we probe
deflection under controlled context edits, demonstrating measurable,
meaning-consistent bends in embedding trajectories that confirm
attention-induced curvature.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Victoria Ebert, Rishabh Singh, Tuochao Chen, Noah A. Smith, Shyamnath Gollakota
ๆฅ็ๆ่ฆ
Conversational AI is constrained in many real-world settings where only one
side of a dialogue can be recorded, such as telemedicine, call centers, and
smart glasses. We formalize this as the one-sided conversation problem (1SC):
inferring and learning from one side of a conversation. We study two tasks: (1)
reconstructing the missing speaker's turns for real-time use cases, and (2)
generating summaries from one-sided transcripts. Evaluating prompting and
finetuned models on MultiWOZ, DailyDialog, and Candor with both human A/B
testing and LLM-as-a-judge metrics, we find that access to one future turn and
information about utterance length improves reconstruction, placeholder
prompting helps to mitigate hallucination, and while large models generate
promising reconstructions with prompting, smaller models require finetuning.
Further, high-quality summaries can be generated without reconstructing missing
turns. We present 1SC as a novel challenge and report promising results that
mark a step toward privacy-aware conversational AI.
๐
2025-11-04
๐ EMNLP 2025
๐ฅ Anthony Hevia, Sanjana Chintalapati, Veronica Ka Wai Lai, Thanh Tam Nguyen, Wai-Tat Wong et al.
ๆฅ็ๆ่ฆ
We present ROBOTO2, an open-source, web-based platform for large language
model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBOTO2
streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process
via an interactive interface that combines PDF parsing, retrieval-augmented LLM
prompting, and human-in-the-loop review. Users can upload clinical trial
reports, receive preliminary answers and supporting evidence for ROB2 signaling
questions, and provide real-time feedback or corrections to system suggestions.
ROBOTO2 is publicly available at https://roboto2.vercel.app/, with code and
data released to foster reproducibility and adoption. We construct and release
a dataset of 521 pediatric clinical trial reports (8954 signaling questions
with 1202 evidence passages), annotated using both manually and LLM-assisted
methods, serving as a benchmark and enabling future research. Using this
dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into
current model capabilities and ongoing challenges in automating this critical
aspect of systematic review.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yan Cathy Hua, Paul Denny, Jรถrg Wicker, Katerina Taลกkova
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Aspect-based Sentiment Analysis (ABSA) is a fine-grained opinion mining
approach that identifies and classifies opinions associated with specific
entities (aspects) or their categories within a sentence. Despite its rapid
growth and broad potential, ABSA research and resources remain concentrated in
commercial domains, leaving analytical needs unmet in high-demand yet
low-resource areas such as education and healthcare. Domain adaptation
challenges and most existing methods' reliance on resource-intensive
in-training knowledge injection further hinder progress in these areas.
Moreover, traditional evaluation methods based on exact matches are overly
rigid for ABSA tasks, penalising any boundary variations which may misrepresent
the performance of generative models. This work addresses these gaps through
three contributions: 1) We propose a novel evaluation method, Flexible Text
Similarity Matching and Optimal Bipartite Pairing (FTS-OBP), which accommodates
realistic extraction boundary variations while maintaining strong correlation
with traditional metrics and offering fine-grained diagnostics. 2) We present
the first ABSA study of small decoder-only generative language models (SLMs;
<7B parameters), examining resource lower bounds via a case study in education
review ABSA. We systematically explore data-free (in-context learning and
weight merging) and data-light fine-tuning methods, and propose a multitask
fine-tuning strategy that significantly enhances SLM performance, enabling
1.5-3.8 B models to surpass proprietary large models and approach benchmark
results with only 200-1,000 examples on a single GPU. 3) We release the first
public set of education review ABSA resources to support future research in
low-resource domains.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller
open-source language models (LLMs) to surpass substantially larger proprietary
models in customer service summarization tasks. The pipeline first analyzes and
categorizes common errors in summaries produced by a teacher model (GPT-3.5),
then performs a targeted revision using a compact editor model (Llama 3.1 70B)
to generate high-quality, refined training data. Fine-tuning a smaller student
model (Llama 3.1 8B) on this refined data resulted in superior summarization
performance compared to GPT-3.5. The ARF pipeline improves cost efficiency and
data privacy while maintaining competitive accuracy, illustrating a
generalizable framework for enhancing open-source LLMs across diverse
downstream applications.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Despite recent progress in using Large Language Models (LLMs) for
automatically generating 3D scenes, generated scenes often lack realistic
spatial layouts and object attributes found in real-world environments. As this
problem stems from insufficiently detailed, coarse-grained instructions,
advancing 3D scene synthesis guided by more detailed, fine-grained instructions
that reflect real-world environments becomes crucial. Without such realistic
scenes, training embodied agents in unrealistic environments can lead them to
learn priors that diverge significantly from real-world physics and semantics,
degrading their performance when deployed. Thus, verifying the alignment
between the fine-grained instruction and the generated scene is essential for
effective learning. However, current evaluation methods, such as CLIPScore and
vision-language models (VLMs), often fail to reliably assess such alignment.
This shortcoming arises primarily from their shallow understanding of 3D
scenes, which often leads to improperly grounded scene components. To address
this, we introduce LEGO-Eval, an evaluation framework equipped with diverse
tools designed to explicitly ground scene components, enabling more accurate
alignment assessments. We also present LEGO-Bench, a benchmark of detailed
instructions that specify complex layouts and attributes of real-world
environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge
by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with
LEGO-Bench reveals significant limitations in current generation methods.
Across all evaluated approaches, success rates reached at most 10% in
generating scenes that fully align with fine-grained instructions.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Cristian Garcรญa-Romero, Miquel Esplร -Gomis, Felipe Sรกnchez-Martรญnez
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern machine translation (MT) systems depend on large parallel corpora,
often collected from the Internet. However, recent evidence indicates that (i)
a substantial portion of these texts are machine-generated translations, and
(ii) an overreliance on such synthetic content in training data can
significantly degrade translation quality. As a result, filtering out non-human
translations is becoming an essential pre-processing step in building
high-quality MT systems. In this work, we propose a novel approach that
directly exploits the internal representations of a surrogate multilingual MT
model to distinguish between human and machine-translated sentences.
Experimental results show that our method outperforms current state-of-the-art
techniques, particularly for non-English language pairs, achieving gains of at
least 5 percentage points of accuracy.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Neil Byers, Ali Zaidi, Valerie Skye, Chris Beecroft, Kjiersten Fagnan
Natural Language Processing
ๆฅ็ๆ่ฆ
Efforts have increased in recent years to identify associations between
specific datasets and the scientific literature that incorporates them. Knowing
that a given publication cites a given dataset, the next logical step is to
explore how or why that data was used. Advances in recent years with
pretrained, transformer-based large language models (LLMs) offer potential
means for scaling the description of data use cases in the published
literature. This avoids expensive manual labeling and the development of
training datasets for classical machine-learning (ML) systems. In this work we
apply an open-source LLM, Llama 3.1-405B, to generate structured data use case
labels for publications known to incorporate specific genomic datasets. We also
introduce a novel evaluation framework for determining the efficacy of our
methods. Our results demonstrate that the stock model can achieve an F1 score
of .674 on a zero-shot data citation classification task with no previously
defined categories. While promising, our results are qualified by barriers
related to data availability, prompt overfitting, computational infrastructure,
and the expense required to conduct responsible performance evaluation.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shuhang Lin, Zhencan Peng, Lingyao Li, Xiao Lin, Xi Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Large Language Model (LLM)-based agents have been
propelled by Retrieval-Augmented Generation (RAG), which grants the models
access to vast external knowledge bases. Despite RAG's success in improving
agent performance, agent-level cache management, particularly constructing,
maintaining, and updating a compact, relevant corpus dynamically tailored to
each agent's need, remains underexplored. Therefore, we introduce ARC (Agent
RAG Cache Mechanism), a novel, annotation-free caching framework that
dynamically manages small, high-value corpora for each agent. By synthesizing
historical query distribution patterns with the intrinsic geometry of cached
items in the embedding space, ARC automatically maintains a high-relevance
cache. With comprehensive experiments on three retrieval datasets, our
experimental results demonstrate that ARC reduces storage requirements to
0.015% of the original corpus while offering up to 79.8% has-answer rate and
reducing average retrieval latency by 80%. Our results demonstrate that ARC can
drastically enhance efficiency and effectiveness in RAG-powered LLM agents.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fatemeh Ghaffari, Siddarth Sitaraman, Xutong Liu, Xuchuang Wang, Mohammad Hajiesmaili
Robotics
ๆฅ็ๆ่ฆ
Online learning to rank (OLTR) studies how to recommend a short ranked list
of items from a large pool and improves future rankings based on user clicks.
This setting is commonly modeled as cascading bandits, where the objective is
to maximize the likelihood that the user clicks on at least one of the
presented items across as many timesteps as possible. However, such systems are
vulnerable to click fraud and other manipulations (i.e., corruption), where
bots or paid click farms inject corrupted feedback that misleads the learning
process and degrades user experience. In this paper, we propose MSUCB, a robust
algorithm that incorporates a novel mean-of-medians estimator, which to our
knowledge is applied to bandits with corruption setting for the first time.
This estimator behaves like a standard mean in the absence of corruption, so no
cost is paid for robustness. Under corruption, the median step filters out
outliers and corrupted samples, keeping the estimate close to its true value.
Updating this estimate at every round further accelerates empirical convergence
in experiments. Hence, MSUCB achieves optimal logarithmic regret in the absence
of corruption and degrades gracefully under corruptions, with regret increasing
only by an additive term tied to the total corruption. Comprehensive and
extensive experiments on real-world datasets further demonstrate that our
approach consistently outperforms prior methods while maintaining strong
robustness. In particular, it achieves a \(97.35\%\) and a \(91.60\%\) regret
improvement over two state-of-the-art methods.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim
Natural Language Processing
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) systems hold great promise for advancing various
scientific disciplines, and are increasingly used in real-world applications.
Despite their remarkable progress, further capabilities are expected in order
to achieve more general types of intelligence. A critical distinction in this
context is between factual knowledge, which can be evaluated against true or
false answers (e.g., "what is the capital of England?"), and probabilistic
knowledge, reflecting probabilistic properties of the real world (e.g., "what
is the sex of a computer science graduate in the US?"). In this paper, our goal
is to build a benchmark for understanding the capabilities of LLMs in terms of
knowledge of probability distributions describing the real world. Given that
LLMs are trained on vast amounts of text, it may be plausible that they
internalize aspects of these distributions. Indeed, LLMs are touted as powerful
universal approximators of real-world distributions. At the same time,
classical results in statistics, known as curse of dimensionality, highlight
fundamental challenges in learning distributions in high dimensions,
challenging the notion of universal distributional learning. In this work, we
develop the first benchmark to directly test this hypothesis, evaluating
whether LLMs have access to empirical distributions describing real-world
populations across domains such as economics, health, education, and social
behavior. Our results demonstrate that LLMs perform poorly overall, and do not
seem to internalize real-world statistics naturally. When interpreted in the
context of Pearl's Causal Hierarchy (PCH), our benchmark demonstrates that
language models do not contain knowledge on observational distributions (Layer
1 of PCH), and thus the Causal Hierarchy Theorem implies that interventional
(Layer 2) and counterfactual (Layer 3) knowledge of these models is also
limited.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Martin Carrasco, Olga Zaghen, Erik Bekkers, Bastian Rieck
Machine Learning
ๆฅ็ๆ่ฆ
For far too long, expressivity of graph neural networks has been measured
\emph{only} in terms of combinatorial properties. In this work we stray away
from this tradition and provide a principled way to measure similarity between
vertex attributed graphs. We denote this measure as the \emph{graph
homomorphism distortion}. We show it can \emph{completely characterize} graphs
and thus is also a \emph{complete graph embedding}. However, somewhere along
the road, we run into the graph canonization problem. To circumvent this
obstacle, we devise to efficiently compute this measure via sampling, which in
expectation ensures \emph{completeness}. Additionally, we also discovered that
we can obtain a metric from this measure. We validate our claims empirically
and find that the \emph{graph homomorphism distortion}: (1.) fully
distinguishes the \texttt{BREC} dataset with up to $4$-WL non-distinguishable
graphs, and (2.) \emph{outperforms} previous methods inspired in homomorphisms
under the \texttt{ZINC-12k} dataset.
These theoretical results, (and their empirical validation), pave the way for
future characterization of graphs, extending the graph theoretic tradition to
new frontiers.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Riccardo Di Sipio, Jairo Diaz-Rodriguez, Luis Serrano
RoboticsNatural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We present a geometric framework for understanding Transformer-based language
models, drawing an explicit analogy to General Relativity. Queries and keys
induce an effective metric on representation space, and attention acts as a
discrete connection that implements parallel transport of value vectors across
tokens. Stacked layers provide discrete time-slices through which token
representations evolve on this curved manifold, while backpropagation plays the
role of a least-action principle that shapes loss-minimizing trajectories in
parameter space. If this analogy is correct, token embeddings should not
traverse straight paths in feature space; instead, their layer-wise steps
should bend and reorient as interactions mediated by embedding space curvature.
To test this prediction, we design experiments that expose both the presence
and the consequences of curvature: (i) we visualize a curvature landscape for a
full paragraph, revealing how local turning angles vary across tokens and
layers; (ii) we show through simulations that excess counts of sharp/flat
angles and longer length-to-chord ratios are not explainable by dimensionality
or chance; and (iii) inspired by Einstein's eclipse experiment, we probe
deflection under controlled context edits, demonstrating measurable,
meaning-consistent bends in embedding trajectories that confirm
attention-induced curvature.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Victoria Ebert, Rishabh Singh, Tuochao Chen, Noah A. Smith, Shyamnath Gollakota
ๆฅ็ๆ่ฆ
Conversational AI is constrained in many real-world settings where only one
side of a dialogue can be recorded, such as telemedicine, call centers, and
smart glasses. We formalize this as the one-sided conversation problem (1SC):
inferring and learning from one side of a conversation. We study two tasks: (1)
reconstructing the missing speaker's turns for real-time use cases, and (2)
generating summaries from one-sided transcripts. Evaluating prompting and
finetuned models on MultiWOZ, DailyDialog, and Candor with both human A/B
testing and LLM-as-a-judge metrics, we find that access to one future turn and
information about utterance length improves reconstruction, placeholder
prompting helps to mitigate hallucination, and while large models generate
promising reconstructions with prompting, smaller models require finetuning.
Further, high-quality summaries can be generated without reconstructing missing
turns. We present 1SC as a novel challenge and report promising results that
mark a step toward privacy-aware conversational AI.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyang Xu, Olaf Wysocki, Christoph Holst
ๆฅ็ๆ่ฆ
Evaluating uncertainty is critical for reliable use of Mobile Laser Scanning
(MLS) point clouds in many high-precision applications such as Scan-to-BIM,
deformation analysis, and 3D modeling. However, obtaining the ground truth (GT)
for evaluation is often costly and infeasible in many real-world applications.
To reduce this long-standing reliance on GT in uncertainty evaluation research,
this study presents a learning-based framework for MLS point clouds that
integrates optimal neighborhood estimation with geometric feature extraction.
Experiments on a real-world dataset show that the proposed framework is
feasible and the XGBoost model delivers fully comparable accuracy to Random
Forest while achieving substantially higher efficiency (about 3 times faster),
providing initial evidence that geometric features can be used to predict
point-level uncertainty quantified by the C2C distance. In summary, this study
shows that MLS point clouds' uncertainty is learnable, offering a novel
learning-based viewpoint towards uncertainty evaluation research.
๐
2025-11-04
๐ ArXiv Preprint (math.OC)
๐ฅ Henry Shugart, Jason M. Altschuler
Machine Learning
ๆฅ็ๆ่ฆ
Classically, a mainstream approach for solving a convex-concave min-max
problem is to instead solve the variational inequality problem arising from its
first-order optimality conditions. Is it possible to solve min-max problems
faster by bypassing this reduction? This paper initiates this investigation. We
show that the answer is yes in the textbook setting of unconstrained quadratic
objectives: the optimal convergence rate for first-order algorithms is strictly
better for min-max problems than for the corresponding variational
inequalities. The key reason that min-max algorithms can be faster is that they
can exploit the asymmetry of the min and max variables--a property that is lost
in the reduction to variational inequalities. Central to our analyses are sharp
characterizations of optimal convergence rates in terms of extremal polynomials
which we compute using Green's functions and conformal mappings.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Katharine E Fisher, Matthew TC Li, Youssef Marzouk, Timo Schorlepp
Machine Learning
ๆฅ็ๆ่ฆ
Gradient information is widely useful and available in applications, and is
therefore natural to include in the training of neural networks. Yet little is
known theoretically about the impact of Sobolev training -- regression with
both function and gradient data -- on the generalization error of highly
overparameterized predictive models in high dimensions. In this paper, we
obtain a precise characterization of this training modality for random feature
(RF) models in the limit where the number of trainable parameters, input
dimensions, and training data tend proportionally to infinity. Our model for
Sobolev training reflects practical implementations by sketching gradient data
onto finite dimensional subspaces. By combining the replica method from
statistical physics with linearizations in operator-valued free probability
theory, we derive a closed-form description for the generalization errors of
the trained RF models. For target functions described by single-index models,
we demonstrate that supplementing function data with additional gradient data
does not universally improve predictive performance. Rather, the degree of
overparameterization should inform the choice of training method. More broadly,
our results identify settings where models perform optimally by interpolating
noisy function and gradient data.
๐
2025-11-04
๐ ICLR 2026
๐ฅ Emi Soroka, Tanmay Chopra, Krish Desai, Sanjay Lall
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have seen increasing popularity in enterprise
applications where AI agents and humans engage in objective-driven
interactions. However, these systems are difficult to evaluate: data may be
complex and unlabeled; human annotation is often impractical at scale; custom
metrics can monitor for specific errors, but not previously-undetected ones;
and LLM judges can produce unreliable results. We introduce the first set of
unsupervised metrics for objective-driven interactions, leveraging statistical
properties of unlabeled interaction data and using fine-tuned LLMs to adapt to
distributional shifts. We develop metrics for labeling user goals, measuring
goal completion, and quantifying LLM uncertainty without grounding evaluations
in human-generated ideal responses. Our approach is validated on open-domain
and task-specific interaction data.
๐
2025-11-04
๐ NeurIPS
๐ฅ Emi Soroka, Artem Arzyn
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Recent work in financial machine learning has shown the virtue of complexity:
the phenomenon by which deep learning methods capable of learning highly
nonlinear relationships outperform simpler approaches in financial forecasting.
While transformer architectures like Informer have shown promise for financial
time series forecasting, the application of transformer models for options data
remains largely unexplored. We conduct preliminary studies towards the
development of a transformer model for options data by training the Vision
Transformer (ViT) architecture, typically used in modern image recognition and
classification systems, to predict the realized volatility of an asset over the
next 30 days from its implied volatility surface (augmented with date
information) for a single day. We show that the ViT can learn seasonal patterns
and nonlinear features from the IV surface, suggesting a promising direction
for model development.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yan Cathy Hua, Paul Denny, Jรถrg Wicker, Katerina Taลกkova
Machine LearningNatural Language Processing
ๆฅ็ๆ่ฆ
Aspect-based Sentiment Analysis (ABSA) is a fine-grained opinion mining
approach that identifies and classifies opinions associated with specific
entities (aspects) or their categories within a sentence. Despite its rapid
growth and broad potential, ABSA research and resources remain concentrated in
commercial domains, leaving analytical needs unmet in high-demand yet
low-resource areas such as education and healthcare. Domain adaptation
challenges and most existing methods' reliance on resource-intensive
in-training knowledge injection further hinder progress in these areas.
Moreover, traditional evaluation methods based on exact matches are overly
rigid for ABSA tasks, penalising any boundary variations which may misrepresent
the performance of generative models. This work addresses these gaps through
three contributions: 1) We propose a novel evaluation method, Flexible Text
Similarity Matching and Optimal Bipartite Pairing (FTS-OBP), which accommodates
realistic extraction boundary variations while maintaining strong correlation
with traditional metrics and offering fine-grained diagnostics. 2) We present
the first ABSA study of small decoder-only generative language models (SLMs;
<7B parameters), examining resource lower bounds via a case study in education
review ABSA. We systematically explore data-free (in-context learning and
weight merging) and data-light fine-tuning methods, and propose a multitask
fine-tuning strategy that significantly enhances SLM performance, enabling
1.5-3.8 B models to surpass proprietary large models and approach benchmark
results with only 200-1,000 examples on a single GPU. 3) We release the first
public set of education review ABSA resources to support future research in
low-resource domains.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ James C. Bowden, Sergey Levine, Jennifer Listgarten
Machine LearningRobotics
ๆฅ็ๆ่ฆ
In the era of AI-driven science and engineering, we often want to design
discrete objects in silico according to user-specified properties. For example,
we may wish to design a protein to bind its target, arrange components within a
circuit to minimize latency, or find materials with certain properties. Given a
property predictive model, in silico design typically involves training a
generative model over the design space (e.g., protein sequence space) to
concentrate on designs with the desired properties. Distributional optimization
-- which can be formalized as an estimation of distribution algorithm or as
reinforcement learning policy optimization -- finds the generative model that
maximizes an objective function in expectation. Optimizing a distribution over
discrete-valued designs is in general challenging because of the combinatorial
nature of the design space. However, many property predictors in scientific
applications are decomposable in the sense that they can be factorized over
design variables in a way that could in principle enable more effective
optimization. For example, amino acids at a catalytic site of a protein may
only loosely interact with amino acids of the rest of the protein to achieve
maximal catalytic activity. Current distributional optimization algorithms are
unable to make use of such decomposability structure. Herein, we propose and
demonstrate use of a new distributional optimization algorithm,
Decomposition-Aware Distributional Optimization (DADO), that can leverage any
decomposability defined by a junction tree on the design variables, to make
optimization more efficient. At its core, DADO employs a soft-factorized
"search distribution" -- a learned generative model -- for efficient navigation
of the search space, invoking graph message-passing to coordinate optimization
across linked factors.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Tao Zhang, Kehui Yao, Luyi Ma, Jiao Chen, Reza Yousefi Maragheh et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Evaluating large language models (LLMs) as judges is increasingly critical
for building scalable and trustworthy evaluation pipelines. We present
ScalingEval, a large-scale benchmarking study that systematically compares 36
LLMs, including GPT, Gemini, Claude, and Llama, across multiple product
categories using a consensus-driven evaluation protocol. Our multi-agent
framework aggregates pattern audits and issue codes into ground-truth labels
via scalable majority voting, enabling reproducible comparison of LLM
evaluators without human annotation. Applied to large-scale complementary-item
recommendation, the benchmark reports four key findings: (i) Anthropic Claude
3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers
the best overall performance across categories; (iii) GPT-4o provides the most
favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B leads among
open-source models. Category-level analysis shows strong consensus in
structured domains (Electronics, Sports) but persistent disagreement in
lifestyle categories (Clothing, Food). These results establish ScalingEval as a
reproducible benchmark and evaluation protocol for LLMs as judges, with
actionable guidance on scaling, reliability, and model family tradeoffs.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Rohan Kota, Kaival Shah, J. Edward Colgate, Gregory Reardon
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Tactile sensing plays a key role in enabling dexterous and reliable robotic
manipulation, but realizing this capability requires substantial calibration to
convert raw sensor readings into physically meaningful quantities. Despite its
near-universal necessity, the calibration process remains ad hoc and
labor-intensive. Here, we introduce \libname{}, an open-source library that
transforms a low-cost 3D printer into an automated probing device capable of
generating large volumes of labeled training data for tactile sensor
calibration. We demonstrate the utility of \libname{} by calibrating two
commercially available vision-based tactile sensors, DIGIT and GelSight Mini,
to reconstruct high-quality depth maps using the collected data and a custom
convolutional neural network. In addition, we perform a data ablation study to
determine how much data is needed for accurate calibration, providing practical
guidelines for researchers working with these specific sensors, and we
benchmark the trained models on previously unseen objects to evaluate
calibration accuracy and generalization performance. By automating tactile
sensor calibration, \libname{} can accelerate tactile sensing research,
simplify sensor deployment, and promote the practical integration of tactile
sensing in robotic platforms.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ R. Khorrambakht, Joaquim Ortiz-Haro, Joseph Amigo, Omar Mostafa, Daniel Dugas et al.
Robotics
ๆฅ็ๆ่ฆ
Robots must understand their environment from raw sensory inputs and reason
about the consequences of their actions in it to solve complex tasks. Behavior
Cloning (BC) leverages task-specific human demonstrations to learn this
knowledge as end-to-end policies. However, these policies are difficult to
transfer to new tasks, and generating training data is challenging because it
requires careful demonstrations and frequent environment resets. In contrast to
such policy-based view, in this paper we take a model-based approach where we
collect a few hours of unstructured easy-to-collect play data to learn an
action-conditioned visual world model, a diffusion-based action sampler, and
optionally a reward model. The world model -- in combination with the action
sampler and a reward model -- is then used to optimize long sequences of
actions with a Monte Carlo Tree Search (MCTS) planner. The resulting plans are
executed on the robot via a zeroth-order Model Predictive Controller (MPC). We
show that the action sampler mitigates hallucinations of the world model during
planning and validate our approach on 3 real-world robotic tasks with varying
levels of planning and modeling complexity. Our experiments support the
hypothesis that planning leads to a significant improvement over BC baselines
on a standard manipulation test environment.
๐
2025-11-04
๐ ICRA 2026
๐ฅ Markus Buchholz, Ignacio Carlucho, Yvan R. Petillot
RoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
The safe deployment of autonomous systems in safety-critical settings
requires a paradigm that combines human expertise with AI-driven analysis,
especially when anomalies are unforeseen. We introduce AURA (Autonomous
Resilience Agent), a collaborative framework for anomaly and fault diagnostics
in robotics. AURA integrates large language models (LLMs), a high-fidelity
digital twin (DT), and human-in-the-loop interaction to detect and respond to
anomalous behavior in real time. The architecture uses two agents with clear
roles: (i) a low-level State Anomaly Characterization Agent that monitors
telemetry and converts signals into a structured natural-language problem
description, and (ii) a high-level Diagnostic Reasoning Agent that conducts a
knowledge-grounded dialogue with an operator to identify root causes, drawing
on external sources. Human-validated diagnoses are then converted into new
training examples that refine the low-level perceptual model. This feedback
loop progressively distills expert knowledge into the AI, transforming it from
a static tool into an adaptive partner. We describe the framework's operating
principles and provide a concrete implementation, establishing a pattern for
trustworthy, continually improving human-robot teams.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Syed Mostaquim Ali, Taufiq Rahman, Ghazal Farhani, Mohamed H. Zaki, Benoit Anctil et al.
Robotics
ๆฅ็ๆ่ฆ
For developing safe Autonomous Driving Systems (ADS), rigorous testing is
required before they are deemed safe for road deployments. Since comprehensive
conventional physical testing is impractical due to cost and safety concerns,
Virtual Testing Environments (VTE) can be adopted as an alternative. Comparing
VTE-generated sensor outputs against their real-world analogues can be a strong
indication that the VTE accurately represents reality. Correspondingly, this
work explores a comprehensive experimental approach to finding evaluation
metrics suitable for comparing real-world and simulated LiDAR scans. The
metrics were tested in terms of sensitivity and accuracy with different noise,
density, distortion, sensor orientation, and channel settings. From comparing
the metrics, we found that Density Aware Chamfer Distance (DCD) works best
across all cases. In the second step of the research, a Virtual Testing
Environment was generated using real LiDAR scan data. The data was collected in
a controlled environment with only static objects using an instrumented vehicle
equipped with LiDAR, IMU and cameras. Simulated LiDAR scans were generated from
the VTEs using the same pose as real LiDAR scans. The simulated and LiDAR scans
were compared in terms of model perception and geometric similarity. Actual and
simulated LiDAR scans have a similar semantic segmentation output with a mIoU
of 21\% with corrected intensity and an average density aware chamfer distance
(DCD) of 0.63. This indicates a slight difference in the geometric properties
of simulated and real LiDAR scans and a significant difference between model
outputs. During the comparison, density-aware chamfer distance was found to be
the most correlated among the metrics with perception methods.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sadiq Layi Macaulay, Nimet Kaygusuz, Simon Hadfield
ๆฅ็ๆ่ฆ
Event cameras, with their high dynamic range (HDR) and low latency, offer a
promising alternative for robust depth estimation in challenging environments.
However, many event-based depth estimation approaches are constrained by
small-scale annotated datasets, limiting their generalizability to real-world
scenarios. To bridge this gap, we introduce EvtSlowTV, a large-scale event
camera dataset curated from publicly available YouTube footage, which contains
more than 13B events across various environmental conditions and motions,
including seasonal hiking, flying, scenic driving, and underwater exploration.
EvtSlowTV is an order of magnitude larger than existing event datasets,
providing an unconstrained, naturalistic setting for event-based depth
learning. This work shows the suitability of EvtSlowTV for a self-supervised
learning framework to capitalise on the HDR potential of raw event streams. We
further demonstrate that training with EvtSlowTV enhances the model's ability
to generalise to complex scenes and motions. Our approach removes the need for
frame-based annotations and preserves the asynchronous nature of event data.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Mirco Felske, Jannik Redenius, Georg Happich, Julius Schรถning
Robotics
ๆฅ็ๆ่ฆ
The agricultural sector increasingly relies on autonomous systems that
operate in complex and variable environments. Unlike on-road applications,
agricultural automation integrates driving and working processes, each of which
imposes distinct operational constraints. Handling this complexity and ensuring
consistency throughout the development and validation processes requires a
structured, transparent, and verified description of the environment. However,
existing Operational Design Domain (ODD) concepts do not yet address the unique
challenges of agricultural applications.
Therefore, this work introduces the Agricultural ODD (Ag-ODD) Framework,
which can be used to describe and verify the operational boundaries of
autonomous agricultural systems. The Ag-ODD Framework consists of three core
elements. First, the Ag-ODD description concept, which provides a structured
method for unambiguously defining environmental and operational parameters
using concepts from ASAM Open ODD and CityGML. Second, the 7-Layer Model
derived from the PEGASUS 6-Layer Model, has been extended to include a process
layer to capture dynamic agricultural operations. Third, the iterative
verification process verifies the Ag-ODD against its corresponding logical
scenarios, derived from the 7-Layer Model, to ensure the Ag-ODD's completeness
and consistency.
Together, these elements provide a consistent approach for creating
unambiguous and verifiable Ag-ODD. Demonstrative use cases show how the Ag-ODD
Framework can support the standardization and scalability of environmental
descriptions for autonomous agricultural systems.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have shown strong capabilities but
remain limited to fixed modality pairs and require costly fine-tuning with
large aligned datasets. Building fully omni-capable models that can integrate
text, images, audio, and video remains impractical and lacks robust reasoning
support. In this paper, we propose an Agent-Omni framework that coordinates
existing foundation models through a master-agent system, enabling flexible
multimodal reasoning without retraining. The master agent interprets user
intent, delegates subtasks to modality-specific agents, and integrates their
outputs into coherent responses. Extensive experiments across text, image,
audio, video, and omni benchmarks show that Agent-Omni consistently achieves
state-of-the-art performance, particularly on tasks requiring complex
cross-modal reasoning. Its agent-based design enables seamless integration of
specialized foundation models, ensuring adaptability to diverse inputs while
maintaining transparency and interpretability. In addition, the framework is
modular and easily extensible, allowing future improvements as stronger models
become available.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Artur d'Avila Garcez, Simon Odense
Machine Learning
ๆฅ็ๆ่ฆ
Artificial Intelligence (AI) is a powerful new language of science as
evidenced by recent Nobel Prizes in chemistry and physics that recognized
contributions to AI applied to those areas. Yet, this new language lacks
semantics, which makes AI's scientific discoveries unsatisfactory at best. With
the purpose of uncovering new facts but also improving our understanding of the
world, AI-based science requires formalization through a framework capable of
translating insight into comprehensible scientific knowledge. In this paper, we
argue that logic offers an adequate framework. In particular, we use logic in a
neurosymbolic framework to offer a much needed semantics for deep learning, the
neural network-based technology of current AI. Deep learning and neurosymbolic
AI lack a general set of conditions to ensure that desirable properties are
satisfied. Instead, there is a plethora of encoding and knowledge extraction
approaches designed for particular cases. To rectify this, we introduced a
framework for semantic encoding, making explicit the mapping between neural
networks and logic, and characterizing the common ingredients of the various
existing approaches. In this paper, we describe succinctly and exemplify how
logical semantics and neural networks are linked through this framework, we
review some of the most prominent approaches and techniques developed for
neural encoding and knowledge extraction, provide a formal definition of our
framework, and discuss some of the difficulties of identifying a semantic
encoding in practice in light of analogous problems in the philosophy of mind.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski et al.
Robotics
ๆฅ็ๆ่ฆ
Data-driven scientific discovery requires iterative cycles of literature
search, hypothesis generation, and data analysis. Substantial progress has been
made towards AI agents that can automate scientific research, but all such
agents remain limited in the number of actions they can take before losing
coherence, thus limiting the depth of their findings. Here we present Kosmos,
an AI scientist that automates data-driven discovery. Given an open-ended
objective and a dataset, Kosmos runs for up to 12 hours performing cycles of
parallel data analysis, literature search, and hypothesis generation before
synthesizing discoveries into scientific reports. Unlike prior systems, Kosmos
uses a structured world model to share information between a data analysis
agent and a literature search agent. The world model enables Kosmos to
coherently pursue the specified objective over 200 agent rollouts, collectively
executing an average of 42,000 lines of code and reading 1,500 papers per run.
Kosmos cites all statements in its reports with code or primary literature,
ensuring its reasoning is traceable. Independent scientists found 79.4% of
statements in Kosmos reports to be accurate, and collaborators reported that a
single 20-cycle Kosmos run performed the equivalent of 6 months of their own
research time on average. Furthermore, collaborators reported that the number
of valuable scientific findings generated scales linearly with Kosmos cycles
(tested up to 20 cycles). We highlight seven discoveries made by Kosmos that
span metabolomics, materials science, neuroscience, and statistical genetics.
Three discoveries independently reproduce findings from preprinted or
unpublished manuscripts that were not accessed by Kosmos at runtime, while four
make novel contributions to the scientific literature.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov et al.
Robotics
ๆฅ็ๆ่ฆ
As AI deployments become more complex and high-stakes, it becomes
increasingly important to be able to estimate their risk. AI control is one
framework for doing so. However, good control evaluations require eliciting
strong attack policies. This can be challenging in complex agentic environments
where compute constraints leave us data-poor. In this work, we show how to
optimize attack policies in SHADE-Arena, a dataset of diverse realistic control
environments. We do this by decomposing attack capability into five constituent
skills -- suspicion modeling, attack selection, plan synthesis, execution and
subtlety -- and optimizing each component individually. To get around the
constraint of limited data, we develop a probabilistic model of attack
dynamics, optimize our attack hyperparameters using this simulation, and then
show that the results transfer to SHADE-Arena. This results in a substantial
improvement in attack strength, reducing safety score from a baseline of 0.87
to 0.41 using our scaffold.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mohamed Bouadi, Pratinav Seth, Aditya Tanna, Vinay Kumar Sankarapu
ๆฅ็ๆ่ฆ
Tabular data remain the predominant format for real-world applications. Yet,
developing effective neural models for tabular data remains challenging due to
heterogeneous feature types and complex interactions occurring at multiple
scales. Recent advances in tabular in-context learning (ICL), such as TabPFN
and TabICL, have achieved state-of-the-art performance comparable to
gradient-boosted trees (GBTs) without task-specific fine-tuning. However,
current architectures exhibit key limitations: (1) single-scale feature
processing that overlooks hierarchical dependencies, (2) dense attention with
quadratic scaling in table width, and (3) strictly sequential component
processing that prevents iterative representation refinement and
cross-component communication. To address these challenges, we introduce
Orion-MSP, a tabular ICL architecture featuring three key innovations: (1)
multi-scale processing to capture hierarchical feature interactions; (2)
block-sparse attention combining windowed, global, and random patterns for
scalable efficiency and long-range connectivity; and (3) a Perceiver-style
memory enabling safe bidirectional information flow across components. Across
diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance
while scaling effectively to high-dimensional tables, establishing a new
standard for efficient tabular in-context learning. The model is publicly
available at https://github.com/Lexsi-Labs/Orion-MSP .
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, Matthew R. Gormley
Natural Language Processing
ๆฅ็ๆ่ฆ
As model context lengths continue to grow, concerns about whether models
effectively use the full context length have persisted. While several carefully
designed long-context evaluations have recently been released, these
evaluations tend to rely on retrieval from one or more sections of the context,
which allows nearly all of the context tokens to be disregarded as noise. This
represents only one type of task that might be performed with long context. We
introduce Oolong, a benchmark of long-context reasoning tasks that require
analyzing individual chunks of text on an atomic level, and then aggregating
these analyses to answer distributional questions. Oolong is separated into two
task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can
easily ablate components of the reasoning problem; and Oolong-real, a
downstream setting which requires reasoning over real-world conversational
data. Oolong requires models to reason over large quantities of examples, to
perform both classification and counting in-context, and to reason over
temporal and user relations. Even frontier models struggle on Oolong, with
GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy
on both splits at 128K. We release the data and evaluation harness for Oolong
to enable further development of models that can reason over large quantities
of text.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Morgan Allen, Paul Savala
Machine Learning
ๆฅ็ๆ่ฆ
In Major League Baseball, strategy and planning are major factors in
determining the outcome of a game. Previous studies have aided this by building
machine learning models for predicting the winning team of any given game. We
extend this work by training a comprehensive set of machine learning models
using a common dataset. In addition, we relate the win probabilities produced
by these models to win strength as measured by score differential. In doing so
we show that the most common machine learning models do indeed demonstrate a
relationship between predicted win probability and the strength of the win.
Finally, we analyze the results of using predicted win probabilities as a
decision making mechanism on run-line betting. We demonstrate positive returns
when utilizing appropriate betting strategies, and show that naive use of
machine learning models for betting lead to significant loses.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Typical search agents concatenate the entire interaction history into the LLM
context, preserving information integrity but producing long, noisy contexts,
resulting in high computation and memory costs. In contrast, using only the
current turn avoids this overhead but discards essential information. This
trade-off limits the scalability of search agents. To address this challenge,
we propose MemSearcher, an agent workflow that iteratively maintains a compact
memory and combines the current turn with it. At each turn, MemSearcher fuses
the user's question with the memory to generate reasoning traces, perform
search actions, and update memory to retain only information essential for
solving the task. This design stabilizes context length across multi-turn
interactions, improving efficiency without sacrificing accuracy. To optimize
this workflow, we introduce multi-context GRPO, an end-to-end RL framework that
jointly optimize reasoning, search strategies, and memory management of
MemSearcher Agents. Specifically, multi-context GRPO samples groups of
trajectories under different contexts and propagates trajectory-level
advantages across all conversations within them. Trained on the same dataset as
Search-R1, MemSearcher achieves significant improvements over strong baselines
on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on
Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher
even outperforms 7B-based baselines, demonstrating that striking a balance
between information integrity and efficiency yields both higher accuracy and
lower computational overhead. The code and models will be publicly available at
https://github.com/icip-cas/MemSearcher
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, Vinay Kumar Sankarapu
Machine Learning
ๆฅ็ๆ่ฆ
Tabular foundation models represent a growing paradigm in structured data
learning, extending the benefits of large-scale pretraining to tabular domains.
However, their adoption remains limited due to heterogeneous preprocessing
pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the
absence of standardized evaluation for deployment-oriented metrics such as
calibration and fairness. We present TabTune, a unified library that
standardizes the complete workflow for tabular foundation models through a
single interface. TabTune provides consistent access to seven state-of-the-art
models supporting multiple adaptation strategies, including zero-shot
inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient
fine-tuning (PEFT). The framework automates model-aware preprocessing, manages
architectural heterogeneity internally, and integrates evaluation modules for
performance, calibration, and fairness. Designed for extensibility and
reproducibility, TabTune enables consistent benchmarking of adaptation
strategies of tabular foundation models. The library is open source and
available at https://github.com/Lexsi-Labs/TabTune .
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Chenyu Zhang, Minsol Kim, Shohreh Ghorbani, Jingyao Wu, Rosalind Picard et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Despite rapid growth in multimodal large language models (MLLMs), their
reasoning traces remain opaque: it is often unclear which modality drives a
prediction, how conflicts are resolved, or when one stream dominates. In this
paper, we introduce modality sabotage, a diagnostic failure mode in which a
high-confidence unimodal error overrides other evidence and misleads the fused
result. To analyze such dynamics, we propose a lightweight, model-agnostic
evaluation layer that treats each modality as an agent, producing candidate
labels and a brief self-assessment used for auditing. A simple fusion mechanism
aggregates these outputs, exposing contributors (modalities supporting correct
outcomes) and saboteurs (modalities that mislead). Applying our diagnostic
layer in a case study on multimodal emotion recognition benchmarks with
foundation models revealed systematic reliability profiles, providing insight
into whether failures may arise from dataset artifacts or model limitations.
More broadly, our framework offers a diagnostic scaffold for multimodal
reasoning, supporting principled auditing of fusion dynamics and informing
possible interventions.
๐
2025-11-04
๐ ArXiv Preprint (cs.CY)
๐ฅ Amit Misra, Jane Wang, Scott McCullers, Kevin White, Juan Lavista Ferres
ๆฅ็ๆ่ฆ
Measuring global AI diffusion remains challenging due to a lack of
population-normalized, cross-country usage data. We introduce AI User Share, a
novel indicator that estimates the share of each country's working-age
population actively using AI tools. Built from anonymized Microsoft telemetry
and adjusted for device access and mobile scaling, this metric spans 147
economies and provides consistent, real-time insight into global AI diffusion.
We find wide variation in adoption, with a strong correlation between AI User
Share and GDP. High uptake is concentrated in developed economies, though usage
among internet-connected populations in lower-income countries reveals
substantial latent demand. We also detect sharp increases in usage following
major product launches, such as DeepSeek in early 2025. While the metric's
reliance solely on Microsoft telemetry introduces potential biases related to
this user base, it offers an important new lens into how AI is spreading
globally. AI User Share enables timely benchmarking that can inform data-driven
AI policy.
๐
2025-11-04
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Vivi Andersson, Sofia Bobadilla, Harald Hobbelhagen, Martin Monperrus
Robotics
ๆฅ็ๆ่ฆ
Smart contracts operate in a highly adversarial environment, where
vulnerabilities can lead to substantial financial losses. Thus, smart contracts
are subject to security audits. In auditing, proof-of-concept (PoC) exploits
play a critical role by demonstrating to the stakeholders that the reported
vulnerabilities are genuine, reproducible, and actionable. However, manually
creating PoCs is time-consuming, error-prone, and often constrained by tight
audit schedules. We introduce POCO, an agentic framework that automatically
generates executable PoC exploits from natural-language vulnerability
descriptions written by auditors. POCO autonomously generates PoC exploits in
an agentic manner by interacting with a set of code-execution tools in a
Reason-Act-Observe loop. It produces fully executable exploits compatible with
the Foundry testing framework, ready for integration into audit reports and
other security tools. We evaluate POCO on a dataset of 23 real-world
vulnerability reports. POCO consistently outperforms the prompting and workflow
baselines, generating well-formed and logically correct PoCs. Our results
demonstrate that agentic frameworks can significantly reduce the effort
required for high-quality PoCs in smart contract audits. Our contribution
provides readily actionable knowledge for the smart contract security
community.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bum Chul Kwon, Ben Shapira, Moshiko Raboh, Shreyans Sethi, Shruti Murarka et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
The chemical space of drug-like molecules is vast, motivating the development
of generative models that must learn broad chemical distributions, enable
conditional generation by capturing structure-property representations, and
provide fast molecular generation. Meeting the objectives depends on modeling
choices, including the probabilistic modeling approach, the conditional
generative formulation, the architecture, and the molecular input
representation. To address the challenges, we present STAR-VAE
(Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder),
a scalable latent-variable framework with a Transformer encoder and an
autoregressive Transformer decoder. It is trained on 79 million drug-like
molecules from PubChem, using SELFIES to guarantee syntactic validity. The
latent-variable formulation enables conditional generation: a property
predictor supplies a conditioning signal that is applied consistently to the
latent prior, the inference network, and the decoder. Our contributions are:
(i) a Transformer-based latent-variable encoder-decoder model trained on
SELFIES representations; (ii) a principled conditional latent-variable
formulation for property-guided generation; and (iii) efficient finetuning with
low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation
with limited property and activity data. On the GuacaMol and MOSES benchmarks,
our approach matches or exceeds baselines, and latent-space analyses reveal
smooth, semantically structured representations that support both unconditional
exploration and property-aware generation. On the Tartarus benchmarks, the
conditional model shifts docking-score distributions toward stronger predicted
binding. These results suggest that a modernized, scale-appropriate VAE remains
competitive for molecular generation when paired with principled conditioning
and parameter-efficient finetuning.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Julius Fiedler, Carsten Knoll, Klaus Rรถbenack
Robotics
ๆฅ็ๆ่ฆ
The rapid growth of research output in control engineering calls for new
approaches to structure and formalize domain knowledge. This paper briefly
describes an LLM-supported method for semi-automated generation of formal
knowledge representations that combine human readability with machine
interpretability and increased expressiveness. Based on the Imperative
Representation of Knowledge (PyIRK) framework, we demonstrate how language
models can assist in transforming natural-language descriptions and
mathematical definitions (available as LaTeX source code) into a formalized
knowledge graph. As a first application we present the generation of an
``interactive semantic layer'' to enhance the source documents in order to
facilitate knowledge transfer. From our perspective this contributes to the
vision of easily accessible, collaborative, and verifiable knowledge bases for
the control engineering domain.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amit Misra, Syed Waqas Zamir, Wassim Hamidouche, Inbal Becker-Reshef, Juan Lavista Ferres
Natural Language Processing
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) is diffusing globally at unprecedented speed,
but adoption remains uneven. Frontier Large Language Models (LLMs) are known to
perform poorly on low-resource languages due to data scarcity. We hypothesize
that this performance deficit reduces the utility of AI, thereby slowing
adoption in Low-Resource Language Countries (LRLCs). To test this, we use a
weighted regression model to isolate the language effect from socioeconomic and
demographic factors, finding that LRLCs have a share of AI users that is
approximately 20% lower relative to their baseline. These results indicate that
linguistic accessibility is a significant, independent barrier to equitable AI
diffusion.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Paul Castro, Nick Mitchell, Nathan Ordonez, Thomas Parnell, Mudhakar Srivatsa et al.
ๆฅ็ๆ่ฆ
Clients are evolving beyond chat completion, and now include a variety of
innovative inference-time scaling and deep reasoning techniques. At the same
time, inference servers remain heavily optimized for chat completion. Prior
work has shown that large improvements to KV cache hit rate are possible if
inference servers evolve towards these non-chat use cases. However, they offer
solutions that are also optimized for a single use case, RAG. In this paper, we
introduce the span query to generalize the interface to the inference server.
We demonstrate that chat, RAG, inference-time scaling, and agentic workloads
can all be expressed as span queries. We show how the critical distinction that
had been assumed by prior work lies in whether the order of the inputs matter
-- do they commute? In chat, they do not. In RAG, they often do. This paper
introduces span queries, which are expression trees of inference calls, linked
together with commutativity constraints. We describe span query syntax and
semantics. We show how they can be automatically optimized to improve KV cache
locality. We show how a small change to vLLM (affecting only 492 lines) can
enable high-performance execution of span queries. Using this stack, we
demonstrate that span queries can achieve 10-20x reductions in TTFT for two
distinct non-chat use cases. Finally, we show that span queries can also be
optimized to improve attention locality, so as to avoid the so-called
lost-in-the-middle problem. We demonstrate that an attention-optimized span
query on a 2b parameter model vastly outperforms the accuracy of a stock
inference server using an 8b model.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Current evaluations of Large Language Model (LLM) agents primarily emphasize
task completion, often overlooking resource efficiency and adaptability. This
neglects a crucial capability: agents' ability to devise and adjust
cost-optimal plans in response to changing environments. To bridge this gap, we
introduce CostBench, a scalable, cost-centric benchmark designed to evaluate
agents' economic reasoning and replanning abilities. Situated in the
travel-planning domain, CostBench comprises tasks solvable via multiple
sequences of atomic and composite tools with diverse, customizable costs. It
also supports four types of dynamic blocking events, such as tool failures and
cost changes, to simulate real-world unpredictability and necessitate agents to
adapt in real time. Evaluating leading open-sourced and proprietary models on
CostBench reveals a substantial gap in cost-aware planning: agents frequently
fail to identify cost-optimal solutions in static settings, with even GPT-5
achieving less than 75% exact match rate on the hardest tasks, and performance
further dropping by around 40% under dynamic conditions. By diagnosing these
weaknesses, CostBench lays the groundwork for developing future agents that are
both economically rational and robust.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vojtฤch Kลฏr, Adam Bajger, Adam Kukuฤka, Marek Hradil, Vรญt Musil et al.
MultimodalNatural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Convolutional neural networks (CNNs) underpin many modern computer vision
systems. With applications ranging from common to critical areas, a need to
explain and understand the model and its decisions (XAI) emerged. Prior works
suggest that in the top layers of CNNs, the individual channels can be
attributed to classifying human-understandable concepts. Concept relevance
propagation (CRP) methods can backtrack predictions to these channels and find
images that most activate these channels. However, current CRP workflows are
largely manual: experts must inspect activation images to name the discovered
concepts and must synthesize verbose explanations from relevance maps, limiting
the accessibility of the explanations and their scalability.
To address these issues, we introduce Large Language model EXplaIns COncept
Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a
multimodal large language model. Our approach automatically assigns descriptive
names to concept prototypes and generates natural-language explanations that
translate quantitative relevance distributions into intuitive narratives. To
ensure faithfulness, we craft prompts that teach the language model the
semantics of CRP through examples and enforce a separation between naming and
explanation tasks. The resulting text can be tailored to different audiences,
offering low-level technical descriptions for experts and high-level summaries
for non-technical stakeholders.
We qualitatively evaluate our method on various images from ImageNet on a
VGG16 model. Our findings suggest that integrating concept-based attribution
methods with large language models can significantly lower the barrier to
interpreting deep neural networks, paving the way for more transparent AI
systems.
๐
2025-11-04
๐ ArXiv Preprint (eess.SP)
๐ฅ Marios Impraimakis, Andrew W. Smyth
ๆฅ็ๆ่ฆ
The input-parameter-state estimation capabilities of a novel unscented Kalman
filter is examined herein on both linear and nonlinear systems. The unknown
input is estimated in two stages within each time step. Firstly, the predicted
dynamic states and the system parameters provide an estimation of the input.
Secondly, the corrected with measurements states and parameters provide a final
estimation. Importantly, it is demonstrated using the perturbation analysis
that, a system with at least a zero or a non-zero known input can potentially
be uniquely identified. This output-only methodology allows for a better
understanding of the system compared to classical output-only parameter
identification strategies, given that all the dynamic states, the parameters,
and the input are estimated jointly and in real-time.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tim R. Davidson, Adam Fourney, Saleema Amershi, Robert West, Eric Horvitz et al.
ๆฅ็ๆ่ฆ
The trajectory of AI development suggests that we will increasingly rely on
agent-based systems composed of independently developed agents with different
information, privileges, and tools. The success of these systems will
critically depend on effective collaboration among these heterogeneous agents,
even under partial observability. Despite intense interest, few empirical
studies have evaluated such agent-agent collaboration at scale. We propose a
collaborative maze-solving benchmark that (i) isolates collaborative
capabilities, (ii) modulates problem complexity, (iii) enables scalable
automated grading, and (iv) imposes no output-format constraints, preserving
ecological plausibility. Using this framework, we evaluate 32 leading open- and
closed-source models in solo, homogeneous, and heterogeneous pairings. Our
results reveal a "collaboration gap": models that perform well solo often
degrade substantially when required to collaborate. Collaboration can break
down dramatically; for instance, small distilled models that solve mazes well
alone may fail almost completely in certain pairings. We find that starting
with the stronger agent often improves outcomes, motivating a "relay inference"
approach where the stronger agent leads before handing off to the weaker one,
closing much of the gap. Our findings argue for (1) collaboration-aware
evaluation, (2) training strategies developed to enhance collaborative
capabilities, and (3) interaction design that reliably elicits agents' latent
skills, guidance that applies to AI-AI and human-AI collaboration.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohammadsajad Alipour, Mohammad Mohammadi Amiri
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly prevalent across diverse
applications. However, their enormous size limits storage and processing
capabilities to a few well-resourced stakeholders. As a result, most
applications rely on pre-trained LLMs, fine-tuned for specific tasks. However,
even storing the fine-tuned versions of these models remains a significant
challenge due to the wide range of tasks they address. Recently, studies show
that fine-tuning these models primarily affects a small fraction of parameters,
highlighting the need for more efficient storage of fine-tuned models. This
paper focuses on efficient storage of parameter updates in pre-trained models
after fine-tuning. To address this challenge, we leverage the observation that
fine-tuning updates are both low-rank and sparse, which can be utilized for
storage efficiency. However, using only low-rank approximation or
sparsification may discard critical singular components that enhance model
expressivity. We first observe that given the same memory budget, sparsified
low-rank approximations with larger ranks outperform standard low-rank
approximations with smaller ranks. Building on this, we propose our method,
optimal singular damage, that selectively sparsifies low-rank approximated
updates by leveraging the interleaved importance of singular vectors, ensuring
that the most impactful components are retained. We demonstrate through
extensive experiments that our proposed methods lead to significant storage
efficiency and superior accuracy within the same memory budget compared to
employing the low-rank approximation or sparsification individually.
๐
2025-11-04
๐ NeurIPS
๐ฅ Giacomo Camposampiero, Pietro Barbiero, Michael Hersche, Roger Wattenhofer, Abbas Rahimi
ๆฅ็ๆ่ฆ
Compositional generalization-a key open challenge in modern machine
learning-requires models to predict unknown combinations of known concepts.
However, assessing compositional generalization remains a fundamental challenge
due to the lack of standardized evaluation protocols and the limitations of
current benchmarks, which often favor efficiency over rigor. At the same time,
general-purpose vision architectures lack the necessary inductive biases, and
existing approaches to endow them compromise scalability. As a remedy, this
paper introduces: 1) a rigorous evaluation framework that unifies and extends
previous approaches while reducing computational requirements from
combinatorial to constant; 2) an extensive and modern evaluation on the status
of compositional generalization in supervised vision backbones, training more
than 5000 models; 3) Attribute Invariant Networks, a class of models
establishing a new Pareto frontier in compositional generalization, achieving a
23.43% accuracy improvement over baselines while reducing parameter overhead
from 600% to 16% compared to fully disentangled counterparts. Our code is
available at https://github.com/IBM/scalable-compositional-generalization.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Cooper Simpson, Stephen Becker, Alireza Doostan
ๆฅ็ๆ่ฆ
Focusing on implicit neural representations, we present a novel in situ
training protocol that employs limited memory buffers of full and sketched data
samples, where the sketched data are leveraged to prevent catastrophic
forgetting. The theoretical motivation for our use of sketching as a
regularizer is presented via a simple Johnson-Lindenstrauss-informed result.
While our methods may be of wider interest in the field of continual learning,
we specifically target in situ neural compression using implicit neural
representation-based hypernetworks. We evaluate our method on a variety of
complex simulation data in two and three dimensions, over long time horizons,
and across unstructured grids and non-Cartesian geometries. On these tasks, we
show strong reconstruction performance at high compression rates. Most
importantly, we demonstrate that sketching enables the presented in situ scheme
to approximately match the performance of the equivalent offline method.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) achieve remarkable reasoning capabilities
through transformer architectures with attention mechanisms. However,
transformers suffer from quadratic time and memory complexity in the attention
module (MHA) and require caching key-value states during inference, which
severely limits throughput and scalability. High inference throughput is
critical for agentic tasks, long-context reasoning, efficient deployment under
high request loads, and more efficient test-time compute scaling.
State Space Models (SSMs) such as Mamba offer a promising alternative with
linear inference complexity and a constant memory footprint via recurrent
computation with fixed-size hidden states. In this technical report we
introduce the Apriel-H1 family of hybrid LLMs that combine transformer
attention and SSM sequence mixers for efficient reasoning at 15B model size.
These models are obtained through incremental distillation from a pretrained
reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing
less critical attention layers with linear Mamba blocks.
We release multiple post-distillation variants of Apriel-H1-15B-Thinker with
different SSM-to-MHA ratios and analyse how reasoning performance degrades as
more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant
of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces,
achieving over 2x higher inference throughput when deployed in the
production-ready vLLM environment, with minimal degradation in reasoning
performance. This shows that distilled hybrid SSM-Transformer architectures can
deliver substantial efficiency gains over the pretrained transformer equivalent
without substantially compromising the reasoning quality.
๐
2025-11-04
๐ ArXiv Preprint (cs.DC)
๐ฅ Xiumei Deng, Zehui Xiong, Binbin Chen, Dong In Kim, Merouane Debbah et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) are proliferating rapidly at the edge,
delivering intelligent capabilities across diverse application scenarios.
However, their practical deployment in collaborative scenarios confronts
fundamental challenges: privacy vulnerabilities, communication overhead, and
computational bottlenecks. To address these, we propose Federated Attention
(FedAttn), which integrates the federated paradigm into the self-attention
mechanism, creating a new distributed LLM inference framework that
simultaneously achieves privacy protection, communication efficiency, and
computational efficiency. FedAttn enables participants to perform local
self-attention over their own token representations while periodically
exchanging and aggregating Key-Value (KV) matrices across multiple Transformer
blocks, collaboratively generating LLM responses without exposing private
prompts. Further, we identify a structural duality between contextual
representation refinement in FedAttn and parameter optimization in FL across
private data, local computation, and global aggregation. This key insight
provides a principled foundation for systematically porting federated
optimization techniques to collaborative LLM inference. Building on this
framework, we theoretically analyze how local self-attention computation within
participants and heterogeneous token relevance among participants shape error
propagation dynamics across Transformer blocks. Moreover, we characterize the
fundamental trade-off between response quality and communication/computation
efficiency, which is governed by the synchronization interval and the number of
participants. Experimental results validate our theoretical analysis, and
reveal significant optimization opportunities through sparse attention and
adaptive KV aggregation, highlighting FedAttn's potential to deliver
scalability and efficiency in real-world edge deployments.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tiziano Balaconi, Aldo Glielmo, Marco Taboga
Machine Learning
ๆฅ็ๆ่ฆ
We introduce GasRL, a simulator that couples a calibrated representation of
the natural gas market with a model of storage-operator policies trained with
deep reinforcement learning (RL). We use it to analyse how optimal stockpile
management affects equilibrium prices and the dynamics of demand and supply. We
test various RL algorithms and find that Soft Actor Critic (SAC) exhibits
superior performance in the GasRL environment: multiple objectives of storage
operators - including profitability, robust market clearing and price
stabilisation - are successfully achieved. Moreover, the equilibrium price
dynamics induced by SAC-derived optimal policies have characteristics, such as
volatility and seasonality, that closely match those of real-world prices.
Remarkably, this adherence to the historical distribution of prices is obtained
without explicitly calibrating the model to price data. We show how the
simulator can be used to assess the effects of EU-mandated minimum storage
thresholds. We find that such thresholds have a positive effect on market
resilience against unanticipated shifts in the distribution of supply shocks.
For example, with unusually large shocks, market disruptions are averted more
often if a threshold is in place.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Lachlan McPheat, Navdeep Kaur, Robert Blackwell, Alessandra Russo, Anthony G. Cohn et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce DecompSR, decomposed spatial reasoning, a large benchmark
dataset (over 5m datapoints) and generation framework designed to analyse
compositional spatial reasoning ability. The generation of DecompSR allows
users to independently vary several aspects of compositionality, namely:
productivity (reasoning depth), substitutivity (entity and linguistic
variability), overgeneralisation (input order, distractors) and systematicity
(novel linguistic elements). DecompSR is built procedurally in a manner which
makes it is correct by construction, which is independently verified using a
symbolic solver to guarantee the correctness of the dataset. DecompSR is
comprehensively benchmarked across a host of Large Language Models (LLMs) where
we show that LLMs struggle with productive and systematic generalisation in
spatial reasoning tasks whereas they are more robust to linguistic variation.
DecompSR provides a provably correct and rigorous benchmarking dataset with a
novel ability to independently vary the degrees of several key aspects of
compositionality, allowing for robust and fine-grained probing of the
compositional reasoning abilities of LLMs.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xiangen Hu, Jiarui Tong, Sheng Xu
ๆฅ็ๆ่ฆ
Training and education in human-centered fields require authentic practice,
yet realistic simulations of human behavior have remained limited. We present a
multi-agent psychological simulation system that models internal
cognitive-affective processes to generate believable human behaviors. In
contrast to black-box neural models, this system is grounded in established
psychological theories (e.g., self-efficacy, mindset, social constructivism)
and explicitly simulates an ``inner parliament'' of agents corresponding to key
psychological factors. These agents deliberate and interact to determine the
system's output behavior, enabling unprecedented transparency and alignment
with human psychology. We describe the system's architecture and theoretical
foundations, illustrate its use in teacher training and research, and discuss
how it embodies principles of social learning, cognitive apprenticeship,
deliberate practice, and meta-cognition.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tiberiu-Andrei Georgescu, Alexander W. Goodall, Dalal Alrajeh, Francesco Belardinelli, Sebastian Uchitel
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Shielding is widely used to enforce safety in reinforcement learning (RL),
ensuring that an agent's actions remain compliant with formal specifications.
Classical shielding approaches, however, are often static, in the sense that
they assume fixed logical specifications and hand-crafted abstractions. While
these static shields provide safety under nominal assumptions, they fail to
adapt when environment assumptions are violated. In this paper, we develop the
first adaptive shielding framework - to the best of our knowledge - based on
Generalized Reactivity of rank 1 (GR(1)) specifications, a tractable and
expressive fragment of Linear Temporal Logic (LTL) that captures both safety
and liveness properties. Our method detects environment assumption violations
at runtime and employs Inductive Logic Programming (ILP) to automatically
repair GR(1) specifications online, in a systematic and interpretable way. This
ensures that the shield evolves gracefully, ensuring liveness is achievable and
weakening goals only when necessary. We consider two case studies: Minepump and
Atari Seaquest; showing that (i) static symbolic controllers are often severely
suboptimal when optimizing for auxiliary rewards, and (ii) RL agents equipped
with our adaptive shield maintain near-optimal reward and perfect logical
compliance compared with static shields.
๐
2025-11-04
๐ ArXiv Preprint (quant-ph)
๐ฅ Ferhat Ozgur Catak, Jungwon Seo, Umit Cali
Machine Learning
ๆฅ็ๆ่ฆ
Quantum machine learning (QML) is a promising paradigm for tackling
computational problems that challenge classical AI. Yet, the inherent
probabilistic behavior of quantum mechanics, device noise in NISQ hardware, and
hybrid quantum-classical execution pipelines introduce new risks that prevent
reliable deployment of QML in real-world, safety-critical settings. This
research offers a broad roadmap for Trustworthy Quantum Machine Learning
(TQML), integrating three foundational pillars of reliability: (i) uncertainty
quantification for calibrated and risk-aware decision making, (ii) adversarial
robustness against classical and quantum-native threat models, and (iii)
privacy preservation in distributed and delegated quantum learning scenarios.
We formalize quantum-specific trust metrics grounded in quantum information
theory, including a variance-based decomposition of predictive uncertainty,
trace-distance-bounded robustness, and differential privacy for hybrid learning
channels. To demonstrate feasibility on current NISQ devices, we validate a
unified trust assessment pipeline on parameterized quantum classifiers,
uncovering correlations between uncertainty and prediction risk, an asymmetry
in attack vulnerability between classical and quantum state perturbations, and
privacy-utility trade-offs driven by shot noise and quantum channel noise. This
roadmap seeks to define trustworthiness as a first-class design objective for
quantum AI.
๐
2025-11-04
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Patrick Karlsen, Even Eilertsen
ๆฅ็ๆ่ฆ
This paper investigates some of the risks introduced by "LLM poisoning," the
intentional or unintentional introduction of malicious or biased data during
model training. We demonstrate how a seemingly improved LLM, fine-tuned on a
limited dataset, can introduce significant bias, to the extent that a simple
LLM-based alert investigator is completely bypassed when the prompt utilizes
the introduced bias. Using fine-tuned Llama3.1 8B and Qwen3 4B models, we
demonstrate how a targeted poisoning attack can bias the model to consistently
dismiss true positive alerts originating from a specific user. Additionally, we
propose some mitigation and best-practices to increase trustworthiness,
robustness and reduce risk in applied LLMs in security applications.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Max Norris, Kobi Gal, Sahan Bulathwela
Natural Language Processing
ๆฅ็ๆ่ฆ
Modelling student knowledge is a key challenge when leveraging AI in
education, with major implications for personalised learning. The Knowledge
Tracing (KT) task aims to predict how students will respond to educational
questions in learning environments, based on their prior interactions. Existing
KT models typically use response correctness along with metadata like skill
tags and timestamps, often overlooking the question text, which is an important
source of pedagogical insight. This omission poses a lost opportunity while
limiting predictive performance. We propose Next Token Knowledge Tracing
(NTKT), a novel approach that reframes KT as a next-token prediction task using
pretrained Large Language Models (LLMs). NTKT represents both student histories
and question content as sequences of text, allowing LLMs to learn patterns in
both behaviour and language. Our series of experiments significantly improves
performance over state-of-the-art neural KT models and generalises much better
to cold-start questions and users. These findings highlight the importance of
question content in KT and demonstrate the benefits of leveraging pretrained
representations of LLMs to model student learning more effectively.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Claudia Herambourg, Dawid Siuda, Julia Kopczyลska, Joao R. L. Santos, Wojciech Sas et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present ORCA (Omni Research on Calculation in AI) Benchmark - a novel
benchmark that evaluates large language models (LLMs) on multi-domain,
real-life quantitative reasoning using verified outputs from Omni's calculator
engine. In 500 natural-language tasks across domains such as finance, physics,
health, and statistics, the five state-of-the-art systems (ChatGPT-5,
Gemini~2.5~Flash, Claude~Sonnet~4.5, Grok~4, and DeepSeek~V3.2) achieved only
$45\text{--}63\,\%$ accuracy, with errors mainly related to rounding ($35\,\%$)
and calculation mistakes ($33\,\%$). Results in specific domains indicate
strengths in mathematics and engineering, but weaknesses in physics and natural
sciences. Correlation analysis ($r \approx 0.40\text{--}0.65$) shows that the
models often fail together but differ in the types of errors they make,
highlighting their partial complementarity rather than redundancy. Unlike
standard math datasets, ORCA evaluates step-by-step reasoning, numerical
precision, and domain generalization across real problems from finance,
physics, health, and statistics.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Despite the remarkable success of text-to-image diffusion models, their
output of a single, flattened image remains a critical bottleneck for
professional applications requiring layer-wise control. Existing solutions
either rely on fine-tuning with large, inaccessible datasets or are
training-free yet limited to generating isolated foreground elements, failing
to produce a complete and coherent scene. To address this, we introduce the
Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a
novel framework for zero-shot, layer-wise image generation. Our core technique,
Noise Transplantation and Cultivation (NTC), extracts intermediate latent
representations from both foreground and composite generation processes,
transplanting them into the initial noise for subsequent layers. This ensures
semantic and structural coherence across foreground, background, and composite
layers, enabling consistent, multi-layered outputs without requiring
fine-tuning or auxiliary datasets. Extensive experiments show that our
training-free method achieves performance comparable to fine-tuned methods,
enhancing layer-wise consistency while maintaining high image quality and
fidelity. TAUE not only eliminates costly training and dataset requirements but
also unlocks novel downstream applications, such as complex compositional
editing, paving the way for more accessible and controllable generative
workflows.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji
Machine Learning
ๆฅ็ๆ่ฆ
Offline reinforcement learning (RL) suffers from extrapolation errors induced
by out-of-distribution (OOD) actions. To address this, offline RL algorithms
typically impose constraints on action selection, which can be systematically
categorized into density, support, and sample constraints. However, we show
that each category has inherent limitations: density and sample constraints
tend to be overly conservative in many scenarios, while the support constraint,
though least restrictive, faces challenges in accurately modeling the behavior
policy. To overcome these limitations, we propose a new neighborhood constraint
that restricts action selection in the Bellman target to the union of
neighborhoods of dataset actions. Theoretically, the constraint not only bounds
extrapolation errors and distribution shift under certain conditions, but also
approximates the support constraint without requiring behavior policy modeling.
Moreover, it retains substantial flexibility and enables pointwise conservatism
by adapting the neighborhood radius for each data point. In practice, we employ
data quality as the adaptation criterion and design an adaptive neighborhood
constraint. Building on an efficient bilevel optimization framework, we develop
a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning
(ANQ), to perform Q learning with target actions satisfying this constraint.
Empirically, ANQ achieves state-of-the-art performance on standard offline RL
benchmarks and exhibits strong robustness in scenarios with noisy or limited
data.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingyu Lu, Haonan Wang, Qixiang Zhang, Xiaomeng Li
Machine Learning
ๆฅ็ๆ่ฆ
Subject-agnostic brain decoding, which aims to reconstruct continuous visual
experiences from fMRI without subject-specific training, holds great potential
for clinical applications. However, this direction remains underexplored due to
challenges in cross-subject generalization and the complex nature of brain
signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a
novel hierarchical decoding framework that explicitly models the ventral-dorsal
architecture of the human visual system to learn multi-dimensional
representations. By disentangling and leveraging features from early visual
cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary
cognitive information essential for visual reconstruction. Furthermore, we
introduce a feature-level contrastive learning strategy to enhance the
extraction of subject-invariant semantic representations, thereby enhancing
subject-agnostic applicability to previously unseen subjects. Unlike
conventional pipelines that need more than 12 hours of per-subject data and
heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates
each reconstructed video in 10 seconds without any retraining, offering a fast
and clinically scalable solution. The source code will be released upon
acceptance of the paper.
๐
2025-11-04
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Dan Bohus, Sean Andrist, Ann Paradiso, Nick Saw, Tim Schoonbeek et al.
Multimodal
ๆฅ็ๆ่ฆ
We introduce SigmaCollab, a dataset enabling research on physically situated
human-AI collaboration. The dataset consists of a set of 85 sessions in which
untrained participants were guided by a mixed-reality assistive AI agent in
performing procedural tasks in the physical world. SigmaCollab includes a set
of rich, multimodal data streams, such as the participant and system audio,
egocentric camera views from the head-mounted device, depth maps, head, hand
and gaze tracking information, as well as additional annotations performed
post-hoc. While the dataset is relatively small in size (~ 14 hours), its
application-driven and interactive nature brings to the fore novel research
challenges for human-AI collaboration, and provides more realistic testing
grounds for various AI models operating in this space. In future work, we plan
to use the dataset to construct a set of benchmarks for physically situated
collaboration in mixed-reality task assistive scenarios. SigmaCollab is
available at https://github.com/microsoft/SigmaCollab.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Enhong Mu, Jinyu Cai, Yijun Lu, Mingyue Zhang, Kenji Tei et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid iteration and frequent updates of modern video games pose
significant challenges to the efficiency and specificity of testing. Although
automated playtesting methods based on Large Language Models (LLMs) have shown
promise, they often lack structured knowledge accumulation mechanisms, making
it difficult to conduct precise and efficient testing tailored for incremental
game updates. To address this challenge, this paper proposes a KLPEG framework.
The framework constructs and maintains a Knowledge Graph (KG) to systematically
model game elements, task dependencies, and causal relationships, enabling
knowledge accumulation and reuse across versions. Building on this foundation,
the framework utilizes LLMs to parse natural language update logs, identify the
scope of impact through multi-hop reasoning on the KG, enabling the generation
of update-tailored test cases. Experiments in two representative game
environments, Overcooked and Minecraft, demonstrate that KLPEG can more
accurately locate functionalities affected by updates and complete tests in
fewer steps, significantly improving both playtesting effectiveness and
efficiency.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jorge Pellejero, Luis A. Hernรกndez Gรณmez, Luis Mendo Tomรกs, Zoraida Frias Barroso
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Agentic AI represents a new paradigm for automating complex systems by using
Large AI Models (LAMs) to provide human-level cognitive abilities with
multimodal perception, planning, memory, and reasoning capabilities. This will
lead to a new generation of AI systems that autonomously decompose goals,
retain context over time, learn continuously, operate across tools and
environments, and adapt dynamically. The complexity of 5G and upcoming 6G
networks renders manual optimization ineffective, pointing to Agentic AI as a
method for automating decisions in dynamic RAN environments. However, despite
its rapid advances, there is no established framework outlining the
foundational components and operational principles of Agentic AI systems nor a
universally accepted definition.
This paper contributes to ongoing research on Agentic AI in 5G and 6G
networks by outlining its core concepts and then proposing a practical use case
that applies Agentic principles to RAN optimization. We first introduce Agentic
AI, tracing its evolution from classical agents and discussing the progress
from workflows and simple AI agents to Agentic AI. Core design
patterns-reflection, planning, tool use, and multi-agent collaboration-are then
described to illustrate how intelligent behaviors are orchestrated. These
theorical concepts are grounded in the context of mobile networks, with a focus
on RAN management and optimization. A practical 5G RAN case study shows how
time-series analytics and LAM-driven agents collaborate for KPI-based
autonomous decision-making.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Munib Mesinovic, Max Buhlan, Tingting Zhu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Healthcare artificial intelligence systems routinely fail when deployed
across institutions, with documented performance drops and perpetuation of
discriminatory patterns embedded in historical data. This brittleness stems, in
part, from learning statistical associations rather than causal mechanisms.
Causal graph neural networks address this triple crisis of distribution shift,
discrimination, and inscrutability by combining graph-based representations of
biomedical data with causal inference principles to learn invariant mechanisms
rather than spurious correlations. This Review examines methodological
foundations spanning structural causal models, disentangled causal
representation learning, and techniques for interventional prediction and
counterfactual reasoning on graphs. We analyse applications demonstrating
clinical value across psychiatric diagnosis through brain network analysis,
cancer subtyping via multi-omics causal integration, continuous physiological
monitoring with mechanistic interpretation, and drug recommendation correcting
prescription bias. These advances establish foundations for patient-specific
Causal Digital Twins, enabling in silico clinical experimentation, with
integration of large language models for hypothesis generation and causal graph
neural networks for mechanistic validation. Substantial barriers remain,
including computational requirements precluding real-time deployment,
validation challenges demanding multi-modal evidence triangulation beyond
cross-validation, and risks of causal-washing where methods employ causal
terminology without rigorous evidentiary support. We propose tiered frameworks
distinguishing causally-inspired architectures from causally-validated
discoveries and identify critical research priorities making causal rather than
purely associational claims.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen
Machine Learning
ๆฅ็ๆ่ฆ
The capacitated location-routing problems (CLRPs) are classical problems in
combinatorial optimization, which require simultaneously making location and
routing decisions. In CLRPs, the complex constraints and the intricate
relationships between various decisions make the problem challenging to solve.
With the emergence of deep reinforcement learning (DRL), it has been
extensively applied to address the vehicle routing problem and its variants,
while the research related to CLRPs still needs to be explored. In this paper,
we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP
(OCLRP), respectively. We are the first to propose an end-to-end learning
approach for CLRPs, following the encoder-decoder structure. In particular, we
reformulate the CLRPs as a markov decision process tailored to various
decisions, a general modeling framework that can be adapted to other DRL-based
methods. To better handle the interdependency across location and routing
decisions, we also introduce a novel heterogeneous querying attention mechanism
designed to adapt dynamically to various decision-making stages. Experimental
results on both synthetic and benchmark datasets demonstrate superior solution
quality and better generalization performance of our proposed approach over
representative traditional and DRL-based baselines in solving both CLRP and
OCLRP.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rajan Das Gupta, Md Kishor Morol, Nafiz Fahad, Md Tanzib Hosain, Sumaya Binte Zilani Choya et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As the global burden of Alzheimer's disease (AD) continues to grow, early and
accurate detection has become increasingly critical, especially in regions with
limited access to advanced diagnostic tools. We propose BRAINS (Biomedical
Retrieval-Augmented Intelligence for Neurodegeneration Screening) to address
this challenge. This novel system harnesses the powerful reasoning capabilities
of Large Language Models (LLMs) for Alzheimer's detection and monitoring.
BRAINS features a dual-module architecture: a cognitive diagnostic module and a
case-retrieval module. The Diagnostic Module utilizes LLMs fine-tuned on
cognitive and neuroimaging datasets -- including MMSE, CDR scores, and brain
volume metrics -- to perform structured assessments of Alzheimer's risk.
Meanwhile, the Case Retrieval Module encodes patient profiles into latent
representations and retrieves similar cases from a curated knowledge base.
These auxiliary cases are fused with the input profile via a Case Fusion Layer
to enhance contextual understanding. The combined representation is then
processed with clinical prompts for inference. Evaluations on real-world
datasets demonstrate BRAINS effectiveness in classifying disease severity and
identifying early signs of cognitive decline. This system not only shows strong
potential as an assistive tool for scalable, explainable, and early-stage
Alzheimer's disease detection, but also offers hope for future applications in
the field.
๐
2025-11-04
๐ ArXiv Preprint (Multimedia)
๐ฅ Bingyan Xie, Yongpeng Wu, Yuxuan Shi, Biqian Feng, Wenjun Zhang et al.
ๆฅ็ๆ่ฆ
Existing wireless video transmission schemes directly conduct video coding in
pixel level, while neglecting the inner semantics contained in videos. In this
paper, we propose a wireless video semantic communication framework with
decoupled diffusion multi-frame compensation (DDMFC), abbreviated as WVSC-D,
which integrates the idea of semantic communication into wireless video
transmission scenarios. WVSC-D first encodes original video frames as semantic
frames and then conducts video coding based on such compact representations,
enabling the video coding in semantic level rather than pixel level. Moreover,
to further reduce the communication overhead, a reference semantic frame is
introduced to substitute motion vectors of each frame in common video coding
methods. At the receiver, DDMFC is proposed to generate compensated current
semantic frame by a two-stage conditional diffusion process. With both the
reference frame transmission and DDMFC frame compensation, the bandwidth
efficiency improves with satisfying video transmission performance.
Experimental results verify the performance gain of WVSC-D over other DL-based
methods e.g. DVSC about 1.8 dB in terms of PSNR.
๐
2025-11-04
๐ ArXiv Preprint (q-fin.CP)
๐ฅ Kaito Takano, Masanori Hirano, Kei Nakagawa
Natural Language Processing
ๆฅ็ๆ่ฆ
Accurately forecasting central bank policy decisions, particularly those of
the Federal Open Market Committee(FOMC) has become increasingly important amid
heightened economic uncertainty. While prior studies have used monetary policy
texts to predict rate changes, most rely on static classification models that
overlook the deliberative nature of policymaking. This study proposes a novel
framework that structurally imitates the FOMC's collective decision-making
process by modeling multiple large language models(LLMs) as interacting agents.
Each agent begins with a distinct initial belief and produces a prediction
based on both qualitative policy texts and quantitative macroeconomic
indicators. Through iterative rounds, agents revise their predictions by
observing the outputs of others, simulating deliberation and consensus
formation. To enhance interpretability, we introduce a latent variable
representing each agent's underlying belief(e.g., hawkish or dovish), and we
theoretically demonstrate how this belief mediates the perception of input
information and interaction dynamics. Empirical results show that this
debate-based approach significantly outperforms standard LLMs-based baselines
in prediction accuracy. Furthermore, the explicit modeling of beliefs provides
insights into how individual perspectives and social influence shape collective
policy forecasts.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mengyu Zhang, Xubo Liu, Siyu Ding, Weichong Yin, Yu Sun et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated great
potential in enhancing the reasoning capabilities of large language models
(LLMs), achieving remarkable progress in domains such as mathematics and
programming where standard answers are available. However, for open-ended tasks
lacking ground-truth solutions (e.g., creative writing and instruction
following), existing studies typically regard them as non-reasoning scenarios,
thereby overlooking the latent value of reasoning capabilities. This raises a
key question: Can strengthening reasoning improve performance in open-ended
tasks? To address this, we explore the transfer of the RLVR paradigm to the
open domain. Yet, since RLVR fundamentally relies on verifiers that presuppose
the existence of standard answers, it cannot be directly applied to open-ended
tasks. To overcome this challenge, we introduce Verifiable Multiple-Choice
Reformulation (VMR), a novel training strategy that restructures open-ended
data into verifiable multiple-choice formats, enabling effective training even
in the absence of explicit ground truth. Experimental results on multiple
benchmarks validate the effectiveness of our method in improving LLM
performance on open-ended tasks. Notably, across eight open-ended benchmarks,
our VMR-based training delivers an average gain of 5.99 points over the
baseline. Code will be released upon acceptance to facilitate reproducibility.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuan-Truong Quan, Xuan-Son Quan, Duc Do Minh, Vinh Nguyen Van
ๆฅ็ๆ่ฆ
Knowledge graph embedding (KGE) has become a fundamental technique for
representation learning on multi-relational data. Many seminal models, such as
TransE, operate in an unbounded Euclidean space, which presents inherent
limitations in modeling complex relations and can lead to inefficient training.
In this paper, we propose Spherical Knowledge Graph Embedding (SKGE), a model
that challenges this paradigm by constraining entity representations to a
compact manifold: a hypersphere. SKGE employs a learnable, non-linear
Spherization Layer to map entities onto the sphere and interprets relations as
a hybrid translate-then-project transformation. Through extensive experiments
on three benchmark datasets, FB15k-237, CoDEx-S, and CoDEx-M, we demonstrate
that SKGE consistently and significantly outperforms its strong Euclidean
counterpart, TransE, particularly on large-scale benchmarks such as FB15k-237
and CoDEx-M, demonstrating the efficacy of the spherical geometric prior. We
provide an in-depth analysis to reveal the sources of this advantage, showing
that this geometric constraint acts as a powerful regularizer, leading to
comprehensive performance gains across all relation types. More fundamentally,
we prove that the spherical geometry creates an "inherently hard negative
sampling" environment, naturally eliminating trivial negatives and forcing the
model to learn more robust and semantically coherent representations. Our
findings compellingly demonstrate that the choice of manifold is not merely an
implementation detail but a fundamental design principle, advocating for
geometric priors as a cornerstone for designing the next generation of powerful
and stable KGE models.
๐
2025-11-04
๐ 32 pages, 17 figures, published in Journal of Soun
๐ฅ Marios Impraimakis
ๆฅ็ๆ่ฆ
The capability of a novel Kullback-Leibler divergence method is examined
herein within the Kalman filter framework to select the input-parameter-state
estimation execution with the most plausible results. This identification
suffers from the uncertainty related to obtaining different results from
different initial parameter set guesses, and the examined approach uses the
information gained from the data in going from the prior to the posterior
distribution to address the issue. Firstly, the Kalman filter is performed for
a number of different initial parameter sets providing the system
input-parameter-state estimation. Secondly, the resulting posterior
distributions are compared simultaneously to the initial prior distributions
using the Kullback-Leibler divergence. Finally, the identification with the
least Kullback-Leibler divergence is selected as the one with the most
plausible results. Importantly, the method is shown to select the better
performed identification in linear, nonlinear, and limited information
applications, providing a powerful tool for system monitoring.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jae-Woo Choi, Hyungmin Kim, Hyobin Ong, Minsu Jang, Dohyung Kim et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Recent advancements in large language models (LLMs) have enabled significant
progress in decision-making and task planning for embodied autonomous agents.
However, most existing methods still struggle with complex, long-horizon tasks
because they rely on a monolithic trajectory that entangles all past decisions
and observations, attempting to solve the entire task in a single unified
process. To address this limitation, we propose ReAcTree, a hierarchical
task-planning method that decomposes a complex goal into more manageable
subgoals within a dynamically constructed agent tree. Each subgoal is handled
by an LLM agent node capable of reasoning, acting, and further expanding the
tree, while control flow nodes coordinate the execution strategies of agent
nodes. In addition, we integrate two complementary memory systems: each agent
node retrieves goal-specific, subgoal-level examples from episodic memory and
shares environment-specific observations through working memory. Experiments on
the WAH-NL and ALFRED datasets demonstrate that ReAcTree consistently
outperforms strong task-planning baselines such as ReAct across diverse LLMs.
Notably, on WAH-NL, ReAcTree achieves a 61% goal success rate with Qwen 2.5
72B, nearly doubling ReAct's 31%.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Benjamin Sykes, Loรฏc Simon, Julien Rabin, Jalal Fadili
Machine Learning
ๆฅ็ๆ่ฆ
With the recent success of generative models in image and text, the question
of their evaluation has recently gained a lot of attention. While most methods
from the state of the art rely on scalar metrics, the introduction of Precision
and Recall (PR) for generative model has opened up a new avenue of research.
The associated PR curve allows for a richer analysis, but their estimation
poses several challenges. In this paper, we present a new framework for
estimating entire PR curves based on a binary classification standpoint. We
conduct a thorough statistical analysis of the proposed estimates. As a
byproduct, we obtain a minimax upper bound on the PR estimation risk. We also
show that our framework extends several landmark PR metrics of the literature
which by design are restrained to the extreme values of the curve. Finally, we
study the different behaviors of the curves obtained experimentally in various
settings.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arya Shah, Vaibhav Tripathi
Natural Language Processing
ๆฅ็ๆ่ฆ
Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic
cats) have vertically elongated pupils linked to ambush predation; yet, how
such specializations manifest in downstream visual representations remains
incompletely understood. We present a unified, frozen-encoder benchmark that
quantifies feline-human cross-species representational alignment in the wild,
across convolutional networks, supervised Vision Transformers, windowed
transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel
Alignment (linear and RBF) and Representational Similarity Analysis, with
additional distributional and stability tests reported in the paper. Across
models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF
$\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$),
peaking at early blocks, indicating that token-level self-supervision induces
early-stage features that bridge species-specific statistics. Supervised ViTs
are competitive on CKA yet show weaker geometric correspondence than DINO
(e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at
block14), revealing depth-dependent divergences between similarity and
representational geometry. CNNs remain strong baselines but below plain ViTs on
alignment, and windowed transformers underperform plain ViTs, implicating
architectural inductive biases in cross-species alignment. Results indicate
that self-supervision coupled with ViT inductive biases yields representational
geometries that more closely align feline and human visual systems than widely
used CNNs and windowed Transformers, providing testable neuroscientific
hypotheses about where and how cross-species visual computations converge. We
release our code and dataset for reference and reproducibility.
๐
2025-11-04
๐ ArXiv Preprint (eess.IV)
๐ฅ Yalda Zafari, Hongyi Pan, Gorkem Durak, Ulas Bagci, Essam A. Rashed et al.
Computer Vision
ๆฅ็ๆ่ฆ
The development of clinically reliable artificial intelligence (AI) systems
for mammography is hindered by profound heterogeneity in data quality, metadata
standards, and population distributions across public datasets. This
heterogeneity introduces dataset-specific biases that severely compromise the
generalizability of the model, a fundamental barrier to clinical deployment. We
present MammoClean, a public framework for standardization and bias
quantification in mammography datasets. MammoClean standardizes case selection,
image processing (including laterality and intensity correction), and unifies
metadata into a consistent multi-view structure. We provide a comprehensive
review of breast anatomy, imaging characteristics, and public mammography
datasets to systematically identify key sources of bias. Applying MammoClean to
three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify
substantial distributional shifts in breast density and abnormality prevalence.
Critically, we demonstrate the direct impact of data corruption: AI models
trained on corrupted datasets exhibit significant performance degradation
compared to their curated counterparts. By using MammoClean to identify and
mitigate bias sources, researchers can construct unified multi-dataset training
corpora that enable development of robust models with superior cross-domain
generalization. MammoClean provides an essential, reproducible pipeline for
bias-aware AI development in mammography, facilitating fairer comparisons and
advancing the creation of safe, effective systems that perform equitably across
diverse patient populations and clinical settings. The open-source code is
publicly available from: https://github.com/Minds-R-Lab/MammoClean.
๐
2025-11-04
๐ ArXiv Preprint (cs.SE)
๐ฅ Junwei Liu, Chen Xu, Chong Wang, Tong Bai, Weitong Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in large language model agents offer the promise of
automating end-to-end software development from natural language requirements.
However, existing approaches largely adopt linear, waterfall-style pipelines,
which oversimplify the iterative nature of real-world development and struggle
with complex, large-scale projects. To address these limitations, we propose
EvoDev, an iterative software development framework inspired by feature-driven
development. EvoDev decomposes user requirements into a set of user-valued
features and constructs a Feature Map, a directed acyclic graph that explicitly
models dependencies between features. Each node in the feature map maintains
multi-level information, including business logic, design, and code, which is
propagated along dependencies to provide context for subsequent development
iterations. We evaluate EvoDev on challenging Android development tasks and
show that it outperforms the best-performing baseline, Claude Code, by a
substantial margin of 56.8%, while improving single-agent performance by
16.0%-76.6% across different base LLMs, highlighting the importance of
dependency modeling, context propagation, and workflow-aware agent design for
complex software projects. Our work summarizes practical insights for designing
iterative, LLM-driven development frameworks and informs future training of
base LLMs to better support iterative software development.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Muhammad Sheharyar Liaqat
Machine Learning
ๆฅ็ๆ่ฆ
Breast cancer remains one of the leading causes of mortality among women
worldwide, with early diagnosis being critical for effective treatment and
improved survival rates. However, timely detection continues to be a challenge
due to the complex nature of the disease and variability in patient risk
factors. This study presents a fuzzy soft set theory-based expert system
designed to assess the risk of breast cancer in patients using measurable
clinical and physiological parameters. The proposed system integrates Body Mass
Index, Insulin Level, Leptin Level, Adiponectin Level, and age as input
variables to estimate breast cancer risk through a set of fuzzy inference rules
and soft set computations. These parameters can be obtained from routine blood
analyses, enabling a non-invasive and accessible method for preliminary
assessment. The dataset used for model development and validation was obtained
from the UCI Machine Learning Repository. The proposed expert system aims to
support healthcare professionals in identifying high-risk patients and
determining the necessity of further diagnostic procedures such as biopsies.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rohith Shinoj Kumar, Rushdeep Dinda, Aditya Tyagi, Annappa B., Naveen Kumar M. R
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Early detection of heart arrhythmia can prevent severe future complications
in cardiac patients. While manual diagnosis still remains the clinical
standard, it relies heavily on visual interpretation and is inherently
subjective. In recent years, deep learning has emerged as a powerful tool to
automate arrhythmia detection, offering improved accuracy, consistency, and
efficiency. Several variants of convolutional and recurrent neural network
architectures have been widely explored to capture spatial and temporal
patterns in physiological signals. However, despite these advancements, current
models often struggle to generalize well in real-world scenarios, especially
when dealing with small or noisy datasets, which are common challenges in
biomedical applications. In this paper, a novel CNN-H-Infinity-LSTM
architecture is proposed to identify arrhythmic heart signals from heart sound
recordings. This architecture introduces trainable parameters inspired by the
H-Infinity filter from control theory, enhancing robustness and generalization.
Extensive experimentation on the PhysioNet CinC Challenge 2016 dataset, a
public benchmark of heart audio recordings, demonstrates that the proposed
model achieves stable convergence and outperforms existing benchmarks, with a
test accuracy of 99.42% and an F1 score of 98.85%.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aashray Reddy, Andrew Zagula, Nicholas Saban
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where
adversarial prompts elicit harmful outputs, yet most evaluations focus on
single-turn interactions while real-world attacks unfold through adaptive
multi-turn conversations. We present AutoAdv, a training-free framework for
automated multi-turn jailbreaking that achieves up to 95% attack success rate
on Llama-3.1-8B within six turns a 24 percent improvement over single turn
baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern
manager that learns from successful attacks to enhance future prompts, a
temperature manager that dynamically adjusts sampling parameters based on
failure modes, and a two-phase rewriting strategy that disguises harmful
requests then iteratively refines them. Extensive evaluation across commercial
and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent
vulnerabilities in current safety mechanisms, with multi-turn attacks
consistently outperforming single-turn approaches. These findings demonstrate
that alignment strategies optimized for single-turn interactions fail to
maintain robustness across extended conversations, highlighting an urgent need
for multi-turn-aware defenses.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohd Nauman, Sravan Gvm, Vijay Devane, Shyam Pawar, Viraj Thakur et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Current large language models excel at broad, general-purpose tasks, but
consistently underperform when exposed to highly specialized domains that
require deep cultural, linguistic, and subject-matter expertise. In particular,
traditional medical systems such as Ayurveda embody centuries of nuanced
textual and clinical knowledge that mainstream LLMs fail to accurately
interpret or apply. We introduce AyurParam-2.9B, a domain-specialized,
bilingual language model fine-tuned from Param-1-2.9B using an extensive,
expertly curated Ayurveda dataset spanning classical texts and clinical
guidance. AyurParam's dataset incorporates context-aware, reasoning, and
objective-style Q&A in both English and Hindi, with rigorous annotation
protocols for factual precision and instructional clarity. Benchmarked on
BhashaBench-Ayur, AyurParam not only surpasses all open-source
instruction-tuned models in its size class (1.5--3B parameters), but also
demonstrates competitive or superior performance compared to much larger
models. The results from AyurParam highlight the necessity for authentic domain
adaptation and high-quality supervision in delivering reliable, culturally
congruent AI for specialized medical knowledge.
๐
2025-11-04
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Adnan Hoq, Matthew Facciani, Tim Weninger
ๆฅ็ๆ่ฆ
AI-generated content is rapidly becoming a salient component of online
information ecosystems, yet its influence on public trust and epistemic
judgments remains poorly understood. We present a large-scale mixed-design
experiment (N = 1,000) investigating how AI-generated credibility scores affect
user perception of political news. Our results reveal that AI feedback
significantly moderates partisan bias and institutional distrust, surpassing
traditional engagement signals such as likes and shares. These findings
demonstrate the persuasive power of generative AI and suggest a need for design
strategies that balance epistemic influence with user autonomy.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wongyu Kim, Hochang Lee, Sanghak Lee, Yoonsung Kim, Jaehyun Park
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Query augmentation makes queries more meaningful by appending further
information to the queries to find relevant documents. Current studies have
proposed Large Language Model (LLM)-based embedders, which learn representation
for embedding and generation for query augmentation in a multi-task manner by
leveraging the generative capabilities of LLM. During inference, these jointly
trained embedders have conducted query augmentation followed by embedding,
showing effective results. However, augmenting every query leads to substantial
embedding latency and query augmentation can be detrimental to performance for
some queries. Also, previous methods have not been explored in multimodal
environments. To tackle these problems, we propose M-Solomon, a universal
multimodal embedder that can adaptively determine when to augment queries. Our
approach first divides the queries of the training datasets into two groups at
the dataset level. One includes queries that require augmentation and the other
includes queries that do not. Then, we introduces a synthesis process that
generates appropriate augmentations for queries that require them by leveraging
a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation.
Through this step, M-Solomon can conduct query augmentation only when necessary
by learning to generate synthetic augmentations with the prefix /augment for
queries that demand them and to generate the simple string /embed for others.
Experimental results showed that M-Solomon not only surpassed the baseline
without augmentation by a large margin but also outperformed the baseline that
always used augmentation, providing much faster embedding latency.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Zhuodi Cai, Ziyu Xu, Juan Pampin
Robotics
ๆฅ็ๆ่ฆ
We introduce a lightweight, real-time motion recognition system that enables
synergic human-machine performance through wearable IMU sensor data, MiniRocket
time-series classification, and responsive multimedia control. By mapping
dancer-specific movement to sound through somatic memory and association, we
propose an alternative approach to human-machine collaboration, one that
preserves the expressive depth of the performing body while leveraging machine
learning for attentive observation and responsiveness. We demonstrate that this
human-centered design reliably supports high accuracy classification (<50 ms
latency), offering a replicable framework to integrate dance-literate machines
into creative, educational, and live performance contexts.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yohan Lee, DongGyun Kang, SeHoon Park, Sa-Yoon Park, Kwangsoo Kim
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Chronic Kidney Disease (CKD) affects nearly 10\% of the global population and
often progresses to end-stage renal failure. Accurate prognosis prediction is
vital for timely interventions and resource optimization. We present a
transformer-based framework for predicting CKD progression using multi-modal
electronic health records (EHR) from the Seoul National University Hospital
OMOP Common Data Model. Our approach (\textbf{ProQ-BERT}) integrates
demographic, clinical, and laboratory data, employing quantization-based
tokenization for continuous lab values and attention mechanisms for
interpretability. The model was pretrained with masked language modeling and
fine-tuned for binary classification tasks predicting progression from stage 3a
to stage 5 across varying follow-up and assessment periods. Evaluated on a
cohort of 91,816 patients, our model consistently outperformed CEHR-BERT,
achieving ROC-AUC up to 0.995 and PR-AUC up to 0.989 for short-term prediction.
These results highlight the effectiveness of transformer architectures and
temporal design choices in clinical prognosis modeling, offering a promising
direction for personalized CKD care.
๐
2025-11-04
๐ ArXiv Preprint (q-bio.MN)
๐ฅ Hongyang Jiang, Yuezhu Wang, Ke Feng, Chaoyi Yin, Yi Chang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Biological networks are pivotal in deciphering the complexity and
functionality of biological systems. Causal inference, which focuses on
determining the directionality and strength of interactions between variables
rather than merely relying on correlations, is considered a logical approach
for inferring biological networks. Existing methods for causal structure
inference typically assume that causal relationships between variables can be
represented by directed acyclic graphs (DAGs). However, this assumption is at
odds with the reality of widespread feedback loops in biological systems,
making these methods unsuitable for direct use in biological network inference.
In this study, we propose a new framework named SCALD (Structural CAusal model
for Loop Diagram), which employs a nonlinear structure equation model and a
stable feedback loop conditional constraint through continuous optimization to
infer causal regulatory relationships under feedback loops. We observe that
SCALD outperforms state-of-the-art methods in inferring both transcriptional
regulatory networks and signaling transduction networks. SCALD has
irreplaceable advantages in identifying feedback regulation. Through
transcription factor (TF) perturbation data analysis, we further validate the
accuracy and sensitivity of SCALD. Additionally, SCALD facilitates the
discovery of previously unknown regulatory relationships, which we have
subsequently confirmed through ChIP-seq data analysis. Furthermore, by
utilizing SCALD, we infer the key driver genes that facilitate the
transformation from colon inflammation to cancer by examining the dynamic
changes within regulatory networks during the process.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aman Sharma, Paras Chopra
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We revisit test-time scaling for language model reasoning and ask a
fundamental question: at equal token budget and compute, is it better to run
multiple independent chains in parallel, or to run fewer chains that
iteratively refine through sequential steps? Through comprehensive evaluation
across 5 state-of-the-art open source models and 3 challenging reasoning
benchmarks, we find that sequential scaling where chains explicitly build upon
previous attempts consistently outperforms the dominant parallel
self-consistency paradigm in 95.6% of configurations with gains in accuracy
upto 46.7%. Further, we introduce inverse-entropy weighted voting, a novel
training-free method to further boost the accuracy of sequential scaling. By
weighing answers in proportion to the inverse entropy of their reasoning
chains, we increase our success rate over parallel majority and establish it as
the optimal test-time scaling strategy. Our findings fundamentally challenge
the parallel reasoning orthodoxy that has dominated test-time scaling since
Wang et al.'s self-consistency decoding (Wang et al., 2022), positioning
sequential refinement as the robust default for modern LLM reasoning and
necessitating a paradigm shift in how we approach inference-time optimization.
๐
2025-11-04
๐ ArXiv Preprint (cs.MA)
๐ฅ Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia
Machine Learning
ๆฅ็ๆ่ฆ
We study the problem of learning multi-task, multi-agent policies for
cooperative, temporal objectives, under centralized training, decentralized
execution. In this setting, using automata to represent tasks enables the
decomposition of complex tasks into simpler sub-tasks that can be assigned to
agents. However, existing approaches remain sample-inefficient and are limited
to the single-task case. In this work, we present Automata-Conditioned
Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for
learning task-conditioned, decentralized team policies. We identify the main
challenges to ACC-MARL's feasibility in practice, propose solutions, and prove
the correctness of our approach. We further show that the value functions of
learned policies can be used to assign tasks optimally at test time.
Experiments show emergent task-aware, multi-step coordination among agents,
e.g., pressing a button to unlock a door, holding the door, and
short-circuiting tasks.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhiwei Zhang, Xiaomin Li, Yudi Lin, Hui Liu, Ramraj Chandradevan et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) trained with reinforcement learning and
verifiable rewards have achieved strong results on complex reasoning tasks.
Recent work extends this paradigm to a multi-agent setting, where a
meta-thinking agent proposes plans and monitors progress while a reasoning
agent executes subtasks through sequential conversational turns. Despite
promising performance, we identify a critical limitation: lazy agent behavior,
in which one agent dominates while the other contributes little, undermining
collaboration and collapsing the setup to an ineffective single agent. In this
paper, we first provide a theoretical analysis showing why lazy behavior
naturally arises in multi-agent reasoning. We then introduce a stable and
efficient method for measuring causal influence, helping mitigate this issue.
Finally, as collaboration intensifies, the reasoning agent risks getting lost
in multi-turn interactions and trapped by previous noisy responses. To counter
this, we propose a verifiable reward mechanism that encourages deliberation by
allowing the reasoning agent to discard noisy outputs, consolidate
instructions, and restart its reasoning process when necessary. Extensive
experiments demonstrate that our framework alleviates lazy agent behavior and
unlocks the full potential of multi-agent framework for complex reasoning
tasks.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fengjuan Wang, Zhiyi Su, Xingzhu Hu, Cheng Wang, Mou Sun
Natural Language Processing
ๆฅ็ๆ่ฆ
Training large Mixture-of-Experts (MoE) models remains computationally
prohibitive due to their extreme compute and memory demands. Although
low-precision training promises to accelerate computation and reduce memory
footprint, existing implementations still rely on BF16-dominated dataflows with
frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode
much of FP8's theoretical efficiency. However, naively removing these casts by
keeping dataflows entirely in FP8 introduces double quantization error: tensors
quantized along different dimensions accumulate inconsistent scaling factors,
degrading numerical stability.
We propose FP8-Flow-MoE, an FP8 training recipe featuring a
quantization-consistent FP8-centric dataflow with a scaling-aware transpose and
fused FP8 operators that streamline computation and eliminate explicit cast
operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate
up to 21\% higher throughput and 16.5 GB lower memory usage per GPU compared to
BF16 and na\"ive FP8 baselines, while maintaining stable convergence. We
provide a plug-and-play FP8 recipe compatible with TransformerEngine and
Megatron-LM, which will be open-sourced soon.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kuan-Cheng Chen, Samuel Yen-Chi Chen, Chen-Yu Liu, Kin K. Leung
ๆฅ็ๆ่ฆ
The rapid growth of industrial Internet of Things (IIoT) systems has created
new challenges for anomaly detection in high-dimensional, multivariate
time-series, where privacy, scalability, and communication efficiency are
critical. Classical federated learning approaches mitigate privacy concerns by
enabling decentralized training, but they often struggle with highly non-linear
decision boundaries and imbalanced anomaly distributions. To address this gap,
we propose a Federated Quantum Kernel Learning (FQKL) framework that integrates
quantum feature maps with federated aggregation to enable distributed,
privacy-preserving anomaly detection across heterogeneous IoT networks. In our
design, quantum edge nodes locally compute compressed kernel statistics using
parameterized quantum circuits and share only these summaries with a central
server, which constructs a global Gram matrix and trains a decision function
(e.g., Fed-QSVM). Experimental results on synthetic IIoT benchmarks demonstrate
that FQKL achieves superior generalization in capturing complex temporal
correlations compared to classical federated baselines, while significantly
reducing communication overhead. This work highlights the promise of quantum
kernels in federated settings, advancing the path toward scalable, robust, and
quantum-enhanced intelligence for next-generation IoT infrastructures.
๐
2025-11-04
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Samjukta Sinha, Prabhat Das
ๆฅ็ๆ่ฆ
This study investigates the application of Random Forest Regression for
predicting mechanical properties of alloy steel-Elongation, Tensile Strength,
and Yield Strength-from material composition features including Iron (Fe),
Chromium (Cr), Nickel (Ni), Manganese (Mn), Silicon (Si), Copper (Cu), Carbon
(C), and deformation percentage during cold rolling. Utilizing a dataset
comprising these features, we trained and evaluated the Random Forest model,
achieving high predictive performance as evidenced by R2 scores and Mean
Squared Errors (MSE). The results demonstrate the model's efficacy in providing
accurate predictions, which is validated through various performance metrics
including residual plots and learning curves. The findings underscore the
potential of ensemble learning techniques in enhancing material property
predictions, with implications for industrial applications in material science.
๐
2025-11-04
๐ ArXiv Preprint (q-bio.GN)
๐ฅ Jaeyeon Lee, Hyun-Hwan Jeong, Zhandong Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
Diagnosing rare diseases often requires connecting variant-bearing genes to
evidence that is written as unstructured clinical prose, which the current
established pipelines still leave for clinicians to reconcile manually. To this
end, we introduce LA-MARRVEL, a knowledge-grounded and language-aware reranking
layer that operates on top of AI-MARRVEL: it supplies expert-engineered
context, queries a large language model multiple times, and aggregates the
resulting partial rankings with a ranked voting method to produce a stable,
explainable gene ranking. Evaluated on three real-world cohorts (BG, DDD, UDN),
LA-MARRVEL consistently improves Recall@K over AI-MARRVEL and established
phenotype-driven tools such as Exomiser and LIRICAL, with especially large
gains on cases where the first-stage ranker placed the causal gene lower. Each
ranked gene is accompanied by LLM-generated reasoning that integrates
phenotypic, inheritance, and variant-level evidence, thereby making the output
more interpretable and facilitating clinical review.
๐
2025-11-04
๐ ArXiv Preprint (cs.DS)
๐ฅ Tan D. Tran, Canh V. Pham
ๆฅ็ๆ่ฆ
This work studies the non-monotone DR-submodular Maximization over a ground
set of $n$ subject to a size constraint $k$. We propose two approximation
algorithms for solving this problem named FastDrSub and FastDrSub++. FastDrSub
offers an approximation ratio of $0.044$ with query complexity of $O(n
\log(k))$. The second one, FastDrSub++, improves upon it with a ratio of
$1/4-\epsilon$ within query complexity of $(n \log k)$ for an input parameter
$\epsilon >0$. Therefore, our proposed algorithms are the first constant-ratio
approximation algorithms for the problem with the low complexity of $O(n
\log(k))$.
Additionally, both algorithms are experimentally evaluated and compared
against existing state-of-the-art methods, demonstrating their effectiveness in
solving the Revenue Maximization problem with DR-submodular objective function.
The experimental results show that our proposed algorithms significantly
outperform existing approaches in terms of both query complexity and solution
quality.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jonathan Liu, Haoling Qiu, Jonathan Lasko, Damianos Karakos, Mahsa Yarmohammadi et al.
ๆฅ็ๆ่ฆ
Recent research has shown that hallucinations, omissions, and biases are
prevalent in everyday use-cases of LLMs. However, chatbots used in medical
contexts must provide consistent advice in situations where non-medical factors
are involved, such as when demographic information is present. In order to
understand the conditions under which medical chatbots fail to perform as
expected, we develop an infrastructure that 1) automatically generates queries
to probe LLMs and 2) evaluates answers to these queries using multiple
LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples
the space of patient demographics, histories, disorders, and writing styles to
create realistic questions that we subsequently use to prompt LLMs. In 2), our
evaluation pipeline provides hallucination and omission detection using
LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge
treatment category detectors. As a baseline study, we perform two case studies
on inter-LLM agreement and the impact of varying the answering and evaluation
LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's
Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs
yield statistically significant differences across writing styles, genders, and
races. We recommend that studies using LLM evaluation use multiple LLMs as
evaluators in order to avoid arriving at statistically significant but
non-generalizable results, particularly in the absence of ground-truth data. We
also suggest publishing inter-LLM agreement metrics for transparency. Our code
and dataset are available here:
https://github.com/BBN-E/medic-neurips-2025-demo.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) must resolve conflicts when
different modalities provide contradictory information, a process we term
modality following. Prior work measured this behavior only with coarse
dataset-level statistics, overlooking the influence of model's confidence in
unimodal reasoning. In this paper, we introduce a new framework that decomposes
modality following into two fundamental factors: relative reasoning uncertainty
(the case-specific confidence gap between unimodal predictions) and inherent
modality preference( a model's stable bias when uncertainties are balanced). To
validate this framework, we construct a controllable dataset that
systematically varies the reasoning difficulty of visual and textual inputs.
Using entropy as a fine-grained uncertainty metric, we uncover a universal law:
the probability of following a modality decreases monotonically as its relative
uncertainty increases. At the relative difficulty level where the model tends
to follow both modalities with comparable probability what we call the balance
point, a practical indicator of the model's inherent preference. Unlike
traditional macro-level ratios, this measure offers a more principled and less
confounded way to characterize modality bias, disentangling it from unimodal
capabilities and dataset artifacts. Further, by probing layer-wise predictions,
we reveal the internal mechanism of oscillation: in ambiguous regions near the
balance point, models vacillate between modalities across layers, explaining
externally observed indecision. Together, these findings establish relative
uncertainty and inherent preference as the two governing principles of modality
following, offering both a quantitative framework and mechanistic insight into
how MLLMs resolve conflicting information.
๐
2025-11-04
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Brennen A. Hill
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Traditional neural networks, while powerful, rely on biologically implausible
learning mechanisms such as global backpropagation. This paper introduces the
Structurally Adaptive Predictive Inference Network (SAPIN), a novel
computational model inspired by the principles of active inference and the
morphological plasticity observed in biological neural cultures. SAPIN operates
on a 2D grid where processing units, or cells, learn by minimizing local
prediction errors. The model features two primary, concurrent learning
mechanisms: a local, Hebbian-like synaptic plasticity rule based on the
temporal difference between a cell's actual activation and its learned
expectation, and a structural plasticity mechanism where cells physically
migrate across the grid to optimize their information-receptive fields. This
dual approach allows the network to learn both how to process information
(synaptic weights) and also where to position its computational resources
(network topology). We validated the SAPIN model on the classic Cart Pole
reinforcement learning benchmark. Our results demonstrate that the architecture
can successfully solve the CartPole task, achieving robust performance. The
network's intrinsic drive to minimize prediction error and maintain homeostasis
was sufficient to discover a stable balancing policy. We also found that while
continual learning led to instability, locking the network's parameters after
achieving success resulted in a stable policy. When evaluated for 100 episodes
post-locking (repeated over 100 successful agents), the locked networks
maintained an average 82% success rate.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Learning generalizable policies for robotic manipulation increasingly relies
on large-scale models that map language instructions to actions (L2A). However,
this one-way paradigm often produces policies that execute tasks without deeper
contextual understanding, limiting their ability to generalize or explain their
behavior. We argue that the complementary skill of mapping actions back to
language (A2L) is essential for developing more holistic grounding. An agent
capable of both acting and explaining its actions can form richer internal
representations and unlock new paradigms for self-supervised learning. We
introduce LACY (Language-Action Cycle), a unified framework that learns such
bidirectional mappings within a single vision-language model. LACY is jointly
trained on three synergistic tasks: generating parameterized actions from
language (L2A), explaining observed actions in language (A2L), and verifying
semantic consistency between two language descriptions (L2C). This enables a
self-improving cycle that autonomously generates and filters new training data
through an active augmentation strategy targeting low-confidence cases, thereby
improving the model without additional human labels. Experiments on
pick-and-place tasks in both simulation and the real world show that LACY
improves task success rates by 56.46% on average and yields more robust
language-action grounding for robotic manipulation. Project page:
https://vla2026.github.io/LACY/
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Keyu Zhao, Weiquan Lin, Qirui Zheng, Fengli Xu, Yong Li
Natural Language Processing
ๆฅ็ๆ่ฆ
Novel research ideas play a critical role in advancing scientific inquiries.
Recent advancements in Large Language Models (LLMs) have demonstrated their
potential to generate novel research ideas by leveraging large-scale scientific
literature. However, previous work in research ideation has primarily relied on
simplistic methods, such as keyword co-occurrence or semantic similarity. These
approaches focus on identifying statistical associations in the literature but
overlook the complex, contextual relationships between scientific concepts,
which are essential to effectively leverage knowledge embedded in human
literature. For instance, papers that simultaneously mention "keyword A" and
"keyword B" often present research ideas that integrate both concepts.
Additionally, some LLM-driven methods propose and refine research ideas using
the model's internal knowledge, but they fail to effectively utilize the
scientific concept network, limiting the grounding of ideas in established
research. To address these challenges, we propose the Deep Ideation framework
to address these challenges, integrating a scientific network that captures
keyword co-occurrence and contextual relationships, enriching LLM-driven
ideation. The framework introduces an explore-expand-evolve workflow to
iteratively refine research ideas, using an Idea Stack to track progress. A
critic engine, trained on real-world reviewer feedback, guides the process by
providing continuous feedback on the novelty and feasibility of ideas. Our
experiments show that our approach improves the quality of generated ideas by
10.67% compared to other methods, with ideas surpassing top conference
acceptance levels. Human evaluation highlights their practical value in
scientific research, and ablation studies confirm the effectiveness of each
component in the workflow. Code repo is available at
https://github.com/kyZhao-1/Deep-Ideation.
๐
2025-11-04
๐ ArXiv Preprint (cs.OS)
๐ฅ Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao et al.
ๆฅ็ๆ่ฆ
Agentic LLM applications interleave LLM generation requests with tool calls.
These tool calls break the continuity of the workflow by creating pauses
between LLM requests, bringing many challenges for the serving system,
especially under multi-turn scenarios. Each pause potentially causes KV cache
eviction and extra waiting time before entering the continuous batch for the
following LLM request. Since these pauses happen for each call, this problem
becomes increasingly severe as turn number grow for agentic programs. Previous
works either fail to incorporate information from the tool call, evicting KV
cache that leads to repetitive prefill or loading, or ignore the continuity of
a multi-turn program, creating waiting time between turns that increases
per-request latency.
We present Continuum, a serving system to optimize job completion time for
multi-turn agent workloads by combining tool-aware KV cache timeout with
program-level scheduling. By predicting tool call durations in agentic
workflows, Continuum selectively pins the KV cache in GPU memory with a
time-to-live value based on total turn number. When combined with program-level
first-come-first-serve, Continuum prevents scheduling bubbles, preserves
multi-turn continuity, and optimizes for throughput for complex agentic
workflows. By modeling the variability of tool call and agent program
continuity, Continuum outperforms state-of-the-art baselines. Our evaluation on
real-world agentic workloads (SWE-Bench and BFCL) with Llama-3.1 8B/70B models
shows that Continuum significantly improves the average job completion times,
and remains performant across different hardware setups and DRAM offloading
schemes. Preview code is available at:
https://github.com/Hanchenli/vllm-continuum
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Delin Ma, Menghui Zhou, Jun Qi, Yun Yang, Po Yang
Multimodal
ๆฅ็ๆ่ฆ
Alzheimer's disease (AD) is the most prevalent form of dementia, and its
early diagnosis is essential for slowing disease progression. Recent studies on
multimodal neuroimaging fusion using MRI and PET have achieved promising
results by integrating multi-scale complementary features. However, most
existing approaches primarily emphasize cross-modal complementarity while
overlooking the diagnostic importance of modality-specific features. In
addition, the inherent distributional differences between modalities often lead
to biased and noisy representations, degrading classification performance. To
address these challenges, we propose a Collaborative Attention and
Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The
proposed model introduces a learnable parameter representation (LPR) block to
compensate for missing modality information, followed by a shared encoder and
modality-independent encoders to preserve both shared and specific
representations. Furthermore, a consistency-guided mechanism is employed to
explicitly align the latent distributions across modalities. Experimental
results on the ADNI dataset demonstrate that our method achieves superior
diagnostic performance compared with existing fusion strategies.
๐
2025-11-04
๐ EMNLP 2025
๐ฅ Changjiang Jiang, Fengchang Yu, Haihua Chen, Wei Lu, Jin Zeng
Natural Language Processing
ๆฅ็ๆ่ฆ
Complex reasoning over tabular data is crucial in real-world data analysis,
yet large language models (LLMs) often underperform due to complex queries,
noisy data, and limited numerical capabilities. To address these issues, we
propose TabDSR, a framework consisting of: (1) a query decomposer that breaks
down complex questions, (2) a table sanitizer that cleans and filters noisy
tables, and (3) a program-of-thoughts (PoT)-based reasoner that generates
executable code to derive the final answer from the sanitized table. To ensure
unbiased evaluation and mitigate data leakage, we introduce a new dataset,
CalTab151, specifically designed for complex numerical reasoning over tables.
Experimental results demonstrate that TabDSR consistently outperforms existing
methods, achieving state-of-the-art (SOTA) performance with 8.79%, 6.08%, and
19.87% accuracy improvement on TAT-QA, TableBench, and TabDSR, respectively.
Moreover, our framework integrates seamlessly with mainstream LLMs, providing a
robust solution for complex tabular numerical reasoning. These findings
highlight the effectiveness of our framework in enhancing LLM performance for
complex tabular numerical reasoning. Data and code are available upon request.
๐
2025-11-04
๐ ArXiv Preprint (cs.MA)
๐ฅ Manonmani Sekar, Nasim Nezamoddini
Machine LearningRobotics
ๆฅ็ๆ่ฆ
One of the main challenges in managing traffic at multilane intersections is
ensuring smooth coordination between human-driven vehicles (HDVs) and connected
autonomous vehicles (CAVs). This paper presents a novel traffic signal control
framework that combines Graph Attention Networks (GAT) with Soft Actor-Critic
(SAC) reinforcement learning to address this challenge. GATs are used to model
the dynamic graph- structured nature of traffic flow to capture spatial and
temporal dependencies between lanes and signal phases. The proposed SAC is a
robust off-policy reinforcement learning algorithm that enables adaptive signal
control through entropy-optimized decision making. This design allows the
system to coordinate the signal timing and vehicle movement simultaneously with
objectives focused on minimizing travel time, enhancing performance, ensuring
safety, and improving fairness between HDVs and CAVs. The model is evaluated
using a SUMO-based simulation of a four-way intersection and incorporating
different traffic densities and CAV penetration rates. The experimental results
demonstrate the effectiveness of the GAT-SAC approach by achieving a 24.1%
reduction in average delay and up to 29.2% fewer traffic violations compared to
traditional methods. Additionally, the fairness ratio between HDVs and CAVs
improved to 1.59, indicating more equitable treatment across vehicle types.
These findings suggest that the GAT-SAC framework holds significant promise for
real-world deployment in mixed-autonomy traffic systems.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Hyemin Yu, Hong-Chuan Yang
Machine Learning
ๆฅ็ๆ่ฆ
Next-generation wireless communication systems must support ultra-reliable
low-latency communication (URLLC) service for mission-critical applications.
Meeting stringent URLLC requirements is challenging, especially for two-hop
cooperative communication. In this paper, we develop an adaptive transmission
design for a two-hop relaying communication system. Each hop transmission
adaptively configures its transmission parameters separately, including
numerology, mini-slot size, and modulation and coding scheme, for reliable
packet transmission within a strict latency constraint. We formulate the
hop-specific transceiver configuration as a Markov decision process (MDP) and
propose a dual-agent reinforcement learning-based cooperative latency-aware
transmission (DRL-CoLA) algorithm to learn latency-aware transmission policies
in a distributed manner. Simulation results verify that the proposed algorithm
achieves the near-optimal reliability while satisfying strict latency
requirements.
๐
2025-11-04
๐ 13 pages, IEEE Journal of Biomedical and Health In
๐ฅ Anders Austlid Taskรฉn, Thierry Judge, Erik Andreas Rye Berg, Jinyang Yu, Bjรธrnar Grenne et al.
Machine Learning
ๆฅ็ๆ่ฆ
Segmental longitudinal strain (SLS) of the left ventricle (LV) is an
important prognostic indicator for evaluating regional LV dysfunction, in
particular for diagnosing and managing myocardial ischemia. Current techniques
for strain estimation require significant manual intervention and expertise,
limiting their efficiency and making them too resource-intensive for monitoring
purposes. This study introduces the first automated pipeline, autoStrain, for
SLS estimation in transesophageal echocardiography (TEE) using deep learning
(DL) methods for motion estimation. We present a comparative analysis of two DL
approaches: TeeFlow, based on the RAFT optical flow model for dense
frame-to-frame predictions, and TeeTracker, based on the CoTracker point
trajectory model for sparse long-sequence predictions.
As ground truth motion data from real echocardiographic sequences are hardly
accessible, we took advantage of a unique simulation pipeline (SIMUS) to
generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with
ground truth myocardial motion to train and evaluate both models. Our
evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a
mean distance error in motion estimation of 0.65 mm on a synTEE test dataset.
Clinical validation on 16 patients further demonstrated that SLS estimation
with our autoStrain pipeline aligned with clinical references, achieving a mean
difference (95\% limits of agreement) of 1.09% (-8.90% to 11.09%).
Incorporation of simulated ischemia in the synTEE data improved the accuracy of
the models in quantifying abnormal deformation. Our findings indicate that
integrating AI-driven motion estimation with TEE can significantly enhance the
precision and efficiency of cardiac function assessment in clinical settings.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
While existing work focuses primarily on task success, we argue that
effective real-world agents require optimizing three dimensions: productivity
(task completion), proactivity (asking essential questions), and
personalization (adapting to diverse user preferences). We introduce UserVille,
an interactive environment with LLM-based user simulators enabling diverse,
configurable user preferences. Leveraging UserVille, we introduce PPP, a
multi-objective reinforcement learning approach that jointly optimizes all
three dimensions: Productivity, Proactivity, and Personalization. Experiments
on software engineering and deep research tasks show that agents trained with
PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6
on average), demonstrating the ability to ask strategic clarifying questions,
adapt to unseen user preferences, and improve task success through better
interaction. This work demonstrates that explicitly optimizing for
user-centered interaction is critical for building practical and effective AI
agents.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiajia Li, Keyi Zhu, Qianwen Zhang, Dong Chen, Qi Sun et al.
Computer Vision
ๆฅ็ๆ่ฆ
Strawberries are among the most economically significant fruits in the United
States, generating over $2 billion in annual farm-gate sales and accounting for
approximately 13% of the total fruit production value. Plant phenotyping plays
a vital role in selecting superior cultivars by characterizing plant traits
such as morphology, canopy structure, and growth dynamics. However, traditional
plant phenotyping methods are time-consuming, labor-intensive, and often
destructive. Recently, neural rendering techniques, notably Neural Radiance
Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful
frameworks for high-fidelity 3D reconstruction. By capturing a sequence of
multi-view images or videos around a target plant, these methods enable
non-destructive reconstruction of complex plant architectures. Despite their
promise, most current applications of 3DGS in agricultural domains reconstruct
the entire scene, including background elements, which introduces noise,
increases computational costs, and complicates downstream trait analysis. To
address this limitation, we propose a novel object-centric 3D reconstruction
framework incorporating a preprocessing pipeline that leverages the Segment
Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean
strawberry plant reconstructions. This approach produces more accurate
geometric representations while substantially reducing computational time. With
a background-free reconstruction, our algorithm can automatically estimate
important plant traits, such as plant height and canopy width, using DBSCAN
clustering and Principal Component Analysis (PCA). Experimental results show
that our method outperforms conventional pipelines in both accuracy and
efficiency, offering a scalable and non-destructive solution for strawberry
plant phenotyping.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jingbo Wang, Sendong Zhao, Haochun Wang, Yuzheng Fan, Lizhe Zhang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The emergence of multi-agent systems powered by large language models (LLMs)
has unlocked new frontiers in complex task-solving, enabling diverse agents to
integrate unique expertise, collaborate flexibly, and address challenges
unattainable for individual models. However, the full potential of such systems
is hindered by rigid agent scheduling and inefficient coordination strategies
that fail to adapt to evolving task requirements. In this paper, we propose
STRMAC, a state-aware routing framework designed for efficient collaboration in
multi-agent systems. Our method separately encodes interaction history and
agent knowledge to power the router, which adaptively selects the most suitable
single agent at each step for efficient and effective collaboration.
Furthermore, we introduce a self-evolving data generation approach that
accelerates the collection of high-quality execution paths for efficient system
training. Experiments on challenging collaborative reasoning benchmarks
demonstrate that our method achieves state-of-the-art performance, achieving up
to 23.8% improvement over baselines and reducing data collection overhead by up
to 90.1% compared to exhaustive search.
๐
2025-11-04
๐ ArXiv Preprint (cs.SE)
๐ฅ Shufan Wang, Xing Hu, Junkai Chen, Zhiyuan Pan, Xin Xia
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
With the widespread application of large language models (LLMs) in the field
of code intelligence, increasing attention has been paid to the reliability and
controllability of their outputs in code reasoning tasks. Confidence estimation
serves as an effective and convenient approach for evaluating these aspects.
This paper proposes a confidence analysis and enhancement framework for LLMs
tailored to code reasoning tasks. We conduct a comprehensive empirical study on
the confidence reliability of mainstream LLMs across different tasks, and
further evaluate the effectiveness of techniques such as prompt strategy
optimisation and mathematical calibration (e.g., Platt Scaling) in improving
confidence reliability. Our results show that DeepSeek-Reasoner achieves the
best performance across various tasks, outperforming other models by up to
$0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance
Score, respectively. The hybrid strategy combining the reassess prompt strategy
and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$
over the original performance in the aforementioned three metrics. These
results indicate that models with reasoning capabilities demonstrate superior
confidence reliability, and that the hybrid strategy is the most effective in
enhancing the confidence reliability of various models. Meanwhile, we elucidate
the impact of different task complexities, model scales, and strategies on
confidence performance, and highlight that the confidence of current LLMs in
complex reasoning tasks still has considerable room for improvement. This study
not only provides a research foundation and technical reference for the
application of confidence in LLM-assisted software engineering, but also points
the way for future optimisation and engineering deployment of confidence
mechanisms.
๐
2025-11-04
๐ ArXiv Preprint (cs.AR)
๐ฅ Liwei Ni, Jiaxi Zhang, Shenggen Zheng, Junfeng Liu, Xingyu Meng et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Boolean equivalence allows Boolean networks with identical functionality to
exhibit diverse graph structures. This gives more room for exploration in logic
optimization, while also posing a challenge for tasks involving consistency
between Boolean networks. To tackle this challenge, we introduce BoolSkeleton,
a novel Boolean network skeletonization method that improves the consistency
and reliability of design-specific evaluations. BoolSkeleton comprises two key
steps: preprocessing and reduction. In preprocessing, the Boolean network is
transformed into a defined Boolean dependency graph, where nodes are assigned
the functionality-related status. Next, the homogeneous and heterogeneous
patterns are defined for the node-level pattern reduction step. Heterogeneous
patterns are preserved to maintain critical functionality-related dependencies,
while homogeneous patterns can be reduced. Parameter K of the pattern further
constrains the fanin size of these patterns, enabling fine-tuned control over
the granularity of graph reduction. To validate BoolSkeleton's effectiveness,
we conducted four analysis/downstream tasks around the Boolean network:
compression analysis, classification, critical path analysis, and timing
prediction, demonstrating its robustness across diverse scenarios. Furthermore,
it improves above 55% in the average accuracy compared to the original Boolean
network for the timing prediction task. These experiments underscore the
potential of BoolSkeleton to enhance design consistency in logic synthesis.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yibo Zhao, Yang Zhao, Hongru Du, Hao Frank Yang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Decision-making models for individuals, particularly in high-stakes scenarios
like vaccine uptake, often diverge from population optimal predictions. This
gap arises from the uniqueness of the individual decision-making process,
shaped by numerical attributes (e.g., cost, time) and linguistic influences
(e.g., personal preferences and constraints). Developing upon Utility Theory
and leveraging the textual-reasoning capabilities of Large Language Models
(LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric
Reasoning framework (ATHENA) to address the optimal information integration.
ATHENA uniquely integrates two stages: First, it discovers robust, group-level
symbolic utility functions via LLM-augmented symbolic discovery; Second, it
implements individual-level semantic adaptation, creating personalized semantic
templates guided by the optimal utility to model personalized choices.
Validated on real-world travel mode and vaccine choice tasks, ATHENA
consistently outperforms utility-based, machine learning, and other LLM-based
models, lifting F1 score by at least 6.5% over the strongest cutting-edge
models. Further, ablation studies confirm that both stages of ATHENA are
critical and complementary, as removing either clearly degrades overall
predictive performance. By organically integrating symbolic utility modeling
and semantic adaptation, ATHENA provides a new scheme for modeling
human-centric decisions. The project page can be found at
https://yibozh.github.io/Athena.
๐
2025-11-04
๐ This paper was accepted by IEEE BIBM 2025 conferen
๐ฅ Jiawen Liu, Yuanbo Zeng, Jiaming Liang, Yizhen Yang, Yiheng Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate detection of retinal vessels plays a critical role in reflecting a
wide range of health status indicators in the clinical diagnosis of ocular
diseases. Recently, advances in deep learning have led to a surge in retinal
vessel segmentation methods, which have significantly contributed to the
quantitative analysis of vascular morphology. However, retinal vasculature
differs significantly from conventional segmentation targets in that it
consists of extremely thin and branching structures, whose global morphology
varies greatly across images. These characteristics continue to pose challenges
to segmentation precision and robustness. To address these issues, we propose
MM-UNet, a novel architecture tailored for efficient retinal vessel
segmentation. The model incorporates Morph Mamba Convolution layers, which
replace pointwise convolutions to enhance branching topological perception
through morph, state-aware feature sampling. Additionally, Reverse Selective
State Guidance modules integrate reverse guidance theory with state-space
modeling to improve geometric boundary awareness and decoding efficiency.
Extensive experiments conducted on two public retinal vessel segmentation
datasets demonstrate the superior performance of the proposed method in
segmentation accuracy. Compared to the existing approaches, MM-UNet achieves
F1-score gains of 1.64 $\%$ on DRIVE and 1.25 $\%$ on STARE, demonstrating its
effectiveness and advancement. The project code is public via
https://github.com/liujiawen-jpg/MM-UNet.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuzhuang Pian, Taiyu Wang, Shiqi Zhang, Rui Xu, Yonghong Liu
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Accurate air quality forecasts are vital for public health alerts, exposure
assessment, and emissions control. In practice, observational data are often
missing in varying proportions and patterns due to collection and transmission
issues. These incomplete spatiotemporal records impede reliable inference and
risk assessment and can lead to overconfident extrapolation. To address these
challenges, we propose an end to end framework, the channel gated learning unit
based spatiotemporal bayesian neural field (CGLUBNF). It uses Fourier features
with a graph attention encoder to capture multiscale spatial dependencies and
seasonal temporal dynamics. A channel gated learning unit, equipped with
learnable activations and gated residual connections, adaptively filters and
amplifies informative features. Bayesian inference jointly optimizes predictive
distributions and parameter uncertainty, producing point estimates and
calibrated prediction intervals. We conduct a systematic evaluation on two real
world datasets, covering four typical missing data patterns and comparing
against five state of the art baselines. CGLUBNF achieves superior prediction
accuracy and sharper confidence intervals. In addition, we further validate
robustness across multiple prediction horizons and analysis the contribution of
extraneous variables. This research lays a foundation for reliable deep
learning based spatio-temporal forecasting with incomplete observations in
emerging sensing paradigms, such as real world vehicle borne mobile monitoring.
๐
2025-11-04
๐ ArXiv Preprint (cs.LO)
๐ฅ Eric Vin, Kyle A. Miller, Inigo Incer, Sanjit A. Seshia, Daniel J. Fremont
Robotics
ๆฅ็ๆ่ฆ
Full verification of learning-enabled cyber-physical systems (CPS) has long
been intractable due to challenges including black-box components and complex
real-world environments. Existing tools either provide formal guarantees for
limited types of systems or test the system as a monolith, but no general
framework exists for compositional analysis of learning-enabled CPS using
varied verification techniques over complex real-world environments. This paper
introduces ScenicProver, a verification framework that aims to fill this gap.
Built upon the Scenic probabilistic programming language, the framework
supports: (1) compositional system description with clear component interfaces,
ranging from interpretable code to black boxes; (2) assume-guarantee contracts
over those components using an extension of Linear Temporal Logic containing
arbitrary Scenic expressions; (3) evidence generation through testing, formal
proofs via Lean 4 integration, and importing external assumptions; (4)
systematic combination of generated evidence using contract operators; and (5)
automatic generation of assurance cases tracking the provenance of system-level
guarantees. We demonstrate the framework's effectiveness through a case study
on an autonomous vehicle's automatic emergency braking system with sensor
fusion. By leveraging manufacturer guarantees for radar and laser sensors and
focusing testing efforts on uncertain conditions, our approach enables stronger
probabilistic guarantees than monolithic testing with the same computational
budget.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Alexander Htet Kyaw, Richa Gupta, Dhruv Shah, Anoop Sinha, Kory Mathewson et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Advances in 3D generative AI have enabled the creation of physical objects
from text prompts, but challenges remain in creating objects involving multiple
component types. We present a pipeline that integrates 3D generative AI with
vision-language models (VLMs) to enable the robotic assembly of multi-component
objects from natural language. Our method leverages VLMs for zero-shot,
multi-modal reasoning about geometry and functionality to decompose
AI-generated meshes into multi-component 3D models using predefined structural
and panel components. We demonstrate that a VLM is capable of determining which
mesh regions need panel components in addition to structural components, based
on object functionality. Evaluation across test objects shows that users
preferred the VLM-generated assignments 90.6% of the time, compared to 59.4%
for rule-based and 2.5% for random assignment. Lastly, the system allows users
to refine component assignments through conversational feedback, enabling
greater human control and agency in making physical objects with generative AI
and robotics.
๐
2025-11-04
๐ ArXiv Preprint (cs.GT)
๐ฅ Asrin Efe Yorulmaz, Tamer Baลar
ๆฅ็ๆ่ฆ
No-regret learning dynamics play a central role in game theory, enabling
decentralized convergence to equilibrium for concepts such as Coarse Correlated
Equilibrium (CCE) or Correlated Equilibrium (CE). In this work, we improve the
convergence rate to CCE in general-sum Markov games, reducing it from the
previously best-known rate of $\mathcal{O}(\log^5 T / T)$ to a sharper
$\mathcal{O}(\log T / T)$. This matches the best known convergence rate for CE
in terms of $T$, number of iterations, while also improving the dependence on
the action set size from polynomial to polylogarithmic-yielding exponential
gains in high-dimensional settings. Our approach builds on recent advances in
adaptive step-size techniques for no-regret algorithms in normal-form games,
and extends them to the Markovian setting via a stage-wise scheme that adjusts
learning rates based on real-time feedback. We frame policy updates as an
instance of Optimistic Follow-the-Regularized-Leader (OFTRL), customized for
value-iteration-based learning. The resulting self-play algorithm achieves, to
our knowledge, the fastest known convergence rate to CCE in Markov games.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yi Luo, Haochen Zhao, Xiao Liang, Yiwei Liu, Yuye Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Drug synergy prediction is a critical task in the development of effective
combination therapies for complex diseases, including cancer. Although existing
methods have shown promising results, they often operate as black-box
predictors that rely predominantly on statistical correlations between drug
characteristics and results. To address this limitation, we propose CausalDDS,
a novel framework that disentangles drug molecules into causal and spurious
substructures, utilizing the causal substructure representations for predicting
drug synergy. By focusing on causal sub-structures, CausalDDS effectively
mitigates the impact of redundant features introduced by spurious
substructures, enhancing the accuracy and interpretability of the model. In
addition, CausalDDS employs a conditional intervention mechanism, where
interventions are conditioned on paired molecular structures, and introduces a
novel optimization objective guided by the principles of sufficiency and
independence. Extensive experiments demonstrate that our method outperforms
baseline models, particularly in cold start and out-of-distribution settings.
Besides, CausalDDS effectively identifies key substructures underlying drug
synergy, providing clear insights into how drug combinations work at the
molecular level. These results underscore the potential of CausalDDS as a
practical tool for predicting drug synergy and facilitating drug discovery.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large-scale data has driven breakthroughs in robotics, from language models
to vision-language-action models in bimanual manipulation. However, humanoid
robotics lacks equally effective data collection frameworks. Existing humanoid
teleoperation systems either use decoupled control or depend on expensive
motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid
teleoperation and data collection system that preserves full whole-body control
while advancing scalability. Our system leverages PICO4U VR for obtaining
real-time whole-body human motions, with a custom 2-DoF robot neck (cost around
$250) for egocentric vision, enabling holistic human-to-humanoid control. We
demonstrate long-horizon dexterous and mobile humanoid skills and we can
collect 100 demonstrations in 15 minutes with an almost 100% success rate.
Building on this pipeline, we propose a hierarchical visuomotor policy
framework that autonomously controls the full humanoid body based on egocentric
vision. Our visuomotor policy successfully demonstrates whole-body dexterous
manipulation and dynamic kicking tasks. The entire system is fully reproducible
and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also
open-sourced at https://twist-data.github.io .
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dmitrii Pozdeev, Alexey Artemov, Ananta R. Bhattarai, Artem Sevastopolsky
Natural Language Processing
ๆฅ็ๆ่ฆ
We propose DenseMarks - a new learned representation for human heads,
enabling high-quality dense correspondences of human head images. For a 2D
image of a human head, a Vision Transformer network predicts a 3D embedding for
each pixel, which corresponds to a location in a 3D canonical unit cube. In
order to train our network, we collect a dataset of pairwise point matches,
estimated by a state-of-the-art point tracker over a collection of diverse
in-the-wild talking heads videos, and guide the mapping via a contrastive loss,
encouraging matched points to have close embeddings. We further employ
multi-task learning with face landmarks and segmentation constraints, as well
as imposing spatial continuity of embeddings through latent cube features,
which results in an interpretable and queryable canonical space. The
representation can be used for finding common semantic parts, face/head
tracking, and stereo reconstruction. Due to the strong supervision, our method
is robust to pose variations and covers the entire head, including hair.
Additionally, the canonical space bottleneck makes sure the obtained
representations are consistent across diverse poses and individuals. We
demonstrate state-of-the-art results in geometry-aware point matching and
monocular head tracking with 3D Morphable Models. The code and the model
checkpoint will be made available to the public.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Harshith Padigela, Shima Nofallah, Atchuth Naveen Chilaparasetti, Ryun Han, Andrew Walker et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Foundation models trained on large-scale pathology image corpora have
demonstrated strong transfer capabilities across diverse histopathology tasks.
Building on this progress, we introduce PLUTO-4, our next generation of
pathology foundation models that extend the Pathology-Universal Transformer
(PLUTO) to frontier scale. We share two complementary Vision Transformer
architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model
optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE
embeddings, and a frontier-scale PLUTO-4G model trained with a single patch
size to maximize representation capacity and stability. Both models are
pretrained using a self-supervised objective derived from DINOv2 on a large
multi-institutional corpus containing 551,164 WSIs from 137,144 patients across
over 50 institutions, spanning over 60 disease types and over 100 stains.
Comprehensive evaluation across public and internal benchmarks demonstrates
that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying
spatial and biological context, including patch-level classification,
segmentation, and slide-level diagnosis. The compact PLUTO-4S provides
high-throughput and robust performance for practical deployment, while PLUTO-4G
establishes new performance frontiers across multiple pathology benchmarks,
including an 11% improvement in dermatopathology diagnosis. These diverse
improvements underscore PLUTO-4's potential to transform real-world
applications as a backbone for translational research and diagnostic use cases.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nusrat Tasnim, Kutub Uddin, Khalid Mahmood Malik
Robotics
ๆฅ็ๆ่ฆ
The threats posed by AI-generated media, particularly deepfakes, are now
raising significant challenges for multimedia forensics, misinformation
detection, and biometric system resulting in erosion of public trust in the
legal system, significant increase in frauds, and social engineering attacks.
Although several forensic methods have been proposed, they suffer from three
critical gaps: (i) use of non-standardized benchmarks with GAN- or
diffusion-generated images, (ii) inconsistent training protocols (e.g.,
scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail
to capture generalization and explainability. These limitations hinder fair
comparison, obscure true robustness, and restrict deployment in
security-critical applications. This paper introduces a unified benchmarking
framework for systematic evaluation of forensic methods under controlled and
reproducible conditions. We benchmark ten SoTA forensic methods (scratch,
frozen, and fine-tuned) and seven publicly available datasets (GAN and
diffusion) to perform extensive and systematic evaluations. We evaluate
performance using multiple metrics, including accuracy, average precision,
ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model
interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations
demonstrate substantial variability in generalization, with certain methods
exhibiting strong in-distribution performance but degraded cross-model
transferability. This study aims to guide the research community toward a
deeper understanding of the strengths and limitations of current forensic
approaches, and to inspire the development of more robust, generalizable, and
explainable solutions.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We propose MIRA, a new benchmark designed to evaluate models in scenarios
where generating intermediate visual images is essential for successful
reasoning. Unlike traditional CoT methods that rely solely on text, tasks in
MIRA require models to generate and utilize intermediate images - such as
sketches, structural diagrams, or path drawings - to guide their reasoning
process. This setup closely mirrors how humans solve complex problems through
"drawing to think". To solve this, MIRA focuses on tasks that are intrinsically
challenging and involve complex structures, spatial relationships, or reasoning
steps that are difficult to express through language alone. To ensure that our
evaluation data is of high-quality, we include 546 multimodal problems,
annotated with intermediate visual images and final answers. We also propose a
unified evaluation protocol for MIRA that spans three levels of evaluation
input: direct input with image and question only, text-only CoT input with
image and thinking prompts, and Visual-CoT input with both annotated image
clues and textual thinking prompts. To probe the upper bound of model capacity
on our benchmark, we also report pass@k and majority voting accuracies under
different k settings. Experimental results show that existing multimodal large
language models, including strongest private models as well as strong
open-weight models, perform poorly when relying solely on textual prompts.
However, when intermediate visual cues are provided, model performance improves
consistently, yielding an average relative gain of 33.7% across all models and
tasks. We also probe the upper bound by expanding the search space and
designing textual prompts aligned with Visual-CoT, but both yield only limited
improvements compared to our Visual-CoT setting. These results underscore the
critical role of imagined visual information in enabling successful reasoning
on MIRA.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao et al.
Multimodal
ๆฅ็ๆ่ฆ
Code has emerged as a precise and executable medium for reasoning and action
in the agent era. Yet, progress has largely focused on language-centric tasks
such as program synthesis and debugging, leaving visual-centric coding
underexplored. Inspired by how humans reason over sketches, we advocate SVG
code as a compact, interpretable, and executable visual representation. We
introduce VCode, a benchmark that reframes multimodal understanding as code
generation: given an image, a model must produce SVG that preserves symbolic
meaning for downstream reasoning. VCode covers three domains - general
commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric
perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel
evaluation protocol in which a policy model answers questions over rendered
SVGs; correct answers indicate faithful symbolic preservation. Empirically,
frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap
between language-centric and visual-centric coding. To close this gap, we
introduce VCoder, an agentic framework that augments VLMs along two axes: (i)
Thinking with Revision, which iteratively analyzes discrepancies and refines
SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply
structured cues such as objects, shapes, and text beyond the model's intrinsic
capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities
score well overall yet remain limited in professional knowledge and 3D
reasoning. VCoder delivers a 12.3-point overall gain over the top-performing
Claude-4-Opus. Human studies show that both humans and VLMs perform worse on
rendered SVGs, their consistency reveals the promise of symbolic visual
representation. The benchmark and code are available at
https://github.com/CSU-JPG/VCode.
๐
2025-11-04
๐ WWW
๐ฅ Antonio Oroz, Matthias Nieรner, Tobias Kirschstein
Robotics
ๆฅ็ๆ่ฆ
We present PercHead, a method for single-image 3D head reconstruction and
semantic 3D editing - two tasks that are inherently challenging due to severe
view occlusions, weak perceptual supervision, and the ambiguity of editing in
3D space. We develop a unified base model for reconstructing view-consistent 3D
heads from a single input image. The model employs a dual-branch encoder
followed by a ViT-based decoder that lifts 2D features into 3D space through
iterative cross-attention. Rendering is performed using Gaussian Splatting. At
the heart of our approach is a novel perceptual supervision strategy based on
DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric
and appearance fidelity. Our model achieves state-of-the-art performance in
novel-view synthesis and, furthermore, exhibits exceptional robustness to
extreme viewing angles compared to established baselines. Furthermore, this
base model can be seamlessly extended for semantic 3D editing by swapping the
encoder and finetuning the network. In this variant, we disentangle geometry
and style through two distinct input modalities: a segmentation map to control
geometry and either a text prompt or a reference image to specify appearance.
We highlight the intuitive and powerful 3D editing capabilities of our model
through a lightweight, interactive GUI, where users can effortlessly sculpt
geometry by drawing segmentation maps and stylize appearance via natural
language or image prompts.
Project Page: https://antoniooroz.github.io/PercHead Video:
https://www.youtube.com/watch?v=4hFybgTk4kE
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tyler Zhu, Tengda Han, Leonidas Guibas, Viorica Pฤtrฤucean, Maks Ovsjanikov
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The alignment of representations from different modalities has recently been
shown to provide insights on the structural similarities and downstream
capabilities of different encoders across diverse data types. While significant
progress has been made in aligning images with text, the temporal nature of
video data remains largely unexplored in this context. In this work, we conduct
the first comprehensive study of video-text representation alignment, probing
the capabilities of modern video and language encoders. Our findings reveal
several key insights. First, we demonstrate that cross-modal alignment highly
depends on the richness of both visual (static images vs. multi-frame videos)
and text (single caption vs. a collection) data provided at test time,
especially when using state-of-the-art video encoders. We propose parametric
test-time scaling laws that capture this behavior and show remarkable
predictive power against empirical observations. Secondly, we investigate the
correlation between semantic alignment and performance on both semantic and
non-semantic downstream tasks, providing initial evidence that strong alignment
against text encoders may be linked to general-purpose video representation and
understanding. Finally, we correlate temporal reasoning with cross-modal
alignment providing a challenging test-bed for vision and language models.
Overall, our work introduces video-text alignment as an informative zero-shot
way to probe the representation power of different encoders for spatio-temporal
data. Project page can be found at https://video-prh.github.io/
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vojtฤch Kลฏr, Adam Bajger, Adam Kukuฤka, Marek Hradil, Vรญt Musil et al.
MultimodalNatural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Convolutional neural networks (CNNs) underpin many modern computer vision
systems. With applications ranging from common to critical areas, a need to
explain and understand the model and its decisions (XAI) emerged. Prior works
suggest that in the top layers of CNNs, the individual channels can be
attributed to classifying human-understandable concepts. Concept relevance
propagation (CRP) methods can backtrack predictions to these channels and find
images that most activate these channels. However, current CRP workflows are
largely manual: experts must inspect activation images to name the discovered
concepts and must synthesize verbose explanations from relevance maps, limiting
the accessibility of the explanations and their scalability.
To address these issues, we introduce Large Language model EXplaIns COncept
Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a
multimodal large language model. Our approach automatically assigns descriptive
names to concept prototypes and generates natural-language explanations that
translate quantitative relevance distributions into intuitive narratives. To
ensure faithfulness, we craft prompts that teach the language model the
semantics of CRP through examples and enforce a separation between naming and
explanation tasks. The resulting text can be tailored to different audiences,
offering low-level technical descriptions for experts and high-level summaries
for non-technical stakeholders.
We qualitatively evaluate our method on various images from ImageNet on a
VGG16 model. Our findings suggest that integrating concept-based attribution
methods with large language models can significantly lower the barrier to
interpreting deep neural networks, paving the way for more transparent AI
systems.
๐
2025-11-04
๐ ArXiv Preprint (eess.SP)
๐ฅ Marios Impraimakis, Andrew W. Smyth
ๆฅ็ๆ่ฆ
The input-parameter-state estimation capabilities of a novel unscented Kalman
filter is examined herein on both linear and nonlinear systems. The unknown
input is estimated in two stages within each time step. Firstly, the predicted
dynamic states and the system parameters provide an estimation of the input.
Secondly, the corrected with measurements states and parameters provide a final
estimation. Importantly, it is demonstrated using the perturbation analysis
that, a system with at least a zero or a non-zero known input can potentially
be uniquely identified. This output-only methodology allows for a better
understanding of the system compared to classical output-only parameter
identification strategies, given that all the dynamic states, the parameters,
and the input are estimated jointly and in real-time.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhicheng Zhang, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Pengfei Wan et al.
ๆฅ็ๆ่ฆ
Understanding and predicting emotion from videos has gathered significant
attention in recent studies, driven by advancements in video large language
models (VideoLLMs). While advanced methods have made progress in video emotion
analysis, the intrinsic nature of emotions poses significant challenges.
Emotions are characterized by dynamic and cues-dependent properties, making it
difficult to understand complex and evolving emotional states with reasonable
rationale. To tackle these challenges, we propose a novel affective cues-guided
reasoning framework that unifies fundamental attribute perception, expression
analysis, and high-level emotional understanding in a stage-wise manner. At the
core of our approach is a family of video emotion foundation models (VidEmo),
specifically designed for emotion reasoning and instruction-following. These
models undergo a two-stage tuning process: first, curriculum emotion learning
for injecting emotion knowledge, followed by affective-tree reinforcement
learning for emotion reasoning. Moreover, we establish a foundational data
infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG)
consisting of 2.1M diverse instruction-based samples. Emo-CFG includes
explainable emotional question-answering, fine-grained captions, and associated
rationales, providing essential resources for advancing emotion understanding
tasks. Experimental results demonstrate that our approach achieves competitive
performance, setting a new milestone across 15 face perception tasks.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chao Yuan, Zanwu Liu, Guiwei Zhang, Haoxuan Xu, Yujian Zhao et al.
ๆฅ็ๆ่ฆ
Visible-infrared person re-identification (VI-ReID) technique could associate
the pedestrian images across visible and infrared modalities in the practical
scenarios of background illumination changes. However, a substantial gap
inherently exists between these two modalities. Besides, existing methods
primarily rely on intermediate representations to align cross-modal features of
the same person. The intermediate feature representations are usually create by
generating intermediate images (kind of data enhancement), or fusing
intermediate features (more parameters, lack of interpretability), and they do
not make good use of the intermediate features. Thus, we propose a novel
VI-ReID framework via Modality-Transition Representation Learning (MTRL) with a
middle generated image as a transmitter from visible to infrared modals, which
are fully aligned with the original visible images and similar to the infrared
modality. After that, using a modality-transition contrastive loss and a
modality-query regularization loss for training, which could align the
cross-modal features more effectively. Notably, our proposed framework does not
need any additional parameters, which achieves the same inference speed to the
backbone while improving its performance on VI-ReID task. Extensive
experimental results illustrate that our model significantly and consistently
outperforms existing SOTAs on three typical VI-ReID datasets.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Marius Aasan, Martine Hjelkrem-Tan, Nico Catalano, Changkyu Choi, Adรญn Ramรญrez Rivera
Natural Language Processing
ๆฅ็ๆ่ฆ
Vision Transformers rely on fixed patch tokens that ignore the spatial and
semantic structure of images. In this work, we introduce an end-to-end
differentiable tokenizer that adapts to image content with pixel-level
granularity while remaining backward-compatible with existing architectures for
retrofitting pretrained models. Our method uses hierarchical model selection
with information criteria to provide competitive performance in both
image-level classification and dense-prediction tasks, and even supports
out-of-the-box raster-to-vector conversion.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang et al.
Multimodal
ๆฅ็ๆ่ฆ
Large multimodal models (LMMs) often suffer from severe inference
inefficiency due to the large number of visual tokens introduced by image
encoders. While recent token compression methods, such as pruning and merging,
have shown promise in reducing redundancy, their evaluation remains fragmented
and inconsistent. In this work, we present UniPruneBench, a unified and
extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench
provides standardized protocols across six ability dimensions and ten datasets,
covering ten representative compression algorithms and three families of LMMs
(LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates
system-level metrics such as runtime and prefilling latency to provide a
holistic view. Our experiments uncover several key findings: (1) random pruning
is a surprisingly strong baseline, (2) no single method consistently
outperforms others across scenarios, (3) pruning sensitivity varies
significantly across tasks, with OCR being most vulnerable, and (4) pruning
ratio is the dominant factor governing performance degradation. We believe
UniPruneBench will serve as a reliable foundation for future research on
efficient multimodal modeling.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Poulami Raha, Yeongnam Chae
Computer Vision
ๆฅ็ๆ่ฆ
Biometric technologies are widely adopted in security, legal, and financial
systems. Face recognition can authenticate a person based on the unique facial
features such as shape and texture. However, recent works have demonstrated the
vulnerability of Face Recognition Systems (FRS) towards presentation attacks.
Using spoofing (aka.,presentation attacks), a malicious actor can get
illegitimate access to secure systems. This paper proposes a novel light-weight
CNN framework to identify print/display, video and wrap attacks. The proposed
robust architecture provides seamless liveness detection ensuring faster
biometric authentication (1-2 seconds on CPU). Further, this also presents a
newly created 2D spoof attack dataset consisting of more than 500 videos
collected from 60 subjects. To validate the effectiveness of this architecture,
we provide a demonstration video depicting print/display, video and wrap attack
detection approaches. The demo can be viewed in the following link:
https://rak.box.com/s/m1uf31fn5amtjp4mkgf1huh4ykfeibaa
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Change detection (CD) is a fundamental task for monitoring and analyzing land
cover dynamics. While recent high performance models and high quality datasets
have significantly advanced the field, a critical limitation persists. Current
models typically acquire limited knowledge from single-type annotated data and
cannot concurrently leverage diverse binary change detection (BCD) and semantic
change detection (SCD) datasets. This constraint leads to poor generalization
and limited versatility. The recent advancements in Multimodal Large Language
Models (MLLMs) introduce new possibilities for a unified CD framework. We
leverage the language priors and unification capabilities of MLLMs to develop
UniChange, the first MLLM-based unified change detection model. UniChange
integrates generative language abilities with specialized CD functionalities.
Our model successfully unifies both BCD and SCD tasks through the introduction
of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange
utilizes text prompts to guide the identification of change categories,
eliminating the reliance on predefined classification heads. This design allows
UniChange to effectively acquire knowledge from multi-source datasets, even
when their class definitions conflict. Experiments on four public benchmarks
(WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance,
achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively,
surpassing all previous methods. The code is available at
https://github.com/Erxucomeon/UniChange.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jan Frederik Meier, Timo Lรผddecke
ๆฅ็ๆ่ฆ
Multi-animal tracking is crucial for understanding animal ecology and
behavior. However, it remains a challenging task due to variations in habitat,
motion patterns, and species appearance. Traditional approaches typically
require extensive model fine-tuning and heuristic design for each application
scenario. In this work, we explore the potential of recent vision foundation
models for zero-shot multi-animal tracking. By combining a Grounding Dino
object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully
designed heuristics, we develop a tracking framework that can be applied to new
datasets without any retraining or hyperparameter adaptation. Evaluations on
ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate
strong and consistent performance across diverse species and environments. The
code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Despite the remarkable success of text-to-image diffusion models, their
output of a single, flattened image remains a critical bottleneck for
professional applications requiring layer-wise control. Existing solutions
either rely on fine-tuning with large, inaccessible datasets or are
training-free yet limited to generating isolated foreground elements, failing
to produce a complete and coherent scene. To address this, we introduce the
Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a
novel framework for zero-shot, layer-wise image generation. Our core technique,
Noise Transplantation and Cultivation (NTC), extracts intermediate latent
representations from both foreground and composite generation processes,
transplanting them into the initial noise for subsequent layers. This ensures
semantic and structural coherence across foreground, background, and composite
layers, enabling consistent, multi-layered outputs without requiring
fine-tuning or auxiliary datasets. Extensive experiments show that our
training-free method achieves performance comparable to fine-tuned methods,
enhancing layer-wise consistency while maintaining high image quality and
fidelity. TAUE not only eliminates costly training and dataset requirements but
also unlocks novel downstream applications, such as complex compositional
editing, paving the way for more accessible and controllable generative
workflows.
๐
2025-11-04
๐ ArXiv Preprint (eess.IV)
๐ฅ Alix de Langlais, Benjamin Billot, Thรฉo Aguilar Vidal, Marc-Olivier Gauci, Hervรฉ Delingette
ๆฅ็ๆ่ฆ
Delineating anatomical regions is a key task in medical image analysis.
Manual segmentation achieves high accuracy but is labor-intensive and prone to
variability, thus prompting the development of automated approaches. Recently,
a breadth of foundation models has enabled automated segmentations across
diverse anatomies and imaging modalities, but these may not always meet the
clinical accuracy standards. While segmentation refinement strategies can
improve performance, current methods depend on heavy user interactions or
require fully supervised segmentations for training. Here, we present SCORE
(Segmentation COrrection from Regional Evaluations), a weakly supervised
framework that learns to refine mask predictions only using light feedback
during training. Specifically, instead of relying on dense training image
annotations, SCORE introduces a novel loss that leverages region-wise quality
scores and over/under-segmentation error labels. We demonstrate SCORE on
humerus CT scans, where it considerably improves initial predictions from
TotalSegmentator, and achieves performance on par with existing refinement
methods, while greatly reducing their supervision requirements and annotation
time. Our code is available at: https://gitlab.inria.fr/adelangl/SCORE.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jingyu Lu, Haonan Wang, Qixiang Zhang, Xiaomeng Li
Machine Learning
ๆฅ็ๆ่ฆ
Subject-agnostic brain decoding, which aims to reconstruct continuous visual
experiences from fMRI without subject-specific training, holds great potential
for clinical applications. However, this direction remains underexplored due to
challenges in cross-subject generalization and the complex nature of brain
signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a
novel hierarchical decoding framework that explicitly models the ventral-dorsal
architecture of the human visual system to learn multi-dimensional
representations. By disentangling and leveraging features from early visual
cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary
cognitive information essential for visual reconstruction. Furthermore, we
introduce a feature-level contrastive learning strategy to enhance the
extraction of subject-invariant semantic representations, thereby enhancing
subject-agnostic applicability to previously unseen subjects. Unlike
conventional pipelines that need more than 12 hours of per-subject data and
heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates
each reconstructed video in 10 seconds without any retraining, offering a fast
and clinically scalable solution. The source code will be released upon
acceptance of the paper.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Rashidunnabi, Kailash A. Hambarde, Vasco Lopes, Joao C. Neves, Hugo Proenca
Machine Learning
ๆฅ็ๆ่ฆ
Video-based person re-identification (ReID) in cross-view domains (for
example, aerial-ground surveillance) remains an open problem because of extreme
viewpoint shifts, scale disparities, and temporal inconsistencies. To address
these challenges, we propose MTF-CVReID, a parameter-efficient framework that
introduces seven complementary modules over a ViT-B/16 backbone. Specifically,
we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and
view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale
stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to
reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for
motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment
(IVFA) for perspective-invariant representation alignment; (6) Hierarchical
Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities;
and (7) Multi-View Identity Consistency Learning (MVICL) that enforces
cross-view identity coherence using a contrastive learning paradigm. Despite
adding only about 2 million parameters and 0.7 GFLOPs over the baseline,
MTF-CVReID maintains real-time efficiency (189 FPS) and achieves
state-of-the-art performance on the AG-VPReID benchmark across all altitude
levels, with strong cross-dataset generalization to G2A-VReID and MARS
datasets. These results show that carefully designed adapter-based modules can
substantially enhance cross-view robustness and temporal consistency without
compromising computational efficiency. The source code is available at
https://github.com/MdRashidunnabi/MTF-CVReID
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Akash Sharma, Chinmay Mhatre, Sankalp Gawali, Ruthvik Bokkasam, Brij Kishore et al.
ๆฅ็ๆ่ฆ
This report describes the UVH-26 dataset, the first public release by
AIM@IISc of a large-scale dataset of annotated traffic-camera images from
India. The dataset comprises 26,646 high-resolution (1080p) images sampled from
2800 Bengaluru's Safe-City CCTV cameras over a 4-week period, and subsequently
annotated through a crowdsourced hackathon involving 565 college students from
across India. In total, 1.8 million bounding boxes were labeled across 14
vehicle classes specific to India: Cycle, 2-Wheeler (Motorcycle), 3-Wheeler
(Auto-rickshaw), LCV (Light Commercial Vehicles), Van, Tempo-traveller,
Hatchback, Sedan, SUV, MUV, Mini-bus, Bus, Truck and Other. Of these, 283k-316k
consensus ground truth bounding boxes and labels were derived for distinct
objects in the 26k images using Majority Voting and STAPLE algorithms. Further,
we train multiple contemporary detectors, including YOLO11-S/X, RT-DETR-S/X,
and DAMO-YOLO-T/L using these datasets, and report accuracy based on mAP50,
mAP75 and mAP50:95. Models trained on UVH-26 achieve 8.4-31.5% improvements in
mAP50:95 over equivalent baseline models trained on COCO dataset, with
RT-DETR-X showing the best performance at 0.67 (mAP50:95) as compared to 0.40
for COCO-trained weights for common classes (Car, Bus, and Truck). This
demonstrates the benefits of domain-specific training data for Indian traffic
scenarios. The release package provides the 26k images with consensus
annotations based on Majority Voting (UVH-26-MV) and STAPLE (UVH-26-ST) and the
6 fine-tuned YOLO and DETR models on each of these datasets. By capturing the
heterogeneity of Indian urban mobility directly from operational traffic-camera
streams, UVH-26 addresses a critical gap in existing global benchmarks, and
offers a foundation for advancing detection, classification, and deployment of
intelligent transportation systems in emerging nations with complex traffic
conditions.
๐
2025-11-04
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Dan Bohus, Sean Andrist, Ann Paradiso, Nick Saw, Tim Schoonbeek et al.
Multimodal
ๆฅ็ๆ่ฆ
We introduce SigmaCollab, a dataset enabling research on physically situated
human-AI collaboration. The dataset consists of a set of 85 sessions in which
untrained participants were guided by a mixed-reality assistive AI agent in
performing procedural tasks in the physical world. SigmaCollab includes a set
of rich, multimodal data streams, such as the participant and system audio,
egocentric camera views from the head-mounted device, depth maps, head, hand
and gaze tracking information, as well as additional annotations performed
post-hoc. While the dataset is relatively small in size (~ 14 hours), its
application-driven and interactive nature brings to the fore novel research
challenges for human-AI collaboration, and provides more realistic testing
grounds for various AI models operating in this space. In future work, we plan
to use the dataset to construct a set of benchmarks for physically situated
collaboration in mixed-reality task assistive scenarios. SigmaCollab is
available at https://github.com/microsoft/SigmaCollab.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ali Farki, Elaheh Moradi, Deepika Koundal, Jussi Tohka
Machine Learning
ๆฅ็ๆ่ฆ
Predicting future brain state from a baseline magnetic resonance image (MRI)
is a central challenge in neuroimaging and has important implications for
studying neurodegenerative diseases such as Alzheimer's disease (AD). Most
existing approaches predict future cognitive scores or clinical outcomes, such
as conversion from mild cognitive impairment to dementia. Instead, here we
investigate longitudinal MRI image-to-image prediction that forecasts a
participant's entire brain MRI several years into the future, intrinsically
modeling complex, spatially distributed neurodegenerative patterns. We
implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR,
Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL).
Predicted follow-up MRIs are directly compared with the actual follow-up scans
using metrics that capture global similarity and local differences. The best
performing models achieve high-fidelity predictions, and all models generalize
well to an independent external dataset, demonstrating robust cross-cohort
performance. Our results indicate that deep learning can reliably predict
participant-specific brain MRI at the voxel level, offering new opportunities
for individualized prognosis.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jessica Plassmann, Nicolas Schuler, Georg von Freymann, Michael Schuth
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Shearography is a non-destructive testing method for detecting subsurface
defects, offering high sensitivity and full-field inspection capabilities.
However, its industrial adoption remains limited due to the need for expert
interpretation. To reduce reliance on labeled data and manual evaluation, this
study explores unsupervised learning methods for automated anomaly detection in
shearographic images. Three architectures are evaluated: a fully connected
autoencoder, a convolutional autoencoder, and a student-teacher feature
matching model. All models are trained solely on defect-free data. A controlled
dataset was developed using a custom specimen with reproducible defect
patterns, enabling systematic acquisition of shearographic measurements under
both ideal and realistic deformation conditions. Two training subsets were
defined: one containing only undistorted, defect-free samples, and one
additionally including globally deformed, yet defect-free, data. The latter
simulates practical inspection conditions by incorporating deformation-induced
fringe patterns that may obscure localized anomalies. The models are evaluated
in terms of binary classification and, for the student-teacher model, spatial
defect localization. Results show that the student-teacher approach achieves
superior classification robustness and enables precise localization. Compared
to the autoencoder-based models, it demonstrates improved separability of
feature representations, as visualized through t-SNE embeddings. Additionally,
a YOLOv8 model trained on labeled defect data serves as a reference to
benchmark localization quality. This study underscores the potential of
unsupervised deep learning for scalable, label-efficient shearographic
inspection in industrial environments.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jee Won Lee, Jongseong Brad Choi
Machine Learning
ๆฅ็ๆ่ฆ
Sparse-voxel rasterization is a fast, differentiable alternative for
optimization-based scene reconstruction, but it tends to underfit low-frequency
content, depends on brittle pruning heuristics, and can overgrow in ways that
inflate VRAM. We introduce LiteVoxel, a self-tuning training pipeline that
makes SV rasterization both steadier and lighter. Our loss is made
low-frequency aware via an inverse-Sobel reweighting with a mid-training
gamma-ramp, shifting gradient budget to flat regions only after geometry
stabilize. Adaptation replaces fixed thresholds with a depth-quantile pruning
logic on maximum blending weight, stabilized by EMA-hysteresis guards and
refines structure through ray-footprint-based, priority-driven subdivision
under an explicit growth budget. Ablations and full-system results across
Mip-NeRF 360 (6scenes) and Tanks & Temples (3scenes) datasets show mitigation
of errors in low-frequency regions and boundary instability while keeping
PSNR/SSIM, training time, and FPS comparable to a strong SVRaster pipeline.
Crucially, LiteVoxel reduces peak VRAM by ~40%-60% and preserves low-frequency
detail that prior setups miss, enabling more predictable, memory-efficient
training without sacrificing perceptual quality.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nicolas Schuler, Lea Dewald, Jรผrgen Graf
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Recent advancements in Deep Learning enable hardware-based cognitive systems,
that is, mechatronic systems in general and robotics in particular with
integrated Artificial Intelligence, to interact with dynamic and unstructured
environments. While the results are impressive, the application of such systems
to critical tasks like autonomous driving as well as service and care robotics
necessitate the evaluation of large amount of heterogeneous data. Automated
report generation for Mobile Robotics can play a crucial role in facilitating
the evaluation and acceptance of such systems in various domains. In this
paper, we propose a pipeline for generating automated reports in natural
language utilizing various multi-modal sensors that solely relies on local
models capable of being deployed on edge computing devices, thus preserving the
privacy of all actors involved and eliminating the need for external services.
In particular, we evaluate our implementation on a diverse dataset spanning
multiple domains including indoor, outdoor and urban environments, providing
quantitative as well as qualitative evaluation results. Various generated
example reports and other supplementary materials are available via a public
repository.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yaosen Chen, Wei Wang, Tianheng Zheng, Xuming Wen, Han Yang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Shot assembly is a crucial step in film production and video editing,
involving the sequencing and arrangement of shots to construct a narrative,
convey information, or evoke emotions. Traditionally, this process has been
manually executed by experienced editors. While current intelligent video
editing technologies can handle some automated video editing tasks, they often
fail to capture the creator's unique artistic expression in shot assembly. To
address this challenge, we propose an energy-based optimization method for
video shot assembly. Specifically, we first perform visual-semantic matching
between the script generated by a large language model and a video library to
obtain subsets of candidate shots aligned with the script semantics. Next, we
segment and label the shots from reference videos, extracting attributes such
as shot size, camera motion, and semantics. We then employ energy-based models
to learn from these attributes, scoring candidate shot sequences based on their
alignment with reference styles. Finally, we achieve shot assembly optimization
by combining multiple syntax rules, producing videos that align with the
assembly style of the reference videos. Our method not only automates the
arrangement and combination of independent shots according to specific logic,
narrative requirements, or artistic styles but also learns the assembly style
of reference videos, creating a coherent visual sequence or holistic visual
expression. With our system, even users with no prior video editing experience
can create visually compelling videos. Project page:
https://sobeymil.github.io/esa.com
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Robinson Umeike, Neil Getty, Yin Xiangyu, Yi Jiang
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The automation of workflows in advanced microscopy is a key goal where
foundation models like Language Models (LLMs) and Vision-Language Models (VLMs)
show great potential. However, adapting these general-purpose models for
specialized scientific tasks is critical, and the optimal domain adaptation
strategy is often unclear. To address this, we introduce PtychoBench, a new
multi-modal, multi-task benchmark for ptychographic analysis. Using this
benchmark, we systematically compare two specialization strategies: Supervised
Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies
on a visual artifact detection task with VLMs and a textual parameter
recommendation task with LLMs in a data-scarce regime. Our findings reveal that
the optimal specialization pathway is task-dependent. For the visual task, SFT
and ICL are highly complementary, with a fine-tuned model guided by
context-aware examples achieving the highest mean performance (Micro-F1 of
0.728). Conversely, for the textual task, ICL on a large base model is the
superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a
powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm
the superiority of context-aware prompting and identify a consistent contextual
interference phenomenon in fine-tuned models. These results, benchmarked
against strong baselines including GPT-4o and a DINOv3-based classifier, offer
key observations for AI in science: the optimal specialization path in our
benchmark is dependent on the task modality, offering a clear framework for
developing more effective science-based agentic systems.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Recent advances in multi-modal models have demonstrated strong performance in
tasks such as image generation and reasoning. However, applying these models to
the fire domain remains challenging due to the lack of publicly available
datasets with high-quality fire domain annotations. To address this gap, we
introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k
high-resolution fire-related images and 2.5k real-world fire-related videos
covering a wide range of fire types, environments, and risk levels. The data
are annotated with both traditional computer vision labels (e.g., bounding
boxes) and detailed textual prompts describing the scene, enabling applications
such as synthetic data generation and fire risk reasoning. DetectiumFire offers
clear advantages over existing benchmarks in scale, diversity, and data
quality, significantly reducing redundancy and enhancing coverage of real-world
scenarios. We validate the utility of DetectiumFire across multiple tasks,
including object detection, diffusion-based image generation, and
vision-language reasoning. Our results highlight the potential of this dataset
to advance fire-related research and support the development of intelligent
safety systems. We release DetectiumFire to promote broader exploration of fire
understanding in the AI community. The dataset is available at
https://kaggle.com/datasets/38b79c344bdfc55d1eed3d22fbaa9c31fad45e27edbbe9e3c529d6e5c4f93890
๐
2025-11-04
๐ 20 pages, Submitted to IEEE TIM
๐ฅ Tao Liu, Kan Ren, Qian Chen
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
With the rapid growth of the low-altitude economy, UAVs have become crucial
for measurement and tracking in patrol systems. However, in GNSS-denied areas,
satellite-based localization methods are prone to failure. This paper presents
a cross-view UAV localization framework that performs map matching via object
detection, aimed at effectively addressing cross-temporal, cross-view,
heterogeneous aerial image matching. In typical pipelines, UAV visual
localization is formulated as an image-retrieval problem: features are
extracted to build a localization map, and the pose of a query image is
estimated by matching it to a reference database with known poses. Because
publicly available UAV localization datasets are limited, many approaches
recast localization as a classification task and rely on scene labels in these
datasets to ensure accuracy. Other methods seek to reduce cross-domain
differences using polar-coordinate reprojection, perspective transformations,
or generative adversarial networks; however, they can suffer from misalignment,
content loss, and limited realism. In contrast, we leverage modern object
detection to accurately extract salient instances from UAV and satellite
images, and integrate a graph neural network to reason about inter-image and
intra-image node relationships. Using a fine-grained, graph-based
node-similarity metric, our method achieves strong retrieval and localization
performance. Extensive experiments on public and real-world datasets show that
our approach handles heterogeneous appearance differences effectively and
generalizes well, making it applicable to scenarios with larger modality gaps,
such as infrared-visible image matching. Our dataset will be publicly available
at the following URL: https://github.com/liutao23/ODGNNLoc.git.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xilong Zhou, Jianchun Chen, Pramod Rao, Timo Teufel, Linjie Lyu et al.
Robotics
ๆฅ็ๆ่ฆ
We introduce OLATverse, a large-scale dataset comprising around 9M images of
765 real-world objects, captured from multiple viewpoints under a diverse set
of precisely controlled lighting conditions. While recent advances in
object-centric inverse rendering, novel view synthesis and relighting have
shown promising results, most techniques still heavily rely on the synthetic
datasets for training and small-scale real-world datasets for benchmarking,
which limits their realism and generalization. To address this gap, OLATverse
offers two key advantages over existing datasets: large-scale coverage of real
objects and high-fidelity appearance under precisely controlled illuminations.
Specifically, OLATverse contains 765 common and uncommon real-world objects,
spanning a wide range of material categories. Each object is captured using 35
DSLR cameras and 331 individually controlled light sources, enabling the
simulation of diverse illumination conditions. In addition, for each object, we
provide well-calibrated camera parameters, accurate object masks, photometric
surface normals, and diffuse albedo as auxiliary resources. We also construct
an extensive evaluation set, establishing the first comprehensive real-world
object-centric benchmark for inverse rendering and normal estimation. We
believe that OLATverse represents a pivotal step toward integrating the next
generation of inverse rendering and relighting methods with real-world data.
The full dataset, along with all post-processing workflows, will be publicly
released at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Taiga Yamane, Satoshi Suzuki, Ryo Masumura, Shotaro Tora
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-view action recognition aims to recognize human actions using multiple
camera views and deals with occlusion caused by obstacles or crowds. In this
task, cooperation among views, which generates a joint representation by
combining multiple views, is vital. Previous studies have explored promising
cooperation methods for improving performance. However, since their methods
focus only on the task setting of recognizing a single action from an entire
video, they are not applicable to the recently popular spatio-temporal action
recognition~(STAR) setting, in which each person's action is recognized
sequentially. To address this problem, this paper proposes a multi-view action
recognition method for the STAR setting, called MVAFormer. In MVAFormer, we
introduce a novel transformer-based cooperation module among views. In contrast
to previous studies, which utilize embedding vectors with lost spatial
information, our module utilizes the feature map for effective cooperation in
the STAR setting, which preserves the spatial information. Furthermore, in our
module, we divide the self-attention for the same and different views to model
the relationship between multiple views effectively. The results of experiments
using a newly collected dataset demonstrate that MVAFormer outperforms the
comparison baselines by approximately $4.4$ points on the F-measure.
๐
2025-11-04
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Chuhan Jiao, Zhiming Hu, Andreas Bulling
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Mobile eye tracking plays a vital role in capturing human visual attention
across both real-world and extended reality (XR) environments, making it an
essential tool for applications ranging from behavioural research to
human-computer interaction. However, missing values due to blinks, pupil
detection errors, or illumination changes pose significant challenges for
further gaze data analysis. To address this challenge, we introduce HAGI++ - a
multi-modal diffusion-based approach for gaze data imputation that, for the
first time, uses the integrated head orientation sensors to exploit the
inherent correlation between head and eye movements. HAGI++ employs a
transformer-based diffusion model to learn cross-modal dependencies between eye
and head representations and can be readily extended to incorporate additional
body movements. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D,
and HOT3D datasets demonstrate that HAGI++ consistently outperforms
conventional interpolation methods and deep learning-based time-series
imputation baselines in gaze imputation. Furthermore, statistical analyses
confirm that HAGI++ produces gaze velocity distributions that closely match
actual human gaze behaviour, ensuring more realistic gaze imputations.
Moreover, by incorporating wrist motion captured from commercial wearable
devices, HAGI++ surpasses prior methods that rely on full-body motion capture
in the extreme case of 100% missing gaze data (pure gaze generation). Our
method paves the way for more complete and accurate eye gaze recordings in
real-world settings and has significant potential for enhancing gaze-based
analysis and interaction across various application domains.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Teerapong Panboonyuen
Machine Learning
ๆฅ็ๆ่ฆ
Satellite image inpainting is a crucial task in remote sensing, where
accurately restoring missing or occluded regions is essential for robust image
analysis. In this paper, we propose KAO, a novel framework that utilizes
Kernel-Adaptive Optimization within diffusion models for satellite image
inpainting. KAO is specifically designed to address the challenges posed by
very high-resolution (VHR) satellite datasets, such as DeepGlobe and the
Massachusetts Roads Dataset. Unlike existing methods that rely on
preconditioned models requiring extensive retraining or postconditioned models
with significant computational overhead, KAO introduces a Latent Space
Conditioning approach, optimizing a compact latent space to achieve efficient
and accurate inpainting. Furthermore, we incorporate Explicit Propagation into
the diffusion process, facilitating forward-backward fusion, which improves the
stability and precision of the method. Experimental results demonstrate that
KAO sets a new benchmark for VHR satellite image restoration, providing a
scalable, high-performance solution that balances the efficiency of
preconditioned models with the flexibility of postconditioned models.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nicolas Schuler, Lea Dewald, Nick Baldig, Jรผrgen Graf
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Video Understanding, Scene Interpretation and Commonsense Reasoning are
highly challenging tasks enabling the interpretation of visual information,
allowing agents to perceive, interact with and make rational decisions in its
environment. Large Language Models (LLMs) and Visual Language Models (VLMs)
have shown remarkable advancements in these areas in recent years, enabling
domain-specific applications as well as zero-shot open vocabulary tasks,
combining multiple domains. However, the required computational complexity
poses challenges for their application on edge devices and in the context of
Mobile Robotics, especially considering the trade-off between accuracy and
inference time. In this paper, we investigate the capabilities of
state-of-the-art VLMs for the task of Scene Interpretation and Action
Recognition, with special regard to small VLMs capable of being deployed to
edge devices in the context of Mobile Robotics. The proposed pipeline is
evaluated on a diverse dataset consisting of various real-world cityscape,
on-campus and indoor scenarios. The experimental evaluation discusses the
potential of these small models on edge devices, with particular emphasis on
challenges, weaknesses, inherent model biases and the application of the gained
information. Supplementary material is provided via the following repository:
https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
๐
2025-11-04
๐ 32 pages, 17 figures, published in Journal of Soun
๐ฅ Marios Impraimakis
ๆฅ็ๆ่ฆ
The capability of a novel Kullback-Leibler divergence method is examined
herein within the Kalman filter framework to select the input-parameter-state
estimation execution with the most plausible results. This identification
suffers from the uncertainty related to obtaining different results from
different initial parameter set guesses, and the examined approach uses the
information gained from the data in going from the prior to the posterior
distribution to address the issue. Firstly, the Kalman filter is performed for
a number of different initial parameter sets providing the system
input-parameter-state estimation. Secondly, the resulting posterior
distributions are compared simultaneously to the initial prior distributions
using the Kullback-Leibler divergence. Finally, the identification with the
least Kullback-Leibler divergence is selected as the one with the most
plausible results. Importantly, the method is shown to select the better
performed identification in linear, nonlinear, and limited information
applications, providing a powerful tool for system monitoring.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Garen Boyadjian, Cyrille Pierre, Johann Laconte, Riccardo Bertoglio
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Precise semantic segmentation of crops and weeds is necessary for
agricultural weeding robots. However, training deep learning models requires
large annotated datasets, which are costly to obtain in real fields. Synthetic
data can reduce this burden, but the gap between simulated and real images
remains a challenge. In this paper, we present a pipeline for procedural
generation of synthetic crop-weed images using Blender, producing annotated
datasets under diverse conditions of plant growth, weed density, lighting, and
camera angle. We benchmark several state-of-the-art segmentation models on
synthetic and real datasets and analyze their cross-domain generalization. Our
results show that training on synthetic images leads to a sim-to-real gap of
10%, surpassing previous state-of-the-art methods. Moreover, synthetic data
demonstrates good generalization properties, outperforming real datasets in
cross-domain scenarios. These findings highlight the potential of synthetic
agricultural datasets and support hybrid strategies for more efficient model
training.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Duo Xu, Hao Cheng, Xin Lin, Zhen Xie, Hao Wang
MultimodalNatural Language ProcessingComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Complex chart understanding tasks demand advanced visual recognition and
reasoning capabilities from multimodal large language models (MLLMs). However,
current research provides limited coverage of complex chart scenarios and
computation-intensive reasoning tasks prevalent in real-world applications.
This study proposes an automated multi-stage code-driven pipeline for
systematically generating visual reasoning datasets to address these
limitations. The pipeline integrates retrieval-augmented generation (RAG) to
retrieve professional chart templates and employs chain-of-thought (CoT)
strategies to generate reasoning codes that simulate real data distributions,
thereby driving chart rendering and question-related statistical computations.
Through model-based evaluation, the pipeline enhances chart diversity and data
quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and
multi-step dataset containing 38K charts and 142K Q&A pairs for training, along
with 2,871 high-quality evaluation samples for enabling practical performance
assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL)
experiments demonstrate that our dataset significantly improves reasoning
capabilities and cross-domain generalization performance, enabling smaller
models to achieve performance comparable to larger-scale models in complex
chart comprehension.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenyang Wei, Yang yang, Xixi Jia, Xiangchu Feng, Weiwei Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
We present IllumFlow, a novel framework that synergizes conditional Rectified
Flow (CRF) with Retinex theory for low-light image enhancement (LLIE). Our
model addresses low-light enhancement through separate optimization of
illumination and reflectance components, effectively handling both lighting
variations and noise. Specifically, we first decompose an input image into
reflectance and illumination components following Retinex theory. To model the
wide dynamic range of illumination variations in low-light images, we propose a
conditional rectified flow framework that represents illumination changes as a
continuous flow field. While complex noise primarily resides in the reflectance
component, we introduce a denoising network, enhanced by flow-derived data
augmentation, to remove reflectance noise and chromatic aberration while
preserving color fidelity. IllumFlow enables precise illumination adaptation
across lighting conditions while naturally supporting customizable brightness
enhancement. Extensive experiments on low-light enhancement and exposure
correction demonstrate superior quantitative and qualitative performance over
existing methods.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Arya Shah, Vaibhav Tripathi
Natural Language Processing
ๆฅ็ๆ่ฆ
Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic
cats) have vertically elongated pupils linked to ambush predation; yet, how
such specializations manifest in downstream visual representations remains
incompletely understood. We present a unified, frozen-encoder benchmark that
quantifies feline-human cross-species representational alignment in the wild,
across convolutional networks, supervised Vision Transformers, windowed
transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel
Alignment (linear and RBF) and Representational Similarity Analysis, with
additional distributional and stability tests reported in the paper. Across
models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF
$\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$),
peaking at early blocks, indicating that token-level self-supervision induces
early-stage features that bridge species-specific statistics. Supervised ViTs
are competitive on CKA yet show weaker geometric correspondence than DINO
(e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at
block14), revealing depth-dependent divergences between similarity and
representational geometry. CNNs remain strong baselines but below plain ViTs on
alignment, and windowed transformers underperform plain ViTs, implicating
architectural inductive biases in cross-species alignment. Results indicate
that self-supervision coupled with ViT inductive biases yields representational
geometries that more closely align feline and human visual systems than widely
used CNNs and windowed Transformers, providing testable neuroscientific
hypotheses about where and how cross-species visual computations converge. We
release our code and dataset for reference and reproducibility.
๐
2025-11-04
๐ ArXiv Preprint (eess.IV)
๐ฅ Yalda Zafari, Hongyi Pan, Gorkem Durak, Ulas Bagci, Essam A. Rashed et al.
Computer Vision
ๆฅ็ๆ่ฆ
The development of clinically reliable artificial intelligence (AI) systems
for mammography is hindered by profound heterogeneity in data quality, metadata
standards, and population distributions across public datasets. This
heterogeneity introduces dataset-specific biases that severely compromise the
generalizability of the model, a fundamental barrier to clinical deployment. We
present MammoClean, a public framework for standardization and bias
quantification in mammography datasets. MammoClean standardizes case selection,
image processing (including laterality and intensity correction), and unifies
metadata into a consistent multi-view structure. We provide a comprehensive
review of breast anatomy, imaging characteristics, and public mammography
datasets to systematically identify key sources of bias. Applying MammoClean to
three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify
substantial distributional shifts in breast density and abnormality prevalence.
Critically, we demonstrate the direct impact of data corruption: AI models
trained on corrupted datasets exhibit significant performance degradation
compared to their curated counterparts. By using MammoClean to identify and
mitigate bias sources, researchers can construct unified multi-dataset training
corpora that enable development of robust models with superior cross-domain
generalization. MammoClean provides an essential, reproducible pipeline for
bias-aware AI development in mammography, facilitating fairer comparisons and
advancing the creation of safe, effective systems that perform equitably across
diverse patient populations and clinical settings. The open-source code is
publicly available from: https://github.com/Minds-R-Lab/MammoClean.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kuo-Liang Chung, Ting-Chung Tang
ๆฅ็ๆ่ฆ
Color consistency correction for color point clouds is a fundamental yet
important task in 3D rendering and compression applications. In the past, most
previous color correction methods aimed at correcting color for color images.
The purpose of this paper is to propose a grouping-based hybrid color
correction algorithm for color point clouds. Our algorithm begins by estimating
the overlapping rate between the aligned source and target point clouds, and
then adaptively partitions the target points into two groups, namely the close
proximity group Gcl and the moderate proximity group Gmod, or three groups,
namely Gcl, Gmod, and the distant proximity group Gdist, when the estimated
overlapping rate is low or high, respectively. To correct color for target
points in Gcl, a K-nearest neighbors based bilateral interpolation (KBI) method
is proposed. To correct color for target points in Gmod, a joint KBI and the
histogram equalization (JKHE) method is proposed. For target points in Gdist, a
histogram equalization (HE) method is proposed for color correction. Finally,
we discuss the grouping-effect free property and the ablation study in our
algorithm. The desired color consistency correction benefit of our algorithm
has been justified through 1086 testing color point cloud pairs against the
state-of-the-art methods. The C++ source code of our algorithm can be accessed
from the website: https://github.com/ivpml84079/Point-cloud-color-correction.
๐
2025-11-04
๐ Accepted for publication at IEEE International Con
๐ฅ Leon Schwarzer, Matthias Zeller, Daniel Casado Herraez, Simon Dierl, Michael Heidingsfeld et al.
Robotics
ๆฅ็ๆ่ฆ
Moving object segmentation is a crucial task for safe and reliable autonomous
mobile systems like self-driving cars, improving the reliability and robustness
of subsequent tasks like SLAM or path planning. While the segmentation of
camera or LiDAR data is widely researched and achieves great results, it often
introduces an increased latency by requiring the accumulation of temporal
sequences to gain the necessary temporal context. Radar sensors overcome this
problem with their ability to provide a direct measurement of a point's Doppler
velocity, which can be exploited for single-scan moving object segmentation.
However, radar point clouds are often sparse and noisy, making data annotation
for use in supervised learning very tedious, time-consuming, and
cost-intensive. To overcome this problem, we address the task of
self-supervised moving object segmentation of sparse and noisy radar point
clouds. We follow a two-step approach of contrastive self-supervised
representation learning with subsequent supervised fine-tuning using limited
amounts of annotated data. We propose a novel clustering-based contrastive loss
function with cluster refinement based on dynamic points removal to pretrain
the network to produce motion-aware representations of the radar data. Our
method improves label efficiency after fine-tuning, effectively boosting
state-of-the-art performance by self-supervised pretraining.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiahe Song, Chuang Wang, Bowen Jiang, Yinfan Wang, Hao Zheng et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large-scale chemical reaction datasets are crucial for AI research in
chemistry. However, existing chemical reaction data often exist as images
within papers, making them not machine-readable and unusable for training
machine learning models. In response to this challenge, we propose the
RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP).
Our framework reformulates the traditional coordinate prediction driven parsing
process into an image captioning problem, which Large Vision-Language Models
(LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as
Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector,
MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the
input image. This turns the downstream parsing into a natural-language
description problem. Extensive experiments show that the BIVP strategy
significantly improves structural extraction quality while simplifying model
design. We further construct the RxnCaption-11k dataset, an order of magnitude
larger than prior real-world literature benchmarks, with a balanced test subset
across four layout archetypes. Experiments demonstrate that RxnCaption-VL
achieves state-of-the-art performance on multiple metrics. We believe our
method, dataset, and models will advance structured information extraction from
chemical literature and catalyze broader AI applications in chemistry. We will
release data, models, and code on GitHub.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In human cognition, there exist numerous thought processes that are tacit and
beyond verbal expression, enabling us to understand and interact with the world
in multiple ways. However, contemporary Vision-Language Models (VLMs) remain
constrained to reasoning within the discrete and rigid space of linguistic
tokens, thereby bottlenecking the rich, high-dimensional nature of visual
perception. To bridge this gap, we propose CoCoVa (Chain of Continuous
Vision-Language Thought), a novel framework for vision-language model that
leverages continuous cross-modal reasoning for diverse vision-language tasks.
The core of CoCoVa is an iterative reasoning cycle, where a novel Latent
Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a
chain of latent thought vectors through cross-modal fusion. To focus this
process, a token selection mechanism dynamically identifies salient visual
regions, mimicking attentional focus. To ensure these latent thoughts remain
grounded, we train the model with a multi-task objective that combines
contrastive learning and diffusion-based reconstruction, enforcing alignment
between latent representations and both visual and textual modalities.
Evaluations show CoCoVa improves accuracy and token efficiency over strong
baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B
models on almost all benchmarks. When scaled to 7B LLM backbones, it remains
competitive with state-of-the-art models. Qualitative analysis validates that
learned latent space captures interpretable and structured reasoning patterns,
highlighting the potential of CoCoVa to bridge the representational gap between
discrete language processing and the continuous nature of visual understanding.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiankai Tang, Tao Zhang, Jia Li, Yiru Zhang, Mingyu Zhang et al.
ๆฅ็ๆ่ฆ
Portable physiological monitoring is essential for early detection and
management of cardiovascular disease, but current methods often require
specialized equipment that limits accessibility or impose impractical postures
that patients cannot maintain. Video-based photoplethysmography on smartphones
offers a convenient noninvasive alternative, yet it still faces reliability
challenges caused by motion artifacts, lighting variations, and single-view
constraints. Few studies have demonstrated reliable application to
cardiovascular patients, and no widely used open datasets exist for
cross-device accuracy. To address these limitations, we introduce the M3PD
dataset, the first publicly available dual-view mobile photoplethysmography
dataset, comprising synchronized facial and fingertip videos captured
simultaneously via front and rear smartphone cameras from 60 participants
(including 47 cardiovascular patients). Building on this dual-view setting, we
further propose F3Mamba, which fuses the facial and fingertip views through
Mamba-based temporal modeling. The model reduces heart-rate error by 21.9 to
30.2 percent over existing single-view baselines while improving robustness in
challenging real-world scenarios. Data and code:
https://github.com/Health-HCI-Group/F3Mamba.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kun Zou, Yongheng Xu, Jianxing Yu, Yan Pan, Jian Yin et al.
ๆฅ็ๆ่ฆ
Out-of-distribution (OOD) detection is paramount to ensuring the reliability
and robustness of learning models in real-world applications. Existing post-hoc
OOD detection methods detect OOD samples by leveraging their features and
logits information without retraining. However, they often overlook the
inherent correlation between features and logits, which is crucial for
effective OOD detection. To address this limitation, we propose Global-Aware
Feature Decoupling with Confidence Calibration (GAFD-CC). GAFD-CC aims to
refine decision boundaries and increase discriminative performance. Firstly, it
performs global-aware feature decoupling guided by classification weights. This
involves aligning features with the direction of global classification weights
to decouple them. From this, GAFD-CC extracts two types of critical
information: positively correlated features that promote in-distribution
(ID)/OOD boundary refinement and negatively correlated features that suppress
false positives and tighten these boundaries. Secondly, it adaptively fuses
these decoupled features with multi-scale logit-based confidence for
comprehensive and robust OOD detection. Extensive experiments on large-scale
benchmarks demonstrate GAFD-CC's competitive performance and strong
generalization ability compared to those of state-of-the-art methods.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Shaohan Li, Yunpeng Shi, Gilad Lerman
Computer Vision
ๆฅ็ๆ่ฆ
We introduce Cycle-Sync, a robust and global framework for estimating camera
poses (both rotations and locations). Our core innovation is a location solver
that adapts message-passing least squares (MPLS) -- originally developed for
group synchronization -- to camera location estimation. We modify MPLS to
emphasize cycle-consistent information, redefine cycle consistencies using
estimated distances from previous iterations, and incorporate a Welsch-type
robust loss. We establish the strongest known deterministic exact-recovery
guarantee for camera location estimation, showing that cycle consistency alone
-- without access to inter-camera distances -- suffices to achieve the lowest
sample complexity currently known. To further enhance robustness, we introduce
a plug-and-play outlier rejection module inspired by robust subspace recovery,
and we fully integrate cycle consistency into MPLS for rotation
synchronization. Our global approach avoids the need for bundle adjustment.
Experiments on synthetic and real datasets show that Cycle-Sync consistently
outperforms leading pose estimators, including full structure-from-motion
pipelines with bundle adjustment.
๐
2025-11-04
๐ 6 pages. This version includes minor lstlisting co
๐ฅ Taisuke Noguchi, Takuya Azumi
Machine LearningRoboticsComputer Vision
ๆฅ็ๆ่ฆ
The field of autonomous driving technology is rapidly advancing, with deep
learning being a key component. Particularly in the field of sensing, 3D point
cloud data collected by LiDAR is utilized to run deep neural network models for
3D object detection. However, these state-of-the-art models are complex,
leading to longer processing times and increased power consumption on edge
devices. The objective of this study is to address these issues by leveraging
Split Computing, a distributed machine learning inference method. Split
Computing aims to lessen the computational burden on edge devices, thereby
reducing processing time and power consumption. Furthermore, it minimizes the
risk of data breaches by only transmitting intermediate data from the deep
neural network model. Experimental results show that splitting after
voxelization reduces the inference time by 70.8% and the edge device execution
time by 90.0%. When splitting within the network, the inference time is reduced
by up to 57.1%, and the edge device execution time is reduced by up to 69.5%.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cuong Tuan Nguyen, Ngoc Tuan Nguyen, Triet Hoang Minh Dao, Huy Minh Nhat, Huy Truong Dinh
Machine Learning
ๆฅ็ๆ่ฆ
We propose a Graph Neural Network (GNN)-based approach for Handwritten
Mathematical Expression (HME) recognition by modeling HMEs as graphs, where
nodes represent symbols and edges capture spatial dependencies. A deep BLSTM
network is used for symbol segmentation, recognition, and spatial relation
classification, forming an initial primitive graph. A 2D-CFG parser then
generates all possible spatial relations, while the GNN-based link prediction
model refines the structure by removing unnecessary connections, ultimately
forming the Symbol Label Graph. Experimental results demonstrate the
effectiveness of our approach, showing promising performance in HME structure
recognition.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We introduce SAIL-RL, a reinforcement learning (RL) post-training framework
that enhances the reasoning capabilities of multimodal large language models
(MLLMs) by teaching them when and how to think. Existing approaches are limited
by outcome-only supervision, which rewards correct answers without ensuring
sound reasoning, and by uniform thinking strategies, which often lead to
overthinking on simple tasks and underthinking on complex ones. SAIL-RL
addresses these challenges with a dual reward system: the Thinking Reward,
which evaluates reasoning quality through factual grounding, logical coherence,
and answer consistency, and the Judging Reward, which adaptively determines
whether deep reasoning or direct answering is appropriate. Experiments on the
state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal
understanding benchmarks at both 4B and 8B scales, achieving competitive
performance against commercial closed-source models such as GPT-4o, and
substantially reduces hallucinations, establishing it as a principled framework
for building more reliable and adaptive MLLMs. The code will be available at
https://github.com/BytedanceDouyinContent/SAIL-RL.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Giorgos Sfikas, Konstantina Nikolaidou, Foteini Papadopoulou, George Retsinas, Anastasios L. Kesidis
Computer Vision
ๆฅ็ๆ่ฆ
Object pose estimation is a task that is of central importance in 3D Computer
Vision. Given a target image and a canonical pose, a single point estimate may
very often be sufficient; however, a probabilistic pose output is related to a
number of benefits when pose is not unambiguous due to sensor and projection
constraints or inherent object symmetries. With this paper, we explore the
usefulness of using the well-known Euler angles parameterisation as a basis for
a Normalizing Flows model for pose estimation. Isomorphic to spatial rotation,
3D pose has been parameterized in a number of ways, either in or out of the
context of parameter estimation. We explore the idea that Euler angles, despite
their shortcomings, may lead to useful models in a number of aspects, compared
to a model built on a more complex parameterisation.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yucheng Song, Yifan Ge, Junhao Li, Zhining Liao, Zhifang Liao
Natural Language Processing
ๆฅ็ๆ่ฆ
Medical Report Generation (MRG) is a key part of modern medical diagnostics,
as it automatically generates reports from radiological images to reduce
radiologists' burden. However, reliable MRG models for lesion description face
three main challenges: insufficient domain knowledge understanding, poor
text-visual entity embedding alignment, and spurious correlations from
cross-modal biases. Previous work only addresses single challenges, while this
paper tackles all three via a novel hierarchical task decomposition approach,
proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into
low-, mid-, and high-level tasks: 1) Low-level: align medical entity features
with spatial locations to enhance domain knowledge for visual encoders; 2)
Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling
(images) to boost cross-modal alignment via mutual guidance; 3) High-level: a
cross-modal causal intervention module (via front-door intervention) to reduce
confounders and improve interpretability. Extensive experiments confirm
HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA)
MRG methods. Code will be made public upon paper acceptance.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Li, Daiwei Lu, Jesse d'Almeida, Dilara Isik, Ehsan Khodapanah Aghdam et al.
Robotics
ๆฅ็ๆ่ฆ
Monocular depth estimation (MDE) is a critical task to guide autonomous
medical robots. However, obtaining absolute (metric) depth from an endoscopy
camera in surgical scenes is difficult, which limits supervised learning of
depth on real endoscopic images. Current image-level unsupervised domain
adaptation methods translate synthetic images with known depth maps into the
style of real endoscopic frames and train depth networks using these translated
images with their corresponding depth maps. However a domain gap often remains
between real and translated synthetic images. In this paper, we present a
latent feature alignment method to improve absolute depth estimation by
reducing this domain gap in the context of endoscopic videos of the central
airway. Our methods are agnostic to the image translation process and focus on
the depth estimation itself. Specifically, the depth network takes translated
synthetic and real endoscopic frames as input and learns latent
domain-invariant features via adversarial learning and directional feature
consistency. The evaluation is conducted on endoscopic videos of central airway
phantoms with manually aligned absolute depth maps. Compared to
state-of-the-art MDE methods, our approach achieves superior performance on
both absolute and relative depth metrics, and consistently improves results
across various backbones and pretrained weights. Our code is available at
https://github.com/MedICL-VU/MDE.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Delin Ma, Menghui Zhou, Jun Qi, Yun Yang, Po Yang
Multimodal
ๆฅ็ๆ่ฆ
Alzheimer's disease (AD) is the most prevalent form of dementia, and its
early diagnosis is essential for slowing disease progression. Recent studies on
multimodal neuroimaging fusion using MRI and PET have achieved promising
results by integrating multi-scale complementary features. However, most
existing approaches primarily emphasize cross-modal complementarity while
overlooking the diagnostic importance of modality-specific features. In
addition, the inherent distributional differences between modalities often lead
to biased and noisy representations, degrading classification performance. To
address these challenges, we propose a Collaborative Attention and
Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The
proposed model introduces a learnable parameter representation (LPR) block to
compensate for missing modality information, followed by a shared encoder and
modality-independent encoders to preserve both shared and specific
representations. Furthermore, a consistency-guided mechanism is employed to
explicitly align the latent distributions across modalities. Experimental
results on the ADNI dataset demonstrate that our method achieves superior
diagnostic performance compared with existing fusion strategies.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yiqin Zhao, Tian Guo
ๆฅ็ๆ่ฆ
Mobile sensing systems have long faced a fundamental trade-off between
sensing quality and efficiency due to constraints in computation, power, and
other limitations. Sparse sensing, which aims to acquire and process only a
subset of sensor data, has been a key strategy for maintaining performance
under such constraints. However, existing sparse sensing methods often suffer
from reduced accuracy, as missing information across space and time introduces
uncertainty into many sensing systems. In this work, we investigate whether
foundation models can change the landscape of mobile sparse sensing. Using
real-world mobile AR data, our evaluations demonstrate that foundation models
offer significant improvements in geometry-aware image warping, a central
technique for enabling accurate reuse of cross-frame information. Furthermore,
our study demonstrates the scalability of foundation model-based sparse sensing
and shows its leading performance in 3D scene reconstruction. Collectively, our
study reveals critical aspects of the promises and the open challenges of
integrating foundation models into mobile sparse sensing systems.
๐
2025-11-04
๐ ArXiv Preprint (physics.med-ph)
๐ฅ Abuobaida M. Khair, Wenjing Jiang, Yousuf Babiker M. Osman, Wenjun Xia, Xiaopeng Ma
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This study presents a hybrid deep learning framework, the Vision Transformer
with Residual Feature Network (VRF-Net), for recovering high-resolution system
matrices in Magnetic Particle Imaging (MPI). MPI resolution often suffers from
downsampling and coil sensitivity variations. VRF-Net addresses these
challenges by combining transformer-based global attention with residual
convolutional refinement, enabling recovery of both large-scale structures and
fine details. To reflect realistic MPI conditions, the system matrix is
degraded using a dual-stage downsampling strategy. Training employed
paired-image super-resolution on the public Open MPI dataset and a simulated
dataset incorporating variable coil sensitivity profiles. For system matrix
recovery on the Open MPI dataset, VRF-Net achieved nRMSE = 0.403, pSNR = 39.08
dB, and SSIM = 0.835 at 2x scaling, and maintained strong performance even at
challenging scale 8x (pSNR = 31.06 dB, SSIM = 0.717). For the simulated
dataset, VRF-Net achieved nRMSE = 4.44, pSNR = 28.52 dB, and SSIM = 0.771 at 2x
scaling, with stable performance at higher scales. On average, it reduced nRMSE
by 88.2%, increased pSNR by 44.7%, and improved SSIM by 34.3% over
interpolation and CNN-based methods. In image reconstruction of Open MPI
phantoms, VRF-Net further reduced reconstruction error to nRMSE = 1.79 at 2x
scaling, while preserving structural fidelity (pSNR = 41.58 dB, SSIM = 0.960),
outperforming existing methods. These findings demonstrate that VRF-Net enables
sharper, artifact-free system matrix recovery and robust image reconstruction
across multiple scales, offering a promising direction for future in vivo
applications.
๐
2025-11-04
๐ 13 pages, IEEE Journal of Biomedical and Health In
๐ฅ Anders Austlid Taskรฉn, Thierry Judge, Erik Andreas Rye Berg, Jinyang Yu, Bjรธrnar Grenne et al.
Machine Learning
ๆฅ็ๆ่ฆ
Segmental longitudinal strain (SLS) of the left ventricle (LV) is an
important prognostic indicator for evaluating regional LV dysfunction, in
particular for diagnosing and managing myocardial ischemia. Current techniques
for strain estimation require significant manual intervention and expertise,
limiting their efficiency and making them too resource-intensive for monitoring
purposes. This study introduces the first automated pipeline, autoStrain, for
SLS estimation in transesophageal echocardiography (TEE) using deep learning
(DL) methods for motion estimation. We present a comparative analysis of two DL
approaches: TeeFlow, based on the RAFT optical flow model for dense
frame-to-frame predictions, and TeeTracker, based on the CoTracker point
trajectory model for sparse long-sequence predictions.
As ground truth motion data from real echocardiographic sequences are hardly
accessible, we took advantage of a unique simulation pipeline (SIMUS) to
generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with
ground truth myocardial motion to train and evaluate both models. Our
evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a
mean distance error in motion estimation of 0.65 mm on a synTEE test dataset.
Clinical validation on 16 patients further demonstrated that SLS estimation
with our autoStrain pipeline aligned with clinical references, achieving a mean
difference (95\% limits of agreement) of 1.09% (-8.90% to 11.09%).
Incorporation of simulated ischemia in the synTEE data improved the accuracy of
the models in quantifying abnormal deformation. Our findings indicate that
integrating AI-driven motion estimation with TEE can significantly enhance the
precision and efficiency of cardiac function assessment in clinical settings.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiajia Li, Keyi Zhu, Qianwen Zhang, Dong Chen, Qi Sun et al.
Computer Vision
ๆฅ็ๆ่ฆ
Strawberries are among the most economically significant fruits in the United
States, generating over $2 billion in annual farm-gate sales and accounting for
approximately 13% of the total fruit production value. Plant phenotyping plays
a vital role in selecting superior cultivars by characterizing plant traits
such as morphology, canopy structure, and growth dynamics. However, traditional
plant phenotyping methods are time-consuming, labor-intensive, and often
destructive. Recently, neural rendering techniques, notably Neural Radiance
Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful
frameworks for high-fidelity 3D reconstruction. By capturing a sequence of
multi-view images or videos around a target plant, these methods enable
non-destructive reconstruction of complex plant architectures. Despite their
promise, most current applications of 3DGS in agricultural domains reconstruct
the entire scene, including background elements, which introduces noise,
increases computational costs, and complicates downstream trait analysis. To
address this limitation, we propose a novel object-centric 3D reconstruction
framework incorporating a preprocessing pipeline that leverages the Segment
Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean
strawberry plant reconstructions. This approach produces more accurate
geometric representations while substantially reducing computational time. With
a background-free reconstruction, our algorithm can automatically estimate
important plant traits, such as plant height and canopy width, using DBSCAN
clustering and Principal Component Analysis (PCA). Experimental results show
that our method outperforms conventional pipelines in both accuracy and
efficiency, offering a scalable and non-destructive solution for strawberry
plant phenotyping.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhengjie Zhang, Xiaoxie Mao, Qihao Guo, Shaoting Zhang, Qi Huang et al.
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Background: Alzheimer's disease (AD) diagnosis heavily relies on amyloid-beta
positron emission tomography (Abeta-PET), which is limited by high cost and
limited accessibility. This study explores whether Abeta-PET spatial patterns
can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We
collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566
participants. A language-enhanced generative model, driven by a large language
model (LLM) and multimodal information fusion, was developed to synthesize PET
images. Synthesized images were evaluated for image quality, diagnostic
consistency, and clinical applicability within a fully automated diagnostic
pipeline. Findings: The synthetic PET images closely resemble real PET scans in
both structural details (SSIM = 0.920 +/- 0.003) and regional patterns
(Pearson's r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show
high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic
PET, we developed a fully automatic AD diagnostic pipeline integrating PET
synthesis and classification. The synthetic PET-based model (AUC = 0.78)
outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while
combining synthetic PET and BBMs further improved performance (AUC = 0.79).
Ablation analysis supports the advantages of LLM integration and prompt
engineering. Interpretation: Our language-enhanced generative model synthesizes
realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial
pattern assessment and improving the diagnostic workflow for Alzheimer's
disease.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kevin Valencia, Thilina Balasooriya, Xihaier Luo, Shinjae Yoo, David Keetae Park
Multimodal
ๆฅ็ๆ่ฆ
Multimodal spatiotemporal learning on real-world experimental data is
constrained by two challenges: within-modality measurements are sparse,
irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of
available modalities varies across space and time, shrinking the usable record
unless models can adapt to arbitrary subsets at train and test time. We propose
OmniField, a continuity-aware framework that learns a continuous neural field
conditioned on available modalities and iteratively fuses cross-modal context.
A multimodal crosstalk block architecture paired with iterative cross-modal
refinement aligns signals prior to the decoder, enabling unified
reconstruction, interpolation, forecasting, and cross-modal prediction without
gridding or surrogate preprocessing. Extensive evaluations show that OmniField
consistently outperforms eight strong multimodal spatiotemporal baselines.
Under heavy simulated sensor noise, performance remains close to clean-input
levels, highlighting robustness to corrupted measurements.
๐
2025-11-04
๐ This paper was accepted by IEEE BIBM 2025 conferen
๐ฅ Jiawen Liu, Yuanbo Zeng, Jiaming Liang, Yizhen Yang, Yiheng Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate detection of retinal vessels plays a critical role in reflecting a
wide range of health status indicators in the clinical diagnosis of ocular
diseases. Recently, advances in deep learning have led to a surge in retinal
vessel segmentation methods, which have significantly contributed to the
quantitative analysis of vascular morphology. However, retinal vasculature
differs significantly from conventional segmentation targets in that it
consists of extremely thin and branching structures, whose global morphology
varies greatly across images. These characteristics continue to pose challenges
to segmentation precision and robustness. To address these issues, we propose
MM-UNet, a novel architecture tailored for efficient retinal vessel
segmentation. The model incorporates Morph Mamba Convolution layers, which
replace pointwise convolutions to enhance branching topological perception
through morph, state-aware feature sampling. Additionally, Reverse Selective
State Guidance modules integrate reverse guidance theory with state-space
modeling to improve geometric boundary awareness and decoding efficiency.
Extensive experiments conducted on two public retinal vessel segmentation
datasets demonstrate the superior performance of the proposed method in
segmentation accuracy. Compared to the existing approaches, MM-UNet achieves
F1-score gains of 1.64 $\%$ on DRIVE and 1.25 $\%$ on STARE, demonstrating its
effectiveness and advancement. The project code is public via
https://github.com/liujiawen-jpg/MM-UNet.
๐
2025-11-04
๐ ICCV 2025
๐ฅ Jinhwan Seo, Yoonki Cho, Junhyug Noh, Sung-eui Yoon
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
In this technical report, we introduce a framework to address Grounded Video
Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The
GVQA task demands robust multimodal models capable of complex reasoning over
video content, grounding the resulting answers visually, and tracking the
referenced objects temporally. To achieve this capability, our proposed
approach decomposes the GVQA task into a three-stage pipeline: (1) Video
Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key
contribution is the introduction of a trigger moment, derived from our proposed
CORTEX prompt, which pinpoints the single most visible frame of a target object
to serve as a robust anchor for grounding and tracking. To this end, we achieve
the HOTA score of 0.4968, which marks a significant improvement over the
previous year's winning score of 0.2704 on GVQA task.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mehdi Sefidgar Dilmaghani, Waseem Shariff, Cian Ryan, Joe Lemley, Peter Corcoran
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Understanding and mitigating flicker effects caused by rapid variations in
light intensity is critical for enhancing the performance of event cameras in
diverse environments. This paper introduces an innovative autonomous mechanism
for tuning the biases of event cameras, effectively addressing flicker across a
wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on
additional hardware or software for flicker filtering, our approach leverages
the event cameras inherent bias settings. Utilizing a simple Convolutional
Neural Networks -CNNs, the system identifies instances of flicker in a spatial
space and dynamically adjusts specific biases to minimize its impact. The
efficacy of this autobiasing system was robustly tested using a face detector
framework under both well-lit and low-light conditions, as well as across
various frequencies. The results demonstrated significant improvements:
enhanced YOLO confidence metrics for face detection, and an increased
percentage of frames capturing detected faces. Moreover, the average gradient,
which serves as an indicator of flicker presence through edge detection,
decreased by 38.2 percent in well-lit conditions and by 53.6 percent in
low-light conditions. These findings underscore the potential of our approach
to significantly improve the functionality of event cameras in a range of
adverse lighting scenarios.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhicheng Wang, Junbiao Pang
ๆฅ็ๆ่ฆ
Accurate quantification of pavement crack width plays a pivotal role in
assessing structural integrity and guiding maintenance interventions. However,
achieving precise crack width measurements presents significant challenges due
to: (1) the complex, non-uniform morphology of crack boundaries, which limits
the efficacy of conventional approaches, and (2) the demand for rapid
measurement capabilities from arbitrary pixel locations to facilitate
comprehensive pavement condition evaluation. To overcome these limitations,
this study introduces a cascaded framework integrating Principal Component
Analysis (PCA) and Robust PCA (RPCA) for efficient crack width extraction from
digital images. The proposed methodology comprises three sequential stages: (1)
initial crack segmentation using established detection algorithms to generate a
binary representation, (2) determination of the primary orientation axis for
quasi-parallel cracks through PCA, and (3) extraction of the Main Propagation
Axis (MPA) for irregular crack geometries using RPCA. Comprehensive evaluations
were conducted across three publicly available datasets, demonstrating that the
proposed approach achieves superior performance in both computational
efficiency and measurement accuracy compared to existing state-of-the-art
techniques.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Huahua Lin, Xiaohao Cai, Mark Nixon, James M. Mulqueeney, Thomas H. G. Ezard
Computer Vision
ๆฅ็ๆ่ฆ
Planktonic foraminifera, marine protists characterized by their intricate
chambered shells, serve as valuable indicators of past and present
environmental conditions. Understanding their chamber growth trajectory
provides crucial insights into organismal development and ecological adaptation
under changing environments. However, automated tracing of chamber growth from
imaging data remains largely unexplored, with existing approaches relying
heavily on manual segmentation of each chamber, which is time-consuming and
subjective. In this study, we propose an end-to-end pipeline that integrates
instance segmentation, a computer vision technique not extensively explored in
foraminifera, with a dedicated chamber ordering algorithm to automatically
reconstruct three-dimensional growth trajectories from high-resolution computed
tomography scans. We quantitatively and qualitatively evaluate multiple
instance segmentation methods, each optimized for distinct spatial features of
the chambers, and examine their downstream influence on growth-order
reconstruction accuracy. Experimental results on expert-annotated datasets
demonstrate that the proposed pipeline substantially reduces manual effort
while maintaining biologically meaningful accuracy. Although segmentation
models exhibit under-segmentation in smaller chambers due to reduced voxel
fidelity and subtle inter-chamber connectivity, the chamber-ordering algorithm
remains robust, achieving consistent reconstruction of developmental
trajectories even under partial segmentation. This work provides the first
fully automated and reproducible pipeline for digital foraminiferal growth
analysis, establishing a foundation for large-scale, data-driven ecological
studies.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have shown strong capabilities but
remain limited to fixed modality pairs and require costly fine-tuning with
large aligned datasets. Building fully omni-capable models that can integrate
text, images, audio, and video remains impractical and lacks robust reasoning
support. In this paper, we propose an Agent-Omni framework that coordinates
existing foundation models through a master-agent system, enabling flexible
multimodal reasoning without retraining. The master agent interprets user
intent, delegates subtasks to modality-specific agents, and integrates their
outputs into coherent responses. Extensive experiments across text, image,
audio, video, and omni benchmarks show that Agent-Omni consistently achieves
state-of-the-art performance, particularly on tasks requiring complex
cross-modal reasoning. Its agent-based design enables seamless integration of
specialized foundation models, ensuring adaptability to diverse inputs while
maintaining transparency and interpretability. In addition, the framework is
modular and easily extensible, allowing future improvements as stronger models
become available.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham Kakade, Surbhi Goel
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Knowledge distillation is an efficient strategy to use data generated by
large "teacher" language models to train smaller capable "student" models, but
selecting the optimal teacher for a specific student-task combination requires
expensive trial-and-error. We propose a lightweight score called GRACE to
quantify how effective a teacher will be for post-training a student model.
GRACE measures distributional properties of the student's gradients without
access to a verifier, teacher logits, teacher internals, or test data. From an
information-theoretic perspective, GRACE connects to leave-one-out stability of
gradient-based algorithms, which controls the generalization performance of the
distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86%
Spearman correlation) with the performance of the distilled LLaMA and OLMo
students. In particular, training a student using the GRACE-selected teacher
can improve the performance by up to 7.4% over naively using the
best-performing teacher. Further, GRACE can provide guidance on crucial design
choices in distillation, including (1) the best temperature to use when
generating from the teacher, (2) the best teacher to use given a size
constraint, and (3) the best teacher to use within a specific model family.
Altogether, our findings demonstrate that GRACE can efficiently and effectively
identify a strongly compatible teacher for a given student and provide
fine-grained guidance on how to perform distillation.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, Matthew R. Gormley
Natural Language Processing
ๆฅ็ๆ่ฆ
As model context lengths continue to grow, concerns about whether models
effectively use the full context length have persisted. While several carefully
designed long-context evaluations have recently been released, these
evaluations tend to rely on retrieval from one or more sections of the context,
which allows nearly all of the context tokens to be disregarded as noise. This
represents only one type of task that might be performed with long context. We
introduce Oolong, a benchmark of long-context reasoning tasks that require
analyzing individual chunks of text on an atomic level, and then aggregating
these analyses to answer distributional questions. Oolong is separated into two
task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can
easily ablate components of the reasoning problem; and Oolong-real, a
downstream setting which requires reasoning over real-world conversational
data. Oolong requires models to reason over large quantities of examples, to
perform both classification and counting in-context, and to reason over
temporal and user relations. Even frontier models struggle on Oolong, with
GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy
on both splits at 128K. We release the data and evaluation harness for Oolong
to enable further development of models that can reason over large quantities
of text.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Typical search agents concatenate the entire interaction history into the LLM
context, preserving information integrity but producing long, noisy contexts,
resulting in high computation and memory costs. In contrast, using only the
current turn avoids this overhead but discards essential information. This
trade-off limits the scalability of search agents. To address this challenge,
we propose MemSearcher, an agent workflow that iteratively maintains a compact
memory and combines the current turn with it. At each turn, MemSearcher fuses
the user's question with the memory to generate reasoning traces, perform
search actions, and update memory to retain only information essential for
solving the task. This design stabilizes context length across multi-turn
interactions, improving efficiency without sacrificing accuracy. To optimize
this workflow, we introduce multi-context GRPO, an end-to-end RL framework that
jointly optimize reasoning, search strategies, and memory management of
MemSearcher Agents. Specifically, multi-context GRPO samples groups of
trajectories under different contexts and propagates trajectory-level
advantages across all conversations within them. Trained on the same dataset as
Search-R1, MemSearcher achieves significant improvements over strong baselines
on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on
Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher
even outperforms 7B-based baselines, demonstrating that striking a balance
between information integrity and efficiency yields both higher accuracy and
lower computational overhead. The code and models will be publicly available at
https://github.com/icip-cas/MemSearcher
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mayank Jobanputra, Nils Philipp Walter, Maitrey Mehta, Blerta Veseli, Evan Parker Kelly Chapple et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We present a systematic study of subtraction in large language models (LLMs).
While prior benchmarks emphasize addition and multiplication, subtraction has
received comparatively little attention despite being structurally distinct as
a non-commutative operation. We evaluate eight pretrained LLMs spanning four
families on addition and subtraction problems. Our experiments reveal that
subtraction accuracy lags behind addition by a wide margin. We find that the
errors for ($a-b$) are concentrated in cases where ($a
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao et al.
Multimodal
ๆฅ็ๆ่ฆ
Code has emerged as a precise and executable medium for reasoning and action
in the agent era. Yet, progress has largely focused on language-centric tasks
such as program synthesis and debugging, leaving visual-centric coding
underexplored. Inspired by how humans reason over sketches, we advocate SVG
code as a compact, interpretable, and executable visual representation. We
introduce VCode, a benchmark that reframes multimodal understanding as code
generation: given an image, a model must produce SVG that preserves symbolic
meaning for downstream reasoning. VCode covers three domains - general
commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric
perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel
evaluation protocol in which a policy model answers questions over rendered
SVGs; correct answers indicate faithful symbolic preservation. Empirically,
frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap
between language-centric and visual-centric coding. To close this gap, we
introduce VCoder, an agentic framework that augments VLMs along two axes: (i)
Thinking with Revision, which iteratively analyzes discrepancies and refines
SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply
structured cues such as objects, shapes, and text beyond the model's intrinsic
capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities
score well overall yet remain limited in professional knowledge and 3D
reasoning. VCoder delivers a 12.3-point overall gain over the top-performing
Claude-4-Opus. Human studies show that both humans and VLMs perform worse on
rendered SVGs, their consistency reveals the promise of symbolic visual
representation. The benchmark and code are available at
https://github.com/CSU-JPG/VCode.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hung-Ting Chen, Xiang Liu, Shauli Ravfogel, Eunsol Choi
Multimodal
ๆฅ็ๆ่ฆ
Most text retrievers generate \emph{one} query vector to retrieve relevant
documents. Yet, the conditional distribution of relevant documents for the
query may be multimodal, e.g., representing different interpretations of the
query. We first quantify the limitations of existing retrievers. All retrievers
we evaluate struggle more as the distance between target document embeddings
grows. To address this limitation, we develop a new retriever architecture,
\emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER).
Our model autoregressively generates multiple query vectors, and all the
predicted query vectors are used to retrieve documents from the corpus. We show
that on the synthetic vectorized data, the proposed method could capture
multiple target distributions perfectly, showing 4x better performance than
single embedding model. We also fine-tune our model on real-world multi-answer
retrieval datasets and evaluate in-domain. AMER presents 4 and 21\% relative
gains over single-embedding baselines on two datasets we evaluate on.
Furthermore, we consistently observe larger gains on the subset of dataset
where the embeddings of the target documents are less similar to each other. We
demonstrate the potential of using a multi-query vector retriever and open up a
new direction for future work.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) exhibit complementary strengths across domains
and come with varying inference costs, motivating the design of multi-agent LLM
systems where specialized models collaborate efficiently. Existing approaches
predominantly rely on decentralized frameworks, which invoke multiple LLMs for
every input and thus lead to substantial and uncontrolled inference costs. In
this work, we introduce a centralized multi-LLM framework, where a controller
LLM selectively coordinates a pool of expert models in a cost-efficient and
cost-controllable manner. We formulate this coordination problem as
reinforcement learning with dual objectives: maximizing task performance while
minimizing the overall inference cost. In addition, we expect the multi-agent
system to have adapted behavior with different budget conditions during
inference. To this end, we propose CoRL, a reinforcement learning framework
that optimizes the performance cost trade-off in a controllable multi-budget
setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a
single system to surpass the best expert LLM under high-budget settings, while
maintaining strong performance in more economical low-budget modes,
highlighting the effectiveness of centralized coordination for scalable and
cost-efficient multi-agent LLM systems.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Amit Misra, Syed Waqas Zamir, Wassim Hamidouche, Inbal Becker-Reshef, Juan Lavista Ferres
Natural Language Processing
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) is diffusing globally at unprecedented speed,
but adoption remains uneven. Frontier Large Language Models (LLMs) are known to
perform poorly on low-resource languages due to data scarcity. We hypothesize
that this performance deficit reduces the utility of AI, thereby slowing
adoption in Low-Resource Language Countries (LRLCs). To test this, we use a
weighted regression model to isolate the language effect from socioeconomic and
demographic factors, finding that LRLCs have a share of AI users that is
approximately 20% lower relative to their baseline. These results indicate that
linguistic accessibility is a significant, independent barrier to equitable AI
diffusion.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Current evaluations of Large Language Model (LLM) agents primarily emphasize
task completion, often overlooking resource efficiency and adaptability. This
neglects a crucial capability: agents' ability to devise and adjust
cost-optimal plans in response to changing environments. To bridge this gap, we
introduce CostBench, a scalable, cost-centric benchmark designed to evaluate
agents' economic reasoning and replanning abilities. Situated in the
travel-planning domain, CostBench comprises tasks solvable via multiple
sequences of atomic and composite tools with diverse, customizable costs. It
also supports four types of dynamic blocking events, such as tool failures and
cost changes, to simulate real-world unpredictability and necessitate agents to
adapt in real time. Evaluating leading open-sourced and proprietary models on
CostBench reveals a substantial gap in cost-aware planning: agents frequently
fail to identify cost-optimal solutions in static settings, with even GPT-5
achieving less than 75% exact match rate on the hardest tasks, and performance
further dropping by around 40% under dynamic conditions. By diagnosing these
weaknesses, CostBench lays the groundwork for developing future agents that are
both economically rational and robust.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Doreen Osmelak, Koel Dutta Chowdhury, Uliana Sentsova, Cristina Espaรฑa-Bonet, Josef van Genabith
Natural Language Processing
ๆฅ็ๆ่ฆ
Translators often enrich texts with background details that make implicit
cultural meanings explicit for new audiences. This phenomenon, known as
pragmatic explicitation, has been widely discussed in translation theory but
rarely modeled computationally. We introduce PragExTra, the first multilingual
corpus and detection framework for pragmatic explicitation. The corpus covers
eight language pairs from TED-Multi and Europarl and includes additions such as
entity descriptions, measurement conversions, and translator remarks. We
identify candidate explicitation cases through null alignments and refined
using active learning with human annotation. Our results show that entity and
system-level explicitations are most frequent, and that active learning
improves classifier accuracy by 7-8 percentage points, achieving up to 0.88
accuracy and 0.82 F1 across languages. PragExTra establishes pragmatic
explicitation as a measurable, cross-linguistic phenomenon and takes a step
towards building culturally aware machine translation. Keywords: translation,
multilingualism, explicitation
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Tim R. Davidson, Adam Fourney, Saleema Amershi, Robert West, Eric Horvitz et al.
ๆฅ็ๆ่ฆ
The trajectory of AI development suggests that we will increasingly rely on
agent-based systems composed of independently developed agents with different
information, privileges, and tools. The success of these systems will
critically depend on effective collaboration among these heterogeneous agents,
even under partial observability. Despite intense interest, few empirical
studies have evaluated such agent-agent collaboration at scale. We propose a
collaborative maze-solving benchmark that (i) isolates collaborative
capabilities, (ii) modulates problem complexity, (iii) enables scalable
automated grading, and (iv) imposes no output-format constraints, preserving
ecological plausibility. Using this framework, we evaluate 32 leading open- and
closed-source models in solo, homogeneous, and heterogeneous pairings. Our
results reveal a "collaboration gap": models that perform well solo often
degrade substantially when required to collaborate. Collaboration can break
down dramatically; for instance, small distilled models that solve mazes well
alone may fail almost completely in certain pairings. We find that starting
with the stronger agent often improves outcomes, motivating a "relay inference"
approach where the stronger agent leads before handing off to the weaker one,
closing much of the gap. Our findings argue for (1) collaboration-aware
evaluation, (2) training strategies developed to enhance collaborative
capabilities, and (3) interaction design that reliably elicits agents' latent
skills, guidance that applies to AI-AI and human-AI collaboration.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohammadsajad Alipour, Mohammad Mohammadi Amiri
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly prevalent across diverse
applications. However, their enormous size limits storage and processing
capabilities to a few well-resourced stakeholders. As a result, most
applications rely on pre-trained LLMs, fine-tuned for specific tasks. However,
even storing the fine-tuned versions of these models remains a significant
challenge due to the wide range of tasks they address. Recently, studies show
that fine-tuning these models primarily affects a small fraction of parameters,
highlighting the need for more efficient storage of fine-tuned models. This
paper focuses on efficient storage of parameter updates in pre-trained models
after fine-tuning. To address this challenge, we leverage the observation that
fine-tuning updates are both low-rank and sparse, which can be utilized for
storage efficiency. However, using only low-rank approximation or
sparsification may discard critical singular components that enhance model
expressivity. We first observe that given the same memory budget, sparsified
low-rank approximations with larger ranks outperform standard low-rank
approximations with smaller ranks. Building on this, we propose our method,
optimal singular damage, that selectively sparsifies low-rank approximated
updates by leveraging the interleaved importance of singular vectors, ensuring
that the most impactful components are retained. We demonstrate through
extensive experiments that our proposed methods lead to significant storage
efficiency and superior accuracy within the same memory budget compared to
employing the low-rank approximation or sparsification individually.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Renfei Dang, Peng Hu, Changjiang Gao, Shujian Huang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Previous studies show that introducing new knowledge during large language
models (LLMs) fine-tuning can lead to the generation of erroneous output when
tested on known information, thereby triggering factual hallucinations.
However, existing studies have not deeply investigated the specific
manifestations and underlying mechanisms of these hallucinations. Our work
addresses this gap by designing a controlled dataset Biography-Reasoning, and
conducting a fine-grained analysis across multiple knowledge types and two task
types, including knowledge question answering (QA) and knowledge reasoning
tasks. We find that when fine-tuned on a dataset in which a specific knowledge
type consists entirely of new knowledge, LLMs exhibit significantly increased
hallucination tendencies. This suggests that the high unfamiliarity of a
particular knowledge type, rather than the overall proportion of new knowledge,
is a stronger driver of hallucinations, and these tendencies can even affect
other knowledge types in QA tasks. To mitigate such factual hallucinations, we
propose KnownPatch, which patches a small number of known knowledge samples in
the later stages of training, effectively alleviating new-knowledge-induced
hallucinations. Through attention analysis, we find that learning new knowledge
reduces the model's attention to key entities in the question, thus causing
excessive focus on the surrounding context, which may increase the risk of
hallucination. Moreover, the attention pattern can propagate to similar
contexts, facilitating the spread of hallucinations to textually similar
questions. Our method effectively mitigates the disruption of new knowledge
learning to the model's attention on key entities, accompanied by improved
performance.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aakash Sen Sharma, Debdeep Sanyal, Vivek Srivastava, Shirish Karande, Murari Mandal
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The alignment of Large Language Models (LLMs) with human values is central to
their safe deployment, yet current practice produces static, brittle, and
costly-to-maintain models that fail to keep pace with evolving norms and
policies. This misalignment, which we term the Alignment-Reality Gap, poses a
growing challenge for reliable long-term use. Existing remedies are inadequate:
large-scale re-annotation is economically prohibitive, and standard unlearning
methods act as blunt instruments that erode utility rather than enable precise
policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict
Evaluation), a framework for principled unlearning that reconceives
re-alignment as a programmatic policy application problem. TRACE
programmatically triages existing preference data against a new policy,
identifies high-impact conflicts via a alignment impact score, and applies a
hybrid optimization that cleanly inverts, discards, or preserves preferences
while safeguarding model performance. Empirical results show that TRACE
achieves robust re-alignment across diverse model families (Qwen2.5-7B,
Gemma-2-9B, Llama-3.1-8B). On both synthetic benchmarks and the PKU-SafeRLHF
dataset under complex policy shift, TRACE enforces new principles without
degrading general capabilities. Our work establishes a scalable, dynamic, and
cost-effective paradigm for maintaining LLM alignment, providing a foundation
for sustainable and responsible AI deployment.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Change detection (CD) is a fundamental task for monitoring and analyzing land
cover dynamics. While recent high performance models and high quality datasets
have significantly advanced the field, a critical limitation persists. Current
models typically acquire limited knowledge from single-type annotated data and
cannot concurrently leverage diverse binary change detection (BCD) and semantic
change detection (SCD) datasets. This constraint leads to poor generalization
and limited versatility. The recent advancements in Multimodal Large Language
Models (MLLMs) introduce new possibilities for a unified CD framework. We
leverage the language priors and unification capabilities of MLLMs to develop
UniChange, the first MLLM-based unified change detection model. UniChange
integrates generative language abilities with specialized CD functionalities.
Our model successfully unifies both BCD and SCD tasks through the introduction
of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange
utilizes text prompts to guide the identification of change categories,
eliminating the reliance on predefined classification heads. This design allows
UniChange to effectively acquire knowledge from multi-source datasets, even
when their class definitions conflict. Experiments on four public benchmarks
(WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance,
achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively,
surpassing all previous methods. The code is available at
https://github.com/Erxucomeon/UniChange.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Ehsan Aghazadeh, Ahmad Ghasemi, Hedyeh Beyhaghi, Hossein Pishro-Nik
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are often queried multiple times at test time,
with predictions aggregated by majority vote. While effective, this
self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls
and can fail when the correct answer is rare. We introduce Confidence-Guided
Early Stopping (CGES), a Bayesian framework that forms posteriors over
candidate answers using scalar confidence signals derived from token
probabilities or reward models. CGES adaptively halts sampling once the
posterior mass of a candidate exceeds a threshold. We provide theoretical
guarantees for both perfectly calibrated confidences and realistic noisy
confidence signals. Across five reasoning benchmarks, CGES reduces the average
number of model calls by about 69 percent (for example, from 16.0 to 4.9) while
matching the accuracy of self-consistency within 0.06 percentage points.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Max Norris, Kobi Gal, Sahan Bulathwela
Natural Language Processing
ๆฅ็ๆ่ฆ
Modelling student knowledge is a key challenge when leveraging AI in
education, with major implications for personalised learning. The Knowledge
Tracing (KT) task aims to predict how students will respond to educational
questions in learning environments, based on their prior interactions. Existing
KT models typically use response correctness along with metadata like skill
tags and timestamps, often overlooking the question text, which is an important
source of pedagogical insight. This omission poses a lost opportunity while
limiting predictive performance. We propose Next Token Knowledge Tracing
(NTKT), a novel approach that reframes KT as a next-token prediction task using
pretrained Large Language Models (LLMs). NTKT represents both student histories
and question content as sequences of text, allowing LLMs to learn patterns in
both behaviour and language. Our series of experiments significantly improves
performance over state-of-the-art neural KT models and generalises much better
to cold-start questions and users. These findings highlight the importance of
question content in KT and demonstrate the benefits of leveraging pretrained
representations of LLMs to model student learning more effectively.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Angela Stamatie
Natural Language Processing
ๆฅ็ๆ่ฆ
The research explores error analysis in the performance of translating by
Machine Translation from English into Romanian, and it focuses on lexical
errors found in texts which include official information, provided by the World
Health Organization (WHO), the Gavi Organization, by the patient information
leaflet (the information about the active ingredients of the vaccines or the
medication, the indications, the dosage instructions, the storage instructions,
the side effects and warning, etc.). All of these texts are related to Covid-19
and have been translated by Google Translate, a multilingual Machine
Translation that was created by Google. In the last decades, Google has
actively worked to develop a more accurate and fluent automatic translation
system. This research, specifically focused on improving Google Translate, aims
to enhance the overall quality of Machine Translation by achieving better
lexical selection and by reducing errors. The investigation involves a
comprehensive analysis of 230 texts that have been translated from English into
Romanian.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kenza Khelkhal, Dihia Lanasri
ๆฅ็ๆ่ฆ
Hiring processes often involve the manual screening of hundreds of resumes
for each job, a task that is time and effort consuming, error-prone, and
subject to human bias. This paper presents Smart-Hiring, an end-to-end Natural
Language Processing (NLP) pipeline de- signed to automatically extract
structured information from unstructured resumes and to semantically match
candidates with job descriptions. The proposed system combines document
parsing, named-entity recognition, and contextual text embedding techniques to
capture skills, experience, and qualifications. Using advanced NLP technics,
Smart-Hiring encodes both resumes and job descriptions in a shared vector space
to compute similarity scores between candidates and job postings. The pipeline
is modular and explainable, allowing users to inspect extracted entities and
matching rationales. Experiments were conducted on a real-world dataset of
resumes and job descriptions spanning multiple professional domains,
demonstrating the robustness and feasibility of the proposed approach. The
system achieves competitive matching accuracy while preserving a high degree of
interpretability and transparency in its decision process. This work introduces
a scalable and practical NLP frame- work for recruitment analytics and outlines
promising directions for bias mitigation, fairness-aware modeling, and
large-scale deployment of data-driven hiring solutions.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Recent advances in multi-modal models have demonstrated strong performance in
tasks such as image generation and reasoning. However, applying these models to
the fire domain remains challenging due to the lack of publicly available
datasets with high-quality fire domain annotations. To address this gap, we
introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k
high-resolution fire-related images and 2.5k real-world fire-related videos
covering a wide range of fire types, environments, and risk levels. The data
are annotated with both traditional computer vision labels (e.g., bounding
boxes) and detailed textual prompts describing the scene, enabling applications
such as synthetic data generation and fire risk reasoning. DetectiumFire offers
clear advantages over existing benchmarks in scale, diversity, and data
quality, significantly reducing redundancy and enhancing coverage of real-world
scenarios. We validate the utility of DetectiumFire across multiple tasks,
including object detection, diffusion-based image generation, and
vision-language reasoning. Our results highlight the potential of this dataset
to advance fire-related research and support the development of intelligent
safety systems. We release DetectiumFire to promote broader exploration of fire
understanding in the AI community. The dataset is available at
https://kaggle.com/datasets/38b79c344bdfc55d1eed3d22fbaa9c31fad45e27edbbe9e3c529d6e5c4f93890
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Giulia Iadisernia, Carolina Camassa
Natural Language Processing
ๆฅ็ๆ่ฆ
We evaluate whether persona-based prompting improves Large Language Model
(LLM) performance on macroeconomic forecasting tasks. Using 2,368
economics-related personas from the PersonaHub corpus, we prompt GPT-4o to
replicate the ECB Survey of Professional Forecasters across 50 quarterly rounds
(2013-2025). We compare the persona-prompted forecasts against the human
experts panel, across four target variables (HICP, core HICP, GDP growth,
unemployment) and four forecast horizons. We also compare the results against
100 baseline forecasts without persona descriptions to isolate its effect. We
report two main findings. Firstly, GPT-4o and human forecasters achieve
remarkably similar accuracy levels, with differences that are statistically
significant yet practically modest. Our out-of-sample evaluation on 2024-2025
data demonstrates that GPT-4o can maintain competitive forecasting performance
on unseen events, though with notable differences compared to the in-sample
period. Secondly, our ablation experiment reveals no measurable forecasting
advantage from persona descriptions, suggesting these prompt components can be
omitted to reduce computational costs without sacrificing accuracy. Our results
provide evidence that GPT-4o can achieve competitive forecasting accuracy even
on out-of-sample macroeconomic events, if provided with relevant context data,
while revealing that diverse prompts produce remarkably homogeneous forecasts
compared to human panels.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kentaro Ueda, Franรงois Portet, Hirohiko Suwa, Keiichi Yasumoto
ๆฅ็ๆ่ฆ
While LLMs excel at general tasks, they struggle in specialized domains like
finance, requiring diverse skills in domain knowledge, mathematical reasoning,
and multilingual processing. Merging domain-specific Continual Pre-training
(CPT) "experts" offers a practical alternative to costly and unstable
multi-skill training. However, unlike established Supervised Fine-Tuning (SFT)
model-based merging, CPT model merging remains largely unexplored. We address
this gap by creating financial LLMs from experts in finance, math, and
Japanese. We propose a three-stage evaluation focusing on knowledge recovery,
complementarity, and emergence, and assess three merging methods (Task
Arithmetic, TIES, and DARE-TIES) on a comprehensive financial benchmark curated
from 18 tasks across 8 established datasets. Results show that merging an
expert with its base model recovers general knowledge lost during CPT, while
merging experts improves performance and can yield emergent cross-domain
skills. Among the methods, Task Arithmetic performs strongly but is
hyperparameter-sensitive, whereas TIES is more robust. Our findings also
suggest that while model similarity correlates with merging success, emergent
skills depend on more complex factors. This work presents the first
foundational analysis of CPT model merging, establishing a principled framework
and providing clear guidance for building multi-skill LLMs from existing
assets.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aashray Reddy, Andrew Zagula, Nicholas Saban
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where
adversarial prompts elicit harmful outputs, yet most evaluations focus on
single-turn interactions while real-world attacks unfold through adaptive
multi-turn conversations. We present AutoAdv, a training-free framework for
automated multi-turn jailbreaking that achieves up to 95% attack success rate
on Llama-3.1-8B within six turns a 24 percent improvement over single turn
baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern
manager that learns from successful attacks to enhance future prompts, a
temperature manager that dynamically adjusts sampling parameters based on
failure modes, and a two-phase rewriting strategy that disguises harmful
requests then iteratively refines them. Extensive evaluation across commercial
and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent
vulnerabilities in current safety mechanisms, with multi-turn attacks
consistently outperforming single-turn approaches. These findings demonstrate
that alignment strategies optimized for single-turn interactions fail to
maintain robustness across extended conversations, highlighting an urgent need
for multi-turn-aware defenses.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohd Nauman, Sravan Gvm, Vijay Devane, Shyam Pawar, Viraj Thakur et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Current large language models excel at broad, general-purpose tasks, but
consistently underperform when exposed to highly specialized domains that
require deep cultural, linguistic, and subject-matter expertise. In particular,
traditional medical systems such as Ayurveda embody centuries of nuanced
textual and clinical knowledge that mainstream LLMs fail to accurately
interpret or apply. We introduce AyurParam-2.9B, a domain-specialized,
bilingual language model fine-tuned from Param-1-2.9B using an extensive,
expertly curated Ayurveda dataset spanning classical texts and clinical
guidance. AyurParam's dataset incorporates context-aware, reasoning, and
objective-style Q&A in both English and Hindi, with rigorous annotation
protocols for factual precision and instructional clarity. Benchmarked on
BhashaBench-Ayur, AyurParam not only surpasses all open-source
instruction-tuned models in its size class (1.5--3B parameters), but also
demonstrates competitive or superior performance compared to much larger
models. The results from AyurParam highlight the necessity for authentic domain
adaptation and high-quality supervision in delivering reliable, culturally
congruent AI for specialized medical knowledge.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yudong Li, Zhongliang Yang, Kejiang Chen, Wenxuan Wang, Tianxin Zhang et al.
Computer Vision
ๆฅ็ๆ่ฆ
In this work, we propose LiveSecBench, a dynamic and continuously updated
safety benchmark specifically for Chinese-language LLM application scenarios.
LiveSecBench evaluates models across six critical dimensions (Legality, Ethics,
Factuality, Privacy, Adversarial Robustness, and Reasoning Safety) rooted in
the Chinese legal and social frameworks. This benchmark maintains relevance
through a dynamic update schedule that incorporates new threat vectors, such as
the planned inclusion of Text-to-Image Generation Safety and Agentic Safety in
the next update. For now, LiveSecBench (v251030) has evaluated 18 LLMs,
providing a landscape of AI safety in the context of Chinese language. The
leaderboard is publicly accessible at https://livesecbench.intokentech.cn/.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In human cognition, there exist numerous thought processes that are tacit and
beyond verbal expression, enabling us to understand and interact with the world
in multiple ways. However, contemporary Vision-Language Models (VLMs) remain
constrained to reasoning within the discrete and rigid space of linguistic
tokens, thereby bottlenecking the rich, high-dimensional nature of visual
perception. To bridge this gap, we propose CoCoVa (Chain of Continuous
Vision-Language Thought), a novel framework for vision-language model that
leverages continuous cross-modal reasoning for diverse vision-language tasks.
The core of CoCoVa is an iterative reasoning cycle, where a novel Latent
Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a
chain of latent thought vectors through cross-modal fusion. To focus this
process, a token selection mechanism dynamically identifies salient visual
regions, mimicking attentional focus. To ensure these latent thoughts remain
grounded, we train the model with a multi-task objective that combines
contrastive learning and diffusion-based reconstruction, enforcing alignment
between latent representations and both visual and textual modalities.
Evaluations show CoCoVa improves accuracy and token efficiency over strong
baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B
models on almost all benchmarks. When scaled to 7B LLM backbones, it remains
competitive with state-of-the-art models. Qualitative analysis validates that
learned latent space captures interpretable and structured reasoning patterns,
highlighting the potential of CoCoVa to bridge the representational gap between
discrete language processing and the continuous nature of visual understanding.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wongyu Kim, Hochang Lee, Sanghak Lee, Yoonsung Kim, Jaehyun Park
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Query augmentation makes queries more meaningful by appending further
information to the queries to find relevant documents. Current studies have
proposed Large Language Model (LLM)-based embedders, which learn representation
for embedding and generation for query augmentation in a multi-task manner by
leveraging the generative capabilities of LLM. During inference, these jointly
trained embedders have conducted query augmentation followed by embedding,
showing effective results. However, augmenting every query leads to substantial
embedding latency and query augmentation can be detrimental to performance for
some queries. Also, previous methods have not been explored in multimodal
environments. To tackle these problems, we propose M-Solomon, a universal
multimodal embedder that can adaptively determine when to augment queries. Our
approach first divides the queries of the training datasets into two groups at
the dataset level. One includes queries that require augmentation and the other
includes queries that do not. Then, we introduces a synthesis process that
generates appropriate augmentations for queries that require them by leveraging
a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation.
Through this step, M-Solomon can conduct query augmentation only when necessary
by learning to generate synthetic augmentations with the prefix /augment for
queries that demand them and to generate the simple string /embed for others.
Experimental results showed that M-Solomon not only surpassed the baseline
without augmentation by a large margin but also outperformed the baseline that
always used augmentation, providing much faster embedding latency.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Liuhao Lin, Ke Li, Zihan Xu, Yuchen Shi, Yulei Qin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Current evaluation paradigms for large language models (LLMs) represent a
critical blind spot in AI research--relying on opaque numerical metrics that
conceal fundamental limitations in spatial reasoning while providing no
intuitive understanding of model capabilities. This deficiency creates a
dangerous disconnect between reported performance and practical abilities,
particularly for applications requiring physical world understanding. We
introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation
from abstract scores to directly observable visual outputs by requiring models
to generate drawings through dot matrices or executable code. This approach
makes spatial reasoning limitations immediately apparent even to non-experts,
bridging the fundamental gap between statistical performance and intuitive
assessment. LTD-Bench implements a comprehensive methodology with complementary
generation tasks (testing spatial imagination) and recognition tasks (assessing
spatial perception) across three progressively challenging difficulty levels,
methodically evaluating both directions of the critical language-spatial
mapping. Our extensive experiments with state-of-the-art models expose an
alarming capability gap: even LLMs achieving impressive results on traditional
benchmarks demonstrate profound deficiencies in establishing bidirectional
mappings between language and spatial concept--a fundamental limitation that
undermines their potential as genuine world models. Furthermore, LTD-Bench's
visual outputs enable powerful diagnostic analysis, offering a potential
approach to investigate model similarity.
๐
2025-11-04
๐ ArXiv Preprint (cs.MA)
๐ฅ Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia
Machine Learning
ๆฅ็ๆ่ฆ
We study the problem of learning multi-task, multi-agent policies for
cooperative, temporal objectives, under centralized training, decentralized
execution. In this setting, using automata to represent tasks enables the
decomposition of complex tasks into simpler sub-tasks that can be assigned to
agents. However, existing approaches remain sample-inefficient and are limited
to the single-task case. In this work, we present Automata-Conditioned
Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for
learning task-conditioned, decentralized team policies. We identify the main
challenges to ACC-MARL's feasibility in practice, propose solutions, and prove
the correctness of our approach. We further show that the value functions of
learned policies can be used to assign tasks optimally at test time.
Experiments show emergent task-aware, multi-step coordination among agents,
e.g., pressing a button to unlock a door, holding the door, and
short-circuiting tasks.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhiwei Zhang, Xiaomin Li, Yudi Lin, Hui Liu, Ramraj Chandradevan et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) trained with reinforcement learning and
verifiable rewards have achieved strong results on complex reasoning tasks.
Recent work extends this paradigm to a multi-agent setting, where a
meta-thinking agent proposes plans and monitors progress while a reasoning
agent executes subtasks through sequential conversational turns. Despite
promising performance, we identify a critical limitation: lazy agent behavior,
in which one agent dominates while the other contributes little, undermining
collaboration and collapsing the setup to an ineffective single agent. In this
paper, we first provide a theoretical analysis showing why lazy behavior
naturally arises in multi-agent reasoning. We then introduce a stable and
efficient method for measuring causal influence, helping mitigate this issue.
Finally, as collaboration intensifies, the reasoning agent risks getting lost
in multi-turn interactions and trapped by previous noisy responses. To counter
this, we propose a verifiable reward mechanism that encourages deliberation by
allowing the reasoning agent to discard noisy outputs, consolidate
instructions, and restart its reasoning process when necessary. Extensive
experiments demonstrate that our framework alleviates lazy agent behavior and
unlocks the full potential of multi-agent framework for complex reasoning
tasks.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cuong Tuan Nguyen, Ngoc Tuan Nguyen, Triet Hoang Minh Dao, Huy Minh Nhat, Huy Truong Dinh
Machine Learning
ๆฅ็ๆ่ฆ
We propose a Graph Neural Network (GNN)-based approach for Handwritten
Mathematical Expression (HME) recognition by modeling HMEs as graphs, where
nodes represent symbols and edges capture spatial dependencies. A deep BLSTM
network is used for symbol segmentation, recognition, and spatial relation
classification, forming an initial primitive graph. A 2D-CFG parser then
generates all possible spatial relations, while the GNN-based link prediction
model refines the structure by removing unnecessary connections, ultimately
forming the Symbol Label Graph. Experimental results demonstrate the
effectiveness of our approach, showing promising performance in HME structure
recognition.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin et al.
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We introduce SAIL-RL, a reinforcement learning (RL) post-training framework
that enhances the reasoning capabilities of multimodal large language models
(MLLMs) by teaching them when and how to think. Existing approaches are limited
by outcome-only supervision, which rewards correct answers without ensuring
sound reasoning, and by uniform thinking strategies, which often lead to
overthinking on simple tasks and underthinking on complex ones. SAIL-RL
addresses these challenges with a dual reward system: the Thinking Reward,
which evaluates reasoning quality through factual grounding, logical coherence,
and answer consistency, and the Judging Reward, which adaptively determines
whether deep reasoning or direct answering is appropriate. Experiments on the
state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal
understanding benchmarks at both 4B and 8B scales, achieving competitive
performance against commercial closed-source models such as GPT-4o, and
substantially reduces hallucinations, establishing it as a principled framework
for building more reliable and adaptive MLLMs. The code will be available at
https://github.com/BytedanceDouyinContent/SAIL-RL.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jonathan Liu, Haoling Qiu, Jonathan Lasko, Damianos Karakos, Mahsa Yarmohammadi et al.
ๆฅ็ๆ่ฆ
Recent research has shown that hallucinations, omissions, and biases are
prevalent in everyday use-cases of LLMs. However, chatbots used in medical
contexts must provide consistent advice in situations where non-medical factors
are involved, such as when demographic information is present. In order to
understand the conditions under which medical chatbots fail to perform as
expected, we develop an infrastructure that 1) automatically generates queries
to probe LLMs and 2) evaluates answers to these queries using multiple
LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples
the space of patient demographics, histories, disorders, and writing styles to
create realistic questions that we subsequently use to prompt LLMs. In 2), our
evaluation pipeline provides hallucination and omission detection using
LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge
treatment category detectors. As a baseline study, we perform two case studies
on inter-LLM agreement and the impact of varying the answering and evaluation
LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's
Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs
yield statistically significant differences across writing styles, genders, and
races. We recommend that studies using LLM evaluation use multiple LLMs as
evaluators in order to avoid arriving at statistically significant but
non-generalizable results, particularly in the absence of ground-truth data. We
also suggest publishing inter-LLM agreement metrics for transparency. Our code
and dataset are available here:
https://github.com/BBN-E/medic-neurips-2025-demo.
๐
2025-11-04
๐ ArXiv Preprint (Multimedia)
๐ฅ Jiawei Liu, Enis Berk รoban, Zarina Schevchenko, Hao Tang, Zhigang Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Standard training for Multi-modal Large Language Models (MLLMs) involves
concatenating non-textual information, like vision or audio, with a text
prompt. This approach may not encourage deep integration of modalities,
limiting the model's ability to leverage the core language model's reasoning
capabilities. This work examined the impact of interleaved instruction tuning
in an audio MLLM, where audio tokens are interleaved within the prompt. Using
the Listen, Think, and Understand (LTU) model as a testbed, we conduct an
experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our
newly created reasoning benchmark for audio-based semantic reasoning focusing
on synonym and hypernym recognition. Our findings show that while even
zero-shot interleaved prompting improves performance on our reasoning tasks, a
small amount of fine-tuning using interleaved training prompts improves the
results further, however, at the expense of the MLLM's audio labeling ability.
๐
2025-11-04
๐ EMNLP 2025
๐ฅ Kangyu Qiao, Shaolei Zhang, Yang Feng
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
With the growing computational demands of large language models (LLMs),
efficient inference has become increasingly critical for practical deployment.
Depth pruning has emerged as a promising approach for reducing the
computational costs of large language models by removing transformer layers.
However, existing methods typically rely on fixed block masks, which can lead
to suboptimal performance across different tasks and inputs. In this paper, we
propose IG-Pruning, a novel input-aware block-wise pruning method that
dynamically selects layer masks at inference time. Our approach consists of two
stages: (1) Discovering diverse mask candidates through semantic clustering and
L0 optimization, and (2) Implementing efficient dynamic pruning without the
need for extensive training. Experimental results demonstrate that our method
consistently outperforms state-of-the-art static depth pruning methods, making
it particularly suitable for resource-constrained deployment scenarios.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
While existing work focuses primarily on task success, we argue that
effective real-world agents require optimizing three dimensions: productivity
(task completion), proactivity (asking essential questions), and
personalization (adapting to diverse user preferences). We introduce UserVille,
an interactive environment with LLM-based user simulators enabling diverse,
configurable user preferences. Leveraging UserVille, we introduce PPP, a
multi-objective reinforcement learning approach that jointly optimizes all
three dimensions: Productivity, Proactivity, and Personalization. Experiments
on software engineering and deep research tasks show that agents trained with
PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6
on average), demonstrating the ability to ask strategic clarifying questions,
adapt to unseen user preferences, and improve task success through better
interaction. This work demonstrates that explicitly optimizing for
user-centered interaction is critical for building practical and effective AI
agents.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yibo Zhao, Yang Zhao, Hongru Du, Hao Frank Yang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Decision-making models for individuals, particularly in high-stakes scenarios
like vaccine uptake, often diverge from population optimal predictions. This
gap arises from the uniqueness of the individual decision-making process,
shaped by numerical attributes (e.g., cost, time) and linguistic influences
(e.g., personal preferences and constraints). Developing upon Utility Theory
and leveraging the textual-reasoning capabilities of Large Language Models
(LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric
Reasoning framework (ATHENA) to address the optimal information integration.
ATHENA uniquely integrates two stages: First, it discovers robust, group-level
symbolic utility functions via LLM-augmented symbolic discovery; Second, it
implements individual-level semantic adaptation, creating personalized semantic
templates guided by the optimal utility to model personalized choices.
Validated on real-world travel mode and vaccine choice tasks, ATHENA
consistently outperforms utility-based, machine learning, and other LLM-based
models, lifting F1 score by at least 6.5% over the strongest cutting-edge
models. Further, ablation studies confirm that both stages of ATHENA are
critical and complementary, as removing either clearly degrades overall
predictive performance. By organically integrating symbolic utility modeling
and semantic adaptation, ATHENA provides a new scheme for modeling
human-centric decisions. The project page can be found at
https://yibozh.github.io/Athena.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have shown strong capabilities but
remain limited to fixed modality pairs and require costly fine-tuning with
large aligned datasets. Building fully omni-capable models that can integrate
text, images, audio, and video remains impractical and lacks robust reasoning
support. In this paper, we propose an Agent-Omni framework that coordinates
existing foundation models through a master-agent system, enabling flexible
multimodal reasoning without retraining. The master agent interprets user
intent, delegates subtasks to modality-specific agents, and integrates their
outputs into coherent responses. Extensive experiments across text, image,
audio, video, and omni benchmarks show that Agent-Omni consistently achieves
state-of-the-art performance, particularly on tasks requiring complex
cross-modal reasoning. Its agent-based design enables seamless integration of
specialized foundation models, ensuring adaptability to diverse inputs while
maintaining transparency and interpretability. In addition, the framework is
modular and easily extensible, allowing future improvements as stronger models
become available. %We release an open-source implementation to support
continued research on scalable and reliable omni-modal reasoning.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham Kakade, Surbhi Goel
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Knowledge distillation is an efficient strategy to use data generated by
large "teacher" language models to train smaller capable "student" models, but
selecting the optimal teacher for a specific student-task combination requires
expensive trial-and-error. We propose a lightweight score called GRACE to
quantify how effective a teacher will be for post-training a student model.
GRACE measures distributional properties of the student's gradients without
access to a verifier, teacher logits, teacher internals, or test data. From an
information-theoretic perspective, GRACE connects to leave-one-out stability of
gradient-based algorithms, which controls the generalization performance of the
distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86%
Spearman correlation) with the performance of the distilled LLaMA and OLMo
students. In particular, training a student using the GRACE-selected teacher
can improve the performance by up to 7.4% over naively using the
best-performing teacher. Further, GRACE can provide guidance on crucial design
choices in distillation, including (1) the best temperature to use when
generating from the teacher, (2) the best teacher to use given a size
constraint, and (3) the best teacher to use within a specific model family.
Altogether, our findings demonstrate that GRACE can efficiently and effectively
identify a strongly compatible teacher for a given student and provide
fine-grained guidance on how to perform distillation.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large-scale data has driven breakthroughs in robotics, from language models
to vision-language-action models in bimanual manipulation. However, humanoid
robotics lacks equally effective data collection frameworks. Existing humanoid
teleoperation systems either use decoupled control or depend on expensive
motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid
teleoperation and data collection system that preserves full whole-body control
while advancing scalability. Our system leverages PICO4U VR for obtaining
real-time whole-body human motions, with a custom 2-DoF robot neck (cost around
$250) for egocentric vision, enabling holistic human-to-humanoid control. We
demonstrate long-horizon dexterous and mobile humanoid skills and we can
collect 100 demonstrations in 15 minutes with an almost 100% success rate.
Building on this pipeline, we propose a hierarchical visuomotor policy
framework that autonomously controls the full humanoid body based on egocentric
vision. Our visuomotor policy successfully demonstrates whole-body dexterous
manipulation and dynamic kicking tasks. The entire system is fully reproducible
and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also
open-sourced at https://twist-data.github.io .
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hakob Tamazyan, Ani Vanyan, Alvard Barseghyan, Anna Khosrovyan, Evan Shelhamer et al.
ๆฅ็ๆ่ฆ
The number and diversity of remote sensing satellites grows over time, while
the vast majority of labeled data comes from older satellites. As the
foundation models for Earth observation scale up, the cost of (re-)training to
support new satellites grows too, so the generalization capabilities of the
models towards new satellites become increasingly important. In this work we
introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a
new evaluation protocol: it tests the in-distribution performance;
generalization to satellites with no band overlap; and generalization to
satellites with additional bands with respect to the training set. We also
develop a self-supervised extension of ChannelViT, ChiViT, to improve its
cross-satellite performance. First, we show that even the best foundation
models for remote sensing (DOFA, TerraFM) do not outperform general purpose
models like DINOv3 in the in-distribution setting. Second, when generalizing to
new satellites with no band overlap, all models suffer 2-4x drop in
performance, and ChiViT significantly outperforms the runner-up DINOv3. Third,
the performance of all tested models drops on average by 5-25\% when given
additional bands during test time. Finally, we show that fine-tuning just the
last linear layer of these models using oracle labels from all bands can get
relatively consistent performance across all satellites, highlighting that the
benchmark is far from being saturated. We publicly release the code and the
datasets to encourage the development of more future-proof remote sensing
models with stronger cross-satellite generalization.
๐
2025-11-04
๐ ArXiv Preprint (math.OC)
๐ฅ Dan Garber
Machine Learning
ๆฅ็ๆ่ฆ
We develop new accelerated first-order algorithms in the Frank-Wolfe (FW)
family for minimizing smooth convex functions over compact convex sets, with a
focus on two prominent constraint classes: (1) polytopes and (2) matrix domains
given by the spectrahedron and the unit nuclear-norm ball. A key technical
ingredient is a complementarity condition that captures solution sparsity --
face dimension for polytopes and rank for matrices. We present two algorithms:
(1) a purely linear optimization oracle (LOO) method for polytopes that has
optimal worst-case first-order (FO) oracle complexity and, aside of a finite
\emph{burn-in} phase and up to a logarithmic factor, has LOO complexity that
scales with $r/\sqrt{\epsilon}$, where $\epsilon$ is the target accuracy and
$r$ is the solution sparsity $r$ (independently of the ambient dimension), and
(2) a hybrid scheme that combines FW with a sparse projection oracle (e.g.,
low-rank SVDs for matrix domains with low-rank solutions), which also has
optimal FO oracle complexity, and after a finite burn-in phase, only requires
$O(1/\sqrt{\epsilon})$ sparse projections and LOO calls (independently of both
the ambient dimension and the rank of optimal solutions). Our results close a
gap on how to accelerate recent advancements in linearly-converging FW
algorithms for strongly convex optimization, without paying the price of the
dimension.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Mohamed Bouadi, Pratinav Seth, Aditya Tanna, Vinay Kumar Sankarapu
ๆฅ็ๆ่ฆ
Tabular data remain the predominant format for real-world applications. Yet,
developing effective neural models for tabular data remains challenging due to
heterogeneous feature types and complex interactions occurring at multiple
scales. Recent advances in tabular in-context learning (ICL), such as TabPFN
and TabICL, have achieved state-of-the-art performance comparable to
gradient-boosted trees (GBTs) without task-specific fine-tuning. However,
current architectures exhibit key limitations: (1) single-scale feature
processing that overlooks hierarchical dependencies, (2) dense attention with
quadratic scaling in table width, and (3) strictly sequential component
processing that prevents iterative representation refinement and
cross-component communication. To address these challenges, we introduce
Orion-MSP, a tabular ICL architecture featuring three key innovations: (1)
multi-scale processing to capture hierarchical feature interactions; (2)
block-sparse attention combining windowed, global, and random patterns for
scalable efficiency and long-range connectivity; and (3) a Perceiver-style
memory enabling safe bidirectional information flow across components. Across
diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance
while scaling effectively to high-dimensional tables, establishing a new
standard for efficient tabular in-context learning. The model is publicly
available at https://github.com/Lexsi-Labs/Orion-MSP .
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Morgan Allen, Paul Savala
Machine Learning
ๆฅ็ๆ่ฆ
In Major League Baseball, strategy and planning are major factors in
determining the outcome of a game. Previous studies have aided this by building
machine learning models for predicting the winning team of any given game. We
extend this work by training a comprehensive set of machine learning models
using a common dataset. In addition, we relate the win probabilities produced
by these models to win strength as measured by score differential. In doing so
we show that the most common machine learning models do indeed demonstrate a
relationship between predicted win probability and the strength of the win.
Finally, we analyze the results of using predicted win probabilities as a
decision making mechanism on run-line betting. We demonstrate positive returns
when utilizing appropriate betting strategies, and show that naive use of
machine learning models for betting lead to significant loses.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, Vinay Kumar Sankarapu
Machine Learning
ๆฅ็ๆ่ฆ
Tabular foundation models represent a growing paradigm in structured data
learning, extending the benefits of large-scale pretraining to tabular domains.
However, their adoption remains limited due to heterogeneous preprocessing
pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the
absence of standardized evaluation for deployment-oriented metrics such as
calibration and fairness. We present TabTune, a unified library that
standardizes the complete workflow for tabular foundation models through a
single interface. TabTune provides consistent access to seven state-of-the-art
models supporting multiple adaptation strategies, including zero-shot
inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient
fine-tuning (PEFT). The framework automates model-aware preprocessing, manages
architectural heterogeneity internally, and integrates evaluation modules for
performance, calibration, and fairness. Designed for extensibility and
reproducibility, TabTune enables consistent benchmarking of adaptation
strategies of tabular foundation models. The library is open source and
available at https://github.com/Lexsi-Labs/TabTune .
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nicolas Riccieri Gardin Assumpcao, Leandro Villas
ๆฅ็ๆ่ฆ
Federated Learning (FL) is a distributed training paradigm wherein
participants collaborate to build a global model while ensuring the privacy of
the involved data, which remains stored on participant devices. However,
proposals aiming to ensure such privacy also make it challenging to protect
against potential attackers seeking to compromise the training outcome. In this
context, we present Fast, Private, and Protected (FPP), a novel approach that
aims to safeguard federated training while enabling secure aggregation to
preserve data privacy. This is accomplished by evaluating rounds using
participants' assessments and enabling training recovery after an attack. FPP
also employs a reputation-based mechanism to mitigate the participation of
attackers. We created a dockerized environment to validate the performance of
FPP compared to other approaches in the literature (FedAvg, Power-of-Choice,
and aggregation via Trimmed Mean and Median). Our experiments demonstrate that
FPP achieves a rapid convergence rate and can converge even in the presence of
malicious participants performing model poisoning attacks.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mayank Jobanputra, Nils Philipp Walter, Maitrey Mehta, Blerta Veseli, Evan Parker Kelly Chapple et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We present a systematic study of subtraction in large language models (LLMs).
While prior benchmarks emphasize addition and multiplication, subtraction has
received comparatively little attention despite being structurally distinct as
a non-commutative operation. We evaluate eight pretrained LLMs spanning four
families on addition and subtraction problems. Our experiments reveal that
subtraction accuracy lags behind addition by a wide margin. We find that the
errors for ($a-b$) are concentrated in cases where ($a
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Andras Ferenczi, Sutapa Samanta, Dagen Wang, Todd Hodges
Machine Learning
ๆฅ็ๆ่ฆ
Federated learning (FL) is a widely used method for training machine learning
(ML) models in a scalable way while preserving privacy (i.e., without
centralizing raw data). Prior research shows that the risk of exposing
sensitive data increases cumulatively as the number of iterations where a
client's updates are included in the aggregated model increase. Attackers can
launch membership inference attacks (MIA; deciding whether a sample or client
participated), property inference attacks (PIA; inferring attributes of a
client's data), and model inversion attacks (MI; reconstructing inputs),
thereby inferring client-specific attributes and, in some cases, reconstructing
inputs. In this paper, we mitigate risk by substantially reducing per client
exposure using a quantum computing-inspired quadratic unconstrained binary
optimization (QUBO) formulation that selects a small subset of client updates
most relevant for each training round. In this work, we focus on two threat
vectors: (i) information leakage by clients during training and (ii)
adversaries who can query or obtain the global model. We assume a trusted
central server and do not model server compromise. This method also assumes
that the server has access to a validation/test set with global data
distribution. Experiments on the MNIST dataset with 300 clients in 20 rounds
showed a 95.2% per-round and 49% cumulative privacy exposure reduction, with
147 clients' updates never being used during training while maintaining in
general the full-aggregation accuracy or even better. The method proved to be
efficient at lower scale and more complex model as well. A CINIC-10
dataset-based experiment with 30 clients resulted in 82% per-round privacy
improvement and 33% cumulative privacy.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinghan Li, Haodong Wen, Kaifeng Lyu
ๆฅ็ๆ่ฆ
Despite the popularity of the Adam optimizer in practice, most theoretical
analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and
little is known about how the solutions found by Adam differ. In this paper, we
show that Adam implicitly reduces a unique form of sharpness measure shaped by
its adaptive updates, leading to qualitatively different solutions from SGD.
More specifically, when the training loss is small, Adam wanders around the
manifold of minimizers and takes semi-gradients to minimize this sharpness
measure in an adaptive manner, a behavior we rigorously characterize through a
continuous-time approximation using stochastic differential equations. We
further demonstrate how this behavior differs from that of SGD in a
well-studied setting: when training overparameterized models with label noise,
SGD has been shown to minimize the trace of the Hessian matrix, $\tr(\mH)$,
whereas we prove that Adam minimizes $\tr(\Diag(\mH)^{1/2})$ instead. In
solving sparse linear regression with diagonal linear networks, this
distinction enables Adam to achieve better sparsity and generalization than
SGD. Finally, our analysis framework extends beyond Adam to a broad class of
adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo,
and provides a unified perspective on how these adaptive optimizers reduce
sharpness, which we hope will offer insights for future optimizer design.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bum Chul Kwon, Ben Shapira, Moshiko Raboh, Shreyans Sethi, Shruti Murarka et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
The chemical space of drug-like molecules is vast, motivating the development
of generative models that must learn broad chemical distributions, enable
conditional generation by capturing structure-property representations, and
provide fast molecular generation. Meeting the objectives depends on modeling
choices, including the probabilistic modeling approach, the conditional
generative formulation, the architecture, and the molecular input
representation. To address the challenges, we present STAR-VAE
(Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder),
a scalable latent-variable framework with a Transformer encoder and an
autoregressive Transformer decoder. It is trained on 79 million drug-like
molecules from PubChem, using SELFIES to guarantee syntactic validity. The
latent-variable formulation enables conditional generation: a property
predictor supplies a conditioning signal that is applied consistently to the
latent prior, the inference network, and the decoder. Our contributions are:
(i) a Transformer-based latent-variable encoder-decoder model trained on
SELFIES representations; (ii) a principled conditional latent-variable
formulation for property-guided generation; and (iii) efficient finetuning with
low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation
with limited property and activity data. On the GuacaMol and MOSES benchmarks,
our approach matches or exceeds baselines, and latent-space analyses reveal
smooth, semantically structured representations that support both unconditional
exploration and property-aware generation. On the Tartarus benchmarks, the
conditional model shifts docking-score distributions toward stronger predicted
binding. These results suggest that a modernized, scale-appropriate VAE remains
competitive for molecular generation when paired with principled conditioning
and parameter-efficient finetuning.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Saeed Razavikia, Josรฉ Mairton Barros Da Silva Junior, Carlo Fischione
ๆฅ็ๆ่ฆ
Recently, the ChannelComp framework has proposed digital over-the-air
computation by designing digital modulations that enable the computation of
arbitrary functions. Unlike traditional analog over-the-air computation, which
is restricted to nomographic functions, ChannelComp enables a broader range of
computational tasks while maintaining compatibility with digital communication
systems. This framework is intended for applications that favor local
information processing over the mere acquisition of data. However, ChannelComp
is currently designed for scalar function computation, while numerous
data-centric applications necessitate vector-based computations, and it is
susceptible to channel fading. In this work, we introduce a generalization of
the ChannelComp framework, called VecComp, by integrating ChannelComp with
multiple-antenna technology. This generalization not only enables vector
function computation but also ensures scalability in the computational
complexity, which increases only linearly with the vector dimension. As such,
VecComp remains computationally efficient and robust against channel
impairments, making it suitable for high-dimensional, data-centric
applications. We establish a non-asymptotic upper bound on the mean squared
error of VecComp, affirming its computation efficiency under fading channel
conditions. Numerical experiments show the effectiveness of VecComp in
improving the computation of vector functions and fading compensation over
noisy and fading multiple-access channels.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xun Wang, Zhuoran Li, Yanshan Lin, Hai Zhong, Longbo Huang
Machine Learning
ๆฅ็ๆ่ฆ
Training a team of agents from scratch in multi-agent reinforcement learning
(MARL) is highly inefficient, much like asking beginners to play a symphony
together without first practicing solo. Existing methods, such as offline or
transferable MARL, can ease this burden, but they still rely on costly
multi-agent data, which often becomes the bottleneck. In contrast, solo
experiences are far easier to obtain in many important scenarios, e.g.,
collaborative coding, household cooperation, and search-and-rescue. To unlock
their potential, we propose Solo-to-Collaborative RL (SoCo), a framework that
transfers solo knowledge into cooperative learning. SoCo first pretrains a
shared solo policy from solo demonstrations, then adapts it for cooperation
during multi-agent training through a policy fusion mechanism that combines an
MoE-like gating selector and an action editor. Experiments across diverse
cooperative tasks show that SoCo significantly boosts the training efficiency
and performance of backbone algorithms. These results demonstrate that solo
demonstrations provide a scalable and effective complement to multi-agent data,
making cooperative learning more practical and broadly applicable.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lejs Deen Behric, Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Zeroth-order or derivative-free optimization (MeZO) is an attractive strategy
for finetuning large language models (LLMs) because it eliminates the memory
overhead of backpropagation. However, it converges slowly due to the inherent
curse of dimensionality when searching for descent directions in the
high-dimensional parameter space of billion-scale LLMs. We propose ConMeZO, a
novel zeroth-order optimizer that accelerates convergence by adaptive
directional sampling. Instead of drawing the direction uniformly at random,
ConMeZO restricts the sampling to a cone centered around a momentum estimate.
This concentrates the search in directions where the true gradient is more
likely to lie and thus reduces the effect of high dimensions. We prove that
ConMeZO achieves the same worst-case convergence rate as MeZO. Empirically,
when finetuning LLMs on natural language tasks, ConMeZO is up to 2X faster than
MeZO while retaining the low-memory footprint of zeroth-order methods.
๐
2025-11-04
๐ ArXiv Preprint (stat.ME)
๐ฅ Zebin Wang, Ziming Gan, Weijing Tang, Zongqi Xia, Tianrun Cai et al.
ๆฅ็ๆ่ฆ
Classical probabilistic graphical models face fundamental challenges in
modern data environments, which are characterized by high dimensionality,
source heterogeneity, and stringent data-sharing constraints. In this work, we
revisit the Ising model, a well-established member of the Markov Random Field
(MRF) family, and develop a distributed framework that enables scalable and
privacy-preserving representation learning from large-scale binary data with
inherent low-rank structure. Our approach optimizes a non-convex surrogate loss
function via bi-factored gradient descent, offering substantial computational
and communication advantages over conventional convex approaches. We evaluate
our algorithm on multi-institutional electronic health record (EHR) datasets
from 58,248 patients across the University of Pittsburgh Medical Center (UPMC)
and Mass General Brigham (MGB), demonstrating superior performance in global
representation learning and downstream clinical tasks, including relationship
detection, patient phenotyping, and patient clustering. These results highlight
a broader potential for statistical inference in federated, high-dimensional
settings while addressing the practical challenges of data complexity and
multi-institutional integration.
๐
2025-11-04
๐ ArXiv Preprint (cs.NI)
๐ฅ Farhad Rezazadeh, Hatim Chergui, Merouane Debbah, Houbing Song, Dusit Niyato et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We argue that sixth-generation (6G) intelligence is not fluent token
prediction but the capacity to imagine and choose -- to simulate future
scenarios, weigh trade-offs, and act with calibrated uncertainty. We reframe
open radio access network (O-RAN) near-real-time (Near-RT) control via
counterfactual dynamics and a world modeling (WM) paradigm that learns an
action-conditioned generative state space. This enables quantitative "what-if"
forecasting beyond large language models (LLMs) as the primary modeling
primitive. Actions such as physical resource blocks (PRBs) are treated as
first-class control inputs in a causal world model, and both aleatoric and
epistemic uncertainty are modeled for prediction and what-if analysis. An
agentic, model predictive control (MPC)-based cross-entropy method (CEM)
planner operates over short horizons, using prior-mean rollouts within
data-driven PRB bounds to maximize a deterministic reward. The model couples
multi-scale structured state-space mixtures (MS3M) with a compact stochastic
latent to form WM-MS3M, summarizing key performance indicators (KPIs) histories
and predicting next-step KPIs under hypothetical PRB sequences. On realistic
O-RAN traces, WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with
32% fewer parameters and similar latency, and achieves 35-80% lower root mean
squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster
inference, enabling rare-event simulation and offline policy screening.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ilies Chibane, Thomas George, Pierre Nodet, Vincent Lemaire
Machine Learning
ๆฅ็ๆ่ฆ
Mislabeled data is a pervasive issue that undermines the performance of
machine learning systems in real-world applications. An effective approach to
mitigate this problem is to detect mislabeled instances and subject them to
special treatment, such as filtering or relabeling. Automatic mislabeling
detection methods typically rely on training a base machine learning model and
then probing it for each instance to obtain a trust score that each provided
label is genuine or incorrect. The properties of this base model are thus of
paramount importance. In this paper, we investigate the impact of calibrating
this model. Our empirical results show that using calibration methods improves
the accuracy and robustness of mislabeled instance detection, providing a
practical and effective solution for industrial applications.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Adia Khalid, Alina Deriyeva, Benjamin Paassen
ๆฅ็ๆ่ฆ
Knowledge tracing (KT) models are a crucial basis for pedagogical
decision-making, namely which task to select next for a learner and when to
stop teaching a particular skill. Given the high stakes of pedagogical
decisions, KT models are typically required to be interpretable, in the sense
that they should implement an explicit model of human learning and provide
explicit estimates of learners' abilities. However, to our knowledge, no study
to date has investigated whether the interpretability of KT models actually
helps human teachers to make teaching decisions. We address this gap. First, we
perform a simulation study to show that, indeed, decisions based on
interpretable KT models achieve mastery faster compared to decisions based on a
non-interpretable model. Second, we repeat the study but ask $N=12$ human
teachers to make the teaching decisions based on the information provided by KT
models. As expected, teachers rate interpretable KT models higher in terms of
usability and trustworthiness. However, the number of tasks needed until
mastery hardly differs between KT models. This suggests that the relationship
between model interpretability and teacher decisions is not straightforward:
teachers do not solely rely on KT models to make decisions and further research
is needed to investigate how learners and teachers actually understand and use
KT models.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Deyao Chen, Franรงois Clรฉment, Carola Doerr, Nathan Kirk
ๆฅ็ๆ่ฆ
Kernel discrepancies are a powerful tool for analyzing worst-case errors in
quasi-Monte Carlo (QMC) methods. Building on recent advances in optimizing such
discrepancy measures, we extend the subset selection problem to the setting of
kernel discrepancies, selecting an m-element subset from a large population of
size $n \gg m$. We introduce a novel subset selection algorithm applicable to
general kernel discrepancies to efficiently generate low-discrepancy samples
from both the uniform distribution on the unit hypercube, the traditional
setting of classical QMC, and from more general distributions $F$ with known
density functions by employing the kernel Stein discrepancy. We also explore
the relationship between the classical $L_2$ star discrepancy and its
$L_\infty$ counterpart.
๐
2025-11-04
๐ NeurIPS
๐ฅ Georgios Tzannetos, Parameswaran Kamalaruban, Adish Singla
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Training agents to operate under strict constraints during deployment, such
as limited resource budgets or stringent safety requirements, presents
significant challenges, especially when these constraints render the task
complex. In this work, we propose a curriculum learning strategy that gradually
tightens constraints during training, enabling the agent to incrementally
master the deployment requirements. Inspired by self-paced learning techniques
in unconstrained reinforcement learning (RL), our approach facilitates a
smoother transition to challenging environments by initially training on
simplified versions of the constraints and progressively introducing the full
deployment conditions. We provide a theoretical analysis using an RL agent in a
binary-tree Markov Decision Process (MDP) to demonstrate that our curriculum
strategy can accelerate training relative to a baseline approach that imposes
the trajectory constraints from the outset. Moreover, we empirically validate
the effectiveness and generality of our method across both RL and large
language model (LLM) agents in diverse settings, including a binary-tree MDP, a
multi-task navigation domain, and a math reasoning task with two benchmarks.
These results highlight the potential of curriculum design in enhancing the
efficiency and performance of agents operating under complex trajectory
constraints during deployment. Moreover, when applied to LLMs, our strategy
enables compression of output chain-of-thought tokens, achieving a substantial
inference speedup on consumer hardware, demonstrating its effectiveness for
resource-constrained deployment.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohammadsajad Alipour, Mohammad Mohammadi Amiri
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly prevalent across diverse
applications. However, their enormous size limits storage and processing
capabilities to a few well-resourced stakeholders. As a result, most
applications rely on pre-trained LLMs, fine-tuned for specific tasks. However,
even storing the fine-tuned versions of these models remains a significant
challenge due to the wide range of tasks they address. Recently, studies show
that fine-tuning these models primarily affects a small fraction of parameters,
highlighting the need for more efficient storage of fine-tuned models. This
paper focuses on efficient storage of parameter updates in pre-trained models
after fine-tuning. To address this challenge, we leverage the observation that
fine-tuning updates are both low-rank and sparse, which can be utilized for
storage efficiency. However, using only low-rank approximation or
sparsification may discard critical singular components that enhance model
expressivity. We first observe that given the same memory budget, sparsified
low-rank approximations with larger ranks outperform standard low-rank
approximations with smaller ranks. Building on this, we propose our method,
optimal singular damage, that selectively sparsifies low-rank approximated
updates by leveraging the interleaved importance of singular vectors, ensuring
that the most impactful components are retained. We demonstrate through
extensive experiments that our proposed methods lead to significant storage
efficiency and superior accuracy within the same memory budget compared to
employing the low-rank approximation or sparsification individually.
๐
2025-11-04
๐ 29 pages, 14 figures. Invited paper, submitted to
๐ฅ Adam Umra, Aya M. Ahmed, Aydin Sezgin
Machine Learning
ๆฅ็ๆ่ฆ
This paper proposes a reinforcement learning (RL)-aided cognitive framework
for massive MIMO-based integrated sensing and communication (ISAC) systems
employing a uniform planar array (UPA). The focus is on enhancing radar sensing
performance in environments with unknown and dynamic disturbance
characteristics. A Wald-type detector is employed for robust target detection
under non-Gaussian clutter, while a SARSA-based RL algorithm enables adaptive
estimation of target positions without prior environmental knowledge. Based on
the RL-derived sensing information, a joint waveform optimization strategy is
formulated to balance radar sensing accuracy and downlink communication
throughput. The resulting design provides an adaptive trade-off between
detection performance and achievable sum rate through an analytically derived
closed-form solution. Monte Carlo simulations demonstrate that the proposed
cognitive ISAC framework achieves significantly improved detection probability
compared to orthogonal and non-learning adaptive baselines, while maintaining
competitive communication performance. These results underline the potential of
RL-assisted sensing for robust and spectrum-efficient ISAC in next-generation
wireless networks.
๐
2025-11-04
๐ NeurIPS
๐ฅ Giacomo Camposampiero, Pietro Barbiero, Michael Hersche, Roger Wattenhofer, Abbas Rahimi
ๆฅ็ๆ่ฆ
Compositional generalization-a key open challenge in modern machine
learning-requires models to predict unknown combinations of known concepts.
However, assessing compositional generalization remains a fundamental challenge
due to the lack of standardized evaluation protocols and the limitations of
current benchmarks, which often favor efficiency over rigor. At the same time,
general-purpose vision architectures lack the necessary inductive biases, and
existing approaches to endow them compromise scalability. As a remedy, this
paper introduces: 1) a rigorous evaluation framework that unifies and extends
previous approaches while reducing computational requirements from
combinatorial to constant; 2) an extensive and modern evaluation on the status
of compositional generalization in supervised vision backbones, training more
than 5000 models; 3) Attribute Invariant Networks, a class of models
establishing a new Pareto frontier in compositional generalization, achieving a
23.43% accuracy improvement over baselines while reducing parameter overhead
from 600% to 16% compared to fully disentangled counterparts.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Cooper Simpson, Stephen Becker, Alireza Doostan
ๆฅ็ๆ่ฆ
Focusing on implicit neural representations, we present a novel in situ
training protocol that employs limited memory buffers of full and sketched data
samples, where the sketched data are leveraged to prevent catastrophic
forgetting. The theoretical motivation for our use of sketching as a
regularizer is presented via a simple Johnson-Lindenstrauss-informed result.
While our methods may be of wider interest in the field of continual learning,
we specifically target in situ neural compression using implicit neural
representation-based hypernetworks. We evaluate our method on a variety of
complex simulation data in two and three dimensions, over long time horizons,
and across unstructured grids and non-Cartesian geometries. On these tasks, we
show strong reconstruction performance at high compression rates. Most
importantly, we demonstrate that sketching enables the presented in situ scheme
to approximately match the performance of the equivalent offline method.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lihan Xu, Yanjie Dong, Gang Wang, Runhao Zeng, Xiaoyi Fan et al.
ๆฅ็ๆ่ฆ
We investigate robust federated learning, where a group of workers
collaboratively train a shared model under the orchestration of a central
server in the presence of Byzantine adversaries capable of arbitrary and
potentially malicious behaviors. To simultaneously enhance communication
efficiency and robustness against such adversaries, we propose a
Byzantine-resilient Nesterov-Accelerated Federated Learning (Byrd-NAFL)
algorithm. Byrd-NAFL seamlessly integrates Nesterov's momentum into the
federated learning process alongside Byzantine-resilient aggregation rules to
achieve fast and safeguarding convergence against gradient corruption. We
establish a finite-time convergence guarantee for Byrd-NAFL under non-convex
and smooth loss functions with relaxed assumption on the aggregated gradients.
Extensive numerical experiments validate the effectiveness of Byrd-NAFL and
demonstrate the superiority over existing benchmarks in terms of convergence
speed, accuracy, and resilience to diverse Byzantine attack strategies.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) achieve remarkable reasoning capabilities
through transformer architectures with attention mechanisms. However,
transformers suffer from quadratic time and memory complexity in the attention
module (MHA) and require caching key-value states during inference, which
severely limits throughput and scalability. High inference throughput is
critical for agentic tasks, long-context reasoning, efficient deployment under
high request loads, and more efficient test-time compute scaling.
State Space Models (SSMs) such as Mamba offer a promising alternative with
linear inference complexity and a constant memory footprint via recurrent
computation with fixed-size hidden states. In this technical report we
introduce the Apriel-H1 family of hybrid LLMs that combine transformer
attention and SSM sequence mixers for efficient reasoning at 15B model size.
These models are obtained through incremental distillation from a pretrained
reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing
less critical attention layers with linear Mamba blocks.
We release multiple post-distillation variants of Apriel-H1-15B-Thinker with
different SSM-to-MHA ratios and analyse how reasoning performance degrades as
more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant
of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces,
achieving over 2x higher inference throughput when deployed in the
production-ready vLLM environment, with minimal degradation in reasoning
performance. This shows that distilled hybrid SSM-Transformer architectures can
deliver substantial efficiency gains over the pretrained transformer equivalent
without substantially compromising the reasoning quality.
๐
2025-11-04
๐ ArXiv Preprint (cs.DC)
๐ฅ Xiumei Deng, Zehui Xiong, Binbin Chen, Dong In Kim, Merouane Debbah et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) are proliferating rapidly at the edge,
delivering intelligent capabilities across diverse application scenarios.
However, their practical deployment in collaborative scenarios confronts
fundamental challenges: privacy vulnerabilities, communication overhead, and
computational bottlenecks. To address these, we propose Federated Attention
(FedAttn), which integrates the federated paradigm into the self-attention
mechanism, creating a new distributed LLM inference framework that
simultaneously achieves privacy protection, communication efficiency, and
computational efficiency. FedAttn enables participants to perform local
self-attention over their own token representations while periodically
exchanging and aggregating Key-Value (KV) matrices across multiple Transformer
blocks, collaboratively generating LLM responses without exposing private
prompts. Further, we identify a structural duality between contextual
representation refinement in FedAttn and parameter optimization in FL across
private data, local computation, and global aggregation. This key insight
provides a principled foundation for systematically porting federated
optimization techniques to collaborative LLM inference. Building on this
framework, we theoretically analyze how local self-attention computation within
participants and heterogeneous token relevance among participants shape error
propagation dynamics across Transformer blocks. Moreover, we characterize the
fundamental trade-off between response quality and communication/computation
efficiency, which is governed by the synchronization interval and the number of
participants. Experimental results validate our theoretical analysis, and
reveal significant optimization opportunities through sparse attention and
adaptive KV aggregation, highlighting FedAttn's potential to deliver
scalability and efficiency in real-world edge deployments.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tiziano Balaconi, Aldo Glielmo, Marco Taboga
Machine Learning
ๆฅ็ๆ่ฆ
We introduce GasRL, a simulator that couples a calibrated representation of
the natural gas market with a model of storage-operator policies trained with
deep reinforcement learning (RL). We use it to analyse how optimal stockpile
management affects equilibrium prices and the dynamics of demand and supply. We
test various RL algorithms and find that Soft Actor Critic (SAC) exhibits
superior performance in the GasRL environment: multiple objectives of storage
operators - including profitability, robust market clearing and price
stabilisation - are successfully achieved. Moreover, the equilibrium price
dynamics induced by SAC-derived optimal policies have characteristics, such as
volatility and seasonality, that closely match those of real-world prices.
Remarkably, this adherence to the historical distribution of prices is obtained
without explicitly calibrating the model to price data. We show how the
simulator can be used to assess the effects of EU-mandated minimum storage
thresholds. We find that such thresholds have a positive effect on market
resilience against unanticipated shifts in the distribution of supply shocks.
For example, with unusually large shocks, market disruptions are averted more
often if a threshold is in place.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ David Kattermann, Lothar Sebastian Krapp
ๆฅ็ๆ่ฆ
We study computable probably approximately correct (CPAC) learning, where
learners are required to be computable functions. It had been previously
observed that the Fundamental Theorem of Statistical Learning, which
characterizes PAC learnability by finiteness of the Vapnik-Chervonenkis
(VC-)dimension, no longer holds in this framework. Recent works recovered
analogs of the Fundamental Theorem in the computable setting, for instance by
introducing an effective VC-dimension. Guided by this, we investigate the
connection between CPAC learning and recursively enumerable representable (RER)
classes, whose members can be algorithmically listed. Our results show that the
effective VC-dimensions can take arbitrary values above the traditional one,
even for RER classes, which creates a whole family of (non-)examples for
various notions of CPAC learning. Yet the two dimensions coincide for classes
satisfying sufficiently strong notions of CPAC learning. We then observe that
CPAC learnability can also be characterized via containment of RER classes that
realize the same samples. Furthermore, it is shown that CPAC learnable classes
satisfying a unique identification property are necessarily RER. Finally, we
establish that agnostic learnability can be guaranteed for RER classes, by
considering the relaxed notion of nonuniform CPAC learning.
๐
2025-11-04
๐ ArXiv Preprint (math.NA)
๐ฅ Xinliang Liu, Tong Mao, Jinchao Xu
Machine Learning
ๆฅ็ๆ่ฆ
We present an estimation of the condition numbers of the \emph{mass} and
\emph{stiffness} matrices arising from shallow ReLU$^k$ neural networks defined
on the unit sphere~$\mathbb{S}^d$. In particular, when $\{\theta_j^*\}_{j=1}^n
\subset \mathbb{S}^d$ is \emph{antipodally quasi-uniform}, the condition number
is sharp. Indeed, in this case, we obtain sharp asymptotic estimates for the
full spectrum of eigenvalues and characterize the structure of the
corresponding eigenspaces, showing that the smallest eigenvalues are associated
with an eigenbasis of low-degree polynomials while the largest eigenvalues are
linked to high-degree polynomials. This spectral analysis establishes a precise
correspondence between the approximation power of the network and its numerical
stability.
๐
2025-11-04
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Roy Rinberg, Adam Karvonen, Alex Hoover, Daniel Reuter, Keri Warr
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
As large AI models become increasingly valuable assets, the risk of model
weight exfiltration from inference servers grows accordingly. An attacker
controlling an inference server may exfiltrate model weights by hiding them
within ordinary model outputs, a strategy known as steganography. This work
investigates how to verify model responses to defend against such attacks and,
more broadly, to detect anomalous or buggy behavior during inference. We
formalize model exfiltration as a security game, propose a verification
framework that can provably mitigate steganographic exfiltration, and specify
the trust assumptions associated with our scheme. To enable verification, we
characterize valid sources of non-determinism in large language model inference
and introduce two practical estimators for them. We evaluate our detection
framework on several open-weight models ranging from 3B to 30B parameters. On
MOE-Qwen-30B, our detector reduces exfiltratable information to <0.5% with
false-positive rate of 0.01%, corresponding to a >200x slowdown for
adversaries. Overall, this work further establishes a foundation for defending
against model weight exfiltration and demonstrates that strong protection can
be achieved with minimal additional cost to inference providers.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammed Al-Jaff, Giovanni Luca Marchetti, Michael C Welle, Jens Lundell, Mats G. Gustafsson et al.
Machine Learning
ๆฅ็ๆ่ฆ
Idempotent Generative Networks (IGNs) are deep generative models that also
function as local data manifold projectors, mapping arbitrary inputs back onto
the manifold. They are trained to act as identity operators on the data and as
idempotent operators off the data manifold. However, IGNs suffer from mode
collapse, mode dropping, and training instability due to their objectives,
which contain adversarial components and can cause the model to cover the data
manifold only partially -- an issue shared with generative adversarial
networks. We introduce Non-Adversarial Idempotent Generative Networks (NAIGNs)
to address these issues. Our loss function combines reconstruction with the
non-adversarial generative objective of Implicit Maximum Likelihood Estimation
(IMLE). This improves on IGN's ability to restore corrupted data and generate
new samples that closely match the data distribution. We moreover demonstrate
that NAIGNs implicitly learn the distance field to the data manifold, as well
as an energy-based model.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nadia Daoudi, Ivan Alfonso, Jordi Cabot
Machine Learning
ๆฅ็ๆ่ฆ
The development of smart systems (i.e., systems enhanced with AI components)
has thrived thanks to the rapid advancements in neural networks (NNs). A wide
range of libraries and frameworks have consequently emerged to support NN
design and implementation. The choice depends on factors such as available
functionalities, ease of use, documentation and community support. After
adopting a given NN framework, organizations might later choose to switch to
another if performance declines, requirements evolve, or new features are
introduced. Unfortunately, migrating NN implementations across libraries is
challenging due to the lack of migration approaches specifically tailored for
NNs. This leads to increased time and effort to modernize NNs, as manual
updates are necessary to avoid relying on outdated implementations and ensure
compatibility with new features. In this paper, we propose an approach to
automatically migrate neural network code across deep learning frameworks. Our
method makes use of a pivot NN model to create an abstraction of the NN prior
to migration. We validate our approach using two popular NN frameworks, namely
PyTorch and TensorFlow. We also discuss the challenges of migrating code
between the two frameworks and how they were approached in our method.
Experimental evaluation on five NNs shows that our approach successfully
migrates their code and produces NNs that are functionally equivalent to the
originals. Artefacts from our work are available online.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chitro Majumdar, Sergio Scandizzo, Ratanlal Mahanta, Avradip Mandal, Swarnendu Bhattacharjee
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce Omega^2, a Large Language Model-driven framework for corporate
credit scoring that combines structured financial data with advanced machine
learning to improve predictive reliability and interpretability. Our study
evaluates Omega^2 on a multi-agency dataset of 7,800 corporate credit ratings
drawn from Moody's, Standard & Poor's, Fitch, and Egan-Jones, each containing
detailed firm-level financial indicators such as leverage, profitability, and
liquidity ratios. The system integrates CatBoost, LightGBM, and XGBoost models
optimized through Bayesian search under temporal validation to ensure
forward-looking and reproducible results. Omega^2 achieved a mean test AUC
above 0.93 across agencies, confirming its ability to generalize across rating
systems and maintain temporal consistency. These results show that combining
language-based reasoning with quantitative learning creates a transparent and
institution-grade foundation for reliable corporate credit-risk assessment.
๐
2025-11-04
๐ ArXiv Preprint (cs.IT)
๐ฅ Mark Blรผmel, Andreas C. Schneider, Valentin Neuhaus, David A. Ehrlich, Marcel Graetz et al.
ๆฅ็ๆ่ฆ
Associative memory, traditionally modeled by Hopfield networks, enables the
retrieval of previously stored patterns from partial or noisy cues. Yet, the
local computational principles which are required to enable this function
remain incompletely understood. To formally characterize the local information
processing in such systems, we employ a recent extension of information theory
- Partial Information Decomposition (PID). PID decomposes the contribution of
different inputs to an output into unique information from each input,
redundant information across inputs, and synergistic information that emerges
from combining different inputs. Applying this framework to individual neurons
in classical Hopfield networks we find that below the memory capacity, the
information in a neuron's activity is characterized by high redundancy between
the external pattern input and the internal recurrent input, while synergy and
unique information are close to zero until the memory capacity is surpassed and
performance drops steeply. Inspired by this observation, we use redundancy as
an information-theoretic learning goal, which is directly optimized for each
neuron, dramatically increasing the network's memory capacity to 1.59, a more
than tenfold improvement over the 0.14 capacity of classical Hopfield networks
and even outperforming recent state-of-the-art implementations of Hopfield
networks. Ultimately, this work establishes redundancy maximization as a new
design principle for associative memories and opens pathways for new
associative memory models based on information-theoretic goals.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Despite the remarkable success of text-to-image diffusion models, their
output of a single, flattened image remains a critical bottleneck for
professional applications requiring layer-wise control. Existing solutions
either rely on fine-tuning with large, inaccessible datasets or are
training-free yet limited to generating isolated foreground elements, failing
to produce a complete and coherent scene. To address this, we introduce the
Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a
novel framework for zero-shot, layer-wise image generation. Our core technique,
Noise Transplantation and Cultivation (NTC), extracts intermediate latent
representations from both foreground and composite generation processes,
transplanting them into the initial noise for subsequent layers. This ensures
semantic and structural coherence across foreground, background, and composite
layers, enabling consistent, multi-layered outputs without requiring
fine-tuning or auxiliary datasets. Extensive experiments show that our
training-free method achieves performance comparable to fine-tuned methods,
enhancing layer-wise consistency while maintaining high image quality and
fidelity. TAUE not only eliminates costly training and dataset requirements but
also unlocks novel downstream applications, such as complex compositional
editing, paving the way for more accessible and controllable generative
workflows.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gilad Karpel, Ruida Zhou, Shoham Sabach, Mohammad Ghavamzadeh
MultimodalMachine Learning
ๆฅ็ๆ่ฆ
Proximal Policy Optimization (PPO) is widely regarded as one of the most
successful deep reinforcement learning algorithms, known for its robustness and
effectiveness across a range of problems.
The PPO objective encourages the importance ratio between the current and
behavior policies to move to the "right" direction -- starting from importance
sampling ratios equal to 1, increasing the ratios for actions with positive
advantages and decreasing those with negative advantages. A clipping function
is introduced to prevent over-optimization when updating the importance ratio
in these "right" direction regions. Many PPO variants have been proposed to
extend its success, most of which modify the objective's behavior by altering
the clipping in the "right" direction regions. However, due to randomness in
the rollouts and stochasticity of the policy optimization, we observe that the
ratios frequently move to the "wrong" direction during the PPO optimization.
This is a key factor hindering the improvement of PPO, but it has been largely
overlooked. To address this, we propose the Directional-Clamp PPO algorithm
(DClamp-PPO), which further penalizes the actions going to the strict "wrong"
direction regions, where the advantage is positive (negative) and importance
ratio falls below (above) $1 - \beta$ ($1+\beta$),
for a tunable parameter $\beta \in (0, 1)$. The penalty is by enforcing a
steeper loss slope, i.e., a clamp, in those regions. We demonstrate that
DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing
on modifying the objective's behavior in the "right" direction, across various
MuJoCo environments, using different random seeds. The proposed method is
shown, both theoretically and empirically, to better avoid "wrong" direction
updates while keeping the importance ratio closer to 1.
๐
2025-11-04
๐ Submitted to IEEE ICC 2026
๐ฅ Anastasios T. Sotiropoulos, Stavros Tsimpoukis, Dimitrios Tyrovolas, Sotiris Ioannidis, George K. Karagiannidis et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The pursuit of immersive and structurally aware multimedia experiences has
intensified interest in sensing modalities that reconstruct objects beyond the
limits of visible light. Conventional optical pipelines degrade under occlusion
or low illumination, motivating the use of radio-frequency (RF) sensing, whose
electromagnetic waves penetrate materials and encode both geometric and
compositional information. Yet, uncontrolled multipath propagation restricts
reconstruction accuracy. Recent advances in Programmable Wireless Environments
(PWEs) mitigate this limitation by enabling software-defined manipulation of
propagation through Reconfigurable Intelligent Surfaces (RISs), thereby
providing controllable illumination diversity. Building on this capability,
this work introduces a PWE-driven RF framework for three-dimensional object
reconstruction using material-aware spherical primitives. The proposed approach
combines RIS-enabled field synthesis with a Detection Transformer (DETR) that
infers spatial and material parameters directly from extracted RF features.
Simulation results confirm the framework's ability to approximate object
geometries and classify material composition with an overall accuracy of
79.35%, marking an initial step toward programmable and physically grounded
RF-based 3D object composition visualization.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lukas Fehring, Marcel Wever, Maximilian Spliethรถver, Leona Hennig, Henning Wachsmuth et al.
Natural Language ProcessingRoboticsComputer VisionMachine Learning
ๆฅ็ๆ่ฆ
Hyperparameter optimization (HPO), for example, based on Bayesian
optimization (BO), supports users in designing models well-suited for a given
dataset. HPO has proven its effectiveness on several applications, ranging from
classical machine learning for tabular data to deep neural networks for
computer vision and transformers for natural language processing. However, HPO
still sometimes lacks acceptance by machine learning experts due to its
black-box nature and limited user control. Addressing this, first approaches
have been proposed to initialize BO methods with expert knowledge. However,
these approaches do not allow for online steering during the optimization
process. In this paper, we introduce a novel method that enables repeated
interventions to steer BO via user input, specifying expert knowledge and user
preferences at runtime of the HPO process in the form of prior distributions.
To this end, we generalize an existing method, $\pi$BO, preserving theoretical
guarantees. We also introduce a misleading prior detection scheme, which allows
protection against harmful user inputs. In our experimental evaluation, we
demonstrate that our method can effectively incorporate multiple priors,
leveraging informative priors, whereas misleading priors are reliably rejected
or overcome. Thereby, we achieve competitiveness to unperturbed BO.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji
Machine Learning
ๆฅ็ๆ่ฆ
Offline reinforcement learning (RL) suffers from extrapolation errors induced
by out-of-distribution (OOD) actions. To address this, offline RL algorithms
typically impose constraints on action selection, which can be systematically
categorized into density, support, and sample constraints. However, we show
that each category has inherent limitations: density and sample constraints
tend to be overly conservative in many scenarios, while the support constraint,
though least restrictive, faces challenges in accurately modeling the behavior
policy. To overcome these limitations, we propose a new neighborhood constraint
that restricts action selection in the Bellman target to the union of
neighborhoods of dataset actions. Theoretically, the constraint not only bounds
extrapolation errors and distribution shift under certain conditions, but also
approximates the support constraint without requiring behavior policy modeling.
Moreover, it retains substantial flexibility and enables pointwise conservatism
by adapting the neighborhood radius for each data point. In practice, we employ
data quality as the adaptation criterion and design an adaptive neighborhood
constraint. Building on an efficient bilevel optimization framework, we develop
a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning
(ANQ), to perform Q learning with target actions satisfying this constraint.
Empirically, ANQ achieves state-of-the-art performance on standard offline RL
benchmarks and exhibits strong robustness in scenarios with noisy or limited
data.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ali Farki, Elaheh Moradi, Deepika Koundal, Jussi Tohka
Machine Learning
ๆฅ็ๆ่ฆ
Predicting future brain state from a baseline magnetic resonance image (MRI)
is a central challenge in neuroimaging and has important implications for
studying neurodegenerative diseases such as Alzheimer's disease (AD). Most
existing approaches predict future cognitive scores or clinical outcomes, such
as conversion from mild cognitive impairment to dementia. Instead, here we
investigate longitudinal MRI image-to-image prediction that forecasts a
participant's entire brain MRI several years into the future, intrinsically
modeling complex, spatially distributed neurodegenerative patterns. We
implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR,
Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL).
Predicted follow-up MRIs are directly compared with the actual follow-up scans
using metrics that capture global similarity and local differences. The best
performing models achieve high-fidelity predictions, and all models generalize
well to an independent external dataset, demonstrating robust cross-cohort
performance. Our results indicate that deep learning can reliably predict
participant-specific brain MRI at the voxel level, offering new opportunities
for individualized prognosis.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mathieu Chevalley, Arash Mehrjou, Patrick Schwab
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate theoretical guarantees for the false-negative rate (FNR) --
the fraction of true causal edges whose orientation is not recovered, under
single-variable random interventions and an $\epsilon$-interventional
faithfulness assumption that accommodates latent confounding. For sparse
Erd\H{o}s--R\'enyi directed acyclic graphs, where the edge probability scales
as $p_e = \Theta(1/d)$, we show that the FNR concentrates around its mean at
rate $O(\frac{\log d}{\sqrt d})$, implying that large deviations above the
expected error become exponentially unlikely as dimensionality increases. This
concentration ensures that derived upper bounds hold with high probability in
large-scale settings. Extending the analysis to generalized Barab\'asi--Albert
graphs reveals an even stronger phenomenon: when the degree exponent satisfies
$\gamma > 3$, the deviation width scales as $O(d^{\beta - \frac{1}{2}})$ with
$\beta = 1/(\gamma - 1) < \frac{1}{2}$, and hence vanishes in the limit. This
demonstrates that realistic scale-free topologies intrinsically regularize
causal discovery, reducing variability in orientation error. These
finite-dimension results provide the first dimension-adaptive,
faithfulness-robust guarantees for causal structure recovery, and challenge the
intuition that high dimensionality and network heterogeneity necessarily hinder
accurate discovery. Our simulation results corroborate these theoretical
predictions, showing that the FNR indeed concentrates and often vanishes in
practice as dimensionality grows.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hortence Nana, Andreas Athanasopoulos, Christos Dimitrakakis
ๆฅ็ๆ่ฆ
We study a many-to-one matching problem, such as the college admission
problem, where each college can admit multiple students. Unlike classical
models, colleges evaluate sets of students through non-linear utility functions
that capture diversity between them. In this setting, we show that classical
stable matchings may fail to exist. To address this, we propose alternative
solution concepts based on Rawlsian fairness, aiming to maximize the minimum
utility across colleges. We design both deterministic and stochastic algorithms
that iteratively improve the outcome of the worst-off college, offering a
practical approach to fair allocation when stability cannot be guaranteed.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jorge Pellejero, Luis A. Hernรกndez Gรณmez, Luis Mendo Tomรกs, Zoraida Frias Barroso
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Agentic AI represents a new paradigm for automating complex systems by using
Large AI Models (LAMs) to provide human-level cognitive abilities with
multimodal perception, planning, memory, and reasoning capabilities. This will
lead to a new generation of AI systems that autonomously decompose goals,
retain context over time, learn continuously, operate across tools and
environments, and adapt dynamically. The complexity of 5G and upcoming 6G
networks renders manual optimization ineffective, pointing to Agentic AI as a
method for automating decisions in dynamic RAN environments. However, despite
its rapid advances, there is no established framework outlining the
foundational components and operational principles of Agentic AI systems nor a
universally accepted definition.
This paper contributes to ongoing research on Agentic AI in 5G and 6G
networks by outlining its core concepts and then proposing a practical use case
that applies Agentic principles to RAN optimization. We first introduce Agentic
AI, tracing its evolution from classical agents and discussing the progress
from workflows and simple AI agents to Agentic AI. Core design
patterns-reflection, planning, tool use, and multi-agent collaboration-are then
described to illustrate how intelligent behaviors are orchestrated. These
theorical concepts are grounded in the context of mobile networks, with a focus
on RAN management and optimization. A practical 5G RAN case study shows how
time-series analytics and LAM-driven agents collaborate for KPI-based
autonomous decision-making.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Munib Mesinovic, Max Buhlan, Tingting Zhu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Healthcare artificial intelligence systems routinely fail when deployed
across institutions, with documented performance drops and perpetuation of
discriminatory patterns embedded in historical data. This brittleness stems, in
part, from learning statistical associations rather than causal mechanisms.
Causal graph neural networks address this triple crisis of distribution shift,
discrimination, and inscrutability by combining graph-based representations of
biomedical data with causal inference principles to learn invariant mechanisms
rather than spurious correlations. This Review examines methodological
foundations spanning structural causal models, disentangled causal
representation learning, and techniques for interventional prediction and
counterfactual reasoning on graphs. We analyse applications demonstrating
clinical value across psychiatric diagnosis through brain network analysis,
cancer subtyping via multi-omics causal integration, continuous physiological
monitoring with mechanistic interpretation, and drug recommendation correcting
prescription bias. These advances establish foundations for patient-specific
Causal Digital Twins, enabling in silico clinical experimentation, with
integration of large language models for hypothesis generation and causal graph
neural networks for mechanistic validation. Substantial barriers remain,
including computational requirements precluding real-time deployment,
validation challenges demanding multi-modal evidence triangulation beyond
cross-validation, and risks of causal-washing where methods employ causal
terminology without rigorous evidentiary support. We propose tiered frameworks
distinguishing causally-inspired architectures from causally-validated
discoveries and identify critical research priorities making causal rather than
purely associational claims.
๐
2025-11-04
๐ will be submitted to Journal of Guidance, Control,
๐ฅ Marc Schneider, Walter Fichter
Robotics
ๆฅ็ๆ่ฆ
This paper presents a novel approach to many-vs-many missile guidance using
virtual targets (VTs) generated by a Normalizing Flows-based trajectory
predictor. Rather than assigning n interceptors directly to m physical targets
through conventional weapon target assignment algorithms, we propose a
centralized strategy that constructs n VT trajectories representing
probabilistic predictions of maneuvering target behavior. Each interceptor is
guided toward its assigned VT using Zero-Effort-Miss guidance during midcourse
flight, transitioning to Proportional Navigation guidance for terminal
interception. This approach treats many-vs-many engagements as
many-vs-distribution scenarios, exploiting numerical superiority (n > m) by
distributing interceptors across diverse trajectory hypotheses rather than
pursuing identical deterministic predictions. Monte Carlo simulations across
various target-interceptor configurations (1-6 targets, 1-8 interceptors)
demonstrate that the VT method matches or exceeds baseline straight-line
prediction performance by 0-4.1% when n = m, with improvements increasing to
5.8-14.4% when n > m. The results confirm that probabilistic VTs enable
effective exploitation of numerical superiority, significantly increasing
interception probability in many-vs-many scenarios.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen
Machine Learning
ๆฅ็ๆ่ฆ
The capacitated location-routing problems (CLRPs) are classical problems in
combinatorial optimization, which require simultaneously making location and
routing decisions. In CLRPs, the complex constraints and the intricate
relationships between various decisions make the problem challenging to solve.
With the emergence of deep reinforcement learning (DRL), it has been
extensively applied to address the vehicle routing problem and its variants,
while the research related to CLRPs still needs to be explored. In this paper,
we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP
(OCLRP), respectively. We are the first to propose an end-to-end learning
approach for CLRPs, following the encoder-decoder structure. In particular, we
reformulate the CLRPs as a markov decision process tailored to various
decisions, a general modeling framework that can be adapted to other DRL-based
methods. To better handle the interdependency across location and routing
decisions, we also introduce a novel heterogeneous querying attention mechanism
designed to adapt dynamically to various decision-making stages. Experimental
results on both synthetic and benchmark datasets demonstrate superior solution
quality and better generalization performance of our proposed approach over
representative traditional and DRL-based baselines in solving both CLRP and
OCLRP.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ronald Katende
Robotics
ๆฅ็ๆ่ฆ
We propose a unified information-geometric framework that formalizes
understanding in learning as a trade-off between informativeness and geometric
simplicity. An encoder phi is evaluated by U(phi) = I(phi(X); Y) - beta *
C(phi), where C(phi) penalizes curvature and intrinsic dimensionality,
enforcing smooth, low-complexity manifolds. Under mild manifold and regularity
assumptions, we derive non-asymptotic bounds showing that generalization error
scales with intrinsic dimension while curvature controls approximation
stability, directly linking geometry to sample efficiency. To operationalize
this theory, we introduce the Variational Geometric Information Bottleneck
(V-GIB), a variational estimator that unifies mutual-information compression
and curvature regularization through tractable geometric proxies such as the
Hutchinson trace, Jacobian norms, and local PCA. Experiments across synthetic
manifolds, few-shot settings, and real-world datasets (Fashion-MNIST, CIFAR-10)
reveal a robust information-geometry Pareto frontier, stable estimators, and
substantial gains in interpretive efficiency. Fractional-data experiments on
CIFAR-10 confirm that curvature-aware encoders maintain predictive power under
data scarcity, validating the predicted efficiency-curvature law. Overall,
V-GIB provides a principled and measurable route to representations that are
geometrically coherent, data-efficient, and aligned with human-understandable
structure.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rajan Das Gupta, Md Kishor Morol, Nafiz Fahad, Md Tanzib Hosain, Sumaya Binte Zilani Choya et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As the global burden of Alzheimer's disease (AD) continues to grow, early and
accurate detection has become increasingly critical, especially in regions with
limited access to advanced diagnostic tools. We propose BRAINS (Biomedical
Retrieval-Augmented Intelligence for Neurodegeneration Screening) to address
this challenge. This novel system harnesses the powerful reasoning capabilities
of Large Language Models (LLMs) for Alzheimer's detection and monitoring.
BRAINS features a dual-module architecture: a cognitive diagnostic module and a
case-retrieval module. The Diagnostic Module utilizes LLMs fine-tuned on
cognitive and neuroimaging datasets -- including MMSE, CDR scores, and brain
volume metrics -- to perform structured assessments of Alzheimer's risk.
Meanwhile, the Case Retrieval Module encodes patient profiles into latent
representations and retrieves similar cases from a curated knowledge base.
These auxiliary cases are fused with the input profile via a Case Fusion Layer
to enhance contextual understanding. The combined representation is then
processed with clinical prompts for inference. Evaluations on real-world
datasets demonstrate BRAINS effectiveness in classifying disease severity and
identifying early signs of cognitive decline. This system not only shows strong
potential as an assistive tool for scalable, explainable, and early-stage
Alzheimer's disease detection, but also offers hope for future applications in
the field.
๐
2025-11-04
๐ ArXiv Preprint (cs.DS)
๐ฅ Weiming Feng, Xiongxin Yang, Yixiao Yu, Yiyao Zhang
ๆฅ็ๆ่ฆ
We study the problem of learning a $n$-variables $k$-CNF formula $\Phi$ from
its i.i.d. uniform random solutions, which is equivalent to learning a Boolean
Markov random field (MRF) with $k$-wise hard constraints. Revisiting Valiant's
algorithm (Commun. ACM'84), we show that it can exactly learn (1) $k$-CNFs with
bounded clause intersection size under Lov\'asz local lemma type conditions,
from $O(\log n)$ samples; and (2) random $k$-CNFs near the satisfiability
threshold, from $\widetilde{O}(n^{\exp(-\sqrt{k})})$ samples. These results
significantly improve the previous $O(n^k)$ sample complexity. We further
establish new information-theoretic lower bounds on sample complexity for both
exact and approximate learning from i.i.d. uniform random solutions.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mohammad Sadegh Eshaghi, Cosmin Anitescu, Navid Valizadeh, Yizheng Wang, Xiaoying Zhuang et al.
ๆฅ็ๆ่ฆ
Partial differential equations (PDEs) underpin quantitative descriptions
across the physical sciences and engineering, yet high-fidelity simulation
remains a major computational bottleneck for many-query, real-time, and design
tasks. Data-driven surrogates can be strikingly fast but are often unreliable
when applied outside their training distribution. Here we introduce Neural
Operator Warm Starts (NOWS), a hybrid strategy that harnesses learned solution
operators to accelerate classical iterative solvers by producing high-quality
initial guesses for Krylov methods such as conjugate gradient and GMRES. NOWS
leaves existing discretizations and solver infrastructures intact, integrating
seamlessly with finite-difference, finite-element, isogeometric analysis,
finite volume method, etc. Across our benchmarks, the learned initialization
consistently reduces iteration counts and end-to-end runtime, resulting in a
reduction of the computational time of up to 90 %, while preserving the
stability and convergence guarantees of the underlying numerical algorithms. By
combining the rapid inference of neural operators with the rigor of traditional
solvers, NOWS provides a practical and trustworthy approach to accelerate
high-fidelity PDE simulations.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xuan-Truong Quan, Xuan-Son Quan, Duc Do Minh, Vinh Nguyen Van
ๆฅ็ๆ่ฆ
Knowledge graph embedding (KGE) has become a fundamental technique for
representation learning on multi-relational data. Many seminal models, such as
TransE, operate in an unbounded Euclidean space, which presents inherent
limitations in modeling complex relations and can lead to inefficient training.
In this paper, we propose Spherical Knowledge Graph Embedding (SKGE), a model
that challenges this paradigm by constraining entity representations to a
compact manifold: a hypersphere. SKGE employs a learnable, non-linear
Spherization Layer to map entities onto the sphere and interprets relations as
a hybrid translate-then-project transformation. Through extensive experiments
on three benchmark datasets, FB15k-237, CoDEx-S, and CoDEx-M, we demonstrate
that SKGE consistently and significantly outperforms its strong Euclidean
counterpart, TransE, particularly on large-scale benchmarks such as FB15k-237
and CoDEx-M, demonstrating the efficacy of the spherical geometric prior. We
provide an in-depth analysis to reveal the sources of this advantage, showing
that this geometric constraint acts as a powerful regularizer, leading to
comprehensive performance gains across all relation types. More fundamentally,
we prove that the spherical geometry creates an "inherently hard negative
sampling" environment, naturally eliminating trivial negatives and forcing the
model to learn more robust and semantically coherent representations. Our
findings compellingly demonstrate that the choice of manifold is not merely an
implementation detail but a fundamental design principle, advocating for
geometric priors as a cornerstone for designing the next generation of powerful
and stable KGE models.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Thomas Sanchez, Pedro M. Gordaliza, Meritxell Bach Cuadra
Machine Learning
ๆฅ็ๆ่ฆ
Machine learning methods are increasingly applied in medical imaging, yet
many reported improvements lack statistical robustness: recent works have
highlighted that small but significant performance gains are highly likely to
be false positives. However, these analyses do not take
\emph{underspecification} into account -- the fact that models achieving
similar validation scores may behave differently on unseen data due to random
initialization or training dynamics. Here, we extend a recent statistical
framework modeling false outperformance claims to include underspecification as
an additional variance component. Our simulations demonstrate that even modest
seed variability ($\sim1\%$) substantially increases the evidence required to
support superiority claims. Our findings underscore the need for explicit
modeling of training variance when validating medical imaging systems.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Junghee Pyeon, Davide Cacciarelli, Kamran Paynabar
ๆฅ็ๆ่ฆ
Concept drift and label scarcity are two critical challenges limiting the
robustness of predictive models in dynamic industrial environments. Existing
drift detection methods often assume global shifts and rely on dense
supervision, making them ill-suited for regression tasks with local drifts and
limited labels. This paper proposes an adaptive sampling framework that
combines residual-based exploration and exploitation with EWMA monitoring to
efficiently detect local concept drift under labeling budget constraints.
Empirical results on synthetic benchmarks and a case study on electricity
market demonstrate superior performance in label efficiency and drift detection
accuracy.
๐
2025-11-04
๐ Accepted to AISec'25 co-located with the 32nd ACM
๐ฅ Virgile Dine, Teddy Furon, Charly Faure
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
We formulate the machine unlearning problem as a general constrained
optimization problem. It unifies the first-order methods from the approximate
machine unlearning literature. This paper then introduces the concept of
feasible updates as the model's parameter update directions that help with
unlearning while not degrading the utility of the initial model. Our design of
feasible updates is based on masking, \ie\ a careful selection of the model's
parameters worth updating. It also takes into account the estimation noise of
the gradients when processing each batch of data to offer a statistical
guarantee to derive locally feasible updates. The technique can be plugged in,
as an add-on, to any first-order approximate unlearning methods. Experiments
with computer vision classifiers validate this approach.
๐
2025-11-04
๐ ArXiv Preprint (math.CO)
๐ฅ Christoph Hertrich, Stefan Kober, Georg Loho
Machine Learning
ๆฅ็ๆ่ฆ
We prove that there exist uniform $(+,\times,/)$-circuits of size $O(n^3)$ to
compute the basis generating polynomial of regular matroids on $n$ elements. By
tropicalization, this implies that there exist uniform $(\max,+,-)$-circuits
and ReLU neural networks of the same size for weighted basis maximization of
regular matroids. As a consequence in linear programming theory, we obtain a
first example where taking the difference of two extended formulations can be
more efficient than the best known individual extended formulation of size
$O(n^6)$ by Aprile and Fiorini. Such differences have recently been introduced
as virtual extended formulations. The proof of our main result relies on a
fine-tuned version of Seymour's decomposition of regular matroids which allows
us to identify and maintain graphic substructures to which we can apply a local
version of the star-mesh transformation.
๐
2025-11-04
๐ ArXiv Preprint (eess.IV)
๐ฅ Yalda Zafari, Hongyi Pan, Gorkem Durak, Ulas Bagci, Essam A. Rashed et al.
Computer Vision
ๆฅ็ๆ่ฆ
The development of clinically reliable artificial intelligence (AI) systems
for mammography is hindered by profound heterogeneity in data quality, metadata
standards, and population distributions across public datasets. This
heterogeneity introduces dataset-specific biases that severely compromise the
generalizability of the model, a fundamental barrier to clinical deployment. We
present MammoClean, a public framework for standardization and bias
quantification in mammography datasets. MammoClean standardizes case selection,
image processing (including laterality and intensity correction), and unifies
metadata into a consistent multi-view structure. We provide a comprehensive
review of breast anatomy, imaging characteristics, and public mammography
datasets to systematically identify key sources of bias. Applying MammoClean to
three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify
substantial distributional shifts in breast density and abnormality prevalence.
Critically, we demonstrate the direct impact of data corruption: AI models
trained on corrupted datasets exhibit significant performance degradation
compared to their curated counterparts. By using MammoClean to identify and
mitigate bias sources, researchers can construct unified multi-dataset training
corpora that enable development of robust models with superior cross-domain
generalization. MammoClean provides an essential, reproducible pipeline for
bias-aware AI development in mammography, facilitating fairer comparisons and
advancing the creation of safe, effective systems that perform equitably across
diverse patient populations and clinical settings. The open-source code is
publicly available from: https://github.com/Minds-R-Lab/MammoClean.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gennaro Guidone, Luca Monegaglia, Elia Raimondi, Han Wang, Mattia Bianchi et al.
Robotics
ๆฅ็ๆ่ฆ
We present a novel decentralized algorithm for coverage control in unknown
spatial environments modeled by Gaussian Processes (GPs). To trade-off between
exploration and exploitation, each agent autonomously determines its trajectory
by minimizing a local cost function. Inspired by the GP-UCB (Upper Confidence
Bound for GPs) acquisition function, the proposed cost combines the expected
locational cost with a variance-based exploration term, guiding agents toward
regions that are both high in predicted density and model uncertainty. Compared
to previous work, our algorithm operates in a fully decentralized fashion,
relying only on local observations and communication with neighboring agents.
In particular, agents periodically update their inducing points using a greedy
selection strategy, enabling scalable online GP updates. We demonstrate the
effectiveness of our algorithm in simulation.
๐
2025-11-04
๐ Accepted for publication at IEEE International Con
๐ฅ Leon Schwarzer, Matthias Zeller, Daniel Casado Herraez, Simon Dierl, Michael Heidingsfeld et al.
Robotics
ๆฅ็ๆ่ฆ
Moving object segmentation is a crucial task for safe and reliable autonomous
mobile systems like self-driving cars, improving the reliability and robustness
of subsequent tasks like SLAM or path planning. While the segmentation of
camera or LiDAR data is widely researched and achieves great results, it often
introduces an increased latency by requiring the accumulation of temporal
sequences to gain the necessary temporal context. Radar sensors overcome this
problem with their ability to provide a direct measurement of a point's Doppler
velocity, which can be exploited for single-scan moving object segmentation.
However, radar point clouds are often sparse and noisy, making data annotation
for use in supervised learning very tedious, time-consuming, and
cost-intensive. To overcome this problem, we address the task of
self-supervised moving object segmentation of sparse and noisy radar point
clouds. We follow a two-step approach of contrastive self-supervised
representation learning with subsequent supervised fine-tuning using limited
amounts of annotated data. We propose a novel clustering-based contrastive loss
function with cluster refinement based on dynamic points removal to pretrain
the network to produce motion-aware representations of the radar data. Our
method improves label efficiency after fine-tuning, effectively boosting
state-of-the-art performance by self-supervised pretraining.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rohith Shinoj Kumar, Rushdeep Dinda, Aditya Tyagi, Annappa B., Naveen Kumar M. R
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Early detection of heart arrhythmia can prevent severe future complications
in cardiac patients. While manual diagnosis still remains the clinical
standard, it relies heavily on visual interpretation and is inherently
subjective. In recent years, deep learning has emerged as a powerful tool to
automate arrhythmia detection, offering improved accuracy, consistency, and
efficiency. Several variants of convolutional and recurrent neural network
architectures have been widely explored to capture spatial and temporal
patterns in physiological signals. However, despite these advancements, current
models often struggle to generalize well in real-world scenarios, especially
when dealing with small or noisy datasets, which are common challenges in
biomedical applications. In this paper, a novel CNN-H-Infinity-LSTM
architecture is proposed to identify arrhythmic heart signals from heart sound
recordings. This architecture introduces trainable parameters inspired by the
H-Infinity filter from control theory, enhancing robustness and generalization.
Extensive experimentation on the PhysioNet CinC Challenge 2016 dataset, a
public benchmark of heart audio recordings, demonstrates that the proposed
model achieves stable convergence and outperforms existing benchmarks, with a
test accuracy of 99.42% and an F1 score of 98.85%.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aashray Reddy, Andrew Zagula, Nicholas Saban
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where
adversarial prompts elicit harmful outputs, yet most evaluations focus on
single-turn interactions while real-world attacks unfold through adaptive
multi-turn conversations. We present AutoAdv, a training-free framework for
automated multi-turn jailbreaking that achieves up to 95% attack success rate
on Llama-3.1-8B within six turns a 24 percent improvement over single turn
baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern
manager that learns from successful attacks to enhance future prompts, a
temperature manager that dynamically adjusts sampling parameters based on
failure modes, and a two-phase rewriting strategy that disguises harmful
requests then iteratively refines them. Extensive evaluation across commercial
and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent
vulnerabilities in current safety mechanisms, with multi-turn attacks
consistently outperforming single-turn approaches. These findings demonstrate
that alignment strategies optimized for single-turn interactions fail to
maintain robustness across extended conversations, highlighting an urgent need
for multi-turn-aware defenses.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Jean-Baptiste Courbot, Hugo Gangloff, Bruno Colicchio
ๆฅ็ๆ่ฆ
This work addresses the problem of efficient sampling of Markov random fields
(MRF). The sampling of Potts or Ising MRF is most often based on Gibbs
sampling, and is thus computationally expensive. We consider in this work how
to circumvent this bottleneck through a link with Gaussian Markov Random
fields. The latter can be sampled in several cost-effective ways, and we
introduce a mapping from real-valued GMRF to discrete-valued MRF. The resulting
new class of MRF benefits from a few theoretical properties that validate the
new model. Numerical results show the drastic performance gain in terms of
computational efficiency, as we sample at least 35x faster than Gibbs sampling
using at least 37x less energy, all the while exhibiting empirical properties
close to classical MRFs.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Rohan Wandre, Yash Gajewar, Namrata Patel, Vivek Dhalkari
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) has emerged as the dominant paradigm for
grounding large language model outputs in verifiable evidence. However, as
modern AI agents transition from static knowledge bases to continuous
multimodal streams encompassing text, images, video, and audio, two critical
challenges arise: maintaining index freshness without prohibitive re-indexing
costs, and preserving cross-modal semantic consistency across heterogeneous
embedding spaces. We present LUMA-RAG, a lifelong multimodal agent architecture
featuring three key innovations: (i) a streaming, multi-tier memory system that
dynamically spills embeddings from a hot HNSW tier to a compressed IVFPQ tier
under strict memory budgets; (ii) a streaming CLAP->CLIP alignment bridge that
maintains cross-modal consistency through incremental orthogonal Procrustes
updates; and (iii) stability-aware retrieval telemetry providing Safe@k
guarantees by jointly bounding alignment drift and quantization error.
Experiments demonstrate robust text-to-image retrieval (Recall@10 = 0.94),
graceful performance degradation under product quantization offloading, and
provably stable audio-to-image rankings (Safe@1 = 1.0), establishing LUMA-RAG
as a practical framework for production multimodal RAG systems.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wongyu Kim, Hochang Lee, Sanghak Lee, Yoonsung Kim, Jaehyun Park
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Query augmentation makes queries more meaningful by appending further
information to the queries to find relevant documents. Current studies have
proposed Large Language Model (LLM)-based embedders, which learn representation
for embedding and generation for query augmentation in a multi-task manner by
leveraging the generative capabilities of LLM. During inference, these jointly
trained embedders have conducted query augmentation followed by embedding,
showing effective results. However, augmenting every query leads to substantial
embedding latency and query augmentation can be detrimental to performance for
some queries. Also, previous methods have not been explored in multimodal
environments. To tackle these problems, we propose M-Solomon, a universal
multimodal embedder that can adaptively determine when to augment queries. Our
approach first divides the queries of the training datasets into two groups at
the dataset level. One includes queries that require augmentation and the other
includes queries that do not. Then, we introduces a synthesis process that
generates appropriate augmentations for queries that require them by leveraging
a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation.
Through this step, M-Solomon can conduct query augmentation only when necessary
by learning to generate synthetic augmentations with the prefix /augment for
queries that demand them and to generate the simple string /embed for others.
Experimental results showed that M-Solomon not only surpassed the baseline
without augmentation by a large margin but also outperformed the baseline that
always used augmentation, providing much faster embedding latency.
๐
2025-11-04
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Xu Liu, Yan Chen, Kan Ling, Yichi Zhu, Hengrun Zhang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The widespread deployment of Large Language Models (LLMs) as public-facing
web services and APIs has made their security a core concern for the web
ecosystem. Jailbreak attacks, as one of the significant threats to LLMs, have
recently attracted extensive research. In this paper, we reveal a jailbreak
strategy which can effectively evade current defense strategies. It can extract
valuable information from failed or partially successful attack attempts and
contains self-evolution from attack interactions, resulting in sufficient
strategy diversity and adaptability. Inspired by continuous learning and
modular design principles, we propose ASTRA, a jailbreak framework that
autonomously discovers, retrieves, and evolves attack strategies to achieve
more efficient and adaptive attacks. To enable this autonomous evolution, we
design a closed-loop "attack-evaluate-distill-reuse" core mechanism that not
only generates attack prompts but also automatically distills and generalizes
reusable attack strategies from every interaction. To systematically accumulate
and apply this attack knowledge, we introduce a three-tier strategy library
that categorizes strategies into Effective, Promising, and Ineffective based on
their performance scores. The strategy library not only provides precise
guidance for attack generation but also possesses exceptional extensibility and
transferability. We conduct extensive experiments under a black-box setting,
and the results show that ASTRA achieves an average Attack Success Rate (ASR)
of 82.7%, significantly outperforming baselines.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qingyun Sun, Jiayi Luo, Haonan Yuan, Xingcheng Fu, Hao Peng et al.
Machine Learning
ๆฅ็ๆ่ฆ
Graph neural networks have shown remarkable success in exploiting the spatial
and temporal patterns on dynamic graphs. However, existing GNNs exhibit poor
generalization ability under distribution shifts, which is inevitable in
dynamic scenarios. As dynamic graph generation progresses amid evolving latent
non-stationary environments, it is imperative to explore their effects on
out-of-distribution (OOD) generalization. This paper proposes a novel Evolving
Graph Learning framework for OOD generalization (EvoOOD) by environment-aware
invariant pattern recognition. Specifically, we first design an environment
sequential variational auto-encoder to model environment evolution and infer
the underlying environment distribution. Then, we introduce a mechanism for
environment-aware invariant pattern recognition, tailored to address
environmental diversification through inferred distributions. Finally, we
conduct fine-grained causal interventions on individual nodes using a mixture
of instantiated environment samples. This approach helps to distinguish
spatio-temporal invariant patterns for OOD prediction, especially in
non-stationary environments. Experimental results demonstrate the superiority
of EvoGOOD on both real-world and synthetic dynamic datasets under distribution
shifts. To the best of our knowledge, it is the first attempt to study the
dynamic graph OOD generalization problem from the environment evolution
perspective.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Zhuodi Cai, Ziyu Xu, Juan Pampin
Robotics
ๆฅ็ๆ่ฆ
We introduce a lightweight, real-time motion recognition system that enables
synergic human-machine performance through wearable IMU sensor data, MiniRocket
time-series classification, and responsive multimedia control. By mapping
dancer-specific movement to sound through somatic memory and association, we
propose an alternative approach to human-machine collaboration, one that
preserves the expressive depth of the performing body while leveraging machine
learning for attentive observation and responsiveness. We demonstrate that this
human-centered design reliably supports high accuracy classification (<50 ms
latency), offering a replicable framework to integrate dance-literate machines
into creative, educational, and live performance contexts.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ David Nabergoj, Erik ล trumbelj
Machine Learning
ๆฅ็ๆ่ฆ
Preconditioning is a key component of MCMC algorithms that improves sampling
efficiency by facilitating exploration of geometrically complex target
distributions through an invertible map. While linear preconditioners are often
sufficient for moderately complex target distributions, recent work has
explored nonlinear preconditioning with invertible neural networks as
components of normalizing flows (NFs). However, empirical and theoretical
studies show that overparameterized NF preconditioners can degrade sampling
efficiency and fit quality. Moreover, existing NF-based approaches do not adapt
their architectures to the target distribution. Related work outside of MCMC
similarly finds that suitably parameterized NFs can achieve comparable or
superior performance with substantially less training time or data. We propose
a factorized preconditioning architecture that reduces NF complexity by
combining a linear component with a conditional NF, improving adaptability to
target geometry. The linear preconditioner is applied to dimensions that are
approximately Gaussian, as estimated from warmup samples, while the conditional
NF models more complex dimensions. Our method yields significantly better tail
samples on two complex synthetic distributions and consistently better
performance on a sparse logistic regression posterior across varying likelihood
and prior strengths. It also achieves higher effective sample sizes on
hierarchical Bayesian model posteriors with weak likelihoods and strong funnel
geometries. This approach is particularly relevant for hierarchical Bayesian
model analyses with limited data and could inform current theoretical and
software strides in neural MCMC design.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fidan Karimova, Tong Chen, Yu Yang, Shazia Sadiq
ๆฅ็ๆ่ฆ
Predicting crimes in urban environments is crucial for public safety, yet
existing prediction methods often struggle to align the knowledge across
diverse cities that vary dramatically in data availability of specific crime
types. We propose HYpernetwork-enhanced Spatial Temporal Learning (HYSTL), a
framework that can effectively train a unified, stronger crime predictor
without assuming identical crime types in different cities' records. In HYSTL,
instead of parameterising a dedicated predictor per crime type, a hypernetwork
is designed to dynamically generate parameters for the prediction function
conditioned on the crime type of interest. To bridge the semantic gap between
different crime types, a structured crime knowledge graph is built, where the
learned representations of crimes are used as the input to the hypernetwork to
facilitate parameter generation. As such, when making predictions for each
crime type, the predictor is additionally guided by its intricate association
with other relevant crime types. Extensive experiments are performed on two
cities with non-overlapping crime types, and the results demonstrate HYSTL
outperforms state-of-the-art baselines.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tianle Pu, Zijie Geng, Haoyang Liu, Shixuan Liu, Jie Wang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Mixed-Integer Linear Programming (MILP) is a fundamental and powerful
framework for modeling complex optimization problems across diverse domains.
Recently, learning-based methods have shown great promise in accelerating MILP
solvers by predicting high-quality solutions. However, most existing approaches
are developed and evaluated in single-domain settings, limiting their ability
to generalize to unseen problem distributions. This limitation poses a major
obstacle to building scalable and general-purpose learning-based solvers. To
address this challenge, we introduce RoME, a domain-Robust Mixture-of-Experts
framework for predicting MILP solutions across domains. RoME dynamically routes
problem instances to specialized experts based on learned task embeddings. The
model is trained using a two-level distributionally robust optimization
strategy: inter-domain to mitigate global shifts across domains, and
intra-domain to enhance local robustness by introducing perturbations on task
embeddings. We reveal that cross-domain training not only enhances the model's
generalization capability to unseen domains but also improves performance
within each individual domain by encouraging the model to capture more general
intrinsic combinatorial patterns. Specifically, a single RoME model trained on
three domains achieves an average improvement of 67.7% then evaluated on five
diverse domains. We further test the pretrained model on MIPLIB in a zero-shot
setting, demonstrating its ability to deliver measurable performance gains on
challenging real-world instances where existing learning-based approaches often
struggle to generalize.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jueye Zhang, Chao Yang, Youfang Lai, Kai-Wen Li, Wenting Yan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Head-and-neck cancer (HNC) planning is difficult because multiple critical
organs-at-risk (OARs) are close to complex targets. Intensity-modulated
carbon-ion therapy (IMCT) offers superior dose conformity and OAR sparing but
remains slow due to relative biological effectiveness (RBE) modeling, leading
to laborious, experience-based, and often suboptimal tuning of many
treatment-planning parameters (TPPs). Recent deep learning (DL) methods are
limited by data bias and plan feasibility, while reinforcement learning (RL)
struggles to efficiently explore the exponentially large TPP search space. We
propose a scalable multi-agent RL (MARL) framework for parallel tuning of 45
TPPs in IMCT. It uses a centralized-training decentralized-execution (CTDE)
QMIX backbone with Double DQN, Dueling DQN, and recurrent encoding (DRQN) for
stable learning in a high-dimensional, non-stationary environment. To enhance
efficiency, we (1) use compact historical DVH vectors as state inputs, (2)
apply a linear action-to-value transform mapping small discrete actions to
uniform parameter adjustments, and (3) design an absolute, clinically informed
piecewise reward aligned with plan scores. A synchronous multi-process worker
system interfaces with the PHOENIX TPS for parallel optimization and
accelerated data collection. On a head-and-neck dataset (10 training, 10
testing), the method tuned 45 parameters simultaneously and produced plans
comparable to or better than expert manual ones (relative plan score: RL
$85.93\pm7.85%$ vs Manual $85.02\pm6.92%$), with significant (p-value $<$ 0.05)
improvements for five OARs. The framework efficiently explores high-dimensional
TPP spaces and generates clinically competitive IMCT plans through direct TPS
interaction, notably improving OAR sparing.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aman Sharma, Paras Chopra
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We revisit test-time scaling for language model reasoning and ask a
fundamental question: at equal token budget and compute, is it better to run
multiple independent chains in parallel, or to run fewer chains that
iteratively refine through sequential steps? Through comprehensive evaluation
across 5 state-of-the-art open source models and 3 challenging reasoning
benchmarks, we find that sequential scaling where chains explicitly build upon
previous attempts consistently outperforms the dominant parallel
self-consistency paradigm in 95.6% of configurations with gains in accuracy
upto 46.7%. Further, we introduce inverse-entropy weighted voting, a novel
training-free method to further boost the accuracy of sequential scaling. By
weighing answers in proportion to the inverse entropy of their reasoning
chains, we increase our success rate over parallel majority and establish it as
the optimal test-time scaling strategy. Our findings fundamentally challenge
the parallel reasoning orthodoxy that has dominated test-time scaling since
Wang et al.'s self-consistency decoding (Wang et al., 2022), positioning
sequential refinement as the robust default for modern LLM reasoning and
necessitating a paradigm shift in how we approach inference-time optimization.
๐
2025-11-04
๐ ArXiv Preprint (cs.MA)
๐ฅ Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia
Machine Learning
ๆฅ็ๆ่ฆ
We study the problem of learning multi-task, multi-agent policies for
cooperative, temporal objectives, under centralized training, decentralized
execution. In this setting, using automata to represent tasks enables the
decomposition of complex tasks into simpler sub-tasks that can be assigned to
agents. However, existing approaches remain sample-inefficient and are limited
to the single-task case. In this work, we present Automata-Conditioned
Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for
learning task-conditioned, decentralized team policies. We identify the main
challenges to ACC-MARL's feasibility in practice, propose solutions, and prove
the correctness of our approach. We further show that the value functions of
learned policies can be used to assign tasks optimally at test time.
Experiments show emergent task-aware, multi-step coordination among agents,
e.g., pressing a button to unlock a door, holding the door, and
short-circuiting tasks.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fengjuan Wang, Zhiyi Su, Xingzhu Hu, Cheng Wang, Mou Sun
Natural Language Processing
ๆฅ็ๆ่ฆ
Training large Mixture-of-Experts (MoE) models remains computationally
prohibitive due to their extreme compute and memory demands. Although
low-precision training promises to accelerate computation and reduce memory
footprint, existing implementations still rely on BF16-dominated dataflows with
frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode
much of FP8's theoretical efficiency. However, naively removing these casts by
keeping dataflows entirely in FP8 introduces double quantization error: tensors
quantized along different dimensions accumulate inconsistent scaling factors,
degrading numerical stability.
We propose FP8-Flow-MoE, an FP8 training recipe featuring a
quantization-consistent FP8-centric dataflow with a scaling-aware transpose and
fused FP8 operators that streamline computation and eliminate explicit cast
operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate
up to 21\% higher throughput and 16.5 GB lower memory usage per GPU compared to
BF16 and na\"ive FP8 baselines, while maintaining stable convergence. We
provide a plug-and-play FP8 recipe compatible with TransformerEngine and
Megatron-LM, which will be open-sourced soon.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kuan-Cheng Chen, Samuel Yen-Chi Chen, Chen-Yu Liu, Kin K. Leung
ๆฅ็ๆ่ฆ
The rapid growth of industrial Internet of Things (IIoT) systems has created
new challenges for anomaly detection in high-dimensional, multivariate
time-series, where privacy, scalability, and communication efficiency are
critical. Classical federated learning approaches mitigate privacy concerns by
enabling decentralized training, but they often struggle with highly non-linear
decision boundaries and imbalanced anomaly distributions. To address this gap,
we propose a Federated Quantum Kernel Learning (FQKL) framework that integrates
quantum feature maps with federated aggregation to enable distributed,
privacy-preserving anomaly detection across heterogeneous IoT networks. In our
design, quantum edge nodes locally compute compressed kernel statistics using
parameterized quantum circuits and share only these summaries with a central
server, which constructs a global Gram matrix and trains a decision function
(e.g., Fed-QSVM). Experimental results on synthetic IIoT benchmarks demonstrate
that FQKL achieves superior generalization in capturing complex temporal
correlations compared to classical federated baselines, while significantly
reducing communication overhead. This work highlights the promise of quantum
kernels in federated settings, advancing the path toward scalable, robust, and
quantum-enhanced intelligence for next-generation IoT infrastructures.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziyi Wang, Lijian Jiang
Machine Learning
ๆฅ็ๆ่ฆ
Data assimilation (DA) has increasingly emerged as a critical tool for state
estimation
across a wide range of applications. It is signiffcantly challenging when the
governing equations of the underlying dynamics are unknown. To this end,
various machine learning approaches have been employed to construct a surrogate
state transition
model in a supervised learning framework, which relies on pre-computed
training
datasets. However, it is often infeasible to obtain noise-free ground-truth
state sequences in practice. To address this challenge, we propose a novel
method that integrates reinforcement learning with ensemble-based Bayesian
ffltering methods, enabling
the learning of surrogate state transition model for unknown dynamics
directly from noisy observations, without using true state trajectories.
Speciffcally, we treat the process for computing maximum likelihood estimation
of surrogate model parameters
as a sequential decision-making problem, which can be formulated as a
discretetime
Markov decision process (MDP). Under this formulation, learning the surrogate
transition model is equivalent to ffnding an optimal policy of the MDP, which
can be effectively addressed using reinforcement learning techniques. Once the
model is trained offfine, state estimation can be performed in the online stage
using ffltering methods based on the learned dynamics. The proposed framework
accommodates a wide range of observation scenarios, including nonlinear and
partially observed measurement
models. A few numerical examples demonstrate that the proposed method
achieves superior accuracy and robustness in high-dimensional settings.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Yuheng Zhao, Yu-Hu Yan, Kfir Yehuda Levy, Peng Zhao
Machine Learning
ๆฅ็ๆ่ฆ
Smoothness is known to be crucial for acceleration in offline optimization,
and for gradient-variation regret minimization in online learning.
Interestingly, these two problems are actually closely connected -- accelerated
optimization can be understood through the lens of gradient-variation online
learning. In this paper, we investigate online learning with H\"older smooth
functions, a general class encompassing both smooth and non-smooth (Lipschitz)
functions, and explore its implications for offline optimization. For
(strongly) convex online functions, we design the corresponding
gradient-variation online learning algorithm whose regret smoothly interpolates
between the optimal guarantees in smooth and non-smooth regimes. Notably, our
algorithms do not require prior knowledge of the H\"older smoothness parameter,
exhibiting strong adaptivity over existing methods. Through online-to-batch
conversion, this gradient-variation online adaptivity yields an optimal
universal method for stochastic convex optimization under H\"older smoothness.
However, achieving universality in offline strongly convex optimization is more
challenging. We address this by integrating online adaptivity with a
detection-based guess-and-check procedure, which, for the first time, yields a
universal offline method that achieves accelerated convergence in the smooth
regime while maintaining near-optimal convergence in the non-smooth one.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ayoub Ghriss
Machine Learning
ๆฅ็ๆ่ฆ
Probabilistic relaxations of graph cuts offer a differentiable alternative to
spectral clustering, enabling end-to-end and online learning without
eigendecompositions, yet prior work centered on RatioCut and lacked general
guarantees and principled gradients. We present a unified probabilistic
framework that covers a wide class of cuts, including Normalized Cut. Our
framework provides tight analytic upper bounds on expected discrete cuts via
integral representations and Gauss hypergeometric functions with closed-form
forward and backward. Together, these results deliver a rigorous, numerically
stable foundation for scalable, differentiable graph partitioning covering a
wide range of clustering and contrastive learning objectives.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Parsa Rangriz
Robotics
ๆฅ็ๆ่ฆ
This paper studies the high-dimensional scaling limits of online stochastic
gradient descent (SGD) for single-layer networks. Building on the seminal work
of Saad and Solla, which analyzed the deterministic (ballistic) scaling limits
of SGD corresponding to the gradient flow of the population loss, we focus on
the critical scaling regime of the step size. Below this critical scale, the
effective dynamics are governed by ballistic (ODE) limits, but at the critical
scale, new correction term appears that changes the phase diagram. In this
regime, near the fixed points, the corresponding diffusive (SDE) limits of the
effective dynamics reduces to an Ornstein-Uhlenbeck process under certain
conditions. These results highlight how the information exponent controls
sample complexity and illustrates the limitations of deterministic scaling
limit in capturing the stochastic fluctuations of high-dimensional learning
dynamics.
๐
2025-11-04
๐ ArXiv Preprint (cs.DC)
๐ฅ Xingqi Cui, Chieh-Jan Mike Liang, Jiarong Xing, Haoran Qiu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Serving large generative models such as LLMs and multi- modal transformers
requires balancing user-facing SLOs (e.g., time-to-first-token,
time-between-tokens) with provider goals of efficiency and cost reduction.
Existing solutions rely on static provisioning or model-level autoscaling, both
of which treat the model as a monolith. This coarse-grained resource management
leads to degraded performance or significant resource underutilization due to
poor adaptability to dynamic inference traffic that is common online.
The root cause of this inefficiency lies in the internal structure of
generative models: they are executed as graphs of interconnected operators.
Through detailed characterization and systematic analysis, we find that
operators are heterogeneous in their compute and memory footprints and exhibit
diverse sensitivity to workload and resource factors such as batch size,
sequence length, and traffic rate. This heterogeneity suggests that the
operator, rather than the entire model, is the right granularity for scaling
decisions.
We propose an operator-level autoscaling framework, which allocates resources
at finer (operator)-granularity, optimizing the scaling, batching, and
placement based on individual operator profiles. Evaluated on production-scale
traces, our approach preserves SLOs with up to 40% fewer GPUs and 35% less
energy, or under fixed resources achieves 1.6x higher throughput with 5% less
energy. These results show that the operator, rather than the model, is
fundamentally a more effective unit for scaling large generative workloads.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jonathan Liu, Haoling Qiu, Jonathan Lasko, Damianos Karakos, Mahsa Yarmohammadi et al.
ๆฅ็ๆ่ฆ
Recent research has shown that hallucinations, omissions, and biases are
prevalent in everyday use-cases of LLMs. However, chatbots used in medical
contexts must provide consistent advice in situations where non-medical factors
are involved, such as when demographic information is present. In order to
understand the conditions under which medical chatbots fail to perform as
expected, we develop an infrastructure that 1) automatically generates queries
to probe LLMs and 2) evaluates answers to these queries using multiple
LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples
the space of patient demographics, histories, disorders, and writing styles to
create realistic questions that we subsequently use to prompt LLMs. In 2), our
evaluation pipeline provides hallucination and omission detection using
LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge
treatment category detectors. As a baseline study, we perform two case studies
on inter-LLM agreement and the impact of varying the answering and evaluation
LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's
Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs
yield statistically significant differences across writing styles, genders, and
races. We recommend that studies using LLM evaluation use multiple LLMs as
evaluators in order to avoid arriving at statistically significant but
non-generalizable results, particularly in the absence of ground-truth data. We
also suggest publishing inter-LLM agreement metrics for transparency. Our code
and dataset are available here:
https://github.com/BBN-E/medic-neurips-2025-demo.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hikaru Homma, Jun Ohkubo
Machine Learning
ๆฅ็ๆ่ฆ
Initialization of neural network parameters, such as weights and biases, has
a crucial impact on learning performance; if chosen well, we can even avoid the
need for additional training with backpropagation. For example, algorithms
based on the ridgelet transform or the SWIM (sampling where it matters) concept
have been proposed for initialization. On the other hand, it is well-known that
neural networks tend to learn coarse information in the earlier layers. The
feature is called spectral bias. In this work, we investigate the effects of
utilizing information on the spectral bias in the initialization of neural
networks. Hence, we propose a framework that adjusts the scale factors in the
SWIM algorithm to capture low-frequency components in the early-stage hidden
layers and to represent high-frequency components in the late-stage hidden
layers. Numerical experiments on a one-dimensional regression task and the
MNIST classification task demonstrate that the proposed method outperforms the
conventional initialization algorithms. This work clarifies the importance of
intrinsic spectral properties in learning neural networks, and the finding
yields an effective parameter initialization strategy that enhances their
training performance.
๐
2025-11-04
๐ ArXiv Preprint (Neural and Evolutionary Computing)
๐ฅ Brennen A. Hill
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Traditional neural networks, while powerful, rely on biologically implausible
learning mechanisms such as global backpropagation. This paper introduces the
Structurally Adaptive Predictive Inference Network (SAPIN), a novel
computational model inspired by the principles of active inference and the
morphological plasticity observed in biological neural cultures. SAPIN operates
on a 2D grid where processing units, or cells, learn by minimizing local
prediction errors. The model features two primary, concurrent learning
mechanisms: a local, Hebbian-like synaptic plasticity rule based on the
temporal difference between a cell's actual activation and its learned
expectation, and a structural plasticity mechanism where cells physically
migrate across the grid to optimize their information-receptive fields. This
dual approach allows the network to learn both how to process information
(synaptic weights) and also where to position its computational resources
(network topology). We validated the SAPIN model on the classic Cart Pole
reinforcement learning benchmark. Our results demonstrate that the architecture
can successfully solve the CartPole task, achieving robust performance. The
network's intrinsic drive to minimize prediction error and maintain homeostasis
was sufficient to discover a stable balancing policy. We also found that while
continual learning led to instability, locking the network's parameters after
achieving success resulted in a stable policy. When evaluated for 100 episodes
post-locking (repeated over 100 successful agents), the locked networks
maintained an average 82% success rate.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal et al.
ๆฅ็ๆ่ฆ
An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures
where the feed-forward layer is replaced by a pool of experts and each token
only activates a small subset of them. During autoregressive generation, these
models often enter a memory-bound regime even for moderate batch sizes because
the average expert load grows more slowly than in an equivalent dense
feedforward layer. Consequently, MoE latency is governed by the number of
activated experts. We introduce a framework for dynamically re-routing
token-to-expert mapping to lower this number (and thus, the decode latency)
while preserving a comparable quality. Our best results use a batch-aware
routing that works by having tokens piggyback experts that have already been
loaded into memory due to being crucial to other tokens within the same batch.
Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with
a batch size of $16$. Without any statistically significant loss in accuracy,
our approach achieves latency reductions of $39\%$ and $15\%$ in the MoE layer
decode latency, respectively.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Fan Feng, Phillip Lippe, Sara Magliacane
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Agents that understand objects and their interactions can learn policies that
are more robust and transferable. However, most object-centric RL methods
factor state by individual objects while leaving interactions implicit. We
introduce the Factored Interactive Object-Centric World Model (FIOC-WM), a
unified framework that learns structured representations of both objects and
their interactions within a world model. FIOC-WM captures environment dynamics
with disentangled and modular representations of object interactions, improving
sample efficiency and generalization for policy learning. Concretely, FIOC-WM
first learns object-centric latents and an interaction structure directly from
pixels, leveraging pre-trained vision encoders. The learned world model then
decomposes tasks into composable interaction primitives, and a hierarchical
policy is trained on top: a high level selects the type and order of
interactions, while a low level executes them. On simulated robotic and
embodied-AI benchmarks, FIOC-WM improves policy-learning sample efficiency and
generalization over world-model baselines, indicating that explicit, modular
interaction learning is crucial for robust control.
๐
2025-11-04
๐ ArXiv Preprint (cs.MA)
๐ฅ Manonmani Sekar, Nasim Nezamoddini
Machine LearningRobotics
ๆฅ็ๆ่ฆ
One of the main challenges in managing traffic at multilane intersections is
ensuring smooth coordination between human-driven vehicles (HDVs) and connected
autonomous vehicles (CAVs). This paper presents a novel traffic signal control
framework that combines Graph Attention Networks (GAT) with Soft Actor-Critic
(SAC) reinforcement learning to address this challenge. GATs are used to model
the dynamic graph- structured nature of traffic flow to capture spatial and
temporal dependencies between lanes and signal phases. The proposed SAC is a
robust off-policy reinforcement learning algorithm that enables adaptive signal
control through entropy-optimized decision making. This design allows the
system to coordinate the signal timing and vehicle movement simultaneously with
objectives focused on minimizing travel time, enhancing performance, ensuring
safety, and improving fairness between HDVs and CAVs. The model is evaluated
using a SUMO-based simulation of a four-way intersection and incorporating
different traffic densities and CAV penetration rates. The experimental results
demonstrate the effectiveness of the GAT-SAC approach by achieving a 24.1%
reduction in average delay and up to 29.2% fewer traffic violations compared to
traditional methods. Additionally, the fairness ratio between HDVs and CAVs
improved to 1.59, indicating more equitable treatment across vehicle types.
These findings suggest that the GAT-SAC framework holds significant promise for
real-world deployment in mixed-autonomy traffic systems.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
While existing work focuses primarily on task success, we argue that
effective real-world agents require optimizing three dimensions: productivity
(task completion), proactivity (asking essential questions), and
personalization (adapting to diverse user preferences). We introduce UserVille,
an interactive environment with LLM-based user simulators enabling diverse,
configurable user preferences. Leveraging UserVille, we introduce PPP, a
multi-objective reinforcement learning approach that jointly optimizes all
three dimensions: Productivity, Proactivity, and Personalization. Experiments
on software engineering and deep research tasks show that agents trained with
PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6
on average), demonstrating the ability to ask strategic clarifying questions,
adapt to unseen user preferences, and improve task success through better
interaction. This work demonstrates that explicitly optimizing for
user-centered interaction is critical for building practical and effective AI
agents.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kevin Valencia, Thilina Balasooriya, Xihaier Luo, Shinjae Yoo, David Keetae Park
Multimodal
ๆฅ็ๆ่ฆ
Multimodal spatiotemporal learning on real-world experimental data is
constrained by two challenges: within-modality measurements are sparse,
irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of
available modalities varies across space and time, shrinking the usable record
unless models can adapt to arbitrary subsets at train and test time. We propose
OmniField, a continuity-aware framework that learns a continuous neural field
conditioned on available modalities and iteratively fuses cross-modal context.
A multimodal crosstalk block architecture paired with iterative cross-modal
refinement aligns signals prior to the decoder, enabling unified
reconstruction, interpolation, forecasting, and cross-modal prediction without
gridding or surrogate preprocessing. Extensive evaluations show that OmniField
consistently outperforms eight strong multimodal spatiotemporal baselines.
Under heavy simulated sensor noise, performance remains close to clean-input
levels, highlighting robustness to corrupted measurements.
๐
2025-11-04
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yibo Zhao, Yang Zhao, Hongru Du, Hao Frank Yang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Decision-making models for individuals, particularly in high-stakes scenarios
like vaccine uptake, often diverge from population optimal predictions. This
gap arises from the uniqueness of the individual decision-making process,
shaped by numerical attributes (e.g., cost, time) and linguistic influences
(e.g., personal preferences and constraints). Developing upon Utility Theory
and leveraging the textual-reasoning capabilities of Large Language Models
(LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric
Reasoning framework (ATHENA) to address the optimal information integration.
ATHENA uniquely integrates two stages: First, it discovers robust, group-level
symbolic utility functions via LLM-augmented symbolic discovery; Second, it
implements individual-level semantic adaptation, creating personalized semantic
templates guided by the optimal utility to model personalized choices.
Validated on real-world travel mode and vaccine choice tasks, ATHENA
consistently outperforms utility-based, machine learning, and other LLM-based
models, lifting F1 score by at least 6.5% over the strongest cutting-edge
models. Further, ablation studies confirm that both stages of ATHENA are
critical and complementary, as removing either clearly degrades overall
predictive performance. By organically integrating symbolic utility modeling
and semantic adaptation, ATHENA provides a new scheme for modeling
human-centric decisions. The project page can be found at
https://yibozh.github.io/Athena.
๐
2025-11-04
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Fuyi Wang, Zekai Chen, Mingyuan Fan, Jianying Zhou, Lei Pan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Graph neural networks (GNNs) are powerful tools for analyzing and learning
from graph-structured (GS) data, facilitating a wide range of services.
Deploying such services in privacy-critical cloud environments necessitates the
development of secure inference (SI) protocols that safeguard sensitive GS
data. However, existing SI solutions largely focus on convolutional models for
image and text data, leaving the challenge of securing GNNs and GS data
relatively underexplored. In this work, we design, implement, and evaluate
$\sysname$, a lightweight cryptographic scheme for graph-centric inference in
the cloud. By hybridizing additive and function secret sharings within secure
two-party computation (2PC), $\sysname$ is carefully designed based on a series
of novel 2PC interactive protocols that achieve $1.5\times \sim 1.7\times$
speedups for linear layers and $2\times \sim 15\times$ for non-linear layers
over state-of-the-art (SotA) solutions. A thorough theoretical analysis is
provided to prove $\sysname$'s correctness, security, and lightweight nature.
Extensive experiments across four datasets demonstrate $\sysname$'s superior
efficiency with $1.3\times \sim 4.7\times$ faster secure predictions while
maintaining accuracy comparable to plaintext graph property inference.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yuzhuang Pian, Taiyu Wang, Shiqi Zhang, Rui Xu, Yonghong Liu
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Accurate air quality forecasts are vital for public health alerts, exposure
assessment, and emissions control. In practice, observational data are often
missing in varying proportions and patterns due to collection and transmission
issues. These incomplete spatiotemporal records impede reliable inference and
risk assessment and can lead to overconfident extrapolation. To address these
challenges, we propose an end to end framework, the channel gated learning unit
based spatiotemporal bayesian neural field (CGLUBNF). It uses Fourier features
with a graph attention encoder to capture multiscale spatial dependencies and
seasonal temporal dynamics. A channel gated learning unit, equipped with
learnable activations and gated residual connections, adaptively filters and
amplifies informative features. Bayesian inference jointly optimizes predictive
distributions and parameter uncertainty, producing point estimates and
calibrated prediction intervals. We conduct a systematic evaluation on two real
world datasets, covering four typical missing data patterns and comparing
against five state of the art baselines. CGLUBNF achieves superior prediction
accuracy and sharper confidence intervals. In addition, we further validate
robustness across multiple prediction horizons and analysis the contribution of
extraneous variables. This research lays a foundation for reliable deep
learning based spatio-temporal forecasting with incomplete observations in
emerging sensing paradigms, such as real world vehicle borne mobile monitoring.
๐
2025-11-04
๐ ArXiv Preprint (cs.DC)
๐ฅ Octavian Alexandru Trifan, Karthik Sangaiah, Muhammad Awad, Muhammad Osama, Sumanth Gudaparthi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
As large language models (LLMs) continue to scale, their workloads
increasingly rely on distributed execution across multiple GPUs. However, the
conventional bulk synchronous parallel~(BSP) model used in such settings
introduces significant performance inefficiencies. To characterize these
bottlenecks, we introduce the ''Three Taxes'' (Bulk Synchronous, Inter-Kernel
Data Locality, and Kernel Launch Overhead) as an analytical framework. We
propose moving beyond the rigid BSP model to address key inefficiencies in
distributed GPU execution. By exploiting libraries like Iris for Triton, we
gain access to in-kernel communication primitives that enable the design of
novel fine-grained programming patterns, offering greater flexibility and
performance than traditional BSP-based approaches. These patterns
systematically eliminate the three taxes by creating direct, tile-level
producer-consumer pipelines and replacing global barriers with fine-grained
dataflow synchronization. Applying this methodology to critical kernels, from
the foundational All-Gather + general matrix multiplication operation to the
complex Flash Decode algorithm, we observe a 10-20% speedup in end-to-end
latency over BSP-based approaches, establishing a more programmable and
efficient paradigm for distributed LLM workloads.
๐
2025-11-04
๐ ArXiv Preprint (cs.LO)
๐ฅ Eric Vin, Kyle A. Miller, Inigo Incer, Sanjit A. Seshia, Daniel J. Fremont
Robotics
ๆฅ็ๆ่ฆ
Full verification of learning-enabled cyber-physical systems (CPS) has long
been intractable due to challenges including black-box components and complex
real-world environments. Existing tools either provide formal guarantees for
limited types of systems or test the system as a monolith, but no general
framework exists for compositional analysis of learning-enabled CPS using
varied verification techniques over complex real-world environments. This paper
introduces ScenicProver, a verification framework that aims to fill this gap.
Built upon the Scenic probabilistic programming language, the framework
supports: (1) compositional system description with clear component interfaces,
ranging from interpretable code to black boxes; (2) assume-guarantee contracts
over those components using an extension of Linear Temporal Logic containing
arbitrary Scenic expressions; (3) evidence generation through testing, formal
proofs via Lean 4 integration, and importing external assumptions; (4)
systematic combination of generated evidence using contract operators; and (5)
automatic generation of assurance cases tracking the provenance of system-level
guarantees. We demonstrate the framework's effectiveness through a case study
on an autonomous vehicle's automatic emergency braking system with sensor
fusion. By leveraging manufacturer guarantees for radar and laser sensors and
focusing testing efforts on uncertain conditions, our approach enables stronger
probabilistic guarantees than monolithic testing with the same computational
budget.
๐
2025-11-04
๐ ArXiv Preprint (cs.GT)
๐ฅ Asrin Efe Yorulmaz, Tamer Baลar
ๆฅ็ๆ่ฆ
No-regret learning dynamics play a central role in game theory, enabling
decentralized convergence to equilibrium for concepts such as Coarse Correlated
Equilibrium (CCE) or Correlated Equilibrium (CE). In this work, we improve the
convergence rate to CCE in general-sum Markov games, reducing it from the
previously best-known rate of $\mathcal{O}(\log^5 T / T)$ to a sharper
$\mathcal{O}(\log T / T)$. This matches the best known convergence rate for CE
in terms of $T$, number of iterations, while also improving the dependence on
the action set size from polynomial to polylogarithmic-yielding exponential
gains in high-dimensional settings. Our approach builds on recent advances in
adaptive step-size techniques for no-regret algorithms in normal-form games,
and extends them to the Markovian setting via a stage-wise scheme that adjusts
learning rates based on real-time feedback. We frame policy updates as an
instance of Optimistic Follow-the-Regularized-Leader (OFTRL), customized for
value-iteration-based learning. The resulting self-play algorithm achieves, to
our knowledge, the fastest known convergence rate to CCE in Markov games.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Bartลomiej Maลkus, Szymon Bobek, Grzegorz J. Nalepa
ๆฅ็ๆ่ฆ
Time series data is one of the most popular data modalities in critical
domains such as industry and medicine. The demand for algorithms that not only
exhibit high accuracy but also offer interpretability is crucial in such
fields, as decisions made there bear significant consequences. In this paper,
we present ProtoTSNet, a novel approach to interpretable classification of
multivariate time series data, through substantial enhancements to the
ProtoPNet architecture. Our method is tailored to overcome the unique
challenges of time series analysis, including capturing dynamic patterns and
handling varying feature significance. Central to our innovation is a modified
convolutional encoder utilizing group convolutions, pre-trainable as part of an
autoencoder and designed to preserve and quantify feature importance. We
evaluated our model on 30 multivariate time series datasets from the UEA
archive, comparing our approach with existing explainable methods as well as
non-explainable baselines. Through comprehensive evaluation and ablation
studies, we demonstrate that our approach achieves the best performance among
ante-hoc explainable methods while maintaining competitive performance with
non-explainable and post-hoc explainable approaches, providing interpretable
results accessible to domain experts.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Abdullah Almansour, Ozan Tonguz
Machine Learning
ๆฅ็ๆ่ฆ
Machine Learning (ML) models are extensively used in various applications due
to their significant advantages over traditional learning methods. However, the
developed ML models often underperform when deployed in the real world due to
the well-known distribution shift problem. This problem can lead to a
catastrophic outcomes when these decision-making systems have to operate in
high-risk applications. Many researchers have previously studied this problem
in ML, known as distribution shift problem, using statistical techniques (such
as Kullback-Leibler, Kolmogorov-Smirnov Test, Wasserstein distance, etc.) to
quantify the distribution shift. In this letter, we show that using
Characteristic Function (CF) as a frequency domain approach is a powerful
alternative for measuring the distribution shift in high-dimensional space and
for domain adaptation.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yi Luo, Haochen Zhao, Xiao Liang, Yiwei Liu, Yuye Zhang et al.
Machine Learning
ๆฅ็ๆ่ฆ
Drug synergy prediction is a critical task in the development of effective
combination therapies for complex diseases, including cancer. Although existing
methods have shown promising results, they often operate as black-box
predictors that rely predominantly on statistical correlations between drug
characteristics and results. To address this limitation, we propose CausalDDS,
a novel framework that disentangles drug molecules into causal and spurious
substructures, utilizing the causal substructure representations for predicting
drug synergy. By focusing on causal sub-structures, CausalDDS effectively
mitigates the impact of redundant features introduced by spurious
substructures, enhancing the accuracy and interpretability of the model. In
addition, CausalDDS employs a conditional intervention mechanism, where
interventions are conditioned on paired molecular structures, and introduces a
novel optimization objective guided by the principles of sufficiency and
independence. Extensive experiments demonstrate that our method outperforms
baseline models, particularly in cold start and out-of-distribution settings.
Besides, CausalDDS effectively identifies key substructures underlying drug
synergy, providing clear insights into how drug combinations work at the
molecular level. These results underscore the potential of CausalDDS as a
practical tool for predicting drug synergy and facilitating drug discovery.
๐
2025-11-04
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yasaman Torabi, Shahram Shirani, James P. Reilly
Machine Learning
ๆฅ็ๆ่ฆ
Early identification of abnormal physiological patterns is essential for the
timely detection of cardiac disease. This work introduces a hybrid
quantum-classical convolutional neural network (QCNN) designed to classify S3
and murmur abnormalities in heart sound signals. The approach transforms
one-dimensional phonocardiogram (PCG) signals into compact two-dimensional
images through a combination of wavelet feature extraction and adaptive
threshold compression methods. We compress the cardiac-sound patterns into an
8-pixel image so that only 8 qubits are needed for the quantum stage.
Preliminary results on the HLS-CMDS dataset demonstrate 93.33% classification
accuracy on the test set and 97.14% on the train set, suggesting that quantum
models can efficiently capture temporal-spectral correlations in biomedical
signals. To our knowledge, this is the first application of a QCNN algorithm
for bioacoustic signal processing. The proposed method represents an early step
toward quantum-enhanced diagnostic systems for resource-constrained healthcare
environments.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hung-Ting Chen, Xiang Liu, Shauli Ravfogel, Eunsol Choi
Multimodal
ๆฅ็ๆ่ฆ
Most text retrievers generate \emph{one} query vector to retrieve relevant
documents. Yet, the conditional distribution of relevant documents for the
query may be multimodal, e.g., representing different interpretations of the
query. We first quantify the limitations of existing retrievers. All retrievers
we evaluate struggle more as the distance between target document embeddings
grows. To address this limitation, we develop a new retriever architecture,
\emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER).
Our model autoregressively generates multiple query vectors, and all the
predicted query vectors are used to retrieve documents from the corpus. We show
that on the synthetic vectorized data, the proposed method could capture
multiple target distributions perfectly, showing 4x better performance than
single embedding model. We also fine-tune our model on real-world multi-answer
retrieval datasets and evaluate in-domain. AMER presents 4 and 21\% relative
gains over single-embedding baselines on two datasets we evaluate on.
Furthermore, we consistently observe larger gains on the subset of dataset
where the embeddings of the target documents are less similar to each other. We
demonstrate the potential of using a multi-query vector retriever and open up a
new direction for future work.
๐
2025-11-04
๐ ArXiv Preprint (cs.DB)
๐ฅ Daren Chao, Kaiwen Chen, Naiqing Guan, Nick Koudas
Robotics
ๆฅ็ๆ่ฆ
Unstructured data is pervasive, but analytical queries demand structured
representations, creating a significant extraction challenge. Existing methods
like RAG lack schema awareness and struggle with cross-document alignment,
leading to high error rates. We propose ReDD (Relational Deep Dive), a
framework that dynamically discovers query-specific schemas, populates
relational tables, and ensures error-aware extraction with provable guarantees.
ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD)
identifies minimal, joinable schemas tailored to each query, and (2) Tabular
Data Population (TDP) extracts and corrects data using lightweight classifiers
trained on LLM hidden states. A main contribution of ReDD is SCAPE, a
statistically calibrated method for error detection with coverage guarantees,
and SCAPE-HYB, a hybrid approach that optimizes the trade-off between accuracy
and human correction costs. Experiments across diverse datasets demonstrate
ReDD's effectiveness, reducing data extraction errors from up to 30% to below
1% while maintaining high schema completeness (100% recall) and precision.
ReDD's modular design enables fine-grained control over accuracy-cost
trade-offs, making it a robust solution for high-stakes analytical queries over
unstructured corpora.
๐
2025-11-04
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Tetiana Manzhos, Tetiana Ianevych, Olga Melnyk
ๆฅ็ๆ่ฆ
Recommender systems and information retrieval platforms rely on ranking
algorithms to present the most relevant items to users, thereby improving
engagement and satisfaction. Assessing the quality of these rankings requires
reliable evaluation metrics. Among them, Mean Average Precision at cutoff k
(MAP@k) is widely used, as it accounts for both the relevance of items and
their positions in the list.
In this paper, the expectation and variance of Average Precision at k (AP@k)
are derived since they can be used as biselines for MAP@k. Here, we covered two
widely used evaluation models: offline and online. The expectation establishes
the baseline, indicating the level of MAP@k that can be achieved by pure
chance. The variance complements this baseline by quantifying the extent of
random fluctuations, enabling a more reliable interpretation of observed
scores.
๐
2025-11-04
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Wongyu Kim, Hochang Lee, Sanghak Lee, Yoonsung Kim, Jaehyun Park
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Query augmentation makes queries more meaningful by appending further
information to the queries to find relevant documents. Current studies have
proposed Large Language Model (LLM)-based embedders, which learn representation
for embedding and generation for query augmentation in a multi-task manner by
leveraging the generative capabilities of LLM. During inference, these jointly
trained embedders have conducted query augmentation followed by embedding,
showing effective results. However, augmenting every query leads to substantial
embedding latency and query augmentation can be detrimental to performance for
some queries. Also, previous methods have not been explored in multimodal
environments. To tackle these problems, we propose M-Solomon, a universal
multimodal embedder that can adaptively determine when to augment queries. Our
approach first divides the queries of the training datasets into two groups at
the dataset level. One includes queries that require augmentation and the other
includes queries that do not. Then, we introduces a synthesis process that
generates appropriate augmentations for queries that require them by leveraging
a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation.
Through this step, M-Solomon can conduct query augmentation only when necessary
by learning to generate synthetic augmentations with the prefix /augment for
queries that demand them and to generate the simple string /embed for others.
Experimental results showed that M-Solomon not only surpassed the baseline
without augmentation by a large margin but also outperformed the baseline that
always used augmentation, providing much faster embedding latency.
๐
2025-11-04
๐ ArXiv Preprint (cs.DL)
๐ฅ Auwalu Abdullahi Umar, Muneer Ahmad, Dr M Sadik Batcha
ๆฅ็ๆ่ฆ
The significance of libraries in preserving and maintaining history and
traditional culture cannot be overlooked. It is from this purpose that
libraries are to envisage in their programmes cultural activities which must be
collected, documented and preserved for posterity. The usefulness of preserved
information lies in the fact that the generation to come will be able to
establish their identity. This will also assist them with a foundation to build
from. This study focus on the growth and development of Library and Culture
research in forms of publications reflected in Web of Science database, during
the span of 2010-2019. A total 890 publications were found and the highest 124
(13.93%) publications published in 2019.The analysis maps comprehensively the
parameters of total output, growth of output, authorship, institution wise and
country-level collaboration patterns, major contributors (individuals, top
publication sources, institutions, and countries). It exposed that the most
prolific author is Lo P secured first place by contributing 4 (0.45%)
publications, followed by Bressan V 3 (0.34%) publications in Library and
Culture literature. Journal of Academic Librarianship produced the highest
number of records 29 (3.26%) followed by Australian Library Journal having
contributed 21 (2.36%).It is identified the domination of Wuhan University;
School Information Management had contributed 6 (0.67%) of total research
output. Authors from USA published the highest number of publications with a
total of 244 (27.42%), followed by UK and Australia with 118 (13.26%) and 76
(8.54%) publications were produced respectively.
๐
2025-11-04
๐ ArXiv Preprint (cs.DL)
๐ฅ Muneer Ahmad, M Sadik Batcha
ๆฅ็ๆ่ฆ
The present study is undertaken to find out the publication trends on
Alopecia Areata Disease during 2010-2019 from the global perspective. The study
mainly focus on distribution of research output, top journals for publications,
most prolific authors, authorship pattern, and citations pattern on Alopecia
Areata Disease. The results indicate that highest growth rate of publications
occurred during the year 2019. Columbia University topped the scene among all
institutes. The maximum publications were more than four authored publications.
Christiano AM and Clynes R were found to be the most prolific authors. It is
also found that most of the prolific authors (by number of publications) do
appear in highly cited publications list. Alopecia Areata Disease researchers
mostly preferred using article publications to communicate their findings.
๐
2025-11-04
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yuhan Wang, Qing Xie, Zhifeng Bao, Mengzi Tang, Lin Li et al.
Robotics
ๆฅ็ๆ่ฆ
Knowledge Graphs (KGs), as structured knowledge bases that organize
relational information across diverse domains, provide a unified semantic
foundation for cross-domain recommendation (CDR). By integrating symbolic
knowledge with user-item interactions, KGs enrich semantic representations,
support reasoning, and enhance model interpretability. Despite this potential,
existing KG-based methods still face major challenges in CDR, particularly
under non-overlapping user scenarios. These challenges arise from: (C1)
sensitivity to KG sparsity and popularity bias, (C2) dependence on overlapping
users for domain alignment and (C3) lack of explicit disentanglement between
transferable and domain-specific knowledge, which limit effective and stable
knowledge transfer. To this end, we propose KGBridge, a knowledge-guided prompt
learning framework for cross-domain sequential recommendation under
non-overlapping user scenarios. KGBridge comprises two core components: a
KG-enhanced Prompt Encoder, which models relation-level semantics as soft
prompts to provide structured and dynamic priors for user sequence modeling
(addressing C1), and a Two-stage Training Paradigm, which combines cross-domain
pretraining and privacy-preserving fine-tuning to enable knowledge transfer
without user overlap (addressing C2). By combining relation-aware semantic
control with correspondence-driven disentanglement, KGBridge explicitly
separates and balances domain-shared and domain-specific semantics, thereby
maintaining complementarity and stabilizing adaptation during fine-tuning
(addressing C3). Extensive experiments on benchmark datasets demonstrate that
KGBridge consistently outperforms state-of-the-art baselines and remains robust
under varying KG sparsity, highlighting its effectiveness in mitigating
structural imbalance and semantic entanglement in KG-enhanced cross-domain
recommendation.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large-scale data has driven breakthroughs in robotics, from language models
to vision-language-action models in bimanual manipulation. However, humanoid
robotics lacks equally effective data collection frameworks. Existing humanoid
teleoperation systems either use decoupled control or depend on expensive
motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid
teleoperation and data collection system that preserves full whole-body control
while advancing scalability. Our system leverages PICO4U VR for obtaining
real-time whole-body human motions, with a custom 2-DoF robot neck (cost around
$250) for egocentric vision, enabling holistic human-to-humanoid control. We
demonstrate long-horizon dexterous and mobile humanoid skills and we can
collect 100 demonstrations in 15 minutes with an almost 100% success rate.
Building on this pipeline, we propose a hierarchical visuomotor policy
framework that autonomously controls the full humanoid body based on egocentric
vision. Our visuomotor policy successfully demonstrates whole-body dexterous
manipulation and dynamic kicking tasks. The entire system is fully reproducible
and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also
open-sourced at https://twist-data.github.io .
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Shichao Fan, Kun Wu, Zhengping Che, Xinhua Wang, Di Wu et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Recent progress in large-scale robotic datasets and vision-language models
(VLMs) has advanced research on vision-language-action (VLA) models. However,
existing VLA models still face two fundamental challenges: (i) producing
precise low-level actions from high-dimensional observations, (ii) bridging
domain gaps across heterogeneous data sources, including diverse robot
embodiments and human demonstrations. Existing methods often encode latent
variables from either visual dynamics or robotic actions to guide policy
learning, but they fail to fully exploit the complementary multi-modal
knowledge present in large-scale, heterogeneous datasets. In this work, we
present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable
VLA learning across diverse robots, tasks, and environments. XR-1 introduces
the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation
learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and
robotic motion. UVMC addresses these challenges by (i) serving as an
intermediate representation between the observations and actions, and (ii)
aligning multimodal dynamic information from heterogeneous data sources to
capture complementary knowledge. To effectively exploit UVMC, we propose a
three-stage training paradigm: (i) self-supervised UVMC learning, (ii)
UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and
(iii) task-specific post-training. We validate XR-1 through extensive
real-world experiments with more than 14,000 rollouts on six different robot
embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently
outperforms state-of-the-art baselines such as $\pi_{0.5}$, $\pi_0$, RDT,
UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel
objects, background variations, distractors, and illumination changes. Our
project is at https://xr-1-vla.github.io/.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Seth Stewart, Joseph Pawelski, Steve Ward, Andrew J. Petruska
Robotics
ๆฅ็ๆ่ฆ
Extending the field of magnetic manipulation to conductive, non-magnetic
objects opens the door for a wide array of applications previously limited to
hard or soft magnetic materials. Of particular interest is the recycling of
space debris through the use of oscillating magnetic fields, which represent a
cache of raw materials in an environment particularly suited to the low forces
generated from inductive magnetic manipulation. Building upon previous work
that demonstrated 3D open-loop position control by leveraging the opposing
dipole moment created from induced eddy currents, this work demonstrates
closed-loop position control of a semi-buoyant aluminum sphere in lab tests,
and the efficacy of varying methods for force inversion is explored. The
closed-loop methods represent a critical first step towards wider applications
for 3-DOF position control of induced magnetic dipoles.
๐
2025-11-04
๐ will be submitted to Journal of Guidance, Control,
๐ฅ Marc Schneider, Walter Fichter
Robotics
ๆฅ็ๆ่ฆ
This paper presents a novel approach to many-vs-many missile guidance using
virtual targets (VTs) generated by a Normalizing Flows-based trajectory
predictor. Rather than assigning n interceptors directly to m physical targets
through conventional weapon target assignment algorithms, we propose a
centralized strategy that constructs n VT trajectories representing
probabilistic predictions of maneuvering target behavior. Each interceptor is
guided toward its assigned VT using Zero-Effort-Miss guidance during midcourse
flight, transitioning to Proportional Navigation guidance for terminal
interception. This approach treats many-vs-many engagements as
many-vs-distribution scenarios, exploiting numerical superiority (n > m) by
distributing interceptors across diverse trajectory hypotheses rather than
pursuing identical deterministic predictions. Monte Carlo simulations across
various target-interceptor configurations (1-6 targets, 1-8 interceptors)
demonstrate that the VT method matches or exceeds baseline straight-line
prediction performance by 0-4.1% when n = m, with improvements increasing to
5.8-14.4% when n > m. The results confirm that probabilistic VTs enable
effective exploitation of numerical superiority, significantly increasing
interception probability in many-vs-many scenarios.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nicolas Schuler, Lea Dewald, Jรผrgen Graf
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Recent advancements in Deep Learning enable hardware-based cognitive systems,
that is, mechatronic systems in general and robotics in particular with
integrated Artificial Intelligence, to interact with dynamic and unstructured
environments. While the results are impressive, the application of such systems
to critical tasks like autonomous driving as well as service and care robotics
necessitate the evaluation of large amount of heterogeneous data. Automated
report generation for Mobile Robotics can play a crucial role in facilitating
the evaluation and acceptance of such systems in various domains. In this
paper, we propose a pipeline for generating automated reports in natural
language utilizing various multi-modal sensors that solely relies on local
models capable of being deployed on edge computing devices, thus preserving the
privacy of all actors involved and eliminating the need for external services.
In particular, we evaluate our implementation on a diverse dataset spanning
multiple domains including indoor, outdoor and urban environments, providing
quantitative as well as qualitative evaluation results. Various generated
example reports and other supplementary materials are available via a public
repository.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Le Chen, Yi Zhao, Jan Schneider, Quankai Gao, Simon Guist et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Endowing robot hands with human-level dexterity has been a long-standing goal
in robotics. Bimanual robotic piano playing represents a particularly
challenging task: it is high-dimensional, contact-rich, and requires fast,
precise control. We present OmniPianist, the first agent capable of performing
nearly one thousand music pieces via scalable, human-demonstration-free
learning. Our approach is built on three core components. First, we introduce
an automatic fingering strategy based on Optimal Transport (OT), allowing the
agent to autonomously discover efficient piano-playing strategies from scratch
without demonstrations. Second, we conduct large-scale Reinforcement Learning
(RL) by training more than 2,000 agents, each specialized in distinct music
pieces, and aggregate their experience into a dataset named RP1M++, consisting
of over one million trajectories for robotic piano playing. Finally, we employ
a Flow Matching Transformer to leverage RP1M++ through large-scale imitation
learning, resulting in the OmniPianist agent capable of performing a wide range
of musical pieces. Extensive experiments and ablation studies highlight the
effectiveness and scalability of our approach, advancing dexterous robotic
piano playing at scale.
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nicolas Schuler, Lea Dewald, Nick Baldig, Jรผrgen Graf
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Video Understanding, Scene Interpretation and Commonsense Reasoning are
highly challenging tasks enabling the interpretation of visual information,
allowing agents to perceive, interact with and make rational decisions in its
environment. Large Language Models (LLMs) and Visual Language Models (VLMs)
have shown remarkable advancements in these areas in recent years, enabling
domain-specific applications as well as zero-shot open vocabulary tasks,
combining multiple domains. However, the required computational complexity
poses challenges for their application on edge devices and in the context of
Mobile Robotics, especially considering the trade-off between accuracy and
inference time. In this paper, we investigate the capabilities of
state-of-the-art VLMs for the task of Scene Interpretation and Action
Recognition, with special regard to small VLMs capable of being deployed to
edge devices in the context of Mobile Robotics. The proposed pipeline is
evaluated on a diverse dataset consisting of various real-world cityscape,
on-campus and indoor scenarios. The experimental evaluation discusses the
potential of these small models on edge devices, with particular emphasis on
challenges, weaknesses, inherent model biases and the application of the gained
information. Supplementary material is provided via the following repository:
https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
๐
2025-11-04
๐ ArXiv Preprint (Computer Vision)
๐ฅ Garen Boyadjian, Cyrille Pierre, Johann Laconte, Riccardo Bertoglio
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Precise semantic segmentation of crops and weeds is necessary for
agricultural weeding robots. However, training deep learning models requires
large annotated datasets, which are costly to obtain in real fields. Synthetic
data can reduce this burden, but the gap between simulated and real images
remains a challenge. In this paper, we present a pipeline for procedural
generation of synthetic crop-weed images using Blender, producing annotated
datasets under diverse conditions of plant growth, weed density, lighting, and
camera angle. We benchmark several state-of-the-art segmentation models on
synthetic and real datasets and analyze their cross-domain generalization. Our
results show that training on synthetic images leads to a sim-to-real gap of
10%, surpassing previous state-of-the-art methods. Moreover, synthetic data
demonstrates good generalization properties, outperforming real datasets in
cross-domain scenarios. These findings highlight the potential of synthetic
agricultural datasets and support hybrid strategies for more efficient model
training.
๐
2025-11-04
๐ Submitted to 2026 IFAC World Congress with the Jou
๐ฅ Lin Yang, Jinwoo Lee, Domenico Campolo, H. Jin Kim, Jeonghyun Byun
Robotics
ๆฅ็ๆ่ฆ
Aerial manipulation combines the maneuverability of multirotors with the
dexterity of robotic arms to perform complex tasks in cluttered spaces. Yet
planning safe, dynamically feasible trajectories remains difficult due to
whole-body collision avoidance and the conservativeness of common geometric
abstractions such as bounding boxes or ellipsoids. We present a whole-body
motion planning and safety-critical control framework for aerial manipulators
built on superquadrics (SQs). Using an SQ-plus-proxy representation, we model
both the vehicle and obstacles with differentiable, geometry-accurate surfaces.
Leveraging this representation, we introduce a maximum-clearance planner that
fuses Voronoi diagrams with an equilibrium-manifold formulation to generate
smooth, collision-aware trajectories. We further design a safety-critical
controller that jointly enforces thrust limits and collision avoidance via
high-order control barrier functions. In simulation, our approach outperforms
sampling-based planners in cluttered environments, producing faster, safer, and
smoother trajectories and exceeding ellipsoid-based baselines in geometric
fidelity. Actual experiments on a physical aerial-manipulation platform confirm
feasibility and robustness, demonstrating consistent performance across
simulation and hardware settings. The video can be found at
https://youtu.be/hQYKwrWf1Ak.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Shaohan Li, Yunpeng Shi, Gilad Lerman
Computer Vision
ๆฅ็ๆ่ฆ
We introduce Cycle-Sync, a robust and global framework for estimating camera
poses (both rotations and locations). Our core innovation is a location solver
that adapts message-passing least squares (MPLS) -- originally developed for
group synchronization -- to camera location estimation. We modify MPLS to
emphasize cycle-consistent information, redefine cycle consistencies using
estimated distances from previous iterations, and incorporate a Welsch-type
robust loss. We establish the strongest known deterministic exact-recovery
guarantee for camera location estimation, showing that cycle consistency alone
-- without access to inter-camera distances -- suffices to achieve the lowest
sample complexity currently known. To further enhance robustness, we introduce
a plug-and-play outlier rejection module inspired by robust subspace recovery,
and we fully integrate cycle consistency into MPLS for rotation
synchronization. Our global approach avoids the need for bundle adjustment.
Experiments on synthetic and real datasets show that Cycle-Sync consistently
outperforms leading pose estimators, including full structure-from-motion
pipelines with bundle adjustment.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Zifei Wu, Lijie Wang, Zhe Yang, Shijie Yang, Liang Wang et al.
ๆฅ็ๆ่ฆ
This paper presents the ZJUNlict team's work over the past year, covering
both hardware and software advancements. In the hardware domain, the
integration of an IMU into the v2023 robot was completed to enhance posture
accuracy and angular velocity planning. On the software side, key modules were
optimized, including the strategy and CUDA modules, with significant
improvements in decision making efficiency, ball pursuit prediction, and ball
possession prediction to adapt to high-tempo game dynamics.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Ruiyong Yuan, Jieji Ren, Zhanxuan Peng, Feifei Chen, Guoying Gu
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Suckers are significant for robots in picking, transferring, manipulation and
locomotion on diverse surfaces. However, most of the existing suckers lack
high-fidelity perceptual and tactile sensing, which impedes them from resolving
the fine-grained geometric features and interaction status of the target
surface. This limits their robust performance with irregular objects and in
complex, unstructured environments. Inspired by the adaptive structure and
high-performance sensory capabilities of cephalopod suckers, in this paper, we
propose a novel, intelligent sucker, named SuckTac, that integrates a
camera-based tactile sensor directly within its optimized structure to provide
high-density perception and robust suction. Specifically, through joint
structure design and optimization and based on a multi-material integrated
casting technique, a camera and light source are embedded into the sucker,
which enables in-situ, high-density perception of fine details like surface
shape, texture and roughness. To further enhance robustness and adaptability,
the sucker's mechanical design is also optimized by refining its profile,
adding a compliant lip, and incorporating surface microstructure. Extensive
experiments, including challenging tasks such as robotic cloth manipulation and
soft mobile robot inspection, demonstrate the superior performance and broad
applicability of the proposed system.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Learning generalizable policies for robotic manipulation increasingly relies
on large-scale models that map language instructions to actions (L2A). However,
this one-way paradigm often produces policies that execute tasks without deeper
contextual understanding, limiting their ability to generalize or explain their
behavior. We argue that the complementary skill of mapping actions back to
language (A2L) is essential for developing more holistic grounding. An agent
capable of both acting and explaining its actions can form richer internal
representations and unlock new paradigms for self-supervised learning. We
introduce LACY (Language-Action Cycle), a unified framework that learns such
bidirectional mappings within a single vision-language model. LACY is jointly
trained on three synergistic tasks: generating parameterized actions from
language (L2A), explaining observed actions in language (A2L), and verifying
semantic consistency between two language descriptions (L2C). This enables a
self-improving cycle that autonomously generates and filters new training data
through an active augmentation strategy targeting low-confidence cases, thereby
improving the model without additional human labels. Experiments on
pick-and-place tasks in both simulation and the real world show that LACY
improves task success rates by 56.46% on average and yields more robust
language-action grounding for robotic manipulation. Project page:
https://vla2026.github.io/LACY/
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Linxin Hou, Qirui Wu, Zhihang Qin, Neil Banerjee, Yongxin Guo et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper presents a quantitative comparison between centralised and
distributed multi-agent reinforcement learning (MARL) architectures for
controlling a soft robotic arm modelled as a Cosserat rod in simulation. Using
PyElastica and the OpenAI Gym interface, we train both a global Proximal Policy
Optimisation (PPO) controller and a Multi-Agent PPO (MAPPO) under identical
budgets. Both approaches are based on the arm having $n$ number of controlled
sections. The study systematically varies $n$ and evaluates the performance of
the arm to reach a fixed target in three scenarios: default baseline condition,
recovery from external disturbance, and adaptation to actuator failure.
Quantitative metrics used for the evaluation are mean action magnitude, mean
final distance, mean episode length, and success rate. The results show that
there are no significant benefits of the distributed policy when the number of
controlled sections $n\le4$. In very simple systems, when $n\le2$, the
centralised policy outperforms the distributed one. When $n$ increases to $4<
n\le 12$, the distributed policy shows a high sample efficiency. In these
systems, distributed policy promotes a stronger success rate, resilience, and
robustness under local observability and yields faster convergence given the
same sample size. However, centralised policies achieve much higher time
efficiency during training as it takes much less time to train the same size of
samples. These findings highlight the trade-offs between centralised and
distributed policy in reinforcement learning-based control for soft robotic
systems and provide actionable design guidance for future sim-to-real transfer
in soft rod-like manipulators.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Tian Hao, Tong Lu, Che Chan
Machine Learning
ๆฅ็ๆ่ฆ
Robotic assistance in minimally invasive surgery can greatly enhance surgical
precision and reduce surgeon fatigue. This paper presents a focused
investigation on the kinematic and ergonomic design principles for a
laparoscopic surgical robotic arm aimed at high-precision tasks. We propose a
7-degree-of-freedom (7-DOF) robotic arm system that incorporates a remote
center of motion (RCM) at the instrument insertion point and ergonomic
considerations to improve surgeon interaction. The design is implemented on a
general-purpose robotic platform, and a series of simulated surgical tasks were
performed to evaluate targeting accuracy, task efficiency, and surgeon comfort
compared to conventional manual laparoscopy. Experimental results demonstrate
that the optimized robotic design achieves significantly improved targeting
accuracy (error reduced by over 50%) and shorter task completion times, while
substantially lowering operator muscle strain and discomfort. These findings
validate the importance of kinematic optimization (such as added articulations
and tremor filtering) and human-centered ergonomic design in enhancing the
performance of robot-assisted surgery. The insights from this work can guide
the development of next-generation surgical robots that improve surgical
outcomes and ergonomics for the operating team.
๐
2025-11-04
๐ NeurIPS 2025
๐ฅ Alexander Htet Kyaw, Richa Gupta, Dhruv Shah, Anoop Sinha, Kory Mathewson et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Advances in 3D generative AI have enabled the creation of physical objects
from text prompts, but challenges remain in creating objects involving multiple
component types. We present a pipeline that integrates 3D generative AI with
vision-language models (VLMs) to enable the robotic assembly of multi-component
objects from natural language. Our method leverages VLMs for zero-shot,
multi-modal reasoning about geometry and functionality to decompose
AI-generated meshes into multi-component 3D models using predefined structural
and panel components. We demonstrate that a VLM is capable of determining which
mesh regions need panel components in addition to structural components, based
on object functionality. Evaluation across test objects shows that users
preferred the VLM-generated assignments 90.6% of the time, compared to 59.4%
for rule-based and 2.5% for random assignment. Lastly, the system allows users
to refine component assignments through conversational feedback, enabling
greater human control and agency in making physical objects with generative AI
and robotics.
๐
2025-11-04
๐ ArXiv Preprint (Robotics)
๐ฅ Tyler M. Paine, Anastasia Bizyaeva, Michael R. Benjamin
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Collaborating teams of robots show promise due in their ability to complete
missions more efficiently and with improved robustness, attributes that are
particularly useful for systems operating in marine environments. A key issue
is how to model, analyze, and design these multi-robot systems to realize the
full benefits of collaboration, a challenging task since the domain of
multi-robot autonomy encompasses both collective and individual behaviors. This
paper introduces a layered model of multi-robot autonomy that uses the
principle of census, or a weighted count of the inputs from neighbors, for
collective decision-making about teaming, coupled with multi-objective behavior
optimization for individual decision-making about actions. The census component
is expressed as a nonlinear opinion dynamics model and the multi-objective
behavior optimization is accomplished using interval programming. This model
can be reduced to recover foundational algorithms in distributed optimization
and control, while the full model enables new types of collective behaviors
that are useful in real-world scenarios. To illustrate these points, a new
method for distributed optimization of subgroup allocation is introduced where
robots use a gradient descent algorithm to minimize portions of the cost
functions that are locally known, while being influenced by the opinion states
from neighbors to account for the unobserved costs. With this method the group
can collectively use the information contained in the Hessian matrix of the
total global cost. The utility of this model is experimentally validated in
three categorically different experiments with fleets of autonomous surface
vehicles: an adaptive sampling scenario, a high value unit protection scenario,
and a competitive game of capture the flag.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Renos Zabounidis, Aditya Golatkar, Michael Kleinman, Alessandro Achille, Wei Xia et al.
Robotics
ๆฅ็ๆ่ฆ
We propose Re-FORC, an adaptive reward prediction method that, given a
context, enables prediction of the expected future rewards as a function of the
number of future thinking tokens. Re-FORC trains a lightweight adapter on
reasoning models, demonstrating improved prediction with longer reasoning and
larger models. Re-FORC enables: 1) early stopping of unpromising reasoning
chains, reducing compute by 26% while maintaining accuracy, 2) optimized model
and thinking length selection that achieves 4% higher accuracy at equal compute
and 55% less compute at equal accuracy compared to the largest model, 3)
adaptive test-time scaling, which increases accuracy by 11% in high compute
regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with
length control via cost-per-token thresholds while estimating computation time
upfront.
๐
2025-11-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xinyuan Song, Jiaye Teng, Ziye Ma
Machine Learning
ๆฅ็ๆ่ฆ
In this paper we study how the choice of loss functions of non-convex
optimization problems affects their robustness and optimization landscape,
through the study of noisy matrix sensing. In traditional regression tasks,
mean squared error (MSE) loss is a common choice, but it can be unreliable for
non-Gaussian or heavy-tailed noise. To address this issue, we adopt a robust
loss based on nonparametric regression, which uses a kernel-based estimate of
the residual density and maximizes the estimated log-likelihood. This robust
formulation coincides with the MSE loss under Gaussian errors but remains
stable under more general settings. We further examine how this robust loss
reshapes the optimization landscape by analyzing the upper-bound of restricted
isometry property (RIP) constants for spurious local minima to disappear.
Through theoretical and empirical analysis, we show that this new loss excels
at handling large noise and remains robust across diverse noise distributions.
This work offers initial insights into enhancing the robustness of machine
learning tasks through simply changing the loss, guided by an intuitive and
broadly applicable analytical framework.
๐
2025-11-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ziheng Geng, Jiachen Liu, Ran Cao, Lu Cheng, Dan M. Frangopol et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Flood insurance is an effective strategy for individuals to mitigate
disaster-related losses. However, participation rates among at-risk populations
in the United States remain strikingly low. This gap underscores the need to
understand and model the behavioral mechanisms underlying insurance decisions.
Large language models (LLMs) have recently exhibited human-like intelligence
across wide-ranging tasks, offering promising tools for simulating human
decision-making. This study constructs a benchmark dataset to capture insurance
purchase probabilities across factors. Using this dataset, the capacity of LLMs
is evaluated: while LLMs exhibit a qualitative understanding of factors, they
fall short in estimating quantitative probabilities. To address this
limitation, InsurAgent, an LLM-empowered agent comprising five modules
including perception, retrieval, reasoning, action, and memory, is proposed.
The retrieval module leverages retrieval-augmented generation (RAG) to ground
decisions in empirical survey data, achieving accurate estimation of marginal
and bivariate probabilities. The reasoning module leverages LLM common sense to
extrapolate beyond survey data, capturing contextual information that is
intractable for traditional models. The memory module supports the simulation
of temporal decision evolutions, illustrated through a roller coaster life
trajectory. Overall, InsurAgent provides a valuable tool for behavioral
modeling and policy analysis.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Joshua Ashkinaze, Hua Shen, Sai Avula, Eric Gilbert, Ceren Budak
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We introduce the Deep Value Benchmark (DVB), an evaluation framework that
directly tests whether large language models (LLMs) learn fundamental human
values or merely surface-level preferences. This distinction is critical for AI
alignment: Systems that capture deeper values are likely to generalize human
intentions robustly, while those that capture only superficial patterns in
preference data risk producing misaligned behavior. The DVB uses a novel
experimental design with controlled confounding between deep values (e.g.,
moral principles) and shallow features (e.g., superficial attributes). In the
training phase, we expose LLMs to human preference data with deliberately
correlated deep and shallow features -- for instance, where a user consistently
prefers (non-maleficence, formal language) options over (justice, informal
language) alternatives. The testing phase then breaks these correlations,
presenting choices between (justice, formal language) and (non-maleficence,
informal language) options. This design allows us to precisely measure a
model's Deep Value Generalization Rate (DVGR) -- the probability of
generalizing based on the underlying value rather than the shallow feature.
Across 9 different models, the average DVGR is just 0.30. All models generalize
deep values less than chance. Larger models have a (slightly) lower DVGR than
smaller models. We are releasing our dataset, which was subject to three
separate human validation experiments. DVB provides an interpretable measure of
a core feature of alignment.
๐
2025-11-03
๐ ArXiv Preprint (cs.SE)
๐ฅ Steven Cho, Stefano Ruberto, Valerio Terragni
Natural Language Processing
ๆฅ็ๆ่ฆ
Using large language models (LLMs) to perform natural language processing
(NLP) tasks has become increasingly pervasive in recent times. The versatile
nature of LLMs makes them applicable to a wide range of such tasks. While the
performance of recent LLMs is generally outstanding, several studies have shown
that they can often produce incorrect results. Automatically identifying these
faulty behaviors is extremely useful for improving the effectiveness of LLMs.
One obstacle to this is the limited availability of labeled datasets, which
necessitates an oracle to determine the correctness of LLM behaviors.
Metamorphic testing (MT) is a popular testing approach that alleviates this
oracle problem. At the core of MT are metamorphic relations (MRs), which define
relationships between the outputs of related inputs. MT can expose faulty
behaviors without the need for explicit oracles (e.g., labeled datasets). This
paper presents the most comprehensive study of MT for LLMs to date. We
conducted a literature review and collected 191 MRs for NLP tasks. We
implemented a representative subset (36 MRs) to conduct a series of experiments
with three popular LLMs, running approximately 560,000 metamorphic tests. The
results shed light on the capabilities and opportunities of MT for LLMs, as
well as its limitations.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Rodrigo Mendoza-Smith
ๆฅ็ๆ่ฆ
Shapley data valuation provides a principled, axiomatic framework for
assigning importance to individual datapoints, and has gained traction in
dataset curation, pruning, and pricing. However, it is a combinatorial measure
that requires evaluating marginal utility across all subsets of the data,
making it computationally infeasible at scale. We propose a geometric
alternative based on statistical leverage scores, which quantify each
datapoint's structural influence in the representation space by measuring how
much it extends the span of the dataset and contributes to the effective
dimensionality of the training problem. We show that our scores satisfy the
dummy, efficiency, and symmetry axioms of Shapley valuation and that extending
them to \emph{ridge leverage scores} yields strictly positive marginal gains
that connect naturally to classical A- and D-optimal design criteria. We
further show that training on a leverage-sampled subset produces a model whose
parameters and predictive risk are within $O(\varepsilon)$ of the full-data
optimum, thereby providing a rigorous link between data valuation and
downstream decision quality. Finally, we conduct an active learning experiment
in which we empirically demonstrate that ridge-leverage sampling outperforms
standard baselines without requiring access gradients or backward passes.
๐
2025-11-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Michel Ma, Takuma Seno, Kaushik Subramanian, Peter R. Wurman, Peter Stone et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
When designing reinforcement learning (RL) agents, a designer communicates
the desired agent behavior through the definition of reward functions -
numerical feedback given to the agent as reward or punishment for its actions.
However, mapping desired behaviors to reward functions can be a difficult
process, especially in complex environments such as autonomous racing. In this
paper, we demonstrate how current foundation models can effectively search over
a space of reward functions to produce desirable RL agents for the Gran Turismo
7 racing game, given only text-based instructions. Through a combination of
LLM-based reward generation, VLM preference-based evaluation, and human
feedback we demonstrate how our system can be used to produce racing agents
competitive with GT Sophy, a champion-level RL racing agent, as well as
generate novel behaviors, paving the way for practical automated reward design
in real world applications.
๐
2025-11-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kishansingh Rajput, Malachi Schram, Brian Sammuli, Sen Lin
Machine Learning
ๆฅ็ๆ่ฆ
Machine Learning (ML) is poised to play a pivotal role in the development and
operation of next-generation fusion devices. Fusion data shows non-stationary
behavior with distribution drifts, resulted by both experimental evolution and
machine wear-and-tear. ML models assume stationary distribution and fail to
maintain performance when encountered with such non-stationary data streams.
Online learning techniques have been leveraged in other domains, however it has
been largely unexplored for fusion applications. In this paper, we present an
application of online learning to continuously adapt to drifting data stream
for prediction of Toroidal Field (TF) coils deflection at the DIII-D fusion
facility. The results demonstrate that online learning is critical to maintain
ML model performance and reduces error by 80% compared to a static model.
Moreover, traditional online learning can suffer from short-term performance
degradation as ground truth is not available before making the predictions. As
such, we propose an uncertainty guided online ensemble method to further
improve the performance. The Deep Gaussian Process Approximation (DGPA)
technique is leveraged for calibrated uncertainty estimation and the
uncertainty values are then used to guide a meta-algorithm that produces
predictions based on an ensemble of learners trained on different horizon of
historical data. The DGPA also provides uncertainty estimation along with the
predictions for decision makers. The online ensemble and the proposed
uncertainty guided online ensemble reduces predictions error by about 6%, and
10% respectively over standard single model based online learning.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yue Yang, Fabian Necker, Christoph Leuze, Michelle Chen, Andrey Finegersh et al.
ๆฅ็ๆ่ฆ
Purpose: In this paper, we develop and clinically evaluate a depth-only,
markerless augmented reality (AR) registration pipeline on a head-mounted
display, and assess accuracy across small or low-curvature anatomies in
real-life operative settings. Methods: On HoloLens 2, we align Articulated HAnd
Tracking (AHAT) depth to Computed Tomography (CT)-derived skin meshes via (i)
depth-bias correction, (ii) brief human-in-the-loop initialization, (iii)
global and local registration. We validated the surface-tracing error metric by
comparing "skin-to-bone" relative distances to CT ground truth on leg and foot
models, using an AR-tracked tool. We then performed seven intraoperative target
trials (feet x2, ear x3, leg x2) during the initial stage of fibula free-flap
harvest and mandibular reconstruction surgery, and collected 500+ data per
trial. Results: Preclinical validation showed tight agreement between AR-traced
and CT distances (leg: median |Delta d| 0.78 mm, RMSE 0.97 mm; feet: 0.80 mm,
1.20 mm). Clinically, per-point error had a median of 3.9 mm. Median errors by
anatomy were 3.2 mm (feet), 4.3 mm (ear), and 5.3 mm (lower leg), with 5 mm
coverage 92-95%, 84-90%, and 72-86%, respectively. Feet vs. lower leg differed
significantly (Delta median ~1.1 mm; p < 0.001). Conclusion: A depth-only,
markerless AR pipeline on HMDs achieved ~3-4 mm median error across feet, ear,
and lower leg in live surgical settings without fiducials, approaching typical
clinical error thresholds for moderate-risk tasks. Human-guided initialization
plus global-to-local registration enabled accurate alignment on small or
low-curvature targets, improving the clinical readiness of markerless AR
guidance.
๐
2025-11-03
๐ ArXiv Preprint (eess.IV)
๐ฅ Ali Almuallem, Harshana Weligampola, Abhiram Gnanasambandam, Wei Xu, Dilshan Godaliyadda et al.
Machine Learning
ๆฅ็ๆ่ฆ
Opto-electronic neural networks integrate optical front-ends with electronic
back-ends to enable fast and energy-efficient vision. However, conventional
end-to-end optimization of both the optical and electronic modules is limited
by costly simulations and large parameter spaces. We introduce a two-stage
strategy for designing opto-electronic convolutional neural networks (CNNs):
first, train a standard electronic CNN, then realize the optical front-end
implemented as a metasurface array through direct kernel optimization of its
first convolutional layer. This approach reduces computational and memory
demands by hundreds of times and improves training stability compared to
end-to-end optimization. On monocular depth estimation, the proposed two-stage
design achieves twice the accuracy of end-to-end training under the same
training time and resource constraints.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Soham Joshi, Shwet Kamal Mishra, Viswanath Gopalakrishnan
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Creation of large-scale databases for Visual Question Answering tasks
pertaining to the text data in a scene (text-VQA) involves skilful human
annotation, which is tedious and challenging. With the advent of foundation
models that handle vision and language modalities, and with the maturity of OCR
systems, it is the need of the hour to establish an end-to-end pipeline that
can synthesize Question-Answer (QA) pairs based on scene-text from a given
image. We propose a pipeline for automated synthesis for text-VQA dataset that
can produce faithful QA pairs, and which scales up with the availability of
scene text data. Our proposed method harnesses the capabilities of multiple
models and algorithms involving OCR detection and recognition (text spotting),
region of interest (ROI) detection, caption generation, and question
generation. These components are streamlined into a cohesive pipeline to
automate the synthesis and validation of QA pairs. To the best of our
knowledge, this is the first pipeline proposed to automatically synthesize and
validate a large-scale text-VQA dataset comprising around 72K QA pairs based on
around 44K images.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tuan Truong, Guillermo Jimenez Perez, Pedro Osorio, Matthias Lenga
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
The detection of Protected Health Information (PHI) in medical imaging is
critical for safeguarding patient privacy and ensuring compliance with
regulatory frameworks. Traditional detection methodologies predominantly
utilize Optical Character Recognition (OCR) models in conjunction with named
entity recognition. However, recent advancements in Large Multimodal Model
(LMM) present new opportunities for enhanced text extraction and semantic
analysis. In this study, we systematically benchmark three prominent closed and
open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing
two distinct pipeline configurations: one dedicated to text analysis alone and
another integrating both OCR and semantic analysis. Our results indicate that
LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to
conventional models like EasyOCR. However, this improvement in OCR performance
does not consistently correlate with enhanced overall PHI detection accuracy.
The strongest performance gains are observed on test cases with complex imprint
patterns. In scenarios where text regions are well readable with sufficient
contrast, and strong LMMs are employed for text analysis after OCR, different
pipeline configurations yield similar results. Furthermore, we provide
empirically grounded recommendations for LMM selection tailored to specific
operational constraints and propose a deployment strategy that leverages
scalable and modular infrastructure.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Benjamin Walder, Daniel Toader, Robert Nuster, Gรผnther Paltauf, Peter Burgholzer et al.
ๆฅ็ๆ่ฆ
We address the problem of image reconstruction from incomplete measurements,
encompassing both upsampling and inpainting, within a learning-based framework.
Conventional supervised approaches require fully sampled ground truth data,
while self-supervised methods allow incomplete ground truth but typically rely
on random sampling that, in expectation, covers the entire image. In contrast,
we consider fixed, deterministic sampling patterns with inherently incomplete
coverage, even in expectation. To overcome this limitation, we exploit multiple
invariances of the underlying image distribution, which theoretically allows us
to achieve the same reconstruction performance as fully supervised approaches.
We validate our method on optical-resolution image upsampling in photoacoustic
microscopy (PAM), demonstrating competitive or superior results while requiring
substantially less ground truth data.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Saurabh Kaushik, Lalit Maurya, Elizabeth Tellman, ZhiJie Zhang
ๆฅ็ๆ่ฆ
Geo-Foundational Models (GFMs) enable fast and reliable extraction of
spatiotemporal information from satellite imagery, improving flood inundation
mapping by leveraging location and time embeddings. Despite their potential, it
remains unclear whether GFMs outperform traditional models like U-Net. A
systematic comparison across sensors and data availability scenarios is still
lacking, which is an essential step to guide end-users in model selection. To
address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a
Prithvi variant), against TransNorm, U-Net, and Attention U-Net using
PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance
among all GFMs, with only 2-5% variation between the best and worst models
across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and
Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In
leave-one-region-out cross-validation across five regions, Clay shows slightly
better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07),
0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA
(0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and
Sentinel-1, respectively. Across all 19 sites, leave-one-region-out
cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual
inspection highlights Clay's superior ability to retain fine details. Few-shot
experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training
images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational
time, Clay is a better choice due to its smaller model size (26M parameters),
making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M).
Contrary to previous findings, our results suggest GFMs offer small to moderate
improvements in flood mapping accuracy at lower computational cost and labeling
effort compared to traditional U-Net.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin et al.
MultimodalNatural Language ProcessingRoboticsComputer Vision
ๆฅ็ๆ่ฆ
The frontier of visual reasoning is shifting toward models like OpenAI o3,
which can intelligently create and operate tools to transform images for
problem-solving, also known as thinking-\textit{with}-images in
chain-of-thought. Yet existing benchmarks fail to fully capture this advanced
capability. Even Visual Search, the most common benchmark for current
thinking-\textit{with}-images methods, tests only basic operations such as
localization and cropping, offering little insight into more complex, dynamic,
and tool-dependent reasoning. We introduce \textbf{TIR-Bench}, a comprehensive
benchmark for evaluating agentic thinking-with-images across 13 diverse tasks,
each requiring novel tool use for image processing and manipulation in
chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from
leading open-sourced and proprietary models to those with explicit tool-use
augmentation. Results show that TIR-Bench is universally challenging, and
strong performance requires genuine thinking-with-images capabilities. Finally,
we present a pilot study comparing direct versus agentic fine-tuning.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sagi Eppel, Alona Strugatski
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The ability to connect visual patterns with the processes that form them
represents one of the deepest forms of visual understanding. Textures of clouds
and waves, the growth of cities and forests, or the formation of materials and
landscapes are all examples of patterns emerging from underlying mechanisms. We
present the Scitextures dataset, a large-scale collection of textures and
visual patterns from all domains of science, tech, and art, along with the
models and code that generate these images. Covering over 1,200 different
models and 100,000 images of patterns and textures from physics, chemistry,
biology, sociology, technology, mathematics, and art, this dataset offers a way
to explore the connection between the visual patterns that shape our world and
the mechanisms that produce them. Created by an agentic AI pipeline that
autonomously collects and implements models in standardized form, we use
SciTextures to evaluate the ability of leading AI models to link visual
patterns to the models and code that generate them, and to identify different
patterns that emerged from the same process. We also test AIs ability to infer
and recreate the mechanisms behind visual patterns by providing a natural image
of a real-world pattern and asking the AI to identify, model, and code the
mechanism that formed the pattern, then run this code to generate a simulated
image that is compared to the real image. These benchmarks show that
vision-language models (VLMs) can understand and simulate the physical system
beyond a visual pattern. The dataset and code are available at:
https://zenodo.org/records/17485502
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tejas Sarnaik, Manan Shah, Ravi Hegde
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) has become a robust framework for
enhancing Large Language Models (LLMs) with external knowledge. Recent advances
in RAG have investigated graph based retrieval for intricate reasoning;
however, the influence of prompt design on enhancing the retrieval and
reasoning process is still considerably under-examined. In this paper, we
present a prompt-driven GraphRAG framework that underscores the significance of
prompt formulation in facilitating entity extraction, fact selection, and
passage reranking for multi-hop question answering. Our approach creates a
symbolic knowledge graph from text data by encoding entities and factual
relationships as structured facts triples. We use LLMs selectively during
online retrieval to perform semantic filtering and answer generation. We also
use entity-guided graph traversal through Personalized PageRank (PPR) to
support efficient, scalable retrieval based on the knowledge graph we built.
Our system gets state-of-the-art performance on HotpotQA and 2WikiMultiHopQA,
with F1 scores of 80.7% and 78.9%, and Recall@5 scores of 97.1% and 98.1%,
respectively. These results show that prompt design is an important part of
improving retrieval accuracy and response quality. This research lays the
groundwork for more efficient and comprehensible multi-hop question-answering
systems, highlighting the importance of prompt-aware graph reasoning.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Gabriel Nobis, Maximilian Springenberg, Arina Belova, Rembert Daems, Christoph Knochenhauer et al.
ๆฅ็ๆ่ฆ
We present Fractional Diffusion Bridge Models (FDBM), a novel generative
diffusion bridge framework driven by an approximation of the rich and
non-Markovian fractional Brownian motion (fBM). Real stochastic processes
exhibit a degree of memory effects (correlations in time), long-range
dependencies, roughness and anomalous diffusion phenomena that are not captured
in standard diffusion or bridge modeling due to the use of Brownian motion
(BM). As a remedy, leveraging a recent Markovian approximation of fBM (MA-fBM),
we construct FDBM that enable tractable inference while preserving the
non-Markovian nature of fBM. We prove the existence of a coupling-preserving
generative diffusion bridge and leverage it for future state prediction from
paired training data. We then extend our formulation to the Schr\"{o}dinger
bridge problem and derive a principled loss function to learn the unpaired data
translation. We evaluate FDBM on both tasks: predicting future protein
conformations from aligned data, and unpaired image translation. In both
settings, FDBM achieves superior performance compared to the Brownian
baselines, yielding lower root mean squared deviation (RMSD) of C$_\alpha$
atomic positions in protein structure prediction and lower Fr\'echet Inception
Distance (FID) in unpaired image translation.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhen Chen, Qing Xu, Jinlin Wu, Biao Yang, Yuhao Zhai et al.
Multimodal
ๆฅ็ๆ่ฆ
Foundation models in video generation are demonstrating remarkable
capabilities as potential world models for simulating the physical world.
However, their application in high-stakes domains like surgery, which demand
deep, specialized causal knowledge rather than general physical rules, remains
a critical unexplored gap. To systematically address this challenge, we present
SurgVeo, the first expert-curated benchmark for video generation model
evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel,
four-tiered framework tailored to assess model outputs from basic appearance to
complex surgical strategy. On the basis of the SurgVeo benchmark, we task the
advanced Veo-3 model with a zero-shot prediction task on surgical clips from
laparoscopic and neurosurgical procedures. A panel of four board-certified
surgeons evaluates the generated videos according to the SPP. Our results
reveal a distinct "plausibility gap": while Veo-3 achieves exceptional Visual
Perceptual Plausibility, it fails critically at higher levels of the SPP,
including Instrument Operation Plausibility, Environment Feedback Plausibility,
and Surgical Intent Plausibility. This work provides the first quantitative
evidence of the chasm between visually convincing mimicry and causal
understanding in surgical AI. Our findings from SurgVeo and the SPP establish a
crucial foundation and roadmap for developing future models capable of
navigating the complexities of specialized, real-world healthcare domains.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhe Liu, Jinghua Hou, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao et al.
Natural Language ProcessingRoboticsComputer Vision
ๆฅ็ๆ่ฆ
Although transformers have demonstrated remarkable capabilities across
various domains, their quadratic attention mechanisms introduce significant
computational overhead when processing long-sequence data. In this paper, we
present a unified autonomous driving model, UniLION, which efficiently handles
large-scale LiDAR point clouds, high-resolution multi-view images, and even
temporal sequences based on the linear group RNN operator (i.e., performs
linear RNN for grouped features). Remarkably, UniLION serves as a single
versatile architecture that can seamlessly support multiple specialized
variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal
temporal fusion configurations) without requiring explicit temporal or
multi-modal fusion modules. Moreover, UniLION consistently delivers competitive
and even state-of-the-art performance across a wide range of core tasks,
including 3D perception (e.g., 3D object detection, 3D object tracking, 3D
occupancy prediction, BEV map segmentation), prediction (e.g., motion
prediction), and planning (e.g., end-to-end planning). This unified paradigm
naturally simplifies the design of multi-modal and multi-task autonomous
driving systems while maintaining superior performance. Ultimately, we hope
UniLION offers a fresh perspective on the development of 3D foundation models
in autonomous driving. Code is available at
https://github.com/happinesslz/UniLION
๐
2025-11-03
๐ 21 pages, 19 figures, accepted by TPAMI
๐ฅ Yuxiao Yang, Xiao-Xiao Long, Zhiyang Dou, Cheng Lin, Yuan Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
In this work, we introduce \textbf{Wonder3D++}, a novel method for
efficiently generating high-fidelity textured meshes from single-view images.
Recent methods based on Score Distillation Sampling (SDS) have shown the
potential to recover 3D geometry from 2D diffusion priors, but they typically
suffer from time-consuming per-shape optimization and inconsistent geometry. In
contrast, certain works directly produce 3D information via fast network
inferences, but their results are often of low quality and lack geometric
details. To holistically improve the quality, consistency, and efficiency of
single-view reconstruction tasks, we propose a cross-domain diffusion model
that generates multi-view normal maps and the corresponding color images. To
ensure the consistency of generation, we employ a multi-view cross-domain
attention mechanism that facilitates information exchange across views and
modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that
drives high-quality surfaces from the multi-view 2D representations in only
about $3$ minute in a coarse-to-fine manner. Our extensive evaluations
demonstrate that our method achieves high-quality reconstruction results,
robust generalization, and good efficiency compared to prior works. Code
available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Kai Zhai, Ziyan Huang, Qiang Nie, Xiang Li, Bo Ouyang
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
2D-to-3D human pose lifting is a fundamental challenge for 3D human pose
estimation in monocular video, where graph convolutional networks (GCNs) and
attention mechanisms have proven to be inherently suitable for encoding the
spatial-temporal correlations of skeletal joints. However, depth ambiguity and
errors in 2D pose estimation lead to incoherence in the 3D trajectory. Previous
studies have attempted to restrict jitters in the time domain, for instance, by
constraining the differences between adjacent frames while neglecting the
global spatial-temporal correlations of skeletal joint motion. To tackle this
problem, we design HGFreNet, a novel GraphFormer architecture with hop-hybrid
feature aggregation and 3D trajectory consistency in the frequency domain.
Specifically, we propose a hop-hybrid graph attention (HGA) module and a
Transformer encoder to model global joint spatial-temporal correlations. The
HGA module groups all $k$-hop neighbors of a skeletal joint into a hybrid group
to enlarge the receptive field and applies the attention mechanism to discover
the latent correlations of these groups globally. We then exploit global
temporal correlations by constraining trajectory consistency in the frequency
domain. To provide 3D information for depth inference across frames and
maintain coherence over time, a preliminary network is applied to estimate the
3D pose. Extensive experiments were conducted on two standard benchmark
datasets: Human3.6M and MPI-INF-3DHP. The results demonstrate that the proposed
HGFreNet outperforms state-of-the-art (SOTA) methods in terms of positional
accuracy and temporal consistency.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Rong Li, Yuhao Dong, Tianshuai Hu, Ao Liang, Youquan Liu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Visual grounding in 3D is the key for embodied agents to localize
language-referred objects in open-world environments. However, existing
benchmarks are limited to indoor focus, single-platform constraints, and small
scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark
featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We
provide over 128,000 objects and 22,000 validated referring expressions across
diverse outdoor scenes -- 10x larger than existing datasets. We develop a
scalable annotation pipeline combining vision-language model prompting with
human verification to ensure high-quality spatial grounding. To support
cross-platform learning, we propose platform-aware normalization and
cross-modal alignment techniques, and establish benchmark protocols for
in-domain and cross-platform evaluations. Our findings reveal significant
performance gaps, highlighting the challenges and opportunities of
generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released
to advance future research in language-driven 3D embodied perception.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yefeng Wu, Yuchen Song, Ling Wu, Shan Wan, Yecheng Zhao
Natural Language ProcessingComputer Vision
ๆฅ็ๆ่ฆ
Pneumonia remains a leading cause of morbidity and mortality worldwide,
necessitating accurate and efficient automated detection systems. While recent
transformer-based detectors like RT-DETR have shown promise in object detection
tasks, their application to medical imaging, particularly pneumonia detection
in chest X-rays, remains underexplored. This paper presents CGF-DETR, an
enhanced real-time detection transformer specifically designed for pneumonia
detection. We introduce XFABlock in the backbone to improve multi-scale feature
extraction through convolutional attention mechanisms integrated with CSP
architecture. To achieve efficient feature aggregation, we propose SPGA module
that replaces standard multi-head attention with dynamic gating mechanisms and
single-head self-attention. Additionally, GCFC3 is designed for the neck to
enhance feature representation through multi-path convolution fusion while
maintaining real-time performance via structural re-parameterization. Extensive
experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR
achieves 82.2% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7% while
maintaining comparable inference speed at 48.1 FPS. Our ablation studies
confirm that each proposed module contributes meaningfully to the overall
performance improvement, with the complete model achieving 50.4% mAP@[0.5:0.95]
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tom Odem
ๆฅ็ๆ่ฆ
This research builds on work in anticipatory human-machine interaction, a
subfield of human-machine interaction where machines can facilitate
advantageous interactions by anticipating a user's future state. The aim of
this research is to further a machine's understanding of user knowledge, skill,
and behavior in pursuit of implicit coordination. A task explorer pipeline was
developed that uses clustering techniques, paired with factor analysis and
string edit distance, to automatically identify key global and local strategies
that are used to complete tasks. Global strategies identify generalized sets of
actions used to complete tasks, while local strategies identify sequences that
used those sets of actions in a similar composition. Additionally, meaningful
subtasks of various lengths are identified within the tasks. The task explorer
pipeline was able to automatically identify key strategies used to complete
tasks and encode user runs with hierarchical subtask structures. In addition, a
Task Explorer application was developed to easily review pipeline results. The
task explorer pipeline can be easily modified to any action-based time-series
data and the identified strategies and subtasks help to inform humans and
machines on user knowledge, skill, and behavior.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yi Zhang, Zheng Wang, Chen Zhen, Wenjie Ruan, Qing Guo et al.
Machine Learning
ๆฅ็ๆ่ฆ
Deep learning models are notoriously vulnerable to imperceptible
perturbations. Most existing research centers on adversarial robustness (AR),
which evaluates models under worst-case scenarios by examining the existence of
deterministic adversarial examples (AEs). In contrast, probabilistic robustness
(PR) adopts a statistical perspective, measuring the probability that
predictions remain correct under stochastic perturbations. While PR is widely
regarded as a practical complement to AR, dedicated training methods for
improving PR are still relatively underexplored, albeit with emerging progress.
Among the few PR-targeted training methods, we identify three limitations: i
non-comparable evaluation protocols; ii limited comparisons to strong AT
baselines despite anecdotal PR gains from AT; and iii no unified framework to
compare the generalization of these methods. Thus, we introduce PRBench, the
first benchmark dedicated to evaluating improvements in PR achieved by
different robustness training methods. PRBench empirically compares most common
AT and PR-targeted training methods using a comprehensive set of metrics,
including clean accuracy, PR and AR performance, training efficiency, and
generalization error (GE). We also provide theoretical analysis on the GE of PR
performance across different training methods. Main findings revealed by
PRBench include: AT methods are more versatile than PR-targeted training
methods in terms of improving both AR and PR performance across diverse
hyperparameter settings, while PR-targeted training methods consistently yield
lower GE and higher clean accuracy. A leaderboard comprising 222 trained models
across 7 datasets and 10 model architectures is publicly available at
https://tmpspace.github.io/PRBenchLeaderboard/.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao et al.
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models aim to understand natural language
instructions and visual observations and to execute corresponding actions as an
embodied agent. Recent work integrates future images into the
understanding-acting loop, yielding unified VLAs that jointly understand,
generate, and act -- reading text and images and producing future images and
actions. However, these models either rely on external experts for modality
unification or treat image generation and action prediction as separate
processes, limiting the benefits of direct synergy between these tasks. Our
core philosophy is to optimize generation and action jointly through a
synchronous denoising process, where the iterative refinement enables actions
to evolve from initialization, under constant and sufficient visual guidance.
We ground this philosophy in our proposed Unified Diffusion VLA and Joint
Discrete Denoising Diffusion Process (JD3P), which is a joint diffusion process
that integrates multiple modalities into a single denoising trajectory to serve
as the key mechanism enabling understanding, generation, and acting to be
intrinsically synergistic. Our model and theory are built on a unified
tokenized space of all modalities and a hybrid attention mechanism. We further
propose a two-stage training pipeline and several inference-time techniques
that optimize performance and efficiency. Our approach achieves
state-of-the-art performance on benchmarks such as CALVIN, LIBERO, and
SimplerEnv with 4$\times$ faster inference than autoregressive methods, and we
demonstrate its effectiveness through in-depth analysis and real-world
evaluations. Our project page is available at
https://irpn-eai.github.io/UD-VLA.github.io/.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Qiao, Matteo Poggi, Xing Wei, Pengchao Deng, Yanhui Zhou et al.
Machine Learning
ๆฅ็ๆ่ฆ
Under-display ToF imaging aims to achieve accurate depth sensing through a
ToF camera placed beneath a screen panel. However, transparent OLED (TOLED)
layers introduce severe degradations-such as signal attenuation, multi-path
interference (MPI), and temporal noise-that significantly compromise depth
quality. To alleviate this drawback, we propose Learnable Fractional
Reaction-Diffusion Dynamics (LFRD2), a hybrid framework that combines the
expressive power of neural networks with the interpretability of physical
modeling. Specifically, we implement a time-fractional reaction-diffusion
module that enables iterative depth refinement with dynamically generated
differential orders, capturing long-term dependencies. In addition, we
introduce an efficient continuous convolution operator via coefficient
prediction and repeated differentiation to further improve restoration quality.
Experiments on four benchmark datasets demonstrate the effectiveness of our
approach. The code is publicly available at https://github.com/wudiqx106/LFRD2.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuhang Kang, Ziyu Su, Tianyang Wang, Zaibo Li, Wei Chen et al.
ๆฅ็ๆ่ฆ
Compared to hematoxylin-eosin (H&E) staining, immunohistochemistry (IHC) not
only maintains the structural features of tissue samples, but also provides
high-resolution protein localization, which is essential for aiding in
pathology diagnosis. Despite its diagnostic value, IHC remains a costly and
labor-intensive technique. Its limited scalability and constraints in
multiplexing further hinder widespread adoption, especially in resource-limited
settings. Consequently, researchers are increasingly exploring computational
stain translation techniques to synthesize IHC-equivalent images from
H&E-stained slides, aiming to extract protein-level information more
efficiently and cost-effectively. However, most existing stain translation
techniques rely on a linearly weighted summation of multiple loss terms within
a single objective function, strategy that often overlooks the interdepedence
among these components-resulting in suboptimal image quality and an inability
to simultaneously preserve structural authenticity and color fidelity. To
address this limitation, we propose a novel network architecture that follows a
progressive structure, incorporating color and cell border generation logic,
which enables each visual aspect to be optimized in a stage-wise and decoupled
manner. To validate the effectiveness of our proposed network architecture, we
build upon the Adaptive Supervised PatchNCE (ASP) framework as our baseline. We
introduce additional loss functions based on 3,3'-diaminobenzidine (DAB)
chromogen concentration and image gradient, enhancing color fidelity and cell
boundary clarity in the generated IHC images. By reconstructing the generation
pipeline using our structure-color-cell boundary progressive mechanism,
experiments on HER2 and ER datasets demonstrated that the model significantly
improved visual quality and achieved finer structural details.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Ropeway Liu, Hangjie Yuan, Bo Dong, Jiazheng Xing, Jinwang Wang et al.
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Relighting is a crucial task with both practical demand and artistic value,
and recent diffusion models have shown strong potential by enabling rich and
controllable lighting effects. However, as they are typically optimized in
semantic latent space, where proximity does not guarantee physical correctness
in visual space, they often produce unrealistic results, such as overexposed
highlights, misaligned shadows, and incorrect occlusions. We address this with
UniLumos, a unified relighting framework for both images and videos that brings
RGB-space geometry feedback into a flow matching backbone. By supervising the
model with depth and normal maps extracted from its outputs, we explicitly
align lighting effects with the scene structure, enhancing physical
plausibility. Nevertheless, this feedback requires high-quality outputs for
supervision in visual space, making standard multi-step denoising
computationally expensive. To mitigate this, we employ path consistency
learning, allowing supervision to remain effective even under few-step training
regimes. To enable fine-grained relighting control and supervision, we design a
structured six-dimensional annotation protocol capturing core illumination
attributes. Building upon this, we propose LumosBench, a disentangled
attribute-level benchmark that evaluates lighting controllability via large
vision-language models, enabling automatic and interpretable assessment of
relighting precision across individual dimensions. Extensive experiments
demonstrate that UniLumos achieves state-of-the-art relighting quality with
significantly improved physical consistency, while delivering a 20x speedup for
both image and video relighting. Code is available at
https://github.com/alibaba-damo-academy/Lumos-Custom.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaogang Xu, Ruihang Chu, Jian Wang, Kun Zhou, Wenjie Shu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement Learning (RL) has recently been incorporated into diffusion
models, e.g., tasks such as text-to-image. However, directly applying existing
RL methods to diffusion-based image restoration models is suboptimal, as the
objective of restoration fundamentally differs from that of pure generation: it
places greater emphasis on fidelity. In this paper, we investigate how to
effectively integrate RL into diffusion-based restoration models. First,
through extensive experiments with various reward functions, we find that an
effective reward can be derived from an Image Quality Assessment (IQA) model,
instead of intuitive ground-truth-based supervision, which has already been
optimized during the Supervised Fine-Tuning (SFT) stage prior to RL. Moreover,
our strategy focuses on using RL for challenging samples that are significantly
distant from the ground truth, and our RL approach is innovatively implemented
using MLLM-based IQA models to align distributions with high-quality images
initially. As the samples approach the ground truth's distribution, RL is
adaptively combined with SFT for more fine-grained alignment. This dynamic
process is facilitated through an automatic weighting strategy that adjusts
based on the relative difficulty of the training samples. Our strategy is
plug-and-play that can be seamlessly applied to diffusion-based restoration
models, boosting its performance across various restoration tasks. Extensive
experiments across multiple benchmarks demonstrate the effectiveness of our
proposed RL framework.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma et al.
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in Multimodal Large Language Models (MLLMs) have
significantly improved 2D visual understanding, prompting interest in their
application to complex 3D reasoning tasks. However, it remains unclear whether
these models can effectively capture the detailed spatial information required
for robust real-world performance, especially cross-view consistency, a key
requirement for accurate 3D reasoning. Considering this issue, we introduce
Viewpoint Learning, a task designed to evaluate and improve the spatial
reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset,
consisting of 100K object-centric image pairs with diverse viewpoints and
corresponding question-answer pairs. Our approach employs a two-stage
fine-tuning strategy: first, foundational knowledge is injected to the baseline
MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in
significant improvements across multiple tasks; second, generalization is
enhanced through Reinforcement Learning using the Group Relative Policy
Optimization (GRPO) algorithm on a broader set of questions. Additionally, we
introduce a hybrid cold-start initialization method designed to simultaneously
learn viewpoint representations and maintain coherent reasoning thinking.
Experimental results show that our approach significantly activates the spatial
reasoning ability of MLLM, improving performance on both in-domain and
out-of-domain reasoning tasks. Our findings highlight the value of developing
foundational spatial skills in MLLMs, supporting future progress in robotics,
autonomous systems, and 3D scene understanding.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
In the retrieval domain, candidates' fusion from heterogeneous retrievers is
a long-standing challenge, particularly for complex, multi-modal data such as
videos. While typical fusion techniques are training-free, they rely solely on
rank or score signals, disregarding candidates' representations. This work
introduces Vote-in-Context (ViC), a generalized, training-free framework that
re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a
Vision-Language Model (VLM). The core insight is to serialize both content
evidence and retriever metadata directly within the VLM's prompt, allowing the
model to adaptively weigh retriever consensus against visual-linguistic
content. We demonstrate the generality of this framework by applying it to the
challenging domain of cross-modal video retrieval. To this end, we introduce
the S-Grid, a compact serialization map that represents each video as an image
grid, optionally paired with subtitles to enable list-wise reasoning over video
candidates. ViC is evaluated both as a single-list reranker, where it
dramatically improves the precision of individual retrievers, and as an
ensemble fuser, where it consistently outperforms strong baselines like
CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the
framework establishes new state-of-the-art zero-shot retrieval performance,
demonstrating its effectiveness in handling complex visual and temporal signals
alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1%
(t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive
gains of up to +40 Recall@1 over previous state-of-the-art baselines. We
present ViC as a simple, reproducible, and highly effective recipe for turning
modern VLMs into powerful zero-shot rerankers and fusers. Code and resources
are publicly available at: https://github.com/mohammad2012191/ViC
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Tomรกลก Krsiฤka, Tibor Kubรญk
ๆฅ็ๆ่ฆ
Progress in anatomical 3D shape classification is limited by the complexity
of mesh data and the lack of standardized benchmarks, highlighting the need for
robust learning methods and reproducible evaluation. We introduce two key steps
toward clinically and benchmark-ready anatomical shape classification via
self-supervised graph autoencoding. We propose Precomputed Structural Pooling
(PSPooling), a non-learnable mesh pooling operator designed for efficient and
structure-preserving graph coarsening in 3D anatomical shape analysis.
PSPooling precomputes node correspondence sets based on geometric proximity,
enabling parallelizable and reversible pooling and unpooling operations with
guaranteed support structure. This design avoids the sparsity and
reconstruction issues of selection-based methods and the sequential overhead of
edge contraction approaches, making it particularly suitable for
high-resolution medical meshes. To demonstrate its effectiveness, we integrate
PSPooling into a self-supervised graph autoencoder that learns anatomy-aware
representations from unlabeled surface meshes. We evaluate the downstream
benefits on MedShapeNet19, a new curated benchmark dataset we derive from
MedShapeNet, consisting of 19 anatomical classes with standardized training,
validation, and test splits. Experiments show that PSPooling significantly
improves reconstruction fidelity and classification accuracy in low-label
regimes, establishing a strong baseline for medical 3D shape learning. We hope
that MedShapeNet19 will serve as a widely adopted benchmark for anatomical
shape classification and further research in medical 3D shape analysis. Access
the complete codebase, model weights, and dataset information here:
https://github.com/TomasKrsicka/MedShapeNet19-PSPooling.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mahmut Selman Gokmen, Cody Bumgardner
Natural Language Processing
ๆฅ็ๆ่ฆ
Vision Foundation Models (VFMs) have advanced representation learning through
self-supervised methods. However, existing training pipelines are often
inflexible, domain-specific, or computationally expensive, which limits their
usability across different domains and resource settings. DINO-MX is a modular
and extensible training framework that combines the core principles of DINO,
DINOv2 and DINOv3 within a unified configuration-driven system. It supports a
variety of transformer-based architectures and is fully compatible with the
Hugging Face ecosystem. The framework includes multiple training strategies
such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation,
along with support for distributed training through both Distributed Data
Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to
work with both natural and specialized data types, including single- and
multi-channel images. Experimental results on diverse datasets show that
DINO-MX achieves competitive performance while significantly reducing
computational costs. Additionally, it offers interpretability tools and a
label-guided data augmentation method that improves attention-based
localization without the need for extra detection or segmentation heads.
DINO-MX provides a reproducible and scalable foundation for developing,
adapting, and benchmarking self-supervised vision models across a range of
research and real-world applications.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Agnar Martin Bjรธrnstad, Elias Stenhede, Arian Ranjbar
ๆฅ็ๆ่ฆ
Accurate tumor size measurement is a cornerstone of evaluating cancer
treatment response. The most widely adopted standard for this purpose is the
Response Evaluation Criteria in Solid Tumors (RECIST) v1.1, which relies on
measuring the longest tumor diameter in a single plane. However, volumetric
measurements have been shown to provide a more reliable assessment of treatment
effect. Their clinical adoption has been limited, though, due to the
labor-intensive nature of manual volumetric annotation. In this paper, we
present Lite ENSAM, a lightweight adaptation of the ENSAM architecture designed
for efficient volumetric tumor segmentation from CT scans annotated with RECIST
annotations. Lite ENSAM was submitted to the MICCAI FLARE 2025 Task 1:
Pan-cancer Segmentation in CT Scans, Subtask 2, where it achieved a Dice
Similarity Coefficient (DSC) of 60.7% and a Normalized Surface Dice (NSD) of
63.6% on the hidden test set, and an average total RAM time of 50.6 GBs and an
average inference time of 14.4 s on CPU on the public validation dataset.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Renjun Gao, Peiyan Zhong
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have shown remarkable capabilities
in cross-modal understanding and reasoning, offering new opportunities for
intelligent assistive systems, yet existing systems still struggle with
risk-aware planning, user personalization, and grounding language plans into
executable skills in cluttered homes. We introduce MARS - a Multi-Agent Robotic
System powered by MLLMs for assistive intelligence and designed for smart home
robots supporting people with disabilities. The system integrates four agents:
a visual perception agent for extracting semantic and spatial features from
environment images, a risk assessment agent for identifying and prioritizing
hazards, a planning agent for generating executable action sequences, and an
evaluation agent for iterative optimization. By combining multimodal perception
with hierarchical multi-agent decision-making, the framework enables adaptive,
risk-aware, and personalized assistance in dynamic indoor environments.
Experiments on multiple datasets demonstrate the superior overall performance
of the proposed system in risk-aware planning and coordinated multi-agent
execution compared with state-of-the-art multimodal models. The proposed
approach also highlights the potential of collaborative AI for practical
assistive scenarios and provides a generalizable methodology for deploying
MLLM-enabled multi-agent systems in real-world environments.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yizhu Chen, Chen Ju, Zhicheng Wang, Shuai Xiao, Xu Chen et al.
ๆฅ็ๆ่ฆ
The unification of understanding and generation within a single multi-modal
large model (MLLM) remains one significant challenge, largely due to the
dichotomy between continuous and discrete visual tokenizations. Continuous
tokenizer (CT) achieves strong performance by bridging multiple
independently-trained understanding modules and generation modules, but suffers
from complex multi-stage pipelines and substantial engineering overhead.
Conversely, discrete tokenizers (DT) offer a conceptually elegant idea by
quantizing each image into a primitive, but inevitably leading to information
loss and performance degradation. To resolve this tension, we question the
binary choice between CT and DT, inspired by the wave-particle duality of
light, and propose the Continuous-Discrete Dualistic Visual Tokenizer (CDD-VT).
We treat visual data as a flexible composition of image primitives derived from
quantized codebooks, with the crucial insight that the primitive number
assigned to each visual sample is adaptively determined according to its
complexity: simple instances use a few primitives, emulating discrete
tokenization, while complex instances use many, approximating continuous
tokenization. Two core components are designed: Diverse Quantitative
Primitives, which encourage primitives orthogonality to better populate
information space, and Dynamic Primitive Allocator, which assesses sample
complexity to determine the optimal set of primitives. Extensive experiments on
reconstruction, retrieval and classification show that CDD-VT achieves superior
performance over to specialized CT and DT, effectively getting strong result
within a concise and scalable MLLM.
๐
2025-11-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhicheng Wang, Chen Ju, Xu Chen, Shuai Xiao, Jinsong Lan et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Embedding models are a cornerstone of modern AI. Driven by Multimodal Large
Language Models (MLLMs), they have made great progress in architecture and data
curation, while the holistic paradigm is still limited to SSC, i.e., single
input, singular embedding, contrastive supervision, which collapses rich,
multifaceted inputs into monolithic embeddings and fails to fully exploit MLLM
capabilities. In this paper, we tailor one Parallel Decoupling Framework (PDF)
for multimodal embedding learning, by utilizing the proprietary steerability of
MLLMs, i.e., their ability to flexibly generate quite differentiated response
under explicit instructions. Concretely, PDF conditions a shared MLLM backbone
on distinct, learnable prefixes to roll out multiple parallel paths for one
input, then relies on these paths to obtain parallel embeddings. To promote
full parallel diversity, we employ Mutual Information Minimization (MIM) as an
explicit constraint, coupled with per-path contrastive supervision to maintain
semantic alignment. Such dual-objectives force PDF to yield robust semantic
coverage and a generalizable embedding space. Ultimately, the remarkable
embedding space are accessible at inference via one single forward pass,
incurring negligible computational overhead. We instantiate PDF on multiple
MLLM backbones and prove its effectiveness on MMEB benchmark. Significant gains
are consistently achieved across various resolutions and model sizes, e.g.,
boosting the VLM2Vec-LLaVA-1.6-LR model by a remarkable +8.9% (7B), while the
VLM2Vec-Qwen2VL models by +4.2% (2B) and +3.1% (7B). In terms of efficiency,
our 2B model surpasses its baseline by +2.6% using only half the computational
budget.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Md Sumon Ali, Muzammil Behzad
Machine LearningComputer Vision
ๆฅ็ๆ่ฆ
Compared to traditional methods, Deep Learning (DL) becomes a key technology
for computer vision tasks. Synthetic data generation is an interesting use case
for DL, especially in the field of medical imaging such as Magnetic Resonance
Imaging (MRI). The need for this task since the original MRI data is limited.
The generation of realistic medical images is completely difficult and
challenging. Generative Adversarial Networks (GANs) are useful for creating
synthetic medical images. In this paper, we propose a DL based methodology for
creating synthetic MRI data using the Deep Convolutional Generative Adversarial
Network (DC-GAN) to address the problem of limited data. We also employ a
Convolutional Neural Network (CNN) classifier to classify the brain tumor using
synthetic data and real MRI data. CNN is used to evaluate the quality and
utility of the synthetic images. The classification result demonstrates
comparable performance on real and synthetic images, which validates the
effectiveness of GAN-generated images for downstream tasks.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenqi Liang, Gan Sun, Yao He, Jiahua Dong, Suyan Dai et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action models (VLAs) are emerging as powerful tools for
learning generalizable visuomotor control policies. However, current VLAs are
mostly trained on large-scale image-text-action data and remain limited in two
key ways: (i) they struggle with pixel-level scene understanding, and (ii) they
rely heavily on textual prompts, which reduces their flexibility in real-world
settings. To address these challenges, we introduce PixelVLA, the first VLA
model designed to support both pixel-level reasoning and multimodal prompting
with text and visual inputs. Our approach is built on a new visuomotor
instruction tuning framework that integrates a multiscale pixel-aware encoder
with a visual prompting encoder. To train PixelVLA effectively, we further
propose a two-stage automated annotation pipeline that generates Pixel-160K, a
large-scale dataset with pixel-level annotations derived from existing robot
data. Experiments on three standard VLA benchmarks and two VLA model variants
show that PixelVLA improves manipulation success rates by 10.1%-17.8% over
OpenVLA, while requiring only 1.5% of its pretraining cost. These results
demonstrate that PixelVLA can be integrated into existing VLAs to enable more
accurate, efficient, and versatile robot control in complex environments. The
dataset and code will be released as open source.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Joseph Suh, Suhong Moon, Serina Chang
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used to simulate humans, with
applications ranging from survey prediction to decision-making. However, are
LLMs strictly necessary, or can smaller, domain-grounded models suffice? We
identify a large class of simulation problems in which individuals make choices
among discrete options, where a graph neural network (GNN) can match or surpass
strong LLM baselines despite being three orders of magnitude smaller. We
introduce Graph-basEd Models for human Simulation (GEMS), which casts discrete
choice simulation tasks as a link prediction problem on graphs, leveraging
relational knowledge while incorporating language representations only when
needed. Evaluations across three key settings on three simulation datasets show
that GEMS achieves comparable or better accuracy than LLMs, with far greater
efficiency, interpretability, and transparency, highlighting the promise of
graph-based modeling as a lightweight alternative to LLMs for human simulation.
Our code is available at https://github.com/schang-lab/gems.
๐
2025-11-03
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Ziheng Geng, Jiachen Liu, Ran Cao, Lu Cheng, Dan M. Frangopol et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Flood insurance is an effective strategy for individuals to mitigate
disaster-related losses. However, participation rates among at-risk populations
in the United States remain strikingly low. This gap underscores the need to
understand and model the behavioral mechanisms underlying insurance decisions.
Large language models (LLMs) have recently exhibited human-like intelligence
across wide-ranging tasks, offering promising tools for simulating human
decision-making. This study constructs a benchmark dataset to capture insurance
purchase probabilities across factors. Using this dataset, the capacity of LLMs
is evaluated: while LLMs exhibit a qualitative understanding of factors, they
fall short in estimating quantitative probabilities. To address this
limitation, InsurAgent, an LLM-empowered agent comprising five modules
including perception, retrieval, reasoning, action, and memory, is proposed.
The retrieval module leverages retrieval-augmented generation (RAG) to ground
decisions in empirical survey data, achieving accurate estimation of marginal
and bivariate probabilities. The reasoning module leverages LLM common sense to
extrapolate beyond survey data, capturing contextual information that is
intractable for traditional models. The memory module supports the simulation
of temporal decision evolutions, illustrated through a roller coaster life
trajectory. Overall, InsurAgent provides a valuable tool for behavioral
modeling and policy analysis.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Joshua Ashkinaze, Hua Shen, Sai Avula, Eric Gilbert, Ceren Budak
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We introduce the Deep Value Benchmark (DVB), an evaluation framework that
directly tests whether large language models (LLMs) learn fundamental human
values or merely surface-level preferences. This distinction is critical for AI
alignment: Systems that capture deeper values are likely to generalize human
intentions robustly, while those that capture only superficial patterns in
preference data risk producing misaligned behavior. The DVB uses a novel
experimental design with controlled confounding between deep values (e.g.,
moral principles) and shallow features (e.g., superficial attributes). In the
training phase, we expose LLMs to human preference data with deliberately
correlated deep and shallow features -- for instance, where a user consistently
prefers (non-maleficence, formal language) options over (justice, informal
language) alternatives. The testing phase then breaks these correlations,
presenting choices between (justice, formal language) and (non-maleficence,
informal language) options. This design allows us to precisely measure a
model's Deep Value Generalization Rate (DVGR) -- the probability of
generalizing based on the underlying value rather than the shallow feature.
Across 9 different models, the average DVGR is just 0.30. All models generalize
deep values less than chance. Larger models have a (slightly) lower DVGR than
smaller models. We are releasing our dataset, which was subject to three
separate human validation experiments. DVB provides an interpretable measure of
a core feature of alignment.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Stefan F. Schouten, Peter Bloem
Natural Language Processing
ๆฅ็ๆ่ฆ
Contrast-Consistent Search (CCS) is an unsupervised probing method able to
test whether large language models represent binary features, such as sentence
truth, in their internal activations. While CCS has shown promise, its two-term
objective has been only partially understood. In this work, we revisit CCS with
the aim of clarifying its mechanisms and extending its applicability. We argue
that what should be optimized for, is relative contrast consistency. Building
on this insight, we reformulate CCS as an eigenproblem, yielding closed-form
solutions with interpretable eigenvalues and natural extensions to multiple
variables. We evaluate these approaches across a range of datasets, finding
that they recover similar performance to CCS, while avoiding problems around
sensitivity to random initialization. Our results suggest that relativizing
contrast consistency not only improves our understanding of CCS but also opens
pathways for broader probing and mechanistic interpretability methods.
๐
2025-11-03
๐ ArXiv Preprint (physics.soc-ph)
๐ฅ Pablo Rosillo-Rodes, Laurent Hรฉbert-Dufresne, Peter Sheridan Dodds
ๆฅ็ๆ่ฆ
The growth dynamics of complex systems often exhibit statistical regularities
involving power-law relationships. For real finite complex systems formed by
countable tokens (animals, words) as instances of distinct types (species,
dictionary entries), an inverse power-law scaling $S \sim r^{-\alpha}$ between
type count $S$ and type rank $r$, widely known as Zipf's law, is widely
observed to varying degrees of fidelity. A secondary, summary relationship is
Heaps' law, which states that the number of types scales sublinearly with the
total number of observed tokens present in a growing system. Here, we propose
an idealized model of a growing system that (1) deterministically produces
arbitrary inverse power-law count rankings for types, and (2) allows us to
determine the exact asymptotics of the type-token relationship. Our argument
improves upon and remedies earlier work. We obtain a unified asymptotic
expression for all values of $\alpha$, which corrects the special cases of
$\alpha = 1$ and $\alpha \gg 1$. Our approach relies solely on the form of
count rankings, avoids unnecessary approximations, and does not involve any
stochastic mechanisms or sampling processes. We thereby demonstrate that a
general type-token relationship arises solely as a consequence of Zipf's law.
๐
2025-11-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Vivswan Shah, Randy Cogill, Hanwei Yue, Gopinath Chennupati, Rinat Khaziev
Natural Language Processing
ๆฅ็ๆ่ฆ
Fine-tuning LLMs for classification typically maps inputs directly to labels.
We ask whether attaching brief explanations to each label during fine-tuning
yields better models. We evaluate conversational response quality along three
axes: naturalness, comprehensiveness, and on-topic adherence, each rated on
5-point scales. Using ensemble-generated data from multiple LLMs, we fine-tune
a 7B-parameter model and test across six diverse conversational datasets.
Across 18 dataset, task settings, label-plus-explanation training outperforms
label-only baselines.
A central and unexpected result concerns random tokens. We replace
human-written explanations with text that is syntactically incoherent yet
vocabulary-aligned with the originals (e.g., shuffled or bag-of-words
variants). Despite lacking semantics, these pseudo-explanations still improve
accuracy over label-only training and often narrow much of the gap to true
explanations. The effect persists across datasets and training seeds,
indicating that gains arise less from meaning than from structure: the extra
token budget encourages richer intermediate computation and acts as a
regularizer that reduces over-confident shortcuts.
Internal analyses support this view: explanation-augmented models exhibit
higher activation entropy in intermediate layers alongside sharper predictive
mass at the output layer, consistent with increased deliberation before
decision. Overall, explanation-augmented fine-tuning, whether with genuine
rationales or carefully constructed random token sequences, improves accuracy
and reliability for LLM classification while clarifying how token-level
scaffolding shapes computation during inference.
๐
2025-11-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aditya Sridhar, Nish Sinnadurai, Sean Lie, Vithursan Thangarasa
ๆฅ็ๆ่ฆ
Speculative decoding accelerates LLMs by using a lightweight draft model to
generate tokens autoregressively before verifying them in parallel with a
larger target model. However, determining the optimal number of tokens to draft
remains a key challenge limiting the approach's effectiveness. Dynamic
speculative decoding aims to intelligently decide how many tokens to draft to
achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive
thresholds (e.g., token entropy), which are costly to set and generalize poorly
across models and domains. We propose TapOut, an online, training-free,
plug-and-play algorithm for dynamic speculation policy selection using
multi-armed bandits. Our approach employs a meta-algorithm that selects among
multiple parameter-free dynamic speculation strategies based on past reward and
exploration. We conduct extensive experiments across diverse model pairs and
datasets, showing that TapOut achieves competitive or superior speedups
compared to well-established dynamic speculation baselines without any
hyperparameter tuning.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Elias Lumer, Faheem Nizar, Anmol Gulati, Pradeep Honaganahalli Basavaraju, Vamse Kumar Subbiah
ๆฅ็ๆ่ฆ
Recent advances in LLM Multi-Agent Systems enable scalable orchestration of
sub-agents, each coordinating hundreds or thousands of tools or Model Context
Protocol (MCP) servers. However, existing retrieval methods typically match
queries against coarse agent-level descriptions before routing, which obscures
fine-grained tool functionality and often results in suboptimal agent
selection. We introduce Tool-to-Agent Retrieval, a unified framework that
embeds both tools and their parent agents in a shared vector space and connects
them through metadata relationships. By explicitly representing tool
capabilities and traversing metadata to the agent level, Tool-to-Agent
Retrieval enables granular tool-level or agent-level retrieval, ensuring that
agents and their underlying tools or MCP servers are equally represented
without the context dilution that arises from chunking many tools together.
Evaluating Tool-to-Agent Retrieval across eight embedding models, our approach
achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over
previous state-of-the-art agent retrievers on the LiveMCPBench benchmark.
๐
2025-11-03
๐ EMNLP 2025
๐ฅ Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi et al.
ๆฅ็ๆ่ฆ
Finding the right north-star metrics is highly critical for advancing the
mathematical reasoning capabilities of foundation models, especially given that
existing evaluations are either too easy or only focus on getting correct short
answers. To address these issues, we present IMO-Bench, a suite of advanced
reasoning benchmarks, vetted by a panel of top specialists and that
specifically targets the level of the International Mathematical Olympiad
(IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench
first tests models on 400 diverse Olympiad problems with verifiable short
answers. IMO-Proof Bench is the next-level evaluation for proof-writing
capabilities, which includes both basic and advanced IMO level problems as well
as detailed grading guidelines to facilitate automatic grading. These
benchmarks played a crucial role in our historic achievement of the gold-level
performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our
model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof
Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4%
respectively. We also showed that autograders built with Gemini reasoning
correlate well with human evaluations and construct IMO-GradingBench, with 1000
human gradings on proofs, to enable further progress in automatic evaluation of
long-form answers. We hope that IMO-Bench will help the community towards
advancing robust mathematical reasoning and release it at
https://imobench.github.io/.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Konrad Staniszewski, Adrian ลaลcucki
Natural Language Processing
ๆฅ็ๆ่ฆ
Serving large language models (LLMs) at scale necessitates efficient
key-value (KV) cache management. KV caches can be reused across conversation
turns via shared-prefix prompts that are common in iterative code editing and
chat. However, stale caches consume scarce GPU memory, require offloading, or
force recomputation. We present KVTC, a lightweight transform coder that
compresses KV caches for compact on-GPU and off-GPU storage. Drawing on
classical media compression, KVTC combines PCA-based feature decorrelation,
adaptive quantization, and entropy coding. It requires only a brief initial
calibration and leaves model parameters unchanged. By exploiting redundancies
in KV caches, KVTC achieves up to 20$\times$ compression while maintaining
reasoning and long-context accuracy, and 40$\times$ or higher for specific use
cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across
benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and
MATH-500. It consistently outperforms inference-time baselines such as token
eviction, quantization, and SVD-based methods, while achieving higher
compression ratios. These results support KVTC as a practical building block
for memory-efficient LLM serving with reusable KV caches.
๐
2025-11-03
๐ KDD 2025
๐ฅ Adewale Akinfaderin, Shreyas Subramanian, Akarsha Sehwag
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Length control in Large Language Models (LLMs) is a crucial but
under-addressed challenge, with applications ranging from voice interfaces
requiring concise responses to research summaries needing comprehensive
outputs. Current approaches to length control, including Regularized DPO,
Length-Instruction Fine Tuning, and tool-augmented methods, typically require
expensive model retraining or complex inference-time tooling. This paper
presents a prompt engineering methodology that enables precise length control
without model retraining. Our structure-guided approach implements deliberate
planning and word counting mechanisms within the prompt, encouraging the model
to carefully track and adhere to specified length constraints. Comprehensive
evaluations across six state-of-the-art LLMs demonstrate that our method
significantly improves length fidelity for several models compared to standard
prompting when applied to document summarization tasks, particularly for
shorter-to-medium length constraints. The proposed technique shows varying
benefits across different model architectures, with some models demonstrating
up to 37.6% improvement in length adherence. Quality evaluations further reveal
that our approach maintains or enhances overall output quality compared to
standard prompting techniques. Our approach provides an immediately deployable
solution for applications requiring precise length control, particularly
valuable for production environments where model retraining is impractical or
cost-prohibitive.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiayi Geng, Howard Chen, Ryan Liu, Manoel Horta Ribeiro, Robb Willer et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Language model (LM) assistants are increasingly used in applications such as
brainstorming and research. Improvements in memory and context size have
allowed these models to become more autonomous, which has also resulted in more
text accumulation in their context windows without explicit user intervention.
This comes with a latent risk: the belief profiles of models -- their
understanding of the world as manifested in their responses or actions -- may
silently change as context accumulates. This can lead to subtly inconsistent
user experiences, or shifts in behavior that deviate from the original
alignment of the models. In this paper, we explore how accumulating context by
engaging in interactions and processing text -- talking and reading -- can
change the beliefs of language models, as manifested in their responses and
behaviors. Our results reveal that models' belief profiles are highly
malleable: GPT-5 exhibits a 54.7% shift in its stated beliefs after 10 rounds
of discussion about moral dilemmas and queries about safety, while Grok 4 shows
a 27.2% shift on political issues after reading texts from the opposing
position. We also examine models' behavioral changes by designing tasks that
require tool use, where each tool selection corresponds to an implicit belief.
We find that these changes align with stated belief shifts, suggesting that
belief shifts will be reflected in actual behavior in agentic systems. Our
analysis exposes the hidden risk of belief shift as models undergo extended
sessions of talking or reading, rendering their opinions and actions
unreliable.
๐
2025-11-03
๐ ICML 2025
๐ฅ Vi Retault, Yohaรฏ-Eliel Berreby
Natural Language Processing
ๆฅ็ๆ่ฆ
When fine-tuning language models on new tasks, catastrophic forgetting --
performance degradation on previously-learned tasks -- is a ubiquitous problem.
While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA address this
through low-rank adapters, sparse adaptation offers an alternative that doesn't
impose rank constraints. We introduce Random Initialization of Gated Sparse
Adapters (RIGSA), which starts from randomly-initialized full-rank adapters,
gates them with a ReZero analog, and sparsifies them with iterative magnitude
pruning. We evaluate RIGSA on SmolLM2-1.7B-Instruct using a novel
vision-in-text task (Textual MNIST) and measure forgetting on PIQA, HellaSwag,
and GSM8k. SmolLM2-1.7B-Instruct initially performs around chance level on
Textual MNIST, and is capable of learning the task through RIGSA, 4-bit QLoRA
and random masking. In spite of having more trainable parameters than QLoRA,
the RIGSA configurations that we studied displayed less forgetting than QLoRA,
particularly on GSM8k, though it performs comparably to random masking.
๐
2025-11-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mian Wu, Gavin Zhang, Sewon Min, Sergey Levine, Aviral Kumar
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Open-ended generation tasks require outputs to satisfy diverse and often
implicit task-specific evaluation rubrics. The sheer number of relevant rubrics
leads to prohibitively high verification costs and incomplete assessments of a
response, making reinforcement learning (RL) post-training with rubric-based
rewards difficult to scale. This problem is exacerbated by the fact that often
the best way to combine these rubrics into one single reward is also highly
prompt-specific. We propose Reinforcement Learning with Adversarial Critic
(RLAC), a post-training approach that addresses these challenges via dynamic
rubric verification. Our approach employs a large language model (LLM) as a
critic that dynamically identifies only the most likely failure modes (e.g., a
factual error or unhandled edge case), which are then verified by an external
validator to optimize both generator and critic jointly. By training both the
generator and the critic, this game enhances the critic's error detection and
the generator's output quality while reducing required verifications. Our
experiments demonstrate that RLAC improves factual accuracy in text generation
and correctness in code generation, while also outperforming exhaustive
verification and reward model methods. We show that dynamic critics are more
effective than fixed critics, showcasing the potential of RLAC for scaling RL
post-training to free-form generation tasks.
๐
2025-11-03
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Soufiane Hayou
Machine Learning
ๆฅ็ๆ่ฆ
We provide the first proof of learning rate transfer with width in a linear
multi-layer perceptron (MLP) parametrized with $\mu$P, a neural network
parameterization designed to ``maximize'' feature learning in the
infinite-width limit. We show that under $\mu P$, the optimal learning rate
converges to a \emph{non-zero constant} as width goes to infinity, providing a
theoretical explanation to learning rate transfer. In contrast, we show that
this property fails to hold under alternative parametrizations such as Standard
Parametrization (SP) and Neural Tangent Parametrization (NTP). We provide
intuitive proofs and support the theoretical findings with extensive empirical
results.
๐
2025-11-03
๐ EMNLP 2025
๐ฅ Mahammad Nuriyev
ๆฅ็ๆ่ฆ
We present a multi-expert system for creating Non-Player Characters (NPCs)
capable of both natural dialogue and contextual action execution in interactive
environments. Using Qwen3 as the base model and Low-Rank Adaptation (LoRA)
adapters, we instantiate three specialists: tool calling, tool-response
interpretation, and direct dialogue. Our system comfortably meets the
computational efficiency requirements, delivering fast responses and
maintaining modest resource usage on L40S GPUs. In the Commonsense
Persona-Grounded Dialogue Challenge 2025, our method ranked second overall.
Code available at:
https://github.com/MahammadNuriyev62/CPDC-challenge-2025-solution/
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sekh Mainul Islam, Pepa Atanasova, Isabelle Augenstein
Natural Language Processing
ๆฅ็ๆ่ฆ
Natural Language Explanations (NLEs) describe how Large Language Models
(LLMs) make decisions, drawing on both external Context Knowledge (CK) and
Parametric Knowledge (PK) stored in model weights. Understanding their
interaction is key to assessing the grounding of NLEs, yet it remains
underexplored. Prior work has largely examined only single-step generation,
typically the final answer, and has modelled PK and CK interaction only as a
binary choice in a rank-1 subspace. This overlooks richer forms of interaction,
such as complementary or supportive knowledge. We propose a novel rank-2
projection subspace that disentangles PK and CK contributions more accurately
and use it for the first multi-step analysis of knowledge interactions across
longer NLE sequences. Experiments on four QA datasets and three open-weight
instruction-tuned LLMs show that diverse knowledge interactions are poorly
represented in a rank-1 subspace but are effectively captured in our rank-2
formulation. Our multi-step analysis reveals that hallucinated NLEs align
strongly with the PK direction, context-faithful ones balance PK and CK, and
Chain-of-Thought prompting for NLEs shifts generated NLEs toward CK by reducing
PK reliance. This work provides the first framework for systematic studies of
multi-step knowledge interactions in LLMs through a richer rank-2 subspace
disentanglement. Code and data:
https://github.com/copenlu/pk-ck-knowledge-disentanglement.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The character of the "AI assistant" persona generated by modern chatbot large
language models influences both surface-level behavior and apparent values,
beliefs, and ethics. These all affect interaction quality, perceived
intelligence, and alignment with both developer and user intentions. The
shaping of this persona, known as character training, is a critical component
of industry post-training, yet remains effectively unstudied in the academic
literature. We introduce the first open implementation of character training,
leveraging Constitutional AI and a new data pipeline using synthetic
introspective data to shape the assistant persona in a more effective and
controlled manner than alternatives such as constraining system prompts or
activation steering. Specifically, we fine-tune three popular open-weights
models using 11 example personas, such as humorous, deeply caring, or even
malevolent. To track the effects of our approach, we introduce a method which
analyzes revealed preferences, uncovering clear and holistic changes in
character. We find these changes are more robust to adversarial prompting than
the above two alternatives, while also leading to more coherent and realistic
generations. Finally, we demonstrate this fine-tuning has little to no effect
on general capabilities as measured by common benchmarks. We describe and
open-source our full post-training method, the implementation of which can be
found at https://github.com/maiush/OpenCharacterTraining.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chaoqun Liu, Mahani Aljunied, Guizhen Chen, Hou Pong Chan, Weiwen Xu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
We introduce SeaLLMs-Audio, the first large audio-language model (LALM)
tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai
(th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a
large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across
diverse audio-centric tasks, spanning fine-grained audio understanding and
voice-based interaction. Its key features include: 1) Multilingual: the model
primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English,
and Chinese; 2) Multimodal: the model accepts flexible input modalities,
including audio only, text only, as well as audio with text; 3) Multi-task: the
model supports a wide range of tasks, including audio analysis tasks such as
Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation,
Speech Emotion Recognition, Speech Question Answering, and Speech
Summarization. It also enables voice-based dialogue, including answering
factual, mathematical, and general knowledge queries. As a significant step
towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to
benefit both the regional research community and industry. To automate LALM
evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark
spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves
competitive performance compared with other LALMs on SEA languages.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Preslav Nakov, Zhuohan Xie
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly being applied to specialized,
high-stakes domains like engineering, which demands rigorous evaluation of
their complex reasoning capabilities. While current benchmarks assess language
understanding, factual recall, mathematics or code generation, none capture the
integrative reasoning central to engineering where scientific principles,
quantitative modeling and practical constraints must converge. To address this
gap, we introduce EngChain, a benchmark for verifiable multi-step engineering
problem-solving. EngChain contains 90 problems spanning three engineering
branches, organized into 9 domains and 20 distinct areas. The problems are
generated from symbolic templates with a high degree of randomization to ensure
diversity and eliminate the risk of contamination. With this benchmark, we move
beyond final answer accuracy with a two-stage evaluation: we first
quantitatively verify the numerical and semantic validity of each reasoning
step and then introduce LLM-As-A-Judge, an automated system to qualitatively
categorize the identified reasoning errors.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hung-Shin Lee, Chen-Chi Chang, Ching-Yuan Chen, Yun-Hsiang Hsu
Natural Language Processing
ๆฅ็ๆ่ฆ
This study proposes a cognitive benchmarking framework to evaluate how large
language models (LLMs) process and apply culturally specific knowledge. The
framework integrates Bloom's Taxonomy with Retrieval-Augmented Generation (RAG)
to assess model performance across six hierarchical cognitive domains:
Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating.
Using a curated Taiwanese Hakka digital cultural archive as the primary
testbed, the evaluation measures LLM-generated responses' semantic accuracy and
cultural relevance.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Riccardo Campi, Nicolรฒ Oreste Pinciroli Vago, Mathyas Giudici, Pablo Barrachina Rodriguez-Guisado, Marco Brambilla et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In this work, we investigate the use of Large Language Models (LLMs) within a
graph-based Retrieval Augmented Generation (RAG) architecture for Energy
Efficiency (EE) Question Answering. First, the system automatically extracts a
Knowledge Graph (KG) from guidance and regulatory documents in the energy
field. Then, the generated graph is navigated and reasoned upon to provide
users with accurate answers in multiple languages. We implement a human-based
validation using the RAGAs framework properties, a validation dataset
comprising 101 question-answer pairs, and domain experts. Results confirm the
potential of this architecture and identify its strengths and weaknesses.
Validation results show how the system correctly answers in about three out of
four of the cases (75.2 +- 2.7%), with higher results on questions related to
more general EE answers (up to 81.0 +- 4.1%), and featuring promising
multilingual abilities (4.4% accuracy loss due to translation).
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nikola Ljubeลกiฤ, Peter Rupnik, Ivan Porupski, Taja Kuzman Pungerลกek
ๆฅ็ๆ่ฆ
ParlaSpeech is a collection of spoken parliamentary corpora currently
spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all
together 6 thousand hours in size. The corpora were built in an automatic
fashion from the ParlaMint transcripts and their corresponding metadata, which
were aligned to the speech recordings of each corresponding parliament. In this
release of the dataset, each of the corpora is significantly enriched with
various automatic annotation layers. The textual modality of all four corpora
has been enriched with linguistic annotations and sentiment predictions.
Similar to that, their spoken modality has been automatically enriched with
occurrences of filled pauses, the most frequent disfluency in typical speech.
Two out of the four languages have been additionally enriched with detailed
word- and grapheme-level alignments, and the automatic annotation of the
position of primary stress in multisyllabic words. With these enrichments, the
usefulness of the underlying corpora has been drastically increased for
downstream research across multiple disciplines, which we showcase through an
analysis of acoustic correlates of sentiment. All the corpora are made
available for download in JSONL and TextGrid formats, as well as for search
through a concordancer.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma et al.
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Recent advances in Multimodal Large Language Models (MLLMs) have
significantly improved 2D visual understanding, prompting interest in their
application to complex 3D reasoning tasks. However, it remains unclear whether
these models can effectively capture the detailed spatial information required
for robust real-world performance, especially cross-view consistency, a key
requirement for accurate 3D reasoning. Considering this issue, we introduce
Viewpoint Learning, a task designed to evaluate and improve the spatial
reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset,
consisting of 100K object-centric image pairs with diverse viewpoints and
corresponding question-answer pairs. Our approach employs a two-stage
fine-tuning strategy: first, foundational knowledge is injected to the baseline
MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in
significant improvements across multiple tasks; second, generalization is
enhanced through Reinforcement Learning using the Group Relative Policy
Optimization (GRPO) algorithm on a broader set of questions. Additionally, we
introduce a hybrid cold-start initialization method designed to simultaneously
learn viewpoint representations and maintain coherent reasoning thinking.
Experimental results show that our approach significantly activates the spatial
reasoning ability of MLLM, improving performance on both in-domain and
out-of-domain reasoning tasks. Our findings highlight the value of developing
foundational spatial skills in MLLMs, supporting future progress in robotics,
autonomous systems, and 3D scene understanding.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Francisco Portillo Lรณpez
Natural Language Processing
ๆฅ็ๆ่ฆ
Linguistic errors are not merely deviations from normative grammar; they
offer a unique window into the cognitive architecture of language and expose
the current limitations of artificial systems that seek to replicate them. This
project proposes an interdisciplinary study of linguistic errors produced by
native Spanish speakers, with the aim of analyzing how current large language
models (LLM) interpret, reproduce, or correct them. The research integrates
three core perspectives: theoretical linguistics, to classify and understand
the nature of the errors; neurolinguistics, to contextualize them within
real-time language processing in the brain; and natural language processing
(NLP), to evaluate their interpretation against linguistic errors. A
purpose-built corpus of authentic errors of native Spanish (+500) will serve as
the foundation for empirical analysis. These errors will be tested against AI
models such as GPT or Gemini to assess their interpretative accuracy and their
ability to generalize patterns of human linguistic behavior. The project
contributes not only to the understanding of Spanish as a native language but
also to the development of NLP systems that are more cognitively informed and
capable of engaging with the imperfect, variable, and often ambiguous nature of
real human language.
๐
2025-11-03
๐ EMNLP 2025
๐ฅ Wenjie Hua, Hoang H. Nguyen, Gangyan Ge
Natural Language Processing
ๆฅ็ๆ่ฆ
Bronze inscriptions from early China are fragmentary and difficult to date.
We introduce BIRD(Bronze Inscription Restoration and Dating), a fully encoded
dataset grounded in standard scholarly transcriptions and chronological labels.
We further propose an allograph-aware masked language modeling framework that
integrates domain- and task-adaptive pretraining with a Glyph Net (GN), which
links graphemes and allographs. Experiments show that GN improves restoration,
while glyph-biased sampling yields gains in dating.
๐
2025-11-03
๐ EMNLP 2025
๐ฅ Seungmin Shin, Dooyoung Kim, Youngjoong Ko
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Controllable Dialogue Generation (CDG) enables chatbots to generate responses
with desired attributes, and weighted decoding methods have achieved
significant success in the CDG task. However, using a fixed constant value to
manage the bias of attribute probabilities makes it challenging to find an
ideal control strength that satisfies both controllability and fluency. To
address this issue, we propose ECO decoding (Entropy-based COntrol), which
dynamically adjusts the control strength at each generation step according to
the model's entropy in both the language model and attribute classifier
probability distributions. Experiments on the DailyDialog and MultiWOZ datasets
demonstrate that ECO decoding consistently improves controllability while
maintaining fluency and grammaticality, outperforming prior decoding methods
across various models and settings. Furthermore, ECO decoding alleviates
probability interpolation issues in multi-attribute generation and consequently
demonstrates strong performance in both single and multi-attribute scenarios.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Luciana Ciringione, Emma Franchino, Simone Reigl, Isaia D'Onofrio, Anna Serbati et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Math anxiety poses significant challenges for university psychology students,
affecting their career choices and overall well-being. This study employs a
framework based on behavioural forma mentis networks (i.e. cognitive models
that map how individuals structure their associative knowledge and emotional
perceptions of concepts) to explore individual and group differences in the
perception and association of concepts related to math and anxiety. We
conducted 4 experiments involving psychology undergraduates from 2 samples (n1
= 70, n2 = 57) compared against GPT-simulated students (GPT-3.5: n2 = 300;
GPT-4o: n4 = 300). Experiments 1, 2, and 3 employ individual-level network
features to predict psychometric scores for math anxiety and its facets
(observational, social and evaluational) from the Math Anxiety Scale.
Experiment 4 focuses on group-level perceptions extracted from human students,
GPT-3.5 and GPT-4o's networks. Results indicate that, in students, positive
valence ratings and higher network degree for "anxiety", together with negative
ratings for "math", can predict higher total and evaluative math anxiety. In
contrast, these models do not work on GPT-based data because of differences in
simulated networks and psychometric scores compared to humans. These results
were also reconciled with differences found in the ways that high/low subgroups
of simulated and real students framed semantically and emotionally STEM
concepts. High math-anxiety students collectively framed "anxiety" in an
emotionally polarising way, absent in the negative perception of low
math-anxiety students. "Science" was rated positively, but contrasted against
the negative perception of "math". These findings underscore the importance of
understanding concept perception and associations in managing students' math
anxiety.
๐
2025-11-03
๐ ArXiv Preprint (cs.SE)
๐ฅ Murali Sridharan, Mikel Robredo, Leevi Rantala, Matteo Esposito, Valentina Lenarduzzi et al.
ๆฅ็ๆ่ฆ
Context. Detecting Self-Admitted Technical Debt (SATD) is crucial for
proactive software maintenance. Previous research has primarily targeted
detecting and prioritizing SATD, with little focus on the source code afflicted
with SATD. Our goal in this work is to connect the SATD comments with source
code constructs that surround them.
Method. We leverage the extensive SATD dataset PENTACET, containing code
comments from over 9000 Java Open Source Software (OSS) repositories. We
quantitatively infer where SATD most commonly occurs and which code
constructs/statements it most frequently affects.
Results and Conclusions. Our large-scale study links over 225,000 SATD
comments to their surrounding code, showing that SATD mainly arises in inline
code near definitions, conditionals, and exception handling, where developers
face uncertainty and trade-offs, revealing it as an intentional signal of
awareness during change rather than mere neglect.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Seokhoon Kang, Yejin Jeon, Seonjeong Hwang, Gary Geunbae Lee
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Multiple-choice cloze questions are commonly used to assess linguistic
proficiency and comprehension. However, generating high-quality distractors
remains challenging, as existing methods often lack adaptability and control
over difficulty levels, and the absence of difficulty-annotated datasets
further hinders progress. To address these issues, we propose a novel framework
for generating distractors with controllable difficulty by leveraging both data
augmentation and a multitask learning strategy. First, to create a
high-quality, difficulty-annotated dataset, we introduce a two-way distractor
generation process in order to produce diverse and plausible distractors. These
candidates are subsequently refined through filtering and then categorized by
difficulty using an ensemble QA system. Second, this newly created dataset is
leveraged to train a difficulty-controllable generation model via multitask
learning. The framework includes carefully designed auxiliary tasks that
enhance the model's semantic understanding of distractors and its ability to
estimate their difficulty. Experimental results demonstrate that our method
generates high-quality distractors across difficulty levels and substantially
outperforms GPT-4o in aligning distractor difficulty with human perception.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ayesha Afroza Mohsin, Mashrur Ahsan, Nafisa Maliyat, Shanta Maria, Syed Rifat Raiyan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Toxic language in Bengali remains prevalent, especially in online
environments, with few effective precautions against it. Although text
detoxification has seen progress in high-resource languages, Bengali remains
underexplored due to limited resources. In this paper, we propose a novel
pipeline for Bengali text detoxification that combines Pareto class-optimized
large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate
detoxified sentences. To support this effort, we construct BanglaNirTox, an
artificially generated parallel corpus of 68,041 toxic Bengali sentences with
class-wise toxicity labels, reasonings, and detoxified paraphrases, using
Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox
dataset is used to fine-tune language models to produce better detoxified
versions of Bengali sentences. Our findings show that Pareto-optimized LLMs
with CoT prompting significantly enhance the quality and consistency of Bengali
text detoxification.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Max Schaffelder, Albert Gatt
Natural Language Processing
ๆฅ็ๆ่ฆ
As synthetic data becomes widely used in language model development,
understanding its impact on model behavior is crucial. This paper investigates
the impact of the diversity of sources of synthetic data on fine-tuned large
language models. We focus on three key dimensions: distribution collapse,
adversarial robustness, and self-preference bias. Our findings reveal that
fine-tuning models on synthetic data from diverse sources can mitigate
distribution collapse, preserving the breadth of the output distribution and
the diversity of the output text. Furthermore, while both human and synthetic
fine-tuning data can remove safeguards, the latter preserves higher output
quality, thus making outputs potentially more usable and dangerous. Finally,
fine-tuning reduces self-preference bias, with human data being the most
effective, followed by multi-source synthetic data.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Neha Sharma, Navneet Agarwal, Kairit Sirts
Natural Language Processing
ๆฅ็ๆ่ฆ
Text-based automated Cognitive Distortion detection is a challenging task due
to its subjective nature, with low agreement scores observed even among expert
human annotators, leading to unreliable annotations. We explore the use of
Large Language Models (LLMs) as consistent and reliable annotators, and propose
that multiple independent LLM runs can reveal stable labeling patterns despite
the inherent subjectivity of the task. Furthermore, to fairly compare models
trained on datasets with different characteristics, we introduce a
dataset-agnostic evaluation framework using Cohen's kappa as an effect size
measure. This methodology allows for fair cross-dataset and cross-study
comparisons where traditional metrics like F1 score fall short. Our results
show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78),
resulting in improved test set performance for models trained on these
annotations compared to those trained on human-labeled data. Our findings
suggest that LLMs can offer a scalable and internally consistent alternative
for generating training data that supports strong downstream performance in
subjective NLP tasks.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lujie Niu, Lei Shen, Yi Jiang, Caixia Yuan, Xiaojie Wang et al.
Natural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
While long Chain-of-Thought (CoT) distillation effectively transfers
reasoning capability to smaller language models, the reasoning process often
remains redundant and computational budget uncontrollable, leading to
inefficient resource usage. To address this limitation, we propose
\textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that
simultaneously distills reasoning capability and enables fine-grained control
over the reasoning length. BARD uses the thinking budget as a user-specified
control signal, allowing the model to dynamically balance reasoning performance
and computational efficiency. To achieve this concept, BARD introduces a
two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on
teacher-generated long CoT data compressed to various budget levels,
bootstrapping the model's understanding of budget constraints. The second phase
leverages Reinforcement Learning (RL) from a reward signal in consideration of
reasoning performance and budget fidelity simultaneously. Incorporating the
two-phase regimen is crucial to avoiding policy degradation and ensuring that
both objectives are optimized jointly. Extensive experiments demonstrate that
our method empowers an 8B student model to achieve strong performance on
challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while
providing precise and adaptive control over its reasoning length across a wide
range of budgets.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sergio Torres Aguilar
Natural Language Processing
ๆฅ็ๆ่ฆ
Translating a morphology-rich, low-resource language like Latin poses
significant challenges. This paper introduces a reproducible draft-based
refinement pipeline that elevates open-source Large Language Models (LLMs) to a
performance level statistically comparable to top-tier proprietary systems. Our
method first uses a fine-tuned NLLB-1.3B model to generate a high-quality,
structurally faithful draft. A zero-shot LLM (Llama-3.3 or Qwen3) then polishes
this draft, a process that can be further enhanced by augmenting the context
with retrieved out-context examples (RAG). We demonstrate the robustness of
this approach on two distinct benchmarks: a standard in-domain test set
(Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of
12th-century Latin letters (2025). Our central finding is that this open-source
RAG system achieves performance statistically comparable to the GPT-5 baseline,
without any task-specific LLM fine-tuning. We release the pipeline, the
Chartres OOD set, and evaluation scripts and models to facilitate replicability
and further research.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Heng Zhou, Ao Yu, Yuchen Fan, Jianing Shi, Li Kang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Evaluating large language models (LLMs) on question answering often relies on
static benchmarks that reward memorization and understate the role of
retrieval, failing to capture the dynamic nature of world knowledge. We present
LiveSearchBench, an automated pipeline for constructing retrieval-dependent
benchmarks from recent knowledge updates. Our method computes deltas between
successive Wikidata snapshots, filters candidate triples for quality, and
synthesizes natural-language questions at three levels of reasoning difficulty,
each guaranteed to admit a unique, verifiable answer through SPARQL validation.
The pipeline is fully automated, scalable across time, and minimizes human
intervention, enabling continual regeneration of temporally grounded
benchmarks. Experiments show a pronounced performance drop when models confront
facts that post-date pretraining, with the gap most salient on multi-hop
queries. Retrieval augmented methods and larger, instruction-tuned models
provide partial gains but fail to close this recency gap. By design,
LiveSearchBench shifts evaluation from static memorization toward tasks that
require up-to-date retrieval and reasoning, offering a foundation for
systematic, long-term assessment of LLMs under evolving knowledge.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Muhammed Yusuf Kartal, Suha Kagan Kose, Korhan Sevinรง, Burak Aktas
Machine Learning
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) quality depends on many interacting
choices across retrieval, ranking, augmentation, prompting, and generation, so
optimizing modules in isolation is brittle. We introduce RAGSmith, a modular
framework that treats RAG design as an end-to-end architecture search over nine
technique families and 46{,}080 feasible pipeline configurations. A genetic
search optimizes a scalar objective that jointly aggregates retrieval metrics
(recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic
similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law,
Finance, Medicine, Defense Industry, Computer Science), each with 100 questions
spanning factual, interpretation, and long-answer types. RAGSmith finds
configurations that consistently outperform naive RAG baseline by +3.8\% on
average (range +1.2\% to +6.9\% across domains), with gains up to +12.5\% in
retrieval and +7.5\% in generation. The search typically explores $\approx
0.2\%$ of the space ($\sim 100$ candidates) and discovers a robust backbone --
vector retrieval plus post-generation reflection/revision -- augmented by
domain-dependent choices in expansion, reranking, augmentation, and prompt
reordering; passage compression is never selected. Improvement magnitude
correlates with question type, with larger gains on factual/long-answer mixes
than interpretation-heavy sets. These results provide practical, domain-aware
guidance for assembling effective RAG systems and demonstrate the utility of
evolutionary search for full-pipeline optimization.
๐
2025-11-03
๐ EMNLP 2025
๐ฅ Wessel Poelman, Thomas Bauwens, Miryam de Lhoneux
Natural Language Processing
ๆฅ็ๆ่ฆ
The extent to which individual language characteristics influence
tokenization and language modeling is an open question. Differences in
morphological systems have been suggested as both unimportant and crucial to
consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter
alia). We argue this conflicting evidence is due to confounding factors in
experimental setups, making it hard to compare results and draw conclusions. We
identify confounding factors in analyses trying to answer the question of
whether, and how, morphology relates to language modeling. Next, we re-assess
three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative
languages results in higher perplexities than fusional languages: they look at
morphological alignment of tokenization, tokenization efficiency, and dataset
size. We show that each conclusion includes confounding factors. Finally, we
introduce token bigram metrics as an intrinsic way to predict the difficulty of
causal language modeling, and find that they are gradient proxies for
morphological complexity that do not require expert annotation. Ultimately, we
outline necessities to reliably answer whether, and how, morphology relates to
language modeling.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ ฤฐbrahim Ethem Deveci, Duygu Ataman
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid rise of Large Language Models (LLMs) and Large Reasoning Models
(LRMs) has been accompanied by an equally rapid increase of benchmarks used to
assess them. However, due to both improved model competence resulting from
scaling and novel training advances as well as likely many of these datasets
being included in pre or post training data, results become saturated, driving
a continuous need for new and more challenging replacements. In this paper, we
discuss whether surpassing a benchmark truly demonstrates reasoning ability or
are we simply tracking numbers divorced from the capabilities we claim to
measure? We present an investigation focused on three model families, OpenAI,
Anthropic, and Google, and how their reasoning capabilities across different
benchmarks evolve over the years. We also analyze performance trends over the
years across different reasoning tasks and discuss the current situation of
benchmarking and remaining challenges. By offering a comprehensive overview of
benchmarks and reasoning tasks, our work aims to serve as a first reference to
ground future research in reasoning evaluation and model development.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aadi Palnitkar, Arjun Suresh, Rishi Rajesh, Puneet Puli
Natural Language Processing
ๆฅ็ๆ่ฆ
Increasingly, more and more people are turning to large language models
(LLMs) for healthcare advice and consultation, making it important to gauge the
efficacy and accuracy of the responses of LLMs to such queries. While there are
pre-existing medical benchmarks literature which seeks to accomplish this very
task, these benchmarks are almost universally in English, which has led to a
notable gap in existing literature pertaining to multilingual LLM evaluation.
Within this work, we seek to aid in addressing this gap with Cancer-Myth-Indic,
an Indic language benchmark built by translating a 500-item subset of
Cancer-Myth, sampled evenly across its original categories, into five
under-served but widely used languages from the subcontinent (500 per language;
2,500 translated items total). Native-speaker translators followed a style
guide for preserving implicit presuppositions in translation; items feature
false presuppositions relating to cancer. We evaluate several popular LLMs
under this presupposition stress.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sapir Harary, Eran Hirsch, Aviv Slobodkin, David Wan, Mohit Bansal et al.
Robotics
ๆฅ็ๆ่ฆ
Natural Language Inference (NLI) models have been used in various ways to
improve the factuality of LLM outputs. This is typically done by applying an
NLI model to judge whether the model output is entailed from the supposed
evidence, triggering some corrective actions, such as beam reranking at
inference time or RL rewards during training. While NLI models are trained to
detect factual inconsistencies over complete sentences, decisions in the common
autoregressive generation architecture are made for each evolving text prefix,
during decoding. Addressing this setting, we generalize the entailment
detection task to apply over arbitrary text prefixes, and suggest its utility
for improving generation faithfulness. Providing suitable evaluation and
training datasets for this task, we train MiniTruePrefixes, a novel specialized
model that better detects factual inconsistencies over text prefixes,
outperforming comparable baseline NLI models by 5-14 F1 points in prefix-level
entailment. We further demonstrate that integrating MiniTruePrefixes into a
controlled decoding framework substantially improves factual consistency in
abstractive summarization. When guided by MiniTruePrefixes,
LLaMA-3.2-3B-Instruct matches the faithfulness and runtime of the 8B model from
the same model family, while using only half the memory.
๐
2025-11-03
๐ EMNLP 2025
๐ฅ Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang
ๆฅ็ๆ่ฆ
Recently, the demand for small and efficient reasoning models to support
real-world applications has driven the development of knowledge distillation
techniques that balance reasoning performance and inference speed. In this
paper, we further extend the DistilQwen model family, initialized from the Qwen
models, by introducing four model series specifically designed to meet
industrial requirements. The distilled model collection comprises: (1)
slow-thinking models, optimized for reasoning tasks that require high accuracy;
(2) two series of adaptive-thinking models, which dynamically adjust reasoning
strategies based on input tasks to maximize efficiency across diverse
scenarios; and (3) distilled reward models, which enable further reinforcement
learning of reasoning models using distilled knowledge. Comprehensive
evaluations across multiple benchmarks demonstrate both high inference
efficiency and strong reasoning performance for these models, as well as the
practical utility of distilled reward models. We further show that these models
support industry practitioners by providing scalable training and inference
functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence)
platform.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Trishanu Das, Abhilash Nandy, Khush Bajaj, Deepiha S
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters
to represent words or phrases creatively) requires a variety of skills such as
image recognition, cognitive skills, commonsense reasoning, multi-step
reasoning, image-based wordplay, etc., making this a challenging task for even
current Vision-Language Models. In this paper, we present
$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$, a large and diverse
benchmark of $1,333$ English Rebus Puzzles containing different artistic styles
and levels of difficulty, spread across 18 categories such as food, idioms,
sports, finance, entertainment, etc. We also propose $RebusDescProgICE$, a
model-agnostic framework which uses a combination of an unstructured
description and code-based, structured reasoning, along with better,
reasoning-based in-context example selection, improving the performance of
Vision-Language Models on
$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$ by $2.1-4.1\%$ and
$20-30\%$ using closed-source and open-source models respectively compared to
Chain-of-Thought Reasoning.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiabao Ji, Min Li, Priyanshu Kumar, Shiyu Chang, Saloni Potdar
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) with integrated search tools show strong promise
in open-domain question answering (QA), yet they often struggle to produce
complete answer set to complex questions such as Which actor from the film Heat
won at least one Academy Award?, which requires (1) distinguishing between
multiple films sharing the same title and (2) reasoning across a large set of
actors to gather and integrate evidence. Existing QA benchmarks rarely evaluate
both challenges jointly. To address this, we introduce DeepAmbigQAGen, an
automatic data generation pipeline that constructs QA tasks grounded in text
corpora and linked knowledge graph, generating natural and verifiable questions
that systematically embed name ambiguity and multi-step reasoning. Based on
this, we build DeepAmbigQA, a dataset of 3,600 questions requiring multi-hop
reasoning and half of them explicit name ambiguity resolving. Experiments
reveal that, even state-of-the-art GPT-5 show incomplete answers, achieving
only 0.13 exact match on ambiguous questions and 0.21 on non-ambiguous
questions. These findings highlight the need for more robust QA systems aimed
at information gathering and answer completeness.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aman Ganapathy Manvattira, Yifei Xu, Ziyue Dang, Songwu Lu
ๆฅ็ๆ่ฆ
5G technology enables mobile Internet access for billions of users. Answering
expert-level questions about 5G specifications requires navigating thousands of
pages of cross-referenced standards that evolve across releases. Existing
retrieval-augmented generation (RAG) frameworks, including telecom-specific
approaches, rely on semantic similarity and cannot reliably resolve
cross-references or reason about specification evolution. We present DeepSpecs,
a RAG system enhanced by structural and temporal reasoning via three
metadata-rich databases: SpecDB (clause-aligned specification text), ChangeDB
(line-level version diffs), and TDocDB (standardization meeting documents).
DeepSpecs explicitly resolves cross-references by recursively retrieving
referenced clauses through metadata lookup, and traces specification evolution
by mining changes and linking them to Change Requests that document design
rationale. We curate two 5G QA datasets: 573 expert-annotated real-world
questions from practitioner forums and educational resources, and 350
evolution-focused questions derived from approved Change Requests. Across
multiple LLM backends, DeepSpecs outperforms base models and state-of-the-art
telecom RAG systems; ablations confirm that explicit cross-reference resolution
and evolution-aware retrieval substantially improve answer quality,
underscoring the value of modeling the structural and temporal properties of 5G
standards.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Saiyma Sittul Muna, Rezwan Islam Salvi, Mushfiqur Rahman Mushfique, Ajwad Abrar
Natural Language Processing
ๆฅ็ๆ่ฆ
In emergency situations, every second counts. The deployment of Large
Language Models (LLMs) in time-sensitive, low or zero-connectivity environments
remains limited. Current models are computationally intensive and unsuitable
for low-tier devices often used by first responders or civilians. A major
barrier to developing lightweight, domain-specific solutions is the lack of
high-quality datasets tailored to first aid and emergency response. To address
this gap, we introduce FirstAidQA, a synthetic dataset containing 5,500
high-quality question answer pairs that encompass a wide range of first aid and
emergency response scenarios. The dataset was generated using a Large Language
Model, ChatGPT-4o-mini, with prompt-based in-context learning, using texts from
the Vital First Aid Book (2019). We applied preprocessing steps such as text
cleaning, contextual chunking, and filtering, followed by human validation to
ensure accuracy, safety, and practical relevance of the QA pairs. FirstAidQA is
designed to support instruction-tuning and fine-tuning of LLMs and Small
Language Models (SLMs), enabling faster, more reliable, and offline-capable
systems for emergency settings. We publicly release the dataset to advance
research on safety-critical and resource-constrained AI applications in first
aid and emergency response. The dataset is available on Hugging Face at
https://huggingface.co/datasets/i-am-mushfiq/FirstAidQA.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qin Zhou, Zhexin Zhang, Zhi Li, Limin Sun
ๆฅ็ๆ่ฆ
With the rapid advancement of AI models, their deployment across diverse
tasks has become increasingly widespread. A notable emerging application is
leveraging AI models to assist in reviewing scientific papers. However, recent
reports have revealed that some papers contain hidden, injected prompts
designed to manipulate AI reviewers into providing overly favorable
evaluations. In this work, we present an early systematic investigation into
this emerging threat. We propose two classes of attacks: (1) static attack,
which employs a fixed injection prompt, and (2) iterative attack, which
optimizes the injection prompt against a simulated reviewer model to maximize
its effectiveness. Both attacks achieve striking performance, frequently
inducing full evaluation scores when targeting frontier AI reviewers.
Furthermore, we show that these attacks are robust across various settings. To
counter this threat, we explore a simple detection-based defense. While it
substantially reduces the attack success rate, we demonstrate that an adaptive
attacker can partially circumvent this defense. Our findings underscore the
need for greater attention and rigorous safeguards against prompt-injection
threats in AI-assisted peer review.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Min Fang, Zhihui Fu, Qibin Zhao, Jun Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Speculative decoding (SD) has emerged as an effective technique to accelerate
large language model (LLM) inference without compromising output quality.
However, the achievable speedup largely depends on the effectiveness of the
drafting model. While model-based methods like EAGLE-2 are accurate but costly,
retrieval-enhanced methods like SAM-Decoding rely on heuristic switching
strategies that often trigger unnecessary retrievals. To address this, we
propose ReSpec (\textbf{Re}trieval-enhanced \textbf{Spe}culative Decoding), a
novel framework that transforms heuristic drafter switching into adaptive
decision-making. ReSpec features three core innovations: 1) An
\textbf{entropy-guided adaptive trigger} quantifies contextual predictability
to initiate retrieval only when uncertainty is low, avoiding costly low-quality
speculations. 2) A \textbf{feedback-driven candidate selection} leverages
historical feedback to organize multiple high-quality candidates for parallel
verification, maximizing retrieval utility. 3) A source-aware \textbf{relaxed
verification strategy} applies strict checks to model-generated drafts while
using a relaxed verification for retrieved drafts, achieving a better balance
between accuracy and efficiency. Extensive experiments on Spec-Bench
demonstrate that ReSpec achieves state-of-the-art acceleration,outperforming
EAGLE-2 and SAM-Decoding by over $33\%$ and $25\%$, respectively, while
maintaining output quality.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mo El-Haj, Paul Rayson
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper investigates the impact of domain specificity on abstractive
summarisation of Arabic financial texts using large language models (LLMs). We
introduce AraFinNews, the largest publicly available Arabic financial news
dataset to date, comprising 212,500 article-headline pairs spanning nearly a
decade of reporting from October 2015 to July 2025. Designed as the Arabic
equivalent of major English summarisation corpora such as CNN/DailyMail,
AraFinNews provides a robust benchmark for evaluating domain-specific language
understanding and generation in financial contexts. Using this resource, we
evaluate transformer-based models -- including mT5, AraT5, and the
domain-adapted FinAraT5 -- to examine how financial-domain pretraining
influences factual accuracy, numerical reliability, and stylistic alignment
with professional reporting. Experimental results show that domain-adapted
models generate more faithful and coherent summaries, particularly in handling
quantitative and entity-centric information. The findings highlight the
importance of domain-specific adaptation for improving factual consistency and
narrative fluency in Arabic financial summarisation. The dataset is freely
available for non-commercial research at
https://github.com/ArabicNLP-UK/AraFinNews.
๐
2025-11-03
๐ ArXiv Preprint (econ.GN)
๐ฅ Chaofeng Wu
Natural Language Processing
ๆฅ็ๆ่ฆ
We propose a framework that recasts scientific novelty not as a single
attribute of a paper, but as a reflection of its position within the evolving
intellectual landscape. We decompose this position into two orthogonal
dimensions: \textit{spatial novelty}, which measures a paper's intellectual
distinctiveness from its neighbors, and \textit{temporal novelty}, which
captures its engagement with a dynamic research frontier. To operationalize
these concepts, we leverage Large Language Models to develop semantic isolation
metrics that quantify a paper's location relative to the full-text literature.
Applying this framework to a large corpus of economics articles, we uncover a
fundamental trade-off: these two dimensions predict systematically different
outcomes. Temporal novelty primarily predicts citation counts, whereas spatial
novelty predicts disruptive impact. This distinction allows us to construct a
typology of semantic neighborhoods, identifying four archetypes associated with
distinct and predictable impact profiles. Our findings demonstrate that novelty
can be understood as a multidimensional construct whose different forms,
reflecting a paper's strategic location, have measurable and fundamentally
distinct consequences for scientific progress.
๐
2025-11-03
๐ ArXiv Preprint (Machine Learning)
๐ฅ Pavel Rumiantsev, Soumyasundar Pal, Yingxue Zhang, Mark Coates
Natural Language Processing
ๆฅ็ๆ่ฆ
The performance of Large Language Models (LLMs) and the associated dollar
costs of API calls can fluctuate over time, potentially invalidating
conclusions drawn in prior research. To address this, we propose a Fair
Evaluation protocol for Test-Time Compute (FEval-TTC), designed to ensure
consistent assessment of test-time compute (TTC) methods, regardless of such
fluctuations. FEval-TTC focuses on the evaluation of TTC methods that utilize
underlying Chains-of-Thought (CoT). It supports evaluations across multiple
LLMs on a diverse set of mathematical and commonsense reasoning datasets. The
few-shot prompting and answer extraction processes are standardized across
datasets, reducing both time and monetary overhead for researchers.
Furthermore, we provide a cost modelling procedure that estimates both the
token and dollar cost per query, facilitating equitable comparisons of
prevalent TTC methods. We open-source FEval-TTC for public use at
https://github.com/networkslab/feval_ttc .
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Guoxin Ma, Xiaoming Liu, Zhanhan Zhang, Chengzhengxu Li, Shengchao Liu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Detecting machine-generated text (MGT) has emerged as a critical challenge,
driven by the rapid advancement of large language models (LLMs) capable of
producing highly realistic, human-like content. However, the performance of
current approaches often degrades significantly under domain shift. To address
this challenge, we propose a novel framework designed to capture both
domain-specific and domain-general MGT patterns through a two-stage
Disentangled mixturE-of-ExpeRts (DEER) architecture. First, we introduce a
disentangled mixture-of-experts module, in which domain-specific experts learn
fine-grained, domain-local distinctions between human and machine-generated
text, while shared experts extract transferable, cross-domain features. Second,
to mitigate the practical limitation of unavailable domain labels during
inference, we design a reinforcement learning-based routing mechanism that
dynamically selects the appropriate experts for each input instance,
effectively bridging the train-inference gap caused by domain uncertainty.
Extensive experiments on five in-domain and five out-of-domain benchmark
datasets demonstrate that DEER consistently outperforms state-of-the-art
methods, achieving average F1-score improvements of 1.39% and 5.32% on
in-domain and out-of-domain datasets respectively, along with accuracy gains of
1.35% and 3.61% respectively. Ablation studies confirm the critical
contributions of both disentangled expert specialization and adaptive routing
to model performance.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ru Wang, Wei Huang, Qi Cao, Yusuke Iwasawa, Yutaka Matsuo et al.
Machine Learning
ๆฅ็ๆ่ฆ
Test-time reinforcement learning (TTRL) offers a label-free paradigm for
adapting models using only synthetic signals at inference, but its success
hinges on constructing reliable learning signals. Standard approaches such as
majority voting often collapse to spurious yet popular answers. We introduce
Self-Harmony, a framework built on a simple intuition: the correct answer
should remain stable across both an original question and its paraphrase.
Self-Harmony operationalizes this by employing a single model in two
complementary roles: a Solver to produce answers and a Reframer to rephrase the
input. Based on this, we further propose a pseudo-label method: instead of
majority voting, it aggregates answer frequencies across these original and
reframed views using the harmonic mean. This is a process that naturally
selects for solutions stable under reframing, thereby avoiding the common trap
of favoring view-dependent, spurious answers. Crucially, this requires no human
supervision or auxiliary models. Across diverse reasoning benchmarks,
Self-Harmony achieves state-of-the-art results at the label-free test-time
setting, ranking first in 28 of 30 settings across multiple methods. Beyond
accuracy, it demonstrates unprecedented robustness, with zero training failures
in all experiments, underscoring its stability and reliability.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lvhua Wu, Xuefeng Jiang, Sheng Sun, Tian Wen, Yuwei Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid spread of fake news threatens social stability and public trust,
rendering its detection an imperative research priority. Although large
language models (LLMs) excel at numerous natural language processing tasks with
their remarkable contextual understanding and extensive prior knowledge, the
time-bounded knowledge coverage and tendency for generating hallucination
content reduce their reliability when handling fast-evolving news streams.
Furthermore, models trained on existing static datasets also often lack the
generalization needed for emerging news topics. To address these challenges, we
propose ZoFia, a novel two-stage zero-shot fake news detection framework.
First, we introduce Hierarchical Salience to quantify the importance of
entities in the news content, and propose the SC-MMR algorithm to effectively
select an informative and diverse set of keywords that serve as queries for
retrieving up-to-date external evidence. Subsequently, a multi LLM interactive
system, in which each agent assumes a distinct role, performs multi-view
collaborative analysis and adversarial debate over the news text and its
related information, and finally produces an interpretable and robust judgment.
Comprehensive experiments on two public datasets demonstrate that ZoFia
obviously outperforms existing zero-shot baselines and most of few-shot
methods. Our codes will be open-sourced to facilitate related communities.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are widely deployed for open-ended
communication, yet most bias evaluations still rely on English,
classification-style tasks. We introduce DebateBias-8K, a new multilingual,
debate-style benchmark designed to reveal how narrative bias appears in
realistic generative settings. Our dataset includes 8,400 structured debate
prompts spanning four sensitive domains: women's rights, socioeconomic
development, terrorism, and religion, across seven languages ranging from
high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin).
Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we
generate and automatically classify over 100,000 responses. Results show that
all models reproduce entrenched stereotypes despite safety alignment: Arabs are
overwhelmingly linked to terrorism and religion (>=95%), Africans to
socioeconomic "backwardness" (up to <=77%), and Western groups are consistently
framed as modern or progressive. Biases grow sharply in lower-resource
languages, revealing that alignment trained primarily in English does not
generalize globally. Our findings highlight a persistent divide in multilingual
fairness: current alignment methods reduce explicit toxicity but fail to
prevent biased outputs in open-ended contexts. We release our DebateBias-8K
benchmark and analysis framework to support the next generation of multilingual
bias evaluation and safer, culturally inclusive model alignment.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Emaad Manzoor, Eva Ascarza, Oded Netzer
Natural Language Processing
ๆฅ็ๆ่ฆ
Salespeople frequently face the dynamic screening decision of whether to
persist in a conversation or abandon it to pursue the next lead. Yet, little is
known about how these decisions are made, whether they are efficient, or how to
improve them. We study these decisions in the context of high-volume outbound
sales where leads are ample, but time is scarce and failure is common. We
formalize the dynamic screening decision as an optimal stopping problem and
develop a generative language model-based sequential decision agent - a
stopping agent - that learns whether and when to quit conversations by
imitating a retrospectively-inferred optimal stopping policy. Our approach
handles high-dimensional textual states, scales to large language models, and
works with both open-source and proprietary language models. When applied to
calls from a large European telecommunications firm, our stopping agent reduces
the time spent on failed calls by 54% while preserving nearly all sales;
reallocating the time saved increases expected sales by up to 37%. Upon
examining the linguistic cues that drive salespeople's quitting decisions, we
find that they tend to overweight a few salient expressions of consumer
disinterest and mispredict call failure risk, suggesting cognitive bounds on
their ability to make real-time conversational decisions. Our findings
highlight the potential of artificial intelligence algorithms to correct
cognitively-bounded human decisions and improve salesforce efficiency.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Lingzhe Zhang, Yunpeng Zhai, Tong Jia, Chiming Duan, Minghua He et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) integrated with agent-based reasoning frameworks
have recently shown strong potential for autonomous decision-making and
system-level operations. One promising yet underexplored direction is
microservice remediation, where the goal is to automatically recover faulty
microservice systems. Existing approaches, however, still rely on human-crafted
prompts from Site Reliability Engineers (SREs), with LLMs merely converting
textual instructions into executable code. To advance research in this area, we
introduce MicroRemed, the first benchmark for evaluating LLMs in end-to-end
microservice remediation, where models must directly generate executable
Ansible playbooks from diagnosis reports to restore system functionality. We
further propose ThinkRemed, a multi-agent framework that emulates the
reflective and perceptive reasoning of SREs. Experimental results show that
MicroRemed presents substantial challenges to current LLMs, while ThinkRemed
improves end-to-end remediation performance through iterative reasoning and
system reflection. The benchmark is available at
https://github.com/LLM4AIOps/MicroRemed.
๐
2025-11-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Hai-Dang Kieu, Min Xu, Thanh Trung Huynh, Dung D. Le
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Recent advances in multimodal recommendation (MMR) have shown that
incorporating rich content sources such as images and text can lead to
significant gains representation quality. However, existing methods often rely
on coarse visual features and uncontrolled fusion, leading to redundant or
misaligned representations. As a result, visual encoders often fail to capture
salient, item-relevant semantics, limiting their contribution in multimodal
fusion. From an information-theoretic perspective, effective fusion should
balance the unique, shared, and redundant information across modalities,
preserving complementary cues while avoiding correlation bias. This paper
presents VLIF, a vision-language and information-theoretic fusion framework
that enhances multimodal recommendation through two key components. (i) A
VLM-based visual enrichment module generates fine-grained, title-guided
descriptions to transform product images into semantically aligned
representations. (ii) An information-aware fusion module, inspired by Partial
Information Decomposition (PID), disentangles redundant and synergistic signals
across modalities for controlled integration. Experiments on three Amazon
datasets demonstrate that VLIF consistently outperforms recent multimodal
baselines and substantially strengthens the contribution of visual features.
๐
2025-11-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Karol Radziszewski, Michaล Szpunar, Piotr Ociepka, Mateusz Buczyลski
ๆฅ็ๆ่ฆ
We present a scalable recommender system implementation based on RippleNet,
tailored for the media domain with a production deployment in Onet.pl, one of
Poland's largest online media platforms. Our solution addresses the cold-start
problem for newly published content by integrating content-based item
embeddings into the knowledge propagation mechanism of RippleNet, enabling
effective scoring of previously unseen items. The system architecture leverages
Amazon SageMaker for distributed training and inference, and Apache Airflow for
orchestrating data pipelines and model retraining workflows. To ensure
high-quality training data, we constructed a comprehensive golden dataset
consisting of user and item features and a separate interaction table, all
enabling flexible extensions and integration of new signals.
๐
2025-11-03
๐ ArXiv Preprint (cs.DB)
๐ฅ Xiangru Jian, Zhengyuan Dong, M. Tamer รzsu
ๆฅ็ๆ่ฆ
In recent years, querying semantic web data using SPARQL has remained
challenging, especially for non-expert users, due to the language's complex
syntax and the prerequisite of understanding intricate data structures. To
address these challenges, we propose InteracSPARQL, an interactive SPARQL query
generation and refinement system that leverages natural language explanations
(NLEs) to enhance user comprehension and facilitate iterative query refinement.
InteracSPARQL integrates LLMs with a rule-based approach to first produce
structured explanations directly from SPARQL abstract syntax trees (ASTs),
followed by LLM-based linguistic refinements. Users can interactively refine
queries through direct feedback or LLM-driven self-refinement, enabling the
correction of ambiguous or incorrect query components in real time. We evaluate
InteracSPARQL on standard benchmarks, demonstrating significant improvements in
query accuracy, explanation clarity, and overall user satisfaction compared to
baseline approaches. Our experiments further highlight the effectiveness of
combining rule-based methods with LLM-driven refinements to create more
accessible and robust SPARQL interfaces.
๐
2025-11-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Reza Esfandiarpoor, Max Zuo, Stephen H. Bach
ๆฅ็ๆ่ฆ
We introduce Trove, an easy-to-use open-source retrieval toolkit that
simplifies research experiments without sacrificing flexibility or speed. For
the first time, we introduce efficient data management features that load and
process (filter, select, transform, and combine) retrieval datasets on the fly,
with just a few lines of code. This gives users the flexibility to easily
experiment with different dataset configurations without the need to compute
and store multiple copies of large datasets. Trove is highly customizable: in
addition to many built-in options, it allows users to freely modify existing
components or replace them entirely with user-defined objects. It also provides
a low-code and unified pipeline for evaluation and hard negative mining, which
supports multi-node execution without any code changes. Trove's data management
features reduce memory consumption by a factor of 2.6. Moreover, Trove's
easy-to-use inference pipeline incurs no overhead, and inference times decrease
linearly with the number of available nodes. Most importantly, we demonstrate
how Trove simplifies retrieval experiments and allows for arbitrary
customizations, thus facilitating exploratory research.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Riccardo Campi, Nicolรฒ Oreste Pinciroli Vago, Mathyas Giudici, Pablo Barrachina Rodriguez-Guisado, Marco Brambilla et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
In this work, we investigate the use of Large Language Models (LLMs) within a
graph-based Retrieval Augmented Generation (RAG) architecture for Energy
Efficiency (EE) Question Answering. First, the system automatically extracts a
Knowledge Graph (KG) from guidance and regulatory documents in the energy
field. Then, the generated graph is navigated and reasoned upon to provide
users with accurate answers in multiple languages. We implement a human-based
validation using the RAGAs framework properties, a validation dataset
comprising 101 question-answer pairs, and domain experts. Results confirm the
potential of this architecture and identify its strengths and weaknesses.
Validation results show how the system correctly answers in about three out of
four of the cases (75.2 +- 2.7%), with higher results on questions related to
more general EE answers (up to 81.0 +- 4.1%), and featuring promising
multilingual abilities (4.4% accuracy loss due to translation).
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
In the retrieval domain, candidates' fusion from heterogeneous retrievers is
a long-standing challenge, particularly for complex, multi-modal data such as
videos. While typical fusion techniques are training-free, they rely solely on
rank or score signals, disregarding candidates' representations. This work
introduces Vote-in-Context (ViC), a generalized, training-free framework that
re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a
Vision-Language Model (VLM). The core insight is to serialize both content
evidence and retriever metadata directly within the VLM's prompt, allowing the
model to adaptively weigh retriever consensus against visual-linguistic
content. We demonstrate the generality of this framework by applying it to the
challenging domain of cross-modal video retrieval. To this end, we introduce
the S-Grid, a compact serialization map that represents each video as an image
grid, optionally paired with subtitles to enable list-wise reasoning over video
candidates. ViC is evaluated both as a single-list reranker, where it
dramatically improves the precision of individual retrievers, and as an
ensemble fuser, where it consistently outperforms strong baselines like
CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the
framework establishes new state-of-the-art zero-shot retrieval performance,
demonstrating its effectiveness in handling complex visual and temporal signals
alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1%
(t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive
gains of up to +40 Recall@1 over previous state-of-the-art baselines. We
present ViC as a simple, reproducible, and highly effective recipe for turning
modern VLMs into powerful zero-shot rerankers and fusers. Code and resources
are publicly available at: https://github.com/mohammad2012191/ViC
๐
2025-11-03
๐ ArXiv Preprint (cs.DL)
๐ฅ Muneer Ahmad, M Sadik Batcha, Wasim Rashid, Obaid Hafiz
ๆฅ็ๆ่ฆ
This paper examines and explores the web impact factor through a webometric
study of the present 12 University Websites of Jammu and Kashmir. Identifies
the domain systems of the websites; analyzes the number of web pages and link
pages, and calculates the External Link WIF or simple web impact factor (WIF)
and external web impact factor of all the University websites. Also reflects
that some university websites have higher number of web pages, but
correspondingly their link pages are very small in number and websites fall
behind in their simple and external link web impact factor. It found that the
Cluster University of Jammu ranked 1 (0.9018) in Internal Link WIF of Websites
in Jammu and Kashmir. Shri Mata Vaishno Devi University ranked 1 (0.7249) in
External Link Web Impact Factor.
๐
2025-11-03
๐ ArXiv Preprint (cs.DL)
๐ฅ M Sadik Batcha, Younis Rashid Dar, Muneer Ahmad
ๆฅ็ๆ่ฆ
This study aims to present a scientometric analysis of the journal titled
Cognition for a period of 20 years from 1999 to 2018. The present study was
conducted with an aim to provide a summary of research activity in current
journal and characterize its most aspects. The research coverage includes the
year wise distribution of articles, authors, institutions, countries and
citation analysis of the journal. The analysis showed that 2870 papers were
published in journal of Cognition from 1999 to 2018. The study identified top
20 prolific authors, institutions and countries of the journal. Researchers
from USA have been made the most percentage of contributions.
๐
2025-11-03
๐ WSDM
๐ฅ Xiaoyu Liu, Fuwei Zhang, Yiqing Wu, Xinyu Jia, Zenghua Xia et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Generative retrieval (GR) has gained significant attention as an effective
paradigm that integrates the capabilities of large language models (LLMs). It
generally consists of two stages: constructing discrete semantic identifiers
(IDs) for documents and retrieving documents by autoregressively generating ID
tokens. The core challenge in GR is how to construct document IDs (DocIDS) with
strong representational power. Good IDs should exhibit two key properties:
similar documents should have more similar IDs, and each document should
maintain a distinct and unique ID. However, most existing methods ignore native
category information, which is common and critical in E-commerce. Therefore, we
propose a novel ID learning method, CAtegory-Tree Integrated Document
IDentifier (CAT-ID$^2$), incorporating prior category information into the
semantic IDs. CAT-ID$^2$ includes three key modules: a Hierarchical Class
Constraint Loss to integrate category information layer by layer during
quantization, a Cluster Scale Constraint Loss for uniform ID token
distribution, and a Dispersion Loss to improve the distinction of reconstructed
documents. These components enable CAT-ID$^2$ to generate IDs that make similar
documents more alike while preserving the uniqueness of different documents'
representations. Extensive offline and online experiments confirm the
effectiveness of our method, with online A/B tests showing a 0.33% increase in
average orders per thousand users for ambiguous intent queries and 0.24% for
long-tail queries.
๐
2025-11-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zhengjun Huang, Zhoujin Tian, Qintian Guo, Fangyuan Zhang, Yingli Zhou et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Model (LLM) agents exhibit remarkable conversational and
reasoning capabilities but remain constrained by limited context windows and
the lack of persistent memory. Recent efforts address these limitations via
external memory architectures, often employing graph-based representations, yet
most adopt flat, entangled structures that intertwine semantics with topology,
leading to redundant representations, unstructured retrieval, and degraded
efficiency and accuracy. To resolve these issues, we propose LiCoMemory, an
end-to-end agentic memory framework for real-time updating and retrieval, which
introduces CogniGraph, a lightweight hierarchical graph that utilizes entities
and relations as semantic indexing layers, and employs temporal and
hierarchy-aware search with integrated reranking for adaptive and coherent
knowledge retrieval. Experiments on long-term dialogue benchmarks, LoCoMo and
LongMemEval, show that LiCoMemory not only outperforms established baselines in
temporal reasoning, multi-session consistency, and retrieval efficiency, but
also notably reduces update latency. Our official code and data are available
at https://github.com/EverM0re/LiCoMemory.
๐
2025-11-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xiaoyu Liu, Yiqing Wu, Ruidong Han, Fuzhen Zhuang, Xiang Li et al.
Machine Learning
ๆฅ็ๆ่ฆ
In industrial practice, Multi-domain Recommendation (MDR) plays a crucial
role. Shared-specific architectures are widely used in industrial solutions to
capture shared and unique attributes via shared and specific parameters.
However, with imbalanced data across different domains, these models face two
key issues: (1) Overwhelming: Dominant domain data skews model performance,
neglecting non-dominant domains. (2) Overfitting: Sparse data in non-dominant
domains leads to overfitting in specific parameters. To tackle these
challenges, we propose Soft-partitioned Semi-supervised Collaborative Transfer
Learning (SSCTL) for multi-domain recommendation. SSCTL generates dynamic
parameters to address the overwhelming issue, thus shifting focus towards
samples from non-dominant domains. To combat overfitting, it leverages
pseudo-labels with weights from dominant domain instances to enhance
non-dominant domain data. We conduct comprehensive experiments, both online and
offline, to validate the efficacy of our proposed method. Online tests yielded
significant improvements across various domains, with increases in GMV ranging
from 0.54% to 2.90% and enhancements in CTR ranging from 0.22% to 1.69%.
๐
2025-11-03
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Muhammed Yusuf Kartal, Suha Kagan Kose, Korhan Sevinรง, Burak Aktas
Machine Learning
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) quality depends on many interacting
choices across retrieval, ranking, augmentation, prompting, and generation, so
optimizing modules in isolation is brittle. We introduce RAGSmith, a modular
framework that treats RAG design as an end-to-end architecture search over nine
technique families and 46{,}080 feasible pipeline configurations. A genetic
search optimizes a scalar objective that jointly aggregates retrieval metrics
(recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic
similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law,
Finance, Medicine, Defense Industry, Computer Science), each with 100 questions
spanning factual, interpretation, and long-answer types. RAGSmith finds
configurations that consistently outperform naive RAG baseline by +3.8\% on
average (range +1.2\% to +6.9\% across domains), with gains up to +12.5\% in
retrieval and +7.5\% in generation. The search typically explores $\approx
0.2\%$ of the space ($\sim 100$ candidates) and discovers a robust backbone --
vector retrieval plus post-generation reflection/revision -- augmented by
domain-dependent choices in expansion, reranking, augmentation, and prompt
reordering; passage compression is never selected. Improvement magnitude
correlates with question type, with larger gains on factual/long-answer mixes
than interpretation-heavy sets. These results provide practical, domain-aware
guidance for assembling effective RAG systems and demonstrate the utility of
evolutionary search for full-pipeline optimization.
๐
2025-11-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Pavan Kumar Perepu
Machine Learning
ๆฅ็ๆ่ฆ
Mathematical expressions (MEs) have complex two-dimensional structures in
which symbols can be present at any nested depth like superscripts, subscripts,
above, below etc. As MEs are represented using LaTeX format, several text
retrieval methods based on string matching, vector space models etc., have also
been applied for ME retrieval problem in the literature. As these methods are
based on syntactic similarity, recently deep learning approaches based on
embedding have been used for semantic similarity. In our present work, we have
focused on the retrieval of mathematical expressions using deep learning
approaches. In our approach, semantic features are extracted from the MEs using
a deep recurrent neural network (DRNN) and these features have been used for
matching and retrieval. We have trained the network for a classification task
which determines the complexity of an ME. ME complexity has been quantified in
terms of its nested depth. Based on the nested depth, we have considered three
complexity classes of MEs: Simple, Medium and Complex. After training the
network, outputs just before the the final fully connected layer are extracted
for all the MEs. These outputs form the semantic features of MEs and are stored
in a database. For a given ME query, its semantic features are computed using
the trained DRNN and matched against the semantic feature database. Matching is
performed based on the standard euclidean distance and top 'k' nearest matches
are retrieved, where 'k' is a user-defined parameter. Our approach has been
illustrated on a database of 829 MEs.
๐
2025-11-03
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Minseok Kim, Hankook Lee, Hyungjoon Koo
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large language models (LLMs) are reshaping numerous facets of our daily
lives, leading widespread adoption as web-based services. Despite their
versatility, LLMs face notable challenges, such as generating hallucinated
content and lacking access to up-to-date information. Lately, to address such
limitations, Retrieval-Augmented Generation (RAG) has emerged as a promising
direction by generating responses grounded in external knowledge sources. A
typical RAG system consists of i) a retriever that probes a group of relevant
passages from a knowledge base and ii) a generator that formulates a response
based on the retrieved content. However, as with other AI systems, recent
studies demonstrate the vulnerability of RAG, such as knowledge corruption
attacks by injecting misleading information. In response, several defense
strategies have been proposed, including having LLMs inspect the retrieved
passages individually or fine-tuning robust retrievers. While effective, such
approaches often come with substantial computational costs.
In this work, we introduce RAGDefender, a resource-efficient defense
mechanism against knowledge corruption (i.e., by data poisoning) attacks in
practical RAG deployments. RAGDefender operates during the post-retrieval
phase, leveraging lightweight machine learning techniques to detect and filter
out adversarial content without requiring additional model training or
inference. Our empirical evaluations show that RAGDefender consistently
outperforms existing state-of-the-art defenses across multiple models and
adversarial scenarios: e.g., RAGDefender reduces the attack success rate (ASR)
against the Gemini model from 0.89 to as low as 0.02, compared to 0.69 for
RobustRAG and 0.24 for Discern-and-Answer when adversarial passages outnumber
legitimate ones by a factor of four (4x).
๐
2025-11-03
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jerry Huang, Siddarth Madala, Cheng Niu, Julia Hockenmaier, Tong Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Reranking algorithms have made progress in improving document retrieval
quality by efficiently aggregating relevance judgments generated by large
language models (LLMs). However, identifying relevant documents for queries
that require in-depth reasoning remains a major challenge. Reasoning-intensive
queries often exhibit multifaceted information needs and nuanced
interpretations, rendering document relevance inherently context dependent. To
address this, we propose contextual relevance, which we define as the
probability that a document is relevant to a given query, marginalized over the
distribution of different reranking contexts it may appear in (i.e., the set of
candidate documents it is ranked alongside and the order in which the documents
are presented to a reranking model). While prior works have studied methods to
mitigate the positional bias LLMs exhibit by accounting for the ordering of
documents, we empirically find that the compositions of these batches also
plays an important role in reranking performance. To efficiently estimate
contextual relevance, we propose TS-SetRank, a sampling-based,
uncertainty-aware reranking algorithm. Empirically, TS-SetRank improves nDCG@10
over retrieval and reranking baselines by 15-25% on BRIGHT and 6-21% on BEIR,
highlighting the importance of modeling relevance as context-dependent.
๐
2025-11-03
๐ ICRA 2026
๐ฅ Hersh Sanghvi, Spencer Folk, Vijay Kumar, Camillo Jose Taylor
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Controller performance in quadrotor trajectory tracking depends heavily on
parameter tuning, yet standard approaches often rely on fixed, manually tuned
parameters that sacrifice task-specific performance. We present
Trajectory-Aware Controller Optimization (TACO), a framework that adapts
controller parameters online based on the upcoming reference trajectory and
current quadrotor state. TACO employs a learned predictive model and a
lightweight optimization scheme to optimize controller gains in real time with
respect to a broad class of trajectories, and can also be used to adapt
trajectories to improve dynamic feasibility while respecting smoothness
constraints. To enable large-scale training, we also introduce a parallelized
quadrotor simulator supporting fast data collection on diverse trajectories.
Experiments on a variety of trajectory types show that TACO outperforms
conventional, static parameter tuning while operating orders of magnitude
faster than black-box optimization baselines, enabling practical real-time
deployment on a physical quadrotor. Furthermore, we show that adapting
trajectories using TACO significantly reduces the tracking error obtained by
the quadrotor.
๐
2025-11-03
๐ ICRA 2026
๐ฅ Parsa Hosseininejad, Kimia Khabiri, Shishir Gopinath, Soudabeh Mohammadhashemi, Karthik Dantu et al.
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents TurboMap, a GPU-accelerated and CPU-optimized local
mapping module for visual SLAM systems. We identify key performance bottlenecks
in the local mapping process for visual SLAM and address them through targeted
GPU and CPU optimizations. Specifically, we offload map point triangulation and
fusion to the GPU, accelerate redundant keyframe culling on the CPU, and
integrate a GPU-accelerated solver to speed up local bundle adjustment. Our
implementation is built on top of ORB-SLAM3 and leverages CUDA for GPU
programming. The experimental results show that TurboMap achieves an average
speedup of 1.3x in the EuRoC dataset and 1.6x in the TUM-VI dataset in the
local mapping module, on both desktop and embedded platforms, while maintaining
the accuracy of the original system.
๐
2025-11-03
๐ Under review, IEEE Letters
๐ฅ Rathin Chandra Shit
Machine Learning
ๆฅ็ๆ่ฆ
Catastrophic forgetting is one of the fundamental issues of continual
learning because neural networks forget the tasks learned previously when
trained on new tasks. The proposed framework is a new path-coordinated
framework of continual learning that unites the Neural Tangent Kernel (NTK)
theory of principled plasticity bounds, statistical validation by Wilson
confidence intervals, and evaluation of path quality by the use of multiple
metrics. Experimental evaluation shows an average accuracy of 66.7% at the cost
of 23.4% catastrophic forgetting on Split-CIFAR10, a huge improvement over the
baseline and competitive performance achieved, which is very close to
state-of-the-art results. Further, it is found out that NTK condition numbers
are predictive indicators of learning capacity limits, showing the existence of
a critical threshold at condition number $>10^{11}$. It is interesting to note
that the proposed strategy shows a tendency of lowering forgetting as the
sequence of tasks progresses (27% to 18%), which is a system stabilization. The
framework validates 80% of discovered paths with a rigorous statistical
guarantee and maintains 90-97% retention on intermediate tasks. The core
capacity limits of the continual learning environment are determined in the
analysis, and actionable insights to enhance the adaptive regularization are
offered.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Jace Aldrich, Odest Chadwicke Jenkins
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This paper presents a novel method for Model Predictive Path Integral (MPPI)
control that optimizes sample generation towards an optimal trajectory through
Stein Variational Gradient Descent (SVGD). MPPI is traditionally reliant on
randomly sampled trajectories, often by a Gaussian distribution. The result can
lead to sample deprivation, under-representing the space of possible
trajectories, and yield suboptimal results. Through introducing SVGD updates in
between MPPI environment steps, we present Stein-Optimized Path-Integral
Inference (SOPPI), an MPPI/SVGD algorithm that can dynamically update noise
distributions at runtime to shape a more optimal representation without an
excessive increase in computational requirements. We demonstrate the efficacy
of our method systems ranging from a Cart-Pole to a two-dimensional bipedal
walking task, indicating improved performance above standard MPPI across a
range of hyper-parameters and demonstrate feasibility at lower particle counts.
We discuss the applicability of this MPPI/SVGD method to higher
degree-of-freedom systems, as well as its potential to new developments in
state-of-the-art differentiable simulators.
๐
2025-11-03
๐ ICCV 2025
๐ฅ Sangyun Park, Jin Kim, Yuchen Cui, Matthew S. Brown
MultimodalNatural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) struggle to translate high-level instructions
into the precise spatial affordances required for robotic manipulation. While
visual Chain-of-Thought (CoT) methods exist, they are often computationally
intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance
Coordinate Extraction), a novel methodology that integrates a textual Chain of
Reasoning (CoR) into the affordance prediction process. We use this methodology
to create the TRACE dataset, a large-scale collection created via an autonomous
pipeline that pairs instructions with explicit textual rationales. By
fine-tuning a VLM on this data, our model learns to externalize its spatial
reasoning before acting. Our experiments show that our TRACE-tuned model
achieves state-of-the-art performance, reaching 48.1% accuracy on the primary
Where2Place (W2P) benchmark (a 9.6% relative improvement) and 55.0% on the more
challenging W2P(h) subset. Crucially, an ablation study demonstrates that
performance scales directly with the amount of reasoning data used, confirming
the CoR's effectiveness. Furthermore, analysis of the model's attention maps
reveals an interpretable reasoning process where focus shifts dynamically
across reasoning steps. This work shows that training VLMs to generate a
textual CoR is an effective and robust strategy for enhancing the precision,
reliability, and interpretability of VLM-based robot control. Our dataset and
code are available at https://github.com/jink-ucla/TRACE
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Javier Ballesteros-Jerez, Jesus Martรญnez-Gรณmez, Ismael Garcรญa-Varea, Luis Orozco-Barbosa, Manuel Castillo-Cara
Machine LearningRobotics
ๆฅ็ๆ่ฆ
We present a hybrid neural network model for inferring the position of mobile
robots using Channel State Information (CSI) data from a Massive MIMO system.
By leveraging an existing CSI dataset, our approach integrates a Convolutional
Neural Network (CNN) with a Multilayer Perceptron (MLP) to form a Hybrid Neural
Network (HyNN) that estimates 2D robot positions. CSI readings are converted
into synthetic images using the TINTO tool. The localisation solution is
integrated with a robotics simulator, and the Robot Operating System (ROS),
which facilitates its evaluation through heterogeneous test cases, and the
adoption of state estimators like Kalman filters. Our contributions illustrate
the potential of our HyNN model in achieving precise indoor localisation and
navigation for mobile robots in complex environments. The study follows, and
proposes, a generalisable procedure applicable beyond the specific use case
studied, making it adaptable to different scenarios and datasets.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Gabriel Nobis, Maximilian Springenberg, Arina Belova, Rembert Daems, Christoph Knochenhauer et al.
ๆฅ็ๆ่ฆ
We present Fractional Diffusion Bridge Models (FDBM), a novel generative
diffusion bridge framework driven by an approximation of the rich and
non-Markovian fractional Brownian motion (fBM). Real stochastic processes
exhibit a degree of memory effects (correlations in time), long-range
dependencies, roughness and anomalous diffusion phenomena that are not captured
in standard diffusion or bridge modeling due to the use of Brownian motion
(BM). As a remedy, leveraging a recent Markovian approximation of fBM (MA-fBM),
we construct FDBM that enable tractable inference while preserving the
non-Markovian nature of fBM. We prove the existence of a coupling-preserving
generative diffusion bridge and leverage it for future state prediction from
paired training data. We then extend our formulation to the Schr\"{o}dinger
bridge problem and derive a principled loss function to learn the unpaired data
translation. We evaluate FDBM on both tasks: predicting future protein
conformations from aligned data, and unpaired image translation. In both
settings, FDBM achieves superior performance compared to the Brownian
baselines, yielding lower root mean squared deviation (RMSD) of C$_\alpha$
atomic positions in protein structure prediction and lower Fr\'echet Inception
Distance (FID) in unpaired image translation.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Feng Chen, Zhuxiu Xu, Tianzhe Chu, Xunzhe Zhou, Li Sun et al.
MultimodalNatural Language ProcessingRoboticsMachine Learning
ๆฅ็ๆ่ฆ
Data scarcity remains a fundamental bottleneck for embodied intelligence.
Existing approaches use large language models (LLMs) to automate gripper-based
simulation generation, but they transfer poorly to dexterous manipulation,
which demands more specialized environment design. Meanwhile, dexterous
manipulation tasks are inherently more difficult due to their higher degrees of
freedom. Massively generating feasible and trainable dexterous hand tasks
remains an open challenge. To this end, we present GenDexHand, a generative
simulation pipeline that autonomously produces diverse robotic tasks and
environments for dexterous manipulation. GenDexHand introduces a closed-loop
refinement process that adjusts object placements and scales based on
vision-language model (VLM) feedback, substantially improving the average
quality of generated environments. Each task is further decomposed into
sub-tasks to enable sequential reinforcement learning, reducing training time
and increasing success rates. Our work provides a viable path toward scalable
training of diverse dexterous hand behaviors in embodied intelligence by
offering a simulation-based solution to synthetic data generation. Our website:
https://winniechen2002.github.io/GenDexHand/.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Alexander Schperberg, Yusuke Tanaka, Stefano Di Cairano, Dennis Hong
Machine LearningRobotics
ๆฅ็ๆ่ฆ
This article presents a Multi-Modal Bipedal Intelligent Urban Scout robot
(MOBIUS) capable of walking, crawling, climbing, and rolling. MOBIUS features
four limbs--two 6-DoF arms with two-finger grippers for manipulation and
climbing, and two 4-DoF legs for locomotion--enabling smooth transitions across
diverse terrains without reconfiguration. A hybrid control architecture
combines reinforcement learning-based locomotion with model-based predictive
and admittance control enhanced for safety by a Reference Governor toward
compliant contact interactions. A high-level MIQCP planner autonomously selects
locomotion modes to balance stability and energy efficiency. Hardware
experiments demonstrate robust gait transitions, dynamic climbing, and
full-body load support via pinch grasp. Overall, MOBIUS demonstrates the
importance of tight integration between morphology, high-level planning, and
control to enable mobile loco-manipulation and grasping, substantially
expanding its interaction capabilities, workspace, and traversability.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Liudi Yang, Yang Bai, Yuhao Wang, Ibrahim Alsarraj, Gitta Kutyniok et al.
Robotics
ๆฅ็ๆ่ฆ
Robotic grasping under uncertainty remains a fundamental challenge due to its
uncertain and contact-rich nature. Traditional rigid robotic hands, with
limited degrees of freedom and compliance, rely on complex model-based and
heavy feedback controllers to manage such interactions. Soft robots, by
contrast, exhibit embodied mechanical intelligence: their underactuated
structures and passive flexibility of their whole body, naturally accommodate
uncertain contacts and enable adaptive behaviors. To harness this capability,
we propose a lightweight actuation-space learning framework that infers
distributional control representations for whole-body soft robotic grasping,
directly from deterministic demonstrations using a flow matching model
(Rectified Flow),without requiring dense sensing or heavy control loops. Using
only 30 demonstrations (less than 8% of the reachable workspace), the learned
policy achieves a 97.5% grasp success rate across the whole workspace,
generalizes to grasped-object size variations of +-33%, and maintains stable
performance when the robot's dynamic response is directly adjusted by scaling
the execution time from 20% to 200%. These results demonstrate that
actuation-space learning, by leveraging its passive redundant DOFs and
flexibility, converts the body's mechanics into functional control intelligence
and substantially reduces the burden on central controllers for this
uncertain-rich task.
๐
2025-11-03
๐ NeurIPS 2025
๐ฅ Rong Li, Yuhao Dong, Tianshuai Hu, Ao Liang, Youquan Liu et al.
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Visual grounding in 3D is the key for embodied agents to localize
language-referred objects in open-world environments. However, existing
benchmarks are limited to indoor focus, single-platform constraints, and small
scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark
featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We
provide over 128,000 objects and 22,000 validated referring expressions across
diverse outdoor scenes -- 10x larger than existing datasets. We develop a
scalable annotation pipeline combining vision-language model prompting with
human verification to ensure high-quality spatial grounding. To support
cross-platform learning, we propose platform-aware normalization and
cross-modal alignment techniques, and establish benchmark protocols for
in-domain and cross-platform evaluations. Our findings reveal significant
performance gaps, highlighting the challenges and opportunities of
generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released
to advance future research in language-driven 3D embodied perception.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao et al.
MultimodalComputer Vision
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models aim to understand natural language
instructions and visual observations and to execute corresponding actions as an
embodied agent. Recent work integrates future images into the
understanding-acting loop, yielding unified VLAs that jointly understand,
generate, and act -- reading text and images and producing future images and
actions. However, these models either rely on external experts for modality
unification or treat image generation and action prediction as separate
processes, limiting the benefits of direct synergy between these tasks. Our
core philosophy is to optimize generation and action jointly through a
synchronous denoising process, where the iterative refinement enables actions
to evolve from initialization, under constant and sufficient visual guidance.
We ground this philosophy in our proposed Unified Diffusion VLA and Joint
Discrete Denoising Diffusion Process (JD3P), which is a joint diffusion process
that integrates multiple modalities into a single denoising trajectory to serve
as the key mechanism enabling understanding, generation, and acting to be
intrinsically synergistic. Our model and theory are built on a unified
tokenized space of all modalities and a hybrid attention mechanism. We further
propose a two-stage training pipeline and several inference-time techniques
that optimize performance and efficiency. Our approach achieves
state-of-the-art performance on benchmarks such as CALVIN, LIBERO, and
SimplerEnv with 4$\times$ faster inference than autoregressive methods, and we
demonstrate its effectiveness through in-depth analysis and real-world
evaluations. Our project page is available at
https://irpn-eai.github.io/UD-VLA.github.io/.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Renjun Gao, Peiyan Zhong
MultimodalNatural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have shown remarkable capabilities
in cross-modal understanding and reasoning, offering new opportunities for
intelligent assistive systems, yet existing systems still struggle with
risk-aware planning, user personalization, and grounding language plans into
executable skills in cluttered homes. We introduce MARS - a Multi-Agent Robotic
System powered by MLLMs for assistive intelligence and designed for smart home
robots supporting people with disabilities. The system integrates four agents:
a visual perception agent for extracting semantic and spatial features from
environment images, a risk assessment agent for identifying and prioritizing
hazards, a planning agent for generating executable action sequences, and an
evaluation agent for iterative optimization. By combining multimodal perception
with hierarchical multi-agent decision-making, the framework enables adaptive,
risk-aware, and personalized assistance in dynamic indoor environments.
Experiments on multiple datasets demonstrate the superior overall performance
of the proposed system in risk-aware planning and coordinated multi-agent
execution compared with state-of-the-art multimodal models. The proposed
approach also highlights the potential of collaborative AI for practical
assistive scenarios and provides a generalizable methodology for deploying
MLLM-enabled multi-agent systems in real-world environments.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wenqi Liang, Gan Sun, Yao He, Jiahua Dong, Suyan Dai et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action models (VLAs) are emerging as powerful tools for
learning generalizable visuomotor control policies. However, current VLAs are
mostly trained on large-scale image-text-action data and remain limited in two
key ways: (i) they struggle with pixel-level scene understanding, and (ii) they
rely heavily on textual prompts, which reduces their flexibility in real-world
settings. To address these challenges, we introduce PixelVLA, the first VLA
model designed to support both pixel-level reasoning and multimodal prompting
with text and visual inputs. Our approach is built on a new visuomotor
instruction tuning framework that integrates a multiscale pixel-aware encoder
with a visual prompting encoder. To train PixelVLA effectively, we further
propose a two-stage automated annotation pipeline that generates Pixel-160K, a
large-scale dataset with pixel-level annotations derived from existing robot
data. Experiments on three standard VLA benchmarks and two VLA model variants
show that PixelVLA improves manipulation success rates by 10.1%-17.8% over
OpenVLA, while requiring only 1.5% of its pretraining cost. These results
demonstrate that PixelVLA can be integrated into existing VLAs to enable more
accurate, efficient, and versatile robot control in complex environments. The
dataset and code will be released as open source.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Shipeng Lyu, Lijie Sheng, Fangyuan Wang, Wenyao Zhang, Weiwei Lin et al.
Robotics
ๆฅ็ๆ่ฆ
Humans naturally grasp objects with minimal level required force for
stability, whereas robots often rely on rigid, over-squeezing control. To
narrow this gap, we propose a human-inspired physics-conditioned tactile method
(Phy-Tac) for force-optimal stable grasping (FOSG) that unifies pose selection,
tactile prediction, and force regulation. A physics-based pose selector first
identifies feasible contact regions with optimal force distribution based on
surface geometry. Then, a physics-conditioned latent diffusion model (Phy-LDM)
predicts the tactile imprint under FOSG target. Last, a latent-space LQR
controller drives the gripper toward this tactile imprint with minimal
actuation, preventing unnecessary compression. Trained on a physics-conditioned
tactile dataset covering diverse objects and contact conditions, the proposed
Phy-LDM achieves superior tactile prediction accuracy, while the Phy-Tac
outperforms fixed-force and GraspNet-based baselines in grasp stability and
force efficiency. Experiments on classical robotic platforms demonstrate
force-efficient and adaptive manipulation that bridges the gap between robotic
and human grasping.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mengtan Zhang, Zizhan Guo, Hongbo Zhao, Yi Feng, Zuyi Xiong et al.
ๆฅ็ๆ่ฆ
Unsupervised learning of depth and ego-motion, two fundamental 3D perception
tasks, has made significant strides in recent years. However, most methods
treat ego-motion as an auxiliary task, either mixing all motion types or
excluding depth-independent rotational motions in supervision. Such designs
limit the incorporation of strong geometric constraints, reducing reliability
and robustness under diverse conditions. This study introduces a discriminative
treatment of motion components, leveraging the geometric regularities of their
respective rigid flows to benefit both depth and ego-motion estimation. Given
consecutive video frames, network outputs first align the optical axes and
imaging planes of the source and target cameras. Optical flows between frames
are transformed through these alignments, and deviations are quantified to
impose geometric constraints individually on each ego-motion component,
enabling more targeted refinement. These alignments further reformulate the
joint learning process into coaxial and coplanar forms, where depth and each
translation component can be mutually derived through closed-form geometric
relationships, introducing complementary constraints that improve depth
robustness. DiMoDE, a general depth and ego-motion joint learning framework
incorporating these designs, achieves state-of-the-art performance on multiple
public datasets and a newly collected diverse real-world dataset, particularly
under challenging conditions. Our source code will be publicly available at
mias.group/DiMoDE upon publication.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yufeng Jin, Niklas Funk, Vignesh Prasad, Zechu Li, Mathias Franzius et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Object pose estimation is a fundamental problem in robotics and computer
vision, yet it remains challenging due to partial observability, occlusions,
and object symmetries, which inevitably lead to pose ambiguity and multiple
hypotheses consistent with the same observation. While deterministic deep
networks achieve impressive performance under well-constrained conditions, they
are often overconfident and fail to capture the multi-modality of the
underlying pose distribution. To address these challenges, we propose a novel
probabilistic framework that leverages flow matching on the SE(3) manifold for
estimating 6D object pose distributions. Unlike existing methods that regress a
single deterministic output, our approach models the full pose distribution
with a sample-based estimate and enables reasoning about uncertainty in
ambiguous cases such as symmetric objects or severe occlusions. We achieve
state-of-the-art results on Real275, YCB-V, and LM-O, and demonstrate how our
sample-based pose estimates can be leveraged in downstream robotic manipulation
tasks such as active perception for disambiguating uncertain viewpoints or
guiding grasp synthesis in an uncertainty-aware manner.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Wei Huang, Jiaxin Li, Zang Wan, Huijun Di, Wei Liang et al.
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
Guiding an agent to a specific target in indoor environments based solely on
RGB inputs and a floor plan is a promising yet challenging problem. Although
existing methods have made significant progress, two challenges remain
unresolved. First, the modality gap between egocentric RGB observations and the
floor plan hinders the integration of visual and spatial information for both
local obstacle avoidance and global planning. Second, accurate localization is
critical for navigation performance, but remains challenging at deployment in
unseen environments due to the lack of explicit geometric alignment between RGB
inputs and floor plans. We propose a novel diffusion-based policy, denoted as
GlocDiff, which integrates global path planning from the floor plan with local
depth-aware features derived from RGB observations. The floor plan offers
explicit global guidance, while the depth features provide implicit geometric
cues, collectively enabling precise prediction of optimal navigation directions
and robust obstacle avoidance. Moreover, GlocDiff introduces noise perturbation
during training to enhance robustness against pose estimation errors, and we
find that combining this with a relatively stable VO module during inference
results in significantly improved navigation performance. Extensive experiments
on the FloNa benchmark demonstrate GlocDiff's efficiency and effectiveness in
achieving superior navigation performance, and the success of real-world
deployments also highlights its potential for widespread practical
applications.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Cankut Bora Tuncer, Marc Toussaint, Ozgur S. Oguz
Robotics
ๆฅ็ๆ่ฆ
In this work, we introduce MO-SeGMan, a Multi-Objective Sequential and Guided
Manipulation planner for highly constrained rearrangement problems. MO-SeGMan
generates object placement sequences that minimize both replanning per object
and robot travel distance while preserving critical dependency structures with
a lazy evaluation method. To address highly cluttered, non-monotone scenarios,
we propose a Selective Guided Forward Search (SGFS) that efficiently relocates
only critical obstacles and to feasible relocation points. Furthermore, we
adopt a refinement method for adaptive subgoal selection to eliminate
unnecessary pick-and-place actions, thereby improving overall solution quality.
Extensive evaluations on nine benchmark rearrangement tasks demonstrate that
MO-SeGMan generates feasible motion plans in all cases, consistently achieving
faster solution times and superior solution quality compared to the baselines.
These results highlight the robustness and scalability of the proposed
framework for complex rearrangement planning problems.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Sarthak Mishra, Rishabh Dev Yadav, Avirup Das, Saksham Gupta, Wei Pan et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The rapid progress of vision--language models (VLMs) has sparked growing
interest in robotic control, where natural language can express the operation
goals while visual feedback links perception to action. However, directly
deploying VLM-driven policies on aerial manipulators remains unsafe and
unreliable since the generated actions are often inconsistent,
hallucination-prone, and dynamically infeasible for flight. In this work, we
present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial
manipulation by separating high-level reasoning from low-level control, without
any task-specific fine-tuning. Our framework encodes natural language
instructions, task context, and safety constraints into a structured prompt
that guides the model to generate a step-by-step reasoning trace in natural
language. This reasoning output is used to select from a predefined library of
discrete, flight-safe skills, ensuring interpretable and temporally consistent
execution. By decoupling symbolic reasoning from physical action, AERMANI-VLM
mitigates hallucinated commands and prevents unsafe behavior, enabling robust
task completion. We validate the framework in both simulation and hardware on
diverse multi-step pick-and-place tasks, demonstrating strong generalization to
previously unseen commands, objects, and environments.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Elian Neppel, Shamistan Karimov, Ashutosh Mishra, Gustavo Hernan Diaz Huenupan, Hazal Gozbasi et al.
Robotics
ๆฅ็ๆ่ฆ
This paper presents the software architecture and deployment strategy behind
the MoonBot platform: a modular space robotic system composed of heterogeneous
components distributed across multiple computers, networks and ultimately
celestial bodies. We introduce a principled approach to distributed,
heterogeneous modularity, extending modular robotics beyond physical
reconfiguration to software, communication and orchestration. We detail the
architecture of our system that integrates component-based design, a
data-oriented communication model using ROS2 and Zenoh, and a deployment
orchestrator capable of managing complex multi-module assemblies. These
abstractions enable dynamic reconfiguration, decentralized control, and
seamless collaboration between numerous operators and modules. At the heart of
this system lies our open-source Motion Stack software, validated by months of
field deployment with self-assembling robots, inter-robot cooperation, and
remote operation. Our architecture tackles the significant hurdles of modular
robotics by significantly reducing integration and maintenance overhead, while
remaining scalable and robust. Although tested with space in mind, we propose
generalizable patterns for designing robotic systems that must scale across
time, hardware, teams and operational environments.
๐
2025-11-03
๐ IROS 2025
๐ฅ Paolo Rabino, Gabriele Tiboni, Tatiana Tommasi
Machine Learning
ๆฅ็ๆ่ฆ
Object-Centric Motion Generation (OCMG) is instrumental in advancing
automated manufacturing processes, particularly in domains requiring
high-precision expert robotic motions, such as spray painting and welding. To
realize effective automation, robust algorithms are essential for generating
extended, object-aware trajectories across intricate 3D geometries. However,
contemporary OCMG techniques are either based on ad-hoc heuristics or employ
learning-based pipelines that are still reliant on sensitive post-processing
steps to generate executable paths. We introduce FoldPath, a novel, end-to-end,
neural field based method for OCMG. Unlike prior deep learning approaches that
predict discrete sequences of end-effector waypoints, FoldPath learns the robot
motion as a continuous function, thus implicitly encoding smooth output paths.
This paradigm shift eliminates the need for brittle post-processing steps that
concatenate and order the predicted discrete waypoints. Particularly, our
approach demonstrates superior predictive performance compared to recently
proposed learning-based methods, and attains generalization capabilities even
in real industrial settings, where only a limited amount of 70 expert samples
are provided. We validate FoldPath through comprehensive experiments in a
realistic simulation environment and introduce new, rigorous metrics designed
to comprehensively evaluate long-horizon robotic paths, thus advancing the OCMG
task towards practical maturity.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Landson Guo, Andres M. Diaz Aguilar, William Talbot, Turcan Tuna, Marco Hutter et al.
Robotics
ๆฅ็ๆ่ฆ
Accurate point-wise velocity estimation in 3D is crucial for robot
interaction with non-rigid, dynamic agents, such as humans, enabling robust
performance in path planning, collision avoidance, and object manipulation in
dynamic environments. To this end, this paper proposes a novel RADAR, LiDAR,
and camera fusion pipeline for point-wise 3D velocity estimation named CaRLi-V.
This pipeline leverages raw RADAR measurements to create a novel RADAR
representation, the velocity cube, which densely represents radial velocities
within the RADAR's field-of-view. By combining the velocity cube for radial
velocity extraction, optical flow for tangential velocity estimation, and LiDAR
for point-wise range measurements through a closed-form solution, our approach
can produce 3D velocity estimates for a dense array of points. Developed as an
open-source ROS2 package, CaRLi-V has been field-tested against a custom
dataset and proven to produce low velocity error metrics relative to ground
truth, enabling point-wise velocity estimation for robotic applications.
๐
2025-11-03
๐ ICRA
๐ฅ Hitesh Kyatham, Arjun Suresh, Aadi Palnitkar, Yiannis Aloimonos
RoboticsComputer Vision
ๆฅ็ๆ่ฆ
The underwater domain presents a vast array of challenges for roboticists and
computer vision researchers alike, such as poor lighting conditions and high
dynamic range scenes. In these adverse conditions, traditional vision
techniques struggle to adapt and lead to suboptimal performance. Event-based
cameras present an attractive solution to this problem, mitigating the issues
of traditional cameras by tracking changes in the footage on a frame-by-frame
basis. In this paper, we introduce a pipeline which can be used to generate
realistic synthetic data of an event-based camera mounted to an AUV (Autonomous
Underwater Vehicle) in an underwater environment for training vision models. We
demonstrate the effectiveness of our pipeline using the task of rock detection
with poor visibility and suspended particulate matter, but the approach can be
generalized to other underwater tasks.
๐
2025-11-03
๐ IROS 2025
๐ฅ Kun Hu, Menggang Li, Zhiwen Jin, Chaoquan Tang, Eryi Hu et al.
MultimodalRobotics
ๆฅ็ๆ่ฆ
Simultaneous Localization and Mapping (SLAM) in large-scale, complex, and
GPS-denied underground coal mine environments presents significant challenges.
Sensors must contend with abnormal operating conditions: GPS unavailability
impedes scene reconstruction and absolute geographic referencing, uneven or
slippery terrain degrades wheel odometer accuracy, and long, feature-poor
tunnels reduce LiDAR effectiveness. To address these issues, we propose
CoalMine-LiDAR-IMU-UWB-Wheel-Odometry (CM-LIUW-Odometry), a multimodal SLAM
framework based on the Iterated Error-State Kalman Filter (IESKF). First,
LiDAR-inertial odometry is tightly fused with UWB absolute positioning
constraints to align the SLAM system with a global coordinate. Next, wheel
odometer is integrated through tight coupling, enhanced by nonholonomic
constraints (NHC) and vehicle lever arm compensation, to address performance
degradation in areas beyond UWB measurement range. Finally, an adaptive motion
mode switching mechanism dynamically adjusts the robot's motion mode based on
UWB measurement range and environmental degradation levels. Experimental
results validate that our method achieves superior accuracy and robustness in
real-world underground coal mine scenarios, outperforming state-of-the-art
approaches. We open source our code of this work on Github to benefit the
robotics community.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Luis Diener, Jens Kalkkuhl, Markus Enzweiler
ๆฅ็ๆ่ฆ
Automated parking requires accurate localization for quick and precise
maneuvering in tight spaces. While the longitudinal velocity can be measured
using wheel encoders, the estimation of the lateral velocity remains a key
challenge due to the absence of dedicated sensors in consumer-grade vehicles.
Existing approaches often rely on simplified vehicle models, such as the
zero-slip model, which assumes no lateral velocity at the rear axle. It is well
established that this assumption does not hold during low-speed driving and
researchers thus introduce additional heuristics to account for differences. In
this work, we analyze real-world data from parking scenarios and identify a
systematic deviation from the zero-slip assumption. We provide explanations for
the observed effects and then propose a lateral velocity model that better
captures the lateral dynamics of the vehicle during parking. The model improves
estimation accuracy, while relying on only two parameters, making it
well-suited for integration into consumer-grade applications.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Maartje H. M. Wermelink, Renate Sachse, Sebastian Kruppert, Thomas Speck, Falk J. Tauber
ๆฅ็ๆ่ฆ
The Venus flytrap (Dionaea muscipula) does not only serve as the textbook
model for a carnivorous plant, but also has long intrigued both botanists and
engineers with its rapidly closing leaf trap. The trap closure is triggered by
two consecutive touches of a potential prey, after which the lobes rapidly
switch from their concave open-state to their convex close-state and catch the
prey within 100-500 ms after being triggered. This transformation from concave
to convex is initiated by changes in turgor pressure and the release of stored
elastic energy from prestresses in the concave state, which accelerate this
movement, leading to inversion of the lobes bi-axial curvature. Possessing two
low-energy states, the leaves can be characterized as bistable systems. With
our research, we seek to deepen the understanding of Venus flytrap motion
mechanics and apply its principles to the design of an artificial bistable lobe
actuator. We identified geometrical characteristics, such as dimensional ratios
and the thickness gradient in the lobe, and transferred these to two 3D-printed
bistable actuator models. One actuator parallels the simulated geometry of a
Venus flytrap leaf, the other is a lobe model designed with CAD. Both models
display concave-convex bi-stability and snap close. These demonstrators are the
first step in the development of an artificial Venus flytrap that mimics the
mechanical behavior of the biological model and can be used as a soft fast
gripper.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Riddhi Das, Joscha Teichmann, Thomas Speck, Falk J. Tauber
Robotics
ๆฅ็ๆ่ฆ
Soft robotic systems have gained widespread attention due to their inherent
flexibility, adaptability, and safety, making them well-suited for varied
applications. Among bioinspired designs, earthworm locomotion has been
extensively studied for its efficient peristaltic motion, enabling movement in
confined and unstructured environments. Existing earthworm-inspired robots
primarily utilize pneumatic actuation due to its high force-to-weight ratio and
ease of implementation. However, these systems often rely on bulky,
power-intensive electronic control units, limiting their practicality. In this
work, we present an electronics-free, earthworm-inspired pneumatic robot
utilizing a modified Pneumatic Logic Gate (PLG) design. By integrating
preconfigured PLG units with bellow actuators, we achieved a plug-and-play
style modular system capable of peristaltic locomotion without external
electronic components. The proposed design reduces system complexity while
maintaining efficient actuation. We characterize the bellow actuators under
different operating conditions and evaluate the robots locomotion performance.
Our findings demonstrate that the modified PLG-based control system effectively
generates peristaltic wave propagation, achieving autonomous motion with
minimal deviation. This study serves as a proof of concept for the development
of electronics-free, peristaltic soft robots. The proposed system has potential
for applications in hazardous environments, where untethered, adaptable
locomotion is critical. Future work will focus on further optimizing the robot
design and exploring untethered operation using onboard compressed air sources.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Shun Yoshida, Qingchuan Song, Bastian E. Rapp, Thomas Speck, Falk J. Tauber
Robotics
ๆฅ็ๆ่ฆ
Despite their often perceived static and slow nature, some plants can move
faster than the blink of an eye. The rapid snap closure motion of the Venus
flytrap (Dionaea muscipula) has long captivated the interest of researchers and
engineers alike, serving as a model for plant-inspired soft machines and
robots. The translation of the fast snapping closure has inspired the
development of various artificial Venus flytrap (AVF) systems. However,
translating both the closing and reopening motion of D. muscipula into an
autonomous plant inspired soft machine has yet to be achieved. In this study,
we present an AVF that autonomously closes and reopens, utilizing novel
thermo-responsive UV-curable shape memory materials for soft robotic systems.
The life-sized thermo-responsive AVF exhibits closing and reopening motions
triggered in a naturally occurring temperature range. The doubly curved trap
lobes, built from shape memory polymers, close at 38{\deg}C, while reopening
initiates around 45{\deg}C, employing shape memory elastomer strips as
antagonistic actuators to facilitate lobe reopening. This work represents the
first demonstration of thermo-responsive closing and reopening in an AVF with
programmed sequential motion in response to increasing temperature. This
approach marks the next step toward autonomously bidirectional moving soft
machines/robots.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Ling Niu, Xiaoji Zheng, Han Wang, Chen Zheng, Ziyuan Yang et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
In recent years, vision-based end-to-end autonomous driving has emerged as a
new paradigm. However, popular end-to-end approaches typically rely on visual
feature extraction networks trained under label supervision. This limited
supervision framework restricts the generality and applicability of driving
models. In this paper, we propose a novel paradigm termed $E^{3}AD$, which
advocates for comparative learning between visual feature extraction networks
and the general EEG large model, in order to learn latent human driving
cognition for enhancing end-to-end planning. In this work, we collected a
cognitive dataset for the mentioned contrastive learning process. Subsequently,
we investigated the methods and potential mechanisms for enhancing end-to-end
planning with human driving cognition, using popular driving models as
baselines on publicly available autonomous driving datasets. Both open-loop and
closed-loop tests are conducted for a comprehensive evaluation of planning
performance. Experimental results demonstrate that the $E^{3}AD$ paradigm
significantly enhances the end-to-end planning performance of baseline models.
Ablation studies further validate the contribution of driving cognition and the
effectiveness of comparative learning process. To the best of our knowledge,
this is the first work to integrate human driving cognition for improving
end-to-end autonomous driving planning. It represents an initial attempt to
incorporate embodied cognitive data into end-to-end autonomous driving,
providing valuable insights for future brain-inspired autonomous driving
systems. Our code will be made available at Github
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Hongyin Zhang, Shuo Zhang, Junxi Jin, Qixin Zeng, Runze Li et al.
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Vision-Language-Action (VLA) models have recently emerged as powerful
general-purpose policies for robotic manipulation, benefiting from large-scale
multi-modal pre-training. However, they often fail to generalize reliably in
out-of-distribution deployments, where unavoidable disturbances such as
observation noise, sensor errors, or actuation perturbations become prevalent.
While recent Reinforcement Learning (RL)-based post-training provides a
practical means to adapt pre-trained VLA models, existing methods mainly
emphasize reward maximization and overlook robustness to environmental
uncertainty. In this work, we introduce RobustVLA, a lightweight online RL
post-training method designed to explicitly enhance the resilience of VLA
models. Through a systematic robustness analysis, we identify two key
regularizations: Jacobian regularization, which mitigates sensitivity to
observation noise, and smoothness regularization, which stabilizes policies
under action perturbations. Extensive experiments across diverse robotic
environments demonstrate that RobustVLA significantly outperforms prior
state-of-the-art methods in robustness and reliability. Our results highlight
the importance of principled robustness-aware RL post-training as a key step
toward improving the reliability and robustness of VLA models.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Jiawei Wang, Dingyou Wang, Jiaming Hu, Qixuan Zhang, Jingyi Yu et al.
Machine Learning
ๆฅ็ๆ่ฆ
A deep understanding of kinematic structures and movable components is
essential for enabling robots to manipulate objects and model their own
articulated forms. Such understanding is captured through articulated objects,
which are essential for tasks such as physical simulation, motion planning, and
policy learning. However, creating these models, particularly for objects with
high degrees of freedom (DoF), remains a significant challenge. Existing
methods typically rely on motion sequences or strong assumptions from
hand-curated datasets, which hinders scalability. In this paper, we introduce
Kinematify, an automated framework that synthesizes articulated objects
directly from arbitrary RGB images or textual descriptions. Our method
addresses two core challenges: (i) inferring kinematic topologies for high-DoF
objects and (ii) estimating joint parameters from static geometry. To achieve
this, we combine MCTS search for structural inference with geometry-driven
optimization for joint reasoning, producing physically consistent and
functionally valid descriptions. We evaluate Kinematify on diverse inputs from
both synthetic and real-world environments, demonstrating improvements in
registration and kinematic topology accuracy over prior work.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Bixuan Zhang, Fengqi Zhang, Haojie Chen, You Wang, Jie Hao et al.
Robotics
ๆฅ็ๆ่ฆ
This paper designs a new spherical robot structure capable of supporting
high-speed motion at up to 10 m/s. Building upon a single-pendulum-driven
spherical robot, the design incorporates a momentum wheel with an axis aligned
with the secondary pendulum, creating a novel spherical robot structure.
Practical experiments with the physical prototype have demonstrated that this
new spherical robot can achieve stable high-speed motion through simple
decoupled control, which was unattainable with the original structure. The
spherical robot designed for high-speed motion not only increases speed but
also significantly enhances obstacle-crossing performance and terrain
robustness.
๐
2025-11-03
๐ Accepted by IEEE Robio 2025
๐ฅ Yupu Lu, Shijie Lin, Hao Xu, Zeqing Zhang, Jia Pan
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Finding a control Lyapunov function (CLF) in a dynamical system with a
controller is an effective way to guarantee stability, which is a crucial issue
in safety-concerned applications. Recently, deep learning models representing
CLFs have been applied into a learner-verifier framework to identify
satisfiable candidates. However, the learner treats Lyapunov conditions as
complex constraints for optimisation, which is hard to achieve global
convergence. It is also too complicated to implement these Lyapunov conditions
for verification. To improve this framework, we treat Lyapunov conditions as
inductive biases and design a neural CLF and a CLF-based controller guided by
this knowledge. This design enables a stable optimisation process with limited
constraints, and allows end-to-end learning of both the CLF and the controller.
Our approach achieves a higher convergence rate and larger region of attraction
(ROA) in learning the CLF compared to existing methods among abundant
experiment cases. We also thoroughly reveal why the success rate decreases with
previous methods during learning.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Yiyao Ma, Kai Chen, Kexin Zheng, Qi Dou
Robotics
ๆฅ็ๆ่ฆ
Dexterous grasp generation is a fundamental challenge in robotics, requiring
both grasp stability and adaptability across diverse objects and tasks.
Analytical methods ensure stable grasps but are inefficient and lack task
adaptability, while generative approaches improve efficiency and task
integration but generalize poorly to unseen objects and tasks due to data
limitations. In this paper, we propose a transfer-based framework for dexterous
grasp generation, leveraging a conditional diffusion model to transfer
high-quality grasps from shape templates to novel objects within the same
category. Specifically, we reformulate the grasp transfer problem as the
generation of an object contact map, incorporating object shape similarity and
task specifications into the diffusion process. To handle complex shape
variations, we introduce a dual mapping mechanism, capturing intricate
geometric relationship between shape templates and novel objects. Beyond the
contact map, we derive two additional object-centric maps, the part map and
direction map, to encode finer contact details for more stable grasps. We then
develop a cascaded conditional diffusion model framework to jointly transfer
these three maps, ensuring their intra-consistency. Finally, we introduce a
robust grasp recovery mechanism, identifying reliable contact points and
optimizing grasp configurations efficiently. Extensive experiments demonstrate
the superiority of our proposed method. Our approach effectively balances grasp
quality, generation efficiency, and generalization performance across various
tasks. Project homepage: https://cmtdiffusion.github.io/
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Sehui Jeong, Magaly C. Aviles, Athena X. Naylor, Cynthia Sung, Allison M. Okamura
Robotics
ๆฅ็ๆ่ฆ
Soft robots employing compliant materials and deformable structures offer
great potential for wearable devices that are comfortable and safe for human
interaction. However, achieving both structural integrity and compliance for
comfort remains a significant challenge. In this study, we present a novel
fabrication and design method that combines the advantages of origami
structures with the material programmability and wearability of knitted
fabrics. We introduce a general design method that translates origami patterns
into knit designs by programming both stitch and material patterns. The method
creates folds in preferred directions while suppressing unintended buckling and
bending by selectively incorporating heat fusible yarn to create rigid panels
around compliant creases. We experimentally quantify folding moments and show
that stitch patterning enhances folding directionality while the heat fusible
yarn (1) keeps geometry consistent by reducing edge curl and (2) prevents
out-of-plane deformations by stiffening panels. We demonstrate the framework
through the successful reproduction of complex origami tessellations, including
Miura-ori, Yoshimura, and Kresling patterns, and present a wearable knitted
Kaleidocycle robot capable of locomotion. The combination of structural
reconfigurability, material programmability, and potential for manufacturing
scalability highlights knitted origami as a promising platform for
next-generation wearable robotics.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Yasamin Foroutani, Yasamin Mousavi-Motlagh, Aya Barzelay, Tsu-Chin Tsao
Robotics
ๆฅ็ๆ่ฆ
Achieving precise control of robotic tool paths is often challenged by
inherent system misalignments, unmodeled dynamics, and actuation inaccuracies.
This work introduces an Iterative Learning Control (ILC) strategy to enable
precise rotational insertion of a tool during robotic surgery, improving
penetration efficacy and safety compared to straight insertion tested in
subretinal injection. A 4 degree of freedom (DOF) robot manipulator is used,
where misalignment of the fourth joint complicates the simple application of
needle rotation, motivating an ILC approach that iteratively adjusts joint
commands based on positional feedback. The process begins with calibrating the
forward kinematics for the chosen surgical tool to achieve higher accuracy,
followed by successive ILC iterations guided by Optical Coherence Tomography
(OCT) volume scans to measure the error and refine control inputs. Experimental
results, tested on subretinal injection tasks on ex vivo pig eyes, show that
the optimized trajectory resulted in higher success rates in tissue penetration
and subretinal injection compared to straight insertion, demonstrating the
effectiveness of ILC in overcoming misalignment challenges. This approach
offers potential applications for other high precision robot tasks requiring
controlled insertions as well.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Junwen Zhang, Changyue Liu, Pengqi Fu, Xiang Guo, Ye Shi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Endowed with inherent dynamical properties that grant them remarkable
ruggedness and adaptability, spherical tensegrity robots stand as prototypical
examples of hybrid softrigid designs and excellent mobile platforms. However,
path planning for these robots in unknown environments presents a significant
challenge, requiring a delicate balance between efficient exploration and
robust planning. Traditional path planners, which treat the environment as a
geometric grid, often suffer from redundant searches and are prone to failure
in complex scenarios due to their lack of semantic understanding. To overcome
these limitations, we reframe path planning in unknown environments as a
semantic reasoning task. We introduce a Semantic Agent for Tensegrity robots
(SATPlanner) driven by a Large Language Model (LLM). SATPlanner leverages
high-level environmental comprehension to generate efficient and reliable
planning strategies.At the core of SATPlanner is an Adaptive Observation Window
mechanism, inspired by the "fast" and "slow" thinking paradigms of LLMs. This
mechanism dynamically adjusts the perceptual field of the agent: it narrows for
rapid traversal of open spaces and expands to reason about complex obstacle
configurations. This allows the agent to construct a semantic belief of the
environment, enabling the search space to grow only linearly with the path
length (O(L)) while maintaining path quality. We extensively evaluate
SATPlanner in 1,000 simulation trials, where it achieves a 100% success rate,
outperforming other real-time planning algorithms. Critically, SATPlanner
reduces the search space by 37.2% compared to the A* algorithm while achieving
comparable, near-optimal path lengths. Finally, the practical feasibility of
SATPlanner is validated on a physical spherical tensegrity robot prototype.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Yu-Ting Lai, Jacob Rosen, Yasamin Foroutani, Ji Ma, Wen-Cheng Wu et al.
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Despite the extensive demonstration of robotic systems for both cataract and
vitreoretinal procedures, existing technologies or mechanisms still possess
insufficient accuracy, precision, and degrees of freedom for instrument
manipulation or potentially automated tool exchange during surgical procedures.
A new robotic system that focuses on improving tooltip accuracy, tracking
performance, and smooth instrument exchange mechanism is therefore designed and
manufactured. Its tooltip accuracy, precision, and mechanical capability of
maintaining small incision through remote center of motion were externally
evaluated using an optical coherence tomography (OCT) system. Through robot
calibration and precise coordinate registration, the accuracy of tooltip
positioning was measured to be 0.053$\pm$0.031 mm, and the overall performance
was demonstrated on an OCT-guided automated cataract lens extraction procedure
with deep learning-based pre-operative anatomical modeling and real-time
supervision.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Chengmeng Li, Yaxin Peng
MultimodalMachine LearningRobotics
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models have significantly advanced robotic
learning, enabling training on large-scale, cross-embodiment data and
fine-tuning for specific robots. However, state-of-the-art autoregressive VLAs
struggle with multi-robot collaboration. We introduce embodiment transfer
learning, denoted as ET-VLA, a novel framework for efficient and effective
transfer of pre-trained VLAs to multi-robot. ET-VLA's core is Synthetic
Continued Pretraining (SCP), which uses synthetically generated data to warm up
the model for the new embodiment, bypassing the need for real human
demonstrations and reducing data collection costs. SCP enables the model to
learn correct actions and precise action token numbers. Following SCP, the
model is fine-tuned on target embodiment data. To further enhance the model
performance on multi-embodiment, we present the Embodied Graph-of-Thought
technique, a novel approach that formulates each sub-task as a node, that
allows the VLA model to distinguish the functionalities and roles of each
embodiment during task execution. Our work considers bimanual robots, a simple
version of multi-robot to verify our approaches. We validate the effectiveness
of our method on both simulation benchmarks and real robots covering three
different bimanual embodiments. In particular, our proposed ET-VLA \space can
outperform OpenVLA on six real-world tasks over 53.2%. We will open-source all
codes to support the community in advancing VLA models for robot learning.
๐
2025-11-03
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zahra Mehraban, Sebastien Glaser, Michael Milford, Ronald Schroeter
Machine LearningRobotics
ๆฅ็ๆ่ฆ
Domain adaptation is required for automated driving models to generalize well
across diverse road conditions. This paper explores a training method for
domain adaptation to adapt PilotNet, an end-to-end deep learning-based model,
for left-hand driving conditions using real-world Australian highway data. Four
training methods were evaluated: (1) a baseline model trained on U.S.
right-hand driving data, (2) a model trained on flipped U.S. data, (3) a model
pretrained on U.S. data and then fine-tuned on Australian highways, and (4) a
model pretrained on flipped U.S. data and then finetuned on Australian
highways. This setup examines whether incorporating flipped data enhances the
model adaptation by providing an initial left-hand driving alignment. The paper
compares model performance regarding steering prediction accuracy and
attention, using saliency-based analysis to measure attention shifts across
significant road regions. Results show that pretraining on flipped data alone
worsens prediction stability due to misaligned feature representations, but
significantly improves adaptation when followed by fine-tuning, leading to
lower prediction error and stronger focus on left-side cues. To validate this
approach across different architectures, the same experiments were done on
ResNet, which confirmed similar adaptation trends. These findings emphasize the
importance of preprocessing techniques, such as flipped-data pretraining,
followed by fine-tuning to improve model adaptation with minimal retraining
requirements.
๐
2025-11-03
๐ 10 pages, 8 figures. This work has been submitted
๐ฅ Muhua Zhang, Lei Ma, Ying Wu, Kai Shen, Deqing Huang et al.
ๆฅ็ๆ่ฆ
This paper addresses the Kidnapped Robot Problem (KRP), a core localization
challenge of relocalizing a robot in a known map without prior pose estimate
when localization loss or at SLAM initialization. For this purpose, a passive
2-D global relocalization framework is proposed. It estimates the global pose
efficiently and reliably from a single LiDAR scan and an occupancy grid map
while the robot remains stationary, thereby enhancing the long-term autonomy of
mobile robots. The proposed framework casts global relocalization as a
non-convex problem and solves it via the multi-hypothesis scheme with batched
multi-stage inference and early termination, balancing completeness and
efficiency. The Rapidly-exploring Random Tree (RRT), under traversability
constraints, asymptotically covers the reachable space to generate sparse,
uniformly distributed feasible positional hypotheses, fundamentally reducing
the sampling space. The hypotheses are preliminarily ordered by the proposed
Scan Mean Absolute Difference (SMAD), a coarse beam-error level metric that
facilitates the early termination by prioritizing high-likelihood candidates.
The SMAD computation is optimized for non-panoramic scans. And the
Translation-Affinity Scan-to-Map Alignment Metric (TAM) is proposed for
reliable orientation selection at hypothesized positions and accurate final
pose evaluation to mitigate degradation in conventional likelihood-field
metrics under translational uncertainty induced by sparse hypotheses, as well
as non-panoramic LiDAR scan and environmental changes. Real-world experiments
on a resource-constrained mobile robot with non-panoramic LiDAR scan
demonstrate that the proposed framework outperforms existing methods in both
global relocalization success rate and computational efficiency.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Max McCandless, Jonathan Hamid, Sammy Elmariah, Nathaniel Langer, Pierre E. Dupont
Robotics
ๆฅ็ๆ่ฆ
To move away from open-heart surgery towards safer transcatheter procedures,
there is a growing need for improved imaging techniques and robotic solutions
to enable simple, accurate tool navigation. Common imaging modalities, such as
fluoroscopy and ultrasound, have limitations that can be overcome using
cardioscopy, i.e., direct optical visualization inside the beating heart. We
present a cardioscope designed as a steerable balloon. As a balloon, it can be
collapsed to pass through the vasculature and subsequently inflated inside the
heart for visualization and tool delivery through an integrated working
channel. Through careful design of balloon wall thickness, a single input,
balloon inflation pressure, is used to independently control two outputs,
balloon diameter (corresponding to field of view diameter) and balloon bending
angle (enabling precise working channel positioning). This balloon technology
can be tuned to produce cardioscopes designed for a range of intracardiac
tasks. To illustrate this approach, a balloon design is presented for the
specific task of aortic leaflet laceration. Image-based closed-loop control of
bending angle is also demonstrated as a means of enabling stable orientation
control during tool insertion and removal.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Lijie Wang, Lianjie Guo, Ziyi Xu, Qianhao Wang, Fei Gao et al.
Robotics
ๆฅ็ๆ่ฆ
Reconstructing large-scale colored point clouds is an important task in
robotics, supporting perception, navigation, and scene understanding. Despite
advances in LiDAR inertial visual odometry (LIVO), its performance remains
highly sensitive to extrinsic calibration. Meanwhile, 3D vision foundation
models, such as VGGT, suffer from limited scalability in large environments and
inherently lack metric scale. To overcome these limitations, we propose
LiDAR-VGGT, a novel framework that tightly couples LiDAR inertial odometry with
the state-of-the-art VGGT model through a two-stage coarse- to-fine fusion
pipeline: First, a pre-fusion module with robust initialization refinement
efficiently estimates VGGT poses and point clouds with coarse metric scale
within each session. Then, a post-fusion module enhances cross-modal 3D
similarity transformation, using bounding-box-based regularization to reduce
scale distortions caused by inconsistent FOVs between LiDAR and camera sensors.
Extensive experiments across multiple datasets demonstrate that LiDAR-VGGT
achieves dense, globally consistent colored point clouds and outperforms both
VGGT-based methods and LIVO baselines. The implementation of our proposed novel
color point cloud evaluation toolkit will be released as open source.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Zihao He, Bo Ai, Tongzhou Mu, Yulin Liu, Weikang Wan et al.
Robotics
ๆฅ็ๆ่ฆ
Cross-embodiment learning seeks to build generalist robots that operate
across diverse morphologies, but differences in action spaces and kinematics
hinder data sharing and policy transfer. This raises a central question: Is
there any invariance that allows actions to transfer across embodiments? We
conjecture that environment dynamics are embodiment-invariant, and that world
models capturing these dynamics can provide a unified interface across
embodiments. To learn such a unified world model, the crucial step is to design
state and action representations that abstract away embodiment-specific details
while preserving control relevance. To this end, we represent different
embodiments (e.g., human hands and robot hands) as sets of 3D particles and
define actions as particle displacements, creating a shared representation for
heterogeneous data and control problems. A graph-based world model is then
trained on exploration data from diverse simulated robot hands and real human
hands, and integrated with model-based planning for deployment on novel
hardware. Experiments on rigid and deformable manipulation tasks reveal three
findings: (i) scaling to more training embodiments improves generalization to
unseen ones, (ii) co-training on both simulated and real data outperforms
training on either alone, and (iii) the learned models enable effective control
on robots with varied degrees of freedom. These results establish world models
as a promising interface for cross-embodiment dexterous manipulation.
๐
2025-11-03
๐ ArXiv Preprint (Robotics)
๐ฅ Dong Heon Han, Mayank Mehta, Runze Zuo, Zachary Wanger, Daniel Bruder
ๆฅ็ๆ่ฆ
This study presents an enhanced proprioceptive method for accurate shape
estimation of soft robots using only off-the-shelf sensors, ensuring
cost-effectiveness and easy applicability. By integrating inertial measurement
units (IMUs) with complementary bend sensors, IMU drift is mitigated, enabling
reliable long-term proprioception. A Kalman filter fuses segment tip
orientations from both sensors in a mutually compensatory manner, improving
shape estimation over single-sensor methods. A piecewise constant curvature
model estimates the tip location from the fused orientation data and
reconstructs the robot's deformation. Experiments under no loading, external
forces, and passive obstacle interactions during 45 minutes of continuous
operation showed a root mean square error of 16.96 mm (2.91% of total length),
a 56% reduction compared to IMU-only benchmarks. These results demonstrate that
our approach not only enables long-duration proprioception in soft robots but
also maintains high accuracy and robustness across these diverse conditions.
๐
2025-11-01
๐ 10 pages. This paper is accepted at IEEE BigData 2
๐ฅ Doyun Choi, Cheonwoo Lee, Jaemin Yoo
Machine Learning
ๆฅ็ๆ่ฆ
Contrastive learning (CL) has been widely used for enhancing the performance
of graph collaborative filtering (GCF) for personalized recommendation. Since
data augmentation plays a crucial role in the success of CL, previous works
have designed augmentation methods to remove noisy interactions between users
and items in order to generate effective augmented views. However, the
ambiguity in defining ''noisiness'' presents a persistent risk of losing core
information and generating unreliable data views, while increasing the overall
complexity of augmentation. In this paper, we propose Simple Collaborative
Augmentation for Recommendation (SCAR), a novel and intuitive augmentation
method designed to maximize the effectiveness of CL for GCF. Instead of
removing information, SCAR leverages collaborative signals extracted from
user-item interactions to generate pseudo-interactions, which are then either
added to or used to replace existing interactions. This results in more robust
representations while avoiding the pitfalls of overly complex augmentation
modules. We conduct experiments on four benchmark datasets and show that SCAR
outperforms previous CL-based GCF methods as well as other state-of-the-art
self-supervised learning approaches across key evaluation metrics. SCAR
exhibits strong robustness across different hyperparameter settings and is
particularly effective in sparse data scenarios.
๐
2025-10-31
๐ ArXiv Preprint (cs.DB)
๐ฅ Chunxiao Ye, Xiao Yan, Eric Lo
ๆฅ็ๆ่ฆ
The increasing prevalence of hybrid vector and relational data necessitates
efficient, general support for queries that combine high-dimensional vector
search with complex relational filtering. However, existing filtered search
solutions are fundamentally limited by specialized indices, which restrict
arbitrary filtering and hinder integration with general-purpose DBMSs. This
work introduces \textsc{Compass}, a unified framework that enables general
filtered search across vector and structured data without relying on new index
designs. Compass leverages established index structures -- such as HNSW and IVF
for vector attributes, and B+-trees for relational attributes -- implementing a
principled cooperative query execution strategy that coordinates candidate
generation and predicate evaluation across modalities. Uniquely, Compass
maintains generality by allowing arbitrary conjunctions, disjunctions, and
range predicates, while ensuring robustness even with highly-selective or
multi-attribute filters. Comprehensive empirical evaluations demonstrate that
Compass consistently outperforms NaviX, the only existing performant general
framework, across diverse hybrid query workloads. It also matches the query
throughput of specialized single-attribute indices in their favorite settings
with only a single attribute involved, all while maintaining full generality
and DBMS compatibility. Overall, Compass offers a practical and robust solution
for achieving truly general filtered search in vector database systems.
๐
2025-10-31
๐ ArXiv Preprint (Computer Vision)
๐ฅ John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin
MultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Recently, augmenting vision-language-action models (VLAs) with world-models
has shown promise in robotic policy learning. However, it remains challenging
to jointly predict next-state observations and action sequences because of the
inherent difference between the two modalities. To address this, we propose
DUal-STream diffusion (DUST), a world-model augmented VLA framework that
handles the modality conflict and enhances the performance of VLAs across
diverse tasks. Specifically, we propose a multimodal diffusion transformer
architecture that explicitly maintains separate modality streams while enabling
cross-modal knowledge sharing. In addition, we propose training techniques such
as independent noise perturbations for each modality and a decoupled flow
matching loss, which enables the model to learn the joint distribution in a
bidirectional manner while avoiding the need for a unified latent space.
Furthermore, based on the decoupled training framework, we introduce a sampling
method where we sample action and vision tokens asynchronously at different
rates, which shows improvement through inference-time scaling. Through
experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up
to 6% gains over a standard VLA baseline and implicit world-modeling methods,
with our inference-time scaling approach providing an additional 2-5% gain on
success rate. On real-world tasks with the Franka Research 3, DUST outperforms
baselines in success rate by 13%, confirming its effectiveness beyond
simulation. Lastly, we demonstrate the effectiveness of DUST in large-scale
pretraining with action-free videos from BridgeV2, where DUST leads to
significant gain when transferred to the RoboCasa benchmark.
๐
2025-10-31
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dayuan Fu, Yunze Wu, Xiaojie Cai, Lyumanshan Ye, Shijie Xia et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Large Language Model (LLM) agents have recently shown strong potential in
domains such as automated coding, deep research, and graphical user interface
manipulation. However, training them to succeed on long-horizon,
domain-specialized tasks remains challenging. Current methods primarily fall
into two categories. The first relies on dense human annotations through
behavior cloning, which is prohibitively expensive for long-horizon tasks that
can take days or months. The second depends on outcome-driven sampling, which
often collapses due to the rarity of valid positive trajectories on
domain-specialized tasks. We introduce Apollo, a sampling framework that
integrates asynchronous human guidance with action-level data filtering.
Instead of requiring annotators to shadow every step, Apollo allows them to
intervene only when the agent drifts from a promising trajectory, by providing
prior knowledge, strategic advice, etc. This lightweight design makes it
possible to sustain interactions for over 30 hours and produces valuable
trajectories at a lower cost. Apollo then applies supervision control to filter
out sub-optimal actions and prevent error propagation. Together, these
components enable reliable and effective data collection in long-horizon
environments. To demonstrate the effectiveness of Apollo, we evaluate it using
InnovatorBench. Our experiments show that when applied to train the GLM-4.5
model on InnovatorBench, Apollo achieves more than a 50% improvement over the
untrained baseline and a 28% improvement over a variant trained without human
interaction. These results highlight the critical role of human-in-the-loop
sampling and the robustness of Apollo's design in handling long-horizon,
domain-specialized tasks.
๐
2025-10-31
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Boyi Wei, Zora Che, Nathaniel Li, Udari Madhushani Sehwag, Jasper Gรถtting et al.
ๆฅ็ๆ่ฆ
Open-weight bio-foundation models present a dual-use dilemma. While holding
great promise for accelerating scientific research and drug development, they
could also enable bad actors to develop more deadly bioweapons. To mitigate the
risk posed by these models, current approaches focus on filtering biohazardous
data during pre-training. However, the effectiveness of such an approach
remains unclear, particularly against determined actors who might fine-tune
these models for malicious use. To address this gap, we propose \eval, a
framework to evaluate the robustness of procedures that are intended to reduce
the dual-use capabilities of bio-foundation models. \eval assesses models'
virus understanding through three lenses, including sequence modeling,
mutational effects prediction, and virulence prediction. Our results show that
current filtering practices may not be particularly effective: Excluded
knowledge can be rapidly recovered in some cases via fine-tuning, and exhibits
broader generalizability in sequence modeling. Furthermore, dual-use signals
may already reside in the pretrained representations, and can be elicited via
simple linear probing. These findings highlight the challenges of data
filtering as a standalone procedure, underscoring the need for further research
into robust safety and security strategies for open-weight bio-foundation
models.
๐
2025-10-31
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
AI agents could accelerate scientific discovery by automating hypothesis
formation, experiment design, coding, execution, and analysis, yet existing
benchmarks probe narrow skills in simplified settings. To address this gap, we
introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end
assessment of agents performing Large Language Model (LLM) research. It
comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss
Design, Reward Design, and Scaffold Construction, which require runnable
artifacts and assessment of correctness, performance, output quality, and
uncertainty. To support agent operation, we develop ResearchGym, a research
environment offering rich action spaces, distributed and long-horizon
execution, asynchronous monitoring, and snapshot saving. We also implement a
lightweight ReAct agent that couples explicit reasoning with executable
planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2.
Our experiments demonstrate that while frontier models show promise in
code-driven research tasks, they struggle with fragile algorithm-related tasks
and long-horizon decision making, such as impatience, poor resource management,
and overreliance on template-based reasoning. Furthermore, agents require over
11 hours to achieve their best performance on InnovatorBench, underscoring the
benchmark's difficulty and showing the potential of InnovatorBench to be the
next generation of code-based research benchmark.
๐
2025-10-31
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guozheng Zheng, Jian Guan, Mingjie Xie, Xuanjia Zhao, Congyi Fan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Cross-view geo-localization (CVGL) between drone and satellite imagery
remains challenging due to severe viewpoint gaps and the presence of hard
negatives, which are visually similar but geographically mismatched samples.
Existing mining or reweighting strategies often use static weighting, which is
sensitive to distribution shifts and prone to overemphasizing difficult samples
too early, leading to noisy gradients and unstable convergence. In this paper,
we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy.
At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates
relative difficulty and assigns fine-grained weights to negatives. At the batch
level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a
training-progress signal to attenuate noisy gradients during early optimization
and progressively enhance hard-negative mining as training matures. Experiments
on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness
and robustness of the proposed DPHR, achieving consistent improvements over
state-of-the-art methods.
๐
2025-10-31
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ilyass Moummad, Kawtar Zaher, Hervรฉ Goรซau, Alexis Joly
Multimodal
ๆฅ็ๆ่ฆ
Efficient large-scale retrieval requires representations that are both
compact and discriminative. Foundation models provide powerful visual and
multimodal embeddings, but nearest neighbor search in these high-dimensional
spaces is computationally expensive. Hashing offers an efficient alternative by
enabling fast Hamming distance search with binary codes, yet existing
approaches often rely on complex pipelines, multi-term objectives, designs
specialized for a single learning paradigm, and long training times. We
introduce CroVCA (Cross-View Code Alignment), a simple and unified principle
for learning binary codes that remain consistent across semantically aligned
views. A single binary cross-entropy loss enforces alignment, while coding-rate
maximization serves as an anti-collapse regularizer to promote balanced and
diverse codes. To implement this, we design HashCoder, a lightweight MLP
hashing network with a final batch normalization layer to enforce balanced
codes. HashCoder can be used as a probing head on frozen embeddings or to adapt
encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves
state-of-the-art results in just 5 training epochs. At 16 bits, it particularly
well-for instance, unsupervised hashing on COCO completes in under 2 minutes
and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These
results highlight CroVCA's efficiency, adaptability, and broad applicability.
๐
2025-10-31
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yehna Kim, Young-Eun Kim, Seong-Whan Lee
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Vision-Language Models (VLMs) have demonstrated impressive capabilities in
zero-shot action recognition by learning to associate video embeddings with
class embeddings. However, a significant challenge arises when relying solely
on action classes to provide semantic context, particularly due to the presence
of multi-semantic words, which can introduce ambiguity in understanding the
intended concepts of actions. To address this issue, we propose an innovative
approach that harnesses web-crawled descriptions, leveraging a large-language
model to extract relevant keywords. This method reduces the need for human
annotators and eliminates the laborious manual process of attribute data
creation. Additionally, we introduce a spatio-temporal interaction module
designed to focus on objects and action units, facilitating alignment between
description attributes and video content. In our zero-shot experiments, our
model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and
68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the
model's adaptability and effectiveness across various downstream tasks.
๐
2025-10-31
๐ ArXiv Preprint (Computer Vision)
๐ฅ Guozheng Zheng, Jian Guan, Mingjie Xie, Xuanjia Zhao, Congyi Fan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Cross-view geo-localization (CVGL) between drone and satellite imagery
remains challenging due to severe viewpoint gaps and the presence of hard
negatives, which are visually similar but geographically mismatched samples.
Existing mining or reweighting strategies often use static weighting, which is
sensitive to distribution shifts and prone to overemphasizing difficult samples
too early, leading to noisy gradients and unstable convergence. In this paper,
we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy.
At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates
relative difficulty and assigns fine-grained weights to negatives. At the batch
level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a
training-progress signal to attenuate noisy gradients during early optimization
and progressively enhance hard-negative mining as training matures. Experiments
on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness
and robustness of the proposed DPHR, achieving consistent improvements over
state-of-the-art methods.
๐
2025-10-31
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ilyass Moummad, Kawtar Zaher, Hervรฉ Goรซau, Alexis Joly
Multimodal
ๆฅ็ๆ่ฆ
Efficient large-scale retrieval requires representations that are both
compact and discriminative. Foundation models provide powerful visual and
multimodal embeddings, but nearest neighbor search in these high-dimensional
spaces is computationally expensive. Hashing offers an efficient alternative by
enabling fast Hamming distance search with binary codes, yet existing
approaches often rely on complex pipelines, multi-term objectives, designs
specialized for a single learning paradigm, and long training times. We
introduce CroVCA (Cross-View Code Alignment), a simple and unified principle
for learning binary codes that remain consistent across semantically aligned
views. A single binary cross-entropy loss enforces alignment, while coding-rate
maximization serves as an anti-collapse regularizer to promote balanced and
diverse codes. To implement this, we design HashCoder, a lightweight MLP
hashing network with a final batch normalization layer to enforce balanced
codes. HashCoder can be used as a probing head on frozen embeddings or to adapt
encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves
state-of-the-art results in just 5 training epochs. At 16 bits, it particularly
well-for instance, unsupervised hashing on COCO completes in under 2 minutes
and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These
results highlight CroVCA's efficiency, adaptability, and broad applicability.
๐
2025-10-31
๐ ArXiv Preprint (eess.SP)
๐ฅ Nelson Mattiรฉ, Arturo Sanchez-Azofeifa, Pablo Crespo-Peremarch, Juan-Ygnacio Lรณpez-Hernรกndez
ๆฅ็ๆ่ฆ
According to the Paris Climate Change Agreement, all nations are required to
submit reports on their greenhouse gas emissions and absorption every two years
by 2024. Consequently, forests play a crucial role in reducing carbon
emissions, which is essential for meeting these obligations. Recognizing the
significance of forest conservation in the global battle against climate
change, Article 5 of the Paris Agreement emphasizes the need for high-quality
forest data. This study focuses on enhancing methods for mapping aboveground
biomass in tropical dry forests. Tropical dry forests are considered one of the
least understood tropical forest environments; therefore, there is a need for
accurate approaches to estimate carbon pools. We employ a comparative analysis
of AGB estimates, utilizing different discrete and full-waveform laser scanning
datasets in conjunction with Ordinary Least Squares and Bayesian approaches
SVM. Airborne Laser Scanning, Unmanned Laser Scanning, and Space Laser Scanning
were used as independent variables for extracting forest metrics. Variable
selection, SVM regression tuning, and cross-validation via a machine-learning
approach were applied to account for overfitting and underfitting. The results
indicate that six key variables primarily related to tree height:
Elev\.minimum, Elev\.L3, lev\.MAD.mode, Elev\.mode, Elev\.MAD\.median, and
Elev\.skewness, are important for AGB estimation using ALSD and ULSD, while
Leaf Area Index, canopy coverage and height, terrain elevation, and
full-waveform signal energy emerged as the most vital variables. AGB values
estimated from ten permanent tropical dry forest plots in Costa Rica Guanacaste
province ranged from 26.02 Mg/ha to 175.43 Mg/ha. The SVM regressions
demonstrated a 17.89 error across all laser scanning systems, with SLSF W
exhibiting the lowest error 17.07 in estimating total biomass per plot.
๐
2025-10-31
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ilyass Moummad, Kawtar Zaher, Hervรฉ Goรซau, Alexis Joly
Multimodal
ๆฅ็ๆ่ฆ
Efficient large-scale retrieval requires representations that are both
compact and discriminative. Foundation models provide powerful visual and
multimodal embeddings, but nearest neighbor search in these high-dimensional
spaces is computationally expensive. Hashing offers an efficient alternative by
enabling fast Hamming distance search with binary codes, yet existing
approaches often rely on complex pipelines, multi-term objectives, designs
specialized for a single learning paradigm, and long training times. We
introduce CroVCA (Cross-View Code Alignment), a simple and unified principle
for learning binary codes that remain consistent across semantically aligned
views. A single binary cross-entropy loss enforces alignment, while coding-rate
maximization serves as an anti-collapse regularizer to promote balanced and
diverse codes. To implement this, we design HashCoder, a lightweight MLP
hashing network with a final batch normalization layer to enforce balanced
codes. HashCoder can be used as a probing head on frozen embeddings or to adapt
encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves
state-of-the-art results in just 5 training epochs. At 16 bits, it particularly
well-for instance, unsupervised hashing on COCO completes in under 2 minutes
and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These
results highlight CroVCA's efficiency, adaptability, and broad applicability.
๐
2025-10-30
๐ ArXiv Preprint (cond-mat.mtrl-sci)
๐ฅ Meghraj Prajapat, Alankar Alankar
Machine Learning
ๆฅ็ๆ่ฆ
Electron back-scatter diffraction (EBSD) has traditionally relied upon
methods such as the Hough transform and dictionary Indexing to interpret
diffraction patterns and extract crystallographic orientation. However, these
methods encounter significant limitations, particularly when operating at high
scanning speeds, where the exposure time per pattern is decreased beyond the
operating sensitivity of CCD camera. Hence the signal to noise ratio decreases
for the observed pattern which makes the pattern noisy, leading to reduced
indexing accuracy. This research work aims to develop generative machine
learning models for the post-processing or on-the-fly processing of Kikuchi
patterns which are capable of restoring noisy EBSD patterns obtained at high
scan speeds. These restored patterns can be used for the determination of
crystal orientations to provide reliable indexing results. We compare the
performance of such generative models in enhancing the quality of patterns
captured at short exposure times (high scan speeds). An interesting observation
is that the methodology is not data-hungry as typical machine learning methods.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Charts play an important role in visualization, reasoning, data analysis, and
the exchange of ideas among humans. However, existing vision-language models
(VLMs) still lack accurate perception of details and struggle to extract
fine-grained structures from charts. Such limitations in chart grounding also
hinder their ability to compare multiple charts and reason over them. In this
paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a
comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting
tabular data, localizing visualization elements, and recognizing various
attributes from charts of diverse types and complexities. We design a JSON
template to facilitate the calculation of evaluation metrics specifically
tailored for each grounding task. By incorporating a novel two-stage inference
workflow, the benchmark can further evaluate VLMs capability to align and
compare elements/attributes across two charts. Our analysis of evaluations on
several recent VLMs reveals new insights into their perception biases,
weaknesses, robustness, and hallucinations in chart understanding. These
findings highlight the fine-grained discrepancies among VLMs in chart
understanding tasks and point to specific skills that need to be strengthened
in current models.
๐
2025-10-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Teerapol Saengsukhiran, Peerawat Chomphooyod, Narabodee Rodjananant, Chompakorn Chaksangchaichot, Patawee Prakrankamanant et al.
Multimodal
ๆฅ็ๆ่ฆ
Multimodal retrieval systems are expected to operate in a semantic space,
agnostic to the language or cultural origin of the query. In practice, however,
retrieval outcomes systematically reflect perspectival biases: deviations
shaped by linguistic prevalence and cultural associations. We study two such
biases. First, prevalence bias refers to the tendency to favor entries from
prevalent languages over semantically faithful entries in image-to-text
retrieval. Second, association bias refers to the tendency to favor images
culturally associated with the query over semantically correct ones in
text-to-image retrieval. Results show that explicit alignment is a more
effective strategy for mitigating prevalence bias. However, association bias
remains a distinct and more challenging problem. These findings suggest that
achieving truly equitable multimodal systems requires targeted strategies
beyond simple data scaling and that bias arising from cultural association may
be treated as a more challenging problem than one arising from linguistic
prevalence.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yair Elboher, Yuval Pinter
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Diacritics restoration in Hebrew is a fundamental task for ensuring accurate
word pronunciation and disambiguating textual meaning. Despite the language's
high degree of ambiguity when unvocalized, recent machine learning approaches
have significantly advanced performance on this task.
In this work, we present DIVRIT, a novel system for Hebrew diacritization
that frames the task as a zero-shot classification problem. Our approach
operates at the word level, selecting the most appropriate diacritization
pattern for each undiacritized word from a dynamically generated candidate set,
conditioned on the surrounding textual context. A key innovation of DIVRIT is
its use of a Hebrew Visual Language Model, which processes undiacritized text
as an image, allowing diacritic information to be embedded directly within the
input's vector representation.
Through a comprehensive evaluation across various configurations, we
demonstrate that the system effectively performs diacritization without relying
on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting
where the correct diacritized form is guaranteed to be among the provided
candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic
architectural enhancements and optimized training methodologies yield
significant improvements in the system's overall generalization capabilities.
These findings highlight the promising potential of visual representations for
accurate and automated Hebrew diacritization.
๐
2025-10-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Teerapol Saengsukhiran, Peerawat Chomphooyod, Narabodee Rodjananant, Chompakorn Chaksangchaichot, Patawee Prakrankamanant et al.
Multimodal
ๆฅ็ๆ่ฆ
Multimodal retrieval systems are expected to operate in a semantic space,
agnostic to the language or cultural origin of the query. In practice, however,
retrieval outcomes systematically reflect perspectival biases: deviations
shaped by linguistic prevalence and cultural associations. We study two such
biases. First, prevalence bias refers to the tendency to favor entries from
prevalent languages over semantically faithful entries in image-to-text
retrieval. Second, association bias refers to the tendency to favor images
culturally associated with the query over semantically correct ones in
text-to-image retrieval. Results show that explicit alignment is a more
effective strategy for mitigating prevalence bias. However, association bias
remains a distinct and more challenging problem. These findings suggest that
achieving truly equitable multimodal systems requires targeted strategies
beyond simple data scaling and that bias arising from cultural association may
be treated as a more challenging problem than one arising from linguistic
prevalence.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Fan Yang, Lingyao Li, Yaxin Hu, Michael Rodgers, Renkai Ma
ๆฅ็ๆ่ฆ
Robots with anthropomorphic features are increasingly shaping how humans
perceive and morally engage with them. Our research investigates how different
levels of anthropomorphism influence protective responses to robot abuse,
extending the Computers as Social Actors (CASA) and uncanny valley theories
into a moral domain. In an experiment, we invite 201 participants to view
videos depicting abuse toward a robot with low (Spider), moderate (Two-Foot),
or high (Humanoid) anthropomorphism. To provide a comprehensive analysis, we
triangulate three modalities: self-report surveys measuring emotions and
uncanniness, physiological data from automated facial expression analysis, and
qualitative reflections. Findings indicate that protective responses are not
linear. The moderately anthropomorphic Two-Foot robot, rated highest in
eeriness and "spine-tingling" sensations consistent with the uncanny valley,
elicited the strongest physiological anger expressions. Self-reported anger and
guilt are significantly higher for both the Two-Foot and Humanoid robots
compared to the Spider. Qualitative findings further reveal that as
anthropomorphism increases, moral reasoning shifts from technical assessments
of property damage to condemnation of the abuser's character, while governance
proposals expand from property law to calls for quasi-animal rights and broader
societal responsibility. These results suggest that the uncanny valley does not
dampen moral concern but paradoxically heightens protective impulses, offering
critical implications for robot design, policy, and future legal frameworks.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chuyan Chen, Chenyang Ma, Zhangxin Li, Yutong He, Yanjie Dong et al.
ๆฅ็ๆ่ฆ
Communication remains a central bottleneck in large-scale distributed machine
learning, and gradient sparsification has emerged as a promising strategy to
alleviate this challenge. However, existing gradient compressors face notable
limitations: Rand-$K$ discards structural information and performs poorly in
practice, while Top-$K$ preserves informative entries but loses the contraction
property and requires costly All-Gather operations. In this paper, we propose
ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity
patterns across nodes using a lightweight sketch of the gradient, enabling
index-free All-Reduce while preserving globally significant information.
ARC-Top-$K$ is provably contractive and, when combined with momentum error
feedback (EF21M), achieves linear speedup and sharper convergence rates than
the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches
the accuracy of Top-$K$ while reducing wall-clock training time by up to
60.7\%, offering an efficient and scalable solution that combines the
robustness of Rand-$K$ with the strong performance of Top-$K$.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lin Xu, Xinyun Yuan, Yuxuan Liang, Suwan Yin, Yuankai Wu
ๆฅ็ๆ่ฆ
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed
to advance research on flight delay prediction and support the development of
foundation models for tabular data. Existing datasets in this domain are
typically limited to flat tabular structures and fail to capture the
spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this
limitation by providing three aligned modalities: (i) a tabular dataset with
rich operational, meteorological, and airportlevel features for over 50 million
flights; (ii) a flight chain module that models delay propagation along
sequential flight legs, capturing upstream and downstream dependencies; and
(iii) a flight network graph that encodes shared aircraft, crew, and airport
resource connections, enabling cross-flight relational reasoning. The dataset
is carefully constructed with temporal splits, comprehensive features, and
strict leakage prevention to support realistic and reproducible machine
learning evaluation. Aeolus supports a broad range of tasks, including
regression, classification, temporal structure modeling, and graph learning,
serving as a unified benchmark across tabular, sequential, and graph
modalities. We release baseline experiments and preprocessing tools to
facilitate adoption. Aeolus fills a key gap for both domain-specific modeling
and general-purpose structured data research.Our source code and data can be
accessed at https://github.com/Flnny/Delay-data
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jingyuan Liu, Fatemeh Ghaffari, Xuchuang Wang, Xutong Liu, Mohammad Hajiesmaili et al.
Machine Learning
ๆฅ็ๆ่ฆ
Preference learning from pairwise feedback is a widely adopted framework in
applications such as reinforcement learning with human feedback and
recommendations. In many practical settings, however, user interactions are
limited or costly, making offline preference learning necessary. Moreover,
real-world preference learning often involves users with different preferences.
For example, annotators from different backgrounds may rank the same responses
differently. This setting presents two central challenges: (1) identifying
similarity across users to effectively aggregate data, especially under
scenarios where offline data is imbalanced across dimensions, and (2) handling
the imbalanced offline data where some preference dimensions are
underrepresented. To address these challenges, we study the Offline Clustering
of Preference Learning problem, where the learner has access to fixed datasets
from multiple users with potentially different preferences and aims to maximize
utility for a test user. To tackle the first challenge, we first propose
Off-C$^2$PL for the pure offline setting, where the learner relies solely on
offline data. Our theoretical analysis provides a suboptimality bound that
explicitly captures the tradeoff between sample noise and bias. To address the
second challenge of inbalanced data, we extend our framework to the setting
with active-data augmentation where the learner is allowed to select a limited
number of additional active-data for the test user based on the cluster
structure learned by Off-C$^2$PL. In this setting, our second algorithm,
A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative
dimensions of the test user's preference. We prove that these actively
collected samples contribute more effectively than offline ones. Finally, we
validate our theoretical results through simulations on synthetic and
real-world datasets.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In sequence modeling, the parametric memory of atomic facts has been
predominantly abstracted as a brute-force lookup of co-occurrences between
entities. We contrast this associative view against a geometric view of how
memory is stored. We begin by isolating a clean and analyzable instance of
Transformer reasoning that is incompatible with memory as strictly a storage of
the local co-occurrences specified during training. Instead, the model must
have somehow synthesized its own geometry of atomic facts, encoding global
relationships between all entities, including non-co-occurring ones. This in
turn has simplified a hard reasoning task involving an $\ell$-fold composition
into an easy-to-learn 1-step geometric task.
From this phenomenon, we extract fundamental aspects of neural embedding
geometries that are hard to explain. We argue that the rise of such a geometry,
despite optimizing over mere local associations, cannot be straightforwardly
attributed to typical architectural or optimizational pressures.
Counterintuitively, an elegant geometry is learned even when it is not more
succinct than a brute-force lookup of associations.
Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry
stems from a spectral bias that -- in contrast to prevailing theories -- indeed
arises naturally despite the lack of various pressures. This analysis also
points to practitioners a visible headroom to make Transformer memory more
strongly geometric. We hope the geometric view of parametric memory encourages
revisiting the default intuitions that guide researchers in areas like
knowledge acquisition, capacity, discovery and unlearning.
๐
2025-10-30
๐ ArXiv Preprint (cs.MA)
๐ฅ Ashwin Kumar, William Yeoh
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We introduce the General Incentives-based Framework for Fairness (GIFF), a
novel approach for fair multi-agent resource allocation that infers fair
decision-making from standard value functions. In resource-constrained
settings, agents optimizing for efficiency often create inequitable outcomes.
Our approach leverages the action-value (Q-)function to balance efficiency and
fairness without requiring additional training. Specifically, our method
computes a local fairness gain for each action and introduces a counterfactual
advantage correction term to discourage over-allocation to already well-off
agents. This approach is formalized within a centralized control setting, where
an arbitrator uses the GIFF-modified Q-values to solve an allocation problem.
Empirical evaluations across diverse domains, including dynamic ridesharing,
homelessness prevention, and a complex job allocation task-demonstrate that our
framework consistently outperforms strong baselines and can discover
far-sighted, equitable policies. The framework's effectiveness is supported by
a theoretical foundation; we prove its fairness surrogate is a principled lower
bound on the true fairness improvement and that its trade-off parameter offers
monotonic tuning. Our findings establish GIFF as a robust and principled
framework for leveraging standard reinforcement learning components to achieve
more equitable outcomes in complex multi-agent systems.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ J. de Curtรฒ, I. de Zarzร , Pablo Garcรญa, Jordi Cabot
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a comprehensive cross-platform evaluation of reasoning
capabilities in contemporary foundation models, establishing an
infrastructure-agnostic benchmark across three computational paradigms: HPC
supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and
university clusters (a node with eight H200 GPUs).
We evaluate 15 foundation models across 79 problems spanning eight academic
domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics,
Calculus, and Optimization) through three experimental phases: (1) Baseline
establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b,
Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing
methodology and reference performance; (2) Infrastructure validation: The
19-problem benchmark repeated on university cluster (seven models including
Falcon-Mamba state-space architecture) and Nebius AI Studio (nine
state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3
30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic
reproducibility; (3) Extended evaluation: Full 79-problem assessment on both
university cluster and Nebius platforms, probing generalization at scale across
architectural diversity.
The findings challenge conventional scaling assumptions, establish training
data quality as more critical than model size, and provide actionable
guidelines for model selection across educational, production, and research
contexts. The tri-infrastructure methodology and 79-problem benchmark enable
longitudinal tracking of reasoning capabilities as foundation models evolve.
๐
2025-10-30
๐ ArXiv Preprint (cs.DC)
๐ฅ Zixu Shen, Kexin Chu, Yifan Zhang, Dawei Xiang, Runxin Wu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The expansion of large language models is increasingly limited by the
constrained memory capacity of modern GPUs. To mitigate this,
Mixture-of-Experts (MoE) architectures activate only a small portion of
parameters during inference, significantly lowering both memory demand and
computational overhead. However, conventional MoE inference approaches, which
select active experts independently at each layer, often introduce considerable
latency because of frequent parameter transfers between host and GPU memory. In
addition, current cross-layer prediction strategies, which are typically based
on fixed steps, lack adaptability across different hardware platforms and
workloads, thereby reducing their robustness and effectiveness.
To address these challenges, we present ExpertFlow, a runtime system for MoE
inference that combines adaptive expert prefetching and cache-aware routing.
ExpertFlow continuously adjusts its prediction horizon for expert activation by
leveraging runtime statistics such as transfer bandwidth, parameter
dimensionality, and model feedback signals. Furthermore, it incorporates a
hybrid cross-layer prediction scheme that fuses pregating information with
intermediate computational states to anticipate future expert needs. By
adaptively refining prefetching decisions and aligning them with actual usage
behavior, ExpertFlow effectively decreases cache misses and removes latency
caused by expert swap-ins. Our evaluation demonstrates that ExpertFlow reduces
model stall time to less than 0.1% of the baseline, highlighting its capability
to optimize MoE inference under stringent memory constraints.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Muhammad Faraz Ul Abrar, Nicolรฒ Michelusi
Robotics
ๆฅ็ๆ่ฆ
Over-the-air (OTA) federated learning (FL) has been well recognized as a
scalable paradigm that exploits the waveform superposition of the wireless
multiple-access channel to aggregate model updates in a single use. Existing
OTA-FL designs largely enforce zero-bias model updates by either assuming
\emph{homogeneous} wireless conditions (equal path loss across devices) or
forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous}
wireless scenarios, however, such designs are constrained by the weakest device
and inflate the update variance. Moreover, prior analyses of biased OTA-FL
largely address convex objectives, while most modern AI models are highly
non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient
descent (SGD) for general smooth non-convex objectives under wireless
heterogeneity. We develop novel OTA-FL SGD updates that allow a structured,
time-invariant model bias while facilitating reduced variance updates. We
derive a finite-time stationarity bound (expected time average squared gradient
norm) that explicitly reveals a bias-variance trade-off. To optimize this
trade-off, we pose a non-convex joint OTA power-control design and develop an
efficient successive convex approximation (SCA) algorithm that requires only
statistical CSI at the base station. Experiments on a non-convex image
classification task validate the approach: the SCA-based design accelerates
convergence via an optimized bias and improves generalization over prior OTA-FL
baselines.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xinhan Zheng, Huyu Wu, Xueting Wang, Haiyun Jiang
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) exhibit a pronounced preference for
textual inputs when processing vision-language data, limiting their ability to
reason effectively from visual evidence. Unlike prior studies that attribute
this text bias to external factors such as data imbalance or instruction
tuning, we propose that the bias originates from the model's internal
architecture. Specifically, we hypothesize that visual key vectors (Visual
Keys) are out-of-distribution (OOD) relative to the text key space learned
during language-only pretraining. Consequently, these visual keys receive
systematically lower similarity scores during attention computation, leading to
their under-utilization in the context representation. To validate this
hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their
distributional structures using qualitative (t-SNE) and quantitative
(Jensen-Shannon divergence) methods. The results provide direct evidence that
visual and textual keys occupy markedly distinct subspaces within the attention
space. The inter-modal divergence is statistically significant, exceeding
intra-modal variation by several orders of magnitude. These findings reveal
that text bias arises from an intrinsic misalignment within the attention key
space rather than solely from external data factors.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma
ๆฅ็ๆ่ฆ
Machine unlearning (MU) aims to remove the influence of certain data points
from a trained model without costly retraining. Most practical MU algorithms
are only approximate and their performance can only be assessed empirically.
Care must therefore be taken to make empirical comparisons as representative as
possible. A common practice is to run the MU algorithm multiple times
independently starting from the same trained model. In this work, we
demonstrate that this practice can give highly non-representative results
because -- even for the same architecture and same dataset -- some MU methods
can be highly sensitive to the choice of random number seed used for model
training. We therefore recommend that empirical
comphttps://info.arxiv.org/help/prep#commentsarisons of MU algorithms should
also reflect the variability across different model training seeds.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Majed El Helou, Chiara Troiani, Benjamin Ryder, Jean Diaconu, Hervรฉ Muyal et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Authorizing Large Language Model driven agents to dynamically invoke tools
and access protected resources introduces significant risks, since current
methods for delegating authorization grant overly broad permissions and give
access to tools allowing agents to operate beyond the intended task scope. We
introduce and assess a delegated authorization model enabling authorization
servers to semantically inspect access requests to protected resources, and
issue access tokens constrained to the minimal set of scopes necessary for the
agents' assigned tasks. Given the unavailability of datasets centered on
delegated authorization flows, particularly including both semantically
appropriate and inappropriate scope requests for a given task, we introduce
ASTRA, a dataset and data generation pipeline for benchmarking semantic
matching between tasks and scopes. Our experiments show both the potential and
current limitations of model-based matching, particularly as the number of
scopes needed for task completion increases. Our results highlight the need for
further research into semantic matching techniques enabling intent-aware
authorization for multi-agent and tool-augmented applications, including
fine-grained control, such as Task-Based Access Control (TBAC).
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a
non-differentiable decoding process that requires laborious, hand-tuning of
hyperparameters like temperature and top-p. This paper introduces AutoDeco, a
novel architecture that enables truly "end-to-end" generation by learning to
control its own decoding strategy. We augment the standard transformer with
lightweight heads that, at each step, dynamically predict context-specific
temperature and top-p values alongside the next-token logits. This approach
transforms decoding into a parametric, token-level process, allowing the model
to self-regulate its sampling strategy within a single forward pass.
Through extensive experiments on eight benchmarks, we demonstrate that
AutoDeco not only significantly outperforms default decoding strategies but
also achieves performance comparable to an oracle-tuned baseline derived from
"hacking the test set"-a practical upper bound for any static method.
Crucially, we uncover an emergent capability for instruction-based decoding
control: the model learns to interpret natural language commands (e.g.,
"generate with low randomness") and adjusts its predicted temperature and top-p
on a token-by-token basis, opening a new paradigm for steerable and interactive
LLM decoding.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vaibhav Kurrey, Sivakalyan Pujari, Gagan Raj Gupta
Computer VisionMachine LearningRobotics
ๆฅ็ๆ่ฆ
We present a long-term deployment study of a machine vision-based anomaly
detection system for failure prediction in a steel rolling mill. The system
integrates industrial cameras to monitor equipment operation, alignment, and
hot bar motion in real time along the process line. Live video streams are
processed on a centralized video server using deep learning models, enabling
early prediction of equipment failures and process interruptions, thereby
reducing unplanned breakdown costs. Server-based inference minimizes the
computational load on industrial process control systems (PLCs), supporting
scalable deployment across production lines with minimal additional resources.
By jointly analyzing sensor data from data acquisition systems and visual
inputs, the system identifies the location and probable root causes of
failures, providing actionable insights for proactive maintenance. This
integrated approach enhances operational reliability, productivity, and
profitability in industrial manufacturing environments.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mingchen Tu, Zhiqiang Liu, Juan Li, Liangyurui Liu, Junjie Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated exceptional capabilities
across multiple domains by leveraging massive pre-training and curated
fine-tuning data. However, in data-sensitive fields such as healthcare, the
lack of high-quality, domain-specific training corpus hinders LLMs' adaptation
for specialized applications. Meanwhile, domain experts have distilled domain
wisdom into ontology rules, which formalize relationships among concepts and
ensure the integrity of knowledge management repositories. Viewing LLMs as
implicit repositories of human knowledge, we propose Evontree, a novel
framework that leverages a small set of high-quality ontology rules to
systematically extract, validate, and enhance domain knowledge within LLMs,
without requiring extensive external datasets. Specifically, Evontree extracts
domain ontology from raw models, detects inconsistencies using two core
ontology rules, and reinforces the refined knowledge via self-distilled
fine-tuning. Extensive experiments on medical QA benchmarks with
Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both
unmodified models and leading supervised baselines, achieving up to a 3.7%
improvement in accuracy. These results confirm the effectiveness, efficiency,
and robustness of our approach for low-resource domain adaptation of LLMs.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zewen Chi, Li Dong, Qingxiu Dong, Yaru Hao, Xun Wu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We envision a new era of AI, termed agentic organization, where agents solve
complex problems by working collaboratively and concurrently, enabling outcomes
beyond individual intelligence. To realize this vision, we introduce
asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large
language models, which organizes the internal thinking process into
concurrently executable structures. Specifically, we propose a thinking
protocol where an organizer dynamically assigns sub-queries to workers, merges
intermediate knowledge, and produces coherent solutions. More importantly, the
thinking structure in this protocol can be further optimized through
reinforcement learning. Experiments demonstrate that AsyncThink achieves 28%
lower inference latency compared to parallel thinking while improving accuracy
on mathematical reasoning. Moreover, AsyncThink generalizes its learned
asynchronous thinking capabilities, effectively tackling unseen tasks without
additional training.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaoyi He, Danggui Chen, Zhenshuo Zhang, Zimeng Bai
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a hierarchical path-planning and control framework that
combines a high-level Deep Q-Network (DQN) for discrete sub-goal selection with
a low-level Twin Delayed Deep Deterministic Policy Gradient (TD3) controller
for continuous actuation. The high-level module selects behaviors and
sub-goals; the low-level module executes smooth velocity commands. We design a
practical reward shaping scheme (direction, distance, obstacle avoidance,
action smoothness, collision penalty, time penalty, and progress), together
with a LiDAR-based safety gate that prevents unsafe motions. The system is
implemented in ROS + Gazebo (TurtleBot3) and evaluated with PathBench metrics,
including success rate, collision rate, path efficiency, and re-planning
efficiency, in dynamic and partially observable environments. Experiments show
improved success rate and sample efficiency over single-algorithm baselines
(DQN or TD3 alone) and rule-based planners, with better generalization to
unseen obstacle configurations and reduced abrupt control changes. Code and
evaluation scripts are available at the project repository.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lin Xu, Xinyun Yuan, Yuxuan Liang, Suwan Yin, Yuankai Wu
ๆฅ็ๆ่ฆ
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed
to advance research on flight delay prediction and support the development of
foundation models for tabular data. Existing datasets in this domain are
typically limited to flat tabular structures and fail to capture the
spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this
limitation by providing three aligned modalities: (i) a tabular dataset with
rich operational, meteorological, and airportlevel features for over 50 million
flights; (ii) a flight chain module that models delay propagation along
sequential flight legs, capturing upstream and downstream dependencies; and
(iii) a flight network graph that encodes shared aircraft, crew, and airport
resource connections, enabling cross-flight relational reasoning. The dataset
is carefully constructed with temporal splits, comprehensive features, and
strict leakage prevention to support realistic and reproducible machine
learning evaluation. Aeolus supports a broad range of tasks, including
regression, classification, temporal structure modeling, and graph learning,
serving as a unified benchmark across tabular, sequential, and graph
modalities. We release baseline experiments and preprocessing tools to
facilitate adoption. Aeolus fills a key gap for both domain-specific modeling
and general-purpose structured data research.Our source code and data can be
accessed at https://github.com/Flnny/Delay-data
๐
2025-10-30
๐ EMNLP 2025
๐ฅ Kentaro Ozeki, Risako Ando, Takanobu Morishita, Hirohiko Abe, Koji Mineshima et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Normative reasoning is a type of reasoning that involves normative or deontic
modality, such as obligation and permission. While large language models (LLMs)
have demonstrated remarkable performance across various reasoning tasks, their
ability to handle normative reasoning remains underexplored. In this paper, we
systematically evaluate LLMs' reasoning capabilities in the normative domain
from both logical and modal perspectives. Specifically, to assess how well LLMs
reason with normative modals, we make a comparison between their reasoning with
normative modals and their reasoning with epistemic modals, which share a
common formal structure. To this end, we introduce a new dataset covering a
wide range of formal patterns of reasoning in both normative and epistemic
domains, while also incorporating non-formal cognitive factors that influence
human reasoning. Our results indicate that, although LLMs generally adhere to
valid reasoning patterns, they exhibit notable inconsistencies in specific
types of normative reasoning and display cognitive biases similar to those
observed in psychological studies of human reasoning. These findings highlight
challenges in achieving logical consistency in LLMs' normative reasoning and
provide insights for enhancing their reliability. All data and code are
released publicly at https://github.com/kmineshima/NeuBAROCO.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Reda El Makroum, Sebastian Zwickl-Bernhard, Lukas Kranzl
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The electricity sector transition requires substantial increases in
residential demand response capacity, yet Home Energy Management Systems (HEMS)
adoption remains limited by user interaction barriers requiring translation of
everyday preferences into technical parameters. While large language models
have been applied to energy systems as code generators and parameter
extractors, no existing implementation deploys LLMs as autonomous coordinators
managing the complete workflow from natural language input to multi-appliance
scheduling. This paper presents an agentic AI HEMS where LLMs autonomously
coordinate multi-appliance scheduling from natural language requests to device
control, achieving optimal scheduling without example demonstrations. A
hierarchical architecture combining one orchestrator with three specialist
agents uses the ReAct pattern for iterative reasoning, enabling dynamic
coordination without hardcoded workflows while integrating Google Calendar for
context-aware deadline extraction. Evaluation across three open-source models
using real Austrian day-ahead electricity prices reveals substantial capability
differences. Llama-3.3-70B successfully coordinates all appliances across all
scenarios to match cost-optimal benchmarks computed via mixed-integer linear
programming, while other models achieve perfect single-appliance performance
but struggle to coordinate all appliances simultaneously. Progressive prompt
engineering experiments demonstrate that analytical query handling without
explicit guidance remains unreliable despite models' general reasoning
capabilities. We open-source the complete system including orchestration logic,
agent prompts, tools, and web interfaces to enable reproducibility, extension,
and future research.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anirban Ray, Vera Galinova, Florian Jug
Machine Learning
ๆฅ็ๆ่ฆ
Computational Super-Resolution (CSR) in fluorescence microscopy has, despite
being an ill-posed problem, a long history. At its very core, CSR is about
finding a prior that can be used to extrapolate frequencies in a micrograph
that have never been imaged by the image-generating microscope. It stands to
reason that, with the advent of better data-driven machine learning techniques,
stronger prior can be learned and hence CSR can lead to better results. Here,
we present ResMatching, a novel CSR method that uses guided conditional flow
matching to learn such improved data-priors. We evaluate ResMatching on 4
diverse biological structures from the BioSR dataset and compare its results
against 7 baselines. ResMatching consistently achieves competitive results,
demonstrating in all cases the best trade-off between data fidelity and
perceptual realism. We observe that CSR using ResMatching is particularly
effective in cases where a strong prior is hard to learn, e.g. when the given
low-resolution images contain a lot of noise. Additionally, we show that
ResMatching can be used to sample from an implicitly learned posterior
distribution and that this distribution is calibrated for all tested use-cases,
enabling our method to deliver a pixel-wise data-uncertainty term that can
guide future users to reject uncertain predictions.
๐
2025-10-30
๐ ArXiv Preprint (cs.MA)
๐ฅ Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, Tao Lin
Natural Language Processing
ๆฅ็ๆ่ฆ
While Multi-Agent Systems (MAS) excel at complex tasks, their growing
autonomy with operational complexity often leads to critical inefficiencies,
such as excessive token consumption and failures arising from misinformation.
Existing methods primarily focus on post-hoc failure attribution, lacking
proactive, real-time interventions to enhance robustness and efficiency. To
this end, we introduce SupervisorAgent, a lightweight and modular framework for
runtime, adaptive supervision that operates without altering the base agent's
architecture. Triggered by an LLM-free adaptive filter, SupervisorAgent
intervenes at critical junctures to proactively correct errors, guide
inefficient behaviors, and purify observations. On the challenging GAIA
benchmark, SupervisorAgent reduces the token consumption of the Smolagent
framework by an average of 29.45% without compromising its success rate.
Extensive experiments across five additional benchmarks (math reasoning, code
generation, and question answering) and various SoTA foundation models validate
the broad applicability and robustness of our approach. The code is available
at https://github.com/LINs-lab/SupervisorAgent.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kun Luo, Hongjin Qian, Zheng Liu, Ziyi Xia, Shitao Xiao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach
for enhancing agentic deep search. However, its application is often hindered
by low \textbf{Reward Density} in deep search scenarios, where agents expend
significant exploratory costs for infrequent and often null final rewards. In
this paper, we formalize this challenge as the \textbf{Reward Density
Optimization} problem, which aims to improve the reward obtained per unit of
exploration cost. This paper introduce \textbf{InfoFlow}, a systematic
framework that tackles this problem from three aspects. 1) \textbf{Subproblem
decomposition}: breaking down long-range tasks to assign process rewards,
thereby providing denser learning signals. 2) \textbf{Failure-guided hints}:
injecting corrective guidance into stalled trajectories to increase the
probability of successful outcomes. 3) \textbf{Dual-agent refinement}:
employing a dual-agent architecture to offload the cognitive burden of deep
exploration. A refiner agent synthesizes the search history, which effectively
compresses the researcher's perceived trajectory, thereby reducing exploration
cost and increasing the overall reward density. We evaluate InfoFlow on
multiple agentic search benchmarks, where it significantly outperforms strong
baselines, enabling lightweight LLMs to achieve performance comparable to
advanced proprietary LLMs.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, Andrea Pugnana
Machine Learning
ๆฅ็ๆ่ฆ
Developing trustworthy Machine Learning (ML) models requires their predicted
probabilities to be well-calibrated, meaning they should reflect true-class
frequencies. Among calibration notions in multiclass classification, strong
calibration is the most stringent, as it requires all predicted probabilities
to be simultaneously calibrated across all classes. However, existing
approaches to multiclass calibration lack a notion of distance among inputs,
which makes them vulnerable to proximity bias: predictions in sparse regions of
the feature space are systematically miscalibrated. This is especially relevant
in high-stakes settings, such as healthcare, where the sparse instances are
exactly those most at risk of biased treatment. In this work, we address this
main shortcoming by introducing a local perspective on multiclass calibration.
First, we formally define multiclass local calibration and establish its
relationship with strong calibration. Second, we theoretically analyze the
pitfalls of existing evaluation metrics when applied to multiclass local
calibration. Third, we propose a practical method for enhancing local
calibration in Neural Networks, which enforces alignment between predicted
probabilities and local estimates of class frequencies using the Jensen-Shannon
distance. Finally, we empirically validate our approach against existing
multiclass calibration techniques.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Prathamesh Kothavale, Sravani Boddepalli
Robotics
ๆฅ็ๆ่ฆ
Conventional robots possess a limited understanding of their kinematics and
are confined to preprogrammed tasks, hindering their ability to leverage tools
efficiently. Driven by the essential components of tool usage - grasping the
desired outcome, selecting the most suitable tool, determining optimal tool
orientation, and executing precise manipulations - we introduce a pioneering
framework. Our novel approach expands the capabilities of the robot's inverse
kinematics solver, empowering it to acquire a sequential repertoire of actions
using tools of varying lengths. By integrating a simulation-learned action
trajectory with the tool, we showcase the practicality of transferring acquired
skills from simulation to real-world scenarios through comprehensive
experimentation. Remarkably, our extended inverse kinematics solver
demonstrates an impressive error rate of less than 1 cm. Furthermore, our
trained policy achieves a mean error of 8 cm in simulation. Noteworthy, our
model achieves virtually indistinguishable performance when employing two
distinct tools of different lengths. This research provides an indication of
potential advances in the exploration of all four fundamental aspects of tool
usage, enabling robots to master the intricate art of tool manipulation across
diverse tasks.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jack FitzGerald, Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Jonnathan Castillo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present EdgeRunner 20B, a fine-tuned version of gpt-oss-20b optimized for
military tasks. EdgeRunner 20B was trained on 1.6M high-quality records curated
from military documentation and websites. We also present four new tests sets:
(a) combat arms, (b) combat medic, (c) cyber operations, and (d) mil-bench-5k
(general military knowledge). On these military test sets, EdgeRunner 20B
matches or exceeds GPT-5 task performance with 95%+ statistical significance,
except for the high reasoning setting on the combat medic test set and the low
reasoning setting on the mil-bench-5k test set. Versus gpt-oss-20b, there is no
statistically-significant regression on general-purpose benchmarks like ARC-C,
GPQA Diamond, GSM8k, IFEval, MMLU Pro, or TruthfulQA, except for GSM8k in the
low reasoning setting. We also present analyses on hyperparameter settings,
cost, and throughput. These findings show that small, locally-hosted models are
ideal solutions for data-sensitive operations such as in the military domain,
allowing for deployment in air-gapped edge devices.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Miranda Anna Christ, Adriรกn Csiszรกrik, Gergely Becsรณ, Dรกniel Varga
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper investigates the structure of linear operators introduced in
Hernandez et al. [2023] that decode specific relational facts in transformer
language models. We extend their single-relation findings to a collection of
relations and systematically chart their organization. We show that such
collections of relation decoders can be highly compressed by simple order-3
tensor networks without significant loss in decoding accuracy. To explain this
surprising redundancy, we develop a cross-evaluation protocol, in which we
apply each linear decoder operator to the subjects of every other relation. Our
results reveal that these linear maps do not encode distinct relations, but
extract recurring, coarse-grained semantic properties (e.g., country of capital
city and country of food are both in the country-of-X property). This
property-centric structure clarifies both the operators' compressibility and
highlights why they generalize only to new relations that are semantically
close. Our findings thus interpret linear relational decoding in transformer
language models as primarily property-based, rather than relation-specific.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Rishub Jain, Sophie Bridgers, Lili Janzer, Rory Greig, Tian Huey Teh et al.
ๆฅ็ๆ่ฆ
Human feedback is critical for aligning AI systems to human values. As AI
capabilities improve and AI is used to tackle more challenging tasks, verifying
quality and safety becomes increasingly challenging. This paper explores how we
can leverage AI to improve the quality of human oversight. We focus on an
important safety problem that is already challenging for humans:
fact-verification of AI outputs. We find that combining AI ratings and human
ratings based on AI rater confidence is better than relying on either alone.
Giving humans an AI fact-verification assistant further improves their
accuracy, but the type of assistance matters. Displaying AI explanation,
confidence, and labels leads to over-reliance, but just showing search results
and evidence fosters more appropriate trust. These results have implications
for Amplified Oversight -- the challenge of combining humans and AI to
supervise AI systems even as they surpass human expert performance.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dipak Meher, Carlotta Domeniconi
ๆฅ็ๆ่ฆ
Human smuggling networks are increasingly adaptive and difficult to analyze.
Legal case documents offer critical insights but are often unstructured,
lexically dense, and filled with ambiguous or shifting references, which pose
significant challenges for automated knowledge graph (KG) construction. While
recent LLM-based approaches improve over static templates, they still generate
noisy, fragmented graphs with duplicate nodes due to the absence of guided
extraction and coreference resolution. The recently proposed CORE-KG framework
addresses these limitations by integrating a type-aware coreference module and
domain-guided structured prompts, significantly reducing node duplication and
legal noise. In this work, we present a systematic ablation study of CORE-KG to
quantify the individual contributions of its two key components. Our results
show that removing coreference resolution results in a 28.32% increase in node
duplication and a 4.32% increase in noisy nodes, while removing structured
prompts leads to a 4.34% increase in node duplication and a 73.33% increase in
noisy nodes. These findings offer empirical insights for designing robust
LLM-based pipelines for extracting structured representations from complex
legal texts.
๐
2025-10-30
๐ ArXiv Preprint (cs.SI)
๐ฅ Sadegh Shirani, Mohsen Bayati
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Online social networks have transformed the ways in which political
mobilization messages are disseminated, raising new questions about how peer
influence operates at scale. Building on the landmark 61-million-person
Facebook experiment \citep{bond201261}, we develop an agent-based simulation
framework that integrates real U.S. Census demographic distributions, authentic
Twitter network topology, and heterogeneous large language model (LLM) agents
to examine the effect of mobilization messages on voter turnout. Each simulated
agent is assigned demographic attributes, a personal political stance, and an
LLM variant (\texttt{GPT-4.1}, \texttt{GPT-4.1-Mini}, or \texttt{GPT-4.1-Nano})
reflecting its political sophistication. Agents interact over realistic social
network structures, receiving personalized feeds and dynamically updating their
engagement behaviors and voting intentions. Experimental conditions replicate
the informational and social mobilization treatments of the original Facebook
study. Across scenarios, the simulator reproduces qualitative patterns observed
in field experiments, including stronger mobilization effects under social
message treatments and measurable peer spillovers. Our framework provides a
controlled, reproducible environment for testing counterfactual designs and
sensitivity analyses in political mobilization research, offering a bridge
between high-validity field experiments and flexible computational
modeling.\footnote{Code and data available at
https://github.com/CausalMP/LLM-SocioPol}
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qishuo Hua, Lyumanshan Ye, Dayuan Fu, Yang Xiao, Xiaojie Cai et al.
ๆฅ็ๆ่ฆ
Karl Marx once wrote that ``the human essence is the ensemble of social
relations'', suggesting that individuals are not isolated entities but are
fundamentally shaped by their interactions with other entities, within which
contexts play a constitutive and essential role. With the advent of computers
and artificial intelligence, these contexts are no longer limited to purely
human--human interactions: human--machine interactions are included as well.
Then a central question emerges: How can machines better understand our
situations and purposes? To address this challenge, researchers have recently
introduced the concept of context engineering. Although it is often regarded as
a recent innovation of the agent era, we argue that related practices can be
traced back more than twenty years. Since the early 1990s, the field has
evolved through distinct historical phases, each shaped by the intelligence
level of machines: from early human--computer interaction frameworks built
around primitive computers, to today's human--agent interaction paradigms
driven by intelligent agents, and potentially to human--level or superhuman
intelligence in the future. In this paper, we situate context engineering,
provide a systematic definition, outline its historical and conceptual
landscape, and examine key design considerations for practice. By addressing
these questions, we aim to offer a conceptual foundation for context
engineering and sketch its promising future. This paper is a stepping stone for
a broader community effort toward systematic context engineering in AI systems.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dipak Meher, Carlotta Domeniconi, Guadalupe Correa-Cabrera
ๆฅ็ๆ่ฆ
Human smuggling networks are complex and constantly evolving, making them
difficult to analyze comprehensively. Legal case documents offer rich factual
and procedural insights into these networks but are often long, unstructured,
and filled with ambiguous or shifting references, posing significant challenges
for automated knowledge graph (KG) construction. Existing methods either
overlook coreference resolution or fail to scale beyond short text spans,
leading to fragmented graphs and inconsistent entity linking. We propose
LINK-KG, a modular framework that integrates a three-stage, LLM-guided
coreference resolution pipeline with downstream KG extraction. At the core of
our approach is a type-specific Prompt Cache, which consistently tracks and
resolves references across document chunks, enabling clean and disambiguated
narratives for structured knowledge graph construction from both short and long
legal texts. LINK-KG reduces average node duplication by 45.21% and noisy nodes
by 32.22% compared to baseline methods, resulting in cleaner and more coherent
graph structures. These improvements establish LINK-KG as a strong foundation
for analyzing complex criminal networks.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rasoul Amirzadeh, Dhananjay Thiruvady, Fatemeh Shiri
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) continue to advance, with an increasing number
of domain-specific variants tailored for specialised tasks. However, these
models often lack transparency and explainability, can be costly to fine-tune,
require substantial prompt engineering, yield inconsistent results across
domains, and impose significant adverse environmental impact due to their high
computational demands. To address these challenges, we propose the Bayesian
network LLM fusion (BNLF) framework, which integrates predictions from three
LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic
mechanism for sentiment analysis. BNLF performs late fusion by modelling the
sentiment predictions from multiple LLMs as probabilistic nodes within a
Bayesian network. Evaluated across three human-annotated financial corpora with
distinct linguistic and contextual characteristics, BNLF demonstrates
consistent gains of about six percent in accuracy over the baseline LLMs,
underscoring its robustness to dataset variability and the effectiveness of
probabilistic fusion for interpretable sentiment classification.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Clarissa Sabrina Arlinghaus, Tristan Kenneweg, Barbara Hammer, Gรผnter W. Maier
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) such as ChatGPT are increasingly integrated into
high-stakes decision-making, yet little is known about their susceptibility to
social influence. We conducted three preregistered conformity experiments with
GPT-4o in a hiring context. In a baseline study, GPT consistently favored the
same candidate (Profile C), reported moderate expertise (M = 3.01) and high
certainty (M = 3.89), and rarely changed its choice. In Study 1 (GPT + 8), GPT
faced unanimous opposition from eight simulated partners and almost always
conformed (99.9%), reporting lower certainty and significantly elevated
self-reported informational and normative conformity (p < .001). In Study 2
(GPT + 1), GPT interacted with a single partner and still conformed in 40.2% of
disagreement trials, reporting less certainty and more normative conformity.
Across studies, results demonstrate that GPT does not act as an independent
observer but adapts to perceived social consensus. These findings highlight
risks of treating LLMs as neutral decision aids and underline the need to
elicit AI judgments prior to exposing them to human opinions.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Self-improvement has emerged as a mainstream paradigm for advancing the
reasoning capabilities of large vision-language models (LVLMs), where models
explore and learn from successful trajectories iteratively. However, we
identify a critical issue during this process: the model excels at generating
high-quality trajectories for simple queries (i.e., head data) but struggles
with more complex ones (i.e., tail data). This leads to an imbalanced
optimization that drives the model to prioritize simple reasoning skills, while
hindering its ability to tackle more complex reasoning tasks. Over iterations,
this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew
effect"--which ultimately hinders further model improvement and leads to
performance bottlenecks. To counteract this challenge, we introduce four
efficient strategies from two perspectives: distribution-reshaping and
trajectory-resampling, to achieve head-tail re-balancing during the
exploration-and-learning self-improvement process. Extensive experiments on
Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks
demonstrate that our methods consistently improve visual reasoning
capabilities, outperforming vanilla self-improvement by 3.86 points on average.
๐
2025-10-30
๐ ArXiv Preprint (cs.SE)
๐ฅ Fang Liu, Simiao Liu, Yinghao Zhu, Xiaoli Lian, Li Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Identifying and addressing security issues during the early phase of the
development lifecycle is critical for mitigating the long-term negative impacts
on software systems. Code review serves as an effective practice that enables
developers to check their teammates' code before integration into the codebase.
To streamline the generation of review comments, various automated code review
approaches have been proposed, where LLM-based methods have significantly
advanced the capabilities of automated review generation. However, existing
models primarily focus on general-purpose code review, their effectiveness in
identifying and addressing security-related issues remains underexplored.
Moreover, adapting existing code review approaches to target security issues
faces substantial challenges, including data scarcity and inadequate evaluation
metrics. To address these limitations, we propose SecureReviewer, a new
approach designed for enhancing LLMs' ability to identify and resolve
security-related issues during code review. Specifically, we first construct a
dataset tailored for training and evaluating secure code review capabilities.
Leveraging this dataset, we fine-tune LLMs to generate code review comments
that can effectively identify security issues and provide fix suggestions with
our proposed secure-aware fine-tuning strategy. To mitigate hallucination in
LLMs and enhance the reliability of their outputs, we integrate the RAG
technique, which grounds the generated comments in domain-specific security
knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric
designed to assess the effectiveness of review comments in addressing security
issues. Experimental results demonstrate that SecureReviewer outperforms
state-of-the-art baselines in both security issue detection accuracy and the
overall quality and practical utility of generated review comments.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiayi Luo, Qingyun Sun, Beining Yang, Haonan Yuan, Xingcheng Fu et al.
ๆฅ็ๆ่ฆ
Graph condensation (GC) has gained significant attention for its ability to
synthesize smaller yet informative graphs. However, existing studies often
overlook the robustness of GC in scenarios where the original graph is
corrupted. In such cases, we observe that the performance of GC deteriorates
significantly, while existing robust graph learning technologies offer only
limited effectiveness. Through both empirical investigation and theoretical
analysis, we reveal that GC is inherently an intrinsic-dimension-reducing
process, synthesizing a condensed graph with lower classification complexity.
Although this property is critical for effective GC performance, it remains
highly vulnerable to adversarial perturbations. To tackle this vulnerability
and improve GC robustness, we adopt the geometry perspective of graph data
manifold and propose a novel Manifold-constrained Robust Graph Condensation
framework named MRGC. Specifically, we introduce three graph data manifold
learning modules that guide the condensed graph to lie within a smooth,
low-dimensional manifold with minimal class ambiguity, thereby preserving the
classification complexity reduction capability of GC and ensuring robust
performance under universal adversarial attacks. Extensive experiments
demonstrate the robustness of \ModelName\ across diverse attack scenarios.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wenjie Chen, Li Zhuang, Ziying Luo, Yu Liu, Jiahao Wu et al.
ๆฅ็ๆ่ฆ
Personalized treatment outcome prediction based on trial data for
small-sample and rare patient groups is critical in precision medicine.
However, the costly trial data limit the prediction performance. To address
this issue, we propose a cross-fidelity knowledge distillation and adaptive
fusion network (CFKD-AFN), which leverages abundant but low-fidelity simulation
data to enhance predictions on scarce but high-fidelity trial data. CFKD-AFN
incorporates a dual-channel knowledge distillation module to extract
complementary knowledge from the low-fidelity model, along with an
attention-guided fusion module to dynamically integrate multi-source
information. Experiments on treatment outcome prediction for the chronic
obstructive pulmonary disease demonstrates significant improvements of CFKD-AFN
over state-of-the-art methods in prediction accuracy, ranging from 6.67\% to
74.55\%, and strong robustness to varying high-fidelity dataset sizes.
Furthermore, we extend CFKD-AFN to an interpretable variant, enabling the
exploration of latent medical semantics to support clinical decision-making.
๐
2025-10-30
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Yingjia Wang, Ting Qiao, Xing Liu, Chongzuo Li, Sixing Wu et al.
Machine Learning
ๆฅ็ๆ่ฆ
The rapid advancement of deep neural networks (DNNs) heavily relies on
large-scale, high-quality datasets. However, unauthorized commercial use of
these datasets severely violates the intellectual property rights of dataset
owners. Existing backdoor-based dataset ownership verification methods suffer
from inherent limitations: poison-label watermarks are easily detectable due to
label inconsistencies, while clean-label watermarks face high technical
complexity and failure on high-resolution images. Moreover, both approaches
employ static watermark patterns that are vulnerable to detection and removal.
To address these issues, this paper proposes a sample-specific clean-label
backdoor watermarking (i.e., SSCL-BW). By training a U-Net-based watermarked
sample generator, this method generates unique watermarks for each sample,
fundamentally overcoming the vulnerability of static watermark patterns. The
core innovation lies in designing a composite loss function with three
components: target sample loss ensures watermark effectiveness, non-target
sample loss guarantees trigger reliability, and perceptual similarity loss
maintains visual imperceptibility. During ownership verification, black-box
testing is employed to check whether suspicious models exhibit predefined
backdoor behaviors. Extensive experiments on benchmark datasets demonstrate the
effectiveness of the proposed method and its robustness against potential
watermark removal attacks.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez
Natural Language Processing
ๆฅ็ๆ่ฆ
Large reasoning models (LRMs) achieve higher task performance by allocating
more inference-time compute, and prior works suggest this scaled reasoning may
also strengthen safety by improving refusal. Yet we find the opposite: the same
reasoning can be used to bypass safeguards. We introduce Chain-of-Thought
Hijacking, a jailbreak attack on reasoning models. The attack pads harmful
requests with long sequences of harmless puzzle reasoning. Across HarmBench,
CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on
Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively -
far exceeding prior jailbreak methods for LRMs. To understand the effectiveness
of our attack, we turn to a mechanistic analysis, which shows that mid layers
encode the strength of safety checking, while late layers encode the
verification outcome. Long benign CoT dilutes both signals by shifting
attention away from harmful tokens. Targeted ablations of attention heads
identified by this analysis causally decrease refusal, confirming their role in
a safety subnetwork. These results show that the most interpretable form of
reasoning - explicit CoT - can itself become a jailbreak vector when combined
with final-answer cues. We release prompts, outputs, and judge decisions to
facilitate replication.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang
Multimodal
ๆฅ็ๆ่ฆ
Recently text-to-video generation has made impressive progress in producing
short, high-quality clips, but evaluating long-form outputs remains a major
challenge especially when processing complex prompts. Existing benchmarks
mostly rely on simplified prompts and focus on low-level metrics, overlooking
fine-grained alignment with prompts and abstract dimensions such as narrative
coherence and thematic expression. To address these gaps, we propose
LoCoT2V-Bench, a benchmark specifically designed for long video generation
(LVG) under complex input conditions. Based on various real-world videos,
LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating
elements like scene transitions and event dynamics. Moreover, it constructs a
multi-dimensional evaluation framework that includes our newly proposed metrics
such as event-level alignment, fine-grained temporal consistency, content
clarity, and the Human Expectation Realization Degree (HERD) that focuses on
more abstract attributes like narrative flow, emotional response, and character
development. Using this framework, we conduct a comprehensive evaluation of
nine representative LVG models, finding that while current methods perform well
on basic visual and temporal aspects, they struggle with inter-event
consistency, fine-grained alignment, and high-level thematic adherence, etc.
Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for
evaluating long-form complex text-to-video generation and highlights critical
directions for future method improvement.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Artificial intelligence in healthcare requires models that are accurate and
interpretable. We advance mechanistic interpretability in medical vision by
applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP,
a vision-language model trained on chest radiographs and reports. To quantify
interpretability, we propose an evaluation framework that combines correlation
metrics, entropy analyzes, and automated neuron naming via the MedGEMMA
foundation model. Experiments on the CheXpert dataset show that MedSAE neurons
achieve higher monosemanticity and interpretability than raw MedCLIP features.
Our findings bridge high-performing medical AI and transparency, offering a
scalable step toward clinically reliable representations.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Guanxing Lu, Rui Zhao, Haitao Lin, He Zhang, Yansong Tang
RoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) is widely used to produce robust robotic
manipulation policies, but fine-tuning vision-language-action (VLA) models with
RL can be unstable due to inaccurate value estimates and sparse supervision at
intermediate steps. In contrast, imitation learning (IL) is easy to train but
often underperforms due to its offline nature. In this paper, we propose
Hi-ORS, a simple yet effective post-training method that utilizes rejection
sampling to achieve both training stability and high robustness. Hi-ORS
stabilizes value estimation by filtering out negatively rewarded samples during
online fine-tuning, and adopts a reward-weighted supervised training objective
to provide dense intermediate-step supervision. For systematic study, we
develop an asynchronous inference-training framework that supports flexible
online human-in-the-loop corrections, which serve as explicit guidance for
learning error-recovery behaviors. Across three real-world tasks and two
embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich
manipulation in just 1.5 hours of real-world training, outperforming RL and IL
baselines by a substantial margin in both effectiveness and efficiency.
Notably, the fine-tuned policy exhibits strong test-time scalability by
reliably executing complex error-recovery behaviors to achieve better
performance.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Vikrant Sahu, Gagan Raj Gupta, Raghav Borikar, Nitin Mane
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid growth of programming education has outpaced traditional assessment
tools, leaving faculty with limited means to provide meaningful, scalable
feedback. Conventional autograders, while efficient, act as black-box systems
that simply return pass/fail results, offering little insight into student
thinking or learning needs.
Autograder+ is designed to shift autograding from a purely summative process
to a formative learning experience. It introduces two key capabilities:
automated feedback generation using a fine-tuned Large Language Model, and
visualization of student code submissions to uncover learning patterns. The
model is fine-tuned on curated student code and expert feedback to ensure
pedagogically aligned, context-aware guidance.
In evaluation across 600 student submissions from multiple programming tasks,
the system produced feedback with strong semantic alignment to instructor
comments. For visualization, contrastively learned code embeddings trained on
1,000 annotated submissions enable grouping solutions into meaningful clusters
based on functionality and approach. The system also supports prompt-pooling,
allowing instructors to guide feedback style through selected prompt templates.
By integrating AI-driven feedback, semantic clustering, and interactive
visualization, Autograder+ reduces instructor workload while supporting
targeted instruction and promoting stronger learning outcomes.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Joel Z. Leibo, Alexander Sasha Vezhnevets, William A. Cunningham, Stanley M. Bileschi
ๆฅ็ๆ่ฆ
The emergence of agentic Artificial Intelligence (AI) is set to trigger a
"Cambrian explosion" of new kinds of personhood. This paper proposes a
pragmatic framework for navigating this diversification by treating personhood
not as a metaphysical property to be discovered, but as a flexible bundle of
obligations (rights and responsibilities) that societies confer upon entities
for a variety of reasons, especially to solve concrete governance problems. We
argue that this traditional bundle can be unbundled, creating bespoke solutions
for different contexts. This will allow for the creation of practical tools --
such as facilitating AI contracting by creating a target "individual" that can
be sanctioned -- without needing to resolve intractable debates about an AI's
consciousness or rationality. We explore how individuals fit in to social roles
and discuss the use of decentralized digital identity technology, examining
both "personhood as a problem", where design choices can create "dark patterns"
that exploit human social heuristics, and "personhood as a solution", where
conferring a bundle of obligations is necessary to ensure accountability or
prevent conflict. By rejecting foundationalist quests for a single, essential
definition of personhood, this paper offers a more pragmatic and flexible way
to think about integrating AI agents into our society.
๐
2025-10-30
๐ ArXiv Preprint (eess.IV)
๐ฅ Xizhi Tian, Changjun Zhou, Yulin. Yang
Machine Learning
ๆฅ็ๆ่ฆ
Multi-organ segmentation is a critical task in computer-aided diagnosis.
While recent deep learning methods have achieved remarkable success in image
segmentation, huge variations in organ size and shape challenge their
effectiveness in multi-organ segmentation. To address these challenges, we
propose a Spatial Prior-Guided Cross Dual Encoder Network (SPG-CDENet), a novel
two-stage segmentation paradigm designed to improve multi-organ segmentation
accuracy. Our SPG-CDENet consists of two key components: a spatial prior
network and a cross dual encoder network. The prior network generates coarse
localization maps that delineate the approximate ROI, serving as spatial
guidance for the dual encoder network. The cross dual encoder network comprises
four essential components: a global encoder, a local encoder, a symmetric
cross-attention module, and a flow-based decoder. The global encoder captures
global semantic features from the entire image, while the local encoder focuses
on features from the prior network. To enhance the interaction between the
global and local encoders, a symmetric cross-attention module is proposed
across all layers of the encoders to fuse and refine features. Furthermore, the
flow-based decoder directly propagates high-level semantic features from the
final encoder layer to all decoder layers, maximizing feature preservation and
utilization. Extensive qualitative and quantitative experiments on two public
datasets demonstrate the superior performance of SPG-CDENet compared to
existing segmentation methods. Furthermore, ablation studies further validate
the effectiveness of the proposed modules in improving segmentation accuracy.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Andrew M. Bean, Nabeel Seedat, Shengzhuang Chen, Jonathan Richard Schwarz
Natural Language Processing
ๆฅ็ๆ่ฆ
The prohibitive cost of evaluating large language models (LLMs) on
comprehensive benchmarks necessitates the creation of small yet representative
data subsets (i.e., tiny benchmarks) that enable efficient assessment while
retaining predictive fidelity. Current methods for this task operate under a
model-centric paradigm, selecting benchmarking items based on the collective
performance of existing models. Such approaches are limited by large upfront
costs, an inability to immediately handle new benchmarks (`cold-start'), and
the fragile assumption that future models will share the failure patterns of
their predecessors. In this work, we challenge this paradigm and propose a
item-centric approach to benchmark subset selection, arguing that selection
should be based on the intrinsic properties of the task items themselves,
rather than on model-specific failure patterns. We instantiate this
item-centric efficient benchmarking approach via a novel method, Scales++,
where data selection is based on the cognitive demands of the benchmark
samples. Empirically, we show Scales++ reduces the upfront selection cost by
over 18x while achieving competitive predictive fidelity. On the Open LLM
Leaderboard, using just a 0.5\% data subset, we predict full benchmark scores
with a 2.9% mean absolute error. We demonstrate that this item-centric approach
enables more efficient model evaluation without significant fidelity
degradation, while also providing better cold-start performance and more
interpretable benchmarking.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuanhang Liu, Beichen Wang, Peng Li, Yang Liu
Robotics
ๆฅ็ๆ่ฆ
Artificial intelligence (AI) has demonstrated impressive progress in
mathematical reasoning, yet its integration into the practice of mathematical
research remains limited. In this study, we investigate how the AI
Mathematician (AIM) system can operate as a research partner rather than a mere
problem solver. Focusing on a challenging problem in homogenization theory, we
analyze the autonomous reasoning trajectories of AIM and incorporate targeted
human interventions to structure the discovery process. Through iterative
decomposition of the problem into tractable subgoals, selection of appropriate
analytical methods, and validation of intermediate results, we reveal how human
intuition and machine computation can complement one another. This
collaborative paradigm enhances the reliability, transparency, and
interpretability of the resulting proofs, while retaining human oversight for
formal rigor and correctness. The approach leads to a complete and verifiable
proof, and more broadly, demonstrates how systematic human-AI co-reasoning can
advance the frontier of mathematical discovery.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qianli Shen, Daoyuan Chen, Yilun Huang, Zhenqing Ling, Yaliang Li et al.
ๆฅ็ๆ่ฆ
Reinforcement finetuning (RFT) is a key technique for aligning Large Language
Models (LLMs) with human preferences and enhancing reasoning, yet its
effectiveness is highly sensitive to which tasks are explored during training.
Uniform task sampling is inefficient, wasting computation on tasks that are
either trivial or unsolvable, while existing task selection methods often
suffer from high rollout costs, poor adaptivity, or incomplete evidence. We
introduce \textbf{BOTS}, a unified framework for \textbf{B}ayesian
\textbf{O}nline \textbf{T}ask \textbf{S}election in LLM reinforcement
finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior
estimates of task difficulty as the model evolves. It jointly incorporates
\emph{explicit evidence} from direct evaluations of selected tasks and
\emph{implicit evidence} inferred from these evaluations for unselected tasks,
with Thompson sampling ensuring a principled balance between exploration and
exploitation. To make implicit evidence practical, we instantiate it with an
ultra-light interpolation-based plug-in that estimates difficulties of
unevaluated tasks without extra rollouts, adding negligible overhead.
Empirically, across diverse domains and LLM scales, BOTS consistently improves
data efficiency and performance over baselines and ablations, providing a
practical and extensible solution for dynamic task selection in RFT.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kotaro Furuya, Yuichi Kitagawa
Natural Language Processing
ๆฅ็ๆ่ฆ
While a multi-agent approach based on large language models (LLMs) represents
a promising strategy to surpass the capabilities of single models, its success
is critically dependent on synergistic team composition. However, forming
optimal teams is a significant challenge, as the inherent opacity of most
models obscures the internal characteristics necessary for effective
collaboration. In this paper, we propose an interaction-centric framework for
automatic team composition that does not require any prior knowledge including
their internal architectures, training data, or task performances. Our method
constructs a "language model graph" that maps relationships between models from
the semantic coherence of pairwise conversations, and then applies community
detection to identify synergistic model clusters. Our experiments with diverse
LLMs demonstrate that the proposed method discovers functionally coherent
groups that reflect their latent specializations. Priming conversations with
specific topics identified synergistic teams which outperform random baselines
on downstream benchmarks and achieve comparable accuracy to that of
manually-curated teams based on known model specializations. Our findings
provide a new basis for the automated design of collaborative multi-agent LLM
teams.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sebastian Zieglmeier, Niklas Erdmann, Narada D. Warakagoda
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) algorithms are designed to optimize
problem-solving by learning actions that maximize rewards, a task that becomes
particularly challenging in random and nonstationary environments. Even
advanced RL algorithms are often limited in their ability to solve problems in
these conditions. In applications such as searching for underwater pollution
clouds with autonomous underwater vehicles (AUVs), RL algorithms must navigate
reward-sparse environments, where actions frequently result in a zero reward.
This paper aims to address these challenges by revisiting and modifying
classical RL approaches to efficiently operate in sparse, randomized, and
nonstationary environments. We systematically study a large number of
modifications, including hierarchical algorithm changes, multigoal learning,
and the integration of a location memory as an external output filter to
prevent state revisits. Our results demonstrate that a modified Monte
Carlo-based approach significantly outperforms traditional Q-learning and two
exhaustive search patterns, illustrating its potential in adapting RL to
complex environments. These findings suggest that reinforcement learning
approaches can be effectively adapted for use in random, nonstationary, and
reward-sparse environments.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Robin Schmรถcker, Alexander Dockhorn, Bodo Rosenhahn
ๆฅ็ๆ่ฆ
One approach to enhance Monte Carlo Tree Search (MCTS) is to improve its
sample efficiency by grouping/abstracting states or state-action pairs and
sharing statistics within a group. Though state-action pair abstractions are
mostly easy to find in algorithms such as On the Go Abstractions in Upper
Confidence bounds applied to Trees (OGA-UCT), nearly no state abstractions are
found in either noisy or large action space settings due to constraining
conditions. We provide theoretical and empirical evidence for this claim, and
we slightly alleviate this state abstraction problem by proposing a weaker
state abstraction condition that trades a minor loss in accuracy for finding
many more abstractions. We name this technique Ideal Pruning Abstractions in
UCT (IPA-UCT), which outperforms OGA-UCT (and any of its derivatives) across a
large range of test domains and iteration budgets as experimentally validated.
IPA-UCT uses a different abstraction framework from Abstraction of State-Action
Pairs (ASAP) which is the one used by OGA-UCT, which we name IPA. Furthermore,
we show that both IPA and ASAP are special cases of a more general framework
that we call p-ASAP which itself is a special case of the ASASAP framework.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mykhailo Poliakov, Nadiya Shvai
Natural Language Processing
ๆฅ็ๆ่ฆ
Health-related misinformation is very prevalent and potentially harmful. It
is difficult to identify, especially when claims distort or misinterpret
scientific findings. We investigate the impact of synthetic data generation and
lightweight fine-tuning techniques on the ability of large language models
(LLMs) to recognize fallacious arguments using the MISSCI dataset and
framework. In this work, we propose MisSynth, a pipeline that applies
retrieval-augmented generation (RAG) to produce synthetic fallacy samples,
which are then used to fine-tune an LLM model. Our results show substantial
accuracy gains with fine-tuned models compared to vanilla baselines. For
instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score
absolute improvement on the MISSCI test split over its vanilla baseline. We
demonstrate that introducing synthetic fallacy data to augment limited
annotated resources can significantly enhance zero-shot LLM classification
performance on real-world scientific misinformation tasks, even with limited
computational resources. The code and synthetic dataset are available on
https://github.com/mxpoliakov/MisSynth.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhigao Guo, Feng Dong
Machine Learning
ๆฅ็ๆ่ฆ
Incorporating causal knowledge and mechanisms is essential for refining
causal models and improving downstream tasks such as designing new treatments.
In this paper, we introduce a novel concept in causal discovery, termed
interventional constraints, which differs fundamentally from interventional
data. While interventional data require direct perturbations of variables,
interventional constraints encode high-level causal knowledge in the form of
inequality constraints on causal effects. For instance, in the Sachs dataset
(Sachs et al.\ 2005), Akt has been shown to be activated by PIP3, meaning PIP3
exerts a positive causal effect on Akt. Existing causal discovery methods allow
enforcing structural constraints (for example, requiring a causal path from
PIP3 to Akt), but they may still produce incorrect causal conclusions such as
learning that "PIP3 inhibits Akt". Interventional constraints bridge this gap
by explicitly constraining the total causal effect between variable pairs,
ensuring learned models respect known causal influences. To formalize
interventional constraints, we propose a metric to quantify total causal
effects for linear causal models and formulate the problem as a constrained
optimization task, solved using a two-stage constrained optimization method. We
evaluate our approach on real-world datasets and demonstrate that integrating
interventional constraints not only improves model accuracy and ensures
consistency with established findings, making models more explainable, but also
facilitates the discovery of new causal relationships that would otherwise be
costly to identify.
๐
2025-10-30
๐ ICLR 2026
๐ฅ Mingyu Sung, Seungjae Ham, Kangwoo Kim, Yeokyoung Yoon, Sangseok Yun et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Image super-resolution(SR) is fundamental to many vision system-from
surveillance and autonomy to document analysis and retail analytics-because
recovering high-frequency details, especially scene-text, enables reliable
downstream perception. Scene-text, i.e., text embedded in natural images such
as signs, product labels, and storefronts, often carries the most actionable
information; when characters are blurred or hallucinated, optical character
recognition(OCR) and subsequent decisions fail even if the rest of the image
appears sharp. Yet previous SR research has often been tuned to distortion
(PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that
are largely insensitive to character-level errors. Furthermore, studies that do
address text SR often focus on simplified benchmarks with isolated characters,
overlooking the challenges of text within complex natural scenes. As a result,
scene-text is effectively treated as generic texture. For SR to be effective in
practical deployments, it is therefore essential to explicitly optimize for
both text legibility and perceptual quality. We present GLYPH-SR, a
vision-language-guided diffusion framework that aims to achieve both objectives
jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by
OCR data, and a ping-pong scheduler that alternates between text- and
scene-centric guidance. To enable targeted text restoration, we train these
components on a synthetic corpus while keeping the main SR branch frozen.
Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by
up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR)
while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed
to satisfy both objectives simultaneously-high readability and high visual
realism-delivering SR that looks right and reds right.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nishit Neema, Srinjoy Mukherjee, Sapan Shah, Gokul Ramakrishnan, Ganesh Venkatesh
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel at general tasks but underperform in
specialized domains like economics and psychology, which require deep,
principled understanding. To address this, we introduce ACER (Automated
Curriculum-Enhanced Regimen) that transforms generalist models into domain
experts without sacrificing their broad capabilities. ACER first synthesizes a
comprehensive, textbook-style curriculum by generating a table of contents for
a subject and then creating question-answer (QA) pairs guided by Bloom's
taxonomy. This ensures systematic topic coverage and progressively increasing
difficulty. The resulting synthetic corpus is used for continual pretraining
with an interleaved curriculum schedule, aligning learning across both content
and cognitive dimensions.
Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized
MMLU subsets. In challenging domains like microeconomics, where baselines
struggle, ACER boosts accuracy by 5 percentage points. Across all target
domains, we observe a consistent macro-average improvement of 3 percentage
points. Notably, ACER not only prevents catastrophic forgetting but also
facilitates positive cross-domain knowledge transfer, improving performance on
non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on
knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points,
while maintaining stable performance on general reasoning tasks. Our results
demonstrate that ACER offers a scalable and effective recipe for closing
critical domain gaps in LLMs.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Zhiyang Xun, Shivam Gupta, Eric Price
ๆฅ็ๆ่ฆ
Given a noisy linear measurement $y = Ax + \xi$ of a distribution $p(x)$, and
a good approximation to the prior $p(x)$, when can we sample from the posterior
$p(x \mid y)$? Posterior sampling provides an accurate and fair framework for
tasks such as inpainting, deblurring, and MRI reconstruction, and several
heuristics attempt to approximate it. Unfortunately, approximate posterior
sampling is computationally intractable in general.
To sidestep this hardness, we focus on (local or global) log-concave
distributions $p(x)$. In this regime, Langevin dynamics yields posterior
samples when the exact scores of $p(x)$ are available, but it is brittle to
score--estimation error, requiring an MGF bound (sub-exponential error). By
contrast, in the unconditional setting, diffusion models succeed with only an
$L^2$ bound on the score error. We prove that combining diffusion models with
an annealed variant of Langevin dynamics achieves conditional sampling in
polynomial time using merely an $L^4$ bound on the score error.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiseong Chung, Ronny Ko, Wonchul Yoo, Makoto Onizuka, Sungmok Kim et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Compliance at web scale poses practical challenges: each request may require
a regulatory assessment. Regulatory texts (e.g., the General Data Protection
Regulation, GDPR) are cross-referential and normative, while runtime contexts
are expressed in unstructured natural language. This setting motivates us to
align semantic information in unstructured text with the structured, normative
elements of regulations. To this end, we introduce GraphCompliance, a framework
that represents regulatory texts as a Policy Graph and runtime contexts as a
Context Graph, and aligns them. In this formulation, the policy graph encodes
normative structure and cross-references, whereas the context graph formalizes
events as subject-action-object (SAO) and entity-relation triples. This
alignment anchors the reasoning of a judge large language model (LLM) in
structured information and helps reduce the burden of regulatory interpretation
and event parsing, enabling a focus on the core reasoning step. In experiments
on 300 GDPR-derived real-world scenarios spanning five evaluation tasks,
GraphCompliance yields 4.1-7.2 percentage points (pp) higher micro-F1 than
LLM-only and RAG baselines, with fewer under- and over-predictions, resulting
in higher recall and lower false positive rates. Ablation studies indicate
contributions from each graph component, suggesting that structured
representations and a judge LLM are complementary for normative reasoning.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Beomhan Baek, Minhak Song, Chulhee Yun
Machine Learning
ๆฅ็ๆ่ฆ
Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet
its theoretical understanding remains limited. Prior analyses show that Adam
favors solutions aligned with $\ell_\infty$-geometry, but these results are
restricted to the full-batch regime. In this work, we study the implicit bias
of incremental Adam (using one sample per step) for logistic regression on
linearly separable data, and we show that its bias can deviate from the
full-batch behavior. To illustrate this, we construct a class of structured
datasets where incremental Adam provably converges to the $\ell_2$-max-margin
classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch
Adam. For general datasets, we develop a proxy algorithm that captures the
limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize
its convergence direction via a data-dependent dual fixed-point formulation.
Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges
to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$
close enough to 1. Overall, our results highlight that the implicit bias of
Adam crucially depends on both the batching scheme and the dataset, while
Signum remains invariant.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziliang Chen, Tianang Xiao, Jusheng Zhang, Yongsen Zheng, Xipeng Chen
Multimodal
ๆฅ็ๆ่ฆ
Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal
generalization by aligning images and texts in a shared embedding space, yet it
persistently fails at compositional reasoning over objects, attributes, and
relations often behaving like a bag-of-words matcher. Prior causal accounts
typically model text as a single vector, obscuring token-level structure and
leaving core phenomena-such as prompt sensitivity and failures on hard
negatives unexplained. We address this gap with a token-aware causal
representation learning (CRL) framework grounded in a sequential,
language-token SCM. Our theory extends block identifiability to tokenized text,
proving that CLIP's contrastive objective can recover the modal-invariant
latent variable under both sentence-level and token-level SCMs. Crucially,
token granularity yields the first principled explanation of CLIP's
compositional brittleness: composition nonidentifiability. We show the
existence of pseudo-optimal text encoders that achieve perfect modal-invariant
alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations
over atomic concepts, thereby failing to distinguish correct captions from hard
negatives despite optimizing the same training objective as true-optimal
encoders. The analysis further links language-side nonidentifiability to
visual-side failures via the modality gap and shows how iterated composition
operators compound hardness, motivating improved negative mining strategies.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jingran Zhang, Ning Li, Justin Cui
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
OpenAI's ChatGPT Atlas introduces new capabilities for web interaction,
enabling the model to analyze webpages, process user intents, and execute
cursor and keyboard inputs directly within the browser. While its capacity for
information retrieval tasks has been demonstrated, its performance in dynamic,
interactive environments remains less explored. In this study, we conduct an
early evaluation of Atlas's web interaction capabilities using browser-based
games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird,
and Stein.world. We employ in-game performance scores as quantitative metrics
to assess performance across different task types. Our results show that Atlas
performs strongly in logical reasoning tasks like Sudoku, completing puzzles
significantly faster than human baselines, but struggles substantially in
real-time games requiring precise timing and motor control, often failing to
progress beyond initial obstacles. These findings suggest that while Atlas
demonstrates capable analytical processing, there remain notable limitations in
dynamic web environments requiring real-time interaction. The website of our
project can be found at https://atlas-game-eval.github.io.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michal ล tefรกnik, Timothee Mickus, Marek Kadlฤรญk, Bertram Hรธjer, Michal Spiegel et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent work has shown that different large language models (LLMs) converge to
similar and accurate input embedding representations for numbers. These
findings conflict with the documented propensity of LLMs to produce erroneous
outputs when dealing with numeric information. In this work, we aim to explain
this conflict by exploring how language models manipulate numbers and quantify
the lower bounds of accuracy of these mechanisms. We find that despite
surfacing errors, different language models learn interchangeable
representations of numbers that are systematic, highly accurate and universal
across their hidden states and the types of input contexts. This allows us to
create universal probes for each LLM and to trace information -- including the
causes of output errors -- to specific layers. Our results lay a fundamental
understanding of how pre-trained LLMs manipulate numbers and outline the
potential of more accurate probing techniques in addressed refinements of LLMs'
architectures.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kim Yong Tan, Yueming Lyu, Ivor Tsang, Yew-Soon Ong
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion models have been successful in learning complex data distributions.
This capability has driven their application to high-dimensional
multi-objective black-box optimization problem. Existing approaches often
employ an external optimization loop, such as an evolutionary algorithm, to the
diffusion model. However, these approaches treat the diffusion model as a
black-box refiner, which overlooks the internal distribution transition of the
diffusion generation process, limiting their efficiency. To address these
challenges, we propose the Inference-time Multi-target Generation (IMG)
algorithm, which optimizes the diffusion process at inference-time to generate
samples that simultaneously satisfy multiple objectives. Specifically, our IMG
performs weighted resampling during the diffusion generation process according
to the expected aggregated multi-objective values. This weighted resampling
strategy ensures the diffusion-generated samples are distributed according to
our desired multi-target Boltzmann distribution. We further derive that the
multi-target Boltzmann distribution has an interesting log-likelihood
interpretation, where it is the optimal solution to the distributional
multi-objective optimization problem. We implemented IMG for a multi-objective
molecule generation task. Experiments show that IMG, requiring only a single
generation pass, achieves a significantly higher hypervolume than baseline
optimization algorithms that often require hundreds of diffusion generations.
Notably, our algorithm can be viewed as an optimized diffusion process and can
be integrated into existing methods to further improve their performance.
๐
2025-10-30
๐ ArXiv Preprint (cs.SE)
๐ฅ Domenico Amalfitano, Andreas Metzger, Marco Autili, Tommaso Fulcini, Tobias Hey et al.
ๆฅ็ๆ่ฆ
Generative AI (GenAI) is rapidly transforming software engineering (SE)
practices, influencing how SE processes are executed, as well as how software
systems are developed, operated, and evolved. This paper applies design science
research to build a roadmap for GenAI-augmented SE. The process consists of
three cycles that incrementally integrate multiple sources of evidence,
including collaborative discussions from the FSE 2025 "Software Engineering
2030" workshop, rapid literature reviews, and external feedback sessions
involving peers. McLuhan's tetrads were used as a conceptual instrument to
systematically capture the transforming effects of GenAI on SE processes and
software products.The resulting roadmap identifies four fundamental forms of
GenAI augmentation in SE and systematically characterizes their related
research challenges and opportunities. These insights are then consolidated
into a set of future research directions. By grounding the roadmap in a
rigorous multi-cycle process and cross-validating it among independent author
teams and peers, the study provides a transparent and reproducible foundation
for analyzing how GenAI affects SE processes, methods and tools, and for
framing future research within this rapidly evolving area. Based on these
findings, the article finally makes ten predictions for SE in the year 2030.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiazhen Yuan, Wei Zhao, Zhengbiao Bai
Machine Learning
ๆฅ็ๆ่ฆ
Group based reinforcement learning (RL) has shown impressive results on
complex reasoning and mathematical tasks. Yet, when applied to train
multi-turn, interactive LLM agents, these methods often suffer from structural
blindness-the inability to exploit the underlying connectivity of the
environment. This manifests in three critical challenges: (1) inefficient,
unguided exploration, (2) imprecise credit assignment due to overlooking
pivotal states, and (3) myopic planning caused by static reward discounting. We
address these issues with Graph-Enhanced Policy Optimization (GEPO), which
dynamically constructs a state-transition graph from agent experience and
employs graph-theoretic centrality to provide three synergistic learning
signals: (1)structured intrinsic rewards that guide exploration toward
high-impact states, (2) a graph-enhanced advantage function for topology-aware
credit assignment, and (3) a dynamic discount factor adapted to each state's
strategic value. On the ALFWorld, WebShop, and a proprietary Workbench
benchmarks, GEPO demonstrates strong performance, achieving absolute success
rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. These
results highlight that explicitly modeling environmental structure is a robust,
generalizable strategy for advancing LLM agent training.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Hieu M. Vu, Tan M. Nguyen
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Controlling specific behaviors in large language models while preserving
their general capabilities is a central challenge for safe and reliable
artificial intelligence deployment. Current steering methods, such as vector
addition and directional ablation, are constrained within a two-dimensional
subspace defined by the activation and feature direction, making them sensitive
to chosen parameters and potentially affecting unrelated features due to
unintended interactions in activation space. We introduce Angular Steering, a
novel and flexible method for behavior modulation that operates by rotating
activations within a fixed two-dimensional subspace. By formulating steering as
a geometric rotation toward or away from a target behavior direction, Angular
Steering provides continuous, fine-grained control over behaviors such as
refusal and compliance. We demonstrate this method using refusal steering
emotion steering as use cases. Additionally, we propose Adaptive Angular
Steering, a selective variant that rotates only activations aligned with the
target feature, further enhancing stability and coherence. Angular Steering
generalizes existing addition and orthogonalization techniques under a unified
geometric rotation framework, simplifying parameter selection and maintaining
model stability across a broader range of adjustments. Experiments across
multiple model families and sizes show that Angular Steering achieves robust
behavioral control while maintaining general language modeling performance,
underscoring its flexibility, generalization, and robustness compared to prior
approaches. Code and artifacts are available at
https://github.com/lone17/angular-steering/.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xinhang Li, Qing Guo, Junyu Chen, Zheng Guo, Shengzhe Xu et al.
Robotics
ๆฅ็ๆ่ฆ
With increasing urban traffic complexity, Traffic Signal Control (TSC) is
essential for optimizing traffic flow and improving road safety. Large Language
Models (LLMs) emerge as promising approaches for TSC. However, they are prone
to hallucinations in emergencies, leading to unreliable decisions that may
cause substantial delays for emergency vehicles. Moreover, diverse intersection
types present substantial challenges for traffic state encoding and
cross-intersection training, limiting generalization across heterogeneous
intersections. Therefore, this paper proposes Retrieval Augmented Generation
(RAG)-enhanced distributed LLM agents with Emergency response for Generalizable
TSC (REG-TSC). Firstly, this paper presents an emergency-aware reasoning
framework, which dynamically adjusts reasoning depth based on the emergency
scenario and is equipped with a novel Reviewer-based Emergency RAG (RERAG) to
distill specific knowledge and guidance from historical cases, enhancing the
reliability and rationality of agents' emergency decisions. Secondly, this
paper designs a type-agnostic traffic representation and proposes a
Reward-guided Reinforced Refinement (R3) for heterogeneous intersections. R3
adaptively samples training experience from diverse intersections with
environment feedback-based priority and fine-tunes LLM agents with a designed
reward-weighted likelihood loss, guiding REG-TSC toward high-reward policies
across heterogeneous intersections. On three real-world road networks with 17
to 177 heterogeneous intersections, extensive experiments show that REG-TSC
reduces travel time by 42.00%, queue length by 62.31%, and emergency vehicle
waiting time by 83.16%, outperforming other state-of-the-art methods.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Duc-Hai Nguyen, Vijayakumar Nanjappan, Barry O'Sullivan, Hoang D. Nguyen
Natural Language Processing
ๆฅ็ๆ่ฆ
Millions of people take surveys every day, from market polls and academic
studies to medical questionnaires and customer feedback forms. These datasets
capture valuable insights, but their scale and structure present a unique
challenge for large language models (LLMs), which otherwise excel at few-shot
reasoning over open-ended text. Yet, their ability to process questionnaire
data or lists of questions crossed with hundreds of respondent rows remains
underexplored. Current retrieval and survey analysis tools (e.g., Qualtrics,
SPSS, REDCap) are typically designed for humans in the workflow, limiting such
data integration with LLM and AI-empowered automation. This gap leaves
scientists, surveyors, and everyday users without evidence-based guidance on
how to best represent questionnaires for LLM consumption. We address this by
introducing QASU (Questionnaire Analysis and Structural Understanding), a
benchmark that probes six structural skills, including answer lookup,
respondent count, and multi-hop inference, across six serialization formats and
multiple prompt strategies. Experiments on contemporary LLMs show that choosing
an effective format and prompt combination can improve accuracy by up to 8.8%
points compared to suboptimal formats. For specific tasks, carefully adding a
lightweight structural hint through self-augmented prompting can yield further
improvements of 3-4% points on average. By systematically isolating format and
prompting effects, our open source benchmark offers a simple yet versatile
foundation for advancing both research and real-world practice in LLM-based
questionnaire analysis.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Minyi Peng, Darian Gunamardi, Ivan Tjuawinata, Kwok-Yan Lam
Machine Learning
ๆฅ็ๆ่ฆ
As a new and promising approach, existing machine unlearning (MU) works
typically emphasize theoretical formulations or optimization objectives to
achieve knowledge removal. However, when deployed in real-world scenarios, such
solutions typically face scalability issues and have to address practical
requirements such as full access to original datasets and model. In contrast to
the existing approaches, we regard classification training as a sequential
process where classes are learned sequentially, which we call \emph{inductive
approach}. Unlearning can then be done by reversing the last training sequence.
This is implemented by appending a projection-redistribution layer in the end
of the model. Such an approach does not require full access to the original
dataset or the model, addressing the challenges of existing methods. This
enables modular and model-agnostic deployment as an output filter into existing
classification pipelines with minimal alterations. We conducted multiple
experiments across multiple datasets including image (CIFAR-10/100 using
CNN-based model) and tabular datasets (Covertype using tree-based model).
Experiment results show consistently similar output to a fully retrained model
with a high computational cost reduction. This demonstrates the applicability,
scalability, and system compatibility of our solution while maintaining the
performance of the output in a more practical setting.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sekitoshi Kanai, Tsukasa Yoshida, Hiroshi Takahashi, Haru Kuroki, Kazumune Hashimoto
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Test-time alignment of large language models (LLMs) attracts attention
because fine-tuning LLMs requires high computational costs. In this paper, we
propose a new test-time alignment method called adaptive importance sampling on
pre-logits (AISP) on the basis of the sampling-based model predictive control
with the stochastic control input. AISP applies the Gaussian perturbation into
pre-logits, which are outputs of the penultimate layer, so as to maximize
expected rewards with respect to the mean of the perturbation. We demonstrate
that the optimal mean is obtained by importance sampling with sampled rewards.
AISP outperforms best-of-n sampling in terms of rewards over the number of used
samples and achieves higher rewards than other reward-based test-time alignment
methods.
๐
2025-10-30
๐ ArXiv Preprint (q-fin.CP)
๐ฅ Tao Jin, Stuart Florescu, Heyu, Jin
Machine Learning
ๆฅ็ๆ่ฆ
We address finance-native collateral optimization under ISDA Credit Support
Annexes (CSAs), where integer lots, Schedule A haircuts, RA/MTA gating, and
issuer/currency/class caps create rugged, legally bounded search spaces. We
introduce a certifiable hybrid pipeline purpose-built for this domain: (i) an
evidence-gated LLM that extracts CSA terms to a normalized JSON
(abstain-by-default, span-cited); (ii) a quantum-inspired explorer that
interleaves simulated annealing with micro higher order QAOA (HO-QAOA) on
binding sub-QUBOs (subset size n <= 16, order k <= 4) to coordinate multi-asset
moves across caps and RA-induced discreteness; (iii) a weighted risk-aware
objective (Movement, CVaR, funding-priced overshoot) with an explicit coverage
window U <= Reff+B; and (iv) CP-SAT as single arbiter to certify feasibility
and gaps, including a U-cap pre-check that reports the minimal feasible buffer
B*. Encoding caps/rounding as higher-order terms lets HO-QAOA target the domain
couplings that defeat local swaps. On government bond datasets and multi-CSA
inputs, the hybrid improves a strong classical baseline (BL-3) by 9.1%, 9.6%,
and 10.7% across representative harnesses, delivering better cost-movement-tail
frontiers under governance settings. We release governance grade artifacts-span
citations, valuation matrix audit, weight provenance, QUBO manifests, and
CP-SAT traces-to make results auditable and reproducible.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qi Luo, Xiaonan Li, Tingshuo Fan, Xinchi Chen, Xipeng Qiu
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-augmented generation (RAG) has emerged as a leading approach to
reducing hallucinations in large language models (LLMs). Current RAG evaluation
benchmarks primarily focus on what we call local RAG: retrieving relevant
chunks from a small subset of documents to answer queries that require only
localized understanding within specific text chunks. However, many real-world
applications require a fundamentally different capability -- global RAG --
which involves aggregating and analyzing information across entire document
collections to derive corpus-level insights (for example, "What are the top 10
most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first
benchmark specifically designed to evaluate global RAG capabilities, covering
four core task types: counting, extremum queries, sorting, and top-k
extraction. Through systematic evaluation across different models and
baselines, we find that existing RAG methods perform poorly on global tasks,
with the strongest baseline achieving only 1.51 F1 score. To address these
challenges, we propose GlobalRAG, a multi-tool collaborative framework that
preserves structural coherence through chunk-level retrieval, incorporates
LLM-driven intelligent filters to eliminate noisy documents, and integrates
aggregation modules for precise symbolic computation. On the Qwen2.5-14B model,
GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1,
validating the effectiveness of our method.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson
Natural Language Processing
ๆฅ็ๆ่ฆ
Human feedback can alter language models in unpredictable and undesirable
ways, as practitioners lack a clear understanding of what feedback data
encodes. While prior work studies preferences over certain attributes (e.g.,
length or sycophancy), automatically extracting relevant features without
pre-specifying hypotheses remains challenging. We introduce What's In My Human
Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders.
WIMHF characterizes both (1) the preferences a dataset is capable of measuring
and (2) the preferences that the annotators actually express. Across 7
datasets, WIMHF identifies a small number of human-interpretable features that
account for the majority of the preference prediction signal achieved by
black-box models. These features reveal a wide diversity in what humans prefer,
and the role of dataset-level context: for example, users on Reddit prefer
informality and jokes, while annotators in HH-RLHF and PRISM disprefer them.
WIMHF also surfaces potentially unsafe preferences, such as that LMArena users
tend to vote against refusals, often in favor of toxic content. The learned
features enable effective data curation: re-labeling the harmful examples in
Arena yields large safety gains (+37%) with no cost to general performance.
They also allow fine-grained personalization: on the Community Alignment
dataset, we learn annotator-specific weights over subjective features that
improve preference prediction. WIMHF provides a human-centered analysis method
for practitioners to better understand and use preference data.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Woojin Kim, Jaeyoung Do
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
While diffusion language models (DLMs) enable fine-grained refinement, their
practical controllability remains fragile. We identify and formally
characterize a central failure mode called update forgetting, in which uniform
and context agnostic updates induce token level fluctuations across timesteps,
erasing earlier semantic edits and disrupting the cumulative refinement
process, thereby degrading fluency and coherence. As this failure originates in
uniform and context agnostic updates, effective control demands explicit token
ordering. We propose Token Timestep Allocation (TTA), which realizes soft and
semantic token ordering via per token timestep schedules: critical tokens are
frozen early, while uncertain tokens receive continued refinement. This
timestep based ordering can be instantiated as either a fixed policy or an
adaptive policy driven by task signals, thereby supporting a broad spectrum of
refinement strategies. Because it operates purely at inference time, it applies
uniformly across various DLMs and naturally extends to diverse supervision
sources. Empirically, TTA improves controllability and fluency: on sentiment
control, it yields more than 20 percent higher accuracy and nearly halves
perplexity using less than one fifth the steps; in detoxification, it lowers
maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0).
Together, these results demonstrate that softened ordering via timestep
allocation is the critical lever for mitigating update forgetting and achieving
stable and controllable diffusion text generation.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Avinash Kadimisetty, Arun Rajagopalan, Vijendra SK
ๆฅ็ๆ่ฆ
Reducing preventable hospital readmissions is a national priority for payers,
providers, and policymakers seeking to improve health care and lower costs. The
rate of readmission is being used as a benchmark to determine the quality of
healthcare provided by the hospitals. In thisproject, we have used machine
learning techniques like Logistic Regression, Random Forest and Support Vector
Machines to analyze the health claims data and identify demographic and medical
factors that play a crucial role in predicting all-cause readmissions. As the
health claims data is high dimensional, we have used Principal Component
Analysis as a dimension reduction technique and used the results for building
regression models. We compared and evaluated these models based on the Area
Under Curve (AUC) metric. Random Forest model gave the highest performance
followed by Logistic Regression and Support Vector Machine models. These models
can be used to identify the crucial factors causing readmissions and help
identify patients to focus on to reduce the chances of readmission, ultimately
bringing down the cost and increasing the quality of healthcare provided to the
patients.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Jinho Choi, Hyesu Lim, Steffen Schneider, Jaegul Choo
Machine Learning
ๆฅ็ๆ่ฆ
Dataset bias, where data points are skewed to certain concepts, is ubiquitous
in machine learning datasets. Yet, systematically identifying these biases is
challenging without costly, fine-grained attribute annotations. We present
ConceptScope, a scalable and automated framework for analyzing visual datasets
by discovering and quantifying human-interpretable concepts using Sparse
Autoencoders trained on representations from vision foundation models.
ConceptScope categorizes concepts into target, context, and bias types based on
their semantic relevance and statistical correlation to class labels, enabling
class-level dataset characterization, bias identification, and robustness
evaluation through concept-based subgrouping. We validate that ConceptScope
captures a wide range of visual concepts, including objects, textures,
backgrounds, facial attributes, emotions, and actions, through comparisons with
annotated datasets. Furthermore, we show that concept activations produce
spatial attributions that align with semantically meaningful image regions.
ConceptScope reliably detects known biases (e.g., background bias in
Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects
in ImageNet), offering a practical tool for dataset auditing and model
diagnostics.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunxiao Shi, Shuo Yang, Yixin Su, Rui Zhang, Min Xu
ๆฅ็ๆ่ฆ
Modern data-centric AI needs precise per-sample influence. Standard SGD-IE
approximates leave-one-out effects by summing per-epoch surrogates and ignores
cross-epoch compounding, which misranks critical examples. We propose
ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out
perturbation across training and updates an accumulative influence state at
each step. In smooth strongly convex settings it achieves geometric error
contraction and, in smooth non-convex regimes, it tightens error bounds; larger
mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups,
and MNIST under clean and corrupted data and both convex and non-convex
training, ACC-SGD-IE yields more accurate influence estimates, especially over
long epochs. For downstream data cleansing it more reliably flags noisy
samples, producing models trained on ACC-SGD-IE cleaned data that outperform
those cleaned with SGD-IE.
๐
2025-10-30
๐ ArXiv Preprint (Human-Computer Interaction)
๐ฅ Shifu Chen, Dazhen Deng, Zhihong Xu, Sijia Xu, Tai-Quan Peng et al.
ๆฅ็ๆ่ฆ
Social media platforms generate massive volumes of heterogeneous data,
capturing user behaviors, textual content, temporal dynamics, and network
structures. Analyzing such data is crucial for understanding phenomena such as
opinion dynamics, community formation, and information diffusion. However,
discovering insights from this complex landscape is exploratory, conceptually
challenging, and requires expertise in social media mining and visualization.
Existing automated approaches, though increasingly leveraging large language
models (LLMs), remain largely confined to structured tabular data and cannot
adequately address the heterogeneity of social media analysis. We present SIA
(Social Insight Agents), an LLM agent system that links heterogeneous
multi-modal data -- including raw inputs (e.g., text, network, and behavioral
data), intermediate outputs, mined analytical results, and visualization
artifacts -- through coordinated agent flows. Guided by a bottom-up taxonomy
that connects insight types with suitable mining and visualization techniques,
SIA enables agents to plan and execute coherent analysis strategies. To ensure
multi-modal integration, it incorporates a data coordinator that unifies
tabular, textual, and network data into a consistent flow. Its interactive
interface provides a transparent workflow where users can trace, validate, and
refine the agent's reasoning, supporting both adaptability and trustworthiness.
Through expert-centered case studies and quantitative evaluation, we show that
SIA effectively discovers diverse and meaningful insights from social media
while supporting human-agent collaboration in complex analytical tasks.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Renhao Li, Jianhong Tu, Yang Su, Hamid Alinejad-Rokny, Derek F. Wong et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reward models (RMs) play a critical role in aligning large language models
(LLMs) with human preferences. Yet in the domain of tool learning, the lack of
RMs specifically designed for function-calling tasks has limited progress
toward more capable agentic AI. We introduce ToolRM, a family of lightweight
generative RMs tailored for general tool-use scenarios. To build these models,
we propose a novel pipeline that constructs pairwise preference data using
rule-based scoring and multidimensional sampling. This yields
ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique
tasks that supports reinforcement learning with verifiable feedback. To
evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on
the agentic evaluation suite BFCL. Trained on our constructed data, models from
the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially
outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward
judgments. Beyond training objectives, ToolRM generalizes to broader critique
tasks, including Best-of-N sampling and self-correction. Experiments on
ACEBench highlight its effectiveness and efficiency, enabling inference-time
scaling and reducing output token usage by over 66%. We release data and model
checkpoints to facilitate future research.
๐
2025-10-30
๐ 6th ACM International Conference on AI in Finance,
๐ฅ Maarten P. Scholl, Mahmoud Mahfouz, Anisoara Calinescu, J. Doyne Farmer
Machine Learning
ๆฅ็ๆ่ฆ
While investment funds publicly disclose their objectives in broad terms,
their managers optimize for complex combinations of competing goals that go
beyond simple risk-return trade-offs. Traditional approaches attempt to model
this through multi-objective utility functions, but face fundamental challenges
in specification and parameterization. We propose a generative framework that
learns latent representations of fund manager strategies without requiring
explicit utility specification.
Our approach directly models the conditional probability of a fund's
portfolio weights, given stock characteristics, historical returns, previous
weights, and a latent variable representing the fund's strategy. Unlike methods
based on reinforcement learning or imitation learning, which require specified
rewards or labeled expert objectives, our GAN-based architecture learns
directly from the joint distribution of observed holdings and market data.
We validate our framework on a dataset of 1436 U.S. equity mutual funds. The
learned representations successfully capture known investment styles, such as
"growth" and "value," while also revealing implicit manager objectives. For
instance, we find that while many funds exhibit characteristics of
Markowitz-like optimization, they do so with heterogeneous realizations for
turnover, concentration, and latent factors.
To analyze and interpret the end-to-end model, we develop a series of tests
that explain the model, and we show that the benchmark's expert labeling are
contained in our model's encoding in a linear interpretable way.
Our framework provides a data-driven approach for characterizing investment
strategies for applications in market simulation, strategy attribution, and
regulatory oversight.
๐
2025-10-30
๐ This paper is currently under review for presentat
๐ฅ Emilio Mastriani, Alessandro Costa, Federico Incardona, Kevin Munari, Sebastiano Spinello
ๆฅ็ๆ่ฆ
In this study, we investigate the effectiveness of advanced feature
engineering and hybrid model architectures for anomaly detection in a
multivariate industrial time series, focusing on a steam turbine system. We
evaluate the impact of change point-derived statistical features,
clustering-based substructure representations, and hybrid learning strategies
on detection performance. Despite their theoretical appeal, these complex
approaches consistently underperformed compared to a simple Random Forest +
XGBoost ensemble trained on segmented data. The ensemble achieved an AUC-ROC of
0.976, F1-score of 0.41, and 100% early detection within the defined time
window. Our findings highlight that, in scenarios with highly imbalanced and
temporally uncertain data, model simplicity combined with optimized
segmentation can outperform more sophisticated architectures, offering greater
robustness, interpretability, and operational utility.
๐
2025-10-30
๐ EMNLP 2025
๐ฅ Hyuntae Park, Yeachan Kim, SangKeun Lee
Machine Learning
ๆฅ็ๆ่ฆ
Molecule and text representation learning has gained increasing interest due
to its potential for enhancing the understanding of chemical information.
However, existing models often struggle to capture subtle differences between
molecules and their descriptions, as they lack the ability to learn
fine-grained alignments between molecular substructures and chemical phrases.
To address this limitation, we introduce MolBridge, a novel molecule-text
learning framework based on substructure-aware alignments. Specifically, we
augment the original molecule-description pairs with additional alignment
signals derived from molecular substructures and chemical phrases. To
effectively learn from these enriched alignments, MolBridge employs
substructure-aware contrastive learning, coupled with a self-refinement
mechanism that filters out noisy alignment signals. Experimental results show
that MolBridge effectively captures fine-grained correspondences and
outperforms state-of-the-art baselines on a wide range of molecular benchmarks,
highlighting the significance of substructure-aware alignment in molecule-text
learning.
๐
2025-10-30
๐ ICCV 2025
๐ฅ Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large annotated datasets are essential for training robust Computer-Aided
Diagnosis (CAD) models for breast cancer detection or risk prediction. However,
acquiring such datasets with fine-detailed annotation is both costly and
time-consuming. Vision-Language Models (VLMs), such as CLIP, which are
pre-trained on large image-text pairs, offer a promising solution by enhancing
robustness and data efficiency in medical imaging tasks. This paper introduces
a novel Multi-View Mammography and Language Model for breast cancer
classification and risk prediction, trained on a dataset of paired mammogram
images and synthetic radiology reports. Our MV-MLM leverages multi-view
supervision to learn rich representations from extensive radiology data by
employing cross-modal self-supervision across image-text pairs. This includes
multiple views and the corresponding pseudo-radiology reports. We propose a
novel joint visual-textual learning strategy to enhance generalization and
accuracy performance over different data types and tasks to distinguish breast
tissues or cancer characteristics(calcification, mass) and utilize these
patterns to understand mammography images and predict cancer risk. We evaluated
our method on both private and publicly available datasets, demonstrating that
the proposed model achieves state-of-the-art performance in three
classification tasks: (1) malignancy classification, (2) subtype
classification, and (3) image-based cancer risk prediction. Furthermore, the
model exhibits strong data efficiency, outperforming existing fully supervised
or VLM baselines while trained on synthetic text reports and without the need
for actual radiology reports.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Annan Li, Chufan Wu, Zengle Ge, Yee Hin Chong, Zhinan Hou et al.
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Large language models (LLMs) are catalyzing the development of autonomous AI
research agents for scientific and engineering discovery. We present FM Agent,
a novel and general-purpose multi-agent framework that leverages a synergistic
combination of LLM-based reasoning and large-scale evolutionary search to
address complex real-world challenges. The core of FM Agent integrates several
key innovations: 1) a cold-start initialization phase incorporating expert
guidance, 2) a novel evolutionary sampling strategy for iterative optimization,
3) domain-specific evaluators that combine correctness, effectiveness, and
LLM-supervised feedback, and 4) a distributed, asynchronous execution
infrastructure built on Ray. Demonstrating broad applicability, our system has
been evaluated across diverse domains, including operations research, machine
learning, GPU kernel optimization, and classical mathematical problems. FM
Agent reaches state-of-the-art results autonomously, without human
interpretation or tuning -- 1976.3 on ALE-Bench (+5.2\%), 43.56\% on MLE-Bench
(+4.0pp), up to 20x speedups on KernelBench, and establishes new
state-of-the-art(SOTA) results on several classical mathematical problems.
Beyond academic benchmarks, FM Agent shows considerable promise for both
large-scale enterprise R\&D workflows and fundamental scientific research,
where it can accelerate innovation, automate complex discovery processes, and
deliver substantial engineering and scientific advances with broader societal
impact.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, Yingbo Zhou
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) can elicit strong reasoning in large language
models (LLMs), yet most open efforts focus on math and code. We propose
Reasoning Curriculum, a simple two-stage curriculum that first elicits
reasoning skills in pretraining-aligned domains such as math, then adapts and
refines these skills across other domains via joint RL. Stage 1 performs a
brief cold start and then math-only RL with verifiable rewards to develop
reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and
consolidate these skills. The curriculum is minimal and backbone-agnostic,
requiring no specialized reward models beyond standard verifiability checks.
Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning
curriculum yields consistent gains. Ablations and a cognitive-skill analysis
indicate that both stages are necessary and that math-first elicitation
increases cognitive behaviors important for solving complex problems. Reasoning
Curriculum provides a compact, easy-to-adopt recipe for general reasoning.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Boqin Zhuang, Jiacheng Qiao, Mingqian Liu, Mingxing Yu, Ping Hong et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The inference cost of Large Language Models (LLMs) has become a critical
factor in determining their commercial viability and widespread adoption. This
paper introduces a quantitative ``economics of inference'' framework, treating
the LLM inference process as a compute-driven intelligent production activity.
We analyze its marginal cost, economies of scale, and quality of output under
various performance configurations. Based on empirical data from WiNEval-3.0,
we construct the first ``LLM Inference Production Frontier,'' revealing three
principles: diminishing marginal cost, diminishing returns to scale, and an
optimal cost-effectiveness zone. This paper not only provides an economic basis
for model deployment decisions but also lays an empirical foundation for the
future market-based pricing and optimization of AI inference resources.
๐
2025-10-30
๐ Pre-print prepared for journal submission
๐ฅ Musfiqur Rahman, SayedHassan Khatoonabadi, Emad Shihab
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have advanced code generation at the function
level, yet their ability to produce correct class-level implementations in
authentic software projects remains poorly understood. This work introduces a
novel benchmark derived from open-source repositories, comprising real-world
classes divided into seen and unseen partitions to evaluate generalization
under practical conditions. The evaluation examines multiple LLMs under varied
input specifications, retrieval-augmented configurations, and documentation
completeness levels.
Results reveal a stark performance disparity: LLMs achieve 84% to 89%
correctness on established synthetic benchmarks but only 25% to 34% on
real-world class tasks, with negligible differences between familiar and novel
codebases. Comprehensive docstrings yield modest gains of 1% to 3% in
functional accuracy, though statistical significance is rare.
Retrieval-augmented generation proves most effective with partial
documentation, improving correctness by 4% to 7% by supplying concrete
implementation patterns absent from specifications. Error profiling identifies
AttributeError, TypeError, and AssertionError as dominant failure modes (84% of
cases), with synthetic tests overemphasizing assertion issues and real-world
scenarios highlighting type and attribute mismatches. Retrieval augmentation
reduces logical flaws but can introduce dependency conflicts.
The benchmark and analysis expose critical limitations in current LLM
capabilities for class-level engineering, offering actionable insights for
enhancing context modelling, documentation strategies, and retrieval
integration in production code assistance tools.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou et al.
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-based end-to-end (E2E) driving has garnered significant interest in
the research community due to its scalability and synergy with multimodal large
language models (MLLMs). However, current E2E driving benchmarks primarily
feature nominal scenarios, failing to adequately test the true potential of
these systems. Furthermore, existing open-loop evaluation metrics often fall
short in capturing the multi-modal nature of driving or effectively evaluating
performance in long-tail scenarios. To address these gaps, we introduce the
Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021
driving segments (approximately 12 hours), specifically curated for challenging
long-tail scenarios that that are rare in daily life with an occurring
frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the
high-level routing information, ego states, and 360-degree camera views from 8
surrounding cameras. To evaluate the E2E driving performance on these long-tail
situations, we propose a novel open-loop evaluation metric: Rater Feedback
Score (RFS). Unlike conventional metrics that measure the distance between
predicted way points and the logs, RFS measures how closely the predicted
trajectory matches rater-annotated trajectory preference labels. We have
released rater preference labels for all WOD-E2E validation set segments, while
the held out test set labels have been used for the 2025 WOD-E2E Challenge.
Through our work, we aim to foster state of the art research into
generalizable, robust, and safe end-to-end autonomous driving agents capable of
handling complex real-world situations.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, Angela Yao
Machine Learning
ๆฅ็ๆ่ฆ
Can Video-LLMs achieve consistent temporal understanding when videos capture
the same event from different viewpoints? To study this, we introduce
EgoExo-Con (Consistency), a benchmark of comprehensively synchronized
egocentric and exocentric video pairs with human-refined queries in natural
language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal
Verification and Temporal Grounding. It evaluates not only correctness but
consistency across viewpoints. Our analysis reveals two critical limitations of
existing Video-LLMs: (1) models often fail to maintain consistency, with
results far worse than their single-view performances. (2) When naively
finetuned with synchronized videos of both viewpoints, the models show improved
consistency but often underperform those trained on a single view. For
improvements, we propose View-GRPO, a novel reinforcement learning framework
that effectively strengthens view-specific temporal reasoning while encouraging
consistent comprehension across viewpoints. Our method demonstrates its
superiority over naive SFT and GRPO, especially for improving cross-view
consistency. All resources will be made publicly available.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaosen Wang, Zhijin Ge, Shaokang Wang
ๆฅ็ๆ่ฆ
Despite the notable advancements and versatility of multi-modal diffusion
models, such as text-to-image models, their susceptibility to adversarial
inputs remains underexplored. Contrary to expectations, our investigations
reveal that the alignment between textual and Image modalities in existing
diffusion models is inadequate. This misalignment presents significant risks,
especially in the generation of inappropriate or Not-Safe-For-Work (NSFW)
content. To this end, we propose a novel attack called Prompt-Restricted
Multi-modal Attack (PReMA) to manipulate the generated content by modifying the
input image in conjunction with any specified prompt, without altering the
prompt itself. PReMA is the first attack that manipulates model outputs by
solely creating adversarial images, distinguishing itself from prior methods
that primarily generate adversarial prompts to produce NSFW content.
Consequently, PReMA poses a novel threat to the integrity of multi-modal
diffusion models, particularly in image-editing applications that operate with
fixed prompts. Comprehensive evaluations conducted on image inpainting and
style transfer tasks across various models confirm the potent efficacy of
PReMA.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nick Masi, Randall Balestriero
Machine Learning
ๆฅ็ๆ่ฆ
The dominant paradigm in machine learning is to assess model performance
based on average loss across all samples in some test set. This amounts to
averaging performance geospatially across the Earth in weather and climate
settings, failing to account for the non-uniform distribution of human
development and geography. We introduce Stratified Assessments of Forecasts
over Earth (SAFE), a package for elucidating the stratified performance of a
set of predictions made over Earth. SAFE integrates various data domains to
stratify by different attributes associated with geospatial gridpoints:
territory (usually country), global subregion, income, and landcover (land or
water). This allows us to examine the performance of models for each individual
stratum of the different attributes (e.g., the accuracy in every individual
country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of
state-of-the-art AI-based weather prediction models, finding that they all
exhibit disparities in forecasting skill across every attribute. We use this to
seed a benchmark of model forecast fairness through stratification at different
lead times for various climatic variables. By moving beyond globally-averaged
metrics, we for the first time ask: where do models perform best or worst, and
which models are most fair? To support further work in this direction, the SAFE
package is open source and available at https://github.com/N-Masi/safe
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large vision language models (VLMs) have advanced graphical user interface
(GUI) task automation but still lag behind humans. We hypothesize this gap
stems from missing core GUI knowledge, which existing training schemes (such as
supervised fine tuning and reinforcement learning) alone cannot fully address.
By analyzing common failure patterns in GUI task execution, we distill GUI
knowledge into three dimensions: (1) interface perception, knowledge about
recognizing widgets and system states; (2) interaction prediction, knowledge
about reasoning action state transitions; and (3) instruction understanding,
knowledge about planning, verifying, and assessing task completion progress. We
further introduce GUI Knowledge Bench, a benchmark with multiple choice and
yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux,
IOS) and 292 applications. Our evaluation shows that current VLMs identify
widget functions but struggle with perceiving system states, predicting
actions, and verifying task completion. Experiments on real world GUI tasks
further validate the close link between GUI knowledge and task success. By
providing a structured framework for assessing GUI knowledge, our work supports
the selection of VLMs with greater potential prior to downstream training and
provides insights for building more capable GUI agents.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yuxin Li, Minghao Liu, Ruida Wang, Wenzhao Ji, Zhitao He et al.
ๆฅ็ๆ่ฆ
We present **Lean4PHYS**, a comprehensive reasoning framework for
college-level physics problems in Lean4. **Lean4PHYS** includes
*LeanPhysBench*, a college-level benchmark for formal physics reasoning in
Lean4, which contains 200 hand-crafted and peer-reviewed statements derived
from university textbooks and physics competition problems. To establish a
solid foundation for formal reasoning in physics, we also introduce *PhysLib*,
a community-driven repository containing fundamental unit systems and theorems
essential for formal physics reasoning. Based on the benchmark and Lean4
repository we composed in **Lean4PHYS**, we report baseline results using major
expert Math Lean4 provers and state-of-the-art closed-source models, with the
best performance of DeepSeek-Prover-V2-7B achieving only 16% and
Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that
our *PhysLib* can achieve an average improvement of 11.75% in model
performance. This demonstrates the challenging nature of our *LeanPhysBench*
and the effectiveness of *PhysLib*. To the best of our knowledge, this is the
first study to provide a physics benchmark in Lean4.
๐
2025-10-30
๐ 29 pages, 12 figures. Fazel Arasteh and Arian Hagh
๐ฅ Fazel Arasteh, Arian Haghparast, Manos Papagelis
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Traffic congestion in urban road networks leads to longer trip times and
higher emissions, especially during peak periods. While the Shortest Path First
(SPF) algorithm is optimal for a single vehicle in a static network, it
performs poorly in dynamic, multi-vehicle settings, often worsening congestion
by routing all vehicles along identical paths. We address dynamic vehicle
routing through a multi-agent reinforcement learning (MARL) framework for
coordinated, network-aware fleet navigation. We first propose Adaptive
Navigation (AN), a decentralized MARL model where each intersection agent
provides routing guidance based on (i) local traffic and (ii) neighborhood
state modeled using Graph Attention Networks (GAT). To improve scalability in
large networks, we further propose Hierarchical Hub-based Adaptive Navigation
(HHAN), an extension of AN that assigns agents only to key intersections
(hubs). Vehicles are routed hub-to-hub under agent control, while SPF handles
micro-routing within each hub region. For hub coordination, HHAN adopts
centralized training with decentralized execution (CTDE) under the Attentive
Q-Mixing (A-QMIX) framework, which aggregates asynchronous vehicle decisions
via attention. Hub agents use flow-aware state features that combine local
congestion and predictive dynamics for proactive routing. Experiments on
synthetic grids and real urban maps (Toronto, Manhattan) show that AN reduces
average travel time versus SPF and learning baselines, maintaining 100% routing
success. HHAN scales to networks with hundreds of intersections, achieving up
to 15.9% improvement under heavy traffic. These findings highlight the
potential of network-constrained MARL for scalable, coordinated, and
congestion-aware routing in intelligent transportation systems.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
There are two prevalent ways to constructing 3D scenes: procedural generation
and 2D lifting. Among them, panorama-based 2D lifting has emerged as a
promising technique, leveraging powerful 2D generative priors to produce
immersive, realistic, and diverse 3D environments. In this work, we advance
this technique to generate graphics-ready 3D scenes suitable for physically
based rendering (PBR), relighting, and simulation. Our key insight is to
repurpose 2D generative models for panoramic perception of geometry, textures,
and PBR materials. Unlike existing 2D lifting approaches that emphasize
appearance generation and ignore the perception of intrinsic properties, we
present OmniX, a versatile and unified framework. Based on a lightweight and
efficient cross-modal adapter structure, OmniX reuses 2D generative priors for
a broad range of panoramic vision tasks, including panoramic perception,
generation, and completion. Furthermore, we construct a large-scale synthetic
panorama dataset containing high-quality multimodal panoramas from diverse
indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness
of our model in panoramic visual perception and graphics-ready 3D scene
generation, opening new possibilities for immersive and physically realistic
virtual world generation.
๐
2025-10-30
๐ EMNLP 2025
๐ฅ Chao Feng, Zihao Wei, Andrew Owens
Natural Language Processing
ๆฅ็ๆ่ฆ
We learn visual features by captioning images with an image-conditioned
masked diffusion language model, a formulation we call masked diffusion
captioning (MDC). During training, text tokens in each image-caption pair are
masked at a randomly chosen ratio, and a decoder conditioned on visual features
is trained to reconstruct the original text. After training, the learned visual
features can be applied to downstream vision tasks. Unlike autoregressive
captioning, the strength of the visual learning signal in MDC does not depend
on each token's position in the sequence, reducing the need for auxiliary
objectives. Linear probing experiments across a variety of academic-scale
models and datasets show that the learned visual features are competitive with
those produced by autoregressive and contrastive approaches.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao et al.
Robotics
ๆฅ็ๆ่ฆ
Immersive applications call for synthesizing spatiotemporal 4D content from
casual videos without costly 3D supervision. Existing video-to-4D methods
typically rely on manually annotated camera poses, which are labor-intensive
and brittle for in-the-wild footage. Recent warp-then-inpaint approaches
mitigate the need for pose labels by warping input frames along a novel camera
trajectory and using an inpainting model to fill missing regions, thereby
depicting the 4D scene from diverse viewpoints. However, this
trajectory-to-trajectory formulation often entangles camera motion with scene
dynamics and complicates both modeling and inference. We introduce SEE4D, a
pose-free, trajectory-to-camera framework that replaces explicit trajectory
prediction with rendering to a bank of fixed virtual cameras, thereby
separating camera control from scene modeling. A view-conditional video
inpainting model is trained to learn a robust geometry prior by denoising
realistically synthesized warped images and to inpaint occluded or missing
regions across virtual viewpoints, eliminating the need for explicit 3D
annotations. Building on this inpainting core, we design a spatiotemporal
autoregressive inference pipeline that traverses virtual-camera splines and
extends videos with overlapping windows, enabling coherent generation at
bounded per-step complexity. We validate See4D on cross-view video generation
and sparse reconstruction benchmarks. Across quantitative metrics and
qualitative assessments, our method achieves superior generalization and
improved performance relative to pose- or trajectory-conditioned baselines,
advancing practical 4D world modeling from casual videos.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Philipp Lindenberger, Paul-Edouard Sarlin, Jan Hosang, Matteo Balice, Marc Pollefeys et al.
ๆฅ็ๆ่ฆ
Determining the precise geographic location of an image at a global scale
remains an unsolved challenge. Standard image retrieval techniques are
inefficient due to the sheer volume of images (>100M) and fail when coverage is
insufficient. Scalable solutions, however, involve a trade-off: global
classification typically yields coarse results (10+ kilometers), while
cross-view retrieval between ground and aerial imagery suffers from a domain
gap and has been primarily studied on smaller regions. This paper introduces a
hybrid approach that achieves fine-grained geo-localization across a large
geographic expanse the size of a continent. We leverage a proxy classification
task during training to learn rich feature representations that implicitly
encode precise location information. We combine these learned prototypes with
embeddings of aerial imagery to increase robustness to the sparsity of
ground-level data. This enables direct, fine-grained retrieval over areas
spanning multiple countries. Our extensive evaluation demonstrates that our
approach can localize within 200m more than 68\% of queries of a dataset
covering a large part of Europe. The code is publicly available at
https://scaling-geoloc.github.io.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Despite recent advances in 3D human motion generation (MoGen) on standard
benchmarks, existing models still face a fundamental bottleneck in their
generalization capability. In contrast, adjacent generative fields, most
notably video generation (ViGen), have demonstrated remarkable generalization
in modeling human behaviors, highlighting transferable insights that MoGen can
leverage. Motivated by this observation, we present a comprehensive framework
that systematically transfers knowledge from ViGen to MoGen across three key
pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a
large-scale dataset comprising 228,000 high-quality motion samples that
integrates high-fidelity optical MoCap data with semantically annotated motions
from web videos and synthesized samples generated by state-of-the-art ViGen
models. The dataset includes both text-motion pairs and text-video-motion
triplets, substantially expanding semantic diversity. Second, we propose
ViMoGen, a flow-matching-based diffusion transformer that unifies priors from
MoCap data and ViGen models through gated multimodal conditioning. To enhance
efficiency, we further develop ViMoGen-light, a distilled variant that
eliminates video generation dependencies while preserving strong
generalization. Finally, we present MBench, a hierarchical benchmark designed
for fine-grained evaluation across motion quality, prompt fidelity, and
generalization ability. Extensive experiments show that our framework
significantly outperforms existing approaches in both automatic and human
evaluations. The code, data, and benchmark will be made publicly available.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng Zheng, William Koch, Baiang Li, Felix Heide
Computer VisionMachine LearningRobotics
ๆฅ็ๆ่ฆ
Hierarchical structures of motion exist across research fields, including
computer vision, graphics, and robotics, where complex dynamics typically arise
from coordinated interactions among simpler motion components. Existing methods
to model such dynamics typically rely on manually-defined or heuristic
hierarchies with fixed motion primitives, limiting their generalizability
across different tasks. In this work, we propose a general hierarchical motion
modeling method that learns structured, interpretable motion relationships
directly from data. Our method represents observed motions using graph-based
hierarchies, explicitly decomposing global absolute motions into
parent-inherited patterns and local motion residuals. We formulate hierarchy
inference as a differentiable graph learning problem, where vertices represent
elemental motions and directed edges capture learned parent-child dependencies
through graph neural networks. We evaluate our hierarchical reconstruction
approach on three examples: 1D translational motion, 2D rotational motion, and
dynamic 3D scene deformation via Gaussian splatting. Experimental results show
that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases,
and produces more realistic and interpretable deformations compared to the
baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable,
data-driven hierarchical modeling paradigm, our method offers a formulation
applicable to a broad range of motion-centric tasks. Project Page:
https://light.princeton.edu/HEIR/
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Charts play an important role in visualization, reasoning, data analysis, and
the exchange of ideas among humans. However, existing vision-language models
(VLMs) still lack accurate perception of details and struggle to extract
fine-grained structures from charts. Such limitations in chart grounding also
hinder their ability to compare multiple charts and reason over them. In this
paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a
comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting
tabular data, localizing visualization elements, and recognizing various
attributes from charts of diverse types and complexities. We design a JSON
template to facilitate the calculation of evaluation metrics specifically
tailored for each grounding task. By incorporating a novel two-stage inference
workflow, the benchmark can further evaluate VLMs' capability to align and
compare elements/attributes across two charts. Our analysis of evaluations on
several recent VLMs reveals new insights into their perception biases,
weaknesses, robustness, and hallucinations in chart understanding. These
findings highlight the fine-grained discrepancies among VLMs in chart
understanding tasks and point to specific skills that need to be strengthened
in current models.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Valentyna Starodub, Mantas Lukoลกeviฤius
ๆฅ็ๆ่ฆ
Age-related macular degeneration (AMD) is one of the leading causes of
irreversible vision impairment in people over the age of 60. This research
focuses on semantic segmentation for AMD lesion detection in RGB fundus images,
a non-invasive and cost-effective imaging technique. The results of the ADAM
challenge - the most comprehensive AMD detection from RGB fundus images
research competition and open dataset to date - serve as a benchmark for our
evaluation. Taking the U-Net connectivity as a base of our framework, we
evaluate and compare several approaches to improve the segmentation model's
architecture and training pipeline, including pre-processing techniques,
encoder (backbone) deep network types of varying complexity, and specialized
loss functions to mitigate class imbalances on image and pixel levels. The main
outcome of this research is the final configuration of the AMD detection
framework, which outperforms all the prior ADAM challenge submissions on the
multi-class segmentation of different AMD lesion types in non-invasive RGB
fundus images. The source code used to conduct the experiments presented in
this paper is made freely available.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
This work introduces SteerVLM, a lightweight steering module designed to
guide Vision-Language Models (VLMs) towards outputs that better adhere to
desired instructions. Our approach learns from the latent embeddings of paired
prompts encoding target and converse behaviors to dynamically adjust
activations connecting the language modality with image context. This allows
for fine-grained, inference-time control over complex output semantics without
modifying model weights while preserving performance on off-target tasks. Our
steering module requires learning parameters equal to 0.14% of the original
VLM's size. Our steering module gains model control through dimension-wise
activation modulation and adaptive steering across layers without requiring
pre-extracted static vectors or manual tuning of intervention points.
Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a
multimodal dataset specifically created to facilitate the development and
evaluation of VLM steering techniques. Our method outperforms existing
intervention techniques on steering and hallucination mitigation benchmarks for
VLMs and proposes a robust solution for multimodal model control through
activation engineering.
๐
2025-10-30
๐ Accepted to ACMMM 2025
๐ฅ Shaokai Wu, Yapan Guo, Yanbiao Ji, Jing Tong, Yuxiang Lu et al.
Machine Learning
ๆฅ็ๆ่ฆ
CT reconstruction provides radiologists with images for diagnosis and
treatment, yet current deep learning methods are typically limited to specific
anatomies and datasets, hindering generalization ability to unseen anatomies
and lesions. To address this, we introduce the Multi-Organ medical image
REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies
with 15 lesion types. This dataset serves two key purposes: (1) enabling robust
training of deep learning models on extensive, heterogeneous data, and (2)
facilitating rigorous evaluation of model generalization for CT reconstruction.
We further establish a strong baseline solution that outperforms prior
approaches under these challenging conditions. Our results demonstrate that:
(1) a comprehensive dataset helps improve the generalization capability of
models, and (2) optimization-based methods offer enhanced robustness for unseen
anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our
project page https://more-med.github.io/
๐
2025-10-30
๐ ArXiv Preprint (eess.IV)
๐ฅ Paul F. R. Wilson, Mohamed Harmanani, Minh Nguyen Nhat To, Amoon Jamzad, Tarek Elghareb et al.
ๆฅ็ๆ่ฆ
Purpose: Medical foundation models (FMs) offer a path to build
high-performance diagnostic systems. However, their application to prostate
cancer (PCa) detection from micro-ultrasound ({\mu}US) remains untested in
clinical settings. We present ProstNFound+, an adaptation of FMs for PCa
detection from {\mu}US, along with its first prospective validation. Methods:
ProstNFound+ incorporates a medical FM, adapter tuning, and a custom prompt
encoder that embeds PCa-specific clinical biomarkers. The model generates a
cancer heatmap and a risk score for clinically significant PCa. Following
training on multi-center retrospective data, the model is prospectively
evaluated on data acquired five years later from a new clinical site. Model
predictions are benchmarked against standard clinical scoring protocols
(PRI-MUS and PI-RADS). Results: ProstNFound+ shows strong generalization to the
prospective data, with no performance degradation compared to retrospective
evaluation. It aligns closely with clinical scores and produces interpretable
heatmaps consistent with biopsy-confirmed lesions. Conclusion: The results
highlight its potential for clinical deployment, offering a scalable and
interpretable alternative to expert-driven protocols.
๐
2025-10-30
๐ Article written for Frontiers of Science Award, In
๐ฅ Bernhard Kerbl
ๆฅ็ๆ่ฆ
Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed
the landscape of 3D scene representations, inspiring an extensive body of
associated research. Follow-up work includes analyses and contributions that
enhance the efficiency, scalability, and real-world applicability of 3DGS. In
this summary, we present an overview of several key directions that have
emerged in the wake of 3DGS. We highlight advances enabling resource-efficient
training and rendering, the evolution toward dynamic (or four-dimensional,
4DGS) representations, and deeper exploration of the mathematical foundations
underlying its appearance modeling and rendering process. Furthermore, we
examine efforts to bring 3DGS to mobile and virtual reality platforms, its
extension to massive-scale environments, and recent progress toward
near-instant radiance field reconstruction via feed-forward or distributed
computation. Collectively, these developments illustrate how 3DGS has evolved
from a breakthrough representation into a versatile and foundational tool for
3D vision and graphics.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vaibhav Kurrey, Sivakalyan Pujari, Gagan Raj Gupta
Computer VisionMachine LearningRobotics
ๆฅ็ๆ่ฆ
We present a long-term deployment study of a machine vision-based anomaly
detection system for failure prediction in a steel rolling mill. The system
integrates industrial cameras to monitor equipment operation, alignment, and
hot bar motion in real time along the process line. Live video streams are
processed on a centralized video server using deep learning models, enabling
early prediction of equipment failures and process interruptions, thereby
reducing unplanned breakdown costs. Server-based inference minimizes the
computational load on industrial process control systems (PLCs), supporting
scalable deployment across production lines with minimal additional resources.
By jointly analyzing sensor data from data acquisition systems and visual
inputs, the system identifies the location and probable root causes of
failures, providing actionable insights for proactive maintenance. This
integrated approach enhances operational reliability, productivity, and
profitability in industrial manufacturing environments.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Courtney M. King, Daniel D. Leeds, Damian Lyons, George Kalaitzis
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
The presence of occlusions has provided substantial challenges to
typically-powerful object recognition algorithms. Additional sources of
information can be extremely valuable to reduce errors caused by occlusions.
Scene context is known to aid in object recognition in biological vision. In
this work, we attempt to add robustness into existing Region Proposal
Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks
through two distinct scene-based information fusion techniques. We present one
algorithm under each methodology: the first operates prior to prediction,
selecting a custom object network to use based on the identified background
scene, and the second operates after detection, fusing scene knowledge into
initial object scores output by the RPN. We demonstrate our algorithms on
challenging datasets featuring partial occlusions, which show overall
improvement in both recall and precision against baseline methods. In addition,
our experiments contrast multiple training methodologies for occlusion
handling, finding that training on a combination of both occluded and
unoccluded images demonstrates an improvement over the others. Our method is
interpretable and can easily be adapted to other datasets, offering many future
directions for research and practical applications.
๐
2025-10-30
๐ ArXiv Preprint (eess.IV)
๐ฅ Alya Almsouti, Ainur Khamitova, Darya Taratynova, Mohammad Yaqub
ๆฅ็ๆ่ฆ
Assessing the severity of artifacts in pediatric brain Magnetic Resonance
Imaging (MRI) is critical for diagnostic accuracy, especially in low-field
systems where the signal-to-noise ratio is reduced. Manual quality assessment
is time-consuming and subjective, motivating the need for robust automated
solutions. In this work, we propose BRIQA (Balanced Reweighting in Image
Quality Assessment), which addresses class imbalance in artifact severity
levels. BRIQA uses gradient-based loss reweighting to dynamically adjust
per-class contributions and employs a rotating batching scheme to ensure
consistent exposure to underrepresented classes. Through experiments, no single
architecture performs best across all artifact types, emphasizing the
importance of architectural diversity. The rotating batching configuration
improves performance across metrics by promoting balanced learning when
combined with cross-entropy loss. BRIQA improves average macro F1 score from
0.659 to 0.706, with notable gains in Noise (0.430), Zipper (0.098),
Positioning (0.097), Contrast (0.217), Motion (0.022), and Banding (0.012)
artifact severity classification. The code is available at
https://github.com/BioMedIA-MBZUAI/BRIQA.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Daniela Martin, Joseph Gallego
Computer VisionMachine LearningRobotics
ๆฅ็ๆ่ฆ
Accurate estimation of sea ice drift is critical for Arctic navigation,
climate research, and operational forecasting. While optical flow, a computer
vision technique for estimating pixel wise motion between consecutive images,
has advanced rapidly in computer vision, its applicability to geophysical
problems and to satellite SAR imagery remains underexplored. Classical optical
flow methods rely on mathematical models and strong assumptions about motion,
which limit their accuracy in complex scenarios. Recent deep learning based
approaches have substantially improved performance and are now the standard in
computer vision, motivating their application to sea ice drift estimation. We
present the first large scale benchmark of 48 deep learning optical flow models
on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and
Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer
accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the
spatial scales of sea ice motion and typical navigation requirements in the
Arctic. Our results demonstrate that the models are capable of capturing
consistent regional drift patterns and that recent deep learning based optical
flow methods, which have substantially improved motion estimation accuracy
compared to classical methods, can be effectively transferred to polar remote
sensing. Optical flow produces spatially continuous drift fields, providing
motion estimates for every image pixel rather than at sparse buoy locations,
offering new opportunities for navigation and climate modeling.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Ahmad Sarlak, Mahlagha Fazeli et al.
Computer VisionMultimodalRoboticsNatural Language Processing
ๆฅ็ๆ่ฆ
Autonomous Vehicles (AVs) are transforming the future of transportation
through advances in intelligent perception, decision-making, and control
systems. However, their success is tied to one core capability, reliable object
detection in complex and multimodal environments. While recent breakthroughs in
Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable
progress, the field still faces a critical challenge as knowledge remains
fragmented across multimodal perception, contextual reasoning, and cooperative
intelligence. This survey bridges that gap by delivering a forward-looking
analysis of object detection in AVs, emphasizing emerging paradigms such as
Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI
rather than re-examining outdated techniques. We begin by systematically
reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR,
and Radar) and their fusion strategies, highlighting not only their
capabilities and limitations in dynamic driving environments but also their
potential to integrate with recent advances in LLM/VLM-driven perception
frameworks. Next, we introduce a structured categorization of AV datasets that
moves beyond simple collections, positioning ego-vehicle, infrastructure-based,
and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a
cross-analysis of data structures and characteristics. Ultimately, we analyze
cutting-edge detection methodologies, ranging from 2D and 3D pipelines to
hybrid sensor fusion, with particular attention to emerging transformer-driven
approaches powered by Vision Transformers (ViTs), Large and Small Language
Models (SLMs), and VLMs. By synthesizing these perspectives, our survey
delivers a clear roadmap of current capabilities, open challenges, and future
opportunities.
๐
2025-10-30
๐ ArXiv Preprint (eess.IV)
๐ฅ Zhao Wang, Wei Dai, Thuy Thanh Dao, Steffen Bollmann, Hongfu Sun et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Accurate magnetic resonance imaging (MRI) segmentation is crucial for
clinical decision-making, but remains labor-intensive when performed manually.
Convolutional neural network (CNN)-based methods can be accurate and efficient,
but often generalize poorly to MRI's variable contrast, intensity
inhomogeneity, and protocols. Although the transformer-based Segment Anything
Model (SAM) has demonstrated remarkable generalizability in natural images,
existing adaptations often treat MRI as another imaging modality, overlooking
these modality-specific challenges. We present SAMRI, an MRI-specialized SAM
trained and validated on 1.1 million labeled MR slices spanning whole-body
organs and pathologies. We demonstrate that SAM can be effectively adapted to
MRI by simply fine-tuning its mask decoder using a two-stage strategy, reducing
training time by 94% and trainable parameters by 96% versus full-model
retraining. Across diverse MRI segmentation tasks, SAMRI achieves a mean Dice
of 0.87, delivering state-of-the-art accuracy across anatomical regions and
robust generalization on unseen structures, particularly small and clinically
important structures.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Bingcong Huo, Zhiming Wang
Computer Vision
ๆฅ็ๆ่ฆ
To address the challenges in UAV object detection, such as complex
backgrounds, severe occlusion, dense small objects, and varying lighting
conditions,this paper proposes PT-DETR based on RT-DETR, a novel detection
algorithm specifically designed for small objects in UAV imagery. In the
backbone network, we introduce the Partially-Aware Detail Focus (PADF) Module
to enhance feature extraction for small objects. Additionally,we design the
Median-Frequency Feature Fusion (MFFF) module,which effectively improves the
model's ability to capture small-object details and contextual information.
Furthermore,we incorporate Focaler-SIoU to strengthen the model's bounding box
matching capability and increase its sensitivity to small-object features,
thereby further enhancing detection accuracy and robustness. Compared with
RT-DETR, our PT-DETR achieves mAP improvements of 1.6% and 1.7% on the
VisDrone2019 dataset with lower computational complexity and fewer parameters,
demonstrating its robustness and feasibility for small-object detection tasks.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Christoffer Koo รhrstrรธm, Ronja Gรผldenring, Lazaros Nalpantidis
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
We propose tokenization of events and present a tokenizer, Spiking Patches,
specifically designed for event cameras. Given a stream of asynchronous and
spatially sparse events, our goal is to discover an event representation that
preserves these properties. Prior works have represented events as frames or as
voxels. However, while these representations yield high accuracy, both frames
and voxels are synchronous and decrease the spatial sparsity. Spiking Patches
gives the means to preserve the unique properties of event cameras and we show
in our experiments that this comes without sacrificing accuracy. We evaluate
our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and
object detection. Tokens from Spiking Patches yield inference times that are up
to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We
achieve this while matching their accuracy and even surpassing in some cases
with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for
object detection. Thus, tokenization constitutes a novel direction in
event-based vision and marks a step towards methods that preserve the
properties of event cameras.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shayan Nejadshamsi, Yuanyuan Zhang, Shadi Zaki, Brock Porth, Lysa Porth et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate and timely crop yield prediction is crucial for global food security
and modern agricultural management. Traditional methods often lack the
scalability and granularity required for precision farming. This paper
introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder
for Satellite Sensing), a deep learning model designed for high-resolution,
intra-field canola yield prediction. CYPRESS leverages a pre-trained,
large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for
a continuous regression task, transforming multi-temporal satellite imagery
into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from
the Canadian Prairies, CYPRESS demonstrates superior performance over existing
deep learning-based yield prediction models, highlighting the effectiveness of
fine-tuning foundation models for specialized agricultural applications. By
providing a continuous, high-resolution output, CYPRESS offers a more
actionable tool for precision agriculture than conventional classification or
county-level aggregation methods. This work validates a novel approach that
bridges the gap between large-scale Earth observation and on-farm
decision-making, offering a scalable solution for detailed agricultural
monitoring.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anirban Ray, Vera Galinova, Florian Jug
Machine Learning
ๆฅ็ๆ่ฆ
Computational Super-Resolution (CSR) in fluorescence microscopy has, despite
being an ill-posed problem, a long history. At its very core, CSR is about
finding a prior that can be used to extrapolate frequencies in a micrograph
that have never been imaged by the image-generating microscope. It stands to
reason that, with the advent of better data-driven machine learning techniques,
stronger prior can be learned and hence CSR can lead to better results. Here,
we present ResMatching, a novel CSR method that uses guided conditional flow
matching to learn such improved data-priors. We evaluate ResMatching on 4
diverse biological structures from the BioSR dataset and compare its results
against 7 baselines. ResMatching consistently achieves competitive results,
demonstrating in all cases the best trade-off between data fidelity and
perceptual realism. We observe that CSR using ResMatching is particularly
effective in cases where a strong prior is hard to learn, e.g. when the given
low-resolution images contain a lot of noise. Additionally, we show that
ResMatching can be used to sample from an implicitly learned posterior
distribution and that this distribution is calibrated for all tested use-cases,
enabling our method to deliver a pixel-wise data-uncertainty term that can
guide future users to reject uncertain predictions.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li et al.
Computer VisionMachine LearningMultimodalRobotics
ๆฅ็ๆ่ฆ
We introduce Emu3.5, a large-scale multimodal world model that natively
predicts the next state across vision and language. Emu3.5 is pre-trained
end-to-end with a unified next-token prediction objective on a corpus of
vision-language interleaved data containing over 10 trillion tokens, primarily
derived from sequential frames and transcripts of internet videos. The model
naturally accepts interleaved vision-language inputs and generates interleaved
vision-language outputs. Emu3.5 is further post-trained with large-scale
reinforcement learning to enhance multimodal reasoning and generation. To
improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA),
which converts token-by-token decoding into bidirectional parallel prediction,
accelerating per-image inference by about 20x without sacrificing performance.
Emu3.5 exhibits strong native multimodal capabilities, including long-horizon
vision-language generation, any-to-image (X2I) generation, and complex
text-rich image generation. It also exhibits generalizable world-modeling
abilities, enabling spatiotemporally consistent world exploration and
open-world embodied manipulation across diverse scenarios and tasks. For
comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image
(Nano Banana) on image generation and editing tasks and demonstrates superior
results on a suite of interleaved generation tasks. We open-source Emu3.5 at
https://github.com/baaivision/Emu3.5 to support community research.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinjin Li, Yulie Lu, Jinghan Cao, Yu Ma, Zhenglin Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Visual Question Answering (VQA) have demonstrated
impressive performance in natural image domains, with models like LLaVA
leveraging large language models (LLMs) for open-ended reasoning. However,
their generalization degrades significantly when transferred to out-of-domain
scenarios such as remote sensing, medical imaging, or math diagrams, due to
large distributional shifts and the lack of effective domain adaptation
mechanisms. Existing approaches typically rely on per-domain fine-tuning or
bespoke pipelines, which are costly, inflexible, and not scalable across
diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for
cross-domain adaptation that improves the generalization of VQA models while
requiring minimal changes to their core architecture. Our key idea is to
decouple visual and linguistic adaptation by introducing two lightweight
modules: a domain classifier to identify the input image type, and a dual
adapter mechanism comprising a Prompt Adapter for language modulation and a
Visual Adapter for vision feature adjustment. Both modules are dynamically
injected via a unified hook interface, requiring no retraining of the backbone
model. Experimental results across four domain-specific VQA benchmarks
demonstrate that our framework achieves consistent performance gains without
retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on
MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH
provides a scalable and extensible approach to multi-domain VQA, enabling
practical deployment across diverse application domains.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Manjunath Prasad Holenarasipura Rajiv, B. M. Vidyavathi
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
In real-world environments, AI systems often face unfamiliar scenarios
without labeled data, creating a major challenge for conventional scene
understanding models. The inability to generalize across unseen contexts limits
the deployment of vision-based applications in dynamic, unstructured settings.
This work introduces a Dynamic Context-Aware Scene Reasoning framework that
leverages Vision-Language Alignment to address zero-shot real-world scenarios.
The goal is to enable intelligent systems to infer and adapt to new
environments without prior task-specific training. The proposed approach
integrates pre-trained vision transformers and large language models to align
visual semantics with natural language descriptions, enhancing contextual
comprehension. A dynamic reasoning module refines predictions by combining
global scene cues and object-level interactions guided by linguistic priors.
Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and
Open Images demonstrate up to 18% improvement in scene understanding accuracy
over baseline models in complex and unseen environments. Results also show
robust performance in ambiguous or cluttered scenes due to the synergistic
fusion of vision and language. This framework offers a scalable and
interpretable approach for context-aware reasoning, advancing zero-shot
generalization in dynamic real-world settings.
๐
2025-10-30
๐ 6 pages, 2025 IEEE International Workshop on Metro
๐ฅ Wondimagegn Abebe Demissie, Stefano Roccella, Rudy Rossetto, Antonio Minnocci, Andrea Vannini et al.
Machine Learning
ๆฅ็ๆ่ฆ
Olive tree biovolume estimation is a key task in precision agriculture,
supporting yield prediction and resource management, especially in
Mediterranean regions severely impacted by climate-induced stress. This study
presents a comparative analysis of three deep learning models U-Net,
YOLOv11m-seg, and Mask RCNN for segmenting olive tree crowns and their shadows
in ultra-high resolution UAV imagery. The UAV dataset, acquired over
Vicopisano, Italy, includes manually annotated crown and shadow masks. Building
on these annotations, the methodology emphasizes spatial feature extraction and
robust segmentation; per-tree biovolume is then estimated by combining crown
projected area with shadow-derived height using solar geometry. In testing,
Mask R-CNN achieved the best overall accuracy (F1 = 0.86; mIoU = 0.72), while
YOLOv11m-seg provided the fastest throughput (0.12 second per image). The
estimated biovolumes spanned from approximately 4 to 24 cubic meters,
reflecting clear structural differences among trees. These results indicate
Mask R-CNN is preferable when biovolume accuracy is paramount, whereas
YOLOv11m-seg suits large-area deployments where speed is critical; U-Net
remains a lightweight, high-sensitivity option. The framework enables accurate,
scalable orchard monitoring and can be further strengthened with DEM or DSM
integration and field calibration for operational decision support.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wen Xie, Yanjun Zhu, Gijs Overgoor, Yakov Bart, Agata Lapedriza Garcia et al.
Multimodal
ๆฅ็ๆ่ฆ
Advertisers commonly need multiple versions of the same advertisement (ad) at
varying durations for a single campaign. The traditional approach involves
manually selecting and re-editing shots from longer video ads to create shorter
versions, which is labor-intensive and time-consuming. In this paper, we
introduce a framework for automated video ad clipping using video summarization
techniques. We are the first to frame video clipping as a shot selection
problem, tailored specifically for advertising. Unlike existing general video
summarization methods that primarily focus on visual content, our approach
emphasizes the critical role of audio in advertising. To achieve this, we
develop a two-stream audio-visual fusion model that predicts the importance of
video frames, where importance is defined as the likelihood of a frame being
selected in the firm-produced short ad. To address the lack of ad-specific
datasets, we present AdSum204, a novel dataset comprising 102 pairs of
30-second and 15-second ads from real advertising campaigns. Extensive
experiments demonstrate that our model outperforms state-of-the-art methods
across various metrics, including Average Precision, Area Under Curve,
Spearman, and Kendall.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hao Xie, Zixun Huang, Yushen Zuo, Yakun Ju, Frank H. F. Leung et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Spine segmentation, based on ultrasound volume projection imaging (VPI),
plays a vital role for intelligent scoliosis diagnosis in clinical
applications. However, this task faces several significant challenges. Firstly,
the global contextual knowledge of spines may not be well-learned if we neglect
the high spatial correlation of different bone features. Secondly, the spine
bones contain rich structural knowledge regarding their shapes and positions,
which deserves to be encoded into the segmentation process. To address these
challenges, we propose a novel scale-adaptive structure-aware network
(SA$^{2}$Net) for effective spine segmentation. First, we propose a
scale-adaptive complementary strategy to learn the cross-dimensional
long-distance correlation features for spinal images. Second, motivated by the
consistency between multi-head self-attention in Transformers and semantic
level affinity, we propose structure-affinity transformation to transform
semantic features with class-specific affinity and combine it with a
Transformer decoder for structure-aware reasoning. In addition, we adopt a
feature mixing loss aggregation method to enhance model training. This method
improves the robustness and accuracy of the segmentation process. The
experimental results demonstrate that our SA$^{2}$Net achieves superior
segmentation performance compared to other state-of-the-art methods. Moreover,
the adaptability of SA$^{2}$Net to various backbones enhances its potential as
a promising tool for advanced scoliosis diagnosis using intelligent spinal
image analysis. The code and experimental demo are available at
https://github.com/taetiseo09/SA2Net.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Vinรญcius Ferraria, Eurico Ruivo
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
The edge detection task is essential in image processing aiming to extract
relevant information from an image. One recurring problem in this task is the
weaknesses found in some detectors, such as the difficulty in detecting loose
edges and the lack of context to extract relevant information from specific
problems. To address these weaknesses and adapt the detector to the properties
of an image, an adaptable detector described by two-dimensional cellular
automaton and optimized by meta-heuristic combined with transfer learning
techniques was developed. This study aims to analyze the impact of expanding
the search space of the optimization phase and the robustness of the
adaptability of the detector in identifying edges of a set of natural images
and specialized subsets extracted from the same image set. The results obtained
prove that expanding the search space of the optimization phase was not
effective for the chosen image set. The study also analyzed the adaptability of
the model through a series of experiments and validation techniques and found
that, regardless of the validation, the model was able to adapt to the input
and the transfer learning techniques applied to the model showed no significant
improvements.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Self-improvement has emerged as a mainstream paradigm for advancing the
reasoning capabilities of large vision-language models (LVLMs), where models
explore and learn from successful trajectories iteratively. However, we
identify a critical issue during this process: the model excels at generating
high-quality trajectories for simple queries (i.e., head data) but struggles
with more complex ones (i.e., tail data). This leads to an imbalanced
optimization that drives the model to prioritize simple reasoning skills, while
hindering its ability to tackle more complex reasoning tasks. Over iterations,
this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew
effect"--which ultimately hinders further model improvement and leads to
performance bottlenecks. To counteract this challenge, we introduce four
efficient strategies from two perspectives: distribution-reshaping and
trajectory-resampling, to achieve head-tail re-balancing during the
exploration-and-learning self-improvement process. Extensive experiments on
Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks
demonstrate that our methods consistently improve visual reasoning
capabilities, outperforming vanilla self-improvement by 3.86 points on average.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang
Multimodal
ๆฅ็ๆ่ฆ
Object-context shortcuts remain a persistent challenge in vision-language
models, undermining zero-shot reliability when test-time scenes differ from
familiar training co-occurrences. We recast this issue as a causal inference
problem and ask: Would the prediction remain if the object appeared in a
different environment? To answer this at inference time, we estimate object and
background expectations within CLIP's representation space, and synthesize
counterfactual embeddings by recombining object features with diverse
alternative contexts sampled from external datasets, batch neighbors, or
text-derived descriptions. By estimating the Total Direct Effect and simulating
intervention, we further subtract background-only activation, preserving
beneficial object-context interactions while mitigating hallucinated scores.
Without retraining or prompt design, our method substantially improves both
worst-group and average accuracy on context-sensitive benchmarks, establishing
a new zero-shot state of the art. Beyond performance, our framework provides a
lightweight representation-level counterfactual approach, offering a practical
causal avenue for debiased and reliable multimodal reasoning.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Few-shot anomaly detection (FSAD) methods identify anomalous regions with few
known normal samples. Most existing methods rely on the generalization ability
of pre-trained vision-language models (VLMs) to recognize potentially anomalous
regions through feature similarity between text descriptions and images.
However, due to the lack of detailed textual descriptions, these methods can
only pre-define image-level descriptions to match each visual patch token to
identify potential anomalous regions, which leads to the semantic misalignment
between image descriptions and patch-level visual anomalies, achieving
sub-optimal localization performance. To address the above issues, we propose
the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and
fine-grained textual descriptions for existing anomaly detection datasets with
automatic construction pipeline. Based on the MFSC, we propose a novel
framework named FineGrainedAD to improve anomaly localization performance,
which consists of two components: Multi-Level Learnable Prompt (MLLP) and
Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics
into multi-level learnable prompts through automatic replacement and
concatenation mechanism, while MLSA designs region aggregation strategy and
multi-level alignment training to facilitate learnable prompts better align
with corresponding visual regions. Experiments demonstrate that the proposed
FineGrainedAD achieves superior overall performance in few-shot settings on
MVTec-AD and VisA datasets.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rhodri Guerrier, Adam W. Harley, Dima Damen
Computer Vision
ๆฅ็ๆ่ฆ
Recent advances in foundational 3D reconstruction models, such as DUSt3R and
MASt3R, have shown great potential in 2D and 3D correspondence in static
scenes. In this paper, we propose to adapt them for the task of point tracking
through 3D grounded correspondence. We first demonstrate that these models are
competitive point trackers when focusing on static points, present in current
point tracking benchmarks ($+33.5\%$ on EgoPoints vs. CoTracker2). We propose
to combine the reconstruction loss with training for dynamic correspondence
along with a visibility head, and fine-tuning MASt3R for point tracking using a
relatively small amount of synthetic data. Importantly, we only train and
evaluate on pairs of frames where one contains the query point, effectively
removing any temporal context. Using a mix of dynamic and static point
correspondences, we achieve competitive or superior point tracking results on
four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $\delta_{avg}$ / 85.8\%
occlusion acc. for PointSt3R compared to 75.7 / 88.3\% for CoTracker2; and
significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs
82.8). We also present results on 3D point tracking along with several
ablations on training datasets and percentage of dynamic correspondences.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shihab Aaqil Ahamed, Udaya S. K. P. Miriya Thanthrige, Ranga Rodrigo, Muhammad Haris Khan
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Test-time prompt tuning (TPT) has emerged as a promising technique for
adapting large vision-language models (VLMs) to unseen tasks without relying on
labeled data. However, the lack of dispersion between textual features can hurt
calibration performance, which raises concerns about VLMs' reliability,
trustworthiness, and safety. Current TPT approaches primarily focus on
improving prompt calibration by either maximizing average textual feature
dispersion or enforcing orthogonality constraints to encourage angular
separation. However, these methods may not always have optimal angular
separation between class-wise textual features, which implies overlooking the
critical role of angular diversity. To address this, we propose A-TPT, a novel
TPT framework that introduces angular diversity to encourage uniformity in the
distribution of normalized textual features induced by corresponding learnable
prompts. This uniformity is achieved by maximizing the minimum pairwise angular
distance between features on the unit hypersphere. We show that our approach
consistently surpasses state-of-the-art TPT methods in reducing the aggregate
average calibration error while maintaining comparable accuracy through
extensive experiments with various backbones on different datasets. Notably,
our approach exhibits superior zero-shot calibration performance on natural
distribution shifts and generalizes well to medical datasets. We provide
extensive analyses, including theoretical aspects, to establish the grounding
of A-TPT. These results highlight the potency of promoting angular diversity to
achieve well-dispersed textual features, significantly improving VLM
calibration during test-time adaptation. Our code will be made publicly
available.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang
Multimodal
ๆฅ็ๆ่ฆ
Recently text-to-video generation has made impressive progress in producing
short, high-quality clips, but evaluating long-form outputs remains a major
challenge especially when processing complex prompts. Existing benchmarks
mostly rely on simplified prompts and focus on low-level metrics, overlooking
fine-grained alignment with prompts and abstract dimensions such as narrative
coherence and thematic expression. To address these gaps, we propose
LoCoT2V-Bench, a benchmark specifically designed for long video generation
(LVG) under complex input conditions. Based on various real-world videos,
LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating
elements like scene transitions and event dynamics. Moreover, it constructs a
multi-dimensional evaluation framework that includes our newly proposed metrics
such as event-level alignment, fine-grained temporal consistency, content
clarity, and the Human Expectation Realization Degree (HERD) that focuses on
more abstract attributes like narrative flow, emotional response, and character
development. Using this framework, we conduct a comprehensive evaluation of
nine representative LVG models, finding that while current methods perform well
on basic visual and temporal aspects, they struggle with inter-event
consistency, fine-grained alignment, and high-level thematic adherence, etc.
Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for
evaluating long-form complex text-to-video generation and highlights critical
directions for future method improvement.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Igor Abramov, Ilya Makarov
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Existing EEG-driven image reconstruction methods often overlook spatial
attention mechanisms, limiting fidelity and semantic coherence. To address
this, we propose a dual-conditioning framework that combines EEG embeddings
with spatial saliency maps to enhance image generation. Our approach leverages
the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes
Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals
with visual semantics, while a ControlNet branch conditions generation on
saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves
a significant improvement in the quality of low- and high-level image features
over existing approaches. Simultaneously, strongly aligning with human visual
attention. The results demonstrate that attentional priors resolve EEG
ambiguities, enabling high-fidelity reconstructions with applications in
medical diagnostics and neuroadaptive interfaces, advancing neural decoding
through efficient adaptation of pre-trained diffusion models.
๐
2025-10-30
๐ ArXiv Preprint (eess.IV)
๐ฅ Xizhi Tian, Changjun Zhou, Yulin. Yang
Machine Learning
ๆฅ็ๆ่ฆ
Multi-organ segmentation is a critical task in computer-aided diagnosis.
While recent deep learning methods have achieved remarkable success in image
segmentation, huge variations in organ size and shape challenge their
effectiveness in multi-organ segmentation. To address these challenges, we
propose a Spatial Prior-Guided Cross Dual Encoder Network (SPG-CDENet), a novel
two-stage segmentation paradigm designed to improve multi-organ segmentation
accuracy. Our SPG-CDENet consists of two key components: a spatial prior
network and a cross dual encoder network. The prior network generates coarse
localization maps that delineate the approximate ROI, serving as spatial
guidance for the dual encoder network. The cross dual encoder network comprises
four essential components: a global encoder, a local encoder, a symmetric
cross-attention module, and a flow-based decoder. The global encoder captures
global semantic features from the entire image, while the local encoder focuses
on features from the prior network. To enhance the interaction between the
global and local encoders, a symmetric cross-attention module is proposed
across all layers of the encoders to fuse and refine features. Furthermore, the
flow-based decoder directly propagates high-level semantic features from the
final encoder layer to all decoder layers, maximizing feature preservation and
utilization. Extensive qualitative and quantitative experiments on two public
datasets demonstrate the superior performance of SPG-CDENet compared to
existing segmentation methods. Furthermore, ablation studies further validate
the effectiveness of the proposed modules in improving segmentation accuracy.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kazuma Kano, Yuki Mori, Shin Katayama, Kenta Urano, Takuro Yonezawa et al.
ๆฅ็ๆ่ฆ
Worker location data is key to higher productivity in industrial sites.
Cameras are a promising tool for localization in logistics warehouses since
they also offer valuable environmental contexts such as package status.
However, identifying individuals with only visual data is often impractical.
Accordingly, several prior studies identified people in videos by comparing
their trajectories and wearable sensor measurements. While this approach has
advantages such as independence from appearance, the existing methods may break
down under real-world conditions. To overcome this challenge, we propose CorVS,
a novel data-driven person identification method based on correspondence
between visual tracking trajectories and sensor measurements. Firstly, our deep
learning model predicts correspondence probabilities and reliabilities for
every pair of a trajectory and sensor measurements. Secondly, our algorithm
matches the trajectories and sensor measurements over time using the predicted
probabilities and reliabilities. We developed a dataset with actual warehouse
operations and demonstrated the method's effectiveness for real-world
applications.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Mirko Usuelli, David Rapado-Rincon, Gert Kootstra, Matteo Matteucci
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Autonomous robots in orchards require real-time 3D scene understanding
despite repetitive row geometry, seasonal appearance changes, and wind-driven
foliage motion. We present AgriGS-SLAM, a Visual--LiDAR SLAM framework that
couples direct LiDAR odometry and loop closures with multi-camera 3D Gaussian
Splatting (3DGS) rendering. Batch rasterization across complementary viewpoints
recovers orchard structure under occlusions, while a unified gradient-driven
map lifecycle executed between keyframes preserves fine details and bounds
memory. Pose refinement is guided by a probabilistic LiDAR-based depth
consistency term, back-propagated through the camera projection to tighten
geometry-appearance coupling. We deploy the system on a field platform in apple
and pear orchards across dormancy, flowering, and harvesting, using a
standardized trajectory protocol that evaluates both training-view and
novel-view synthesis to reduce 3DGS overfitting in evaluation. Across seasons
and sites, AgriGS-SLAM delivers sharper, more stable reconstructions and
steadier trajectories than recent state-of-the-art 3DGS-SLAM baselines while
maintaining real-time performance on-tractor. While demonstrated in orchard
monitoring, the approach can be applied to other outdoor domains requiring
robust multimodal perception.
๐
2025-10-30
๐ ICLR 2026
๐ฅ Mingyu Sung, Seungjae Ham, Kangwoo Kim, Yeokyoung Yoon, Sangseok Yun et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Image super-resolution(SR) is fundamental to many vision system-from
surveillance and autonomy to document analysis and retail analytics-because
recovering high-frequency details, especially scene-text, enables reliable
downstream perception. Scene-text, i.e., text embedded in natural images such
as signs, product labels, and storefronts, often carries the most actionable
information; when characters are blurred or hallucinated, optical character
recognition(OCR) and subsequent decisions fail even if the rest of the image
appears sharp. Yet previous SR research has often been tuned to distortion
(PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that
are largely insensitive to character-level errors. Furthermore, studies that do
address text SR often focus on simplified benchmarks with isolated characters,
overlooking the challenges of text within complex natural scenes. As a result,
scene-text is effectively treated as generic texture. For SR to be effective in
practical deployments, it is therefore essential to explicitly optimize for
both text legibility and perceptual quality. We present GLYPH-SR, a
vision-language-guided diffusion framework that aims to achieve both objectives
jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by
OCR data, and a ping-pong scheduler that alternates between text- and
scene-centric guidance. To enable targeted text restoration, we train these
components on a synthetic corpus while keeping the main SR branch frozen.
Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by
up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR)
while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed
to satisfy both objectives simultaneously-high readability and high visual
realism-delivering SR that looks right and reds right.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Junlai Qiu, Yunzhu Chen, Hao Zheng, Yawen Huang, Yuexiang Li
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged
and elderly people, which significantly impacts their daily lives and mental
health. To improve the efficiency of clinical screening and enable the early
detection of DR, a variety of automated DR diagnosis systems have been recently
established based on convolutional neural network (CNN) or vision Transformer
(ViT). However, due to the own shortages of CNN / ViT, the performance of
existing methods using single-type backbone has reached a bottleneck. One
potential way for the further improvements is integrating different kinds of
backbones, which can fully leverage the respective strengths of them
(\emph{i.e.,} the local feature extraction capability of CNN and the global
feature capturing ability of ViT). To this end, we propose a novel paradigm to
effectively fuse the features extracted by different backbones based on the
theory of evidence. Specifically, the proposed evidential fusion paradigm
transforms the features from different backbones into supporting evidences via
a set of deep evidential networks. With the supporting evidences, the
aggregated opinion can be accordingly formed, which can be used to adaptively
tune the fusion pattern between different backbones and accordingly boost the
performance of our hybrid model. We evaluated our method on two publicly
available DR grading datasets. The experimental results demonstrate that our
hybrid model not only improves the accuracy of DR grading, compared to the
state-of-the-art frameworks, but also provides the excellent interpretability
for feature fusion and decision-making.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jelizaveta Jankowska, Boลผena Kostek, Fernando Alonso-Fernandez, Prayag Tiwari
ๆฅ็ๆ่ฆ
The subject of this work is to check how different types of music affect
human emotions. While listening to music, a subjective survey and brain
activity measurements were carried out using an EEG helmet. The aim is to
demonstrate the impact of different music genres on emotions. The research
involved a diverse group of participants of different gender and musical
preferences. This had the effect of capturing a wide range of emotional
responses to music. After the experiment, a relationship analysis of the
respondents' questionnaires with EEG signals was performed. The analysis
revealed connections between emotions and observed brain activity.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Luting Wang, Yinghao Xiang, Hongliang Huang, Dongjun Li, Chen Gao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented
flexibility for monitoring the Earth's surface, but their scheduling remains
challenging under large-scale scenarios, dynamic environments, and stringent
constraints. Existing methods often simplify these complexities, limiting their
real-world performance. We address this gap with a unified framework
integrating a standardized benchmark suite and a novel scheduling model. Our
benchmark suite, AEOS-Bench, contains $3,907$ finely tuned satellite assets and
$16,410$ scenarios. Each scenario features $1$ to $50$ satellites and $50$ to
$300$ imaging tasks. These scenarios are generated via a high-fidelity
simulation platform, ensuring realistic satellite behavior such as orbital
dynamics and resource constraints. Ground truth scheduling annotations are
provided for each scenario. To our knowledge, AEOS-Bench is the first
large-scale benchmark suite tailored for realistic constellation scheduling.
Building upon this benchmark, we introduce AEOS-Former, a Transformer-based
scheduling model that incorporates a constraint-aware attention mechanism. A
dedicated internal constraint module explicitly models the physical and
operational limits of each satellite. Through simulation-based iterative
learning, AEOS-Former adapts to diverse scenarios, offering a robust solution
for AEOS constellation scheduling. Experimental results demonstrate that
AEOS-Former outperforms baseline models in task completion and energy
efficiency, with ablation studies highlighting the contribution of each
component. Code and data are provided in
https://github.com/buaa-colalab/AEOSBench.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fernando Alonso-Fernandez, Kevin Hernandez-Diaz, Jose Maria Buades Rubio, Josef Bigun
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
We focus on ocular biometrics, specifically the periocular region (the area
around the eye), which offers high discrimination and minimal acquisition
constraints. We evaluate three Convolutional Neural Network architectures of
varying depth and complexity to assess their effectiveness for periocular
recognition. The networks are trained on 1,907,572 ocular crops extracted from
the large-scale VGGFace2 database. This significantly contrasts with existing
works, which typically rely on small-scale periocular datasets for training
having only a few thousand images. Experiments are conducted with ocular images
from VGGFace2-Pose, a subset of VGGFace2 containing in-the-wild face images,
and the UFPR-Periocular database, which consists of selfies captured via mobile
devices with user guidance on the screen. Due to the uncontrolled conditions of
VGGFace2, the Equal Error Rates (EERs) obtained with ocular crops range from
9-15%, noticeably higher than the 3-6% EERs achieved using full-face images. In
contrast, UFPR-Periocular yields significantly better performance (EERs of
1-2%), thanks to higher image quality and more consistent acquisition
protocols. To the best of our knowledge, these are the lowest reported EERs on
the UFPR dataset to date.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lin Liu, Guanyi Yu, Ziying Song, Junqiao Li, Caiyan Jia et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Planning is a critical component of end-to-end autonomous driving. However,
prevailing imitation learning methods often suffer from mode collapse, failing
to produce diverse trajectory hypotheses. Meanwhile, existing generative
approaches struggle to incorporate crucial safety and physical constraints
directly into the generative process, necessitating an additional optimization
stage to refine their outputs. To address these limitations, we propose CATG, a
novel planning framework that leverages Constrained Flow Matching. Concretely,
CATG explicitly models the flow matching process, which inherently mitigates
mode collapse and allows for flexible guidance from various conditioning
signals. Our primary contribution is the novel imposition of explicit
constraints directly within the flow matching process, ensuring that the
generated trajectories adhere to vital safety and kinematic rules. Secondly,
CATG parameterizes driving aggressiveness as a control signal during
generation, enabling precise manipulation of trajectory style. Notably, on the
NavSim v2 challenge, CATG achieved 2nd place with an EPDMS score of 51.31 and
was honored with the Innovation Award.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Fernando Alonso-Fernandez, Kevin Hernandez Diaz, Jose M. Buades, Kiran Raja, Josef Bigun
ๆฅ็ๆ่ฆ
We study the complementarity of different CNNs for periocular verification at
different distances on the UBIPr database. We train three architectures of
increasing complexity (SqueezeNet, MobileNetv2, and ResNet50) on a large set of
eye crops from VGGFace2. We analyse performance with cosine and chi2 metrics,
compare different network initialisations, and apply score-level fusion via
logistic regression. In addition, we use LIME heatmaps and Jensen-Shannon
divergence to compare attention patterns of the CNNs. While ResNet50
consistently performs best individually, the fusion provides substantial gains,
especially when combining all three networks. Heatmaps show that networks
usually focus on distinct regions of a given image, which explains their
complementarity. Our method significantly outperforms previous works on UBIPr,
achieving a new state-of-the-art.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Lin Guo, Xiaoqing Luo, Wei Xie, Zhancheng Zhang, Hui Li et al.
ๆฅ็ๆ่ฆ
Existing infrared and visible image fusion methods often face the dilemma of
balancing modal information. Generative fusion methods reconstruct fused images
by learning from data distributions, but their generative capabilities remain
limited. Moreover, the lack of interpretability in modal information selection
further affects the reliability and consistency of fusion results in complex
scenarios. This manuscript revisits the essence of generative image fusion
under the inspiration of human cognitive laws and proposes a novel infrared and
visible image fusion method, termed HCLFuse. First, HCLFuse investigates the
quantification theory of information mapping in unsupervised fusion networks,
which leads to the design of a multi-scale mask-regulated variational
bottleneck encoder. This encoder applies posterior probability modeling and
information decomposition to extract accurate and concise low-level modal
information, thereby supporting the generation of high-fidelity structural
details. Furthermore, the probabilistic generative capability of the diffusion
model is integrated with physical laws, forming a time-varying physical
guidance mechanism that adaptively regulates the generation process at
different stages, thereby enhancing the ability of the model to perceive the
intrinsic structure of data and reducing dependence on data quality.
Experimental results show that the proposed method achieves state-of-the-art
fusion performance in qualitative and quantitative evaluations across multiple
datasets and significantly improves semantic segmentation metrics. This fully
demonstrates the advantages of this generative image fusion method, drawing
inspiration from human cognition, in enhancing structural consistency and
detail quality.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Modern vision-language models (VLMs) excel at many multimodal tasks, yet
their grasp of temporal information in video remains weak and, crucially,
under-evaluated. We probe this gap with a deceptively simple but revealing
challenge: judging the arrow of time (AoT)-whether a short clip is played
forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated
benchmark that tests whether VLMs can infer temporal direction in natural
videos using the same stimuli and behavioral baselines established for humans.
Our comprehensive evaluation of open-weight and proprietary, reasoning and
non-reasoning VLMs reveals that most models perform near chance, and even the
best lag far behind human accuracy on physically irreversible processes (e.g.,
free fall, diffusion/explosion) and causal manual actions (division/addition)
that humans recognize almost instantly. These results highlight a fundamental
gap in current multimodal systems: while they capture rich visual-semantic
correlations, they lack the inductive biases required for temporal continuity
and causal understanding. We release the code and data for AoT-PsyPhyBENCH to
encourage further progress in the physical and temporal reasoning capabilities
of VLMs.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hengrui Kang, Zhuangcheng Gu, Zhiyuan Zhao, Zichen Wen, Bin Wang et al.
ๆฅ็ๆ่ฆ
Document AI has advanced rapidly and is attracting increasing attention. Yet,
while most efforts have focused on document layout analysis (DLA), its
generative counterpart, document layout generation, remains underexplored. A
major obstacle lies in the scarcity of diverse layouts: academic papers with
Manhattan-style structures dominate existing studies, while open-world genres
such as newspapers and magazines remain severely underrepresented. To address
this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse
document layouts, covering six common document types and comprising
contemporary layouts collected from multiple sources. Moreover, since existing
methods struggle in complex domains and often fail to arrange long sequences
coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage
Coarse-to-Fine learning paradigm: 1) learning universal layout principles from
OmniLayout-1M with coarse category definitions, and 2) transferring the
knowledge to a specific domain with fine-grained annotations. Extensive
experiments demonstrate that our approach achieves strong performance on
multiple domains in M$^{6}$Doc dataset, substantially surpassing both existing
layout generation experts and several latest general-purpose LLMs. Our code,
models, and dataset will be publicly released.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Mehdi Khaleghi, Nastaran Khaleghi, Sobhan Sheykhivand, Sebelan Danishvar
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The sustainability of supply chain plays a key role in achieving optimal
performance in controlling the supply chain. The management of risks that occur
in a supply chain is a fundamental problem for the purpose of developing the
sustainability of the network and elevating the performance efficiency of the
supply chain. The correct classification of products is another essential
element in a sustainable supply chain. Acknowledging recent breakthroughs in
the context of deep networks, several architectural options have been deployed
to analyze supply chain datasets. A novel geometric deep network is used to
propose an ensemble deep network. The proposed Chebyshev ensemble geometric
network (Ch-EGN) is a hybrid convolutional and geometric deep learning. This
network is proposed to leverage the information dependencies in supply chain to
derive invisible states of samples in the database. The functionality of the
proposed deep network is assessed on the two different databases. The
SupplyGraph Dataset and DataCo are considered in this research. The prediction
of delivery status of DataCo supply chain is done for risk administration. The
product classification and edge classification are performed using the
SupplyGraph database to enhance the sustainability of the supply network. An
average accuracy of 98.95% is obtained for the ensemble network for risk
management. The average accuracy of 100% and 98.07% are obtained for
sustainable supply chain in terms of 5 product group classification and 4
product relation classification, respectively. The average accuracy of 92.37%
is attained for 25 company relation classification. The results confirm an
average improvement and efficiency of the proposed method compared to the
state-of-the-art approaches.
๐
2025-10-30
๐ SIGGRAPH
๐ฅ Li Wang, Yiyu Zhuang, Yanwen Wang, Xun Cao, Chuan Guo et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
3D human pose estimation from sketches has broad applications in computer
animation and film production. Unlike traditional human pose estimation, this
task presents unique challenges due to the abstract and disproportionate nature
of sketches. Previous sketch-to-pose methods, constrained by the lack of
large-scale sketch-3D pose annotations, primarily relied on optimization with
heuristic rules-an approach that is both time-consuming and limited in
generalizability. To address these challenges, we propose a novel approach
leveraging a "learn from synthesis" strategy. First, a diffusion model is
trained to synthesize sketch images from 2D poses projected from 3D human
poses, mimicking disproportionate human structures in sketches. This process
enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k
accurate sketch-3D pose annotation pairs across various sketch styles. Building
on this synthetic dataset, we introduce an end-to-end data-driven framework for
estimating human poses and shapes from diverse sketch styles. Our framework
combines existing 2D pose detectors and generative diffusion priors for sketch
feature extraction with a feed-forward neural network for efficient 2D pose
estimation. Multiple heuristic loss functions are incorporated to guarantee
geometric coherence between the derived 3D poses and the detected 2D poses
while preserving accurate self-contacts. Qualitative, quantitative, and
subjective evaluations collectively show that our model substantially surpasses
previous ones in both estimation accuracy and speed for sketch-to-pose tasks.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Jinho Choi, Hyesu Lim, Steffen Schneider, Jaegul Choo
Machine Learning
ๆฅ็ๆ่ฆ
Dataset bias, where data points are skewed to certain concepts, is ubiquitous
in machine learning datasets. Yet, systematically identifying these biases is
challenging without costly, fine-grained attribute annotations. We present
ConceptScope, a scalable and automated framework for analyzing visual datasets
by discovering and quantifying human-interpretable concepts using Sparse
Autoencoders trained on representations from vision foundation models.
ConceptScope categorizes concepts into target, context, and bias types based on
their semantic relevance and statistical correlation to class labels, enabling
class-level dataset characterization, bias identification, and robustness
evaluation through concept-based subgrouping. We validate that ConceptScope
captures a wide range of visual concepts, including objects, textures,
backgrounds, facial attributes, emotions, and actions, through comparisons with
annotated datasets. Furthermore, we show that concept activations produce
spatial attributions that align with semantically meaningful image regions.
ConceptScope reliably detects known biases (e.g., background bias in
Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects
in ImageNet), offering a practical tool for dataset auditing and model
diagnostics.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wontae Choi, Jaelin Lee, Hyung Sup Yun, Byeungwoo Jeon, Il Yong Chun
Computer Vision
ๆฅ็ๆ่ฆ
Accurate estimation of motion information is crucial in diverse computational
imaging and computer vision applications. Researchers have investigated various
methods to extract motion information from a single blurred image, including
blur kernels and optical flow. However, existing motion representations are
often of low quality, i.e., coarse-grained and inaccurate. In this paper, we
propose the first high-resolution (HR) Motion Trajectory estimation framework
using Diffusion models (MoTDiff). Different from existing motion
representations, we aim to estimate an HR motion trajectory with high-quality
from a single motion-blurred image. The proposed MoTDiff consists of two key
components: 1) a new conditional diffusion framework that uses multi-scale
feature maps extracted from a single blurred image as a condition, and 2) a new
training method that can promote precise identification of a fine-grained
motion trajectory, consistent estimation of overall shape and position of a
motion path, and pixel connectivity along a motion trajectory. Our experiments
demonstrate that the proposed MoTDiff can outperform state-of-the-art methods
in both blind image deblurring and coded exposure photography applications.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Satoshi Kikuch, Masaya Kato, Tsuyoshi Tasaki
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Self-localization on a 3D map by using an inexpensive monocular camera is
required to realize autonomous driving. Self-localization based on a camera
often uses a convolutional neural network (CNN) that can extract local features
that are calculated by nearby pixels. However, when dynamic obstacles, such as
people, are present, CNN does not work well. This study proposes a new method
combining CNN with Vision Transformer, which excels at extracting global
features that show the relationship of patches on whole image. Experimental
results showed that, compared to the state-of-the-art method (SOTA), the
accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times
higher than that without dynamic obstacles. Moreover, the self-localization
error of our method is 20.1% smaller than that of SOTA on public datasets.
Additionally, our robot using our method can localize itself with 7.51cm error
on average, which is more accurate than SOTA.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma et al.
ๆฅ็ๆ่ฆ
Wearable devices such as smart glasses are transforming the way people
interact with their surroundings, enabling users to seek information regarding
entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG)
plays a key role in supporting such questions, yet there is still no
comprehensive benchmark for this task, especially regarding wearables
scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG
benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse
set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn
conversations across 13 domains, including 6.2K egocentric images designed to
mimic captures from wearable devices. We carefully constructed the questions to
reflect real-world scenarios and challenges, including five types of
image-quality issues, six question types, varying entity popularity, differing
information dynamism, and different conversation turns. We design three tasks:
single-source augmentation, multi-source augmentation, and multi-turn
conversations -- each paired with an associated retrieval corpus and APIs for
both image-KG retrieval and webpage retrieval. Our evaluation shows that
straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM
single- and multi-turn QA, respectively, whereas state-of-the-art industry
solutions have similar quality (32%/45%), underscoring ample room for
improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K
participants and 5K submissions, with winning solutions improving baseline
performance by 28%, highlighting its early impact on advancing the field.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sudipto Das Sukanto, Diponker Roy, Fahim Shakil, Nirjhar Singha, Abdullah Asik et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Modes of transportation vary across countries depending on geographical
location and cultural context. In South Asian countries rickshaws are among the
most common means of local transport. Based on their mode of operation,
rickshaws in cities across Bangladesh can be broadly classified into non-auto
(pedal-powered) and auto-rickshaws (motorized). Monitoring the movement of
auto-rickshaws is necessary as traffic rules often restrict auto-rickshaws from
accessing certain routes. However, existing surveillance systems make it quite
difficult to monitor them due to their similarity to other vehicles, especially
non-auto rickshaws whereas manual video analysis is too time-consuming. This
paper presents a machine learning-based approach to automatically detect
auto-rickshaws in traffic images. In this system, we used real-time object
detection using the YOLOv8 model. For training purposes, we prepared a set of
1,730 annotated images that were captured under various traffic conditions. The
results show that our proposed model performs well in real-time auto-rickshaw
detection and offers an mAP50 of 83.447% and binary precision and recall values
above 78%, demonstrating its effectiveness in handling both dense and sparse
traffic scenarios. The dataset has been publicly released for further research.
๐
2025-10-30
๐ ICCV 2025
๐ฅ Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Large annotated datasets are essential for training robust Computer-Aided
Diagnosis (CAD) models for breast cancer detection or risk prediction. However,
acquiring such datasets with fine-detailed annotation is both costly and
time-consuming. Vision-Language Models (VLMs), such as CLIP, which are
pre-trained on large image-text pairs, offer a promising solution by enhancing
robustness and data efficiency in medical imaging tasks. This paper introduces
a novel Multi-View Mammography and Language Model for breast cancer
classification and risk prediction, trained on a dataset of paired mammogram
images and synthetic radiology reports. Our MV-MLM leverages multi-view
supervision to learn rich representations from extensive radiology data by
employing cross-modal self-supervision across image-text pairs. This includes
multiple views and the corresponding pseudo-radiology reports. We propose a
novel joint visual-textual learning strategy to enhance generalization and
accuracy performance over different data types and tasks to distinguish breast
tissues or cancer characteristics(calcification, mass) and utilize these
patterns to understand mammography images and predict cancer risk. We evaluated
our method on both private and publicly available datasets, demonstrating that
the proposed model achieves state-of-the-art performance in three
classification tasks: (1) malignancy classification, (2) subtype
classification, and (3) image-based cancer risk prediction. Furthermore, the
model exhibits strong data efficiency, outperforming existing fully supervised
or VLM baselines while trained on synthetic text reports and without the need
for actual radiology reports.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wei Shang, Wanying Zhang, Shuhang Gu, Pengfei Zhu, Qinghua Hu et al.
ๆฅ็ๆ่ฆ
Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution
of video frames, potentially at various scaling factors, which presents several
challenges regarding spatial detail reproduction, temporal consistency, and
computational complexity. In this paper, we propose a strong baseline BasicAVSR
for AVSR by integrating four key components: 1) adaptive multi-scale frequency
priors generated from image Laplacian pyramids, 2) a flow-guided propagation
unit to aggregate spatiotemporal information from adjacent frames, 3) a
second-order motion compensation unit for more accurate spatial alignment of
adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and
content-independent upsampling kernels. To meet diverse application demands, we
instantiate three propagation variants: (i) a unidirectional RNN unit for
strictly online inference, (ii) a unidirectional RNN unit empowered with a
limited lookahead that tolerates a small output delay, and (iii) a
bidirectional RNN unit designed for offline tasks where computational resources
are less constrained. Experimental results demonstrate the effectiveness and
adaptability of our model across these different scenarios. Through extensive
experiments, we show that BasicAVSR significantly outperforms existing methods
in terms of super-resolution quality, generalization ability, and inference
speed. Our work not only advances the state-of-the-art in AVSR but also extends
its core components to multiple frameworks for diverse scenarios. The code is
available at https://github.com/shangwei5/BasicAVSR.
๐
2025-10-30
๐ ArXiv Preprint (cs.GR)
๐ฅ Xin Hu, Pengfei Xu, Jin Zhou, Hongbo Fu, Hui Huang
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Structured layouts are preferable in many 2D visual contents (\eg, GUIs,
webpages) since the structural information allows convenient layout editing.
Computational frameworks can help create structured layouts but require heavy
labor input. Existing data-driven approaches are effective in automatically
generating fixed layouts but fail to produce layout structures. We present
StructLayoutFormer, a novel Transformer-based approach for conditional
structured layout generation. We use a structure serialization scheme to
represent structured layouts as sequences. To better control the structures of
generated layouts, we disentangle the structural information from the element
placements. Our approach is the first data-driven approach that achieves
conditional structured layout generation and produces realistic layout
structures explicitly. We compare our approach with existing data-driven layout
generation approaches by including post-processing for structure extraction.
Extensive experiments have shown that our approach exceeds these baselines in
conditional structured layout generation. We also demonstrate that our approach
is effective in extracting and transferring layout structures. The code is
publicly available at %\href{https://github.com/Teagrus/StructLayoutFormer}
{https://github.com/Teagrus/StructLayoutFormer}.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Lihe Ding, Shaocong Dong, Yaokun Li, Chenjian Gao, Xiao Chen et al.
ๆฅ็ๆ่ฆ
Part-based 3D generation holds great potential for various applications.
Previous part generators that represent parts using implicit vector-set tokens
often suffer from insufficient geometric details. Another line of work adopts
an explicit voxel representation but shares a global voxel grid among all
parts; this often causes small parts to occupy too few voxels, leading to
degraded quality. In this paper, we propose FullPart, a novel framework that
combines both implicit and explicit paradigms. It first derives the bounding
box layout through an implicit box vector-set diffusion process, a task that
implicit diffusion handles effectively since box tokens contain little
geometric detail. Then, it generates detailed parts, each within its own fixed
full-resolution voxel grid. Instead of sharing a global low-resolution space,
each part in our method - even small ones - is generated at full resolution,
enabling the synthesis of intricate details. We further introduce a
center-point encoding strategy to address the misalignment issue when
exchanging information between parts of different actual sizes, thereby
maintaining global coherence. Moreover, to tackle the scarcity of reliable part
data, we present PartVerse-XL, the largest human-annotated 3D part dataset to
date with 40K objects and 320K parts. Extensive experiments demonstrate that
FullPart achieves state-of-the-art results in 3D part generation. We will
release all code, data, and model to benefit future research in 3D part
generation.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ali Caglayan, Nevrez Imamoglu, Oguzhan Guclu, Ali Osman Serhatoglu, Ahmet Burak Can et al.
Machine Learning
ๆฅ็ๆ่ฆ
Attention models have recently emerged as a powerful approach, demonstrating
significant progress in various fields. Visualization techniques, such as class
activation mapping, provide visual insights into the reasoning of convolutional
neural networks (CNNs). Using network gradients, it is possible to identify
regions where the network pays attention during image recognition tasks.
Furthermore, these gradients can be combined with CNN features to localize more
generalizable, task-specific attentive (salient) regions within scenes.
However, explicit use of this gradient-based attention information integrated
directly into CNN representations for semantic object understanding remains
limited. Such integration is particularly beneficial for visual tasks like
simultaneous localization and mapping (SLAM), where CNN representations
enriched with spatially attentive object locations can enhance performance. In
this work, we propose utilizing task-specific network attention for RGB-D
indoor SLAM. Specifically, we integrate layer-wise attention information
derived from network gradients with CNN feature representations to improve
frame association performance. Experimental results indicate improved
performance compared to baseline methods, particularly for large environments.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou et al.
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-based end-to-end (E2E) driving has garnered significant interest in
the research community due to its scalability and synergy with multimodal large
language models (MLLMs). However, current E2E driving benchmarks primarily
feature nominal scenarios, failing to adequately test the true potential of
these systems. Furthermore, existing open-loop evaluation metrics often fall
short in capturing the multi-modal nature of driving or effectively evaluating
performance in long-tail scenarios. To address these gaps, we introduce the
Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021
driving segments (approximately 12 hours), specifically curated for challenging
long-tail scenarios that that are rare in daily life with an occurring
frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the
high-level routing information, ego states, and 360-degree camera views from 8
surrounding cameras. To evaluate the E2E driving performance on these long-tail
situations, we propose a novel open-loop evaluation metric: Rater Feedback
Score (RFS). Unlike conventional metrics that measure the distance between
predicted way points and the logs, RFS measures how closely the predicted
trajectory matches rater-annotated trajectory preference labels. We have
released rater preference labels for all WOD-E2E validation set segments, while
the held out test set labels have been used for the 2025 WOD-E2E Challenge.
Through our work, we aim to foster state of the art research into
generalizable, robust, and safe end-to-end autonomous driving agents capable of
handling complex real-world situations.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuxuan Li, Tao Wang, Xianben Yang
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Traditional novel view synthesis methods heavily rely on external camera pose
estimation tools such as COLMAP, which often introduce computational
bottlenecks and propagate errors. To address these challenges, we propose a
unified framework that jointly optimizes 3D Gaussian points and camera poses
without requiring pre-calibrated inputs. Our approach iteratively refines 3D
Gaussian parameters and updates camera poses through a novel co-optimization
strategy, ensuring simultaneous improvements in scene reconstruction fidelity
and pose accuracy. The key innovation lies in decoupling the joint optimization
into two interleaved phases: first, updating 3D Gaussian parameters via
differentiable rendering with fixed poses, and second, refining camera poses
using a customized 3D optical flow algorithm that incorporates geometric and
photometric constraints. This formulation progressively reduces projection
errors, particularly in challenging scenarios with large viewpoint variations
and sparse feature distributions, where traditional methods struggle. Extensive
evaluations on multiple datasets demonstrate that our approach significantly
outperforms existing COLMAP-free techniques in reconstruction quality, and also
surpasses the standard COLMAP-based baseline in general.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
As one of the earliest writing systems, Oracle Bone Script (OBS) preserves
the cultural and intellectual heritage of ancient civilizations. However,
current OBS research faces two major challenges: (1) the interpretation of OBS
involves a complex workflow comprising multiple serial and parallel sub-tasks,
and (2) the efficiency of OBS information organization and retrieval remains a
critical bottleneck, as scholars often spend substantial effort searching for,
compiling, and managing relevant resources. To address these challenges, we
present OracleAgent, the first agent system designed for the structured
management and retrieval of OBS-related information. OracleAgent seamlessly
integrates multiple OBS analysis tools, empowered by large language models
(LLMs), and can flexibly orchestrate these components. Additionally, we
construct a comprehensive domain-specific multimodal knowledge base for OBS,
which is built through a rigorous multi-year process of data collection,
cleaning, and expert annotation. The knowledge base comprises over 1.4M
single-character rubbing images and 80K interpretation texts. OracleAgent
leverages this resource through its multimodal tools to assist experts in
retrieval tasks of character, document, interpretation text, and rubbing image.
Extensive experiments demonstrate that OracleAgent achieves superior
performance across a range of multimodal reasoning and generation tasks,
surpassing leading mainstream multimodal large language models (MLLMs) (e.g.,
GPT-4o). Furthermore, our case study illustrates that OracleAgent can
effectively assist domain experts, significantly reducing the time cost of OBS
research. These results highlight OracleAgent as a significant step toward the
practical deployment of OBS-assisted research and automated interpretation
systems.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, Angela Yao
Machine Learning
ๆฅ็ๆ่ฆ
Can Video-LLMs achieve consistent temporal understanding when videos capture
the same event from different viewpoints? To study this, we introduce
EgoExo-Con (Consistency), a benchmark of comprehensively synchronized
egocentric and exocentric video pairs with human-refined queries in natural
language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal
Verification and Temporal Grounding. It evaluates not only correctness but
consistency across viewpoints. Our analysis reveals two critical limitations of
existing Video-LLMs: (1) models often fail to maintain consistency, with
results far worse than their single-view performances. (2) When naively
finetuned with synchronized videos of both viewpoints, the models show improved
consistency but often underperform those trained on a single view. For
improvements, we propose View-GRPO, a novel reinforcement learning framework
that effectively strengthens view-specific temporal reasoning while encouraging
consistent comprehension across viewpoints. Our method demonstrates its
superiority over naive SFT and GRPO, especially for improving cross-view
consistency. All resources will be made publicly available.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xiaosen Wang, Zhijin Ge, Shaokang Wang
ๆฅ็ๆ่ฆ
Despite the notable advancements and versatility of multi-modal diffusion
models, such as text-to-image models, their susceptibility to adversarial
inputs remains underexplored. Contrary to expectations, our investigations
reveal that the alignment between textual and Image modalities in existing
diffusion models is inadequate. This misalignment presents significant risks,
especially in the generation of inappropriate or Not-Safe-For-Work (NSFW)
content. To this end, we propose a novel attack called Prompt-Restricted
Multi-modal Attack (PReMA) to manipulate the generated content by modifying the
input image in conjunction with any specified prompt, without altering the
prompt itself. PReMA is the first attack that manipulates model outputs by
solely creating adversarial images, distinguishing itself from prior methods
that primarily generate adversarial prompts to produce NSFW content.
Consequently, PReMA poses a novel threat to the integrity of multi-modal
diffusion models, particularly in image-editing applications that operate with
fixed prompts. Comprehensive evaluations conducted on image inpainting and
style transfer tasks across various models confirm the potent efficacy of
PReMA.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Hoyeon Chang, Seungjin Kim, Yoonseok Choi
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
We propose a novel approach for dynamic negative prompting in diffusion
models that leverages Vision-Language Models (VLMs) to adaptively generate
negative prompts during the denoising process. Unlike traditional Negative
Prompting methods that use fixed negative prompts, our method generates
intermediate image predictions at specific denoising steps and queries a VLM to
produce contextually appropriate negative prompts. We evaluate our approach on
various benchmark datasets and demonstrate the trade-offs between negative
guidance strength and text-image alignment.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuyue Zhou, Jessica Knight, Shrimanti Ghosh, Banafshe Felfeliyan, Jacob L. Jaremko et al.
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
Elbow and wrist fractures are the most common fractures in pediatric
populations. Automatic segmentation of musculoskeletal structures in ultrasound
(US) can improve diagnostic accuracy and treatment planning. Fractures appear
as cortical defects but require expert interpretation. Deep learning (DL) can
provide real-time feedback and highlight key structures, helping lightly
trained users perform exams more confidently. However, pixel-wise expert
annotations for training remain time-consuming and costly. To address this
challenge, we propose FlexICL, a novel and flexible in-context learning (ICL)
framework for segmenting bony regions in US images. We apply it to an
intra-video segmentation setting, where experts annotate only a small subset of
frames, and the model segments unseen frames. We systematically investigate
various image concatenation techniques and training strategies for visual ICL
and introduce novel concatenation methods that significantly enhance model
performance with limited labeled data. By integrating multiple augmentation
strategies, FlexICL achieves robust segmentation performance across four wrist
and elbow US datasets while requiring only 5% of the training images. It
outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and
conventional segmentation models like U-Net and TransUNet by 1-27% Dice
coefficient on 1,252 US sweeps. These initial results highlight the potential
of FlexICL as an efficient and scalable solution for US image segmentation well
suited for medical imaging use cases where labeled data is scarce.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiali Cheng, Chirag Agarwal, Hadi Amiri
ๆฅ็ๆ่ฆ
Knowledge distillation (KD) is an effective method for model compression and
transferring knowledge between models. However, its effect on model's
robustness against spurious correlations that degrade performance on
out-of-distribution data remains underexplored. This study investigates the
effect of knowledge distillation on the transferability of ``debiasing''
capabilities from teacher models to student models on natural language
inference (NLI) and image classification tasks. Through extensive experiments,
we illustrate several key findings: (i) overall the debiasing capability of a
model is undermined post-KD; (ii) training a debiased model does not benefit
from injecting teacher knowledge; (iii) although the overall robustness of a
model may remain stable post-distillation, significant variations can occur
across different types of biases; and (iv) we pin-point the internal attention
pattern and circuit that causes the distinct behavior post-KD. Given the above
findings, we propose three effective solutions to improve the distillability of
debiasing methods: developing high quality data for augmentation, implementing
iterative knowledge distillation, and initializing student models with weights
obtained from teacher models. To the best of our knowledge, this is the first
study on the effect of KD on debiasing and its interenal mechanism at scale.
Our findings provide understandings on how KD works and how to design better
debiasing methods.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In sequence modeling, the parametric memory of atomic facts has been
predominantly abstracted as a brute-force lookup of co-occurrences between
entities. We contrast this associative view against a geometric view of how
memory is stored. We begin by isolating a clean and analyzable instance of
Transformer reasoning that is incompatible with memory as strictly a storage of
the local co-occurrences specified during training. Instead, the model must
have somehow synthesized its own geometry of atomic facts, encoding global
relationships between all entities, including non-co-occurring ones. This in
turn has simplified a hard reasoning task involving an $\ell$-fold composition
into an easy-to-learn 1-step geometric task.
From this phenomenon, we extract fundamental aspects of neural embedding
geometries that are hard to explain. We argue that the rise of such a geometry,
despite optimizing over mere local associations, cannot be straightforwardly
attributed to typical architectural or optimizational pressures.
Counterintuitively, an elegant geometry is learned even when it is not more
succinct than a brute-force lookup of associations.
Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry
stems from a spectral bias that -- in contrast to prevailing theories -- indeed
arises naturally despite the lack of various pressures. This analysis also
points to practitioners a visible headroom to make Transformer memory more
strongly geometric. We hope the geometric view of parametric memory encourages
revisiting the default intuitions that guide researchers in areas like
knowledge acquisition, capacity, discovery and unlearning.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ J. de Curtรฒ, I. de Zarzร , Pablo Garcรญa, Jordi Cabot
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a comprehensive cross-platform evaluation of reasoning
capabilities in contemporary foundation models, establishing an
infrastructure-agnostic benchmark across three computational paradigms: HPC
supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and
university clusters (a node with eight H200 GPUs).
We evaluate 15 foundation models across 79 problems spanning eight academic
domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics,
Calculus, and Optimization) through three experimental phases: (1) Baseline
establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b,
Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing
methodology and reference performance; (2) Infrastructure validation: The
19-problem benchmark repeated on university cluster (seven models including
Falcon-Mamba state-space architecture) and Nebius AI Studio (nine
state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3
30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic
reproducibility; (3) Extended evaluation: Full 79-problem assessment on both
university cluster and Nebius platforms, probing generalization at scale across
architectural diversity.
The findings challenge conventional scaling assumptions, establish training
data quality as more critical than model size, and provide actionable
guidelines for model selection across educational, production, and research
contexts. The tri-infrastructure methodology and 79-problem benchmark enable
longitudinal tracking of reasoning capabilities as foundation models evolve.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Staลczak et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
As LLMs occupy an increasingly important role in society, they are more and
more confronted with questions that require them not only to draw on their
general knowledge but also to align with certain human value systems.
Therefore, studying the alignment of LLMs with human values has become a
crucial field of inquiry. Prior work, however, mostly focuses on evaluating the
alignment of fully trained models, overlooking the training dynamics by which
models learn to express human values. In this work, we investigate how and at
which stage value alignment arises during the course of a model's
post-training. Our analysis disentangles the effects of post-training
algorithms and datasets, measuring both the magnitude and time of value drifts
during training. Experimenting with Llama-3 and Qwen-3 models of different
sizes and popular supervised fine-tuning (SFT) and preference optimization
datasets and algorithms, we find that the SFT phase generally establishes a
model's values, and subsequent preference optimization rarely re-aligns these
values. Furthermore, using a synthetic preference dataset that enables
controlled manipulation of values, we find that different preference
optimization algorithms lead to different value alignment outcomes, even when
preference data is held constant. Our findings provide actionable insights into
how values are learned during post-training and help to inform data curation,
as well as the selection of models and algorithms for preference optimization
to improve model alignment to human values.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a
non-differentiable decoding process that requires laborious, hand-tuning of
hyperparameters like temperature and top-p. This paper introduces AutoDeco, a
novel architecture that enables truly "end-to-end" generation by learning to
control its own decoding strategy. We augment the standard transformer with
lightweight heads that, at each step, dynamically predict context-specific
temperature and top-p values alongside the next-token logits. This approach
transforms decoding into a parametric, token-level process, allowing the model
to self-regulate its sampling strategy within a single forward pass.
Through extensive experiments on eight benchmarks, we demonstrate that
AutoDeco not only significantly outperforms default decoding strategies but
also achieves performance comparable to an oracle-tuned baseline derived from
"hacking the test set"-a practical upper bound for any static method.
Crucially, we uncover an emergent capability for instruction-based decoding
control: the model learns to interpret natural language commands (e.g.,
"generate with low randomness") and adjusts its predicted temperature and top-p
on a token-by-token basis, opening a new paradigm for steerable and interactive
LLM decoding.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu et al.
ๆฅ็ๆ่ฆ
We introduce Kimi Linear, a hybrid linear attention architecture that, for
the first time, outperforms full attention under fair comparisons across
various scenarios -- including short-context, long-context, and reinforcement
learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an
expressive linear attention module that extends Gated DeltaNet with a
finer-grained gating mechanism, enabling more effective use of limited
finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware
efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR)
transition matrices, which substantially reduces computation compared to the
general DPLR formulation while remaining more consistent with the classical
delta rule.
We pretrain a Kimi Linear model with 3B activated parameters and 48B total
parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention
(MLA). Our experiments show that with an identical training recipe, Kimi Linear
outperforms full MLA with a sizeable margin across all evaluated tasks, while
reducing KV cache usage by up to 75% and achieving up to 6 times decoding
throughput for a 1M context. These results demonstrate that Kimi Linear can be
a drop-in replacement for full attention architectures with superior
performance and efficiency, including tasks with longer input and output
lengths.
To support further research, we open-source the KDA kernel and vLLM
implementations, and release the pre-trained and instruction-tuned model
checkpoints.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mingchen Tu, Zhiqiang Liu, Juan Li, Liangyurui Liu, Junjie Wang et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have demonstrated exceptional capabilities
across multiple domains by leveraging massive pre-training and curated
fine-tuning data. However, in data-sensitive fields such as healthcare, the
lack of high-quality, domain-specific training corpus hinders LLMs' adaptation
for specialized applications. Meanwhile, domain experts have distilled domain
wisdom into ontology rules, which formalize relationships among concepts and
ensure the integrity of knowledge management repositories. Viewing LLMs as
implicit repositories of human knowledge, we propose Evontree, a novel
framework that leverages a small set of high-quality ontology rules to
systematically extract, validate, and enhance domain knowledge within LLMs,
without requiring extensive external datasets. Specifically, Evontree extracts
domain ontology from raw models, detects inconsistencies using two core
ontology rules, and reinforces the refined knowledge via self-distilled
fine-tuning. Extensive experiments on medical QA benchmarks with
Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both
unmodified models and leading supervised baselines, achieving up to a 3.7%
improvement in accuracy. These results confirm the effectiveness, efficiency,
and robustness of our approach for low-resource domain adaptation of LLMs.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zewen Chi, Li Dong, Qingxiu Dong, Yaru Hao, Xun Wu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We envision a new era of AI, termed agentic organization, where agents solve
complex problems by working collaboratively and concurrently, enabling outcomes
beyond individual intelligence. To realize this vision, we introduce
asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large
language models, which organizes the internal thinking process into
concurrently executable structures. Specifically, we propose a thinking
protocol where an organizer dynamically assigns sub-queries to workers, merges
intermediate knowledge, and produces coherent solutions. More importantly, the
thinking structure in this protocol can be further optimized through
reinforcement learning. Experiments demonstrate that AsyncThink achieves 28%
lower inference latency compared to parallel thinking while improving accuracy
on mathematical reasoning. Moreover, AsyncThink generalizes its learned
asynchronous thinking capabilities, effectively tackling unseen tasks without
additional training.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Biao Zhang, Yong Cheng, Siamak Shakeri, Xinyi Wang, Min Ma et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent large language model (LLM) research has undergone an architectural
shift from encoder-decoder modeling to nowadays the dominant decoder-only
modeling. This rapid transition, however, comes without a rigorous comparative
analysis especially \textit{from the scaling perspective}, raising concerns
that the potential of encoder-decoder models may have been overlooked. To fill
this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent
recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison
between RedLLM, pretrained with prefix language modeling (LM), and DecLLM,
pretrained with causal LM, at different model scales, ranging from $\sim$150M
to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for
instruction tuning, our experiments show that RedLLM produces compelling
scaling properties and surprisingly strong performance. While DecLLM is overall
more compute-optimal during pretraining, RedLLM demonstrates comparable scaling
and context length extrapolation capabilities. After instruction tuning, RedLLM
achieves comparable and even better results on various downstream tasks while
enjoying substantially better inference efficiency. We hope our findings could
inspire more efforts on re-examining RedLLM, unlocking its potential for
developing powerful and efficient LLMs.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar
Natural Language Processing
ๆฅ็ๆ่ฆ
Multi-page visual documents such as manuals, brochures, presentations, and
posters convey key information through layout, colors, icons, and cross-slide
references. While large language models (LLMs) offer opportunities in document
understanding, current systems struggle with complex, multi-page visual
documents, particularly in fine-grained reasoning over elements and pages. We
introduce SlideAgent, a versatile agentic framework for understanding
multi-modal, multi-page, and multi-layout documents, especially slide decks.
SlideAgent employs specialized agents and decomposes reasoning into three
specialized levels-global, page, and element-to construct a structured,
query-agnostic representation that captures both overarching themes and
detailed visual or textual cues. During inference, SlideAgent selectively
activates specialized agents for multi-level reasoning and integrates their
outputs into coherent, context-aware answers. Extensive experiments show that
SlideAgent achieves significant improvement over both proprietary (+7.9
overall) and open-source models (+9.8 overall).
๐
2025-10-30
๐ EMNLP 2025
๐ฅ Kentaro Ozeki, Risako Ando, Takanobu Morishita, Hirohiko Abe, Koji Mineshima et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Normative reasoning is a type of reasoning that involves normative or deontic
modality, such as obligation and permission. While large language models (LLMs)
have demonstrated remarkable performance across various reasoning tasks, their
ability to handle normative reasoning remains underexplored. In this paper, we
systematically evaluate LLMs' reasoning capabilities in the normative domain
from both logical and modal perspectives. Specifically, to assess how well LLMs
reason with normative modals, we make a comparison between their reasoning with
normative modals and their reasoning with epistemic modals, which share a
common formal structure. To this end, we introduce a new dataset covering a
wide range of formal patterns of reasoning in both normative and epistemic
domains, while also incorporating non-formal cognitive factors that influence
human reasoning. Our results indicate that, although LLMs generally adhere to
valid reasoning patterns, they exhibit notable inconsistencies in specific
types of normative reasoning and display cognitive biases similar to those
observed in psychological studies of human reasoning. These findings highlight
challenges in achieving logical consistency in LLMs' normative reasoning and
provide insights for enhancing their reliability. All data and code are
released publicly at https://github.com/kmineshima/NeuBAROCO.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yinrong Hong, Zhiquan Tan, Kai Hu
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) face significant inference latency challenges
stemming from their autoregressive design and large size. To address this,
speculative decoding emerges as a solution, enabling the simultaneous
generation and validation of multiple tokens. While recent approaches like
EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures,
they often neglect the impact of crucial system variables such as GPU devices
and batch sizes.
Therefore, we introduce a new dynamic tree decoding approach called CAST that
takes into account inference costs, including factors such as GPU
configurations and batch sizes, to dynamically refine the tree structure.
Through comprehensive experimentation across six diverse tasks and utilizing
six distinct LLMs, our methodology demonstrates remarkable results, achieving
speeds up to 5.2 times faster than conventional decoding methods. Moreover, it
generally outperforms existing state-of-the-art techniques from 5% to 20%.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kun Luo, Hongjin Qian, Zheng Liu, Ziyi Xia, Shitao Xiao et al.
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach
for enhancing agentic deep search. However, its application is often hindered
by low \textbf{Reward Density} in deep search scenarios, where agents expend
significant exploratory costs for infrequent and often null final rewards. In
this paper, we formalize this challenge as the \textbf{Reward Density
Optimization} problem, which aims to improve the reward obtained per unit of
exploration cost. This paper introduce \textbf{InfoFlow}, a systematic
framework that tackles this problem from three aspects. 1) \textbf{Subproblem
decomposition}: breaking down long-range tasks to assign process rewards,
thereby providing denser learning signals. 2) \textbf{Failure-guided hints}:
injecting corrective guidance into stalled trajectories to increase the
probability of successful outcomes. 3) \textbf{Dual-agent refinement}:
employing a dual-agent architecture to offload the cognitive burden of deep
exploration. A refiner agent synthesizes the search history, which effectively
compresses the researcher's perceived trajectory, thereby reducing exploration
cost and increasing the overall reward density. We evaluate InfoFlow on
multiple agentic search benchmarks, where it significantly outperforms strong
baselines, enabling lightweight LLMs to achieve performance comparable to
advanced proprietary LLMs.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Miranda Anna Christ, Adriรกn Csiszรกrik, Gergely Becsรณ, Dรกniel Varga
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper investigates the structure of linear operators introduced in
Hernandez et al. [2023] that decode specific relational facts in transformer
language models. We extend their single-relation findings to a collection of
relations and systematically chart their organization. We show that such
collections of relation decoders can be highly compressed by simple order-3
tensor networks without significant loss in decoding accuracy. To explain this
surprising redundancy, we develop a cross-evaluation protocol, in which we
apply each linear decoder operator to the subjects of every other relation. Our
results reveal that these linear maps do not encode distinct relations, but
extract recurring, coarse-grained semantic properties (e.g., country of capital
city and country of food are both in the country-of-X property). This
property-centric structure clarifies both the operators' compressibility and
highlights why they generalize only to new relations that are semantically
close. Our findings thus interpret linear relational decoding in transformer
language models as primarily property-based, rather than relation-specific.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yair Elboher, Yuval Pinter
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Diacritics restoration in Hebrew is a fundamental task for ensuring accurate
word pronunciation and disambiguating textual meaning. Despite the language's
high degree of ambiguity when unvocalized, recent machine learning approaches
have significantly advanced performance on this task.
In this work, we present DIVRIT, a novel system for Hebrew diacritization
that frames the task as a zero-shot classification problem. Our approach
operates at the word level, selecting the most appropriate diacritization
pattern for each undiacritized word from a dynamically generated candidate set,
conditioned on the surrounding textual context. A key innovation of DIVRIT is
its use of a Hebrew Visual Language Model, which processes undiacritized text
as an image, allowing diacritic information to be embedded directly within the
input's vector representation.
Through a comprehensive evaluation across various configurations, we
demonstrate that the system effectively performs diacritization without relying
on complex, explicit linguistic analysis. Notably, in an ``oracle'' setting
where the correct diacritized form is guaranteed to be among the provided
candidates, DIVRIT achieves a high level of accuracy. Furthermore, strategic
architectural enhancements and optimized training methodologies yield
significant improvements in the system's overall generalization capabilities.
These findings highlight the promising potential of visual representations for
accurate and automated Hebrew diacritization.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dipak Meher, Carlotta Domeniconi
ๆฅ็ๆ่ฆ
Human smuggling networks are increasingly adaptive and difficult to analyze.
Legal case documents offer critical insights but are often unstructured,
lexically dense, and filled with ambiguous or shifting references, which pose
significant challenges for automated knowledge graph (KG) construction. While
recent LLM-based approaches improve over static templates, they still generate
noisy, fragmented graphs with duplicate nodes due to the absence of guided
extraction and coreference resolution. The recently proposed CORE-KG framework
addresses these limitations by integrating a type-aware coreference module and
domain-guided structured prompts, significantly reducing node duplication and
legal noise. In this work, we present a systematic ablation study of CORE-KG to
quantify the individual contributions of its two key components. Our results
show that removing coreference resolution results in a 28.32% increase in node
duplication and a 4.32% increase in noisy nodes, while removing structured
prompts leads to a 4.34% increase in node duplication and a 73.33% increase in
noisy nodes. These findings offer empirical insights for designing robust
LLM-based pipelines for extracting structured representations from complex
legal texts.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Adam E. Flanders, Yifan Peng, Luciano Prevedello, Robyn Ball, Errol Colak et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Purpose: The purpose of this study was to determine if an ensemble of
multiple LLM agents could be used collectively to provide a more reliable
assessment of a pixel-based AI triage tool than a single LLM.
Methods: 29,766 non-contrast CT head exams from fourteen hospitals were
processed by a commercial intracranial hemorrhage (ICH) AI detection tool.
Radiology reports were analyzed by an ensemble of eight open-source LLM models
and a HIPAA compliant internal version of GPT-4o using a single multi-shot
prompt that assessed for presence of ICH. 1,726 examples were manually
reviewed. Performance characteristics of the eight open-source models and
consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were
tested for rating the performance of the triage tool.
Results: The cohort consisted of 29,766 head CTs exam-report pairs. The
highest AUC performance was achieved with llama3.3:70b and GPT-4o (AUC= 0.78).
The average precision was highest for Llama3.3:70b and GPT-4o (AP=0.75 & 0.76).
Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater
precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the
ideal combination of LLMs were: Full-9 Ensemble 0.571 (0.552-0.591), Top-3
Ensemble 0.558 (0.537-0.579), Consensus 0.556 (0.539-0.574), and GPT4o 0.522
(0.500-0.543). No statistically significant differences were observed between
Top-3, Full-9, and Consensus (p > 0.05).
Conclusion: An ensemble of medium to large sized open-source LLMs provides a
more consistent and reliable method to derive a ground truth retrospective
evaluation of a clinical AI triage tool over a single LLM alone.
๐
2025-10-30
๐ ArXiv Preprint (cs.DB)
๐ฅ Linzhuang Sun, Tianyu Guo, Hao Liang, Yuying Li, Qifeng Cai et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in Text-to-SQL have achieved strong results in static,
single-turn tasks, where models generate SQL queries from natural language
questions. However, these systems fall short in real-world interactive
scenarios, where user intents evolve and queries must be refined over multiple
turns. In applications such as finance and business analytics, users
iteratively adjust query constraints or dimensions based on intermediate
results. To evaluate such dynamic capabilities, we introduce DySQL-Bench, a
benchmark assessing model performance under evolving user interactions. Unlike
previous manually curated datasets, DySQL-Bench is built through an automated
two-stage pipeline of task synthesis and verification. Structured tree
representations derived from raw database tables guide LLM-based task
generation, followed by interaction-oriented filtering and expert validation.
Human evaluation confirms 100% correctness of the synthesized data. We further
propose a multi-turn evaluation framework simulating realistic interactions
among an LLM-simulated user, the model under test, and an executable database.
The model must adapt its reasoning and SQL generation as user intents change.
DySQL-Bench covers 13 domains across BIRD and Spider 2 databases, totaling
1,072 tasks. Even GPT-4o attains only 58.34% overall accuracy and 23.81% on the
Pass@5 metric, underscoring the benchmark's difficulty. All code and data are
released at https://github.com/Aurora-slz/Real-World-SQL-Bench .
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Qishuo Hua, Lyumanshan Ye, Dayuan Fu, Yang Xiao, Xiaojie Cai et al.
ๆฅ็ๆ่ฆ
Karl Marx once wrote that ``the human essence is the ensemble of social
relations'', suggesting that individuals are not isolated entities but are
fundamentally shaped by their interactions with other entities, within which
contexts play a constitutive and essential role. With the advent of computers
and artificial intelligence, these contexts are no longer limited to purely
human--human interactions: human--machine interactions are included as well.
Then a central question emerges: How can machines better understand our
situations and purposes? To address this challenge, researchers have recently
introduced the concept of context engineering. Although it is often regarded as
a recent innovation of the agent era, we argue that related practices can be
traced back more than twenty years. Since the early 1990s, the field has
evolved through distinct historical phases, each shaped by the intelligence
level of machines: from early human--computer interaction frameworks built
around primitive computers, to today's human--agent interaction paradigms
driven by intelligent agents, and potentially to human--level or superhuman
intelligence in the future. In this paper, we situate context engineering,
provide a systematic definition, outline its historical and conceptual
landscape, and examine key design considerations for practice. By addressing
these questions, we aim to offer a conceptual foundation for context
engineering and sketch its promising future. This paper is a stepping stone for
a broader community effort toward systematic context engineering in AI systems.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rasoul Amirzadeh, Dhananjay Thiruvady, Fatemeh Shiri
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) continue to advance, with an increasing number
of domain-specific variants tailored for specialised tasks. However, these
models often lack transparency and explainability, can be costly to fine-tune,
require substantial prompt engineering, yield inconsistent results across
domains, and impose significant adverse environmental impact due to their high
computational demands. To address these challenges, we propose the Bayesian
network LLM fusion (BNLF) framework, which integrates predictions from three
LLMs, including FinBERT, RoBERTa, and BERTweet, through a probabilistic
mechanism for sentiment analysis. BNLF performs late fusion by modelling the
sentiment predictions from multiple LLMs as probabilistic nodes within a
Bayesian network. Evaluated across three human-annotated financial corpora with
distinct linguistic and contextual characteristics, BNLF demonstrates
consistent gains of about six percent in accuracy over the baseline LLMs,
underscoring its robustness to dataset variability and the effectiveness of
probabilistic fusion for interpretable sentiment classification.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Self-improvement has emerged as a mainstream paradigm for advancing the
reasoning capabilities of large vision-language models (LVLMs), where models
explore and learn from successful trajectories iteratively. However, we
identify a critical issue during this process: the model excels at generating
high-quality trajectories for simple queries (i.e., head data) but struggles
with more complex ones (i.e., tail data). This leads to an imbalanced
optimization that drives the model to prioritize simple reasoning skills, while
hindering its ability to tackle more complex reasoning tasks. Over iterations,
this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew
effect"--which ultimately hinders further model improvement and leads to
performance bottlenecks. To counteract this challenge, we introduce four
efficient strategies from two perspectives: distribution-reshaping and
trajectory-resampling, to achieve head-tail re-balancing during the
exploration-and-learning self-improvement process. Extensive experiments on
Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks
demonstrate that our methods consistently improve visual reasoning
capabilities, outperforming vanilla self-improvement by 3.86 points on average.
๐
2025-10-30
๐ ArXiv Preprint (cs.SE)
๐ฅ Fang Liu, Simiao Liu, Yinghao Zhu, Xiaoli Lian, Li Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Identifying and addressing security issues during the early phase of the
development lifecycle is critical for mitigating the long-term negative impacts
on software systems. Code review serves as an effective practice that enables
developers to check their teammates' code before integration into the codebase.
To streamline the generation of review comments, various automated code review
approaches have been proposed, where LLM-based methods have significantly
advanced the capabilities of automated review generation. However, existing
models primarily focus on general-purpose code review, their effectiveness in
identifying and addressing security-related issues remains underexplored.
Moreover, adapting existing code review approaches to target security issues
faces substantial challenges, including data scarcity and inadequate evaluation
metrics. To address these limitations, we propose SecureReviewer, a new
approach designed for enhancing LLMs' ability to identify and resolve
security-related issues during code review. Specifically, we first construct a
dataset tailored for training and evaluating secure code review capabilities.
Leveraging this dataset, we fine-tune LLMs to generate code review comments
that can effectively identify security issues and provide fix suggestions with
our proposed secure-aware fine-tuning strategy. To mitigate hallucination in
LLMs and enhance the reliability of their outputs, we integrate the RAG
technique, which grounds the generated comments in domain-specific security
knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric
designed to assess the effectiveness of review comments in addressing security
issues. Experimental results demonstrate that SecureReviewer outperforms
state-of-the-art baselines in both security issue detection accuracy and the
overall quality and practical utility of generated review comments.
๐
2025-10-30
๐ EMNLP 2025
๐ฅ Zeliang Zong, Kai Zhang, Zheyang Li, Wenming Tan, Ye Ren et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have demonstrated remarkable proficiency in
language comprehension and generation; however, their widespread adoption is
constrained by substantial bandwidth and computational demands. While pruning
and low-rank approximation have each demonstrated promising performance
individually, their synergy for LLMs remains underexplored. We introduce
\underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank
\underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths
of both techniques: low-rank approximation compresses the model by retaining
its essential structure with minimal information loss, whereas sparse
optimization eliminates non-essential weights, preserving those crucial for
generalization. Based on theoretical analysis, we first formulate the low-rank
approximation and sparse optimization as a unified problem and solve it by
iterative optimization algorithm. Experiments on LLaMA and Qwen2.5 models
(7B-70B) show that SSLC, without any additional training steps, consistently
surpasses standalone methods, achieving state-of-the-arts results. Notably,
SSLC compresses Qwen2.5 by 50\% with no performance drop and achieves at least
1.63$\times$ speedup, offering a practical solution for efficient LLM
deployment.
๐
2025-10-30
๐ ArXiv Preprint (cs.SE)
๐ฅ Dong Huang, Mingzhe Du, Jie M. Zhang, Zheng Lin, Meng Luo et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Test oracle generation in non-regression testing is a longstanding challenge
in software engineering, where the goal is to produce oracles that can
accurately determine whether a function under test (FUT) behaves as intended
for a given input. In this paper, we introduce Nexus, a novel multi-agent
framework to address this challenge. Nexus generates test oracles by leveraging
a diverse set of specialized agents that synthesize test oracles through a
structured process of deliberation, validation, and iterative self-refinement.
During the deliberation phase, a panel of four specialist agents, each
embodying a distinct testing philosophy, collaboratively critiques and refines
an initial set of test oracles. Then, in the validation phase, Nexus generates
a plausible candidate implementation of the FUT and executes the proposed
oracles against it in a secure sandbox. For any oracle that fails this
execution-based check, Nexus activates an automated selfrefinement loop, using
the specific runtime error to debug and correct the oracle before
re-validation. Our extensive evaluation on seven diverse benchmarks
demonstrates that Nexus consistently and substantially outperforms
state-of-theart baselines. For instance, Nexus improves the test-level oracle
accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The
improved accuracy also significantly enhances downstream tasks: the bug
detection rate of GPT4.1-Mini generated test oracles on HumanEval increases
from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of
automated program repair improves from 35.23% to 69.32%.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Min Zhang, Hao Chen, Hao Chen, Wenqi Zhang, Didi Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
With the rapid development of large language models (LLMs), various LLM-based
works have been widely applied in educational fields. However, most existing
LLMs and their benchmarks focus primarily on the knowledge dimension, largely
neglecting the evaluation of cultivation capabilities that are essential for
real-world educational scenarios. Additionally, current benchmarks are often
limited to a single subject or question type, lacking sufficient diversity.
This issue is particularly prominent within the Chinese context. To address
this gap, we introduce OmniEduBench, a comprehensive Chinese educational
benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs.
The data is meticulously divided into two core dimensions: the knowledge
dimension and the cultivation dimension, which contain 18.121K and 6.481K
entries, respectively. Each dimension is further subdivided into 6 fine-grained
categories, covering a total of 61 different subjects (41 in the knowledge and
20 in the cultivation). Furthermore, the dataset features a rich variety of
question formats, including 11 common exam question types, providing a solid
foundation for comprehensively evaluating LLMs' capabilities in education.
Extensive experiments on 11 mainstream open-source and closed-source LLMs
reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro
surpassed 60\% accuracy, while in the cultivation dimension, the
best-performing model, QWQ, still trailed human intelligence by nearly 30\%.
These results highlight the substantial room for improvement and underscore the
challenges of applying LLMs in education.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Stephen Wan, Wei Liu, Michael Strube
Natural Language Processing
ๆฅ็ๆ่ฆ
With the increasing use of generative Artificial Intelligence (AI) methods to
support science workflows, we are interested in the use of discourse-level
information to find supporting evidence for AI generated scientific claims. A
first step towards this objective is to examine the task of inferring discourse
structure in scientific writing.
In this work, we present a preliminary investigation of pretrained language
model (PLM) and Large Language Model (LLM) approaches for Discourse Relation
Classification (DRC), focusing on scientific publications, an under-studied
genre for this task. We examine how context can help with the DRC task, with
our experiments showing that context, as defined by discourse structure, is
generally helpful. We also present an analysis of which scientific discourse
relation types might benefit most from context.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kotaro Furuya, Yuichi Kitagawa
Natural Language Processing
ๆฅ็ๆ่ฆ
While a multi-agent approach based on large language models (LLMs) represents
a promising strategy to surpass the capabilities of single models, its success
is critically dependent on synergistic team composition. However, forming
optimal teams is a significant challenge, as the inherent opacity of most
models obscures the internal characteristics necessary for effective
collaboration. In this paper, we propose an interaction-centric framework for
automatic team composition that does not require any prior knowledge including
their internal architectures, training data, or task performances. Our method
constructs a "language model graph" that maps relationships between models from
the semantic coherence of pairwise conversations, and then applies community
detection to identify synergistic model clusters. Our experiments with diverse
LLMs demonstrate that the proposed method discovers functionally coherent
groups that reflect their latent specializations. Priming conversations with
specific topics identified synergistic teams which outperform random baselines
on downstream benchmarks and achieve comparable accuracy to that of
manually-curated teams based on known model specializations. Our findings
provide a new basis for the automated design of collaborative multi-agent LLM
teams.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mykhailo Poliakov, Nadiya Shvai
Natural Language Processing
ๆฅ็ๆ่ฆ
Health-related misinformation is very prevalent and potentially harmful. It
is difficult to identify, especially when claims distort or misinterpret
scientific findings. We investigate the impact of synthetic data generation and
lightweight fine-tuning techniques on the ability of large language models
(LLMs) to recognize fallacious arguments using the MISSCI dataset and
framework. In this work, we propose MisSynth, a pipeline that applies
retrieval-augmented generation (RAG) to produce synthetic fallacy samples,
which are then used to fine-tune an LLM model. Our results show substantial
accuracy gains with fine-tuned models compared to vanilla baselines. For
instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score
absolute improvement on the MISSCI test split over its vanilla baseline. We
demonstrate that introducing synthetic fallacy data to augment limited
annotated resources can significantly enhance zero-shot LLM classification
performance on real-world scientific misinformation tasks, even with limited
computational resources. The code and synthetic dataset are available on
https://github.com/mxpoliakov/MisSynth.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nishit Neema, Srinjoy Mukherjee, Sapan Shah, Gokul Ramakrishnan, Ganesh Venkatesh
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel at general tasks but underperform in
specialized domains like economics and psychology, which require deep,
principled understanding. To address this, we introduce ACER (Automated
Curriculum-Enhanced Regimen) that transforms generalist models into domain
experts without sacrificing their broad capabilities. ACER first synthesizes a
comprehensive, textbook-style curriculum by generating a table of contents for
a subject and then creating question-answer (QA) pairs guided by Bloom's
taxonomy. This ensures systematic topic coverage and progressively increasing
difficulty. The resulting synthetic corpus is used for continual pretraining
with an interleaved curriculum schedule, aligning learning across both content
and cognitive dimensions.
Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized
MMLU subsets. In challenging domains like microeconomics, where baselines
struggle, ACER boosts accuracy by 5 percentage points. Across all target
domains, we observe a consistent macro-average improvement of 3 percentage
points. Notably, ACER not only prevents catastrophic forgetting but also
facilitates positive cross-domain knowledge transfer, improving performance on
non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on
knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points,
while maintaining stable performance on general reasoning tasks. Our results
demonstrate that ACER offers a scalable and effective recipe for closing
critical domain gaps in LLMs.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fares Fawzi, Vinitra Swamy, Dominik Glandorf, Tanya Nazaretsky, Tanja Kรคser
Natural Language Processing
ๆฅ็ๆ่ฆ
Language models can be used to provide interactive, personalized student
feedback in educational settings. However, real-world deployment faces three
key challenges: privacy concerns, limited computational resources, and the need
for pedagogically valid responses. These constraints require small, open-source
models that can run locally and reliably ground their outputs in correct
information. We introduce SCRIBE, a framework for multi-hop, tool-augmented
reasoning designed to generate valid responses to student questions about
feedback reports. SCRIBE combines domain-specific tools with a self-reflective
inference pipeline that supports iterative reasoning, tool use, and error
recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA
fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned
GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models
achieve comparable or superior quality to much larger models in key dimensions
such as relevance and actionability, while being perceived on par with GPT-4o
and Llama-3.3 70B by students. These findings demonstrate the viability of
SCRIBE for low-resource, privacy-sensitive educational applications.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jingran Zhang, Ning Li, Justin Cui
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
OpenAI's ChatGPT Atlas introduces new capabilities for web interaction,
enabling the model to analyze webpages, process user intents, and execute
cursor and keyboard inputs directly within the browser. While its capacity for
information retrieval tasks has been demonstrated, its performance in dynamic,
interactive environments remains less explored. In this study, we conduct an
early evaluation of Atlas's web interaction capabilities using browser-based
games as test scenarios, including Google's T-Rex Runner, Sudoku, Flappy Bird,
and Stein.world. We employ in-game performance scores as quantitative metrics
to assess performance across different task types. Our results show that Atlas
performs strongly in logical reasoning tasks like Sudoku, completing puzzles
significantly faster than human baselines, but struggles substantially in
real-time games requiring precise timing and motor control, often failing to
progress beyond initial obstacles. These findings suggest that while Atlas
demonstrates capable analytical processing, there remain notable limitations in
dynamic web environments requiring real-time interaction. The website of our
project can be found at https://atlas-game-eval.github.io.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michal ล tefรกnik, Timothee Mickus, Marek Kadlฤรญk, Bertram Hรธjer, Michal Spiegel et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent work has shown that different large language models (LLMs) converge to
similar and accurate input embedding representations for numbers. These
findings conflict with the documented propensity of LLMs to produce erroneous
outputs when dealing with numeric information. In this work, we aim to explain
this conflict by exploring how language models manipulate numbers and quantify
the lower bounds of accuracy of these mechanisms. We find that despite
surfacing errors, different language models learn interchangeable
representations of numbers that are systematic, highly accurate and universal
across their hidden states and the types of input contexts. This allows us to
create universal probes for each LLM and to trace information -- including the
causes of output errors -- to specific layers. Our results lay a fundamental
understanding of how pre-trained LLMs manipulate numbers and outline the
potential of more accurate probing techniques in addressed refinements of LLMs'
architectures.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kang Chen, Yaoning Wang, Kai Xiong, Zhuoka Feng, Wenhe Sun et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) commonly boost reasoning via
sample-evaluate-ensemble decoders, achieving label free gains without ground
truth. However, prevailing strategies score candidates using only external
outputs such as token probabilities, entropies, or self evaluations, and these
signals can be poorly calibrated after post training. We instead analyze
internal behavior based on neuron activations and uncover three findings: (1)
external signals are low dimensional projections of richer internal dynamics;
(2) correct responses activate substantially fewer unique neurons than
incorrect ones throughout generation; and (3) activations from correct
responses exhibit stronger cross sample agreement, whereas incorrect ones
diverge. Motivated by these observations, we propose Neuron Agreement Decoding
(NAD), an unsupervised best-of-N method that selects candidates using
activation sparsity and cross sample neuron agreement, operating solely on
internal signals and without requiring comparable textual outputs. NAD enables
early correctness prediction within the first 32 generated tokens and supports
aggressive early stopping. Across math and science benchmarks with verifiable
answers, NAD matches majority voting; on open ended coding benchmarks where
majority voting is inapplicable, NAD consistently outperforms Avg@64. By
pruning unpromising trajectories early, NAD reduces token usage by 99% with
minimal loss in generation quality, showing that internal signals provide
reliable, scalable, and efficient guidance for label free ensemble decoding.
๐
2025-10-30
๐ This work has been submitted to the IEEE for possi
๐ฅ Haohua Duan, Liyao Xiang, Xin Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Watermarking schemes for large language models (LLMs) have been proposed to
identify the source of the generated text, mitigating the potential threats
emerged from model theft. However, current watermarking solutions hardly
resolve the trust issue: the non-public watermark detection cannot prove itself
faithfully conducting the detection. We observe that it is attributed to the
secret key mostly used in the watermark detection -- it cannot be public, or
the adversary may launch removal attacks provided the key; nor can it be
private, or the watermarking detection is opaque to the public. To resolve the
dilemma, we propose PVMark, a plugin based on zero-knowledge proof (ZKP),
enabling the watermark detection process to be publicly verifiable by third
parties without disclosing any secret key. PVMark hinges upon the proof of
`correct execution' of watermark detection on which a set of ZKP constraints
are built, including mapping, random number generation, comparison, and
summation. We implement multiple variants of PVMark in Python, Rust and Circom,
covering combinations of three watermarking schemes, three hash functions, and
four ZKP protocols, to show our approach effectively works under a variety of
circumstances. By experimental results, PVMark efficiently enables public
verifiability on the state-of-the-art LLM watermarking schemes yet without
compromising the watermarking performance, promising to be deployed in
practice.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sukrit Sriratanawilai, Jhayahgrit Thongwat, Romrawin Chumpu, Patomporn Payoungkhamdee, Sarana Nutanong et al.
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-language models (VLMs) exhibit uneven performance across languages, a
problem that is often exacerbated when the model size is reduced. While
Knowledge distillation (KD) demonstrates promising results in transferring
knowledge from larger to smaller VLMs, applying KD in multilingualism is an
underexplored area. This paper presents a controlled empirical study of KD
behavior across five distillation approaches, isolating their effects on
cross-lingual representation consistency and downstream performance stability
under model compression. We study five distillation formulations across CLIP
and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual
QA. We find that some configurations preserve or even improve multilingual
retrieval robustness despite halving model size, but others fail to maintain
cross-task stability, exposing design-sensitive trade-offs that aggregate
accuracy alone does not reveal.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mรฉrilin Sousa Silva, Sina Ahmadi
Natural Language Processing
ๆฅ็ๆ่ฆ
Throughout language history, words are borrowed from one language to another
and gradually become integrated into the recipient's lexicon. Speakers can
often differentiate these loanwords from native vocabulary, particularly in
bilingual communities where a dominant language continuously imposes lexical
items on a minority language. This paper investigates whether pretrained
language models, including large language models, possess similar capabilities
for loanword identification. We evaluate multiple models across 10 languages.
Despite explicit instructions and contextual information, our results show that
models perform poorly in distinguishing loanwords from native ones. These
findings corroborate previous evidence that modern NLP systems exhibit a bias
toward loanwords rather than native equivalents. Our work has implications for
developing NLP tools for minority languages and supporting language
preservation in communities under lexical pressure from dominant languages.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Takuma Sato, Seiya Kawano, Koichiro Yoshino
Natural Language Processing
ๆฅ็ๆ่ฆ
The ability to accurately interpret implied meanings plays a crucial role in
human communication and language use, and language models are also expected to
possess this capability. This study demonstrates that providing language models
with pragmatic theories as prompts is an effective in-context learning approach
for tasks to understand implied meanings. Specifically, we propose an approach
in which an overview of pragmatic theories, such as Gricean pragmatics and
Relevance Theory, is presented as a prompt to the language model, guiding it
through a step-by-step reasoning process to derive a final interpretation.
Experimental results showed that, compared to the baseline, which prompts
intermediate reasoning without presenting pragmatic theories (0-shot
Chain-of-Thought), our methods enabled language models to achieve up to 9.6\%
higher scores on pragmatic reasoning tasks. Furthermore, we show that even
without explaining the details of pragmatic theories, merely mentioning their
names in the prompt leads to a certain performance improvement (around 1-3%) in
larger models compared to the baseline.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Modern vision-language models (VLMs) excel at many multimodal tasks, yet
their grasp of temporal information in video remains weak and, crucially,
under-evaluated. We probe this gap with a deceptively simple but revealing
challenge: judging the arrow of time (AoT)-whether a short clip is played
forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated
benchmark that tests whether VLMs can infer temporal direction in natural
videos using the same stimuli and behavioral baselines established for humans.
Our comprehensive evaluation of open-weight and proprietary, reasoning and
non-reasoning VLMs reveals that most models perform near chance, and even the
best lag far behind human accuracy on physically irreversible processes (e.g.,
free fall, diffusion/explosion) and causal manual actions (division/addition)
that humans recognize almost instantly. These results highlight a fundamental
gap in current multimodal systems: while they capture rich visual-semantic
correlations, they lack the inductive biases required for temporal continuity
and causal understanding. We release the code and data for AoT-PsyPhyBENCH to
encourage further progress in the physical and temporal reasoning capabilities
of VLMs.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qi Luo, Xiaonan Li, Tingshuo Fan, Xinchi Chen, Xipeng Qiu
Natural Language Processing
ๆฅ็ๆ่ฆ
Retrieval-augmented generation (RAG) has emerged as a leading approach to
reducing hallucinations in large language models (LLMs). Current RAG evaluation
benchmarks primarily focus on what we call local RAG: retrieving relevant
chunks from a small subset of documents to answer queries that require only
localized understanding within specific text chunks. However, many real-world
applications require a fundamentally different capability -- global RAG --
which involves aggregating and analyzing information across entire document
collections to derive corpus-level insights (for example, "What are the top 10
most cited papers in 2023?"). In this paper, we introduce GlobalQA -- the first
benchmark specifically designed to evaluate global RAG capabilities, covering
four core task types: counting, extremum queries, sorting, and top-k
extraction. Through systematic evaluation across different models and
baselines, we find that existing RAG methods perform poorly on global tasks,
with the strongest baseline achieving only 1.51 F1 score. To address these
challenges, we propose GlobalRAG, a multi-tool collaborative framework that
preserves structural coherence through chunk-level retrieval, incorporates
LLM-driven intelligent filters to eliminate noisy documents, and integrates
aggregation modules for precise symbolic computation. On the Qwen2.5-14B model,
GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1,
validating the effectiveness of our method.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson
Natural Language Processing
ๆฅ็ๆ่ฆ
Human feedback can alter language models in unpredictable and undesirable
ways, as practitioners lack a clear understanding of what feedback data
encodes. While prior work studies preferences over certain attributes (e.g.,
length or sycophancy), automatically extracting relevant features without
pre-specifying hypotheses remains challenging. We introduce What's In My Human
Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders.
WIMHF characterizes both (1) the preferences a dataset is capable of measuring
and (2) the preferences that the annotators actually express. Across 7
datasets, WIMHF identifies a small number of human-interpretable features that
account for the majority of the preference prediction signal achieved by
black-box models. These features reveal a wide diversity in what humans prefer,
and the role of dataset-level context: for example, users on Reddit prefer
informality and jokes, while annotators in HH-RLHF and PRISM disprefer them.
WIMHF also surfaces potentially unsafe preferences, such as that LMArena users
tend to vote against refusals, often in favor of toxic content. The learned
features enable effective data curation: re-labeling the harmful examples in
Arena yields large safety gains (+37%) with no cost to general performance.
They also allow fine-grained personalization: on the Community Alignment
dataset, we learn annotator-specific weights over subjective features that
improve preference prediction. WIMHF provides a human-centered analysis method
for practitioners to better understand and use preference data.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Woojin Kim, Jaeyoung Do
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
While diffusion language models (DLMs) enable fine-grained refinement, their
practical controllability remains fragile. We identify and formally
characterize a central failure mode called update forgetting, in which uniform
and context agnostic updates induce token level fluctuations across timesteps,
erasing earlier semantic edits and disrupting the cumulative refinement
process, thereby degrading fluency and coherence. As this failure originates in
uniform and context agnostic updates, effective control demands explicit token
ordering. We propose Token Timestep Allocation (TTA), which realizes soft and
semantic token ordering via per token timestep schedules: critical tokens are
frozen early, while uncertain tokens receive continued refinement. This
timestep based ordering can be instantiated as either a fixed policy or an
adaptive policy driven by task signals, thereby supporting a broad spectrum of
refinement strategies. Because it operates purely at inference time, it applies
uniformly across various DLMs and naturally extends to diverse supervision
sources. Empirically, TTA improves controllability and fluency: on sentiment
control, it yields more than 20 percent higher accuracy and nearly halves
perplexity using less than one fifth the steps; in detoxification, it lowers
maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0).
Together, these results demonstrate that softened ordering via timestep
allocation is the critical lever for mitigating update forgetting and achieving
stable and controllable diffusion text generation.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dongjun Jang, Youngchae Ahn, Hyopil Shin
Natural Language Processing
ๆฅ็ๆ่ฆ
Current LLM evaluations often rely on a single instruction template,
overlooking models' sensitivity to instruction style-a critical aspect for
real-world deployments. We present RCScore, a multi-dimensional framework
quantifying how instruction formulation affects model responses. By
systematically transforming benchmark problems into multiple instruction
styles, RCScore reveals performance variations undetected by conventional
metrics. Our experiments across ten LLMs on four reasoning benchmarks
demonstrate that instruction style can shift accuracy by up to 16.7% points. We
introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to
measure stylistic self-consistency, and establish its strong correlation with
task accuracy, suggesting consistency as a valuable proxy for model
reliability. Additional findings show that deterministic decoding produces more
stylistically stable outputs, and model scale correlates positively with
cross-style consistency. RCScore offers a principled approach to assess
instruction robustness.
๐
2025-10-30
๐ ArXiv Preprint (cs.SD)
๐ฅ Hitomi Jin Ling Tee, Chaoren Wang, Zijie Zhang, Zhizheng Wu
Natural Language Processing
ๆฅ็ๆ่ฆ
The evaluation of intelligibility for TTS has reached a bottleneck, as
existing assessments heavily rely on word-by-word accuracy metrics such as WER,
which fail to capture the complexity of real-world speech or reflect human
comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice
Question Answering, a novel subjective approach evaluating the accuracy of key
information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour
news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal
that low WER does not necessarily guarantee high key-information accuracy,
exposing a gap between traditional metrics and practical intelligibility.
SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text
normalization and phonetic accuracy. This work underscores the urgent need for
high-level, more life-like evaluation criteria now that many systems already
excel at WER yet may fall short on real-world intelligibility.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Allen Schmaltz
Natural Language Processing
ๆฅ็ๆ่ฆ
We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which
are sequence prediction models fine-tuned to maximize the proportion of
generations in the well-calibrated, high-probability region partitioned by a
final-layer SDM activation layer used for binary classification of
instruction-following. We demonstrate that existing pre-trained decoder-only
Transformer LMs can be readily converted into SDM LMs via supervised
fine-tuning, using the final-layer SDM activation layer during training to
estimate a change-of-base for a supervised next-token loss over a contrastive
input encoding scheme, with additional hard negative examples generated online
during training. This results in reduced abstentions (i.e., improved
statistical efficiency) compared to strong supervised baselines.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shikhar Tuli, James Seale Smith, Haris Jeelani, Chi-Heng Lin, Abhishek Patel et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have significantly advanced generative
applications in natural language processing (NLP). Recent trends in model
architectures revolve around efficient variants of transformers or
state-space/gated-recurrent models (SSMs, GRMs). However, prevailing
SSM/GRM-based methods often emulate only a single attention head, potentially
limiting their expressiveness. In this work, we propose MossNet, a novel
mixture-of-state-space-experts architecture that emulates a linear multi-head
attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation
not only in channel-mixing multi-layered perceptron (MLP) blocks but also in
the time-mixing SSM kernels to realize multiple "attention heads." Extensive
experiments on language modeling and downstream evaluations show that MossNet
outperforms both transformer- and SSM-based architectures of similar model size
and data budgets. Larger variants of MossNet, trained on trillions of tokens,
further confirm its scalability and superior performance. In addition,
real-device profiling on a Samsung Galaxy S24 Ultra and an Nvidia A100 GPU
demonstrate favorable runtime speed and resource usage compared to similarly
sized baselines. Our results suggest that MossNet is a compelling new direction
for efficient, high-performing recurrent LLM architectures.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Renhao Li, Jianhong Tu, Yang Su, Hamid Alinejad-Rokny, Derek F. Wong et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reward models (RMs) play a critical role in aligning large language models
(LLMs) with human preferences. Yet in the domain of tool learning, the lack of
RMs specifically designed for function-calling tasks has limited progress
toward more capable agentic AI. We introduce ToolRM, a family of lightweight
generative RMs tailored for general tool-use scenarios. To build these models,
we propose a novel pipeline that constructs pairwise preference data using
rule-based scoring and multidimensional sampling. This yields
ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique
tasks that supports reinforcement learning with verifiable feedback. To
evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on
the agentic evaluation suite BFCL. Trained on our constructed data, models from
the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially
outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward
judgments. Beyond training objectives, ToolRM generalizes to broader critique
tasks, including Best-of-N sampling and self-correction. Experiments on
ACEBench highlight its effectiveness and efficiency, enabling inference-time
scaling and reducing output token usage by over 66%. We release data and model
checkpoints to facilitate future research.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, Yingbo Zhou
Machine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) can elicit strong reasoning in large language
models (LLMs), yet most open efforts focus on math and code. We propose
Reasoning Curriculum, a simple two-stage curriculum that first elicits
reasoning skills in pretraining-aligned domains such as math, then adapts and
refines these skills across other domains via joint RL. Stage 1 performs a
brief cold start and then math-only RL with verifiable rewards to develop
reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and
consolidate these skills. The curriculum is minimal and backbone-agnostic,
requiring no specialized reward models beyond standard verifiability checks.
Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning
curriculum yields consistent gains. Ablations and a cognitive-skill analysis
indicate that both stages are necessary and that math-first elicitation
increases cognitive behaviors important for solving complex problems. Reasoning
Curriculum provides a compact, easy-to-adopt recipe for general reasoning.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nawar Turk, Sevag Kaspar, Leila Kosseim
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper investigates the relationship between Persuasion Techniques (PTs)
and Discourse Relations (DRs) by leveraging Large Language Models (LLMs) and
prompt engineering. Since no dataset annotated with both PTs and DRs exists, we
took the SemEval 2023 Task 3 dataset labelled with 19 PTs as a starting point
and developed LLM-based classifiers to label each instance of the dataset with
one of the 22 PDTB 3.0 level-2 DRs. In total, four LLMs were evaluated using 10
different prompts, resulting in 40 unique DR classifiers. Ensemble models using
different majority-pooling strategies were used to create 5 silver datasets of
instances labelled with both persuasion techniques and level-2 PDTB senses. The
silver dataset sizes vary from 1,281 instances to 204 instances, depending on
the majority pooling technique used. Statistical analysis of these silver
datasets shows that six discourse relations (namely Cause, Purpose, Contrast,
Cause+Belief, Concession, and Condition) play a crucial role in persuasive
texts, especially in the use of Loaded Language, Exaggeration/Minimisation,
Repetition and to cast Doubt. This insight can contribute to detecting online
propaganda and misinformation, as well as to our general understanding of
effective communication.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Feng Ju, Zeyu Qin, Rui Min, Zhitao He, Lingpeng Kong et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
While Test-Time Scaling (TTS) has proven effective in improving the reasoning
ability of large language models (LLMs), low diversity in model outputs often
becomes a bottleneck; this is partly caused by the common "one problem, one
solution" (1P1S) training practice, which provides a single canonical answer
and can push models toward a narrow set of reasoning paths. To address this, we
propose a "one problem, multiple solutions" (1PNS) training paradigm that
exposes the model to a variety of valid reasoning trajectories and thus
increases inference diversity. A core challenge for 1PNS is reliably measuring
semantic differences between multi-step chains of thought, so we introduce
Reasoning Path Divergence (RPD), a step-level metric that aligns and scores
Long Chain-of-Thought solutions to capture differences in intermediate
reasoning. Using RPD, we curate maximally diverse solution sets per problem and
fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields
more varied outputs and higher pass@k, with an average +2.80% gain in pass@16
over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that
1PNS further amplifies the effectiveness of TTS. Our code is available at
https://github.com/fengjujf/Reasoning-Path-Divergence .
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Taku Mikuriya, Tatsuya Ishigaki, Masayuki Kawarada, Shunya Minami, Tadashi Kadowaki et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have increasingly been applied to automatic
programming code generation. This task can be viewed as a language generation
task that bridges natural language, human knowledge, and programming logic.
However, it remains underexplored in domains that require interaction with
hardware devices, such as quantum programming, where human coders write Python
code that is executed on a quantum computer. To address this gap, we introduce
QCoder Benchmark, an evaluation framework that assesses LLMs on quantum
programming with feedback from simulated hardware devices. Our benchmark offers
two key features. First, it supports evaluation using a quantum simulator
environment beyond conventional Python execution, allowing feedback of
domain-specific metrics such as circuit depth, execution time, and error
classification, which can be used to guide better generation. Second, it
incorporates human-written code submissions collected from real programming
contests, enabling both quantitative comparisons and qualitative analyses of
LLM outputs against human-written codes. Our experiments reveal that even
advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting
the difficulty of the benchmark. In contrast, reasoning-based models such as o3
reach up to 78% accuracy, outperforming averaged success rates of human-written
codes (39.98%). We release the QCoder Benchmark dataset and public evaluation
API to support further research.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Jingyuan He, Jiongnan Liu, Vishan Vishesh Oberoi, Bolin Wu, Mahima Jagadeesh Patel et al.
ๆฅ็ๆ่ฆ
Recommender systems are among the most impactful AI applications, interacting
with billions of users every day, guiding them to relevant products, services,
or information tailored to their preferences. However, the research and
development of recommender systems are hindered by existing datasets that fail
to capture realistic user behaviors and inconsistent evaluation settings that
lead to ambiguous conclusions. This paper introduces the Open Recommendation
Benchmark for Reproducible Research with HIdden Tests (ORBIT), a unified
benchmark for consistent and realistic evaluation of recommendation models.
ORBIT offers a standardized evaluation framework of public datasets with
reproducible splits and transparent settings for its public leaderboard.
Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco,
featuring web browsing sequences from 87 million public, high-quality webpages.
ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and
privacy-guaranteed browsing data. It aligns with modern recommendation
scenarios and is reserved as the hidden test part of our leaderboard to
challenge recommendation models' generalization ability. ORBIT measures 12
representative recommendation models on its public benchmark and introduces a
prompted LLM baseline on the ClueWeb-Reco hidden test. Our benchmark results
reflect general improvements of recommender systems on the public datasets,
with variable individual performances. The results on the hidden test reveal
the limitations of existing approaches in large-scale webpage recommendation
and highlight the potential for improvements with LLM integrations. ORBIT
benchmark, leaderboard, and codebase are available at
https://www.open-reco-bench.ai.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiali Cheng, Chirag Agarwal, Hadi Amiri
ๆฅ็ๆ่ฆ
Knowledge distillation (KD) is an effective method for model compression and
transferring knowledge between models. However, its effect on model's
robustness against spurious correlations that degrade performance on
out-of-distribution data remains underexplored. This study investigates the
effect of knowledge distillation on the transferability of ``debiasing''
capabilities from teacher models to student models on natural language
inference (NLI) and image classification tasks. Through extensive experiments,
we illustrate several key findings: (i) overall the debiasing capability of a
model is undermined post-KD; (ii) training a debiased model does not benefit
from injecting teacher knowledge; (iii) although the overall robustness of a
model may remain stable post-distillation, significant variations can occur
across different types of biases; and (iv) we pin-point the internal attention
pattern and circuit that causes the distinct behavior post-KD. Given the above
findings, we propose three effective solutions to improve the distillability of
debiasing methods: developing high quality data for augmentation, implementing
iterative knowledge distillation, and initializing student models with weights
obtained from teacher models. To the best of our knowledge, this is the first
study on the effect of KD on debiasing and its interenal mechanism at scale.
Our findings provide understandings on how KD works and how to design better
debiasing methods.
๐
2025-10-30
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Kaiwen Zhou, Ahmed Elgohary, A S M Iftekhar, Amin Saied
ๆฅ็ๆ่ฆ
The ability of LLM agents to plan and invoke tools exposes them to new safety
risks, making a comprehensive red-teaming system crucial for discovering
vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic
red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic
two-step process that starts with an agent definition and generates diverse
seed test cases that cover various risk outcomes, tool-use trajectories, and
risk sources. Then, it iteratively constructs and refines model-based
adversarial attacks based on the execution trajectories of former attempts. To
optimize the red-teaming cost, we present a model distillation approach that
leverages structured forms of a teacher model's reasoning to train smaller
models that are equally effective. Across diverse evaluation agent settings,
our seed test case generation approach yields 2 -- 2.5x boost to the coverage
of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer
model improves attack success rate by 100%, surpassing the 671B Deepseek-R1
model. Our ablations and analyses validate the effectiveness of the iterative
framework, structured reasoning, and the generalization of our red-teamer
models.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Felipe Larios, Mariana Borras-Osorio, Yuqi Wu, Ana Gabriela Claros, David Toro-Tobon et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Importance Incidental thyroid findings (ITFs) are increasingly detected on
imaging performed for non-thyroid indications. Their prevalence, features, and
clinical consequences remain undefined. Objective To develop, validate, and
deploy a natural language processing (NLP) pipeline to identify ITFs in
radiology reports and assess their prevalence, features, and clinical outcomes.
Design, Setting, and Participants Retrospective cohort of adults without prior
thyroid disease undergoing thyroid-capturing imaging at Mayo Clinic sites from
July 1, 2017, to September 30, 2023. A transformer-based NLP pipeline
identified ITFs and extracted nodule characteristics from image reports from
multiple modalities and body regions. Main Outcomes and Measures Prevalence of
ITFs, downstream thyroid ultrasound, biopsy, thyroidectomy, and thyroid cancer
diagnosis. Logistic regression identified demographic and imaging-related
factors. Results Among 115,683 patients (mean age, 56.8 [SD 17.2] years; 52.9%
women), 9,077 (7.8%) had an ITF, of which 92.9% were nodules. ITFs were more
likely in women, older adults, those with higher BMI, and when imaging was
ordered by oncology or internal medicine. Compared with chest CT, ITFs were
more likely via neck CT, PET, and nuclear medicine scans. Nodule
characteristics were poorly documented, with size reported in 44% and other
features in fewer than 15% (e.g. calcifications). Compared with patients
without ITFs, those with ITFs had higher odds of thyroid nodule diagnosis,
biopsy, thyroidectomy and thyroid cancer diagnosis. Most cancers were
papillary, and larger when detected after ITFs vs no ITF. Conclusions ITFs were
common and strongly associated with cascades leading to the detection of small,
low-risk cancers. These findings underscore the role of ITFs in thyroid cancer
overdiagnosis and the need for standardized reporting and more selective
follow-up.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
There are two prevalent ways to constructing 3D scenes: procedural generation
and 2D lifting. Among them, panorama-based 2D lifting has emerged as a
promising technique, leveraging powerful 2D generative priors to produce
immersive, realistic, and diverse 3D environments. In this work, we advance
this technique to generate graphics-ready 3D scenes suitable for physically
based rendering (PBR), relighting, and simulation. Our key insight is to
repurpose 2D generative models for panoramic perception of geometry, textures,
and PBR materials. Unlike existing 2D lifting approaches that emphasize
appearance generation and ignore the perception of intrinsic properties, we
present OmniX, a versatile and unified framework. Based on a lightweight and
efficient cross-modal adapter structure, OmniX reuses 2D generative priors for
a broad range of panoramic vision tasks, including panoramic perception,
generation, and completion. Furthermore, we construct a large-scale synthetic
panorama dataset containing high-quality multimodal panoramas from diverse
indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness
of our model in panoramic visual perception and graphics-ready 3D scene
generation, opening new possibilities for immersive and physically realistic
virtual world generation.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Philipp Lindenberger, Paul-Edouard Sarlin, Jan Hosang, Matteo Balice, Marc Pollefeys et al.
ๆฅ็ๆ่ฆ
Determining the precise geographic location of an image at a global scale
remains an unsolved challenge. Standard image retrieval techniques are
inefficient due to the sheer volume of images (>100M) and fail when coverage is
insufficient. Scalable solutions, however, involve a trade-off: global
classification typically yields coarse results (10+ kilometers), while
cross-view retrieval between ground and aerial imagery suffers from a domain
gap and has been primarily studied on smaller regions. This paper introduces a
hybrid approach that achieves fine-grained geo-localization across a large
geographic expanse the size of a continent. We leverage a proxy classification
task during training to learn rich feature representations that implicitly
encode precise location information. We combine these learned prototypes with
embeddings of aerial imagery to increase robustness to the sparsity of
ground-level data. This enables direct, fine-grained retrieval over areas
spanning multiple countries. Our extensive evaluation demonstrates that our
approach can localize within 200m more than 68\% of queries of a dataset
covering a large part of Europe. The code is publicly available at
https://scaling-geoloc.github.io.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tao Tao, Maissam Barkeshli
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
We study the ability of Transformer models to learn sequences generated by
Permuted Congruential Generators (PCGs), a widely used family of pseudo-random
number generators (PRNGs). PCGs introduce substantial additional difficulty
over linear congruential generators (LCGs) by applying a series of bit-wise
shifts, XORs, rotations and truncations to the hidden state. We show that
Transformers can nevertheless successfully perform in-context prediction on
unseen sequences from diverse PCG variants, in tasks that are beyond published
classical attacks. In our experiments we scale moduli up to $2^{22}$ using up
to $50$ million model parameters and datasets with up to $5$ billion tokens.
Surprisingly, we find even when the output is truncated to a single bit, it can
be reliably predicted by the model. When multiple distinct PRNGs are presented
together during training, the model can jointly learn them, identifying
structures from different permutations. We demonstrate a scaling law with
modulus $m$: the number of in-context sequence elements required for
near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization
enters extended stagnation phases; in our experiments, learning moduli $m \geq
2^{20}$ requires incorporating training data from smaller moduli, demonstrating
a critical necessity for curriculum learning. Finally, we analyze embedding
layers and uncover a novel clustering phenomenon: the model spontaneously
groups the integer inputs into bitwise rotationally-invariant clusters,
revealing how representations can transfer from smaller to larger moduli.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Cheng Zheng, William Koch, Baiang Li, Felix Heide
Computer VisionMachine LearningRobotics
ๆฅ็ๆ่ฆ
Hierarchical structures of motion exist across research fields, including
computer vision, graphics, and robotics, where complex dynamics typically arise
from coordinated interactions among simpler motion components. Existing methods
to model such dynamics typically rely on manually-defined or heuristic
hierarchies with fixed motion primitives, limiting their generalizability
across different tasks. In this work, we propose a general hierarchical motion
modeling method that learns structured, interpretable motion relationships
directly from data. Our method represents observed motions using graph-based
hierarchies, explicitly decomposing global absolute motions into
parent-inherited patterns and local motion residuals. We formulate hierarchy
inference as a differentiable graph learning problem, where vertices represent
elemental motions and directed edges capture learned parent-child dependencies
through graph neural networks. We evaluate our hierarchical reconstruction
approach on three examples: 1D translational motion, 2D rotational motion, and
dynamic 3D scene deformation via Gaussian splatting. Experimental results show
that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases,
and produces more realistic and interpretable deformations compared to the
baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable,
data-driven hierarchical modeling paradigm, our method offers a formulation
applicable to a broad range of motion-centric tasks. Project Page:
https://light.princeton.edu/HEIR/
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Masahiro Kato
Machine Learning
ๆฅ็ๆ่ฆ
This note introduces a unified theory for causal inference that integrates
Riesz regression, covariate balancing, density-ratio estimation (DRE), targeted
maximum likelihood estimation (TMLE), and the matching estimator in average
treatment effect (ATE) estimation. In ATE estimation, the balancing weights and
the regression functions of the outcome play important roles, where the
balancing weights are referred to as the Riesz representer, bias-correction
term, and clever covariates, depending on the context. Riesz regression,
covariate balancing, DRE, and the matching estimator are methods for estimating
the balancing weights, where Riesz regression is essentially equivalent to DRE
in the ATE context, the matching estimator is a special case of DRE, and DRE is
in a dual relationship with covariate balancing. TMLE is a method for
constructing regression function estimators such that the leading bias term
becomes zero. Nearest Neighbor Matching is equivalent to Least Squares Density
Ratio Estimation and Riesz Regression.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Valentyna Starodub, Mantas Lukoลกeviฤius
ๆฅ็ๆ่ฆ
Age-related macular degeneration (AMD) is one of the leading causes of
irreversible vision impairment in people over the age of 60. This research
focuses on semantic segmentation for AMD lesion detection in RGB fundus images,
a non-invasive and cost-effective imaging technique. The results of the ADAM
challenge - the most comprehensive AMD detection from RGB fundus images
research competition and open dataset to date - serve as a benchmark for our
evaluation. Taking the U-Net connectivity as a base of our framework, we
evaluate and compare several approaches to improve the segmentation model's
architecture and training pipeline, including pre-processing techniques,
encoder (backbone) deep network types of varying complexity, and specialized
loss functions to mitigate class imbalances on image and pixel levels. The main
outcome of this research is the final configuration of the AMD detection
framework, which outperforms all the prior ADAM challenge submissions on the
multi-class segmentation of different AMD lesion types in non-invasive RGB
fundus images. The source code used to conduct the experiments presented in
this paper is made freely available.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Andreas Auer, Daniel Klotz, Sebastinan Bรถck, Sepp Hochreiter
ๆฅ็ๆ่ฆ
Recent research on time series foundation models has primarily focused on
forecasting, leaving it unclear how generalizable their learned representations
are. In this study, we examine whether frozen pre-trained forecasting models
can provide effective representations for classification. To this end, we
compare different representation extraction strategies and introduce two
model-agnostic embedding augmentations. Our experiments show that the best
forecasting models achieve classification accuracy that matches or even
surpasses that of state-of-the-art models pre-trained specifically for
classification. Moreover, we observe a positive correlation between forecasting
and classification performance. These findings challenge the assumption that
task-specific pre-training is necessary, and suggest that learning to forecast
may provide a powerful route toward constructing general-purpose time series
foundation models.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
This work introduces SteerVLM, a lightweight steering module designed to
guide Vision-Language Models (VLMs) towards outputs that better adhere to
desired instructions. Our approach learns from the latent embeddings of paired
prompts encoding target and converse behaviors to dynamically adjust
activations connecting the language modality with image context. This allows
for fine-grained, inference-time control over complex output semantics without
modifying model weights while preserving performance on off-target tasks. Our
steering module requires learning parameters equal to 0.14% of the original
VLM's size. Our steering module gains model control through dimension-wise
activation modulation and adaptive steering across layers without requiring
pre-extracted static vectors or manual tuning of intervention points.
Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a
multimodal dataset specifically created to facilitate the development and
evaluation of VLM steering techniques. Our method outperforms existing
intervention techniques on steering and hallucination mitigation benchmarks for
VLMs and proposes a robust solution for multimodal model control through
activation engineering.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In sequence modeling, the parametric memory of atomic facts has been
predominantly abstracted as a brute-force lookup of co-occurrences between
entities. We contrast this associative view against a geometric view of how
memory is stored. We begin by isolating a clean and analyzable instance of
Transformer reasoning that is incompatible with memory as strictly a storage of
the local co-occurrences specified during training. Instead, the model must
have somehow synthesized its own geometry of atomic facts, encoding global
relationships between all entities, including non-co-occurring ones. This in
turn has simplified a hard reasoning task involving an $\ell$-fold composition
into an easy-to-learn 1-step geometric task.
From this phenomenon, we extract fundamental aspects of neural embedding
geometries that are hard to explain. We argue that the rise of such a geometry,
despite optimizing over mere local associations, cannot be straightforwardly
attributed to typical architectural or optimizational pressures.
Counterintuitively, an elegant geometry is learned even when it is not more
succinct than a brute-force lookup of associations.
Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry
stems from a spectral bias that -- in contrast to prevailing theories -- indeed
arises naturally despite the lack of various pressures. This analysis also
points to practitioners a visible headroom to make Transformer memory more
strongly geometric. We hope the geometric view of parametric memory encourages
revisiting the default intuitions that guide researchers in areas like
knowledge acquisition, capacity, discovery and unlearning.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Masahiro Kato
Machine Learning
ๆฅ็ๆ่ฆ
The goal of policy learning is to train a policy function that recommends a
treatment given covariates to maximize population welfare. There are two major
approaches in policy learning: the empirical welfare maximization (EWM)
approach and the plug-in approach. The EWM approach is analogous to a
classification problem, where one first builds an estimator of the population
welfare, which is a functional of policy functions, and then trains a policy by
maximizing the estimated welfare. In contrast, the plug-in approach is based on
regression, where one first estimates the conditional average treatment effect
(CATE) and then recommends the treatment with the highest estimated outcome.
This study bridges the gap between the two approaches by showing that both are
based on essentially the same optimization problem. In particular, we prove an
exact equivalence between EWM and least squares over a reparameterization of
the policy class. As a consequence, the two approaches are interchangeable in
several respects and share the same theoretical guarantees under common
conditions. Leveraging this equivalence, we propose a novel regularization
method for policy learning. Our findings yield a convex and computationally
efficient training procedure that avoids the NP-hard combinatorial step
typically required in EWM.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Muhammad Faraz Ul Abrar, Nicolรฒ Michelusi
Robotics
ๆฅ็ๆ่ฆ
Over-the-air (OTA) federated learning (FL) has been well recognized as a
scalable paradigm that exploits the waveform superposition of the wireless
multiple-access channel to aggregate model updates in a single use. Existing
OTA-FL designs largely enforce zero-bias model updates by either assuming
\emph{homogeneous} wireless conditions (equal path loss across devices) or
forcing zero-bias updates to guarantee convergence. Under \emph{heterogeneous}
wireless scenarios, however, such designs are constrained by the weakest device
and inflate the update variance. Moreover, prior analyses of biased OTA-FL
largely address convex objectives, while most modern AI models are highly
non-convex. Motivated by these gaps, we study OTA-FL with stochastic gradient
descent (SGD) for general smooth non-convex objectives under wireless
heterogeneity. We develop novel OTA-FL SGD updates that allow a structured,
time-invariant model bias while facilitating reduced variance updates. We
derive a finite-time stationarity bound (expected time average squared gradient
norm) that explicitly reveals a bias-variance trade-off. To optimize this
trade-off, we pose a non-convex joint OTA power-control design and develop an
efficient successive convex approximation (SCA) algorithm that requires only
statistical CSI at the base station. Experiments on a non-convex image
classification task validate the approach: the SCA-based design accelerates
convergence via an optimized bias and improves generalization over prior OTA-FL
baselines.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tommaso d'Orsi, Gleb Novikov
ๆฅ็ๆ่ฆ
We present a simple perturbation mechanism for the release of $d$-dimensional
covariance matrices $\Sigma$ under pure differential privacy. For large
datasets with at least $n\geq d^2/\varepsilon$ elements, our mechanism recovers
the provably optimal Frobenius norm error guarantees of
\cite{nikolov2023private}, while simultaneously achieving best known error for
all other $p$-Schatten norms, with $p\in [1,\infty]$. Our error is
information-theoretically optimal for all $p\ge 2$, in particular, our
mechanism is the first purely private covariance estimator that achieves
optimal error in spectral norm.
For small datasets $n< d^2/\varepsilon$, we further show that by projecting
the output onto the nuclear norm ball of appropriate radius, our algorithm
achieves the optimal Frobenius norm error $O(\sqrt{d\;\text{Tr}(\Sigma) /n})$,
improving over the known bounds of $O(\sqrt{d/n})$ of \cite{nikolov2023private}
and ${O}\big(d^{3/4}\sqrt{\text{Tr}(\Sigma)/n}\big)$ of
\cite{dong2022differentially}.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Gabriel Asher, Devesh Shah, Amy A. Caudy, Luke Ferro, Lea Amar et al.
Machine Learning
ๆฅ็ๆ่ฆ
A vast majority of mass spectrometry data remains uncharacterized, leaving
much of its biological and chemical information untapped. Recent advances in
machine learning have begun to address this gap, particularly for tasks such as
spectral identification in tandem mass spectrometry data. Here, we present the
latest generation of LSM-MS2, a large-scale deep learning foundation model
trained on millions of spectra to learn a semantic chemical space. LSM-MS2
achieves state-of-the-art performance in spectral identification, improving on
existing methods by 30% in accuracy of identifying challenging isomeric
compounds, yielding 42% more correct identifications in complex biological
samples, and maintaining robustness under low-concentration conditions.
Furthermore, LSM-MS2 produces rich spectral embeddings that enable direct
biological interpretation from minimal downstream data, successfully
differentiating disease states and predicting clinical outcomes across diverse
translational applications.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma
ๆฅ็ๆ่ฆ
Machine unlearning (MU) aims to remove the influence of certain data points
from a trained model without costly retraining. Most practical MU algorithms
are only approximate and their performance can only be assessed empirically.
Care must therefore be taken to make empirical comparisons as representative as
possible. A common practice is to run the MU algorithm multiple times
independently starting from the same trained model. In this work, we
demonstrate that this practice can give highly non-representative results
because -- even for the same architecture and same dataset -- some MU methods
can be highly sensitive to the choice of random number seed used for model
training. We therefore recommend that empirical
comphttps://info.arxiv.org/help/prep#commentsarisons of MU algorithms should
also reflect the variability across different model training seeds.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Chuyan Chen, Chenyang Ma, Zhangxin Li, Yutong He, Yanjie Dong et al.
ๆฅ็ๆ่ฆ
Communication remains a central bottleneck in large-scale distributed machine
learning, and gradient sparsification has emerged as a promising strategy to
alleviate this challenge. However, existing gradient compressors face notable
limitations: Rand-$K$\ discards structural information and performs poorly in
practice, while Top-$K$\ preserves informative entries but loses the
contraction property and requires costly All-Gather operations. In this paper,
we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that
aligns sparsity patterns across nodes using a lightweight sketch of the
gradient, enabling index-free All-Reduce while preserving globally significant
information. ARC-Top-$K$\ is provably contractive and, when combined with
momentum error feedback (EF21M), achieves linear speedup and sharper
convergence rates than the original EF21M under standard assumptions.
Empirically, ARC-Top-$K$\ matches the accuracy of Top-$K$\ while reducing
wall-clock training time by up to 60.7\%, offering an efficient and scalable
solution that combines the robustness of Rand-$K$\ with the strong performance
of Top-$K$.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Staลczak et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
As LLMs occupy an increasingly important role in society, they are more and
more confronted with questions that require them not only to draw on their
general knowledge but also to align with certain human value systems.
Therefore, studying the alignment of LLMs with human values has become a
crucial field of inquiry. Prior work, however, mostly focuses on evaluating the
alignment of fully trained models, overlooking the training dynamics by which
models learn to express human values. In this work, we investigate how and at
which stage value alignment arises during the course of a model's
post-training. Our analysis disentangles the effects of post-training
algorithms and datasets, measuring both the magnitude and time of value drifts
during training. Experimenting with Llama-3 and Qwen-3 models of different
sizes and popular supervised fine-tuning (SFT) and preference optimization
datasets and algorithms, we find that the SFT phase generally establishes a
model's values, and subsequent preference optimization rarely re-aligns these
values. Furthermore, using a synthetic preference dataset that enables
controlled manipulation of values, we find that different preference
optimization algorithms lead to different value alignment outcomes, even when
preference data is held constant. Our findings provide actionable insights into
how values are learned during post-training and help to inform data curation,
as well as the selection of models and algorithms for preference optimization
to improve model alignment to human values.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Giulia DeSalvo, Clara Mohri, Mehryar Mohri, Yutao Zhong
Machine Learning
ๆฅ็ๆ่ฆ
Learning to defer uncertain predictions to costly experts offers a powerful
strategy for improving the accuracy and efficiency of machine learning systems.
However, standard training procedures for deferral algorithms typically require
querying all experts for every training instance, an approach that becomes
prohibitively expensive when expert queries incur significant computational or
resource costs. This undermines the core goal of deferral: to limit unnecessary
expert usage. To overcome this challenge, we introduce the budgeted deferral
framework, which aims to train effective deferral algorithms while minimizing
expert query costs during training. We propose new algorithms for both
two-stage and single-stage multiple-expert deferral settings that selectively
query only a subset of experts per training example. While inspired by active
learning, our setting is fundamentally different: labels are already known, and
the core challenge is to decide which experts to query in order to balance cost
and predictive performance. We establish theoretical guarantees for both of our
algorithms, including generalization bounds and label complexity analyses.
Empirical results across several domains show that our algorithms substantially
reduce training costs without sacrificing prediction accuracy, demonstrating
the practical value of our budget-aware deferral algorithms.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nick Heilenkรถtter
Machine Learning
ๆฅ็ๆ่ฆ
Can regularization terms in the training of invertible neural networks lead
to known Bayesian point estimators in reconstruction? Invertible networks are
attractive for inverse problems due to their inherent stability and
interpretability. Recently, optimization strategies for invertible neural
networks that approximate either a reconstruction map or the forward operator
have been studied from a Bayesian perspective, but each has limitations. To
address this, we introduce and analyze two regularization terms for the network
training that, upon inversion of the network, recover properties of classical
Bayesian point estimators: while the first can be connected to the posterior
mean, the second resembles the MAP estimator. Our theoretical analysis
characterizes how each loss shapes both the learned forward operator and its
inverse reconstruction map. Numerical experiments support our findings and
demonstrate how these loss-term regularizers introduce data-dependence in a
stable and interpretable way.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Gerard T. Portela, Jason B. Gibbons, Sebastian Schneeweiss, Rishi J. Desai
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Observational studies developing causal machine learning (ML) models for the
prediction of individualized treatment effects (ITEs) seldom conduct empirical
evaluations to assess the conditional exchangeability assumption. We aimed to
evaluate the performance of these models under conditional exchangeability
violations and the utility of negative control outcomes (NCOs) as a diagnostic.
We conducted a simulation study to examine confounding bias in ITE estimates
generated by causal forest and X-learner models under varying conditions,
including the presence or absence of true heterogeneity. We simulated data to
reflect real-world scenarios with differing levels of confounding, sample size,
and NCO confounding structures. We then estimated and compared subgroup-level
treatment effects on the primary outcome and NCOs across settings with and
without unmeasured confounding. When conditional exchangeability was violated,
causal forest and X-learner models failed to recover true treatment effect
heterogeneity and, in some cases, falsely indicated heterogeneity when there
was none. NCOs successfully identified subgroups affected by unmeasured
confounding. Even when NCOs did not perfectly satisfy its ideal assumptions, it
remained informative, flagging potential bias in subgroup level estimates,
though not always pinpointing the subgroup with the largest confounding.
Violations of conditional exchangeability substantially limit the validity of
ITE estimates from causal ML models in routinely collected observational data.
NCOs serve a useful empirical diagnostic tool for detecting subgroup-specific
unmeasured confounding and should be incorporated into causal ML workflows to
support the credibility of individualized inference.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu et al.
ๆฅ็ๆ่ฆ
We introduce Kimi Linear, a hybrid linear attention architecture that, for
the first time, outperforms full attention under fair comparisons across
various scenarios -- including short-context, long-context, and reinforcement
learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an
expressive linear attention module that extends Gated DeltaNet with a
finer-grained gating mechanism, enabling more effective use of limited
finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware
efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR)
transition matrices, which substantially reduces computation compared to the
general DPLR formulation while remaining more consistent with the classical
delta rule.
We pretrain a Kimi Linear model with 3B activated parameters and 48B total
parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention
(MLA). Our experiments show that with an identical training recipe, Kimi Linear
outperforms full MLA with a sizeable margin across all evaluated tasks, while
reducing KV cache usage by up to 75% and achieving up to 6 times decoding
throughput for a 1M context. These results demonstrate that Kimi Linear can be
a drop-in replacement for full attention architectures with superior
performance and efficiency, including tasks with longer input and output
lengths.
To support further research, we open-source the KDA kernel and vLLM
implementations, and release the pre-trained and instruction-tuned model
checkpoints.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Amir Reza Mirzaei, Yuqiao Wen, Yanshuai Cao, Lili Mou
Natural Language Processing
ๆฅ็ๆ่ฆ
Low-Rank Adaptation (LoRA) has become a popular technique for
parameter-efficient fine-tuning of large language models (LLMs). In many
real-world scenarios, multiple adapters are loaded simultaneously to enable LLM
customization for personalized user experiences or to support a diverse range
of tasks. Although each adapter is lightweight in isolation, their aggregate
cost becomes substantial at scale. To address this, we propose LoRAQuant, a
mixed-precision post-training quantization method tailored to LoRA.
Specifically, LoRAQuant reparameterizes each adapter by singular value
decomposition (SVD) to concentrate the most important information into specific
rows and columns. This makes it possible to quantize the important components
to higher precision, while quantizing the rest to ultra-low bitwidth. We
conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B
models on mathematical reasoning, coding, and summarization tasks. Results show
that our LoRAQuant uses significantly lower bits than other quantization
methods, but achieves comparable or even higher performance.
๐
2025-10-30
๐ ArXiv Preprint (quant-ph)
๐ฅ Jun Dai, Michael Rizvi-Martel, Guillaume Rabusseau
Machine Learning
ๆฅ็ๆ่ฆ
Designing efficient quantum circuits is a central bottleneck to exploring the
potential of quantum computing, particularly for noisy intermediate-scale
quantum (NISQ) devices, where circuit efficiency and resilience to errors are
paramount. The search space of gate sequences grows combinatorially, and
handcrafted templates often waste scarce qubit and depth budgets. We introduce
\textsc{FlowQ-Net} (Flow-based Quantum design Network), a generative framework
for automated quantum circuit synthesis based on Generative Flow Networks
(GFlowNets). This framework learns a stochastic policy to construct circuits
sequentially, sampling them in proportion to a flexible, user-defined reward
function that can encode multiple design objectives such as performance, depth,
and gate count. This approach uniquely enables the generation of a diverse
ensemble of high-quality circuits, moving beyond single-solution optimization.
We demonstrate the efficacy of \textsc{FlowQ-Net} through an extensive set of
simulations. We apply our method to Variational Quantum Algorithm (VQA) ansatz
design for molecular ground state estimation, Max-Cut, and image
classification, key challenges in near-term quantum computing. Circuits
designed by \textsc{FlowQ-Net} achieve significant improvements, yielding
circuits that are 10$\times$-30$\times$ more compact in terms of parameters,
gates, and depth compared to commonly used unitary baselines, without
compromising accuracy. This trend holds even when subjected to error profiles
from real-world quantum devices. Our results underline the potential of
generative models as a general-purpose methodology for automated quantum
circuit design, offering a promising path towards more efficient quantum
algorithms and accelerating scientific discovery in the quantum domain.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Tommaso d'Orsi, Gleb Novikov
ๆฅ็ๆ่ฆ
We revisit the task of computing the span of the top $r$ singular vectors
$u_1, \ldots, u_r$ of a matrix under differential privacy. We show that a
simple and efficient algorithm -- based on singular value decomposition and
standard perturbation mechanisms -- returns a private rank-$r$ approximation
whose error depends only on the \emph{rank-$r$ coherence} of $u_1, \ldots, u_r$
and the spectral gap $\sigma_r - \sigma_{r+1}$. This resolves a question posed
by Hardt and Roth~\cite{hardt2013beyond}. Our estimator outperforms the state
of the art -- significantly so in some regimes. In particular, we show that in
the dense setting, it achieves the same guarantees for single-spike PCA in the
Wishart model as those attained by optimal non-private algorithms, whereas
prior private algorithms failed to do so.
In addition, we prove that (rank-$r$) coherence does not increase under
Gaussian perturbations. This implies that any estimator based on the Gaussian
mechanism -- including ours -- preserves the coherence of the input. We
conjecture that similar behavior holds for other structured models, including
planted problems in graphs.
We also explore applications of coherence to graph problems. In particular,
we present a differentially private algorithm for Max-Cut and other constraint
satisfaction problems under low coherence assumptions.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Ruimin He, Shaowei Lin
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
At the heart of reinforcement learning are actions -- decisions made in
response to observations of the environment. Actions are equally fundamental in
the modeling of stochastic processes, as they trigger discontinuous state
transitions and enable the flow of information through large, complex systems.
In this paper, we unify the perspectives of stochastic processes and
reinforcement learning through action-driven processes, and illustrate their
application to spiking neural networks. Leveraging ideas from
control-as-inference, we show that minimizing the Kullback-Leibler divergence
between a policy-driven true distribution and a reward-driven model
distribution for a suitably defined action-driven process is equivalent to
maximum entropy reinforcement learning.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Georgios Kamaras, Craig Innes, Subramanian Ramamoorthy
Robotics
ๆฅ็ๆ่ฆ
In robotics, likelihood-free inference (LFI) can provide the domain
distribution that adapts a learnt agent in a parametric set of deployment
conditions. LFI assumes an arbitrary support for sampling, which remains
constant as the initial generic prior is iteratively refined to more
descriptive posteriors. However, a potentially misspecified support can lead to
suboptimal, yet falsely certain, posteriors. To address this issue, we propose
three heuristic LFI variants: EDGE, MODE, and CENTRE. Each interprets the
posterior mode shift over inference steps in its own way and, when integrated
into an LFI step, adapts the support alongside posterior inference. We first
expose the support misspecification issue and evaluate our heuristics using
stochastic dynamical benchmarks. We then evaluate the impact of heuristic
support adaptation on parameter inference and policy learning for a dynamic
deformable linear object (DLO) manipulation task. Inference results in a finer
length and stiffness classification for a parametric set of DLOs. When the
resulting posteriors are used as domain distributions for sim-based policy
learning, they lead to more robust object-centric agent performance.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaoyi He, Danggui Chen, Zhenshuo Zhang, Zimeng Bai
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a hierarchical path-planning and control framework that
combines a high-level Deep Q-Network (DQN) for discrete sub-goal selection with
a low-level Twin Delayed Deep Deterministic Policy Gradient (TD3) controller
for continuous actuation. The high-level module selects behaviors and
sub-goals; the low-level module executes smooth velocity commands. We design a
practical reward shaping scheme (direction, distance, obstacle avoidance,
action smoothness, collision penalty, time penalty, and progress), together
with a LiDAR-based safety gate that prevents unsafe motions. The system is
implemented in ROS + Gazebo (TurtleBot3) and evaluated with PathBench metrics,
including success rate, collision rate, path efficiency, and re-planning
efficiency, in dynamic and partially observable environments. Experiments show
improved success rate and sample efficiency over single-algorithm baselines
(DQN or TD3 alone) and rule-based planners, with better generalization to
unseen obstacle configurations and reduced abrupt control changes. Code and
evaluation scripts are available at the project repository.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Katarina Petroviฤ, Lazar Atanackovic, Viggo Moro, Kacper Kapuลniak, ฤฐsmail ฤฐlkan Ceylan et al.
ๆฅ็ๆ่ฆ
Modeling the transport dynamics of natural processes from population-level
observations is a ubiquitous problem in the natural sciences. Such models rely
on key assumptions about the underlying process in order to enable faithful
learning of governing dynamics that mimic the actual system behavior. The de
facto assumption in current approaches relies on the principle of least action
that results in gradient field dynamics and leads to trajectories minimizing an
energy functional between two probability measures. However, many real-world
systems, such as cell cycles in single-cell RNA, are known to exhibit
non-gradient, periodic behavior, which fundamentally cannot be captured by
current state-of-the-art methods such as flow and bridge matching. In this
paper, we introduce Curly Flow Matching (Curly-FM), a novel approach that is
capable of learning non-gradient field dynamics by designing and solving a
Schr\"odinger bridge problem with a non-zero drift reference process -- in
stark contrast to typical zero-drift reference processes -- which is
constructed using inferred velocities in addition to population snapshot data.
We showcase Curly-FM by solving the trajectory inference problems for single
cells, computational fluid dynamics, and ocean currents with approximate
velocities. We demonstrate that Curly-FM can learn trajectories that better
match both the reference process and population marginals. Curly-FM expands
flow matching models beyond the modeling of populations and towards the
modeling of known periodic behavior in physical systems. Our code repository is
accessible at: https://github.com/kpetrovicc/curly-flow-matching.git
๐
2025-10-30
๐ VLDB
๐ฅ Emmanouil Sylligardos, John Paparrizos, Themis Palpanas, Pierre Senellart, Paul Boniol
ๆฅ็ๆ่ฆ
Anomaly detection is a fundamental task for time series analytics with
important implications for the downstream performance of many applications.
Despite increasing academic interest and the large number of methods proposed
in the literature, recent benchmarks and evaluation studies demonstrated that
no overall best anomaly detection methods exist when applied to very
heterogeneous time series datasets. Therefore, the only scalable and viable
solution to solve anomaly detection over very different time series collected
from diverse domains is to propose a model selection method that will select,
based on time series characteristics, the best anomaly detection methods to
run. Existing AutoML solutions are, unfortunately, not directly applicable to
time series anomaly detection, and no evaluation of time series-based
approaches for model selection exists. Towards that direction, this paper
studies the performance of time series classification methods used as model
selection for anomaly detection. In total, we evaluate 234 model configurations
derived from 16 base classifiers across more than 1980 time series, and we
propose the first extensive experimental evaluation of time series
classification as model selection for anomaly detection. Our results
demonstrate that model selection methods outperform every single anomaly
detection method while being in the same order of magnitude regarding execution
time. This evaluation is the first step to demonstrate the accuracy and
efficiency of time series classification algorithms for anomaly detection, and
represents a strong baseline that can then be used to guide the model selection
step in general AutoML pipelines. Preprint version of an article accepted at
the VLDB Journal.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Colin Doumont, Victor Picheny, Viacheslav Borovitskiy, Henry Moss
Machine Learning
ๆฅ็ๆ่ฆ
Bayesian Optimization (BO) has the potential to solve various combinatorial
tasks, ranging from materials science to neural architecture search. However,
BO requires specialized kernels to effectively model combinatorial domains.
Recent efforts have introduced several combinatorial kernels, but the
relationships among them are not well understood. To bridge this gap, we
develop a unifying framework based on heat kernels, which we derive in a
systematic way and express as simple closed-form expressions. Using this
framework, we prove that many successful combinatorial kernels are either
related or equivalent to heat kernels, and validate this theoretical claim in
our experiments. Moreover, our analysis confirms and extends the results
presented in Bounce: certain algorithms' performance decreases substantially
when the unknown optima of the function do not have a certain structure. In
contrast, heat kernels are not sensitive to the location of the optima. Lastly,
we show that a fast and simple pipeline, relying on heat kernels, is able to
achieve state-of-the-art results, matching or even outperforming certain slow
or complex algorithms.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Lin Xu, Xinyun Yuan, Yuxuan Liang, Suwan Yin, Yuankai Wu
ๆฅ็ๆ่ฆ
We introduce Aeolus, a large-scale Multi-modal Flight Delay Dataset designed
to advance research on flight delay prediction and support the development of
foundation models for tabular data. Existing datasets in this domain are
typically limited to flat tabular structures and fail to capture the
spatiotemporal dynamics inherent in delay propagation. Aeolus addresses this
limitation by providing three aligned modalities: (i) a tabular dataset with
rich operational, meteorological, and airportlevel features for over 50 million
flights; (ii) a flight chain module that models delay propagation along
sequential flight legs, capturing upstream and downstream dependencies; and
(iii) a flight network graph that encodes shared aircraft, crew, and airport
resource connections, enabling cross-flight relational reasoning. The dataset
is carefully constructed with temporal splits, comprehensive features, and
strict leakage prevention to support realistic and reproducible machine
learning evaluation. Aeolus supports a broad range of tasks, including
regression, classification, temporal structure modeling, and graph learning,
serving as a unified benchmark across tabular, sequential, and graph
modalities. We release baseline experiments and preprocessing tools to
facilitate adoption. Aeolus fills a key gap for both domain-specific modeling
and general-purpose structured data research.Our source code and data can be
accessed at https://github.com/Flnny/Delay-data
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Shayan Nejadshamsi, Yuanyuan Zhang, Shadi Zaki, Brock Porth, Lysa Porth et al.
Machine Learning
ๆฅ็ๆ่ฆ
Accurate and timely crop yield prediction is crucial for global food security
and modern agricultural management. Traditional methods often lack the
scalability and granularity required for precision farming. This paper
introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder
for Satellite Sensing), a deep learning model designed for high-resolution,
intra-field canola yield prediction. CYPRESS leverages a pre-trained,
large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for
a continuous regression task, transforming multi-temporal satellite imagery
into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from
the Canadian Prairies, CYPRESS demonstrates superior performance over existing
deep learning-based yield prediction models, highlighting the effectiveness of
fine-tuning foundation models for specialized agricultural applications. By
providing a continuous, high-resolution output, CYPRESS offers a more
actionable tool for precision agriculture than conventional classification or
county-level aggregation methods. This work validates a novel approach that
bridges the gap between large-scale Earth observation and on-farm
decision-making, offering a scalable solution for detailed agricultural
monitoring.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Maksim Maslov, Alexander Kugaevskikh, Matthew Ivanov
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper considers the problem of regression over distributions, which is
becoming increasingly important in machine learning. Existing approaches often
ignore the geometry of the probability space or are computationally expensive.
To overcome these limitations, a new method is proposed that combines the
parameterization of probability trajectories using a Bernstein basis and the
minimization of the Wasserstein distance between distributions. The key idea is
to model a conditional distribution as a smooth probability trajectory defined
by a weighted sum of Gaussian components whose parameters -- the mean and
covariance -- are functions of the input variable constructed using Bernstein
polynomials. The loss function is the averaged squared Wasserstein distance
between the predicted Gaussian distributions and the empirical data, which
takes into account the geometry of the distributions. An autodiff-based
optimization method is used to train the model. Experiments on synthetic
datasets that include complex trajectories demonstrated that the proposed
method provides competitive approximation quality in terms of the Wasserstein
distance, Energy Distance, and RMSE metrics, especially in cases of pronounced
nonlinearity. The model demonstrates trajectory smoothness that is better than
or comparable to alternatives and robustness to changes in data structure,
while maintaining high interpretability due to explicit parameterization via
control points. The developed approach represents a balanced solution that
combines geometric accuracy, computational practicality, and interpretability.
Prospects for further research include extending the method to non-Gaussian
distributions, applying entropy regularization to speed up computations, and
adapting the approach to working with high-dimensional data for approximating
surfaces and more complex structures.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Arne Thomsen, Tilman Trรถster, Franรงois Lanusse
Machine Learning
ๆฅ็ๆ่ฆ
Cosmological field-level inference requires differentiable forward models
that solve the challenging dynamics of gas and dark matter under hydrodynamics
and gravity. We propose a hybrid approach where gravitational forces are
computed using a differentiable particle-mesh solver, while the hydrodynamics
are parametrized by a neural network that maps local quantities to an effective
pressure field. We demonstrate that our method improves upon alternative
approaches, such as an Enthalpy Gradient Descent baseline, both at the field
and summary-statistic level. The approach is furthermore highly data efficient,
with a single reference simulation of cosmological structure formation being
sufficient to constrain the neural pressure model. This opens the door for
future applications where the model is fit directly to observational data,
rather than a training set of simulations.
๐
2025-10-30
๐ Five pages, four figures, to be presented at the A
๐ฅ Sebastian Basterrech, Shuo Shan, Debabrata Adhikari, Sankhya Mohanty
ๆฅ็ๆ่ฆ
In this study, we leverage a mixture model learning approach to identify
defects in laser-based Additive Manufacturing (AM) processes. By incorporating
physics based principles, we also ensure that the model is sensitive to
meaningful physical parameter variations. The empirical evaluation was
conducted by analyzing real-world data from two AM processes: Directed Energy
Deposition and Laser Powder Bed Fusion. In addition, we also studied the
performance of the developed framework over public datasets with different
alloy type and experimental parameter information. The results show the
potential of physics-guided mixture models to examine the underlying physical
behavior of an AM system.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yinrong Hong, Zhiquan Tan, Kai Hu
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) face significant inference latency challenges
stemming from their autoregressive design and large size. To address this,
speculative decoding emerges as a solution, enabling the simultaneous
generation and validation of multiple tokens. While recent approaches like
EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures,
they often neglect the impact of crucial system variables such as GPU devices
and batch sizes.
Therefore, we introduce a new dynamic tree decoding approach called CAST that
takes into account inference costs, including factors such as GPU
configurations and batch sizes, to dynamically refine the tree structure.
Through comprehensive experimentation across six diverse tasks and utilizing
six distinct LLMs, our methodology demonstrates remarkable results, achieving
speeds up to 5.2 times faster than conventional decoding methods. Moreover, it
generally outperforms existing state-of-the-art techniques from 5% to 20%.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, Andrea Pugnana
Machine Learning
ๆฅ็ๆ่ฆ
Developing trustworthy Machine Learning (ML) models requires their predicted
probabilities to be well-calibrated, meaning they should reflect true-class
frequencies. Among calibration notions in multiclass classification, strong
calibration is the most stringent, as it requires all predicted probabilities
to be simultaneously calibrated across all classes. However, existing
approaches to multiclass calibration lack a notion of distance among inputs,
which makes them vulnerable to proximity bias: predictions in sparse regions of
the feature space are systematically miscalibrated. This is especially relevant
in high-stakes settings, such as healthcare, where the sparse instances are
exactly those most at risk of biased treatment. In this work, we address this
main shortcoming by introducing a local perspective on multiclass calibration.
First, we formally define multiclass local calibration and establish its
relationship with strong calibration. Second, we theoretically analyze the
pitfalls of existing evaluation metrics when applied to multiclass local
calibration. Third, we propose a practical method for enhancing local
calibration in Neural Networks, which enforces alignment between predicted
probabilities and local estimates of class frequencies using the Jensen-Shannon
distance. Finally, we empirically validate our approach against existing
multiclass calibration techniques.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikita Tsoy, Nikola Konstantinov
ๆฅ็ๆ่ฆ
Shortcuts, spurious rules that perform well during training but fail to
generalize, present a major challenge to the reliability of deep networks
(Geirhos et al., 2020). However, the impact of shortcuts on feature
representations remains understudied, obstructing the design of principled
shortcut-mitigation methods. To overcome this limitation, we investigate the
layer-wise localization of shortcuts in deep models. Our novel experiment
design quantifies the layer-wise contribution to accuracy degradation caused by
a shortcut-inducing skew by counterfactual training on clean and skewed
datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and
CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find
that shortcut learning is not localized in specific layers but distributed
throughout the network. Different network parts play different roles in this
process: shallow layers predominantly encode spurious features, while deeper
layers predominantly forget core features that are predictive on clean data. We
also analyze the differences in localization and describe its principal axes of
variation. Finally, our analysis of layer-wise shortcut-mitigation strategies
suggests the hardness of designing general methods, supporting dataset- and
architecture-specific approaches instead.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jan Stenkamp, Nina Herrmann, Benjamin Karic, Stefan Oehmcke, Fabian Gieseke
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Deploying machine learning models on compute-constrained devices has become a
key building block of modern IoT applications. In this work, we present a
compression scheme for boosted decision trees, addressing the growing need for
lightweight machine learning models. Specifically, we provide techniques for
training compact boosted decision tree ensembles that exhibit a reduced memory
footprint by rewarding, among other things, the reuse of features and
thresholds during training. Our experimental evaluation shows that models
achieved the same performance with a compression ratio of 4-16x compared to
LightGBM models using an adapted training process and an alternative memory
layout. Once deployed, the corresponding IoT devices can operate independently
of constant communication or external energy supply, and, thus, autonomously,
requiring only minimal computing power and energy. This capability opens the
door to a wide range of IoT applications, including remote monitoring, edge
analytics, and real-time decision making in isolated or power-limited
environments.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Prathamesh Kothavale, Sravani Boddepalli
Robotics
ๆฅ็ๆ่ฆ
Conventional robots possess a limited understanding of their kinematics and
are confined to preprogrammed tasks, hindering their ability to leverage tools
efficiently. Driven by the essential components of tool usage - grasping the
desired outcome, selecting the most suitable tool, determining optimal tool
orientation, and executing precise manipulations - we introduce a pioneering
framework. Our novel approach expands the capabilities of the robot's inverse
kinematics solver, empowering it to acquire a sequential repertoire of actions
using tools of varying lengths. By integrating a simulation-learned action
trajectory with the tool, we showcase the practicality of transferring acquired
skills from simulation to real-world scenarios through comprehensive
experimentation. Remarkably, our extended inverse kinematics solver
demonstrates an impressive error rate of less than 1 cm. Furthermore, our
trained policy achieves a mean error of 8 cm in simulation. Noteworthy, our
model achieves virtually indistinguishable performance when employing two
distinct tools of different lengths. This research provides an indication of
potential advances in the exploration of all four fundamental aspects of tool
usage, enabling robots to master the intricate art of tool manipulation across
diverse tasks.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Miranda Anna Christ, Adriรกn Csiszรกrik, Gergely Becsรณ, Dรกniel Varga
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper investigates the structure of linear operators introduced in
Hernandez et al. [2023] that decode specific relational facts in transformer
language models. We extend their single-relation findings to a collection of
relations and systematically chart their organization. We show that such
collections of relation decoders can be highly compressed by simple order-3
tensor networks without significant loss in decoding accuracy. To explain this
surprising redundancy, we develop a cross-evaluation protocol, in which we
apply each linear decoder operator to the subjects of every other relation. Our
results reveal that these linear maps do not encode distinct relations, but
extract recurring, coarse-grained semantic properties (e.g., country of capital
city and country of food are both in the country-of-X property). This
property-centric structure clarifies both the operators' compressibility and
highlights why they generalize only to new relations that are semantically
close. Our findings thus interpret linear relational decoding in transformer
language models as primarily property-based, rather than relation-specific.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Aidan Furlong, Robert Salko, Xingang Zhao, Xu Wu
Machine Learning
ๆฅ็ๆ่ฆ
The use of ML in engineering has grown steadily to support a wide array of
applications. Among these methods, deep neural networks have been widely
adopted due to their performance and accessibility, but they require large,
high-quality datasets. Experimental data are often sparse, noisy, or
insufficient to build resilient data-driven models. Transfer learning, which
leverages relevant data-abundant source domains to assist learning in
data-scarce target domains, has shown efficacy. Parameter transfer, where
pretrained weights are reused, is common but degrades under large domain
shifts. Domain-adversarial neural networks (DANNs) help address this issue by
learning domain-invariant representations, thereby improving transfer under
greater domain shifts in a semi-supervised setting. However, DANNs can be
unstable during training and lack a native means for uncertainty
quantification. This study introduces a fully-supervised three-stage framework,
the staged Bayesian domain-adversarial neural network (staged B-DANN), that
combines parameter transfer and shared latent space adaptation. In Stage 1, a
deterministic feature extractor is trained on the source domain. This feature
extractor is then adversarially refined using a DANN in Stage 2. In Stage 3, a
Bayesian neural network is built on the adapted feature extractor for
fine-tuning on the target domain to handle conditional shifts and yield
calibrated uncertainty estimates. This staged B-DANN approach was first
validated on a synthetic benchmark, where it was shown to significantly
outperform standard transfer techniques. It was then applied to the task of
predicting critical heat flux in rectangular channels, leveraging data from
tube experiments as the source domain. The results of this study show that the
staged B-DANN method can improve predictive accuracy and generalization,
potentially assisting other domains in nuclear engineering.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Adrien Weihs, Andrea Bertozzi, Matthew Thorpe
ๆฅ็ๆ่ฆ
Higher-Order Hypergraph Learning (HOHL) was recently introduced as a
principled alternative to classical hypergraph regularization, enforcing
higher-order smoothness via powers of multiscale Laplacians induced by the
hypergraph structure. Prior work established the well- and ill-posedness of
HOHL through an asymptotic consistency analysis in geometric settings. We
extend this theoretical foundation by proving the consistency of a truncated
version of HOHL and deriving explicit convergence rates when HOHL is used as a
regularizer in fully supervised learning. We further demonstrate its strong
empirical performance in active learning and in datasets lacking an underlying
geometric structure, highlighting HOHL's versatility and robustness across
diverse learning settings.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ruilin Wang, Huixia Li, Yuexiao Ma, Xiawu Zheng, Fei Chao et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Inference latency stands as a critical bottleneck in the large-scale
deployment of Large Language Models (LLMs). Speculative decoding methods have
recently shown promise in accelerating inference without compromising the
output distribution. However, existing work typically relies on a dualistic
draft-verify framework and lacks rigorous theoretical grounding. In this paper,
we introduce a novel \emph{polybasic} speculative decoding framework,
underpinned by a comprehensive theoretical analysis. Specifically, we prove a
fundamental theorem that characterizes the optimal inference time for
multi-model speculative decoding systems, shedding light on how to extend
beyond the dualistic approach to a more general polybasic paradigm. Through our
theoretical investigation of multi-model token generation, we expose and
optimize the interplay between model capabilities, acceptance lengths, and
overall computational cost. Our framework supports both standalone
implementation and integration with existing speculative techniques, leading to
accelerated performance in practice. Experimental results across multiple model
families demonstrate that our approach yields speedup ratios ranging from
$3.31\times$ to $4.01\times$ for LLaMA2-Chat 7B, up to $3.87 \times$ for
LLaMA3-8B, up to $4.43 \times$ for Vicuna-7B and up to $3.85 \times$ for
Qwen2-7B -- all while preserving the original output distribution. We release
our theoretical proofs and implementation code to facilitate further
investigation into polybasic speculative decoding.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hsiu-Yuan Huang, Chenming Tang, Weijie Liu, Saiyong Yang, Yunfang Wu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such
as Group Relative Policy Optimization (GRPO), have achieved remarkable progress
in improving the reasoning capabilities of Large Reasoning Models (LRMs).
However, they exhibit limited exploration due to reliance on on-policy rollouts
where confined to the current policy's distribution, resulting in narrow
trajectory diversity. Recent approaches attempt to expand policy coverage by
incorporating trajectories generated from stronger expert models, yet this
reliance increases computational cost and such advaned models are often
inaccessible. To address these issues, we propose In-Context Steered Policy
Optimization (ICPO), a unified framework that leverages the inherent in-context
learning capability of LRMs to provide expert guidance using existing datasets.
ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing, which expands
exploration beyond the current policy distribution without requiring advanced
LRM trajectories. To further stabilize optimization, ICPO integrates Expert
Region Reject Sampling to filter unreliable off-policy trajectories and
Annealed Expert-Bonus Reward Shaping to balance early expert guidance with
later autonomous improvement. Results demonstrate that ICPO consistently
enhances reinforcement learning performance and training stability on
mathematical reasoning benchmarks, revealing a scalable and effective RLVR
paradigm for LRMs.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dipak Meher, Carlotta Domeniconi
ๆฅ็ๆ่ฆ
Human smuggling networks are increasingly adaptive and difficult to analyze.
Legal case documents offer critical insights but are often unstructured,
lexically dense, and filled with ambiguous or shifting references, which pose
significant challenges for automated knowledge graph (KG) construction. While
recent LLM-based approaches improve over static templates, they still generate
noisy, fragmented graphs with duplicate nodes due to the absence of guided
extraction and coreference resolution. The recently proposed CORE-KG framework
addresses these limitations by integrating a type-aware coreference module and
domain-guided structured prompts, significantly reducing node duplication and
legal noise. In this work, we present a systematic ablation study of CORE-KG to
quantify the individual contributions of its two key components. Our results
show that removing coreference resolution results in a 28.32% increase in node
duplication and a 4.32% increase in noisy nodes, while removing structured
prompts leads to a 4.34% increase in node duplication and a 73.33% increase in
noisy nodes. These findings offer empirical insights for designing robust
LLM-based pipelines for extracting structured representations from complex
legal texts.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Youssef Attia El Hili, Albert Thomas, Malik Tiomoko, Abdelhakim Benechehab, Corentin Lรฉger et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Model and hyperparameter selection are critical but challenging in machine
learning, typically requiring expert intuition or expensive automated search.
We investigate whether large language models (LLMs) can act as in-context
meta-learners for this task. By converting each dataset into interpretable
metadata, we prompt an LLM to recommend both model families and
hyperparameters. We study two prompting strategies: (1) a zero-shot mode
relying solely on pretrained knowledge, and (2) a meta-informed mode augmented
with examples of models and their performance on past tasks. Across synthetic
and real-world benchmarks, we show that LLMs can exploit dataset metadata to
recommend competitive models and hyperparameters without search, and that
improvements from meta-informed prompting demonstrate their capacity for
in-context meta-learning. These results highlight a promising new role for LLMs
as lightweight, general-purpose assistants for model selection and
hyperparameter optimization.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mustafa Fuad Rifet Ibrahim, Maurice Meijer, Alexander Schlaefer, Peer Stelldinger
Machine Learning
ๆฅ็ๆ่ฆ
Continuous electrocardiogram (ECG) monitoring via wearables offers
significant potential for early cardiovascular disease (CVD) detection.
However, deploying deep learning models for automated analysis in
resource-constrained environments faces reliability challenges due to
inevitable Out-of-Distribution (OOD) data. OOD inputs, such as unseen
pathologies or noisecorrupted signals, often cause erroneous, high-confidence
predictions by standard classifiers, compromising patient safety. Existing OOD
detection methods either neglect computational constraints or address noise and
unseen classes separately. This paper explores Unsupervised Anomaly Detection
(UAD) as an independent, upstream filtering mechanism to improve robustness. We
benchmark six UAD approaches, including Deep SVDD, reconstruction-based models,
Masked Anomaly Detection, normalizing flows, and diffusion models, optimized
via Neural Architecture Search (NAS) under strict resource constraints (at most
512k parameters). Evaluation on PTB-XL and BUT QDB datasets assessed detection
of OOD CVD classes and signals unsuitable for analysis due to noise. Results
show Deep SVDD consistently achieves the best trade-off between detection and
efficiency. In a realistic deployment simulation, integrating the optimized
Deep SVDD filter with a diagnostic classifier improved accuracy by up to 21
percentage points over a classifier-only baseline. This study demonstrates that
optimized UAD filters can safeguard automated ECG analysis, enabling safer,
more reliable continuous cardiovascular monitoring on wearables.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng et al.
Machine Learning
ๆฅ็ๆ่ฆ
Data selection is a critical aspect of Reinforcement Learning with Verifiable
Rewards (RLVR) for enhancing the reasoning capabilities of large language
models (LLMs). Current data selection methods are largely heuristic-based,
lacking theoretical guarantees and generalizability. This work proposes a
theoretically-grounded approach using influence functions to estimate the
contribution of each data point to the learning objective. To overcome the
prohibitive computational cost of policy rollouts required for online influence
estimation, we introduce an off-policy influence estimation method that
efficiently approximates data influence using pre-collected offline
trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we
employ sparse random projection to reduce dimensionality and improve storage
and computation efficiency. Leveraging these techniques, we develop
\textbf{C}urriculum \textbf{R}L with \textbf{O}ff-\textbf{P}olicy
\text{I}nfluence guidance (\textbf{CROPI}), a multi-stage RL framework that
iteratively selects the most influential data for the current policy.
Experiments on models up to 7B parameters demonstrate that CROPI significantly
accelerates training. On a 1.5B model, it achieves a 2.66x step-level
acceleration while using only 10\% of the data per stage compared to
full-dataset training. Our results highlight the substantial potential of
influence-based data selection for efficient RLVR.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wajdi Hammami, Soumaya Cherkaoui, Jean-Frederic Laprade, Ola Ahmad, Shengrui Wang
Machine Learning
ๆฅ็ๆ่ฆ
Anomaly detection in time-series data is a critical challenge with
significant implications for network security. Recent quantum machine learning
approaches, such as quantum kernel methods and variational quantum circuits,
have shown promise in capturing complex data distributions for anomaly
detection but remain constrained by limited qubit counts. We introduce in this
work a novel Quantum Gated Recurrent Unit (QGRU)-based Generative Adversarial
Network (GAN) employing Successive Data Injection (SuDaI) and a multi-metric
gating strategy for robust network anomaly detection. Our model uniquely
utilizes a quantum-enhanced generator that outputs parameters (mean and
log-variance) of a Gaussian distribution via reparameterization, combined with
a Wasserstein critic to stabilize adversarial training. Anomalies are
identified through a novel gating mechanism that initially flags potential
anomalies based on Gaussian uncertainty estimates and subsequently verifies
them using a composite of critic scores and reconstruction errors. Evaluated on
benchmark datasets, our method achieves a high time-series aware F1 score
(TaF1) of 89.43% demonstrating superior capability in detecting anomalies
accurately and promptly as compared to existing classical and quantum models.
Furthermore, the trained QGRU-WGAN was deployed on real IBM Quantum hardware,
where it retained high anomaly detection performance, confirming its robustness
and practical feasibility on current noisy intermediate-scale quantum (NISQ)
devices.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dipak Meher, Carlotta Domeniconi, Guadalupe Correa-Cabrera
ๆฅ็ๆ่ฆ
Human smuggling networks are complex and constantly evolving, making them
difficult to analyze comprehensively. Legal case documents offer rich factual
and procedural insights into these networks but are often long, unstructured,
and filled with ambiguous or shifting references, posing significant challenges
for automated knowledge graph (KG) construction. Existing methods either
overlook coreference resolution or fail to scale beyond short text spans,
leading to fragmented graphs and inconsistent entity linking. We propose
LINK-KG, a modular framework that integrates a three-stage, LLM-guided
coreference resolution pipeline with downstream KG extraction. At the core of
our approach is a type-specific Prompt Cache, which consistently tracks and
resolves references across document chunks, enabling clean and disambiguated
narratives for structured knowledge graph construction from both short and long
legal texts. LINK-KG reduces average node duplication by 45.21% and noisy nodes
by 32.22% compared to baseline methods, resulting in cleaner and more coherent
graph structures. These improvements establish LINK-KG as a strong foundation
for analyzing complex criminal networks.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Adapting large language models (LLMs) via reinforcement learning (RL) is
often bottlenecked by the generation stage, which can consume over 75\% of the
training time. Speculative decoding (SD) accelerates autoregressive generation
in serving systems, but its behavior under RL training remains largely
unexplored. We identify three critical gaps that hinder the naive integration
of SD into RL systems: diminishing speedups at large batch sizes, drafter
staleness under continual actor updates, and drafter-induced policy
degradation.
To address these gaps, we present ReSpec, a system that adapts SD to RL
through three complementary mechanisms: dynamically tuning SD configurations,
evolving the drafter via knowledge distillation, and weighting updates by
rollout rewards. On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup
while preserving reward convergence and training stability, providing a
practical solution for efficient RL-based LLM adaptation.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Self-improvement has emerged as a mainstream paradigm for advancing the
reasoning capabilities of large vision-language models (LVLMs), where models
explore and learn from successful trajectories iteratively. However, we
identify a critical issue during this process: the model excels at generating
high-quality trajectories for simple queries (i.e., head data) but struggles
with more complex ones (i.e., tail data). This leads to an imbalanced
optimization that drives the model to prioritize simple reasoning skills, while
hindering its ability to tackle more complex reasoning tasks. Over iterations,
this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew
effect"--which ultimately hinders further model improvement and leads to
performance bottlenecks. To counteract this challenge, we introduce four
efficient strategies from two perspectives: distribution-reshaping and
trajectory-resampling, to achieve head-tail re-balancing during the
exploration-and-learning self-improvement process. Extensive experiments on
Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks
demonstrate that our methods consistently improve visual reasoning
capabilities, outperforming vanilla self-improvement by 3.86 points on average.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang
Multimodal
ๆฅ็ๆ่ฆ
Object-context shortcuts remain a persistent challenge in vision-language
models, undermining zero-shot reliability when test-time scenes differ from
familiar training co-occurrences. We recast this issue as a causal inference
problem and ask: Would the prediction remain if the object appeared in a
different environment? To answer this at inference time, we estimate object and
background expectations within CLIP's representation space, and synthesize
counterfactual embeddings by recombining object features with diverse
alternative contexts sampled from external datasets, batch neighbors, or
text-derived descriptions. By estimating the Total Direct Effect and simulating
intervention, we further subtract background-only activation, preserving
beneficial object-context interactions while mitigating hallucinated scores.
Without retraining or prompt design, our method substantially improves both
worst-group and average accuracy on context-sensitive benchmarks, establishing
a new zero-shot state of the art. Beyond performance, our framework provides a
lightweight representation-level counterfactual approach, offering a practical
causal avenue for debiased and reliable multimodal reasoning.
๐
2025-10-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Danial Ebrat, Sepideh Ahmadian, Luis Rueda
Natural Language Processing
ๆฅ็ๆ่ฆ
Recommender systems often struggle with data sparsity and cold-start
scenarios, limiting their ability to provide accurate suggestions for new or
infrequent users. This paper presents a Graph Attention Network (GAT) based
Collaborative Filtering (CF) framework enhanced with Large Language Model (LLM)
driven context aware embeddings. Specifically, we generate concise textual user
profiles and unify item metadata (titles, genres, overviews) into rich textual
embeddings, injecting these as initial node features in a bipartite user item
graph. To further optimize ranking performance, we introduce a hybrid loss
function that combines Bayesian Personalized Ranking (BPR) with a cosine
similarity term and robust negative sampling, ensuring explicit negative
feedback is distinguished from unobserved data. Experiments on the MovieLens
100k and 1M datasets show consistent improvements over state-of-the-art
baselines in Precision, NDCG, and MAP while demonstrating robustness for users
with limited interaction history. Ablation studies confirm the critical role of
LLM-augmented embeddings and the cosine similarity term in capturing nuanced
semantic relationships. Our approach effectively mitigates sparsity and
cold-start limitations by integrating LLM-derived contextual understanding into
graph-based architectures. Future directions include balancing recommendation
accuracy with coverage and diversity, and introducing fairness-aware
constraints and interpretability features to enhance system performance
further.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jiayi Luo, Qingyun Sun, Beining Yang, Haonan Yuan, Xingcheng Fu et al.
ๆฅ็ๆ่ฆ
Graph condensation (GC) has gained significant attention for its ability to
synthesize smaller yet informative graphs. However, existing studies often
overlook the robustness of GC in scenarios where the original graph is
corrupted. In such cases, we observe that the performance of GC deteriorates
significantly, while existing robust graph learning technologies offer only
limited effectiveness. Through both empirical investigation and theoretical
analysis, we reveal that GC is inherently an intrinsic-dimension-reducing
process, synthesizing a condensed graph with lower classification complexity.
Although this property is critical for effective GC performance, it remains
highly vulnerable to adversarial perturbations. To tackle this vulnerability
and improve GC robustness, we adopt the geometry perspective of graph data
manifold and propose a novel Manifold-constrained Robust Graph Condensation
framework named MRGC. Specifically, we introduce three graph data manifold
learning modules that guide the condensed graph to lie within a smooth,
low-dimensional manifold with minimal class ambiguity, thereby preserving the
classification complexity reduction capability of GC and ensuring robust
performance under universal adversarial attacks. Extensive experiments
demonstrate the robustness of \ModelName\ across diverse attack scenarios.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wenjie Chen, Li Zhuang, Ziying Luo, Yu Liu, Jiahao Wu et al.
ๆฅ็ๆ่ฆ
Personalized treatment outcome prediction based on trial data for
small-sample and rare patient groups is critical in precision medicine.
However, the costly trial data limit the prediction performance. To address
this issue, we propose a cross-fidelity knowledge distillation and adaptive
fusion network (CFKD-AFN), which leverages abundant but low-fidelity simulation
data to enhance predictions on scarce but high-fidelity trial data. CFKD-AFN
incorporates a dual-channel knowledge distillation module to extract
complementary knowledge from the low-fidelity model, along with an
attention-guided fusion module to dynamically integrate multi-source
information. Experiments on treatment outcome prediction for the chronic
obstructive pulmonary disease demonstrates significant improvements of CFKD-AFN
over state-of-the-art methods in prediction accuracy, ranging from 6.67\% to
74.55\%, and strong robustness to varying high-fidelity dataset sizes.
Furthermore, we extend CFKD-AFN to an interpretable variant, enabling the
exploration of latent medical semantics to support clinical decision-making.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang et al.
Robotics
ๆฅ็ๆ่ฆ
Adapting pre-trained video generation models into controllable world models
via latent actions is a promising step towards creating generalist world
models. The dominant paradigm adopts a two-stage approach that trains latent
action model (LAM) and the world model separately, resulting in redundant
training and limiting their potential for co-adaptation. A conceptually simple
and appealing idea is to directly replace the forward dynamic model in LAM with
a powerful world model and training them jointly, but it is non-trivial and
prone to representational collapse. In this work, we propose CoLA-World, which
for the first time successfully realizes this synergistic paradigm, resolving
the core challenge in joint learning through a critical warm-up phase that
effectively aligns the representations of the from-scratch LAM with the
pre-trained world model. This unlocks a co-evolution cycle: the world model
acts as a knowledgeable tutor, providing gradients to shape a high-quality LAM,
while the LAM offers a more precise and adaptable control interface to the
world model. Empirically, CoLA-World matches or outperforms prior two-stage
methods in both video simulation quality and downstream visual planning,
establishing a robust and efficient new paradigm for the field.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Vikrant Sahu, Gagan Raj Gupta, Raghav Borikar, Nitin Mane
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid growth of programming education has outpaced traditional assessment
tools, leaving faculty with limited means to provide meaningful, scalable
feedback. Conventional autograders, while efficient, act as black-box systems
that simply return pass/fail results, offering little insight into student
thinking or learning needs.
Autograder+ is designed to shift autograding from a purely summative process
to a formative learning experience. It introduces two key capabilities:
automated feedback generation using a fine-tuned Large Language Model, and
visualization of student code submissions to uncover learning patterns. The
model is fine-tuned on curated student code and expert feedback to ensure
pedagogically aligned, context-aware guidance.
In evaluation across 600 student submissions from multiple programming tasks,
the system produced feedback with strong semantic alignment to instructor
comments. For visualization, contrastively learned code embeddings trained on
1,000 annotated submissions enable grouping solutions into meaningful clusters
based on functionality and approach. The system also supports prompt-pooling,
allowing instructors to guide feedback style through selected prompt templates.
By integrating AI-driven feedback, semantic clustering, and interactive
visualization, Autograder+ reduces instructor workload while supporting
targeted instruction and promoting stronger learning outcomes.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning (Statistics))
๐ฅ Joshua Rooijakkers, Leiv Rรธnneberg, Franรงois-Xavier Briol, Jeremias Knoblauch, Matias Altamirano
ๆฅ็ๆ่ฆ
Multi-output Gaussian process (MOGP) regression allows modelling dependencies
among multiple correlated response variables. Similarly to standard Gaussian
processes, MOGPs are sensitive to model misspecification and outliers, which
can distort predictions within individual outputs. This situation can be
further exacerbated by multiple anomalous response variables whose errors
propagate due to correlations between outputs. To handle this situation, we
extend and generalise the robust and conjugate Gaussian process (RCGP)
framework introduced by Altamirano et al. (2024). This results in the
multi-output RCGP (MO-RCGP): a provably robust MOGP that is conjugate, and
jointly captures correlations across outputs. We thoroughly evaluate our
approach through applications in finance and cancer research.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Fatemeh Bazikar, Hossein Moosaei, Atefeh Hemmati, Panos M. Pardalos
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Multi-task learning (MTL) enables simultaneous training across related tasks,
leveraging shared information to improve generalization, efficiency, and
robustness, especially in data-scarce or high-dimensional scenarios. While deep
learning dominates recent MTL research, Support Vector Machines (SVMs) and Twin
SVMs (TWSVMs) remain relevant due to their interpretability, theoretical rigor,
and effectiveness with small datasets.
This chapter surveys MTL approaches based on SVM and TWSVM, highlighting
shared representations, task regularization, and structural coupling
strategies. Special attention is given to emerging TWSVM extensions for
multi-task settings, which show promise but remain underexplored. We compare
these models in terms of theoretical properties, optimization strategies, and
empirical performance, and discuss applications in fields such as computer
vision, natural language processing, and bioinformatics.
Finally, we identify research gaps and outline future directions for building
scalable, interpretable, and reliable margin-based MTL frameworks. This work
provides a comprehensive resource for researchers and practitioners interested
in SVM- and TWSVM-based multi-task learning.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wenchang Duan, Yaoliang Yu, Jiwan He, Yi Shi
Machine Learning
ๆฅ็ๆ่ฆ
Recently, deep multi-agent reinforcement learning (MARL) has demonstrated
promising performance for solving challenging tasks, such as long-term
dependencies and non-Markovian environments. Its success is partly attributed
to conditioning policies on large fixed context length. However, such large
fixed context lengths may lead to limited exploration efficiency and redundant
information. In this paper, we propose a novel MARL framework to obtain
adaptive and effective contextual information. Specifically, we design a
central agent that dynamically optimizes context length via temporal gradient
analysis, enhancing exploration to facilitate convergence to global optima in
MARL. Furthermore, to enhance the adaptive optimization capability of the
context length, we present an efficient input representation for the central
agent, which effectively filters redundant information. By leveraging a
Fourier-based low-frequency truncation method, we extract global temporal
trends across decentralized agents, providing an effective and efficient
representation of the MARL environment. Extensive experiments demonstrate that
the proposed method achieves state-of-the-art (SOTA) performance on long-term
dependency tasks, including PettingZoo, MiniGrid, Google Research Football
(GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Andrew M. Bean, Nabeel Seedat, Shengzhuang Chen, Jonathan Richard Schwarz
Natural Language Processing
ๆฅ็ๆ่ฆ
The prohibitive cost of evaluating large language models (LLMs) on
comprehensive benchmarks necessitates the creation of small yet representative
data subsets (i.e., tiny benchmarks) that enable efficient assessment while
retaining predictive fidelity. Current methods for this task operate under a
model-centric paradigm, selecting benchmarking items based on the collective
performance of existing models. Such approaches are limited by large upfront
costs, an inability to immediately handle new benchmarks (`cold-start'), and
the fragile assumption that future models will share the failure patterns of
their predecessors. In this work, we challenge this paradigm and propose a
item-centric approach to benchmark subset selection, arguing that selection
should be based on the intrinsic properties of the task items themselves,
rather than on model-specific failure patterns. We instantiate this
item-centric efficient benchmarking approach via a novel method, Scales++,
where data selection is based on the cognitive demands of the benchmark
samples. Empirically, we show Scales++ reduces the upfront selection cost by
over 18x while achieving competitive predictive fidelity. On the Open LLM
Leaderboard, using just a 0.5\% data subset, we predict full benchmark scores
with a 2.9% mean absolute error. We demonstrate that this item-centric approach
enables more efficient model evaluation without significant fidelity
degradation, while also providing better cold-start performance and more
interpretable benchmarking.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ningning Tao, Fei Xie, Baoxiang Pan, Hongyu Wang, Han Huang et al.
ๆฅ็ๆ่ฆ
Sudden Stratospheric Warmings (SSWs) are key sources of subseasonal
predictability and major drivers of extreme winter weather. Yet, their accurate
and efficient forecast remains a persistent challenge for numerical weather
prediction (NWP) systems due to limitations in physical representation,
initialization, and the immense computational demands of ensemble forecasts.
While data-driven forecasting is rapidly evolving, its application to the
complex, three-dimensional dynamics of SSWs, particularly for probabilistic
forecast, remains underexplored. Here, we bridge this gap by developing a Flow
Matching-based generative AI model (FM-Cast) for efficient and skillful
probabilistic forecasting of the spatiotemporal evolution of stratospheric
circulation. Evaluated across 18 major SSW events (1998-2024), FM-Cast
skillfully forecasts the onset, intensity, and morphology of 10 events up to 20
days in advance, achieving ensemble accuracies above 50%. Its performance is
comparable to or exceeds leading NWP systems while requiring only two minutes
for a 50-member, 30-day forecast on a consumer GPU. Furthermore, leveraging
FM-Cast as a scientific tool, we demonstrate through idealized experiments that
SSW predictability is fundamentally linked to its underlying physical drivers,
distinguishing between events forced from the troposphere and those driven by
internal stratospheric dynamics. Our work thus establishes a computationally
efficient paradigm for probabilistic forecasting stratospheric anomalies and
showcases generative AI's potential to deepen the physical understanding of
atmosphere-climate dynamics.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kazuma Kano, Yuki Mori, Shin Katayama, Kenta Urano, Takuro Yonezawa et al.
ๆฅ็ๆ่ฆ
Worker location data is key to higher productivity in industrial sites.
Cameras are a promising tool for localization in logistics warehouses since
they also offer valuable environmental contexts such as package status.
However, identifying individuals with only visual data is often impractical.
Accordingly, several prior studies identified people in videos by comparing
their trajectories and wearable sensor measurements. While this approach has
advantages such as independence from appearance, the existing methods may break
down under real-world conditions. To overcome this challenge, we propose CorVS,
a novel data-driven person identification method based on correspondence
between visual tracking trajectories and sensor measurements. Firstly, our deep
learning model predicts correspondence probabilities and reliabilities for
every pair of a trajectory and sensor measurements. Secondly, our algorithm
matches the trajectories and sensor measurements over time using the predicted
probabilities and reliabilities. We developed a dataset with actual warehouse
operations and demonstrated the method's effectiveness for real-world
applications.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Albi Isufaj, Pablo Mollรก, Helmut Prendinger
Machine Learning
ๆฅ็ๆ่ฆ
Financial forecasting increasingly uses large neural network models, but
their opacity raises challenges for trust and regulatory compliance. We present
several approaches to explainable and reliable AI in finance. \emph{First}, we
describe how Time-LLM, a time series foundation model, uses a prompt to avoid a
wrong directional forecast. \emph{Second}, we show that combining foundation
models for time series forecasting with a reliability estimator can filter our
unreliable predictions. \emph{Third}, we argue for symbolic reasoning encoding
domain rules for transparent justification. These approaches shift emphasize
executing only forecasts that are both reliable and explainable. Experiments on
equity and cryptocurrency data show that the architecture reduces false
positives and supports selective execution. By integrating predictive
performance with reliability estimation and rule-based reasoning, our framework
advances transparent and auditable financial AI systems.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Furkan Pala, Islem Rekik
Machine Learning
ๆฅ็ๆ่ฆ
Federated learning (FL) has emerged as a key paradigm for collaborative model
training across multiple clients without sharing raw data, enabling
privacy-preserving applications in areas such as radiology and pathology.
However, works on collaborative training across clients with fundamentally
different neural architectures and non-identically distributed datasets remain
scarce. Existing FL frameworks face several limitations. Despite claiming to
support architectural heterogeneity, most recent FL methods only tolerate
variants within a single model family (e.g., shallower, deeper, or wider CNNs),
still presuming a shared global architecture and failing to accommodate
federations where clients deploy fundamentally different network types (e.g.,
CNNs, GNNs, MLPs). Moreover, existing approaches often address only statistical
heterogeneity while overlooking the domain-fracture problem, where each
client's data distribution differs markedly from that faced at testing time,
undermining model generalizability. When clients use different architectures,
have non-identically distributed data, and encounter distinct test domains,
current methods perform poorly. To address these challenges, we propose
UnifiedFL, a dynamic federated learning framework that represents heterogeneous
local networks as nodes and edges in a directed model graph optimized by a
shared graph neural network (GNN). UnifiedFL introduces (i) a common GNN to
parameterize all architectures, (ii) distance-driven clustering via Euclidean
distances between clients' parameters, and (iii) a two-tier aggregation policy
balancing convergence and diversity. Experiments on MedMNIST classification and
hippocampus segmentation benchmarks demonstrate UnifiedFL's superior
performance. Code and data: https://github.com/basiralab/UnifiedFL
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sebastian Zieglmeier, Niklas Erdmann, Narada D. Warakagoda
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) algorithms are designed to optimize
problem-solving by learning actions that maximize rewards, a task that becomes
particularly challenging in random and nonstationary environments. Even
advanced RL algorithms are often limited in their ability to solve problems in
these conditions. In applications such as searching for underwater pollution
clouds with autonomous underwater vehicles (AUVs), RL algorithms must navigate
reward-sparse environments, where actions frequently result in a zero reward.
This paper aims to address these challenges by revisiting and modifying
classical RL approaches to efficiently operate in sparse, randomized, and
nonstationary environments. We systematically study a large number of
modifications, including hierarchical algorithm changes, multigoal learning,
and the integration of a location memory as an external output filter to
prevent state revisits. Our results demonstrate that a modified Monte
Carlo-based approach significantly outperforms traditional Q-learning and two
exhaustive search patterns, illustrating its potential in adapting RL to
complex environments. These findings suggest that reinforcement learning
approaches can be effectively adapted for use in random, nonstationary, and
reward-sparse environments.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mykhailo Poliakov, Nadiya Shvai
Natural Language Processing
ๆฅ็ๆ่ฆ
Health-related misinformation is very prevalent and potentially harmful. It
is difficult to identify, especially when claims distort or misinterpret
scientific findings. We investigate the impact of synthetic data generation and
lightweight fine-tuning techniques on the ability of large language models
(LLMs) to recognize fallacious arguments using the MISSCI dataset and
framework. In this work, we propose MisSynth, a pipeline that applies
retrieval-augmented generation (RAG) to produce synthetic fallacy samples,
which are then used to fine-tune an LLM model. Our results show substantial
accuracy gains with fine-tuned models compared to vanilla baselines. For
instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score
absolute improvement on the MISSCI test split over its vanilla baseline. We
demonstrate that introducing synthetic fallacy data to augment limited
annotated resources can significantly enhance zero-shot LLM classification
performance on real-world scientific misinformation tasks, even with limited
computational resources. The code and synthetic dataset are available on
https://github.com/mxpoliakov/MisSynth.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zhigao Guo, Feng Dong
Machine Learning
ๆฅ็ๆ่ฆ
Incorporating causal knowledge and mechanisms is essential for refining
causal models and improving downstream tasks such as designing new treatments.
In this paper, we introduce a novel concept in causal discovery, termed
interventional constraints, which differs fundamentally from interventional
data. While interventional data require direct perturbations of variables,
interventional constraints encode high-level causal knowledge in the form of
inequality constraints on causal effects. For instance, in the Sachs dataset
(Sachs et al.\ 2005), Akt has been shown to be activated by PIP3, meaning PIP3
exerts a positive causal effect on Akt. Existing causal discovery methods allow
enforcing structural constraints (for example, requiring a causal path from
PIP3 to Akt), but they may still produce incorrect causal conclusions such as
learning that "PIP3 inhibits Akt". Interventional constraints bridge this gap
by explicitly constraining the total causal effect between variable pairs,
ensuring learned models respect known causal influences. To formalize
interventional constraints, we propose a metric to quantify total causal
effects for linear causal models and formulate the problem as a constrained
optimization task, solved using a two-stage constrained optimization method. We
evaluate our approach on real-world datasets and demonstrate that integrating
interventional constraints not only improves model accuracy and ensures
consistency with established findings, making models more explainable, but also
facilitates the discovery of new causal relationships that would otherwise be
costly to identify.
๐
2025-10-30
๐ ArXiv Preprint (eess.SP)
๐ฅ Shih-Kai Chou, Mengran Zhao, Cheng-Nan Hu, Kuang-Chung Chou, Carolina Fortuna et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Accurate Angle-of-arrival (AoA) estimation is essential for next-generation
wireless communication systems to enable reliable beamforming, high-precision
localization, and integrated sensing. Unfortunately, classical high-resolution
techniques require multi-element arrays and extensive snapshot collection,
while generic Machine Learning (ML) approaches often yield black-box models
that lack physical interpretability. To address these limitations, we propose a
Symbolic Regression (SR)-based ML framework. Namely, Symbolic Regression-based
Angle of Arrival and Beam Pattern Estimator (SABER), a constrained
symbolic-regression framework that automatically discovers closed-form beam
pattern and AoA models from path loss measurements with interpretability. SABER
achieves high accuracy while bridging the gap between opaque ML methods and
interpretable physics-driven estimators. First, we validate our approach in a
controlled free-space anechoic chamber, showing that both direct inversion of
the known $\cos^n$ beam and a low-order polynomial surrogate achieve sub-0.5
degree Mean Absolute Error (MAE). A purely unconstrained SR method can further
reduce the error of the predicted angles, but produces complex formulas that
lack physical insight. Then, we implement the same SR-learned inversions in a
real-world, Reconfigurable Intelligent Surface (RIS)-aided indoor testbed.
SABER and unconstrained SR models accurately recover the true AoA with
near-zero error. Finally, we benchmark SABER against the Cram\'er-Rao Lower
Bounds (CRLBs). Our results demonstrate that SABER is an interpretable and
accurate alternative to state-of-the-art and black-box ML-based methods for AoA
estimation.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ David Schmotz, Sahar Abdelnabi, Maksym Andriushchenko
ๆฅ็ๆ่ฆ
Enabling continual learning in LLMs remains a key unresolved research
challenge. In a recent announcement, a frontier LLM company made a step towards
this by introducing Agent Skills, a framework that equips agents with new
knowledge based on instructions stored in simple markdown files. Although Agent
Skills can be a very useful tool, we show that they are fundamentally insecure,
since they enable trivially simple prompt injections. We demonstrate how to
hide malicious instructions in long Agent Skill files and referenced scripts to
exfiltrate sensitive data, such as internal files or passwords. Importantly, we
show how to bypass system-level guardrails of a popular coding agent: a benign,
task-specific approval with the "Don't ask again" option can carry over to
closely related but harmful actions. Overall, we conclude that despite ongoing
research efforts and scaling model capabilities, frontier LLMs remain
vulnerable to very simple prompt injections in realistic scenarios. Our code is
available at https://github.com/aisa-group/promptinject-agent-skills.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Zhiyang Xun, Shivam Gupta, Eric Price
ๆฅ็ๆ่ฆ
Given a noisy linear measurement $y = Ax + \xi$ of a distribution $p(x)$, and
a good approximation to the prior $p(x)$, when can we sample from the posterior
$p(x \mid y)$? Posterior sampling provides an accurate and fair framework for
tasks such as inpainting, deblurring, and MRI reconstruction, and several
heuristics attempt to approximate it. Unfortunately, approximate posterior
sampling is computationally intractable in general.
To sidestep this hardness, we focus on (local or global) log-concave
distributions $p(x)$. In this regime, Langevin dynamics yields posterior
samples when the exact scores of $p(x)$ are available, but it is brittle to
score--estimation error, requiring an MGF bound (sub-exponential error). By
contrast, in the unconditional setting, diffusion models succeed with only an
$L^2$ bound on the score error. We prove that combining diffusion models with
an annealed variant of Langevin dynamics achieves conditional sampling in
polynomial time using merely an $L^4$ bound on the score error.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sascha Mรผcke
Machine Learning
ๆฅ็ๆ่ฆ
Training Support Vector Machines (SVMs) can be formulated as a QUBO problem,
enabling the use of quantum annealing for model optimization. In this work, we
study how the number of qubits - linked to the discretization level of dual
weights - affects predictive performance across datasets. We compare QUBO-based
SVM training to the classical LIBSVM solver and find that even low-precision
QUBO encodings (e.g., 1 bit per parameter) yield competitive, and sometimes
superior, accuracy. While increased bit-depth enables larger regularization
parameters, it does not always improve classification. Our findings suggest
that selecting the right support vectors may matter more than their precise
weighting. Although current hardware limits the size of solvable QUBOs, our
results highlight the potential of quantum annealing for efficient SVM training
as quantum devices scale.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Ruilin Tong, Haodong Lu, Yuhang Liu, Dong Gong
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
Continual learning (CL) aims to incrementally train a model on a sequence of
tasks while retaining performance on prior ones. However, storing and replaying
data is often infeasible due to privacy or security constraints and impractical
for arbitrary pre-trained models. Data-free CL seeks to update models without
access to previous data. Beyond regularization, we employ model inversion to
synthesize data from the trained model, enabling replay without storing
samples. Yet, model inversion in predictive models faces two challenges: (1)
generating inputs solely from compressed output labels causes drift between
synthetic and real data, and replaying such data can erode prior knowledge; (2)
inversion is computationally expensive since each step backpropagates through
the full model. These issues are amplified in large pre-trained models such as
CLIP. To improve efficiency, we propose Per-layer Model Inversion (PMI),
inspired by faster convergence in single-layer optimization. PMI provides
strong initialization for full-model inversion, substantially reducing
iterations. To mitigate feature shift, we model class-wise features via
Gaussian distributions and contrastive model, ensuring alignment between
synthetic and real features. Combining PMI and feature modeling, our approach
enables continual learning of new classes by generating pseudo-images from
semantic-aware projected features, achieving strong effectiveness and
compatibility across multiple CL settings.
๐
2025-10-30
๐ 37 pages, 4 figures, 86 references. Submitted to J
๐ฅ Laura Jiang, Reza Ryan, Qian Li, Nasim Ferdosian
Machine Learning
ๆฅ็ๆ่ฆ
Anomaly detection is a critical task in cybersecurity, where identifying
insider threats, access violations, and coordinated attacks is essential for
ensuring system resilience. Graph-based approaches have become increasingly
important for modeling entity interactions, yet most rely on homogeneous and
static structures, which limits their ability to capture the heterogeneity and
temporal evolution of real-world environments. Heterogeneous Graph Neural
Networks (HGNNs) have emerged as a promising paradigm for anomaly detection by
incorporating type-aware transformations and relation-sensitive aggregation,
enabling more expressive modeling of complex cyber data. However, current
research on HGNN-based anomaly detection remains fragmented, with diverse
modeling strategies, limited comparative evaluation, and an absence of
standardized benchmarks. To address this gap, we provide a comprehensive survey
of HGNN-based anomaly detection methods in cybersecurity. We introduce a
taxonomy that classifies approaches by anomaly type and graph dynamics, analyze
representative models, and map them to key cybersecurity applications. We also
review commonly used benchmark datasets and evaluation metrics, highlighting
their strengths and limitations. Finally, we identify key open challenges
related to modeling, data, and deployment, and outline promising directions for
future research. This survey aims to establish a structured foundation for
advancing HGNN-based anomaly detection toward scalable, interpretable, and
practically deployable solutions.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Beomhan Baek, Minhak Song, Chulhee Yun
Machine Learning
ๆฅ็ๆ่ฆ
Adam [Kingma and Ba, 2015] is the de facto optimizer in deep learning, yet
its theoretical understanding remains limited. Prior analyses show that Adam
favors solutions aligned with $\ell_\infty$-geometry, but these results are
restricted to the full-batch regime. In this work, we study the implicit bias
of incremental Adam (using one sample per step) for logistic regression on
linearly separable data, and we show that its bias can deviate from the
full-batch behavior. To illustrate this, we construct a class of structured
datasets where incremental Adam provably converges to the $\ell_2$-max-margin
classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch
Adam. For general datasets, we develop a proxy algorithm that captures the
limiting behavior of incremental Adam as $\beta_2 \to 1$ and we characterize
its convergence direction via a data-dependent dual fixed-point formulation.
Finally, we prove that, unlike Adam, Signum [Bernstein et al., 2018] converges
to the $\ell_\infty$-max-margin classifier for any batch size by taking $\beta$
close enough to 1. Overall, our results highlight that the implicit bias of
Adam crucially depends on both the batching scheme and the dataset, while
Signum remains invariant.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Ziliang Chen, Tianang Xiao, Jusheng Zhang, Yongsen Zheng, Xipeng Chen
Multimodal
ๆฅ็ๆ่ฆ
Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal
generalization by aligning images and texts in a shared embedding space, yet it
persistently fails at compositional reasoning over objects, attributes, and
relations often behaving like a bag-of-words matcher. Prior causal accounts
typically model text as a single vector, obscuring token-level structure and
leaving core phenomena-such as prompt sensitivity and failures on hard
negatives unexplained. We address this gap with a token-aware causal
representation learning (CRL) framework grounded in a sequential,
language-token SCM. Our theory extends block identifiability to tokenized text,
proving that CLIP's contrastive objective can recover the modal-invariant
latent variable under both sentence-level and token-level SCMs. Crucially,
token granularity yields the first principled explanation of CLIP's
compositional brittleness: composition nonidentifiability. We show the
existence of pseudo-optimal text encoders that achieve perfect modal-invariant
alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations
over atomic concepts, thereby failing to distinguish correct captions from hard
negatives despite optimizing the same training objective as true-optimal
encoders. The analysis further links language-side nonidentifiability to
visual-side failures via the modality gap and shows how iterated composition
operators compound hardness, motivating improved negative mining strategies.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jingyuan Liu, Fatemeh Ghaffari, Xuchuang Wang, Mohammad Hajiesmaili, Carlee Joe-Wong
Machine Learning
ๆฅ็ๆ่ฆ
Preference learning from pairwise feedback is a widely adopted framework in
applications such as reinforcement learning with human feedback and
recommendations. In many practical settings, however, user interactions are
limited or costly, making offline preference learning necessary. Moreover,
real-world preference learning often involves users with different preferences.
For example, annotators from different backgrounds may rank the same responses
differently. This setting presents two central challenges: (1) identifying
similarity across users to effectively aggregate data, especially under
scenarios where offline data is imbalanced across dimensions, and (2) handling
the imbalanced offline data where some preference dimensions are
underrepresented. To address these challenges, we study the Offline Clustering
of Preference Learning problem, where the learner has access to fixed datasets
from multiple users with potentially different preferences and aims to maximize
utility for a test user. To tackle the first challenge, we first propose
Off-C$^2$PL for the pure offline setting, where the learner relies solely on
offline data. Our theoretical analysis provides a suboptimality bound that
explicitly captures the tradeoff between sample noise and bias. To address the
second challenge of inbalanced data, we extend our framework to the setting
with active-data augmentation where the learner is allowed to select a limited
number of additional active-data for the test user based on the cluster
structure learned by Off-C$^2$PL. In this setting, our second algorithm,
A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative
dimensions of the test user's preference. We prove that these actively
collected samples contribute more effectively than offline ones. Finally, we
validate our theoretical results through simulations on synthetic and
real-world datasets.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Michal ล tefรกnik, Timothee Mickus, Marek Kadlฤรญk, Bertram Hรธjer, Michal Spiegel et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent work has shown that different large language models (LLMs) converge to
similar and accurate input embedding representations for numbers. These
findings conflict with the documented propensity of LLMs to produce erroneous
outputs when dealing with numeric information. In this work, we aim to explain
this conflict by exploring how language models manipulate numbers and quantify
the lower bounds of accuracy of these mechanisms. We find that despite
surfacing errors, different language models learn interchangeable
representations of numbers that are systematic, highly accurate and universal
across their hidden states and the types of input contexts. This allows us to
create universal probes for each LLM and to trace information -- including the
causes of output errors -- to specific layers. Our results lay a fundamental
understanding of how pre-trained LLMs manipulate numbers and outline the
potential of more accurate probing techniques in addressed refinements of LLMs'
architectures.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Xia Jiang, Rong J. B. Zhu
ๆฅ็ๆ่ฆ
Multi-task learning in contextual bandits has attracted significant research
interest due to its potential to enhance decision-making across multiple
related tasks by leveraging shared structures and task-specific heterogeneity.
In this article, we propose a novel hierarchical Bayesian framework for
learning in various bandit instances. This framework captures both the
heterogeneity and the correlations among different bandit instances through a
hierarchical Bayesian model, enabling effective information sharing while
accommodating instance-specific variations. Unlike previous methods that
overlook the learning of the covariance structure across bandits, we introduce
an empirical Bayesian approach to estimate the covariance matrix of the prior
distribution.This enhances both the practicality and flexibility of learning
across multi-bandits. Building on this approach, we develop two efficient
algorithms: ebmTS (Empirical Bayesian Multi-Bandit Thompson Sampling) and
ebmUCB (Empirical Bayesian Multi-Bandit Upper Confidence Bound), both of which
incorporate the estimated prior into the decision-making process. We provide
the frequentist regret upper bounds for the proposed algorithms, thereby
filling a research gap in the field of multi-bandit problems. Extensive
experiments on both synthetic and real-world datasets demonstrate the superior
performance of our algorithms, particularly in complex environments. Our
methods achieve lower cumulative regret compared to existing techniques,
highlighting their effectiveness in balancing exploration and exploitation
across multi-bandits.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kim Yong Tan, Yueming Lyu, Ivor Tsang, Yew-Soon Ong
Machine Learning
ๆฅ็ๆ่ฆ
Diffusion models have been successful in learning complex data distributions.
This capability has driven their application to high-dimensional
multi-objective black-box optimization problem. Existing approaches often
employ an external optimization loop, such as an evolutionary algorithm, to the
diffusion model. However, these approaches treat the diffusion model as a
black-box refiner, which overlooks the internal distribution transition of the
diffusion generation process, limiting their efficiency. To address these
challenges, we propose the Inference-time Multi-target Generation (IMG)
algorithm, which optimizes the diffusion process at inference-time to generate
samples that simultaneously satisfy multiple objectives. Specifically, our IMG
performs weighted resampling during the diffusion generation process according
to the expected aggregated multi-objective values. This weighted resampling
strategy ensures the diffusion-generated samples are distributed according to
our desired multi-target Boltzmann distribution. We further derive that the
multi-target Boltzmann distribution has an interesting log-likelihood
interpretation, where it is the optimal solution to the distributional
multi-objective optimization problem. We implemented IMG for a multi-objective
molecule generation task. Experiments show that IMG, requiring only a single
generation pass, achieves a significantly higher hypervolume than baseline
optimization algorithms that often require hundreds of diffusion generations.
Notably, our algorithm can be viewed as an optimized diffusion process and can
be integrated into existing methods to further improve their performance.
๐
2025-10-30
๐ ArXiv Preprint (cs.SE)
๐ฅ Domenico Amalfitano, Andreas Metzger, Marco Autili, Tommaso Fulcini, Tobias Hey et al.
ๆฅ็ๆ่ฆ
Generative AI (GenAI) is rapidly transforming software engineering (SE)
practices, influencing how SE processes are executed, as well as how software
systems are developed, operated, and evolved. This paper applies design science
research to build a roadmap for GenAI-augmented SE. The process consists of
three cycles that incrementally integrate multiple sources of evidence,
including collaborative discussions from the FSE 2025 "Software Engineering
2030" workshop, rapid literature reviews, and external feedback sessions
involving peers. McLuhan's tetrads were used as a conceptual instrument to
systematically capture the transforming effects of GenAI on SE processes and
software products.The resulting roadmap identifies four fundamental forms of
GenAI augmentation in SE and systematically characterizes their related
research challenges and opportunities. These insights are then consolidated
into a set of future research directions. By grounding the roadmap in a
rigorous multi-cycle process and cross-validating it among independent author
teams and peers, the study provides a transparent and reproducible foundation
for analyzing how GenAI affects SE processes, methods and tools, and for
framing future research within this rapidly evolving area. Based on these
findings, the article finally makes ten predictions for SE in the year 2030.
๐
2025-10-30
๐ This work has been submitted to the IEEE for possi
๐ฅ Haohua Duan, Liyao Xiang, Xin Zhang
Natural Language Processing
ๆฅ็ๆ่ฆ
Watermarking schemes for large language models (LLMs) have been proposed to
identify the source of the generated text, mitigating the potential threats
emerged from model theft. However, current watermarking solutions hardly
resolve the trust issue: the non-public watermark detection cannot prove itself
faithfully conducting the detection. We observe that it is attributed to the
secret key mostly used in the watermark detection -- it cannot be public, or
the adversary may launch removal attacks provided the key; nor can it be
private, or the watermarking detection is opaque to the public. To resolve the
dilemma, we propose PVMark, a plugin based on zero-knowledge proof (ZKP),
enabling the watermark detection process to be publicly verifiable by third
parties without disclosing any secret key. PVMark hinges upon the proof of
`correct execution' of watermark detection on which a set of ZKP constraints
are built, including mapping, random number generation, comparison, and
summation. We implement multiple variants of PVMark in Python, Rust and Circom,
covering combinations of three watermarking schemes, three hash functions, and
four ZKP protocols, to show our approach effectively works under a variety of
circumstances. By experimental results, PVMark efficiently enables public
verifiability on the state-of-the-art LLM watermarking schemes yet without
compromising the watermarking performance, promising to be deployed in
practice.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Frederik Mรถbius Rygaard, Shen Zhu, Yinzhu Jin, Sรธren Hauberg, Tom Fletcher
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Interpolation in generative models allows for controlled generation, model
inspection, and more. Unfortunately, most generative models lack a principal
notion of interpolants without restrictive assumptions on either the model or
data dimension. In this paper, we develop a general interpolation scheme that
targets likely transition paths compatible with different metrics and
probability distributions. We consider interpolants analogous to a geodesic
constrained to a suitable data distribution and derive a novel algorithm for
computing these curves, which requires no additional training. Theoretically,
we show that our method locally can be considered as a geodesic under a
suitable Riemannian metric. We quantitatively show that our interpolation
scheme traverses higher density regions than baselines across a range of models
and datasets.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Hieu M. Vu, Tan M. Nguyen
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Controlling specific behaviors in large language models while preserving
their general capabilities is a central challenge for safe and reliable
artificial intelligence deployment. Current steering methods, such as vector
addition and directional ablation, are constrained within a two-dimensional
subspace defined by the activation and feature direction, making them sensitive
to chosen parameters and potentially affecting unrelated features due to
unintended interactions in activation space. We introduce Angular Steering, a
novel and flexible method for behavior modulation that operates by rotating
activations within a fixed two-dimensional subspace. By formulating steering as
a geometric rotation toward or away from a target behavior direction, Angular
Steering provides continuous, fine-grained control over behaviors such as
refusal and compliance. We demonstrate this method using refusal steering
emotion steering as use cases. Additionally, we propose Adaptive Angular
Steering, a selective variant that rotates only activations aligned with the
target feature, further enhancing stability and coherence. Angular Steering
generalizes existing addition and orthogonalization techniques under a unified
geometric rotation framework, simplifying parameter selection and maintaining
model stability across a broader range of adjustments. Experiments across
multiple model families and sizes show that Angular Steering achieves robust
behavioral control while maintaining general language modeling performance,
underscoring its flexibility, generalization, and robustness compared to prior
approaches. Code and artifacts are available at
https://github.com/lone17/angular-steering/.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Minyi Peng, Darian Gunamardi, Ivan Tjuawinata, Kwok-Yan Lam
Machine Learning
ๆฅ็ๆ่ฆ
As a new and promising approach, existing machine unlearning (MU) works
typically emphasize theoretical formulations or optimization objectives to
achieve knowledge removal. However, when deployed in real-world scenarios, such
solutions typically face scalability issues and have to address practical
requirements such as full access to original datasets and model. In contrast to
the existing approaches, we regard classification training as a sequential
process where classes are learned sequentially, which we call \emph{inductive
approach}. Unlearning can then be done by reversing the last training sequence.
This is implemented by appending a projection-redistribution layer in the end
of the model. Such an approach does not require full access to the original
dataset or the model, addressing the challenges of existing methods. This
enables modular and model-agnostic deployment as an output filter into existing
classification pipelines with minimal alterations. We conducted multiple
experiments across multiple datasets including image (CIFAR-10/100 using
CNN-based model) and tabular datasets (Covertype using tree-based model).
Experiment results show consistently similar output to a fully retrained model
with a high computational cost reduction. This demonstrates the applicability,
scalability, and system compatibility of our solution while maintaining the
performance of the output in a more practical setting.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Sekitoshi Kanai, Tsukasa Yoshida, Hiroshi Takahashi, Haru Kuroki, Kazumune Hashimoto
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Test-time alignment of large language models (LLMs) attracts attention
because fine-tuning LLMs requires high computational costs. In this paper, we
propose a new test-time alignment method called adaptive importance sampling on
pre-logits (AISP) on the basis of the sampling-based model predictive control
with the stochastic control input. AISP applies the Gaussian perturbation into
pre-logits, which are outputs of the penultimate layer, so as to maximize
expected rewards with respect to the mean of the perturbation. We demonstrate
that the optimal mean is obtained by importance sampling with sampled rewards.
AISP outperforms best-of-n sampling in terms of rewards over the number of used
samples and achieves higher rewards than other reward-based test-time alignment
methods.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Avinash Kadimisetty, Arun Rajagopalan, Vijendra SK
ๆฅ็ๆ่ฆ
Reducing preventable hospital readmissions is a national priority for payers,
providers, and policymakers seeking to improve health care and lower costs. The
rate of readmission is being used as a benchmark to determine the quality of
healthcare provided by the hospitals. In thisproject, we have used machine
learning techniques like Logistic Regression, Random Forest and Support Vector
Machines to analyze the health claims data and identify demographic and medical
factors that play a crucial role in predicting all-cause readmissions. As the
health claims data is high dimensional, we have used Principal Component
Analysis as a dimension reduction technique and used the results for building
regression models. We compared and evaluated these models based on the Area
Under Curve (AUC) metric. Random Forest model gave the highest performance
followed by Logistic Regression and Support Vector Machine models. These models
can be used to identify the crucial factors causing readmissions and help
identify patients to focus on to reduce the chances of readmission, ultimately
bringing down the cost and increasing the quality of healthcare provided to the
patients.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Yunxiao Shi, Shuo Yang, Yixin Su, Rui Zhang, Min Xu
ๆฅ็ๆ่ฆ
Modern data-centric AI needs precise per-sample influence. Standard SGD-IE
approximates leave-one-out effects by summing per-epoch surrogates and ignores
cross-epoch compounding, which misranks critical examples. We propose
ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out
perturbation across training and updates an accumulative influence state at
each step. In smooth strongly convex settings it achieves geometric error
contraction and, in smooth non-convex regimes, it tightens error bounds; larger
mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups,
and MNIST under clean and corrupted data and both convex and non-convex
training, ACC-SGD-IE yields more accurate influence estimates, especially over
long epochs. For downstream data cleansing it more reliably flags noisy
samples, producing models trained on ACC-SGD-IE cleaned data that outperform
those cleaned with SGD-IE.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Songxin Lei, Qiongyan Wang, Yanchen Zhu, Hanyu Yao, Sijie Ruan et al.
Machine Learning
ๆฅ็ๆ่ฆ
Public resource allocation involves the efficient distribution of resources,
including urban infrastructure, energy, and transportation, to effectively meet
societal demands. However, existing methods focus on optimizing the movement of
individual resources independently, without considering their capacity
constraints. To address this limitation, we propose a novel and more practical
problem: Collaborative Public Resource Allocation (CPRA), which explicitly
incorporates capacity constraints and spatio-temporal dynamics in real-world
scenarios. We propose a new framework called Game-Theoretic Spatio-Temporal
Reinforcement Learning (GSTRL) for solving CPRA. Our contributions are twofold:
1) We formulate the CPRA problem as a potential game and demonstrate that there
is no gap between the potential function and the optimal target, laying a solid
theoretical foundation for approximating the Nash equilibrium of this NP-hard
problem; and 2) Our designed GSTRL framework effectively captures the
spatio-temporal dynamics of the overall system. We evaluate GSTRL on two
real-world datasets, where experiments show its superior performance. Our
source codes are available in the supplementary materials.
๐
2025-10-30
๐ This paper is currently under review for presentat
๐ฅ Emilio Mastriani, Alessandro Costa, Federico Incardona, Kevin Munari, Sebastiano Spinello
ๆฅ็ๆ่ฆ
In this study, we investigate the effectiveness of advanced feature
engineering and hybrid model architectures for anomaly detection in a
multivariate industrial time series, focusing on a steam turbine system. We
evaluate the impact of change point-derived statistical features,
clustering-based substructure representations, and hybrid learning strategies
on detection performance. Despite their theoretical appeal, these complex
approaches consistently underperformed compared to a simple Random Forest +
XGBoost ensemble trained on segmented data. The ensemble achieved an AUC-ROC of
0.976, F1-score of 0.41, and 100% early detection within the defined time
window. Our findings highlight that, in scenarios with highly imbalanced and
temporally uncertain data, model simplicity combined with optimized
segmentation can outperform more sophisticated architectures, offering greater
robustness, interpretability, and operational utility.
๐
2025-10-30
๐ EMNLP 2025
๐ฅ Hyuntae Park, Yeachan Kim, SangKeun Lee
Machine Learning
ๆฅ็ๆ่ฆ
Molecule and text representation learning has gained increasing interest due
to its potential for enhancing the understanding of chemical information.
However, existing models often struggle to capture subtle differences between
molecules and their descriptions, as they lack the ability to learn
fine-grained alignments between molecular substructures and chemical phrases.
To address this limitation, we introduce MolBridge, a novel molecule-text
learning framework based on substructure-aware alignments. Specifically, we
augment the original molecule-description pairs with additional alignment
signals derived from molecular substructures and chemical phrases. To
effectively learn from these enriched alignments, MolBridge employs
substructure-aware contrastive learning, coupled with a self-refinement
mechanism that filters out noisy alignment signals. Experimental results show
that MolBridge effectively captures fine-grained correspondences and
outperforms state-of-the-art baselines on a wide range of molecular benchmarks,
highlighting the significance of substructure-aware alignment in molecule-text
learning.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kexing Liu
Machine Learning
ๆฅ็ๆ่ฆ
Human Activity Recognition (HAR) via Wi-Fi Channel State Information (CSI)
presents a privacy-preserving, contactless sensing approach suitable for smart
homes, healthcare monitoring, and mobile IoT systems. However, existing methods
often encounter computational inefficiency, high latency, and limited
feasibility within resource-constrained, embedded mobile edge environments.
This paper proposes STAR (Sensing Technology for Activity Recognition), an
edge-AI-optimized framework that integrates a lightweight neural architecture,
adaptive signal processing, and hardware-aware co-optimization to enable
real-time, energy-efficient HAR on low-power embedded devices. STAR
incorporates a streamlined Gated Recurrent Unit (GRU)-based recurrent neural
network, reducing model parameters by 33% compared to conventional LSTM models
while maintaining effective temporal modeling capability. A multi-stage
pre-processing pipeline combining median filtering, 8th-order Butterworth
low-pass filtering, and Empirical Mode Decomposition (EMD) is employed to
denoise CSI amplitude data and extract spatial-temporal features. For on-device
deployment, STAR is implemented on a Rockchip RV1126 processor equipped with an
embedded Neural Processing Unit (NPU), interfaced with an ESP32-S3-based CSI
acquisition module. Experimental results demonstrate a mean recognition
accuracy of 93.52% across seven activity classes and 99.11% for human presence
detection, utilizing a compact 97.6k-parameter model. INT8 quantized inference
achieves a processing speed of 33 MHz with just 8% CPU utilization, delivering
sixfold speed improvements over CPU-based execution. With sub-second response
latency and low power consumption, the system ensures real-time,
privacy-preserving HAR, offering a practical, scalable solution for mobile and
pervasive computing environments.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kexing Liu
Robotics
ๆฅ็ๆ่ฆ
WiFi Channel State Information (CSI)-based human activity recognition (HAR)
provides a privacy-preserving, device-free sensing solution for smart
environments. However, its deployment on edge devices is severely constrained
by domain shift, where recognition performance deteriorates under varying
environmental and hardware conditions. This study presents maxVSTAR (maximally
adaptive Vision-guided Sensing Technology for Activity Recognition), a
closed-loop, vision-guided model adaptation framework that autonomously
mitigates domain shift for edge-deployed CSI sensing systems. The proposed
system integrates a cross-modal teacher-student architecture, where a
high-accuracy YOLO-based vision model serves as a dynamic supervisory signal,
delivering real-time activity labels for the CSI data stream. These labels
enable autonomous, online fine-tuning of a lightweight CSI-based HAR model,
termed Sensing Technology for Activity Recognition (STAR), directly at the
edge. This closed-loop retraining mechanism allows STAR to continuously adapt
to environmental changes without manual intervention. Extensive experiments
demonstrate the effectiveness of maxVSTAR. When deployed on uncalibrated
hardware, the baseline STAR model's recognition accuracy declined from 93.52%
to 49.14%. Following a single vision-guided adaptation cycle, maxVSTAR restored
the accuracy to 81.51%. These results confirm the system's capacity for
dynamic, self-supervised model adaptation in privacy-conscious IoT
environments, establishing a scalable and practical paradigm for long-term
autonomous HAR using CSI sensing at the network edge.
๐
2025-10-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Martim Afonso, Nuno Saavedra, Bruno Lourenรงo, Alexandra Mendes, Joรฃo Ferreira
ๆฅ็ๆ่ฆ
Systematic reviews and mapping studies are critical for synthesizing
research, identifying gaps, and guiding future work, but they are often
labor-intensive and time-consuming. Existing tools provide partial support for
specific steps, leaving much of the process manual and error-prone. We present
ProfOlaf, a semi-automated tool designed to streamline systematic reviews while
maintaining methodological rigor. ProfOlaf supports iterative snowballing for
article collection with human-in-the-loop filtering and uses large language
models to assist in analyzing articles, extracting key topics, and answering
queries about the content of papers. By combining automation with guided manual
effort, ProfOlaf enhances the efficiency, quality, and reproducibility of
systematic reviews across research fields. A video describing and demonstrating
ProfOlaf is available at: https://youtu.be/4noUXfcmxsE
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Wen Xie, Yanjun Zhu, Gijs Overgoor, Yakov Bart, Agata Lapedriza Garcia et al.
Multimodal
ๆฅ็ๆ่ฆ
Advertisers commonly need multiple versions of the same advertisement (ad) at
varying durations for a single campaign. The traditional approach involves
manually selecting and re-editing shots from longer video ads to create shorter
versions, which is labor-intensive and time-consuming. In this paper, we
introduce a framework for automated video ad clipping using video summarization
techniques. We are the first to frame video clipping as a shot selection
problem, tailored specifically for advertising. Unlike existing general video
summarization methods that primarily focus on visual content, our approach
emphasizes the critical role of audio in advertising. To achieve this, we
develop a two-stream audio-visual fusion model that predicts the importance of
video frames, where importance is defined as the likelihood of a frame being
selected in the firm-produced short ad. To address the lack of ad-specific
datasets, we present AdSum204, a novel dataset comprising 102 pairs of
30-second and 15-second ads from real advertising campaigns. Extensive
experiments demonstrate that our model outperforms state-of-the-art methods
across various metrics, including Average Precision, Area Under Curve,
Spearman, and Kendall.
๐
2025-10-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Min Hou, Xin Liu, Le Wu, Chenyi He, Hao Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Cross-Domain Sequential Recommendation (CDSR) seeks to improve user
preference modeling by transferring knowledge from multiple domains. Despite
the progress made in CDSR, most existing methods rely on overlapping users or
items to establish cross-domain correlations-a requirement that rarely holds in
real-world settings. The advent of large language models (LLM) and
model-merging techniques appears to overcome this limitation by unifying
multi-domain data without explicit overlaps. Yet, our empirical study shows
that naively training an LLM on combined domains-or simply merging several
domain-specific LLMs-often degrades performance relative to a model trained
solely on the target domain. To address these challenges, we first
experimentally investigate the cause of suboptimal performance in LLM-based
cross-domain recommendation and model merging. Building on these insights, we
introduce WeaveRec, which cross-trains multiple LoRA modules with source and
target domain data in a weaving fashion, and fuses them via model merging.
WeaveRec can be extended to multi-source domain scenarios and notably does not
introduce additional inference-time cost in terms of latency or memory.
Furthermore, we provide a theoretical guarantee that WeaveRec can reduce the
upper bound of the expected error in the target domain. Extensive experiments
on single-source, multi-source, and cross-platform cross-domain recommendation
scenarios validate that WeaveRec effectively mitigates performance degradation
and consistently outperforms baseline approaches in real-world recommendation
tasks.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dipak Meher, Carlotta Domeniconi
ๆฅ็ๆ่ฆ
Human smuggling networks are increasingly adaptive and difficult to analyze.
Legal case documents offer critical insights but are often unstructured,
lexically dense, and filled with ambiguous or shifting references, which pose
significant challenges for automated knowledge graph (KG) construction. While
recent LLM-based approaches improve over static templates, they still generate
noisy, fragmented graphs with duplicate nodes due to the absence of guided
extraction and coreference resolution. The recently proposed CORE-KG framework
addresses these limitations by integrating a type-aware coreference module and
domain-guided structured prompts, significantly reducing node duplication and
legal noise. In this work, we present a systematic ablation study of CORE-KG to
quantify the individual contributions of its two key components. Our results
show that removing coreference resolution results in a 28.32% increase in node
duplication and a 4.32% increase in noisy nodes, while removing structured
prompts leads to a 4.34% increase in node duplication and a 73.33% increase in
noisy nodes. These findings offer empirical insights for designing robust
LLM-based pipelines for extracting structured representations from complex
legal texts.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Dipak Meher, Carlotta Domeniconi, Guadalupe Correa-Cabrera
ๆฅ็ๆ่ฆ
Human smuggling networks are complex and constantly evolving, making them
difficult to analyze comprehensively. Legal case documents offer rich factual
and procedural insights into these networks but are often long, unstructured,
and filled with ambiguous or shifting references, posing significant challenges
for automated knowledge graph (KG) construction. Existing methods either
overlook coreference resolution or fail to scale beyond short text spans,
leading to fragmented graphs and inconsistent entity linking. We propose
LINK-KG, a modular framework that integrates a three-stage, LLM-guided
coreference resolution pipeline with downstream KG extraction. At the core of
our approach is a type-specific Prompt Cache, which consistently tracks and
resolves references across document chunks, enabling clean and disambiguated
narratives for structured knowledge graph construction from both short and long
legal texts. LINK-KG reduces average node duplication by 45.21% and noisy nodes
by 32.22% compared to baseline methods, resulting in cleaner and more coherent
graph structures. These improvements establish LINK-KG as a strong foundation
for analyzing complex criminal networks.
๐
2025-10-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Danial Ebrat, Sepideh Ahmadian, Luis Rueda
Natural Language Processing
ๆฅ็ๆ่ฆ
Recommender systems often struggle with data sparsity and cold-start
scenarios, limiting their ability to provide accurate suggestions for new or
infrequent users. This paper presents a Graph Attention Network (GAT) based
Collaborative Filtering (CF) framework enhanced with Large Language Model (LLM)
driven context aware embeddings. Specifically, we generate concise textual user
profiles and unify item metadata (titles, genres, overviews) into rich textual
embeddings, injecting these as initial node features in a bipartite user item
graph. To further optimize ranking performance, we introduce a hybrid loss
function that combines Bayesian Personalized Ranking (BPR) with a cosine
similarity term and robust negative sampling, ensuring explicit negative
feedback is distinguished from unobserved data. Experiments on the MovieLens
100k and 1M datasets show consistent improvements over state-of-the-art
baselines in Precision, NDCG, and MAP while demonstrating robustness for users
with limited interaction history. Ablation studies confirm the critical role of
LLM-augmented embeddings and the cosine similarity term in capturing nuanced
semantic relationships. Our approach effectively mitigates sparsity and
cold-start limitations by integrating LLM-derived contextual understanding into
graph-based architectures. Future directions include balancing recommendation
accuracy with coverage and diversity, and introducing fairness-aware
constraints and interpretability features to enhance system performance
further.
๐
2025-10-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Ivan Razvorotnev, Marina Munkhoeva, Evgeny Frolov
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Sequential recommendation models must navigate sparse interaction data
popularity bias and conflicting objectives like accuracy versus diversity While
recent contrastive selfsupervised learning SSL methods offer improved accuracy
they come with tradeoffs large batch requirements reliance on handcrafted
augmentations and negative sampling that can reinforce popularity bias In this
paper we introduce BT-SR a novel noncontrastive SSL framework that integrates
the Barlow Twins redundancyreduction principle into a Transformerbased nextitem
recommender BTSR learns embeddings that align users with similar shortterm
behaviors while preserving longterm distinctionswithout requiring negative
sampling or artificial perturbations This structuresensitive alignment allows
BT-SR to more effectively recognize emerging user intent and mitigate the
influence of noisy historical context Our experiments on five public benchmarks
demonstrate that BTSR consistently improves nextitem prediction accuracy and
significantly enhances longtail item coverage and recommendation calibration
Crucially we show that a single hyperparameter can control the
accuracydiversity tradeoff enabling practitioners to adapt recommendations to
specific application needs
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Jiseong Chung, Ronny Ko, Wonchul Yoo, Makoto Onizuka, Sungmok Kim et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Compliance at web scale poses practical challenges: each request may require
a regulatory assessment. Regulatory texts (e.g., the General Data Protection
Regulation, GDPR) are cross-referential and normative, while runtime contexts
are expressed in unstructured natural language. This setting motivates us to
align semantic information in unstructured text with the structured, normative
elements of regulations. To this end, we introduce GraphCompliance, a framework
that represents regulatory texts as a Policy Graph and runtime contexts as a
Context Graph, and aligns them. In this formulation, the policy graph encodes
normative structure and cross-references, whereas the context graph formalizes
events as subject-action-object (SAO) and entity-relation triples. This
alignment anchors the reasoning of a judge large language model (LLM) in
structured information and helps reduce the burden of regulatory interpretation
and event parsing, enabling a focus on the core reasoning step. In experiments
on 300 GDPR-derived real-world scenarios spanning five evaluation tasks,
GraphCompliance yields 4.1-7.2 percentage points (pp) higher micro-F1 than
LLM-only and RAG baselines, with fewer under- and over-predictions, resulting
in higher recall and lower false positive rates. Ablation studies indicate
contributions from each graph component, suggesting that structured
representations and a judge LLM are complementary for normative reasoning.
๐
2025-10-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Haochen Chen, Qi Huang, Anan Wu, Wenhao Zhang, Jianliang Ye et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Automatic structure elucidation is essential for self-driving laboratories as
it enables the system to achieve truly autonomous. This capability closes the
experimental feedback loop, ensuring that machine learning models receive
reliable structure information for real-time decision-making and optimization.
Herein, we present DiSE, an end-to-end diffusion-based generative model that
integrates multiple spectroscopic modalities, including MS, 13C and 1H chemical
shifts, HSQC, and COSY, to achieve automated yet accurate structure elucidation
of organic compounds. By learning inherent correlations among spectra through
data-driven approaches, DiSE achieves superior accuracy, strong generalization
across chemically diverse datasets, and robustness to experimental data despite
being trained on calculated spectra. DiSE thus represents a significant advance
toward fully automated structure elucidation, with broad potential in natural
product research, drug discovery, and self-driving laboratories.
๐
2025-10-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yanran Tang, Ruihong Qiu, Xue Li, Zi Huang
Natural Language Processing
ๆฅ็ๆ่ฆ
Legal case retrieval (LCR) is a cornerstone of real-world legal decision
making, as it enables practitioners to identify precedents for a given query
case. Existing approaches mainly rely on traditional lexical models and
pretrained language models to encode the texts of legal cases. Yet there are
rich information in the relations among different legal entities as well as the
crucial reasoning process that uncovers how legal facts and legal issues can
lead to judicial decisions. Such relational reasoning process reflects the
distinctive characteristics of each case that can distinguish one from another,
mirroring the real-world judicial process. Naturally, incorporating such
information into the precise case embedding could further enhance the accuracy
of case retrieval. In this paper, a novel ReaKase-8B framework is proposed to
leverage extracted legal facts, legal issues, legal relation triplets and legal
reasoning for effective legal case retrieval. ReaKase-8B designs an in-context
legal case representation learning paradigm with a fine-tuned large language
model. Extensive experiments on two benchmark datasets from COLIEE 2022 and
COLIEE 2023 demonstrate that our knowledge and reasoning augmented embeddings
substantially improve retrieval performance over baseline models, highlighting
the potential of integrating legal reasoning into legal case retrieval systems.
The code has been released on https://github.com/yanran-tang/ReaKase-8B.
๐
2025-10-30
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zhaoqi Zhang, Haolei Pei, Jun Guo, Tianyu Wang, Yufei Feng et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
In recommendation systems, scaling up feature-interaction modules (e.g.,
Wukong, RankMixer) or user-behavior sequence modules (e.g., LONGER) has
achieved notable success. However, these efforts typically proceed on separate
tracks, which not only hinders bidirectional information exchange but also
prevents unified optimization and scaling. In this paper, we propose OneTrans,
a unified Transformer backbone that simultaneously performs user-behavior
sequence modeling and feature interaction. OneTrans employs a unified tokenizer
to convert both sequential and non-sequential attributes into a single token
sequence. The stacked OneTrans blocks share parameters across similar
sequential tokens while assigning token-specific parameters to non-sequential
tokens. Through causal attention and cross-request KV caching, OneTrans enables
precomputation and caching of intermediate representations, significantly
reducing computational costs during both training and inference. Experimental
results on industrial-scale datasets demonstrate that OneTrans scales
efficiently with increasing parameters, consistently outperforms strong
baselines, and yields a 5.68% lift in per-user GMV in online A/B tests.
๐
2025-10-30
๐ NeurIPS 2025
๐ฅ Jingyuan He, Jiongnan Liu, Vishan Vishesh Oberoi, Bolin Wu, Mahima Jagadeesh Patel et al.
ๆฅ็ๆ่ฆ
Recommender systems are among the most impactful AI applications, interacting
with billions of users every day, guiding them to relevant products, services,
or information tailored to their preferences. However, the research and
development of recommender systems are hindered by existing datasets that fail
to capture realistic user behaviors and inconsistent evaluation settings that
lead to ambiguous conclusions. This paper introduces the Open Recommendation
Benchmark for Reproducible Research with HIdden Tests (ORBIT), a unified
benchmark for consistent and realistic evaluation of recommendation models.
ORBIT offers a standardized evaluation framework of public datasets with
reproducible splits and transparent settings for its public leaderboard.
Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco,
featuring web browsing sequences from 87 million public, high-quality webpages.
ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and
privacy-guaranteed browsing data. It aligns with modern recommendation
scenarios and is reserved as the hidden test part of our leaderboard to
challenge recommendation models' generalization ability. ORBIT measures 12
representative recommendation models on its public benchmark and introduces a
prompted LLM baseline on the ClueWeb-Reco hidden test. Our benchmark results
reflect general improvements of recommender systems on the public datasets,
with variable individual performances. The results on the hidden test reveal
the limitations of existing approaches in large-scale webpage recommendation
and highlight the potential for improvements with LLM integrations. ORBIT
benchmark, leaderboard, and codebase are available at
https://www.open-reco-bench.ai.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Yunchao Ma, Yizhuang Zhou, Yunhuan Yang, Tiancai Wang, Haoqiang Fan
Robotics
ๆฅ็ๆ่ฆ
In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate
and at most 480Hz trajectory frequency using a single consumer GPU. This
enables dynamic and real-time tasks that were previously believed to be
unattainable by large VLA models. To achieve it, we introduce a bag of
strategies to eliminate the overheads in model inference. The real-world
experiment shows that the pi0 policy with our strategy achieves a 100% success
rate in grasping a falling pen task. Based on the results, we further propose a
full streaming inference framework for real-time robot control of VLA. Code is
available at https://github.com/Dexmal/realtime-vla.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Qianyou Zhao, Yuliang Shen, Xuanran Zhai, Ce Hao, Duidi Wu et al.
Robotics
ๆฅ็ๆ่ฆ
In visuomotor policy learning, diffusion-based imitation learning has become
widely adopted for its ability to capture diverse behaviors. However,
approaches built on ordinary and stochastic denoising processes struggle to
jointly achieve fast sampling and strong multi-modality. To address these
challenges, we propose the Hybrid Consistency Policy (HCP). HCP runs a short
stochastic prefix up to an adaptive switch time, and then applies a one-step
consistency jump to produce the final action. To align this one-jump
generation, HCP performs time-varying consistency distillation that combines a
trajectory-consistency objective to keep neighboring predictions coherent and a
denoising-matching objective to improve local fidelity. In both simulation and
on a real robot, HCP with 25 SDE steps plus one jump approaches the 80-step
DDPM teacher in accuracy and mode coverage while significantly reducing
latency. These results show that multi-modality does not require slow
inference, and a switch time decouples mode retention from speed. It yields a
practical accuracy efficiency trade-off for robot policies.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Georgios Kamaras, Craig Innes, Subramanian Ramamoorthy
Robotics
ๆฅ็ๆ่ฆ
In robotics, likelihood-free inference (LFI) can provide the domain
distribution that adapts a learnt agent in a parametric set of deployment
conditions. LFI assumes an arbitrary support for sampling, which remains
constant as the initial generic prior is iteratively refined to more
descriptive posteriors. However, a potentially misspecified support can lead to
suboptimal, yet falsely certain, posteriors. To address this issue, we propose
three heuristic LFI variants: EDGE, MODE, and CENTRE. Each interprets the
posterior mode shift over inference steps in its own way and, when integrated
into an LFI step, adapts the support alongside posterior inference. We first
expose the support misspecification issue and evaluate our heuristics using
stochastic dynamical benchmarks. We then evaluate the impact of heuristic
support adaptation on parameter inference and policy learning for a dynamic
deformable linear object (DLO) manipulation task. Inference results in a finer
length and stiffness classification for a parametric set of DLOs. When the
resulting posteriors are used as domain distributions for sim-based policy
learning, they lead to more robust object-centric agent performance.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Xiaoyi He, Danggui Chen, Zhenshuo Zhang, Zimeng Bai
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This paper presents a hierarchical path-planning and control framework that
combines a high-level Deep Q-Network (DQN) for discrete sub-goal selection with
a low-level Twin Delayed Deep Deterministic Policy Gradient (TD3) controller
for continuous actuation. The high-level module selects behaviors and
sub-goals; the low-level module executes smooth velocity commands. We design a
practical reward shaping scheme (direction, distance, obstacle avoidance,
action smoothness, collision penalty, time penalty, and progress), together
with a LiDAR-based safety gate that prevents unsafe motions. The system is
implemented in ROS + Gazebo (TurtleBot3) and evaluated with PathBench metrics,
including success rate, collision rate, path efficiency, and re-planning
efficiency, in dynamic and partially observable environments. Experiments show
improved success rate and sample efficiency over single-algorithm baselines
(DQN or TD3 alone) and rule-based planners, with better generalization to
unseen obstacle configurations and reduced abrupt control changes. Code and
evaluation scripts are available at the project repository.
๐
2025-10-30
๐ IROS 2025
๐ฅ Dave van der Meer, Loรฏck P. Chovet, Gabriel M. Garcia, Abhishek Bera, Miguel A. Olivares-Mendez
Robotics
ๆฅ็ๆ่ฆ
The European Space Agency (ESA) and the European Space Resources Innovation
Centre (ESRIC) created the Space Resources Challenge to invite researchers and
companies to propose innovative solutions for Multi-Robot Systems (MRS) space
prospection. This paper proposes the Resilient Exploration And Lunar Mapping
System 2 (REALMS2), a MRS framework for planetary prospection and mapping.
Based on Robot Operating System version 2 (ROS 2) and enhanced with Visual
Simultaneous Localisation And Mapping (vSLAM) for map generation, REALMS2 uses
a mesh network for a robust ad hoc network. A single graphical user interface
(GUI) controls all the rovers, providing a simple overview of the robotic
mission. This system is designed for heterogeneous multi-robot exploratory
missions, tackling the challenges presented by extraterrestrial environments.
REALMS2 was used during the second field test of the ESA-ESRIC Challenge and
allowed to map around 60% of the area, using three homogeneous rovers while
handling communication delays and blackouts.
๐
2025-10-30
๐ 8 pages, 6 figures. Submitted to IEEE-RAS Internat
๐ฅ Spencer Teetaert, Sven Lilge, Jessica Burgner-Kahrs, Timothy D. Barfoot
ๆฅ็ๆ่ฆ
Stochastic state estimation methods for continuum robots (CRs) often struggle
to balance accuracy and computational efficiency. While several recent works
have explored sliding-window formulations for CRs, these methods are limited to
simplified, discrete-time approximations and do not provide stochastic
representations. In contrast, current stochastic filter methods must run at the
speed of measurements, limiting their full potential. Recent works in
continuous-time estimation techniques for CRs show a principled approach to
addressing this runtime constraint, but are currently restricted to offline
operation. In this work, we present a sliding-window filter (SWF) for
continuous-time state estimation of CRs that improves upon the accuracy of a
filter approach while enabling continuous-time methods to operate online, all
while running at faster-than-real-time speeds. This represents the first
stochastic SWF specifically designed for CRs, providing a promising direction
for future research in this area.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Christoffer Koo รhrstrรธm, Ronja Gรผldenring, Lazaros Nalpantidis
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
We propose tokenization of events and present a tokenizer, Spiking Patches,
specifically designed for event cameras. Given a stream of asynchronous and
spatially sparse events, our goal is to discover an event representation that
preserves these properties. Prior works have represented events as frames or as
voxels. However, while these representations yield high accuracy, both frames
and voxels are synchronous and decrease the spatial sparsity. Spiking Patches
gives the means to preserve the unique properties of event cameras and we show
in our experiments that this comes without sacrificing accuracy. We evaluate
our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and
object detection. Tokens from Spiking Patches yield inference times that are up
to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We
achieve this while matching their accuracy and even surpassing in some cases
with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for
object detection. Thus, tokenization constitutes a novel direction in
event-based vision and marks a step towards methods that preserve the
properties of event cameras.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Gang Li, Chunlei Zhai, Teng Wang, Shaun Li, Shangsong Jiang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Visual navigation algorithms for quadrotors often exhibit a large variation
in performance when transferred across different vehicle platforms and scene
geometries, which increases the cost and risk of field deployment. To support
systematic early-stage evaluation, we introduce FLYINGTRUST, a high-fidelity,
configurable benchmarking framework that measures how platform kinodynamics and
scenario structure jointly affect navigation robustness. FLYINGTRUST models
vehicle capability with two compact, physically interpretable indicators:
maximum thrust-to-weight ratio and axis-wise maximum angular acceleration. The
benchmark pairs a diverse scenario library with a heterogeneous set of real and
virtual platforms and prescribes a standardized evaluation protocol together
with a composite scoring method that balances scenario importance, platform
importance and performance stability. We use FLYINGTRUST to compare
representative optimization-based and learning-based navigation approaches
under identical conditions, performing repeated trials per platform-scenario
combination and reporting uncertainty-aware metrics. The results reveal
systematic patterns: navigation success depends predictably on platform
capability and scene geometry, and different algorithms exhibit distinct
preferences and failure modes across the evaluated conditions. These
observations highlight the practical necessity of incorporating both platform
capability and scenario structure into algorithm design, evaluation, and
selection, and they motivate future work on methods that remain robust across
diverse platforms and scenarios.
๐
2025-10-30
๐ ArXiv Preprint (physics.soc-ph)
๐ฅ Saleh Albeaik, Faisal Alsallum, Mohamad Alrished
ๆฅ็ๆ่ฆ
People tend to walk in groups, and interactions with those groups have a
significant impact on crowd behavior and pedestrian traffic dynamics. Social
norms can be seen as unwritten rules regulating people interactions in social
settings. This article studies people interactions with groups and the
emergence of group proxemics. Group zones, zone occupancy counts and people
clearance from the group are studied using naturalistic data. Analysis indicate
potential presence of three different zones in addition to the public zone.
People tend to remain in the public zone and only progressively get closer to
groups, and those closer approaches happen in a low frequency and for brief
periods of time.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Prathamesh Kothavale, Sravani Boddepalli
Robotics
ๆฅ็ๆ่ฆ
Conventional robots possess a limited understanding of their kinematics and
are confined to preprogrammed tasks, hindering their ability to leverage tools
efficiently. Driven by the essential components of tool usage - grasping the
desired outcome, selecting the most suitable tool, determining optimal tool
orientation, and executing precise manipulations - we introduce a pioneering
framework. Our novel approach expands the capabilities of the robot's inverse
kinematics solver, empowering it to acquire a sequential repertoire of actions
using tools of varying lengths. By integrating a simulation-learned action
trajectory with the tool, we showcase the practicality of transferring acquired
skills from simulation to real-world scenarios through comprehensive
experimentation. Remarkably, our extended inverse kinematics solver
demonstrates an impressive error rate of less than 1 cm. Furthermore, our
trained policy achieves a mean error of 8 cm in simulation. Noteworthy, our
model achieves virtually indistinguishable performance when employing two
distinct tools of different lengths. This research provides an indication of
potential advances in the exploration of all four fundamental aspects of tool
usage, enabling robots to master the intricate art of tool manipulation across
diverse tasks.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Huajie Tan, Cheng Chi, Xiansheng Chen, Yuheng Ji, Zhongxia Zhao et al.
RoboticsMultimodal
ๆฅ็ๆ่ฆ
The proliferation of collaborative robots across diverse tasks and
embodiments presents a central challenge: achieving lifelong adaptability,
scalable coordination, and robust scheduling in multi-agent systems. Existing
approaches, from vision-language-action (VLA) models to hierarchical
frameworks, fall short due to their reliance on limited or dividual-agent
memory. This fundamentally constrains their ability to learn over long
horizons, scale to heterogeneous teams, or recover from failures, highlighting
the need for a unified memory representation. To address these limitations, we
introduce RoboOS-NeXT, a unified memory-based framework for lifelong, scalable,
and robust multi-robot collaboration. At the core of RoboOS-NeXT is the novel
Spatio-Temporal-Embodiment Memory (STEM), which integrates spatial scene
geometry, temporal event history, and embodiment profiles into a shared
representation. This memory-centric design is integrated into a
brain-cerebellum framework, where a high-level brain model performs global
planning by retrieving and updating STEM, while low-level controllers execute
actions locally. This closed loop between cognition, memory, and execution
enables dynamic task allocation, fault-tolerant collaboration, and consistent
state synchronization. We conduct extensive experiments spanning complex
coordination tasks in restaurants, supermarkets, and households. Our results
demonstrate that RoboOS-NeXT achieves superior performance across heterogeneous
embodiments, validating its effectiveness in enabling lifelong, scalable, and
robust multi-robot collaboration. Project website:
https://flagopen.github.io/RoboOS/
๐
2025-10-30
๐ ArXiv Preprint (eess.SY)
๐ฅ David Leprich, Mario Rosenfelder, Markus Herrmann-Wicklmayr, Kathrin Flaรkamp, Peter Eberhard et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
This article proposes a modular optimal control framework for local
three-dimensional ellipsoidal obstacle avoidance, exemplarily applied to model
predictive path-following control. Static as well as moving obstacles are
considered. Central to the approach is a computationally efficient and
continuously differentiable condition for detecting collisions with ellipsoidal
obstacles. A novel two-stage optimization approach mitigates numerical issues
arising from the structure of the resulting optimal control problem. The
effectiveness of the approach is demonstrated through simulations and
real-world experiments with the Crazyflie quadrotor. This represents the first
hardware demonstration of an MPC controller of this kind for UAVs in a
three-dimensional task.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Guanxing Lu, Rui Zhao, Haitao Lin, He Zhang, Yansong Tang
RoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) is widely used to produce robust robotic
manipulation policies, but fine-tuning vision-language-action (VLA) models with
RL can be unstable due to inaccurate value estimates and sparse supervision at
intermediate steps. In contrast, imitation learning (IL) is easy to train but
often underperforms due to its offline nature. In this paper, we propose
Hi-ORS, a simple yet effective post-training method that utilizes rejection
sampling to achieve both training stability and high robustness. Hi-ORS
stabilizes value estimation by filtering out negatively rewarded samples during
online fine-tuning, and adopts a reward-weighted supervised training objective
to provide dense intermediate-step supervision. For systematic study, we
develop an asynchronous inference-training framework that supports flexible
online human-in-the-loop corrections, which serve as explicit guidance for
learning error-recovery behaviors. Across three real-world tasks and two
embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich
manipulation in just 1.5 hours of real-world training, outperforming RL and IL
baselines by a substantial margin in both effectiveness and efficiency.
Notably, the fine-tuned policy exhibits strong test-time scalability by
reliably executing complex error-recovery behaviors to achieve better
performance.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Kazuma Kano, Yuki Mori, Shin Katayama, Kenta Urano, Takuro Yonezawa et al.
ๆฅ็ๆ่ฆ
Worker location data is key to higher productivity in industrial sites.
Cameras are a promising tool for localization in logistics warehouses since
they also offer valuable environmental contexts such as package status.
However, identifying individuals with only visual data is often impractical.
Accordingly, several prior studies identified people in videos by comparing
their trajectories and wearable sensor measurements. While this approach has
advantages such as independence from appearance, the existing methods may break
down under real-world conditions. To overcome this challenge, we propose CorVS,
a novel data-driven person identification method based on correspondence
between visual tracking trajectories and sensor measurements. Firstly, our deep
learning model predicts correspondence probabilities and reliabilities for
every pair of a trajectory and sensor measurements. Secondly, our algorithm
matches the trajectories and sensor measurements over time using the predicted
probabilities and reliabilities. We developed a dataset with actual warehouse
operations and demonstrated the method's effectiveness for real-world
applications.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Ilya Kurinov, Miroslav Ivanov, Grzegorz Orzechowski, Aki Mikkola
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Forestry forwarders play a central role in mechanized timber harvesting by
picking up and moving logs from the felling site to a processing area or a
secondary transport vehicle. Forwarder operation is challenging and physically
and mentally exhausting for the operator who must control the machine in remote
areas for prolonged periods of time. Therefore, even partial automation of the
process may reduce stress on the operator. This study focuses on continuing
previous research efforts in application of reinforcement learning agents in
automating log handling process, extending the task from grasping which was
studied in previous research to full log loading operation. The resulting agent
will be capable to automate a full loading procedure from locating and
grappling to transporting and delivering the log to a forestry forwarder bed.
To train the agent, a trailer type forestry forwarder simulation model in
NVIDIA's Isaac Gym and a virtual environment for a typical log loading scenario
were developed. With reinforcement learning agents and a curriculum learning
approach, the trained agent may be a stepping stone towards application of
reinforcement learning agents in automation of the forestry forwarder. The
agent learnt grasping a log in a random position from grapple's random position
and transport it to the bed with 94% success rate of the best performing agent.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Tobias Lรถw, Cem Bilaloglu, Sylvain Calinon
Robotics
ๆฅ็ๆ่ฆ
Many tasks in human environments require collaborative behavior between
multiple kinematic chains, either to provide additional support for carrying
big and bulky objects or to enable the dexterity that is required for in-hand
manipulation. Since these complex systems often have a very high number of
degrees of freedom coordinating their movements is notoriously difficult to
model. In this article, we present the derivation of the theoretical
foundations for cooperative task spaces of multi-arm robotic systems based on
geometric primitives defined using conformal geometric algebra. Based on the
similarity transformations of these cooperative geometric primitives, we derive
an abstraction of complex robotic systems that enables representing these
systems in a way that directly corresponds to single-arm systems. By deriving
the associated analytic and geometric Jacobian matrices, we then show the
straightforward integration of our approach into classical control techniques
rooted in operational space control. We demonstrate this using bimanual
manipulators, humanoids and multi-fingered hands in optimal control experiments
for reaching desired geometric primitives and in teleoperation experiments
using differential kinematics control. We then discuss how the geometric
primitives naturally embed nullspace structures into the controllers that can
be exploited for introducing secondary control objectives. This work,
represents the theoretical foundations of this cooperative manipulation control
framework, and thus the experiments are presented in an abstract way, while
giving pointers towards potential future applications.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Mirko Usuelli, David Rapado-Rincon, Gert Kootstra, Matteo Matteucci
RoboticsMultimodal
ๆฅ็ๆ่ฆ
Autonomous robots in orchards require real-time 3D scene understanding
despite repetitive row geometry, seasonal appearance changes, and wind-driven
foliage motion. We present AgriGS-SLAM, a Visual--LiDAR SLAM framework that
couples direct LiDAR odometry and loop closures with multi-camera 3D Gaussian
Splatting (3DGS) rendering. Batch rasterization across complementary viewpoints
recovers orchard structure under occlusions, while a unified gradient-driven
map lifecycle executed between keyframes preserves fine details and bounds
memory. Pose refinement is guided by a probabilistic LiDAR-based depth
consistency term, back-propagated through the camera projection to tighten
geometry-appearance coupling. We deploy the system on a field platform in apple
and pear orchards across dormancy, flowering, and harvesting, using a
standardized trajectory protocol that evaluates both training-view and
novel-view synthesis to reduce 3DGS overfitting in evaluation. Across seasons
and sites, AgriGS-SLAM delivers sharper, more stable reconstructions and
steadier trajectories than recent state-of-the-art 3DGS-SLAM baselines while
maintaining real-time performance on-tractor. While demonstrated in orchard
monitoring, the approach can be applied to other outdoor domains requiring
robust multimodal perception.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Gangyang Li, Qing Shi, Youhao Hu, Jincheng Hu, Zhongyuan Wang et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Humanoids hold great potential for service, industrial, and rescue
applications, in which robots must sustain whole-body stability while
performing intense, contact-rich interactions with the environment. However,
enabling humanoids to generate human-like, adaptive responses under such
conditions remains a major challenge. To address this, we propose Thor, a
humanoid framework for human-level whole-body reactions in contact-rich
environments. Based on the robot's force analysis, we design a force-adaptive
torso-tilt (FAT2) reward function to encourage humanoids to exhibit human-like
responses during force-interaction tasks. To mitigate the high-dimensional
challenges of humanoid control, Thor introduces a reinforcement learning
architecture that decouples the upper body, waist, and lower body. Each
component shares global observations of the whole body and jointly updates its
parameters. Finally, we deploy Thor on the Unitree G1, and it substantially
outperforms baselines in force-interaction tasks. Specifically, the robot
achieves a peak pulling force of 167.7 N (approximately 48% of the G1's body
weight) when moving backward and 145.5 N when moving forward, representing
improvements of 68.9% and 74.7%, respectively, compared with the
best-performing baseline. Moreover, Thor is capable of pulling a loaded rack
(130 N) and opening a fire door with one hand (60 N). These results highlight
Thor's effectiveness in enhancing humanoid force-interaction capabilities.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Kyungmin Lee, Sibeen Kim, Minho Park, Hyunseung Kim, Dongyoon Hwang et al.
Robotics
ๆฅ็ๆ่ฆ
Motion imitation is a promising approach for humanoid locomotion, enabling
agents to acquire humanlike behaviors. Existing methods typically rely on
high-quality motion capture datasets such as AMASS, but these are scarce and
expensive, limiting scalability and diversity. Recent studies attempt to scale
data collection by converting large-scale internet videos, exemplified by
Humanoid-X. However, they often introduce physical artifacts such as floating,
penetration, and foot skating, which hinder stable imitation. In response, we
introduce PHUMA, a Physically-grounded HUMAnoid locomotion dataset that
leverages human video at scale, while addressing physical artifacts through
careful data curation and physics-constrained retargeting. PHUMA enforces joint
limits, ensures ground contact, and eliminates foot skating, producing motions
that are both large-scale and physically reliable. We evaluated PHUMA in two
sets of conditions: (i) imitation of unseen motion from self-recorded test
videos and (ii) path following with pelvis-only guidance. In both cases,
PHUMA-trained policies outperform Humanoid-X and AMASS, achieving significant
gains in imitating diverse motions. The code is available at
https://davian-robotics.github.io/PHUMA.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Satoshi Kikuch, Masaya Kato, Tsuyoshi Tasaki
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Self-localization on a 3D map by using an inexpensive monocular camera is
required to realize autonomous driving. Self-localization based on a camera
often uses a convolutional neural network (CNN) that can extract local features
that are calculated by nearby pixels. However, when dynamic obstacles, such as
people, are present, CNN does not work well. This study proposes a new method
combining CNN with Vision Transformer, which excels at extracting global
features that show the relationship of patches on whole image. Experimental
results showed that, compared to the state-of-the-art method (SOTA), the
accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times
higher than that without dynamic obstacles. Moreover, the self-localization
error of our method is 20.1% smaller than that of SOTA on public datasets.
Additionally, our robot using our method can localize itself with 7.51cm error
on average, which is more accurate than SOTA.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Hahjin Lee, Young J. Kim
Machine Learning
ๆฅ็ๆ่ฆ
Trajectory planning for mobile robots in cluttered environments remains a
major challenge due to narrow passages, where conventional methods often fail
or generate suboptimal paths. To address this issue, we propose the adaptive
trajectory refinement algorithm, which consists of two main stages. First, to
ensure safety at the path-segment level, a segment-wise conservative collision
test is applied, where risk-prone trajectory path segments are recursively
subdivided until collision risks are eliminated. Second, to guarantee
pose-level safety, pose correction based on penetration direction and line
search is applied, ensuring that each pose in the trajectory is collision-free
and maximally clear from obstacles. Simulation results demonstrate that the
proposed method achieves up to 1.69x higher success rates and up to 3.79x
faster planning times than state-of-the-art approaches. Furthermore, real-world
experiments confirm that the robot can safely pass through narrow passages
while maintaining rapid planning performance.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Minseo Kwon, Young J. Kim
ๆฅ็ๆ่ฆ
Task and Motion Planning (TAMP) integrates high-level task planning with
low-level motion feasibility, but existing methods are costly in long-horizon
problems due to excessive motion sampling. While LLMs provide commonsense
priors, they lack 3D spatial reasoning and cannot ensure geometric or dynamic
feasibility. We propose a kinodynamic TAMP framework based on a hybrid state
tree that uniformly represents symbolic and numeric states during planning,
enabling task and motion decisions to be jointly decided. Kinodynamic
constraints embedded in the TAMP problem are verified by an off-the-shelf
motion planner and physics simulator, and a VLM guides exploring a TAMP
solution and backtracks the search based on visual rendering of the states.
Experiments on the simulated domains and in the real world show 32.14% -
1166.67% increased average success rates compared to traditional and LLM-based
TAMP planners and reduced planning time on complex problems, with ablations
further highlighting the benefits of VLM guidance.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Nestor O. Perez-Arancibia
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The term embodied intelligence (EI) conveys the notion that body morphology,
material properties, interaction with the environment, and control strategies
can be purposefully integrated into the process of robotic design to generate
intelligent behavior; in particular, locomotion and navigation. In this paper,
we discuss EI as a design principle for advanced microrobotics, with a
particular focus on co-design -- the simultaneous and interdependent
development of physical structure and behavioral function. To illustrate the
contrast between EI-inspired systems and traditional architectures that
decouple sensing, computation, and actuation, we present and discuss a
collection of robots developed by the author and his team at the Autonomous
Microrobotic Systems Laboratory (AMSL). These robots exhibit intelligent
behavior that emerges from their structural dynamics and the physical
interaction between their components and with the environment. Platforms such
as the Bee++, RoBeetle, SMALLBug, SMARTI, WaterStrider, VLEIBot+, and FRISSHBot
exemplify how feedback loops, decision logics, sensing mechanisms, and smart
actuation strategies can be embedded into the physical properties of the
robotic system itself. Along these lines, we contend that co-design is not only
a method for empirical optimization under constraints, but also an enabler of
EI, offering a scalable and robust alternative to classical control for
robotics at the mm-to-cm-scale.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ali Caglayan, Nevrez Imamoglu, Oguzhan Guclu, Ali Osman Serhatoglu, Ahmet Burak Can et al.
Machine Learning
ๆฅ็ๆ่ฆ
Attention models have recently emerged as a powerful approach, demonstrating
significant progress in various fields. Visualization techniques, such as class
activation mapping, provide visual insights into the reasoning of convolutional
neural networks (CNNs). Using network gradients, it is possible to identify
regions where the network pays attention during image recognition tasks.
Furthermore, these gradients can be combined with CNN features to localize more
generalizable, task-specific attentive (salient) regions within scenes.
However, explicit use of this gradient-based attention information integrated
directly into CNN representations for semantic object understanding remains
limited. Such integration is particularly beneficial for visual tasks like
simultaneous localization and mapping (SLAM), where CNN representations
enriched with spatially attentive object locations can enhance performance. In
this work, we propose utilizing task-specific network attention for RGB-D
indoor SLAM. Specifically, we integrate layer-wise attention information
derived from network gradients with CNN feature representations to improve
frame association performance. Experimental results indicate improved
performance compared to baseline methods, particularly for large environments.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Fan Yang, Lingyao Li, Yaxin Hu, Michael Rodgers, Renkai Ma
ๆฅ็ๆ่ฆ
Robots with anthropomorphic features are increasingly shaping how humans
perceive and morally engage with them. Our research investigates how different
levels of anthropomorphism influence protective responses to robot abuse,
extending the Computers as Social Actors (CASA) and uncanny valley theories
into a moral domain. In an experiment, we invite 201 participants to view
videos depicting abuse toward a robot with low (Spider), moderate (Two-Foot),
or high (Humanoid) anthropomorphism. To provide a comprehensive analysis, we
triangulate three modalities: self-report surveys measuring emotions and
uncanniness, physiological data from automated facial expression analysis, and
qualitative reflections. Findings indicate that protective responses are not
linear. The moderately anthropomorphic Two-Foot robot, rated highest in
eeriness and "spine-tingling" sensations consistent with the uncanny valley,
elicited the strongest physiological anger expressions. Self-reported anger and
guilt are significantly higher for both the Two-Foot and Humanoid robots
compared to the Spider. Qualitative findings further reveal that as
anthropomorphism increases, moral reasoning shifts from technical assessments
of property damage to condemnation of the abuser's character, while governance
proposals expand from property law to calls for quasi-animal rights and broader
societal responsibility. These results suggest that the uncanny valley does not
dampen moral concern but paradoxically heightens protective impulses, offering
critical implications for robot design, policy, and future legal frameworks.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Fan Yang, Renkai Ma, Yaxin Hu, Michael Rodgers, Lingyao Li
ๆฅ็ๆ่ฆ
Social robots like Moxie are designed to form strong emotional bonds with
children, but their abrupt discontinuation can cause significant struggles and
distress to children. When these services end, the resulting harm raises
complex questions of who bears responsibility when children's emotional bonds
are broken. Using the Moxie shutdown as a case study through a qualitative
survey of 72 U.S. participants, our findings show that the responsibility is
viewed as a shared duty across the robot company, parents, developers, and
government. However, these attributions varied by political ideology and
parental status of whether they have children. Participants' perceptions of
whether the robot service should continue are highly polarized; supporters
propose technical, financial, and governmental pathways for continuity, while
opponents cite business realities and risks of unhealthy emotional dependency.
Ultimately, this research contributes an empirically grounded shared
responsibility framework for safeguarding child-robot companionship by
detailing how accountability is distributed and contested, informing concrete
design and policy implications to mitigate the emotional harm of robot
discontinuation.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Chi Zhang, Mingrui Li, Wenzhe Tong, Xiaonan Huang
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Tensegrity robots combine rigid rods and elastic cables, offering high
resilience and deployability but posing major challenges for locomotion control
due to their underactuated and highly coupled dynamics. This paper introduces a
morphology-aware reinforcement learning framework that integrates a graph
neural network (GNN) into the Soft Actor-Critic (SAC) algorithm. By
representing the robot's physical topology as a graph, the proposed GNN-based
policy captures coupling among components, enabling faster and more stable
learning than conventional multilayer perceptron (MLP) policies. The method is
validated on a physical 3-bar tensegrity robot across three locomotion
primitives, including straight-line tracking and bidirectional turning. It
shows superior sample efficiency, robustness to noise and stiffness variations,
and improved trajectory accuracy. Notably, the learned policies transfer
directly from simulation to hardware without fine-tuning, achieving stable
real-world locomotion. These results demonstrate the advantages of
incorporating structural priors into reinforcement learning for tensegrity
robot control.
๐
2025-10-30
๐ ArXiv Preprint (Robotics)
๐ฅ Emily Steiner, Daniel van der Spuy, Futian Zhou, Afereti Pama, Minas Liarokapis et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
While autonomous racing performance in Time-Trial scenarios has seen
significant progress and development, autonomous wheel-to-wheel racing and
overtaking are still severely limited. These limitations are particularly
apparent in real-life driving scenarios where state-of-the-art algorithms
struggle to safely or reliably complete overtaking manoeuvres. This is
important, as reliable navigation around other vehicles is vital for safe
autonomous wheel-to-wheel racing. The F1Tenth Competition provides a useful
opportunity for developing wheel-to-wheel racing algorithms on a standardised
physical platform. The competition format makes it possible to evaluate
overtaking and wheel-to-wheel racing algorithms against the state-of-the-art.
This research presents a novel racing and overtaking agent capable of learning
to reliably navigate a track and overtake opponents in both simulation and
reality. The agent was deployed on an F1Tenth vehicle and competed against
opponents running varying competitive algorithms in the real world. The results
demonstrate that the agent's training against opponents enables deliberate
overtaking behaviours with an overtaking rate of 87% compared 56% for an agent
trained just to race.
๐
2025-10-30
๐ ArXiv Preprint (Computer Vision)
๐ฅ Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi et al.
Robotics
ๆฅ็ๆ่ฆ
Recent video generation models can produce high-fidelity, temporally coherent
videos, indicating that they may encode substantial world knowledge. Beyond
realistic synthesis, they also exhibit emerging behaviors indicative of visual
perception, modeling, and manipulation. Yet, an important question still
remains: Are video models ready to serve as zero-shot reasoners in challenging
visual reasoning scenarios? In this work, we conduct an empirical study to
comprehensively investigate this question, focusing on the leading and popular
Veo-3. We evaluate its reasoning behavior across 12 dimensions, including
spatial, geometric, physical, temporal, and embodied logic, systematically
characterizing both its strengths and failure modes. To standardize this study,
we curate the evaluation data into MME-CoF, a compact benchmark that enables
in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our
findings reveal that while current video models demonstrate promising reasoning
patterns on short-horizon spatial coherence, fine-grained grounding, and
locally consistent dynamics, they remain limited in long-horizon causal
reasoning, strict geometric constraints, and abstract logic. Overall, they are
not yet reliable as standalone zero-shot reasoners, but exhibit encouraging
signs as complementary visual engines alongside dedicated reasoning models.
Project page: https://video-cof.github.io
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hyunji Lee, Minseon Kim, Chinmay Singh, Matheus Pereira, Atharv Sonwane et al.
ๆฅ็ๆ่ฆ
As coding agents are increasingly deployed in large codebases, the need to
automatically design challenging, codebase-level evaluation is central. We
propose Gistify, a task where a coding LLM must create a single, minimal,
self-contained file that can reproduce a specific functionality of a codebase.
The coding LLM is given full access to a codebase along with a specific
entrypoint (e.g., a python command), and the generated file must replicate the
output of the same command ran under the full codebase, while containing only
the essential components necessary to execute the provided command. Success on
Gistify requires both structural understanding of the codebase, accurate
modeling of its execution flow as well as the ability to produce potentially
large code patches. Our findings show that current state-of-the-art models
struggle to reliably solve Gistify tasks, especially ones with long executions
traces.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often
suffers from instability due to the numerical mismatch between the training and
inference policies. While prior work has attempted to mitigate this issue
through algorithmic corrections or engineering alignments, we show that its
root cause lies in the floating point precision itself. The widely adopted
BF16, despite its large dynamic range, introduces large rounding errors that
breaks the consistency between training and inference. In this work, we
demonstrate that simply reverting to \textbf{FP16} effectively eliminates this
mismatch. The change is simple, fully supported by modern frameworks with only
a few lines of code change, and requires no modification to the model
architecture or learning algorithm. Our results suggest that using FP16
uniformly yields more stable optimization, faster convergence, and stronger
performance across diverse tasks, algorithms and frameworks. We hope these
findings motivate a broader reconsideration of precision trade-offs in RL
fine-tuning.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal et al.
ๆฅ็ๆ่ฆ
AIs have made rapid progress on research-oriented benchmarks of knowledge and
reasoning, but it remains unclear how these gains translate into economic value
and automation. To measure this, we introduce the Remote Labor Index (RLI), a
broadly multi-sector benchmark comprising real-world, economically valuable
projects designed to evaluate end-to-end agent performance in practical
settings. AI agents perform near the floor on RLI, with the highest-performing
agent achieving an automation rate of 2.5%. These results help ground
discussions of AI automation in empirical evidence, setting a common basis for
tracking AI impacts and enabling stakeholders to proactively navigate AI-driven
labor automation.
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Arnab Sen Sharma, Giordano Rogers, Natalie Shapira, David Bau
Natural Language Processing
ๆฅ็ๆ่ฆ
We investigate the mechanisms underlying a range of list-processing tasks in
LLMs, and we find that LLMs have learned to encode a compact, causal
representation of a general filtering operation that mirrors the generic
"filter" function of functional programming. Using causal mediation analysis on
a diverse set of list-processing tasks, we find that a small number of
attention heads, which we dub filter heads, encode a compact representation of
the filtering predicate in their query states at certain tokens. We demonstrate
that this predicate representation is general and portable: it can be extracted
and reapplied to execute the same filtering operation on different collections,
presented in different formats, languages, or even in tasks. However, we also
identify situations where transformer LMs can exploit a different strategy for
filtering: eagerly evaluating if an item satisfies the predicate and storing
this intermediate result as a flag directly in the item representations. Our
results reveal that transformer LMs can develop human-interpretable
implementations of abstract computational operations that generalize in ways
that are surprisingly similar to strategies used in traditional functional
programming patterns.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen
ๆฅ็ๆ่ฆ
A world model is an internal model that simulates how the world evolves.
Given past observations and actions, it predicts the future of both the
embodied agent and its environment. Accurate world models are essential for
enabling agents to think, plan, and reason effectively in complex, dynamic
settings. Despite rapid progress, current world models remain brittle and
degrade over long horizons. We argue that a central cause is representation
quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or
entangled latents make dynamics learning unnecessarily hard. We therefore ask
whether improving representation learning alone can substantially improve
world-model performance. In this work, we take a step toward building a truly
accurate world model by addressing a fundamental yet open problem: constructing
a model that can fully clone and overfit to a deterministic 3D world. We
propose Geometrically-Regularized World Models (GRWM), which enforces that
consecutive points along a natural sensory trajectory remain close in latent
representation space. This approach yields significantly improved latent
representations that align closely with the true topology of the environment.
GRWM is plug-and-play, requires only minimal architectural modification, scales
with trajectory length, and is compatible with diverse latent generative
backbones. Across deterministic 3D settings and long-horizon prediction tasks,
GRWM significantly increases rollout fidelity and stability. Analyses show that
its benefits stem from learning a latent manifold with superior geometric
structure. These findings support a clear takeaway: improving representation
learning is a direct and useful path to robust world models, delivering
reliable long-horizon predictions without enlarging the dynamics module.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Jungyeon Koh, Hyeonsu Lyu, Jonggyu Jang, Hyun Jong Yang
ๆฅ็ๆ่ฆ
How can we explain the influence of training data on black-box models?
Influence functions (IFs) offer a post-hoc solution by utilizing gradients and
Hessians. However, computing the Hessian for an entire dataset is
resource-intensive, necessitating a feasible alternative. A common approach
involves randomly sampling a small subset of the training data, but this method
often results in highly inconsistent IF estimates due to the high variance in
sample configurations. To address this, we propose two advanced sampling
techniques based on features and logits. These samplers select a small yet
representative subset of the entire dataset by considering the stochastic
distribution of features or logits, thereby enhancing the accuracy of IF
estimations. We validate our approach through class removal experiments, a
typical application of IFs, using the F1-score to measure how effectively the
model forgets the removed class while maintaining inference consistency on the
remaining classes. Our method reduces computation time by 30.1% and memory
usage by 42.2%, or improves the F1-score by 2.5% compared to the baseline.
๐
2025-10-30
๐ ArXiv Preprint (Machine Learning)
๐ฅ Marco Federici, Riccardo Del Chiaro, Boris van Breugel, Paul Whatmough, Markus Nagel
ๆฅ็ๆ่ฆ
Quantization is the key method for reducing inference latency, power and
memory footprint of generative AI models. However, accuracy often degrades
sharply when activations are quantized below eight bits. Recent work suggests
that invertible linear transformations (e.g. rotations) can aid quantization,
by reparameterizing feature channels and weights. In this paper, we propose
\textit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a
novel strategy that applies linear transformations along the \textit{sequence}
dimension to exploit the strong local correlation in language and visual data.
By keeping a small number of tokens in each intermediate activation at higher
precision, we can maintain model accuracy at lower (average) activations
bit-widths. We evaluate STaMP on recent LVM and LLM architectures,
demonstrating that it significantly improves low bit width activation
quantization and complements established activation and weight quantization
methods including recent feature transformations.
๐
2025-10-30
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We present AMO-Bench, an Advanced Mathematical reasoning benchmark with
Olympiad level or even higher difficulty, comprising 50 human-crafted problems.
Existing benchmarks have widely leveraged high school math competitions for
evaluating mathematical reasoning capabilities of large language models (LLMs).
However, many existing math competitions are becoming less effective for
assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To
address this, AMO-Bench introduces more rigorous challenges by ensuring all 50
problems are (1) cross-validated by experts to meet at least the International
Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original
problems to prevent potential performance leakages from data memorization.
Moreover, each problem in AMO-Bench requires only a final answer rather than a
proof, enabling automatic and robust grading for evaluation. Experimental
results across 26 LLMs on AMO-Bench show that even the best-performing model
achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%.
Beyond these poor performances, our further analysis reveals a promising
scaling trend with increasing test-time compute on AMO-Bench. These results
highlight the significant room for improving the mathematical reasoning in
current LLMs. We release AMO-Bench to facilitate further research into
advancing the reasoning abilities of language models.
https://amo-bench.github.io/
๐
2025-10-30
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ William Overman, Mohsen Bayati
Robotics
ๆฅ็ๆ่ฆ
As increasingly capable agents are deployed, a central safety question is how
to retain meaningful human control without modifying the underlying system. We
study a minimal control interface where an agent chooses whether to act
autonomously (play) or defer (ask), while a human simultaneously chooses
whether to be permissive (trust) or to engage in oversight (oversee). If the
agent defers, the human's choice determines the outcome, potentially leading to
a corrective action or a system shutdown. We model this interaction as a
two-player Markov Game. Our analysis focuses on cases where this game qualifies
as a Markov Potential Game (MPG), a class of games where we can provide an
alignment guarantee: under a structural assumption on the human's value
function, any decision by the agent to act more autonomously that benefits
itself cannot harm the human's value. We also analyze extensions to this MPG
framework. Theoretically, this perspective provides conditions for a specific
form of intrinsic alignment. If the reward structures of the human-agent game
meet these conditions, we have a formal guarantee that the agent improving its
own outcome will not harm the human's. Practically, this model motivates a
transparent control layer with predictable incentives where the agent learns to
defer when risky and act when safe, while its pretrained policy and the
environment's reward structure remain untouched. Our gridworld simulation shows
that through independent learning, the agent and human discover their optimal
oversight roles. The agent learns to ask when uncertain and the human learns
when to oversee, leading to an emergent collaboration that avoids safety
violations introduced post-training. This demonstrates a practical method for
making misaligned models safer after deployment.
๐
2025-10-29
๐ NeurIPS 2025
๐ฅ Ali Rasekh, Erfan Bagheri Soula, Omid Daliran, Simon Gottschalk, Mohsen Fayyaz
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Despite significant advances in Multimodal Large Language Models (MLLMs),
understanding complex temporal dynamics in videos remains a major challenge.
Our experiments show that current Video Large Language Model (Video-LLM)
architectures have critical limitations in temporal understanding, struggling
with tasks that require detailed comprehension of action sequences and temporal
progression. In this work, we propose a Video-LLM architecture that introduces
stacked temporal attention modules directly within the vision encoder. This
design incorporates a temporal attention in vision encoder, enabling the model
to better capture the progression of actions and the relationships between
frames before passing visual tokens to the LLM. Our results show that this
approach significantly improves temporal reasoning and outperforms existing
models in video question answering tasks, specifically in action recognition.
We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to
+5.5%. By enhancing the vision encoder with temporal structure, we address a
critical gap in video understanding for Video-LLMs. Project page and code are
available at: https://alirasekh.github.io/STAVEQ2/.
๐
2025-10-29
๐ ArXiv Preprint (eess.IV)
๐ฅ Xinqi Li, Yi Zhang, Li-Ting Huang, Hsiao-Huang Chang, Thoralf Niendorf et al.
ๆฅ็ๆ่ฆ
Multiparametric mapping MRI has become a viable tool for myocardial tissue
characterization. However, misalignment between multiparametric maps makes
pixel-wise analysis challenging. To address this challenge, we developed a
generalizable physics-informed deep-learning model using test-time adaptation
to enable group image registration across contrast weighted images acquired
from multiple physical models (e.g., a T1 mapping model and T2 mapping model).
The physics-informed adaptation utilized the synthetic images from specific
physics model as registration reference, allows for transductive learning for
various tissue contrast. We validated the model in healthy volunteers with
various MRI sequences, demonstrating its improvement for multi-modal
registration with a wide range of image contrast variability.
๐
2025-10-29
๐ Submitted to Hydrology and Earth System Sciences
๐ฅ Bilal Hassan, Areg Karapetyan, Aaron Chung Hin Chow, Samer Madanat
Machine Learning
ๆฅ็ๆ่ฆ
Climate change and sea-level rise (SLR) pose escalating threats to coastal
cities, intensifying the need for efficient and accurate methods to predict
potential flood hazards. Traditional physics-based hydrodynamic simulators,
although precise, are computationally expensive and impractical for city-scale
coastal planning applications. Deep Learning (DL) techniques offer promising
alternatives, however, they are often constrained by challenges such as data
scarcity and high-dimensional output requirements. Leveraging a recently
proposed vision-based, low-resource DL framework, we develop a novel,
lightweight Convolutional Neural Network (CNN)-based model designed to predict
coastal flooding under variable SLR projections and shoreline adaptation
scenarios. Furthermore, we demonstrate the ability of the model to generalize
across diverse geographical contexts by utilizing datasets from two distinct
regions: Abu Dhabi and San Francisco. Our findings demonstrate that the
proposed model significantly outperforms state-of-the-art methods, reducing the
mean absolute error (MAE) in predicted flood depth maps on average by nearly
20%. These results highlight the potential of our approach to serve as a
scalable and practical tool for coastal flood management, empowering
decision-makers to develop effective mitigation strategies in response to the
growing impacts of climate change. Project Page: https://caspiannet.github.io/
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish et al.
Computer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Humans can naturally identify, reason about, and explain anomalies in their
environment. In computer vision, this long-standing challenge remains limited
to industrial defects or unrealistic, synthetically generated anomalies,
failing to capture the richness and unpredictability of real-world anomalies.
In this work, we introduce CAVE, the first benchmark of real-world visual
anomalies. CAVE supports three open-ended tasks: anomaly description,
explanation, and justification; with fine-grained annotations for visual
grounding and categorizing anomalies based on their visual manifestations,
their complexity, severity, and commonness. These annotations draw inspiration
from cognitive science research on how humans identify and resolve anomalies,
providing a comprehensive framework for evaluating Vision-Language Models
(VLMs) in detecting and understanding anomalies. We show that state-of-the-art
VLMs struggle with visual anomaly perception and commonsense reasoning, even
with advanced prompting strategies. By offering a realistic and cognitively
grounded benchmark, CAVE serves as a valuable resource for advancing research
in anomaly detection and commonsense reasoning in VLMs.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Bai Li, Achilleas Kourtellis, Rong Cao, Joseph Post, Brian Porter et al.
Robotics
ๆฅ็ๆ่ฆ
Rapid and reliable incident detection is critical for reducing crash-related
fatalities, injuries, and congestion. However, conventional methods, such as
closed-circuit television, dashcam footage, and sensor-based detection,
separate detection from verification, suffer from limited flexibility, and
require dense infrastructure or high penetration rates, restricting
adaptability and scalability to shifting incident hotspots. To overcome these
challenges, we developed DARTS, a drone-based, AI-powered real-time traffic
incident detection system. DARTS integrates drones' high mobility and aerial
perspective for adaptive surveillance, thermal imaging for better
low-visibility performance and privacy protection, and a lightweight deep
learning framework for real-time vehicle trajectory extraction and incident
detection. The system achieved 99% detection accuracy on a self-collected
dataset and supports simultaneous online visual verification, severity
assessment, and incident-induced congestion propagation monitoring via a
web-based interface. In a field test on Interstate 75 in Florida, DARTS
detected and verified a rear-end collision 12 minutes earlier than the local
transportation management center and monitored incident-induced congestion
propagation, suggesting potential to support faster emergency response and
enable proactive traffic control to reduce congestion and secondary crash risk.
Crucially, DARTS's flexible deployment architecture reduces dependence on
frequent physical patrols, indicating potential scalability and
cost-effectiveness for use in remote areas and resource-constrained settings.
This study presents a promising step toward a more flexible and integrated
real-time traffic incident detection system, with significant implications for
the operational efficiency and responsiveness of modern transportation
management.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xinhua Wang, Caibo Feng, Xiangjun Fu, Chunxiao Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
We propose an innovative enhancement to the Mamba framework by increasing the
Hausdorff dimension of its scanning pattern through a novel Hilbert Selective
Scan mechanism. This mechanism explores the feature space more effectively,
capturing intricate fine-scale details and improving overall coverage. As a
result, it mitigates information inconsistencies while refining spatial
locality to better capture subtle local interactions without sacrificing the
model's ability to handle long-range dependencies. Extensive experiments on
publicly available benchmarks demonstrate that our approach significantly
improves both the quantitative metrics and qualitative visual fidelity of
existing Mamba-based low-light image enhancement methods, all while reducing
computational resource consumption and shortening inference time. We believe
that this refined strategy not only advances the state-of-the-art in low-light
image enhancement but also holds promise for broader applications in fields
that leverage Mamba-based techniques.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Valentin Boussot, Cรฉdric Hรฉmon, Jean-Claude Nunes, Jean-Louis Dillenseger
ๆฅ็ๆ่ฆ
In this work, we address the TrackRAD2025 challenge of real-time tumor
tracking in cine-MRI sequences of the thoracic and abdominal regions under
strong data scarcity constraints. Two complementary strategies were explored:
(i) unsupervised registration with the IMPACT similarity metric and (ii)
foundation model-based segmentation leveraging SAM 2.1 and its recent variants
through prompt-based interaction. Due to the one-second runtime constraint, the
SAM-based method was ultimately selected. The final configuration used SAM2.1
b+ with mask-based prompts from the first annotated slice, fine-tuned solely on
the small labeled subset from TrackRAD2025. Training was configured to minimize
overfitting, using 1024x1024 patches (batch size 1), standard augmentations,
and a balanced Dice + IoU loss. A low uniform learning rate (0.0001) was
applied to all modules (prompt encoder, decoder, Hiera backbone) to preserve
generalization while adapting to annotator-specific styles. Training lasted 300
epochs (~12h on RTX A6000, 48GB). The same inference strategy was consistently
applied across all anatomical sites and MRI field strengths. Test-time
augmentation was considered but ultimately discarded due to negligible
performance gains. The final model was selected based on the highest Dice
Similarity Coefficient achieved on the validation set after fine-tuning. On the
hidden test set, the model reached a Dice score of 0.8794, ranking 6th overall
in the TrackRAD2025 challenge. These results highlight the strong potential of
foundation models for accurate and real-time tumor tracking in MRI-guided
radiotherapy.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Roman Beliy, Amit Zalcher, Jonathan Kogman, Navve Wasserman, Michal Irani
Natural Language Processing
ๆฅ็ๆ่ฆ
Reconstructing images seen by people from their fMRI brain recordings
provides a non-invasive window into the human brain. Despite recent progress
enabled by diffusion models, current methods often lack faithfulness to the
actual seen images. We present "Brain-IT", a brain-inspired approach that
addresses this challenge through a Brain Interaction Transformer (BIT),
allowing effective interactions between clusters of functionally-similar
brain-voxels. These functional-clusters are shared by all subjects, serving as
building blocks for integrating information both within and across brains. All
model components are shared by all clusters & subjects, allowing efficient
training with a limited amount of data. To guide the image reconstruction, BIT
predicts two complementary localized patch-level image features: (i)high-level
semantic features which steer the diffusion model toward the correct semantic
content of the image; and (ii)low-level structural features which help to
initialize the diffusion process with the correct coarse layout of the image.
BIT's design enables direct flow of information from brain-voxel clusters to
localized image features. Through these principles, our method achieves image
reconstructions from fMRI that faithfully reconstruct the seen images, and
surpass current SotA approaches both visually and by standard objective
metrics. Moreover, with only 1-hour of fMRI data from a new subject, we achieve
results comparable to current methods trained on full 40-hour recordings.
๐
2025-10-29
๐ NeurIPS 2025
๐ฅ Sung-Hoon Yoon, Minghan Li, Gaspard Beaudouin, Congcong Wen, Muhammad Rafay Azhar et al.
Computer VisionRobotics
ๆฅ็ๆ่ฆ
Rectified flow models have become a de facto standard in image generation due
to their stable sampling trajectories and high-fidelity outputs. Despite their
strong generative capabilities, they face critical limitations in image editing
tasks: inaccurate inversion processes for mapping real images back into the
latent space, and gradient entanglement issues during editing often result in
outputs that do not faithfully reflect the target prompt. Recent efforts have
attempted to directly map source and target distributions via ODE-based
approaches without inversion; however,these methods still yield suboptimal
editing quality. In this work, we propose a flow decomposition-and-aggregation
framework built upon an inversion-free formulation to address these
limitations. Specifically, we semantically decompose the target prompt into
multiple sub-prompts, compute an independent flow for each, and aggregate them
to form a unified editing trajectory. While we empirically observe that
decomposing the original flow enhances diversity in the target space,
generating semantically aligned outputs still requires consistent guidance
toward the full target prompt. To this end, we design a projection and
soft-aggregation mechanism for flow, inspired by gradient conflict resolution
in multi-task learning. This approach adaptively weights the sub-target
velocity fields, suppressing semantic redundancy while emphasizing distinct
directions, thereby preserving both diversity and consistency in the final
edited output. Experimental results demonstrate that our method outperforms
existing zero-shot editing approaches in terms of semantic fidelity and
attribute disentanglement. The code is available at
https://github.com/Harvard-AI-and-Robotics-Lab/SplitFlow.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nikola L. Kolev, Tommaso Rodani, Neil J. Curson, Taylor J. Z. Stock, Alberto Cazzaniga
RoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Scanning tunnelling microscopy (STM) enables atomic-resolution imaging and
atom manipulation, but its utility is often limited by tip degradation and slow
serial data acquisition. Fabrication adds another layer of complexity since the
tip is often subjected to large voltages, which may alter the shape of its
apex, requiring it to be conditioned. Here, we propose a machine learning (ML)
approach for image repair and super-resolution to alleviate both challenges.
Using a dataset of only 36 pristine experimental images of Si(001):H, we
demonstrate that a physics-informed synthetic data generation pipeline can be
used to train several state-of-the-art flow-matching and diffusion models.
Quantitative evaluation with metrics such as the CLIP Maximum Mean Discrepancy
(CMMD) score and structural similarity demonstrates that our models are able to
effectively restore images and offer a two- to fourfold reduction in image
acquisition time by accurately reconstructing images from sparsely sampled
data. Our framework has the potential to significantly increase STM
experimental throughput by offering a route to reducing the frequency of
tip-conditioning procedures and to enhancing frame rates in existing high-speed
STM systems.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Denniz Goren, Holger Caesar
ๆฅ็ๆ่ฆ
The vulnerability of cyclists, exacerbated by the rising popularity of faster
e-bikes, motivates adapting automotive perception technologies for bicycle
safety. We use our multi-sensor 'SenseBike' research platform to develop and
evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the
automotive-to-bicycle domain gap, we introduce the novel BikeScenes-lidarseg
Dataset, comprising 3021 consecutive LiDAR scans around the university campus
of the TU Delft, semantically annotated for 29 dynamic and static classes. By
evaluating model performance, we demonstrate that fine-tuning on our BikeScenes
dataset achieves a mean Intersection-over-Union (mIoU) of 63.6%, significantly
outperforming the 13.8% obtained with SemanticKITTI pre-training alone. This
result underscores the necessity and effectiveness of domain-specific training.
We highlight key challenges specific to bicycle-mounted, hardware-constrained
perception systems and contribute the BikeScenes dataset as a resource for
advancing research in cyclist-centric LiDAR segmentation.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Vicky Kalogeiton, David Picard
Machine Learning
ๆฅ็ๆ่ฆ
Current text-to-image generative models are trained on large uncurated
datasets to enable diverse generation capabilities. However, this does not
align well with user preferences. Recently, reward models have been
specifically designed to perform post-hoc selection of generated images and
align them to a reward, typically user preference. This discarding of
informative data together with the optimizing for a single reward tend to harm
diversity, semantic fidelity and efficiency. Instead of this post-processing,
we propose to condition the model on multiple reward models during training to
let the model learn user preferences directly. We show that this not only
dramatically improves the visual quality of the generated images but it also
significantly speeds up the training. Our proposed method, called MIRO,
achieves state-of-the-art performances on the GenEval compositional benchmark
and user-preference scores (PickAScore, ImageReward, HPSv2).
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi et al.
ๆฅ็ๆ่ฆ
Visual effects (VFX) are crucial to the expressive power of digital media,
yet their creation remains a major challenge for generative AI. Prevailing
methods often rely on the one-LoRA-per-effect paradigm, which is
resource-intensive and fundamentally incapable of generalizing to unseen
effects, thus limiting scalability and creation. To address this challenge, we
introduce VFXMaster, the first unified, reference-based framework for VFX video
generation. It recasts effect generation as an in-context learning task,
enabling it to reproduce diverse dynamic effects from a reference video onto
target content. In addition, it demonstrates remarkable generalization to
unseen effect categories. Specifically, we design an in-context conditioning
strategy that prompts the model with a reference example. An in-context
attention mask is designed to precisely decouple and inject the essential
effect attributes, allowing a single unified model to master the effect
imitation without information leakage. In addition, we propose an efficient
one-shot effect adaptation mechanism to boost generalization capability on
tough unseen effects from a single user-provided video rapidly. Extensive
experiments demonstrate that our method effectively imitates various categories
of effect information and exhibits outstanding generalization to out-of-domain
effects. To foster future research, we will release our code, models, and a
comprehensive dataset to the community.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, Minghua Liu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Articulated 3D objects are central to many applications in robotics, AR/VR,
and animation. Recent approaches to modeling such objects either rely on
optimization-based reconstruction pipelines that require dense-view supervision
or on feed-forward generative models that produce coarse geometric
approximations and often overlook surface texture. In contrast, open-world 3D
generation of static objects has achieved remarkable success, especially with
the advent of native 3D diffusion models such as Trellis. However, extending
these methods to articulated objects by training native 3D diffusion models
poses significant challenges. In this work, we present FreeArt3D, a
training-free framework for articulated 3D object generation. Instead of
training a new model on limited articulated data, FreeArt3D repurposes a
pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape
prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by
treating articulation as an additional generative dimension. Given a few images
captured in different articulation states, FreeArt3D jointly optimizes the
object's geometry, texture, and articulation parameters without requiring
task-specific training or access to large-scale articulated datasets. Our
method generates high-fidelity geometry and textures, accurately predicts
underlying kinematic structures, and generalizes well across diverse object
categories. Despite following a per-instance optimization paradigm, FreeArt3D
completes in minutes and significantly outperforms prior state-of-the-art
approaches in both quality and versatility.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo et al.
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
Humans possess spatial reasoning abilities that enable them to understand
spaces through multimodal observations, such as vision and sound. Large
multimodal reasoning models extend these abilities by learning to perceive and
reason, showing promising performance across diverse spatial tasks. However,
systematic reviews and publicly available benchmarks for these models remain
limited. In this survey, we provide a comprehensive review of multimodal
spatial reasoning tasks with large models, categorizing recent progress in
multimodal large language models (MLLMs) and introducing open benchmarks for
evaluation. We begin by outlining general spatial reasoning, focusing on
post-training techniques, explainability, and architecture. Beyond classical 2D
tasks, we examine spatial relationship reasoning, scene and layout
understanding, as well as visual question answering and grounding in 3D space.
We also review advances in embodied AI, including vision-language navigation
and action models. Additionally, we consider emerging modalities such as audio
and egocentric video, which contribute to novel spatial understanding through
new sensors. We believe this survey establishes a solid foundation and offers
insights into the growing field of multimodal spatial reasoning. Updated
information about this survey, codes and implementation of the open benchmarks
can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zhi-Kai Chen, Jun-Peng Jiang, Han-Jia Ye, De-Chuan Zhan
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Autoregressive (AR) image generation models are capable of producing
high-fidelity images but often suffer from slow inference due to their
inherently sequential, token-by-token decoding process. Speculative decoding,
which employs a lightweight draft model to approximate the output of a larger
AR model, has shown promise in accelerating text generation without
compromising quality. However, its application to image generation remains
largely underexplored. The challenges stem from a significantly larger sampling
space, which complicates the alignment between the draft and target model
outputs, coupled with the inadequate use of the two-dimensional spatial
structure inherent in images, thereby limiting the modeling of local
dependencies. To overcome these challenges, we introduce Hawk, a new approach
that harnesses the spatial structure of images to guide the speculative model
toward more accurate and efficient predictions. Experimental results on
multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR
models, while preserving both image fidelity and diversity.
๐
2025-10-29
๐ ArXiv Preprint (Machine Learning)
๐ฅ Arani Roy, Marco P. Apolinario, Shristi Das Biswas, Kaushik Roy
Machine Learning
ๆฅ็ๆ่ฆ
Training deep neural networks (DNNs) with backpropagation (BP) achieves
state-of-the-art accuracy but requires global error propagation and full
parameterization, leading to substantial memory and computational overhead.
Direct Feedback Alignment (DFA) enables local, parallelizable updates with
lower memory requirements but is limited by unstructured feedback and poor
scalability in deeper architectures, specially convolutional neural networks.
To address these limitations, we propose a structured local learning framework
that operates directly on low-rank manifolds defined by the Singular Value
Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed
form, with updates applied to the SVD components using a composite loss that
integrates cross-entropy, subspace alignment, and orthogonality regularization.
Feedback matrices are constructed to match the SVD structure, ensuring
consistent alignment between forward and feedback pathways. Our method reduces
the number of trainable parameters relative to the original DFA model, without
relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100,
and ImageNet show that our method achieves accuracy comparable to that of BP.
Ablation studies confirm the importance of each loss term in the low-rank
setting. These results establish local learning on low-rank manifolds as a
principled and scalable alternative to full-rank gradient-based training.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Pengtao Chen, Xianfang Zeng, Maosen Zhao, Mingzhu Shen, Peng Ye et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recently, instruction-based image editing (IIE) has received widespread
attention. In practice, IIE often modifies only specific regions of an image,
while the remaining areas largely remain unchanged. Although these two types of
regions differ significantly in generation difficulty and computational
redundancy, existing IIE models do not account for this distinction, instead
applying a uniform generation process across the entire image. This motivates
us to propose RegionE, an adaptive, region-aware generation framework that
accelerates IIE tasks without additional training. Specifically, the RegionE
framework consists of three main components: 1) Adaptive Region Partition. We
observed that the trajectory of unedited regions is straight, allowing for
multi-step denoised predictions to be inferred in a single step. Therefore, in
the early denoising stages, we partition the image into edited and unedited
regions based on the difference between the final estimated result and the
reference image. 2) Region-Aware Generation. After distinguishing the regions,
we replace multi-step denoising with one-step prediction for unedited areas.
For edited regions, the trajectory is curved, requiring local iterative
denoising. To improve the efficiency and quality of local iterative generation,
we propose the Region-Instruction KV Cache, which reduces computational cost
while incorporating global information. 3) Adaptive Velocity Decay Cache.
Observing that adjacent timesteps in edited regions exhibit strong velocity
similarity, we further propose an adaptive velocity decay cache to accelerate
the local denoising process. We applied RegionE to state-of-the-art IIE base
models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE
achieved acceleration factors of 2.57, 2.41, and 2.06. Evaluations by GPT-4o
confirmed that semantic and perceptual fidelity were well preserved.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Doan-Van-Anh Ly, Thi-Thu-Hien Pham, Thanh-Hai Le
Computer VisionNatural Language Processing
ๆฅ็ๆ่ฆ
Segmentation of liver structures in multi-phase contrast-enhanced computed
tomography (CECT) plays a crucial role in computer-aided diagnosis and
treatment planning for liver diseases, including tumor detection. In this
study, we investigate the performance of UNet-based architectures for liver
tumor segmentation, starting from the original UNet and extending to UNet3+
with various backbone networks. We evaluate ResNet, Transformer-based, and
State-space (Mamba) backbones, all initialized with pretrained weights.
Surprisingly, despite the advances in modern architecture, ResNet-based models
consistently outperform Transformer- and Mamba-based alternatives across
multiple evaluation metrics. To further improve segmentation quality, we
introduce attention mechanisms into the backbone and observe that incorporating
the Convolutional Block Attention Module (CBAM) yields the best performance.
ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a
Dice score of 0.755 and IoU of 0.662, but also achieved the most precise
boundary delineation, evidenced by the lowest HD95 distance of 77.911. The
model's superiority was further cemented by its leading overall accuracy of
0.925 and specificity of 0.926, showcasing its robust capability in accurately
identifying both lesion and healthy tissue. To further enhance
interpretability, Grad-CAM visualizations were employed to highlight the
region's most influential predictions, providing insights into its
decision-making process. These findings demonstrate that classical ResNet
architecture, when combined with modern attention modules, remain highly
competitive for medical image segmentation tasks, offering a promising
direction for liver tumor detection in clinical practice.
๐
2025-10-29
๐ NeurIPS 2025
๐ฅ Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer, Bernt Schiele
Machine Learning
ๆฅ็ๆ่ฆ
Deep networks have shown remarkable performance across a wide range of tasks,
yet getting a global concept-level understanding of how they function remains a
key challenge. Many post-hoc concept-based approaches have been introduced to
understand their workings, yet they are not always faithful to the model.
Further, they make restrictive assumptions on the concepts a model learns, such
as class-specificity, small spatial extent, or alignment to human expectations.
In this work, we put emphasis on the faithfulness of such concept-based
explanations and propose a new model with model-inherent mechanistic
concept-explanations. Our concepts are shared across classes and, from any
layer, their contribution to the logit and their input-visualization can be
faithfully traced. We also leverage foundation models to propose a new
concept-consistency metric, C$^2$-Score, that can be used to evaluate
concept-based methods. We show that, compared to prior work, our concepts are
quantitatively more consistent and users find our concepts to be more
interpretable, all while retaining competitive ImageNet performance.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongjie Zhang, Gideon Billings, Stefan B. Williams
Natural Language Processing
ๆฅ็ๆ่ฆ
Underwater infrastructure requires frequent inspection and maintenance due to
harsh marine conditions. Current reliance on human divers or remotely operated
vehicles is limited by perceptual and operational challenges, especially around
complex structures or in turbid water. Enhancing the spatial awareness of
underwater vehicles is key to reducing piloting risks and enabling greater
autonomy. To address these challenges, we present SPADE: SParsity Adaptive
Depth Estimator, a monocular depth estimation pipeline that combines
pre-trained relative depth estimator with sparse depth priors to produce dense,
metric scale depth maps. Our two-stage approach first scales the relative depth
map with the sparse depth points, then refines the final metric prediction with
our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves
improved accuracy and generalisation over state-of-the-art baselines and runs
efficiently at over 15 FPS on embedded hardware, promising to support practical
underwater inspection and intervention. This work has been submitted to IEEE
Journal of Oceanic Engineering Special Issue of AUV 2026.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani et al.
ๆฅ็ๆ่ฆ
Audio Descriptions (ADs) convey essential on-screen information, allowing
visually impaired audiences to follow videos. To be effective, ADs must form a
coherent sequence that helps listeners to visualise the unfolding scene, rather
than describing isolated moments. However, most automatic methods generate each
AD independently, often resulting in repetitive, incoherent descriptions. To
address this, we propose a training-free method, CoherentAD, that first
generates multiple candidate descriptions for each AD time interval, and then
performs auto-regressive selection across the sequence to form a coherent and
informative narrative. To evaluate AD sequences holistically, we introduce a
sequence-level metric, StoryRecall, which measures how well the predicted ADs
convey the ground truth narrative, alongside repetition metrics that capture
the redundancy across consecutive AD outputs. Our method produces coherent AD
sequences with enhanced narrative understanding, outperforming prior approaches
that rely on independent generations.
๐
2025-10-29
๐ NeurIPS 2025
๐ฅ Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagiotis Filntisis, Yannis Avrithis et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The progress of composed image retrieval (CIR), a popular research direction
in image retrieval, where a combined visual and textual query is used, is held
back by the absence of high-quality training and evaluation data. We introduce
a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an
instance-level class definition. The goal is to retrieve images that contain
the same particular object as the visual query, presented under a variety of
modifications defined by textual queries. Its design and curation process keep
the dataset compact to facilitate future research, while maintaining its
challenge-comparable to retrieval among more than 40M random
distractors-through a semi-automated selection of hard negatives.
To overcome the challenge of obtaining clean, diverse, and suitable training
data, we leverage pre-trained vision-and-language models (VLMs) in a
training-free approach called BASIC. The method separately estimates
query-image-to-image and query-text-to-image similarities, performing late
fusion to upweight images that satisfy both queries, while down-weighting those
that exhibit high similarity with only one of the two. Each individual
similarity is further improved by a set of components that are simple and
intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR
datasets that follow a semantic-level class definition. Project page:
https://vrg.fel.cvut.cz/icir/.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ M Yashwanth, Sharannya Ghosh, Aditay Tripathi, Anirban Chakraborty
Natural Language Processing
ๆฅ็ๆ่ฆ
Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has
proven highly effective as a parameter-efficient fine-tuning technique for
adapting large models to downstream tasks with limited data. Its parameter
efficiency makes it particularly suitable for Federated Learning (FL), where
both communication and computation budgets are often constrained. However,
global prompt tuning struggles to generalize across heterogeneous clients,
while personalized tuning overfits to local data and lacks generalization. We
propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt
Tuning), a unified framework designed to achieve both generalization and
personalization in federated prompt tuning of ViTs. Within this framework, we
introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on
class-specific prompts maintained alongside a globally shared prompt. For each
input, CCMP adaptively combines class-specific prompts using weights derived
from global class prototypes and client class priors. This approach enables
per-sample prompt personalization without storing client-dependent trainable
parameters. The prompts are collaboratively optimized via traditional federated
averaging technique on the same. Comprehensive evaluations on CIFAR-100,
TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT
consistently surpasses the state-of-the-art baselines under diverse data
heterogeneity scenarios, establishing a strong foundation for efficient and
generalizable federated prompt tuning of Vision Transformers.
๐
2025-10-29
๐ 11 pages, 2 Figures, MICCAI AMAI 2025 workshop, to
๐ฅ Ayman Abaid, Gianpiero Guidone, Sara Alsubai, Foziyah Alquahtani, Talha Iqbal et al.
Machine Learning
ๆฅ็ๆ่ฆ
Coronary artery calcium (CAC) scoring plays a crucial role in the early
detection and risk stratification of coronary artery disease (CAD). In this
study, we focus on non-contrast coronary computed tomography angiography (CCTA)
scans, which are commonly used for early calcification detection in clinical
settings. To address the challenge of limited annotated data, we propose a
radiomics-based pipeline that leverages pseudo-labeling to generate training
labels, thereby eliminating the need for expert-defined segmentations.
Additionally, we explore the use of pretrained foundation models, specifically
CT-FM and RadImageNet, to extract image features, which are then used with
traditional classifiers. We compare the performance of these deep learning
features with that of radiomics features. Evaluation is conducted on a clinical
CCTA dataset comprising 182 patients, where individuals are classified into two
groups: zero versus non-zero calcium scores. We further investigate the impact
of training on non-contrast datasets versus combined contrast and non-contrast
datasets, with testing performed only on non contrast scans. Results show that
radiomics-based models significantly outperform CNN-derived embeddings from
foundation models (achieving 84% accuracy and p<0.05), despite the
unavailability of expert annotations.
๐
2025-10-29
๐ Accepted by IEEE Transactions on Image Processing
๐ฅ Zhigang Tu, Zhengbo Zhang, Jia Gong, Junsong Yuan, Bo Du
ๆฅ็ๆ่ฆ
Skeleton-based human action recognition aims to classify human skeletal
sequences, which are spatiotemporal representations of actions, into predefined
categories. To reduce the reliance on costly annotations of skeletal sequences
while maintaining competitive recognition accuracy, the task of 3D Action
Recognition with Limited Training Samples, also known as semi-supervised 3D
Action Recognition, has been proposed. In addition, active learning, which aims
to proactively select the most informative unlabeled samples for annotation,
has been explored in semi-supervised 3D Action Recognition for training sample
selection. Specifically, researchers adopt an encoder-decoder framework to
embed skeleton sequences into a latent space, where clustering information,
combined with a margin-based selection strategy using a multi-head mechanism,
is utilized to identify the most informative sequences in the unlabeled set for
annotation. However, the most representative skeleton sequences may not
necessarily be the most informative for the action recognizer, as the model may
have already acquired similar knowledge from previously seen skeleton samples.
To solve it, we reformulate Semi-supervised 3D action recognition via active
learning from a novel perspective by casting it as a Markov Decision Process
(MDP). Built upon the MDP framework and its training paradigm, we train an
informative sample selection model to intelligently guide the selection of
skeleton sequences for annotation. To enhance the representational capacity of
the factors in the state-action pairs within our method, we project them from
Euclidean space to hyperbolic space. Furthermore, we introduce a meta tuning
strategy to accelerate the deployment of our method in real-world scenarios.
Extensive experiments on three 3D action recognition benchmarks demonstrate the
effectiveness of our method.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yuhang Hu, Zhenyu Yang, Shihan Wang, Shengsheng Qian, Bin Wen et al.
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The rapid growth of streaming video applications demands multimodal models
with enhanced capabilities for temporal dynamics understanding and complex
reasoning. However, current Video Question Answering (VideoQA) datasets suffer
from two critical limitations: 1) Static annotation mechanisms fail to capture
the evolving nature of answers in temporal video streams, and 2) The absence of
explicit reasoning process annotations restricts model interpretability and
logical deduction capabilities. To address these challenges, We introduce
StreamingCoT, the first dataset explicitly designed for temporally evolving
reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our
framework first establishes a dynamic hierarchical annotation architecture that
generates per-second dense descriptions and constructs temporally-dependent
semantic segments through similarity fusion, paired with question-answer sets
constrained by temporal evolution patterns. We further propose an explicit
reasoning chain generation paradigm that extracts spatiotemporal objects via
keyframe semantic alignment, derives object state transition-based reasoning
paths using large language models, and ensures logical coherence through
human-verified validation. This dataset establishes a foundation for advancing
research in streaming video understanding, complex temporal reasoning, and
multimodal inference. Our StreamingCoT and its construction toolkit can be
accessed at https://github.com/Fleeting-hyh/StreamingCoT.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Runxi Huang, Mingxuan Yu, Mingyu Tsoi, Xiaomin Ouyang
RoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Real-time multimodal inference on resource-constrained edge devices is
essential for applications such as autonomous driving, human-computer
interaction, and mobile health. However, prior work often overlooks the tight
coupling between sensing dynamics and model execution, as well as the complex
inter-modality dependencies. In this paper, we propose MMEdge, an new on-device
multi-modal inference framework based on pipelined sensing and encoding.
Instead of waiting for complete sensor inputs, MMEdge decomposes the entire
inference process into a sequence of fine-grained sensing and encoding units,
allowing computation to proceed incrementally as data arrive. MMEdge also
introduces a lightweight but effective temporal aggregation module that
captures rich temporal dynamics across different pipelined units to maintain
accuracy performance. Such pipelined design also opens up opportunities for
fine-grained cross-modal optimization and early decision-making during
inference. To further enhance system performance under resource variability and
input data complexity, MMEdge incorporates an adaptive multimodal configuration
optimizer that dynamically selects optimal sensing and model configurations for
each modality under latency constraints, and a cross-modal speculative skipping
mechanism that bypasses future units of slower modalities when early
predictions reach sufficient confidence. We evaluate MMEdge using two public
multimodal datasets and deploy it on a real-world unmanned aerial vehicle
(UAV)-based multimodal testbed. The results show that MMEdge significantly
reduces end-to-end latency while maintaining high task accuracy across various
system and data dynamics.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Yushen Huang, Zhiming Wang
Computer Vision
ๆฅ็ๆ่ฆ
Few-shot object detection (FSOD) often suffers from base-class bias and
unstable calibration when only a few novel samples are available. We propose
Prototype-Driven Alignment (PDA), a lightweight, plug-in metric head for DeFRCN
that provides a prototype-based "second opinion" complementary to the linear
classifier. PDA maintains support-only prototypes in a learnable
identity-initialized projection space and optionally applies
prototype-conditioned RoI alignment to reduce geometric mismatch. During
fine-tuning, prototypes can be adapted via exponential moving average(EMA)
updates on labeled foreground RoIs-without introducing class-specific
parameters-and are frozen at inference to ensure strict protocol compliance.
PDA employs a best-of-K matching scheme to capture intra-class multi-modality
and temperature-scaled fusion to combine metric similarities with detector
logits. Experiments on VOC FSOD and GFSOD benchmarks show that PDA consistently
improves novel-class performance with minimal impact on base classes and
negligible computational overhead.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ HyoJung Han, Sweta Agrawal, Eleftheria Briakou
Natural Language Processing
ๆฅ็ๆ่ฆ
Cross-lingual alignment (CLA) aims to align multilingual representations,
enabling Large Language Models (LLMs) to seamlessly transfer knowledge across
languages. While intuitive, we hypothesize, this pursuit of representational
convergence can inadvertently cause "cultural erasure", the functional loss of
providing culturally-situated responses that should diverge based on the query
language. In this work, we systematically analyze this trade-off by introducing
a holistic evaluation framework, the transfer-localization plane, which
quantifies both desirable knowledge transfer and undesirable cultural erasure.
Using this framework, we re-evaluate recent CLA approaches and find that they
consistently improve factual transfer at the direct cost of cultural
localization across all six languages studied. Our investigation into the
internal representations of these models reveals a key insight: universal
factual transfer and culturally-specific knowledge are optimally steerable at
different model layers. Based on this finding, we propose Surgical Steering, a
novel inference-time method that disentangles these two objectives. By applying
targeted activation steering to distinct layers, our approach achieves a better
balance between the two competing dimensions, effectively overcoming the
limitations of current alignment techniques.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Feijie Wu, Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Current tool-use large language models (LLMs) are trained on static datasets,
enabling them to interact with external tools and perform multi-step,
tool-integrated reasoning, which produces tool-call trajectories. However,
these models imitate how a query is resolved in a generic tool-call routine,
thereby failing to explore possible solutions and demonstrating limited
performance in an evolved, dynamic tool-call environment. In this work, we
propose PORTool, a reinforcement learning (RL) method that encourages a
tool-use LLM to explore various trajectories yielding the correct answer.
Specifically, this method starts with generating multiple rollouts for a given
query, and some of them share the first few tool-call steps, thereby forming a
tree-like structure. Next, we assign rewards to each step, based on its ability
to produce a correct answer and make successful tool calls. A shared step
across different trajectories receives the same reward, while different steps
under the same fork receive different rewards. Finally, these step-wise rewards
are used to calculate fork-relative advantages, blended with
trajectory-relative advantages, to train the LLM for tool use. The experiments
utilize 17 tools to address user queries, covering both time-sensitive and
time-invariant topics. We conduct ablation studies to systematically justify
the necessity and the design robustness of step-wise rewards. Furthermore, we
compare the proposed PORTool with other training approaches and demonstrate
significant improvements in final accuracy and the number of tool-call steps.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish et al.
Computer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Humans can naturally identify, reason about, and explain anomalies in their
environment. In computer vision, this long-standing challenge remains limited
to industrial defects or unrealistic, synthetically generated anomalies,
failing to capture the richness and unpredictability of real-world anomalies.
In this work, we introduce CAVE, the first benchmark of real-world visual
anomalies. CAVE supports three open-ended tasks: anomaly description,
explanation, and justification; with fine-grained annotations for visual
grounding and categorizing anomalies based on their visual manifestations,
their complexity, severity, and commonness. These annotations draw inspiration
from cognitive science research on how humans identify and resolve anomalies,
providing a comprehensive framework for evaluating Vision-Language Models
(VLMs) in detecting and understanding anomalies. We show that state-of-the-art
VLMs struggle with visual anomaly perception and commonsense reasoning, even
with advanced prompting strategies. By offering a realistic and cognitively
grounded benchmark, CAVE serves as a valuable resource for advancing research
in anomaly detection and commonsense reasoning in VLMs.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) often struggle with problems that require
multi-step reasoning. For small-scale open-source models, Reinforcement
Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely
sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to
overfit long demonstrations through rigid token-by-token imitation. To address
this gap, we propose Supervised Reinforcement Learning (SRL), a framework that
reformulates problem solving as generating a sequence of logical "actions". SRL
trains the model to generate an internal reasoning monologue before committing
to each action. It provides smoother rewards based on the similarity between
the model's actions and expert actions extracted from the SFT dataset in a
step-wise manner. This supervision offers richer learning signals even when all
rollouts are incorrect, while encouraging flexible reasoning guided by expert
demonstrations. As a result, SRL enables small models to learn challenging
problems previously unlearnable by SFT or RLVR. Moreover, initializing training
with SRL before refining with RLVR yields the strongest overall performance.
Beyond reasoning benchmarks, SRL generalizes effectively to agentic software
engineering tasks, establishing it as a robust and versatile training framework
for reasoning-oriented LLMs.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dinghong Song, Yuan Feng, Yiwei Wang, Shangye Chen, Cyril Guyot et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are widely used in generative applications such
as chatting, code generation, and reasoning. However, many realworld workloads
such as classification, question answering, recommendation, and text embedding
rely solely on the prefill stage of inference, where the model encodes input
sequences without performing autoregressive decoding. In these prefill only
scenarios, the self-attention computation becomes the primary performance
bottleneck due to its quadratic complexity with respect to sequence length. In
this paper, we observe that semantically different sentences often produce
similar attention maps across layers and heads. Building on this insight, we
propose AttnCache, a framework that accelerates the prefill stage of LLM
inference by retrieving and reusing similar attention maps. Based on an
attention map memorization database, AttnCache employs efficient caching and
similarity search techniques to identify and reuse pre-cached attention maps
during inference, thereby reducing the computational overhead of
self-attention. Experimental results show that AttnCache achieves an average of
1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x
attention speedup on GPU, with negligible accuracy degradation.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su, Dong Li
ๆฅ็ๆ่ฆ
AI accelerators, customized to AI workloads, provide cost-effective and
high-performance solutions for training and inference. Trainium, an AI
accelerator recently developed by Amazon Web Services (AWS), provides an
attractive option for LLM training and inference through its heterogeneous
architecture. However, leveraging Trainium architecture for high performance
can be challenging because of its systolic array architecture and special
requirement on data layout. In this paper, we design high-performance matrix
multiplication (matmul), a critical compute kernel, for LLM inference on
Trainium. We introduce a series of techniques customized to Trainium based on
kernel fusion and novel caching strategies to reduce data movement across the
software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive
matrix transpose. Evaluating with nine datasets and four recent LLMs, we show
that our system largely outperforms the state-of-the-art matmul implemented by
AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x
speedup (up to 2.22x), which translates to an average 1.66x speedup (up to
2.49x) for end-to-end LLM inference.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sina Bagheri Nezhad, Yao Li, Ameeta Agrawal
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) often struggle with complex mathematical
reasoning, where prose-based generation leads to unverified and arithmetically
unsound solutions. Current prompting strategies like Chain of Thought still
operate within this unreliable medium, lacking a mechanism for deterministic
verification. To address these limitations, we introduce SymCode, a
neurosymbolic framework that reframes mathematical problem-solving as a task of
verifiable code generation using the SymPy library. We evaluate SymCode on
challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating
significant accuracy improvements of up to 13.6 percentage points over
baselines. Our analysis shows that SymCode is not only more token-efficient but
also fundamentally shifts model failures from opaque logical fallacies towards
transparent, programmatic errors. By grounding LLM reasoning in a deterministic
symbolic engine, SymCode represents a key step towards more accurate and
trustworthy AI in formal domains.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohsinul Kabir, Tasnim Ahmed, Md Mezbaur Rahman, Polydoros Giannouris, Sophia Ananiadou
Natural Language Processing
ๆฅ็ๆ่ฆ
Machine Translation (MT) is widely employed to address resource scarcity in
low-resource languages by generating synthetic data from high-resource
counterparts. While sentiment preservation in translation has long been
studied, a critical but underexplored factor is the role of cultural alignment
between source and target languages. In this paper, we hypothesize that
semantic labels are drifted or altered during MT due to cultural divergence.
Through a series of experiments across culturally sensitive and neutral
domains, we establish three key findings: (1) MT systems, including modern
Large Language Models (LLMs), induce label drift during translation,
particularly in culturally sensitive domains; (2) unlike earlier statistical MT
tools, LLMs encode cultural knowledge, and leveraging this knowledge can
amplify label drift; and (3) cultural similarity or dissimilarity between
source and target languages is a crucial determinant of label preservation. Our
findings highlight that neglecting cultural factors in MT not only undermines
label fidelity but also risks misinterpretation and cultural conflict in
downstream applications.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Negar Foroutan, Paul Teiletche, Ayush Kumar Tarun, Antoine Bosselut
Natural Language Processing
ๆฅ็ๆ่ฆ
The impact of different multilingual data mixtures in pretraining large
language models (LLMs) has been a topic of ongoing debate, often raising
concerns about potential trade-offs between language coverage and model
performance (i.e., the curse of multilinguality). In this work, we investigate
these assumptions by training 1.1B and 3B parameter LLMs on diverse
multilingual corpora, varying the number of languages from 25 to 400. Our study
challenges common beliefs surrounding multilingual training. First, we find
that combining English and multilingual data does not necessarily degrade the
in-language performance of either group, provided that languages have a
sufficient number of tokens included in the pretraining corpus. Second, we
observe that using English as a pivot language (i.e., a high-resource language
that serves as a catalyst for multilingual generalization) yields benefits
across language families, and contrary to expectations, selecting a pivot
language from within a specific family does not consistently improve
performance for languages within that family. Lastly, we do not observe a
significant "curse of multilinguality" as the number of training languages
increases in models at this scale. Our findings suggest that multilingual data,
when balanced appropriately, can enhance language model capabilities without
compromising performance, even in low-resource settings
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Andrรฉ V. Duarte, Xuying li, Bin Zeng, Arlindo L. Oliveira, Lei Li et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
If we cannot inspect the training data of a large language model (LLM), how
can we ever know what it has seen? We believe the most compelling evidence
arises when the model itself freely reproduces the target content. As such, we
propose RECAP, an agentic pipeline designed to elicit and verify memorized
training data from LLM outputs. At the heart of RECAP is a feedback-driven
loop, where an initial extraction attempt is evaluated by a secondary language
model, which compares the output against a reference passage and identifies
discrepancies. These are then translated into minimal correction hints, which
are fed back into the target model to guide subsequent generations. In
addition, to address alignment-induced refusals, RECAP includes a jailbreaking
module that detects and overcomes such barriers. We evaluate RECAP on
EchoTrace, a new benchmark spanning over 30 full books, and the results show
that RECAP leads to substantial gains over single-iteration approaches. For
instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text
extraction improved from 0.38 to 0.47 - a nearly 24% increase.
๐
2025-10-29
๐ Accepted for publication in the Proceedings of the
๐ฅ Soufiane Essahli, Oussama Sarsar, Imane Fouad, Anas Motii, Ahmed Bentajer
Natural Language Processing
ๆฅ็ๆ่ฆ
Social platforms distribute information at unprecedented speed, which in turn
accelerates the spread of misinformation and threatens public discourse. We
present FakeZero, a fully client-side, cross-platform browser extension that
flags unreliable posts on Facebook and X (formerly Twitter) while the user
scrolls. All computation, DOM scraping, tokenisation, Transformer inference,
and UI rendering run locally through the Chromium messaging API, so no personal
data leaves the device.FakeZero employs a three-stage training curriculum:
baseline fine-tuning and domain-adaptive training enhanced with focal loss,
adversarial augmentation, and post-training quantisation. Evaluated on a
dataset of 239,000 posts, the DistilBERT-Quant model (67.6 MB) reaches 97.1%
macro-F1, 97.4% accuracy, and an AUROC of 0.996, with a median latency of
approximately 103 ms on a commodity laptop. A memory-efficient TinyBERT-Quant
variant retains 95.7% macro-F1 and 96.1% accuracy while shrinking the model to
14.7 MB and lowering latency to approximately 40 ms, showing that high-quality
fake-news detection is feasible under tight resource budgets with only modest
performance loss.By providing inline credibility cues, the extension can serve
as a valuable tool for policymakers seeking to curb the spread of
misinformation across social networks. With user consent, FakeZero also opens
the door for researchers to collect large-scale datasets of fake news in the
wild, enabling deeper analysis and the development of more robust detection
techniques.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Frederico Belcavello, Ely Matos, Arthur Lorenzi, Lisandra Bonoto, Lรญvia Ruiz et al.
ๆฅ็ๆ่ฆ
The use of LLM-based applications as a means to accelerate and/or substitute
human labor in the creation of language resources and dataset is a reality.
Nonetheless, despite the potential of such tools for linguistic research,
comprehensive evaluation of their performance and impact on the creation of
annotated datasets, especially under a perspectivized approach to NLP, is still
missing. This paper contributes to reduction of this gap by reporting on an
extensive evaluation of the (semi-)automatization of FrameNet-like semantic
annotation by the use of an LLM-based semantic role labeler. The methodology
employed compares annotation time, coverage and diversity in three experimental
settings: manual, automatic and semi-automatic annotation. Results show that
the hybrid, semi-automatic annotation setting leads to increased frame
diversity and similar annotation coverage, when compared to the human-only
setting, while the automatic setting performs considerably worse in all
metrics, except for annotation time.
๐
2025-10-29
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Eitรกn Sprejer, Fernando Avalos, Augusto Bernardi, Jose Pedro Brito de Azevedo Faustino, Jacob Haimes et al.
Machine Learning
ๆฅ็ๆ่ฆ
Aligning LLM-based judges with human preferences is a significant challenge,
as they are difficult to calibrate and often suffer from rubric sensitivity,
bias, and instability. Overcoming this challenge advances key applications,
such as creating reliable reward models for Reinforcement Learning from Human
Feedback (RLHF) and building effective routing systems that select the
best-suited model for a given user query. In this work, we propose a framework
for modeling diverse, persona-based preferences by learning to aggregate
outputs from multiple rubric-conditioned judges. We investigate the performance
of this approach against naive baselines and assess its robustness through case
studies on both human and LLM-judges biases. Our primary contributions include
a persona-based method for synthesizing preference labels at scale and two
distinct implementations of our aggregator: Generalized Additive Model (GAM)
and a Multi-Layer Perceptron (MLP).
๐
2025-10-29
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used as raters for evaluation
tasks. However, their reliability is often limited for subjective tasks, when
human judgments involve subtle reasoning beyond annotation labels. Thinking
traces, the reasoning behind a judgment, are highly informative but challenging
to collect and curate. We present a human-LLM collaborative framework to infer
thinking traces from label-only annotations. The proposed framework uses a
simple and effective rejection sampling method to reconstruct these traces at
scale. These inferred thinking traces are applied to two complementary tasks:
(1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation
guidelines for proprietary LLM raters. Across multiple datasets, our methods
lead to significantly improved LLM-human agreement. Additionally, the refined
annotation guidelines increase agreement among different LLM models. These
results suggest that LLMs can serve as practical proxies for otherwise
unrevealed human thinking traces, enabling label-only corpora to be extended
into thinking-trace-augmented resources that enhance the reliability of LLM
raters.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, รric de la Clergerie et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
We release Gaperon, a fully open suite of French-English-coding language
models designed to advance transparency and reproducibility in large-scale
model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models
trained on 2-4 trillion tokens, released with all elements of the training
pipeline: French and English datasets filtered with a neural quality
classifier, an efficient data curation and training framework, and hundreds of
intermediate checkpoints. Through this work, we study how data filtering and
contamination interact to shape both benchmark and generative performance. We
find that filtering for linguistic quality enhances text fluency and coherence
but yields subpar benchmark results, and that late deliberate contamination --
continuing training on data mixes that include test sets -- recovers
competitive scores while only reasonably harming generation quality. We discuss
how usual neural filtering can unintentionally amplify benchmark leakage. To
support further research, we also introduce harmless data poisoning during
pretraining, providing a realistic testbed for safety studies. By openly
releasing all models, datasets, code, and checkpoints, Gaperon establishes a
reproducible foundation for exploring the trade-offs between data curation,
evaluation, safety, and openness in multilingual language model development.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Sriram Balasubramaniam, Samyadeep Basu, Koustava Goswami, Ryan Rossi, Varun Manjunatha et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly used for long-document question
answering, where reliable attribution to sources is critical for trust.
Existing post-hoc attribution methods work well for extractive QA but struggle
in multi-hop, abstractive, and semi-extractive settings, where answers
synthesize information across passages. To address these challenges, we argue
that post-hoc attribution can be reframed as a reasoning problem, where answers
are decomposed into constituent units, each tied to specific context. We first
show that prompting models to generate such decompositions alongside
attributions improves performance. Building on this, we introduce DecompTune, a
post-training method that teaches models to produce answer decompositions as
intermediate reasoning steps. We curate a diverse dataset of complex QA tasks,
annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and
14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards.
Across extensive experiments and ablations, DecompTune substantially improves
attribution quality, outperforming prior methods and matching or exceeding
state-of-the-art frontier models.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Chumeng Liang, Jiaxuan You
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Diagrams play a central role in research papers for conveying ideas, yet they
are often notoriously complex and labor-intensive to create. Although diagrams
are presented as images, standard image generative models struggle to produce
clear diagrams with well-defined structure. We argue that a promising direction
is to generate demonstration diagrams directly in textual form as SVGs, which
can leverage recent advances in large language models (LLMs). However, due to
the complexity of components and the multimodal nature of diagrams,
sufficiently discriminative and explainable metrics for evaluating the quality
of LLM-generated diagrams remain lacking. In this paper, we propose
DiagramEval, a novel evaluation metric designed to assess demonstration
diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams
as graphs, treating text elements as nodes and their connections as directed
edges, and evaluates diagram quality using two new groups of metrics: node
alignment and path alignment. For the first time, we effectively evaluate
diagrams produced by state-of-the-art LLMs on recent research literature,
quantitatively demonstrating the validity of our metrics. Furthermore, we show
how the enhanced explainability of our proposed metrics offers valuable
insights into the characteristics of LLM-generated diagrams. Code:
https://github.com/ulab-uiuc/diagram-eval.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma et al.
ๆฅ็ๆ่ฆ
Current evaluations of agents remain centered around one-shot task
completion, failing to account for the inherently iterative and collaborative
nature of many real-world problems, where human goals are often underspecified
and evolve. We argue for a shift from building and assessing task completion
agents to developing collaborative agents, assessed not only by the quality of
their final outputs but by how well they engage with and enhance human effort
throughout the problem-solving process. To support this shift, we introduce
collaborative effort scaling, a framework that captures how an agent's utility
grows with increasing user involvement. Through case studies and simulated
evaluations, we show that state-of-the-art agents often underperform in
multi-turn, real-world scenarios, revealing a missing ingredient in agent
design: the ability to sustain engagement and scaffold user understanding.
Collaborative effort scaling offers a lens for diagnosing agent behavior and
guiding development toward more effective interactions.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Modern LLMs are trained to "think" primarily via explicit text generation,
such as chain-of-thought (CoT), which defers reasoning to post-training and
under-leverages pre-training data. We present and open-source Ouro, named after
the recursive Ouroboros, a family of pre-trained Looped Language Models
(LoopLM) that instead build reasoning into the pre-training phase through (i)
iterative computation in latent space, (ii) an entropy-regularized objective
for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and
2.6B models enjoy superior performance that match the results of up to 12B SOTA
LLMs across a wide range of benchmarks. Through controlled experiments, we show
this advantage stems not from increased knowledge capacity, but from superior
knowledge manipulation capabilities. We also show that LoopLM yields reasoning
traces more aligned with final outputs than explicit CoT. We hope our results
show the potential of LoopLM as a novel scaling direction in the reasoning era.
Our model could be found in: http://ouro-llm.github.io.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Aakriti Shah, Thai Le
Natural Language Processing
ๆฅ็ๆ่ฆ
Unlearning in large language models (LLMs) is crucial for managing sensitive
data and correcting misinformation, yet evaluating its effectiveness remains an
open problem. We investigate whether persuasive prompting can recall factual
knowledge from deliberately unlearned LLMs across models ranging from 2.7B to
13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from
ACT-R and Hebbian theory (spreading activation theories), as well as
communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior
Framework (SKeB), which models information entanglement via domain graphs and
tests whether factual recall in unlearned models is correlated with persuasive
framing. We develop entanglement metrics to quantify knowledge activation
patterns and evaluate factuality, non-factuality, and hallucination in outputs.
Our results show persuasive prompts substantially enhance factual knowledge
recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness
inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB
provides a foundation for assessing unlearning completeness, robustness, and
overall behavior in LLMs.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu et al.
ๆฅ็ๆ่ฆ
Real-world language agents must handle complex, multi-step workflows across
diverse Apps. For instance, an agent may manage emails by coordinating with
calendars and file systems, or monitor a production database to detect
anomalies and generate reports following an operating manual. However, existing
language agent benchmarks often focus on narrow domains or simplified tasks
that lack the diversity, realism, and long-horizon complexity required to
evaluate agents' real-world performance. To address this gap, we introduce the
Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering
diverse Apps and tools, realistic environment setup, and reliable
execution-based evaluation. Toolathlon spans 32 software applications and 604
tools, ranging from everyday platforms such as Google Calendar and Notion to
professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools
are based on a high-quality set of Model Context Protocol (MCP) servers that we
may have revised or implemented ourselves. Unlike prior works, which primarily
ensure functional realism but offer limited environment state diversity, we
provide realistic initial environment states from real software, such as Canvas
courses with dozens of students or real financial spreadsheets. This benchmark
includes 108 manually sourced or crafted tasks in total, requiring interacting
with multiple Apps over around 20 turns on average to complete. Each task is
strictly verifiable through dedicated evaluation scripts. Comprehensive
evaluation of SOTA models highlights their significant shortcomings: the
best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate
with 20.2 tool calling turns on average, while the top open-weights model
DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development
of more capable language agents for real-world, long-horizon task execution.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Saeed AlMarri, Kristof Juhasz, Mathieu Ravaut, Gautier Marti, Hamdan Al Ahbabi et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are increasingly explored as flexible
alternatives to classical machine learning models for classification tasks
through zero-shot prompting. However, their suitability for structured tabular
data remains underexplored, especially in high-stakes financial applications
such as financial risk assessment. This study conducts a systematic comparison
between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art
gradient-boosting model, on a real-world loan default prediction task. We
evaluate their predictive performance, analyze feature attributions using SHAP,
and assess the reliability of LLM-generated self-explanations. While LLMs are
able to identify key financial risk indicators, their feature importance
rankings diverge notably from LightGBM, and their self-explanations often fail
to align with empirical SHAP attributions. These findings highlight the
limitations of LLMs as standalone models for structured financial risk
prediction and raise concerns about the trustworthiness of their self-generated
explanations. Our results underscore the need for explainability audits,
baseline comparisons with interpretable models, and human-in-the-loop oversight
when deploying LLMs in risk-sensitive financial environments.
๐
2025-10-29
๐ ACL 2025
๐ฅ Junyu Luo, Bohan Wu, Xiao Luo, Zhiping Xiao, Yiqiao Jin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Post-training of Large Language Models (LLMs) is crucial for unlocking their
task generalization potential and domain-specific capabilities. However, the
current LLM post-training paradigm faces significant data challenges, including
the high costs of manual annotation and diminishing marginal returns on data
scales. Therefore, achieving data-efficient post-training has become a key
research question. In this paper, we present the first systematic survey of
data-efficient LLM post-training from a data-centric perspective. We propose a
taxonomy of data-efficient LLM post-training methods, covering data selection,
data quality enhancement, synthetic data generation, data distillation and
compression, and self-evolving data ecosystems. We summarize representative
approaches in each category and outline future research directions. By
examining the challenges in data-efficient LLM post-training, we highlight open
problems and propose potential research avenues. We hope our work inspires
further exploration into maximizing the potential of data utilization in
large-scale model training. Paper List:
https://github.com/luo-junyu/Awesome-Data-Efficient-LLM
๐
2025-10-29
๐ ArXiv Preprint (cs.SE)
๐ฅ Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language model-based agents show promise for software engineering, but
environment configuration remains a bottleneck due to heavy manual effort and
scarce large-scale, high-quality datasets. Existing benchmarks assess only
end-to-end build/test success, obscuring where and why agents succeed or fail.
We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench,
which provides process-level trajectory assessment of fine-grained agent
capabilities during environment setup-planning, perception-driven error
diagnosis, feedback-driven repair, and action to execute final environment
configuration. Our task instances are automatically constructed by injecting
realistic README errors and are validated in Docker for scalable, high-quality
evaluation. Enconda-bench combines process-level analysis with end-to-end
executability to enable capability assessments beyond aggregate success rates.
Evaluations across state-of-the-art LLMs and agent frameworks show that while
agents can localize errors, they struggle to translate feedback into effective
corrections, limiting end-to-end performance. To our knowledge, Enconda-bench
is the first framework to provide process-level internal capability assessment
for environment configuration, offering actionable insights for improving
software engineering agents.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian et al.
Natural Language ProcessingMachine LearningMultimodal
ๆฅ็ๆ่ฆ
Unified vision-language models (UVLMs) must perform both understanding and
generation within a single architecture, but these tasks rely on heterogeneous
data and supervision, making it difficult to balance them during reinforcement
learning (RL). We propose PairUni, a unified framework that reorganizes data
into understanding-generation (UG) pairs and aligns optimization accordingly.
We first use GPT-o3 to augment single-task data, generating captions for
understanding samples and question-answer (QA) pairs for generation samples,
forming aligned pairs from the same instance. Additionally, for each generation
sample, we retrieve a semantically related understanding example to form a
retrieved pair, linking different but related data points. These paired
structures expose cross-task semantic correspondences and support consistent
policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware
variant based on Group Relative Policy Optimization. It assigns a similarity
score to each pair to modulate the advantage, strengthening learning from
well-aligned examples and reducing task interference. We curate a high-quality
dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on
the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on
various UVLMs, outperforming strong UVLM RL baselines. Codes are available at
https://github.com/Haochen-Wang409/PairUni.
๐
2025-10-29
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe
ๆฅ็ๆ่ฆ
ZK-SenseLM is a secure and auditable wireless sensing framework that pairs a
large-model encoder for Wi-Fi channel state information (and optionally mmWave
radar or RFID) with a policy-grounded decision layer and end-to-end
zero-knowledge proofs of inference. The encoder uses masked spectral
pretraining with phase-consistency regularization, plus a light cross-modal
alignment that ties RF features to compact, human-interpretable policy tokens.
To reduce unsafe actions under distribution shift, we add a calibrated
selective-abstention head; the chosen risk-coverage operating point is
registered and bound into the proof. We implement a four-stage proving
pipeline: (C1) feature sanity and commitment, (C2) threshold and version
binding, (C3) time-window binding, and (C4) PLONK-style proofs that the
quantized network, given the committed window, produced the logged action and
confidence. Micro-batched proving amortizes cost across adjacent windows, and a
gateway option offloads proofs from low-power devices. The system integrates
with differentially private federated learning and on-device personalization
without weakening verifiability: model hashes and the registered threshold are
part of each public statement. Across activity, presence or intrusion,
respiratory proxy, and RF fingerprinting tasks, ZK-SenseLM improves macro-F1
and calibration, yields favorable coverage-risk curves under perturbations, and
rejects tamper and replay with compact proofs and fast verification.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tarun Kumar Chawdhury, Jon D. Duke
Natural Language Processing
ๆฅ็ๆ่ฆ
Electronic Health Records (EHR) store clinical documentation as base64
encoded attachments in FHIR DocumentReference resources, which makes semantic
question answering difficult. Traditional vector database methods often miss
nuanced clinical relationships. The Clinical Entity Augmented Retrieval (CLEAR)
method, introduced by Lopez et al. 2025, uses entity aware retrieval and
achieved improved performance with an F1 score of 0.90 versus 0.86 for
embedding based retrieval, while using over 70 percent fewer tokens. We
developed a Clinical Notes QA Evaluation Platform to validate CLEAR against
zero shot large context inference and traditional chunk based retrieval
augmented generation. The platform was tested on 12 clinical notes ranging from
10,000 to 65,000 tokens representing realistic EHR content. CLEAR achieved a
58.3 percent win rate, an average semantic similarity of 0.878, and used 78
percent fewer tokens than wide context processing. The largest performance
gains occurred on long notes, with a 75 percent win rate for documents
exceeding 65,000 tokens. These findings confirm that entity aware retrieval
improves both efficiency and accuracy in clinical natural language processing.
The evaluation framework provides a reusable and transparent benchmark for
assessing clinical question answering systems where semantic precision and
computational efficiency are critical.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Electronic Health Records (EHRs) contain rich yet complex information, and
their automated analysis is critical for clinical decision-making. Despite
recent advances of large language models (LLMs) in clinical workflows, their
ability to analyze EHRs remains limited due to narrow task coverage and lack of
EHR-oriented reasoning capabilities. This paper aims to bridge the gap,
specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning
instruction dataset, comprising 300k high-quality reasoning cases and 4M
non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a
thinking-graph-driven framework that enables to generate high-quality reasoning
data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced
LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage
training paradigm, including domain adaptation, reasoning enhancement, and
reinforcement learning, EHR-R1 systematically acquires domain knowledge and
diverse reasoning capabilities, enabling accurate and robust EHR analysis.
Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning
42 tasks, to comprehensively assess reasoning and prediction across EHR
scenarios. In experiments, we show that the resulting EHR-R1 consistently
outperforms state-of-the-art commercial and open-source LLMs (including
DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and
achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins,
EHR-R1, and EHR-Bench have significantly advanced the development for more
reliable and clinically relevant EHR analysis.
๐
2025-10-29
๐ NeurIPS 2025
๐ฅ Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Modern language models (LMs) exhibit strong deductive reasoning capabilities,
yet standard evaluations emphasize correctness while overlooking a key aspect
of human-like reasoning: efficiency. In real-world reasoning scenarios, much of
the available information is irrelevant, and effective deductive inference
requires identifying and ignoring such distractions. We propose a framework for
assessing LM reasoning efficiency through the lens of logic programming,
introducing a simple method to align proofs written in natural language -- as
generated by an LM -- with shortest proofs found by executing the logic
program. Efficiency is quantified by measuring how well a model avoids
unnecessary inference. Empirically, we construct a dataset of math word
problems injected with various number of irrelevant axioms that vary in
semantic overlap with the goal theorem. We find that current LMs show marked
accuracy declines under such conditions -- even with minimal, domain-consistent
distractions -- and the proofs they generate frequently exhibit detours through
irrelevant inferences.
๐
2025-10-29
๐ EMNLP
๐ฅ Davide Romano, Jonathan Schwarz, Daniele Giofrรฉ
Natural Language Processing
ๆฅ็ๆ่ฆ
Test-time scaling (TTS) techniques can improve the performance of large
language models (LLMs) at the expense of additional computation and latency.
While TTS has proven effective in formal domains such as mathematics and
programming, its value in argumentative domains such as law remains
underexplored. We present an empirical study of verifier-based TTS methods for
legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7
reward models, we evaluate both outcome-level (Best-of-$N$) and process-level
(tree search) verification under realistic low-$N$ budgets. Our analysis
systematically investigates how verifier utility is affected by key properties
such as domain specialization, model size, and supervision type
(process-supervised PRMs vs. outcome-only ORMs), even when applied across
different roles.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohammad Aghajani Asl, Behrooz Minaei Bidgoli
Natural Language Processing
ๆฅ็ๆ่ฆ
The advent of Large Language Models (LLMs) has revolutionized Natural
Language Processing, yet their application in high-stakes, specialized domains
like religious question answering is hindered by challenges like hallucination
and unfaithfulness to authoritative sources. This issue is particularly
critical for the Persian-speaking Muslim community, where accuracy and
trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG)
systems, relying on simplistic single-pass pipelines, fall short on complex,
multi-hop queries requiring multi-step reasoning and evidence aggregation. To
address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful
Advanced Question Answering in the Persian Islamic domain. FARSIQA is built
upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative
Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting
process: it adaptively decomposes complex queries, assesses evidence
sufficiency, and enters an iterative loop to generate sub-queries,
progressively filling information gaps. Operating on a curated knowledge base
of over one million authoritative Islamic documents, FARSIQA demonstrates
superior performance. Rigorous evaluation on the challenging IslamicPCQA
benchmark shows state-of-the-art performance: the system achieves a remarkable
97.0% in Negative Rejection - a 40-point improvement over baselines - and a
high Answer Correctness score of 74.3%. Our work establishes a new standard for
Persian Islamic QA and validates that our iterative, adaptive architecture is
crucial for building faithful, reliable AI systems in sensitive domains.
๐
2025-10-29
๐ ICML 2025
๐ฅ Run Peng, Ziqiao Ma, Amy Pang, Sikai Li, Zhang Xi-Jia et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
While Large Language Model (LLM) agents are often approached from the angle
of action planning/generation to accomplish a goal (e.g., given by language
descriptions), their abilities to collaborate with each other to achieve a
joint goal are not well explored. To address this limitation, this paper
studies LLM agents in task collaboration, particularly under the condition of
information asymmetry, where agents have disparities in their knowledge and
skills and need to work together to complete a shared task. We extend Einstein
Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two
LLM agents must reason, communicate, and act to satisfy spatial and relational
constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier
framework in which LLM agents are equipped with various communication
strategies and verification signals from the environment. Empirical results
highlight the critical importance of aligned communication, especially when
agents possess both information-seeking and -providing capabilities.
Interestingly, agents without communication can still achieve high task
performance; however, further analysis reveals a lack of true rule
understanding and lower trust from human evaluators. Instead, by integrating an
environment-based verifier, we enhance agents' ability to comprehend task rules
and complete tasks, promoting both safer and more interpretable collaboration
in AI systems. https://github.com/Roihn/EinsteinPuzzles
๐
2025-10-29
๐ ArXiv Preprint (eess.AS)
๐ฅ Harm Lameris, Shree Harsha Bokkahalli Satish, Joakim Gustafson, รva Szรฉkely
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent advances in speech foundation models (SFMs) have enabled the direct
processing of spoken language from raw audio, bypassing intermediate textual
representations. This capability allows SFMs to be exposed to, and potentially
respond to, rich paralinguistic variations embedded in the input speech signal.
One under-explored dimension of paralinguistic variation is voice quality,
encompassing phonation types such as creaky and breathy voice. These phonation
types are known to influence how listeners infer affective state, stance and
social meaning in speech. Existing benchmarks for speech understanding largely
rely on multiple-choice question answering (MCQA) formats, which are prone to
failure and therefore unreliable in capturing the nuanced ways paralinguistic
features influence model behaviour. In this paper, we probe SFMs through
open-ended generation tasks and speech emotion recognition, evaluating whether
model behaviours are consistent across different phonation inputs. We introduce
a new parallel dataset featuring synthesized modifications to voice quality,
designed to evaluate SFM responses to creaky and breathy voice. Our work
provides the first examination of SFM sensitivity to these particular
non-lexical aspects of speech perception.
๐
2025-10-29
๐ ArXiv Preprint (Machine Learning)
๐ฅ Wenduan Xu
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
We present a hybrid quantum-classical recurrent neural network (QRNN)
architecture in which the entire recurrent core is realized as a parametrized
quantum circuit (PQC) controlled by a classical feedforward network. The hidden
state is the quantum state of an $n$-qubit PQC, residing in an exponentially
large Hilbert space $\mathbb{C}^{2^n}$. The PQC is unitary by construction,
making the hidden-state evolution norm-preserving without external constraints.
At each timestep, mid-circuit readouts are combined with the input embedding
and processed by the feedforward network, which provides explicit classical
nonlinearity. The outputs parametrize the PQC, which updates the hidden state
via unitary dynamics. The QRNN is compact and physically consistent, and it
unifies (i) unitary recurrence as a high-capacity memory, (ii) partial
observation via mid-circuit measurements, and (iii) nonlinear classical control
for input-conditioned parametrization. We evaluate the model in simulation with
up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory,
and language modeling, adopting projective measurements as a limiting case to
obtain mid-circuit readouts while maintaining a coherent recurrent quantum
memory. We further devise a soft attention mechanism over the mid-circuit
readouts in a sequence-to-sequence model and show its effectiveness for machine
translation. To our knowledge, this is the first model (RNN or otherwise)
grounded in quantum operations to achieve competitive performance against
strong classical baselines across a broad class of sequence-learning tasks.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) are exhibiting emergent human-like abilities and
are increasingly envisioned as the foundation for simulating an individual's
communication style, behavioral tendencies, and personality traits. However,
current evaluations of LLM-based persona simulation remain limited: most rely
on synthetic dialogues, lack systematic frameworks, and lack analysis of the
capability requirement. To address these limitations, we introduce TwinVoice, a
comprehensive benchmark for assessing persona simulation across diverse
real-world contexts. TwinVoice encompasses three dimensions: Social Persona
(public social interactions), Interpersonal Persona (private dialogues), and
Narrative Persona (role-based expression). It further decomposes the evaluation
of LLM performance into six fundamental capabilities, including opinion
consistency, memory recall, logical reasoning, lexical fidelity, persona tone,
and syntactic style. Experimental results reveal that while advanced models
achieve moderate accuracy in persona simulation, they still fall short of
capabilities such as syntactic style and memory recall. Consequently, the
average performance achieved by LLMs remains considerably below the human
baseline.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jun Wang, Fuming Lin, Yuyu Chen
Natural Language Processing
ๆฅ็ๆ่ฆ
This paper presents a pipeline integrating fine-tuned large language models
(LLMs) with named entity recognition (NER) for efficient domain-specific text
summarization and tagging. The authors address the challenge posed by rapidly
evolving sub-cultural languages and slang, which complicate automated
information extraction and law enforcement monitoring. By leveraging the LLaMA
Factory framework, the study fine-tunes LLMs on both generalpurpose and custom
domain-specific datasets, particularly in the political and security domains.
The models are evaluated using BLEU and ROUGE metrics, demonstrating that
instruction fine-tuning significantly enhances summarization and tagging
accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct
model, despite its initial limitations in Chinese comprehension, outperforms
its Chinese-trained counterpart after domainspecific fine-tuning, suggesting
that underlying reasoning capabilities can transfer across languages. The
pipeline enables concise summaries and structured entity tagging, facilitating
rapid document categorization and distribution. This approach proves scalable
and adaptable for real-time applications, supporting efficient information
management and the ongoing need to capture emerging language trends. The
integration of LLMs and NER offers a robust solution for transforming
unstructured text into actionable insights, crucial for modern knowledge
management and security operations.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Fei Wei, Daoyuan Chen, Ce Wang, Yilun Huang, Yushuo Chen et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) excel as passive responders, but teaching them
to be proactive, goal-oriented partners, a critical capability in high-stakes
domains, remains a major challenge. Current paradigms either myopically
optimize single-turn attributes or rely on brittle, high-cost user simulators,
creating a persistent ``reality gap''. To bridge this gap, we introduce
\texttt{Learn-to-Ask}, a general, simulator-free framework for learning and
deploying proactive dialogue agents \textit{directly from offline expert data},
bypassing the need to model complex user dynamics. Our key insight is to
reframe the offline policy learning problem by leveraging the \textbf{observed
future} of each expert trajectory. This allows us to infer a dense,
turn-by-turn reward signal grounded in the expert's revealed strategy,
decomposing the intractable long-horizon problem into a series of supervised
learning tasks, and training a policy to output a structured \texttt{(action,
state_assessment)} tuple, governing both \textbf{what to ask} and, crucially,
\textbf{when to stop}. To ensure reward fidelity, our Automated Grader
Calibration pipeline systematically purges noise from the LLM-based reward
model with minimal human supervision. Empirically, we demonstrate the efficacy
of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying
sizes up to 32B. Our approach culminates in the successful deployment of LLMs
into a live, large-scale online AI service. In rigorous in-house evaluations,
our model was launched and achieved performance even superior to human experts,
proving our framework's ability to translate offline data into tangible,
real-world impact. We hope this work provides a practical and economically
viable blueprint for transforming passive LLMs into proactive, goal-oriented
LLM applications.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani et al.
ๆฅ็ๆ่ฆ
Audio Descriptions (ADs) convey essential on-screen information, allowing
visually impaired audiences to follow videos. To be effective, ADs must form a
coherent sequence that helps listeners to visualise the unfolding scene, rather
than describing isolated moments. However, most automatic methods generate each
AD independently, often resulting in repetitive, incoherent descriptions. To
address this, we propose a training-free method, CoherentAD, that first
generates multiple candidate descriptions for each AD time interval, and then
performs auto-regressive selection across the sequence to form a coherent and
informative narrative. To evaluate AD sequences holistically, we introduce a
sequence-level metric, StoryRecall, which measures how well the predicted ADs
convey the ground truth narrative, alongside repetition metrics that capture
the redundancy across consecutive AD outputs. Our method produces coherent AD
sequences with enhanced narrative understanding, outperforming prior approaches
that rely on independent generations.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Shakib Yazdani, Yasser Hamidullah, Cristina Espaรฑa-Bonet, Eleftherios Avramidis, Josef van Genabith
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
Automatic evaluation metrics are crucial for advancing sign language
translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are
only text-based, and it remains unclear to what extent text-based metrics can
reliably capture the quality of SLT outputs. To address this gap, we
investigate the limitations of text-based SLT evaluation metrics by analyzing
six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one
hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA
zero-shot direct assessment on the other hand. Specifically, we assess the
consistency and robustness of these metrics under three controlled conditions:
paraphrasing, hallucinations in model outputs, and variations in sentence
length. Our analysis highlights the limitations of lexical overlap metrics and
demonstrates that while LLM-based evaluators better capture semantic
equivalence often missed by conventional metrics, they can also exhibit bias
toward LLM-paraphrased translations. Moreover, although all metrics are able to
detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and
LLM-based evaluators are comparatively lenient toward subtle cases. This
motivates the need for multimodal evaluation frameworks that extend beyond
text-based metrics to enable a more holistic assessment of SLT outputs.
๐
2025-10-29
๐ Presented at the Annual Meeting of the American Po
๐ฅ Ali Sanaei, Ali Rajabzadeh
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) are increasingly utilized by researchers across
a wide range of domains, and qualitative social science is no exception;
however, this adoption faces persistent challenges, including interpretive
bias, low reliability, and weak auditability. We introduce a framework that
situates LLM usage along two dimensions, interpretive depth and autonomy,
thereby offering a straightforward way to classify LLM applications in
qualitative research and to derive practical design recommendations. We present
the state of the literature with respect to these two dimensions, based on all
published social science papers available on Web of Science that use LLMs as a
tool and not strictly as the subject of study. Rather than granting models
expansive freedom, our approach encourages researchers to decompose tasks into
manageable segments, much as they would when delegating work to capable
undergraduate research assistants. By maintaining low levels of autonomy and
selectively increasing interpretive depth only where warranted and under
supervision, one can plausibly reap the benefits of LLMs while preserving
transparency and reliability.
๐
2025-10-29
๐ EMNLP 2025
๐ฅ Auguste Poiroux, Antoine Bosselut, Viktor Kunฤak
Natural Language Processing
ๆฅ็ๆ่ฆ
Despite impressive results on curated benchmarks, the practical impact of
large language models (LLMs) on research-level neural theorem proving and proof
autoformalization is still limited. We introduce RLMEval, an evaluation suite
for these tasks, focusing on research-level mathematics from real-world Lean
formalization projects. RLMEval targets the evaluation of neural theorem
proving and proof autoformalization on challenging research-level theorems by
leveraging real Lean Blueprint formalization projects. Our evaluation of
state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean
projects, reveals a significant gap: progress on existing benchmarks does not
readily translate to these more realistic settings, with the best model
achieving only a 10.3 % pass rate. RLMEval provides a new, challenging
benchmark designed to guide and accelerate progress in automated reasoning for
formal mathematics.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Asutosh Hota, Jussi P. P. Jokinen
Natural Language Processing
ๆฅ็ๆ่ฆ
The rapid advancement of Large Language Models (LLMs) is positioning language
at the core of human-computer interaction (HCI). We argue that advancing HCI
requires attention to the linguistic foundations of interaction, particularly
implicature (meaning conveyed beyond explicit statements through shared
context) which is essential for human-AI (HAI) alignment. This study examines
LLMs' ability to infer user intent embedded in context-driven prompts and
whether understanding implicature improves response generation. Results show
that larger models approximate human interpretations more closely, while
smaller models struggle with implicature inference. Furthermore,
implicature-based prompts significantly enhance the perceived relevance and
quality of responses across models, with notable gains in smaller models.
Overall, 67.6% of participants preferred responses with implicature-embedded
prompts to literal ones, highlighting a clear preference for contextually
nuanced communication. Our work contributes to understanding how linguistic
theory can be used to address the alignment problem by making HAI interaction
more natural and contextually grounded.
๐
2025-10-29
๐ Accepted for presentation at the AI in Science Sum
๐ฅ Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Christina Lioma
ๆฅ็ๆ่ฆ
The development of Artificial Intelligence (AI), including AI in Science
(AIS), should be done following the principles of responsible AI. Progress in
responsible AI is often quantified through evaluation metrics, yet there has
been less work on assessing the robustness and reliability of the metrics
themselves. We reflect on prior work that examines the robustness of fairness
metrics for recommender systems as a type of AI application and summarise their
key takeaways into a set of non-exhaustive guidelines for developing reliable
metrics of responsible AI. Our guidelines apply to a broad spectrum of AI
applications, including AIS.
๐
2025-10-29
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Jamie Mahowald, Benjamin Charles Germain Lee
Multimodal
ๆฅ็ๆ่ฆ
Multimodal approaches have shown great promise for searching and navigating
digital collections held by libraries, archives, and museums. In this paper, we
introduce map-RAS: a retrieval-augmented search system for historic maps. In
addition to introducing our framework, we detail our publicly-hosted demo for
searching 101,233 map images held by the Library of Congress. With our system,
users can multimodally query the map collection via ColPali, summarize search
results using Llama 3.2, and upload their own collections to perform
inter-collection search. We articulate potential use cases for archivists,
curators, and end-users, as well as future work with our system in both machine
learning and the digital humanities. Our demo can be viewed at:
http://www.mapras.com.
๐
2025-10-29
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yi Xu, Moyu Zhang, Chaofan Fan, Jinxin Hu, Xiaochen Li et al.
Multimodal
ๆฅ็ๆ่ฆ
Industrial recommender systems rely on unique Item Identifiers (ItemIDs).
However, this method struggles with scalability and generalization in large,
dynamic datasets that have sparse long-tail data. Content-based Semantic IDs
(SIDs) address this by sharing knowledge through content quantization. However,
by ignoring dynamic behavioral properties, purely content-based SIDs have
limited expressive power. Existing methods attempt to incorporate behavioral
information but overlook a critical distinction: unlike relatively uniform
content features, user-item interactions are highly skewed and diverse,
creating a vast information gap in quality and quantity between popular and
long-tail items. This oversight leads to two critical limitations: (1) Noise
Corruption: Indiscriminate behavior-content alignment allows collaborative
noise from long-tail items to corrupt their content representations, leading to
the loss of critical multimodal information. (2)Signal Obscurity: The
equal-weighting scheme for SIDs fails to reflect the varying importance of
different behavioral signals, making it difficult for downstream tasks to
distinguish important SIDs from uninformative ones. To tackle these issues, we
propose a mixture-of-quantization framework, MMQ-v2, to adaptively Align,
Denoise, and Amplify multimodal information from content and behavior
modalities for semantic IDs learning. The semantic IDs generated by this
framework named ADA-SID. It introduces two innovations: an adaptive
behavior-content alignment that is aware of information richness to shield
representations from noise, and a dynamic behavioral router to amplify critical
signals by applying different weights to SIDs. Extensive experiments on public
and large-scale industrial datasets demonstrate ADA-SID's significant
superiority in both generative and discriminative recommendation tasks.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Mohammad Aghajani Asl, Behrooz Minaei Bidgoli
Natural Language Processing
ๆฅ็ๆ่ฆ
The advent of Large Language Models (LLMs) has revolutionized Natural
Language Processing, yet their application in high-stakes, specialized domains
like religious question answering is hindered by challenges like hallucination
and unfaithfulness to authoritative sources. This issue is particularly
critical for the Persian-speaking Muslim community, where accuracy and
trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG)
systems, relying on simplistic single-pass pipelines, fall short on complex,
multi-hop queries requiring multi-step reasoning and evidence aggregation. To
address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful
Advanced Question Answering in the Persian Islamic domain. FARSIQA is built
upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative
Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting
process: it adaptively decomposes complex queries, assesses evidence
sufficiency, and enters an iterative loop to generate sub-queries,
progressively filling information gaps. Operating on a curated knowledge base
of over one million authoritative Islamic documents, FARSIQA demonstrates
superior performance. Rigorous evaluation on the challenging IslamicPCQA
benchmark shows state-of-the-art performance: the system achieves a remarkable
97.0% in Negative Rejection - a 40-point improvement over baselines - and a
high Answer Correctness score of 74.3%. Our work establishes a new standard for
Persian Islamic QA and validates that our iterative, adaptive architecture is
crucial for building faithful, reliable AI systems in sensitive domains.
๐
2025-10-29
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, Fen Lin et al.
Machine Learning
ๆฅ็ๆ่ฆ
Query rewriting is a fundamental technique in information retrieval (IR). It
typically employs the retrieval result as relevance feedback to refine the
query and thereby addresses the vocabulary mismatch between user queries and
relevant documents. Traditional pseudo-relevance feedback (PRF) and its
vector-based extension (VPRF) improve retrieval performance by leveraging
top-retrieved documents as relevance feedback. However, they are constructed
based on two major hypotheses: the relevance assumption (top documents are
relevant) and the model assumption (rewriting methods need to be designed
specifically for particular model architectures). While recent large language
models (LLMs)-based generative relevance feedback (GRF) enables model-free
query reformulation, it either suffers from severe LLM hallucination or, again,
relies on the relevance assumption to guarantee the effectiveness of rewriting
quality. To overcome these limitations, we introduce an assumption-relaxed
framework: \textit{Generalized Pseudo Relevance Feedback} (GPRF), which
performs model-free, natural language rewriting based on retrieved documents,
not only eliminating the model assumption but also reducing dependence on the
relevance assumption. Specifically, we design a utility-oriented training
pipeline with reinforcement learning to ensure robustness against noisy
feedback. Extensive experiments across multiple benchmarks and retrievers
demonstrate that GPRF consistently outperforms strong baselines, establishing
it as an effective and generalizable framework for query rewriting.
๐
2025-10-29
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Thang-Long Nguyen-Ho, Minh-Khoi Pham, Hoang-Bao Le
Natural Language Processing
ๆฅ็ๆ่ฆ
This report details our methodology and results developed for the
Multilingual E-commerce Search Competition. The problem aims to recognize
relevance between user queries versus product items in a multilingual context
and improve recommendation performance on e-commerce platforms. Utilizing Large
Language Models (LLMs) and their capabilities in other tasks, our data-centric
method achieved the highest score compared to other solutions during the
competition. Final leaderboard is publised at
https://alibaba-international-cikm2025.github.io. The source code for our
project is published at https://github.com/nhtlongcs/e-commerce-product-search.
๐
2025-10-29
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yuqian Chai, Chaochao Wang, Weilei Wang
ๆฅ็ๆ่ฆ
Although AI drafting tools have gained prominence in patent writing, the
systematic evaluation of AI-generated patent content quality represents a
significant research gap. To address this gap, We propose to evaluate patents
using regulatory compliance, technical coherence, and figure-reference
consistency detection modules, and then generate improvement suggestions via an
integration module. The framework is validated on a comprehensive dataset
comprising 80 human-authored and 80 AI-generated patents from two patent
drafting tools. Evaluation is performed on 10,841 total sentences, 8,924
non-template sentences, and 554 patent figures for the three detection modules
respectively, achieving balanced accuracies of 99.74%, 82.12%, and 91.2%
against expert annotations. Additional analysis was conducted to examine defect
distributions across patent sections, technical domains, and authoring sources.
Section-based analysis indicates that figure-text consistency and technical
detail precision require particular attention. Mechanical Engineering and
Construction show more claim-specification inconsistencies due to complex
technical documentation requirements. AI-generated patents show a significant
gap compared to human-authored ones. While human-authored patents primarily
contain surface-level errors like typos, AI-generated patents exhibit more
structural defects in figure-text alignment and cross-references.
๐
2025-10-29
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Qiushi Pan, Hao Wang, Guoyuan An, Luankang Zhang, Wei Guo et al.
ๆฅ็ๆ่ฆ
In recommendation systems, how to effectively scale up recommendation models
has been an essential research topic. While significant progress has been made
in developing advanced and scalable architectures for sequential
recommendation(SR) models, there are still challenges due to items'
multi-faceted characteristics and dynamic item relevance in the user context.
To address these issues, we propose Fuxi-MME, a framework that integrates a
multi-embedding strategy with a Mixture-of-Experts (MoE) architecture.
Specifically, to efficiently capture diverse item characteristics in a
decoupled manner, we decompose the conventional single embedding matrix into
several lower-dimensional embedding matrices. Additionally, by substituting
relevant parameters in the Fuxi Block with an MoE layer, our model achieves
adaptive and specialized transformation of the enriched representations.
Empirical results on public datasets show that our proposed framework
outperforms several competitive baselines.
๐
2025-10-29
๐ ArXiv Preprint (cs.DL)
๐ฅ Muneer Ahmad, Undie Felicia Nkatv
ๆฅ็ๆ่ฆ
This study employs scientometric methods to assess the research output and
performance of the University of Ibadan from 2014 to 2023. By analyzing
publication trends, citation patterns, and collaboration networks, the research
aims to comprehensively evaluate the university's research productivity,
impact, and disciplinary focus. This article's endeavors are characterized by
innovation, interdisciplinary collaboration, and commitment to excellence,
making the University of Ibadan a significant hub for cutting-edge research in
Nigeria and beyond. The goal of the current study is to ascertain the influence
of the university's research output and publication patterns between 2014 and
2023. The study focuses on the departments at the University of Ibadan that
contribute the most, the best journals for publishing, the nations that
collaborate, the impact of citations both locally and globally, well-known
authors and their total production, and the research output broken down by
year. According to the university's ten-year publication data, 7159 papers with
an h-index of 75 were published between 2014 and 2023, garnering 218572
citations. Furthermore, the VOSviewer software mapping approach is used to
illustrate the stenographical mapping of data through graphs. The findings of
this study will contribute to understanding the university's research
strengths, weaknesses, and potential areas for improvement. Additionally, the
results will inform evidence-based decision-making for enhancing research
strategies and policies at the University of Ibadan.
๐
2025-10-29
๐ NeurIPS 2025
๐ฅ Yehjin Shin, Jeongwhan Choi, Seojin Kim, Noseong Park
ๆฅ็ๆ่ฆ
Recently, convolutional filters have been increasingly adopted in sequential
recommendation for their ability to capture local sequential patterns. However,
most of these models complement convolutional filters with self-attention. This
is because convolutional filters alone, generally fixed filters, struggle to
capture global interactions necessary for accurate recommendation. We propose
Time-Variant Convolutional Filters for Sequential Recommendation (TV-Rec), a
model inspired by graph signal processing, where time-variant graph filters
capture position-dependent temporal variations in user sequences. By replacing
both fixed kernels and self-attention with time-variant filters, TV-Rec
achieves higher expressive power and better captures complex interaction
patterns in user behavior. This design not only eliminates the need for
self-attention but also reduces computation while accelerating inference.
Extensive experiments on six public benchmarks show that TV-Rec outperforms
state-of-the-art baselines by an average of 7.49%.
๐
2025-10-29
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Zhijie Lin, Zhuofeng Li, Chenglei Dai, Wentian Bao, Shuai Lin et al.
Machine Learning
ๆฅ็ๆ่ฆ
In a multi-stage recommendation system, reranking plays a crucial role in
modeling intra-list correlations among items. A key challenge lies in exploring
optimal sequences within the combinatorial space of permutations. Recent
research follows a two-stage (generator-evaluator) paradigm, where a generator
produces multiple feasible sequences, and an evaluator selects the best one. In
practice, the generator is typically implemented as an autoregressive model.
However, these two-stage methods face two main challenges. First, the
separation of the generator and evaluator hinders end-to-end training. Second,
autoregressive generators suffer from inference efficiency. In this work, we
propose a Unified Generative Efficient Reranking Framework (GReF) to address
the two primary challenges. Specifically, we introduce Gen-Reranker, an
autoregressive generator featuring a bidirectional encoder and a dynamic
autoregressive decoder to generate causal reranking sequences. Subsequently, we
pre-train Gen-Reranker on the item exposure order for high-quality parameter
initialization. To eliminate the need for the evaluator while integrating
sequence-level evaluation during training for end-to-end optimization, we
propose post-training the model through Rerank-DPO. Moreover, for efficient
autoregressive inference, we introduce ordered multi-token prediction (OMTP),
which trains Gen-Reranker to simultaneously generate multiple future items
while preserving their order, ensuring practical deployment in real-time
recommender systems. Extensive offline experiments demonstrate that GReF
outperforms state-of-the-art reranking methods while achieving latency that is
nearly comparable to non-autoregressive models. Additionally, GReF has also
been deployed in a real-world video app Kuaishou with over 300 million daily
active users, significantly improving online recommendation quality.
๐
2025-10-29
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Hongjin Qian, Zheng Liu
Natural Language Processing
ๆฅ็ๆ่ฆ
AI search depends on linking large language models (LLMs) with vast external
knowledge sources. Yet web pages, PDF files, and other raw documents are not
inherently LLM-ready: they are long, noisy, and unstructured. Conventional
retrieval methods treat these documents as verbatim text and return raw
passages, leaving the burden of fragment assembly and contextual reasoning to
the LLM. This gap underscores the need for a new retrieval paradigm that
redefines how models interact with documents.
We introduce the Model-Document Protocol (MDP), a general framework that
formalizes how raw text is bridged to LLMs through consumable knowledge
representations. Rather than treating retrieval as passage fetching, MDP
defines multiple pathways that transform unstructured documents into
task-specific, LLM-ready inputs. These include agentic reasoning, which curates
raw evidence into coherent context; memory grounding, which accumulates
reusable notes to enrich reasoning; and structured leveraging, which encodes
documents into formal representations such as graphs or key-value caches. All
three pathways share the same goal: ensuring that what reaches the LLM is not
raw fragments but compact, structured knowledge directly consumable for
reasoning.
As an instantiation, we present MDP-Agent, which realizes the protocol
through an agentic process: constructing document-level gist memories for
global coverage, performing diffusion-based exploration with vertical
exploitation to uncover layered dependencies, and applying map-reduce style
synthesis to integrate large-scale evidence into compact yet sufficient
context. Experiments on information-seeking benchmarks demonstrate that
MDP-Agent outperforms baselines, validating both the soundness of the MDP
framework and the effectiveness of its agentic instantiation.
๐
2025-10-29
๐ ArXiv Preprint (Machine Learning)
๐ฅ Hyunsik Yoo, Ting-Wei Li, SeongKu Kang, Zhining Liu, Charlie Xu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
While large language models (LLMs) achieve strong performance in
recommendation, they face challenges in continual learning as users, items, and
user preferences evolve over time. Existing LoRA-based continual methods
primarily focus on preserving performance on previous tasks, but this overlooks
the unique nature of recommendation: the goal is not to predict past
preferences, and outdated preferences can even harm performance when current
interests shift significantly. To address this, we propose PESO (Proximally
rEgularized Single evolving lOra, a continual adaptation method for LoRA in
recommendation. PESO introduces a proximal regularizer that anchors the current
adapter to its most recent frozen state, enabling the model to flexibly balance
adaptation and preservation, and to better capture recent user behaviors.
Theoretically, we show that this proximal design provides data-aware,
direction-wise guidance in the LoRA subspace. Empirically, PESO consistently
outperforms existing LoRA-based continual learning methods.
๐
2025-10-29
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Zhipeng Bao, Qianwen Li
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Despite significant advancements in recent decades, autonomous vehicles (AVs)
continue to face challenges in navigating certain traffic scenarios where human
drivers excel. In such situations, AVs often become immobilized, disrupting
overall traffic flow. Current recovery solutions, such as remote intervention
(which is costly and inefficient) and manual takeover (which excludes
non-drivers and limits AV accessibility), are inadequate. This paper introduces
StuckSolver, a novel Large Language Model (LLM) driven recovery framework that
enables AVs to resolve immobilization scenarios through self-reasoning and/or
passenger-guided decision-making. StuckSolver is designed as a plug-in add-on
module that operates on top of the AV's existing perception-planning-control
stack, requiring no modification to its internal architecture. Instead, it
interfaces with standard sensor data streams to detect immobilization states,
interpret environmental context, and generate high-level recovery commands that
can be executed by the AV's native planner. We evaluate StuckSolver on the
Bench2Drive benchmark and in custom-designed uncertainty scenarios. Results
show that StuckSolver achieves near-state-of-the-art performance through
autonomous self-reasoning alone and exhibits further improvements when
passenger guidance is incorporated.
๐
2025-10-29
๐ 8 pages, 9 figures, submitted for review to IEEE R
๐ฅ Petr Stibinger, Tomas Baca, Daniela Doubravova, Jan Rusnak, Jaroslav Solc et al.
ๆฅ็ๆ่ฆ
We present a novel approach to localizing radioactive material by cooperating
Micro Aerial Vehicles (MAVs). Our approach utilizes a state-of-the-art
single-detector Compton camera as a highly sensitive, yet miniature detector of
ionizing radiation. The detector's exceptionally low weight (40 g) opens up new
possibilities of radiation detection by a team of cooperating agile MAVs. We
propose a new fundamental concept of fusing the Compton camera measurements to
estimate the position of the radiation source in real time even from extremely
sparse measurements. The data readout and processing are performed directly
onboard and the results are used in a dynamic feedback to drive the motion of
the vehicles. The MAVs are stabilized in a tightly cooperating swarm to
maximize the information gained by the Compton cameras, rapidly locate the
radiation source, and even track a moving radiation source.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Bai Li, Achilleas Kourtellis, Rong Cao, Joseph Post, Brian Porter et al.
Robotics
ๆฅ็ๆ่ฆ
Rapid and reliable incident detection is critical for reducing crash-related
fatalities, injuries, and congestion. However, conventional methods, such as
closed-circuit television, dashcam footage, and sensor-based detection,
separate detection from verification, suffer from limited flexibility, and
require dense infrastructure or high penetration rates, restricting
adaptability and scalability to shifting incident hotspots. To overcome these
challenges, we developed DARTS, a drone-based, AI-powered real-time traffic
incident detection system. DARTS integrates drones' high mobility and aerial
perspective for adaptive surveillance, thermal imaging for better
low-visibility performance and privacy protection, and a lightweight deep
learning framework for real-time vehicle trajectory extraction and incident
detection. The system achieved 99% detection accuracy on a self-collected
dataset and supports simultaneous online visual verification, severity
assessment, and incident-induced congestion propagation monitoring via a
web-based interface. In a field test on Interstate 75 in Florida, DARTS
detected and verified a rear-end collision 12 minutes earlier than the local
transportation management center and monitored incident-induced congestion
propagation, suggesting potential to support faster emergency response and
enable proactive traffic control to reduce congestion and secondary crash risk.
Crucially, DARTS's flexible deployment architecture reduces dependence on
frequent physical patrols, indicating potential scalability and
cost-effectiveness for use in remote areas and resource-constrained settings.
This study presents a promising step toward a more flexible and integrated
real-time traffic incident detection system, with significant implications for
the operational efficiency and responsiveness of modern transportation
management.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Francisco M. F. R. Gonรงalves, Ryan M. Bena, Nรฉstor O. Pรฉrez-Arancibia
Robotics
ๆฅ็ๆ่ฆ
Over the past few decades, continuous quaternion-based attitude control has
been proven highly effective for driving rotational systems that can be modeled
as rigid bodies, such as satellites and drones. However, methods rooted in this
approach do not enforce the existence of a unique closed-loop (CL) equilibrium
attitude-error quaternion (AEQ); and, for rotational errors about the
attitude-error Euler axis larger than {\pi}rad, their proportional-control
effect diminishes as the system state moves away from the stable equilibrium of
the CL rotational dynamics. In this paper, we introduce a new type of attitude
control law that more effectively leverages the attitude-error Euler axis-angle
information to guarantee a unique CL equilibrium AEQ and to provide greater
flexibility in the use of proportional-control efforts. Furthermore, using two
different control laws as examples-through the construction of a strict
Lyapunov function for the CL dynamics-we demonstrate that the resulting unique
equilibrium of the CL rotational system can be enforced to be uniformly
asymptotically stable. To assess and demonstrate the functionality and
performance of the proposed approach, we performed numerical simulations and
executed dozens of real-time tumble-recovery maneuvers using a small quadrotor.
These simulations and flight tests compellingly demonstrate that the proposed
axis-angle-based method achieves superior flight performance-compared with that
obtained using a high-performance quaternion-based controller-in terms of
stabilization time.
๐
2025-10-29
๐ This work has been submitted to the IEEE for possi
๐ฅ Luoyan Zhong, Heather Jin Hee Kim, Dylan P. Losey, Cara M. Nunez
Machine Learning
ๆฅ็ๆ่ฆ
Flexible tactile sensors are increasingly used in real-world applications
such as robotic grippers, prosthetic hands, wearable gloves, and assistive
devices, where they need to conform to curved and irregular surfaces. However,
most existing tactile sensors are calibrated only on flat substrates, and their
accuracy and consistency degrade once mounted on curved geometries. This
limitation restricts their reliability in practical use. To address this
challenge, we develop a calibration model for a widely used resistive tactile
sensor design that enables accurate force estimation on one-dimensional curved
surfaces. We then train a neural network (a multilayer perceptron) to predict
local curvature from baseline sensor outputs recorded under no applied load,
achieving an R2 score of 0.91. The proposed approach is validated on five daily
objects with varying curvatures under forces from 2 N to 8 N. Results show that
the curvature-aware calibration maintains consistent force accuracy across all
surfaces, while flat-surface calibration underestimates force as curvature
increases. Our results demonstrate that curvature-aware modeling improves the
accuracy, consistency, and reliability of flexible tactile sensors, enabling
dependable performance across real-world applications.
๐
2025-10-29
๐ ArXiv Preprint (Cryptography and Security)
๐ฅ Zeynep Yasemin Erdogan, Shishir Nagaraja, Chuadhry Mujeeb Ahmed, Ryan Shah
Machine Learning
ๆฅ็ๆ่ฆ
In this paper, we present a framework that uses acoustic side-channel
analysis (ASCA) to monitor and verify whether a robot correctly executes its
intended commands. We develop and evaluate a machine-learning-based workflow
verification system that uses acoustic emissions generated by robotic
movements. The system can determine whether real-time behavior is consistent
with expected commands. The evaluation takes into account movement speed,
direction, and microphone distance. The results show that individual robot
movements can be validated with over 80% accuracy under baseline conditions
using four different classifiers: Support Vector Machine (SVM), Deep Neural
Network (DNN), Recurrent Neural Network (RNN), and Convolutional Neural Network
(CNN). Additionally, workflows such as pick-and-place and packing could be
identified with similarly high confidence. Our findings demonstrate that
acoustic signals can support real-time, low-cost, passive verification in
sensitive robotic environments without requiring hardware modifications.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Gilbert Bahati, Ryan M. Bena, Meg Wilkinson, Pol Mestres, Ryan K. Cosner et al.
ๆฅ็ๆ่ฆ
Robotic systems navigating in real-world settings require a semantic
understanding of their environment to properly determine safe actions. This
work aims to develop the mathematical underpinnings of such a representation --
specifically, the goal is to develop safety filters that are risk-aware. To
this end, we take a two step approach: encoding an understanding of the
environment via Poisson's equation, and associated risk via Laplace guidance
fields. That is, we first solve a Dirichlet problem for Poisson's equation to
generate a safety function that encodes system safety as its 0-superlevel set.
We then separately solve a Dirichlet problem for Laplace's equation to
synthesize a safe \textit{guidance field} that encodes variable levels of
caution around obstacles -- by enforcing a tunable flux boundary condition. The
safety function and guidance fields are then combined to define a safety
constraint and used to synthesize a risk-aware safety filter which, given a
semantic understanding of an environment with associated risk levels of
environmental features, guarantees safety while prioritizing avoidance of
higher risk obstacles. We demonstrate this method in simulation and discuss how
\textit{a priori} understandings of obstacle risk can be directly incorporated
into the safety filter to generate safe behaviors that are risk-aware.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Denniz Goren, Holger Caesar
ๆฅ็ๆ่ฆ
The vulnerability of cyclists, exacerbated by the rising popularity of faster
e-bikes, motivates adapting automotive perception technologies for bicycle
safety. We use our multi-sensor 'SenseBike' research platform to develop and
evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the
automotive-to-bicycle domain gap, we introduce the novel BikeScenes-lidarseg
Dataset, comprising 3021 consecutive LiDAR scans around the university campus
of the TU Delft, semantically annotated for 29 dynamic and static classes. By
evaluating model performance, we demonstrate that fine-tuning on our BikeScenes
dataset achieves a mean Intersection-over-Union (mIoU) of 63.6%, significantly
outperforming the 13.8% obtained with SemanticKITTI pre-training alone. This
result underscores the necessity and effectiveness of domain-specific training.
We highlight key challenges specific to bicycle-mounted, hardware-constrained
perception systems and contribute the BikeScenes dataset as a resource for
advancing research in cyclist-centric LiDAR segmentation.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Kevin Qiu, Marek Cygan
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Automating the co-design of a robot's morphology and control is a
long-standing challenge due to the vast design space and the tight coupling
between body and behavior. We introduce Debate2Create (D2C), a framework in
which large language model (LLM) agents engage in a structured dialectical
debate to jointly optimize a robot's design and its reward function. In each
round, a design agent proposes targeted morphological modifications, and a
control agent devises a reward function tailored to exploit the new design. A
panel of pluralistic judges then evaluates the design-control pair in
simulation and provides feedback that guides the next round of debate. Through
iterative debates, the agents progressively refine their proposals, producing
increasingly effective robot designs. Notably, D2C yields diverse and
specialized morphologies despite no explicit diversity objective. On a
quadruped locomotion benchmark, D2C discovers designs that travel 73% farther
than the default, demonstrating that structured LLM-based debate can serve as a
powerful mechanism for emergent robot co-design. Our results suggest that
multi-agent debate, when coupled with physics-grounded feedback, is a promising
new paradigm for automated robot design.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Kush Hari, Ziyang Chen, Hansoul Kim, Ken Goldberg
Computer Vision
ๆฅ็ๆ่ฆ
Surgical suturing is a high-precision task that impacts patient healing and
scarring. Suturing skill varies widely between surgeons, highlighting the need
for robot assistance. Previous robot suturing works, such as STITCH 1.0 [1],
struggle to fully close wounds due to inaccurate needle tracking and poor
thread management. To address these challenges, we present STITCH 2.0, an
elevated augmented dexterity pipeline with seven improvements including:
improved EKF needle pose estimation, new thread untangling methods, and an
automated 3D suture alignment algorithm. Experimental results over 15 trials
find that STITCH 2.0 on average achieves 74.4% wound closure with 4.87 sutures
per trial, representing 66% more sutures in 38% less time compared to the
previous baseline. When two human interventions are allowed, STITCH 2.0
averages six sutures with 100% wound closure rate. Project website:
https://stitch-2.github.io/
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Bohan Wu, Paul de La Sayette, Li Fei-Fei, Roberto Martรญn-Martรญn
Robotics
ๆฅ็ๆ่ฆ
The ability to use random objects as tools in a generalizable manner is a
missing piece in robots' intelligence today to boost their versatility and
problem-solving capabilities. State-of-the-art robotic tool usage methods
focused on procedurally generating or crowd-sourcing datasets of tools for a
task to learn how to grasp and manipulate them for that task. However, these
methods assume that only one object is provided and that it is possible, with
the correct grasp, to perform the task; they are not capable of identifying,
grasping, and using the best object for a task when many are available,
especially when the optimal tool is absent. In this work, we propose GeT-USE, a
two-step procedure that learns to perform real-robot generalized tool usage by
learning first to extend the robot's embodiment in simulation and then
transferring the learned strategies to real-robot visuomotor policies. Our key
insight is that by exploring a robot's embodiment extensions (i.e., building
new end-effectors) in simulation, the robot can identify the general tool
geometries most beneficial for a task. This learned geometric knowledge can
then be distilled to perform generalized tool usage tasks by selecting and
using the best available real-world object as tool. On a real robot with 22
degrees of freedom (DOFs), GeT-USE outperforms state-of-the-art methods by
30-60% success rates across three vision-based bimanual mobile manipulation
tool-usage tasks.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Ciera McFarland, Margaret McGuinness
Robotics
ๆฅ็ๆ่ฆ
Soft, vine-inspired growing robots that move by eversion are highly mobile in
confined environments, but, when faced with gaps in the environment, they may
collapse under their own weight while navigating a desired path. In this work,
we present a comprehensive collapse model that can predict the collapse length
of steered robots in any shape using true shape information and tail tension.
We validate this model by collapsing several unsteered robots without true
shape information. The model accurately predicts the trends of those
experiments. We then attempt to collapse a robot steered with a single actuator
at different orientations. Our models accurately predict collapse when it
occurs. Finally, we demonstrate how this could be used in the field by having a
robot attempt a gap-crossing task with and without inflating its actuators. The
robot needs its actuators inflated to cross the gap without collapsing, which
our model supports. Our model has been specifically tested on straight and
series pouch motor-actuated robots made of non-stretchable material, but it
could be applied to other robot variations. This work enables us to model the
robot's collapse behavior in any open environment and understand the parameters
it needs to succeed in 3D navigation tasks.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Boshi An, Chenyu Yang, Robert Katzschmann
Multimodal
ๆฅ็ๆ่ฆ
We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for
dexterous human-robot collaboration with minimal language prompting. Our
approach adds (i) FiLM conditioning to visual backbones for task-aware
perception, (ii) an auxiliary intent head that predicts collaborator hand pose
and target cues, and (iii) action-space post-processing that predicts compact
deltas (position/rotation) and PCA-reduced finger joints before mapping to full
commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset
augmented with MediaPipe hand poses, we demonstrate that delta actions are
well-behaved and that four principal components explain ~96% of hand-joint
variance. Ablations identify action post-processing as the primary performance
driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is
detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes
"pick-up" and "pass" into a long-horizon behavior. We surface "trainer
overfitting" to specific demonstrators as the key limitation.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Ahmad Kokhahi, Mary Kurz
ๆฅ็ๆ่ฆ
Multi-Agent Path Finding (MAPF) has gained significant attention, with most
research focusing on minimizing collisions and travel time. This paper also
considers energy consumption in the path planning of automated guided vehicles
(AGVs). It addresses two main challenges: i) resolving collisions between AGVs
and ii) assigning tasks to AGVs. We propose a new collision avoidance strategy
that takes both energy use and travel time into account. For task assignment,
we present two multi-objective algorithms: Non-Dominated Sorting Genetic
Algorithm (NSGA) and Adaptive Large Neighborhood Search (ALNS). Comparative
evaluations show that these proposed methods perform better than existing
approaches in both collision avoidance and task assignment.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Weikang Wan, Fabio Ramos, Xuning Yang, Caelan Garrett
Natural Language ProcessingMachine LearningRobotics
ๆฅ็ๆ่ฆ
Long-horizon contact-rich bimanual manipulation presents a significant
challenge, requiring complex coordination involving a mixture of parallel
execution and sequential collaboration between arms. In this paper, we
introduce a hierarchical framework that frames this challenge as an integrated
skill planning & scheduling problem, going beyond purely sequential
decision-making to support simultaneous skill invocation. Our approach is built
upon a library of single-arm and bimanual primitive skills, each trained using
Reinforcement Learning (RL) in GPU-accelerated simulation. We then train a
Transformer-based planner on a dataset of skill compositions to act as a
high-level scheduler, simultaneously predicting the discrete schedule of skills
as well as their continuous parameters. We demonstrate that our method achieves
higher success rates on complex, contact-rich tasks than end-to-end RL
approaches and produces more efficient, coordinated behaviors than traditional
sequential-only planners.
๐
2025-10-29
๐ ArXiv Preprint (Machine Learning)
๐ฅ Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
The growing success of Vision-Language-Action (VLA) models stems from the
promise that pretrained Vision-Language Models (VLMs) can endow agents with
transferable world knowledge and vision-language (VL) grounding, laying a
foundation for action models with broader generalization. Yet when these VLMs
are adapted to the action modality, it remains unclear to what extent their
original VL representations and knowledge are preserved. In this work, we
conduct a systematic study of representation retention during VLA fine-tuning,
showing that naive action fine-tuning leads to degradation of visual
representations. To characterize and measure these effects, we probe VLA's
hidden representations and analyze attention maps, further, we design a set of
targeted tasks and methods that contrast VLA models with their counterpart
VLMs, isolating changes in VL capabilities induced by action fine-tuning. We
further evaluate a range of strategies for aligning visual representations and
introduce a simple yet effective method that mitigates degradation and yields
improved generalization to out-of-distribution (OOD) scenarios. Taken together,
our analysis clarifies the trade-off between action fine-tuning and the
degradation of VL representations and highlights practical approaches to
recover inherited VL capabilities. Code is publicly available:
https://blind-vla-paper.github.io
๐
2025-10-29
๐ ArXiv Preprint (eess.SY)
๐ฅ Siddhartha Upadhyay, Ratnangshu Das, Pushpak Jagtap
Robotics
ๆฅ็ๆ่ฆ
This paper presents a decentralized control framework that incorporates
social awareness into multi-agent systems with unknown dynamics to achieve
prescribed-time reach-avoid-stay tasks in dynamic environments. Each agent is
assigned a social awareness index that quantifies its level of cooperation or
self-interest, allowing heterogeneous social behaviors within the system.
Building on the spatiotemporal tube (STT) framework, we propose a real-time STT
framework that synthesizes tubes online for each agent while capturing its
social interactions with others. A closed-form, approximation-free control law
is derived to ensure that each agent remains within its evolving STT, thereby
avoiding dynamic obstacles while also preventing inter-agent collisions in a
socially aware manner, and reaching the target within a prescribed time. The
proposed approach provides formal guarantees on safety and timing, and is
computationally lightweight, model-free, and robust to unknown disturbances.
The effectiveness and scalability of the framework are validated through
simulation and hardware experiments on a 2D omnidirectional
๐
2025-10-29
๐ ICRA 2026
๐ฅ Muyang Yan, Miras Mengdibayev, Ardon Floros, Weihang Guo, Lydia E. Kavraki et al.
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
In task and motion planning, high-level task planning is done over an
abstraction of the world to enable efficient search in long-horizon robotics
problems. However, the feasibility of these task-level plans relies on the
downward refinability of the abstraction into continuous motion. When a
domain's refinability is poor, task-level plans that appear valid may
ultimately fail during motion planning, requiring replanning and resulting in
slower overall performance. Prior works mitigate this by encoding refinement
issues as constraints to prune infeasible task plans. However, these approaches
only add constraints upon refinement failure, expending significant search
effort on infeasible branches. We propose VIZ-COAST, a method of leveraging the
common-sense spatial reasoning of large pretrained Vision-Language Models to
identify issues with downward refinement a priori, bypassing the need to fix
these failures during planning. Experiments on two challenging TAMP domains
show that our approach is able to extract plausible constraints from images and
domain descriptions, drastically reducing planning times and, in some cases,
eliminating downward refinement failures altogether, generalizing to a diverse
range of instances from the broader domain.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Shengyao Zhang, Yiyuan Zhang, Chenrui Zhang, Yiming Li, Wenci Xin et al.
Robotics
ๆฅ็ๆ่ฆ
The stereotypical reaching motion of the octopus arm has drawn growing
attention for its efficient control of a highly deformable body. Previous
studies suggest that its characteristic bend propagation may share underlying
principles with the dynamics of a whip. This work investigates whether
whip-like passive dynamics in water can reproduce the kinematic features
observed in biological reaching and their similarities and differences.
Platform-based whipping tests were performed in water and air while
systematically varying material stiffness and driving speed. Image-based
quantification revealed that the Ecoflex Gel 2 arm driven at 150 rpm (motor
speed) reproduced curvature propagation similar to that observed in octopus
reaching. However, its bend-point velocity decreased monotonically rather than
exhibiting the biological bell-shaped profile, confirming that the octopus
reaching movement is not merely a passive whipping behavior. The absence of
propagation in air further highlights the critical role of the surrounding
medium in forming octopus-like reaching motion. This study provides a new
perspective for understand biological reaching movement, and offers a potential
platform for future hydrodynamic research.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Alexander B. Rambech, Ivar B. Saksvik, Vahid Hassani
ๆฅ็ๆ่ฆ
In this paper, we present a Newton-Euler formulation of the equations of
motion for underwater vehicles with an interntal moving mass actuator.
Furthermore, the moving mass dynamics are expressed as an extension to the
manoeuvring model for underwater vehicles, originally introduced by Fossen
(1991). The influence of the moving mass is described in body-frame and
included as states in both an additional kinematic equation and as part of the
coupled rigid-body kinetics of the underwater vehicle. The Coriolis-centripetal
effects are derived from Kirchhoff's equations and the hydrostatics are derived
using first principals. The proposed Newton-Euler model is validated through
simulation and compared with the traditional Hamiltonian internal moving mass
actuator formulation.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongjie Zhang, Gideon Billings, Stefan B. Williams
Natural Language Processing
ๆฅ็ๆ่ฆ
Underwater infrastructure requires frequent inspection and maintenance due to
harsh marine conditions. Current reliance on human divers or remotely operated
vehicles is limited by perceptual and operational challenges, especially around
complex structures or in turbid water. Enhancing the spatial awareness of
underwater vehicles is key to reducing piloting risks and enabling greater
autonomy. To address these challenges, we present SPADE: SParsity Adaptive
Depth Estimator, a monocular depth estimation pipeline that combines
pre-trained relative depth estimator with sparse depth priors to produce dense,
metric scale depth maps. Our two-stage approach first scales the relative depth
map with the sparse depth points, then refines the final metric prediction with
our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves
improved accuracy and generalisation over state-of-the-art baselines and runs
efficiently at over 15 FPS on embedded hardware, promising to support practical
underwater inspection and intervention. This work has been submitted to IEEE
Journal of Oceanic Engineering Special Issue of AUV 2026.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Chaz Cornwall, Jeremy P. Bos
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Formation control simplifies minimizing multi-robot cost functions by
encoding a cost function as a shape the robots maintain. However, by reducing
complex cost functions to formations, discrepancies arise between maintaining
the shape and minimizing the original cost function. For example, a Diamond or
Box formation shape is often used for protecting all members of the formation.
When more information about the surrounding environment becomes available, a
static shape often no longer minimizes the original protection cost. We propose
a formation planner to reduce mismatch between a formation and the cost
function while still leveraging efficient formation controllers. Our formation
planner is a two-step optimization problem that identifies desired relative
robot positions. We first solve a constrained problem to estimate non-linear
and non-differentiable costs with a weighted sum of surrogate cost functions.
We theoretically analyze this problem and identify situations where weights do
not need to be updated. The weighted, surrogate cost function is then minimized
using relative positions between robots. The desired relative positions are
realized using a non-cooperative formation controller derived from Lyapunov's
direct approach. We then demonstrate the efficacy of this approach for
military-like costs such as protection and obstacle avoidance. In simulations,
we show a formation planner can reduce a single cost by over 75%. When
minimizing a variety of cost functions simultaneously, using a formation
planner with adaptive weights can reduce the cost by 20-40%. Formation planning
provides better performance by minimizing a surrogate cost function that
closely approximates the original cost function instead of relying on a shape
abstraction.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Kei Ikemura, Yifei Dong, David Blanco-Mulero, Alberta Longhini, Li Chen et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Robotic manipulation of deformable and fragile objects presents significant
challenges, as excessive stress can lead to irreversible damage to the object.
While existing solutions rely on accurate object models or specialized sensors
and grippers, this adds complexity and often lacks generalization. To address
this problem, we present a vision-based reinforcement learning approach that
incorporates a stress-penalized reward to discourage damage to the object
explicitly. In addition, to bootstrap learning, we incorporate offline
demonstrations as well as a designed curriculum progressing from rigid proxies
to deformables. We evaluate the proposed method in both simulated and
real-world scenarios, showing that the policy learned in simulation can be
transferred to the real world in a zero-shot manner, performing tasks such as
picking up and pushing tofu. Our results show that the learned policies exhibit
a damage-aware, gentle manipulation behavior, demonstrating their effectiveness
by decreasing the stress applied to fragile objects by 36.5% while achieving
the task goals, compared to vanilla RL policies.
๐
2025-10-29
๐ Accepted to 2025 IEEE International Automated Vehi
๐ฅ Kumar Manas, Mert Keser, Alois Knoll
Robotics
ๆฅ็ๆ่ฆ
This survey provides an analysis of current methodologies integrating legal
and logical specifications into the perception, prediction, and planning
modules of automated driving systems. We systematically explore techniques
ranging from logic-based frameworks to computational legal reasoning
approaches, emphasizing their capability to ensure regulatory compliance and
interpretability in dynamic and uncertain driving environments. A central
finding is that significant challenges arise at the intersection of perceptual
reliability, legal compliance, and decision-making justifiability. To
systematically analyze these challenges, we introduce a taxonomy categorizing
existing approaches by their theoretical foundations, architectural
implementations, and validation strategies. We particularly focus on methods
that address perceptual uncertainty and incorporate explicit legal norms,
facilitating decisions that are both technically robust and legally defensible.
The review covers neural-symbolic integration methods for perception,
logic-driven rule representation, and norm-aware prediction strategies, all
contributing toward transparent and accountable autonomous vehicle operation.
We highlight critical open questions and practical trade-offs that must be
addressed, offering multidisciplinary insights from engineering, logic, and law
to guide future developments in legally compliant autonomous driving systems.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Bernhard Rameder, Hubert Gattringer, Andreas Mueller
Machine Learning
ๆฅ็ๆ่ฆ
In this paper a new method for geometric robot calibration is introduced,
which uses a calibration plate with precisely known distances between its
measuring points. The relative measurement between two points on the
calibration plate is used to determine predefined error parameters of the
system. In comparison to conventional measurement methods, like laser tracker
or motion capture systems, the calibration plate provides a more mechanically
robust and cheaper alternative, which is furthermore easier to transport due to
its small size. The calibration method, the plate design, the mathematical
description of the error system as well as the identification of the parameters
are described in detail. For identifying the error parameters, the least
squares method and a constrained optimization problem are used. The
functionality of this method was demonstrated in experiments that led to
promising results, correlated with one of a laser tracker calibration. The
modeling and identification of the error parameters is done for a gantry
machine, but is not restricted to that type of robot.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Jakob Ziegler, Bernhard Rameder, Hubert Gattringer, Andreas Mueller
ๆฅ็ๆ่ฆ
In this paper, an approach for gait assistance with a lower body exoskeleton
is described. Two concepts, transparency and motion assistance, are combined.
The transparent mode, where the system is following the user's free motion with
a minimum of perceived interaction forces, is realized by exploiting the gear
backlash of the actuation units. During walking a superimposed assistance mode
applies an additional torque guiding the legs to their estimated future
position. The concept of adaptive oscillators is utilized to learn the
quasi-periodic signals typical for locomotion. First experiments showed
promising results.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Zongxi Yu, Xiaolong Qian, Shaohua Gao, Qi Jiang, Yao Gao et al.
ๆฅ็ๆ่ฆ
Achieving high-fidelity, compact RGBD imaging presents a dual challenge:
conventional compact optics struggle with RGB sharpness across the entire
depth-of-field, while software-only Monocular Depth Estimation (MDE) is an
ill-posed problem reliant on unreliable semantic priors. While deep optics with
elements like DOEs can encode depth, they introduce trade-offs in fabrication
complexity and chromatic aberrations, compromising simplicity. To address this,
we first introduce a novel bio-inspired all-spherical monocentric lens, around
which we build the Bionic Monocentric Imaging (BMI) framework, a holistic
co-design. This optical design naturally encodes depth into its depth-varying
Point Spread Functions (PSFs) without requiring complex diffractive or freeform
elements. We establish a rigorous physically-based forward model to generate a
synthetic dataset by precisely simulating the optical degradation process. This
simulation pipeline is co-designed with a dual-head, multi-scale reconstruction
network that employs a shared encoder to jointly recover a high-fidelity
All-in-Focus (AiF) image and a precise depth map from a single coded capture.
Extensive experiments validate the state-of-the-art performance of the proposed
framework. In depth estimation, the method attains an Abs Rel of 0.026 and an
RMSE of 0.130, markedly outperforming leading software-only approaches and
other deep optics systems. For image restoration, the system achieves an SSIM
of 0.960 and a perceptual LPIPS score of 0.082, thereby confirming a superior
balance between image fidelity and depth accuracy. This study illustrates that
the integration of bio-inspired, fully spherical optics with a joint
reconstruction algorithm constitutes an effective strategy for addressing the
intrinsic challenges in high-performance compact RGBD imaging. Source code will
be publicly available at https://github.com/ZongxiYu-ZJU/BMI.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Yusuke Tsunoda, Seiya Yamamoto, Kazuki Ito, Runze Xiao, Keisuke Naniwa et al.
Robotics
ๆฅ็ๆ่ฆ
Multi-legged mobile robots possess high mobility performance in rough terrain
environments, stemming from their high postural stability, joint flexibility,
and the redundancy provided by multiple legs. In prior research on navigating
between different environments such as land and water, the primary strategy
employed involves switching to a controller that generates an appropriate gait
for the new environment upon entering it. However, designing appropriate gaits
for each complex and diverse environment and accurately determining controller
switching for each environment is challenging. Therefore, this research
develops a centipede-type mobile robot that navigates both aquatic and
terrestrial environments with a simple, unified control scheme, based on the
implicit-explicit control philosophy and by ingeniously designing the robot's
body structure. In this research, we developed the robot featuring flexible
joints and left and right legs on each body segment and focused on the leg
structure which has extensive contact with the environment. This paper
evaluates the locomotion performance on land and water using the three
developed leg structures, using the robot's leg slip rate and actuator energy
consumption as evaluation metrics. The experimental results confirmed the
existence of an appropriate leg structure capable of navigating both aquatic
and terrestrial environments under identical control.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Wang zhi, Yuyan Liu, Liu Liu, Li Zhang, Ruixuan Lu et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
Generating hand grasps with language instructions is a widely studied topic
that benefits from embodied AI and VR/AR applications. While transferring into
hand articulatied object interaction (HAOI), the hand grasps synthesis requires
not only object functionality but also long-term manipulation sequence along
the object deformation. This paper proposes a novel HAOI sequence generation
framework SynHLMA, to synthesize hand language manipulation for articulated
objects. Given a complete point cloud of an articulated object, we utilize a
discrete HAOI representation to model each hand object interaction frame. Along
with the natural language embeddings, the representations are trained by an
HAOI manipulation language model to align the grasping process with its
language description in a shared representation space. A joint-aware loss is
employed to ensure hand grasps follow the dynamic variations of articulated
object joints. In this way, our SynHLMA achieves three typical hand
manipulation tasks for articulated objects of HAOI generation, HAOI prediction
and HAOI interpolation. We evaluate SynHLMA on our built HAOI-lang dataset and
experimental results demonstrate the superior hand grasp sequence generation
performance comparing with state-of-the-art. We also show a robotics grasp
application that enables dexterous grasps execution from imitation learning
using the manipulation sequence provided by our SynHLMA. Our codes and datasets
will be made publicly available.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Klaus Zauner, Hubert Gattringer, Andreas Mueller
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Handling loosely placed objects with robotic manipulators is a difficult task
from the point of view of trajectory planning and control. This becomes even
more challenging when the object to be handled is a container filled with
liquid. This paper addresses the task of transporting a liquid-filled cup
placed on a tray along a prescribed path in shortest time. The objective is to
minimize swapping, thus avoiding spillage of the fluid. To this end, the
sloshing dynamics is incorporated into the dynamic model used within the
optimal control problem formulation. The optimization problem is solved using a
direct multiple shooting approach.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Hao Huang, Geeta Chandra Raju Bethala, Shuaihang Yuan, Congcong Wen, Anthony Tzes et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Whole-body humanoid motion represents a cornerstone challenge in robotics,
integrating balance, coordination, and adaptability to enable human-like
behaviors. However, existing methods typically require multiple training
samples per motion category, rendering the collection of high-quality human
motion datasets both labor-intensive and costly. To address this, we propose a
novel approach that trains effective humanoid motion policies using only a
single non-walking target motion sample alongside readily available walking
motions. The core idea lies in leveraging order-preserving optimal transport to
compute distances between walking and non-walking sequences, followed by
interpolation along geodesics to generate new intermediate pose skeletons,
which are then optimized for collision-free configurations and retargeted to
the humanoid before integration into a simulated environment for policy
training via reinforcement learning. Experimental evaluations on the CMU MoCap
dataset demonstrate that our method consistently outperforms baselines,
achieving superior performance across metrics. Code will be released upon
acceptance.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Jee Won Lee, Hansol Lim, Sooyeun Yang, Jongseong Brad Choi
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Vision-based control systems, such as image-based visual servoing (IBVS),
have been extensively explored for precise robot manipulation. A persistent
challenge, however, is maintaining robust target tracking under partial or full
occlusions. Classical methods like Lucas-Kanade (LK) offer lightweight tracking
but are fragile to occlusion and drift, while deep learning-based approaches
often require continuous visibility and intensive computation. To address these
gaps, we propose a hybrid visual tracking framework that bridges advanced
perception with real-time servo control. First, a fast global template matcher
constrains the pose search region; next, a deep-feature Lucas-Kanade module
operating on early VGG layers refines alignment to sub-pixel accuracy (<2px);
then, a lightweight residual regressor corrects local misalignments caused by
texture degradation or partial occlusion. When visual confidence falls below a
threshold, a GRU-based predictor seamlessly extrapolates pose updates from
recent motion history. Crucially, the pipeline's final outputs-translation,
rotation, and scale deltas-are packaged as direct control signals for 30Hz
image-based servo loops. Evaluated on handheld video sequences with up to 90%
occlusion, our system sustains under 2px tracking error, demonstrating the
robustness and low-latency precision essential for reliable real-world robot
vision applications.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Amith Khandakar, David Michelson, Shaikh Golam Rabbani, Fariya Bintay Shafi, Md. Faysal Ahamed et al.
Multimodal
ๆฅ็ๆ่ฆ
It's important to monitor road issues such as bumps and potholes to enhance
safety and improve road conditions. Smartphones are equipped with various
built-in sensors that offer a cost-effective and straightforward way to assess
road quality. However, progress in this area has been slow due to the lack of
high-quality, standardized datasets. This paper discusses a new dataset created
by a mobile app that collects sensor data from devices like GPS,
accelerometers, gyroscopes, magnetometers, gravity sensors, and orientation
sensors. This dataset is one of the few that integrates Geographic Information
System (GIS) data with weather information and video footage of road
conditions, providing a comprehensive understanding of road issues with
geographic context. The dataset allows for a clearer analysis of road
conditions by compiling essential data, including vehicle speed, acceleration,
rotation rates, and magnetic field intensity, along with the visual and spatial
context provided by GIS, weather, and video data. Its goal is to provide
funding for initiatives that enhance traffic management, infrastructure
development, road safety, and urban planning. Additionally, the dataset will be
publicly accessible to promote further research and innovation in smart
transportation systems.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Hongyu Song, Rishabh Dev Yadav, Cheng Guo, Wei Pan
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
Interpreting visual observations and natural language instructions for
complex task execution remains a key challenge in robotics and AI. Despite
recent advances, language-driven navigation is still difficult, particularly
for UAVs in small-scale 3D environments. Existing Vision-Language Navigation
(VLN) approaches are mostly designed for ground robots and struggle to
generalize to aerial tasks that require full 3D spatial reasoning. The
emergence of large Vision-Language Models (VLMs), such as GPT and Claude,
enables zero-shot semantic reasoning from visual and textual inputs. However,
these models lack spatial grounding and are not directly applicable to
navigation. To address these limitations, SoraNav is introduced, an adaptive
UAV navigation framework that integrates zero-shot VLM reasoning with
geometry-aware decision-making. Geometric priors are incorporated into image
annotations to constrain the VLM action space and improve decision quality. A
hybrid switching strategy leverages navigation history to alternate between VLM
reasoning and geometry-based exploration, mitigating dead-ends and redundant
revisits. A PX4-based hardware-software platform, comprising both a digital
twin and a physical micro-UAV, enables reproducible evaluation. Experimental
results show that in 2.5D scenarios, our method improves Success Rate (SR) by
25.7% and Success weighted by Path Length (SPL) by 17%. In 3D scenarios, it
improves SR by 29.5% and SPL by 18.5% relative to the baseline.
๐
2025-10-29
๐ NeurIPS 2025
๐ฅ Yuxiang Yan, Zhiyuan Zhou, Xin Gao, Guanghao Li, Shenglin Li et al.
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
Manipulation in cluttered environments is challenging due to spatial
dependencies among objects, where an improper manipulation order can cause
collisions or blocked access. Existing approaches often overlook these spatial
relationships, limiting their flexibility and scalability. To address these
limitations, we propose OrderMind, a unified spatial-aware manipulation
ordering framework that directly learns object manipulation priorities based on
spatial context. Our architecture integrates a spatial context encoder with a
temporal priority structuring module. We construct a spatial graph using
k-Nearest Neighbors to aggregate geometric information from the local layout
and encode both object-object and object-manipulator interactions to support
accurate manipulation ordering in real-time. To generate physically and
semantically plausible supervision signals, we introduce a spatial prior
labeling method that guides a vision-language model to produce reasonable
manipulation orders for distillation. We evaluate OrderMind on our Manipulation
Ordering Benchmark, comprising 163,222 samples of varying difficulty. Extensive
experiments in both simulation and real-world environments demonstrate that our
method significantly outperforms prior approaches in effectiveness and
efficiency, enabling robust manipulation in cluttered scenes.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Jiahong Chen, Jing Wang, Long Chen, Chuwei Cai, Jinghui Lu
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
Vision-language-action (VLA) models have significantly advanced robotic
manipulation by integrating vision-language models (VLMs), and action decoders
into a unified architecture. However, their deployment on resource-constrained
edge devices, such as mobile robots or embedded systems (e.g., Jetson Orin
Nano), remains challenging due to high computational demands, especially in
real-world scenarios where power, latency, and computational resources are
critical. To close this gap, we introduce Nano-scale Vision-Language Action
(NanoVLA), a family of lightweight VLA architectures that achieve high
performance with minimal resources. Our core innovations include: (1)
vision-language decoupling that moves conventional early vision and language
inputs fusion in VLM to late stage, achieving better performance while enabling
caching and reduce inference overhead and latency; (2) long-short action
chunking to ensure smooth, coherent multi-step planning without sacrificing
real-time responsiveness; (3) dynamic routing that adaptively assigns
lightweight or heavy backbones based on task complexity, further optimizing
inference efficiency. Experimental results on several benchmarks, as well as
real-world deployments, demonstrate that NanoVLA achieves up to 52x faster
inference on edge devices compared to previous state-of-the-art VLA models,
with 98% less parameters while maintaining or surpassing their task accuracy
and generalization. Ablation studies confirm that our decoupling strategy
preserves cross-task transferability, and the routing module enhances
cost-performance trade-offs, enabling practical, high-precision robotic
manipulation on resource-constrained hardware.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Guibin Sun, Jinhu Lรผ, Kexin Liu, Zhenqian Wang, Guanrong Chen
Robotics
ๆฅ็ๆ่ฆ
Swarms evolving from collective behaviors among multiple individuals are
commonly seen in nature, which enables biological systems to exhibit more
efficient and robust collaboration. Creating similar swarm intelligence in
engineered robots poses challenges to the design of collaborative algorithms
that can be programmed at large scales. The assignment-based method has played
an eminent role for a very long time in solving collaboration problems of robot
swarms. However, it faces fundamental limitations in terms of efficiency and
robustness due to its unscalability to swarm variants. This article presents a
tutorial review on recent advances in assignment-free collaboration of robot
swarms, focusing on the problem of shape formation. A key theoretical component
is the recently developed \emph{mean-shift exploration} strategy, which
improves the collaboration efficiency of large-scale swarms by dozens of times.
Further, the efficiency improvement is more significant as the swarm scale
increases. Finally, this article discusses three important applications of the
mean-shift exploration strategy, including precise shape formation, area
coverage formation, and maneuvering formation, as well as their corresponding
industrial scenarios in smart warehousing, area exploration, and cargo
transportation.
๐
2025-10-29
๐ ArXiv Preprint (Computer Vision)
๐ฅ Sai Likhith Karri, Ansh Saxena
Computer VisionMachine Learning
ๆฅ็ๆ่ฆ
This study examines the effectiveness of spatio-temporal modeling and the
integration of spatial attention mechanisms in deep learning models for
underwater object detection. Specifically, in the first phase, the performance
of temporal-enhanced YOLOv5 variant T-YOLOv5 is evaluated, in comparison with
the standard YOLOv5. For the second phase, an augmented version of T-YOLOv5 is
developed, through the addition of a Convolutional Block Attention Module
(CBAM). By examining the effectiveness of the already pre-existing YOLOv5 and
T-YOLOv5 models and of the newly developed T-YOLOv5 with CBAM. With CBAM, the
research highlights how temporal modeling improves detection accuracy in
dynamic marine environments, particularly under conditions of sudden movements,
partial occlusions, and gradual motion. The testing results showed that YOLOv5
achieved a mAP@50-95 of 0.563, while T-YOLOv5 and T-YOLOv5 with CBAM
outperformed with mAP@50-95 scores of 0.813 and 0.811, respectively,
highlighting their superior accuracy and generalization in detecting complex
objects. The findings demonstrate that T-YOLOv5 significantly enhances
detection reliability compared to the standard model, while T-YOLOv5 with CBAM
further improves performance in challenging scenarios, although there is a loss
of accuracy when it comes to simpler scenarios.
๐
2025-10-29
๐ The International Journal of Advanced Manufacturin
๐ฅ Sourabh Karmakar, Cameron J. Turner
ๆฅ็ๆ่ฆ
Accurate calibration of a Stewart platform is important for their precise and
efficient operation. However, the calibration of these platforms using forward
kinematics is a challenge for researchers because forward kinematics normally
generates multiple feasible and unfeasible solutions for any pose of the moving
platform. The complex kinematic relations among the six actuator paths
connecting the fixed base to the moving platform further compound the
difficulty in establishing a straightforward and efficient calibration method.
The authors developed a new forward kinematics-based calibration method using
Denavit-Hartenberg convention and used the Stewart platform Tiger 66.1
developed in their lab for experimenting with the photogrammetry-based
calibration strategies described in this paper. This system became operational
upon completion of construction, marking its inaugural use. The authors used
their calibration model for estimating the errors in the system and adopted
three compensation options or strategies as per Least Square method to improve
the accuracy of the system. These strategies leveraged a high-resolution
digital camera and off-the-shelf software to capture the poses of the moving
platform's center. This process is non-invasive and does not need any
additional equipment to be attached to the hexapod or any alteration of the
hexapod hardware. This photogrammetry-based calibration process involves
multiple high-resolution images from different angles to measure the position
and orientation of the platform center in the three-dimensional space. The
Target poses and Actual poses are then compared, and the error compensations
are estimated using the Least-Squared methods to calculate the Predicted poses.
Results from each of the three compensation approaches demonstrated noticeable
enhancements in platform pose accuracies, suggesting room for further
improvements.
๐
2025-10-29
๐ ArXiv Preprint (Robotics)
๐ฅ Hayato Idei, Tamon Miyake, Tetsuya Ogata, Yuichi Yamashita
RoboticsMachine LearningMultimodal
ๆฅ็ๆ่ฆ
The rapid aging of societies is intensifying demand for autonomous care
robots; however, most existing systems are task-specific and rely on
handcrafted preprocessing, limiting their ability to generalize across diverse
scenarios. A prevailing theory in cognitive neuroscience proposes that the
human brain operates through hierarchical predictive processing, which
underlies flexible cognition and behavior by integrating multimodal sensory
signals. Inspired by this principle, we introduce a hierarchical multimodal
recurrent neural network grounded in predictive processing under the
free-energy principle, capable of directly integrating over 30,000-dimensional
visuo-proprioceptive inputs without dimensionality reduction. The model was
able to learn two representative caregiving tasks, rigid-body repositioning and
flexible-towel wiping, without task-specific feature engineering. We
demonstrate three key properties: (i) self-organization of hierarchical latent
dynamics that regulate task transitions, capture variability in uncertainty,
and infer occluded states; (ii) robustness to degraded vision through
visuo-proprioceptive integration; and (iii) asymmetric interference in
multitask learning, where the more variable wiping task had little influence on
repositioning, whereas learning the repositioning task led to a modest
reduction in wiping performance, while the model maintained overall robustness.
Although the evaluation was limited to simulation, these results establish
predictive processing as a universal and scalable computational principle,
pointing toward robust, flexible, and autonomous caregiving robots while
offering theoretical insight into the human brain's ability to achieve flexible
adaptation in uncertain real-world environments.
๐
2025-10-28
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We present Tongyi DeepResearch, an agentic large language model, which is
specifically designed for long-horizon, deep information-seeking research
tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is
developed through an end-to-end training framework that combines agentic
mid-training and agentic post-training, enabling scalable reasoning and
information seeking across complex tasks. We design a highly scalable data
synthesis pipeline that is fully automatic, without relying on costly human
annotation, and empowers all training stages. By constructing customized
environments for each stage, our system enables stable and consistent
interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total
parameters, with only 3.3 billion activated per token, achieves
state-of-the-art performance across a range of agentic deep research
benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH,
WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We
open-source the model, framework, and complete solutions to empower the
community.
๐
2025-10-28
๐ To appear in IEEE BigData 2025
๐ฅ Zirui Cheng, Jikai Sun, Anjun Gao, Yueyang Quan, Zhuqing Liu et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large language models (LLMs) have transformed natural language processing
(NLP), enabling applications from content generation to decision support.
Retrieval-Augmented Generation (RAG) improves LLMs by incorporating external
knowledge but also introduces security risks, particularly from data poisoning,
where the attacker injects poisoned texts into the knowledge database to
manipulate system outputs. While various defenses have been proposed, they
often struggle against advanced attacks. To address this, we introduce RAGuard,
a detection framework designed to identify poisoned texts. RAGuard first
expands the retrieval scope to increase the proportion of clean texts, reducing
the likelihood of retrieving poisoned content. It then applies chunk-wise
perplexity filtering to detect abnormal variations and text similarity
filtering to flag highly similar texts. This non-parametric approach enhances
RAG security, and experiments on large-scale datasets demonstrate its
effectiveness in detecting and mitigating poisoning attacks, including strong
adaptive attacks.
๐
2025-10-28
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders et al.
Multimodal
ๆฅ็ๆ่ฆ
We introduce MiRAGE, an evaluation framework for retrieval-augmented
generation (RAG) from multimodal sources. As audiovisual media becomes a
prevalent source of information online, it is essential for RAG systems to
integrate information from these sources into generation. However, existing
evaluations for RAG are text-centric, limiting their applicability to
multimodal, reasoning intensive settings because they don't verify information
against sources. MiRAGE is a claim-centric approach to multimodal RAG
evaluation, consisting of InfoF1, evaluating factuality and information
coverage, and CiteF1, measuring citation support and completeness. We show that
MiRAGE, when applied by humans, strongly aligns with extrinsic quality
judgments. We additionally introduce automatic variants of MiRAGE and three
prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the
limitations of text-centric work and laying the groundwork for automatic
evaluation. We release open-source implementations and outline how to assess
multimodal RAG.
๐
2025-10-28
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang et al.
Natural Language ProcessingRobotics
ๆฅ็ๆ่ฆ
We present Tongyi DeepResearch, an agentic large language model, which is
specifically designed for long-horizon, deep information-seeking research
tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is
developed through an end-to-end training framework that combines agentic
mid-training and agentic post-training, enabling scalable reasoning and
information seeking across complex tasks. We design a highly scalable data
synthesis pipeline that is fully automatic, without relying on costly human
annotation, and empowers all training stages. By constructing customized
environments for each stage, our system enables stable and consistent
interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total
parameters, with only 3.3 billion activated per token, achieves
state-of-the-art performance across a range of agentic deep research
benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH,
WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We
open-source the model, framework, and complete solutions to empower the
community.
๐
2025-10-28
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Jiawei Zhou, Lei Chen
Machine Learning
ๆฅ็ๆ่ฆ
As retrieval-augmented generation (RAG) becomes increasingly widespread, the
role of information retrieval (IR) is shifting from retrieving information for
human users to retrieving contextual knowledge for artificial intelligence (AI)
systems, where relevance becomes difficult to define or annotate beforehand. To
address this challenge, we propose R3, a Retrieval framework optimized for RAG
through trialand-feedback Reinforced contrastive learning. Unlike prior
approaches that rely on annotated or synthetic data for supervised fine-tuning,
R3 enables the retriever to dynamically explore and optimize relevance within
the RAG environment. During training, the retrieved results interact with the
environment to produce contrastive signals that automatically guide the
retriever's self-improvement. Extensive experiments across diverse tasks
demonstrate that R3 improves RAG performance by 5.2% over the original
retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving
comparable results to LLM-augmented retrieval and RAG systems built on
post-trained or instruction-tuned LLMs. It is both efficient and practical,
requiring only 4 GPUs and completing training within a single day.
๐
2025-10-28
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Durga Prasad Maram, Dhruvin Gandhi, Zonghai Yao, Gayathri Akkinapalli, Franck Dernoncourt et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Personalized text generation requires models not only to produce coherent
text but also to align with a target user's style, tone, and topical focus.
Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich
profiles with user and neighbor histories, but they stop at generation and
often yield outputs that drift in tone, topic, or style. We present PerFine, a
unified, training-free critique-refine framework that enhances personalization
through iterative, profile-grounded feedback. In each iteration, an LLM
generator produces a draft conditioned on the retrieved profile, and a critic
LLM - also conditioned on the same profile - provides structured feedback on
tone, vocabulary, sentence structure, and topicality. The generator then
revises, while a novel knockout strategy retains the stronger draft across
iterations. We further study additional inference-time strategies such as
Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp,
Goodreads, and Amazon datasets, PerFine consistently improves personalization
over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5
refinement iterations, and scalability with increasing critic size. These
results highlight that post-hoc, profile-aware feedback offers a powerful
paradigm for personalized LLM generation that is both training-free and
model-agnostic.
๐
2025-10-28
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xiaoyu Kong, Leheng Sheng, Junfei Tan, Yuxin Chen, Jiancan Wu et al.
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
The recent success of large language models (LLMs) has renewed interest in
whether recommender systems can achieve similar scaling benefits. Conventional
recommenders, dominated by massive embedding tables, tend to plateau as
embedding dimensions grow. In contrast, the emerging generative paradigm
replaces embeddings with compact Semantic ID (SID) sequences produced by
autoregressive Transformers. Yet most industrial deployments remain
proprietary, leaving two fundamental questions open: (1) Do the expected
scaling laws hold on public benchmarks? (2) What is the minimal post-training
recipe that enables competitive performance?
We present MiniOneRec, to the best of our knowledge, the first fully
open-source generative recommendation framework, which provides an end-to-end
workflow spanning SID construction, supervised fine-tuning, and
recommendation-oriented reinforcement learning. We generate SIDs via a Residual
Quantized VAE and post-train Qwen backbones ranging from 0.5B to 7B parameters
on the Amazon Review dataset. Our experiments reveal a consistent downward
trend in both training and evaluation losses with increasing model size,
validating the parameter efficiency of the generative approach. To further
enhance performance, we propose a lightweight yet effective post-training
pipeline that (1) enforces full-process SID alignment and (2) applies
reinforcement learning with constrained decoding and hybrid rewards. Together,
these techniques yield significant improvements in both ranking accuracy and
candidate diversity.
๐
2025-10-28
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yejin Kim, Shaghayegh Agah, Mayur Nankani, Neeraj Sharma, Feifei Peng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Most recommender systems treat timestamps as numeric or cyclical values,
overlooking real-world context such as holidays, events, and seasonal patterns.
We propose a scalable framework that uses large language models (LLMs) to
generate geo-temporal embeddings from only a timestamp and coarse location,
capturing holidays, seasonal trends, and local/global events. We then introduce
a geo-temporal embedding informativeness test as a lightweight diagnostic,
demonstrating on MovieLens, LastFM, and a production dataset that these
embeddings provide predictive signal consistent with the outcomes of full model
integrations. Geo-temporal embeddings are incorporated into sequential models
through (1) direct feature fusion with metadata embeddings or (2) an auxiliary
loss that enforces semantic and geo-temporal alignment. Our findings highlight
the need for adaptive or hybrid recommendation strategies, and we release a
context-enriched MovieLens dataset to support future research.
๐
2025-10-28
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Michail Dadopoulos, Anestis Ladas, Stratos Moschidis, Ioannis Negkakis
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Retrieval-Augmented Generation (RAG) struggles on long, structured financial
filings where relevant evidence is sparse and cross-referenced. This paper
presents a systematic investigation of advanced metadata-driven
Retrieval-Augmented Generation (RAG) techniques, proposing and evaluating a
novel, multi-stage RAG architecture that leverages LLM-generated metadata. We
introduce a sophisticated indexing pipeline to create contextually rich
document chunks and benchmark a spectrum of enhancements, including
pre-retrieval filtering, post-retrieval reranking, and enriched embeddings,
benchmarked on the FinanceBench dataset. Our results reveal that while a
powerful reranker is essential for precision, the most significant performance
gains come from embedding chunk metadata directly with text ("contextual
chunks"). Our proposed optimal architecture combines LLM-driven pre-retrieval
optimizations with these contextual embeddings to achieve superior performance.
Additionally, we present a custom metadata reranker that offers a compelling,
cost-effective alternative to commercial solutions, highlighting a practical
trade-off between peak performance and operational efficiency. This study
provides a blueprint for building robust, metadata-aware RAG systems for
financial document analysis.
๐
2025-10-28
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yutian Xiao, Meng Yuan, Fuzhen Zhuang, Wei Chen, Shukuan Wang et al.
ๆฅ็ๆ่ฆ
The pre-ranking stage plays a pivotal role in large-scale recommender systems
but faces an intrinsic trade-off between model expressiveness and computational
efficiency. Owing to the massive candidate pool and strict latency constraints,
industry systems often rely on lightweight two-tower architectures, which are
computationally efficient yet limited in estimation capability. As a result,
they struggle to capture the complex synergistic and suppressive relationships
among candidate items, which are essential for producing contextually coherent
and diverse recommendation lists. Moreover, this simplicity further amplifies
the Sample Selection Bias (SSB) problem, as coarse-grained models trained on
biased exposure data must generalize to a much larger candidate space with
distinct distributions.
To address these issues, we propose \textbf{DUET} (\textbf{DU}al Model
Co-Training for \textbf{E}ntire Space C\textbf{T}R Prediction), a set-wise
pre-ranking framework that achieves expressive modeling under tight
computational budgets. Instead of scoring items independently, DUET performs
set-level prediction over the entire candidate subset in a single forward pass,
enabling information-aware interactions among candidates while amortizing the
computational cost across the set. Moreover, a dual model co-training mechanism
extends supervision to unexposed items via mutual pseudo-label refinement,
effectively mitigating SSB. Validated through extensive offline experiments and
online A/B testing, DUET consistently outperforms state-of-the-art baselines
and achieves improvements across multiple core business metrics. At present,
DUET has been fully deployed in Kuaishou and Kuaishou Lite Apps, serving the
main traffic for hundreds of millions of users.
๐
2025-10-28
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Maruf Ahmed Mridul, Oshani Seneviratne
Natural Language Processing
ๆฅ็ๆ่ฆ
The transformation of unstructured legal contracts into standardized,
machine-readable formats is essential for automating financial workflows. The
Common Domain Model (CDM) provides a standardized framework for this purpose,
but converting complex legal documents like Credit Support Annexes (CSAs) into
CDM representations remains a significant challenge. In this paper, we present
an extension of the CDMizer framework, a template-driven solution that ensures
syntactic correctness and adherence to the CDM schema during contract-to-CDM
conversion. We apply this extended framework to a real-world task, comparing
its performance with a benchmark developed by the International Swaps and
Derivatives Association (ISDA) for CSA clause extraction. Our results show that
CDMizer, when integrated with a significantly smaller, open-source Large
Language Model (LLM), achieves competitive performance in terms of accuracy and
efficiency against larger, proprietary models. This work underscores the
potential of resource-efficient solutions to automate legal contract
transformation, offering a cost-effective and scalable approach that can meet
the needs of financial institutions with constrained resources or strict data
privacy requirements.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Matsive Ali, Blake Gassen, Sen Liu
Robotics
ๆฅ็ๆ่ฆ
This paper presents an integrated robotic fused deposition modeling additive
manufacturing system featuring closed-loop thermal control and intelligent
in-situ defect correction using a 6-degree of freedom robotic arm and an Oak-D
camera. The robot arm end effector was modified to mount an E3D hotend
thermally regulated by an IoT microcontroller, enabling precise temperature
control through real-time feedback. Filament extrusion system was synchronized
with robotic motion, coordinated via ROS2, ensuring consistent deposition along
complex trajectories. A vision system based on OpenCV detects layer-wise
defects position, commanding autonomous re-extrusion at identified sites.
Experimental validation demonstrated successful defect mitigation in printing
operations. The integrated system effectively addresses challenges real-time
quality assurance. Inverse kinematics were used for motion planning, while
homography transformations corrected camera perspectives for accurate defect
localization. The intelligent system successfully mitigated surface anomalies
without interrupting the print process. By combining real-time thermal
regulation, motion control, and intelligent defect detection & correction, this
architecture establishes a scalable and adaptive robotic additive manufacturing
framework suitable for aerospace, biomedical, and industrial applications.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Iancu Andrei, Marius Kloetzer, Cristian Mahulea, Catalin Dosoftei
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
In this paper, we propose a computationally efficient quadratic programming
(QP) approach for generating smooth, $C^1$ continuous paths for mobile robots
using piece-wise quadratic Bezier (PWB) curves. Our method explicitly
incorporates safety margins within a structured optimization framework,
balancing trajectory smoothness and robustness with manageable numerical
complexity suitable for real-time and embedded applications. Comparative
simulations demonstrate clear advantages over traditional piece-wise linear
(PWL) path planning methods, showing reduced trajectory deviations, enhanced
robustness, and improved overall path quality. These benefits are validated
through simulations using a Pure-Pursuit controller in representative
scenarios, highlighting the practical effectiveness and scalability of our
approach for safe navigation.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Anil Yildiz, Sarah M. Thornton, Carl Hildebrandt, Sreeja Roy-Singh, Mykel J. Kochenderfer
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
Assessing scenario coverage is crucial for evaluating the robustness of
autonomous agents, yet existing methods rely on expensive human annotations or
computationally intensive Large Vision-Language Models (LVLMs). These
approaches are impractical for large-scale deployment due to cost and
efficiency constraints. To address these shortcomings, we propose SCOUT
(Scenario Coverage Oversight and Understanding Tool), a lightweight surrogate
model designed to predict scenario coverage labels directly from an agent's
latent sensor representations. SCOUT is trained through a distillation process,
learning to approximate LVLM-generated coverage labels while eliminating the
need for continuous LVLM inference or human annotation. By leveraging
precomputed perception features, SCOUT avoids redundant computations and
enables fast, scalable scenario coverage estimation. We evaluate our method
across a large dataset of real-life autonomous navigation scenarios,
demonstrating that it maintains high accuracy while significantly reducing
computational cost. Our results show that SCOUT provides an effective and
practical alternative for large-scale coverage analysis. While its performance
depends on the quality of LVLM-generated training labels, SCOUT represents a
major step toward efficient scenario coverage oversight in autonomous systems.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Jun Wang, Ziyang Zhou, Ardalan Kahak, Suyi Li
Robotics
ๆฅ็ๆ่ฆ
Softening and onboarding computers and controllers is one of the final
frontiers in soft robotics towards their robustness and intelligence for
everyday use. In this regard, embodying soft and physical computing presents
exciting potential. Physical computing seeks to encode inputs into a mechanical
computing kernel and leverage the internal interactions among this kernel's
constituent elements to compute the output. Moreover, such input-to-output
evolution can be re-programmable. This perspective paper proposes a framework
for embodying physical computing into soft robots and discusses three unique
strategies in the literature: analog oscillators, physical reservoir computing,
and physical algorithmic computing. These embodied computers enable the soft
robot to perform complex behaviors that would otherwise require CMOS-based
electronics -- including coordinated locomotion with obstacle avoidance,
payload weight and orientation classification, and programmable operation based
on logical rules. This paper will detail the working principles of these
embodied physical computing methods, survey the current state-of-the-art, and
present a perspective for future development.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Caleb Escobedo, Nataliya Nechyporenko, Shreyas Kadekodi, Alessandro Roncone
Robotics
ๆฅ็ๆ่ฆ
Real-time control is an essential aspect of safe robot operation in the real
world with dynamic objects. We present a framework for the analysis of
object-aware controllers, methods for altering a robot's motion to anticipate
and avoid possible collisions. This framework is focused on three design
considerations: kinematics, motion profiles, and virtual constraints.
Additionally, the analysis in this work relies on verification of robot
behaviors using fundamental robot-obstacle experimental scenarios. To showcase
the effectiveness of our method we compare three representative object-aware
controllers. The comparison uses metrics originating from the design
considerations. From the analysis, we find that the design of object-aware
controllers often lacks kinematic considerations, continuity of control points,
and stability in movement profiles. We conclude that this framework can be used
in the future to design, compare, and benchmark obstacle avoidance methods.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Zishuo Wang, Joel Loo, David Hsu
Robotics
ๆฅ็ๆ่ฆ
While imitation learning (IL) enables effective visual navigation, IL
policies are prone to unpredictable failures in out-of-distribution (OOD)
scenarios. We advance the notion of failure-resilient policies, which not only
detect failures but also recover from them automatically. Failure recognition
that identifies the factors causing failure is key to informing recovery: e.g.
pinpointing image regions triggering failure detections can provide cues to
guide recovery. We present Fare, a framework to construct failure-resilient IL
policies, embedding OOD-detection and recognition in them without using
explicit failure data, and pairing them with recovery heuristics. Real-world
experiments show that Fare enables failure recovery across two different policy
architectures, enabling robust long-range navigation in complex environments.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Jiaxuan Zhang, Yuquan Leng, Yixuan Guo, Chenglong Fu
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
For amputees with powered transfemoral prosthetics, navigating obstacles or
complex terrain remains challenging. This study addresses this issue by using
an inertial sensor on the sound ankle to guide obstacle-crossing movements. A
genetic algorithm computes the optimal neural network structure to predict the
required angles of the thigh and knee joints. A gait progression prediction
algorithm determines the actuation angle index for the prosthetic knee motor,
ultimately defining the necessary thigh and knee angles and gait progression.
Results show that when the standard deviation of Gaussian noise added to the
thigh angle data is less than 1, the method can effectively eliminate noise
interference, achieving 100\% accuracy in gait phase estimation under 150 Hz,
with thigh angle prediction error being 8.71\% and knee angle prediction error
being 6.78\%. These findings demonstrate the method's ability to accurately
predict gait progression and joint angles, offering significant practical value
for obstacle negotiation in powered transfemoral prosthetics.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Li Li, Tobias Brinkmann, Till Temmen, Markus Eisenbarth, Jakob Andert
Natural Language Processing
ๆฅ็ๆ่ฆ
With the increasing integration of intelligent driving functions into
serial-produced vehicles, ensuring their functionality and robustness poses
greater challenges. Compared to traditional road testing, scenario-based
virtual testing offers significant advantages in terms of time and cost
efficiency, reproducibility, and exploration of edge cases. We propose a
Transformer-enhanced Conditional Variational Autoencoder (CVAE-T) model for
generating multi-agent traffic scenarios in roundabouts, which are
characterized by high vehicle dynamics and complex layouts, yet remain
relatively underexplored in current research. The results show that the
proposed model can accurately reconstruct original scenarios and generate
realistic, diverse synthetic scenarios. Besides, two Key-Performance-Indicators
(KPIs) are employed to evaluate the interactive behavior in the generated
scenarios. Analysis of the latent space reveals partial disentanglement, with
several latent dimensions exhibiting distinct and interpretable effects on
scenario attributes such as vehicle entry timing, exit timing, and velocity
profiles. The results demonstrate the model's capability to generate scenarios
for the validation of intelligent driving functions involving multi-agent
interactions, as well as to augment data for their development and iterative
improvement.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Nicolai Steinke, Daniel Goehring
ๆฅ็ๆ่ฆ
In this letter, we introduce GroundLoc, a LiDAR-only localization pipeline
designed to localize a mobile robot in large-scale outdoor environments using
prior maps. GroundLoc employs a Bird's-Eye View (BEV) image projection focusing
on the perceived ground area and utilizes the place recognition network R2D2,
or alternatively, the non-learning approach Scale-Invariant Feature Transform
(SIFT), to identify and select keypoints for BEV image map registration. Our
results demonstrate that GroundLoc outperforms state-of-the-art methods on the
SemanticKITTI and HeLiPR datasets across various sensors. In the multi-session
localization evaluation, GroundLoc reaches an Average Trajectory Error (ATE)
well below 50 cm on all Ouster OS2 128 sequences while meeting online runtime
requirements. The system supports various sensor models, as evidenced by
evaluations conducted with Velodyne HDL-64E, Ouster OS2 128, Aeva Aeries II,
and Livox Avia sensors. The prior maps are stored as 2D raster image maps,
which can be created from a single drive and require only 4 MB of storage per
square kilometer. The source code is available at
https://github.com/dcmlr/groundloc.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Jรธrgen Anker Olsen, Lars Rรธnhaug Pettersen, Kostas Alexis
Machine Learning
ๆฅ็ๆ่ฆ
This paper presents a curriculum-based reinforcement learning framework for
training precise and high-performance jumping policies for the robot `Olympus'.
Separate policies are developed for vertical and horizontal jumps, leveraging a
simple yet effective strategy. First, we densify the inherently sparse jumping
reward using the laws of projectile motion. Next, a reference state
initialization scheme is employed to accelerate the exploration of dynamic
jumping behaviors without reliance on reference trajectories. We also present a
walking policy that, when combined with the jumping policies, unlocks versatile
and dynamic locomotion capabilities. Comprehensive testing validates walking on
varied terrain surfaces and jumping performance that exceeds previous works,
effectively crossing the Sim2Real gap. Experimental validation demonstrates
horizontal jumps up to 1.25 m with centimeter accuracy and vertical jumps up to
1.0 m. Additionally, we show that with only minor modifications, the proposed
method can be used to learn omnidirectional jumping.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Hongxu Zhao, Guangyang Zeng, Yunling Shao, Tengfei Zhang, Junfeng Wu
ๆฅ็ๆ่ฆ
The calibration of extrinsic parameters and clock offsets between sensors for
high-accuracy performance in underwater SLAM systems remains insufficiently
explored. Existing methods for Doppler Velocity Log (DVL) calibration are
either constrained to specific sensor configurations or rely on oversimplified
assumptions, and none jointly estimate translational extrinsics and time
offsets. We propose a Unified Iterative Calibration (UIC) framework for general
DVL sensor setups, formulated as a Maximum A Posteriori (MAP) estimation with a
Gaussian Process (GP) motion prior for high-fidelity motion interpolation. UIC
alternates between efficient GP-based motion state updates and gradient-based
calibration variable updates, supported by a provably statistically consistent
sequential initialization scheme. The proposed UIC can be applied to IMU,
cameras and other modalities as co-sensors. We release an open-source
DVL-camera calibration toolbox. Beyond underwater applications, several aspects
of UIC-such as the integration of GP priors for MAP-based calibration and the
design of provably reliable initialization procedures-are broadly applicable to
other multi-sensor calibration problems. Finally, simulations and real-world
tests validate our approach.
๐
2025-10-28
๐ ICRA 2026
๐ฅ Vignesh Kottayam Viswanathan, Yifan Bai, Scott Fredriksson, Sumeet Satpute, Christoforos Kanellakis et al.
ๆฅ็ๆ่ฆ
In this work, we present a hierarchical framework designed to support robotic
inspection under environment uncertainty. By leveraging a known environment
model, existing methods plan and safely track inspection routes to visit points
of interest. However, discrepancies between the model and actual site
conditions, caused by either natural or human activities, can alter the surface
morphology or introduce path obstructions. To address this challenge, the
proposed framework divides the inspection task into: (a) generating the initial
global view-plan for region of interests based on a historical map and (b)
local view replanning to adapt to the current morphology of the inspection
scene. The proposed hierarchy preserves global coverage objectives while
enabling reactive adaptation to the local surface morphology. This enables the
local autonomy to remain robust against environment uncertainty and complete
the inspection tasks. We validate the approach through deployments in
real-world subterranean mines using quadrupedal robot.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Yuan Shen, Yuze Hong, Guangyang Zeng, Tengfei Zhang, Pui Yi Chui et al.
Computer Vision
ๆฅ็ๆ่ฆ
Accurate visual inertial simultaneous localization and mapping (VI SLAM) for
underwater robots remains a significant challenge due to frequent visual
degeneracy and insufficient inertial measurement unit (IMU) motion excitation.
In this paper, we present GeVI-SLAM, a gravity-enhanced stereo VI SLAM system
designed to address these issues. By leveraging the stereo camera's direct
depth estimation ability, we eliminate the need to estimate scale during IMU
initialization, enabling stable operation even under low acceleration dynamics.
With precise gravity initialization, we decouple the pitch and roll from the
pose estimation and solve a 4 degrees of freedom (DOF) Perspective-n-Point
(PnP) problem for pose tracking. This allows the use of a minimal 3-point
solver, which significantly reduces computational time to reject outliers
within a Random Sample Consensus framework. We further propose a
bias-eliminated 4-DOF PnP estimator with provable consistency, ensuring the
relative pose converges to the true value as the feature number increases. To
handle dynamic motion, we refine the full 6-DOF pose while jointly estimating
the IMU covariance, enabling adaptive weighting of the gravity prior. Extensive
experiments on simulated and real-world data demonstrate that GeVI-SLAM
achieves higher accuracy and greater stability compared to state-of-the-art
methods.
๐
2025-10-28
๐ Submitted to IEEE Robotics and Automation Letters
๐ฅ Malintha Fernando, Petter รgren, Silun Zhang
Robotics
ๆฅ็ๆ่ฆ
The Team Orienteering Problem (TOP) generalizes many real-world multi-robot
scheduling and routing tasks that occur in autonomous mobility, aerial
logistics, and surveillance applications. While many flavors of the TOP exist
for planning in multi-robot systems, they assume that all the robots cooperate
toward a single objective; thus, they do not extend to settings where the
robots compete in reward-scarce environments. We propose Stochastic
Prize-Collecting Games (SPCG) as an extension of the TOP to plan in the
presence of self-interested robots operating on a graph, under energy
constraints and stochastic transitions. A theoretical study on complete and
star graphs establishes that there is a unique pure Nash equilibrium in SPCGs
that coincides with the optimal routing solution of an equivalent TOP given a
rank-based conflict resolution rule. This work proposes two algorithms: Ordinal
Rank Search (ORS) to obtain the ''ordinal rank'' --one's effective rank in
temporarily-formed local neighborhoods during the games' stages, and Fictitious
Ordinal Response Learning (FORL) to obtain best-response policies against one's
senior-rank opponents. Empirical evaluations conducted on road networks and
synthetic graphs under both dynamic and stationary prize distributions show
that 1) the state-aliasing induced by OR-conditioning enables learning policies
that scale more efficiently to large team sizes than those trained with the
global index, and 2) Policies trained with FORL generalize better to imbalanced
prize distributions than those with other multi-agent training methods.
Finally, the learned policies in the SPCG achieved between 87% and 95%
optimality compared to an equivalent TOP solution obtained by mixed-integer
linear programming.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Haoying Li, Yifan Peng, Junfeng Wu
Machine Learning
ๆฅ็ๆ่ฆ
Reliable state estimation hinges on accurate specification of sensor noise
covariances, which weigh heterogeneous measurements. In practice, these
covariances are difficult to identify due to environmental variability,
front-end preprocessing, and other reasons. We address this by formulating
noise covariance estimation as a bilevel optimization that, from a Bayesian
perspective, factorizes the joint likelihood of so-called odometry and
supervisory measurements, thereby balancing information utilization with
computational efficiency. The factorization converts the nested Bayesian
dependency into a chain structure, enabling efficient parallel computation: at
the lower level, an invariant extended Kalman filter with state augmentation
estimates trajectories, while a derivative filter computes analytical gradients
in parallel for upper-level gradient updates. The upper level refines the
covariance to guide the lower-level estimation. Experiments on synthetic and
real-world datasets show that our method achieves higher efficiency over
existing baselines.
๐
2025-10-28
๐ ArXiv Preprint (Machine Learning)
๐ฅ Klemens Iten, Lenart Treven, Bhavya Sukhija, Florian Dรถrfler, Andreas Krause
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Reinforcement learning algorithms are typically designed for discrete-time
dynamics, even though the underlying real-world control systems are often
continuous in time. In this paper, we study the problem of continuous-time
reinforcement learning, where the unknown system dynamics are represented using
nonlinear ordinary differential equations (ODEs). We leverage probabilistic
models, such as Gaussian processes and Bayesian neural networks, to learn an
uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily
maximizes a weighted sum of the extrinsic reward and model epistemic
uncertainty. This yields a scalable and sample-efficient approach to
continuous-time model-based RL. We show that COMBRL achieves sublinear regret
in the reward-driven setting, and in the unsupervised RL setting (i.e., without
extrinsic rewards), we provide a sample complexity bound. In our experiments,
we evaluate COMBRL in both standard and unsupervised RL settings and
demonstrate that it scales better, is more sample-efficient than prior methods,
and outperforms baselines across several deep RL tasks.
๐
2025-10-28
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Korneel Van den Berghe, Stein Stroobants, Vijay Janapa Reddi, G. C. H. E. de Croon
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Neuromorphic computing systems are set to revolutionize energy-constrained
robotics by achieving orders-of-magnitude efficiency gains, while enabling
native temporal processing. Spiking Neural Networks (SNNs) represent a
promising algorithmic approach for these systems, yet their application to
complex control tasks faces two critical challenges: (1) the non-differentiable
nature of spiking neurons necessitates surrogate gradients with unclear
optimization properties, and (2) the stateful dynamics of SNNs require training
on sequences, which in reinforcement learning (RL) is hindered by limited
sequence lengths during early training, preventing the network from bridging
its warm-up period.
We address these challenges by systematically analyzing surrogate gradient
slope settings, showing that shallower slopes increase gradient magnitude in
deeper layers but reduce alignment with true gradients. In supervised learning,
we find no clear preference for fixed or scheduled slopes. The effect is much
more pronounced in RL settings, where shallower slopes or scheduled slopes lead
to a 2.1x improvement in both training and final deployed performance. Next, we
propose a novel training approach that leverages a privileged guiding policy to
bootstrap the learning process, while still exploiting online environment
interactions with the spiking policy. Combining our method with an adaptive
slope schedule for a real-world drone position control task, we achieve an
average return of 400 points, substantially outperforming prior techniques,
including Behavioral Cloning and TD3BC, which achieve at most --200 points
under the same conditions. This work advances both the theoretical
understanding of surrogate gradient learning in SNNs and practical training
methodologies for neuromorphic controllers demonstrated in real-world robotic
systems.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Jorge Vicente-Martinez, Edgar Ramirez-Laboreo
ๆฅ็ๆ่ฆ
This paper presents an optimal trajectory generation method for 3D overhead
cranes by leveraging differential flatness. This framework enables the direct
inclusion of complex physical and dynamic constraints, such as nonlinear
friction and collision avoidance for both payload and rope. Our approach allows
for aggressive movements by constraining payload swing only at the final point.
A comparative simulation study validates our approach, demonstrating that
neglecting dry friction leads to actuator saturation and collisions. The
results show that friction modeling is a fundamental requirement for fast and
safe crane trajectories.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Eunju Kwon, Seungwon Oh, In-Chang Baek, Yucheon Park, Gyungbo Kim et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
Contact-rich manipulation has become increasingly important in robot
learning. However, previous studies on robot learning datasets have focused on
rigid objects and underrepresented the diversity of pressure conditions for
real-world manipulation. To address this gap, we present a humanoid
visual-tactile-action dataset designed for manipulating deformable soft
objects. The dataset was collected via teleoperation using a humanoid robot
equipped with dexterous hands, capturing multi-modal interactions under varying
pressure conditions. This work also motivates future research on models with
advanced optimization strategies capable of effectively leveraging the
complexity and diversity of tactile signals.
๐
2025-10-28
๐ This work has been submitted to the IEEE for possi
๐ฅ Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen
Machine Learning
ๆฅ็ๆ่ฆ
This paper proposes a visual multi-object tracking method that jointly
employs stochastic and deterministic mechanisms to ensure identifier
consistency for unknown and time-varying target numbers under nonlinear
dynamics. A stochastic particle filter addresses nonlinear dynamics and
non-Gaussian noise, with support from particle swarm optimization (PSO) to
guide particles toward state distribution modes and mitigate divergence through
proposed fitness measures incorporating motion consistency, appearance
similarity, and social-interaction cues with neighboring targets. Deterministic
association further enforces identifier consistency via a proposed cost matrix
incorporating spatial consistency between particles and current detections,
detection confidences, and track penalties. Subsequently, a novel scheme is
proposed for the smooth updating of target states while preserving their
identities, particularly for weak tracks during interactions with other targets
and prolonged occlusions. Moreover, velocity regression over past states
provides trend-seed velocities, enhancing particle sampling and state updates.
The proposed tracker is designed to operate flexibly for both pre-recorded
videos and camera live streams, where future frames are unavailable.
Experimental results confirm superior performance compared to state-of-the-art
trackers. The source-code reference implementations of both the proposed method
and compared-trackers are provided on GitHub:
https://github.com/SDU-VelKoTek/GenTrack2
๐
2025-10-28
๐ This work has been submitted to the IEEE for possi
๐ฅ Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen
Machine Learning
ๆฅ็ๆ่ฆ
This paper introduces a novel multi-object tracking (MOT) method, dubbed
GenTrack, whose main contributions include: a hybrid tracking approach
employing both stochastic and deterministic manners to robustly handle unknown
and time-varying numbers of targets, particularly in maintaining target
identity (ID) consistency and managing nonlinear dynamics, leveraging particle
swarm optimization (PSO) with some proposed fitness measures to guide
stochastic particles toward their target distribution modes, enabling effective
tracking even with weak and noisy object detectors, integration of social
interactions among targets to enhance PSO-guided particles as well as improve
continuous updates of both strong (matched) and weak (unmatched) tracks,
thereby reducing ID switches and track loss, especially during occlusions, a
GenTrack-based redefined visual MOT baseline incorporating a comprehensive
state and observation model based on space consistency, appearance, detection
confidence, track penalties, and social scores for systematic and efficient
target updates, and the first-ever publicly available source-code reference
implementation with minimal dependencies, featuring three variants, including
GenTrack Basic, PSO, and PSO-Social, facilitating flexible reimplementation.
Experimental results have shown that GenTrack provides superior performance on
standard benchmarks and real-world scenarios compared to state-of-the-art
trackers, with integrated implementations of baselines for fair comparison.
Potential directions for future work are also discussed. The source-code
reference implementations of both the proposed method and compared-trackers are
provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Mingyu Jeong, Eunsung Kim, Sehun Park, Andrew Jaeyong Choi
Robotics
ๆฅ็ๆ่ฆ
We present NVSim, a framework that automatically constructs large-scale,
navigable indoor simulators from only common image sequences, overcoming the
cost and scalability limitations of traditional 3D scanning. Our approach
adapts 3D Gaussian Splatting to address visual artifacts on sparsely observed
floors a common issue in robotic traversal data. We introduce Floor-Aware
Gaussian Splatting to ensure a clean, navigable ground plane, and a novel
mesh-free traversability checking algorithm that constructs a topological graph
by directly analyzing rendered views. We demonstrate our system's ability to
generate valid, large-scale navigation graphs from real-world data. A video
demonstration is avilable at https://youtu.be/tTiIQt6nXC8
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Baozhe Zhang, Xinwei Chen, Qingcheng Chen, Chao Xu, Fei Gao et al.
Robotics
ๆฅ็ๆ่ฆ
CoNi-MPC provides an efficient framework for UAV control in air-ground
cooperative tasks by relying exclusively on relative states, eliminating the
need for global state estimation. However, its lack of environmental
information poses significant challenges for obstacle avoidance. To address
this issue, we propose a novel obstacle avoidance algorithm, Cooperative
Non-inertial frame-based Obstacle Avoidance (CoNi-OA), designed explicitly for
UAV-UGV cooperative scenarios without reliance on global state estimation or
obstacle prediction. CoNi-OA uniquely utilizes a single frame of raw LiDAR data
from the UAV to generate a modulation matrix, which directly adjusts the
quadrotor's velocity to achieve obstacle avoidance. This modulation-based
method enables real-time generation of collision-free trajectories within the
UGV's non-inertial frame, significantly reducing computational demands (less
than 5 ms per iteration) while maintaining safety in dynamic and unpredictable
environments. The key contributions of this work include: (1) a
modulation-based obstacle avoidance algorithm specifically tailored for UAV-UGV
cooperation in non-inertial frames without global states; (2) rapid, real-time
trajectory generation based solely on single-frame LiDAR data, removing the
need for obstacle modeling or prediction; and (3) adaptability to both static
and dynamic environments, thus extending applicability to featureless or
unknown scenarios.
๐
2025-10-28
๐ NeurIPS 2025
๐ฅ Jingyi Tian, Le Wang, Sanping Zhou, Sen Wang, Jiayi Li et al.
Robotics
ๆฅ็ๆ่ฆ
Learning generalizable robotic manipulation policies remains a key challenge
due to the scarcity of diverse real-world training data. While recent
approaches have attempted to mitigate this through self-supervised
representation learning, most either rely on 2D vision pretraining paradigms
such as masked image modeling, which primarily focus on static semantics or
scene geometry, or utilize large-scale video prediction models that emphasize
2D dynamics, thus failing to jointly learn the geometry, semantics, and
dynamics required for effective manipulation. In this paper, we present
DynaRend, a representation learning framework that learns 3D-aware and
dynamics-informed triplane features via masked reconstruction and future
prediction using differentiable volumetric rendering. By pretraining on
multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future
dynamics, and task semantics in a unified triplane representation. The learned
representations can be effectively transferred to downstream robotic
manipulation tasks via action value map prediction. We evaluate DynaRend on two
challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic
experiments, demonstrating substantial improvements in policy success rate,
generalization to environmental perturbations, and real-world applicability
across diverse manipulation tasks.
๐
2025-10-28
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Ziqi Ma, Sao Mai Nguyen, Philippe Xu
Natural Language ProcessingMachine Learning
ๆฅ็ๆ่ฆ
Emergent symbolic representations are critical for enabling developmental
learning agents to plan and generalize across tasks. In this work, we
investigate whether large language models (LLMs) can translate human natural
language instructions into the internal symbolic representations that emerge
during hierarchical reinforcement learning. We apply a structured evaluation
framework to measure the translation performance of commonly seen LLMs -- GPT,
Claude, Deepseek and Grok -- across different internal symbolic partitions
generated by a hierarchical reinforcement learning algorithm in the Ant Maze
and Ant Fall environments. Our findings reveal that although LLMs demonstrate
some ability to translate natural language into a symbolic representation of
the environment dynamics, their performance is highly sensitive to partition
granularity and task complexity. The results expose limitations in current LLMs
capacity for representation alignment, highlighting the need for further
research on robust alignment between language and internal agent
representations.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Ziqi Ma, Changda Tian, Yue Gao
Robotics
ๆฅ็ๆ่ฆ
In recent years, there has been growing interest in developing robots and
autonomous systems that can interact with human in a more natural and intuitive
way. One of the key challenges in achieving this goal is to enable these
systems to manipulate objects and tools in a manner that is similar to that of
humans. In this paper, we propose a novel approach for learning human-style
manipulation skills by using adversarial motion priors, which we name HMAMP.
The approach leverages adversarial networks to model the complex dynamics of
tool and object manipulation, as well as the aim of the manipulation task. The
discriminator is trained using a combination of real-world data and simulation
data executed by the agent, which is designed to train a policy that generates
realistic motion trajectories that match the statistical properties of human
motion. We evaluated HMAMP on one challenging manipulation task: hammering, and
the results indicate that HMAMP is capable of learning human-style manipulation
skills that outperform current baseline methods. Additionally, we demonstrate
that HMAMP has potential for real-world applications by performing real robot
arm hammering tasks. In general, HMAMP represents a significant step towards
developing robots and autonomous systems that can interact with humans in a
more natural and intuitive way, by learning to manipulate tools and objects in
a manner similar to how humans do.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Ev Zisselman, Mirco Mutti, Shelly Francis-Meretzki, Elisei Shafer, Aviv Tamar
Robotics
ๆฅ็ๆ่ฆ
Behavioral cloning is a simple yet effective technique for learning
sequential decision-making from demonstrations. Recently, it has gained
prominence as the core of foundation models for the physical world, where
achieving generalization requires countless demonstrations of a multitude of
tasks. Typically, a human expert with full information on the task demonstrates
a (nearly) optimal behavior. In this paper, we propose to hide some of the
task's information from the demonstrator. This ``blindfolded'' expert is
compelled to employ non-trivial exploration to solve the task. We show that
cloning the blindfolded expert generalizes better to unseen tasks than its
fully-informed counterpart. We conduct experiments of real-world robot peg
insertion tasks with (limited) human demonstrations, alongside videogames from
the Procgen benchmark. Additionally, we support our findings with theoretical
analysis, which confirms that the generalization error scales with
$\sqrt{I/m}$, where $I$ measures the amount of task information available to
the demonstrator, and $m$ is the number of demonstrated tasks. Both theory and
practice indicate that cloning blindfolded experts generalizes better with
fewer demonstrated tasks. Project page with videos and code:
https://sites.google.com/view/blindfoldedexperts/home
๐
2025-10-28
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Wentao Tan, Bowen Wang, Heng Zhi, Chenyu Liu, Zhe Li et al.
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
Multimodal large language models (MLLMs) have advanced vision-language
reasoning and are increasingly deployed in embodied agents. However,
significant limitations remain: MLLMs generalize poorly across digital-physical
spaces and embodiments; vision-language-action models (VLAs) produce low-level
actions yet lack robust high-level embodied reasoning; and most embodied large
language models (ELLMs) are constrained to digital-space with poor
generalization to the physical world. Thus, unified models that operate
seamlessly across digital and physical spaces while generalizing across
embodiments and tasks remain absent. We introduce the \textbf{Boundless Large
Model (BLM$_1$)}, a multimodal spatial foundation model that preserves
instruction following and reasoning, incorporates embodied knowledge, and
supports robust cross-embodiment control. BLM$_1$ integrates three key
capabilities -- \textit{cross-space transfer, cross-task learning, and
cross-embodiment generalization} -- via a two-stage training paradigm. Stage I
injects embodied knowledge into the MLLM through curated digital corpora while
maintaining language competence. Stage II trains a policy module through an
intent-bridging interface that extracts high-level semantics from the MLLM to
guide control, without fine-tuning the MLLM backbone. This process is supported
by a self-collected cross-embodiment demonstration suite spanning four robot
embodiments and six progressively challenging tasks. Evaluations across digital
and physical benchmarks show that a single BLM$_1$ instance outperforms four
model families -- MLLMs, ELLMs, VLAs, and GMLMs -- achieving
$\sim\!\textbf{6%}$ gains in digital tasks and $\sim\!\textbf{3%}$ in physical
tasks.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Haotian Zhou, Xiaole Wang, He Li, Fusheng Sun, Shengyu Guo et al.
Robotics
ๆฅ็ๆ่ฆ
Navigating to a designated goal using visual information is a fundamental
capability for intelligent robots. Most classical visual navigation methods are
restricted to single-goal, single-modality, and closed set goal settings. To
address the practical demands of multi-modal, open-vocabulary goal queries and
multi-goal visual navigation, we propose LagMemo, a navigation system that
leverages a language 3D Gaussian Splatting memory. During exploration, LagMemo
constructs a unified 3D language memory. With incoming task goals, the system
queries the memory, predicts candidate goal locations, and integrates a local
perception-based verification mechanism to dynamically match and validate goals
during navigation. For fair and rigorous evaluation, we curate GOAT-Core, a
high-quality core split distilled from GOAT-Bench tailored to multi-modal
open-vocabulary multi-goal visual navigation. Experimental results show that
LagMemo's memory module enables effective multi-modal open-vocabulary goal
localization, and that LagMemo outperforms state-of-the-art methods in
multi-goal visual navigation. Project page:
https://weekgoodday.github.io/lagmemo
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Wenbin Ding, Jun Chen, Mingjia Chen, Fei Xie, Qi Mao et al.
Natural Language ProcessingMultimodalRobotics
ๆฅ็ๆ่ฆ
The rapid advancement of Large Language Models (LLMs) has marked a
significant breakthrough in Artificial Intelligence (AI), ushering in a new era
of Human-centered Artificial Intelligence (HAI). HAI aims to better serve human
welfare and needs, thereby placing higher demands on the intelligence level of
robots, particularly in aspects such as natural language interaction, complex
task planning, and execution. Intelligent agents powered by LLMs have opened up
new pathways for realizing HAI. However, existing LLM-based embodied agents
often lack the ability to plan and execute complex natural language control
tasks online. This paper explores the implementation of intelligent robotic
manipulating agents based on Vision-Language Models (VLMs) in the physical
world. We propose a novel embodied agent framework for robots, which comprises
a human-robot voice interaction module, a vision-language agent module and an
action execution module. The vision-language agent itself includes a
vision-based task planner, a natural language instruction converter, and a task
performance feedback evaluator. Experimental results demonstrate that our agent
achieves a 28\% higher average task success rate in both simulated and real
environments compared to approaches relying solely on LLM+CLIP, significantly
improving the execution success rate of high-level natural language instruction
tasks.
๐
2025-10-28
๐ ArXiv Preprint (Robotics)
๐ฅ Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Jingde Chen et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
End-to-end autonomous driving maps raw sensor inputs directly into
ego-vehicle trajectories to avoid cascading errors from perception modules and
to leverage rich semantic cues. Existing frameworks largely rely on Imitation
Learning (IL), which can be limited by sub-optimal expert demonstrations and
covariate shift during deployment. On the other hand, Reinforcement Learning
(RL) has recently shown potential in scaling up with simulations, but is
typically confined to low-dimensional symbolic inputs (e.g. 3D objects and
maps), falling short of full end-to-end learning from raw sensor data. We
introduce ZTRS (Zero-Imitation End-to-End Autonomous Driving with Trajectory
Scoring), a framework that combines the strengths of both worlds: sensor inputs
without losing information and RL training for robust planning. To the best of
our knowledge, ZTRS is the first framework that eliminates IL entirely by only
learning from rewards while operating directly on high-dimensional sensor data.
ZTRS utilizes offline reinforcement learning with our proposed Exhaustive
Policy Optimization (EPO), a variant of policy gradient tailored for enumerable
actions and rewards. ZTRS demonstrates strong performance across three
benchmarks: Navtest (generic real-world open-loop planning), Navhard (open-loop
planning in challenging real-world and synthetic scenarios), and HUGSIM
(simulated closed-loop driving). Specifically, ZTRS achieves the
state-of-the-art result on Navhard and outperforms IL-based baselines on
HUGSIM. Code will be available at https://github.com/woxihuanjiangguo/ZTRS.
๐
2025-10-28
๐ NeurIPS 2025
๐ฅ Vedant Gupta, Haotian Fu, Calvin Luo, Yiding Jiang, George Konidaris
ๆฅ็ๆ่ฆ
We present DEPS, an end-to-end algorithm for discovering parameterized skills
from expert demonstrations. Our method learns parameterized skill policies
jointly with a meta-policy that selects the appropriate discrete skill and
continuous parameters at each timestep. Using a combination of temporal
variational inference and information-theoretic regularization methods, we
address the challenge of degeneracy common in latent variable models, ensuring
that the learned skills are temporally extended, semantically meaningful, and
adaptable. We empirically show that learning parameterized skills from
multitask expert demonstrations significantly improves generalization to unseen
tasks. Our method outperforms multitask as well as skill learning baselines on
both LIBERO and MetaWorld benchmarks. We also demonstrate that DEPS discovers
interpretable parameterized skills, such as an object grasping skill whose
continuous arguments define the grasp location.
๐
2025-10-27
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yi-Li Hsu, Katelyn X. Mei, Lucy Lu Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Medical multi-document summarization (MDS) is a complex task that requires
effectively managing cross-document relationships. This paper investigates
whether incorporating hierarchical structures in the inputs of MDS can improve
a model's ability to organize and contextualize information across documents
compared to traditional flat summarization methods. We investigate two ways of
incorporating hierarchical organization across three large language models
(LLMs), and conduct comprehensive evaluations of the resulting summaries using
automated metrics, model-based metrics, and domain expert evaluation of
preference, understandability, clarity, complexity, relevance, coverage,
factuality, and coherence. Our results show that human experts prefer
model-generated summaries over human-written summaries. Hierarchical approaches
generally preserve factuality, coverage, and coherence of information, while
also increasing human preference for summaries. Additionally, we examine
whether simulated judgments from GPT-4 align with human judgments, finding
higher agreement along more objective evaluation facets. Our findings
demonstrate that hierarchical structures can improve the clarity of medical
summaries generated by models while maintaining content coverage, providing a
practical way to improve human preference for generated summaries.
๐
2025-10-27
๐ EMNLP 2025
๐ฅ Tingyu Song, Yilun Zhao, Siyue Zhang, Chen Zhao, Arman Cohan
ๆฅ็ๆ่ฆ
Existing approaches typically rely on large-scale fine-tuning to adapt LLMs
for information reranking tasks, which is computationally expensive. In this
work, we demonstrate that modern LLMs can be effectively adapted using only
minimal, high-quality supervision. To enable this, we design
LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating
diverse, challenging, and realistic reranking examples. Using this synthetic
data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two
challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and
FollowIR for instruction-following retrieval. Our experiments demonstrate that
LIMRANK achieves competitive performance, while being trained on less than 5%
of the data typically used in prior work. Further ablation studies demonstrate
the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization
capabilities of LIMRANK across downstream tasks, including scientific
literature search and retrieval-augmented generation for knowledge-intensive
problem solving.
๐
2025-10-27
๐ ArXiv Preprint (Computer Vision)
๐ฅ Hongyi Wang, Zhengjie Zhu, Jiabo Ma, Fang Wang, Yue Shi et al.
Machine LearningMultimodal
ๆฅ็ๆ่ฆ
The rapid digitization of histopathology slides has opened up new
possibilities for computational tools in clinical and research workflows. Among
these, content-based slide retrieval stands out, enabling pathologists to
identify morphologically and semantically similar cases, thereby supporting
precise diagnoses, enhancing consistency across observers, and assisting
example-based education. However, effective retrieval of whole slide images
(WSIs) remains challenging due to their gigapixel scale and the difficulty of
capturing subtle semantic differences amid abundant irrelevant content. To
overcome these challenges, we present PathSearch, a retrieval framework that
unifies fine-grained attentive mosaic representations with global-wise slide
embeddings aligned through vision-language contrastive learning. Trained on a
corpus of 6,926 slide-report pairs, PathSearch captures both fine-grained
morphological cues and high-level semantic patterns to enable accurate and
flexible retrieval. The framework supports two key functionalities: (1)
mosaic-based image-to-image retrieval, ensuring accurate and efficient slide
research; and (2) multi-modal retrieval, where text queries can directly
retrieve relevant slides. PathSearch was rigorously evaluated on four public
pathology datasets and three in-house cohorts, covering tasks including
anatomical site retrieval, tumor subtyping, tumor vs. non-tumor discrimination,
and grading across diverse organs such as breast, lung, kidney, liver, and
stomach. External results show that PathSearch outperforms traditional
image-to-image retrieval frameworks. A multi-center reader study further
demonstrates that PathSearch improves diagnostic accuracy, boosts confidence,
and enhances inter-observer agreement among pathologists in real clinical
scenarios. These results establish PathSearch as a scalable and generalizable
retrieval solution for digital pathology.
๐
2025-10-27
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Yi-Li Hsu, Katelyn X. Mei, Lucy Lu Wang
Natural Language Processing
ๆฅ็ๆ่ฆ
Medical multi-document summarization (MDS) is a complex task that requires
effectively managing cross-document relationships. This paper investigates
whether incorporating hierarchical structures in the inputs of MDS can improve
a model's ability to organize and contextualize information across documents
compared to traditional flat summarization methods. We investigate two ways of
incorporating hierarchical organization across three large language models
(LLMs), and conduct comprehensive evaluations of the resulting summaries using
automated metrics, model-based metrics, and domain expert evaluation of
preference, understandability, clarity, complexity, relevance, coverage,
factuality, and coherence. Our results show that human experts prefer
model-generated summaries over human-written summaries. Hierarchical approaches
generally preserve factuality, coverage, and coherence of information, while
also increasing human preference for summaries. Additionally, we examine
whether simulated judgments from GPT-4 align with human judgments, finding
higher agreement along more objective evaluation facets. Our findings
demonstrate that hierarchical structures can improve the clarity of medical
summaries generated by models while maintaining content coverage, providing a
practical way to improve human preference for generated summaries.
๐
2025-10-27
๐ NeurIPS 2025
๐ฅ Xiaoyu Kong, Junguang Jiang, Bin Liu, Ziru Xu, Han Zhu et al.
RoboticsMachine Learning
ๆฅ็ๆ่ฆ
The core task of recommender systems is to learn user preferences from
historical user-item interactions. With the rapid development of large language
models (LLMs), recent research has explored leveraging the reasoning
capabilities of LLMs to enhance rating prediction tasks. However, existing
distillation-based methods suffer from limitations such as the teacher model's
insufficient recommendation capability, costly and static supervision, and
superficial transfer of reasoning ability. To address these issues, this paper
proposes RecZero, a reinforcement learning (RL)-based recommendation paradigm
that abandons the traditional multi-model and multi-stage distillation
approach. Instead, RecZero trains a single LLM through pure RL to autonomously
develop reasoning capabilities for rating prediction. RecZero consists of two
key components: (1) "Think-before-Recommendation" prompt construction, which
employs a structured reasoning template to guide the model in step-wise
analysis of user interests, item features, and user-item compatibility; and (2)
rule-based reward modeling, which adopts group relative policy optimization
(GRPO) to compute rewards for reasoning trajectories and optimize the LLM.
Additionally, the paper explores a hybrid paradigm, RecOne, which combines
supervised fine-tuning with RL, initializing the model with cold-start
reasoning samples and further optimizing it with RL. Experimental results
demonstrate that RecZero and RecOne significantly outperform existing baseline
methods on multiple benchmark datasets, validating the superiority of the RL
paradigm in achieving autonomous reasoning-enhanced recommender systems.
๐
2025-10-27
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yichao Jin, Yushuo Wang, Qishuai Zhong, Kent Chiu Jin-Chun, Kenneth Zhu Ke et al.
Computer VisionMultimodalNatural Language Processing
ๆฅ็ๆ่ฆ
Financial documents are essential sources of information for regulators,
auditors, and financial institutions, particularly for assessing the wealth and
compliance of Small and Medium-sized Businesses. However, SMB documents are
often difficult to parse. They are rarely born digital and instead are
distributed as scanned images that are none machine readable. The scans
themselves are low in resolution, affected by skew or rotation, and often
contain noisy backgrounds. These documents also tend to be heterogeneous,
mixing narratives, tables, figures, and multilingual content within the same
report. Such characteristics pose major challenges for automated information
extraction, especially when relying on end to end large Vision Language Models,
which are computationally expensive, sensitive to noise, and slow when applied
to files with hundreds of pages.
We propose a multistage pipeline that leverages traditional image processing
models and OCR extraction, together with compact VLMs for structured field
extraction of large-scale financial documents. Our approach begins with image
pre-processing, including segmentation, orientation detection, and size
normalization. Multilingual OCR is then applied to recover page-level text.
Upon analyzing the text information, pages are retrieved for coherent sections.
Finally, compact VLMs are operated within these narrowed-down scopes to extract
structured financial indicators.
Our approach is evaluated using an internal corpus of multi-lingual, scanned
financial documents. The results demonstrate that compact VLMs, together with a
multistage pipeline, achieves 8.8 times higher field level accuracy relative to
directly feeding the whole document into large VLMs, only at 0.7 percent of the
GPU cost and 92.6 percent less end-to-end service latency.
๐
2025-10-27
๐ ArXiv Preprint (Information Retrieval)
๐ฅ JaeEun Lim, Soomin Kim, Jaeyong Seo, Iori Ono, Qimu Ran et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Multilingual e-commerce search is challenging due to linguistic diversity and
the noise inherent in user-generated queries. This paper documents the solution
employed by our team (EAR-MP) for the CIKM 2025 AnalytiCup, which addresses two
core tasks: Query-Category (QC) relevance and Query-Item (QI) relevance. Our
approach first normalizes the multilingual dataset by translating all text into
English, then mitigates noise through extensive data cleaning and
normalization. For model training, we build on DeBERTa-v3-large and improve
performance with label smoothing, self-distillation, and dropout. In addition,
we introduce task-specific upgrades, including hierarchical token injection for
QC and a hybrid scoring mechanism for QI. Under constrained compute, our method
achieves competitive results, attaining an F1 score of 0.8796 on QC and 0.8744
on QI. These findings underscore the importance of systematic data
preprocessing and tailored training strategies for building robust,
resource-efficient multilingual relevance systems.
๐
2025-10-27
๐ EMNLP 2025
๐ฅ Anwesan Pal, Karen Hovsepian, Tinghao Guo, Mengnan Zhao, Somendra Tripathi et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Recent investigations into effective context lengths of modern flagship large
language models (LLMs) have revealed major limitations in effective question
answering (QA) and reasoning over long and complex contexts for even the
largest and most impressive cadre of models. While approaches like
retrieval-augmented generation (RAG) and chunk-based re-ranking attempt to
mitigate this issue, they are sensitive to chunking, embedding and retrieval
strategies and models, and furthermore, rely on extensive pre-processing,
knowledge acquisition and indexing steps. In this paper, we propose
Tagging-Augmented Generation (TAG), a lightweight data augmentation strategy
that boosts LLM performance in long-context scenarios, without degrading and
altering the integrity and composition of retrieved documents. We validate our
hypothesis by augmenting two challenging and directly relevant
question-answering benchmarks -- NoLima and NovelQA -- and show that tagging
the context or even just adding tag definitions into QA prompts leads to
consistent performance gains over the baseline -- up to 17% for 32K token
contexts, and 2.9% in complex reasoning question-answering for multi-hop
queries requiring knowledge across a wide span of text. Additional details are
available at https://sites.google.com/view/tag-emnlp.
๐
2025-10-27
๐ ICDE 2026
๐ฅ Zhuoxuan Li, Jieyuan Pei, Tangwei Ye, Zhongyuan Lai, Zihan Liu et al.
Machine Learning
ๆฅ็ๆ่ฆ
Next Point-of-Interest (POI) recommendation is a critical task in modern
Location-Based Social Networks (LBSNs), aiming to model the complex
decision-making process of human mobility to provide personalized
recommendations for a user's next check-in location. Existing POI
recommendation models, predominantly based on Graph Neural Networks and
sequential models, have been extensively studied. However, these models face a
fundamental limitation: they struggle to simultaneously capture the inherent
hierarchical structure of spatial choices and the dynamics and irregular shifts
of user-specific temporal contexts. To overcome this limitation, we propose
GTR-Mamba, a novel framework for cross-manifold conditioning and routing.
GTR-Mamba leverages the distinct advantages of different mathematical spaces
for different tasks: it models the static, tree-like preference hierarchies in
hyperbolic geometry, while routing the dynamic sequence updates to a novel
Mamba layer in the computationally stable and efficient Euclidean tangent
space. This process is coordinated by a cross-manifold channel that fuses
spatio-temporal information to explicitly steer the State Space Model (SSM),
enabling flexible adaptation to contextual changes. Extensive experiments on
three real-world datasets demonstrate that GTR-Mamba consistently outperforms
state-of-the-art baseline models in next POI recommendation.
๐
2025-10-27
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Shihao Cai, Chongming Gao, Haoyan Liu, Wentao Shi, Jianshan Sun et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
The powerful reasoning and generative capabilities of large language models
(LLMs) have inspired researchers to apply them to reasoning-based
recommendation tasks, which require in-depth reasoning about user interests and
the generation of recommended items. However, previous reasoning-based
recommendation methods have typically performed inference within the language
space alone, without incorporating the actual item space. This has led to
over-interpreting user interests and deviating from real items. Towards this
research gap, we propose performing multiple rounds of grounding during
inference to help the LLM better understand the actual item space, which could
ensure that its reasoning remains aligned with real items. Furthermore, we
introduce a user agent that provides feedback during each grounding step,
enabling the LLM to better recognize and adapt to user interests. Comprehensive
experiments conducted on three Amazon review datasets demonstrate the
effectiveness of incorporating multiple groundings and feedback. These findings
underscore the critical importance of reasoning within the actual item space,
rather than being confined to the language space, for recommendation tasks.
๐
2025-10-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qi Liu, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie et al.
ๆฅ็ๆ่ฆ
Text embedding models serve as a fundamental component in real-world search
applications. By mapping queries and documents into a shared embedding space,
they deliver competitive retrieval performance with high efficiency. However,
their ranking fidelity remains limited compared to dedicated rerankers,
especially recent LLM-based listwise rerankers, which capture fine-grained
query-document and document-document interactions. In this paper, we propose a
simple yet effective unified framework E2Rank, means Efficient Embedding-based
Ranking (also means Embedding-to-Rank), which extends a single text embedding
model to perform both high-quality retrieval and listwise reranking through
continued training under a listwise ranking objective, thereby achieving strong
effectiveness with remarkable efficiency. By applying cosine similarity between
the query and document embeddings as a unified ranking function, the listwise
ranking prompt, which is constructed from the original query and its candidate
documents, serves as an enhanced query enriched with signals from the top-K
documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval
models. This design preserves the efficiency and representational quality of
the base embedding model while significantly improving its reranking
performance. Empirically, E2Rank achieves state-of-the-art results on the BEIR
reranking benchmark and demonstrates competitive performance on the
reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also
show that the ranking training process improves embedding performance on the
MTEB benchmark. Our findings indicate that a single embedding model can
effectively unify retrieval and reranking, offering both computational
efficiency and competitive ranking accuracy.
๐
2025-10-26
๐ RecSys 2025
๐ฅ James Meese, Kyle Herbertson
ๆฅ็ๆ่ฆ
Research in news recommendation systems (NRS) continues to explore the best
ways to integrate normative goals such as editorial objectives and public
service values into existing systems. Prior efforts have incorporated expert
input or audience feedback to quantify these values, laying the groundwork for
more civic-minded recommender systems. This paper contributes to that
trajectory, introducing a method for embedding civic values into NRS through
large-scale, structured audience evaluations. The proposed civic ground truth
approach aims to generate value-based labels through a nationally
representative survey that are generalisable across a wider news corpus, using
automated metadata enrichment.
๐
2025-10-26
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Yiwen Tang, Qiuyu Zhao, Zenghui Sun, Jinsong Lan, Xiaoyong Zhu et al.
Machine Learning
ๆฅ็ๆ่ฆ
In Taobao e-commerce visual search, user behavior analysis reveals a large
proportion of no-click requests, suggesting diverse and implicit user intents.
These intents are expressed in various forms and are difficult to mine and
discover, thereby leading to the limited adaptability and lag in platform
strategies. This greatly restricts users' ability to express diverse intents
and hinders the scalability of the visual search system. This mismatch between
user implicit intent expression and system response defines the User-SearchSys
Intent Discrepancy. To alleviate the issue, we propose a novel framework
REVISION. This framework integrates offline reasoning mining with online
decision-making and execution, enabling adaptive strategies to solve implicit
user demands. In the offline stage, we construct a periodic pipeline to mine
discrepancies from historical no-click requests. Leveraging large models, we
analyze implicit intent factors and infer optimal suggestions by jointly
reasoning over query and product metadata. These inferred suggestions serve as
actionable insights for refining platform strategies. In the online stage,
REVISION-R1-3B, trained on the curated offline data, performs holistic analysis
over query images and associated historical products to generate optimization
plans and adaptively schedule strategies across the search pipeline. Our
framework offers a streamlined paradigm for integrating large models with
traditional search systems, enabling end-to-end intelligent optimization across
information aggregation and user interaction. Experimental results demonstrate
that our approach improves the efficiency of implicit intent mining from
large-scale search logs and significantly reduces the no-click rate.
๐
2025-10-26
๐ ArXiv Preprint (Computational Linguistics (NLP))
๐ฅ Qi Liu, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie et al.
ๆฅ็ๆ่ฆ
Text embedding models serve as a fundamental component in real-world search
applications. By mapping queries and documents into a shared embedding space,
they deliver competitive retrieval performance with high efficiency. However,
their ranking fidelity remains limited compared to dedicated rerankers,
especially recent LLM-based listwise rerankers, which capture fine-grained
query-document and document-document interactions. In this paper, we propose a
simple yet effective unified framework $\text{E}^2\text{Rank}$, means Efficient
Embedding-based Ranking (also means Embedding-to-Rank), which extends a single
text embedding model to perform both high-quality retrieval and listwise
reranking through continued training under a listwise ranking objective,
thereby achieving strong effectiveness with remarkable efficiency. By applying
cosine similarity between the query and document embeddings as a unified
ranking function, the listwise ranking prompt, which is constructed from the
original query and its candidate documents, serves as an enhanced query
enriched with signals from the top-K documents, akin to pseudo-relevance
feedback (PRF) in traditional retrieval models. This design preserves the
efficiency and representational quality of the base embedding model while
significantly improving its reranking performance. Empirically,
$\textrm{E}^2\text{Rank}$ achieves state-of-the-art results on the BEIR
reranking benchmark and demonstrates competitive performance on the
reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also
show that the ranking training process improves embedding performance on the
MTEB benchmark. Our findings indicate that a single embedding model can
effectively unify retrieval and reranking, offering both computational
efficiency and competitive ranking accuracy.
๐
2025-10-26
๐ NeurIPS 2025
๐ฅ Jiali Cheng, Anjishnu Kumar, Roshan Lal, Rishi Rajasekaran, Hani Ramezani et al.
Machine Learning
ๆฅ็ๆ่ฆ
We observe that current state-of-the-art web-agents are unable to effectively
adapt to new environments without neural network fine-tuning, without which
they produce inefficient execution plans due to a lack of awareness of the
structure and dynamics of the new environment. To address this limitation, we
introduce ATLAS (Actor-Critic Task-completion with Look-ahead Action
Simulation), a memory-augmented agent that is able to make plans grounded in a
model of the environment by simulating the consequences of those actions in
cognitive space. Our agent starts by building a "cognitive map" by performing a
lightweight curiosity driven exploration of the environment. The planner
proposes candidate actions; the simulator predicts their consequences in
cognitive space; a critic analyzes the options to select the best roll-out and
update the original plan; and a browser executor performs the chosen action. On
the WebArena-Lite Benchmark, we achieve a 63% success rate compared to 53.9%
success rate for the previously published state-of-the-art. Unlike previous
systems, our modular architecture requires no website-specific LLM fine-tuning.
Ablations show sizable drops without the world-model, hierarchical planner, and
look-ahead-based replanner confirming their complementary roles within the
design of our system
๐
2025-10-26
๐ NeurIPS 2025
๐ฅ Shu Zhao, Tianyi Shen, Nilesh Ahuja, Omesh Tickoo, Vijaykrishnan Narayanan
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising
method to generate factual and up-to-date responses of Multimodal Large
Language Models (MLLMs) by incorporating non-parametric knowledge from external
knowledge bases. However, existing MRAG approaches suffer from static retrieval
strategies, inflexible modality selection, and suboptimal utilization of
retrieved information, leading to three critical challenges: determining when
to retrieve, what modality to incorporate, and how to utilize retrieved
information effectively. To address these challenges, we introduce Windsock, a
query-dependent module making decisions on retrieval necessity and modality
selection, effectively reducing computational overhead and improving response
quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction
Tuning, an adaptive training strategy that enhances MLLMs' ability to utilize
retrieved information while maintaining robustness against noise. Moreover, we
adopt a self-assessment approach leveraging knowledge within MLLMs to convert
question-answering datasets to MRAG training datasets. Extensive experiments
demonstrate that our proposed method significantly improves the generation
quality by 17.07% while reducing 8.95% retrieval times.
๐
2025-10-26
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Rikiya Takehi, Fernando Diaz, Tetsuya Sakai
ๆฅ็ๆ่ฆ
Users tend to remember failures of a search session more than its many
successes. This observation has led to work on search robustness, where systems
are penalized if they perform very poorly on some queries. However, this
principle of robustness has been overlooked within a single query. An ambiguous
or underspecified query (e.g., ``jaguar'') can have several user intents, where
popular intents often dominate the ranking, leaving users with minority intents
unsatisfied. Although the diversification literature has long recognized this
issue, existing metrics only model the average relevance across intents and
provide no robustness guarantees. More surprisingly, we show theoretically and
empirically that many well-known diversification algorithms are no more robust
than a naive, non-diversified algorithm. To address this critical gap, we
propose to frame diversification as a risk-minimization problem. We introduce
VRisk, which measures the expected risk faced by the least-served fraction of
intents in a query. Optimizing VRisk produces a robust ranking, reducing the
likelihood of poor user experiences. We then propose VRisker, a fast greedy
re-ranker with provable approximation guarantees. Finally, experiments on NTCIR
INTENT-2, TREC Web 2012, and MovieLens show the vulnerability of existing
methods. VRisker reduces worst-case intent failures by up to 33% with a minimal
2% drop in average performance.
๐
2025-10-26
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Xuan Lu, Haohang Huang, Rui Meng, Yaohui Jin, Wenjun Zeng et al.
Natural Language Processing
ๆฅ็ๆ่ฆ
Large Language Models (LLMs) have recently demonstrated strong capabilities
in tool use, yet progress in tool retrieval remains hindered by incomplete and
heterogeneous tool documentation. To address this challenge, we introduce
Tool-DE, a new benchmark and framework that systematically enriches tool
documentation with structured fields to enable more effective tool retrieval,
together with two dedicated models, Tool-Embed and Tool-Rank. We design a
scalable document expansion pipeline that leverages both open- and
closed-source LLMs to generate, validate, and refine enriched tool profiles at
low cost, producing large-scale corpora with 50k instances for embedding-based
retrievers and 200k for rerankers. On top of this data, we develop two models
specifically tailored for tool retrieval: Tool-Embed, a dense retriever, and
Tool-Rank, an LLM-based reranker. Extensive experiments on ToolRet and Tool-DE
demonstrate that document expansion substantially improves retrieval
performance, with Tool-Embed and Tool-Rank achieving new state-of-the-art
results on both benchmarks. We further analyze the contribution of individual
fields to retrieval effectiveness, as well as the broader impact of document
expansion on both training and evaluation. Overall, our findings highlight both
the promise and limitations of LLM-driven document expansion, positioning
Tool-DE, along with the proposed Tool-Embed and Tool-Rank, as a foundation for
future research in tool retrieval.
๐
2025-10-26
๐ ArXiv Preprint (Artificial Intelligence)
๐ฅ Yassir Lairgi, Ludovic Moncla, Khalid Benabdeslem, Rรฉmy Cazabet, Pierre Clรฉau
ๆฅ็ๆ่ฆ
In today's rapidly expanding data landscape, knowledge extraction from
unstructured text is vital for real-time analytics, temporal inference, and
dynamic memory frameworks. However, traditional static knowledge graph (KG)
construction often overlooks the dynamic and time-sensitive nature of
real-world data, limiting adaptability to continuous changes. Moreover, recent
zero- or few-shot approaches that avoid domain-specific fine-tuning or reliance
on prebuilt ontologies often suffer from instability across multiple runs, as
well as incomplete coverage of key facts. To address these challenges, we
introduce ATOM (AdapTive and OptiMized), a few-shot and scalable approach that
builds and continuously updates Temporal Knowledge Graphs (TKGs) from
unstructured texts. ATOM splits input documents into minimal, self-contained
"atomic" facts, improving extraction exhaustivity and stability. Then, it
constructs atomic TKGs from these facts while employing a dual-time modeling
that distinguishes when information is observed from when it is valid. The
resulting atomic TKGs are subsequently merged in parallel. Empirical
evaluations demonstrate that ATOM achieves ~18% higher exhaustivity, ~17%
better stability, and over 90% latency reduction compared to baseline methods,
demonstrating a strong scalability potential for dynamic TKG construction.
๐
2025-10-21
๐ ArXiv Preprint (Information Retrieval)
๐ฅ Hamed Jelodar, Mohammad Meymani, Samita Bai, Roozbeh Razavi-Far, Ali A. Ghorbani
Natural Language ProcessingMultimodal
ๆฅ็ๆ่ฆ
This paper introduces SBAN (Source code, Binary, Assembly, and Natural
Language Description), a large-scale, multi-dimensional dataset designed to
advance the pre-training and evaluation of large language models (LLMs) for
software code analysis. SBAN comprises more than 3 million samples, including
2.9 million benign and 672,000 malware respectively, each represented across
four complementary layers: binary code, assembly instructions, natural language
descriptions, and source code. This unique multimodal structure enables
research on cross-representation learning, semantic understanding of software,
and automated malware detection. Beyond security applications, SBAN supports
broader tasks such as code translation, code explanation, and other software
mining tasks involving heterogeneous data. It is particularly suited for
scalable training of deep models, including transformers and other LLM
architectures. By bridging low-level machine representations and high-level
human semantics, SBAN provides a robust foundation for building intelligent
systems that reason about code. We believe that this dataset opens new
opportunities for mining software behavior, improving security analytics, and
enhancing LLM capabilities in pre-training and fine-tuning tasks for software
code mining.